diff --git a/docs/mindformers/docs/source_en/feature/images/sliding_window.png b/docs/mindformers/docs/source_en/feature/images/sliding_window.png new file mode 100644 index 0000000000000000000000000000000000000000..a7f218e487add3ee210ee772637a2aa718b26d2f Binary files /dev/null and b/docs/mindformers/docs/source_en/feature/images/sliding_window.png differ diff --git a/docs/mindformers/docs/source_en/feature/other_training_features.md b/docs/mindformers/docs/source_en/feature/other_training_features.md index 9de6489abe8af571cc1022e804815bf9bba9dc2c..4f073d82ccc04f252610e8a73923081d63108e1e 100644 --- a/docs/mindformers/docs/source_en/feature/other_training_features.md +++ b/docs/mindformers/docs/source_en/feature/other_training_features.md @@ -229,3 +229,62 @@ context: | device_id | The id of the device to be configured | Replace the letter `id` with effective number. | | affinity_cpu_list | Manually specifies the CPU affinity range for the process. Format: `["cpuidX-cpuidY"]` (e.g. `["0-3", "8-11"]`) | (list, optional) - Default: `None`. | | module_to_cpu_dict | Customizes core binding for specific modules. Valid keys (module names) are`main`, `runtime`, `pynative`, `minddata`. Valid value is a list of int indices representing CPU cores (e.g. `{"main": [0,1], "minddata": [6,7]}`) | (dict, optional) - Default: `None`. | + +## SlidingWindowAttention + +### Overview + +SlidingWindowAttention is a sparse attention mechanism that solves the problem of quadratic increase in computational complexity with sequence length in standard Transformer models by restricting each token to only focus on other tokens within a local window. The core idea is to narrow the attention range from global to a fixed window size. + +### Configuration and Usage + +#### YAML Parameter Configuration + +While use the SlidingWindowAttention module, you need to configure the `window_size` and `window_attn_skip_freq` items under the `model_config` item in the configuration file. + +The type of `window_size` is `Tuple[int, int]`, where `window_size[0]` represents `pre_tokens`, and `window_size[1]` represents `next_tokens`. Both are integers not less than -1, where -1 is a special value representing "infinite window size". The default starting point is the bottom right corner, as shown in the following figure: + +![/expert_load](./images/sliding_window.png) + +The type of `window_attn_skip_freq` is `Union[int, List[int]]`, which represents the frequency of the entire attention layer in the sliding window attention layer. Accept any of the following options: + +- Integer N: represents the ratio of (N-1): 1, which is a fully focused layer after (N-1) SWA layers. +- Define a list of custom modes, such as [1,1,1,1,0,0,0], where 1 represents SWA. + +Example: + +```yaml +model_config: + ... + window_size: (10, 0) + window_attn_skip_freq: 2 + ... +``` + +## SharedKVCrossAttention + +### Overview + +SharedKVCrossAttention is an attention mechanism that only requires one KV cache and shares the KV cache generated by the decoder through cross attention multiplexing. + +### Configuration and Usage + +#### YAML Parameter Configuration + +When using the SharedKVCrossAttention module, you need to configure the `model_architecture`, `num_encoder_layers`, `num_decoder_layers`, and `num_layers` items under the `model_config` item in the configuration file. + +`model_architecture` represents the model structure, with optional types of `decoder_only` and `yoco`. Currently, only `yoco` supports SharedKVCrossAttention. + +`num_encoder_layers` represents the number of encoder layers, and `num_decoder_layers` represents the number of decoder layers. The sum of the two is equal to the size of `num_layers`. When `model_architecture` is set to `yoco`, SharedKVCrossAttention will be enabled after the end of the encoder layers, that is, when the decoder layers begin. If `num_decoder_layers` is set to 1 and `num_encoder_layers` is set to 1, then SharedKVCrossAttention will be enabled starting from the second layer. If only `num_encoder_layers` is configured, SharedKVCrossAttention will not be enabled. If only `num_decoder_layers` is configured, then SharedKVCrossAttention will be enabled from the first layer onwards. + +Example: + +```yaml +model_config: + ... + model_architecture: "yoco" + num_layers: 2 + num_encoder_layers: 1 + num_decoder_layers: 1 + ... +``` \ No newline at end of file diff --git a/docs/mindformers/docs/source_zh_cn/feature/images/sliding_window.png b/docs/mindformers/docs/source_zh_cn/feature/images/sliding_window.png new file mode 100644 index 0000000000000000000000000000000000000000..a7f218e487add3ee210ee772637a2aa718b26d2f Binary files /dev/null and b/docs/mindformers/docs/source_zh_cn/feature/images/sliding_window.png differ diff --git a/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md b/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md index af6190013974da08e9cf2a97a51edf73ff844f48..4753b1a82681c9ad3d453e9bf17b85428ef9ce4f 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md +++ b/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md @@ -229,3 +229,62 @@ context: | device_X | 需要配置的设备`id` | 将`X`替换为有效数字 | | affinity_cpu_list | 自定义指定本进程的绑核CPU范围。传入列表需要为`["cpuidX-cpuidY"]` 格式,例如 `["0-3", "8-11"]` | (list, 可选) - 默认值:`None`。 | | module_to_cpu_dict | 自定义指定的绑核策略。传入字典的key需要为模块名称字符串,目前支持传入`main` 、 `runtime` 、 `pynative` 、 `minddata`;value需要为包含 `int` 元素的列表,表示绑核CPU范围中的索引,例如 `{"main": [0,1], "minddata": [6,7]}` | (dict, 可选) - 默认值:`None`。 | + +## SlidingWindowAttention + +### 概述 + +SlidingWindowAttention是一种稀疏注意力机制,通过限制每个token仅关注局部窗口内的其他token,解决标准Transformer模型计算复杂度随序列长度二次增长的问题。其核心思想是将注意力范围从全局缩小到固定窗口大小。 + +### 配置与使用 + +#### YAML 参数配置 + +用户在使用SlidingWindowAttention模块时,需要配置文件中的 `model_config` 项下配置`window_size` 项和`window_attn_skip_freq` 项。 + +`window_size`类型为`Tuple[int, int]`,其中`window_size[0]`代表`pre_tokens`,`window_size[1]`代表`next_tokens`。二者均为不小于-1的整数,-1是特殊值,表示“无限窗口大小”。默认起点为右下角,如下图所示: + +![/expert_load](./images/sliding_window.png) + +`window_attn_skip_freq`类型为`Union[int, List[int]]`,表示滑动窗口关注层中全关注层的频率。接受以下任一选项: + +- 整数N:表示(N-1):1的比率,在(N-1)个SWA层之后的一个全关注层。 +- 定义自定义模式的列表,例如:[1,1,1,1,0,0,0],其中1表示SWA。 + +配置示例: + +```yaml +model_config: + ... + window_size: (10, 0) + window_attn_skip_freq: 2 + ... +``` + +## SharedKVCrossAttention + +### 概述 + +SharedKVCrossAttention是一种仅需一次KV缓存并通过交叉注意力复用自解码器生成的共享KV缓存的注意力机制。 + +### 配置与使用 + +#### YAML 参数配置 + +用户在使用SharedKVCrossAttention模块时,需要配置文件中的 `model_config` 项下配置`model_architecture` 项、`num_encoder_layers`、`num_decoder_layers`和`num_layers`项。 + +`model_architecture`表示模型结构,可选类型为`decoder_only`和`yoco`,目前只有`yoco`支持SharedKVCrossAttention。 + +`num_encoder_layers`表示编码器层数,`num_decoder_layers`表示解码器层数,二者相加等于`num_layers`的大小。当`model_architecture`设定为`yoco`时,SharedKVCrossAttention会在编码器的层数结束后使能,即解码器层数开始时使能。如设定`num_encoder_layers`为1,`num_decoder_layers`为1,那么SharedKVCrossAttention会在第二层开始使能。如果只是配置了`num_encoder_layers`,则SharedKVCrossAttention不会使能。如若是只配置了`num_decoder_layers`,那么SharedKVCrossAttention便会在第一层开始使能。 + +配置示例: + +```yaml +model_config: + ... + model_architecture: "yoco" + num_layers: 2 + num_encoder_layers: 1 + num_decoder_layers: 1 + ... +```