From c42313fb46736943a9d4a0c4069d7d60db883b4f Mon Sep 17 00:00:00 2001
From: zhangyihuiben <zhangyihuiben@sina.com>
Date: Wed, 3 Dec 2025 16:36:20 +0800
Subject: [PATCH] =?UTF-8?q?=E8=A7=A3=E5=86=B3=E6=96=87=E6=A1=A3=E8=B5=84?=
 =?UTF-8?q?=E6=96=99=E4=B8=AD=E9=83=A8=E5=88=86=E6=96=87=E6=A1=88=E5=AD=98?=
 =?UTF-8?q?=E5=9C=A8=E6=A0=BC=E5=BC=8F=E3=80=81=E5=A4=A7=E5=B0=8F=E5=86=99?=
 =?UTF-8?q?=E4=B8=8D=E7=BB=9F=E4=B8=80=E5=92=8C=E9=94=99=E5=88=AB=E5=AD=97?=
 =?UTF-8?q?=E9=97=AE=E9=A2=98=20Fix=20inconsistencies=20in=20formatting,?=
 =?UTF-8?q?=20capitalization,=20and=20typos=20in=20the=20documentation.?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../accuracy_comparison.md                    | 142 +++++++++---------
 .../inference_precision_comparison.md         |   2 +-
 .../precision_optimization.md                 |  30 ++--
 .../docs/source_en/feature/resume_training.md |  18 +--
 .../accuracy_comparison.md                    |   4 +-
 .../inference_precision_comparison.md         |   2 +-
 .../performance_optimization.md               |   2 +-
 .../precision_optimization.md                 |  30 ++--
 .../training_template_instruction.md          |   2 +-
 .../example/distilled/distilled.md            |   4 +-
 .../source_zh_cn/feature/resume_training.md   |  18 +--
 .../docs/source_zh_cn/guide/deployment.md     |  16 +-
 docs/mindformers/docs/source_zh_cn/index.rst  |   2 +-
 13 files changed, 136 insertions(+), 136 deletions(-)

diff --git a/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md b/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md
index 818ccc6a28..4c688eb4e8 100644
--- a/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md
+++ b/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md
@@ -53,77 +53,77 @@ The following tables describe the configuration comparison with Megatron-LM.
 
 - Model configurations
 
-    This document supports only the precision comparison of the mcore model. Therefore, `use-mcore-model` must be configured for Megatron-LM, and `use_legacy: False` must be configured for MindSpore Transformers.
-
-    | Megatron-LM                                | Description                                         | MindSpore Transformers                     | Description                                                                 |
-    |--------------------------------------------|---------------------------------------------|--------------------------------------------|---------------------------------------------------------------------|
-    | `use-legacy-model` and `use-mcore-model`    | Specifies whether to use the mcore model.                              | `use_legacy`                               | Specifies whether to use the mcore model.                                                      |
-    | `num-layers`                               | Number of network layers, that is, number of transformer layers.                       | `num_layers`                               | Number of network layers, that is, number of transformer layers.                                               |
-    | `encoder-num-layers`                       | Number of encoder layers.                             | Not supported.                                     |                                                                     |
-    | `decoder-num-layers`                       | Number of decoder layers.                             | Not supported.                                     |                                                                     |
-    | `hidden-size`                              | Size of the hidden layer, which is the dimension in the hidden state.                              | `hidden_size`                              | Size of the hidden layer, which is the dimension in the hidden state.                                                      |
-    | `ffn-hidden-size`                          | Size of the hidden layer in the feedforward network.                                  | `intermediate_size`                        | Size of the hidden layer in the feedforward network.                                                          |
-    | `num-attention-heads`                      | Number of attention heads.                                      | `num_heads`                                | Number of attention heads.                                                              |
-    | `kv-channels`                              | Number of key/value tensor channels.                            | `head_dim`                                 | Number of key/value tensor channels.                                                    |
-    | `group-query-attention`                    | Specifies whether to enable group query attention.                                | `use_gqa`                                  | Specifies whether to enable group query attention.                                                        |
-    | `num-query-groups`                         | Number of query groups.                                     | `n_kv_heads`                               | Number of query groups.                                                             |
-    | `max-position-embeddings`                  | Maximum position encoding length.                                   | `max_position_embeddings`                  | Maximum position encoding length.                                                           |
-    | `position-embedding-type`                  | Position encoding type, such as learned_absolute and rope.           | `position_embedding_type`                  | Position encoding type, such as learned_absolute and rope.                                   |
-    | `use-rotary-position-embeddings`           | Specifies whether to use rotary position embedding (RoPE).                           | Specified by `position_embedding_type`==`rope`      | Specifies whether to use RoPE.                                                   |
-    | `rotary-base`                              | Rotary base used for RoPE.                               | `rotary_base`                              | Rotary base used for RoPE.                                                       |
-    | `rotary-percent`                           | RoPE usage ratio.                                 | `rotary_percent`                           | RoPE usage ratio.                                                         |
-    | `rotary-interleaved`                       | Specifies whether to use interleaved RoPE.                                | `rotary_interleaved`                       | Specifies whether to use interleaved RoPE.                                                        |
-    | `rotary-seq-len-interpolation-factor`      | Rotary sequence length interpolation factor.                                 | `rotary_seq_len_interpolation_factor`      | Rotary sequence length interpolation factor.                                                         |
-    | `use-rope-scaling`                         | Specifies whether to enable RoPE scaling.                               | `use_rope_scaling`                         | Specifies whether to enable RoPE scaling.                                                       |
-    | `rope-scaling-factor`                      | RoPE scaling factor.                                  | `scaling_factor`                    | RoPE scaling factor.                                                           |
-    | `no-position-embedding`                    | Specifies whether to disable location encoding.                                   | `no-position-embedding`                                     | Specifies whether to disable location encoding.                                                              |
-    | `disable-bias-linear`                      | Disables bias in linear layers.                                | `add_bias_linear`                          | Enables bias in linear layers.                                                        |
-    | `mrope-section`                            | Information of multiple RoPE sections.                           | Not supported.                                     |                                                                     |
-    | `make-vocab-size-divisible-by`             | Divides the size of the word table by a specified number.                               | Not supported.                                     | By default, the dictionary size is not changed.                                                          |
-    | `init-method-std`                          | Standard deviation of the normal distribution used during model parameter initialization.                        | `init_method_std`                          | Standard deviation of the normal distribution used during model parameter initialization.                                                |
-    | `attention-dropout`                        | Dropout probability applied in the multi-head self-attention mechanism.                    | `attention_dropout`                        | Dropout probability applied in the multi-head self-attention mechanism.                                            |
-    | `hidden-dropout`                           | Dropout probability in the hidden layer.                            | `hidden_dropout`                           | Dropout probability in the hidden layer.                                                    |
-    | `normalization`                            | Normalization method, which can be LayerNorm or RMSNorm.                  | `normalization`                            | Normalization method, which can be LayerNorm or RMSNorm.                                          |
-    | `norm-epsilon`                             | Normalized stability factor (epsilon).                           | `rms_norm_eps`                             | RMSNorm stability factor.                                                       |
-    | `apply-layernorm-1p`                       | Specifies whether to add 1 after LayerNorm.                     | Not supported.                                     |                                                                     |
-    | `apply-residual-connection-post-layernorm` | Specifies whether the residual connection is applied after LayerNorm.                     | `apply_residual_connection_post_layernorm` | Specifies whether the residual connection is applied after LayerNorm.                                             |
-    | `openai-gelu`                              | Specifies whether to use the GELU activation function of the OpenAI version.                  | Not supported.                                     |                                                                     |
-    | `squared-relu`                             | Specifies whether to use the square ReLU activation function.                           | Not supported.                                     |                                                                     |
-    | Specified by `swiglu`, `openai-gelu`, and `squared-relu`  | The default value is **torch.nn.functional.gelu**.               | `hidden_act`                               | Activation function type.                                                             |
-    | `gated_linear_unit`                        | Specifies whether to use gate linear unit in multi-layer perceptron (MLP).                      | `gated_linear_unit`                        | Specifies whether to use gate linear unit in MLP.                                              |
-    | `swiglu`                                   | Specifies whether to use the SwiGLU activation function.                           | `hidden_act`==`silu` and `gated_linear_unit`| Specifies whether to use the SwiGLU activation function.                                                   |
-    | `no-persist-layer-norm`                    | Disables persistence layer normalization.                                  | Not supported.                                     |                                                                     |
-    | `untie-embeddings-and-output-weights`      | Specifies whether to decouple the weights of the input embedding layer and output layer.                            | `untie_embeddings_and_output_weights`      | Specifies whether to decouple the weights of the input embedding layer and output layer.                                                    |
-    | Specified by `fp16` and `bf16`                       | Tensor compute precision during training.                                  | `compute_dtype`                            | Tensor compute precision during training.                                                          |
-    | `grad-reduce-in-bf16`                      | Gradient reduction using BFloat16.                          | Not supported.                                     |                                                                     |
-    | Not supported.                                     | By default, the initialization tensor is generated in BFloat16 format.                       | `param_init_type`                          | Initial precision of the weight tensor. The default value is **Float32**, which ensures that the backward gradient is updated in Float32.                           |
-    | Not supported.                                     | By default, layer normalization is calculated in Float32.                       | `layernorm_compute_type`                   | Layer normalization tensor calculation precision.                                                         |
-    | `attention-softmax-in-fp32`                | Executes **attention softmax** in Float32.            | `softmax_compute_type`                     | Softmax tensor calculation precision.                                                     |
-    | Not supported.                                     |                                             | `rotary_dtype`                             | Position encoding tensor calculation precision.                                                         |
-    | `loss-scale`                               | Overall loss scaling factor.                                   | `loss_scale_value`                         | Overall loss scaling factor, which is configured in **runner_wrapper**. If `compute_dtype` is set to **BFloat16**, the value is usually set to **1.0**.|
-    | `initial-loss-scale`                       | Initial loss scaling factor.                                   | Not supported.                                     |                                                                     |
-    | `min-loss-scale`                           | Minimum loss scaling factor.                                   | Not supported.                                     |                                                                     |
-    | `loss-scale-window`                        | Dynamic window size scaling.                                   | `loss_scale_window`                        | Dynamic window size scaling.                                                           |
-    | `hysteresis`                               | Loss scale hysteresis parameter.                                   | Not supported.                                     |                                                                     |
-    | `fp32-residual-connection`                 | Uses Float32 for residual connection.                            | Not supported.                                     |                                                                     |
-    | `accumulate-allreduce-grads-in-fp32`       | Accumulates and reduces gradients using Float32.                         | Not supported.                                     | Accumulates and reduces gradients using Float32 by default.                                               |
-    | `fp16-lm-cross-entropy`                    | Uses Float16 to execute the cross entropy of the LLM.                       | Not supported.                                     | Uses Float32 to execute the cross entropy of the LLM by default.                                             |
-    | `q-lora-rank`                              | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled.         | `q_lora_rank`                              | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled.                                 |
-    | `kv-lora-rank`                             | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled.    | `kv_lora_rank`                             | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled.                            |
-    | `qk-head-dim`                              | Number of dimensions per Q/K head.                         | `qk_nope_head_dim`                         | Number of dimensions per Q/K head.                                                 |
-    | `qk-pos-emb-head-dim`                      | Number of relative position embedding dimensions per Q/K head.                             | `qk_rope_head_dim`                         | Number of relative position embedding dimensions per Q/K head.                                                     |
-    | `v-head-dim`                               | Number of dimensions per value projection (V head).                       | `v_head_dim`                               | Number of dimensions per value projection (V head).                                               |
-    | `rotary-scaling-factor`                    | RoPE scaling coefficient.| `scaling_factor`                           | RoPE scaling coefficient.                        |
-    | `use-precision-aware-optimizer`            | Enables the optimizer with precision awareness to automatically manage parameter updates of different data types.            | Not supported.                                     |                                                                     |
-    | `main-grads-dtype`                         | Data type of the main gradient.                                   | Not supported.                                     | By default, Float32 is used as the data type of the main gradient.                                            |
-    | `main-params-dtype`                        | Data type of the main parameter.                                   | Not supported.                                     | By default, Float32 is used as the data type of the main parameter.                                            |
-    | `exp-avg-dtype`                            | Data type of the exponential moving average (EMA).                           | Not supported.                                     |                                                                     |
-    | `exp-avg-sq-dtype`                         | Data type of the EMA square item.                                | Not supported.                                     |                                                                     |
-    | `first-last-layers-bf16`                   | Specifies whether to forcibly use BFloat16 at the first and last layers.                        | Not supported.                                     |                                                                     |
-    | `num-layers-at-start-in-bf16`              | Number of layers that start with BFloat16.                        | Not supported.                                     |                                                                     |
-    | `num-layers-at-end-in-bf16`                | Number of layers that end with BFloat16.                        | Not supported.                                     |                                                                     |
-    | `multi-latent-attention`                   | Specifies whether to enable the multi-hidden variable attention mechanism.                              | `multi_latent_attention`                   | Specifies whether to enable the multi-hidden variable attention mechanism.                                                      |
-    | `qk-layernorm`                             | Enables query/key layer normalization.                           | `qk-layernorm`                             | Enables query/key layer normalization.                                                   |
+    This document supports only the precision comparison of the mcore model. Therefore, `--use-mcore-model` must be configured for Megatron-LM, and `use_legacy: False` must be configured for MindSpore Transformers.
+
+    | Megatron-LM                                | Description                                         | MindSpore Transformers                     | Description                                                                                                                                             |
+    |--------------------------------------------|---------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+    | `use-legacy-model` and `use-mcore-model`    | Specifies whether to use the mcore model.                              | `use_legacy`                               | Specifies whether to use the mcore model. `use_legacy: False` is equivalent to `--use-mcore-model`.                                                     |
+    | `num-layers`                               | Number of network layers, that is, number of transformer layers.                       | `num_layers`                               | Number of network layers, that is, number of transformer layers.                                                                                        |
+    | `encoder-num-layers`                       | Number of encoder layers.                             | Not supported.                                     |                                                                                                                                                         |
+    | `decoder-num-layers`                       | Number of decoder layers.                             | Not supported.                                     |                                                                                                                                                         |
+    | `hidden-size`                              | Size of the hidden layer, which is the dimension in the hidden state.                              | `hidden_size`                              | Size of the hidden layer, which is the dimension in the hidden state.                                                                                   |
+    | `ffn-hidden-size`                          | Size of the hidden layer in the feedforward network.                                  | `intermediate_size`                        | Size of the hidden layer in the feedforward network.                                                                                                    |
+    | `num-attention-heads`                      | Number of attention heads.                                      | `num_heads`                                | Number of attention heads.                                                                                                                              |
+    | `kv-channels`                              | Number of key/value tensor channels.                            | `head_dim`                                 | Number of key/value tensor channels.                                                                                                                    |
+    | `group-query-attention`                    | Specifies whether to enable group query attention.                                | `use_gqa`                                  | Specifies whether to enable group query attention.                                                                                                      |
+    | `num-query-groups`                         | Number of query groups.                                     | `n_kv_heads`                               | Number of query groups.                                                                                                                                 |
+    | `max-position-embeddings`                  | Maximum position encoding length.                                   | `max_position_embeddings`                  | Maximum position encoding length.                                                                                                                       |
+    | `position-embedding-type`                  | Position encoding type, such as learned_absolute and rope.           | `position_embedding_type`                  | Position encoding type, such as learned_absolute and rope.                                                                                              |
+    | `use-rotary-position-embeddings`           | Specifies whether to use rotary position embedding (RoPE).                           | Specified by `position_embedding_type`==`rope`      | Specifies whether to use RoPE.                                                                                                                          |
+    | `rotary-base`                              | Rotary base used for RoPE.                               | `rotary_base`                              | Rotary base used for RoPE.                                                                                                                              |
+    | `rotary-percent`                           | RoPE usage ratio.                                 | `rotary_percent`                           | RoPE usage ratio.                                                                                                                                       |
+    | `rotary-interleaved`                       | Specifies whether to use interleaved RoPE.                                | `rotary_interleaved`                       | Specifies whether to use interleaved RoPE.                                                                                                              |
+    | `rotary-seq-len-interpolation-factor`      | Rotary sequence length interpolation factor.                                 | `rotary_seq_len_interpolation_factor`      | Rotary sequence length interpolation factor.                                                                                                            |
+    | `use-rope-scaling`                         | Specifies whether to enable RoPE scaling.                               | `use_rope_scaling`                         | Specifies whether to enable RoPE scaling.                                                                                                               |
+    | `rope-scaling-factor`                      | RoPE scaling factor.                                  | `scaling_factor`                    | RoPE scaling factor.                                                                                                                                    |
+    | `no-position-embedding`                    | Specifies whether to disable location encoding.                                   | `no-position-embedding`                                     | Specifies whether to disable location encoding.                                                                                                         |
+    | `disable-bias-linear`                      | Disables bias in linear layers.                                | `add_bias_linear`                          | Enables bias in linear layers.                                                                                                                          |
+    | `mrope-section`                            | Information of multiple RoPE sections.                           | Not supported.                                     |                                                                                                                                                         |
+    | `make-vocab-size-divisible-by`             | Divides the size of the word table by a specified number.                               | Not supported.                                     | By default, the dictionary size is not changed.                                                                                                         |
+    | `init-method-std`                          | Standard deviation of the normal distribution used during model parameter initialization.                        | `init_method_std`                          | Standard deviation of the normal distribution used during model parameter initialization.                                                               |
+    | `attention-dropout`                        | Dropout probability applied in the multi-head self-attention mechanism.                    | `attention_dropout`                        | Dropout probability applied in the multi-head self-attention mechanism.                                                                                 |
+    | `hidden-dropout`                           | Dropout probability in the hidden layer.                            | `hidden_dropout`                           | Dropout probability in the hidden layer.                                                                                                                |
+    | `normalization`                            | Normalization method, which can be LayerNorm or RMSNorm.                  | `normalization`                            | Normalization method, which can be LayerNorm or RMSNorm.                                                                                                |
+    | `norm-epsilon`                             | Normalized stability factor (epsilon).                           | `rms_norm_eps`                             | RMSNorm stability factor.                                                                                                                               |
+    | `apply-layernorm-1p`                       | Specifies whether to add 1 after LayerNorm.                     | Not supported.                                     |                                                                                                                                                         |
+    | `apply-residual-connection-post-layernorm` | Specifies whether the residual connection is applied after LayerNorm.                     | `apply_residual_connection_post_layernorm` | Specifies whether the residual connection is applied after LayerNorm.                                                                                   |
+    | `openai-gelu`                              | Specifies whether to use the GELU activation function of the OpenAI version.                  | Not supported.                                     |                                                                                                                                                         |
+    | `squared-relu`                             | Specifies whether to use the square ReLU activation function.                           | Not supported.                                     |                                                                                                                                                         |
+    | Specified by `swiglu`, `openai-gelu`, and `squared-relu`  | The default value is **torch.nn.functional.gelu**.               | `hidden_act`                               | Activation function type.                                                                                                                               |
+    | `gated_linear_unit`                        | Specifies whether to use gate linear unit in multi-layer perceptron (MLP).                      | `gated_linear_unit`                        | Specifies whether to use gate linear unit in MLP.                                                                                                       |
+    | `swiglu`                                   | Specifies whether to use the SwiGLU activation function.                           | `hidden_act`==`silu` and `gated_linear_unit`| Specifies whether to use the SwiGLU activation function.                                                                                                |
+    | `no-persist-layer-norm`                    | Disables persistence layer normalization.                                  | Not supported.                                     |                                                                                                                                                         |
+    | `untie-embeddings-and-output-weights`      | Specifies whether to decouple the weights of the input embedding layer and output layer.                            | `untie_embeddings_and_output_weights`      | Specifies whether to decouple the weights of the input embedding layer and output layer.                                                                |
+    | Specified by `fp16` and `bf16`                       | Tensor compute precision during training.                                  | `compute_dtype`                            | Tensor compute precision during training.                                                                                                               |
+    | `grad-reduce-in-bf16`                      | Gradient reduction using BFloat16.                          | Not supported.                                     |                                                                                                                                                         |
+    | Not supported.                                     | By default, the initialization tensor is generated in BFloat16 format.                       | `param_init_type`                          | Initial precision of the weight tensor. The default value is **Float32**, which ensures that the backward gradient is updated in Float32.               |
+    | Not supported.                                     | By default, layer normalization is calculated in Float32.                       | `layernorm_compute_type`                   | Layer normalization tensor calculation precision.                                                                                                       |
+    | `attention-softmax-in-fp32`                | Executes **attention softmax** in Float32.            | `softmax_compute_type`                     | Softmax tensor calculation precision.                                                                                                                   |
+    | Not supported.                                     |                                             | `rotary_dtype`                             | Position encoding tensor calculation precision.                                                                                                         |
+    | `loss-scale`                               | Overall loss scaling factor.                                   | `loss_scale_value`                         | Overall loss scaling factor, which is configured in **runner_wrapper**. If `compute_dtype` is set to **BFloat16**, the value is usually set to **1.0**. |
+    | `initial-loss-scale`                       | Initial loss scaling factor.                                   | Not supported.                                     |                                                                                                                                                         |
+    | `min-loss-scale`                           | Minimum loss scaling factor.                                   | Not supported.                                     |                                                                                                                                                         |
+    | `loss-scale-window`                        | Dynamic window size scaling.                                   | `loss_scale_window`                        | Dynamic window size scaling.                                                                                                                            |
+    | `hysteresis`                               | Loss scale hysteresis parameter.                                   | Not supported.                                     |                                                                                                                                                         |
+    | `fp32-residual-connection`                 | Uses Float32 for residual connection.                            | Not supported.                                     |                                                                                                                                                         |
+    | `accumulate-allreduce-grads-in-fp32`       | Accumulates and reduces gradients using Float32.                         | Not supported.                                     | Accumulates and reduces gradients using Float32 by default.                                                                                             |
+    | `fp16-lm-cross-entropy`                    | Uses Float16 to execute the cross entropy of the LLM.                       | Not supported.                                     | Uses Float32 to execute the cross entropy of the LLM by default.                                                                                        |
+    | `q-lora-rank`                              | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled.         | `q_lora_rank`                              | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled.                                                                          |
+    | `kv-lora-rank`                             | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled.    | `kv_lora_rank`                             | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled.                                                                     |
+    | `qk-head-dim`                              | Number of dimensions per Q/K head.                         | `qk_nope_head_dim`                         | Number of dimensions per Q/K head.                                                                                                                      |
+    | `qk-pos-emb-head-dim`                      | Number of relative position embedding dimensions per Q/K head.                             | `qk_rope_head_dim`                         | Number of relative position embedding dimensions per Q/K head.                                                                                          |
+    | `v-head-dim`                               | Number of dimensions per value projection (V head).                       | `v_head_dim`                               | Number of dimensions per value projection (V head).                                                                                                     |
+    | `rotary-scaling-factor`                    | RoPE scaling coefficient.| `scaling_factor`                           | RoPE scaling coefficient.                                                                                                                               |
+    | `use-precision-aware-optimizer`            | Enables the optimizer with precision awareness to automatically manage parameter updates of different data types.            | Not supported.                                     |                                                                                                                                                         |
+    | `main-grads-dtype`                         | Data type of the main gradient.                                   | Not supported.                                     | By default, Float32 is used as the data type of the main gradient.                                                                                      |
+    | `main-params-dtype`                        | Data type of the main parameter.                                   | Not supported.                                     | By default, Float32 is used as the data type of the main parameter.                                                                                     |
+    | `exp-avg-dtype`                            | Data type of the exponential moving average (EMA).                           | Not supported.                                     |                                                                                                                                                         |
+    | `exp-avg-sq-dtype`                         | Data type of the EMA square item.                                | Not supported.                                     |                                                                                                                                                         |
+    | `first-last-layers-bf16`                   | Specifies whether to forcibly use BFloat16 at the first and last layers.                        | Not supported.                                     |                                                                                                                                                         |
+    | `num-layers-at-start-in-bf16`              | Number of layers that start with BFloat16.                        | Not supported.                                     |                                                                                                                                                         |
+    | `num-layers-at-end-in-bf16`                | Number of layers that end with BFloat16.                        | Not supported.                                     |                                                                                                                                                         |
+    | `multi-latent-attention`                   | Specifies whether to enable the multi-hidden variable attention mechanism.                              | `multi_latent_attention`                   | Specifies whether to enable the multi-hidden variable attention mechanism.                                                                              |
+    | `qk-layernorm`                             | Enables query/key layer normalization.                           | `qk-layernorm`                             | Enables query/key layer normalization.                                                                                                                  |
 
 - Optimizer and learning rate scheduling configurations
 
diff --git a/docs/mindformers/docs/source_en/advanced_development/inference_precision_comparison.md b/docs/mindformers/docs/source_en/advanced_development/inference_precision_comparison.md
index 888ac80844..b08b2395da 100644
--- a/docs/mindformers/docs/source_en/advanced_development/inference_precision_comparison.md
+++ b/docs/mindformers/docs/source_en/advanced_development/inference_precision_comparison.md
@@ -68,7 +68,7 @@ When adapting a new model with a similar structure, it is generally done by dire
 
 Possible problems and solutions:
 
-- Problem: The reasoning output remains unchanged for different problems.
+- Problem: The reasoning output remains unchanged even when the inputs differ..
     - Possible reasons: The MLP module, MoE module, and the linear module involved in the Attention module do not require bias, but they impose bias, and there are Nans in the input and output, etc.
     - Positioning method: You can directly print the input and output of each module and observe whether the printing result is normal.
     - Solution: After confirming that a certain module has a problem, compare it with the benchmark to determine whether bias is needed for that module. If bias is not needed, simply set the configuration item of bias to False.
diff --git a/docs/mindformers/docs/source_en/advanced_development/precision_optimization.md b/docs/mindformers/docs/source_en/advanced_development/precision_optimization.md
index 7cb4b2fab4..47a3359eee 100644
--- a/docs/mindformers/docs/source_en/advanced_development/precision_optimization.md
+++ b/docs/mindformers/docs/source_en/advanced_development/precision_optimization.md
@@ -24,17 +24,17 @@ Before locating the operator precision problem, we should first eliminate the in
 
 #### Generalized structure
 
-| **Key parameters**          | **Descriptions**            | **CheckList**                                                                                                                                                                                                                                                              |
-| ----------------- | ------------------------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| num_layers        | Number of transformer layers                                              | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
-| num_heads         | Number of attention heads in transformer                             | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
-| hidden_size       | Transformer hidden layer size                                        | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
-| intermediate_size | Feed-Forward Network hidden layer size                             | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
-| n_kv_heads        | Number of kv groups                                                     | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
-| Regularization function        | Regularization functions, common structures are LayerNorm, RMSNorm                     | The specified regularization function is used in MindSpore Transformers and cannot be modified by configuration in the Legacy Model.                                                                                                                                       |
-| rms_norm_eps      | Regularized epsilon parameters                                          | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
-| dropout           | dropout in the network                                              | Currently, when MindSpore enables dropout, recalculation cannot be enabled; if precision comparison is carried out, it is recommended that both sides be closed to reduce the random factor.                                                                               |
-| Fusion computation          | Common fusion operators include FA, ROPE, Norm, SwigLU; some users will fuse Wq, Wk, Wv for computation | 1. For precision comparison under the same hardware, if fusion algorithms are used, they should be consistent. <br>2. When comparing precision on different hardware, focus on checking whether there is any difference in the calculation of the fusion calculation part. |
+| **Key parameters**          | **Descriptions**                                                                                        | **CheckList**                                                                                                                                                                                                                                                              |
+| ----------------- |---------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| num_layers        | Number of transformer layers                                                                            | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
+| num_heads         | Number of attention heads in transformer                                                                | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
+| hidden_size       | Transformer hidden layer size                                                                           | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
+| intermediate_size | Feed-Forward Network hidden layer size                                                                  | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
+| n_kv_heads        | Number of kv groups                                                                                     | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
+| Regularization function        | Regularization functions, common structures are LayerNorm, RMSNorm                                      | The specified regularization function is used in MindSpore Transformers and cannot be modified by configuration in the Legacy Model.                                                                                                                                       |
+| rms_norm_eps      | Regularized epsilon parameters                                                                          | Verify that the parameters are consistent with the baseline.                                                                                                                                                                                                               |
+| dropout           | dropout in the network                                                                                  | Currently, when MindSpore enables dropout, recalculation cannot be enabled; if precision comparison is carried out, it is recommended that both sides be closed to reduce the random factor.                                                                               |
+| Fusion computation          | Common fusion operators include FA, ROPE, Norm, SwiGLU; some users will fuse Wq, Wk, Wv for computation | 1. For precision comparison under the same hardware, if fusion algorithms are used, they should be consistent. <br>2. When comparing precision on different hardware, focus on checking whether there is any difference in the calculation of the fusion calculation part. |
 
 #### MOE Structure
 
@@ -65,10 +65,10 @@ Before locating the operator precision problem, we should first eliminate the in
 
 ### Weight CheckList
 
-| **Key parameters**          | **Descriptions**                                                         | **CheckList**                                                                                                                                |
-| ----------------- | ------------------------------------------------------------ |------------------------------------------------------------------------------------------------------------------------------------|
-| param_init_type | Weight initialization type       | MindSpore Transformers usually sets the param_init_dtype type to FP32. This is because the gradient communication type needs to be the same as the weight type, controlling the communication type to be FP32. Megatron gradient communication type defaults to FP32 and is not tied to the weight type. |
-| init-method-std | Distribution of weights randomly initialized | If weighted random initialization is used, parameters such as mean/std in the random distribution need to be checked for consistency. |
+| **Key parameters** | **Descriptions**                                                         | **CheckList**                                                                                                                                                                                                                                                                                   |
+|--------------------| ------------------------------------------------------------ |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| params_dtype       | Weight initialization type       | MindSpore Transformers usually sets the params_dtype to FP32. This is because the gradient communication type needs to be the same as the weight type, controlling the communication type to be FP32. Megatron gradient communication type defaults to FP32 and is not tied to the weight type. |
+| init-method-std    | Distribution of weights randomly initialized | If weighted random initialization is used, parameters such as mean/std in the random distribution need to be checked for consistency.                                                                                                                                                           |
 
 ### Mixed-precision CheckList
 
diff --git a/docs/mindformers/docs/source_en/feature/resume_training.md b/docs/mindformers/docs/source_en/feature/resume_training.md
index 0551fe4ffe..83e1434412 100644
--- a/docs/mindformers/docs/source_en/feature/resume_training.md
+++ b/docs/mindformers/docs/source_en/feature/resume_training.md
@@ -48,16 +48,16 @@ For more information about weights, refer to [Ckpt Weights](https://www.mindspor
 
 ## YAML Parameter Configuration Description
 
-| Parameter                | Description                                                  |
-| ------------------------ | ------------------------------------------------------------ |
-| load_checkpoint          | Path to the weight file or folder, **required for resuming training**, default is an empty string.<br />If the configured path is an empty directory, it will fall back to using randomly initialized weights for pre-training.<br />For single-card weights, configure the path to the weight file, ensuring the parent directory does not start with "rank_". |
-| src_strategy_path_or_dir | Path to the strategy file or folder, required when **`auto_trans_ckpt=True` and load_checkpoint is a distributed weight**, default is an empty string.<br />If the weights configured in load_checkpoint do not have pipeline parallel sharding, configure any strategy file path; otherwise, configure the strategy folder path. |
-| auto_trans_ckpt          | Switch for automatic weight conversion, needs to be enabled when the **weights configured in load_checkpoint do not match the distributed strategy of the current task**, default is False. |
+| Parameter                | Description                                                                                                                                                                                                                                                                                                                                                                              |
+| ------------------------ |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| load_checkpoint          | Path to the weight file or folder, **required for resuming training**, default is an empty string.<br />If the configured path is an empty directory, it will fall back to using randomly initialized weights for pre-training.<br />For single-card weights, configure the path to the weight file, ensuring the parent directory does not start with "rank_".                          |
+| src_strategy_path_or_dir | Path to the strategy file or folder, required when **`auto_trans_ckpt=True` and load_checkpoint is a distributed weight**, default is an empty string.<br />If the weights configured in load_checkpoint do not have pipeline parallel sharding, configure any strategy file path; otherwise, configure the strategy folder path.                                                        |
+| auto_trans_ckpt          | Switch for automatic weight conversion, needs to be enabled when the **weights configured in load_checkpoint do not match the distributed strategy of the current task**, default is `False`.                                                                                                                                                                                              |
 | transform_process_num    | Number of processes used for automatic weight conversion, **only applicable to automatic conversion of ckpt format weights**, which can accelerate weight conversion. Default is `None` (disabled).<br />The set value must be divisible by the total number of cluster cards. A larger value increases host memory usage; reduce the number of processes if host memory is insufficient. |
-| resume_training          | Switch for resuming training, can be set to `True` or the weight file name in any rank sub-folder. Default is `False`.<br />When set to `True`, it **loads the last fully saved weights** for resumption.<br />When set to a weight file name, it **loads the weights from the specified step** for resumption. |
-| load_ckpt_format         | Format of the weights configured in load_checkpoint, can be set to `safetensors` or `ckpt`, default is `ckpt`. |
-| remove_redundancy        | Switch for loading without redundancy, needs to be enabled when the weights configured in load_checkpoint are **safetensors format weights saved without redundancy**, default is False. |
-| load_ckpt_async          | Whether to execute weight loading in parallel with model compilation. This configuration **only applies to asynchronous loading scenarios with ckpt format weights and unchanged distributed strategy**. Default is `False`. |
+| resume_training          | Switch for resuming training, can be set to `True` or the weight file name in any rank sub-folder. Default is `False`.<br />When set to `True`, it **loads the last fully saved weights** for resumption.<br />When set to a weight file name, it **loads the weights from the specified step** for resumption.                                                                          |
+| load_ckpt_format         | Format of the weights configured in load_checkpoint, can be set to `safetensors` or `ckpt`, default is `ckpt`.                                                                                                                                                                                                                                                                           |
+| remove_redundancy        | Switch for loading without redundancy, needs to be enabled when the weights configured in load_checkpoint are **safetensors format weights saved without redundancy**, default is `False`.                                                                                                                                                                                               |
+| load_ckpt_async          | Whether to execute weight loading in parallel with model compilation. This configuration **only applies to asynchronous loading scenarios with ckpt format weights and unchanged distributed strategy**. Default is `False`.                                                                                                                                                             |
 
 ## Introduction to Resume Training Scenarios
 
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md b/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md
index 58b9bbec73..a6ab5b0aff 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md
@@ -53,11 +53,11 @@ Megatron-LM 是一个面向大规模训练任务的成熟框架，具备高度
 
 - 模型配置
 
-    本文档仅支持 mcore 模型的精度比对，故 Megatron-LM 必须配置 `use-mcore-model`，MindSpore Transformers 必须配置`use_legacy: False`
+    本文档仅支持 mcore 模型的精度比对，故 Megatron-LM 必须配置 `--use-mcore-model`，MindSpore Transformers 必须配置`use_legacy: False`
 
     | Megatron-LM                                | 含义                                          | MindSpore Transformers                     | 含义                                                                  |
     |--------------------------------------------|---------------------------------------------|--------------------------------------------|---------------------------------------------------------------------|
-    | `use-legacy-model`和`use-mcore-model`组合     | 是否使用 mcore 模型                               | `use_legacy`                               | 是否使用 mcore 模型                                                       |
+    | `use-legacy-model`和`use-mcore-model`组合     | 是否使用 mcore 模型                               | `use_legacy`                               | 是否使用 mcore 模型，`use_legacy: False`等价于`--use-mcore-model`             |
     | `num-layers`                               | 网络层数，Transformer层的数量                        | `num_layers`                               | 网络层数，Transformer层的数量                                                |
     | `encoder-num-layers`                       | 编码器（Encoder）层数                              | 不支持配置                                      |                                                                     |
     | `decoder-num-layers`                       | 解码器（Decoder）层数                              | 不支持配置                                      |                                                                     |
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/inference_precision_comparison.md b/docs/mindformers/docs/source_zh_cn/advanced_development/inference_precision_comparison.md
index 897ef7aa49..2ee97f9331 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/inference_precision_comparison.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/inference_precision_comparison.md
@@ -67,7 +67,7 @@
 
 可能出现的问题和解决方法：
 
-- 问题：不同的问题推理输出依旧不变。
+- 问题：不同输入的推理输出依旧不变。
     - 可能的原因：MLP模块、MoE模块以及Attention模块涉及的linear模块不需要bias，但是强加了bias，输入输出存在nan等。
     - 定位方法：可以直接打印各个模块的输入输出，观察打印结果是否正常。
     - 解决方法：确定某个模块有问题之后，对比标杆确定该模块是否需要bias。如果不需要bias，将bias的配置项设置成False即可。
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/performance_optimization.md b/docs/mindformers/docs/source_zh_cn/advanced_development/performance_optimization.md
index 1c5485ff08..6e4d10feab 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/performance_optimization.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/performance_optimization.md
@@ -10,7 +10,7 @@
 
 * 数据加载时间：指的是模型加载训练数据和权重的时间，包括将数据从硬件存储设备读取到CPU、在CPU中进行数据的预处理、以及CPU数据传输到NPU的过程。对于需要切分到若干张NPU上的模型，数据加载时间还包括从一张NPU广播到其他NPU上的时间。
 
-* 模型正向计算（Forward）反向计算（Backward）时间，包含前向的数据计算和反向的数据微分求导。
+* 模型正向计算（Forward）和反向计算（Backward）时间，包含前向的数据计算和反向的数据微分求导。
 
 * 优化器时间：指的是模型参数更新时间。
 
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md b/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
index b18633024e..ad1592fac5 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
@@ -24,17 +24,17 @@
 
 #### 通用结构
 
-| **关键参数**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;          | **说明**                                                     | **检查项**                                                                                   |
-| ----------------- | ------------------------------------------------------------ |-------------------------------------------------------------------------------------------|
-| num_layers        | transformer层数                                              | 检查标杆的对应参数是否一致。                                                                            |
-| num_heads         | transformer中attention heads数量                             | 检查标杆的对应参数是否一致。                                                                            |
-| hidden_size       | transformer隐藏层大小                                        | 检查标杆的对应参数是否一致。                                                                            |
-| intermediate_size | Feed-Forward Network的隐藏层大小                             | 检查标杆的对应参数是否一致。                                                                            |
-| n_kv_heads        | kv分组数                                                     | 检查标杆的对应参数是否一致。                                                                            |
-| 正则化函数        | 正则化函数，常见结构有LayerNorm、RMSNorm                     | MindSpore Transformers中使用指定的正则化函数，Legacy模型无法通过配置修改。 |
-| rms_norm_eps      | 正则化的epsilon参数                                          | 检查标杆的对应参数是否一致。                                                      |
-| dropout           | 网络中的dropout                                              | 当前MindSpore开启dropout时，不能开重计算；若进行精度比对，建议两边都关闭，减少随机因素。                                      |
-| 融合计算          | 常见的融合算子包括FA、ROPE、Norm、SwigLU；部分用户会将Wq、Wk、Wv进行融合计算 | 1. 同硬件下进行精度比对时，若有使用融合算子，需要保持一致。 <br>2. 不同硬件下进行精度比对时，重点检查融合计算部分是否有计算差异。                    |
+| **关键参数**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;          | **说明**                                            | **检查项**                                                                                   |
+| ----------------- |---------------------------------------------------|-------------------------------------------------------------------------------------------|
+| num_layers        | transformer层数                                     | 检查标杆的对应参数是否一致。                                                                            |
+| num_heads         | transformer中attention heads数量                     | 检查标杆的对应参数是否一致。                                                                            |
+| hidden_size       | transformer隐藏层大小                                  | 检查标杆的对应参数是否一致。                                                                            |
+| intermediate_size | Feed-Forward Network的隐藏层大小                        | 检查标杆的对应参数是否一致。                                                                            |
+| n_kv_heads        | kv分组数                                             | 检查标杆的对应参数是否一致。                                                                            |
+| 正则化函数        | 正则化函数，常见结构有LayerNorm、RMSNorm                      | MindSpore Transformers中使用指定的正则化函数，Legacy模型无法通过配置修改。 |
+| rms_norm_eps      | 正则化的epsilon参数                                     | 检查标杆的对应参数是否一致。                                                      |
+| dropout           | 网络中的dropout                                       | 当前MindSpore开启dropout时，不能开重计算；若进行精度比对，建议两边都关闭，减少随机因素。                                      |
+| 融合计算          | 常见的融合算子包括FA、ROPE、Norm、SwiGLU；部分用户会将Wq、Wk、Wv进行融合计算 | 1. 同硬件下进行精度比对时，若有使用融合算子，需要保持一致。 <br>2. 不同硬件下进行精度比对时，重点检查融合计算部分是否有计算差异。                    |
 
 #### MOE结构
 
@@ -65,10 +65,10 @@
 
 ### 权重CheckList
 
-| **关键参数**    | **说明**             | **检查项**                                                   |
-| --------------- | -------------------- | ------------------------------------------------------------ |
-| param_init_type | 权重初始化类型       | MindSpore Transformers通常会设置param_init_dtype类型为FP32，这是因为梯度通信类型是跟权重类型一致，控制通信类型为FP32。而Megatron的梯度通信类型默认为FP32，不与权重类型绑定。 |
-| init-method-std | 权重随机初始化的分布 | 若使用权重随机初始化，需要检查随机分布中的mean/std等参数是否一致。 |
+| **关键参数**        | **说明**             | **检查项**                                                                                                         |
+|-----------------| -------------------- |-----------------------------------------------------------------------------------------------------------------|
+| params_dtype    | 权重初始化类型       | MindSpore Transformers通常会设置params_dtype类型为FP32，这是因为梯度通信类型是跟权重类型一致，控制通信类型为FP32。而Megatron的梯度通信类型默认为FP32，不与权重类型绑定。 |
+| init-method-std | 权重随机初始化的分布 | 若使用权重随机初始化，需要检查随机分布中的mean/std等参数是否一致。                                                                           |
 
 ### 混合精度CheckList
 
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md b/docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md
index e7b90ccaca..01636912f3 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md
@@ -39,7 +39,7 @@ MindSpore Transformers对于不同训练场景提供了对应的配置模板，
 
 ## 基本配置修改
 
-使用配置模版进行训练时，修改以下基础配置即可快速启动。
+使用配置模板进行训练时，修改以下基础配置即可快速启动。
 
 配置模板默认使用8卡。
 
diff --git a/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md b/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md
index 535782dbd8..48cda534dd 100644
--- a/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md
+++ b/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md
@@ -44,7 +44,7 @@ mindformers
 - **从零开始生成数据集**：适合希望自定义数据集或深入了解数据生成流程的用户。包括从种子数据集生成CoT数据和拒绝采样。请从[1.3.1 从零开始生成数据集](#131-从零开始生成数据集)开始。
 - **使用OpenR1-Math-220K数据集**：
 
-    - **选项1: 使用原始数据离线处理**：适合需要自定义数据处理或学习处理流程的用户。包括预处理和Packing。请从[选项1: 使用原始数据离线处理](#选项-1-使用原始数据离线处理)开始。
+    - **选项1: 使用原始数据离线处理**：适合需要自定义数据处理或学习处理流程的用户。包括预处理和packing。请从[选项1: 使用原始数据离线处理](#选项-1-使用原始数据离线处理)开始。
     - **选项2: 使用已处理好的数据**：适合希望快速开始训练的用户。案例提供预处理好的OpenR1-Math-220K数据集。请从[选项2: 使用已处理好的数据](#选项-2-使用完成转换的数据)开始。
 
 #### 1.3.1 从零开始生成数据集
@@ -196,7 +196,7 @@ mindformers
         - **`--save_path`**：转换后数据集的保存文件夹路径。
         - **`--register_path`**：注册路径，为当前目录下的`distilled/`文件夹。
 
-步骤二、**数据集Packing**
+步骤二、**数据集packing**
 
 MindSpore Transformers已经支持数据集packing机制，减少微调所需要的时间。
 数据集packing的配置文件放在/dataset/packing目录下。其中，需要将`path`修改成`handled_data`的路径，
diff --git a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md
index 267a2ae23b..56b1359210 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md
@@ -48,16 +48,16 @@ output/strategy
 
 ## YAML参数配置说明
 
-| 参数                     | 描述                                                         |
-| ------------------------ | ------------------------------------------------------------ |
-| load_checkpoint          | 权重文件或文件夹路径，**断点续训时必填**，默认为空字符串。<br />当配置的路径为空目录时，会退化为使用随机初始化权重进行预训练。<br />若为单卡权重，可配置为权重文件路径，需要确保文件父目录不以"rank_"开头。 |
+| 参数                     | 描述                                                                                                                                       |
+| ------------------------ |------------------------------------------------------------------------------------------------------------------------------------------|
+| load_checkpoint          | 权重文件或文件夹路径，**断点续训时必填**，默认为空字符串。<br />当配置的路径为空目录时，会退化为使用随机初始化权重进行预训练。<br />若为单卡权重，可配置为权重文件路径，需要确保文件父目录不以"rank_"开头。                        |
 | src_strategy_path_or_dir | 策略文件或文件夹路径，**`auto_trans_ckpt=True`且load_checkpoint为分布式权重**时需要配置，默认为空字符串。<br />若load_checkpoint配置的权重不带流水线并行切分，则可配置为任一策略文件路径，否则配置为策略文件夹路径。 |
-| auto_trans_ckpt          | 权重自动转换开关，load_checkpoint配置的**权重和当前任务的分布式策略不匹配**时需要开启，默认为False。 |
-| transform_process_num    | 权重自动转换使用进程数，**仅适用于ckpt格式权重的自动转换**，可加速权重转换。默认为`None`不开启。<br />设置值需要能够整除集群总卡数，设置值越大，host内存占用越高，若host内存不足，需要减少进程数。 |
-| resume_training          | 断点续训开关，可设置为`True`或任一rank子文件夹下的权重文件名。默认为`False`。<br />为`True`时，**加载最后保存完整的权重**续训。<br />为权重文件名时，**加载指定step的权重**续训。 |
-| load_ckpt_format         | load_checkpoint配置的权重格式，可配置为`safetensors`或`ckpt`，默认为`ckpt`。 |
-| remove_redundancy        | 去冗余加载开关，load_checkpoint配置的权重为**去冗余保存的safetensors格式权重**时需要开启，默认为`False`。 |
-| load_ckpt_async          | 是否将加载权重与模型编译的操作并行执行。该配置**仅适用于ckpt格式权重且分布式策略不变**的异步加载场景。默认为`False`。 |
+| auto_trans_ckpt          | 权重自动转换开关，load_checkpoint配置的**权重和当前任务的分布式策略不匹配**时需要开启，默认为`False`。                                                                           |
+| transform_process_num    | 权重自动转换使用进程数，**仅适用于ckpt格式权重的自动转换**，可加速权重转换。默认为`None`不开启。<br />设置值需要能够整除集群总卡数，设置值越大，host内存占用越高，若host内存不足，需要减少进程数。                          |
+| resume_training          | 断点续训开关，可设置为`True`或任一rank子文件夹下的权重文件名。默认为`False`。<br />为`True`时，**加载最后保存完整的权重**续训。<br />为权重文件名时，**加载指定step的权重**续训。                         |
+| load_ckpt_format         | load_checkpoint配置的权重格式，可配置为`safetensors`或`ckpt`，默认为`ckpt`。                                                                               |
+| remove_redundancy        | 去冗余加载开关，load_checkpoint配置的权重为**去冗余保存的safetensors格式权重**时需要开启，默认为`False`。                                                                  |
+| load_ckpt_async          | 是否将加载权重与模型编译的操作并行执行。该配置**仅适用于ckpt格式权重且分布式策略不变**的异步加载场景。默认为`False`。                                                                       |
 
 ## 断点续训使用场景介绍
 
diff --git a/docs/mindformers/docs/source_zh_cn/guide/deployment.md b/docs/mindformers/docs/source_zh_cn/guide/deployment.md
index de1d14a2eb..1f1bde0121 100644
--- a/docs/mindformers/docs/source_zh_cn/guide/deployment.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/deployment.md
@@ -73,14 +73,14 @@ MindSpore Transformers模型注册表中，注册模型配置类和模型类等
 
 #### 模型支持列表
 
-|模型|Mcore新架构|状态|下载链接|
-|-|-|-|-|
-|Qwen3-32B|是|已支持|[Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B)|
-|Qwen3-235B-A22B|是|已支持|[Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)|
-|Qwen3|是|测试中|[Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)、 [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)、 [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)、 [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)、 [Qwen3-14B](https://modelers.cn/models/MindSpore-Lab/Qwen3-14B)|
-|Qwen3-MOE|是|测试中|[Qwen3-30B-A3](https://modelers.cn/models/MindSpore-Lab/Qwen3-30B-A3B-Instruct-2507)|
-|deepSeek-V3|是|测试中|[deepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3)|
-|Qwen2.5|否|已支持|[Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)、 [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)、 [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)、 [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)、 [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)、 [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)、 [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct)|
+| 模型              |Mcore新架构|状态| 下载链接                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+|-----------------|-|-|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Qwen3-32B       |是|已支持| [Qwen3-32B](https://modelers.cn/models/MindSpore-Lab/Qwen3-32B)                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| Qwen3-235B-A22B |是|已支持| [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B)                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| Qwen3           |是|测试中| [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)、 [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)、 [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)、 [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)、 [Qwen3-14B](https://modelers.cn/models/MindSpore-Lab/Qwen3-14B)                                                                                                                                                                                                                                                      |
+| Qwen3-MOE       |是|测试中| [Qwen3-30B-A3](https://modelers.cn/models/MindSpore-Lab/Qwen3-30B-A3B-Instruct-2507)                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| DeepSeek-V3     |是|测试中| [DeepSeek-V3](https://modelers.cn/models/MindSpore-Lab/DeepSeek-V3)                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| Qwen2.5         |否|已支持| [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)、 [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)、 [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)、 [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)、 [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)、 [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)、 [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
 
 ## MindIE服务化部署
 
diff --git a/docs/mindformers/docs/source_zh_cn/index.rst b/docs/mindformers/docs/source_zh_cn/index.rst
index 34d01181db..ce5f88ad93 100644
--- a/docs/mindformers/docs/source_zh_cn/index.rst
+++ b/docs/mindformers/docs/source_zh_cn/index.rst
@@ -62,7 +62,7 @@ MindSpore Transformers功能特性说明
 
   - `使用Tokenizer <https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/tokenizer.html>`_
 
-    Tokenizer相关介绍，支持在Hugging Face Tokenizer在推理、数据集中使用。
+    Tokenizer相关介绍，支持在推理、数据集中使用Hugging Face Tokenizer。
 
 - 训练功能：
 
-- 
Gitee