diff --git a/docs/mindformers/docs/source_en/usage/dev_migration.md b/docs/mindformers/docs/source_en/advanced_deployment/dev_migration.md
similarity index 91%
rename from docs/mindformers/docs/source_en/usage/dev_migration.md
rename to docs/mindformers/docs/source_en/advanced_deployment/dev_migration.md
index 429be46ea731e2d97102e38fca9fdc2697deea84..d1de16c56ec5e79b2ffbc2e4f05ef76e50efcddd 100644
--- a/docs/mindformers/docs/source_en/usage/dev_migration.md
+++ b/docs/mindformers/docs/source_en/advanced_deployment/dev_migration.md
@@ -1,6 +1,6 @@
# Development Migration
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/dev_migration.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/advanced_development/dev_migration.md)
This document describes how to develop and build foundation models based on MindSpore Transformers and complete basic adaptation to start the training and inference processes.
@@ -46,9 +46,9 @@ All tokenizer classes must be inherited from the PretrainedTokenizer or Pretrain
### Preparing a Weight and a Dataset
-If a PyTorch-based model weight already exists, you can convert the weight to that in the MindSpore format by referring to [Weight Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html).
+If a PyTorch-based model weight already exists, you can convert the weight to that in the MindSpore format by referring to [Weight Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html).
-For details about how to prepare a dataset, see [Dataset](https://www.mindspore.cn/mindformers/docs/en/dev/function/dataset.html) or the model document, for example, [Llama2 Description Document > Dataset Preparation](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87).
+For details about how to prepare a dataset, see [Dataset](https://www.mindspore.cn/mindformers/docs/en/dev/feature/dataset.html) or the model document, for example, [Llama2 Description Document > Dataset Preparation](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87).
### Preparing a `YAML` Configuration File
@@ -93,13 +93,13 @@ python run_mindformer.py --config research/llama3_1/predict_llama3_1_8b.yaml --l
`register_path` is set to `research/llama3_1` (path of the directory where the external code is located). For details about how to prepare the model weight, see [Llama3.1 Description Document > Model Weight Download](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD).
-For details about the configuration file and configurable items, see [Configuration File Descriptions](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html). When compiling a configuration file, you can refer to an existing configuration file in the library, for example, [Llama2-7B fine-tuning configuration file](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/finetune_llama2_7b.yaml).
+For details about the configuration file and configurable items, see [Configuration File Descriptions](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html). When compiling a configuration file, you can refer to an existing configuration file in the library, for example, [Llama2-7B fine-tuning configuration file](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/finetune_llama2_7b.yaml).
-After all the preceding basic elements are prepared, you can refer to other documents in the MindSpore Transformers tutorial to perform model training, fine-tuning, and inference. For details about subsequent model debugging and optimization, see [Large Model Accuracy Optimization Guide](https://www.mindspore.cn/mindformers/docs/en/dev/acc_optimize/acc_optimize.html) and [Large Model Performance Optimization Guide](https://www.mindspore.cn/mindformers/docs/en/dev/perf_optimize/perf_optimize.html).
+After all the preceding basic elements are prepared, you can refer to other documents in the MindSpore Transformers tutorial to perform model training, fine-tuning, and inference. For details about subsequent model debugging and optimization, see [Large Model Accuracy Optimization Guide](https://www.mindspore.cn/mindformers/docs/en/dev/advanced_development/precision_optimization.html) and [Large Model Performance Optimization Guide](https://www.mindspore.cn/mindformers/docs/en/dev/advanced_development/performance_optimization.html).
### Contributing Models to the MindSpore Transformers Open Source Repository
-You can contribute models to the MindSpore Transformers open source repository for developers to research and use. For details, see [MindSpore Transformers Contribution Guidelines](https://www.mindspore.cn/mindformers/docs/en/dev/faq/mindformers_contribution.html).
+You can contribute models to the MindSpore Transformers open source repository for developers to research and use. For details, see [MindSpore Transformers Contribution Guidelines](https://www.mindspore.cn/mindformers/docs/en/dev/contribution/mindformers_contribution.html).
## MindSpore Transformers Model Migration Practice
@@ -111,7 +111,7 @@ Llama3-8B and Llama2-7B have the same model structure but different model parame
The following compares the model configurations between Llama2-7B and Llama3-8B.
-
+
The differences are as follows:
diff --git a/docs/mindformers/docs/source_en/perf_optimize/images/cast.png b/docs/mindformers/docs/source_en/advanced_deployment/image/cast.png
similarity index 100%
rename from docs/mindformers/docs/source_en/perf_optimize/images/cast.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/cast.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/general_process.png b/docs/mindformers/docs/source_en/advanced_deployment/image/general_process.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/general_process.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/general_process.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/local_norm.png b/docs/mindformers/docs/source_en/advanced_deployment/image/local_norm.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/local_norm.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/local_norm.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/loss1.png b/docs/mindformers/docs/source_en/advanced_deployment/image/loss1.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/loss1.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/loss1.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/loss2.png b/docs/mindformers/docs/source_en/advanced_deployment/image/loss2.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/loss2.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/loss2.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/loss3.png b/docs/mindformers/docs/source_en/advanced_deployment/image/loss3.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/loss3.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/loss3.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/loss4.png b/docs/mindformers/docs/source_en/advanced_deployment/image/loss4.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/loss4.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/loss4.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/loss5.png b/docs/mindformers/docs/source_en/advanced_deployment/image/loss5.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/loss5.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/loss5.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/loss6.png b/docs/mindformers/docs/source_en/advanced_deployment/image/loss6.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/loss6.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/loss6.png
diff --git a/docs/mindformers/docs/source_en/acc_optimize/image/loss7.png b/docs/mindformers/docs/source_en/advanced_deployment/image/loss7.png
similarity index 100%
rename from docs/mindformers/docs/source_en/acc_optimize/image/loss7.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/loss7.png
diff --git a/docs/mindformers/docs/source_en/perf_optimize/images/mstx.png b/docs/mindformers/docs/source_en/advanced_deployment/image/mstx.png
similarity index 100%
rename from docs/mindformers/docs/source_en/perf_optimize/images/mstx.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/mstx.png
diff --git a/docs/mindformers/docs/source_en/perf_optimize/images/reshape.png b/docs/mindformers/docs/source_en/advanced_deployment/image/reshape.png
similarity index 100%
rename from docs/mindformers/docs/source_en/perf_optimize/images/reshape.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/reshape.png
diff --git a/docs/mindformers/docs/source_en/perf_optimize/images/silu_mul.png b/docs/mindformers/docs/source_en/advanced_deployment/image/silu_mul.png
similarity index 100%
rename from docs/mindformers/docs/source_en/perf_optimize/images/silu_mul.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/silu_mul.png
diff --git a/docs/mindformers/docs/source_en/perf_optimize/images/studio.png b/docs/mindformers/docs/source_en/advanced_deployment/image/studio.png
similarity index 100%
rename from docs/mindformers/docs/source_en/perf_optimize/images/studio.png
rename to docs/mindformers/docs/source_en/advanced_deployment/image/studio.png
diff --git a/docs/mindformers/docs/source_en/usage/multi_modal.md b/docs/mindformers/docs/source_en/advanced_deployment/multi_modal_dev.md
similarity index 99%
rename from docs/mindformers/docs/source_en/usage/multi_modal.md
rename to docs/mindformers/docs/source_en/advanced_deployment/multi_modal_dev.md
index e3bee6fd68eb8182ac2032e410967c6dc33fe2de..23e1c889eaa8c427a0860410a6bf21c58cff9132 100644
--- a/docs/mindformers/docs/source_en/usage/multi_modal.md
+++ b/docs/mindformers/docs/source_en/advanced_deployment/multi_modal_dev.md
@@ -1,6 +1,6 @@
# Multimodal Model Development
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/multi_modal.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/advanced_development/multi_modal_dev.md)
Multimodal models refer to artificial intelligence models capable of processing and combining information from different modalities (such as text, images, audio, video, etc.) for learning and inference. Traditional single-modality models typically focus on a single type of data, such as text classification models handling only text data or image recognition models handling only image data. In contrast, multimodal models integrate data from different sources to accomplish more complex tasks, enabling them to understand and generate richer and more comprehensive content.
@@ -66,7 +66,7 @@ During the training and inference of multimodal models, the data processing modu
Below is a flowchart of the multimodal data processing. The custom modules in the diagram need to be implemented by the user according to their specific requirements, while other modules can be directly invoked.
-
+
Then, using the [CogVLM2-Video model data preprocessing module](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/cogvlm2/cogvlm2_processor.py) as an example, we will introduce the functionality of the components of the multimodal data processing module.
@@ -322,7 +322,7 @@ Parameter Explanation:
After implementing the multimodal dataset, data processing modules, and multimodal model construction, you can start model pre-training, fine-tuning, inference, and other tasks by using the model configuration file. This requires creating the corresponding model configuration file.
-For specific model configuration files, refer to [predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml) and [finetune_cogvlm2_video_llama3_chat_13b_lora.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml), which correspond to model inference and fine-tuning, respectively. For the meaning of specific parameters, refer to the [configuration file documentation](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html).
+For specific model configuration files, refer to [predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml) and [finetune_cogvlm2_video_llama3_chat_13b_lora.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml), which correspond to model inference and fine-tuning, respectively. For the meaning of specific parameters, refer to the [configuration file documentation](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
In the user-defined configuration file, sections such as `model`, `processor`, and `train_dataset` need to correspond to the user's custom **dataset**, **data processing module**, and **multimodal model**.
diff --git a/docs/mindformers/docs/source_en/perf_optimize/perf_optimize.md b/docs/mindformers/docs/source_en/advanced_deployment/performance_optimization.md
similarity index 98%
rename from docs/mindformers/docs/source_en/perf_optimize/perf_optimize.md
rename to docs/mindformers/docs/source_en/advanced_deployment/performance_optimization.md
index 28be670f2cbc1ef75e4e7909cf0c8e0e039dda3b..96b262ee10fffd67f68ed1594d532f32d262152b 100644
--- a/docs/mindformers/docs/source_en/perf_optimize/perf_optimize.md
+++ b/docs/mindformers/docs/source_en/advanced_deployment/performance_optimization.md
@@ -1,6 +1,6 @@
# Large Model Performance Optimization Guide
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/perf_optimize/perf_optimize.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/advanced_development/performance_optimization.md)
## Overview
@@ -64,7 +64,7 @@ Parallelism strategies are usually classified into various parallel modes:
In practice, multiple parallel strategies and multiple optimizations, such as using optimizer parallelism and recomputation, are usually employed to reduce the model's use of memory and improve training efficiency. Parallel strategy design is closely related to the efficiency of the model, and it is crucial to identify one or more sets of better parallel strategies before model tuning.
-For details, refer to [Parallel Strategy Guide](https://www.mindspore.cn/mindformers/docs/en/dev/function/distributed_parallel.html).
+For details, refer to [Parallel Strategy Guide](https://www.mindspore.cn/mindformers/docs/en/dev/feature/parallel_training.html).
For models with different parameter count specifications, the following parallel strategy can be selected:
@@ -277,7 +277,7 @@ Click anywhere on the timeline page tree or graphical pane can be performed usin
#### IR Graph
-In the [MindSpore Transformers configuration file](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html), just turn on save_graphs, and the runtime will output some intermediate files ending with the .ir suffix generated during the graph compilation process, which we call IR files. By default, a directory of graphs will be generated in the current task execution directory, and all IR graphs will be saved in this. It is a relatively intuitive and easy to understand document describing the structure of the model in text format, which can be viewed directly with text editing software. Refer to [Config Configuration Description](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html) for the meaning of the configuration items, and the configuration method is as follows:
+In the [MindSpore Transformers configuration file](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html), just turn on save_graphs, and the runtime will output some intermediate files ending with the .ir suffix generated during the graph compilation process, which we call IR files. By default, a directory of graphs will be generated in the current task execution directory, and all IR graphs will be saved in this. It is a relatively intuitive and easy to understand document describing the structure of the model in text format, which can be viewed directly with text editing software. Refer to [Config Configuration Description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html) for the meaning of the configuration items, and the configuration method is as follows:
```yaml
context:
diff --git a/docs/mindformers/docs/source_en/acc_optimize/acc_optimize.md b/docs/mindformers/docs/source_en/advanced_deployment/precision_optimization.md
similarity index 80%
rename from docs/mindformers/docs/source_en/acc_optimize/acc_optimize.md
rename to docs/mindformers/docs/source_en/advanced_deployment/precision_optimization.md
index c1f02eb42438f2a653fbc75589795d5f5a245daf..56095ba191f70535200b5bf526575355ca0aeb7b 100644
--- a/docs/mindformers/docs/source_en/acc_optimize/acc_optimize.md
+++ b/docs/mindformers/docs/source_en/advanced_deployment/precision_optimization.md
@@ -1,24 +1,24 @@
-# Large Model Accuracy Optimization Guide
+# Large Model Precision Optimization Guide
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/acc_optimize/acc_optimize.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/advanced_development/precision_optimization.md)
-## Overview and Scenarios of Accuracy Issues
+## Overview and Scenarios of Precision Issues
### Descriptions
-As the Ascend AI processor (hereinafter referred to as NPU) is widely used in deep learning, the MindSpore framework, which is developed natively based on the Ascend NPU, shows better performance advantages. During large-scale cluster training, the performance improvement will greatly save users the cost of large model development. Therefore, more and more users are gradually migrating their original training models to MindSpore. However, due to the differences in hardware and framework usage, users may encounter accuracy problems after completing the model migration.
+As the Ascend AI processor (hereinafter referred to as NPU) is widely used in deep learning, the MindSpore framework, which is developed natively based on the Ascend NPU, shows better performance advantages. During large-scale cluster training, the performance improvement will greatly save users the cost of large model development. Therefore, more and more users are gradually migrating their original training models to MindSpore. However, due to the differences in hardware and framework usage, users may encounter precision problems after completing the model migration.
-This paper summarizes the common accuracy problems in the training process of large models and general accuracy problem localization methods, and seeks to help users quickly troubleshoot accuracy problems and shorten the time for model accuracy problem localization. When starting the work on large model accuracy optimization, you should have the basic knowledge of large model. To avoid dispersion, this document will not explain the basic concepts related to large models and focus on the introduction of accuracy optimization.
+This paper summarizes the common precision problems in the training process of large models and general precision problem localization methods, and seeks to help users quickly troubleshoot precision problems and shorten the time for model precision problem localization. When starting the work on large model precision optimization, you should have the basic knowledge of large model. To avoid dispersion, this document will not explain the basic concepts related to large models and focus on the introduction of precision optimization.
### Categorized Summary of Common Problems
-Various accuracy problems often occur in large model training, and the common problems include that the loss fails to converge, the loss converges poorly, the loss fails to converge at the late stage of training, the accuracy overflows, and the loss can not be fitted to the benchmark in the process of descending. There can be a variety of reasons for these accuracy problems, including the structure of the model, the dataset, the hyperparameters, the precision of the forward and reverse computation, the calculation of the optimizer, the floating-point computational accuracy, and randomness.
+Various precision problems often occur in large model training, and the common problems include that the loss fails to converge, the loss converges poorly, the loss fails to converge at the late stage of training, the precision overflows, and the loss can not be fitted to the benchmark in the process of descending. There can be a variety of reasons for these precision problems, including the structure of the model, the dataset, the hyperparameters, the precision of the forward and reverse computation, the calculation of the optimizer, the floating-point computational precision, and randomness.
-When accuracy problems occur, the problem can be analyzed from the reasons for these accuracy problems. A quick troubleshooting based on CheckList is performed first, followed by parameter and weight alignment, fixed randomness and turning on deterministic calculations. Then the base problem is troubleshooted, and finally the anomalous step is troubleshooted by long stable training. At the current stage, this paper mainly introduces the general method of accuracy localization for the scenarios with accuracy benchmarks, and the content of accuracy problem localization without accuracy benchmarks will be added successively.
+When precision problems occur, the problem can be analyzed from the reasons for these precision problems. A quick troubleshooting based on CheckList is performed first, followed by parameter and weight alignment, fixed randomness and turning on deterministic calculations. Then the base problem is troubleshooted, and finally the anomalous step is troubleshooted by long stable training. At the current stage, this paper mainly introduces the general method of precision localization for the scenarios with precision benchmarks, and the content of precision problem localization without precision benchmarks will be added successively.
-## Accuracy Problems Location CheckList
+## Precision Problems Location CheckList
-Before locating the operator accuracy problem, we should first eliminate the interference of other non-operator factors. Combined with the previous precision positioning cases, the CheckList before precision positioning is summarized. In order to easier locate the problems, users can first carry out quick troubleshooting according to the CheckList.
+Before locating the operator precision problem, we should first eliminate the interference of other non-operator factors. Combined with the previous precision positioning cases, the CheckList before precision positioning is summarized. In order to easier locate the problems, users can first carry out quick troubleshooting according to the CheckList.
### Network Structure CheckList
@@ -34,7 +34,7 @@ Before locating the operator accuracy problem, we should first eliminate the int
| Regularization function | Regularization functions, common structures are LayerNorm, RMSNorm | The specified regularization function is used in MindSpore Transformers and cannot be modified by configuration. The configuration can be customized in Megatron by normalization to check for consistency. |
| rms_norm_eps | Regularized epsilon parameters | Correspond to the Megatron layernorm_epsilon parameter and check for consistency. |
| dropout | dropout in the network | Currently, when MindSpore enables dropout, recalculation cannot be enabled; if precision comparison is carried out, it is recommended that both sides be closed to reduce the random factor.|
-| Fusion computation | Common fusion operators include FA, ROPE, Norm, SwigLU; some users will fuse Wq, Wk, Wv for computation | 1. For accuracy comparison under the same hardware, if fusion algorithms are used, they should be consistent. 2. When comparing accuracy on different hardware, focus on checking whether there is any difference in the calculation of the fusion calculation part. |
+| Fusion computation | Common fusion operators include FA, ROPE, Norm, SwigLU; some users will fuse Wq, Wk, Wv for computation | 1. For precision comparison under the same hardware, if fusion algorithms are used, they should be consistent. 2. When comparing precision on different hardware, focus on checking whether there is any difference in the calculation of the fusion calculation part. |
#### MOE Structure
@@ -74,14 +74,14 @@ Before locating the operator accuracy problem, we should first eliminate the int
| **Key parameters** | **Descriptions** | **CheckList** |
| ----------------- | ----------------------------------------- |---------------------------------------|
-| compute_dtype | Compute accuracy | Megatron set `-bf16: true` to BF16, otherwise FP16. |
+| compute_dtype | Compute precision | Megatron set `-bf16: true` to BF16, otherwise FP16. |
| layernorm_compute_type | LayerNorm/RMSNorm compute precision | Megatron is not configurable, need to check that implementations are consistent. |
| softmax_compute_type | When MindSpore uses FA, the internal Softmax fix is calculated with FA. Type of calculation is configurable only for small arithmetic splicing implementations | Megatron is not configurable, needs to check if the implementation is consistent. |
-| rotary_dtype | Calculation accuracy of rotary position encoding | Megatron is not configurable, needs to check if the implementation is consistent. |
-| Calculation of weights | accuracy calculation for each weight such as, Embedding, lm_head | Since MindSpore Transformers weight initialization needs to be set to FP32, and the usual calculation precision is BF16/FP16, it is necessary to check whether the weight data type is converted to BF16/FP16 before weight calculation.|
-| bias add | bias in the linear layer | If bias is present, Linear layer checks consistency in the computational accuracy of add. |
-| residual add | sum of residuals | Check that the accuracy of the calculation of the residuals is consistent with the benchmarks |
-| loss | Loss Calculation Module | Check that the accuracy of the calculation in the entire loss module is consistent with the benchmarks |
+| rotary_dtype | Calculation precision of rotary position encoding | Megatron is not configurable, needs to check if the implementation is consistent. |
+| Calculation of weights | precision calculation for each weight such as, Embedding, lm_head | Since MindSpore Transformers weight initialization needs to be set to FP32, and the usual calculation precision is BF16/FP16, it is necessary to check whether the weight data type is converted to BF16/FP16 before weight calculation.|
+| bias add | bias in the linear layer | If bias is present, Linear layer checks consistency in the computational precision of add. |
+| residual add | sum of residuals | Check that the precision of the calculation of the residuals is consistent with the benchmarks |
+| loss | Loss Calculation Module | Check that the precision of the calculation in the entire loss module is consistent with the benchmarks |
| Operator High Precision Mode | Ascend Calculator supports high precision mode | Method: `context.set_context(ascend_config= {"ge_options":{ "global":{ "ge.opSelectImplmode":"high_precision" } } })` |
### Parallel Strategy CheckList
@@ -108,9 +108,9 @@ Before locating the operator accuracy problem, we should first eliminate the int
| Version Check | Check whether the versions of MindSpore, MindSpore Transformers and CANN are compatible, it is recommended to use the latest compatible version. |
| Differences with Open Source | MindSpore Transformers has supported the mainstream open source LLM models, and has been more fully tested. If you are developing based on the open source models in MindSpore Transformers, you can focus on checking the differences with the open source models in MindSpore Transformers. |
-## Introduction to Accuracy Debugging Tools
+## Introduction to Precision Debugging Tools
-In accuracy localization, MindSpore's Dump tool is mainly used. For details, please refer to [Dump Function Debugging](https://www.mindspore.cn/tutorials/en/master/debug/dump.html).
+In precision localization, MindSpore's Dump tool is mainly used. For details, please refer to [Dump Function Debugging](https://www.mindspore.cn/tutorials/en/master/debug/dump.html).
MindSpore's Dump tool is enabled by configuring a JSON file, which Dumps out all the operator data in the network, saving the tensor and statistics in the statistic.csv table. The following gives a JSON example of full operator Dump:
@@ -146,16 +146,16 @@ After setting the environment variables, start the program training to get the c
### Other Introductions
-In addition to the full amount of operator Dump introduced above, the tool also supports partial data Dump, overflow Dump, specified-condition Dump and so on. Limited to space, interested users can refer to [Dump function debugging](https://www.mindspore.cn/tutorials/en/master/debug/dump.html) for configuration and use. In addition, the msprobe precision debugging tool is provided. msprobe is a tool package under the precision debugging component of the MindStudio Training Tools suite. It mainly includes functions such as precision pre-check, overflow detection, and precision comparison. For more information, refer to [msprobe User Guide](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/msprobe).
+In addition to the full amount of operator Dump introduced above, the tool also supports partial data Dump, overflow Dump, specified-condition Dump and so on. Limited to space, interested users can refer to [Dump function debugging](https://www.mindspore.cn/tutorials/en/master/debug/dump.html) for configuration and use. In addition, the msprobe precision debugging tool is provided. msprobe is a tool package under the precision debugging component of the MindStudio Training Tools suite. It mainly includes functions such as precision pre-check, overflow detection, and precision comparison. For more information, refer to [msprobe User Guide](https://gitee.com/ascend/mstt/tree/master/debug/precision_tools/msprobe).
-## Generalized Processes for Accuracy Positioning
+## Generalized Processes for Precision Positioning
-Quickly troubleshoot the problem by using the [Accuracy Problems Location CheckList](#accuracy-problems-location-checklist) section. If the accuracy problem still exists after completing the CheckList and there is no obvious direction, you can narrow down the scope of the problem by using the accuracy location generic process in this section for further troubleshooting. The current generalized process is mainly for benchmarked scenarios, and the following section will take the scenario of comparing the accuracy of GPU+PyTorch and Ascend+MindSpore as an example to introduce the accuracy localization process.
+Quickly troubleshoot the problem by using the [Precision Problems Location CheckList](#precision-problems-location-checklist) section. If the precision problem still exists after completing the CheckList and there is no obvious direction, you can narrow down the scope of the problem by using the precision location generic process in this section for further troubleshooting. The current generalized process is mainly for benchmarked scenarios, and the following section will take the scenario of comparing the precision of GPU+PyTorch and Ascend+MindSpore as an example to introduce the precision localization process.
There are two main ideas for problem positioning:
* Simplified training scenarios based on single card/standalone, small-scale model replication problems.
-* Fix the random factor and compare the loss difference with the benchmark during training to locate the cause of the accuracy difference.
+* Fix the random factor and compare the loss difference with the benchmark during training to locate the cause of the precision difference.
The training process of the model can be decomposed into the following processes: data input, forward computation, loss, backward computation, gradient, optimizer weight update, and next step. The following will describe how to rank each stage of the training in conjunction with the flow of the following figure.
@@ -163,7 +163,7 @@ The training process of the model can be decomposed into the following processes
### Stage 1: Pre-training Preparation
-Conducting accuracy comparison between GPU+PyTorch and Ascend+MindSpore requires simplifying the scenario and fixing the randomness before reproducing the problem. There are three main parts as follows:
+Conducting precision comparison between GPU+PyTorch and Ascend+MindSpore requires simplifying the scenario and fixing the randomness before reproducing the problem. There are three main parts as follows:
* Aligning parameters, downsizing models, single-card/stand-alone reproduction problems;
@@ -187,7 +187,7 @@ Since features such as model parallelism, flow parallelism, sequence parallelism
#### Weight Conversion
-During training, MindSpore is loaded with the same weights as PyTorch. In case of pre-training scenarios, you can use PyTorch to save an initialized weight and then convert it to MindSpore weights. Because MindSpore weight names differ from PyTorch, the essence of weight conversion is to change the names in the PyTorch weight dict to MindSpore weight names to support MindSpore loading. Refer to [weight conversion guide](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html) for weight conversion.
+During training, MindSpore is loaded with the same weights as PyTorch. In case of pre-training scenarios, you can use PyTorch to save an initialized weight and then convert it to MindSpore weights. Because MindSpore weight names differ from PyTorch, the essence of weight conversion is to change the names in the PyTorch weight dict to MindSpore weight names to support MindSpore loading. Refer to [weight conversion guide](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html) for weight conversion.
Both MindSpore and PyTorch support `bin` format data, loading the same dataset for training ensures consistency from step to step.
@@ -254,7 +254,7 @@ By comparing the loss and local norm of the first step (step1) and the second st
#### Comparison of Step1 Losses
-After fixing the weights, dataset, and randomness, the difference in the loss value of the first step of training is compared. The loss value of the first step is obtained from the forward computation of the network. If the difference with the benchmark loss is large, it can be determined that there is an accuracy difference in the forward computation, which may be due to the model structure is not aligned, and the accuracy of the operator is abnormal. The tensor values of each layer of MindSpore and PyTorch can be obtained by printing or Dump tool. Currently, the tool does not have automatic comparison function, users need to manually identify the correspondence for comparison. For the introduction of MindSpore Dump tool, please refer to [Introduction of Accuracy Debugging Tools](#introduction-to-accuracy-debugging-tools), and for the use of PyTorch Dump tool, please refer to [Function Explanation of Accuracy Tools](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md).
+After fixing the weights, dataset, and randomness, the difference in the loss value of the first step of training is compared. The loss value of the first step is obtained from the forward computation of the network. If the difference with the benchmark loss is large, it can be determined that there is an precision difference in the forward computation, which may be due to the model structure is not aligned, and the precision of the operator is abnormal. The tensor values of each layer of MindSpore and PyTorch can be obtained by printing or Dump tool. Currently, the tool does not have automatic comparison function, users need to manually identify the correspondence for comparison. For the introduction of MindSpore Dump tool, please refer to [Introduction of Precision Debugging Tools](#introduction-to-precision-debugging-tools), and for the use of PyTorch Dump tool, please refer to [Function Explanation of Precision Tools](https://gitee.com/ascend/mstt/blob/master/debug/precision_tools/msprobe/docs/05.data_dump_PyTorch.md).
Find the correspondence of layers through PyTorch api_stack_dump.pkl file, and MindSpore statistic.csv file, and initially determine the degree of difference between input and output through max, min, and L2Norm. If you need further comparison, you can load the corresponding npy data for detailed comparison.
@@ -301,7 +301,7 @@ Below is an example of a local norm comparison, comparing the local norm values

-It can be found that in the scenario shown in this figure, the local norm value of model.tok_embeddings.embedding_weight has a large difference, which can be focused on troubleshooting the implementation of the Embedding and the calculation accuracy, etc.
+It can be found that in the scenario shown in this figure, the local norm value of model.tok_embeddings.embedding_weight has a large difference, which can be focused on troubleshooting the implementation of the Embedding and the calculation precision, etc.
The local norm value only serves as a preliminary judgment of whether the reverse computation is correct, if we want to compare the reverse computation in depth, we need to compare the MindSpore and PyTorch reverse computation values layer by layer by using the Dump tool.
@@ -397,44 +397,44 @@ In this scenario, troubleshooting can be done from the following perspectives:
* Examine whether the parameters are aligned: focus on examining the parameters related to the optimizer, such as the optimizer type, learning rate, weight decay. We can compare whether the change of learning rate during training is consistent by drawing diagrams, and we also need to confirm whether the weight of weight decay is consistent with the benchmark.
-* Mixed accuracy checking: through the Dump tool, carefully check whether the mixed accuracy is consistent with the benchmark in the calculation process;
+* Mixed precision checking: through the Dump tool, carefully check whether the mixed precision is consistent with the benchmark in the calculation process;
-* If there is a difference in the loss at convergence, but the difference is small, such as less than 1%, the accuracy acceptance can be performed by evaluating the downstream tasks.
+* If there is a difference in the loss at convergence, but the difference is small, such as less than 1%, the precision acceptance can be performed by evaluating the downstream tasks.
#### Scenario Expansion
-After completing the single-card alignment, gradually expand from single-card to multi-card testing and cluster testing; model size and related features such as model parallelism, flow parallelism, optimizer parallelism are added as appropriate. Gradually expand from simple scenarios to actual training scenarios, so as to troubleshoot the impact of the added features on the accuracy.
+After completing the single-card alignment, gradually expand from single-card to multi-card testing and cluster testing; model size and related features such as model parallelism, flow parallelism, optimizer parallelism are added as appropriate. Gradually expand from simple scenarios to actual training scenarios, so as to troubleshoot the impact of the added features on the precision.
-### Large Model Migration Accuracy Standard
+### Large Model Migration Precision Standard
-Accuracy standard for large model migration refers to the accuracy standard set for key indicators to ensure that the model accuracy before and after migration is basically the same after migrating the models trained by other third-party hardware or frameworks to MindSpore and Ascend Hardware. It is summarized based on the actual migration scenarios of MindSpore's large models for developers' reference. Since the accuracy of large models is strongly related to the application domain, model structure, number of parameters, and hyperparameters, and is not fully interpretable, there is no complete and unified mandatory standard. Therefore, this standard is only used as a reference standard to help users make a basic judgment on the accuracy of model migration.
+Precision standard for large model migration refers to the precision standard set for key indicators to ensure that the model precision before and after migration is basically the same after migrating the models trained by other third-party hardware or frameworks to MindSpore and Ascend Hardware. It is summarized based on the actual migration scenarios of MindSpore's large models for developers' reference. Since the precision of large models is strongly related to the application domain, model structure, number of parameters, and hyperparameters, and is not fully interpretable, there is no complete and unified mandatory standard. Therefore, this standard is only used as a reference standard to help users make a basic judgment on the precision of model migration.
-#### Accuracy Standard Specifications
+#### Precision Standard Specifications
1. Relative discrepancy is uniformly described as a percentage (x.x%) and absolute discrepancy is uniformly described as a decimal (0.xx);
-2. If the accuracy fluctuations of the third-party model training no longer meet this accuracy standard, the original model should be adequately tested and the standard should be relaxed in accordance with the fluctuations of the original model;
+2. If the precision fluctuations of the third-party model training no longer meet this precision standard, the original model should be adequately tested and the standard should be relaxed in accordance with the fluctuations of the original model;
#### Default Configuration
| Classes | Default Values | Descriptions |
|--------------------|------|-------------------------------|
| Dataset | [pretrain] wikitext-103 [sft] alpaca | |
-| Accuracy mode | BF16 | Mixed-accuracy configurations are consistent, and distinguish between actual FP32/FP16/BF16 configurations for each API in the network. |
+| Precision mode | BF16 | Mixed-precision configurations are consistent, and distinguish between actual FP32/FP16/BF16 configurations for each API in the network. |
| Parallel method | Data parallel | The parallelism can be adjusted according to the computational resources. |
| Cluster size | Stand-alone 8 cards | Can be adjusted according to the computational resources. |
-| checkpoint | [pretrain] Script initialization by default [sft]Loading pre-training weights | ckpt has a large impact on the accuracy metrics, prioritizing weights with small fluctuations in loss and a clear downward trend in overall loss.|
-|determinism|Turn on|The accuracy indicator determination phase can turn off determinism. The comparison phase needs to turn on determinism in order to minimize random error interference.|
+| checkpoint | [pretrain] Script initialization by default [sft]Loading pre-training weights | ckpt has a large impact on the precision metrics, prioritizing weights with small fluctuations in loss and a clear downward trend in overall loss.|
+|determinism|Turn on|The precision indicator determination phase can turn off determinism. The comparison phase needs to turn on determinism in order to minimize random error interference.|
-#### Accuracy Standard Indicator
+#### Precision Standard Indicator
* Test Standard
1. Without user's special designation, the default continuous observation is 5000 steps or 12 hours, the number of steps can be reduced according to the resource situation, but it is not recommended to be less than 1000 steps.
2. Load the same weights, keep all hyperparameters configured the same, and turn off all randomness.
- 3. The fluctuation of indicators such as loss is greatly influenced by the model, weights, and hyperparameters, and the combination with smooth loss fluctuation is preferred as a benchmark to reduce the judgment of random fluctuation on the accuracy results.
- 4. The randomness of the third-party model was adequately tested by repeating the experiment at least 2 times with determinism turned off and observing the range of fluctuations in the accuracy metrics.
+ 3. The fluctuation of indicators such as loss is greatly influenced by the model, weights, and hyperparameters, and the combination with smooth loss fluctuation is preferred as a benchmark to reduce the judgment of random fluctuation on the precision results.
+ 4. The randomness of the third-party model was adequately tested by repeating the experiment at least 2 times with determinism turned off and observing the range of fluctuations in the precision metrics.
-* loss Accuracy Standard
+* loss Precision Standard
1. The absolute error of first loss is less than 0.005, or the relative error is less than 0.5%.
2. The average absolute error is less than 0.01, or the average relative error is less than 1%.
@@ -445,7 +445,7 @@ Accuracy standard for large model migration refers to the accuracy standard set
### Case Details
-This section will introduce the completion of accuracy ranking based on the above accuracy localization process with practical examples.
+This section will introduce the completion of precision ranking based on the above precision localization process with practical examples.
#### Problem Phenomenon
@@ -465,7 +465,7 @@ First the loss alignment of step1 is confirmed to be OK. Comparing the local nor
The reason for this is that MindSpore Transformers uses FP32 for weight initialization, and FP32 precision is used for both forward and backward Embedding calculations, while PyTorch forward and backward calculations are BF16, which leads to differences in the calculated local norm values.
-Once the computational accuracy is aligned, the exhaustive optimizer computation is also fine, and the long stable training alignment starts.
+Once the computational precision is aligned, the exhaustive optimizer computation is also fine, and the long stable training alignment starts.
The long stable training exhaustion will be extended from single card experiments to multi-card experiments by first setting the LEARNING RATE=0, i.e., the weights are not updated. Forward computation of the loss difference of each step is around 0.001, and the forward computation error is as expected. The difference of global norm of each step is about 0.05, and the difference of reverse calculation is not significant. It is initially judged that the model migration code is correct, the model structure is consistent, and the difference of forward and reverse calculation is not significant.
@@ -477,11 +477,11 @@ Re-weight update, single card training, set learning rate=1e-5, train 1k steps.
Perform problem troubleshooting. Identify the following problems:
-* Identify inconsistencies in computational accuracy during training through Dump file exclusion, and harmonize inconsistencies.
+* Identify inconsistencies in computational precision during training through Dump file exclusion, and harmonize inconsistencies.
* Weight decay implementation is inconsistent, weight decay is performed on all weights in user PyTorch network. bias weights and one-dimensional weights in MindSpore Transformers do not have weight decay by default.
-After fixing the problem, experiment again, train 10,000 steps, the loss difference fluctuates around the 0 axis and is less than 0.03, the accuracy meets the expectation, and the single-card accuracy is aligned.
+After fixing the problem, experiment again, train 10,000 steps, the loss difference fluctuates around the 0 axis and is less than 0.03, the precision meets the expectation, and the single-card precision is aligned.
After completing the single card training, start the multi-card training test: set the learning rate=1e-5, train 1,000 steps. convergence is consistent in the late stage of training, but there is a stable 0.05 error in the middle stage of training.
diff --git a/docs/mindformers/docs/source_en/usage/pretrain_gpt.md b/docs/mindformers/docs/source_en/advanced_deployment/pretrain_gpt.md
similarity index 99%
rename from docs/mindformers/docs/source_en/usage/pretrain_gpt.md
rename to docs/mindformers/docs/source_en/advanced_deployment/pretrain_gpt.md
index ff52ba7dcb2149248241f9b5e6da458253607a56..1c24ba53a06c651a4f1317f199683ba3ca884a41 100644
--- a/docs/mindformers/docs/source_en/usage/pretrain_gpt.md
+++ b/docs/mindformers/docs/source_en/advanced_deployment/pretrain_gpt.md
@@ -1,6 +1,6 @@
# Dynamic Graph Parallelism
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/pretrain_gpt.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/advanced_development/pretrain_gpt.md)
## Overview
diff --git a/docs/mindformers/docs/source_en/faq/mindformers_contribution.md b/docs/mindformers/docs/source_en/contribution/mindformers_contribution.md
similarity index 98%
rename from docs/mindformers/docs/source_en/faq/mindformers_contribution.md
rename to docs/mindformers/docs/source_en/contribution/mindformers_contribution.md
index 670eb56fd291cf20c00ab8e292bd85a6af7f8cc9..24ca3cf2088ba5147ea380e1450c0bc694a6b074 100644
--- a/docs/mindformers/docs/source_en/faq/mindformers_contribution.md
+++ b/docs/mindformers/docs/source_en/contribution/mindformers_contribution.md
@@ -1,6 +1,6 @@
# MindSpore Transformers Contribution Guidelines
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/faq/mindformers_contribution.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/contribution/mindformers_contribution.md)
## Contributing Code to MindSpore Transformers
diff --git a/docs/mindformers/docs/source_en/faq/modelers_contribution.md b/docs/mindformers/docs/source_en/contribution/modelers_contribution.md
similarity index 98%
rename from docs/mindformers/docs/source_en/faq/modelers_contribution.md
rename to docs/mindformers/docs/source_en/contribution/modelers_contribution.md
index e94dfab412aa63f200982353e472802812f07b50..bf7b6585a578ce08be3301972c2ec4b24f32472d 100644
--- a/docs/mindformers/docs/source_en/faq/modelers_contribution.md
+++ b/docs/mindformers/docs/source_en/contribution/modelers_contribution.md
@@ -1,6 +1,6 @@
# Modelers Contribution Guidelines
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/faq/modelers_contribution.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/contribution/modelers_contribution.md)
## Upload a Model to the Modelers Community
diff --git a/docs/mindformers/docs/source_en/appendix/env_variables.md b/docs/mindformers/docs/source_en/env_variables.md
similarity index 97%
rename from docs/mindformers/docs/source_en/appendix/env_variables.md
rename to docs/mindformers/docs/source_en/env_variables.md
index b34c37e0c9c89134b7d22f341c78c1c5caefc2fd..ce7ff8613b21131a11be568deaed221571544142 100644
--- a/docs/mindformers/docs/source_en/appendix/env_variables.md
+++ b/docs/mindformers/docs/source_en/env_variables.md
@@ -1,6 +1,6 @@
# Environment Variable Descriptions
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/appendix/env_variables.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/env_variables.md)
The following environment variables are supported by MindSpore Transformers.
@@ -14,7 +14,7 @@ The following environment variables are supported by MindSpore Transformers.
| **ASCEND_LAUNCH_BLOCKING** | 0 | training or online inference scenarios, this environment variable can be used to control whether synchronization mode is activated during operator execution. | `1`: synchronized mode is mandatory; `0`: synchronized mode is optional. | Since the default operator executes asynchronously during NPU model training, when an error is reported during operator execution, the error stack information printed is not the actual call stack information. When set to `1`, synchronized mode is mandatory, which prints the correct call stack information and makes it easier to debug and locate problems in the code. Setting it to `1` provides more efficient arithmetic. |
| **TE_PARALLEL_COMPILER** | 8 | The number of threads on which the operator is compiled in parallel. Enables parallel compilation when greater than 1. | Takes a positive integer;Maximum number of cpu cores\*80%/number of Ascend AI processors, value range 1~32, default value is 8. | When the network model is large, parallel compilation of the operator can be turned on by configuring this environment variable; setting it to `1` for single-threaded compilation simplifies the difficulty when debugging. |
| **CPU_AFFINITY** | 0 | Turn on the CPU affinity switch, thus ensuring that each process or thread is bound to a single CPU core to improve performance. | `1`: turn on the CPU affinity switch; `0`: turn off the CPU affinity switch. | CPU affinity is turned off by default for **optimized resource utilization** and **energy saving**. |
-| **MS_MEMORY_STATISTIC** | 0 | Memory Statistics. | `1`: turn on memory statistics; `0`: turn off memory statistics. | During memory analysis, basic memory usage can be counted. You can refer to [Optimization Guide](https://www.mindspore.cn/mindformers/docs/en/dev/perf_optimize/perf_optimize.html) for details. |
+| **MS_MEMORY_STATISTIC** | 0 | Memory Statistics. | `1`: turn on memory statistics; `0`: turn off memory statistics. | During memory analysis, basic memory usage can be counted. You can refer to [Optimization Guide](https://www.mindspore.cn/mindformers/docs/en/dev/advanced_development/performance_optimization.html) for details. |
| **MINDSPORE_DUMP_CONFIG** | NA | Specify the path to the configuration file that the [cloud-side Dump function](https://www.mindspore.cn/tutorials/en/master/debug/dump.html) or [end-side Dump function](https://www.mindspore.cn/lite/docs/en/master/tools/benchmark_tool.html#dump) depends on. | File path, support relative path and absolute path. |
| **GLOG_v** | 3 | Controls the level of MindSpore logs. | `0`: DEBUG `1`: INFO `2`: WARNING `3`: ERROR: indicates that an error has been reported in the execution of the program, an error log is output, and the program may not be terminated; `4`: CRITICAL, indicates that an exception has occurred in the execution of the program, and the execution of the program will be terminated. |
| **ASCEND_GLOBAL_LOG_LEVEL** | 3 | Controls the logging level of CANN. | `0`: DEBUG `1`: INFO `2`: WARNING `3`: ERROR `4`: NULL, no log is output. |
@@ -40,4 +40,4 @@ The following environment variables are supported by MindSpore Transformers.
| **MS_ENABLE_FA_FLATTEN** | on | Controls whether support FlashAttention flatten optimization. | `on`: Enable FlashAttention flatten optimization; `off`: Disable FlashAttention flatten optimization. | Provide a fallback mechanism for models that have not yet been adapted to FlashAttention flatten optimization. |
| **EXPERIMENTAL_KERNEL_LAUNCH_GROUP** | NA | Control whether to support the batch parallel submission of operators. If supported, enable the parallel submission and configure the number of parallel submissions. | `thread_num`: The number of concurrent threads is not recommended to be increased. The default value is 2; `kernel_group_num`: Total number of operator groups, 'kernel_group_num/thread_num' groups per thread, default is' 8 '. | This feature will continue to evolve in the future, and the subsequent behavior may change. Currently, only the `deepseek` reasoning scenario is supported, with certain performance optimization, but other models using this feature may deteriorate, and users need to use it with caution, as follows:`export EXPERIMENTAL_KERNEL_LAUNCH_GROUP="thread_num:2,kernel_group_num:8"`. |
| **FORCE_EAGER** | False | Control whether to disable jit mode. | `False`: Enable jit mode; `True`: Do not enable jit mode. | Jit compiles functions into a callable MindSpore graph, sets FORCE_EAGER to False to enable jit mode, which can generate performance benefits. Currently, only inference mode is supported. |
-| **MS_ENABLE_TFT** | NA | Enable [MindIO TFT](https://www.hiascend.com/document/detail/zh/mindx-dl/600/clusterscheduling/ref/mindiottp/mindiotft001.html) feature. Turn on TTP, UCE or ARF feature. | The value of the environment variable can be:"{TTP:1,UCE:1,ARF:1}", when using a certain feature, the corresponding field can be configured as "1". | Usage can refer to [High Availability](https://www.mindspore.cn/mindformers/docs/en/dev/function/high_availability.html). |
\ No newline at end of file
+| **MS_ENABLE_TFT** | NA | Enable [MindIO TFT](https://www.hiascend.com/document/detail/zh/mindx-dl/600/clusterscheduling/ref/mindiottp/mindiotft001.html) feature. Turn on TTP, UCE or ARF feature. | The value of the environment variable can be:"{TTP:1,UCE:1,ARF:1}", when using a certain feature, the corresponding field can be configured as "1". | Usage can refer to [High Availability](https://www.mindspore.cn/mindformers/docs/en/dev/feature/high_availability.html). |
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/faq/func_related.md b/docs/mindformers/docs/source_en/faq/feature_related.md
similarity index 88%
rename from docs/mindformers/docs/source_en/faq/func_related.md
rename to docs/mindformers/docs/source_en/faq/feature_related.md
index 7c61f181338b9d303b65143fe71cafdf78adcb98..cdf536564efa1325434f4d3efddcc9e004e42d5e 100644
--- a/docs/mindformers/docs/source_en/faq/func_related.md
+++ b/docs/mindformers/docs/source_en/faq/feature_related.md
@@ -1,6 +1,6 @@
-# Function-Related
+# Feature-Related FAQ
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/faq/func_related.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/faq/feature_related.md)
## Q: The WikiText dataset download link is not available.
@@ -10,7 +10,7 @@ A: The official download link is not available, please follow the community Issu
## Q: How Do I Generate a Model Sharding Strategy File?
-A: The model sharding strategy file documents the sharding strategy for model weights in distributed scenarios and is generally used when slicing weights offline. Configure `only_save_strategy: True` in the network `yaml` file, and then start the distributed task normally, then the distributed strategy file can be generated in the `output/strategy/` directory. For details, please refer to the [Tutorial on Slicing and Merging Distributed Weights](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html).
+A: The model sharding strategy file documents the sharding strategy for model weights in distributed scenarios and is generally used when slicing weights offline. Configure `only_save_strategy: True` in the network `yaml` file, and then start the distributed task normally, then the distributed strategy file can be generated in the `output/strategy/` directory. For details, please refer to the [Tutorial on Slicing and Merging Distributed Weights](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html).
diff --git a/docs/mindformers/docs/source_en/faq/model_related.md b/docs/mindformers/docs/source_en/faq/model_related.md
index 157196c3327fd485d84d8b37c87a4ad29d4e112c..07bc072e30283ffdd98dad4a4a61c3d70986bdce 100644
--- a/docs/mindformers/docs/source_en/faq/model_related.md
+++ b/docs/mindformers/docs/source_en/faq/model_related.md
@@ -1,4 +1,4 @@
-# Model-Related
+# Model-Related FAQ
[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/faq/model_related.md)
diff --git a/docs/mindformers/docs/source_en/appendix/conf_files.md b/docs/mindformers/docs/source_en/feature/configuration.md
similarity index 98%
rename from docs/mindformers/docs/source_en/appendix/conf_files.md
rename to docs/mindformers/docs/source_en/feature/configuration.md
index e8b7abba33c3abe0c355a9411fe9914173b9b9ad..6774effc48bf8261abaf731e93d9803d1f9a75f1 100644
--- a/docs/mindformers/docs/source_en/appendix/conf_files.md
+++ b/docs/mindformers/docs/source_en/feature/configuration.md
@@ -1,6 +1,6 @@
# Configuration File Descriptions
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/appendix/conf_files.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/configuration.md)
## Overview
@@ -19,9 +19,9 @@ The basic configuration is mainly used to specify MindSpore random seeds and rel
| seed | Set the global seed. For details, refer to [mindspore.set_seed](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_seed.html). | int |
| run_mode | Set the running mode of the model: `train`, `finetune`, `eval` or `predict`. | str |
| output_dir | Set the path where log, checkpoint, strategy, etc. files are saved. | str |
-| load_checkpoint | File or folder paths for loading weights. Currently there are 3 application scenarios 1. Support for passing in full weight file paths. 2. Support for passing in offline sliced weight folder paths. 3. Support for passing in folder paths containing lora weights and base weights Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html) for the ways of obtaining various weights. | str |
-| auto_trans_ckpt | Enable distributed weight auto slicing and merging. Refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html). | bool |
-| resume_training | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/function/resume_training.html#resumable-training). | bool |
+| load_checkpoint | File or folder paths for loading weights. Currently there are 3 application scenarios 1. Support for passing in full weight file paths. 2. Support for passing in offline sliced weight folder paths. 3. Support for passing in folder paths containing lora weights and base weights Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html) for the ways of obtaining various weights. | str |
+| auto_trans_ckpt | Enable distributed weight auto slicing and merging. Refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html). | bool |
+| resume_training | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html#resumable-training). | bool |
| load_ckpt_format | The format of loading checkpoint, either `ckpt` or `safetensors`. | str |
| remove_redundancy | Whether the checkpoint has removed redundancy while loading checkpoint. The default value is `False`. | bool |
| train_precision_sync | Switching on or off deterministic computation of the training process. The default value is `None`. | Optional[bool] |
@@ -140,7 +140,7 @@ When starting model training, in addition to model-related parameters, you also
### Parallel Configuration
-In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to [Distributed Parallelism](https://www.mindspore.cn/mindformers/docs/en/dev/function/distributed_parallel.html), the parallel configuration in MindSpore Transformers is as follows.
+In order to improve the performance of the model, it is usually necessary to configure the parallelism strategy for the model in large-scale cluster usage scenarios. For details, please refer to [Distributed Parallelism](https://www.mindspore.cn/mindformers/docs/en/dev/feature/parallel_training.html), the parallel configuration in MindSpore Transformers is as follows.
| Parameters | Descriptions | Types |
|-----------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
@@ -174,7 +174,7 @@ In order to improve the performance of the model, it is usually necessary to con
### Model Optimization Configuration
-1. MindSpore Transformers provides recomputation-related configurations to reduce the memory footprint of the model during training, see [Recomputation](https://www.mindspore.cn/mindformers/docs/en/dev/perf_optimize/perf_optimize.html#recomputation) for details.
+1. MindSpore Transformers provides recomputation-related configurations to reduce the memory footprint of the model during training, see [Recomputation](https://www.mindspore.cn/mindformers/docs/en/dev/advanced_development/performance_optimization.html#recomputation) for details.
| Parameters | Descriptions | Types |
|----------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------|
@@ -186,7 +186,7 @@ In order to improve the performance of the model, it is usually necessary to con
| recompute_config.select_recompute_exclude | Disable recomputation for the specified operator, valid only for the Primitive operators. | bool/list |
| recompute_config.select_comm_recompute_exclude | Disable communication recomputation for the specified operator, valid only for the Primitive operators. | bool/list |
-2. MindSpore Transformers provides fine-grained activations SWAP-related configurations to reduce the memory footprint of the model during training, see [Fine-Grained Activations SWAP](https://www.mindspore.cn/mindformers/docs/en/dev/function/fine_grained_activations_swap.html) for details.
+2. MindSpore Transformers provides fine-grained activations SWAP-related configurations to reduce the memory footprint of the model during training, see [Fine-Grained Activations SWAP](https://www.mindspore.cn/mindformers/docs/en/dev/feature/fine_grained_activations_swap.html) for details.
| Parameters | Descriptions | Types |
|----------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------|
@@ -280,7 +280,7 @@ MindSpore Transformers provides model evaluation function, and also supports mod
### Profile Configuration
-MindSpore Transformers provides Profile as the main tool for model performance tuning, please refer to [Performance Tuning Guide](https://www.mindspore.cn/mindformers/docs/en/dev/perf_optimize/perf_optimize.html) for more details. The following is the Profile related configuration.
+MindSpore Transformers provides Profile as the main tool for model performance tuning, please refer to [Performance Tuning Guide](https://www.mindspore.cn/mindformers/docs/en/dev/advanced_development/performance_optimization.html) for more details. The following is the Profile related configuration.
| Parameters | Descriptions | Types |
|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
@@ -300,7 +300,7 @@ MindSpore Transformers provides Profile as the main tool for model performance t
### Metric Monitoring Configuration
-The metric monitoring configuration is primarily used to configure methods to record metrics during training, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/dev/function/monitor.html) for more details.Below is a description of the common metric monitoring configuration options in MindSpore Transformers:
+The metric monitoring configuration is primarily used to configure methods to record metrics during training, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/dev/feature/monitor.html) for more details.Below is a description of the common metric monitoring configuration options in MindSpore Transformers:
| Parameters | Descriptions | Types |
|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
@@ -320,7 +320,7 @@ The metric monitoring configuration is primarily used to configure methods to re
### TensorBoard Configuration
-The TensorBoard configuration is primarily used to configure parameters related to TensorBoard during training, allowing for real-time monitoring and visualization of training metrics, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/dev/function/monitor.html) for more details. Below is a description of the common TensorBoard configuration options in MindSpore Transformers:
+The TensorBoard configuration is primarily used to configure parameters related to TensorBoard during training, allowing for real-time monitoring and visualization of training metrics, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/dev/feature/monitor.html) for more details. Below is a description of the common TensorBoard configuration options in MindSpore Transformers:
| Parameters | Descriptions | Types |
|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|
diff --git a/docs/mindformers/docs/source_en/function/dataset.md b/docs/mindformers/docs/source_en/feature/dataset.md
similarity index 99%
rename from docs/mindformers/docs/source_en/function/dataset.md
rename to docs/mindformers/docs/source_en/feature/dataset.md
index 9a4ac14ed082c26128558d9bf5e167db0097e5fb..e6664dccc591268e2b10e3d425acaeccdad5dd4b 100644
--- a/docs/mindformers/docs/source_en/function/dataset.md
+++ b/docs/mindformers/docs/source_en/feature/dataset.md
@@ -1,6 +1,6 @@
# Dataset
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/dataset.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/dataset.md)
MindSpore Transformers currently supports multiple types of dataset loading methods, covering common open-source and custom scenarios. Specifically, it includes:
@@ -186,7 +186,7 @@ The following explains how to configure and use Megatron datasets in the configu
| eod | Token ID of the EOD token in the dataset |
| pad | Token ID of the pad token in the dataset |
- In addition, the Megatron dataset also depends on configurations such as `input_columns`, `construct_args_key`, and `full_batch`. For more details, refer to the [configuration file documentation](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html).
+ In addition, the Megatron dataset also depends on configurations such as `input_columns`, `construct_args_key`, and `full_batch`. For more details, refer to the [configuration file documentation](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html).
Here, we only explain how to configure them in different scenarios:
@@ -279,7 +279,7 @@ HuggingFace datasets support online and offline loading of datasets from both th
#### Dataset Loading Process
-
+
The online dataset loading and processing functionality is primarily implemented through `CommonDataLoader`. The data loading part can be customized via configuration files, with detailed configuration instructions available in the [dataloader parameter description](#dataloader-parameter-description). The online loading module requires users to implement customizations for different datasets. For example, the `AlpacaInstructDataHandler` class can be used to preprocess the `alpaca` dataset. For more information, please refer to [Custom Data Handler](#custom-data-handler).
@@ -393,7 +393,7 @@ When packing is configured, the dataset returns an `actual_seq_len` column. For
prefetch_size: 1
```
- 1. For parameter descriptions in `train_dataset`, please refer to the [documentation](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html).
+ 1. For parameter descriptions in `train_dataset`, please refer to the [documentation](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html).
2. `AlpacaInstructDataHandler` is an online processing script developed for the `alpaca` dataset. If using a different dataset, you need to implement a custom data handler by referring to the [Custom Data Handler](#custom-data-handler) guide.
@@ -510,7 +510,7 @@ Users can define custom data handlers to apply various preprocessing logic to th
prefetch_size: 1
```
- The rest of the parameters can be described in "model training configuration" and "model evaluation configuration" in [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html).
+ The rest of the parameters can be described in "model training configuration" and "model evaluation configuration" in [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
Custom data handler:
@@ -626,7 +626,7 @@ Users can define custom data handlers to apply various preprocessing logic to th
seed: 0
```
- The rest of the parameters can be described in "model training configuration" and "model evaluation configuration" in [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html).
+ The rest of the parameters can be described in "model training configuration" and "model evaluation configuration" in [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
Custom adgen_handler:
diff --git a/docs/mindformers/docs/source_en/usage/evaluation.md b/docs/mindformers/docs/source_en/feature/evaluation.md
similarity index 98%
rename from docs/mindformers/docs/source_en/usage/evaluation.md
rename to docs/mindformers/docs/source_en/feature/evaluation.md
index 909c34cc63cc2e873959297df507e4c276d8650a..d201ce5b75d37593848846ab4eb2cd036c7cfe21 100644
--- a/docs/mindformers/docs/source_en/usage/evaluation.md
+++ b/docs/mindformers/docs/source_en/feature/evaluation.md
@@ -1,6 +1,6 @@
# Evaluation
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/evaluation.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/evaluation.md)
## Harness Evaluation
@@ -43,7 +43,7 @@ pip install -e .
#### Preparations Before Evaluation
1. Create a new directory with e.g. the name `model_dir` for storing the model yaml files.
- 2. Place the model inference yaml configuration file (predict_xxx_.yaml) in the directory created in the previous step. The directory location of the reasoning yaml configuration file for different models refers to [model library](../start/models.md).
+ 2. Place the model inference yaml configuration file (predict_xxx_.yaml) in the directory created in the previous step. The directory location of the reasoning yaml configuration file for different models refers to [model library](../introduction/models.md).
3. Configure the yaml file. If the model class, model Config class, and model Tokenzier class in yaml use cheat code, that is, the code files are in [research](https://gitee.com/mindspore/mindformers/tree/dev/research) directory or other external directories, it is necessary to modify the yaml file: under the corresponding class `type` field, add the `auto_register` field in the format of `module.class`. (`module` is the file name of the script where the class is located, and `class` is the class name. If it already exists, there is no need to modify it.).
Using [predict_1lama3_1_8b. yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml) configuration as an example, modify some of the configuration items as follows:
@@ -58,7 +58,7 @@ pip install -e .
auto_register: llama3_tokenizer.Llama3Tokenizer
```
- For detailed instructions on each configuration item, please refer to the [configuration description](../appendix/conf_files.md).
+ For detailed instructions on each configuration item, please refer to the [configuration description](../feature/configuration.md).
4. If you use the `ceval-valid`, `mmlu`, `cmmlu`, `race`, and `lambada` datasets for evaluation, you need to set `use_flash_attention` to `False`. Using `predict_lama3_1_8b.yaml` as an example, modify the yaml as follow:
```yaml
@@ -362,7 +362,7 @@ For OpenEuler systems follow the steps below to install:
#### Preparations Before Evaluation
1. Create a new directory, for example named `model_dir`, to store the model yaml file;
-2. Place the model inference yaml configuration file (predict_xxx_. yaml) in the directory created in the previous step. For details, Please refer to the inference content of description documents for each model in the [model library](../start/models.md);
+2. Place the model inference yaml configuration file (predict_xxx_. yaml) in the directory created in the previous step. For details, Please refer to the inference content of description documents for each model in the [model library](../introduction/models.md);
3. Configure the yaml file.
Using [predict_cogvlm2_image_llama3_chat_19b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_image_llama3_chat_19b.yaml) configuration as an example:
@@ -378,7 +378,7 @@ For OpenEuler systems follow the steps below to install:
vocab_file: "/{path}/tokenizer.model" # Specify the tokenizer file path
```
- Configure the yaml file. Refer to [configuration description](../appendix/conf_files.md).
+ Configure the yaml file. Refer to [configuration description](../feature/configuration.md).
4. The MMBench-Video dataset evaluation requires the use of the GPT-4 Turbo model for evaluation and scoring. Please prepare the corresponding API Key in advance and put it in the VLMEvalKit/.env file as follows:
```text
diff --git a/docs/mindformers/docs/source_en/function/high_availability.md b/docs/mindformers/docs/source_en/feature/high_availability.md
similarity index 99%
rename from docs/mindformers/docs/source_en/function/high_availability.md
rename to docs/mindformers/docs/source_en/feature/high_availability.md
index 72b5cc64b48e10964c9d55bada11a17da1d78928..7428778943bc09f242c764a5c67602018dbf0b5a 100644
--- a/docs/mindformers/docs/source_en/function/high_availability.md
+++ b/docs/mindformers/docs/source_en/feature/high_availability.md
@@ -1,6 +1,6 @@
# High Availability
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/high_availability.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/high_availability.md)
## Overview
diff --git a/docs/mindformers/docs/source_en/function/logs.md b/docs/mindformers/docs/source_en/feature/logging.md
similarity index 93%
rename from docs/mindformers/docs/source_en/function/logs.md
rename to docs/mindformers/docs/source_en/feature/logging.md
index 4e11068b14df3d797ae0618453a42c5300f3ee01..b1ca34abece385a24aa9334359f71d8f46405f7f 100644
--- a/docs/mindformers/docs/source_en/function/logs.md
+++ b/docs/mindformers/docs/source_en/feature/logging.md
@@ -1,6 +1,6 @@
# Logs
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/logs.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/logging.md)
## Logs Saving
@@ -54,13 +54,13 @@ output_dir: './output' # path to save logs/checkpoint/strategy
#### Specifying Output Directory for Single-Card Tasks
-In addition to specifying the yaml file configuration, MindSpore TransFormer also supports [run_mindformer In the one-click start script](https://www.mindspore.cn/mindformers/docs/en/dev/function/start_tasks.html#run-mindformer-one-click-start-script),
+In addition to specifying the yaml file configuration, MindSpore TransFormer also supports [run_mindformer In the one-click start script](https://www.mindspore.cn/mindformers/docs/en/dev/feature/start_tasks.html#run-mindformer-one-click-start-script),
use the `--output_dir` start command to specify the log output path.
> If the output path is configured here, it will overwrite the configuration in the yaml file!
#### Distributed Task Specifies the Output Directory
-If the model training requires multiple servers, use the [distributed task launch script](https://www.mindspore.cn/mindformers/docs/en/dev/function/start_tasks.html#distributed-task-pull-up-script) to start the distributed training task.
+If the model training requires multiple servers, use the [distributed task launch script](https://www.mindspore.cn/mindformers/docs/en/dev/feature/start_tasks.html#distributed-task-pull-up-script) to start the distributed training task.
If shared storage is set, you can also specify the input parameter `LOG_DIR` in the startup script to specify the log output path of the Worker and Scheduler, and output the logs of all machine nodes to one path for unified observation.
diff --git a/docs/mindformers/docs/source_en/function/other_features.md b/docs/mindformers/docs/source_en/feature/memory_optimization.md
similarity index 50%
rename from docs/mindformers/docs/source_en/function/other_features.md
rename to docs/mindformers/docs/source_en/feature/memory_optimization.md
index 12100a2e31d0f0636ab8310d2bd90099fa1c2e9a..42d8d55257ad5e6b4c510dc02f0cca90cd71c060 100644
--- a/docs/mindformers/docs/source_en/function/other_features.md
+++ b/docs/mindformers/docs/source_en/feature/memory_optimization.md
@@ -1,10 +1,6 @@
-# Other features
+# Memory Optimization Features
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/other_features.md)
-
-During the large-scale training of deep learning models, challenges such as memory limitations, effective utilization of computational resources, and synchronization issues in distributed training are encountered. To address these challenges, training optimization algorithms are employed to enhance training efficiency, accelerate convergence, and improve the final model performance.
-
-MindSpore Transformer provides optimization algorithms like Recomputation, Gradient Accumulation, and Gradient Clipping for use during training.
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/memory_optimization.md)
## Recomputation
@@ -72,70 +68,273 @@ The main parameters for recomputation configuration are listed in the following
| mp_comm_recompute | Model parallel communication recomputation, whether to recompute communication operators in model parallelism. | (bool, optional) - After turning on, in automatic parallelism or semi-automatic parallelism mode, specify whether to recompute the communication operations introduced by model parallelism in the cell. Default value: `True`. |
| recompute_slice_activation | Slice recomputation, whether to slice the cell output that will be kept in memory. | (bool, optional) - Default value: `False`. |
-## Gradient Accumulation
+## Fine-Grained Activations SWAP
### Overview
-MindSpore supported the gradient accumulation implementation interface `mindspore.nn.wrap.cell_wrapper.GradAccumulationCell` in versions after 2.1.1, which provides the gradient accumulation capability by splitting MiniBatch. MindSpore Transformer encapsulates it into a unified training process and enables it through yaml configuration. For the principle of gradient accumulation and the ability of framework measurement, please refer to [MindSpore Document: Gradient Accumulation](https://www.mindspore.cn/tutorials/en/master/parallel/distributed_gradient_accumulation.html).
+In traditional large-scale model training tasks, the memory resources of computing cards often become a bottleneck. Although adopting larger-scale model parallel (mp) and pipeline parallel (pp) can alleviate the memory pressure on individual computing cards to some extent, it requires larger-scale cluster resources, and excessive communication can significantly reduce the model's Model FLOPs Utilization (MFU). Under limited cluster resources, recomputation is another effective method to mitigate memory pressure. It reduces the memory footprint of activations by discarding the storage of activation values during the forward propagation phase and recomputing the required activation values during gradient backpropagation. However, since recomputation introduces additional computational overhead, this method also significantly decreases the MFU of model training.
-### Configuration and Usage
+Against this backdrop, fine-grained activations SWAP can provide a third effective approach to reduce memory usage while offering greater end-to-end performance advantages. Specifically, SWAP offloads activations that need to be stored long-term to the host side during the forward propagation phase and prefetches them back to the device side in advance when they are needed during backpropagation. In terms of resource utilization, fine-grained activations SWAP leverages D2H/H2D bandwidth, which can overlap with computation tasks and D2D communication tasks during training, thereby masking the overhead of memory transfers.
-#### YAML Parameter Configuration
+The fine-grained activations SWAP technology offers high flexibility in usage. During the forward propagation phase of large model training, multiple activations of varying data sizes are generated, allowing users to swap specific activations at the granularity of the operator selectively. When the model type or configuration changes, users can flexibly adjust the corresponding SWAP strategy to minimize memory overhead and achieve optimal performance.
+
+### Instrunction for Use
+
+#### Constraint Scenarios
+
+- Only support static graph O0/O1 mode
+- Compatible with LLama-family dense models, MoE sparse models to be supported in future updates
+- Somas does not support heterogeneity and needs to be set in the configuration file:
+
+ ```yaml
+ context:
+ memory_optimize_level=O0
+ ```
+
+- When pipeline parallelism is disabled, the lazy_inline scenario must be enabled by setting the environment variable:
+
+ ```bash
+ ENABLE_LAZY_INLINE_NO_PIPELINE=1
+ ```
+
+- Only support Ascend backend
+
+#### Instruction for API
+
+Fine-grained activations SWAP is enabled through the `swap_config` field in YAML configuration, which includes four functional interfaces: `swap`, `default_prefetch`, `layer_swap`, and `op_swap`. These interfaces allow users to flexibly enable SWAP for specific layers or specific operators within layers.
+
+> MindSpore framework currently decouples memory offloading and memory release. When activations are offloaded from the device side to the host side, the memory space occupied on the device side is not immediately released even after all data has been transferred. An explicit release operation is required instead. Before triggering the memory release, the system checks whether the activation offloading is complete. If not, the process will wait in place until the offloading finishes.
+
+| Configuration Item | Type | Description |
+|:--:|:--:|:---|
+| swap | bool | Default False. When set to False, all four functional interfaces are disabled. When set to True, activations SWAP is enabled, and the system checks whether layer_swap and op_swap are None. If both are None, the default SWAP strategy is applied, which enables SWAP for the flash_attention operator across all layers. If either layer_swap or op_swap has a non-None value, the default policy is overridden, and SWAP is enabled according to the configurations in layer_swap and op_swap. |
+| default_prefetch | int | Default 1 and only takes effect when swap=True, layer_swap=None, and op_swap=None. It controls the timing of releasing memory in forward phase and starting prefetch in backward phase of the default SWAP strategy. A larger `default_prefetch` delays memory release during the forward phase, keeping device memory occupied by activations locked for an extended period after offloading, preventing reuse by other data blocks. It also starts earlier prefetching from host to device during the backward phase, applying memory pressure prematurely. A smaller `default_prefetch` releases memory earlier in the forward phase but may introduce idle waiting for copy operations to complete. Additionally, delayed prefetch in the backward phase may cause computation stalls if prefetching isn't finished before activation usage, impacting end-to-end performance. This interface allows users to fine-tune memory release and prefetch timing for optimal memory efficiency and performance.|
+| layer_swap | list | Default None. When set to None, this interface is inactive. When the type is List, this interface contains several list elements of the Dict type. Each Dict element contains two keys: `backward_prefetch`, and `layers`, and provides the prefetch opportunity and layer index for enabling swap. |
+| op_swap | list | Default None. When set to None, this interface is inactive. When the type is List, this interface contains several list elements of the Dict type. Each Dict element contains three keys: `op_name`, `backward_prefetch`, and `layers`, and provides the prefetch opportunity, operator name, and layer index for enabling swap. |
+
+#### Used together with Recomputation
+
+Fine-Grained Activations SWAP and Recomputation have coupling effects:
-To enable gradient accumulation, users only need to configure the `gradient_accumulation_steps` item under the `runner_config` item in the configuration file and set it to the required number of gradient accumulation steps:
+1. If any operator has both recomputation and SWAP enabled simultaneously, recomputation will take effect while SWAP will not.
+2. For any operator with SWAP enabled, if its output is used by an operator with recomputation enabled, then SWAP for that operator will not take effect.
+3. The YAML configuration interface for recomputation only supports enabling recomputation for a specific number of layers sequentially from front to back, rather than selecting specific layers or specific operators within layers. This means when using both SWAP and recomputation together, SWAP can only be enabled for later layers or operators within later layers, preventing full utilization of SWAP's benefits. Therefore, when and only when `swap=True`, the recomputation interface functionality will be adjusted as shown in the table below.
+
+| Interface Name | Original Functionality | Functionality When Enabling SWAP |
+|:--:|:---|:---|
+| recompute | Determine the number of layers with recomputation enabled in each pipeline stage. | Pipeline stage-agnostic, only accepts bool/list type inputs. When bool type: enables recomputation for all layers; when list type: uses layer indices to enable recomputation for specific layers. |
+| select_recompute | Determine the number of layers with recomputation enabled for specific operators in each pipeline stage. | Pipeline stage-agnostic, for each operator's key-value pair, only accepts bool/list type inputs. When bool type: enables recomputation for all layers; when list type: uses layer indices to enable recomputation for specific layers. |
+| select_comm_recompute | Determine the number of layers with recomputation enabled for communication operators in each pipeline stage. | Pipeline stage-agnostic, only accepts bool/list type inputs. When bool type: enables recomputation for all layers; when list type: uses layer indices to enable recomputation for specific layers. |
+
+### Cases of Fine-Grained Activations SWAP
+
+This section demonstrates the usage of fine-grained activations SWAP using Llama2-7B training as an example.
+
+#### Environmental Preparation
+
+Download Mindformers, and prepare the pre-training dataset, such as wikitext.
+
+#### Case 1: Default SWAP Strategy
+
+Modify and supplement the recomputation and SWAP configurations in YAML as follows:
```yaml
-# runner config
-runner_config:
-...
-gradient_accumulation_steps: 4
-...
+context:
+ memory_optimize_level: "O0"
+model:
+ model_config:
+ num_layers: 4
+recompute_config:
+ recompute: False
+ select_recompute: False
+ select_comm_recompute: False
+swap_config:
+ swap: True
+ default_prefetch: 10
```
-#### Key Parameters Introduction
+Execute the following script to launch single-node 8-NPU training, with the script's execution path being the root directory, requiring the user to specify the YAML file path(machine_ip needs to fill in the local environment IP address):
-| Parameter | Description | Value Description |
-|-----------------------------|----------------------------------------------------------------------------------------------|---------------------------------------|
-| gradient_accumulation_steps | The number of steps to accumulate gradients before performing backpropagation. Default: `1`. | (int, required) - Default value: `1`. |
+```bash
+export GLOG_v=1
+export MS_MEMORY_STATISTIC=1
+export ENABLE_LAZY_INLINE_NO_PIPELINE=1
+YAML_FILE=$1 # User specifies the YAML file path.
+ROOT_PATH=`pwd`
-#### Other Ways to Use Gradient Accumulation
+bash ./scripts/msrun_launcher.sh "run_mindformer.py \
+ --config ${ROOT_PATH}/${YAML_FILE} \
+ --run_mode train \
+ --use_parallel True" \
+ 8 8 8118 0 output/msrun False 300
+```
-In addition to the configuration file, when launching the `run_mindformer.py` script, you can specify the `--gradient_accumulation_steps` argument to use the gradient accumulation feature.
+After training completes, execute the command `cat output/msrun/worker_0.log | grep 'attention.flash_attention'` to check the execution status of the default SWAP strategy:
-#### Usage Restrictions of Gradient Accumulation
+```text
+-INFO - Set op_swap at layer 0: attention.flash_attention, value=10
+-INFO - Set op_swap at layer 1: attention.flash_attention, value=10
+-INFO - Set op_swap at layer 2: attention.flash_attention, value=10
+-INFO - Set op_swap at layer 3: attention.flash_attention, value=10
+```
-> Enabling gradient accumulation will increase memory overhead. Please pay attention to memory management to prevent Out Of Memory.
+The default SWAP strategy is executed successfully.
-1. Since the implementation of `GradAccumulationCell` relies on parallel features, gradient accumulation is currently only supported in **semi-automatic parallel mode**;
-2. In addition, in the pipeline parallel scenario, the meaning of gradient accumulation is the same as micro_batch and will not take effect. Please configure the `micro_batch_num` item to increase the training batch_size.
+#### Case 2: Select Specific Layers to Enable SWAP
-## Gradient Clipping
+Modify and supplement the recomputation and SWAP configurations in YAML as follows:
-### Overview
+```yaml
+context:
+ memory_optimize_level: "O0"
+model:
+ model_config:
+ num_layers: 4
+recompute_config:
+ recompute: False
+ select_recompute: False
+ select_comm_recompute: False
+swap_config:
+ swap: True
+ layer_swap:
+ - backward_prefetch: 20
+ layers: [0,3]
+```
-The gradient clipping algorithm can avoid the situation where the reverse gradient is too large and the optimal solution is skipped.
+Execute the following script to launch single-node 8-NPU training, with the script's execution path being the root directory, requiring the user to specify the YAML file path(machine_ip needs to fill in the local environment IP address):
-### Configuration and Usage
+```bash
+export GLOG_v=1
+export MS_MEMORY_STATISTIC=1
+export ENABLE_LAZY_INLINE_NO_PIPELINE=1
+YAML_FILE=$1 # User specifies the YAML file path.
+ROOT_PATH=`pwd`
-#### YAML Parameter Configuration
+bash ./scripts/msrun_launcher.sh "run_mindformer.py \
+ --config ${ROOT_PATH}/${YAML_FILE} \
+ --run_mode train \
+ --use_parallel True" \
+ 8 8 8118 0 output/msrun False 300
+```
+
+After training completes, execute the command `cat output/msrun/worker_0.log | grep 'Set layer swap at'` to check the execution status of the default SWAP strategy:
+
+```text
+-INFO - Set layer swap at layer 0 and value is: 20
+-INFO - Set layer swap at layer 3 and value is: 20
+```
+
+The strategy of enabling SWAP for specific layers is executed successfully.
-In MindSpore TransFormers, the default training process `MFTrainOneStepCell` integrates gradient clipping logic.
+#### Case 3: Select Specific Operators within Layers to Enable SWAP
-You can use the following example to enable gradient clipping:
+Modify and supplement the recomputation and SWAP configurations in YAML as follows:
```yaml
-# wrapper cell config
-runner_wrapper:
-type: MFTrainOneStepCell
-...
-use_clip_grad: True
-max_grad_norm: 1.0
-...
+context:
+ memory_optimize_level: "O0"
+model:
+ model_config:
+ num_layers: 4
+recompute_config:
+ recompute: False
+ select_recompute: False
+ select_comm_recompute: False
+swap_config:
+ swap: True
+ op_swap:
+ - op_name: 'attention'
+ backward_prefetch: 20
+ layers: [0,1,2]
+ - op_name: 'attention'
+ backward_prefetch: 10
+ layers: [3]
+ - op_name: 'feed_forward'
+ backward_prefetch: 15
+ layers: [1,2]
```
-#### Key Parameters Introduction
+Execute the following script to launch single-node 8-NPU training, with the script's execution path being the root directory, requiring the user to specify the YAML file path(machine_ip needs to fill in the local environment IP address):
+
+```bash
+export GLOG_v=1
+export MS_MEMORY_STATISTIC=1
+export ENABLE_LAZY_INLINE_NO_PIPELINE=1
+YAML_FILE=$1 # User specifies the YAML file path.
+ROOT_PATH=`pwd`
+
+bash ./scripts/msrun_launcher.sh "run_mindformer.py \
+ --config ${ROOT_PATH}/${YAML_FILE} \
+ --run_mode train \
+ --use_parallel True" \
+ 8 8 8118 0 output/msrun False 300
+```
+
+After training completes, execute the command `cat output/msrun/worker_0.log | grep 'Set op_swap at layer'` to check the execution status of the default SWAP strategy:
+
+```text
+-INFO - Set op_swap at layer 0: .attention, value=20
+-INFO - Set op_swap at layer 1: .attention, value=20, .feed_forward, value=15
+-INFO - Set op_swap at layer 2: .attention, value=20, .feed_forward, value=15
+-INFO - Set op_swap at layer 3: .attention, value=10
+```
+
+The strategy of enabling SWAP for specific operators within layers is executed successfully.
+
+#### Case 4: Use Fine-Grained Activations SWAP together with Recomputation
+
+Modify and supplement the recomputation and SWAP configurations in YAML as follows:
+
+```yaml
+context:
+ memory_optimize_level: "O0"
+model:
+ model_config:
+ num_layers: 4
+recompute_config:
+ recompute: False
+ select_recompute:
+ 'feed_forward': [0,3]
+ select_comm_recompute: False
+swap_config:
+ swap: True
+ op_swap:
+ - op_name: 'attention'
+ backward_prefetch: 20
+ layers: [0,1,2]
+ - op_name: 'attention'
+ backward_prefetch: 10
+ layers: [3]
+ - op_name: 'feed_forward'
+ backward_prefetch: 15
+ layers: [1,2]
+```
+
+Execute the following script to launch single-node 8-NPU training, with the script's execution path being the root directory, requiring the user to specify the YAML file path(machine_ip needs to fill in the local environment IP address):
+
+```bash
+export GLOG_v=1
+export MS_MEMORY_STATISTIC=1
+export ENABLE_LAZY_INLINE_NO_PIPELINE=1
+YAML_FILE=$1 # User specifies the YAML file path.
+ROOT_PATH=`pwd`
+
+bash ./scripts/msrun_launcher.sh "run_mindformer.py \
+ --config ${ROOT_PATH}/${YAML_FILE} \
+ --run_mode train \
+ --use_parallel True" \
+ 8 8 8118 0 output/msrun False 300
+```
+
+After training completes, execute the command `cat output/msrun/worker_0.log | grep 'Set op_swap at layer' -C 1` to check the execution status of the default SWAP strategy:
+
+```text
+-INFO - Set select recompute at layer 0: feed_forward
+-INFO - Set op_swap at layer 0: .attention, value=20
+-INFO - Set op_swap at layer 1: .attention, value=20, .feed_forward, value=15
+-INFO - Set op_swap at layer 2: .attention, value=20, .feed_forward, value=15
+-INFO - Set select recompute at layer 3: feed_forward
+-INFO - Set op_swap at layer 3: .attention, value=10
+```
-| Parameter | Description | Value Description |
-|---------------|----------------------------------------------------------------------------------------|-------------------------------------------|
-| use_clip_grad | Controls whether gradient clipping is enabled during training, default value: `False`. | (bool, optional) - Default: `False`. |
-| max_grad_norm | Controls the maximum norm value of gradient clipping, default value: `1.0`. | (float, optional) - Default: `1.0`. |
\ No newline at end of file
+The strategy of enabling fine-grained activations SWAP together with recomputation is executed successfully.
diff --git a/docs/mindformers/docs/source_en/function/monitor.md b/docs/mindformers/docs/source_en/feature/monitor.md
similarity index 97%
rename from docs/mindformers/docs/source_en/function/monitor.md
rename to docs/mindformers/docs/source_en/feature/monitor.md
index dbeeed4d6f5284d4a751c5203998a3918a054ba2..f46742e33582d6f7dd0c8f492b21a7e4ca3da2f4 100644
--- a/docs/mindformers/docs/source_en/function/monitor.md
+++ b/docs/mindformers/docs/source_en/feature/monitor.md
@@ -1,6 +1,6 @@
# Training Metrics Monitoring
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/monitor.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/monitor.md)
MindSpore Transformers supports TensorBoard as a visualization tool for monitoring and analyzing various metrics and information during training. TensorBoard is a standalone visualization library that requires the user to manually install it, and it provides an interactive way to view loss, precision, learning rate, gradient distribution, and a variety of other things in training. After the user configures TensorBoard in the training `yaml` file, the event file is generated and updated in real time during the training of the large model, and the training data can be viewed via commands.
@@ -117,7 +117,7 @@ The names and descriptions of the metrics monitored by `MFLossMonitor` are liste
In Tensorboard SCALARS page, the above metrics (assumed to be named `scalar_name`) have drop-down tabs for `scalar_name` and `scalar_name-vs-samples`, except for the last two. A line plot of this scalar versus the number of training iterations is shown under `scalar_name`, and a line plot of this scalar versus the number of samples is shown under `scalar_name-vs-samples`. An example of a plot of learning rate `learning-rate` is shown below:
-
+
#### TrainingStateMonitor Monitoring Metrics
@@ -138,23 +138,23 @@ Depending on the specific settings, the above metrics will be displayed in the T
**Example of logging effect**
-
+
**Example of tensorboard visualization**
adam_m_norm
-
+
local_loss and local_norm
-
+
### Description of Text Data Visualization
On the TEXT page, a tab exists for each training configuration where the values for that configuration are recorded. This is shown in the following figure:
-
+
All configuration names and descriptions are listed below:
@@ -223,4 +223,4 @@ All configuration names and descriptions are listed below:
> 2. Configuration parameters set by the user in the training configuration file `yaml`;
> 3. Default configuration parameters during training.
>
-> Refer to [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html) for all configurable parameters.
\ No newline at end of file
+> Refer to [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html) for all configurable parameters.
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/feature/other_training_features.md b/docs/mindformers/docs/source_en/feature/other_training_features.md
new file mode 100644
index 0000000000000000000000000000000000000000..e8e6feefe14bc5f8978f46a708dcdc77d4620953
--- /dev/null
+++ b/docs/mindformers/docs/source_en/feature/other_training_features.md
@@ -0,0 +1,75 @@
+# Other Training Features
+
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/other_training_features.md)
+
+During the large-scale training of deep learning models, challenges such as memory limitations, effective utilization of computational resources, and synchronization issues in distributed training are encountered. To address these challenges, training optimization algorithms are employed to enhance training efficiency, accelerate convergence, and improve the final model performance.
+
+MindSpore Transformer provides optimization algorithms like Recomputation, Gradient Accumulation, and Gradient Clipping for use during training.
+
+## Gradient Accumulation
+
+### Overview
+
+MindSpore supported the gradient accumulation implementation interface `mindspore.nn.wrap.cell_wrapper.GradAccumulationCell` in versions after 2.1.1, which provides the gradient accumulation capability by splitting MiniBatch. MindSpore Transformer encapsulates it into a unified training process and enables it through yaml configuration. For the principle of gradient accumulation and the ability of framework measurement, please refer to [MindSpore Document: Gradient Accumulation](https://www.mindspore.cn/tutorials/en/master/parallel/distributed_gradient_accumulation.html).
+
+### Configuration and Usage
+
+#### YAML Parameter Configuration
+
+To enable gradient accumulation, users only need to configure the `gradient_accumulation_steps` item under the `runner_config` item in the configuration file and set it to the required number of gradient accumulation steps:
+
+```yaml
+# runner config
+runner_config:
+...
+gradient_accumulation_steps: 4
+...
+```
+
+#### Key Parameters Introduction
+
+| Parameter | Description | Value Description |
+|-----------------------------|----------------------------------------------------------------------------------------------|---------------------------------------|
+| gradient_accumulation_steps | The number of steps to accumulate gradients before performing backpropagation. Default: `1`. | (int, required) - Default value: `1`. |
+
+#### Other Ways to Use Gradient Accumulation
+
+In addition to the configuration file, when launching the `run_mindformer.py` script, you can specify the `--gradient_accumulation_steps` argument to use the gradient accumulation feature.
+
+#### Usage Restrictions of Gradient Accumulation
+
+> Enabling gradient accumulation will increase memory overhead. Please pay attention to memory management to prevent Out Of Memory.
+
+1. Since the implementation of `GradAccumulationCell` relies on parallel features, gradient accumulation is currently only supported in **semi-automatic parallel mode**;
+2. In addition, in the pipeline parallel scenario, the meaning of gradient accumulation is the same as micro_batch and will not take effect. Please configure the `micro_batch_num` item to increase the training batch_size.
+
+## Gradient Clipping
+
+### Overview
+
+The gradient clipping algorithm can avoid the situation where the reverse gradient is too large and the optimal solution is skipped.
+
+### Configuration and Usage
+
+#### YAML Parameter Configuration
+
+In MindSpore TransFormers, the default training process `MFTrainOneStepCell` integrates gradient clipping logic.
+
+You can use the following example to enable gradient clipping:
+
+```yaml
+# wrapper cell config
+runner_wrapper:
+type: MFTrainOneStepCell
+...
+use_clip_grad: True
+max_grad_norm: 1.0
+...
+```
+
+#### Key Parameters Introduction
+
+| Parameter | Description | Value Description |
+|---------------|----------------------------------------------------------------------------------------|-------------------------------------------|
+| use_clip_grad | Controls whether gradient clipping is enabled during training, default value: `False`. | (bool, optional) - Default: `False`. |
+| max_grad_norm | Controls the maximum norm value of gradient clipping, default value: `1.0`. | (float, optional) - Default: `1.0`. |
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/function/distributed_parallel.md b/docs/mindformers/docs/source_en/feature/parallel_training.md
similarity index 96%
rename from docs/mindformers/docs/source_en/function/distributed_parallel.md
rename to docs/mindformers/docs/source_en/feature/parallel_training.md
index ec76fd1ab85c6c3f18ce2ba3551edd6cf7a3c72f..ade427c95d54573fe98e98e9f35479b162e309d2 100644
--- a/docs/mindformers/docs/source_en/function/distributed_parallel.md
+++ b/docs/mindformers/docs/source_en/feature/parallel_training.md
@@ -1,6 +1,6 @@
-# Distributed Parallelism
+# Distributed Parallelism Training
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/distributed_parallel.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/parallel_training.md)
## Parallel Modes and Application Scenarios
@@ -33,7 +33,7 @@ MindSpore Transformers supports multiple parallelism features. You can use these
| **[Long sequence parallelism](#long-sequence-parallelism)** | Slices all inputs and output activations by sequence to further reduce the GPU memory usage of the model for processing long sequence inputs.|
| **[Multi-copy parallelism](https://www.mindspore.cn/docs/en/master/features/parallel/pipeline_parallel.html#mindspore-interleaved-pipeline-scheduler)** | Implements fine-grained parallel control among multiple copies to optimize performance and resource utilization. This mode is suitable for efficient training of models with large specifications. |
-For details about how to configure distributed parallel parameters, see [MindSpore Transformers Configuration Description](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html).
+For details about how to configure distributed parallel parameters, see [MindSpore Transformers Configuration Description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
## Introduction to Parallel Characterization
@@ -64,7 +64,7 @@ Parameter Descriptions:
- use_ring_attention: Whether to enable Ring Attention, default is False.
- context_parallel: The number of sequence parallel slices, default is 1, configure according to user requirements.
-For configuration method of distributed parallel parameters, refer to the contents of the Parallel Configuration section in [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html).
+For configuration method of distributed parallel parameters, refer to the contents of the Parallel Configuration section in [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
#### Ulysses Sequence Parallelism
@@ -95,7 +95,7 @@ Parameter Descriptions:
- enable_alltoall: Generate alltoall communication operator, default is False, when the parameter is not enabled, it will be replaced by a combination of other operators such as allgather. See MindSpore `set_auto_parallel_context` [interface documentation](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_auto_parallel_context.html). We expect to be able to directly input allto_all communication operators when we enable the Ulysses scenario, so we turn this configuration item on.
- context_parallel_algo: Set to `ulysses_cp` to enable Ulysses sequence parallelism.
-For configuration method of distributed parallel parameters, refer to the contents of the Parallel Configuration section in [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html).
+For configuration method of distributed parallel parameters, refer to the contents of the Parallel Configuration section in [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
#### Hybrid Sequence Parallelism
@@ -121,7 +121,7 @@ Parameter Descriptions:
- context_parallel_algo: hybrid sequence parallelism is turned on when set to `hybrid_cp`.
- ulysses_degree_in_cp: the number of parallel slices of the Ulysses sequence.
-For configuration method of distributed parallel parameters, refer to the contents of the Parallel Configuration section in [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/en/dev/appendix/conf_files.html).
+For configuration method of distributed parallel parameters, refer to the contents of the Parallel Configuration section in [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/en/dev/feature/configuration.html).
### Pipeline Parallelism
@@ -153,7 +153,7 @@ Notes:
- Currently, only Llama and DeepSeek series models are supported.
- Using Megatron's multi-source datasets for training is not yet supported.
-For more information on configuring distributed parallel parameters, see the [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html), specifically the section on parallel configuration.
+For more information on configuring distributed parallel parameters, see the [MindSpore Transformers configuration description](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html), specifically the section on parallel configuration.
## MindSpore Transformers Distributed Parallel Application Practices
diff --git a/docs/mindformers/docs/source_en/usage/quantization.md b/docs/mindformers/docs/source_en/feature/quantization.md
similarity index 97%
rename from docs/mindformers/docs/source_en/usage/quantization.md
rename to docs/mindformers/docs/source_en/feature/quantization.md
index 809bbe31ed7b634ce0e6d0188a1852f8a3a1239c..0cd91eeed5d0899616178d1a80ab68e835c2c11d 100644
--- a/docs/mindformers/docs/source_en/usage/quantization.md
+++ b/docs/mindformers/docs/source_en/feature/quantization.md
@@ -1,6 +1,6 @@
# Quantization
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/quantization.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/quantization.md)
## Overview
diff --git a/docs/mindformers/docs/source_en/function/resume_training.md b/docs/mindformers/docs/source_en/feature/resume_training.md
similarity index 99%
rename from docs/mindformers/docs/source_en/function/resume_training.md
rename to docs/mindformers/docs/source_en/feature/resume_training.md
index 6c9778ea8e69ebb84f476909ec148a7f8893ce0e..ec61b7934d5c2212ef1434069e7a20dc7ba316d2 100644
--- a/docs/mindformers/docs/source_en/function/resume_training.md
+++ b/docs/mindformers/docs/source_en/feature/resume_training.md
@@ -1,6 +1,6 @@
# Resumable Training After Breakpoint
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/resume_training.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/resume_training.md)
## Resumable Training
diff --git a/docs/mindformers/docs/source_en/function/safetensors.md b/docs/mindformers/docs/source_en/feature/safetensors.md
similarity index 97%
rename from docs/mindformers/docs/source_en/function/safetensors.md
rename to docs/mindformers/docs/source_en/feature/safetensors.md
index 3fdba20e49e22cbdc19eae185754b60613e4b8a4..69d9ddf182ebafabffc307baeeb07a3331abaef1 100644
--- a/docs/mindformers/docs/source_en/function/safetensors.md
+++ b/docs/mindformers/docs/source_en/feature/safetensors.md
@@ -1,6 +1,6 @@
# Safetensors Weights
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/safetensors.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/safetensors.md)
## Overview
@@ -15,7 +15,7 @@ There are two main types of Safetensors files: complete weights files and distri
Safetensors complete weights can be obtained in two ways:
1. Download directly from Huggingface.
-2. After MindSpore Transformers distributed training, the weights are generated by [merge script](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html#safetensors-weight-merging).
+2. After MindSpore Transformers distributed training, the weights are generated by [merge script](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html#safetensors-weight-merging).
Huggingface Safetensors example catalog structure is as follows:
@@ -106,7 +106,7 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
After the task is executed, a checkpoint folder is generated in the mindformers/output directory, while the model files are saved in that folder.
-For more details, please refer to: [Introduction to Pre-training](https://www.mindspore.cn/mindformers/docs/en/dev/usage/pre_training.html).
+For more details, please refer to: [Introduction to Pre-training](https://www.mindspore.cn/mindformers/docs/en/dev/guide/pre_training.html).
### Examples of Fine-tuning Tasks
@@ -154,7 +154,7 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
After the task is executed, a checkpoint folder is generated in the mindformers/output directory, while the model files are saved in that folder.
-For more details, please refer to [Introduction to SFT fine-tuning](https://www.mindspore.cn/mindformers/docs/en/dev/usage/sft_tuning.html)
+For more details, please refer to [Introduction to SFT fine-tuning](https://www.mindspore.cn/mindformers/docs/en/dev/guide/supervised_fine_tuning.html)
### Example of an Inference Task
@@ -201,7 +201,7 @@ The results of executing the above single-card inference and multi-card inferenc
'text_generation_text': [I love Beijing, because it is a city with a long history and culture.......]
```
-For more details, please refer to: [Introduction to Inference](https://www.mindspore.cn/mindformers/docs/en/dev/usage/inference.html)
+For more details, please refer to: [Introduction to Inference](https://www.mindspore.cn/mindformers/docs/en/dev/guide/inference.html)
### Examples of Resumable Training after Breakpoint Tasks
@@ -237,9 +237,9 @@ callbacks:
checkpoint_format: safetensors # Save weights file format
```
-In large cluster scale scenarios, to avoid the online merging process taking too long to occupy the training resources, it is recommended to [merge the complete weights](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html#safetensors-weight-merging) with the original distributed weights file offline, and then pass it in. There is no need to pass in the path of the source slicing strategy file.
+In large cluster scale scenarios, to avoid the online merging process taking too long to occupy the training resources, it is recommended to [merge the complete weights](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html#safetensors-weight-merging) with the original distributed weights file offline, and then pass it in. There is no need to pass in the path of the source slicing strategy file.
-For more details, please refer to: [Resumable Training](https://www.mindspore.cn/mindformers/docs/en/dev/function/resume_training.html).
+For more details, please refer to: [Resumable Training](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html).
## Weight Saving
@@ -247,7 +247,7 @@ For more details, please refer to: [Resumable Training](https://www.mindspore.cn
In the training process of deep learning models, saving the model weights is a crucial step. The weight saving function allows us to store the model parameters at any stage of training, so that users can restore, continue training, evaluate or deploy after training is interrupted or completed. At the same time, by saving weights, experimental results can be reproduced in different environments.
-Currently, MindSpore TransFormer supports reading and saving weight files in the [safetensors](https://www.mindspore.cn/mindformers/docs/en/dev/function/safetensors.html) format.
+Currently, MindSpore TransFormer supports reading and saving weight files in the [safetensors](https://www.mindspore.cn/mindformers/docs/en/dev/feature/safetensors.html) format.
### Directory Structure
diff --git a/docs/mindformers/docs/source_en/function/start_tasks.md b/docs/mindformers/docs/source_en/feature/start_tasks.md
similarity index 96%
rename from docs/mindformers/docs/source_en/function/start_tasks.md
rename to docs/mindformers/docs/source_en/feature/start_tasks.md
index d5dfa3cb9782c3d8c9f9c2e66a12b97dd65141ba..e77498056b5976940fd0811d0cc0dc3b34fcb99f 100644
--- a/docs/mindformers/docs/source_en/function/start_tasks.md
+++ b/docs/mindformers/docs/source_en/feature/start_tasks.md
@@ -1,6 +1,6 @@
# Start Tasks
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/start_tasks.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/start_tasks.md)
## Overview
@@ -22,7 +22,7 @@ In the root directory of the MindSpore Transformers code, execute the `run_mindf
| `--device_id` | Set the execution device ID. The value must be within the range of available devices. | int, optional | pre-train/finetune/predict |
| `--device_target` | Set the backend execution device. MindSpore Transformers is only supported on `Ascend` devices. | str, optional | pre-train/finetune/predict |
| `--run_mode` | Set the running mode of the model: `train`, `finetune` or `predict`. | str, optional | pre-train/finetune/predict |
-| `--load_checkpoint` | File or folder paths for loading weights. For detailed usage, please refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html) | str, optional | pre-train/finetune/predict |
+| `--load_checkpoint` | File or folder paths for loading weights. For detailed usage, please refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html) | str, optional | pre-train/finetune/predict |
| `--use_parallel` | Whether use parallel mode. | bool, optional | pre-train/finetune/predict |
| `--output_dir` | Set the path where log, checkpoint, strategy, etc. files are saved. | str, optional | pre-train/finetune/predict |
| `--register_path` | The absolute path of the directory where the external code is located. For example, the model directory under the research directory. | str, optional | pre-train/finetune/predict |
@@ -33,7 +33,7 @@ In the root directory of the MindSpore Transformers code, execute the `run_mindf
| Parameters | Parameter Descriptions | Value Description | Applicable Scenarios |
|:----------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----------------------------|
| `--src_strategy_path_or_dir` | The strategy of load_checkpoint. | str, optional | pre-train/finetune/predict |
-| `--auto_trans_ckpt` | Enable online weight automatic conversion. Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html). | bool, optional | pre-train/finetune/predict |
+| `--auto_trans_ckpt` | Enable online weight automatic conversion. Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html). | bool, optional | pre-train/finetune/predict |
| `--transform_process_num` | The number of processes responsible for checkpoint transform. | int, optional | pre-train/finetune/predict |
| `--only_save_strategy` | Whether to only save the strategy files. | bool, optional, when it is `true`, the task exits directly after saving the strategy file. | pre-train/finetune/predict |
@@ -42,7 +42,7 @@ In the root directory of the MindSpore Transformers code, execute the `run_mindf
| Parameters | Parameter Descriptions | Value Description | Applicable Scenarios |
|:--------------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|----------------------|
| `--train_dataset_dir` | Dataset directory of data loader to pre-train/finetune. | str, optional | pre-train/finetune |
-| `--resume_training` | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/function/resume_training.html#resumable-training). | bool, optional | pre-train/finetune |
+| `--resume_training` | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html#resumable-training). | bool, optional | pre-train/finetune |
| `--epochs` | Train epochs. | int, optional | pre-train/finetune |
| `--batch_size` | The sample size of the batch data. | int, optional | pre-train/finetune |
| `--gradient_accumulation_steps` | The number of gradient accumulation steps. | int, optional | pre-train/finetune |
diff --git a/docs/mindformers/docs/source_en/function/training_hyperparameters.md b/docs/mindformers/docs/source_en/feature/training_hyperparameters.md
similarity index 99%
rename from docs/mindformers/docs/source_en/function/training_hyperparameters.md
rename to docs/mindformers/docs/source_en/feature/training_hyperparameters.md
index 05c4d093772b397042ceab2d71b6c5292bdfd7cf..d09a4c6cb13dc1a25bb92d390f66522cc85753b3 100644
--- a/docs/mindformers/docs/source_en/function/training_hyperparameters.md
+++ b/docs/mindformers/docs/source_en/feature/training_hyperparameters.md
@@ -1,6 +1,6 @@
# Model Training Hyperparameters Configuration
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/training_hyperparameters.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/training_hyperparameters.md)
## Overview
diff --git a/docs/mindformers/docs/source_en/function/transform_weight.md b/docs/mindformers/docs/source_en/feature/transform_weight.md
similarity index 99%
rename from docs/mindformers/docs/source_en/function/transform_weight.md
rename to docs/mindformers/docs/source_en/feature/transform_weight.md
index ea60d771ebe050f198a19135204a7d6be7245cdc..f03949135ac6b4abcb8063614a3c2ab429cc9835 100644
--- a/docs/mindformers/docs/source_en/function/transform_weight.md
+++ b/docs/mindformers/docs/source_en/feature/transform_weight.md
@@ -1,6 +1,6 @@
# Distributed Weight Slicing and Merging
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/transform_weight.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/transform_weight.md)
## Overview
diff --git a/docs/mindformers/docs/source_en/function/weight_conversion.md b/docs/mindformers/docs/source_en/feature/weight_conversion.md
similarity index 99%
rename from docs/mindformers/docs/source_en/function/weight_conversion.md
rename to docs/mindformers/docs/source_en/feature/weight_conversion.md
index caaf18f1adc0acf4a266825c35be703566137652..c61bcbe880ded0adcbb54ed248838df60e0bcff7 100644
--- a/docs/mindformers/docs/source_en/function/weight_conversion.md
+++ b/docs/mindformers/docs/source_en/feature/weight_conversion.md
@@ -1,6 +1,6 @@
# Weight Format Conversion
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/weight_conversion.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/weight_conversion.md)
## Overview
diff --git a/docs/mindformers/docs/source_en/function/fine_grained_activations_swap.md b/docs/mindformers/docs/source_en/function/fine_grained_activations_swap.md
deleted file mode 100644
index 51a400555b5f9738ac6b8a95361ff36ac6f2bd47..0000000000000000000000000000000000000000
--- a/docs/mindformers/docs/source_en/function/fine_grained_activations_swap.md
+++ /dev/null
@@ -1,272 +0,0 @@
-# Fine-Grained Activations SWAP
-
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/function/fine_grained_activations_swap.md)
-
-## Overview
-
-In traditional large-scale model training tasks, the memory resources of computing cards often become a bottleneck. Although adopting larger-scale model parallel (mp) and pipeline parallel (pp) can alleviate the memory pressure on individual computing cards to some extent, it requires larger-scale cluster resources, and excessive communication can significantly reduce the model's Model FLOPs Utilization (MFU). Under limited cluster resources, recomputation is another effective method to mitigate memory pressure. It reduces the memory footprint of activations by discarding the storage of activation values during the forward propagation phase and recomputing the required activation values during gradient backpropagation. However, since recomputation introduces additional computational overhead, this method also significantly decreases the MFU of model training.
-
-Against this backdrop, fine-grained activations SWAP can provide a third effective approach to reduce memory usage while offering greater end-to-end performance advantages. Specifically, SWAP offloads activations that need to be stored long-term to the host side during the forward propagation phase and prefetches them back to the device side in advance when they are needed during backpropagation. In terms of resource utilization, fine-grained activations SWAP leverages D2H/H2D bandwidth, which can overlap with computation tasks and D2D communication tasks during training, thereby masking the overhead of memory transfers.
-
-The fine-grained activations SWAP technology offers high flexibility in usage. During the forward propagation phase of large model training, multiple activations of varying data sizes are generated, allowing users to swap specific activations at the granularity of the operator selectively. When the model type or configuration changes, users can flexibly adjust the corresponding SWAP strategy to minimize memory overhead and achieve optimal performance.
-
-## Instrunction for Use
-
-### Constraint Scenarios
-
-- Only support static graph O0/O1 mode
-- Compatible with LLama-family dense models, MoE sparse models to be supported in future updates
-- Somas does not support heterogeneity and needs to be set in the configuration file:
-
- ```yaml
- context:
- memory_optimize_level=O0
- ```
-
-- When pipeline parallelism is disabled, the lazy_inline scenario must be enabled by setting the environment variable:
-
- ```bash
- ENABLE_LAZY_INLINE_NO_PIPELINE=1
- ```
-
-- Only support Ascend backend
-
-### Instruction for API
-
-Fine-grained activations SWAP is enabled through the `swap_config` field in YAML configuration, which includes four functional interfaces: `swap`, `default_prefetch`, `layer_swap`, and `op_swap`. These interfaces allow users to flexibly enable SWAP for specific layers or specific operators within layers.
-
-> MindSpore framework currently decouples memory offloading and memory release. When activations are offloaded from the device side to the host side, the memory space occupied on the device side is not immediately released even after all data has been transferred. An explicit release operation is required instead. Before triggering the memory release, the system checks whether the activation offloading is complete. If not, the process will wait in place until the offloading finishes.
-
-| Configuration Item | Type | Description |
-|:--:|:--:|:---|
-| swap | bool | Default False. When set to False, all four functional interfaces are disabled. When set to True, activations SWAP is enabled, and the system checks whether layer_swap and op_swap are None. If both are None, the default SWAP strategy is applied, which enables SWAP for the flash_attention operator across all layers. If either layer_swap or op_swap has a non-None value, the default policy is overridden, and SWAP is enabled according to the configurations in layer_swap and op_swap. |
-| default_prefetch | int | Default 1 and only takes effect when swap=True, layer_swap=None, and op_swap=None. It controls the timing of releasing memory in forward phase and starting prefetch in backward phase of the default SWAP strategy. A larger `default_prefetch` delays memory release during the forward phase, keeping device memory occupied by activations locked for an extended period after offloading, preventing reuse by other data blocks. It also starts earlier prefetching from host to device during the backward phase, applying memory pressure prematurely. A smaller `default_prefetch` releases memory earlier in the forward phase but may introduce idle waiting for copy operations to complete. Additionally, delayed prefetch in the backward phase may cause computation stalls if prefetching isn't finished before activation usage, impacting end-to-end performance. This interface allows users to fine-tune memory release and prefetch timing for optimal memory efficiency and performance.|
-| layer_swap | list | Default None. When set to None, this interface is inactive. When the type is List, this interface contains several list elements of the Dict type. Each Dict element contains two keys: `backward_prefetch`, and `layers`, and provides the prefetch opportunity and layer index for enabling swap. |
-| op_swap | list | Default None. When set to None, this interface is inactive. When the type is List, this interface contains several list elements of the Dict type. Each Dict element contains three keys: `op_name`, `backward_prefetch`, and `layers`, and provides the prefetch opportunity, operator name, and layer index for enabling swap. |
-
-### Used together with Recomputation
-
-Fine-Grained Activations SWAP and Recomputation have coupling effects:
-
-1. If any operator has both recomputation and SWAP enabled simultaneously, recomputation will take effect while SWAP will not.
-2. For any operator with SWAP enabled, if its output is used by an operator with recomputation enabled, then SWAP for that operator will not take effect.
-3. The YAML configuration interface for recomputation only supports enabling recomputation for a specific number of layers sequentially from front to back, rather than selecting specific layers or specific operators within layers. This means when using both SWAP and recomputation together, SWAP can only be enabled for later layers or operators within later layers, preventing full utilization of SWAP's benefits. Therefore, when and only when `swap=True`, the recomputation interface functionality will be adjusted as shown in the table below.
-
-| Interface Name | Original Functionality | Functionality When Enabling SWAP |
-|:--:|:---|:---|
-| recompute | Determine the number of layers with recomputation enabled in each pipeline stage. | Pipeline stage-agnostic, only accepts bool/list type inputs. When bool type: enables recomputation for all layers; when list type: uses layer indices to enable recomputation for specific layers. |
-| select_recompute | Determine the number of layers with recomputation enabled for specific operators in each pipeline stage. | Pipeline stage-agnostic, for each operator's key-value pair, only accepts bool/list type inputs. When bool type: enables recomputation for all layers; when list type: uses layer indices to enable recomputation for specific layers. |
-| select_comm_recompute | Determine the number of layers with recomputation enabled for communication operators in each pipeline stage. | Pipeline stage-agnostic, only accepts bool/list type inputs. When bool type: enables recomputation for all layers; when list type: uses layer indices to enable recomputation for specific layers. |
-
-## Cases of Fine-Grained Activations SWAP
-
-This section demonstrates the usage of fine-grained activations SWAP using Llama2-7B training as an example.
-
-### Environmental Preparation
-
-Download Mindformers, and prepare the pre-training dataset, such as wikitext.
-
-### Case 1: Default SWAP Strategy
-
-Modify and supplement the recomputation and SWAP configurations in YAML as follows:
-
-```yaml
-context:
- memory_optimize_level: "O0"
-model:
- model_config:
- num_layers: 4
-recompute_config:
- recompute: False
- select_recompute: False
- select_comm_recompute: False
-swap_config:
- swap: True
- default_prefetch: 10
-```
-
-Execute the following script to launch single-node 8-NPU training, with the script's execution path being the root directory, requiring the user to specify the YAML file path(machine_ip needs to fill in the local environment IP address):
-
-```bash
-export GLOG_v=1
-export MS_MEMORY_STATISTIC=1
-export ENABLE_LAZY_INLINE_NO_PIPELINE=1
-YAML_FILE=$1 # User specifies the YAML file path.
-ROOT_PATH=`pwd`
-
-bash ./scripts/msrun_launcher.sh "run_mindformer.py \
- --config ${ROOT_PATH}/${YAML_FILE} \
- --run_mode train \
- --use_parallel True" \
- 8 8 8118 0 output/msrun False 300
-```
-
-After training completes, execute the command `cat output/msrun/worker_0.log | grep 'attention.flash_attention'` to check the execution status of the default SWAP strategy:
-
-```text
--INFO - Set op_swap at layer 0: attention.flash_attention, value=10
--INFO - Set op_swap at layer 1: attention.flash_attention, value=10
--INFO - Set op_swap at layer 2: attention.flash_attention, value=10
--INFO - Set op_swap at layer 3: attention.flash_attention, value=10
-```
-
-The default SWAP strategy is executed successfully.
-
-### Case 2: Select Specific Layers to Enable SWAP
-
-Modify and supplement the recomputation and SWAP configurations in YAML as follows:
-
-```yaml
-context:
- memory_optimize_level: "O0"
-model:
- model_config:
- num_layers: 4
-recompute_config:
- recompute: False
- select_recompute: False
- select_comm_recompute: False
-swap_config:
- swap: True
- layer_swap:
- - backward_prefetch: 20
- layers: [0,3]
-```
-
-Execute the following script to launch single-node 8-NPU training, with the script's execution path being the root directory, requiring the user to specify the YAML file path(machine_ip needs to fill in the local environment IP address):
-
-```bash
-export GLOG_v=1
-export MS_MEMORY_STATISTIC=1
-export ENABLE_LAZY_INLINE_NO_PIPELINE=1
-YAML_FILE=$1 # User specifies the YAML file path.
-ROOT_PATH=`pwd`
-
-bash ./scripts/msrun_launcher.sh "run_mindformer.py \
- --config ${ROOT_PATH}/${YAML_FILE} \
- --run_mode train \
- --use_parallel True" \
- 8 8 8118 0 output/msrun False 300
-```
-
-After training completes, execute the command `cat output/msrun/worker_0.log | grep 'Set layer swap at'` to check the execution status of the default SWAP strategy:
-
-```text
--INFO - Set layer swap at layer 0 and value is: 20
--INFO - Set layer swap at layer 3 and value is: 20
-```
-
-The strategy of enabling SWAP for specific layers is executed successfully.
-
-### Case 3: Select Specific Operators within Layers to Enable SWAP
-
-Modify and supplement the recomputation and SWAP configurations in YAML as follows:
-
-```yaml
-context:
- memory_optimize_level: "O0"
-model:
- model_config:
- num_layers: 4
-recompute_config:
- recompute: False
- select_recompute: False
- select_comm_recompute: False
-swap_config:
- swap: True
- op_swap:
- - op_name: 'attention'
- backward_prefetch: 20
- layers: [0,1,2]
- - op_name: 'attention'
- backward_prefetch: 10
- layers: [3]
- - op_name: 'feed_forward'
- backward_prefetch: 15
- layers: [1,2]
-```
-
-Execute the following script to launch single-node 8-NPU training, with the script's execution path being the root directory, requiring the user to specify the YAML file path(machine_ip needs to fill in the local environment IP address):
-
-```bash
-export GLOG_v=1
-export MS_MEMORY_STATISTIC=1
-export ENABLE_LAZY_INLINE_NO_PIPELINE=1
-YAML_FILE=$1 # User specifies the YAML file path.
-ROOT_PATH=`pwd`
-
-bash ./scripts/msrun_launcher.sh "run_mindformer.py \
- --config ${ROOT_PATH}/${YAML_FILE} \
- --run_mode train \
- --use_parallel True" \
- 8 8 8118 0 output/msrun False 300
-```
-
-After training completes, execute the command `cat output/msrun/worker_0.log | grep 'Set op_swap at layer'` to check the execution status of the default SWAP strategy:
-
-```text
--INFO - Set op_swap at layer 0: .attention, value=20
--INFO - Set op_swap at layer 1: .attention, value=20, .feed_forward, value=15
--INFO - Set op_swap at layer 2: .attention, value=20, .feed_forward, value=15
--INFO - Set op_swap at layer 3: .attention, value=10
-```
-
-The strategy of enabling SWAP for specific operators within layers is executed successfully.
-
-### Case 4: Use Fine-Grained Activations SWAP together with Recomputation
-
-Modify and supplement the recomputation and SWAP configurations in YAML as follows:
-
-```yaml
-context:
- memory_optimize_level: "O0"
-model:
- model_config:
- num_layers: 4
-recompute_config:
- recompute: False
- select_recompute:
- 'feed_forward': [0,3]
- select_comm_recompute: False
-swap_config:
- swap: True
- op_swap:
- - op_name: 'attention'
- backward_prefetch: 20
- layers: [0,1,2]
- - op_name: 'attention'
- backward_prefetch: 10
- layers: [3]
- - op_name: 'feed_forward'
- backward_prefetch: 15
- layers: [1,2]
-```
-
-Execute the following script to launch single-node 8-NPU training, with the script's execution path being the root directory, requiring the user to specify the YAML file path(machine_ip needs to fill in the local environment IP address):
-
-```bash
-export GLOG_v=1
-export MS_MEMORY_STATISTIC=1
-export ENABLE_LAZY_INLINE_NO_PIPELINE=1
-YAML_FILE=$1 # User specifies the YAML file path.
-ROOT_PATH=`pwd`
-
-bash ./scripts/msrun_launcher.sh "run_mindformer.py \
- --config ${ROOT_PATH}/${YAML_FILE} \
- --run_mode train \
- --use_parallel True" \
- 8 8 8118 0 output/msrun False 300
-```
-
-After training completes, execute the command `cat output/msrun/worker_0.log | grep 'Set op_swap at layer' -C 1` to check the execution status of the default SWAP strategy:
-
-```text
--INFO - Set select recompute at layer 0: feed_forward
--INFO - Set op_swap at layer 0: .attention, value=20
--INFO - Set op_swap at layer 1: .attention, value=20, .feed_forward, value=15
--INFO - Set op_swap at layer 2: .attention, value=20, .feed_forward, value=15
--INFO - Set select recompute at layer 3: feed_forward
--INFO - Set op_swap at layer 3: .attention, value=10
-```
-
-The strategy of enabling fine-grained activations SWAP together with recomputation is executed successfully.
diff --git a/docs/mindformers/docs/source_en/usage/mindie_deployment.md b/docs/mindformers/docs/source_en/guide/deployment.md
similarity index 97%
rename from docs/mindformers/docs/source_en/usage/mindie_deployment.md
rename to docs/mindformers/docs/source_en/guide/deployment.md
index d7a477c997db1993f0d7d98912628ac1635d0884..223f7f3630d0b57fde6793bdf800533a9349cb84 100644
--- a/docs/mindformers/docs/source_en/usage/mindie_deployment.md
+++ b/docs/mindformers/docs/source_en/guide/deployment.md
@@ -1,6 +1,6 @@
# Service Deployment
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/mindie_deployment.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/guide/deployment.md)
## Introduction
@@ -8,7 +8,7 @@ MindIE, full name Mind Inference Engine, is a high-performance inference framewo
MindSpore Transformers are hosted in the model application layer MindIE LLM, and large models in MindSpore Transformers can be deployed through MindIE Service.
-The model support for MindIE inference can be found in [model repository](https://www.mindspore.cn/mindformers/docs/en/dev/start/models.html).
+The model support for MindIE inference can be found in [model repository](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html).
## Environment Setup
@@ -16,7 +16,7 @@ The model support for MindIE inference can be found in [model repository](https:
1. Install MindSpore Transformers
- Refer to [MindSpore Transformers Official Installation Guide](https://www.mindspore.cn/mindformers/docs/en/dev/quick_start/install.html) for installation.
+ Refer to [MindSpore Transformers Official Installation Guide](https://www.mindspore.cn/mindformers/docs/en/dev/installation.html) for installation.
2. Install MindIE
@@ -86,9 +86,9 @@ processor:
merges_file: "/path/to/mf_model/qwen1_5_72b/merges.txt" # merges file absolute path
```
-For model weight downloading and conversions, refer to the [Weight Format Conversion Guide](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html).
+For model weight downloading and conversions, refer to the [Weight Format Conversion Guide](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html).
-Required files and configurations may vary from model to model. Refer to the model-specific inference sections in [Model Repository](https://www.mindspore.cn/mindformers/docs/en/dev/start/models.html) for details.
+Required files and configurations may vary from model to model. Refer to the model-specific inference sections in [Model Repository](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html) for details.
### Starting MindIE
@@ -346,4 +346,4 @@ The validation is successful with the following returned inference result:
## Model List
-Examples of MindIE inference for other models can be found in the introduction documentation for each model in [Model Library](https://www.mindspore.cn/mindformers/docs/en/dev/start/models.html).
\ No newline at end of file
+Examples of MindIE inference for other models can be found in the introduction documentation for each model in [Model Library](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html).
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/usage/inference.md b/docs/mindformers/docs/source_en/guide/inference.md
similarity index 98%
rename from docs/mindformers/docs/source_en/usage/inference.md
rename to docs/mindformers/docs/source_en/guide/inference.md
index 7dbb41f52187ee933e6a1c6f0606468f43ed84de..94a8790f3b51222d1d81f8d31969d2fd27783160 100644
--- a/docs/mindformers/docs/source_en/usage/inference.md
+++ b/docs/mindformers/docs/source_en/guide/inference.md
@@ -1,6 +1,6 @@
# Inference
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/inference.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/guide/inference.md)
## Overview
@@ -22,8 +22,8 @@ Model weights can be categorized into two types: complete weights and distribute
Complete weights can be obtained in two ways:
-1. After downloading the open source weights of the corresponding model from the HuggingFace model library, refer to [Weight Format Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html) to convert them to the ckpt format.
-2. Pre-trained or fine-tuned distributed weights are used to generate a complete weight by [merging](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html).
+1. After downloading the open source weights of the corresponding model from the HuggingFace model library, refer to [Weight Format Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html) to convert them to the ckpt format.
+2. Pre-trained or fine-tuned distributed weights are used to generate a complete weight by [merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html).
#### 2.2 Distributed Weights
@@ -35,7 +35,7 @@ If the inference uses a weight slicing that is different from the model slicing
2. The weights of the eight-card training are reasoned over two cards;
3. Already sliced distributed weights are reasoned on a single card, and so on.
-The command samples in the following contents are all used in the way of online autoslicing. It is recommended to use online autoslicing by setting the command parameters `--auto_trans_ckpt` to `-True` and `-src_strategy_path_or_dir` to the weighted slicing strategy file or directory path (which is saved by default after training under `./output/strategy`) are automatically sliced in the inference task. Details can be found in [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html).
+The command samples in the following contents are all used in the way of online autoslicing. It is recommended to use online autoslicing by setting the command parameters `--auto_trans_ckpt` to `-True` and `-src_strategy_path_or_dir` to the weighted slicing strategy file or directory path (which is saved by default after training under `./output/strategy`) are automatically sliced in the inference task. Details can be found in [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html).
> Since both the training and inference tasks use `./output` as the default output path, when using the strategy file output by the training task as the source weight strategy file for the inference task, you need to move the strategy file directory under the default output path to another location to avoid it being emptied by the process of the inference task, for example:
>
@@ -358,4 +358,4 @@ Thanks, sir.
## More Information
-For more inference examples of different models, see [the models supported by MindSpore Transformers](https://www.mindspore.cn/mindformers/docs/en/dev/start/models.html).
+For more inference examples of different models, see [the models supported by MindSpore Transformers](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html).
diff --git a/docs/mindformers/docs/source_en/usage/pre_training.md b/docs/mindformers/docs/source_en/guide/pre_training.md
similarity index 96%
rename from docs/mindformers/docs/source_en/usage/pre_training.md
rename to docs/mindformers/docs/source_en/guide/pre_training.md
index 55d538ff89bec54df22adc27d47870238ca118bd..a0a3a3ba35586db30f0d801606189cdd84e13194 100644
--- a/docs/mindformers/docs/source_en/usage/pre_training.md
+++ b/docs/mindformers/docs/source_en/guide/pre_training.md
@@ -1,6 +1,6 @@
# Pretraining
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/pre_training.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/guide/pre_training.md)
## Overview
@@ -82,8 +82,8 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
run_mode: running mode. The value can be train, finetune, or predict (inference).
```
-**Note**: During multi-node distributed training, some performance problems may occur. To ensure the efficiency and stability of the training process, you are advised to optimize and adjust the performance by referring to [Large Model Performance Optimization Guide](https://www.mindspore.cn/mindformers/docs/en/dev/perf_optimize/perf_optimize.html).
+**Note**: During multi-node distributed training, some performance problems may occur. To ensure the efficiency and stability of the training process, you are advised to optimize and adjust the performance by referring to [Large Model Performance Optimization Guide](https://www.mindspore.cn/mindformers/docs/en/dev/advanced_development/performance_optimization.html).
## More Information
-For more training examples of different models, see [the models supported by MindFormers](https://www.mindspore.cn/mindformers/docs/en/dev/start/models.html).
+For more training examples of different models, see [the models supported by MindFormers](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html).
diff --git a/docs/mindformers/docs/source_en/usage/sft_tuning.md b/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md
similarity index 98%
rename from docs/mindformers/docs/source_en/usage/sft_tuning.md
rename to docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md
index 0f8d2f58b87b3207e0312f484fd96f1e940ecc3f..f51952873dd2bf668686a3b7f1dc4125d6fee9dc 100644
--- a/docs/mindformers/docs/source_en/usage/sft_tuning.md
+++ b/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md
@@ -1,6 +1,6 @@
# Supervised Fine-Tuning (SFT)
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/usage/sft_tuning.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md)
## Overview
@@ -179,7 +179,7 @@ After the task is executed, the **checkpoint** folder is generated in the **mind
#### Multi-Node Training
-The multi-node multi-device fine-tuning task is similar to the pretrained task. You can refer to the [multi-node multi-device pretraining command](https://www.mindspore.cn/mindformers/docs/en/dev/usage/pre_training.html#multi-node-training) and modify the command as follows:
+The multi-node multi-device fine-tuning task is similar to the pretrained task. You can refer to the [multi-node multi-device pretraining command](https://www.mindspore.cn/mindformers/docs/en/dev/guide/pre_training.html#multi-node-training) and modify the command as follows:
1. Add the input parameter `--load_checkpoint /{path}/llama2_7b.ckpt` to the startup script to load the pretrained weights.
2. Set `--train_dataset_dir /{path}/alpaca-fastchat4096.mindrecord` in the startup script to load the fine-tuning dataset.
@@ -243,7 +243,7 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
--run_mode finetune" 8
```
-When the distributed strategy of the weights does not match the distributed strategy of the model, the weights need to be transformed. The load weight path should be set to the upper path of the directory named with `rank_0`, and the weight auto transformation function should be enabled by setting `--auto_trans_ckpt True` . For a more detailed description of the scenarios and usage of distributed weight transformation, please refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html).
+When the distributed strategy of the weights does not match the distributed strategy of the model, the weights need to be transformed. The load weight path should be set to the upper path of the directory named with `rank_0`, and the weight auto transformation function should be enabled by setting `--auto_trans_ckpt True` . For a more detailed description of the scenarios and usage of distributed weight transformation, please refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html).
```shell
bash scripts/msrun_launcher.sh "run_mindformer.py \
diff --git a/docs/mindformers/docs/source_en/index.rst b/docs/mindformers/docs/source_en/index.rst
index bc039e3e82ee318dc96c5ce772ddcf478856c742..47b969487f09d717c906a6d5081f8147a96ce4e3 100644
--- a/docs/mindformers/docs/source_en/index.rst
+++ b/docs/mindformers/docs/source_en/index.rst
@@ -1,171 +1,198 @@
MindSpore Transformers Documentation
=====================================
-MindSpore Transformers (also known as MindFormers) is a MindSpore-native foundation model suite designed to provide full-flow development capabilities for foundation model training, fine-tuning, evaluating, inference and deploying, providing the industry mainstream Transformer class of pre-trained models and SOTA downstream task applications, and covering a rich range of parallel features, with the expectation of helping users to easily realize large model training and innovative research and development.
+The goal of the MindSpore Transformers suite is to build a full-process development suite for Large model pre-training, fine-tuning, inference, and deployment. It provides mainstream Transformer-based Large Language Models (LLMs) and Multimodal Models (MMs). It is expected to help users easily realize the full process of large model development.
-Users can refer to `Overall Architecture `_ and `Model Library `_ to get a quick overview of the MindSpore Transformers system architecture, and the list of supported functional features and foundation models. Further, refer to the `Installation `_ and `Quick Start `_ to get started with MindSpore Transformers.
+Based on MindSpore's built-in parallel technology and component-based design, the MindSpore Transformers suite has the following features:
+
+- One-click initiation of single or multi card pre-training, fine-tuning, inference, and deployment processes for large models;
+- Provides rich multi-dimensional hybrid parallel capabilities for flexible and easy-to-use personalized configuration;
+- System-level deep optimization on large model training and inference, native support for ultra-large-scale cluster efficient training and inference, rapid fault recovery;
+- Support for configurable development of task components. Any module can be enabled by unified configuration, including model network, optimizer, learning rate policy, etc.;
+- Provide real-time visualization of training accuracy/performance monitoring indicators.
+
+Users can refer to `Overall Architecture `_ and `Model Library `_ to get a quick overview of the MindSpore Transformers system architecture, and the list of supported foundation models.
If you have any suggestions for MindSpore Transformers, please contact us via `issue `_ and we will handle them promptly.
-MindSpore Transformers supports one-click start of single/multi-card training, fine-tuning, evaluation, and inference processes for any task, which makes the execution of deep learning tasks more efficient and user-friendly by simplifying the operation, providing flexibility, and automating the process. Users can learn from the following explanatory documents:
+Full-process Developing with MindSpore Transformers
+-------------------------------------------------------------------------------------------
+
+MindSpore Transformers supports one-click start of single/multi-card training, fine-tuning, and inference processes for any task, which makes the execution of deep learning tasks more efficient and user-friendly by simplifying the operation, providing flexibility, and automating the process. Users can learn from the following explanatory documents:
-- `Development Migration `_
-- `Pretraining `_
-- `SFT Tuning `_
-- `Evaluation `_
-- `Inference `_
-- `Quantization `_
-- `Service Deployment `_
-- `Multimodal Model Development `_
+- `Pretraining `_
+- `Supervised Fine-Tuning `_
+- `Inference `_
+- `Service Deployment `_
Code repository address:
-Flexible and Easy-to-Use Personalized Configuration with MindSpore Transformers
+Features description of MindSpore Transformers
-------------------------------------------------------------------------------------------
-With its powerful feature set, MindSpore Transformers provides users with flexible and easy-to-use personalized configuration options. Specifically, it comes with the following key features:
+MindSpore Transformers provides a wealth of features throughout the full-process of large model development. Users can learn about these features via the following links:
-1. `Start Tasks `_
+- General Features:
+
+ - `Start Tasks `_
One-click start for single-device, single-node and multi-node tasks.
-2. `Weight Format Conversion `_
+ - `Weight Format Conversion `_
+
+ Provides a unified weight conversion tool that converts model weights between the formats used by HuggingFace and MindSpore Transformers.
+
+ - `Distributed Weight Slicing and Merging `_
+
+ Weights in different distributed scenarios are flexibly sliced and merged.
+
+ - `Safetensors Weights `_
+
+ Supports saving and loading weight files in safetensors format.
+
+ - `Configuration File `_
+
+ Supports the use of `YAML` files to centrally manage and adjust configurable items in tasks.
+
+ - `Logging `_
+
+ Introduction of logs, including log structure, log saving, and so on.
+
+- Training Features:
+
+ - `Dataset `_
- Provides a unified weight conversion tool that converts model weights between the formats used by HuggingFace and MindSpore Transformers.
+ Supports multiple types and formats of datasets.
-3. `Distributed Weight Slicing and Merging `_
+ - `Model Training Hyperparameters `_
- Weights in different distributed scenarios are flexibly sliced and merged.
+ Flexibly configure hyperparameter settings for large model training.
-4. `Distributed Parallel `_
+ - `Training Metrics Monitoring `_
- One-click configuration of multi-dimensional hybrid distributed parallel allows models to run efficiently in clusters up to 10,000 cards.
+ Provides visualization services for the training phase of large models for monitoring and analyzing various indicators and information during the training process.
-5. `Dataset `_
+ - `Resumable Training After Breakpoint `_
- Support multiple types and formats of datasets.
+ Supports step-level resumable training after breakpoint, effectively reducing the waste of time and resources caused by unexpected interruptions during large-scale training.
-6. `Model Training Hyperparameters Configuration `_
+ - `Training High Availability (Beta) `_
- Provides an introduction and examples of hyperparameter configuration for large model training.
+ Provides high-availability capabilities for the training phase of large models, including end-of-life CKPT preservation, UCE fault-tolerant recovery, and process-level rescheduling recovery (Beta feature).
-7. `Other features `_
+ - `Parallel Training `_
- Introduce features such as gradient accumulation and gradient clipping.
+ One-click configuration of multi-dimensional hybrid distributed parallel allows models to run efficiently in clusters up to 10,000 cards.
-8. `Logs `_
+ - `Training Memory Optimization `_
- Introduction of logs, including log structure, log saving, and so on.
+ Supports fine-grained recomputation and activations swap, to reduce peak memory overhead during model training.
-9. `Resumable Training After Breakpoint `_
+ - `Other Training Features `_
- Supports step-level resumable training after breakpoint, effectively reducing the waste of time and resources caused by unexpected interruptions during large-scale training.
+ Supports gradient accumulation and gradient clipping, etc.
-10. `Training Metrics Monitoring `_
+- Inference Features:
- Provides visualization services for the training phase of large models for monitoring and analyzing various indicators and information during the training process.
+ - `Evaluation `_
-11. `Training High Availability `_
+ Supports the use of third-party open-source evaluation frameworks and datasets for large-scale model ranking evaluations.
- Provide high-availability capabilities for the training phase of large models, including end-of-life CKPT preservation, UCE fault-tolerant recovery, and process-level rescheduling recovery.
+ - `Quantization `_
-12. `Safetensors Weights `_
+ Integrated MindSpore Golden Stick toolkit to provides a unified quantization inference process.
- Support the function of saving and loading weight files in safetensors format.
+Advanced developing with MindSpore Transformers
+-------------------------------------------------
-13. `Fine-Grained Activations SWAP `_
+- Diagnostics and Optimization
- Support fine-grained selection of specific activations to enable SWAP and reduce peak memory overhead during model training.
+ - `Precision Optimization `_
+ - `Performance Optimization `_
-Deep Optimizing with MindSpore Transformers
----------------------------------------------
+- Model Development
-- `Precision Optimizing `_
-- `Performance Optimizing `_
+ - `Development Migration `_
+ - `Multimodal Model Development `_
-Appendix
+Environment Variables
------------------------------------
-- `Environment Variables Descriptions `_
-- `Configuration File Descriptions `_
+- `Environment Variables Description `_
-FAQ
+Contribution Guide
------------------------------------
-- `Model-Related `_
-- `Function-Related `_
-- `MindSpore Transformers Contribution Guide `_
-- `Modelers Contribution Guide `_
+- `MindSpore Transformers Contribution Guide `_
+- `Modelers Contribution Guide `_
-.. toctree::
- :glob:
- :maxdepth: 1
- :caption: Start
- :hidden:
+FAQ
+------------------------------------
- start/overview
- start/models
+- `Model-Related `_
+- `Function-Related `_
.. toctree::
:glob:
:maxdepth: 1
- :caption: Quick Start
+ :caption: Introduction
:hidden:
- quick_start/install
- quick_start/source_code_start
+ Introduction/overview
+ Introduction/models
.. toctree::
:glob:
:maxdepth: 1
- :caption: Usage Tutorials
+ :caption: Installation
:hidden:
- usage/dev_migration
- usage/multi_modal
- usage/pre_training
- usage/sft_tuning
- usage/evaluation
- usage/inference
- usage/quantization
- usage/mindie_deployment
- usage/pretrain_gpt
+ installation
.. toctree::
:glob:
:maxdepth: 1
- :caption: Function Description
+ :caption: Full-process Guide to Large Models
:hidden:
- function/start_tasks
- function/weight_conversion
- function/transform_weight
- function/distributed_parallel
- function/dataset
- function/training_hyperparameters
- function/other_features
- function/logs
- function/resume_training
- function/monitor
- function/high_availability
- function/safetensors
- function/fine_grained_activations_swap
+ guide/pre_training
+ guide/supervised_fine_tuning
+ guide/inference
+ guide/deployment
.. toctree::
:glob:
:maxdepth: 1
- :caption: Precision Optimization
+ :caption: Features
:hidden:
- acc_optimize/acc_optimize
+ feature/start_tasks
+ feature/weight_conversion
+ feature/transform_weight
+ feature/safetensors
+ feature/configuration
+ feature/logging
+ feature/dataset
+ feature/training_hyperparameters
+ feature/monitor
+ feature/resume_training
+ feature/parallel_training
+ feature/high_availability
+ feature/memory_optimization
+ feature/other_training_features
+ feature/evaluation
+ feature/quantization
.. toctree::
:glob:
:maxdepth: 1
- :caption: Performance Optimization
+ :caption: Advanced Development
:hidden:
- perf_optimize/perf_optimize
+ advanced_development/precision_optimization
+ advanced_development/performance_optimization
+ advanced_development/dev_migration
+ advanced_development/multi_modal_dev
.. toctree::
:maxdepth: 1
@@ -186,11 +213,19 @@ FAQ
.. toctree::
:glob:
:maxdepth: 1
- :caption: Appendix
+ :caption: Environment Variables
+ :hidden:
+
+ env_variables
+
+.. toctree::
+ :glob:
+ :maxdepth: 1
+ :caption: Contribution Guide
:hidden:
- appendix/env_variables
- appendix/conf_files
+ contribution/mindformers_contribution
+ contribution/modelers_contribution
.. toctree::
:glob:
@@ -199,6 +234,4 @@ FAQ
:hidden:
faq/model_related
- faq/func_related
- faq/mindformers_contribution
- faq/modelers_contribution
\ No newline at end of file
+ faq/feature_related
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/quick_start/install.md b/docs/mindformers/docs/source_en/installation.md
similarity index 92%
rename from docs/mindformers/docs/source_en/quick_start/install.md
rename to docs/mindformers/docs/source_en/installation.md
index 6f0ccb6b1615a78f13b4386c27b76bf29e143ea2..68ea9d2c4319f738acad38eebbceea010438cb33 100644
--- a/docs/mindformers/docs/source_en/quick_start/install.md
+++ b/docs/mindformers/docs/source_en/installation.md
@@ -1,6 +1,6 @@
# Installation
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/quick_start/install.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/installation.md)
## Confirming Version Matching Relationship
@@ -23,7 +23,7 @@ Historical version matching relationship:
## Installing Dependent Software
-1. Install Firmware and Driver: Download the firmware and driver package through the [Confirming Version Matching Relationship](https://www.mindspore.cn/mindformers/docs/en/dev/quick_start/install.html#confirming-version-matching-relationship) to download the installation package, and refer to the [Ascend official tutorial](https://www.hiascend.com/en/document) for installation.
+1. Install Firmware and Driver: Download the firmware and driver package through the [Confirming Version Matching Relationship](https://www.mindspore.cn/mindformers/docs/en/dev/installation.html#confirming-version-matching-relationship) to download the installation package, and refer to the [Ascend official tutorial](https://www.hiascend.com/en/document) for installation.
2. Install CANN and MindSpore: Use the officially provided Docker image (CANN, MindSpore are already included in the image, no need to install them manually) or follow the [Manual Installation](https://www.mindspore.cn/install/en) section on the MindSpore website for installation.
diff --git a/docs/mindformers/docs/source_en/start/models.md b/docs/mindformers/docs/source_en/introduction/models.md
similarity index 99%
rename from docs/mindformers/docs/source_en/start/models.md
rename to docs/mindformers/docs/source_en/introduction/models.md
index 3f49dc31ef50ba23179044b438de6c2e11e87c10..48468134cf399c8a63b7d51177dc149fdb62af90 100644
--- a/docs/mindformers/docs/source_en/start/models.md
+++ b/docs/mindformers/docs/source_en/introduction/models.md
@@ -1,6 +1,6 @@
# Models
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/start/models.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/introduction/models.md)
The following table lists models supported by MindFormers.
diff --git a/docs/mindformers/docs/source_en/start/overview.md b/docs/mindformers/docs/source_en/introduction/overview.md
similarity index 52%
rename from docs/mindformers/docs/source_en/start/overview.md
rename to docs/mindformers/docs/source_en/introduction/overview.md
index 4a4ec0fda939f58d62ce61ffddfed0508f46b0ef..30e98a2cc29c48591b22f3c7caccbe74248565a0 100644
--- a/docs/mindformers/docs/source_en/start/overview.md
+++ b/docs/mindformers/docs/source_en/introduction/overview.md
@@ -1,13 +1,13 @@
# Overall Structure
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/start/overview.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/introduction/overview.md)
The overall architecture formed by MindSpore Transformers and the end-to-end AI hardware and software ecosystem of MindSpore and Ascend is as follows:
1. At the hardware level, MindSpore Transformers supports users running large models on Ascend servers;
2. At the software level, MindSpore Transformers implements the big model-related code through the Python interface provided by MindSpore and performs data computation by the operator libraries provided by the supporting software package of the Ascend AI processor;
3. The basic functionality features currently supported by MindSpore Transformers are listed below:
- 1. Supports tasks such as running training and inference for large models [distributed parallelism](https://www.mindspore.cn/mindformers/docs/en/dev/function/distributed_parallel.html), with parallel capabilities including data parallelism, model parallelism, ultra-long sequence parallelism;
- 2. Supports [model weight conversion](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html), [distributed weight splitting and combination](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html), and different format of [dataset loading](https://www.mindspore.cn/mindformers/docs/en/dev/function/dataset.html) and [resumable training after breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/function/resume_training.html);
- 3. Support 25+ large models [pretraining](https://www.mindspore.cn/mindformers/docs/en/dev/usage/pre_training.html), [fine-tuning](https://www.mindspore.cn/mindformers/docs/en/dev/usage/sft_tuning.html), [inference](https://www.mindspore.cn/mindformers/docs/en/dev/usage/inference.html) and [evaluation] (https://www.mindspore.cn/mindformers/docs/en/dev/usage/evaluation.html). Meanwhile, it also supports [quantization](https://www.mindspore.cn/mindformers/docs/en/dev/usage/quantization.html), and the list of supported models can be found in [Model Library](https://www.mindspore.cn/mindformers/docs/en/dev/start/models.html);
-4. MindSpore Transformers supports users to carry out model service deployment function through [MindIE](https://www.mindspore.cn/mindformers/docs/en/dev/usage/mindie_deployment.html), and also supports the use of [MindX]( https://www.hiascend.com/software/mindx-dl) to realize large-scale cluster scheduling; more third-party platforms will be supported in the future, please look forward to it.
+ 1. Supports tasks such as running training and inference for large models [distributed parallelism](https://www.mindspore.cn/mindformers/docs/en/dev/feature/parallel_training.html), with parallel capabilities including data parallelism, model parallelism, ultra-long sequence parallelism;
+ 2. Supports [model weight conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html), [distributed weight splitting and combination](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html), and different format of [dataset loading](https://www.mindspore.cn/mindformers/docs/en/dev/feature/dataset.html) and [resumable training after breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html);
+ 3. Support 25+ large models [pretraining](https://www.mindspore.cn/mindformers/docs/en/dev/guide/pre_training.html), [fine-tuning](https://www.mindspore.cn/mindformers/docs/en/dev/guide/supervised_fine_tuning.html), [inference](https://www.mindspore.cn/mindformers/docs/en/dev/guide/inference.html) and [evaluation] (https://www.mindspore.cn/mindformers/docs/en/dev/guide/evaluation.html). Meanwhile, it also supports [quantization](https://www.mindspore.cn/mindformers/docs/en/dev/feature/quantization.html), and the list of supported models can be found in [Model Library](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html);
+4. MindSpore Transformers supports users to carry out model service deployment function through [MindIE](https://www.mindspore.cn/mindformers/docs/en/dev/guide/mindie_deployment.html), and also supports the use of [MindX]( https://www.hiascend.com/software/mindx-dl) to realize large-scale cluster scheduling; more third-party platforms will be supported in the future, please look forward to it.
diff --git a/docs/mindformers/docs/source_en/quick_start/source_code_start.md b/docs/mindformers/docs/source_en/quick_start/source_code_start.md
deleted file mode 100644
index 0f6145e4f1035ba1a0d68a6c9e9d847c2783780e..0000000000000000000000000000000000000000
--- a/docs/mindformers/docs/source_en/quick_start/source_code_start.md
+++ /dev/null
@@ -1,110 +0,0 @@
-# Calling Source Code to Start
-
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/quick_start/source_code_start.md)
-
-This section shows how to use MindSpore Transformers to quickly pull up a LoRA low-parameter fine-tuning task based on the Llama2-7B model. To use other models and tasks via MindSpore Transformers, please read the corresponding [model documentation](https://www.mindspore.cn/mindformers/docs/en/dev/start/models.html).
-
-## Preparing Weights File
-
-MindSpore Transformers provides pre-trained weights and word list files that have been converted for pre-training, fine-tuning and inference. Users can also download the official HuggingFace weights and use them after converting the model weights. For convenience, this file won't go into too much detail about converting the original weights here, but you can refer to the [Llama2 documentation](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md) and [weight conversion](https://www.mindspore.cn/mindformers/docs/en/dev/function/weight_conversion.html) for more details. Please download the `MindSpore` weights, the converted `.ckpt` file, and the `tokenizer.model` file for subsequent processing.
-
-| Model Name | MindSpore Weights | HuggingFace Weights |
-| ------ | ------ | ------ |
-| Llama2-7B | [llama2_7b.ckpt](https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/llama2_7b.ckpt) | [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) |
-
-Word list download link: [tokenizer.model](https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/tokenizer.model)
-
-## Preparing Dataset
-
-1. The dataset file alpaca_data.json used in the fine-tuning process can be obtained at [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca).
-
-2. Data Preprocessing
-
- The following command needs to be executed in the MindSpore Transformers code root directory, and replaces {path} below with the local path where the dataset files are stored.
-
- 1. Execute [mindformers/tools/dataset_preprocess/llama/alpaca_converter.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/alpaca_converter.py), and add prompt templates to convert the raw dataset into a multi-round conversation format.
-
- ```shell
- python mindformers/tools/dataset_preprocess/llama/alpaca_converter.py \
- --data_path /{path}/alpaca_data.json \
- --output_path /{path}/alpaca-data-conversation.json
- ```
-
- **Parameter descriptions**
-
- - data_path: Input the path to the downloaded file.
- - output_path: Save path of the output file.
-
- 2. Execute [mindformers/tools/dataset_preprocess/llama/llama_preprocess.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/llama_preprocess.py), and generate MindRecord data and convert data with prompt templates to MindRecord format.
-
- ```shell
- python mindformers/tools/dataset_preprocess/llama/llama_preprocess.py \
- --dataset_type qa \
- --input_glob /{path}/alpaca-data-conversation.json \
- --model_file /{path}/tokenizer.model \
- --seq_length 4096 \
- --output_file /{path}/alpaca-fastchat4096.mindrecord
- ```
-
- **Parameter descriptions**
-
- - dataset_type: Preprocessed data types. The options include "wiki" and "qa."
- - "wiki" is used to process the Wikitext2 dataset, which is suitable for the pre-training and evaluation stages.
- - "qa" is used to process the Alpaca dataset, converting it into a question-answer format, which is suitable for the fine-tuning stage.
- For other dataset conversion scripts, please refer to the corresponding [model documentation](https://www.mindspore.cn/mindformers/docs/en/dev/start/models.html).
- - input_glob: Path to the converted alpaca file.
- - model_file: Path to the model tokenizer.model file.
- - seq_length: Sequence length of the output data.
- - output_file: Save path of the output file.
-
- 3. The console outputs the following, proving that the format conversion was successful.
-
- ```shell
- # Console outputs
- Transformed 52002 records.
- Transform finished, output files refer: {path}/alpaca-fastchat4096.mindrecord
- ```
-
-## Initiating Fine-tuning
-
-In the MindSpore Transformers code root directory, execute the following command to launch the fine-tuning task:
-
-```shell
-bash scripts/msrun_launcher.sh "run_mindformer.py \
- --config configs/llama2/lora_llama2_7b.yaml \
- --train_dataset_dir /{path}/alpaca-fastchat4096.mindrecord \
- --load_checkpoint /{path}/llama2_7b.ckpt \
- --auto_trans_ckpt True \
- --use_parallel True \
- --run_mode finetune" 8
-```
-
-**Command Explanation:**
-
-- `scripts/msrun_launcher.sh`: Script for launching distributed tasks.
-- `"run_mindformer.py ..."`: Parameter string for the Python task executed on each card, including:
- - `run_mindformer.py`: One-click startup script.
- - `--config`: Specifies the task configuration file path, e.g., `configs/llama2/lora_llama2_7b.yaml`.
- - `--train_dataset_dir`: Specifies the dataset path, e.g., `/{path}/alpaca-fastchat4096.mindrecord`.
- - `--load_checkpoint`: Specifies the checkpoint file path, e.g., `/{path}/llama2_7b.ckpt`.
- - `--auto_trans_ckpt True`: Enables automatic checkpoint partitioning.
- - `--use_parallel True`: Enables distributed task execution.
- - `--run_mode finetune`: Sets the run mode to fine-tuning.
-- `8`: Sets the task to runs on 8 NPUs.
-
-When the following log appears on the console:
-
-```shell
-Start worker process with rank id:0, log file:output/msrun_log/worker_0.log. Environment variable [RANK_ID=0] is exported.
-Start worker process with rank id:1, log file:output/msrun_log/worker_1.log. Environment variable [RANK_ID=1] is exported.
-Start worker process with rank id:2, log file:output/msrun_log/worker_2.log. Environment variable [RANK_ID=2] is exported.
-Start worker process with rank id:3, log file:output/msrun_log/worker_3.log. Environment variable [RANK_ID=3] is exported.
-Start worker process with rank id:4, log file:output/msrun_log/worker_4.log. Environment variable [RANK_ID=4] is exported.
-Start worker process with rank id:5, log file:output/msrun_log/worker_5.log. Environment variable [RANK_ID=5] is exported.
-Start worker process with rank id:6, log file:output/msrun_log/worker_6.log. Environment variable [RANK_ID=6] is exported.
-Start worker process with rank id:7, log file:output/msrun_log/worker_7.log. Environment variable [RANK_ID=7] is exported.
-```
-
-It indicates that the fine-tuning task is started, the progress can be monitored in the `output/msrun_log/` directory.
-
-For more details on Llama2, and more startup approaches, please refer specifically to the `Llama2` [README](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#llama-2) documentation for more support.
diff --git a/docs/mindformers/docs/source_zh_cn/usage/dev_migration.md b/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
similarity index 89%
rename from docs/mindformers/docs/source_zh_cn/usage/dev_migration.md
rename to docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
index b25bd538337f569c5643c0f6128b1425b460db94..3367c081785915b2ba977f922d2122ca338907d2 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/dev_migration.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
@@ -1,6 +1,6 @@
# 开发迁移
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/dev_migration.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md)
本文档将指导用户如何基于MindSpore Transformers开发构建一个大模型,并完成最基本的适配,以拉起训练和推理流程。
@@ -46,9 +46,9 @@ MindSpore Transformers提供了[PretrainedTokenizer](https://www.mindspore.cn/mi
### 准备权重和数据集
-如已有基于PyTorch的模型权重,可以参考[权重转换文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/weight_conversion.html)将权重转换为MindSpore格式的权重。
+如已有基于PyTorch的模型权重,可以参考[权重转换文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)将权重转换为MindSpore格式的权重。
-数据集的准备可以参考[数据集文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/dataset.html),或参考模型文档,如[Llama2说明文档——数据集准备](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)。
+数据集的准备可以参考[数据集文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html),或参考模型文档,如[Llama2说明文档——数据集准备](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)。
### 准备`YAML`配置文件
@@ -93,13 +93,13 @@ python run_mindformer.py --config research/llama3_1/predict_llama3_1_8b.yaml --l
其中设置了`register_path`为外挂代码所在目录的路径`research/llama3_1`,模型权重的准备参考[Llama3.1说明文档——模型权重下载](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD)。
-配置文件的详细内容及可配置项可以参考[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html)。在实际编写配置文件时,也可以参考库内已有的配置文件,例如[Llama2-7B微调的配置文件](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/finetune_llama2_7b.yaml)。
+配置文件的详细内容及可配置项可以参考[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html)。在实际编写配置文件时,也可以参考库内已有的配置文件,例如[Llama2-7B微调的配置文件](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/finetune_llama2_7b.yaml)。
-在准备完上述所有基本要素之后,可以参考MindSpore Transformers使用教程中的其余文档进行模型训练、微调、推理等流程的实践。后续模型调试调优可以参考[大模型精度调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/acc_optimize/acc_optimize.html)和[大模型性能调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/perf_optimize/perf_optimize.html)。
+在准备完上述所有基本要素之后,可以参考MindSpore Transformers使用教程中的其余文档进行模型训练、微调、推理等流程的实践。后续模型调试调优可以参考[大模型精度调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/advanced_development/precision_optimization.html)和[大模型性能调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/advanced_development/performance_optimization.html)。
### 将模型贡献给MindSpore Transformers开源仓库
-可以参考[MindSpore Transformers贡献指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/faq/mindformers_contribution.html),将模型贡献到MindSpore Transformers的开源仓库,供广大开发者研究和使用。
+可以参考[MindSpore Transformers贡献指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/contribution/mindformers_contribution.html),将模型贡献到MindSpore Transformers的开源仓库,供广大开发者研究和使用。
## MindSpore Transformers大模型迁移实践
diff --git a/docs/mindformers/docs/source_zh_cn/perf_optimize/images/cast.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/cast.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/perf_optimize/images/cast.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/cast.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/general_process.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/general_process.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/general_process.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/general_process.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/local_norm.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/local_norm.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/local_norm.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/local_norm.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss1.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/loss1.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss1.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/loss1.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss2.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/loss2.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss2.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/loss2.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss3.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/loss3.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss3.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/loss3.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss4.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/loss4.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss4.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/loss4.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss5.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/loss5.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss5.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/loss5.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss6.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/loss6.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss6.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/loss6.png
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss7.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/loss7.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/image/loss7.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/loss7.png
diff --git a/docs/mindformers/docs/source_zh_cn/usage/image/model_config_comparison.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/model_config_comparison.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/usage/image/model_config_comparison.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/model_config_comparison.png
diff --git a/docs/mindformers/docs/source_zh_cn/perf_optimize/images/mstx.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/mstx.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/perf_optimize/images/mstx.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/mstx.png
diff --git a/docs/mindformers/docs/source_zh_cn/usage/image/multi_modal.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/multi_modal.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/usage/image/multi_modal.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/multi_modal.png
diff --git a/docs/mindformers/docs/source_zh_cn/perf_optimize/images/reshape.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/reshape.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/perf_optimize/images/reshape.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/reshape.png
diff --git a/docs/mindformers/docs/source_zh_cn/perf_optimize/images/silu_mul.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/silu_mul.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/perf_optimize/images/silu_mul.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/silu_mul.png
diff --git a/docs/mindformers/docs/source_zh_cn/perf_optimize/images/studio.png b/docs/mindformers/docs/source_zh_cn/advanced_development/image/studio.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/perf_optimize/images/studio.png
rename to docs/mindformers/docs/source_zh_cn/advanced_development/image/studio.png
diff --git a/docs/mindformers/docs/source_zh_cn/usage/multi_modal.md b/docs/mindformers/docs/source_zh_cn/advanced_development/multi_modal_dev.md
similarity index 99%
rename from docs/mindformers/docs/source_zh_cn/usage/multi_modal.md
rename to docs/mindformers/docs/source_zh_cn/advanced_development/multi_modal_dev.md
index 89a61f83bbdea2a0e701131e9b02ddbafed1fc34..bd2aa713a023f2c5b4ee21d7a4c1d3c813b008fd 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/multi_modal.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/multi_modal_dev.md
@@ -1,6 +1,6 @@
# 多模态理解模型开发
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/multi_modal.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/advanced_development/multi_modal_dev.md)
多模态理解模型(Multimodal Model)是指能够处理并结合来自不同模态(如文字、图像、音频、视频等)的信息进行学习和推理的人工智能模型。
传统的单一模态模型通常只关注单一数据类型,如文本分类模型只处理文本数据,图像识别模型只处理图像数据。而多模态理解模型则通过融合不同来源的数据来完成更复杂的任务,从而能够理解和生成更加丰富、全面的内容。
@@ -324,7 +324,7 @@ class MultiModalForCausalLM(BaseXModalToTextModel):
在实现多模态数据集、数据处理模块以及多模态理解模型构建之后,就可以通过模型配置文件启动模型预训练、微调、推理等任务,为此需要构建对应的模型配置文件。
-具体模型配置文件可参考[predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml)和[finetune_cogvlm2_video_llama3_chat_13b_lora.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml)分别对应模型推理和微调,其中参数具体含义可查阅[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html)。
+具体模型配置文件可参考[predict_cogvlm2_video_llama3_chat_13b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_video_llama3_chat_13b.yaml)和[finetune_cogvlm2_video_llama3_chat_13b_lora.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/finetune_cogvlm2_video_llama3_chat_13b_lora.yaml)分别对应模型推理和微调,其中参数具体含义可查阅[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html)。
在用户自定义的配置文件中`model`、`processor`、`train_dataset`等部分内容需要对应用户自定义的**数据集**、**数据处理模块**以及**多模态理解模型**进行设置。
diff --git a/docs/mindformers/docs/source_zh_cn/perf_optimize/perf_optimize.md b/docs/mindformers/docs/source_zh_cn/advanced_development/performance_optimization.md
similarity index 98%
rename from docs/mindformers/docs/source_zh_cn/perf_optimize/perf_optimize.md
rename to docs/mindformers/docs/source_zh_cn/advanced_development/performance_optimization.md
index c51b5d980892860f7b7b8df9ed82f562b818d309..eea05bf4b514cfe237f368afd3892bf870dbb5ef 100644
--- a/docs/mindformers/docs/source_zh_cn/perf_optimize/perf_optimize.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/performance_optimization.md
@@ -1,6 +1,6 @@
# 大模型性能调优指南
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/perf_optimize/perf_optimize.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/advanced_development/performance_optimization.md)
## 概述
@@ -64,7 +64,7 @@ $$
在实际应用中,通常会采用多种并行策略和优化手段,例如使用优化器并行和重计算等方式,以减少模型对内存的使用并提高训练效率。并行策略设计与模型的效率密切相关,因此在模型调优之前先确定一组或多组较优的并行策略,是至关重要的。
-详细介绍参考文档[并行策略指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/distributed_parallel.html)。
+详细介绍参考文档[并行策略指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/parallel_training.html)。
对于不同的参数量规格的模型,可参考以下并行策略选择方向:
@@ -277,7 +277,7 @@ MindStudio Insight工具以时间线(Timeline)的形式呈现全流程在线
#### IR 图
-在[MindSpore Transformers配置文件](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html)中,只需要开启save_graphs,运行时会输出一些图编译过程中生成的.ir后缀的中间文件,这些被称为IR文件。默认情况下,这些文件会保存在当前执行目录下的graph目录中。IR文件是一种比较直观易懂的文本格式文件,用于描述模型结构的文件,可以直接用文本编辑软件查看。配置项含义参考[Config配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html),配置方法如下:
+在[MindSpore Transformers配置文件](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html)中,只需要开启save_graphs,运行时会输出一些图编译过程中生成的.ir后缀的中间文件,这些被称为IR文件。默认情况下,这些文件会保存在当前执行目录下的graph目录中。IR文件是一种比较直观易懂的文本格式文件,用于描述模型结构的文件,可以直接用文本编辑软件查看。配置项含义参考[Config配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html),配置方法如下:
```yaml
context:
diff --git a/docs/mindformers/docs/source_zh_cn/acc_optimize/acc_optimize.md b/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
similarity index 99%
rename from docs/mindformers/docs/source_zh_cn/acc_optimize/acc_optimize.md
rename to docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
index dcf6f800f55b025e5998b0fc17d50154c4f32a63..f45ce54cef951dc49c2b633c04ea266100b7e7cf 100644
--- a/docs/mindformers/docs/source_zh_cn/acc_optimize/acc_optimize.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
@@ -1,6 +1,6 @@
# 大模型精度调优指南
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/acc_optimize/acc_optimize.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md)
## 精度问题概述和场景
@@ -187,7 +187,7 @@ export MINDSPORE_DUMP_CONFIG=${JSON_PATH}
#### 权重转换
-训练过程中,MindSpore与PyTorch加载同一份权重。若是预训练场景,可以使用PyTorch保存一个初始化权重后,转换为MindSpore权重。因为MindSpore的权重名称与PyTorch有差异,权重转换的本质是将PyTorch权重dict中的名字改为MindSpore权重名字,以支持MindSpore加载。权重转换参考[权重转换指导](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/weight_conversion.html)。
+训练过程中,MindSpore与PyTorch加载同一份权重。若是预训练场景,可以使用PyTorch保存一个初始化权重后,转换为MindSpore权重。因为MindSpore的权重名称与PyTorch有差异,权重转换的本质是将PyTorch权重dict中的名字改为MindSpore权重名字,以支持MindSpore加载。权重转换参考[权重转换指导](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。
MindSpore与PyTorch均支持`bin`格式数据,加载相同的数据集进行训练,保证每个step一致。
diff --git a/docs/mindformers/docs/source_zh_cn/usage/pretrain_gpt.md b/docs/mindformers/docs/source_zh_cn/advanced_development/pretrain_gpt.md
similarity index 97%
rename from docs/mindformers/docs/source_zh_cn/usage/pretrain_gpt.md
rename to docs/mindformers/docs/source_zh_cn/advanced_development/pretrain_gpt.md
index d6b4d966c0ce76560e6fdedadde0ab24444c15b6..079deedec16b452e5f15a34675129275de7d9e09 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/pretrain_gpt.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/pretrain_gpt.md
@@ -1,505 +1,505 @@
-# 动态图并行
-
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/pretrain_gpt.md)
-
-## 概述
-
-本教程演示如何使用MindSpore Transformers动态图并行框架训练GPT模型,此框架支持张量并行、流水线并行、序列并行等并行场景,还有支持使用分布式优化器动态学习率等场景,帮助开发者快速、便捷地构建和训练基于动态图并行框架的GPT预训练模型。
-
-## 操作实践
-
-下面基于Ascend平台,进行GPT模型训练。
-
-### 样例代码参考
-
-目录结构如下:
-
-```text
-└─ gpt
- ├─ pretrain_gpt.py
- ├─ pretrain_gpt.sh
- └─ pretrain_gpt_7B.yaml
- ...
-```
-
-其中,`pretrain_gpt.py`是环境配置、模型对象创建及训练的脚本。`pretrain_gpt.sh`是启动执行脚本。`pretrain_gpt_7B.yaml`是配置项。
-
-### 模型结构
-
-GPT以`Transformer`模型为主要架构,网络结构主要围绕`Transformer`的基本构建块构建。
-
-在模型中,初始化五个参数,`config`是模型配置项(在yaml文件的`model_config`中),`num_tokentypes`指定embedding的类型,`parallel_output`用来确认是否输出每一个并行Tensor的输出,`pre_process`和`post_process`分别指定是否为第一阶段和最后一阶段。
-
-调用的`get_language_model`是一个基于`Transformer`模型的接口,详情请看`get_language_model`的api文档。
-
-注意:数据集返回值要与模型定义的前向过程所需要的参数相对应。
-
-```python
-from mindformers.experimental.parallel_core.pynative.transformer.module import Module
-from mindformers.experimental.parallel_core.pynative.transformer.language_model import get_language_model
-from mindformers.experimental.parallel_core.pynative.transformer import ParallelLMLogits
-from mindformers.experimental.parallel_core.pynative.training.loss_func import VocabParallelCrossEntropy
-
-
-class AttnMaskType(enum.Enum):
- padding = 1
- causal = 2
- no_mask = 3
- padding_causal = 4
-
-
-attn_mask_type_mapping = {
- "padding": AttnMaskType.padding,
- "causal": AttnMaskType.causal,
-}
-
-
-class GPTModel(Module):
- def __init__(self,
- config,
- num_tokentypes=0,
- parallel_output=True,
- pre_process=True,
- post_process=True):
- super().__init__(config=config,\
- share_embeddings_and_output_weights=not config.untie_embeddings_and_output_weights)
-
- self.parallel_output = parallel_output
- self.pre_process = pre_process
- self.post_process = post_process
- self.untie_embeddings_and_output_weights = config.untie_embeddings_and_output_weights
- self.fp16_lm_cross_entropy = config.fp16_lm_cross_entropy
-
- self.set_model_key()
- encoder_attn_mask_type = None
- if config.encoder_attn_mask_type is not None:
- encoder_attn_mask_type = attn_mask_type_mapping.get(config.encoder_attn_mask_type)
- if encoder_attn_mask_type is None:
- raise ValueError(f"encoder_attn_mask_type must be one of {attn_mask_type_mapping.keys()}, but got"
- f"{config.encoder_attn_mask_type}")
-
- self.language_model, self._language_model_key = get_language_model(
- config=config,
- num_tokentypes=num_tokentypes,
- add_pooler=False,
- encoder_attn_mask_type=encoder_attn_mask_type,
- pre_process=self.pre_process,
- post_process=self.post_process)
-
- if self.post_process:
- self.parallel_lm_logits = ParallelLMLogits(config=config,
- bias=False,
- compute_dtype=config.compute_dtype)
- self.loss = VocabParallelCrossEntropy()
-
- if not config.untie_embeddings_and_output_weights:
- self.initialize_word_embeddings()
-
- def set_input_tensor(self, input_tensor):
- """ set input_tensor to model """
- self.language_model.set_input_tensor(input_tensor)
-
- def set_model_key(self):
- """ set model key for differentiate PipelineCell process """
- self.model_key = "gpt3"
-
- def construct(self, input_ids, position_ids, attention_mask, loss_mask,
- retriever_input_ids=None,
- retriever_position_ids=None,
- retriever_attn_mask=None,
- labels=None, tokentype_ids=None, inference_params=None):
- """ gpt model forward """
- # use RoPE
- position_ids = None
- retriever_input_ids = None
- retriever_position_ids = None
- retriever_attn_mask = None
- lm_output = self.language_model(
- input_ids,
- position_ids,
- attention_mask,
- retriever_input_ids=retriever_input_ids,
- retriever_position_ids=retriever_position_ids,
- retriever_attn_mask=retriever_attn_mask,
- inference_params=inference_params)
- if self.post_process:
- return post_language_model_processing(
- self.parallel_lm_logits, self.loss,
- lm_output, labels,
- self.language_model.output_layer.weight if\
- self.untie_embeddings_and_output_weights else self.shared_embedding_or_output_weight(),
- self.parallel_output,
- self.fp16_lm_cross_entropy,
- loss_mask)
- else:
- return lm_output
-```
-
-当`post_process`为`True`时,需要对语言模型的输出`lm_output`进行后处理,输出损失和预测结果。
-
-```python
-import mindspore.common.dtype as mstype
-
-def post_language_model_processing(parallel_lm_logits, loss_fn, lm_output, labels, logit_weights,
- parallel_output, fp16_lm_cross_entropy, loss_mask):
- """ gpt model post process forward """
- output = parallel_lm_logits(lm_output, logit_weights, parallel_output)
-
- if labels is None:
- return output
-
- labels = labels
- loss_mask = loss_mask.reshape(-1)
-
- if fp16_lm_cross_entropy:
- if output.dtype != mstype.float16:
- raise ValueError(f"When fp16_lm_cross_entropy=True, output should be float16, but got {output.dtype}")
- loss = loss_fn(output, labels, loss_mask)
- else:
- loss = loss_fn(output.astype(mstype.float32), labels)
- token_nums = loss_mask.sum()
- loss_mask = loss_mask.astype(mstype.float32)
- loss = ops.sum(loss * loss_mask.float()) / loss_mask.sum()
- return loss, output, token_nums
-```
-
-### 动态图并行训练配置
-
-动态图并行的配置项通过yaml文件来读取,并分为不同种类,包括训练配置、并行配置、模型配置等,接下来简单介绍一下大模型训练需要的基本配置。
-
-#### 配置训练参数(training_config)
-
-```yaml
-training_config:
- seed: 42 # 固定随机性用的种子
- output_dir: './output' # 输出目录,用于储存checkpoints和日志等
- training_iters: 10 # 训练迭代次数
- log_interval: 1 # 日志打印的频率
- save_interval: null # 储存checkpoints的频率
- loss_scale: 4096 # loss scale的初始值
- grad_clip_kwargs:
- grad_clip_type: "ClipGlobalNorm" # 梯度裁剪的方法,可选:"ClipGlobalNorm"或者"GradClipByValue"
- clip_value: 1.0
- loss_reduction: "mean" # loss reduction的方法,可选:"mean"或者"sum"
- loss_func_kwargs:
- loss_func_type: "VocabParallelCrossEntropy" # 损失函数,可选: "VocabParallelCrossEntropy"或者"CrossEntropyLoss"
- use_distributed_optimizer: True # 是否使用分布式优化器
-```
-
-#### 配置并行模式(parallel_config)
-
-```yaml
-parallel_config:
- tensor_model_parallel_size: 1 # 张量并行
- pipeline_model_parallel_size: 1 # 流水线并行
- expert_model_parallel_size: 1 # 专家并行
- virtual_pipeline_model_parallel_size: null # 虚拟流水线并行
- sequence_parallel: False # 序列并行
-```
-
-#### 配置模型参数(gpt_config)
-
-```yaml
-model_config:
- params_dtype: "float32" # 参数初始化类型
- compute_dtype: "bfloat16" # 计算时使用的类型
- position_embedding_type: 'rope' # 位置编码的类型,可选:"rope"或者"absolute"
- untie_embeddings_and_output_weights: True # embedding层和head层是否不共享权重
- # 配置GPT 7B模型
- num_layers: 6 # Transformer层数
- hidden_size: 4096 # 隐藏层的大小
- ffn_hidden_size: 11008 # 前馈神经网络隐藏层大小
- num_attention_heads: 32 # 注意力头的数量
-```
-
-GPT模型当前有三种不同规格的配置:7B、13B和70B。
-
-```yaml
-7B:
- num_layers: 32
- hidden_size: 4096
- ffn_hidden_size: 11008
- num_attention_heads: 32
-13B:
- num_layers: 40
- hidden_size: 5120
- ffn_hidden_size: 13824
- num_attention_heads: 40
-70B:
- num_layers: 80
- hidden_size: 8192
- ffn_hidden_size: 28672
- num_attention_heads: 64
- group_query_attention: True
- num_query_groups: 8
-```
-
-#### 数据集配置(dataset_config)
-
-```yaml
-dataset_config:
- batch_size: 1 # 一次迭代从数据集中取出的数据大小
- micro_batch_num: 2 # 微批次个数
- dataset_dir: './dataset' # 数据集所在目录
- shuffle: False # 是否打乱顺序
-```
-
-#### 优化器配置(optimizer_config)
-
-```yaml
-optimizer_config:
- optimizer_type: "AdamW" # 优化器类型,可选:"AdamW", "Adam", "SGD", "Came", "mint.AdamW"及"SpeedAdamW"
- betas: # 优化器输入参数
- - 0.9
- - 0.95
- eps: 1.e-8
- learning_rate: 1.25e-6 # 初始学习率
- weight_decay: 1.e-1 # 权重衰减系数
- learning_rate_scheduler_kwargs: # 学习率调整策略
- warmup_steps: 200
- decay_steps: 2000
- use_cosine: True
- end_learning_rate: 1.25e-7
-```
-
-### 模型训练配置解析
-
-在pretrain_gpt.py里对传入的yaml配置文件进行解析,可以得到训练配置、模型配置、优化器配置、并行策略配置以及数据集配置。
-
-```python
-import argparse
-from mindformers.experimental.parallel_core.pynative.config import (
- init_configs_from_yaml
-)
-
-def get_arg_parser():
- """get argument parser"""
- parser = argparse.ArgumentParser(description="Train gpt model")
- parser.add_argument("--config_path", type=str, default="pretrain_gpt.yaml", help="The path to the config file.")
- parser.add_argument("--run_cmd", type=str, default="", help="running cmd.")
- parser.add_argument("--model_type", type=str, default="gpt_config", help="Input model config.")
- return parser
-parser = get_arg_parser()
-args = parser.parse_args()
-
-all_config = init_configs_from_yaml(args.config_path)
-
-training_config = all_config.training_config
-model_config = all_config.model_config
-optimizer_config = all_config.optimizer_config
-parallel_config = all_config.parallel_config
-dataset_config = all_config.dataset_config
-```
-
-### 通信配置
-
-通过set_context接口可以指定运行模式、运行设备、运行卡号等。并行脚本还需指定并行模式`parallel_mode`为数据并行模式,并通过init根据不同的设备需求初始化HCCL、NCCL或者MCCL通信。指定平台:设置`device_target`为`Ascend`。调试阶段可以使用`set_context(pynative_synchronize=True)`开启同步模式,更准确地定位报错位置。
-
-```python
-import mindspore as ms
-
-
-def set_parallel_context(parallel_config):
- init()
- initialize_model_parallel(
- tensor_model_parallel_size=parallel_config.tensor_model_parallel_size,
- pipeline_model_parallel_size=parallel_config.pipeline_model_parallel_size,
- virtual_pipeline_model_parallel_size=parallel_config.virtual_pipeline_model_parallel_size,
- )
- logger.info(
- f"dp {get_data_parallel_world_size()} | "
- f"pp {parallel_config.pipeline_model_parallel_size} | "
- f"tp {parallel_config.tensor_model_parallel_size} | "
- f"sp {parallel_config.sequence_parallel} | "
- f"vpp {parallel_config.virtual_pipeline_model_parallel_size}"
- )
-
-
-def set_seed(seed):
- # set global seed, np seed, and dataset seed
- ms.set_seed(seed)
- # set rng seed
- ms.manual_seed(seed)
-
-
-ms.set_context(mode=ms.PYNATIVE_MODE)
-ms.set_device(device_target="Ascend")
-set_parallel_context(parallel_config)
-set_seed(training_config.seed)
-```
-
-### 创建网络对象
-
-从模型库获取GPT模型,根据配置文件创建网络模型对象。通过`set_weight_decay`来为不同参数设置不同的权重衰减系数,这个函数会将参数分为两组,一组应用特定的权重衰减值,另一组权重衰减为`0`,然后返回一个包含参数分组信息的列表,赋值给`group_params`变量。调用`get_optimizer`函数,传入`optimizer_config`(优化器配置)、`training_config`(训练配置)、`group_params`(前面得到的参数分组信息)、`network_with_loss`(包含模型和损失的对象)以及一个梯度归约操作(从`training_config.loss_reduction`获取),返回一个优化器对象,并赋值给`optimizer`变量。
-创建一个`TrainOneStepCell`对象,它通常用于在训练过程中执行一步优化。传入`network_with_loss`、`optimizer`及配置作为参数,并将其赋值给train_one_step_cell变量。
-
-完整的创建网络对象代码:
-
-```python
-from mindformers.experimental.parallel_core.pynative.optimizer import get_optimizer
-from mindformers.experimental.parallel_core.pynative.training import get_model
-from mindformers.experimental.parallel_core.pynative.training import TrainOneStepCell
-from mindformers.experimental.parallel_core.models import GPTModel
-
-
-def decay_filter(x):
- return "norm" not in x.name.lower() and "bias" not in x.name.lower()
-
-
-def set_weight_decay(params, weight_decay=1e-1):
- decay_params = list(filter(decay_filter, params))
- other_params = list(filter(lambda x: not decay_filter(x), params))
- group_params = []
- if decay_params:
- group_params.append({"params": decay_params, "weight_decay": weight_decay})
- if other_params:
- group_params.append({"params": other_params, "weight_decay": 0.0})
- return group_params
-
-
-def model_provider_func(pre_process=True, post_process=True):
- network_with_loss = GPTModel(
- model_config, pre_process=pre_process, post_process=post_process
- )
- return network_with_loss
-
-network_with_loss = get_model(model_provider_func, training_config)
-
-group_params = set_weight_decay(network_with_loss.trainable_params(), optimizer_config.weight_decay)
-optimizer = get_optimizer(
- optimizer_config,
- training_config,
- group_params,
- network_with_loss,
- grad_allreduce_op=training_config.loss_reduction
-)
-
-train_one_step_cell = TrainOneStepCell(network_with_loss, optimizer, None, training_config, model_config)
-```
-
-### 加载数据集及执行训练
-
-```python
-from dataset import get_dataset
-from mindformers.experimental.parallel_core.pynative.training import train
-
-train_dataset_iterator, val_dataset_iterator = get_dataset(dataset_config)
-train(
- train_one_step_cell,
- train_dataset_iterator,
- training_config,
- val_dataset_iterator,
- metrics,
- evaluation,
-)
-```
-
-### 运行训练脚本
-
-```bash
-bash pretrain_gpt.sh xx.yaml
-```
-
-若不指定xx.yaml,则默认为pretrain_gpt_7B.yaml。
-
-训练脚本`pretrain_gpt.sh`详细解析如下:
-
-#### 设置环境变量
-
-`HCCL_BUFFSIZE=200`设置两个NPU之间共享数据的缓存区大小为200M;`HCCL_EXEC_TIMEOUT=600`设置设备间执行时同步的等待时间为10分钟。`ASCEND_RT_VISIBLE_DEVICES`指定了可见的设备编号,这里设置为设备`0`号卡。
-
-```bash
-export HCCL_BUFFSIZE=200
-export HCCL_EXEC_TIMEOUT=600
-export ASCEND_RT_VISIBLE_DEVICES='0'
-```
-
-#### 设置端口号
-
-```bash
-port=8828
-```
-
-如果之前的配置异常退出,可以使用如下代码进行清理。
-
-```bash
-PIDS=$(sudo lsof -i :$port | awk 'NR>1 {print $2}')
-if [ -n "$PIDS" ]; then
- for pid in $PIDS; do
- kill -9 $pid
- echo "Killed process $pid"
- done
-else
- echo "No processes found listening on port $port."
-fi
-```
-
-#### 设置日志存储路径
-
-获取当前脚本所在的目录路径并存储在`project_dir`变量中,同时设置日志路径变量`log_path="msrun_log"`。先删除名为`msrun_log`的目录(如果存在),然后重新创建这个目录。
-
-```bash
-project_dir=$(cd "$(dirname "$0")" || exit; pwd)
-log_path="msrun_log"
-
-rm -rf "${log_path}"
-mkdir "${log_path}"
-```
-
-#### 设置可用设备数量
-
-```bash
-# 计算设备数量
-IFS=',' read -r -a devices <<< "$ASCEND_RT_VISIBLE_DEVICES"
-work_num=${#devices[@]}
-```
-
-#### 获取配置文件
-
-尝试从命令行参数中获取配置文件路径,如果没有提供命令行参数,则使用默认的配置文件 "pretrain_gpt_7B.yaml"。
-
-```bash
-config_path=$1
-if [ -z "$config_path" ]; then
- config_path="pretrain_gpt_7B.yaml"
-fi
-```
-
-#### 以msrun模式执行训练脚本
-
-```bash
-msrun --worker_num "$work_num" --local_worker_num="$work_num" --master_port=$port --log_dir="$log_path" --join=True --cluster_time_out=300 pretrain_gpt.py --config_path="${config_path}"
-```
-
-#### 运行结果
-
-接下来通过命令调用对应的脚本。
-
-```bash
-bash pretrain_gpt.sh
-```
-
-执行完后,日志文件保存到`output`目录下,其中部分文件目录结构如下:
-
-```text
-└─ output
- └─ log
- ├─ rank_0
- | ├─ info.log
- | └─ error.log
- ├─ rank_1
- | ├─ info.log
- | └─ error.log
- ...
-```
-
-关于Loss部分结果保存在`output/log/rank_*/info.log`中,示例如下:
-
-```text
-train: Epoch:0, Step:5, Loss: 10.341485, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1403.24 ms
-train: Epoch:0, Step:6, Loss: 10.38118, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1378.19 ms
-train: Epoch:0, Step:7, Loss: 10.165115, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1370.32 ms
-train: Epoch:0, Step:8, Loss: 10.039211, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1386.89 ms
-train: Epoch:0, Step:9, Loss: 10.040031, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1475.95 ms
-...
-```
+# 动态图并行
+
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/advanced_development/pretrain_gpt.md)
+
+## 概述
+
+本教程演示如何使用MindSpore Transformers动态图并行框架训练GPT模型,此框架支持张量并行、流水线并行、序列并行等并行场景,还有支持使用分布式优化器动态学习率等场景,帮助开发者快速、便捷地构建和训练基于动态图并行框架的GPT预训练模型。
+
+## 操作实践
+
+下面基于Ascend平台,进行GPT模型训练。
+
+### 样例代码参考
+
+目录结构如下:
+
+```text
+└─ gpt
+ ├─ pretrain_gpt.py
+ ├─ pretrain_gpt.sh
+ └─ pretrain_gpt_7B.yaml
+ ...
+```
+
+其中,`pretrain_gpt.py`是环境配置、模型对象创建及训练的脚本。`pretrain_gpt.sh`是启动执行脚本。`pretrain_gpt_7B.yaml`是配置项。
+
+### 模型结构
+
+GPT以`Transformer`模型为主要架构,网络结构主要围绕`Transformer`的基本构建块构建。
+
+在模型中,初始化五个参数,`config`是模型配置项(在yaml文件的`model_config`中),`num_tokentypes`指定embedding的类型,`parallel_output`用来确认是否输出每一个并行Tensor的输出,`pre_process`和`post_process`分别指定是否为第一阶段和最后一阶段。
+
+调用的`get_language_model`是一个基于`Transformer`模型的接口,详情请看`get_language_model`的api文档。
+
+注意:数据集返回值要与模型定义的前向过程所需要的参数相对应。
+
+```python
+from mindformers.experimental.parallel_core.pynative.transformer.module import Module
+from mindformers.experimental.parallel_core.pynative.transformer.language_model import get_language_model
+from mindformers.experimental.parallel_core.pynative.transformer import ParallelLMLogits
+from mindformers.experimental.parallel_core.pynative.training.loss_func import VocabParallelCrossEntropy
+
+
+class AttnMaskType(enum.Enum):
+ padding = 1
+ causal = 2
+ no_mask = 3
+ padding_causal = 4
+
+
+attn_mask_type_mapping = {
+ "padding": AttnMaskType.padding,
+ "causal": AttnMaskType.causal,
+}
+
+
+class GPTModel(Module):
+ def __init__(self,
+ config,
+ num_tokentypes=0,
+ parallel_output=True,
+ pre_process=True,
+ post_process=True):
+ super().__init__(config=config,\
+ share_embeddings_and_output_weights=not config.untie_embeddings_and_output_weights)
+
+ self.parallel_output = parallel_output
+ self.pre_process = pre_process
+ self.post_process = post_process
+ self.untie_embeddings_and_output_weights = config.untie_embeddings_and_output_weights
+ self.fp16_lm_cross_entropy = config.fp16_lm_cross_entropy
+
+ self.set_model_key()
+ encoder_attn_mask_type = None
+ if config.encoder_attn_mask_type is not None:
+ encoder_attn_mask_type = attn_mask_type_mapping.get(config.encoder_attn_mask_type)
+ if encoder_attn_mask_type is None:
+ raise ValueError(f"encoder_attn_mask_type must be one of {attn_mask_type_mapping.keys()}, but got"
+ f"{config.encoder_attn_mask_type}")
+
+ self.language_model, self._language_model_key = get_language_model(
+ config=config,
+ num_tokentypes=num_tokentypes,
+ add_pooler=False,
+ encoder_attn_mask_type=encoder_attn_mask_type,
+ pre_process=self.pre_process,
+ post_process=self.post_process)
+
+ if self.post_process:
+ self.parallel_lm_logits = ParallelLMLogits(config=config,
+ bias=False,
+ compute_dtype=config.compute_dtype)
+ self.loss = VocabParallelCrossEntropy()
+
+ if not config.untie_embeddings_and_output_weights:
+ self.initialize_word_embeddings()
+
+ def set_input_tensor(self, input_tensor):
+ """ set input_tensor to model """
+ self.language_model.set_input_tensor(input_tensor)
+
+ def set_model_key(self):
+ """ set model key for differentiate PipelineCell process """
+ self.model_key = "gpt3"
+
+ def construct(self, input_ids, position_ids, attention_mask, loss_mask,
+ retriever_input_ids=None,
+ retriever_position_ids=None,
+ retriever_attn_mask=None,
+ labels=None, tokentype_ids=None, inference_params=None):
+ """ gpt model forward """
+ # use RoPE
+ position_ids = None
+ retriever_input_ids = None
+ retriever_position_ids = None
+ retriever_attn_mask = None
+ lm_output = self.language_model(
+ input_ids,
+ position_ids,
+ attention_mask,
+ retriever_input_ids=retriever_input_ids,
+ retriever_position_ids=retriever_position_ids,
+ retriever_attn_mask=retriever_attn_mask,
+ inference_params=inference_params)
+ if self.post_process:
+ return post_language_model_processing(
+ self.parallel_lm_logits, self.loss,
+ lm_output, labels,
+ self.language_model.output_layer.weight if\
+ self.untie_embeddings_and_output_weights else self.shared_embedding_or_output_weight(),
+ self.parallel_output,
+ self.fp16_lm_cross_entropy,
+ loss_mask)
+ else:
+ return lm_output
+```
+
+当`post_process`为`True`时,需要对语言模型的输出`lm_output`进行后处理,输出损失和预测结果。
+
+```python
+import mindspore.common.dtype as mstype
+
+def post_language_model_processing(parallel_lm_logits, loss_fn, lm_output, labels, logit_weights,
+ parallel_output, fp16_lm_cross_entropy, loss_mask):
+ """ gpt model post process forward """
+ output = parallel_lm_logits(lm_output, logit_weights, parallel_output)
+
+ if labels is None:
+ return output
+
+ labels = labels
+ loss_mask = loss_mask.reshape(-1)
+
+ if fp16_lm_cross_entropy:
+ if output.dtype != mstype.float16:
+ raise ValueError(f"When fp16_lm_cross_entropy=True, output should be float16, but got {output.dtype}")
+ loss = loss_fn(output, labels, loss_mask)
+ else:
+ loss = loss_fn(output.astype(mstype.float32), labels)
+ token_nums = loss_mask.sum()
+ loss_mask = loss_mask.astype(mstype.float32)
+ loss = ops.sum(loss * loss_mask.float()) / loss_mask.sum()
+ return loss, output, token_nums
+```
+
+### 动态图并行训练配置
+
+动态图并行的配置项通过yaml文件来读取,并分为不同种类,包括训练配置、并行配置、模型配置等,接下来简单介绍一下大模型训练需要的基本配置。
+
+#### 配置训练参数(training_config)
+
+```yaml
+training_config:
+ seed: 42 # 固定随机性用的种子
+ output_dir: './output' # 输出目录,用于储存checkpoints和日志等
+ training_iters: 10 # 训练迭代次数
+ log_interval: 1 # 日志打印的频率
+ save_interval: null # 储存checkpoints的频率
+ loss_scale: 4096 # loss scale的初始值
+ grad_clip_kwargs:
+ grad_clip_type: "ClipGlobalNorm" # 梯度裁剪的方法,可选:"ClipGlobalNorm"或者"GradClipByValue"
+ clip_value: 1.0
+ loss_reduction: "mean" # loss reduction的方法,可选:"mean"或者"sum"
+ loss_func_kwargs:
+ loss_func_type: "VocabParallelCrossEntropy" # 损失函数,可选: "VocabParallelCrossEntropy"或者"CrossEntropyLoss"
+ use_distributed_optimizer: True # 是否使用分布式优化器
+```
+
+#### 配置并行模式(parallel_config)
+
+```yaml
+parallel_config:
+ tensor_model_parallel_size: 1 # 张量并行
+ pipeline_model_parallel_size: 1 # 流水线并行
+ expert_model_parallel_size: 1 # 专家并行
+ virtual_pipeline_model_parallel_size: null # 虚拟流水线并行
+ sequence_parallel: False # 序列并行
+```
+
+#### 配置模型参数(gpt_config)
+
+```yaml
+model_config:
+ params_dtype: "float32" # 参数初始化类型
+ compute_dtype: "bfloat16" # 计算时使用的类型
+ position_embedding_type: 'rope' # 位置编码的类型,可选:"rope"或者"absolute"
+ untie_embeddings_and_output_weights: True # embedding层和head层是否不共享权重
+ # 配置GPT 7B模型
+ num_layers: 6 # Transformer层数
+ hidden_size: 4096 # 隐藏层的大小
+ ffn_hidden_size: 11008 # 前馈神经网络隐藏层大小
+ num_attention_heads: 32 # 注意力头的数量
+```
+
+GPT模型当前有三种不同规格的配置:7B、13B和70B。
+
+```yaml
+7B:
+ num_layers: 32
+ hidden_size: 4096
+ ffn_hidden_size: 11008
+ num_attention_heads: 32
+13B:
+ num_layers: 40
+ hidden_size: 5120
+ ffn_hidden_size: 13824
+ num_attention_heads: 40
+70B:
+ num_layers: 80
+ hidden_size: 8192
+ ffn_hidden_size: 28672
+ num_attention_heads: 64
+ group_query_attention: True
+ num_query_groups: 8
+```
+
+#### 数据集配置(dataset_config)
+
+```yaml
+dataset_config:
+ batch_size: 1 # 一次迭代从数据集中取出的数据大小
+ micro_batch_num: 2 # 微批次个数
+ dataset_dir: './dataset' # 数据集所在目录
+ shuffle: False # 是否打乱顺序
+```
+
+#### 优化器配置(optimizer_config)
+
+```yaml
+optimizer_config:
+ optimizer_type: "AdamW" # 优化器类型,可选:"AdamW", "Adam", "SGD", "Came", "mint.AdamW"及"SpeedAdamW"
+ betas: # 优化器输入参数
+ - 0.9
+ - 0.95
+ eps: 1.e-8
+ learning_rate: 1.25e-6 # 初始学习率
+ weight_decay: 1.e-1 # 权重衰减系数
+ learning_rate_scheduler_kwargs: # 学习率调整策略
+ warmup_steps: 200
+ decay_steps: 2000
+ use_cosine: True
+ end_learning_rate: 1.25e-7
+```
+
+### 模型训练配置解析
+
+在pretrain_gpt.py里对传入的yaml配置文件进行解析,可以得到训练配置、模型配置、优化器配置、并行策略配置以及数据集配置。
+
+```python
+import argparse
+from mindformers.experimental.parallel_core.pynative.config import (
+ init_configs_from_yaml
+)
+
+def get_arg_parser():
+ """get argument parser"""
+ parser = argparse.ArgumentParser(description="Train gpt model")
+ parser.add_argument("--config_path", type=str, default="pretrain_gpt.yaml", help="The path to the config file.")
+ parser.add_argument("--run_cmd", type=str, default="", help="running cmd.")
+ parser.add_argument("--model_type", type=str, default="gpt_config", help="Input model config.")
+ return parser
+parser = get_arg_parser()
+args = parser.parse_args()
+
+all_config = init_configs_from_yaml(args.config_path)
+
+training_config = all_config.training_config
+model_config = all_config.model_config
+optimizer_config = all_config.optimizer_config
+parallel_config = all_config.parallel_config
+dataset_config = all_config.dataset_config
+```
+
+### 通信配置
+
+通过set_context接口可以指定运行模式、运行设备、运行卡号等。并行脚本还需指定并行模式`parallel_mode`为数据并行模式,并通过init根据不同的设备需求初始化HCCL、NCCL或者MCCL通信。指定平台:设置`device_target`为`Ascend`。调试阶段可以使用`set_context(pynative_synchronize=True)`开启同步模式,更准确地定位报错位置。
+
+```python
+import mindspore as ms
+
+
+def set_parallel_context(parallel_config):
+ init()
+ initialize_model_parallel(
+ tensor_model_parallel_size=parallel_config.tensor_model_parallel_size,
+ pipeline_model_parallel_size=parallel_config.pipeline_model_parallel_size,
+ virtual_pipeline_model_parallel_size=parallel_config.virtual_pipeline_model_parallel_size,
+ )
+ logger.info(
+ f"dp {get_data_parallel_world_size()} | "
+ f"pp {parallel_config.pipeline_model_parallel_size} | "
+ f"tp {parallel_config.tensor_model_parallel_size} | "
+ f"sp {parallel_config.sequence_parallel} | "
+ f"vpp {parallel_config.virtual_pipeline_model_parallel_size}"
+ )
+
+
+def set_seed(seed):
+ # set global seed, np seed, and dataset seed
+ ms.set_seed(seed)
+ # set rng seed
+ ms.manual_seed(seed)
+
+
+ms.set_context(mode=ms.PYNATIVE_MODE)
+ms.set_device(device_target="Ascend")
+set_parallel_context(parallel_config)
+set_seed(training_config.seed)
+```
+
+### 创建网络对象
+
+从模型库获取GPT模型,根据配置文件创建网络模型对象。通过`set_weight_decay`来为不同参数设置不同的权重衰减系数,这个函数会将参数分为两组,一组应用特定的权重衰减值,另一组权重衰减为`0`,然后返回一个包含参数分组信息的列表,赋值给`group_params`变量。调用`get_optimizer`函数,传入`optimizer_config`(优化器配置)、`training_config`(训练配置)、`group_params`(前面得到的参数分组信息)、`network_with_loss`(包含模型和损失的对象)以及一个梯度归约操作(从`training_config.loss_reduction`获取),返回一个优化器对象,并赋值给`optimizer`变量。
+创建一个`TrainOneStepCell`对象,它通常用于在训练过程中执行一步优化。传入`network_with_loss`、`optimizer`及配置作为参数,并将其赋值给train_one_step_cell变量。
+
+完整的创建网络对象代码:
+
+```python
+from mindformers.experimental.parallel_core.pynative.optimizer import get_optimizer
+from mindformers.experimental.parallel_core.pynative.training import get_model
+from mindformers.experimental.parallel_core.pynative.training import TrainOneStepCell
+from mindformers.experimental.parallel_core.models import GPTModel
+
+
+def decay_filter(x):
+ return "norm" not in x.name.lower() and "bias" not in x.name.lower()
+
+
+def set_weight_decay(params, weight_decay=1e-1):
+ decay_params = list(filter(decay_filter, params))
+ other_params = list(filter(lambda x: not decay_filter(x), params))
+ group_params = []
+ if decay_params:
+ group_params.append({"params": decay_params, "weight_decay": weight_decay})
+ if other_params:
+ group_params.append({"params": other_params, "weight_decay": 0.0})
+ return group_params
+
+
+def model_provider_func(pre_process=True, post_process=True):
+ network_with_loss = GPTModel(
+ model_config, pre_process=pre_process, post_process=post_process
+ )
+ return network_with_loss
+
+network_with_loss = get_model(model_provider_func, training_config)
+
+group_params = set_weight_decay(network_with_loss.trainable_params(), optimizer_config.weight_decay)
+optimizer = get_optimizer(
+ optimizer_config,
+ training_config,
+ group_params,
+ network_with_loss,
+ grad_allreduce_op=training_config.loss_reduction
+)
+
+train_one_step_cell = TrainOneStepCell(network_with_loss, optimizer, None, training_config, model_config)
+```
+
+### 加载数据集及执行训练
+
+```python
+from dataset import get_dataset
+from mindformers.experimental.parallel_core.pynative.training import train
+
+train_dataset_iterator, val_dataset_iterator = get_dataset(dataset_config)
+train(
+ train_one_step_cell,
+ train_dataset_iterator,
+ training_config,
+ val_dataset_iterator,
+ metrics,
+ evaluation,
+)
+```
+
+### 运行训练脚本
+
+```bash
+bash pretrain_gpt.sh xx.yaml
+```
+
+若不指定xx.yaml,则默认为pretrain_gpt_7B.yaml。
+
+训练脚本`pretrain_gpt.sh`详细解析如下:
+
+#### 设置环境变量
+
+`HCCL_BUFFSIZE=200`设置两个NPU之间共享数据的缓存区大小为200M;`HCCL_EXEC_TIMEOUT=600`设置设备间执行时同步的等待时间为10分钟。`ASCEND_RT_VISIBLE_DEVICES`指定了可见的设备编号,这里设置为设备`0`号卡。
+
+```bash
+export HCCL_BUFFSIZE=200
+export HCCL_EXEC_TIMEOUT=600
+export ASCEND_RT_VISIBLE_DEVICES='0'
+```
+
+#### 设置端口号
+
+```bash
+port=8828
+```
+
+如果之前的配置异常退出,可以使用如下代码进行清理。
+
+```bash
+PIDS=$(sudo lsof -i :$port | awk 'NR>1 {print $2}')
+if [ -n "$PIDS" ]; then
+ for pid in $PIDS; do
+ kill -9 $pid
+ echo "Killed process $pid"
+ done
+else
+ echo "No processes found listening on port $port."
+fi
+```
+
+#### 设置日志存储路径
+
+获取当前脚本所在的目录路径并存储在`project_dir`变量中,同时设置日志路径变量`log_path="msrun_log"`。先删除名为`msrun_log`的目录(如果存在),然后重新创建这个目录。
+
+```bash
+project_dir=$(cd "$(dirname "$0")" || exit; pwd)
+log_path="msrun_log"
+
+rm -rf "${log_path}"
+mkdir "${log_path}"
+```
+
+#### 设置可用设备数量
+
+```bash
+# 计算设备数量
+IFS=',' read -r -a devices <<< "$ASCEND_RT_VISIBLE_DEVICES"
+work_num=${#devices[@]}
+```
+
+#### 获取配置文件
+
+尝试从命令行参数中获取配置文件路径,如果没有提供命令行参数,则使用默认的配置文件 "pretrain_gpt_7B.yaml"。
+
+```bash
+config_path=$1
+if [ -z "$config_path" ]; then
+ config_path="pretrain_gpt_7B.yaml"
+fi
+```
+
+#### 以msrun模式执行训练脚本
+
+```bash
+msrun --worker_num "$work_num" --local_worker_num="$work_num" --master_port=$port --log_dir="$log_path" --join=True --cluster_time_out=300 pretrain_gpt.py --config_path="${config_path}"
+```
+
+#### 运行结果
+
+接下来通过命令调用对应的脚本。
+
+```bash
+bash pretrain_gpt.sh
+```
+
+执行完后,日志文件保存到`output`目录下,其中部分文件目录结构如下:
+
+```text
+└─ output
+ └─ log
+ ├─ rank_0
+ | ├─ info.log
+ | └─ error.log
+ ├─ rank_1
+ | ├─ info.log
+ | └─ error.log
+ ...
+```
+
+关于Loss部分结果保存在`output/log/rank_*/info.log`中,示例如下:
+
+```text
+train: Epoch:0, Step:5, Loss: 10.341485, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1403.24 ms
+train: Epoch:0, Step:6, Loss: 10.38118, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1378.19 ms
+train: Epoch:0, Step:7, Loss: 10.165115, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1370.32 ms
+train: Epoch:0, Step:8, Loss: 10.039211, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1386.89 ms
+train: Epoch:0, Step:9, Loss: 10.040031, Finite_grads: True, Loss_scale: 4096.0, Learning_rate: (1.250000e-06,1.250000e-06,), Time: 1475.95 ms
+...
+```
diff --git a/docs/mindformers/docs/source_zh_cn/conf.py b/docs/mindformers/docs/source_zh_cn/conf.py
index 07db1388c60b45566ee22fec403a5304143ad455..2750c952ea59c8acd59758a5807e74312634c025 100644
--- a/docs/mindformers/docs/source_zh_cn/conf.py
+++ b/docs/mindformers/docs/source_zh_cn/conf.py
@@ -227,8 +227,8 @@ if os.path.exists('./mindformers.experimental.rst'):
if os.path.exists('./experimental'):
shutil.rmtree('./experimental')
-if os.path.exists('./usage/pretrain_gpt.md'):
- os.remove('./usage/pretrain_gpt.md')
+if os.path.exists('advanced_development/pretrain_gpt.md'):
+ os.remove('advanced_development/pretrain_gpt.md')
with open('./index.rst', 'r+', encoding='utf-8') as f:
ind_content = f.read()
diff --git a/docs/mindformers/docs/source_zh_cn/faq/mindformers_contribution.md b/docs/mindformers/docs/source_zh_cn/contribution/mindformers_contribution.md
similarity index 98%
rename from docs/mindformers/docs/source_zh_cn/faq/mindformers_contribution.md
rename to docs/mindformers/docs/source_zh_cn/contribution/mindformers_contribution.md
index ec45c60534ed4d4c9060f8aabf12044739a7ccc1..471635e079156329a1bf90c1b0fb9d8ed5ab882d 100644
--- a/docs/mindformers/docs/source_zh_cn/faq/mindformers_contribution.md
+++ b/docs/mindformers/docs/source_zh_cn/contribution/mindformers_contribution.md
@@ -1,6 +1,6 @@
# MindSpore Transformers贡献指南
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/faq/mindformers_contribution.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/contribution/mindformers_contribution.md)
## 贡献代码至MindSpore Transformers
diff --git a/docs/mindformers/docs/source_zh_cn/faq/modelers_contribution.md b/docs/mindformers/docs/source_zh_cn/contribution/modelers_contribution.md
similarity index 98%
rename from docs/mindformers/docs/source_zh_cn/faq/modelers_contribution.md
rename to docs/mindformers/docs/source_zh_cn/contribution/modelers_contribution.md
index 52a65a934767db9c1f41336aea880a4c294b25c7..923b46ad679b72f8103e6b7ddd24dd0c19e2bf08 100644
--- a/docs/mindformers/docs/source_zh_cn/faq/modelers_contribution.md
+++ b/docs/mindformers/docs/source_zh_cn/contribution/modelers_contribution.md
@@ -1,6 +1,6 @@
# 魔乐社区贡献指南
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/faq/modelers_contribution.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/contribution/modelers_contribution.md)
## 上传模型至魔乐社区
diff --git a/docs/mindformers/docs/source_zh_cn/appendix/env_variables.md b/docs/mindformers/docs/source_zh_cn/env_variables.md
similarity index 98%
rename from docs/mindformers/docs/source_zh_cn/appendix/env_variables.md
rename to docs/mindformers/docs/source_zh_cn/env_variables.md
index b6c507dc8d200598e065308e10ee401db26b542a..470d08de46eb63cdd5455050235bec240e4b1674 100644
--- a/docs/mindformers/docs/source_zh_cn/appendix/env_variables.md
+++ b/docs/mindformers/docs/source_zh_cn/env_variables.md
@@ -1,6 +1,6 @@
# 环境变量说明
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/appendix/env_variables.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/env_variables.md)
以下是 MindSpore Transformers 支持的环境变量。
@@ -14,7 +14,7 @@
| **ASCEND_LAUNCH_BLOCKING** | 0 | 训练或在线推理场景,可通过此环境变量控制算子执行时是否启动同步模式。 | `1`:强制算子采用同步模式运行; `0`:不强制算子采用同步模式运行。 | 由于 NPU 模型训练时默认算子异步执行,导致算子执行过程中出现报错时,打印的报错堆栈信息并不是实际的调用栈信息。当设置为`1`时,强制算子采用同步模式运行,这样能够打印正确的调用栈信息,从而更容易地调试和定位代码中的问题。设置为`1`时有更高的运算效率。 |
| **TE_PARALLEL_COMPILER** | 8 | 算子最大并行编译进程数,当大于 1 时开启并行编译。 | 取值为正整数;最大不超过 cpu 核数\*80%/昇腾 AI 处理器个数,取值范围 1~32,默认值是 8。 | 网络模型较大时,可通过配置此环境变量开启算子的并行编译功能; 设置为`1`时为单线程编译,在调试时,可以简化难度。 |
| **CPU_AFFINITY** | 0 | 启动 CPU 亲和性开关,启动该选项可以确保每个进程或线程绑定到一个 CPU 核心上,以提高性能。 | `1`:开启 CPU 亲和性开关; `0`:关闭 CPU 亲和性开关。 | 出于**优化资源利用** 以及**节能** 的考虑,CPU 亲和性默认关闭。 |
-| **MS_MEMORY_STATISTIC** | 0 | 内存统计。 | `1`:开启内存统计功能; `0`:关闭内存统计功能。 | 在内存分析时,可以统计内存的基本使用情况。具体可以参考[调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/perf_optimize/perf_optimize.html)。 |
+| **MS_MEMORY_STATISTIC** | 0 | 内存统计。 | `1`:开启内存统计功能; `0`:关闭内存统计功能。 | 在内存分析时,可以统计内存的基本使用情况。具体可以参考[调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/advanced_development/performance_optimization.html)。 |
| **MINDSPORE_DUMP_CONFIG** | | 指定 [云侧 Dump 功能](https://www.mindspore.cn/tutorials/zh-CN/master/debug/dump.html) 或 [端侧 Dump 功能](https://www.mindspore.cn/lite/docs/zh-CN/master/tools/benchmark_tool.html#dump功能) 所依赖的配置文件的路径 | 文件路径,支持相对路径与绝对路径。 | |
| **GLOG_v** | 3 | 控制 MindSpore 日志的级别。 | `0`:DEBUG; `1`:INFO; `2`:WARNING; `3`:ERROR:表示程序执行出现报错,输出错误日志,程序可能不会终止; `4`:CRITICAL,表示程序执行出现异常,将会终止执行程序。 | |
| **ASCEND_GLOBAL_LOG_LEVEL** | 3 | 控制 CANN 的日志级别。 | `0`:DEBUG; `1`:INFO; `2`:WARNING; `3`:ERROR; `4`:NULL,不输出日志。 | |
@@ -40,4 +40,4 @@
| **MS_ENABLE_FA_FLATTEN** | on | 控制 是否支持 FlashAttention flatten 优化。 | `on`:启用 FlashAttention flatten 优化; `off`: 禁用 FlashAttention flatten 优化。 | 对于还未适配FlashAttention flatten 优化的模型提供回退机制。 |
| **EXPERIMENTAL_KERNEL_LAUNCH_GROUP** | NA | 控制是否支持算子批量并行下发,支持开启并行下发,并配置并行数 | `thread_num`: 并发线程数,一般不建议增加,默认值为`2`; `kernel_group_num`: 算子分组总数量,每线程`kernel_group_num/thread_num`个组,默认值为`8`。 | 该特性后续还会继续演进,后续行为可能会有变更,当前仅支持`deepseek`推理场景,有一定的性能优化,但是其他模型使用该特性可能会有劣化,用户需要谨慎使用,使用方法如下:`export EXPERIMENTAL_KERNEL_LAUNCH_GROUP="thread_num:2,kernel_group_num:8"`。 |
| **FORCE_EAGER** | False | 控制是否**不开启**jit模式。 | `False`: 开启jit模式; `True`: 不开启jit模式。 | Jit将函数编译成一张可调用的MindSpore图,设置FORCE_EAGER为False开启jit模式,可以获取性能收益,当前仅支持推理模式。 |
-| **MS_ENABLE_TFT** | NA | 使能 [MindIO TFT](https://www.hiascend.com/document/detail/zh/mindx-dl/600/clusterscheduling/ref/mindiottp/mindiotft001.html) 特性,表示启用 TTP、UCE 或 ARF 功能。 | 取值为"{TTP:1,UCE:1,ARF:1}",使用某一功能时,可将对应字段配置为"1"。 | 使用方式可以参考[高可用特性](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/high_availability.html)。 |
\ No newline at end of file
+| **MS_ENABLE_TFT** | NA | 使能 [MindIO TFT](https://www.hiascend.com/document/detail/zh/mindx-dl/600/clusterscheduling/ref/mindiottp/mindiotft001.html) 特性,表示启用 TTP、UCE 或 ARF 功能。 | 取值为"{TTP:1,UCE:1,ARF:1}",使用某一功能时,可将对应字段配置为"1"。 | 使用方式可以参考[高可用特性](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/high_availability.html)。 |
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md b/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md
index f2f6ff3a48110ded8a774d27b841375f70be5dc5..3eed1c965d5163ea30e13ecb5fc9dc421a7cdcb9 100644
--- a/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md
+++ b/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md
@@ -1,5 +1,7 @@
# 使用DeepSeek-R1进行模型蒸馏的实践案例
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/distilled/distilled.md)
+
本案例参考OpenR1-Qwen-7B,旨在指导用户基于MindSpore框架和MindSpore Transformers大模型套件,使用DeepSeek-R1对Qwen2.5-Math-7B模型进行知识蒸馏和微调,以提升其在数学推理任务上的性能。案例涵盖了从环境配置、数据生成、预处理到模型微调和推理测试的完整流程。通过以下步骤,您可以了解如何利用DeepSeek-R1生成推理数据、过滤错误数据、处理数据集,并最终对模型进行微调以解决复杂的数学问题。
蒸馏流程:
@@ -12,7 +14,7 @@
### 1.1 环境
-安装方式请参考[MindSpore Transformers安装指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/quick_start/install.html)。
+安装方式请参考[MindSpore Transformers安装指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/installation.html)。
并将本案例的[distilled](https://gitee.com/mindspore/docs/tree/master/docs/mindformers/docs/source_zh_cn/example/distilled/distilled)文件夹,复制到MindSpore Transformers源码根目录下。
@@ -233,7 +235,7 @@ python toolkit/data_preprocess/huggingface/datasets_preprocess.py \
最后在`packed_data`中可以找到处理后的数据集,格式为arrow。
-更多数据集处理的教程请参考[MindSpore Transformers官方文档-数据集](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/dataset.html#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AEhandler)。
+更多数据集处理的教程请参考[MindSpore Transformers官方文档-数据集](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AEhandler)。
##### 选项 2:使用完成转换的数据
@@ -283,7 +285,7 @@ train_dataset: &train_dataset
......
```
-其余参数配置的解释可以参考[MindSpore Transformers官方文档-SFT微调](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/usage/sft_tuning.html)。
+其余参数配置的解释可以参考[MindSpore Transformers官方文档-SFT微调](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/supervised_fine_tuning.html)。
## 2. 启动微调
@@ -303,7 +305,7 @@ bash scripts/msrun_launcher.sh "run_mindformer.py --config distilled/finetune_qw
日志记录在`output/msrun_log`目录下,例如可以通过`tail -f output/msrun_log/worker_7.log`指令查看worker 7的日志信息。
微调完成后,输出的`safetensors`权重文件在`output/checkpoint`目录下。
-更多safetensors权重的内容请参考[MindSpore Transformers官方文档-Safetensors权重](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/safetensors.html)。
+更多safetensors权重的内容请参考[MindSpore Transformers官方文档-Safetensors权重](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/safetensors.html)。
## 3. 执行推理
diff --git a/docs/mindformers/docs/source_zh_cn/faq/func_related.md b/docs/mindformers/docs/source_zh_cn/faq/feature_related.md
similarity index 85%
rename from docs/mindformers/docs/source_zh_cn/faq/func_related.md
rename to docs/mindformers/docs/source_zh_cn/faq/feature_related.md
index 556d17948d0fd8b5dc823fcf0af0ee267bc29c09..01ce3ec5915b6905e0ac3da540ac646ef981edbf 100644
--- a/docs/mindformers/docs/source_zh_cn/faq/func_related.md
+++ b/docs/mindformers/docs/source_zh_cn/faq/feature_related.md
@@ -1,6 +1,6 @@
-# 功能相关
+# 功能相关 FAQ
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/faq/func_related.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/faq/feature_related.md)
## Q: WikiText数据集下载链接失效。
@@ -10,7 +10,7 @@ A: 官方下载链接失效,请关注社区Issue [#IBV35D](https://gitee.com/m
## Q: 如何生成模型切分策略文件?
-A: 模型切分策略文件记录了模型权重在分布式场景下的切分策略,一般在离线权重切分时使用。在网络`yaml`文件中配置`only_save_strategy: True`,然后正常启动分布式任务,便可在`output/strategy/`目录下生成分布式策略文件,详细介绍请参阅[分布式权重切分与合并教程](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/transform_weight.html#%E7%A6%BB%E7%BA%BF%E8%BD%AC%E6%8D%A2%E9%85%8D%E7%BD%AE%E8%AF%B4%E6%98%8E)。
+A: 模型切分策略文件记录了模型权重在分布式场景下的切分策略,一般在离线权重切分时使用。在网络`yaml`文件中配置`only_save_strategy: True`,然后正常启动分布式任务,便可在`output/strategy/`目录下生成分布式策略文件,详细介绍请参阅[分布式权重切分与合并教程](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html#%E7%A6%BB%E7%BA%BF%E8%BD%AC%E6%8D%A2%E9%85%8D%E7%BD%AE%E8%AF%B4%E6%98%8E)。
diff --git a/docs/mindformers/docs/source_zh_cn/faq/model_related.md b/docs/mindformers/docs/source_zh_cn/faq/model_related.md
index e0e376cfef5612d0f08a70994f67a5a8ac94425a..d4ce7e1361cd823e4da45ac81a8a6236187a150c 100644
--- a/docs/mindformers/docs/source_zh_cn/faq/model_related.md
+++ b/docs/mindformers/docs/source_zh_cn/faq/model_related.md
@@ -1,4 +1,4 @@
-# 模型相关
+# 模型相关 FAQ
[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/faq/model_related.md)
diff --git a/docs/mindformers/docs/source_zh_cn/appendix/conf_files.md b/docs/mindformers/docs/source_zh_cn/feature/configuration.md
similarity index 97%
rename from docs/mindformers/docs/source_zh_cn/appendix/conf_files.md
rename to docs/mindformers/docs/source_zh_cn/feature/configuration.md
index 96dcd64e21d364544666419aee87967de8c91596..fe6338f5cbd49fff900019f5978e7ee0e0c69d0f 100644
--- a/docs/mindformers/docs/source_zh_cn/appendix/conf_files.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/configuration.md
@@ -1,6 +1,6 @@
# 配置文件说明
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/appendix/conf_files.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/configuration.md)
## 概述
@@ -19,9 +19,9 @@ MindSpore Transformers提供的`YAML`文件中包含对于不同功能的配置
| seed | 设置全局种子,详情可参考[mindspore.set_seed](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_seed.html)。 | int |
| run_mode | 设置模型的运行模式,可选`train`、`finetune`、`eval`或`predict`。 | str |
| output_dir | 设置保存log、checkpoint、strategy等文件的路径。 | str |
-| load_checkpoint | 加载权重的文件或文件夹路径,目前有3个应用场景: 1. 支持传入完整权重文件路径。 2. 支持传入离线切分后的权重文件夹路径。 3. 支持传入包含lora权重和base权重的文件夹路径。 各种权重的获取途径可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/weight_conversion.html)。 | str |
-| auto_trans_ckpt | 是否开启分布式权重自动切分与合并功能,详情可参考[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/transform_weight.html)。 | bool |
-| resume_training | 是否开启断点续训功能,详情可参考[断点续训功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/resume_training.html#%E6%96%AD%E7%82%B9%E7%BB%AD%E8%AE%AD)。 | bool |
+| load_checkpoint | 加载权重的文件或文件夹路径,目前有3个应用场景: 1. 支持传入完整权重文件路径。 2. 支持传入离线切分后的权重文件夹路径。 3. 支持传入包含lora权重和base权重的文件夹路径。 各种权重的获取途径可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。 | str |
+| auto_trans_ckpt | 是否开启分布式权重自动切分与合并功能,详情可参考[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html)。 | bool |
+| resume_training | 是否开启断点续训功能,详情可参考[断点续训功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html#%E6%96%AD%E7%82%B9%E7%BB%AD%E8%AE%AD)。 | bool |
| load_ckpt_format | 加载的模型权重的格式,可选`ckpt`、`safetensors`。 | str |
| remove_redundancy | 加载的模型权重是否去除了冗余。默认值为`False`。 | bool |
| train_precision_sync | 训练确定性计算开关。默认值为`None` 。 | Optional[bool] |
@@ -140,7 +140,7 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/
### 并行配置
-为了提升模型的性能,在大规模集群的使用场景中通常需要为模型配置并行策略,详情可参考[分布式并行](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/distributed_parallel.html),MindSpore Transformers中的并行配置如下。
+为了提升模型的性能,在大规模集群的使用场景中通常需要为模型配置并行策略,详情可参考[分布式并行](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/parallel_training.html),MindSpore Transformers中的并行配置如下。
| 参数 | 说明 | 类型 |
|-----------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|
@@ -174,7 +174,7 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/
### 模型优化配置
-1. MindSpore Transformers提供重计算相关配置,以降低模型在训练时的内存占用,详情可参考[重计算](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/perf_optimize/perf_optimize.html#重计算)。
+1. MindSpore Transformers提供重计算相关配置,以降低模型在训练时的内存占用,详情可参考[重计算](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/advanced_development/performance_optimization.html#重计算)。
| 参数 | 说明 | 类型 |
|----------------------------------------------------|-------------------------------|-----------------|
@@ -186,7 +186,7 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/
| recompute_config.select_recompute_exclude | 关闭指定算子的重计算,只对Primitive算子有效。 | bool/list |
| recompute_config.select_comm_recompute_exclude | 关闭指定算子的通讯重计算,只对Primitive算子有效。 | bool/list |
-2. MindSpore Transformers提供细粒度激活值SWAP相关配置,以降低模型在训练时的内存占用,详情可参考[细粒度激活值SWAP](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/fine_grained_activations_swap.html)。
+2. MindSpore Transformers提供细粒度激活值SWAP相关配置,以降低模型在训练时的内存占用,详情可参考[细粒度激活值SWAP](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/fine_grained_activations_swap.html)。
| 参数 | 说明 | 类型 |
|------|-----|-----|
@@ -280,7 +280,7 @@ MindSpore Transformers提供模型评估功能,同时支持模型边训练边
### Profile配置
-MindSpore Transformers提供Profile作为模型性能调优的主要工具,详情可参考[性能调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/perf_optimize/perf_optimize.html)。以下是Profile相关配置。
+MindSpore Transformers提供Profile作为模型性能调优的主要工具,详情可参考[性能调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/advanced_development/performance_optimization.html)。以下是Profile相关配置。
| 参数 | 说明 | 类型 |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------|------|
@@ -300,7 +300,7 @@ MindSpore Transformers提供Profile作为模型性能调优的主要工具,详
### 指标监控配置
-指标监控配置主要用于配置训练过程中各指标的记录方式,详情可参考[训练指标监控](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/monitor.html)。以下是MindSpore Transformers中通用的指标监控配置项说明:
+指标监控配置主要用于配置训练过程中各指标的记录方式,详情可参考[训练指标监控](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/monitor.html)。以下是MindSpore Transformers中通用的指标监控配置项说明:
| 参数名称 | 说明 | 类型 |
|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------|---------------|
@@ -320,7 +320,7 @@ MindSpore Transformers提供Profile作为模型性能调优的主要工具,详
### TensorBoard配置
-TensorBoard配置主要用于配置训练过程中与TensorBoard相关的参数,便于在训练过程中实时查看和监控训练信息,详情可参考[训练指标监控](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/monitor.html)。以下是MindSpore Transformers中通用的TensorBoard配置项说明:
+TensorBoard配置主要用于配置训练过程中与TensorBoard相关的参数,便于在训练过程中实时查看和监控训练信息,详情可参考[训练指标监控](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/monitor.html)。以下是MindSpore Transformers中通用的TensorBoard配置项说明:
| 参数名称 | 说明 | 类型 |
|-------------------------------------------|---------------------------------------------------------|------|
diff --git a/docs/mindformers/docs/source_zh_cn/function/dataset.md b/docs/mindformers/docs/source_zh_cn/feature/dataset.md
similarity index 98%
rename from docs/mindformers/docs/source_zh_cn/function/dataset.md
rename to docs/mindformers/docs/source_zh_cn/feature/dataset.md
index 1b81eebd99018196cc19595fa768b5989f33c7f3..a370c22f516b73ce344f308f2d85e12c538fd687 100644
--- a/docs/mindformers/docs/source_zh_cn/function/dataset.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/dataset.md
@@ -1,6 +1,6 @@
# 数据集
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/dataset.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/dataset.md)
MindSpore Transformers目前支持多种类型的数据集加载方式,涵盖常用开源与自定义场景。具体包括:
@@ -179,7 +179,7 @@ MindSpore Transformers推荐用户使用Megatron数据集进行模型预训练
| eod | 数据集中eod的token id |
| pad | 数据集中pad的token id |
- 此外,Megatron数据集还依赖`input_columns`、`construct_args_key`、`full_batch`等配置,具体可参考[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html),这里仅说明在不同场景如何配置:
+ 此外,Megatron数据集还依赖`input_columns`、`construct_args_key`、`full_batch`等配置,具体可参考[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html),这里仅说明在不同场景如何配置:
- 当`create_compressed_eod_mask=True`时:
@@ -380,7 +380,7 @@ train_dataset: &train_dataset
prefetch_size: 1
```
- 1. `train_dataset`中参数说明可参考[文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html);
+ 1. `train_dataset`中参数说明可参考[文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html);
2. `AlpacaInstructDataHandler`是针对`alpaca`数据集开发的在线处理脚本,如果使用其他数据集,用户需要参考[自定义数据handler](#自定义数据handler)完成自定义数据处理的功能实现。
@@ -496,7 +496,7 @@ export MS_DEV_RUNTIME_CONF="aclnn_cache_queue_length:64"
prefetch_size: 1
```
- 其余参数介绍可以参考 [配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html) 的 “模型训练配置” 和 “模型评估配置”。
+ 其余参数介绍可以参考 [配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html) 的 “模型训练配置” 和 “模型评估配置”。
自定义数据 handler:
@@ -611,7 +611,7 @@ export MS_DEV_RUNTIME_CONF="aclnn_cache_queue_length:64"
seed: 0
```
- 其余参数介绍可以参考 [配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html) 的 “模型训练配置” 和 “模型评估配置”。
+ 其余参数介绍可以参考 [配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html) 的 “模型训练配置” 和 “模型评估配置”。
自定义 adgen_handler:
diff --git a/docs/mindformers/docs/source_zh_cn/usage/evaluation.md b/docs/mindformers/docs/source_zh_cn/feature/evaluation.md
similarity index 98%
rename from docs/mindformers/docs/source_zh_cn/usage/evaluation.md
rename to docs/mindformers/docs/source_zh_cn/feature/evaluation.md
index 84e8edd5ec801550e3af17439dbda49d2a831ef7..cbd2f65c57f3aeae19804de0a7e3707862afc5bc 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/evaluation.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/evaluation.md
@@ -1,6 +1,6 @@
# 评测
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/evaluation.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/evaluation.md)
## Harness评测
@@ -43,7 +43,7 @@ pip install -e .
#### 评测前准备
1. 创建一个新目录,例如名称为`model_dir`,用于存储模型yaml文件。
- 2. 在上个步骤创建的目录中,放置模型推理yaml配置文件(predict_xxx_.yaml)。不同模型的推理yaml配置文件所在目录位置,请参考[模型库](../start/models.md)。
+ 2. 在上个步骤创建的目录中,放置模型推理yaml配置文件(predict_xxx_.yaml)。不同模型的推理yaml配置文件所在目录位置,请参考[模型库](../introduction/models.md)。
3. 配置yaml文件。如果yaml中模型类、模型Config类、模型Tokenzier类使用了外挂代码,即代码文件在[research](https://gitee.com/mindspore/mindformers/tree/dev/research)目录或其他外部目录下,需要修改yaml文件:在相应类的`type`字段下,添加`auto_register`字段,格式为“module.class”(其中“module”为类所在脚本的文件名,“class”为类名。如果已存在,则不需要修改)。
以[predict_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml)配置为例,对其中的部分配置项进行如下修改:
@@ -58,7 +58,7 @@ pip install -e .
auto_register: llama3_tokenizer.Llama3Tokenizer
```
- 关于每个配置项的详细说明请参考[配置文件说明](../appendix/conf_files.md)。
+ 关于每个配置项的详细说明请参考[配置文件说明](../feature/configuration.md)。
4. 如果使用`ceval-valid`、`mmlu`、`cmmlu`、`race`、`lambada`数据集进行评测,需要将`use_flash_attention`设置为`False`,以`predict_llama3_1_8b.yaml`为例,修改yaml如下:
```yaml
@@ -362,7 +362,7 @@ OpenEuler系统按照如下步骤安装:
#### 评测前准备
1. 创建一个新目录,例如名称为`model_dir`,用于存储模型yaml文件;
-2. 在上个步骤创建的目录中放置模型推理yaml配置文件(predict_xxx_.yaml),不同模型的推理yaml配置文件的目录位置参考[模型库](../start/models.md)各模型说明文档中的模型文件树;
+2. 在上个步骤创建的目录中放置模型推理yaml配置文件(predict_xxx_.yaml),不同模型的推理yaml配置文件的目录位置参考[模型库](../introduction/models.md)各模型说明文档中的模型文件树;
3. 配置yaml配置文件。
以[predict_cogvlm2_image_llama3_chat_19b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/cogvlm2/predict_cogvlm2_image_llama3_chat_19b.yaml)配置为例:
@@ -378,7 +378,7 @@ OpenEuler系统按照如下步骤安装:
vocab_file: "/{path}/tokenizer.model" # 指定tokenizer文件路径
```
- 配置yaml文件,参考[配置文件说明](../appendix/conf_files.md)。
+ 配置yaml文件,参考[配置文件说明](../feature/configuration.md)。
4. MMbench-Video数据集评测需要使用GPT-4 Turbo模型进行评测打分,请提前准备好相应的API Key,并放在VLMEvalKit/.env文件中,内容如下所示:
```text
diff --git a/docs/mindformers/docs/source_zh_cn/function/high_availability.md b/docs/mindformers/docs/source_zh_cn/feature/high_availability.md
similarity index 99%
rename from docs/mindformers/docs/source_zh_cn/function/high_availability.md
rename to docs/mindformers/docs/source_zh_cn/feature/high_availability.md
index bf6f64b70a39ad9f04fa8fdff4d53662e275b61b..eaab6b7307882a5ed6c8ef1d549743461b55f530 100644
--- a/docs/mindformers/docs/source_zh_cn/function/high_availability.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/high_availability.md
@@ -1,6 +1,6 @@
# 高可用特性
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/high_availability.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/high_availability.md)
## 概述
diff --git a/docs/mindformers/docs/source_zh_cn/function/image/TrainingStateMonitor_log.png b/docs/mindformers/docs/source_zh_cn/feature/image/TrainingStateMonitor_log.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/function/image/TrainingStateMonitor_log.png
rename to docs/mindformers/docs/source_zh_cn/feature/image/TrainingStateMonitor_log.png
diff --git a/docs/mindformers/docs/source_zh_cn/function/image/adam_m_norm.png b/docs/mindformers/docs/source_zh_cn/feature/image/adam_m_norm.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/function/image/adam_m_norm.png
rename to docs/mindformers/docs/source_zh_cn/feature/image/adam_m_norm.png
diff --git a/docs/mindformers/docs/source_zh_cn/function/image/commondataloader.png b/docs/mindformers/docs/source_zh_cn/feature/image/commondataloader.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/function/image/commondataloader.png
rename to docs/mindformers/docs/source_zh_cn/feature/image/commondataloader.png
diff --git a/docs/mindformers/docs/source_zh_cn/function/image/local_loss&local_norm.png b/docs/mindformers/docs/source_zh_cn/feature/image/local_loss&local_norm.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/function/image/local_loss&local_norm.png
rename to docs/mindformers/docs/source_zh_cn/feature/image/local_loss&local_norm.png
diff --git a/docs/mindformers/docs/source_zh_cn/function/image/tensorboard_scalar.png b/docs/mindformers/docs/source_zh_cn/feature/image/tensorboard_scalar.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/function/image/tensorboard_scalar.png
rename to docs/mindformers/docs/source_zh_cn/feature/image/tensorboard_scalar.png
diff --git a/docs/mindformers/docs/source_zh_cn/function/image/tensorboard_text.png b/docs/mindformers/docs/source_zh_cn/feature/image/tensorboard_text.png
similarity index 100%
rename from docs/mindformers/docs/source_zh_cn/function/image/tensorboard_text.png
rename to docs/mindformers/docs/source_zh_cn/feature/image/tensorboard_text.png
diff --git a/docs/mindformers/docs/source_zh_cn/function/logs.md b/docs/mindformers/docs/source_zh_cn/feature/logging.md
similarity index 88%
rename from docs/mindformers/docs/source_zh_cn/function/logs.md
rename to docs/mindformers/docs/source_zh_cn/feature/logging.md
index 97ecfd02cc2a2b182c8d0d833752747b95f4ab3a..3a2c29ea5d38847c1dae450489e94b065fbc88b0 100644
--- a/docs/mindformers/docs/source_zh_cn/function/logs.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/logging.md
@@ -1,6 +1,6 @@
# 日志
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/logs.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/logging.md)
## 日志保存
@@ -54,12 +54,12 @@ output_dir: './output' # path to save logs/checkpoint/strategy
#### 单卡任务指定输出目录
-除了 yaml 文件配置来指定,MindSpore TransFormer 还支持在 [run_mindformer 一键启动脚本](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/start_tasks.html?highlight=%E6%97%A5%E5%BF%97#run-mindformer%E4%B8%80%E9%94%AE%E5%90%AF%E5%8A%A8%E8%84%9A%E6%9C%AC) 中,使用 `--output_dir` 启动命令对日志输出路径做指定。
+除了 yaml 文件配置来指定,MindSpore TransFormer 还支持在 [run_mindformer 一键启动脚本](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/start_tasks.html?highlight=%E6%97%A5%E5%BF%97#run-mindformer%E4%B8%80%E9%94%AE%E5%90%AF%E5%8A%A8%E8%84%9A%E6%9C%AC) 中,使用 `--output_dir` 启动命令对日志输出路径做指定。
> 如果在这里配置了输出路径,将会覆盖 yaml 文件中的配置!
#### 分布式任务指定输出目录
-如果模型训练需要用到多台服务器,使用[分布式任务拉起脚本 msrun_launcher.sh](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/start_tasks.html?highlight=%E6%97%A5%E5%BF%97#%E5%88%86%E5%B8%83%E5%BC%8F%E4%BB%BB%E5%8A%A1%E6%8B%89%E8%B5%B7%E8%84%9A%E6%9C%AC) 来启动分布式训练任务。
+如果模型训练需要用到多台服务器,使用[分布式任务拉起脚本 msrun_launcher.sh](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/start_tasks.html?highlight=%E6%97%A5%E5%BF%97#%E5%88%86%E5%B8%83%E5%BC%8F%E4%BB%BB%E5%8A%A1%E6%8B%89%E8%B5%B7%E8%84%9A%E6%9C%AC) 来启动分布式训练任务。
在设置了共享存储的情况下,还可以在启动脚本中指定入参 `LOG_DIR` 来指定 Worker 以及 Scheduler 的日志输出路径,将所有机器节点的日志都输出到一个路径下,方便统一观察。
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/function/fine_grained_activations_swap.md b/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md
similarity index 58%
rename from docs/mindformers/docs/source_zh_cn/function/fine_grained_activations_swap.md
rename to docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md
index 82892826659bd9f57e6577ca81905ce688b7e1a6..79f9ecd79fd2855ab14662186dffb1c6c77af17d 100644
--- a/docs/mindformers/docs/source_zh_cn/function/fine_grained_activations_swap.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md
@@ -1,8 +1,76 @@
-# 细粒度激活值SWAP
+# 训练内存优化特性
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/fine_grained_activations_swap.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md)
-## 概述
+## 重计算
+
+### 概述
+
+重计算可以显著降低训练时的激活内存,但会额外增加一些计算。关于重计算的原理和框架测能力可参考 [MindSpore 教程文档:重计算](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/recompute.html)。
+
+### 配置与使用
+
+#### YAML 参数配置
+
+用户可通过在模型训练的 yaml 配置文件中新增 `recompute_config` 模块来使用重计算。
+
+以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L113) 为例,可做如下配置:
+
+```yaml
+# recompute config
+recompute_config:
+ recompute: [3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 2, 0]
+ select_recompute: False
+ parallel_optimizer_comm_recompute: True
+ mp_comm_recompute: True
+ recompute_slice_activation: True
+```
+
+如果需要对选择重计算配置到某几个特定层进行,可以使用 tuple 的方式进行配置。
+
+例如:一个网络有48层, `pp_interleave_num` 为 `2` , `pipeline_stage` 为 `5` ,offset设为 `[[0,1,1,1,1],[1,1,1,1,0]]` ,重计算配置如下:
+
+```yaml
+# recompute config
+recompute_config:
+ recompute: [[2,1,0,0,0],[1,0,0,0,0]]
+ select_recompute:
+ 'feed_forward\.w1\.activation\.silu': True
+ 'feed_forward\.mul': True
+ 'feed_forward\.w1\.matmul': [[1,0,0,0,0],[2,1,0,0,0]]
+ 'feed_forward\.w3\.matmul': [2,1,0,0,0]
+ select_comm_recompute: ['ffn_norm\.norm','attention_norm\.norm']
+```
+
+在日志中会打印将输入格式规范化后的重计算策略信息:
+
+```log
+INFO - Formative layer_recompute: [[2, 1, 0, 0, 0], [1, 0, 0, 0, 0]]
+INFO - Formative select_recompute: {'feed_forward\.w1\.activation\.silu': [[4, 5, 5, 5, 5], [5, 5, 5, 5, 4]], 'feed_forward\.mul': [[4, 5, 5, 5, 5], [5, 5, 5, 5, 4]], 'feed_forward\.w1\.matmul': [[1, 0, 0, 0, 0], [2, 1, 0, 0, 0]], 'feed_forward\.w3\.matmul': [[1, 1, 0, 0, 0], [1, 0, 0, 0, 0]]}
+INFO - Formative select_comm_recompute: {'ffn_norm\.norm': [[4, 5, 5, 5, 5], [5, 5, 5, 5, 4]], 'attention_norm\.norm': [[4, 5, 5, 5, 5], [5, 5, 5, 5, 4]]}
+```
+
+随后会打印每一层重计算的配置方式。
+
+> 1. 如果某一层同时配置了完全重计算与选择重计算,则按完全重计算生效。
+> 2. 在一维整数型 list 或 tuple 中的整数可以替换为 True 或 False,代表对所有层启用或关闭重计算。
+
+#### 主要配置参数介绍
+
+有关重计算配置的主要参数如下表所列:
+
+| 参数 | 描述 | 取值说明 |
+|-----------------------------------|----------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| recompute | (按层)完全重计算。 | 可配置为 bool,整数型的 list 或 tuple,或二维 list 或 tuple。 配置为 bool 类型时,对所有层开启或关闭完全重计算; 配置为整数型 list 或 tuple 时,代表每个 `pipline_stage` 中有多少层开启完全重计算, `pp_interleave_num > 1` 时开启的重计算层数会均匀分配到各 interleave 中; 配置为整数型二维 list 或 tuple 时,代表每个 mini stage 中有多少层开启完全重计算。 |
+| select_recompute | (按算子)选择重计算。 | 可配置为 bool,整数型的 list 或 tuple,或二维 list 或 tuple,字符串的 list 或 tuple,以及 dict。 默认选择重计算算子为 `['feed_forward\\.mul', 'feed_forward\\.w1\\.activation\\.silu']` 。 配置为 bool 类型时,对所有层开启或关闭默认算子的选择重计算; 配置为整数型 list 或 tuple 时,代表每个 `pipline_stage` 中有多少层开启默认算子的选择重计算, `pp_interleave_num > 1` 时开启的选择重计算层数会均匀分配到各 interleave 中; 配置为整数型二维 list 或 tuple 时,代表每个 mini stage 中有多少层开启默认算子的选择重计算。 配置为字符串 list 或 tuple 时,代表对哪些算子开启选择重计算,算子名通过正则表达式匹配,层级关系通过 `'\\.'` 分割; 配置为 dict 时,key 值对应算子名,value 值对应选择重计算的配置方式,这种配法可以对每个算子精细配置重计算策略。 |
+| select_comm_recompute | (按算子)选择通信重计算。 | 配置方式与 **select_recompute** 相同,默认选择通信重计算算子为 `['.*\\.norm']` 。一般仅对 layer_norm 或类似层进行配置。 |
+| parallel_optimizer_comm_recompute | 优化器并行通信重计算。在优化器并行下,是否重计算 AllGather 通信。 | (bool, 可选) - 开启后在自动并行或半自动并行模式下,指定 Cell 内部由优化器并行引入的 AllGather 通信是否重计算。 默认值: `False` 。 |
+| mp_comm_recompute | 模型并行通信重计算,在模型并行下,是否重计算通信算子。 | (bool, 可选) - 开启后在自动并行或半自动并行模式下,指定 Cell 内部由模型并行引入的通信操作是否重计算。默认值: `True` 。 |
+| recompute_slice_activation | 切片重计算,是否对将保留在内存中的 Cell 输出进行切片。 | (bool, 可选) - 默认值: `False` 。 |
+
+## 细粒度激活值SWAP
+
+### 概述
在传统大模型训练任务中,计算卡的显存资源常常成为训练瓶颈,采用更大规模的模型并行(model parallel, mp)和流水线并行(pipeline parallel, pp)切分策略,虽然能一定程度上缓解单张计算卡的显存压力,但需要更大规模的集群资源,且引入过多的通信会极大地降低模型的MFU。在集群资源有限的情况下,重计算是另一个缓解内存压力的有效手段,其通过放弃存储正向传播阶段的激活值,并在梯度反向回传时重新计算所需激活值,来降低激活值的显存占用,由于重计算需引入额外的计算开销,因此该方法同样会显著降低模型训练的MFU(Model FLOPs Utilization)。
@@ -10,9 +78,9 @@
细粒度激活值SWAP技术具备较高的使用灵活度。大模型训练的正向传播阶段,将产生数据量大小不同的若干激活值,用户可按需选择特定的激活值进行SWAP,且选择激活值的粒度为算子级。当模型类型或规格改变时,用户可灵活调整对应的SWAP策略,以追求最低的内存开销和最优的性能。
-## 使用说明
+### 使用说明
-### 约束场景
+#### 约束场景
- 仅支持静态图O0/O1模式
- 支持Llama系稠密模型,后续演进支持MoE稀疏模型
@@ -31,7 +99,7 @@
- 仅支持Ascend后端
-### 接口说明
+#### 接口说明
细粒度激活值SWAP特性通过YAML配置`swap_config`字段使能,包括`swap`、`default_prefetch`、`layer_swap`、`op_swap`四个功能接口,用户可通过此接口灵活选择特定层或特定层的特定算子使能激活值SWAP功能。
@@ -44,7 +112,7 @@
| layer_swap | List | 默认值None。当为None时,本接口不生效;当为List类型时,本接口包含若干Dict类型的列表元素,每个Dict类型元素包含`backward_prefetch`与`layers`两个键,提供使能SWAP的预取时机(即开始搬回操作的时机)和对应的层索引。 |
| op_swap | List | 默认值None。当为None时,本接口不生效;当为List类型时,本接口包含若干Dict类型的列表元素,每个Dict类型元素包含`op_name`、`backward_prefetch`与`layers`三个键,提供使能SWAP的预取时机和对应的算子名、层索引。 |
-### 混合重计算
+#### 混合重计算
细粒度激活值SWAP与重计算存在耦合:
@@ -58,15 +126,15 @@
| select_recompute | 确定各pipeline stage中特定算子使能重计算的层数 | 不感知pipeline stage,对于每个算子的键值对,仅接受bool/list类型入参。当为bool类型时,所有层使能重计算;当为list类型时,列表元素为层索引,按索引选择特定层使能重计算 |
| select_comm_recompute | 确定各pipeline stage中通信算子使能重计算的层数 | 不感知pipeline stage,仅接受bool/list类型入参。当为bool类型时,所有层使能重计算;当为list类型时,列表元素为层索引,按索引选择特定层使能重计算 |
-## 使用示例
+### 使用示例
本章节以 Llama2-7B 训练为例,演示细粒度激活值SWAP特性的使用。
-### 环境准备
+#### 环境准备
下载 MindSpore Transformers,并准备预训练数据集,如wikitext等。
-### 示例一:默认SWAP策略
+#### 示例一:默认SWAP策略
在YAML中修改补充重计算与SWAP配置,主要配置参数如下:
@@ -112,7 +180,7 @@ bash ./scripts/msrun_launcher.sh "run_mindformer.py \
默认SWAP策略执行成功。
-### 示例二:选择特定层使能SWAP
+#### 示例二:选择特定层使能SWAP
在YAML中修改补充重计算与SWAP配置,主要配置参数如下:
@@ -158,7 +226,7 @@ bash ./scripts/msrun_launcher.sh "run_mindformer.py \
选择特定层使能SWAP的策略执行成功。
-### 示例三:选择特定层的特定算子使能SWAP
+#### 示例三:选择特定层的特定算子使能SWAP
在YAML中修改补充重计算与SWAP配置,主要配置参数如下:
@@ -213,7 +281,7 @@ bash ./scripts/msrun_launcher.sh "run_mindformer.py \
选择特定层的特定算子使能SWAP成功。
-### 示例四:细粒度激活值SWAP与重计算混用
+#### 示例四:细粒度激活值SWAP与重计算混用
在YAML中修改补充重计算与SWAP配置,主要配置参数如下:
diff --git a/docs/mindformers/docs/source_zh_cn/function/monitor.md b/docs/mindformers/docs/source_zh_cn/feature/monitor.md
similarity index 99%
rename from docs/mindformers/docs/source_zh_cn/function/monitor.md
rename to docs/mindformers/docs/source_zh_cn/feature/monitor.md
index 57b478433142ce067166ce9e45c03f6d8745b71c..9cdf61a2a113e1d15998bc40a0ff4b21bb0a59eb 100644
--- a/docs/mindformers/docs/source_zh_cn/function/monitor.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/monitor.md
@@ -1,6 +1,6 @@
# 训练指标监控
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/monitor.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/monitor.md)
MindSpore Transformers 支持 TensorBoard 作为可视化工具,用于监控和分析训练过程中的各种指标和信息。TensorBoard 是一个独立的可视化库,需要用户手动安装,它提供了一种交互式的方式来查看训练中的损失、精度、学习率、梯度分布等多种内容。用户在训练`yaml`文件中配置 TensorBoard 后,在大模型训练过程中会实时生成并更新事件文件,可以通过命令查看训练数据。
@@ -223,4 +223,4 @@ local_loss与local_norm
> 2. 用户在训练配置文件 `yaml` 中设置的配置参数;
> 3. 训练默认的配置参数。
>
-> 可配置的所有参数请参考[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html)。
\ No newline at end of file
+> 可配置的所有参数请参考[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html)。
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md b/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md
new file mode 100644
index 0000000000000000000000000000000000000000..c1017c784f045e7330028ed154e1fb694776dfa0
--- /dev/null
+++ b/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md
@@ -0,0 +1,75 @@
+# 其它训练特性
+
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/other_training_features.md)
+
+在大规模的深度学习模型训练中,会遇到诸如:内存限制、计算资源的有效利用、分布式训练中的同步问题等挑战,需要使用训练优化算法来提高训练效率、加速收敛速度以及改善最终模型性能。
+
+MindSpore TransFormer 提供了梯度累积、梯度裁剪等训练优化算法,可供开发者进行训练时使用。
+
+## 梯度累积
+
+### 概述
+
+MindSpore 在 2.1.1 之后的版本中增加了 `mindspore.nn.wrap.cell_wrapper.GradAccumulationCell` 这一梯度累积实现接口,通过拆分 MiniBatch 的形式提供了梯度累加的能力,MindSpore Transformer 将其封装进了统一的训练流程,通过 yaml 配置进行使能。关于梯度累积的原理和框架测的能力可以参考 [MindSpore 文档:梯度累加](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/distributed_gradient_accumulation.html)。
+
+### 配置与使用
+
+#### YAML 参数配置
+
+用户在需要开启梯度累积的场景下,只需在配置文件中的 `runner_config` 项下配置 `gradient_accumulation_steps` 项,设置为所需的梯度累积步数即可:
+
+```yaml
+# runner config
+runner_config:
+ ...
+ gradient_accumulation_steps: 4
+ ...
+```
+
+#### 主要配置参数介绍
+
+| 参数 | 描述 | 取值说明 |
+|-----------------------------|---------------------------------|------------------------|
+| gradient_accumulation_steps | 在执行反向传播前,累积梯度的步数。 | (int, 必选) - 默认值: `1` 。 |
+
+#### 其他方式使用梯度累积
+
+除配置文件外,当采用 `run_mindformer.py` 脚本启动时,可指定 `--gradient_accumulation_steps` 入参来使用梯度累积功能。
+
+#### 梯度累积使用限制
+
+> 开启梯度累积会增大内存开销,请注意内存管理,防止发生内存溢出(OOM)。
+
+1. 由于 `GradAccumulationCell` 的实现依赖并行特性,梯度累积当前仅支持在**半自动并行模式**下使用;
+2. 此外,在 pipeline 并行场景下,梯度累积含义与 micro_batch 相同,将不会生效,请配置 `micro_batch_num` 项以增大训练 batch_size。
+
+## 梯度裁剪
+
+### 概述
+
+梯度裁剪算法可以避免反向梯度过大,跳过最优解的情况。
+
+### 配置与使用
+
+#### YAML 参数配置
+
+在 MindSpore TransFormers 中,默认的训练流程 `MFTrainOneStepCell` 中集成了梯度裁剪逻辑。
+
+可使用如下示例,以开启梯度裁剪:
+
+```yaml
+# wrapper cell config
+runner_wrapper:
+ type: MFTrainOneStepCell
+ ...
+ use_clip_grad: True
+ max_grad_norm: 1.0
+ ...
+```
+
+#### 主要配置参数介绍
+
+| 参数 | 描述 | 取值说明 |
+|---------------|-------------------|----------------------------|
+| use_clip_grad | 控制在训练过程中是否开启梯度裁剪。 | (bool, 可选) - 默认值: `False` 。 |
+| max_grad_norm | 控制梯度裁剪的最大 norm 值。 | (float, 可选) - 默认值: `1.0` 。 |
diff --git a/docs/mindformers/docs/source_zh_cn/function/distributed_parallel.md b/docs/mindformers/docs/source_zh_cn/feature/parallel_training.md
similarity index 95%
rename from docs/mindformers/docs/source_zh_cn/function/distributed_parallel.md
rename to docs/mindformers/docs/source_zh_cn/feature/parallel_training.md
index f3d679f1364eff25a820011be344c05a918b7704..22ba674a9911b7fad9d3549d54f1c904fc00536d 100644
--- a/docs/mindformers/docs/source_zh_cn/function/distributed_parallel.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/parallel_training.md
@@ -1,6 +1,6 @@
-# 分布式并行
+# 分布式并行训练
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/distributed_parallel.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/parallel_training.md)
## 并行模式与应用场景
@@ -33,7 +33,7 @@ MindSpore Transformers 支持多种并行特性,开发者可以利用这些特
| **[长序列并行](#长序列并行)** | 设计用于处理长序列输入的模型,对所有的input输入和所有的输出activation在sequence维度上进行切分,对于超长序列输入场景进一步减少显存占用。 |
| **[多副本并行](https://www.mindspore.cn/docs/zh-CN/master/features/parallel/pipeline_parallel.html#mindspore%E4%B8%AD%E7%9A%84interleaved-pipeline%E8%B0%83%E5%BA%A6)** | 用于在多个副本之间实现精细的并行控制,优化性能和资源利用率,适合大规格模型的高效训练。 |
-关于分布式并行参数的配置方法,参见 [MindSpore Transformers 配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html) 中的并行配置章节下的具体内容。
+关于分布式并行参数的配置方法,参见 [MindSpore Transformers 配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html) 中的并行配置章节下的具体内容。
## 并行特性介绍
@@ -64,7 +64,7 @@ parallel_config:
- use_ring_attention:是否开启Ring Attention,默认为False。
- context_parallel:序列并行切分数量,默认为1,根据用户需求配置。
-关于分布式并行参数的配置方法,参见 [MindSpore Transformers 配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html) 中的并行配置章节下的具体内容。
+关于分布式并行参数的配置方法,参见 [MindSpore Transformers 配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html) 中的并行配置章节下的具体内容。
#### Ulysses序列并行
@@ -95,7 +95,7 @@ parallel_config:
- enable_alltoall:生成alltoall通信算子,默认为False,不启用时将会由allgather等其他算子组合完成等价替代,可参考MindSpore `set_auto_parallel_context`[接口文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_auto_parallel_context.html);启用Ulysses方案时我们期望能够直接插入alltoall通信算子,因此将该配置项打开。
- context_parallel_algo:设置为`ulysses_cp`开启Ulysses序列并行。
-关于分布式并行参数的配置方法,参见 [MindSpore Transformers 配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html) 中的并行配置章节下的具体内容。
+关于分布式并行参数的配置方法,参见 [MindSpore Transformers 配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html) 中的并行配置章节下的具体内容。
#### 混合序列并行
@@ -121,7 +121,7 @@ parallel_config:
- context_parallel_algo:设置为`hybrid_cp`时开启混合序列并行。
- ulysses_degree_in_cp:Ulysses序列并行切分数量。
-关于分布式并行参数的配置方法,参见 [MindSpore Transformers 配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html) 中的并行配置章节下的具体内容。
+关于分布式并行参数的配置方法,参见 [MindSpore Transformers 配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html) 中的并行配置章节下的具体内容。
### 流水线并行
@@ -153,7 +153,7 @@ parallel_config:
- 目前仅支持Llama和DeepSeek系列模型。
- 目前暂不支持使用Megatron的多源数据集进行训练的场景。
-关于分布式并行参数的配置方法,参见 [MindSpore Transformers配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/appendix/conf_files.html) 中的并行配置章节下的具体内容。
+关于分布式并行参数的配置方法,参见 [MindSpore Transformers配置说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html) 中的并行配置章节下的具体内容。
## MindSpore Transformers 分布式并行应用实践
diff --git a/docs/mindformers/docs/source_zh_cn/usage/quantization.md b/docs/mindformers/docs/source_zh_cn/feature/quantization.md
similarity index 97%
rename from docs/mindformers/docs/source_zh_cn/usage/quantization.md
rename to docs/mindformers/docs/source_zh_cn/feature/quantization.md
index c439651c502c144809483f8eeaefb606a8f73a6b..76d2c1a3b3a5dc84868835733ce69fffdf7ecad9 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/quantization.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/quantization.md
@@ -1,6 +1,6 @@
# 量化
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/quantization.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/quantization.md)
## 概述
diff --git a/docs/mindformers/docs/source_zh_cn/function/resume_training.md b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md
similarity index 99%
rename from docs/mindformers/docs/source_zh_cn/function/resume_training.md
rename to docs/mindformers/docs/source_zh_cn/feature/resume_training.md
index f91c66a89932a4e4a76ede2e10fb490a8e27fecc..f9762887e3e6174842eab9859ccccaffecc81a59 100644
--- a/docs/mindformers/docs/source_zh_cn/function/resume_training.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md
@@ -1,6 +1,6 @@
# 模型断点续训
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/resume_training.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/resume_training.md)
## 断点续训
diff --git a/docs/mindformers/docs/source_zh_cn/function/safetensors.md b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md
similarity index 96%
rename from docs/mindformers/docs/source_zh_cn/function/safetensors.md
rename to docs/mindformers/docs/source_zh_cn/feature/safetensors.md
index f4d3a76795215a1b90a8328d55432f88244aff68..6e33d1143ddfb88a87e38306258ec1ba639de253 100644
--- a/docs/mindformers/docs/source_zh_cn/function/safetensors.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md
@@ -1,6 +1,6 @@
# Safetensors权重
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/safetensors.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/safetensors.md)
## 概述
@@ -15,7 +15,7 @@ Safetensors文件主要分为两种类型:完整权重文件和分布式权重
Safetensors完整权重可通过以下两种方式获取:
1. 直接从Huggingface上下载。
-2. 通过MindSpore Transformers分布式训练后,通过[合并脚本](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/transform_weight.html#safetensors%E6%9D%83%E9%87%8D%E7%A6%BB%E7%BA%BF%E5%90%88%E5%B9%B6)生成完整权重。
+2. 通过MindSpore Transformers分布式训练后,通过[合并脚本](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html#safetensors%E6%9D%83%E9%87%8D%E7%A6%BB%E7%BA%BF%E5%90%88%E5%B9%B6)生成完整权重。
Huggingface Safetensors示例目录结构:
@@ -106,7 +106,7 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
任务执行完成后,在mindformers/output目录下,会生成checkpoint文件夹,同时模型文件会保存在该文件夹下。
-更多详情请参考:[预训练介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/usage/pre_training.html)
+更多详情请参考:[预训练介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/pre_training.html)
### 微调任务示例
@@ -154,7 +154,7 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
任务执行完成后,在mindformers/output目录下,会生成checkpoint文件夹,同时模型文件会保存在该文件夹下。
-更多详情请参考:[SFT微调介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/usage/sft_tuning.html)
+更多详情请参考:[SFT微调介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/supervised_fine_tuning.html)
### 推理任务示例
@@ -201,7 +201,7 @@ bash scripts/msrun_launcher.sh "python run_mindformer.py \
'text_generation_text': [I love Beijing, because it is a city with a long history and culture.......]
```
-更多详情请参考:[推理介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/usage/inference.html)
+更多详情请参考:[推理介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/inference.html)
### 断点续训任务示例
@@ -237,9 +237,9 @@ callbacks:
checkpoint_format: safetensors # 保存权重文件格式
```
-大集群规模场景下,避免在线合并过程耗时过长占用训练资源,推荐将原分布式权重文件离线[合并完整权重](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/transform_weight.html#safetensors%E6%9D%83%E9%87%8D%E7%A6%BB%E7%BA%BF%E5%90%88%E5%B9%B6)后传入,无需传入源切分策略文件路径。
+大集群规模场景下,避免在线合并过程耗时过长占用训练资源,推荐将原分布式权重文件离线[合并完整权重](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html#safetensors%E6%9D%83%E9%87%8D%E7%A6%BB%E7%BA%BF%E5%90%88%E5%B9%B6)后传入,无需传入源切分策略文件路径。
-更多详情请参考:[断点续训介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/resume_training.html)。
+更多详情请参考:[断点续训介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html)。
## 权重保存
@@ -247,7 +247,7 @@ callbacks:
在深度学习模型的训练过程中,保存模型的权重是至关重要的一步。权重保存功能使得我们能够在训练的任意阶段存储模型的参数,以便用户在训练中断或完成后进行恢复、继续训练、评估或部署。同时还可以通过保存权重的方式,在不同环境下复现实验结果。
-目前,MindSpore TransFormer 支持 [safetensors](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/safetensors.html) 格式的权重文件读取和保存。
+目前,MindSpore TransFormer 支持 [safetensors](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/safetensors.html) 格式的权重文件读取和保存。
### 目录结构
diff --git a/docs/mindformers/docs/source_zh_cn/function/start_tasks.md b/docs/mindformers/docs/source_zh_cn/feature/start_tasks.md
similarity index 96%
rename from docs/mindformers/docs/source_zh_cn/function/start_tasks.md
rename to docs/mindformers/docs/source_zh_cn/feature/start_tasks.md
index bef2b6993416d0077949237243084f775cde92a3..f4e983a0676a7894d86a9b4002cfefab13b1f624 100644
--- a/docs/mindformers/docs/source_zh_cn/function/start_tasks.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/start_tasks.md
@@ -1,6 +1,6 @@
# 启动任务
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/start_tasks.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/start_tasks.md)
## 概述
@@ -22,7 +22,7 @@ MindSpore Transformers提供了一键启动脚本`run_mindformer.py`和分布式
| `--device_id` | 设置执行设备ID,其值必须在可用设备范围内。 | int,可选 | 预训练/微调/推理 |
| `--device_target` | 设置后端执行设备,MindSpore Transformers仅支持在`Ascend`设备上运行。 | str,可选 | 预训练/微调/推理 |
| `--run_mode` | 设置模型的运行模式,可选`train`、`finetune`或`predict`。 | str,可选 | 预训练/微调/推理 |
-| `--load_checkpoint` | 加载的权重文件或文件夹路径,详细使用方式参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/weight_conversion.html)。 | str,可选 | 预训练/微调/推理 |
+| `--load_checkpoint` | 加载的权重文件或文件夹路径,详细使用方式参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。 | str,可选 | 预训练/微调/推理 |
| `--use_parallel` | 是否开启并行模式。 | bool,可选 | 预训练/微调/推理 |
| `--output_dir` | 设置保存日志、权重、切分策略等文件的路径。 | str,可选 | 预训练/微调/推理 |
| `--register_path` | 外挂代码所在目录的绝对路径。比如research目录下的模型目录。 | str,可选 | 预训练/微调/推理 |
@@ -33,7 +33,7 @@ MindSpore Transformers提供了一键启动脚本`run_mindformer.py`和分布式
| 参数 | 参数说明 | 取值说明 | 适用场景 |
|:----------------------------:|:-------------------------------------------------------------------------------------------------------------------|--------------------------------|-----------|
| `--src_strategy_path_or_dir` | 权重的策略文件路径。 | str,可选 | 预训练/微调/推理 |
-| `--auto_trans_ckpt` | 是否开启在线权重自动转换功能,详情可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/weight_conversion.html)。 | bool,可选 | 预训练/微调/推理 |
+| `--auto_trans_ckpt` | 是否开启在线权重自动转换功能,详情可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。 | bool,可选 | 预训练/微调/推理 |
| `--transform_process_num` | 负责权重转换的进程数。 | int,可选 | 预训练/微调/推理 |
| `--only_save_strategy` | 是否仅保存切分策略文件。 | bool,可选,为`true`时任务在保存策略文件后直接退出 | 预训练/微调/推理 |
@@ -42,7 +42,7 @@ MindSpore Transformers提供了一键启动脚本`run_mindformer.py`和分布式
| 参数 | 参数说明 | 取值说明 | 适用场景 |
|:-------------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------|---------|--------|
| `--train_dataset_dir` | 预训练/微调的数据集目录。 | str,可选 | 预训练/微调 |
-| `--resume_training` | 是否开启断点续训功能,详情可参考[断点续训功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/resume_training.html#%E6%96%AD%E7%82%B9%E7%BB%AD%E8%AE%AD)。 | bool,可选 | 预训练/微调 |
+| `--resume_training` | 是否开启断点续训功能,详情可参考[断点续训功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html#%E6%96%AD%E7%82%B9%E7%BB%AD%E8%AE%AD)。 | bool,可选 | 预训练/微调 |
| `--epochs` | 训练轮次。 | int,可选 | 预训练/微调 |
| `--gradient_accumulation_steps` | 梯度累积步数。 | int,可选 | 预训练/微调 |
| `--batch_size` | 批处理数据的样本数。 | int,可选 | 预训练/微调 |
diff --git a/docs/mindformers/docs/source_zh_cn/function/training_hyperparameters.md b/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md
similarity index 98%
rename from docs/mindformers/docs/source_zh_cn/function/training_hyperparameters.md
rename to docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md
index 00fb536df9b897573d959068eb6d29c4172d979d..d4297c37690d6344f1350040d345e38533f76cdb 100644
--- a/docs/mindformers/docs/source_zh_cn/function/training_hyperparameters.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md
@@ -1,6 +1,6 @@
# 模型训练超参数配置
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/training_hyperparameters.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md)
超参数对模型的性能有着重要影响,不同的超参数设置可能导致模型表现的巨大差异。参数的选择会影响到模型的训练速度、收敛性、容量和泛化能力等方面。且它们并非通过训练数据直接学习得到的,而是由开发者根据经验、实验或调优过程来确定的。
diff --git a/docs/mindformers/docs/source_zh_cn/function/transform_weight.md b/docs/mindformers/docs/source_zh_cn/feature/transform_weight.md
similarity index 99%
rename from docs/mindformers/docs/source_zh_cn/function/transform_weight.md
rename to docs/mindformers/docs/source_zh_cn/feature/transform_weight.md
index 0081faf9b901485924d022ca01aed7fac4ddf7fc..cc03b0c4b4b7ba12725bb5b124234fe799b402db 100644
--- a/docs/mindformers/docs/source_zh_cn/function/transform_weight.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/transform_weight.md
@@ -1,6 +1,6 @@
# 分布式权重切分与合并
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/transform_weight.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/transform_weight.md)
## 概述
diff --git a/docs/mindformers/docs/source_zh_cn/function/weight_conversion.md b/docs/mindformers/docs/source_zh_cn/feature/weight_conversion.md
similarity index 99%
rename from docs/mindformers/docs/source_zh_cn/function/weight_conversion.md
rename to docs/mindformers/docs/source_zh_cn/feature/weight_conversion.md
index 13e5e81e3539df2263088e5d109f50031307b018..19eacbc26074a5fbeae250922c06fb3325aa6feb 100644
--- a/docs/mindformers/docs/source_zh_cn/function/weight_conversion.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/weight_conversion.md
@@ -1,6 +1,6 @@
# 权重格式转换
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/weight_conversion.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/weight_conversion.md)
## 概述
diff --git a/docs/mindformers/docs/source_zh_cn/full-process_1.png b/docs/mindformers/docs/source_zh_cn/full-process_1.png
index dbb6a24333a105f779396fc342b049c72938e5c8..27e14e5bb14815a03be6ab6fffe290dd995a8c5f 100644
Binary files a/docs/mindformers/docs/source_zh_cn/full-process_1.png and b/docs/mindformers/docs/source_zh_cn/full-process_1.png differ
diff --git a/docs/mindformers/docs/source_zh_cn/full-process_2.png b/docs/mindformers/docs/source_zh_cn/full-process_2.png
index 27e14e5bb14815a03be6ab6fffe290dd995a8c5f..f422ae1f15ee0285eb9d37da52f096835bd98f93 100644
Binary files a/docs/mindformers/docs/source_zh_cn/full-process_2.png and b/docs/mindformers/docs/source_zh_cn/full-process_2.png differ
diff --git a/docs/mindformers/docs/source_zh_cn/full-process_3.png b/docs/mindformers/docs/source_zh_cn/full-process_3.png
index f422ae1f15ee0285eb9d37da52f096835bd98f93..4356392871de33da27839693b25238e103097f64 100644
Binary files a/docs/mindformers/docs/source_zh_cn/full-process_3.png and b/docs/mindformers/docs/source_zh_cn/full-process_3.png differ
diff --git a/docs/mindformers/docs/source_zh_cn/full-process_4.png b/docs/mindformers/docs/source_zh_cn/full-process_4.png
deleted file mode 100644
index d438149f0718f823a8da83b7fec5f679281b2b8c..0000000000000000000000000000000000000000
Binary files a/docs/mindformers/docs/source_zh_cn/full-process_4.png and /dev/null differ
diff --git a/docs/mindformers/docs/source_zh_cn/full-process_5.png b/docs/mindformers/docs/source_zh_cn/full-process_5.png
deleted file mode 100644
index 4356392871de33da27839693b25238e103097f64..0000000000000000000000000000000000000000
Binary files a/docs/mindformers/docs/source_zh_cn/full-process_5.png and /dev/null differ
diff --git a/docs/mindformers/docs/source_zh_cn/function/other_features.md b/docs/mindformers/docs/source_zh_cn/function/other_features.md
deleted file mode 100644
index 63dac73faf5852d244ae849862a59070c287fae1..0000000000000000000000000000000000000000
--- a/docs/mindformers/docs/source_zh_cn/function/other_features.md
+++ /dev/null
@@ -1,141 +0,0 @@
-# 其它特性
-
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/function/other_features.md)
-
-在大规模的深度学习模型训练中,会遇到诸如:内存限制、计算资源的有效利用、分布式训练中的同步问题等挑战,需要使用训练优化算法来提高训练效率、加速收敛速度以及改善最终模型性能。
-
-MindSpore TransFormer 提供了重计算、梯度累积、梯度裁剪等训练优化算法,可供开发者进行训练时使用。
-
-## 重计算
-
-### 概述
-
-重计算可以显著降低训练时的激活内存,但会额外增加一些计算。关于重计算的原理和框架测能力可参考 [MindSpore 教程文档:重计算](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/recompute.html)。
-
-### 配置与使用
-
-#### YAML 参数配置
-
-用户可通过在模型训练的 yaml 配置文件中新增 `recompute_config` 模块来使用重计算。
-
-以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L113) 为例,可做如下配置:
-
-```yaml
-# recompute config
-recompute_config:
- recompute: [3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 2, 0]
- select_recompute: False
- parallel_optimizer_comm_recompute: True
- mp_comm_recompute: True
- recompute_slice_activation: True
-```
-
-如果需要对选择重计算配置到某几个特定层进行,可以使用 tuple 的方式进行配置。
-
-例如:一个网络有48层, `pp_interleave_num` 为 `2` , `pipeline_stage` 为 `5` ,offset设为 `[[0,1,1,1,1],[1,1,1,1,0]]` ,重计算配置如下:
-
-```yaml
-# recompute config
-recompute_config:
- recompute: [[2,1,0,0,0],[1,0,0,0,0]]
- select_recompute:
- 'feed_forward\.w1\.activation\.silu': True
- 'feed_forward\.mul': True
- 'feed_forward\.w1\.matmul': [[1,0,0,0,0],[2,1,0,0,0]]
- 'feed_forward\.w3\.matmul': [2,1,0,0,0]
- select_comm_recompute: ['ffn_norm\.norm','attention_norm\.norm']
-```
-
-在日志中会打印将输入格式规范化后的重计算策略信息:
-
-```log
-INFO - Formative layer_recompute: [[2, 1, 0, 0, 0], [1, 0, 0, 0, 0]]
-INFO - Formative select_recompute: {'feed_forward\.w1\.activation\.silu': [[4, 5, 5, 5, 5], [5, 5, 5, 5, 4]], 'feed_forward\.mul': [[4, 5, 5, 5, 5], [5, 5, 5, 5, 4]], 'feed_forward\.w1\.matmul': [[1, 0, 0, 0, 0], [2, 1, 0, 0, 0]], 'feed_forward\.w3\.matmul': [[1, 1, 0, 0, 0], [1, 0, 0, 0, 0]]}
-INFO - Formative select_comm_recompute: {'ffn_norm\.norm': [[4, 5, 5, 5, 5], [5, 5, 5, 5, 4]], 'attention_norm\.norm': [[4, 5, 5, 5, 5], [5, 5, 5, 5, 4]]}
-```
-
-随后会打印每一层重计算的配置方式。
-
-> 1. 如果某一层同时配置了完全重计算与选择重计算,则按完全重计算生效。
-> 2. 在一维整数型 list 或 tuple 中的整数可以替换为 True 或 False,代表对所有层启用或关闭重计算。
-
-#### 主要配置参数介绍
-
-有关重计算配置的主要参数如下表所列:
-
-| 参数 | 描述 | 取值说明 |
-|-----------------------------------|----------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| recompute | (按层)完全重计算。 | 可配置为 bool,整数型的 list 或 tuple,或二维 list 或 tuple。 配置为 bool 类型时,对所有层开启或关闭完全重计算; 配置为整数型 list 或 tuple 时,代表每个 `pipline_stage` 中有多少层开启完全重计算, `pp_interleave_num > 1` 时开启的重计算层数会均匀分配到各 interleave 中; 配置为整数型二维 list 或 tuple 时,代表每个 mini stage 中有多少层开启完全重计算。 |
-| select_recompute | (按算子)选择重计算。 | 可配置为 bool,整数型的 list 或 tuple,或二维 list 或 tuple,字符串的 list 或 tuple,以及 dict。 默认选择重计算算子为 `['feed_forward\\.mul', 'feed_forward\\.w1\\.activation\\.silu']` 。 配置为 bool 类型时,对所有层开启或关闭默认算子的选择重计算; 配置为整数型 list 或 tuple 时,代表每个 `pipline_stage` 中有多少层开启默认算子的选择重计算, `pp_interleave_num > 1` 时开启的选择重计算层数会均匀分配到各 interleave 中; 配置为整数型二维 list 或 tuple 时,代表每个 mini stage 中有多少层开启默认算子的选择重计算。 配置为字符串 list 或 tuple 时,代表对哪些算子开启选择重计算,算子名通过正则表达式匹配,层级关系通过 `'\\.'` 分割; 配置为 dict 时,key 值对应算子名,value 值对应选择重计算的配置方式,这种配法可以对每个算子精细配置重计算策略。 |
-| select_comm_recompute | (按算子)选择通信重计算。 | 配置方式与 **select_recompute** 相同,默认选择通信重计算算子为 `['.*\\.norm']` 。一般仅对 layer_norm 或类似层进行配置。 |
-| parallel_optimizer_comm_recompute | 优化器并行通信重计算。在优化器并行下,是否重计算 AllGather 通信。 | (bool, 可选) - 开启后在自动并行或半自动并行模式下,指定 Cell 内部由优化器并行引入的 AllGather 通信是否重计算。 默认值: `False` 。 |
-| mp_comm_recompute | 模型并行通信重计算,在模型并行下,是否重计算通信算子。 | (bool, 可选) - 开启后在自动并行或半自动并行模式下,指定 Cell 内部由模型并行引入的通信操作是否重计算。默认值: `True` 。 |
-| recompute_slice_activation | 切片重计算,是否对将保留在内存中的 Cell 输出进行切片。 | (bool, 可选) - 默认值: `False` 。 |
-
-## 梯度累积
-
-### 概述
-
-MindSpore 在 2.1.1 之后的版本中增加了 `mindspore.nn.wrap.cell_wrapper.GradAccumulationCell` 这一梯度累积实现接口,通过拆分 MiniBatch 的形式提供了梯度累加的能力,MindSpore Transformer 将其封装进了统一的训练流程,通过 yaml 配置进行使能。关于梯度累积的原理和框架测的能力可以参考 [MindSpore 文档:梯度累加](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/distributed_gradient_accumulation.html)。
-
-### 配置与使用
-
-#### YAML 参数配置
-
-用户在需要开启梯度累积的场景下,只需在配置文件中的 `runner_config` 项下配置 `gradient_accumulation_steps` 项,设置为所需的梯度累积步数即可:
-
-```yaml
-# runner config
-runner_config:
- ...
- gradient_accumulation_steps: 4
- ...
-```
-
-#### 主要配置参数介绍
-
-| 参数 | 描述 | 取值说明 |
-|-----------------------------|---------------------------------|------------------------|
-| gradient_accumulation_steps | 在执行反向传播前,累积梯度的步数。 | (int, 必选) - 默认值: `1` 。 |
-
-#### 其他方式使用梯度累积
-
-除配置文件外,当采用 `run_mindformer.py` 脚本启动时,可指定 `--gradient_accumulation_steps` 入参来使用梯度累积功能。
-
-#### 梯度累积使用限制
-
-> 开启梯度累积会增大内存开销,请注意内存管理,防止发生内存溢出(OOM)。
-
-1. 由于 `GradAccumulationCell` 的实现依赖并行特性,梯度累积当前仅支持在**半自动并行模式**下使用;
-2. 此外,在 pipeline 并行场景下,梯度累积含义与 micro_batch 相同,将不会生效,请配置 `micro_batch_num` 项以增大训练 batch_size。
-
-## 梯度裁剪
-
-### 概述
-
-梯度裁剪算法可以避免反向梯度过大,跳过最优解的情况。
-
-### 配置与使用
-
-#### YAML 参数配置
-
-在 MindSpore TransFormers 中,默认的训练流程 `MFTrainOneStepCell` 中集成了梯度裁剪逻辑。
-
-可使用如下示例,以开启梯度裁剪:
-
-```yaml
-# wrapper cell config
-runner_wrapper:
- type: MFTrainOneStepCell
- ...
- use_clip_grad: True
- max_grad_norm: 1.0
- ...
-```
-
-#### 主要配置参数介绍
-
-| 参数 | 描述 | 取值说明 |
-|---------------|-------------------|----------------------------|
-| use_clip_grad | 控制在训练过程中是否开启梯度裁剪。 | (bool, 可选) - 默认值: `False` 。 |
-| max_grad_norm | 控制梯度裁剪的最大 norm 值。 | (float, 可选) - 默认值: `1.0` 。 |
diff --git a/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md b/docs/mindformers/docs/source_zh_cn/guide/deployment.md
similarity index 97%
rename from docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md
rename to docs/mindformers/docs/source_zh_cn/guide/deployment.md
index b6fe3c78832ee03636bba7570f475bc80ecbdc9c..4e381adaa7c42a42aff3d6ea77936478f5352880 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/deployment.md
@@ -1,6 +1,6 @@
# 服务化部署
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/mindie_deployment.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/guide/deployment.md)
## MindIE介绍
@@ -8,7 +8,7 @@ MindIE,全称Mind Inference Engine,是基于昇腾硬件的高性能推理
MindSpore Transformers承载在模型应用层MindIE LLM中,通过MindIE Service可以部署MindSpore Transformers中的大模型。
-MindIE推理的模型支持度可参考[模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/start/models.html)。
+MindIE推理的模型支持度可参考[模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html)。
## 环境搭建
@@ -16,7 +16,7 @@ MindIE推理的模型支持度可参考[模型库](https://www.mindspore.cn/mind
1. 安装MindSpore Transformers
- 参考[MindSpore Transformers官方安装指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/quick_start/install.html)进行安装。
+ 参考[MindSpore Transformers官方安装指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/installation.html)进行安装。
2. 安装MindIE
@@ -86,9 +86,9 @@ processor:
merges_file: "/path/to/mf_model/qwen1_5_72b/merges.txt" # merges文件绝对路径
```
-模型权重下载和转换可参考 [权重格式转换指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/weight_conversion.html)。
+模型权重下载和转换可参考 [权重格式转换指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。
-不同模型的所需文件和配置可能会有差异,详情参考[模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/start/models.html)中具体模型的推理章节。
+不同模型的所需文件和配置可能会有差异,详情参考[模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html)中具体模型的推理章节。
### 启动MindIE
@@ -346,4 +346,4 @@ curl -w "\ntime_total=%{time_total}\n" -H "Accept: application/json" -H "Content
## 模型列表
-其他模型的MindIE推理示例可参考[模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/start/models.html)中的各模型的介绍文档。
\ No newline at end of file
+其他模型的MindIE推理示例可参考[模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html)中的各模型的介绍文档。
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/usage/inference.md b/docs/mindformers/docs/source_zh_cn/guide/inference.md
similarity index 97%
rename from docs/mindformers/docs/source_zh_cn/usage/inference.md
rename to docs/mindformers/docs/source_zh_cn/guide/inference.md
index da923bcf70a6a464c186f2fd0bc1ebb1a1881a2f..5b8bd72da11d131055c473a406d69ff549cf2b33 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/inference.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/inference.md
@@ -1,6 +1,6 @@
# 推理
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/inference.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/guide/inference.md)
## 概述
@@ -22,8 +22,8 @@ MindSpore Transformers 提供了大模型推理能力,用户可以执行 `run_
完整权重可以通过以下两种方式获得:
-1. 从HuggingFace模型库中下载相应模型的开源权重后,参考[权重格式转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/weight_conversion.html)将其转换为ckpt格式。
-2. 预训练或者微调后的分布式权重,通过[合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/transform_weight.html)生成一个完整权重。
+1. 从HuggingFace模型库中下载相应模型的开源权重后,参考[权重格式转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)将其转换为ckpt格式。
+2. 预训练或者微调后的分布式权重,通过[合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html)生成一个完整权重。
#### 2.2 分布式权重
@@ -35,7 +35,7 @@ MindSpore Transformers 提供了大模型推理能力,用户可以执行 `run_
2. 8卡训练的权重在2卡上推理;
3. 已经切分好的分布式权重在单卡上推理等。
-下文的命令示例均采用了在线自动切分的方式,通过设置参数 `--auto_trans_ckpt` 为 `True` 和 `--src_strategy_path_or_dir` 为权重的切分策略文件或目录路径(预训练或者微调后,默认保存在`./output/strategy`下)在推理任务中自动完成切分。更多用法可参考[分布式权重的合并和切分](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/transform_weight.html)。
+下文的命令示例均采用了在线自动切分的方式,通过设置参数 `--auto_trans_ckpt` 为 `True` 和 `--src_strategy_path_or_dir` 为权重的切分策略文件或目录路径(预训练或者微调后,默认保存在`./output/strategy`下)在推理任务中自动完成切分。更多用法可参考[分布式权重的合并和切分](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html)。
> 由于训练和推理任务都使用 `./output` 作为默认输出路径,当使用训练任务所输出的策略文件,作为推理任务的源权重策略文件时,需要将默认输出路径下的策略文件目录移动到其他位置,避免被推理任务的进程清空,如:
>
@@ -358,4 +358,4 @@ Thanks, sir.
## 更多信息
-更多关于不同模型的推理示例,请访问[MindSpore Transformers 已支持模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/start/models.html)。
\ No newline at end of file
+更多关于不同模型的推理示例,请访问[MindSpore Transformers 已支持模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html)。
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/usage/pre_training.md b/docs/mindformers/docs/source_zh_cn/guide/pre_training.md
similarity index 96%
rename from docs/mindformers/docs/source_zh_cn/usage/pre_training.md
rename to docs/mindformers/docs/source_zh_cn/guide/pre_training.md
index 6fc6dafb39eb9a65454a4ecd92f6be4407b21205..21cd0945fd565eca57c7152effe576618795a8a8 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/pre_training.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/pre_training.md
@@ -1,6 +1,6 @@
# 预训练
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/pre_training.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/guide/pre_training.md)
## 概述
@@ -82,8 +82,8 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
run_mode: 运行模式,train:训练,finetune:微调,predict:推理
```
-**注意**: 在多机分布式训练的过程中,可能会遇到一些性能问题。为了确保训练过程的高效性和稳定性,建议参考[大模型性能调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/perf_optimize/perf_optimize.html),进行必要的性能优化和调整。
+**注意**: 在多机分布式训练的过程中,可能会遇到一些性能问题。为了确保训练过程的高效性和稳定性,建议参考[大模型性能调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/advanced_development/performance_optimization.html),进行必要的性能优化和调整。
## 更多信息
-更多关于不同模型的训练示例,请访问[MindSpore Transformers已支持模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/start/models.html)。
\ No newline at end of file
+更多关于不同模型的训练示例,请访问[MindSpore Transformers已支持模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html)。
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/usage/sft_tuning.md b/docs/mindformers/docs/source_zh_cn/guide/sft_tuning.md
similarity index 98%
rename from docs/mindformers/docs/source_zh_cn/usage/sft_tuning.md
rename to docs/mindformers/docs/source_zh_cn/guide/sft_tuning.md
index 6ce5dc27a4ecfbd8db718730a0754e18b401647e..5dcdb51211aa016f0a54c644f9aff8d4a1344272 100644
--- a/docs/mindformers/docs/source_zh_cn/usage/sft_tuning.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/sft_tuning.md
@@ -1,6 +1,6 @@
# SFT微调
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/usage/sft_tuning.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/guide/supervised_fine_tuning.md)
## 概述
@@ -179,7 +179,7 @@ run_mode: 运行模式,train:训练,finetune:微调,predict
#### 多机训练
-多机多卡微调任务与启动预训练类似,可参考[多机多卡的预训练命令](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/usage/pre_training.html#%E5%A4%9A%E6%9C%BA%E8%AE%AD%E7%BB%83),并对命令进行如下修改:
+多机多卡微调任务与启动预训练类似,可参考[多机多卡的预训练命令](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/pre_training.html#%E5%A4%9A%E6%9C%BA%E8%AE%AD%E7%BB%83),并对命令进行如下修改:
1. 增加启动脚本入参`--load_checkpoint /{path}/llama2_7b.ckpt`加载预训练权重。
2. 设置启动脚本中的`--train_dataset_dir /{path}/alpaca-fastchat4096.mindrecord`加载微调数据集。
@@ -243,7 +243,7 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
--run_mode finetune" 8
```
-当权重的分布式策略和模型的分布式策略不一致时,需要对权重进行切分转换。加载权重路径应设置为以 `rank_0` 命名的目录的上一层路径,同时开启权重自动切分转换功能 `--auto_trans_ckpt True` 。关于分布式权重切分转换的场景和使用方式的更多说明请参考[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/transform_weight.html)。
+当权重的分布式策略和模型的分布式策略不一致时,需要对权重进行切分转换。加载权重路径应设置为以 `rank_0` 命名的目录的上一层路径,同时开启权重自动切分转换功能 `--auto_trans_ckpt True` 。关于分布式权重切分转换的场景和使用方式的更多说明请参考[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html)。
```shell
bash scripts/msrun_launcher.sh "run_mindformer.py \
diff --git a/docs/mindformers/docs/source_zh_cn/index.rst b/docs/mindformers/docs/source_zh_cn/index.rst
index 463d5dff8a34c8dda574949584cfa9aa2818f37b..75bbb4a5f3136d0f41c2909410c9bed17b09af5c 100644
--- a/docs/mindformers/docs/source_zh_cn/index.rst
+++ b/docs/mindformers/docs/source_zh_cn/index.rst
@@ -1,13 +1,24 @@
MindSpore Transformers 文档
=========================================
-MindSpore Transformers(也称MindFormers)是一个MindSpore原生的大模型套件,旨在提供大模型训练、微调、评估、推理、部署等全流程开发能力,提供业内主流的Transformer类预训练模型和SOTA下游任务应用,涵盖丰富的并行特性,期望帮助用户轻松地实现大模型训练和创新研发。
+MindSpore Transformers套件的目标是构建一个大模型预训练、微调、推理、部署的全流程开发套件,提供业内主流的Transformer类大语言模型(Large Language Models, LLMs)和多模态理解模型(Multimodal Models, MMs)。期望帮助用户轻松地实现大模型全流程开发。
-用户可以参阅 `整体架构 `_ 和 `模型库 `_ ,快速了解MindSpore Transformers的系统架构,及所支持的功能特性和大模型清单。进一步地,可参考 `安装 `_ 和 `快速启动 `_ 章节,上手探索MindSpore Transformers。
+MindSpore Transformers套件基于MindSpore内置的多维混合并行技术和组件化设计,具备如下特点:
+
+- 一键启动模型单卡或多卡预训练、微调、推理、部署流程;
+- 提供丰富的多维混合并行能力可供灵活易用地进行个性化配置;
+- 大模型训推系统级深度优化,原生支持超大规模集群高效训推,故障快速恢复;
+- 支持任务组件配置化开发。任意模块可通过统一配置进行使能,包括模型网络、优化器、学习率策略等;
+- 提供训练精度/性能监控指标实时可视化能力等。
+
+用户可以参阅 `整体架构 `_ 和 `模型库 `_ ,快速了解MindSpore Transformers的系统架构,以及所支持的大模型清单。
如果您对MindSpore Transformers有任何建议,请通过 `issue `_ 与我们联系,我们将及时处理。
-MindSpore Transformers支持一键启动任意任务的单卡/多卡训练、微调、评估、推理流程,它通过简化操作、提供灵活性和自动化流程,使得深度学习任务的执行变得更加高效和用户友好,用户可以通过以下说明文档进行学习:
+使用MindSpore Transformers进行大模型全流程开发
+-----------------------------------------------------
+
+MindSpore Transformers提供了统一的一键启动脚本,支持一键启动任意任务的单卡/多卡训练、微调、推理流程,它通过简化操作、提供灵活性和自动化流程,使得深度学习任务的执行变得更加高效和用户友好,用户可以通过以下说明文档进行学习:
.. raw:: html
@@ -22,40 +33,22 @@ MindSpore Transformers支持一键启动任意任务的单卡/多卡训练、微