diff --git a/docs/mindformers/docs/source_en/faq/feature_related.md b/docs/mindformers/docs/source_en/faq/feature_related.md
index cdf536564efa1325434f4d3efddcc9e004e42d5e..0ac4f0672630c420faa4371b10258b94487ac96b 100644
--- a/docs/mindformers/docs/source_en/faq/feature_related.md
+++ b/docs/mindformers/docs/source_en/faq/feature_related.md
@@ -10,7 +10,7 @@ A: The official download link is not available, please follow the community Issu
## Q: How Do I Generate a Model Sharding Strategy File?
-A: The model sharding strategy file documents the sharding strategy for model weights in distributed scenarios and is generally used when slicing weights offline. Configure `only_save_strategy: True` in the network `yaml` file, and then start the distributed task normally, then the distributed strategy file can be generated in the `output/strategy/` directory. For details, please refer to the [Tutorial on Slicing and Merging Distributed Weights](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html).
+A: The model sharding strategy file documents the sharding strategy for model weights in distributed scenarios and is generally used when slicing weights offline. Configure `only_save_strategy: True` in the network `yaml` file, and then start the distributed task normally, then the distributed strategy file can be generated in the `output/strategy/` directory. For details, please refer to the [Tutorial on Slicing and Merging Distributed Weights](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html).
diff --git a/docs/mindformers/docs/source_en/feature/transform_weight.md b/docs/mindformers/docs/source_en/feature/ckpt.md
similarity index 75%
rename from docs/mindformers/docs/source_en/feature/transform_weight.md
rename to docs/mindformers/docs/source_en/feature/ckpt.md
index e075dba564d59270177752df70299a095f671d49..5729e3b3d8dfff3554030859f0410b5b42aefddb 100644
--- a/docs/mindformers/docs/source_en/feature/transform_weight.md
+++ b/docs/mindformers/docs/source_en/feature/ckpt.md
@@ -1,18 +1,149 @@
-# Distributed Weight Slicing and Merging
+# Ckpt Weights
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/transform_weight.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/ckpt.md)
## Overview
+Ckpt is a common file format used to save model training status in the deep learning framework. It contains model parameters, optimizer status, and training progress. It is used to restore training or fine-tune models. This document describes how MindSpore Transformers supports conversion , slice and merge.
+
+> The ckpt format is planned to offline.The safetensors format is recommended for weights.Safetensors is a reliable and portable machine learning model storage format from Huggingface for storing Tensors securely and with fast storage (zero copies). For details, see [Safetensors Weights](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/safetensors.html).
+
+## Weight Format Conversion
+
+### Overview
+
+MindSpore Transformers provides a unified weight conversion tool that allows model weights to convert between the HuggingFace and MindSpore Transformers formats. This helps you:
+
+- Convert a HuggingFace weight to a MindSpore Transformers one for fine-tuning, evaluation, or inference on MindSpore Transformers.
+- Convert the weights trained or fine-tuned using MindSpore Transformers to HuggingFace weights and uses them on other frameworks.
+
+### Conversion Procedure
+
+To perform weight conversion, clone the complete HuggingFace repository of the model to be converted locally, and execute the `mindformers/convert_weight.py` script. This script automatically converts the HuggingFace model weight file into a weight file applicable to MindSpore Transformers. If you want to convert a MindSpore Transformers weight to a HuggingFace one, set `reversed` to `True`.
+
+```shell
+python convert_weight.py [-h] --model MODEL [--reversed] --input_path INPUT_PATH --output_path OUTPUT_PATH [--dtype DTYPE] [--n_head N_HEAD] [--hidden_size HIDDEN_SIZE] [--layers LAYERS] [--is_pretrain IS_PRETRAIN] [--telechat_type TELECHAT_TYPE]
+```
+
+#### Parameters
+
+- model: model name.
+- reversed: converts a MindSpore Transformers weight to the HuggingFace one.
+- input_path: path of the HuggingFace weight folder, which points to the downloaded weight file.
+- output_path: path for storing the MindSpore Transformers weight file after conversion.
+- dtype: weight data type after conversion.
+- n_head: takes effect only for the BLOOM model. Set this parameter to `16` when `bloom_560m` is used and to `32` when `bloom_7.1b` is used.
+- hidden_size: takes effect only for the BLOOM model. Set this parameter to `1024` when `bloom_560m` is used and to `4096` when `bloom_7.1b` is used.
+- layers: number of layers to be converted. This parameter takes effect only for the GPT2 and WizardCoder models.
+- is_pretrain: converts the pre-trained weight. This parameter takes effect only for the Swin model.
+- telechat_type: version of the TeleChat model. This parameter takes effect only for the TeleChat model.
+
+### Conversion Example
+
+Assume that you have downloaded the [Llama2 model weight](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD) and saved it in the `/home/user/torch_weights` path. To convert it to the MindSpore Transformers weight and save it in the `/home/user/ms_weights` path, run the following command:
+
+```bash
+python convert_weight.py --model llama --input_path /home/user/torch_weights --output_path /home/user/ms_weights/llama.ckpt
+```
+
+After the preceding steps are performed, the HuggingFace weight is successfully converted to a MindSpore Transformers weight, facilitating model training or inference on MindSpore Transformers.
+
+### Supported Models
+
+| Parameter Value | Supported models |
+|-----------|---------------------------------------------|
+| llama | Llama2, Llama3, Llama3.1, CodeLlama |
+| baichuan2 | Baichuan2 |
+| glm-n | GLM2, GLM3, GLM3-32K, GLM4 |
+| cogvlm2 | CogVLM2-Video, CogVLM2-Image |
+| qwen | Qwen, Qwen1.5, Qwen2 |
+| qwenvl | QwenVL |
+| internlm | InternLM |
+| internlm2 | InternLM2 |
+| yi | Yi |
+| mixtral | Mixtral |
+| deepseek | DeepSeekCoder, DeepSeekCoder1.5, DeepSeekV2 |
+| gpt | GPT2 |
+| whisper | Whisper |
+
+### Developing Weight Conversion for Unsupported Models
+
+1. Add the `convert_weight.py` and `convert_reversed.py` files to the extended model directory.
+2. Compile the `convert_pt_to_ms` and `convert_ms_to_pt` weight conversion functions in the files. The function parameters are `input_path`, `output_path`, `dtype`, and an additional parameter `**kwargs`.
+3. Add the extended model name and conversion function import paths to the `convert_map` and `reversed_convert_map` dictionaries in the `convert_weight.py` file in the MindSpore Transformers code root directory.
+4. Call the `parser.add_argument()` method in the `main` function to add the additional parameter.
+
+### Example of Developing Model Weight Conversion
+
+Llama is used as an example. To convert a HuggingFace weight to a MindSpore Transformers one, define the `convert_pt_to_ms` function in [convert_weight.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/llama/convert_weight.py).
+
+```python
+def convert_pt_to_ms(input_path, output_path, dtype=None, **kwargs):
+ """convert hf weight to ms."""
+ print(f"Trying to convert huggingface checkpoint in '{input_path}'.", flush=True)
+ try:
+ from transformers import LlamaForCausalLM
+ except:
+ raise ImportError(f"Failed to load huggingface checkpoint. Please make sure transformers is available.")
+
+ try:
+ model_hf = LlamaForCausalLM.from_pretrained(os.path.dirname(input_path))
+ except Exception as e:
+ print(f"Do not find huggingface checkpoint in '{os.path.dirname(input_path)}', Error {e.message}.", flush=True)
+ return False
+ ckpt_list = []
+ for name, value in model_hf.state_dict().items():
+ name = name_replace(name)
+ if name == 'norm.weight':
+ name = 'norm_out.weight'
+ if name[:7] == 'layers.':
+ name = name[7:]
+
+ print(f'\rprocessing parameter: {name} {value.shape} ', end='', flush=True)
+ ckpt_list.append({'name': name, 'data': pt2ms(value, dtype)})
+
+ ms.save_checkpoint(ckpt_list, output_path)
+ print(f"\rConvert huggingface checkpoint finished, the mindspore checkpoint is saved in '{output_path}'.",
+ flush=True)
+ return True
+```
+
+To convert a MindSpore Transformers weight to a HuggingFace one, define the `convert_ms_to_pt` function in [convert_reversed.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/llama/convert_reversed.py).
+
+```python
+def convert_ms_to_pt(input_path, output_path, dtype=None, **kwargs):
+ """convert ms weight to hf."""
+ print(f"Trying to convert mindspore checkpoint in '{input_path}'.", flush=True)
+ model_ms = ms.load_checkpoint(input_path)
+
+ state_dict = {}
+ for name, value in model_ms.items():
+ name = name_replace(name)
+ print(f'\rprocessing parameter: {name} {value.shape} ', end='', flush=True)
+ if is_lora_param(name):
+ name = name.replace('.tk_delta_lora_a', '.lora_A.weight')
+ name = name.replace('.tk_delta_lora_b', 'lora_B.weight')
+ state_dict[name] = ms2pt(value, dtype)
+
+ torch.save(state_dict, output_path)
+ print(f"\rConvert mindspore checkpoint finished, the huggingface checkpoint is saved in '{output_path}'.",
+ flush=True)
+ return True
+```
+
+## Distributed Weight Slicing and Merging
+
+### Overview
+
In a current distributed training and inference environment, if a pre-trained weight does not match a distributed strategy, the pre-trained weight needs to be converted to adapt to the corresponding distributed strategy. MindSpore Transformers provides a set of weight conversion tools to meet the requirements in different scenarios. This tool can be used to slice a single-device weight into multi-device weights, convert between multi-device weights, and merge multi-device weights into a single-device weight. You can select [Automatic Conversion](#automatic-conversion) or [Offline Conversion](#offline-conversion) as required so that a model can quickly switch between different distributed scenarios.
In addition, MindSpore Transformers supports [LoRA Weight Merging](#lora-weight-merging) to facilitate the deployment of models fine-tuned using LoRA.
-## Automatic Conversion
+### Automatic Conversion
When a model loads a weight, it automatically checks whether the weight is matching the distributed slicing strategy of the current model. If they do not match, the weight is automatically converted.
-### Parameters
+#### Parameters
Parameters in the `yaml` file related to **automatic weight conversion** are described as follows:
@@ -24,9 +155,9 @@ Parameters in the `yaml` file related to **automatic weight conversion** are des
| transform_process_num | Number of processes used for automatic weight conversion. The default value is 1.
- If transform_process_num is set to 1, only rank_0 is used for weight conversion. Other processes wait until the conversion ends.
- If transform_process_num is larger than 1, **multiple processes conduct conversion**. For example, for an 8-device task, if transform_process_num is set to 2, rank_0 is used for converting the weights of slices rank_0, rank_1, rank_2, and rank_3, and rank_4 is used for converting the weights of slices rank_4, rank_5, rank_6, and rank_7, and other processes wait until rank_0 and rank_4 complete the conversion.
**Note**:
1. A larger value of transform_process_num indicates a shorter conversion time and **a larger host memory occupied by the conversion**. If the host memory is insufficient, decrease the value of transform_process_num.
2. The value of transform_process_num must be a number that can be exactly divided by and cannot exceed that of NPUs. |
| transform_by_rank | Specifies whether to use the mindspore.transform_checkpoint_by_rank API for weight conversion.
- If transform_process_num is larger than 1, the value is automatically set to `True`.
- If transform_process_num is set to 1, if the target weight is a distributed weight, the mindspore.transform_checkpoint_by_rank API is cyclically called to convert the weight of each rank slice in serial mode.
- If transform_process_num is set to 1, if the target weight is a complete weight, the value is automatically set to `False`, and the mindspore.transform_checkpoints API is called for weight conversion. |
-### YAML Configurations in Different Scenarios
+#### YAML Configurations in Different Scenarios
-#### Slicing a Single-Device Weight into Multi-Device Weights
+**Slicing a Single-Device Weight into Multi-Device Weights**
```yaml
# load_checkpoint: specifies path of the pre-trained weight file.
@@ -36,7 +167,7 @@ load_checkpoint: "/worker/llama3_8b/llama3_8b.ckpt"
auto_trans_ckpt: True
```
-#### Conversion Between Multi-Device Weights
+**Conversion Between Multi-Device Weights**
```yaml
# load_checkpoint: specifies the path of the multi-device weight folder.
@@ -49,7 +180,7 @@ src_strategy_path_or_dir: "/worker/checkpoint/llama3-8b-2layer-dp2mp2pp2/strateg
auto_trans_ckpt: True
```
-#### Merging Multi-Device Weights into a Single-Device Weight
+**Merging Multi-Device Weights into a Single-Device Weight**
```yaml
# load_checkpoint: specifies the path of the multi-device weight folder.
@@ -65,14 +196,14 @@ auto_trans_ckpt: True
use_parallel: False
```
-#### Enabling Multi-Process Conversion (Optional)
+**Enabling Multi-Process Conversion (Optional)**
```yaml
# transform_process_num: specifies the number of processes involved in the conversion.
transform_process_num: 2
```
-### Precautions
+#### Precautions
- **Multi-process conversion**: Set the `transform_process_num` parameter to enable multi-process conversion. Pay attention to the memory usage. If a memory overflow occurs, you are advised to reduce the number of processes.
@@ -80,13 +211,13 @@ transform_process_num: 2
- **Distributed strategy file saving**: The distributed strategy file is saved in the `output/strategy` folder. If **pipeline parallelism** is enabled, the system automatically merges all `ckpt_strategy_rank_x.ckpt` files into a `merged_ckpt_strategy.ckpt` file. If pipeline parallelism is not enabled, the MERGE operation is not performed.
-## Offline Conversion
+### Offline Conversion
The offline conversion function is designed to meet your requirements for manually converting weights. With offline conversion, you can convert model weights in an independent environment. Offline conversion supports multiple weight conversion scenarios, including slicing a single-device weight into multi-device weights, converting between multi-device weights, and merging multi-device weights into a single-device weight.
When using offline conversion, you can manually configure conversion parameters as required to ensure that the conversion process is flexible and controllable. This function is especially suitable for model deployment and optimization in a strictly controlled computing environment.
-### Parameters
+#### Parameters
Parameters in the `yaml` file related to **offline weight conversion** are described as follows:
@@ -100,15 +231,15 @@ Parameters in the `yaml` file related to **offline weight conversion** are descr
| world_size | Total number of slices of the target weight. Generally, the value is dp \* mp \* pp. |
| process_num | Number of processes used for offline weight conversion. The default value is 1.
- If process_num is set to 1, **a single process is used for conversion**.
- If process_num is larger than 1, **multi-process conversion** is used. For example, if the target weight for conversion is the distributed weight of eight GPUs and process_num is set to 2, two processes are started to convert the weights of slices rank_0, rank_1, rank_2, and rank_3 and slices rank_4, rank_5, rank_6, and rank_7, respectively. |
-### Offline Conversion Configuration
+#### Offline Conversion Configuration
-#### Generating Distributed Strategy
+**Generating Distributed Strategy**
MindSpore generates a distributed strategy file (ckpt format) corresponding to the number of cards in the `output/strategy` folder after running a distributed task, which can be used in offline weight conversion.
If there is currently no distributed strategy file, it can be quickly generated by setting `only_save_strategy:True` in the yaml configuration file on the basis of the original distributed training/inference task. After setting, the task will stop immediately after generating the distributed strategy file, without actually executing training or inference.
-#### Single-Process Conversion
+**Single-Process Conversion**
Use [mindformers/tools/ckpt_transform/transform_checkpoint.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/ckpt_transform/transform_checkpoint.py) to perform single-process conversion on the loaded weight.
@@ -121,7 +252,7 @@ python transform_checkpoint.py \
--dst_strategy /worker/mindformers/output/strategy/
```
-#### Multi-Process Conversion
+**Multi-Process Conversion**
Use [mindformers/tools/ckpt_transform/transform_checkpoint.sh](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/ckpt_transform/transform_checkpoint.sh) to perform multi-process conversion on the loaded weight.
@@ -140,13 +271,13 @@ bash transform_checkpoint.sh \
- When the [transform_checkpoint.sh](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/ckpt_transform/transform_checkpoint.sh) script is used, `8` indicates the number of target devices, and `2` indicates that two processes are used for conversion.
-## Special Scenarios
+### Special Scenarios
-### Multi-Node Multi-Device Training on Physical Machines
+#### Multi-Node Multi-Device Training on Physical Machines
Training a large-scale model usually needs a cluster of servers. In the multi-node multi-device scenario, if there is a shared disk between servers, the automatic conversion function can be used. Otherwise, only offline conversion can be used. The following example is a training that uses two servers and 16 GPUs.
-#### Scenario 1: A shared disk exists between servers.
+**Scenario 1: A shared disk exists between servers.**
If there is a shared disk between servers, you can use MindSpore Transformers to automatically convert a weight before multi-node multi-device training. Assume that `/data` is the shared disk between the servers and the MindSpore Transformers project code is stored in the `/data/mindformers` directory.
@@ -209,7 +340,7 @@ If there is a shared disk between servers, you can use MindSpore Transformers to
16 8 ${ip} ${port} 1 output/msrun_log False 300
```
-#### Scenario 2: No shared disk exists between servers.
+**Scenario 2: No shared disk exists between servers.**
If there is no shared disk between servers, you need to use the offline weight conversion tool to convert the weight. The following steps describe how to perform offline weight conversion and start a multi-node multi-device training task.
@@ -282,15 +413,15 @@ If there is no shared disk between servers, you need to use the offline weight c
only_save_strategy: False
```
-### ModelArts Training
+#### ModelArts Training
Training in ModelArts is similar to multi-node multi-device training on physical machines. Automatic weight conversion can also be enabled. You can set `auto_trans_ckpt=True` in the hyperparameters of a training task to enable automatic weight conversion and set `transform_process_num > 1` to enable multi-process conversion.
**Note**: If the number of NPUs on the server node in the ModelArts resource pool is not 8, you need to set `npu_num_per_node = the number of NPUs on the node`. For example, if each node is configured with 16 NPUs, `npu_num_per_node=16` should be set.
-## LoRA Weight Merging
+### LoRA Weight Merging
-### Overview
+#### Overview
The basic principle of low-rank adaptation (LoRA) is to parameterize the original model with low-rank weights. The core process of merging LoRA weights is to calculate the parameters of the LoRA branches and add them to the corresponding model parameters, which makes the parameter list of the final weight file the same as that of the original model and excludes additional LoRA parameters. This operation does not affect the inference result. Therefore, the model after merging still has the same performance as the original model during inference.
For details about the principles and implementation of LoRA, see the following resources:
@@ -298,7 +429,7 @@ For details about the principles and implementation of LoRA, see the following r
- Paper: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
- GitHub: [https://github.com/microsoft/LoRA](https://github.com/microsoft/LoRA)
-### Instructions
+#### Instructions
Use the [LoRA weight merging script](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/transform_ckpt_lora.py) provided by MindSpore Transformers to merge LoRA weights as follows:
@@ -311,7 +442,7 @@ python mindformers/tools/transform_ckpt_lora.py \
--lora_scaling lora_alpha/lora_rank
```
-#### Parameters
+**Parameters**
- **src_ckpt_strategy**: specifies the path of the distributed strategy file corresponding to the source weight. The file is stored in the `output/strategy/` directory by default after the training task is started. If the source is a complete set of weights, you do not need to set this parameter. If the source contains distributed weights, set this parameter based on the following conditions:
- **Pipeline parallelism enabled for the source weights**: Weight conversion is based on the merging strategy file. Set the parameter to the path of the distributed strategy folder. The script automatically merges all `ckpt_strategy_rank_x.ckpt` files in the folder into `merged_ckpt_strategy.ckpt` in the folder. If `merged_ckpt_strategy.ckpt` already exists, set the parameter to the path of the file.
@@ -323,9 +454,9 @@ python mindformers/tools/transform_ckpt_lora.py \
- **prefix**: name prefix of the target weight file. The default value is "checkpoint_", indicating that the target weight is saved in the `model_dir/rank_x/checkpoint_x.ckpt` format.
- **lora_scaling**: combination coefficient of the LoRA weight. The default value is `lora_alpha/lora_rank`. The two parameters are used for LoRA model configuration and need to be calculated.
-### Examples
+#### Examples
-#### Scenario 1: There is a complete set of weights for LoRA parameters.
+**Scenario 1: There is a complete set of weights for LoRA parameters.**
If the weight file before merging is a complete one, you can set the parameters as follows (directly enter the path of the complete set of weights):
@@ -337,7 +468,7 @@ python mindformers/tools/transform_ckpt_lora.py \
--lora_scaling lora_alpha/lora_rank
```
-#### Scenario 2: There are distributed weights for LoRA parameters.
+**Scenario 2: There are distributed weights for LoRA parameters.**
If the weight file before merging contains distributed weights, you can set the parameters as follows (enter the path of the distributed weight folder and the path of the distributed strategy folder). The obtained weights are automatically merged into a complete weight file.
@@ -349,73 +480,3 @@ python mindformers/tools/transform_ckpt_lora.py \
--prefix "checkpoint_" \
--lora_scaling lora_alpha/lora_rank
```
-
-## Safetensors Weight Merging
-
-### Instructions
-
-Use the [safetensors weight merging script](https://gitee.com/mindspore/mindformers/blob/dev/toolkit/safetensors/unified_safetensors.py) provided by MindSpore Transformers to perform safetensors weight merging.
-
-```shell
-python toolkit/safetensors/unified_safetensors.py \
- --src_strategy_dirs src_strategy_path_or_dir \
- --mindspore_ckpt_dir mindspore_ckpt_dir\
- --output_dir output_dir \
- --file_suffix "1_1" \
- --has_redundancy has_redundancy
-```
-
-#### Parameters
-
-- **src_strategy_dirs**: specifies the path of the distributed strategy file corresponding to the source weight. The file is stored in the `output/strategy/` directory by default after the training task is started. Set the distributed weight based on the following conditions:
- - **Pipeline parallelism enabled for the source weights**: Weight conversion is based on the merging strategy file. Set the parameter to the path of the distributed strategy folder. The script automatically merges all `ckpt_strategy_rank_x.ckpt` files in the folder into `merged_ckpt_strategy.ckpt` in the folder. If `merged_ckpt_strategy.ckpt` already exists, set the parameter to the path of the file.
- - **Pipeline parallelism not enabled for the source weights**: Weight conversion can be based on any strategy file. Set the parameter to the path of any `ckpt_strategy_rank_x.ckpt` file.
-
- **Note**: If a `merged_ckpt_strategy.ckpt` already exists in the strategy folder and is still transferred to the folder path, the script deletes the old `merged_ckpt_strategy.ckpt` and then merges files into a new `merged_ckpt_strategy.ckpt` for weight conversion. Therefore, ensure that the folder has enough write permission. Otherwise, an error will be reported.
-- **mindspore_ckpt_dir**: The path of distributed weight, please fill in the path of the folder where the source weight is located, the source weights should be stored as `model_dir/rank_x/xxx.safetensors`, and fill in the folder path as `model_dir`.
-- **output_dir**: Path for saving target weights, default value is "/new_llm_data/******/ckpt/nbg3_31b/tmp", target weights will be saved in `/new_llm_data/******/ckpt/nbg3_31b/tmp`.
-- **file_suffix**: Naming suffix of target weight file, default value is "1_1", The target weight will be searched in the format of `*1_1.safetensors`.
-- **has_redundancy**: Is the merged weights which has redundancy, default value is `True`.
-- **filter_out_param_prefix**: Customize the parameters to be filtered out when merging weights, and the filtering rules are based on prefix name matching. For example, optimizer parameter "adam_".
-- **max_process_num**: Maximum number of processes to merge. Default value: 64.
-
-### Examples
-
-#### Scenario 1: Safetensors weights removed redundancy
-
-If merging the safetensors weights which have removed redundancy, you can set the parameters as follows:
-
-```shell
-python toolkit/safetensors/unified_safetensors.py \
- --src_strategy_dirs src_strategy_path_or_dir \
- --mindspore_ckpt_dir mindspore_ckpt_dir\
- --output_dir output_dir \
- --file_suffix "1_1" \
- --has_redundancy True
-```
-
-#### Scenario 2: Safetensors weights did not remove redundancy
-
-If merging the safetensors weights which did not remove redundancy, you can set the parameters as follows:
-
-```shell
-python toolkit/safetensors/unified_safetensors.py \
- --src_strategy_dirs src_strategy_path_or_dir \
- --mindspore_ckpt_dir mindspore_ckpt_dir\
- --output_dir output_dir \
- --file_suffix "1_1" \
- --has_redundancy False
-```
-
-#### Scenario 3: Safetensors weights of Adam optimizer are filtered
-
-If merge the filtered safetensors weights of Adam optimizer, you can fill in the parameters as follows:
-
-```shell
-python toolkit/safetensors/unified_safetensors.py \
- --src_strategy_dirs src_strategy_path_or_dir \
- --mindspore_ckpt_dir mindspore_ckpt_dir\
- --output_dir output_dir \
- --file_suffix "1_1" \
- --filter_out_param_prefix "adam_"
-```
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/feature/configuration.md b/docs/mindformers/docs/source_en/feature/configuration.md
index 9b46c7089a39b3b4bff9e48953f9d6aa735ded63..114a52658d72544a208b489abe1690e1b88d82f8 100644
--- a/docs/mindformers/docs/source_en/feature/configuration.md
+++ b/docs/mindformers/docs/source_en/feature/configuration.md
@@ -14,18 +14,18 @@ The `YAML` file provided by MindSpore Transformers contains configuration items
The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights.
-| Parameters | Descriptions | Types |
-|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
-| seed | Set the global seed. For details, refer to [mindspore.set_seed](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_seed.html). | int |
-| run_mode | Set the running mode of the model: `train`, `finetune`, `eval` or `predict`. | str |
-| output_dir | Set the path where log, checkpoint, strategy, etc. files are saved. | str |
-| load_checkpoint | File or folder paths for loading weights. Currently there are 3 application scenarios
1. Support for passing in full weight file paths.
2. Support for passing in offline sliced weight folder paths.
3. Support for passing in folder paths containing lora weights and base weights
Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html) for the ways of obtaining various weights. | str |
-| auto_trans_ckpt | Enable distributed weight auto slicing and merging. Refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html). | bool |
+| Parameters | Descriptions | Types |
+|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
+| seed | Set the global seed. For details, refer to [mindspore.set_seed](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_seed.html). | int |
+| run_mode | Set the running mode of the model: `train`, `finetune`, `eval` or `predict`. | str |
+| output_dir | Set the path where log, checkpoint, strategy, etc. files are saved. | str |
+| load_checkpoint | File or folder paths for loading weights. Currently there are 3 application scenarios
1. Support for passing in full weight file paths.
2. Support for passing in offline sliced weight folder paths.
3. Support for passing in folder paths containing lora weights and base weights
Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html) for the ways of obtaining various weights. | str |
+| auto_trans_ckpt | Enable distributed weight auto slicing and merging. Refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html). | bool |
| resume_training | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html#resumable-training). | bool |
-| load_ckpt_format | The format of loading checkpoint, either `ckpt` or `safetensors`. | str |
-| remove_redundancy | Whether the checkpoint has removed redundancy while loading checkpoint. The default value is `False`. | bool |
-| train_precision_sync | Switching on or off deterministic computation of the training process. The default value is `None`. | Optional[bool] |
-| infer_precision_sync | Switching on or off deterministic computation of the inference process. The default value is `None`. | Optional[bool] |
+| load_ckpt_format | The format of loading checkpoint, either `ckpt` or `safetensors`. | str |
+| remove_redundancy | Whether the checkpoint has removed redundancy while loading checkpoint. The default value is `False`. | bool |
+| train_precision_sync | Switching on or off deterministic computation of the training process. The default value is `None`. | Optional[bool] |
+| infer_precision_sync | Switching on or off deterministic computation of the inference process. The default value is `None`. | Optional[bool] |
### Context Configuration
diff --git a/docs/mindformers/docs/source_en/feature/start_tasks.md b/docs/mindformers/docs/source_en/feature/start_tasks.md
index e77498056b5976940fd0811d0cc0dc3b34fcb99f..36a23f4aa9a12d7aa2fe0e4b122580b85a9168d5 100644
--- a/docs/mindformers/docs/source_en/feature/start_tasks.md
+++ b/docs/mindformers/docs/source_en/feature/start_tasks.md
@@ -22,7 +22,7 @@ In the root directory of the MindSpore Transformers code, execute the `run_mindf
| `--device_id` | Set the execution device ID. The value must be within the range of available devices. | int, optional | pre-train/finetune/predict |
| `--device_target` | Set the backend execution device. MindSpore Transformers is only supported on `Ascend` devices. | str, optional | pre-train/finetune/predict |
| `--run_mode` | Set the running mode of the model: `train`, `finetune` or `predict`. | str, optional | pre-train/finetune/predict |
-| `--load_checkpoint` | File or folder paths for loading weights. For detailed usage, please refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html) | str, optional | pre-train/finetune/predict |
+| `--load_checkpoint` | File or folder paths for loading weights. For detailed usage, please refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html) | str, optional | pre-train/finetune/predict |
| `--use_parallel` | Whether use parallel mode. | bool, optional | pre-train/finetune/predict |
| `--output_dir` | Set the path where log, checkpoint, strategy, etc. files are saved. | str, optional | pre-train/finetune/predict |
| `--register_path` | The absolute path of the directory where the external code is located. For example, the model directory under the research directory. | str, optional | pre-train/finetune/predict |
@@ -33,7 +33,7 @@ In the root directory of the MindSpore Transformers code, execute the `run_mindf
| Parameters | Parameter Descriptions | Value Description | Applicable Scenarios |
|:----------------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----------------------------|
| `--src_strategy_path_or_dir` | The strategy of load_checkpoint. | str, optional | pre-train/finetune/predict |
-| `--auto_trans_ckpt` | Enable online weight automatic conversion. Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html). | bool, optional | pre-train/finetune/predict |
+| `--auto_trans_ckpt` | Enable online weight automatic conversion. Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html). | bool, optional | pre-train/finetune/predict |
| `--transform_process_num` | The number of processes responsible for checkpoint transform. | int, optional | pre-train/finetune/predict |
| `--only_save_strategy` | Whether to only save the strategy files. | bool, optional, when it is `true`, the task exits directly after saving the strategy file. | pre-train/finetune/predict |
diff --git a/docs/mindformers/docs/source_en/feature/weight_conversion.md b/docs/mindformers/docs/source_en/feature/weight_conversion.md
deleted file mode 100644
index c61bcbe880ded0adcbb54ed248838df60e0bcff7..0000000000000000000000000000000000000000
--- a/docs/mindformers/docs/source_en/feature/weight_conversion.md
+++ /dev/null
@@ -1,124 +0,0 @@
-# Weight Format Conversion
-
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/weight_conversion.md)
-
-## Overview
-
-MindSpore Transformers provides a unified weight conversion tool that allows model weights to convert between the HuggingFace and MindSpore Transformers formats. This helps you:
-
-- Convert a HuggingFace weight to a MindSpore Transformers one for fine-tuning, evaluation, or inference on MindSpore Transformers.
-- Convert the weights trained or fine-tuned using MindSpore Transformers to HuggingFace weights and uses them on other frameworks.
-
-## Conversion Procedure
-
-To perform weight conversion, clone the complete HuggingFace repository of the model to be converted locally, and execute the `mindformers/convert_weight.py` script. This script automatically converts the HuggingFace model weight file into a weight file applicable to MindSpore Transformers. If you want to convert a MindSpore Transformers weight to a HuggingFace one, set `reversed` to `True`.
-
-```shell
-python convert_weight.py [-h] --model MODEL [--reversed] --input_path INPUT_PATH --output_path OUTPUT_PATH [--dtype DTYPE] [--n_head N_HEAD] [--hidden_size HIDDEN_SIZE] [--layers LAYERS] [--is_pretrain IS_PRETRAIN] [--telechat_type TELECHAT_TYPE]
-```
-
-### Parameters
-
-- model: model name.
-- reversed: converts a MindSpore Transformers weight to the HuggingFace one.
-- input_path: path of the HuggingFace weight folder, which points to the downloaded weight file.
-- output_path: path for storing the MindSpore Transformers weight file after conversion.
-- dtype: weight data type after conversion.
-- n_head: takes effect only for the BLOOM model. Set this parameter to `16` when `bloom_560m` is used and to `32` when `bloom_7.1b` is used.
-- hidden_size: takes effect only for the BLOOM model. Set this parameter to `1024` when `bloom_560m` is used and to `4096` when `bloom_7.1b` is used.
-- layers: number of layers to be converted. This parameter takes effect only for the GPT2 and WizardCoder models.
-- is_pretrain: converts the pre-trained weight. This parameter takes effect only for the Swin model.
-- telechat_type: version of the TeleChat model. This parameter takes effect only for the TeleChat model.
-
-## Conversion Example
-
-Assume that you have downloaded the [Llama2 model weight](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD) and saved it in the `/home/user/torch_weights` path. To convert it to the MindSpore Transformers weight and save it in the `/home/user/ms_weights` path, run the following command:
-
-```bash
-python convert_weight.py --model llama --input_path /home/user/torch_weights --output_path /home/user/ms_weights/llama.ckpt
-```
-
-After the preceding steps are performed, the HuggingFace weight is successfully converted to a MindSpore Transformers weight, facilitating model training or inference on MindSpore Transformers.
-
-## Supported Models
-
-| Parameter Value | Supported models |
-|-----------|---------------------------------------------|
-| llama | Llama2, Llama3, Llama3.1, CodeLlama |
-| baichuan2 | Baichuan2 |
-| glm-n | GLM2, GLM3, GLM3-32K, GLM4 |
-| cogvlm2 | CogVLM2-Video, CogVLM2-Image |
-| qwen | Qwen, Qwen1.5, Qwen2 |
-| qwenvl | QwenVL |
-| internlm | InternLM |
-| internlm2 | InternLM2 |
-| yi | Yi |
-| mixtral | Mixtral |
-| deepseek | DeepSeekCoder, DeepSeekCoder1.5, DeepSeekV2 |
-| gpt | GPT2 |
-| whisper | Whisper |
-
-## Developing Weight Conversion for Unsupported Models
-
-1. Add the `convert_weight.py` and `convert_reversed.py` files to the extended model directory.
-2. Compile the `convert_pt_to_ms` and `convert_ms_to_pt` weight conversion functions in the files. The function parameters are `input_path`, `output_path`, `dtype`, and an additional parameter `**kwargs`.
-3. Add the extended model name and conversion function import paths to the `convert_map` and `reversed_convert_map` dictionaries in the `convert_weight.py` file in the MindSpore Transformers code root directory.
-4. Call the `parser.add_argument()` method in the `main` function to add the additional parameter.
-
-## Example of Developing Model Weight Conversion
-
-Llama is used as an example. To convert a HuggingFace weight to a MindSpore Transformers one, define the `convert_pt_to_ms` function in [convert_weight.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/llama/convert_weight.py).
-
-```python
-def convert_pt_to_ms(input_path, output_path, dtype=None, **kwargs):
- """convert hf weight to ms."""
- print(f"Trying to convert huggingface checkpoint in '{input_path}'.", flush=True)
- try:
- from transformers import LlamaForCausalLM
- except:
- raise ImportError(f"Failed to load huggingface checkpoint. Please make sure transformers is available.")
-
- try:
- model_hf = LlamaForCausalLM.from_pretrained(os.path.dirname(input_path))
- except Exception as e:
- print(f"Do not find huggingface checkpoint in '{os.path.dirname(input_path)}', Error {e.message}.", flush=True)
- return False
- ckpt_list = []
- for name, value in model_hf.state_dict().items():
- name = name_replace(name)
- if name == 'norm.weight':
- name = 'norm_out.weight'
- if name[:7] == 'layers.':
- name = name[7:]
-
- print(f'\rprocessing parameter: {name} {value.shape} ', end='', flush=True)
- ckpt_list.append({'name': name, 'data': pt2ms(value, dtype)})
-
- ms.save_checkpoint(ckpt_list, output_path)
- print(f"\rConvert huggingface checkpoint finished, the mindspore checkpoint is saved in '{output_path}'.",
- flush=True)
- return True
-```
-
-To convert a MindSpore Transformers weight to a HuggingFace one, define the `convert_ms_to_pt` function in [convert_reversed.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/llama/convert_reversed.py).
-
-```python
-def convert_ms_to_pt(input_path, output_path, dtype=None, **kwargs):
- """convert ms weight to hf."""
- print(f"Trying to convert mindspore checkpoint in '{input_path}'.", flush=True)
- model_ms = ms.load_checkpoint(input_path)
-
- state_dict = {}
- for name, value in model_ms.items():
- name = name_replace(name)
- print(f'\rprocessing parameter: {name} {value.shape} ', end='', flush=True)
- if is_lora_param(name):
- name = name.replace('.tk_delta_lora_a', '.lora_A.weight')
- name = name.replace('.tk_delta_lora_b', 'lora_B.weight')
- state_dict[name] = ms2pt(value, dtype)
-
- torch.save(state_dict, output_path)
- print(f"\rConvert mindspore checkpoint finished, the huggingface checkpoint is saved in '{output_path}'.",
- flush=True)
- return True
-```
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_en/guide/deployment.md b/docs/mindformers/docs/source_en/guide/deployment.md
index 223f7f3630d0b57fde6793bdf800533a9349cb84..4200fecd3f92fa867887f41b6bc2f63b1b70cec1 100644
--- a/docs/mindformers/docs/source_en/guide/deployment.md
+++ b/docs/mindformers/docs/source_en/guide/deployment.md
@@ -86,7 +86,7 @@ processor:
merges_file: "/path/to/mf_model/qwen1_5_72b/merges.txt" # merges file absolute path
```
-For model weight downloading and conversions, refer to the [Weight Format Conversion Guide](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html).
+For model weight downloading and conversions, refer to the [Weight Format Conversion Guide](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html).
Required files and configurations may vary from model to model. Refer to the model-specific inference sections in [Model Repository](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html) for details.
diff --git a/docs/mindformers/docs/source_en/guide/inference.md b/docs/mindformers/docs/source_en/guide/inference.md
index aaacee7c152773e2b43ed5e04d88784ce10ae41c..071cea651fdba1ce56a81f630db5a89557885d67 100644
--- a/docs/mindformers/docs/source_en/guide/inference.md
+++ b/docs/mindformers/docs/source_en/guide/inference.md
@@ -22,8 +22,8 @@ Model weights can be categorized into two types: complete weights and distribute
Complete weights can be obtained in two ways:
-1. After downloading the open source weights of the corresponding model from the HuggingFace model library, refer to [Weight Format Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html) to convert them to the ckpt format.
-2. Pre-trained or fine-tuned distributed weights are used to generate a complete weight by [merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html).
+1. After downloading the open source weights of the corresponding model from the HuggingFace model library, refer to [Weight Format Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html) to convert them to the ckpt format.
+2. Pre-trained or fine-tuned distributed weights are used to generate a complete weight by [merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html).
#### 2.2 Distributed Weights
@@ -35,7 +35,7 @@ If the inference uses a weight slicing that is different from the model slicing
2. The weights of the eight-card training are reasoned over two cards;
3. Already sliced distributed weights are reasoned on a single card, and so on.
-The command samples in the following contents are all used in the way of online autoslicing. It is recommended to use online autoslicing by setting the command parameters `--auto_trans_ckpt` to `-True` and `-src_strategy_path_or_dir` to the weighted slicing strategy file or directory path (which is saved by default after training under `./output/strategy`) are automatically sliced in the inference task. Details can be found in [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html).
+The command samples in the following contents are all used in the way of online autoslicing. It is recommended to use online autoslicing by setting the command parameters `--auto_trans_ckpt` to `-True` and `-src_strategy_path_or_dir` to the weighted slicing strategy file or directory path (which is saved by default after training under `./output/strategy`) are automatically sliced in the inference task. Details can be found in [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html).
> Since both the training and inference tasks use `./output` as the default output path, when using the strategy file output by the training task as the source weight strategy file for the inference task, you need to move the strategy file directory under the default output path to another location to avoid it being emptied by the process of the inference task, for example:
>
diff --git a/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md b/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md
index f51952873dd2bf668686a3b7f1dc4125d6fee9dc..ae22b5788134a761e78246053dc50f8db54527a7 100644
--- a/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md
+++ b/docs/mindformers/docs/source_en/guide/supervised_fine_tuning.md
@@ -243,7 +243,7 @@ bash scripts/msrun_launcher.sh "run_mindformer.py \
--run_mode finetune" 8
```
-When the distributed strategy of the weights does not match the distributed strategy of the model, the weights need to be transformed. The load weight path should be set to the upper path of the directory named with `rank_0`, and the weight auto transformation function should be enabled by setting `--auto_trans_ckpt True` . For a more detailed description of the scenarios and usage of distributed weight transformation, please refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html).
+When the distributed strategy of the weights does not match the distributed strategy of the model, the weights need to be transformed. The load weight path should be set to the upper path of the directory named with `rank_0`, and the weight auto transformation function should be enabled by setting `--auto_trans_ckpt True` . For a more detailed description of the scenarios and usage of distributed weight transformation, please refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html).
```shell
bash scripts/msrun_launcher.sh "run_mindformer.py \
diff --git a/docs/mindformers/docs/source_en/index.rst b/docs/mindformers/docs/source_en/index.rst
index 92b3f33a9534f8b5f9f1c154c914f604bd5e7f10..589419b5f18b48141c511a69e97047d7af2d80fd 100644
--- a/docs/mindformers/docs/source_en/index.rst
+++ b/docs/mindformers/docs/source_en/index.rst
@@ -38,13 +38,9 @@ MindSpore Transformers provides a wealth of features throughout the full-process
One-click start for single-device, single-node and multi-node tasks.
- - `Weight Format Conversion `_
+ - `Ckpt Weights `_
- Provides a unified weight conversion tool that converts model weights between the formats used by HuggingFace and MindSpore Transformers.
-
- - `Distributed Weight Slicing and Merging `_
-
- Weights in different distributed scenarios are flexibly sliced and merged.
+ Supports conversion, slice and merge weight files in ckpt format.
- `Safetensors Weights `_
@@ -167,8 +163,7 @@ FAQ
:hidden:
feature/start_tasks
- feature/weight_conversion
- feature/transform_weight
+ feature/ckpt
feature/safetensors
feature/configuration
feature/logging
diff --git a/docs/mindformers/docs/source_en/introduction/overview.md b/docs/mindformers/docs/source_en/introduction/overview.md
index 322b841f5d97a392216e580cc5ad6f5cb24c22d5..5100ae43b1b18c99d90f1235f3b8f76860d90a4f 100644
--- a/docs/mindformers/docs/source_en/introduction/overview.md
+++ b/docs/mindformers/docs/source_en/introduction/overview.md
@@ -8,6 +8,6 @@ The overall architecture formed by MindSpore Transformers and the end-to-end AI
2. At the software level, MindSpore Transformers implements the big model-related code through the Python interface provided by MindSpore and performs data computation by the operator libraries provided by the supporting software package of the Ascend AI processor;
3. The basic functionality features currently supported by MindSpore Transformers are listed below:
1. Supports tasks such as running training and inference for large models [distributed parallelism](https://www.mindspore.cn/mindformers/docs/en/dev/feature/parallel_training.html), with parallel capabilities including data parallelism, model parallelism, ultra-long sequence parallelism;
- 2. Supports [model weight conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/weight_conversion.html), [distributed weight splitting and combination](https://www.mindspore.cn/mindformers/docs/en/dev/feature/transform_weight.html), and different format of [dataset loading](https://www.mindspore.cn/mindformers/docs/en/dev/feature/dataset.html) and [resumable training after breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html);
+ 2. Supports [model weight conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html), [distributed weight splitting and combination](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html), and different format of [dataset loading](https://www.mindspore.cn/mindformers/docs/en/dev/feature/dataset.html) and [resumable training after breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html);
3. Support 25+ large models [pretraining](https://www.mindspore.cn/mindformers/docs/en/dev/guide/pre_training.html), [fine-tuning](https://www.mindspore.cn/mindformers/docs/en/dev/guide/supervised_fine_tuning.html), [inference](https://www.mindspore.cn/mindformers/docs/en/dev/guide/inference.html) and [evaluation] (https://www.mindspore.cn/mindformers/docs/en/dev/feature/evaluation.html). Meanwhile, it also supports [quantization](https://www.mindspore.cn/mindformers/docs/en/dev/feature/quantization.html), and the list of supported models can be found in [Model Library](https://www.mindspore.cn/mindformers/docs/en/dev/introduction/models.html);
4. MindSpore Transformers supports users to carry out model service deployment function through [MindIE](https://www.mindspore.cn/mindformers/docs/en/dev/guide/deployment.html), and also supports the use of [MindX]( https://www.hiascend.com/software/mindx-dl) to realize large-scale cluster scheduling; more third-party platforms will be supported in the future, please look forward to it.
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md b/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
index 12df6e202715ed62a373bf606a80db75dfe41797..4efe088164f8dd2006251d231306416e737d4d04 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
@@ -46,7 +46,7 @@ MindSpore Transformers提供了[PretrainedTokenizer](https://www.mindspore.cn/mi
### 准备权重和数据集
-如已有基于PyTorch的模型权重,可以参考[权重转换文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)将权重转换为MindSpore格式的权重。
+如已有基于PyTorch的模型权重,可以参考[权重转换文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)将权重转换为MindSpore格式的权重。
数据集的准备可以参考[数据集文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html),或参考模型文档,如[Llama2说明文档——数据集准备](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)。
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md b/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
index 9e7c34524d171cba72567f4b6bb652ccca604066..914af367450d93b3cc86a3610a1d8751ef767b85 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/precision_optimization.md
@@ -187,7 +187,7 @@ export MINDSPORE_DUMP_CONFIG=${JSON_PATH}
#### 权重转换
-训练过程中,MindSpore与PyTorch加载同一份权重。若是预训练场景,可以使用PyTorch保存一个初始化权重后,转换为MindSpore权重。因为MindSpore的权重名称与PyTorch有差异,权重转换的本质是将PyTorch权重dict中的名字改为MindSpore权重名字,以支持MindSpore加载。权重转换参考[权重转换指导](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。
+训练过程中,MindSpore与PyTorch加载同一份权重。若是预训练场景,可以使用PyTorch保存一个初始化权重后,转换为MindSpore权重。因为MindSpore的权重名称与PyTorch有差异,权重转换的本质是将PyTorch权重dict中的名字改为MindSpore权重名字,以支持MindSpore加载。权重转换参考[权重转换指导](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。
MindSpore与PyTorch均支持`bin`格式数据,加载相同的数据集进行训练,保证每个step一致。
diff --git a/docs/mindformers/docs/source_zh_cn/faq/feature_related.md b/docs/mindformers/docs/source_zh_cn/faq/feature_related.md
index 01ce3ec5915b6905e0ac3da540ac646ef981edbf..97a7738a7c7e8f5c2ad331af06ddb244bcd668fe 100644
--- a/docs/mindformers/docs/source_zh_cn/faq/feature_related.md
+++ b/docs/mindformers/docs/source_zh_cn/faq/feature_related.md
@@ -10,7 +10,7 @@ A: 官方下载链接失效,请关注社区Issue [#IBV35D](https://gitee.com/m
## Q: 如何生成模型切分策略文件?
-A: 模型切分策略文件记录了模型权重在分布式场景下的切分策略,一般在离线权重切分时使用。在网络`yaml`文件中配置`only_save_strategy: True`,然后正常启动分布式任务,便可在`output/strategy/`目录下生成分布式策略文件,详细介绍请参阅[分布式权重切分与合并教程](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html#%E7%A6%BB%E7%BA%BF%E8%BD%AC%E6%8D%A2%E9%85%8D%E7%BD%AE%E8%AF%B4%E6%98%8E)。
+A: 模型切分策略文件记录了模型权重在分布式场景下的切分策略,一般在离线权重切分时使用。在网络`yaml`文件中配置`only_save_strategy: True`,然后正常启动分布式任务,便可在`output/strategy/`目录下生成分布式策略文件,详细介绍请参阅[分布式权重切分与合并教程](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。
diff --git a/docs/mindformers/docs/source_zh_cn/feature/transform_weight.md b/docs/mindformers/docs/source_zh_cn/feature/ckpt.md
similarity index 71%
rename from docs/mindformers/docs/source_zh_cn/feature/transform_weight.md
rename to docs/mindformers/docs/source_zh_cn/feature/ckpt.md
index cc03b0c4b4b7ba12725bb5b124234fe799b402db..256e05b964048afdfd03776a3ba835a5a5987e10 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/transform_weight.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/ckpt.md
@@ -1,18 +1,149 @@
-# 分布式权重切分与合并
+# Ckpt权重
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/transform_weight.md)
+[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/ckpt.md)
## 概述
+ckpt是深度学习框架中用于保存模型训练状态的通用文件格式,包含模型参数、优化器状态和训练进度等信息,主要用于恢复训练或微调模型,本文主要介绍MindSpore Transformers如何支持该文件格式的转换和切分。
+
+> 已计划日落ckpt格式,使用权重更推荐使用safetensors格式。Safetensors 是 Huggingface 推出的一种可靠、易移植的机器学习模型存储格式,用于安全地存储Tensor,而且存储速度较快。详细参考文档[Safetensors权重](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/safetensors.html)。
+
+## 权重格式转换
+
+### 概述
+
+MindSpore Transformers提供了统一的权重转换工具,能够将模型权重在HuggingFace所使用的格式与MindSpore Transformers所使用的格式之间相互转换。这可以帮助用户:
+
+- 将HuggingFace权重转换为MindSpore Transformers权重,在MindSpore Transformers上进行微调、测评或推理。
+- 把使用MindSpore Transformers训练或微调得到的权重转换为HuggingFace权重,并在其他框架上使用。
+
+### 转换步骤
+
+要进行权重转换,首先请将待转换模型的HuggingFace仓库完整克隆到本地,然后执行`mindformers/convert_weight.py`脚本。该脚本能够自动将HuggingFace的模型权重文件转换为适用于MindSpore Transformers的权重文件。如若希望将MindSpore Transformers权重转为HuggingFace权重,请将`reversed`设置为`True`。
+
+```shell
+python convert_weight.py [-h] --model MODEL [--reversed] --input_path INPUT_PATH --output_path OUTPUT_PATH [--dtype DTYPE] [--n_head N_HEAD] [--hidden_size HIDDEN_SIZE] [--layers LAYERS] [--is_pretrain IS_PRETRAIN] [--telechat_type TELECHAT_TYPE]
+```
+
+#### 参数说明
+
+- model:模型名称。
+- reversed:将MindSpore Transformers权重转换为HuggingFace权重。
+- input_path:HuggingFace权重文件夹的路径,指向已下载的权重文件。
+- output_path:转换后MindSpore Transformers权重文件的保存路径。
+- dtype:转换后的权重数据类型。
+- n_head:只对BLOOM模型生效,使用`bloom_560m`时请设为`16`,使用`bloom_7.1b`时请设为`32`。
+- hidden_size:只对BLOOM模型生效,使用`bloom_560m`时请设为`1024`,使用`bloom_7.1b`时请设为`4096`。
+- layers:只对GPT2和WizardCoder模型生效,模型被转换的层数。
+- is_pretrain:只对Swin模型生效,转换预训练权重。
+- telechat_type:只对TeleChat模型生效,TeleChat模型的版本。
+
+### 转换示例
+
+假设用户已经下载了[Llama2模型的权重](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD),并保存在路径`/home/user/torch_weights`中,用户希望将其转换为MindSpore Transformers权重并保存在路径`/home/user/ms_weights`中,可以使用以下命令:
+
+```bash
+python convert_weight.py --model llama --input_path /home/user/torch_weights --output_path /home/user/ms_weights/llama.ckpt
+```
+
+通过以上步骤,可将HuggingFace权重成功转换为MindSpore Transformers权重,方便在MindSpore Transformers中继续模型训练或推理。
+
+### 已支持模型
+
+| 参数取值 | 支持模型 |
+|-----------|-------------------------------------------|
+| llama | Llama2、Llama3、Llama3.1、CodeLlama |
+| baichuan2 | Baichuan2 |
+| glm-n | GLM2、GLM3、GLM3-32K、GLM4 |
+| cogvlm2 | CogVLM2-Video、CogVLM2-Image |
+| qwen | Qwen、Qwen1.5、Qwen2 |
+| qwenvl | QwenVL |
+| internlm | InternLM |
+| internlm2 | InternLM2 |
+| yi | Yi |
+| mixtral | Mixtral |
+| deepseek | DeepSeekCoder、DeepSeekCoder1.5、DeepSeekV2 |
+| gpt | GPT2 |
+| whisper | Whisper |
+
+### 未支持模型权重转换开发
+
+1. 在扩展模型目录下新增`convert_weight.py`及`convert_reversed.py`文件。
+2. 在文件中分别编写`convert_pt_to_ms`及`convert_ms_to_pt`权重转换函数,函数参数为`input_path`、`output_path`、`dtype`及额外参数`**kwargs`。
+3. 在MindSpore Transformers代码根目录下`convert_weight.py`文件中的`convert_map`和`reversed_convert_map`字典中加入扩展模型名称及转换函数引入路径。
+4. 在`main`函数中通过调用`parser.add_argument()`方法新增额外参数。
+
+### 模型权重转换开发示例
+
+此处以Llama为例。如若希望转换HuggingFace权重至MindSpore Transformers权重,需在[convert_weight.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/llama/convert_weight.py)内定义`convert_pt_to_ms`函数:
+
+```python
+def convert_pt_to_ms(input_path, output_path, dtype=None, **kwargs):
+ """convert hf weight to ms."""
+ print(f"Trying to convert huggingface checkpoint in '{input_path}'.", flush=True)
+ try:
+ from transformers import LlamaForCausalLM
+ except:
+ raise ImportError(f"Failed to load huggingface checkpoint. Please make sure transformers is available.")
+
+ try:
+ model_hf = LlamaForCausalLM.from_pretrained(os.path.dirname(input_path))
+ except Exception as e:
+ print(f"Do not find huggingface checkpoint in '{os.path.dirname(input_path)}', Error {e.message}.", flush=True)
+ return False
+ ckpt_list = []
+ for name, value in model_hf.state_dict().items():
+ name = name_replace(name)
+ if name == 'norm.weight':
+ name = 'norm_out.weight'
+ if name[:7] == 'layers.':
+ name = name[7:]
+
+ print(f'\rprocessing parameter: {name} {value.shape} ', end='', flush=True)
+ ckpt_list.append({'name': name, 'data': pt2ms(value, dtype)})
+
+ ms.save_checkpoint(ckpt_list, output_path)
+ print(f"\rConvert huggingface checkpoint finished, the mindspore checkpoint is saved in '{output_path}'.",
+ flush=True)
+ return True
+```
+
+而若是希望转换MindSpore Transformers权重至HuggingFace权重,则需在[convert_reversed.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/llama/convert_reversed.py)内定义`convert_ms_to_pt`函数:
+
+```python
+def convert_ms_to_pt(input_path, output_path, dtype=None, **kwargs):
+ """convert ms weight to hf."""
+ print(f"Trying to convert mindspore checkpoint in '{input_path}'.", flush=True)
+ model_ms = ms.load_checkpoint(input_path)
+
+ state_dict = {}
+ for name, value in model_ms.items():
+ name = name_replace(name)
+ print(f'\rprocessing parameter: {name} {value.shape} ', end='', flush=True)
+ if is_lora_param(name):
+ name = name.replace('.tk_delta_lora_a', '.lora_A.weight')
+ name = name.replace('.tk_delta_lora_b', 'lora_B.weight')
+ state_dict[name] = ms2pt(value, dtype)
+
+ torch.save(state_dict, output_path)
+ print(f"\rConvert mindspore checkpoint finished, the huggingface checkpoint is saved in '{output_path}'.",
+ flush=True)
+ return True
+```
+
+## 权重切分与合并
+
+### 概述
+
在当前的分布式训练和推理环境中,当预训练权重与分布式策略不匹配时,需要对预训练权重进行转换,以适应相应的分布式策略。为满足不同场景下的权重转换需求,MindSpore Transformers提供了一套权重转换工具。该工具支持单卡权重切分为多卡权重、多卡权重之间的转换、多卡权重合并为单卡权重。用户可根据具体需求选择[自动转换](#自动转换)或[离线转换](#离线转换),帮助模型在不同分布式场景之间快速切换。
此外,MindSpore Transformers还支持[LoRA权重的合并](#lora权重合并),方便用户部署使用LoRA微调后的模型。
-## 自动转换
+### 自动转换
模型加载权重时,自动转换功能可以自动检测权重与当前模型分布式切分策略之间的匹配情况,如果不匹配,自动进行权重转换,无需用户手动干预。
-### 参数说明
+#### 参数说明
**自动权重转换**相关`yaml`文件参数说明如下:
@@ -24,9 +155,9 @@
| transform_process_num | 权重自动转换使用的进程数,默认为1。
- 如果transform_process_num = 1,使用**单进程转换**,转换时只有rank_0负责权重转换,其他进程等待rank_0转换结束;
- 如果transform_process_num > 1,使用**多进程转换**,比如8卡任务,transform_process_num=2时,转换时rank_0负责rank_0/1/2/3切片权重的转换,rank_4负责rank_4/5/6/7切片权重的转换,其他进程等待rank_0/4转换结束;
**注意**:
1. transform_process_num越大,转换时间越短,**转换所占用的host内存越大**;当出现host侧内存不足时,需要减少transform_process_num。
2. transform_process_num必须能够整除NPU卡数,且最大不得超过NPU卡数。 |
| transform_by_rank | 是否使用mindspore.transform_checkpoint_by_rank接口做权重转换。
- transform_process_num > 1时,自动设置为`True`;
- transform_process_num = 1时,如果目标权重为分布式权重,则循环调用mindspore.transform_checkpoint_by_rank串行转换每一个rank切片权重。
- transform_process_num = 1时,如果目标权重为完整权重,则自动设置为`False`,使用mindspore.transform_checkpoints接口做权重转换; |
-### 不同场景下yaml配置说明
+#### 不同场景下yaml配置说明
-#### 单卡权重切分为多卡权重
+**单卡权重切分为多卡权重**
```yaml
# load_checkpoint: 设置为预训练权重文件路径
@@ -36,7 +167,7 @@ load_checkpoint: "/worker/llama3_8b/llama3_8b.ckpt"
auto_trans_ckpt: True
```
-#### 多卡权重之间的转换
+**多卡权重之间的转换**
```yaml
# load_checkpoint: 设置为多卡权重文件夹路径
@@ -49,7 +180,7 @@ src_strategy_path_or_dir: "/worker/checkpoint/llama3-8b-2layer-dp2mp2pp2/strateg
auto_trans_ckpt: True
```
-#### 多卡权重合并为单卡权重
+**多卡权重合并为单卡权重**
```yaml
# load_checkpoint: 设置为多卡权重文件夹路径
@@ -65,14 +196,14 @@ auto_trans_ckpt: True
use_parallel: False
```
-#### 开启多进程转换(可选)
+**开启多进程转换(可选)**
```yaml
# transform_process_num: 设置参与转换的进程数量
transform_process_num: 2
```
-### 注意事项
+#### 注意事项
- **多进程转换**:配置`transform_process_num`参数以开启多进程转换,但需注意内存占用。如果发生内存溢出,建议降低进程数量。
@@ -80,13 +211,13 @@ transform_process_num: 2
- **分布式策略文件保存**:分布式策略文件将保存在`output/strategy`文件夹下。如果开启了**流水线并行**,系统会自动合并所有的`ckpt_strategy_rank_x.ckpt`文件,生成`merged_ckpt_strategy.ckpt`。如果未开启流水线并行,则不会进行合并操作。
-## 离线转换
+### 离线转换
离线转换功能旨在满足用户手动转换权重的需求。通过离线转换,用户可以在独立的环境中进行模型权重的转换操作。离线转换支持多种权重转换场景,包括单卡权重切分为多卡权重、多卡权重之间的转换、多卡权重合并为单卡权重。
用户在使用离线转换时,可以根据具体需求手动配置转换参数,确保转换过程灵活且可控,尤其适用于在严格控制的计算环境中进行模型部署和优化的场景。
-### 参数说明
+#### 参数说明
**离线权重转换**相关`yaml`参数说明如下:
@@ -100,15 +231,15 @@ transform_process_num: 2
| world_size | 目标权重的切片总数,一般等于dp \* mp \* pp。 |
| process_num | 离线权重转换使用的进程数,默认为1。
- 如果process_num = 1,使用**单进程转换**;
- 如果process_num > 1,使用**多进程转换**,比如转换的目标权重为8卡分布式权重,process_num=2时,会启动两个进程分别负责rank_0/1/2/3和rank_4/5/6/7切片权重的转换; |
-### 离线转换配置说明
+#### 离线转换配置说明
-#### 生成分布式策略
+**生成分布式策略**
MindSpore每次运行分布式任务后都会在`output/strategy`文件夹下生成对应卡数的分布式策略文件(ckpt格式),可以在离线权重转换中使用。
如果当前没有分布式策略文件,可以通过这种方式快速生成:在原有分布式训练/推理任务的基础上,在yaml配置文件中设置`only_save_strategy:True`来生成策略文件。设置之后任务会在生成分布式策略文件后立即停止,而不会实际执行训练或推理。
-#### 单进程转换
+**单进程转换**
使用[mindformers/tools/ckpt_transform/transform_checkpoint.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/ckpt_transform/transform_checkpoint.py)对载入权重进行单进程转换。
@@ -121,7 +252,7 @@ python transform_checkpoint.py \
--dst_strategy /worker/mindformers/output/strategy/
```
-#### 多进程转换
+**多进程转换**
使用[mindformers/tools/ckpt_transform/transform_checkpoint.sh](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/ckpt_transform/transform_checkpoint.sh)对载入权重进行多进程转换。
@@ -140,13 +271,13 @@ bash transform_checkpoint.sh \
- 使用[transform_checkpoint.sh](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/ckpt_transform/transform_checkpoint.sh)脚本时,参数`8`表示目标设备数,参数`2`表示使用2个进程进行转换。
-## 特殊场景
+### 特殊场景
-### 物理机多机多卡训练
+#### 物理机多机多卡训练
大规模模型通常需要通过多台服务器组成的集群进行训练。在这种多机多卡的场景下,如果服务器之间存在共享盘,则可以使用自动转换功能,否则只能使用离线转换。下面以两台服务器、16卡训练为例进行说明。
-#### 场景一:服务器之间有共享盘
+**场景一:服务器之间有共享盘**
在服务器之间有共享盘的场景下,可以使用 MindSpore Transformers 的自动权重转换功能在多机多卡训练之前自动进行权重转换。假设 `/data` 为服务器的共享盘,且 MindSpore Transformers 的工程代码位于 `/data/mindformers` 路径下。
@@ -209,7 +340,7 @@ bash transform_checkpoint.sh \
16 8 ${ip} ${port} 1 output/msrun_log False 300
```
-#### 场景二:服务器之间无共享盘
+**场景二:服务器之间无共享盘**
在服务器之间无共享盘的情况下,需要使用离线权重转换工具进行权重转换。以下步骤描述了如何进行离线权重转换,并启动多机多卡训练任务。
@@ -282,15 +413,15 @@ bash transform_checkpoint.sh \
only_save_strategy: False
```
-### ModelArts 训练
+#### ModelArts 训练
在 ModelArts 环境中进行训练与物理机上的多机多卡训练类似,同样支持开启权重自动转换。用户可以通过在训练作业的超参数中配置`auto_trans_ckpt=True`来启用自动权重转换,并通过设置`transform_process_num > 1`来开启多进程转换。
**注意**:如果 ModelArts 资源池中的服务器节点NPU卡数不是8,则需要额外配置`npu_num_per_node=节点NPU卡数`。例如,如果每个节点配有16个NPU,则应设置`npu_num_per_node=16`。
-## LoRA权重合并
+### LoRA权重合并
-### 概述
+#### 概述
LoRA(Low-Rank Adaptation)的基本原理是对原始模型的参数进行低秩重参数化。合并LoRA权重的核心过程是将 LoRA 分支的参数进行计算,并叠加到对应的模型参数中,使最终得到的权重文件的参数列表与原始模型一致,不包含额外的LoRA参数。这一操作不会对推理结果产生任何影响,因此合并后的模型在推理时依然能够保持与原始模型一致的性能。
有关 LoRA 的详细原理和实现,请参阅以下资源:
@@ -298,7 +429,7 @@ LoRA(Low-Rank Adaptation)的基本原理是对原始模型的参数进行低
- 论文: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
- GitHub: [https://github.com/microsoft/LoRA](https://github.com/microsoft/LoRA)
-### 使用说明
+#### 使用说明
使用MindSpore Transformers提供的[LoRA权重合并脚本](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/transform_ckpt_lora.py),按照如下方式进行LoRA权重合并。
@@ -311,7 +442,7 @@ python mindformers/tools/transform_ckpt_lora.py \
--lora_scaling lora_alpha/lora_rank
```
-#### 参数说明
+**参数说明**
- **src_ckpt_strategy**:源权重对应的分布式策略文件路径,通常在启动训练任务后默认保存在 `output/strategy/` 目录下。如果源权重为完整权重,则无需填写此参数;如果为分布式权重,需根据以下情况填写:
- **源权重开启了流水线并行**:权重转换基于合并的策略文件,填写分布式策略文件夹路径。脚本会自动将文件夹内的所有 `ckpt_strategy_rank_x.ckpt` 文件合并,并在文件夹下生成 `merged_ckpt_strategy.ckpt`。如果已经存在 `merged_ckpt_strategy.ckpt`,可以直接填写该文件的路径。
@@ -323,9 +454,9 @@ python mindformers/tools/transform_ckpt_lora.py \
- **prefix**:目标权重文件的命名前缀,默认值为 "checkpoint_",即目标权重将按照 `model_dir/rank_x/checkpoint_x.ckpt` 格式保存。
- **lora_scaling**:LoRA 权重的合并系数,默认为 `lora_alpha/lora_rank`,这两个参数即为 LoRA 模型配置时的参数,需自行计算。
-### 示例
+#### 示例
-#### 场景一:包含 LoRA 参数的完整权重
+**场景一:包含 LoRA 参数的完整权重**
如果合并前的权重是完整的权重文件,可以按照以下方式填写参数(直接输入完整权重的路径):
@@ -337,7 +468,7 @@ python mindformers/tools/transform_ckpt_lora.py \
--lora_scaling lora_alpha/lora_rank
```
-#### 场景二:包含 LoRA 参数的分布式权重
+**场景二:包含 LoRA 参数的分布式权重**
如果合并前的权重是分布式的权重文件,可以按照以下方式填写参数(需输入分布式权重文件夹路径和分布式策略文件夹路径),最后得到的权重会自动合并为完整的权重文件:
@@ -349,73 +480,3 @@ python mindformers/tools/transform_ckpt_lora.py \
--prefix "checkpoint_" \
--lora_scaling lora_alpha/lora_rank
```
-
-## Safetensors权重离线合并
-
-### 使用说明
-
-使用MindSpore Transformers提供的[safetensors权重合并脚本](https://gitee.com/mindspore/mindformers/blob/dev/toolkit/safetensors/unified_safetensors.py),按照如下方式进行safetensors权重合并。
-
-```shell
-python toolkit/safetensors/unified_safetensors.py \
- --src_strategy_dirs src_strategy_path_or_dir \
- --mindspore_ckpt_dir mindspore_ckpt_dir\
- --output_dir output_dir \
- --file_suffix "1_1" \
- --has_redundancy has_redundancy
-```
-
-#### 参数说明
-
-- **src_strategy_dirs**:源权重对应的分布式策略文件路径,通常在启动训练任务后默认保存在 `output/strategy/` 目录下。分布式权重需根据以下情况填写:
- - **源权重开启了流水线并行**:权重转换基于合并的策略文件,填写分布式策略文件夹路径。脚本会自动将文件夹内的所有 `ckpt_strategy_rank_x.ckpt` 文件合并,并在文件夹下生成 `merged_ckpt_strategy.ckpt`。如果已经存在 `merged_ckpt_strategy.ckpt`,可以直接填写该文件的路径。
- - **源权重未开启流水线并行**:权重转换可基于任一策略文件,填写任意一个 `ckpt_strategy_rank_x.ckpt` 文件的路径即可。
-
- **注意**:如果策略文件夹下已存在 `merged_ckpt_strategy.ckpt` 且仍传入文件夹路径,脚本会首先删除旧的 `merged_ckpt_strategy.ckpt`,再合并生成新的 `merged_ckpt_strategy.ckpt` 以用于权重转换。因此,请确保该文件夹具有足够的写入权限,否则操作将报错。
-- **mindspore_ckpt_dir**:分布式权重路径,请填写源权重所在文件夹的路径,源权重应按 `model_dir/rank_x/xxx.safetensors` 格式存放,并将文件夹路径填写为 `model_dir`。
-- **output_dir**:目标权重的保存路径,默认值为 "/new_llm_data/******/ckpt/nbg3_31b/tmp",即目标权重将放置在 `/new_llm_data/******/ckpt/nbg3_31b/tmp` 目录下。
-- **file_suffix**:目标权重文件的命名后缀,默认值为 "1_1",即目标权重将按照 `*1_1.safetensors` 格式查找。
-- **has_redundancy**:合并的源权重是否是冗余的权重,默认为 `True`。
-- **filter_out_param_prefix**:合并权重时可自定义过滤掉部分参数,过滤规则以前缀名匹配。如优化器参数"adam_"。
-- **max_process_num**:合并最大进程数。默认值:64。
-
-### 示例
-
-#### 场景一:去除冗余的safetensors权重
-
-如果合并去除冗余的safetensors权重,可以按照以下方式填写参数:
-
-```shell
-python toolkit/safetensors/unified_safetensors.py \
- --src_strategy_dirs src_strategy_path_or_dir \
- --mindspore_ckpt_dir mindspore_ckpt_dir\
- --output_dir output_dir \
- --file_suffix "1_1" \
- --has_redundancy True
-```
-
-#### 场景二:不去除冗余的safetensors权重
-
-如果合并非去除冗余的safetensors权重,可以按照以下方式填写参数:
-
-```shell
-python toolkit/safetensors/unified_safetensors.py \
- --src_strategy_dirs src_strategy_path_or_dir \
- --mindspore_ckpt_dir mindspore_ckpt_dir\
- --output_dir output_dir \
- --file_suffix "1_1" \
- --has_redundancy False
-```
-
-#### 场景三:过滤Adam优化器的safetensors权重
-
-如果合并过滤Adam优化器的safetensors权重,可以按照以下方式填写参数:
-
-```shell
-python toolkit/safetensors/unified_safetensors.py \
- --src_strategy_dirs src_strategy_path_or_dir \
- --mindspore_ckpt_dir mindspore_ckpt_dir\
- --output_dir output_dir \
- --file_suffix "1_1" \
- --filter_out_param_prefix "adam_"
-```
\ No newline at end of file
diff --git a/docs/mindformers/docs/source_zh_cn/feature/configuration.md b/docs/mindformers/docs/source_zh_cn/feature/configuration.md
index b027cb38b36bb2952a529cf59f0ae923f03e9670..b4341099138536440acff0f6ff042114ea9ee9f1 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/configuration.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/configuration.md
@@ -14,18 +14,18 @@ MindSpore Transformers提供的`YAML`文件中包含对于不同功能的配置
基础配置主要用于指定MindSpore随机种子以及加载权重的相关设置。
-| 参数 | 说明 | 类型 |
-|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
-| seed | 设置全局种子,详情可参考[mindspore.set_seed](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_seed.html)。 | int |
-| run_mode | 设置模型的运行模式,可选`train`、`finetune`、`eval`或`predict`。 | str |
-| output_dir | 设置保存log、checkpoint、strategy等文件的路径。 | str |
-| load_checkpoint | 加载权重的文件或文件夹路径,目前有3个应用场景:
1. 支持传入完整权重文件路径。
2. 支持传入离线切分后的权重文件夹路径。
3. 支持传入包含lora权重和base权重的文件夹路径。
各种权重的获取途径可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。 | str |
-| auto_trans_ckpt | 是否开启分布式权重自动切分与合并功能,详情可参考[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html)。 | bool |
+| 参数 | 说明 | 类型 |
+|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
+| seed | 设置全局种子,详情可参考[mindspore.set_seed](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_seed.html)。 | int |
+| run_mode | 设置模型的运行模式,可选`train`、`finetune`、`eval`或`predict`。 | str |
+| output_dir | 设置保存log、checkpoint、strategy等文件的路径。 | str |
+| load_checkpoint | 加载权重的文件或文件夹路径,目前有3个应用场景:
1. 支持传入完整权重文件路径。
2. 支持传入离线切分后的权重文件夹路径。
3. 支持传入包含lora权重和base权重的文件夹路径。
各种权重的获取途径可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。 | str |
+| auto_trans_ckpt | 是否开启分布式权重自动切分与合并功能,详情可参考[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。 | bool |
| resume_training | 是否开启断点续训功能,详情可参考[断点续训功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html#%E6%96%AD%E7%82%B9%E7%BB%AD%E8%AE%AD)。 | bool |
-| load_ckpt_format | 加载的模型权重的格式,可选`ckpt`、`safetensors`。 | str |
-| remove_redundancy | 加载的模型权重是否去除了冗余。默认值为`False`。 | bool |
-| train_precision_sync | 训练确定性计算开关。默认值为`None` 。 | Optional[bool] |
-| infer_precision_sync | 推理确定性计算开关。默认值为`None`。 | Optional[bool] |
+| load_ckpt_format | 加载的模型权重的格式,可选`ckpt`、`safetensors`。 | str |
+| remove_redundancy | 加载的模型权重是否去除了冗余。默认值为`False`。 | bool |
+| train_precision_sync | 训练确定性计算开关。默认值为`None` 。 | Optional[bool] |
+| infer_precision_sync | 推理确定性计算开关。默认值为`None`。 | Optional[bool] |
### Context配置
diff --git a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md
index 6e33d1143ddfb88a87e38306258ec1ba639de253..08199f187e62c4e18502059d2b6ec796c0400d33 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md
@@ -4,9 +4,10 @@
## 概述
-Safetensors 是 Huggingface 推出的一种可靠、易移植的机器学习模型存储格式,用于安全地存储Tensor,而且存储速度较快(零拷贝)。本文主要介绍MindSpore Transformers如何支持该文件格式的保存与加载,帮助用户更好更快地使用权重。
+Safetensors 是 Huggingface 推出的一种可靠、易移植的机器学习模型存储格式,用于安全地存储Tensor,而且存储速度较快(零拷贝)。
+本文主要介绍了safetensor的几种格式类型,以及MindSpore Transformers如何支持该格式权重的保存与加载,权重特性,权重的分布式切分与合并和权重格式转换,帮助用户更好更快地使用权重。
-## Safetensors权重示例
+## 权重示例
Safetensors文件主要分为两种类型:完整权重文件和分布式权重文件。以下是它们的获取方式及对应的文件示例。
@@ -15,7 +16,7 @@ Safetensors文件主要分为两种类型:完整权重文件和分布式权重
Safetensors完整权重可通过以下两种方式获取:
1. 直接从Huggingface上下载。
-2. 通过MindSpore Transformers分布式训练后,通过[合并脚本](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html#safetensors%E6%9D%83%E9%87%8D%E7%A6%BB%E7%BA%BF%E5%90%88%E5%B9%B6)生成完整权重。
+2. 通过MindSpore Transformers分布式训练后,通过[合并脚本](#权重合并)生成完整权重。
Huggingface Safetensors示例目录结构:
@@ -63,25 +64,477 @@ qwen2_7b
└── qwen2_7b_rank_x.safetensors
```
-## 配置说明
+## 权重保存
+
+### 概述
+
+在深度学习模型的训练过程中,保存模型的权重是至关重要的一步。权重保存功能使得我们能够在训练的任意阶段存储模型的参数,以便用户在训练中断或完成后进行恢复、继续训练、评估或部署。同时还可以通过保存权重的方式,在不同环境下复现实验结果。
+
+目前,MindSpore Transformers 支持 [safetensors](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/safetensors.html) 格式的权重文件读取和保存。
+
+### 目录结构
+
+在训练过程中,MindSpore Transformers 默认会在输出目录(同训练日志,默认为 `./output` )中生成权重保存文件夹: `checkpoint` 。
+
+如果在 yaml 中设置了配置项 `save_network_params: True` 后,会额外生成权重保存文件夹 `checkpoint_network` 。
+
+| 文件夹 | 描述 |
+| ------------------ | ------------------------------------------------------------ |
+| checkpoint | 保存模型权重、优化器状态、step 和 epoch 于 safetensors 文件中,可用于**断点恢复训练**。 |
+| checkpoint_network | 仅保存模型权重参数于 safetensors 文件中,适用于后续进行微调、推理、评测,不支持断点续训。 |
+
+#### checkpoint目录结构
+
+以一个 8 卡任务为例,`output` 文件夹中的权重文件按如下格式保存:
+
+```text
+output
+ ├── checkpoint
+ ├── rank_0
+ ├── meta.json
+ └── {prefix}-{epoch}_{step}.ckpt
+ ...
+ └── rank_7
+ ├── meta.json
+ └── {prefix}-{epoch}_{step}.ckpt
+ └──checkpoint_network
+ ├── rank_0
+ └── {prefix}-{epoch}_{step}.safetensors
+ ...
+ └── rank_7
+ └── {prefix}-{epoch}_{step}.safetensors
+```
+
+权重相关文件说明
+
+| 文件 | 描述 |
+| ----------------------------------- | ------------------------------------------------------------ |
+| meta.json | 记录最后保存的权重的 `epoch` 、 `step` 和权重名,每个 rank 进程独立维护一个 `meta.json` 文件。 |
+| {prefix}-{epoch}_{step}.safetensors | 保存的权重文件, `prefix` 包含 rank_id 信息,格式为 `{prefix}-{epoch}_{step}.safetensors` 。如果前缀相同的文件已经存在,系统会自动递增后缀。
开启数据下沉时, `epoch` 位置计算方式为 $\frac{CurrentTotalStepNumber}{SinkSize} = \frac{((CurrentEpoch-1)*StepsPerEpoch+CurrentStepInEpoch)}{SinkSize}$,`step` 固定为 `sink_size` 。 |
+
+### 配置与使用
+
+#### YAML参数配置
-加载相关配置:
+用户可通过修改配置文件来控制权重保存的行为。以下是主要参数:
-| 参数名称 | 说明 |
-| ------------------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| load_checkpoint | 预加载权重的文件夹路径。
- 如果是完整权重,填写切片/单个权重文件所在文件夹路径。
注:支持Huggingface safetensor权重加载(当前仅支持Llama系列模型)。在线加载过程中,会保存一份转换后的MindSpore safetensor权重文件至`/output/ms_safetensors下`。
- 如果是分布式权重,需按照`model_dir/rank_x/xxx.safetensor`格式存放,文件夹路径填写为`model_dir`。 |
-| load_ckpt_format | 加载的模型权重的格式,可选`ckpt`、`safetensors`,默认为`ckpt`。
加载权重为`safetensors`格式时,需配套修改此配置为`safetensors`。 |
-| auto_trans_ckpt | 是否开启在线切分功能。
- 如果加载权重是完整权重:
a. `use_parallel: True`时,判断为分布式加载,需同步设置`auto_trans_ckpt: True`,开启在线切分功能。
b. `use_parallel: False`时,判断为单卡加载,需同步设置`auto_trans_ckpt: False`,关闭在线切分功能。
- 如果加载权重是分布式权重:
a. 不改变原有切分策略,需设置`auto_trans_ckpt: False`,直接按原先切分策略直接加载。
b. 改变原有切分策略,需设置`auto_trans_ckpt: True` 并配置`src_strategy_path_or_dir`为原有切分策略文件路径。
任务拉起时,会将权重在线合并为完整权重,并依据配置文件中设定的并行策略进行切分与加载。在线合并的完整权重会保存在当前目录`/output/unified_checkpoint`文件下。 |
-| remove_redundancy | 加载的权重是否为去冗余后的权重,默认为`False`。 |
+用户可修改 `yaml` 配置文件中 `CheckpointMonitor` 下的字段来控制权重保存行为。
-保存相关配置:
+以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L206) 为例,可做如下配置:
-| 参数名称 | 说明 |
-| :-------------------------- | ------------------------------------------------------------ |
-| callbacks.checkpoint_format | 保存的模型权重的格式,默认值为`ckpt`。可选`ckpt`,`safetensors`。 |
-| callbacks.remove_redundancy | 保存权重时是否开启去冗余保存功能,默认为`False`。仅支持`safetensors格式`。 |
+```yaml
+# callbacks
+callbacks:
+ ...
+ - type: CheckpointMonitor
+ prefix: "deepseekv3"
+ save_checkpoint_steps: 1000
+ integrated_save: False
+ async_save: False
+ checkpoint_format: "safetensors"
+ ...
+```
+
+该配置的含义为:每隔 1000 步保存一次 safetensors 权重、最多同时存储 5 个权重、并行场景下不合并保存拆分的 Tensor、且不使用异步方式保存权重文件。
+
+有关保存权重配置的主要参数如下表所列:
+
+| 参数 | 描述 | 取值说明 |
+| --------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| prefix | 模型权重文件的前缀名,可用于指代模型名字。 | (str, 可选) - 默认值: `"CKP"` 。 |
+| save_checkpoint_steps | 每训练多少步保存一次权重。 | (int, 可选) - 默认值: `1` ,不设置时不保存模型权重。 |
+| keep_checkpoint_max | 最多同时保存多少个权重文件,达到上限后会在保存权重时删除最旧的权重文件。 | (int, 可选) - 默认值: `5` ,不设置时不对文件夹下权重数量进行监控和删除。 |
+| integrated_save | 在并行场景下是否合并保存拆分的 Tensor。合并保存功能仅支持在自动并行场景中使用,在手动并行场景中不支持。 | (bool, 可选) - 默认值: `False` |
+| async_save | 是否使用异步方式保存 safetensors 文件。 | (bool, 可选) - `True` 时默认使用异步线程,默认值: `False` 。 |
+| checkpoint_format | 输出文件的格式,需要配置为 `safetensors` 。 | (str, 可选) - 模型权重保存的格式。支持 `"ckpt"` 、 `"safetensors"` 。默认值: `ckpt` 。(注意: ckpt 格式将在后续版本中日落,推荐使用 safetensors 格式。) |
+| remove_redundancy | 保存模型权重时是否去除冗余。 | (bool, 可选) - 默认值: `False` 。 |
+| save_network_params | 是否仅额外保存网络参数。 | (bool, 可选) - 是否仅额外保存网络参数。默认值: `False` 。 |
+
+如果您想了解更多有关 CheckpointMonitor 的知识,可以参考 [CheckpointMonitor API 文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/core/mindformers.core.CheckpointMonitor.html)。
+
+## 权重加载
+
+### 概述
+
+MindSpore Transformers支持训练、推理、续训在单卡多卡全场景下的权重加载,包括完整权重和分布式权重。可参考以下说明,针对相应场景调整配置。
+
+### 配置说明
+
+| 参数名称 | 说明 |
+| ---------------- | ------------------------------------------------------------ |
+| load_checkpoint | 预加载权重的文件夹路径。
- 如果是完整权重,填写切片/单个权重文件所在文件夹路径。
注:支持Huggingface safetensor权重加载(当前仅支持Llama系列模型)。在线加载过程中,会保存一份转换后的MindSpore safetensor权重文件至`/output/ms_safetensors下`。
- 如果是分布式权重,需按照`model_dir/rank_x/xxx.safetensor`格式存放,文件夹路径填写为`model_dir`。 |
+| load_ckpt_format | 加载的模型权重的格式,可选`ckpt`、`safetensors`,默认为`ckpt`。
加载权重为`safetensors`格式时,需配套修改此配置为`safetensors`。 |
+| use_parallel | 是否并行加载。 |
+| auto_trans_ckpt | 是否开启在线切分功能。
- 如果加载权重是完整权重:
a. `use_parallel: True`时,判断为分布式加载,需同步设置`auto_trans_ckpt: True`,开启在线切分功能。
b. `use_parallel: False`时,判断为单卡加载,需同步设置`auto_trans_ckpt: False`,关闭在线切分功能。
- 如果加载权重是分布式权重:
a. 不改变原有切分策略,需设置`auto_trans_ckpt: False`,直接按原先切分策略直接加载。
b. 改变原有切分策略,需设置`auto_trans_ckpt: True` 并配置`src_strategy_path_or_dir`为原有切分策略文件路径。
任务拉起时,会将权重在线合并为完整权重,并依据配置文件中设定的并行策略进行切分与加载。在线合并的完整权重会保存在当前目录`/output/unified_checkpoint`文件下。 |
+
+### 完整权重加载
+
+#### 单卡加载
+
+```yaml
+# 配置文件
+load_checkpoint: '/qwen2_7b/unified_safetenosrs' # 加载完整权重文件路径
+load_ckpt_format: 'safetensors' # 加载权重文件格式
+auto_trans_ckpt: False # 完整权重+单卡加载时需关闭此配置项
+use_parallel: False # 单卡加载
+parallel_config: # 配置目标分布式策略
+ data_parallel: 1
+ model_parallel: 1
+ pipeline_stage: 1
+```
+
+#### 多卡加载
+
+```yaml
+# 配置文件
+load_checkpoint: '/qwen2_7b/unified_safetenosrs' # 加载完整权重文件路径
+load_ckpt_format: 'safetensors' # 加载权重文件格式
+auto_trans_ckpt: True # 完整权重+分布式加载时需打开此配置项,开启在线切分功能
+use_parallel: True # 多卡加载
+parallel_config: # 配置目标分布式策略
+ data_parallel: 1
+ model_parallel: 4
+ pipeline_stage: 1
+```
+
+### 分布式权重加载
+
+#### 多卡加载-原有切分策略
+
+```yaml
+# 配置文件
+load_checkpoint: '/output/distributed_safetenosrs' # 加载源分布式权重文件路径
+load_ckpt_format: 'safetensors' # 加载权重文件格式
+auto_trans_ckpt: False # 关闭在线切分功能
+parallel_config: # 配置目标分布式策略
+ data_parallel: 2
+ model_parallel: 4
+ pipeline_stage: 1
+```
+
+#### 多卡加载-改变切分策略
+
+```yaml
+# 配置文件
+load_checkpoint: '/output/distributed_safetenosrs' # 加载源分布式权重文件路径
+src_strategy_path_or_dir: '/output/src_strategy' # 加载源策略文件,用于合并源分布式权重为完整权重
+load_ckpt_format: 'safetensors' # 加载权重文件格式
+auto_trans_ckpt: True # 开启在线切分功能
+parallel_config: # 配置目标分布式策略
+ data_parallel: 4
+ model_parallel: 2
+ pipeline_stage: 1
+```
+
+大集群规模场景下,避免在线合并过程耗时过长占用训练资源,推荐将原分布式权重文件离线[合并完整权重](#权重合并)后传入,此时无需传入源切分策略文件路径。
+
+### 特殊场景
+
+#### 物理机多机多卡训练
+
+大规模模型通常需要通过多台服务器组成的集群进行训练。权重切分转换需要依赖编译完成后的目标切分策略文件,在这种多机多卡的场景下,如果服务器之间存在共享盘,生成的策略文件在同一个目录下,则可以使用自动转换功能;如果服务器之间无共享盘,需要手动复制策略文件后在进行转换功能。下面以两台服务器、16卡训练为例进行说明。
+
+**场景一:服务器之间有共享盘**
+
+在服务器之间有共享盘的场景下,可以使用 MindSpore Transformers 的自动权重转换功能在多机多卡训练之前自动进行权重转换。假设 `/data` 为服务器的共享盘,且 MindSpore Transformers 的工程代码位于 `/data/mindformers` 路径下。
+
+**参数配置:**
+
+```yaml
+output_dir: './output' # 策略文件会生成再./output/strategy下,用于权重在线切分
+load_checkpoint: '/qwen2_7b/unified_safetenosrs' # 加载完整权重文件路径
+load_ckpt_format: 'safetensors' # 加载权重文件格式
+auto_trans_ckpt: True # 完整权重+分布式加载时需打开此配置项,开启在线切分功能
+train_dataset: &train_dataset
+ data_loader:
+ type: MindDataset
+ dataset_dir: "/worker/dataset/wiki103/"
+ shuffle: True
+parallel_config: # 配置16卡分布式策略(仅供参考)
+ data_parallel: 2
+ model_parallel: 4
+ pipeline_stage: 2
+ micro_batch_num: 2
+ vocab_emb_dp: True
+ gradient_aggregation_group: 4
+ micro_batch_interleave_num: 1
+```
+
+**启动任务**:
+
+使用[mindformers/scripts/msrun_launcher.sh](https://gitee.com/mindspore/mindformers/blob/dev/scripts/msrun_launcher.sh)进行任务启动。
+
+ ```shell
+ # 第一台服务器(主节点)
+ bash scripts/msrun_launcher.sh "run_mindformer.py \
+ --config {CONFIG_PATH} \
+ --run_mode train" \
+ 16 8 ${ip} ${port} 0 output/msrun_log False 300
+ # 第二台服务器(子节点)
+ bash scripts/msrun_launcher.sh "run_mindformer.py \
+ --config {CONFIG_PATH} \
+ --run_mode train" \
+ 16 8 ${ip} ${port} 1 output/msrun_log False 300
+ ```
+
+**场景二:服务器之间无共享盘**
+
+在服务器之间无共享盘的情况下,需要对生成的策略文件进行离线合并和转发操作后再使能在线切分功能。以下步骤描述了如何进行该操作,并启动多机多卡训练任务。
+
+**1.获取分布式策略**
+
+在进行离线权重转换前,首先需要获取各节点的分布式策略文件。
+
+```yaml
+ # 设置 only_save_strategy 为 True 以获取分布式策略文件,生成后任务自动退出
+ only_save_strategy: True
+
+ # 配置数据集路径
+ train_dataset: &train_dataset
+ data_loader:
+ type: MindDataset
+ dataset_dir: "/worker/dataset/wikitext_2048/"
+ shuffle: True
+
+ # 配置16卡分布式策略(仅供参考)
+ parallel_config:
+ data_parallel: 2
+ model_parallel: 4
+ pipeline_stage: 2
+ micro_batch_num: 2
+ vocab_emb_dp: True
+ gradient_aggregation_group: 4
+ micro_batch_interleave_num: 1
+```
+
+各节点的策略文件将分别保存在各自的`output/strategy`目录中。例如,节点0仅保存`ckpt_strategy_rank_0-7.ckpt`文件,节点1仅保存`ckpt_strategy_rank_8-15.ckpt`文件。随后,需将所有节点的策略文件集中到同一台服务器上,以便进行后续操作,集中后的目录及文件如下。
+
+```text
+output
+ ├── strategy
+ ├── ckpt_strategy_rank_0.ckpt
+ ...
+ ├── ckpt_strategy_rank_7.ckpt
+ ├── ckpt_strategy_rank_8.ckpt
+ ...
+ └── ckpt_strategy_rank_15.ckpt
+```
+
+**2.合并分布式策略**
+
+调用MindSpore提供的[策略合并接口](https://www.mindspore.cn/docs/zh-CN/master/api_python/parallel/mindspore.parallel.merge_pipeline_strategys.html)将集中后的所有策略文件合并成一个文件,用于后续权重切分。
+
+```python
+import mindspore as ms
+ms.parallel.merge_pipeline_strategys("/output/strategy", "/output/merged_strategy/dst_strategy.ckpt")
+```
+
+**3.权重切分加载**
+
+**分发策略文件+在线切分(推荐):**
+
+将合并后的策略文件`dst_strategy.ckpt`分发到各个节点下的`./output/merged_strategy/`目录下,打开自动切分功能,重新拉起训练任务。每个节点的配置文件均需要修改。
+
+```yaml
+output_dir: './output' # 确保每个节点下./output/merged_strategy/都有合并完后的策略文件
+load_checkpoint: '/qwen2_7b/unified_safetenosrs' # 加载完整权重文件路径
+load_ckpt_format: 'safetensors' # 加载权重文件格式
+auto_trans_ckpt: True # 完整权重+分布式加载时需打开此配置项,开启在线切分功能
+```
-## 使用示例
+**离线切分+分发分布式权重:**
+
+根据[权重切分](#权重切分)指南,先将完整权重离线切分成分布式权重文件,再分发到各台机器,关闭自动切分功能,配置`load_checkpoint`为分布式权重路径。每个节点的配置文件均需要修改。
+
+因为分布式权重文件一般比策略文件大,分发操作更耗时,更推荐第一种方式。
+
+```yaml
+load_checkpoint: '/output/distributed_safetenosrs' # 加载分布式权重文件路径
+load_ckpt_format: 'safetensors' # 加载权重文件格式
+auto_trans_ckpt: False # 分布式权重加载,关闭在线切分功能
+```
+
+**4.启动任务**:
+
+使用[mindformers/scripts/msrun_launcher.sh](https://gitee.com/mindspore/mindformers/blob/dev/scripts/msrun_launcher.sh)进行任务启动。
+
+ ```shell
+ # 第一台服务器(主节点)
+ bash scripts/msrun_launcher.sh "run_mindformer.py \
+ --config {CONFIG_PATH} \
+ --run_mode train" \
+ 16 8 ${ip} ${port} 0 output/msrun_log False 300
+ # 第二台服务器(子节点)
+ bash scripts/msrun_launcher.sh "run_mindformer.py \
+ --config {CONFIG_PATH} \
+ --run_mode train" \
+ 16 8 ${ip} ${port} 1 output/msrun_log False 300
+ ```
+
+## 权重特性
+
+### 去冗余保存及加载
+
+当前MindSpore Transformers保存权重时,默认会在dp/opt域重复保存多份一致的权重文件,导致带来额外的存储开销和负担。可通过以下的配置和使用方法,实现dp/opt去冗余保存和加载,有效降低千卡及以上大规模集群下的存储压力。此特性仅在分布式权重下生效,完整权重不涉及去冗余。
+
+保存时打开以下配置:
+
+```yaml
+callbacks:
+ - type: CheckpointMonitor
+ checkpoint_format: safetensors # 保存权重文件格式
+ remove_redundancy: True # 保存权重时开启去冗余
+```
+
+保存后的分布式权重大小不同,总权重文件小于去冗余功能开启前:
+
+```text
+output
+ ├── checkpoint
+ ├── rank_0
+ └── example-1_1.ckpt #文件大小:5.2G
+ ├── rank_1
+ └── example-1_1.ckpt #文件大小:5.2G
+ ...
+ ├── rank_6
+ └── example-1_1.ckpt #文件大小:4.1G
+ └── rank_7
+ └── example-1_1.ckpt #文件大小:4.1G
+```
+
+加载时打开以下配置:
+
+```yaml
+load_ckpt_format: 'safetensors' # 加载权重文件格式
+remove_redundancy: True # 加载权重时开启去冗余
+```
+
+> MindSpore Transformers 1.5.0及以下版本当去冗余保存和加载的配置项不一致时,可能导致精度异常,请确保配置正确。1.5.0以上版本将根据传入的权重是否去冗余自动识别并加载,无需关注加载配置。
+
+## 权重切分与合并
+
+### 概述
+
+在当前的分布式训练和推理环境中,当用户需要改变分布式策略时,需要先将已有的分布式权重合并成完整权重后,再通过在线切分/离线切分的方式完成权重加载。为满足不同场景下的权重转换需求,可以参考下面脚本和接口,实现权重多卡合并单卡和单卡切分多卡的功能。
+
+### 权重合并
+
+#### 使用说明
+
+使用MindSpore Transformers提供的[safetensors权重合并脚本](https://gitee.com/mindspore/mindformers/blob/dev/toolkit/safetensors/unified_safetensors.py),按照如下方式进行safetensors权重合并。合并后的权重格式为[完整权重](#完整权重)。
+
+```shell
+python toolkit/safetensors/unified_safetensors.py \
+ --src_strategy_dirs src_strategy_path_or_dir \
+ --mindspore_ckpt_dir mindspore_ckpt_dir\
+ --output_dir output_dir \
+ --file_suffix "1_1" \
+ --has_redundancy has_redundancy
+```
+
+#### 参数说明
+
+- **src_strategy_dirs**:源权重对应的分布式策略文件路径,通常在启动训练任务后默认保存在 `output/strategy/` 目录下。分布式权重需根据以下情况填写:
+
+ - **源权重开启了流水线并行**:权重转换基于合并的策略文件,填写分布式策略文件夹路径。脚本会自动将文件夹内的所有 `ckpt_strategy_rank_x.ckpt` 文件合并,并在文件夹下生成 `merged_ckpt_strategy.ckpt`。如果已经存在 `merged_ckpt_strategy.ckpt`,可以直接填写该文件的路径。
+ - **源权重未开启流水线并行**:权重转换可基于任一策略文件,填写任意一个 `ckpt_strategy_rank_x.ckpt` 文件的路径即可。
+
+ **注意**:如果策略文件夹下已存在 `merged_ckpt_strategy.ckpt` 且仍传入文件夹路径,脚本会首先删除旧的 `merged_ckpt_strategy.ckpt`,再合并生成新的 `merged_ckpt_strategy.ckpt` 以用于权重转换。因此,请确保该文件夹具有足够的写入权限,否则操作将报错。
+- **mindspore_ckpt_dir**:分布式权重路径,请填写源权重所在文件夹的路径,源权重应按 `model_dir/rank_x/xxx.safetensors` 格式存放,并将文件夹路径填写为 `model_dir`。
+- **output_dir**:目标权重的保存路径,默认值为 "/new_llm_data/******/ckpt/nbg3_31b/tmp",即目标权重将放置在 `/new_llm_data/******/ckpt/nbg3_31b/tmp` 目录下。
+- **file_suffix**:目标权重文件的命名后缀,默认值为 "1_1",即目标权重将按照 `*1_1.safetensors` 格式查找。
+- **has_redundancy**:合并的源权重是否是冗余的权重,默认为 `True`。
+- **filter_out_param_prefix**:合并权重时可自定义过滤掉部分参数,过滤规则以前缀名匹配。如优化器参数"adam_"。
+- **max_process_num**:合并最大进程数。默认值:64。
+
+#### 示例
+
+场景一:
+
+如果合并去除冗余的safetensors权重,可以按照以下方式填写参数:
+
+```shell
+python toolkit/safetensors/unified_safetensors.py \
+ --src_strategy_dirs src_strategy_path_or_dir \
+ --mindspore_ckpt_dir mindspore_ckpt_dir\
+ --output_dir output_dir \
+ --file_suffix "1_1" \
+ --has_redundancy False
+```
+
+场景二:
+
+如果合并过滤Adam优化器的safetensors权重,可以按照以下方式填写参数:
+
+```shell
+python toolkit/safetensors/unified_safetensors.py \
+ --src_strategy_dirs src_strategy_path_or_dir \
+ --mindspore_ckpt_dir mindspore_ckpt_dir\
+ --output_dir output_dir \
+ --file_suffix "1_1" \
+ --filter_out_param_prefix "adam_"
+```
+
+### 权重切分
+
+#### 使用说明
+
+使用MindSpore提供的[策略合并接口](https://www.mindspore.cn/docs/zh-CN/master/api_python/parallel/mindspore.parallel.merge_pipeline_strategys.html)和[切分保存接口](https://www.mindspore.cn/docs/zh-CN/master/api_python/parallel/mindspore.parallel.load_distributed_checkpoint.html),按照如下方式进行safetensors权重离线切分保存。切分后的权重格式为[分布式权重](#分布式权重)。
+
+```python
+import mindspore as ms
+# step1:合并目标切分策略文件
+ms.parallel.merge_pipeline_strategys("/output/strategy", "/output/merged_strategy/dst_strategy.ckpt")
+# step2:根据合并后的目标切分策略以及完整权重,将权重切分并保存成分布式权重
+ms.load_distributed_checkpoint(
+ network=None,
+ predict_strategy='/output/merged_strategy/dst_strategy.ckpt',
+ unified_safetensors_dir='/path/unified_safetensors',
+ dst_safetensors_dir='/path/distributed_safetensors',
+ format='safetensors',
+ max_process_num=64
+ )
+```
+
+#### 参数说明
+
+- **network** (Cell) - 分布式预测网络,format为 safetensors 时,network传递为None,此时接口执行保存模式。
+- **predict_strategy** (Union[dict, str]) - 目标切分策略文件。默认值: `None` 。
+- **unified_safetensors_dir** (str) - 完整权重文件目录。默认值: `None` 。
+- **dst_safetensors_dir** (str) - 保存模式场景下,权重的保存目录。
+- **max_process_num** (int) - 最大进程数。默认值:64。
+
+## 权重格式转换
+
+### Ckpt转换Safetensors
+
+MindSpore Transformers存量权重文件为ckpt格式,可以通过以下两种方式实现格式转换成safetensors文件。
+
+#### 接口调用
+
+直接调用[Mindspore格式转换接口](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.ckpt_to_safetensors.html)实现。
+
+```python
+import mindspore as ms
+ms.ckpt_to_safetensors("./ckpt_save_path/rank0/checkpoint_0.ckpt", ".output/safetensors_path/")
+#参数说明
+#file_path (str) - 包含 checkpoint 文件的目录路径或单个 checkpoint 文件 (.ckpt) 的路径
+#save_path (str, 可选) - 保存 safetensors 文件的目录路径。默认值:None
+```
+
+#### 训练任务
+
+调整配置文件后启动MindSpore Transformers训练任务,通过以ckpt格式加载和safetensor格式保存的方法试实现转换。
+
+```yaml
+load_checkpoint: 'output/checkpoint/' # 加载权重文件路径
+load_ckpt_format: 'ckpt' # 加载权重文件格式为ckpt
+callbacks:
+ - type: CheckpointMonitor
+ checkpoint_format: 'safetensors' # 保存权重文件格式为safetensor
+```
+
+## 任务示例
### 预训练任务示例
@@ -237,96 +690,5 @@ callbacks:
checkpoint_format: safetensors # 保存权重文件格式
```
-大集群规模场景下,避免在线合并过程耗时过长占用训练资源,推荐将原分布式权重文件离线[合并完整权重](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html#safetensors%E6%9D%83%E9%87%8D%E7%A6%BB%E7%BA%BF%E5%90%88%E5%B9%B6)后传入,无需传入源切分策略文件路径。
-
更多详情请参考:[断点续训介绍](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html)。
-## 权重保存
-
-### 概述
-
-在深度学习模型的训练过程中,保存模型的权重是至关重要的一步。权重保存功能使得我们能够在训练的任意阶段存储模型的参数,以便用户在训练中断或完成后进行恢复、继续训练、评估或部署。同时还可以通过保存权重的方式,在不同环境下复现实验结果。
-
-目前,MindSpore TransFormer 支持 [safetensors](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/safetensors.html) 格式的权重文件读取和保存。
-
-### 目录结构
-
-在训练过程中,MindSpore Transformers 默认会在输出目录(同训练日志,默认为 `./output` )中生成权重保存文件夹: `checkpoint` 。
-
-如果在 yaml 中设置了配置项 `save_network_params: True` 后,会额外生成权重保存文件夹 `checkpoint_network` 。
-
-| 文件夹 | 描述 |
-|--------------------|-----------------------------------------------------------|
-| checkpoint | 保存模型权重、优化器状态、step 和 epoch 于 safetensors 文件中,可用于**断点恢复训练**。 |
-| checkpoint_network | 仅保存模型权重参数于 safetensors 文件中,适用于后续进行微调、推理、评测,不支持断点续训。 |
-
-#### `checkpoint`目录结构
-
-以一个 8 卡任务为例,`output` 文件夹中的权重文件按如下格式保存:
-
-```text
-output
- ├── checkpoint
- ├── rank_0
- ├── meta.json
- └── {prefix}-{epoch}_{step}.ckpt
- ...
- └── rank_7
- ├── meta.json
- └── {prefix}-{epoch}_{step}.ckpt
- └──checkpoint_network
- ├── rank_0
- └── {prefix}-{epoch}_{step}.safetensors
- ...
- └── rank_7
- └── {prefix}-{epoch}_{step}.safetensors
-```
-
-##### 权重相关文件说明
-
-| 文件 | 描述 |
-|-------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| meta.json | 记录最后保存的权重的 `epoch` 、 `step` 和权重名,每个 rank 进程独立维护一个 `meta.json` 文件。 |
-| {prefix}-{epoch}_{step}.safetensors | 保存的权重文件, `prefix` 包含 rank_id 信息,格式为 `{prefix}-{epoch}_{step}.safetensors` 。如果前缀相同的文件已经存在,系统会自动递增后缀。
开启数据下沉时, `epoch` 位置计算方式为 $\frac{CurrentTotalStepNumber}{SinkSize} = \frac{((CurrentEpoch-1)*StepsPerEpoch+CurrentStepInEpoch)}{SinkSize}$,`step` 固定为 `sink_size` 。 |
-
-### 配置与使用
-
-#### YAML参数配置
-
-用户可通过修改配置文件来控制权重保存的行为。以下是主要参数:
-
-用户可修改 `yaml` 配置文件中 `CheckpointMonitor` 下的字段来控制权重保存行为。
-
-以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L206) 为例,可做如下配置:
-
-```yaml
-# callbacks
-callbacks:
- ...
- - type: CheckpointMonitor
- prefix: "deepseekv3"
- save_checkpoint_steps: 1000
- integrated_save: False
- async_save: False
- checkpoint_format: "safetensors"
- ...
-```
-
-该配置的含义为:每隔 1000 步保存一次 safetensors 权重、最多同时存储 5 个权重、并行场景下不合并保存拆分的 Tensor、且不使用异步方式保存权重文件。
-
-##### 主要配置参数介绍
-
-有关保存权重配置的主要参数如下表所列:
-
-| 参数 | 描述 | 取值说明 |
-|-----------------------|---------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
-| prefix | 模型权重文件的前缀名,可用于指代模型名字。 | (str, 可选) - 默认值: `"CKP"` 。 |
-| save_checkpoint_steps | 每训练多少步保存一次权重。 | (int, 可选) - 默认值: `1` ,不设置时不保存模型权重。 |
-| keep_checkpoint_max | 最多同时保存多少个权重文件,达到上限后会在保存权重时删除最旧的权重文件。 | (int, 可选) - 默认值: `5` ,不设置时不对文件夹下权重数量进行监控和删除。 |
-| integrated_save | 在并行场景下是否合并保存拆分的 Tensor。合并保存功能仅支持在自动并行场景中使用,在手动并行场景中不支持。 | (bool, 可选) - 默认值: `False` |
-| async_save | 是否使用异步方式保存 safetensors 文件。 | (bool, 可选) - `True` 时默认使用异步线程,默认值: `False` 。 |
-| checkpoint_format | 输出文件的格式,需要配置为 `safetensors` 。 | (str, 可选) - 模型权重保存的格式。支持 `"ckpt"` 、 `"safetensors"` 。默认值: `ckpt` 。(注意: ckpt 格式将在后续版本中日落,推荐使用 safetensors 格式。) |
-| remove_redundancy | 保存模型权重时是否去除冗余。 | (bool, 可选) - 默认值: `False` 。 |
-| save_network_params | 是否仅额外保存网络参数。 | (bool, 可选) - 是否仅额外保存网络参数。默认值: `False` 。 |
-
-如果您想了解更多有关 CheckpointMonitor 的知识,可以关注 [CheckpointMonitor API 文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/core/mindformers.core.CheckpointMonitor.html)。
diff --git a/docs/mindformers/docs/source_zh_cn/feature/start_tasks.md b/docs/mindformers/docs/source_zh_cn/feature/start_tasks.md
index f4e983a0676a7894d86a9b4002cfefab13b1f624..2c5b8f1d579c9f5989ce4974721bf5a6d5d5c6b5 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/start_tasks.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/start_tasks.md
@@ -22,7 +22,7 @@ MindSpore Transformers提供了一键启动脚本`run_mindformer.py`和分布式
| `--device_id` | 设置执行设备ID,其值必须在可用设备范围内。 | int,可选 | 预训练/微调/推理 |
| `--device_target` | 设置后端执行设备,MindSpore Transformers仅支持在`Ascend`设备上运行。 | str,可选 | 预训练/微调/推理 |
| `--run_mode` | 设置模型的运行模式,可选`train`、`finetune`或`predict`。 | str,可选 | 预训练/微调/推理 |
-| `--load_checkpoint` | 加载的权重文件或文件夹路径,详细使用方式参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。 | str,可选 | 预训练/微调/推理 |
+| `--load_checkpoint` | 加载的权重文件或文件夹路径,详细使用方式参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。 | str,可选 | 预训练/微调/推理 |
| `--use_parallel` | 是否开启并行模式。 | bool,可选 | 预训练/微调/推理 |
| `--output_dir` | 设置保存日志、权重、切分策略等文件的路径。 | str,可选 | 预训练/微调/推理 |
| `--register_path` | 外挂代码所在目录的绝对路径。比如research目录下的模型目录。 | str,可选 | 预训练/微调/推理 |
@@ -33,7 +33,7 @@ MindSpore Transformers提供了一键启动脚本`run_mindformer.py`和分布式
| 参数 | 参数说明 | 取值说明 | 适用场景 |
|:----------------------------:|:-------------------------------------------------------------------------------------------------------------------|--------------------------------|-----------|
| `--src_strategy_path_or_dir` | 权重的策略文件路径。 | str,可选 | 预训练/微调/推理 |
-| `--auto_trans_ckpt` | 是否开启在线权重自动转换功能,详情可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。 | bool,可选 | 预训练/微调/推理 |
+| `--auto_trans_ckpt` | 是否开启在线权重自动转换功能,详情可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。 | bool,可选 | 预训练/微调/推理 |
| `--transform_process_num` | 负责权重转换的进程数。 | int,可选 | 预训练/微调/推理 |
| `--only_save_strategy` | 是否仅保存切分策略文件。 | bool,可选,为`true`时任务在保存策略文件后直接退出 | 预训练/微调/推理 |
diff --git a/docs/mindformers/docs/source_zh_cn/feature/weight_conversion.md b/docs/mindformers/docs/source_zh_cn/feature/weight_conversion.md
deleted file mode 100644
index 19eacbc26074a5fbeae250922c06fb3325aa6feb..0000000000000000000000000000000000000000
--- a/docs/mindformers/docs/source_zh_cn/feature/weight_conversion.md
+++ /dev/null
@@ -1,124 +0,0 @@
-# 权重格式转换
-
-[](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/weight_conversion.md)
-
-## 概述
-
-MindSpore Transformers提供了统一的权重转换工具,能够将模型权重在HuggingFace所使用的格式与MindSpore Transformers所使用的格式之间相互转换。这可以帮助用户:
-
-- 将HuggingFace权重转换为MindSpore Transformers权重,在MindSpore Transformers上进行微调、测评或推理。
-- 把使用MindSpore Transformers训练或微调得到的权重转换为HuggingFace权重,并在其他框架上使用。
-
-## 转换步骤
-
-要进行权重转换,首先请将待转换模型的HuggingFace仓库完整克隆到本地,然后执行`mindformers/convert_weight.py`脚本。该脚本能够自动将HuggingFace的模型权重文件转换为适用于MindSpore Transformers的权重文件。如若希望将MindSpore Transformers权重转为HuggingFace权重,请将`reversed`设置为`True`。
-
-```shell
-python convert_weight.py [-h] --model MODEL [--reversed] --input_path INPUT_PATH --output_path OUTPUT_PATH [--dtype DTYPE] [--n_head N_HEAD] [--hidden_size HIDDEN_SIZE] [--layers LAYERS] [--is_pretrain IS_PRETRAIN] [--telechat_type TELECHAT_TYPE]
-```
-
-### 参数说明
-
-- model:模型名称。
-- reversed:将MindSpore Transformers权重转换为HuggingFace权重。
-- input_path:HuggingFace权重文件夹的路径,指向已下载的权重文件。
-- output_path:转换后MindSpore Transformers权重文件的保存路径。
-- dtype:转换后的权重数据类型。
-- n_head:只对BLOOM模型生效,使用`bloom_560m`时请设为`16`,使用`bloom_7.1b`时请设为`32`。
-- hidden_size:只对BLOOM模型生效,使用`bloom_560m`时请设为`1024`,使用`bloom_7.1b`时请设为`4096`。
-- layers:只对GPT2和WizardCoder模型生效,模型被转换的层数。
-- is_pretrain:只对Swin模型生效,转换预训练权重。
-- telechat_type:只对TeleChat模型生效,TeleChat模型的版本。
-
-## 转换示例
-
-假设用户已经下载了[Llama2模型的权重](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD),并保存在路径`/home/user/torch_weights`中,用户希望将其转换为MindSpore Transformers权重并保存在路径`/home/user/ms_weights`中,可以使用以下命令:
-
-```bash
-python convert_weight.py --model llama --input_path /home/user/torch_weights --output_path /home/user/ms_weights/llama.ckpt
-```
-
-通过以上步骤,可将HuggingFace权重成功转换为MindSpore Transformers权重,方便在MindSpore Transformers中继续模型训练或推理。
-
-## 已支持模型
-
-| 参数取值 | 支持模型 |
-|-----------|-------------------------------------------|
-| llama | Llama2、Llama3、Llama3.1、CodeLlama |
-| baichuan2 | Baichuan2 |
-| glm-n | GLM2、GLM3、GLM3-32K、GLM4 |
-| cogvlm2 | CogVLM2-Video、CogVLM2-Image |
-| qwen | Qwen、Qwen1.5、Qwen2 |
-| qwenvl | QwenVL |
-| internlm | InternLM |
-| internlm2 | InternLM2 |
-| yi | Yi |
-| mixtral | Mixtral |
-| deepseek | DeepSeekCoder、DeepSeekCoder1.5、DeepSeekV2 |
-| gpt | GPT2 |
-| whisper | Whisper |
-
-## 未支持模型权重转换开发
-
-1. 在扩展模型目录下新增`convert_weight.py`及`convert_reversed.py`文件。
-2. 在文件中分别编写`convert_pt_to_ms`及`convert_ms_to_pt`权重转换函数,函数参数为`input_path`、`output_path`、`dtype`及额外参数`**kwargs`。
-3. 在MindSpore Transformers代码根目录下`convert_weight.py`文件中的`convert_map`和`reversed_convert_map`字典中加入扩展模型名称及转换函数引入路径。
-4. 在`main`函数中通过调用`parser.add_argument()`方法新增额外参数。
-
-## 模型权重转换开发示例
-
-此处以Llama为例。如若希望转换HuggingFace权重至MindSpore Transformers权重,需在[convert_weight.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/llama/convert_weight.py)内定义`convert_pt_to_ms`函数:
-
-```python
-def convert_pt_to_ms(input_path, output_path, dtype=None, **kwargs):
- """convert hf weight to ms."""
- print(f"Trying to convert huggingface checkpoint in '{input_path}'.", flush=True)
- try:
- from transformers import LlamaForCausalLM
- except:
- raise ImportError(f"Failed to load huggingface checkpoint. Please make sure transformers is available.")
-
- try:
- model_hf = LlamaForCausalLM.from_pretrained(os.path.dirname(input_path))
- except Exception as e:
- print(f"Do not find huggingface checkpoint in '{os.path.dirname(input_path)}', Error {e.message}.", flush=True)
- return False
- ckpt_list = []
- for name, value in model_hf.state_dict().items():
- name = name_replace(name)
- if name == 'norm.weight':
- name = 'norm_out.weight'
- if name[:7] == 'layers.':
- name = name[7:]
-
- print(f'\rprocessing parameter: {name} {value.shape} ', end='', flush=True)
- ckpt_list.append({'name': name, 'data': pt2ms(value, dtype)})
-
- ms.save_checkpoint(ckpt_list, output_path)
- print(f"\rConvert huggingface checkpoint finished, the mindspore checkpoint is saved in '{output_path}'.",
- flush=True)
- return True
-```
-
-而若是希望转换MindSpore Transformers权重至HuggingFace权重,则需在[convert_reversed.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/llama/convert_reversed.py)内定义`convert_ms_to_pt`函数:
-
-```python
-def convert_ms_to_pt(input_path, output_path, dtype=None, **kwargs):
- """convert ms weight to hf."""
- print(f"Trying to convert mindspore checkpoint in '{input_path}'.", flush=True)
- model_ms = ms.load_checkpoint(input_path)
-
- state_dict = {}
- for name, value in model_ms.items():
- name = name_replace(name)
- print(f'\rprocessing parameter: {name} {value.shape} ', end='', flush=True)
- if is_lora_param(name):
- name = name.replace('.tk_delta_lora_a', '.lora_A.weight')
- name = name.replace('.tk_delta_lora_b', 'lora_B.weight')
- state_dict[name] = ms2pt(value, dtype)
-
- torch.save(state_dict, output_path)
- print(f"\rConvert mindspore checkpoint finished, the huggingface checkpoint is saved in '{output_path}'.",
- flush=True)
- return True
-```
diff --git a/docs/mindformers/docs/source_zh_cn/guide/deployment.md b/docs/mindformers/docs/source_zh_cn/guide/deployment.md
index 4e381adaa7c42a42aff3d6ea77936478f5352880..4fa23ebfefe62d428217a76ced6171f2654df3b0 100644
--- a/docs/mindformers/docs/source_zh_cn/guide/deployment.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/deployment.md
@@ -86,7 +86,7 @@ processor:
merges_file: "/path/to/mf_model/qwen1_5_72b/merges.txt" # merges文件绝对路径
```
-模型权重下载和转换可参考 [权重格式转换指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)。
+模型权重下载和转换可参考 [权重格式转换指南](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。
不同模型的所需文件和配置可能会有差异,详情参考[模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html)中具体模型的推理章节。
diff --git a/docs/mindformers/docs/source_zh_cn/guide/inference.md b/docs/mindformers/docs/source_zh_cn/guide/inference.md
index 1336b1d2b033a9f986fc566500d0cf4771c575a5..2423ce69f5f162faf7cdcdcc2712ea597801bc27 100644
--- a/docs/mindformers/docs/source_zh_cn/guide/inference.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/inference.md
@@ -22,8 +22,8 @@ MindSpore Transformers 提供了大模型推理能力,用户可以执行 `run_
完整权重可以通过以下两种方式获得:
-1. 从HuggingFace模型库中下载相应模型的开源权重后,参考[权重格式转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)将其转换为ckpt格式。
-2. 预训练或者微调后的分布式权重,通过[合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html)生成一个完整权重。
+1. 从HuggingFace模型库中下载相应模型的开源权重后,参考[权重格式转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)将其转换为ckpt格式。
+2. 预训练或者微调后的分布式权重,通过[合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)生成一个完整权重。
#### 2.2 分布式权重
@@ -35,7 +35,7 @@ MindSpore Transformers 提供了大模型推理能力,用户可以执行 `run_
2. 8卡训练的权重在2卡上推理;
3. 已经切分好的分布式权重在单卡上推理等。
-下文的命令示例均采用了在线自动切分的方式,通过设置参数 `--auto_trans_ckpt` 为 `True` 和 `--src_strategy_path_or_dir` 为权重的切分策略文件或目录路径(预训练或者微调后,默认保存在`./output/strategy`下)在推理任务中自动完成切分。更多用法可参考[分布式权重的合并和切分](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html)。
+下文的命令示例均采用了在线自动切分的方式,通过设置参数 `--auto_trans_ckpt` 为 `True` 和 `--src_strategy_path_or_dir` 为权重的切分策略文件或目录路径(预训练或者微调后,默认保存在`./output/strategy`下)在推理任务中自动完成切分。更多用法可参考[分布式权重的合并和切分](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。
> 由于训练和推理任务都使用 `./output` 作为默认输出路径,当使用训练任务所输出的策略文件,作为推理任务的源权重策略文件时,需要将默认输出路径下的策略文件目录移动到其他位置,避免被推理任务的进程清空,如:
>
diff --git a/docs/mindformers/docs/source_zh_cn/index.rst b/docs/mindformers/docs/source_zh_cn/index.rst
index 8ee12ed2646f1055a494403ac2c6f8cace61c16a..dd142e8086884ce1c21583c297a79eb6e6394fca 100644
--- a/docs/mindformers/docs/source_zh_cn/index.rst
+++ b/docs/mindformers/docs/source_zh_cn/index.rst
@@ -68,13 +68,9 @@ MindSpore Transformers功能特性说明
单卡、单机和多机任务一键启动。
- - `权重格式转换 `_
+ - `Ckpt权重 `_
- 提供统一的权重转换工具,能够将模型权重在HuggingFace所使用的格式与MindSpore Transformers所使用的格式之间相互转换。
-
- - `分布式权重切分与合并 `_
-
- 不同分布式场景下的权重灵活地进行切分与合并。
+ 支持ckpt格式的权重文件转换及切分功能。
- `Safetensors权重 `_
@@ -197,8 +193,7 @@ FAQ
:hidden:
feature/start_tasks
- feature/weight_conversion
- feature/transform_weight
+ feature/ckpt
feature/safetensors
feature/configuration
feature/logging
diff --git a/docs/mindformers/docs/source_zh_cn/introduction/overview.md b/docs/mindformers/docs/source_zh_cn/introduction/overview.md
index b1e03fb2d0d674e1808caeb6196452064a0a3e45..a9e6ba63ca42f3802a7fd19cab7519c651e1d18d 100644
--- a/docs/mindformers/docs/source_zh_cn/introduction/overview.md
+++ b/docs/mindformers/docs/source_zh_cn/introduction/overview.md
@@ -8,7 +8,7 @@ MindSpore Transformers与昇思MindSpore、昇腾Ascend的端到端AI软硬件
2. 在软件层面,MindSpore Transformers通过MindSpore提供的Python接口实现大模型相关代码,并由昇腾AI处理器配套软件包提供的算子库进行数据运算;
3. MindSpore Transformers目前支持的基础功能特性如下:
1. 支持大模型[分布式并行](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/parallel_training.html)运行训练和推理等任务,并行能力包括数据并行、模型并行、超长序列并行等;
- 2. 支持[模型权重转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/weight_conversion.html)、[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/transform_weight.html)、不同格式[数据集加载](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html)以及[断点续训](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html)等功能;
+ 2. 支持[模型权重转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)、[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)、不同格式[数据集加载](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html)以及[断点续训](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html)等功能;
3. 支持25+大模型[预训练](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/pre_training.html)、[微调](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/supervised_fine_tuning.html)、[推理](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/inference.html)和[评测](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/evaluation.html)等功能,同时支持对模型参数进行[量化](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/quantization.html),具体支持模型列表可参考[模型库](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/introduction/models.html);
4. MindSpore Transformers支持用户通过[MindIE](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/guide/deployment.html)进行模型服务化部署功能,同时支持使用[MindX](https://www.hiascend.com/software/mindx-dl)实现大规模集群调度;后续将支持更多第三方平台,敬请期待。
diff --git a/tutorials/source_en/model_infer/ms_infer/llm_inference_overview.md b/tutorials/source_en/model_infer/ms_infer/llm_inference_overview.md
index 02cc479e39dc1d93e16e0d54f6b15fdcdaa315b0..49baacc1c51290eca66f8999c7109cc82b5fccb5 100644
--- a/tutorials/source_en/model_infer/ms_infer/llm_inference_overview.md
+++ b/tutorials/source_en/model_infer/ms_infer/llm_inference_overview.md
@@ -209,7 +209,7 @@ In addition to utilizing the capabilities provided by the MindFormers model suit
For large language models with many model parameters, such as Llama2-70B and Qwen2-72B, the parameter scale usually exceeds the memory capacity of a GPU or NPU. Therefore, multi-device parallel inference is required. MindSpore large language model inference can shard the original large language model into N parallel models so that they can be executed on multiple devices in parallel. This not only enables inference for super-large models but also enhances performance by leveraging more resources from the multiple devices. The model scripts provided by the MindFormers model suite can be used to shard a model into multi-device models for execution. You can perform the following steps to deploy the model on multiple devices.
-- **Weight sharding**: Because the original weight files are too large, when executing on multiple devices, the overall weight needs to be sharded into multiple weights for each device and passed to the model process corresponding to each device. You can use the script in the MindFormers model suite to perform weight sharding. For details, see [Weight Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/function/transform_weight.html).
+- **Weight sharding**: Because the original weight files are too large, when executing on multiple devices, the overall weight needs to be sharded into multiple weights for each device and passed to the model process corresponding to each device. You can use the script in the MindFormers model suite to perform weight sharding. For details, see [Weight Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/function/ckpt.html).
Here is an example of how to shard the Llama2-7B model for parallel execution on two devices.
diff --git a/tutorials/source_zh_cn/model_infer/ms_infer/llm_inference_overview.md b/tutorials/source_zh_cn/model_infer/ms_infer/llm_inference_overview.md
index ca9bcc84ddd7a8d8fb92d8f7eceb9b3a5e5e580d..309783672e5847ba80a9a06016f9a3aeef746c53 100644
--- a/tutorials/source_zh_cn/model_infer/ms_infer/llm_inference_overview.md
+++ b/tutorials/source_zh_cn/model_infer/ms_infer/llm_inference_overview.md
@@ -209,7 +209,7 @@ model = AutoModel.from_config(config)
对于模型参数比较多的大语言模型,如Llama2-70B、Qwen2-72B,由于其参数规模通常会超过一张GPU或者NPU的内存容量,因此需要采用多卡并行推理,MindSpore大语言模型推理支持将原始大语言模型切分成N份可并行的子模型,使其能够分别在多卡上并行执行,在实现超大模型推理同时,也利用多卡中更多的资源提升性能。MindFormers模型套件提供的模型脚本天然支持将模型切分成多卡模型执行,用户可以通过以下步骤在多卡上部署模型。
-- **权重切分**:由于原来的权重文件太大,多卡执行时,需要将整体权重切分成每张卡上的多份权重,分别传给每张卡对应的模型进程。用户可以使用MindFormers模型套件中的脚本来进行权重切分。具体可以参考[权重转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/transform_weight.html)。
+- **权重切分**:由于原来的权重文件太大,多卡执行时,需要将整体权重切分成每张卡上的多份权重,分别传给每张卡对应的模型进程。用户可以使用MindFormers模型套件中的脚本来进行权重切分。具体可以参考[权重转换](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/function/ckpt.html)。
下面以Llama2-7B大语言模型为例,简单描述一下将模型切分为2卡并行的操作: