From f1a0aa632ce6b729737c7d20702c941cfddf2c13 Mon Sep 17 00:00:00 2001 From: SaiYao Date: Tue, 8 Jul 2025 11:39:30 +0800 Subject: [PATCH] =?UTF-8?q?=E3=80=90bugfix=E3=80=91=E6=96=87=E6=A1=A3?= =?UTF-8?q?=EF=BC=9A=E4=BF=AE=E5=A4=8D=E5=A4=B1=E6=95=88=E9=93=BE=E6=8E=A5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../docs/source_en/feature/ckpt.md | 56 +++++++--------- .../docs/source_en/feature/quantization.md | 1 - .../docs/source_en/feature/resume_training.md | 64 +++++++++---------- .../docs/source_zh_cn/feature/ckpt.md | 24 +++---- .../docs/source_zh_cn/feature/quantization.md | 1 - .../source_zh_cn/feature/resume_training.md | 35 +++++----- 6 files changed, 81 insertions(+), 100 deletions(-) diff --git a/docs/mindformers/docs/source_en/feature/ckpt.md b/docs/mindformers/docs/source_en/feature/ckpt.md index e9a91256b0..98c3acbf99 100644 --- a/docs/mindformers/docs/source_en/feature/ckpt.md +++ b/docs/mindformers/docs/source_en/feature/ckpt.md @@ -40,7 +40,7 @@ python convert_weight.py [-h] --model MODEL [--reversed] --input_path INPUT_PATH ### Conversion Example -Assume that you have downloaded the [Llama2 model weight](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD) and saved it in the `/home/user/torch_weights` path, to convert it to the MindSpore Transformers weight and save it in the `/home/user/ms_weights` path, run the following command: +Assume that you have downloaded the [Llama3.1 model weight](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD) and saved it in the `/home/user/torch_weights` path, to convert it to the MindSpore Transformers weight and save it in the `/home/user/ms_weights` path, run the following command: ```bash python convert_weight.py --model llama --input_path /home/user/torch_weights --output_path /home/user/ms_weights/llama.ckpt @@ -50,21 +50,13 @@ After the preceding steps are performed, the HuggingFace weight is successfully ### Supported Models -| Parameter Value | Supported models | -|-----------|---------------------------------------------| -| llama | Llama2, Llama3, Llama3.1, CodeLlama | -| baichuan2 | Baichuan2 | -| glm-n | GLM2, GLM3, GLM3-32K, GLM4 | -| cogvlm2 | CogVLM2-Video, CogVLM2-Image | -| qwen | Qwen, Qwen1.5, Qwen2 | -| qwenvl | QwenVL | -| internlm | InternLM | -| internlm2 | InternLM2 | -| yi | Yi | -| mixtral | Mixtral | -| deepseek | DeepSeekCoder, DeepSeekCoder1.5, DeepSeekV2 | -| gpt | GPT2 | -| whisper | Whisper | +| Parameter Value | Supported models | +|-----------------|------------------------------| +| llama | Llama3.1 | +| glm-n | GLM4 | +| qwen | Qwen2.5 | +| mixtral | Mixtral | +| deepseek | DeepSeekV3 | ### Developing Weight Conversion for Unsupported Models @@ -147,13 +139,13 @@ When a model loads a weight, it automatically checks whether the weight is match Parameters in the `yaml` file related to **automatic weight conversion** are described as follows: -| Parameter | Description | -| ------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| load_checkpoint | Absolute path or folder path of the pre-loaded weights.
- For a complete set of weights, set this parameter to an absolute path.
- For a distributed weight, set this parameter to the folder path. The distributed weight must be stored in the `model_dir/rank_x/xxx.ckpt` format. The folder path is `model_dir`.
**If there are multiple CKPT files in the rank_x folder, the last CKPT file in the file name sequence is used for conversion by default.** | -| src_strategy_path_or_dir | Path of [the distributed strategy file](#offline-conversion-configuration) corresponding to the pre-loaded weights.
- If the pre-loaded weights are a complete set of weights, leave this parameter **blank**.
- If the pre-loaded weights are distributed and pipeline parallelism is used when the pre-loaded weights are saved, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- If the pre-loaded weights are distributed and pipeline parallelism is not used when the pre-load weights are saved, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | -| auto_trans_ckpt | Specifies whether to enable automatic weight conversion. The value True indicates that it is enabled. The default value is False. | -| transform_process_num | Number of processes used for automatic weight conversion. The default value is 1.
- If transform_process_num is set to 1, only rank_0 is used for weight conversion. Other processes wait until the conversion ends.
- If transform_process_num is larger than 1, **multiple processes conduct conversion**. For example, for an 8-device task, if transform_process_num is set to 2, rank_0 is used for converting the weights of slices rank_0, rank_1, rank_2, and rank_3, and rank_4 is used for converting the weights of slices rank_4, rank_5, rank_6, and rank_7, and other processes wait until rank_0 and rank_4 complete the conversion.
**Note**:
1. A larger value of transform_process_num indicates a shorter conversion time and **a larger host memory occupied by the conversion**. If the host memory is insufficient, decrease the value of transform_process_num.
2. The value of transform_process_num must be a number that can be exactly divided by and cannot exceed that of NPUs. | -| transform_by_rank | Specifies whether to use the mindspore.transform_checkpoint_by_rank API for weight conversion.
- If transform_process_num is larger than 1, the value is automatically set to `True`.
- If transform_process_num is set to 1, if the target weight is a distributed weight, the mindspore.transform_checkpoint_by_rank API is cyclically called to convert the weight of each rank slice in serial mode.
- If transform_process_num is set to 1, if the target weight is a complete weight, the value is automatically set to `False`, and the mindspore.transform_checkpoints API is called for weight conversion. | +| Parameter | Description | +|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| load_checkpoint | Absolute path or folder path of the pre-loaded weights.
- For a complete set of weights, set this parameter to an absolute path.
- For a distributed weight, set this parameter to the folder path. The distributed weight must be stored in the `model_dir/rank_x/xxx.ckpt` format. The folder path is `model_dir`.
**If there are multiple CKPT files in the rank_x folder, the last CKPT file in the file name sequence is used for conversion by default.** | +| src_strategy_path_or_dir | Path of [the distributed strategy file](#offline-conversion-configuration) corresponding to the pre-loaded weights.
- If the pre-loaded weights are a complete set of weights, leave this parameter **blank**.
- If the pre-loaded weights are distributed and pipeline parallelism is used when the pre-loaded weights are saved, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- If the pre-loaded weights are distributed and pipeline parallelism is not used when the pre-load weights are saved, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | +| auto_trans_ckpt | Specifies whether to enable automatic weight conversion. The value True indicates that it is enabled. The default value is False. | +| transform_process_num | Number of processes used for automatic weight conversion. The default value is 1.
- If transform_process_num is set to 1, only rank_0 is used for weight conversion. Other processes wait until the conversion ends.
- If transform_process_num is larger than 1, **multiple processes conduct conversion**. For example, for an 8-device task, if transform_process_num is set to 2, rank_0 is used for converting the weights of slices rank_0, rank_1, rank_2, and rank_3, and rank_4 is used for converting the weights of slices rank_4, rank_5, rank_6, and rank_7, and other processes wait until rank_0 and rank_4 complete the conversion.
**Note**:
1. A larger value of transform_process_num indicates a shorter conversion time and **a larger host memory occupied by the conversion**. If the host memory is insufficient, decrease the value of transform_process_num.
2. The value of transform_process_num must be a number that can be exactly divided by and cannot exceed that of NPUs. | +| transform_by_rank | Specifies whether to use the mindspore.transform_checkpoint_by_rank API for weight conversion.
- If transform_process_num is larger than 1, the value is automatically set to `True`.
- If transform_process_num is set to 1, if the target weight is a distributed weight, the mindspore.transform_checkpoint_by_rank API is cyclically called to convert the weight of each rank slice in serial mode.
- If transform_process_num is set to 1, if the target weight is a complete weight, the value is automatically set to `False`, and the mindspore.transform_checkpoints API is called for weight conversion. | #### YAML Configurations in Different Scenarios @@ -221,15 +213,15 @@ When using offline conversion, you can manually configure conversion parameters Parameters in the `yaml` file related to **offline weight conversion** are described as follows: -| Parameter | Description | -| ----------------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| src_checkpoint | Absolute path or folder path of the source weight.
- For **a complete set of weights**, set this parameter to an **absolute path**.
- For **distributed weights**, set this parameter to the **folder path**. The distributed weights must be stored in the `model_dir/rank_x/xxx.ckpt` format. The folder path is `model_dir`.
**If there are multiple CKPT files in the rank_x folder, the last CKPT file in the file name sequence is used for conversion by default.** | -| src_strategy_path_or_dir | Path of the distributed strategy file corresponding to the source weight.
- For a complete set of weights, leave it **blank**.
- For distributed weights, if pipeline parallelism is used, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- For distributed weights, if pipeline parallelism is not used, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | -| dst_checkpoint | Path of the folder that stores the target weight. | -| dst_strategy | Path of the distributed strategy file corresponding to the target weight.
- For a complete set of weights, leave it **blank**.
- For distributed weights, if pipeline parallelism is used, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- For distributed weights, if pipeline parallelism is not used, set this parameter to any **ckpt_strategy_rank_x.ckpt** path.| -| prefix | Prefix name of the saved target weight. The weight is saved as {prefix}rank_x.ckpt. The default value is checkpoint_. | -| world_size | Total number of slices of the target weight. Generally, the value is dp \* mp \* pp. | -| process_num | Number of processes used for offline weight conversion. The default value is 1.
- If process_num is set to 1, **a single process is used for conversion**.
- If process_num is larger than 1, **multi-process conversion** is used. For example, if the target weight for conversion is the distributed weight of eight GPUs and process_num is set to 2, two processes are started to convert the weights of slices rank_0, rank_1, rank_2, and rank_3 and slices rank_4, rank_5, rank_6, and rank_7, respectively. | +| Parameter | Description | +|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| src_checkpoint | Absolute path or folder path of the source weight.
- For **a complete set of weights**, set this parameter to an **absolute path**.
- For **distributed weights**, set this parameter to the **folder path**. The distributed weights must be stored in the `model_dir/rank_x/xxx.ckpt` format. The folder path is `model_dir`.
**If there are multiple CKPT files in the rank_x folder, the last CKPT file in the file name sequence is used for conversion by default.** | +| src_strategy_path_or_dir | Path of the distributed strategy file corresponding to the source weight.
- For a complete set of weights, leave it **blank**.
- For distributed weights, if pipeline parallelism is used, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- For distributed weights, if pipeline parallelism is not used, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | +| dst_checkpoint | Path of the folder that stores the target weight. | +| dst_strategy | Path of the distributed strategy file corresponding to the target weight.
- For a complete set of weights, leave it **blank**.
- For distributed weights, if pipeline parallelism is used, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- For distributed weights, if pipeline parallelism is not used, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | +| prefix | Prefix name of the saved target weight. The weight is saved as {prefix}rank_x.ckpt. The default value is checkpoint_. | +| world_size | Total number of slices of the target weight. Generally, the value is dp \* mp \* pp. | +| process_num | Number of processes used for offline weight conversion. The default value is 1.
- If process_num is set to 1, **a single process is used for conversion**.
- If process_num is larger than 1, **multi-process conversion** is used. For example, if the target weight for conversion is the distributed weight of eight GPUs and process_num is set to 2, two processes are started to convert the weights of slices rank_0, rank_1, rank_2, and rank_3 and slices rank_4, rank_5, rank_6, and rank_7, respectively. | #### Offline Conversion Configuration diff --git a/docs/mindformers/docs/source_en/feature/quantization.md b/docs/mindformers/docs/source_en/feature/quantization.md index 0cd91eeed5..4755352753 100644 --- a/docs/mindformers/docs/source_en/feature/quantization.md +++ b/docs/mindformers/docs/source_en/feature/quantization.md @@ -16,4 +16,3 @@ Currently, only the following models are supported, and the supported models are |-----------------------------------------------------------------------------------------------------------------------------------| | [DeepSeek-V3](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/predict_deepseek3_671b.yaml) | | [DeepSeek-R1](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b.yaml) | -| [Llama2](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/predict_llama2_13b_ptq.yaml) | \ No newline at end of file diff --git a/docs/mindformers/docs/source_en/feature/resume_training.md b/docs/mindformers/docs/source_en/feature/resume_training.md index ec61b7934d..51726a0643 100644 --- a/docs/mindformers/docs/source_en/feature/resume_training.md +++ b/docs/mindformers/docs/source_en/feature/resume_training.md @@ -14,27 +14,27 @@ MindSpore Transformers supports **step-level resumable training**, which allows You can modify the configuration file to control resumable training. The main parameters are as follows. For details about other parameters, see the description of CheckpointMonitor. -| Parameter | Description | -|------------------|---------------------------------------------------------------------| -| load_checkpoint | Weight path loaded during resumable training. The path can be a folder path (used to load distributed weights) or a specific weight file path. The default value is an empty string, indicating that no weight is loaded (required for resumable training). | -| resume_training | Specifies whether to enable resumable training. You can set it to `True` or specify a weight file name. If the value is `True`, the system automatically resumes the training from the last interruption. The default value is `False`. | -| load_ckpt_async | Determines whether to load model weights and compile in parallel (this configuration does not take effect when auto_trans_ckpt is set to true). The default value is False (serial execution).
When it is `True`, the parallel capability of loading ckpt weights and building model is enabled to reduce the overall time resume training. | +| Parameter | Description | +|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| load_checkpoint | Weight path loaded during resumable training. The path can be a folder path (used to load distributed weights) or a specific weight file path. The default value is an empty string, indicating that no weight is loaded (required for resumable training). | +| resume_training | Specifies whether to enable resumable training. You can set it to `True` or specify a weight file name. If the value is `True`, the system automatically resumes the training from the last interruption. The default value is `False`. | +| load_ckpt_async | Determines whether to load model weights and compile in parallel (this configuration does not take effect when auto_trans_ckpt is set to true). The default value is False (serial execution).
When it is `True`, the parallel capability of loading ckpt weights and building model is enabled to reduce the overall time resume training. | Based on the input parameters, there are four cases. -| load_checkpoint | resume_training | Description | Recommended or Not| -|-----------------|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------| -| Weight file path | True | Resumes a training based on the weights specified by load_checkpoint. | √ | -| Weight file path | Weight file name | The file name specified by resume_training is invalid. A training is resumed based on the weights specified by load_checkpoint. | × | -| Weight folder path | True | **Scenario 1: Single-node system, multi-node system+shared directory, or ModelArts**
1. Resumes the training based on the weights recorded in meta.json files and supports fault recovery.
2. Resumes the training based on the latest weight of all ranks if the meta.json file of any rank is missing.
**Scenario 2: Multi-node+non-shared directory**
Resumes the training based on the latest weight of all ranks.| √ | -| Weight folder path | Weight file name | Resumes the training based on the weights specified by resume_training. | √ | +| load_checkpoint | resume_training | Description | Recommended or Not | +|---------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------| +| Weight file path | True | Resumes a training based on the weights specified by load_checkpoint. | √ | +| Weight file path | Weight file name | The file name specified by resume_training is invalid. A training is resumed based on the weights specified by load_checkpoint. | × | +| Weight folder path | True | **Scenario 1: Single-node system, multi-node system+shared directory, or ModelArts**
1. Resumes the training based on the weights recorded in meta.json files and supports fault recovery.
2. Resumes the training based on the latest weight of all ranks if the meta.json file of any rank is missing.
**Scenario 2: Multi-node+non-shared directory**
Resumes the training based on the latest weight of all ranks. | √ | +| Weight folder path | Weight file name | Resumes the training based on the weights specified by resume_training. | √ | In addition, you can modify the following parameters in the configuration file to use related functions. -| Parameter | Description | -|------------------|-------------------------------------------------------------------------------------------------------------| -| ignore_data_skip | Specifies whether to ignore the mechanism of skipping data during resumable training and read the dataset from the beginning instead. This parameter is used when the dataset is changed during resumable training. If this parameter is set to `True`, no data is skipped. The default value is `False`. | -| data_skip_steps | Number of steps skipped for the dataset. This parameter is used when the training is interrupted again after being resumed because the dataset or `global batch size` is changed. You need to manually set this parameter to configure the number of steps skipped for the new dataset. If the `global batch size` is changed, you need to divide and round down its value by the scaling coefficient and then specify the result as the value of this parameter.| +| Parameter | Description | +|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| ignore_data_skip | Specifies whether to ignore the mechanism of skipping data during resumable training and read the dataset from the beginning instead. This parameter is used when the dataset is changed during resumable training. If this parameter is set to `True`, no data is skipped. The default value is `False`. | +| data_skip_steps | Number of steps skipped for the dataset. This parameter is used when the training is interrupted again after being resumed because the dataset or `global batch size` is changed. You need to manually set this parameter to configure the number of steps skipped for the new dataset. If the `global batch size` is changed, you need to divide and round down its value by the scaling coefficient and then specify the result as the value of this parameter. | #### Fault Recovery Mechanism @@ -44,12 +44,12 @@ If `resume_training` is set to `True`, the system automatically resumes training ### Example of Distributed Training -The following example shows how to enable resumable training in single-device and multi-device environments. The example is based on the `llama2_7b` model. -For related configuration files, see [configs/llama2/pretrain_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/pretrain_llama2_7b.yaml). +The following example shows how to enable resumable training in single-device and multi-device environments. The example is based on the `llama3.1 8b` model. +For related configuration files, see [research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml). #### Complete Training -1. Modify `configs/llama2/pretrain_llama2_7b.yaml`. +1. Modify `research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml`. Configure the parallelism as required. @@ -67,7 +67,7 @@ For related configuration files, see [configs/llama2/pretrain_llama2_7b.yaml](ht callbacks: ... - type: CheckpointMonitor - prefix: "llama2_7b" + prefix: "llama3_1_8b" save_checkpoint_steps: 10 keep_checkpoint_max: 3 integrated_save: False @@ -75,12 +75,12 @@ For related configuration files, see [configs/llama2/pretrain_llama2_7b.yaml](ht ... ``` -2. Prepare a dataset. The following uses [wikitext2](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87) as an example to describe how to start four-device distributed training. +2. Prepare a dataset. The following uses [alpaca datasets](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md#%E6%95%B0%E6%8D%AE%E9%9B%86%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87) as an example to describe how to start four-device distributed training. ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ - --config configs/llama2/pretrain_llama2_7b.yaml \ - --train_dataset /path/to/wikitext2-llama2.mindrecord \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_dataset /path/to/alpaca-fastchat8192.mindrecord \ --run_mode train \ --use_parallel True" 4 ``` @@ -89,9 +89,9 @@ For related configuration files, see [configs/llama2/pretrain_llama2_7b.yaml](ht ```text checkpoint/rank_0 - ├── llama2_7b_rank_0-10_2.ckpt - ├── llama2_7b_rank_0-15_2.ckpt - ├── llama2_7b_rank_0-20_2.ckpt + ├── llama3_1_8b_rank_0-10_2.ckpt + ├── llama3_1_8b_rank_0-15_2.ckpt + ├── llama3_1_8b_rank_0-20_2.ckpt └── meta.json ``` @@ -108,8 +108,8 @@ For related configuration files, see [configs/llama2/pretrain_llama2_7b.yaml](ht ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ - --config configs/llama2/pretrain_llama2_7b.yaml \ - --train_dataset /path/to/wikitext2-llama2.mindrecord \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_dataset /path/to/alpaca-fastchat8192.mindrecord \ --run_mode train \ --use_parallel True" 4 ``` @@ -168,12 +168,12 @@ If `global batch size` is changed (for example, doubled) when a training is resu If some weight files are missing, the system automatically restores the files based on the latest available weight. -1. Delete the `llama2_7b_rank_0-20_2.ckpt` file from the `rank_3` directory. The folder structure after the deletion is as follows: +1. Delete the `llama3_1_8b_rank_0-20_2.ckpt` file from the `rank_3` directory. The folder structure after the deletion is as follows: ```text checkpoint/rank_3 - ├── llama2_7b_rank_0-10_2.ckpt - ├── llama2_7b_rank_0-15_2.ckpt + ├── llama3_1_8b_rank_0-10_2.ckpt + ├── llama3_1_8b_rank_0-15_2.ckpt └── meta.json ``` @@ -188,8 +188,8 @@ If some weight files are missing, the system automatically restores the files ba ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ - --config configs/llama2/pretrain_llama2_7b.yaml \ - --train_dataset /path/to/wikitext2-llama2.mindrecord \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_dataset /path/to/alpaca-fastchat8192.mindrecord \ --run_mode train \ --use_parallel True" 4 ``` diff --git a/docs/mindformers/docs/source_zh_cn/feature/ckpt.md b/docs/mindformers/docs/source_zh_cn/feature/ckpt.md index 29bda97b56..ea73cea3d7 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/ckpt.md +++ b/docs/mindformers/docs/source_zh_cn/feature/ckpt.md @@ -40,7 +40,7 @@ python convert_weight.py [-h] --model MODEL [--reversed] --input_path INPUT_PATH ### 转换示例 -假设用户已经下载了[Llama2模型的权重](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD),并保存在路径`/home/user/torch_weights`中,用户希望将其转换为MindSpore Transformers权重并保存在路径`/home/user/ms_weights`中,可以使用以下命令: +假设用户已经下载了[Llama3.1模型的权重](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md#%E6%A8%A1%E5%9E%8B%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD),并保存在路径`/home/user/torch_weights`中,用户希望将其转换为MindSpore Transformers权重并保存在路径`/home/user/ms_weights`中,可以使用以下命令: ```bash python convert_weight.py --model llama --input_path /home/user/torch_weights --output_path /home/user/ms_weights/llama.ckpt @@ -50,21 +50,13 @@ python convert_weight.py --model llama --input_path /home/user/torch_weights --o ### 已支持模型 -| 参数取值 | 支持模型 | -|-----------|-------------------------------------------| -| llama | Llama2、Llama3、Llama3.1、CodeLlama | -| baichuan2 | Baichuan2 | -| glm-n | GLM2、GLM3、GLM3-32K、GLM4 | -| cogvlm2 | CogVLM2-Video、CogVLM2-Image | -| qwen | Qwen、Qwen1.5、Qwen2 | -| qwenvl | QwenVL | -| internlm | InternLM | -| internlm2 | InternLM2 | -| yi | Yi | -| mixtral | Mixtral | -| deepseek | DeepSeekCoder、DeepSeekCoder1.5、DeepSeekV2 | -| gpt | GPT2 | -| whisper | Whisper | +| 参数取值 | 支持模型 | +|----------|------------------------------| +| llama | Llama3.1 | +| glm-n | GLM4 | +| qwen | Qwen2.5 | +| mixtral | Mixtral | +| deepseek | DeepSeekV3 | ### 未支持模型权重转换开发 diff --git a/docs/mindformers/docs/source_zh_cn/feature/quantization.md b/docs/mindformers/docs/source_zh_cn/feature/quantization.md index 76d2c1a3b3..6f4b96a883 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/quantization.md +++ b/docs/mindformers/docs/source_zh_cn/feature/quantization.md @@ -16,4 +16,3 @@ MindSpore Transformers 集成 MindSpore Golden Stick 工具组件,提供统一 |-----------------------------------------------------------------------------------------------------------------------------------| | [DeepSeek-V3](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/predict_deepseek3_671b.yaml) | | [DeepSeek-R1](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b.yaml) | -| [Llama2](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/predict_llama2_13b_ptq.yaml) | \ No newline at end of file diff --git a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md index f9762887e3..3ef3704f4e 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md +++ b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md @@ -44,12 +44,11 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 ### 分布式训练示例 -以下示例演示了如何在单卡和多卡环境中启动断点续训。示例基于`llama2_7b` -模型,相关配置文件[configs/llama2/pretrain_llama2_7b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/configs/llama2/pretrain_llama2_7b.yaml)。 +以下示例演示了如何在单卡和多卡环境中启动断点续训。示例基于 `llama3.1 8b` 模型,相关配置文件[research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)。 #### 完整训练 -1. 修改`configs/llama2/pretrain_llama2_7b.yaml`: +1. 修改`research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml`: 根据需要设置并行配置: @@ -67,7 +66,7 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 callbacks: ... - type: CheckpointMonitor - prefix: "llama2_7b" + prefix: "llama3_1_8b" save_checkpoint_steps: 10 keep_checkpoint_max: 3 integrated_save: False @@ -75,23 +74,23 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 ... ``` -2. 准备数据集,此处以[wikitext2](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)为例,启动4卡分布式训练: +2. 准备数据集,此处以 [alpaca 数据集](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md#%E6%95%B0%E6%8D%AE%E9%9B%86%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)为例,启动4卡分布式训练: ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ - --config configs/llama2/pretrain_llama2_7b.yaml \ - --train_dataset /path/to/wikitext2-llama2.mindrecord \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_dataset /path/to/alpaca-fastchat8192.mindrecord \ --run_mode train \ --use_parallel True" 4 ``` - 在第四次保存完毕后,结束进程,此时`checkpoint`下的`rank_0`文件夹结构为: + 在第四次保存完毕后,结束进程,此时 `checkpoint` 下的 `rank_0` 文件夹结构为: ```text checkpoint/rank_0 - ├── llama2_7b_rank_0-10_2.ckpt - ├── llama2_7b_rank_0-15_2.ckpt - ├── llama2_7b_rank_0-20_2.ckpt + ├── llama3_1_8b_rank_0-10_2.ckpt + ├── llama3_1_8b_rank_0-15_2.ckpt + ├── llama3_1_8b_rank_0-20_2.ckpt └── meta.json ``` @@ -108,8 +107,8 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ - --config configs/llama2/pretrain_llama2_7b.yaml \ - --train_dataset /path/to/wikitext2-llama2.mindrecord \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_dataset /path/to/alpaca-fastchat8192.mindrecord \ --run_mode train \ --use_parallel True" 4 ``` @@ -168,12 +167,12 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 当部分权重文件缺失时,系统会自动基于上一个可用的权重进行恢复。 -1. 删除`rank_3`下的`llama2_7b_rank_0-20_2.ckpt`文件。删除后文件夹结构应为: +1. 删除`rank_3`下的`llama3_1_8b_rank_0-20_2.ckpt`文件。删除后文件夹结构应为: ```text checkpoint/rank_3 - ├── llama2_7b_rank_0-10_2.ckpt - ├── llama2_7b_rank_0-15_2.ckpt + ├── llama3_1_8b_rank_0-10_2.ckpt + ├── llama3_1_8b_rank_0-15_2.ckpt └── meta.json ``` @@ -188,8 +187,8 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ - --config configs/llama2/pretrain_llama2_7b.yaml \ - --train_dataset /path/to/wikitext2-llama2.mindrecord \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_dataset /path/to/alpaca-fastchat8192.mindrecord \ --run_mode train \ --use_parallel True" 4 ``` -- Gitee