From 47238d1a13d4ecce65e2a331a411648e200c9369 Mon Sep 17 00:00:00 2001 From: senzhen Date: Wed, 20 Aug 2025 09:29:56 +0800 Subject: [PATCH] =?UTF-8?q?=E4=BF=AE=E6=94=B9=E7=A6=BB=E7=BA=BF=E6=9D=83?= =?UTF-8?q?=E9=87=8D=E8=BD=AC=E6=8D=A2=E8=AF=B4=E6=98=8E?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../docs/source_en/feature/ckpt.md | 44 +++++++++++-------- .../docs/source_zh_cn/feature/ckpt.md | 42 ++++++++++-------- 2 files changed, 49 insertions(+), 37 deletions(-) diff --git a/docs/mindformers/docs/source_en/feature/ckpt.md b/docs/mindformers/docs/source_en/feature/ckpt.md index 7c44f2e22d..01ce7cb36c 100644 --- a/docs/mindformers/docs/source_en/feature/ckpt.md +++ b/docs/mindformers/docs/source_en/feature/ckpt.md @@ -209,20 +209,6 @@ The offline conversion function is designed to meet your requirements for manual When using offline conversion, you can manually configure conversion parameters as required to ensure that the conversion process is flexible and controllable. This function is especially suitable for model deployment and optimization in a strictly controlled computing environment. -#### Parameters - -Parameters in the `yaml` file related to **offline weight conversion** are described as follows: - -| Parameter | Description | -|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| src_checkpoint | Absolute path or folder path of the source weight.
- For **a complete set of weights**, set this parameter to an **absolute path**.
- For **distributed weights**, set this parameter to the **folder path**. The distributed weights must be stored in the `model_dir/rank_x/xxx.ckpt` format. The folder path is `model_dir`.
**If there are multiple CKPT files in the rank_x folder, the last CKPT file in the file name sequence is used for conversion by default.** | -| src_strategy_path_or_dir | Path of the distributed strategy file corresponding to the source weight.
- For a complete set of weights, leave it **blank**.
- For distributed weights, if pipeline parallelism is used, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- For distributed weights, if pipeline parallelism is not used, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | -| dst_checkpoint | Path of the folder that stores the target weight. | -| dst_strategy | Path of the distributed strategy file corresponding to the target weight.
- For a complete set of weights, leave it **blank**.
- For distributed weights, if pipeline parallelism is used, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- For distributed weights, if pipeline parallelism is not used, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | -| prefix | Prefix name of the saved target weight. The weight is saved as {prefix}rank_x.ckpt. The default value is checkpoint_. | -| world_size | Total number of slices of the target weight. Generally, the value is dp \* mp \* pp. | -| process_num | Number of processes used for offline weight conversion. The default value is 1.
- If process_num is set to 1, **a single process is used for conversion**.
- If process_num is larger than 1, **multi-process conversion** is used. For example, if the target weight for conversion is the distributed weight of eight GPUs and process_num is set to 2, two processes are started to convert the weights of slices rank_0, rank_1, rank_2, and rank_3 and slices rank_4, rank_5, rank_6, and rank_7, respectively. | - #### Offline Conversion Configuration **Generating Distributed Strategy** @@ -241,7 +227,8 @@ Use [mindformers/tools/ckpt_transform/transform_checkpoint.py](https://gitee.com python transform_checkpoint.py \ --src_checkpoint /worker/checkpoint/llama3-8b-2layer/rank_0/llama3_8b.ckpt \ --dst_checkpoint /worker/transform_ckpt/llama3_8b_1to8/ \ - --dst_strategy /worker/mindformers/output/strategy/ + --dst_strategy /worker/mindformers/output/strategy/ \ + --prefix "checkpoint_" ``` **Multi-Process Conversion** @@ -256,12 +243,33 @@ bash transform_checkpoint.sh \ None \ /worker/transform_ckpt/llama3_8b_1to8/ \ /worker/mindformers/output/strategy/ \ - 8 2 + 8 2 "checkpoint_" ``` -**Precautions**: +> Note: The order of parameters is src_checkpoint, src_strategy, dst_checkpoint_dir, dst_strategy, world_size, transform_process_num, prefix. + +#### Parameters + +Parameters related to **offline weight conversion** are described as follows: + +- Parameters for single-process conversion + +| Parameter | Description | +|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| src_checkpoint | Absolute path or folder path of the source weight.
- For **a complete set of weights**, set this parameter to an **absolute path**.
- For **distributed weights**, set this parameter to the **folder path**. The distributed weights must be stored in the `model_dir/rank_x/xxx.ckpt` format. The folder path is `model_dir`.
**If there are multiple CKPT files in the rank_x folder, the last CKPT file in the file name sequence is used for conversion by default.** | +| src_strategy | Path of the distributed strategy file corresponding to the source weight.
- For a complete set of weights, leave it **blank**.
- For distributed weights, if pipeline parallelism is used, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- For distributed weights, if pipeline parallelism is not used, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | +| dst_checkpoint_dir | Path of the folder that stores the target weight. | +| dst_strategy | Path of the distributed strategy file corresponding to the target weight.
- For a complete set of weights, leave it **blank**.
- For distributed weights, if pipeline parallelism is used, set this parameter to the **merged strategy file path** or **distributed strategy folder path**.
- For distributed weights, if pipeline parallelism is not used, set this parameter to any **ckpt_strategy_rank_x.ckpt** path. | +| prefix | Prefix name of the saved target weight. The weight is saved as {prefix}rank_x.ckpt. The default value is checkpoint_. | +| world_size | Total number of slices of the target weight. Generally, the value is dp \* mp \* pp. | +| process_num | Number of processes used for offline weight conversion. The default value is 1.
- If process_num is set to 1, **a single process is used for conversion**.
- If process_num is larger than 1, **multi-process conversion** is used. For example, if the target weight for conversion is the distributed weight of eight GPUs and process_num is set to 2, two processes are started to convert the weights of slices rank_0, rank_1, rank_2, and rank_3 and slices rank_4, rank_5, rank_6, and rank_7, respectively. | + +- Additional parameters used for multi-process conversion -- When the [transform_checkpoint.sh](https://gitee.com/mindspore/mindformers/blob/master/mindformers/tools/ckpt_transform/transform_checkpoint.sh) script is used, `8` indicates the number of target devices, and `2` indicates that two processes are used for conversion. +| Parameter | Description | +| --------------------- | ------------------------------------------------------------ | +| world_size | Total number of slices of the target weight. Generally, the value is dp \* mp \* pp. | +| transform_process_num | Number of processes used for offline weight conversion. The default value is 1.
- If process_num is set to 1, **a single process is used for conversion**.
- If process_num is larger than 1, **multi-process conversion** is used. For example, if the target weight for conversion is the distributed weight of eight GPUs and process_num is set to 2, two processes are started to convert the weights of slices rank_0, rank_1, rank_2, and rank_3 and slices rank_4, rank_5, rank_6, and rank_7, respectively. | ### Special Scenarios diff --git a/docs/mindformers/docs/source_zh_cn/feature/ckpt.md b/docs/mindformers/docs/source_zh_cn/feature/ckpt.md index 36e0d3d867..886a3ec1c1 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/ckpt.md +++ b/docs/mindformers/docs/source_zh_cn/feature/ckpt.md @@ -209,20 +209,6 @@ transform_process_num: 2 用户在使用离线转换时,可以根据具体需求手动配置转换参数,确保转换过程灵活且可控,尤其适用于在严格控制的计算环境中进行模型部署和优化的场景。 -#### 参数说明 - -**离线权重转换**相关`yaml`参数说明如下: - -| 参数名称 | 说明 | -| ----------------- |-----------------------------| -| src_checkpoint | 源权重的绝对路径或文件夹路径。
- 如果是**完整权重**,则填写**绝对路径**;
- 如果是**分布式权重**,则填写**文件夹路径**,分布式权重须按照`model_dir/rank_x/xxx.ckpt`格式存放,文件夹路径填写为`model_dir`。
**如果rank_x文件夹下存在多个ckpt,将会使用文件名默认排序最后的ckpt文件用于转换。** | -| src_strategy_path_or_dir | 源权重对应的分布式策略文件路径。
- 如果是完整权重,则**不填写**;
- 如果是分布式权重,且使用了流水线并行,则填写**合并的策略文件路径**或**分布式策略文件夹路径**;
- 如果是分布式权重,且未使用流水线并行,则填写任一**ckpt_strategy_rank_x.ckpt**路径; | -| dst_checkpoint | 保存目标权重的文件夹路径。 | -| dst_strategy | 目标权重对应的分布式策略文件路径。
- 如果是完整权重,则**不填写**;
- 如果是分布式权重,且使用了流水线并行,则填写**合并的策略文件路径**或**分布式策略文件夹路径**;
- 如果是分布式权重,且未使用流水线并行,则填写任一**ckpt_strategy_rank_x.ckpt**路径; | -| prefix | 目标权重保存的前缀名,权重保存为”{prefix}rank_x.ckpt”,默认”checkpoint_”。 | -| world_size | 目标权重的切片总数,一般等于dp \* mp \* pp。 | -| process_num | 离线权重转换使用的进程数,默认为1。
- 如果process_num = 1,使用**单进程转换**;
- 如果process_num > 1,使用**多进程转换**,比如转换的目标权重为8卡分布式权重,process_num=2时,会启动两个进程分别负责rank_0/1/2/3和rank_4/5/6/7切片权重的转换; | - #### 离线转换配置说明 **生成分布式策略** @@ -240,8 +226,9 @@ MindSpore每次运行分布式任务后都会在`output/strategy`文件夹下生 ```shell python transform_checkpoint.py \ --src_checkpoint /worker/checkpoint/llama3-8b-2layer/rank_0/llama3_8b.ckpt \ - --dst_checkpoint /worker/transform_ckpt/llama3_8b_1to8/ \ - --dst_strategy /worker/mindformers/output/strategy/ + --dst_checkpoint_dir /worker/transform_ckpt/llama3_8b_1to8/ \ + --dst_strategy /worker/mindformers/output/strategy/ \ + --prefix "checkpoint_" ``` **多进程转换** @@ -256,12 +243,29 @@ bash transform_checkpoint.sh \ None \ /worker/transform_ckpt/llama3_8b_1to8/ \ /worker/mindformers/output/strategy/ \ - 8 2 + 8 2 "checkpoint_" ``` -**注意事项**: +> 注:参数顺序为src_checkpoint、src_strategy、dst_checkpoint_dir、dst_strategy、world_size、transform_process_num、prefix。 + +**参数说明** + +- 单进程转换使用参数 + +| 参数名称 | 说明 | +| ----------------- |-----------------------------| +| src_checkpoint | 源权重的绝对路径或文件夹路径。
- 如果是**完整权重**,则填写**绝对路径**;
- 如果是**分布式权重**,则填写**文件夹路径**,分布式权重须按照`model_dir/rank_x/xxx.ckpt`格式存放,文件夹路径填写为`model_dir`。
**如果rank_x文件夹下存在多个ckpt,将会使用文件名默认排序最后的ckpt文件用于转换。** | +| src_strategy | 源权重对应的分布式策略文件路径。
- 如果是完整权重,则**不填写**;
- 如果是分布式权重,且使用了流水线并行,则填写**合并的策略文件路径**或**分布式策略文件夹路径**;
- 如果是分布式权重,且未使用流水线并行,则填写任一**ckpt_strategy_rank_x.ckpt**路径; | +| dst_checkpoint_dir | 保存目标权重的文件夹路径。 | +| dst_strategy | 目标权重对应的分布式策略文件路径。
- 如果是完整权重,则**不填写**;
- 如果是分布式权重,且使用了流水线并行,则填写**合并的策略文件路径**或**分布式策略文件夹路径**;
- 如果是分布式权重,且未使用流水线并行,则填写任一**ckpt_strategy_rank_x.ckpt**路径; | +| prefix | 目标权重保存的前缀名,权重保存为”{prefix}rank_x.ckpt”,默认”checkpoint_”。 | + +- 多进程转换额外使用参数 -- 使用[transform_checkpoint.sh](https://gitee.com/mindspore/mindformers/blob/master/mindformers/tools/ckpt_transform/transform_checkpoint.sh)脚本时,参数`8`表示目标设备数,参数`2`表示使用2个进程进行转换。 +| 参数名称 | 说明 | +| --------------------- | ------------------------------------------------------------ | +| world_size | 目标权重的切片总数,一般等于dp \* mp \* pp。 | +| transform_process_num | 离线权重转换使用的进程数,默认为1。
- 如果process_num = 1,使用**单进程转换**;
- 如果process_num > 1,使用**多进程转换**,比如转换的目标权重为8卡分布式权重,process_num=2时,会启动两个进程分别负责rank_0/1/2/3和rank_4/5/6/7切片权重的转换; | ### 特殊场景 -- Gitee