From 27ca8c4d169f8c6c4420b3096666203376e8ce4d Mon Sep 17 00:00:00 2001 From: yiyison Date: Fri, 15 Aug 2025 09:56:43 +0800 Subject: [PATCH] =?UTF-8?q?=E5=A2=9E=E5=8A=A0=20resume=20training=20?= =?UTF-8?q?=E4=BD=BF=E7=94=A8=E9=80=BB=E8=BE=91=E5=8F=98=E6=9B=B4=E7=9A=84?= =?UTF-8?q?=E8=AF=B4=E6=98=8E?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../docs/source_en/feature/resume_training.md | 33 ++++++++++++------- .../source_zh_cn/feature/resume_training.md | 33 ++++++++++++------- 2 files changed, 42 insertions(+), 24 deletions(-) diff --git a/docs/mindformers/docs/source_en/feature/resume_training.md b/docs/mindformers/docs/source_en/feature/resume_training.md index 1dbd2632a8..984efd3310 100644 --- a/docs/mindformers/docs/source_en/feature/resume_training.md +++ b/docs/mindformers/docs/source_en/feature/resume_training.md @@ -12,20 +12,20 @@ MindSpore Transformers supports **step-level resumable training**, which allows You can modify the configuration file to control resumable training. The main parameters are as follows. For details about other parameters, see the description of CheckpointMonitor. -| Parameter | Description | -|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| load_checkpoint | Weight path loaded during resumable training. The path can be a folder path (used to load distributed weights) or a specific weight file path. The default value is an empty string, indicating that no weight is loaded (required for resumable training). | -| resume_training | Specifies whether to enable resumable training. You can set it to `True` or specify a weight file name. If the value is `True`, the system automatically resumes the training from the last interruption. The default value is `False`. | -| load_ckpt_async | Determines whether to load model weights and compile in parallel (this configuration does not take effect when auto_trans_ckpt is set to true). The default value is False (serial execution).
When it is `True`, the parallel capability of loading ckpt weights and building model is enabled to reduce the overall time resume training. | +| Parameter | Description | +|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| load_checkpoint | Weight path loaded during resumable training. The path can be a folder path (used to load distributed weights) or a specific weight file path. The default value is an empty string, indicating that no weight is loaded (required for resumable training). When the configured path is an empty directory, the system will fall back to pretraining with randomly initialized weights. | +| resume_training | Specifies whether to enable resumable training. You can set it to `True` or specify a weight file name. If the value is `True`, the system automatically resumes the training from the last interruption. The default value is `False`. | +| load_ckpt_async | Determines whether to load model weights and compile in parallel (this configuration does not take effect when auto_trans_ckpt is set to true). The default value is False (serial execution).
When it is `True`, the parallel capability of loading ckpt weights and building model is enabled to reduce the overall time resume training. | Based on the input parameters, there are four cases. -| load_checkpoint | resume_training | Description | Recommended or Not | -|---------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------| -| Weight file path | True | Resumes a training based on the weights specified by load_checkpoint. | √ | -| Weight file path | Weight file name | The file name specified by resume_training is invalid. A training is resumed based on the weights specified by load_checkpoint. | × | -| Weight folder path | True | **Scenario 1: Single-node system, multi-node system+shared directory, or ModelArts**
1. Resumes the training based on the weights recorded in meta.json files and supports fault recovery.
2. Resumes the training based on the latest weight of all ranks if the meta.json file of any rank is missing.
**Scenario 2: Multi-node+non-shared directory**
Resumes the training based on the latest weight of all ranks. | √ | -| Weight folder path | Weight file name | Resumes the training based on the weights specified by resume_training. | √ | +| load_checkpoint | resume_training | Description | Recommended or Not | +|---------------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------| +| Weight file path | True | Resumes a training based on the weights specified by load_checkpoint. | √ | +| Weight file path | Weight file name | The file name specified by resume_training is invalid. A training is resumed based on the weights specified by load_checkpoint. | × | +| Weight folder path | True | **Scenario 1: Single-node system, multi-node system+shared directory, or ModelArts**
1. Resumes the training based on the weights recorded in meta.json files and supports fault recovery.
2. Resumes the training based on the latest weight of all ranks if the meta.json file of any rank is missing.
**Scenario 2: Multi-node+non-shared directory**
Resumes the training based on the latest weight of all ranks.
**Scenario 3:Automatically resume training**
To facilitate using the automatic training recovery feature, configure `load_checkpoint` as the save path for weight checkpoints, eliminating the need to manually modify this setting when resuming training. If the directory is empty during initial training, weights will initialize randomly normally; when resuming, training will recover from checkpoints saved in this directory. | √ | +| Weight folder path | Weight file name | Resumes the training based on the weights specified by resume_training. | √ | In addition, you can modify the following parameters in the configuration file to use related functions. @@ -49,6 +49,15 @@ For related configuration files, see [research/llama3_1/llama3_1_8b/finetune_lla 1. Modify `research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml`. + For initial training with randomly initialized weights followed by resume training without changing the configuration file, set `resume_training` to `True` and `load_checkpoint` to the directory where checkpoints will be saved: + + ```yaml + load_checkpoint: './output/checkpoint' + resume_training: True + ``` + + > Use an empty directory for `load_checkpoint` only if it is intended for saving checkpoints; otherwise, the next run will start from scratch instead of resuming. + Configure the parallelism as required. ```yaml @@ -95,7 +104,7 @@ For related configuration files, see [research/llama3_1/llama3_1_8b/finetune_lla ### Resumable Training -1. Modify the configuration and specify the resumable training weight file. +1. If `resume_training` is set to `False` in the pre-training configuration, update the configuration to specify the resumable training weight file. ```yaml load_checkpoint: './output/checkpoint' diff --git a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md index 32445bab4f..33655c5b17 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md +++ b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md @@ -12,20 +12,20 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 用户可通过修改配置文件来控制断点续训的行为。以下是主要参数,其他参数可参考CheckpointMonitor介绍: -| 参数 | 描述 | -| --------------- |--------------------------------------------------------------------------------------------------------------| -| load_checkpoint | 断点续训时加载的权重路径。路径可以是文件夹路径(用于加载分布式权重),也可以是具体权重文件的路径。默认为空字符串,即不加载权重(断点续训时必填) | -| resume_training | 断点续训开关,可设置为`True`或指定特定的权重文件名。为`True`时,系统会自动从上次中断处恢复训练。默认为`False` | -| load_ckpt_async | 是否将加载权重与模型编译的操作并行执行,不支持在线自动切分权重场景(auto_trans_ckpt=True),该场景下不生效。默认为False串行执行。
为`True`时,并行执行,减少总体拉起续训的耗时 | +| 参数 | 描述 | +| --------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------| +| load_checkpoint | 断点续训时加载的权重路径。路径可以是文件夹路径(用于加载分布式权重),也可以是具体权重文件的路径。默认为空字符串,即不加载权重(断点续训时必填)。当配置的路径为空目录时,会退化为使用随机初始化权重进行预训练。| +| resume_training | 断点续训开关,可设置为`True`或指定特定的权重文件名。为`True`时,系统会自动从上次中断处恢复训练。默认为`False`。 | +| load_ckpt_async | 是否将加载权重与模型编译的操作并行执行,不支持在线自动切分权重场景(auto_trans_ckpt=True),该场景下不生效。默认为False串行执行。
为`True`时,并行执行,减少总体拉起续训的耗时。 | 根据传入参数不同,可分为如下四种情况: -| load_checkpoint | resume_training | 功能描述 | 是否为推荐使用方式 | -|-----------------|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------| -| 权重文件路径 | True | 基于load_checkpoint指代的权重续训 | √ | -| 权重文件路径 | 权重文件名 | resume_training指代的文件名无效,基于load_checkpoint指代的权重续训 | × | -| 权重文件夹路径 | True | **场景1:"单机"或"多机+共享目录"或"ModelArts"**
① 基于meta.json记录的权重续训,支持故障恢复。
② 若任一rank文件夹下缺少meta.json,所有rank基于最后时间戳的权重续训。
**场景2:"多机+非共享目录"**
所有rank基于最后时间戳的权重续训。 | √ | -| 权重文件夹路径 | 权重文件名 | 基于resume_training指代的权重续训 | √ | +| load_checkpoint | resume_training | 功能描述 | 是否为推荐使用方式 | +|-----------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------| +| 权重文件路径 | True | 基于load_checkpoint指代的权重续训 | √ | +| 权重文件路径 | 权重文件名 | resume_training指代的文件名无效,基于load_checkpoint指代的权重续训 | × | +| 权重文件夹路径 | True | **场景1:"单机"或"多机+共享目录"或"ModelArts"**
① 基于meta.json记录的权重续训,支持故障恢复。
② 若任一rank文件夹下缺少meta.json,所有rank基于最后时间戳的权重续训。
**场景2:"多机+非共享目录"**
所有rank基于最后时间戳的权重续训。
**场景3:"自动恢复训练"**
为方便自动恢复训练功能的使用,可以将load_checkpoint配置为权重checkpoint的保存路径,这样在续训时不需要对配置项load_checkpoint做手动修改。首次开始训练时,该目录为空,会正常随机初始化权重;续训时,会从该目录下保存的checkpoint恢复训练。 | √ | +| 权重文件夹路径 | 权重文件名 | 基于resume_training指代的权重续训 | √ | 此外,用户还可通过增改配置文件的如下参数来使用相关功能。 @@ -48,6 +48,15 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 1. 修改`research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml`: + 如果想首次运行随机初始化训练,并且后续断点续训不改配置文件,可在此时将`resume_training`设置为`True`,并将`load_checkpoint`设为即将保存权重的目录: + + ```yaml + load_checkpoint: './output/checkpoint' + resume_training: True + ``` + + > 一旦目录为空目录,模型权重即会自动随机初始化。因此如果误设了一个非即将保存权重的空目录,会导致第二次拉起任务时训练从头开始。 + 根据需要设置并行配置: ```yaml @@ -94,7 +103,7 @@ MindSpore Transformers支持**step级断点续训**功能,允许在训练中 ### 断点续训 -1. 修改配置,指定断点续训权重文件: +1. 如果在前置训练的配置中,`resume_training`为`False`,此时需修改配置,指定断点续训权重文件: ```yaml load_checkpoint: './output/checkpoint' -- Gitee