From 7f5612e553e844e3c88d2999c3473db889239ba2 Mon Sep 17 00:00:00 2001 From: senzhen Date: Tue, 9 Dec 2025 16:20:12 +0800 Subject: [PATCH] =?UTF-8?q?=E6=96=B0=E5=A2=9Echeckpoint2.0=E4=BF=9D?= =?UTF-8?q?=E5=AD=98=E5=8A=A0=E8=BD=BD=E5=92=8C=E6=96=AD=E7=82=B9=E7=BB=AD?= =?UTF-8?q?=E8=AE=AD=E6=96=87=E6=A1=A3?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../feature/checkpoint_saving_and_loading.md | 114 +++++++++++++++ .../docs/source_en/feature/resume_training.md | 6 + .../source_en/feature/resume_training2.0.md | 135 ++++++++++++++++++ .../docs/source_en/feature/safetensors.md | 8 ++ .../source_en/feature/training_function.rst | 2 + docs/mindformers/docs/source_en/index.rst | 14 +- .../feature/checkpoint_saving_and_loading.md | 114 +++++++++++++++ .../source_zh_cn/feature/resume_training.md | 6 + .../feature/resume_training2.0.md | 135 ++++++++++++++++++ .../docs/source_zh_cn/feature/safetensors.md | 8 ++ .../feature/training_function.rst | 2 + docs/mindformers/docs/source_zh_cn/index.rst | 14 +- 12 files changed, 552 insertions(+), 6 deletions(-) create mode 100644 docs/mindformers/docs/source_en/feature/checkpoint_saving_and_loading.md create mode 100644 docs/mindformers/docs/source_en/feature/resume_training2.0.md create mode 100644 docs/mindformers/docs/source_zh_cn/feature/checkpoint_saving_and_loading.md create mode 100644 docs/mindformers/docs/source_zh_cn/feature/resume_training2.0.md diff --git a/docs/mindformers/docs/source_en/feature/checkpoint_saving_and_loading.md b/docs/mindformers/docs/source_en/feature/checkpoint_saving_and_loading.md new file mode 100644 index 0000000000..25df854146 --- /dev/null +++ b/docs/mindformers/docs/source_en/feature/checkpoint_saving_and_loading.md @@ -0,0 +1,114 @@ +# Checkpoint Saving and Loading + +[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/checkpoint_saving_and_loading.md) + +## Overview + +MindSpore Transformers supports saving intermediate checkpoints during training. Checkpoints include components such as **model weights**, **optimizer weights**, **training context information**, and **distributed strategy meta-information**. Their core functions are to **resume training after interruption**, **prevent progress loss due to training failures**, and support **subsequent fine-tuning**, **inference**, or **model iteration**. + +MindSpore Transformers has launched **Checkpoint 2.0**, which achieves comprehensive improvements in usability and loading efficiency by reconstructing the checkpoint saving strategy and loading process. + +Compared with Checkpoint 1.0, the core updates are as follows: + +- **New checkpoint saving [directory structure](#directory-structure)**: The checkpoint directory contains independent files for **model weights**, **optimizer weights**, **training context information**, **distributed strategy meta-information**, etc.; +- **Added online Reshard loading mechanism**: If the distributed strategy meta-information of the checkpoint to be loaded is inconsistent with the current task, Reshard conversion will be **automatically performed on the weight parameters** during loading to generate parameters adapted to the current distributed strategy; +- **Simplified loading configuration**: Relying on the online Reshard mechanism, users **do not need to manually configure parameters such as `auto_trans_ckpt` and `src_strategy_path_or_dir`** to trigger weight strategy conversion, which significantly improves usability. + +MindSpore Transformers currently uses Checkpoint 1.0 by default. Users need to add the following parameters to the YAML configuration file to enable the saving and loading functions of Checkpoint 2.0. + +```yaml +use_legacy_format: False +``` + +> This document is only for users to experience Checkpoint 2.0. If using Checkpoint 1.0, please refer to the [Safetensors Document](https://www.mindspore.cn/mindformers/docs/en/master/feature/safetensors.html) or [Ckpt Document](https://www.mindspore.cn/mindformers/docs/en/master/feature/ckpt.html). + +## Checkpoint Saving + +### Directory Structure + +The training checkpoints of MindSpore Transformers are stored in the `output/checkpoint` directory by default, and each checkpoint is independently saved as a subfolder named after `iteration`. Taking the checkpoint generated in the first step of an 8-card task as an example, its saving format is as follows: + +```text +output + ├── checkpoint + ├── iteration_00000001 + ├── metadata.json + ├── common.json + ├── {prefix}-model-0000000-0000008.safetensors + ... + ├── {prefix}-model-0000007-0000008.safetensors + ├── {prefix}-opt-0000000-0000008.safetensors + ... + └── {prefix}-opt-0000007-0000008.safetensors + ... + └── latest_checkpointed_iteration.txt +``` + +Description of weight-related files + +| File | Description | +| ------------------------------------------ | ------------------------------------------------------------ | +| metadata.json | Records the distributed strategy meta-information and storage information of each parameter, providing necessary metadata support for automatically performing Reshard conversion when loading weights later, ensuring that the conversion is accurately adapted to the current task. | +| common.json | Records the training information of the current iteration, providing data support for resuming training from a breakpoint. | +| {prefix}-model-0000000-0000008.safetensors | Model weight storage file. Naming rule description: `prefix` is a custom file name prefix, `model` identifies the file type as model weights, `0000000` is the file sequence number, and `0000008` represents the total number of files. | +| {prefix}-opt-0000000-0000008.safetensors | Optimizer weight storage file. Naming rule description: `prefix` is a custom file name prefix, `opt` identifies the file type as optimizer weights, `0000000` is the file sequence number, and `0000008` represents the total number of files. | +| latest_checkpointed_iteration.txt | Records the iteration step corresponding to the last successfully saved checkpoint in the `output/checkpoint` directory. | + +### Configuration Instructions + +Users can control the weight saving behavior by modifying the relevant fields under `CheckpointMonitor` in the YAML configuration file. The specific parameter descriptions are as follows: + +| Parameter Name | Description | Value Description | +| --------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| prefix | Custom prefix for weight file names. It is recommended to fill in the model name to distinguish checkpoints of different models. | (str, optional) - Default value: `"CKP"`. | +| directory | The path where checkpoints are saved. If not configured, they are stored in `./output/checkpoint` by default. | (str, optional) - Default value: `None`. | +| save_checkpoint_steps | Set the training interval steps for saving checkpoints (i.e., save a checkpoint every specified number of training steps). | (int, optional) - Default value: `1`. If not set, model weights will not be saved. | +| keep_checkpoint_max | Set the maximum number of checkpoints to keep. When the limit is reached, the oldest checkpoint will be automatically deleted when a new checkpoint is saved. | (int, optional) - Default value: `5`. | +| async_save | Switch for the asynchronous checkpoint saving function (controls whether to enable the asynchronous saving mechanism). | (bool, optional) - When `True`, an asynchronous thread will be used to save checkpoints. Default value: `False`. | +| checkpoint_format | The saving format of checkpoint weights. Checkpoint 2.0 only supports `'safetensors'`; if `use_legacy_format: False` is configured, this field will be automatically converted to `'safetensors'`. | (str, optional) - Default value: `'safetensors'`. | +| remove_redundancy | Switch for the checkpoint redundancy removal function (controls whether to enable the redundancy removal saving mechanism). | (bool, optional) - Default value: `False`. | +| save_optimizer | Switch for the optimizer weight saving function (controls whether to save optimizer weight information). | (bool, optional) - Default value: `True`. | + +Configuration example is as follows: + +```yaml +callbacks: + ... + - type: CheckpointMonitor + prefix: "qwen3" + save_checkpoint_steps: 1000 + keep_checkpoint_max: 5 + async_save: False + checkpoint_format: "safetensors" + save_optimizer: True + ... +``` + +> The above configuration specifies that the training task uses "qwen3" as the prefix for safetensors file names, adopts the synchronous saving mode, saves checkpoints containing model weights and optimizer weights every 1000 steps, and retains at most the latest 5 checkpoints throughout the training process. + +If you want to learn more about CheckpointMonitor, you can refer to the [CheckpointMonitor API Document](https://www.mindspore.cn/mindformers/docs/en/master/core/mindformers.core.CheckpointMonitor.html). + +## Checkpoint Loading + +MindSpore Transformers provides flexible checkpoint loading capabilities, covering all scenarios of single-card and multi-card, with the following core features: + +1. Adaptability upgrade for Checkpoint 2.0: Relying on the online Reshard mechanism, weights can be automatically adapted to any distributed strategy task during loading without manual adjustment, reducing the cost of multi-scenario deployment; +2. Cross-platform weight compatibility: Through a dedicated conversion interface, it supports loading weight files released by the HuggingFace community. Currently, it has achieved compatible adaptation for the Qwen3 model training scenario, facilitating users to reuse community resources. + +### Configuration Instructions + +Users can control the weight loading behavior by modifying the relevant fields in the YAML configuration file. + +| Parameter Name | Description | Value Description | +| -------------------- | ------------------------------------------------------------ | ----------------------------------------- | +| load_checkpoint | The path to the checkpoint folder, supporting **filling in the `output/checkpoint` folder path** or **the specific `iteration` subfolder path**; if the former is filled in, the checkpoint in the corresponding `iteration` subfolder will be loaded according to the step recorded in `latest_checkpointed_iteration.txt`. | (str, optional) - Default value: `""` | +| pretrained_model_dir | Specify the folder path of HuggingFace community weights; if `load_checkpoint` is also configured, this field will be automatically invalidated. | (str, optional) - Default value: `""` | +| balanced_load | Switch for the weight balanced loading function, **only supported in distributed tasks**; when set to `True`, each rank loads weights according to the parameter balanced allocation strategy, and then obtains the final weights through parameter broadcasting. | (bool, optional) - Default value: `False` | +| use_legacy_format | Switch for enabling Checkpoint 1.0, which needs to be set to `False` (i.e., using Checkpoint 2.0 by default). | (bool, optional) - Default value: `True` | +| load_ckpt_format | Specify the format of the loaded weights, which needs to be set to `'safetensors'` (to adapt to Checkpoint 2.0). | (bool, optional) - Default value: `ckpt` | + +When `load_checkpoint` is configured as the path of the `output/checkpoint` folder, users can modify the step recorded in `latest_checkpointed_iteration.txt` to load the weights of the specified `iteration`. + +## Constraint Description + +- In multi-machine scenarios, all files need to be stored in the **same shared directory**, and users need to configure the **shared path to the environment variable `SHARED_PATHS`**. It is recommended to configure it as the uppermost shared directory path first. Example: If the shared directory is `/data01` (the project directory is under it), you can execute `export SHARED_PATHS=/data01`. \ No newline at end of file diff --git a/docs/mindformers/docs/source_en/feature/resume_training.md b/docs/mindformers/docs/source_en/feature/resume_training.md index 83e1434412..484d350593 100644 --- a/docs/mindformers/docs/source_en/feature/resume_training.md +++ b/docs/mindformers/docs/source_en/feature/resume_training.md @@ -2,6 +2,12 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/resume_training.md) +This document is the user guide for the checkpoint resume training feature of **Checkpoint 1.0** under the MindSpore Transformers framework. + +## Important Note + +Currently, MindSpore Transformers has officially launched **[Checkpoint 2.0](https://www.mindspore.cn/mindformers/docs/en/master/feature/checkpoint_saving_and_loading.html)**, along with the official documentation for checkpoint [resume training adapted to the new version](https://www.mindspore.cn/mindformers/docs/en/master/feature/resume_training2.0.html). To ensure compatibility and advancement in feature usage, this document related to Checkpoint 1.0 will be gradually discontinued (sunset) in the future. Users are advised to refer to the new version of the documentation first for development and usage. + ## Overview MindSpore Transformers supports **step-level resume training** functionality, enabling the loading of saved checkpoints to resume previous training states. This feature is particularly important for handling large-scale training tasks, as it effectively reduces time and resource waste caused by unexpected interruptions. diff --git a/docs/mindformers/docs/source_en/feature/resume_training2.0.md b/docs/mindformers/docs/source_en/feature/resume_training2.0.md new file mode 100644 index 0000000000..f7af3cea61 --- /dev/null +++ b/docs/mindformers/docs/source_en/feature/resume_training2.0.md @@ -0,0 +1,135 @@ +# Resume Training2.0 + +[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/resume_training2.0.md) + +## Overview + +MindSpore Transformers has complete resume training capabilities. The core functions and applicable scenarios are as follows: + +1. **Core Functions**: Supports loading saved checkpoints to quickly resume training progress without starting from scratch; +2. **Multi-scenario Adaptation**: Covers four mainstream resume training scenarios + - **Interruption Resume Training**: After an abnormal interruption of a normal training task (such as equipment failure, network fluctuation), resume the training process based on the saved checkpoint; + - **Scaling Resume Training**: Adjust the number of cards (expansion / reduction) during training and continue training based on the saved checkpoint; + - **Incremental Resume Training**: On the basis of existing training results, supplement the training dataset and continue training based on the saved checkpoint; + - **Automatic Recovery Resume Training**: Supports the platform to automatically start resume training without manual intervention; + +For large-scale training tasks (long training cycles and large resource investment), it can avoid progress loss caused by unexpected interruptions and significantly reduce time and computing resource waste. + +> This document only applies to scenarios where [Checkpoint 2.0](https://www.mindspore.cn/mindformers/docs/en/master/feature/checkpoint_saving_and_loading.html) are used for resume training; if users use Checkpoint 1.0, please refer to the old version [resume training document](https://www.mindspore.cn/mindformers/docs/en/master/feature/resume_training.html). + +## Checkpoint Introduction + +The training checkpoints of MindSpore Transformers are stored in the `output/checkpoint` directory by default, and each checkpoint is independently saved as a subfolder named after `iteration`. Taking the checkpoint generated in the first step of an 8-card task as an example, its saving format is as follows: + +```text +output + ├── checkpoint + ├── iteration_0000001 + ├── metadata.json + ├── common.json + ├── {prefix}-model-0000000-0000008.safetensor + ... + ├── {prefix}-model-0000007-0000008.safetensor + ├── {prefix}-opt-0000000-0000008.safetensor + ... + └── {prefix}-opt-0000007-0000008.safetensor + ... + └── latest_checkpointed_iteration.txt +``` + +You can refer to [Checkpoint Saving and Loading](https://www.mindspore.cn/mindformers/docs/en/master/feature/checkpoint_saving_and_loading.html) for more information about checkpoints. + +## Configuration Description + +| Parameter Name | Description | Value Description | +| --------------- | ------------------------------------------------------------ | ----------------------------------------- | +| load_checkpoint | The path to the checkpoint folder. It can **be filled with the path of the `output/checkpoint` folder or the path of the `iteration` subfolder**.
If it is the path of the `checkpoint` folder, the checkpoint in the corresponding `iteration` subfolder will be loaded according to the number of iterations recorded in `latest_checkpointed_iteration.txt`. | (str, optional) - Default value: `""` | +| resume_training | The switch for the resume training function. When set to `True`, training will restore from the number of iterations corresponding to the checkpoint to be loaded. | (bool, optional) - Default value: `False` | + +## Scenario Introduction + +### Interruption Resume Training + +**Overview**: After an abnormal interruption of a normal training task, resume the training process based on the saved checkpoint without changing the distributed strategy. + +MindSpore Transformers provides two ways to start resuming training: + +- Resume training based on the number of iterations recorded in `latest_checkpointed_iteration.txt` + + ```yaml + load_checkpoint: /path/to/checkpoint + resume_training: True + ``` + +- Resume training based on the specified number of iterations + + ```yaml + load_checkpoint: /path/to/checkpoint/iteration_{x} + resume_training: True + ``` + + > x represents the training iteration step corresponding to the checkpoint. For example, "0000001" indicates the checkpoint corresponding to the 1st training step. + +### Scaling Resume Training + +**Overview**: When it is necessary to **expand/reduce the cluster scale** or **modify the distributed strategy** to continue the training task, the configuration is the same as [Interruption Resume Training](#interruption-resume-training). Relying on the online Reshard mechanism, MindSpore Transformers can ensure that the checkpoint weights automatically adapt to any distributed strategy, ensuring smooth resume training. + +- Resume training based on the number of iterations recorded in `latest_checkpointed_iteration.txt` + + ```yaml + load_checkpoint: /path/to/checkpoint + resume_training: True + ``` + +- Resume training based on the specified number of iterations + + ```yaml + load_checkpoint: /path/to/checkpoint/iteration_{x} + resume_training: True + ``` + + > x represents the training iteration step corresponding to the checkpoint. For example, "0000001" indicates the checkpoint corresponding to the 1st training step. + +### Incremental Resume Training + +**Overview**: The training dataset needs to be **produced and trained simultaneously**. After the current dataset is trained, add the newly produced dataset to continue training until all datasets are trained. This scenario requires users to preset the total steps of the learning rate curve in advance according to the total amount of data for training. + +Assume that a total of 10T tokens of data are trained, each produced dataset contains only 1T tokens of data, and the entire training process is completed in 10 epochs, which takes a total of 100000 steps. + +- Step 1: Preset the total training steps to fix the learning rate curve of the entire training process + + ```yaml + lr_schedule: + total_steps: 100000 + ``` + +- Step 2: Set a sufficiently large epoch value to ensure that all datasets can be trained + + ```yaml + runner_config: + epochs: 15 + ``` + + > The learning rate curve of the entire training process has been fixed, and the epoch value setting will not affect the learning rate. A larger value can be set to ensure that 10 datasets can be trained. + +- Step 3: After training one epoch of the dataset, you can replace the dataset to resume training. The following is resume training based on the number of iterations recorded in `latest_checkpointed_iteration.txt`. For other resume training methods, please refer to [Interruption Resume Training](#interruption-resume-training) or [Strategy Conversion Resume Training](#strategy-conversion-resume-training). + + ```yaml + load_checkpoint: /path/to/checkpoint + resume_training: True + ``` + + > When replacing the dataset for resume training, due to the different number of samples in each dataset, the displayed epoch and single-batch step may change, but the total number of training steps remains unchanged, which is a normal phenomenon. + +### Automatic Recovery Resume Training + +**Overview**: To support the platform to automatically start resume training without manual intervention, `load_checkpoint` can be configured as the checkpoint saving directory path: when training for the first time, the directory is empty, and the model initializes parameters randomly; during resume training, it will recover training based on the last saved complete checkpoint in the directory. + +```yaml +load_checkpoint: /path/to/output/checkpoint +resume_training: True +``` + +## Constraint Description + +- In multi-machine scenarios, all checkpoint files need to be stored in the same shared directory for resume training. Users need to configure the shared path to the environment variable `SHARED_PATHS`; it is recommended to configure the top-level shared directory first. Example: When the shared directory is `/data01`, execute `export SHARED_PATHS=/data01`. \ No newline at end of file diff --git a/docs/mindformers/docs/source_en/feature/safetensors.md b/docs/mindformers/docs/source_en/feature/safetensors.md index de08ae6a63..8b948a3cd2 100644 --- a/docs/mindformers/docs/source_en/feature/safetensors.md +++ b/docs/mindformers/docs/source_en/feature/safetensors.md @@ -2,6 +2,14 @@ [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/safetensors.md) +This document provides an introduction to the usage of **Safetensors format weights for Checkpoint 1.0** under the MindSpore Transformers framework. + +## Important Note + +Currently, MindSpore Transformers has officially supported **Checkpoint 2.0**. To ensure user experience and functional compatibility, this document related to Checkpoint 1.0 will be gradually **sunset (discontinued from maintenance and updates)**. + +It is recommended that users prioritize migrating to [Checkpoint 2.0](https://www.mindspore.cn/mindformers/docs/en/master/feature/checkpoint_saving_and_loading.html) for relevant operations. Subsequent feature iterations and technical support will focus on the new version. Thank you for your understanding and support. + ## Overview Safetensors is a reliable and portable machine learning model storage format from Huggingface for storing Tensors securely and with fast storage (zero copies). This article focuses on how MindSpore Transformers supports saving and loading of this file format to help users use weights better and faster. diff --git a/docs/mindformers/docs/source_en/feature/training_function.rst b/docs/mindformers/docs/source_en/feature/training_function.rst index 22179aeb3f..8e2a0ef28b 100644 --- a/docs/mindformers/docs/source_en/feature/training_function.rst +++ b/docs/mindformers/docs/source_en/feature/training_function.rst @@ -9,6 +9,8 @@ Training Function training_hyperparameters monitor resume_training + checkpoint_saving_and_laoding + resume_training2.0 parallel_training high_availability memory_optimization diff --git a/docs/mindformers/docs/source_en/index.rst b/docs/mindformers/docs/source_en/index.rst index 8d81f2a30e..d59c6b2dd3 100644 --- a/docs/mindformers/docs/source_en/index.rst +++ b/docs/mindformers/docs/source_en/index.rst @@ -43,11 +43,11 @@ MindSpore Transformers provides a wealth of features throughout the full-process - `Ckpt Weights `_ - Supports conversion, slice and merge weight files in ckpt format. + [Checkpoint 1.0] Supports conversion, slice and merge weight files in ckpt format. - `Safetensors Weights `_ - Supports saving and loading weight files in safetensors format. + [Checkpoint 1.0] Supports saving and loading weight files in safetensors format. - `Configuration File Descriptions `_ @@ -81,7 +81,15 @@ MindSpore Transformers provides a wealth of features throughout the full-process - `Resumable Training After Breakpoint `_ - Supports step-level resumable training after breakpoint, effectively reducing the waste of time and resources caused by unexpected interruptions during large-scale training. + [Checkpoint 1.0] Supports step-level resumable training after breakpoint, effectively reducing the waste of time and resources caused by unexpected interruptions during large-scale training. + + - `Checkpoint Saving and Loading `_ + + [Checkpoint 2.0] Supports checkpoint saving and loading. + + - `Resumable Training After Breakpoint 2.0 `_ + + [Checkpoint 2.0] Supports step-level resumable training after breakpoint, effectively reducing the waste of time and resources caused by unexpected interruptions during large-scale training. - `Training High-Availability (Beta) `_ diff --git a/docs/mindformers/docs/source_zh_cn/feature/checkpoint_saving_and_loading.md b/docs/mindformers/docs/source_zh_cn/feature/checkpoint_saving_and_loading.md new file mode 100644 index 0000000000..8b4d12fd44 --- /dev/null +++ b/docs/mindformers/docs/source_zh_cn/feature/checkpoint_saving_and_loading.md @@ -0,0 +1,114 @@ +# checkpoint保存和加载 + +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/checkpoint_saving_and_loading.md) + +## 概述 + +MindSpore Transformers 支持训练过程中保存checkpoint。checkpoint包括**模型权重**、**优化器权重**、**训练上下文信息**和**分布式策略元信息**等组件,核心作用是**中断后恢复训练**、**防止训练失败丢失进度**,同时支持**后续微调**、**推理**或**模型迭代**。 + +MindSpore Transformers 推出**Checkpoint 2.0 版本**,通过重构checkpoint保存策略与加载流程,实现易用性与加载效率的综合提升。 + +相较于Checkpoint 1.0 版本,核心更新如下: + +- **全新checkpoint保存[目录结构](#目录结构)**:目录包含**模型权重**、**优化器权重**、**训练上下文信息**、**分布式策略元信息**等独立文件; +- **新增在线 Reshard 加载机制**:若待加载checkpoint的分布式策略元信息与当前任务不一致,加载时将**自动对权重参数执行 Reshard 转换**,生成适配当前分布式策略的参数; +- **简化加载配置**:依托在线 Reshard 机制,用户**无需手动配置`auto_trans_ckpt`、`src_strategy_path_or_dir`等参数**触发权重策略转换,易用性显著提升。 + +MindSpore Transformers 目前默认采用Checkpoint 1.0 版本,用户需在 YAML 配置文件中添加如下参数,即可启用Checkpoint 2.0 版本的保存与加载功能。 + +```yaml +use_legacy_format: False +``` + +> 该文档仅针对用户使用体验Checkpoint 2.0 版本,若使用Checkpoint 1.0 版本,请参考[Safetensors文档](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/safetensors.html)或[Ckpt文档](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/ckpt.html)。 + +## checkpoint保存 + +### 目录结构 + +MindSpore Transformers 的训练checkpoint默认存储于 `output/checkpoint` 目录,每个checkpoint独立保存为以 `iteration` 命名的子文件夹。以 8 卡任务第 1 步生成的checkpoint为例,其保存格式如下: + +```text +output + ├── checkpoint + ├── iteration_00000001 + ├── metadata.json + ├── common.json + ├── {prefix}-model-0000000-0000008.safetensors + ... + ├── {prefix}-model-0000007-0000008.safetensors + ├── {prefix}-opt-0000000-0000008.safetensors + ... + └── {prefix}-opt-0000007-0000008.safetensors + ... + └── latest_checkpointed_iteration.txt +``` + +权重相关文件说明 + +| 文件 | 描述 | +| ------------------------------------------ | ------------------------------------------------------------ | +| metadata.json | 记录各参数的分布式策略元信息与存储信息,为后续加载权重时自动执行 Reshard 转换提供必要的元数据支持,确保转换精准适配当前任务。 | +| common.json | 记录当前迭代(iteration)的训练信息,为断点续训提供数据支持。 | +| {prefix}-model-0000000-0000008.safetensors | 模型权重存储文件。命名规则说明:`prefix` 为自定义文件名前缀,`model` 标识文件类型为模型权重,`0000000` 是文件序号,`0000008` 代表总文件个数。 | +| {prefix}-opt-0000000-0000008.safetensors | 优化器权重存储文件。命名规则说明:`prefix` 为自定义文件名前缀,`opt` 标识文件类型为优化器权重,`0000000` 是文件序号,`0000008` 代表总文件个数。 | +| latest_checkpointed_iteration.txt | 记录 `output/checkpoint` 目录下最后一个成功保存的checkpoint对应的迭代步数。 | + +### 配置说明 + +用户可通过修改 YAML 配置文件中 `CheckpointMonitor` 下的相关字段,控制权重保存行为,具体参数说明如下: + +| 参数名称 | 描述 | 取值说明 | +| --------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | +| prefix | 权重文件名自定义前缀,建议填写模型名称以区分不同模型的checkpoint。 | (str, 可选) - 默认值: `"CKP"` 。 | +| directory | checkpoint保存路径,未配置时默认存储于 `./output/checkpoint`。 | (str, 可选) - 默认值: `None` 。 | +| save_checkpoint_steps | 设置保存checkpoint的训练间隔步数(即每训练指定步数保存一次checkpoint)。 | (int, 可选) - 默认值: `1` ,不设置时不保存模型权重。 | +| keep_checkpoint_max | 设置checkpoint最大保留数量,达到上限后,保存新checkpoint时会自动删除最旧的checkpoint。 | (int, 可选) - 默认值: `5` 。 | +| async_save | checkpoint异步保存功能开关(控制是否启用异步保存机制)。 | (bool, 可选) - `True` 时将使用异步线程保存checkpoint。默认值: `False` 。 | +| checkpoint_format | checkpoint权重保存格式,Checkpoint 2.0 版本仅支持 `'safetensors'`;若已配置 `use_legacy_format: False`,该字段将自动转换为 `'safetensors'`。 | (str, 可选) - 默认值: `'safetensors'` 。 | +| remove_redundancy | checkpoint去冗余保存功能开关(控制是否启用去冗余保存机制)。 | (bool, 可选) - 默认值: `False` 。 | +| save_optimizer | 优化器权重保存功能开关(控制是否保存优化器权重信息)。 | (bool, 可选) - 默认值: `True` 。 | + +配置示例如下: + +```yaml +callbacks: + ... + - type: CheckpointMonitor + prefix: "qwen3" + save_checkpoint_steps: 1000 + keep_checkpoint_max: 5 + async_save: False + checkpoint_format: "safetensors" + save_optimizer: True + ... +``` + +> 上述配置指定训练任务以 "qwen3" 作为 safetensors 文件名前缀,采用同步保存模式,每 1000 步保存一次包含模型权重与优化器权重的checkpoint,且训练全程最多保留最新的 5 个checkpoint。 + +如果您想了解更多有关 CheckpointMonitor 的知识,可以参考 [CheckpointMonitor API 文档](https://www.mindspore.cn/mindformers/docs/zh-CN/master/core/mindformers.core.CheckpointMonitor.html)。 + +## checkpoint加载 + +MindSpore Transformers 提供灵活的checkpoint加载能力,覆盖单卡与多卡全场景,核心特性如下: + +1. Checkpoint 2.0 版本适配性升级:依托在线 Reshard 机制,加载时权重可自动适配任意分布式策略任务,无需手动调整,降低多场景部署成本; +2. 跨平台权重兼容:通过专用转换接口,支持加载 HuggingFace 社区发布的权重文件,当前已实现 Qwen3 模型训练场景的兼容适配,方便用户复用社区资源。 + +### 配置说明 + +用户可通过修改 YAML 配置文件中的相关字段,控制权重加载行为。 + +| 参数名称 | 描述 | 取值说明 | +| -------------------- | ------------------------------------------------------------ | ------------------------------ | +| load_checkpoint | checkpoint文件夹路径,可**填写`output/checkpoint`文件夹路径或`iteration`子文件夹路径**。
若为`checkpoint`文件夹路径,按照`latest_checkpointed_iteration.txt`中记录的步数加载对应`iteration`子文件夹checkpoint。 | (str,可选) - 默认值:`""` | +| pretrained_model_dir | 指定 HuggingFace 社区权重的文件夹路径;若同时配置了 `load_checkpoint`,该字段将自动失效。 | (str,可选) - 默认值:`""` | +| balanced_load | 权重均衡加载功能开关,**仅支持在分布式任务中开启**;设为 `True` 时,各 rank 按参数均衡分配策略加载权重,再通过参数广播获取最终权重。 | (bool,可选) - 默认值:`False` | +| use_legacy_format | Checkpoint 1.0 版本启用开关,需设置为 `False`(即默认使用Checkpoint 2.0 版本)。 | (bool,可选) - 默认值:`True` | +| load_ckpt_format | 指定加载权重的格式,需设置为 `'safetensors'`(适配Checkpoint 2.0 版本)。 | (bool,可选) - 默认值:`ckpt` | + +当 `load_checkpoint` 配置为 `output/checkpoint` 文件夹路径时,用户可通过修改 `latest_checkpointed_iteration.txt` 中记录的步数,实现指定 `iteration` 权重的加载。 + +## 约束说明 + +- 多机场景下,所有文件需存储于**同一共享目录**,用户需将该**共享路径配置至环境变量 `SHARED_PATHS`**。建议优先配置为最上层共享目录路径,示例:若共享目录为 `/data01`(工程目录位于其下),可执行 `export SHARED_PATHS=/data01`。 diff --git a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md index 56b1359210..85a0b0fe98 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/resume_training.md +++ b/docs/mindformers/docs/source_zh_cn/feature/resume_training.md @@ -2,6 +2,12 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/resume_training.md) +本文档为 **MindSpore Transformers** 框架下 Checkpoint 1.0 版本的断点续训功能使用介绍。 + +## 重要说明 + +目前 MindSpore Transformers 已正式推出 **[Checkpoint 2.0 版本](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/checkpoint_saving_and_loading.html)**,并同步发布了适配新版本的[断点续训](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/resume_training2.0.html)官方文档。为保证功能使用的兼容性与先进性,本 Checkpoint 1.0 版本相关文档后续将逐步停止维护(日落),建议用户优先参考新版本文档进行开发与使用。 + ## 概述 MindSpore Transformers支持**step级断点续训**功能,支持加载已保存的checkpoint来恢复之前的状态继续训练。这一特性在处理大规模训练任务时尤为重要,能够有效减少因意外中断导致的时间和资源浪费。 diff --git a/docs/mindformers/docs/source_zh_cn/feature/resume_training2.0.md b/docs/mindformers/docs/source_zh_cn/feature/resume_training2.0.md new file mode 100644 index 0000000000..d4d2b7f597 --- /dev/null +++ b/docs/mindformers/docs/source_zh_cn/feature/resume_training2.0.md @@ -0,0 +1,135 @@ +# 断点续训2.0 + +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/resume_training2.0.md) + +## 概述 + +MindSpore Transformers 具备完备的断点续训能力,核心功能与适用场景如下: + +1. **核心功能**:支持加载已保存的checkpoint,快速恢复训练进度,无需从零开始; +2. **多场景适配**:覆盖四大主流续训场景 + - **中断续训**:正常训练任务异常中断(如设备故障、网络波动)后,基于已保存的checkpoint重新恢复训练流程; + - **扩缩容续训**:训练过程中调整卡数(扩容 / 缩容),基于已保存的checkpoint继续训练; + - **增量续训**:在已有训练成果基础上,补充训练数据集,基于已保存的checkpoint继续训练; + - **自动恢复续训**:支持平台无需人工干预自动拉起断点续训; + +对于大规模训练任务(训练周期长、资源投入大),可避免意外中断导致的进度丢失,显著减少时间与计算资源浪费。 + +> 本文档仅适用于使用 [Checkpoint 2.0 版本](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/checkpoint_saving_and_loading.html)进行续训的场景;若用户使用Checkpoint 1.0 版本,需参考旧版[断点续训文档](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/resume_training.html)。 + +## checkpoint介绍 + +MindSpore Transformers 的训练checkpoint默认存储于 `output/checkpoint` 目录,每个checkpoint独立保存为以 `iteration` 命名的子文件夹。以 8 卡任务第 1 步生成的checkpoint为例,其保存格式如下: + +```text +output + ├── checkpoint + ├── iteration_0000001 + ├── metadata.json + ├── common.json + ├── {prefix}-model-0000000-0000008.safetensor + ... + ├── {prefix}-model-0000007-0000008.safetensor + ├── {prefix}-opt-0000000-0000008.safetensor + ... + └── {prefix}-opt-0000007-0000008.safetensor + ... + └── latest_checkpointed_iteration.txt +``` + +可参考[checkpoint保存和加载](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/checkpoint_saving_and_loading.html),获取更多checkpoint相关信息。 + +## 配置说明 + +| 参数名称 | 描述 | 取值说明 | +| --------------- | ------------------------------------------------------------ | ------------------------------ | +| load_checkpoint | checkpoint文件夹路径,可**填写`output/checkpoint`文件夹路径或`iteration`子文件夹路径**。
若为`checkpoint`文件夹路径,将会按照`latest_checkpointed_iteration.txt`中记录的迭代步数,加载对应`iteration`子文件夹checkpoint。 | (str,可选) - 默认值:`""` | +| resume_training | 断点续训功能开关,设置为 `True` 时,将从待加载checkpoint对应的迭代步数继续训练。 | (bool,可选) - 默认值:`False` | + +## 场景介绍 + +### 中断续训 + +**概述**:正常训练任务异常中断后,在不改变分布式策略的前提下,基于已保存的checkpoint重新恢复训练流程。 + +MindSpore Transformers 支持用户使用以下两种方式启动断点续训: + +- 基于`latest_checkpointed_iteration.txt`中记录的迭代步数续训 + + ```yaml + load_checkpoint: /path/to/checkpoint + resume_training: True + ``` + +- 基于指定迭代步数续训 + + ```yaml + load_checkpoint: /path/to/checkpoint/iteration_{x} + resume_training: True + ``` + + > x 代表checkpoint对应的训练迭代步数,例如 "0000001" 即表示第 1 步训练对应的checkpoint。 + +### 扩缩容续训 + +**概述**:需要**扩大/缩小集群规模**或**修改分布式策略**继续训练任务,配置方式和[中断续训](#中断续训)一致。MindSpore Transformers 依托在线 Reshard 机制,可确保checkpoint权重自动适配任意分布式策略,保障续训顺畅。 + +- 基于`latest_checkpointed_iteration.txt`中记录的迭代步数续训 + + ```yaml + load_checkpoint: /path/to/checkpoint + resume_training: True + ``` + +- 基于指定迭代步数续训 + + ```yaml + load_checkpoint: /path/to/checkpoint/iteration_{x} + resume_training: True + ``` + + > x 代表checkpoint对应的训练迭代步数,例如 "0000001" 即表示第 1 步训练对应的checkpoint。 + +### 增量续训 + +**概述**:训练数据集需要**边生产边训练**,当前数据集训练结束后,加入新生产的数据集继续训练,直到所有数据集训练完毕。该场景需要用户根据训练的总数据量,提前预设学习率曲线的总步数。 + +假设一共训练10T tokens数据,每次生产的数据集只包含1T tokens数据,整个训练过程分10个epoch训完,一共需要花费100000steps。 + +- 步骤1:预设总训练步数,固定整个训练流程的学习率曲线 + + ```yaml + lr_schedule: + total_steps: 100000 + ``` + +- 步骤2:设置足够大的epoch值,确保能够训完所有数据集 + + ```yaml + runner_config: + epochs: 15 + ``` + + > 整个训练过程的学习率曲线已固定,epochs值设置不会影响学习率,可以设置较大值,确保能训完10个数据集。 + +- 步骤3:数据集训完1个epoch后,可以更换数据集续训,如下为基于`latest_checkpointed_iteration.txt`中记录的迭代步数续训,其他续训方式请参考[中断续训](#中断续训)或[策略转换续训](#策略转换续训)。 + + ```yaml + load_checkpoint: /path/to/checkpoint + resume_training: True + ``` + + > 更换数据集续训时,因各数据集样本数量不同,显示的 epoch 和单批次 step 可能变化,但训练总 step 数保持不变,这属于正常现象。 + +### 自动恢复续训 + +**概述**:为支持平台无人工干预自动拉起断点续训,可将 `load_checkpoint` 配置为checkpoint保存目录路径:首次训练时目录为空,模型随机初始化参数;续训时则基于该目录下最后保存的完整checkpoint恢复训练。 + +```yaml +load_checkpoint: /path/to/output/checkpoint +resume_training: True +``` + +## 约束说明 + +- 多机场景下,断点续训需将所有checkpoint文件存放于同一共享目录,用户需将该共享路径配置至环境变量 `SHARED_PATHS`;建议优先配置最上层共享目录,示例:共享目录为 `/data01` 时,执行 `export SHARED_PATHS=/data01` 即可。 diff --git a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md index 95bc09dc1d..057c66eb8c 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md +++ b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md @@ -2,6 +2,14 @@ [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/safetensors.md) +本文档为 **MindSpore Transformers 框架下 Checkpoint 1.0 版本** 的 Safetensors 格式权重使用介绍。 + +## 重要说明 + +当前 MindSpore Transformers 已正式支持 **Checkpoint 2.0 版本**,为保障用户使用体验与功能兼容性,本 Checkpoint 1.0 版本相关文档将逐步 **日落(停止维护与更新)**。 + +建议用户优先迁移至 [Checkpoint 2.0 版本](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/checkpoint_saving_and_loading.html)进行相关操作,后续功能迭代与技术支持将聚焦于新版本,感谢你的理解与支持。 + ## 概述 Safetensors 是 Huggingface 推出的一种可靠、易移植的机器学习模型存储格式,用于安全地存储Tensor,而且存储速度较快(零拷贝)。 diff --git a/docs/mindformers/docs/source_zh_cn/feature/training_function.rst b/docs/mindformers/docs/source_zh_cn/feature/training_function.rst index 15b0e53de8..b10a766ac3 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/training_function.rst +++ b/docs/mindformers/docs/source_zh_cn/feature/training_function.rst @@ -9,6 +9,8 @@ training_hyperparameters monitor resume_training + checkpoint_saving_and_loading + resume_training2.0 parallel_training high_availability memory_optimization diff --git a/docs/mindformers/docs/source_zh_cn/index.rst b/docs/mindformers/docs/source_zh_cn/index.rst index ce5f88ad93..12d8db9a1e 100644 --- a/docs/mindformers/docs/source_zh_cn/index.rst +++ b/docs/mindformers/docs/source_zh_cn/index.rst @@ -42,11 +42,11 @@ MindSpore Transformers功能特性说明 - `Ckpt权重 `_ - 支持ckpt格式的权重文件转换及切分功能。 + [Checkpoint 1.0 版本] 支持ckpt格式的权重文件转换及切分功能。 - `Safetensors权重 `_ - 支持safetensors格式的权重文件保存及加载功能。 + [Checkpoint 1.0 版本] 支持safetensors格式的权重文件保存及加载功能。 - `配置文件说明 `_ @@ -80,7 +80,15 @@ MindSpore Transformers功能特性说明 - `断点续训 `_ - 支持step级断点续训,有效减少大规模训练时意外中断造成的时间和资源浪费。 + [Checkpoint 1.0 版本] 支持step级断点续训,有效减少大规模训练时意外中断造成的时间和资源浪费。 + + - `checkpoint保存和加载 `_ + + [Checkpoint 2.0 版本] 支持checkpoint保存和加载功能。 + + - `断点续训2.0 `_ + + [Checkpoint 2.0 版本] 支持step级断点续训,有效减少大规模训练时意外中断造成的时间和资源浪费。 - `训练高可用(Beta) `_ -- Gitee