From 2ec2cbf3f82764eb550559707b3b0226ffb4ead1 Mon Sep 17 00:00:00 2001 From: lanxiang Date: Tue, 29 Jul 2025 10:40:37 +0800 Subject: [PATCH] =?UTF-8?q?PMA=E8=9E=8D=E5=90=88=E6=9D=83=E9=87=8D?= =?UTF-8?q?=E4=BD=BF=E7=94=A8=E8=AF=B4=E6=98=8E?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../docs/source_en/feature/configuration.md | 6 +- .../source_en/feature/pma_fused_checkpoint.md | 80 +++++++++++++++++ .../source_en/feature/training_function.rst | 1 + .../source_zh_cn/feature/configuration.md | 88 ++++++++++--------- .../feature/pma_fused_checkpoint.md | 80 +++++++++++++++++ .../feature/training_function.rst | 3 +- 6 files changed, 214 insertions(+), 44 deletions(-) create mode 100644 docs/mindformers/docs/source_en/feature/pma_fused_checkpoint.md create mode 100644 docs/mindformers/docs/source_zh_cn/feature/pma_fused_checkpoint.md diff --git a/docs/mindformers/docs/source_en/feature/configuration.md b/docs/mindformers/docs/source_en/feature/configuration.md index 9c8ea5de41..d267a91d06 100644 --- a/docs/mindformers/docs/source_en/feature/configuration.md +++ b/docs/mindformers/docs/source_en/feature/configuration.md @@ -126,10 +126,14 @@ When starting model training, in addition to model-related parameters, you also | layer_decay | Set the layer attenuation factor. | float | | optimizer.type | Set the optimizer class, the optimizer is mainly used to calculate the gradient for model training. | str | | optimizer.weight_decay | Set the optimizer weight decay factor. | float | +| optimizer.fused_num | Set `fusied_num` weights for fusion, and update the fused weights to the network parameters according to the fusion algorithm. Default to `10`. | int | +| optimizer.interleave_step | Select the number of step intervals for the weights to be fused, and take a weight as a candidate weight for fusion once every `interlove_step` step. Default to `1000`. | int | +| optimizer.fused_algo | Fusion algorithm, supports `ema` and `sma`. Default to `ema`. | string | +| optimizer.ema_alpha | The fusion coefficient is only effective when `fused_algo` is set to `ema`. Default to `0.2`. | float | | train_dataset.batch_size | The description is same as that of `runner_config.batch_size`. | int | | train_dataset.input_columns | Set the input data columns for the training dataset. | list | | train_dataset.output_columns | Set the output data columns for the training dataset. | list | -| train_dataset.construct_args_key | Set the dataset part `keys` of the model `construct` input to the model in lexicographical order, used when the parameter passing order of the model does not match the order of the dataset input. | list | +| train_dataset.construct_args_key | Set the dataset part `keys` of the model `construct` input to the model in lexicographical order, used when the parameter passing order of the model does not match the order of the dataset input. | list | | train_dataset.column_order | Set the order of the output data columns of the training dataset. | list | | train_dataset.num_parallel_workers | Set the number of processes that read the training dataset. | int | | train_dataset.python_multiprocessing | Enabling Python multi-process mode to improve data processing performance. | bool | diff --git a/docs/mindformers/docs/source_en/feature/pma_fused_checkpoint.md b/docs/mindformers/docs/source_en/feature/pma_fused_checkpoint.md new file mode 100644 index 0000000000..8ae0ee691d --- /dev/null +++ b/docs/mindformers/docs/source_en/feature/pma_fused_checkpoint.md @@ -0,0 +1,80 @@ +# Pre-trained Model Average Weight Consolidation + +[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/pma_fused_checkpoint.md) + +## Overview + +Pre-trained Model Average (PMA) weight merging refers to the process of merging weights based on the selection of Exponential Moving Average (EMA) or Simple Moving Average (SMA) algorithm during training, in order to enhance the effectiveness of model training. + +MindSpore Transformers provides the `EMA` and `SMA` algorithms for weight fusion and merging. The merging formula is as follows: + +EMA algorithm formula: $PMA_n = (1 - \alpha) \times PMA_{n-1} + \alpha \times W_n$ + +> The EMA algorithm allocates weights in an exponentially decreasing manner, making it more sensitive to the weights of the nearest model and able to quickly respond to changes in the model during the later stages of training. + +SMA algorithm formula: $PMA_n = (W_1+ ... + Wn) / n$ + +> The SMA algorithm evenly distributes weights across all model weights and treats each weight equally. + +| Parameter | Description | +|-------------|-----------------------------------------------------------------------------| +| $PMA_n$ | The fused weight in step n | +| $PMA_{n-1}$ | The fused weight of step n-1 | +| $W_1$ | The original weight of step 1 | +| $W_n$ | The original weight of step n | +| $\alpha$ | The fusion coefficient will only take effect when the algorithm chooses EMA | +| $n$ | Take the average of n weights | + +> The model will select a weight every fixed number of steps for formula calculation during training and save it as the middle value `pma_weight` in the weight, which will not affect the parameter values of the original weight. +> When the number of selected weights reaches the set number, the middle value of the weights `pma_weight` is written and overwritten with the zero after the original parameter value, and the training enters the next cycle of weight merging. + +The reference is as follows: + +```text +@misc{modelmerging, + title={Model Merging in Pre-training of Large Language Models}, + authors={Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, + Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, + Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, + Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu}, + year={2025}, + archivePrefix={arXiv}, + primaryClasee={cs.CL}, + url={https://arxiv.org/abs/2505.12082} +} +``` + +## Usage + +**Note**: The parameter values shown in the following examples are only experimental data, please refer to real training data. + +This feature is enabled through YAML configuration files: + +```yaml +optimizer: + type: PmaAdamW + betas: [0.9, 0.999] + eps: 1.e-6 + weight_decay: 0.0 + fused_num: 10 + interleave_step: 1000 + fused_algo: 'ema' + ema_alpha: 0.2 +``` + +**Parameter:** + +| Parameter | Description | Type | Optional | Value Range | +|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|------------|----------------| +| type | Optimizer type, to enable PMA feature, it needs to be set to `PmaAdamW`. Default to `AdamW`. | String | Optional | | +| betas | The exponential decay rate of `moment1` and `moment2`. Each parameter range (0.0, 1.0). Default to ``(0.9, 0.999)``. | Union[list(float), tuple(float)] | Optional | (0.0,1.0) | +| eps | Add it to the denominator to improve numerical stability. Must be greater than 0. Default to ``1e-6``. | float | Optional | positive number | +| weight_decay | Set the optimizer weight decay coefficient. Default to `0.0`. | float | Optional | | +| fused_num | Set `fusied_num` weights for fusion, and update the fused weights to the network parameters according to the fusion algorithm. Default to `10`. | int | Optional | Positive integer | +| interleave_step | Select the number of step intervals for the weights to be fused, and take a weight as a candidate weight for fusion once every `interlove_step` step. Default to `1000`. | int | Optional | Positive integer | +| fused_algo | Fusion algorithm, supports `ema` and `sma`. Default to `ema`. | string | Optional | [`ema`, `sma`] | +| ema_alpha | The fusion coefficient is only effective when `fused_algo` is set to `ema`. Default to `0.2`. | float | Optional | (0, 1) | + +### PmaAdamW Optimizer Configuration Introduction + +For information on configuring the PmaAdamW optimizer, please refer to [MindSpore Transformers PmaAdamW Source Code](https://gitee.com/mindspore/mindformers/blob/master/mindformers/core/optim/pma_adamw.py). diff --git a/docs/mindformers/docs/source_en/feature/training_function.rst b/docs/mindformers/docs/source_en/feature/training_function.rst index ad4717484e..149c3e1f2d 100644 --- a/docs/mindformers/docs/source_en/feature/training_function.rst +++ b/docs/mindformers/docs/source_en/feature/training_function.rst @@ -14,3 +14,4 @@ Training Function memory_optimization other_training_features skip_data_and_ckpt_health_monitor + pma_fused_checkpoint diff --git a/docs/mindformers/docs/source_zh_cn/feature/configuration.md b/docs/mindformers/docs/source_zh_cn/feature/configuration.md index 5ac20911a9..42cb2ed062 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/configuration.md +++ b/docs/mindformers/docs/source_zh_cn/feature/configuration.md @@ -104,48 +104,52 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/ 启动模型训练时,除了模型相关参数,还需要设置trainer、runner_config、学习率以及优化器等训练所需模块的参数,MindSpore Transformers提供了如下配置项。 -| 参数 | 说明 | 类型 | -|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| -| trainer.type | 设置trainer类,通常不同应用场景的模型会设置不同的trainer类。 | str | -| trainer.model_name | 设置模型名称,格式为'{name}_xxb',表示模型的某一规格。 | str | -| runner_config.epochs | 设置模型训练的轮数。 | int | -| runner_config.batch_size | 设置批处理数据的样本数,该配置会覆盖数据集配置中的`batch_size`。 | int | -| runner_config.sink_mode | 是否开启数据下沉模式。 | bool | -| runner_config.sink_size | 设置每次从Host下发到Device的迭代数量,仅`sink_mode=True`时生效,此参数将在后续版本中废弃。 | int | -| runner_config.gradient_accumulation_steps | 设置梯度累积步数,默认值为1,表示不开启梯度累积。 | int | -| runner_wrapper.type | 设置wrapper类,一般设置'MFTrainOneStepCell'即可。 | str | -| runner_wrapper.local_norm | 设置打印单卡上各参数的梯度范数。 | bool | -| runner_wrapper.scale_sense.type | 设置梯度缩放类,一般设置'DynamicLossScaleUpdateCell'即可。 | str | -| runner_wrapper.scale_sense.use_clip_grad | 是否开启梯度剪裁,开启可避免反向梯度过大导致训练无法收敛的情况。 | bool | -| runner_wrapper.scale_sense.loss_scale_value | 设置loss动态尺度系数,模型loss可以根据该参数配置动态变化。 | int | -| lr_schedule.type | 设置lr_schedule类,lr_schedule主要用于调整模型训练中的学习率。 | str | -| lr_schedule.learning_rate | 设置初始化学习率大小。 | float | -| lr_scale | 是否开启学习率缩放。 | bool | -| lr_scale_factor | 设置学习率缩放系数。 | int | -| layer_scale | 是否开启层衰减。 | bool | -| layer_decay | 设置层衰减系数。 | float | -| optimizer.type | 设置优化器类,优化器主要用于计算模型训练的梯度。 | str | -| optimizer.weight_decay | 设置优化器权重衰减系数。 | float | -| train_dataset.batch_size | 同`runner_config.batch_size`。 | int | -| train_dataset.input_columns | 设置训练数据集输入的数据列。 | list | -| train_dataset.output_columns | 设置训练数据集输出的数据列。 | list | -| train_dataset.construct_args_key | 设置模型`construct`输入的数据集部分`keys`, 按照字典序传入模型,当模型的传参顺序和数据集输入的顺序不一致时使用该功能。 | list | -| train_dataset.column_order | 设置训练数据集输出数据列的顺序。 | list | -| train_dataset.num_parallel_workers | 设置读取训练数据集的进程数。 | int | -| train_dataset.python_multiprocessing | 是否开启Python多进程模式提升数据处理性能。 | bool | -| train_dataset.drop_remainder | 是否在最后一个批处理数据包含样本数小于batch_size时,丢弃该批处理数据。 | bool | -| train_dataset.repeat | 设置数据集重复数据次数。 | int | -| train_dataset.numa_enable | 设置NUMA的默认状态为数据读取启动状态。 | bool | -| train_dataset.prefetch_size | 设置预读取数据量。 | int | -| train_dataset.data_loader.type | 设置数据加载类。 | str | -| train_dataset.data_loader.dataset_dir | 设置加载数据的路径。 | str | -| train_dataset.data_loader.shuffle | 是否在读取数据集时对数据进行随机排序。 | bool | -| train_dataset.transforms | 设置数据增强相关选项。 | - | -| train_dataset_task.type | 设置dataset类,该类用于对数据加载类以及其他相关配置进行封装。 | str | -| train_dataset_task.dataset_config | 通常设置为`train_dataset`的引用,包含`train_dataset`的所有配置项。 | - | -| auto_tune | 是否开启数据处理参数自动调优,详情可参考[set_enable_autotune](https://www.mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.config.set_enable_autotune.html)。 | bool | -| filepath_prefix | 设置数据优化后的参数配置的保存路径。 | str | -| autotune_per_step | 设置自动数据加速的配置调整step间隔,详情可参考[set_autotune_interval](https://www.mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.config.set_autotune_interval.html)。 | int | +| 参数 | 说明 | 类型 | +|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------| +| trainer.type | 设置trainer类,通常不同应用场景的模型会设置不同的trainer类。 | str | +| trainer.model_name | 设置模型名称,格式为'{name}_xxb',表示模型的某一规格。 | str | +| runner_config.epochs | 设置模型训练的轮数。 | int | +| runner_config.batch_size | 设置批处理数据的样本数,该配置会覆盖数据集配置中的`batch_size`。 | int | +| runner_config.sink_mode | 是否开启数据下沉模式。 | bool | +| runner_config.sink_size | 设置每次从Host下发到Device的迭代数量,仅`sink_mode=True`时生效,此参数将在后续版本中废弃。 | int | +| runner_config.gradient_accumulation_steps | 设置梯度累积步数,默认值为1,表示不开启梯度累积。 | int | +| runner_wrapper.type | 设置wrapper类,一般设置'MFTrainOneStepCell'即可。 | str | +| runner_wrapper.local_norm | 设置打印单卡上各参数的梯度范数。 | bool | +| runner_wrapper.scale_sense.type | 设置梯度缩放类,一般设置'DynamicLossScaleUpdateCell'即可。 | str | +| runner_wrapper.scale_sense.use_clip_grad | 是否开启梯度剪裁,开启可避免反向梯度过大导致训练无法收敛的情况。 | bool | +| runner_wrapper.scale_sense.loss_scale_value | 设置loss动态尺度系数,模型loss可以根据该参数配置动态变化。 | int | +| lr_schedule.type | 设置lr_schedule类,lr_schedule主要用于调整模型训练中的学习率。 | str | +| lr_schedule.learning_rate | 设置初始化学习率大小。 | float | +| lr_scale | 是否开启学习率缩放。 | bool | +| lr_scale_factor | 设置学习率缩放系数。 | int | +| layer_scale | 是否开启层衰减。 | bool | +| layer_decay | 设置层衰减系数。 | float | +| optimizer.type | 设置优化器类,优化器主要用于计算模型训练的梯度。 | str | +| optimizer.weight_decay | 设置优化器权重衰减系数。 | float | +| optimizer.fused_num | 设置`fused_num`个权重进行融合,根据融合算法将融合后的权重更新到网络参数中。默认值为`10`。 | int | +| optimizer.interleave_step | 设置选取待融合权重的step间隔数,每`interleave_step`个step取一次权重作为候选权重进行融合。默认值为`1000`。 | int | +| optimizer.fused_algo | 设置融合算法,支持`ema`和`sma`。默认值为`ema`。 | string | +| optimizer.ema_alpha | 设置融合系数,仅在`fused_algo`=`ema`时生效。默认值为`0.2`。 | float | +| train_dataset.batch_size | 同`runner_config.batch_size`。 | int | +| train_dataset.input_columns | 设置训练数据集输入的数据列。 | list | +| train_dataset.output_columns | 设置训练数据集输出的数据列。 | list | +| train_dataset.construct_args_key | 设置模型`construct`输入的数据集部分`keys`, 按照字典序传入模型,当模型的传参顺序和数据集输入的顺序不一致时使用该功能。 | list | +| train_dataset.column_order | 设置训练数据集输出数据列的顺序。 | list | +| train_dataset.num_parallel_workers | 设置读取训练数据集的进程数。 | int | +| train_dataset.python_multiprocessing | 是否开启Python多进程模式提升数据处理性能。 | bool | +| train_dataset.drop_remainder | 是否在最后一个批处理数据包含样本数小于batch_size时,丢弃该批处理数据。 | bool | +| train_dataset.repeat | 设置数据集重复数据次数。 | int | +| train_dataset.numa_enable | 设置NUMA的默认状态为数据读取启动状态。 | bool | +| train_dataset.prefetch_size | 设置预读取数据量。 | int | +| train_dataset.data_loader.type | 设置数据加载类。 | str | +| train_dataset.data_loader.dataset_dir | 设置加载数据的路径。 | str | +| train_dataset.data_loader.shuffle | 是否在读取数据集时对数据进行随机排序。 | bool | +| train_dataset.transforms | 设置数据增强相关选项。 | - | +| train_dataset_task.type | 设置dataset类,该类用于对数据加载类以及其他相关配置进行封装。 | str | +| train_dataset_task.dataset_config | 通常设置为`train_dataset`的引用,包含`train_dataset`的所有配置项。 | - | +| auto_tune | 是否开启数据处理参数自动调优,详情可参考[set_enable_autotune](https://www.mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.config.set_enable_autotune.html)。 | bool | +| filepath_prefix | 设置数据优化后的参数配置的保存路径。 | str | +| autotune_per_step | 设置自动数据加速的配置调整step间隔,详情可参考[set_autotune_interval](https://www.mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.config.set_autotune_interval.html)。 | int | ### 并行配置 diff --git a/docs/mindformers/docs/source_zh_cn/feature/pma_fused_checkpoint.md b/docs/mindformers/docs/source_zh_cn/feature/pma_fused_checkpoint.md new file mode 100644 index 0000000000..0b079f7b9a --- /dev/null +++ b/docs/mindformers/docs/source_zh_cn/feature/pma_fused_checkpoint.md @@ -0,0 +1,80 @@ +# Pre-trained Model Average 权重合并 + +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/pma_fused_checkpoint.md) + +## 概述 + +Pre-trained Model Average(PMA)权重合并是指在训练过程中,根据选择 Exponential Moving Average(EMA)算法或 Simple Moving Average(SMA)算法对权重进行融合合并,从而提升模型训练的效果。 + +MindSpore Transformers提供了`EMA`算法和`SMA`算法对权重进行融合合并,合并公式如下: + +EMA算法公式:$PMA_n = (1 - \alpha) \times PMA_{n-1} + \alpha \times W_n$ + +> EMA算法通过指数递减的方式分配权重,对最近的模型权重更为敏感,能够快速响应模型在训练后期的变化。 + +SMA算法公式:$PMA_n = (W_1 + ... + Wn) / n$ + +> SMA算法在所有模型权重上均匀分配权重,对待每个权重都一视同仁。 + +| 参数名称 | 参数说明 | +|-------------|----------------------| +| $PMA_n$ | 第n步的合并权重 | +| $PMA_{n-1}$ | 第n-1步的合并权重 | +| $W_1$ | 第1步的原始权重 | +| $W_n$ | 第n步的原始权重 | +| $\alpha$ | 融合系数,只有当算法选择EMA时才会生效 | +| $n$ | 表示n个权重取平均值 | + +> - 模型在训练时,会每隔固定步数选取一个权重进行公式计算,并作为中间值`pma_weight`保存在权重中,此时并不会影响原来权重的参数取值。 +> - 当选取的权重数量达到设定的数量时,权重中间值`pma_weight`写入并覆盖原参数取值后置零,训练进入下一个周期的权重合并。 + +参考文献如下: + +```text +@misc{modelmerging, + title={Model Merging in Pre-training of Large Language Models}, + authors={Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, + Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, + Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, + Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu}, + year={2025}, + archivePrefix={arXiv}, + primaryClasee={cs.CL}, + url={https://arxiv.org/abs/2505.12082} +} +``` + +## 使用方法 + +**注意**:以下示例所展示的参数数值仅作为实验数据,请以真实训练数据为准。 + +本功能通过YAML配置文件使能: + +```yaml +optimizer: + type: PmaAdamW + betas: [0.9, 0.999] + eps: 1.e-6 + weight_decay: 0.0 + fused_num: 10 + interleave_step: 1000 + fused_algo: 'ema' + ema_alpha: 0.2 +``` + +**参数说明:** + +| 参数名称 | 描述 | 类型 | 是否可选 | 取值范围 | +|-----------------|---------------------------------------------------------------------|---------------------------------|------------|----------------| +| type | 优化器类型,启用PMA特性需要设定为`PmaAdamW`。默认值为`AdamW`。 | String | 可选 | | +| betas | `moment1`、 `moment2` 的指数衰减率。每一个参数范围(0.0,1.0)。默认值为``(0.9, 0.999)`` 。 | Union[list(float), tuple(float)] | 可选 | (0.0,1.0) | +| eps | 将添加到分母中,以提高数值稳定性。必须大于0。默认值: ``1e-6`` 。 | float | 可选 | 正数 | +| weight_decay | 设定优化器权重衰减系数。默认值为`0.0`。 | float | 可选 | | +| fused_num | 设定`fused_num`个权重进行融合,根据融合算法将融合后的权重更新到网络参数中。默认值为`10`。 | int | 可选 | 正整数 | +| interleave_step | 选取待融合权重的step间隔数,每`interleave_step`个step取一次权重作为候选权重进行融合。默认值为`1000`。 | int | 可选 | 正整数 | +| fused_algo | 融合算法,支持`ema`和`sma`。默认值为`ema`。 | string | 可选 | [`ema`, `sma`] | +| ema_alpha | 融合系数,仅在`fused_algo`=`ema`时生效。默认值为`0.2`。 | float | 可选 | (0, 1) | + +### PmaAdamW优化器配置介绍 + +有关PmaAdamW优化器配置相关内容,可参见 [MindSpore Transformers PmaAdamW 源码](https://gitee.com/mindspore/mindformers/blob/master/mindformers/core/optim/pma_adamw.py) 的相关链接。 diff --git a/docs/mindformers/docs/source_zh_cn/feature/training_function.rst b/docs/mindformers/docs/source_zh_cn/feature/training_function.rst index d5935bd437..40eb5962e0 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/training_function.rst +++ b/docs/mindformers/docs/source_zh_cn/feature/training_function.rst @@ -13,4 +13,5 @@ high_availability memory_optimization other_training_features - skip_data_and_ckpt_health_monitor \ No newline at end of file + skip_data_and_ckpt_health_monitor + pma_fused_checkpoint \ No newline at end of file -- Gitee