From a77dd7f722cfb1dc3c2928d4656569a52508e568 Mon Sep 17 00:00:00 2001 From: lanxiang Date: Wed, 27 Aug 2025 17:19:59 +0800 Subject: [PATCH] =?UTF-8?q?=E8=AE=AD=E7=BB=83=E9=85=8D=E7=BD=AE=E6=A8=A1?= =?UTF-8?q?=E5=9E=8B=E8=AF=B4=E6=98=8E?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../training_template_instruction.md | 82 +++++++++++++++++++ docs/mindformers/docs/source_en/index.rst | 1 + .../training_template_instruction.md | 82 +++++++++++++++++++ docs/mindformers/docs/source_zh_cn/index.rst | 1 + 4 files changed, 166 insertions(+) create mode 100644 docs/mindformers/docs/source_en/advanced_development/training_template_instruction.md create mode 100644 docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md diff --git a/docs/mindformers/docs/source_en/advanced_development/training_template_instruction.md b/docs/mindformers/docs/source_en/advanced_development/training_template_instruction.md new file mode 100644 index 0000000000..7320106365 --- /dev/null +++ b/docs/mindformers/docs/source_en/advanced_development/training_template_instruction.md @@ -0,0 +1,82 @@ +# Training Configuration Template Instruction + +[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/advanced_development/training_template_instruction.md) + +## Overview + +Mindspore Transformers uses template YAML files as configuration files to initiate model training tasks, mainly covering two scenarios: pre training and fine-tuning. + +Users can perform corresponding training tasks by modifying the parameters of the corresponding template. + +When conducting model pre-training, please use llm_pretrain_template.yaml (see template examples section). + +When conducting model fine-tuning training, please use llm_finetune_template.yaml (see Template Example section). + +## Instructions for Use + +### Module Description + +The template mainly covers the configuration of the following nine functional modules, and detailed parameter configuration instructions can be referred to [Profile Description](https://docs.qq.com/sheet/DTmxXTnRLbHB2RG1S)。 + +| Module Name | Module usage | +|-----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Basic Configuration | The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights. | +| Dataset Configuration | Dataset configuration is mainly used for dataset related settings during MindSpore model training. For details, please refer to the [Dataset](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/dataset.html). | +| Model Configuration | There are differences in the configuration parameters of different models, and the parameters in the template are universal configurations. | +| Model Optimization Configuration | MindSpore Transformers provides configuration related to recalculation to reduce the memory usage of the model during training. For details, please refer to [Recalculation](https://www.mindspore.cn/mindformers/docs/zh-CN/master/advanced_development/performance_optimization.html#%E9%87%8D%E8%AE%A1%E7%AE%97). | +| Model Training Configuration | When starting model training, the configuration module for relevant parameters is mainly included in the template, which includes parameters for the required training modules such as trainer, runner.config, runnerwrapper, learning rate (lr_stchedule), and optimizer. | +| Parallel Configuration | In order to improve the performance of the model, it is usually necessary to configure parallel strategies for the model in large-scale cluster usage scenarios. For details, please refer to [Distributed Parallel](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/parallel_training.html). | +| Callback Function Configuration | MindSpore Transformers provides encapsulated callbacks function classes, which mainly implement operations such as returning the training state of the model and outputting, saving the model weight file, etc. during the model training process. Currently, the following callback function classes are supported.
1.MFLossMonitor
This callback function class is mainly used to print information such as training progress, model loss, and learning rate during the training process.
2.SummaryMonitor
This callback function class is mainly used to collect Summary data. For details, please refer to[mindspore.SummaryCollector](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.SummaryCollector.html).
3.CheckpointMonitor
This callback function class is mainly used to save the model weight file during the model training process. | +| Context configuration | Context configuration is mainly used to specify [mindspore.set_context](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_context.html中的相关参数). | +| Performance Analysis Tool Configuration | MindSpore Transformers provides Profile as the main tool for model performance tuning. For details, please refer to the [Performance Tuning Guide](https://www.mindspore.cn/mindformers/docs/zh-CN/master/advanced_development/performance_optimization.html). | + +## Basic Configuration Modification + +When using a configuration template for training, modify the following basic configurations to quickly start. + +The default configuration template uses 8 cards. + +### Dataset Configuration Modification + +1. The pre-training scenario uses the Megatron dataset. For details, please refer to the [Megatron Dataset](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/dataset.html#megatron%E6%95%B0%E6%8D%AE%E9%9B%86). +2. Fine tune the scene using the HuggingFace dataset, please refer to [HuggingFace dataset](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/dataset.html#huggingface%E6%95%B0%E6%8D%AE%E9%9B%86) for details. + +### Model Configuration Modification + +1. When modifying the model configuration, you can choose to download the Huggingface model and directly modify the pretrained-model-dir in the YAML configuration to read the model configuration (this feature does not currently support pretraining). During model training, a tokenizer and model_comfig will be automatically generated, and the model list is supported: + +| Model Name | +|------------| +| Deepseek3 | +| Qwen3 | +| Qwen2_5 | + +2. The generated model configuration shall be based on the YAML configuration first, and if no parameters are configured, the parameters in the config. json file under the pretrained-model-dir path shall be taken as the values. If you want to modify the custom model configuration, you only need to add the relevant configuration in mode_comfig. +3. Please refer to for general configuration details [Model Configuration](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/configuration.html#%E6%A8%A1%E5%9E%8B%E9%85%8D%E7%BD%AE). + +## Advanced Configuration Modification + +Further modifications can be made in the following way to customize the training. + +### Basic Configuration Modification + +1. When conducting pre-training, the generated weight format can be modified through load_ckpt_format, which supports safetensors and ckpt. It is recommended to use safetensors. The path for generating logs, weights, and policy files during the training process can be specified through output. + +### Training Parameter Modification + +1. Configuration modifications related to recomputing, optimizer, and lr_stchedule can affect the accuracy of model training results. +2. If there is insufficient memory during the training process, causing the model to be unable to start training, it may be considered to enable recalculation to reduce the memory usage of the model during training. +3. By modifying the learning rate configuration, the learning effect during model training can be achieved. +4. Modifying optimizer configuration can modify the gradient during model training. +5. Parallel (model parallelism) and context related configurations can affect the performance of model training. +6. During model training, the performance can be improved by enabling use_crallel=True, and the expected performance can be achieved by debugging and configuring parallel strategies. Please refer to [Parallel Configuration](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/configuration.html#%E5%B9%B6%E8%A1%8C%E9%85%8D%E7%BD%AE) for detailed parameter configuration. +7. Specific Configuration Details Reference [Model Training Configuration](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/configuration.html#%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83%E9%85%8D%E7%BD%AE). + +### Callback Function Configuration Modification + +1. The template provides callback functions related to saving weights: save_checkpoint_steps can modify the interval for saving weights; Keep_checkpoint_max can set the maximum number of weights to be saved, effectively controlling the disk space for weight saving. +2. Please refer to [callback function configuration](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/configuration.html#callbacks%E9%85%8D%E7%BD%AE) for other callback function applications. + +### Resume Training + +1. When performing breakpoint continuation training, it is necessary to modify the load_checkpoint to the weight directory saved in the previous training task based on the YAML configuration file used in the previous training, that is, the checkpoint directory under the directory specified by the output dir parameter, and set the resume_training to True. For details, please refer to [Resume training](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/resume_training.html). diff --git a/docs/mindformers/docs/source_en/index.rst b/docs/mindformers/docs/source_en/index.rst index c3c883a2dc..37f724d550 100644 --- a/docs/mindformers/docs/source_en/index.rst +++ b/docs/mindformers/docs/source_en/index.rst @@ -194,6 +194,7 @@ FAQ advanced_development/dev_migration advanced_development/accuracy_comparison advanced_development/api + advanced_development/training_template_instruction .. toctree:: :glob: diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md b/docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md new file mode 100644 index 0000000000..cf2a957973 --- /dev/null +++ b/docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md @@ -0,0 +1,82 @@ +# 训练配置模板使用说明 + +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/advanced_development/training_template_instruction.md) + +## 概述 + +Mindspore Transformers使用模板yaml文件作为配置文件启动模型训练任务,主要涵盖两个场景:预训练、微调。 + +用户可通过修改对应模版的参数从而进行对应的训练任务。 + +进行模型预训练时,请使用llm_pretrain_template.yaml(见模板示例章节)。 + +进行模型微调训练时,请使用llm_finetune_template.yaml(见模板示例章节)。 + +## 使用说明 + +### 模块说明 + +模板主要涵盖以下九个功能模块配置,详细参数配置说明可以参考[配置文件说明](https://docs.qq.com/sheet/DTmxXTnRLbHB2RG1S)。 + +| 模块名称 | 模块用途 | +|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| 基础配置 | 基础配置主要用于指定MindSpore随机种子以及加载权重的相关设置。 | +| 数据集配置 | 数据集配置主要用于MindSpore模型训练时的数据集相关设置。详情可参考[数据集](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/dataset.html)。 | +| 模型配置 | 不同的模型配置参数存在差异,模板中的参数为通用配置。 | +| 模型优化配置 | MindSpore Transformers提供重计算相关配置,以降低模型在训练时的内存占用,详情可参考[重计算](https://www.mindspore.cn/mindformers/docs/zh-CN/master/advanced_development/performance_optimization.html#%E9%87%8D%E8%AE%A1%E7%AE%97)。 | +| 模型训练配置 | 启动模型训练时相关参数的配置模块,模板中主要包含trainer、runner_config、runner_wrapper、学习率(lr_schedule)以及优化器(optimizer)相关训练所需模块的参数。 | +| 并行配置 | 为了提升模型的性能,在大规模集群的使用场景中通常需要为模型配置并行策略,详情可参考[分布式并行](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/parallel_training.html)。 | +| 回调函数配置 | MindSpore Transformers提供封装后的Callbacks函数类,主要实现在模型训练过程中返回模型的训练状态并输出、保存模型权重文件等一些操作,目前支持以下几个Callbacks函数类。
1.MFLossMonitor
该回调函数类主要用于在训练过程中对训练进度、模型Loss、学习率等信息进行打印
2.SummaryMonitor
该回调函数类主要用于收集Summary数据,详情可参考[mindspore.SummaryCollector](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.SummaryCollector.html)。
3.CheckpointMonitor
该回调函数类主要用于在模型训练过程中保存模型权重文件。 | +| context配置 | Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_context.html中的相关参数)。 | +| 性能分析工具配置 | MindSpore Transformers提供Profile作为模型性能调优的主要工具,详情可参考[性能调优指南](https://www.mindspore.cn/mindformers/docs/zh-CN/master/advanced_development/performance_optimization.html)。 | + +## 基本配置修改 + +使用配置模版进行训练时,修改以下基础配置即可快速启动。 + +配置模板默认使用8卡。 + +### 数据集配置修改 + +1. 预训练场景使用Megatron数据集,详情请参考[Megatron数据集](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/dataset.html#megatron%E6%95%B0%E6%8D%AE%E9%9B%86)。 +2. 微调场景使用HuggingFace数据集,详情请参考[HuggingFace数据集](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/dataset.html#huggingface%E6%95%B0%E6%8D%AE%E9%9B%86)。 + +### 模型配置修改 + +1. 修改模型配置时可以选择下载Huggingface模型后直接修改yaml配置中的pretrained_model_dir来读取模型配置(该功能暂不支持预训练),模型训练时会自动生成tokenizer和model_config,支持模型列表: + +| 模型名称 | +|----------| +| Deepseek3 | +| Qwen3 | +| Qwen2_5 | + +2. 生成的模型配置优先以yaml配置为准,未配置参数则取值pretrained_model_dir路径下的config.json中的参数。如若要修改定制模型配置,则只需要在mode_config中添加相关配置即可。 +3. 通用配置详情请参考[模型配置](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/configuration.html#%E6%A8%A1%E5%9E%8B%E9%85%8D%E7%BD%AE)。 + +## 进阶配置修改 + +可以进一步按照下述方式进行修改,以自定义训练。 + +### 基础配置修改 + +1. 进行预训练时,可通过load_ckpt_format来修改生成的权重格式,支持safetensors和ckpt,推荐使用safetensors。可通过output_dir来指定训练过程中日志、权重和策略文件的生成路径。 + +### 训练超参修改 + +1. recompute_config(重计算)、optimizer(优化器)、lr_schedule(学习率)相关配置修改会影响模型训练结果的精度。 +2. 如果在训练过程中出现内存不足而导致模型无法开启训练,可考虑开启重计算从而降低模型在训练时的内存占用。 +3. 通过修改学习率配置来达到模型训练时的学习效果。 +4. 修改优化器配置能够修改计算模型训练时的梯度。 +5. parallel(模型并行)、context相关配置会影响模型训练时的性能。 +6. 模型训练时通过开启use_parallel=True来提升训练时的性能,通过调试配置并行策略达到预期的性能效果。详细参数配置请参考[并行配置](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/configuration.html#%E5%B9%B6%E8%A1%8C%E9%85%8D%E7%BD%AE)。 +7. 具体配置详情参考[模型训练配置](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/configuration.html#%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83%E9%85%8D%E7%BD%AE)。 + +### 回调函数配置修改 + +1. 模板提供了保存权重相关的回调函数:save_checkpoint_steps可修改权重的保存步数间隔;keep_checkpoint_max可设定最大权重的保存数量,能够有效控制权重保存的磁盘空间。 +2. 其他回调函数应用请参考[回调函数配置](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/configuration.html#callbacks%E9%85%8D%E7%BD%AE)。 + +### 断点续训 + +1. 进行断点续训时,需要基于上次训练使用的yaml配置文件,修改load_checkpoint指定到上一次训练任务时保存的权重目录,即output_dir参数指定目录下的checkpoint目录,resume_training设置为True。详情参考[断点续训](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/resume_training.html)。 diff --git a/docs/mindformers/docs/source_zh_cn/index.rst b/docs/mindformers/docs/source_zh_cn/index.rst index a615d23055..72edf1de9e 100644 --- a/docs/mindformers/docs/source_zh_cn/index.rst +++ b/docs/mindformers/docs/source_zh_cn/index.rst @@ -221,6 +221,7 @@ FAQ advanced_development/dev_migration advanced_development/accuracy_comparison advanced_development/api + advanced_development/training_template_instruction .. toctree:: :glob: -- Gitee