From 3538d3834516c4d44153e9a53e1d7a892ed808c7 Mon Sep 17 00:00:00 2001 From: lanxiang <1277800895@qq.com> Date: Mon, 23 Jun 2025 16:26:26 +0800 Subject: [PATCH] =?UTF-8?q?=E6=95=B0=E6=8D=AE=E8=B7=B3=E8=BF=87=E5=92=8C?= =?UTF-8?q?=E5=81=A5=E5=BA=B7=E7=9B=91=E6=B5=8B=E6=96=87=E6=A1=A3=E4=BD=BF?= =?UTF-8?q?=E7=94=A8=E8=AF=B4=E6=98=8E?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../docs/source_en/feature/configuration.md | 89 ++++---- .../skip_data_and_ckpt_health_monitor.md | 196 ++++++++++++++++++ .../source_en/feature/training_function.rst | 1 + .../source_zh_cn/feature/configuration.md | 97 +++++---- .../skip_data_and_ckpt_health_monitor.md | 196 ++++++++++++++++++ .../feature/training_function.rst | 1 + 6 files changed, 494 insertions(+), 86 deletions(-) create mode 100644 docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md create mode 100644 docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md diff --git a/docs/mindformers/docs/source_en/feature/configuration.md b/docs/mindformers/docs/source_en/feature/configuration.md index 677250010d..f77a64bb1a 100644 --- a/docs/mindformers/docs/source_en/feature/configuration.md +++ b/docs/mindformers/docs/source_en/feature/configuration.md @@ -14,18 +14,20 @@ The `YAML` file provided by MindSpore Transformers contains configuration items The basic configuration is mainly used to specify MindSpore random seeds and related settings for loading weights. -| Parameters | Descriptions | Types | -|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------| -| seed | Set the global seed. For details, refer to [mindspore.set_seed](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_seed.html). | int | -| run_mode | Set the running mode of the model: `train`, `finetune`, `eval` or `predict`. | str | -| output_dir | Set the path where log, checkpoint, strategy, etc. files are saved. | str | -| load_checkpoint | File or folder paths for loading weights. Currently there are 3 application scenarios
1. Support for passing in full weight file paths.
2. Support for passing in offline sliced weight folder paths.
3. Support for passing in folder paths containing lora weights and base weights
Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html) for the ways of obtaining various weights. | str | -| auto_trans_ckpt | Enable distributed weight auto slicing and merging. Refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html). | bool | -| resume_training | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html#resumable-training). | bool | -| load_ckpt_format | The format of loading checkpoint, either `ckpt` or `safetensors`. | str | -| remove_redundancy | Whether the checkpoint has removed redundancy while loading checkpoint. The default value is `False`. | bool | -| train_precision_sync | Switching on or off deterministic computation of the training process. The default value is `None`. | Optional[bool] | -| infer_precision_sync | Switching on or off deterministic computation of the inference process. The default value is `None`. | Optional[bool] | +| Parameters | Descriptions | Types | +|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------| +| seed | Set the global seed. For details, refer to [mindspore.set_seed](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_seed.html). | int | +| run_mode | Set the running mode of the model: `train`, `finetune`, `eval` or `predict`. | str | +| output_dir | Set the path where log, checkpoint, strategy, etc. files are saved. | str | +| load_checkpoint | File or folder paths for loading weights. Currently there are 3 application scenarios
1. Support for passing in full weight file paths.
2. Support for passing in offline sliced weight folder paths.
3. Support for passing in folder paths containing lora weights and base weights
Refer to [Weight Conversion Function](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html) for the ways of obtaining various weights. | str | +| auto_trans_ckpt | Enable distributed weight auto slicing and merging. Refer to [Distributed Weight Slicing and Merging](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html). | bool | +| resume_training | Enable resumable training after breakpoint. For details, refer to [Resumable Training After Breakpoint](https://www.mindspore.cn/mindformers/docs/en/dev/feature/resume_training.html#resumable-training). | bool | +| load_ckpt_format | The format of loading checkpoint, either `ckpt` or `safetensors`. | str | +| remove_redundancy | Whether the checkpoint has removed redundancy while loading checkpoint. The default value is `False`. | bool | +| train_precision_sync | Switching on or off deterministic computation of the training process. The default value is `None`. | Optional[bool] | +| infer_precision_sync | Switching on or off deterministic computation of the inference process. The default value is `None`. | Optional[bool] | +| use_skip_data_by_global_norm | Enable data Skip Function. The default value is `False`. | | +| use_checkpoint_health_monitor | Enable health monitoring function. The default value is `False`. | | ### Context Configuration @@ -102,16 +104,17 @@ In addition to the basic configuration of the model above, the MoE model needs t When starting model training, in addition to model-related parameters, you also need to set the parameters of trainer, runner_config, learning rate, and optimizer and other modules required for training, MindSpore Transformers provides the following configuration items. -| Parameters | Descriptions | Types | -|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| +| Parameters | Descriptions | Types | +|---------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | trainer.type | Set the trainer class, usually different models for different application scenarios will set different trainer classes. | str | | trainer.model_name | Set the model name in the format '{name}_xxb', indicating a certain specification of the model. | str | | runner_config.epochs | Set the number of rounds for model training. | int | -| runner_config.batch_size | Set the sample size of the batch data, which overrides the `batch_size` in the dataset configuration. | int | +| runner_config.batch_size | Set the sample size of the batch data, which overrides the `batch_size` in the dataset configuration. | int | | runner_config.sink_mode | Enable data sink mode. | bool | -| runner_config.sink_size | Set the number of iterations to be sent down from Host to Device per iteration, effective only when `sink_mode=True`. This argument will be deprecated in a future release. | int | -| runner_config.gradient_accumulation_steps | Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled. | int | +| runner_config.sink_size | Set the number of iterations to be sent down from Host to Device per iteration, effective only when `sink_mode=True`. This argument will be deprecated in a future release. | int | +| runner_config.gradient_accumulation_steps | Set the number of gradient accumulation steps, the default value is 1, which means that gradient accumulation is not enabled. | int | | runner_wrapper.type | Set the wrapper class, generally set 'MFTrainOneStepCell'. | str | +| runner_wrapper.local_norm | Set the gradient norm of each parameter on the printing card. | bool | | runner_wrapper.scale_sense.type | Set the gradient scaling class, generally just set 'DynamicLossScaleUpdateCell'. | str | | runner_wrapper.scale_sense.use_clip_grad | Turn on gradient clipping. Turning on to avoid cases where the inverse gradient is too large and training fails to converge. | bool | | runner_wrapper.scale_sense.loss_scale_value | Set the loss dynamic scale factor, the model loss can change dynamically according to the configuration of this parameter. | int | @@ -126,11 +129,11 @@ When starting model training, in addition to model-related parameters, you also | train_dataset.batch_size | The description is same as that of `runner_config.batch_size`. | int | | train_dataset.input_columns | Set the input data columns for the training dataset. | list | | train_dataset.output_columns | Set the output data columns for the training dataset. | list | -| train_dataset.construct_args_key | Set the dataset part `keys` of the model `construct` input to the model in lexicographical order, used when the parameter passing order of the model does not match the order of the dataset input. | list | +| train_dataset.construct_args_key | Set the dataset part `keys` of the model `construct` input to the model in lexicographical order, used when the parameter passing order of the model does not match the order of the dataset input. | list | | train_dataset.column_order | Set the order of the output data columns of the training dataset. | list | | train_dataset.num_parallel_workers | Set the number of processes that read the training dataset. | int | | train_dataset.python_multiprocessing | Enabling Python multi-process mode to improve data processing performance. | bool | -| train_dataset.drop_remainder | Whether to discard the last batch of data if it contains fewer samples than batch_size. | bool | +| train_dataset.drop_remainder | Whether to discard the last batch of data if it contains fewer samples than batch_size. | bool | | train_dataset.repeat | Set the number of dataset duplicates. | int | | train_dataset.numa_enable | Set the default state of NUMA to data read startup state. | bool | | train_dataset.prefetch_size | Set the amount of pre-read data. | int | @@ -139,7 +142,7 @@ When starting model training, in addition to model-related parameters, you also | train_dataset.data_loader.shuffle | Whether to randomly sort the data when reading the dataset. | bool | | train_dataset.transforms | Set options related to data enhancement. | - | | train_dataset_task.type | Set up the dataset class, which is used to encapsulate the data loading class and other related configurations. | str | -| train_dataset_task.dataset_config | Typically set as a reference to `train_dataset`, containing all configuration entries for `train_dataset`. | - | +| train_dataset_task.dataset_config | Typically set as a reference to `train_dataset`, containing all configuration entries for `train_dataset`. | - | | auto_tune | Enable auto-tuning of data processing parameters, see [set_enable_autotune](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.config.set_enable_autotune.html) for details. | bool | | filepath_prefix | Set the save path for parameter configurations after data optimization. | str | | autotune_per_step | Set the configuration tuning step interval for automatic data acceleration, for details see [set_autotune_interval](https://www.mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.config.set_autotune_interval.html). | int | @@ -233,18 +236,19 @@ MindSpore Transformers provides encapsulated Callbacks function class, mainly to | Parameters | Descriptions | Types | |------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| - | prefix | Set the prefix for saving file names. | str | - | directory | Set the directory for saving file names. | str | - | save_checkpoint_seconds | Set the number of seconds between saving model weights. | int | - | save_checkpoint_steps | Set the number of interval steps for saving model weights. | int | + | prefix | Set the prefix for saving file names. | str | + | directory | Set the directory for saving file names. | str | + | save_checkpoint_seconds | Set the number of seconds between saving model weights. | int | + | save_checkpoint_steps | Set the number of interval steps for saving model weights. | int | | keep_checkpoint_max | Set the maximum number of model weight files to be saved, if there are more model weight files in the save path, they will be deleted starting from the earliest file created to ensure that the total number of files does not exceed `keep_checkpoint_max`. | int | - | keep_checkpoint_per_n_minutes | Set the number of minutes between saving model weights. | int | + | keep_checkpoint_per_n_minutes | Set the number of minutes between saving model weights. | int | | integrated_save | Turn on aggregation to save the weights file.
1. When set to True, it means that the weights of all devices are aggregated when the weight file is saved, i.e., the weights of all devices are the same.
2. False means that all devices save their own weights
When using semi-automatic parallel mode, it is usually necessary to set it to False to avoid memory problems when saving the weights file. | bool | | save_network_params | Set to save only model weights, default value is `False`. | bool | | save_trainable_params | Set the additional saving of trainable parameter weights, i.e. the parameter weights of the model when partially fine-tuned, default to `False`. | bool | | async_save | Set an asynchronous execution to save the model weights file. | bool | | remove_redundancy | Whether to remove the redundancy for the checkpoint, default value is `False`. | bool | - | checkpoint_format | The format of the checkpoint while saving the checkpoint, default value is `ckpt`. Either `ckpt` or `safetensors`. | str | + | checkpoint_format | The format of the checkpoint while saving the checkpoint, default value is `ckpt`. Either `ckpt` or `safetensors`. | str | + | embedding_local_norm_threshold | Set the threshold for embedding norm in health monitoring,default value is `1.0`. | float | Multiple Callbacks function classes can be configured at the same time under the `callbacks` field. The following is an example of `callbacks` configuration. @@ -308,21 +312,24 @@ MindSpore Transformers provides Profile as the main tool for model performance t The metric monitoring configuration is primarily used to configure methods to record metrics during training, please refer to [Training Metrics Monitoring](https://www.mindspore.cn/mindformers/docs/en/dev/feature/monitor.html) for more details.Below is a description of the common metric monitoring configuration options in MindSpore Transformers: -| Parameters | Descriptions | Types | -|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| -| monitor_config.monitor_on | Set whether to enable monitoring. The default is `False`, which will disable all parameters below. | bool | -| monitor_config.dump_path | Set the save path for metric files of `local_norm`, `device_local_norm` and `local_loss` during training. Defaults to './dump' when not set or set to `null`. | str | -| monitor_config.target | Set the (partial) name of target parameters monitored by metric `optimizer state` and `local_norm`, can be regular expression.Defaults to ['.*'] when not set or set to `null`, that is, specify all parameters. | list[str] | -| monitor_config.invert | Set whether to invert the targets specified in `monitor_config.target`, defaults to `False`. | bool | -| monitor_config.step_interval | Set the frequency for metric recording. The default value is `1`, that is, the metrics are recorded every step. | int | -| monitor_config.local_loss_format | Set the format to record metric `local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | -| monitor_config.device_local_loss_format | Set the format to record metric `device_local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | -| monitor_config.local_norm_format | Set the format to record metric `local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | -| monitor_config.device_local_norm_format | Set the format to record metric `device_local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | -| monitor_config.optimizer_state_format | Set the format to record metric `optimizer state`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | -| monitor_config.weight_state_format | Set the format to record metric `weight L2-norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | -| monitor_config.throughput_baseline | Set the baseline of metric `throughput linearity`, must be positive number. Defaults to `null`, that is, do not monitor this metric. | int/float | -| monitor_config.print_struct | Set whether to print all trainable parameters' name of model. If set to `True`, print all trainable parameters' name at the beginning of the first step, and exit training process after step end. Defaults to `False`. | bool | +| Parameters | Descriptions | Types | +|--------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| monitor_config.monitor_on | Set whether to enable monitoring. The default is `False`, which will disable all parameters below. | bool | +| monitor_config.dump_path | Set the save path for metric files of `local_norm`, `device_local_norm` and `local_loss` during training. Defaults to './dump' when not set or set to `null`. | str | +| monitor_config.target | Set the (partial) name of target parameters monitored by metric `optimizer state` and `local_norm`, can be regular expression.Defaults to ['.*'] when not set or set to `null`, that is, specify all parameters. | list[str] | +| monitor_config.invert | Set whether to invert the targets specified in `monitor_config.target`, defaults to `False`. | bool | +| monitor_config.step_interval | Set the frequency for metric recording. The default value is `1`, that is, the metrics are recorded every step. | int | +| monitor_config.local_loss_format | Set the format to record metric `local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | +| monitor_config.device_local_loss_format | Set the format to record metric `device_local_loss`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | +| monitor_config.local_norm_format | Set the format to record metric `local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | +| monitor_config.device_local_norm_format | Set the format to record metric `device_local_norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | +| monitor_config.optimizer_state_format | Set the format to record metric `optimizer state`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | +| monitor_config.weight_state_format | Set the format to record metric `weight L2-norm`, can be string 'tensorboard' and 'log' (represent write to Tensorboard and write to log respectively), or list composed of them, or `null`. Defaults to `null`, that is, do not monitor this metric. | str/list[str] | +| monitor_config.throughput_baseline | Set the baseline of metric `throughput linearity`, must be positive number. Defaults to `null`, that is, do not monitor this metric. | int/float | +| monitor_config.print_struct | Set whether to print all trainable parameters' name of model. If set to `True`, print all trainable parameters' name at the beginning of the first step, and exit training process after step end. Defaults to `False`. | bool | +| monitor_config.check_for_global_norm | Set whether to enable process level fault recovery function. Defaults to `False`. | bool | +| monitor_config.global_norm_spike_threshold | Set the threshold for global norm, triggering data skipping when the global norm is exceeded. Defaults to `3.0`. | float | +| monitor_config.global_norm_spike_count_threshold | Set the cumulative number of consecutive global norm anomalies, and when the threshold is reached, trigger an exception interrupt to terminate the training. Defaults to `10`. | int | ### TensorBoard Configuration diff --git a/docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md b/docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md new file mode 100644 index 0000000000..2ce323a3ec --- /dev/null +++ b/docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md @@ -0,0 +1,196 @@ +# Data Skip And Checkpoint Health Monitor + +[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md) + +## Overview + +The data skipping function refers to the process where, during the training process, when the parameter global norm exceeds the set threshold, it accumulates the number of out of bounds and skips the training data for the current step, and proceeds to retraining in the next step; When the cumulative number of violations reaches the threshold, an abnormal interrupt will be triggered to terminate the training. The health monitoring function refers to monitoring the health status of the saved weights when saving them, generating a file to record the health status of the weights, and using this file to select the latest healthy weights for the next training session. + +Please refer to [Checkpoint Health Monitor](#checkpoint-health-monitor) for the determination of weight health status. + +> - The combination of data skipping function and health monitoring function can effectively solve the problem of data anomalies caused by abnormal global norm during the training process. Before use, please train normally for a period of time to determine the threshold of the global norm that needs to be set, the threshold of the number of consecutive anomalies, and the threshold of the embedding norm. +> - Please note that training will only be interrupted when there are consecutive exceptions. If there is only one instance where it returns to normal, the cumulative count will be cleared. Therefore, please control the threshold setting. +> - The data skipping function cannot be used in conjunction with the quick fault recovery function. Refer to the process level rescheduling recovery function in the [high availability feature](https://www.mindspore.cn/mindformers/docs/en/dev/feature/high_availability.html). + +## Skipping Data + +### Overview + +MindSpore Transformers provides the function of skipping data, which can skip the current training data when there is a global norm exception, and trigger an exception interrupt when the number of consecutive exceptions reaches the set threshold. + +This feature has the following three behaviors in total: + +- An out of bounds global norm has occurred, with a cumulative abnormal occurrence of +1. Skipping the current step training data and printing log information. +- global norm has returned to normal, and the cumulative number of abnormal occurrences has been cleared. +- When the cumulative number of abnormal occurrences reaches the set threshold, an abnormal interrupt is triggered and the training is terminated. + +#### Usage + +**Note**: The parameter values shown in the following examples are only experimental data, please refer to real training data. + +This feature is enabled through YAML configuration files: + +```yaml +use_skip_data_by_global_norm: True + +monitor_config: + monitor_on: True + check_for_global_norm: False + global_norm_spike_threshold: 3.0 + global_norm_spike_count_threshold: 10 +``` + +**Parameter:** + +| Parameter | Description | Type | Optional | Value Range | +|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|---------|------------------| +| use_skip_data_by_global_norm | Data skip function switch. Default to `False`. | Bool | Optional | | +| monitor_config | Training indicator monitoring configuration. Default to `None`. | | Optional | | +| monitor_on | Whether to enable training metric monitoring configuration. Default to `False`. | Bool | Optional | | +| check_for_global_norm | To enable the fault recovery function, which is mutually exclusive with the data skipping function. Default to `False`. | Bool | Optional | | +| global_norm_spike_threshold | The threshold for global norm, which triggers data skipping when global norm is exceeded. Default to `3.0`. | Float | Optional | Greater than 0 | +| global_norm_spike_count_threshold | The number of consecutive abnormal global_norm. When the number reaches the threshold, an abnormal interruption is triggered, and training is terminated. Default to `10`. | Int | Optional | Positive integer | + +### Conversion Example + +Assuming Llama3.1-8B is taken as an example, use [finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml) to add parameters according to the above [Configuration](#usage), please refer to the [Llama3.1-8B Document](https://gitee.com/lanshaozuishuai/mindformers/blob/dev/research/llama3_1/README.md) for the remaining steps. Start training: + +```shell +bash scripts/msrun_launcher.sh "run_mindformer.py \ + --register_path research/llama3_1 \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_data /{path}/wiki4096.mindrecord \ + --run_mode train \ + --use_parallel True" 8 +``` + +When the model officially starts training, if the global norm is greater than the set threshold, the following log will be printed, indicating that the user has experienced abnormal global norm n times in a row and skipped the training data for the current step count. + +```log +- INFO - { Epoch:[ 1/ 2], step:[ 1/ 6500], loss: 0.000, per_step_time: 166756ms, lr: 0.0, overflow cond: False, loss_scale: 1.0, global_norm: [44.313248], train_throughput_per_npu: 2.849T +- INFO - 0.0% | | 0.00600 samples/s/p 25 days, 2:07:47 } +- INFO - opt_global_step: 0, skip_data_grad_norm_threshold: 3.0, is_skip: [ True] +- INFO - Current global norm [44.313248] of step 1 has been 1 consecutive times greater than threshold: 3.0 +``` + +When the number of consecutive exceptions reaches the set threshold, print an error log and terminate the training. + +```log +- INFO - { Epoch:[ 1/ 2], step:[ 2/ 6500], loss: 0.000, per_step_time: 7637ms, lr: 0.0, overflow cond: False, loss_scale: 1.0, global_norm: [47.329006], train_throughput_per_npu: 62.211T +- INFO - 0.0% | | 0.00600 samples/s/p 25 days, 2:07:47 } +- INFO - opt_global_step: 0, skip_data_grad_norm_threshold: 3.0, is_skip: [ True] +ValueError: Current global norm [47.329006] of step 2 has been 2 consecutive times greater than threshold 3.0, stop training... +``` + +## Checkpoint Health Monitor + +### Overview + +The health monitoring function provided by MindSpore Transformers can determine the health status of saved weights by monitoring the embeddings in stage0. The health status of all saved weights during the training process is recorded in the file health_ckpts.json, and the latest healthy weights are automatically found through this file for further training. + +This feature covers the following three steps: + +1. Turn on the health monitoring switch and determine the threshold for the embeddings needed to be set through a period of normal training. +2. After setting the threshold, restart the training. When the embeddings exceed the threshold when saving weights, the health status of the weights is recorded as unhealthy. Otherwise, it is recorded as healthy, with 1 indicating unhealthy and 0 indicating healthy. +3. When resuming training, the latest health weights recorded in the health_ckpts.json file generated from the previous training will be automatically used for continuation. + +**Note**: + +- Only the embedding norm under stage0 is meaningful when pipeline stage is greater than 1. +- Only the weights of cards in stage 0 have corresponding health status. The record file records the total health status of all card weights, that is, if the health status of a card's weight is unhealthy, then the health status of the weight corresponding to that step is unhealthy. Only when the weights of all cards in stage 0 are healthy, will the file record the health status of the corresponding weights for that step as healthy. +- When there are no health weights in the record file, the user will be prompted to retrain until there are health weights. If the training fails to generate health weights, the threshold set for embeddings should be considered whether it is reasonable. +- If a weight is specified for resuming training, priority will be given to the specified weight for resuming training, without considering the health status of the weight. +- This feature does not support full batch scenarios. +- Enabling this feature may pose a risk of insufficient communication memory. + +#### Usage + +**Note**: The parameter values shown in the following examples are only experimental data, please refer to real training data. + +This feature is enabled through YAML configuration files: + +```yaml +use_checkpoint_health_monitor : True + +monitor_config: + monitor_on: True + +runner_wrapper: + local_norm: True + +callbacks: + - type: CheckpointMonitor + save_checkpoint_steps: 1 + embedding_local_norm_threshold: 270.0 + +parallel: + full_batch: False + dataset_strategy: [[4, 1], [4, 1]] + +parallel_config: + data_parallel: 4 + pipeline_stage: 2 + micro_batch_num: 2 +``` + +**Parameter:** + +| Parameter | Description | Type | Optional | Value Range | +|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|------------------|------------------| +| use_checkpoint_health_monitor | Checkpoint health monitoring function switch. Default to `False`. | Bool | Optional | | +| monitor_config | Training indicator monitoring configuration. Default to `None`. | | Optional | | +| monitor_on | Whether to enable the training metric monitoring configuration. Only after enabling it can you observe the data metrics of embedding local norm. Default to `False`. | Bool | Optional | | +| runner_wrapper | The configs of wrapper. | | Required | | +| local_norm | The gradient norm of each parameter on a single card. Default to `False`. | Bool | Optional | | +| callbacks | The configs of callbacks. | | Required | | +| save_checkpoint_steps | The step interval for saving weights. | Int | Required | Positive Integer | +| embedding_local_norm_threshold | The threshold of embedding norm for health monitoring. Default to `1.0`. | Float | Optional | Greater than 0 | +| parallel | Parallel strategy configuration. | | Required | | +| full_batch | Whether to load the full batch of data from the dataset in parallel mode. Setting it to `True` means all ranks will load the full batch of data. Setting it to `False` means each rank will only load the corresponding batch of data. When set to `False`, the corresponding `dataset_strategy` must be configured. This feature only supports`False`. | Bool | Required `False` | | +| dataset_strategy | Only supports `List of List` type and is effective only when `full_batch=False`. The number of sublists in the list must be equal to the length of `train_dataset.input_columns`. Each sublist in the list must have the same shape as the data returned by the dataset. Generally, data parallel splitting is done along the first dimension, so the first dimension of the sublist should be configured to match `data_parallel`, while the other dimensions should be set to `1`. For detailed explanation, refer to [Dataset Splitting](https://www.mindspore.cn/tutorials/en/master/parallel/dataset_slice.html). | List | Required | | +| parallel_config | Parallel parameter configuration. | | Required | | +| data_parallel | Set the number of data parallel. | Int | Required | Positive Integer | +| pipeline_stage | Set the number of pipeline parallel. | Int | Required | Positive Integer | +| micro_batch_num | Set the pipeline parallel microbatch size, which should satisfy `parallel_config.micro_batch_num` >= `parallel_config.pipeline_stage` when `parallel_config.pipeline_stage` is greater than 1. | Int | Required | Positive Integer | + +### Conversion Example + +Assuming Llama3.1-8B is taken as an example, use [finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml) to add parameters and modify according to the above [Configuration](#usage-1), please refer to the [Llama3.1-8B Document](https://gitee.com/lanshaozuishuai/mindformers/blob/dev/research/llama3_1/README.md) for the remaining steps. Start training: + +```shell +bash scripts/msrun_launcher.sh "run_mindformer.py \ + --register_path research/llama3_1 \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_data /{path}/wiki4096.mindrecord \ + --run_mode train \ + --use_parallel True" 8 +``` + +When the model officially starts training, the log will print the embedding local norm for the current number of steps, making it easier for users to set thresholds after statistical observation. + +```log +- INFO - { Epoch:[ 1/ 2], step:[ 1/ 6500], loss: 0.000, per_step_time: 157149ms, lr: 0.0, overflow cond: False, loss_scale: 1.0, global_norm: [44.31202], train_throughput_per_npu: 3.023T +- INFO - 0.0% | | 0.00636 samples/s/p 23 days, 15:26:22 } +- INFO - embedding_local_norm: 251.79117 + +- INFO - { Epoch:[ 1/ 2], step:[ 2/ 6500], loss: 0.000, per_step_time: 8266ms, lr: 2.5641025e-08, overflow cond: False, loss_scale: 1.0, global_norm: [47.328575], train_throughput_per_npu: 57.471T +- INFO - 0.0% | | 0.12096 samples/s/p 1 day, 5:50:52 } +- INFO - embedding_local_norm: 291.3603 +``` + +The recorded data of health_ckpts.json is as follows: + +The ckpt_name records the weight file name, while is_health records the health status of the corresponding weight. In the record, 1 represents unhealthy and 0 represents healthy. + +```json +[ + { + "is_health": 0, + "ckpt_name": "llama3_1_8b_rank_0-1_1.safetensors" + }, + { + "is_health": 1, + "ckpt_name": "llama3_1_8b_rank_0-2_1.safetensors" + } +] +``` \ No newline at end of file diff --git a/docs/mindformers/docs/source_en/feature/training_function.rst b/docs/mindformers/docs/source_en/feature/training_function.rst index 4a27467211..ad4717484e 100644 --- a/docs/mindformers/docs/source_en/feature/training_function.rst +++ b/docs/mindformers/docs/source_en/feature/training_function.rst @@ -13,3 +13,4 @@ Training Function high_availability memory_optimization other_training_features + skip_data_and_ckpt_health_monitor diff --git a/docs/mindformers/docs/source_zh_cn/feature/configuration.md b/docs/mindformers/docs/source_zh_cn/feature/configuration.md index 53c010538c..a8497ba545 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/configuration.md +++ b/docs/mindformers/docs/source_zh_cn/feature/configuration.md @@ -14,18 +14,20 @@ MindSpore Transformers提供的`YAML`文件中包含对于不同功能的配置 基础配置主要用于指定MindSpore随机种子以及加载权重的相关设置。 -| 参数 | 说明 | 类型 | -|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------| -| seed | 设置全局种子,详情可参考[mindspore.set_seed](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_seed.html)。 | int | -| run_mode | 设置模型的运行模式,可选`train`、`finetune`、`eval`或`predict`。 | str | -| output_dir | 设置保存log、checkpoint、strategy等文件的路径。 | str | -| load_checkpoint | 加载权重的文件或文件夹路径,目前有3个应用场景:
1. 支持传入完整权重文件路径。
2. 支持传入离线切分后的权重文件夹路径。
3. 支持传入包含lora权重和base权重的文件夹路径。
各种权重的获取途径可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。 | str | -| auto_trans_ckpt | 是否开启分布式权重自动切分与合并功能,详情可参考[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。 | bool | -| resume_training | 是否开启断点续训功能,详情可参考[断点续训功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html#%E6%96%AD%E7%82%B9%E7%BB%AD%E8%AE%AD)。 | bool | -| load_ckpt_format | 加载的模型权重的格式,可选`ckpt`、`safetensors`。 | str | -| remove_redundancy | 加载的模型权重是否去除了冗余。默认值为`False`。 | bool | -| train_precision_sync | 训练确定性计算开关。默认值为`None` 。 | Optional[bool] | -| infer_precision_sync | 推理确定性计算开关。默认值为`None`。 | Optional[bool] | +| 参数 | 说明 | 类型 | +|-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------| +| seed | 设置全局种子,详情可参考[mindspore.set_seed](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_seed.html)。 | int | +| run_mode | 设置模型的运行模式,可选`train`、`finetune`、`eval`或`predict`。 | str | +| output_dir | 设置保存log、checkpoint、strategy等文件的路径。 | str | +| load_checkpoint | 加载权重的文件或文件夹路径,目前有3个应用场景:
1. 支持传入完整权重文件路径。
2. 支持传入离线切分后的权重文件夹路径。
3. 支持传入包含lora权重和base权重的文件夹路径。
各种权重的获取途径可参考[权重转换功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。 | str | +| auto_trans_ckpt | 是否开启分布式权重自动切分与合并功能,详情可参考[分布式权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html)。 | bool | +| resume_training | 是否开启断点续训功能,详情可参考[断点续训功能](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/resume_training.html#%E6%96%AD%E7%82%B9%E7%BB%AD%E8%AE%AD)。 | bool | +| load_ckpt_format | 加载的模型权重的格式,可选`ckpt`、`safetensors`。 | str | +| remove_redundancy | 加载的模型权重是否去除了冗余。默认值为`False`。 | bool | +| train_precision_sync | 训练确定性计算开关。默认值为`None` 。 | Optional[bool] | +| infer_precision_sync | 推理确定性计算开关。默认值为`None`。 | Optional[bool] | +| use_skip_data_by_global_norm | 数据跳过功能开关。默认值为`False`。 | | +| use_checkpoint_health_monitor | 健康监测功能开关。默认值为`False`。 | | ### Context配置 @@ -102,8 +104,8 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/ 启动模型训练时,除了模型相关参数,还需要设置trainer、runner_config、学习率以及优化器等训练所需模块的参数,MindSpore Transformers提供了如下配置项。 -| 参数 | 说明 | 类型 | -|---------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| +| 参数 | 说明 | 类型 | +|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------| | trainer.type | 设置trainer类,通常不同应用场景的模型会设置不同的trainer类。 | str | | trainer.model_name | 设置模型名称,格式为'{name}_xxb',表示模型的某一规格。 | str | | runner_config.epochs | 设置模型训练的轮数。 | int | @@ -112,6 +114,7 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/ | runner_config.sink_size | 设置每次从Host下发到Device的迭代数量,仅`sink_mode=True`时生效,此参数将在后续版本中废弃。 | int | | runner_config.gradient_accumulation_steps | 设置梯度累积步数,默认值为1,表示不开启梯度累积。 | int | | runner_wrapper.type | 设置wrapper类,一般设置'MFTrainOneStepCell'即可。 | str | +| runner_wrapper.local_norm | 设置打印单卡上各参数的梯度范数。 | bool | | runner_wrapper.scale_sense.type | 设置梯度缩放类,一般设置'DynamicLossScaleUpdateCell'即可。 | str | | runner_wrapper.scale_sense.use_clip_grad | 是否开启梯度剪裁,开启可避免反向梯度过大导致训练无法收敛的情况。 | bool | | runner_wrapper.scale_sense.loss_scale_value | 设置loss动态尺度系数,模型loss可以根据该参数配置动态变化。 | int | @@ -126,7 +129,7 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/ | train_dataset.batch_size | 同`runner_config.batch_size`。 | int | | train_dataset.input_columns | 设置训练数据集输入的数据列。 | list | | train_dataset.output_columns | 设置训练数据集输出的数据列。 | list | -| train_dataset.construct_args_key | 设置模型`construct`输入的数据集部分`keys`, 按照字典序传入模型,当模型的传参顺序和数据集输入的顺序不一致时使用该功能。 | list | +| train_dataset.construct_args_key | 设置模型`construct`输入的数据集部分`keys`, 按照字典序传入模型,当模型的传参顺序和数据集输入的顺序不一致时使用该功能。 | list | | train_dataset.column_order | 设置训练数据集输出数据列的顺序。 | list | | train_dataset.num_parallel_workers | 设置读取训练数据集的进程数。 | int | | train_dataset.python_multiprocessing | 是否开启Python多进程模式提升数据处理性能。 | bool | @@ -134,7 +137,7 @@ Context配置主要用于指定[mindspore.set_context](https://www.mindspore.cn/ | train_dataset.repeat | 设置数据集重复数据次数。 | int | | train_dataset.numa_enable | 设置NUMA的默认状态为数据读取启动状态。 | bool | | train_dataset.prefetch_size | 设置预读取数据量。 | int | -| train_dataset.data_loader.type | 设置数据加载类。 | str | +| train_dataset.data_loader.type | 设置数据加载类。 | str | | train_dataset.data_loader.dataset_dir | 设置加载数据的路径。 | str | | train_dataset.data_loader.shuffle | 是否在读取数据集时对数据进行随机排序。 | bool | | train_dataset.transforms | 设置数据增强相关选项。 | - | @@ -231,20 +234,21 @@ MindSpore Transformers提供封装后的Callbacks函数类,主要实现在模 该回调函数类主要用于在模型训练过程中保存模型权重文件,有如下几个可配置项: - | 参数 | 说明 | 类型 | - |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|------| - | prefix | 设置保存文件名称的前缀。 | str | - | directory | 设置保存文件名称的目录。 | str | - | save_checkpoint_seconds | 设置保存模型权重的间隔秒数。 | int | - | save_checkpoint_steps | 设置保存模型权重的间隔steps数。 | int | - | keep_checkpoint_max | 设置保存模型权重文件的最大数量,如果保存路径内存在超出数量的模型权重文件,会从创建时间最早的文件开始删除,以保证文件总数不超过`keep_checkpoint_max`。 | int | - | keep_checkpoint_per_n_minutes | 设置保存模型权重的间隔分钟数。 | int | - | integrated_save | 开启聚合保存权重文件。
1. 设为True时表示在保存权重文件时聚合所有device的权重,即所有device权重一致。
2. 设为False时表示所有device各自保存自己的权重。
使用半自动并行模式时通常需要设置为False,以避免保存权重文件时出现内存问题。 | bool | - | save_network_params | 是否仅保存模型权重,默认值为`False`。 | bool | - | save_trainable_params | 是否额外保存可训练的参数权重,即部分微调时模型的参数权重,默认为`False`。 | bool | - | async_save | 是否异步执行保存模型权重文件。 | bool | - | remove_redundancy | 是否去除模型权重的冗余,默认值为`False`。 | bool | | | - | checkpoint_format | 保存的模型权重的格式,默认值为`ckpt`。可选`ckpt`,`safetensors`。 | str | + | 参数 | 说明 | 类型 | + |--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|-------| + | prefix | 设置保存文件名称的前缀。 | str | + | directory | 设置保存文件名称的目录。 | str | + | save_checkpoint_seconds | 设置保存模型权重的间隔秒数。 | int | + | save_checkpoint_steps | 设置保存模型权重的间隔steps数。 | int | + | keep_checkpoint_max | 设置保存模型权重文件的最大数量,如果保存路径内存在超出数量的模型权重文件,会从创建时间最早的文件开始删除,以保证文件总数不超过`keep_checkpoint_max`。 | int | + | keep_checkpoint_per_n_minutes | 设置保存模型权重的间隔分钟数。 | int | + | integrated_save | 开启聚合保存权重文件。
1. 设为True时表示在保存权重文件时聚合所有device的权重,即所有device权重一致。
2. 设为False时表示所有device各自保存自己的权重。
使用半自动并行模式时通常需要设置为False,以避免保存权重文件时出现内存问题。 | bool | + | save_network_params | 是否仅保存模型权重,默认值为`False`。 | bool | + | save_trainable_params | 是否额外保存可训练的参数权重,即部分微调时模型的参数权重,默认为`False`。 | bool | + | async_save | 是否异步执行保存模型权重文件。 | bool | + | remove_redundancy | 是否去除模型权重的冗余,默认值为`False`。 | bool | | | + | checkpoint_format | 保存的模型权重的格式,默认值为`ckpt`。可选`ckpt`,`safetensors`。 | str | + | embedding_local_norm_threshold | 设置健康监测的embedding norm的阈值,默认值为`1.0`。 | float | 在`callbacks`字段下可同时配置多个Callbacks函数类,以下是`callbacks`配置示例。 @@ -308,21 +312,24 @@ MindSpore Transformers提供Profile作为模型性能调优的主要工具,详 指标监控配置主要用于配置训练过程中各指标的记录方式,详情可参考[训练指标监控](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/monitor.html)。以下是MindSpore Transformers中通用的指标监控配置项说明: -| 参数名称 | 说明 | 类型 | -|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------|---------------| -| monitor_config.monitor_on | 设置是否开启监控。默认为`False`,此时以下所有参数不生效。 | bool | -| monitor_config.dump_path | 设置训练过程中`local_norm`、`device_local_norm`、`local_loss`指标文件的保存路径。未设置或设置为`null`时取默认值'./dump'。 | str | -| monitor_config.target | 设置指标`优化器状态`和`local_norm`所监控的的目标参数的名称(片段),可为正则表达式。未设置或设置为`null`时取默认值['.*'],即指定所有参数。 | list[str] | -| monitor_config.invert | 设置反选`monitor_config.target`所指定的参数,默认为`False`。 | bool | -| monitor_config.step_interval | 设置记录指标的频率。默认为1,即每个step记录一次。 | int | -| monitor_config.local_loss_format | 设置指标`local_loss`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | -| monitor_config.device_local_loss_format | 设置指标`device_local_loss`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | -| monitor_config.local_norm_format | 设置指标`local_norm`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | -| monitor_config.device_local_norm_format | 设置指标`device_local_norm`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | -| monitor_config.optimizer_state_format | 设置指标`优化器状态`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | -| monitor_config.weight_state_format | 设置指标`权重L2-norm`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | -| monitor_config.throughput_baseline | 设置指标`吞吐量线性度`的基线值,需要为正数。未设置时默认为`null`,表示不监控该指标。 | int或float | -| monitor_config.print_struct | 设置是否打印模型的全部可训练参数名。若为`True`,则会在第一个step开始时打印所有可训练参数的名称,并在step结束后退出训练。默认为`False`。 | bool | +| 参数名称 | 说明 | 类型 | +|--------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|---------------| +| monitor_config.monitor_on | 设置是否开启监控。默认为`False`,此时以下所有参数不生效。 | bool | +| monitor_config.dump_path | 设置训练过程中`local_norm`、`device_local_norm`、`local_loss`指标文件的保存路径。未设置或设置为`null`时取默认值'./dump'。 | str | +| monitor_config.target | 设置指标`优化器状态`和`local_norm`所监控的的目标参数的名称(片段),可为正则表达式。未设置或设置为`null`时取默认值['.*'],即指定所有参数。 | list[str] | +| monitor_config.invert | 设置反选`monitor_config.target`所指定的参数,默认为`False`。 | bool | +| monitor_config.step_interval | 设置记录指标的频率。默认为1,即每个step记录一次。 | int | +| monitor_config.local_loss_format | 设置指标`local_loss`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | +| monitor_config.device_local_loss_format | 设置指标`device_local_loss`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | +| monitor_config.local_norm_format | 设置指标`local_norm`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | +| monitor_config.device_local_norm_format | 设置指标`device_local_norm`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | +| monitor_config.optimizer_state_format | 设置指标`优化器状态`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | +| monitor_config.weight_state_format | 设置指标`权重L2-norm`的记录形式,可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时默认为`null`,表示不监控该指标。 | str或list[str] | +| monitor_config.throughput_baseline | 设置指标`吞吐量线性度`的基线值,需要为正数。未设置时默认为`null`,表示不监控该指标。 | int或float | +| monitor_config.print_struct | 设置是否打印模型的全部可训练参数名。若为`True`,则会在第一个step开始时打印所有可训练参数的名称,并在step结束后退出训练。默认为`False`。 | bool | +| monitor_config.check_for_global_norm | 设置是否开启进程级故障快恢功能。默认为`False`。 | bool | +| monitor_config.global_norm_spike_threshold | 设置global norm的阈值,当global norm超过时触发数据跳过。默认值为`3.0`。 | float | +| monitor_config.global_norm_spike_count_threshold | 设置连续异常global norm累计的次数,当次数达到该阈值则触发异常中断,终止训练。默认值为`10`。 | int | ### TensorBoard配置 diff --git a/docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md b/docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md new file mode 100644 index 0000000000..11fec99412 --- /dev/null +++ b/docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md @@ -0,0 +1,196 @@ +# 数据跳过和健康监测 + +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md) + +## 概述 + +数据跳过功能是指当训练过程中,遇到某个step的global norm超过设定的阈值时,会跳过当前步数训练数据;当连续累计的越界次数达到阈值时,便会触发异常中断,终止训练。而健康监测功能是指在保存权重时,对保存的权重的健康状况进行监测,生成一个文件记录权重的健康状况,并在下次续训时通过该文件来选择最新的健康的权重进行续训。 + +权重的健康状况判定请参考[权重健康监测](#权重健康监测)。 + +> - 数据跳过功能和健康监测功能二者结合,能有效解决训练过程中异常 global norm 带来的数据异常问题。使用前请先正常训练一段时间,从而确定需要设定的 global norm 的阈值、连续异常次数的阈值以及 embedding norm 的阈值。 +> - 只有连续出现异常时才会中断训练,如果中途出现一次恢复正常,则会清空累计次数,所以请把控阈值的设定。 +> - 数据跳过功能不能与故障快速恢复功能同时使用。参考[高可用特性](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/high_availability.html)中的进程级重调度恢复功能。 + +## 数据跳过 + +### 概述 + +MindSpore Transformers提供了跳过数据的功能,能够在global norm异常时跳过当前训练的数据,并当连续异常次数达到设定阈值时触发异常中断。 + +本功能一共有以下三种行为: + +- 出现越界global norm,异常连续累计次数+1,跳过当前步数训练数据,打印日志信息。 +- global norm恢复正常,异常连续累计次数清空。 +- 异常连续累计次数达到设定阈值,触发异常中断,终止训练。 + +#### 使用方法 + +**注意**:以下示例所展示的参数数值仅作为实验数据,请以真实训练数据为准。 + +本功能通过YAML配置文件使能: + +```yaml +use_skip_data_by_global_norm: True + +monitor_config: + monitor_on: True + check_for_global_norm: False + global_norm_spike_threshold: 3.0 + global_norm_spike_count_threshold: 2 +``` + +**参数说明:** + +| 参数名称 | 描述 | 类型 | 是否可选 | 取值范围 | +|-----------------------------------|-----------------------------------------------------|-------|------|------| +| use_skip_data_by_global_norm | 数据跳过功能开关。默认值为`False`。 | bool | 可选 | | +| monitor_config | 训练指标监控配置。默认值为`None`。 | | 可选 | | +| monitor_on | 是否开启训练指标监控配置。默认值为`False`。 | bool | 可选 | | +| check_for_global_norm | 是否开启故障快速恢复功能,和数据跳过功能互斥。默认值为`False`。 | bool | 可选 | | +| global_norm_spike_threshold | global norm的阈值,当global norm超过时触发数据跳过。默认值为`3.0`。 | float | 可选 | 大于0 | +| global_norm_spike_count_threshold | 连续异常global norm累计的次数,当次数达到该阈值则触发异常中断,终止训练。默认值为`10`。 | int | 可选 | 正整数 | + +### 使用示例 + +假设以Llama3.1-8B为例子,使用的[finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)按照上述[配置](#使用方法)添加参数,其余步骤请参考[Llama3.1-8B文档](https://gitee.com/lanshaozuishuai/mindformers/blob/dev/research/llama3_1/README.md)。开启训练: + +```shell +bash scripts/msrun_launcher.sh "run_mindformer.py \ + --register_path research/llama3_1 \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_data /{path}/wiki4096.mindrecord \ + --run_mode train \ + --use_parallel True" 8 +``` + +模型正式开始训练时,global norm大于设定阈值,则会打印如下日志,提示用户当前已经连续n次出现异常global norm,并跳过当前步数的训练数据。 + +```log +- INFO - { Epoch:[ 1/ 2], step:[ 1/ 6500], loss: 0.000, per_step_time: 166756ms, lr: 0.0, overflow cond: False, loss_scale: 1.0, global_norm: [44.313248], train_throughput_per_npu: 2.849T +- INFO - 0.0% | | 0.00600 samples/s/p 25 days, 2:07:47 } +- INFO - opt_global_step: 0, skip_data_grad_norm_threshold: 3.0, is_skip: [ True] +- INFO - Current global norm [44.313248] of step 1 has been 1 consecutive times greater than threshold: 3.0 +``` + +当连续异常次数达到设定的阈值时,打印错误日志,终止训练。 + +```log +- INFO - { Epoch:[ 1/ 2], step:[ 2/ 6500], loss: 0.000, per_step_time: 7637ms, lr: 0.0, overflow cond: False, loss_scale: 1.0, global_norm: [47.329006], train_throughput_per_npu: 62.211T +- INFO - 0.0% | | 0.00600 samples/s/p 25 days, 2:07:47 } +- INFO - opt_global_step: 0, skip_data_grad_norm_threshold: 3.0, is_skip: [ True] +ValueError: Current global norm [47.329006] of step 2 has been 2 consecutive times greater than threshold 3.0, stop training... +``` + +## 权重健康监测 + +### 概述 + +MindSpore Transformers提供的健康监测功能,能够通过监测stage0下的embedding local norm,来判定保存的权重的健康情况,通过文件health_ckpts.json,来记录训练过程中所有保存的权重的健康状况,续训时通过该文件自动寻找最新的健康的权重进行续训。 + +本功能涵盖以下三个步骤: + +1. 打开健康监测开关,通过一段时间的正常训练来确定需要设定的embedding local norm的阈值。 +2. 设定阈值后重新开启训练,当保存权重时,embedding local norm超过阈值,则记录权重健康状况为不健康,反之则记录为健康,记录中1表示不健康,0表示健康。 +3. 续训时,自动根据上次训练生成的health_ckpts.josn文件中记录的最新的健康权重进行续训。 + +**注意**: + +- 只有当pipeline stage>1时的stage0下的embedding norm才有意义。 +- 只有stage0下的卡的权重才有对应的健康状况,记录文件记录的是所有卡权重汇总后的结果,即只要有一张卡的权重的健康状况为不健康,那么该步数对应的权重的健康状况则为不健康。当stage0下所有卡的权重均为健康时,文件才会记录该步数下对应的权重的健康状况为健康。 +- 当记录文件中不存在健康的权重时,则会提示用户重新训练直到存在健康的权重,如若训练一直无法产生健康的权重,则应当考虑设定的embedding local norm的阈值是否合理。 +- 如果指定权重进行续训,则优先以指定的权重进行续训,不考虑权重的健康状况。 +- 该功能不支持full batch的场景。 +- 开启该功能可能会存在通信内存不足的风险。 + +#### 使用方法 + +**注意**:以下示例所展示的参数数值仅作为实验数据,请以真实训练数据为准。 + +本功能通过YAML配置文件使能: + +```yaml +use_checkpoint_health_monitor : True + +monitor_config: + monitor_on: True + +runner_wrapper: + local_norm: True + +callbacks: + - type: CheckpointMonitor + save_checkpoint_steps: 1 + embedding_local_norm_threshold: 270.0 + +parallel: + full_batch: False + dataset_strategy: [[4, 1], [4, 1]] + +parallel_config: + data_parallel: 4 + pipeline_stage: 2 + micro_batch_num: 2 +``` + +**参数说明:** + +| 参数名称 | 描述 | 类型 | 是否可选 | 取值范围 | +|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|------------|-----| +| use_checkpoint_health_monitor | 健康监测功能开关。默认值为`False`。 | bool | 可选 | | +| monitor_config | 训练指标监控配置。默认值为`None`。 | | 可选 | | +| monitor_on | 是否开启训练指标监控配置,开启后才能观测embedding local norm的数据指标。默认值为`False`。 | bool | 可选 | | +| runner_wrapper | wrapper配置。 | | 必选 | | +| local_norm | 单卡上各参数的梯度范数。默认值为`False`。 | bool | 可选 | | +| callbacks | callbacks配置。 | | 必选 | | +| save_checkpoint_steps | 保存权重的步数间隔。 | int | 必选 | 正整数 | +| embedding_local_norm_threshold | 健康监测的embedding norm的阈值。默认值为`1.0`。 | float | 可选 | 大于0 | +| parallel | 并行策略配置。 | | 必选 | | +| full_batch | 是否在并行模式下从数据集中读取加载完整的批数据,设置为`True`表示所有rank都读取完整的批数据,设置为`False`表示每个rank仅加载对应的批数据,设置为`False`时必须设置对应的`dataset_strategy`。此功能仅支持`False`。 | bool | 必选 `False` | | +| dataset_strategy | 仅支持`List of List`类型且仅在`full_batch=False`时生效,列表中子列表的个数需要等于`train_dataset.input_columns`的长度,并且列表中的每个子列表需要和数据集返回的数据的shape保持一致。一般在数据的第1维进行数据并行切分,所以子列表的第1位数配置与`data_parallel`相同,其他位配置为`1`。具体原理可以参考[数据集切分](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/dataset_slice.html)。 | list | 必选 | | +| parallel_config | 并行参数配置。 | | 必选 | | +| data_parallel | 设置数据并行数。 | int | 必选 | 正整数 | +| pipeline_stage | 设置流水线并行数。 | int | 必选 | 正整数 | +| micro_batch_num | 设置流水线并行的微批次大小,在`parallel_config.pipeline_stage`大于1时,应满足`parallel_config.micro_batch_num` >= `parallel_config.pipeline_stage`。 | int | 必选 | 正整数 | + +### 使用示例 + +假设以Llama3.1-8B为例子,使用的[finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)按照上述[配置](#使用方法-1)添加参数和修改,其余步骤请参考[Llama3.1-8B文档](https://gitee.com/lanshaozuishuai/mindformers/blob/dev/research/llama3_1/README.md)。开启训练: + +```shell +bash scripts/msrun_launcher.sh "run_mindformer.py \ + --register_path research/llama3_1 \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_data /{path}/wiki4096.mindrecord \ + --run_mode train \ + --use_parallel True" 8 +``` + +模型正式开始训练时,日志会打印当前步数的embedding local norm,便于用户统计观测后设定阈值。 + +```log +- INFO - { Epoch:[ 1/ 2], step:[ 1/ 6500], loss: 0.000, per_step_time: 157149ms, lr: 0.0, overflow cond: False, loss_scale: 1.0, global_norm: [44.31202], train_throughput_per_npu: 3.023T +- INFO - 0.0% | | 0.00636 samples/s/p 23 days, 15:26:22 } +- INFO - embedding_local_norm: 251.79117 + +- INFO - { Epoch:[ 1/ 2], step:[ 2/ 6500], loss: 0.000, per_step_time: 8266ms, lr: 2.5641025e-08, overflow cond: False, loss_scale: 1.0, global_norm: [47.328575], train_throughput_per_npu: 57.471T +- INFO - 0.0% | | 0.12096 samples/s/p 1 day, 5:50:52 } +- INFO - embedding_local_norm: 291.3603 +``` + +health_ckpts.json记录数据如下: + +ckpt_name记录的为权重文件名,is_health记录的是对应权重的健康状况。记录中1表示不健康,0表示健康。 + +```json +[ + { + "is_health": 0, + "ckpt_name": "llama3_1_8b_rank_0-1_1.safetensors" + }, + { + "is_health": 1, + "ckpt_name": "llama3_1_8b_rank_0-2_1.safetensors" + } +] +``` \ No newline at end of file diff --git a/docs/mindformers/docs/source_zh_cn/feature/training_function.rst b/docs/mindformers/docs/source_zh_cn/feature/training_function.rst index 63c9692fce..d5935bd437 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/training_function.rst +++ b/docs/mindformers/docs/source_zh_cn/feature/training_function.rst @@ -13,3 +13,4 @@ high_availability memory_optimization other_training_features + skip_data_and_ckpt_health_monitor \ No newline at end of file -- Gitee