diff --git a/docs/mindformers/docs/source_en/feature/high_availability.md b/docs/mindformers/docs/source_en/feature/high_availability.md index cf0dc69296e738d726cd8de4a5408622136b5794..20662fe8bc6316858bcd9c9c0e87c6cd6d014042 100644 --- a/docs/mindformers/docs/source_en/feature/high_availability.md +++ b/docs/mindformers/docs/source_en/feature/high_availability.md @@ -27,6 +27,15 @@ The replica relationship between cards is used to make sure when one of the card When End-of-life CKPT, UCE and ARF functions are turned on in combination, the order in which they take effect is: UCE -> ARF -> End-of-Life CKPT, and if one of the functions can be recovered, the next function will not be executed. The end-of-life CKPT function serves as a final safeguard, and the entire training process exits upon completion of this function, so it will be turned on by default when the UCE or ARF functions are turned on. +The rapid recovery of faults is a combination of ARF and TRE functions, with the order of effectiveness being TRE -> ARF. TRE is responsible for monitoring outliers in the global norm and throwing them, while ARF is responsible for capturing TRE anomalies and restarting the corrective cluster for training without interrupting the entire process. + +Quick recovery and use instructions for malfunctions: + +> - The process-level rapid recovery feature can effectively reduce the time required to restart training after encountering abnormal global norms during the training process. +> - Please train normally for a period of time before use to determine the threshold of the global norm that needs to be set. +> - Once a global norm exceeding the set threshold is encountered, an exception will be thrown immediately, entering the fast recovery phase. +> - The data skipping function cannot be used in conjunction with the quick fault recovery function. Refer to the data skipping function in [Data Skip](https://www.mindspore.cn/mindformers/docs/en/dev/feature/skip_data_and_ckpt_health_monitor.html#skipping-data) function. + ## Instructions for Use The high availability feature switch is enabled by an environment variable, and the switch is not set separately in the YAML configuration file. For high availability functions which depend on replica relationship between between cards, the YAML file needs to be able to configure the weights and optimizer states to be the same for both cards, as detailed in the [Replica Relationships Configuration](#replica-relationships-configuration) section of this document. @@ -132,7 +141,9 @@ The key to the end-of-life CheckPoint, UCE and ARF functions of high availabilit pipeline_stage: 1 ``` -#### End-of-life CheckPoint Examples +## Example Usage + +### End-of-life CheckPoint This section demonstrates the use of the end-of-life CKPT using Llama2-13B training as an example. @@ -245,4 +256,83 @@ This section demonstrates the use of the end-of-life CKPT using Llama2-13B train - The rank 0 and rank 4 weights have a replica relationship, and the end-of-life checkpoint is stored in rank 0. - The rank 3 and rank 7 weights have a replica relationship, and the end-of-life checkpoint is stored in rank 3. - The rank 2 and rank 6 weights have a replica relationship, and the end-of-life checkpoint is stored in rank 2. - - There is a replica relationship between rank 1 and rank 5 weights, and since worker 1 terminates, the final checkpoint is stored in rank 5. \ No newline at end of file + - There is a replica relationship between rank 1 and rank 5 weights, and since worker 1 terminates, the final checkpoint is stored in rank 5. + +### Abnormal Training Results Recovery + +This chapter uses Llama3.1-8B training as an example to demonstrate the use of rapid fault recovery. + +> The parameter values shown in the following examples are only experimental data, please refer to real training data. + +1. Install [MindSpore](https://www.mindspore.cn/install/en) first. +2. Download MindSpore Transformers, using [finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml) to add and modify parameters according to the configuration below: + + ```yaml + output_dir: './output' + + monitor_config: + monitor_on: True + check_for_global_norm: True + global_norm_spike_threshold: 44.0 + + callbacks: + - type: CheckpointMonitor + save_checkpoint_steps: 1 + ``` + + **Parameter:** + + | Parameters | Description | Type | Optional | + |-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------| + | output_dir | Path to save checkpoint/strategy. Default to `./output`. | str | Optional | + | monitor_config | Whether to enable training indicator monitoring configuration. Default to `None`. | dict | Optional | + | monitor_on | Whether to enable training metric monitoring configuration. Only when enabled can abnormal global norm be monitored and TRE functionality be enabled. | bool | Required `True` | + | check_for_global_norm | Whether to enable the process-level fault rapid recovery function is mutually exclusive with the data skip function. Default to `False`. | bool | Optional | + | global_norm_spike_threshold | The threshold for global norm, which triggers data skipping when global norm is exceeded. Default to `3.0`. | float | Optional | + | callbacks | The configs of callbacks. | list | Required | + | save_checkpoint_steps | The step interval for saving weights. | int | Required | + +3. Configure environment variables: + + ```shell + export MS_ENABLE_TFT="TRE:1" + ``` + +4. Run the following command to start training: + + ```shell + cd mindformers + + bash scripts/msrun_launcher.sh "run_mindformer.py \ + --register_path research/llama3_1 \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_data /{path}/wiki4096.mindrecord \ + --run_mode train \ + --use_parallel True" 8 + ``` + +5. When the model officially starts training and encounters a global norm greater than the set threshold, the following log will be printed to prompt the user that an abnormal global norm has been encountered, and the corresponding global step and global norm will be recorded in abnormal_global_norm.json, triggering an error and entering the fast recovery phase. + + ```log + - INFO - { Epoch:[ 1/ 2], step:[ 2/ 6500], loss: 11.905, per_step_time: 2775ms, lr: 2.5641025e-08, overflow cond: False, loss_scale: 1.0, global_norm: [45.702465], train_throughput_per_npu: 171.176T + - INFO - 0.0% | | 0.36029 samples/s/p 10:01:16 } + - INFO - Current global norm [45.702465] is greater equal than threshold 44.0, stop training... + ``` + +6. After retraining, the training will continue from the previous breakpoint step count. If the global norm is still greater than the set threshold, since the corresponding global step has already been recorded in the abnormal_global_norm.json under the output dir set by YAML, only the corresponding global norm will be recorded here and it will not raise error. + + ```log + - INFO - { Epoch:[ 1/ 2], step:[ 2/ 6500], loss: 11.905, per_step_time: 3504ms, lr: 2.5641025e-08, overflow cond: False, loss_scale: 1.0, global_norm: [45.706497], train_throughput_per_npu: 135.552T + - INFO - 0.0% | | 0.28531 samples/s/p 12:39:17 } + - INFO - The global norm [45.706497] of step 2 is still greater or equal than threshold 44.0, continue training. + ``` + + The data recorded in abnormal_global_norm.json is as follows: + + ```json + { + "2": [45.70246505737305, 45.70649719238281] + } + ``` + + '2' represents the global step corresponding to the number of training steps, and the following list records the global norm of training before and after recovery. \ No newline at end of file diff --git a/docs/mindformers/docs/source_en/feature/monitor.md b/docs/mindformers/docs/source_en/feature/monitor.md index b39f26dd2d3e5efd9a89d703ad238646aa210682..addb1f1a91d566492097134add88bf2de06b1a60 100644 --- a/docs/mindformers/docs/source_en/feature/monitor.md +++ b/docs/mindformers/docs/source_en/feature/monitor.md @@ -29,6 +29,9 @@ monitor_config: weight_state_format: null throughput_baseline: null print_struct: False + check_for_global_norm: False + global_norm_spike_threshold: 1.0 + global_norm_spike_count_threshold: 10 tensorboard: tensorboard_dir: 'worker/tensorboard' @@ -41,21 +44,24 @@ callbacks: per_print_times: 1 ``` -| monitor_config field parameter name | Descriptions | Types | -|-----------------------------------------|------------------------------------------------------------------------------------------|---------------| -| monitor_config.monitor_on | Sets whether monitoring is enabled. The default is `False`, when all the following parameters do not take effect | bool | -| monitor_config.dump_path | Sets the path where the `local_norm`, `device_local_norm`, `local_loss`, and `device_local_loss` metrics files are saved during training. When not set or set to `null` take the default value '. /dump' | str | -| monitor_config.target | Sets the name (fragment) of the target parameter monitored by the indicator `optimizer_state` and `local_norm`, which can be a regular expression. When not set or set to `null` take the default value ['. *'], i.e. specify all parameters | list[str] | -| monitor_config.invert | Sets the parameter specified by counterselecting `monitor_config.target`. Defaults to `False`. | bool | -| monitor_config.step_interval | Sets the frequency of logging the indicator. Default is 1, i.e., record once per step | int | -| monitor_config.local_loss_format | Sets the logging form of the indicator `local_loss` | str or list[str] | -| monitor_config.device_local_loss_format | Sets the logging form of the indicator `device_local_loss` | str or list[str] | -| monitor_config.local_norm_format | Sets the logging form of the indicator `local_norm` | str or list[str] | -| monitor_config.device_local_norm_format | Sets the logging form of the indicator `device_local_norm` | str or list[str] | -| monitor_config.optimizer_state_format | Sets the logging form of the indicator `optimizer_state` | str or list[str] | -| monitor_config.weight_state_format | Sets the logging form of the indicator `weight L2-norm` | str or list[str] | -| monitor_config.throughput_baseline | Sets the baseline value for the metric `throughput linearity`, which needs to be positive. It will be written to both Tensorboard and logs. Defaults to `null` when not set, indicating that the metric is not monitored | int or float | -| monitor_config.print_struct | Sets whether to print all trainable parameter names for the model. If `True`, it will print the names of all trainable parameters at the start of the first step and exit training at the end of the step. Default is `False`. | bool | +| monitor_config field parameter name | Descriptions | Types | +|--------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| monitor_config.monitor_on | Sets whether monitoring is enabled. The default is `False`, when all the following parameters do not take effect | bool | +| monitor_config.dump_path | Sets the path where the `local_norm`, `device_local_norm`, `local_loss`, and `device_local_loss` metrics files are saved during training. When not set or set to `null` take the default value '. /dump' | str | +| monitor_config.target | Sets the name (fragment) of the target parameter monitored by the indicator `optimizer_state` and `local_norm`, which can be a regular expression. When not set or set to `null` take the default value ['. *'], i.e. specify all parameters | list[str] | +| monitor_config.invert | Sets the parameter specified by counterselecting `monitor_config.target`. Defaults to `False`. | bool | +| monitor_config.step_interval | Sets the frequency of logging the indicator. Default is 1, i.e., record once per step | int | +| monitor_config.local_loss_format | Sets the logging form of the indicator `local_loss` | str or list[str] | +| monitor_config.device_local_loss_format | Sets the logging form of the indicator `device_local_loss` | str or list[str] | +| monitor_config.local_norm_format | Sets the logging form of the indicator `local_norm` | str or list[str] | +| monitor_config.device_local_norm_format | Sets the logging form of the indicator `device_local_norm` | str or list[str] | +| monitor_config.optimizer_state_format | Sets the logging form of the indicator `optimizer_state` | str or list[str] | +| monitor_config.weight_state_format | Sets the logging form of the indicator `weight L2-norm` | str or list[str] | +| monitor_config.throughput_baseline | Sets the baseline value for the metric `throughput linearity`, which needs to be positive. It will be written to both Tensorboard and logs. Defaults to `null` when not set, indicating that the metric is not monitored | int or float | +| monitor_config.print_struct | Sets whether to print all trainable parameter names for the model. If `True`, it will print the names of all trainable parameters at the start of the first step and exit training at the end of the step. Default is `False`. | bool | +| monitor_config.check_for_global_norm | Sets whether to enable anomaly monitoring for indicator `global norm`. Default is `False` | bool | +| monitor_config.global_norm_spike_threshold | Sets a relative threshold for the indicator `global norm`, which is considered abnormal if it exceeds this value. Default is `3.0` | float | +| monitor_config.global_norm_spike_count_threshold | Sets the cumulative number of consecutive abnormal indicators `global norm`, and when the number of occurrences reaches the threshold, trigger an abnormal interrupt and terminate the training. Default is `10` | int | The optional values for the parameters of the form xxx_format above are the strings 'tensorboard' and 'log' (for writing to the Tensorboard and writing to the log, respectively), or a list of both, or `null`. All default to `null` when not set, indicating that the corresponding metrics are not monitored. diff --git a/docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md b/docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md index 2ce323a3ecb47024addf358c256c68401abf147d..e70edbcce435670ee78e2876ea3886a853ce2e9b 100644 --- a/docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md +++ b/docs/mindformers/docs/source_en/feature/skip_data_and_ckpt_health_monitor.md @@ -53,7 +53,7 @@ monitor_config: ### Conversion Example -Assuming Llama3.1-8B is taken as an example, use [finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml) to add parameters according to the above [Configuration](#usage), please refer to the [Llama3.1-8B Document](https://gitee.com/lanshaozuishuai/mindformers/blob/dev/research/llama3_1/README.md) for the remaining steps. Start training: +Assuming Llama3.1-8B is taken as an example, use [finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml) to add parameters according to the above [Configuration](#usage), please refer to the [Llama3.1-8B Document](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md) for the remaining steps. Start training: ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ @@ -155,7 +155,7 @@ parallel_config: ### Conversion Example -Assuming Llama3.1-8B is taken as an example, use [finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml) to add parameters and modify according to the above [Configuration](#usage-1), please refer to the [Llama3.1-8B Document](https://gitee.com/lanshaozuishuai/mindformers/blob/dev/research/llama3_1/README.md) for the remaining steps. Start training: +Assuming Llama3.1-8B is taken as an example, use [finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml) to add parameters and modify according to the above [Configuration](#usage-1), please refer to the [Llama3.1-8B Document](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md) for the remaining steps. Start training: ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ diff --git a/docs/mindformers/docs/source_zh_cn/feature/high_availability.md b/docs/mindformers/docs/source_zh_cn/feature/high_availability.md index 02b84c1c10f5124a6d884fa9421e99cb2968aaf0..511b7a5803f93cf20fd2352318631ddd108f8769 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/high_availability.md +++ b/docs/mindformers/docs/source_zh_cn/feature/high_availability.md @@ -27,6 +27,15 @@ MindSpore Transformers 高可用特性提供了如下六个功能: 临终 CKPT、UCE 和 ARF 组合开启这三个功能时,依次生效的顺序是:UCE -> ARF -> 临终 CKPT ,如果其中一个功能可以恢复,就不会执行下一个功能。临终 CKPT 功能作为最后的保障,完成该功能后整个训练进程会退出,所以在 UCE 或 ARF 功能开启时,会默认开启临终 CKPT。 +故障快速恢复由ARF和TRE两个功能组合,生效顺序为:TRE -> ARF 。TRE负责监测global norm的异常值并抛出异常,ARF负责捕获TRE异常后重新拉起整个集群修复训练,整个过程不中断训练。 + +故障快速恢复使用须知: + +> - 进程级快速恢复功能,能有效减少训练过程中遇到异常 global norm 而导致中断训练直至重新拉起的时间。 +> - 使用前请先正常训练一段时间,从而确定需要设定的 global norm 的阈值。 +> - 一旦遇到超过设定阈值的global norm,便会立即抛出异常,进入快速恢复阶段。 +> - 数据跳过功能不能与故障快速恢复功能同时使用。参考[数据跳过](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/skip_data_and_ckpt_health_monitor.html#数据跳过)功能。 + ## 使用说明 高可用特性开关由环境变量使能,YAML 配置文件中不单独设置开关。但对于要求卡间存在副本关系的高可用特性,YAML 文件需要能配置出两张卡的权重和优化器状态一致,详见本文档中的[副本关系配置](#副本关系配置)章节。 @@ -53,6 +62,7 @@ export MS_TFT_PORT=30051 - 开启 UCE 或者 ARF 功能时,默认开启 TTP 功能 - 目前 TRE 功能不可以与 UCE 或 ARF 功能同时使用 - TRE 功能不依赖 MindIO 组件,若只使能TRE特性,无需配置 MindIO 相关的环境变量 MINDIO_FOR_MINDSPORE、MS_TFT_IP 和 MS_TFT_PORT + - `MS_TFT_IP` 和 `MS_TFT_PORT` 分别表示 TFT Controller 的 IP 和端口号,无默认值,需要用户指定。如果由 MindSpore Transformers 启动 Controller,则配置用户集群中 rank0 节点的 IP 和端口号。如果用户自行启动 Controller,则配置 Controller 的 IP 和端口号。 ### YAML 配置 @@ -131,7 +141,9 @@ YAML配置包含两部分:临终 CKPT 的保存及恢复配置和卡间副本 pipeline_stage: 1 ``` -#### 临终 CKPT 使用示例 +## 使用示例 + +### 临终 CKPT 本章节以 Llama2-13B 训练为例演示临终 CKPT 的使用。 @@ -201,8 +213,7 @@ YAML配置包含两部分:临终 CKPT 的保存及恢复配置和卡间副本 注意:需要将 `/YourDataSetPath` 换成实际数据集的路径。 4. 待训练执行若干个 step 之后,终止 worker 进程,触发临终 CKPT 保存 - 注意:通过上述启动方式, MindIO Controller 附着在 worker 0 进程上,此种情况下不能终止 worker 0,否则导致 MindIO Controller 退出, - 无法触发临终 CKPT。但是通过 taskd 方式启动训练时,MindIO Controller 是个单独的进程,可以终止 worker 0 进程。 + 注意:通过上述启动方式, MindIO Controller 附着在 worker 0 进程上,此种情况下不能终止 worker 0,否则导致 MindIO Controller 退出,无法触发临终 CKPT。但是通过 taskd 方式启动训练时,MindIO Controller 是个单独的进程,可以终止 worker 0 进程。 5. 确认临终的 CheckPoint 生成 在整个训练进程结束后,通过日志确认最终生成的 CheckPoint 文件的合理性,具体操作如下: @@ -246,3 +257,82 @@ YAML配置包含两部分:临终 CKPT 的保存及恢复配置和卡间副本 - rank 3 和 rank 7 权重存在副本关系,临终的 Checkpoint 保存在 rank 3 - rank 2 和 rank 6 权重存在副本关系,临终的 Checkpoint 保存在 rank 2 - rank 1 和 rank 5 权重存在副本关系,由于 worker 1 终止,临终的 Checkpoint 保存在 rank 5 + +### 故障快速恢复 + +本章节以 Llama3.1-8B 训练为例演示故障快速恢复的使用。 + +> 以下示例所展示的参数数值仅作为实验数据,请以真实训练数据为准。 + +1. 先安装 [MindSpore](https://www.mindspore.cn/install)。 +2. 下载 MindSpore Transformers,使用的[finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)按照如下配置添加和修改参数: + + ```yaml + output_dir: './output' + + monitor_config: + monitor_on: True + check_for_global_norm: True + global_norm_spike_threshold: 44.0 + + callbacks: + - type: CheckpointMonitor + save_checkpoint_steps: 1 + ``` + + **参数说明:** + + | 参数名称 | 描述 | 类型 | 是否可选 | + |-----------------------------|-------------------------------------------------|-------|----------| + | output_dir | 保存权重和切分策略的文件路径。默认值为`./output`。 | str | 可选 | + | monitor_config | 训练指标监控配置。默认值为`None`。 | dict | 可选 | + | monitor_on | 是否开启训练指标监控配置。只有开启时才能监测异常的global norm和使能TRE功能。 | bool | 必选`True` | + | check_for_global_norm | 是否开启进程级故障快速恢复功能,和数据跳过功能互斥。默认值为`False`。 | bool | 可选 | + | global_norm_spike_threshold | global norm的阈值,当global norm超过时触发数据跳过。默认值为`3.0`。 | float | 可选 | + | callbacks | callbacks配置。 | list | 必选 | + | save_checkpoint_steps | 保存权重的步数间隔。 | int | 必选 | + +3. 配置环境变量: + + ```shell + export MS_ENABLE_TFT="TRE:1" + ``` + +4. 运行以下命令,开启训练: + + ```shell + cd mindformers + + bash scripts/msrun_launcher.sh "run_mindformer.py \ + --register_path research/llama3_1 \ + --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ + --train_data /{path}/wiki4096.mindrecord \ + --run_mode train \ + --use_parallel True" 8 + ``` + +5. 模型正式开始训练时,遇到global norm大于设定阈值,则会打印如下日志,提示用户当前遇到异常global norm,并记录对应的global step和global norm到abnormal_global_norm.json中,触发报错,进入快速恢复阶段。 + + ```log + - INFO - { Epoch:[ 1/ 2], step:[ 2/ 6500], loss: 11.905, per_step_time: 2775ms, lr: 2.5641025e-08, overflow cond: False, loss_scale: 1.0, global_norm: [45.702465], train_throughput_per_npu: 171.176T + - INFO - 0.0% | | 0.36029 samples/s/p 10:01:16 } + - INFO - Current global norm [45.702465] is greater equal than threshold 44.0, stop training... + ``` + +6. 重新拉起训练后,从之前断点的步数开始续训。如果在训练至相同的global step时,global norm仍然大于设定的阈值,由于此前已经将对应的global step记录到YAML设置的output_dir下的abnormal_global_norm.json中,故此处只会记录相应的global norm,并不会抛出异常。 + + ```log + - INFO - { Epoch:[ 1/ 2], step:[ 2/ 6500], loss: 11.905, per_step_time: 3504ms, lr: 2.5641025e-08, overflow cond: False, loss_scale: 1.0, global_norm: [45.706497], train_throughput_per_npu: 135.552T + - INFO - 0.0% | | 0.28531 samples/s/p 12:39:17 } + - INFO - The global norm [45.706497] of step 2 is still greater or equal than threshold 44.0, continue training. + ``` + + abnormal_global_norm.json记录数据如下: + + ```json + { + "2": [45.70246505737305, 45.70649719238281] + } + ``` + + "2"表示对应训练步数的global step,后面列表记录的则是恢复前后训练的global norm。 \ No newline at end of file diff --git a/docs/mindformers/docs/source_zh_cn/feature/monitor.md b/docs/mindformers/docs/source_zh_cn/feature/monitor.md index fb1c95840f3f63883a5ba578ddb72fdf5bce6da7..ff878e30c4865974e16aa6cf40bf2d309475584b 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/monitor.md +++ b/docs/mindformers/docs/source_zh_cn/feature/monitor.md @@ -29,6 +29,9 @@ monitor_config: weight_state_format: null throughput_baseline: null print_struct: False + check_for_global_norm: False + global_norm_spike_threshold: 1.0 + global_norm_spike_count_threshold: 10 tensorboard: tensorboard_dir: 'worker/tensorboard' @@ -41,21 +44,24 @@ callbacks: per_print_times: 1 ``` -| monitor_config字段参数名称 | 说明 | 类型 | -|-----------------------------------------|------------------------------------------------------------------------------------------|---------------| -| monitor_config.monitor_on | 设置是否开启监控。默认为`False`,此时以下所有参数不生效 | bool | -| monitor_config.dump_path | 设置训练过程中`local_norm`、`device_local_norm`、`local_loss`、`device_local_loss`指标文件的保存路径。未设置或设置为`null`时取默认值'./dump' | str | -| monitor_config.target | 设置指标`优化器状态`和`local_norm`所监控的的目标参数的名称(片段),可为正则表达式。未设置或设置为`null`时取默认值['.*'],即指定所有参数 | list[str] | -| monitor_config.invert | 设置反选`monitor_config.target`所指定的参数。默认为`False` | bool | -| monitor_config.step_interval | 设置记录指标的频率。默认为1,即每个step记录一次 | int | -| monitor_config.local_loss_format | 设置指标`local_loss`的记录形式 | str或list[str] | -| monitor_config.device_local_loss_format | 设置指标`device_local_loss`的记录形式 | str或list[str] | -| monitor_config.local_norm_format | 设置指标`local_norm`的记录形式 | str或list[str] | -| monitor_config.device_local_norm_format | 设置指标`device_local_norm`的记录形式 | str或list[str] | -| monitor_config.optimizer_state_format | 设置指标`优化器状态`的记录形式 | str或list[str] | -| monitor_config.weight_state_format | 设置指标`权重L2-norm`的记录形式 | str或list[str] | -| monitor_config.throughput_baseline | 设置指标`吞吐量线性度`的基线值,需要为正数。会同时写入到 Tensorboard 和日志。未设置时默认为`null`,表示不监控该指标 | int或float | -| monitor_config.print_struct | 设置是否打印模型的全部可训练参数名。若为`True`,则会在第一个step开始时打印所有可训练参数的名称,并在step结束后退出训练。默认为`False` | bool | +| monitor_config字段参数名称 | 说明 | 类型 | +|--------------------------------------------------|-----------------------------------------------------------------------------------------------------------|---------------| +| monitor_config.monitor_on | 设置是否开启监控。默认为`False`,此时以下所有参数不生效 | bool | +| monitor_config.dump_path | 设置训练过程中`local_norm`、`device_local_norm`、`local_loss`、`device_local_loss`指标文件的保存路径。未设置或设置为`null`时取默认值'./dump' | str | +| monitor_config.target | 设置指标`优化器状态`和`local_norm`所监控的的目标参数的名称(片段),可为正则表达式。未设置或设置为`null`时取默认值['.*'],即指定所有参数 | list[str] | +| monitor_config.invert | 设置反选`monitor_config.target`所指定的参数。默认为`False` | bool | +| monitor_config.step_interval | 设置记录指标的频率。默认为1,即每个step记录一次 | int | +| monitor_config.local_loss_format | 设置指标`local_loss`的记录形式 | str或list[str] | +| monitor_config.device_local_loss_format | 设置指标`device_local_loss`的记录形式 | str或list[str] | +| monitor_config.local_norm_format | 设置指标`local_norm`的记录形式 | str或list[str] | +| monitor_config.device_local_norm_format | 设置指标`device_local_norm`的记录形式 | str或list[str] | +| monitor_config.optimizer_state_format | 设置指标`optimizer_state`的记录形式 | str或list[str] | +| monitor_config.weight_state_format | 设置指标`权重L2-norm`的记录形式 | str或list[str] | +| monitor_config.throughput_baseline | 设置指标`吞吐量线性度`的基线值,需要为正数。会同时写入到 Tensorboard 和日志。未设置时默认为`null`,表示不监控该指标 | int或float | +| monitor_config.print_struct | 设置是否打印模型的全部可训练参数名。若为`True`,则会在第一个step开始时打印所有可训练参数的名称,并在step结束后退出训练。默认为`False` | bool | +| monitor_config.check_for_global_norm | 设置是否开启指标`global norm`的异常监测。默认为`False` | bool | +| monitor_config.global_norm_spike_threshold | 设置指标`global norm`的相对阈值,大于该值即判定为异常。默认值为`3.0` | float | +| monitor_config.global_norm_spike_count_threshold | 设置连续异常指标`global norm`累计的次数,当次数达到该阈值则触发异常中断,终止训练。默认值为`10` | int | 上述 xxx_format 形式的参数的可选值为字符串'tensorboard'和'log'(分别表示写入 Tensorboard 和写入日志),或由两者组成的列表,或`null`。未设置时均默认为`null`,表示不监控对应指标。 diff --git a/docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md b/docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md index 11fec99412144c3448602b97c03b02c163f02c9b..e0043513b53851a78e9c18d0f4a79faac32c9842 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md +++ b/docs/mindformers/docs/source_zh_cn/feature/skip_data_and_ckpt_health_monitor.md @@ -53,7 +53,7 @@ monitor_config: ### 使用示例 -假设以Llama3.1-8B为例子,使用的[finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)按照上述[配置](#使用方法)添加参数,其余步骤请参考[Llama3.1-8B文档](https://gitee.com/lanshaozuishuai/mindformers/blob/dev/research/llama3_1/README.md)。开启训练: +假设以Llama3.1-8B为例子,使用的[finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)按照上述[配置](#使用方法)添加参数,其余步骤请参考[Llama3.1-8B文档](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md)。开启训练: ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \ @@ -155,7 +155,7 @@ parallel_config: ### 使用示例 -假设以Llama3.1-8B为例子,使用的[finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)按照上述[配置](#使用方法-1)添加参数和修改,其余步骤请参考[Llama3.1-8B文档](https://gitee.com/lanshaozuishuai/mindformers/blob/dev/research/llama3_1/README.md)。开启训练: +假设以Llama3.1-8B为例子,使用的[finetune_llama3_1_8b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml)按照上述[配置](#使用方法-1)添加参数和修改,其余步骤请参考[Llama3.1-8B文档](https://gitee.com/mindspore/mindformers/blob/dev/research/llama3_1/README.md)。开启训练: ```shell bash scripts/msrun_launcher.sh "run_mindformer.py \