From f98bdc55d8cb86778fef75211874e21ff0024267 Mon Sep 17 00:00:00 2001 From: shen_haochen Date: Thu, 27 Mar 2025 16:30:17 +0800 Subject: [PATCH] fix doc for dryrun and MS_NODE_TIMEOUT --- .../source_en/model_train/parallel/dynamic_cluster.md | 2 +- .../source_en/model_train/parallel/msrun_launcher.md | 6 +++--- .../source_zh_cn/model_train/parallel/dynamic_cluster.md | 2 +- .../source_zh_cn/model_train/parallel/msrun_launcher.md | 8 ++++---- 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md b/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md index 1d6373908e..490e416c64 100644 --- a/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md +++ b/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md @@ -118,7 +118,7 @@ The relevant environment variables: MS_NODE_TIMEOUT Node heartbeat timeout in seconds。 Integer - The default is 300 seconds. + The default is 30 seconds. This value represents the heartbeat timeout time between the scheduler and the worker. If there are no heartbeat messages within this time window, the cluster will exit abnormally. diff --git a/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md b/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md index 65c6264e9a..5a6673912a 100644 --- a/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md +++ b/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md @@ -84,16 +84,16 @@ A parameters list of command line: --sim_level - Set single card simulated compilation level. + Set simulated compilation level. Integer Default: -1. Disable simulated compilation. - If this parameter is set, msrun starts only a single process for simulated compilation and does not execute operators. This feature is commonly used to debug large-scale distributed training parallel strategies, and to detect memory and strategy issues in advance.
If set to 0, only compile the frontend graph; If set to 1, further compile backend graph compilation and exit during the execution phase + If this parameter is set, msrun starts only the processes for simulated compilation and does not execute operators. This feature is commonly used to debug large-scale distributed training parallel strategies, and to detect memory and strategy issues in advance.
If set to 0, only compile the frontend graph; If set to 1, further compile backend graph compilation and exit during the execution phase --sim_rank_id rank_id of the simulated process. Integer - Default: 0. + Default: -1. Disable single process simulated compilation. Set rank id of the simulated process. diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md b/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md index 62cc358afb..0b2791f7b1 100644 --- a/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md +++ b/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md @@ -118,7 +118,7 @@ MindSpore**动态组网**特性通过**复用Parameter Server模式训练架构* MS_NODE_TIMEOUT 节点心跳超时时间,单位:秒。 Integer - 默认为300秒 + 默认为30秒 此数值代表Scheduler以及Worker间心跳超时时间,若此时间窗口内没有心跳消息,则集群异常退出。 diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md b/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md index bed2a70e9d..1fbc91afa6 100644 --- a/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md +++ b/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md @@ -84,16 +84,16 @@ --sim_level - 设置单卡模拟编译等级。 + 设置模拟编译等级。 Integer - 默认为-1,即关闭单卡模拟编译功能。 - 若用户配置此参数,msrun只会拉起单进程模拟编译,不做算子执行。此功能通常用于调试大规模分布式训练并行策略,在编译阶段提前发现内存和策略问题。
若设置为0,只做前端图编译;若设置为1,进一步执行后端图编译,在执行图阶段退出。 + 默认为-1,即关闭模拟编译功能。 + 若用户配置此参数,msrun只会拉起进程的模拟编译,不做算子执行。此功能通常用于调试大规模分布式训练并行策略,在编译阶段提前发现内存和策略问题。
若设置为0,只做前端图编译;若设置为1,进一步执行后端图编译,在执行图阶段退出。 --sim_rank_id 单卡模拟编译的rank_id。 Integer - 默认为0。 + 默认为-1,即关闭单卡模拟编译功能。 设置单卡模拟编译进程的rank_id。 -- Gitee