diff --git a/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md b/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md deleted file mode 100644 index 1d6373908e17b971c7d43d975d21cca5974e6d5a..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md +++ /dev/null @@ -1,439 +0,0 @@ -# Dynamic Cluster Startup - -[](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md) - -## Overview - -For reliability requirements during training, MindSpore provides **dynamic cluster** features that enable users to start Ascend/GPU/CPU distributed training tasks without relying on any third-party library (OpenMPI) and without any modification to the training script. We recommend users to use this startup method in preference. - -The MindSpore **Dynamic Cluster** feature replaces the OpenMPI capability by **reusing the Parameter Server mode training architecture**, which can be found in the [Parameter Server Mode](https://mindspore.cn/docs/en/master/model_train/parallel/parameter_server_training.html) training tutorial. - -The **Dynamic Cluster** feature starts multiple MindSpore training processes as `Workers`, and starts an additional `Scheduler` for cluster and disaster recovery, thus, distributed training can be achieved without the need for OpenMPI's message passing mechanism. The user only needs to make a few changes to the startup script to perform distributed training. - -> Dynamic cluster supports Ascend, GPU and CPU, so the dynamic cluster startup script can be quickly migrated between multiple hardware platforms without additional modifications. - -The relevant environment variables: - -
Environment Variables | -Function | -Type | -Value | -Description | -
---|---|---|---|---|
MS_ROLE | -Specifies the role of this process. | -String | -
-
|
- The Worker and Parameter Server processes register with the Scheduler process to complete the networking. | -
MS_SCHED_HOST | -Specifies the IP address of the Scheduler. | -String | -Legal IP address. | -IPv6 addresses are only supported on `Ascend` platform in current version. | -
MS_SCHED_PORT | -Specifies the Scheduler binding port number. | -Integer | -Port number in the range of 1024 to 65535. | -- |
MS_NODE_ID | -Specifies the ID of this process, unique within the cluster. | -String | -Represents the unique ID of this process, which is automatically generated by MindSpore by default. | -
- MS_NODE_ID needs to be set in the following cases. Normally it does not need to be set and is automatically generated by MindSpore:
-
|
-
MS_WORKER_NUM | -Specifies the number of processes with the role MS_WORKER. | -Integer | -Integer greater than 0. | -- The number of Worker processes started by the user should be equal to the value of this environment variable. If it is less than this value, the networking fails. If it is greater than this value, the Scheduler process will complete the networking according to the order of Worker registration, and the redundant Worker processes will fail to start. - | -
MS_SERVER_NUM | -Specifies the number of processes with the role MS_PSERVER. | -Integer | -Integer greater than 0. | -Only set in Parameter Server training mode. | -
MS_WORKER_IP | -Specifies the IP address used for communication and networking between processes. | -String | -Legitimate IP address. | -This environment variable is suggested to be set when using IPv6. But when MS_SCHED_HOST is set to ::1(Representing local loopback interface in IPv6), there's no need to set MS_WORKER_IP because MindSpore will use local loopback interface to communicate by default. | -
MS_ENABLE_RECOVERY | -Turn on disaster recovery. | -Integer | -1 for on, 0 for off. The default is 0. | -- |
MS_RECOVERY_PATH | -Persistent path folder. | -String | -Legal user directory. | -The Worker and Scheduler processes perform the necessary persistence during execution, such as node information for restoring the networking and training the intermediate state of the service, and are saved via files. | -
MS_ENABLE_LCCL | -Whether to use LCCL as communication library. | -Integer | -1 for yes, other values for no. The default is no. | -The LCCL communication library currently only supports single-machine multi-card scenario and must be executed when the graph compilation level is O0. | -
MS_TOPO_TIMEOUT | -Cluster networking phase timeout time in seconds. | -Integer | -The default is 30 minutes. | -This value represents that all nodes can register to the scheduler within this time window. If the time window is exceeded, registration will fail and if the number of nodes does not meet the requirements, cluster networking will fail. We suggest users to configure this environment variable when the cluster is in large-scale. | -
MS_NODE_TIMEOUT | -Node heartbeat timeout in seconds。 | -Integer | -The default is 300 seconds. | -This value represents the heartbeat timeout time between the scheduler and the worker. If there are no heartbeat messages within this time window, the cluster will exit abnormally. | -
MS_RECEIVE_MSG_TIMEOUT | -Node timeout for receiving messages in seconds. | -Integer | -The default is 300 seconds. | -This value represents the timeout window for the node to receive messages from the other end. If there is no message response within the time window, an empty message is returned. | -
MS_RETRY_INTERVAL_LOWER | -Lower limit of message retry interval between nodes in seconds. | -Integer | -The default is 3 seconds. | -This value represents the lower limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between MS_RETRY_INTERVAL_LOWER and MS_RETRY_INTERVAL_UPPER as the interval time. This variable is used to control the message concurrency of the Scheduler. |
-
MS_RETRY_INTERVAL_UPPER | -Upper limit of message retry interval between nodes in seconds | -Integer | -The default is 5 seconds. | -This value represents the upper limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between MS_RETRY_INTERVAL_LOWER and MS_RETRY_INTERVAL_UPPER as the interval time. This variable is used to control the message concurrency of the Scheduler. |
-
MS_DISABLE_HEARTBEAT | -Disable the heartbeat feature between nodes in the cluster. | -Integer | -Heartbeat feature is enabled by default. | -If set to 1, the heartbeat between cluster nodes will be disabled. In this scenario, Scheduler will not detect Workers' exception and will not control the cluster to exit. This variable can reduce the message concurrency of the Scheduler. It is recommended to set this environment variable when using `gdb attach` command for debugging. |
-
Parameters | -Functions | -Types | -Values | -Instructions | -
---|---|---|---|---|
--worker_num | -The total number of Worker processes participating in the distributed task. | -Integer | -An integer greater than 0. The default value is 8. | -The total number of Workers started on all nodes should be equal to this parameter: if the total number is greater than this parameter, the extra Worker processes will fail to register; if the total number is less than this parameter, the cluster will wait for a certain period of timeout before prompting the task to pull up the failed task and exit, and the size of the timeout window can be configured by the parameter cluster_time_out . |
-
--local_worker_num | -The number of Worker processes pulled up on the current node. | -Integer | -An integer greater than 0. The default value is 8. | -When this parameter is consistent with worker_num , it means that all Worker processes are executed locally. The node_rank value is ignored in this scenario. |
-
--master_addr | -Specifies the IP address or hostname of the Scheduler. | -String | -Legal IP address or hostname. The default is the IP address 127.0.0.1. | -msrun will automatically detect on which node to pull up the Scheduler process, and users do not need to care. If the corresponding IP address cannot be found or the hostname cannot be resolved by DNS, the training task will pull up and fail. IPv6 addresses are not supported in the current version. If a hostname is input as a parameter, msrun will automatically resolve it to an IP address, which requires the user's environment to support DNS service. |
-
--master_port | -Specifies the Scheduler binding port number. | -Integer | -Port number in the range 1024 to 65535. The default is 8118. | -- |
--node_rank | -The index of the current node. | -Integer | -An integer greater than or equal to 0 can be passed in. In case no value is passed, the default value is -1. | -This parameter is ignored in single-machine multi-card scenario. In multi-machine and multi-card scenarios, if this parameter is not set, the rank_id of the Worker process will be assigned automatically; if it is set, the rank_id will be assigned to the Worker process on each node according to the index. If the number of Worker processes per node is different, it is recommended that this parameter not be configured to automatically assign the rank_id. |
-
--log_dir | -Worker, and Scheduler log output paths. | -String | -Folder path. Defaults to the current directory. | -If the path does not exist, msrun creates the folder recursively. The log format is as follows: for the Scheduler process, the log is named scheduler.log ; For Worker process, log name is worker_[rank].log , where rank suffix is the same as the rank_id assigned to the Worker, but they may be inconsistent in multiple-machine and multiple-card scenarios where node_rank is not set. It is recommended that grep -rn "Global rank id" is executed to view rank_id of each Worker. |
-
--join | -Whether msrun waits for the Worker as well as the Scheduler to exit. | -Bool | -True or False. Default: False. | -If set to False, msrun will exit immediately after pulling up the process and check the logs to confirm that the distributed task is executing properly. If set to True, msrun waits for all processes to exit, collects the exception log and exits. |
-
--cluster_time_out | -Cluster networking timeout in seconds. | -Integer | -Default: 600 seconds. | -This parameter represents the waiting time in cluster networking. If no worker_num number of Workers register successfully beyond this time window, the task pull-up fails. |
-
--bind_core | -Enable processes binding CPU cores. | -Bool | -True or False. Default: False. | -If set to True, msrun will evenly allocate CPU cores and bind them to the spawned distributed processes. | -
--sim_level | -Set single card simulated compilation level. | -Integer | -Default: -1. Disable simulated compilation. | -If this parameter is set, msrun starts only a single process for simulated compilation and does not execute operators. This feature is commonly used to debug large-scale distributed training parallel strategies, and to detect memory and strategy issues in advance. If set to 0, only compile the frontend graph; If set to 1, further compile backend graph compilation and exit during the execution phase |
-
--sim_rank_id | -rank_id of the simulated process. | -Integer | -Default: 0. | -Set rank id of the simulated process. | -
--rank_table_file | -rank_table configuration. Only valid on Ascend platform. | -String | -File path of rank_table configuration. Default: empty string. | -This parameter represents the rank_table configuration file on Ascend platform, describing current distributed cluster. Since the rank_table configuration file reflects distributed cluster information at the physical level, when using this configuration, make sure that the Devices visible to the current process are consistent with the rank_table configuration. The Device visible to the current process can be set via the environment variable ASCEND_RT_VISIBLE_DEVICES . |
-
--worker_log_name | -Specifies the worker log name. | -String | -File name of worker log. Default: worker_[rank].log . |
- This parameter represents support users configure worker log name, and support configure ip and hostname to worker log name by {ip} and {hostname} separately. The suffix of worker log name is rank by default. |
-
--tail_worker_log | -Enable output worker log to console. | -String | -One or multiple integers associated with the worker process rank_id. Default: -1. | -This parameter represents output all worker logs of the current node to console by default, and supports users specify one or more worker logs output to console when --join=True . This parameter should be in [0, local_worker_num]. |
-
task_script | -User Python scripts. | -String | -Legal script path. | -Normally, this parameter is the python script path, and msrun will pull up the process as python task_script task_script_args by default.msrun also supports this parameter as pytest. In this scenario the task script and task parameters are passed in the parameter task_script_args . |
-
task_script_args | -Parameters for the user Python script. | -- | Parameter list. | -For example, msrun --worker_num=8 --local_worker_num=8 train.py --device_target=Ascend --dataset_path=/path/to/dataset |
-
Environment Variables | -Functions | -Values | -
---|---|---|
MS_ROLE | -This process role. | -
- The current version of msrun exports the following two values:
-
|
-
MS_SCHED_HOST | -The IP address of the user-specified Scheduler. | -Same as parameter --master_addr . |
-
MS_SCHED_PORT | -User-specified Scheduler binding port number. | -Same as parameter --master_port . |
-
MS_WORKER_NUM | -The total number of Worker processes specified by the user. | -Same as parameter --worker_num . |
-
MS_TOPO_TIMEOUT | -Cluster Timeout Time. | -Same as parameter --cluster_time_out . |
-
RANK_SIZE | -The total number of Worker processes specified by the user. | -Same as parameter --worker_num . |
-
RANK_ID | -The rank_id assigned to the Worker process. | -In a multi-machine multi-card scenario, if the parameter --node_rank is not set, RANK_ID will only be exported after the cluster is initialized.So to use this environment variable, it is recommended to set the --node_rank parameter correctly. |
-
环境变量 | -功能 | -类型 | -取值 | -说明 | -
---|---|---|---|---|
MS_ROLE | -指定本进程角色。 | -String | -
-
|
- Worker和Parameter Server进程会向Scheduler进程注册从而完成组网。 | -
MS_SCHED_HOST | -指定Scheduler的IP地址。 | -String | -合法的IP地址。 | -当前版本还支持Ascend平台下的IPv6地址。 | -
MS_SCHED_PORT | -指定Scheduler绑定端口号。 | -Integer | -1024~65535范围内的端口号。 | -- |
MS_NODE_ID | -指定本进程的ID,集群内唯一。 | -String | -代表本进程的唯一ID,默认由MindSpore自动生成。 | -
- MS_NODE_ID在在以下情况需要设置,一般情况下无需设置,由MindSpore自动生成:
-
|
-
MS_WORKER_NUM | -指定角色为MS_WORKER的进程数量。 | -Integer | -大于0的整数。 | -- 用户启动的Worker进程数量应当与此环境变量值相等。若小于此数值,组网失败;若大于此数值,Scheduler进程会根据Worker注册先后顺序完成组网,多余的Worker进程会启动失败。 - | -
MS_SERVER_NUM | -指定角色为MS_PSERVER的进程数量。 | -Integer | -大于0的整数。 | -只在Parameter Server训练模式下需要设置。 | -
MS_WORKER_IP | -指定当前进程和其他进程进行通信和组网使用的IP地址。 | -String | -合法的IP地址。 | -在使用IPv6地址进行组网时,建议设置此环境变量。但当用户设置MS_SCHED_HOST为::1时(代表IPv6的本地回环地址),无需设置此环境变量,这是因为MindSpore会默认使用本地回环地址进行通信。 | -
MS_ENABLE_RECOVERY | -开启容灾。 | -Integer | -1代表开启,0代表关闭。默认为0。 | -- |
MS_RECOVERY_PATH | -持久化路径文件夹。 | -String | -合法的用户目录。 | -Worker和Scheduler进程在执行过程中会进行必要的持久化,如用于恢复组网的节点信息以及训练业务中间状态等,并通过文件保存。 | -
MS_ENABLE_LCCL | -是否使用LCCL通信库。 | -Integer | -1代表开启,0代表关闭。默认为0。 | -LCCL通信库暂只支持单机多卡,并且必须在图编译等级为O0时执行。 | -
MS_TOPO_TIMEOUT | -集群组网阶段超时时间,单位:秒。 | -Integer | -默认为30分钟。 | -此数值代表在所有节点在这个时间窗口内均可向Scheduler进行注册,超出此时间窗口则注册失败,若节点数量不满足要求,则集群组网失败。建议用户在集群规模较大时配置此环境变量。 | -
MS_NODE_TIMEOUT | -节点心跳超时时间,单位:秒。 | -Integer | -默认为300秒 | -此数值代表Scheduler以及Worker间心跳超时时间,若此时间窗口内没有心跳消息,则集群异常退出。 | -
MS_RECEIVE_MSG_TIMEOUT | -节点接收消息超时时间,单位:秒。 | -Integer | -默认为300秒 | -此数值代表节点接收对端消息超时时间,若时间窗口内无消息响应,则返回空消息。 | -
MS_RETRY_INTERVAL_LOWER | -节点间消息重试间隔下限,单位:秒。 | -Integer | -默认为3秒 | -此数值代表节点每次重试发送消息的时间间隔下限,MindSpore会随机选择MS_RETRY_INTERVAL_LOWER 和MS_RETRY_INTERVAL_UPPER 之间的值作为间隔时间。此变量可以控制Scheduler节点的消息并发量。 |
-
MS_RETRY_INTERVAL_UPPER | -节点间消息重试间隔上限,单位:秒。 | -Integer | -默认为5秒 | -此数值代表节点每次重试发送消息的时间间隔上限,MindSpore会随机选择MS_RETRY_INTERVAL_LOWER 和MS_RETRY_INTERVAL_UPPER 之间的值作为间隔时间。此变量可以控制Scheduler节点的消息并发量。 |
-
MS_DISABLE_HEARTBEAT | -关闭集群中节点间心跳业务。 | -Integer | -默认开启心跳业务 | -若设置为1,则关闭集群节点间心跳,此场景下Scheduler不会检测到Worker异常,集群不会被Scheduler控制退出。此变量可以降低Scheduler节点消息并发量。 在使用`gdb attach`指令调试时,建议开启此环境变量。 |
-
参数 | -功能 | -类型 | -取值 | -说明 | -
---|---|---|---|---|
--worker_num | -参与分布式任务的Worker进程总数。 | -Integer | -大于0的整数。默认值为8。 | -所有节点上启动的Worker总数应当等于此参数: 若总数大于此参数,多余的Worker进程会注册失败; 若总数小于此参数,集群会在等待一段超时时间后, 提示任务拉起失败并退出, 超时时间窗大小可通过参数 cluster_time_out 配置。 |
-
--local_worker_num | -当前节点上拉起的Worker进程数。 | -Integer | -大于0的整数。默认值为8。 | -当此参数与worker_num 保持一致时,代表所有Worker进程在本地执行,此场景下 node_rank 值会被忽略。 |
-
--master_addr | -指定Scheduler的IP地址或者主机名。 | -String | -合法的IP地址或者主机名。默认为IP地址127.0.0.1。 | -msrun会自动检测在哪个节点拉起Scheduler进程,用户无需关心。 若无法查找到对应的地址或主机名无法被DNS解析,训练任务会拉起失败。 当前版本暂不支持IPv6地址。 若传入主机名时,msrun会自动将其解析为IP地址,需要用户环境支持DNS服务。 |
-
--master_port | -指定Scheduler绑定端口号。 | -Integer | -1024~65535范围内的端口号。默认为8118。 | -- |
--node_rank | -当前节点的索引。 | -Integer | -可传入大于等于0的整数。在不传入值的情况下,默认值为-1。 | -单机多卡场景下,此参数会被忽略。 多机多卡场景下, 若不设置此参数,Worker进程的rank_id会被自动分配; 若设置,则会按照索引为各节点上的Worker进程分配rank_id。 若每个节点Worker进程数量不同,建议不配置此参数, 以自动分配rank_id。 |
-
--log_dir | -Worker以及Scheduler日志输出路径。 | -String | -文件夹路径。默认为当前目录。 | -若路径不存在,msrun会递归创建文件夹。 日志格式如下:对于Scheduler进程,日志名为 scheduler.log ;对于Worker进程,日志名为 worker_[rank].log ,其中 rank 后缀与分配给Worker的rank_id 一致,但在未设置 node_rank 且多机多卡场景下,它们可能不一致。建议执行 grep -rn "Global rank id" 指令查看各Worker的rank_id 。 |
-
--join | -msrun是否等待Worker以及Scheduler退出。 | -Bool | -True或者False。默认为False。 | -若设置为False,msrun在拉起进程后会立刻退出, 查看日志确认分布式任务是否正常执行。 若设置为True,msrun会等待所有进程退出后,收集异常日志并退出。 |
-
--cluster_time_out | -集群组网超时时间,单位为秒。 | -Integer | -默认为600秒。 | -此参数代表在集群组网的等待时间。 若超出此时间窗口依然没有 worker_num 数量的Worker注册成功,则任务拉起失败。 |
-
--bind_core | -开启进程绑核。 | -Bool | -True或者False。默认为False。 | -若用户配置此参数,msrun会平均分配CPU核,将其绑定到拉起的分布式进程上。 | -
--sim_level | -设置单卡模拟编译等级。 | -Integer | -默认为-1,即关闭单卡模拟编译功能。 | -若用户配置此参数,msrun只会拉起单进程模拟编译,不做算子执行。此功能通常用于调试大规模分布式训练并行策略,在编译阶段提前发现内存和策略问题。 若设置为0,只做前端图编译;若设置为1,进一步执行后端图编译,在执行图阶段退出。 |
-
--sim_rank_id | -单卡模拟编译的rank_id。 | -Integer | -默认为0。 | -设置单卡模拟编译进程的rank_id。 | -
--rank_table_file | -rank_table配置文件,只在昇腾平台下有效。 | -String | -rank_table配置文件路径,默认为空。 | -此参数代表昇腾平台下的rank_table配置文件,描述当前分布式集群。 由于rank_table配置文件反映的是物理层面分布式集群信息,在使用该配置时,请确保对于当前进程可见的Device与rank_table配置保持一致。 可通过环境变量 ASCEND_RT_VISIBLE_DEVICES 设置对于当前进程可见的Device。 |
-
--worker_log_name | -设置worker日志名。 | -String | -worker日志文件名,默认为worker_[rank].log 。 |
- 此参数代表支持用户配置worker日志名,并且支持分别通过{ip} 和{hostname} 在worker日志名中配置ip 和hostname 。worker日志名的后缀默认为 rank 。 |
-
--tail_worker_log | -输出worker日志到控制台。 | -String | -一个或多个与worker进程rank_id关联的整数。默认为-1。 | -此参数代表--join=True 情况下,默认输出当前节点所有worker日志,并且支持用户指定一个或多个卡的worker日志输出到控制台。这个参数需要在[0, local_worker_num]范围内。 |
-
task_script | -用户Python脚本。 | -String | -合法的脚本路径。 | -一般情况下,此参数为python脚本路径,msrun会默认以python task_script task_script_args 方式拉起进程。msrun还支持此参数为pytest,此场景下任务脚本及任务参数在参数 task_script_args 传递。 |
-
task_script_args | -用户Python脚本的参数。 | -- | 参数列表。 | -例如:msrun --worker_num=8 --local_worker_num=8 train.py --device_target=Ascend --dataset_path=/path/to/dataset |
-
环境变量 | -功能 | -取值 | -
---|---|---|
MS_ROLE | -本进程角色。 | -
- 当前版本msrun 导出下面两个值:
-
|
-
MS_SCHED_HOST | -用户指定的Scheduler的IP地址。 | -与参数--master_addr 相同。 |
-
MS_SCHED_PORT | -用户指定的Scheduler绑定端口号。 | -与参数--master_port 相同。 |
-
MS_WORKER_NUM | -用户指定的Worker进程总数。 | -与参数--worker_num 相同。 |
-
MS_TOPO_TIMEOUT | -集群组网超时时间。 | -与参数--cluster_time_out 相同。 |
-
RANK_SIZE | -用户指定的Worker进程总数。 | -与参数--worker_num 相同。 |
-
RANK_ID | -为Worker进程分配的rank_id。 | -多机多卡场景下,若没有设置--node_rank 参数,RANK_ID 只会在集群初始化后被导出。因此要使用此环境变量,建议正确设置 --node_rank 参数。 |
-