diff --git a/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md b/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md deleted file mode 100644 index 4e29b082cf72a0f761542262fca72fa1be1bb2ae..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md +++ /dev/null @@ -1,439 +0,0 @@ -# Dynamic Cluster Startup - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md) - -## Overview - -For reliability requirements during training, MindSpore provides **dynamic cluster** features that enable users to start Ascend/GPU/CPU distributed training tasks without relying on any third-party library (OpenMPI) and without any modification to the training script. We recommend users to use this startup method in preference. - -The MindSpore **Dynamic Cluster** feature replaces the OpenMPI capability by **reusing the Parameter Server mode training architecture**, which can be found in the [Parameter Server Mode](https://mindspore.cn/docs/en/master/model_train/parallel/parameter_server_training.html) training tutorial. - -The **Dynamic Cluster** feature starts multiple MindSpore training processes as `Workers`, and starts an additional `Scheduler` for cluster and disaster recovery, thus, distributed training can be achieved without the need for OpenMPI's message passing mechanism. The user only needs to make a few changes to the startup script to perform distributed training. - -> Dynamic cluster supports Ascend, GPU and CPU, so the dynamic cluster startup script can be quickly migrated between multiple hardware platforms without additional modifications. - -The relevant environment variables: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Environment VariablesFunctionTypeValueDescription
MS_ROLESpecifies the role of this process.String -
    -
  • MS_SCHED: represents the Scheduler process. A training task starts only one Scheduler, which is responsible for networking, disaster recovery, etc., and does not execute training code.
  • -
  • MS_WORKER: Represents the Worker process, which generally sets up the distributed training process for this role.
  • -
  • MS_PSERVER: represents the Parameter Server process. Only in Parameter Server mode this role is effective. Please refer to Parameter Server Mode.
  • -
-
The Worker and Parameter Server processes register with the Scheduler process to complete the networking.
MS_SCHED_HOSTSpecifies the IP address of the Scheduler.StringLegal IP address.IPv6 addresses are only supported on `Ascend` platform in current version.
MS_SCHED_PORTSpecifies the Scheduler binding port number.IntegerPort number in the range of 1024 to 65535.
MS_NODE_IDSpecifies the ID of this process, unique within the cluster.StringRepresents the unique ID of this process, which is automatically generated by MindSpore by default. - MS_NODE_ID needs to be set in the following cases. Normally it does not need to be set and is automatically generated by MindSpore: -
    -
  • Enable Disaster Recovery Scenario: Disaster recovery requires obtaining the current process ID and thus re-registering with the Scheduler.
  • -
  • Enable GLOG log redirection scenario: In order to ensure that the logs of each training process are saved independently, it is necessary to set the process ID, which is used as the log saving path suffix.
  • -
  • Specify process rank id scenario: users can specify the rank id of this process by setting MS_NODE_ID to some integer.
  • -
-
MS_WORKER_NUMSpecifies the number of processes with the role MS_WORKER.IntegerInteger greater than 0. - The number of Worker processes started by the user should be equal to the value of this environment variable. If it is less than this value, the networking fails. If it is greater than this value, the Scheduler process will complete the networking according to the order of Worker registration, and the redundant Worker processes will fail to start. -
MS_SERVER_NUMSpecifies the number of processes with the role MS_PSERVER.IntegerInteger greater than 0.Only set in Parameter Server training mode.
MS_WORKER_IPSpecifies the IP address used for communication and networking between processes.StringLegitimate IP address.This environment variable is suggested to be set when using IPv6. But when MS_SCHED_HOST is set to ::1(Representing local loopback interface in IPv6), there's no need to set MS_WORKER_IP because MindSpore will use local loopback interface to communicate by default.
MS_ENABLE_RECOVERYTurn on disaster recovery.Integer1 for on, 0 for off. The default is 0.
MS_RECOVERY_PATHPersistent path folder.StringLegal user directory.The Worker and Scheduler processes perform the necessary persistence during execution, such as node information for restoring the networking and training the intermediate state of the service, and are saved via files.
MS_ENABLE_LCCLWhether to use LCCL as communication library.Integer1 for yes, other values for no. The default is no.The LCCL communication library currently only supports single-machine multi-card scenario and must be executed when the graph compilation level is O0.
MS_TOPO_TIMEOUTCluster networking phase timeout time in seconds.IntegerThe default is 30 minutes.This value represents that all nodes can register to the scheduler within this time window. If the time window is exceeded, registration will fail and if the number of nodes does not meet the requirements, cluster networking will fail. We suggest users to configure this environment variable when the cluster is in large-scale.
MS_NODE_TIMEOUTNode heartbeat timeout in seconds。IntegerThe default is 300 seconds.This value represents the heartbeat timeout time between the scheduler and the worker. If there are no heartbeat messages within this time window, the cluster will exit abnormally.
MS_RECEIVE_MSG_TIMEOUTNode timeout for receiving messages in seconds.IntegerThe default is 300 seconds.This value represents the timeout window for the node to receive messages from the other end. If there is no message response within the time window, an empty message is returned.
MS_RETRY_INTERVAL_LOWERLower limit of message retry interval between nodes in seconds.IntegerThe default is 3 seconds.This value represents the lower limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between MS_RETRY_INTERVAL_LOWER and MS_RETRY_INTERVAL_UPPER as the interval time. This variable is used to control the message concurrency of the Scheduler.
MS_RETRY_INTERVAL_UPPERUpper limit of message retry interval between nodes in secondsIntegerThe default is 5 seconds.This value represents the upper limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between MS_RETRY_INTERVAL_LOWER and MS_RETRY_INTERVAL_UPPER as the interval time. This variable is used to control the message concurrency of the Scheduler.
MS_DISABLE_HEARTBEATDisable the heartbeat feature between nodes in the cluster.IntegerHeartbeat feature is enabled by default.If set to 1, the heartbeat between cluster nodes will be disabled. In this scenario, Scheduler will not detect Workers' exception and will not control the cluster to exit. This variable can reduce the message concurrency of the Scheduler.
It is recommended to set this environment variable when using `gdb attach` command for debugging.
- -> The environment variables `MS_SCHED_HOST`, `MS_SCHED_PORT`, and `MS_WORKER_NUM` need to be consistent in their contents, or else the networking will fail due to the inconsistency in the configurations of the processes. - -## Operation Practice - -Dynamic cluster startup scripts are consistent across hardware platforms. The following is an example of how to write a startup script for Ascend: - -> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method). - -The directory structure is as follows: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── run_dynamic_cluster.sh - ├── run_dynamic_cluster_1.sh - ├── run_dynamic_cluster_2.sh - ... -``` - -`net.py` is defining the network structure and training process, and `run_dynamic_cluster.sh`, `run_dynamic_cluster_1.sh` and `run_dynamic_cluster_2.sh` are executing scripts. - -### 1. Preparing Python Training Scripts - -Here, as an example of data parallel, a recognition network is trained for the MNIST dataset. - -First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL, NCCL or MCCL communication via `init()`. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -Then build the following network: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -Finally, the dataset is processed and the training process is defined: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. Preparing the Startup Script - -#### Single-Machine Multi-Card - -The content of the single-machine multi-card startup script [run_dynamic_cluster.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster.sh) is as follows. Taking the single-machine 8-card as an example: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# Start 8 Worker training processes in a loop -for((i=0;i<8;i++)); -do - export MS_WORKER_NUM=8 # Set the number of Worker processes in the cluster to 8 - export MS_SCHED_HOST=127.0.0.1 # Set the Scheduler IP address to the local loop address - export MS_SCHED_PORT=8118 # Set Scheduler port - export MS_ROLE=MS_WORKER # Set the started process to the MS_WORKER role - export MS_NODE_ID=$i # Set process id, optional - python ./net.py > device/worker_$i.log 2>&1 & # Start training script -done - -# Start 1 Scheduler process -export MS_WORKER_NUM=8 # Set the number of Worker processes in the cluster to 8 -export MS_SCHED_HOST=127.0.0.1 # Set the Scheduler IP address to the local loop address -export MS_SCHED_PORT=8118 # Set Scheduler port -export MS_ROLE=MS_SCHED # Set the started process to the MS_SCHED role -python ./net.py > device/scheduler.log 2>&1 & # Start training script -``` - -> The training scripts for the Scheduler and Worker processes are identical in content and startup method, because the internal processes of the two roles are handled differently in MindSpore. Users simply pull up the process in the normal training manner, without modifying the Python code by role. This is one of the reasons why dynamic cluster startup scripts can be consistent across multiple hardware platforms. - -A single-machine 8-card distributed training can be executed by executing the following command: - -```bash -bash run_dynamic_cluster.sh -``` - -The script will run in the background, the log file will be saved to the device directory and the result will be saved in the worker_*.log and is as follows: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -#### Multi-Machine Multi-Card - -The startup script needs to be split in the multi-machine training scenario. The following is an example of performing 2-machine 8-card training, with each machine executing the startup 4 Worker: - -The script [run_dynamic_cluster_1.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster_1.sh) starts 1 `Scheduler` process and 4 `Worker` processes on node 1: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# Start Worker1 to Worker4, 4 Worker training processes in a loop -for((i=0;i<4;i++)); -do - export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) - export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address - export MS_SCHED_PORT=8118 # Set the Scheduler port - export MS_ROLE=MS_WORKER # Set the startup process to the MS_WORKER role - export MS_NODE_ID=$i # Set process id, optional - python ./net.py > device/worker_$i.log 2>&1 & # Start training script -done - -# Start 1 Scheduler process on node 1 -export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) -export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address -export MS_SCHED_PORT=8118 # Set the Scheduler port -export MS_ROLE=MS_SCHED # Set the startup process to the MS_SCHED role -python ./net.py > device/scheduler.log 2>&1 & # Start training script -``` - -The script [run_dynamic_cluster_2.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster_2.sh) starts `Worker5` to `Worker8` on node 2 (without executing Scheduler): - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# Start Worker5 to Worker8, 4 Worker training processes in a loop -for((i=4;i<8;i++)); -do - export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) - export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address - export MS_SCHED_PORT=8118 # Set the Scheduler port - export MS_ROLE=MS_WORKER # Set the startup process to the MS_WORKER role - export MS_NODE_ID=$i # Set process id, optional - python ./net.py > device/worker_$i.log 2>&1 & # Start training script -done -``` - -> In a multi-machine task, it is necessary to set a different hostname for each host node, otherwise an error is reported that device id out of range. Refer to [FAQ](https://www.mindspore.cn/docs/en/master/faq/distributed_parallel.html#q-When-starting-distributed-framework-using-dynamic-cluster-or-msrun-in-multi-machine-scenario-an-error-is-reported-that-device-id-is-out-of-range-how-can-we-solve-it). -> -> In a multi-machine task, `MS_WORKER_NUM` should be the total number of Worker nodes in the cluster. -> -> To keep the inter-node network connected, use the `telnet ` command to test whether this node is connected to the started Scheduler node. - -Execute on Node 1: - -```bash -bash run_dynamic_cluster_1.sh -``` - -Execute on Node 2: - -```bash -bash run_dynamic_cluster_2.sh -``` - -That is, you can perform 2-machine 8-card distributed training tasks. - -## Disaster Recovery - -Dynamic cluster supports disaster recovery under data parallel. In a parallel training scenario with multi-card data, if a process quits abnormally, the training can be continued after pulling up the corresponding script of the corresponding process again, and the accuracy convergence will not be affected. Disaster recovery configuration and samples can be found in the [Disaster Recovery in Dynamic Cluster Scenarios](https://www.mindspore.cn/tutorials/en/master/train_availability/disaster_recover.html) tutorial. - -## Security Authentication - -Dynamic cluster also supports the **Secure Encrypted Channel** feature, which supports the `TLS/SSL` protocol to satisfy users security needs. By default, the secure encrypted channel is turned off. If you need to turn it on, call init() only after configuring the secure encrypted channel correctly via `set_ps_context`, otherwise the initialization of the networking will fail. If you want to use the secure Encrypted channel, please configure it: - -`set_ps_context(config_file_path="/path/to/config_file.json", enable_ssl=True, client_password="123456", server_password="123456")` - -The `config.json` configuration file specified by `config_file_path` needs to add the following fields: - -```json -{ - "server_cert_path": "server.p12", - "crl_path": "", - "client_cert_path": "client.p12", - "ca_cert_path": "ca.crt", - "cipher_list": "ECDHE-R SA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-DSS-AES256-GCM-SHA384:DHE-PSK-AES128-GCM-SHA256:DHE-PSK-AES256-GCM-SHA384:DHE-PSK-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-PSK-CHACHA20-POLY1305:DHE-RSA-AES128-CCM:DHE-RSA-AES256-CCM:DHE-RSA-CHACHA20-POLY1305:DHE-PSK-AES128-CCM:DHE-PSK-AES256-CCM:ECDHE-ECDSA-AES128-CCM:ECDHE-ECDSA-AES256-CCM:ECDHE-ECDSA-CHACHA20-POLY1305", - "cert_expire_warning_time_in_day": 90 -} -``` - -- `server_cert_path`: The path to the p12 file (SSL-specific certificate file) that contains the cipher text of the certificate and the secret key on the server side. -- `crl_path`: The file path to the revocation list (used to distinguish invalid untrusted certificates from valid trusted certificates). -- `client_cert_path`: The client contains the path to the p12 file (SSL-specific certificate file) with the cipher text of the certificate and secret key. -- `ca_cert_path`: The path to root certificate -- `cipher_list`: Cipher suite (list of supported SSL encrypted types) -- `cert_expire_warning_time_in_da`: The warning time of certificate expiration. - -The secret key in the p12 file is stored in cipher text, and the password needs to be passed in when starting. Please refer to the Python API [mindspore.set_ps_context](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_ps_context.html#mindspore.set_ps_context) for the `client_password` and `server_password` fields. diff --git a/docs/mindspore/source_en/model_train/parallel/mpirun.md b/docs/mindspore/source_en/model_train/parallel/mpirun.md deleted file mode 100644 index 2f42aa3965dec84d73b611b0deb6904fff98f818..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/mpirun.md +++ /dev/null @@ -1,219 +0,0 @@ -# mpirun Startup - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/mpirun.md) - -## Overview - -Open Message Passing Interface (OpenMPI) is an open source, high-performance message-passing programming library for parallel computing and distributed memory computing, which realizes parallel computing by passing messages between different processes for many scientific computing and machine learning tasks. Parallel training with OpenMPI is a generalized approach to accelerate the training process by utilizing parallel computing resources on computing clusters or multi-core machines. OpenMPI serves the function of synchronizing data on the Host side as well as inter-process networking in distributed training scenarios. - -Unlike rank table startup, the user does not need to configure the `RANK_TABLE_FILE` environment variable to run the script via OpenMPI `mpirun` command on the Ascend hardware platform. - -> The `mpirun` startup supports Ascend and GPU, in addition to both PyNative mode and Graph mode. - -Related commands: - -1. The `mpirun` startup command is as follows, where `DEVICE_NUM` is the number of GPUs on the machine: - - ```bash - mpirun -n DEVICE_NUM python net.py - ``` - -2. `mpirun` can also be configured with the following parameters. For more configuration, see [mpirun documentation](https://www.open-mpi.org/doc/current/man1/mpirun.1.php): - - - `--output-filename log_output`: Save the log information of all processes to the `log_output` directory, and the logs on different cards will be saved in the corresponding files under the `log_output/1/` path by `rank_id`. - - `--merge-stderr-to-stdout`: Merge stderr to the output message of stdout. - - `--allow-run-as-root`: This parameter is required if the script is executed through the root user. - - `-mca orte_abort_on_non_zero_status 0`: When a child process exits abnormally, OpenMPI will abort all child processes by default. If you don't want to abort child processes automatically, you can add this parameter. - - `-bind-to none`: OpenMPI will specify the number of available CPU cores for the child process to be pulled up by default. If you don't want to limit the number of cores used by the process, you can add this parameter. - -> OpenMPI starts up with a number of environment variables named `OPMI_*`, and users should avoid manually modifying these environment variables in scripts. - -## Operation Practice - -The `mpirun` startup script is consistent across Ascend and GPU hardware platforms. Below is a demonstration of how to write a startup script using Ascend as an example: - -> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method). - -The directory structure is as follows: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── hostfile - ├── run_mpirun_1.sh - ├── run_mpirun_2.sh - ... -``` - -`net.py` is to define the network structure and training process. `run_mpirun_1.sh` and `run_mpirun_2.sh` are the execution scripts, and `hostfile` is the file to configure the multi-machine and multi-card files. - -### 1. Installing OpenMPI - -Download the OpenMPI-4.1.4 source code [openmpi-4.1.4.tar.gz] (https://www.open-mpi.org/software/ompi/v4.1/). Refer to [OpenMPI official website tutorial](https://www.open-mpi.org/faq/?category=building#easy-build) for installation. - -### 2. Preparing Python Training Scripts - -Here, as an example of data parallel, a recognition network is trained for the MNIST dataset. - -First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL or NCCL communication via init. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -Then build the following network: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -Finally, the dataset is processed and the training process is defined: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 3. Preparing the Startup Script - -#### Single-Machine Multi-Card - -First download the [MNIST](http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip) dataset and extract it to the current folder. - -Then execute the single-machine multi-card boot script, using the single-machine 8-card example: - -```bash -export DATA_PATH=./MNIST_Data/train/ -mpirun -n 8 --output-filename log_output --merge-stderr-to-stdout python net.py -``` - -The log file will be saved to the `log_output` directory and the result will be saved in the `log_output/1/rank.*/stdout` and is as follows: - -```text -epoch: 0, step: 0, loss is 2.3413472 -epoch: 0, step: 10, loss is 1.6298866 -epoch: 0, step: 20, loss is 1.3729795 -epoch: 0, step: 30, loss is 1.2199347 -epoch: 0, step: 40, loss is 0.85778403 -epoch: 0, step: 50, loss is 1.0849445 -epoch: 0, step: 60, loss is 0.9102987 -epoch: 0, step: 70, loss is 0.7571399 -epoch: 0, step: 80, loss is 0.7989929 -epoch: 0, step: 90, loss is 1.0189024 -epoch: 0, step: 100, loss is 0.6298542 -... -``` - -#### Multi-Machine Multi-Card - -Before running multi-machine multi-card training, you first need to follow the following configuration: - -1. Ensure that the same versions of OpenMPI, NCCL, Python, and MindSpore are available on each node. - -2. To configure host-to-host password-free login, you can refer to the following steps to configure it: - - Identify the same user as the login user for each host (root is not recommended); - - Execute `ssh-keygen -t rsa -P ""` to generate the key; - - Execute `ssh-copy-id DEVICE-IP` to set the IP of the machine that needs password-free login; - - Execute `ssh DEVICE-IP`. If you can log in without entering a password, the above configuration is successful; - - Execute the above command on all machines to ensure two-by-two interoperability. - -After the configuration is successful, you can start the multi-machine task with the `mpirun` command, and there are currently two ways to start a multi-machine training task: - -- By means of `mpirun -H`. The startup script is as follows: - - ```bash - export DATA_PATH=./MNIST_Data/train/ - mpirun -n 16 -H DEVICE1_IP:8,DEVICE2_IP:8 --output-filename log_output --merge-stderr-to-stdout python net.py - ``` - - indicates that 8 processes are started to run the program on the machines with ip DEVICE1_IP and DEVICE2_IP respectively. Execute on one of the nodes: - - ```bash - bash run_mpirun_1.sh - ``` - -- By means of the `mpirun --hostfile` method. For debugging purposes, this method is recommended for executing multi-machine multi-card scripts. First you need to construct the hostfile file as follows: - - ```text - DEVICE1 slots=8 - 192.168.0.1 slots=8 - ``` - - The format of each line is `[hostname] slots=[slotnum]`, and hostname can be either ip or hostname. The above example indicates that there are 8 cards on DEVICE1, and there are also 8 cards on the machine with ip 192.168.0.1. - - The execution script for the 2-machine 16-card is as follows, and you need to pass in the variable `HOSTFILE`, which indicates the path to the hostfile file: - - ```bash - export DATA_PATH=./MNIST_Data/train/ - HOSTFILE=$1 - mpirun -n 16 --hostfile $HOSTFILE --output-filename log_output --merge-stderr-to-stdout python net.py - ``` - - Execute on one of the nodes: - - ```bash - bash run_mpirun_2.sh ./hostfile - ``` - -After execution, the log file is saved to the log_output directory and the result is saved in log_output/1/rank.*/stdout. diff --git a/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md b/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md deleted file mode 100644 index 65c6264e9a58006efd291be4081fb17ece79215e..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md +++ /dev/null @@ -1,503 +0,0 @@ -# msrun Launching - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md) - -## Overview - -`msrun` is an encapsulation of the [Dynamic Cluster](https://www.mindspore.cn/docs/en/master/model_train/parallel/dynamic_cluster.html) startup method. Users can use `msrun` to pull multi-process distributed tasks across nodes with a single command line instruction. Users can use `msrun` to pull up multi-process distributed tasks on each node with a single command line command, and there is no need to manually set [dynamic networking environment variables](https://www.mindspore.cn/docs/en/master/model_train/parallel/dynamic_cluster.html). `msrun` supports both `Ascend`, `GPU` and `CPU` backends. As with the `Dynamic Cluster` startup, `msrun` has no dependencies on third-party libraries and configuration files. - -> - `msrun` is available after the user installs MindSpore, and the command `msrun --help` can be used to view the supported parameters. -> - `msrun` supports `graph mode` as well as `PyNative mode`. - -A parameters list of command line: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ParametersFunctionsTypesValuesInstructions
--worker_numThe total number of Worker processes participating in the distributed task.IntegerAn integer greater than 0. The default value is 8.The total number of Workers started on all nodes should be equal to this parameter:
if the total number is greater than this parameter, the extra Worker processes will fail to register;
if the total number is less than this parameter, the cluster will wait for a certain period of timeout before prompting the task to pull up the failed task and exit,
and the size of the timeout window can be configured by the parameter cluster_time_out.
--local_worker_numThe number of Worker processes pulled up on the current node.IntegerAn integer greater than 0. The default value is 8.When this parameter is consistent with worker_num, it means that all Worker processes are executed locally.
The node_rank value is ignored in this scenario.
--master_addrSpecifies the IP address or hostname of the Scheduler.StringLegal IP address or hostname. The default is the IP address 127.0.0.1.msrun will automatically detect on which node to pull up the Scheduler process, and users do not need to care.
If the corresponding IP address cannot be found or the hostname cannot be resolved by DNS, the training task will pull up and fail.
IPv6 addresses are not supported in the current version.
If a hostname is input as a parameter, msrun will automatically resolve it to an IP address, which requires the user's environment to support DNS service.
--master_portSpecifies the Scheduler binding port number.IntegerPort number in the range 1024 to 65535. The default is 8118.
--node_rankThe index of the current node.IntegerAn integer greater than or equal to 0 can be passed in. In case no value is passed, the default value is -1.This parameter is ignored in single-machine multi-card scenario.
In multi-machine and multi-card scenarios, if this parameter is not set, the rank_id of the Worker process will be assigned automatically;
if it is set, the rank_id will be assigned to the Worker process on each node according to the index.
If the number of Worker processes per node is different, it is recommended that this parameter not be configured to automatically assign the rank_id.
--log_dirWorker, and Scheduler log output paths.StringFolder path. Defaults to the current directory.If the path does not exist, msrun creates the folder recursively.
The log format is as follows: for the Scheduler process, the log is named scheduler.log;
For Worker process, log name is worker_[rank].log, where rank suffix is the same as the rank_id assigned to the Worker,
but they may be inconsistent in multiple-machine and multiple-card scenarios where node_rank is not set.
It is recommended that grep -rn "Global rank id" is executed to view rank_id of each Worker.
--joinWhether msrun waits for the Worker as well as the Scheduler to exit.BoolTrue or False. Default: False.If set to False, msrun will exit immediately after pulling up the process and check the logs to confirm that the distributed task is executing properly.
If set to True, msrun waits for all processes to exit, collects the exception log and exits.
--cluster_time_outCluster networking timeout in seconds.IntegerDefault: 600 seconds.This parameter represents the waiting time in cluster networking.
If no worker_num number of Workers register successfully beyond this time window, the task pull-up fails.
--bind_coreEnable processes binding CPU cores.BoolTrue or False. Default: False.If set to True, msrun will evenly allocate CPU cores and bind them to the spawned distributed processes.
--sim_levelSet single card simulated compilation level.IntegerDefault: -1. Disable simulated compilation.If this parameter is set, msrun starts only a single process for simulated compilation and does not execute operators. This feature is commonly used to debug large-scale distributed training parallel strategies, and to detect memory and strategy issues in advance.
If set to 0, only compile the frontend graph; If set to 1, further compile backend graph compilation and exit during the execution phase
--sim_rank_idrank_id of the simulated process.IntegerDefault: 0.Set rank id of the simulated process.
--rank_table_filerank_table configuration. Only valid on Ascend platform.StringFile path of rank_table configuration. Default: empty string.This parameter represents the rank_table configuration file on Ascend platform, describing current distributed cluster.
Since the rank_table configuration file reflects distributed cluster information at the physical level, when using this configuration, make sure that the Devices visible to the current process are consistent with the rank_table configuration.
The Device visible to the current process can be set via the environment variable ASCEND_RT_VISIBLE_DEVICES.
--worker_log_nameSpecifies the worker log name.StringFile name of worker log. Default: worker_[rank].log.This parameter represents support users configure worker log name, and support configure ip and hostname to worker log name by {ip} and {hostname} separately.
The suffix of worker log name is rank by default.
--tail_worker_logEnable output worker log to console.StringOne or multiple integers associated with the worker process rank_id. Default: -1.This parameter represents output all worker logs of the current node to console by default, and supports users specify one or more worker logs output to console when --join=True.
This parameter should be in [0, local_worker_num].
task_scriptUser Python scripts.StringLegal script path.Normally, this parameter is the python script path, and msrun will pull up the process as python task_script task_script_args by default.
msrun also supports this parameter as pytest.
In this scenario the task script and task parameters are passed in the parameter task_script_args.
task_script_argsParameters for the user Python script.Parameter list.For example, msrun --worker_num=8 --local_worker_num=8 train.py --device_target=Ascend --dataset_path=/path/to/dataset
- -## Environment Variables - -The following table shows the environment variables can be used in user scripts, which are set by `msrun`: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Environment VariablesFunctionsValues
MS_ROLEThis process role. - The current version of msrun exports the following two values: -
    -
  • MS_SCHED: Represents the Scheduler process.
  • -
  • MS_WORKER: Represents the Worker process.
  • -
-
MS_SCHED_HOSTThe IP address of the user-specified Scheduler.Same as parameter --master_addr.
MS_SCHED_PORTUser-specified Scheduler binding port number.Same as parameter --master_port.
MS_WORKER_NUMThe total number of Worker processes specified by the user.Same as parameter --worker_num.
MS_TOPO_TIMEOUTCluster Timeout Time.Same as parameter --cluster_time_out.
RANK_SIZEThe total number of Worker processes specified by the user.Same as parameter --worker_num.
RANK_IDThe rank_id assigned to the Worker process.In a multi-machine multi-card scenario, if the parameter --node_rank is not set, RANK_ID will only be exported after the cluster is initialized.
So to use this environment variable, it is recommended to set the --node_rank parameter correctly.
- -msrun is used as an encapsulation of the Dynamic Cluster startup method, and all user-configurable environment variables can be found in [dynamic networking environment variables](https://www.mindspore.cn/docs/en/master/model_train/parallel/dynamic_cluster.html). - -## Starting Distributed Tasks - -The startup script is consistent across hardware platforms. The following is an example of how to write a startup script for Ascend: - -> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method). - -The directory structure is as follows: - -```text -└─ sample_code - ├─ startup_method - ├── msrun_1.sh - ├── msrun_2.sh - ├── msrun_single.sh - ├── net.py - ... -``` - -`net.py` defines the network structure and the training process. `msrun_single.sh` is a single-machine multi-card execution script that starts with `msrun`. `msrun_1.sh` and `msrun_2.sh` are multi-machine, multi-card execution scripts started with `msrun` and executed on separate nodes. - -### 1. Preparing Python Training Scripts - -Here is an example of data parallelism to train a recognition network for the MNIST dataset. - -First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize the HCCL, NCCL or MCCL communication domain with `init()`. If `device_target` is not set here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -Then build the following network: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -Finally, the dataset is processed and the training process is defined: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. Preparing the Startup Script - -> For msrun, single-machine multi-card and multi-machine multi-card execution commands are similar, single-machine multi-card only needs to keep the parameters `worker_num` and `local_worker_num` the same, and single-machine multi-card scenarios do not need to set the `master_addr`, which defaults to `127.0.0.1`. - -#### Single-machine Multi-card - -The following is an example of performing a single-machine 8-card training session: - -The script [msrun_single.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_single.sh) uses the msrun command to pull up 1 `Scheduler` process as well as 8 `Worker` processes on the current node (no need to set `master_addr`, defaults to `127.0.0.1`; no need to set `node_rank` for single-machine): - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -Execute the command: - -```bash -bash msrun_single.sh -``` - -The single-machine 8-card distributed training task can be executed. The log file is saved to the `. /msrun_log` directory and the results are saved in `. /msrun_log/worker_*.log`. The Loss results are as follows: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -#### Multi-machine Multi-card - -The following is an example of executing 2-machine, 8-card training, with each machine executing the startup of 4 Workers: - -The script [msrun_1.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_1.sh) is executed on node 1 and uses the msrun command to pull up 1 `Scheduler` process and 4 `Worker` processes, configures `master_addr` as the IP address of node 1 (msrun automatically detects that the current node ip matches the `master_addr` and pulls up the `Scheduler` process). Set the current node to node 0 with `node_rank`: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=4 --master_addr= --master_port=8118 --node_rank=0 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -The script [msrun_2.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_2.sh) is executed on node 2 and uses the msrun command to pull up 4 `Worker` processes, configures `master_addr` as the IP address of node 1. Set the current node to node 0 with `node_rank`: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=4 --master_addr= --master_port=8118 --node_rank=1 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -> The difference between the instructions for node 2 and node 1 is that `node_rank` is different. - -Executed at node 1: - -```bash -bash msrun_1.sh -``` - -Executed at node 2: - -```bash -bash msrun_2.sh -``` - -The 2-machine, 8-card distributed training task can be executed, and the log files are saved to the `. /msrun_log` directory and the results are saved in `. /msrun_log/worker_*.log`. The Loss results are as follows: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -## Multi-Card Parallel Debugging - -Multi-card parallel debugging can be done in distributed environments using Python's built-in debugger (pdb), by breaking and synchronizing operations on all or a particular rank. After pulling up a worker process with the `msrun` parameter set to `--join=True`, the standard input of all worker processes is inherited from the `msrun` master process, and the standard output is output to a shell window via the `msrun` log redirection feature. Details of how to use pdb in a distributed environment are given below: - -### 1. Starting the pdb Debugger - -Users can start the pdb debugger in various ways, such as inserting `import pdb; pdb.set_trace()` or `breakpoint()` in the python training script to perform breakpoint operations. - -#### Python Training Script - -```python -import pdb -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -pdb.set_trace() -ms.set_seed(1) -``` - -#### Startup Script - -In the startup script, the `msrun` parameter needs to be set to `--join=True` to ensure that pdb commands are passed through stdin and debugging is displayed through stdout. - -```bash -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -### 2. Debugging Particular Ranks - -In a distributed environment, users may need to debug for a particular rank, which can be accomplished by placing breakpoints in the training script for that particular rank. For example, in the standalone eight-card task, debugging is done only for rank 7: - -```python -import pdb -import mindspore as ms -from mindspore.communication import init, get_rank - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -if get_rank() == 7: - pdb.set_trace() -ms.set_seed(1) -``` - -> The `mindspore.communication.get_rank()` interface needs to be called after the `mindspore.communication.init()` interface has completed its distributed initialization to get the rank information properly, otherwise `get_rank()` will return 0 by default. - -After a breakpoint operation on a rank, it will cause the execution of that rank process to stop at the breakpoint and wait for subsequent interactions, while other rank processes will continue to run, which may lead to different processes to be faster or slower. So users can use the `mindspore.communication.comm_func.barrier()` operator and the `mindspore.communication.api._pynative_executor.sync()` to synchronize the running of all ranks, ensuring that other ranks block and wait, and that the stops of other ranks are released once the debugging rank continues to run. For example, in a standalone eight-card task, debugging breakpoints only for rank 7 and blocking all other ranks: - -```python -import pdb -import mindspore as ms -from mindspore.communication import init, get_rank -from mindspore.communication.comm_func import barrier -from mindspore.common.api import _pynative_executor - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -if get_rank() == 7: - pdb.set_trace() -barrier() -_pynative_executor.sync() -ms.set_seed(1) -``` - -### 3. Stdin and Stdout to Shell Terminial - -`msrun` supports outputting specific worker logs to the shell's stdout via `--tail_worker_log`. To make the standard output more observable, it is recommended to use this parameter to specify the rank to be debugged. For example, in the standalone eight-card task, only breakpoints for rank 7 are debugged: - -```bash -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 --tail_worker_log=7 net.py -``` - -> - `msrun`'s default behavior without the `--tail_worker_log` parameter will output the logs of all workers on this node to the shell's stdout. -> - When debugging multiple ranks at the same time, a single pdb command will be passed to a single rank in turn via stdin. - -### 4. Common pdb Debugging Commands - -- `n` (next): Execute the current line of code and jump to the next one. -- `s` (step): Enter the function called in the current line of code, step by step. -- `c` (continue): Continue executing the program until the next breakpoint. -- `q` (quit): Exit the debugger and terminate program execution. -- `p` (print): Print the value of a variable. For example, `p variable` displays the current value of the variable `variable`. -- `l` (list): Display the context of the current code. -- `b` (break): Set a breakpoint, either by line number or function name. -- `h` (help): Display help information, list all available commands. diff --git a/docs/mindspore/source_en/model_train/parallel/rank_table.md b/docs/mindspore/source_en/model_train/parallel/rank_table.md deleted file mode 100644 index 2fe9ab3a2fbe47ad31a4b6d9ccdac8f760fc15e4..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/rank_table.md +++ /dev/null @@ -1,456 +0,0 @@ -# rank table Startup - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/rank_table.md) - -## Overview - -`rank table` startup is a startup method unique to the Ascend hardware platform. This method does not rely on third-party libraries and runs as a single process on a single card, requiring the user to create a process in the script that matches the number of cards in use. This method is consistent across nodes in multiple machines and facilitates rapid batch deployment. - -Related Configurations: - -`rank table` mainly need to configure the rank_table file, taking the 2-card environment configuration file `rank_table_2pcs.json` as an example: - -```json -{ - "version": "1.0", - "server_count": "1", - "server_list": [ - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.6","rank_id": "0"}, - {"device_id": "1","device_ip": "192.2.*.6","rank_id": "1"}], - "host_nic_ip": "reserve" - } - ], - "status": "completed" -} -``` - -The parameter items that need to be modified according to the actual training environment are: - -- `server_count` represents the number of machines involved in training. -- `server_id` represents the IP address of the current machine. -- `device_id` represents the physical serial number of the card, i.e., the actual serial number in the machine where the card is located. -- `device_ip` represents the IP address of the integrated NIC. You can execute the command `cat /etc/hccn.conf` on the current machine, and the key value of `address_x` is the IP address of the NIC. -- `rank_id` represents the card logical serial number, fixed numbering from 0. - -## Operation Practice - -> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method). - -The directory structure is as follows: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── rank_table_8pcs.json - ├── rank_table_16pcs.json - ├── rank_table_cross_cluster_16pcs.json - ├── run_rank_table.sh - ├── run_rank_table_cluster.sh - ├── run_rank_table_cross_cluster.sh - ... -``` - -`net.py` defines the network structure and the training process, `run_rank_table.sh`, `run_rank_table_cluster.sh` and `run_rank_table_cross_cluster.sh` are executing the scripts. `rank_table_8pcs.json`, `rank_table_16pcs.json` and `rank_table_cross_cluster_16pcs.json` are 8 cards, 16 cards and cross cluster 16 cards rank_table config file. - -### 1. Preparing Python Training Scripts - -Here, as an example of data parallel, a recognition network is trained for the MNIST dataset. - -First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL or NCCL communication via init. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. - -```python -import os -import mindspore as ms -from mindspore.communication import init - -device_id = int(os.getenv('DEVICE_ID')) -ms.set_device(device_id=device_id) -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -Then build the following network: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -Finally, the dataset is processed and the training process is defined: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. Preparing the Startup Script - -#### Single-Machine Multi-Card - -The `rank table` method uses a single-card single-process operation, i.e., 1 process runs on each card, with the same number of processes as the number of cards in use. Each process creates a directory to store log information and operator compilation information. Below is an example of how to run a distributed training script using 8 cards: - -```bash -RANK_SIZE=8 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json -export RANK_SIZE=$RANK_SIZE - -for((i=1;i<${RANK_SIZE};i++)) -do - rm -rf device$i - mkdir device$i - cp ./net.py ./device$i - cd ./device$i - export DEVICE_ID=$i - export RANK_ID=$i - echo "start training for device $i" - env > env$i.log - python ./net.py > train$i.log 2>&1 & - cd ../ -done -``` - -Distributed-related environment variables are: - -- `RANK_TABLE_FILE`: Path to the networking information file. -- `DEVICE_ID`: The actual serial number of the current card on the machine. -- `RANK_ID`: The logical serial number of the current card. - -After configuring `rank_table_8pcs.json` in the current path, execute the following command: - -```bash -bash run_rank_table.sh -``` - -After running, the log files are saved in `device0`, `device1` and other directories, `env*.log` records information about environment variables, and the output is saved in `train*.log`, as shown in the example below: - -```text -epoch: 0, step: 0, loss is 2.3391366 -epoch: 0, step: 10, loss is 1.8047495 -epoch: 0, step: 20, loss is 1.2186875 -epoch: 0, step: 30, loss is 1.3065228 -epoch: 0, step: 40, loss is 1.0825632 -epoch: 0, step: 50, loss is 1.0281029 -epoch: 0, step: 60, loss is 0.8405618 -epoch: 0, step: 70, loss is 0.7346531 -epoch: 0, step: 80, loss is 0.688364 -epoch: 0, step: 90, loss is 0.51331174 -epoch: 0, step: 100, loss is 0.53782797 -... -``` - -#### Multi-Machine Multi-Card - -In the Ascend environment, the communication of NPU units across machines is the same as the communication of individual NPU units within a single machine, still through the HCCL. The difference is that the NPU units within a single machine are naturally interoperable, while the cross-machine ones need to ensure that the networks of the two machines are interoperable. The method of confirmation is as follows: - -Executing the following command on server 1 will configure each device with the `device ip` of the corresponding device on server 2. For example, configure the destination IP of card 0 on server 1 as the ip of card 0 on server 2. Configuration commands require the `hccn_tool` tool. The [`hccn_tool`](https://support.huawei.com/enterprise/zh/ascend-computing/a300t-9000-pid-250702906?category=developer-documents) is an HCCL tool that comes with the CANN package. - -```bash -hccn_tool -i 0 -netdetect -s address 192.*.92.131 -hccn_tool -i 1 -netdetect -s address 192.*.93.131 -hccn_tool -i 2 -netdetect -s address 192.*.94.131 -hccn_tool -i 3 -netdetect -s address 192.*.95.131 -hccn_tool -i 4 -netdetect -s address 192.*.92.141 -hccn_tool -i 5 -netdetect -s address 192.*.93.141 -hccn_tool -i 6 -netdetect -s address 192.*.94.141 -hccn_tool -i 7 -netdetect -s address 192.*.95.141 -``` - -`-i 0` specifies the device ID. `-netdetect` specifies the network detection object IP attribute. `-s address` indicates that the attribute is set to an IP address. `192.*.92.131` indicates the ip address of device 0 on server 2. The interface command can be referenced [here](https://support.huawei.com/enterprise/zh/doc/EDOC1100251947/8eff627f). - -After executing the above command on server 1, start checking the network link status with the following command. Another function of `hccn_tool` is used here, the meaning of which can be found [here](https://support.huawei.com/enterprise/zh/doc/EDOC1100251947/7d059b59). - -```bash -hccn_tool -i 0 -net_health -g -hccn_tool -i 1 -net_health -g -hccn_tool -i 2 -net_health -g -hccn_tool -i 3 -net_health -g -hccn_tool -i 4 -net_health -g -hccn_tool -i 5 -net_health -g -hccn_tool -i 6 -net_health -g -hccn_tool -i 7 -net_health -g -``` - -If the connection is successful, the corresponding output is as follows: - -```bash -net health status: Success -``` - -If the connection fails, the corresponding output is as follows: - -```bash -net health status: Fault -``` - -After confirming that the network of the NPU units between the machines is smooth, configure the json configuration file of the multi-machine, this document takes the configuration file of the 16 cards as an example. The detailed description of the configuration file can be referred to the introduction of single-machine multi-card part in this document. It should be noted that in the configuration of the multi-machine json file, it is required that the order of rank_id is consistent with the dictionary order of server_id. - -```json -{ - "version": "1.0", - "server_count": "2", - "server_list": [ - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.6","rank_id": "0"}, - {"device_id": "1","device_ip": "192.2.*.6","rank_id": "1"}, - {"device_id": "2","device_ip": "192.3.*.6","rank_id": "2"}, - {"device_id": "3","device_ip": "192.4.*.6","rank_id": "3"}, - {"device_id": "4","device_ip": "192.1.*.7","rank_id": "4"}, - {"device_id": "5","device_ip": "192.2.*.7","rank_id": "5"}, - {"device_id": "6","device_ip": "192.3.*.7","rank_id": "6"}, - {"device_id": "7","device_ip": "192.4.*.7","rank_id": "7"}], - "host_nic_ip": "reserve" - }, - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.8","rank_id": "8"}, - {"device_id": "1","device_ip": "192.2.*.8","rank_id": "9"}, - {"device_id": "2","device_ip": "192.3.*.8","rank_id": "10"}, - {"device_id": "3","device_ip": "192.4.*.8","rank_id": "11"}, - {"device_id": "4","device_ip": "192.1.*.9","rank_id": "12"}, - {"device_id": "5","device_ip": "192.2.*.9","rank_id": "13"}, - {"device_id": "6","device_ip": "192.3.*.9","rank_id": "14"}, - {"device_id": "7","device_ip": "192.4.*.9","rank_id": "15"}], - "host_nic_ip": "reserve" - } - ], - "status": "completed" -} -``` - -After preparing the configuration file, you can carry out the organization of distributed multi-machine training scripts. In the case of 2-machine 16-card, the scripts on the two machines are similar to the scripts run on single-machine multi-card, the difference being the specification of different rank_id variables. - -```bash -RANK_SIZE=16 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_16pcs.json -export RANK_SIZE=$RANK_SIZE - -RANK_START=$1 -DEVICE_START=0 - -for((i=0;i<=7;i++)); -do - export RANK_ID=$[i+RANK_START] - export DEVICE_ID=$[i+DEVICE_START] - rm -rf ./device_$RANK_ID - mkdir ./device_$RANK_ID - cp ./net.py ./device_$RANK_ID - cd ./device_$RANK_ID - env > env$i.log - python ./net.py >train$RANK_ID.log 2>&1 & -done -``` - -During execution, the following commands are executed on the two machines, where rank_table.json is configured according to the 16-card distributed json file reference shown in this section. - -```bash -# server0 -bash run_rank_table_cluster.sh 0 -# server1 -bash run_rank_table_cluster.sh 8 -``` - -After running, the log files are saved in the directories `device_0`, `device_1`. The information about the environment variables is recorded in `env*.log`, and the output is saved in `train*.log`. - -#### Cross Cluster - -For today's large-scale models, using compute clusters for training has become the norm. However, as model sizes continue to grow, the resources of a single cluster can no longer meet the memory requirements for model training. Therefore, support for cross-cluster communication has become a prerequisite for training ultra-large-scale models. Currently, the HCCL communication library of Ascend hardware does not support cross-cluster communication. To address this issue, MindSpore provides a cross-cluster communication library that enables efficient communication between NPUs in different clusters. With this library, users can overcome the memory limitations of a single cluster and achieve cross-cluster parallel training for ultra-large-scale models. - -Currently, the MindSpore framework enables this feature simply by adding the `cluster_list` configuration item for cross-cluster communication in the multi-node, multi-card JSON configuration file. This document uses a 2-node, 16-card setup (assuming the two machines are not in the same cluster) as an example to illustrate how to write the relevant configuration items for cross-cluster scenarios. For detailed information about the configuration file, please refer to the single-node, multi-card section in this document. - -```json -{ - "version": "1.0", - "server_count": "2", - "server_list": [ - { - "server_id": "server_0_10.*.*.*", - "server_ip": "10.*.*.*", - "device": [ - {"device_id": "0", "device_ip": "192.1.*.6", "rank_id": "0", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "1", "device_ip": "192.2.*.6", "rank_id": "1", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "2", "device_ip": "192.3.*.6", "rank_id": "2", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "3", "device_ip": "192.4.*.6", "rank_id": "3", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "4", "device_ip": "192.1.*.7", "rank_id": "4", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "5", "device_ip": "192.2.*.7", "rank_id": "5", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "6", "device_ip": "192.3.*.7", "rank_id": "6", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "7", "device_ip": "192.4.*.7", "rank_id": "7", "dpu_ip": "8.2.17.60", "numa_id": ""}], - "host_nic_ip": "reserve", - "pod_ip": "127.0.0.1" - }, - { - "server_id": "server_1_10.*.*.*", - "server_ip": "10.*.*.*", - "device": [ - {"device_id": "0", "device_ip": "192.1.*.8", "rank_id": "8", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "1", "device_ip": "192.2.*.8", "rank_id": "9", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "2", "device_ip": "192.3.*.8", "rank_id": "10", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "3", "device_ip": "192.4.*.8", "rank_id": "11", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "4", "device_ip": "192.1.*.9", "rank_id": "12", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "5", "device_ip": "192.2.*.9", "rank_id": "13", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "6", "device_ip": "192.3.*.9", "rank_id": "14", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "7", "device_ip": "192.4.*.9", "rank_id": "15", "dpu_ip": "8.2.17.60", "numa_id": ""}], - "host_nic_ip": "reserve", - "pod_ip": "127.0.0.1" - } - ], - "cluster_list": [ - { - "cluster_id": "cluster_0", - "network_type": "ROCE", - "az_id": "az_0", - "region_id": "region_0", - "server_list": [ - { - "server_id": "server_0_10.*.*.*" - } - ] - }, - { - "cluster_id": "cluster_1", - "network_type": "ROCE", - "az_id": "az_1", - "region_id": "region_1", - "server_list": [ - { - "server_id": "server_1_10.*.*.*" - } - ] - } - ], - "status": "completed" -} -``` - -For cross-cluster scenarios, the parameters that need to be added or modified based on the actual training environment are as follows: - -- `server_id` represents the globally unique identifier of the current machine. -- `server_ip` represents the IP address of the current machine. -- `dpu_ip` represents the virtual IP address of the card within the tenant VPC, used for cross-cluster communication. -- `numa_id` represents the NUMA-affined CPU core ID of the card on the current machine. -- `cluster_id` represents the globally unique identifier of the cluster. -- `network_type` represents the type of network between machines within the cluster, currently set to "ROCE." -- `az_id` represents the AZ (Availability Zone) ID where the cluster is located. -- `server_list` represents the list of machines included in the current cluster. - -Once the configuration file is prepared, the distributed training script for cross-cluster scenarios remains consistent with the distributed training script for multi-node, multi-card setups described in this document. Using a 2-cluster, 16-card setup as an example, the scripts on the two machines in the two clusters are the same as those used in multi-node, multi-card scenarios. The only difference lies in specifying different rank_id variables. - -```bash -RANK_SIZE=16 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_2_cluster_16pcs.json -export RANK_SIZE=$RANK_SIZE - -RANK_START=$1 -DEVICE_START=0 - -for((i=0;i<=7;i++)); -do - export RANK_ID=$[i+RANK_START] - export DEVICE_ID=$[i+DEVICE_START] - rm -rf ./device_$RANK_ID - mkdir ./device_$RANK_ID - cp ./net.py ./device_$RANK_ID - cd ./device_$RANK_ID - env > env$i.log - python ./net.py >train$RANK_ID.log 2>&1 & -done -``` - -During execution, the two machines in the two clusters run the following commands, respectively. The `rank_table_cross_cluster_16pcs.json` file is configured based on the 2-cluster, 16-card cross-cluster distributed JSON file example shown in this section. The `rank_table_cross_cluster_16pcs.json` configuration used on each machine in both clusters must remain consistent. - -```bash -# server0 -bash run_rank_table_cross_cluster.sh 0 -# server1 -bash run_rank_table_cross_cluster.sh 8 -``` - -After execution, log files are saved in the `device_0`, `device_1`, and other corresponding directories on each machine in the clusters. The `env*.log` files record information about the environment variables, while the output results are stored in the `train*.log` files. diff --git a/docs/mindspore/source_en/model_train/parallel/startup_method.rst b/docs/mindspore/source_en/model_train/parallel/startup_method.rst deleted file mode 100644 index a4016765adb4b032f49b60b6e898a9678992c9d5..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/startup_method.rst +++ /dev/null @@ -1,42 +0,0 @@ -Distributed Parallel Startup Methods -==================================== - -.. image:: https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg - :target: https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/startup_method.rst - :alt: View Source on Gitee - -.. toctree:: - :maxdepth: 1 - :hidden: - - msrun_launcher - dynamic_cluster - mpirun - rank_table - -Startup Method ---------------- - -Currently GPU, Ascend and CPU support multiple startup methods respectively, four of which are \ ``msrun``, dynamic cluster, \ ``mpirun`` and \ ``rank table``: - -- `msrun `_: `msrun` is the capsulation of Dynamic cluster. It allows user to launch distributed jobs using one single command in each node. It could be used after MindSpore is installed. This method does not rely on third-party libraries and configuration files, has disaster recovery function, good security, and supports three hardware platforms. It is recommended that users prioritize the use of this startup method. -- `Dynamic cluster `_: dynamic cluster requires user to spawn multiple processes and export environment variables. It's the implementation of `msrun`. Use this method when running `Parameter Server` training mode. For other distributed jobs, `msrun` is recommended. -- `mpirun `_: this method relies on the open source library OpenMPI, and startup command is simple. Multi-machine need to ensure two-by-two password-free login. It is recommended for users who have experience in using OpenMPI to use this startup method. -- `rank table `_: this method requires the Ascend hardware platform and does not rely on third-party library. After manually configuring the rank_table file, you can start the parallel program via a script, and the script is consistent across multiple machines for easy batch deployment. - -.. warning:: - `rank_table` method will be deprecated in MindSpore 2.4 version. - -The hardware support for the four startup methods is shown in the table below: - -+-------------------------+--------------+-----------------+-------------+ -| | GPU | Ascend | CPU | -+=========================+==============+=================+=============+ -| \ ``msrun``\ | Support | Support | Support | -+-------------------------+--------------+-----------------+-------------+ -| Dynamic cluster | Support | Support | Support | -+-------------------------+--------------+-----------------+-------------+ -| \ ``mpirun``\ | Support | Support | Not support | -+-------------------------+--------------+-----------------+-------------+ -| \ ``rank table``\ | Not support | Support | Not support | -+-------------------------+--------------+-----------------+-------------+ \ No newline at end of file diff --git a/tutorials/source_zh_cn/index.rst b/tutorials/source_zh_cn/index.rst index deb1815ab99aa3de047efbcff6b7eeb47dd1dc3b..3cfefd1e18b5e1ca45ba465b0fac480d7efdb9b6 100644 --- a/tutorials/source_zh_cn/index.rst +++ b/tutorials/source_zh_cn/index.rst @@ -59,7 +59,8 @@ MindSpore教程 parallel/distributed_case parallel/optimize_technique - + parallel/startup_methods + .. toctree:: :glob: :maxdepth: 1 diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md b/tutorials/source_zh_cn/parallel/dynamic_cluster.md similarity index 100% rename from docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md rename to tutorials/source_zh_cn/parallel/dynamic_cluster.md diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/mpirun.md b/tutorials/source_zh_cn/parallel/mpirun.md similarity index 100% rename from docs/mindspore/source_zh_cn/model_train/parallel/mpirun.md rename to tutorials/source_zh_cn/parallel/mpirun.md diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md b/tutorials/source_zh_cn/parallel/msrun_launcher.md similarity index 100% rename from docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md rename to tutorials/source_zh_cn/parallel/msrun_launcher.md diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/rank_table.md b/tutorials/source_zh_cn/parallel/rank_table.md similarity index 100% rename from docs/mindspore/source_zh_cn/model_train/parallel/rank_table.md rename to tutorials/source_zh_cn/parallel/rank_table.md diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/startup_method.rst b/tutorials/source_zh_cn/parallel/startup_method.rst similarity index 48% rename from docs/mindspore/source_zh_cn/model_train/parallel/startup_method.rst rename to tutorials/source_zh_cn/parallel/startup_method.rst index de40fd42a5f5aa969f564f0ea4e9689aaac11088..4ea9586694be8eb9a02f262c799463cf768acdb2 100644 --- a/docs/mindspore/source_zh_cn/model_train/parallel/startup_method.rst +++ b/tutorials/source_zh_cn/parallel/startup_method.rst @@ -2,7 +2,7 @@ ============================ .. image:: https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg - :target: https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/model_train/parallel/startup_method.rst + :target: https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/parallel/startup_method.rst :alt: 查看源文件 .. toctree:: @@ -19,13 +19,11 @@ 目前GPU、Ascend和CPU分别支持多种启动方式。主要有\ ``msrun``\、动态组网、\ ``mpirun``\和\ ``rank table``\四种方式: -- `msrun `_: `msrun` 是动态组网的封装,允许用户使用单命令行指令在各节点拉起分布式任务,安装MindSpore后即可使用。此方式不依赖第三方库以及配置文件,具有容灾恢复功能,安全性较好,支持三种硬件平台。建议用户优先使用此种启动方式。 -- `动态组网 `_:动态组网需要用户手动拉起多进程以及导出环境变量,是 `msrun` 的具体实现,Parameter Server训练模式建议使用此方式,其余分布式场景建议使用 `msrun` 。 -- `mpirun `_:此方式依赖开源库OpenMPI,启动命令简单,多机需要保证两两之间免密登录,推荐有OpenMPI使用经验的用户使用此种启动方式。 -- `rank table `_:此方式需要在Ascend硬件平台使用,不依赖第三方库。手动配置rank_table文件后,就可以通过脚本启动并行程序,多机脚本一致,方便批量部署。 +- `msrun `_: `msrun` 是动态组网的封装,允许用户使用单命令行指令在各节点拉起分布式任务,安装MindSpore后即可使用。此方式不依赖第三方库以及配置文件,具有容灾恢复功能,安全性较好,支持三种硬件平台。建议用户优先使用此种启动方式。 +- `动态组网 `_:动态组网需要用户手动拉起多进程以及导出环境变量,是 `msrun` 的具体实现,Parameter Server训练模式建议使用此方式,其余分布式场景建议使用 `msrun` 。 +- `mpirun `_:此方式依赖开源库OpenMPI,启动命令简单,多机需要保证两两之间免密登录,推荐有OpenMPI使用经验的用户使用此种启动方式。 +- `rank table `_:此方式需要在Ascend硬件平台使用,不依赖第三方库。手动配置rank_table文件后,就可以通过脚本启动并行程序,多机脚本一致,方便批量部署。 -.. warning:: - `rank_table` 启动方式将在MindSpore 2.4版本废弃。 四种启动方式的硬件支持情况如下表: