diff --git a/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md b/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md deleted file mode 100644 index 1d6373908e17b971c7d43d975d21cca5974e6d5a..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md +++ /dev/null @@ -1,439 +0,0 @@ -# Dynamic Cluster Startup - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/dynamic_cluster.md) - -## Overview - -For reliability requirements during training, MindSpore provides **dynamic cluster** features that enable users to start Ascend/GPU/CPU distributed training tasks without relying on any third-party library (OpenMPI) and without any modification to the training script. We recommend users to use this startup method in preference. - -The MindSpore **Dynamic Cluster** feature replaces the OpenMPI capability by **reusing the Parameter Server mode training architecture**, which can be found in the [Parameter Server Mode](https://mindspore.cn/docs/en/master/model_train/parallel/parameter_server_training.html) training tutorial. - -The **Dynamic Cluster** feature starts multiple MindSpore training processes as `Workers`, and starts an additional `Scheduler` for cluster and disaster recovery, thus, distributed training can be achieved without the need for OpenMPI's message passing mechanism. The user only needs to make a few changes to the startup script to perform distributed training. - -> Dynamic cluster supports Ascend, GPU and CPU, so the dynamic cluster startup script can be quickly migrated between multiple hardware platforms without additional modifications. - -The relevant environment variables: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Environment VariablesFunctionTypeValueDescription
MS_ROLESpecifies the role of this process.String -
    -
  • MS_SCHED: represents the Scheduler process. A training task starts only one Scheduler, which is responsible for networking, disaster recovery, etc., and does not execute training code.
  • -
  • MS_WORKER: Represents the Worker process, which generally sets up the distributed training process for this role.
  • -
  • MS_PSERVER: represents the Parameter Server process. Only in Parameter Server mode this role is effective. Please refer to Parameter Server Mode.
  • -
-
The Worker and Parameter Server processes register with the Scheduler process to complete the networking.
MS_SCHED_HOSTSpecifies the IP address of the Scheduler.StringLegal IP address.IPv6 addresses are only supported on `Ascend` platform in current version.
MS_SCHED_PORTSpecifies the Scheduler binding port number.IntegerPort number in the range of 1024 to 65535.
MS_NODE_IDSpecifies the ID of this process, unique within the cluster.StringRepresents the unique ID of this process, which is automatically generated by MindSpore by default. - MS_NODE_ID needs to be set in the following cases. Normally it does not need to be set and is automatically generated by MindSpore: -
    -
  • Enable Disaster Recovery Scenario: Disaster recovery requires obtaining the current process ID and thus re-registering with the Scheduler.
  • -
  • Enable GLOG log redirection scenario: In order to ensure that the logs of each training process are saved independently, it is necessary to set the process ID, which is used as the log saving path suffix.
  • -
  • Specify process rank id scenario: users can specify the rank id of this process by setting MS_NODE_ID to some integer.
  • -
-
MS_WORKER_NUMSpecifies the number of processes with the role MS_WORKER.IntegerInteger greater than 0. - The number of Worker processes started by the user should be equal to the value of this environment variable. If it is less than this value, the networking fails. If it is greater than this value, the Scheduler process will complete the networking according to the order of Worker registration, and the redundant Worker processes will fail to start. -
MS_SERVER_NUMSpecifies the number of processes with the role MS_PSERVER.IntegerInteger greater than 0.Only set in Parameter Server training mode.
MS_WORKER_IPSpecifies the IP address used for communication and networking between processes.StringLegitimate IP address.This environment variable is suggested to be set when using IPv6. But when MS_SCHED_HOST is set to ::1(Representing local loopback interface in IPv6), there's no need to set MS_WORKER_IP because MindSpore will use local loopback interface to communicate by default.
MS_ENABLE_RECOVERYTurn on disaster recovery.Integer1 for on, 0 for off. The default is 0.
MS_RECOVERY_PATHPersistent path folder.StringLegal user directory.The Worker and Scheduler processes perform the necessary persistence during execution, such as node information for restoring the networking and training the intermediate state of the service, and are saved via files.
MS_ENABLE_LCCLWhether to use LCCL as communication library.Integer1 for yes, other values for no. The default is no.The LCCL communication library currently only supports single-machine multi-card scenario and must be executed when the graph compilation level is O0.
MS_TOPO_TIMEOUTCluster networking phase timeout time in seconds.IntegerThe default is 30 minutes.This value represents that all nodes can register to the scheduler within this time window. If the time window is exceeded, registration will fail and if the number of nodes does not meet the requirements, cluster networking will fail. We suggest users to configure this environment variable when the cluster is in large-scale.
MS_NODE_TIMEOUTNode heartbeat timeout in seconds。IntegerThe default is 300 seconds.This value represents the heartbeat timeout time between the scheduler and the worker. If there are no heartbeat messages within this time window, the cluster will exit abnormally.
MS_RECEIVE_MSG_TIMEOUTNode timeout for receiving messages in seconds.IntegerThe default is 300 seconds.This value represents the timeout window for the node to receive messages from the other end. If there is no message response within the time window, an empty message is returned.
MS_RETRY_INTERVAL_LOWERLower limit of message retry interval between nodes in seconds.IntegerThe default is 3 seconds.This value represents the lower limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between MS_RETRY_INTERVAL_LOWER and MS_RETRY_INTERVAL_UPPER as the interval time. This variable is used to control the message concurrency of the Scheduler.
MS_RETRY_INTERVAL_UPPERUpper limit of message retry interval between nodes in secondsIntegerThe default is 5 seconds.This value represents the upper limit of the time interval between each retry of sending a message by a node. MindSpore randomly selects the value between MS_RETRY_INTERVAL_LOWER and MS_RETRY_INTERVAL_UPPER as the interval time. This variable is used to control the message concurrency of the Scheduler.
MS_DISABLE_HEARTBEATDisable the heartbeat feature between nodes in the cluster.IntegerHeartbeat feature is enabled by default.If set to 1, the heartbeat between cluster nodes will be disabled. In this scenario, Scheduler will not detect Workers' exception and will not control the cluster to exit. This variable can reduce the message concurrency of the Scheduler.
It is recommended to set this environment variable when using `gdb attach` command for debugging.
- -> The environment variables `MS_SCHED_HOST`, `MS_SCHED_PORT`, and `MS_WORKER_NUM` need to be consistent in their contents, or else the networking will fail due to the inconsistency in the configurations of the processes. - -## Operation Practice - -Dynamic cluster startup scripts are consistent across hardware platforms. The following is an example of how to write a startup script for Ascend: - -> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method). - -The directory structure is as follows: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── run_dynamic_cluster.sh - ├── run_dynamic_cluster_1.sh - ├── run_dynamic_cluster_2.sh - ... -``` - -`net.py` is defining the network structure and training process, and `run_dynamic_cluster.sh`, `run_dynamic_cluster_1.sh` and `run_dynamic_cluster_2.sh` are executing scripts. - -### 1. Preparing Python Training Scripts - -Here, as an example of data parallel, a recognition network is trained for the MNIST dataset. - -First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL, NCCL or MCCL communication via `init()`. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -Then build the following network: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -Finally, the dataset is processed and the training process is defined: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. Preparing the Startup Script - -#### Single-Machine Multi-Card - -The content of the single-machine multi-card startup script [run_dynamic_cluster.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster.sh) is as follows. Taking the single-machine 8-card as an example: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# Start 8 Worker training processes in a loop -for((i=0;i<8;i++)); -do - export MS_WORKER_NUM=8 # Set the number of Worker processes in the cluster to 8 - export MS_SCHED_HOST=127.0.0.1 # Set the Scheduler IP address to the local loop address - export MS_SCHED_PORT=8118 # Set Scheduler port - export MS_ROLE=MS_WORKER # Set the started process to the MS_WORKER role - export MS_NODE_ID=$i # Set process id, optional - python ./net.py > device/worker_$i.log 2>&1 & # Start training script -done - -# Start 1 Scheduler process -export MS_WORKER_NUM=8 # Set the number of Worker processes in the cluster to 8 -export MS_SCHED_HOST=127.0.0.1 # Set the Scheduler IP address to the local loop address -export MS_SCHED_PORT=8118 # Set Scheduler port -export MS_ROLE=MS_SCHED # Set the started process to the MS_SCHED role -python ./net.py > device/scheduler.log 2>&1 & # Start training script -``` - -> The training scripts for the Scheduler and Worker processes are identical in content and startup method, because the internal processes of the two roles are handled differently in MindSpore. Users simply pull up the process in the normal training manner, without modifying the Python code by role. This is one of the reasons why dynamic cluster startup scripts can be consistent across multiple hardware platforms. - -A single-machine 8-card distributed training can be executed by executing the following command: - -```bash -bash run_dynamic_cluster.sh -``` - -The script will run in the background, the log file will be saved to the device directory and the result will be saved in the worker_*.log and is as follows: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -#### Multi-Machine Multi-Card - -The startup script needs to be split in the multi-machine training scenario. The following is an example of performing 2-machine 8-card training, with each machine executing the startup 4 Worker: - -The script [run_dynamic_cluster_1.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster_1.sh) starts 1 `Scheduler` process and 4 `Worker` processes on node 1: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# Start Worker1 to Worker4, 4 Worker training processes in a loop -for((i=0;i<4;i++)); -do - export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) - export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address - export MS_SCHED_PORT=8118 # Set the Scheduler port - export MS_ROLE=MS_WORKER # Set the startup process to the MS_WORKER role - export MS_NODE_ID=$i # Set process id, optional - python ./net.py > device/worker_$i.log 2>&1 & # Start training script -done - -# Start 1 Scheduler process on node 1 -export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) -export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address -export MS_SCHED_PORT=8118 # Set the Scheduler port -export MS_ROLE=MS_SCHED # Set the startup process to the MS_SCHED role -python ./net.py > device/scheduler.log 2>&1 & # Start training script -``` - -The script [run_dynamic_cluster_2.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster_2.sh) starts `Worker5` to `Worker8` on node 2 (without executing Scheduler): - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# Start Worker5 to Worker8, 4 Worker training processes in a loop -for((i=4;i<8;i++)); -do - export MS_WORKER_NUM=8 # Set the total number of Worker processes in the cluster to 8 (including other node processes) - export MS_SCHED_HOST= # Set the Scheduler IP address to the Node 1 IP address - export MS_SCHED_PORT=8118 # Set the Scheduler port - export MS_ROLE=MS_WORKER # Set the startup process to the MS_WORKER role - export MS_NODE_ID=$i # Set process id, optional - python ./net.py > device/worker_$i.log 2>&1 & # Start training script -done -``` - -> In a multi-machine task, it is necessary to set a different hostname for each host node, otherwise an error is reported that device id out of range. Refer to [FAQ](https://www.mindspore.cn/docs/en/master/faq/distributed_parallel.html#q-When-starting-distributed-framework-using-dynamic-cluster-or-msrun-in-multi-machine-scenario-an-error-is-reported-that-device-id-is-out-of-range-how-can-we-solve-it). -> -> In a multi-machine task, `MS_WORKER_NUM` should be the total number of Worker nodes in the cluster. -> -> To keep the inter-node network connected, use the `telnet ` command to test whether this node is connected to the started Scheduler node. - -Execute on Node 1: - -```bash -bash run_dynamic_cluster_1.sh -``` - -Execute on Node 2: - -```bash -bash run_dynamic_cluster_2.sh -``` - -That is, you can perform 2-machine 8-card distributed training tasks. - -## Disaster Recovery - -Dynamic cluster supports disaster recovery under data parallel. In a parallel training scenario with multi-card data, if a process quits abnormally, the training can be continued after pulling up the corresponding script of the corresponding process again, and the accuracy convergence will not be affected. Disaster recovery configuration and samples can be found in the [Disaster Recovery in Dynamic Cluster Scenarios](https://www.mindspore.cn/docs/en/master/model_train/parallel/disaster_recover.html) tutorial. - -## Security Authentication - -Dynamic cluster also supports the **Secure Encrypted Channel** feature, which supports the `TLS/SSL` protocol to satisfy users security needs. By default, the secure encrypted channel is turned off. If you need to turn it on, call init() only after configuring the secure encrypted channel correctly via `set_ps_context`, otherwise the initialization of the networking will fail. If you want to use the secure Encrypted channel, please configure it: - -`set_ps_context(config_file_path="/path/to/config_file.json", enable_ssl=True, client_password="123456", server_password="123456")` - -The `config.json` configuration file specified by `config_file_path` needs to add the following fields: - -```json -{ - "server_cert_path": "server.p12", - "crl_path": "", - "client_cert_path": "client.p12", - "ca_cert_path": "ca.crt", - "cipher_list": "ECDHE-R SA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-DSS-AES256-GCM-SHA384:DHE-PSK-AES128-GCM-SHA256:DHE-PSK-AES256-GCM-SHA384:DHE-PSK-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-PSK-CHACHA20-POLY1305:DHE-RSA-AES128-CCM:DHE-RSA-AES256-CCM:DHE-RSA-CHACHA20-POLY1305:DHE-PSK-AES128-CCM:DHE-PSK-AES256-CCM:ECDHE-ECDSA-AES128-CCM:ECDHE-ECDSA-AES256-CCM:ECDHE-ECDSA-CHACHA20-POLY1305", - "cert_expire_warning_time_in_day": 90 -} -``` - -- `server_cert_path`: The path to the p12 file (SSL-specific certificate file) that contains the cipher text of the certificate and the secret key on the server side. -- `crl_path`: The file path to the revocation list (used to distinguish invalid untrusted certificates from valid trusted certificates). -- `client_cert_path`: The client contains the path to the p12 file (SSL-specific certificate file) with the cipher text of the certificate and secret key. -- `ca_cert_path`: The path to root certificate -- `cipher_list`: Cipher suite (list of supported SSL encrypted types) -- `cert_expire_warning_time_in_da`: The warning time of certificate expiration. - -The secret key in the p12 file is stored in cipher text, and the password needs to be passed in when starting. Please refer to the Python API [mindspore.set_ps_context](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.set_ps_context.html#mindspore.set_ps_context) for the `client_password` and `server_password` fields. diff --git a/docs/mindspore/source_en/model_train/parallel/mpirun.md b/docs/mindspore/source_en/model_train/parallel/mpirun.md deleted file mode 100644 index 2f42aa3965dec84d73b611b0deb6904fff98f818..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/mpirun.md +++ /dev/null @@ -1,219 +0,0 @@ -# mpirun Startup - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/mpirun.md) - -## Overview - -Open Message Passing Interface (OpenMPI) is an open source, high-performance message-passing programming library for parallel computing and distributed memory computing, which realizes parallel computing by passing messages between different processes for many scientific computing and machine learning tasks. Parallel training with OpenMPI is a generalized approach to accelerate the training process by utilizing parallel computing resources on computing clusters or multi-core machines. OpenMPI serves the function of synchronizing data on the Host side as well as inter-process networking in distributed training scenarios. - -Unlike rank table startup, the user does not need to configure the `RANK_TABLE_FILE` environment variable to run the script via OpenMPI `mpirun` command on the Ascend hardware platform. - -> The `mpirun` startup supports Ascend and GPU, in addition to both PyNative mode and Graph mode. - -Related commands: - -1. The `mpirun` startup command is as follows, where `DEVICE_NUM` is the number of GPUs on the machine: - - ```bash - mpirun -n DEVICE_NUM python net.py - ``` - -2. `mpirun` can also be configured with the following parameters. For more configuration, see [mpirun documentation](https://www.open-mpi.org/doc/current/man1/mpirun.1.php): - - - `--output-filename log_output`: Save the log information of all processes to the `log_output` directory, and the logs on different cards will be saved in the corresponding files under the `log_output/1/` path by `rank_id`. - - `--merge-stderr-to-stdout`: Merge stderr to the output message of stdout. - - `--allow-run-as-root`: This parameter is required if the script is executed through the root user. - - `-mca orte_abort_on_non_zero_status 0`: When a child process exits abnormally, OpenMPI will abort all child processes by default. If you don't want to abort child processes automatically, you can add this parameter. - - `-bind-to none`: OpenMPI will specify the number of available CPU cores for the child process to be pulled up by default. If you don't want to limit the number of cores used by the process, you can add this parameter. - -> OpenMPI starts up with a number of environment variables named `OPMI_*`, and users should avoid manually modifying these environment variables in scripts. - -## Operation Practice - -The `mpirun` startup script is consistent across Ascend and GPU hardware platforms. Below is a demonstration of how to write a startup script using Ascend as an example: - -> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method). - -The directory structure is as follows: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── hostfile - ├── run_mpirun_1.sh - ├── run_mpirun_2.sh - ... -``` - -`net.py` is to define the network structure and training process. `run_mpirun_1.sh` and `run_mpirun_2.sh` are the execution scripts, and `hostfile` is the file to configure the multi-machine and multi-card files. - -### 1. Installing OpenMPI - -Download the OpenMPI-4.1.4 source code [openmpi-4.1.4.tar.gz] (https://www.open-mpi.org/software/ompi/v4.1/). Refer to [OpenMPI official website tutorial](https://www.open-mpi.org/faq/?category=building#easy-build) for installation. - -### 2. Preparing Python Training Scripts - -Here, as an example of data parallel, a recognition network is trained for the MNIST dataset. - -First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL or NCCL communication via init. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -Then build the following network: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -Finally, the dataset is processed and the training process is defined: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 3. Preparing the Startup Script - -#### Single-Machine Multi-Card - -First download the [MNIST](http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip) dataset and extract it to the current folder. - -Then execute the single-machine multi-card boot script, using the single-machine 8-card example: - -```bash -export DATA_PATH=./MNIST_Data/train/ -mpirun -n 8 --output-filename log_output --merge-stderr-to-stdout python net.py -``` - -The log file will be saved to the `log_output` directory and the result will be saved in the `log_output/1/rank.*/stdout` and is as follows: - -```text -epoch: 0, step: 0, loss is 2.3413472 -epoch: 0, step: 10, loss is 1.6298866 -epoch: 0, step: 20, loss is 1.3729795 -epoch: 0, step: 30, loss is 1.2199347 -epoch: 0, step: 40, loss is 0.85778403 -epoch: 0, step: 50, loss is 1.0849445 -epoch: 0, step: 60, loss is 0.9102987 -epoch: 0, step: 70, loss is 0.7571399 -epoch: 0, step: 80, loss is 0.7989929 -epoch: 0, step: 90, loss is 1.0189024 -epoch: 0, step: 100, loss is 0.6298542 -... -``` - -#### Multi-Machine Multi-Card - -Before running multi-machine multi-card training, you first need to follow the following configuration: - -1. Ensure that the same versions of OpenMPI, NCCL, Python, and MindSpore are available on each node. - -2. To configure host-to-host password-free login, you can refer to the following steps to configure it: - - Identify the same user as the login user for each host (root is not recommended); - - Execute `ssh-keygen -t rsa -P ""` to generate the key; - - Execute `ssh-copy-id DEVICE-IP` to set the IP of the machine that needs password-free login; - - Execute `ssh DEVICE-IP`. If you can log in without entering a password, the above configuration is successful; - - Execute the above command on all machines to ensure two-by-two interoperability. - -After the configuration is successful, you can start the multi-machine task with the `mpirun` command, and there are currently two ways to start a multi-machine training task: - -- By means of `mpirun -H`. The startup script is as follows: - - ```bash - export DATA_PATH=./MNIST_Data/train/ - mpirun -n 16 -H DEVICE1_IP:8,DEVICE2_IP:8 --output-filename log_output --merge-stderr-to-stdout python net.py - ``` - - indicates that 8 processes are started to run the program on the machines with ip DEVICE1_IP and DEVICE2_IP respectively. Execute on one of the nodes: - - ```bash - bash run_mpirun_1.sh - ``` - -- By means of the `mpirun --hostfile` method. For debugging purposes, this method is recommended for executing multi-machine multi-card scripts. First you need to construct the hostfile file as follows: - - ```text - DEVICE1 slots=8 - 192.168.0.1 slots=8 - ``` - - The format of each line is `[hostname] slots=[slotnum]`, and hostname can be either ip or hostname. The above example indicates that there are 8 cards on DEVICE1, and there are also 8 cards on the machine with ip 192.168.0.1. - - The execution script for the 2-machine 16-card is as follows, and you need to pass in the variable `HOSTFILE`, which indicates the path to the hostfile file: - - ```bash - export DATA_PATH=./MNIST_Data/train/ - HOSTFILE=$1 - mpirun -n 16 --hostfile $HOSTFILE --output-filename log_output --merge-stderr-to-stdout python net.py - ``` - - Execute on one of the nodes: - - ```bash - bash run_mpirun_2.sh ./hostfile - ``` - -After execution, the log file is saved to the log_output directory and the result is saved in log_output/1/rank.*/stdout. diff --git a/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md b/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md deleted file mode 100644 index 65c6264e9a58006efd291be4081fb17ece79215e..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md +++ /dev/null @@ -1,503 +0,0 @@ -# msrun Launching - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/msrun_launcher.md) - -## Overview - -`msrun` is an encapsulation of the [Dynamic Cluster](https://www.mindspore.cn/docs/en/master/model_train/parallel/dynamic_cluster.html) startup method. Users can use `msrun` to pull multi-process distributed tasks across nodes with a single command line instruction. Users can use `msrun` to pull up multi-process distributed tasks on each node with a single command line command, and there is no need to manually set [dynamic networking environment variables](https://www.mindspore.cn/docs/en/master/model_train/parallel/dynamic_cluster.html). `msrun` supports both `Ascend`, `GPU` and `CPU` backends. As with the `Dynamic Cluster` startup, `msrun` has no dependencies on third-party libraries and configuration files. - -> - `msrun` is available after the user installs MindSpore, and the command `msrun --help` can be used to view the supported parameters. -> - `msrun` supports `graph mode` as well as `PyNative mode`. - -A parameters list of command line: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ParametersFunctionsTypesValuesInstructions
--worker_numThe total number of Worker processes participating in the distributed task.IntegerAn integer greater than 0. The default value is 8.The total number of Workers started on all nodes should be equal to this parameter:
if the total number is greater than this parameter, the extra Worker processes will fail to register;
if the total number is less than this parameter, the cluster will wait for a certain period of timeout before prompting the task to pull up the failed task and exit,
and the size of the timeout window can be configured by the parameter cluster_time_out.
--local_worker_numThe number of Worker processes pulled up on the current node.IntegerAn integer greater than 0. The default value is 8.When this parameter is consistent with worker_num, it means that all Worker processes are executed locally.
The node_rank value is ignored in this scenario.
--master_addrSpecifies the IP address or hostname of the Scheduler.StringLegal IP address or hostname. The default is the IP address 127.0.0.1.msrun will automatically detect on which node to pull up the Scheduler process, and users do not need to care.
If the corresponding IP address cannot be found or the hostname cannot be resolved by DNS, the training task will pull up and fail.
IPv6 addresses are not supported in the current version.
If a hostname is input as a parameter, msrun will automatically resolve it to an IP address, which requires the user's environment to support DNS service.
--master_portSpecifies the Scheduler binding port number.IntegerPort number in the range 1024 to 65535. The default is 8118.
--node_rankThe index of the current node.IntegerAn integer greater than or equal to 0 can be passed in. In case no value is passed, the default value is -1.This parameter is ignored in single-machine multi-card scenario.
In multi-machine and multi-card scenarios, if this parameter is not set, the rank_id of the Worker process will be assigned automatically;
if it is set, the rank_id will be assigned to the Worker process on each node according to the index.
If the number of Worker processes per node is different, it is recommended that this parameter not be configured to automatically assign the rank_id.
--log_dirWorker, and Scheduler log output paths.StringFolder path. Defaults to the current directory.If the path does not exist, msrun creates the folder recursively.
The log format is as follows: for the Scheduler process, the log is named scheduler.log;
For Worker process, log name is worker_[rank].log, where rank suffix is the same as the rank_id assigned to the Worker,
but they may be inconsistent in multiple-machine and multiple-card scenarios where node_rank is not set.
It is recommended that grep -rn "Global rank id" is executed to view rank_id of each Worker.
--joinWhether msrun waits for the Worker as well as the Scheduler to exit.BoolTrue or False. Default: False.If set to False, msrun will exit immediately after pulling up the process and check the logs to confirm that the distributed task is executing properly.
If set to True, msrun waits for all processes to exit, collects the exception log and exits.
--cluster_time_outCluster networking timeout in seconds.IntegerDefault: 600 seconds.This parameter represents the waiting time in cluster networking.
If no worker_num number of Workers register successfully beyond this time window, the task pull-up fails.
--bind_coreEnable processes binding CPU cores.BoolTrue or False. Default: False.If set to True, msrun will evenly allocate CPU cores and bind them to the spawned distributed processes.
--sim_levelSet single card simulated compilation level.IntegerDefault: -1. Disable simulated compilation.If this parameter is set, msrun starts only a single process for simulated compilation and does not execute operators. This feature is commonly used to debug large-scale distributed training parallel strategies, and to detect memory and strategy issues in advance.
If set to 0, only compile the frontend graph; If set to 1, further compile backend graph compilation and exit during the execution phase
--sim_rank_idrank_id of the simulated process.IntegerDefault: 0.Set rank id of the simulated process.
--rank_table_filerank_table configuration. Only valid on Ascend platform.StringFile path of rank_table configuration. Default: empty string.This parameter represents the rank_table configuration file on Ascend platform, describing current distributed cluster.
Since the rank_table configuration file reflects distributed cluster information at the physical level, when using this configuration, make sure that the Devices visible to the current process are consistent with the rank_table configuration.
The Device visible to the current process can be set via the environment variable ASCEND_RT_VISIBLE_DEVICES.
--worker_log_nameSpecifies the worker log name.StringFile name of worker log. Default: worker_[rank].log.This parameter represents support users configure worker log name, and support configure ip and hostname to worker log name by {ip} and {hostname} separately.
The suffix of worker log name is rank by default.
--tail_worker_logEnable output worker log to console.StringOne or multiple integers associated with the worker process rank_id. Default: -1.This parameter represents output all worker logs of the current node to console by default, and supports users specify one or more worker logs output to console when --join=True.
This parameter should be in [0, local_worker_num].
task_scriptUser Python scripts.StringLegal script path.Normally, this parameter is the python script path, and msrun will pull up the process as python task_script task_script_args by default.
msrun also supports this parameter as pytest.
In this scenario the task script and task parameters are passed in the parameter task_script_args.
task_script_argsParameters for the user Python script.Parameter list.For example, msrun --worker_num=8 --local_worker_num=8 train.py --device_target=Ascend --dataset_path=/path/to/dataset
- -## Environment Variables - -The following table shows the environment variables can be used in user scripts, which are set by `msrun`: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Environment VariablesFunctionsValues
MS_ROLEThis process role. - The current version of msrun exports the following two values: -
    -
  • MS_SCHED: Represents the Scheduler process.
  • -
  • MS_WORKER: Represents the Worker process.
  • -
-
MS_SCHED_HOSTThe IP address of the user-specified Scheduler.Same as parameter --master_addr.
MS_SCHED_PORTUser-specified Scheduler binding port number.Same as parameter --master_port.
MS_WORKER_NUMThe total number of Worker processes specified by the user.Same as parameter --worker_num.
MS_TOPO_TIMEOUTCluster Timeout Time.Same as parameter --cluster_time_out.
RANK_SIZEThe total number of Worker processes specified by the user.Same as parameter --worker_num.
RANK_IDThe rank_id assigned to the Worker process.In a multi-machine multi-card scenario, if the parameter --node_rank is not set, RANK_ID will only be exported after the cluster is initialized.
So to use this environment variable, it is recommended to set the --node_rank parameter correctly.
- -msrun is used as an encapsulation of the Dynamic Cluster startup method, and all user-configurable environment variables can be found in [dynamic networking environment variables](https://www.mindspore.cn/docs/en/master/model_train/parallel/dynamic_cluster.html). - -## Starting Distributed Tasks - -The startup script is consistent across hardware platforms. The following is an example of how to write a startup script for Ascend: - -> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method). - -The directory structure is as follows: - -```text -└─ sample_code - ├─ startup_method - ├── msrun_1.sh - ├── msrun_2.sh - ├── msrun_single.sh - ├── net.py - ... -``` - -`net.py` defines the network structure and the training process. `msrun_single.sh` is a single-machine multi-card execution script that starts with `msrun`. `msrun_1.sh` and `msrun_2.sh` are multi-machine, multi-card execution scripts started with `msrun` and executed on separate nodes. - -### 1. Preparing Python Training Scripts - -Here is an example of data parallelism to train a recognition network for the MNIST dataset. - -First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize the HCCL, NCCL or MCCL communication domain with `init()`. If `device_target` is not set here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -Then build the following network: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -Finally, the dataset is processed and the training process is defined: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. Preparing the Startup Script - -> For msrun, single-machine multi-card and multi-machine multi-card execution commands are similar, single-machine multi-card only needs to keep the parameters `worker_num` and `local_worker_num` the same, and single-machine multi-card scenarios do not need to set the `master_addr`, which defaults to `127.0.0.1`. - -#### Single-machine Multi-card - -The following is an example of performing a single-machine 8-card training session: - -The script [msrun_single.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_single.sh) uses the msrun command to pull up 1 `Scheduler` process as well as 8 `Worker` processes on the current node (no need to set `master_addr`, defaults to `127.0.0.1`; no need to set `node_rank` for single-machine): - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -Execute the command: - -```bash -bash msrun_single.sh -``` - -The single-machine 8-card distributed training task can be executed. The log file is saved to the `. /msrun_log` directory and the results are saved in `. /msrun_log/worker_*.log`. The Loss results are as follows: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -#### Multi-machine Multi-card - -The following is an example of executing 2-machine, 8-card training, with each machine executing the startup of 4 Workers: - -The script [msrun_1.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_1.sh) is executed on node 1 and uses the msrun command to pull up 1 `Scheduler` process and 4 `Worker` processes, configures `master_addr` as the IP address of node 1 (msrun automatically detects that the current node ip matches the `master_addr` and pulls up the `Scheduler` process). Set the current node to node 0 with `node_rank`: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=4 --master_addr= --master_port=8118 --node_rank=0 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -The script [msrun_2.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_2.sh) is executed on node 2 and uses the msrun command to pull up 4 `Worker` processes, configures `master_addr` as the IP address of node 1. Set the current node to node 0 with `node_rank`: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=4 --master_addr= --master_port=8118 --node_rank=1 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -> The difference between the instructions for node 2 and node 1 is that `node_rank` is different. - -Executed at node 1: - -```bash -bash msrun_1.sh -``` - -Executed at node 2: - -```bash -bash msrun_2.sh -``` - -The 2-machine, 8-card distributed training task can be executed, and the log files are saved to the `. /msrun_log` directory and the results are saved in `. /msrun_log/worker_*.log`. The Loss results are as follows: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -## Multi-Card Parallel Debugging - -Multi-card parallel debugging can be done in distributed environments using Python's built-in debugger (pdb), by breaking and synchronizing operations on all or a particular rank. After pulling up a worker process with the `msrun` parameter set to `--join=True`, the standard input of all worker processes is inherited from the `msrun` master process, and the standard output is output to a shell window via the `msrun` log redirection feature. Details of how to use pdb in a distributed environment are given below: - -### 1. Starting the pdb Debugger - -Users can start the pdb debugger in various ways, such as inserting `import pdb; pdb.set_trace()` or `breakpoint()` in the python training script to perform breakpoint operations. - -#### Python Training Script - -```python -import pdb -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -pdb.set_trace() -ms.set_seed(1) -``` - -#### Startup Script - -In the startup script, the `msrun` parameter needs to be set to `--join=True` to ensure that pdb commands are passed through stdin and debugging is displayed through stdout. - -```bash -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -### 2. Debugging Particular Ranks - -In a distributed environment, users may need to debug for a particular rank, which can be accomplished by placing breakpoints in the training script for that particular rank. For example, in the standalone eight-card task, debugging is done only for rank 7: - -```python -import pdb -import mindspore as ms -from mindspore.communication import init, get_rank - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -if get_rank() == 7: - pdb.set_trace() -ms.set_seed(1) -``` - -> The `mindspore.communication.get_rank()` interface needs to be called after the `mindspore.communication.init()` interface has completed its distributed initialization to get the rank information properly, otherwise `get_rank()` will return 0 by default. - -After a breakpoint operation on a rank, it will cause the execution of that rank process to stop at the breakpoint and wait for subsequent interactions, while other rank processes will continue to run, which may lead to different processes to be faster or slower. So users can use the `mindspore.communication.comm_func.barrier()` operator and the `mindspore.communication.api._pynative_executor.sync()` to synchronize the running of all ranks, ensuring that other ranks block and wait, and that the stops of other ranks are released once the debugging rank continues to run. For example, in a standalone eight-card task, debugging breakpoints only for rank 7 and blocking all other ranks: - -```python -import pdb -import mindspore as ms -from mindspore.communication import init, get_rank -from mindspore.communication.comm_func import barrier -from mindspore.common.api import _pynative_executor - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -if get_rank() == 7: - pdb.set_trace() -barrier() -_pynative_executor.sync() -ms.set_seed(1) -``` - -### 3. Stdin and Stdout to Shell Terminial - -`msrun` supports outputting specific worker logs to the shell's stdout via `--tail_worker_log`. To make the standard output more observable, it is recommended to use this parameter to specify the rank to be debugged. For example, in the standalone eight-card task, only breakpoints for rank 7 are debugged: - -```bash -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 --tail_worker_log=7 net.py -``` - -> - `msrun`'s default behavior without the `--tail_worker_log` parameter will output the logs of all workers on this node to the shell's stdout. -> - When debugging multiple ranks at the same time, a single pdb command will be passed to a single rank in turn via stdin. - -### 4. Common pdb Debugging Commands - -- `n` (next): Execute the current line of code and jump to the next one. -- `s` (step): Enter the function called in the current line of code, step by step. -- `c` (continue): Continue executing the program until the next breakpoint. -- `q` (quit): Exit the debugger and terminate program execution. -- `p` (print): Print the value of a variable. For example, `p variable` displays the current value of the variable `variable`. -- `l` (list): Display the context of the current code. -- `b` (break): Set a breakpoint, either by line number or function name. -- `h` (help): Display help information, list all available commands. diff --git a/docs/mindspore/source_en/model_train/parallel/rank_table.md b/docs/mindspore/source_en/model_train/parallel/rank_table.md deleted file mode 100644 index 2fe9ab3a2fbe47ad31a4b6d9ccdac8f760fc15e4..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/rank_table.md +++ /dev/null @@ -1,456 +0,0 @@ -# rank table Startup - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/rank_table.md) - -## Overview - -`rank table` startup is a startup method unique to the Ascend hardware platform. This method does not rely on third-party libraries and runs as a single process on a single card, requiring the user to create a process in the script that matches the number of cards in use. This method is consistent across nodes in multiple machines and facilitates rapid batch deployment. - -Related Configurations: - -`rank table` mainly need to configure the rank_table file, taking the 2-card environment configuration file `rank_table_2pcs.json` as an example: - -```json -{ - "version": "1.0", - "server_count": "1", - "server_list": [ - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.6","rank_id": "0"}, - {"device_id": "1","device_ip": "192.2.*.6","rank_id": "1"}], - "host_nic_ip": "reserve" - } - ], - "status": "completed" -} -``` - -The parameter items that need to be modified according to the actual training environment are: - -- `server_count` represents the number of machines involved in training. -- `server_id` represents the IP address of the current machine. -- `device_id` represents the physical serial number of the card, i.e., the actual serial number in the machine where the card is located. -- `device_ip` represents the IP address of the integrated NIC. You can execute the command `cat /etc/hccn.conf` on the current machine, and the key value of `address_x` is the IP address of the NIC. -- `rank_id` represents the card logical serial number, fixed numbering from 0. - -## Operation Practice - -> You can download the full sample code here: [startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method). - -The directory structure is as follows: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── rank_table_8pcs.json - ├── rank_table_16pcs.json - ├── rank_table_cross_cluster_16pcs.json - ├── run_rank_table.sh - ├── run_rank_table_cluster.sh - ├── run_rank_table_cross_cluster.sh - ... -``` - -`net.py` defines the network structure and the training process, `run_rank_table.sh`, `run_rank_table_cluster.sh` and `run_rank_table_cross_cluster.sh` are executing the scripts. `rank_table_8pcs.json`, `rank_table_16pcs.json` and `rank_table_cross_cluster_16pcs.json` are 8 cards, 16 cards and cross cluster 16 cards rank_table config file. - -### 1. Preparing Python Training Scripts - -Here, as an example of data parallel, a recognition network is trained for the MNIST dataset. - -First specify the operation mode, hardware device, etc. Unlike single card scripts, parallel scripts also need to specify configuration items such as parallel mode and initialize HCCL or NCCL communication via init. If you don't set `device_target` here, it will be automatically specified as the backend hardware device corresponding to the MindSpore package. - -```python -import os -import mindspore as ms -from mindspore.communication import init - -device_id = int(os.getenv('DEVICE_ID')) -ms.set_device(device_id=device_id) -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -Then build the following network: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -Finally, the dataset is processed and the training process is defined: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. Preparing the Startup Script - -#### Single-Machine Multi-Card - -The `rank table` method uses a single-card single-process operation, i.e., 1 process runs on each card, with the same number of processes as the number of cards in use. Each process creates a directory to store log information and operator compilation information. Below is an example of how to run a distributed training script using 8 cards: - -```bash -RANK_SIZE=8 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json -export RANK_SIZE=$RANK_SIZE - -for((i=1;i<${RANK_SIZE};i++)) -do - rm -rf device$i - mkdir device$i - cp ./net.py ./device$i - cd ./device$i - export DEVICE_ID=$i - export RANK_ID=$i - echo "start training for device $i" - env > env$i.log - python ./net.py > train$i.log 2>&1 & - cd ../ -done -``` - -Distributed-related environment variables are: - -- `RANK_TABLE_FILE`: Path to the networking information file. -- `DEVICE_ID`: The actual serial number of the current card on the machine. -- `RANK_ID`: The logical serial number of the current card. - -After configuring `rank_table_8pcs.json` in the current path, execute the following command: - -```bash -bash run_rank_table.sh -``` - -After running, the log files are saved in `device0`, `device1` and other directories, `env*.log` records information about environment variables, and the output is saved in `train*.log`, as shown in the example below: - -```text -epoch: 0, step: 0, loss is 2.3391366 -epoch: 0, step: 10, loss is 1.8047495 -epoch: 0, step: 20, loss is 1.2186875 -epoch: 0, step: 30, loss is 1.3065228 -epoch: 0, step: 40, loss is 1.0825632 -epoch: 0, step: 50, loss is 1.0281029 -epoch: 0, step: 60, loss is 0.8405618 -epoch: 0, step: 70, loss is 0.7346531 -epoch: 0, step: 80, loss is 0.688364 -epoch: 0, step: 90, loss is 0.51331174 -epoch: 0, step: 100, loss is 0.53782797 -... -``` - -#### Multi-Machine Multi-Card - -In the Ascend environment, the communication of NPU units across machines is the same as the communication of individual NPU units within a single machine, still through the HCCL. The difference is that the NPU units within a single machine are naturally interoperable, while the cross-machine ones need to ensure that the networks of the two machines are interoperable. The method of confirmation is as follows: - -Executing the following command on server 1 will configure each device with the `device ip` of the corresponding device on server 2. For example, configure the destination IP of card 0 on server 1 as the ip of card 0 on server 2. Configuration commands require the `hccn_tool` tool. The [`hccn_tool`](https://support.huawei.com/enterprise/zh/ascend-computing/a300t-9000-pid-250702906?category=developer-documents) is an HCCL tool that comes with the CANN package. - -```bash -hccn_tool -i 0 -netdetect -s address 192.*.92.131 -hccn_tool -i 1 -netdetect -s address 192.*.93.131 -hccn_tool -i 2 -netdetect -s address 192.*.94.131 -hccn_tool -i 3 -netdetect -s address 192.*.95.131 -hccn_tool -i 4 -netdetect -s address 192.*.92.141 -hccn_tool -i 5 -netdetect -s address 192.*.93.141 -hccn_tool -i 6 -netdetect -s address 192.*.94.141 -hccn_tool -i 7 -netdetect -s address 192.*.95.141 -``` - -`-i 0` specifies the device ID. `-netdetect` specifies the network detection object IP attribute. `-s address` indicates that the attribute is set to an IP address. `192.*.92.131` indicates the ip address of device 0 on server 2. The interface command can be referenced [here](https://support.huawei.com/enterprise/zh/doc/EDOC1100251947/8eff627f). - -After executing the above command on server 1, start checking the network link status with the following command. Another function of `hccn_tool` is used here, the meaning of which can be found [here](https://support.huawei.com/enterprise/zh/doc/EDOC1100251947/7d059b59). - -```bash -hccn_tool -i 0 -net_health -g -hccn_tool -i 1 -net_health -g -hccn_tool -i 2 -net_health -g -hccn_tool -i 3 -net_health -g -hccn_tool -i 4 -net_health -g -hccn_tool -i 5 -net_health -g -hccn_tool -i 6 -net_health -g -hccn_tool -i 7 -net_health -g -``` - -If the connection is successful, the corresponding output is as follows: - -```bash -net health status: Success -``` - -If the connection fails, the corresponding output is as follows: - -```bash -net health status: Fault -``` - -After confirming that the network of the NPU units between the machines is smooth, configure the json configuration file of the multi-machine, this document takes the configuration file of the 16 cards as an example. The detailed description of the configuration file can be referred to the introduction of single-machine multi-card part in this document. It should be noted that in the configuration of the multi-machine json file, it is required that the order of rank_id is consistent with the dictionary order of server_id. - -```json -{ - "version": "1.0", - "server_count": "2", - "server_list": [ - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.6","rank_id": "0"}, - {"device_id": "1","device_ip": "192.2.*.6","rank_id": "1"}, - {"device_id": "2","device_ip": "192.3.*.6","rank_id": "2"}, - {"device_id": "3","device_ip": "192.4.*.6","rank_id": "3"}, - {"device_id": "4","device_ip": "192.1.*.7","rank_id": "4"}, - {"device_id": "5","device_ip": "192.2.*.7","rank_id": "5"}, - {"device_id": "6","device_ip": "192.3.*.7","rank_id": "6"}, - {"device_id": "7","device_ip": "192.4.*.7","rank_id": "7"}], - "host_nic_ip": "reserve" - }, - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.8","rank_id": "8"}, - {"device_id": "1","device_ip": "192.2.*.8","rank_id": "9"}, - {"device_id": "2","device_ip": "192.3.*.8","rank_id": "10"}, - {"device_id": "3","device_ip": "192.4.*.8","rank_id": "11"}, - {"device_id": "4","device_ip": "192.1.*.9","rank_id": "12"}, - {"device_id": "5","device_ip": "192.2.*.9","rank_id": "13"}, - {"device_id": "6","device_ip": "192.3.*.9","rank_id": "14"}, - {"device_id": "7","device_ip": "192.4.*.9","rank_id": "15"}], - "host_nic_ip": "reserve" - } - ], - "status": "completed" -} -``` - -After preparing the configuration file, you can carry out the organization of distributed multi-machine training scripts. In the case of 2-machine 16-card, the scripts on the two machines are similar to the scripts run on single-machine multi-card, the difference being the specification of different rank_id variables. - -```bash -RANK_SIZE=16 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_16pcs.json -export RANK_SIZE=$RANK_SIZE - -RANK_START=$1 -DEVICE_START=0 - -for((i=0;i<=7;i++)); -do - export RANK_ID=$[i+RANK_START] - export DEVICE_ID=$[i+DEVICE_START] - rm -rf ./device_$RANK_ID - mkdir ./device_$RANK_ID - cp ./net.py ./device_$RANK_ID - cd ./device_$RANK_ID - env > env$i.log - python ./net.py >train$RANK_ID.log 2>&1 & -done -``` - -During execution, the following commands are executed on the two machines, where rank_table.json is configured according to the 16-card distributed json file reference shown in this section. - -```bash -# server0 -bash run_rank_table_cluster.sh 0 -# server1 -bash run_rank_table_cluster.sh 8 -``` - -After running, the log files are saved in the directories `device_0`, `device_1`. The information about the environment variables is recorded in `env*.log`, and the output is saved in `train*.log`. - -#### Cross Cluster - -For today's large-scale models, using compute clusters for training has become the norm. However, as model sizes continue to grow, the resources of a single cluster can no longer meet the memory requirements for model training. Therefore, support for cross-cluster communication has become a prerequisite for training ultra-large-scale models. Currently, the HCCL communication library of Ascend hardware does not support cross-cluster communication. To address this issue, MindSpore provides a cross-cluster communication library that enables efficient communication between NPUs in different clusters. With this library, users can overcome the memory limitations of a single cluster and achieve cross-cluster parallel training for ultra-large-scale models. - -Currently, the MindSpore framework enables this feature simply by adding the `cluster_list` configuration item for cross-cluster communication in the multi-node, multi-card JSON configuration file. This document uses a 2-node, 16-card setup (assuming the two machines are not in the same cluster) as an example to illustrate how to write the relevant configuration items for cross-cluster scenarios. For detailed information about the configuration file, please refer to the single-node, multi-card section in this document. - -```json -{ - "version": "1.0", - "server_count": "2", - "server_list": [ - { - "server_id": "server_0_10.*.*.*", - "server_ip": "10.*.*.*", - "device": [ - {"device_id": "0", "device_ip": "192.1.*.6", "rank_id": "0", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "1", "device_ip": "192.2.*.6", "rank_id": "1", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "2", "device_ip": "192.3.*.6", "rank_id": "2", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "3", "device_ip": "192.4.*.6", "rank_id": "3", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "4", "device_ip": "192.1.*.7", "rank_id": "4", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "5", "device_ip": "192.2.*.7", "rank_id": "5", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "6", "device_ip": "192.3.*.7", "rank_id": "6", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "7", "device_ip": "192.4.*.7", "rank_id": "7", "dpu_ip": "8.2.17.60", "numa_id": ""}], - "host_nic_ip": "reserve", - "pod_ip": "127.0.0.1" - }, - { - "server_id": "server_1_10.*.*.*", - "server_ip": "10.*.*.*", - "device": [ - {"device_id": "0", "device_ip": "192.1.*.8", "rank_id": "8", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "1", "device_ip": "192.2.*.8", "rank_id": "9", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "2", "device_ip": "192.3.*.8", "rank_id": "10", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "3", "device_ip": "192.4.*.8", "rank_id": "11", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "4", "device_ip": "192.1.*.9", "rank_id": "12", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "5", "device_ip": "192.2.*.9", "rank_id": "13", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "6", "device_ip": "192.3.*.9", "rank_id": "14", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "7", "device_ip": "192.4.*.9", "rank_id": "15", "dpu_ip": "8.2.17.60", "numa_id": ""}], - "host_nic_ip": "reserve", - "pod_ip": "127.0.0.1" - } - ], - "cluster_list": [ - { - "cluster_id": "cluster_0", - "network_type": "ROCE", - "az_id": "az_0", - "region_id": "region_0", - "server_list": [ - { - "server_id": "server_0_10.*.*.*" - } - ] - }, - { - "cluster_id": "cluster_1", - "network_type": "ROCE", - "az_id": "az_1", - "region_id": "region_1", - "server_list": [ - { - "server_id": "server_1_10.*.*.*" - } - ] - } - ], - "status": "completed" -} -``` - -For cross-cluster scenarios, the parameters that need to be added or modified based on the actual training environment are as follows: - -- `server_id` represents the globally unique identifier of the current machine. -- `server_ip` represents the IP address of the current machine. -- `dpu_ip` represents the virtual IP address of the card within the tenant VPC, used for cross-cluster communication. -- `numa_id` represents the NUMA-affined CPU core ID of the card on the current machine. -- `cluster_id` represents the globally unique identifier of the cluster. -- `network_type` represents the type of network between machines within the cluster, currently set to "ROCE." -- `az_id` represents the AZ (Availability Zone) ID where the cluster is located. -- `server_list` represents the list of machines included in the current cluster. - -Once the configuration file is prepared, the distributed training script for cross-cluster scenarios remains consistent with the distributed training script for multi-node, multi-card setups described in this document. Using a 2-cluster, 16-card setup as an example, the scripts on the two machines in the two clusters are the same as those used in multi-node, multi-card scenarios. The only difference lies in specifying different rank_id variables. - -```bash -RANK_SIZE=16 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_2_cluster_16pcs.json -export RANK_SIZE=$RANK_SIZE - -RANK_START=$1 -DEVICE_START=0 - -for((i=0;i<=7;i++)); -do - export RANK_ID=$[i+RANK_START] - export DEVICE_ID=$[i+DEVICE_START] - rm -rf ./device_$RANK_ID - mkdir ./device_$RANK_ID - cp ./net.py ./device_$RANK_ID - cd ./device_$RANK_ID - env > env$i.log - python ./net.py >train$RANK_ID.log 2>&1 & -done -``` - -During execution, the two machines in the two clusters run the following commands, respectively. The `rank_table_cross_cluster_16pcs.json` file is configured based on the 2-cluster, 16-card cross-cluster distributed JSON file example shown in this section. The `rank_table_cross_cluster_16pcs.json` configuration used on each machine in both clusters must remain consistent. - -```bash -# server0 -bash run_rank_table_cross_cluster.sh 0 -# server1 -bash run_rank_table_cross_cluster.sh 8 -``` - -After execution, log files are saved in the `device_0`, `device_1`, and other corresponding directories on each machine in the clusters. The `env*.log` files record information about the environment variables, while the output results are stored in the `train*.log` files. diff --git a/docs/mindspore/source_en/model_train/parallel/startup_method.rst b/docs/mindspore/source_en/model_train/parallel/startup_method.rst deleted file mode 100644 index a4016765adb4b032f49b60b6e898a9678992c9d5..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/model_train/parallel/startup_method.rst +++ /dev/null @@ -1,42 +0,0 @@ -Distributed Parallel Startup Methods -==================================== - -.. image:: https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg - :target: https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/model_train/parallel/startup_method.rst - :alt: View Source on Gitee - -.. toctree:: - :maxdepth: 1 - :hidden: - - msrun_launcher - dynamic_cluster - mpirun - rank_table - -Startup Method ---------------- - -Currently GPU, Ascend and CPU support multiple startup methods respectively, four of which are \ ``msrun``, dynamic cluster, \ ``mpirun`` and \ ``rank table``: - -- `msrun `_: `msrun` is the capsulation of Dynamic cluster. It allows user to launch distributed jobs using one single command in each node. It could be used after MindSpore is installed. This method does not rely on third-party libraries and configuration files, has disaster recovery function, good security, and supports three hardware platforms. It is recommended that users prioritize the use of this startup method. -- `Dynamic cluster `_: dynamic cluster requires user to spawn multiple processes and export environment variables. It's the implementation of `msrun`. Use this method when running `Parameter Server` training mode. For other distributed jobs, `msrun` is recommended. -- `mpirun `_: this method relies on the open source library OpenMPI, and startup command is simple. Multi-machine need to ensure two-by-two password-free login. It is recommended for users who have experience in using OpenMPI to use this startup method. -- `rank table `_: this method requires the Ascend hardware platform and does not rely on third-party library. After manually configuring the rank_table file, you can start the parallel program via a script, and the script is consistent across multiple machines for easy batch deployment. - -.. warning:: - `rank_table` method will be deprecated in MindSpore 2.4 version. - -The hardware support for the four startup methods is shown in the table below: - -+-------------------------+--------------+-----------------+-------------+ -| | GPU | Ascend | CPU | -+=========================+==============+=================+=============+ -| \ ``msrun``\ | Support | Support | Support | -+-------------------------+--------------+-----------------+-------------+ -| Dynamic cluster | Support | Support | Support | -+-------------------------+--------------+-----------------+-------------+ -| \ ``mpirun``\ | Support | Support | Not support | -+-------------------------+--------------+-----------------+-------------+ -| \ ``rank table``\ | Not support | Support | Not support | -+-------------------------+--------------+-----------------+-------------+ \ No newline at end of file diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md b/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md deleted file mode 100644 index 62cc358afbfb8077742b938a3fa4ba09f7a75d8c..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md +++ /dev/null @@ -1,439 +0,0 @@ -# 动态组网启动 - -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/model_train/parallel/dynamic_cluster.md) - -## 概述 - -出于训练时的可靠性要求,MindSpore提供了**动态组网**特性,用户能够不依赖任何第三方库(OpenMPI)来启动Ascend/GPU/CPU分布式训练任务,并且训练脚本无需做任何修改。我们建议用户优先使用此种启动方式。 - -MindSpore**动态组网**特性通过**复用Parameter Server模式训练架构**,取代了OpenMPI能力,可参考[Parameter Server模式](https://mindspore.cn/docs/zh-CN/master/model_train/parallel/parameter_server_training.html)训练教程。 - -**动态组网**特性将多个MindSpore训练进程作为`Worker`启动,并且额外启动一个`Scheduler`负责组网和容灾恢复。因此无需借助OpenMPI的消息传递机制即可实现分布式训练。用户只需对启动脚本做少量修改,便可执行分布式训练。 - -> 动态组网支持Ascend、GPU和CPU,因此动态组网启动脚本能在多种硬件平台间快速迁移,无需对其进行额外修改。 - -相关环境变量: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
环境变量功能类型取值说明
MS_ROLE指定本进程角色。String -
    -
  • MS_SCHED: 代表Scheduler进程,一个训练任务只启动一个Scheduler,负责组网,容灾恢复等,不会执行训练代码
  • -
  • MS_WORKER: 代表Worker进程,一般设置分布式训练进程为此角色。
  • -
  • MS_PSERVER: 代表Parameter Server进程,只有在Parameter Server模式下此角色生效,具体请参考Parameter Server模式
  • -
-
Worker和Parameter Server进程会向Scheduler进程注册从而完成组网。
MS_SCHED_HOST指定Scheduler的IP地址。String合法的IP地址。当前版本还支持Ascend平台下的IPv6地址。
MS_SCHED_PORT指定Scheduler绑定端口号。Integer1024~65535范围内的端口号。
MS_NODE_ID指定本进程的ID,集群内唯一。String代表本进程的唯一ID,默认由MindSpore自动生成。 - MS_NODE_ID在在以下情况需要设置,一般情况下无需设置,由MindSpore自动生成: -
    -
  • 开启容灾场景:容灾恢复时需要获取当前进程ID,从而向Scheduler重新注册。
  • -
  • 开启GLOG日志重定向场景:为了保证各训练进程日志独立保存,需设置进程ID,作为日志保存路径后缀。
  • -
  • 指定进程rank id场景:用户可通过设置MS_NODE_ID为某个整数,来指定本进程的rank id。
  • -
-
MS_WORKER_NUM指定角色为MS_WORKER的进程数量。Integer大于0的整数。 - 用户启动的Worker进程数量应当与此环境变量值相等。若小于此数值,组网失败;若大于此数值,Scheduler进程会根据Worker注册先后顺序完成组网,多余的Worker进程会启动失败。 -
MS_SERVER_NUM指定角色为MS_PSERVER的进程数量。Integer大于0的整数。只在Parameter Server训练模式下需要设置。
MS_WORKER_IP指定当前进程和其他进程进行通信和组网使用的IP地址。String合法的IP地址。在使用IPv6地址进行组网时,建议设置此环境变量。但当用户设置MS_SCHED_HOST为::1时(代表IPv6的本地回环地址),无需设置此环境变量,这是因为MindSpore会默认使用本地回环地址进行通信。
MS_ENABLE_RECOVERY开启容灾。Integer1代表开启,0代表关闭。默认为0。
MS_RECOVERY_PATH持久化路径文件夹。String合法的用户目录。Worker和Scheduler进程在执行过程中会进行必要的持久化,如用于恢复组网的节点信息以及训练业务中间状态等,并通过文件保存。
MS_ENABLE_LCCL是否使用LCCL通信库。Integer1代表开启,0代表关闭。默认为0。LCCL通信库暂只支持单机多卡,并且必须在图编译等级为O0时执行。
MS_TOPO_TIMEOUT集群组网阶段超时时间,单位:秒。Integer默认为30分钟。此数值代表在所有节点在这个时间窗口内均可向Scheduler进行注册,超出此时间窗口则注册失败,若节点数量不满足要求,则集群组网失败。建议用户在集群规模较大时配置此环境变量。
MS_NODE_TIMEOUT节点心跳超时时间,单位:秒。Integer默认为300秒此数值代表Scheduler以及Worker间心跳超时时间,若此时间窗口内没有心跳消息,则集群异常退出。
MS_RECEIVE_MSG_TIMEOUT节点接收消息超时时间,单位:秒。Integer默认为300秒此数值代表节点接收对端消息超时时间,若时间窗口内无消息响应,则返回空消息。
MS_RETRY_INTERVAL_LOWER节点间消息重试间隔下限,单位:秒。Integer默认为3秒此数值代表节点每次重试发送消息的时间间隔下限,MindSpore会随机选择MS_RETRY_INTERVAL_LOWERMS_RETRY_INTERVAL_UPPER之间的值作为间隔时间。此变量可以控制Scheduler节点的消息并发量。
MS_RETRY_INTERVAL_UPPER节点间消息重试间隔上限,单位:秒。Integer默认为5秒此数值代表节点每次重试发送消息的时间间隔上限,MindSpore会随机选择MS_RETRY_INTERVAL_LOWERMS_RETRY_INTERVAL_UPPER之间的值作为间隔时间。此变量可以控制Scheduler节点的消息并发量。
MS_DISABLE_HEARTBEAT关闭集群中节点间心跳业务。Integer默认开启心跳业务若设置为1,则关闭集群节点间心跳,此场景下Scheduler不会检测到Worker异常,集群不会被Scheduler控制退出。此变量可以降低Scheduler节点消息并发量。
在使用`gdb attach`指令调试时,建议开启此环境变量。
- -> 环境变量`MS_SCHED_HOST`、`MS_SCHED_PORT`、`MS_WORKER_NUM`内容需保持一致,否则会由于各进程配置不一致导致组网失败。 - -## 操作实践 - -动态组网启动脚本在各硬件平台下一致,下面以Ascend为例演示如何编写启动脚本: - -> 您可以在这里下载完整的样例代码:[startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method)。 - -目录结构如下: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── run_dynamic_cluster.sh - ├── run_dynamic_cluster_1.sh - ├── run_dynamic_cluster_2.sh - ... -``` - -其中,`net.py`是定义网络结构和训练过程,`run_dynamic_cluster.sh`、`run_dynamic_cluster_1.sh`和`run_dynamic_cluster_2.sh`是执行脚本。 - -### 1. 准备Python训练脚本 - -这里以数据并行为例,训练一个MNIST数据集的识别网络。 - -首先指定运行模式、硬件设备等。与单卡脚本不同,并行脚本还需指定并行模式等配置项,并通过`init()`初始化HCCL、NCCL或MCCL通信域。此处未设置`device_target`,会自动指定为MindSpore包对应的后端硬件设备。 - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -然后构建如下网络: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -最后是数据集处理和定义训练过程: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. 准备启动脚本 - -#### 单机多卡 - -单机多卡启动脚本内容[run_dynamic_cluster.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster.sh)如下,以单机8卡为例: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# 循环启动8个Worker训练进程 -for((i=0;i<8;i++)); -do - export MS_WORKER_NUM=8 # 设置集群中Worker进程数量为8 - export MS_SCHED_HOST=127.0.0.1 # 设置Scheduler IP地址为本地环路地址 - export MS_SCHED_PORT=8118 # 设置Scheduler端口 - export MS_ROLE=MS_WORKER # 设置启动的进程为MS_WORKER角色 - export MS_NODE_ID=$i # 设置进程id,可选 - python ./net.py > device/worker_$i.log 2>&1 & # 启动训练脚本 -done - -# 启动1个Scheduler进程 -export MS_WORKER_NUM=8 # 设置集群中Worker进程数量为8 -export MS_SCHED_HOST=127.0.0.1 # 设置Scheduler IP地址为本地环路地址 -export MS_SCHED_PORT=8118 # 设置Scheduler端口 -export MS_ROLE=MS_SCHED # 设置启动的进程为MS_SCHED角色 -python ./net.py > device/scheduler.log 2>&1 & # 启动训练脚本 -``` - -> Scheduler和Worker进程的训练脚本内容和启动方式完全一致,这是因为在MindSpore已经差异化处理了两种角色内部流程。用户只需按照普通的训练方式拉起进程即可,无需按照角色修改Python代码。这是动态组网启动脚本在多硬件平台能够保持一致的原因之一。 - -执行如下指令,即可启动单机8卡训练网络: - -```bash -bash run_dynamic_cluster.sh -``` - -脚本会在后台运行,日志文件会保存到device目录下,结果保存在worker_*.log中,Loss结果如下: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -#### 多机多卡 - -多机训练场景下,需拆分启动脚本。下面以执行双机8卡训练为例,每台机器执行启动4个Worker: - -脚本[run_dynamic_cluster_1.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster_1.sh)在节点1上启动1个`Scheduler`进程以及4个`Worker`进程: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# 循环启动Worker1到Worker4,4个Worker训练进程 -for((i=0;i<4;i++)); -do - export MS_WORKER_NUM=8 # 设置集群中Worker进程数量为8 - export MS_SCHED_HOST= # 设置Scheduler IP地址为节点1 IP地址 - export MS_SCHED_PORT=8118 # 设置Scheduler端口 - export MS_ROLE=MS_WORKER # 设置启动的进程为MS_WORKER角色 - export MS_NODE_ID=$i # 设置进程id,可选 - python ./net.py > device/worker_$i.log 2>&1 & # 启动训练脚本 -done - -# 在节点1启动1个Scheduler进程 -export MS_WORKER_NUM=8 # 设置集群中Worker进程总数为8(包括其他节点进程) -export MS_SCHED_HOST= # 设置Scheduler IP地址为节点1 IP地址 -export MS_SCHED_PORT=8118 # 设置Scheduler端口 -export MS_ROLE=MS_SCHED # 设置启动的进程为MS_SCHED角色 -python ./net.py > device/scheduler.log 2>&1 & # 启动训练脚本 -``` - -脚本[run_dynamic_cluster_2.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/run_dynamic_cluster_2.sh)在节点2上启动`Worker5`到`Worker8`(无需执行Scheduler): - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf device -mkdir device -echo "start training" - -# 循环启动Worker5到Worker8,4个Worker训练进程 -for((i=4;i<8;i++)); -do - export MS_WORKER_NUM=8 # 设置集群中Worker进程总数为8(包括其他节点进程) - export MS_SCHED_HOST= # 设置Scheduler IP地址为节点1 IP地址 - export MS_SCHED_PORT=8118 # 设置Scheduler端口 - export MS_ROLE=MS_WORKER # 设置启动的进程为MS_WORKER角色 - export MS_NODE_ID=$i # 设置进程id,可选 - python ./net.py > device/worker_$i.log 2>&1 & # 启动训练脚本 -done -``` - -> 在多机器任务中,需要为每个主机节点设置不同的主机名,否则会出现报错`deivce id`越界。可参考[FAQ](https://www.mindspore.cn/docs/zh-CN/master/faq/distributed_parallel.html#q-多机场景使用动态组网或msrun启动分布式任务时报错device-id越界如何解决)。 -> -> 在多机任务中,`MS_WORKER_NUM`应当为集群中Worker节点总数。 -> -> 节点间网络需保持连通,可使用`telnet `指令测试本节点是否和已启动的Scheduler节点连通。 - -在节点1执行: - -```bash -bash run_dynamic_cluster_1.sh -``` - -在节点2执行: - -```bash -bash run_dynamic_cluster_2.sh -``` - -即可执行双机8卡分布式训练任务,结果应与单机多卡结果一致。 - -## 容灾恢复 - -动态组网支持数据并行下容灾恢复。在多卡数据并行训练场景下,发生进程异常退出,重新拉起对应进程对应的脚本后训练可继续,并且不影响精度收敛。容灾恢复配置和样例可参考[动态组网场景下故障恢复](https://www.mindspore.cn/docs/zh-CN/master/model_train/parallel/disaster_recover.html)教程。 - -## 安全认证 - -动态组网还支持**安全加密通道**特性,支持`TLS/SSL`协议,满足用户的安全性需求。默认情况下,安全加密通道是关闭的,若需要开启,则通过`set_ps_context`正确配置安全加密通道后,才能调用init(),否则初始化组网会失败。若想使用安全加密通道,请配置: - -`set_ps_context(config_file_path="/path/to/config_file.json", enable_ssl=True, client_password="123456", server_password="123456")` - -`config_file_path`指定的`config.json`配置文件需要添加如下字段: - -```json -{ - "server_cert_path": "server.p12", - "crl_path": "", - "client_cert_path": "client.p12", - "ca_cert_path": "ca.crt", - "cipher_list": "ECDHE-R SA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:DHE-DSS-AES256-GCM-SHA384:DHE-PSK-AES128-GCM-SHA256:DHE-PSK-AES256-GCM-SHA384:DHE-PSK-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-PSK-CHACHA20-POLY1305:DHE-RSA-AES128-CCM:DHE-RSA-AES256-CCM:DHE-RSA-CHACHA20-POLY1305:DHE-PSK-AES128-CCM:DHE-PSK-AES256-CCM:ECDHE-ECDSA-AES128-CCM:ECDHE-ECDSA-AES256-CCM:ECDHE-ECDSA-CHACHA20-POLY1305", - "cert_expire_warning_time_in_day": 90 -} -``` - -- `server_cert_path`:服务端包含了证书和秘钥的密文的p12文件(SSL专用证书文件)路径。 -- `crl_path`:吊销列表(用于区分无效不可信证书和有效可信证书)的文件路径。 -- `client_cert_path`:客户端包含了证书和秘钥的密文的p12文件(SSL专用证书文件)路径。 -- `ca_cert_path`:根证书路径。 -- `cipher_list`:密码套件(支持的SSL加密类型列表)。 -- `cert_expire_warning_time_in_day`:证书过期的告警时间。 - -p12文件中的秘钥为密文存储,在启动时需要传入密码,具体参数请参考Python API [mindspore.set_ps_context](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.set_ps_context.html#mindspore.set_ps_context)中的`client_password`以及`server_password`字段。 diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/mpirun.md b/docs/mindspore/source_zh_cn/model_train/parallel/mpirun.md deleted file mode 100644 index 64a66dad731e25060cd59fbeef8f51885a6f2013..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_zh_cn/model_train/parallel/mpirun.md +++ /dev/null @@ -1,219 +0,0 @@ -# mpirun启动 - -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/model_train/parallel/mpirun.md) - -## 概述 - -OpenMPI(Open Message Passing Interface)是一个开源的、高性能的消息传递编程库,用于并行计算和分布式内存计算。它通过在不同进程之间传递消息来实现并行计算,适用于许多科学计算和机器学习任务。使用OpenMPI进行并行训练,是一种通用的加速训练过程的方法,通过在计算集群或多核机器上充分利用并行计算资源来实现。OpenMPI在分布式训练的场景中,起到在Host侧同步数据以及进程间组网的功能。 - -与rank table启动不同的是,在Ascend硬件平台上通过OpenMPI的`mpirun`命令运行脚本,用户不需要配置`RANK_TABLE_FILE`环境变量。 - -> `mpirun`启动支持Ascend和GPU,此外还同时支持PyNative模式和Graph模式。 - -相关命令: - -1. `mpirun`启动命令如下,其中`DEVICE_NUM`是所在机器的GPU数量: - - ```bash - mpirun -n DEVICE_NUM python net.py - ``` - -2. `mpirun`还可以配置以下参数,更多配置可以参考[mpirun文档](https://www.open-mpi.org/doc/current/man1/mpirun.1.php): - - - `--output-filename log_output`:将所有进程的日志信息保存到`log_output`目录下,不同卡上的日志会按`rank_id`分别保存在`log_output/1/`路径下对应的文件中。 - - `--merge-stderr-to-stdout`:合并stderr到stdout的输出信息中。 - - `--allow-run-as-root`:如果通过root用户执行脚本,则需要加上此参数。 - - `-mca orte_abort_on_non_zero_status 0`:当一个子进程异常退出时,OpenMPI会默认abort所有的子进程,如果不想自动abort子进程,可以加上此参数。 - - `-bind-to none`:OpenMPI会默认给拉起的子进程指定可用的CPU核数,如果不想限制进程使用的核数,可以加上此参数。 - -> OpenMPI启动时会设置若干名为`OPMI_*`的环境变量,用户应避免在脚本中手动修改这些环境变量。 - -## 操作实践 - -`mpirun`启动脚本在Ascend和GPU硬件平台下一致,下面以Ascend为例演示如何编写启动脚本: - -> 您可以在这里下载完整的样例代码:[startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method)。 - -目录结构如下: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── hostfile - ├── run_mpirun_1.sh - ├── run_mpirun_2.sh - ... -``` - -其中,`net.py`是定义网络结构和训练过程,`run_mpirun_1.sh`、`run_mpirun_2.sh`是执行脚本,`hostfile`是配置多机多卡的文件。 - -### 1. 安装OpenMPI - -下载OpenMPI-4.1.4源码[openmpi-4.1.4.tar.gz](https://www.open-mpi.org/software/ompi/v4.1/)。参考[OpenMPI官网教程](https://www.open-mpi.org/faq/?category=building#easy-build)安装。 - -### 2. 准备Python训练脚本 - -这里以数据并行为例,训练一个MNIST数据集的识别网络。 - -首先指定运行模式、硬件设备等,与单卡脚本不同,并行脚本还需指定并行模式等配置项,并通过init初始化HCCL或NCCL通信。此处未设置`device_target`,会自动指定为MindSpore包对应的后端硬件设备。 - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -然后构建如下网络: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -最后是数据集处理和定义训练过程: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 3. 准备启动脚本 - -#### 单机多卡 - -首先下载[MNIST](http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip)数据集,并解压到当前文件夹。 - -然后执行单机多卡启动脚本,以单机8卡为例: - -```bash -export DATA_PATH=./MNIST_Data/train/ -mpirun -n 8 --output-filename log_output --merge-stderr-to-stdout python net.py -``` - -日志文件会保存到`log_output`目录下,结果保存在`log_output/1/rank.*/stdout`中,结果如下: - -```text -epoch: 0, step: 0, loss is 2.3413472 -epoch: 0, step: 10, loss is 1.6298866 -epoch: 0, step: 20, loss is 1.3729795 -epoch: 0, step: 30, loss is 1.2199347 -epoch: 0, step: 40, loss is 0.85778403 -epoch: 0, step: 50, loss is 1.0849445 -epoch: 0, step: 60, loss is 0.9102987 -epoch: 0, step: 70, loss is 0.7571399 -epoch: 0, step: 80, loss is 0.7989929 -epoch: 0, step: 90, loss is 1.0189024 -epoch: 0, step: 100, loss is 0.6298542 -... -``` - -#### 多机多卡 - -在运行多机多卡训练前,首先需要按照如下配置: - -1. 保证每个节点上的OpenMPI、NCCL、Python以及MindSpore版本都相同。 - -2. 配置主机间免密登陆,可参考以下步骤进行配置: - - 每台主机确定同一个用户作为登陆用户(不推荐root); - - 执行`ssh-keygen -t rsa -P ""`生成密钥; - - 执行`ssh-copy-id DEVICE-IP`设置需要免密登陆的机器IP; - - 执行`ssh DEVICE-IP`,若不需要输入密码即可登录,则说明以上配置成功; - - 在所有机器上执行以上命令,确保两两互通。 - -配置成功后,就可以通过`mpirun`指令启动多机任务,目前有两种方式启动多机训练任务: - -- 通过`mpirun -H`方式。启动脚本如下: - - ```bash - export DATA_PATH=./MNIST_Data/train/ - mpirun -n 16 -H DEVICE1_IP:8,DEVICE2_IP:8 --output-filename log_output --merge-stderr-to-stdout python net.py - ``` - - 表示在ip为DEVICE1_IP和DEVICE2_IP的机器上分别起8个进程运行程序。在其中一个节点执行: - - ```bash - bash run_mpirun_1.sh - ``` - -- 通过`mpirun --hostfile`方式。为方便调试,建议用这种方法来执行多机多卡脚本。首先需要构造hostfile文件如下: - - ```text - DEVICE1 slots=8 - 192.168.0.1 slots=8 - ``` - - 每一行格式为`[hostname] slots=[slotnum]`,hostname可以是ip或者主机名。上例表示在DEVICE1上有8张卡;ip为192.168.0.1的机器上也有8张卡。 - - 双机16卡的执行脚本如下,需要传入变量`HOSTFILE`,表示hostfile文件的路径: - - ```bash - export DATA_PATH=./MNIST_Data/train/ - HOSTFILE=$1 - mpirun -n 16 --hostfile $HOSTFILE --output-filename log_output --merge-stderr-to-stdout python net.py - ``` - - 在其中一个节点执行: - - ```bash - bash run_mpirun_2.sh ./hostfile - ``` - -执行完毕后,日志文件会保存到log_output目录下,结果保存在log_output/1/rank.*/stdout中。 \ No newline at end of file diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md b/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md deleted file mode 100644 index bed2a70e9d699e1d3655dcfccd4c4091213b7f53..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md +++ /dev/null @@ -1,503 +0,0 @@ -# msrun启动 - -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/model_train/parallel/msrun_launcher.md) - -## 概述 - -`msrun`是[动态组网](https://www.mindspore.cn/docs/zh-CN/master/model_train/parallel/dynamic_cluster.html)启动方式的封装,用户可使用`msrun`,以单个命令行指令的方式在各节点拉起多进程分布式任务,并且无需手动设置[动态组网环境变量](https://www.mindspore.cn/docs/zh-CN/master/model_train/parallel/dynamic_cluster.html)。`msrun`同时支持`Ascend`,`GPU`和`CPU`后端。与`动态组网`启动方式一样,`msrun`无需依赖第三方库以及配置文件。 - -> - `msrun`在用户安装MindSpore后即可使用,可使用指令`msrun --help`查看支持参数。 -> - `msrun`支持`图模式`以及`PyNative模式`。 - -命令行参数列表: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
参数功能类型取值说明
--worker_num参与分布式任务的Worker进程总数。Integer大于0的整数。默认值为8。所有节点上启动的Worker总数应当等于此参数:
若总数大于此参数,多余的Worker进程会注册失败;
若总数小于此参数,集群会在等待一段超时时间后,
提示任务拉起失败并退出,
超时时间窗大小可通过参数cluster_time_out配置。
--local_worker_num当前节点上拉起的Worker进程数。Integer大于0的整数。默认值为8。当此参数与worker_num保持一致时,代表所有Worker进程在本地执行,
此场景下node_rank值会被忽略。
--master_addr指定Scheduler的IP地址或者主机名。String合法的IP地址或者主机名。默认为IP地址127.0.0.1。msrun会自动检测在哪个节点拉起Scheduler进程,用户无需关心。
若无法查找到对应的地址或主机名无法被DNS解析,训练任务会拉起失败。
当前版本暂不支持IPv6地址。
若传入主机名时,msrun会自动将其解析为IP地址,需要用户环境支持DNS服务。
--master_port指定Scheduler绑定端口号。Integer1024~65535范围内的端口号。默认为8118。
--node_rank当前节点的索引。Integer可传入大于等于0的整数。在不传入值的情况下,默认值为-1。单机多卡场景下,此参数会被忽略。
多机多卡场景下,
若不设置此参数,Worker进程的rank_id会被自动分配;
若设置,则会按照索引为各节点上的Worker进程分配rank_id。
若每个节点Worker进程数量不同,建议不配置此参数,
以自动分配rank_id。
--log_dirWorker以及Scheduler日志输出路径。String文件夹路径。默认为当前目录。若路径不存在,msrun会递归创建文件夹。
日志格式如下:对于Scheduler进程,日志名为scheduler.log
对于Worker进程,日志名为worker_[rank].log
其中rank后缀与分配给Worker的rank_id一致,
但在未设置node_rank且多机多卡场景下,它们可能不一致。
建议执行grep -rn "Global rank id"指令查看各Worker的rank_id
--joinmsrun是否等待Worker以及Scheduler退出。BoolTrue或者False。默认为False。若设置为False,msrun在拉起进程后会立刻退出,
查看日志确认分布式任务是否正常执行。
若设置为True,msrun会等待所有进程退出后,收集异常日志并退出。
--cluster_time_out集群组网超时时间,单位为秒。Integer默认为600秒。此参数代表在集群组网的等待时间。
若超出此时间窗口依然没有worker_num数量的Worker注册成功,则任务拉起失败。
--bind_core开启进程绑核。BoolTrue或者False。默认为False。若用户配置此参数,msrun会平均分配CPU核,将其绑定到拉起的分布式进程上。
--sim_level设置单卡模拟编译等级。Integer默认为-1,即关闭单卡模拟编译功能。若用户配置此参数,msrun只会拉起单进程模拟编译,不做算子执行。此功能通常用于调试大规模分布式训练并行策略,在编译阶段提前发现内存和策略问题。
若设置为0,只做前端图编译;若设置为1,进一步执行后端图编译,在执行图阶段退出。
--sim_rank_id单卡模拟编译的rank_id。Integer默认为0。设置单卡模拟编译进程的rank_id。
--rank_table_filerank_table配置文件,只在昇腾平台下有效。Stringrank_table配置文件路径,默认为空。此参数代表昇腾平台下的rank_table配置文件,描述当前分布式集群。
由于rank_table配置文件反映的是物理层面分布式集群信息,在使用该配置时,请确保对于当前进程可见的Device与rank_table配置保持一致。
可通过环境变量ASCEND_RT_VISIBLE_DEVICES设置对于当前进程可见的Device。
--worker_log_name设置worker日志名。Stringworker日志文件名,默认为worker_[rank].log此参数代表支持用户配置worker日志名,并且支持分别通过{ip}{hostname}在worker日志名中配置iphostname
worker日志名的后缀默认为rank
--tail_worker_log输出worker日志到控制台。String一个或多个与worker进程rank_id关联的整数。默认为-1。此参数代表--join=True情况下,默认输出当前节点所有worker日志,并且支持用户指定一个或多个卡的worker日志输出到控制台。
这个参数需要在[0, local_worker_num]范围内。
task_script用户Python脚本。String合法的脚本路径。一般情况下,此参数为python脚本路径,msrun会默认以python task_script task_script_args方式拉起进程。
msrun还支持此参数为pytest,此场景下任务脚本及任务参数在参数task_script_args传递。
task_script_args用户Python脚本的参数。参数列表。例如:msrun --worker_num=8 --local_worker_num=8 train.py --device_target=Ascend --dataset_path=/path/to/dataset
- -## 环境变量 - -下表是用户脚本中能够使用的环境变量,它们由`msrun`设置: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
环境变量功能取值
MS_ROLE本进程角色。 - 当前版本msrun导出下面两个值: -
    -
  • MS_SCHED:代表Scheduler进程。
  • -
  • MS_WORKER:代表Worker进程。
  • -
-
MS_SCHED_HOST用户指定的Scheduler的IP地址。与参数--master_addr相同。
MS_SCHED_PORT用户指定的Scheduler绑定端口号。与参数--master_port相同。
MS_WORKER_NUM用户指定的Worker进程总数。与参数--worker_num相同。
MS_TOPO_TIMEOUT集群组网超时时间。与参数--cluster_time_out相同。
RANK_SIZE用户指定的Worker进程总数。与参数--worker_num相同。
RANK_ID为Worker进程分配的rank_id。多机多卡场景下,若没有设置--node_rank参数,RANK_ID只会在集群初始化后被导出。
因此要使用此环境变量,建议正确设置--node_rank参数。
- -msrun作为动态组网启动方式的封装,所有用户可自定义配置的环境变量可参考[动态组网环境变量](https://www.mindspore.cn/docs/zh-CN/master/model_train/parallel/dynamic_cluster.html)。 - -## 启动分布式任务 - -启动脚本在各硬件平台下一致,下面以Ascend为例演示如何编写启动脚本: - -> 您可以在这里下载完整的样例代码:[startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method)。 - -目录结构如下: - -```text -└─ sample_code - ├─ startup_method - ├── msrun_1.sh - ├── msrun_2.sh - ├── msrun_single.sh - ├── net.py - ... -``` - -其中,`net.py`是定义网络结构和训练过程,`msrun_single.sh`是以`msrun`启动的单机多卡执行脚本;`msrun_1.sh`和`msrun_2.sh`是以`msrun`启动的多机多卡执行脚本,分别在不同节点上执行。 - -### 1. 准备Python训练脚本 - -这里以数据并行为例,训练一个MNIST数据集的识别网络。 - -首先指定运行模式、硬件设备等,与单卡脚本不同,并行脚本还需指定并行模式等配置项,并通过`init()`初始化HCCL、NCCL或MCCL通信域。此处未设置`device_target`,会自动指定为MindSpore包对应的后端硬件设备。 - -```python -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -然后构建如下网络: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -最后是数据集处理和定义训练过程: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. 准备启动脚本 - -> 对于msrun来说单机多卡和多机多卡执行指令类似,单机多卡只需将参数`worker_num`和`local_worker_num`保持相同即可,且单机多卡场景下无需设置`master_addr`,默认为`127.0.0.1`。 - -#### 单机多卡 - -下面以执行单机8卡训练为例: - -脚本[msrun_single.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_single.sh)使用msrun指令在当前节点拉起1个`Scheduler`进程以及8个`Worker`进程(无需设置`master_addr`,默认为`127.0.0.1`;单机无需设置`node_rank`): - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -执行指令: - -```bash -bash msrun_single.sh -``` - -即可执行单机8卡分布式训练任务,日志文件会保存到`./msrun_log`目录下,结果保存在`./msrun_log/worker_*.log`中,Loss结果如下: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -#### 多机多卡 - -下面以执行2机8卡训练,每台机器执行启动4个Worker为例: - -脚本[msrun_1.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_1.sh)在节点1上执行,使用msrun指令拉起1个`Scheduler`进程以及4个`Worker`进程,配置`master_addr`为节点1的IP地址(msrun会自动检测到当前节点IP与`master_addr`匹配而拉起`Scheduler`进程),通过`node_rank`设置当前节点为0号节点: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=4 --master_addr= --master_port=8118 --node_rank=0 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -脚本[msrun_2.sh](https://gitee.com/mindspore/docs/blob/master/docs/sample_code/startup_method/msrun_2.sh)在节点2上执行,使用msrun指令拉起4个`Worker`进程,配置`master_addr`为节点1的IP地址,通过`node_rank`设置当前节点为1号节点: - -```bash -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -rm -rf msrun_log -mkdir msrun_log -echo "start training" - -msrun --worker_num=8 --local_worker_num=4 --master_addr= --master_port=8118 --node_rank=1 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -> 节点2和节点1的指令差别在于`node_rank`不同。 - -在节点1执行: - -```bash -bash msrun_1.sh -``` - -在节点2执行: - -```bash -bash msrun_2.sh -``` - -即可执行2机8卡分布式训练任务,日志文件会保存到`./msrun_log`目录下,结果保存在`./msrun_log/worker_*.log`中,Loss结果如下: - -```text -epoch: 0, step: 0, loss is 2.3499548 -epoch: 0, step: 10, loss is 1.6682479 -epoch: 0, step: 20, loss is 1.4237018 -epoch: 0, step: 30, loss is 1.0437132 -epoch: 0, step: 40, loss is 1.0643986 -epoch: 0, step: 50, loss is 1.1021575 -epoch: 0, step: 60, loss is 0.8510884 -epoch: 0, step: 70, loss is 1.0581372 -epoch: 0, step: 80, loss is 1.0076828 -epoch: 0, step: 90, loss is 0.88950706 -... -``` - -## 多卡并行调试 - -在分布式环境中可以使用Python内置的调试器(pdb)来进行多卡并行的调试,通过对所有或者某一rank进行断点和同步操作来实现。在`msrun`参数设置为`--join=True`拉起worker进程后,所有worker进程的标准输入从`msrun`主进程继承,且标准输出通过`msrun`日志重定向功能输出到shell窗口。以下会给出如何在分布式环境下使用pdb的操作细节: - -### 1. 启动pdb调试器 - -用户可以通过多种方式来启动pdb调试器,比如在Python训练脚本中插入`import pdb; pdb.set_trace()`或者`breakpoint()`来进行断点操作。 - -#### Python训练脚本 - -```python -import pdb -import mindspore as ms -from mindspore.communication import init - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -pdb.set_trace() -ms.set_seed(1) -``` - -#### 启动脚本 - -在启动脚本中,`msrun`的参数需要设置为`--join=True`来保证通过标准输入传递pdb命令,且通过标准输出显示调试情况。 - -```bash -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 net.py -``` - -### 2. 针对rank进行调试 - -在分布式环境中,用户可能需要针对某一rank进行调试,这可以通过在训练脚本中对特定的rank进行断点操作实现。比如在单机八卡任务中,仅针对rank 7进行断点调试: - -```python -import pdb -import mindspore as ms -from mindspore.communication import init, get_rank - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -if get_rank() == 7: - pdb.set_trace() -ms.set_seed(1) -``` - -> `mindspore.communication.get_rank()`接口需要在调用`mindspore.communication.init()`接口完成分布式初始化后才能正常获取rank信息,否则`get_rank()`默认返回0。 - -在对某一rank进行断点操作之后,会导致该rank进程执行停止在断点处等待后续交互操作,而其他未断点rank进程会继续运行,这样可能会导致快慢卡的情况,所以可以使用`mindspore.communication.comm_func.barrier()`算子和`mindspore.common.api._pynative_executor.sync()`来同步所有rank的运行,确保其他rank阻塞等待,且一旦调试的rank继续运行则其他rank的停止会被释放。比如在单机八卡任务中,仅针对rank 7进行断点调试且阻塞所有其他rank: - -```python -import pdb -import mindspore as ms -from mindspore.communication import init, get_rank -from mindspore.communication.comm_func import barrier -from mindspore.common.api import _pynative_executor - -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -if get_rank() == 7: - pdb.set_trace() -barrier() -_pynative_executor.sync() -ms.set_seed(1) -``` - -### 3. shell终端的标准输入和标准输出 - -`msrun`支持通过`--tail_worker_log`将特定的worker日志输出到shell的标准输出,为了使标准输出更利于观察,推荐使用此参数来指定输出需要断点调试的rank。比如在单机八卡任务中,仅针对rank 7进行断点调试: - -```bash -msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --log_dir=msrun_log --join=True --cluster_time_out=300 --tail_worker_log=7 net.py -``` - -> - `msrun`不使用`--tail_worker_log`参数的默认行为会把本节点所有worker的日志输出到shell的标准输出。 -> - 在同时调试多个rank时,一个pdb的指令会依次通过标准输入传递到一个rank上。 - -### 4. 常用pdb调试命令 - -- `n` (next):执行当前行代码,跳到下一行代码。 -- `s` (step):进入当前行代码调用的函数,逐步调试。 -- `c` (continue):继续执行程序,直到下一个断点。 -- `q` (quit):退出调试器并终止程序执行。 -- `p` (print):打印变量的值。例如,`p variable`会显示变量`variable`的当前值。 -- `l` (list):显示当前代码的上下文。 -- `b` (break):设置断点,可以指定行号或函数名。 -- `h` (help):显示帮助信息,列出所有可用命令。 diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/rank_table.md b/docs/mindspore/source_zh_cn/model_train/parallel/rank_table.md deleted file mode 100644 index e9a62e8f03ac88bcc2f172882961fe6c4fdf176d..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_zh_cn/model_train/parallel/rank_table.md +++ /dev/null @@ -1,464 +0,0 @@ -# rank table启动 - -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/model_train/parallel/rank_table.md) - -## 概述 - -`rank table`启动是Ascend硬件平台独有的启动方式。该方式不依赖第三方库,采用单卡单进程运行方式,需要用户在脚本中创建与使用的卡数量一致的进程。该方法在多机下各节点的脚本一致,方便快速批量部署。 - -相关配置: - -`rank table`主要需要配置rank_table文件,以2卡环境配置文件`rank_table_2pcs.json`为例: - -```json -{ - "version": "1.0", - "server_count": "1", - "server_list": [ - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.6","rank_id": "0"}, - {"device_id": "1","device_ip": "192.2.*.6","rank_id": "1"}], - "host_nic_ip": "reserve" - } - ], - "status": "completed" -} -``` - -其中,需要根据实际训练环境修改的参数项有: - -- `server_count`表示参与训练的机器数量。 -- `server_id`表示当前机器的IP地址。 -- `device_id`表示卡的物理序号,即卡所在机器中的实际序号。 -- `device_ip`表示集成网卡的IP地址,可以在当前机器执行指令`cat /etc/hccn.conf`,`address_x`的键值就是网卡IP地址。 -- `rank_id`表示卡的逻辑序号,固定从0开始编号。 - -## 操作实践 - -> 您可以在这里下载完整的样例代码:[startup_method](https://gitee.com/mindspore/docs/tree/master/docs/sample_code/startup_method)。 - -目录结构如下: - -```text -└─ sample_code - ├─ startup_method - ├── net.py - ├── rank_table_8pcs.json - ├── rank_table_16pcs.json - ├── rank_table_cross_cluster_16pcs.json - ├── run_rank_table.sh - ├── run_rank_table_cluster.sh - ├── run_rank_table_cross_cluster.sh - ... -``` - -其中,`net.py`是定义网络结构和训练过程,`run_rank_table.sh`、`run_rank_table_cluster.sh`、`run_rank_table_cross_cluster.sh`是执行脚本。`rank_table_8pcs.json`、`rank_table_16pcs.json`、`rank_table_cross_cluster_16pcs.json`分别是8卡、16卡和跨集群16卡的rank_table配置文件。 - -### 1. 准备Python训练脚本 - -这里以数据并行为例,训练一个MNIST数据集的识别网络。 - -首先指定运行模式、设备ID、硬件设备等。与单卡脚本不同,并行脚本还需指定并行模式等配置项,并通过init初始化HCCL通信。此处若不设置`device_target`,则会自动指定为MindSpore包对应的后端硬件设备。 - -```python -import os -import mindspore as ms -from mindspore.communication import init - -device_id = int(os.getenv('DEVICE_ID')) -ms.set_device(device_id=device_id) -ms.set_context(mode=ms.GRAPH_MODE) -ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) -init() -ms.set_seed(1) -``` - -然后构建如下网络: - -```python -from mindspore import nn - -class Network(nn.Cell): - def __init__(self): - super().__init__() - self.flatten = nn.Flatten() - self.fc = nn.Dense(28*28, 10, weight_init="normal", bias_init="zeros") - self.relu = nn.ReLU() - - def construct(self, x): - x = self.flatten(x) - logits = self.relu(self.fc(x)) - return logits -net = Network() -``` - -最后是数据集处理和定义训练过程: - -```python -import os -from mindspore import nn -import mindspore as ms -import mindspore.dataset as ds -from mindspore.communication import get_rank, get_group_size - -def create_dataset(batch_size): - dataset_path = os.getenv("DATA_PATH") - rank_id = get_rank() - rank_size = get_group_size() - dataset = ds.MnistDataset(dataset_path, num_shards=rank_size, shard_id=rank_id) - image_transforms = [ - ds.vision.Rescale(1.0 / 255.0, 0), - ds.vision.Normalize(mean=(0.1307,), std=(0.3081,)), - ds.vision.HWC2CHW() - ] - label_transform = ds.transforms.TypeCast(ms.int32) - dataset = dataset.map(image_transforms, 'image') - dataset = dataset.map(label_transform, 'label') - dataset = dataset.batch(batch_size) - return dataset - -data_set = create_dataset(32) -loss_fn = nn.CrossEntropyLoss() -optimizer = nn.SGD(net.trainable_params(), 1e-2) - -def forward_fn(data, label): - logits = net(data) - loss = loss_fn(logits, label) - return loss, logits - -grad_fn = ms.value_and_grad(forward_fn, None, net.trainable_params(), has_aux=True) -grad_reducer = nn.DistributedGradReducer(optimizer.parameters) - -for epoch in range(10): - i = 0 - for data, label in data_set: - (loss, _), grads = grad_fn(data, label) - grads = grad_reducer(grads) - optimizer(grads) - if i % 10 == 0: - print("epoch: %s, step: %s, loss is %s" % (epoch, i, loss)) - i += 1 -``` - -### 2. 准备启动脚本 - -#### 单机多卡 - -`rank table`方式采用单卡单进程运行方式,即每张卡上运行1个进程,进程数量与使用的卡的数量一致。每个进程创建1个目录,用来保存日志信息以及算子编译信息。下面以使用8张卡的分布式训练脚本为例,演示如何运行脚本: - -```bash -RANK_SIZE=8 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json -export RANK_SIZE=$RANK_SIZE - -for((i=1;i<${RANK_SIZE};i++)) -do - rm -rf device$i - mkdir device$i - cp ./net.py ./device$i - cd ./device$i - export DEVICE_ID=$i - export RANK_ID=$i - echo "start training for device $i" - env > env$i.log - python ./net.py > train$i.log 2>&1 & - cd ../ -done -``` - -分布式相关的环境变量有: - -- `RANK_TABLE_FILE`:组网信息文件的路径。 -- `DEVICE_ID`:当前卡在机器上的实际序号。 -- `RANK_ID`:当前卡的逻辑序号。 - -在当前路径配置好`rank_table_8pcs.json`后,执行以下指令: - -```bash -bash run_rank_table.sh -``` - -运行结束后,日志文件保存在`device0`、 `device1`等目录下,`env*.log`中记录了环境变量的相关信息,输出结果保存在`train*.log`中,示例如下: - -```text -epoch: 0, step: 0, loss is 2.3391366 -epoch: 0, step: 10, loss is 1.8047495 -epoch: 0, step: 20, loss is 1.2186875 -epoch: 0, step: 30, loss is 1.3065228 -epoch: 0, step: 40, loss is 1.0825632 -epoch: 0, step: 50, loss is 1.0281029 -epoch: 0, step: 60, loss is 0.8405618 -epoch: 0, step: 70, loss is 0.7346531 -epoch: 0, step: 80, loss is 0.688364 -epoch: 0, step: 90, loss is 0.51331174 -epoch: 0, step: 100, loss is 0.53782797 -... -``` - -#### 多机多卡 - -在Ascend环境下,跨机器的NPU单元的通信与单机内各个NPU单元的通信一样,依旧是通过HCCL进行通信,区别在于:单机内的NPU单元天然是互通的,而跨机器的则需要保证两台机器的网络是互通的。确认的方法如下: - -在1号服务器执行下述命令,会为每个设备配置2号服务器对应设备的`device ip`。例如,将1号服务器卡0的目标IP配置为2号服务器的卡0的IP。配置命令需要使用`hccn_tool`工具。[`hccn_tool`](https://support.huawei.com/enterprise/zh/ascend-computing/a300t-9000-pid-250702906?category=developer-documents)是一个HCCL的工具,由CANN包自带。 - -```bash -hccn_tool -i 0 -netdetect -s address 192.*.92.131 -hccn_tool -i 1 -netdetect -s address 192.*.93.131 -hccn_tool -i 2 -netdetect -s address 192.*.94.131 -hccn_tool -i 3 -netdetect -s address 192.*.95.131 -hccn_tool -i 4 -netdetect -s address 192.*.92.141 -hccn_tool -i 5 -netdetect -s address 192.*.93.141 -hccn_tool -i 6 -netdetect -s address 192.*.94.141 -hccn_tool -i 7 -netdetect -s address 192.*.95.141 -``` - -其中,`-i 0`指定设备ID;`-netdetect`指定网络检测对象IP属性;`-s address`表示设置属性为IP地址;`192.*.92.131`表示2号服务器的设备0的IP地址。接口命令可以[参考此处](https://support.huawei.com/enterprise/zh/doc/EDOC1100251947/8eff627f)。 - -在1号服务器上面执行完上述命令后,通过下述命令开始检测网络链接状态。在此使用`hccn_tool`的另一个功能,此功能的含义可以[参考此处](https://support.huawei.com/enterprise/zh/doc/EDOC1100251947/7d059b59)。 - -```bash -hccn_tool -i 0 -net_health -g -hccn_tool -i 1 -net_health -g -hccn_tool -i 2 -net_health -g -hccn_tool -i 3 -net_health -g -hccn_tool -i 4 -net_health -g -hccn_tool -i 5 -net_health -g -hccn_tool -i 6 -net_health -g -hccn_tool -i 7 -net_health -g -``` - -如果连接正常,对应的输出如下: - -```bash -net health status: Success -``` - -如果连接失败,对应的输出如下: - -```bash -net health status: Fault -``` - -在确认了机器之间的NPU单元的网络是通畅后,配置多机的json配置文件。本文档以16卡的配置文件为例进行介绍,详细的配置文件说明可参照本文档单机多卡部分的相关内容。 - -需要注意的是,在多机的json文件配置中,要求rank_id的排序,与server_id的字典序一致。 - -```json -{ - "version": "1.0", - "server_count": "2", - "server_list": [ - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.6","rank_id": "0"}, - {"device_id": "1","device_ip": "192.2.*.6","rank_id": "1"}, - {"device_id": "2","device_ip": "192.3.*.6","rank_id": "2"}, - {"device_id": "3","device_ip": "192.4.*.6","rank_id": "3"}, - {"device_id": "4","device_ip": "192.1.*.7","rank_id": "4"}, - {"device_id": "5","device_ip": "192.2.*.7","rank_id": "5"}, - {"device_id": "6","device_ip": "192.3.*.7","rank_id": "6"}, - {"device_id": "7","device_ip": "192.4.*.7","rank_id": "7"}], - "host_nic_ip": "reserve" - }, - { - "server_id": "10.*.*.*", - "device": [ - {"device_id": "0","device_ip": "192.1.*.8","rank_id": "8"}, - {"device_id": "1","device_ip": "192.2.*.8","rank_id": "9"}, - {"device_id": "2","device_ip": "192.3.*.8","rank_id": "10"}, - {"device_id": "3","device_ip": "192.4.*.8","rank_id": "11"}, - {"device_id": "4","device_ip": "192.1.*.9","rank_id": "12"}, - {"device_id": "5","device_ip": "192.2.*.9","rank_id": "13"}, - {"device_id": "6","device_ip": "192.3.*.9","rank_id": "14"}, - {"device_id": "7","device_ip": "192.4.*.9","rank_id": "15"}], - "host_nic_ip": "reserve" - } - ], - "status": "completed" -} -``` - -准备好配置文件后,可以进行分布式多机训练脚本的组织,以2机16卡为例,两台机器上编写的脚本与单机多卡的运行脚本类似,区别在于指定不同的rank_id变量。 - -```bash -RANK_SIZE=16 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_16pcs.json -export RANK_SIZE=$RANK_SIZE - -RANK_START=$1 -DEVICE_START=0 - -for((i=0;i<=7;i++)); -do - export RANK_ID=$[i+RANK_START] - export DEVICE_ID=$[i+DEVICE_START] - rm -rf ./device_$RANK_ID - mkdir ./device_$RANK_ID - cp ./net.py ./device_$RANK_ID - cd ./device_$RANK_ID - env > env$i.log - python ./net.py >train$RANK_ID.log 2>&1 & -done -``` - -执行时,两台机器分别执行如下命令: - -```bash -# server0 -bash run_rank_table_cluster.sh 0 -# server1 -bash run_rank_table_cluster.sh 8 -``` - -其中,rank_table.json按照本章节展示的16卡的分布式json文件参考配置。 - -运行结束后,日志文件保存在`device_0`、 `device_1`等目录下,`env*.log`中记录了环境变量的相关信息,输出结果保存在`train*.log`中。 - -#### 跨集群 - -对于如今的大模型而言,使用计算集群进行训练已经成为一种常态。然而,随着模型规模的不断提升,单一集群的资源难以满足模型训练所需的显存要求。因此,支持跨集群通信成为了训练超大规模模型的前提。 - -目前,昇腾硬件的HCCL通信库暂不支持跨集群通信。因此,MindSpore框架提供了一套跨集群通信库,使得不同集群的NPU之间能够实现高效通信。借助这一通信库,用户可以突破单一集群的显存限制,实现超大规模模型的跨集群并行训练。 - -目前,MindSpore框架仅需在多机多卡的json配置文件中添加跨集群的`cluster_list`配置项即可开启这一功能,本文档同样以2机16卡(假设两个机器不在同一集群)配置文件为例,介绍跨集群相关配置项的编写方法,详细的配置文件说明可以参照本文档单机多卡部分的介绍。 - -```json -{ - "version": "1.0", - "server_count": "2", - "server_list": [ - { - "server_id": "server_0_10.*.*.*", - "server_ip": "10.*.*.*", - "device": [ - {"device_id": "0", "device_ip": "192.1.*.6", "rank_id": "0", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "1", "device_ip": "192.2.*.6", "rank_id": "1", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "2", "device_ip": "192.3.*.6", "rank_id": "2", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "3", "device_ip": "192.4.*.6", "rank_id": "3", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "4", "device_ip": "192.1.*.7", "rank_id": "4", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "5", "device_ip": "192.2.*.7", "rank_id": "5", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "6", "device_ip": "192.3.*.7", "rank_id": "6", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "7", "device_ip": "192.4.*.7", "rank_id": "7", "dpu_ip": "8.2.17.60", "numa_id": ""}], - "host_nic_ip": "reserve", - "pod_ip": "127.0.0.1" - }, - { - "server_id": "server_1_10.*.*.*", - "server_ip": "10.*.*.*", - "device": [ - {"device_id": "0", "device_ip": "192.1.*.8", "rank_id": "8", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "1", "device_ip": "192.2.*.8", "rank_id": "9", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "2", "device_ip": "192.3.*.8", "rank_id": "10", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "3", "device_ip": "192.4.*.8", "rank_id": "11", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "4", "device_ip": "192.1.*.9", "rank_id": "12", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "5", "device_ip": "192.2.*.9", "rank_id": "13", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "6", "device_ip": "192.3.*.9", "rank_id": "14", "dpu_ip": "8.2.17.60", "numa_id": ""}, - {"device_id": "7", "device_ip": "192.4.*.9", "rank_id": "15", "dpu_ip": "8.2.17.60", "numa_id": ""}], - "host_nic_ip": "reserve", - "pod_ip": "127.0.0.1" - } - ], - "cluster_list": [ - { - "cluster_id": "cluster_0", - "network_type": "ROCE", - "az_id": "az_0", - "region_id": "region_0", - "server_list": [ - { - "server_id": "server_0_10.*.*.*" - } - ] - }, - { - "cluster_id": "cluster_1", - "network_type": "ROCE", - "az_id": "az_1", - "region_id": "region_1", - "server_list": [ - { - "server_id": "server_1_10.*.*.*" - } - ] - } - ], - "status": "completed" -} -``` - -其中,跨集群需要根据实际训练环境添加和修改的参数项有: - -- `server_id`表示当前机器的全局唯一标识。 -- `server_ip`表示当前机器的IP地址。 -- `dpu_ip`表示卡在租户VPC内的虚拟IP地址,用于跨集群通信。 -- `numa_id`表示卡在当前机器上NUMA亲和的CPU核序号。 -- `cluster_id`表示集群的全局唯一标识。 -- `network_type`表示集群内的机器间的网络类型,目前都是"ROCE"。 -- `az_id`表示集群所在的AZ id。 -- `server_list`表示当前集群包含的机器列表。 - -准备好配置文件后,跨集群的分布式训练脚本与本文档多机多卡的分布式训练脚本保持一致,以2集群16卡为例,两个集群的两台机器上编写的脚本与多机多卡的运行脚本相同,区别在于指定不同的rank_id变量。 - -```bash -RANK_SIZE=16 -EXEC_PATH=$(pwd) -if [ ! -d "${EXEC_PATH}/MNIST_Data" ]; then - if [ ! -f "${EXEC_PATH}/MNIST_Data.zip" ]; then - wget http://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/MNIST_Data.zip - fi - unzip MNIST_Data.zip -fi -export DATA_PATH=${EXEC_PATH}/MNIST_Data/train/ - -export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_2_cluster_16pcs.json -export RANK_SIZE=$RANK_SIZE - -RANK_START=$1 -DEVICE_START=0 - -for((i=0;i<=7;i++)); -do - export RANK_ID=$[i+RANK_START] - export DEVICE_ID=$[i+DEVICE_START] - rm -rf ./device_$RANK_ID - mkdir ./device_$RANK_ID - cp ./net.py ./device_$RANK_ID - cd ./device_$RANK_ID - env > env$i.log - python ./net.py >train$RANK_ID.log 2>&1 & -done -``` - -执行时,两个集群中的两台机器分别执行如下命令: - -```bash -# server0 -bash run_rank_table_cross_cluster.sh 0 -# server1 -bash run_rank_table_cross_cluster.sh 8 -``` - -其中,`rank_table_cross_cluster_16pcs.json`按照本章节展示的2集群16卡的跨集群分布式json文件参考配置,每个集群的每台机器上使用的`rank_table_cross_cluster_16pcs.json`配置需要保持一致。 - -运行结束后,日志文件保存在各个集群中每台机器的`device_0`、 `device_1`等目录下,`env*.log`中记录了环境变量的相关信息,输出结果保存在`train*.log`中。 diff --git a/docs/mindspore/source_zh_cn/model_train/parallel/startup_method.rst b/docs/mindspore/source_zh_cn/model_train/parallel/startup_method.rst deleted file mode 100644 index de40fd42a5f5aa969f564f0ea4e9689aaac11088..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_zh_cn/model_train/parallel/startup_method.rst +++ /dev/null @@ -1,42 +0,0 @@ -分布式并行启动方式 -============================ - -.. image:: https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg - :target: https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/model_train/parallel/startup_method.rst - :alt: 查看源文件 - -.. toctree:: - :maxdepth: 1 - :hidden: - - msrun_launcher - dynamic_cluster - mpirun - rank_table - -启动方式 ----------------------- - -目前GPU、Ascend和CPU分别支持多种启动方式。主要有\ ``msrun``\、动态组网、\ ``mpirun``\和\ ``rank table``\四种方式: - -- `msrun `_: `msrun` 是动态组网的封装,允许用户使用单命令行指令在各节点拉起分布式任务,安装MindSpore后即可使用。此方式不依赖第三方库以及配置文件,具有容灾恢复功能,安全性较好,支持三种硬件平台。建议用户优先使用此种启动方式。 -- `动态组网 `_:动态组网需要用户手动拉起多进程以及导出环境变量,是 `msrun` 的具体实现,Parameter Server训练模式建议使用此方式,其余分布式场景建议使用 `msrun` 。 -- `mpirun `_:此方式依赖开源库OpenMPI,启动命令简单,多机需要保证两两之间免密登录,推荐有OpenMPI使用经验的用户使用此种启动方式。 -- `rank table `_:此方式需要在Ascend硬件平台使用,不依赖第三方库。手动配置rank_table文件后,就可以通过脚本启动并行程序,多机脚本一致,方便批量部署。 - -.. warning:: - `rank_table` 启动方式将在MindSpore 2.4版本废弃。 - -四种启动方式的硬件支持情况如下表: - -+-------------------------+--------------+-----------------+-------------+ -| | GPU | Ascend | CPU | -+=========================+==============+=================+=============+ -| \ ``msrun``\ | 支持 | 支持 | 支持 | -+-------------------------+--------------+-----------------+-------------+ -| 动态组网 | 支持 | 支持 | 支持 | -+-------------------------+--------------+-----------------+-------------+ -| \ ``mpirun``\ | 支持 | 支持 | 不支持 | -+-------------------------+--------------+-----------------+-------------+ -| \ ``rank table``\ | 不支持 | 支持 | 不支持 | -+-------------------------+--------------+-----------------+-------------+