From ba0fef7aab3814064908a77b128f5c239b135209 Mon Sep 17 00:00:00 2001 From: Jack20 Date: Tue, 9 Feb 2021 12:25:58 +0800 Subject: [PATCH] Update use_on_the_cloud.md(en) --- .../advanced_use/use_on_the_cloud.md | 297 +++++++++++++++++- 1 file changed, 295 insertions(+), 2 deletions(-) diff --git a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md index 0ae46ab51b..13c7abbdce 100644 --- a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md +++ b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md @@ -1,5 +1,298 @@ # Use MindSpore on the Cloud -No English version available right now, welcome to contribute. - + +[TOC] + +## Summary + +ModelArts is a developer-oriented one-stop AI development platform provided by Huawei Cloud, which integrates the Ascend AI processor resource pool, and users can experience MindSpore under this platform. + +Taking ResNet-50 as an example, this tutorial briefly introduces how to use MindSpore in ModelArts to complete training tasks. + +## Preparation Work + +### Use preparation of ModelArts + +Refer to the "Preparation Work" column of ModelArts tutorial to complete the preparation for account registration, ModelArts configuration and bucket creation. +> ModelArts tutorial link:.The page provides abundant tutorials of ModelArts. Please refer to the "Preparation Work" section to complete the preparation of ModelArts. + +### Have AI processor resources Ascend on the cloud + +Make sure that your account has the public beta qualification of ModelArts Huawei Cloud Ascend Cluster Service, and you can submit an application in [ModelArts Huawei Cloud](https://console.huaweicloud.com/modelarts/?region=cn-north-4#/dashboard/applyModelArtsAscend910Beta). + +### Data Preparation + +ModelArts uses Object Storage Service (OBS) for data storage, so it is necessary to upload the data to OBS before starting the training task. This example uses a CIFAR-10 binary format dataset. + +1. Download CIFAR-10 data set and decompress it. + + > CIFAR-10 data set download page:.Page provides 3 data set download links, this example uses CIFAR-10 binary version. + +2. Create a new OBS bucket (for example, ms-dataset), create a data directory (for example, cifar-10) in the bucket, and upload CIFAR-10 data to the data directory according to the following structure. + + ```text + └─obs/ms-dataset/cifar-10 + ├─train + │ data_batch_1.bin + │ data_batch_2.bin + │ data_batch_3.bin + │ data_batch_4.bin + │ data_batch_5.bin + │ + └─eval + test_batch.bin + ``` + +### Execute script preparation + +Create a new OBS bucket (For example:`resnet50-train`),create a code directory in the bucket (For example:`resnet50_cifar10_train`),and upload all scripts in the following directory to the code directory: +> The script uses ResNet-50 network to train on CIFAR-10 data set, and verifies the accuracy after training. The scripts can use`1*Ascend` or `8*Ascend` in ModelArts for training tasks. +> +> Note that the version of the running script needs to be consistent with the version of MindSpore selected in the "Create Training Task" step. For example, if you use the script provided by the MindSpore 1.1 tutorial, you need to select the MindSpore engine 1.1 when creating training tasks. + +To facilitate the subsequent creation of training jobs, first create a training output directory and a log output directory. The directory structure created in this example is as follows: + +```text +└─obs/resnet50-train + ├─resnet50_cifar10_train + │ dataset.py + │ resnet50_train.py + │ + ├─output + └─log +``` + +## Run MindSpore script in ModelArts through simple adaptation + +Scripts provided in the chapter "Preparation for Executing Scripts" can be run directly in ModelArts. If you want to experience the ResNet-50 training over the CIFAR-10 dataset quickly, you can skip this chapter. If you need to run custom MindSports scripts or more MindSports sample codes in ModelArts, you need to refer to this chapter to simply adapt MindSports codes. + +### Adapt Script Parameters + +1. The scripts running in ModelArts must be configured with `data_url` and `train_url`,which correspond to data storage path (OBS path) and training output path (OBS path)respectively. + + ``` python + import argparse + + parser = argparse.ArgumentParser(description='ResNet-50 train.') + parser.add_argument('--data_url', required=True, default=None, help='Location of data.') + parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.') + ``` + +2. The ModelArts interface supports passing values to other parameters in the script, which will be described in detail in the next chapter "Create a training task". + + ``` python + parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') + ``` + +### Adapt OBS data + +MindSpore does not provide an interface for directly accessing OBS data, so it needs to interact with OBS through API provided by MoXing. The ModelArts training script is executed in the container, and the `/cache` directory is usually selected as the data storage path of the container. +> Huawei Cloud MoXing provides a wealth of APIs for users to use this:,In this example, only the `copy_parallel` interface needs to be used. + +1. Download the data stored in OBS to the execution container. + + ```python + import moxing as mox + mox.file.copy_parallel(src_url='s3://dataset_url/', dst_url='/cache/data_path') + ``` + +2. Upload the training output from the container to OBS. + + ```python + import moxing as mox + mox.file.copy_parallel(src_url='/cache/output_path', dst_url='s3://output_url/') + ``` + +### Adapt to the 8-card training task + +If the script needs to be run in the `8*Ascend` environment, it is necessary to adapt the code for creating the data set and the local data path, and configure the distributed policy. By obtaining two environment variables`DEVICE_ID` and `RANK_SIZE`users can build training scripts suitable for two different specifications `1*Ascend` and `8*Ascend` + +1. Local path adaptation. + + ```python + import os + + device_num = int(os.getenv('RANK_SIZE')) + device_id = int(os.getenv('DEVICE_ID')) + # define local data path + local_data_path = '/cache/data' + + if device_num > 1: + # define distributed local data path + local_data_path = os.path.join(local_data_path, str(device_id)) + ``` + +2. Data set adaptation. + + ```python + import os + import mindspore.dataset.engine as de + + device_id = int(os.getenv('DEVICE_ID')) + device_num = int(os.getenv('RANK_SIZE')) + if device_num == 1: + # create train data for 1 Ascend situation + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) + else: + # create train data for 1 Ascend situation, split train data for 8 Ascend situation + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, + num_shards=device_num, shard_id=device_id) + ``` + +3. Configure distributed policies. + + ```python + import os + from mindspore import context + from mindspore.context import ParallelMode + + device_num = int(os.getenv('RANK_SIZE')) + if device_num > 1: + context.set_auto_parallel_context(device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, + gradients_mean=True) + ``` + +### Sample Code + +Combining the above three points, we simply adapt MindSpore script, taking the following pseudo-code as an example: + +Original MindSpore script: + +``` python +import os +import argparse +from mindspore import context +from mindspore.context import ParallelMode +import mindspore.dataset.engine as de + +device_id = int(os.getenv('DEVICE_ID')) +device_num = int(os.getenv('RANK_SIZE')) + +def create_dataset(dataset_path): + if device_num == 1: + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) + else: + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, + num_shards=device_num, shard_id=device_id) + return ds + +def resnet50_train(args): + if device_num > 1: + context.set_auto_parallel_context(device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, + gradients_mean=True) + train_dataset = create_dataset(local_data_path) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='ResNet-50 train.') + parser.add_argument('--local_data_path', required=True, default=None, help='Location of data.') + parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') + + args_opt, unknown = parser.parse_known_args() + + resnet50_train(args_opt) +``` + +Adapted MindSpore script: + +``` python +import os +import argparse +from mindspore import context +from mindspore.context import ParallelMode +import mindspore.dataset.engine as de + +# adapt to cloud: used for downloading data +import moxing as mox + +device_id = int(os.getenv('DEVICE_ID')) +device_num = int(os.getenv('RANK_SIZE')) + +def create_dataset(dataset_path): + if device_num == 1: + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) + else: + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, + num_shards=device_num, shard_id=device_id) + return ds + +def resnet50_train(args): + # adapt to cloud: define local data path + local_data_path = '/cache/data' + + if device_num > 1: + context.set_auto_parallel_context(device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, + gradients_mean=True) + # adapt to cloud: define distributed local data path + local_data_path = os.path.join(local_data_path, str(device_id)) + + # adapt to cloud: download data from obs to local location + print('Download data.') + mox.file.copy_parallel(src_url=args.data_url, dst_url=local_data_path) + + train_dataset = create_dataset(local_data_path) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='ResNet-50 train.') + # adapt to cloud: get obs data path + parser.add_argument('--data_url', required=True, default=None, help='Location of data.') + # adapt to cloud: get obs output path + parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.') + parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') + args_opt, unknown = parser.parse_known_args() + + resnet50_train(args_opt) +``` + +## Create a training task + +After preparing the data and executing the script, you need to create a training task to really run the MindSpore script. Users who use ModelArts for the first time can learn about the process of creating training assignments by ModelArts according to this chapter. + +### Enter the ModelArts console + +Open Huawei Cloud ModelArts homepage:,Click "Enter console" on this page. + +### Create a training job using a common framework + +An introductory course in ModelArts:Shows how to create training assignments using common frameworks. + +### Create training assignments using MindSpore as a common framework + +Take the training script and data used in this tutorial as an example, and list in detail how to configure in the interface of creating training job: + +1. `Algorithm source`:select `Common framework`,than `AI Engine` select `Ascend-Powered-Engine`and the required version of MindSpore (the picture of this example is `Mindspore-0.5-python3.7-aarch64`,please pay attention to using the script corresponding to the selected version). + +2. `Code directory` :Select the code directory created in OBS bucket in advance,`Startup file` Select the startup script under the code directory. + +3. `Source of data`:Select `Data storage location` and fill in the location of CIFAR-10 data set in OBS. + +4. `Operating parameters`:`Data storage location` and `Training output location` correspond to operation parameters `data_url` and `train_url`.Select`Add operation parameter` to pass values to other parameters in the script, such as `epoch_size`. + +5. `Resource Pool` Select `Public Resource Pool > Ascend`. + +6. `Resource Pool > Specifications` Select `Ascend: 1 * Ascend 910 CPU:24 core 96GiB` or `Ascend: 8 * Ascend 910 CPU:192 core 768GiB`.Respectively represent the specifications of single machine, single card and single machine and 8 cards. + +Use MindSpore as a common framework to create training assignments, as shown in the following figure: + +![cloud_train_job1](../../source_zh_cn/advanced_use/images/cloud_train_job1.png) + +![cloud_train_job2](../../source_zh_cn/advanced_use/images/cloud_train_job2.png) + +## Check the running results + +1. You can view the running log in the training operation interface + + The ResNet-50 training task is carried out according to the `8*Ascend` specification. the total number of epoch is 92, the accuracy is about 92%, and the number of training pictures per second is about 12,000. the log is shown in the following figure. + + ![train_log_8_Ascend_clu](../../source_zh_cn/advanced_use/images/train_log_8_Ascend_clu.png) + + ![train_log_8_Ascend](../../source_zh_cn/advanced_use/images/train_log_8_Ascend.png) + + Implement ResNet-50 training task according to `1*Ascend` specification. The total number of epoch is 92, the accuracy is about 95%, and the number of training pictures per second is about 1800. The log is shown in the following figure. + + ![train_log_1_Ascend](../../source_zh_cn/advanced_use/images/train_log_1_Ascend.png) + +2. If the log path is specified when creating the training job, you can download the log file from OBS and view it. -- Gitee