From 54abadcdb34df13578d9019656094501c2a0b332 Mon Sep 17 00:00:00 2001 From: sgyt20 Date: Mon, 8 Feb 2021 20:41:07 +0800 Subject: [PATCH 1/4] update tutorials/training/source_en/advanced_use/use_on_the_cloud.md. --- .../advanced_use/use_on_the_cloud.md | 318 +++++++++++++++++- 1 file changed, 317 insertions(+), 1 deletion(-) diff --git a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md index 0ae46ab51b..7a908475d4 100644 --- a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md +++ b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md @@ -1,5 +1,321 @@ # Use MindSpore on the Cloud -No English version available right now, welcome to contribute. + + + +- [在云上使用MindSpore](#在云上使用mindspore) + - [概述](#概述) + - [准备工作](#准备工作) + - [ModelArts使用准备](#modelarts使用准备) + - [拥有云上昇腾AI处理器资源](#拥有云上昇腾ai处理器资源) + - [数据准备](#数据准备) + - [执行脚本准备](#执行脚本准备) + - [通过简单适配将MindSpore脚本运行在ModelArts](#通过简单适配将mindspore脚本运行在modelarts) + - [适配脚本参数](#适配脚本参数) + - [适配OBS数据](#适配obs数据) + - [适配8卡训练任务](#适配8卡训练任务) + - [示例代码](#示例代码) + - [创建训练任务](#创建训练任务) + - [进入ModelArts控制台](#进入modelarts控制台) + - [使用常用框架创建训练作业](#使用常用框架创建训练作业) + - [使用MindSpore作为常用框架创建训练作业](#使用mindspore作为常用框架创建训练作业) + - [查看运行结果](#查看运行结果) + + + + + +## 概述 + +ModelArts是华为云提供的面向开发者的一站式AI开发平台,集成了昇腾AI处理器资源池,用户可以在该平台下体验MindSpore。 + +本教程以ResNet-50为例,简要介绍如何在ModelArts使用MindSpore完成训练任务。 + +## 准备工作 + +### ModelArts使用准备 + +参考ModelArts教程“准备工作”一栏,完成账号注册、ModelArts配置和创建桶的准备工作。 +> ModelArts教程链接:。页面提供了较丰富的ModelArts教程,参考“准备工作”部分完成ModelArts准备工作。 + +### 拥有云上昇腾AI处理器资源 + +确保你的账号已拥有ModelArts华为云昇腾集群服务公测资格,可在[ModelArts华为云](https://console.huaweicloud.com/modelarts/?region=cn-north-4#/dashboard/applyModelArtsAscend910Beta)提交申请。 + +### 数据准备 + +ModelArts使用对象存储服务(Object Storage Service,简称OBS)进行数据存储,因此,在开始训练任务之前,需要将数据上传至OBS。本示例使用CIFAR-10二进制格式数据集。 + +1. 下载CIFAR-10数据集并解压。 + + > CIFAR-10数据集下载页面:。页面提供3个数据集下载链接,本示例使用CIFAR-10 binary version。 + +2. 新建一个自己的OBS桶(例如:ms-dataset),在桶中创建数据目录(例如:cifar-10),将CIFAR-10数据按照如下结构上传至数据目录。 + + ```text + └─对象存储/ms-dataset/cifar-10 + ├─train + │ data_batch_1.bin + │ data_batch_2.bin + │ data_batch_3.bin + │ data_batch_4.bin + │ data_batch_5.bin + │ + └─eval + test_batch.bin + ``` + +### 执行脚本准备 + +新建一个自己的OBS桶(例如:`resnet50-train`),在桶中创建代码目录(例如:`resnet50_cifar10_train`),并将以下目录中的所有脚本上传至代码目录: +> 脚本使用ResNet-50网络在CIFAR-10数据集上进行训练,并在训练结束后验证精度。脚本可以在ModelArts采用`1*Ascend`或`8*Ascend`两种不同规格进行训练任务。 +> +> 注意运行脚本的版本需要与"创建训练任务"步骤选择的MindSpore版本一致。例如:使用MindSpore 1.1版本教程提供的脚本,则需要在创建训练任务时选择1.1版本的MindSpore引擎。 + +为了方便后续创建训练作业,先创建训练输出目录和日志输出目录,本示例创建的目录结构如下: + +```text +└─对象存储/resnet50-train + ├─resnet50_cifar10_train + │ dataset.py + │ resnet50_train.py + │ + ├─output + └─log +``` + +## 通过简单适配将MindSpore脚本运行在ModelArts + +“执行脚本准备”章节提供的脚本可以直接运行在ModelArts,想要快速体验ResNet-50训练CIFAR-10可以跳过本章节。如果需要将自定义MindSpore脚本或更多MindSpore示例代码在ModelArts运行起来,需要参考本章节对MindSpore代码进行简单适配。 + +### 适配脚本参数 + +1. 在ModelArts运行的脚本必须配置`data_url`和`train_url`,分别对应数据存储路径(OBS路径)和训练输出路径(OBS路径)。 + + ``` python + import argparse + + parser = argparse.ArgumentParser(description='ResNet-50 train.') + parser.add_argument('--data_url', required=True, default=None, help='Location of data.') + parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.') + ``` + +2. ModelArts界面支持向脚本中其他参数传值,在下一章节“创建训练作业”中将会详细介绍。 + + ``` python + parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') + ``` + +### 适配OBS数据 + +MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing提供的API与OBS交互。ModelArts训练脚本在容器中执行,通常选用`/cache`目录作为容器数据存储路径。 +> 华为云MoXing提供了丰富的API供用户使用,本示例中仅需要使用`copy_parallel`接口。 + +1. 将OBS中存储的数据下载至执行容器。 + + ```python + import moxing as mox + mox.file.copy_parallel(src_url='s3://dataset_url/', dst_url='/cache/data_path') + ``` + +2. 将训练输出从容器中上传至OBS。 + + ```python + import moxing as mox + mox.file.copy_parallel(src_url='/cache/output_path', dst_url='s3://output_url/') + ``` + +### 适配8卡训练任务 + +如果需要将脚本运行在`8*Ascend`规格的环境上,需要对创建数据集的代码和本地数据路径进行适配,并配置分布式策略。通过获取`DEVICE_ID`和`RANK_SIZE`两个环境变量,用户可以构建适用于`1*Ascend`和`8*Ascend`两种不同规格的训练脚本。 + +1. 本地路径适配。 + + ```python + import os + + device_num = int(os.getenv('RANK_SIZE')) + device_id = int(os.getenv('DEVICE_ID')) + # define local data path + local_data_path = '/cache/data' + + if device_num > 1: + # define distributed local data path + local_data_path = os.path.join(local_data_path, str(device_id)) + ``` + +2. 数据集适配。 + + ```python + import os + import mindspore.dataset.engine as de + + device_id = int(os.getenv('DEVICE_ID')) + device_num = int(os.getenv('RANK_SIZE')) + if device_num == 1: + # create train data for 1 Ascend situation + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) + else: + # create train data for 1 Ascend situation, split train data for 8 Ascend situation + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, + num_shards=device_num, shard_id=device_id) + ``` + +3. 配置分布式策略。 + + ```python + import os + from mindspore import context + from mindspore.context import ParallelMode + + device_num = int(os.getenv('RANK_SIZE')) + if device_num > 1: + context.set_auto_parallel_context(device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, + gradients_mean=True) + ``` + +### 示例代码 + +结合以上三点对MindSpore脚本进行简单适配,以下述伪代码为例: + +原始MindSpore脚本: + +``` python +import os +import argparse +from mindspore import context +from mindspore.context import ParallelMode +import mindspore.dataset.engine as de + +device_id = int(os.getenv('DEVICE_ID')) +device_num = int(os.getenv('RANK_SIZE')) + +def create_dataset(dataset_path): + if device_num == 1: + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) + else: + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, + num_shards=device_num, shard_id=device_id) + return ds + +def resnet50_train(args): + if device_num > 1: + context.set_auto_parallel_context(device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, + gradients_mean=True) + train_dataset = create_dataset(local_data_path) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='ResNet-50 train.') + parser.add_argument('--local_data_path', required=True, default=None, help='Location of data.') + parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') + + args_opt, unknown = parser.parse_known_args() + + resnet50_train(args_opt) +``` + +适配后的MindSpore脚本: + +``` python +import os +import argparse +from mindspore import context +from mindspore.context import ParallelMode +import mindspore.dataset.engine as de + +# adapt to cloud: used for downloading data +import moxing as mox + +device_id = int(os.getenv('DEVICE_ID')) +device_num = int(os.getenv('RANK_SIZE')) + +def create_dataset(dataset_path): + if device_num == 1: + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True) + else: + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True, + num_shards=device_num, shard_id=device_id) + return ds + +def resnet50_train(args): + # adapt to cloud: define local data path + local_data_path = '/cache/data' + + if device_num > 1: + context.set_auto_parallel_context(device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, + gradients_mean=True) + # adapt to cloud: define distributed local data path + local_data_path = os.path.join(local_data_path, str(device_id)) + + # adapt to cloud: download data from obs to local location + print('Download data.') + mox.file.copy_parallel(src_url=args.data_url, dst_url=local_data_path) + + train_dataset = create_dataset(local_data_path) + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='ResNet-50 train.') + # adapt to cloud: get obs data path + parser.add_argument('--data_url', required=True, default=None, help='Location of data.') + # adapt to cloud: get obs output path + parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.') + parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') + args_opt, unknown = parser.parse_known_args() + + resnet50_train(args_opt) +``` + +## 创建训练任务 + +准备好数据和执行脚本以后,需要创建训练任务将MindSpore脚本真正运行起来。首次使用ModelArts的用户可以根据本章节了解ModelArts创建训练作业的流程。 + +### 进入ModelArts控制台 + +打开华为云ModelArts主页,点击该页面的“进入控制台”。 + +### 使用常用框架创建训练作业 + +ModelArts教程展示了如何使用常用框架创建训练作业。 + +### 使用MindSpore作为常用框架创建训练作业 + +以本教程使用的训练脚本和数据为例,详细列出在创建训练作业界面如何进行配置: + +1. `算法来源`:选择`常用框架`,然后`AI引擎`选择`Ascend-Powered-Engine`和所需的MindSpore版本(本示例图片为`Mindspore-0.5-python3.7-aarch64`,请注意使用与所选版本对应的脚本)。 + +2. `代码目录`选择预先在OBS桶中创建代码目录,`启动文件`选择代码目录下的启动脚本。 + +3. `数据来源`选择`数据存储位置`,并填入OBS中CIFAR-10数据集的位置。 + +4. `运行参数`:`数据存储位置`和`训练输出位置`分别对应运行参数`data_url`和`train_url`,选择`增加运行参数`可以向脚本中其他参数传值,如`epoch_size`。 + +5. `资源池`选择`公共资源池 > Ascend`。 + +6. `资源池 > 规格`选择`Ascend: 1 * Ascend 910 CPU:24 核 96GiB`或`Ascend: 8 * Ascend 910 CPU:192 核 768GiB`,分别表示单机单卡和单机8卡规格。 + +使用MindSpore作为常用框架创建训练作业,如下图所示。 + +![训练作业参数](./images/cloud_train_job1.png) + +![训练作业规格](./images/cloud_train_job2.png) + +## 查看运行结果 + +1. 在训练作业界面可以查看运行日志 + + 采用`8*Ascend`规格执行ResNet-50训练任务,epoch总数为92,精度约为92%,每秒训练图片张数约12000,日志如下图所示。 + + ![8*Ascend训练执行结果](./images/train_log_8_Ascend_clu.png) + + ![8*Ascend训练执行结果](./images/train_log_8_Ascend.png) + + 采用`1*Ascend`规格执行ResNet-50训练任务。epoch总数为92,精度约为95%,每秒训练图片张数约1800,日志如下图所示。 + + ![1*Ascend训练执行结果](./images/train_log_1_Ascend.png) + +2. 如果创建训练作业时指定了日志路径,可以从OBS下载日志文件并查看。 -- Gitee From debe8bd1837901711b84c7060b5a3d46c9cc05fc Mon Sep 17 00:00:00 2001 From: sgyt20 Date: Mon, 8 Feb 2021 20:43:51 +0800 Subject: [PATCH 2/4] update tutorials/training/source_en/advanced_use/use_on_the_cloud.md. --- .../training/source_en/advanced_use/use_on_the_cloud.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md index 7a908475d4..1ceab35602 100644 --- a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md +++ b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md @@ -3,9 +3,9 @@ -- [在云上使用MindSpore](#在云上使用mindspore) - - [概述](#概述) - - [准备工作](#准备工作) +- [Use MindSpore on the Cloud](#Use MindSpore on the Cloud) + - [Summary](#Aummary) + - [Preparatory Work](#Preparatory Work) - [ModelArts使用准备](#modelarts使用准备) - [拥有云上昇腾AI处理器资源](#拥有云上昇腾ai处理器资源) - [数据准备](#数据准备) -- Gitee From 1a3e3f4e0918e3ca89b4fa98bfaa6f0c540e0a8b Mon Sep 17 00:00:00 2001 From: sgyt20 Date: Mon, 8 Feb 2021 20:45:12 +0800 Subject: [PATCH 3/4] update tutorials/training/source_en/advanced_use/use_on_the_cloud.md. --- .../training/source_en/advanced_use/use_on_the_cloud.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md index 1ceab35602..d9a3c68bad 100644 --- a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md +++ b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md @@ -1,11 +1,14 @@ # Use MindSpore on the Cloud +# 在云上使用MindSpore + +`Linux` `Ascend` `全流程` `初级` `中级` `高级` -- [Use MindSpore on the Cloud](#Use MindSpore on the Cloud) - - [Summary](#Aummary) - - [Preparatory Work](#Preparatory Work) +- [在云上使用MindSpore](#在云上使用mindspore) + - [概述](#概述) + - [准备工作](#准备工作) - [ModelArts使用准备](#modelarts使用准备) - [拥有云上昇腾AI处理器资源](#拥有云上昇腾ai处理器资源) - [数据准备](#数据准备) -- Gitee From 7f7083821d7e48cfa05e7b91eef6bdba2b6516e1 Mon Sep 17 00:00:00 2001 From: Jack20 Date: Mon, 8 Feb 2021 21:46:10 +0800 Subject: [PATCH 4/4] Update --- .../advanced_use/use_on_the_cloud.md | 170 +++++++++--------- 1 file changed, 82 insertions(+), 88 deletions(-) diff --git a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md index d9a3c68bad..cab26134ba 100644 --- a/tutorials/training/source_en/advanced_use/use_on_the_cloud.md +++ b/tutorials/training/source_en/advanced_use/use_on_the_cloud.md @@ -1,62 +1,58 @@ # Use MindSpore on the Cloud -# 在云上使用MindSpore - -`Linux` `Ascend` `全流程` `初级` `中级` `高级` - -- [在云上使用MindSpore](#在云上使用mindspore) - - [概述](#概述) - - [准备工作](#准备工作) - - [ModelArts使用准备](#modelarts使用准备) - - [拥有云上昇腾AI处理器资源](#拥有云上昇腾ai处理器资源) - - [数据准备](#数据准备) - - [执行脚本准备](#执行脚本准备) - - [通过简单适配将MindSpore脚本运行在ModelArts](#通过简单适配将mindspore脚本运行在modelarts) - - [适配脚本参数](#适配脚本参数) - - [适配OBS数据](#适配obs数据) - - [适配8卡训练任务](#适配8卡训练任务) - - [示例代码](#示例代码) - - [创建训练任务](#创建训练任务) - - [进入ModelArts控制台](#进入modelarts控制台) - - [使用常用框架创建训练作业](#使用常用框架创建训练作业) - - [使用MindSpore作为常用框架创建训练作业](#使用mindspore作为常用框架创建训练作业) - - [查看运行结果](#查看运行结果) +- [Use MindSpore on the Cloud](#Use MindSpore on the Cloud) + - [Summary](#Aummary) + - [Preparatory Work](#Preparatory Work) + - [Use preparation of ModelArts](#Use preparation of ModelArts) + - [Have AI processor resources rising on the cloud](#Have AI processor resources rising on the cloud) + - [Data Preparation](#Data Preparation) + - [Execute script preparation](#Execute script preparation) + - [Run MindSpore script in ModelArts through simple adaptation](#Run MindSpore script in ModelArts through simple adaptation) + - [Adaptation Script Parameters](#Adaptation Script Parameters) + - [Adapting OBS data](#Adapting OBS data) + - [Adapt to the 8-card training task](#Adapt to the 8-card training task) + - [Sample Code](#Sample Code) + - [Create a training task](#Create a training task) + - [Enter the ModelArts console](#Enter the ModelArts console) + - [Create a training job using a common framework](#Create a training job using a common framework) + - [Create training assignments using MindSpore as a common framework](#Create training assignments using MindSpore as a common framework) + - [Check the running results](#Check the running results) -## 概述 +## Summary -ModelArts是华为云提供的面向开发者的一站式AI开发平台,集成了昇腾AI处理器资源池,用户可以在该平台下体验MindSpore。 +ModelArts is a developer-oriented one-stop AI development platform provided by Huawei Cloud, which integrates the rising AI processor resource pool, and users can experience MindSpore under this platform. -本教程以ResNet-50为例,简要介绍如何在ModelArts使用MindSpore完成训练任务。 +Taking ResNet-50 as an example, this tutorial briefly introduces how to use MindSpore in ModelArts to complete training tasks. -## 准备工作 +## Preparatory Work -### ModelArts使用准备 +### Use preparation of ModelArts -参考ModelArts教程“准备工作”一栏,完成账号注册、ModelArts配置和创建桶的准备工作。 -> ModelArts教程链接:。页面提供了较丰富的ModelArts教程,参考“准备工作”部分完成ModelArts准备工作。 +Refer to the "Preparatory Work" column of ModelArts tutorial to complete the preparation for account registration, ModelArts configuration and bucket creation. +> ModelArts tutorial link:。The page provides abundant tutorials of ModelArts. Please refer to the "Preparatory Work" section to complete the preparation of ModelArts. -### 拥有云上昇腾AI处理器资源 +### Have AI processor resources rising on the cloud -确保你的账号已拥有ModelArts华为云昇腾集群服务公测资格,可在[ModelArts华为云](https://console.huaweicloud.com/modelarts/?region=cn-north-4#/dashboard/applyModelArtsAscend910Beta)提交申请。 +Make sure that your account has the public beta qualification of ModelArts Huawei Cloud Rising Cluster Service, and you can submit an application in [ModelArts Huawei Cloud](https://console.huaweicloud.com/modelarts/?region=cn-north-4#/dashboard/applyModelArtsAscend910Beta). -### 数据准备 +### Data Preparation -ModelArts使用对象存储服务(Object Storage Service,简称OBS)进行数据存储,因此,在开始训练任务之前,需要将数据上传至OBS。本示例使用CIFAR-10二进制格式数据集。 +ModelArts uses Object Storage Service (OBS) for data storage, so it is necessary to upload the data to OBS before starting the training task. This example uses a CIFAR-10 binary format dataset. -1. 下载CIFAR-10数据集并解压。 +1. Download CIFAR-10 data set and decompress it. - > CIFAR-10数据集下载页面:。页面提供3个数据集下载链接,本示例使用CIFAR-10 binary version。 + > CIFAR-10 data set download page:。Page provides 3 data set download links, this example uses CIFAR-10 binary version. -2. 新建一个自己的OBS桶(例如:ms-dataset),在桶中创建数据目录(例如:cifar-10),将CIFAR-10数据按照如下结构上传至数据目录。 +2. Create a new OBS bucket (for example, ms-dataset), create a data directory (for example, cifar-10) in the bucket, and upload CIFAR-10 data to the data directory according to the following structure. ```text - └─对象存储/ms-dataset/cifar-10 + └─obs/ms-dataset/cifar-10 ├─train │ data_batch_1.bin │ data_batch_2.bin @@ -68,17 +64,17 @@ ModelArts使用对象存储服务(Object Storage Service,简称OBS)进行 test_batch.bin ``` -### 执行脚本准备 +### Execute script preparation -新建一个自己的OBS桶(例如:`resnet50-train`),在桶中创建代码目录(例如:`resnet50_cifar10_train`),并将以下目录中的所有脚本上传至代码目录: -> 脚本使用ResNet-50网络在CIFAR-10数据集上进行训练,并在训练结束后验证精度。脚本可以在ModelArts采用`1*Ascend`或`8*Ascend`两种不同规格进行训练任务。 +Create a new OBS bucket (For example:`resnet50-train`),create a code directory in the bucket (For example:`resnet50_cifar10_train`),and upload all scripts in the following directory to the code directory: +> The script uses ResNet-50 network to train on CIFAR-10 data set, and verifies the accuracy after training. The scripts can use`1*Ascend` or `8*Ascend` in ModelArts for training tasks. > -> 注意运行脚本的版本需要与"创建训练任务"步骤选择的MindSpore版本一致。例如:使用MindSpore 1.1版本教程提供的脚本,则需要在创建训练任务时选择1.1版本的MindSpore引擎。 +> Note that the version of the running script needs to be consistent with the version of MindSpore selected in the "Create Training Task" step. For example, if you use the script provided by the MindSpore 1.1 tutorial, you need to select the MindSpore engine 1.1 when creating training tasks. -为了方便后续创建训练作业,先创建训练输出目录和日志输出目录,本示例创建的目录结构如下: +To facilitate the subsequent creation of training jobs, first create a training output directory and a log output directory. The directory structure created in this example is as follows: ```text -└─对象存储/resnet50-train +└─obs/resnet50-train ├─resnet50_cifar10_train │ dataset.py │ resnet50_train.py @@ -87,13 +83,13 @@ ModelArts使用对象存储服务(Object Storage Service,简称OBS)进行 └─log ``` -## 通过简单适配将MindSpore脚本运行在ModelArts +## Run MindSpore script in ModelArts through simple adaptation -“执行脚本准备”章节提供的脚本可以直接运行在ModelArts,想要快速体验ResNet-50训练CIFAR-10可以跳过本章节。如果需要将自定义MindSpore脚本或更多MindSpore示例代码在ModelArts运行起来,需要参考本章节对MindSpore代码进行简单适配。 +Scripts provided in the chapter "Preparation for Executing Scripts" can be run directly in ModelArts. If you want to experience ResNet-50 training CIFAR-10 quickly, you can skip this chapter. If you need to run custom MindSports scripts or more MindSports sample codes in ModelArts, you need to refer to this chapter to simply adapt MindSports codes. -### 适配脚本参数 +### Adaptation Script Parameters -1. 在ModelArts运行的脚本必须配置`data_url`和`train_url`,分别对应数据存储路径(OBS路径)和训练输出路径(OBS路径)。 +1. The scripts running in ModelArts must be configured with `data_url` and `train_url`,which correspond to data storage path (OBS path) and training output path (OBS path)respectively. ``` python import argparse @@ -103,36 +99,36 @@ ModelArts使用对象存储服务(Object Storage Service,简称OBS)进行 parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.') ``` -2. ModelArts界面支持向脚本中其他参数传值,在下一章节“创建训练作业”中将会详细介绍。 +2. The ModelArts interface supports passing values to other parameters in the script, which will be described in detail in the next chapter "Create a training task". ``` python parser.add_argument('--epoch_size', type=int, default=90, help='Train epoch size.') ``` -### 适配OBS数据 +### Adapting OBS data -MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing提供的API与OBS交互。ModelArts训练脚本在容器中执行,通常选用`/cache`目录作为容器数据存储路径。 -> 华为云MoXing提供了丰富的API供用户使用,本示例中仅需要使用`copy_parallel`接口。 +MindSpore does not provide an interface for directly accessing OBS data, so it needs to interact with OBS through API provided by MoXing. The ModelArts training script is executed in the container, and the `/cache` directory is usually selected as the data storage path of the container. +> Huawei Cloud MoXing provides a wealth of APIs for users to use this:,In this example, only the `copy_parallel` interface needs to be used. -1. 将OBS中存储的数据下载至执行容器。 +1. Download the data stored in OBS to the execution container. ```python import moxing as mox mox.file.copy_parallel(src_url='s3://dataset_url/', dst_url='/cache/data_path') ``` -2. 将训练输出从容器中上传至OBS。 +2. Upload the training output from the container to OBS. ```python import moxing as mox mox.file.copy_parallel(src_url='/cache/output_path', dst_url='s3://output_url/') ``` -### 适配8卡训练任务 +### Adapt to the 8-card training task -如果需要将脚本运行在`8*Ascend`规格的环境上,需要对创建数据集的代码和本地数据路径进行适配,并配置分布式策略。通过获取`DEVICE_ID`和`RANK_SIZE`两个环境变量,用户可以构建适用于`1*Ascend`和`8*Ascend`两种不同规格的训练脚本。 +If the script needs to be run in the `8*Ascend` environment, it is necessary to adapt the code for creating the data set and the local data path, and configure the distributed policy. By obtaining two environment variables`DEVICE_ID` and `RANK_SIZE`users can build training scripts suitable for two different specifications `1*Ascend` and `8*Ascend` -1. 本地路径适配。 +1. Local path adaptation. ```python import os @@ -147,7 +143,7 @@ MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing local_data_path = os.path.join(local_data_path, str(device_id)) ``` -2. 数据集适配。 +2. Data set adaptation. ```python import os @@ -164,7 +160,7 @@ MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing num_shards=device_num, shard_id=device_id) ``` -3. 配置分布式策略。 +3. Configure distributed policies. ```python import os @@ -178,11 +174,11 @@ MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing gradients_mean=True) ``` -### 示例代码 +### Sample Code -结合以上三点对MindSpore脚本进行简单适配,以下述伪代码为例: +Combining the above three points, we simply adapt MindSpore script, taking the following pseudo-code as an example: -原始MindSpore脚本: +Original MindSpore script: ``` python import os @@ -219,7 +215,7 @@ if __name__ == '__main__': resnet50_train(args_opt) ``` -适配后的MindSpore脚本: +Adapted MindSpore script: ``` python import os @@ -271,54 +267,52 @@ if __name__ == '__main__': resnet50_train(args_opt) ``` -## 创建训练任务 - -准备好数据和执行脚本以后,需要创建训练任务将MindSpore脚本真正运行起来。首次使用ModelArts的用户可以根据本章节了解ModelArts创建训练作业的流程。 +## Create a training task -### 进入ModelArts控制台 +After preparing the data and executing the script, you need to create a training task to really run the MindSpore script. Users who use ModelArts for the first time can learn about the process of creating training assignments by ModelArts according to this chapter. -打开华为云ModelArts主页,点击该页面的“进入控制台”。 +### Enter the ModelArts console -### 使用常用框架创建训练作业 +Open Huawei Cloud ModelArts homepage:,Click "Enter console" on this page. -ModelArts教程展示了如何使用常用框架创建训练作业。 +### Create a training job using a common framework -### 使用MindSpore作为常用框架创建训练作业 +An introductory course in ModelArts:Shows how to create training assignments using common frameworks. -以本教程使用的训练脚本和数据为例,详细列出在创建训练作业界面如何进行配置: +### Create training assignments using MindSpore as a common framework -1. `算法来源`:选择`常用框架`,然后`AI引擎`选择`Ascend-Powered-Engine`和所需的MindSpore版本(本示例图片为`Mindspore-0.5-python3.7-aarch64`,请注意使用与所选版本对应的脚本)。 +Take the training script and data used in this tutorial as an example, and list in detail how to configure in the interface of creating training job: -2. `代码目录`选择预先在OBS桶中创建代码目录,`启动文件`选择代码目录下的启动脚本。 +1. `Algorithm source`:select `Common framework`,than `AI Engine` select `Ascend-Powered-Engine`and the required version of MindSpore (the picture of this example is `Mindspore-0.5-python3.7-aarch64`,please pay attention to using the script corresponding to the selected version). -3. `数据来源`选择`数据存储位置`,并填入OBS中CIFAR-10数据集的位置。 +2. `Code directory` :Select the code directory created in OBS bucket in advance,`Startup file` Select the startup script under the code directory. -4. `运行参数`:`数据存储位置`和`训练输出位置`分别对应运行参数`data_url`和`train_url`,选择`增加运行参数`可以向脚本中其他参数传值,如`epoch_size`。 +3. `Source of data`:Select `Data storage location` and fill in the location of CIFAR-10 data set in OBS. -5. `资源池`选择`公共资源池 > Ascend`。 +4. `Operating parameters`:`Data storage location` and `Training output location` correspond to operation parameters `data_url` and `train_url`.Select`Add operation parameter` to pass values to other parameters in the script, such as `epoch_size`。 -6. `资源池 > 规格`选择`Ascend: 1 * Ascend 910 CPU:24 核 96GiB`或`Ascend: 8 * Ascend 910 CPU:192 核 768GiB`,分别表示单机单卡和单机8卡规格。 +5. `Resource Pool` Select `Public Resource Pool > Ascend`。 -使用MindSpore作为常用框架创建训练作业,如下图所示。 +6. `Resource Pool > Specifications` Select `Ascend: 1 * Ascend 910 CPU:24 core 96GiB` or `Ascend: 8 * Ascend 910 CPU:192 core 768GiB`.Respectively represent the specifications of single machine, single card and single machine and 8 cards. -![训练作业参数](./images/cloud_train_job1.png) +Use MindSpore as a common framework to create training assignments, as shown in the following figure: -![训练作业规格](./images/cloud_train_job2.png) +![cloud_train_job1](../../source_zh_cn/advanced_use/images/cloud_train_job1.png) -## 查看运行结果 +![cloud_train_job2](../../source_zh_cn/advanced_use/images/cloud_train_job2.png) -1. 在训练作业界面可以查看运行日志 +## Check the running results - 采用`8*Ascend`规格执行ResNet-50训练任务,epoch总数为92,精度约为92%,每秒训练图片张数约12000,日志如下图所示。 +1. You can view the running log in the training operation interface - ![8*Ascend训练执行结果](./images/train_log_8_Ascend_clu.png) + The ResNet-50 training task is carried out according to the `8*Ascend` specification. the total number of epoch is 92, the accuracy is about 92%, and the number of training pictures per second is about 12,000. the log is shown in the following figure. - ![8*Ascend训练执行结果](./images/train_log_8_Ascend.png) + ![train_log_8_Ascend_clu](../../source_zh_cn/advanced_use/images/train_log_8_Ascend_clu.png) - 采用`1*Ascend`规格执行ResNet-50训练任务。epoch总数为92,精度约为95%,每秒训练图片张数约1800,日志如下图所示。 + ![train_log_8_Ascend](../../source_zh_cn/advanced_use/images/train_log_8_Ascend.png) - ![1*Ascend训练执行结果](./images/train_log_1_Ascend.png) + Implement ResNet-50 training task according to `1*Ascend` specification. The total number of epoch is 92, the accuracy is about 95%, and the number of training pictures per second is about 1800. The log is shown in the following figure. -2. 如果创建训练作业时指定了日志路径,可以从OBS下载日志文件并查看。 + ![train_log_1_Ascend](../../source_zh_cn/advanced_use/images/train_log_1_Ascend.png) - +2. If the log path is specified when creating the training job, you can download the log file from OBS and view it. -- Gitee