diff --git a/tutorials/source_en/parallel/distributed_case.rst b/tutorials/source_en/parallel/distributed_case.rst index 9d70505ee3c8f7aabbd8fcdb6f6662b262ab2111..b055fbd86cbc5855501792b98e358c9630a2950c 100644 --- a/tutorials/source_en/parallel/distributed_case.rst +++ b/tutorials/source_en/parallel/distributed_case.rst @@ -8,5 +8,4 @@ Distributed High-Level Configuration Case .. toctree:: :maxdepth: 1 - multiple_mixed - ms_operator \ No newline at end of file + multiple_mixed \ No newline at end of file diff --git a/tutorials/source_en/parallel/ms_operator.md b/tutorials/source_en/parallel/ms_operator.md deleted file mode 100644 index c73f3c2f777fc106c18d4a53c8e34ab1ee93d795..0000000000000000000000000000000000000000 --- a/tutorials/source_en/parallel/ms_operator.md +++ /dev/null @@ -1,139 +0,0 @@ -# Performing Distributed Training on K8S Clusters - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/parallel/ms_operator.md) - -MindSpore Operator is a plugin that follows Kubernetes' Operator pattern (based on the CRD-Custom Resource Definition feature) and implements distributed training on Kubernetes. MindSpore Operator defines Scheduler, PS, worker three roles in CRD, and users can easily use MindSpore on K8S for distributed training through simple YAML file configuration. The code repository of mindSpore Operator is described in: [ms-operator](https://gitee.com/mindspore/ms-operator/). - -## Installation - -There are three installation methods: - -1. Install directly by using YAML - - ```shell - kubectl apply -f deploy/v1/ms-operator.yaml - ``` - - After installation: - - Use `kubectl get pods --all-namespaces` to see the namespace as the deployment task for the ms-operator-system. - - Use `kubectl describe pod ms-operator-controller-manager-xxx-xxx -n ms-operator-system` to view pod details. - -2. Install by using make deploy - - ```shell - make deploy IMG=swr.cn-south-1.myhuaweicloud.com/mindspore/ms-operator:latest - ``` - -3. Local debugging environment - - ```shell - make run - ``` - -## Sample - -The current ms-operator supports ordinary single worker training, single worker training in PS mode, and Scheduler and Worker startups for automatic parallelism (such as data parallelism and model parallelism). - -There are running examples in [config/samples/](https://gitee.com/mindspore/ms-operator/tree/master/config/samples). Take the data-parallel Scheduler and Worker startup as an example, where the dataset and network scripts need to be prepared in advance: - -```shell -kubectl apply -f config/samples/ms_wide_deep_dataparallel.yaml -``` - -Use `kubectl get all -o wide` to see scheduler and worker launched in the cluster, as well as the services corresponding to Scheduler. - -## Development Guide - -### Core Code - -`pkg/apis/v1/msjob_types.go` is the CRD definition for MSJob. - -`pkg/controllers/v1/msjob_controller.go` is the core logic of the MSJob controller. - -### Image Creation and Uploading - -To modify the ms-operator code and create an upload image, please refer to the following command: - -```shell -make docker-build IMG={image_name}:{tag} -docker push {image_name}:{tag} -``` - -### YAML File Configuration Instructions - -Taking the data parallelization of self-developed networking as an example, the YAML configuration of MSJob is introduced, such as `runPolicy`, `successPolicy`, the number of roles, mindspore images, and file mounting, and users need to configure it according to their actual needs. - -```yaml -apiVersion: mindspore.gitee.com/v1 -kind: MSJob # ms-operator custom CRD type, MSJob -metadata: - name: ms-widedeep-dataparallel # Task name -spec: - runPolicy: # RunPolicy encapsulates various runtime strategies for distributed training jobs, such as how to clean up resources and how long the job can remain active. - cleanPodPolicy: None # All/Running/None - successPolicy: AllWorkers # The condition that marks MSJob as subcess, which defaults to blank, represents the use of the default rule (success after a single worker execution is completed) - msReplicaSpecs: - Scheduler: - replicas: 1 # The number of Scheduler - restartPolicy: Never # Restart the policy Always, OnFailure, Never - template: - spec: - volumes: # File mounts, such as datasets, network scripts, and so on - - name: script-data - hostPath: - path: /absolute_path - containers: - - name: mindspore # Each character must have a container with only one mindspore name, configure containerPort to adjust the default port number (2222), and you need to set the port name to msjob-port - image: mindspore-image-name:tag # mindspore image - imagePullPolicy: IfNotPresent - command: # Execute the command after the container starts - - /bin/bash - - -c - - python -s /absolute_path/train_and_eval_distribute.py --device_target="GPU" --epochs=1 --data_path=/absolute_path/criteo_mindrecord --batch_size=16000 - volumeMounts: - - mountPath: /absolute_path - name: script-data - env: # Configurable environment variables - - name: GLOG_v - value: "1" - Worker: - replicas: 4 # The number of Worker - restartPolicy: Never - template: - spec: - volumes: - - name: script-data - hostPath: - path: /absolute_path - containers: - - name: mindspore - image: mindspore-image-name:tag # mindspore image - imagePullPolicy: IfNotPresent - command: - - /bin/bash - - -c - - python -s /absolute_path/train_and_eval_distribute.py --device_target="GPU" --epochs=1 --data_path=/absolute_path/criteo_mindrecord --batch_size=16000 - volumeMounts: - - mountPath: /absolute_path - name: script-data - env: - - name: GLOG_v - value: "1" - resources: # Resource limit configuration - limits: - nvidia.com/gpu: 1 -``` - -### Frequent Questions - -- If you find that gcr.io/distroless/static cannot be pulled during the image construction process, see [issue](https://github.com/anjia0532/gcr.io_mirror/issues/169). -- During the installation and deployment process, when finding that the gcr.io/kubebuilder/kube-rbac-proxy cannot be pulled, see [issue](https://github.com/anjia0532/gcr.io_mirror/issues/153). -- When you call up tasks through k8s in the GPU and need to use NVIDIA graphics cards, you need to install k8s device plugin, nvidia-docker2 and other environments. -- Do not use underscores in YAML file configuration items. -- When k8s is blocked but the cause cannot be determined by the pod log, view the log of the pod creation process via `kubectl logs $(kubectl get statefulset,pods -o wide --all -namespaces|grep ms-operator-system|awk-F""'{print$2}') -n ms-operator-system`. -- Performing tasks through the pod, it will be executed in the root directory of the launched container, and the relevant files generated will be stored in the root directory by default. But if the mapping path is only a directory under the root directory, the generated files will not be mapped and saved to the host. It is recommended to switch the path to the specified directory before officially performing the task, so as to save the files generated during the execution of the task. -- In the disaster recovery scenario, if bindIP failed occurs, confirm whether the persistence file generated by the last training has not been cleaned. -- It is not recommended to redirect log files directly in YAML. If redirection is required, distinguish between redirect log file names for different pods. -- When there are residual processes or other processes on the Device, the pod may be in Pending state due to the inability to apply for all the resources, and it is recommended that the user set a timeout strategy to avoid being blocked all the time. diff --git a/tutorials/source_en/parallel/overview.md b/tutorials/source_en/parallel/overview.md index 7a3362393351c37b9fe80657e82939b805e160a5..79b7eb86dd37b0538e303b59d4154b05a8d719b8 100644 --- a/tutorials/source_en/parallel/overview.md +++ b/tutorials/source_en/parallel/overview.md @@ -66,4 +66,3 @@ If there is a requirement for performance, throughput, or scale, or if you don't ## Distributed High-Level Configuration Examples - **Multi-dimensional Hybrid Parallel Case Based on Double Recursive Search**: Multi-dimensional hybrid parallel based on double recursive search means that the user can configure optimization methods such as recomputation, optimizer parallel, pipeline parallel. Based on the user configurations, the operator-level strategy is automatically searched by the double recursive strategy search algorithm, which generates the optimal parallel strategy. For details, please refer to the [Multi-dimensional Hybrid Parallel Case Based on Double Recursive Search](https://www.mindspore.cn/tutorials/en/master/parallel/multiple_mixed.html). -- **Performing Distributed Training on K8S Clusters**: MindSpore Operator is a plugin that follows Kubernetes' Operator pattern (based on the CRD-Custom Resource Definition feature) and implements distributed training on Kubernetes. MindSpore Operator defines Scheduler, PS, Worker three roles in CRD, and users can easily use MindSpore on K8S for distributed training through simple YAML file configuration. The code repository of mindSpore Operator is described in: [ms-operator](https://gitee.com/mindspore/ms-operator/). For details, please refer to the [Performing Distributed Training on K8S Clusters](https://www.mindspore.cn/tutorials/en/master/parallel/ms_operator.html). diff --git a/tutorials/source_zh_cn/parallel/distributed_case.rst b/tutorials/source_zh_cn/parallel/distributed_case.rst index 43872e680f4156e41657a75653604df600547fa4..42542720e251ecf4a7669bbdadc13fa99d1ab71b 100644 --- a/tutorials/source_zh_cn/parallel/distributed_case.rst +++ b/tutorials/source_zh_cn/parallel/distributed_case.rst @@ -9,4 +9,3 @@ :maxdepth: 1 multiple_mixed - ms_operator diff --git a/tutorials/source_zh_cn/parallel/ms_operator.md b/tutorials/source_zh_cn/parallel/ms_operator.md deleted file mode 100644 index 556bb30eeb55a7143be45095f0e63dc64d143cbe..0000000000000000000000000000000000000000 --- a/tutorials/source_zh_cn/parallel/ms_operator.md +++ /dev/null @@ -1,139 +0,0 @@ -# 在K8S集群上进行分布式训练 - -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/parallel/ms_operator.md) - -MindSpore Operator是遵循Kubernetes的Operator模式(基于CRD-Custom Resource Definition功能),实现的在Kubernetes上进行分布式训练的插件。其中,MindSpore Operator在CRD中定义了Scheduler、PS、Worker三种角色,用户只需通过简单的YAML文件配置,就可以轻松地在K8S上进行分布式训练。MindSpore Operator的代码仓详见:[ms-operator](https://gitee.com/mindspore/ms-operator/)。 - -## 安装 - -安装方法可以有以下三种: - -1. 使用YAML直接安装 - - ```shell - kubectl apply -f deploy/v1/ms-operator.yaml - ``` - - 安装后: - - 使用`kubectl get pods --all-namespaces`,即可看到namespace为ms-operator-system的部署任务。 - - 使用`kubectl describe pod ms-operator-controller-manager-xxx-xxx -n ms-operator-system`,可查看pod的详细信息。 - -2. 使用make deploy安装 - - ```shell - make deploy IMG=swr.cn-south-1.myhuaweicloud.com/mindspore/ms-operator:latest - ``` - -3. 本地调试环境 - - ```shell - make run - ``` - -## 样例 - -当前ms-operator支持普通单Worker训练、PS模式的单Worker训练以及自动并行(例如数据并行、模型并行等)的Scheduler、Worker启动。 - -在[config/samples/](https://gitee.com/mindspore/ms-operator/tree/master/config/samples)中有运行样例。以数据并行的Scheduler、Worker启动为例,其中数据集和网络脚本需提前准备: - -```shell -kubectl apply -f config/samples/ms_wide_deep_dataparallel.yaml -``` - -使用`kubectl get all -o wide`即可看到集群中启动的Scheduler和Worker,以及Scheduler对应的Service。 - -## 开发指南 - -### 核心代码 - -`pkg/apis/v1/msjob_types.go`中为MSJob的CRD定义。 - -`pkg/controllers/v1/msjob_controller.go`中为MSJob controller的核心逻辑。 - -### 镜像制作、上传 - -如需修改ms-operator代码并制作上传镜像,可参考以下命令: - -```shell -make docker-build IMG={image_name}:{tag} -docker push {image_name}:{tag} -``` - -### YAML文件配置说明 - -以自研组网的数据并行为例,介绍MSJob的YAML配置,如`runPolicy`、`successPolicy`、各个角色数量、mindspore镜像、文件挂载等,用户需根据自己的实际需要进行配置,YAML文件配置项中不要使用下划线。 - -```yaml -apiVersion: mindspore.gitee.com/v1 -kind: MSJob # ms-operator自定的CRD类型,为MSJob -metadata: - name: ms-widedeep-dataparallel # 任务名 -spec: - runPolicy: # RunPolicy 封装了分布式训练作业的各种运行时策略,例如如何清理资源以及作业可以保持活动多长时间。 - cleanPodPolicy: None # All/Running/None - successPolicy: AllWorkers # 将MSJob标记为success的条件,默认为空,代表使用默认规则(单worker执行完毕即表示成功) - msReplicaSpecs: - Scheduler: - replicas: 1 # Scheduler数量 - restartPolicy: Never # 重启策略 Always,OnFailure,Never - template: - spec: - volumes: # 文件挂载,如数据集、网络脚本等 - - name: script-data - hostPath: - path: /absolute_path - containers: - - name: mindspore # 各个角色中必须有且只有一个mindspore名字的container,可配置containerPort来调整默认端口号(2222),需设置端口name为 msjob-port - image: mindspore-image-name:tag # mindspore镜像 - imagePullPolicy: IfNotPresent - command: # 容器启动后的执行命令 - - /bin/bash - - -c - - python -s /absolute_path/train_and_eval_distribute.py --device_target="GPU" --epochs=1 --data_path=/absolute_path/criteo_mindrecord --batch_size=16000 - volumeMounts: - - mountPath: /absolute_path - name: script-data - env: # 可配置环境变量 - - name: GLOG_v - value: "1" - Worker: - replicas: 4 # Worker数量 - restartPolicy: Never - template: - spec: - volumes: - - name: script-data - hostPath: - path: /absolute_path - containers: - - name: mindspore - image: mindspore-image-name:tag # mindspore镜像 - imagePullPolicy: IfNotPresent - command: - - /bin/bash - - -c - - python -s /absolute_path/train_and_eval_distribute.py --device_target="GPU" --epochs=1 --data_path=/absolute_path/criteo_mindrecord --batch_size=16000 - volumeMounts: - - mountPath: /absolute_path - name: script-data - env: - - name: GLOG_v - value: "1" - resources: # 资源限制配置 - limits: - nvidia.com/gpu: 1 -``` - -### 常见问题 - -- 镜像构建过程中若发现gcr.io/distroless/static无法拉取,可参考[issue](https://github.com/anjia0532/gcr.io_mirror/issues/169)。 -- 安装部署过程中发现gcr.io/kubebuilder/kube-rbac-proxy无法拉取,参考[issue](https://github.com/anjia0532/gcr.io_mirror/issues/153)。 -- 当在GPU中通过k8s调起任务,且需要使用NVIDIA显卡时,需要安装k8s device plugin、nvidia-docker2等环境。 -- YAML文件配置项中不要使用下划线。 -- 当k8s出现阻塞但是通过pod日志无法明确原因时,通过`kubectl logs $(kubectl get statefulset,pods -o wide --all -namespaces|grep ms-operator-system|awk-F""'{print$2}') -n ms-operator-system`查看pod创建过程的日志。 -- 通过pod执行任务,默认会在启动的容器根目录下执行,生成的相关文件都会存放在根目录下,但是如果映射路径只是根目录下的某个目录,那生成的文件不会映射保存到宿主机,建议在正式执行任务之前切换路径到指定目录下,便于保存任务执行过程中产生的文件。 -- 容灾场景下,如果出现bindIP failed,建议清理上次训练生成持久化文件。 -- 不建议在YAML中直接重定向日志文件,如果需要重定向,请区分不同pod的重定向日志文件名。 -- Device上存在残留进程或者有其他进程的时候,可能会因为无法申请全部资源导致pod处于Pending状态,建议用户设置超时策略,避免始终被阻塞。 \ No newline at end of file diff --git a/tutorials/source_zh_cn/parallel/overview.md b/tutorials/source_zh_cn/parallel/overview.md index a7de68cd2518ee1bbd19217cad6dc8ff3ce27cab..3da1bc241819a4971bf00d48d8899053f28daf8f 100644 --- a/tutorials/source_zh_cn/parallel/overview.md +++ b/tutorials/source_zh_cn/parallel/overview.md @@ -65,5 +65,4 @@ MindSpore提供两种粒度的算子级并行能力:算子级并行和高阶 ## 分布式高阶配置案例 -- **基于双递归搜索的多维混合并行案例**:基于双递归搜索的多维混合并行是指用户可以配置重计算、优化器并行、流水线并行等优化方法,在用户配置的基础上,通过双递归策略搜索算法进行算子级策略自动搜索,进而生成最优的并行策略。详细可参考[基于双递归搜索的多维混合并行案例](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/multiple_mixed.html)。 -- **在K8S集群上进行分布式训练**:MindSpore Operator是遵循Kubernetes的Operator模式(基于CRD-Custom Resource Definition功能),实现的在Kubernetes上进行分布式训练的插件。其中,MindSpore Operator在CRD中定义了Scheduler、PS、Worker三种角色,用户只需通过简单的YAML文件配置,就可以轻松地在K8S上进行分布式训练。MindSpore Operator的代码仓详见:[ms-operator](https://gitee.com/mindspore/ms-operator/)。详细可参考[在K8S集群上进行分布式训练](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/ms_operator.html)。 +- **基于双递归搜索的多维混合并行案例**:基于双递归搜索的多维混合并行是指用户可以配置重计算、优化器并行、流水线并行等优化方法,在用户配置的基础上,通过双递归策略搜索算法进行算子级策略自动搜索,进而生成最优的并行策略。详细可参考[基于双递归搜索的多维混合并行案例](https://www.mindspore.cn/tutorials/zh-CN/master/parallel/multiple_mixed.html)。 \ No newline at end of file