diff --git a/docs/mindspore/source_en/design/multi_level_compilation.md b/docs/mindspore/source_en/design/multi_level_compilation.md index ba9a1428bf7bccf3564cf3ea02d94ea632523dfd..e1ac6d6c5d2defa0bb06331c08656a906c1e4d0a 100644 --- a/docs/mindspore/source_en/design/multi_level_compilation.md +++ b/docs/mindspore/source_en/design/multi_level_compilation.md @@ -4,7 +4,7 @@ ## Background -With the arrival of the era of deep learning large models, the bigger the network size is, the bigger the challenge of graph compilation performance, execution performance and debugging and tuning efficiency is. For this reason, MindSpore proposes a multilevel compilation architecture that provides an O(n) multilevel compilation execution model, which are different from each other in terms of graph optimization, operator fusion, memory management, and execution modes, and is designed to provide a diversity of graph mode. Users can choose the most suitable compilation and execution mode according to their own network characteristics and needs: +With the arrival of the era of deep learning large models, the bigger the network size is, the bigger the challenge of graph compilation performance, execution performance and debugging and tuning efficiency is. For this reason, MindSpore proposes a multilevel compilation architecture that provides an O(n) multilevel compilation execution model, which are different from each other in terms of graph optimization, operator fusion, memory management, and execution modes, and is designed to provide a diversity of graph mode. Users can choose the most suitable compilation and execution mode according to their own network characteristics and needs: 1. O0 mode: this is a basic compilation and execution mode, where all optimizations are turned off except those necessary to affect the functionality, and a single-calculus execution is used for execution. Therefore, the execution performance may not be optimal, but it can guarantee the original structure of the graph, which is convenient for users to debug and understand, and the compilation performance is also better. Add and Mul single operator execution is shown in the following figure. 2. O1 mode: this mode performs some basic optimizations, such as common graph optimization and automatic operator fusion optimization, and uses single operator execution for execution. Compared with O0, because of enabling the fusion optimization, the execution performance of O1 can be improved, but it may affect the original structure of the graph, so the compilation performance and debugging and tuning efficiency is lost. In the following figure, Add and Mul are fused into a single fused_op execution. @@ -16,7 +16,7 @@ With the arrival of the era of deep learning large models, the bigger the networ ![jit_level_framework](./images/multi_level_compilation/jit_level_framework.png) -1. Multi-level compilation external interface: configure multi-level compilation level through [mindspore.jit(jit_level=“O0/O1”)](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/mindspore/mindspore.jit.html#mindspore.jit), jit_level defaults to O0. We usually recommend that users use O0 mode for network debugging tuning. After debugging is ready, for better performance you can turn on O1 to run the network. +1. Multi-level compilation external interface: configure multi-level compilation level through [mindspore.jit(jit_level="O0/O1")](https://www.mindspore.cn/docs/en/r2.7.0rc1/api_python/mindspore/mindspore.jit.html#mindspore.jit), jit_level defaults to O0. We usually recommend that users use O0 mode for network debugging tuning. After debugging is ready, for better performance you can turn on O1 to run the network. 2. Backend graph compilation: According to the configured multi-level compilation level, different compilation modes are selected. O0 is the most basic native composition and compilation, and O1 adds automatic operator fusion function on the basis of O0, with the main functions of graph optimization, graph-operator fusion, operator selection, and execution sequence scheduling, of which graph-operator fusion is a unique function in O1 mode. 3. Backend graph execution: The O0 and O1 modes are the same at the execution level, and both use a single operator way of scheduling execution, with the main functions of multi-stream concurrency, multi-level streaming, HAL management, and memory management. @@ -30,7 +30,7 @@ There are fewer graph optimizations for the O0 mode, and the basic optimizations - **Back-end LazyInline** - **LazyInline**: The main idea is to postpone the overhead of the function call to the actual need to call , so that you can reduce the compilation overhead, improve compilation efficiency.LazyInline is the same sub-graph structure reuse in the graph compilation phase, do not unfolding placed in the graph , to avoid the graph size is large resulting in the impact of the compilation performance. + **LazyInline**: The main idea is to postpone the overhead of the function call to the actual need to call , so that you can reduce the compilation overhead, improve compilation efficiency. LazyInline is the same sub-graph structure reuse in the graph compilation phase, do not unfolding placed in the graph, to avoid the graph size is large resulting in the impact of the compilation performance. ![jit_level_lazyinline](./images/multi_level_compilation/jit_level_lazyinline.png) @@ -76,11 +76,11 @@ Execution order scheduling is a complex problem of solving optimal operator conc O1 is mainly targeted at implementing general-purpose, generalizable AI compilation optimizations on top of O0 to support better execution performance requirements for most general-purpose training and inference scenarios. -In the current phase, O1 mainly supports graph-kernel fusion optimization. The main idea is to automatically identify neighboring fusable nodes in the computational graph during the static graph compilation phase, and then fuse them into executable operators with larger granularity. Through graph-kernel fusion, optimization effects such as increasing the computational locality of operators and reducing the overall global memory access bandwidth overhead are achieved. As verified by real-world tests on 15+ networks, O1 is able to achieve an average of 15% performance acceleration compared to O0. Especially for access-intensive networks, the optimization effect of O1 is more significant. +In the current phase, O1 mainly supports graph-kernel fusion optimization. The main idea is to automatically identify neighboring fusable nodes in the computational graph during the static graph compilation phase, and then fuse them into executable operators with larger granularity. Through graph-kernel fusion, optimization effects such as increasing the computational locality of operators and reducing the overall global memory access bandwidth overhead are achieved. As verified by real-world tests on more than 15 networks, O1 is able to achieve an average of 15% performance acceleration compared to O0. Especially for access-intensive networks, the optimization effect of O1 is more significant. ### Graph-Kernel Fusion -Mainstream AI computing frameworks such as MindSpore provide operators to users that is usually defined in terms of understandable and easy use for user. Each operator carries a different amount of computation and varies in computational complexity. However, from the hardware execution point of view, this natural, user perspective-based division of operator computation volume is not efficient and does not fully utilize the computational power of hardware resources, which is mainly reflected in the following aspects: +Mainstream AI computing frameworks such as MindSpore provides operators to users that is usually defined in terms of understandable and easy use for user. Each operator carries a different amount of computation and varies in computational complexity. However, from the hardware execution point of view, this natural, user perspective-based division of operator computation volume is not efficient and does not fully utilize the computational power of hardware resources, which is mainly reflected in the following aspects: 1. Computationally overloaded and overly complex operators, which usually makes it difficult to generate well-cut high-performance operator, thereby reducing equipment utilization. 2. Operators that are too small in computation may also cause latency in computation and thus reduce equipment utilization, as the computation cannot effectively hide the data moving overhead. diff --git a/docs/mindspore/source_en/design/pluggable_device.md b/docs/mindspore/source_en/design/pluggable_device.md index 570dc4e8e11d22c1f339b209715172058dd34ee0..e5480e9103a534bafdf54a06cae9be46898fa8a2 100644 --- a/docs/mindspore/source_en/design/pluggable_device.md +++ b/docs/mindspore/source_en/design/pluggable_device.md @@ -22,7 +22,7 @@ The overall MindSpore architecture consists of the following major components, w The process of third-party chip interconnection to MindSpore mainly involves the back-end of MindSpore, which is also divided into several components. The overall components are divided into two main categories: -- A category of hardware-independent components, commonly used data structures such as MemoryManager, MemoryPool, DeviceAddres and related algorithms as well as components including GraphCompiler, GraphSchdeduler that can schedule the entire process and have initial processing and scheduling capabilities for graphs or single operators. +- A category of hardware-independent components, commonly used data structures such as MemoryManager, MemoryPool, DeviceAddress and related algorithms as well as components including GraphCompiler, GraphScheduler that can schedule the entire process and have initial processing and scheduling capabilities for graphs or single operators. - The other category is hardware-related components. This part provides several interfaces through the abstraction of hardware, and the third-party chips can choose interconnection according to the situation to realize the logic of operator, graph optimization, memory allocation, stream allocation, etc. unique to the hardware platform, and encapsulate them into dynamic libraries, which are loaded as plug-ins when the program runs. Third-party chips can refer to default built-in CPU/GPU/Ascend plug-ins of MindSpore when interconnection. To facilitate third-party hardware interconnection, a hardware abstraction layer is provided in MindSpore, which defines a standardized hardware interconnection interface. The abstraction layer is called by two modules, GraphCompiler and GraphScheduler, in the upper unified runtime: @@ -32,7 +32,7 @@ To facilitate third-party hardware interconnection, a hardware abstraction layer Also public data structures and algorithms are provided in the framework, such as debug tools, default memory pool implementation, hundreds of common operations on Anf IR, and efficient memory reuse algorithm SOMAS developed by MindSpore. -The hardware abstraction layer provides Graph mode (GraphExecutor) and Kernel mode (KernelExecutor) for two interconnection methods, respectively, for DSA architecture (such as NPU, XPU) and general architecture chips (such as GPU, CPU) to provide a classified interconnection interface. Chip vendors can inherit one or two abstract classes and implement them. Depending on the interconnection method, if you interconnect to Kernel mode, you also need to implement DeviceResMananger, KernelMod, DeviceAddress and other interfaces. +The hardware abstraction layer provides Graph mode (GraphExecutor) and Kernel mode (KernelExecutor) for two interconnection methods, respectively, for DSA architecture (such as NPU, XPU) and general architecture chips (such as GPU, CPU) to provide a classified interconnection interface. Chip vendors can inherit one or two abstract classes and implement them. Depending on the interconnection method, if you interconnect to Kernel mode, you also need to implement DeviceResManager, KernelMod, DeviceAddress and other interfaces. ## Kernel Mode Interconnection @@ -43,7 +43,7 @@ The generic architecture Kernel mode requires the following aspects to be implem - Custom graph optimization, which allows splitting and fusion of certain operators according to the features of the hardware, and other custom modifications to the graph. - Operator selection and operator compilation. -- Memory management. DeviceAddres is the abstraction of memory, and third-party chip vendors need to implement the function of copying between Host and Device. It also needs to provide memory request and destruction functions. To facilitate third-party chip vendors, MindSpore provides a set of memory pool implementations and an efficient memory reuse algorithm, SOMAS, in the Common component. +- Memory management. DeviceAddress is the abstraction of memory, and third-party chip vendors need to implement the function of copying between Host and Device. It also needs to provide memory request and destruction functions. To facilitate third-party chip vendors, MindSpore provides a set of memory pool implementations and an efficient memory reuse algorithm, SOMAS, in the Common component. - Stream management. If the chip to be docked has the concept of stream, it needs to provide the function of creation and destruction. and If not, it will run in single stream mode. ![image](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0rc1/docs/mindspore/source_zh_cn/design/images/pluggable_device_kernel.png) diff --git a/docs/mindspore/source_zh_cn/design/pluggable_device.md b/docs/mindspore/source_zh_cn/design/pluggable_device.md index 19f4ba73d2776f0ff75b95741c216d63f9225a67..9739b506d81a35671887908c2f8fb2d3724d5060 100644 --- a/docs/mindspore/source_zh_cn/design/pluggable_device.md +++ b/docs/mindspore/source_zh_cn/design/pluggable_device.md @@ -22,7 +22,7 @@ MindSpore整体架构包括如下几个主要组件,它们之间存在相互 第三方芯片对接MindSpore的过程主要涉及MindSpore的后端,后端也分为多个组件,整体上分为两大类: -- 一类与硬件无关,如MemoryManager、MemoryPool、DeviceAddres等常用数据结构及相关算法以及包括GraphCompiler、GraphSchdeduler在内的能够调度整个流程、具有对图或单算子的初步处理和调度能力的组件; +- 一类与硬件无关,如MemoryManager、MemoryPool、DeviceAddress等常用数据结构及相关算法以及包括GraphCompiler、GraphScheduler在内的能够调度整个流程、具有对图或单算子的初步处理和调度能力的组件; - 另一类与硬件相关,这部分通过对硬件的抽象,提供了多个接口,第三方芯片可以根据情况选择对接,实现硬件平台上特有的算子、图优化、内存分配、流分配等逻辑,并封装成动态库,程序运行时作为插件加载。第三方芯片对接时可以参考MindSpore默认内置的CPU/GPU/Ascend插件。 为了方便第三方硬件对接,在MindSpore中提供了硬件抽象层,定义了标准化的硬件对接接口,抽象层被上层统一运行时中的GraphCompiler和GraphScheduler两个模块调用: @@ -32,7 +32,7 @@ MindSpore整体架构包括如下几个主要组件,它们之间存在相互 同时,在框架中也提供了公共数据结构与算法,如debug工具、默认的内存池实现、数百个对Anf IR的常见操作、由MindSpore研发高效内存复用算法SOMAS等。 -硬件抽象层提供了Graph模式(GraphExecutor)和Kernel模式(KernelExecutor)用于两种对接方式,分别面向DSA架构(如NPU、XPU等)和通用架构的芯片(如GPU、CPU等)提供分类的对接接口。芯片厂商可以继承某种或两种抽象类并实现,根据对接方式的不同,如果对接Kernel模式还需实现DeviceResMananger、KernelMod、DeviceAddress等接口。 +硬件抽象层提供了Graph模式(GraphExecutor)和Kernel模式(KernelExecutor)用于两种对接方式,分别面向DSA架构(如NPU、XPU等)和通用架构的芯片(如GPU、CPU等)提供分类的对接接口。芯片厂商可以继承某种或两种抽象类并实现,根据对接方式的不同,如果对接Kernel模式还需实现DeviceResManager、KernelMod、DeviceAddress等接口。 ## Kernel模式对接 @@ -43,7 +43,7 @@ MindSpore整体架构包括如下几个主要组件,它们之间存在相互 - 自定义图优化,可以根据硬件的特性对某些算子进行拆分与融合,以及其他自定义的对图的修改; - 算子选择和算子编译; -- 内存管理,DeviceAddres是对内存的抽象,第三方芯片厂商需要实现Host与Device之间拷贝的功能。还需要提供内存申请、销毁的功能。为了方便第三方芯片厂商,MindSpore在Common组件中提供了一套内存池的实现和高效内存复用算法SOMAS; +- 内存管理,DeviceAddress是对内存的抽象,第三方芯片厂商需要实现Host与Device之间拷贝的功能。还需要提供内存申请、销毁的功能。为了方便第三方芯片厂商,MindSpore在Common组件中提供了一套内存池的实现和高效内存复用算法SOMAS; - 流管理,如果待对接的芯片有流的概念,需要提供创建与销毁的功能,如果没有,则将会以单流模式运行。 ![image](./images/pluggable_device_kernel.png)