From 6f1d18eb95b338ce5e4a9787aec28e3fd55007e6 Mon Sep 17 00:00:00 2001 From: huan <3174348550@qq.com> Date: Wed, 6 Aug 2025 14:29:01 +0800 Subject: [PATCH] modify compile file --- .../features/compile/compilation_guide.md | 102 +++++++++--------- docs/mindspore/source_en/features/index.rst | 4 +- ...ation_guide_zh.md => compilation_guide.md} | 63 +++++------ .../mindspore/source_zh_cn/features/index.rst | 4 +- .../source_zh_cn/features/overview.md | 10 +- 5 files changed, 88 insertions(+), 95 deletions(-) rename docs/mindspore/source_zh_cn/features/compile/{compilation_guide_zh.md => compilation_guide.md} (82%) diff --git a/docs/mindspore/source_en/features/compile/compilation_guide.md b/docs/mindspore/source_en/features/compile/compilation_guide.md index e77411ba7f..d4ee2a1d6e 100644 --- a/docs/mindspore/source_en/features/compile/compilation_guide.md +++ b/docs/mindspore/source_en/features/compile/compilation_guide.md @@ -1,32 +1,33 @@ # mindspore.jit Multi-Level Compilation Optimization -[![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/features/compile/compilation_guide.md) +[![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_en/features/compile/compilation_guide.md) ## MindSpore Compilation Architecture -MindSpore utilizes jit (just-in-time) for performance optimization. The jit mode converts Python code to intermediate representation graphs (IR, Intermediate Representation) through AST tree parsing, Python bytecode parsing, or code execution tracing. We name it MindIR. The compiler optimizes this IR graph to achieve code optimization and improve runtime performance. In contrast to dynamic graph mode, this JIT compilation mode is called graph mode. +MindSpore utilizes jit (just-in-time) for performance optimization. The jit mode converts Python code to intermediate representation graphs (IR, Intermediate Representation) through AST tree parsing, Python bytecode parsing, or code execution tracing. We name it MindIR. The compiler optimizes this IR graph to achieve code optimization and improve runtime performance. In contrast to PyNative Mode, this JIT compilation mode is called Graph Mode. -Python code written by developers runs in dynamic graph mode by default. Functions can be decorated with the @mindspore.jit decorator to specify execution in graph mode. For documentation on the @mindspore.jit decorator, please refer to the [jit documentation](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.jit.html). +Python code written by developers runs in PyNative Mode mode by default. Functions can be decorated with the @mindspore.jit decorator to specify execution in Graph Mode. For documentation on the @mindspore.jit decorator, please refer to the [jit documentation](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.jit.html). -Graph mode is roughly divided into 3 stages: - - Graph Capture (Graph Construction): python code -> MindIR. - - Graph Optimization (Frontend): Hardware-independent optimization of MindIR, algebraic simplification, function inlining, redundancy elimination, etc. - - Graph Optimization (Backend): Hardware-dependent optimization of MindIR, LazyInline, operator selection, graph-operator fusion, etc. +Graph Mode is roughly divided into 3 stages: + +- Graph Capture (Graph Construction): Python code -> MindIR. +- Graph Optimization (Frontend): Hardware-independent optimization of MindIR, algebraic simplification, function inlining, redundancy elimination, etc. +- Graph Optimization (Backend): Hardware-dependent optimization of MindIR, LazyInline, operator selection, graph-operator fusion, etc. ## Graph Capture (Graph Construction) MindSpore provides three capture methods as follows: - - AST: Converts executed functions to IR graphs through AST tree parsing - - bytecode (experimental): Parses Python bytecode to construct IR graphs as much as possible. Parts that cannot be converted to IR graphs will be executed according to dynamic graph - - trace (experimental): Constructs IR graphs by tracing the execution trajectory of Python code +- AST: Converts executed functions to IR graphs through AST tree parsing +- bytecode (experimental): Parses Python bytecode to construct IR graphs as much as possible. Parts that cannot be converted to IR graphs will be executed according to dynamic graph +- trace (experimental): Constructs IR graphs by tracing the execution trajectory of Python code - Taking ast as an example: developers can choose `@mindspore.jit:(capture_mode="ast")` decorator to modify functions. Functions modified with ast mode have certain syntax restrictions. We provide two modes for developers to choose from. +Taking ast as an example: developers can choose `@mindspore.jit:(capture_mode="ast")` decorator to modify functions. Functions modified with ast mode have certain syntax restrictions. We provide two modes for developers to choose from. - - strict mode: The goal of this mode is to construct a single graph. If the developer's Python code cannot construct a graph, choosing this mode will cause an error when running the program, requiring the developer to modify the code to use graphable syntax. This is suitable for developers pursuing performance. - - lax mode: The goal of this mode is to make the developer's program runnable as much as possible. The idea is to perform Python fallback for code that cannot construct graphs in strict mode, that is, return to the Python layer for execution. +- strict mode: The goal of this mode is to construct a single graph. If the developer's Python code cannot construct a graph, choosing this mode will cause an error when running the program, requiring the developer to modify the code to use graphable syntax. This is suitable for developers pursuing performance. +- lax mode: The goal of this mode is to make the developer's program runnable as much as possible. The idea is to perform Python fallback for code that cannot construct graphs in strict mode, that is, return to the Python layer for execution. -For graph mode constraints, please refer to [Syntax Constraints](https://www.mindspore.cn/tutorials/en/master/compile/static_graph.html). Here's an example of how ast parses Python code and constructs graphs: +For Graph Mode constraints, please refer to [Syntax Constraints](https://www.mindspore.cn/tutorials/en/master/compile/static_graph.html). Here's an example of how ast parses Python code and constructs graphs: ```python @mindspore.jit @@ -56,20 +57,21 @@ subgraph @foo() { **Advantages of ast**: - - Using ast mode gives users stronger programming autonomy and more precise performance optimization. They can tune network performance to optimal based on function characteristics and usage experience. +- Using ast mode gives users stronger programming autonomy and more precise performance optimization. They can tune network performance to optimal based on function characteristics and usage experience. **Limitations of ast**: - - Functions decorated with ast must strictly follow static graph syntax for internal programming. +- Functions decorated with ast must strictly follow static graph syntax for internal programming. **recommendations for ast mode**: - - Compared to dynamic graph execution, functions decorated with `@mindspore.jit` need to consume certain time for compilation on the first call. In subsequent calls to this function, if the original compilation result can be reused, the original compilation result will be used directly for execution. Therefore, using the `@mindspore.jit` decorator to modify functions that will be executed multiple times usually obtains more performance benefits. - - The runtime efficiency advantage of graph mode is reflected in its global compilation optimization of functions decorated with `@mindspore.jit`. The more operations contained in the function, the greater the optimization space. Therefore, functions decorated with `@mindspore.jit` are best large code blocks containing many operations, rather than many fragmented functions containing only a few operations separately marked with jit tags. Otherwise, it may lead to no performance benefits or even degradation. +- Compared to dynamic graph execution, functions decorated with `@mindspore.jit` need to consume certain time for compilation on the first call. In subsequent calls to this function, if the original compilation result can be reused, the original compilation result will be used directly for execution. Therefore, using the `@mindspore.jit` decorator to modify functions that will be executed multiple times usually obtains more performance benefits. + +- The runtime efficiency advantage of Graph Mode is reflected in its global compilation optimization of functions decorated with `@mindspore.jit`. The more operations contained in the function, the greater the optimization space. Therefore, functions decorated with `@mindspore.jit` are best large code blocks containing many operations, rather than many fragmented functions containing only a few operations separately marked with jit tags. Otherwise, it may lead to no performance benefits or even degradation. - - Most calculations and optimizations are based on optimization of Tensor calculations. It is recommended that decorated functions should be used for real data calculation functions, rather than simple scalar calculations or data structure transformations. +- Most calculations and optimizations are based on optimization of Tensor calculations. It is recommended that decorated functions should be used for real data calculation functions, rather than simple scalar calculations or data structure transformations. - - For functions decorated with `@mindspore.jit`, if their inputs contain constants, changes in input values each time will cause recompilation. For the concept of variable constants, please refer to [Constants and Variables in Just-in-Time Compilation](https://www.mindspore.cn/tutorials/en/master/compile/static_graph.html). Therefore, it is recommended that decorated functions take Tensors or data modified by Mutable as input to avoid additional performance loss caused by multiple compilations. +- For functions decorated with `@mindspore.jit`, if their inputs contain constants, changes in input values each time will cause recompilation. For the concept of variable constants, please refer to [Constants and Variables in Just-in-Time Compilation](https://www.mindspore.cn/tutorials/en/master/compile/static_graph.html). Therefore, it is recommended that decorated functions take Tensors or data modified by Mutable as input to avoid additional performance loss caused by multiple compilations. ## Graph Optimization (Frontend) @@ -80,6 +82,7 @@ There are many frontend compilation optimization techniques, such as: algebraic ### 1 Algebraic Simplification In traditional compilers, algebraic simplification is a compiler optimization technique aimed at simplifying algebraic expressions in source code, eliminating redundant calculations, improving program execution efficiency, and reducing memory usage. + For example, in the following code snippet: ```cpp @@ -114,7 +117,6 @@ out = func(m) The MindSpore graph compiler will convert the Python program to a computation graph, which consists of multiple subgraphs. Algebraic operations in the source program are converted to operator calls within subgraphs. You can see that the PrimFunc_Add operator is called once. - ```text %para1_x: @@ -127,8 +129,7 @@ subgraph @1_func_14() { } ``` -Through algebraic simplification, the PrimFunc_Add operator can be directly deleted, simplifying the computation graph structure, and simplifying x + 0 to x. - +Through algebraic simplification, the PrimFunc_Add operator can be directly deleted, simplifying the computation graph structure, and simplifying `x + 0` to `x`. ```text %para1_x: @@ -143,7 +144,7 @@ Algebraic simplification can involve more modifications to the computation graph ### 2 Function Inlining -In traditional compilers, inlining is an optimization technique that can directly replace the code of called functions at the location where the function is called, improving program execution efficiency. Suppose we have a C++ function add for summing two numbers: +In traditional compilers, inlining is an optimization technique that can directly replace the code of called functions at the location where the function is called, improving program execution efficiency. Suppose we have a C++ function `add` for summing two numbers: ```cpp int add(int a, int b) { @@ -157,8 +158,7 @@ int main() { } ``` - -The compiler inlines the function body directly to the call site, which eliminates the overhead of function calls and creates conditions for subsequent optimizations (such as eliminating redundant calculations 3 + 5, directly evaluating and replacing at compile time). This idea of replacing calls with code is the core of inlining. +The compiler inlines the function body directly to the call site, which eliminates the overhead of function calls and creates conditions for subsequent optimizations (such as eliminating redundant calculations `3 + 5`, directly evaluating and replacing at compile time). This idea of replacing calls with code is the core of inlining. ```cpp int main() { @@ -189,10 +189,8 @@ c = mindspore.ops.randn(2, 4) out = f1(a, b, c) ``` - First, MindSpore's computation graph compiler will convert the Python program to a computation graph. Function calls in the Python program will be converted to calls between computation graphs, resulting in an original computation graph similar to the following. Among them, the main graph f1 calls the subgraph f2 twice. - ```text # Params: %para1_a: @@ -236,7 +234,7 @@ subgraph @f1() { } ``` -Before inlining expands the subgraph, the compiler may not be able to identify the repeated operations in the two calls to subgraph f2 (at this time the subgraph is usually treated as a black box). After inlining expands the subgraph, the compiler can clearly see that x * 0.5 is calculated twice, which can trigger further optimization by the compiler: Common Subexpression Elimination (CSE), thus reducing the amount of calculation. +Before inlining expands the subgraph, the compiler may not be able to identify the repeated operations in the two calls to subgraph f2 (at this time the subgraph is usually treated as a black box). After inlining expands the subgraph, the compiler can clearly see that `x * 0.5` is calculated twice, which can trigger further optimization by the compiler: Common Subexpression Elimination (CSE), thus reducing the amount of calculation. ```text subgraph @f1() { @@ -268,7 +266,7 @@ The purpose and techniques used in MindSpore redundancy elimination are similar ```python import mindspore - + @mindspore.jit def func(x, y): a = x + y @@ -282,7 +280,7 @@ The purpose and techniques used in MindSpore redundancy elimination are similar out = func(x, y) ``` -The MindSpore graph compiler will convert Python code decorated with `@mindspore.jit` to MindIR representation through static analysis and eliminate the redundant calculation of c = x * y. The final generated MindIR is as follows: + The MindSpore graph compiler will convert Python code decorated with `@mindspore.jit` to MindIR representation through static analysis and eliminate the redundant calculation of c = x * y. The final generated MindIR is as follows: ```text # Params: @@ -300,10 +298,11 @@ The MindSpore graph compiler will convert Python code decorated with `@mindspore : () } ``` + 2. **Unreachable Code Elimination** Suppose there is Python code with unreachable paths as follows: - + ```python import mindspore @@ -322,7 +321,7 @@ The MindSpore graph compiler will convert Python code decorated with `@mindspore out = func(x, y) ``` -The MindSpore graph compiler will convert Python code decorated with `@mindspore.jit` to MindIR representation through static analysis and eliminate the redundant control flow branch code of `1 < 0`. The final generated MindIR is as follows: + The MindSpore graph compiler will convert Python code decorated with `@mindspore.jit` to MindIR representation through static analysis and eliminate the redundant control flow branch code of `1 < 0`. The final generated MindIR is as follows: ```text # Params: @@ -344,21 +343,23 @@ The MindSpore graph compiler will convert Python code decorated with `@mindspore Redundancy elimination plays an important role in compilation optimization. Without changing the original semantics of the program, it can significantly improve program execution efficiency and save computational resources by reducing unnecessary runtime calculations. Redundancy elimination is usually combined with other compilation optimization techniques to obtain more opportunities for eliminating redundant code. ## Graph Optimization (Backend) + After the MindIR graph completes frontend optimization, it needs further optimization (including target hardware). The optimization modes are divided into O0 and O1, represented by the parameter jit_level: - - **jit_level=O0**: Only performs basic graph segmentation optimization and operator selection (hardware-related). The advantage is that it can guarantee the original structure of the IR graph and has faster compilation speed. - - **jit_level=O1**: Adds graph optimization and automatic operator fusion. Compilation performance is somewhat lost, but after the model starts training, efficiency is higher. -After this round of optimization, MindIR will be executed by the runtime module, involving multi-level pipeline concurrency and other technologies. For reference, see [Multi-Level Pipeline]. +- **jit_level=O0**: Only performs basic graph segmentation optimization and operator selection (hardware-related). The advantage is that it can guarantee the original structure of the IR graph and has faster compilation speed. +- **jit_level=O1**: Adds graph optimization and automatic operator fusion. Compilation performance is somewhat lost, but after the model starts training, efficiency is higher. + +After this round of optimization, MindIR will be executed by the runtime module, involving multi-level pipeline concurrency and other technologies. For reference, see [Multi-Level Pipeline](https://www.mindspore.cn/docs/en/master/features/runtime/multilevel_pipeline.html). ### jit_level=O0 Mode O0 mode has fewer optimizations. The basic optimizations are mainly backend LazyInline and No-task node execution optimization. - - ***LazyInline**: The main idea is to postpone the overhead of function calls to when they are actually needed, which can reduce compilation overhead and improve compilation efficiency. LazyInline reuses the same subgraph structure during the graph compilation phase without expanding it in the graph, avoiding large graph scale affecting compilation performance. +- ***LazyInline**: The main idea is to postpone the overhead of function calls to when they are actually needed, which can reduce compilation overhead and improve compilation efficiency. LazyInline reuses the same subgraph structure during the graph compilation phase without expanding it in the graph, avoiding large graph scale affecting compilation performance. ![jit_level_lazyinline](./images/multi_level_compilation/jit_level_lazyinline.png) - - **No-task node Execution Optimization**: No-task nodes refer to operators such as Reshape, ExpandDims, Squeeze, Flatten, FlattenGrad, Reformat, etc. These operators have no computational logic, do not modify memory layout, and only modify shape, format and other information. At the end of graph compilation, No-task nodes are converted to ref nodes, where the output has the same address as the input, and kernel launch is skipped during execution to achieve execution performance optimization. +- **No-task node Execution Optimization**: No-task nodes refer to operators such as Reshape, ExpandDims, Squeeze, Flatten, FlattenGrad, Reformat, etc. These operators have no computational logic, do not modify memory layout, and only modify shape, format and other information. At the end of graph compilation, No-task nodes are converted to ref nodes, where the output has the same address as the input, and kernel launch is skipped during execution to achieve execution performance optimization. ![jit_level_no_task](./images/multi_level_compilation/jit_level_no_task.png) @@ -381,16 +382,15 @@ Different graph traversal algorithms produce execution orders with large differe ![jit_level_exec_order](./images/multi_level_compilation/jit_level_exec_order.png) - - **Execution order obtained by BFS**: kernel1-> kernel2-> kernel4-> kernel5-> kernel3-> kernel6, memory peaks at 5G (kernel3 can release kernel1 and kernel2 after execution, then reuse them when it's kernel6's turn to execute, so kernel6 doesn't need to request extra memory). - - **Execution order obtained by DFS**: kernel1-> kernel2-> kernel3-> kernel4-> kernel5-> kernel6, memory peaks at 4G (kernel3 can release kernel1 and kernel2 after execution, then reuse them when it's kernel4 and kernel5's turn to execute, so kernel4 and kernel5 don't need to request extra memory). +- **Execution order obtained by BFS**: kernel1-> kernel2-> kernel4-> kernel5-> kernel3-> kernel6, memory peaks at 5G (kernel3 can release kernel1 and kernel2 after execution, then reuse them when it's kernel6's turn to execute, so kernel6 doesn't need to request extra memory). +- **Execution order obtained by DFS**: kernel1-> kernel2-> kernel3-> kernel4-> kernel5-> kernel6, memory peaks at 4G (kernel3 can release kernel1 and kernel2 after execution, then reuse them when it's kernel4 and kernel5's turn to execute, so kernel4 and kernel5 don't need to request extra memory). Execution order scheduling is a complex problem of solving optimal operator concurrency under certain memory constraints. It not only requires identifying and exploiting concurrency opportunities in the computational graph to improve computational efficiency, but also must consider multiple constraints simultaneously to ensure system stability and efficiency. - - First, the optimization module needs to address the complexity of solving for optimal operator concurrency. Due to the large number of operators in the computational graph and their interdependencies, finding an execution order that maximizes concurrency while maintaining the logical correctness of the computational graph is a challenging task. +- First, the optimization module needs to address the complexity of solving for optimal operator concurrency. Due to the large number of operators in the computational graph and their interdependencies, finding an execution order that maximizes concurrency while maintaining the logical correctness of the computational graph is a challenging task. - - Second, memory constraints are a critical factor that cannot be ignored in execution order optimization. Increasing concurrency, while improving computational efficiency, tends to significantly increase peak memory requirements, which may lead to Out of Memory (OOM) errors, especially in resource-constrained environments. Therefore, the optimization module must weigh the relationship between concurrency and memory usage to ensure that concurrency is increased without exceeding the memory capacity of the system. - - - MindSpore's execution order adjustment module combines rule-based and heuristic-based strategies to provide both bfs/dfs execution order orchestration algorithms [mindspore.jit(option={"exec_order":"bfs/dfs"})](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.jit.html) to achieve fine-grained adjustment of the execution order of the computation graph, thus effectively dealing with multiple challenges such as memory constraints and system stability while ensuring computational efficiency. +- Second, memory constraints are a critical factor that cannot be ignored in execution order optimization. Increasing concurrency, while improving computational efficiency, tends to significantly increase peak memory requirements, which may lead to Out of Memory (OOM) errors, especially in resource-constrained environments. Therefore, the optimization module must weigh the relationship between concurrency and memory usage to ensure that concurrency is increased without exceeding the memory capacity of the system. +- MindSpore's execution order adjustment module combines rule-based and heuristic-based strategies to provide both bfs/dfs execution order orchestration algorithms [mindspore.jit(option={"exec_order":"bfs/dfs"})](https://www.mindspore.cn/docs/en/master/api_python/mindspore/mindspore.jit.html) to achieve fine-grained adjustment of the execution order of the computation graph, thus effectively dealing with multiple challenges such as memory constraints and system stability while ensuring computational efficiency. ### jit_level=O1 Mode @@ -401,9 +401,7 @@ Currently O1 mainly supports graph-operator fusion optimization. The main idea i Mainstream AI computing frameworks such as MindSpore provide operators to users that are usually defined from the perspective of user understanding and ease of use. Each operator carries different amounts of computation and varies in computational complexity. However, from the hardware execution perspective, this natural, user perspective-based division of operator computation volume is not efficient and cannot fully utilize the computational power of hardware resources. This is mainly reflected in: 1. Operators with too much computation and overly complex operators usually make it difficult to generate well-split high-performance operators, thereby reducing device utilization; - 2. Operators with too little computation may also cause computational latency and thus reduce device utilization, as the computation cannot effectively hide data movement overhead; - 3. Hardware devices are usually multi-core, many-core architectures. When operator shapes are small or other reasons cause insufficient computational parallelism, it may cause some cores to be idle, thus reducing device utilization. Especially chips based on Domain Specific Architecture (DSA for short) are more sensitive to these factors. How to maximize hardware computational performance while making operators easy to use has always been a big challenge. In terms of AI framework design, the current industry mainstream adopts a layered implementation approach of graph layer and operator layer. The graph layer is responsible for fusing or regrouping the computational graph, and the operator layer is responsible for compiling the fused or regrouped operators into high-performance executable operators. The graph layer usually uses Tensor-based High-Level IR for processing and optimization, while the operator layer uses computation instruction-based Low-Level IR for analysis and optimization. This artificial layered processing significantly increases the difficulty of collaborative optimization between the graph and computation layers. @@ -427,6 +425,7 @@ The optimized computational graph is passed to MindSpore AKG as subgraphs for fu ![graphkernel](./images/graphkernel.png) Through the above steps, we can obtain two aspects of performance gains: + 1. Cross-boundary performance optimization gains between different operators; 2. Through reorganization and splitting of the entire computational graph, the optimal granularity of fusion operators is obtained. @@ -438,22 +437,17 @@ Automatic generation technology of fusion operators can solve the problem of hig Therefore, **MindSpore AKG accelerates optimization and automatic generation of fusion operators based on Polyhedral Compilation Technology (Polyhedral Model)**, which can help fusion operators optimized by MindSpore's graph-operator fusion module to automatically generate high-performance kernels on **heterogeneous hardware platforms**(GPU/Ascend) and improve MindSpore training performance. -The architecture and overall process are as follows: - -![graphkernel_akg_overview](./images/graphkernel_akg_overview.png) - -The overall framework of MindSpore AKG is shown in the figure above: - - IR Normalization +- IR Normalization - The input of MindSpore AKG is the fusion subgraph optimized by MindSpore's graph-operator fusion module. The operators in the subgraph are expressed through various description methods such as TVM's Compute/IR Builder/Hybrid. Then the DSL is converted to [Halide](https://halide-lang.org/) IR (Halide, a common language used for developing high-performance image processing and array computation, which can be used as an intermediate representation to decouple algorithms and optimization) and IR normalization; - After initial simplification and optimization is completed, the Halide IR is transformed into the scheduling tree required by the Poly module; - - Poly Module Scheduling Optimization +- Poly Module Scheduling Optimization - Using the Pluto scheduling algorithm in polyhedral technology to achieve automatic loop fusion, automatic rearrangement and other transformations, automatically generating initial scheduling that satisfies parallelism and data locality for fusion operators; - To quickly adapt to different hardware backends, the optimization passes in the Poly module are divided into hardware-independent generic optimizations and hardware-related specific optimizations, which are stitched and combined according to hardware features at compilation time to achieve fast adaptation of heterogeneous hardware backends. Auto-slicing, auto-mapping, and auto-memory boosting passes will give different optimization methods according to the nature of different hardware architectures; - - Backend Optimization +- Backend Optimization - To further improve operator performance, we developed corresponding optimization passes for different hardware backends, such as data alignment and instruction mapping in Ascend backend, vectorized access and insertion of synchronization instructions in GPU backend, and finally generate corresponding platform code. Summary: MindSpore compilation optimizes AI model code from various dimensions such as graph capture mode, IR optimization, graph-operator fusion, etc. Many features also face certain challenges in the trade-off between usability and performance. We also plan to further layer and decouple the entire process to avoid black-box operation and increase the threshold for developer understanding. \ No newline at end of file diff --git a/docs/mindspore/source_en/features/index.rst b/docs/mindspore/source_en/features/index.rst index 6552f5d328..5d6f169112 100644 --- a/docs/mindspore/source_en/features/index.rst +++ b/docs/mindspore/source_en/features/index.rst @@ -11,9 +11,7 @@ Developer Notes parallel/optimizer_parallel parallel/pipeline_parallel parallel/auto_parallel - compile/multi_level_compilation - compile/graph_construction - compile/graph_optimization + compile/compilation_guide runtime/memory_manager runtime/multilevel_pipeline runtime/multistream_concurrency diff --git a/docs/mindspore/source_zh_cn/features/compile/compilation_guide_zh.md b/docs/mindspore/source_zh_cn/features/compile/compilation_guide.md similarity index 82% rename from docs/mindspore/source_zh_cn/features/compile/compilation_guide_zh.md rename to docs/mindspore/source_zh_cn/features/compile/compilation_guide.md index fa0411c9fd..be1dcb442b 100644 --- a/docs/mindspore/source_zh_cn/features/compile/compilation_guide_zh.md +++ b/docs/mindspore/source_zh_cn/features/compile/compilation_guide.md @@ -1,31 +1,33 @@ # mindspore.jit 多级编译优化 -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/features/compile/compilation_guide_zh.md) - +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindspore/source_zh_cn/features/compile/compilation_guide.md) ## MindSpore编译架构 - -MindSpore利用jit(just-in-time)来进行性能优化。jit模式会通过AST树解析、Python字节码解析或追踪代码执行的方式,将python代码转换为中间表示图(IR,Intermediate Representation)。我们给它命名MindIR。编译器通过对该IR图的优化,来达到对代码的优化,提高运行性能。与动态图模式相对应,这种JIT的编译模式被称为graph mode。 -开发者写的python代码默认以动态图模式运行,可以通过`@mindspore.jit`装饰器修饰函数,来指定其按照graph mode执行。有关`@mindspore.jit`装饰器的相关文档请见[jit 文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.jit.html#mindspore.jit)。 +MindSpore利用jit(just-in-time)来进行性能优化。jit模式会通过AST树解析、Python字节码解析或追踪代码执行的方式,将Python代码转换为中间表示图(IR,Intermediate Representation)。我们给它命名MindIR。编译器通过对该IR图的优化,来达到对代码的优化,提高运行性能。与PyNative Mode相对应,这种JIT的编译模式被称为Graph Mode。 + +开发者写的Python代码默认以PyNative Mode运行,可以通过`@mindspore.jit`装饰器修饰函数,来指定其按照Graph Mode执行。有关`@mindspore.jit`装饰器的相关文档请见[jit 文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.jit.html#mindspore.jit)。 -graph mode大致分为3个阶段: - - 图捕获(构图): python代码 -> MindIR。 - - 图优化(前端): 对MindIR进行硬件无关优化,代数化简、函数inline(内联)、冗余消除等。 - - 图优化(后端): 对MindIR进行硬件相关优化,LazyInline,算子选择,图算融合等。 +Graph Mode大致分为3个阶段: + +- 图捕获(构图):Python代码 -> MindIR。 +- 图优化(前端):对MindIR进行硬件无关优化,代数化简、函数inline(内联)、冗余消除等。 +- 图优化(后端):对MindIR进行硬件相关优化,LazyInline、算子选择、图算融合等。 ## 图捕获(构图) -MindSpore提供三种捕获方式,如下 - - AST: 通过AST树解析的方式将执行的函数转换成IR图 - - bytecode(实验性): 对Python字节码的解析,尽可能的构建IR图,无法转换为IR图的部分则会按照动态图进行执行 - - trace(实验性): 通过追踪Python代码执行的轨迹来构建IR图 +MindSpore提供三种捕获方式,如下: + +- AST:通过AST树解析的方式将执行的函数转换成IR图 +- bytecode(实验性):对Python字节码的解析,尽可能的构建IR图,无法转换为IR图的部分则会按照PyNative Mode进行执行 +- trace(实验性):通过追踪Python代码执行的轨迹来构建IR图 -这三种模式在mindspore.jit中使用capture_mode来选择,以ast举例: 开发者可用`@mindspore.jit(capture_mode="ast")`装饰器修饰函,用ast方式修饰的函数,其语法有一定限制,我们提供两种模式供开发者选择。 -- strict模式:此模式目标是构成一张图,开发者的python代码如果无法构图,选择此模式运行程序时会报错,需要开发者进行代码修改,变为可构图的语法,适合追求性能的开发者。 -- lax模式:此模式目标是尽可能的让开发者程序可运行,思路是针对无法在strict模式构图的代码进行python fallback,即返回python层运行。 +这三种模式在mindspore.jit中使用capture_mode来选择,以ast举例:开发者可用`@mindspore.jit(capture_mode="ast")`装饰器修饰函数。用ast方式修饰的函数,其语法有一定限制,我们提供两种模式供开发者选择。 -graph mode模式约束请参考[语法约束](https://www.mindspore.cn/tutorials/zh-CN/master/compile/static_graph.html)。ast如何将python代码解析并构图,举例如下: +- strict模式:此模式目标是构成一张图,开发者的Python代码如果无法构图,选择此模式运行程序时会报错,需要开发者进行代码修改,变为可构图的语法,适合追求性能的开发者。 +- lax模式:此模式目标是尽可能的让开发者程序可运行,思路是针对无法在strict模式构图的代码进行Python fallback,即返回Python层运行。 + +Graph Mode模式约束请参考[语法约束](https://www.mindspore.cn/tutorials/zh-CN/master/compile/static_graph.html)。ast如何将Python代码解析并构图,举例如下: ```python @mindspore.jit @@ -63,9 +65,9 @@ subgraph @foo() { **ast模式的使用建议**: -- 相比于动态图执行,被`@mindspore.jit`修饰的函数,在第一次调用时需要先消耗一定的时间进行编译。在该函数的后续调用时,若原有的编译结果可以复用,则会直接使用原有的编译结果进行执行。因此,使用@mindspore.jit装饰器修饰会多次执行的函数通常会获得更多的性能收益。 +- 相比于PyNative Mode执行,被`@mindspore.jit`修饰的函数,在第一次调用时需要先消耗一定的时间进行编译。在该函数的后续调用时,若原有的编译结果可以复用,则会直接使用原有的编译结果进行执行。因此,使用@mindspore.jit装饰器修饰会多次执行的函数通常会获得更多的性能收益。 -- graph mode的运行效率优势体现在其会将被@mindspore.jit修饰函数进行全局上的编译优化,函数内含有的操作越多,优化的空间越大。因此`@mindspore.jit`装饰器修饰的函数最好是内含操作很多的大代码块,而不应将很多细碎的、仅含有少量操作的函数分别打上jit标签。否则,则可能会导致性能没有收益甚至劣化。 +- Graph Mode的运行效率优势体现在其会将被@mindspore.jit修饰函数进行全局上的编译优化,函数内含有的操作越多,优化的空间越大。因此`@mindspore.jit`装饰器修饰的函数最好是内含操作很多的大代码块,而不应将很多细碎的、仅含有少量操作的函数分别打上jit标签。否则,则可能会导致性能没有收益甚至劣化。 - 绝大部分计算以及优化都是基于对Tensor计算的优化,建议被修饰的函数应该是用来进行真正的数据计算的函数,而不是一些简单的标量计算或者数据结构的变换。 @@ -75,7 +77,7 @@ subgraph @foo() { 与传统编译优化技术类似,MindSpore 中的编译优化也是通过一个个 Pass 来完成的。将每个 Pass 的上一个 Pass 所产生的 MindIR 作为输入,经过本 Pass 优化之后,产生新的 MindIR 表示作为输出。一个大的 Pass 可以包含多个小的 Pass,每个小的 Pass 只负责单点的编译优化,如:代数化简、函数内联(inline)、冗余消除等。一个 Pass 产生的优化结果,可能会为其它的 Pass 带来优化机会,故可以循环运行这些 Pass,直到产生的 MindIR 不再发生变化为止。 -前端编译优化技术有很多,如: 代数化简、函数inline(内联)、冗余消除等。这里仅介绍具有代表性的编译优化技术。 +前端编译优化技术有很多,如:代数化简、函数inline(内联)、冗余消除等。这里仅介绍具有代表性的编译优化技术。 ### 1 代数化简 @@ -97,7 +99,7 @@ b = x; c = y; ``` -在 MindSpore编译器中,代数化简原理不同于传统编译器,进行处理的是计算图而非传统控制流图,通过调整计算图中算子的执行顺序,或者删除不必要的算子,以保持计算图的简洁性和提高计算效率。 +在MindSpore编译器中,代数化简原理不同于传统编译器,进行处理的是计算图而非传统控制流图,通过调整计算图中算子的执行顺序,或者删除不必要的算子,以保持计算图的简洁性和提高计算效率。 例如,在如下Python代码片段中: @@ -264,7 +266,7 @@ MindSpore冗余消除的目的及使用的技术与传统编译器类似。不 ```python import mindspore - + @mindspore.jit def func(x, y): a = x + y @@ -342,21 +344,22 @@ MindSpore冗余消除的目的及使用的技术与传统编译器类似。不 ## 图优化(后端) -当MindIR图经过前端优化完成后,需要进行进一步优化(包含目标硬件)。优化模式我们分为O0,O1,用参数jit_level表示 - - **jit_level=O0**: 只做基本的图切分优化,以及算子选择(硬件相关),优点是可以保证IR图的原始结构,编译速度较快。 - - **jit_level=O1**: 增加图优化和自动算子融合,编译性能有所损失,但模型开始训练后,效率较高 +当MindIR图经过前端优化完成后,需要进行进一步优化(包含目标硬件)。优化模式我们分为O0,O1,用参数jit_level表示: + +- **jit_level=O0**:只做基本的图切分优化,以及算子选择(硬件相关),优点是可以保证IR图的原始结构,编译速度较快。 +- **jit_level=O1**:增加图优化和自动算子融合,编译性能有所损失,但模型开始训练后,效率较高。 -MindIR经过本轮优化后,会由runtime模块进行执行,涉及多级流水并发等技术,可参考[多级流水] +MindIR经过本轮优化后,会由runtime模块进行执行,涉及多级流水并发等技术,可参考[多级流水](https://www.mindspore.cn/docs/zh-CN/master/features/runtime/multilevel_pipeline.html)。 ### jit_level=O0 模式 O0模式的优化较少,基础的优化主要为后端LazyInline和No-task node执行优化。 -- **LazyInline**: 主要思想是将函数调用的开销推迟到实际需要调用的时候,这样可以减少编译时的开销,提高编译效率。LazyInline在图编译阶段是将相同的子图结构复用,不展开放在图中,避免图规模较大导致影响编译性能。 +- **LazyInline**:主要思想是将函数调用的开销推迟到实际需要调用的时候,这样可以减少编译时的开销,提高编译效率。LazyInline在图编译阶段是将相同的子图结构复用,不展开放在图中,避免图规模较大导致影响编译性能。 ![jit_level_lazyinline](./images/multi_level_compilation/jit_level_lazyinline.png) -- **No-task node执行优化**: No-task node指的是Reshape、ExpandDims、Squeeze、Flatten、FlattenGrad、Reformat等诸类算子没有计算逻辑,不修改内存排布,仅修改shape、format等信息。在图编译结束后,将No-task node转换成ref node,输出跟输入同地址,执行过程中跳过kernel launch,从而达到执行性能优化目的。 +- **No-task node执行优化**:No-task node指的是Reshape、ExpandDims、Squeeze、Flatten、FlattenGrad、Reformat等诸类算子没有计算逻辑,不修改内存排布,仅修改shape、format等信息。在图编译结束后,将No-task node转换成ref node,输出跟输入同地址,执行过程中跳过kernel launch,从而达到执行性能优化目的。 ![jit_level_no_task](./images/multi_level_compilation/jit_level_no_task.png) @@ -389,7 +392,7 @@ MindSpore 在Ascend硬件的算子类型有aclnn kernel/aclop kernel/hccl kernel ### jit_level=O1 模式 - 当前O1主要支持了图算融合优化。其主要思路是:在编译阶段,自动识别计算图中相邻的可融合节点,然后将其融合为更大粒度的可执行算子。通过图算融合,实现增加算子计算局部性、减少整体全局内存访存带宽开销等优化效果。通过对主流SOTA模型的实测验证,O1能够实现相比O0平均15%的性能加速。特别是对于访存密集型网络,O1优化效果更加显著。 +当前O1主要支持了图算融合优化。其主要思路是:在编译阶段,自动识别计算图中相邻的可融合节点,然后将其融合为更大粒度的可执行算子。通过图算融合,实现增加算子计算局部性、减少整体全局内存访存带宽开销等优化效果。通过对主流SOTA模型的实测验证,O1能够实现相比O0平均15%的性能加速。特别是对于访存密集型网络,O1优化效果更加显著。 #### 图算融合 @@ -444,4 +447,4 @@ MindSpore AKG的整体框架如上图所示: - 后端优化 - 为了进一步提升算子的性能,我们针对不同硬件后端开发了相应的优化pass,如Ascend后端中实现数据对齐、指令映射,GPU后端中实现向量化存取,插入同步指令等,最终生成相应平台代码。 -总结: MindSpore编译从图捕获模式,IR优化图算融合等各维度对AI模型代码进行优化,很多特性在易用性和性能方面的取舍也有一定挑战。我们也规划进一步分层解耦整个流程,避免黑盒运行,增加开发者理解的门槛。 \ No newline at end of file +总结:MindSpore编译从图捕获模式,IR优化图算融合等各维度对AI模型代码进行优化,很多特性在易用性和性能方面的取舍也有一定挑战。我们也规划进一步分层解耦整个流程,避免黑盒运行,增加开发者理解的门槛。 \ No newline at end of file diff --git a/docs/mindspore/source_zh_cn/features/index.rst b/docs/mindspore/source_zh_cn/features/index.rst index 657f6d0bca..3484e5a43c 100644 --- a/docs/mindspore/source_zh_cn/features/index.rst +++ b/docs/mindspore/source_zh_cn/features/index.rst @@ -11,9 +11,7 @@ Developer Notes parallel/optimizer_parallel parallel/pipeline_parallel parallel/auto_parallel - compile/multi_level_compilation - compile/graph_construction - compile/graph_optimization + compile/compilation_guide runtime/memory_manager runtime/multilevel_pipeline runtime/multistream_concurrency diff --git a/docs/mindspore/source_zh_cn/features/overview.md b/docs/mindspore/source_zh_cn/features/overview.md index 95f47265c6..4f773ad7b2 100644 --- a/docs/mindspore/source_zh_cn/features/overview.md +++ b/docs/mindspore/source_zh_cn/features/overview.md @@ -45,13 +45,13 @@ MindSpore实现了函数式微分编程,对可被微分求导的函数对象 ### 编程范式(动静结合) -传统AI框架主要有两种编程执行形态,静态图模式(graph mode)和动态图模式(pynative mode)。动态图模式又称eager mode。 +传统AI框架主要有两种编程执行形态,静态图模式(Graph Mode)和动态图模式(PyNative Mode)。动态图模式又称Eager Mode。 -graph mode会在编译时生成神经网络的模型计算的图结构,然后再执行计算图。 +Graph Mode会在编译时生成神经网络的模型计算的图结构,然后再执行计算图。 -pynative mode,由于程序是按照代码的编写顺序执行,符合python解释执行方式,易开发和调试。因为不做图编译优化,性能优化空间较少,特别是面向DSA等专有硬件的优化具有较大挑战。 +PyNative Mode,由于程序是按照代码的编写顺序执行,符合python解释执行方式,易开发和调试。因为不做图编译优化,性能优化空间较少,特别是面向DSA等专有硬件的优化具有较大挑战。 -MindSpore基于Python构建神经网络的图结构,相比于传统的graph mode,能有更易用、更灵活的表达能力。MindSpore创新性的构建源码转换能力,基于Python语句提取AST进行计算图构建,因此可以支持开发者使用的Python原生语法(条件/循环等)和其他操作,如元组(Tuple)、列表(List)以及Lambda表达来构建计算图,并对计算图进行自动微分。所以MindSpore能更好地兼容动态图和静态图的编程接口,在代码层面保持一致,如控制流写法等。 +MindSpore基于Python构建神经网络的图结构,相比于传统的Graph Mode,能有更易用、更灵活的表达能力。MindSpore创新性的构建源码转换能力,基于Python语句提取AST进行计算图构建,因此可以支持开发者使用的Python原生语法(条件/循环等)和其他操作,如元组(Tuple)、列表(List)以及Lambda表达来构建计算图,并对计算图进行自动微分。所以MindSpore能更好地兼容动态图和静态图的编程接口,在代码层面保持一致,如控制流写法等。 原生Python表达可基于Python控制流关键字,直接使能静态图模式的执行,使得动静态图的编程统一性更高。同时开发者基于MindSpore的接口,可以灵活的对Python代码片段进行动静态图模式控制。即可以将程序局部函数以静态图模式执行([mindspore.jit](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.jit.html))而其他函数按照动态图模式执行。从而使得在与常用Python库、自定义Python函数进行穿插执行使用时,开发者可以灵活指定函数片段进行静态图优化加速,而不牺牲穿插执行的编程易用性。 @@ -69,7 +69,7 @@ MindSpore在并行化策略搜索中引入了张量重排布技术(Tensor Redi MindSpore基于编译技术,提供了丰富的硬件无关优化,如IR融合、代数化简、常数折叠、公共子表达式消除等。同时针对NPU、GPU等不同硬件,也提供各种硬件优化能力,从而更好的发挥硬件的大规模计算加速能力。 -#### [多级编译架构](https://www.mindspore.cn/docs/zh-CN/master/features/compile/compilation_guide_zh.html#图算融合) +#### [多级编译架构](https://www.mindspore.cn/docs/zh-CN/master/features/compile/compilation_guide.html#图算融合) MindSpore等主流AI计算框架对开发者提供的算子通常是从开发中可理解、易使用角度进行定义。每个算子承载的计算量不等,计算复杂度也各不相同。但从硬件执行角度看,这种天然的、基于用开发者角度的算子计算量划分,并不高效,也无法充分发挥硬件资源计算能力。主要体现在: -- Gitee