diff --git a/docs/mindspore/source_en/features/compile/compilation_guide.md b/docs/mindspore/source_en/features/compile/compilation_guide.md new file mode 100644 index 0000000000000000000000000000000000000000..7ba214b8b4f06811b32fb8d69d110b8cf0fc0f40 --- /dev/null +++ b/docs/mindspore/source_en/features/compile/compilation_guide.md @@ -0,0 +1,453 @@ +# mindspore.jit Multi-Level Compilation Optimization + +[![View Source](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0/docs/mindspore/source_en/features/compile/compilation_guide.md) + +## MindSpore Compilation Architecture + +MindSpore utilizes jit (just-in-time) for performance optimization. The jit mode converts Python code to intermediate representation graphs (IR, Intermediate Representation) through AST tree parsing, Python bytecode parsing, or code execution tracing. We name it MindIR. The compiler optimizes this IR graph to achieve code optimization and improve runtime performance. In contrast to PyNative Mode, this JIT compilation mode is called Graph Mode. + +Python code written by developers runs in PyNative Mode mode by default. Functions can be decorated with the @mindspore.jit decorator to specify execution in Graph Mode. For documentation on the @mindspore.jit decorator, please refer to the [jit documentation](https://www.mindspore.cn/docs/en/r2.7.0/api_python/mindspore/mindspore.jit.html). + +Graph Mode is roughly divided into 3 stages: + +- Graph Capture (Graph Construction): Python code -> MindIR. +- Graph Optimization (Frontend): Hardware-independent optimization of MindIR, algebraic simplification, function inlining, redundancy elimination, etc. +- Graph Optimization (Backend): Hardware-dependent optimization of MindIR, LazyInline, operator selection, graph-operator fusion, etc. + +## Graph Capture (Graph Construction) + +MindSpore provides three capture methods as follows: + +- AST: Converts executed functions to IR graphs through AST tree parsing +- bytecode (experimental): Parses Python bytecode to construct IR graphs as much as possible. Parts that cannot be converted to IR graphs will be executed according to dynamic graph +- trace (experimental): Constructs IR graphs by tracing the execution trajectory of Python code + +Taking ast as an example: developers can choose `@mindspore.jit:(capture_mode="ast")` decorator to modify functions. Functions modified with ast mode have certain syntax restrictions. We provide two modes for developers to choose from. + +- strict mode: The goal of this mode is to construct a single graph. If the developer's Python code cannot construct a graph, choosing this mode will cause an error when running the program, requiring the developer to modify the code to use graphable syntax. This is suitable for developers pursuing performance. +- lax mode: The goal of this mode is to make the developer's program runnable as much as possible. The idea is to perform Python fallback for code that cannot construct graphs in strict mode, that is, return to the Python layer for execution. + +For Graph Mode constraints, please refer to [Syntax Constraints](https://www.mindspore.cn/tutorials/en/r2.7.0/compile/static_graph.html). Here's an example of how ast parses Python code and constructs graphs: + +```python +@mindspore.jit +def foo(x, y): + z = x + y + return z +``` + +The corresponding abstract syntax tree is as follows: + +![Abstract Syntax Tree](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/docs/mindspore/source_zh_cn/features/compile/images/ast.png) + +By parsing the above abstract syntax tree, we obtain the following IR: + +```text +%para1_x: +%para2_y: + +subgraph instance: foo +subgraph @foo() { + %0(CNode_17) = PrimFunc_Add(%para1_x, %para2_y) + : (, ) -> () + Return(%0) + : () +} +``` + +**Advantages of ast**: + +- Using ast mode gives users stronger programming autonomy and more precise performance optimization. They can tune network performance to optimal based on function characteristics and usage experience. + +**Limitations of ast**: + +- Functions decorated with ast must strictly follow static graph syntax for internal programming. + +**recommendations for ast mode**: + +- Compared to dynamic graph execution, functions decorated with `@mindspore.jit` need to consume certain time for compilation on the first call. In subsequent calls to this function, if the original compilation result can be reused, the original compilation result will be used directly for execution. Therefore, using the `@mindspore.jit` decorator to modify functions that will be executed multiple times usually obtains more performance benefits. + +- The runtime efficiency advantage of Graph Mode is reflected in its global compilation optimization of functions decorated with `@mindspore.jit`. The more operations contained in the function, the greater the optimization space. Therefore, functions decorated with `@mindspore.jit` are best large code blocks containing many operations, rather than many fragmented functions containing only a few operations separately marked with jit tags. Otherwise, it may lead to no performance benefits or even degradation. + +- Most calculations and optimizations are based on optimization of Tensor calculations. It is recommended that decorated functions should be used for real data calculation functions, rather than simple scalar calculations or data structure transformations. + +- For functions decorated with `@mindspore.jit`, if their inputs contain constants, changes in input values each time will cause recompilation. For the concept of variable constants, please refer to [Constants and Variables in Just-in-Time Compilation](https://www.mindspore.cn/tutorials/en/r2.7.0/compile/static_graph.html). Therefore, it is recommended that decorated functions take Tensors or data modified by Mutable as input to avoid additional performance loss caused by multiple compilations. + +## Graph Optimization (Frontend) + +Similar to traditional compilation optimization techniques, compilation optimization in MindSpore is also completed through individual Passes. Each Pass takes the MindIR produced by the previous Pass as input, and after optimization by this Pass, produces a new MindIR representation as output. A large Pass can contain multiple small Passes, each small Pass is only responsible for single-point compilation optimization, such as: algebraic simplification, function inlining, redundancy elimination, etc. The optimization result produced by one Pass may bring optimization opportunities for other Passes, so these Passes can be run in cycles until the produced MindIR no longer changes. + +There are many frontend compilation optimization techniques, such as: algebraic simplification, function inlining, redundancy elimination, etc. Here we only introduce representative compilation optimization techniques. + +### 1 Algebraic Simplification + +In traditional compilers, algebraic simplification is a compiler optimization technique aimed at simplifying algebraic expressions in source code, eliminating redundant calculations, improving program execution efficiency, and reducing memory usage. + +For example, in the following code snippet: + +```cpp +int a = x * 1; +int b = x + 0; +int c = x * 0 + y * 1; +``` + +Traditional compilers perform equivalent replacement of identified expressions according to algebraic rules and identities. Common algebraic rules include associative law, commutative law, and distributive law, etc. The compiler tries to replace expressions with simpler forms as much as possible. Optimization is performed through analysis of AST (Abstract Syntax Tree) or SSA (Static Single Assignment), identifying and simplifying code to: + +```cpp +a = x; +b = x; +c = y; +``` + +In the MindSpore compiler, the principle of algebraic simplification is different from traditional compilers. It processes computation graphs rather than traditional control flow graphs, by adjusting the execution order of operators in the computation graph, or deleting unnecessary operators, to maintain the simplicity of the computation graph and improve computational efficiency. + +For example, in the following Python code snippet: + +```python +import numpy as np +import mindspore + +@mindspore.jit +def func(x): + return x + 0 + +m = mindspore.tensor(np.array([[1, 2, 3], [4, 5, 6]]).astype(np.int32)) +out = func(m) +``` + +The MindSpore graph compiler will convert the Python program to a computation graph, which consists of multiple subgraphs. Algebraic operations in the source program are converted to operator calls within subgraphs. You can see that the PrimFunc_Add operator is called once. + +```text +%para1_x: + +subgraph @1_func_14() { + %0(CNode_7) = PrimFunc_Add(%para1_x, Tensor(shape=[], dtype=Int32, value=0)) + : (, ) -> () + + Return(%0) + : () +} +``` + +Through algebraic simplification, the PrimFunc_Add operator can be directly deleted, simplifying the computation graph structure, and simplifying `x + 0` to `x`. + +```text +%para1_x: + +subgraph @1_func_14() { + Return(%para1_x) + : () +} +``` + +Algebraic simplification can involve more modifications to the computation graph structure. It is usually combined with other compiler optimization techniques (such as constant folding, constant propagation, etc.) to jointly improve program performance. + +### 2 Function Inlining + +In traditional compilers, inlining is an optimization technique that can directly replace the code of called functions at the location where the function is called, improving program execution efficiency. Suppose we have a C++ function `add` for summing two numbers: + +```cpp +int add(int a, int b) { + return a + b; +} + +int main() { + int x = add(3, 5); + int y = add(x, 10); + return y; +} +``` + +The compiler inlines the function body directly to the call site, which eliminates the overhead of function calls and creates conditions for subsequent optimizations (such as eliminating redundant calculations `3 + 5`, directly evaluating and replacing at compile time). This idea of replacing calls with code is the core of inlining. + +```cpp +int main() { + int x = 3 + 5; // Replace first call + int y = x + 10; // Replace second call + return y; +} +``` + +In AI framework computation graph compilers, the goal of inlining is similar, but the operation object changes from "functions" to "subgraphs". Suppose we have a Python program: + +```python +from mindspore + +def f2(x: mindspore.Tensor, y: mindspore.Tensor): + return x * 0.5 + y + +@mindspore.jit +def f1(a: mindspore.Tensor, b: mindspore.Tensor, c: mindspore.Tensor): + x = f2(a, b) + y = f2(a, c) + return x + y + +# Create 3 random value Tensors with shape=(2, 4) +a = mindspore.ops.randn(2, 4) +b = mindspore.ops.randn(2, 4) +c = mindspore.ops.randn(2, 4) +out = f1(a, b, c) +``` + +First, MindSpore's computation graph compiler will convert the Python program to a computation graph. Function calls in the Python program will be converted to calls between computation graphs, resulting in an original computation graph similar to the following. Among them, the main graph f1 calls the subgraph f2 twice. + +```text +# Params: +%para1_a: +%para2_b: +%para3_c: + +subgraph @f2(%para1_x, %para2_y) { + %0 = PrimFunc_Mul(%para1_x, Float32(0.5)) + + %1 = PrimFunc_Add(%0, %para2_y) + + Return(%1) +} + +subgraph @f1() { + %0(x) = call @f2(%para1_a, %para2_b) # Call subgraph f2 + + %1(y) = call @f2(%para1_a, %para3_c) # Call subgraph f2 + + %2 = PrimFunc_Add(%0, %1) + + Return(%2) +} +``` + +Through inlining, the subgraph f2 can be expanded and merged into the main graph f1. + +```text +subgraph @f1() { + # First subgraph inlining + %0 = PrimFunc_Mul(%para1_a, Float32(0.5)) # Repeated calculation step + %1 = PrimFunc_Add(%0, %para2_b) + + # Second subgraph inlining + %2 = PrimFunc_Mul(%para1_a, Float32(0.5)) # Repeated calculation step + %3 = PrimFunc_Add(%2, %para3_c) + + %4 = PrimFunc_Add(%1, %3) + + Return(%4) +} +``` + +Before inlining expands the subgraph, the compiler may not be able to identify the repeated operations in the two calls to subgraph f2 (at this time the subgraph is usually treated as a black box). After inlining expands the subgraph, the compiler can clearly see that `x * 0.5` is calculated twice, which can trigger further optimization by the compiler: Common Subexpression Elimination (CSE), thus reducing the amount of calculation. + +```text +subgraph @f1() { + %0 = PrimFunc_Mul(%para1_a, Float32(0.5)) # CSE merges repeated calculations + + %1 = PrimFunc_Add(%0, %para2_b) + + %2 = PrimFunc_Add(%0, %para3_c) # Directly reuse %0 + + %3 = PrimFunc_Add(%1, %2) + + Return(%3) +} +``` + +By inlining to expand subgraphs, the compiler can more clearly identify cross-subgraph optimization opportunities. In addition to Common Subexpression Elimination (CSE), it can also trigger many optimization measures such as operator fusion and memory management. Therefore, inlining is an important optimization mechanism in computation graph compilers and the foundation for many cross-graph optimizations. + +### 3 Redundancy Elimination + +In traditional compilers, redundancy elimination includes various compilation optimization techniques aimed at identifying redundant parts in code during compilation and eliminating them to reduce unnecessary calculations and improve program execution efficiency. + +Usually redundant code may be intentionally written by users for readability purposes, or it may just be an unintentional act during the coding process. In addition, intermediate results produced by the compilation optimization process itself through other optimization techniques (such as: algebraic simplification, inlining, common subexpression elimination, etc.) may also bring opportunities for redundancy elimination. + +The purpose and techniques used in MindSpore redundancy elimination are similar to traditional compilers. The difference is that these redundancy optimizations are completed on MindIR. For example: + +1. **Dead Code Elimination** + + Suppose there is Python code with redundant calculations as follows: + + ```python + import mindspore + + @mindspore.jit + def func(x, y): + a = x + y + b = x - y + c = x * y # Dead code + d = a / b + return d + + x = mindspore.tensor(20, mindspore.float32) + y = mindspore.tensor(10, mindspore.float32) + out = func(x, y) + ``` + + The MindSpore graph compiler will convert Python code decorated with `@mindspore.jit` to MindIR representation through static analysis and eliminate the redundant calculation of c = x * y. The final generated MindIR is as follows: + + ```text + # Params: + %para1_x: + %para2_y: + + subgraph @func_1() { + %0(a) = PrimFunc_Add(%para1_x, %para2_y) + : (, ) -> () + %1(b) = PrimFunc_Sub(%para1_x, %para2_y) + : (, ) -> () + %2(d) = PrimFunc_Div(%0, %1) + : (, ) -> () + Return(%2) + : () + } + ``` + +2. **Unreachable Code Elimination** + + Suppose there is Python code with unreachable paths as follows: + + ```python + import mindspore + + @mindspore.jit + def func(x, y): + a = x + y + if 1 < 0: # Unreachable branch + b = x + y + else: + b = x - y + d = a / b + return d + + x = mindspore.tensor(20, mindspore.float32) + y = mindspore.tensor(10, mindspore.float32) + out = func(x, y) + ``` + + The MindSpore graph compiler will convert Python code decorated with `@mindspore.jit` to MindIR representation through static analysis and eliminate the redundant control flow branch code of `1 < 0`. The final generated MindIR is as follows: + + ```text + # Params: + %para1_x: + %para2_y: + + subgraph @func_1() { + %0(a) = PrimFunc_Add(%para1_x, %para2_y) + : (, ) -> () + %1(b) = PrimFunc_Sub(%para1_x, %para2_y) + : (, ) -> () + %2(d) = PrimFunc_Div(%0, %1) + : (, ) -> () + Return(%2) cnode_attrs: {checkpoint: Bool(1)} + : () + } + ``` + +Redundancy elimination plays an important role in compilation optimization. Without changing the original semantics of the program, it can significantly improve program execution efficiency and save computational resources by reducing unnecessary runtime calculations. Redundancy elimination is usually combined with other compilation optimization techniques to obtain more opportunities for eliminating redundant code. + +## Graph Optimization (Backend) + +After the MindIR graph completes frontend optimization, it needs further optimization (including target hardware). The optimization modes are divided into O0 and O1, represented by the parameter jit_level: + +- **jit_level=O0**: Only performs basic graph segmentation optimization and operator selection (hardware-related). The advantage is that it can guarantee the original structure of the IR graph and has faster compilation speed. +- **jit_level=O1**: Adds graph optimization and automatic operator fusion. Compilation performance is somewhat lost, but after the model starts training, efficiency is higher. + +After this round of optimization, MindIR will be executed by the runtime module, involving multi-level pipeline concurrency and other technologies. For reference, see [Multi-Level Pipeline](https://www.mindspore.cn/docs/en/r2.7.0/features/runtime/multilevel_pipeline.html). + +### jit_level=O0 Mode + +O0 mode has fewer optimizations. The basic optimizations are mainly backend LazyInline and No-task node execution optimization. + +- ***LazyInline**: The main idea is to postpone the overhead of function calls to when they are actually needed, which can reduce compilation overhead and improve compilation efficiency. LazyInline reuses the same subgraph structure during the graph compilation phase without expanding it in the graph, avoiding large graph scale affecting compilation performance. + + ![jit_level_lazyinline](./images/multi_level_compilation/jit_level_lazyinline.png) + +- **No-task node Execution Optimization**: No-task nodes refer to operators such as Reshape, ExpandDims, Squeeze, Flatten, FlattenGrad, Reformat, etc. These operators have no computational logic, do not modify memory layout, and only modify shape, format and other information. At the end of graph compilation, No-task nodes are converted to ref nodes, where the output has the same address as the input, and kernel launch is skipped during execution to achieve execution performance optimization. + + ![jit_level_no_task](./images/multi_level_compilation/jit_level_no_task.png) + +#### Operator Selection + +Operators are the basic execution units in deep learning frameworks. They are responsible for performing specific computational tasks such as matrix multiplication, convolution, pooling, etc. Operator selection requires comprehensive consideration of factors such as operator type, data type, hardware platform, and operator optimization to select the optimal operator for achieving the highest model runtime efficiency. + +MindSpore's operator types on Ascend hardware are aclnn kernel/aclop kernel/hccl kernel/cpu kernel. The operator selection process is shown in the following figure: + +![jit_level_kernelselect](./images/multi_level_compilation/jit_level_kernelselect.png) + +1. Operator type: First, according to the operator type, choose whether it is a computational operator or communication operator. +2. Hardware platform: If there is a corresponding operator on the hardware, the operator on the hardware is preferred, otherwise the operator on CPU is chosen (heterogeneous). For example, shape-related computational operators may only be suitable to be supported on CPU, and there is no corresponding hardware operator. +3. Operator efficiency: Due to the better performance of aclnn operators on Ascend hardware, computational operators will prefer aclnn kernel if there is a corresponding aclnn kernel, otherwise aclop kernel will be chosen. +4. If no operator is selected in any of the above 3 steps, it is an unsupported operator and operator selection fails with an error. + +#### Execution Order Scheduling + +Different graph traversal algorithms produce execution orders with large differences in execution performance and memory, as shown in the figure: + +![jit_level_exec_order](./images/multi_level_compilation/jit_level_exec_order.png) + +- **Execution order obtained by BFS**: kernel1-> kernel2-> kernel4-> kernel5-> kernel3-> kernel6, memory peaks at 5G (kernel3 can release kernel1 and kernel2 after execution, then reuse them when it's kernel6's turn to execute, so kernel6 doesn't need to request extra memory). +- **Execution order obtained by DFS**: kernel1-> kernel2-> kernel3-> kernel4-> kernel5-> kernel6, memory peaks at 4G (kernel3 can release kernel1 and kernel2 after execution, then reuse them when it's kernel4 and kernel5's turn to execute, so kernel4 and kernel5 don't need to request extra memory). + +Execution order scheduling is a complex problem of solving optimal operator concurrency under certain memory constraints. It not only requires identifying and exploiting concurrency opportunities in the computational graph to improve computational efficiency, but also must consider multiple constraints simultaneously to ensure system stability and efficiency. + +- First, the optimization module needs to address the complexity of solving for optimal operator concurrency. Due to the large number of operators in the computational graph and their interdependencies, finding an execution order that maximizes concurrency while maintaining the logical correctness of the computational graph is a challenging task. + +- Second, memory constraints are a critical factor that cannot be ignored in execution order optimization. Increasing concurrency, while improving computational efficiency, tends to significantly increase peak memory requirements, which may lead to Out of Memory (OOM) errors, especially in resource-constrained environments. Therefore, the optimization module must weigh the relationship between concurrency and memory usage to ensure that concurrency is increased without exceeding the memory capacity of the system. +- MindSpore's execution order adjustment module combines rule-based and heuristic-based strategies to provide both bfs/dfs execution order orchestration algorithms [mindspore.jit(option={"exec_order":"bfs/dfs"})](https://www.mindspore.cn/docs/en/r2.7.0/api_python/mindspore/mindspore.jit.html) to achieve fine-grained adjustment of the execution order of the computation graph, thus effectively dealing with multiple challenges such as memory constraints and system stability while ensuring computational efficiency. + +### jit_level=O1 Mode + +Currently O1 mainly supports graph-operator fusion optimization. The main idea is: during the compilation phase, automatically identify neighboring fusable nodes in the computational graph, then fuse them into executable operators with larger granularity. Through graph-operator fusion, optimization effects such as increasing operator computational locality and reducing overall global memory access bandwidth overhead are achieved. Through real-world testing verification on mainstream SOTA models, O1 can achieve an average 15% performance acceleration compared to O0. Especially for memory access-intensive networks, the optimization effect of O1 is more significant. + +#### Graph-Kernel Fusion + +Mainstream AI computing frameworks such as MindSpore provide operators to users that are usually defined from the perspective of user understanding and ease of use. Each operator carries different amounts of computation and varies in computational complexity. However, from the hardware execution perspective, this natural, user perspective-based division of operator computation volume is not efficient and cannot fully utilize the computational power of hardware resources. This is mainly reflected in: + +1. Operators with too much computation and overly complex operators usually make it difficult to generate well-split high-performance operators, thereby reducing device utilization; +2. Operators with too little computation may also cause computational latency and thus reduce device utilization, as the computation cannot effectively hide data movement overhead; +3. Hardware devices are usually multi-core, many-core architectures. When operator shapes are small or other reasons cause insufficient computational parallelism, it may cause some cores to be idle, thus reducing device utilization. Especially chips based on Domain Specific Architecture (DSA for short) are more sensitive to these factors. How to maximize hardware computational performance while making operators easy to use has always been a big challenge. + +In terms of AI framework design, the current industry mainstream adopts a layered implementation approach of graph layer and operator layer. The graph layer is responsible for fusing or regrouping the computational graph, and the operator layer is responsible for compiling the fused or regrouped operators into high-performance executable operators. The graph layer usually uses Tensor-based High-Level IR for processing and optimization, while the operator layer uses computation instruction-based Low-Level IR for analysis and optimization. This artificial layered processing significantly increases the difficulty of collaborative optimization between the graph and computation layers. + +MindSpore has adopted the technique of graph-operator fusion to better solve this problem in the past few years of technical practice. Typical networks in different categories such as NLP and recommendation show significant gains in training speed after enabling graph-operator fusion. One of the main reasons is the presence of a large number of small operator combinations in these networks, which have more opportunities for fusion optimization. + +#### Graph-Kernel Fusion Architecture and Overall Process + +The overall architecture of graph-operator fusion is shown in the figure below. The main idea in the graph layer is to expand composite operators, then perform cross-boundary aggregation and optimization, and finally perform kernel operator splitting. The main steps include: + +1. Composite Expansion: Expand composite operators into basic operators and form composite subgraphs to facilitate subsequent cross-boundary optimization and operator splitting; + +2. Cross-OP Aggregation: Aggregate adjacent basic operators or composite subgraphs to form larger aggregated subgraphs for subsequent cross-boundary optimization and operator splitting; + +3. High-Level Optimization: Based on the aggregated subgraphs obtained in the above two steps, we can perform a large number of cross-boundary optimizations, such as algebraic simplification, common subexpression extraction (CSE), etc.; + +4. Kernel Partition: Based on computational features and fusion operator performance, perform operator splitting on the aggregated computational subgraph. + +The optimized computational graph is passed to MindSpore AKG as subgraphs for further backend optimization and target code generation. + +![graphkernel](./images/graphkernel.png) + +Through the above steps, we can obtain two aspects of performance gains: + +1. Cross-boundary performance optimization gains between different operators; +2. Through reorganization and splitting of the entire computational graph, the optimal granularity of fusion operators is obtained. + +#### Fusion Operator Acceleration Optimization (MindSpore AKG) + +As mentioned earlier, in scenarios such as HPC and deep neural network training, graph-operator fusion optimization can bring exponential performance improvements. However, with the increasing capability of graph-operator fusion, the development of fusion operators has become a bottleneck point for continuing to improve graph-operator fusion capability. + +Automatic generation technology of fusion operators can solve the problem of high programming threshold for developing fusion operators based on DSA, allowing programmers to focus on operator implementation logic during operator development without focusing on backend optimization, greatly improving their development efficiency. Especially for scenarios with complex backend hardware architectures and the presence of complex operators and fusion operators, automatic operator generation technology is more critical. + +Therefore, **MindSpore AKG accelerates optimization and automatic generation of fusion operators based on Polyhedral Compilation Technology (Polyhedral Model)**, which can help fusion operators optimized by MindSpore's graph-operator fusion module to automatically generate high-performance kernels on **heterogeneous hardware platforms**(GPU/Ascend) and improve MindSpore training performance. + +- IR Normalization + - The input of MindSpore AKG is the fusion subgraph optimized by MindSpore's graph-operator fusion module. The operators in the subgraph are expressed through various description methods such as TVM's Compute/IR Builder/Hybrid. Then the DSL is converted to [Halide](https://halide-lang.org/) IR (Halide, a common language used for developing high-performance image processing and array computation, which can be used as an intermediate representation to decouple algorithms and optimization) and IR normalization; + + - After initial simplification and optimization is completed, the Halide IR is transformed into the scheduling tree required by the Poly module; + +- Poly Module Scheduling Optimization + - Using the Pluto scheduling algorithm in polyhedral technology to achieve automatic loop fusion, automatic rearrangement and other transformations, automatically generating initial scheduling that satisfies parallelism and data locality for fusion operators; + + - To quickly adapt to different hardware backends, the optimization passes in the Poly module are divided into hardware-independent generic optimizations and hardware-related specific optimizations, which are stitched and combined according to hardware features at compilation time to achieve fast adaptation of heterogeneous hardware backends. Auto-slicing, auto-mapping, and auto-memory boosting passes will give different optimization methods according to the nature of different hardware architectures; + +- Backend Optimization + - To further improve operator performance, we developed corresponding optimization passes for different hardware backends, such as data alignment and instruction mapping in Ascend backend, vectorized access and insertion of synchronization instructions in GPU backend, and finally generate corresponding platform code. + +Summary: MindSpore compilation optimizes AI model code from various dimensions such as graph capture mode, IR optimization, graph-operator fusion, etc. Many features also face certain challenges in the trade-off between usability and performance. We also plan to further layer and decouple the entire process to avoid black-box operation and increase the threshold for developer understanding. \ No newline at end of file diff --git a/docs/mindspore/source_en/features/compile/graph_construction.md b/docs/mindspore/source_en/features/compile/graph_construction.md deleted file mode 100644 index 566d9d7b81d014a1f5e42926021db69076a5efce..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/features/compile/graph_construction.md +++ /dev/null @@ -1,181 +0,0 @@ -# Graph Construction (Compilation) - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0/docs/mindspore/source_en/features/compile/graph_construction.md) - -MindSpore provides JIT (just-in-time) technology to optimize the performance. The JIT mode parses the code into an intermediate representation (IR) graph by means of AST tree parsing, Python bytecode parsing or code execution tracing, which serves as a unique representation of the code, and the compiler optimizes the code by optimizing the IR graph to improve the runtime performance. In contrast to the dynamic graph model, this JIT compilation model is called the static graph model. - -Based on JIT technology, MindSpore provides a dynamic-static combination approach to improve the operational efficiency of the user's network. The combination of dynamic and static, that is, in the overall run as a dynamic graph, specifies certain code blocks to run as a static graph. Code blocks that run as static graphs are compiled first and then executed, and global optimizations are performed during the compilation period to obtain performance gains during the execution period. Users can modify functions with the `@jit` decorator to specify that they execute according to the pattern of a static graph. For the documentation on the `@jit` decorator, refer to [jit API documentation](https://www.mindspore.cn/docs/en/r2.7.0/api_python/mindspore/mindspore.jit.html#mindspore.jit). - -MindSpore provides three JIT compilation methods, namely, ast, bytecode and trace. The ast converts the functions that are identified by the users manually and need to be executed in accordance with the ast into a static graph through the AST tree parsing. The bytecode is through the Python bytecode parsing, in the dynamic graph as much as possible to build a static graph. The part that can not be converted to a static graph will be in accordance with the dynamic graph for the purpose of combining static and dynamic. The trace constructs a static graph by tracing the execution path of Python code and is currently an experimental feature. Subsequent introduction will explain in detail the difference among the three principles and their respective characteristics. - -## Ast - -In dynamic graph mode, the user can modify a function to execute in ast mode by using the `@jit(capture_mode=“ast”)` decorator. The syntax and data structures used inside the functions which decorated by ast mode need to strictly follow the [Static Graph Syntax Specification](https://www.mindspore.cn/tutorials/en/r2.7.0/compile/static_graph.html). The ast approach compiles Python code via a source-to-source method, which first parses the Python source code of model definitions into an Abstract Syntax Tree (AST), then converts the AST into MindIR. For example, the following Python code: - -```python -@jit -def foo(x, y): - z = x + y - return z -``` - -The corresponding AST is as follows: - -![image](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/docs/mindspore/source_zh_cn/features/compile/images/ast.png) - -By parsing the above AST, we obtain the following MindIR: - -```text -%para1_x: -%para2_y: - -subgraph instance: foo -subgraph @foo() { - %0(CNode_17) = PrimFunc_Add(%para1_x, %para2_y) - : (, ) -> () - Return(%0) - : () -} -``` - -**ast Usage** - -The user can specify that the function is to be executed as a static graph via the `@jit` decorator, for example: - -```python -import numpy as np -import mindspore as ms -from mindspore import ops -from mindspore import jit -from mindspore import Tensor - -@jit -def tensor_cal(x, y, z): - return ops.matmul(x, y) + z - -x = Tensor(np.ones(shape=[2, 3]), ms.float32) -y = Tensor(np.ones(shape=[3, 4]), ms.float32) -z = Tensor(np.ones(shape=[2, 4]), ms.float32) -ret = tensor_cal(x, y, z) -print(ret) -``` - -```text -[[4. 4. 4. 4.] - [4. 4. 4. 4.]] -``` - -In the above use case, the tensor_cal function is modified by the @jit decorator, and the function follows the pattern of the static graph when it is called in order to capture the performance gains during the execution period of the function. - -**Advantages** - -- With the ast model, users have more programming autonomy and more precise performance optimization, allowing them to tune the performance of the network to the optimal level based on function characteristics and usage experience. - -**Limitations** - -- Functions modified by ast must be programmed with an internal syntax that strictly adheres to the static graph. - -**Recommendations for the Use of the ast Model** - -- In contrast to dynamic graph execution, a function modified by `@jit` consumes some time to compile a static graph the first time it is called. On subsequent calls to the function, if the original compilation result can be reused, the original compilation result will be used for execution. As a result, functions that are executed multiple times using @jit decorator usually gain more performance benefits. - -- The operational efficiency advantage of the static graph pattern is that it optimizes the compilation of @jit-modified functions globally. The more operations a function contains, the higher the upper limit of optimization. Therefore, functions modified by the `@jit` decorator should ideally be large chunks of code with a lot of operations, rather than many small, fragmented functions with only a few operations tagged with a separate jit tag. Otherwise, there may be no performance gain or even degradation. - -- The vast majority of calculations and optimizations for MindSpore static graphs are based on optimizations for Tensor calculations, so we recommend that the functions that are modified should be the kind of functions that are used to perform real data calculations, rather than simple scalar calculations or transformations of data structures. - -- Functions modified by `@jit` that have constants in their inputs will result in a recompile each time that the function input value changes. See [Constants and Variables Within JIT](https://www.mindspore.cn/tutorials/en/r2.7.0/compile/static_graph.html#constants-and-variables-within-jit) for the concept of variable constants. Therefore, it is recommended that the modified function takes as input Tensor or data modified by Mutable. Avoid additional performance loss due to multiple compilations. - -## Bytecode - -In addition to ast, MindSpore provides another static acceleration mechanism, bytecode, which allows the user to modify a function to execute in bytecode mode via the `@jit(capture_mode=“bytecode”)` decorator. When bytecode recognizes that the syntax for entering a static graph is not supported, it will fall back to Python for execution instead of compiling directly and reporting errors. This feature combines performance and ease of use to reduce the occurrence of compilation errors. It is based on the analysis of Python bytecode, graph capture of Python execution flow, allowing subgraphs that can be run as static graphs to be run as static graphs, and allowing subgraphs that are not supported by Python syntax to be run as dynamic graphs, as well as linking the dynamic-static graphs by modifying and adjusting the bytecode, so as to achieve a mixed execution of dynamic and static. While meeting the premise of ease of use, to improve performance as much as possible. - -**bytecode Operating Principle** - -1. Capture the execution of Python functions based on Python VM_PyInterpreterState_SetEvalFrameFunc, which captures the execution of all Python functions in the execution area using context management. -2. Analyze the function bytecode in conjunction with the current runtime input parameters to construct a control flow graph (CFG) and a data flow graph (DFG). -3. Simulate in-stack and out-stack operations, trace bytecode by bytecode, and derive the output based on the stack inputs. Python 3.7 to Python 3.11 has a corresponding simulation implementation for each bytecode, noting that the type size of the outputs is derived, not the actual execution of the values, unless the constants are collapsed. -4. During the simulated execution of the bytecode, translate the derivation results and operations into MindIR, and finally, optimize the static graph by constant folding, UD analysis (removing useless input and output parameters), etc. -5. Before executing the equivalent static graph, compare the input parameters with the caretaker Guard conditions generated during the optimization process, and based on the runtime information, select the matching static graph for execution. -6. Dynamically manage the matching relationship between Guard and static graph buffer, recycle the unused static graph buffer, and optimize the static graph buffer through Symbolic Shape and Dynamic Shape. - -The compilation process of bytecode is illustrated in the following diagram: - -![image](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/docs/mindspore/source_zh_cn/features/compile/images/bytecode.png) - -**bytecode Usage** - -Setting the capture_mode parameter of jit to bytecode switches the mode of operation of the modifier function to bytecode, for example: - -```python -import numpy as np -import mindspore as ms -from mindspore import ops -from mindspore import jit -from mindspore import Tensor - -@jit(capture_mode="bytecode") -def tensor_cal(x, y, z): - return ops.matmul(x, y) + z - -x = Tensor(np.ones(shape=[2, 3]), ms.float32) -y = Tensor(np.ones(shape=[3, 4]), ms.float32) -z = Tensor(np.ones(shape=[2, 4]), ms.float32) -ret = tensor_cal(x, y, z) -print(ret) -``` - -```text -[[4. 4. 4. 4.] - [4. 4. 4. 4.]] -``` - -**Advantages** - -- Good user experience, no human intervention, user-written web code always runs properly, and code that can't be executed by static graphs will automatically run using dynamic graphs. -- bytecode can make more statements into the static graph by transforming the byte code. Users do not need to perceive or modify the code. - -**Limitations** - -- Users can't explicitly do performance acceleration for certain code, and for scenarios with more cracked graphs, the performance acceleration may not be obvious. - -## Trace - -MindSpore also offers another static acceleration mechanism called trace. Users can decorate a function with the `@jit(capture_mode=“trace”)` decorator to execute the function in trace mode. In this mode, the code first runs in pynative mode, during which the operators executed at runtime are recorded and captured into the computation graph. Subsequent executions of the decorated code will directly execute the computation graph constructed during the first execution. This mechanism does not parse syntax but only captures the operators called during runtime, thus avoiding syntax-related errors. It captures the operators invoked during the execution of the pynative mode, captures the Python execution flow into a graph, and compiles the captured operators into the computation graph. Operations without corresponding operators will have their return values recorded as constants in the computation graph. The generated computation graph runs in the manner of static graph execution. - -**trace Usage** - -Setting the capture_mode parameter of jit to trace switches the mode of operation of the modifier function to trace, for example: - -```python -import numpy as np -import mindspore as ms -from mindspore import ops -from mindspore import jit -from mindspore import Tensor - -@jit(capture_mode="trace") -def tensor_cal(x, y, z): - return ops.matmul(x, y) + z - -x = Tensor(np.ones(shape=[2, 3]), ms.float32) -y = Tensor(np.ones(shape=[3, 4]), ms.float32) -z = Tensor(np.ones(shape=[2, 4]), ms.float32) -ret = tensor_cal(x, y, z) -print(ret) -``` - -```text -[[4. 4. 4. 4.] - [4. 4. 4. 4.]] -``` - -**Advantages of trace** - -- The graph construction capability is robust; as long as the code has corresponding operators, they can be captured into the graph without the need for additional adaptation. There will be no syntax-related errors when building the static graph. -- Good user experience, no human intervention, user-written web code always runs properly. - -**Limitations of trace** - -- It is unable to detect the control flow within the code, and correctness cannot be ensured in scenarios where different branches of the control flow are entered during multiple executions. -- Operations in the code that are not defined as operators, such as calls to third-party libraries, are fixed as constants in the computation graph, and correctness cannot be guaranteed across multiple runs. - diff --git a/docs/mindspore/source_en/features/compile/graph_optimization.md b/docs/mindspore/source_en/features/compile/graph_optimization.md deleted file mode 100644 index 751d9345c3e26268298f458b8247f6ce95d3fa0f..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/features/compile/graph_optimization.md +++ /dev/null @@ -1,318 +0,0 @@ -# Graph Optimization (Compilation) - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0/docs/mindspore/source_en/features/compile/graph_optimization.md) - -Similar to traditional compilers, MindSpore also performs compilation optimization after graph construction. The main purpose of compilation optimization is to analyze and transform MindSpore's intermediate representation MindIR by static analysis techniques to achieve goals such as reducing the size of the target code, improving execution efficiency, lowering runtime resource consumption, or enhancing other performance metrics. Compilation optimization is a crucial part of the graph compilation system and plays an extremely important role in improving the performance and resource utilization of the entire neural network model. Compared with the original code that has not been optimized, compilation optimization can bring several times or even tens of times performance improvement. - -This section mainly introduces front-end compilation optimization techniques that are independent of specific hardware. Hardware-specific back-end compilation optimization techniques are not within the scope of this discussion. - -## Principles of Front-End Compilation Optimization Techniques - -Similar to traditional compilation optimization techniques, compilation optimization in MindSpore is also carried out through a series of Passes. Each Pass takes the MindIR produced by the previous Pass as input and generates a new MindIR representation as output after optimization. A large Pass can include multiple smaller Passes, each of which is only responsible for a single point of compilation optimization, such as arithmetic simplify, inline, redundancy elimination and etc. The optimization results produced by one Pass may create optimization opportunities for other Passes, so these Passes can be run in a loop until the MindIR no longer changes. - -The selection of which Passes to run and how to arrange the execution order of these Passes has a very important impact on the final compilation result. Depending on the actual situation, the optimization actions to be performed can be adjusted by setting compilation optimization strategies (such as optimization levels, number of iterations, etc.). - -## Common Front-End Compilation Optimization Techniques - -There are many front-end compilation optimization techniques, such as arithmetic simplify, inline, and redundancy elimination. This section will introduce some representative compilation optimization techniques. - -### Arithmetic Simplify - -In traditional compilers, arithmetic simplify is a compiler optimization technique aimed at simplifying algebraic expressions in source code, eliminating redundant calculations, improving program execution efficiency, and reducing memory usage. - -For example, in the following code snippet: - -```cpp -int a = x * 1; -int b = x + 0; -int c = x * 0 + y * 1; -``` - -Traditional compilers perform equivalent substitution on recognized expressions based on algebraic rules and identities. Common algebraic rules include laws of union, commutative, and distributive, and compilers will try to replace expressions with simpler forms as much as possible. By analyzing AST or SSA analysis is used for optimization, identifying and simplifying code as follows: - -```cpp -a = x; -b = x; -c = y; -``` - -In the MindSpore compiler, the principle of arithmetic simplify is different from traditional compilers. It processes computational graphs rather than traditional control flow graphs. By adjusting the execution order of operators in the computational graph or deleting unnecessary operators, it maintains the simplicity of the graph and improves computational efficiency. - -For example, in the following Python code snippet: - -```python -import numpy as np -from mindspore.common import Tensor, jit - -@jit -def func(x): - return x + 0 - -m = Tensor(np.array([[1, 2, 3], [4, 5, 6]]).astype(np.int32)) -out = func(m) -``` - -The MindSpore graph compiler converts Python programs into computational graphs, which consist of multiple subgraphs. The algebraic operations in the source code are converted into operator calls within the subgraph, and it can be seen that the PrimFunc_Add operator is called once. - -```text -%para1_x: - -subgraph @1_func_14() { - %0(CNode_7) = PrimFunc_Add(%para1_x, Tensor(shape=[], dtype=Int32, value=0)) - : (, ) -> () - - Return(%0) - : () -} -``` - -By arithmetic simplify, the PrimFunc_Add operator can be directly removed to simplify the computational graph structure, reducing `x + 0` to `x`. - -```text -%para1_x: - -subgraph @1_func_14() { - Return(%para1_x) - : () -} -``` - -Arithmetic simplify can involve more modifications to the structure of computational graphs, and it is often combined with other compiler optimization techniques such as constant folding and constant propagation to improve program performance. - -### Inline - -In traditional compilers, inline is an optimization technique that replaces function calls with the actual code of the called function, improving program performance. For example, consider a C++ `add` function that sums two numbers: - -```cpp -int add(int a, int b) { - return a + b; -} - -int main() { - int x = add(3, 5); - int y = add(x, 10); - return y; -} -``` - -The compiler uses inline to directly insert the function body at the call site. This eliminates function call overhead and enables follow-up optimizations (e.g., replacing `3 + 5` with its result at compile time). **Replacing calls with code** is the core idea of inline. - -```cpp -int main() { - int x = 3 + 5; // Replace the first call. - int y = x + 10; // Replace the second call. - return y; -} -``` - -In AI frameworks' computational graph compilers, inline serves a similar purpose but operates on "subgraphs" instead of functions. For example, consider a Python program: - -```python -from mindspore import Tensor, jit, ops - -def f2(x: Tensor, y: Tensor): - return x * 0.5 + y - -@jit -def f1(a: Tensor, b: Tensor, c: Tensor): - x = f2(a, b) - y = f2(a, c) - return x + y - -# Create three Tensors with random values, each having a shape of (2, 4). -a = ops.randn(2, 4) -b = ops.randn(2, 4) -c = ops.randn(2, 4) -out = f1(a, b, c) -``` - -First, MindSpore's graph compiler converts the Python program into a computational graph. The function calls in the Python program are converted into calls between calculation graphs, and the original calculation graph is similar to the following. The main graph `f1` calls the subgraph `f2` twice. - -```text -# Params: -%para1_a: -%para2_b: -%para3_c: - -subgraph @f2(%para1_x, %para2_y) { - %0 = PrimFunc_Mul(%para1_x, Float32(0.5)) - - %1 = PrimFunc_Add(%0, %para2_y) - - Return(%1) -} - -subgraph @f1() { - %0(x) = call @f2(%para1_a, %para2_b) # Call subgraph f2 - - %1(y) = call @f2(%para1_a, %para3_c) # Call subgraph f2 - - %2 = PrimFunc_Add(%0, %1) - - Return(%2) -} -``` - -With inlining, the subgraph `f2` can be expanded and merged into the main graph `f1`. - -```text -subgraph @f1() { - # First-time subgraph inlining - %0 = PrimFunc_Mul(%para1_a, Float32(0.5)) # Repeated computation - %1 = PrimFunc_Add(%0, %para2_b) - - # Second-time subgraph inlining - %2 = PrimFunc_Mul(%para1_a, Float32(0.5)) # Repeated computation - %3 = PrimFunc_Add(%2, %para3_c) - - %4 = PrimFunc_Add(%1, %3) - - Return(%4) -} -``` - -Before inlining, the compiler might not detect repeated operations in the two calls to subgraph `f2` (as subgraphs are often treated as black boxes). After inlining, the compiler clearly sees `x * 0.5` calculated twice, enabling optimizations like **CSE** (Common Subexpression Elimination) to reduce redundant computations. - -```text -subgraph @f1() { - %0 = PrimFunc_Mul(%para1_a, Float32(0.5)) # CSE merges redundant computations - - %1 = PrimFunc_Add(%0, %para2_b) - - %2 = PrimFunc_Add(%0, %para3_c) # Directly reuse %0 - - %3 = PrimFunc_Add(%1, %2) - - Return(%3) -} -``` - -With inlining, compilers better identify cross-subgraph optimization opportunities. In addition to CSE, it enables operator fusion, memory management optimizations, and many other optimizations. Thus, inline is a critical optimization mechanism in computational graph compilers and a foundation for many cross-subgraph optimizations. - -### Redundancy Elimination - -In traditional compilers, redundancy elimination encompasses various compiler optimization techniques aimed at identifying and removing redundant parts of the code during compilation. This process is designed to reduce unnecessary computations and improve the execution efficiency of programs. - -Redundant code may be intentionally written by developers for readability purposes or may simply be an unintentional result of the coding process. Additionally, intermediate results generated by other optimization techniques during the compilation process (such as arithmetic simplify, inline and common subexpression elimination) may also create opportunities for redundancy elimination. - -There are many techniques for redundancy elimination. This section selects and introduces some of the common ones, including dead code elimination and unreachable code elimination. - -1. **Dead code elimination** - - Removing code whose results are not used. For example, in the following C++ code, the variable `c` is not used by any other code. Compilers can use data flow analysis techniques from the field of static analysis to eliminate the computation of code: `int c = x * y`. - - ```cpp - int func(x, y) { - int a = x + y; - int b = x - y; - int c = x * y; // Dead code - int d = a / b; - return d; - } - ``` - -2. **Unreachable code elimination** - - Removing code that is not included in any valid control flow path. For example, in the following C++ code, compilers can use control flow analysis techniques from the field of static analysis to analyze the control flow graph. They can identify that the expression `1 < 0` is always false, and thus the code within this control flow path will never be executed during actual runtime. Therefore, the code in this branch can be eliminated. - - ```cpp - int func(x, y) { - int a = x + y; - - int b; - if 1 < 0 { // Unreachable branch - b = x + y; - } else { - b = x - y; - } - - int d = a / b; - return d; - } - ``` - -In MindSpore's graph mode, the purpose and techniques of redundancy elimination are similar to those in traditional compilers. However, unlike traditional compilers, these redundancy optimization techniques are performed on MindIR. Similarly, common redundancy elimination techniques in MindSpore include: - -1. **Dead code elimination** - - For example, consider the following Python code with redundant computations: - - ```python - import mindspore as ms - from mindspore.common import Tensor, jit - - @jit - def func(x, y): - a = x + y - b = x - y - c = x * y # Dead code - d = a / b - return d - - x = Tensor(20, ms.float32) - y = Tensor(10, ms.float32) - out = func(x, y) - ``` - - The MindSpore graph compiler will convert the Python code decorated with `@jit` into the MindIR representation through static analysis and eliminate the redundant computation `c = x * y`. The resulting MindIR is as follows: - - ```text - # Params: - %para1_x: - %para2_y: - - subgraph @func_1() { - %0(a) = PrimFunc_Add(%para1_x, %para2_y) - : (, ) -> () - %1(b) = PrimFunc_Sub(%para1_x, %para2_y) - : (, ) -> () - %2(d) = PrimFunc_Div(%0, %1) - : (, ) -> () - Return(%2) - : () - } - ``` - -2. **Unreachable code elimination** - - For example, consider the following Python code with an unreachable path: - - ```python - import mindspore as ms - from mindspore.common import Tensor, jit - - @jit - def func(x, y): - a = x + y - if 1 < 0: # Unreachable branch - b = x + y - else: - b = x - y - d = a / b - return d - - x = Tensor(20, ms.float32) - y = Tensor(10, ms.float32) - out = func(x, y) - ``` - - The MindSpore graph compiler will convert the Python code decorated with `@jit` into the MindIR representation through static analysis and eliminate the redundant control flow branch `1 < 0`. The resulting MindIR is as follows: - - ```text - # Params: - %para1_x: - %para2_y: - - subgraph @func_1() { - %0(a) = PrimFunc_Add(%para1_x, %para2_y) - : (, ) -> () - %1(b) = PrimFunc_Sub(%para1_x, %para2_y) - : (, ) -> () - %2(d) = PrimFunc_Div(%0, %1) - : (, ) -> () - Return(%2) cnode_attrs: {checkpoint: Bool(1)} - : () - } - ``` - -Redundancy elimination plays a crucial role in compiler optimization. Without changing the original semantics of the program, it can significantly improve execution efficiency by reducing unnecessary runtime computations and saving computing resources. Redundancy elimination is often combined with other compiler optimization techniques to create more opportunities for eliminating redundant code. diff --git a/docs/mindspore/source_en/features/compile/multi_level_compilation.md b/docs/mindspore/source_en/features/compile/multi_level_compilation.md deleted file mode 100644 index a7fc203a692a69fb04ef1103f270ee1007fb8131..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_en/features/compile/multi_level_compilation.md +++ /dev/null @@ -1,137 +0,0 @@ -# Multi-Level Compilation Introduction (Compilation) - -[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0/docs/mindspore/source_en/features/compile/multi_level_compilation.md) - -## Background - -With the arrival of the era of deep learning large models, the bigger the network size is, the bigger the challenge of graph compilation performance, execution performance and debugging and tuning efficiency is. For this reason, MindSpore proposes a multilevel compilation architecture that provides an O(n) multilevel compilation execution model, which are different from each other in terms of graph optimization, operator fusion, memory management, and execution modes, and is designed to provide a diversity of graph mode. Users can choose the most suitable compilation and execution mode according to their own network characteristics and needs: - -1. O0 mode: this is a basic compilation and execution mode, where all optimizations are turned off except those necessary to affect the functionality, and a single-calculus execution is used for execution. Therefore, the execution performance may not be optimal, but it can guarantee the original structure of the graph, which is convenient for users to debug and understand, and the compilation performance is also better. Add and Mul single operator execution is shown in the following figure. -2. O1 mode: this mode performs some basic optimizations, such as common graph optimization and automatic operator fusion optimization, and uses single operator execution for execution. Compared with O0, because of enabling the fusion optimization, the execution performance of O1 can be improved, but it may affect the original structure of the graph, so the compilation performance and debugging and tuning efficiency is lost. In the following figure, Add and Mul are fused into a single fused_op execution. -3. O2 mode: this is a more advanced optimization mode, currently not implemented, the subsequent deeper optimization can use this mode. - -![jit_level_example](./images/multi_level_compilation/jit_level_example.png) - -## Overview of Multi-Level Compilation Architecture - -![jit_level_framework](./images/multi_level_compilation/jit_level_framework.png) - -1. Multi-level compilation external interface: configure multi-level compilation level through [mindspore.jit(jit_level="O0/O1")](https://www.mindspore.cn/docs/en/r2.7.0/api_python/mindspore/mindspore.jit.html#mindspore.jit), jit_level defaults to O0. We usually recommend that users use O0 mode for network debugging tuning. After debugging is ready, for better performance you can turn on O1 to run the network. -2. Backend graph compilation: According to the configured multi-level compilation level, different compilation modes are selected. O0 is the most basic native composition and compilation, and O1 adds automatic operator fusion function on the basis of O0, with the main functions of graph optimization, graph-operator fusion, operator selection, and execution sequence scheduling, of which graph-operator fusion is a unique function in O1 mode. -3. Backend graph execution: The O0 and O1 modes are the same at the execution level, and both use a single operator way of scheduling execution, with the main functions of multi-stream concurrency, multi-level streaming, HAL management, and memory management. - -## Introduction to the O0 Model - -O0 is the basic graph compilation and execution mode, except for the necessary impact on the functionality of the optimization, other optimizations are turned off, the use of native graph structure for compilation and execution, easy to debug and tuning, with better compilation performance. The following mainly introduces the functions related to backend graph compilation, and the functions related to backend graph execution are detailed in [runtime](https://www.mindspore.cn/docs/en/r2.7.0/features/runtime/memory_manager.html). - -### Graph Optimization - -There are fewer graph optimizations for the O0 mode, and the basic optimizations are mainly back-end LazyInline and No-task node execution optimizations. - -- **Back-end LazyInline** - - **LazyInline**: The main idea is to postpone the overhead of the function call to the actual need to call , so that you can reduce the compilation overhead, improve compilation efficiency. LazyInline is the same sub-graph structure reuse in the graph compilation phase, do not unfolding placed in the graph, to avoid the graph size is large resulting in the impact of the compilation performance. - - ![jit_level_lazyinline](./images/multi_level_compilation/jit_level_lazyinline.png) - - **Pipeline Parallelism**: Slicing the operator in the neural network into multiple Stages, and then mapping the Stages to different devices, so that different devices to compute different parts of the neural network. In order to improve efficiency, pipeline parallelism further slices the MiniBatch into finer-grained MicroBatches, in which pipelined scheduling is used, thus achieving the goal of improving efficiency. - - **Back-end LazyInline**: Since MicroBatch slicing of Pipeline parallel leads to the expansion of the entire computational graph to a number of times of the MicroBatch, which results in a huge model size and long compilation performance time (possibly hour-level), and these Micro subgraphs are all structured the same way. In order to solve the compilation performance problem, the LazyInline technique is a great fit, however LazyInline brings problems such as inability to use the optimal way for memory reuse and stream allocation at runtime, inability to perform cross-graph optimization (memory optimization, communication fusion, operator fusion, etc.). For this reason, at the end of the compilation of the graph, before the execution of the graph, these Micro subgraphs are as the actual nodes of Inline in order to form a complete global whole graph, and then through memory optimization, communication optimization, redundant computation elimination after the graph Inline, so as to achieve the goal of compilation performance, execution performance, and execution memory are taken into account. - -- **No-task node Execution Optimization** - - ![jit_level_no_task](./images/multi_level_compilation/jit_level_no_task.png) - - No-task node refers to Reshape, ExpandDims, Squeeze, Flatten, FlattenGrad, Reformat, etc. There is no computational logic in these algorithms, and they do not modify the memory layout, but only modify the information of the shape, format. At the end of the compilation of the graph, the No-task node is converted to ref node, the output has the same address as the input, and the kernel launch is skipped in the execution process, so as to achieve the purpose of execution performance optimization. - -### Operator Selection - -Operators are the basic execution units in deep learning frameworks, and they are responsible for performing specific computational tasks, such as matrix multiplication, convolution, pooling. Operator selection requires comprehensive consideration of factors such as operator type, data type, hardware platform, and operator optimization in order to select the optimal operator for deep learning tasks. - -The operator types in the backend of MindSpore Ascend are Aclnn kernel/Aclop kernel/Hccl kernel/Cpu kernel, and the process of operator selection is shown as follows: - -![jit_level_kernelselect](./images/multi_level_compilation/jit_level_kernelselect.png) - -1. operator type: firstly, according to the type of operator, choose whether it is computational operator or communication operator. -2. hardware platform: If there is a corresponding operator on hardware, then the operator on hardware is preferred, otherwise the heterogeneous operator on CPU is chosen, e.g., shape-related computational operators may only be suitable to be supported on CPU, and there is no corresponding operator on hardware. -3. operator efficiency: due to the better performance of Aclnn operator on Ascend, the computational operator will prefer Aclnn kernel if there is a corresponding Aclnn kernel, otherwise Aclop kernel will be chosen. -4. If no operator is selected in any of the above 3 steps, it is an unsupported operator and the operator selection fails to exit. - -### Executing Order Organization - -![jit_level_exec_order](./images/multi_level_compilation/jit_level_exec_order.png) - -Different graph traversal algorithms produce execution orders with large differences in execution performance and memory, as shown in the figure above: - -- **Execution order obtained by BFS**: kernel1-> kernel2-> kernel4-> kernel5-> kernel3-> kernel6. Memory peaks at 5G (kernel3 can release kernel1 and kernel2 after execution, and then reuse them when it's kernel6's turn to execute, so kernel6 doesn't need to request extra memory). -- **Execution order obtained by DFS**: kernel1-> kernel2-> kernel3-> kernel4-> kernel5-> kernel6. Memory peaks at 4G (kernel3 can release kernel1 and kernel2 after execution, and then reuse them when it's kernel4 and kernel5's turn to execute, so kernel4 and kernel5 don't need to request extra memory). - -Execution order scheduling is a complex problem of solving optimal operator concurrency under certain memory constraints, which not only requires identifying and exploiting concurrency opportunities in the computational graph to improve computational efficiency, but also must consider multiple constraints at the same time to ensure the stability and efficiency of the system. - -- First, the optimization module needs to address the complexity of solving for optimal operator concurrency. Due to the large number of operators in the computational graph and their interdependencies, finding an execution order that maximizes concurrency while maintaining the logical correctness of the computational graph is a challenging task. -- Second, memory constraints are a critical factor that cannot be ignored in execution order optimization. Increasing concurrency, while improving computational efficiency, tends to significantly increase peak memory requirements, which may lead to Overflow of Memory (OOM) errors, especially in resource-constrained environments. Therefore, the optimization module must weigh the relationship between concurrency and memory usage to ensure that concurrency is increased without exceeding the memory capacity of the system. -- MindSpore's execution order adjustment module combines rule-based and heuristic-based strategies to provide both bfs/dfs execution order orchestration algorithms [mindspore.jit(option={“exec_order”: “bfs/dfs”})](https://www.mindspore.cn/docs/en/r2.7.0/api_python/mindspore/mindspore.jit.html#mindspore.jit) to achieve fine-grained adjustment of the execution order of the computation graph, so as to effectively deal with multiple challenges such as memory constraints and system stability while ensuring computational efficiency. - -## Introduction to the O1 Model - -O1 is mainly targeted at implementing general-purpose, generalizable AI compilation optimizations on top of O0 to support better execution performance requirements for most general-purpose training and inference scenarios. - -In the current phase, O1 mainly supports graph-kernel fusion optimization. The main idea is to automatically identify neighboring fusable nodes in the computational graph during the static graph compilation phase, and then fuse them into executable operators with larger granularity. Through graph-kernel fusion, optimization effects such as increasing the computational locality of operators and reducing the overall global memory access bandwidth overhead are achieved. As verified by real-world tests on more than 15 networks, O1 is able to achieve an average of 15% performance acceleration compared to O0. Especially for access-intensive networks, the optimization effect of O1 is more significant. - -### Graph-Kernel Fusion - -Mainstream AI computing frameworks such as MindSpore provides operators to users that is usually defined in terms of understandable and easy use for user. Each operator carries a different amount of computation and varies in computational complexity. However, from the hardware execution point of view, this natural, user perspective-based division of operator computation volume is not efficient and does not fully utilize the computational power of hardware resources, which is mainly reflected in the following aspects: - -1. Computationally overloaded and overly complex operators, which usually makes it difficult to generate well-cut high-performance operator, thereby reducing equipment utilization. -2. Operators that are too small in computation may also cause latency in computation and thus reduce equipment utilization, as the computation cannot effectively hide the data moving overhead. -3. Hardware Devices are usually multi-core, many-core architectures. When the operator shape is small or other reasons cause insufficient computational parallelism, it may cause some cores to be idle, thus reducing the device utilization. In particular, chips based on Domain Specific Architecture (DSA for short) are more sensitive to these factors. It has been a big challenge to maximize the performance of hardware operator while making the operator easy to use. - -In terms of AI framework design, the current industry mainstream adopts a separate layer implementation approach of graph and operator layers. The graph layer is responsible for fusing or regrouping the computational graph, and the operator layer is responsible for compiling the fused or regrouped operators into high-performance executable operators. The graph layer is usually processed and optimized by using Tensor-based High-Level IR, while the operator layer is analyzed and optimized by using computational instruction-based Low-Level IR. This artificial separate-layer process significantly increases the difficulty of performing collaborative optimization in both graph and computational layers. - -MindSpore has adopted the technique of graph-kernel fusion to better solve this problem in the past few years. Typical networks in different categories such as NLP and recommendation show significant gains in training speed after enabling graph-kernel fusion. One of the main reasons is the presence of a large number of small operator combinations in these networks, which have more opportunities for fusion optimization. - -#### Graph-Kernel Fusion Architecture and Overall Process - -The overall architecture of graph-kernel fusion is shown in the figure below. The main idea in the graph layer is to turn on the composite operator, then perform cross-boundary aggregation and optimization, and finally perform Kernel operator splitting. The main steps include: - -1. Composite Expansion: Expand the composite operator into the basic operator and form the Composite subgraph to facilitate subsequent cross-boundary optimization and operator splitting. -2. Cross-OP Aggregation: Aggregate adjacent elementary operators or Composite subgraphs to form larger aggregated subgraphs for subsequent cross-boundary optimization and operator splitting. -3. High-Level Optimization: Based on the aggregated subgraphs obtained in the above two steps, we can perform a large number of cross-boundary optimizations, such as algebraic simplification, common subexpression extraction (CSE). -4. Kernel Partition: Based on the computational features and the performance of the fusion operator, the operator splitting is performed on the aggregated computational subgraph. - -The optimized computational graph is passed to MindSpore AKG as a subgraph for further back-end optimization and target code generation. - -![graphkernel](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/docs/mindspore/source_zh_cn/features/images/graphkernel.png) - -By following these steps, we can obtain two aspects of performance gains: - -1. Cross-boundary performance optimization gains between different operators. -2. The optimal granularity of the fusion operator is obtained by reorganizing and splitting the entire computational graph. - -#### Fusion Operator Acceleration Optimization (MindSpore AKG) - -As mentioned earlier, in scenarios such as HPC and deep neural network training, graph-kernel fusion optimization can bring exponential performance improvements. However, with the increasing capability of graph-kernel fusion, the development of fusion operator becomes a bottleneck point to continue to improve the graph-kernel fusion capability. - -The automatic generation technology of fusion operators can solve the problem of high programming threshold for developing fusion operators based on DSA, allowing programmers to focus on the implementation logic of operators during operator development without focusing on back-end optimization, which greatly improves their development efficiency. Especially for scenarios with complex back-end hardware architectures and the presence of complex operators and fusion operators, automatic operator generation techniques are more critical. - -Therefore, **MindSpore AKG accelerates optimization and automatic generation of fusion operator based on Polyhedral Compilation Technology (Polyhedral Model)**, can help fused operators optimized by MindSpore graph-kernel fusion module to automatically generate high-performance kernel on **heterogeneous hardware platforms** (GPU/Ascend) and improve MindSpore training performance. - -Architecture and Overall Process are as follows: - -The overall framework of MindSpore AKG is shown in the figure above: - -- IR Normalization - - The input of MindSpore AKG is the fused subgraph optimized by MindSpore graph-kernel fusion module, and the operator in the subgraph is expressed by various descriptions such as TVM's Compute/IR Builder/Hybrid. The DSL is then converted to Halide IR ([Halide](https://halide-lang.org/), a common language used to develop high-performance image processing and Array computation, which can be used as an intermediate expression for decoupling algorithms and optimization) and IR normalization. - - After the initial simplification and optimization is completed, the Halide IR is transformed into the scheduling tree required by the Poly module. -- Poly module scheduling optimization - - Using the Pluto scheduling algorithm in Polyhedral technology to achieve automatic fusion of loops, automatic rearrangement and other transformations to automatically generate an initial schedule that satisfies parallelism and data locality for the fusion operator. - - To quickly adapt to different hardware backends, the optimization pass in the Poly module is divided into hardware-independent generic optimizations and hardware-related specific optimizations, which are stitched and combined according to hardware features at compilation time, to achieve fast adaptation of heterogeneous hardware backends. The pass such as Auto-slicing, auto-mapping and auto-memory boosting will give different optimizations depending on the nature of the hardware architecture. -- Backends optimization - - In order to further improve the performance of the operator, we developed corresponding optimization passes for different hardware backends, such as data alignment and instruction mapping in Ascend backend, vectorized access and insertion of synchronization instructions in GPU backend, and finally generated the corresponding platform code. - -### Other Graph Optimization Techniques - -In addition to graph-kernel fusion, O1 may be gradually extended to add some other graph optimization techniques in subsequent releases. For example: - -1. KernelPacket: automatic fusion and optimization of shape computations in dynamic shape scenarios; -2. Communicative-kernel fusion: fusion of communication operators with computational operators. diff --git a/docs/mindspore/source_en/features/index.rst b/docs/mindspore/source_en/features/index.rst index 6552f5d328fc67c4f3c50c56efab580cec4607d0..5d6f1691129f06c9ea290c6b538e3ea93b1cf354 100644 --- a/docs/mindspore/source_en/features/index.rst +++ b/docs/mindspore/source_en/features/index.rst @@ -11,9 +11,7 @@ Developer Notes parallel/optimizer_parallel parallel/pipeline_parallel parallel/auto_parallel - compile/multi_level_compilation - compile/graph_construction - compile/graph_optimization + compile/compilation_guide runtime/memory_manager runtime/multilevel_pipeline runtime/multistream_concurrency diff --git a/docs/mindspore/source_zh_cn/features/compile/multi_level_compilation.md b/docs/mindspore/source_zh_cn/features/compile/compilation_guide.md similarity index 33% rename from docs/mindspore/source_zh_cn/features/compile/multi_level_compilation.md rename to docs/mindspore/source_zh_cn/features/compile/compilation_guide.md index 129948bcb42ed27d76f355a64bb5ad45e75f2b38..48e9f4861023f3b22db496d718f0bb98aac9916e 100644 --- a/docs/mindspore/source_zh_cn/features/compile/multi_level_compilation.md +++ b/docs/mindspore/source_zh_cn/features/compile/compilation_guide.md @@ -1,68 +1,386 @@ -# 多级编译介绍(编译) +# mindspore.jit 多级编译优化 -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0/docs/mindspore/source_zh_cn/features/compile/multi_level_compilation.md) +[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0/docs/mindspore/source_zh_cn/features/compile/compilation_guide.md) -## 背景 +## MindSpore编译架构 -随着深度学习大模型时代的到来,网络规模越来越大,对图编译性能、执行性能和调试调优效率的挑战也越来越大。为此,MindSpore提出多级编译架构,提供O(n)多级编译执行模式,它们在图优化、算子融合、内存管理以及执行模式等方面有所不同,旨在提供图模式的多样性选择,用户可以根据自己的网络特点和需求,选择最适合的编译执行模式: +MindSpore利用jit(just-in-time)来进行性能优化。jit模式会通过AST树解析、Python字节码解析或追踪代码执行的方式,将Python代码转换为中间表示图(IR,Intermediate Representation)。我们给它命名MindIR。编译器通过对该IR图的优化,来达到对代码的优化,提高运行性能。与PyNative Mode相对应,这种JIT的编译模式被称为Graph Mode。 -1. O0模式:这是一个基础的编译执行模式,除必要影响功能的优化外,其他优化均关闭,使用单算子执行的执行方式。因此执行性能可能不是最优,但它的优点是可以保证图的原始结构,方便用户进行调试和理解,编译性能也较好。如下图中的Add和Mul单算子执行。 -2. O1模式:这个模式会进行一些基础的优化,比如常用图优化和自动算子融合优化,使用单算子执行的执行方式。相比O0,由于使能了融合优化,可以提高执行性能,但可能会影响到图的原始结构,因此编译性能和调试调优效率有所损失。如下图中的Add跟Mul融合成一个fused_op执行。 -3. O2模式:这是一个更高级的优化模式,目前没有实现,后续较为深层次的优化可使用该模式。 +开发者写的Python代码默认以PyNative Mode运行,可以通过`@mindspore.jit`装饰器修饰函数,来指定其按照Graph Mode执行。有关`@mindspore.jit`装饰器的相关文档请见[jit 文档](https://www.mindspore.cn/docs/zh-CN/r2.7.0/api_python/mindspore/mindspore.jit.html#mindspore.jit)。 -![jit_level_example](./images/multi_level_compilation/jit_level_example.png) +Graph Mode大致分为3个阶段: -## 多级编译架构概述 +- 图捕获(构图):Python代码 -> MindIR。 +- 图优化(前端):对MindIR进行硬件无关优化,代数化简、函数inline(内联)、冗余消除等。 +- 图优化(后端):对MindIR进行硬件相关优化,LazyInline、算子选择、图算融合等。 -![jit_level_framework](./images/multi_level_compilation/jit_level_framework.png) +## 图捕获(构图) -1. 多级编译对外接口:通过[mindspore.jit(jit_level="O0/O1")](https://www.mindspore.cn/docs/zh-CN/r2.7.0/api_python/mindspore/mindspore.jit.html#mindspore.jit)来配置多级编译级别,jit_level默认为O0,通常我们建议用户使用O0模式进行网络调试调优,调试就绪后,为了更好的性能可以一键开启O1运行网络。 -2. 后端图编译:根据配置的多级编译级别,选择不同的编译模式,O0为最基础的原生构图与编译,O1在O0基础增加了自动算子融合功能,主要功能有图优化、图算融合、算子选择、执行序编排,其中图算融合为O1模式下独有功能。 -3. 后端图执行:O0跟O1模式执行层面是一样的,均使用单算子方式调度执行,主要功能有多流并发、多级流水、HAL管理、内存管理。 +MindSpore提供三种捕获方式,如下: -## O0模式介绍 +- AST:通过AST树解析的方式将执行的函数转换成IR图 +- bytecode(实验性):对Python字节码的解析,尽可能的构建IR图,无法转换为IR图的部分则会按照PyNative Mode进行执行 +- trace(实验性):通过追踪Python代码执行的轨迹来构建IR图 -O0为基础的图编译执行模式,除必要影响功能的优化外,其他优化均关闭,使用原生的图结构进行编译和执行,方便调试调优,具备较好的编译性能。下面主要介绍后端图编译相关功能,后端图执行相关功能详见[运行时](https://www.mindspore.cn/docs/zh-CN/r2.7.0/features/runtime/memory_manager.html)。 +这三种模式在mindspore.jit中使用capture_mode来选择,以ast举例:开发者可用`@mindspore.jit(capture_mode="ast")`装饰器修饰函数。用ast方式修饰的函数,其语法有一定限制,我们提供两种模式供开发者选择。 -### 图优化 +- strict模式:此模式目标是构成一张图,开发者的Python代码如果无法构图,选择此模式运行程序时会报错,需要开发者进行代码修改,变为可构图的语法,适合追求性能的开发者。 +- lax模式:此模式目标是尽可能的让开发者程序可运行,思路是针对无法在strict模式构图的代码进行Python fallback,即返回Python层运行。 -O0模式的图优化较少,基础的优化主要为后端LazyInline和No-task node执行优化。 +Graph Mode模式约束请参考[语法约束](https://www.mindspore.cn/tutorials/zh-CN/r2.7.0/compile/static_graph.html)。ast如何将Python代码解析并构图,举例如下: -- **后端LazyInline** +```python +@mindspore.jit +def foo(x, y): + z = x + y + return z +``` - **LazyInline**:主要思想是将函数调用的开销推迟到实际需要调用的时候,这样可以减少编译时的开销,提高编译效率。LazyInline在图编译阶段是将相同的子图结构复用,不展开放在图中,避免图规模较大导致影响编译性能。 +它对应的抽象语法树如下: - ![jit_level_lazyinline](./images/multi_level_compilation/jit_level_lazyinline.png) +![抽象语法树](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/docs/mindspore/source_zh_cn/features/compile/images/ast.png) - **流水线(Pipeline)并行**:将神经网络中的算子切分成多个Stage,再把Stage映射到不同的设备上,使得不同设备去计算神经网络的不同部分。为了提升效率,流水线并行进一步将小批次(MiniBatch)切分成更细粒度的微批次(MicroBatch),在微批次中采用流水线式的调度,从而达到提升效率的目的。 +通过解析上面的抽象语法树,我们得到下面的IR: - **后端LazyInline**:由于Pipeline并行的MicroBatch切分会导致整个计算图扩张到MicroBatch的数量倍,从而导致模型规模巨大,编译性能时间较长(可能小时级别)。而这些Micro子图结构都是一样的,为了解决编译性能问题,LazyInline技术则非常契合,不过LazyInline带来的问题就是运行时无法采用最优的方式进行内存复用和流分配、无法做跨图的优化(内存优化、通信融合、算子融合等)等问题。为此,在图编译结束后,在图执行之前,将这些Micro子图做实际的节点Inline,以形成完整的全局整图,再通过图Inline后的内存优化、通信优化、冗余计算消除等方式,从而实现在编译性能、执行性能、执行内存方面都兼顾的目标。 +```text +%para1_x: +%para2_y: -- **No-task node执行优化** +subgraph instance: foo +subgraph @foo() { + %0(CNode_17) = PrimFunc_Add(%para1_x, %para2_y) + : (, ) -> () + Return(%0) + : () +} +``` - ![jit_level_no_task](./images/multi_level_compilation/jit_level_no_task.png) +**ast的优点**: + +- 使用ast模式,用户的编程自主性更强,性能优化更精准,可以根据函数特征以及使用经验将网络的性能调至最优。 + +**ast的限制**: + +- ast修饰的函数,其内部的语法必须严格遵守静态图语法来进行编程。 + +**ast模式的使用建议**: + +- 相比于PyNative Mode执行,被`@mindspore.jit`修饰的函数,在第一次调用时需要先消耗一定的时间进行编译。在该函数的后续调用时,若原有的编译结果可以复用,则会直接使用原有的编译结果进行执行。因此,使用@mindspore.jit装饰器修饰会多次执行的函数通常会获得更多的性能收益。 + +- Graph Mode的运行效率优势体现在其会将被@mindspore.jit修饰函数进行全局上的编译优化,函数内含有的操作越多,优化的空间越大。因此`@mindspore.jit`装饰器修饰的函数最好是内含操作很多的大代码块,而不应将很多细碎的、仅含有少量操作的函数分别打上jit标签。否则,则可能会导致性能没有收益甚至劣化。 + +- 绝大部分计算以及优化都是基于对Tensor计算的优化,建议被修饰的函数应该是用来进行真正的数据计算的函数,而不是一些简单的标量计算或者数据结构的变换。 + +- 被`@mindspore.jit`修饰的函数,若其输入存在常量,那么该函数每次输入值的变化都会导致重新编译,关于变量常量的概念请见[即时编译下的常量与变量](https://www.mindspore.cn/tutorials/zh-CN/r2.7.0/compile/static_graph)。因此,建议被修饰的函数以Tensor或者被Mutable修饰的数据作为输入。避免因多次编译导致的额外性能损耗。 + +## 图优化(前端) + +与传统编译优化技术类似,MindSpore 中的编译优化也是通过一个个 Pass 来完成的。将每个 Pass 的上一个 Pass 所产生的 MindIR 作为输入,经过本 Pass 优化之后,产生新的 MindIR 表示作为输出。一个大的 Pass 可以包含多个小的 Pass,每个小的 Pass 只负责单点的编译优化,如:代数化简、函数内联(inline)、冗余消除等。一个 Pass 产生的优化结果,可能会为其它的 Pass 带来优化机会,故可以循环运行这些 Pass,直到产生的 MindIR 不再发生变化为止。 + +前端编译优化技术有很多,如:代数化简、函数inline(内联)、冗余消除等。这里仅介绍具有代表性的编译优化技术。 + +### 1 代数化简 + +在传统编译器中,代数化简是一种编译器优化技术,旨在简化源代码中的代数表达式,消除多余计算,提高程序执行效率、减少内存占用等。 + +例如,在以下代码片段中: + +```cpp +int a = x * 1; +int b = x + 0; +int c = x * 0 + y * 1; +``` + +传统编译器根据代数规则和恒等式对识别出的表达式进行等价替换。常见代数规则包括结合律、交换律和分配律等,编译器尽可能将表达式替换成更为简单的形式。通过对 AST(抽象语法树)或 SSA(静态单赋值形式)的分析来进行优化,识别并简化代码为: + +```cpp +a = x; +b = x; +c = y; +``` + +在MindSpore编译器中,代数化简原理不同于传统编译器,进行处理的是计算图而非传统控制流图,通过调整计算图中算子的执行顺序,或者删除不必要的算子,以保持计算图的简洁性和提高计算效率。 + +例如,在如下Python代码片段中: + +```python +import numpy as np +import mindspore + +@mindspore.jit +def func(x): + return x + 0 + +m = mindspore.tensor(np.array([[1, 2, 3], [4, 5, 6]]).astype(np.int32)) +out = func(m) +``` + +MindSpore图编译器会把 Python 程序转换为计算图,计算图由多个子图构成。源程序中的代数运算,转换为子图内部的算子调用,可以看到 PrimFunc_Add 算子调用了一次。 + +```text +%para1_x: + +subgraph @1_func_14() { + %0(CNode_7) = PrimFunc_Add(%para1_x, Tensor(shape=[], dtype=Int32, value=0)) + : (, ) -> () + + Return(%0) + : () +} +``` + +通过代数化简,可以直接删除 PrimFunc_Add 算子,简化计算图结构,将 `x + 0` 简化成 `x`。 + +```text +%para1_x: + +subgraph @1_func_14() { + Return(%para1_x) + : () +} +``` + +代数化简能更多地涉及对计算图结构的修改,它通常还与其他编译器优化技术(如常量折叠、常量传播等)结合使用,共同提高程序性能。 + +### 2 函数inline + +在传统编译器中,inline(内联)是一种优化技术,可以把被调用函数的代码直接替换到调用该函数的位置,提高程序运行效率。假设我们有一个 C++ 函数`add`,用于对两个数求和: + +```cpp +int add(int a, int b) { + return a + b; +} + +int main() { + int x = add(3, 5); + int y = add(x, 10); + return y; +} +``` + +编译器通过 inline 将函数体直接替换到调用处,这消除了函数调用的开销,同时为后续优化(如消除冗余计算`3 + 5`,直接在编译期求值替换)创造了条件。这种**用代码替换调用**的思想,正是 inline 的核心。 + +```cpp +int main() { + int x = 3 + 5; // 替换第一次调用 + int y = x + 10; // 替换第二次调用 + return y; +} +``` + +在 AI 框架的计算图编译器中,inline 的目标类似,但操作对象从“函数”变成了“子图”(subgraph)。假设我们有一个 Python 程序: + +```python +from mindspore + +def f2(x: mindspore.Tensor, y: mindspore.Tensor): + return x * 0.5 + y + +@mindspore.jit +def f1(a: mindspore.Tensor, b: mindspore.Tensor, c: mindspore.Tensor): + x = f2(a, b) + y = f2(a, c) + return x + y + +# 创建3个shape=(2, 4)的随机值Tensor +a = mindspore.ops.randn(2, 4) +b = mindspore.ops.randn(2, 4) +c = mindspore.ops.randn(2, 4) +out = f1(a, b, c) +``` + +首先,MindSpore 的计算图编译器会把 Python 程序转换为计算图。而 Python 程序中的函数调用,会转换为计算图之间的调用,得到类似于下面的原始计算图。其中,主图 f1 调用了 2 次子图 f2。 + +```text +# Params: +%para1_a: +%para2_b: +%para3_c: + +subgraph @f2(%para1_x, %para2_y) { + %0 = PrimFunc_Mul(%para1_x, Float32(0.5)) + + %1 = PrimFunc_Add(%0, %para2_y) + + Return(%1) +} + +subgraph @f1() { + %0(x) = call @f2(%para1_a, %para2_b) # 调用子图f2 + + %1(y) = call @f2(%para1_a, %para3_c) # 调用子图f2 - No-task node指的是Reshape、ExpandDims、Squeeze、Flatten、FlattenGrad、Reformat等诸类算子没有计算逻辑,不修改内存排布,仅修改shape、format等信息。在图编译结束后,将No-task node转换成ref node,输出跟输入同地址,执行过程中跳过kernel launch,从而达到执行性能优化目的。 + %2 = PrimFunc_Add(%0, %1) -### 算子选择 + Return(%2) +} +``` -算子是深度学习框架中的基本执行单元,它们负责执行特定的计算任务,如矩阵乘法、卷积、池化等。算子选择需要综合考虑算子类型、数据类型、硬件平台和算子优化等因素,以选择最优的算子来实现深度学习任务。 +通过 inline,可以将子图 f2 展开,合并到主图 f1。 -MindSpore Ascend后端的算子类型有Aclnn kernel/Aclop kernel/Hccl kernel/Cpu kernel,算子选择流程如下图所示: +```text +subgraph @f1() { + # 第一次子图inline + %0 = PrimFunc_Mul(%para1_a, Float32(0.5)) # 重复计算步骤 + %1 = PrimFunc_Add(%0, %para2_b) + + # 第二次子图inline + %2 = PrimFunc_Mul(%para1_a, Float32(0.5)) # 重复计算步骤 + %3 = PrimFunc_Add(%2, %para3_c) + + %4 = PrimFunc_Add(%1, %3) + + Return(%4) +} +``` + +在 inline 将子图展开之前,编译器可能无法识别到两次调用子图 f2 中的重复操作(此时子图通常被当作黑盒处理)。而通过 inline 将子图展开后,此时编译器可以清晰看到`x * 0.5`被计算了两次,就可以触发编译器进一步的优化:**公共子表达式消除** (CSE, Common Subexpression Elimination),这样就降低了计算量。 + +```text +subgraph @f1() { + %0 = PrimFunc_Mul(%para1_a, Float32(0.5)) # CSE合并重复计算 + + %1 = PrimFunc_Add(%0, %para2_b) + + %2 = PrimFunc_Add(%0, %para3_c) # 直接复用%0 + + %3 = PrimFunc_Add(%1, %2) + + Return(%3) +} +``` + +通过 inline 将子图展开,编译器能够更清晰地识别跨子图的优化机会,除了公共子表达式消除 (CSE),还能够触发算子融合、内存管理等许多优化措施。因此 inline 是计算图编译器的一项重要优化机制,也是许多跨图优化的基础。 + +### 3 冗余消除 + +在传统编译器中,冗余消除包含了多种编译优化技术,旨在通过在编译期间识别出代码中存在冗余的部分并进行消除,达到减少不必要的计算,提高程序的执行效率的目的。 + +通常冗余代码可能是用户出于可读性等目的有意编写的,也可能仅仅是编码过程中的无心之举。此外,编译优化过程本身通过其它优化技术(如:代数化简、inline、公共子表达式消除等)产生的中间结果,也可能带来冗余消除的机会。 + +MindSpore冗余消除的目的及使用的技术与传统编译器类似。不同的是这些冗余优化是在 MindIR 上完成的。例如: + +1. **无用代码消除** + + 假设有如下存在冗余计算的Python代码: + + ```python + import mindspore + + @mindspore.jit + def func(x, y): + a = x + y + b = x - y + c = x * y # 无用代码 + d = a / b + return d + + x = mindspore.tensor(20, mindspore.float32) + y = mindspore.tensor(10, mindspore.float32) + out = func(x, y) + ``` + + MindSpore 图编译器会通过静态分析将 `@mindspore.jit` 修饰的 Python 代码转换为 MindIR 的表示形式并消除其中冗余的 `c = x * y` 的计算,最终生成的 MindIR 如下: + + ```text + # Params: + %para1_x: + %para2_y: + + subgraph @func_1() { + %0(a) = PrimFunc_Add(%para1_x, %para2_y) + : (, ) -> () + %1(b) = PrimFunc_Sub(%para1_x, %para2_y) + : (, ) -> () + %2(d) = PrimFunc_Div(%0, %1) + : (, ) -> () + Return(%2) + : () + } + ``` + +2. **不可达代码消除** + + 假设有如下存在不可达路径的Python代码: + + ```python + import mindspore + + @mindspore.jit + def func(x, y): + a = x + y + if 1 < 0: # 不可达分支 + b = x + y + else: + b = x - y + d = a / b + return d + + x = mindspore.tensor(20, mindspore.float32) + y = mindspore.tensor(10, mindspore.float32) + out = func(x, y) + ``` + + MindSpore图编译器会通过静态分析将 `@mindspore.jit` 修饰的 Python 代码转换为 MindIR 的表示形式并消除其中冗余的控制流分支 `1 < 0` 的代码,最终生成的 MindIR 如下: + + ```text + # Params: + %para1_x: + %para2_y: + + subgraph @func_1() { + %0(a) = PrimFunc_Add(%para1_x, %para2_y) + : (, ) -> () + %1(b) = PrimFunc_Sub(%para1_x, %para2_y) + : (, ) -> () + %2(d) = PrimFunc_Div(%0, %1) + : (, ) -> () + Return(%2) cnode_attrs: {checkpoint: Bool(1)} + : () + } + ``` + +冗余消除在编译优化中扮演着重要的角色,在不改变程序原语义的前提下,能够显著提高程序的执行效率,通过减少不必要的运行时计算节省计算资源。冗余消除通常还与其它编译优化技术结合使用以获得更多消除冗余代码的机会。 + +## 图优化(后端) + +当MindIR图经过前端优化完成后,需要进行进一步优化(包含目标硬件)。优化模式我们分为O0,O1,用参数jit_level表示: + +- **jit_level=O0**:只做基本的图切分优化,以及算子选择(硬件相关),优点是可以保证IR图的原始结构,编译速度较快。 +- **jit_level=O1**:增加图优化和自动算子融合,编译性能有所损失,但模型开始训练后,效率较高。 + +MindIR经过本轮优化后,会由runtime模块进行执行,涉及多级流水并发等技术,可参考[多级流水](https://www.mindspore.cn/docs/zh-CN/r2.7.0/features/runtime/multilevel_pipeline.html)。 + +### jit_level=O0 模式 + +O0模式的优化较少,基础的优化主要为后端LazyInline和No-task node执行优化。 + +- **LazyInline**:主要思想是将函数调用的开销推迟到实际需要调用的时候,这样可以减少编译时的开销,提高编译效率。LazyInline在图编译阶段是将相同的子图结构复用,不展开放在图中,避免图规模较大导致影响编译性能。 + + ![jit_level_lazyinline](./images/multi_level_compilation/jit_level_lazyinline.png) + +- **No-task node执行优化**:No-task node指的是Reshape、ExpandDims、Squeeze、Flatten、FlattenGrad、Reformat等诸类算子没有计算逻辑,不修改内存排布,仅修改shape、format等信息。在图编译结束后,将No-task node转换成ref node,输出跟输入同地址,执行过程中跳过kernel launch,从而达到执行性能优化目的。 + + ![jit_level_no_task](./images/multi_level_compilation/jit_level_no_task.png) + +#### 算子选择 + +算子是深度学习框架中的基本执行单元,它们负责执行特定的计算任务,如矩阵乘法、卷积、池化等。算子选择需要综合考虑算子类型、数据类型、硬件平台和算子优化等因素,以选择最优的算子来实现模型运行效率最高。 + +MindSpore 在Ascend硬件的算子类型有aclnn kernel/aclop kernel/hccl kernel/cpu kernel,算子选择流程如下图所示: ![jit_level_kernelselect](./images/multi_level_compilation/jit_level_kernelselect.png) 1. 算子类型:首先根据算子类型选择为计算算子还是通信算子。 -2. 硬件平台:如果硬件上有对应算子,则优先选择硬件上的算子,否则选择CPU上的异构算子,例如shape相关的计算算子可能只适合在CPU上支持,没有对应的硬件算子。 -3. 算子效率:Ascend上由于Aclnn算子较好的性能,因此计算类型算子如果有对应Aclnn kernel,则优先选择Aclnn kernel,否则就选择Aclop kernel。 -4. 如果上述3步都未选择到算子,则为不支持的算子,算子选择失败退出。 +2. 硬件平台:如果硬件上有对应算子,则优先选择硬件上的算子,否则选择CPU上的算子(异构),例如shape相关的计算算子可能只适合在CPU上支持,没有对应的硬件算子。 +3. 算子效率:ascend硬件由于aclnn算子较好的性能,因此计算类型算子如果有对应aclnn kernel,则优先选择aclnn kernel,否则就选择aclop kernel。 +4. 如果上述3步都未选择到算子,则为不支持的算子,算子选择失败报错。 -### 执行序编排 +#### 执行序编排 +不同图遍历算法产生的执行序在执行性能跟内存上会有较大的差异,如图所示: ![jit_level_exec_order](./images/multi_level_compilation/jit_level_exec_order.png) -不同图遍历算法产生的执行序在执行性能跟内存上会有较大的差异,如上图所示: - - **BFS得到的执行序**:kernel1-> kernel2-> kernel4-> kernel5-> kernel3-> kernel6,内存峰值为5G(kernel3执行后可以把kernel1和kernel2的释放掉,则轮到kernel6执行的时候则能复用,因此kernel6 不用额外申请多的内存)。 - **DFS得到的执行序**:kernel1-> kernel2-> kernel3-> kernel4-> kernel5-> kernel6,内存峰值为4G(kernel3执行后可以把kernel1和kernel2的释放掉,则轮到kernel4和kernel5执行的时候则能复用,因此kernel4和kernel5不用额外申请多的内存)。 @@ -70,15 +388,13 @@ MindSpore Ascend后端的算子类型有Aclnn kernel/Aclop kernel/Hccl kernel/Cp - 首先,优化模块需要解决求解最优算子并发的复杂性问题。由于计算图中的算子数量庞大且相互依赖,找到一个既能最大化并发又能保持计算图逻辑正确性的执行顺序是一个极具挑战性的任务。 - 其次,内存限制是执行序优化中不可忽视的关键因素。增大并发虽然可以提升计算效率,但往往会显著增加峰值内存需求,从而可能导致内存溢出(OOM)错误,尤其是在资源受限的环境中。因此,优化模块必须权衡并发与内存使用之间的关系,确保在提升并发的同时,不会超出系统的内存容量。 -- MindSpore的执行序调整模块结合了基于规则和基于启发式策略的方式,提供bfs/dfs两种执行序编排算法[mindspore.jit(option={"exec_order":"bfs/dfs"})](https://www.mindspore.cn/docs/zh-CN/r2.7.0/api_python/mindspore/mindspore.jit.html#mindspore.jit),以实现对计算图执行顺序的精细调整,从而在保证计算效率的同时,有效应对内存限制和系统稳定性等多重挑战。 +- MindSpore的执行序调整模块结合了基于规则和基于启发式策略的方式,提供bfs/dfs两种执行序编排算法[mindspore.jit(option={"exec_order":"bfs/dfs"})](https://www.mindspore.cn/docs/zh-CN/r2.7.0/api_python/mindspore/mindspore.jit.html),以实现对计算图执行顺序的精细调整,从而在保证计算效率的同时,有效应对内存限制和系统稳定性等多重挑战。 -## O1模式介绍 +### jit_level=O1 模式 -O1主要定位于在O0基础上实现通用、可泛化的AI编译优化,以支持大部分通用训练、推理场景的更好执行性能需求。 +当前O1主要支持了图算融合优化。其主要思路是:在编译阶段,自动识别计算图中相邻的可融合节点,然后将其融合为更大粒度的可执行算子。通过图算融合,实现增加算子计算局部性、减少整体全局内存访存带宽开销等优化效果。通过对主流SOTA模型的实测验证,O1能够实现相比O0平均15%的性能加速。特别是对于访存密集型网络,O1优化效果更加显著。 -在当前阶段,O1主要支持了图算融合优化。其主要思路是:在静态图编译阶段,自动识别计算图中相邻的可融合节点,然后将其融合为更大粒度的可执行算子。通过图算融合,实现增加算子计算局部性、减少整体全局内存访存带宽开销等优化效果。通过对15+网络的实测验证,O1能够实现相比O0平均15%的性能加速。特别是对于访存密集型网络,O1优化效果更加显著。 - -### 图算融合 +#### 图算融合 MindSpore等主流AI计算框架对用户提供的算子通常是从用户可理解、易使用角度进行定义。每个算子承载的计算量不等,计算复杂度也各不相同。但从硬件执行角度看,这种天然的、基于用户角度的算子计算量划分,并不高效,也无法充分发挥硬件资源计算能力。主要体现在: @@ -131,9 +447,4 @@ MindSpore AKG的整体框架如上图所示: - 后端优化 - 为了进一步提升算子的性能,我们针对不同硬件后端开发了相应的优化pass,如Ascend后端中实现数据对齐、指令映射,GPU后端中实现向量化存取,插入同步指令等,最终生成相应平台代码。 -### 其它图优化技术 - -除了图算融合之外,在后续版本中,O1可能会逐步扩展增加一些其它图优化技术。比如: - -1. KernelPacket:用于在动态shape场景对shape计算进行自动融合和优化; -2. 通算融合:将通信算子与计算算子进行融合。 +总结:MindSpore编译从图捕获模式,IR优化图算融合等各维度对AI模型代码进行优化,很多特性在易用性和性能方面的取舍也有一定挑战。我们也规划进一步分层解耦整个流程,避免黑盒运行,增加开发者理解的门槛。 \ No newline at end of file diff --git a/docs/mindspore/source_zh_cn/features/compile/graph_construction.ipynb b/docs/mindspore/source_zh_cn/features/compile/graph_construction.ipynb deleted file mode 100644 index c4fd24bfe91603e47dec9ea639e24b436e8d8077..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_zh_cn/features/compile/graph_construction.ipynb +++ /dev/null @@ -1,273 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 构图(编译)\n", - "\n", - "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_notebook.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.7.0/zh_cn/features/compile/mindspore_graph_construction.ipynb) [![下载样例代码](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_download_code.svg)](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/r2.7.0/zh_cn/features/compile/mindspore_graph_construction.py) [![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0/docs/mindspore/source_zh_cn/features/compile/graph_construction.ipynb)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "MindSpore提供JIT(just-in-time)技术来进行性能优化。JIT模式会通过AST树解析、Python字节码解析或追踪代码执行的方式,将代码解析为一张中间表示图(IR,intermediate representation)。IR图作为该代码的唯一表示,编译器通过对该IR图的优化,来达到对代码的优化,提高运行性能。与动态图模式相对应,这种JIT的编译模式被称为静态图模式。\n", - "\n", - "基于JIT技术,MindSpore提供了动静结合的方法来提高用户的网络的运行效率。动静结合,即在整体运行为动态图的情况下,指定某些代码块以静态图的方式运行。按照静态图方式运行的代码块会采取先编译后执行的运行模式,在编译期对代码进行全局优化,来获取执行期的性能收益。用户可以通过`@jit`装饰器修饰函数,来指定其按照静态图的模式执行。有关`@jit`装饰器的相关文档请见[jit API文档](https://www.mindspore.cn/docs/zh-CN/r2.7.0/api_python/mindspore/mindspore.jit.html#mindspore.jit)。\n", - "\n", - "MindSpore提供了三种JIT编译方式,分别通过ast、bytecode和trace的方式来构图。ast是通过AST树解析的方式,将用户手工标识需要按照ast方式执行的函数转换成静态图。bytecode则是通过对Python字节码的解析,在动态图中尽可能的构建静态图,无法转换为静态图的部分则会按照动态图进行执行,来达到动静结合的目的。trace是通过追踪Python代码执行的轨迹来构建静态图,当前属于实验性质的特性。后续介绍会详细说明三者原理的不同以及各自的特点。\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Ast\n", - "\n", - "在动态图模式下,用户可以通过`@jit(capture_mode=\"ast\")`装饰器修饰函数来让该函数以ast方式来执行。用ast方式修饰的函数,其内部使用的语法以及数据结构需要遵守静态图语法规范[静态图语法规范](https://www.mindspore.cn/tutorials/zh-CN/r2.7.0/compile/static_graph.html)。ast方式通过源到源的方式来编译Python代码,先把模型定义的Python源码解析成抽象语法树,然后把抽象语法树解析为MindIR。例如下面的Python代码:\n", - "\n", - "```python\n", - "@jit\n", - "def foo(x, y):\n", - " z = x + y\n", - " return z\n", - "```\n", - "\n", - "它对应的抽象语法树如下:\n", - "\n", - "![抽象语法树](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/docs/mindspore/source_zh_cn/features/compile/images/ast.png)\n", - "\n", - "通过解析上面的抽象语法树,我们得到下面的MindIR:\n", - "\n", - "```text\n", - "%para1_x: \n", - "%para2_y: \n", - "\n", - "subgraph instance: foo\n", - "subgraph @foo() {\n", - " %0(CNode_17) = PrimFunc_Add(%para1_x, %para2_y)\n", - " : (, ) -> ()\n", - " Return(%0)\n", - " : ()\n", - "}\n", - "```\n", - "\n", - "**ast的使用方法**:\n", - "\n", - "用户可以通过`@jit`装饰器来指定函数以静态图的方式来执行,例如:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[[4. 4. 4. 4.]\n", - " [4. 4. 4. 4.]]" - ] - } - ], - "source": [ - "import numpy as np\n", - "import mindspore as ms\n", - "from mindspore import ops\n", - "from mindspore import jit\n", - "from mindspore import Tensor\n", - "\n", - "@jit\n", - "def tensor_cal(x, y, z):\n", - " return ops.matmul(x, y) + z\n", - "\n", - "x = Tensor(np.ones(shape=[2, 3]), ms.float32)\n", - "y = Tensor(np.ones(shape=[3, 4]), ms.float32)\n", - "z = Tensor(np.ones(shape=[2, 4]), ms.float32)\n", - "ret = tensor_cal(x, y, z)\n", - "print(ret)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "上述用例中,tensor_cal函数被@jit装饰器修饰,该函数被调用时就会按照静态图的模式进行执行,以获取该函数执行期的性能收益。\n", - "\n", - "**ast的优点**:\n", - "\n", - "- 使用ast模式,用户的编程自主性更强,性能优化更精准,可以根据函数特征以及使用经验将网络的性能调至最优。\n", - "\n", - "**ast的限制**:\n", - "\n", - "- ast修饰的函数,其内部的语法必须严格遵守静态图语法来进行编程。\n", - "\n", - "**ast模式的使用建议**:\n", - "\n", - "- 相比于动态图执行,被`@jit`修饰的函数,在第一次调用时需要先消耗一定的时间进行静态图的编译。在该函数的后续调用时,若原有的编译结果可以复用,则会直接使用原有的编译结果进行执行。因此,使用@jit装饰器修饰会多次执行的函数通常会获得更多的性能收益。\n", - "\n", - "- 静态图模式的运行效率优势体现在其会将被@jit修饰函数进行全局上的编译优化,函数内含有的操作越多,优化的上限也就越高。因此`@jit`装饰器修饰的函数最好是内含操作很多的大代码块,而不应将很多细碎的、仅含有少量操作的函数分别打上jit标签。否则,则可能会导致性能没有收益甚至劣化。\n", - "\n", - "- MindSpore静态图绝大部分计算以及优化都是基于对Tensor计算的优化,因此我们建议被修饰的函数应该是那种用来进行真正的数据计算的函数,而不是一些简单的标量计算或者数据结构的变换。\n", - "\n", - "- 被`@jit`修饰的函数,若其输入存在常量,那么该函数每次输入值的变化都会导致重新编译,关于变量常量的概念请见[即时编译下的常量与变量](https://www.mindspore.cn/tutorials/zh-CN/r2.7.0/compile/static_graph.html#%E5%8D%B3%E6%97%B6%E7%BC%96%E8%AF%91%E4%B8%8B%E7%9A%84%E5%B8%B8%E9%87%8F%E4%B8%8E%E5%8F%98%E9%87%8F)。因此,建议被修饰的函数以Tensor或者被Mutable修饰的数据作为输入。避免因多次编译导致的额外性能损耗。" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Bytecode\n", - "\n", - "除了ast,MindSpore提供另外一种静态化加速机制bytecode,用户可以通过`@jit(capture_mode=\"bytecode\")`装饰器修饰函数来让该函数以bytecode模式来执行。当bytecode识别到不支持进入静态图的语法时,会回退到Python执行而非直接编译报错。该功能同时兼顾性能和易用性,减少编译报错的发生。它基于Python字节码的分析,对Python的执行流进行图捕获,让可以以静态图方式运行的子图以静态图方式运行,并让Python语法不支持的子图以动态图方式运行,同时通过修改调整字节码的方式链接动静态图,达到动静混合执行。在满足易用性的前提下,尽可能地提高性能。\n", - "\n", - "**bytecode的运行原理**:\n", - "\n", - "1. 基于Python虚拟机_PyInterpreterState_SetEvalFrameFunc捕获Python函数的执行,采用上下文管理的方式捕获执行区域内的所有Python函数执行。\n", - "2. 按照当前的运行时输入参数结合函数字节码进行分析,构造控制流图(CFG)以及数据流图(DFG)。\n", - "3. 模拟进栈出栈操作,跟踪逐个字节码,根据栈输入,推导输出。Python3.7~Python3.11每条字节码都有对应的模拟实现,注意是推导输出的类型尺寸,而不是真正执行得到值,除非常量折叠。\n", - "4. 在模拟执行字节码的过程中,将推导结果和操作翻译成MindIR,最后,通过常量折叠,UD分析(删除无用的输入输出参数)等方式,优化静态图。\n", - "5. 在执行等效的静态图之前,对输入参数和优化过程中产生的看护Guard条件进行比对,根据运行时信息,选择匹配的静态图执行。\n", - "6. 动态管理看护Guard和静态图缓冲的匹配关系,对不常用的静态图缓冲进行回收,通过Symbolic Shape和Dynamic Shape优化静态图缓冲。\n", - "\n", - "bytecode的编译流程如下图所示\n", - "\n", - "![bytecode的编译流程](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/docs/mindspore/source_zh_cn/features/compile/images/bytecode.png)\n", - "\n", - "**bytecode的使用方式**:\n", - "\n", - "将jit的capture_mode参数设置为bytecode,即可将修饰函数的运行模式切换为bytecode,例如:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[[4. 4. 4. 4.]\n", - " [4. 4. 4. 4.]]" - ] - } - ], - "source": [ - "import numpy as np\n", - "import mindspore as ms\n", - "from mindspore import ops\n", - "from mindspore import jit\n", - "from mindspore import Tensor\n", - "\n", - "@jit(capture_mode=\"bytecode\")\n", - "def tensor_cal(x, y, z):\n", - " return ops.matmul(x, y) + z\n", - "\n", - "x = Tensor(np.ones(shape=[2, 3]), ms.float32)\n", - "y = Tensor(np.ones(shape=[3, 4]), ms.float32)\n", - "z = Tensor(np.ones(shape=[2, 4]), ms.float32)\n", - "ret = tensor_cal(x, y, z)\n", - "print(ret)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**bytecode的优点**:\n", - "\n", - "- 用户体验好,无需人工介入,用户编写的网络代码总是能够正常运行,静态图不能执行的代码会自动采用动态图运行。\n", - "- bytecode可以通过对字节码的变换,使得更多的语句进入静态图。用户无需感知或修改代码。\n", - "\n", - "**bytecode的限制**:\n", - "\n", - "- 用户无法明确对某些代码做性能加速,对于裂图较多的场景,性能加速的效果可能会不明显。" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Trace\n", - "\n", - "MindSpore也提供另外一种静态化加速机制trace,用户可以通过`@jit(capture_mode=\"trace\")`装饰器修饰函数来让该函数以trace模式来执行。在该模式下,代码会先以PyNative模式运行,在运行时调用的算子会被记录,并被捕获到计算图中。在后续执行该装饰器修饰的代码时,会直接执行第一次执行所构造出的计算图。该功能不会解析语法,只会捕获运行时调用的算子,因此不会有语法不支持报错的发生。它基于捕获运行PyNative模式时调用的算子,对Python的执行流进行图捕获,将捕获到的算子编入计算图中。没有对应算子的操作将无法生成节点,trace流程将只捕获该操作的返回值,在计算图中作为常量。生成的计算图以静态图的运行方式运行。\n", - "\n", - "**trace的使用方式**:\n", - "\n", - "将jit的capture_mode参数设置为trace,即可将修饰函数的运行模式切换为trace,例如:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[[4. 4. 4. 4.]\n", - " [4. 4. 4. 4.]]" - ] - } - ], - "source": [ - "import numpy as np\n", - "import mindspore as ms\n", - "from mindspore import ops\n", - "from mindspore import jit\n", - "from mindspore import Tensor\n", - "\n", - "@jit(capture_mode=\"trace\")\n", - "def tensor_cal(x, y, z):\n", - " return ops.matmul(x, y) + z\n", - "\n", - "x = Tensor(np.ones(shape=[2, 3]), ms.float32)\n", - "y = Tensor(np.ones(shape=[3, 4]), ms.float32)\n", - "z = Tensor(np.ones(shape=[2, 4]), ms.float32)\n", - "ret = tensor_cal(x, y, z)\n", - "print(ret)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**trace的优点**:\n", - "\n", - "- 构图能力强,只要代码有对应算子就能够入图,不需要额外适配。构建静态图时不会有语法不支持报错。\n", - "- 用户体验好,无需人工介入,用户编写的网络代码总是能够正常运行。\n", - "\n", - "**trace的限制**:\n", - "\n", - "- 无法感知控制流,多次运行时控制流会进入不同分支的场景无法保证正确性。\n", - "- 没有定义为算子的操作,如第三方库会在计算图中被固定为常量,多次运行无法保证正确性。" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python [conda env:base] *", - "language": "python", - "name": "conda-base-py" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.7" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/mindspore/source_zh_cn/features/compile/graph_optimization.md b/docs/mindspore/source_zh_cn/features/compile/graph_optimization.md deleted file mode 100644 index 788be180d90866d6360479c425856d650003ddc3..0000000000000000000000000000000000000000 --- a/docs/mindspore/source_zh_cn/features/compile/graph_optimization.md +++ /dev/null @@ -1,318 +0,0 @@ -# 图优化(编译) - -[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0/resource/_static/logo_source.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0/docs/mindspore/source_zh_cn/features/compile/graph_optimization.md) - -与传统编译器类似,MindSpore 在进行完构图之后,也会进行编译优化。编译优化的主要目的是通过静态分析技术对 MindSpore 的中间表示 MindIR 进行分析和转换,以达成减小目标代码大小、提升代码执行效率、降低运行时资源开销或者提升其它性能指标的目的。编译优化是图编译系统中的重要一环,对提升整个神经网络模型的性能和资源利用率有着极其重要的意义,相较于未经过编译优化的原始代码,编译优化可能带来数倍甚至数十倍的性能提升。 - -本节主要介绍独立于特定硬件的前端编译优化技术,特定于硬件的后端编译优化技术不在本节的讨论范围之内。 - -## 前端编译优化技术原理 - -与传统编译优化技术类似,MindSpore 中的编译优化也是通过一个个 Pass 来完成的。将每个 Pass 的上一个 Pass 所产生的 MindIR 作为输入,经过本 Pass 优化之后,产生新的 MindIR 表示作为输出。一个大的 Pass 可以包含多个小的 Pass,每个小的 Pass 只负责单点的编译优化,如:代数化简、函数内联(inline)、冗余消除等。一个 Pass 产生的优化结果,可能会为其它的 Pass 带来优化机会,故可以循环运行这些 Pass,直到产生的 MindIR 不再发生变化为止。 - -编译优化过程中,选择运行哪些 Pass,如何安排这些 Pass 的执行顺序对生成的最终的编译结果有着非常重要的影响。可以按照实际情况,通过设定编译优化策略(如优化级别、次数等)来对即将执行的优化动作进行调整。 - -## 常见前端编译优化技术 - -前端编译优化技术有很多,如:代数化简、函数inline(内联)、冗余消除等。本节将介绍部分具有代表性的编译优化技术。 - -### 代数化简 - -在传统编译器中,代数化简是一种编译器优化技术,旨在简化源代码中的代数表达式,消除多余计算,提高程序执行效率、减少内存占用等。 - -例如,在以下代码片段中: - -```cpp -int a = x * 1; -int b = x + 0; -int c = x * 0 + y * 1; -``` - -传统编译器根据代数规则和恒等式对识别出的表达式进行等价替换。常见代数规则包括结合律、交换律和分配律等,编译器尽可能将表达式替换成更为简单的形式。通过对 AST(抽象语法树)或 SSA(静态单赋值形式)的分析来进行优化,识别并简化代码为: - -```cpp -a = x; -b = x; -c = y; -``` - -在 MindSpore编译器中,代数化简原理不同于传统编译器,进行处理的是计算图而非传统控制流图,通过调整计算图中算子的执行顺序,或者删除不必要的算子,以保持计算图的简洁性和提高计算效率。 - -例如,在如下Python代码片段中: - -```python -import numpy as np -from mindspore.common import Tensor, jit - -@jit -def func(x): - return x + 0 - -m = Tensor(np.array([[1, 2, 3], [4, 5, 6]]).astype(np.int32)) -out = func(m) -``` - -MindSpore图编译器会把 Python 程序转换为计算图,计算图由多个子图构成。源程序中的代数运算,转换为子图内部的算子调用,可以看到 PrimFunc_Add 算子调用了一次。 - -```text -%para1_x: - -subgraph @1_func_14() { - %0(CNode_7) = PrimFunc_Add(%para1_x, Tensor(shape=[], dtype=Int32, value=0)) - : (, ) -> () - - Return(%0) - : () -} -``` - -通过代数化简,可以直接删除 PrimFunc_Add 算子,简化计算图结构,将 `x + 0` 简化成 `x`。 - -```text -%para1_x: - -subgraph @1_func_14() { - Return(%para1_x) - : () -} -``` - -代数化简能更多地涉及对计算图结构的修改,它通常还与其他编译器优化技术(如常量折叠、常量传播等)结合使用,共同提高程序性能。 - -### 函数inline - -在传统编译器中,inline(内联)是一种优化技术,可以把被调用函数的代码直接替换到调用该函数的位置,提高程序运行效率。假设我们有一个 C++ 函数`add`,用于对两个数求和: - -```cpp -int add(int a, int b) { - return a + b; -} - -int main() { - int x = add(3, 5); - int y = add(x, 10); - return y; -} -``` - -编译器通过 inline 将函数体直接替换到调用处,这消除了函数调用的开销,同时为后续优化(如消除冗余计算`3 + 5`,直接在编译期求值替换)创造了条件。这种**用代码替换调用**的思想,正是 inline 的核心。 - -```cpp -int main() { - int x = 3 + 5; // 替换第一次调用 - int y = x + 10; // 替换第二次调用 - return y; -} -``` - -在 AI 框架的计算图编译器中,inline 的目标类似,但操作对象从“函数”变成了“子图”(subgraph)。假设我们有一个 Python 程序: - -```python -from mindspore import Tensor, jit, ops - -def f2(x: Tensor, y: Tensor): - return x * 0.5 + y - -@jit -def f1(a: Tensor, b: Tensor, c: Tensor): - x = f2(a, b) - y = f2(a, c) - return x + y - -# 创建3个shape=(2, 4)的随机值Tensor -a = ops.randn(2, 4) -b = ops.randn(2, 4) -c = ops.randn(2, 4) -out = f1(a, b, c) -``` - -首先,MindSpore 的计算图编译器会把 Python 程序转换为计算图。而 Python 程序中的函数调用,会转换为计算图之间的调用,得到类似于下面的原始计算图。其中,主图 f1 调用了 2 次子图 f2。 - -```text -# Params: -%para1_a: -%para2_b: -%para3_c: - -subgraph @f2(%para1_x, %para2_y) { - %0 = PrimFunc_Mul(%para1_x, Float32(0.5)) - - %1 = PrimFunc_Add(%0, %para2_y) - - Return(%1) -} - -subgraph @f1() { - %0(x) = call @f2(%para1_a, %para2_b) # 调用子图f2 - - %1(y) = call @f2(%para1_a, %para3_c) # 调用子图f2 - - %2 = PrimFunc_Add(%0, %1) - - Return(%2) -} -``` - -通过 inline,可以将子图 f2 展开,合并到主图 f1。 - -```text -subgraph @f1() { - # 第一次子图inline - %0 = PrimFunc_Mul(%para1_a, Float32(0.5)) # 重复计算步骤 - %1 = PrimFunc_Add(%0, %para2_b) - - # 第二次子图inline - %2 = PrimFunc_Mul(%para1_a, Float32(0.5)) # 重复计算步骤 - %3 = PrimFunc_Add(%2, %para3_c) - - %4 = PrimFunc_Add(%1, %3) - - Return(%4) -} -``` - -在 inline 将子图展开之前,编译器可能无法识别到两次调用子图 f2 中的重复操作(此时子图通常被当作黑盒处理)。而通过 inline 将子图展开后,此时编译器可以清晰看到`x * 0.5`被计算了两次,就可以触发编译器进一步的优化:**公共子表达式消除** (CSE, Common Subexpression Elimination),这样就降低了计算量。 - -```text -subgraph @f1() { - %0 = PrimFunc_Mul(%para1_a, Float32(0.5)) # CSE合并重复计算 - - %1 = PrimFunc_Add(%0, %para2_b) - - %2 = PrimFunc_Add(%0, %para3_c) # 直接复用%0 - - %3 = PrimFunc_Add(%1, %2) - - Return(%3) -} -``` - -通过 inline 将子图展开,编译器能够更清晰地识别跨子图的优化机会,除了公共子表达式消除 (CSE),还能够触发算子融合、内存管理等许多优化措施。因此 inline 是计算图编译器的一项重要优化机制,也是许多跨图优化的基础。 - -### 冗余消除 - -在传统编译器中,冗余消除包含了多种编译优化技术,旨在通过在编译期间识别出代码中存在冗余的部分并进行消除,达到减少不必要的计算,提高程序的执行效率的目的。 - -通常冗余代码可能是用户出于可读性等目的有意编写的,也可能仅仅是编码过程中的无心之举。此外,编译优化过程本身通过其它优化技术(如:代数化简、inline、公共子表达式消除等)产生的中间结果,也可能带来冗余消除的机会。 - -冗余消除的技术有很多,本节挑选了其中常见的无用代码消除、不可达代码消除进行介绍。 - -1. **无用代码消除** - - 消除计算结果未被使用的代码。例如:下面的 C++ 代码中,变量 `c` 未被任何其它代码使用,编译器可以通过静态分析领域的数据流分析等技术,将计算 `int c = x * y` 的这行代码消除。 - - ```cpp - int func(x, y) { - int a = x + y; - int b = x - y; - int c = x * y; // 无用代码 - int d = a / b; - return d; - } - ``` - -2. **不可达代码消除** - - 消除未被有效控制流路径包含的代码。例如:下面的 C++ 代码中,编译器可以通过静态分析领域的控制流分析技术,分析代码的控制流图,识别到表达式 `1 < 0` 恒不成立,从而控制流 `1 < 0` 包含的代码在实际运行期间必定不会被执行,故可将该分支的代码消除。 - - ```cpp - int func(x, y) { - int a = x + y; - - int b; - if 1 < 0 { // 不可达分支 - b = x + y; - } else { - b = x - y; - } - - int d = a / b; - return d; - } - ``` - -MindSpore 图模式下冗余消除的目的及使用的技术也类似。与传统编译器不同的是,这些冗余优化技术是在 MindIR 上完成的。类似的,MindSpore 中常见的冗余消除技术有: - -1. **无用代码消除** - - 假设有如下存在冗余计算的Python代码: - - ```python - import mindspore as ms - from mindspore.common import Tensor, jit - - @jit - def func(x, y): - a = x + y - b = x - y - c = x * y # 无用代码 - d = a / b - return d - - x = Tensor(20, ms.float32) - y = Tensor(10, ms.float32) - out = func(x, y) - ``` - - MindSpore 图编译器会通过静态分析将 `@jit` 修饰的 Python 代码转换为 MindIR 的表示形式并消除其中冗余的 `c = x * y` 的计算,最终生成的 MindIR 如下: - - ```text - # Params: - %para1_x: - %para2_y: - - subgraph @func_1() { - %0(a) = PrimFunc_Add(%para1_x, %para2_y) - : (, ) -> () - %1(b) = PrimFunc_Sub(%para1_x, %para2_y) - : (, ) -> () - %2(d) = PrimFunc_Div(%0, %1) - : (, ) -> () - Return(%2) - : () - } - ``` - -2. **不可达代码消除** - - 假设有如下存在不可达路径的Python代码: - - ```python - import mindspore as ms - from mindspore.common import Tensor, jit - - @jit - def func(x, y): - a = x + y - if 1 < 0: # 不可达分支 - b = x + y - else: - b = x - y - d = a / b - return d - - x = Tensor(20, ms.float32) - y = Tensor(10, ms.float32) - out = func(x, y) - ``` - - MindSpore 图编译器会通过静态分析将 `@jit` 修饰的 Python 代码转换为 MindIR 的表示形式并消除其中冗余的控制流分支 `1 < 0` 的代码,最终生成的 MindIR 如下: - - ```text - # Params: - %para1_x: - %para2_y: - - subgraph @func_1() { - %0(a) = PrimFunc_Add(%para1_x, %para2_y) - : (, ) -> () - %1(b) = PrimFunc_Sub(%para1_x, %para2_y) - : (, ) -> () - %2(d) = PrimFunc_Div(%0, %1) - : (, ) -> () - Return(%2) cnode_attrs: {checkpoint: Bool(1)} - : () - } - ``` - -冗余消除在编译优化中扮演着重要的角色,在不改变程序原语义的前提下,能够显著提高程序的执行效率,通过减少不必要的运行时计算节省计算资源。冗余消除通常还与其它编译优化技术结合使用以获得更多消除冗余代码的机会。 diff --git a/docs/mindspore/source_zh_cn/features/index.rst b/docs/mindspore/source_zh_cn/features/index.rst index 657f6d0bcaa47fdf3e45b42c2334ca6af9140757..3484e5a43c8f8857d3e81c78c3615c1e0bccded6 100644 --- a/docs/mindspore/source_zh_cn/features/index.rst +++ b/docs/mindspore/source_zh_cn/features/index.rst @@ -11,9 +11,7 @@ Developer Notes parallel/optimizer_parallel parallel/pipeline_parallel parallel/auto_parallel - compile/multi_level_compilation - compile/graph_construction - compile/graph_optimization + compile/compilation_guide runtime/memory_manager runtime/multilevel_pipeline runtime/multistream_concurrency diff --git a/docs/mindspore/source_zh_cn/features/overview.md b/docs/mindspore/source_zh_cn/features/overview.md index 33cacc3826b7698d5074a215e5aa9b71baea9ea6..4d909607ea256ac15b72c3d6ed2591c6b13f00ed 100644 --- a/docs/mindspore/source_zh_cn/features/overview.md +++ b/docs/mindspore/source_zh_cn/features/overview.md @@ -45,13 +45,13 @@ MindSpore实现了函数式微分编程,对可被微分求导的函数对象 ### 动静统一的编程体验 -传统AI框架主要有两种编程执行形态,静态图模式和动态图模式。 +传统AI框架主要有两种编程执行形态,静态图模式(Graph Mode)和动态图模式(PyNative Mode)。动态图模式又称Eager Mode。 -静态图模式会基于开发者调用的接口,在编译时生成神经网络的图结构,然后再执行图中涉及的计算操作。 +Graph Mode会在编译时生成神经网络的模型计算的图结构,然后再执行计算图。 -动态图模式,能有效解决静态图的编程门槛高问题,由于程序是按照代码的编写顺序执行,不做整图编译优化,相对性能优化空间较少,特别是面向DSA等专有硬件的优化具有较大挑战。 +PyNative Mode,由于程序是按照代码的编写顺序执行,符合python解释执行方式,易开发和调试。因为不做图编译优化,性能优化空间较少,特别是面向DSA等专有硬件的优化具有较大挑战。 -静态图模式,能有效感知神经网络各层算子间的关系,基于编译技术进行有效的编译优化以提升性能。但传统静态图需要开发者感知构图接口,组建或调试网络比较复杂,且难于与常用Python库、自定义Python函数进行穿插使用。 +MindSpore基于Python构建神经网络的图结构,相比于传统的Graph Mode,能有更易用、更灵活的表达能力。MindSpore创新性的构建源码转换能力,基于Python语句提取AST进行计算图构建,因此可以支持开发者使用的Python原生语法(条件/循环等)和其他操作,如元组(Tuple)、列表(List)以及Lambda表达来构建计算图,并对计算图进行自动微分。所以MindSpore能更好地兼容动态图和静态图的编程接口,在代码层面保持一致,如控制流写法等。 MindSpore基于Python构建神经网络的图结构,相比于传统的静态图模式,能有更易用、更灵活的表达能力。MindSpore创新性的构建源码转换能力,基于Python语句提取AST进行计算图构建,因此可以支持开发者使用的Python原生语法(条件/循环等)和其他操作,如元组(Tuple)、列表(List)以及Lambda表达来构建计算图,并对计算图进行自动微分。所以MindSpore能更好地兼容动态图和静态图的编程接口,在代码层面保持一致,如控制流写法等。 @@ -71,7 +71,7 @@ MindSpore在并行化策略搜索中引入了张量重排布技术(Tensor Redi MindSpore基于编译技术,提供了丰富的硬件无关优化,如IR融合、代数化简、常数折叠、公共子表达式消除等。同时针对NPU、GPU等不同硬件,也提供各种硬件优化能力,从而更好的发挥硬件的大规模计算加速能力。 -#### [图算融合](https://www.mindspore.cn/docs/zh-CN/r2.7.0/features/compile/multi_level_compilation.html#图算融合) +#### [多级编译架构](https://www.mindspore.cn/docs/zh-CN/r2.7.0/features/compile/compilation_guide.html#图算融合) MindSpore等主流AI计算框架对开发者提供的算子通常是从开发中可理解、易使用角度进行定义。每个算子承载的计算量不等,计算复杂度也各不相同。但从硬件执行角度看,这种天然的、基于用开发者角度的算子计算量划分,并不高效,也无法充分发挥硬件资源计算能力。主要体现在: