diff --git a/atvc/README.md b/atvc/README.md index 854dd6e79268b6a29ee4b18cbb36937bd9dfa8b8..93037269b547bca59c105af9fc743ca5881e8bc3 100644 --- a/atvc/README.md +++ b/atvc/README.md @@ -1,35 +1,35 @@ # ATVC -ATVC(AscendC Template for Vector Compute)是一个用AscendC API搭建的C++模板头文件集合,旨在帮助用户快速开发AscendC典型Vector算子。它将AscendC Vector算子开发流程中的计算实现解耦成可自定义的模块, 内部封装实现了kernel数据搬入搬出等底层通用操作及通用tiling计算,实现了高效的算子开发模式。 -相比传统AscendC算子开发方式,利用ATVC搭建的Vector算子可做到开发效率提升3-5倍。用户只需选择匹配的模板并完成核心计算逻辑就完成算子kernel侧开发,atvc还内置了每个模板库对应的通用tiling计算实现,可省去用户手写tiling的开发量就能达到不错的性能表现,极大提升算子开发效率。 +ATVC(Ascend C Template for Vector Compute)是一个用Ascend C API搭建的C++模板头文件集合,旨在帮助用户快速开发Ascend C典型Vector算子。它将Ascend C Vector算子开发流程中的计算实现解耦成可自定义的模块, 内部封装实现了kernel数据搬入搬出等底层通用操作及通用tiling计算,实现了高效的算子开发模式。 +相比传统Ascend C算子开发方式,利用ATVC搭建的Vector算子可做到开发效率提升3-5倍。用户只需选择匹配的模板并完成核心计算逻辑就完成算子kernel侧开发,atvc还内置了每个模板库对应的通用tiling计算实现,可省去用户手写tiling的开发量就能达到不错的性能表现,极大提升算子开发效率。 -![atvc_user_case.png](./docs/data/atvc_user_case.png)
+![atvc_user_case.png](./docs/images/atvc_user_case.png)
-请参阅[快速入门](./docs/1_quick_start.md)以快速了解ATVC的Add算子搭建流程。 -请参阅[开发者文档](./docs/2_developer_guide.md)以获取ATVC框架各模板与API的使用细节,完成自定义EleWise类算子以及Reduce类算子开发。 +请参阅[快速入门](./docs/01_quick_start.md)以快速了解ATVC的Add算子搭建流程。 +请参阅[开发者文档](./docs/02_developer_guide.md)以获取ATVC框架各模板与API的使用细节,完成自定义Elementwise类算子以及Reduce类算子开发。 # 环境要求 - 硬件型号支持 -Atlas 800I A2推理服务器/Atlas A2训练服务器 +Atlas 800I A2推理服务器 - 配套软件 CANN开发套件包Ascend-cann-toolkit\_\\_linux\-\.run,并设置相关环境变量 cmake >= 3.16.0 # 工程目录 -ATVC工程结构可归纳成以下主要组件,更详细的文件结构介绍请参阅[Code Organization](./docs/code_organization.md): +ATVC工程结构可归纳成以下主要组件,更详细的文件结构介绍请参阅[Code Organization](./docs/03_code_organization.md): ``` ├── docs // 文档介绍 -├── examples // ATVC使用样例 -├── include // ATVC提供的头文件集合,用户使用前需将其置入其他工程的包含路径下 +├── examples // ATVC使用样例 +├── include // ATVC提供的头文件集合,用户使用前需将其置入其他工程的包含路径下 ├── tests // 测试模块相关代码 -└── ReadME.md // overview +└── README.md // 综述 ``` # 快速上手 -快速执行example用例,更详细的流程请参阅[快速入门](./docs/1_quick_start.md)。 +快速执行example用例,更详细的流程请参阅[快速入门](./docs/01_quick_start.md)。 - 下载ATVC代码 ```bash @@ -68,21 +68,22 @@ Accuracy verification passed. # 已支持的模版 | Vector模版类型 | | ------------------------------------------------------------ | -| Ele-wise模板 | +| Elementwise模板 | | Reduce模板 | | Broadcast模板 | # 样例介绍 | 样例名 | 描述 | 类型| | ------------------------------------------------------------ | ------------------------------------------------------------ |------------------------------------------------------------ | -| [add](./examples/add/add.cpp) | 使用ATVC的Ele-wise模板实现Add算子以及调用样例 | 直调算子 | -| [sinh_custom](./examples/sinh_custom/sinh_custom.cpp) | 临时Tensor参与计算的自定义Ele-wise类算子以及调用样例 | 直调算子 | -| [add_with_scalar](./examples/add_with_scalar/add_with_scalar.cpp) | 输入带标量的自定义Ele-wise类算子以及调用样例 | 直调算子 | +| [add](./examples/add/add.cpp) | 使用ATVC的Elementwise模板实现Add算子以及调用样例 | 直调算子 | +| [sinh_custom](./examples/sinh_custom/sinh_custom.cpp) | 临时Tensor参与计算的自定义Elementwise类算子以及调用样例 | 直调算子 | +| [add_with_scalar](./examples/add_with_scalar/add_with_scalar.cpp) | 输入带标量的自定义Elementwise类算子以及调用样例 | 直调算子 | | [reduce_sum](./examples/reduce_sum/reduce_sum.cpp) | 使用ATVC的Reduce模板实现自定义ReduceSum算子以及调用样例 | 直调算子 | | [broadcast_to](./examples/broadcast_to/broadcast_to.cpp) | 使用ATVC的Broadcast模板实现自定义BroadcastTo算子以及调用样例 | 直调算子 | -| [ops_aclnn](./examples/aclnn) | 使用ATVC基于自定义工程算子的实现以及调用样例 | 自定义工程算子 | -| [ops_pytorch](./examples/pytorch) | 使用ATVC基于[pytorch](https://gitee.com/ascend/pytorch)算子的实现以及调用样例 | pytorch算子 | +| [tanh_grad](./examples/tanh_grad/tanh_grad.cpp) | 使用Tiling超参进行算子性能调优的ElementWise类算子调用样例 | 直调算子 | +| [ops_aclnn](./examples/ops_aclnn) | 使用ATVC基于自定义工程算子的实现以及调用样例 | 自定义工程算子 | +| [ops_pytorch](./examples/ops_pytorch) | 使用ATVC基于[pytorch](https://gitee.com/ascend/pytorch)算子的实现以及调用样例 | pytorch算子 | -更多算子类型介绍和如何选取模板参见参阅[快速入门](./docs/1_quick_start.md),其中add、sinh_custom、add_with_scalar、reduce_sum、broadcast_to是ATVC的直调样例,ops_aclnn为基于ATVC对接aclnn工程的算子目录,ops_pytorch为基于ATVC对接pytorch工程的算子目录。其中,ops_aclnn和ops_pytorch样例需要进入到example路径下按照README.md描述执行。 +更多算子类型介绍和如何选取模板参见参阅[快速入门](./docs/1_quick_start.md),其中add、sinh_custom、add_with_scalar、reduce_sum、broadcast_to、tanh_grad是ATVC的直调样例,ops_aclnn为基于ATVC对接aclnn工程的算子目录,ops_pytorch为基于ATVC对接pytorch工程的算子目录。其中,ops_aclnn和ops_pytorch样例需要进入到example路径下按照README.md描述执行。 @@ -90,7 +91,7 @@ Accuracy verification passed. | 算子模板 | 数据类型 | | ------------------------------------------------------------ | ------------------------------------------------------------ | -| Ele-wise | int32_t、float | +| Elementwise | int32_t、float | | Reduce | int32_t、float | | Broadcast | int32_t、float | diff --git a/atvc/docs/1_quick_start.md b/atvc/docs/01_quick_start.md similarity index 66% rename from atvc/docs/1_quick_start.md rename to atvc/docs/01_quick_start.md index 3c5c469b49f1b6048ff53f5b6d4f9637a2273b22..51c24a860f7cf3a2edd3113d526045d6843be990 100644 --- a/atvc/docs/1_quick_start.md +++ b/atvc/docs/01_quick_start.md @@ -1,9 +1,8 @@ # 快速入门 这篇文档帮助你体验ATVC开发Add算子的整个流程。
-快速上手的完整代码请参阅[atvc/examples/add](../examples/add/add.cpp)。 # 环境要求 -ATVC 对软硬件运行环境有如下要求: +ATVC对软硬件运行环境有如下要求: - 硬件型号 Atlas 800I A2推理服务器 - 软件要求 @@ -33,17 +32,16 @@ git clone https://gitee.com/ascend/ascendc-api-adv.git # 使用ATVC开发Add算子 -本示例将展示如何基于ATVC提供的模板以及接口快速搭建Add算子,示例内展示了ATVC框架下区别于传统AscendC Add的实现代码。
-
+本示例将展示如何基于ATVC提供的模板以及接口快速搭建Add算子,示例内展示了ATVC框架下区别于传统Ascend C Add的实现代码。
## 定义算子描述 首先通过ATVC提供的[ATVC::OpTraits](../include/common/atvc_opdef.h)模板结构体来描述Add算子的输入输出信息,定义如下: ```cpp // Add算子中有两个输入,一个输出。类型均为float -using ADD_OPTRAITS = ATVC::OpTraits, ATVC::OpOutputs>; +using ADD_OPTRAITS = ATVC::OpTraits, ATVC::OpOutputs>; ``` ## 实现算子计算逻辑 -用户需要通过AscendC API来搭建Add算子的核心计算逻辑,在ATVC框架中,这类算子的核心计算逻辑是通过定义一个结构体的仿函数来实现。它需要`ATVC::OpTraits`作为固定模板参数,并重载`operator()`来被提供的Kernel层算子模板类调用。
+用户需要通过Ascend C API来搭建Add算子的核心计算逻辑,在ATVC框架中,这类算子的核心计算逻辑是通过定义一个结构体的仿函数来实现。它需要`ATVC::OpTraits`作为固定模板参数,并重载`operator()`来被提供的Kernel层算子模板类调用。
Add算子的计算仿函数定义如下: ```cpp #include "atvc.h" // 包含所有atvc模板api的总入口头文件 @@ -61,7 +59,7 @@ struct AddComputeFunc { template // 重载operator,提供给算子模板类调用 __aicore__ inline void operator()(AscendC::LocalTensor a, AscendC::LocalTensor b, AscendC::LocalTensor c) { - AscendC::Add(c, a, b, c.GetSize()); // 开发调用AscendC Api自行实现计算逻辑, 通过c.GetSize()获取单次计算的元素数量 + AscendC::Add(c, a, b, c.GetSize()); // 开发调用AscendC API自行实现计算逻辑, 通过c.GetSize()获取单次计算的元素数量 } }; ``` @@ -80,13 +78,13 @@ template * a Device上的gm地址,指向Add算子第一个输入 * b Device上的gm地址,指向Add算子第二个输入 * c Device上的gm地址,指向Add算子第一个输出 - * param Device上的gm地址,指向运行态ATVC::EleWiseParam数据 + * param 指向运行态ATVC::EleWiseParam数据 */ -__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR param) +__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, ATVC::EleWiseParam param) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); auto op = ATVC::Kernel::EleWiseOpTemplate>(); // 将AddComputeFunc仿函数作为模板参数传入,实例化EleWiseOpTemplate模板类 - op.Run(a, b, c, param); // 按照输入、输出、param的顺序传入Run函数,实现GM->GM的数据计算 + op.Run(a, b, c, ¶m); // 按照输入、输出、param的顺序传入Run函数,实现GM->GM的数据计算 } ``` @@ -99,49 +97,48 @@ ATVC::ElewiseParam param; // totalCnt描述EleWise单输入的元素个数 int32_t totalCnt = 1024; // ADD_OPTRAITS为ADD算子描述原型,根据算子输入输出个数和实际元素数量计算出Tiling数据后填入param中 -if (!ATVC::Host::CalcEleWiseTiling(eleNum, param)) { +if (!ATVC::Host::CalcEleWiseTiling(totalCnt, param)) { printf("Elewise tiling error.\n"); return -1; }; -// -// 调用aclrtMemcpy接口拷贝Host侧数据到Device侧 -// aclrtMemcpy(paramDevice, sizeof(param), reinterpret_cast(¶m), sizeof(param), ACL_MEMCPY_HOST_TO_DEVICE) -// uint32_t blockNum = param.tilingData.blockNum; // 调用核函数 +// aDevice Device上的gm地址,指向Add算子第一个输入 +// bDevice Device上的gm地址,指向Add算子第二个输入 +// cDevice Device上的gm地址,指向Add算子第一个输出 +// param Device上的gm地址,指向运行态ATVC::EleWiseParam数据 AddCustom<<>>(aDevice, bDevice, cDevice, paramDevice); ``` ## 算子编译&执行 完成算子代码编写后,调用以下命令编译代码并执行: ```bash -$ cd ./atvc/tests/ -$ bash run_test.sh add -... -Generate golden data successfully. -... -Accuracy verification passed. +cd ./atvc/tests/ +bash run_test.sh add ``` 其他样例执行命令如下: ```bash -$ bash run_test.sh sinh_custom # 执行sinh_custom样例 -$ bash run_test.sh reduce_sum # 执行reduce_sum样例 -$ bash run_test.sh add_with_scalar # 执行add_with_scalar样例 -$ bash run_test.sh broadcast_to # 执行broadcast样例 +bash run_test.sh sinh_custom # 执行sinh_custom样例 +bash run_test.sh reduce_sum # 执行reduce_sum样例 +bash run_test.sh add_with_scalar # 执行add_with_scalar样例 +bash run_test.sh broadcast_to # 执行broadcast样例 ``` ## 完整样例
完整代码样例请参照[examples/add/add.cpp](../examples/add/add.cpp) # 模板选择 -用户根据待开发的Vector算子定义特征,选择匹配的模板及其配套的tiling算法,若自定义算子不在当前模板库的范围内,建议使用基本AscnedC API 手写算子。 -## Ele-wise类算子 -Ele-wise类算子通常是指对张量进行元素级别的操作的函数或方法,包括但不限于加、减、乘、除及指数、对数、三角函数等数学函数。这类算子的特点是会逐元素进行计算操作,而不会改变输入数据的形状。常见的ELe-wise算子有Add、Sub、Exp、Log、Sin、Sqrt等。 +用户根据待开发的Vector算子定义特征,选择以下三种匹配的模板及其配套的tiling算法,若自定义算子不在当前模板库的范围内,建议使用基本Ascend C API 手写算子。 +## Elementwise类算子 +Elementwise类算子通常是指对张量进行元素级别的操作的函数或方法,包括但不限于加、减、乘、除及指数、对数、三角函数等数学函数。这类算子的特点是会逐元素进行计算操作,而不会改变输入数据的形状。常见的Elementwise算子有Add、Sub、Exp、Log、Sin、Sqrt等。 ## Reduce类算子 Reduce类算子通常是指对张量中的元素进行归约操作的算子,通常用来求和、求平均值等操作,可指定某几个维度进行归约计算,也可以将所有元素归约计算为一个标量。常见的Reduce类算子有ReduceSum(求和)、ReduceMean(求平均值)、ReduceProdcut(累乘)、ReduceMax(求最大值)、ReduceMin(求最小值)、ReduceAny(or操作)、ReduceAll(and操作)。 ## Broadcast -Broadcast算子是指完成广播操作,用于处理不同形状的张量间的运算。 -例如tensorA的shape是(1,5),tensorB的shape是(3,5),若要实现tensorC = tensorA + tensorB,实际上需要将TensorA广播为shape(3,5),再进行tensorA和tensorB的相加操作。广播的过程,实际上就是将原数据在某个维度上进行复制扩展。 +Broadcast 算子用于在张量形状不一致时实现张量间的逐元素运算。 +设张量 A 的 shape 为 (1, 5),张量 B 的 shape 为 (3, 5)。为完成 C = A + B,首先需依据广播规则将 A 由 (1, 5) 扩展至 (3, 5)。该过程通过在长度为 1 的维度上复制数据,使两个张量的形状对齐,从而支持逐元素相加运算。 + +# Copyright +Copyright (c) 2025 Huawei Technologies Co., Ltd. \ No newline at end of file diff --git a/atvc/docs/2_developer_guide.md b/atvc/docs/02_developer_guide.md similarity index 72% rename from atvc/docs/2_developer_guide.md rename to atvc/docs/02_developer_guide.md index c236bd991ca57a6d0eeb9c3dbf3ac16bb9aa81c3..e494ebad5e79cb59f229104ded1df5ace0d121da 100644 --- a/atvc/docs/2_developer_guide.md +++ b/atvc/docs/02_developer_guide.md @@ -1,20 +1,20 @@ # 1 架构设计 ATVC将Vector算子开发流程中的可定制化模块抽象出了Host层和Kernel层,它们的基本概念如下:
-- Host层:在CPU Host侧执行,提供Tiling计算&策略分派的API,它根据实际数据场景帮助用户计算出较优的数据搬运等运行态参数 -- Kernel层:它是利用AscendC API搭建出的一系列Vector算子模板类,屏蔽了算子开发中用户无需感知的数据搬入搬出以及资源申请等固定模块,并将核心计算模块开放给用户定义。 +- Host层:在CPU Host侧执行,提供Tiling计算&策略分派的API,它根据实际数据场景帮助用户计算出较优的数据搬运等运行态参数。 +- Kernel层:它是利用Ascend C API搭建出的一系列Vector算子模板类,屏蔽了算子开发中用户无需感知的数据搬入搬出以及资源申请等固定模块,并将核心计算模块开放给用户定义。 -![architecture.png](./data/architecture.png)
+![architecture.png](./images/architecture.png)
基于如上分层结构,ATVC中一个核函数的实现与调用的关系如下图所示(ATVC框架提供的模板及接口用黄色表示;支持开发自定义的模块用蓝色表示): - +
## 1.1 公共数据结构 -我们将对ATVC核函数定义以及调用涉及的三个公共数据概念:算子原型的编译态参数`OpTraits`, Tiling计算的运行态参数`Param`, 模板策略的编译态参数`Policy` 分别进行介绍: +我们将对ATVC核函数定义以及调用涉及的三个公共数据概念:算子原型的编译态参数`OpTraits`, Tiling计算的运行态参数`Param`, 模板策略的编译态参数`Policy` 分别进行介绍。 ### 1.1.1 OpTraits -ATVC框架参考C++模板元编程的`stl::type_list`实现,推出了`ATVC::OpInputs`、`ATVC::OpOutputs`、`ATVC::OpTemps`的模板结构体分别用于描述算子的计算输入、计算输出、计算过程的临时资源,支持C++基础类型作为不定长模板参数传入。它们三者组成了覆盖了整个ATVC框架编译态参数`OpTraits`。`ATVC::OpTraits`的完整数据定义如下
+ATVC框架参考C++模板元编程的`type_list`实现,推出了`ATVC::OpInputs`、`ATVC::OpOutputs`、`ATVC::OpTemps`的模板结构体分别用于描述算子的计算输入、计算输出、计算过程的临时资源,支持C++基础类型作为不定长模板参数传入。它们三者组成了整个ATVC框架编译态参数`OpTraits`。`ATVC::OpTraits`的完整数据定义如下
```cpp // atvc_opdef.h namespace ATVC { @@ -25,23 +25,23 @@ enum class ParamType { TEMP, // 临时计算资源 }; -template +template struct ParamTypes{ using types = ATVC::TypeList; static constexpr ParamType usage = paramType_; }; -template +template using OpInputs = ParamTypes; -template +template using OpOutputs = ParamTypes; -template +template using OpTemps = ParamTypes; // OpTraits的结构体定义,TempTypeList默认为空 -template> +template > struct OpTraits { using In = InTypeList; using Out = OutTypeList; @@ -64,7 +64,7 @@ using AddOpTraits = ATVC::OpTraits; // Add算 ### 1.1.2 Param ATVC框架提供了`ATVC::EleWiseParam`、`ATVC::ReduceParam`、`ATVC::BroadcastParam` 三个结构体来描述算子内部调度的Tiling数据和其他资源变量。Param 作为Host侧Tiling API的输出,它将传入ATVC框架的Kernel层算子模板,并在运行时指导算子内部模块完成数据的循环搬运与调度计算。
-以下为ElementWise类算子的`ATVC::EleWiseParam`参与计算的伪代码,详细使用流程请参考本文档的 2.1.5 Host层API: +以下为ElementWise类算子的`ATVC::EleWiseParam`参与计算的伪代码,详细使用流程请参考本文档的 [2.1.5 Host层API](#215-host层api): ```cpp // 声明运行态参数param ATVC::ElewiseParam param; @@ -93,48 +93,48 @@ ATVC::Host::CalcReduceTiling(..., &policy, ¶m); // ReduceAdapter根据policy具体值将动态参数转为静态模板参数,并传入核函数 if (policy.patternId == 1 && policy.loopCnt == 2 && policy.loopInnerCnt == 3) { - constexpr ATVC::ReducePolicy selectedPolicy = {1, 2, 3}; + constexpr ATVC::ReducePolicy selectedPolicy = {1, 2, 3}; ReduceKernel<<<...>>>(...); } // 自定义的ReduceKernel核函数内部调用了ReduceOpTemplate算子模板类, 该模板类内部实现了Policy对应的各种计算场景 -template -__global__ __aicore__ ReduceKernel(GM_ADDR x, GM_ADDR y, GM_ADDR paramDevice) { +template +__global__ __aicore__ ReduceKernel(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam param) { auto op = ATVC::Kernel::ReduceOpTemplate(); // 实例化算子Kernel模板, Policy作为模板参数传入 - op.Run(x, y, paramDevice); // param作为运行态参数传入 + op.Run(x, y, ¶m); // param作为运行态参数传入 } ``` # 2 利用ATVC完成算子开发 -## 2.1 element-wise算子开发 -ATVC框架提供的element-wise算子模块之间的交互如下(ATVC框架提供的模板及接口用黄色表示;开发自定义的模块用蓝色表示): -![elewise_dataflow.png](data/elewise_dataflow.png) +## 2.1 Elementwise算子开发 +ATVC框架提供的Elementwise算子模块之间的交互如下(ATVC框架提供的模板及接口用黄色表示;开发自定义的模块用蓝色表示): +![elewise_dataflow.png](images/elewise_dataflow.png) -不同计算原理的element-wise算子在Kernel内部的数据搬运模块并无区别,因此EleWise的数据交互不涉及Policy的不同Kernel模板实现。 +不同计算原理的Elementwise算子在Kernel内部的数据搬运模块并无区别,因此Elementwise的数据交互不涉及Policy的不同Kernel模板实现。 ### 2.1.1 Components -根据element-wise算子在框架内部的交互场景,ATVC提供如下的接口以及模板类帮助开发搭建自定义AscendC的ElementWise算子: -![elewise_components.png](data/elewise_components.png) -自定义element-wise算子需按照以下顺序完成模块之间的组装: +根据Elementwise算子在框架内部的交互场景,ATVC提供如下的接口以及模板类帮助开发搭建自定义Ascend C的ElementWise算子: +![elewise_components.png](images/elewise_components.png) +自定义Elementwise算子需按照以下顺序完成模块之间的组装: 1. 定义计算模板 -2. 将计算模板类传入`Kernel`层算子模板完成核函数功能实现; +2. 将计算模板类传入Kernel层算子模板完成核函数功能实现; 3. 定义Kernel层算子入口API,内部实例化计算模板类; -下面将以Sinh算子 $y = \frac{\exp(x) - \exp(-x)}{2}$ 的实现为样例,按照组成Kernel的顺序介绍element-wise算子开发的流程。 +下面将以Sinh算子 $y = \frac{\exp(x) - \exp(-x)}{2}$ 的实现为样例,按照组成Kernel的顺序介绍Elementwise算子开发的流程。 ### 2.1.2 计算模板 -计算模板是用户必须在element-wise 算子实现过程中完成的一类特殊模板类的定义。模板类无需关注数据如何从GM搬运到UB,只需重载`operator()`的公有接口,并在该仿函数内部实现AscendC::LocalTensor之间的计算逻辑。在Kernel层的组装阶段,计算模板将作为模板参数传入`ATVC::Kernel::EleWiseOpTemplate`,并在数据计算阶段被调用。下方为计算模板实现Sinh计算逻辑的代码样例: +计算模板是用户必须在Elementwise 算子实现过程中完成的一类特殊模板类的定义。模板类无需关注数据如何从GM搬运到UB,只需重载`operator()`的公有接口,并在该仿函数内部实现`AscendC::LocalTensor`之间的计算逻辑。在Kernel层的组装阶段,计算模板将作为模板参数传入`ATVC::Kernel::EleWiseOpTemplate`,并在数据计算阶段被调用。下方为计算模板实现Sinh计算逻辑的代码样例: ```cpp #include "atvc.h" // 包含所有模板及API的总入口头文件 // 传入编译态参数ATVC::OpTraits -template +template // 开发自定义函数名/类名 struct SinhComputeFunc { // DataType模板参数,根据实际数据类型个数填写 - template + template // 重载operator公有接口,提供给Kernel::EleWiseOpTemplate调用 __aicore__ inline void operator()(AscendC::LocalTensor x, AscendC::LocalTensor y, AscendC::LocalTensor tempBuffer1, AscendC::LocalTensor tempBuffer2) { // 开发调用AscendC API自行实现计算仿函数 @@ -153,10 +153,10 @@ struct SinhComputeFunc { 3. 开发定义的`operator()`仿函数的输入参数类型支持`AscendC::LocalTensor`以及C++其他基础数据类型。形式参数需按照`ATVC::OpInputs<>`,`ATVC::OpOutputs<>`, `ATVC::OpTemps<>`声明的顺序填入,其他标量参数放在最后,根据用户计算场景按需传入。 -### 2.1.3 内置EleWise算子模板 +### 2.1.3 内置Elementwise算子模板 `ATVC::Kernel::EleWiseOpTemplate`为ATVC框架提供的内置ElementWise基本算子类,它实现了一套算子数据的搬运搬出、资源分配和释放的算子流程。它需要计算模板类作为模板参数传入来完成实例化。核函数通过调用它完成整套计算逻辑:1. 资源初始化; 2.将数据从GM搬运至UB; 3.按`OpTraits`的输入、输出、临时资源描述、其他标量的顺序传入计算模板类的仿函数完成数据的基块计算; 4.将结果从UB搬出至GM。 -下方为`ATVC::Kernel::EleWiseOpTemplate`模板类的外部接口介绍,完整模板类定义请参考`atvc/include/elewise/elewise_op_template.h`。 +下方为`ATVC::Kernel::EleWiseOpTemplate`模板类的外部接口介绍,完整模板类定义请参考[`atvc/include/elewise/elewise_op_template.h`](../include/elewise/elewise_op_template.h)。 ```cpp // 开发定义的计算模板类 template @@ -166,13 +166,13 @@ public: // 按照输入、输出、EleWiseParam、其他标量的顺序传入 // 内部根据EleWiseParam进行数据调度并调用EleWiseOpTemplate完成计算后搬出到GM - template + template __aicore__ inline void Run(Args&&... args) { // // 完成变长参数的解析和数据调度计算 // } -} +}; ``` @@ -187,20 +187,20 @@ using SinhOpTraits = ATVC::OpTraits, ATVC::OpOutputs> elewiseTemplate; // 调用EleWiseOpTemplate的Run接口传入输入x, 输出y,Host::CalcEleWiseTiling API的输出param -elewiseTemplate.Run(x, y, param); +elewiseTemplate.Run(x, y, ¶m); } ``` ### 2.1.4 核函数定义 -在element-wise开发流程中,用户需要自行定义核函数接口。核函数内部可依赖`ATVC::Kernel::EleWiseOpTemplate`完成功能实现。 +在Elementwise开发流程中,用户需要自行定义核函数接口。核函数内部可依赖`ATVC::Kernel::EleWiseOpTemplate`完成功能实现。 基于`2.1.2`和`2.1.3`的样例代码,Kernel层的自定义核函数代码样例如下: ```cpp #include "atvc.h" // 2.1.2 章节中的SinhComputeFunc 定义 -// template +// template // struct SinhComputeFunc { // ... // } @@ -208,21 +208,21 @@ using SinhOpTraits = ATVC::OpTraits, ATVC::OpOutputs -__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param) +template +__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, ATVC::EleWiseParam param) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 控制算子执行时只启动Vector核 auto op = ATVC::Kernel::EleWiseOpTemplate>(); - op.Run(x, y, param); // 按照输入、输出、param的顺序传入Run函数中;OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源,开发无需关注 + op.Run(x, y, ¶m); // 按照输入、输出、param的顺序传入Run函数中;OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源,开发无需关注 } ```
-利用ATVC框架开发element-wise算子的过程中,Kernel层的核函数定义必须遵从以下约束: +利用ATVC框架开发Elementwise算子的过程中,Kernel层的核函数定义必须遵从以下约束: 1. 核函数必须预留一个GM_ADDR类型的形参用于传入`ATVC::EleWiseParam`运行态参数; 2. 核函数内部必须加入`KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);`这段代码标注算子执行时只启动Vector核; @@ -231,24 +231,24 @@ __global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param) ```cpp using SinhOpTraits = ATVC::OpTraits, ATVC::OpOutputs, ATVC::OpTemps>; -extern "C" __global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param) +extern "C" __global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, ATVC::EleWiseParam param) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 控制算子执行时只启动Vector核 auto op = ATVC::Kernel::EleWiseOpTemplate>(); // 模板参数传入固定的SinhOpTraits - op.Run(x, y, param); + op.Run(x, y, ¶m); } ``` ### 2.1.5 Host层API -ATVC的Host层提供了element-wise算子的通用Tiling算法API `ATVC::Host::CalcEleWiseTiling`,它根据算子计算原型`ATVC::OpTraits`以及数据大小计算出包含`ATVC::EleWiseTilingData`的运行态参数`ATVC::EleWiseParam`。`ATVC::EleWiseParam`在运行时将参与模板算子数据搬运从而实现较优计算。
`ATVC::EleWiseTilingData`和`ATVC::EleWiseParam`的数据结构定义如下: +ATVC的Host层提供了Elementwise算子的通用Tiling算法API `ATVC::Host::CalcEleWiseTiling`,它根据算子计算原型`ATVC::OpTraits`以及数据大小计算出包含`ATVC::EleWiseTilingData`的运行态参数`ATVC::EleWiseParam`。`ATVC::EleWiseParam`在运行时将参与模板算子数据搬运从而实现较优计算。
`ATVC::EleWiseTilingData`和`ATVC::EleWiseParam`的数据结构定义如下: ```cpp namespace ATVC{ struct EleWiseTilingData { uint32_t tailBlockCnt; // 需要额外执行一次循环的核的数量 - uint32_t tailElemCnt; // 尾块元素数量 - uint32_t numPerBlock; // 每个核需计算的总元素数量 - uint32_t tiledCnt; // 每次搬入搬出的元素个数 - uint32_t blockNum; // 执行核数 + uint32_t tailElemCnt; // 尾块元素数量 + uint32_t numPerBlock; // 每个核需计算的总元素数量 + uint32_t tiledCnt; // 每次搬入搬出的元素个数 + uint32_t blockNum; // 执行核数 }; struct EleWiseParam { @@ -300,23 +300,15 @@ if (!ATVC::Host::CalcEleWiseTiling(eleNum, param)) { 通过ATVC框架实现的完整SinhCustom算子定义&调用[代码](../examples/sinh_custom/sinh_custom.cpp)如下: ```cpp -/** - * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. - * - * This file is a part of the CANN Open Software. - * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - */ - #include #include #include +#include +#include #include #include "acl/acl.h" -#include "atvc.h" +#include "elewise/elewise_host.h" +#include "elewise/elewise_device.h" #define CHECK_ACL(x) \ do { \ @@ -331,12 +323,12 @@ static constexpr float REL_TOL = 1e-3f; static constexpr float ABS_TOL = 1e-5f; // 判断两个浮点数是否足够接近 -bool IsClose(float a, float b) { +bool IsClose(float a, float b) +{ const float eps = 1e-40f; // 防止分母为零 float diff = std::abs(a - b); return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps)); } -} // 描述算子的输入输出以及临时计算资源 using SinhOpTraits = ATVC::OpTraits, ATVC::OpOutputs, ATVC::OpTemps>; @@ -348,30 +340,76 @@ struct SinhComputeFunc { // DataType模板参数,根据实际数据类型个数填写 template // 重载operator公有接口,提供给`ATVC::Kernel::EleWiseOpTemplate`调用 - __aicore__ inline void operator()(AscendC::LocalTensor x, AscendC::LocalTensor y, AscendC::LocalTensor tempBuffer1, AscendC::LocalTensor tempBuffer2) { + __aicore__ inline void operator()(AscendC::LocalTensor x, + AscendC::LocalTensor y, + AscendC::LocalTensor tempBuffer1, + AscendC::LocalTensor tempBuffer2) + { // 开发调用AscendC Api自行实现计算仿函数 uint32_t tiledCnt = y.GetSize(); // 进行单次基块计算的元素个数 AscendC::Muls(tempBuffer1, x, static_cast(-1), tiledCnt); // tempBuffer1 = -1 * x AscendC::Exp(tempBuffer1, tempBuffer1, tiledCnt); // tempbuffer1 = exp(-x) AscendC::Exp(tempBuffer2, x, tiledCnt); // tempbuffer2 = exp(x) AscendC::Sub(y, tempBuffer2, tempBuffer1, tiledCnt); // y = exp(x) - exp(-x) - AscendC::Muls(y, y, static_cast(0.5), tiledCnt); // y = (e^(x) - e^(-x)) / 2 + AscendC::Muls(y, y, static_cast(0.5), tiledCnt); // y = (e^(x) - e^(-x)) / 2 } }; +void InitializeData(int32_t eleNum, std::vector &inputX, std::vector &golden) +{ + std::random_device rd; + std::mt19937 gen(rd()); + std::uniform_real_distribution dis(1.0f, 10.0f); + + for (int i = 0; i < eleNum; ++i) { + inputX[i] = dis(gen); + golden[i] = std::sinh(inputX[i]); + } +} + +bool VerifyResults(const std::vector &golden, const std::vector &output) +{ + for (int32_t i = 0; i < golden.size(); i++) { + if (!IsClose(golden[i], output[i])) { + printf("Accuracy verification failed! The expected value of element " + "in index [%d] is %f, but actual value is %f.\n", + i, + golden[i], + output[i]); + return false; + } + } + return true; +} + +void InitializeACL(aclrtContext &context, aclrtStream &stream, int32_t deviceId) +{ + CHECK_ACL(aclInit(nullptr)); + CHECK_ACL(aclrtSetDevice(deviceId)); + CHECK_ACL(aclrtCreateContext(&context, deviceId)); + CHECK_ACL(aclrtCreateStream(&stream)); +} + +void CleanACL(aclrtStream &stream, int32_t deviceId) +{ + CHECK_ACL(aclrtDestroyStream(stream)); + CHECK_ACL(aclrtResetDevice(deviceId)); + CHECK_ACL(aclFinalize()); +} +} /* * 该函数为SinhCustom算子核函数入口 - * x Device上的gm地址,指向Add算子第一个输入 - * y Device上的gm地址,指向Add算子第一个输出 - * param Device上的gm地址,指向运行态ATVC::EleWiseParam数据 + * x Device上的gm地址,指向SinhCustom算子第一个输入 + * y Device上的gm地址,指向SinhCustom算子第一个输出 + * param 指向运行态ATVC::EleWiseParam数据 */ template -__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param) +__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, ATVC::EleWiseParam param) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 控制算子执行时只启动Vector核 auto op = ATVC::Kernel::EleWiseOpTemplate>(); - op.Run(x, y, param); // 按照输入、输出、param的顺序传入Run函数中;OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源,开发无需关注 + op.Run(x, y, ¶m); // 按照输入、输出、param的顺序传入Run函数中;OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源,开发无需关注 } int main() @@ -380,18 +418,11 @@ int main() int32_t eleNum = 8 * 2048; size_t inputByteSize = static_cast(eleNum) * sizeof(float); size_t outputByteSize = static_cast(eleNum) * sizeof(float); - - std::random_device rd; - std::mt19937 gen(rd()); - std::uniform_real_distribution dis(1.0f, 10.0f); std::vector inputX(eleNum); std::vector golden(eleNum); + InitializeData(eleNum, inputX, golden); - for (int i = 0; i < eleNum; ++i) { - inputX[i] = (dis(gen)); - golden[i] = std::sinh(inputX[i]); - } ATVC::EleWiseParam param; // 计算输入为8*2048个float元素的sinh算子的运行态参数param @@ -400,19 +431,15 @@ int main() return -1; }; // 初始化Acl资源与数据 - CHECK_ACL(aclInit(nullptr)); aclrtContext context; - int32_t deviceId = 0; - CHECK_ACL(aclrtSetDevice(deviceId)); - CHECK_ACL(aclrtCreateContext(&context, deviceId)); aclrtStream stream = nullptr; - CHECK_ACL(aclrtCreateStream(&stream)); + int32_t deviceId = 0; + InitializeACL(context, stream, deviceId); uint8_t *yHost; uint8_t *xDevice; uint8_t *yDevice; uint8_t *paramDevice; - CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); @@ -422,8 +449,10 @@ int main() // 将tiling计算的运行时参数EleWiseParam param传到Device侧 auto elementParamSize = sizeof(param); CHECK_ACL(aclrtMalloc((void**)¶mDevice, elementParamSize, ACL_MEM_MALLOC_HUGE_FIRST)); - CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize, reinterpret_cast(¶m), elementParamSize, ACL_MEMCPY_HOST_TO_DEVICE)); - + CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize, + reinterpret_cast(¶m), elementParamSize, + ACL_MEMCPY_HOST_TO_DEVICE)); + // 调用自定义的Kernel API, <<<>>>的BlockNum参数可通过param的TilingData获取 SinhCustom<<>>(xDevice, yDevice, paramDevice); @@ -437,15 +466,10 @@ int main() CHECK_ACL(aclrtFree(paramDevice)); CHECK_ACL(aclrtFreeHost(yHost)); - CHECK_ACL(aclrtDestroyStream(stream)); - CHECK_ACL(aclrtResetDevice(deviceId)); - CHECK_ACL(aclFinalize()); + CleanACL(stream, deviceId); - for (int32_t i = 0; i < eleNum; i++) { - if (!IsClose(golden[i], outputY[i])) { - printf("Accuracy verification failed! The expected value of element in index [%d] is %f, but actual value is %f.\n", i, golden[i], outputY[i]); - return -1; - } + if (!VerifyResults(golden, outputY)) { + return -1; } printf("Accuracy verification passed.\n"); return 0; @@ -454,12 +478,12 @@ int main() ## 2.2 Reduce算子开发 ATVC框架提供的Reduce算子模板类的模块之间的交互如下(ATVC框架提供的模板及接口用黄色表示;开发自定义的模块用蓝色表示): -![reduce_dataflow.png](data/reduce_dataflow.png) +![reduce_dataflow.png](images/reduce_dataflow.png) Reduce模板算子内部根据计算的数据大小、shape、Reduce axis轴完成了不同计算调度的代码实现,ATVC将各种计算调度场景抽象为`ATVC::ReducePolicy`。在算子调用阶段,分派策略API可根据Tiling API计算出的`ATVC::ReducePolicy`转化为编译态参数,结合计算模板来实例化`ATVC::Kernel::ReduceOpTemplate`算子模板类。 ### 2.2.1 Components 根据Reduce算子在框架内部的交互场景,ATVC提供如下的接口以及模板类帮助开发搭建自定义Reduce算子: -![reduce_components.png](data/reduce_components.png) +![reduce_components.png](images/reduce_components.png) 自定义Reduce算子需按照以下顺序完成模块之间的组装: 1. 自定义计算模板/使用框架内置计算模板 2. 将计算模板传入`Kernel`层模板算子完成核函数功能实现; @@ -478,7 +502,7 @@ Reduce类的计算模板涉及多核之间的数据结果同步以及核内分 #include "common/kernel_utils.h" #include "reduce/common/patterns.h" -#include "reduce/reduce_utils/reduce_block_aux_util.h" +#include "reduce/utils/reduce_block_aux_util.h" namespace ATVC { @@ -520,17 +544,17 @@ public: ``` Reduce计算模板类将在数据计算阶段被`ReduceOpTemplate`算子模板调用,因此Reduce计算模板类的实现必须遵从以下约束: -- 该模板类在实例化时固定传入ATVC::OpTraits类型的结构体作为模板参数,如` ATVC::OpTraits,ATVC::OpOutputs`。 +- 该模板类在实例化时固定传入ATVC::OpTraits类型的结构体作为模板参数,如`ATVC::OpTraits,ATVC::OpOutputs`。 - 开发必须完成以下公有API的内部实现: - 1. 计算单数据基块的Reduce结果 \_\_aicore\_\_ inline void Compute(...) - 2. 计算单UB内不同数据基块的计算结果 \_\_aicore\_\_ inline void UpdateCache(...); - 3. 计算多核之间&同一核内的多次UB结果 \_\_aicore\_\_ inline void ReduceBetweenUB(...) - 4. 返回非对齐场景不参与计算的尾部数据的填充值 \_\_aicore\_\_ inline U GetPaddingValue() + 1. 计算单数据基块的Reduce结果 `\_\_aicore\_\_ inline void Compute(...)` + 2. 计算单UB内不同数据基块的计算结果 `\_\_aicore\_\_ inline void UpdateCache(...)` + 3. 计算多核之间&同一核内的多次UB结果 `\_\_aicore\_\_ inline void ReduceBetweenUB(...)` + 4. 返回非对齐场景不参与计算的尾部数据的填充值 `\_\_aicore\_\_ inline U GetPaddingValue()` ### 2.2.3 内置Reduce算子模板 `ATVC::Kernel::ReduceOpTemplate`是一套基本的Reduce算子类,它实现了一套算子数据的搬运搬出、资源分配和释放的流程。Kernel层的算子模板需要计算模板类作为模板参数传入来完成实例化。在调用阶段,算子类将按照固定参数顺序调用计算模板类的对应接口,完成数据的计算。 -相比EleWise算子模板不同的是,ReduceOpTemplate内置了不同场景的Reduce实现,并在编译时通过`ATVC::ReducePolicy`类型的结构体来实现实例化。ReduceOpTemplate内部将根据模板参数决定数据将由哪类具体的模板实例计算。`ATVC::ReducePolicy`的数据定义如下: +相比Elementwise算子模板不同的是,ReduceOpTemplate内置了不同场景的Reduce实现,并在编译时通过`ATVC::ReducePolicy`类型的结构体来实现实例化。ReduceOpTemplate内部将根据模板参数决定数据将由哪类具体的模板实例计算。`ATVC::ReducePolicy`的数据定义如下: ```cpp struct ReducePolicy { @@ -542,21 +566,20 @@ struct ReducePolicy { 下方为`ATVC::Kernel::ReduceOpTemplate`模板类的外部接口介绍,完整模板类定义请参考`atvc/include/reduce/reduce_op_template.h`。 ```cpp -template +template class ReduceOpTemplate { public: __aicore__ inline ReduceOpTemplate(){}; // 按照输入、输出、EleWiseParam、其他标量的顺序传入 // 内部根据EleWiseParam进行数据调度并调用ReduceOpTemplate完成计算后搬出到GM - template - __aicore__ inline void Run(GM_ADDR x, GM_ADDR y, GM_ADDR param) { + template + __aicore__ inline void Run(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam* param) { // // Reduce类算子Run接口按输入、输出、运行态参数param顺序传入 // } -} +}; ``` @@ -566,13 +589,13 @@ public: // ReduceSum算子的描述:一个输入,一个输出,类型均为float using ReduceOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; -template -__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, GM_ADDR reduceParam) +template +__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam reduceParam) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0); // 使用了多核控制指令,设置算子执行时只启动Vector核 // 将计算模板类模板定义作为模板参数传入, Policy由Host层的策略分派API给出 auto op = ATVC::Kernel::ReduceOpTemplate, Policy>(); - op.Run(x, y, reduceParam); + op.Run(x, y, &reduceParam); } ``` @@ -587,16 +610,16 @@ using ReduceOpTraits = ATVC::OpTraits, ATVC::OpOutputs -__global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param) +template +__global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam param) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV); // 使用了多核控制指令,设置算子执行时只启动Vector核 auto op = ATVC::Kernel::ReduceOpTemplate, SelectPolicy>(); - op.Run(x, y, param); + op.Run(x, y, ¶m); } ```
@@ -604,7 +627,7 @@ __global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param) Reduce算子开发场景下,核函数定义必须遵从以下约束: 1. 核函数须预留一个GM_ADDR类型的形参用于传入`ATVC::ReduceParam`运行态参数; 2. 核函数须加入`KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0);`这段代码显示标注算子类型; -3. 核函数须实例化`ATVC::Kernel::ReduceOpTemplate`变量并调用它的`Run(GM_ADDR x, GM_ADDR y, GM_ADDR param)`接口来实现数据的调度运算; +3. 核函数须实例化`ATVC::Kernel::ReduceOpTemplate`变量并调用它的`Run(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam* param)`接口来实现数据的调度运算; ### 2.2.5 Host层API @@ -693,33 +716,28 @@ static constexpr ATVC::ReducePolicy REDUCE_POLICY3 { ATVC::AR_PATTERN::AR, ATVC: // ReduceOpAdapter函数定义 // 负责Reduce类算子的调度,选择对应的Policy最佳策略并执行Kernel函数 -template +template void ReduceOpAdapter(uint8_t* x, uint8_t* y, ATVC::ReduceParam ¶m, ATVC::ReducePolicy &policy, aclrtStream& stream) { // 申请临时空间workspace,并将其与ReduceTilingData一同传到Device侧 - uint8_t *paramDevice; uint8_t *workspaceDevice; CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST)); param.workspaceAddr = reinterpret_cast(workspaceDevice); - auto reduceParamSize = sizeof(param); - CHECK_ACL(aclrtMalloc((void**)¶mDevice, reduceParamSize, ACL_MEM_MALLOC_HUGE_FIRST)); - CHECK_ACL(aclrtMemcpy(paramDevice, reduceParamSize, reinterpret_cast(¶m), reduceParamSize, ACL_MEMCPY_HOST_TO_DEVICE)); // 将tiling api计算出的ReducePolicy转化为编译态参数并实例化相应的核函数 if (policy == ATVC::REDUCE_POLICY0) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); }else if (policy == ATVC::REDUCE_POLICY1) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY2) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY3) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else { printf("[ERROR] Cannot find any matched policy.\n"); } // 流同步后释放申请的param内存 CHECK_ACL(aclrtSynchronizeStream(stream)); CHECK_ACL(aclrtFree(workspaceDevice)); - CHECK_ACL(aclrtFree(paramDevice)); } using ReduceOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; @@ -747,25 +765,17 @@ using ReduceOpTraits = ATVC::OpTraits, ATVC::OpOutputs #include #include +#include +#include #include #include "acl/acl.h" -#include "atvc.h" +#include "reduce/reduce_host.h" +#include "reduce/reduce_device.h" #define CHECK_ACL(x) \ do { \ @@ -780,29 +790,45 @@ static constexpr float REL_TOL = 1e-3f; static constexpr float ABS_TOL = 1e-5f; // 判断两个浮点数是否足够接近 -bool IsClose(float a, float b) { +bool IsClose(float a, float b) +{ const float eps = 1e-40f; // 防止分母为零 float diff = std::abs(a - b); return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps)); } -} // ReduceSum算子的描述:一个输入,一个输出,类型均为float using ReduceOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; +bool VerifyResults(const std::vector &golden, const std::vector &output) +{ + for (int32_t i = 0; i < golden.size(); i++) { + if (!IsClose(golden[i], output[i])) { + printf("Accuracy verification failed! The expected value of element " + "in index [%d] is %f, but actual value is %f.\n", + i, + golden[i], + output[i]); + return false; + } + } + return true; +} +} + /* * 该函数为ReduceCustom算子核函数入口 - * x Device上的gm地址,指向Add算子第一个输入 - * y Device上的gm地址,指向Add算子第一个输出 - * reduceParam Device上的gm地址,指向运行态ATVC::ReduceParam数据 + * x Device上的gm地址,指向ReduceCustom算子第一个输入 + * y Device上的gm地址,指向ReduceCustom算子第一个输出 + * reduceParam 指向运行态ATVC::ReduceParam数据 */ template -__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, GM_ADDR reduceParam) +__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam reduceParam) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0); // 使用了多核控制指令,设置算子执行时只启动Vector核 // 将计算模板类模板定义作为模板参数传入,Policy由Host层的策略分派API给出 auto op = ATVC::Kernel::ReduceOpTemplate, Policy>(); - op.Run(x, y, reduceParam); + op.Run(x, y, &reduceParam); } // 负责Reduce类算子的调度,选择对应的Policy最佳策略并执行Kernel函数 @@ -816,11 +842,13 @@ void ReduceOpAdapter(uint8_t* x, uint8_t* y, ATVC::ReduceParam ¶m, ATVC::Red param.workspaceAddr = reinterpret_cast(workspaceDevice); auto reduceParamSize = sizeof(param); CHECK_ACL(aclrtMalloc((void**)¶mDevice, reduceParamSize, ACL_MEM_MALLOC_HUGE_FIRST)); - CHECK_ACL(aclrtMemcpy(paramDevice, reduceParamSize, reinterpret_cast(¶m), reduceParamSize, ACL_MEMCPY_HOST_TO_DEVICE)); + CHECK_ACL(aclrtMemcpy(paramDevice, reduceParamSize, + reinterpret_cast(¶m), reduceParamSize, + ACL_MEMCPY_HOST_TO_DEVICE)); // 将tiling api计算出的ReducePolicy转化为编译态参数并实例化相应的核函数 if (policy == ATVC::REDUCE_POLICY0) { ReduceCustom<<>>(x, y, paramDevice); - }else if (policy == ATVC::REDUCE_POLICY1) { + } else if (policy == ATVC::REDUCE_POLICY1) { ReduceCustom<<>>(x, y, paramDevice); } else if (policy == ATVC::REDUCE_POLICY2) { ReduceCustom<<>>(x, y, paramDevice); @@ -862,6 +890,8 @@ void ReduceOpAdapter(uint8_t* x, uint8_t* y, ATVC::ReduceParam ¶m, ATVC::Red ReduceCustom<<>>(x, y, paramDevice); } else if (policy == ATVC::REDUCE_POLICY21) { ReduceCustom<<>>(x, y, paramDevice); + } else if (policy == ATVC::REDUCE_POLICY22) { + ReduceCustom<<>>(x, y, paramDevice); } else { printf("[ERROR] Cannot find any matched policy.\n"); } @@ -881,7 +911,7 @@ int32_t main(int32_t argc, char* argv[]) std::vector shape{8, 1024}; // 测试输入shape std::vector inputX(eleNum, 1.0f); std::vector golden(outEleNum, 8.0f); - + printf("Generate golden data successfully.\n"); // 初始化Acl资源 CHECK_ACL(aclInit(nullptr)); aclrtContext context; @@ -904,7 +934,7 @@ int32_t main(int32_t argc, char* argv[]) ATVC::ReducePolicy policy = {-1, -1, -1}; // Reduce运行态参数,负责映射最适合的Reduce模板实现 // Host侧调用Tiling API完成相关运行态参数的运算 if (!ATVC::Host::CalcReduceTiling(shape, dim, &policy, ¶m)) { - printf("Reduce tiling error."); + printf("Reduce tiling error.\n"); return -1; }; @@ -924,12 +954,10 @@ int32_t main(int32_t argc, char* argv[]) CHECK_ACL(aclrtResetDevice(deviceId)); CHECK_ACL(aclFinalize()); - for (int32_t i = 0; i < outEleNum; i++) { - if (!IsClose(golden[i], outputY[i])) { - printf("Accuracy verification failed! The expected value of element in index [%d] is %f, but actual value is %f.\n", i, golden[i], outputY[i]); - return -1; - } + if (!VerifyResults(golden, outputY)) { + return -1; } + printf("Accuracy verification passed.\n"); return 0; } @@ -937,11 +965,11 @@ int32_t main(int32_t argc, char* argv[]) ## 2.3 Broadcast算子开发 ATVC框架提供的Broadcast算子模板类的模块之间的交互如下(ATVC框架提供的模板及接口用黄色表示;开发自定义的模块用蓝色表示): -![broadcast_dataflow.png](data/broadcast_dataflow.png) +![broadcast_dataflow.png](images/broadcast_dataflow.png) Broadcast模板算子内部根据数据类型、输入/输出shape完成某个轴上数据扩充的功能,ATVC将各种计算调度场景抽象为`ATVC::BroadcastPolicy`。在算子调用阶段,分派策略API可根据Tiling API计算出的`ATVC::BroadcastPolicy`转化为编译态参数,结合计算模板来实例化`ATVC::Kernel::BroadcastOpTemplate`算子模板类。 ### 2.3.1 Components 根据Broadcast算子在框架内部的交互场景,ATVC提供如下的接口以及模板类帮助开发搭建自定义Broadcast算子: -![broadcast_components.png](data/broadcast_components.png) +![broadcast_components.png](images/broadcast_components.png) 自定义Broadcast算子需按照以下顺序完成模块之间的组装: 1. 自定义计算模板/使用框架内置计算模板 2. 将计算模板传入`Kernel`层模板算子完成核函数功能实现; @@ -959,7 +987,7 @@ Broadcast计算模板是指Broadcast类算子在UB上实现将A轴的数据复 #include "broadcast/common/broadcast_common.h" namespace ATVC { -template +template class BroadcastCompute { public: using inputDTypeList = typename OpTraits::In::types; @@ -988,7 +1016,7 @@ Broadcast计算模板类将在数据计算阶段被`BroadcastOpTemplate`算子 * AB场景的计算:输入`src`是一个shape为(dimA, 1)的Tensor,需要将数据扩充到`dst`上,dst的shape是(dimA, dimB); * BA场景的计算:输入`src`是一个shape为(1, dimA)的Tensor,需要将src数据扩充到`dst`上,dst的shape是(dimB, dimA); -- 该模板类在实例化时固定传入ATVC::OpTraits类型的结构体作为模板参数,如` ATVC::OpTraits,ATVC::OpOutputs`。 +- 该模板类在实例化时固定传入`ATVC::OpTraits`类型的结构体作为模板参数,如` ATVC::OpTraits,ATVC::OpOutputs`。 ### 2.3.3 内置Broadcast算子模板 `ATVC::Kernel::BroadcastOpTemplate`是一套基本的Broadcast算子类,它实现了一套算子数据的搬运搬出、资源分配和释放的流程。Kernel层的算子模板需要计算模板类作为模板参数传入来完成实例化。在调用阶段,Broadcast算子模板将按照固定参数顺序调用计算模板类的`Compute`接口,完成数据的计算。 @@ -1003,12 +1031,12 @@ struct BroadcastPolicy { }; ``` -下方为`ATVC::Kernel::BroadcastOpTemplate`模板类的外部接口介绍,完整模板类定义请参考`atvc/include/broadcast/broadcast_op_template.h`。 +下方为`ATVC::Kernel::BroadcastOpTemplate`模板类的外部接口介绍,完整模板类定义请参考[`atvc/include/broadcast/broadcast_op_template.h`](../include/broadcast/broadcast_op_template.h)。 ```cpp #ifndef ATVC_BROADCAST_OP_TEMPLATE_H #define ATVC_BROADCAST_OP_TEMPLATE_H #include "common/const_def.h" -#include "broadcast/broadcast_utils/broadcast_buf_pool.h" +#include "broadcast/utils/broadcast_buf_pool.h" namespace ATVC { namespace Kernel { template @@ -1023,14 +1051,14 @@ public: @param dst: 输出数据的gm指针 @broadcastParam: broadcast的动态参数,包含tiling data, workspace等 */ - __aicore__ inline void Run(GM_ADDR src, GM_ADDR dst, GM_ADDR broadcastParam) + __aicore__ inline void Run(GM_ADDR src, GM_ADDR dst, ATVC::BroadcastParam* broadcastParam) { this->Init(src, dst, broadcastParam); this->Process(); } - AscendC::GlobalTensor srcGlobal; - AscendC::GlobalTensor dstGlobal; + AscendC::GlobalTensor srcGlobal_; + AscendC::GlobalTensor dstGlobal_; BroadcastCompute compute_; __gm__ BroadcastParam *param_; }; @@ -1046,14 +1074,13 @@ public: // BroadcastTo算子的描述:一个输入,一个输出,类型均为float using BroadcastOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; -template -__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam) +template +__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam broadcastParam) { - AscendC::printf("BroadcastCustom\n"); KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 设置算子执行时只启动Vector核 // 将计算模板类模板定义作为模板参数传入,Policy由Host层的策略分派API给出 auto op = ATVC::Kernel::BroadcastOpTemplate, Policy>(); - op.Run(x, y, broadcastParam); + op.Run(x, y, &broadcastParam); } ``` @@ -1069,18 +1096,17 @@ using BroadcastOpTraits = ATVC::OpTraits, ATVC::OpOutputs /* * 该函数为BroadcastCustom算子核函数入口 - * x Device上的gm地址,指向Add算子第一个输入 - * y Device上的gm地址,指向Add算子第一个输出 - * broadcastParam Device上的gm地址,指向运行态ATVC::BroadcastParam数据 + * x Device上的gm地址,指向BroadcastTo算子第一个输入 + * y Device上的gm地址,指向BroadcastTo算子第一个输出 + * broadcastParam 指向运行态ATVC::BroadcastParam数据 */ -template -__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam) +template +__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam broadcastParam) { - AscendC::printf("BroadcastCustom\n"); KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 设置算子执行时只启动Vector核 // 将计算模板类模板定义作为模板参数传入,Policy由Host层的策略分派API给出 auto op = ATVC::Kernel::BroadcastOpTemplate, Policy>(); - op.Run(x, y, broadcastParam); + op.Run(x, y, &broadcastParam); } ```
@@ -1088,7 +1114,7 @@ __global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadca Broadcast算子开发场景下,核函数定义必须遵从以下约束: 1. 核函数须预留一个GM_ADDR类型的形参用于传入`ATVC::BroadcastParam`运行态参数; 2. 核函数须加入`KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);`这段代码显示标注算子类型; -3. 核函数须实例化`ATVC::Kernel::BroadcastOpTemplate`变量,实例化时需传入对应的计算实现模板类`ATVC::BroadcastCompute`,并调用它的`Run(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam)`接口来实现数据的调度运算; +3. 核函数须实例化`ATVC::Kernel::BroadcastOpTemplate`变量,实例化时需传入对应的计算实现模板类`ATVC::BroadcastCompute`,并调用它的`Run(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam* broadcastParam)`接口来实现数据的调度运算; ### 2.3.5 Host层API #### 2.3.5.1 CalcBroadcastTiling @@ -1128,7 +1154,7 @@ struct BroadcastParam { ```cpp namespace ATVC { namespace Host { -template +template bool CalcBroadcastTiling(std::vector shapeIn, std::vector shapeOut, BroadcastPolicy* policy, BroadcastParam* param) { using inputDTypeList = typename OpTraits::In::types; @@ -1170,7 +1196,7 @@ int32_t main(int32_t argc, char* argv[]) ```cpp // 负责Broadcast类算子的调度,选择对应的Policy最佳策略并执行Kernel函数 -template +template void BroadcastOpAdapter(uint8_t* x, uint8_t* y, ATVC::BroadcastParam ¶m, ATVC::BroadcastPolicy &policy, aclrtStream& stream) { // 申请临时空间workspace,并将其与BroadcastTilingData一同传到Device侧 @@ -1216,7 +1242,7 @@ int32_t main(int32_t argc, char* argv[]) ### 2.3.6 完整样例 -通过ATVC框架实现的完整BroadcastCustom算子[样例代码](../examples/broadcast_to/broadcast_to.cpp)如下: +通过ATVC框架实现的完整BroadcastCustom算子[样例代码](../examples/broadcast_to/broadcast_to.cpp) ```cpp #include #include @@ -1252,23 +1278,22 @@ using BroadcastOpTraits = ATVC::OpTraits, ATVC::OpOutputs /* * 该函数为BroadcastCustom算子核函数入口 - * x Device上的gm地址,指向Add算子第一个输入 - * y Device上的gm地址,指向Add算子第一个输出 - * broadcastParam Device上的gm地址,指向运行态ATVC::BroadcastParam数据 + * x Device上的gm地址,指向BroadcastTo算子第一个输入 + * y Device上的gm地址,指向BroadcastTo算子第一个输出 + * broadcastParam 指向运行态ATVC::BroadcastParam数据 */ -template -__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam) +template +__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam broadcastParam) { - AscendC::printf("BroadcastCustom\n"); KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 设置算子执行时只启动Vector核 // 将计算模板类模板定义作为模板参数传入,Policy由Host层的策略分派API给出 auto op = ATVC::Kernel::BroadcastOpTemplate, Policy>(); - op.Run(x, y, broadcastParam); + op.Run(x, y, &broadcastParam); } // 负责Broadcast类算子的调度,选择对应的Policy最佳策略并执行Kernel函数 -template +template void BroadcastOpAdapter(uint8_t* x, uint8_t* y, ATVC::BroadcastParam ¶m, ATVC::BroadcastPolicy &policy, aclrtStream& stream) { // 申请临时空间workspace,并将其与BroadcastTilingData一同传到Device侧 @@ -1355,4 +1380,210 @@ int32_t main(int32_t argc, char* argv[]) printf("Accuracy verification passed.\n"); return 0; } -``` \ No newline at end of file +``` + +# 3 ATVC的调试调优功能 +为了用户在使用ATVC进行算子开发时能快速进行精度调试和性能调优,ATVC支持多种调试调优能力。 +## 3.1 OpTraits校验接口 +用户可通过`DebugCheck()`接口校验不同模板的OpTraits功能, 接口在Host侧调用,无需额外的开关限制,接口定义如下: +```cpp +namespace ATVC { +namespace Host { +template +bool DebugCheck() +} +} +``` +其中,模板参数`OpTraits`是用户定义的待校验的输入输出描述信息, 模板参数`templateType`是校验规则分类的标识, 定义如下: +```cpp +enum class TemplateType { + ELE_WISE, // ElementWise模板的校验类型 + REDUCE, // Reduce模板的校验类型 + BROADCAST, // Broadcast模板的校验类型 +}; +``` +接口使用示例: +```cpp +using AddOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; + +ATVC::Host::DebugCheck(); +``` +完整的DebugCheck调用接口样例可参考tanh_grad算子[样例代码](../examples/tanh_grad/tanh_grad.cpp)。 +## 3.2 使用调试调优模式运行算子 +样例执行脚本run_test.sh支持可选入参`--run-mode`进行不同调试调优运行模式的选择。 +当前支持`debug_print`和`profiling`两种模式。 +- `--run-mode=debug_print`:DFX信息打印模式,打开kernel侧的模板内置关键节点的信息打印和异常退出时的打印功能。 +- `--run-mode=profiling`:Profiling性能采集模式,运行时打开profiling性能数据采集功能。 +- 未设置`--run-mode`:默认模式,正常上板,无kernel侧的dfx信息打印, 未开启profiling性能采集功能。 +## 3.2.1 DFX信息打印模式 +通过运行run_test.sh脚本时加上可选参数`--run-mode=debug_print`打开本功能。 +DFX信息打印格式按照 [日志级别(`ERROR`/`INFO`)]:[`ATVC`][`Module`](可选:[`CopyIn`/`CopyOut`等])的标准进行打印。 +- 日志级别: ERROR是异常打印信息,INFO是模板内部重要信息打印 +- `ATVC`: 标识是ATVC模板库内置的DFX信息打印 +- `Module`: 标识是哪个模块的信息打印,例如:`EleWise`、 `Reduce`、`Broadcast`、`Common`等模块。 +- 可选子模块: 用于部分`Module`涉及多个子模块,可选择增加子模块信息,细化DFX信息。 +模板内部提供的DFX信息打印接口定义及使用样例如下所示, 对于普通算子开发用户,无需关注该接口,只有需要修改或者扩展开发模板功能的场景,可使用该接口。 +```cpp +//接口定义 +namespace ATVC { +namespace Kernel { +template +__aicore__ inline void DebugPrintf(__gm__ const char* fmt, Args&&... args); +} +} +// 调用示例 +ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][EleWise] Input Count can not be 0!\n"); +ATVC::Kernel::DebugPrintf("[INFO]: [ATVC][EleWise][CopyIn] Offset is %u, copy count is %u.\n", curCoreStartCnt_ + offsetCnt_, calcCnt_); +``` + +## 3.2.2 开启Profiling性能调优功能 +通过运行run_test.sh脚本时加上可选参数`--run-mode=profiling`打开本功能。 +为了增加Profiling采集性能数据的稳定性,建议用户在开启profiling时,运行时重复多次调用kernel,可实现一次性采集多次上板的性能数据,消除抖动。 +```cpp +TanhGrad<<>>(dyDevice, yDevice, zDevice, paramDevice); +#if ATVC_DEBUG_MODE == 2 // ATVC_DEBUG_MODE == 2: open profiling + for (int32_t i = 0; i < 19; i++) { // 19: run kernel 1 + 19 times for profiling + TanhGrad<<>>(dyDevice, yDevice, zDevice, paramDevice); + } +#endif +``` +其中`ATVC_DEBUG_MODE`是`run-mode`在不同的模式下的内部宏定义的映射。`ATVC_DEBUG_MODE == 2`是`--run-mode=profiling`的内部映射,用户无需关注。 + +## 3.3 Tiling超参调优 +### 3.3.1 ElementWise模板算子Tiling超参调优 +#### 3.3.1.1 ElementWise模板通用Tiling算法 +- 计算blockNum:计算blockNum = 总的元素量(totalCnt) / 单核数据量基线(singleCoreBaseLine), blockNum最小值为1, 最大值为平台提供的最大vectorCore值。 +- 计算达到UB上限的单核单输入元素个数值ubLimitCnt:UB上限内存大小 / 所有输入输出及temp单个元素的内存之和。 +- 计算tiledCnt: + - 计算每个核需要处理的数据元素量avgElePerBlock = totalCnt / blockNum; + - 根据avgElePerBlock所处的splitDataShape数据段,按照切分系数去切分基本块: tiledCnt = dataSplitFactor / dataSplitFactor + - tiledCnt调整: 不超上限ubLimitCnt, 不小于下限32,且最后的tiledCnt要做32元素对齐。 +#### 3.3.1.2 ElementWise TilingData定义 +ElementWise模板通用Tiling切分的数据结构为EleWiseTilingData,描述了核间切分和单核内切分的策略,其定义如下: +```cpp +namespace ATVC { +struct EleWiseTilingData { + uint32_t tailBlockCnt; // 需要额外执行一次循环的核的数量 + uint32_t tailElemCnt; // 尾块元素数量 + uint32_t numPerBlock; // 每个核需计算的基本块个数 + uint32_t tiledCnt; // 基本块元素个数 + uint32_t blockNum; // 执行核数 +}; +} +``` + +#### 3.3.1.3 ElementWise Tiling超参调优 +当前提供的ElementWise模板内置通用Tiling可调超参如下所示: +| Tiling超参名 | 数据类型 | 参数说明 | 调节范围 | 默认值 | +| ----------- | -------------- | ----------- | ----------- |---| +| singleCoreBaseLine | uint32_t | 单核数据量基线 | [256, 128 * 1024] | 512| +| ubSizeLimitThreshold | float | UB内存使用上限,决定了basicBlock最大值 | [0.5, 0.96] | 0.95 | +| splitDataShape | uint32_t[3]| 单核内数据量的3个分段节点,表示数据量分为4段| {node_0, node_1, node_2} | {1024, 32*1024, 64*1024}| +| dataSplitFactor | uint32_t[4]| 单核内4个数据段的切分系数, 决定不同数据段的切分基本块的大小| {factor_0, factor_1, factor_2, factor_3} 均大于0 小于32| {4, 4, 8, 6}| +| rsvLiveCnt | uint32_t| 预留的空间大小为rsvLiveCnt * (inputBuffer + outputBuffer)|[0, 1]| 0| + +对应的超参`EleWiseTilingHyperParam`数据结构定义如下: +```cpp +namespace ATVC { +namespace Host { +struct EleWiseTilingHyperParam { + uint32_t singleCoreBaseLine = 512; // 数据量基线:核内数据量超过基线就分多核直至满核, 设置范围: [256, 128 * 1024] + float ubSizeLimitThreshold = 0.95f; // UB内存使用上限,决定了basicBlock最大值 + uint32_t nBufferNum = 2; // multi buffer 设置值为: [1, 2] + uint32_t splitDataShape[MAX_SHAPE_NODE] = {1024, 32 * 1024, 64 * 1024}; // 数据分段节点 + uint32_t dataSplitFactor[MAX_SHAPE_NODE + 1] = {4, 4, 8, 6}; // 对应数据分段内的数据量的切分系数 + uint32_t rsvLiveCnt = 0; // 额外的存活节点数,表示内部需要申请空间个数,可设置的范围为[0, 1] +}; +} +} +``` +计算接口`CalcEleWiseTiling()`定义如下所示: +```cpp +namespace ATVC { +namespace Host { +template +bool CalcEleWiseTiling(int32_t totalCnt, ATVC::EleWiseParam ¶m, + EleWiseTilingHyperParam hyperParam = EleWiseTilingHyperParam()); +} +} +``` +其中,可选参数`hyperParam`在未传入用户自定义超参时,使用`EleWiseTilingHyperParam`的默认值。 +若用户需要修改某个超参,调用示例如下所示: +```cpp + // Add算子中有两个输入,一个输出。类型均为float + using AddOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; + // totalCnt描述EleWise单输入的元素个数 + int32_t eleNum = 8 * 1024; + // 声明运行态参数param + ATVC::EleWiseParam param; + ATVC::Host::EleWiseTilingHyperParam hyperParam; + hyperParam.singleCoreBaseLine = 1024; + if (!ATVC::Host::CalcEleWiseTiling(eleNum, param, hyperParam=hyperParam)) { + printf("[ERROR]: Calculate eleWise tiling failed.\n"); + return -1; + }; +``` +### 3.3.2 Reduce模板算子Tiling超参调优 +#### 3.3.2.1 Reduce Tiling通用算法 +Reduce Tiling计算流程较为复杂,简化后的主要流程如下: +![](images/reduce_tiling.png) +#### 3.3.2.2 Reduce TilingData定义 +ATVC host 和kernel侧都会使用到的`ReduceTilingData`是Reduce的核间AR轴切分、单核内AR轴切分的策略,其定义如下: +```cpp +namespace ATVC { +struct ReduceTilingData { + uint64_t factorACntPerCore; // 在每个核上不参与计算的非Reduce轴实际维度 + uint64_t factorATotalCnt; // 不参与计算的非Reduce轴总维度 + uint64_t ubFactorA; // 单UB内非Reduce轴的数据量 + uint64_t factorRCntPerCore; // 在每个核上参与计算的Reduce轴实际维度 + uint64_t factorRTotalCnt; // 参与计算的Reduce轴总维度 + uint64_t ubFactorR; // 单UB内参与计算的Reduce轴维度 + uint64_t groupR; // 切轴为R轴,该轴上切点外的R的相对数据量 + uint64_t outSize; // 切轴外的AR数据总量 + uint64_t basicBlock; // 基础数据块大小 + int32_t coreNum; // 执行核数 + float meanVar; // 预留信息,暂不使用 + uint64_t shape[MAX_DIM]; // shape信息 + uint64_t stride[MAX_DIM]; // 输入数据搬运步长 + uint64_t dstStride[MAX_DIM]; // 输出数据搬运步长 +}; +} +``` +#### 3.3.2.3 Reduce 超参调优 +可调参数如下所示: +| Tiling超参名 | 数据类型 | 参数说明 | 调节范围 | 默认值 | +| ----------- | -------------- | ----------- | ----------- |---| +| basicBlock | uint32_t | Reduce 基本块内存大小 | 不能超过UB内存的1/3, 192K内存 建议在48K-54K之间设置 | 54 * 1024| +| maxInnerA | uint32_t |AR切轴内A轴的最大数据量 | [128, 4096] | 128 | +| balanceThreshHold | double| 多核均衡的阈值水平, 阈值越高,切分后每个核处理的数据量越均衡 | [0.85, 0.95]| 0.95 | + +对应的超参`ReduceTilingHyperParam`结构定义如下: +```cpp +namespace ATVC { +namespace Host { +struct ReduceTilingHyperParam { + uint32_t basicBlock = 48 * 1024; // 设置Reduce基本块内存大小, 一般不能超过内存的1/3, 建议在[48-54]之间设置 + uint32_t maxInnerA = 128; // [128, 256] + double balanceThreshHold = 0.95; // 多核均衡的阈值水平 [0.8-0.95] +}; +} +} +``` +Reduce Tiling的计算接口`CalcReduceTiling()`定义如下: +```cpp +namespace ATVC { +namespace Host { +template +bool CalcReduceTiling(std::vector inputShape, + std::vector reduceDim, + ReducePolicy* policy, + ReduceParam* param, + ReduceTilingHyperParam hyperParam = ReduceTilingHyperParam()); +} +} +``` +其中,可选参数`hyperParam`在未传入用户自定义超参时,使用`ReduceTilingHyperParam`的默认值,若用户需要修改某个超参,可自定义`ReduceTilingHyperParam`后传入。 + +# Copyright + +Copyright (c) 2025 Huawei Technologies Co., Ltd. \ No newline at end of file diff --git a/atvc/docs/code_organization.md b/atvc/docs/03_code_organization.md similarity index 42% rename from atvc/docs/code_organization.md rename to atvc/docs/03_code_organization.md index 58bd34d4de241e131263081eb5bcc0d85dab218c..47641b0cb98d098b19a918a1145b5d546ade92a2 100644 --- a/atvc/docs/code_organization.md +++ b/atvc/docs/03_code_organization.md @@ -8,40 +8,25 @@ include目录下的头文件是按照如下的文件层级进行组织。 ``` include/ ├── atvc.h // Vector模板编程入口头文件 -├── common // 不同模板公用api和C++基本类的拓展模板目录 -│ ├── atvc_opdef.h // ATVC模板算子数据类型描述 -│ ├── compile_info.h // 平台相关编译信息 -│ ├── const_def.h // 常量定义 -│ └── ... -├── elewise // element_wise模板目录 -│ ├── common -│ │ └── elewise_common.h // element_wise的公共数据定义 -│ ├── elewise_op_template.h // element_wise算子模板类 -│ └── elewise_host.h // element_wise算子host侧API -├── broadcast // broadcast模板目录 -│ ├── common // broadcast模板各层公用文件目录 -│ │ ├── ... -│ │ └── broadcast_common.h // broadcast的公共数据定义 -│ ├── broadcast_utils // broadcast模板辅助工具目录 -│ │ ├── ... -│ │ └── broadcast_buf_pool.h // broadcast内存管理类 -│ ├── tiling // broadcast模板host层目录 -│ │ └── broadcast_tiling.h // broadcast tiling算法 -│ ├── broadcast_host.h // broadcast算子host侧API -│ ├── broadcast_op_template.h // broadcast算子模板类 -│ └── broadcast_compute.h // broadcast计算模板 -└── reduce // reduce模板目录 - ├── common // reduce模板各层公用文件目录 - │ ├── ... - │ └── reduce_common.h // reduce的公共数据定义 - ├── reduce_utils // reduce模板辅助工具目录 - │ ├── ... - │ └── reduce_util.h // reduce计算辅助工具函数 - ├── tiling // reduce模板host层目录 - │ └── reduce_tiling.h // reduce tiling算法 - ├── reduce_host.h // reduce算子host侧API - ├── reduce_op_template.h // reduce算子模板类 - └── reduce_sum.h // reduceSum计算模板 +├── common // 不同模板公用API和C++基本类的拓展模板目录 +├── elewise // Elementwise模板目录 +│ ├── common // Elementwise的公共数据定义 +│ ├── elewise_op_template.h // Elementwise算子模板类 +│ └── elewise_host.h // Elementwise算子host侧API +├── broadcast // Broadcast模板目录 +│ ├── common // Broadcast模板各层公用文件目录 +│ ├── utils // Broadcast模板辅助工具目录 +│ ├── tiling // Broadcast模板host层目录 +│ ├── broadcast_host.h // Broadcast算子host侧API +│ ├── broadcast_op_template.h // Broadcast算子模板类 +│ └── broadcast_compute.h // Broadcast计算模板 +└── reduce // Reduce模板目录 + ├── common // Reduce模板各层公用文件目录 + ├── utils // Reduce模板辅助工具目录 + ├── tiling // Reduce模板host层目录 + ├── reduce_host.h // Reduce算子host侧API + ├── reduce_op_template.h // Reduce算子模板类 + └── reduce_sum.h // ReduceSum计算模板 ``` ## 2. examples examples文件夹下提供了算子代码样例,包含算子实现的源码文件和测试用例配置和执行脚本。 @@ -50,7 +35,7 @@ examples ├── add // EleWise Add算子样例 │ ├── README.md │ └── add.cpp -├── add_with_scalar // EleWise + Scalar场景样例 +├── add_with_scalar // EleWise + Scalar算子样例 │ ├── README.md │ └── add_with_scalar.cpp ├── broadcast_to // BroadcastTo算子样例 @@ -62,14 +47,21 @@ examples └── sinh_custom // SinhCustom算子样例 ├── README.md └── sinh_custom.cpp +└── tanh_grad // Tah_Grad算子样例 + ├── README.md + └── tanh_grad.cpp ``` ## 3. docs docs文件夹下包含项目的所有文档。 ``` doc/ -├── 1_quick_start.md // atvc快速上手指南 -├── 2_developer_guide.md // 开发指南 -├── code_organization.md // 目录结构说明 -└── data // 图片 -``` \ No newline at end of file +├── 01_quick_start.md // ATVC快速上手指南 +├── 02_developer_guide.md // 开发指南 +├── 03_code_organization.md // 目录结构说明 +└── images // 图片 +``` + +# Copyright + +Copyright (c) 2025 Huawei Technologies Co., Ltd. \ No newline at end of file diff --git a/atvc/docs/data/architecture.png b/atvc/docs/images/architecture.png similarity index 100% rename from atvc/docs/data/architecture.png rename to atvc/docs/images/architecture.png diff --git a/atvc/docs/data/atvc_user_case.png b/atvc/docs/images/atvc_user_case.png similarity index 100% rename from atvc/docs/data/atvc_user_case.png rename to atvc/docs/images/atvc_user_case.png diff --git a/atvc/docs/data/broadcast_components.png b/atvc/docs/images/broadcast_components.png similarity index 100% rename from atvc/docs/data/broadcast_components.png rename to atvc/docs/images/broadcast_components.png diff --git a/atvc/docs/data/broadcast_dataflow.png b/atvc/docs/images/broadcast_dataflow.png similarity index 100% rename from atvc/docs/data/broadcast_dataflow.png rename to atvc/docs/images/broadcast_dataflow.png diff --git a/atvc/docs/data/elewise_components.png b/atvc/docs/images/elewise_components.png similarity index 100% rename from atvc/docs/data/elewise_components.png rename to atvc/docs/images/elewise_components.png diff --git a/atvc/docs/data/elewise_dataflow.png b/atvc/docs/images/elewise_dataflow.png similarity index 100% rename from atvc/docs/data/elewise_dataflow.png rename to atvc/docs/images/elewise_dataflow.png diff --git a/atvc/docs/data/framework.png b/atvc/docs/images/framework.png similarity index 100% rename from atvc/docs/data/framework.png rename to atvc/docs/images/framework.png diff --git a/atvc/docs/data/reduce_components.png b/atvc/docs/images/reduce_components.png similarity index 100% rename from atvc/docs/data/reduce_components.png rename to atvc/docs/images/reduce_components.png diff --git a/atvc/docs/data/reduce_dataflow.png b/atvc/docs/images/reduce_dataflow.png similarity index 100% rename from atvc/docs/data/reduce_dataflow.png rename to atvc/docs/images/reduce_dataflow.png diff --git a/atvc/docs/images/reduce_tiling.png b/atvc/docs/images/reduce_tiling.png new file mode 100644 index 0000000000000000000000000000000000000000..628694c0b699281016a49adadfe0f52bc296fe20 Binary files /dev/null and b/atvc/docs/images/reduce_tiling.png differ diff --git a/atvc/examples/add/add.cpp b/atvc/examples/add/add.cpp index f5a410a114299bbff4b620dd15bce57201812688..a3f1e65443f1c0bd3551f573f1cd49df4063fd9b 100644 --- a/atvc/examples/add/add.cpp +++ b/atvc/examples/add/add.cpp @@ -38,26 +38,6 @@ bool IsClose(float a, float b) return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps)); } -// Add算子中有两个输入,一个输出。类型均为float -using ADD_OPTRAITS = ATVC::OpTraits, ATVC::OpOutputs>; - -// 传入编译态参数ATVC::OpTraits -template -struct AddComputeFunc { - /* - 函数说明: c = a + b - 参数说明: - a : 参与运算的输入 - b : 参与运算的输入 - c : 参与运算的输出 - */ - template - // 重载operator,提供给算子模板类调用 - __aicore__ inline void operator()(AscendC::LocalTensor a, AscendC::LocalTensor b, AscendC::LocalTensor c) { - AscendC::Add(c, a, b, c.GetSize()); // 开发调用AscendC Api自行实现计算逻辑, 通过c.GetSize()获取单次计算的元素数量 - } -}; - void InitializeData(int32_t eleNum, std::vector &inputX, std::vector &inputY, std::vector &golden) { std::random_device rd; @@ -75,7 +55,7 @@ bool VerifyResults(const std::vector &golden, const std::vector &o { for (int32_t i = 0; i < golden.size(); i++) { if (!IsClose(golden[i], output[i])) { - printf("Accuracy verification failed! The expected value of element " + printf("[ERROR]: Accuracy verification failed! The expected value of element " "in index [%d] is %f, but actual value is %f.\n", i, golden[i], @@ -94,21 +74,41 @@ void InitializeACL(aclrtContext &context, aclrtStream &stream, int32_t deviceId) CHECK_ACL(aclrtCreateStream(&stream)); } -void CleanACL(aclrtStream &stream, int32_t deviceId) +void CleanACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId) { CHECK_ACL(aclrtDestroyStream(stream)); + CHECK_ACL(aclrtDestroyContext(context)); CHECK_ACL(aclrtResetDevice(deviceId)); CHECK_ACL(aclFinalize()); } -void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zDevice, uint8_t *¶mDevice) +void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zDevice) { CHECK_ACL(aclrtFree(xDevice)); CHECK_ACL(aclrtFree(yDevice)); CHECK_ACL(aclrtFree(zDevice)); - CHECK_ACL(aclrtFree(paramDevice)); CHECK_ACL(aclrtFreeHost(zHost)); } + +// Add算子中有两个输入,一个输出。类型均为float +using AddOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; + +// 传入编译态参数ATVC::OpTraits +template +struct AddComputeFunc { + /* + 函数说明: c = a + b + 参数说明: + a : 参与运算的输入 + b : 参与运算的输入 + c : 参与运算的输出 + */ + template + // 重载operator,提供给算子模板类调用 + __aicore__ inline void operator()(AscendC::LocalTensor a, AscendC::LocalTensor b, AscendC::LocalTensor c) { + AscendC::Add(c, a, b, c.GetSize()); // 开发调用AscendC Api自行实现计算逻辑, 通过c.GetSize()获取单次计算的元素数量 + } +}; } template @@ -117,18 +117,22 @@ template * a Device上的gm地址,指向Add算子第一个输入 * b Device上的gm地址,指向Add算子第二个输入 * c Device上的gm地址,指向Add算子第一个输出 - * param Device上的gm地址,指向运行态ATVC::EleWiseParam数据 + * param 指向运行态ATVC::EleWiseParam数据 */ -__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR param) +__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, ATVC::EleWiseParam param) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 将AddComputeFunc仿函数作为模板参数传入,实例化EleWiseOpTemplate模板类 auto op = ATVC::Kernel::EleWiseOpTemplate>(); - op.Run(a, b, c, param); // 按照输入、输出、param的顺序传入Run函数,实现GM->GM的数据计算 + op.Run(a, b, c, ¶m); // 按照输入、输出、param的顺序传入Run函数,实现GM->GM的数据计算 } int main() { + if (!ATVC::Host::DebugCheck()) { + printf("[ERROR]: Element wise OpTraits check failed.\n"); + return -1; + } // totalCnt描述EleWise单输入的元素个数 int32_t eleNum = 8 * 1024; size_t inputByteSize = static_cast(eleNum) * sizeof(float); @@ -145,8 +149,8 @@ int main() InitializeACL(context, stream, deviceId); ATVC::EleWiseParam param; - if (!ATVC::Host::CalcEleWiseTiling(eleNum, param)) { - printf("Elewise tiling error.\n"); + if (!ATVC::Host::CalcEleWiseTiling(eleNum, param)) { + printf("[ERROR]: Calculate Element wise tiling Failed.\n"); return -1; }; auto elementParamSize = sizeof(param); @@ -155,33 +159,29 @@ int main() uint8_t *xDevice; uint8_t *yDevice; uint8_t *zDevice; - uint8_t *paramDevice; CHECK_ACL(aclrtMallocHost((void **)&zHost, outputByteSize)); CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void **)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void **)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); - CHECK_ACL(aclrtMalloc((void **)¶mDevice, elementParamSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); - CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, inputY.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); - CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize, reinterpret_cast(¶m), - elementParamSize, ACL_MEMCPY_HOST_TO_DEVICE)); + CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, inputY.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); uint32_t blockNum = param.tilingData.blockNum; - // 调用核函数 - AddCustom<<>>(xDevice, yDevice, zDevice, paramDevice); + // 调用核函数 + AddCustom<<>>(xDevice, yDevice, zDevice, param); CHECK_ACL(aclrtSynchronizeStream(stream)); CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST)); std::vector outputZ(reinterpret_cast(zHost), reinterpret_cast(zHost) + eleNum); - CleanUp(zHost, xDevice, yDevice, zDevice, paramDevice); - CleanACL(stream, deviceId); + CleanUp(zHost, xDevice, yDevice, zDevice); + CleanACL(stream, context, deviceId); if (!VerifyResults(golden, outputZ)) { return -1; } - printf("Accuracy verification passed.\n"); + printf("[INFO]: Accuracy verification passed.\n"); return 0; } \ No newline at end of file diff --git a/atvc/examples/add_with_broadcast/README.md b/atvc/examples/add_with_broadcast/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f49500d66f3d7b495549cdb9d09bb3530860e097 --- /dev/null +++ b/atvc/examples/add_with_broadcast/README.md @@ -0,0 +1,47 @@ + + +## 概述 + +本样例介绍了利用ATVC实现带广播的Add单算子并完成功能验证 + + +## 样例支持产品型号: +- Atlas A2训练系列产品 + +## 目录结构 + +| 文件名 | 描述 | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| [add_with_broadcast.cpp](./add_with_broadcast.cpp) | Add算子代码实现以及调用样例 | + +## 算子描述 + +Add算子数学计算公式:$z = x + y$ + +Add算子规格: + + + + + + + + + + + + + + +
算子类型(OpType)Add
算子输入
nameshapedata typeformat
x1 * 2048floatND
y8 * 2048floatND
算子输出
z8 * 2048floatND
核函数名AddWithBroadcastCustom
+ +## 算子运行 +在ascendc-api-adv代码仓目录下执行: +```bash +$ cd ./atvc/tests/ +$ bash run_test.sh add_with_broadcast +... +Generate golden data successfully. +... +Accuracy verification passed. +``` \ No newline at end of file diff --git a/atvc/examples/add_with_broadcast/add_with_broadcast.cpp b/atvc/examples/add_with_broadcast/add_with_broadcast.cpp new file mode 100644 index 0000000000000000000000000000000000000000..2a58d49b93040a9091e931da49242e21ad4cbb77 --- /dev/null +++ b/atvc/examples/add_with_broadcast/add_with_broadcast.cpp @@ -0,0 +1,160 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ + +#include +#include +#include +#include +#include +#include +#include "acl/acl.h" +#include "broadcast/broadcast_host.h" +#include "add_with_broadcast.h" + +#define CHECK_ACL(x) \ + do { \ + aclError __ret = x; \ + if (__ret != ACL_ERROR_NONE) { \ + std::cerr << __FILE__ << ":" << __LINE__ << " aclError:" << __ret << std::endl; \ + } \ + } while (0) + +namespace { +static constexpr float REL_TOL = 1e-3f; +static constexpr float ABS_TOL = 1e-5f; + +// 判断两个浮点数是否足够接近 +bool IsClose(float a, float b) { + const float eps = 1e-40f; // 防止分母为零 + float diff = std::abs(a - b); + return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps)); +} + +// AddWithBroadcast算子的描述:两个输入,一个输出,类型均为float +using BroadcastOpTraits = ATVC::OpTraits, ATVC::OpOutputs, ATVC::OpTemps>; + +// 负责Broadcast类算子的调度,选择对应的Policy最佳策略并执行Kernel函数 +template +void BroadcastOpAdapter(uint8_t* x, uint8_t* y, uint8_t* z, ATVC::BroadcastParam ¶m, ATVC::BroadcastPolicy &policy) +{ + aclrtStream stream = nullptr; + CHECK_ACL(aclrtCreateStream(&stream)); + // 申请临时空间workspace,并将其与BroadcastTilingData一同传到Device侧 + uint8_t *workspaceDevice; + CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST)); + param.workspaceAddr = reinterpret_cast(workspaceDevice); + // 将tiling api计算出的BroadcastPolicy转化为编译态参数并实例化相应的核函数 + if (policy == ATVC::BROADCAST_POLICY0) { + AddWithBroadcastCustom<<>>(x, y, z, param); + }else if (policy == ATVC::BROADCAST_POLICY1) { + AddWithBroadcastCustom<<>>(x, y, z, param); + } else { + printf("[ERROR] Cannot find any matched policy.\n"); + } + // 流同步后释放申请的param内存 + CHECK_ACL(aclrtSynchronizeStream(stream)); + CHECK_ACL(aclrtDestroyStream(stream)); + CHECK_ACL(aclrtFree(workspaceDevice)); +} + +bool VerifyResults(const std::vector &golden, const std::vector &output) +{ + for (int32_t i = 0; i < golden.size(); i++) { + if (!IsClose(golden[i], output[i])) { + printf("Accuracy verification failed! The expected value of element " + "in index [%d] is %f, but actual value is %f.\n", + i, + golden[i], + output[i]); + return false; + } + } + return true; +} +} + + +int32_t main(int32_t argc, char* argv[]) +{ + int32_t eleNum = 1 * 2048; + int32_t outEleNum = 8 * 2048; + std::vector shapeIn{1, 2048}; // 测试输入shape + std::vector shapeOut{8, 2048}; // 测试输入shape + + size_t inputByteSize = static_cast(eleNum) * sizeof(float); + size_t outputByteSize = static_cast(outEleNum) * sizeof(float); + std::random_device rd; + std::mt19937 gen(rd()); + std::uniform_real_distribution disX(1.0f, 9.0f); + std::uniform_real_distribution disY(1.0f, 9.0f); + + std::vector inputX(eleNum); + std::vector inputY(outEleNum); + std::vector golden(outEleNum); + for (int i = 0; i < eleNum; ++i) { + inputX[i] = (disX(gen)); + } + for (int i = 0; i < outEleNum; ++i) { + inputY[i] = (disY(gen)); + } + for (int i = 0; i < outEleNum; ++i) { + golden[i] = (inputX[i % eleNum]) + (inputY[i]); + } + printf("Generate golden data successfully.\n"); + // 初始化Acl资源 + CHECK_ACL(aclInit(nullptr)); + aclrtContext context; + int32_t deviceId = 0; + CHECK_ACL(aclrtSetDevice(deviceId)); + CHECK_ACL(aclrtCreateContext(&context, deviceId)); + uint8_t *zHost; + uint8_t *xDevice; + uint8_t *yDevice; + uint8_t *zDevice; + + CHECK_ACL(aclrtMallocHost((void **)(&zHost), outputByteSize)); + CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + CHECK_ACL(aclrtMalloc((void **)&yDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + CHECK_ACL(aclrtMalloc((void **)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + + CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); + CHECK_ACL(aclrtMemcpy(yDevice, outputByteSize, inputY.data(), outputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); + + ATVC::BroadcastParam param; // Broadcast运行态参数,包含TilingData以及临时空间的相关信息 + ATVC::BroadcastPolicy policy = {-1, -1, -1}; // Broadcast运行态参数,负责映射最适合的Broadcast模板实现 + // Host侧调用Tiling API完成相关运行态参数的运算 + param.nBufferNum = 1; + if (!ATVC::Host::CalcBroadcastTiling(shapeIn, shapeOut, &policy, ¶m)) { + printf("Broadcast tiling error.\n"); + return -1; + }; + // 调用Adapter调度接口,完成核函数的模板调用 + BroadcastOpAdapter(xDevice, yDevice, zDevice, param, policy); + + CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST)); + std::vector outputZ(reinterpret_cast(zHost), reinterpret_cast(zHost) + outEleNum); + + // 释放Acl资源 + CHECK_ACL(aclrtFree(xDevice)); + CHECK_ACL(aclrtFree(yDevice)); + CHECK_ACL(aclrtFree(zDevice)); + CHECK_ACL(aclrtFreeHost(zHost)); + + CHECK_ACL(aclrtDestroyContext(context)); + CHECK_ACL(aclrtResetDevice(deviceId)); + CHECK_ACL(aclFinalize()); + + if (!VerifyResults(golden, outputZ)) { + return -1; + } + printf("Accuracy verification passed.\n"); + return 0; +} diff --git a/atvc/examples/add_with_broadcast/add_with_broadcast.h b/atvc/examples/add_with_broadcast/add_with_broadcast.h new file mode 100644 index 0000000000000000000000000000000000000000..a1abb007e4ba07de1c32e3f7321e58eff89ddcf2 --- /dev/null +++ b/atvc/examples/add_with_broadcast/add_with_broadcast.h @@ -0,0 +1,51 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ +#ifndef ATVC_ADD_WITH_BROADCAST_H +#define ATVC_ADD_WITH_BROADCAST_H +#include "post_compute_add_of_broadcast.h" + +/* ! + * \brief z = x + y, the shape of x must be able to be broadcasted to the shape of y + * \param [in] x, input global memory of x + * \param [in] y, input global memory of y + * \param [out] z, output global memory + * \param [in] broadcastParam, params of broadcast + */ +template +__global__ __aicore__ void AddWithBroadcastCustom(GM_ADDR x, + GM_ADDR y, + GM_ADDR z, + ATVC::BroadcastParam broadcastParam) +{ + KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); + + // 1. get input and output for kernel op from host Traits + using KernelOpIn = typename Traits::In::types; + using KernelOpOut = typename Traits::Out::types; + using KernelOpTemp = typename Traits::Temp::types; + + // 2. define input and output for broadcast + using BroadcastOpInput = ATVC::OpInputs::Type>; + using BroadcastOpOutput = ATVC::OpOutputs::Type>; + using BroadcastOpTraits = ATVC::OpTraits; + + // 3. define input and output for post compute + using AddOpInput = ATVC::OpInputs::Type, typename ATVC::TypeListGet::Type>; + using AddOpOutput = ATVC::OpOutputs::Type>; + using AddOpTraits = ATVC::OpTraits; + using PostCompute = PostComputeAddOfBroadcast; + + // 4. call op run + auto op = ATVC::Kernel::BroadcastOpTemplate, Policy, void, PostCompute>(); + ATVC::BroadcastParam *param = &broadcastParam; + op.Run(x, y, z, param); +} +#endif diff --git a/atvc/examples/add_with_broadcast/post_compute_add_of_broadcast.h b/atvc/examples/add_with_broadcast/post_compute_add_of_broadcast.h new file mode 100644 index 0000000000000000000000000000000000000000..a02bb149d6c9b7adf0ec6e6ce920a359471483a8 --- /dev/null +++ b/atvc/examples/add_with_broadcast/post_compute_add_of_broadcast.h @@ -0,0 +1,102 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ +#ifndef ATVC_POST_COMPUTE_ADD_OF_BROADCAST_H +#define ATVC_POST_COMPUTE_ADD_OF_BROADCAST_H + +#include "broadcast/broadcast_device.h" + +template +struct PostComputeAddOfBroadcast { + using inputDTypeList = typename Traits::In::types; + using DataType = typename ATVC::TypeListGet::Type; + /* ! + * \brief set scaler param for compute fuction + * \param [in] args, args are mutable parameters, and are passed transparently from the parameters of + * global kernel functions, which are the parameters after broadcastParam + */ + template + __aicore__ inline void SetParam(Args... args) {} + + /* ! + * \brief set tensor param for compute fuction + * \param [in] args, args are mutable parameters, and are passed transparently from the parameters of + * global kernel functions, which are the parameters before broadcastParam, the num of args is + * decided by Traits + */ + template + __aicore__ inline void SetArgs(Args... args) + { + InitArgs(args...); + } + + /* ! + * \brief process function of compute struct + * \param [in] y, local tensor of y + * \param [in] z, local tensor of z + * \param [in] x, local tensor of x, x is the output of broadcast, must be the last local tensor + * \param [in] copyOutOffset, copy out offset for DataCopy + * \param [in] copyOutParams, copy out params for DataCopy + */ + template + __aicore__ inline void operator()(AscendC::LocalTensor y, AscendC::LocalTensor z, AscendC::LocalTensor x, + uint32_t copyOutOffset, AscendC::DataCopyExtParams ©OutParams) + { + size_t size = copyOutParams.blockCount * (copyOutParams.blockLen + copyOutParams.srcStride * 32)/ sizeof(DataType); + ATVC::SyncDataQueue(); + + CopyIn(y, copyOutOffset, copyOutParams); + AscendC::PipeBarrier(); // wait broadcast finished + ATVC::SyncDataQueue(); + + Compute(y, z, x, copyOutOffset, size); + ATVC::SyncDataQueue(); + + CopyOut(z, copyOutOffset, copyOutParams); + AscendC::PipeBarrier(); + } + +private: + template + __aicore__ inline void InitArgs(GM_ADDR src, GM_ADDR dst) + { + srcGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(src)); + dstGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(dst)); + } + + template + __aicore__ inline void CopyIn(AscendC::LocalTensor y, uint32_t copyOutOffset, AscendC::DataCopyExtParams ©OutParams) + { + AscendC::DataCopyPadExtParams padParams{false, 0, 0, 0}; + uint32_t tmp = copyOutParams.srcStride; + copyOutParams.srcStride = copyOutParams.dstStride; + copyOutParams.dstStride = tmp; + AscendC::DataCopyPad(y, srcGlobal_[copyOutOffset], copyOutParams, padParams); + copyOutParams.dstStride = copyOutParams.srcStride; + copyOutParams.srcStride = tmp; + } + + template + __aicore__ inline void Compute(AscendC::LocalTensor y, AscendC::LocalTensor z, AscendC::LocalTensor x, + uint32_t copyOutOffset, uint32_t size) + { + AscendC::Add(z, x, y, size); + } + + template + __aicore__ inline void CopyOut(AscendC::LocalTensor z, uint32_t copyOutOffset, AscendC::DataCopyExtParams ©OutParams) + { + AscendC::DataCopyPad(dstGlobal_[copyOutOffset], z, copyOutParams); + } + + AscendC::GlobalTensor srcGlobal_; + AscendC::GlobalTensor dstGlobal_; +}; +#endif diff --git a/atvc/examples/add_with_scalar/add_with_scalar.cpp b/atvc/examples/add_with_scalar/add_with_scalar.cpp index a2f60e3943e6918a333ad3d6cc6b8bfac91999d9..24c4707032a33538bc727257ea340ec574cb51a3 100644 --- a/atvc/examples/add_with_scalar/add_with_scalar.cpp +++ b/atvc/examples/add_with_scalar/add_with_scalar.cpp @@ -103,12 +103,11 @@ void CleanACL(aclrtStream &stream, int32_t deviceId) CHECK_ACL(aclFinalize()); } -void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zDevice, uint8_t *¶mDevice) +void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zDevice) { CHECK_ACL(aclrtFree(xDevice)); CHECK_ACL(aclrtFree(yDevice)); CHECK_ACL(aclrtFree(zDevice)); - CHECK_ACL(aclrtFree(paramDevice)); CHECK_ACL(aclrtFreeHost(zHost)); } } @@ -118,15 +117,15 @@ void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zD * a Device上的gm地址,指向算子第一个输入 * b Device上的gm地址,指向算子第二个输入 * c Device上的gm地址,指向算子第一个输出 - * param Device上的gm地址,指向运行态ATVC::EleWiseParam数据 + * param 指向运行态ATVC::EleWiseParam数据 * conditionVal 标量,控制算子的计算逻辑 */ template -__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR param, bool conditionVal) +__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, ATVC::EleWiseParam param, bool conditionVal) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); auto op = ATVC::Kernel::EleWiseOpTemplate>(); // 传入计算仿函数, 实例化算子 - op.Run(a, b, c, param, conditionVal); // 调用Run函数, 执行算子 + op.Run(a, b, c, ¶m, conditionVal); // 调用Run函数, 执行算子 } @@ -159,28 +158,24 @@ int main() uint8_t *xDevice; uint8_t *yDevice; uint8_t *zDevice; - uint8_t *paramDevice; CHECK_ACL(aclrtMallocHost((void **)&zHost, outputByteSize)); CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void **)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void **)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); - CHECK_ACL(aclrtMalloc((void **)¶mDevice, elementParamSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, inputY.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); - CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize, reinterpret_cast(¶m), - elementParamSize, ACL_MEMCPY_HOST_TO_DEVICE)); AddCustom - <<>>(xDevice, yDevice, zDevice, paramDevice, conditionVal); + <<>>(xDevice, yDevice, zDevice, param, conditionVal); CHECK_ACL(aclrtSynchronizeStream(stream)); CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST)); std::vector outputZ(reinterpret_cast(zHost), reinterpret_cast(zHost) + eleNum); - CleanUp(zHost, xDevice, yDevice, zDevice, paramDevice); + CleanUp(zHost, xDevice, yDevice, zDevice); CleanACL(stream, deviceId); if (!VerifyResults(golden, outputZ)) { diff --git a/atvc/examples/addcmul/README.md b/atvc/examples/addcmul/README.md new file mode 100644 index 0000000000000000000000000000000000000000..43e9b2b656d028e33247164bc289ed004b8dbd4e --- /dev/null +++ b/atvc/examples/addcmul/README.md @@ -0,0 +1,48 @@ + + +## 概述 + +本样例介绍了利用ATVC实现Addcmul单算子并完成功能验证 + + +## 样例支持产品型号: +- Atlas A2训练系列产品 + +## 目录结构 + +| 文件名 | 描述 | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| [addcmul.cpp](./addcmul.cpp) | Addcmul算子代码实现以及调用样例 | + +## 算子描述 + +Add算子数学计算公式:$output_i = input_i + value * tensor1_i * tensor2_i$ + +Add算子规格: + + + + + + + + + + + + + + + +
算子类型(OpType)Add
算子输入
nameshapedata typeformat
input8 * 2048floatND
tensor11 * 2048floatND
tensor21 * 2048floatND
算子输出
output8 * 2048floatND
核函数名AddcmulCustom
+ +## 算子运行 +在ascendc-api-adv代码仓目录下执行: +```bash +$ cd ./atvc/tests/ +$ bash run_test.sh addcmul +... +Generate golden data successfully. +... +Accuracy verification passed. +``` \ No newline at end of file diff --git a/atvc/examples/addcmul/addcmul.cpp b/atvc/examples/addcmul/addcmul.cpp new file mode 100644 index 0000000000000000000000000000000000000000..2ccd77a0d121ea7511b9d2417bde985d59a949da --- /dev/null +++ b/atvc/examples/addcmul/addcmul.cpp @@ -0,0 +1,166 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ + +#include +#include +#include +#include +#include +#include +#include "acl/acl.h" +#include "broadcast/broadcast_host.h" +#include "addcmul.h" + +#define CHECK_ACL(x) \ + do { \ + aclError __ret = x; \ + if (__ret != ACL_ERROR_NONE) { \ + std::cerr << __FILE__ << ":" << __LINE__ << " aclError:" << __ret << std::endl; \ + } \ + } while (0) + +namespace { +static constexpr float REL_TOL = 1e-3f; +static constexpr float ABS_TOL = 1e-5f; + +// 判断两个浮点数是否足够接近 +bool IsClose(float a, float b) { + const float eps = 1e-40f; // 防止分母为零 + float diff = std::abs(a - b); + return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps)); +} + +// AddWithBroadcast算子的描述:两个输入,一个输出,类型均为float +using BroadcastOpTraits = ATVC::OpTraits, ATVC::OpOutputs, ATVC::OpTemps>; + +// 负责Broadcast类算子的调度,选择对应的Policy最佳策略并执行Kernel函数 +template +void BroadcastOpAdapter(uint8_t* tensor1, uint8_t* tensor2, float value, uint8_t* input, uint8_t* output, ATVC::BroadcastParam ¶m, ATVC::BroadcastPolicy &policy, aclrtStream& stream) +{ + // 申请临时空间workspace,并将其与BroadcastTilingData一同传到Device侧 + uint8_t *workspaceDevice; + CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST)); + param.workspaceAddr = reinterpret_cast(workspaceDevice); + // 将tiling api计算出的BroadcastPolicy转化为编译态参数并实例化相应的核函数 + if (policy == ATVC::BROADCAST_POLICY0) { + AddcmulCustom<<>>(tensor1, tensor2, input, output, param, value); + }else if (policy == ATVC::BROADCAST_POLICY1) { + AddcmulCustom<<>>(tensor1, tensor2, input, output, param, value); + } else { + printf("[ERROR] Cannot find any matched policy.\n"); + } + // 流同步后释放申请的param内存 + CHECK_ACL(aclrtSynchronizeStream(stream)); + CHECK_ACL(aclrtFree(workspaceDevice)); +} + +bool VerifyResults(const std::vector &golden, const std::vector &output) +{ + for (int32_t i = 0; i < golden.size(); i++) { + if (!IsClose(golden[i], output[i])) { + printf("Accuracy verification failed! The expected value of element " + "in index [%d] is %f, but actual value is %f.\n", + i, + golden[i], + output[i]); + return false; + } + } + return true; +} +} + +int32_t main(int32_t argc, char* argv[]) +{ + int32_t eleNum = 1 * 8; + int32_t outEleNum = 8 * 8; + std::vector shapeIn{1, 8}; // 测试输入shape + std::vector shapeOut{8, 8}; // 测试输入shape + + size_t inputByteSize = static_cast(eleNum) * sizeof(float); + size_t outputByteSize = static_cast(outEleNum) * sizeof(float); + std::random_device rd; + std::mt19937 gen(rd()); + std::uniform_real_distribution disX(1.0f, 9.0f); + std::uniform_real_distribution disY(1.0f, 9.0f); + + std::vector tensor1(eleNum); + std::vector tensor2(eleNum); + float value = 4; + std::vector input(outEleNum); + std::vector golden(outEleNum); + for (int i = 0; i < eleNum; ++i) { + tensor1[i] = (disX(gen)); + tensor2[i] = (disX(gen)); + } + for (int i = 0; i < outEleNum; ++i) { + input[i] = (disY(gen)); + } + for (int i = 0; i < outEleNum; ++i) { + golden[i] = input[i] + (tensor1[i % eleNum] * tensor2[i % eleNum] * value); + } + printf("Generate golden data successfully.\n"); + // 初始化Acl资源 + CHECK_ACL(aclInit(nullptr)); + aclrtContext context; + int32_t deviceId = 0; + CHECK_ACL(aclrtSetDevice(deviceId)); + CHECK_ACL(aclrtCreateContext(&context, deviceId)); + aclrtStream stream = nullptr; + CHECK_ACL(aclrtCreateStream(&stream)); + uint8_t *outputHost; + uint8_t *tensor1Device; + uint8_t *tensor2Device; + uint8_t *inputDevice; + uint8_t *outputDevice; + + CHECK_ACL(aclrtMallocHost((void **)(&outputHost), outputByteSize)); + CHECK_ACL(aclrtMalloc((void **)&tensor1Device, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + CHECK_ACL(aclrtMalloc((void **)&tensor2Device, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + CHECK_ACL(aclrtMalloc((void **)&inputDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + CHECK_ACL(aclrtMalloc((void **)&outputDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + + CHECK_ACL(aclrtMemcpy(tensor1Device, inputByteSize, tensor1.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); + CHECK_ACL(aclrtMemcpy(tensor2Device, inputByteSize, tensor2.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); + CHECK_ACL(aclrtMemcpy(inputDevice, outputByteSize, input.data(), outputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); + + ATVC::BroadcastParam param; // Broadcast运行态参数,包含TilingData以及临时空间的相关信息 + ATVC::BroadcastPolicy policy = {-1, -1, -1}; // Broadcast运行态参数,负责映射最适合的Broadcast模板实现 + // Host侧调用Tiling API完成相关运行态参数的运算 + param.nBufferNum = 1; + if (!ATVC::Host::CalcBroadcastTiling(shapeIn, shapeOut, &policy, ¶m)) { + printf("Broadcast tiling error.\n"); + return -1; + }; + // 调用Adapter调度接口,完成核函数的模板调用 + BroadcastOpAdapter(tensor1Device, tensor2Device, value, inputDevice, outputDevice, param, policy, stream); + + CHECK_ACL(aclrtMemcpy(outputHost, outputByteSize, outputDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST)); + std::vector output(reinterpret_cast(outputHost), reinterpret_cast(outputHost) + outEleNum); + + // 释放Acl资源 + CHECK_ACL(aclrtFree(tensor1Device)); + CHECK_ACL(aclrtFree(tensor2Device)); + CHECK_ACL(aclrtFree(inputDevice)); + CHECK_ACL(aclrtFree(outputDevice)); + CHECK_ACL(aclrtFreeHost(outputHost)); + + CHECK_ACL(aclrtDestroyStream(stream)); + CHECK_ACL(aclrtDestroyContext(context)); + CHECK_ACL(aclrtResetDevice(deviceId)); + CHECK_ACL(aclFinalize()); + + if (!VerifyResults(golden, output)) { + return -1; + } + printf("Accuracy verification passed.\n"); + return 0; +} diff --git a/atvc/examples/addcmul/addcmul.h b/atvc/examples/addcmul/addcmul.h new file mode 100644 index 0000000000000000000000000000000000000000..a63479d4c21e4605304072c4a0b2c98f544d095d --- /dev/null +++ b/atvc/examples/addcmul/addcmul.h @@ -0,0 +1,63 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ +#ifndef ATVC_ADDCMUL_H +#define ATVC_ADDCMUL_H +#include "pre_compute_mul_of_broadcast.h" +#include "post_compute_add_of_broadcast.h" + +/* ! + * \brief Addcmul(i) = input(i) + value * tensor1(i) * tensor2(i) + * \param [in] tensor1, input global memory of tensor1 + * \param [in] tensor2, input global memory of tensor2 + * \param [in] input, input global memory of input + * \param [out] output, output global memory + * \param [in] broadcastParam, params of broadcast + * \param [in] value, input value + */ +template::Type> +__global__ __aicore__ void AddcmulCustom(GM_ADDR tensor1, + GM_ADDR tensor2, + GM_ADDR input, + GM_ADDR output, + ATVC::BroadcastParam broadcastParam, + DataType value) +{ + KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); + + // 1. get input and output for kernel op from host Traits + using KernelOpIn = typename Traits::In::types; + using KernelOpOut = typename Traits::Out::types; + using KernelOpTemp = typename Traits::Temp::types; + + // 2. define input and output for pre compute + using PreComputeInput = ATVC::OpInputs::Type, typename ATVC::TypeListGet::Type>; + using PreComputeOutput = ATVC::OpOutputs::Type>; + using PreComputeTemp = ATVC::OpOutputs::Type>; + using PreComputeOpTraits = ATVC::OpTraits; + using PreCompute = PreComputeMulOfBroadcast; + + // 3. define input and output for broadcast + using BroadcastOpInput = ATVC::OpInputs::Type>; + using BroadcastOpOutput = ATVC::OpOutputs::Type>; + using BroadcastOpTraits = ATVC::OpTraits; + + // 4. define input and output for post compute + using PostComputeInput = ATVC::OpInputs::Type, typename ATVC::TypeListGet::Type>; + using PostComputeOutput = ATVC::OpOutputs::Type>; + using PostComputeOpTraits = ATVC::OpTraits; + using PostCompute = PostComputeAddOfBroadcast; + + // 5. call op run + auto op = ATVC::Kernel::BroadcastOpTemplate, Policy, PreCompute, PostCompute>(); + ATVC::BroadcastParam *param = &broadcastParam; + op.Run(tensor1, tensor2, input, output, param, value); +} +#endif diff --git a/atvc/examples/addcmul/post_compute_add_of_broadcast.h b/atvc/examples/addcmul/post_compute_add_of_broadcast.h new file mode 100644 index 0000000000000000000000000000000000000000..9059186cb36583d1208382e5e3085fa6c8ddc2bc --- /dev/null +++ b/atvc/examples/addcmul/post_compute_add_of_broadcast.h @@ -0,0 +1 @@ +../add_with_broadcast/post_compute_add_of_broadcast.h \ No newline at end of file diff --git a/atvc/examples/addcmul/pre_compute_mul_of_broadcast.h b/atvc/examples/addcmul/pre_compute_mul_of_broadcast.h new file mode 100644 index 0000000000000000000000000000000000000000..6df76098319d82cffc55eb6d601bca9d93a73e52 --- /dev/null +++ b/atvc/examples/addcmul/pre_compute_mul_of_broadcast.h @@ -0,0 +1,73 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ +#ifndef ATVC_PRE_COMPUTE_MUL_OF_BROADCAST_H +#define ATVC_PRE_COMPUTE_MUL_OF_BROADCAST_H + +#include "broadcast/broadcast_device.h" + +template +struct PreComputeMulOfBroadcast { + using Inputs = typename Traits::In::types; + using Outputs = typename Traits::Out::types; + using DataType = typename ATVC::TypeListGet::Type; + static constexpr size_t INPUT_SIZE = ATVC::TypeListSize::VALUE; + + template + __aicore__ inline void SetParam(DataType value) + { + value_ = value; + } + + template + __aicore__ inline void SetArgs(Args... args) + { + InitArgsInput<0>(args...); + } + + template + __aicore__ inline void operator()(AscendC::LocalTensor tensor1, AscendC::LocalTensor tensor2, AscendC::LocalTensor temp1, AscendC::LocalTensor temp2, + uint32_t copyInOffset, AscendC::DataCopyExtParams ©InParams) + { + size_t size = copyInParams.blockCount * (copyInParams.blockLen + copyInParams.srcStride * 32) / sizeof(DataType); + ATVC::SyncDataQueue(); + + CopyIn(tensor1, tensor2, copyInOffset, copyInParams); + + AscendC::PipeBarrier(); // wait broadcast finished + ATVC::SyncDataQueue(); + + AscendC::Mul(temp1, tensor1, tensor2, size); + AscendC::Muls(temp2, temp1, value_, size); + } + +private: + template + __aicore__ inline void InitArgsInput(GM_ADDR x, Args... args) + { + input_[start].SetGlobalBuffer((__gm__ DataType*)x); + if constexpr (start + 1 < INPUT_SIZE) { + InitArgsInput(args...); + } + } + + template + __aicore__ inline void CopyIn(AscendC::LocalTensor tensor1, AscendC::LocalTensor tensor2, uint32_t copyInOffset, AscendC::DataCopyExtParams ©Inarams) + { + AscendC::DataCopyPadExtParams padParams{false, 0, 0, 0}; + AscendC::DataCopyPad(tensor1, input_[0][copyInOffset], copyInarams, padParams); + AscendC::DataCopyPad(tensor2, input_[1][copyInOffset], copyInarams, padParams); + } + + AscendC::GlobalTensor input_[INPUT_SIZE]; + DataType value_; +}; + +#endif diff --git a/atvc/examples/broadcast_to/broadcast_to.cpp b/atvc/examples/broadcast_to/broadcast_to.cpp index c3a0794413ded63fe04e34d679afeae1d02c9619..c7a4b16baeac5dfc80ced66e199ea5cb52c5c56b 100644 --- a/atvc/examples/broadcast_to/broadcast_to.cpp +++ b/atvc/examples/broadcast_to/broadcast_to.cpp @@ -38,53 +38,91 @@ bool IsClose(float a, float b) { return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps)); } -// BroadcastTo算子的描述:一个输入,一个输出,类型均为float -using BroadcastOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; +bool VerifyResults(const std::vector &golden, const std::vector &output) +{ + for (int32_t i = 0; i < golden.size(); i++) { + if (!IsClose(golden[i], output[i])) { + printf("[ERROR]: Accuracy verification failed! The expected value of element " + "in index [%d] is %f, but actual value is %f.\n", + i, + golden[i], + output[i]); + return false; + } + } + return true; +} + +void InitializeACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId) +{ + CHECK_ACL(aclInit(nullptr)); + CHECK_ACL(aclrtSetDevice(deviceId)); + CHECK_ACL(aclrtCreateContext(&context, deviceId)); + CHECK_ACL(aclrtCreateStream(&stream)); } +void CleanACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId) +{ + CHECK_ACL(aclrtDestroyStream(stream)); + CHECK_ACL(aclrtDestroyContext(context)); + CHECK_ACL(aclrtResetDevice(deviceId)); + CHECK_ACL(aclFinalize()); +} + +void CleanUp(uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&yHost) +{ + CHECK_ACL(aclrtFree(xDevice)); + CHECK_ACL(aclrtFree(yDevice)); + CHECK_ACL(aclrtFreeHost(yHost)); +} + +// BroadcastTo算子的描述:一个输入,一个输出,类型均为float +using BroadcastOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; +} /* * 该函数为BroadcastCustom算子核函数入口 - * x Device上的gm地址,指向Add算子第一个输入 - * y Device上的gm地址,指向Add算子第一个输出 - * broadcastParam Device上的gm地址,指向运行态ATVC::BroadcastParam数据 + * x Device上的gm地址,指向BroadcastCustom算子第一个输入 + * y Device上的gm地址,指向BroadcastCustom算子第一个输出 + * broadcastParam 指向运行态ATVC::BroadcastParam数据 */ template -__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam) +__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam broadcastParam) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 将计算模板类模板定义作为模板参数传入,Policy由Host层的策略分派API给出 auto op = ATVC::Kernel::BroadcastOpTemplate, Policy>(); - op.Run(x, y, broadcastParam); + ATVC::BroadcastParam *param = &broadcastParam; + op.Run(x, y, param); } - +namespace { // 负责Broadcast类算子的调度,选择对应的Policy最佳策略并执行Kernel函数 template void BroadcastOpAdapter(uint8_t* x, uint8_t* y, ATVC::BroadcastParam ¶m, ATVC::BroadcastPolicy &policy, aclrtStream& stream) { - // 申请临时空间workspace,并将其与BroadcastTilingData一同传到Device侧 - uint8_t *paramDevice; + // 申请临时空间workspace uint8_t *workspaceDevice; CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST)); param.workspaceAddr = reinterpret_cast(workspaceDevice); - auto broadcastParamSize = sizeof(param); - CHECK_ACL(aclrtMalloc((void**)¶mDevice, broadcastParamSize, ACL_MEM_MALLOC_HUGE_FIRST)); - CHECK_ACL(aclrtMemcpy(paramDevice, broadcastParamSize, reinterpret_cast(¶m), broadcastParamSize, ACL_MEMCPY_HOST_TO_DEVICE)); // 将tiling api计算出的BroadcastPolicy转化为编译态参数并实例化相应的核函数 if (policy == ATVC::BROADCAST_POLICY0) { - BroadcastCustom<<>>(x, y, paramDevice); + BroadcastCustom<<>>(x, y, param); }else if (policy == ATVC::BROADCAST_POLICY1) { - BroadcastCustom<<>>(x, y, paramDevice); + BroadcastCustom<<>>(x, y, param); } else { - printf("[ERROR] Cannot find any matched policy.\n"); + printf("[ERROR]: Cannot find any matched policy.\n"); } // 流同步后释放申请的param内存 CHECK_ACL(aclrtSynchronizeStream(stream)); CHECK_ACL(aclrtFree(workspaceDevice)); - CHECK_ACL(aclrtFree(paramDevice)); +} } int32_t main(int32_t argc, char* argv[]) { + if (!ATVC::Host::DebugCheck()) { + printf("[ERROR]: Broadcast opTraits check failed.\n"); + return -1; + } int32_t eleNum = 1 * 1024; int32_t outEleNum = 8 * 1024; size_t inputByteSize = static_cast(eleNum) * sizeof(float); @@ -95,13 +133,10 @@ int32_t main(int32_t argc, char* argv[]) std::vector golden(outEleNum, 1.0f); printf("Generate golden data successfully.\n"); // 初始化Acl资源 - CHECK_ACL(aclInit(nullptr)); aclrtContext context; int32_t deviceId = 0; - CHECK_ACL(aclrtSetDevice(deviceId)); - CHECK_ACL(aclrtCreateContext(&context, deviceId)); aclrtStream stream = nullptr; - CHECK_ACL(aclrtCreateStream(&stream)); + InitializeACL(stream, context, deviceId); uint8_t *yHost; uint8_t *xDevice; uint8_t *yDevice; @@ -127,14 +162,8 @@ int32_t main(int32_t argc, char* argv[]) std::vector outputY(reinterpret_cast(yHost), reinterpret_cast(yHost) + outEleNum); // 释放Acl资源 - CHECK_ACL(aclrtFree(xDevice)); - CHECK_ACL(aclrtFree(yDevice)); - CHECK_ACL(aclrtFreeHost(yHost)); - - CHECK_ACL(aclrtDestroyStream(stream)); - CHECK_ACL(aclrtDestroyContext(context)); - CHECK_ACL(aclrtResetDevice(deviceId)); - CHECK_ACL(aclFinalize()); + CleanUp(xDevice, yDevice, yHost); + CleanACL(stream, context, deviceId); for (int32_t i = 0; i < outEleNum; i++) { if (!IsClose(golden[i], outputY[i])) { @@ -142,6 +171,9 @@ int32_t main(int32_t argc, char* argv[]) return -1; } } + if (!VerifyResults(golden, outputY)) { + return -1; + } printf("Accuracy verification passed.\n"); return 0; } diff --git a/atvc/examples/aclnn/README.md b/atvc/examples/ops_aclnn/README.md similarity index 93% rename from atvc/examples/aclnn/README.md rename to atvc/examples/ops_aclnn/README.md index 3bfe4acfd50a90daf8b3c0abc7aa46e018c7ccf6..802586c91f6cb52d69430f79563c2e3463c46a4e 100644 --- a/atvc/examples/aclnn/README.md +++ b/atvc/examples/ops_aclnn/README.md @@ -14,7 +14,7 @@ 快速执行example用例,更详细的流程请参阅[add算子](../add/README.md)。 -- 下载ATVC代码及环境配置,参考[快速入门](../../../docs/1_quick_start.md)。 +- 下载ATVC代码及环境配置,参考[快速入门](../../../docs/01_quick_start.md)。 - 导入ATVC环境变量 ```bash @@ -26,7 +26,7 @@ ```bash # 基于ATVC编译自定义Add算子 - $ cd ./atvc/examples/aclnn/add + $ cd ./atvc/examples/ops_aclnn/add # 以910B1为例,运行命令如下: $ bash install.sh -v Ascend910B1 # 安装custom包 @@ -71,10 +71,10 @@ - 2.1 复制需要的配置文件 - 将[func.cmake](./add/AddCustom/cmake/func.cmake)、host侧的[CMakeLists.txt](./add/AddCustom/op_host/CMakeLists.txt)和kernel侧的[CMakeLists.txt](./add/AddCustom/op_kernel/CMakeLists.txt)分别复制到`步骤1`生成的工程文件的对应目录下。 + 将[func.cmake](./add/AddCustom/cmake/func.cmake)、[intf.cmake](./add/AddCustom/cmake/intf.cmake)、host侧的[CMakeLists.txt](./add/AddCustom/op_host/CMakeLists.txt)和kernel侧的[CMakeLists.txt](./add/AddCustom/op_kernel/CMakeLists.txt)分别复制到`步骤1`生成的工程文件的对应目录下。 - 2.2 修改对应的host文件 - - 引入对应的头文件,修改对应TilingFunc函数中tiling的生成,根据算子类型调用不同的tiling生成策略,更多ATVC的用法可参考atvc的[开发指南](../../docs/2_developer_guide.md)。 + - 引入对应的头文件,修改对应TilingFunc函数中tiling的生成,根据算子类型调用不同的tiling生成策略,更多ATVC的用法可参考atvc的[开发指南](../../docs/02_developer_guide.md)。 elewise类,参考[add_custom.cpp](./add/AddCustom/op_host/add_custom.cpp) ```cpp @@ -182,7 +182,7 @@ ``` - 2.3 修改对应的kernel文件 - 用户需要通过AscendC API来搭建Add算子的核心计算逻辑,在ATVC框架中,这类算子的核心计算逻辑是通过定义一个结构体的仿函数来实现。它需要`ATVC::OpTraits`作为固定模板参数,并重载`operator()`来被提供的Kernel层算子模板类调用,更多ATVC的用法可参考atvc的[开发指南](../../docs/2_developer_guide.md)。 + 用户需要通过AscendC API来搭建Add算子的核心计算逻辑,在ATVC框架中,这类算子的核心计算逻辑是通过定义一个结构体的仿函数来实现。它需要`ATVC::OpTraits`作为固定模板参数,并重载`operator()`来被提供的Kernel层算子模板类调用,更多ATVC的用法可参考atvc的[开发指南](../../docs/02_developer_guide.md)。 elewise类[add_custom.cpp](./add/AddCustom/op_kernel/add_custom.cpp) ```cpp @@ -214,8 +214,9 @@ // 修改核函数文件的实现 KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); REGISTER_TILING_DEFAULT(ATVC::EleWiseParam); + GET_TILING_DATA(param, tiling); auto op = ATVC::Kernel::EleWiseOpTemplate>(); - op.Run(x, y, z, tiling); + op.Run(x, y, z, ¶m); ``` broadcast类 ```cpp @@ -228,10 +229,12 @@ // 修改核函数文件 KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); REGISTER_TILING_DEFAULT(ATVC::BroadcastParam); + GET_TILING_DATA(tilingData, tiling); // broadcast有不同的policy,在host与tilingkey进行绑定,此处的调用使用TILING_KEY_IS进行判断,和host的文件SetTilingKey相对应 if (TILING_KEY_IS(0)) { auto op = ATVC::Kernel::BroadcastOpTemplate, ATVC::BROADCAST_POLICY0>(); - op.Run(x, y, tiling); + ATVC::BroadcastParam param = &tilingData; + op.Run(x, y, ¶m); }else{ ... } @@ -250,7 +253,7 @@ GET_TILING_DATA(param, tiling); if (param.policyId == ATVC::REDUCE_POLICY0.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY0>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else { // 根据不同的tiling.policyId进行判断不同ReduceOpTemplate初始化 ... @@ -293,4 +296,4 @@ ### 步骤5. 调用执行算子工程 - 算子文件编写完成,参考[aclnn调用AddCustom算子工程(代码简化)](https://gitee.com/ascend/samples/blob/master/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocationNaive/README.md)进行编译验证 \ No newline at end of file + 算子文件编写完成,参考[aclnn调用AddCustom算子工程(代码简化)](https://gitee.com/ascend/samples/blob/master/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocationNaive/README.md)进行编译验证 \ No newline at end of file diff --git a/atvc/examples/aclnn/add/AclNNInvocationNaive/CMakeLists.txt b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/CMakeLists.txt similarity index 100% rename from atvc/examples/aclnn/add/AclNNInvocationNaive/CMakeLists.txt rename to atvc/examples/ops_aclnn/add/AclNNInvocationNaive/CMakeLists.txt diff --git a/atvc/examples/aclnn/add/AclNNInvocationNaive/README.md b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/README.md similarity index 97% rename from atvc/examples/aclnn/add/AclNNInvocationNaive/README.md rename to atvc/examples/ops_aclnn/add/AclNNInvocationNaive/README.md index 65acdddea5ed3db4d375d13a1dc2c66d1bd3f856..de9e6772fbde260270c8d215aa2fb516120693b7 100644 --- a/atvc/examples/aclnn/add/AclNNInvocationNaive/README.md +++ b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/README.md @@ -26,7 +26,7 @@ - 进入到样例目录 以命令行方式下载样例代码,master分支为例。 ```bash - cd atvc/examples/aclnn/add/AclNNInvocationNaive + cd atvc/examples/ops_aclnn/add/AclNNInvocationNaive ``` - 样例编译文件修改 diff --git a/atvc/examples/aclnn/add/AclNNInvocationNaive/main.cpp b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/main.cpp similarity index 100% rename from atvc/examples/aclnn/add/AclNNInvocationNaive/main.cpp rename to atvc/examples/ops_aclnn/add/AclNNInvocationNaive/main.cpp diff --git a/atvc/examples/aclnn/add/AclNNInvocationNaive/run.sh b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/run.sh similarity index 100% rename from atvc/examples/aclnn/add/AclNNInvocationNaive/run.sh rename to atvc/examples/ops_aclnn/add/AclNNInvocationNaive/run.sh diff --git a/atvc/examples/aclnn/add/AddCustom.json b/atvc/examples/ops_aclnn/add/AddCustom.json similarity index 100% rename from atvc/examples/aclnn/add/AddCustom.json rename to atvc/examples/ops_aclnn/add/AddCustom.json diff --git a/atvc/examples/aclnn/add/AddCustom/CMakeLists.txt b/atvc/examples/ops_aclnn/add/AddCustom/CMakeLists.txt similarity index 100% rename from atvc/examples/aclnn/add/AddCustom/CMakeLists.txt rename to atvc/examples/ops_aclnn/add/AddCustom/CMakeLists.txt diff --git a/atvc/examples/aclnn/add/AddCustom/cmake/func.cmake b/atvc/examples/ops_aclnn/add/AddCustom/cmake/func.cmake similarity index 100% rename from atvc/examples/aclnn/add/AddCustom/cmake/func.cmake rename to atvc/examples/ops_aclnn/add/AddCustom/cmake/func.cmake diff --git a/atvc/examples/ops_aclnn/add/AddCustom/cmake/intf.cmake b/atvc/examples/ops_aclnn/add/AddCustom/cmake/intf.cmake new file mode 100644 index 0000000000000000000000000000000000000000..4c8719a4f6ea941eb7a43d60359b4fdb4517c6b6 --- /dev/null +++ b/atvc/examples/ops_aclnn/add/AddCustom/cmake/intf.cmake @@ -0,0 +1,29 @@ + +add_library(intf_pub INTERFACE) +target_compile_options(intf_pub INTERFACE + -fPIC + -fvisibility=hidden + -fvisibility-inlines-hidden + $<$:-O2> + $<$:-O0 -g> + $<$:-std=c++17> + $<$,$>:-ftrapv -fstack-check> + $<$:-pthread -Wfloat-equal -Wshadow -Wformat=2 -Wno-deprecated -Wextra> + $,-fstack-protector-strong,-fstack-protector-all> +) +target_compile_definitions(intf_pub INTERFACE + _GLIBCXX_USE_CXX17_ABI=0 + $<$:_FORTIFY_SOURCE=2> +) +target_include_directories(intf_pub INTERFACE ${ASCEND_CANN_PACKAGE_PATH}/include + ${CMAKE_CURRENT_SOURCE_DIR}/op_kernel + $ENV{ATVC_PATH} +) +target_link_options(intf_pub INTERFACE + $<$,EXECUTABLE>:-pie> + $<$:-s> + -Wl,-z,relro + -Wl,-z,now + -Wl,-z,noexecstack +) +target_link_directories(intf_pub INTERFACE ${ASCEND_CANN_PACKAGE_PATH}/lib64) diff --git a/atvc/examples/aclnn/add/AddCustom/op_host/CMakeLists.txt b/atvc/examples/ops_aclnn/add/AddCustom/op_host/CMakeLists.txt similarity index 100% rename from atvc/examples/aclnn/add/AddCustom/op_host/CMakeLists.txt rename to atvc/examples/ops_aclnn/add/AddCustom/op_host/CMakeLists.txt diff --git a/atvc/examples/aclnn/add/AddCustom/op_host/add_custom.cpp b/atvc/examples/ops_aclnn/add/AddCustom/op_host/add_custom.cpp similarity index 100% rename from atvc/examples/aclnn/add/AddCustom/op_host/add_custom.cpp rename to atvc/examples/ops_aclnn/add/AddCustom/op_host/add_custom.cpp diff --git a/atvc/examples/aclnn/add/AddCustom/op_kernel/CMakeLists.txt b/atvc/examples/ops_aclnn/add/AddCustom/op_kernel/CMakeLists.txt similarity index 92% rename from atvc/examples/aclnn/add/AddCustom/op_kernel/CMakeLists.txt rename to atvc/examples/ops_aclnn/add/AddCustom/op_kernel/CMakeLists.txt index 16f4f8e4aa572b35d908cb42f2c4ffbdb91b2c5d..178359e4b78e1dcd20f66ceef5c7a0ec86ab8d73 100644 --- a/atvc/examples/aclnn/add/AddCustom/op_kernel/CMakeLists.txt +++ b/atvc/examples/ops_aclnn/add/AddCustom/op_kernel/CMakeLists.txt @@ -8,5 +8,5 @@ if ("${CMAKE_BUILD_TYPE}x" STREQUAL "Debugx") add_ops_compile_options(ALL OPTIONS -g -O0) endif() -add_ops_compile_options(ALL OPTIONS -g -O0 --cce-aicore-block-local-init -I ${ATVC_PATH}) +add_ops_compile_options(ALL OPTIONS -g -O0 --cce-aicore-block-local-init -w -I ${ATVC_PATH}) add_kernels_compile() \ No newline at end of file diff --git a/atvc/examples/aclnn/add/AddCustom/op_kernel/add_custom.cpp b/atvc/examples/ops_aclnn/add/AddCustom/op_kernel/add_custom.cpp similarity index 91% rename from atvc/examples/aclnn/add/AddCustom/op_kernel/add_custom.cpp rename to atvc/examples/ops_aclnn/add/AddCustom/op_kernel/add_custom.cpp index d28d2c6c93b669480b2f0570b9301cd57a9105fa..73c14dac2e43c762c3cb01d7d2fbf56ac753ef6d 100644 --- a/atvc/examples/aclnn/add/AddCustom/op_kernel/add_custom.cpp +++ b/atvc/examples/ops_aclnn/add/AddCustom/op_kernel/add_custom.cpp @@ -25,7 +25,8 @@ struct AddComputeFunc { extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); - REGISTER_TILING_DEFAULT(ATVC::EleWiseParam); + REGISTER_TILING_DEFAULT(ATVC::EleWiseParam); + GET_TILING_DATA(param, tiling); auto op = ATVC::Kernel::EleWiseOpTemplate>(); - op.Run(x, y, z, tiling); + op.Run(x, y, z, ¶m); } diff --git a/atvc/examples/aclnn/add/README.md b/atvc/examples/ops_aclnn/add/README.md similarity index 99% rename from atvc/examples/aclnn/add/README.md rename to atvc/examples/ops_aclnn/add/README.md index 358d7b113437f5663b7e67eae32568dec07b4ec4..42020ecffbbf614de35adac72434339bc0d7492b 100644 --- a/atvc/examples/aclnn/add/README.md +++ b/atvc/examples/ops_aclnn/add/README.md @@ -64,7 +64,7 @@ CANN软件包中提供了工程创建工具msOpGen,AddCustom算子工程可通 - 切换到msOpGen脚本install.sh所在目录 ```bash # 若开发者以git命令行方式clone了master分支代码,并切换目录 - cd atvc/examples/aclnn/add + cd atvc/examples/ops_aclnn/add ``` - 调用脚本,生成自定义算子工程,复制host和kernel实现并编译算子 diff --git a/atvc/examples/aclnn/add/install.sh b/atvc/examples/ops_aclnn/add/install.sh similarity index 100% rename from atvc/examples/aclnn/add/install.sh rename to atvc/examples/ops_aclnn/add/install.sh diff --git a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/CMakeLists.txt b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/CMakeLists.txt similarity index 100% rename from atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/CMakeLists.txt rename to atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/CMakeLists.txt diff --git a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/README.md b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/README.md similarity index 97% rename from atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/README.md rename to atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/README.md index 3fd20f3cac4dce4d504240835e5352e7a7015c8f..239ea433571ebe3f43e00a333dee6b8fe876f99c 100644 --- a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/README.md +++ b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/README.md @@ -26,7 +26,7 @@ - 进入到样例目录 以命令行方式下载样例代码,master分支为例。 ```bash - cd atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive + cd atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive ``` - 样例编译文件修改 diff --git a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/main.cpp b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/main.cpp similarity index 100% rename from atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/main.cpp rename to atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/main.cpp diff --git a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/run.sh b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/run.sh similarity index 100% rename from atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/run.sh rename to atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/run.sh diff --git a/atvc/examples/aclnn/reduce_sum/README.md b/atvc/examples/ops_aclnn/reduce_sum/README.md similarity index 99% rename from atvc/examples/aclnn/reduce_sum/README.md rename to atvc/examples/ops_aclnn/reduce_sum/README.md index 3503c46e6991fe907459745f11f8905230cf6ccc..0ae318021bc6f2a8f55d60ec27f640b480f52b47 100644 --- a/atvc/examples/aclnn/reduce_sum/README.md +++ b/atvc/examples/ops_aclnn/reduce_sum/README.md @@ -61,7 +61,7 @@ CANN软件包中提供了工程创建工具msOpGen,ReduceSumCustom算子工程 - 切换到msOpGen脚本install.sh所在目录 ```bash # 若开发者以git命令行方式clone了master分支代码,并切换目录 - cd ./atvc/examples/aclnn/reduce_sum + cd ./atvc/examples/ops_aclnn/reduce_sum ``` - 调用脚本,生成自定义算子工程,复制host和kernel实现并编译算子 diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom.json b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom.json similarity index 100% rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom.json rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom.json diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/CMakeLists.txt b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/CMakeLists.txt similarity index 100% rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/CMakeLists.txt rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/CMakeLists.txt diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/cmake/func.cmake b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/func.cmake similarity index 100% rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/cmake/func.cmake rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/func.cmake diff --git a/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/intf.cmake b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/intf.cmake new file mode 100644 index 0000000000000000000000000000000000000000..4c8719a4f6ea941eb7a43d60359b4fdb4517c6b6 --- /dev/null +++ b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/intf.cmake @@ -0,0 +1,29 @@ + +add_library(intf_pub INTERFACE) +target_compile_options(intf_pub INTERFACE + -fPIC + -fvisibility=hidden + -fvisibility-inlines-hidden + $<$:-O2> + $<$:-O0 -g> + $<$:-std=c++17> + $<$,$>:-ftrapv -fstack-check> + $<$:-pthread -Wfloat-equal -Wshadow -Wformat=2 -Wno-deprecated -Wextra> + $,-fstack-protector-strong,-fstack-protector-all> +) +target_compile_definitions(intf_pub INTERFACE + _GLIBCXX_USE_CXX17_ABI=0 + $<$:_FORTIFY_SOURCE=2> +) +target_include_directories(intf_pub INTERFACE ${ASCEND_CANN_PACKAGE_PATH}/include + ${CMAKE_CURRENT_SOURCE_DIR}/op_kernel + $ENV{ATVC_PATH} +) +target_link_options(intf_pub INTERFACE + $<$,EXECUTABLE>:-pie> + $<$:-s> + -Wl,-z,relro + -Wl,-z,now + -Wl,-z,noexecstack +) +target_link_directories(intf_pub INTERFACE ${ASCEND_CANN_PACKAGE_PATH}/lib64) diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_host/CMakeLists.txt b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_host/CMakeLists.txt similarity index 100% rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_host/CMakeLists.txt rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_host/CMakeLists.txt diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_host/reduce_sum_custom.cpp b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_host/reduce_sum_custom.cpp similarity index 100% rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_host/reduce_sum_custom.cpp rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_host/reduce_sum_custom.cpp diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt similarity index 92% rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt index 1a719a60cf480bf01d15b9d1905038536d5e5ccb..72549d35d19a37bb438d58356ec026bf81d8f0c4 100644 --- a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt +++ b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt @@ -8,5 +8,5 @@ if ("${CMAKE_BUILD_TYPE}x" STREQUAL "Debugx") add_ops_compile_options(ALL OPTIONS -g -O0) endif() -add_ops_compile_options(ALL OPTIONS -g -O0 --cce-aicore-block-local-init -I ${ATVC_PATH}) +add_ops_compile_options(ALL OPTIONS -g -O0 --cce-aicore-block-local-init -w -I ${ATVC_PATH}) add_kernels_compile() \ No newline at end of file diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp similarity index 87% rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp index 50a5c73f975c464282ae32cc46da52666ca19529..871f587d3c029cb95646187a2ec180fa38211423 100644 --- a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp +++ b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp @@ -20,72 +20,72 @@ extern "C" __global__ __aicore__ void reduce_sum_custom(GM_ADDR x, GM_ADDR y, GM GET_TILING_DATA(param, tiling); if (param.policyId == ATVC::REDUCE_POLICY0.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY0>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY1.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY1>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY2.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY2>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY3.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY3>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY4.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY4>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY5.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY5>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY6.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY6>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY7.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY7>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY8.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY8>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY9.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY9>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY10.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY10>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY11.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY11>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY12.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY12>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY13.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY13>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY14.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY14>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY15.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY15>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY16.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY16>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY17.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY17>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY18.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY18>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY19.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY19>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY20.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY20>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY21.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY21>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } else if (param.policyId == ATVC::REDUCE_POLICY22.ID) { auto op = ATVC::Kernel::ReduceOpTemplate, ATVC::REDUCE_POLICY22>(); - op.Run(x, y, tiling); + op.Run(x, y, ¶m); } } diff --git a/atvc/examples/aclnn/reduce_sum/install.sh b/atvc/examples/ops_aclnn/reduce_sum/install.sh similarity index 100% rename from atvc/examples/aclnn/reduce_sum/install.sh rename to atvc/examples/ops_aclnn/reduce_sum/install.sh diff --git a/atvc/examples/pytorch/README.md b/atvc/examples/ops_pytorch/README.md similarity index 85% rename from atvc/examples/pytorch/README.md rename to atvc/examples/ops_pytorch/README.md index 945054b730a2b53ca50fc14102e4bf89d65c174b..5f85b795bcd0b3014831f618424ceef9ca28cc22 100644 --- a/atvc/examples/pytorch/README.md +++ b/atvc/examples/ops_pytorch/README.md @@ -14,7 +14,7 @@ 快速执行example用例,更详细的流程请参阅[add算子](./add/README.md)。 -- 下载ATVC代码及环境配置,参考[快速入门](../../../docs/1_quick_start.md)。 +- 下载ATVC代码及环境配置,参考[快速入门](../../../docs/01_quick_start.md)。 - 导入ATVC环境变量 ```bash @@ -25,7 +25,7 @@ - 执行add用例 ```bash # 基于ATVC编译pytorch Add算子 - $ cd ./atvc/examples/pytorch/add + $ cd ./atvc/examples/ops_pytorch/add $ bash run.sh ... OK @@ -69,7 +69,7 @@ ``` ### 步骤3. 实现核函数,参考[add_custom_impl.h](./add/add_custom_impl.h) - ATVC提供的`ATVC::Kernel::EleWiseOpTemplate`算子模板类实现了核内的数据搬运、资源申请和计算调度功能。它将计算仿函数作为模板参数传入来完成构造实例化,用户可通过调用`ATVC::Kernel::EleWiseOpTemplate`算子模板类的`Run(Args&&... args)`接口完成算子的功能计算,完成完整核函数的实现。 在examples/add用例中,算子核函数的形式参数除了输入输出之外,还需额外传入`ATVC::EleWiseParam param`的形参,需要用户在初始化`ATVC::Kernel::EleWiseOpTemplate`时指定`ATVC::EleWiseParam*`的类型。该参数包含算子模板类进行数据搬运数据的必要参数,由`ATVC::Host::CalcEleWiseTiling` API计算得出。 + ATVC提供的`ATVC::Kernel::EleWiseOpTemplate`算子模板类实现了核内的数据搬运、资源申请和计算调度功能。它将计算仿函数作为模板参数传入来完成构造实例化,用户可通过调用`ATVC::Kernel::EleWiseOpTemplate`算子模板类的`Run(Args&&... args)`接口完成算子的功能计算,完成完整核函数的实现。 ```cpp /* @@ -83,15 +83,15 @@ __global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, ATVC::EleWiseParam param) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); + auto op = ATVC::Kernel::EleWiseOpTemplate, ATVC::EleWiseParam*>(); // 将AddComputeFunc仿函数作为模板参数传入,实例化EleWiseOpTemplate模板类,并指明param的数据类型 - op.Run(x, y, z, ¶m); // 按照输入、输出、param的顺序传入Run函数,实现GM->GM的数据计算 + op.Run(x, y, z, ¶m); // 按照输入、输出、param的顺序传入Run函数 } ``` - 备注:reduce类算子在初始化的时候,` ATVC::Kernel::ReduceOpTemplate, Policy, void, void, ATVC::ReduceParam*>`时指定`ATVC::ReduceParam*`的类型;broadcast类算子在初始化的时候,` ATVC::Kernel::ATVC::Kernel::BroadcastOpTemplate, Policy, ATVC::BroadcastParam*>`时指定`ATVC::BroadcastParam*`的类型。 ### 步骤4. 编写torch入口[pytorch_ascendc_extension.cpp](./add/pytorch_ascendc_extension.cpp) - 不同的算子类型可参考[快速入门](../../docs/1_quick_start.md)中的模版选择模版进行选择,用户在此处通过`<<<>>>`的方式调用核函数,更多ATVC的用法可参考atvc的[开发指南](../../docs/2_developer_guide.md)。 + 不同的算子类型可参考[快速入门](../../docs/01_quick_start.md)中的模版选择模版进行选择,用户在此处通过`<<<>>>`的方式调用核函数,更多ATVC的用法可参考atvc的[开发指南](../../docs/02_developer_guide.md)。 ``` cpp // 头文件引入 @@ -185,7 +185,7 @@ ### 步骤7. 算子编译&执行 ```bash # 基于ATVC编译pytorch Add算子 - $ cd ./atvc/examples/pytorch/add + $ cd ./atvc/examples/ops_pytorch/add $ bash run.sh ... OK diff --git a/atvc/examples/pytorch/add/README.md b/atvc/examples/ops_pytorch/add/README.md similarity index 97% rename from atvc/examples/pytorch/add/README.md rename to atvc/examples/ops_pytorch/add/README.md index bbd61f3088f05ca4526e7a0aafba85703575a8a2..39366061aee1a8aa7cc0495815ab77278adad311 100644 --- a/atvc/examples/pytorch/add/README.md +++ b/atvc/examples/ops_pytorch/add/README.md @@ -34,7 +34,7 @@ z = x + y - Atlas A2训练系列产品 ## 编译运行样例算子 -针对自定义算子工程,编译运行包含如下步骤: +针对pytorch算子,编译运行包含如下步骤: - 完成算子pytorch入口和impl文件的实现; - 编译pytorch算子的二进制文件; - 调用执行pytorch算子; @@ -80,7 +80,7 @@ z = x + y { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 将AddComputeFunc仿函数作为模板参数传入,实例化EleWiseOpTemplate模板类 - auto op = ATVC::Kernel::EleWiseOpTemplate, ATVC::EleWiseParam*>(); + auto op = ATVC::Kernel::EleWiseOpTemplate>(); op.Run(x, y, z, ¶m); } ``` @@ -129,7 +129,7 @@ z = x + y ### 4. 基于ATVC编写pytorch算子的调用验证 - 调用脚本,生成pytorch算子,并运行测试用例 ```bash - $ cd ./atvc/examples/pytorch/add + $ cd ./atvc/examples/ops_pytorch/add $ bash run.sh ... OK diff --git a/atvc/examples/pytorch/add/add_custom_impl.h b/atvc/examples/ops_pytorch/add/add_custom_impl.h similarity index 98% rename from atvc/examples/pytorch/add/add_custom_impl.h rename to atvc/examples/ops_pytorch/add/add_custom_impl.h index 86a6912caaaa3264324a68740fa98a70fcb48566..34b7de96b6c5e26592a22f60575884d642347fe6 100644 --- a/atvc/examples/pytorch/add/add_custom_impl.h +++ b/atvc/examples/ops_pytorch/add/add_custom_impl.h @@ -46,7 +46,7 @@ __global__ __aicore__ void AddCustom(GM_ADDR x, GM_ADDR y, GM_ADDR z, ATVC::EleW { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 将AddComputeFunc仿函数作为模板参数传入,实例化EleWiseOpTemplate模板类 - auto op = ATVC::Kernel::EleWiseOpTemplate, ATVC::EleWiseParam*>(); + auto op = ATVC::Kernel::EleWiseOpTemplate>(); op.Run(x, y, z, ¶m); } #endif diff --git a/atvc/examples/pytorch/add/pytorch_ascendc_extension.cpp b/atvc/examples/ops_pytorch/add/pytorch_ascendc_extension.cpp similarity index 100% rename from atvc/examples/pytorch/add/pytorch_ascendc_extension.cpp rename to atvc/examples/ops_pytorch/add/pytorch_ascendc_extension.cpp diff --git a/atvc/examples/pytorch/add/run.sh b/atvc/examples/ops_pytorch/add/run.sh similarity index 100% rename from atvc/examples/pytorch/add/run.sh rename to atvc/examples/ops_pytorch/add/run.sh diff --git a/atvc/examples/pytorch/add/run_op.py b/atvc/examples/ops_pytorch/add/run_op.py similarity index 100% rename from atvc/examples/pytorch/add/run_op.py rename to atvc/examples/ops_pytorch/add/run_op.py diff --git a/atvc/examples/pytorch/reduce_sum/README.md b/atvc/examples/ops_pytorch/reduce_sum/README.md similarity index 96% rename from atvc/examples/pytorch/reduce_sum/README.md rename to atvc/examples/ops_pytorch/reduce_sum/README.md index b951656b9c174399dcae3e8825294cbadc64a5a1..23175eb3de04e4f1a31ecab53173f9c59999663a 100644 --- a/atvc/examples/pytorch/reduce_sum/README.md +++ b/atvc/examples/ops_pytorch/reduce_sum/README.md @@ -33,7 +33,7 @@ z = x + y - Atlas A2训练系列产品 ## 编译运行样例算子 -针对自定义算子工程,编译运行包含如下步骤: +针对pytorch算子,编译运行包含如下步骤: - 完成算子pytorch入口和impl文件的实现; - 编译pytorch算子的二进制文件; - 调用执行pytorch算子; @@ -62,16 +62,13 @@ z = x + y * reduceParam ATVC::ReduceParam */ template - __global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, GM_ADDR z, ATVC::ReduceParam param) + __global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, GM_ADDR z, ATVC::ReduceParam reduceParam) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0); // 使用了多核控制指令,设置算子执行时只启动Vector核 // 将计算模板类模板定义作为模板参数传入,Policy由Host层的策略分派API给出 // ATVC::ReduceParam* 为tiling的类型 auto op = ATVC::Kernel::ReduceOpTemplate, - Policy, - void, - void, - ATVC::ReduceParam*>(); + Policy>(); op.Run(x, y, &reduceParam); } ``` @@ -153,7 +150,7 @@ z = x + y ### 4. 基于ATVC编写pytorch算子的调用验证 - 调用脚本,生成pytorch算子,并运行测试用例 ```bash - $ cd ./atvc/examples/pytorch/reduce_sum + $ cd ./atvc/examples/ops_pytorch/reduce_sum $ bash run.sh ... OK diff --git a/atvc/examples/pytorch/reduce_sum/pytorch_ascendc_extension.cpp b/atvc/examples/ops_pytorch/reduce_sum/pytorch_ascendc_extension.cpp similarity index 100% rename from atvc/examples/pytorch/reduce_sum/pytorch_ascendc_extension.cpp rename to atvc/examples/ops_pytorch/reduce_sum/pytorch_ascendc_extension.cpp diff --git a/atvc/examples/pytorch/reduce_sum/reduce_sum_impl.h b/atvc/examples/ops_pytorch/reduce_sum/reduce_sum_impl.h similarity index 95% rename from atvc/examples/pytorch/reduce_sum/reduce_sum_impl.h rename to atvc/examples/ops_pytorch/reduce_sum/reduce_sum_impl.h index 670215c7ba865e50c966ae76ee1a26364138d0a4..dcf9fbbd8cfa34ae6897628c8cedd335b881ccba 100644 --- a/atvc/examples/pytorch/reduce_sum/reduce_sum_impl.h +++ b/atvc/examples/ops_pytorch/reduce_sum/reduce_sum_impl.h @@ -29,10 +29,7 @@ __global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, ATVC::ReducePar KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0); // 使用了多核控制指令,设置算子执行时只启动Vector核 // 将计算模板类模板定义作为模板参数传入,Policy由Host层的策略分派API给出 auto op = ATVC::Kernel::ReduceOpTemplate, - Policy, - void, - void, - ATVC::ReduceParam*>(); + Policy>(); op.Run(x, y, &reduceParam); } diff --git a/atvc/examples/pytorch/reduce_sum/run.sh b/atvc/examples/ops_pytorch/reduce_sum/run.sh similarity index 100% rename from atvc/examples/pytorch/reduce_sum/run.sh rename to atvc/examples/ops_pytorch/reduce_sum/run.sh diff --git a/atvc/examples/pytorch/reduce_sum/run_op.py b/atvc/examples/ops_pytorch/reduce_sum/run_op.py similarity index 100% rename from atvc/examples/pytorch/reduce_sum/run_op.py rename to atvc/examples/ops_pytorch/reduce_sum/run_op.py diff --git a/atvc/examples/reduce_sum/reduce_sum.cpp b/atvc/examples/reduce_sum/reduce_sum.cpp index 43b393022690ea2be7ec24ec5361963bc569c8b3..cf1fe4cf5203a5a4130f73a4bb7163735ac378c2 100644 --- a/atvc/examples/reduce_sum/reduce_sum.cpp +++ b/atvc/examples/reduce_sum/reduce_sum.cpp @@ -56,93 +56,108 @@ bool VerifyResults(const std::vector &golden, const std::vector &o } return true; } + +void CleanACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId) +{ + CHECK_ACL(aclrtDestroyStream(stream)); + CHECK_ACL(aclrtDestroyContext(context)); + CHECK_ACL(aclrtResetDevice(deviceId)); + CHECK_ACL(aclFinalize()); +} + +void CleanUp(uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&yHost) +{ + CHECK_ACL(aclrtFree(xDevice)); + CHECK_ACL(aclrtFree(yDevice)); + CHECK_ACL(aclrtFreeHost(yHost)); +} } /* * 该函数为ReduceCustom算子核函数入口 * x Device上的gm地址,指向Add算子第一个输入 * y Device上的gm地址,指向Add算子第一个输出 - * reduceParam Device上的gm地址,指向运行态ATVC::ReduceParam数据 + * reduceParam 指向运行态ATVC::ReduceParam数据 */ template -__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, GM_ADDR reduceParam) +__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam reduceParam) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0); // 使用了多核控制指令,设置算子执行时只启动Vector核 // 将计算模板类模板定义作为模板参数传入,Policy由Host层的策略分派API给出 auto op = ATVC::Kernel::ReduceOpTemplate, Policy>(); - op.Run(x, y, reduceParam); + op.Run(x, y, &reduceParam); } - +namespace { // 负责Reduce类算子的调度,选择对应的Policy最佳策略并执行Kernel函数 template void ReduceOpAdapter(uint8_t* x, uint8_t* y, ATVC::ReduceParam ¶m, ATVC::ReducePolicy &policy, aclrtStream& stream) { - // 申请临时空间workspace,并将其与ReduceTilingData一同传到Device侧 - uint8_t *paramDevice; + // 申请临时空间workspace uint8_t *workspaceDevice; CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST)); param.workspaceAddr = reinterpret_cast(workspaceDevice); - auto reduceParamSize = sizeof(param); - CHECK_ACL(aclrtMalloc((void**)¶mDevice, reduceParamSize, ACL_MEM_MALLOC_HUGE_FIRST)); - CHECK_ACL(aclrtMemcpy(paramDevice, reduceParamSize, - reinterpret_cast(¶m), reduceParamSize, - ACL_MEMCPY_HOST_TO_DEVICE)); // 将tiling api计算出的ReducePolicy转化为编译态参数并实例化相应的核函数 if (policy == ATVC::REDUCE_POLICY0) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY1) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY2) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY3) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY4) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY5) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY6) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY7) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY8) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY9) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY10) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY11) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY12) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY13) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY14) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY15) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY16) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY17) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY18) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY19) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY20) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); } else if (policy == ATVC::REDUCE_POLICY21) { - ReduceCustom<<>>(x, y, paramDevice); + ReduceCustom<<>>(x, y, param); + } else if (policy == ATVC::REDUCE_POLICY22) { + ReduceCustom<<>>(x, y, param); } else { - printf("[ERROR] Cannot find any matched policy.\n"); + printf("[ERROR]: Cannot find any matched policy.\n"); } // 流同步后释放申请的param内存 CHECK_ACL(aclrtSynchronizeStream(stream)); CHECK_ACL(aclrtFree(workspaceDevice)); - CHECK_ACL(aclrtFree(paramDevice)); +} } int32_t main(int32_t argc, char* argv[]) { + if (!ATVC::Host::DebugCheck()) { + printf("[ERROR]: Reduce OpTraits check failed.\n"); + return -1; + } int32_t eleNum = 8 * 1024; int32_t outEleNum = 1 * 1024; size_t inputByteSize = static_cast(eleNum) * sizeof(float); @@ -172,8 +187,10 @@ int32_t main(int32_t argc, char* argv[]) ATVC::ReduceParam param; // Reduce运行态参数,包含TilingData以及临时空间的相关信息 ATVC::ReducePolicy policy = {-1, -1, -1}; // Reduce运行态参数,负责映射最适合的Reduce模板实现 + ATVC::Host::ReduceTilingHyperParam hyperParam; + hyperParam.maxInnerA = 256;// 设置maxInnerA为256 // Host侧调用Tiling API完成相关运行态参数的运算 - if (!ATVC::Host::CalcReduceTiling(shape, dim, &policy, ¶m)) { + if (!ATVC::Host::CalcReduceTiling(shape, dim, &policy, ¶m, hyperParam=hyperParam)) { printf("Reduce tiling error.\n"); return -1; }; @@ -185,14 +202,8 @@ int32_t main(int32_t argc, char* argv[]) std::vector outputY(reinterpret_cast(yHost), reinterpret_cast(yHost) + outEleNum); // 释放Acl资源 - CHECK_ACL(aclrtFree(xDevice)); - CHECK_ACL(aclrtFree(yDevice)); - CHECK_ACL(aclrtFreeHost(yHost)); - - CHECK_ACL(aclrtDestroyStream(stream)); - CHECK_ACL(aclrtDestroyContext(context)); - CHECK_ACL(aclrtResetDevice(deviceId)); - CHECK_ACL(aclFinalize()); + CleanUp(xDevice, yDevice, yHost); + CleanACL(stream, context, deviceId); if (!VerifyResults(golden, outputY)) { return -1; diff --git a/atvc/examples/sinh_custom/sinh_custom.cpp b/atvc/examples/sinh_custom/sinh_custom.cpp index e8e277b76e7117137579b4ab6880453d182c1e7d..b8a18a08b849b7805574b5d5eabc27d7c6fa6c42 100644 --- a/atvc/examples/sinh_custom/sinh_custom.cpp +++ b/atvc/examples/sinh_custom/sinh_custom.cpp @@ -109,16 +109,16 @@ void CleanACL(aclrtStream &stream, int32_t deviceId) /* * 该函数为SinhCustom算子核函数入口 - * x Device上的gm地址,指向Add算子第一个输入 - * y Device上的gm地址,指向Add算子第一个输出 + * x Device上的gm地址,指向SinhCustom算子第一个输入 + * y Device上的gm地址,指向SinhCustom算子第一个输出 * param Device上的gm地址,指向运行态ATVC::EleWiseParam数据 */ template -__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param) +__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, ATVC::EleWiseParam param) { KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 控制算子执行时只启动Vector核 auto op = ATVC::Kernel::EleWiseOpTemplate>(); - op.Run(x, y, param); // 按照输入、输出、param的顺序传入Run函数中;OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源,开发无需关注 + op.Run(x, y, ¶m); // 按照输入、输出、param的顺序传入Run函数中;OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源,开发无需关注 } int main() @@ -148,22 +148,14 @@ int main() uint8_t *yHost; uint8_t *xDevice; uint8_t *yDevice; - uint8_t *paramDevice; CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); CHECK_ACL(aclrtMallocHost((void **)(&yHost), outputByteSize)); CHECK_ACL(aclrtMalloc((void **)&yDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); - // 将tiling计算的运行时参数EleWiseParam param传到Device侧 - auto elementParamSize = sizeof(param); - CHECK_ACL(aclrtMalloc((void**)¶mDevice, elementParamSize, ACL_MEM_MALLOC_HUGE_FIRST)); - CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize, - reinterpret_cast(¶m), elementParamSize, - ACL_MEMCPY_HOST_TO_DEVICE)); - // 调用自定义的Kernel API, <<<>>>的BlockNum参数可通过param的TilingData获取 - SinhCustom<<>>(xDevice, yDevice, paramDevice); + SinhCustom<<>>(xDevice, yDevice, param); CHECK_ACL(aclrtSynchronizeStream(stream)); CHECK_ACL(aclrtMemcpy(yHost, outputByteSize, yDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST)); @@ -172,7 +164,6 @@ int main() // 释放资源 CHECK_ACL(aclrtFree(xDevice)); CHECK_ACL(aclrtFree(yDevice)); - CHECK_ACL(aclrtFree(paramDevice)); CHECK_ACL(aclrtFreeHost(yHost)); CleanACL(stream, deviceId); diff --git a/atvc/examples/tanh_grad/README.md b/atvc/examples/tanh_grad/README.md new file mode 100644 index 0000000000000000000000000000000000000000..5136b68b090b646c08912bbc2d3f24e91d3f9674 --- /dev/null +++ b/atvc/examples/tanh_grad/README.md @@ -0,0 +1,76 @@ + + +## 概述 + +本样例介绍了利用ATVC实现Tanh单算子并验证了调试调优相关功能验证。 + + +## 目录结构 + +| 文件名 | 描述 | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| [tanh_grad.cpp](./tanh_grad.cpp) | Tanh算子代码实现以及调用样例 | + +## 算子描述 + +Tanh算子数学计算公式:$z = dy * (1 - y ^ 2)$ + +Tanh算子规格: + + + + + + + + + + + + + + +
算子类型(OpType)TanhGrad
算子输入
nameshapedata typeformat
dy8 * 1024floatND
y8 * 1024floatND
算子输出
z8 * 1024floatND
核函数名TanhGrad
+ +## 算子基本功能验证 +在代码仓目录下执行: +```bash +$ cd ./atvc/tests/ +$ bash run_test.sh tanh_grad +... +Accuracy verification passed. +``` + +## 算子调试调优 +样例提供的主要调试调优方式如下: +- 使用`ATVC::Host::EleWiseTilingHyperParam`构建超参对`ATVC::Host::CalcEleWiseTiling()`接口实现Tiling调优 +- 使用`--run-mode=debug_print`进行DFX信息打印: +在代码仓目录下执行: +```bash +$ cd ./atvc/tests/ +$ bash run_test.sh tanh_grad --run-mode=debug_print +... +[INFO]:[ATVC][EleWise]Start to run Template Fuction. +... +[INFO]:[ATVC][EleWise] Tiling data: blockNum = 8 +... +[INFO]:[ATVC][EleWise][CopyIn]: Offset is 7168, copy count is 256. +... +[INFO]:[ATVC][EleWise]End to run Template Fuction. +... +Accuracy verification passed. +``` + +- 使用`--run-mode=profiling`开启Profiling,获取性能数据: +在代码仓目录下执行: +```bash +$ cd ./atvc/tests/ +$ bash run_test.sh tanh_grad --run-mode=profiling +... +[INFO] Start Profiling ... +... +[INFO] Process profiling data complete, Data is saved in /xxx_path +Accuracy verification passed. +``` + +更多详细的调试调优介绍参考[ATVC开发指南](../../docs/2_developer_guide.md)的`ATVC的调试调优功能`章节 \ No newline at end of file diff --git a/atvc/examples/tanh_grad/tanh_grad.cpp b/atvc/examples/tanh_grad/tanh_grad.cpp new file mode 100644 index 0000000000000000000000000000000000000000..7f8693d6050f1355ef641e09bbcbaa7120a8548e --- /dev/null +++ b/atvc/examples/tanh_grad/tanh_grad.cpp @@ -0,0 +1,204 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ +#include +#include +#include +#include +#include +#include +#include "acl/acl.h" +#include "elewise/elewise_host.h" +#include "elewise/elewise_device.h" + +#define CHECK_ACL(x) \ + do { \ + aclError __ret = x; \ + if (__ret != ACL_ERROR_NONE) { \ + std::cerr << __FILE__ << ":" << __LINE__ << " aclError:" << __ret << std::endl; \ + } \ + } while (0) + +namespace { +static constexpr float REL_TOL = 1e-3f; +static constexpr float ABS_TOL = 1e-5f; + +struct MemoryPtrs { + uint8_t *zHost; + uint8_t *dyDevice; + uint8_t *yDevice; + uint8_t *zDevice; + uint8_t *paramDevice; +}; + +// 判断两个浮点数是否足够接近 +bool IsClose(float a, float b) +{ + const float eps = 1e-40f; // 防止分母为零 + float diff = std::abs(a - b); + return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps)); +} + +void InitializeData(int32_t eleNum, std::vector &inputDy, std::vector &inputY, std::vector &golden) +{ + std::random_device rd; + std::mt19937 gen(rd()); + std::uniform_real_distribution dis(1.0f, 100.0f); + + for (int i = 0; i < eleNum; ++i) { + inputDy[i] = (dis(gen)); + inputY[i] = (dis(gen)); + } + for (int i = 0; i < eleNum; ++i) { + // dy * (1 - x ^ 2) + golden[i] = (inputDy[i]) * (1 - inputY[i] * inputY[i]); + } +} + +bool VerifyResults(const std::vector &golden, const std::vector &output) +{ + for (int32_t i = 0; i < golden.size(); i++) { + if (!IsClose(golden[i], output[i])) { + printf("[ERROR]: Accuracy verification failed! The expected value of element " + "in index [%d] is %f, but actual value is %f.\n", + i, + golden[i], + output[i]); + return false; + } + } + return true; +} + +void InitializeACL(aclrtContext &context, aclrtStream &stream, int32_t deviceId) +{ + CHECK_ACL(aclInit(nullptr)); + CHECK_ACL(aclrtSetDevice(deviceId)); + CHECK_ACL(aclrtCreateContext(&context, deviceId)); + CHECK_ACL(aclrtCreateStream(&stream)); +} + +void CleanACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId) +{ + CHECK_ACL(aclrtDestroyStream(stream)); + CHECK_ACL(aclrtResetDevice(deviceId)); + CHECK_ACL(aclFinalize()); +} + +void CleanUp(uint8_t *&zHost, uint8_t *&dyDevice, uint8_t *&yDevice, uint8_t *&zDevice) +{ + CHECK_ACL(aclrtFree(dyDevice)); + CHECK_ACL(aclrtFree(yDevice)); + CHECK_ACL(aclrtFree(zDevice)); + CHECK_ACL(aclrtFreeHost(zHost)); +} + +void MallocHostDeviceMemory(MemoryPtrs& memoryPtrs, size_t byteSize, std::vector& inputDy, std::vector& inputY) +{ + CHECK_ACL(aclrtMallocHost((void **)(&memoryPtrs.zHost), byteSize)); + CHECK_ACL(aclrtMalloc((void **)&memoryPtrs.dyDevice, byteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + CHECK_ACL(aclrtMalloc((void **)&memoryPtrs.yDevice, byteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + CHECK_ACL(aclrtMalloc((void **)&memoryPtrs.zDevice, byteSize, ACL_MEM_MALLOC_HUGE_FIRST)); + + CHECK_ACL(aclrtMemcpy(memoryPtrs.dyDevice, byteSize, inputDy.data(), byteSize, ACL_MEMCPY_HOST_TO_DEVICE)); + CHECK_ACL(aclrtMemcpy(memoryPtrs.yDevice, byteSize, inputY.data(), byteSize, ACL_MEMCPY_HOST_TO_DEVICE)); +} + +// Add算子中有两个输入,一个输出。类型均为float +using AddOpTraits = ATVC::OpTraits, ATVC::OpOutputs>; + +// 传入编译态参数ATVC::OpTraits +template +struct TanhGradComputeFunc { + /** + * \brief: Compute operator of tanh: z = dy * (1 - y ^ 2) + * \param [in] dy, input local tensor + * \param [in] y, input local tensor + * \param [out] z, output local tensor + * \return void + */ + template + __aicore__ inline void operator()( + AscendC::LocalTensor dy, AscendC::LocalTensor y, AscendC::LocalTensor z) + { + auto length = y.GetSize(); + AscendC::Mul(y, y, y, length); + AscendC::Mul(y, dy, y, length); + AscendC::Sub(z, dy, y, length); + } +}; +} + +template +/** +* \brief: Kernel entry of Tanh, kernel func is: z = dy * (1 - y ^ 2) +* \param [in] dy, input global tensor +* \param [in] y, input global tensor +* \param [out] z, output global tensor +* \return void +*/ +__global__ __aicore__ void TanhGrad(GM_ADDR dy, GM_ADDR y, GM_ADDR z, ATVC::EleWiseParam param) +{ + KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); + auto op = ATVC::Kernel::EleWiseOpTemplate< + TanhGradComputeFunc>(); // 将TanhGradComputeFunc仿函数作为模板参数传入,实例化EleWiseOpTemplate模板类 + op.Run(dy, y, z, ¶m); // 按照输入、输出、param的顺序传入Run函数,实现GM->GM的数据计算 +} + +int main() +{ + if (!ATVC::Host::DebugCheck()) { + printf("[ERROR]: ElementWise OpTraits check failed.\n"); + return -1; + } + // totalCnt描述EleWise单输入的元素个数 + int32_t eleNum = 8 * 1024; + size_t byteSize = static_cast(eleNum) * sizeof(float); + + std::vector inputDy(eleNum); + std::vector inputY(eleNum); + std::vector golden(eleNum); + + // 生成输入数据 + InitializeData(eleNum, inputDy, inputY, golden); + // 声明运行态参数param + ATVC::EleWiseParam param; + ATVC::Host::EleWiseTilingHyperParam hyperParam; + hyperParam.singleCoreBaseLine = 1024; // set base count for single core 为1024. + if (!ATVC::Host::CalcEleWiseTiling(eleNum, param, hyperParam=hyperParam)) { + printf("[ERROR]: Calculate eleWise tiling failed.\n"); + return -1; + }; + aclrtContext context; + int32_t deviceId = 0; + aclrtStream stream = nullptr; + InitializeACL(context, stream, deviceId); + MemoryPtrs memoryPtrs; + MallocHostDeviceMemory(memoryPtrs, byteSize, inputDy, inputY); + uint32_t blockNum = param.tilingData.blockNum; + TanhGrad<<>>(memoryPtrs.dyDevice, memoryPtrs.yDevice, memoryPtrs.zDevice, param); +#if ATVC_DEBUG_MODE == 2 // 2: open profiling + for (int32_t i = 0; i < 19; i++) { // 19: run kernel 1 + 19 times for profiling + TanhGrad<<>>(memoryPtrs.dyDevice, memoryPtrs.yDevice, memoryPtrs.zDevice, param); + } +#endif + CHECK_ACL(aclrtSynchronizeStream(stream)); + CHECK_ACL(aclrtMemcpy(memoryPtrs.zHost, byteSize, memoryPtrs.zDevice, byteSize, ACL_MEMCPY_DEVICE_TO_HOST)); + std::vector outputZ(reinterpret_cast(memoryPtrs.zHost), reinterpret_cast(memoryPtrs.zHost) + eleNum); + + CleanUp(memoryPtrs.zHost, memoryPtrs.dyDevice, memoryPtrs.yDevice, memoryPtrs.zDevice); + CleanACL(stream, context, deviceId); + + if (!VerifyResults(golden, outputZ)) { + return -1; + } + printf("Accuracy verification passed.\n"); + return 0; +} diff --git a/atvc/include/atvc.h b/atvc/include/atvc.h index 967bf442a91dec4c69f751dd7d1463d903a4686c..37b9c55517bcb650d52308001ccc9755922c4b04 100644 --- a/atvc/include/atvc.h +++ b/atvc/include/atvc.h @@ -12,6 +12,7 @@ #define ATVC_ATVC_H #include "common/atvc_opdef.h" +#include "common/atvc_op_check.h" #include "common/const_def.h" #include "elewise/common/elewise_common.h" #include "elewise/elewise_host.h" diff --git a/atvc/include/broadcast/broadcast_compute.h b/atvc/include/broadcast/broadcast_compute.h index 90388f328b0f7be2405d66ad5caf64705d7959df..dbd86ecdc415e0f3f67bd6c403431d5b1db14b32 100644 --- a/atvc/include/broadcast/broadcast_compute.h +++ b/atvc/include/broadcast/broadcast_compute.h @@ -16,20 +16,17 @@ #include "broadcast/common/broadcast_common.h" namespace ATVC { -template +template class BroadcastCompute { public: using inputDTypeList = typename OpTraits::In::types; using DataType = typename ATVC::TypeListGet::Type; - template - __aicore__ inline void Compute(AscendC::LocalTensor &src, - uint32_t inputOffset, - AscendC::LocalTensor &dst, - uint32_t dimA, - uint32_t dimB) + template + __aicore__ inline void Compute(AscendC::LocalTensor& src, uint32_t inputOffset, + AscendC::LocalTensor& dst, uint32_t dimA, uint32_t dimB) { - if (patternID == ATVC::AB_PATTERN::ABA) { + if constexpr (PatternID == ATVC::AB_PATTERN::ABA) { ComputeBA(src, inputOffset, dst, dimA, dimB); } else { ComputeAB(src, inputOffset, dst, dimA, dimB); @@ -37,27 +34,21 @@ public: } private: - __aicore__ inline void ComputeBAByDataCopy(AscendC::LocalTensor &src, - uint32_t inputOffset, - AscendC::LocalTensor &dst, - uint32_t dimA, - uint32_t dimB) + __aicore__ inline void ComputeBAByDataCopy(AscendC::LocalTensor& src, uint32_t inputOffset, + AscendC::LocalTensor& dst, uint32_t dimA, uint32_t dimB) { AscendC::DataCopy(dst, src[inputOffset], dimA); uint32_t i = 1; - uint32_t cnt = 1; - while(i < dimB) { + uint32_t cnt; + while (i < dimB) { cnt = i > (dimB - i) ? (dimB - i) : i; AscendC::DataCopy(dst[dimA * i], dst, dimA * cnt); i += cnt; } } - __aicore__ inline void ComputeBA(AscendC::LocalTensor &src, - uint32_t inputOffset, - AscendC::LocalTensor &dst, - uint32_t dimA, - uint32_t dimB) + __aicore__ inline void ComputeBA(AscendC::LocalTensor& src, uint32_t inputOffset, + AscendC::LocalTensor& dst, uint32_t dimA, uint32_t dimB) { /* X1 X2 X3 X4 @@ -68,14 +59,11 @@ private: ComputeBAByDataCopy(src, inputOffset, dst, dimA, dimB); } - __aicore__ inline void ComputeABByBrcbCopy(AscendC::LocalTensor &src, - uint32_t inputOffset, - AscendC::LocalTensor &dst, - uint32_t dimA, - uint32_t dimB) + __aicore__ inline void ComputeABByBrcbCopy(AscendC::LocalTensor& src, uint32_t inputOffset, + AscendC::LocalTensor& dst, uint32_t dimA, uint32_t dimB) { - uint32_t brcbProcCnt = 8; // 一次brcb 处理8个元素 - uint32_t dSize = sizeof(DataType); + constexpr uint32_t brcbProcCnt = 8; // 一次brcb 处理8个元素 + constexpr uint32_t dSize = sizeof(DataType); AscendC::BrcbRepeatParams repeatParam(dimB * dSize / ATVC::UB_ALIGN_32, brcbProcCnt * dimB * dSize / ATVC::UB_ALIGN_32); AscendC::Brcb(dst, src[inputOffset], dimA / brcbProcCnt, repeatParam); @@ -84,28 +72,24 @@ private: while (i < dimB) { step = i * 2 > dimB ? (dimB - i) : i; // 2: 每次循环 将已拷贝长度为i的元素拷贝到下一个dst,要保证不超出dimB step = step * dSize / ATVC::UB_ALIGN_32; - uint16_t stride = (uint16_t)(dimB * dSize / ATVC::UB_ALIGN_32 - step); - AscendC::DataCopyParams repeatParam = { - (uint16_t)dimA, // blockCount [1, 4095] - step, // 单位为32B - stride, // 取值范围不能超uint16_t - stride}; // 取值范围不能超uint16_t + uint16_t stride = static_cast(dimB * dSize / ATVC::UB_ALIGN_32 - step); + AscendC::DataCopyParams repeatParam = {static_cast(dimA), // blockCount [1, 4095] + step, // 单位为32B + stride, // 取值范围不能超uint16_t + stride}; // 取值范围不能超uint16_t AscendC::DataCopy(dst[i], dst, repeatParam); i = i + step * ATVC::UB_ALIGN_32 / dSize; AscendC::PipeBarrier(); } } - __aicore__ inline void ComputeAB(AscendC::LocalTensor &src, - uint32_t inputOffset, - AscendC::LocalTensor &dst, - uint32_t dimA, - uint32_t dimB) + __aicore__ inline void ComputeAB(AscendC::LocalTensor& src, uint32_t inputOffset, + AscendC::LocalTensor& dst, uint32_t dimA, uint32_t dimB) { /* - X1 - X2 - X3 + X1 + X2 + X3 X4 -> X1 X1 @@ -116,5 +100,5 @@ private: ComputeABByBrcbCopy(src, inputOffset, dst, dimA, dimB); } }; -} +} // namespace ATVC #endif // ATVC_BROADCAST_COMPUTE_H \ No newline at end of file diff --git a/atvc/include/broadcast/broadcast_host.h b/atvc/include/broadcast/broadcast_host.h index 80fba92814ae79e8be737276c88b7266f6d95ee2..da1a6adfc1863f64887ca62d7c5942ccff6c8d28 100644 --- a/atvc/include/broadcast/broadcast_host.h +++ b/atvc/include/broadcast/broadcast_host.h @@ -17,61 +17,33 @@ #define ATVC_BROADCAST_HOST_H #include #include "common/atvc_opdef.h" +#include "common/atvc_op_check.h" #include "common/const_def.h" -#include "common/dtype_utils.h" #include "broadcast/common/broadcast_common.h" #include "broadcast/tiling/broadcast_tiling.h" namespace ATVC { namespace Host { -void PrintParam(BroadcastPolicy* policy, BroadcastParam* param) -{ - printf("[Broadcast] Tiling result: A0 = %lu\n", param->tilingData.A0); - printf("[Broadcast] Tiling result: A11 = %lu\n", param->tilingData.A11); - printf("[Broadcast] Tiling result: A12 = %lu\n", param->tilingData.A12); - printf("[Broadcast] Tiling result: A2 = %lu\n", param->tilingData.A2); - printf("[Broadcast] Tiling result: B0 = %lu\n", param->tilingData.B0); - printf("[Broadcast] Tiling result: B1 = %lu\n", param->tilingData.B1); - printf("[Broadcast] Tiling result: B2 = %lu\n", param->tilingData.B2); - printf("[Broadcast] Tiling result: coreNum = %d\n", param->tilingData.coreNum); - printf("[Broadcast] Tiling result: basicBlock = %lu\n", param->tilingData.basicBlock); - printf("[Broadcast] Tiling result: factorACntPerCore = %lu\n", param->tilingData.factorACntPerCore); - printf("[Broadcast] Tiling result: factorATotalCnt = %lu\n", param->tilingData.factorATotalCnt); - printf("[Broadcast] Tiling result: factorBCntPerCore = %lu\n", param->tilingData.factorBCntPerCore); - printf("[Broadcast] Tiling result: factorBTotalCnt = %lu\n", param->tilingData.factorBTotalCnt); - for (int32_t i = 0; i < ATVC::CONST2; i++) { - printf("[Broadcast] Tiling result: shape[%d] = %lu\n", i, param->tilingData.shape[i]); - printf("[Broadcast] Tiling result: dstShape[%d] = %lu\n", i, param->tilingData.dstShape[i]); - } - printf("[Broadcast] Tiling result: policy.patternID = %d\n", policy->patternID); - printf("[Broadcast] Tiling result: workspaceSize = %u\n", param->workspaceSize); - return; -} - -template +template bool CalcBroadcastTiling(std::vector shapeIn, std::vector shapeOut, BroadcastPolicy* policy, BroadcastParam* param) { - if(policy == nullptr || param == nullptr) { - printf("[ERROR] Invalid input: policy or param is null pointer!\n"); + if (policy == nullptr || param == nullptr) { + printf("[ERROR]: [ATVC][Broadcast] Invalid input: policy or param is null pointer!\n"); return false; } - struct BroadcastTilingHyperParam { - int32_t basicBlock = 16 * 1024; // 最大为UB的总大小的1/3 - int nBufferNum = 2; - }; + using inputDTypeList = typename OpTraits::In::types; using DataType = typename ATVC::TypeListGet::Type; auto inputDtype = GetOriInputType(); BroadcastTilingInputParam opInput = {shapeIn, shapeOut, inputDtype}; - OpTiling::BroadcastOpTiling tiling(opInput, policy, param); + OpTiling::BroadcastOpTiling tiling(opInput, policy, param); if (!tiling.Run()) { - printf("[ERROR] Tiling Error\n"); + printf("[ERROR]: [ATVC][Broadcast] Run tiling failed!\n"); return false; } - PrintParam(policy, param); return true; }; } // Host diff --git a/atvc/include/broadcast/broadcast_op_template.h b/atvc/include/broadcast/broadcast_op_template.h index 99cdc7d9834829374883145bc66b71be552f6e2b..8aa2755f5e13c14939e7966fdf4fe4ea66736fd6 100644 --- a/atvc/include/broadcast/broadcast_op_template.h +++ b/atvc/include/broadcast/broadcast_op_template.h @@ -12,9 +12,15 @@ #ifndef ATVC_BROADCAST_OP_TEMPLATE_H #define ATVC_BROADCAST_OP_TEMPLATE_H +#include +#include +#include "kernel_operator.h" +#include "common/atvc_opdef.h" #include "common/const_def.h" +#include "common/kernel_check_debug.h" #include "common/ops_utils_device.h" -#include "broadcast/broadcast_utils/broadcast_buf_pool.h" +#include "broadcast/utils/broadcast_buf_pool.h" +#include "broadcast/utils/broadcast_util.h" #include "broadcast/broadcast_compute.h" namespace ATVC { struct BroadcastDataView { @@ -32,56 +38,208 @@ struct BroadcastDataView { }; namespace Kernel { -template +template class BroadcastOpTemplate { public: + // for v-v fusion + static constexpr bool HAS_PRE_COMPUTE = !AscendC::Std::is_same_v; + static constexpr bool HAS_POST_COMPUTE = !AscendC::Std::is_same_v; + using PreComputeTraits = AscendC::Std::conditional_t::ComputeTraits, VoidComputeTraits>; + using PostComputeTraits = AscendC::Std::conditional_t::ComputeTraits, VoidComputeTraits>; + using PreInputs = typename PreComputeTraits::In::types; + using PreOutputs = typename PreComputeTraits::Out::types; + using PreTemp = typename PreComputeTraits::Temp::types; + using PostInputs = typename PostComputeTraits::In::types; + using PostOutputs = typename PostComputeTraits::Out::types; + using PostTemp = typename PostComputeTraits::Temp::types; using DataType = typename BroadcastCompute::DataType; + + static constexpr size_t PreInputCount = ATVC::TypeListSize::VALUE; + static constexpr size_t PreOutputCount = ATVC::TypeListSize::VALUE; + static constexpr size_t PreTempCount = ATVC::TypeListSize::VALUE; + static constexpr size_t PostInputCount = ATVC::TypeListSize::VALUE; + static constexpr size_t PostOutputCount = ATVC::TypeListSize::VALUE; + static constexpr size_t PostTempCount = ATVC::TypeListSize::VALUE; + static constexpr size_t BroadcastInputCount = 1; + static constexpr size_t BroadcastOutputCount = 1; + static constexpr uint32_t DATA_SIZE = sizeof(DataType); + static constexpr uint32_t UB_ALIGN_COUNT = ATVC::UB_ALIGN_32 / DATA_SIZE; + __aicore__ inline BroadcastOpTemplate() {} + /* BroadcastOpTemplate对外运行接口,主要完成资源初始化、数据搬入、计算调度、数据搬出操作 @param src: 输入数据的gm指针 @param dst: 输出数据的gm指针 @broadcastParam: broadcast的动态参数,包含tiling data, workspace等 */ - template - __aicore__ inline void Run(GM_ADDR src, GM_ADDR dst, T1 broadcastParam) + template + __aicore__ inline void Run(Args&&... args) { - this->Init(src, dst, broadcastParam); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Start to run template function.\n"); + constexpr size_t PRE_ARGS_COUNT = HAS_PRE_COMPUTE ? PreInputCount + PreOutputCount - BroadcastInputCount : 0; + constexpr size_t BROADCAST_ARGS_COUNT = BroadcastInputCount + BroadcastOutputCount - HAS_PRE_COMPUTE - HAS_POST_COMPUTE; + constexpr size_t POST_ARGS_COUNT = HAS_POST_COMPUTE ? PostInputCount + PostOutputCount - BroadcastOutputCount : 0; + auto tuple = AscendC::Std::forward_as_tuple(AscendC::Std::forward(args)...); + SplitAndCall(tuple, + AscendC::Std::make_index_sequence{}, + AscendC::Std::make_index_sequence{}, + AscendC::Std::make_index_sequence{}, + AscendC::Std::make_index_sequence{} + ); + // Check param + ATVC::KernelUtils::PrintParam(param_); + if (!ATVC::KernelUtils::CheckParam(param_)) { + return; + } this->Process(); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] End to run template function.\n"); + } + + template + __aicore__ inline void SetParam(BroadcastParam *param, Args... args) + { + param_ = param; + tilingData_ = ¶m_->tilingData; + static constexpr int size = sizeof...(args); + if constexpr (size == 0) { + return; + } + if constexpr (HAS_PRE_COMPUTE) { + preCompute_.SetParam(args...); + } + if constexpr (HAS_POST_COMPUTE) { + postCompute_.SetParam(args...); + } } private: - template - __aicore__ inline void Init(GM_ADDR src, GM_ADDR dst, T1 broadcastParam) - { - param_ = reinterpret_cast(broadcastParam); - uint32_t srcDataSize = this->param_->tilingData.basicBlock; - uint32_t dstDataSize = this->param_->tilingData.basicBlock; - srcGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(src), srcDataSize); - dstGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(dst), dstDataSize); + template + __aicore__ inline void SplitAndCall(Tuple&& t, AscendC::Std::index_sequence, AscendC::Std::index_sequence, + AscendC::Std::index_sequence, AscendC::Std::index_sequence) + { + if constexpr (HAS_PRE_COMPUTE) { + InitPreArgs(AscendC::Std::get(AscendC::Std::forward(t))...); + } + InitBroadcastArgs(AscendC::Std::get(AscendC::Std::forward(t))...); + if constexpr (HAS_POST_COMPUTE) { + InitPostArgs(AscendC::Std::get(AscendC::Std::forward(t))...); + } + InitArgsParams(AscendC::Std::get(AscendC::Std::forward(t))...); + } + + template + __aicore__ inline void InitArgsParams(Args... args) + { + SetParam(args...); + if constexpr (!HAS_PRE_COMPUTE) { + uint32_t srcDataSize = tilingData_->basicBlock; + srcGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(src_), srcDataSize); + } + if constexpr (!HAS_POST_COMPUTE) { + uint32_t dstDataSize = tilingData_->basicBlock; + dstGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(dst_), dstDataSize); + } + inputCount_ = BroadcastInputCount; + outputCount_ = BroadcastOutputCount; + if (HAS_PRE_COMPUTE) { + inputCount_ = PreInputCount + PreTempCount + PreOutputCount; + } + if (HAS_POST_COMPUTE) { + outputCount_ = PostInputCount + PostTempCount + PostOutputCount; + } bufPool_.template Init(GetTPipePtr(), - ATVC::CONST2, // doublebuff需要的输入个数 - ATVC::CONST2, // 计算结果的个数,一般与inputNum保持一致 - this->param_->tilingData.A2 * this->param_->tilingData.A12 * DATA_SIZE, // 输入Tensor大小 - this->param_->tilingData.A2 * this->param_->tilingData.B2 * DATA_SIZE); // 输出Tensor大小 + inputCount_, // doublebuff需要的输入个数 + outputCount_, // 计算结果的个数,一般与inputNum保持一致 + tilingData_->A2 * tilingData_->A12 * DATA_SIZE, // 输入Tensor大小 + tilingData_->A2 * tilingData_->B2 * DATA_SIZE); // 输出Tensor大小 + } + + template + __aicore__ inline void InitArgsOutput(void) {} + + template + __aicore__ inline void InitArgsInput(void) {} + + template + __aicore__ inline void InitArgsOutput(GM_ADDR dst, Args... args) + { + dst_ = dst; + } + + template + __aicore__ inline void InitArgsInput(GM_ADDR src, Args... args) + { + src_ = src; + InitArgsOutput(args...); + } + + template + __aicore__ inline void InitBroadcastArgs(Args... args) + { + InitArgsInput(args...); + } + + template + __aicore__ inline void InitPreArgs(Args... args) + { + preCompute_.SetArgs(args...); + } + + template + __aicore__ inline void InitPostArgs(Args... args) + { + postCompute_.SetArgs(args...); + } + + template + __aicore__ inline void AllocLocalTensors(Args... args) + { + if constexpr (idx < num) { + AscendC::LocalTensor tensor; + bufPool_.template AllocTensor(tensor); + AllocLocalTensors(tensor, args...); + bufPool_.template FreeTensor(tensor); + } else { + if constexpr (isPreCompute) { + preCompute_(args...); + } else { + postCompute_(args...); + } + } + } + + template + __aicore__ inline void ProcessPreCompute(Args... args) + { + // first arg of args is the first input of postCompute, so only need to alloc PostInputCount - 1 tensors + constexpr int32_t localTensorCount = PreInputCount - 1 + PreOutputCount + PreTempCount; + AllocLocalTensors<0, localTensorCount, true>(args...); + } + + template + __aicore__ inline void ProcessPostCompute(Args... args) + { + // first arg of args is the first input of postCompute, so only need to alloc PostInputCount - 1 tensors + constexpr int32_t localTensorCount = PostInputCount - 1 + PostOutputCount + PostTempCount; + AllocLocalTensors<0, localTensorCount, false>(args...); } __aicore__ inline void CopyOutBatch(BroadcastDataView &view, uint32_t dimACount, AscendC::LocalTensor &output) { - uint32_t dimBCount = 0; - SyncDataQueue(); + uint32_t dimBCount = 0; for (int i = 0; i < view.B1; i++) { uint32_t copyOutOffset; if (SelectBroadcastPolicy.patternID == AB_PATTERN::ABA) { - copyOutOffset = dimBCount * view.dimASize + dimACount * this->param_->tilingData.A2; + copyOutOffset = dimBCount * view.dimASize + dimACount * tilingData_->A2; } else { - copyOutOffset = dimACount * this->param_->tilingData.A2 * view.dimBSize + dimBCount; + copyOutOffset = dimACount * tilingData_->A2 * view.dimBSize + dimBCount; } CopyOut(output, copyOutOffset + view.copyOutBaseOffset, view); - dimBCount += this->param_->tilingData.B2; - AscendC::PipeBarrier(); + dimBCount += tilingData_->B2; } } @@ -94,9 +252,9 @@ private: AscendC::LocalTensor input; for (int i = 0; i < view.A11; i++) { inputOffset = 0; - bufPool_.AllocTensor(input); - uint32_t copyInOffset = i * view.A12 * this->param_->tilingData.A2; - if (this->param_->tilingData.A0 != 1) { + bufPool_.template AllocTensor(input); + uint32_t copyInOffset = i * view.A12 * tilingData_->A2; + if (tilingData_->A0 != 1) { copyInOffset += view.dimAOffset; } if (copyInOffset >= view.dimASize) { @@ -105,28 +263,28 @@ private: if (copyInOffset + view.copyInSize > view.dimASize) { // 剩下的数据不够一次完整计算, 根据实际数据重新计算 view.copyInSize = view.dimASize - copyInOffset; - view.A12 = OpsUtils::CeilDiv(view.copyInSize, this->param_->tilingData.A2); + view.A12 = OpsUtils::CeilDiv(view.copyInSize, tilingData_->A2); } + CopyIn(input, copyInOffset, view); - bufPool_.SetVecSync(input); - bufPool_.WaitVecSync(input); + bufPool_.template SetVecSync(input); + bufPool_.template WaitVecSync(input); for (int j = 0; j < view.A12; j ++) { AscendC::LocalTensor output; - bufPool_.AllocTensor(output); - SyncDataQueue(); + bufPool_.template AllocTensor(output); + compute_.template Compute(input, inputOffset, output, - OpsUtils::CeilAlign(this->param_->tilingData.A2, UB_ALIGN_COUNT), - OpsUtils::CeilAlign(this->param_->tilingData.B2, UB_ALIGN_COUNT)); - bufPool_.SetCopyOutSync(output); - bufPool_.WaitCopyOutSync(output); + OpsUtils::CeilAlign(tilingData_->A2, UB_ALIGN_COUNT), + OpsUtils::CeilAlign(tilingData_->B2, UB_ALIGN_COUNT)); CopyOutBatch(view, dimACount, output); - bufPool_.FreeTensor(output); + bufPool_.template FreeTensor(output); dimACount++; - inputOffset += this->param_->tilingData.A2; - SyncDataQueue(); + inputOffset += tilingData_->A2; + bufPool_.template SetCopyOutSync(output); + bufPool_.template WaitCopyOutSync(output); } - SyncDataQueue(); - bufPool_.FreeTensor(input); + + bufPool_.template FreeTensor(input); } bufPool_.ResetEvent(); } @@ -136,17 +294,17 @@ private: uint32_t copyOutBaseOffset = 0; // 计算拷出偏移基址 if (SelectBroadcastPolicy.patternID == AB_PATTERN::ABA) { - if (this->param_->tilingData.A0 != 1) { // 核间A切分, 取部分A + if (tilingData_->A0 != 1) { // 核间A切分, 取部分A copyOutBaseOffset += view.dimAOffset; } - if (this->param_->tilingData.B0 != 1) { // 核间B切分,取部分B + if (tilingData_->B0 != 1) { // 核间B切分,取部分B copyOutBaseOffset += view.dimBOffset * view.dimASize; } } else { - if (this->param_->tilingData.A0 != 1) { // 核间A切分, 取部分A + if (tilingData_->A0 != 1) { // 核间A切分, 取部分A copyOutBaseOffset += view.dimAOffset * view.dimBSize; } - if (this->param_->tilingData.B0 != 1) { // 核间B切分,取部分B + if (tilingData_->B0 != 1) { // 核间B切分,取部分B copyOutBaseOffset += view.dimBOffset; } } @@ -156,44 +314,44 @@ private: __aicore__ inline void CalcView(BroadcastDataView &view) { if (SelectBroadcastPolicy.patternID == AB_PATTERN::ABA) { - view.dimASize = this->param_->tilingData.dstShape[1]; - view.dimBSize = this->param_->tilingData.dstShape[0]; + view.dimASize = tilingData_->dstShape[1]; + view.dimBSize = tilingData_->dstShape[0]; view.inShape[0] = 1; - view.inShape[1] = this->param_->tilingData.A2; - view.outShape[0] = this->param_->tilingData.B2; - view.outShape[1] = this->param_->tilingData.A2; + view.inShape[1] = tilingData_->A2; + view.outShape[0] = tilingData_->B2; + view.outShape[1] = tilingData_->A2; } else { - view.dimASize = this->param_->tilingData.dstShape[0]; - view.dimBSize = this->param_->tilingData.dstShape[1]; - view.inShape[0] = this->param_->tilingData.A2; + view.dimASize = tilingData_->dstShape[0]; + view.dimBSize = tilingData_->dstShape[1]; + view.inShape[0] = tilingData_->A2; view.inShape[1] = 1; - view.outShape[0] = this->param_->tilingData.A2; - view.outShape[1] = this->param_->tilingData.B2; + view.outShape[0] = tilingData_->A2; + view.outShape[1] = tilingData_->B2; } - view.A11 = this->param_->tilingData.A11; - view.A12 = this->param_->tilingData.A12; - view.B1 = this->param_->tilingData.B1; + view.A11 = tilingData_->A11; + view.A12 = tilingData_->A12; + view.B1 = tilingData_->B1; uint32_t blockId = AscendC::GetBlockIdx(); - uint32_t dimAIdx = blockId / this->param_->tilingData.B0; - uint32_t dimBIdx = blockId % this->param_->tilingData.factorBTotalCnt; - view.dimAOffset = dimAIdx * this->param_->tilingData.factorACntPerCore; - view.dimBOffset = dimBIdx * this->param_->tilingData.factorBCntPerCore; + uint32_t dimAIdx = blockId / tilingData_->B0; + uint32_t dimBIdx = blockId % tilingData_->factorBTotalCnt; + view.dimAOffset = dimAIdx * tilingData_->factorACntPerCore; + view.dimBOffset = dimBIdx * tilingData_->factorBCntPerCore; // 计算一次计算的输入数据大小 - view.copyInSize = view.A12 * this->param_->tilingData.A2; // 一次拷贝A12份数据, for循环计算A12次 - if (view.dimAOffset + this->param_->tilingData.factorACntPerCore > view.dimASize) { + view.copyInSize = view.A12 * tilingData_->A2; // 一次拷贝A12份数据, for循环计算A12次 + if (view.dimAOffset + tilingData_->factorACntPerCore > view.dimASize) { // 剩下的A维度的数据不够每个核分到的A数目,重新计算实际的A维度切分 uint32_t realShape = view.dimASize - view.dimAOffset; - uint32_t A1 = OpsUtils::CeilDiv(realShape, this->param_->tilingData.A2); - if (A1 < view.A12) { + uint32_t dimA1 = OpsUtils::CeilDiv(realShape, tilingData_->A2); + if (dimA1 < view.A12) { view.A11 = 1; - view.A12 = A1; + view.A12 = dimA1; } else { - view.A11 = OpsUtils::CeilDiv(A1, view.A12); + view.A11 = OpsUtils::CeilDiv(dimA1, view.A12); } } - if (view.dimBOffset + this->param_->tilingData.factorBCntPerCore > view.dimBSize) { + if (view.dimBOffset + tilingData_->factorBCntPerCore > view.dimBSize) { uint32_t realShape = view.dimBSize - view.dimBOffset; - view.B1 = OpsUtils::CeilDiv(realShape, this->param_->tilingData.B2); + view.B1 = OpsUtils::CeilDiv(realShape, tilingData_->B2); } view.copyOutBaseOffset = CalcCopyOutBaseOffset(view); } @@ -206,7 +364,13 @@ private: copyInParams.blockLen = view.copyInSize * DATA_SIZE; copyInParams.srcStride = 0; copyInParams.dstStride = 0; + if constexpr(HAS_PRE_COMPUTE) { + ProcessPreCompute(input, copyInOffset, copyInParams); + return; + } AscendC::DataCopyPad(input, srcGlobal_[copyInOffset], copyInParams, padParams); + ATVC::Kernel::DebugPrintf("[INFO]: [ATVC][Broadcast][CopyIn] Offset is %u, block len is %u " + "block count is %u.\n", copyInOffset, copyInParams.blockLen, copyInParams.blockCount); } __aicore__ inline void CopyOutNonAligned(AscendC::LocalTensor &output, @@ -214,23 +378,32 @@ private: { uint32_t blockId = AscendC::GetBlockIdx(); uint32_t dstDataSize = view.outShape[0] * view.outShape[1]; - uint64_t dstShape = this->param_->tilingData.dstShape[1]; + uint64_t dstShape = tilingData_->dstShape[1]; AscendC::DataCopyExtParams copyOutParams; copyOutParams.blockLen = view.outShape[1] * DATA_SIZE; copyOutParams.blockCount = dstDataSize * DATA_SIZE / copyOutParams.blockLen; copyOutParams.srcStride = 0; + if (view.outShape[1] + copyOutOffset % dstShape > dstShape) { // 列非对齐, 按实际数据拷贝 copyOutParams.srcStride = OpsUtils::CeilAlign(view.outShape[1], UB_ALIGN_COUNT) * DATA_SIZE; copyOutParams.blockLen = (dstShape - copyOutOffset % dstShape) * DATA_SIZE; copyOutParams.srcStride = (copyOutParams.srcStride - copyOutParams.blockLen) / ATVC::UB_ALIGN_32; } - if (view.outShape[0] + copyOutOffset / dstShape > this->param_->tilingData.dstShape[0]) { + if (view.outShape[0] + copyOutOffset / dstShape > tilingData_->dstShape[0]) { // 行非对齐, 按实际数据拷贝 - copyOutParams.blockCount = (this->param_->tilingData.dstShape[0] - copyOutOffset / dstShape); + copyOutParams.blockCount = (tilingData_->dstShape[0] - copyOutOffset / dstShape); } copyOutParams.dstStride = dstShape * DATA_SIZE - copyOutParams.blockLen; + bufPool_.template SetCopyOutSync(output); + bufPool_.template WaitCopyOutSync(output); + if constexpr(HAS_POST_COMPUTE) { + ProcessPostCompute(output, copyOutOffset, copyOutParams); + return; + } AscendC::DataCopyPad(dstGlobal_[copyOutOffset], output, copyOutParams); + ATVC::Kernel::DebugPrintf("[INFO]: [ATVC][Broadcast][CopyOut] Offset is %u, block len is %u block count is %u.\n", + copyOutOffset, copyOutParams.blockLen, copyOutParams.blockCount); } __aicore__ inline void CopyOut(AscendC::LocalTensor &output, uint32_t copyOutOffset, BroadcastDataView &view) @@ -238,14 +411,19 @@ private: CopyOutNonAligned(output, copyOutOffset, view); } + GM_ADDR src_; + GM_ADDR dst_; AscendC::GlobalTensor srcGlobal_; AscendC::GlobalTensor dstGlobal_; BroadcastCompute compute_; - ParamPtr param_; - KernelUtils::BroadcastBufPool bufPool_; - constexpr static uint32_t DATA_SIZE = sizeof(DataType); - constexpr static uint32_t UB_ALIGN_COUNT = ATVC::UB_ALIGN_32 / DATA_SIZE; + AscendC::Std::conditional_t preCompute_; + AscendC::Std::conditional_t postCompute_; + const BroadcastParam *param_; + const BroadcastOpTilingData *tilingData_; + KernelUtils::BroadcastBufPool bufPool_; + uint32_t inputCount_; + uint32_t outputCount_; }; } // namespace Kernel } // namespace ATVC -#endif // ATVC_BROADCAST_OP_TEMPLATE_H \ No newline at end of file +#endif // ATVC_BROADCAST_OP_TEMPLATE_H diff --git a/atvc/include/broadcast/common/broadcast_common.h b/atvc/include/broadcast/common/broadcast_common.h index fcdd0c0ff194596374ea5f58ba196f364fba1105..0652209625073b24b43154629a31264a5c05b42f 100644 --- a/atvc/include/broadcast/common/broadcast_common.h +++ b/atvc/include/broadcast/common/broadcast_common.h @@ -20,12 +20,12 @@ namespace ATVC { namespace AB_PATTERN { - static constexpr uint32_t A = 100; - static constexpr uint32_t AB = 11; - static constexpr uint32_t ABA = 20; - static constexpr uint32_t ABAB = 31; - static constexpr uint32_t ABABA = 40; -}; +static constexpr uint32_t A = 100; +static constexpr uint32_t AB = 11; +static constexpr uint32_t ABA = 20; +static constexpr uint32_t ABAB = 31; +static constexpr uint32_t ABABA = 40; +}; // namespace AB_PATTERN struct BroadcastPolicy { public: @@ -36,8 +36,8 @@ public: patternID(patternID_), loopABCount(loopABCount_),loopInnerABCount(loopInnerABCount_){} bool operator==(const BroadcastPolicy& rhs) const { - return this->patternID == rhs.patternID && this->loopABCount == rhs.loopABCount &&\ - this->loopInnerABCount == rhs.loopInnerABCount; + return this->patternID == rhs.patternID && this->loopABCount == rhs.loopABCount && + this->loopInnerABCount == rhs.loopInnerABCount; } }; @@ -68,8 +68,8 @@ struct BroadcastParam { int32_t nBufferNum = 2; }; -static constexpr BroadcastPolicy BROADCAST_POLICY0 { ATVC::AB_PATTERN::AB, 10, 1 }; -static constexpr BroadcastPolicy BROADCAST_POLICY1 { ATVC::AB_PATTERN::ABA, 10, 1 }; -}; +static constexpr BroadcastPolicy BROADCAST_POLICY0{ATVC::AB_PATTERN::AB, 10, 1}; +static constexpr BroadcastPolicy BROADCAST_POLICY1{ATVC::AB_PATTERN::ABA, 10, 1}; +}; // namespace ATVC -#endif // ATVC_BROADCAST_COMMON_H \ No newline at end of file +#endif // ATVC_BROADCAST_COMMON_H \ No newline at end of file diff --git a/atvc/include/broadcast/tiling/broadcast_tiling.h b/atvc/include/broadcast/tiling/broadcast_tiling.h index 10dfc8fe7f8b333923e02407904fc145e099bc91..09bf4f55d8433fcbde6f9effc245bf7dbe32adc6 100644 --- a/atvc/include/broadcast/tiling/broadcast_tiling.h +++ b/atvc/include/broadcast/tiling/broadcast_tiling.h @@ -38,31 +38,44 @@ struct BroadcastTilingInputParam { } namespace OpTiling { -constexpr static int32_t BRC_BASIC_NUM = 4; // broadcast输入输出内存基本块分配个数 + +template class BroadcastOpTiling { -public: +public: + using OpInputs = typename OpTraits::In::types; + using OpOutputs = typename OpTraits::Out::types; + using OpTemp = typename OpTraits::Temp::types; + static constexpr size_t OP_INPUT_COUNT = ATVC::TypeListSize::VALUE; + static constexpr size_t OP_OUTPUT_COUNT = ATVC::TypeListSize::VALUE; + static constexpr size_t OP_TEMP_COUNT = ATVC::TypeListSize::VALUE; + static constexpr size_t BROADCAST_UB_NUM = OP_INPUT_COUNT + OP_OUTPUT_COUNT + OP_TEMP_COUNT; + BroadcastOpTiling(ATVC::BroadcastTilingInputParam& inputParam, ATVC::BroadcastPolicy* policy, ATVC::BroadcastParam* param) : opInput_(inputParam), param_(param), policy_(policy) { compileInfo_ = ATVC::GetOpCompileInfo(); + /* Built-in tiling only support allocate unified memory evenly, so all we need to know is the definition of + the complete operator. If you wants to allocate UB memory unevenly, you need to know the definitions of + broadcast, pre-compute, post-compute separately, and extend tiling to support non-uniform distribution. */ + broadcast_basic_num_ = BROADCAST_UB_NUM * param->nBufferNum; } bool Run() { if (!IsAxesValid(opInput_.shapeIn, opInput_.shapeOut)) { - printf("[ERROR]Shape checkout failed!\n"); + printf("[ERROR]: [ATVC][Broadcast] Shape checkout failed!\n"); return false; } std::vector newShapeIn; std::vector newShapeOut; if (!EliminateOne(opInput_.shapeIn, opInput_.shapeOut, newShapeIn, newShapeOut)) { - printf("[ERROR]Failed to eliminate shape!\n"); + printf("[ERROR]: [ATVC][Broadcast] Failed to eliminate shape!\n"); return false; } if (!DoTiling(newShapeIn, newShapeOut)) { - printf("[ERROR]Failed to Calculate Tiling param!\n"); + printf("[ERROR]: [ATVC][Broadcast] Failed to Calculate Tiling param!\n"); return false; } CalcWorkSpace(); @@ -71,7 +84,7 @@ public: private: template - void ComputeStride(std::vector& shapeIn, std::vector& shapeOut) + void ComputeStride(const std::vector& shapeIn, const std::vector& shapeOut) { // shape if (shapeIn[0] == 1 && shapeOut[ATVC::DIM0] == 1) { @@ -95,10 +108,10 @@ private: param_->tilingData.dstStride[ATVC::DIM0] = dimB * dimA; param_->tilingData.dstStride[ATVC::DIM1] = dimB; } else { - param_->tilingData.stride[ATVC::DIM0] = dimA * 1; + param_->tilingData.stride[ATVC::DIM0] = dimA; param_->tilingData.stride[ATVC::DIM1] = dimA; param_->tilingData.stride[ATVC::DIM2] = 1; - param_->tilingData.dstStride[ATVC::DIM0] = dimA * dimB * 1; + param_->tilingData.dstStride[ATVC::DIM0] = dimA * dimB; param_->tilingData.dstStride[ATVC::DIM1] = dimB; param_->tilingData.dstStride[ATVC::DIM2] = 1; } @@ -123,7 +136,7 @@ private: size_t sizeIn = shapeIn.size(); size_t sizeOut = shapeOut.size(); if (sizeOut != sizeIn) { - printf("input dim in is not equel to output dim! \n"); + printf("input dim in is not equal to output dim!\n"); return false; }; @@ -133,16 +146,14 @@ private: return false; } else if (shapeIn[i] <= 0) { printf("Input and output shape should be more than 0\n"); - return false; + return false; } } return true; } - bool EliminateOne(std::vector &oriShapeIn, - std::vector &oriShapeOut, - std::vector &shapeIn, - std::vector &shapeOut) + bool EliminateOne(std::vector& oriShapeIn, std::vector& oriShapeOut, + std::vector& shapeIn, std::vector& shapeOut) { bool isCurB = false; bool haveA = false; @@ -151,7 +162,7 @@ private: for (size_t i = 0; i < oriShapeIn.size(); i++) { if (oriShapeIn[i] == 1 && oriShapeOut[i] != oriShapeIn[i]) { // B轴 if (!isCurB && haveB) { - printf("[ERROR]Only support AB/BA!\n"); + printf("[ERROR]: [ATVC][Broadcast] Only support AB/BA!\n"); return false; } if (!haveB) { @@ -165,7 +176,7 @@ private: haveB = true; } else { // A轴 if (isCurB && haveA) { - printf("[ERROR]Only support AB/BA!\n"); + printf("[ERROR]: [ATVC][Broadcast] Only support AB/BA!\n"); return false; } if (!haveA) { @@ -179,8 +190,8 @@ private: haveA = true; } } - if (shapeIn.size() !=2U && shapeOut.size() != 2U) { - printf("[ERROR] Shape after eliminate is not 2 dim!\n"); + if (shapeIn.size() != 2U && shapeOut.size() != 2U) { + printf("[ERROR]: [ATVC][Broadcast] Shape after eliminate is not 2 dim!\n"); return false; } if (shapeIn[0] != shapeOut[0]) { @@ -193,9 +204,6 @@ private: bool DoTiling(std::vector& shapeIn, std::vector& shapeOut) { int32_t shapeSize = shapeIn.size(); - for (int32_t i = 0; i < shapeSize; i++) { - printf("DoTiling shapeSize[%d]: shape[%d] %lu\n", shapeSize, i, shapeIn[i]); - } switch (shapeSize) { case ATVC::CONST1: return ComputeTiling(shapeIn, shapeOut); @@ -204,7 +212,8 @@ private: case ATVC::CONST3: return ComputeTiling(shapeIn, shapeOut); default: - printf("[ERROR] Compute tiling error because of invalid input shape size[%d]\n", shapeSize); + printf("[ERROR]: [ATVC][Broadcast] Compute tiling error because of invalid input shape size[%d]\n", + shapeSize); return false; } return false; @@ -214,7 +223,7 @@ private: bool ComputeTiling(std::vector& shapeIn, std::vector& shapeOut) { if (!CalcSplitParam(shapeOut)) { - printf("[ERROR] Calculate tiling param failed!\n"); + printf("[ERROR]: [ATVC][Broadcast] Calculate tiling param failed!\n"); return false; } ComputeStride(shapeIn, shapeOut); @@ -224,7 +233,7 @@ private: uint64_t CalcBasicBlock() { - uint64_t basicBlock = OpsUtils::FloorAlign(compileInfo_.ubSize / BRC_BASIC_NUM, ATVC::UB_ALIGN_32); + uint64_t basicBlock = OpsUtils::FloorAlign(compileInfo_.ubSize / broadcast_basic_num_, ATVC::UB_ALIGN_32); if (basicBlock > ATVC::BLOCK_SIZE_64K) { basicBlock = ATVC::BLOCK_SIZE_64K; } else if (basicBlock > ATVC::BLOCK_SIZE_48K) { @@ -251,29 +260,31 @@ private: ATVC::BroadcastOpTilingData& tilingData = param_->tilingData; uint64_t dSize = ge::GetSizeByDataType(opInput_.inputDtype); if (dSize == 0) { - printf("[ERROR] Data size is invalid, please check input data type!\n"); + printf("[ERROR]: [ATVC][Broadcast] Data size is invalid, please check input data type!\n"); return false; } if (tilingData.coreNum > compileInfo_.vectorCoreNum) { - printf("[ERROR] Check tiling failed, coreNum(%u) > vector Real Core count(%lu)\n", - tilingData.coreNum, compileInfo_.vectorCoreNum); + printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, coreNum(%u) " + "must be smaller than vector total core number(%lu)\n", + tilingData.coreNum, compileInfo_.vectorCoreNum); return false; } if (tilingData.A2 * tilingData.A12 * tilingData.A11 * tilingData.A0 < dimA) { - printf("[ERROR] Check tiling failed, A2 * A12 * A11 * A0 < dimA(%u)\n", dimA); + printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, A2 * A12 * A11 * A0 < dimA(%u)\n", dimA); return false; } if (tilingData.B2 * tilingData.B1 * tilingData.B0 < dimB) { - printf("[ERROR] Check tiling failed, B2 * B1 * B0 < dimB(%u)\n", dimB); + printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, B2 * B1 * B0 < dimB(%u)\n", dimB); return false; } if (tilingData.B2 * dSize % ATVC::UB_ALIGN_32 != 0) { - printf("[ERROR] Check tiling failed, B2(%lu) is not aligined with 32B\n", tilingData.B2); + printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, B2(%lu) is not aligned with 32B\n", + tilingData.B2); return false; } if (tilingData.A2 * dSize % ATVC::UB_ALIGN_32 != 0) { - printf("[ERROR] Check tiling failed, A2(%lu) is not aligined with 32B\n", tilingData.A2); + printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, A2(%lu) is not aligned with 32B\n", tilingData.A2); return false; } return true; @@ -293,18 +304,18 @@ private: uint64_t basicBlock = CalcBasicBlock(); uint64_t dSize = ge::GetSizeByDataType(opInput_.inputDtype); if (dSize == 0) { - printf("[ERROR] Data size is invalid, please check input data type!\n"); + printf("[ERROR]: [ATVC][Broadcast] Data size is invalid, please check input data type!\n"); return false; } uint64_t dUint = ATVC::UB_ALIGN_32 / dSize; uint64_t cacheSize = OpsUtils::FloorDiv(basicBlock, dSize); - uint32_t dimA = Pattern::TailA ? Pattern::Dim - 1 : Pattern::Dim - 2; // A - uint32_t dimB = Pattern::TailA ? Pattern::Dim - 2 : Pattern::Dim - 1; // B + uint32_t dimA = Pattern::TailA ? Pattern::Dim - 1 : Pattern::Dim - 2; // A + uint32_t dimB = Pattern::TailA ? Pattern::Dim - 2 : Pattern::Dim - 1; // B uint64_t i = OpsUtils::FloorAlign(shape[dimA], dUint); // 32B对齐 uint64_t j = OpsUtils::FloorAlign(shape[dimB], dUint); // 32B对齐 ATVC::BroadcastOpTilingData& tilingData = param_->tilingData; - if (Pattern::TailA) {// 优先A轴打满 + if constexpr (Pattern::TailA) { // 优先A轴打满 tilingData.B2 = dUint; // B2最小值 tilingData.A2 = i > OpsUtils::FloorDiv(cacheSize, dUint) ? OpsUtils::FloorDiv(cacheSize, dUint) : i; tilingData.B2 = OpsUtils::FloorAlign(OpsUtils::FloorDiv(cacheSize, tilingData.A2), dUint); @@ -321,8 +332,8 @@ private: } // 1.优先多核 A0 B0打满核后再计算核内循环 - tilingData.A0 = OpsUtils::CeilDiv(shape[dimA], tilingData.A2); - tilingData.B0 = OpsUtils::CeilDiv(shape[dimB], tilingData.B2); + tilingData.A0 = OpsUtils::CeilDiv(shape[dimA], tilingData.A2); + tilingData.B0 = OpsUtils::CeilDiv(shape[dimB], tilingData.B2); // A0*B0为实际的block num 必须小于vectorCoreNum while (tilingData.A0 * tilingData.B0 > compileInfo_.vectorCoreNum) { if (tilingData.B0 > 1) { // 优先A0切轴 @@ -335,14 +346,15 @@ private: // 2.核内循环优先A12,因为A12只需要copyIn 1次 tilingData.A12 = OpsUtils::CeilDiv(shape[dimA], tilingData.A2 * tilingData.A0); if (tilingData.A12 * tilingData.A2 > cacheSize) { - tilingData.A12 = OpsUtils::FloorDiv(cacheSize , tilingData.A2); + tilingData.A12 = OpsUtils::FloorDiv(cacheSize, tilingData.A2); } - tilingData.A11 = OpsUtils::CeilDiv(shape[dimA], (tilingData.A0 * tilingData.A2 * tilingData.A12)); // 计算精确A11 - tilingData.B1= OpsUtils::CeilDiv(shape[dimB], (tilingData.B0 * tilingData.B2)); + tilingData.A11 = + OpsUtils::CeilDiv(shape[dimA], (tilingData.A0 * tilingData.A2 * tilingData.A12)); // 计算精确A11 + tilingData.B1 = OpsUtils::CeilDiv(shape[dimB], (tilingData.B0 * tilingData.B2)); // 3.最后重新计算A0 B0避免空核 - tilingData.A0 = OpsUtils::CeilDiv(shape[dimA], tilingData.A2 * tilingData.A11 * tilingData.A12); - tilingData.B0 = OpsUtils::CeilDiv(shape[dimB], tilingData.B2 * tilingData.B1); + tilingData.A0 = OpsUtils::CeilDiv(shape[dimA], tilingData.A2 * tilingData.A11 * tilingData.A12); + tilingData.B0 = OpsUtils::CeilDiv(shape[dimB], tilingData.B2 * tilingData.B1); // 4.写Tiling结果 ExpandTilingParam(basicBlock); @@ -354,6 +366,7 @@ private: ATVC::BroadcastParam* param_ {nullptr}; ATVC::BroadcastPolicy* policy_ {nullptr}; ATVC::OpCompileInfo compileInfo_ = {0, 0, 0, 0}; + uint32_t broadcast_basic_num_; }; } // namespace OpTiling #endif // ATVC_BROADCAST_TILING_H \ No newline at end of file diff --git a/atvc/include/broadcast/broadcast_utils/broadcast_buf_pool.h b/atvc/include/broadcast/utils/broadcast_buf_pool.h similarity index 75% rename from atvc/include/broadcast/broadcast_utils/broadcast_buf_pool.h rename to atvc/include/broadcast/utils/broadcast_buf_pool.h index 0888c084dc0f58f2f43a9152eaec66b212bd1a38..db1a36200260b12daecf260f580e03b67bcae003 100644 --- a/atvc/include/broadcast/broadcast_utils/broadcast_buf_pool.h +++ b/atvc/include/broadcast/utils/broadcast_buf_pool.h @@ -28,25 +28,32 @@ struct BrcPoolManagerUnit { int32_t offset = 0; }; +template class BroadcastBufPool { -constexpr static int32_t MAX_INPUT_SIZE = 10; + constexpr static int32_t MAX_INPUT_SIZE = 10; public: - __aicore__ inline BroadcastBufPool(){}; + __aicore__ inline BroadcastBufPool() {}; - template + template __aicore__ inline void Init(AscendC::TPipe* pipeIn, - int32_t inputNum, // doublebuff需要的输入个数 - int32_t computeNum, // 计算结果的个数,一般与inputNum保持一致 - int32_t inBlockLen, // 一次计算的输入基本块大小 - int32_t outBlockLen) { // 一次计算的输出大小 + int32_t inputNum, // doublebuff需要的输入个数 + int32_t computeNum, // 计算结果的个数,一般与inputNum保持一致 + int32_t inBlockLen, // 一次计算的输入基本块大小 + int32_t outBlockLen) + { + // 一次计算的输出大小 /* _______________________________________________________________________________________________________ | inputTensor 0 | inputTensor 1 | outputTensor 0 | outputTensor 0 | |___________________|___________________|_______________________________|_______________________________| */ + if (EnableDb) { + inputNum *= ATVC::CONST2; + computeNum *= ATVC::CONST2; + } pipe_ = pipeIn; - int32_t eleSize = sizeof(DataType); + constexpr int32_t eleSize = static_cast(sizeof(T)); inputNum_ = inBlockLen / eleSize; outputNum_ = outBlockLen / eleSize; int32_t poolSize = inBlockLen * inputNum + outBlockLen * computeNum; @@ -61,7 +68,8 @@ public: } template - __aicore__ inline const void AllocTensor(AscendC::LocalTensor& tensor) { + __aicore__ inline const void AllocTensor(AscendC::LocalTensor& tensor) + { if constexpr (IsInput) { int32_t idx = GetInputTensorId(); tensor = qQue_.GetWithOffset(inputNum_, inputUnit_.offset + idx * inputNum_ * sizeof(T)); @@ -72,42 +80,46 @@ public: } template - __aicore__ inline const void FreeTensor(AscendC::LocalTensor& tensor) { + __aicore__ inline const void FreeTensor(AscendC::LocalTensor& tensor) + { if constexpr (!IsInput) { - uint32_t idx = GetOutputTensorIdx(tensor); + uint32_t idx = GetOutputTensorIdx(tensor); isBusyOut_[idx] = false; // 恢复isBusy_状态 } } - template - __aicore__ inline const void SetVecSync(AscendC::LocalTensor& tensor) { - uint32_t idx = GetInputTensorIdx(tensor); - event_t eventId = static_cast(pipe_->AllocEventID()); + template + __aicore__ inline const void SetVecSync(AscendC::LocalTensor& tensor) + { + uint32_t idx = GetInputTensorIdx(tensor); + event_t eventId = static_cast(pipe_->AllocEventID()); vecEventId_[idx] = eventId; - AscendC::SetFlag(eventId); + AscendC::SetFlag(eventId); } - template - __aicore__ inline const void WaitVecSync(AscendC::LocalTensor& tensor) { - uint32_t idx = GetInputTensorIdx(tensor); - AscendC::WaitFlag(vecEventId_[idx]); - pipe_->ReleaseEventID(vecEventId_[idx]); + template + __aicore__ inline const void WaitVecSync(AscendC::LocalTensor& tensor) + { + uint32_t idx = GetInputTensorIdx(tensor); + AscendC::WaitFlag(vecEventId_[idx]); + pipe_->ReleaseEventID(vecEventId_[idx]); } - template - __aicore__ inline const void SetCopyOutSync(AscendC::LocalTensor& tensor) { - uint32_t idx = GetOutputTensorIdx(tensor); - event_t eventId = static_cast(pipe_->AllocEventID()); + template + __aicore__ inline const void SetCopyOutSync(AscendC::LocalTensor& tensor) + { + uint32_t idx = GetOutputTensorIdx(tensor); + event_t eventId = static_cast(pipe_->AllocEventID()); outEventId_[idx] = eventId; - AscendC::SetFlag(eventId); + AscendC::SetFlag(eventId); } - template + template __aicore__ inline const void WaitCopyOutSync(AscendC::LocalTensor& tensor) { - uint32_t idx = GetOutputTensorIdx(tensor); - AscendC::WaitFlag(outEventId_[idx]); - pipe_->ReleaseEventID(outEventId_[idx]); + uint32_t idx = GetOutputTensorIdx(tensor); + AscendC::WaitFlag(outEventId_[idx]); + pipe_->ReleaseEventID(outEventId_[idx]); } template diff --git a/atvc/include/broadcast/utils/broadcast_util.h b/atvc/include/broadcast/utils/broadcast_util.h new file mode 100644 index 0000000000000000000000000000000000000000..24cd745a08269380a9d63b8eab4b63267569f7ca --- /dev/null +++ b/atvc/include/broadcast/utils/broadcast_util.h @@ -0,0 +1,70 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ + +/*! + * \file broadcast_util.h + * \brief broadcast util interface + */ + +#ifndef ATVC_BROADCAST_UTIL_H +#define ATVC_BROADCAST_UTIL_H +#include "common/const_def.h" +#include "common/kernel_check_debug.h" + +namespace ATVC { +namespace KernelUtils { +template +__aicore__ inline void PrintParam(const T* param) +{ + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: A0 = %lu\n", param->tilingData.A0); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: A11 = %lu\n", param->tilingData.A11); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: A12 = %lu\n", param->tilingData.A12); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: A2 = %lu\n", param->tilingData.A2); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: B0 = %lu\n", param->tilingData.B0); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: B1 = %lu\n", param->tilingData.B1); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: B2 = %lu\n", param->tilingData.B2); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: coreNum = %d\n", param->tilingData.coreNum); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: basicBlock = %lu\n", param->tilingData.basicBlock); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: factorACntPerCore = %lu\n", + param->tilingData.factorACntPerCore); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: factorATotalCnt = %lu\n", + param->tilingData.factorATotalCnt); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: factorBCntPerCore = %lu\n", + param->tilingData.factorBCntPerCore); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: factorBTotalCnt = %lu\n", + param->tilingData.factorBTotalCnt); + for (int32_t i = 0; i < ATVC::CONST2; i++) { + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: shape[%d] = %lu\n", + i, param->tilingData.shape[i]); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: dstShape[%d] = %lu\n", + i, param->tilingData.dstShape[i]); + } + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: policy.patternID = %d\n", + SelectBroadcastPolicy.patternID); + ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: workspaceSize = %u\n", param->workspaceSize); + return; +} + +template +__aicore__ inline bool CheckParam(const T* param) +{ + auto *tilingData = ¶m->tilingData; + if (tilingData->coreNum < AscendC::GetBlockIdx() + 1 ) { + ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][Broadcast] Tiling data[coreNum = %d] is invalid," + "it must be larger than current block number.\n", tilingData->coreNum); + return false; + } + return true; +} + +} // namespace KernelUtils +} // namespace ATVC +#endif \ No newline at end of file diff --git a/atvc/include/common/atvc_op_check.h b/atvc/include/common/atvc_op_check.h new file mode 100644 index 0000000000000000000000000000000000000000..06ae33e11bb27c4e9566c40ca6f0c07158dc27d4 --- /dev/null +++ b/atvc/include/common/atvc_op_check.h @@ -0,0 +1,75 @@ +/** + * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. + * + * This file is a part of the CANN Open Software. + * Licensed under CANN Open Software License Agreement Version 1.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + */ + +#ifndef ATVC_COMMON_OP_CHECK_H +#define ATVC_COMMON_OP_CHECK_H + +#include +#include "atvc_opdef.h" + +namespace ATVC { +namespace Host { + +template +bool CheckSameDtype_() { + if (GetOriInputType::Type>() != + GetOriInputType::Type>()) { + return false; + } + return true; +} +template +// 校验Traits +bool DebugCheck() { + using Inputs = typename OpTraits::In::types; + using Outputs = typename OpTraits::Out::types; + if constexpr (templateType == ATVC::TemplateType::ELE_WISE) { + if constexpr (TypeListSize::VALUE == 0) { // 0: input is empty + printf("[ERROR]: [ATVC][OpTraits] Input can not be empty in Ele-wise template.\n"); + return false; + } + if constexpr (TypeListSize::VALUE == 0) { // 0: output is empty + printf("[ERROR]: [ATVC][OpTraits] Output can not be empty in Ele-wise template.\n"); + return false; + } + } else if constexpr (templateType == ATVC::TemplateType::REDUCE) { + if constexpr (TypeListSize::VALUE != 1) { // 1: input number must be 1. + printf("[ERROR]: [ATVC][OpTraits] Input numer must be 1 in Reduce template.\n"); + return false; + } + if constexpr (TypeListSize::VALUE != 1) { // 1: output number must be 1. + printf("[ERROR]: [ATVC][OpTraits] Output numer must be 1 in Reduce template.\n"); + return false; + } + } else if constexpr (templateType == ATVC::TemplateType::BROADCAST) { + if constexpr (TypeListSize::VALUE != 1) { // 1: input number must be 1. + printf("[ERROR]: [ATVC][OpTraits] Input numer must be 1 in broadcast template.\n"); + return false; + } + if constexpr (TypeListSize::VALUE != 1) { // 1: input number must be 1. + printf("[ERROR]: [ATVC][OpTraits] Output numer must be 1 in broadcast template.\n"); + return false; + } + } + + if constexpr (templateType == ATVC::TemplateType::REDUCE || + templateType == ATVC::TemplateType::BROADCAST) { + if (!CheckSameDtype_()) { + printf("[ERROR]: Different input/output data types is not surpport in Reduce or Broadcast template.\n"); + return false; + } + } + return true; +} +} +} +#endif + diff --git a/atvc/include/common/atvc_opdef.h b/atvc/include/common/atvc_opdef.h index bac3dfc7ad99145e44e9f72bc70c6ed8e393b838..83ee2e0fbf0a0896dd53493931207b9527f70ddf 100644 --- a/atvc/include/common/atvc_opdef.h +++ b/atvc/include/common/atvc_opdef.h @@ -13,6 +13,7 @@ #define ATVC_COMMON_OPDEF_H #include "type_list.h" +#include "common/dtype_utils.h" namespace ATVC { enum class ParamType { INPUT, @@ -20,37 +21,48 @@ enum class ParamType { TEMP, }; -template +enum class TemplateType { + ELE_WISE, + REDUCE, + BROADCAST, +}; + +template struct ParamTypes { using types = ATVC::TypeList; static constexpr ParamType usage = paramType_; }; -template +template using OpInputs = ParamTypes; -template +template using OpOutputs = ParamTypes; -template +template using OpTemps = ParamTypes; - -template> +template > struct OpTraits { using In = InTypeList; using Out = OutTypeList; using Temp = TempTypeList; }; +struct VoidComputeTraits { + using In = OpInputs<>; + using Out = OpOutputs<>; + using Temp = OpTemps<>; +}; + template -struct GetFunctionTraits {}; +struct GetFunctionTraits { + using ComputeTraits = VoidComputeTraits; +}; -// 专门化处理函数A的模板实例 -template class TileCompute, typename Traits> +template