diff --git a/atvc/README.md b/atvc/README.md
index 854dd6e79268b6a29ee4b18cbb36937bd9dfa8b8..93037269b547bca59c105af9fc743ca5881e8bc3 100644
--- a/atvc/README.md
+++ b/atvc/README.md
@@ -1,35 +1,35 @@
 
 
 # ATVC
-ATVC(AscendC Template for Vector Compute)是一个用AscendC API搭建的C++模板头文件集合，旨在帮助用户快速开发AscendC典型Vector算子。它将AscendC Vector算子开发流程中的计算实现解耦成可自定义的模块, 内部封装实现了kernel数据搬入搬出等底层通用操作及通用tiling计算，实现了高效的算子开发模式。
-相比传统AscendC算子开发方式，利用ATVC搭建的Vector算子可做到开发效率提升3-5倍。用户只需选择匹配的模板并完成核心计算逻辑就完成算子kernel侧开发，atvc还内置了每个模板库对应的通用tiling计算实现，可省去用户手写tiling的开发量就能达到不错的性能表现，极大提升算子开发效率。
+ATVC(Ascend C Template for Vector Compute)是一个用Ascend C API搭建的C++模板头文件集合，旨在帮助用户快速开发Ascend C典型Vector算子。它将Ascend C Vector算子开发流程中的计算实现解耦成可自定义的模块, 内部封装实现了kernel数据搬入搬出等底层通用操作及通用tiling计算，实现了高效的算子开发模式。
+相比传统Ascend C算子开发方式，利用ATVC搭建的Vector算子可做到开发效率提升3-5倍。用户只需选择匹配的模板并完成核心计算逻辑就完成算子kernel侧开发，atvc还内置了每个模板库对应的通用tiling计算实现，可省去用户手写tiling的开发量就能达到不错的性能表现，极大提升算子开发效率。
 
-![atvc_user_case.png](./docs/data/atvc_user_case.png)<br>
+![atvc_user_case.png](./docs/images/atvc_user_case.png)<br>
 
 
-请参阅[快速入门](./docs/1_quick_start.md)以快速了解ATVC的Add算子搭建流程。
-请参阅[开发者文档](./docs/2_developer_guide.md)以获取ATVC框架各模板与API的使用细节，完成自定义EleWise类算子以及Reduce类算子开发。
+请参阅[快速入门](./docs/01_quick_start.md)以快速了解ATVC的Add算子搭建流程。
+请参阅[开发者文档](./docs/02_developer_guide.md)以获取ATVC框架各模板与API的使用细节，完成自定义Elementwise类算子以及Reduce类算子开发。
 
 # 环境要求
 - 硬件型号支持  
-Atlas 800I A2推理服务器/Atlas A2训练服务器
+Atlas 800I A2推理服务器
 - 配套软件  
 CANN开发套件包Ascend-cann-toolkit\_\<cann_version>\_linux\-\<arch>.run，并设置相关环境变量
 cmake >= 3.16.0
 
 # 工程目录
-ATVC工程结构可归纳成以下主要组件，更详细的文件结构介绍请参阅[Code Organization](./docs/code_organization.md)：
+ATVC工程结构可归纳成以下主要组件，更详细的文件结构介绍请参阅[Code Organization](./docs/03_code_organization.md)：
 
 ``` 
 ├── docs        // 文档介绍
-├── examples     // ATVC使用样例
-├── include     // ATVC提供的头文件集合，用户使用前需将其置入其他工程的包含路径下
+├── examples    // ATVC使用样例
+├── include     // ATVC提供的头文件集合,用户使用前需将其置入其他工程的包含路径下
 ├── tests       // 测试模块相关代码
-└── ReadME.md   // overview
+└── README.md   // 综述
 ```
 # 快速上手
 
-快速执行example用例，更详细的流程请参阅[快速入门](./docs/1_quick_start.md)。
+快速执行example用例，更详细的流程请参阅[快速入门](./docs/01_quick_start.md)。
 
 - 下载ATVC代码
 ```bash
@@ -68,21 +68,22 @@ Accuracy verification passed.
 # 已支持的模版
 | Vector模版类型                                           |
 | ------------------------------------------------------------ |
-| Ele-wise模板 |
+| Elementwise模板 |
 | Reduce模板 |
 | Broadcast模板 |
 # 样例介绍
 | 样例名                                                         | 描述                                                         | 类型|
 | ------------------------------------------------------------ | ------------------------------------------------------------ |------------------------------------------------------------ |
-| [add](./examples/add/add.cpp) | 使用ATVC的Ele-wise模板实现Add算子以及调用样例               | 直调算子 |
-| [sinh_custom](./examples/sinh_custom/sinh_custom.cpp) | 临时Tensor参与计算的自定义Ele-wise类算子以及调用样例               | 直调算子 |
-| [add_with_scalar](./examples/add_with_scalar/add_with_scalar.cpp) | 输入带标量的自定义Ele-wise类算子以及调用样例              | 直调算子 |
+| [add](./examples/add/add.cpp) | 使用ATVC的Elementwise模板实现Add算子以及调用样例               | 直调算子 |
+| [sinh_custom](./examples/sinh_custom/sinh_custom.cpp) | 临时Tensor参与计算的自定义Elementwise类算子以及调用样例               | 直调算子 |
+| [add_with_scalar](./examples/add_with_scalar/add_with_scalar.cpp) | 输入带标量的自定义Elementwise类算子以及调用样例              | 直调算子 |
 | [reduce_sum](./examples/reduce_sum/reduce_sum.cpp) | 使用ATVC的Reduce模板实现自定义ReduceSum算子以及调用样例               | 直调算子 |
 | [broadcast_to](./examples/broadcast_to/broadcast_to.cpp) | 使用ATVC的Broadcast模板实现自定义BroadcastTo算子以及调用样例               | 直调算子 |
-| [ops_aclnn](./examples/aclnn) | 使用ATVC基于自定义工程算子的实现以及调用样例               | 自定义工程算子 |
-| [ops_pytorch](./examples/pytorch) | 使用ATVC基于[pytorch](https://gitee.com/ascend/pytorch)算子的实现以及调用样例                 | pytorch算子 |
+| [tanh_grad](./examples/tanh_grad/tanh_grad.cpp) | 使用Tiling超参进行算子性能调优的ElementWise类算子调用样例              | 直调算子 |
+| [ops_aclnn](./examples/ops_aclnn) | 使用ATVC基于自定义工程算子的实现以及调用样例               | 自定义工程算子 |
+| [ops_pytorch](./examples/ops_pytorch) | 使用ATVC基于[pytorch](https://gitee.com/ascend/pytorch)算子的实现以及调用样例                 | pytorch算子 |
 
-更多算子类型介绍和如何选取模板参见参阅[快速入门](./docs/1_quick_start.md)，其中add、sinh_custom、add_with_scalar、reduce_sum、broadcast_to是ATVC的直调样例，ops_aclnn为基于ATVC对接aclnn工程的算子目录，ops_pytorch为基于ATVC对接pytorch工程的算子目录。其中，ops_aclnn和ops_pytorch样例需要进入到example路径下按照README.md描述执行。
+更多算子类型介绍和如何选取模板参见参阅[快速入门](./docs/1_quick_start.md)，其中add、sinh_custom、add_with_scalar、reduce_sum、broadcast_to、tanh_grad是ATVC的直调样例，ops_aclnn为基于ATVC对接aclnn工程的算子目录，ops_pytorch为基于ATVC对接pytorch工程的算子目录。其中，ops_aclnn和ops_pytorch样例需要进入到example路径下按照README.md描述执行。
 
 
 
@@ -90,7 +91,7 @@ Accuracy verification passed.
 
 | 算子模板                                                         | 数据类型                                                         |
 | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| Ele-wise | int32_t、float               |
+| Elementwise | int32_t、float               |
 | Reduce | int32_t、float               |
 | Broadcast | int32_t、float               |
 
diff --git a/atvc/docs/1_quick_start.md b/atvc/docs/01_quick_start.md
similarity index 66%
rename from atvc/docs/1_quick_start.md
rename to atvc/docs/01_quick_start.md
index 3c5c469b49f1b6048ff53f5b6d4f9637a2273b22..51c24a860f7cf3a2edd3113d526045d6843be990 100644
--- a/atvc/docs/1_quick_start.md
+++ b/atvc/docs/01_quick_start.md
@@ -1,9 +1,8 @@
 # 快速入门
 这篇文档帮助你体验ATVC开发Add算子的整个流程。<br>
-快速上手的完整代码请参阅[atvc/examples/add](../examples/add/add.cpp)。
 
 # 环境要求
-ATVC 对软硬件运行环境有如下要求：
+ATVC对软硬件运行环境有如下要求：
 - 硬件型号
     Atlas 800I A2推理服务器
 - 软件要求  
@@ -33,17 +32,16 @@ git clone https://gitee.com/ascend/ascendc-api-adv.git
 
 # 使用ATVC开发Add算子
 
-本示例将展示如何基于ATVC提供的模板以及接口快速搭建Add算子，示例内展示了ATVC框架下区别于传统AscendC Add的实现代码。<br>
-<br>
+本示例将展示如何基于ATVC提供的模板以及接口快速搭建Add算子，示例内展示了ATVC框架下区别于传统Ascend C Add的实现代码。<br>
 ## 定义算子描述
 首先通过ATVC提供的[ATVC::OpTraits](../include/common/atvc_opdef.h)模板结构体来描述Add算子的输入输出信息，定义如下：
 ```cpp
 // Add算子中有两个输入，一个输出。类型均为float
-using ADD_OPTRAITS = ATVC::OpTraits<ATVC::OpInputs<float, float>, ATVC::OpOutputs<float>>；
+using ADD_OPTRAITS = ATVC::OpTraits<ATVC::OpInputs<float, float>, ATVC::OpOutputs<float>>;
 ```
 
 ## 实现算子计算逻辑
-用户需要通过AscendC API来搭建Add算子的核心计算逻辑，在ATVC框架中，这类算子的核心计算逻辑是通过定义一个结构体的仿函数来实现。它需要`ATVC::OpTraits`作为固定模板参数，并重载`operator()`来被提供的Kernel层算子模板类调用。<br> 
+用户需要通过Ascend C API来搭建Add算子的核心计算逻辑，在ATVC框架中，这类算子的核心计算逻辑是通过定义一个结构体的仿函数来实现。它需要`ATVC::OpTraits`作为固定模板参数，并重载`operator()`来被提供的Kernel层算子模板类调用。<br> 
 Add算子的计算仿函数定义如下：
 ```cpp
 #include "atvc.h" // 包含所有atvc模板api的总入口头文件
@@ -61,7 +59,7 @@ struct AddComputeFunc {
     template<typename T> 
     // 重载operator，提供给算子模板类调用
     __aicore__ inline void operator()(AscendC::LocalTensor<T> a, AscendC::LocalTensor<T> b, AscendC::LocalTensor<T> c) {
-        AscendC::Add(c, a, b, c.GetSize()); // 开发调用AscendC Api自行实现计算逻辑, 通过c.GetSize()获取单次计算的元素数量
+        AscendC::Add(c, a, b, c.GetSize()); // 开发调用AscendC API自行实现计算逻辑, 通过c.GetSize()获取单次计算的元素数量
     }
 };
 ```
@@ -80,13 +78,13 @@ template<class Traits>
  * a        Device上的gm地址，指向Add算子第一个输入
  * b        Device上的gm地址，指向Add算子第二个输入
  * c        Device上的gm地址，指向Add算子第一个输出
- * param    Device上的gm地址，指向运行态ATVC::EleWiseParam数据
+ * param    指向运行态ATVC::EleWiseParam数据
 */
-__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR param)
+__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, ATVC::EleWiseParam param)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
     auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<Traits>>();  // 将AddComputeFunc仿函数作为模板参数传入，实例化EleWiseOpTemplate模板类
-    op.Run(a, b, c, param); // 按照输入、输出、param的顺序传入Run函数，实现GM->GM的数据计算
+    op.Run(a, b, c, &param); // 按照输入、输出、param的顺序传入Run函数，实现GM->GM的数据计算
 }
 ```
 
@@ -99,49 +97,48 @@ ATVC::ElewiseParam param;
 // totalCnt描述EleWise单输入的元素个数
 int32_t totalCnt = 1024; 
 // ADD_OPTRAITS为ADD算子描述原型，根据算子输入输出个数和实际元素数量计算出Tiling数据后填入param中
-if (!ATVC::Host::CalcEleWiseTiling<ADD_OPTRAITS>(eleNum, param)) {
+if (!ATVC::Host::CalcEleWiseTiling<ADD_OPTRAITS>(totalCnt, param)) {
     printf("Elewise tiling error.\n");
     return -1;
 };
 
-//
-// 调用aclrtMemcpy接口拷贝Host侧数据到Device侧
-// aclrtMemcpy(paramDevice, sizeof(param), reinterpret_cast<uint8_t*>(&param), sizeof(param), ACL_MEMCPY_HOST_TO_DEVICE)
-//
 uint32_t blockNum = param.tilingData.blockNum;
 // 调用核函数
+// aDevice  Device上的gm地址，指向Add算子第一个输入
+// bDevice  Device上的gm地址，指向Add算子第二个输入
+// cDevice  Device上的gm地址，指向Add算子第一个输出
+// param    Device上的gm地址，指向运行态ATVC::EleWiseParam数据
 AddCustom<ADD_OPTRAITS><<<blockNum, nullptr, stream>>>(aDevice, bDevice, cDevice, paramDevice);
 ```
 
 ## 算子编译&执行
 完成算子代码编写后，调用以下命令编译代码并执行：
 ```bash
-$ cd ./atvc/tests/
-$ bash run_test.sh add
-...
-Generate golden data successfully.
-...
-Accuracy verification passed.
+cd ./atvc/tests/
+bash run_test.sh add
 ```
 
 其他样例执行命令如下：
 ```bash
-$ bash run_test.sh sinh_custom      # 执行sinh_custom样例
-$ bash run_test.sh reduce_sum       # 执行reduce_sum样例
-$ bash run_test.sh add_with_scalar  # 执行add_with_scalar样例
-$ bash run_test.sh broadcast_to     # 执行broadcast样例
+bash run_test.sh sinh_custom      # 执行sinh_custom样例
+bash run_test.sh reduce_sum       # 执行reduce_sum样例
+bash run_test.sh add_with_scalar  # 执行add_with_scalar样例
+bash run_test.sh broadcast_to     # 执行broadcast样例
 ```
 
 ## 完整样例<br>
 完整代码样例请参照[examples/add/add.cpp](../examples/add/add.cpp)
 
 # 模板选择
-用户根据待开发的Vector算子定义特征，选择匹配的模板及其配套的tiling算法，若自定义算子不在当前模板库的范围内，建议使用基本AscnedC API 手写算子。
-## Ele-wise类算子
-Ele-wise类算子通常是指对张量进行元素级别的操作的函数或方法，包括但不限于加、减、乘、除及指数、对数、三角函数等数学函数。这类算子的特点是会逐元素进行计算操作，而不会改变输入数据的形状。常见的ELe-wise算子有Add、Sub、Exp、Log、Sin、Sqrt等。
+用户根据待开发的Vector算子定义特征，选择以下三种匹配的模板及其配套的tiling算法，若自定义算子不在当前模板库的范围内，建议使用基本Ascend C API 手写算子。
+## Elementwise类算子
+Elementwise类算子通常是指对张量进行元素级别的操作的函数或方法，包括但不限于加、减、乘、除及指数、对数、三角函数等数学函数。这类算子的特点是会逐元素进行计算操作，而不会改变输入数据的形状。常见的Elementwise算子有Add、Sub、Exp、Log、Sin、Sqrt等。
 ## Reduce类算子
 Reduce类算子通常是指对张量中的元素进行归约操作的算子，通常用来求和、求平均值等操作，可指定某几个维度进行归约计算，也可以将所有元素归约计算为一个标量。常见的Reduce类算子有ReduceSum(求和)、ReduceMean(求平均值)、ReduceProdcut(累乘)、ReduceMax(求最大值)、ReduceMin(求最小值)、ReduceAny(or操作)、ReduceAll(and操作)。
 ## Broadcast
-Broadcast算子是指完成广播操作，用于处理不同形状的张量间的运算。
-例如tensorA的shape是（1，5），tensorB的shape是（3，5），若要实现tensorC = tensorA + tensorB，实际上需要将TensorA广播为shape（3，5）,再进行tensorA和tensorB的相加操作。广播的过程，实际上就是将原数据在某个维度上进行复制扩展。
+Broadcast 算子用于在张量形状不一致时实现张量间的逐元素运算。
+设张量 A 的 shape 为 (1, 5)，张量 B 的 shape 为 (3, 5)。为完成 C = A + B，首先需依据广播规则将 A 由 (1, 5) 扩展至 (3, 5)。该过程通过在长度为 1 的维度上复制数据，使两个张量的形状对齐，从而支持逐元素相加运算。
+
+# Copyright
 
+Copyright (c) 2025 Huawei Technologies Co., Ltd.
\ No newline at end of file
diff --git a/atvc/docs/2_developer_guide.md b/atvc/docs/02_developer_guide.md
similarity index 72%
rename from atvc/docs/2_developer_guide.md
rename to atvc/docs/02_developer_guide.md
index c236bd991ca57a6d0eeb9c3dbf3ac16bb9aa81c3..e494ebad5e79cb59f229104ded1df5ace0d121da 100644
--- a/atvc/docs/2_developer_guide.md
+++ b/atvc/docs/02_developer_guide.md
@@ -1,20 +1,20 @@
 # 1 架构设计
 ATVC将Vector算子开发流程中的可定制化模块抽象出了Host层和Kernel层，它们的基本概念如下：<br>
-- Host层：在CPU Host侧执行，提供Tiling计算&策略分派的API，它根据实际数据场景帮助用户计算出较优的数据搬运等运行态参数
-- Kernel层：它是利用AscendC API搭建出的一系列Vector算子模板类，屏蔽了算子开发中用户无需感知的数据搬入搬出以及资源申请等固定模块，并将核心计算模块开放给用户定义。
+- Host层：在CPU Host侧执行，提供Tiling计算&策略分派的API，它根据实际数据场景帮助用户计算出较优的数据搬运等运行态参数。
+- Kernel层：它是利用Ascend C API搭建出的一系列Vector算子模板类，屏蔽了算子开发中用户无需感知的数据搬入搬出以及资源申请等固定模块，并将核心计算模块开放给用户定义。
 
-![architecture.png](./data/architecture.png)<br>
+![architecture.png](./images/architecture.png)<br>
 
 基于如上分层结构，ATVC中一个核函数的实现与调用的关系如下图所示（ATVC框架提供的模板及接口用黄色表示；支持开发自定义的模块用蓝色表示）：
-<img src="data/framework.png">
+<img src="images/framework.png">
 
 <br> 
 
 ## 1.1 公共数据结构
-我们将对ATVC核函数定义以及调用涉及的三个公共数据概念：算子原型的编译态参数`OpTraits`, Tiling计算的运行态参数`Param`, 模板策略的编译态参数`Policy` 分别进行介绍：
+我们将对ATVC核函数定义以及调用涉及的三个公共数据概念：算子原型的编译态参数`OpTraits`, Tiling计算的运行态参数`Param`, 模板策略的编译态参数`Policy` 分别进行介绍。
 
 ### 1.1.1 OpTraits
-ATVC框架参考C++模板元编程的`stl::type_list`实现，推出了`ATVC::OpInputs`、`ATVC::OpOutputs`、`ATVC::OpTemps`的模板结构体分别用于描述算子的计算输入、计算输出、计算过程的临时资源，支持C++基础类型作为不定长模板参数传入。它们三者组成了覆盖了整个ATVC框架编译态参数`OpTraits`。`ATVC::OpTraits`的完整数据定义如下<br>
+ATVC框架参考C++模板元编程的`type_list`实现，推出了`ATVC::OpInputs`、`ATVC::OpOutputs`、`ATVC::OpTemps`的模板结构体分别用于描述算子的计算输入、计算输出、计算过程的临时资源，支持C++基础类型作为不定长模板参数传入。它们三者组成了整个ATVC框架编译态参数`OpTraits`。`ATVC::OpTraits`的完整数据定义如下<br>
 ```cpp
 // atvc_opdef.h
 namespace ATVC {
@@ -25,23 +25,23 @@ enum class ParamType {
     TEMP, // 临时计算资源
 };
 
-template<ParamType paramType_, typename ... Ts>
+template <ParamType paramType_, typename ... Ts>
 struct ParamTypes{
     using types = ATVC::TypeList<Ts...>;
     static constexpr ParamType usage = paramType_;
 };
 
-template<typename ... Ts>
+template <typename ... Ts>
 using OpInputs = ParamTypes<ParamType::INPUT, Ts...>;
 
-template<typename ... Ts>
+template <typename ... Ts>
 using OpOutputs = ParamTypes<ParamType::OUTPUT, Ts...>;
 
-template<typename ... Ts>
+template <typename ... Ts>
 using OpTemps = ParamTypes<ParamType::TEMP, Ts...>;
 
 // OpTraits的结构体定义，TempTypeList默认为空
-template<typename InTypeList, typename OutTypeList, typename TempTypeList=ATVC::OpTemps<>>
+template <typename InTypeList, typename OutTypeList, typename TempTypeList=ATVC::OpTemps<>>
 struct OpTraits {
     using In = InTypeList;
     using Out = OutTypeList;
@@ -64,7 +64,7 @@ using AddOpTraits = ATVC::OpTraits<AddInputs, AddOutputs, AddTemps>; // Add算
 ### 1.1.2 Param
 ATVC框架提供了`ATVC::EleWiseParam`、`ATVC::ReduceParam`、`ATVC::BroadcastParam` 三个结构体来描述算子内部调度的Tiling数据和其他资源变量。Param 作为Host侧Tiling API的输出，它将传入ATVC框架的Kernel层算子模板，并在运行时指导算子内部模块完成数据的循环搬运与调度计算。<br>
 
-以下为ElementWise类算子的`ATVC::EleWiseParam`参与计算的伪代码，详细使用流程请参考本文档的 <b>2.1.5 Host层API</b>：
+以下为ElementWise类算子的`ATVC::EleWiseParam`参与计算的伪代码，详细使用流程请参考本文档的 [2.1.5 Host层API](#215-host层api)：
 ```cpp
 // 声明运行态参数param
 ATVC::ElewiseParam param;
@@ -93,48 +93,48 @@ ATVC::Host::CalcReduceTiling<OpTraits>(..., &policy, &param);
 
 // ReduceAdapter根据policy具体值将动态参数转为静态模板参数，并传入核函数
 if (policy.patternId == 1 && policy.loopCnt == 2 && policy.loopInnerCnt == 3) {
-    constexpr ATVC::ReducePolicy selectedPolicy = {1, 2, 3}；
+    constexpr ATVC::ReducePolicy selectedPolicy = {1, 2, 3};
     ReduceKernel<OpTraits, selectedPolicy><<<...>>>(...);
 }
 
 // 自定义的ReduceKernel核函数内部调用了ReduceOpTemplate算子模板类, 该模板类内部实现了Policy对应的各种计算场景
-template<class OpTraits, const ATVC::ReducePolicy& Policy>
-__global__ __aicore__ ReduceKernel(GM_ADDR x, GM_ADDR y, GM_ADDR paramDevice) {
+template <class OpTraits, const ATVC::ReducePolicy& Policy>
+__global__ __aicore__ ReduceKernel(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam param) {
     auto op = ATVC::Kernel::ReduceOpTemplate<OpTraits, Policy, ...>(); // 实例化算子Kernel模板, Policy作为模板参数传入
-    op.Run(x, y, paramDevice); // param作为运行态参数传入
+    op.Run(x, y, &param); // param作为运行态参数传入
 }
 ```
 
 
 
 # 2 利用ATVC完成算子开发
-## 2.1 element-wise算子开发
-ATVC框架提供的element-wise算子模块之间的交互如下(ATVC框架提供的模板及接口用黄色表示；开发自定义的模块用蓝色表示)：
-![elewise_dataflow.png](data/elewise_dataflow.png)
+## 2.1 Elementwise算子开发
+ATVC框架提供的Elementwise算子模块之间的交互如下(ATVC框架提供的模板及接口用黄色表示；开发自定义的模块用蓝色表示)：
+![elewise_dataflow.png](images/elewise_dataflow.png)
 
-不同计算原理的element-wise算子在Kernel内部的数据搬运模块并无区别，因此EleWise的数据交互不涉及Policy的不同Kernel模板实现。
+不同计算原理的Elementwise算子在Kernel内部的数据搬运模块并无区别，因此Elementwise的数据交互不涉及Policy的不同Kernel模板实现。
 
 ### 2.1.1 Components
-根据element-wise算子在框架内部的交互场景，ATVC提供如下的接口以及模板类帮助开发搭建自定义AscendC的ElementWise算子：
-![elewise_components.png](data/elewise_components.png)
-自定义element-wise算子需按照以下顺序完成模块之间的组装：
+根据Elementwise算子在框架内部的交互场景，ATVC提供如下的接口以及模板类帮助开发搭建自定义Ascend C的ElementWise算子：
+![elewise_components.png](images/elewise_components.png)
+自定义Elementwise算子需按照以下顺序完成模块之间的组装：
 1. 定义计算模板
-2. 将计算模板类传入`Kernel`层算子模板完成核函数功能实现；
+2. 将计算模板类传入Kernel层算子模板完成核函数功能实现；
 3. 定义Kernel层算子入口API，内部实例化计算模板类；
 
-下面将以Sinh算子 $y = \frac{\exp(x) - \exp(-x)}{2}$ 的实现为样例，按照组成Kernel的顺序介绍element-wise算子开发的流程。
+下面将以Sinh算子 $y = \frac{\exp(x) - \exp(-x)}{2}$ 的实现为样例，按照组成Kernel的顺序介绍Elementwise算子开发的流程。
 
 ### 2.1.2 计算模板
-计算模板是用户必须在element-wise 算子实现过程中完成的一类特殊模板类的定义。模板类无需关注数据如何从GM搬运到UB，只需重载`operator()`的公有接口，并在该仿函数内部实现AscendC::LocalTensor之间的计算逻辑。在Kernel层的组装阶段，计算模板将作为模板参数传入`ATVC::Kernel::EleWiseOpTemplate`，并在数据计算阶段被调用。下方为计算模板实现Sinh计算逻辑的代码样例：
+计算模板是用户必须在Elementwise 算子实现过程中完成的一类特殊模板类的定义。模板类无需关注数据如何从GM搬运到UB，只需重载`operator()`的公有接口，并在该仿函数内部实现`AscendC::LocalTensor`之间的计算逻辑。在Kernel层的组装阶段，计算模板将作为模板参数传入`ATVC::Kernel::EleWiseOpTemplate`，并在数据计算阶段被调用。下方为计算模板实现Sinh计算逻辑的代码样例：
 ```cpp
 #include "atvc.h" // 包含所有模板及API的总入口头文件
 
 // 传入编译态参数ATVC::OpTraits
-template<class Traits>
+template <class Traits>
 // 开发自定义函数名/类名
 struct SinhComputeFunc {
     // DataType模板参数，根据实际数据类型个数填写
-    template<typename T, typename U>
+    template <typename T, typename U>
     // 重载operator公有接口，提供给Kernel::EleWiseOpTemplate调用
     __aicore__ inline void operator()(AscendC::LocalTensor<T> x, AscendC::LocalTensor<T> y, AscendC::LocalTensor<U> tempBuffer1, AscendC::LocalTensor<U> tempBuffer2) {
         // 开发调用AscendC API自行实现计算仿函数
@@ -153,10 +153,10 @@ struct SinhComputeFunc {
 3. 开发定义的`operator()`仿函数的输入参数类型支持`AscendC::LocalTensor<T>`以及C++其他基础数据类型。形式参数需按照`ATVC::OpInputs<>`,`ATVC::OpOutputs<>`, `ATVC::OpTemps<>`声明的顺序填入，其他标量参数放在最后，根据用户计算场景按需传入。
 
 
-### 2.1.3 内置EleWise算子模板
+### 2.1.3 内置Elementwise算子模板
 `ATVC::Kernel::EleWiseOpTemplate`为ATVC框架提供的内置ElementWise基本算子类，它实现了一套算子数据的搬运搬出、资源分配和释放的算子流程。它需要计算模板类作为模板参数传入来完成实例化。核函数通过调用它完成整套计算逻辑：1. 资源初始化; 2.将数据从GM搬运至UB; 3.按`OpTraits`的输入、输出、临时资源描述、其他标量的顺序传入计算模板类的仿函数完成数据的基块计算; 4.将结果从UB搬出至GM。
 
-下方为`ATVC::Kernel::EleWiseOpTemplate`模板类的外部接口介绍，完整模板类定义请参考`atvc/include/elewise/elewise_op_template.h`。
+下方为`ATVC::Kernel::EleWiseOpTemplate`模板类的外部接口介绍，完整模板类定义请参考[`atvc/include/elewise/elewise_op_template.h`](../include/elewise/elewise_op_template.h)。
 ```cpp
 // 开发定义的计算模板类
 template <class EleWiseCompute>
@@ -166,13 +166,13 @@ public:
 
     // 按照输入、输出、EleWiseParam、其他标量的顺序传入
     // 内部根据EleWiseParam进行数据调度并调用EleWiseOpTemplate完成计算后搬出到GM
-    template<typename ...Args>
+    template <typename ...Args>
     __aicore__ inline void Run(Args&&... args) {
         //
         // 完成变长参数的解析和数据调度计算
         //
     }
-}
+};
 ```
 
 
@@ -187,20 +187,20 @@ using SinhOpTraits = ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float
 // 将计算模板类模板定义作为模板参数传入
 ATVC::Kernel::EleWiseOpTemplate<SinhComputeFunc<SinhOpTraits>> elewiseTemplate;
 // 调用EleWiseOpTemplate的Run接口传入输入x， 输出y，Host::CalcEleWiseTiling API的输出param
-elewiseTemplate.Run(x, y, param);
+elewiseTemplate.Run(x, y, &param);
 }
 ```
 
 
 ### 2.1.4 核函数定义
-在element-wise开发流程中，用户需要自行定义核函数接口。核函数内部可依赖`ATVC::Kernel::EleWiseOpTemplate`完成功能实现。
+在Elementwise开发流程中，用户需要自行定义核函数接口。核函数内部可依赖`ATVC::Kernel::EleWiseOpTemplate`完成功能实现。
 
 基于`2.1.2`和`2.1.3`的样例代码，Kernel层的自定义核函数代码样例如下：
 ```cpp
 #include "atvc.h"
 
 // 2.1.2 章节中的SinhComputeFunc 定义
-// template<class Traits>
+// template <class Traits>
 // struct SinhComputeFunc {
 // ...
 // }
@@ -208,21 +208,21 @@ using SinhOpTraits = ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float
 
 /*
  * 该函数为SinhCustom算子核函数入口
- * x        Device上的gm地址，指向Add算子第一个输入
- * y        Device上的gm地址，指向Add算子第一个输出
- * param    Device上的gm地址，指向运行态ATVC::EleWiseParam数据
+ * x        Device上的gm地址，指向SinhCustom算子第一个输入
+ * y        Device上的gm地址，指向SinhCustom算子第一个输出
+ * param    指向运行态ATVC::EleWiseParam数据
 */
-template<class OpTraits>
-__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param)
+template <class OpTraits>
+__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, ATVC::EleWiseParam param)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 控制算子执行时只启动Vector核
     auto op = ATVC::Kernel::EleWiseOpTemplate<SinhComputeFunc<OpTraits>>();
-    op.Run(x, y, param);          // 按照输入、输出、param的顺序传入Run函数中；OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源，开发无需关注
+    op.Run(x, y, &param);          // 按照输入、输出、param的顺序传入Run函数中；OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源，开发无需关注
 }
 ```
 <br>
 
-利用ATVC框架开发element-wise算子的过程中，Kernel层的核函数定义必须遵从以下约束：
+利用ATVC框架开发Elementwise算子的过程中，Kernel层的核函数定义必须遵从以下约束：
 
 1. 核函数必须预留一个GM_ADDR类型的形参用于传入`ATVC::EleWiseParam`运行态参数；
 2. 核函数内部必须加入`KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);`这段代码标注算子执行时只启动Vector核；
@@ -231,24 +231,24 @@ __global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param)
 ```cpp
 using SinhOpTraits = ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float>, ATVC::OpTemps<float, float>>;
 
-extern "C" __global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param)
+extern "C" __global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, ATVC::EleWiseParam param)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);  // 控制算子执行时只启动Vector核
     auto op = ATVC::Kernel::EleWiseOpTemplate<SinhComputeFunc<SinhOpTraits>>(); // 模板参数传入固定的SinhOpTraits
-    op.Run(x, y, param);
+    op.Run(x, y, &param);
 }
 ```
 
 ### 2.1.5 Host层API
-ATVC的Host层提供了element-wise算子的通用Tiling算法API `ATVC::Host::CalcEleWiseTiling`，它根据算子计算原型`ATVC::OpTraits`以及数据大小计算出包含`ATVC::EleWiseTilingData`的运行态参数`ATVC::EleWiseParam`。`ATVC::EleWiseParam`在运行时将参与模板算子数据搬运从而实现较优计算。<br>`ATVC::EleWiseTilingData`和`ATVC::EleWiseParam`的数据结构定义如下：
+ATVC的Host层提供了Elementwise算子的通用Tiling算法API `ATVC::Host::CalcEleWiseTiling`，它根据算子计算原型`ATVC::OpTraits`以及数据大小计算出包含`ATVC::EleWiseTilingData`的运行态参数`ATVC::EleWiseParam`。`ATVC::EleWiseParam`在运行时将参与模板算子数据搬运从而实现较优计算。<br>`ATVC::EleWiseTilingData`和`ATVC::EleWiseParam`的数据结构定义如下：
 ```cpp
 namespace ATVC{
 struct EleWiseTilingData {
     uint32_t tailBlockCnt; // 需要额外执行一次循环的核的数量
-    uint32_t tailElemCnt; // 尾块元素数量
-    uint32_t numPerBlock; // 每个核需计算的总元素数量
-    uint32_t tiledCnt;    // 每次搬入搬出的元素个数
-    uint32_t blockNum;    // 执行核数
+    uint32_t tailElemCnt;  // 尾块元素数量
+    uint32_t numPerBlock;  // 每个核需计算的总元素数量
+    uint32_t tiledCnt;     // 每次搬入搬出的元素个数
+    uint32_t blockNum;     // 执行核数
 };
 
 struct EleWiseParam {
@@ -300,23 +300,15 @@ if (!ATVC::Host::CalcEleWiseTiling<SinhOpTraits>(eleNum, param)) {
 通过ATVC框架实现的完整SinhCustom算子定义&调用[代码](../examples/sinh_custom/sinh_custom.cpp)如下：
 
 ```cpp
-/**
- * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
- *
- * This file is a part of the CANN Open Software.
- * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
- * Please refer to the License for details. You may not use this file except in compliance with the License.
- * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
- * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
- * See LICENSE in the root of the software repository for the full text of the License.
- */
-
 #include <vector>
 #include <cstdint>
 #include <cmath>
+#include <random>
+#include <iostream>
 #include <algorithm>
 #include "acl/acl.h"
-#include "atvc.h"
+#include "elewise/elewise_host.h"
+#include "elewise/elewise_device.h" 
 
 #define CHECK_ACL(x)                                                                        \
     do {                                                                                    \
@@ -331,12 +323,12 @@ static constexpr float REL_TOL = 1e-3f;
 static constexpr float ABS_TOL = 1e-5f;
 
 // 判断两个浮点数是否足够接近
-bool IsClose(float a, float b) {
+bool IsClose(float a, float b)
+{
     const float eps = 1e-40f; // 防止分母为零
     float diff = std::abs(a - b);
     return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps));
 }
-}
 
 // 描述算子的输入输出以及临时计算资源
 using SinhOpTraits = ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float>, ATVC::OpTemps<float, float>>;
@@ -348,30 +340,76 @@ struct SinhComputeFunc {
     // DataType模板参数，根据实际数据类型个数填写
     template<typename T, typename U>
     // 重载operator公有接口，提供给`ATVC::Kernel::EleWiseOpTemplate`调用
-    __aicore__ inline void operator()(AscendC::LocalTensor<T> x, AscendC::LocalTensor<T> y, AscendC::LocalTensor<U> tempBuffer1, AscendC::LocalTensor<U> tempBuffer2) {
+    __aicore__ inline void operator()(AscendC::LocalTensor<T> x,
+                                      AscendC::LocalTensor<T> y,
+                                      AscendC::LocalTensor<U> tempBuffer1,
+                                      AscendC::LocalTensor<U> tempBuffer2)
+    {
         // 开发调用AscendC Api自行实现计算仿函数
         uint32_t tiledCnt = y.GetSize(); // 进行单次基块计算的元素个数
         AscendC::Muls(tempBuffer1, x, static_cast<T>(-1), tiledCnt); // tempBuffer1 = -1 * x
         AscendC::Exp(tempBuffer1, tempBuffer1, tiledCnt); // tempbuffer1 = exp(-x)
         AscendC::Exp(tempBuffer2, x, tiledCnt);  // tempbuffer2 = exp(x)
         AscendC::Sub(y, tempBuffer2, tempBuffer1, tiledCnt); // y = exp(x) - exp(-x)
-        AscendC::Muls(y, y, static_cast<T>(0.5), tiledCnt);  // y = (e^(x) - e^(-x)) / 2 
+        AscendC::Muls(y, y, static_cast<T>(0.5), tiledCnt); // y = (e^(x) - e^(-x)) / 2 
     }
 };
 
+void InitializeData(int32_t eleNum, std::vector<float> &inputX, std::vector<float> &golden)
+{
+    std::random_device rd;
+    std::mt19937 gen(rd());
+    std::uniform_real_distribution<float> dis(1.0f, 10.0f);
+
+    for (int i = 0; i < eleNum; ++i) {
+        inputX[i] = dis(gen);
+        golden[i] = std::sinh(inputX[i]);
+    }
+}
+
+bool VerifyResults(const std::vector<float> &golden, const std::vector<float> &output)
+{
+    for (int32_t i = 0; i < golden.size(); i++) {
+        if (!IsClose(golden[i], output[i])) {
+            printf("Accuracy verification failed! The expected value of element "
+                   "in index [%d] is %f, but actual value is %f.\n",
+                i,
+                golden[i],
+                output[i]);
+            return false;
+        }
+    }
+    return true;
+}
+
+void InitializeACL(aclrtContext &context, aclrtStream &stream, int32_t deviceId)
+{
+    CHECK_ACL(aclInit(nullptr));
+    CHECK_ACL(aclrtSetDevice(deviceId));
+    CHECK_ACL(aclrtCreateContext(&context, deviceId));
+    CHECK_ACL(aclrtCreateStream(&stream));
+}
+
+void CleanACL(aclrtStream &stream, int32_t deviceId)
+{
+    CHECK_ACL(aclrtDestroyStream(stream));
+    CHECK_ACL(aclrtResetDevice(deviceId));
+    CHECK_ACL(aclFinalize());
+}
+}
 
 /*
  * 该函数为SinhCustom算子核函数入口
- * x        Device上的gm地址，指向Add算子第一个输入
- * y        Device上的gm地址，指向Add算子第一个输出
- * param    Device上的gm地址，指向运行态ATVC::EleWiseParam数据
+ * x        Device上的gm地址，指向SinhCustom算子第一个输入
+ * y        Device上的gm地址，指向SinhCustom算子第一个输出
+ * param    指向运行态ATVC::EleWiseParam数据
 */
 template<class OpTraits>
-__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param)
+__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, ATVC::EleWiseParam param)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 控制算子执行时只启动Vector核
     auto op = ATVC::Kernel::EleWiseOpTemplate<SinhComputeFunc<OpTraits>>();
-    op.Run(x, y, param);          // 按照输入、输出、param的顺序传入Run函数中；OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源，开发无需关注
+    op.Run(x, y, &param);          // 按照输入、输出、param的顺序传入Run函数中；OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源，开发无需关注
 }
 
 int main()
@@ -380,18 +418,11 @@ int main()
     int32_t eleNum = 8 * 2048;
     size_t inputByteSize = static_cast<size_t>(eleNum) * sizeof(float);
     size_t outputByteSize = static_cast<size_t>(eleNum) * sizeof(float);
-        
-    std::random_device rd;
-    std::mt19937 gen(rd());
-    std::uniform_real_distribution<float> dis(1.0f, 10.0f);
 
     std::vector<float> inputX(eleNum);
     std::vector<float> golden(eleNum);
+    InitializeData(eleNum, inputX, golden);
 
-    for (int i = 0; i < eleNum; ++i) {
-        inputX[i] = (dis(gen));
-        golden[i] = std::sinh(inputX[i]);
-    }
     ATVC::EleWiseParam param;
 
     // 计算输入为8*2048个float元素的sinh算子的运行态参数param
@@ -400,19 +431,15 @@ int main()
         return -1;
     };
     // 初始化Acl资源与数据
-    CHECK_ACL(aclInit(nullptr));
     aclrtContext context;
-    int32_t deviceId = 0;
-    CHECK_ACL(aclrtSetDevice(deviceId));
-    CHECK_ACL(aclrtCreateContext(&context, deviceId));
     aclrtStream stream = nullptr;
-    CHECK_ACL(aclrtCreateStream(&stream));
+    int32_t deviceId = 0;
+    InitializeACL(context, stream, deviceId);
 
     uint8_t *yHost;
     uint8_t *xDevice;
     uint8_t *yDevice;
     uint8_t *paramDevice;
-
     CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
     CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
 
@@ -422,8 +449,10 @@ int main()
     // 将tiling计算的运行时参数EleWiseParam param传到Device侧
     auto elementParamSize = sizeof(param);
     CHECK_ACL(aclrtMalloc((void**)&paramDevice, elementParamSize, ACL_MEM_MALLOC_HUGE_FIRST));
-    CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize, reinterpret_cast<uint8_t*>(&param), elementParamSize, ACL_MEMCPY_HOST_TO_DEVICE));
-    
+    CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize,
+                          reinterpret_cast<uint8_t *>(&param), elementParamSize,
+                          ACL_MEMCPY_HOST_TO_DEVICE));
+
     // 调用自定义的Kernel API, <<<>>>的BlockNum参数可通过param的TilingData获取
     SinhCustom<SinhOpTraits><<<param.tilingData.blockNum, nullptr, stream>>>(xDevice, yDevice, paramDevice);
 
@@ -437,15 +466,10 @@ int main()
     CHECK_ACL(aclrtFree(paramDevice));
     CHECK_ACL(aclrtFreeHost(yHost));
 
-    CHECK_ACL(aclrtDestroyStream(stream));
-    CHECK_ACL(aclrtResetDevice(deviceId));
-    CHECK_ACL(aclFinalize());
+    CleanACL(stream, deviceId);
 
-    for (int32_t i = 0; i < eleNum; i++) {
-        if (!IsClose(golden[i], outputY[i])) {
-            printf("Accuracy verification failed! The expected value of element in index [%d] is %f, but actual value is %f.\n", i, golden[i], outputY[i]);
-            return -1;
-        }
+    if (!VerifyResults(golden, outputY)) {
+        return -1;
     }
     printf("Accuracy verification passed.\n");
     return 0;
@@ -454,12 +478,12 @@ int main()
 
 ## 2.2 Reduce算子开发
 ATVC框架提供的Reduce算子模板类的模块之间的交互如下(ATVC框架提供的模板及接口用黄色表示；开发自定义的模块用蓝色表示)：
-![reduce_dataflow.png](data/reduce_dataflow.png)
+![reduce_dataflow.png](images/reduce_dataflow.png)
 Reduce模板算子内部根据计算的数据大小、shape、Reduce axis轴完成了不同计算调度的代码实现，ATVC将各种计算调度场景抽象为`ATVC::ReducePolicy`。在算子调用阶段，分派策略API可根据Tiling API计算出的`ATVC::ReducePolicy`转化为编译态参数，结合计算模板来实例化`ATVC::Kernel::ReduceOpTemplate`算子模板类。
 
 ### 2.2.1 Components
 根据Reduce算子在框架内部的交互场景，ATVC提供如下的接口以及模板类帮助开发搭建自定义Reduce算子：
-![reduce_components.png](data/reduce_components.png)
+![reduce_components.png](images/reduce_components.png)
 自定义Reduce算子需按照以下顺序完成模块之间的组装：
 1. 自定义计算模板/使用框架内置计算模板
 2. 将计算模板传入`Kernel`层模板算子完成核函数功能实现；
@@ -478,7 +502,7 @@ Reduce类的计算模板涉及多核之间的数据结果同步以及核内分
 
 #include "common/kernel_utils.h"
 #include "reduce/common/patterns.h"
-#include "reduce/reduce_utils/reduce_block_aux_util.h"
+#include "reduce/utils/reduce_block_aux_util.h"
 
 
 namespace ATVC {
@@ -520,17 +544,17 @@ public:
 ```
 Reduce计算模板类将在数据计算阶段被`ReduceOpTemplate`算子模板调用，因此Reduce计算模板类的实现必须遵从以下约束：
 
-- 该模板类在实例化时固定传入ATVC::OpTraits类型的结构体作为模板参数，如` ATVC::OpTraits<ATVC::OpInputs<float>,ATVC::OpOutputs<float>`。
+- 该模板类在实例化时固定传入ATVC::OpTraits类型的结构体作为模板参数，如`ATVC::OpTraits<ATVC::OpInputs<float>,ATVC::OpOutputs<float>`。
 - 开发必须完成以下公有API的内部实现：
-    1. 计算单数据基块的Reduce结果 \_\_aicore\_\_ inline void Compute(...)
-    2. 计算单UB内不同数据基块的计算结果 \_\_aicore\_\_ inline void UpdateCache(...);
-    3. 计算多核之间&同一核内的多次UB结果 \_\_aicore\_\_ inline void ReduceBetweenUB(...)
-    4. 返回非对齐场景不参与计算的尾部数据的填充值 \_\_aicore\_\_ inline U GetPaddingValue()
+    1. 计算单数据基块的Reduce结果 `\_\_aicore\_\_ inline void Compute(...)`
+    2. 计算单UB内不同数据基块的计算结果 `\_\_aicore\_\_ inline void UpdateCache(...)`
+    3. 计算多核之间&同一核内的多次UB结果 `\_\_aicore\_\_ inline void ReduceBetweenUB(...)`
+    4. 返回非对齐场景不参与计算的尾部数据的填充值 `\_\_aicore\_\_ inline U GetPaddingValue()`
 
 
 ### 2.2.3 内置Reduce算子模板
 `ATVC::Kernel::ReduceOpTemplate`是一套基本的Reduce算子类，它实现了一套算子数据的搬运搬出、资源分配和释放的流程。Kernel层的算子模板需要计算模板类作为模板参数传入来完成实例化。在调用阶段，算子类将按照固定参数顺序调用计算模板类的对应接口，完成数据的计算。
-相比EleWise算子模板不同的是，ReduceOpTemplate内置了不同场景的Reduce实现，并在编译时通过`ATVC::ReducePolicy`类型的结构体来实现实例化。ReduceOpTemplate内部将根据模板参数决定数据将由哪类具体的模板实例计算。`ATVC::ReducePolicy`的数据定义如下：
+相比Elementwise算子模板不同的是，ReduceOpTemplate内置了不同场景的Reduce实现，并在编译时通过`ATVC::ReducePolicy`类型的结构体来实现实例化。ReduceOpTemplate内部将根据模板参数决定数据将由哪类具体的模板实例计算。`ATVC::ReducePolicy`的数据定义如下：
 
 ```cpp
 struct ReducePolicy {
@@ -542,21 +566,20 @@ struct ReducePolicy {
 
 下方为`ATVC::Kernel::ReduceOpTemplate`模板类的外部接口介绍，完整模板类定义请参考`atvc/include/reduce/reduce_op_template.h`。
 ```cpp
-template <class ReduceCompute, const auto& SelectReducePolicy, 
-            class PreCompute = void, class PostCompute = void>
+template <class ReduceCompute, const auto& SelectReducePolicy, class PreCompute = void, class PostCompute = void>
 class ReduceOpTemplate {
 public:
     __aicore__ inline ReduceOpTemplate(){};
 
     // 按照输入、输出、EleWiseParam、其他标量的顺序传入
     // 内部根据EleWiseParam进行数据调度并调用ReduceOpTemplate完成计算后搬出到GM
-    template<typename ...Args>
-    __aicore__ inline void Run(GM_ADDR x, GM_ADDR y, GM_ADDR param) {
+    template <typename ...Args>
+    __aicore__ inline void Run(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam* param) {
         //
         // Reduce类算子Run接口按输入、输出、运行态参数param顺序传入
         //
     }
-}
+};
 ```
 
 
@@ -566,13 +589,13 @@ public:
 // ReduceSum算子的描述：一个输入，一个输出，类型均为float
 using ReduceOpTraits =  ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float>>;
 
-template<typename Traits, const auto& Policy>
-__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, GM_ADDR reduceParam)
+template <typename Traits, const auto& Policy>
+__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam reduceParam)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0); // 使用了多核控制指令，设置算子执行时只启动Vector核
     // 将计算模板类模板定义作为模板参数传入， Policy由Host层的策略分派API给出
     auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<Traits>, Policy>();
-    op.Run(x, y, reduceParam);
+    op.Run(x, y, &reduceParam);
 }
 ```
 
@@ -587,16 +610,16 @@ using ReduceOpTraits = ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<flo
 
 /*
  * 该函数为ReduceCustom算子核函数入口
- * x                 Device上的gm地址，指向Add算子第一个输入
- * y                 Device上的gm地址，指向Add算子第一个输出
- * reduceParam       Device上的gm地址，指向运行态ATVC::ReduceParam数据
+ * x                 Device上的gm地址，指向ReduceCustom算子第一个输入
+ * y                 Device上的gm地址，指向ReduceCustom算子第一个输出
+ * reduceParam       指向运行态ATVC::ReduceParam数据
 */
-template<class OpTraits, ATVC::ReducePolicy& SelectPolicy>
-__global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param)
+template <class OpTraits, ATVC::ReducePolicy& SelectPolicy>
+__global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam param)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV); // 使用了多核控制指令，设置算子执行时只启动Vector核
     auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<OpTraits>, SelectPolicy>();
-    op.Run(x, y, param);
+    op.Run(x, y, &param);
 }
 ```
 <br>
@@ -604,7 +627,7 @@ __global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param)
 Reduce算子开发场景下，核函数定义必须遵从以下约束：
 1. 核函数须预留一个GM_ADDR类型的形参用于传入`ATVC::ReduceParam`运行态参数；
 2. 核函数须加入`KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0);`这段代码显示标注算子类型；
-3. 核函数须实例化`ATVC::Kernel::ReduceOpTemplate`变量并调用它的`Run(GM_ADDR x, GM_ADDR y, GM_ADDR param)`接口来实现数据的调度运算；
+3. 核函数须实例化`ATVC::Kernel::ReduceOpTemplate`变量并调用它的`Run(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam* param)`接口来实现数据的调度运算；
 
 
 ### 2.2.5 Host层API
@@ -693,33 +716,28 @@ static constexpr ATVC::ReducePolicy REDUCE_POLICY3 { ATVC::AR_PATTERN::AR, ATVC:
 
 // ReduceOpAdapter函数定义
 // 负责Reduce类算子的调度，选择对应的Policy最佳策略并执行Kernel函数
-template<class OpTraits>
+template <class OpTraits>
 void ReduceOpAdapter(uint8_t* x, uint8_t* y, ATVC::ReduceParam &param, ATVC::ReducePolicy &policy, aclrtStream& stream)
 {
     // 申请临时空间workspace，并将其与ReduceTilingData一同传到Device侧
-    uint8_t *paramDevice;
     uint8_t *workspaceDevice;
     CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST));
     param.workspaceAddr = reinterpret_cast<uint64_t>(workspaceDevice);
-    auto reduceParamSize = sizeof(param);
-    CHECK_ACL(aclrtMalloc((void**)&paramDevice, reduceParamSize, ACL_MEM_MALLOC_HUGE_FIRST));
-    CHECK_ACL(aclrtMemcpy(paramDevice, reduceParamSize, reinterpret_cast<uint8_t*>(&param), reduceParamSize, ACL_MEMCPY_HOST_TO_DEVICE));
     // 将tiling api计算出的ReducePolicy转化为编译态参数并实例化相应的核函数
     if (policy == ATVC::REDUCE_POLICY0) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
      }else if (policy == ATVC::REDUCE_POLICY1) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY2) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY2><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY2><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY3) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY3><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY3><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else {
         printf("[ERROR] Cannot find any matched policy.\n");
     }
     // 流同步后释放申请的param内存
     CHECK_ACL(aclrtSynchronizeStream(stream));
     CHECK_ACL(aclrtFree(workspaceDevice));
-    CHECK_ACL(aclrtFree(paramDevice));
 }
 
 using ReduceOpTraits = ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float>>;
@@ -747,25 +765,17 @@ using ReduceOpTraits = ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<flo
 
 
 ### 2.2.6 完整样例
-通过ATVC框架实现的完整ReduceSumCustom算子[样例代码](../examples/reduce_sum/demo_reduce.cpp)如下：
+通过ATVC框架实现的完整ReduceSumCustom算子[样例代码](../examples/reduce_sum/reduce_sum.cpp)如下：
 ```cpp
-/**
- * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
- *
- * This file is a part of the CANN Open Software.
- * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
- * Please refer to the License for details. You may not use this file except in compliance with the License.
- * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
- * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
- * See LICENSE in the root of the software repository for the full text of the License.
- */
-
 #include <vector>
 #include <cstdint>
 #include <cmath>
+#include <random>
+#include <iostream>
 #include <algorithm>
 #include "acl/acl.h"
-#include "atvc.h"
+#include "reduce/reduce_host.h"
+#include "reduce/reduce_device.h"
 
 #define CHECK_ACL(x)                                                                        \
     do {                                                                                    \
@@ -780,29 +790,45 @@ static constexpr float REL_TOL = 1e-3f;
 static constexpr float ABS_TOL = 1e-5f;
 
 // 判断两个浮点数是否足够接近
-bool IsClose(float a, float b) {
+bool IsClose(float a, float b)
+{
     const float eps = 1e-40f; // 防止分母为零
     float diff = std::abs(a - b);
     return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps));
 }
-}
 
 // ReduceSum算子的描述：一个输入，一个输出，类型均为float
 using ReduceOpTraits =  ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float>>;
 
+bool VerifyResults(const std::vector<float> &golden, const std::vector<float> &output)
+{
+    for (int32_t i = 0; i < golden.size(); i++) {
+            if (!IsClose(golden[i], output[i])) {
+                printf("Accuracy verification failed! The expected value of element "
+                       "in index [%d] is %f, but actual value is %f.\n",
+                    i,
+                    golden[i],
+                    output[i]);
+                return false;
+            }
+    }
+    return true;
+}
+}
+
 /*
  * 该函数为ReduceCustom算子核函数入口
- * x                 Device上的gm地址，指向Add算子第一个输入
- * y                 Device上的gm地址，指向Add算子第一个输出
- * reduceParam       Device上的gm地址，指向运行态ATVC::ReduceParam数据
+ * x                 Device上的gm地址，指向ReduceCustom算子第一个输入
+ * y                 Device上的gm地址，指向ReduceCustom算子第一个输出
+ * reduceParam       指向运行态ATVC::ReduceParam数据
 */
 template<typename Traits, const auto& Policy>
-__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, GM_ADDR reduceParam)
+__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam reduceParam)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0); // 使用了多核控制指令，设置算子执行时只启动Vector核
     // 将计算模板类模板定义作为模板参数传入，Policy由Host层的策略分派API给出
     auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<Traits>, Policy>();
-    op.Run(x, y, reduceParam);
+    op.Run(x, y, &reduceParam);
 }
 
 // 负责Reduce类算子的调度，选择对应的Policy最佳策略并执行Kernel函数
@@ -816,11 +842,13 @@ void ReduceOpAdapter(uint8_t* x, uint8_t* y, ATVC::ReduceParam &param, ATVC::Red
     param.workspaceAddr = reinterpret_cast<uint64_t>(workspaceDevice);
     auto reduceParamSize = sizeof(param);
     CHECK_ACL(aclrtMalloc((void**)&paramDevice, reduceParamSize, ACL_MEM_MALLOC_HUGE_FIRST));
-    CHECK_ACL(aclrtMemcpy(paramDevice, reduceParamSize, reinterpret_cast<uint8_t*>(&param), reduceParamSize, ACL_MEMCPY_HOST_TO_DEVICE));
+    CHECK_ACL(aclrtMemcpy(paramDevice, reduceParamSize,
+                          reinterpret_cast<uint8_t *>(&param), reduceParamSize,
+                          ACL_MEMCPY_HOST_TO_DEVICE));
     // 将tiling api计算出的ReducePolicy转化为编译态参数并实例化相应的核函数
     if (policy == ATVC::REDUCE_POLICY0) {
         ReduceCustom<OpTraits, ATVC::REDUCE_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
-     }else if (policy == ATVC::REDUCE_POLICY1) {
+    } else if (policy == ATVC::REDUCE_POLICY1) {
         ReduceCustom<OpTraits, ATVC::REDUCE_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
     } else if (policy == ATVC::REDUCE_POLICY2) {
         ReduceCustom<OpTraits, ATVC::REDUCE_POLICY2><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
@@ -862,6 +890,8 @@ void ReduceOpAdapter(uint8_t* x, uint8_t* y, ATVC::ReduceParam &param, ATVC::Red
         ReduceCustom<OpTraits, ATVC::REDUCE_POLICY20><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
     } else if (policy == ATVC::REDUCE_POLICY21) {
         ReduceCustom<OpTraits, ATVC::REDUCE_POLICY21><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+    } else if (policy == ATVC::REDUCE_POLICY22) {
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY22><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
     } else {
         printf("[ERROR] Cannot find any matched policy.\n");
     }
@@ -881,7 +911,7 @@ int32_t main(int32_t argc, char* argv[])
     std::vector<int64_t> shape{8, 1024};    // 测试输入shape
     std::vector<float> inputX(eleNum, 1.0f);
     std::vector<float> golden(outEleNum, 8.0f);
-
+    printf("Generate golden data successfully.\n");
     // 初始化Acl资源
     CHECK_ACL(aclInit(nullptr));
     aclrtContext context;
@@ -904,7 +934,7 @@ int32_t main(int32_t argc, char* argv[])
     ATVC::ReducePolicy policy = {-1, -1, -1};  // Reduce运行态参数，负责映射最适合的Reduce模板实现
     // Host侧调用Tiling API完成相关运行态参数的运算
     if (!ATVC::Host::CalcReduceTiling<ReduceOpTraits>(shape, dim, &policy, &param)) {
-        printf("Reduce tiling error.");
+        printf("Reduce tiling error.\n");
         return -1;
     };
 
@@ -924,12 +954,10 @@ int32_t main(int32_t argc, char* argv[])
     CHECK_ACL(aclrtResetDevice(deviceId));
     CHECK_ACL(aclFinalize());
 
-    for (int32_t i = 0; i < outEleNum; i++) {
-        if (!IsClose(golden[i], outputY[i])) {
-            printf("Accuracy verification failed! The expected value of element in index [%d] is %f, but actual value is %f.\n", i, golden[i], outputY[i]);
-            return -1;
-        }
+    if (!VerifyResults(golden, outputY)) {
+        return -1;
     }
+
     printf("Accuracy verification passed.\n");
     return 0;
 }
@@ -937,11 +965,11 @@ int32_t main(int32_t argc, char* argv[])
 
 ## 2.3 Broadcast算子开发
 ATVC框架提供的Broadcast算子模板类的模块之间的交互如下(ATVC框架提供的模板及接口用黄色表示；开发自定义的模块用蓝色表示)：
-![broadcast_dataflow.png](data/broadcast_dataflow.png)
+![broadcast_dataflow.png](images/broadcast_dataflow.png)
 Broadcast模板算子内部根据数据类型、输入/输出shape完成某个轴上数据扩充的功能，ATVC将各种计算调度场景抽象为`ATVC::BroadcastPolicy`。在算子调用阶段，分派策略API可根据Tiling API计算出的`ATVC::BroadcastPolicy`转化为编译态参数，结合计算模板来实例化`ATVC::Kernel::BroadcastOpTemplate`算子模板类。
 ### 2.3.1 Components
 根据Broadcast算子在框架内部的交互场景，ATVC提供如下的接口以及模板类帮助开发搭建自定义Broadcast算子：
-![broadcast_components.png](data/broadcast_components.png)
+![broadcast_components.png](images/broadcast_components.png)
 自定义Broadcast算子需按照以下顺序完成模块之间的组装：
 1. 自定义计算模板/使用框架内置计算模板
 2. 将计算模板传入`Kernel`层模板算子完成核函数功能实现；
@@ -959,7 +987,7 @@ Broadcast计算模板是指Broadcast类算子在UB上实现将A轴的数据复
 #include "broadcast/common/broadcast_common.h"
 
 namespace ATVC {
-template<class OpTraits>
+template <class OpTraits>
 class BroadcastCompute {
 public:
     using inputDTypeList = typename OpTraits::In::types;
@@ -988,7 +1016,7 @@ Broadcast计算模板类将在数据计算阶段被`BroadcastOpTemplate`算子
     * AB场景的计算：输入`src`是一个shape为(dimA, 1)的Tensor,需要将数据扩充到`dst`上，dst的shape是(dimA, dimB);
     * BA场景的计算：输入`src`是一个shape为(1, dimA)的Tensor,需要将src数据扩充到`dst`上，dst的shape是(dimB, dimA);
 
-- 该模板类在实例化时固定传入ATVC::OpTraits类型的结构体作为模板参数，如` ATVC::OpTraits<ATVC::OpInputs<float>,ATVC::OpOutputs<float>`。
+- 该模板类在实例化时固定传入`ATVC::OpTraits`类型的结构体作为模板参数，如` ATVC::OpTraits<ATVC::OpInputs<float>,ATVC::OpOutputs<float>`。
 
 ### 2.3.3 内置Broadcast算子模板
 `ATVC::Kernel::BroadcastOpTemplate`是一套基本的Broadcast算子类，它实现了一套算子数据的搬运搬出、资源分配和释放的流程。Kernel层的算子模板需要计算模板类作为模板参数传入来完成实例化。在调用阶段，Broadcast算子模板将按照固定参数顺序调用计算模板类的`Compute`接口，完成数据的计算。
@@ -1003,12 +1031,12 @@ struct BroadcastPolicy {
 };
 ```
 
-下方为`ATVC::Kernel::BroadcastOpTemplate`模板类的外部接口介绍，完整模板类定义请参考`atvc/include/broadcast/broadcast_op_template.h`。
+下方为`ATVC::Kernel::BroadcastOpTemplate`模板类的外部接口介绍，完整模板类定义请参考[`atvc/include/broadcast/broadcast_op_template.h`](../include/broadcast/broadcast_op_template.h)。
 ```cpp
 #ifndef ATVC_BROADCAST_OP_TEMPLATE_H
 #define ATVC_BROADCAST_OP_TEMPLATE_H
 #include "common/const_def.h"
-#include "broadcast/broadcast_utils/broadcast_buf_pool.h"
+#include "broadcast/utils/broadcast_buf_pool.h"
 namespace ATVC {
 namespace Kernel {
 template <class BroadcastCompute, const auto& SelectBroadcastPolicy>
@@ -1023,14 +1051,14 @@ public:
     @param dst: 输出数据的gm指针
     @broadcastParam: broadcast的动态参数，包含tiling data, workspace等
     */
-    __aicore__ inline void Run(GM_ADDR src, GM_ADDR dst, GM_ADDR broadcastParam)
+    __aicore__ inline void Run(GM_ADDR src, GM_ADDR dst, ATVC::BroadcastParam* broadcastParam)
     {
         this->Init(src, dst, broadcastParam);
         this->Process();
     }
 
-    AscendC::GlobalTensor<DataType> srcGlobal;
-    AscendC::GlobalTensor<DataType> dstGlobal;
+    AscendC::GlobalTensor<DataType> srcGlobal_;
+    AscendC::GlobalTensor<DataType> dstGlobal_;
     BroadcastCompute compute_;
     __gm__ BroadcastParam *param_;
 };
@@ -1046,14 +1074,13 @@ public:
 // BroadcastTo算子的描述：一个输入，一个输出，类型均为float
 using BroadcastOpTraits =  ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float>>;
 
-template<typename Traits, const auto& Policy>
-__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam)
+template <typename Traits, const auto& Policy>
+__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam broadcastParam)
 {
-    AscendC::printf("BroadcastCustom\n");
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 设置算子执行时只启动Vector核
     // 将计算模板类模板定义作为模板参数传入，Policy由Host层的策略分派API给出
     auto op = ATVC::Kernel::BroadcastOpTemplate<ATVC::BroadcastCompute<Traits>, Policy>();
-    op.Run(x, y, broadcastParam);
+    op.Run(x, y, &broadcastParam);
 }
 ```
 
@@ -1069,18 +1096,17 @@ using BroadcastOpTraits =  ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs
 
 /*
  * 该函数为BroadcastCustom算子核函数入口
- * x                 Device上的gm地址，指向Add算子第一个输入
- * y                 Device上的gm地址，指向Add算子第一个输出
- * broadcastParam       Device上的gm地址，指向运行态ATVC::BroadcastParam数据
+ * x                 Device上的gm地址，指向BroadcastTo算子第一个输入
+ * y                 Device上的gm地址，指向BroadcastTo算子第一个输出
+ * broadcastParam    指向运行态ATVC::BroadcastParam数据
 */
-template<typename Traits, const auto& Policy>
-__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam)
+template <typename Traits, const auto& Policy>
+__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam broadcastParam)
 {
-    AscendC::printf("BroadcastCustom\n");
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 设置算子执行时只启动Vector核
     // 将计算模板类模板定义作为模板参数传入，Policy由Host层的策略分派API给出
     auto op = ATVC::Kernel::BroadcastOpTemplate<ATVC::BroadcastCompute<Traits>, Policy>();
-    op.Run(x, y, broadcastParam);
+    op.Run(x, y, &broadcastParam);
 }
 ```
 <br>
@@ -1088,7 +1114,7 @@ __global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadca
 Broadcast算子开发场景下，核函数定义必须遵从以下约束：
 1. 核函数须预留一个GM_ADDR类型的形参用于传入`ATVC::BroadcastParam`运行态参数；
 2. 核函数须加入`KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);`这段代码显示标注算子类型；
-3. 核函数须实例化`ATVC::Kernel::BroadcastOpTemplate`变量,实例化时需传入对应的计算实现模板类`ATVC::BroadcastCompute`,并调用它的`Run(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam)`接口来实现数据的调度运算；
+3. 核函数须实例化`ATVC::Kernel::BroadcastOpTemplate`变量,实例化时需传入对应的计算实现模板类`ATVC::BroadcastCompute`,并调用它的`Run(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam* broadcastParam)`接口来实现数据的调度运算；
 
 ### 2.3.5 Host层API
 #### 2.3.5.1 CalcBroadcastTiling
@@ -1128,7 +1154,7 @@ struct BroadcastParam {
 ```cpp
 namespace ATVC {
 namespace Host {
-template<class OpTraits>
+template <class OpTraits>
 bool CalcBroadcastTiling(std::vector<int64_t> shapeIn, std::vector<int64_t> shapeOut, BroadcastPolicy* policy, BroadcastParam* param)
 {
     using inputDTypeList = typename OpTraits::In::types;
@@ -1170,7 +1196,7 @@ int32_t main(int32_t argc, char* argv[])
 
 ```cpp
 // 负责Broadcast类算子的调度，选择对应的Policy最佳策略并执行Kernel函数
-template<class OpTraits>
+template <class OpTraits>
 void BroadcastOpAdapter(uint8_t* x, uint8_t* y, ATVC::BroadcastParam &param, ATVC::BroadcastPolicy &policy, aclrtStream& stream)
 {
     // 申请临时空间workspace，并将其与BroadcastTilingData一同传到Device侧
@@ -1216,7 +1242,7 @@ int32_t main(int32_t argc, char* argv[])
 
 
 ### 2.3.6 完整样例
-通过ATVC框架实现的完整BroadcastCustom算子[样例代码](../examples/broadcast_to/broadcast_to.cpp)如下：
+通过ATVC框架实现的完整BroadcastCustom算子[样例代码](../examples/broadcast_to/broadcast_to.cpp)
 ```cpp
 #include <vector>
 #include <cstdint>
@@ -1252,23 +1278,22 @@ using BroadcastOpTraits =  ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs
 
 /*
  * 该函数为BroadcastCustom算子核函数入口
- * x                 Device上的gm地址，指向Add算子第一个输入
- * y                 Device上的gm地址，指向Add算子第一个输出
- * broadcastParam       Device上的gm地址，指向运行态ATVC::BroadcastParam数据
+ * x                 Device上的gm地址，指向BroadcastTo算子第一个输入
+ * y                 Device上的gm地址，指向BroadcastTo算子第一个输出
+ * broadcastParam    指向运行态ATVC::BroadcastParam数据
 */
-template<typename Traits, const auto& Policy>
-__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam)
+template <typename Traits, const auto& Policy>
+__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam broadcastParam)
 {
-    AscendC::printf("BroadcastCustom\n");
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 设置算子执行时只启动Vector核
     // 将计算模板类模板定义作为模板参数传入，Policy由Host层的策略分派API给出
     auto op = ATVC::Kernel::BroadcastOpTemplate<ATVC::BroadcastCompute<Traits>, Policy>();
-    op.Run(x, y, broadcastParam);
+    op.Run(x, y, &broadcastParam);
 }
 
 
 // 负责Broadcast类算子的调度，选择对应的Policy最佳策略并执行Kernel函数
-template<class OpTraits>
+template <class OpTraits>
 void BroadcastOpAdapter(uint8_t* x, uint8_t* y, ATVC::BroadcastParam &param, ATVC::BroadcastPolicy &policy, aclrtStream& stream)
 {
     // 申请临时空间workspace，并将其与BroadcastTilingData一同传到Device侧
@@ -1355,4 +1380,210 @@ int32_t main(int32_t argc, char* argv[])
     printf("Accuracy verification passed.\n");
     return 0;
 }
-```
\ No newline at end of file
+```
+
+# 3 ATVC的调试调优功能
+为了用户在使用ATVC进行算子开发时能快速进行精度调试和性能调优，ATVC支持多种调试调优能力。
+## 3.1 OpTraits校验接口
+用户可通过`DebugCheck()`接口校验不同模板的OpTraits功能, 接口在Host侧调用，无需额外的开关限制,接口定义如下：
+```cpp
+namespace ATVC {
+namespace Host {
+template <typename OpTraits, ATVC::TemplateType templateType>
+bool DebugCheck()
+}
+}
+```
+其中，模板参数`OpTraits`是用户定义的待校验的输入输出描述信息, 模板参数`templateType`是校验规则分类的标识， 定义如下:
+```cpp
+enum class TemplateType {
+    ELE_WISE,   // ElementWise模板的校验类型
+    REDUCE,     // Reduce模板的校验类型
+    BROADCAST,  // Broadcast模板的校验类型
+};
+```
+接口使用示例:
+```cpp
+using AddOpTraits = ATVC::OpTraits<ATVC::OpInputs<float, float>, ATVC::OpOutputs<float>>;
+
+ATVC::Host::DebugCheck<AddOpTraits, ATVC::TemplateType::ELE_WISE>();
+```
+完整的DebugCheck调用接口样例可参考tanh_grad算子[样例代码](../examples/tanh_grad/tanh_grad.cpp)。
+## 3.2 使用调试调优模式运行算子
+样例执行脚本run_test.sh支持可选入参`--run-mode`进行不同调试调优运行模式的选择。
+当前支持`debug_print`和`profiling`两种模式。
+- `--run-mode=debug_print`：DFX信息打印模式，打开kernel侧的模板内置关键节点的信息打印和异常退出时的打印功能。
+- `--run-mode=profiling`：Profiling性能采集模式，运行时打开profiling性能数据采集功能。
+- 未设置`--run-mode`：默认模式，正常上板，无kernel侧的dfx信息打印， 未开启profiling性能采集功能。
+## 3.2.1 DFX信息打印模式
+通过运行run_test.sh脚本时加上可选参数`--run-mode=debug_print`打开本功能。
+DFX信息打印格式按照 [日志级别(`ERROR`/`INFO`)]:[`ATVC`][`Module`](可选：[`CopyIn`/`CopyOut`等])的标准进行打印。
+- 日志级别： ERROR是异常打印信息，INFO是模板内部重要信息打印
+- `ATVC`： 标识是ATVC模板库内置的DFX信息打印
+- `Module`: 标识是哪个模块的信息打印，例如：`EleWise`、 `Reduce`、`Broadcast`、`Common`等模块。
+- 可选子模块： 用于部分`Module`涉及多个子模块，可选择增加子模块信息，细化DFX信息。
+模板内部提供的DFX信息打印接口定义及使用样例如下所示， 对于普通算子开发用户，无需关注该接口，只有需要修改或者扩展开发模板功能的场景，可使用该接口。
+```cpp
+//接口定义
+namespace ATVC {
+namespace Kernel {
+template <class... Args>
+__aicore__ inline void DebugPrintf(__gm__ const char* fmt, Args&&... args);
+}
+}
+// 调用示例
+ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][EleWise] Input Count can not be 0!\n");
+ATVC::Kernel::DebugPrintf("[INFO]: [ATVC][EleWise][CopyIn] Offset is %u, copy count is %u.\n", curCoreStartCnt_ + offsetCnt_, calcCnt_);
+```
+
+## 3.2.2 开启Profiling性能调优功能
+通过运行run_test.sh脚本时加上可选参数`--run-mode=profiling`打开本功能。
+为了增加Profiling采集性能数据的稳定性，建议用户在开启profiling时，运行时重复多次调用kernel，可实现一次性采集多次上板的性能数据，消除抖动。
+```cpp
+TanhGrad<AddOpTraits><<<blockNum, nullptr, stream>>>(dyDevice, yDevice, zDevice, paramDevice);
+#if ATVC_DEBUG_MODE == 2                // ATVC_DEBUG_MODE == 2: open profiling
+    for (int32_t i = 0; i < 19; i++) {  // 19: run kernel 1 + 19 times for profiling
+        TanhGrad<AddOpTraits><<<blockNum, nullptr, stream>>>(dyDevice, yDevice, zDevice, paramDevice);
+    }
+#endif
+```
+其中`ATVC_DEBUG_MODE`是`run-mode`在不同的模式下的内部宏定义的映射。`ATVC_DEBUG_MODE == 2`是`--run-mode=profiling`的内部映射，用户无需关注。
+
+## 3.3 Tiling超参调优
+### 3.3.1 ElementWise模板算子Tiling超参调优
+#### 3.3.1.1  ElementWise模板通用Tiling算法
+- 计算blockNum：计算blockNum = 总的元素量(totalCnt) / 单核数据量基线(singleCoreBaseLine), blockNum最小值为1， 最大值为平台提供的最大vectorCore值。
+- 计算达到UB上限的单核单输入元素个数值ubLimitCnt：UB上限内存大小 / 所有输入输出及temp单个元素的内存之和。
+- 计算tiledCnt：
+    - 计算每个核需要处理的数据元素量avgElePerBlock = totalCnt / blockNum；
+    - 根据avgElePerBlock所处的splitDataShape数据段，按照切分系数去切分基本块： tiledCnt = dataSplitFactor / dataSplitFactor
+    - tiledCnt调整： 不超上限ubLimitCnt， 不小于下限32，且最后的tiledCnt要做32元素对齐。
+#### 3.3.1.2 ElementWise TilingData定义
+ElementWise模板通用Tiling切分的数据结构为EleWiseTilingData，描述了核间切分和单核内切分的策略，其定义如下：
+```cpp
+namespace ATVC {
+struct EleWiseTilingData {
+    uint32_t tailBlockCnt; // 需要额外执行一次循环的核的数量
+    uint32_t tailElemCnt; // 尾块元素数量
+    uint32_t numPerBlock; // 每个核需计算的基本块个数
+    uint32_t tiledCnt;    // 基本块元素个数
+    uint32_t blockNum;    // 执行核数
+};
+}
+```
+
+#### 3.3.1.3 ElementWise Tiling超参调优
+当前提供的ElementWise模板内置通用Tiling可调超参如下所示：
+| Tiling超参名     | 数据类型  |  参数说明   |  调节范围 |  默认值 |
+| ----------- | -------------- | ----------- | ----------- |---|
+| singleCoreBaseLine | uint32_t | 单核数据量基线 |  [256, 128 * 1024] | 512|
+| ubSizeLimitThreshold | float  | UB内存使用上限，决定了basicBlock最大值 | [0.5, 0.96] | 0.95 |
+| splitDataShape | uint32_t[3]| 单核内数据量的3个分段节点，表示数据量分为4段| {node_0, node_1, node_2} | {1024, 32*1024, 64*1024}|
+| dataSplitFactor | uint32_t[4]| 单核内4个数据段的切分系数, 决定不同数据段的切分基本块的大小| {factor_0, factor_1, factor_2, factor_3} 均大于0 小于32| {4, 4, 8, 6}|
+| rsvLiveCnt | uint32_t| 预留的空间大小为rsvLiveCnt * (inputBuffer + outputBuffer)|[0, 1]| 0|
+
+对应的超参`EleWiseTilingHyperParam`数据结构定义如下：
+```cpp
+namespace ATVC {
+namespace Host {
+struct EleWiseTilingHyperParam {
+    uint32_t singleCoreBaseLine = 512; // 数据量基线：核内数据量超过基线就分多核直至满核, 设置范围： [256, 128 * 1024]
+    float ubSizeLimitThreshold = 0.95f; // UB内存使用上限，决定了basicBlock最大值
+    uint32_t nBufferNum = 2; // multi buffer 设置值为: [1, 2]
+    uint32_t splitDataShape[MAX_SHAPE_NODE] = {1024, 32 * 1024, 64 * 1024}; // 数据分段节点
+    uint32_t dataSplitFactor[MAX_SHAPE_NODE + 1] = {4, 4, 8, 6}; // 对应数据分段内的数据量的切分系数
+    uint32_t rsvLiveCnt = 0; // 额外的存活节点数，表示内部需要申请空间个数,可设置的范围为[0, 1]
+};
+}
+}
+```
+计算接口`CalcEleWiseTiling()`定义如下所示：
+```cpp
+namespace ATVC {
+namespace Host {
+template <class OpTraits>
+bool CalcEleWiseTiling(int32_t totalCnt, ATVC::EleWiseParam &param,
+        EleWiseTilingHyperParam hyperParam = EleWiseTilingHyperParam());
+}
+}
+```
+其中，可选参数`hyperParam`在未传入用户自定义超参时，使用`EleWiseTilingHyperParam`的默认值。
+若用户需要修改某个超参，调用示例如下所示：
+```cpp
+    // Add算子中有两个输入，一个输出。类型均为float
+    using AddOpTraits = ATVC::OpTraits<ATVC::OpInputs<float, float>, ATVC::OpOutputs<float>>;
+    // totalCnt描述EleWise单输入的元素个数
+    int32_t eleNum = 8 * 1024;
+    // 声明运行态参数param
+    ATVC::EleWiseParam param;
+    ATVC::Host::EleWiseTilingHyperParam hyperParam;
+    hyperParam.singleCoreBaseLine = 1024;
+    if (!ATVC::Host::CalcEleWiseTiling<AddOpTraits>(eleNum, param, hyperParam=hyperParam)) {
+        printf("[ERROR]: Calculate eleWise tiling failed.\n");
+        return -1;
+    };
+```
+### 3.3.2 Reduce模板算子Tiling超参调优
+#### 3.3.2.1 Reduce Tiling通用算法
+Reduce Tiling计算流程较为复杂，简化后的主要流程如下：
+![](images/reduce_tiling.png)
+#### 3.3.2.2 Reduce TilingData定义
+ATVC host 和kernel侧都会使用到的`ReduceTilingData`是Reduce的核间AR轴切分、单核内AR轴切分的策略，其定义如下：
+```cpp
+namespace ATVC {
+struct ReduceTilingData {
+    uint64_t factorACntPerCore; // 在每个核上不参与计算的非Reduce轴实际维度
+    uint64_t factorATotalCnt;   // 不参与计算的非Reduce轴总维度
+    uint64_t ubFactorA;         // 单UB内非Reduce轴的数据量
+    uint64_t factorRCntPerCore; // 在每个核上参与计算的Reduce轴实际维度
+    uint64_t factorRTotalCnt;   // 参与计算的Reduce轴总维度
+    uint64_t ubFactorR;         // 单UB内参与计算的Reduce轴维度
+    uint64_t groupR;            // 切轴为R轴，该轴上切点外的R的相对数据量
+    uint64_t outSize;           // 切轴外的AR数据总量
+    uint64_t basicBlock;        // 基础数据块大小
+    int32_t coreNum;            // 执行核数
+    float meanVar;              // 预留信息，暂不使用
+    uint64_t shape[MAX_DIM];        //  shape信息
+    uint64_t stride[MAX_DIM];       //  输入数据搬运步长
+    uint64_t dstStride[MAX_DIM];    // 输出数据搬运步长
+};
+}
+```
+#### 3.3.2.3 Reduce 超参调优
+可调参数如下所示：
+| Tiling超参名     | 数据类型  |  参数说明   |  调节范围 |  默认值 |
+| ----------- | -------------- | ----------- | ----------- |---|
+| basicBlock | uint32_t | Reduce 基本块内存大小 | 不能超过UB内存的1/3， 192K内存 建议在48K-54K之间设置 | 54 * 1024|
+| maxInnerA | uint32_t |AR切轴内A轴的最大数据量 | [128, 4096] | 128 |
+| balanceThreshHold | double| 多核均衡的阈值水平, 阈值越高，切分后每个核处理的数据量越均衡 | [0.85, 0.95]| 0.95 |
+
+对应的超参`ReduceTilingHyperParam`结构定义如下：
+```cpp
+namespace ATVC {
+namespace Host {
+struct ReduceTilingHyperParam {
+        uint32_t basicBlock = 48 * 1024;    // 设置Reduce基本块内存大小, 一般不能超过内存的1/3, 建议在[48-54]之间设置
+        uint32_t maxInnerA = 128;           // [128, 256]
+        double balanceThreshHold = 0.95;    // 多核均衡的阈值水平 [0.8-0.95]
+};
+}
+}
+```
+Reduce Tiling的计算接口`CalcReduceTiling()`定义如下：
+```cpp
+namespace ATVC {
+namespace Host {
+template<class OpTraits>
+bool CalcReduceTiling(std::vector<int64_t> inputShape,
+                      std::vector<int64_t> reduceDim,
+                      ReducePolicy* policy,
+                      ReduceParam* param,
+                      ReduceTilingHyperParam hyperParam = ReduceTilingHyperParam());
+}
+}
+```
+其中，可选参数`hyperParam`在未传入用户自定义超参时，使用`ReduceTilingHyperParam`的默认值,若用户需要修改某个超参，可自定义`ReduceTilingHyperParam`后传入。
+
+# Copyright
+
+Copyright (c) 2025 Huawei Technologies Co., Ltd.
\ No newline at end of file
diff --git a/atvc/docs/code_organization.md b/atvc/docs/03_code_organization.md
similarity index 42%
rename from atvc/docs/code_organization.md
rename to atvc/docs/03_code_organization.md
index 58bd34d4de241e131263081eb5bcc0d85dab218c..47641b0cb98d098b19a918a1145b5d546ade92a2 100644
--- a/atvc/docs/code_organization.md
+++ b/atvc/docs/03_code_organization.md
@@ -8,40 +8,25 @@ include目录下的头文件是按照如下的文件层级进行组织。
 ```
 include/
 ├── atvc.h                                      // Vector模板编程入口头文件
-├── common                                      // 不同模板公用api和C++基本类的拓展模板目录
-│   ├── atvc_opdef.h                            // ATVC模板算子数据类型描述
-│   ├── compile_info.h                          // 平台相关编译信息
-│   ├── const_def.h                             // 常量定义
-│   └── ...
-├── elewise                                     // element_wise模板目录
-│   ├── common
-│   │   └── elewise_common.h                    // element_wise的公共数据定义 
-│   ├── elewise_op_template.h                   // element_wise算子模板类
-│   └── elewise_host.h                          // element_wise算子host侧API
-├── broadcast // broadcast模板目录
-│   ├── common                                  // broadcast模板各层公用文件目录
-│   │   ├── ...
-│   │   └── broadcast_common.h                  // broadcast的公共数据定义
-│   ├── broadcast_utils                         // broadcast模板辅助工具目录
-│   │   ├── ...
-│   │   └── broadcast_buf_pool.h                // broadcast内存管理类
-│   ├── tiling                                  // broadcast模板host层目录
-│   │   └──  broadcast_tiling.h                 // broadcast tiling算法
-│   ├── broadcast_host.h                        // broadcast算子host侧API
-│   ├── broadcast_op_template.h                 // broadcast算子模板类
-│   └── broadcast_compute.h                     // broadcast计算模板
-└── reduce // reduce模板目录
-    ├── common                                  // reduce模板各层公用文件目录
-    │   ├── ...
-    │   └── reduce_common.h                     // reduce的公共数据定义
-    ├── reduce_utils                            // reduce模板辅助工具目录
-    │   ├── ...
-    │   └── reduce_util.h                       // reduce计算辅助工具函数
-    ├── tiling                                  // reduce模板host层目录
-    │   └── reduce_tiling.h                     // reduce tiling算法
-    ├── reduce_host.h                           // reduce算子host侧API
-    ├── reduce_op_template.h                    // reduce算子模板类
-    └── reduce_sum.h                            // reduceSum计算模板
+├── common                                      // 不同模板公用API和C++基本类的拓展模板目录
+├── elewise                                     // Elementwise模板目录
+│   ├── common                                  // Elementwise的公共数据定义 
+│   ├── elewise_op_template.h                   // Elementwise算子模板类
+│   └── elewise_host.h                          // Elementwise算子host侧API
+├── broadcast                                   // Broadcast模板目录
+│   ├── common                                  // Broadcast模板各层公用文件目录
+│   ├── utils                                   // Broadcast模板辅助工具目录
+│   ├── tiling                                  // Broadcast模板host层目录
+│   ├── broadcast_host.h                        // Broadcast算子host侧API
+│   ├── broadcast_op_template.h                 // Broadcast算子模板类
+│   └── broadcast_compute.h                     // Broadcast计算模板
+└── reduce                                      // Reduce模板目录
+    ├── common                                  // Reduce模板各层公用文件目录
+    ├── utils                                   // Reduce模板辅助工具目录
+    ├── tiling                                  // Reduce模板host层目录
+    ├── reduce_host.h                           // Reduce算子host侧API
+    ├── reduce_op_template.h                    // Reduce算子模板类
+    └── reduce_sum.h                            // ReduceSum计算模板
 ```
 ## 2. examples
 examples文件夹下提供了算子代码样例，包含算子实现的源码文件和测试用例配置和执行脚本。
@@ -50,7 +35,7 @@ examples
 ├── add                                         // EleWise Add算子样例
 │   ├── README.md
 │   └── add.cpp
-├── add_with_scalar                             // EleWise + Scalar场景样例
+├── add_with_scalar                             // EleWise + Scalar算子样例
 │   ├── README.md
 │   └── add_with_scalar.cpp
 ├── broadcast_to                                // BroadcastTo算子样例
@@ -62,14 +47,21 @@ examples
 └── sinh_custom                                 // SinhCustom算子样例
     ├── README.md
     └── sinh_custom.cpp
+└── tanh_grad                                   // Tah_Grad算子样例
+    ├── README.md
+    └── tanh_grad.cpp
 ```
 
 ## 3. docs
 docs文件夹下包含项目的所有文档。
 ```
 doc/
-├── 1_quick_start.md        // atvc快速上手指南
-├── 2_developer_guide.md    // 开发指南
-├── code_organization.md    // 目录结构说明
-└── data                    // 图片
-```
\ No newline at end of file
+├── 01_quick_start.md        // ATVC快速上手指南
+├── 02_developer_guide.md    // 开发指南
+├── 03_code_organization.md  // 目录结构说明
+└── images                   // 图片
+```
+
+# Copyright
+
+Copyright (c) 2025 Huawei Technologies Co., Ltd.
\ No newline at end of file
diff --git a/atvc/docs/data/architecture.png b/atvc/docs/images/architecture.png
similarity index 100%
rename from atvc/docs/data/architecture.png
rename to atvc/docs/images/architecture.png
diff --git a/atvc/docs/data/atvc_user_case.png b/atvc/docs/images/atvc_user_case.png
similarity index 100%
rename from atvc/docs/data/atvc_user_case.png
rename to atvc/docs/images/atvc_user_case.png
diff --git a/atvc/docs/data/broadcast_components.png b/atvc/docs/images/broadcast_components.png
similarity index 100%
rename from atvc/docs/data/broadcast_components.png
rename to atvc/docs/images/broadcast_components.png
diff --git a/atvc/docs/data/broadcast_dataflow.png b/atvc/docs/images/broadcast_dataflow.png
similarity index 100%
rename from atvc/docs/data/broadcast_dataflow.png
rename to atvc/docs/images/broadcast_dataflow.png
diff --git a/atvc/docs/data/elewise_components.png b/atvc/docs/images/elewise_components.png
similarity index 100%
rename from atvc/docs/data/elewise_components.png
rename to atvc/docs/images/elewise_components.png
diff --git a/atvc/docs/data/elewise_dataflow.png b/atvc/docs/images/elewise_dataflow.png
similarity index 100%
rename from atvc/docs/data/elewise_dataflow.png
rename to atvc/docs/images/elewise_dataflow.png
diff --git a/atvc/docs/data/framework.png b/atvc/docs/images/framework.png
similarity index 100%
rename from atvc/docs/data/framework.png
rename to atvc/docs/images/framework.png
diff --git a/atvc/docs/data/reduce_components.png b/atvc/docs/images/reduce_components.png
similarity index 100%
rename from atvc/docs/data/reduce_components.png
rename to atvc/docs/images/reduce_components.png
diff --git a/atvc/docs/data/reduce_dataflow.png b/atvc/docs/images/reduce_dataflow.png
similarity index 100%
rename from atvc/docs/data/reduce_dataflow.png
rename to atvc/docs/images/reduce_dataflow.png
diff --git a/atvc/docs/images/reduce_tiling.png b/atvc/docs/images/reduce_tiling.png
new file mode 100644
index 0000000000000000000000000000000000000000..628694c0b699281016a49adadfe0f52bc296fe20
Binary files /dev/null and b/atvc/docs/images/reduce_tiling.png differ
diff --git a/atvc/examples/add/add.cpp b/atvc/examples/add/add.cpp
index f5a410a114299bbff4b620dd15bce57201812688..a3f1e65443f1c0bd3551f573f1cd49df4063fd9b 100644
--- a/atvc/examples/add/add.cpp
+++ b/atvc/examples/add/add.cpp
@@ -38,26 +38,6 @@ bool IsClose(float a, float b)
     return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps));
 }
 
-// Add算子中有两个输入，一个输出。类型均为float
-using ADD_OPTRAITS = ATVC::OpTraits<ATVC::OpInputs<float, float>, ATVC::OpOutputs<float>>;
-
-// 传入编译态参数ATVC::OpTraits
-template<typename Traits>
-struct AddComputeFunc {
-    /*
-    函数说明： c = a + b
-    参数说明：
-        a                   : 参与运算的输入
-        b                   : 参与运算的输入
-        c                   : 参与运算的输出
-    */
-    template<typename T>
-    // 重载operator，提供给算子模板类调用
-    __aicore__ inline void operator()(AscendC::LocalTensor<T> a, AscendC::LocalTensor<T> b, AscendC::LocalTensor<T> c) {
-        AscendC::Add(c, a, b, c.GetSize()); // 开发调用AscendC Api自行实现计算逻辑, 通过c.GetSize()获取单次计算的元素数量
-    }
-};
-
 void InitializeData(int32_t eleNum, std::vector<float> &inputX, std::vector<float> &inputY, std::vector<float> &golden)
 {
     std::random_device rd;
@@ -75,7 +55,7 @@ bool VerifyResults(const std::vector<float> &golden, const std::vector<float> &o
 {
     for (int32_t i = 0; i < golden.size(); i++) {
         if (!IsClose(golden[i], output[i])) {
-            printf("Accuracy verification failed! The expected value of element "
+            printf("[ERROR]: Accuracy verification failed! The expected value of element "
                    "in index [%d] is %f, but actual value is %f.\n",
                 i,
                 golden[i],
@@ -94,21 +74,41 @@ void InitializeACL(aclrtContext &context, aclrtStream &stream, int32_t deviceId)
     CHECK_ACL(aclrtCreateStream(&stream));
 }
 
-void CleanACL(aclrtStream &stream, int32_t deviceId)
+void CleanACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId)
 {
     CHECK_ACL(aclrtDestroyStream(stream));
+    CHECK_ACL(aclrtDestroyContext(context));
     CHECK_ACL(aclrtResetDevice(deviceId));
     CHECK_ACL(aclFinalize());
 }
 
-void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zDevice, uint8_t *&paramDevice)
+void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zDevice)
 {
     CHECK_ACL(aclrtFree(xDevice));
     CHECK_ACL(aclrtFree(yDevice));
     CHECK_ACL(aclrtFree(zDevice));
-    CHECK_ACL(aclrtFree(paramDevice));
     CHECK_ACL(aclrtFreeHost(zHost));
 }
+
+// Add算子中有两个输入，一个输出。类型均为float
+using AddOpTraits = ATVC::OpTraits<ATVC::OpInputs<float, float>, ATVC::OpOutputs<float>>;
+
+// 传入编译态参数ATVC::OpTraits
+template<typename Traits>
+struct AddComputeFunc {
+    /*
+    函数说明： c = a + b
+    参数说明：
+        a                   : 参与运算的输入
+        b                   : 参与运算的输入
+        c                   : 参与运算的输出
+    */
+    template<typename T>
+    // 重载operator，提供给算子模板类调用
+    __aicore__ inline void operator()(AscendC::LocalTensor<T> a, AscendC::LocalTensor<T> b, AscendC::LocalTensor<T> c) {
+        AscendC::Add(c, a, b, c.GetSize()); // 开发调用AscendC Api自行实现计算逻辑, 通过c.GetSize()获取单次计算的元素数量
+    }
+};
 }
 
 template<class Traits>
@@ -117,18 +117,22 @@ template<class Traits>
  * a        Device上的gm地址，指向Add算子第一个输入
  * b        Device上的gm地址，指向Add算子第二个输入
  * c        Device上的gm地址，指向Add算子第一个输出
- * param    Device上的gm地址，指向运行态ATVC::EleWiseParam数据
+ * param    指向运行态ATVC::EleWiseParam数据
 */
-__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR param)
+__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, ATVC::EleWiseParam param)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
     // 将AddComputeFunc仿函数作为模板参数传入，实例化EleWiseOpTemplate模板类
     auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<Traits>>();
-    op.Run(a, b, c, param); // 按照输入、输出、param的顺序传入Run函数，实现GM->GM的数据计算
+    op.Run(a, b, c, &param); // 按照输入、输出、param的顺序传入Run函数，实现GM->GM的数据计算
 }
 
 int main()
 {
+    if (!ATVC::Host::DebugCheck<AddOpTraits, ATVC::TemplateType::ELE_WISE>()) {
+        printf("[ERROR]: Element wise OpTraits check failed.\n");
+        return -1;
+    }
     // totalCnt描述EleWise单输入的元素个数
     int32_t eleNum = 8 * 1024;
     size_t inputByteSize = static_cast<size_t>(eleNum) * sizeof(float);
@@ -145,8 +149,8 @@ int main()
     InitializeACL(context, stream, deviceId);
 
     ATVC::EleWiseParam param;
-    if (!ATVC::Host::CalcEleWiseTiling<ADD_OPTRAITS>(eleNum, param)) {
-        printf("Elewise tiling error.\n");
+    if (!ATVC::Host::CalcEleWiseTiling<AddOpTraits>(eleNum, param)) {
+        printf("[ERROR]: Calculate Element wise tiling Failed.\n");
         return -1;
     };
     auto elementParamSize = sizeof(param);
@@ -155,33 +159,29 @@ int main()
     uint8_t *xDevice;
     uint8_t *yDevice;
     uint8_t *zDevice;
-    uint8_t *paramDevice;
 
     CHECK_ACL(aclrtMallocHost((void **)&zHost, outputByteSize));
     CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
     CHECK_ACL(aclrtMalloc((void **)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
     CHECK_ACL(aclrtMalloc((void **)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
-    CHECK_ACL(aclrtMalloc((void **)&paramDevice, elementParamSize, ACL_MEM_MALLOC_HUGE_FIRST));
 
     CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
-    CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, inputY.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
-    CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize, reinterpret_cast<uint8_t *>(&param),
-              elementParamSize, ACL_MEMCPY_HOST_TO_DEVICE));
+    CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, inputY.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); 
     uint32_t blockNum = param.tilingData.blockNum;
-    // 调用核函数
-    AddCustom<ADD_OPTRAITS><<<blockNum, nullptr, stream>>>(xDevice, yDevice, zDevice, paramDevice);
+    // 调用核函数 
+    AddCustom<AddOpTraits><<<blockNum, nullptr, stream>>>(xDevice, yDevice, zDevice, param);
 
     CHECK_ACL(aclrtSynchronizeStream(stream));
     CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
 
     std::vector<float> outputZ(reinterpret_cast<float *>(zHost), reinterpret_cast<float *>(zHost) + eleNum);
 
-    CleanUp(zHost, xDevice, yDevice, zDevice, paramDevice);
-    CleanACL(stream, deviceId);
+    CleanUp(zHost, xDevice, yDevice, zDevice);
+    CleanACL(stream, context, deviceId);
 
     if (!VerifyResults(golden, outputZ)) {
         return -1;
     }
-    printf("Accuracy verification passed.\n");
+    printf("[INFO]: Accuracy verification passed.\n");
     return 0;
 }
\ No newline at end of file
diff --git a/atvc/examples/add_with_broadcast/README.md b/atvc/examples/add_with_broadcast/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f49500d66f3d7b495549cdb9d09bb3530860e097
--- /dev/null
+++ b/atvc/examples/add_with_broadcast/README.md
@@ -0,0 +1,47 @@
+<!--声明：本文使用[Creative Commons License version 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)许可协议，转载、引用或修改等操作请遵循此许可协议。-->
+
+## 概述
+
+本样例介绍了利用ATVC实现带广播的Add单算子并完成功能验证
+
+
+## 样例支持产品型号：
+- Atlas A2训练系列产品
+
+## 目录结构
+
+| 文件名                                                         | 描述                                                         |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [add_with_broadcast.cpp](./add_with_broadcast.cpp) | Add算子代码实现以及调用样例               |
+
+## 算子描述
+
+Add算子数学计算公式：$z = x + y$
+
+Add算子规格：
+
+<table>
+<tr><td rowspan="1" align="center">算子类型(OpType)</td><td colspan="4" align="center">Add</td></tr>
+
+<tr><td rowspan="4" align="center">算子输入</td></tr>
+<tr><td align="center">name</td><td align="center">shape</td><td align="center">data type</td><td align="center">format</td></tr>
+<tr><td align="center">x</td><td align="center">1 * 2048</td><td align="center">float</td><td align="center">ND</td></tr>
+<tr><td align="center">y</td><td align="center">8 * 2048</td><td align="center">float</td><td align="center">ND</td></tr>
+<tr></tr>
+
+<tr><td rowspan="2" align="center">算子输出</td></tr>
+<tr><td align="center">z</td><td align="center">8 * 2048</td><td align="center">float</td><td align="center">ND</td></tr>
+
+<tr><td rowspan="1" align="center">核函数名</td><td colspan="4" align="center">AddWithBroadcastCustom</td></tr>
+</table>
+
+## 算子运行
+在ascendc-api-adv代码仓目录下执行：
+```bash
+$ cd ./atvc/tests/
+$ bash run_test.sh add_with_broadcast
+...
+Generate golden data successfully.
+...
+Accuracy verification passed.
+```
\ No newline at end of file
diff --git a/atvc/examples/add_with_broadcast/add_with_broadcast.cpp b/atvc/examples/add_with_broadcast/add_with_broadcast.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..2a58d49b93040a9091e931da49242e21ad4cbb77
--- /dev/null
+++ b/atvc/examples/add_with_broadcast/add_with_broadcast.cpp
@@ -0,0 +1,160 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+
+#include <vector>
+#include <cstdint>
+#include <cmath>
+#include <random>
+#include <iostream>
+#include <algorithm>
+#include "acl/acl.h"
+#include "broadcast/broadcast_host.h"
+#include "add_with_broadcast.h"
+
+#define CHECK_ACL(x)                                                                        \
+    do {                                                                                    \
+        aclError __ret = x;                                                                 \
+        if (__ret != ACL_ERROR_NONE) {                                                      \
+            std::cerr << __FILE__ << ":" << __LINE__ << " aclError:" << __ret << std::endl; \
+        }                                                                                   \
+    } while (0)
+
+namespace {
+static constexpr float REL_TOL = 1e-3f;
+static constexpr float ABS_TOL = 1e-5f;
+
+// 判断两个浮点数是否足够接近
+bool IsClose(float a, float b) {
+    const float eps = 1e-40f; // 防止分母为零
+    float diff = std::abs(a - b);
+    return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps));
+}
+
+// AddWithBroadcast算子的描述：两个输入，一个输出，类型均为float
+using BroadcastOpTraits = ATVC::OpTraits<ATVC::OpInputs<float, float>, ATVC::OpOutputs<float>, ATVC::OpTemps<float>>;
+
+// 负责Broadcast类算子的调度，选择对应的Policy最佳策略并执行Kernel函数
+template<class OpTraits>
+void BroadcastOpAdapter(uint8_t* x, uint8_t* y, uint8_t* z, ATVC::BroadcastParam &param, ATVC::BroadcastPolicy &policy)
+{
+    aclrtStream stream = nullptr;
+    CHECK_ACL(aclrtCreateStream(&stream));
+    // 申请临时空间workspace，并将其与BroadcastTilingData一同传到Device侧
+    uint8_t *workspaceDevice;
+    CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    param.workspaceAddr = reinterpret_cast<uint64_t>(workspaceDevice);
+    // 将tiling api计算出的BroadcastPolicy转化为编译态参数并实例化相应的核函数
+    if (policy == ATVC::BROADCAST_POLICY0) {
+        AddWithBroadcastCustom<OpTraits, ATVC::BROADCAST_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, z, param);
+     }else if (policy == ATVC::BROADCAST_POLICY1) {
+        AddWithBroadcastCustom<OpTraits, ATVC::BROADCAST_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, z, param);
+    } else {
+        printf("[ERROR] Cannot find any matched policy.\n");
+    }
+    // 流同步后释放申请的param内存
+    CHECK_ACL(aclrtSynchronizeStream(stream));
+    CHECK_ACL(aclrtDestroyStream(stream));
+    CHECK_ACL(aclrtFree(workspaceDevice));
+}
+
+bool VerifyResults(const std::vector<float> &golden, const std::vector<float> &output)
+{
+    for (int32_t i = 0; i < golden.size(); i++) {
+        if (!IsClose(golden[i], output[i])) {
+            printf("Accuracy verification failed! The expected value of element "
+                   "in index [%d] is %f, but actual value is %f.\n",
+                i,
+                golden[i],
+                output[i]);
+            return false;
+        }
+    }
+    return true;
+}
+}
+
+
+int32_t main(int32_t argc, char* argv[])
+{
+    int32_t eleNum = 1 * 2048;
+    int32_t outEleNum = 8 * 2048;
+    std::vector<int64_t> shapeIn{1, 2048};    // 测试输入shape
+    std::vector<int64_t> shapeOut{8, 2048};    // 测试输入shape
+
+    size_t inputByteSize = static_cast<size_t>(eleNum) * sizeof(float);
+    size_t outputByteSize = static_cast<size_t>(outEleNum) * sizeof(float);
+    std::random_device rd;
+    std::mt19937 gen(rd());
+    std::uniform_real_distribution<float> disX(1.0f, 9.0f);
+    std::uniform_real_distribution<float> disY(1.0f, 9.0f);
+
+    std::vector<float> inputX(eleNum);
+    std::vector<float> inputY(outEleNum);
+    std::vector<float> golden(outEleNum);
+    for (int i = 0; i < eleNum; ++i) {
+        inputX[i] = (disX(gen));
+    }
+    for (int i = 0; i < outEleNum; ++i) {
+        inputY[i] = (disY(gen));
+    }
+    for (int i = 0; i < outEleNum; ++i) {
+        golden[i] = (inputX[i % eleNum]) + (inputY[i]);
+    }
+    printf("Generate golden data successfully.\n");
+    // 初始化Acl资源
+    CHECK_ACL(aclInit(nullptr));
+    aclrtContext context;
+    int32_t deviceId = 0;
+    CHECK_ACL(aclrtSetDevice(deviceId));
+    CHECK_ACL(aclrtCreateContext(&context, deviceId));
+    uint8_t *zHost;
+    uint8_t *xDevice;
+    uint8_t *yDevice;
+    uint8_t *zDevice;
+
+    CHECK_ACL(aclrtMallocHost((void **)(&zHost), outputByteSize));
+    CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    CHECK_ACL(aclrtMalloc((void **)&yDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    CHECK_ACL(aclrtMalloc((void **)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+
+    CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
+    CHECK_ACL(aclrtMemcpy(yDevice, outputByteSize, inputY.data(), outputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
+
+    ATVC::BroadcastParam param;    // Broadcast运行态参数，包含TilingData以及临时空间的相关信息
+    ATVC::BroadcastPolicy policy = {-1, -1, -1};  // Broadcast运行态参数，负责映射最适合的Broadcast模板实现
+    // Host侧调用Tiling API完成相关运行态参数的运算
+    param.nBufferNum = 1;
+    if (!ATVC::Host::CalcBroadcastTiling<BroadcastOpTraits>(shapeIn, shapeOut, &policy, &param)) {
+        printf("Broadcast tiling error.\n");
+        return -1;
+    };
+    // 调用Adapter调度接口，完成核函数的模板调用
+    BroadcastOpAdapter<BroadcastOpTraits>(xDevice, yDevice, zDevice, param, policy);
+
+    CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
+    std::vector<float> outputZ(reinterpret_cast<float*>(zHost), reinterpret_cast<float*>(zHost) + outEleNum);
+
+    // 释放Acl资源
+    CHECK_ACL(aclrtFree(xDevice));
+    CHECK_ACL(aclrtFree(yDevice));
+    CHECK_ACL(aclrtFree(zDevice));
+    CHECK_ACL(aclrtFreeHost(zHost));
+
+    CHECK_ACL(aclrtDestroyContext(context));
+    CHECK_ACL(aclrtResetDevice(deviceId));
+    CHECK_ACL(aclFinalize());
+
+    if (!VerifyResults(golden, outputZ)) {
+        return -1;
+    }
+    printf("Accuracy verification passed.\n");
+    return 0;
+}
diff --git a/atvc/examples/add_with_broadcast/add_with_broadcast.h b/atvc/examples/add_with_broadcast/add_with_broadcast.h
new file mode 100644
index 0000000000000000000000000000000000000000..a1abb007e4ba07de1c32e3f7321e58eff89ddcf2
--- /dev/null
+++ b/atvc/examples/add_with_broadcast/add_with_broadcast.h
@@ -0,0 +1,51 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+#ifndef ATVC_ADD_WITH_BROADCAST_H
+#define ATVC_ADD_WITH_BROADCAST_H
+#include "post_compute_add_of_broadcast.h"
+
+/* !
+ * \brief z = x + y, the shape of x must be able to be broadcasted to the shape of y
+ * \param [in] x, input global memory of x
+ * \param [in] y, input global memory of y
+ * \param [out] z, output global memory
+ * \param [in] broadcastParam, params of broadcast
+ */
+template<typename Traits, const auto& Policy>
+__global__ __aicore__ void AddWithBroadcastCustom(GM_ADDR x,
+                                                  GM_ADDR y,
+                                                  GM_ADDR z,
+                                                  ATVC::BroadcastParam broadcastParam)
+{
+    KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
+
+    // 1. get input and output for kernel op from host Traits
+    using KernelOpIn = typename Traits::In::types;
+    using KernelOpOut = typename Traits::Out::types;
+    using KernelOpTemp = typename Traits::Temp::types;
+
+    // 2. define input and output for broadcast
+    using BroadcastOpInput = ATVC::OpInputs<typename ATVC::TypeListGet<KernelOpIn, 0>::Type>;
+    using BroadcastOpOutput = ATVC::OpOutputs<typename ATVC::TypeListGet<KernelOpTemp, 0>::Type>;
+    using BroadcastOpTraits =  ATVC::OpTraits<BroadcastOpInput, BroadcastOpOutput>;
+
+    // 3. define input and output for post compute
+    using AddOpInput = ATVC::OpInputs<typename ATVC::TypeListGet<KernelOpTemp, 0>::Type, typename ATVC::TypeListGet<KernelOpIn, 1>::Type>;
+    using AddOpOutput = ATVC::OpOutputs<typename ATVC::TypeListGet<KernelOpOut, 0>::Type>;
+    using AddOpTraits = ATVC::OpTraits<AddOpInput, AddOpOutput>;
+    using PostCompute = PostComputeAddOfBroadcast<AddOpTraits>;
+
+    // 4. call op run
+    auto op = ATVC::Kernel::BroadcastOpTemplate<ATVC::BroadcastCompute<BroadcastOpTraits>, Policy, void, PostCompute>();
+    ATVC::BroadcastParam *param = &broadcastParam;
+    op.Run(x, y, z, param);
+}
+#endif
diff --git a/atvc/examples/add_with_broadcast/post_compute_add_of_broadcast.h b/atvc/examples/add_with_broadcast/post_compute_add_of_broadcast.h
new file mode 100644
index 0000000000000000000000000000000000000000..a02bb149d6c9b7adf0ec6e6ce920a359471483a8
--- /dev/null
+++ b/atvc/examples/add_with_broadcast/post_compute_add_of_broadcast.h
@@ -0,0 +1,102 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+#ifndef ATVC_POST_COMPUTE_ADD_OF_BROADCAST_H
+#define ATVC_POST_COMPUTE_ADD_OF_BROADCAST_H
+
+#include "broadcast/broadcast_device.h"
+
+template<typename Traits>
+struct PostComputeAddOfBroadcast {    
+    using inputDTypeList = typename Traits::In::types;
+    using DataType = typename ATVC::TypeListGet<inputDTypeList, 0>::Type;
+    /* !
+    * \brief set scaler param for compute fuction
+    * \param [in] args, args are mutable parameters, and are passed transparently from the parameters of
+    *                   global kernel functions, which are the parameters after broadcastParam
+    */
+    template <class... Args>
+    __aicore__ inline void SetParam(Args... args) {}
+
+    /* !
+    * \brief set tensor param for compute fuction
+    * \param [in] args, args are mutable parameters, and are passed transparently from the parameters of
+    *                   global kernel functions, which are the parameters before broadcastParam, the num of args is
+    *                   decided by Traits
+    */
+    template<class... Args>
+    __aicore__ inline void SetArgs(Args... args)
+    {
+        InitArgs(args...);
+    }
+
+    /* !
+    * \brief process function of compute struct
+    * \param [in] y, local tensor of y
+    * \param [in] z, local tensor of z
+    * \param [in] x, local tensor of x,  x is the output of broadcast, must be the last local tensor
+    * \param [in] copyOutOffset,  copy out offset for DataCopy
+    * \param [in] copyOutParams, copy out params for DataCopy
+    */
+    template<typename DataType>
+    __aicore__ inline void operator()(AscendC::LocalTensor<DataType> y, AscendC::LocalTensor<DataType> z, AscendC::LocalTensor<DataType> x,
+        uint32_t copyOutOffset,  AscendC::DataCopyExtParams &copyOutParams)
+    {
+        size_t size = copyOutParams.blockCount * (copyOutParams.blockLen + copyOutParams.srcStride * 32)/ sizeof(DataType);
+        ATVC::SyncDataQueue<AscendC::HardEvent::MTE3_MTE2>();
+
+        CopyIn<DataType>(y, copyOutOffset, copyOutParams);
+        AscendC::PipeBarrier<PIPE_V>();  // wait broadcast finished
+        ATVC::SyncDataQueue<AscendC::HardEvent::MTE2_V>();
+
+        Compute<DataType>(y, z, x, copyOutOffset, size);
+        ATVC::SyncDataQueue<AscendC::HardEvent::V_MTE3>(); 
+            
+        CopyOut<DataType>(z, copyOutOffset, copyOutParams);            
+        AscendC::PipeBarrier<PIPE_MTE3>();
+    }
+
+private:
+    template <class... Args>
+    __aicore__ inline void InitArgs(GM_ADDR src, GM_ADDR dst)
+    {
+        srcGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(src));
+        dstGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(dst));
+    }
+
+    template<typename DataType>
+    __aicore__ inline void CopyIn(AscendC::LocalTensor<DataType> y, uint32_t copyOutOffset,  AscendC::DataCopyExtParams &copyOutParams)
+    {
+        AscendC::DataCopyPadExtParams<DataType> padParams{false, 0, 0, 0};
+        uint32_t tmp = copyOutParams.srcStride;
+        copyOutParams.srcStride = copyOutParams.dstStride;
+        copyOutParams.dstStride = tmp;
+        AscendC::DataCopyPad(y, srcGlobal_[copyOutOffset], copyOutParams, padParams);
+        copyOutParams.dstStride = copyOutParams.srcStride;
+        copyOutParams.srcStride = tmp;
+    }
+
+    template<typename DataType>
+    __aicore__ inline void Compute(AscendC::LocalTensor<DataType> y, AscendC::LocalTensor<DataType> z, AscendC::LocalTensor<DataType> x,
+        uint32_t copyOutOffset, uint32_t size)
+    {
+        AscendC::Add(z, x, y, size);     
+    }
+
+    template<typename DataType>
+    __aicore__ inline void CopyOut(AscendC::LocalTensor<DataType> z, uint32_t copyOutOffset,  AscendC::DataCopyExtParams &copyOutParams)
+    {
+        AscendC::DataCopyPad(dstGlobal_[copyOutOffset], z, copyOutParams);
+    }
+
+    AscendC::GlobalTensor<DataType> srcGlobal_;
+    AscendC::GlobalTensor<DataType> dstGlobal_;
+};
+#endif
diff --git a/atvc/examples/add_with_scalar/add_with_scalar.cpp b/atvc/examples/add_with_scalar/add_with_scalar.cpp
index a2f60e3943e6918a333ad3d6cc6b8bfac91999d9..24c4707032a33538bc727257ea340ec574cb51a3 100644
--- a/atvc/examples/add_with_scalar/add_with_scalar.cpp
+++ b/atvc/examples/add_with_scalar/add_with_scalar.cpp
@@ -103,12 +103,11 @@ void CleanACL(aclrtStream &stream, int32_t deviceId)
     CHECK_ACL(aclFinalize());
 }
 
-void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zDevice, uint8_t *&paramDevice)
+void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zDevice)
 {
     CHECK_ACL(aclrtFree(xDevice));
     CHECK_ACL(aclrtFree(yDevice));
     CHECK_ACL(aclrtFree(zDevice));
-    CHECK_ACL(aclrtFree(paramDevice));
     CHECK_ACL(aclrtFreeHost(zHost));
 }
 }
@@ -118,15 +117,15 @@ void CleanUp(uint8_t *&zHost, uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&zD
  * a                 Device上的gm地址，指向算子第一个输入
  * b                 Device上的gm地址，指向算子第二个输入
  * c                 Device上的gm地址，指向算子第一个输出
- * param             Device上的gm地址，指向运行态ATVC::EleWiseParam数据
+ * param             指向运行态ATVC::EleWiseParam数据
  * conditionVal      标量，控制算子的计算逻辑
 */
 template<class Traits>
-__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR param, bool conditionVal)
+__global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, ATVC::EleWiseParam param, bool conditionVal)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
     auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<Traits>>(); // 传入计算仿函数<In, Out, Temp>, 实例化算子
-    op.Run(a, b, c, param, conditionVal);                                // 调用Run函数, 执行算子
+    op.Run(a, b, c, &param, conditionVal);                                // 调用Run函数, 执行算子
 }
 
 
@@ -159,28 +158,24 @@ int main()
     uint8_t *xDevice;
     uint8_t *yDevice;
     uint8_t *zDevice;
-    uint8_t *paramDevice;
 
     CHECK_ACL(aclrtMallocHost((void **)&zHost, outputByteSize));
     CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
     CHECK_ACL(aclrtMalloc((void **)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
     CHECK_ACL(aclrtMalloc((void **)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
-    CHECK_ACL(aclrtMalloc((void **)&paramDevice, elementParamSize, ACL_MEM_MALLOC_HUGE_FIRST));
 
     CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
     CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, inputY.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
-    CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize, reinterpret_cast<uint8_t *>(&param),
-              elementParamSize, ACL_MEMCPY_HOST_TO_DEVICE));
 
     AddCustom<OP_TRAITS>
-        <<<param.tilingData.blockNum, nullptr, stream>>>(xDevice, yDevice, zDevice, paramDevice, conditionVal);
+        <<<param.tilingData.blockNum, nullptr, stream>>>(xDevice, yDevice, zDevice, param, conditionVal);
 
     CHECK_ACL(aclrtSynchronizeStream(stream));
     CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
 
     std::vector<float> outputZ(reinterpret_cast<float *>(zHost), reinterpret_cast<float *>(zHost) + eleNum);
 
-    CleanUp(zHost, xDevice, yDevice, zDevice, paramDevice);
+    CleanUp(zHost, xDevice, yDevice, zDevice);
     CleanACL(stream, deviceId);
 
     if (!VerifyResults(golden, outputZ)) {
diff --git a/atvc/examples/addcmul/README.md b/atvc/examples/addcmul/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..43e9b2b656d028e33247164bc289ed004b8dbd4e
--- /dev/null
+++ b/atvc/examples/addcmul/README.md
@@ -0,0 +1,48 @@
+<!--声明：本文使用[Creative Commons License version 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)许可协议，转载、引用或修改等操作请遵循此许可协议。-->
+
+## 概述
+
+本样例介绍了利用ATVC实现Addcmul单算子并完成功能验证
+
+
+## 样例支持产品型号：
+- Atlas A2训练系列产品
+
+## 目录结构
+
+| 文件名                                                         | 描述                                                         |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [addcmul.cpp](./addcmul.cpp) | Addcmul算子代码实现以及调用样例               |
+
+## 算子描述
+
+Add算子数学计算公式：$output_i = input_i + value * tensor1_i * tensor2_i$
+
+Add算子规格：
+
+<table>
+<tr><td rowspan="1" align="center">算子类型(OpType)</td><td colspan="4" align="center">Add</td></tr>
+
+<tr><td rowspan="5" align="center">算子输入</td></tr>
+<tr><td align="center">name</td><td align="center">shape</td><td align="center">data type</td><td align="center">format</td></tr>
+<tr><td align="center">input</td><td align="center">8 * 2048</td><td align="center">float</td><td align="center">ND</td></tr>
+<tr><td align="center">tensor1</td><td align="center">1 * 2048</td><td align="center">float</td><td align="center">ND</td></tr>
+<tr><td align="center">tensor2</td><td align="center">1 * 2048</td><td align="center">float</td><td align="center">ND</td></tr>
+<tr></tr>
+
+<tr><td rowspan="2" align="center">算子输出</td></tr>
+<tr><td align="center">output</td><td align="center">8 * 2048</td><td align="center">float</td><td align="center">ND</td></tr>
+
+<tr><td rowspan="1" align="center">核函数名</td><td colspan="4" align="center">AddcmulCustom</td></tr>
+</table>
+
+## 算子运行
+在ascendc-api-adv代码仓目录下执行：
+```bash
+$ cd ./atvc/tests/
+$ bash run_test.sh addcmul
+...
+Generate golden data successfully.
+...
+Accuracy verification passed.
+```
\ No newline at end of file
diff --git a/atvc/examples/addcmul/addcmul.cpp b/atvc/examples/addcmul/addcmul.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..2ccd77a0d121ea7511b9d2417bde985d59a949da
--- /dev/null
+++ b/atvc/examples/addcmul/addcmul.cpp
@@ -0,0 +1,166 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+
+#include <vector>
+#include <cstdint>
+#include <cmath>
+#include <random>
+#include <iostream>
+#include <algorithm>
+#include "acl/acl.h"
+#include "broadcast/broadcast_host.h"
+#include "addcmul.h"
+
+#define CHECK_ACL(x)                                                                        \
+    do {                                                                                    \
+        aclError __ret = x;                                                                 \
+        if (__ret != ACL_ERROR_NONE) {                                                      \
+            std::cerr << __FILE__ << ":" << __LINE__ << " aclError:" << __ret << std::endl; \
+        }                                                                                   \
+    } while (0)
+
+namespace {
+static constexpr float REL_TOL = 1e-3f;
+static constexpr float ABS_TOL = 1e-5f;
+
+// 判断两个浮点数是否足够接近
+bool IsClose(float a, float b) {
+    const float eps = 1e-40f; // 防止分母为零
+    float diff = std::abs(a - b);
+    return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps));
+}
+
+// AddWithBroadcast算子的描述：两个输入，一个输出，类型均为float
+using BroadcastOpTraits = ATVC::OpTraits<ATVC::OpInputs<float, float, float>, ATVC::OpOutputs<float>, ATVC::OpTemps<float, float, float>>;
+
+// 负责Broadcast类算子的调度，选择对应的Policy最佳策略并执行Kernel函数
+template<class OpTraits>
+void BroadcastOpAdapter(uint8_t* tensor1, uint8_t* tensor2, float value, uint8_t* input, uint8_t* output, ATVC::BroadcastParam &param, ATVC::BroadcastPolicy &policy, aclrtStream& stream)
+{
+    // 申请临时空间workspace，并将其与BroadcastTilingData一同传到Device侧
+    uint8_t *workspaceDevice;
+    CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    param.workspaceAddr = reinterpret_cast<uint64_t>(workspaceDevice);
+    // 将tiling api计算出的BroadcastPolicy转化为编译态参数并实例化相应的核函数
+    if (policy == ATVC::BROADCAST_POLICY0) {
+        AddcmulCustom<OpTraits, ATVC::BROADCAST_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(tensor1, tensor2, input, output, param, value);
+     }else if (policy == ATVC::BROADCAST_POLICY1) {
+        AddcmulCustom<OpTraits, ATVC::BROADCAST_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(tensor1, tensor2, input, output, param, value);
+    } else {
+        printf("[ERROR] Cannot find any matched policy.\n");
+    }
+    // 流同步后释放申请的param内存
+    CHECK_ACL(aclrtSynchronizeStream(stream));
+    CHECK_ACL(aclrtFree(workspaceDevice));
+}
+
+bool VerifyResults(const std::vector<float> &golden, const std::vector<float> &output)
+{
+    for (int32_t i = 0; i < golden.size(); i++) {
+        if (!IsClose(golden[i], output[i])) {
+            printf("Accuracy verification failed! The expected value of element "
+                   "in index [%d] is %f, but actual value is %f.\n",
+                i,
+                golden[i],
+                output[i]);
+            return false;
+        }
+    }
+    return true;
+}
+}
+
+int32_t main(int32_t argc, char* argv[])
+{
+    int32_t eleNum = 1 * 8;
+    int32_t outEleNum = 8 * 8;
+    std::vector<int64_t> shapeIn{1, 8};    // 测试输入shape
+    std::vector<int64_t> shapeOut{8, 8};    // 测试输入shape
+
+    size_t inputByteSize = static_cast<size_t>(eleNum) * sizeof(float);
+    size_t outputByteSize = static_cast<size_t>(outEleNum) * sizeof(float);
+    std::random_device rd;
+    std::mt19937 gen(rd());
+    std::uniform_real_distribution<float> disX(1.0f, 9.0f);
+    std::uniform_real_distribution<float> disY(1.0f, 9.0f);
+
+    std::vector<float> tensor1(eleNum);
+    std::vector<float> tensor2(eleNum);
+    float value = 4;
+    std::vector<float> input(outEleNum);
+    std::vector<float> golden(outEleNum);
+    for (int i = 0; i < eleNum; ++i) {
+        tensor1[i] = (disX(gen));
+        tensor2[i] = (disX(gen));
+    }
+    for (int i = 0; i < outEleNum; ++i) {
+        input[i] = (disY(gen));
+    }
+    for (int i = 0; i < outEleNum; ++i) {
+        golden[i] = input[i] + (tensor1[i % eleNum] * tensor2[i % eleNum] * value);
+    }
+    printf("Generate golden data successfully.\n");
+    // 初始化Acl资源
+    CHECK_ACL(aclInit(nullptr));
+    aclrtContext context;
+    int32_t deviceId = 0;
+    CHECK_ACL(aclrtSetDevice(deviceId));
+    CHECK_ACL(aclrtCreateContext(&context, deviceId));
+    aclrtStream stream = nullptr;
+    CHECK_ACL(aclrtCreateStream(&stream));
+    uint8_t *outputHost;
+    uint8_t *tensor1Device;
+    uint8_t *tensor2Device;
+    uint8_t *inputDevice;
+    uint8_t *outputDevice;
+
+    CHECK_ACL(aclrtMallocHost((void **)(&outputHost), outputByteSize));
+    CHECK_ACL(aclrtMalloc((void **)&tensor1Device, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    CHECK_ACL(aclrtMalloc((void **)&tensor2Device, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    CHECK_ACL(aclrtMalloc((void **)&inputDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    CHECK_ACL(aclrtMalloc((void **)&outputDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+
+    CHECK_ACL(aclrtMemcpy(tensor1Device, inputByteSize, tensor1.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
+    CHECK_ACL(aclrtMemcpy(tensor2Device, inputByteSize, tensor2.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
+    CHECK_ACL(aclrtMemcpy(inputDevice, outputByteSize, input.data(), outputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
+
+    ATVC::BroadcastParam param;    // Broadcast运行态参数，包含TilingData以及临时空间的相关信息
+    ATVC::BroadcastPolicy policy = {-1, -1, -1};  // Broadcast运行态参数，负责映射最适合的Broadcast模板实现
+    // Host侧调用Tiling API完成相关运行态参数的运算
+    param.nBufferNum = 1;
+    if (!ATVC::Host::CalcBroadcastTiling<BroadcastOpTraits>(shapeIn, shapeOut, &policy, &param)) {
+        printf("Broadcast tiling error.\n");
+        return -1;
+    };
+    // 调用Adapter调度接口，完成核函数的模板调用
+    BroadcastOpAdapter<BroadcastOpTraits>(tensor1Device, tensor2Device, value, inputDevice, outputDevice, param, policy, stream);
+
+    CHECK_ACL(aclrtMemcpy(outputHost, outputByteSize, outputDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
+    std::vector<float> output(reinterpret_cast<float*>(outputHost), reinterpret_cast<float*>(outputHost) + outEleNum);
+
+    // 释放Acl资源
+    CHECK_ACL(aclrtFree(tensor1Device));
+    CHECK_ACL(aclrtFree(tensor2Device));
+    CHECK_ACL(aclrtFree(inputDevice));
+    CHECK_ACL(aclrtFree(outputDevice));
+    CHECK_ACL(aclrtFreeHost(outputHost));
+
+    CHECK_ACL(aclrtDestroyStream(stream));
+    CHECK_ACL(aclrtDestroyContext(context));
+    CHECK_ACL(aclrtResetDevice(deviceId));
+    CHECK_ACL(aclFinalize());
+
+    if (!VerifyResults(golden, output)) {
+        return -1;
+    }
+    printf("Accuracy verification passed.\n");
+    return 0;
+}
diff --git a/atvc/examples/addcmul/addcmul.h b/atvc/examples/addcmul/addcmul.h
new file mode 100644
index 0000000000000000000000000000000000000000..a63479d4c21e4605304072c4a0b2c98f544d095d
--- /dev/null
+++ b/atvc/examples/addcmul/addcmul.h
@@ -0,0 +1,63 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+#ifndef ATVC_ADDCMUL_H
+#define ATVC_ADDCMUL_H
+#include "pre_compute_mul_of_broadcast.h"
+#include "post_compute_add_of_broadcast.h"
+
+/* !
+ * \brief Addcmul(i) = input(i) + value * tensor1(i) * tensor2(i)
+ * \param [in] tensor1, input global memory of tensor1
+ * \param [in] tensor2, input global memory of tensor2
+ * \param [in] input, input global memory of input
+ * \param [out] output, output global memory
+ * \param [in] broadcastParam, params of broadcast
+ * \param [in] value, input value
+ */
+template<typename Traits, const auto& Policy, typename DataType = typename ATVC::TypeListGet<typename Traits::In::types, 0>::Type>
+__global__ __aicore__ void AddcmulCustom(GM_ADDR tensor1,
+                                         GM_ADDR tensor2,
+                                         GM_ADDR input,
+                                         GM_ADDR output,
+                                         ATVC::BroadcastParam broadcastParam,
+                                         DataType value)
+{
+    KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
+
+    // 1. get input and output for kernel op from host Traits
+    using KernelOpIn = typename Traits::In::types;
+    using KernelOpOut = typename Traits::Out::types;
+    using KernelOpTemp = typename Traits::Temp::types;
+
+    // 2. define input and output for pre compute 
+    using PreComputeInput = ATVC::OpInputs<typename ATVC::TypeListGet<KernelOpIn, 0>::Type, typename ATVC::TypeListGet<KernelOpIn, 1>::Type>;
+    using PreComputeOutput = ATVC::OpOutputs<typename ATVC::TypeListGet<KernelOpTemp, 1>::Type>;
+    using PreComputeTemp = ATVC::OpOutputs<typename ATVC::TypeListGet<KernelOpTemp, 0>::Type>;
+    using PreComputeOpTraits =  ATVC::OpTraits<PreComputeInput, PreComputeOutput, PreComputeTemp>;
+    using PreCompute = PreComputeMulOfBroadcast<PreComputeOpTraits>;
+
+    // 3. define input and output for broadcast
+    using BroadcastOpInput = ATVC::OpInputs<typename ATVC::TypeListGet<KernelOpTemp, 1>::Type>;
+    using BroadcastOpOutput = ATVC::OpOutputs<typename ATVC::TypeListGet<KernelOpTemp, 2>::Type>;
+    using BroadcastOpTraits =  ATVC::OpTraits<BroadcastOpInput, BroadcastOpOutput>;
+
+    // 4. define input and output for post compute
+    using PostComputeInput = ATVC::OpInputs<typename ATVC::TypeListGet<KernelOpTemp, 2>::Type, typename ATVC::TypeListGet<KernelOpIn, 2>::Type>;
+    using PostComputeOutput = ATVC::OpOutputs<typename ATVC::TypeListGet<KernelOpOut, 0>::Type>;
+    using PostComputeOpTraits = ATVC::OpTraits<PostComputeInput, PostComputeOutput>;
+    using PostCompute = PostComputeAddOfBroadcast<PostComputeOpTraits>;
+
+    // 5. call op run
+    auto op = ATVC::Kernel::BroadcastOpTemplate<ATVC::BroadcastCompute<BroadcastOpTraits>, Policy, PreCompute, PostCompute>();
+    ATVC::BroadcastParam *param = &broadcastParam;
+    op.Run(tensor1, tensor2, input, output, param, value);
+}
+#endif
diff --git a/atvc/examples/addcmul/post_compute_add_of_broadcast.h b/atvc/examples/addcmul/post_compute_add_of_broadcast.h
new file mode 100644
index 0000000000000000000000000000000000000000..9059186cb36583d1208382e5e3085fa6c8ddc2bc
--- /dev/null
+++ b/atvc/examples/addcmul/post_compute_add_of_broadcast.h
@@ -0,0 +1 @@
+../add_with_broadcast/post_compute_add_of_broadcast.h
\ No newline at end of file
diff --git a/atvc/examples/addcmul/pre_compute_mul_of_broadcast.h b/atvc/examples/addcmul/pre_compute_mul_of_broadcast.h
new file mode 100644
index 0000000000000000000000000000000000000000..6df76098319d82cffc55eb6d601bca9d93a73e52
--- /dev/null
+++ b/atvc/examples/addcmul/pre_compute_mul_of_broadcast.h
@@ -0,0 +1,73 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+#ifndef ATVC_PRE_COMPUTE_MUL_OF_BROADCAST_H
+#define ATVC_PRE_COMPUTE_MUL_OF_BROADCAST_H
+
+#include "broadcast/broadcast_device.h"
+
+template<typename Traits>
+struct PreComputeMulOfBroadcast {    
+    using Inputs = typename Traits::In::types;
+    using Outputs = typename Traits::Out::types;
+    using DataType = typename ATVC::TypeListGet<Inputs, 0>::Type;
+    static constexpr size_t INPUT_SIZE = ATVC::TypeListSize<Inputs>::VALUE;
+
+    template <class... Args>
+    __aicore__ inline void SetParam(DataType value) 
+    { 
+        value_ = value;
+    }
+
+    template<class... Args>
+    __aicore__ inline void SetArgs(Args... args)
+    {
+        InitArgsInput<0>(args...);
+    }
+
+    template<typename DataType>
+    __aicore__ inline void operator()(AscendC::LocalTensor<DataType> tensor1, AscendC::LocalTensor<DataType> tensor2, AscendC::LocalTensor<DataType> temp1, AscendC::LocalTensor<DataType> temp2, 
+        uint32_t copyInOffset, AscendC::DataCopyExtParams &copyInParams)
+    {
+        size_t size = copyInParams.blockCount * (copyInParams.blockLen + copyInParams.srcStride * 32) / sizeof(DataType);
+        ATVC::SyncDataQueue<AscendC::HardEvent::MTE3_MTE2>();
+
+        CopyIn<DataType>(tensor1, tensor2, copyInOffset, copyInParams);
+
+	    AscendC::PipeBarrier<PIPE_V>();  // wait broadcast finished
+        ATVC::SyncDataQueue<AscendC::HardEvent::MTE2_V>();
+
+        AscendC::Mul(temp1, tensor1, tensor2, size);
+        AscendC::Muls(temp2, temp1, value_, size);
+    }
+
+private:
+    template <int32_t start, class... Args>
+    __aicore__ inline void InitArgsInput(GM_ADDR x, Args... args)
+    {
+        input_[start].SetGlobalBuffer((__gm__ DataType*)x);
+        if constexpr (start + 1 < INPUT_SIZE) {
+            InitArgsInput<start + 1>(args...);
+        }
+    }
+
+    template<typename DataType>
+    __aicore__ inline void CopyIn(AscendC::LocalTensor<DataType> tensor1, AscendC::LocalTensor<DataType> tensor2, uint32_t copyInOffset,  AscendC::DataCopyExtParams &copyInarams)
+    {
+        AscendC::DataCopyPadExtParams<DataType> padParams{false, 0, 0, 0};
+        AscendC::DataCopyPad(tensor1, input_[0][copyInOffset], copyInarams, padParams);
+        AscendC::DataCopyPad(tensor2, input_[1][copyInOffset], copyInarams, padParams);
+    }
+
+    AscendC::GlobalTensor<DataType> input_[INPUT_SIZE];
+    DataType value_;
+};
+
+#endif
diff --git a/atvc/examples/broadcast_to/broadcast_to.cpp b/atvc/examples/broadcast_to/broadcast_to.cpp
index c3a0794413ded63fe04e34d679afeae1d02c9619..c7a4b16baeac5dfc80ced66e199ea5cb52c5c56b 100644
--- a/atvc/examples/broadcast_to/broadcast_to.cpp
+++ b/atvc/examples/broadcast_to/broadcast_to.cpp
@@ -38,53 +38,91 @@ bool IsClose(float a, float b) {
     return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps));
 }
 
-// BroadcastTo算子的描述：一个输入，一个输出，类型均为float
-using BroadcastOpTraits =  ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float>>;
+bool VerifyResults(const std::vector<float> &golden, const std::vector<float> &output)
+{
+    for (int32_t i = 0; i < golden.size(); i++) {
+        if (!IsClose(golden[i], output[i])) {
+            printf("[ERROR]: Accuracy verification failed! The expected value of element "
+                   "in index [%d] is %f, but actual value is %f.\n",
+                i,
+                golden[i],
+                output[i]);
+            return false;
+        }
+    }
+    return true;
+}
+
+void InitializeACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId)
+{
+    CHECK_ACL(aclInit(nullptr));
+    CHECK_ACL(aclrtSetDevice(deviceId));
+    CHECK_ACL(aclrtCreateContext(&context, deviceId));
+    CHECK_ACL(aclrtCreateStream(&stream));
 }
 
+void CleanACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId)
+{
+    CHECK_ACL(aclrtDestroyStream(stream));
+    CHECK_ACL(aclrtDestroyContext(context));
+    CHECK_ACL(aclrtResetDevice(deviceId));
+    CHECK_ACL(aclFinalize());
+}
+
+void CleanUp(uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&yHost)
+{
+    CHECK_ACL(aclrtFree(xDevice));
+    CHECK_ACL(aclrtFree(yDevice));
+    CHECK_ACL(aclrtFreeHost(yHost));
+}
+
+// BroadcastTo算子的描述：一个输入，一个输出，类型均为float
+using BroadcastOpTraits = ATVC::OpTraits<ATVC::OpInputs<float>, ATVC::OpOutputs<float>>;
+}
 /*
  * 该函数为BroadcastCustom算子核函数入口
- * x                 Device上的gm地址，指向Add算子第一个输入
- * y                 Device上的gm地址，指向Add算子第一个输出
- * broadcastParam       Device上的gm地址，指向运行态ATVC::BroadcastParam数据
+ * x                 Device上的gm地址，指向BroadcastCustom算子第一个输入
+ * y                 Device上的gm地址，指向BroadcastCustom算子第一个输出
+ * broadcastParam    指向运行态ATVC::BroadcastParam数据
 */
 template<typename Traits, const auto& Policy>
-__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, GM_ADDR broadcastParam)
+__global__ __aicore__ void BroadcastCustom(GM_ADDR x, GM_ADDR y, ATVC::BroadcastParam broadcastParam)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
     // 将计算模板类模板定义作为模板参数传入，Policy由Host层的策略分派API给出
     auto op = ATVC::Kernel::BroadcastOpTemplate<ATVC::BroadcastCompute<Traits>, Policy>();
-    op.Run(x, y, broadcastParam);
+    ATVC::BroadcastParam *param = &broadcastParam;
+    op.Run(x, y, param);
 }
-
+namespace {
 // 负责Broadcast类算子的调度，选择对应的Policy最佳策略并执行Kernel函数
 template<class OpTraits>
 void BroadcastOpAdapter(uint8_t* x, uint8_t* y, ATVC::BroadcastParam &param, ATVC::BroadcastPolicy &policy, aclrtStream& stream)
 {
-    // 申请临时空间workspace，并将其与BroadcastTilingData一同传到Device侧
-    uint8_t *paramDevice;
+    // 申请临时空间workspace
     uint8_t *workspaceDevice;
     CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST));
     param.workspaceAddr = reinterpret_cast<uint64_t>(workspaceDevice);
-    auto broadcastParamSize = sizeof(param);
-    CHECK_ACL(aclrtMalloc((void**)&paramDevice, broadcastParamSize, ACL_MEM_MALLOC_HUGE_FIRST));
-    CHECK_ACL(aclrtMemcpy(paramDevice, broadcastParamSize, reinterpret_cast<uint8_t*>(&param), broadcastParamSize, ACL_MEMCPY_HOST_TO_DEVICE));
     // 将tiling api计算出的BroadcastPolicy转化为编译态参数并实例化相应的核函数
     if (policy == ATVC::BROADCAST_POLICY0) {
-        BroadcastCustom<OpTraits, ATVC::BROADCAST_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        BroadcastCustom<OpTraits, ATVC::BROADCAST_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
      }else if (policy == ATVC::BROADCAST_POLICY1) {
-        BroadcastCustom<OpTraits, ATVC::BROADCAST_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        BroadcastCustom<OpTraits, ATVC::BROADCAST_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else {
-        printf("[ERROR] Cannot find any matched policy.\n");
+        printf("[ERROR]: Cannot find any matched policy.\n");
     }
     // 流同步后释放申请的param内存
     CHECK_ACL(aclrtSynchronizeStream(stream));
     CHECK_ACL(aclrtFree(workspaceDevice));
-    CHECK_ACL(aclrtFree(paramDevice));
+}
 }
 
 int32_t main(int32_t argc, char* argv[])
 {
+    if (!ATVC::Host::DebugCheck<BroadcastOpTraits, ATVC::TemplateType::BROADCAST>()) {
+        printf("[ERROR]: Broadcast opTraits check failed.\n");
+        return -1;
+    }
     int32_t eleNum = 1 * 1024;
     int32_t outEleNum = 8 * 1024;
     size_t inputByteSize = static_cast<size_t>(eleNum) * sizeof(float);
@@ -95,13 +133,10 @@ int32_t main(int32_t argc, char* argv[])
     std::vector<float> golden(outEleNum, 1.0f);
     printf("Generate golden data successfully.\n");
     // 初始化Acl资源
-    CHECK_ACL(aclInit(nullptr));
     aclrtContext context;
     int32_t deviceId = 0;
-    CHECK_ACL(aclrtSetDevice(deviceId));
-    CHECK_ACL(aclrtCreateContext(&context, deviceId));
     aclrtStream stream = nullptr;
-    CHECK_ACL(aclrtCreateStream(&stream));
+    InitializeACL(stream, context, deviceId);
     uint8_t *yHost;
     uint8_t *xDevice;
     uint8_t *yDevice;
@@ -127,14 +162,8 @@ int32_t main(int32_t argc, char* argv[])
     std::vector<float> outputY(reinterpret_cast<float*>(yHost), reinterpret_cast<float*>(yHost) + outEleNum);
 
     // 释放Acl资源
-    CHECK_ACL(aclrtFree(xDevice));
-    CHECK_ACL(aclrtFree(yDevice));
-    CHECK_ACL(aclrtFreeHost(yHost));
-
-    CHECK_ACL(aclrtDestroyStream(stream));
-    CHECK_ACL(aclrtDestroyContext(context));
-    CHECK_ACL(aclrtResetDevice(deviceId));
-    CHECK_ACL(aclFinalize());
+    CleanUp(xDevice, yDevice, yHost);
+    CleanACL(stream, context, deviceId);
 
     for (int32_t i = 0; i < outEleNum; i++) {
         if (!IsClose(golden[i], outputY[i])) {
@@ -142,6 +171,9 @@ int32_t main(int32_t argc, char* argv[])
             return -1;
         }
     }
+    if (!VerifyResults(golden, outputY)) {
+        return -1;
+    }
     printf("Accuracy verification passed.\n");
     return 0;
 }
diff --git a/atvc/examples/aclnn/README.md b/atvc/examples/ops_aclnn/README.md
similarity index 93%
rename from atvc/examples/aclnn/README.md
rename to atvc/examples/ops_aclnn/README.md
index 3bfe4acfd50a90daf8b3c0abc7aa46e018c7ccf6..802586c91f6cb52d69430f79563c2e3463c46a4e 100644
--- a/atvc/examples/aclnn/README.md
+++ b/atvc/examples/ops_aclnn/README.md
@@ -14,7 +14,7 @@
 
  快速执行example用例，更详细的流程请参阅[add算子](../add/README.md)。
 
-- 下载ATVC代码及环境配置,参考[快速入门](../../../docs/1_quick_start.md)。 
+- 下载ATVC代码及环境配置,参考[快速入门](../../../docs/01_quick_start.md)。 
 
 - 导入ATVC环境变量
   ```bash
@@ -26,7 +26,7 @@
    
    ```bash
     # 基于ATVC编译自定义Add算子
-    $ cd ./atvc/examples/aclnn/add
+    $ cd ./atvc/examples/ops_aclnn/add
     # 以910B1为例，运行命令如下：
     $ bash install.sh -v Ascend910B1
     # 安装custom包
@@ -71,10 +71,10 @@
   
 - 2.1 复制需要的配置文件
 
-  将[func.cmake](./add/AddCustom/cmake/func.cmake)、host侧的[CMakeLists.txt](./add/AddCustom/op_host/CMakeLists.txt)和kernel侧的[CMakeLists.txt](./add/AddCustom/op_kernel/CMakeLists.txt)分别复制到`步骤1`生成的工程文件的对应目录下。
+  将[func.cmake](./add/AddCustom/cmake/func.cmake)、[intf.cmake](./add/AddCustom/cmake/intf.cmake)、host侧的[CMakeLists.txt](./add/AddCustom/op_host/CMakeLists.txt)和kernel侧的[CMakeLists.txt](./add/AddCustom/op_kernel/CMakeLists.txt)分别复制到`步骤1`生成的工程文件的对应目录下。
  
 - 2.2 修改对应的host文件
-  - 引入对应的头文件，修改对应TilingFunc函数中tiling的生成，根据算子类型调用不同的tiling生成策略,更多ATVC的用法可参考atvc的[开发指南](../../docs/2_developer_guide.md)。
+  - 引入对应的头文件，修改对应TilingFunc函数中tiling的生成，根据算子类型调用不同的tiling生成策略,更多ATVC的用法可参考atvc的[开发指南](../../docs/02_developer_guide.md)。
 
     elewise类,参考[add_custom.cpp](./add/AddCustom/op_host/add_custom.cpp)
     ```cpp
@@ -182,7 +182,7 @@
     ```
  - 2.3 修改对应的kernel文件
   
-    用户需要通过AscendC API来搭建Add算子的核心计算逻辑，在ATVC框架中，这类算子的核心计算逻辑是通过定义一个结构体的仿函数来实现。它需要`ATVC::OpTraits`作为固定模板参数，并重载`operator()`来被提供的Kernel层算子模板类调用，更多ATVC的用法可参考atvc的[开发指南](../../docs/2_developer_guide.md)。
+    用户需要通过AscendC API来搭建Add算子的核心计算逻辑，在ATVC框架中，这类算子的核心计算逻辑是通过定义一个结构体的仿函数来实现。它需要`ATVC::OpTraits`作为固定模板参数，并重载`operator()`来被提供的Kernel层算子模板类调用，更多ATVC的用法可参考atvc的[开发指南](../../docs/02_developer_guide.md)。
  
     elewise类[add_custom.cpp](./add/AddCustom/op_kernel/add_custom.cpp)
     ```cpp
@@ -214,8 +214,9 @@
       // 修改核函数文件的实现
       KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
       REGISTER_TILING_DEFAULT(ATVC::EleWiseParam);
+      GET_TILING_DATA(param, tiling);
       auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<AddOpTraits>>();
-      op.Run(x, y, z, tiling);
+      op.Run(x, y, z, &param);
     ```    
     broadcast类
     ```cpp
@@ -228,10 +229,12 @@
       // 修改核函数文件
         KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
         REGISTER_TILING_DEFAULT(ATVC::BroadcastParam);
+        GET_TILING_DATA(tilingData, tiling);
         // broadcast有不同的policy，在host与tilingkey进行绑定，此处的调用使用TILING_KEY_IS进行判断，和host的文件SetTilingKey相对应
         if (TILING_KEY_IS(0)) {
           auto op = ATVC::Kernel::BroadcastOpTemplate<ATVC::BroadcastCompute<BroadcastOpTraits>, ATVC::BROADCAST_POLICY0>();
-          op.Run(x, y, tiling);
+          ATVC::BroadcastParam param = &tilingData;
+          op.Run(x, y, &param);
         }else{
           ...
         }
@@ -250,7 +253,7 @@
         GET_TILING_DATA(param, tiling);
         if (param.policyId == ATVC::REDUCE_POLICY0.ID) {
             auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY0>();
-            op.Run(x, y, tiling);
+            op.Run(x, y, &param);
         } else {
           // 根据不同的tiling.policyId进行判断不同ReduceOpTemplate初始化
            ...
@@ -293,4 +296,4 @@
 
 ### 步骤5. 调用执行算子工程 
 
-    算子文件编写完成，参考[aclnn调用AddCustom算子工程(代码简化)](https://gitee.com/ascend/samples/blob/master/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocationNaive/README.md)进行编译验证
\ No newline at end of file
+  算子文件编写完成，参考[aclnn调用AddCustom算子工程(代码简化)](https://gitee.com/ascend/samples/blob/master/operator/ascendc/0_introduction/1_add_frameworklaunch/AclNNInvocationNaive/README.md)进行编译验证
\ No newline at end of file
diff --git a/atvc/examples/aclnn/add/AclNNInvocationNaive/CMakeLists.txt b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/CMakeLists.txt
similarity index 100%
rename from atvc/examples/aclnn/add/AclNNInvocationNaive/CMakeLists.txt
rename to atvc/examples/ops_aclnn/add/AclNNInvocationNaive/CMakeLists.txt
diff --git a/atvc/examples/aclnn/add/AclNNInvocationNaive/README.md b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/README.md
similarity index 97%
rename from atvc/examples/aclnn/add/AclNNInvocationNaive/README.md
rename to atvc/examples/ops_aclnn/add/AclNNInvocationNaive/README.md
index 65acdddea5ed3db4d375d13a1dc2c66d1bd3f856..de9e6772fbde260270c8d215aa2fb516120693b7 100644
--- a/atvc/examples/aclnn/add/AclNNInvocationNaive/README.md
+++ b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/README.md
@@ -26,7 +26,7 @@
   - 进入到样例目录   
     以命令行方式下载样例代码，master分支为例。
     ```bash
-    cd atvc/examples/aclnn/add/AclNNInvocationNaive
+    cd atvc/examples/ops_aclnn/add/AclNNInvocationNaive
     ```
   - 样例编译文件修改
 
diff --git a/atvc/examples/aclnn/add/AclNNInvocationNaive/main.cpp b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/main.cpp
similarity index 100%
rename from atvc/examples/aclnn/add/AclNNInvocationNaive/main.cpp
rename to atvc/examples/ops_aclnn/add/AclNNInvocationNaive/main.cpp
diff --git a/atvc/examples/aclnn/add/AclNNInvocationNaive/run.sh b/atvc/examples/ops_aclnn/add/AclNNInvocationNaive/run.sh
similarity index 100%
rename from atvc/examples/aclnn/add/AclNNInvocationNaive/run.sh
rename to atvc/examples/ops_aclnn/add/AclNNInvocationNaive/run.sh
diff --git a/atvc/examples/aclnn/add/AddCustom.json b/atvc/examples/ops_aclnn/add/AddCustom.json
similarity index 100%
rename from atvc/examples/aclnn/add/AddCustom.json
rename to atvc/examples/ops_aclnn/add/AddCustom.json
diff --git a/atvc/examples/aclnn/add/AddCustom/CMakeLists.txt b/atvc/examples/ops_aclnn/add/AddCustom/CMakeLists.txt
similarity index 100%
rename from atvc/examples/aclnn/add/AddCustom/CMakeLists.txt
rename to atvc/examples/ops_aclnn/add/AddCustom/CMakeLists.txt
diff --git a/atvc/examples/aclnn/add/AddCustom/cmake/func.cmake b/atvc/examples/ops_aclnn/add/AddCustom/cmake/func.cmake
similarity index 100%
rename from atvc/examples/aclnn/add/AddCustom/cmake/func.cmake
rename to atvc/examples/ops_aclnn/add/AddCustom/cmake/func.cmake
diff --git a/atvc/examples/ops_aclnn/add/AddCustom/cmake/intf.cmake b/atvc/examples/ops_aclnn/add/AddCustom/cmake/intf.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..4c8719a4f6ea941eb7a43d60359b4fdb4517c6b6
--- /dev/null
+++ b/atvc/examples/ops_aclnn/add/AddCustom/cmake/intf.cmake
@@ -0,0 +1,29 @@
+
+add_library(intf_pub INTERFACE)
+target_compile_options(intf_pub INTERFACE
+    -fPIC
+    -fvisibility=hidden
+    -fvisibility-inlines-hidden
+    $<$<CONFIG:Release>:-O2>
+    $<$<CONFIG:Debug>:-O0 -g>
+    $<$<COMPILE_LANGUAGE:CXX>:-std=c++17>
+    $<$<AND:$<COMPILE_LANGUAGE:CXX>,$<CONFIG:Debug>>:-ftrapv -fstack-check>
+    $<$<COMPILE_LANGUAGE:C>:-pthread -Wfloat-equal -Wshadow -Wformat=2 -Wno-deprecated -Wextra>
+    $<IF:$<VERSION_GREATER:${CMAKE_C_COMPILER_VERSION},4.8.5>,-fstack-protector-strong,-fstack-protector-all>
+)
+target_compile_definitions(intf_pub INTERFACE
+    _GLIBCXX_USE_CXX17_ABI=0
+    $<$<CONFIG:Release>:_FORTIFY_SOURCE=2>
+)
+target_include_directories(intf_pub INTERFACE ${ASCEND_CANN_PACKAGE_PATH}/include
+    ${CMAKE_CURRENT_SOURCE_DIR}/op_kernel
+    $ENV{ATVC_PATH}
+)
+target_link_options(intf_pub INTERFACE
+    $<$<STREQUAL:$<TARGET_PROPERTY:TYPE>,EXECUTABLE>:-pie>
+    $<$<CONFIG:Release>:-s>
+    -Wl,-z,relro
+    -Wl,-z,now
+    -Wl,-z,noexecstack
+)
+target_link_directories(intf_pub INTERFACE ${ASCEND_CANN_PACKAGE_PATH}/lib64)
diff --git a/atvc/examples/aclnn/add/AddCustom/op_host/CMakeLists.txt b/atvc/examples/ops_aclnn/add/AddCustom/op_host/CMakeLists.txt
similarity index 100%
rename from atvc/examples/aclnn/add/AddCustom/op_host/CMakeLists.txt
rename to atvc/examples/ops_aclnn/add/AddCustom/op_host/CMakeLists.txt
diff --git a/atvc/examples/aclnn/add/AddCustom/op_host/add_custom.cpp b/atvc/examples/ops_aclnn/add/AddCustom/op_host/add_custom.cpp
similarity index 100%
rename from atvc/examples/aclnn/add/AddCustom/op_host/add_custom.cpp
rename to atvc/examples/ops_aclnn/add/AddCustom/op_host/add_custom.cpp
diff --git a/atvc/examples/aclnn/add/AddCustom/op_kernel/CMakeLists.txt b/atvc/examples/ops_aclnn/add/AddCustom/op_kernel/CMakeLists.txt
similarity index 92%
rename from atvc/examples/aclnn/add/AddCustom/op_kernel/CMakeLists.txt
rename to atvc/examples/ops_aclnn/add/AddCustom/op_kernel/CMakeLists.txt
index 16f4f8e4aa572b35d908cb42f2c4ffbdb91b2c5d..178359e4b78e1dcd20f66ceef5c7a0ec86ab8d73 100644
--- a/atvc/examples/aclnn/add/AddCustom/op_kernel/CMakeLists.txt
+++ b/atvc/examples/ops_aclnn/add/AddCustom/op_kernel/CMakeLists.txt
@@ -8,5 +8,5 @@ if ("${CMAKE_BUILD_TYPE}x" STREQUAL "Debugx")
     add_ops_compile_options(ALL OPTIONS -g -O0)
 endif()
 
-add_ops_compile_options(ALL OPTIONS -g -O0 --cce-aicore-block-local-init -I ${ATVC_PATH})
+add_ops_compile_options(ALL OPTIONS -g -O0 --cce-aicore-block-local-init -w -I ${ATVC_PATH})
 add_kernels_compile()
\ No newline at end of file
diff --git a/atvc/examples/aclnn/add/AddCustom/op_kernel/add_custom.cpp b/atvc/examples/ops_aclnn/add/AddCustom/op_kernel/add_custom.cpp
similarity index 91%
rename from atvc/examples/aclnn/add/AddCustom/op_kernel/add_custom.cpp
rename to atvc/examples/ops_aclnn/add/AddCustom/op_kernel/add_custom.cpp
index d28d2c6c93b669480b2f0570b9301cd57a9105fa..73c14dac2e43c762c3cb01d7d2fbf56ac753ef6d 100644
--- a/atvc/examples/aclnn/add/AddCustom/op_kernel/add_custom.cpp
+++ b/atvc/examples/ops_aclnn/add/AddCustom/op_kernel/add_custom.cpp
@@ -25,7 +25,8 @@ struct AddComputeFunc {
 extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
-    REGISTER_TILING_DEFAULT(ATVC::EleWiseParam);    
+    REGISTER_TILING_DEFAULT(ATVC::EleWiseParam);
+    GET_TILING_DATA(param, tiling);
     auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<AddOpTraits>>();
-    op.Run(x, y, z, tiling);    
+    op.Run(x, y, z, &param);
 }
diff --git a/atvc/examples/aclnn/add/README.md b/atvc/examples/ops_aclnn/add/README.md
similarity index 99%
rename from atvc/examples/aclnn/add/README.md
rename to atvc/examples/ops_aclnn/add/README.md
index 358d7b113437f5663b7e67eae32568dec07b4ec4..42020ecffbbf614de35adac72434339bc0d7492b 100644
--- a/atvc/examples/aclnn/add/README.md
+++ b/atvc/examples/ops_aclnn/add/README.md
@@ -64,7 +64,7 @@ CANN软件包中提供了工程创建工具msOpGen，AddCustom算子工程可通
   - 切换到msOpGen脚本install.sh所在目录
     ```bash
     # 若开发者以git命令行方式clone了master分支代码，并切换目录
-    cd atvc/examples/aclnn/add
+    cd atvc/examples/ops_aclnn/add
     ```
 
   - 调用脚本，生成自定义算子工程，复制host和kernel实现并编译算子
diff --git a/atvc/examples/aclnn/add/install.sh b/atvc/examples/ops_aclnn/add/install.sh
similarity index 100%
rename from atvc/examples/aclnn/add/install.sh
rename to atvc/examples/ops_aclnn/add/install.sh
diff --git a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/CMakeLists.txt b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/CMakeLists.txt
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/CMakeLists.txt
rename to atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/CMakeLists.txt
diff --git a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/README.md b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/README.md
similarity index 97%
rename from atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/README.md
rename to atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/README.md
index 3fd20f3cac4dce4d504240835e5352e7a7015c8f..239ea433571ebe3f43e00a333dee6b8fe876f99c 100644
--- a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/README.md
+++ b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/README.md
@@ -26,7 +26,7 @@
   - 进入到样例目录   
     以命令行方式下载样例代码，master分支为例。
     ```bash
-    cd atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive
+    cd atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive
     ```
   - 样例编译文件修改
 
diff --git a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/main.cpp b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/main.cpp
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/main.cpp
rename to atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/main.cpp
diff --git a/atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/run.sh b/atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/run.sh
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/AclNNInvocationNaive/run.sh
rename to atvc/examples/ops_aclnn/reduce_sum/AclNNInvocationNaive/run.sh
diff --git a/atvc/examples/aclnn/reduce_sum/README.md b/atvc/examples/ops_aclnn/reduce_sum/README.md
similarity index 99%
rename from atvc/examples/aclnn/reduce_sum/README.md
rename to atvc/examples/ops_aclnn/reduce_sum/README.md
index 3503c46e6991fe907459745f11f8905230cf6ccc..0ae318021bc6f2a8f55d60ec27f640b480f52b47 100644
--- a/atvc/examples/aclnn/reduce_sum/README.md
+++ b/atvc/examples/ops_aclnn/reduce_sum/README.md
@@ -61,7 +61,7 @@ CANN软件包中提供了工程创建工具msOpGen，ReduceSumCustom算子工程
   - 切换到msOpGen脚本install.sh所在目录
     ```bash
     # 若开发者以git命令行方式clone了master分支代码，并切换目录
-    cd ./atvc/examples/aclnn/reduce_sum
+    cd ./atvc/examples/ops_aclnn/reduce_sum
     ```
 
   - 调用脚本，生成自定义算子工程，复制host和kernel实现并编译算子
diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom.json b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom.json
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom.json
rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom.json
diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/CMakeLists.txt b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/CMakeLists.txt
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/CMakeLists.txt
rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/CMakeLists.txt
diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/cmake/func.cmake b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/func.cmake
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/cmake/func.cmake
rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/func.cmake
diff --git a/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/intf.cmake b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/intf.cmake
new file mode 100644
index 0000000000000000000000000000000000000000..4c8719a4f6ea941eb7a43d60359b4fdb4517c6b6
--- /dev/null
+++ b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/cmake/intf.cmake
@@ -0,0 +1,29 @@
+
+add_library(intf_pub INTERFACE)
+target_compile_options(intf_pub INTERFACE
+    -fPIC
+    -fvisibility=hidden
+    -fvisibility-inlines-hidden
+    $<$<CONFIG:Release>:-O2>
+    $<$<CONFIG:Debug>:-O0 -g>
+    $<$<COMPILE_LANGUAGE:CXX>:-std=c++17>
+    $<$<AND:$<COMPILE_LANGUAGE:CXX>,$<CONFIG:Debug>>:-ftrapv -fstack-check>
+    $<$<COMPILE_LANGUAGE:C>:-pthread -Wfloat-equal -Wshadow -Wformat=2 -Wno-deprecated -Wextra>
+    $<IF:$<VERSION_GREATER:${CMAKE_C_COMPILER_VERSION},4.8.5>,-fstack-protector-strong,-fstack-protector-all>
+)
+target_compile_definitions(intf_pub INTERFACE
+    _GLIBCXX_USE_CXX17_ABI=0
+    $<$<CONFIG:Release>:_FORTIFY_SOURCE=2>
+)
+target_include_directories(intf_pub INTERFACE ${ASCEND_CANN_PACKAGE_PATH}/include
+    ${CMAKE_CURRENT_SOURCE_DIR}/op_kernel
+    $ENV{ATVC_PATH}
+)
+target_link_options(intf_pub INTERFACE
+    $<$<STREQUAL:$<TARGET_PROPERTY:TYPE>,EXECUTABLE>:-pie>
+    $<$<CONFIG:Release>:-s>
+    -Wl,-z,relro
+    -Wl,-z,now
+    -Wl,-z,noexecstack
+)
+target_link_directories(intf_pub INTERFACE ${ASCEND_CANN_PACKAGE_PATH}/lib64)
diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_host/CMakeLists.txt b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_host/CMakeLists.txt
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_host/CMakeLists.txt
rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_host/CMakeLists.txt
diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_host/reduce_sum_custom.cpp b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_host/reduce_sum_custom.cpp
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_host/reduce_sum_custom.cpp
rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_host/reduce_sum_custom.cpp
diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt
similarity index 92%
rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt
rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt
index 1a719a60cf480bf01d15b9d1905038536d5e5ccb..72549d35d19a37bb438d58356ec026bf81d8f0c4 100644
--- a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt
+++ b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/CMakeLists.txt
@@ -8,5 +8,5 @@ if ("${CMAKE_BUILD_TYPE}x" STREQUAL "Debugx")
     add_ops_compile_options(ALL OPTIONS -g -O0)
 endif()
 
-add_ops_compile_options(ALL OPTIONS -g -O0 --cce-aicore-block-local-init -I ${ATVC_PATH})
+add_ops_compile_options(ALL OPTIONS -g -O0 --cce-aicore-block-local-init -w -I ${ATVC_PATH})
 add_kernels_compile()
\ No newline at end of file
diff --git a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp
similarity index 87%
rename from atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp
rename to atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp
index 50a5c73f975c464282ae32cc46da52666ca19529..871f587d3c029cb95646187a2ec180fa38211423 100644
--- a/atvc/examples/aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp
+++ b/atvc/examples/ops_aclnn/reduce_sum/ReduceSumCustom/op_kernel/reduce_sum_custom.cpp
@@ -20,72 +20,72 @@ extern "C" __global__ __aicore__ void reduce_sum_custom(GM_ADDR x, GM_ADDR y, GM
     GET_TILING_DATA(param, tiling);
     if (param.policyId == ATVC::REDUCE_POLICY0.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY0>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY1.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY1>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY2.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY2>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY3.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY3>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY4.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY4>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY5.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY5>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY6.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY6>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY7.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY7>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY8.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY8>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY9.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY9>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY10.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY10>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY11.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY11>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY12.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY12>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY13.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY13>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY14.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY14>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY15.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY15>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY16.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY16>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY17.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY17>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY18.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY18>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY19.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY19>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY20.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY20>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY21.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY21>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     } else if (param.policyId == ATVC::REDUCE_POLICY22.ID) {
         auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<ReduceOpTraits>, ATVC::REDUCE_POLICY22>();
-        op.Run(x, y, tiling);
+        op.Run(x, y, &param);
     }
 }
diff --git a/atvc/examples/aclnn/reduce_sum/install.sh b/atvc/examples/ops_aclnn/reduce_sum/install.sh
similarity index 100%
rename from atvc/examples/aclnn/reduce_sum/install.sh
rename to atvc/examples/ops_aclnn/reduce_sum/install.sh
diff --git a/atvc/examples/pytorch/README.md b/atvc/examples/ops_pytorch/README.md
similarity index 85%
rename from atvc/examples/pytorch/README.md
rename to atvc/examples/ops_pytorch/README.md
index 945054b730a2b53ca50fc14102e4bf89d65c174b..5f85b795bcd0b3014831f618424ceef9ca28cc22 100644
--- a/atvc/examples/pytorch/README.md
+++ b/atvc/examples/ops_pytorch/README.md
@@ -14,7 +14,7 @@
 
  快速执行example用例，更详细的流程请参阅[add算子](./add/README.md)。
 
-- 下载ATVC代码及环境配置,参考[快速入门](../../../docs/1_quick_start.md)。 
+- 下载ATVC代码及环境配置,参考[快速入门](../../../docs/01_quick_start.md)。 
 
 - 导入ATVC环境变量
     ```bash
@@ -25,7 +25,7 @@
  - 执行add用例
     ```bash
         # 基于ATVC编译pytorch Add算子
-        $ cd ./atvc/examples/pytorch/add
+        $ cd ./atvc/examples/ops_pytorch/add
         $ bash run.sh
         ...
         OK
@@ -69,7 +69,7 @@
 ```    
 ### 步骤3.  实现核函数，参考[add_custom_impl.h](./add/add_custom_impl.h)
     
-   ATVC提供的`ATVC::Kernel::EleWiseOpTemplate`算子模板类实现了核内的数据搬运、资源申请和计算调度功能。它将计算仿函数作为模板参数传入来完成构造实例化，用户可通过调用`ATVC::Kernel::EleWiseOpTemplate`算子模板类的`Run(Args&&... args)`接口完成算子的功能计算，完成完整核函数的实现。 在examples/add用例中，算子核函数的形式参数除了输入输出之外，还需额外传入`ATVC::EleWiseParam param`的形参,需要用户在初始化`ATVC::Kernel::EleWiseOpTemplate`时指定`ATVC::EleWiseParam*`的类型。该参数包含算子模板类进行数据搬运数据的必要参数，由`ATVC::Host::CalcEleWiseTiling` API计算得出。
+   ATVC提供的`ATVC::Kernel::EleWiseOpTemplate`算子模板类实现了核内的数据搬运、资源申请和计算调度功能。它将计算仿函数作为模板参数传入来完成构造实例化，用户可通过调用`ATVC::Kernel::EleWiseOpTemplate`算子模板类的`Run(Args&&... args)`接口完成算子的功能计算，完成完整核函数的实现。
 
 ```cpp        
     /*
@@ -83,15 +83,15 @@
     __global__ __aicore__ void AddCustom(GM_ADDR a, GM_ADDR b, GM_ADDR c, ATVC::EleWiseParam param)
     {
         KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
+
         auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<Traits>, ATVC::EleWiseParam*>();  // 将AddComputeFunc仿函数作为模板参数传入，实例化EleWiseOpTemplate模板类,并指明param的数据类型
-        op.Run(x, y, z, &param); // 按照输入、输出、param的顺序传入Run函数，实现GM->GM的数据计算
+        op.Run(x, y, z, &param); // 按照输入、输出、param的顺序传入Run函数
     } 
 ```
-   备注：reduce类算子在初始化的时候，` ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<Traits>, Policy, void, void,  ATVC::ReduceParam*>`时指定`ATVC::ReduceParam*`的类型；broadcast类算子在初始化的时候，` ATVC::Kernel::ATVC::Kernel::BroadcastOpTemplate<ATVC::BroadcastCompute<Traits>, Policy, ATVC::BroadcastParam*>`时指定`ATVC::BroadcastParam*`的类型。
 
  ### 步骤4. 编写torch入口[pytorch_ascendc_extension.cpp](./add/pytorch_ascendc_extension.cpp)
 
-  不同的算子类型可参考[快速入门](../../docs/1_quick_start.md)中的模版选择模版进行选择,用户在此处通过`<<<>>>`的方式调用核函数，更多ATVC的用法可参考atvc的[开发指南](../../docs/2_developer_guide.md)。
+  不同的算子类型可参考[快速入门](../../docs/01_quick_start.md)中的模版选择模版进行选择,用户在此处通过`<<<>>>`的方式调用核函数，更多ATVC的用法可参考atvc的[开发指南](../../docs/02_developer_guide.md)。
 
 ``` cpp
     // 头文件引入
@@ -185,7 +185,7 @@
 ### 步骤7. 算子编译&执行
 ```bash
     # 基于ATVC编译pytorch Add算子
-    $ cd ./atvc/examples/pytorch/add
+    $ cd ./atvc/examples/ops_pytorch/add
     $ bash run.sh
     ...
     OK
diff --git a/atvc/examples/pytorch/add/README.md b/atvc/examples/ops_pytorch/add/README.md
similarity index 97%
rename from atvc/examples/pytorch/add/README.md
rename to atvc/examples/ops_pytorch/add/README.md
index bbd61f3088f05ca4526e7a0aafba85703575a8a2..39366061aee1a8aa7cc0495815ab77278adad311 100644
--- a/atvc/examples/pytorch/add/README.md
+++ b/atvc/examples/ops_pytorch/add/README.md
@@ -34,7 +34,7 @@ z = x + y
 - Atlas A2训练系列产品 
 
 ## 编译运行样例算子
-针对自定义算子工程，编译运行包含如下步骤：
+针对pytorch算子，编译运行包含如下步骤：
 - 完成算子pytorch入口和impl文件的实现；
 - 编译pytorch算子的二进制文件；
 - 调用执行pytorch算子；
@@ -80,7 +80,7 @@ z = x + y
     {
         KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
         // 将AddComputeFunc仿函数作为模板参数传入，实例化EleWiseOpTemplate模板类
-        auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<Traits>, ATVC::EleWiseParam*>();
+        auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<Traits>>();
         op.Run(x, y, z, &param);
     }
     ```
@@ -129,7 +129,7 @@ z = x + y
 ### 4. 基于ATVC编写pytorch算子的调用验证
   - 调用脚本，生成pytorch算子，并运行测试用例 
       ```bash
-      $ cd ./atvc/examples/pytorch/add
+      $ cd ./atvc/examples/ops_pytorch/add
       $ bash run.sh
       ...
       OK
diff --git a/atvc/examples/pytorch/add/add_custom_impl.h b/atvc/examples/ops_pytorch/add/add_custom_impl.h
similarity index 98%
rename from atvc/examples/pytorch/add/add_custom_impl.h
rename to atvc/examples/ops_pytorch/add/add_custom_impl.h
index 86a6912caaaa3264324a68740fa98a70fcb48566..34b7de96b6c5e26592a22f60575884d642347fe6 100644
--- a/atvc/examples/pytorch/add/add_custom_impl.h
+++ b/atvc/examples/ops_pytorch/add/add_custom_impl.h
@@ -46,7 +46,7 @@ __global__ __aicore__ void AddCustom(GM_ADDR x, GM_ADDR y, GM_ADDR z, ATVC::EleW
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
     // 将AddComputeFunc仿函数作为模板参数传入，实例化EleWiseOpTemplate模板类
-    auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<Traits>, ATVC::EleWiseParam*>();
+    auto op = ATVC::Kernel::EleWiseOpTemplate<AddComputeFunc<Traits>>();
     op.Run(x, y, z, &param);
 }
 #endif
diff --git a/atvc/examples/pytorch/add/pytorch_ascendc_extension.cpp b/atvc/examples/ops_pytorch/add/pytorch_ascendc_extension.cpp
similarity index 100%
rename from atvc/examples/pytorch/add/pytorch_ascendc_extension.cpp
rename to atvc/examples/ops_pytorch/add/pytorch_ascendc_extension.cpp
diff --git a/atvc/examples/pytorch/add/run.sh b/atvc/examples/ops_pytorch/add/run.sh
similarity index 100%
rename from atvc/examples/pytorch/add/run.sh
rename to atvc/examples/ops_pytorch/add/run.sh
diff --git a/atvc/examples/pytorch/add/run_op.py b/atvc/examples/ops_pytorch/add/run_op.py
similarity index 100%
rename from atvc/examples/pytorch/add/run_op.py
rename to atvc/examples/ops_pytorch/add/run_op.py
diff --git a/atvc/examples/pytorch/reduce_sum/README.md b/atvc/examples/ops_pytorch/reduce_sum/README.md
similarity index 96%
rename from atvc/examples/pytorch/reduce_sum/README.md
rename to atvc/examples/ops_pytorch/reduce_sum/README.md
index b951656b9c174399dcae3e8825294cbadc64a5a1..23175eb3de04e4f1a31ecab53173f9c59999663a 100644
--- a/atvc/examples/pytorch/reduce_sum/README.md
+++ b/atvc/examples/ops_pytorch/reduce_sum/README.md
@@ -33,7 +33,7 @@ z = x + y
 - Atlas A2训练系列产品 
 
 ## 编译运行样例算子
-针对自定义算子工程，编译运行包含如下步骤：
+针对pytorch算子，编译运行包含如下步骤：
 - 完成算子pytorch入口和impl文件的实现；
 - 编译pytorch算子的二进制文件；
 - 调用执行pytorch算子；
@@ -62,16 +62,13 @@ z = x + y
     * reduceParam       ATVC::ReduceParam
     */
     template <class Traits>
-    __global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, GM_ADDR z, ATVC::ReduceParam param)
+    __global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, GM_ADDR z, ATVC::ReduceParam reduceParam)
     {
       KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0);  // 使用了多核控制指令，设置算子执行时只启动Vector核
       // 将计算模板类模板定义作为模板参数传入，Policy由Host层的策略分派API给出
       // ATVC::ReduceParam* 为tiling的类型
       auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<Traits>,
-          Policy,
-          void,
-          void,
-          ATVC::ReduceParam*>();
+          Policy>();
       op.Run(x, y, &reduceParam);
     }
     ```
@@ -153,7 +150,7 @@ z = x + y
 ### 4. 基于ATVC编写pytorch算子的调用验证
   - 调用脚本，生成pytorch算子，并运行测试用例 
     ```bash
-    $ cd ./atvc/examples/pytorch/reduce_sum
+    $ cd ./atvc/examples/ops_pytorch/reduce_sum
     $ bash run.sh
     ...
     OK
diff --git a/atvc/examples/pytorch/reduce_sum/pytorch_ascendc_extension.cpp b/atvc/examples/ops_pytorch/reduce_sum/pytorch_ascendc_extension.cpp
similarity index 100%
rename from atvc/examples/pytorch/reduce_sum/pytorch_ascendc_extension.cpp
rename to atvc/examples/ops_pytorch/reduce_sum/pytorch_ascendc_extension.cpp
diff --git a/atvc/examples/pytorch/reduce_sum/reduce_sum_impl.h b/atvc/examples/ops_pytorch/reduce_sum/reduce_sum_impl.h
similarity index 95%
rename from atvc/examples/pytorch/reduce_sum/reduce_sum_impl.h
rename to atvc/examples/ops_pytorch/reduce_sum/reduce_sum_impl.h
index 670215c7ba865e50c966ae76ee1a26364138d0a4..dcf9fbbd8cfa34ae6897628c8cedd335b881ccba 100644
--- a/atvc/examples/pytorch/reduce_sum/reduce_sum_impl.h
+++ b/atvc/examples/ops_pytorch/reduce_sum/reduce_sum_impl.h
@@ -29,10 +29,7 @@ __global__ __aicore__ void ReduceSumCustom(GM_ADDR x, GM_ADDR y, ATVC::ReducePar
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0);  // 使用了多核控制指令，设置算子执行时只启动Vector核
     // 将计算模板类模板定义作为模板参数传入，Policy由Host层的策略分派API给出
     auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<Traits>,
-        Policy,
-        void,
-        void,
-        ATVC::ReduceParam*>();
+        Policy>();
     op.Run(x, y, &reduceParam);
 }
 
diff --git a/atvc/examples/pytorch/reduce_sum/run.sh b/atvc/examples/ops_pytorch/reduce_sum/run.sh
similarity index 100%
rename from atvc/examples/pytorch/reduce_sum/run.sh
rename to atvc/examples/ops_pytorch/reduce_sum/run.sh
diff --git a/atvc/examples/pytorch/reduce_sum/run_op.py b/atvc/examples/ops_pytorch/reduce_sum/run_op.py
similarity index 100%
rename from atvc/examples/pytorch/reduce_sum/run_op.py
rename to atvc/examples/ops_pytorch/reduce_sum/run_op.py
diff --git a/atvc/examples/reduce_sum/reduce_sum.cpp b/atvc/examples/reduce_sum/reduce_sum.cpp
index 43b393022690ea2be7ec24ec5361963bc569c8b3..cf1fe4cf5203a5a4130f73a4bb7163735ac378c2 100644
--- a/atvc/examples/reduce_sum/reduce_sum.cpp
+++ b/atvc/examples/reduce_sum/reduce_sum.cpp
@@ -56,93 +56,108 @@ bool VerifyResults(const std::vector<float> &golden, const std::vector<float> &o
     }
     return true;
 }
+
+void CleanACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId)
+{
+    CHECK_ACL(aclrtDestroyStream(stream));
+    CHECK_ACL(aclrtDestroyContext(context));
+    CHECK_ACL(aclrtResetDevice(deviceId));
+    CHECK_ACL(aclFinalize());
+}
+
+void CleanUp(uint8_t *&xDevice, uint8_t *&yDevice, uint8_t *&yHost)
+{
+    CHECK_ACL(aclrtFree(xDevice));
+    CHECK_ACL(aclrtFree(yDevice));
+    CHECK_ACL(aclrtFreeHost(yHost));
+}
 }
 
 /*
  * 该函数为ReduceCustom算子核函数入口
  * x                 Device上的gm地址，指向Add算子第一个输入
  * y                 Device上的gm地址，指向Add算子第一个输出
- * reduceParam       Device上的gm地址，指向运行态ATVC::ReduceParam数据
+ * reduceParam       指向运行态ATVC::ReduceParam数据
 */
 template<typename Traits, const auto& Policy>
-__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, GM_ADDR reduceParam)
+__global__ __aicore__ void ReduceCustom(GM_ADDR x, GM_ADDR y, ATVC::ReduceParam reduceParam)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIV_1_0); // 使用了多核控制指令，设置算子执行时只启动Vector核
     // 将计算模板类模板定义作为模板参数传入，Policy由Host层的策略分派API给出
     auto op = ATVC::Kernel::ReduceOpTemplate<ATVC::ReduceSumCompute<Traits>, Policy>();
-    op.Run(x, y, reduceParam);
+    op.Run(x, y, &reduceParam);
 }
-
+namespace {
 // 负责Reduce类算子的调度，选择对应的Policy最佳策略并执行Kernel函数
 template<class OpTraits>
 void ReduceOpAdapter(uint8_t* x, uint8_t* y, ATVC::ReduceParam &param, ATVC::ReducePolicy &policy, aclrtStream& stream)
 {
-    // 申请临时空间workspace，并将其与ReduceTilingData一同传到Device侧
-    uint8_t *paramDevice;
+    // 申请临时空间workspace
     uint8_t *workspaceDevice;
     CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, param.workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST));
     param.workspaceAddr = reinterpret_cast<uint64_t>(workspaceDevice);
-    auto reduceParamSize = sizeof(param);
-    CHECK_ACL(aclrtMalloc((void**)&paramDevice, reduceParamSize, ACL_MEM_MALLOC_HUGE_FIRST));
-    CHECK_ACL(aclrtMemcpy(paramDevice, reduceParamSize,
-                          reinterpret_cast<uint8_t *>(&param), reduceParamSize,
-                          ACL_MEMCPY_HOST_TO_DEVICE));
     // 将tiling api计算出的ReducePolicy转化为编译态参数并实例化相应的核函数
     if (policy == ATVC::REDUCE_POLICY0) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY0><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY1) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY1><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY2) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY2><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY2><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY3) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY3><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY3><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY4) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY4><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY4><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY5) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY5><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY5><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY6) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY6><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY6><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY7) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY7><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY7><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY8) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY8><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY8><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY9) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY9><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY9><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY10) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY10><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY10><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY11) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY11><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY11><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY12) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY12><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY12><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY13) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY13><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY13><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY14) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY14><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY14><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY15) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY15><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY15><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY16) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY16><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY16><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY17) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY17><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY17><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY18) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY18><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY18><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY19) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY19><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY19><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY20) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY20><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY20><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else if (policy == ATVC::REDUCE_POLICY21) {
-        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY21><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, paramDevice);
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY21><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
+    } else if (policy == ATVC::REDUCE_POLICY22) {
+        ReduceCustom<OpTraits, ATVC::REDUCE_POLICY22><<<param.tilingData.coreNum, nullptr, stream>>>(x, y, param);
     } else {
-        printf("[ERROR] Cannot find any matched policy.\n");
+        printf("[ERROR]: Cannot find any matched policy.\n");
     }
     // 流同步后释放申请的param内存
     CHECK_ACL(aclrtSynchronizeStream(stream));
     CHECK_ACL(aclrtFree(workspaceDevice));
-    CHECK_ACL(aclrtFree(paramDevice));
+}
 }
 
 int32_t main(int32_t argc, char* argv[])
 {
+    if (!ATVC::Host::DebugCheck<ReduceOpTraits, ATVC::TemplateType::REDUCE>()) {
+        printf("[ERROR]: Reduce OpTraits check failed.\n");
+        return -1;
+    }
     int32_t eleNum = 8 * 1024;
     int32_t outEleNum = 1 * 1024;
     size_t inputByteSize = static_cast<size_t>(eleNum) * sizeof(float);
@@ -172,8 +187,10 @@ int32_t main(int32_t argc, char* argv[])
 
     ATVC::ReduceParam param;    // Reduce运行态参数，包含TilingData以及临时空间的相关信息
     ATVC::ReducePolicy policy = {-1, -1, -1};  // Reduce运行态参数，负责映射最适合的Reduce模板实现
+    ATVC::Host::ReduceTilingHyperParam hyperParam;
+    hyperParam.maxInnerA = 256;// 设置maxInnerA为256
     // Host侧调用Tiling API完成相关运行态参数的运算
-    if (!ATVC::Host::CalcReduceTiling<ReduceOpTraits>(shape, dim, &policy, &param)) {
+    if (!ATVC::Host::CalcReduceTiling<ReduceOpTraits>(shape, dim, &policy, &param, hyperParam=hyperParam)) {
         printf("Reduce tiling error.\n");
         return -1;
     };
@@ -185,14 +202,8 @@ int32_t main(int32_t argc, char* argv[])
     std::vector<float> outputY(reinterpret_cast<float*>(yHost), reinterpret_cast<float*>(yHost) + outEleNum);
 
     // 释放Acl资源
-    CHECK_ACL(aclrtFree(xDevice));
-    CHECK_ACL(aclrtFree(yDevice));
-    CHECK_ACL(aclrtFreeHost(yHost));
-
-    CHECK_ACL(aclrtDestroyStream(stream));
-    CHECK_ACL(aclrtDestroyContext(context));
-    CHECK_ACL(aclrtResetDevice(deviceId));
-    CHECK_ACL(aclFinalize());
+    CleanUp(xDevice, yDevice, yHost);
+    CleanACL(stream, context, deviceId);
 
     if (!VerifyResults(golden, outputY)) {
         return -1;
diff --git a/atvc/examples/sinh_custom/sinh_custom.cpp b/atvc/examples/sinh_custom/sinh_custom.cpp
index e8e277b76e7117137579b4ab6880453d182c1e7d..b8a18a08b849b7805574b5d5eabc27d7c6fa6c42 100644
--- a/atvc/examples/sinh_custom/sinh_custom.cpp
+++ b/atvc/examples/sinh_custom/sinh_custom.cpp
@@ -109,16 +109,16 @@ void CleanACL(aclrtStream &stream, int32_t deviceId)
 
 /*
  * 该函数为SinhCustom算子核函数入口
- * x        Device上的gm地址，指向Add算子第一个输入
- * y        Device上的gm地址，指向Add算子第一个输出
+ * x        Device上的gm地址，指向SinhCustom算子第一个输入
+ * y        Device上的gm地址，指向SinhCustom算子第一个输出
  * param    Device上的gm地址，指向运行态ATVC::EleWiseParam数据
 */
 template<class OpTraits>
-__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, GM_ADDR param)
+__global__ __aicore__ void SinhCustom(GM_ADDR x, GM_ADDR y, ATVC::EleWiseParam param)
 {
     KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY); // 控制算子执行时只启动Vector核
     auto op = ATVC::Kernel::EleWiseOpTemplate<SinhComputeFunc<OpTraits>>();
-    op.Run(x, y, param);          // 按照输入、输出、param的顺序传入Run函数中；OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源，开发无需关注
+    op.Run(x, y, &param);          // 按照输入、输出、param的顺序传入Run函数中；OpTraits内部的ATVC::OpTemps将由EleWiseOpTemplate内部申请资源，开发无需关注
 }
 
 int main()
@@ -148,22 +148,14 @@ int main()
     uint8_t *yHost;
     uint8_t *xDevice;
     uint8_t *yDevice;
-    uint8_t *paramDevice;
     CHECK_ACL(aclrtMalloc((void **)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
     CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, inputX.data(), inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
 
     CHECK_ACL(aclrtMallocHost((void **)(&yHost), outputByteSize));
     CHECK_ACL(aclrtMalloc((void **)&yDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
 
-    // 将tiling计算的运行时参数EleWiseParam param传到Device侧
-    auto elementParamSize = sizeof(param);
-    CHECK_ACL(aclrtMalloc((void**)&paramDevice, elementParamSize, ACL_MEM_MALLOC_HUGE_FIRST));
-    CHECK_ACL(aclrtMemcpy(paramDevice, elementParamSize,
-                          reinterpret_cast<uint8_t *>(&param), elementParamSize,
-                          ACL_MEMCPY_HOST_TO_DEVICE));
-
     // 调用自定义的Kernel API, <<<>>>的BlockNum参数可通过param的TilingData获取
-    SinhCustom<SinhOpTraits><<<param.tilingData.blockNum, nullptr, stream>>>(xDevice, yDevice, paramDevice);
+    SinhCustom<SinhOpTraits><<<param.tilingData.blockNum, nullptr, stream>>>(xDevice, yDevice, param);
 
     CHECK_ACL(aclrtSynchronizeStream(stream));
     CHECK_ACL(aclrtMemcpy(yHost, outputByteSize, yDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
@@ -172,7 +164,6 @@ int main()
     // 释放资源
     CHECK_ACL(aclrtFree(xDevice));
     CHECK_ACL(aclrtFree(yDevice));
-    CHECK_ACL(aclrtFree(paramDevice));
     CHECK_ACL(aclrtFreeHost(yHost));
 
     CleanACL(stream, deviceId);
diff --git a/atvc/examples/tanh_grad/README.md b/atvc/examples/tanh_grad/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5136b68b090b646c08912bbc2d3f24e91d3f9674
--- /dev/null
+++ b/atvc/examples/tanh_grad/README.md
@@ -0,0 +1,76 @@
+<!--声明：本文使用[Creative Commons License version 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)许可协议，转载、引用或修改等操作请遵循此许可协议。-->
+
+## 概述
+
+本样例介绍了利用ATVC实现Tanh单算子并验证了调试调优相关功能验证。
+
+
+## 目录结构
+
+| 文件名                                                         | 描述                                                         |
+| ------------------------------------------------------------ | ------------------------------------------------------------ |
+| [tanh_grad.cpp](./tanh_grad.cpp) | Tanh算子代码实现以及调用样例               |
+
+## 算子描述
+
+Tanh算子数学计算公式：$z = dy * (1 - y ^ 2)$
+
+Tanh算子规格：
+
+<table>
+<tr><td rowspan="1" align="center">算子类型(OpType)</td><td colspan="4" align="center">TanhGrad</td></tr>
+
+<tr><td rowspan="4" align="center">算子输入</td></tr>
+<tr><td align="center">name</td><td align="center">shape</td><td align="center">data type</td><td align="center">format</td></tr>
+<tr><td align="center">dy</td><td align="center">8 * 1024</td><td align="center">float</td><td align="center">ND</td></tr>
+<tr><td align="center">y</td><td align="center">8 * 1024</td><td align="center">float</td><td align="center">ND</td></tr>
+<tr></tr>
+
+<tr><td rowspan="2" align="center">算子输出</td></tr>
+<tr><td align="center">z</td><td align="center">8 * 1024</td><td align="center">float</td><td align="center">ND</td></tr>
+
+<tr><td rowspan="1" align="center">核函数名</td><td colspan="4" align="center">TanhGrad</td></tr>
+</table>
+
+## 算子基本功能验证
+在代码仓目录下执行：
+```bash
+$ cd ./atvc/tests/
+$ bash run_test.sh tanh_grad
+...
+Accuracy verification passed.
+```
+
+## 算子调试调优
+样例提供的主要调试调优方式如下：
+- 使用`ATVC::Host::EleWiseTilingHyperParam`构建超参对`ATVC::Host::CalcEleWiseTiling()`接口实现Tiling调优
+- 使用`--run-mode=debug_print`进行DFX信息打印：
+在代码仓目录下执行：
+```bash
+$ cd ./atvc/tests/
+$ bash run_test.sh tanh_grad --run-mode=debug_print
+...
+[INFO]:[ATVC][EleWise]Start to run Template Fuction.
+...
+[INFO]:[ATVC][EleWise] Tiling data: blockNum = 8
+...
+[INFO]:[ATVC][EleWise][CopyIn]: Offset is 7168, copy count is 256.
+...
+[INFO]:[ATVC][EleWise]End to run Template Fuction.
+...
+Accuracy verification passed.
+```
+
+- 使用`--run-mode=profiling`开启Profiling，获取性能数据：
+在代码仓目录下执行：
+```bash
+$ cd ./atvc/tests/
+$ bash run_test.sh tanh_grad --run-mode=profiling
+...
+[INFO] Start Profiling ...
+...
+[INFO] Process profiling data complete, Data is saved in /xxx_path
+Accuracy verification passed.
+```
+
+更多详细的调试调优介绍参考[ATVC开发指南](../../docs/2_developer_guide.md)的`ATVC的调试调优功能`章节
\ No newline at end of file
diff --git a/atvc/examples/tanh_grad/tanh_grad.cpp b/atvc/examples/tanh_grad/tanh_grad.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..7f8693d6050f1355ef641e09bbcbaa7120a8548e
--- /dev/null
+++ b/atvc/examples/tanh_grad/tanh_grad.cpp
@@ -0,0 +1,204 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+#include <vector>
+#include <cstdint>
+#include <cmath>
+#include <random>
+#include <iostream>
+#include <algorithm>
+#include "acl/acl.h"
+#include "elewise/elewise_host.h"
+#include "elewise/elewise_device.h"
+
+#define CHECK_ACL(x)                                                                        \
+    do {                                                                                    \
+        aclError __ret = x;                                                                 \
+        if (__ret != ACL_ERROR_NONE) {                                                      \
+            std::cerr << __FILE__ << ":" << __LINE__ << " aclError:" << __ret << std::endl; \
+        }                                                                                   \
+    } while (0)
+
+namespace {
+static constexpr float REL_TOL = 1e-3f;
+static constexpr float ABS_TOL = 1e-5f;
+
+struct MemoryPtrs {
+    uint8_t *zHost;
+    uint8_t *dyDevice;
+    uint8_t *yDevice;
+    uint8_t *zDevice;
+    uint8_t *paramDevice;
+};
+
+// 判断两个浮点数是否足够接近
+bool IsClose(float a, float b)
+{
+    const float eps = 1e-40f; // 防止分母为零
+    float diff = std::abs(a - b);
+    return (diff <= ABS_TOL) || (diff <= REL_TOL * std::max(std::abs(a), std::abs(b) + eps));
+}
+
+void InitializeData(int32_t eleNum, std::vector<float> &inputDy, std::vector<float> &inputY, std::vector<float> &golden)
+{
+    std::random_device rd;
+    std::mt19937 gen(rd());
+    std::uniform_real_distribution<float> dis(1.0f, 100.0f);
+
+    for (int i = 0; i < eleNum; ++i) {
+        inputDy[i] = (dis(gen));
+        inputY[i] = (dis(gen));
+    }
+    for (int i = 0; i < eleNum; ++i) {
+        // dy * (1 - x ^ 2)
+        golden[i] = (inputDy[i]) * (1 - inputY[i] * inputY[i]);
+    }
+}
+
+bool VerifyResults(const std::vector<float> &golden, const std::vector<float> &output)
+{
+    for (int32_t i = 0; i < golden.size(); i++) {
+        if (!IsClose(golden[i], output[i])) {
+            printf("[ERROR]: Accuracy verification failed! The expected value of element "
+                   "in index [%d] is %f, but actual value is %f.\n",
+                i,
+                golden[i],
+                output[i]);
+            return false;
+        }
+    }
+    return true;
+}
+
+void InitializeACL(aclrtContext &context, aclrtStream &stream, int32_t deviceId)
+{
+    CHECK_ACL(aclInit(nullptr));
+    CHECK_ACL(aclrtSetDevice(deviceId));
+    CHECK_ACL(aclrtCreateContext(&context, deviceId));
+    CHECK_ACL(aclrtCreateStream(&stream));
+}
+
+void CleanACL(aclrtStream &stream, aclrtContext &context, int32_t deviceId)
+{
+    CHECK_ACL(aclrtDestroyStream(stream));
+    CHECK_ACL(aclrtResetDevice(deviceId));
+    CHECK_ACL(aclFinalize());
+}
+
+void CleanUp(uint8_t *&zHost, uint8_t *&dyDevice, uint8_t *&yDevice, uint8_t *&zDevice)
+{
+    CHECK_ACL(aclrtFree(dyDevice));
+    CHECK_ACL(aclrtFree(yDevice));
+    CHECK_ACL(aclrtFree(zDevice));
+    CHECK_ACL(aclrtFreeHost(zHost));
+}
+
+void MallocHostDeviceMemory(MemoryPtrs& memoryPtrs, size_t byteSize, std::vector<float>& inputDy, std::vector<float>& inputY)
+{
+    CHECK_ACL(aclrtMallocHost((void **)(&memoryPtrs.zHost), byteSize));
+    CHECK_ACL(aclrtMalloc((void **)&memoryPtrs.dyDevice, byteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    CHECK_ACL(aclrtMalloc((void **)&memoryPtrs.yDevice, byteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+    CHECK_ACL(aclrtMalloc((void **)&memoryPtrs.zDevice, byteSize, ACL_MEM_MALLOC_HUGE_FIRST));
+
+    CHECK_ACL(aclrtMemcpy(memoryPtrs.dyDevice, byteSize, inputDy.data(), byteSize, ACL_MEMCPY_HOST_TO_DEVICE));
+    CHECK_ACL(aclrtMemcpy(memoryPtrs.yDevice, byteSize, inputY.data(), byteSize, ACL_MEMCPY_HOST_TO_DEVICE));
+}
+
+// Add算子中有两个输入，一个输出。类型均为float
+using AddOpTraits = ATVC::OpTraits<ATVC::OpInputs<float, float>, ATVC::OpOutputs<float>>;
+
+// 传入编译态参数ATVC::OpTraits
+template <typename Traits>
+struct TanhGradComputeFunc {
+    /**
+    * \brief: Compute operator of tanh: z = dy * (1 - y ^ 2)
+    * \param [in] dy, input local tensor
+    * \param [in] y, input local tensor
+    * \param [out] z, output local tensor
+    * \return void
+    */
+    template <typename DTYPE_DY, typename DTYPE_X, typename DTYPE_Z>
+    __aicore__ inline void operator()(
+        AscendC::LocalTensor<DTYPE_DY> dy, AscendC::LocalTensor<DTYPE_X> y, AscendC::LocalTensor<DTYPE_Z> z)
+    {
+        auto length = y.GetSize();
+        AscendC::Mul(y, y, y, length);
+        AscendC::Mul(y, dy, y, length);
+        AscendC::Sub(z, dy, y, length);
+    }
+};
+}
+
+template <class Traits>
+/**
+* \brief: Kernel entry of Tanh, kernel func is: z = dy * (1 - y ^ 2)
+* \param [in] dy, input global tensor
+* \param [in] y, input global tensor
+* \param [out] z, output global tensor
+* \return void
+*/
+__global__ __aicore__ void TanhGrad(GM_ADDR dy, GM_ADDR y, GM_ADDR z, ATVC::EleWiseParam param)
+{
+    KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);
+    auto op = ATVC::Kernel::EleWiseOpTemplate<
+        TanhGradComputeFunc<Traits>>();  // 将TanhGradComputeFunc仿函数作为模板参数传入，实例化EleWiseOpTemplate模板类
+    op.Run(dy, y, z, &param);  // 按照输入、输出、param的顺序传入Run函数，实现GM->GM的数据计算
+}
+
+int main()
+{
+    if (!ATVC::Host::DebugCheck<AddOpTraits, ATVC::TemplateType::ELE_WISE>()) {
+        printf("[ERROR]: ElementWise OpTraits check failed.\n");
+        return -1;
+    }
+    // totalCnt描述EleWise单输入的元素个数
+    int32_t eleNum = 8 * 1024;
+    size_t byteSize = static_cast<size_t>(eleNum) * sizeof(float);
+
+    std::vector<float> inputDy(eleNum);
+    std::vector<float> inputY(eleNum);
+    std::vector<float> golden(eleNum);
+
+    // 生成输入数据
+    InitializeData(eleNum, inputDy, inputY, golden);
+    // 声明运行态参数param
+    ATVC::EleWiseParam param;
+    ATVC::Host::EleWiseTilingHyperParam hyperParam;
+    hyperParam.singleCoreBaseLine = 1024; // set base count for single core 为1024.
+    if (!ATVC::Host::CalcEleWiseTiling<AddOpTraits>(eleNum, param, hyperParam=hyperParam)) {
+        printf("[ERROR]: Calculate eleWise tiling failed.\n");
+        return -1;
+    };
+    aclrtContext context;
+    int32_t deviceId = 0;
+    aclrtStream stream = nullptr;
+    InitializeACL(context, stream, deviceId);
+    MemoryPtrs memoryPtrs;
+    MallocHostDeviceMemory(memoryPtrs, byteSize, inputDy, inputY);
+    uint32_t blockNum = param.tilingData.blockNum;
+    TanhGrad<AddOpTraits><<<blockNum, nullptr, stream>>>(memoryPtrs.dyDevice, memoryPtrs.yDevice, memoryPtrs.zDevice, param);
+#if ATVC_DEBUG_MODE == 2                // 2: open profiling
+    for (int32_t i = 0; i < 19; i++) {  // 19: run kernel 1 + 19 times for profiling
+        TanhGrad<AddOpTraits><<<blockNum, nullptr, stream>>>(memoryPtrs.dyDevice, memoryPtrs.yDevice, memoryPtrs.zDevice, param);
+    }
+#endif
+    CHECK_ACL(aclrtSynchronizeStream(stream));
+    CHECK_ACL(aclrtMemcpy(memoryPtrs.zHost, byteSize, memoryPtrs.zDevice, byteSize, ACL_MEMCPY_DEVICE_TO_HOST));
+    std::vector<float> outputZ(reinterpret_cast<float *>(memoryPtrs.zHost), reinterpret_cast<float *>(memoryPtrs.zHost) + eleNum);
+
+    CleanUp(memoryPtrs.zHost, memoryPtrs.dyDevice, memoryPtrs.yDevice, memoryPtrs.zDevice);
+    CleanACL(stream, context, deviceId);
+
+    if (!VerifyResults(golden, outputZ)) {
+        return -1;
+    }
+    printf("Accuracy verification passed.\n");
+    return 0;
+}
diff --git a/atvc/include/atvc.h b/atvc/include/atvc.h
index 967bf442a91dec4c69f751dd7d1463d903a4686c..37b9c55517bcb650d52308001ccc9755922c4b04 100644
--- a/atvc/include/atvc.h
+++ b/atvc/include/atvc.h
@@ -12,6 +12,7 @@
 #define ATVC_ATVC_H
 
 #include "common/atvc_opdef.h"
+#include "common/atvc_op_check.h"
 #include "common/const_def.h"
 #include "elewise/common/elewise_common.h"
 #include "elewise/elewise_host.h"
diff --git a/atvc/include/broadcast/broadcast_compute.h b/atvc/include/broadcast/broadcast_compute.h
index 90388f328b0f7be2405d66ad5caf64705d7959df..dbd86ecdc415e0f3f67bd6c403431d5b1db14b32 100644
--- a/atvc/include/broadcast/broadcast_compute.h
+++ b/atvc/include/broadcast/broadcast_compute.h
@@ -16,20 +16,17 @@
 #include "broadcast/common/broadcast_common.h"
 
 namespace ATVC {
-template<class OpTraits>
+template <class OpTraits>
 class BroadcastCompute {
 public:
     using inputDTypeList = typename OpTraits::In::types;
     using DataType = typename ATVC::TypeListGet<inputDTypeList, 0>::Type;
 
-    template <int32_t patternID>
-    __aicore__ inline void Compute(AscendC::LocalTensor<DataType> &src,
-                                   uint32_t inputOffset,
-                                   AscendC::LocalTensor<DataType> &dst,
-                                   uint32_t dimA,
-                                   uint32_t dimB)
+    template <int32_t PatternID>
+    __aicore__ inline void Compute(AscendC::LocalTensor<DataType>& src, uint32_t inputOffset,
+                                   AscendC::LocalTensor<DataType>& dst, uint32_t dimA, uint32_t dimB)
     {
-        if (patternID == ATVC::AB_PATTERN::ABA) {
+        if constexpr (PatternID == ATVC::AB_PATTERN::ABA) {
             ComputeBA(src, inputOffset, dst, dimA, dimB);
         } else {
             ComputeAB(src, inputOffset, dst, dimA, dimB);
@@ -37,27 +34,21 @@ public:
     }
 
 private:
-    __aicore__ inline void ComputeBAByDataCopy(AscendC::LocalTensor<DataType> &src,
-                                               uint32_t inputOffset,
-                                               AscendC::LocalTensor<DataType> &dst,
-                                               uint32_t dimA,
-                                               uint32_t dimB)
+    __aicore__ inline void ComputeBAByDataCopy(AscendC::LocalTensor<DataType>& src, uint32_t inputOffset,
+                                               AscendC::LocalTensor<DataType>& dst, uint32_t dimA, uint32_t dimB)
     {
         AscendC::DataCopy(dst, src[inputOffset], dimA);
         uint32_t i = 1;
-        uint32_t cnt = 1;
-        while(i < dimB) {
+        uint32_t cnt;
+        while (i < dimB) {
             cnt = i > (dimB - i) ? (dimB - i) : i;
             AscendC::DataCopy(dst[dimA * i], dst, dimA * cnt);
             i += cnt;
         }
     }
 
-    __aicore__ inline void ComputeBA(AscendC::LocalTensor<DataType> &src,
-                                     uint32_t inputOffset,
-                                     AscendC::LocalTensor<DataType> &dst,
-                                     uint32_t dimA,
-                                     uint32_t dimB)
+    __aicore__ inline void ComputeBA(AscendC::LocalTensor<DataType>& src, uint32_t inputOffset,
+                                     AscendC::LocalTensor<DataType>& dst, uint32_t dimA, uint32_t dimB)
     {
         /*
         X1 X2 X3 X4
@@ -68,14 +59,11 @@ private:
         ComputeBAByDataCopy(src, inputOffset, dst, dimA, dimB);
     }
 
-    __aicore__ inline void ComputeABByBrcbCopy(AscendC::LocalTensor<DataType> &src,
-                                               uint32_t inputOffset,
-                                               AscendC::LocalTensor<DataType> &dst,
-                                               uint32_t dimA,
-                                               uint32_t dimB)
+    __aicore__ inline void ComputeABByBrcbCopy(AscendC::LocalTensor<DataType>& src, uint32_t inputOffset,
+                                               AscendC::LocalTensor<DataType>& dst, uint32_t dimA, uint32_t dimB)
     {
-        uint32_t brcbProcCnt = 8; // 一次brcb 处理8个元素
-        uint32_t dSize = sizeof(DataType);
+        constexpr uint32_t brcbProcCnt = 8; // 一次brcb 处理8个元素
+        constexpr uint32_t dSize = sizeof(DataType);
         AscendC::BrcbRepeatParams repeatParam(dimB * dSize / ATVC::UB_ALIGN_32,
                                               brcbProcCnt * dimB * dSize / ATVC::UB_ALIGN_32);
         AscendC::Brcb(dst, src[inputOffset], dimA / brcbProcCnt, repeatParam);
@@ -84,28 +72,24 @@ private:
         while (i < dimB) {
             step = i * 2 > dimB ? (dimB - i) : i; // 2: 每次循环 将已拷贝长度为i的元素拷贝到下一个dst，要保证不超出dimB
             step = step * dSize / ATVC::UB_ALIGN_32;
-            uint16_t stride = (uint16_t)(dimB * dSize / ATVC::UB_ALIGN_32 - step);
-            AscendC::DataCopyParams repeatParam = {
-                (uint16_t)dimA, // blockCount [1, 4095]
-                step,           // 单位为32B
-                stride,         // 取值范围不能超uint16_t
-                stride};        // 取值范围不能超uint16_t
+            uint16_t stride = static_cast<uint16_t>(dimB * dSize / ATVC::UB_ALIGN_32 - step);
+            AscendC::DataCopyParams repeatParam = {static_cast<uint16_t>(dimA), // blockCount [1, 4095]
+                                                   step,                        // 单位为32B
+                                                   stride,                      // 取值范围不能超uint16_t
+                                                   stride};                     // 取值范围不能超uint16_t
             AscendC::DataCopy(dst[i], dst, repeatParam);
             i = i + step * ATVC::UB_ALIGN_32 / dSize;
             AscendC::PipeBarrier<PIPE_V>();
         }
     }
 
-    __aicore__ inline void ComputeAB(AscendC::LocalTensor<DataType> &src,
-                                     uint32_t inputOffset,
-                                     AscendC::LocalTensor<DataType> &dst,
-                                     uint32_t dimA,
-                                     uint32_t dimB)
+    __aicore__ inline void ComputeAB(AscendC::LocalTensor<DataType>& src, uint32_t inputOffset,
+                                     AscendC::LocalTensor<DataType>& dst, uint32_t dimA, uint32_t dimB)
     {
         /*
-        X1 
-        X2 
-        X3 
+        X1
+        X2
+        X3
         X4
         ->
         X1 X1
@@ -116,5 +100,5 @@ private:
         ComputeABByBrcbCopy(src, inputOffset, dst, dimA, dimB);
     }
 };
-}
+} // namespace ATVC
 #endif // ATVC_BROADCAST_COMPUTE_H
\ No newline at end of file
diff --git a/atvc/include/broadcast/broadcast_host.h b/atvc/include/broadcast/broadcast_host.h
index 80fba92814ae79e8be737276c88b7266f6d95ee2..da1a6adfc1863f64887ca62d7c5942ccff6c8d28 100644
--- a/atvc/include/broadcast/broadcast_host.h
+++ b/atvc/include/broadcast/broadcast_host.h
@@ -17,61 +17,33 @@
 #define ATVC_BROADCAST_HOST_H
 #include <vector>
 #include "common/atvc_opdef.h"
+#include "common/atvc_op_check.h"
 #include "common/const_def.h"
-#include "common/dtype_utils.h"
 #include "broadcast/common/broadcast_common.h"
 #include "broadcast/tiling/broadcast_tiling.h"
 
 namespace ATVC {
 namespace Host {
-void PrintParam(BroadcastPolicy* policy, BroadcastParam* param)
-{
-    printf("[Broadcast] Tiling result: A0 = %lu\n", param->tilingData.A0);
-    printf("[Broadcast] Tiling result: A11 = %lu\n", param->tilingData.A11);
-    printf("[Broadcast] Tiling result: A12 = %lu\n", param->tilingData.A12);
-    printf("[Broadcast] Tiling result: A2 = %lu\n", param->tilingData.A2);
-    printf("[Broadcast] Tiling result: B0 = %lu\n", param->tilingData.B0);
-    printf("[Broadcast] Tiling result: B1 = %lu\n", param->tilingData.B1);
-    printf("[Broadcast] Tiling result: B2 = %lu\n", param->tilingData.B2);
-    printf("[Broadcast] Tiling result: coreNum = %d\n", param->tilingData.coreNum);
-    printf("[Broadcast] Tiling result: basicBlock = %lu\n", param->tilingData.basicBlock);
-    printf("[Broadcast] Tiling result: factorACntPerCore = %lu\n", param->tilingData.factorACntPerCore);
-    printf("[Broadcast] Tiling result: factorATotalCnt = %lu\n", param->tilingData.factorATotalCnt);
-    printf("[Broadcast] Tiling result: factorBCntPerCore = %lu\n", param->tilingData.factorBCntPerCore);
-    printf("[Broadcast] Tiling result: factorBTotalCnt = %lu\n", param->tilingData.factorBTotalCnt);
-    for (int32_t i = 0; i < ATVC::CONST2; i++) {
-        printf("[Broadcast] Tiling result: shape[%d] = %lu\n", i, param->tilingData.shape[i]);
-        printf("[Broadcast] Tiling result: dstShape[%d] = %lu\n", i, param->tilingData.dstShape[i]);
-    }
-    printf("[Broadcast] Tiling result: policy.patternID = %d\n", policy->patternID);
-    printf("[Broadcast] Tiling result: workspaceSize = %u\n", param->workspaceSize);
-    return;
-}
-
-template<class OpTraits>
+template<class OpTraits, class PreComputeTraits = void, class PostComputeTraits = void>
 bool CalcBroadcastTiling(std::vector<int64_t> shapeIn,
                          std::vector<int64_t> shapeOut,
                          BroadcastPolicy* policy,
                          BroadcastParam* param)
 {
-    if(policy == nullptr || param == nullptr) {
-        printf("[ERROR] Invalid input: policy or param is null pointer!\n");
+    if (policy == nullptr || param == nullptr) {
+        printf("[ERROR]: [ATVC][Broadcast] Invalid input: policy or param is null pointer!\n");
         return false;
     }
-    struct BroadcastTilingHyperParam {
-        int32_t basicBlock = 16 * 1024;  // 最大为UB的总大小的1/3
-        int nBufferNum = 2;
-    };
+
     using inputDTypeList = typename OpTraits::In::types;
     using DataType = typename ATVC::TypeListGet<inputDTypeList, 0>::Type;
     auto inputDtype = GetOriInputType<DataType>();
     BroadcastTilingInputParam opInput = {shapeIn, shapeOut, inputDtype};
-    OpTiling::BroadcastOpTiling tiling(opInput, policy, param);
+    OpTiling::BroadcastOpTiling<OpTraits, PreComputeTraits, PostComputeTraits> tiling(opInput, policy, param);
     if (!tiling.Run()) {
-        printf("[ERROR] Tiling Error\n");
+        printf("[ERROR]: [ATVC][Broadcast] Run tiling failed!\n");
         return false;
     }
-    PrintParam(policy, param);
     return true;
 };
 } // Host
diff --git a/atvc/include/broadcast/broadcast_op_template.h b/atvc/include/broadcast/broadcast_op_template.h
index 99cdc7d9834829374883145bc66b71be552f6e2b..8aa2755f5e13c14939e7966fdf4fe4ea66736fd6 100644
--- a/atvc/include/broadcast/broadcast_op_template.h
+++ b/atvc/include/broadcast/broadcast_op_template.h
@@ -12,9 +12,15 @@
 
 #ifndef ATVC_BROADCAST_OP_TEMPLATE_H
 #define ATVC_BROADCAST_OP_TEMPLATE_H
+#include <type_traits>
+#include <tuple>
+#include "kernel_operator.h"
+#include "common/atvc_opdef.h"
 #include "common/const_def.h"
+#include "common/kernel_check_debug.h"
 #include "common/ops_utils_device.h"
-#include "broadcast/broadcast_utils/broadcast_buf_pool.h"
+#include "broadcast/utils/broadcast_buf_pool.h"
+#include "broadcast/utils/broadcast_util.h"
 #include "broadcast/broadcast_compute.h"
 namespace ATVC {
 struct BroadcastDataView {
@@ -32,56 +38,208 @@ struct BroadcastDataView {
 };
 
 namespace Kernel {
-template <class BroadcastCompute, const auto& SelectBroadcastPolicy, typename ParamPtr = __gm__ ATVC::BroadcastParam*>
+template <class BroadcastCompute, const auto& SelectBroadcastPolicy, class PreCompute = void, class PostCompute = void>
 class BroadcastOpTemplate {
 public:
+    // for v-v fusion
+    static constexpr bool HAS_PRE_COMPUTE = !AscendC::Std::is_same_v<PreCompute, void>;
+    static constexpr bool HAS_POST_COMPUTE = !AscendC::Std::is_same_v<PostCompute, void>;
+    using PreComputeTraits = AscendC::Std::conditional_t<HAS_PRE_COMPUTE, typename GetFunctionTraits<PreCompute>::ComputeTraits, VoidComputeTraits>;
+    using PostComputeTraits = AscendC::Std::conditional_t<HAS_POST_COMPUTE, typename GetFunctionTraits<PostCompute>::ComputeTraits, VoidComputeTraits>;
+    using PreInputs = typename PreComputeTraits::In::types;
+    using PreOutputs = typename PreComputeTraits::Out::types;
+    using PreTemp = typename PreComputeTraits::Temp::types;
+    using PostInputs = typename PostComputeTraits::In::types;
+    using PostOutputs = typename PostComputeTraits::Out::types;
+    using PostTemp = typename PostComputeTraits::Temp::types;
     using DataType = typename BroadcastCompute::DataType;
+
+    static constexpr size_t PreInputCount = ATVC::TypeListSize<PreInputs>::VALUE;
+    static constexpr size_t PreOutputCount = ATVC::TypeListSize<PreOutputs>::VALUE;
+    static constexpr size_t PreTempCount = ATVC::TypeListSize<PreTemp>::VALUE;
+    static constexpr size_t PostInputCount = ATVC::TypeListSize<PostInputs>::VALUE;
+    static constexpr size_t PostOutputCount = ATVC::TypeListSize<PostOutputs>::VALUE;
+    static constexpr size_t PostTempCount = ATVC::TypeListSize<PostTemp>::VALUE;
+    static constexpr size_t BroadcastInputCount = 1;
+    static constexpr size_t BroadcastOutputCount = 1;
+    static constexpr uint32_t DATA_SIZE = sizeof(DataType);
+    static constexpr uint32_t UB_ALIGN_COUNT = ATVC::UB_ALIGN_32 / DATA_SIZE;
+
     __aicore__ inline BroadcastOpTemplate() {}
+
     /*
     BroadcastOpTemplate对外运行接口，主要完成资源初始化、数据搬入、计算调度、数据搬出操作
     @param src: 输入数据的gm指针
     @param dst: 输出数据的gm指针
     @broadcastParam: broadcast的动态参数，包含tiling data, workspace等
     */
-    template<typename T1>
-    __aicore__ inline void Run(GM_ADDR src, GM_ADDR dst, T1 broadcastParam)
+    template<typename ...Args>
+    __aicore__ inline void Run(Args&&... args)
     {
-        this->Init(src, dst, broadcastParam);
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Start to run template function.\n");
+        constexpr size_t PRE_ARGS_COUNT = HAS_PRE_COMPUTE ? PreInputCount + PreOutputCount - BroadcastInputCount : 0;
+        constexpr size_t BROADCAST_ARGS_COUNT = BroadcastInputCount + BroadcastOutputCount - HAS_PRE_COMPUTE - HAS_POST_COMPUTE;
+        constexpr size_t POST_ARGS_COUNT = HAS_POST_COMPUTE ? PostInputCount + PostOutputCount - BroadcastOutputCount : 0;
+        auto tuple = AscendC::Std::forward_as_tuple(AscendC::Std::forward<Args>(args)...);
+        SplitAndCall<PRE_ARGS_COUNT, BROADCAST_ARGS_COUNT, POST_ARGS_COUNT>(tuple,
+            AscendC::Std::make_index_sequence<PRE_ARGS_COUNT>{},
+            AscendC::Std::make_index_sequence<BROADCAST_ARGS_COUNT>{},
+            AscendC::Std::make_index_sequence<POST_ARGS_COUNT>{},
+            AscendC::Std::make_index_sequence<sizeof...(args) - PRE_ARGS_COUNT - BROADCAST_ARGS_COUNT - POST_ARGS_COUNT>{}
+        );
+        // Check param
+        ATVC::KernelUtils::PrintParam<ATVC::BroadcastParam, SelectBroadcastPolicy>(param_);
+        if (!ATVC::KernelUtils::CheckParam<ATVC::BroadcastParam>(param_)) {
+            return;
+        }
         this->Process();
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] End to run template function.\n");
+    }
+
+    template <class... Args>
+    __aicore__ inline void SetParam(BroadcastParam *param, Args... args) 
+    {
+        param_ = param;
+        tilingData_ = &param_->tilingData;
+        static constexpr int size = sizeof...(args);
+        if constexpr (size == 0) {
+            return;
+        }          
+        if constexpr (HAS_PRE_COMPUTE) {
+            preCompute_.SetParam(args...);
+        }
+        if constexpr (HAS_POST_COMPUTE) {
+            postCompute_.SetParam(args...);
+        }
     }
 
 private:
-    template<typename T1>
-    __aicore__ inline void Init(GM_ADDR src, GM_ADDR dst, T1 broadcastParam)
-    {
-        param_ = reinterpret_cast<ParamPtr>(broadcastParam);
-        uint32_t srcDataSize = this->param_->tilingData.basicBlock;
-        uint32_t dstDataSize = this->param_->tilingData.basicBlock;
-        srcGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(src), srcDataSize);
-        dstGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(dst), dstDataSize);
+    template<size_t PRE_ARGS_COUNT, size_t BROADCAST_ARGS_COUNT, size_t POST_ARGS_COUNT, typename Tuple,
+        size_t... I1, size_t... I2, size_t... I3, size_t... I4>
+    __aicore__ inline void SplitAndCall(Tuple&& t, AscendC::Std::index_sequence<I1...>, AscendC::Std::index_sequence<I2...>,
+        AscendC::Std::index_sequence<I3...>, AscendC::Std::index_sequence<I4...>)
+    {
+        if constexpr (HAS_PRE_COMPUTE) {
+            InitPreArgs(AscendC::Std::get<I1>(AscendC::Std::forward<Tuple>(t))...);
+        }
+        InitBroadcastArgs(AscendC::Std::get<I2 + PRE_ARGS_COUNT>(AscendC::Std::forward<Tuple>(t))...);
+        if constexpr (HAS_POST_COMPUTE) {
+            InitPostArgs(AscendC::Std::get<I3 + PRE_ARGS_COUNT + BROADCAST_ARGS_COUNT>(AscendC::Std::forward<Tuple>(t))...);
+        }
+        InitArgsParams(AscendC::Std::get<I4 + PRE_ARGS_COUNT + BROADCAST_ARGS_COUNT + POST_ARGS_COUNT>(AscendC::Std::forward<Tuple>(t))...);
+    }
+
+    template <class... Args>
+    __aicore__ inline void InitArgsParams(Args... args)
+    {
+        SetParam(args...);
+        if constexpr (!HAS_PRE_COMPUTE) {
+            uint32_t srcDataSize = tilingData_->basicBlock;
+            srcGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(src_), srcDataSize);
+        }
+        if constexpr (!HAS_POST_COMPUTE) {
+            uint32_t dstDataSize = tilingData_->basicBlock;
+            dstGlobal_.SetGlobalBuffer(reinterpret_cast<__gm__ DataType*>(dst_), dstDataSize);
+        }
+        inputCount_ = BroadcastInputCount;
+        outputCount_ = BroadcastOutputCount;
+        if (HAS_PRE_COMPUTE) {
+            inputCount_ = PreInputCount + PreTempCount + PreOutputCount;  
+        }
+        if (HAS_POST_COMPUTE) {
+            outputCount_ = PostInputCount + PostTempCount + PostOutputCount;
+        }        
         bufPool_.template Init<DataType>(GetTPipePtr(),
-                                    ATVC::CONST2, // doublebuff需要的输入个数
-                                    ATVC::CONST2, // 计算结果的个数，一般与inputNum保持一致
-                                    this->param_->tilingData.A2 * this->param_->tilingData.A12 * DATA_SIZE, // 输入Tensor大小
-                                    this->param_->tilingData.A2 * this->param_->tilingData.B2 * DATA_SIZE); // 输出Tensor大小
+                                    inputCount_, // doublebuff需要的输入个数
+                                    outputCount_, // 计算结果的个数，一般与inputNum保持一致
+                                    tilingData_->A2 * tilingData_->A12 * DATA_SIZE, // 输入Tensor大小
+                                    tilingData_->A2 * tilingData_->B2 * DATA_SIZE); // 输出Tensor大小
+    }
+
+    template <class... Args>
+    __aicore__ inline void InitArgsOutput(void) {}
+
+    template <class... Args>
+    __aicore__ inline void InitArgsInput(void) {}
+
+    template <class... Args>
+    __aicore__ inline void InitArgsOutput(GM_ADDR dst, Args... args)
+    {
+        dst_ = dst;
+    }
+
+    template <class... Args>
+    __aicore__ inline void InitArgsInput(GM_ADDR src, Args... args)
+    {
+        src_ = src;
+        InitArgsOutput(args...);
+    }
+
+    template<class... Args>
+    __aicore__ inline void InitBroadcastArgs(Args... args)
+    {
+        InitArgsInput(args...);
+    }
+
+    template<class... Args>
+    __aicore__ inline void InitPreArgs(Args... args)
+    {
+        preCompute_.SetArgs(args...);
+    }
+
+    template<class... Args>
+    __aicore__ inline void InitPostArgs(Args... args)
+    {
+        postCompute_.SetArgs(args...);
+    }
+
+    template<int32_t idx, int32_t num, bool isPreCompute, class... Args>
+    __aicore__ inline void AllocLocalTensors(Args... args)
+    {
+        if constexpr (idx < num) {
+            AscendC::LocalTensor<DataType> tensor;
+            bufPool_.template AllocTensor<false>(tensor);
+            AllocLocalTensors<idx + 1, num, isPreCompute>(tensor, args...);
+            bufPool_.template FreeTensor<false>(tensor);
+        } else {
+            if constexpr (isPreCompute) {
+                preCompute_(args...);
+            } else {
+                postCompute_(args...);
+            }
+        }
+    }
+
+    template<class... Args>
+    __aicore__ inline void ProcessPreCompute(Args... args)
+    {
+        // first arg of args is the first input of postCompute, so only need to alloc PostInputCount - 1 tensors
+        constexpr int32_t localTensorCount = PreInputCount - 1 + PreOutputCount + PreTempCount;
+        AllocLocalTensors<0, localTensorCount, true>(args...);
+    }
+
+    template<class... Args>
+    __aicore__ inline void ProcessPostCompute(Args... args)
+    {
+        // first arg of args is the first input of postCompute, so only need to alloc PostInputCount - 1 tensors
+        constexpr int32_t localTensorCount = PostInputCount - 1 + PostOutputCount + PostTempCount;
+        AllocLocalTensors<0, localTensorCount, false>(args...);
     }
 
     __aicore__ inline void CopyOutBatch(BroadcastDataView &view,
                                         uint32_t dimACount,
                                         AscendC::LocalTensor<DataType> &output)
     {
-        uint32_t dimBCount = 0;          
-        SyncDataQueue<AscendC::HardEvent::V_MTE3>(); 
+        uint32_t dimBCount = 0;  
         for (int i = 0; i < view.B1; i++) {
             uint32_t copyOutOffset;
             if (SelectBroadcastPolicy.patternID == AB_PATTERN::ABA) {
-                copyOutOffset = dimBCount * view.dimASize + dimACount * this->param_->tilingData.A2;
+                copyOutOffset = dimBCount * view.dimASize + dimACount * tilingData_->A2;
             } else {
-                copyOutOffset = dimACount * this->param_->tilingData.A2 * view.dimBSize + dimBCount;
+                copyOutOffset = dimACount * tilingData_->A2 * view.dimBSize + dimBCount;
             }
             CopyOut(output, copyOutOffset + view.copyOutBaseOffset, view);
-            dimBCount += this->param_->tilingData.B2;
-            AscendC::PipeBarrier<PIPE_MTE3>();
+            dimBCount += tilingData_->B2;
         }
     }
 
@@ -94,9 +252,9 @@ private:
         AscendC::LocalTensor<DataType> input;
         for (int i = 0; i < view.A11; i++) {
             inputOffset = 0;
-            bufPool_.AllocTensor<true>(input);
-            uint32_t copyInOffset = i * view.A12 * this->param_->tilingData.A2;
-            if (this->param_->tilingData.A0 != 1) {
+            bufPool_.template AllocTensor<true>(input);
+            uint32_t copyInOffset = i * view.A12 * tilingData_->A2;
+            if (tilingData_->A0 != 1) {
                 copyInOffset += view.dimAOffset;
             }
             if (copyInOffset >= view.dimASize) {
@@ -105,28 +263,28 @@ private:
             if (copyInOffset + view.copyInSize > view.dimASize) {
                 // 剩下的数据不够一次完整计算， 根据实际数据重新计算
                 view.copyInSize = view.dimASize - copyInOffset;
-                view.A12 = OpsUtils::CeilDiv<uint32_t>(view.copyInSize, this->param_->tilingData.A2);
+                view.A12 = OpsUtils::CeilDiv<uint32_t>(view.copyInSize, tilingData_->A2);
             }
+
             CopyIn(input, copyInOffset, view);
-            bufPool_.SetVecSync(input);
-            bufPool_.WaitVecSync(input);
+            bufPool_.template SetVecSync<DataType, AscendC::HardEvent::MTE2_V>(input);
+            bufPool_.template WaitVecSync<DataType, AscendC::HardEvent::MTE2_V>(input);
             for (int j = 0; j < view.A12; j ++) {
                 AscendC::LocalTensor<DataType> output;
-                bufPool_.AllocTensor<false>(output);
-                SyncDataQueue<AscendC::HardEvent::MTE2_V>();
+                bufPool_.template AllocTensor<false>(output);
+
                 compute_.template Compute<SelectBroadcastPolicy.patternID>(input, inputOffset, output,
-                    OpsUtils::CeilAlign<uint32_t>(this->param_->tilingData.A2, UB_ALIGN_COUNT),
-                    OpsUtils::CeilAlign<uint32_t>(this->param_->tilingData.B2, UB_ALIGN_COUNT));
-                bufPool_.SetCopyOutSync(output);
-                bufPool_.WaitCopyOutSync(output);
+                    OpsUtils::CeilAlign<uint32_t>(tilingData_->A2, UB_ALIGN_COUNT),
+                    OpsUtils::CeilAlign<uint32_t>(tilingData_->B2, UB_ALIGN_COUNT));
                 CopyOutBatch(view, dimACount, output);
-                bufPool_.FreeTensor<false>(output);
+                bufPool_.template FreeTensor<false>(output);
                 dimACount++;
-                inputOffset += this->param_->tilingData.A2;
-                SyncDataQueue<AscendC::HardEvent::MTE3_V>();
+                inputOffset += tilingData_->A2;
+                bufPool_.template SetCopyOutSync<DataType, AscendC::HardEvent::MTE3_V>(output);
+                bufPool_.template WaitCopyOutSync<DataType, AscendC::HardEvent::MTE3_V>(output);
             }
-            SyncDataQueue<AscendC::HardEvent::MTE3_MTE2>();
-            bufPool_.FreeTensor<true>(input);            
+
+            bufPool_.template FreeTensor<true>(input);            
         }
         bufPool_.ResetEvent();
     }
@@ -136,17 +294,17 @@ private:
         uint32_t copyOutBaseOffset = 0;
         // 计算拷出偏移基址
         if (SelectBroadcastPolicy.patternID == AB_PATTERN::ABA) {
-            if (this->param_->tilingData.A0 != 1) { // 核间A切分， 取部分A
+            if (tilingData_->A0 != 1) { // 核间A切分， 取部分A
                 copyOutBaseOffset += view.dimAOffset;
             }
-            if (this->param_->tilingData.B0 != 1) { // 核间B切分，取部分B
+            if (tilingData_->B0 != 1) { // 核间B切分，取部分B
                 copyOutBaseOffset += view.dimBOffset * view.dimASize;
             }
         } else {
-            if (this->param_->tilingData.A0 != 1) { // 核间A切分， 取部分A
+            if (tilingData_->A0 != 1) { // 核间A切分， 取部分A
                 copyOutBaseOffset += view.dimAOffset * view.dimBSize;
             }
-            if (this->param_->tilingData.B0 != 1) { // 核间B切分，取部分B
+            if (tilingData_->B0 != 1) { // 核间B切分，取部分B
                 copyOutBaseOffset += view.dimBOffset;
             }
         }    
@@ -156,44 +314,44 @@ private:
     __aicore__ inline void CalcView(BroadcastDataView &view)
     {
         if (SelectBroadcastPolicy.patternID == AB_PATTERN::ABA) {
-            view.dimASize = this->param_->tilingData.dstShape[1];
-            view.dimBSize = this->param_->tilingData.dstShape[0];
+            view.dimASize = tilingData_->dstShape[1];
+            view.dimBSize = tilingData_->dstShape[0];
             view.inShape[0] = 1;
-            view.inShape[1] = this->param_->tilingData.A2;
-            view.outShape[0] = this->param_->tilingData.B2;
-            view.outShape[1] = this->param_->tilingData.A2;
+            view.inShape[1] = tilingData_->A2;
+            view.outShape[0] = tilingData_->B2;
+            view.outShape[1] = tilingData_->A2;
         } else {
-            view.dimASize = this->param_->tilingData.dstShape[0];
-            view.dimBSize = this->param_->tilingData.dstShape[1];
-            view.inShape[0] = this->param_->tilingData.A2;
+            view.dimASize = tilingData_->dstShape[0];
+            view.dimBSize = tilingData_->dstShape[1];
+            view.inShape[0] = tilingData_->A2;
             view.inShape[1] = 1;
-            view.outShape[0] = this->param_->tilingData.A2;
-            view.outShape[1] = this->param_->tilingData.B2;
+            view.outShape[0] = tilingData_->A2;
+            view.outShape[1] = tilingData_->B2;
         }        
-        view.A11 = this->param_->tilingData.A11;
-        view.A12 = this->param_->tilingData.A12;
-        view.B1 = this->param_->tilingData.B1;
+        view.A11 = tilingData_->A11;
+        view.A12 = tilingData_->A12;
+        view.B1 = tilingData_->B1;
         uint32_t blockId = AscendC::GetBlockIdx();
-        uint32_t dimAIdx = blockId / this->param_->tilingData.B0;
-        uint32_t dimBIdx = blockId % this->param_->tilingData.factorBTotalCnt;
-        view.dimAOffset = dimAIdx * this->param_->tilingData.factorACntPerCore;
-        view.dimBOffset = dimBIdx * this->param_->tilingData.factorBCntPerCore;
+        uint32_t dimAIdx = blockId / tilingData_->B0;
+        uint32_t dimBIdx = blockId % tilingData_->factorBTotalCnt;
+        view.dimAOffset = dimAIdx * tilingData_->factorACntPerCore;
+        view.dimBOffset = dimBIdx * tilingData_->factorBCntPerCore;
         // 计算一次计算的输入数据大小
-        view.copyInSize = view.A12 * this->param_->tilingData.A2; // 一次拷贝A12份数据， for循环计算A12次
-        if (view.dimAOffset + this->param_->tilingData.factorACntPerCore > view.dimASize) {
+        view.copyInSize = view.A12 * tilingData_->A2; // 一次拷贝A12份数据， for循环计算A12次
+        if (view.dimAOffset + tilingData_->factorACntPerCore > view.dimASize) {
             // 剩下的A维度的数据不够每个核分到的A数目，重新计算实际的A维度切分
             uint32_t realShape = view.dimASize - view.dimAOffset;
-            uint32_t A1 =  OpsUtils::CeilDiv<uint32_t>(realShape, this->param_->tilingData.A2);
-            if (A1 < view.A12) {
+            uint32_t dimA1 =  OpsUtils::CeilDiv<uint32_t>(realShape, tilingData_->A2);
+            if (dimA1 < view.A12) {
                 view.A11 = 1;
-                view.A12 = A1;
+                view.A12 = dimA1;
             } else {
-                view.A11 = OpsUtils::CeilDiv<uint32_t>(A1, view.A12);
+                view.A11 = OpsUtils::CeilDiv<uint32_t>(dimA1, view.A12);
             }
         }
-        if (view.dimBOffset + this->param_->tilingData.factorBCntPerCore > view.dimBSize) {
+        if (view.dimBOffset + tilingData_->factorBCntPerCore > view.dimBSize) {
             uint32_t realShape = view.dimBSize - view.dimBOffset;
-            view.B1 =  OpsUtils::CeilDiv<uint32_t>(realShape, this->param_->tilingData.B2);
+            view.B1 =  OpsUtils::CeilDiv<uint32_t>(realShape, tilingData_->B2);
         }
         view.copyOutBaseOffset = CalcCopyOutBaseOffset(view);
     }
@@ -206,7 +364,13 @@ private:
         copyInParams.blockLen = view.copyInSize * DATA_SIZE;
         copyInParams.srcStride = 0;
         copyInParams.dstStride = 0;
+        if constexpr(HAS_PRE_COMPUTE) {
+            ProcessPreCompute(input, copyInOffset, copyInParams);
+            return;
+        }
         AscendC::DataCopyPad(input, srcGlobal_[copyInOffset], copyInParams, padParams);
+        ATVC::Kernel::DebugPrintf("[INFO]: [ATVC][Broadcast][CopyIn] Offset is %u, block len is %u "
+            "block count is %u.\n", copyInOffset, copyInParams.blockLen, copyInParams.blockCount);
     }
 
     __aicore__ inline void CopyOutNonAligned(AscendC::LocalTensor<DataType> &output,
@@ -214,23 +378,32 @@ private:
     {
         uint32_t blockId = AscendC::GetBlockIdx();
         uint32_t dstDataSize = view.outShape[0] * view.outShape[1];
-        uint64_t dstShape = this->param_->tilingData.dstShape[1];
+        uint64_t dstShape = tilingData_->dstShape[1];
         AscendC::DataCopyExtParams copyOutParams;
         copyOutParams.blockLen = view.outShape[1] * DATA_SIZE;
         copyOutParams.blockCount = dstDataSize * DATA_SIZE / copyOutParams.blockLen;
         copyOutParams.srcStride = 0;
+
         if (view.outShape[1] + copyOutOffset % dstShape > dstShape) {
             // 列非对齐， 按实际数据拷贝
             copyOutParams.srcStride = OpsUtils::CeilAlign<uint32_t>(view.outShape[1], UB_ALIGN_COUNT) * DATA_SIZE;
             copyOutParams.blockLen = (dstShape - copyOutOffset % dstShape) * DATA_SIZE;
             copyOutParams.srcStride = (copyOutParams.srcStride - copyOutParams.blockLen) / ATVC::UB_ALIGN_32;
         }
-        if (view.outShape[0] + copyOutOffset / dstShape > this->param_->tilingData.dstShape[0]) {
+        if (view.outShape[0] + copyOutOffset / dstShape > tilingData_->dstShape[0]) {
             // 行非对齐， 按实际数据拷贝
-            copyOutParams.blockCount = (this->param_->tilingData.dstShape[0] - copyOutOffset / dstShape);
+            copyOutParams.blockCount = (tilingData_->dstShape[0] - copyOutOffset / dstShape);
         }
         copyOutParams.dstStride = dstShape * DATA_SIZE - copyOutParams.blockLen;
+        bufPool_.template SetCopyOutSync<DataType, AscendC::HardEvent::V_MTE3>(output);
+        bufPool_.template WaitCopyOutSync<DataType, AscendC::HardEvent::V_MTE3>(output);
+        if constexpr(HAS_POST_COMPUTE) {
+            ProcessPostCompute(output, copyOutOffset, copyOutParams);
+            return;
+        }
         AscendC::DataCopyPad(dstGlobal_[copyOutOffset], output, copyOutParams);
+        ATVC::Kernel::DebugPrintf("[INFO]: [ATVC][Broadcast][CopyOut] Offset is %u, block len is %u block count is %u.\n",
+                                  copyOutOffset, copyOutParams.blockLen, copyOutParams.blockCount);
     }
 
     __aicore__ inline void CopyOut(AscendC::LocalTensor<DataType> &output, uint32_t copyOutOffset, BroadcastDataView &view)
@@ -238,14 +411,19 @@ private:
         CopyOutNonAligned(output, copyOutOffset, view);
     }
 
+    GM_ADDR src_;
+    GM_ADDR dst_;
     AscendC::GlobalTensor<DataType> srcGlobal_;
     AscendC::GlobalTensor<DataType> dstGlobal_;
     BroadcastCompute compute_;
-    ParamPtr param_;
-    KernelUtils::BroadcastBufPool bufPool_;
-    constexpr static uint32_t DATA_SIZE = sizeof(DataType);
-    constexpr static uint32_t UB_ALIGN_COUNT = ATVC::UB_ALIGN_32 / DATA_SIZE;
+    AscendC::Std::conditional_t<HAS_PRE_COMPUTE, PreCompute, void*> preCompute_;
+    AscendC::Std::conditional_t<HAS_POST_COMPUTE, PostCompute, void*> postCompute_;
+    const BroadcastParam *param_;
+    const BroadcastOpTilingData *tilingData_;
+    KernelUtils::BroadcastBufPool<!HAS_PRE_COMPUTE && !HAS_POST_COMPUTE> bufPool_;
+    uint32_t inputCount_;
+    uint32_t outputCount_;
 };
 } // namespace Kernel
 } // namespace ATVC
-#endif // ATVC_BROADCAST_OP_TEMPLATE_H
\ No newline at end of file
+#endif // ATVC_BROADCAST_OP_TEMPLATE_H
diff --git a/atvc/include/broadcast/common/broadcast_common.h b/atvc/include/broadcast/common/broadcast_common.h
index fcdd0c0ff194596374ea5f58ba196f364fba1105..0652209625073b24b43154629a31264a5c05b42f 100644
--- a/atvc/include/broadcast/common/broadcast_common.h
+++ b/atvc/include/broadcast/common/broadcast_common.h
@@ -20,12 +20,12 @@
 
 namespace ATVC {
 namespace AB_PATTERN {
-    static constexpr uint32_t A = 100;
-    static constexpr uint32_t AB = 11;
-    static constexpr uint32_t ABA = 20;
-    static constexpr uint32_t ABAB = 31;
-    static constexpr uint32_t ABABA = 40;
-};
+static constexpr uint32_t A = 100;
+static constexpr uint32_t AB = 11;
+static constexpr uint32_t ABA = 20;
+static constexpr uint32_t ABAB = 31;
+static constexpr uint32_t ABABA = 40;
+};  // namespace AB_PATTERN
 
 struct BroadcastPolicy {
 public:
@@ -36,8 +36,8 @@ public:
     patternID(patternID_), loopABCount(loopABCount_),loopInnerABCount(loopInnerABCount_){}
     bool operator==(const BroadcastPolicy& rhs) const
     {
-        return this->patternID == rhs.patternID && this->loopABCount == rhs.loopABCount &&\
-        this->loopInnerABCount == rhs.loopInnerABCount;
+        return this->patternID == rhs.patternID && this->loopABCount == rhs.loopABCount &&
+               this->loopInnerABCount == rhs.loopInnerABCount;
     }
 };
 
@@ -68,8 +68,8 @@ struct BroadcastParam {
     int32_t nBufferNum = 2;
 };
 
-static constexpr BroadcastPolicy BROADCAST_POLICY0 { ATVC::AB_PATTERN::AB, 10, 1 };
-static constexpr BroadcastPolicy BROADCAST_POLICY1 { ATVC::AB_PATTERN::ABA, 10, 1 };
-};
+static constexpr BroadcastPolicy BROADCAST_POLICY0{ATVC::AB_PATTERN::AB, 10, 1};
+static constexpr BroadcastPolicy BROADCAST_POLICY1{ATVC::AB_PATTERN::ABA, 10, 1};
+};  // namespace ATVC
 
-#endif // ATVC_BROADCAST_COMMON_H
\ No newline at end of file
+#endif  // ATVC_BROADCAST_COMMON_H
\ No newline at end of file
diff --git a/atvc/include/broadcast/tiling/broadcast_tiling.h b/atvc/include/broadcast/tiling/broadcast_tiling.h
index 10dfc8fe7f8b333923e02407904fc145e099bc91..09bf4f55d8433fcbde6f9effc245bf7dbe32adc6 100644
--- a/atvc/include/broadcast/tiling/broadcast_tiling.h
+++ b/atvc/include/broadcast/tiling/broadcast_tiling.h
@@ -38,31 +38,44 @@ struct BroadcastTilingInputParam {
 }
 
 namespace OpTiling {
-constexpr static int32_t BRC_BASIC_NUM = 4;     // broadcast输入输出内存基本块分配个数
+
+template<class OpTraits, class PreComputeTraits, class PostComputeTraits>
 class BroadcastOpTiling {
-public:
+public:    
+    using OpInputs = typename OpTraits::In::types;
+    using OpOutputs = typename OpTraits::Out::types;
+    using OpTemp = typename OpTraits::Temp::types;
+    static constexpr size_t OP_INPUT_COUNT = ATVC::TypeListSize<OpInputs>::VALUE;
+    static constexpr size_t OP_OUTPUT_COUNT = ATVC::TypeListSize<OpOutputs>::VALUE;
+    static constexpr size_t OP_TEMP_COUNT = ATVC::TypeListSize<OpTemp>::VALUE;
+    static constexpr size_t BROADCAST_UB_NUM = OP_INPUT_COUNT + OP_OUTPUT_COUNT + OP_TEMP_COUNT;
+    
     BroadcastOpTiling(ATVC::BroadcastTilingInputParam& inputParam,
         ATVC::BroadcastPolicy* policy, ATVC::BroadcastParam* param)
         : opInput_(inputParam), param_(param), policy_(policy)
     {
         compileInfo_ = ATVC::GetOpCompileInfo();
+        /* Built-in tiling only support allocate unified memory evenly, so all we need to know is the definition of
+           the complete operator. If you wants to allocate UB memory unevenly, you need to know the definitions of
+           broadcast, pre-compute, post-compute separately, and extend tiling to support non-uniform distribution. */
+        broadcast_basic_num_ = BROADCAST_UB_NUM * param->nBufferNum;
     }
 
     bool Run()
     {
         if (!IsAxesValid(opInput_.shapeIn, opInput_.shapeOut)) {
-            printf("[ERROR]Shape checkout failed!\n");
+            printf("[ERROR]: [ATVC][Broadcast] Shape checkout failed!\n");
             return false;
         }
         std::vector<uint64_t> newShapeIn;
         std::vector<uint64_t> newShapeOut;
         if (!EliminateOne(opInput_.shapeIn, opInput_.shapeOut, newShapeIn, newShapeOut)) {
-            printf("[ERROR]Failed to  eliminate shape!\n");
+            printf("[ERROR]: [ATVC][Broadcast] Failed to eliminate shape!\n");
             return false;
         }
 
         if (!DoTiling(newShapeIn, newShapeOut)) {
-            printf("[ERROR]Failed to Calculate Tiling param!\n");
+            printf("[ERROR]: [ATVC][Broadcast] Failed to Calculate Tiling param!\n");
             return false;
         }
         CalcWorkSpace();
@@ -71,7 +84,7 @@ public:
 
 private:
     template <class Pattern>
-    void ComputeStride(std::vector<uint64_t>& shapeIn, std::vector<uint64_t>& shapeOut)
+    void ComputeStride(const std::vector<uint64_t>& shapeIn, const std::vector<uint64_t>& shapeOut)
     {
         // shape
         if (shapeIn[0] == 1 && shapeOut[ATVC::DIM0] == 1) {
@@ -95,10 +108,10 @@ private:
             param_->tilingData.dstStride[ATVC::DIM0] = dimB * dimA;
             param_->tilingData.dstStride[ATVC::DIM1] = dimB;
         } else {
-            param_->tilingData.stride[ATVC::DIM0] = dimA * 1;
+            param_->tilingData.stride[ATVC::DIM0] = dimA;
             param_->tilingData.stride[ATVC::DIM1] = dimA;
             param_->tilingData.stride[ATVC::DIM2] = 1;
-            param_->tilingData.dstStride[ATVC::DIM0] = dimA * dimB * 1;
+            param_->tilingData.dstStride[ATVC::DIM0] = dimA * dimB;
             param_->tilingData.dstStride[ATVC::DIM1] = dimB;
             param_->tilingData.dstStride[ATVC::DIM2] = 1;
         }
@@ -123,7 +136,7 @@ private:
         size_t sizeIn = shapeIn.size();
         size_t sizeOut = shapeOut.size();
         if (sizeOut != sizeIn) {
-            printf("input dim in is not equel to output dim! \n");
+            printf("input dim in is not equal to output dim!\n");
             return false;
         };
 
@@ -133,16 +146,14 @@ private:
                 return false;
             } else if (shapeIn[i] <= 0) {
                 printf("Input and output shape should be more than 0\n");
-                return false; 
+                return false;
             }
         }
         return true;
     }
 
-    bool EliminateOne(std::vector<int64_t> &oriShapeIn,
-                      std::vector<int64_t> &oriShapeOut,
-                      std::vector<uint64_t> &shapeIn,
-                      std::vector<uint64_t> &shapeOut)
+    bool EliminateOne(std::vector<int64_t>& oriShapeIn, std::vector<int64_t>& oriShapeOut,
+                      std::vector<uint64_t>& shapeIn, std::vector<uint64_t>& shapeOut)
     {
         bool isCurB = false;
         bool haveA = false;
@@ -151,7 +162,7 @@ private:
         for (size_t i = 0; i < oriShapeIn.size(); i++) {
             if (oriShapeIn[i] == 1 && oriShapeOut[i] != oriShapeIn[i]) { // B轴
                 if (!isCurB && haveB) {
-                    printf("[ERROR]Only support AB/BA!\n");
+                    printf("[ERROR]: [ATVC][Broadcast] Only support AB/BA!\n");
                     return false;
                 }
                 if (!haveB) {
@@ -165,7 +176,7 @@ private:
                 haveB = true;
             } else { // A轴
                 if (isCurB && haveA) {
-                    printf("[ERROR]Only support AB/BA!\n");
+                    printf("[ERROR]: [ATVC][Broadcast] Only support AB/BA!\n");
                     return false;
                 }
                 if (!haveA) {
@@ -179,8 +190,8 @@ private:
                 haveA = true;
             }
         }
-        if (shapeIn.size() !=2U && shapeOut.size() != 2U) {
-            printf("[ERROR] Shape after eliminate is not 2 dim!\n");
+        if (shapeIn.size() != 2U && shapeOut.size() != 2U) {
+            printf("[ERROR]: [ATVC][Broadcast] Shape after eliminate is not 2 dim!\n");
             return false;
         }
         if (shapeIn[0] != shapeOut[0]) {
@@ -193,9 +204,6 @@ private:
     bool DoTiling(std::vector<uint64_t>& shapeIn, std::vector<uint64_t>& shapeOut)
     {
         int32_t shapeSize = shapeIn.size();
-        for (int32_t i = 0; i < shapeSize; i++) {
-            printf("DoTiling shapeSize[%d]: shape[%d] %lu\n", shapeSize, i, shapeIn[i]);
-        }
         switch (shapeSize) {
             case ATVC::CONST1:
                 return ComputeTiling<ATVC::BroadcastPattern::A>(shapeIn, shapeOut);
@@ -204,7 +212,8 @@ private:
             case ATVC::CONST3:
                 return ComputeTiling<ATVC::BroadcastPattern::ABA>(shapeIn, shapeOut);
             default:
-                printf("[ERROR] Compute tiling error because of invalid input shape size[%d]\n", shapeSize);
+                printf("[ERROR]: [ATVC][Broadcast] Compute tiling error because of invalid input shape size[%d]\n",
+                    shapeSize);
                 return false;
         }
         return false;
@@ -214,7 +223,7 @@ private:
     bool ComputeTiling(std::vector<uint64_t>& shapeIn, std::vector<uint64_t>& shapeOut)
     {
         if (!CalcSplitParam<Pattern>(shapeOut)) {
-            printf("[ERROR] Calculate tiling param failed!\n");
+            printf("[ERROR]: [ATVC][Broadcast] Calculate tiling param failed!\n");
             return false;
         }
         ComputeStride<Pattern>(shapeIn, shapeOut);
@@ -224,7 +233,7 @@ private:
 
     uint64_t CalcBasicBlock()
     {
-        uint64_t basicBlock = OpsUtils::FloorAlign(compileInfo_.ubSize / BRC_BASIC_NUM, ATVC::UB_ALIGN_32);
+        uint64_t basicBlock = OpsUtils::FloorAlign(compileInfo_.ubSize / broadcast_basic_num_, ATVC::UB_ALIGN_32);
         if (basicBlock > ATVC::BLOCK_SIZE_64K) {
             basicBlock = ATVC::BLOCK_SIZE_64K;
         } else if (basicBlock > ATVC::BLOCK_SIZE_48K) {
@@ -251,29 +260,31 @@ private:
         ATVC::BroadcastOpTilingData& tilingData = param_->tilingData;
         uint64_t dSize = ge::GetSizeByDataType(opInput_.inputDtype);
         if (dSize == 0) {
-            printf("[ERROR] Data size is invalid, please check input data type!\n");
+            printf("[ERROR]: [ATVC][Broadcast] Data size is invalid, please check input data type!\n");
             return false;
         }
 
         if (tilingData.coreNum > compileInfo_.vectorCoreNum) {
-            printf("[ERROR] Check tiling failed, coreNum(%u) > vector Real Core count(%lu)\n",
-                tilingData.coreNum, compileInfo_.vectorCoreNum);
+            printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, coreNum(%u) "
+                   "must be smaller than vector total core number(%lu)\n",
+                   tilingData.coreNum, compileInfo_.vectorCoreNum);
             return false;
         }
         if (tilingData.A2 * tilingData.A12 * tilingData.A11 * tilingData.A0 < dimA) {
-            printf("[ERROR] Check tiling failed, A2 * A12 * A11 * A0 < dimA(%u)\n", dimA);
+            printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, A2 * A12 * A11 * A0 < dimA(%u)\n", dimA);
             return false;
         }
         if (tilingData.B2 * tilingData.B1 * tilingData.B0 < dimB) {
-            printf("[ERROR] Check tiling failed, B2 * B1 * B0 < dimB(%u)\n", dimB);
+            printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, B2 * B1 * B0 < dimB(%u)\n", dimB);
             return false;
         }
         if (tilingData.B2 * dSize % ATVC::UB_ALIGN_32 != 0) {
-            printf("[ERROR] Check tiling failed, B2(%lu) is not aligined with 32B\n", tilingData.B2);
+            printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, B2(%lu) is not aligned with 32B\n",
+                   tilingData.B2);
             return false;
         }
         if (tilingData.A2 * dSize % ATVC::UB_ALIGN_32 != 0) {
-            printf("[ERROR] Check tiling failed, A2(%lu) is not aligined with 32B\n", tilingData.A2);
+            printf("[ERROR]: [ATVC][Broadcast] Check tiling failed, A2(%lu) is not aligned with 32B\n", tilingData.A2);
             return false;
         }
         return true;
@@ -293,18 +304,18 @@ private:
         uint64_t basicBlock = CalcBasicBlock();
         uint64_t dSize = ge::GetSizeByDataType(opInput_.inputDtype);
         if (dSize == 0) {
-            printf("[ERROR] Data size is invalid, please check input data type!\n");
+            printf("[ERROR]: [ATVC][Broadcast] Data size is invalid, please check input data type!\n");
             return false;
         }
         uint64_t dUint = ATVC::UB_ALIGN_32 / dSize;
         uint64_t cacheSize = OpsUtils::FloorDiv(basicBlock, dSize);
-        uint32_t dimA = Pattern::TailA ? Pattern::Dim - 1 :  Pattern::Dim - 2; // A
-        uint32_t dimB = Pattern::TailA ? Pattern::Dim - 2 :  Pattern::Dim - 1; // B
+        uint32_t dimA = Pattern::TailA ? Pattern::Dim - 1 : Pattern::Dim - 2; // A
+        uint32_t dimB = Pattern::TailA ? Pattern::Dim - 2 : Pattern::Dim - 1; // B
         uint64_t i = OpsUtils::FloorAlign(shape[dimA], dUint); // 32B对齐
         uint64_t j = OpsUtils::FloorAlign(shape[dimB], dUint); // 32B对齐
         ATVC::BroadcastOpTilingData& tilingData = param_->tilingData;
 
-        if (Pattern::TailA) {// 优先A轴打满
+        if constexpr (Pattern::TailA) { // 优先A轴打满
             tilingData.B2 = dUint; // B2最小值
             tilingData.A2 = i > OpsUtils::FloorDiv(cacheSize, dUint) ? OpsUtils::FloorDiv(cacheSize, dUint) : i;
             tilingData.B2 = OpsUtils::FloorAlign(OpsUtils::FloorDiv(cacheSize, tilingData.A2), dUint);
@@ -321,8 +332,8 @@ private:
         }
 
         // 1.优先多核 A0 B0打满核后再计算核内循环
-        tilingData.A0 =  OpsUtils::CeilDiv(shape[dimA], tilingData.A2);
-        tilingData.B0 =  OpsUtils::CeilDiv(shape[dimB], tilingData.B2);
+        tilingData.A0 = OpsUtils::CeilDiv(shape[dimA], tilingData.A2);
+        tilingData.B0 = OpsUtils::CeilDiv(shape[dimB], tilingData.B2);
         // A0*B0为实际的block num 必须小于vectorCoreNum
         while (tilingData.A0 * tilingData.B0 > compileInfo_.vectorCoreNum) {
             if (tilingData.B0 > 1) { // 优先A0切轴
@@ -335,14 +346,15 @@ private:
         // 2.核内循环优先A12,因为A12只需要copyIn 1次
         tilingData.A12 = OpsUtils::CeilDiv(shape[dimA], tilingData.A2 * tilingData.A0);
         if (tilingData.A12 * tilingData.A2 > cacheSize) {
-            tilingData.A12 = OpsUtils::FloorDiv(cacheSize , tilingData.A2);
+            tilingData.A12 = OpsUtils::FloorDiv(cacheSize, tilingData.A2);
         }
-        tilingData.A11 = OpsUtils::CeilDiv(shape[dimA], (tilingData.A0 * tilingData.A2 * tilingData.A12)); // 计算精确A11
-        tilingData.B1= OpsUtils::CeilDiv(shape[dimB], (tilingData.B0 * tilingData.B2));
+        tilingData.A11 =
+            OpsUtils::CeilDiv(shape[dimA], (tilingData.A0 * tilingData.A2 * tilingData.A12)); // 计算精确A11
+        tilingData.B1 = OpsUtils::CeilDiv(shape[dimB], (tilingData.B0 * tilingData.B2));
 
         // 3.最后重新计算A0  B0避免空核
-        tilingData.A0 =  OpsUtils::CeilDiv(shape[dimA], tilingData.A2 * tilingData.A11 * tilingData.A12);
-        tilingData.B0 =  OpsUtils::CeilDiv(shape[dimB], tilingData.B2 * tilingData.B1);
+        tilingData.A0 = OpsUtils::CeilDiv(shape[dimA], tilingData.A2 * tilingData.A11 * tilingData.A12);
+        tilingData.B0 = OpsUtils::CeilDiv(shape[dimB], tilingData.B2 * tilingData.B1);
 
         // 4.写Tiling结果
         ExpandTilingParam(basicBlock);
@@ -354,6 +366,7 @@ private:
     ATVC::BroadcastParam* param_ {nullptr};
     ATVC::BroadcastPolicy* policy_ {nullptr};
     ATVC::OpCompileInfo compileInfo_ = {0, 0, 0, 0};
+    uint32_t broadcast_basic_num_;
 };
 }  // namespace OpTiling
 #endif // ATVC_BROADCAST_TILING_H
\ No newline at end of file
diff --git a/atvc/include/broadcast/broadcast_utils/broadcast_buf_pool.h b/atvc/include/broadcast/utils/broadcast_buf_pool.h
similarity index 75%
rename from atvc/include/broadcast/broadcast_utils/broadcast_buf_pool.h
rename to atvc/include/broadcast/utils/broadcast_buf_pool.h
index 0888c084dc0f58f2f43a9152eaec66b212bd1a38..db1a36200260b12daecf260f580e03b67bcae003 100644
--- a/atvc/include/broadcast/broadcast_utils/broadcast_buf_pool.h
+++ b/atvc/include/broadcast/utils/broadcast_buf_pool.h
@@ -28,25 +28,32 @@ struct BrcPoolManagerUnit {
     int32_t offset = 0;
 };
 
+template<bool EnableDb>
 class BroadcastBufPool {
-constexpr static int32_t MAX_INPUT_SIZE = 10;
+    constexpr static int32_t MAX_INPUT_SIZE = 10;
 
 public:
-    __aicore__ inline BroadcastBufPool(){};
+    __aicore__ inline BroadcastBufPool() {};
 
-    template <class DataType>
+    template<typename T>
     __aicore__ inline void Init(AscendC::TPipe* pipeIn,
-                                int32_t inputNum,           // doublebuff需要的输入个数
-                                int32_t computeNum,         // 计算结果的个数，一般与inputNum保持一致
-                                int32_t inBlockLen,         // 一次计算的输入基本块大小
-                                int32_t outBlockLen) {      // 一次计算的输出大小
+                                int32_t inputNum,   // doublebuff需要的输入个数
+                                int32_t computeNum, // 计算结果的个数，一般与inputNum保持一致
+                                int32_t inBlockLen, // 一次计算的输入基本块大小
+                                int32_t outBlockLen)
+    {
+        // 一次计算的输出大小
         /*
          _______________________________________________________________________________________________________
         |   inputTensor 0   |   inputTensor 1   |        outputTensor 0         |        outputTensor 0         |
         |___________________|___________________|_______________________________|_______________________________|
         */
+        if (EnableDb) {
+            inputNum *= ATVC::CONST2;
+            computeNum *= ATVC::CONST2;
+        }
         pipe_ = pipeIn;
-        int32_t eleSize = sizeof(DataType);
+        constexpr int32_t eleSize = static_cast<int32_t>(sizeof(T));
         inputNum_ = inBlockLen / eleSize;
         outputNum_ = outBlockLen / eleSize;
         int32_t poolSize = inBlockLen * inputNum + outBlockLen * computeNum;
@@ -61,7 +68,8 @@ public:
     }
 
     template <bool IsInput, typename T>
-    __aicore__ inline const void AllocTensor(AscendC::LocalTensor<T>& tensor) {
+    __aicore__ inline const void AllocTensor(AscendC::LocalTensor<T>& tensor)
+    {
         if constexpr (IsInput) {
             int32_t idx = GetInputTensorId();
             tensor = qQue_.GetWithOffset<T>(inputNum_, inputUnit_.offset + idx * inputNum_ * sizeof(T));
@@ -72,42 +80,46 @@ public:
     }
 
     template <bool IsInput, typename T>
-    __aicore__ inline const void FreeTensor(AscendC::LocalTensor<T>& tensor) {
+    __aicore__ inline const void FreeTensor(AscendC::LocalTensor<T>& tensor)
+    {
         if constexpr (!IsInput) {
-            uint32_t idx = GetOutputTensorIdx(tensor);
+            uint32_t idx = GetOutputTensorIdx<T>(tensor);
             isBusyOut_[idx] = false; // 恢复isBusy_状态
         }
     }
 
-    template <typename T>
-    __aicore__ inline const void SetVecSync(AscendC::LocalTensor<T>& tensor) {
-        uint32_t idx = GetInputTensorIdx(tensor);
-        event_t eventId = static_cast<event_t>(pipe_->AllocEventID<AscendC::HardEvent::MTE2_V>());
+    template <typename T, AscendC::HardEvent EVENT>
+    __aicore__ inline const void SetVecSync(AscendC::LocalTensor<T>& tensor)
+    {
+        uint32_t idx = GetInputTensorIdx<T>(tensor);
+        event_t eventId = static_cast<event_t>(pipe_->AllocEventID<EVENT>());
         vecEventId_[idx] = eventId;
-        AscendC::SetFlag<AscendC::HardEvent::MTE2_V>(eventId);
+        AscendC::SetFlag<EVENT>(eventId);
     }
 
-    template <typename T>
-    __aicore__ inline const void WaitVecSync(AscendC::LocalTensor<T>& tensor) {
-        uint32_t idx = GetInputTensorIdx(tensor);
-        AscendC::WaitFlag<AscendC::HardEvent::MTE2_V>(vecEventId_[idx]);
-        pipe_->ReleaseEventID<AscendC::HardEvent::MTE2_V>(vecEventId_[idx]);
+    template <typename T, AscendC::HardEvent EVENT>
+    __aicore__ inline const void WaitVecSync(AscendC::LocalTensor<T>& tensor)
+    {
+        uint32_t idx = GetInputTensorIdx<T>(tensor);
+        AscendC::WaitFlag<EVENT>(vecEventId_[idx]);
+        pipe_->ReleaseEventID<EVENT>(vecEventId_[idx]);
     }
 
-    template <typename T>
-    __aicore__ inline const void SetCopyOutSync(AscendC::LocalTensor<T>& tensor) {
-        uint32_t idx = GetOutputTensorIdx(tensor);
-        event_t eventId = static_cast<event_t>(pipe_->AllocEventID<AscendC::HardEvent::V_MTE3>());
+    template <typename T, AscendC::HardEvent EVENT>
+    __aicore__ inline const void SetCopyOutSync(AscendC::LocalTensor<T>& tensor)
+    {
+        uint32_t idx = GetOutputTensorIdx<T>(tensor);
+        event_t eventId = static_cast<event_t>(pipe_->AllocEventID<EVENT>());
         outEventId_[idx] = eventId;
-        AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(eventId);
+        AscendC::SetFlag<EVENT>(eventId);
     }
 
-    template <typename T>
+    template <typename T, AscendC::HardEvent EVENT>
     __aicore__ inline const void WaitCopyOutSync(AscendC::LocalTensor<T>& tensor)
     {
-        uint32_t idx = GetOutputTensorIdx(tensor);
-        AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(outEventId_[idx]);
-        pipe_->ReleaseEventID<AscendC::HardEvent::V_MTE3>(outEventId_[idx]);
+        uint32_t idx = GetOutputTensorIdx<T>(tensor);
+        AscendC::WaitFlag<EVENT>(outEventId_[idx]);
+        pipe_->ReleaseEventID<EVENT>(outEventId_[idx]);
     }
 
     template <typename T>
diff --git a/atvc/include/broadcast/utils/broadcast_util.h b/atvc/include/broadcast/utils/broadcast_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..24cd745a08269380a9d63b8eab4b63267569f7ca
--- /dev/null
+++ b/atvc/include/broadcast/utils/broadcast_util.h
@@ -0,0 +1,70 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+
+/*!
+ * \file broadcast_util.h
+ * \brief broadcast util interface
+ */
+
+#ifndef ATVC_BROADCAST_UTIL_H
+#define ATVC_BROADCAST_UTIL_H
+#include "common/const_def.h"
+#include "common/kernel_check_debug.h"
+
+namespace ATVC {
+namespace KernelUtils {
+template <typename T, const auto& SelectBroadcastPolicy>
+__aicore__ inline void PrintParam(const T* param)
+{
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: A0 = %lu\n", param->tilingData.A0);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: A11 = %lu\n", param->tilingData.A11);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: A12 = %lu\n", param->tilingData.A12);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: A2 = %lu\n", param->tilingData.A2);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: B0 = %lu\n", param->tilingData.B0);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: B1 = %lu\n", param->tilingData.B1);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: B2 = %lu\n", param->tilingData.B2);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: coreNum = %d\n", param->tilingData.coreNum);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: basicBlock = %lu\n", param->tilingData.basicBlock);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: factorACntPerCore = %lu\n",
+                              param->tilingData.factorACntPerCore);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: factorATotalCnt = %lu\n",
+                              param->tilingData.factorATotalCnt);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: factorBCntPerCore = %lu\n",
+                              param->tilingData.factorBCntPerCore);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: factorBTotalCnt = %lu\n",
+                              param->tilingData.factorBTotalCnt);
+    for (int32_t i = 0; i < ATVC::CONST2; i++) {
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: shape[%d] = %lu\n",
+                                  i, param->tilingData.shape[i]);
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: dstShape[%d] = %lu\n",
+                                  i, param->tilingData.dstShape[i]);
+    }
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: policy.patternID = %d\n",
+                              SelectBroadcastPolicy.patternID);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Broadcast] Tiling data: workspaceSize = %u\n", param->workspaceSize);
+    return;
+}
+
+template <typename T>
+__aicore__ inline bool CheckParam(const T* param)
+{
+    auto *tilingData = &param->tilingData;
+    if (tilingData->coreNum < AscendC::GetBlockIdx() + 1 ) {
+        ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][Broadcast] Tiling data[coreNum = %d] is invalid,"
+                                  "it must be larger than current block number.\n", tilingData->coreNum);
+        return false;
+    }
+    return true;
+}
+
+} // namespace KernelUtils
+} // namespace ATVC
+#endif
\ No newline at end of file
diff --git a/atvc/include/common/atvc_op_check.h b/atvc/include/common/atvc_op_check.h
new file mode 100644
index 0000000000000000000000000000000000000000..06ae33e11bb27c4e9566c40ca6f0c07158dc27d4
--- /dev/null
+++ b/atvc/include/common/atvc_op_check.h
@@ -0,0 +1,75 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+
+#ifndef ATVC_COMMON_OP_CHECK_H
+#define ATVC_COMMON_OP_CHECK_H
+
+#include <stdio.h>
+#include "atvc_opdef.h"
+
+namespace ATVC {
+namespace Host {
+
+template <typename InputTypeList, typename OutputTypeList>
+bool CheckSameDtype_() {
+    if (GetOriInputType<typename ATVC::TypeListGet<InputTypeList, 0>::Type>() !=
+        GetOriInputType<typename ATVC::TypeListGet<OutputTypeList, 0>::Type>()) {
+        return false;
+    }
+    return true;
+}
+template <typename OpTraits, ATVC::TemplateType templateType>
+// 校验Traits
+bool DebugCheck() {
+    using Inputs = typename OpTraits::In::types;
+    using Outputs = typename OpTraits::Out::types;
+    if constexpr (templateType == ATVC::TemplateType::ELE_WISE) {
+        if constexpr (TypeListSize<Inputs>::VALUE == 0) { // 0: input is empty
+            printf("[ERROR]: [ATVC][OpTraits] Input can not be empty in Ele-wise template.\n");
+            return false;
+        }
+        if constexpr (TypeListSize<Outputs>::VALUE == 0) { // 0: output is empty
+            printf("[ERROR]: [ATVC][OpTraits] Output can not be empty in Ele-wise template.\n");
+            return false;
+        }
+    } else if constexpr (templateType == ATVC::TemplateType::REDUCE) {
+        if constexpr (TypeListSize<Inputs>::VALUE != 1) { // 1: input number must be 1.
+            printf("[ERROR]: [ATVC][OpTraits] Input numer must be 1 in Reduce template.\n");
+            return false;
+        }
+        if constexpr (TypeListSize<Outputs>::VALUE != 1) { // 1: output number must be 1.
+            printf("[ERROR]: [ATVC][OpTraits] Output numer must be 1 in Reduce template.\n");
+            return false;
+        }
+    } else if constexpr (templateType == ATVC::TemplateType::BROADCAST) {
+        if constexpr (TypeListSize<Inputs>::VALUE != 1) { // 1: input number must be 1.
+            printf("[ERROR]: [ATVC][OpTraits] Input numer must be 1 in broadcast template.\n");
+            return false;
+        }
+        if constexpr (TypeListSize<Outputs>::VALUE != 1) { // 1: input number must be 1.
+            printf("[ERROR]: [ATVC][OpTraits] Output numer must be 1 in broadcast template.\n");
+            return false;
+        }
+    }
+
+    if constexpr (templateType == ATVC::TemplateType::REDUCE ||
+                  templateType == ATVC::TemplateType::BROADCAST) {
+        if (!CheckSameDtype_<Inputs, Outputs>()) {
+            printf("[ERROR]: Different input/output data types is not surpport in Reduce or Broadcast template.\n");
+            return false;
+        }
+    }
+    return true;
+}
+}
+}
+#endif
+
diff --git a/atvc/include/common/atvc_opdef.h b/atvc/include/common/atvc_opdef.h
index bac3dfc7ad99145e44e9f72bc70c6ed8e393b838..83ee2e0fbf0a0896dd53493931207b9527f70ddf 100644
--- a/atvc/include/common/atvc_opdef.h
+++ b/atvc/include/common/atvc_opdef.h
@@ -13,6 +13,7 @@
 #define ATVC_COMMON_OPDEF_H
 
 #include "type_list.h"
+#include "common/dtype_utils.h"
 namespace ATVC {
 enum class ParamType {
     INPUT,
@@ -20,37 +21,48 @@ enum class ParamType {
     TEMP,
 };
 
-template<ParamType paramType_, typename ... Ts>
+enum class TemplateType {
+    ELE_WISE,
+    REDUCE,
+    BROADCAST,
+};
+
+template <ParamType paramType_, typename... Ts>
 struct ParamTypes {
     using types = ATVC::TypeList<Ts...>;
     static constexpr ParamType usage = paramType_;
 };
 
-template<typename ... Ts>
+template <typename... Ts>
 using OpInputs = ParamTypes<ParamType::INPUT, Ts...>;
 
-template<typename ... Ts>
+template <typename... Ts>
 using OpOutputs = ParamTypes<ParamType::OUTPUT, Ts...>;
 
-template<typename ... Ts>
+template <typename... Ts>
 using OpTemps = ParamTypes<ParamType::TEMP, Ts...>;
 
-
-template<typename InTypeList, typename OutTypeList, typename TempTypeList=ATVC::OpTemps<>>
+template <typename InTypeList, typename OutTypeList, typename TempTypeList = ATVC::OpTemps<>>
 struct OpTraits {
     using In = InTypeList;
     using Out = OutTypeList;
     using Temp = TempTypeList;
 };
 
+struct VoidComputeTraits {
+    using In = OpInputs<>;
+    using Out = OpOutputs<>;
+    using Temp = OpTemps<>;
+};
+
 template<typename TileCompute>
-struct GetFunctionTraits {};
+struct GetFunctionTraits {
+    using ComputeTraits = VoidComputeTraits;
+};
 
-// 专门化处理函数A的模板实例
-template<template<typename> class TileCompute, typename Traits>
+template <template <typename> class TileCompute, typename Traits>
 struct GetFunctionTraits<TileCompute<Traits>> {
     using ComputeTraits = Traits;
 };
-}
+} // namespace ATVC
 #endif
-
diff --git a/atvc/include/common/compile_info.h b/atvc/include/common/compile_info.h
index f92f68aa479d8b89ed24cbc78bc0f026c8cba9a7..da502ed33b9d8c49cc08da15678d2d4497920bc7 100644
--- a/atvc/include/common/compile_info.h
+++ b/atvc/include/common/compile_info.h
@@ -20,7 +20,8 @@ struct OpCompileInfo {
     uint64_t ubSize = 0;
     uint64_t cacheLineSize = 0;
     uint64_t ubBlockSize = 0;
-    OpCompileInfo(uint64_t a, uint64_t b, uint64_t c, uint64_t d) {
+    OpCompileInfo(uint64_t a, uint64_t b, uint64_t c, uint64_t d)
+    {
         this->vectorCoreNum = a;
         this->ubBlockSize = d;
         this->ubSize = b;
@@ -31,18 +32,22 @@ struct OpCompileInfo {
 inline OpCompileInfo GetOpCompileInfo()
 {
     const auto& platformInfoMgr = platform_ascendc::PlatformAscendCManager::GetInstance();
+    if (platformInfoMgr == nullptr) {
+        printf("[ERROR] failed to retrieve platform infomation.\n");
+        return {0, 0, 0, 0};
+    }
     auto soc = platformInfoMgr->GetSocVersion();
     static const std::map<platform_ascendc::SocVersion, OpCompileInfo> compileInfoMap = {
-    {platform_ascendc::SocVersion::ASCEND910B, {48, 196608, 256, 32}},
-    {platform_ascendc::SocVersion::ASCEND310B, {1, 262144, 256, 32}},
-    {platform_ascendc::SocVersion::ASCEND310P, {8, 262144, 256, 32}},
+        {platform_ascendc::SocVersion::ASCEND910B, {48, 196608, 256, 32}},
+        {platform_ascendc::SocVersion::ASCEND310B, {1, 262144, 256, 32}},
+        {platform_ascendc::SocVersion::ASCEND310P, {8, 262144, 256, 32}},
     };
     auto findRes = compileInfoMap.find(soc);
     if (findRes != compileInfoMap.cend()) {
         return findRes->second;
     }
-    printf("[ERROR] Current framework does not support this current device. Please check chip version.\n");
+    printf("[ERROR]: [ATVC][Common] Current framework does not support this current device. Please check chip version.\n");
     return {0, 0, 0, 0};
 }
-}
+} // namespace ATVC
 #endif
diff --git a/atvc/include/common/const_def.h b/atvc/include/common/const_def.h
index df6eb2fbd1bfbc9cdda64347dd6f649f784bcee8..de724f8ccfc5b21cb7d6a47a1e371a42bdd35ddb 100644
--- a/atvc/include/common/const_def.h
+++ b/atvc/include/common/const_def.h
@@ -15,39 +15,39 @@
 #define ATVC_COMMON_CONST_DEF_H
 #include <cstdint>
 namespace ATVC {
-    constexpr int32_t UB_ALIGN_32 = 32;
-    constexpr int32_t UB_ALIGN_31 = 31;
-    constexpr int32_t UB_ALIGN_255 = 255;
-    constexpr int32_t UB_ALIGN_256 = 256;
-    constexpr int32_t DIM0 = 0;
-    constexpr int32_t DIM1 = 1;
-    constexpr int32_t DIM2 = 2;
-    constexpr int32_t DIM3 = 3;
-    constexpr int32_t DIM4 = 4;
-    constexpr int32_t DIM5 = 5;
-    constexpr int32_t DIM6 = 6;
-    constexpr int32_t DIM7 = 7;
-    constexpr int32_t DIM8 = 8;
-    constexpr int32_t DIM9 = 9;
+constexpr int32_t UB_ALIGN_32 = 32;
+constexpr int32_t UB_ALIGN_31 = 31;
+constexpr int32_t UB_ALIGN_255 = 255;
+constexpr int32_t UB_ALIGN_256 = 256;
+constexpr int32_t DIM0 = 0;
+constexpr int32_t DIM1 = 1;
+constexpr int32_t DIM2 = 2;
+constexpr int32_t DIM3 = 3;
+constexpr int32_t DIM4 = 4;
+constexpr int32_t DIM5 = 5;
+constexpr int32_t DIM6 = 6;
+constexpr int32_t DIM7 = 7;
+constexpr int32_t DIM8 = 8;
+constexpr int32_t DIM9 = 9; // reserved
 
-    constexpr int32_t CONST0 = 0;
-    constexpr int32_t CONST1 = 1;
-    constexpr int32_t CONST2 = 2;
-    constexpr int32_t CONST3 = 3;
-    constexpr int32_t CONST4 = 4;
-    constexpr int32_t CONST5 = 5;
-    constexpr int32_t CONST6 = 6;
-    constexpr int32_t CONST7 = 7;
-    constexpr int32_t CONST8 = 8;
-    constexpr int32_t CONST9 = 9;
-    constexpr int32_t CONST10 = 10;
-    constexpr static int64_t CONST63 = 63;
+constexpr int32_t CONST0 = 0;
+constexpr int32_t CONST1 = 1;
+constexpr int32_t CONST2 = 2;
+constexpr int32_t CONST3 = 3;
+constexpr int32_t CONST4 = 4;
+constexpr int32_t CONST5 = 5;
+constexpr int32_t CONST6 = 6;
+constexpr int32_t CONST7 = 7;
+constexpr int32_t CONST8 = 8;
+constexpr int32_t CONST9 = 9;
+constexpr int32_t CONST10 = 10;
+constexpr int64_t CONST63 = 63;
 
-    constexpr int32_t MAX_DIM = 9;
+constexpr int32_t MAX_DIM = 9;
 
-    constexpr static uint64_t WORKSPACE_SIZE = 16 * 1024 * 1024;  // fixed workspace size for ascendc
-    constexpr static uint64_t BLOCK_SIZE_32K = 32 * 1024;
-    constexpr static uint64_t BLOCK_SIZE_48K = 48 * 1024;
-    constexpr static uint64_t BLOCK_SIZE_64K = 64 * 1024;
-}
+constexpr uint64_t WORKSPACE_SIZE = 16 * 1024 * 1024; // fixed workspace size for ascendc
+constexpr uint64_t BLOCK_SIZE_32K = 32 * 1024;
+constexpr uint64_t BLOCK_SIZE_48K = 48 * 1024;
+constexpr uint64_t BLOCK_SIZE_64K = 64 * 1024;
+} // namespace ATVC
 #endif
\ No newline at end of file
diff --git a/atvc/include/common/dtype_utils.h b/atvc/include/common/dtype_utils.h
index 94eafcdf25dcba49e6d5e1d01dd71d0bca383969..86ac3b50399991e4f616e8d7ae9fca5dce86a019 100644
--- a/atvc/include/common/dtype_utils.h
+++ b/atvc/include/common/dtype_utils.h
@@ -60,6 +60,6 @@ inline ge::DataType GetPromoteDataType(ge::DataType dtype)
             return ge::DataType::DT_UNDEFINED;
     }
 }
-}
+} // namespace ATVC
 
 #endif
\ No newline at end of file
diff --git a/atvc/include/common/forward.h b/atvc/include/common/forward.h
index aa821ccc94d863dbaf4010233a9c089b70b1ce43..eeba5ca0145869f164006b3e433587bd001effef 100644
--- a/atvc/include/common/forward.h
+++ b/atvc/include/common/forward.h
@@ -19,16 +19,18 @@
 #include <type_traits>
 
 namespace ATVC {
-template<typename T>
-__aicore__ inline constexpr T&& Forward(std::remove_reference_t<T>& param) noexcept {
+template <typename T>
+__aicore__ inline constexpr T&& Forward(std::remove_reference_t<T>& param) noexcept
+{
     return static_cast<T&&>(param);
 }
 
-template<typename T>
-__aicore__ inline constexpr T&& Forward(std::remove_reference_t<T>&& param) noexcept {
+template <typename T>
+__aicore__ inline constexpr T&& Forward(std::remove_reference_t<T>&& param) noexcept
+{
     static_assert(!std::is_lvalue_reference<T>::value, "Invalid Forward of an rvalue as an lvalue");
     return static_cast<T&&>(param);
 }
 
-}
+} // namespace ATVC
 #endif
diff --git a/atvc/include/common/index_seq.h b/atvc/include/common/index_seq.h
index 5977864e11fd33bd4cb3f360857a1a81860dca82..ba95e43fb100242068ceaf733ed228b01647b857 100644
--- a/atvc/include/common/index_seq.h
+++ b/atvc/include/common/index_seq.h
@@ -33,5 +33,5 @@ struct MakeIndexSequenceImpl<0, Is...> {
 template <std::size_t N>
 using MakeIndexSequence = typename MakeIndexSequenceImpl<N>::Type;
 
-}
+} // namespace ATVC
 #endif
diff --git a/atvc/include/reduce/reduce_utils/reduce_util.h b/atvc/include/common/kernel_check_debug.h
similarity index 43%
rename from atvc/include/reduce/reduce_utils/reduce_util.h
rename to atvc/include/common/kernel_check_debug.h
index 556f2867ae4367a9a20c164643fb8c3f8de047a5..b8e4bc34f198b7d01d0e0216734d99b11c52fc6f 100644
--- a/atvc/include/reduce/reduce_utils/reduce_util.h
+++ b/atvc/include/common/kernel_check_debug.h
@@ -9,43 +9,18 @@
  * See LICENSE in the root of the software repository for the full text of the License.
  */
 
-/*!
- * \file reduce_util.h
- * \brief reduce template struct
- */
-
-#ifndef ATVC_REDUCE_REDUCE_UTIL_H
-#define ATVC_REDUCE_REDUCE_UTIL_H
-#include "common/const_def.h"
-#include "reduce/common/patterns.h"
-
+#ifndef ATVC_COMMON_KERNEL_CHECK_DEBUG_H
+#define ATVC_COMMON_KERNEL_CHECK_DEBUG_H
 namespace ATVC {
-namespace KernelUtils {
-template <int32_t dim>
-struct Shape {
-    int64_t value[dim];
-    int64_t oriBurstLen;
-};
+namespace Kernel {
 
-template <typename T>
-__aicore__ inline constexpr int32_t GetCopyInCount()
-{
-    if constexpr (AscendC::IsSameType<T, bfloat16_t>::value || AscendC::IsSameType<T, half>::value) {
-        return CONST2;
-    } else {
-        return CONST3;
-    }
+template <class... Args>
+__aicore__ inline void DebugPrintf(__gm__ const char* fmt, Args&&... args) {
+#if ATVC_DEBUG_MODE == 1
+    AscendC::printf(fmt, args...);
+#endif
+}
 }
-
-template <typename T>
-__aicore__ inline constexpr int32_t GetComputeCount()
-{
-    if constexpr (AscendC::IsSameType<T, bfloat16_t>::value || AscendC::IsSameType<T, half>::value) {
-        return CONST2;
-    } else {
-        return CONST0;
-    }
 }
-} // namespace KernelUtils
-} // namespace ATVC
-#endif
\ No newline at end of file
+
+#endif
diff --git a/atvc/include/common/kernel_utils.h b/atvc/include/common/kernel_utils.h
index 810aeff7e30c7d97744ffd4d95cf14b69cbe16f9..0abbc7ba304cbcf7f41bc9ce120defebbfb2cc62 100644
--- a/atvc/include/common/kernel_utils.h
+++ b/atvc/include/common/kernel_utils.h
@@ -9,15 +9,15 @@
  * See LICENSE in the root of the software repository for the full text of the License.
  */
 
-
 #ifndef ATVC_KERNEL_UTILS_H
 #define ATVC_KERNEL_UTILS_H
 
-#ifndef __ASCC_HOST__
 #include "common/const_def.h"
 #include "kernel_operator.h"
 namespace ATVC {
-__BLOCK_LOCAL__  static AscendC::TPipe g_pipe;
+#ifndef __ASCC_HOST__
+__BLOCK_LOCAL__ static AscendC::TPipe g_pipe;
+#endif // __ASCC_HOST__
 template <AscendC::HardEvent EVENT>
 __aicore__ inline void SetEvent(AscendC::HardEvent evt)
 {
@@ -34,18 +34,8 @@ __aicore__ inline void SyncDataQueue()
 }
 
 namespace KernelUtils {
-template <class T, class U>
-struct IsSameV {
-};
-template <class T>
-struct IsSameV<T, T> {
-    using Type = T;
-};
-
-
 template <typename DType>
-struct GetPromoteType {
-};
+struct GetPromoteType {};
 
 template <>
 struct GetPromoteType<float> {
@@ -67,7 +57,6 @@ struct GetPromoteType<uint8_t> {
     using T = uint8_t;
 };
 
-
 template <>
 struct GetPromoteType<half> {
     using T = float;
@@ -78,7 +67,7 @@ struct GetPromoteType<bfloat16_t> {
     using T = float;
 };
 
-__aicore__ inline int64_t FindNearestPower2(const int64_t value)
+__aicore__ inline int64_t FindNearestPower(const int64_t value)
 {
     if (value == 0) {
         return 0;
@@ -93,7 +82,7 @@ __aicore__ inline int64_t FindNearestPower2(const int64_t value)
     }
 }
 
-__aicore__ inline int64_t CalLog2(int64_t value)
+__aicore__ inline int64_t CalLog(int64_t value)
 {
     int64_t res = 0;
     while (value > 1) {
@@ -102,7 +91,6 @@ __aicore__ inline int64_t CalLog2(int64_t value)
     }
     return res;
 }
-}
-}
-#endif // __ASCC_HOST__
+} // namespace KernelUtils
+} // namespace ATVC
 #endif // ATVC_KERNEL_UTILS_H
\ No newline at end of file
diff --git a/atvc/include/common/ops_utils_device.h b/atvc/include/common/ops_utils_device.h
index cacac400336b39340c30b91fca507f19c282500f..0d06d0b5f3c12e21e24224fdf0b0b2185fe78075 100644
--- a/atvc/include/common/ops_utils_device.h
+++ b/atvc/include/common/ops_utils_device.h
@@ -14,8 +14,8 @@
  * \brief
  */
 
-#ifndef ATVC_COMMON_OPS_UTILS_DEVICE_H
-#define ATVC_COMMON_OPS_UTILS_DEVICE_H
+#ifndef ATVC_OPS_UTILS_DEVICE_H
+#define ATVC_OPS_UTILS_DEVICE_H
 
 #include "kernel_operator.h"
 namespace OpsUtils {
@@ -68,10 +68,11 @@ __aicore__ inline T Aligned(T value, T alignment)
  * if align is 0, return 0
  */
 template <typename T, typename U>
-__aicore__ inline typename std::enable_if <std::is_integral<T>::value, T>::type FloorAlign(T x, U align) {
+__aicore__ inline typename std::enable_if<std::is_integral<T>::value, T>::type FloorAlign(T x, U align)
+{
     return align == 0 ? 0 : x / align * align;
 }
 
-}
+} // namespace OpsUtils
 
-#endif  // ATVC_COMMON_OPS_UTILS_DEVICE_H
\ No newline at end of file
+#endif // ATVC_OPS_UTILS_DEVICE_H
\ No newline at end of file
diff --git a/atvc/include/common/ops_utils_host.h b/atvc/include/common/ops_utils_host.h
index e92ff007f5f3a5b3c6d845df9a46f5c3b11885c3..6e99915cfbd226ec18edabd9c801a77722147821 100644
--- a/atvc/include/common/ops_utils_host.h
+++ b/atvc/include/common/ops_utils_host.h
@@ -14,8 +14,8 @@
  * \brief
  */
 
-#ifndef ATVC_COMMON_OPS_UTILS_HOST_H
-#define ATVC_COMMON_OPS_UTILS_HOST_H
+#ifndef ATVC_OPS_UTILS_HOST_H
+#define ATVC_OPS_UTILS_HOST_H
 
 namespace OpsUtils {
 template <typename T>
@@ -67,10 +67,11 @@ inline T Aligned(T value, T alignment)
  * if align is 0, return 0
  */
 template <typename T, typename U>
-inline typename std::enable_if <std::is_integral<T>::value, T>::type FloorAlign(T x, U align) {
+inline typename std::enable_if<std::is_integral<T>::value, T>::type FloorAlign(T x, U align)
+{
     return align == 0 ? 0 : x / align * align;
 }
 
-}
+} // namespace OpsUtils
 
-#endif  // ATVC_COMMON_OPS_UTILS_HOST_H
\ No newline at end of file
+#endif // ATVC_OPS_UTILS_HOST_H
diff --git a/atvc/include/common/platform.h b/atvc/include/common/platform.h
index ad27408858cfbde869e122611c75be751a93a874..2d00ca0eb9ac3a25698becfe3e6f69081997c24a 100644
--- a/atvc/include/common/platform.h
+++ b/atvc/include/common/platform.h
@@ -13,11 +13,12 @@
  * \file platform.h
  * \brief
  */
-#ifndef ATVC_COMMON_PLATFORM_H
-#define ATVC_COMMON_PLATFORM_H
+#ifndef ATVC_PLATFORM_H
+#define ATVC_PLATFORM_H
 
 #include "kernel_operator.h"
 #include "kernel_tiling/kernel_tiling.h"
+
 namespace ATVC {
 namespace Platform {
 /**
@@ -41,4 +42,4 @@ __aicore__ inline constexpr uint32_t GetVRegSize()
 }
 } // namespace Platform
 } // namespace ATVC
-#endif  // ATVC_COMMON_PLATFORM_H
\ No newline at end of file
+#endif // ATVC_PLATFORM_H
\ No newline at end of file
diff --git a/atvc/include/common/tensor_info.h b/atvc/include/common/tensor_info.h
index c2338b91d64df776d2353964e7813f14fcf6b0e6..adebf56a33da79f0d685bd784fcec73eee76fdf9 100644
--- a/atvc/include/common/tensor_info.h
+++ b/atvc/include/common/tensor_info.h
@@ -23,13 +23,14 @@
 #include "tuple.h"
 
 namespace ATVC {
-// 记录一个LocalTensor在In/Out/Local中的偏移位置，以及自己的类型
-template<typename T>
+/**
+ * Record the offset position of a LocalTensor in In/Out/Local, as well as its own type
+ */
+template <typename T>
 struct TensorInfo {
-    __aicore__ inline TensorInfo() {}
-    using Dtype = T;
+    using DataType = T;
     AscendC::GlobalTensor<T> gmTensor;
-    int32_t local_offset;
+    int32_t localOffset;
 };
 
 template <typename T>
@@ -41,17 +42,20 @@ template <typename List>
 struct TensorTuple {
 private:
     using Tensors = typename ATVC::TypeListMap<List, TypeToTensor>::Type;
+
 public:
     using Type = typename ATVC::TupleFromTypeList<Tensors>::Type;
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
-// 获取特定索引的 Tuple 元素的对应的LocalTensor
+/**
+ * Get the LocalTensor corresponding to the Tuple element at a specific index
+ */
 template <std::size_t N, typename TupleType>
-__aicore__ inline auto TupleElemGetLocalTensor(AscendC::LocalTensor<uint8_t> local, TupleType& tuple, uint32_t size) {
-    using Dtype = typename ATVC::TupleElemType<N, TupleType>::Type::Dtype;
+__aicore__ inline auto TupleElemGetLocalTensor(AscendC::LocalTensor<uint8_t> local, TupleType& tuple, uint32_t size)
+{
+    using DataType = typename ATVC::TupleElemType<N, TupleType>::Type::DataType;
     if constexpr (N == 0) {
-        auto tensor = local[tuple.head.local_offset].template ReinterpretCast<Dtype>();
+        auto tensor = local[tuple.head.localOffset].template ReinterpretCast<DataType>();
         tensor.SetSize(size);
         return tensor;
     } else {
@@ -59,9 +63,6 @@ __aicore__ inline auto TupleElemGetLocalTensor(AscendC::LocalTensor<uint8_t> loc
     }
 }
 
-/////////////////////////////////////////////////////////////////////////////////////
-
-}
+} // namespace ATVC
 
 #endif // ATVC_TENSOR_INFO_H
-
diff --git a/atvc/include/common/tuple.h b/atvc/include/common/tuple.h
index d96fa6884e4443a4586aea783e7ebce670b8601f..110fb4bb459f73b8f5beb010f9519140df0669cb 100644
--- a/atvc/include/common/tuple.h
+++ b/atvc/include/common/tuple.h
@@ -20,7 +20,6 @@
 
 namespace ATVC {
 
-/////////////////////////////////////////////////////////////////////////////////////
 template <typename... Ts>
 struct Tuple {};
 
@@ -28,14 +27,13 @@ template <typename T, typename... Ts>
 struct Tuple<T, Ts...> {
     T head;
     Tuple<Ts...> tail;
-    
+
     __aicore__ inline Tuple() : head(), tail() {}
 };
 
 template <>
 struct Tuple<> {};
 
-/////////////////////////////////////////////////////////////////////////////////////
 template <typename List>
 struct TupleFromTypeList {};
 
@@ -44,7 +42,6 @@ struct TupleFromTypeList<TypeList<Ts...>> {
     using Type = Tuple<Ts...>;
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
 template <typename TupleType>
 struct TupleSize {};
 
@@ -53,8 +50,9 @@ struct TupleSize<Tuple<Ts...>> {
     static constexpr size_t value = sizeof...(Ts);
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
-// 获取 Tuple 中第 N 个元素的类型
+/**
+ * Get the type of the Nth element in a Tuple
+ */
 template <std::size_t N, typename TupleType>
 struct TupleElemType {};
 
@@ -68,10 +66,12 @@ struct TupleElemType<N, Tuple<T, Ts...>> {
     using Type = typename TupleElemType<N - 1, Tuple<Ts...>>::Type;
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
-// 获取特定索引的 Tuple 元素的引用
+/**
+ * Get a reference to the Tuple element at a specific index
+ */
 template <std::size_t N, typename TupleType>
-__aicore__ inline typename TupleElemType<N, TupleType>::Type& TupleElemGet(TupleType& tuple) {
+__aicore__ inline typename TupleElemType<N, TupleType>::Type& TupleElemGet(TupleType& tuple)
+{
     if constexpr (N == 0) {
         return tuple.head;
     } else {
@@ -79,5 +79,5 @@ __aicore__ inline typename TupleElemType<N, TupleType>::Type& TupleElemGet(Tuple
     }
 }
 
-}
+} // namespace ATVC
 #endif
diff --git a/atvc/include/common/type_list.h b/atvc/include/common/type_list.h
index 791bc291d782542ac04dc044f4faa28ce2a168ef..d3c190493cd4f1bb004a86484b88235327bbc68e 100644
--- a/atvc/include/common/type_list.h
+++ b/atvc/include/common/type_list.h
@@ -17,30 +17,25 @@
 namespace ATVC {
 template <size_t N>
 struct SizeValue {
-    static const size_t value = N;
+    static const size_t VALUE = N;
 };
 
 template <typename Acc, typename T>
 struct SumSizes {
-    using Type = SizeValue<Acc::value + sizeof(T)>;
+    using Type = SizeValue<Acc::VALUE + sizeof(T)>;
 };
-/////////////////////////////////////////////////////////////////////////////////////
-// TypeList 定义
+
 template <typename... Ts>
 struct TypeList {};
 
-/////////////////////////////////////////////////////////////////////////////////////
-// Size 元结构
 template <typename List>
 struct TypeListSize {};
 
 template <typename... Ts>
 struct TypeListSize<TypeList<Ts...>> {
-    static constexpr size_t value = sizeof...(Ts);
+    static constexpr size_t VALUE = sizeof...(Ts);
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
-// Get 元结构
 template <typename TypeList, std::size_t N>
 struct TypeListGet {};
 
@@ -59,30 +54,26 @@ struct TypeListGet<TypeList<>, N> {
     static_assert(N < 0, "Index out of bounds in TypeListGet");
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
-// ByteOffset 元结构
 template <typename TypeList, std::size_t N>
 struct TypeListByteOffset {};
 
 template <std::size_t N>
 struct TypeListByteOffset<TypeList<>, N> {
     static_assert(N < 0, "Index out of range for empty TypeList.");
-    static constexpr std::size_t value = 0;
+    static constexpr std::size_t VALUE = 0;
 };
 
 template <typename Head, typename... Tail>
 struct TypeListByteOffset<TypeList<Head, Tail...>, 0> {
-    static constexpr std::size_t value = 0;
+    static constexpr std::size_t VALUE = 0;
 };
 
 template <typename Head, typename... Tail, std::size_t N>
 struct TypeListByteOffset<TypeList<Head, Tail...>, N> {
     static_assert(N < sizeof...(Tail) + 1, "Index out of range for TypeList.");
-    static constexpr std::size_t value = sizeof(Head) + TypeListByteOffset<TypeList<Tail...>, N - 1>::value;
+    static constexpr std::size_t VALUE = sizeof(Head) + TypeListByteOffset<TypeList<Tail...>, N - 1>::VALUE;
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
-// Prepend 元结构
 template <typename T, typename List>
 struct TypeListPrepend {};
 
@@ -91,8 +82,6 @@ struct TypeListPrepend<T, TypeList<Ts...>> {
     using Type = TypeList<T, Ts...>;
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
-// Map 元结构
 template <typename List, template <typename> class Mapper>
 struct TypeListMap {};
 
@@ -111,8 +100,6 @@ public:
     using Type = typename TypeListPrepend<MappedHead, MappedTail>::Type;
 };
 
-/////////////////////////////////////////////////////////////////////////////////////
-// Reduce 元结构
 template <typename List, typename Init, template <typename, typename> class Reducer>
 struct TypeListReduce {};
 
@@ -130,5 +117,5 @@ private:
 public:
     using Type = ReducedTail;
 };
-}
+} // namespace ATVC
 #endif
diff --git a/atvc/include/elewise/common/elewise_common.h b/atvc/include/elewise/common/elewise_common.h
index 44070542f69af8559144a3d723c5b36aaec0150b..fddde8efe5e009c192c17d26abbf48355120abf6 100644
--- a/atvc/include/elewise/common/elewise_common.h
+++ b/atvc/include/elewise/common/elewise_common.h
@@ -9,15 +9,15 @@
  * See LICENSE in the root of the software repository for the full text of the License.
  */
 
-#ifndef ATVC_ELEWISE_COMMON_H
-#define ATVC_ELEWISE_COMMON_H
+#ifndef ATVC_ELE_WISE_COMMON_H
+#define ATVC_ELE_WISE_COMMON_H
 
 namespace ATVC {
 struct EleWiseTilingData {
     uint32_t tailBlockCnt; // 需要额外执行一次循环的核的数量
     uint32_t tailElemCnt; // 尾块元素数量
-    uint32_t numPerBlock; // 每个核需计算的总元素数量
-    uint32_t tiledCnt;    // 每次搬入搬出的元素个数
+    uint32_t numPerBlock; // 每个核需计算的基本块个数
+    uint32_t tiledCnt;    // 基本块元素个数
     uint32_t blockNum;    // 执行核数
 };
 
diff --git a/atvc/include/elewise/elewise_device.h b/atvc/include/elewise/elewise_device.h
index de15983a34b1e3dcb4d4bdd9044c11a4fd45c6a7..e5bff70e2393d75a107f65bf6498bc8fea6e5d49 100644
--- a/atvc/include/elewise/elewise_device.h
+++ b/atvc/include/elewise/elewise_device.h
@@ -15,8 +15,8 @@
  * \brief
  */
 
-#ifndef ATVC_ELEWISE_DEVICE_H
-#define ATVC_ELEWISE_DEVICE_H
+#ifndef ATVC_ELE_WISE_DEVICE_H
+#define ATVC_ELE_WISE_DEVICE_H
 
 #include "common/atvc_opdef.h"
 #include "common/const_def.h"
@@ -25,4 +25,4 @@
 #include "elewise/elewise_op_template.h"
 #include "kernel_operator.h"
 
-#endif // ATVC_ELEWISE_DEVICE_H
\ No newline at end of file
+#endif // ATVC_ELE_WISE_DEVICE_H
\ No newline at end of file
diff --git a/atvc/include/elewise/elewise_host.h b/atvc/include/elewise/elewise_host.h
index 3c64aaa90cc3fb2b25fdacf30254c2b4c69956ce..6b4d221fd534ee3d97a0ed288a271c0701b04efe 100644
--- a/atvc/include/elewise/elewise_host.h
+++ b/atvc/include/elewise/elewise_host.h
@@ -15,60 +15,113 @@
  * \brief
  */
 
-#ifndef ATVC_ELEWISE_HOST_H
-#define ATVC_ELEWISE_HOST_H
+#ifndef ATVC_ELE_WISE_HOST_H
+#define ATVC_ELE_WISE_HOST_H
 #include <vector>
 #include <cstdint>
 #include "common/compile_info.h"
 #include "common/atvc_opdef.h"
+#include "common/atvc_op_check.h"
 #include "elewise/common/elewise_common.h"
 
 namespace ATVC {
 namespace Host {
-namespace {
-constexpr uint32_t basicCntMin = 32; // 基本块最小值
-constexpr uint32_t TILE_CONTROL = 2;
+constexpr uint32_t BASIC_CNT_MIN = 32; // Min number of basicBlock elements, because 32B alignment is required in UB
+constexpr uint32_t MAX_SHAPE_NODE = 3; // number of Data segmentation node
 
 struct EleWiseTilingHyperParam {
-    int32_t basicCntBase = 16 * 1024;   // 基块单次搬运的初始元素个数，推荐在[1K, 32K]的范围内变动
-    int32_t nBufferNum = 2;             // 每个Queue中的Tensor数量
+    uint32_t singleCoreBaseLine = 512; // data volume baseline for a core, the valid setting range: [256, 128 * 1024]
+    float ubSizeLimitThreshold = 0.95f; // UB memory usage upper limit，determines the maximum value of basicBlock
+    uint32_t nBufferNum = 2;            // The number of parallelism buffer, the valid setting range: [1, 2]
+    uint32_t splitDataShape[MAX_SHAPE_NODE] = {1024, 32 * 1024, 64 * 1024}; // Segmentation nodes for shape
+    uint32_t dataSplitFactor[MAX_SHAPE_NODE + 1] = {4, 4, 8, 6}; // The split coefficient for each segmentation nodes
+    uint32_t rsvLiveCnt = 0; // Additional surviving nodes, means to reserve a portion of UB space.
 };
 
-int32_t GetEleWiseBasicCnt(const EleWiseTilingHyperParam &hiperParam,
-                           int32_t totalCnt, uint32_t blockNum, uint32_t ubufLimitCnt)
+bool CheckEleWiseHyperParam(const EleWiseTilingHyperParam &hyperParam)
 {
-    uint32_t basicCnt = hiperParam.basicCntBase; // 基本块初始值
+    constexpr uint32_t MAX_BASE_LINE = 128 *1024U;
+    constexpr uint32_t MIN_BASE_LINE = 256U;
+    constexpr float MAX_UB_LIMIT = 0.96f;
+    constexpr float MIN_UB_LIMIT = 0.5f;
+    constexpr uint32_t MAX_BUF_NUM = 2U;
+    constexpr uint32_t MIN_BUF_NUM = 1U;
+    constexpr uint32_t MAX_LIVE_CNT = 1U;
+    constexpr uint32_t MAX_FACTOR_VAL = 32U;
+    if(hyperParam.singleCoreBaseLine > MAX_BASE_LINE || hyperParam.singleCoreBaseLine < MIN_BASE_LINE) {
+        printf("[ERROR]: [ATVC][EleWise] Tiling hyperParam is invalid: "
+            "singleCoreBaseLine(%u) must be in [256, 128 * 1024].\n", hyperParam.singleCoreBaseLine);
+        return false;
+    }
+    if(hyperParam.ubSizeLimitThreshold > MAX_UB_LIMIT || hyperParam.ubSizeLimitThreshold < MIN_UB_LIMIT) {
+        printf("[ERROR]: [ATVC][EleWise] Tiling hyperParam is invalid: "
+            "ubSizeLimitThreshold(%f) must be in [0.5, 0.96].\n", hyperParam.ubSizeLimitThreshold);
+        return false;
+    }
+    if(hyperParam.nBufferNum > MAX_BUF_NUM || hyperParam.singleCoreBaseLine < MIN_BUF_NUM) {
+        printf("[ERROR]: [ATVC][EleWise] Tiling hyperParam is invalid: nBufferNum(%u) must be in [1, 2].\n",
+            hyperParam.nBufferNum);
+        return false;
+    }
+    if(hyperParam.rsvLiveCnt > MAX_LIVE_CNT) {
+        printf("[ERROR]: [ATVC][EleWise] Tiling hyperParam is invalid: rsvLiveCnt(%u) must be smaller than 2.\n",
+            hyperParam.rsvLiveCnt);
+        return false;
+    }
+    for (uint32_t i = 0; i < MAX_SHAPE_NODE - 1; i++) {
+        if (hyperParam.splitDataShape[i] >= hyperParam.splitDataShape[i + 1]) {
+            printf("[ERROR]: [ATVC][EleWise] Tiling hyperParam is invalid: "
+                "splitDataShape[%u] = %u must be smaller than splitDataShape[%u] = %u.\n",
+                i, hyperParam.splitDataShape[i], i + 1, hyperParam.splitDataShape[i + 1]);
+            return false;
+        }
+    }
+
+    for (uint32_t i = 0; i <= MAX_SHAPE_NODE; i++) {
+        if (hyperParam.dataSplitFactor[i] == 0 || hyperParam.dataSplitFactor[i] > MAX_FACTOR_VAL) {
+            printf("[ERROR]: [ATVC][EleWise] Tiling hyperParam is invalid: "
+                "dataSplitFactor[%u]  must be in [1, 32]. but real value is %u\n", i, hyperParam.dataSplitFactor[i]);
+            return false;
+        }
+    }
+    return true;
+}
+
+int32_t GetEleWiseBasicCnt(const EleWiseTilingHyperParam &hyperParam,
+                           int32_t totalCnt,
+                           uint32_t blockNum,
+                           uint32_t ubufLimitCnt)
+{
+    uint32_t basicCnt = 0;
     if (blockNum == 0) {
-        return basicCnt;
+        return 0;
+    }
+    if (!CheckEleWiseHyperParam(hyperParam)) {
+        return 0;
     }
     uint32_t avgElePerBlock = totalCnt / blockNum;
-    if (avgElePerBlock > basicCnt) {
-        basicCnt = avgElePerBlock  / 128 * 128; // 128 向下对齐
-        if (basicCnt > basicCntMin) {
-            basicCnt = basicCnt / hiperParam.nBufferNum; // 乒乓搬运avgCoreCnt的1/bufferNum数据
-        }
-    } else {
-        while ((basicCnt > avgElePerBlock) && (basicCnt > basicCntMin)) {
-            basicCnt = basicCnt / hiperParam.nBufferNum;
+    for (uint32_t i =0; i < MAX_SHAPE_NODE; i++) {
+        if (avgElePerBlock <= hyperParam.splitDataShape[i]) {
+            basicCnt = avgElePerBlock / hyperParam.dataSplitFactor[i];
+            break;
         }
     }
-    if (basicCnt > ubufLimitCnt) {
-        basicCnt = ubufLimitCnt / basicCntMin * basicCntMin;
+    if (avgElePerBlock > hyperParam.splitDataShape[MAX_SHAPE_NODE - 1]) {
+        basicCnt = avgElePerBlock / hyperParam.dataSplitFactor[MAX_SHAPE_NODE];
+    }
+    // 控制basicCnt不超上限
+    if(basicCnt > ubufLimitCnt) {
+        basicCnt = ubufLimitCnt;
+    }
+    // 保证UB数据块32B对齐
+    basicCnt = basicCnt / BASIC_CNT_MIN * BASIC_CNT_MIN;
+    // 控制basicCnt不小于最小数据量
+    if(basicCnt < BASIC_CNT_MIN) {
+        basicCnt = BASIC_CNT_MIN;
     }
     return basicCnt;
 }
 
-void PrintParam(const ATVC::EleWiseParam &param)
-{
-    printf("[EleWise] Tiling Result: tiledCnt = %d\n", param.tilingData.tiledCnt);
-    printf("[EleWise] Tiling Result: tailBlockCnt = %d\n", param.tilingData.tailBlockCnt);
-    printf("[EleWise] Tiling Result: blockNum: = %d\n", param.tilingData.blockNum);
-    printf("[EleWise] Tiling Result: numPerBlock = %d\n", param.tilingData.numPerBlock);
-    printf("[EleWise] Tiling Result: tailElemCnt = %d\n", param.tilingData.tailElemCnt);
-    printf("[EleWise] Tiling Result: nBufferNum = %d\n", param.nBufferNum);
-    return;
-}
-}
 /**
  * @brief 计算EleWise的EleWiseParam运行态参数
  * @param totalCnt 单个输入的总元素个数
@@ -76,38 +129,37 @@ void PrintParam(const ATVC::EleWiseParam &param)
  * @return bool 返回true表示计算成功，false表示失败。
  */
 template <class OpTraits>
-bool CalcEleWiseTiling(int32_t totalCnt, ATVC::EleWiseParam &param)
+bool CalcEleWiseTiling(int32_t totalCnt, ATVC::EleWiseParam &param,
+        EleWiseTilingHyperParam hyperParam = EleWiseTilingHyperParam())
 {
-    EleWiseTilingHyperParam hiperParam;
     using Inputs = typename OpTraits::In::types;
     using Outputs = typename OpTraits::Out::types;
     using Temps = typename OpTraits::Temp::types;
     // xxTensroSumBytes表示TensorList里面所有数据类型长度的累加值， xxTensroSumBytes = sum(sizeof(Tensor_i::type))
-    static constexpr size_t inTensorSumBytes = ATVC::TypeListReduce<Inputs, SizeValue<0>, SumSizes>::Type::value;
-    static constexpr size_t outTensorSumBytes = ATVC::TypeListReduce<Outputs, SizeValue<0>, SumSizes>::Type::value;
-    static constexpr size_t tempTensorSumBytes = ATVC::TypeListReduce<Temps, SizeValue<0>, SumSizes>::Type::value;
-    if (inTensorSumBytes == 0 || outTensorSumBytes == 0) {
-        printf("[ERROR] Tiling Error: OpTraits input cannot be null!\n");
-        return false;
+    static constexpr size_t IN_TENSOR_SUM_BYTES = ATVC::TypeListReduce<Inputs, SizeValue<0>, SumSizes>::Type::VALUE;
+    static constexpr size_t OUT_TENSOR_SUM_BYTES = ATVC::TypeListReduce<Outputs, SizeValue<0>, SumSizes>::Type::VALUE;
+    static constexpr size_t TEMP_TENSOR_SUM_BYTES = ATVC::TypeListReduce<Temps, SizeValue<0>, SumSizes>::Type::VALUE;
+    uint32_t tensorSumBytes = 
+        (IN_TENSOR_SUM_BYTES + OUT_TENSOR_SUM_BYTES) * hyperParam.nBufferNum + TEMP_TENSOR_SUM_BYTES;
+    if (hyperParam.rsvLiveCnt > 0) {
+        tensorSumBytes =
+            (IN_TENSOR_SUM_BYTES + OUT_TENSOR_SUM_BYTES) * (hyperParam.nBufferNum + 1) + TEMP_TENSOR_SUM_BYTES;
     }
     auto compileInfo = ATVC::GetOpCompileInfo();
     uint32_t ubSize = compileInfo.ubSize;
-    uint32_t blockNum = (totalCnt / hiperParam.basicCntBase == 0) ? 1 : totalCnt / hiperParam.basicCntBase;
+    uint32_t blockNum = (totalCnt < hyperParam.singleCoreBaseLine) ? 1 : totalCnt / hyperParam.singleCoreBaseLine;
     if (blockNum > compileInfo.vectorCoreNum) {
         blockNum = compileInfo.vectorCoreNum;
     }
-    uint32_t ubufLimitCnt = ubSize / ((inTensorSumBytes + outTensorSumBytes) * hiperParam.nBufferNum +
-                            tempTensorSumBytes) / TILE_CONTROL;
-    if (tempTensorSumBytes == 0) {
-        // 未声明tempbuffer时，预留 1 / (bufferNum + 1)的空间给AscendC高阶API内部临时空间使用
-        ubufLimitCnt = ubSize / ((inTensorSumBytes + outTensorSumBytes) * (hiperParam.nBufferNum + 1));
-    }
 
-    int32_t basicCnt = GetEleWiseBasicCnt(hiperParam, totalCnt, blockNum, ubufLimitCnt);
+    uint32_t ubufLimitCnt = (uint32_t)(ubSize * hyperParam.ubSizeLimitThreshold) / tensorSumBytes;
+
+    int32_t basicCnt = GetEleWiseBasicCnt(hyperParam, totalCnt, blockNum, ubufLimitCnt);
     if (basicCnt == 0 || blockNum == 0) {
-        printf("[ERROR] Tiling Error: initial basic count and block number cannot be zero!\n");
+        printf("[ERROR]: [ATVC][EleWise] Tiling Error: initial basic count and block number can not be zero!\n");
         return false;
     }
+
     param.tilingData.tiledCnt = basicCnt;
     param.totalCnt = totalCnt;
     uint32_t totalCopyCnt = totalCnt / basicCnt;
@@ -115,11 +167,9 @@ bool CalcEleWiseTiling(int32_t totalCnt, ATVC::EleWiseParam &param)
     param.tilingData.blockNum = blockNum;
     param.tilingData.numPerBlock = totalCopyCnt / blockNum; // 每个block要搬运的基本块数量
     param.tilingData.tailElemCnt = totalCnt % basicCnt; // 尾块元素个数
-    param.nBufferNum = hiperParam.nBufferNum;
-    PrintParam(param);
+    param.nBufferNum = hyperParam.nBufferNum;
     return true;
 };
-
 }
 } // namespace ATVC
-#endif // ATVC_ELEWISE_HOST_H
\ No newline at end of file
+#endif // ATVC_ELE_WISE_HOST_H
\ No newline at end of file
diff --git a/atvc/include/elewise/elewise_op_template.h b/atvc/include/elewise/elewise_op_template.h
index 2646597061718f5177ba87005b0dcfa40daf9dc4..28c4311d6c49dc44a1d1bcbdaf36a6139122635d 100644
--- a/atvc/include/elewise/elewise_op_template.h
+++ b/atvc/include/elewise/elewise_op_template.h
@@ -9,62 +9,72 @@
  * See LICENSE in the root of the software repository for the full text of the License.
  */
 
-#ifndef ATVC_ELEWISE_OP_TEMPLATE_H
-#define ATVC_ELEWISE_OP_TEMPLATE_H
+#ifndef ATVC_ELE_WISE_OP_TEMPLATE_H
+#define ATVC_ELE_WISE_OP_TEMPLATE_H
 
 #include <type_traits>
 #include "kernel_operator.h"
+#include "common/kernel_check_debug.h"
 #include "common/tensor_info.h"
 #include "common/type_list.h"
 #include "common/index_seq.h"
 #include "common/atvc_opdef.h"
 #include "common/const_def.h"
 #include "elewise/common/elewise_common.h"
+#include "elewise/elewise_utils/elewise_util.h"
 
 namespace ATVC {
 namespace Kernel {
-template <class EleWiseCompute, typename ParamPtr = __gm__ ATVC::EleWiseParam*>
+template <class EleWiseCompute>
 class EleWiseOpTemplate {
     using EleWiseOpTraits = typename GetFunctionTraits<EleWiseCompute>::ComputeTraits;
     using Inputs = typename EleWiseOpTraits::In::types;
     using Outputs = typename EleWiseOpTraits::Out::types;
     using Temps = typename EleWiseOpTraits::Temp::types;
     // xxCount表示TensorList里面有几个Tensor
-    static constexpr size_t InputCount = ATVC::TypeListSize<Inputs>::value;
-    static constexpr size_t OutputCount = ATVC::TypeListSize<Outputs>::value;
-    static constexpr size_t TempCount = ATVC::TypeListSize<Temps>::value;
+    static constexpr size_t INPUT_COUNT = ATVC::TypeListSize<Inputs>::VALUE;
+    static constexpr size_t OUTPUT_COUNT = ATVC::TypeListSize<Outputs>::VALUE;
+    static constexpr size_t TEMP_COUNT = ATVC::TypeListSize<Temps>::VALUE;
     // xxTensroSumBytes表示TensorList里面所有数据类型长度的累加值， xxTensroSumBytes = sum(sizeof(Tensor_i::type))
-    static constexpr size_t inTensorSumBytes = ATVC::TypeListReduce<Inputs, SizeValue<0>, SumSizes>::Type::value;
-    static constexpr size_t outTensorSumBytes = ATVC::TypeListReduce<Outputs, SizeValue<0>, SumSizes>::Type::value;
-    static constexpr size_t tempTensorSumBytes = ATVC::TypeListReduce<Temps, SizeValue<0>, SumSizes>::Type::value;
+    static constexpr size_t IN_TENSOR_SUM_BYTES = ATVC::TypeListReduce<Inputs, SizeValue<0>, SumSizes>::Type::VALUE;
+    static constexpr size_t OUT_TENSOR_SUM_BYTES = ATVC::TypeListReduce<Outputs, SizeValue<0>, SumSizes>::Type::VALUE;
+    static constexpr size_t TEMP_TENSOR_SUM_BYTES = ATVC::TypeListReduce<Temps, SizeValue<0>, SumSizes>::Type::VALUE;
 
 public:
     __aicore__ inline EleWiseOpTemplate() {}
-    template<typename ...Args>
-    __aicore__ inline void Run(Args&&... args) {
+    template <typename... Args>
+    __aicore__ inline void Run(Args&&... args)
+    {
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] Start to run template function.\n");
         g_pipe.Reset();
-        constexpr std::size_t GM_ARGS_COUNT = InputCount + OutputCount;
-        GM_ADDR argsArr[InputCount + OutputCount];
+        constexpr std::size_t GM_ARGS_COUNT = INPUT_COUNT + OUTPUT_COUNT;
+        GM_ADDR argsArr[INPUT_COUNT + OUTPUT_COUNT];
         InitHelper<0>(argsArr, ATVC::Forward<Args>(args)...);
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] End to run template function.\n");
     }
+
 private:
-    template<size_t N>
-    __aicore__ inline void InitHelper(GM_ADDR tensorsGMArr[]) {
-    }
+    template <size_t N>
+    __aicore__ inline void InitHelper(GM_ADDR tensorsGMArr[])
+    {}
 
     template <size_t N, class T0, class... Ts>
     __aicore__ inline void InitHelper(GM_ADDR tensorsGMArr[], T0 t0, Ts... ts)
     {
-        if constexpr (N < (InputCount + OutputCount)) {
+        if constexpr (N < (INPUT_COUNT + OUTPUT_COUNT)) {
             tensorsGMArr[N] = t0;
             InitHelper<N + 1>(tensorsGMArr, ts...);
-        } else if constexpr (N == (InputCount + OutputCount)) {
+        } else if constexpr (N == (INPUT_COUNT + OUTPUT_COUNT)) {
             FillAddrs(tensorsGMArr);
             FillOffsets<Inputs>(inOffsets_);
             FillOffsets<Outputs>(outOffsets_);
             FillOffsets<Temps>(tempOffsets_);
-
-            this->param_ = reinterpret_cast<ParamPtr>(t0);
+            this->param_ = reinterpret_cast<ATVC::EleWiseParam*>(t0);
+            // Check param
+            ATVC::KernelUtils::PrintParam<EleWiseParam>(param_);
+            if (!ATVC::KernelUtils::CheckParam<EleWiseParam>(param_)) {
+                return;
+            }
             uint32_t curBlockId = AscendC::GetBlockIdx();
 
             if (curBlockId < param_->tilingData.tailBlockCnt) {
@@ -83,104 +93,115 @@ private:
                 return;
             }
             Init();
+            ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] End to initialize template.\n");
             Process(ATVC::Forward<Ts>(ts)...);
+            ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] End main process.\n");
         }
     }
 
 private:
+    __aicore__ inline bool DebugCheckTilingParam() {
+        // tiledCnt、numPerBlock
+        if (param_->tilingData.tiledCnt > param_->tilingData.numPerBlock) {
+            return false;
+        }
+        //todo
+        return true;
+    }
 
     // 申请LocalTensor等资源，初始化本核计算的GlobalTensor
     __aicore__ inline void Init()
     {
         // in/out/temp各自使用一个pipe进行管理，每个pipe里面管理的是ub地址连续的多个tensor
-        if constexpr (InputCount > 0) {
-            g_pipe.InitBuffer(inQueue, param_->nBufferNum, param_->tilingData.tiledCnt * inTensorSumBytes);
+        if constexpr (INPUT_COUNT > 0) {
+            g_pipe.InitBuffer(inQueue, param_->nBufferNum, param_->tilingData.tiledCnt * IN_TENSOR_SUM_BYTES);
         }
-        if constexpr (OutputCount > 0) {
-            g_pipe.InitBuffer(outQueue, param_->nBufferNum, param_->tilingData.tiledCnt * outTensorSumBytes);
+        if constexpr (OUTPUT_COUNT > 0) {
+            g_pipe.InitBuffer(outQueue, param_->nBufferNum, param_->tilingData.tiledCnt * OUT_TENSOR_SUM_BYTES);
         }
-        if constexpr(TempCount > 0) {
-            g_pipe.InitBuffer(tempQueue, param_->tilingData.tiledCnt * tempTensorSumBytes);
+        if constexpr(TEMP_COUNT > 0) {
+            g_pipe.InitBuffer(tempQueue, param_->tilingData.tiledCnt * TEMP_TENSOR_SUM_BYTES);
         }
     }
     // 根据tiling循环调用CopyIn/CopyOut，以及外部传入的Compute计算
     // 如果有尾块，则处理尾块的CopyIn/Compute/CopyOut
-    template<typename ...Args>
-    __aicore__ inline void Process(Args&&... args) {
+    template <typename... Args>
+    __aicore__ inline void Process(Args&&... args)
+    {
         typename TensorTuple<Inputs>::Type inTensors;
         typename TensorTuple<Outputs>::Type outTensors;
         typename TensorTuple<Temps>::Type tempTensors;
 
-        InitInputTensors(inTensors, param_->tilingData.tiledCnt, ATVC::MakeIndexSequence<InputCount>{});
-        InitOutputTensors(outTensors, param_->tilingData.tiledCnt, ATVC::MakeIndexSequence<OutputCount>{});
-        InitTempTensors(tempTensors, param_->tilingData.tiledCnt, ATVC::MakeIndexSequence<TempCount>{});
-        
+        InitInputTensors(inTensors, param_->tilingData.tiledCnt, ATVC::MakeIndexSequence<INPUT_COUNT>{});
+        InitOutputTensors(outTensors, param_->tilingData.tiledCnt, ATVC::MakeIndexSequence<OUTPUT_COUNT>{});
+        InitTempTensors(tempTensors, param_->tilingData.tiledCnt, ATVC::MakeIndexSequence<TEMP_COUNT>{});
+
         uint32_t repeat = curCoreCnt_ / param_->tilingData.tiledCnt;
         uint32_t tailCnt = curCoreCnt_ % param_->tilingData.tiledCnt;
         offsetCnt_ = 0;
         caclCnt_ = param_->tilingData.tiledCnt;
         // 循环处理主块数据
-        for (uint32_t i = 0; i < repeat; i++)
-        {
-            CopyIn(inTensors, ATVC::MakeIndexSequence<InputCount>{});
-            Compute(inTensors, outTensors, tempTensors, 
-                ATVC::MakeIndexSequence<InputCount>{}, 
-                ATVC::MakeIndexSequence<OutputCount>{}, 
-                ATVC::MakeIndexSequence<TempCount>{}, 
-                ATVC::Forward<Args>(args)...);
-            CopyOut(outTensors, ATVC::MakeIndexSequence<OutputCount>{});
+        for (uint32_t i = 0; i < repeat; i++) {
+            CopyIn(inTensors, ATVC::MakeIndexSequence<INPUT_COUNT>{});
+            Compute(inTensors, outTensors, tempTensors, ATVC::MakeIndexSequence<INPUT_COUNT>{},
+                    ATVC::MakeIndexSequence<OUTPUT_COUNT>{}, ATVC::MakeIndexSequence<TEMP_COUNT>{},
+                    ATVC::Forward<Args>(args)...);
+            CopyOut(outTensors, ATVC::MakeIndexSequence<OUTPUT_COUNT>{});
             offsetCnt_ += caclCnt_;
         }
+
         // 如果有尾块，则处理尾块
-        if (tailCnt > 0)
-        {
+        if (tailCnt > 0) {
             caclCnt_ = tailCnt;
-            CopyIn(inTensors, ATVC::MakeIndexSequence<InputCount>{});
-            Compute(inTensors, outTensors, tempTensors,
-                ATVC::MakeIndexSequence<InputCount>{},
-                ATVC::MakeIndexSequence<OutputCount>{},
-                ATVC::MakeIndexSequence<TempCount>{},
-                ATVC::Forward<Args>(args)...);
-            CopyOut(outTensors, ATVC::MakeIndexSequence<OutputCount>{});
+            CopyIn(inTensors, ATVC::MakeIndexSequence<INPUT_COUNT>{});
+            Compute(inTensors, outTensors, tempTensors, ATVC::MakeIndexSequence<INPUT_COUNT>{},
+                    ATVC::MakeIndexSequence<OUTPUT_COUNT>{}, ATVC::MakeIndexSequence<TEMP_COUNT>{},
+                    ATVC::Forward<Args>(args)...);
+            CopyOut(outTensors, ATVC::MakeIndexSequence<OUTPUT_COUNT>{});
         }
     }
     // 模拟单个 Tensor 的处理逻辑：入参为类型 T 对应的 Tensor 变量
-    template<typename T>
-    __aicore__ inline void CopyInAllTensors(AscendC::LocalTensor<uint8_t>& inLocal, int32_t i, T& tensorInfo) {
+    template <typename T>
+    __aicore__ inline void CopyInAllTensors(AscendC::LocalTensor<uint8_t>& inLocal, int32_t i, T& tensorInfo)
+    {
         // 单个 Tensor 的处理逻辑
-        auto inLocal_i = inLocal[tensorInfo.local_offset].template ReinterpretCast<typename T::Dtype>();
+        auto inLocalI = inLocal[tensorInfo.localOffset].template ReinterpretCast<typename T::DataType>();
 
-        using DataType = typename T::Dtype;
+        using DataType = typename T::DataType;
         constexpr uint32_t typeAlignCnt = UB_ALIGN_32 / sizeof(DataType);
-        uint32_t alignMainCnt = caclCnt_/typeAlignCnt*typeAlignCnt;
+        uint32_t alignMainCnt = caclCnt_ / typeAlignCnt * typeAlignCnt;
         uint32_t alignTailCnt = caclCnt_ - alignMainCnt;
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise][CopyIn] Offset is %u, copy count is %u.\n",
+                                  curCoreStartCnt_ + offsetCnt_, caclCnt_);
         if (alignMainCnt > 0) {
-            AscendC::DataCopy(inLocal_i, tensorInfo.gmTensor[curCoreStartCnt_ + offsetCnt_], alignMainCnt);
+            AscendC::DataCopy(inLocalI, tensorInfo.gmTensor[curCoreStartCnt_ + offsetCnt_], alignMainCnt);
         }
         if (alignTailCnt > 0) {
-            struct AscendC::DataCopyExtParams repeatParams = {1, (uint16_t)(alignTailCnt*sizeof(DataType)), 0, 0, 0};
+            struct AscendC::DataCopyExtParams repeatParams = {1, (uint16_t)(alignTailCnt * sizeof(DataType)), 0, 0, 0};
             AscendC::DataCopyPadExtParams<DataType> padParams;
-            AscendC::DataCopyPad(inLocal_i[alignMainCnt],
-                tensorInfo.gmTensor[curCoreStartCnt_ + offsetCnt_ + alignMainCnt],
-                repeatParams, padParams);
+            AscendC::DataCopyPad(inLocalI[alignMainCnt],
+                                 tensorInfo.gmTensor[curCoreStartCnt_ + offsetCnt_ + alignMainCnt], repeatParams,
+                                 padParams);
         }
     }
 
     // 对应于没有 Tensor 时候的空处理逻辑
-    __aicore__ inline void CopyInAllTensors(AscendC::LocalTensor<uint8_t>& inLocal, int32_t i) {
-    }
+    __aicore__ inline void CopyInAllTensors(AscendC::LocalTensor<uint8_t>& inLocal, int32_t i) {}
 
     // 所有 Tensor 的处理入口逻辑：递归完成对每个 Tensor 的处理
-    template<typename T, typename... Ts>
-    __aicore__ inline void CopyInAllTensors(AscendC::LocalTensor<uint8_t>& inLocal, int32_t i, T& first, Ts&... rest) {
+    template <typename T, typename... Ts>
+    __aicore__ inline void CopyInAllTensors(AscendC::LocalTensor<uint8_t>& inLocal, int32_t i, T& first, Ts&... rest)
+    {
         CopyInAllTensors(inLocal, i, first);
         CopyInAllTensors(inLocal, ++i, rest...);
     }
 
     // 将所有输入tensor从gm拷贝到local
-    template<typename InTuple, std::size_t... Is>
-    __aicore__ inline void CopyIn(InTuple& inTensors, ATVC::IndexSequence<Is...>) {
-        if constexpr (InputCount == 0) {
+    template <typename InTuple, std::size_t... Is>
+    __aicore__ inline void CopyIn(InTuple& inTensors, ATVC::IndexSequence<Is...>)
+    {
+        if constexpr (INPUT_COUNT == 0) {
+            ATVC::Kernel::DebugPrintf("[ERROR]:[ATVC][EleWise] Input Count can not be 0!\n");
             return;
         }
         AscendC::LocalTensor<uint8_t> inLocal = inQueue.template AllocTensor<uint8_t>();
@@ -192,21 +213,24 @@ private:
     }
 
     // 模拟单个 Tensor 的处理逻辑：入参为类型 T 对应的 Tensor 变量
-    template<typename T>
-    __aicore__ inline void CopyOutAllTensors(AscendC::LocalTensor<uint8_t>& outLocal, int32_t i, T& tensorInfo) {
+    template <typename T>
+    __aicore__ inline void CopyOutAllTensors(AscendC::LocalTensor<uint8_t>& outLocal, int32_t i, T& tensorInfo)
+    {
         // 单个 Tensor 的处理逻辑
-        auto outLocal_i = outLocal[tensorInfo.local_offset].template ReinterpretCast<typename T::Dtype>();
-        using DataType = typename T::Dtype;
-        constexpr uint32_t typeAlignCnt = 32 / sizeof(DataType);
-        uint32_t alignMainCnt = caclCnt_/typeAlignCnt*typeAlignCnt;
+        auto outLocalI = outLocal[tensorInfo.localOffset].template ReinterpretCast<typename T::DataType>();
+        using DataType = typename T::DataType;
+        constexpr uint32_t TYPE_ALIGN_CNT = 32 / sizeof(DataType);
+        uint32_t alignMainCnt = caclCnt_ / TYPE_ALIGN_CNT * TYPE_ALIGN_CNT;
         uint32_t alignTailCnt = caclCnt_ - alignMainCnt;
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise][CopyOut] Offset is %u, copy count is %u.\n",
+                                  curCoreStartCnt_ + offsetCnt_, caclCnt_);
         if (alignMainCnt > 0) {
-            AscendC::DataCopy(tensorInfo.gmTensor[curCoreStartCnt_ + offsetCnt_], outLocal_i, alignMainCnt);
+            AscendC::DataCopy(tensorInfo.gmTensor[curCoreStartCnt_ + offsetCnt_], outLocalI, alignMainCnt);
         }
         if (alignTailCnt > 0) {
-            struct AscendC::DataCopyParams repeatParams = {1, (uint16_t)(alignTailCnt*sizeof(DataType)), 0, 0};
+            struct AscendC::DataCopyParams repeatParams = {1, (uint16_t)(alignTailCnt * sizeof(DataType)), 0, 0};
             AscendC::DataCopyPad(tensorInfo.gmTensor[curCoreStartCnt_ + offsetCnt_ + alignMainCnt],
-                outLocal_i[alignMainCnt], repeatParams);
+                                 outLocalI[alignMainCnt], repeatParams);
         }
     }
 
@@ -224,111 +248,114 @@ private:
       CopyOutAllTensors(outLocal, ++i, rest...);
     }
     // 将所有输出tensor拷贝到gm
-    template<typename OutTuple, std::size_t... Is>
-    __aicore__ inline void CopyOut(OutTuple& outTensors, ATVC::IndexSequence<Is...>) {
-        AscendC::LocalTensor<uint8_t> outLocal = outQueue.template DeQue<uint8_t>();
-        CopyOutAllTensors(outLocal, 0, ATVC::TupleElemGet<Is>(outTensors)...);
-        outQueue.FreeTensor(outLocal);
+    template <typename OutTuple, std::size_t... Is>
+    __aicore__ inline void CopyOut(OutTuple& outTensors, ATVC::IndexSequence<Is...>)
+    {
+      AscendC::LocalTensor<uint8_t> outLocal = outQueue.template DeQue<uint8_t>();
+      CopyOutAllTensors(outLocal, 0, ATVC::TupleElemGet<Is>(outTensors)...);
+      outQueue.FreeTensor(outLocal);
     }
 
-template<typename InTuple, typename OutTuple, typename TmpTuple,
-    std::size_t... I1, std::size_t... I2,  std::size_t... I3, typename ...Args>
-__aicore__ inline void Compute(InTuple& inTensors, OutTuple& outTensors, TmpTuple& tempTensors,
-    ATVC::IndexSequence<I1...>, ATVC::IndexSequence<I2...>, ATVC::IndexSequence<I3...>, Args&&... args)
-{
-    AscendC::LocalTensor<uint8_t> inLocal = inQueue.template DeQue<uint8_t>();
-    AscendC::LocalTensor<uint8_t> outLocal = outQueue.template AllocTensor<uint8_t>();
-    AscendC::LocalTensor<uint8_t> tempLocal;
-    if constexpr(TempCount > 0) {
-        tempLocal = tempQueue.Get<uint8_t>();
+    template<typename InTuple, typename OutTuple, typename TmpTuple,
+        std::size_t... I1, std::size_t... I2,  std::size_t... I3, typename ...Args>
+    __aicore__ inline void Compute(InTuple& inTensors, OutTuple& outTensors, TmpTuple& tempTensors,
+        ATVC::IndexSequence<I1...>, ATVC::IndexSequence<I2...>, ATVC::IndexSequence<I3...>, Args&&... args)
+    {
+        AscendC::LocalTensor<uint8_t> inLocal = inQueue.template DeQue<uint8_t>();
+        AscendC::LocalTensor<uint8_t> outLocal = outQueue.template AllocTensor<uint8_t>();
+        AscendC::LocalTensor<uint8_t> tempLocal;
+        if constexpr(TEMP_COUNT > 0) {
+            tempLocal = tempQueue.Get<uint8_t>();
+        }
+        compute_(TupleElemGetLocalTensor<I1>(inLocal, inTensors, this->caclCnt_)...,
+            TupleElemGetLocalTensor<I2>(outLocal, outTensors, this->caclCnt_)...,
+            TupleElemGetLocalTensor<I3>(tempLocal, tempTensors, this->caclCnt_)...,
+            ATVC::Forward<Args>(args)...);
+        inQueue.FreeTensor(inLocal);
+        outQueue.template EnQue<uint8_t>(outLocal);
     }
-    compute_(TupleElemGetLocalTensor<I1>(inLocal, inTensors, this->caclCnt_)...,
-        TupleElemGetLocalTensor<I2>(outLocal, outTensors, this->caclCnt_)...,
-        TupleElemGetLocalTensor<I3>(tempLocal, tempTensors, this->caclCnt_)...,
-        ATVC::Forward<Args>(args)...);
-    inQueue.FreeTensor(inLocal);
-    outQueue.template EnQue<uint8_t>(outLocal);
-}
 
-private:
     template <typename InTuple, std::size_t... Is>
-    __aicore__ inline void InitInputTensors(InTuple& tuple, std::size_t cnt, ATVC::IndexSequence<Is...>) {
-        // 初始化每个 Tensor
-        int32_t dummy[] = { 0, (InitInputTensor(ATVC::TupleElemGet<Is>(tuple), cnt, Is), 0)... };
-        (void)dummy; // 避免未使用变量警告
+    __aicore__ inline void InitInputTensors(InTuple& tuple, std::size_t cnt, ATVC::IndexSequence<Is...>)
+    {
+    // 初始化每个 Tensor
+    [[maybe_unused]] int32_t dummy[] = {0, (InitInputTensor(ATVC::TupleElemGet<Is>(tuple), cnt, Is), 0)...};
     }
 
     template <typename OutTuple, std::size_t... Is>
-    __aicore__ inline void InitOutputTensors(OutTuple& tuple, std::size_t cnt, ATVC::IndexSequence<Is...>) {
-        int32_t dummy[] = { 0, (InitOutputTensor(ATVC::TupleElemGet<Is>(tuple), cnt, Is), 0)... };
-        (void)dummy;
+    __aicore__ inline void InitOutputTensors(OutTuple& tuple, std::size_t cnt, ATVC::IndexSequence<Is...>)
+    {
+    [[maybe_unused]] int32_t dummy[] = {0, (InitOutputTensor(ATVC::TupleElemGet<Is>(tuple), cnt, Is), 0)...};
     }
 
     template <typename TmpTuple, std::size_t... Is>
-    __aicore__ inline void InitTempTensors(TmpTuple& tuple, std::size_t cnt, ATVC::IndexSequence<Is...>) {
-        int32_t dummy[] = { 0, (InitTempTensor(ATVC::TupleElemGet<Is>(tuple), cnt, Is), 0)... };
-        (void)dummy;
+    __aicore__ inline void InitTempTensors(TmpTuple& tuple, std::size_t cnt, ATVC::IndexSequence<Is...>)
+    {
+    [[maybe_unused]] int32_t dummy[] = {0, (InitTempTensor(ATVC::TupleElemGet<Is>(tuple), cnt, Is), 0)...};
     }
 
     template <typename T>
-    __aicore__ inline TensorInfo<T>& InitInputTensor(TensorInfo<T>& tensor, std::size_t cnt, std::size_t index) {
-        tensor.gmTensor.SetGlobalBuffer(reinterpret_cast<__gm__ T*>(inGMAddrs_[index].GetPhyAddr(0)));
-        tensor.local_offset = inOffsets_[index] * cnt;
-        return tensor;
+    __aicore__ inline TensorInfo<T>& InitInputTensor(TensorInfo<T>& tensor, std::size_t cnt, std::size_t index)
+    {
+    tensor.gmTensor.SetGlobalBuffer(reinterpret_cast<__gm__ T*>(inGMAddrs_[index].GetPhyAddr(0)));
+    tensor.localOffset = inOffsets_[index] * cnt;
+    return tensor;
     }
 
     template <typename T>
-    __aicore__ inline TensorInfo<T>& InitOutputTensor(TensorInfo<T>& tensor, std::size_t cnt, std::size_t index) {
-        tensor.gmTensor.SetGlobalBuffer(reinterpret_cast<__gm__ T*>(outGMAddrs_[index].GetPhyAddr(0)));
-        tensor.local_offset = outOffsets_[index] * cnt;
-        return tensor;
+    __aicore__ inline TensorInfo<T>& InitOutputTensor(TensorInfo<T>& tensor, std::size_t cnt, std::size_t index)
+    {
+    tensor.gmTensor.SetGlobalBuffer(reinterpret_cast<__gm__ T*>(outGMAddrs_[index].GetPhyAddr(0)));
+    tensor.localOffset = outOffsets_[index] * cnt;
+    return tensor;
     }
 
     template <typename T>
-    __aicore__ inline TensorInfo<T>& InitTempTensor(TensorInfo<T>& tensor, std::size_t cnt, std::size_t index) {
-        tensor.local_offset = tempOffsets_[index] * cnt;
-        return tensor;
+    __aicore__ inline TensorInfo<T>& InitTempTensor(TensorInfo<T>& tensor, std::size_t cnt, std::size_t index)
+    {
+    tensor.localOffset = tempOffsets_[index] * cnt;
+    return tensor;
     }
 
-private:
     // 填充 addr 到数组
     __aicore__ inline void FillAddrs(GM_ADDR argsArr[])
     {
-        for (std::size_t i = 0; i < InputCount; ++i) {
+        for (std::size_t i = 0; i < INPUT_COUNT; ++i) {
             inGMAddrs_[i].SetGlobalBuffer(argsArr[i]);
         }
-        for (std::size_t i = 0; i < OutputCount; ++i) {
-            outGMAddrs_[i].SetGlobalBuffer(argsArr[InputCount + i]);
+        for (std::size_t i = 0; i < OUTPUT_COUNT; ++i) {
+            outGMAddrs_[i].SetGlobalBuffer(argsArr[INPUT_COUNT + i]);
         }
     }
 
     // 填充 offset 到数组
     template <typename List, std::size_t... Is>
-    __aicore__ inline constexpr void FillOffsetsImpl(std::size_t* offsets, ATVC::IndexSequence<Is...>) {
-        ((offsets[Is] = ATVC::TypeListByteOffset<List, Is>::value), ...);
+    __aicore__ inline constexpr void FillOffsetsImpl(std::size_t* offsets, ATVC::IndexSequence<Is...>)
+    {
+        ((offsets[Is] = ATVC::TypeListByteOffset<List, Is>::VALUE), ...);
     }
 
     template <typename List>
-    __aicore__ inline constexpr void FillOffsets(std::size_t* offsets) {
-        constexpr std::size_t count = ATVC::TypeListSize<List>::value;
+    __aicore__ inline constexpr void FillOffsets(std::size_t* offsets)
+    {
+        constexpr std::size_t count = ATVC::TypeListSize<List>::VALUE;
         FillOffsetsImpl<List>(offsets, ATVC::MakeIndexSequence<count>{});
     }
 
-private:
     AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueue;
     AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue;
     AscendC::TBuf<AscendC::QuePosition::VECCALC> tempQueue;
     
     // 全局变量
-    AscendC::GlobalTensor<uint8_t> inGMAddrs_[InputCount];
-    AscendC::GlobalTensor<uint8_t> outGMAddrs_[OutputCount];
+    AscendC::GlobalTensor<uint8_t> inGMAddrs_[INPUT_COUNT];
+    AscendC::GlobalTensor<uint8_t> outGMAddrs_[OUTPUT_COUNT];
 
-    std::size_t inOffsets_[InputCount];
-    std::size_t outOffsets_[OutputCount];
-    std::size_t tempOffsets_[TempCount];
+    std::size_t inOffsets_[INPUT_COUNT];
+    std::size_t outOffsets_[OUTPUT_COUNT];
+    std::size_t tempOffsets_[TEMP_COUNT];
 
     // 计算得到的tiling数据
-    ParamPtr param_;
+    ATVC::EleWiseParam* param_;
 
     uint32_t curCoreCnt_;
     uint32_t curCoreStartCnt_;
diff --git a/atvc/include/elewise/elewise_utils/elewise_util.h b/atvc/include/elewise/elewise_utils/elewise_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..1a2d9d90a0491dfa7619d073eaf2745474a4fcd6
--- /dev/null
+++ b/atvc/include/elewise/elewise_utils/elewise_util.h
@@ -0,0 +1,61 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+
+/*!
+ * \file broadcast_util.h
+ * \brief broadcast util interface
+ */
+
+#ifndef ATVC_ELE_WISE_UTIL_H
+#define ATVC_ELE_WISE_UTIL_H
+#include "common/const_def.h"
+#include "common/kernel_check_debug.h"
+
+namespace ATVC {
+namespace KernelUtils {
+template <typename T>
+__aicore__ inline void PrintParam(const T* param)
+{
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] Tiling data: blockNum = %u\n", param->tilingData.blockNum);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] Tiling data: tiledCnt = %u\n", param->tilingData.tiledCnt);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] Tiling data: tailBlockCnt = %u\n", param->tilingData.tailBlockCnt);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] Tiling data: numPerBlock = %u\n", param->tilingData.numPerBlock);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] Tiling data: tailElemCnt = %u\n", param->tilingData.tailElemCnt);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] Param: nBufferNum = %u\n", param->nBufferNum);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][EleWise] Param: totalCnt = %u\n", param->totalCnt);
+    return;
+}
+
+template <typename T>
+__aicore__ inline bool CheckParam(const T* param)
+{
+    auto *tilingData = &param->tilingData;
+    if (tilingData->blockNum < AscendC::GetBlockIdx() + 1) {
+        ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][EleWise] Tiling data[blockNum = %u] is invalid,"
+                                  "it must be larger than current block number.\n", tilingData->blockNum);
+        return false;
+    }
+    if (tilingData->tailElemCnt > tilingData->tiledCnt) {
+        ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][EleWise] Tiling data[tailElemCnt = %u] is invalid,"
+                                  "it must be smaller than tiledCnt(%u).\n", tilingData->blockNum, tilingData->tiledCnt);
+        return false;
+    }
+    if (tilingData->tailBlockCnt > tilingData->blockNum) {
+        ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][EleWise] Tiling data[tailBlockCnt = %u] is invalid,"
+                                  "it must be smaller than blockNum(%u).\n", tilingData->blockNum, tilingData->blockNum);
+        return false;
+    }
+    return true;
+}
+
+} // namespace KernelUtils
+} // namespace ATVC
+#endif
\ No newline at end of file
diff --git a/atvc/include/reduce/common/patterns.h b/atvc/include/reduce/common/patterns.h
index 364100b7aec41b35ced061e9754d407dd75d11eb..741063d7c86aceb5e17441e3408a99cba34bcf05 100644
--- a/atvc/include/reduce/common/patterns.h
+++ b/atvc/include/reduce/common/patterns.h
@@ -34,30 +34,19 @@ struct PatternConstInfo {
     constexpr static int32_t Dim = dim;
 };
 
-struct A : public PatternConstInfo<PATTERN_A, true, true, CONST1> {
-};
-struct RA : public PatternConstInfo<PATTERN_RA, false, true, CONST2> {
-};
-struct AR : public PatternConstInfo<PATTERN_AR, true, false, CONST2> {
-};
-struct ARA : public PatternConstInfo<PATTERN_ARA, true, true, CONST3> {
-};
-struct ARAR : public PatternConstInfo<PATTERN_ARAR, true, false, CONST4> {
-};
-struct ARARA : public PatternConstInfo<PATTERN_ARARA, true, true, CONST5> {
-};
-struct ARARAR : public PatternConstInfo<PATTERN_ARARAR, true, false, CONST6> {
-};
-struct ARARARA : public PatternConstInfo<PATTERN_ARARARA, true, true, CONST7> {
-};
-struct ARARARAR : public PatternConstInfo<PATTERN_ARARARAR, true, false, CONST8> {
-};
-struct ARARARARA : public PatternConstInfo<PATTERN_ARARARARA, true, true, CONST9> {
-};
+struct A : public PatternConstInfo<PATTERN_A, true, true, CONST1> {};
+struct RA : public PatternConstInfo<PATTERN_RA, false, true, CONST2> {};
+struct AR : public PatternConstInfo<PATTERN_AR, true, false, CONST2> {};
+struct ARA : public PatternConstInfo<PATTERN_ARA, true, true, CONST3> {};
+struct ARAR : public PatternConstInfo<PATTERN_ARAR, true, false, CONST4> {};
+struct ARARA : public PatternConstInfo<PATTERN_ARARA, true, true, CONST5> {};
+struct ARARAR : public PatternConstInfo<PATTERN_ARARAR, true, false, CONST6> {};
+struct ARARARA : public PatternConstInfo<PATTERN_ARARARA, true, true, CONST7> {};
+struct ARARARAR : public PatternConstInfo<PATTERN_ARARARAR, true, false, CONST8> {};
+struct ARARARARA : public PatternConstInfo<PATTERN_ARARARARA, true, true, CONST9> {};
 
 template <int32_t id>
-struct GetPattern {
-};
+struct GetPattern {};
 
 template <>
 struct GetPattern<PATTERN_A> {
@@ -99,6 +88,6 @@ template <>
 struct GetPattern<PATTERN_ARARARARA> {
     using T = ARARARARA;
 };
-}  // namespace ReducePattern
+} // namespace ReducePattern
 } // namespace ATVC
 #endif
\ No newline at end of file
diff --git a/atvc/include/reduce/common/reduce_common.h b/atvc/include/reduce/common/reduce_common.h
index 309cc49ea42b4b1220a4d9cf05fa0c27a66aa7f8..44ceb71e17fcaa9aa12a1cb83d4f433f49d441ca 100644
--- a/atvc/include/reduce/common/reduce_common.h
+++ b/atvc/include/reduce/common/reduce_common.h
@@ -141,10 +141,10 @@ struct ReduceTilingData {
 };
 
 struct ReduceParam {
-    uint64_t workspaceAddr;         // 申请空间在device上的地址
-    uint32_t workspaceSize = 0;     // 申请空间大小
-    ReduceTilingData tilingData;    // Reduce类算子的tiling数据
-    int32_t nBufferNum = 2;         // 每个Queue中的Tensor数量
+    uint64_t workspaceAddr;         // The address of the requested space on the device
+    uint32_t workspaceSize = 0;     // The size of the requested space
+    ReduceTilingData tilingData;    // tilingData
+    int32_t nBufferNum = 2;         // The number of Tensors in each Queue
     int32_t policyId = -1;
 };
 
diff --git a/atvc/include/reduce/reduce_device.h b/atvc/include/reduce/reduce_device.h
index 3b862d04fdc9ac31362cf00332324e0f2a1721f5..a271e8f17d289e74509cbf03ef8348a443afcaff 100644
--- a/atvc/include/reduce/reduce_device.h
+++ b/atvc/include/reduce/reduce_device.h
@@ -22,10 +22,9 @@
 #include "common/const_def.h"
 #include "reduce/common/reduce_common.h"
 #include "kernel_operator.h"
-#ifndef __ASCC_HOST__
+
 #include "common/kernel_utils.h"
 #include "reduce/reduce_sum.h"
 #include "reduce/reduce_op_template.h"
-#endif
 
 #endif // ATVC_REDUCE_DEVICE_H
\ No newline at end of file
diff --git a/atvc/include/reduce/reduce_host.h b/atvc/include/reduce/reduce_host.h
index 7834a895aa671da6261ce9b84b718350356632a2..0ab6bc9532c6f43f27e5f5d99b65e10cfd9c9260 100644
--- a/atvc/include/reduce/reduce_host.h
+++ b/atvc/include/reduce/reduce_host.h
@@ -11,6 +11,7 @@
 #ifndef ATVC_REDUCE_HOST_H
 #define ATVC_REDUCE_HOST_H
 #include "common/atvc_opdef.h"
+#include "common/atvc_op_check.h"
 #include "common/dtype_utils.h"
 #include "common/const_def.h"
 #include "reduce/common/reduce_common.h"
@@ -19,24 +20,36 @@
 
 namespace ATVC {
 namespace Host {
-
-void PrintParam(ReducePolicy* policy, ReduceParam* param)
+/**
+ * @brief Validate the legitimacy of reduce tiling hyper param
+ * @param [in] hyperParam, reduce tiling hyper param
+ * @return bool result, return true if the hyper param is valid, otherwise return false.
+ */
+bool CheckReduceHyperParam(const ATVC::Host::ReduceTilingHyperParam &hyperParam)
 {
-    printf("[Reduce] Tiling result: factorACntPerCore = %zu\n", param->tilingData.factorACntPerCore);
-    printf("[Reduce] Tiling result: factorATotalCnt = %zu\n", param->tilingData.factorATotalCnt);
-    printf("[Reduce] Tiling result: ubFactorA = %zu\n", param->tilingData.ubFactorA);
-    printf("[Reduce] Tiling result: factorRCntPerCore = %zu\n", param->tilingData.factorRCntPerCore);
-    printf("[Reduce] Tiling result: factorRTotalCnt = %zu\n", param->tilingData.factorRTotalCnt);
-    printf("[Reduce] Tiling result: ubFactorR = %zu\n", param->tilingData.ubFactorR);
-    printf("[Reduce] Tiling result: groupR = %zu\n", param->tilingData.groupR);
-    printf("[Reduce] Tiling result: outSize = %zu\n", param->tilingData.outSize);
-    printf("[Reduce] Tiling result: basicBlock = %zu\n", param->tilingData.basicBlock);
-    printf("[Reduce] Tiling result: coreNum = %d\n", param->tilingData.coreNum);
-    printf("[Reduce] Tiling result: nBufferNum = %d\n", param->nBufferNum);
-    printf("[Reduce] Tiling result: workspaceSize = %u\n", param->workspaceSize);
-    printf("[Reduce] Tiling result: policy = (%d, %d, %d)\n",
-        policy->patternID, policy->loopARCount, policy->loopInnerARCount);
-    return;
+    constexpr uint32_t MAX_BASE_LINE = 54 * 1024U;
+    constexpr uint32_t MIN_BASE_LINE = 48 * 1024U;
+    constexpr float MAX_INNER_A = 256;
+    constexpr float MIN_INNER_A = 128;
+    constexpr float MIN_THRESH_HOLD = 0.8f;
+    constexpr float MAX_THRESH_HOLD = 0.95f;
+    if(hyperParam.basicBlock > MAX_BASE_LINE || hyperParam.basicBlock < MIN_BASE_LINE) {
+        printf("[ERROR]: [ATVC][Reduce] Tiling hyperParam is invalid: "
+            "basicBlock(%u) must be in [48 * 1024, 54 * 1024].\n", hyperParam.basicBlock);
+        return false;
+    }
+    if(hyperParam.maxInnerA > MAX_INNER_A || hyperParam.maxInnerA < MIN_INNER_A) {
+        printf("[ERROR]: [ATVC][Reduce] Tiling hyperParam is invalid: "
+            "maxInnerA(%u) must be in [128, 256].\n", hyperParam.maxInnerA);
+        return false;
+    }
+    if(hyperParam.balanceThreshHold > MAX_THRESH_HOLD || hyperParam.balanceThreshHold < MIN_THRESH_HOLD) {
+        printf("[ERROR]: [ATVC][EleWise] Tiling hyperParam is invalid: nBufferNum(%f) must be in [0.8, 0.96].\n",
+            hyperParam.balanceThreshHold);
+        return false;
+    }
+
+    return true;
 }
 
 /**
@@ -49,44 +62,39 @@ void PrintParam(ReducePolicy* policy, ReduceParam* param)
  */
 template<class OpTraits>
 bool CalcReduceTiling(std::vector<int64_t> inputShape,
-                      std::vector<int64_t> reduceDim, ReducePolicy *policy,
-                      ReduceParam *param) 
+                      std::vector<int64_t> reduceDim,
+                      ReducePolicy* policy,
+                      ReduceParam* param,
+                      ReduceTilingHyperParam hyperParam = ReduceTilingHyperParam())
 {
     if (policy == nullptr || param == nullptr) {
-        printf("[ERROR] Invalid input: policy or param is null pointer!\n");
+        printf("[ERROR]: [ATVC][Reduce] Invalid input: policy or param is null pointer!\n");
+        return false;
+    }
+    if (!CheckReduceHyperParam(hyperParam)) {
+        printf("[ERROR]: [ATVC][Reduce] Invalid input: hyper param is invalid!\n");
         return false;
     }
-    struct ReduceTilingHyperParam {
-        int32_t basicBlock = 16 * 1024;  // 最大为UB的总大小的1/3
-        int32_t nBufferNum = 2;          // 每个Queue中的Tensor数量
-    };
+
     using inputDTypeList = typename OpTraits::In::types;
     static constexpr size_t inTensorSumBytes =
-        ATVC::TypeListReduce<inputDTypeList, SizeValue<0>, SumSizes>::Type::value;
+        ATVC::TypeListReduce<inputDTypeList, SizeValue<0>, SumSizes>::Type::VALUE;
     if (inTensorSumBytes == 0) {
-        printf("[ERROR] Tiling Error: OpTraits Input cannot be null!\n");
+        printf("[ERROR]: [ATVC][Reduce] Tiling Error: OpTraits Input cannot be null!\n");
         return false;
     }
     using DataType = typename ATVC::TypeListGet<inputDTypeList, 0>::Type;
     auto inputDtype = GetOriInputType<DataType>();
-    // check output
-    using outputDTypeList = typename OpTraits::Out::types;
-    using DataTypeOut = typename ATVC::TypeListGet<outputDTypeList, 0>::Type;
-    if (GetOriInputType<DataTypeOut>() != inputDtype) {
-        printf("[ERROR] Reduce template does not support different input/output data types!\n");
-        return false;
-    }
     if (inputDtype == ge::DataType::DT_UNDEFINED) {
-        printf("[ERROR] Reduce template does not support this data type!\n");
+        printf("[ERROR]: [ATVC][Reduce] Reduce template does not support this data type!\n");
         return false;
     }
     OpTiling::ReduceTilingInputParam opInput = {reduceDim, inputShape, inputDtype, GetPromoteDataType(inputDtype)};
-    OpTiling::ReduceOpTiling tiling(opInput, policy, param);
+    OpTiling::ReduceOpTiling tiling(opInput, policy, param, hyperParam);
     if (tiling.Run() != 0) {
-        printf("[ERROR] Tiling Error\n");
+        printf("[ERROR]: [ATVC][Reduce] Run tiling failed!\n");
         return false;
     }
-    PrintParam(policy, param);
     return true;
 };
 } // Host
diff --git a/atvc/include/reduce/reduce_op_template.h b/atvc/include/reduce/reduce_op_template.h
index 7bdbb0600904bdaca6e31b4e3178b286251a8ff9..1c77be6f25c579b15500a54feb65c9d5f5629c28 100644
--- a/atvc/include/reduce/reduce_op_template.h
+++ b/atvc/include/reduce/reduce_op_template.h
@@ -19,33 +19,34 @@
 
 #include <type_traits>
 #include "common/const_def.h"
+#include "common/kernel_check_debug.h"
 #include "kernel_operator.h"
 #include "reduce/common/patterns.h"
-#include "reduce/reduce_utils/reduce_block_aux_util.h"
-#include "reduce/reduce_utils/reduce_block_aux.h"
-#include "reduce/reduce_utils/reduce_util.h"
-#include "reduce/reduce_utils/reduce_buf_pool.h"
+#include "reduce/utils/reduce_block_aux_util.h"
+#include "reduce/utils/reduce_block_aux.h"
+#include "reduce/utils/reduce_util.h"
+#include "reduce/utils/reduce_buf_pool.h"
 #include "reduce/common/reduce_common.h"
 
 namespace ATVC {
 namespace Kernel {
 template <class ReduceCompute, const auto& SelectReducePolicy, 
-            class PreCompute = void, class PostCompute = void,
-            typename ParamPtr = __gm__ ATVC::ReduceParam*>
+            class PreCompute = void, class PostCompute = void>
 class ReduceOpTemplate {
 public:
-    constexpr static ReduceSchLoopInfo SchLoopInfo = KernelUtils::Reduce::GetSchLoopInfo<SelectReducePolicy>();
-    using Pattern = typename ReducePattern::GetPattern<SchLoopInfo.patternID>::T;
+    constexpr static ReduceSchLoopInfo SCH_LOOP_INFO = KernelUtils::Reduce::GetSchLoopInfo<SelectReducePolicy>();
+    using Pattern = typename ReducePattern::GetPattern<SCH_LOOP_INFO.patternID>::T;
     using DataType = typename ReduceCompute::DataType;
     using PromoteDataType = typename ReduceCompute::PrompteDtype;
-    constexpr static int32_t ELEMENT_ONE_REPEAT_ORI = Platform::GetVRegSize() / sizeof(DataType);
-    constexpr static int32_t ELEMENT_ONE_REPEAT_COMPUTE = Platform::GetVRegSize() / sizeof(PromoteDataType);
-    constexpr static int32_t VL_LENGTH_B = Platform::GetVRegSize();
+    constexpr static int32_t ELEMENT_ONE_REPEAT_ORI = static_cast<int32_t>(Platform::GetVRegSize()) / sizeof(DataType);
+    constexpr static int32_t ELEMENT_ONE_REPEAT_COMPUTE =
+        static_cast<int32_t>(Platform::GetVRegSize()) / sizeof(PromoteDataType);
+    constexpr static int32_t VL_LENGTH_B = static_cast<int32_t>(Platform::GetVRegSize());
     constexpr static uint64_t BLOCK_SIZE_BYTE = Platform::GetUbBlockSize();
     constexpr static uint64_t CACHE_BUF_SIZE = 16 * 1024;
     constexpr static uint64_t RES_BUF_SIZE = 16 * 1024;
-    constexpr static uint32_t TBufSize = KernelUtils::GetCopyInCount<DataType>();
-    constexpr static uint32_t PromoteBufSize = KernelUtils::GetComputeCount<DataType>();
+    constexpr static uint32_t T_BUF_SIZE = KernelUtils::GetCopyInCount<DataType>();
+    constexpr static uint32_t PROMOTE_BUF_SIZE = KernelUtils::GetComputeCount<DataType>();
     AscendC::LocalTensor<PromoteDataType> tempBuf_;
     AscendC::LocalTensor<PromoteDataType> computeRes_;
     // 算子开发者传入的计算对象
@@ -53,19 +54,21 @@ public:
 
 public:
     __aicore__ inline ReduceOpTemplate() {}
-    
+
     // 按照输入、输出、ReduceParam、其他标量的顺序传入
     // 内部根据ReduceParam进行数据调度并调用ReduceOpTemplate完成计算后搬出到GM
-    template<typename ...Args, typename T1>
-    __aicore__ inline void Run(GM_ADDR x, GM_ADDR y, T1 param) {
-        param_ = reinterpret_cast<ParamPtr>(param);  
-        // 完成一些编译期的检查，比如PreCompute和PostCompute的In、Out个数是否与args的个数匹配
-        Init((GM_ADDR)(param_->workspaceAddr), x, y);
-        if constexpr (AscendC::IsSameType<T1, GM_ADDR>::value) {
-            Process<__gm__ ATVC::ReduceTilingData>();
-        } else if constexpr (AscendC::IsSameType<T1, ATVC::ReduceParam*>::value) {
-            Process<ATVC::ReduceTilingData>();
+    template<typename ...Args>
+    __aicore__ inline void Run(GM_ADDR x, GM_ADDR y, ReduceParam* param) {
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Start to run template function.\n");
+        param_ = param;  
+        // 完成入参校验打印
+        KernelUtils::PrintParam<ReduceParam, SelectReducePolicy>(param_);
+        if (!KernelUtils::CheckParam<ReduceParam>(param_)) {
+            return;
         }
+        Init((GM_ADDR)(param_->workspaceAddr), x, y);  
+        Process();
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] End to run template function.\n");
     }
 
 public:
@@ -74,7 +77,7 @@ public:
     {
         pipe_ = GetTPipePtr();
         basicBlockLen_ = this->param_->tilingData.basicBlock;
-        bufPool_.template Init<DataType, PromoteDataType>(pipe_, TBufSize, PromoteBufSize, this->param_->tilingData.basicBlock);
+        bufPool_.template Init<DataType, PromoteDataType>(pipe_, T_BUF_SIZE, PROMOTE_BUF_SIZE, this->param_->tilingData.basicBlock);
 
         InitArgsInput<0>(args...);
         InitArgsWorkspace(workspace);
@@ -85,10 +88,9 @@ public:
         tempBuf_ = tempBufQue_.template Get<PromoteDataType>();
 
         pipe_->InitBuffer(tempUbQue_, BLOCK_SIZE_BYTE);
-        tempUb_ = tempUbQue_.template Get<PromoteDataType>();
     }
 
-    template <bool IsInput, bool needDup=true, typename T>
+    template <bool IsInput, bool needDup = true, typename T>
     __aicore__ inline void AllocTensorAux(AscendC::LocalTensor<T>& tensor)
     {
         bufPool_.AllocTensor<IsInput, needDup>(tensor);
@@ -100,16 +102,16 @@ public:
         bufPool_.FreeTensor(tensor);
     }
 
-    template <class T1, class... Args>
+    template <class... Args>
     __aicore__ inline void Process(Args... args)
     {
         if constexpr (SelectReducePolicy.patternID == ATVC::AR_PATTERN::A) {
             CopyInput2Output();
             return;
         }
-        if constexpr (SchLoopInfo.loopRCount == 0) {
+        if constexpr (SCH_LOOP_INFO.loopRCount == 0) {
             using SchTypeA = KernelUtils::Reduce::ReduceBlockAux<
-                T1, &SchLoopInfo,
+                ATVC::ReduceTilingData, &SCH_LOOP_INFO,
                 std::remove_reference_t<decltype(*this)>, DataType, DataType,
                 PreCompute, PostCompute>;
 
@@ -118,7 +120,7 @@ public:
         } else {
             // 完成第一阶段的Reduce
             using SchTypeR = KernelUtils::Reduce::ReduceBlockAux<
-                T1, &SchLoopInfo,
+                ATVC::ReduceTilingData, &SCH_LOOP_INFO,
                 std::remove_reference_t<decltype(*this)>, DataType,
                 PromoteDataType, PreCompute, void>;
             SchTypeR op(this, input_, &workspace_, &(this->param_)->tilingData);
@@ -133,10 +135,9 @@ public:
             constexpr static ReduceSchLoopInfo groupSchLoopInfo = KernelUtils::Reduce::GetGroupSchLoopInfo();
             ATVC::ReduceTilingData groupTiling;
             SetGroupTiling(groupTiling);
-            using SchTypeA = KernelUtils::Reduce::ReduceBlockAux<
-                ATVC::ReduceTilingData, &groupSchLoopInfo,
-                std::remove_reference_t<decltype(*this)>, PromoteDataType,
-                DataType, void, PostCompute>;
+            using SchTypeA = KernelUtils::Reduce::ReduceBlockAux<ATVC::ReduceTilingData, &groupSchLoopInfo,
+                                                                 std::remove_reference_t<decltype(*this)>,
+                                                                 PromoteDataType, DataType, void, PostCompute>;
             SchTypeA groupOp(this, &workspace_, output_, &groupTiling);
             groupOp.Process(args...);
         }
@@ -161,25 +162,27 @@ public:
                                                           static_cast<uint64_t>(64));  // 按照64核计算，需要tiling传
     }
 
-    template <bool isPadding, class T, class U, class V>
-    __aicore__ inline void CopyInAux(const AscendC::GlobalTensor<T> &src,
-                                     U &view, V &shape,
-                                     const AscendC::LocalTensor<T> &ubTensor)
+    template <bool isPadding, class EleT, class ViewDescT, class ShapeDescT>
+    __aicore__ inline void CopyInAux(const AscendC::GlobalTensor<EleT> &src,
+                                     ViewDescT &view, ShapeDescT &shape,
+                                     const AscendC::LocalTensor<EleT> &ubTensor)
     {
-        T paddingValue = compute_.template GetPaddingValue<T>();
-        uint8_t padCnt = ((view.axis[0].dstStride - view.burstLen) * sizeof(T)) % BLOCK_SIZE_BYTE / sizeof(T);
+        EleT paddingValue = compute_.template GetPaddingValue<EleT>();
+        uint8_t padCnt = ((view.axis[0].dstStride - view.burstLen) * sizeof(EleT)) % BLOCK_SIZE_BYTE / sizeof(EleT);
         int32_t dupliCnt = view.axis[0].dstStride * view.axis[0].repeat;
-        AscendC::DataCopyPadExtParams<T> padParams{true, 0, padCnt, paddingValue};
+        AscendC::DataCopyPadExtParams<EleT> padParams{true, 0, padCnt, paddingValue};
         if constexpr (!isPadding) {
             padParams = {false, 0, 0, 0};
             shape.oriBurstLen = view.burstLen;
         }
         AscendC::DataCopyExtParams copyInParams;
         copyInParams.blockCount = view.axis[0].repeat;
-        copyInParams.blockLen = view.burstLen * sizeof(T);                              // unit Byte
-        copyInParams.srcStride = (view.axis[0].srcStride - view.burstLen) * sizeof(T);  // unit Byte
+        copyInParams.blockLen = view.burstLen * sizeof(EleT);                              // unit Byte
+        copyInParams.srcStride = (view.axis[0].srcStride - view.burstLen) * sizeof(EleT);  // unit Byte
         copyInParams.dstStride =
-            (view.axis[0].dstStride - view.burstLen) * sizeof(T) / BLOCK_SIZE_BYTE;  // unit block(32byte)
+            (view.axis[0].dstStride - view.burstLen) * sizeof(EleT) / BLOCK_SIZE_BYTE;  // unit block(32byte)
+            ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce][CopyIn] Padding is %d, padding value is %d,  block length is %u,"
+            " repeat count is %u.\n", isPadding, paddingValue, copyInParams.blockLen, copyInParams.blockCount);
         bufPool_.SyncTensor(ubTensor);
 
         const int32_t repeats[CONST6] = {static_cast<int32_t>(view.axis[CONST1].repeat),
@@ -258,8 +261,11 @@ public:
         AscendC::DataCopyExtParams copyOutParams = {1, 1, 0, 0, 0};
         copyOutParams.blockCount = view.axis[0].repeat;
         copyOutParams.blockLen = view.burstLen * sizeof(T);
-
+        ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce][CopyOut] Block length is %u,  repeat count is %u.\n",
+                                  view.burstLen, copyOutParams.blockCount);
+        AscendC::LocalTensor<DataType> outputLocal = tempBuf_[tmpBufOffest].template ReinterpretCast<DataType>();
         if constexpr (AscendC::IsSameType<PromoteDataType, DataType>::value) {
+
             SetEvent<AscendC::HardEvent::V_MTE3>(AscendC::HardEvent::V_MTE3);
             AscendC::DataCopyPad(dst[view.addr], tempBuf_[tmpBufOffest], copyOutParams);
         } else {
@@ -312,7 +318,7 @@ protected:
     }
 
 private:
-    ParamPtr param_;   // CalcReduceTiling API计算出的运行态参数
+    ATVC::ReduceParam* param_;   // CalcReduceTiling API计算出的运行态参数
     AscendC::TPipe* pipe_;
     AscendC::TBuf<> oriVecQue_;
     AscendC::TBuf<> tempResQue_;
@@ -334,6 +340,6 @@ private:
 
     KernelUtils::ReduceBufPool bufPool_;
 };
-}
-}
+} // namespace Kernel
+} // namespace ATVC
 #endif // ATVC_REDUCE_OP_TEMPLATE_H
diff --git a/atvc/include/reduce/reduce_sum.h b/atvc/include/reduce/reduce_sum.h
index d806251313cf876602993d4335c76b212cab62e9..36c71a8c43c03d82b41181431802cd08114d49eb 100644
--- a/atvc/include/reduce/reduce_sum.h
+++ b/atvc/include/reduce/reduce_sum.h
@@ -14,7 +14,7 @@
 
 #include "common/kernel_utils.h"
 #include "reduce/common/patterns.h"
-#include "reduce/reduce_utils/reduce_block_aux_util.h"
+#include "reduce/utils/reduce_block_aux_util.h"
 
 namespace {
 struct ReduceARParam {
@@ -30,7 +30,6 @@ struct ReduceARParam {
 };
 }
 
-
 namespace ATVC {
 // OpTraits: 算子描述的ATVC::OpTraits结构体
 template <typename OpTraits>
@@ -76,10 +75,10 @@ public:
     {
         uint32_t totalNum = dimR * dimA;
         uint32_t mainNum = dimA * mainR;
-        uint32_t dtypeSize = sizeof(PrompteDtype);
+        constexpr uint32_t dtypeSize = sizeof(PrompteDtype);
         uint32_t tailNum = totalNum - mainNum;
         // add mask最大值为256 bytes 且要满足32bytes对齐
-        uint32_t maskAddNum = UB_ALIGN_256 / dtypeSize / UB_ALIGN_32 * UB_ALIGN_32;
+        constexpr uint32_t maskAddNum = UB_ALIGN_256 / dtypeSize / UB_ALIGN_32 * UB_ALIGN_32;
         // 处理tail
         uint16_t repeatTimes = tailNum / maskAddNum;
         uint16_t repeatNum = repeatTimes * maskAddNum;
@@ -131,10 +130,10 @@ public:
              uint16_t dimR, uint16_t mainR, uint64_t oriBurstLen)
     {
         uint16_t tailR = oriBurstLen - mainR;
-        uint16_t dtypeSize = sizeof(PrompteDtype);
+        constexpr uint16_t dtypeSize = sizeof(PrompteDtype);
         uint32_t repStride = dtypeSize * dimR / UB_ALIGN_32;
         uint16_t dimMax = dimA * dimR;
-        uint64_t maskAddRNum = UB_ALIGN_256 / dtypeSize;
+        constexpr uint64_t maskAddRNum = UB_ALIGN_256 / dtypeSize;
 
         ReduceARParam param{
             .repStride   = repStride,
@@ -217,7 +216,7 @@ private:
         AscendC::BinaryRepeatParams repeatParams(1, 1, 1, param.repStride, param.repStride, param.repStride);
 
         if (param.repStride > UB_ALIGN_255) {
-            for (uint16_t i = 0; i < param.dimMax; i += param.dimR) {
+            for (uint16_t i = 0; i < param.dimMax; i += param.mainR) {
                 AscendC::Add(srcTensor[i], srcTensor[i], srcTensor[i + param.mainR], param.tailR);
             }
         } else {
diff --git a/atvc/include/reduce/tiling/reduce_tiling.h b/atvc/include/reduce/tiling/reduce_tiling.h
index 92d90df69dbb3db49d1a6a78af342e9818ca06fa..ba9673558bbb104cb3929c5eab6449ca8da077c9 100644
--- a/atvc/include/reduce/tiling/reduce_tiling.h
+++ b/atvc/include/reduce/tiling/reduce_tiling.h
@@ -24,9 +24,8 @@ namespace OpTiling {
 class ReduceOpTiling {
 public:
 ReduceOpTiling(ReduceTilingInputParam& inputParam,
-    ATVC::ReducePolicy* policy, ATVC::ReduceParam* param)
-    : param_(param), policy_(policy), opInput_(inputParam)
-    {
+    ATVC::ReducePolicy* policy, ATVC::ReduceParam* param, ATVC::Host::ReduceTilingHyperParam& hyperParam)
+    : param_(param), policy_(policy), opInput_(inputParam), hyperParam_(hyperParam) {
         compileInfo_ = ATVC::GetOpCompileInfo();
     };
 
@@ -53,17 +52,17 @@ private:
 template <class Pattern>
 void ComputeStride(std::vector<uint64_t>& shape)
 {
-    uint64_t s = 1;
+    uint64_t accDim = 1;
     uint64_t ds = 1;
     for (int32_t dim = Pattern::Dim - 1; dim > -1; dim--) {
-        param_->tilingData.stride[dim] = s;
+        param_->tilingData.stride[dim] = accDim;
         param_->tilingData.dstStride[dim] = ds;
-        s *= shape[dim];
+        accDim *= shape[dim];
         if (IsAxisA<Pattern>(dim)) {
             ds *= shape[dim];
         }
     }
-    double meanVar = static_cast<double>(1) / static_cast<double>(s / ds);
+    double meanVar = static_cast<double>(1) / static_cast<double>(accDim / ds);
     param_->tilingData.outSize = ds;
     param_->tilingData.meanVar = static_cast<float>(meanVar);
 }
@@ -214,15 +213,15 @@ bool ComputeExtraUnitA(const std::vector<uint64_t>& shape)
     uint64_t bBlockNum = basicBlock_ / dTypeSize;
     uint64_t maxInnerA = CACHE_SIZE / promoteDtypeSize;
     uint64_t step = 1;
-    int32_t iA;
-    for (iA = unitA_.idx; iA > -1; iA -= ATVC::CONST2) {
-        uint64_t axislen = (iA == cBlock_.axis ? cBlock_.cacheLineOuter : shape[iA]);
+    int32_t idxA;
+    for (idxA = unitA_.idx; idxA > -1; idxA -= ATVC::CONST2) {
+        uint64_t axislen = (idxA == cBlock_.axis ? cBlock_.cacheLineOuter : shape[idxA]);
         bool splitHere = false;
-        for (step = (iA == unitA_.idx ? unitA_.step : 1); step <= axislen; step += 1) {
+        for (step = (idxA == unitA_.idx ? unitA_.step : 1); step <= axislen; step += 1) {
             uint64_t tmpInnerA = innerA * step;
             uint64_t tmpOuterA = outerA / axislen * OpsUtils::CeilDiv(axislen, step);
             double rate = (double)tmpOuterA / (double)(OpsUtils::CeilAlign(tmpOuterA, compileInfo_.vectorCoreNum));
-            bool isContinue = (rate > THRES_HOLD && tmpInnerA * innerR * cBlock_.size <= bBlockNum &&
+            bool isContinue = (rate > hyperParam_.balanceThreshHold && tmpInnerA * innerR * cBlock_.size <= bBlockNum &&
                                tmpInnerA * cBlock_.aSize < maxInnerA);
             if (isContinue) {
                 continue;
@@ -240,7 +239,7 @@ bool ComputeExtraUnitA(const std::vector<uint64_t>& shape)
         innerA *= axislen;
         outerA /= axislen;
     }
-    unitA_.Update(iA, innerA, outerA, step);
+    unitA_.Update(idxA, innerA, outerA, step);
     return true;
 }
 
@@ -248,13 +247,13 @@ template <class Pattern>
 void CalcBasicBlock()
 {
     if (Pattern::ID == ATVC::ReducePattern::PATTERN_A) {
-        basicBlock_ = BASIC_BLOCK;
+        basicBlock_ = hyperParam_.basicBlock;
         return;
     }
     if (opInput_.inputDtype != opInput_.promoteDtpye) {
-        basicBlock_ = BASIC_BLOCK / ATVC::CONST2;
+        basicBlock_ = hyperParam_.basicBlock / ATVC::CONST2;
     } else {
-        basicBlock_ = BASIC_BLOCK;
+        basicBlock_ = hyperParam_.basicBlock;
     }
 }
 
@@ -271,7 +270,7 @@ bool ComputeEmptyTiling(std::vector<uint64_t>& shape)
     if (s == 0) {
         return true;
     }
-    basicBlock_ = BASIC_BLOCK;
+    basicBlock_ = hyperParam_.basicBlock;
     std::vector<uint64_t> newshape(ATVC::MAX_DIM, s);
     if (!CalcCacheLineStep<ATVC::ReducePattern::A>(newshape)) { return false; }
     unitA_.outer *= cBlock_.cacheLineOuter;
@@ -421,7 +420,7 @@ void ComputeUnitA(const std::vector<uint64_t>& shape)
     int32_t axisInCacheLine = cBlock_.axis;
     uint64_t outerA = unitA_.outer;
     uint64_t innerA = unitA_.inner;
-    uint64_t maxCacheA = MAX_INNER_A * sizeof(float) / ge::GetSizeByDataType(opInput_.promoteDtpye);
+    uint64_t maxCacheA = hyperParam_.maxInnerA * sizeof(float) / ge::GetSizeByDataType(opInput_.promoteDtpye);
     const bool isPatternA = (Pattern::ID == ATVC::ReducePattern::PATTERN_A);
     const uint64_t patternABlockSize = basicBlock_ / ge::GetSizeByDataType(opInput_.inputDtype);
     uint64_t maxInnerA = isPatternA ? patternABlockSize : maxCacheA;
@@ -444,7 +443,7 @@ void ComputeUnitA(const std::vector<uint64_t>& shape)
                 aSize = (cBlock_.aSize / cBlock_.cacheLineStep) * std::min(cBlock_.cacheLineStep * s, shape[iA]);
             }
             if (aSize <= maxInnerA && tmpInnerA * cBlock_.size <= bBlockNum) {
-                maxStep = rate > THRES_HOLD ? step : maxStep;
+                maxStep = rate > hyperParam_.balanceThreshHold ? step : maxStep;
             } else {
                 step = step > 1 ? step - 1 : step;
                 splitHere = true;
@@ -493,7 +492,7 @@ void ComputeUnitR(std::vector<uint64_t>& shape)
             auto tmpOuterR = outerR / axislen * OpsUtils::CeilDiv(axislen, step);
             double rate = (double)(outerA * tmpOuterR) /
                           (double)(OpsUtils::CeilAlign(outerA * tmpOuterR, compileInfo_.vectorCoreNum));
-            if (rate > THRES_HOLD) {
+            if (rate > hyperParam_.balanceThreshHold) {
                 innerR *= step;
                 outerR = outerR / axislen * OpsUtils::CeilDiv(axislen, step);
                 break;
@@ -513,6 +512,7 @@ private:
     ReduceTilingUnit unitA_;
     ReduceTilingUnit unitR_;
     ReduceTilingInputParam opInput_;
+    ATVC::Host::ReduceTilingHyperParam hyperParam_;
 };  // class ReduceOpTiling
 }  // namespace OpTiling
 
diff --git a/atvc/include/reduce/tiling/tiling_common.h b/atvc/include/reduce/tiling/tiling_common.h
index 2f96757a6026c739c14ebc007d9bf9112ebf433f..7226dfb27b9ccbe52aaf31adf11edb25e3d52485 100644
--- a/atvc/include/reduce/tiling/tiling_common.h
+++ b/atvc/include/reduce/tiling/tiling_common.h
@@ -14,10 +14,7 @@
 #include "reduce/common/patterns.h"
 
 namespace OpTiling {
-constexpr static int32_t BASIC_BLOCK = 48 * 1024;
 constexpr static int32_t CACHE_SIZE = 16 * 1024;              // cahce size for ub reduce
-constexpr static int32_t MAX_INNER_A = 128;
-constexpr static double THRES_HOLD = 0.85;
 constexpr static int32_t A_STEP_LEN = 4;
 
 struct ReduceTilingUnit {
@@ -78,13 +75,13 @@ int32_t IsAxesValid(const std::vector<int64_t>& shape, const std::vector<int64_t
     size_t shapeSize = shape.size();
     size_t axesSize = axes.size();
     if (axesSize > shapeSize) {
-        printf("[ERROR] axis size is greater than shape size\n");
+        printf("[ERROR]: [ATVC][Reduce] Axis size is greater than shape size.\n");
         return -1;
     };
 
     for (size_t i = 0; i < axesSize; i++) {
         if (axes[i] >= static_cast<int64_t>(shapeSize) || axes[i] < 0) {
-            printf("[ERROR] axis size incorrect \n");
+            printf("[ERROR]: [ATVC][Reduce] Axis size incorrect.\n");
             return -1;
         };
     }
@@ -104,4 +101,14 @@ bool IsEmtpyTensor(const std::vector<uint64_t>& shape)
 
 }; // namespace OpTiling
 
+namespace ATVC {
+namespace Host {
+// Hyper param for reduce tiling.
+struct ReduceTilingHyperParam {
+        uint32_t basicBlock = 48 * 1024;    // 设置Reduce基本块内存大小, 一般不能超过内存的1/3, 建议在[48k-54k]之间设置
+        uint32_t maxInnerA = 128;           // [128, 256]
+        double balanceThreshHold = 0.85;    // 多核均衡的阈值水平 [0.8-0.95]
+};
+}
+}
 #endif // ATVC_TILING_COMMON_H
\ No newline at end of file
diff --git a/atvc/include/reduce/reduce_utils/reduce_block_aux.h b/atvc/include/reduce/utils/reduce_block_aux.h
similarity index 96%
rename from atvc/include/reduce/reduce_utils/reduce_block_aux.h
rename to atvc/include/reduce/utils/reduce_block_aux.h
index 0182083655debce83a7c0b228ded9fb7cce7efee..2ffce9b6ded9180384392a8efb78661f3de96bda 100644
--- a/atvc/include/reduce/reduce_utils/reduce_block_aux.h
+++ b/atvc/include/reduce/utils/reduce_block_aux.h
@@ -14,8 +14,8 @@
  * \brief reduce schedule aux
  */
 
-#ifndef ATVC_REDUCE_UTILS_BLOCK_AUX_H
-#define ATVC_REDUCE_UTILS_BLOCK_AUX_H
+#ifndef ATVC_REDUCE_BLOCK_AUX_H
+#define ATVC_REDUCE_BLOCK_AUX_H
 #include "common/kernel_utils.h"
 #include "common/const_def.h"
 #include "reduce/common/patterns.h"
@@ -45,9 +45,9 @@ struct ReduceBlockAux {
     using Pattern = typename ReducePattern::GetPattern<LoopInfo->patternID>::T;
     using InnerPattern = typename ReducePattern::GetPattern<LoopInfo->innerPatternID>::T;
     using PromoteDataType = typename ReduceOp::PromoteDataType;
-    constexpr static int32_t Dim = Pattern::Dim;
-    constexpr static uint32_t InputSize = ReduceOp::InputSize;
-    constexpr static uint32_t OutputSize = ReduceOp::OutputSize;
+    constexpr static int32_t DIM = Pattern::Dim;
+    constexpr static uint32_t INPUT_SIZE = ReduceOp::InputSize;
+    constexpr static uint32_t OUTPUT_SIZE = ReduceOp::OutputSize;
     constexpr static int32_t ELEMENT_ONE_REPEAT = Platform::GetVRegSize() / sizeof(PromoteDataType);
     constexpr static uint64_t BLOCK_SIZE_BYTE = Platform::GetUbBlockSize();
 
@@ -76,7 +76,7 @@ public:
     struct {
         uint64_t start = 0;
         uint64_t stride = 1; // 拷贝步长
-    } iterAddr_[Dim];
+    } iterAddr_[DIM];
 
     const DataType* tiling_;
     ReduceOp* op_;
@@ -89,7 +89,7 @@ public:
         this->input_ = input;
         this->output_ = output;
         this->tiling_ = tiling;
-        for (uint64_t i = 0; i < Dim; i++) {
+        for (uint64_t i = 0; i < DIM; i++) {
             iterAddr_[i].stride = tiling_->shape[i];
         }
     }
@@ -154,8 +154,8 @@ public:
         } else {
             rCount_ = loopREndIndex_ - loopRStartIndex_;
         }
-        bisectionPos_ = KernelUtils::FindNearestPower2(rCount_);
-        cacheCount_ = KernelUtils::CalLog2(bisectionPos_) + 1;
+        bisectionPos_ = KernelUtils::FindNearestPower(rCount_);
+        cacheCount_ = KernelUtils::CalLog(bisectionPos_) + 1;
         bisectionTail_ = rCount_ - bisectionPos_;
         SetEvent<AscendC::HardEvent::V_MTE2>(AscendC::HardEvent::V_MTE2);
         if constexpr (LoopInfo->loopRCount == 0) {
@@ -362,16 +362,16 @@ public:
     __aicore__ inline void CalcCopyInParam(V& view)
     {
         uint64_t addrOffset = 0;
-        for (int32_t i = 0; i < Dim; i++) {
+        for (int32_t i = 0; i < DIM; i++) {
             addrOffset += iterAddr_[i].start * tiling_->stride[i];
         }
-        constexpr static auto burstLenAxis = Dim - 1;  // 获取第一个循环轴
+        constexpr static auto burstLenAxis = DIM - 1;  // 获取第一个循环轴
         view.addr = addrOffset;                        // 搬运地址
         view.burstLen = GetBurstLen<LoopInfo, burstLenAxis>(iterAddr_, tiling_);
         view.axisSize = 0;
         if constexpr (burstLenAxis > 0) {
             int32_t axis = burstLenAxis;
-            for (int32_t i = 0; i < Dim; i++) {
+            for (int32_t i = 0; i < DIM; i++) {
                 view.axisSize = i + 1;
                 view.axis[i].repeat =
                     GetRepeatStride<LoopInfo>(axis - 1, iterAddr_, tiling_, view.axis[i].srcStride);
@@ -391,7 +391,7 @@ public:
         shape.oriBurstLen = view.burstLen;
         if constexpr (!InnerPattern::TailA) {
             int64_t value = OpsUtils::CeilAlign(view.burstLen, BLOCK_SIZE_BYTE / sizeof(InDtype));
-            if (IsLoopSpliteRAxis<LoopInfo>(Dim - 1)) {
+            if (IsLoopSpliteRAxis<LoopInfo>(DIM - 1)) {
                 value = OpsUtils::CeilAlign(this->ubFactorR_, BLOCK_SIZE_BYTE / sizeof(InDtype));
             }
             for (uint64_t i = 0; i < view.axisSize; i++) {
@@ -415,7 +415,7 @@ public:
         } else {
             int64_t value = OpsUtils::CeilAlign(view.burstLen, BLOCK_SIZE_BYTE / sizeof(InDtype));
             aOutBurstLen_ = view.burstLen;
-            if (IsLoopSpliteRAxis<LoopInfo>(Dim - 1)) {
+            if (IsLoopSpliteRAxis<LoopInfo>(DIM - 1)) {
                 value = OpsUtils::CeilAlign(this->ubFactorA_, BLOCK_SIZE_BYTE / sizeof(InDtype));
                 aOutBurstLen_ = this->ubFactorA_;
             }
@@ -468,7 +468,7 @@ public:
     {
         constexpr int32_t axis = Pattern::FirstA ? 0 : 1;
         uint64_t addrOffset = 0;
-        for (int32_t i = axis; i < Dim; i += CONST2) {
+        for (int32_t i = axis; i < DIM; i += CONST2) {
             addrOffset += this->iterAddr_[i].start * tiling_->dstStride[i];
         }
         SliceView<CONST2> view;
@@ -507,7 +507,7 @@ public:
 
         uint64_t addrOffset = 0;
         if constexpr (LoopInfo->loopInnerACount > 0) {
-            for (int32_t i = axis; i < Dim; i += CONST2) {
+            for (int32_t i = axis; i < DIM; i += CONST2) {
                 addrOffset += this->iterAddr_[i].start * tiling_->dstStride[i];
             }
         }
@@ -524,4 +524,4 @@ public:
 }  // namespace Reduce
 }  // namespace KernelUtils
 }  // namespace ATVC
-#endif // ATVC_REDUCE_UTILS_BLOCK_AUX_H
+#endif // ATVC_REDUCE_BLOCK_AUX_H
diff --git a/atvc/include/reduce/reduce_utils/reduce_block_aux_util.h b/atvc/include/reduce/utils/reduce_block_aux_util.h
similarity index 98%
rename from atvc/include/reduce/reduce_utils/reduce_block_aux_util.h
rename to atvc/include/reduce/utils/reduce_block_aux_util.h
index adcd122d0e9ad47826790e0ba01d289402c5dd26..bfffed420deef8fabd861d2daae34aca764a8325 100644
--- a/atvc/include/reduce/reduce_utils/reduce_block_aux_util.h
+++ b/atvc/include/reduce/utils/reduce_block_aux_util.h
@@ -23,7 +23,6 @@
 #include "reduce/common/patterns.h"
 #include "reduce/common/reduce_common.h"
 
-
 namespace ATVC {
 namespace KernelUtils {
 namespace Reduce {
@@ -45,12 +44,12 @@ __aicore__ inline int16_t MainR(int64_t dimR, bool isAR)
 
 __aicore__ inline int64_t GetCacheID(const int64_t idx)
 {
-    return bcnt1(idx ^ (idx + CONST1)) - CONST1;
+    return AscendC::ScalarGetCountOfValue<1>(idx ^ (idx + CONST1)) - CONST1;
 }
 
 template <auto LoopInfo>
 __aicore__ inline constexpr bool IsBlockCutA() {
-    return !(LoopInfo->loopRCount > 0);
+    return LoopInfo->loopRCount <= 0;
 }
 
 template <auto LoopInfo>
@@ -121,8 +120,8 @@ __aicore__ inline constexpr uint64_t GetBurstLen(T& iterAddr, U& tiling) {
 }
 
 template <auto LoopInfo, class T, class U>
-__aicore__ inline uint64_t GetRepeatStride(int32_t Dim, T& iterAddr, U& tiling, uint64_t& stride) {
-    for (int64_t i = Dim; i >= 0; i--) {
+__aicore__ inline uint64_t GetRepeatStride(int32_t dim, T& iterAddr, U& tiling, uint64_t& stride) {
+    for (int64_t i = dim; i >= 0; i--) {
         if (i == 0) {
             stride = tiling->stride[i];
             return iterAddr[i].stride;
diff --git a/atvc/include/reduce/reduce_utils/reduce_buf_pool.h b/atvc/include/reduce/utils/reduce_buf_pool.h
similarity index 95%
rename from atvc/include/reduce/reduce_utils/reduce_buf_pool.h
rename to atvc/include/reduce/utils/reduce_buf_pool.h
index f27ec468f374b11d7d1f389fa156a327cb71a5fa..e6779444c7291c2d4ff266b38c07e2ac08af64ea 100644
--- a/atvc/include/reduce/reduce_utils/reduce_buf_pool.h
+++ b/atvc/include/reduce/utils/reduce_buf_pool.h
@@ -34,13 +34,14 @@ class ReduceBufPool {
     constexpr static int32_t MAX_INPUT_SIZE = 10;
 
 public:
-    __aicore__ inline ReduceBufPool(){};
+    __aicore__ inline ReduceBufPool() {};
 
     template <class DataType, class PromoteDataType>
-    __aicore__ inline void Init(AscendC::TPipe* pipeIn, int32_t inputNum, int32_t computeNum, int32_t basicBlockLen) {
+    __aicore__ inline void Init(AscendC::TPipe* pipeIn, int32_t inputNum, int32_t computeNum, int32_t basicBlockLen)
+    {
         pipe_ = pipeIn;
-        int32_t inputEleSize = sizeof(DataType);
-        int32_t computeEleSize = sizeof(PromoteDataType);
+        constexpr int32_t inputEleSize = sizeof(DataType);
+        constexpr int32_t computeEleSize = sizeof(PromoteDataType);
         basicNum_ = basicBlockLen / sizeof(DataType);
         int32_t poolSize = basicNum_ * inputEleSize * inputNum + basicNum_ * computeEleSize * computeNum;
         inputUnit_.bufNum = inputNum;
diff --git a/atvc/include/reduce/utils/reduce_util.h b/atvc/include/reduce/utils/reduce_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..213fb6bc3a6227e975b8303b10182ac43d95b291
--- /dev/null
+++ b/atvc/include/reduce/utils/reduce_util.h
@@ -0,0 +1,93 @@
+/**
+ * Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved.
+ *
+ * This file is a part of the CANN Open Software.
+ * Licensed under CANN Open Software License Agreement Version 1.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ */
+
+/*!
+ * \file reduce_util.h
+ * \brief reduce template struct
+ */
+
+#ifndef ATVC_REDUCE_REDUCE_UTIL_H
+#define ATVC_REDUCE_REDUCE_UTIL_H
+#include "common/const_def.h"
+#include "common/kernel_check_debug.h"
+#include "reduce/common/patterns.h"
+
+namespace ATVC {
+namespace KernelUtils {
+template <int32_t dim>
+struct Shape {
+    int64_t value[dim];
+    int64_t oriBurstLen;
+};
+
+template <typename T>
+__aicore__ inline constexpr int32_t GetCopyInCount()
+{
+    if constexpr (AscendC::IsSameType<T, bfloat16_t>::value || AscendC::IsSameType<T, half>::value) {
+        return CONST2;
+    } else {
+        return CONST3;
+    }
+}
+
+template <typename T>
+__aicore__ inline constexpr int32_t GetComputeCount()
+{
+    if constexpr (AscendC::IsSameType<T, bfloat16_t>::value || AscendC::IsSameType<T, half>::value) {
+        return CONST2;
+    } else {
+        return CONST0;
+    }
+}
+
+template <typename T, const auto& SelectReducePolicy>
+__aicore__ inline void PrintParam(const T* param)
+{
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: factorACntPerCore = %lu\n",
+        param->tilingData.factorACntPerCore);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: factorATotalCnt = %lu\n",
+        param->tilingData.factorATotalCnt);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: ubFactorA = %lu\n",
+        param->tilingData.ubFactorA);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: factorRCntPerCore = %lu\n",
+        param->tilingData.factorRCntPerCore);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: factorRTotalCnt = %lu\n",
+        param->tilingData.factorRTotalCnt);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: ubFactorR = %lu\n", param->tilingData.ubFactorR);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: groupR = %lu\n", param->tilingData.groupR);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: outSize = %lu\n", param->tilingData.outSize);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: basicBlock = %lu\n", param->tilingData.basicBlock);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: coreNum = %d\n", param->tilingData.coreNum);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Tiling data: nBufferNum = %d\n", param->nBufferNum);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] Work space size is %u\n", param->workspaceSize);
+    ATVC::Kernel::DebugPrintf("[INFO]:[ATVC][Reduce] policy = (%d, %d, %d)\n",
+        SelectReducePolicy.patternID, SelectReducePolicy.loopARCount, SelectReducePolicy.loopInnerARCount);
+    return;
+}
+
+template <typename T>
+__aicore__ inline bool CheckParam(const T* param)
+{
+    auto *tilingData = &param->tilingData;
+    if (tilingData->coreNum < AscendC::GetBlockIdx() + 1 ) {
+        ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][Reduce] Tiling data[coreNum = %d] is invalid, it must be larger than current block number.\n", tilingData->coreNum);
+        return false;
+    }
+    if (tilingData->basicBlock % ATVC::UB_ALIGN_32 != 0) {
+        ATVC::Kernel::DebugPrintf("[ERROR]: [ATVC][Reduce] Tiling data[basicBlock = %lu] is invalid, it must be divisible by 32.\n", tilingData->coreNum);
+        return false;
+    }
+    return true;
+}
+
+} // namespace KernelUtils
+} // namespace ATVC
+#endif
\ No newline at end of file
diff --git a/atvc/tests/run_test.sh b/atvc/tests/run_test.sh
index 667e24b6d14d8b883d253c477e896f8ab198fea6..925f8de5d23ea4510703a4f44581779c3dd98fcd 100644
--- a/atvc/tests/run_test.sh
+++ b/atvc/tests/run_test.sh
@@ -24,21 +24,56 @@ fi
 
 ATVC_HOME_DIR=$CURRENT_DIR/../
 TEST_CASE_LIST=$(ls $ATVC_HOME_DIR/examples|xargs)
-if [ $# -ne 1 ]; then
-    echo "This script takes only one test case name as input. Execution example: 'bash run_test.sh [$TEST_CASE_LIST]'"
+if [ $# -lt 1 ]; then
+    echo "This script requires an input as the test case name. Execution example: 'bash run_test.sh [$TEST_CASE_LIST]'"
     exit 1
 fi
 TEST_NAME=$1
 
+# 调试模式设置
+RUN_MODE=""
+function parse_run_mode(){
+    for arg in "$@"; do
+        if [[ "$@" =~ --run-mode=([a-zA-Z0-9_]+) ]]; then
+            RUN_MODE="${BASH_REMATCH[1]}"
+            return
+        fi
+    done
+}
+
+# 根据不同run-mode执行不同的操作
+function compile_operator(){
+    cd $ATVC_HOME_DIR/examples/$TEST_NAME
+    if [ -z "$RUN_MODE" ]; then
+        echo "Executing with npu mode"
+        ${COMPILE_TOOL} -arch Ascend910B1 $TEST_NAME.cpp -o $TEST_NAME --include-path ${ATVC_HOME_DIR}/include
+    elif [ "$RUN_MODE" = "debug_print" ]; then
+        echo "Executing with debug_print mode"
+        ${COMPILE_TOOL} -arch Ascend910B1 $TEST_NAME.cpp -o $TEST_NAME --include-path ${ATVC_HOME_DIR}/include -DATVC_DEBUG_MODE=1
+    elif [ "$RUN_MODE" = "profiling" ]; then
+        echo "Executing with profiling mode"
+        ${COMPILE_TOOL} -arch Ascend910B1 $TEST_NAME.cpp -o $TEST_NAME --include-path ${ATVC_HOME_DIR}/include -DATVC_DEBUG_MODE=2
+    else
+        echo "--npu-mode is an optional parameter and can be left unset. If set, the value must be debug_print or profiling."
+        echo "Execution example: 'bash run_test.sh $TEST_NAME --run-mode=debug_print'"
+        exit 1
+    fi
+}
+
 if [[ " $TEST_CASE_LIST " == *" ${TEST_NAME} "* ]]; then
     cd $ATVC_HOME_DIR/examples/$TEST_NAME
     rm -rf ./$TEST_NAME
-    ${COMPILE_TOOL} -arch Ascend910B1 $TEST_NAME.cpp -o $TEST_NAME --include-path ${ATVC_HOME_DIR}/include
+    parse_run_mode "$@"
+    compile_operator
     if [ ! -f ./$TEST_NAME ]; then
         echo "Error: Cannot find file ./${TEST_NAME} due to compilation error, please check error message."
         exit 1
     fi
-    ./$TEST_NAME
+    if [ "$RUN_MODE" = "profiling" ]; then
+        msprof --ai-core=on --ascendcl=on --model-execution=on --runtime-api=on --task-time=on --application=$TEST_NAME --output=./
+    else
+        ./$TEST_NAME
+    fi
     if [ $? -eq 0 ]; then
         echo "Sample ${TEST_NAME} passed!"
     else

算子类型(OpType)	Add
算子输入
	name	shape	data type	format
	x	1 * 2048	float	ND
	y	8 * 2048	float	ND
算子输出
算子输出	z	8 * 2048	float	ND
核函数名	AddWithBroadcastCustom
算子类型(OpType)	TanhGrad
算子输入
	name	shape	data type	format
	dy	8 * 1024	float	ND
	y	8 * 1024	float	ND
算子输出
算子输出	z	8 * 1024	float	ND
核函数名	TanhGrad