From 1c4a0e9f1e050a9172ae4bf4e2140091c2f91b67 Mon Sep 17 00:00:00 2001
From: user_10012209 <734267852@qq.com>
Date: Wed, 8 Nov 2023 15:18:15 +0800
Subject: [PATCH 1/5] =?UTF-8?q?[att\profiler\compare=5Ftools]=E8=B5=84?=
 =?UTF-8?q?=E6=96=99=E6=8F=8F=E8=BF=B0=E6=96=B0=E5=A2=9Eother=E5=92=8CSDMA?=
 =?UTF-8?q?=E5=AD=97=E6=AE=B5=E6=8F=8F=E8=BF=B0?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 profiler/compare_tools/README.md | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/profiler/compare_tools/README.md b/profiler/compare_tools/README.md
index 8ff48187f0..84366efbf5 100644
--- a/profiler/compare_tools/README.md
+++ b/profiler/compare_tools/README.md
@@ -111,7 +111,7 @@ python performance_compare.py [基准性能数据文件] [比对性能数据文
 总体性能比对结果以打屏的形式呈现。
 #### 算子耗时
 ```
-包含cube算子耗时和vector算子耗时
+包含cube算子耗时和vector算子耗时以及other（AI CPU、DSA等其他非cube vector算子）耗时
 ```
 #### 计算流耗时
 ```
@@ -138,6 +138,13 @@ npu上的内存使用可以使用npu-smi查看
 profiling信息采集时打开profile_memory=True开关，即可从json文件中读出运行稳定后的memory信息
 ```
 #### E2E总耗时
+
+```
+拷贝类任务耗时
+```
+
+#### E2E总耗时
+
 ```
 计算流端到端耗时
 ```
-- 
Gitee


From ee7c011b0f54b6a2a653465d56a43c7c687c394b Mon Sep 17 00:00:00 2001
From: user_10012209 <734267852@qq.com>
Date: Wed, 8 Nov 2023 15:20:34 +0800
Subject: [PATCH 2/5] =?UTF-8?q?[att\profiler\compare=5Ftools]=E8=B5=84?=
 =?UTF-8?q?=E6=96=99=E6=8F=8F=E8=BF=B0=E6=96=B0=E5=A2=9Eother=E5=92=8CSDMA?=
 =?UTF-8?q?=E5=AD=97=E6=AE=B5=E6=8F=8F=E8=BF=B0?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 profiler/compare_tools/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/profiler/compare_tools/README.md b/profiler/compare_tools/README.md
index 84366efbf5..9b3c71e387 100644
--- a/profiler/compare_tools/README.md
+++ b/profiler/compare_tools/README.md
@@ -137,7 +137,7 @@ npu上的内存使用可以使用npu-smi查看
 
 profiling信息采集时打开profile_memory=True开关，即可从json文件中读出运行稳定后的memory信息
 ```
-#### E2E总耗时
+#### SDMA耗时
 
 ```
 拷贝类任务耗时
-- 
Gitee


From b6d062a732c5bf75ac0e7cc75c423a1c68b4f185 Mon Sep 17 00:00:00 2001
From: user_10012209 <734267852@qq.com>
Date: Wed, 8 Nov 2023 16:23:45 +0800
Subject: [PATCH 3/5] =?UTF-8?q?[att\profiler\compare=5Ftools]=E8=B5=84?=
 =?UTF-8?q?=E6=96=99=E6=8F=8F=E8=BF=B0=E6=96=B0=E5=A2=9Eother=E5=92=8CSDMA?=
 =?UTF-8?q?=E5=AD=97=E6=AE=B5=E6=8F=8F=E8=BF=B0?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 profiler/compare_tools/README.md | 54 +++++++++-----------------------
 1 file changed, 15 insertions(+), 39 deletions(-)

diff --git a/profiler/compare_tools/README.md b/profiler/compare_tools/README.md
index 9b3c71e387..e00951921c 100644
--- a/profiler/compare_tools/README.md
+++ b/profiler/compare_tools/README.md
@@ -108,46 +108,22 @@ python performance_compare.py [基准性能数据文件] [比对性能数据文
 ## 比对结果说明
 ### 总体性能
 
-总体性能比对结果以打屏的形式呈现。
-#### 算子耗时
-```
-包含cube算子耗时和vector算子耗时以及other（AI CPU、DSA等其他非cube vector算子）耗时
-```
-#### 计算流耗时
-```
-计算流所有event耗时总和
-```
-#### 通信
-```
-通信未掩盖耗时
-```
-#### 调度耗时
-```
-调度耗时 = e2e耗时 - 算子耗时 - 通信不可掩盖耗时
-```
-#### 调度占比
-```
-调度占比 = 调度耗时/e2e耗时
-```
-#### 内存
-```
-gpu上的内存使用可以使用nvidia-smi查看
-
-npu上的内存使用可以使用npu-smi查看
-
-profiling信息采集时打开profile_memory=True开关，即可从json文件中读出运行稳定后的memory信息
-```
-#### SDMA耗时
-
-```
-拷贝类任务耗时
-```
-
-#### E2E总耗时
+总体性能比对结果以打屏的形式呈现。详细比对信息如下：
+
+| 字段                            | 说明                                                         |
+| ------------------------------- | ------------------------------------------------------------ |
+| Cube Time(Num)                  | Cube算子总耗时，Num表示计算的次数。                          |
+| Vector Time(Num)                | Vector算子总耗时，Num表示计算的次数。                        |
+| Other Time                      | AI CPU、DSA等其他非cube vector算子耗时。                     |
+| Flash Attention Time(Forward)   | Flash Attention算子前向耗时。                                |
+| Flash Attention Time(Backward)  | Flash Attention算子反向耗时。                                |
+| Computing Time                  | 计算流耗时，计算流所有event耗时总和。                        |
+| Mem Usage                       | 内存使用。gpu上的内存使用可以使用nvidia-smi查看，npu上的内存使用可以使用npu-smi查看，profiling信息采集时打开profile_memory=True开关，即可从json文件中读出运行稳定后的memory信息。 |
+| Uncovered Communication Time    | 通信未掩盖耗时。                                             |
+| SDMA Time(Num)                  | 拷贝类任务耗时，Num表示计算的次数。                          |
+| Free Time                       | 调度耗时 = E2E耗时 - 算子耗时 - 通信不可掩盖耗时。           |
+| E2E Time(Not minimal profiling) | E2E总耗时，计算流端到端耗时。当存在Not minimal profiling时，表示该时间存在性能膨胀，会影响通信和调度耗时。 |
 
-```
-计算流端到端耗时
-```
 ### 算子性能
 
 算子性能比对结果在performance_comparison_result_*.xlsl中OperatorCompare的sheet页呈现。
-- 
Gitee


From 2fbe65260732e167ebfa26d0dceeca2852091994 Mon Sep 17 00:00:00 2001
From: user_10012209 <734267852@qq.com>
Date: Thu, 9 Nov 2023 10:20:43 +0800
Subject: [PATCH 4/5] =?UTF-8?q?[ptdbg=5Fascend]=E5=B7=A5=E5=85=B7=E4=BD=BF?=
 =?UTF-8?q?=E7=94=A8=E6=89=8B=E5=86=8C=E8=A1=A5=E5=85=85overflow=5Fnums?=
 =?UTF-8?q?=E5=8F=82=E6=95=B0=E9=85=8D=E7=BD=AE-1=E7=9A=84=E8=AF=B4?=
 =?UTF-8?q?=E6=98=8E?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 profiler/compare_tools/README.md | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/profiler/compare_tools/README.md b/profiler/compare_tools/README.md
index e00951921c..76b72ffcbe 100644
--- a/profiler/compare_tools/README.md
+++ b/profiler/compare_tools/README.md
@@ -117,13 +117,28 @@ python performance_compare.py [基准性能数据文件] [比对性能数据文
 | Other Time                      | AI CPU、DSA等其他非cube vector算子耗时。                     |
 | Flash Attention Time(Forward)   | Flash Attention算子前向耗时。                                |
 | Flash Attention Time(Backward)  | Flash Attention算子反向耗时。                                |
-| Computing Time                  | 计算流耗时，计算流所有event耗时总和。                        |
-| Mem Usage                       | 内存使用。gpu上的内存使用可以使用nvidia-smi查看，npu上的内存使用可以使用npu-smi查看，profiling信息采集时打开profile_memory=True开关，即可从json文件中读出运行稳定后的memory信息。 |
+| Computing Time                  | 计算流耗时，计算流所有event耗时总和。如果有多条并发计算，计算流耗时对重叠部分只会计算一次。 |
+| Mem Usage                       | 内存使用。gpu上的内存使用可以使用nvidia-smi查看，npu上的内存使用可以使用npu-smi查看，Profiling信息采集时打开profile_memory=True开关，mem usage显示的是memory_record里面的最大resevered值，一般来说是进程级内存。 |
 | Uncovered Communication Time    | 通信未掩盖耗时。                                             |
 | SDMA Time(Num)                  | 拷贝类任务耗时，Num表示计算的次数。                          |
-| Free Time                       | 调度耗时 = E2E耗时 - 算子耗时 - 通信不可掩盖耗时。           |
+| Free Time                       | 调度耗时 = E2E耗时 - 算子耗时 - 通信不可掩盖耗时。Free的定义为Device侧既不在通信又不在计算的时间，因此包含拷贝时间（SDMA Time）。 |
 | E2E Time(Not minimal profiling) | E2E总耗时，计算流端到端耗时。当存在Not minimal profiling时，表示该时间存在性能膨胀，会影响通信和调度耗时。 |
 
+可以采取最简性能数据采集的方式来减少E2E耗时的性能膨胀，示例代码如下：
+
+```python
+with torch_npu.profiler.profile(
+        activities=[torch_npu.profiler.ProfilerActivity.NPU],
+        schedule=torch_npu.profiler.schedule(wait=1, warmup=1, active=1, repeat=1, skip_first=10),
+        on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"),
+) as prof:
+        for step in range(steps):
+            train_one_step()
+            prof.step()
+```
+
+activities配置仅采集NPU数据，不配置experimental_config参数以及其他可选开关。
+
 ### 算子性能
 
 算子性能比对结果在performance_comparison_result_*.xlsl中OperatorCompare的sheet页呈现。
-- 
Gitee


From 63f3099dc68046b9e123b9d7ea20bb700976c49e Mon Sep 17 00:00:00 2001
From: user_10012209 <734267852@qq.com>
Date: Fri, 10 Nov 2023 16:23:48 +0800
Subject: [PATCH 5/5] =?UTF-8?q?[att\profiler]=E8=B5=84=E6=96=99=E8=A1=A5?=
 =?UTF-8?q?=E5=85=85PyThon=E5=85=BC=E5=AE=B9=E7=89=88=E6=9C=AC=E8=AF=B4?=
 =?UTF-8?q?=E6=98=8E=E4=BB=A5=E5=8F=8Acompare=E5=B7=A5=E5=85=B7=E8=A1=A5?=
 =?UTF-8?q?=E5=85=85=E5=B7=A5=E5=85=B7=E4=BE=9D=E8=B5=96?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 profiler/README.md               |  4 +++-
 profiler/compare_tools/README.md | 11 +++++++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/profiler/README.md b/profiler/README.md
index 8a1ce3b5c9..a406d47b5e 100644
--- a/profiler/README.md
+++ b/profiler/README.md
@@ -5,7 +5,9 @@
 ATT仓针对训练&大模型场景，提供端到端调优工具：用户采集到profiling后，由att仓提供统计、分析以及相关的调优建议。
 
 ### profiling采集
-目前att仓工具主要支持ascend pytorch profiler采集工具，可参考https://gitee.com/ascend/att/wikis/%E6%A1%88%E4%BE%8B%E5%88%86%E4%BA%AB/%E6%80%A7%E8%83%BD%E6%A1%88%E4%BE%8B/Ascend%20PyTorch%20Profiler%E6%80%A7%E8%83%BD%E8%B0%83%E4%BC%98%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D
+目前att仓工具主要支持Ascend PyTorch Profiler采集工具，请参见《[Ascend PyTorch Profiler性能调优工具介绍](https://gitee.com/ascend/att/wikis/%E6%A1%88%E4%BE%8B%E5%88%86%E4%BA%AB/%E6%80%A7%E8%83%BD%E6%A1%88%E4%BE%8B/Ascend%20PyTorch%20Profiler%E6%80%A7%E8%83%BD%E8%B0%83%E4%BC%98%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D)》。
+
+Ascend PyTorch Profiler采集工具支持AscendPyTorch 5.0.RC2或更高版本，支持的PyThon和CANN软件版本配套关系请参见《CANN软件安装指南》中的“[安装PyTorch](https://www.hiascend.com/document/detail/zh/canncommercial/63RC2/envdeployment/instg/instg_000041.html)”。
 
 ### 子功能介绍
 | 工具名称     | 说明                                                         |
diff --git a/profiler/compare_tools/README.md b/profiler/compare_tools/README.md
index 76b72ffcbe..df0622da64 100644
--- a/profiler/compare_tools/README.md
+++ b/profiler/compare_tools/README.md
@@ -8,8 +8,16 @@
 
 场景二：PyTorch训练工程在NPU上，不同版本之间存在性能差距，通过工具定位具体差异。
 
-
 ## 使用指导
+
+### 环境依赖
+
+使用本工具前需要安装openpyxl：
+
+```bash
+pip3 install openpyxl
+```
+
 ### 性能数据采集
 
 #### GPU性能数据采集
@@ -50,7 +58,6 @@ pytorch profiler数据目录结构如下：
 
 #### NPU性能数据采集
 通过Ascend PyTorch Profiler工具采集NPU的性能数据，采集参数配置跟GPU一致，参考链接：https://www.hiascend.com/document/detail/zh/canncommercial/63RC2/modeldevpt/ptmigr/ptmigr_0066.html
-将GPU的性能数据采集代码中torch.profiler替换成torch_npu.profiler
 
 ascend pytorch profiler数据目录结构如下：
 
-- 
Gitee