diff --git a/profiler/compare_tools/README.md b/profiler/compare_tools/README.md
index 6b1da16bfa910c71a4adf7c1cd47c1b5bc0f8dcf..e307491f36686687d7a842f990193f58cff955ac 100644
--- a/profiler/compare_tools/README.md
+++ b/profiler/compare_tools/README.md
@@ -1,17 +1,57 @@
 # 性能比对工具
 
-## 介绍
-性能比对工具支持比较GPU与NPU之间、NPU与NPU之间的性能差异，帮助用户更快的定位性能瓶颈。比对的结果主要分为4个维度展示：总体性能、算子性能、通信性能、算子内存。
+## 1.简介
+性能比对工具支持比较GPU与NPU之间、NPU与NPU之间的性能差异，通过对训练耗时和内存占用的比对分析，定位到具体劣化的算子，帮助用户提升性能调优的效率。工具将训练耗时拆分为算子、通信、调度3大维度，并针对算子和通信分别进行算子级别的比对；将训练占用的总内存，拆分成算子级别的内存占用进行比对。
 
-## 使用方式
-### <font color=green>最简执行命令</font>
+## 2.使用场景
+场景一：PyTorch训练工程从GPU迁移至NPU后出现性能劣化，通过工具分析出劣化点
+
+场景二：PyTorch训练工程在NPU上，不同版本之间存在性能差距，通过工具定位具体差异
+
+
+## 3.使用指导
+### <font color=green>性能数据采集</font>
+#### GPU性能数据采集
+通过PyTorch Profiler工具采集GPU的性能数据，参考链接：
+https://pytorch.org/docs/stable/profiler.html
+
+采集样例代码参考1：
 ```
-python performance_compare.py [基准性能数据的文件路径] [比较性能数据的文件路径]
+with torch.profiler.profile(
+  profile_memory=True, #内存数据采集的开关
+  record_shapes=True,  #算子input shape信息采集的开关
+  schedule=torch.profiler.schedule(wait=10, warmup=0, active=1, repeat=1),
+  on_trace_ready=torch.profiler.tensorboard_trace_handler("./result_dir")
+) as prof:
+    for step in ranges(step_number):
+        train_one_step()
+        prof.step()
+```
+采集样例代码参考2：
+```
+prof = torch.profiler.profile(
+  profile_memory=True, #内存数据采集的开关
+  record_shapes=True,  #算子input shape信息采集的开关
+  on_trace_ready=torch.profiler.tensorboard_trace_handler("./result_dir"))
+for step in range(step_number):
+    if step == 11:
+        prof.start()
+    train_one_step()
+    if step == 11:
+        prof.stop()
 ```
-#### 文件路径说明：
-GPU的性能数据文件路径：指定到以".pt.trace"结尾的json文件
 
-NPU的性能数据文件路径：可以指定以"_ascend_pt"结尾的文件，也可以指定到ASCEND_PROFILER_OUTPUT目录，也可以指定到trace_view.json，当指定到trace_view.json时不支持比对算子内存占用。
+pytorch profiler数据目录结构如下：
+
+```
+|- pytorch_profiling
+    |- **.pt.trace.json
+```
+
+#### NPU性能数据采集
+通过Ascend PyTorch Profiler工具采集NPU的性能数据，采集参数配置跟GPU一致，参考链接：
+https://www.hiascend.com/document/detail/zh/canncommercial/63RC2/modeldevpt/ptmigr/ptmigr_0066.html
+将GPU的性能数据采集代码中torch.profiler替换成torch_npu.profiler
 
 ascend pytorch profiler数据目录结构如下：
 
@@ -25,32 +65,48 @@ ascend pytorch profiler数据目录结构如下：
     |- **_ascend_pt
 ```
 
-#### 通用参数说明：
+### <font color=green>性能数据比对</font>
+#### 最简执行命令
+进入att代码仓的下载目录，cd att/profiler/compare_tools，执行以下命令：
+```
+python performance_compare.py [基准性能数据的文件路径] [比较性能数据的文件路径]
 ```
---disable_profiling_compare：设置该参数代表不进行总体性能比较
+工具将总体性能拆解为训练耗时和内存占用2个方面，其中训练耗时可拆分为算子、通信、调度3个维度，以打屏的形式输出总体指标，帮助用户定界劣化的方向。与此同时，工具还会生成performance_comparison_result_**.xlsl，里面具体到每个算子在执行耗时、通信耗时、内存占用的优劣，可通过DIFF列大于0筛选出劣化算子。
+
+#### 文件路径说明
+GPU的性能数据文件路径：指定到以".pt.trace"结尾的json文件
 
---disable_operator_compare：设置该参数代表不进行算子性能比较
+NPU的性能数据文件路径： 支持多种路径，①以"_ascend_pt"结尾的目录；②ASCEND_PROFILER_OUTPUT目录；③trace_view.json，该路径无法显示算子的内存占用
 
---disable_memory_compare：设置该参数代表不进行算子内存比较
+#### 通用参数说明
+```
+--enable_profiling_compare：开启总体性能比较。使用示例：--enable_profiling_compare
 
---disable_communication_compare：设置该参数代表不进行通信性能比较   
+--enable_operator_compare：开启算子性能比较。使用示例：--enable_operator_compare
 
---output_path：性能比对结果存放的路径
+--enable_communication_compare：开启通信性能比较。使用示例：--enable_communication_compare
+
+--enable_memory_compare：开启算子内存比较。使用示例：--enable_memory_compare
+```
+说明：以上4个开关均不设置的情况下，<font color=red>工具默认开启所有的性能比较</font>，当用户设置了以上开关，则按照用户设置的开关进行性能比对
+```
+--output_path：性能比对结果存放的路径。使用示例：--output_path=./result_dir
 ```
 
-#### 算子性能比对特有参数说明：
+#### 算子性能比对特有参数说明
 ```
---gpu_flow_cat：GPU trace中cpu侧算子与device kernel的连线标识，默认是async_gpu
+--gpu_flow_cat：配置GPU trace中cpu侧算子与device kernel的连线标识，当GPU的kernel均为空时设置。使用示例：--gpu_flow_cat=async_gpu
 
---use_input_shape：设置该参数代表算子精准匹配
+--use_input_shape：开启算子精准匹配。使用示例：--use_input_shape
 
---max_kernel_num：该参数设置cpu侧算子下发执行的最大kernel数量，当超过设定值时工具会自动找下层的子算子，直至满足条件
+--max_kernel_num：设置cpu侧算子下发的最大kernel数量，当超过设定值时工具会自动往下找子算子，直至满足条件。使用示例：--max_kernel_num=10
 
---op_name_map：该参数存放GPU与NPU等价的算子名称的映射关系，以字典形式存入
+--op_name_map：设置GPU与NPU等价的算子名称的映射关系，以字典形式存入。使用示例：--op_name_map='{"Optimizer.step#SGD.step":"Optimizer.step#NpuFusedSGD.step"}'
 ```
 
-## 比对内容
+## 4.比对结果说明
 ### <font color=green>总体性能</font>
+总体性能比对结果以打屏的形式呈现。
 #### 算子耗时
 ```
 包含cube算子耗时和vector算子耗时
@@ -79,24 +135,47 @@ npu上的内存使用可以使用npu-smi查看
 
 profiling信息采集时打开profile_memory=True开关，即可从json文件中读出运行稳定后的memory信息
 ```
-#### 计算流e2e耗时
+#### E2E总耗时
 ```
 计算流端到端耗时
 ```
 ### <font color=green>算子性能</font>
-DIFF以device耗时为比对指标
-#### device耗时
+算子性能比对结果在performance_comparison_result_**.xlsl中OperatorCompare的sheet页呈现。
+
+淡蓝色背景的记录行：算子的summary信息，包括算子名称、算子的Input Shape、算子的Input Type、算子在device上的总耗时（单位：us）
+
+无背景色的记录行：算子的detail信息，包含了这个算子下发到device侧的所有kernel的明细，包括kernel名称、kernel的信息（针对NPU）、device上的耗时（单位：us）
+
+DIFF列 = (比较算子在device上执行总耗时 - 基准算子在device上执行总耗时) / 基准算子在device上执行总耗时
+
+DIFF Filter列：红色代表劣化
+
+#### Device Duration(us)
 ```
-该算子下发到device上执行的所有kernel耗时的加总
+该算子下发到device上执行的所有kernel耗时的总和
 ```
 ### <font color=green>通信性能</font>
-DIFF以同一个类型的通信算子（如：allreduce）的总耗时为比对指标
+通信性能比对结果在performance_comparison_result_**.xlsl中CommunicationCompare的sheet页呈现。
 
-通信性能比对结果以通信算子的类型为粒度，展示该类型通信算子调用的总次数、平均耗时、总耗时、耗时最大值、耗时最小值。
+淡蓝色背景的记录行：通信算子的summary信息，包括通信算子名称、调用总次数、通信算子总耗时（单位：us）、通信算子平均耗时（单位：us）、通信算子最大耗时（单位：us）、通信算子最小耗时（单位：us）
+
+无背景色的记录行：通信算子的detail信息，仅支持NPU，包含了该通信算子下的所有Task信息，包括Task名称、Task调用次数、Task总耗时（单位：us）、Task平均耗时（单位：us）、Task最大耗时（单位：us）、Task最小耗时（单位：us）
+
+DIFF列 = (比较通信算子的总耗时 - 基准通信算子的总耗时) / 基准通信算子的总耗时
+
+DIFF Filter列：红色代表劣化
 
-NPU会下钻展示该类型通信算子下，不同通信小算子（如：Notify_Wait）的耗时占比，调用的总次数、平均耗时、总耗时、耗时最大值、耗时最小值。
 ### <font color=green>算子内存</font>
-DIFF以内存占用的大小为比对指标
+算子内存比对结果在performance_comparison_result_**.xlsl中MemoryCompare的sheet页呈现。
+
+淡蓝色背景的记录行：算子的summary信息，包括算子名称、算子的Input Shape、算子的Input Type、算子占用的总内存（单位：KB）
+
+无背景色的记录行：算子的detail信息，包含了这个算子下发到device侧执行的所有算子的内存占用，包括算子名称、内存持有时间（单位：us）、内存占用大小（单位：KB）
+
+DIFF列 = (比较算子占用的总内存 - 基准算子占用的总内存) / 基准算子占用的总内存
+
+DIFF Filter列：红色代表劣化
+
 #### 内存占用大小
 ```
 该算子占用的device内存大小，单位KB
diff --git a/profiler/compare_tools/comparator/op_comparator.py b/profiler/compare_tools/comparator/op_comparator.py
index 89bfc1a6921e2452e35fc2f491ad59e6ffdefea6..cb4b5bfa899ead63b995bacec5a85fa7d52575a3 100644
--- a/profiler/compare_tools/comparator/op_comparator.py
+++ b/profiler/compare_tools/comparator/op_comparator.py
@@ -90,11 +90,11 @@ class OpComparator:
         root_node = TreeBuilder.build_tree(torch_op_data)
 
         kernel_dict, memory_list = {}, []
-        if not self._args.disable_operator_compare:
+        if self._args.enable_operator_compare:
             kernel_dict = profiling_instance.kernel_dict
             if not kernel_dict:
                 print(f"[WARNING] Can't find any flow event in the file: {profiling_instance.json_path}")
-        if not self._args.disable_memory_compare:
+        if self._args.enable_memory_compare:
             memory_list = profiling_instance.memory_list
             if not memory_list:
                 print(f"[WARNING] Can't find any memory event in the file: {profiling_instance.file_path}")
diff --git a/profiler/compare_tools/generation/comparison_generator.py b/profiler/compare_tools/generation/comparison_generator.py
index f415262cd239cc282603520f5caaaf3c4819e2bd..fa3f9f8416ef380820321e7a477e4d8452496f1f 100644
--- a/profiler/compare_tools/generation/comparison_generator.py
+++ b/profiler/compare_tools/generation/comparison_generator.py
@@ -17,15 +17,15 @@ class ComparisonGenerator:
 
     def create_excel(self, file_path: str):
         wb = Workbook()
-        if not self._args.disable_operator_compare or not self._args.disable_memory_compare:
+        if self._args.enable_operator_compare or self._args.enable_memory_compare:
             op_compare_result = OpComparator(self._args).compare()
             if op_compare_result:
-                if not self._args.disable_operator_compare:
+                if self._args.enable_operator_compare:
                     OpComparisonGenerator(self._args, op_compare_result, Constant.OPERATOR_COMPARE).create_sheet(wb)
-                if not self._args.disable_memory_compare:
+                if self._args.enable_memory_compare:
                     OpComparisonGenerator(self._args, op_compare_result, Constant.MEMORY_COMPARE).create_sheet(wb)
 
-        if not self._args.disable_communication_compare:
+        if self._args.enable_communication_compare:
             index_compare_result = IndexComparator(self._args).compare()
             if not index_compare_result.empty:
                 CommunicationComparisonGenerator(self._args, index_compare_result).create_sheet(wb)
diff --git a/profiler/compare_tools/performance_compare.py b/profiler/compare_tools/performance_compare.py
index 885f9b44b75e02c305a027a6c201494191ea85ac..dec22a4ec28b5dfa32594e0ad9c04050b5ddea0b 100644
--- a/profiler/compare_tools/performance_compare.py
+++ b/profiler/compare_tools/performance_compare.py
@@ -12,7 +12,7 @@ from utils.constant import Constant
 
 
 def performance_compare(args):
-    if args.disable_profiling_compare:
+    if not args.enable_profiling_compare:
         return
     npu_path = ''
     gpu_path = ''
@@ -30,21 +30,17 @@ def performance_compare(args):
 def main():
     sys.path.append(os.path.dirname(__file__))
     parser = argparse.ArgumentParser(description="Compare trace of GPU and NPU")
-    parser.add_argument("base_profiling_path", type=str, default='', help="base profiling file path")
-    parser.add_argument("comparison_profiling_path", type=str, default='', help="comparison profiling file path")
-    parser.add_argument("--disable_profiling_compare", default=False, action='store_true',
-                        help="不进行GPU与NPU的性能拆解")
-    parser.add_argument("--disable_operator_compare", default=False, action='store_true',
-                        help="do not compare operator execution time")
-    parser.add_argument("--disable_memory_compare", default=False, action='store_true',
-                        help="do not compare memory usage by operator dimensions")
-    parser.add_argument("--disable_communication_compare", default=False, action='store_true',
-                        help="do not compare communication operator execution time")
+    parser.add_argument("base_profiling_path", type=str, default='', help="基准性能数据的文件路径")
+    parser.add_argument("comparison_profiling_path", type=str, default='', help="比较性能数据的文件路径")
+    parser.add_argument("--enable_profiling_compare", default=False, action='store_true', help="开启总体性能比较")
+    parser.add_argument("--enable_operator_compare", default=False, action='store_true', help="开启算子性能比较")
+    parser.add_argument("--enable_memory_compare", default=False, action='store_true', help="开启算子内存比较")
+    parser.add_argument("--enable_communication_compare", default=False, action='store_true', help="开启通信性能比较")
     parser.add_argument("--output_path", type=str, default='', help="性能数据比对结果的存放路径")
     parser.add_argument("--max_kernel_num", type=int, help="每个torch op的kernel数量限制")
     parser.add_argument("--op_name_map", type=ast.literal_eval, default={},
-                        help="配置GPU OP与NPU OP等价的名称映射关系，以字典的形式传入")
-    parser.add_argument("--use_input_shape", default=False, action='store_true', help="使用input shape作为匹配信息")
+                        help="配置GPU与NPU等价的算子名称映射关系，以字典的形式传入")
+    parser.add_argument("--use_input_shape", default=False, action='store_true', help="开启算子的精准匹配")
     parser.add_argument("--gpu_flow_cat", type=str, default='', help="gpu flow event的分类标识")
     args = parser.parse_args()
 
@@ -54,14 +50,15 @@ def main():
     except Exception:
         print("[WARNING] Profiling failed to analyze.")
 
-    print("[INFO] Start to compare performance data, please wait.")
-    dir_path = args.output_path if args.output_path else "./"
-    file_name = "performance_comparison_result_{}.xlsx".format(
-        time.strftime("%Y%m%d%H%M%S", time.localtime(time.time())))
-    result_file_path = os.path.realpath(os.path.join(dir_path, file_name))
+    if any([args.enable_operator_compare, args.enable_memory_compare, args.enable_communication_compare]):
+        print("[INFO] Start to compare performance data, please wait.")
+        dir_path = args.output_path if args.output_path else "./"
+        file_name = "performance_comparison_result_{}.xlsx".format(
+            time.strftime("%Y%m%d%H%M%S", time.localtime(time.time())))
+        result_file_path = os.path.realpath(os.path.join(dir_path, file_name))
 
-    ComparisonGenerator(args).create_excel(result_file_path)
-    print(f"[INFO] The comparison result file has been generated: {result_file_path}")
+        ComparisonGenerator(args).create_excel(result_file_path)
+        print(f"[INFO] The comparison result file has been generated: {result_file_path}")
 
 
 if __name__ == "__main__":
diff --git a/profiler/compare_tools/utils/args_manager.py b/profiler/compare_tools/utils/args_manager.py
index eba55d72e362123011048fbcaa5cdc6977c176be..88c57b2f9e437a681243c9c33a7772cf6a4a6c23 100644
--- a/profiler/compare_tools/utils/args_manager.py
+++ b/profiler/compare_tools/utils/args_manager.py
@@ -112,6 +112,12 @@ class ArgsManager:
             msg = f"Invalid param, --gpu_flow_cat exceeded the maximum value {Constant.MAX_FLOW_CAT_LEN}"
             raise RuntimeError(msg)
 
+        if not any([self._args.enable_profiling_compare, self._args.enable_operator_compare,
+                    self._args.enable_memory_compare, self._args.enable_communication_compare]):
+            self._args.enable_profiling_compare = True
+            self._args.enable_operator_compare = True
+            self._args.enable_memory_compare = True
+            self._args.enable_communication_compare = True
         base_profiling_dict = self.parse_profiling_path(self._args.base_profiling_path)
         comparison_profiling_dict = self.parse_profiling_path(self._args.comparison_profiling_path)
 
diff --git a/profiler/compare_tools/utils/compare_event.py b/profiler/compare_tools/utils/compare_event.py
index a994d8d6fc511292a349da60e529647f65434d8e..d80620a556fb276cde78b95a52f6af116b3866da 100644
--- a/profiler/compare_tools/utils/compare_event.py
+++ b/profiler/compare_tools/utils/compare_event.py
@@ -42,9 +42,12 @@ class MemoryEvent:
         return self._event.get(Constant.SIZE, 0)
 
     def get_record(self) -> list:
-        if self._event.get(Constant.RELEASE_TIME):
-            duration = float(self._event.get(Constant.RELEASE_TIME)) - self._event.get(Constant.ALLOCATION_TIME, 0)
-        else:
+        if not self._event.get(Constant.ALLOCATION_TIME):
+            duration = Constant.NA
+        elif not self._event.get(Constant.RELEASE_TIME):
             duration = Constant.NA
+        else:
+            duration = float(self._event.get(Constant.RELEASE_TIME)) - self._event.get(Constant.ALLOCATION_TIME, 0)
+
         name = self._event.get(Constant.NAME, "") if self._event.get(Constant.NAME, "") else self._name
         return [name, duration, self._event.get(Constant.SIZE, 0)]
diff --git a/profiler/compare_tools/utils/constant.py b/profiler/compare_tools/utils/constant.py
index 8c4d4a76810b708f276270c349f252ed57a7d01c..360c2ab44ae8f56c1708bb2c8213c357445ffcb4 100644
--- a/profiler/compare_tools/utils/constant.py
+++ b/profiler/compare_tools/utils/constant.py
@@ -66,8 +66,8 @@ class Constant(object):
     OPERATOR_COMPARE = "OperatorCompare"
     MEMORY_COMPARE = "MemoryCompare"
 
-    DEFAULT_WIDTH = 25
-    COLUMN_WIDTH = {OP_NAME: 45, INPUT_SHAPE + " / " + MEMORY_OP_NAME: 30, INPUT_SHAPE + " / " + KERNEL_NAME: 30}
+    DEFAULT_WIDTH = 20
+    COLUMN_WIDTH = {OP_NAME: 30, INPUT_SHAPE + " / " + MEMORY_OP_NAME: 30, INPUT_SHAPE + " / " + KERNEL_NAME: 30}
 
     # communication
     COMMUNICAT_OP = "Communication OP Name"
@@ -91,5 +91,5 @@ class Constant(object):
     CMP_COMMUNICATION_HEADER = [COMMUNICAT_OP, TASK_NAME, CALLS, TOTAL_DURATION, AVG_DURATION, MAX_DURATION,
                                 MIN_DURATION]
     COLUMNS = [COMMUNICAT_OP, CALLS, TOTAL_DURATION, AVG_DURATION, MAX_DURATION, MIN_DURATION]
-    COLUMN_WIDTH_CLL = {COMMUNICAT_OP: 25, TASK_NAME: 22, CALLS: 10, TOTAL_DURATION: 20, AVG_DURATION: 20,
-                        MAX_DURATION: 20, MIN_DURATION: 20, DIFF: 20}
+    COLUMN_WIDTH_CLL = {COMMUNICAT_OP: 22, TASK_NAME: 22, CALLS: 10, TOTAL_DURATION: 16, AVG_DURATION: 16,
+                        MAX_DURATION: 16, MIN_DURATION: 16, DIFF: 16}
diff --git a/profiler/compare_tools/utils/profiling_parser.py b/profiler/compare_tools/utils/profiling_parser.py
index 8a94cb695df271c4f2d9493d948d77156a5262a0..aefcbade39ead9745741e4a95870bb1aa4da2709 100644
--- a/profiler/compare_tools/utils/profiling_parser.py
+++ b/profiler/compare_tools/utils/profiling_parser.py
@@ -1,7 +1,7 @@
 from abc import ABCMeta, abstractmethod
 from math import ceil
 
-from utils.compare_event import KernelEvent
+from utils.compare_event import KernelEvent, MemoryEvent
 from utils.constant import Constant
 from utils.file_reader import FileReader
 from utils.trace_event_data import TraceEventData
@@ -53,7 +53,7 @@ class GPUProfilingParser(ProfilingParser):
         return self._kernel_dict
 
     @property
-    def memory_list(self) -> dict:
+    def memory_list(self) -> list:
         if self._memory_list is None:
             self.get_memory_list()
         return self._memory_list
@@ -83,13 +83,13 @@ class GPUProfilingParser(ProfilingParser):
         flow_kernel_dict = {}
         json_data = FileReader.read_trace_file(self._json_path)
         total_events = json_data.get("traceEvents", [])
-        flow_cat = self._args.gpu_flow_cat if self._args.gpu_flow_cat else "async_gpu"
+        flow_cat = (self._args.gpu_flow_cat,) if self._args.gpu_flow_cat else ("async_gpu", "async_cpu_to_gpu", "ac2g")
 
         flow_start_dict, flow_end_dict, kernel_dict = {}, {}, {}
         for event in total_events:
-            if event.get("cat") == flow_cat and event.get("ph") == "s":
+            if event.get("cat") in flow_cat and event.get("ph") == "s":
                 flow_start_dict[event.get("id")] = event
-            elif event.get("cat") == flow_cat and event.get("ph") == "f":
+            elif event.get("cat") in flow_cat and event.get("ph") == "f":
                 flow_end_dict[event.get("id")] = event
             elif event.get("cat", "").capitalize() == "Kernel".capitalize():
                 kernel_dict["{}-{}-{}".format(event.get("pid"), event.get("tid"), event.get("ts"))] = event
@@ -175,7 +175,7 @@ class NPUProfilingParser(ProfilingParser):
         return self._kernel_dict
 
     @property
-    def memory_list(self) -> dict:
+    def memory_list(self) -> list:
         if self._memory_list is None:
             self.get_memory_list()
         return self._memory_list
@@ -239,6 +239,8 @@ class NPUProfilingParser(ProfilingParser):
             return
         memory_data = FileReader.read_csv_file(self._memory_data_path)
         for data in memory_data:
+            if not data.get(Constant.ALLOCATION_TIME, 0):
+                continue
             if "cann::" in data.get("Name", ""):
                 ts_time = float(data.get(Constant.ALLOCATION_TIME, 0))
                 match_dequeue_data = self._match_cann_memory_data(dequeue_data, ts_time)