diff --git a/profiler/cluster_analyse/README.md b/profiler/cluster_analyse/README.md
deleted file mode 100644
index deaebb6cde565d2c7f43c41fde252326c7d06ef5..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/README.md
+++ /dev/null
@@ -1,180 +0,0 @@
-# 集群分析工具
-cluster_analyse(集群分析工具)是在集群场景下,通过此工具来进行集群数据的分析,当前主要对基于通信域的迭代内耗时分析、通信时间分析以及通信矩阵分析为主, 从而定位慢卡、慢节点以及慢链路问题。
-
-## 性能数据采集
-当前集群调优工具主要支持Ascend PyTorch Profiler采集方式下的集群数据。采集方式参考:[Profiling数据采集](https://gitee.com/ascend/mstt/tree/master/profiler),此工具只需要通过Ascend PyTorch Porfiler工具采集NPU的性能数据即可。
-
-我们要求至少是L1级别的数据。
-```python
-experimental_config = torch_npu.profiler._ExperimentalConfig(
- profiler_level=torch_npu.profiler.ProfilerLevel.Level1
-)
-```
-### 确认数据是否可用
-
-打开采集到的某张卡数据(*ascend_pt结尾的文件夹),可用的数据应该具备:
-
-- ./profiler_info_x.json,
-- ./ASCEND_PROFILER_OUTPUT/step_trace_time.csv,
-- ./ASCEND_PROFILER_OUTPUT/trace_view.json,
-- ./ASCEND_PROFILER_OUTPUT/kernel_details.csv,
-- ./ASCEND_PROFILER_OUTPUT/communication.json,
-- ./ASCEND_PROFILER_OUTPUT/communication_matrix.json
-
-或者具备:
-
-- analysis.db
-- ascend_pytorch_profiler_{rank_id}.db
-
-以上csv、json文件与db文件只能存在一类,否则集群分析工具解析异常。
-
-确认这几个文件生成后,继续下面的集群分析。
-
-## 数据汇聚与解析
-
-### 操作步骤
-
-1. 参见《[性能工具](../README.md)》完成工具安装。建议安装最新版本。
-
-2. 将所有卡的数据拷贝并汇集到一个目录下,在本目录下运行以下命令即可生成cluster_analysis_output文件夹。
-
- ```bash
- msprof-analyze cluster -d {cluster profiling data path} -m {mode}
- ```
-
- 或
-
- ```bash
- python3 cluster_analysis.py -d {cluster profiling data path} -m {mode}
- ```
-
- 参数说明:
-
- | 参数名 | 说明 | 是否必选 |
- | --------------------- | ------------------------------------------------------------ | -------- |
- | --collection_path或-d | 性能数据汇集目录,运行分析脚本之后会在该目录下自动创建cluster_analysis_output文件夹,保存分析数据。 | 是 |
- | --mode或-m | 数据解析模式,取值详见“**--mode参数说明**”表。 | 否 |
- | --parallel_mode | 设置收集多卡、多节点db数据时的并发方式。取值为concurrent(使用concurrent.feature进程池实现并发)。
**只有-m配置cann_api_sum、compute_op_sum、hccl_sum、mstx_sum时可配置此参数。** | 否 |
- | --export_type | 设置导出的数据形式。取值为db(.db格式文件)和notebook(Jupyter Notebook文件),默认值为db。
**只有-m配置cann_api_sum、compute_op_sum、hccl_sum、mstx_sum时可配置此参数。** | 否 |
- | --rank_list | 对特定Rank上的数据进行统计,默认值为all(表示对所有Rank进行统计),须根据实际卡的Rank ID配置。应配置为大于等于0的整数,若所配置的值大于实际训练所运行的卡的Rank ID,则仅解析合法的RankID的数据,比如当前环境Rank ID为0到7,实际训练运行0到3卡,此时若配置Rank ID为0, 3, 4或不存在的10等其他值,则仅解析0和3。配置示例:--rank_list 0, 1, 2。
**只有-m配置cann_api_sum、compute_op_sum、hccl_sum、mstx_sum时可配置此参数。** | 否 |
- | --top_num | 设置TopN耗时的通信算子的数量,默认值为15,配置示例:--top_num 20。
**只有-m配置hccl_sum时可配置此参数。** | 否 |
-
- --mode参数说明:
-
- | 参数名 | 说明 | 是否必选 |
- | -------------------- | ------------------------------------------------------------ | -------- |
- | communication_matrix | 解析通信矩阵数据。 | 否 |
- | communication_time | 解析通信耗时数据。 | 否 |
- | all | 同时解析通信矩阵communication_matrix和通信耗时数据communication_time,--mode参数默认值为all。 | 否 |
- | cann_api_sum | 集群API性能数据汇总分析,输入性能数据需要基于ascend_pytorch_profiler_{rank_id}.db文件。--export_type为db时,输出交付件cluster_analysis.db;--export_type为notebook时,在cluster_analysis_output/CannApiSum目录下输出交付件stats.ipynb。 | 否 |
- | compute_op_sum | 集群场景性能数据的device运行算子信息汇总分析,输入性能数据需要基于ascend_pytorch_profiler_{rank_id}.db文件。--export_type为db时,输出交付件cluster_analysis.db;--export_type为notebook时,在cluster_analysis_output/ComputeOpSum目录下输出交付件stats.ipynb。 | 否 |
- | hccl_sum | 集合通信算子耗时分析,输入性能数据需要基于ascend_pytorch_profiler_{rank_id}.db文件。--export_type为db时,输出交付件cluster_analysis.db;--export_type为notebook时,在cluster_analysis_output/HcclSum目录下输出交付件stats.ipynb。 | 否 |
- | mstx_sum | 集群场景mstx打点信息汇总分析,输入性能数据需要基于ascend_pytorch_profiler_{rank_id}.db文件。--export_type为db时,输出交付件cluster_analysis.db;--export_type为notebook时,在cluster_analysis_output/MstxSum目录下输出交付件stats.ipynb。 | 否 |
-
- --parallel_mode参数示例如下:
-
- ```bash
- msprof-analyze cluster -d {cluster profiling data path} -m cann_api_sum --parallel_mode concurrent
- ```
-
- 或
-
- ```bash
- python3 cluster_analysis.py -d {cluster profiling data path} -m cann_api_sum --parallel_mode concurrent
- ```
-
-
-### 交付件
-
-集群分析工具的交付件通过Ascend Insight工具展示,详见《[MindStudio Ascend Insight用户指南](https://www.hiascend.com/document/detail/zh/mindstudio/70RC1/GUI-baseddevelopmenttool/msascendinsightug/AscendInsight_0002.html)》。
-
-#### cluster_step_trace_time.csv
-
-数据解析模式为communication_matrix、communication_time或all时均生成。
-
-A列: Step数,是采集性能数据时设置的,一般来说集群性能数据采集一个step足够,如果采集多个step,需要先筛选一下。
-
-B列: Type,主要分两种,rank和stage, 和后面的index强相关,可以理解为一个是单卡rank,一个是rank group(pp 并行的stage),如果type为stage,则后面D-K列信息为rank group下的最大值。
-
-C列:Index,与type相关,表示卡号。
-
-D列:Computing, 此列统计计算时间。
-
-E列:Communication(Not Overlapped),此列统计未被掩盖的通信耗时。
-
-F列:Overlapped,统计计算与通信重叠的耗时。
-
-G列:Communication,通信时间的全部耗时。
-
-H列:Free,空闲时间,只device侧既不在通信也不在计算的耗时,可能在做sdma拷贝或者空等。
-
-I列:Stage时间,I、J、K列属于pp并行时有效的数值,stage时间代表除recieve算子时间外的时间。
-
-J列:Bubble时间,指receive时间的总和。
-
-K列:Communication(Not Overlapped and Exclude Receive)指剔除recieve算子外的并且不被掩盖的通信时间。
-
-L列:Preparing,指迭代开始到首个计算或通信算子运行的时间。
-
-**Tips**:先筛选B列type为stage, 看stage间是否有问题,再筛选B列type为rank,看rank是否有问题,根据以下几点排查。
-
-* 根据Computing的时间差异判断是否有慢卡,或者有负载不均衡的现象。
-
-* 根据Free统计是否有host bound或者分布不均现象。
-
-* 根据Communication(Not Overlapped and Exclude Receive)时间判断是否通信耗时占比过大。
-
-* 根据Bubble时间的占比和理论计算公式判断bubble设置是否合理,是否stage间有不均衡现象。
-
-以上时间理论上都应该处于持平状态,即最大值小于最小值5%,否则就可能出现慢卡。
-
-#### cluster_communication_matrix.json
-
-数据解析模式为communication_matrix或all时生成。
-
-直接打开json(vscode或json查看器), 搜索"Total", 会有多个搜索结果,一般来说链路带宽信息的结构:
-
-```bash
-{src_rank}-{dst_rank}: {
- "Transport Type": "LOCAL",
- "Transit Time(ms)": 0.02462,
- "Transit Size(MB)": 16.777216,
- "Bandwidth(GB/s)": 681.4466
-}
-```
-**Tips**:可以根据rank互联的带宽以及链路类型,判断是否有慢链路的问题。
-
-- "LOCAL"是片内拷贝,速率非常快,不需要考虑。
-- “HCCS”或“PCIE”是节点内片间拷贝,速度在18GB左右或以上比较正常。
-- “RDMA”是节点间拷贝,910A速度在12GB左右或以上。
-
-#### cluster_communication.json
-
-数据解析模式为communication_time或all时生成。
-
-主要为通信耗时数据。
-
-#### cluster_analysis.db
-
-解析analysis.db或ascend_pytorch_profiler_{rank_id}.db生成的交付件,根据数据解析模式不同而解析不同的数据,可以使用Ascend Insight工具展示。
-
-#### stats.ipynb
-
-- 数据解析模式为cann_api_sum时生成,保存在cluster_analysis_output/CannApiSum目录下。
-
- 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群API耗时信息。
-
-- 数据解析模式为compute_op_sum时生成,保存在cluster_analysis_output/ComputeOpSum目录下。
-
- 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群计算算子耗时分析(将集群所有计算算子进行汇总并以图表展示),集群Rank计算算子耗时分析(将每个Rank的计算算子进行各自汇总)。
-
-- 数据解析模式为hccl_sum时生成,保存在cluster_analysis_output/HcclSum目录下。
-
- 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群通信算子耗时分析(将集群所有通信算子进行汇总并以图表展示),集群Rank通信算子耗时分析(将每个Rank的通信算子进行各自汇总)、Top通信算子信息展示。
-
-- 数据解析模式为mstx_sum时生成,保存在cluster_analysis_output/MstxSum目录下。
-
- 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群场景mstx打点信息,分为框架侧、CANN侧和Device侧三部分的打点信息。
-
-
-
diff --git a/profiler/cluster_analyse/__init__.py b/profiler/cluster_analyse/__init__.py
deleted file mode 100644
index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-# Copyright (c) 2023, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/profiler/cluster_analyse/analysis/__init__.py b/profiler/cluster_analyse/analysis/__init__.py
deleted file mode 100644
index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-# Copyright (c) 2023, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/profiler/cluster_analyse/analysis/analysis_facade.py b/profiler/cluster_analyse/analysis/analysis_facade.py
deleted file mode 100644
index 435d77b21bff423b207bf050ea660a1738f0fe5f..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/analysis_facade.py
+++ /dev/null
@@ -1,50 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from multiprocessing import Process
-
-from analysis.communication_analysis import CommunicationAnalysis
-from analysis.comm_matrix_analysis import CommMatrixAnalysis
-from analysis.step_trace_time_analysis import StepTraceTimeAnalysis
-from analysis.host_info_analysis import HostInfoAnalysis
-from common_func.context import Context
-from common_func.constant import Constant
-
-class AnalysisFacade:
- default_module = {CommunicationAnalysis, StepTraceTimeAnalysis, CommMatrixAnalysis, HostInfoAnalysis}
-
- def __init__(self, params: dict):
- self.params = params
-
- def cluster_analyze(self):
- # 多个profiler用多进程处理
- process_list = []
- for analysis in self.default_module:
- process = Process(target=analysis(self.params).run)
- process.start()
- process_list.append(process)
-
- for process in process_list:
- process.join()
-
- def recipe_analyze(self):
- HostInfoAnalysis(self.params).run()
- print("[INFO] Recipe analysis launched.")
- try:
- with Context.create_context(self.params.get(Constant.PARALLEL_MODE)) as context:
- with self.params.get(Constant.RECIPE_CLASS)(self.params) as recipe:
- recipe.run(context)
- except Exception as e:
- print("[ERROR] Recipe analysis launched failed, %s." % str(e))
diff --git a/profiler/cluster_analyse/analysis/base_analysis.py b/profiler/cluster_analyse/analysis/base_analysis.py
deleted file mode 100644
index 7209e9b56f04cc6e97e4331db2ca48ba18a67ed6..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/base_analysis.py
+++ /dev/null
@@ -1,255 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import sys
-import traceback
-import shutil
-import pandas as pd
-from abc import abstractmethod
-
-from common_func.constant import Constant
-from common_func.file_manager import FileManager
-from common_func.db_manager import DBManager
-from common_func.utils import convert_unit
-from cluster_utils.data_transfer_adapter import DataTransferAdapter
-
-
-class BaseAnalysis:
- MAX_RANKS = 1000
- def __init__(self, param: dict):
- self.collection_path = param.get(Constant.COLLECTION_PATH)
- self.data_map = param.get(Constant.DATA_MAP)
- self.data_type = param.get(Constant.DATA_TYPE)
- self.communication_ops = []
- self.collective_group_dict = param.get(Constant.COMM_DATA_DICT, {}).get(Constant.COLLECTIVE_GROUP)
- self.comm_ops_struct = {}
- self.adapter = DataTransferAdapter()
-
- @staticmethod
- def compute_ratio(dividend: float, divisor: float):
- if abs(divisor) < Constant.EPS:
- return 0
- else:
- return round(dividend / divisor, 4)
-
- @staticmethod
- def check_add_op(op_name: str):
- """
- 兼容2个版本,判断是否需要将此算子信息相加
- """
- stat_list = ["middle", "top", "bottom", "total"]
- total = "total"
- for stat_name in stat_list:
- if stat_name in op_name:
- if stat_name != total:
- return False
- return True
-
- @abstractmethod
- def run(self):
- pass
-
- def dump_data(self):
- if not self.comm_ops_struct:
- print("[WARNING] There is no final comm ops data generated")
- return
- if self.data_type == Constant.TEXT:
- self.dump_json()
- else:
- if len(self.data_map) >= self.MAX_RANKS:
- print("[WARNING]The number of ranks is too large to dump to db, it will be dumped to json file.")
- self.dump_json()
- else:
- self.dump_db()
-
- @abstractmethod
- def dump_db(self):
- pass
-
- def dump_json(self):
- output_comm_data = {}
- for key in self.comm_ops_struct:
- output_comm_data[str(key)] = self.comm_ops_struct.get(key)
- FileManager.create_json_file(self.collection_path, output_comm_data, self.SAVED_JSON)
-
- def split_op_by_group(self):
- for single_op in self.communication_ops:
- if single_op.get(Constant.COMM_OP_TYPE) == Constant.P2P:
- rank_tup = Constant.P2P
- else:
- rank_tup = tuple(self.collective_group_dict.get(single_op.get(Constant.GROUP_NAME), []))
- rank_id = single_op.get(Constant.RANK_ID, 'N/A')
- step_id = single_op.get(Constant.STEP_ID, 'N/A')
- op_name = single_op.get(Constant.COMM_OP_NAME, 'N/A')
- op_info = single_op.get(Constant.COMM_OP_INFO)
- self.comm_ops_struct.setdefault(rank_tup, {}).setdefault(step_id, {}).\
- setdefault(op_name, {}).setdefault(rank_id, op_info)
-
- def combine_ops_total_info(self):
- for rank_tup, group_dict in self.comm_ops_struct.items():
- for step_id, communication_ops in group_dict.items():
- self.compute_total_info(communication_ops)
-
-
-class BaseRecipeAnalysis:
-
- UNIT = "Us"
- DB_UNIT = "Ns"
-
- RANK_LIST = "rank_list"
-
- def __init__(self, params):
- self._params = params
- self._collection_dir = params.get(Constant.COLLECTION_PATH, "")
- self._data_map = params.get(Constant.DATA_MAP, {})
- self._recipe_name = params.get(Constant.RECIPE_NAME, "")
- self._mode = params.get(Constant.PARALLEL_MODE, "")
- self._export_type = params.get(Constant.EXPORT_TYPE, "")
- self._output_dir = None
- self._rank_list = params.get(self.RANK_LIST, 'all')
-
- def __enter__(self):
- return self
-
- def __exit__(self, exc_type, exc_val, exc_tb):
- if self._params is not None and exc_type is not None:
- print(f"[ERROR] Failed to exit analysis: {exc_val}")
- traceback.print_exc(file=sys.stdout)
-
- def run(self, context):
- pass
-
- @property
- def base_dir(self):
- return os.path.basename(os.path.dirname(__file__))
-
- def _get_rank_db(self):
- invalid_rank_id = []
- if self._rank_list == 'all':
- rank_ids = list(self._data_map.keys())
- else:
- rank_ids = []
- for rank_id in self._rank_list:
- if rank_id in self._data_map.keys():
- rank_ids.append(rank_id)
- else:
- invalid_rank_id.append(str(rank_id))
- db_paths = []
- for rank_id in rank_ids:
- rank_path = self._data_map[rank_id]
- db_path = os.path.join(rank_path, Constant.SINGLE_OUTPUT, f"ascend_pytorch_profiler_{rank_id}.db")
- if os.path.exists(db_path):
- db_paths.append((rank_id, db_path))
- else:
- print(f"[WARNING] DB file not found, rank id: {rank_id}, db path: {db_path}.")
- if invalid_rank_id:
- print(f"[WARNING] Invalid Rank id : [{','.join(invalid_rank_id)}].")
- return db_paths
-
- def get_mode(self):
- return self._mode
-
- def get_recipe_name(self):
- return self._recipe_name
-
- def dump_data(self, data, file_name, table_name=None, index=True):
- output_path = os.path.join(self._collection_dir, Constant.CLUSTER_ANALYSIS_OUTPUT)
- if table_name:
- result_db = os.path.join(output_path, file_name)
- conn, cursor = DBManager.create_connect_db(result_db)
- if isinstance(data, pd.DataFrame):
- data.to_sql(table_name, conn, if_exists='replace', index=True)
- else:
- print(f"[ERROR] Unknown dump data type: {type(data)}")
- DBManager.destroy_db_connect(conn, cursor)
- else:
- result_csv = os.path.join(output_path, file_name)
- if isinstance(data, pd.DataFrame):
- data = convert_unit(data, self.DB_UNIT, self.UNIT)
- data.to_csv(result_csv, index=index)
- else:
- print(f"[ERROR] Unknown dump data type: {type(data)}")
-
- def _create_output_dir_name(self, name):
- i = 1
- while os.path.exists(f"{name}-{i}"):
- i += 1
- return f"{name}-{i}"
-
- def _create_unique_output_dir(self):
- output_dir = os.path.join(self._collection_dir, Constant.CLUSTER_ANALYSIS_OUTPUT, self._recipe_name)
-
- if os.path.exists(output_dir):
- return self._create_output_dir_name(output_dir)
- return output_dir
-
- def _get_output_dir(self):
- if self._output_dir is None:
- self._output_dir = self._create_unique_output_dir()
- os.makedirs(self._output_dir)
- return self._output_dir
-
- def create_notebook(self, filename, notebook_template_dir=None, replace_dict=None):
- if notebook_template_dir is None:
- template_path = os.path.dirname(__file__)
- else:
- template_path = notebook_template_dir
- output_path = os.path.join(self._get_output_dir(), filename)
- template_file = os.path.join(template_path, self.base_dir, filename)
- if replace_dict is None:
- shutil.copy(template_file, output_path)
- else:
- with open(template_file, 'r') as f:
- template_content = f.read()
- for key, value in replace_dict.items():
- template_content = template_content.replace(str(key), str(value))
- with open(output_path, 'w') as f:
- f.write(template_content)
- print(f"[INFO] Notebook export path is: {self._get_output_dir()}")
-
- def add_helper_file(self, helper_file):
- helper_output_path = os.path.join(self._get_output_dir(), helper_file)
- helper_file_path = os.path.join(os.path.dirname(__file__), helper_file)
-
- if helper_file_path is not None:
- shutil.copy(helper_file_path, helper_output_path)
-
- @staticmethod
- def _filter_data(mapper_data):
- return [(rank, data) for rank, data in mapper_data if data is not None and len(data) != 0]
-
- @classmethod
- def add_parser_argument(cls, parser):
- parser.add_argument("--rank_list", type=str, help="Rank id list", default='all')
-
- @classmethod
- def parse_argument(cls, args_parsed) -> dict:
- if args_parsed.rank_list == 'all':
- return {
- cls.RANK_LIST: 'all'
- }
- else:
- rank_str_list = args_parsed.rank_list.split(",")
- rank_list = [int(rank) for rank in rank_str_list if rank.isdigit()]
- return {
- cls.RANK_LIST: rank_list
- }
-
- @classmethod
- def get_extra_argument(cls, params) -> dict:
- return {
- cls.RANK_LIST: params.get(cls.RANK_LIST, "all")
- }
diff --git a/profiler/cluster_analyse/analysis/cann_api_sum/__init__.py b/profiler/cluster_analyse/analysis/cann_api_sum/__init__.py
deleted file mode 100644
index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/cann_api_sum/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/profiler/cluster_analyse/analysis/cann_api_sum/cann_api_sum.py b/profiler/cluster_analyse/analysis/cann_api_sum/cann_api_sum.py
deleted file mode 100644
index db37b004b150eaa65b9c9cd4e12f1f5bdc0836e9..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/cann_api_sum/cann_api_sum.py
+++ /dev/null
@@ -1,108 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import pandas as pd
-
-from analysis.base_analysis import BaseRecipeAnalysis
-from common_func.constant import Constant
-from common_func.utils import stdev
-from cluster_statistics_export.cann_api_sum_export import CannApiSumExport
-
-
-class CannApiSum(BaseRecipeAnalysis):
-
- def __init__(self, params):
- super().__init__(params)
- print("[INFO] CannApiSum init.")
-
- @property
- def base_dir(self):
- return os.path.basename(os.path.dirname(__file__))
-
- @staticmethod
- def _mapper_func(data_map, analysis_class):
- df = CannApiSumExport(data_map[1], analysis_class).read_export_db()
-
- if df is None or df.empty:
- print(f"[WARNING] There is no stats data in {data_map[1]}.")
- return None
- return data_map[0], df
-
- def mapper_func(self, context):
- return context.wait(
- context.map(
- self._mapper_func,
- self._get_rank_db(),
- analysis_class=self._recipe_name
- )
- )
-
- def reducer_func(self, mapper_res):
- stats_rank_data = self._filter_data(mapper_res)
- if not stats_rank_data:
- print("[ERROR] Mapper data is None.")
- return
- stats_rank_data = [df.assign(rank=rank) for rank, df in stats_rank_data]
- stats_rank_data = pd.concat(stats_rank_data)
- stats_data = self._aggregate_stats(stats_rank_data)
- if self._export_type == "db":
- self.dump_data(stats_rank_data, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, "CannApiSumRank")
- self.dump_data(stats_data, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, "CannApiSum")
- elif self._export_type == "notebook":
- self.dump_data(stats_rank_data, os.path.join(self._get_output_dir(), "rank_stats.csv"), index=False)
- self.dump_data(stats_data, os.path.join(self._get_output_dir(), "all_stats.csv"))
- self.save_notebook()
- else:
- print("[ERROR] Unknown export type.")
-
- def run(self, context):
- mapper_res = self.mapper_func(context)
- self.reducer_func(mapper_res)
-
- @staticmethod
- def _aggregate_stats(stats_res):
- grouped = stats_res.groupby("name")
- res = {}
- total_time = grouped["totalTimeNs"].sum()
- res["timeRatio"] = total_time / total_time.sum() * 100.0
- res["totalTimeNs"] = total_time
- res["totalCount"] = grouped["totalCount"].sum()
- res["averageNs"] = res["totalTimeNs"] / res["totalCount"]
- res["Q1Ns"] = grouped["Q1Ns"].min()
- res["medNs"] = grouped["medNs"].median()
- res["Q3Ns"] = grouped["Q3Ns"].max()
- res["minNs"] = grouped["minNs"].min()
- res["maxNs"] = grouped["maxNs"].max()
- res["stdev"] = grouped.apply(lambda x: stdev(x, res))
- min_value = grouped["minNs"].min()
- res["minRank"] = grouped.apply(
- lambda x: ", ".join(
- x.loc[x["minNs"] == min_value.loc[x.name], "rank"].astype(str)
- )
- )
- max_value = grouped["maxNs"].max()
- res["maxRank"] = grouped.apply(
- lambda x: ", ".join(
- x.loc[x["maxNs"] == max_value.loc[x.name], "rank"].astype(str)
- )
- )
- res = pd.concat(res.values(), axis=1, keys=res.keys()).round(1)
- res.sort_values(by="totalTimeNs", ascending=False, inplace=True)
- return res
-
- def save_notebook(self):
- self.create_notebook("stats.ipynb")
- self.add_helper_file("cluster_display.py")
diff --git a/profiler/cluster_analyse/analysis/cann_api_sum/stats.ipynb b/profiler/cluster_analyse/analysis/cann_api_sum/stats.ipynb
deleted file mode 100644
index c97f039c5a01a6e7cce2968d569d79e137e76f8c..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/cann_api_sum/stats.ipynb
+++ /dev/null
@@ -1,86 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# CANN_API_SUM"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "vscode": {
- "languageId": "plaintext"
- }
- },
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "import plotly.offline as pyo\n",
- "\n",
- "from IPython.display import display, HTML\n",
- "\n",
- "import cluster_display\n",
- "\n",
- "display(HTML(\"\"))\n",
- "pd.set_option('display.max_columns', None)\n",
- "pd.set_option('display.max_rows', None)\n",
- "pyo.init_notebook_mode()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "## 集群场景CANN层API统计分析\n",
- "该分析脚本展示了集群场景的统计数据分析结果。需要注意以下几点:\n",
- "1. 所有的时间信息单位是微秒(us);\n",
- "2. Q1表示单个API耗时的25%分位数,最终结果取自所有卡的Q1值中最小值;\n",
- "3. Q3表示单个API耗时的75%分位数,最终结果取自所有卡的Q3值中最大值;\n",
- "4. 'minRank'展示了API最小耗时所在卡;\n",
- "5. 'maxRank'展示了API最大耗时所在卡。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "df = pd.read_csv(\"all_stats.csv\")\n",
- "display(df)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "cluster_display.display_box(df, xaxis_title=\"name\", yaxis_title=\"duration (ns)\")\n",
- "cluster_display.display_stats_scatter(df, xaxis_title=\"name\", yaxis_title=\"duration (ns)\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "per_rank_df = pd.read_csv(\"rank_stats.csv\")\n",
- "cluster_display.display_stats_per_operation(per_rank_df, xaxis_title='rank', yaxis_title='duration (ns)')"
- ]
- }
- ],
- "metadata": {
- "language_info": {
- "name": "python"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/profiler/cluster_analyse/analysis/cluster_display.py b/profiler/cluster_analyse/analysis/cluster_display.py
deleted file mode 100644
index 8fc6040ccafae2d069e2e6e394941c7aff83a452..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/cluster_display.py
+++ /dev/null
@@ -1,239 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import numpy as np
-import pandas as pd
-import plotly.graph_objects as go
-from IPython.display import display, HTML
-from ipywidgets import Dropdown, fixed, interact
-
-
-def get_stats_cols(df):
- cols = df.columns.tolist()
- q1 = "Q1(Us)" if "Q1(Us)" in cols else "Q1~"
- q3 = "Q3(Us)" if "Q3(Us)" in cols else "Q3~"
- med = "med(Us)" if "med(Us)" in cols else "med~"
- std = "stdev" if "stdev" in cols else "stdev~"
- return q1, q3, med, std
-
-
-def display_box(df, x=None, **layout_args):
- if x is None:
- x = df.columns[0]
- q1, q3, med, std = get_stats_cols(df)
- fig = go.Figure()
- fig.add_trace(
- go.Box(
- x=df[x],
- q1=df[q1],
- median=df[med],
- q3=df[q3],
- sd=df[std],
- lowerfence=df["minRank"],
- upperfence=df["maxRank"]
- )
- )
- fig.update_layout(**layout_args)
- fig.show()
-
-
-def display_stats_scatter(df, x=None, **layout_args):
- if x is None:
- x = df.columns[0]
- q1, q3, med, _ = get_stats_cols(df)
- fig = go.Figure()
- col_names = [q1, med, q3, "minRank", "maxRank"]
- for name in col_names:
- fig.add_trace(
- go.Scatter(
- x=df[x],
- y=df[name],
- name=name
- )
- )
- fig.update_layout(**layout_args)
- fig.show()
-
-
-def display_table_per_rank(df):
- if df.empty:
- display(df)
- return
-
- rank_groups = df.groupby("rank")
- def display_table(name):
- rank_df = rank_groups.get_group(name)
- rank_df = rank_df.drop(columns=["rank"])
- display(rank_df)
-
- dropdown = Dropdown(
- options=rank_groups.groups.keys(),
- description="rank:",
- disabled=False,
- )
- interact(
- display_table,
- name=dropdown
- )
-
-
-def display_stats_per_operation(df, x=None, box=True, scatter=True, table=True, **layout_args):
- if df.empty:
- display(df)
- return
-
- if x is None:
- x = df.columns[0]
-
- op_groups = df.groupby(x)
-
- def display_graphs(name):
- op_df = op_groups.get_group(name)
- if table:
- display(op_df.reset_index(drop=True).set_index("rank"))
- if box:
- display_box(op_df, x=op_df["rank"], **layout_args)
- if scatter:
- display_stats_scatter(op_df, x=op_df["rank"], **layout_args)
-
- operations = list(op_groups.groups.keys())
-
- if len(operations) > 1:
- dropdown = Dropdown(
- options=operations,
- description="Operation:",
- disabled=False,
- value=operations[1]
- )
- interact(
- display_graphs,
- name=dropdown
- )
- dropdown.value = operations[0]
- else:
- display_graphs(operations[0])
-
-
-def display_duration_boxplots(figs, stats_df: pd.DataFrame, orientation="v", title=None,
- x_title="Names", y_title="Time", legend_title="Legend"):
- mean_ds = stats_df.get("Mean(Us)", None)
- min_ds = stats_df.get("Min(Us)", None)
- max_ds = stats_df.get("Max(Us)", None)
- q1_ds = stats_df.get("Q1(Us)", None)
- median_ds = stats_df.get('Median(Us)', None)
- q3_ds = stats_df.get('Q3(Us)', None)
- return display_boxplot(figs, stats_df.index, min_ds, q1_ds, median_ds, q3_ds, max_ds, mean_ds,
- orientation=orientation, title=title, x_title=x_title, y_title=y_title,
- legend_title=legend_title)
-
-
-def display_boxplot(figs, x_axis, min_ds, q1_ds, median_ds, q3_ds, max_ds, mean_ds, orientation="v",
- title=None, x_title=None, y_title="Time", legend_title="Legend"):
- fig = go.Figure()
- fig.add_trace(
- go.Box(
- x=x_axis,
- lowerfence=min_ds,
- q1=q1_ds,
- median=median_ds,
- q3=q3_ds,
- upperfence=max_ds,
- mean=mean_ds
- )
- )
- fig.update_traces(orientation=orientation)
- fig.update_layout(
- xaxis_title=x_title, yaxis_title=y_title, legend_title=legend_title,
- title=title, height=1024
- )
- fig.show()
- if isinstance(figs, list):
- figs.append(fig)
- return fig
-
-
-def display_graph(figs, x_axis, y_axes, title=None,
- x_title=None, y_title=None, legend_title="Legend"):
- data = None
- if isinstance(y_axes, pd.DataFrame):
- data = y_axes.set_index(x_axis)
- elif isinstance(y_axes, dict):
- data = pd.DataFrame(y_axes, index=x_axis)
- elif isinstance(y_axes, pd.Series):
- data = pd.DataFrame({"": y_axes}, index=x_axis)
- elif isinstance(y_axes, np.ndarray):
- data = pd.DataFrame({"": pd.Series(y_axes)}, index=x_axis)
- else:
- return
-
- fig = data.plot.line()
- fig.update_layout(
- title=title, xaxis_title=x_title, yaxis_title=y_title, legend_title=legend_title
- )
- fig.show()
- if isinstance(figs, list):
- figs.append(fig)
- return fig
-
-
-def display_stats_per_rank_groups_combobox(rank_stats_gdf):
- names = list(rank_stats_gdf.groups.keys())
- if len(names) > 1:
- dropdown = Dropdown(
- options=names, layout={"width": "max-content"}, value=names[1]
- )
- interact(
- __display_stats_per_rank_group,
- selected=dropdown,
- rank_stats_gdf=fixed(rank_stats_gdf)
- )
- dropdown.value = names[0]
- elif len(names) == 1:
- __display_stats_per_rank_group(names[0], rank_stats_gdf)
- else:
- print("cluster_display func:input rank_stats_gdf groups is null so no need to display")
-
-
-def __display_stats_per_rank_group(selected, rank_stats_gdf):
- df = rank_stats_gdf.get_group(selected)
- df = df.reset_index(drop=True)
- df = df.set_index(df["Rank"])
- display(df)
-
- figs = []
- display_duration_boxplots(figs, df, x_title="Ranks")
- display_graph(
- figs,
- df.index,
- df[["Q1(Us)", "Median(Us)", "Q3(Us)"]],
- title="50% of Distribution",
- x_title="Ranks"
- )
-
-
-def display_stats_optional_combobox(options, display_func, args, description="Option:"):
- if len(options) > 1:
- dropdown = Dropdown(
- options=options, layout={"width": "max-content"}, value=options[1],
- description=description
- )
- interact(
- display_func,
- selected=dropdown,
- args=fixed(args)
- )
- dropdown.value = options[0]
- elif len(options) == 1:
- display_func(options[0], args)
diff --git a/profiler/cluster_analyse/analysis/comm_matrix_analysis.py b/profiler/cluster_analyse/analysis/comm_matrix_analysis.py
deleted file mode 100644
index 8dc04471fe0a164fc859e51597d41028523f7a32..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/comm_matrix_analysis.py
+++ /dev/null
@@ -1,106 +0,0 @@
-import os
-from collections import defaultdict
-
-from analysis.base_analysis import BaseAnalysis
-from common_func.constant import Constant
-from common_func.db_manager import DBManager
-
-
-class CommMatrixAnalysis(BaseAnalysis):
- SAVED_JSON = "cluster_communication_matrix.json"
- COMMUNICATION_MATRIX_TABLE = "ClusterCommAnalyzerMatrix"
-
- def __init__(self, param: dict):
- super().__init__(param)
- self.communication_ops = param.get(Constant.COMM_DATA_DICT, {}).get(Constant.MATRIX_OPS)
-
- @staticmethod
- def combine_link(link_info_dict: dict, single_link_dict: dict):
- link_info_dict[Constant.TRANSPORT_TYPE] = single_link_dict.get(Constant.TRANSPORT_TYPE)
- link_info_dict[Constant.OP_NAME] = single_link_dict.get(Constant.OP_NAME, '')
- link_info_dict[Constant.TRANSIT_TIME_MS] += single_link_dict.get(Constant.TRANSIT_TIME_MS, 0)
- link_info_dict[Constant.TRANSIT_SIZE_MB] += single_link_dict.get(Constant.TRANSIT_SIZE_MB, 0)
-
- def run(self):
- if not self.communication_ops:
- return
- self.split_op_by_group()
- self.combine_ops_total_info()
- self.dump_data()
-
- def dump_db(self):
- res_comm_matrix = self.adapter.transfer_matrix_from_json_to_db(self.comm_ops_struct)
- output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT)
- result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER)
- DBManager.create_tables(result_db, self.COMMUNICATION_MATRIX_TABLE)
- conn, cursor = DBManager.create_connect_db(result_db)
- if res_comm_matrix:
- res_matrix_value = [list(data.values()) for data in res_comm_matrix]
- sql = "insert into {} values ({value})".format(self.COMMUNICATION_MATRIX_TABLE,
- value="?," * (len(res_matrix_value[0]) - 1) + "?")
- DBManager.executemany_sql(conn, sql, res_matrix_value)
- DBManager.destroy_db_connect(conn, cursor)
-
- def compute_total_info(self, step_dict: dict):
- self.merge_same_links(step_dict)
- self.combine_link_info(step_dict)
-
- def merge_same_links(self, step_dict: dict):
- def process_link_key():
- for link_key in rank_dict:
- if '-' not in link_key:
- print(f"[WARNING] {op_name} has an invalid link key {link_key}!")
- break
- src_rank = link_key.split('-')[0]
- dst_rank = link_key.split('-')[1]
- if src_rank == dst_rank:
- if src_rank not in project_local_global_rank_map:
- project_local_global_rank_map[src_rank] = rank_id
- elif project_local_global_rank_map.get(src_rank) != rank_id:
- print(f"[WARNING] In the same communication group, local ranks projecting to global ranks "
- f"repeat!")
- self.combine_link(link_info[link_key], rank_dict[link_key])
-
- def convert_local_to_global_rank():
- tmp_link = {}
- for link_key, link_dict in link_info.items():
- src_rank = link_key.split('-')[0]
- dst_rank = link_key.split('-')[1]
- src_rank = project_local_global_rank_map[src_rank] \
- if src_rank in project_local_global_rank_map else src_rank
- dst_rank = project_local_global_rank_map[dst_rank] \
- if dst_rank in project_local_global_rank_map else dst_rank
- link_dict[Constant.BANDWIDTH_GB_S] = \
- self.compute_ratio(link_dict.get(Constant.TRANSIT_SIZE_MB, 0),
- link_dict.get(Constant.TRANSIT_TIME_MS, 0))
- tmp_link[f"{src_rank}-{dst_rank}"] = link_dict
- return tmp_link
-
- project_local_global_rank_map = dict()
- for op_name, op_dict in step_dict.items():
- link_info = defaultdict(lambda: {
- Constant.TRANSPORT_TYPE: '',
- Constant.TRANSIT_TIME_MS: 0,
- Constant.TRANSIT_SIZE_MB: 0,
- Constant.OP_NAME: ''
- })
- for rank_id, rank_dict in op_dict.items():
- process_link_key()
- step_dict[op_name] = convert_local_to_global_rank()
-
- def combine_link_info(self, step_dict: dict):
- total_op_info = defaultdict(lambda: {
- Constant.TRANSPORT_TYPE: '',
- Constant.TRANSIT_TIME_MS: 0,
- Constant.TRANSIT_SIZE_MB: 0,
- Constant.OP_NAME: ''
- })
- for op_name, op_dict in step_dict.items():
- if self.check_add_op(op_name):
- for link_key, link_dict in op_dict.items():
- self.combine_link(total_op_info[link_key], link_dict)
- for link_key, link_dict in total_op_info.items():
- link_dict[Constant.BANDWIDTH_GB_S] = \
- self.compute_ratio(link_dict.get(Constant.TRANSIT_SIZE_MB, 0),
- link_dict.get(Constant.TRANSIT_TIME_MS, 0))
- step_dict[Constant.TOTAL_OP_INFO] = total_op_info
diff --git a/profiler/cluster_analyse/analysis/communication_analysis.py b/profiler/cluster_analyse/analysis/communication_analysis.py
deleted file mode 100644
index 3f0a9b417e211b124b052cb5c5534f2fdbe5302e..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/communication_analysis.py
+++ /dev/null
@@ -1,103 +0,0 @@
-import os
-from collections import defaultdict
-
-from analysis.base_analysis import BaseAnalysis
-from common_func.constant import Constant
-from common_func.db_manager import DBManager
-
-
-class CommunicationAnalysis(BaseAnalysis):
- SAVED_JSON = "cluster_communication.json"
- COMMUNICATION_BANDWIDTH_TABLE = "ClusterCommAnalyzerBandwidth"
- COMMUNICATION_TIME_TABLE = "ClusterCommAnalyzerTime"
-
- def __init__(self, param: dict):
- super().__init__(param)
- self.communication_ops = param.get(Constant.COMM_DATA_DICT, {}).get(Constant.COMMUNICATION_OPS)
-
- @staticmethod
- def combine_size_distribution(op_dict: dict, total_dict: dict):
- for size, size_info in op_dict.items():
- total_dict[size][0] += size_info[0]
- total_dict[size][1] += size_info[1]
-
- def run(self):
- if not self.communication_ops:
- return
- self.split_op_by_group()
- self.combine_ops_total_info()
- self.dump_data()
-
- def dump_db(self):
- res_comm_time, res_comm_bandwidth = self.adapter.transfer_comm_from_json_to_db(self.comm_ops_struct)
- output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT)
- result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER)
- DBManager.create_tables(result_db, self.COMMUNICATION_TIME_TABLE, self.COMMUNICATION_BANDWIDTH_TABLE)
- conn, cursor = DBManager.create_connect_db(result_db)
- self.execute(conn, res_comm_time, self.COMMUNICATION_TIME_TABLE)
- self.execute(conn, res_comm_bandwidth, self.COMMUNICATION_BANDWIDTH_TABLE)
- DBManager.destroy_db_connect(conn, cursor)
-
- @staticmethod
- def execute(conn, res_data, table_name):
- if res_data:
- res_value = [list(data.values()) for data in res_data]
- sql = "insert into {} values ({value})".format(table_name, value="?," * (len(res_value[0]) - 1) + "?")
- DBManager.executemany_sql(conn, sql, res_value)
-
- def compute_total_info(self, comm_ops: dict):
- if not comm_ops:
- return
- total_rank_dict = defaultdict(lambda: {
- Constant.COMMUNICATION_TIME_INFO: defaultdict(float),
- Constant.COMMUNICATION_BANDWIDTH_INFO: {}
- })
- for communication_op, rank_dict in comm_ops.items():
- for rank_id, communication_op_info in rank_dict.items():
- for com_info, com_info_dict in communication_op_info.items():
- if com_info == Constant.COMMUNICATION_TIME_INFO:
- self.combine_time_info(com_info_dict, total_rank_dict[rank_id][com_info])
- if com_info == Constant.COMMUNICATION_BANDWIDTH_INFO:
- self.combine_bandwidth_info(com_info_dict, total_rank_dict[rank_id][com_info])
- for rank_id in total_rank_dict:
- self.compute_time_ratio(total_rank_dict[rank_id][Constant.COMMUNICATION_TIME_INFO])
- self.compute_bandwidth_ratio(total_rank_dict[rank_id][Constant.COMMUNICATION_BANDWIDTH_INFO])
- comm_ops[Constant.TOTAL_OP_INFO] = total_rank_dict
-
- def combine_time_info(self, com_info_dict: dict, total_time_info_dict: dict):
- ratio_list = [Constant.WAIT_TIME_RATIO, Constant.SYNCHRONIZATION_TIME_RATIO]
- for time_info in com_info_dict:
- if time_info not in ratio_list and time_info != Constant.START_TIMESTAMP:
- total_time_info_dict[time_info] += com_info_dict.get(time_info)
-
- def combine_bandwidth_info(self, com_info_dict: dict, total_bandwidth_info_dict: dict):
- add_list = [Constant.TRANSIT_TIME_MS, Constant.TRANSIT_SIZE_MB]
- dict_list = [Constant.SIZE_DISTRIBUTION]
- for transport_type, part_transport_dict in com_info_dict.items():
- if transport_type not in total_bandwidth_info_dict:
- total_bandwidth_info_dict[transport_type] = {
- Constant.TRANSIT_TIME_MS: 0,
- Constant.TRANSIT_SIZE_MB: 0,
- Constant.SIZE_DISTRIBUTION: defaultdict(lambda: [0, 0])
- }
- for bandwidth_msg, value in part_transport_dict.items():
- if bandwidth_msg in add_list:
- total_bandwidth_info_dict[transport_type][bandwidth_msg] += value
- if bandwidth_msg in dict_list:
- self.combine_size_distribution(value, total_bandwidth_info_dict[transport_type][bandwidth_msg])
-
- def compute_time_ratio(self, total_time_info_dict: dict):
- total_time_info_dict[Constant.WAIT_TIME_RATIO] = \
- self.compute_ratio(total_time_info_dict.get(Constant.WAIT_TIME_MS, 0),
- total_time_info_dict.get(Constant.WAIT_TIME_MS, 0) +
- total_time_info_dict.get(Constant.TRANSIT_TIME_MS, 0))
- total_time_info_dict[Constant.SYNCHRONIZATION_TIME_RATIO] = \
- self.compute_ratio(total_time_info_dict.get(Constant.SYNCHRONIZATION_TIME_MS, 0),
- total_time_info_dict.get(Constant.SYNCHRONIZATION_TIME_MS, 0) +
- total_time_info_dict.get(Constant.TRANSIT_TIME_MS, 0))
-
- def compute_bandwidth_ratio(self, total_bandwidth_info_dict: dict):
- for transport_type, bandwidth_dict in total_bandwidth_info_dict.items():
- bandwidth_dict[Constant.BANDWIDTH_GB_S] = \
- self.compute_ratio(bandwidth_dict.get(Constant.TRANSIT_SIZE_MB, 0),
- bandwidth_dict.get(Constant.TRANSIT_TIME_MS, 0))
diff --git a/profiler/cluster_analyse/analysis/compute_op_sum/__init__.py b/profiler/cluster_analyse/analysis/compute_op_sum/__init__.py
deleted file mode 100644
index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/compute_op_sum/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/profiler/cluster_analyse/analysis/compute_op_sum/compute_op_sum.py b/profiler/cluster_analyse/analysis/compute_op_sum/compute_op_sum.py
deleted file mode 100644
index e71cf868ac9e06785d030a702bf9c8182ae4e948..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/compute_op_sum/compute_op_sum.py
+++ /dev/null
@@ -1,103 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import pandas as pd
-from analysis.base_analysis import BaseRecipeAnalysis
-from common_func.constant import Constant
-from common_func.utils import describe_duration
-from cluster_statistics_export.compute_op_sum_export import ComputeOpSumExport
-
-
-class ComputeOpSum(BaseRecipeAnalysis):
-
- TABLE_ALL_RANK_STATS = "ComputeOpAllRankStats"
- TABLE_PER_RANK_STATS_BY_OPTYPE = "ComputeOpPerRankStatsByOpType"
- TABLE_PER_RANK_STATS_BY_OPNAME = "ComputeOpPerRankStatsByOpName"
-
- def __init__(self, params):
- super().__init__(params)
- print("[INFO] ComputeOpSum init.")
- self.all_rank_stats = None
- self.per_rank_stats_by_optype = None
- self.per_rank_stats_by_opname = None
-
- @property
- def base_dir(self):
- return os.path.basename(os.path.dirname(__file__))
-
- @staticmethod
- def _mapper_func(data_map, analysis_class):
- df = ComputeOpSumExport(data_map[1], analysis_class).read_export_db()
-
- if df is None or df.empty:
- print(f"[WARNING] There is no stats data in {data_map[1]}.")
- return None
-
- df["Rank"] = data_map[0]
- return df
-
- def mapper_func(self, context):
- return context.wait(
- context.map(
- self._mapper_func,
- self._get_rank_db(),
- analysis_class=self._recipe_name
- )
- )
-
- def reducer_func(self, mapper_res):
- mapper_res = list(filter(lambda df: df is not None, mapper_res))
- if not mapper_res:
- print("[ERROR] Mapper data is None.")
- return
- # get per rank stats by optype
- self.per_rank_stats_by_optype = pd.concat(
- describe_duration(df.groupby(["OpType", "TaskType"])["Duration"]).assign(Rank=df["Rank"][0]) for df in mapper_res)
- self.per_rank_stats_by_optype.sort_values(by=["SumNs"], inplace=True, ascending=False)
-
- # get all rank stats by optype
- all_op_data = pd.concat(mapper_res)
- self.all_rank_stats = describe_duration(all_op_data.groupby(["OpType", "TaskType"])["Duration"])
- self.all_rank_stats.sort_values(by=["SumNs"], inplace=True, ascending=False)
-
- # get per rank stats by opname
- self.per_rank_stats_by_opname = pd.concat(
- describe_duration(df.groupby(["OpName", "OpType", "TaskType", "InputShapes"])["Duration"]).assign(Rank=df["Rank"][0]) for df in mapper_res)
- self.per_rank_stats_by_opname.sort_values(by=["SumNs"], inplace=True, ascending=False)
-
- def run(self, context):
- super().run(context)
- mapper_res = self.mapper_func(context)
- self.reducer_func(mapper_res)
-
- if self._export_type == "db":
- self.save_db()
- elif self._export_type == "notebook":
- self.save_notebook()
- else:
- print("[ERROR] Unknown export type.")
-
- def save_notebook(self):
- self.dump_data(self.all_rank_stats, os.path.join(self._get_output_dir(), "all_stats.csv"))
- self.dump_data(self.per_rank_stats_by_optype, os.path.join(self._get_output_dir(), "rank_stats_by_optype.csv"))
- self.dump_data(self.per_rank_stats_by_opname, os.path.join(self._get_output_dir(), "rank_stats_by_opname.csv"))
- self.create_notebook("stats.ipynb")
- self.add_helper_file("cluster_display.py")
-
- def save_db(self):
- self.dump_data(self.all_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_ALL_RANK_STATS)
- self.dump_data(self.per_rank_stats_by_optype, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_PER_RANK_STATS_BY_OPTYPE)
- self.dump_data(self.per_rank_stats_by_opname, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_PER_RANK_STATS_BY_OPNAME)
diff --git a/profiler/cluster_analyse/analysis/compute_op_sum/stats.ipynb b/profiler/cluster_analyse/analysis/compute_op_sum/stats.ipynb
deleted file mode 100644
index c88d2684c1f8822818f62005355c444332aaa915..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/compute_op_sum/stats.ipynb
+++ /dev/null
@@ -1,164 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Compute Op Summary\n",
- "\n",
- "集群场景计算类算子数据分析\n",
- "\n",
- "主要包含以下3个统计内容:\n",
- "1. 按算子类型和任务类型分组的,整个集群通信算子耗时的统计情况\n",
- "2. 按算子类型和任务类型分组的,每个Rank上计算类算子的耗时情况\n",
- "3. 按算子名称、任务类型、输入shape分组的,每个Rank上的计算类算子的耗时情况"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 数据准备"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from IPython.display import display, HTML\n",
- "display(HTML(\"\"))\n",
- "\n",
- "import plotly.offline as pyo\n",
- "\n",
- "def is_lab_notebook():\n",
- " import re\n",
- " import psutil\n",
- " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n",
- "\n",
- "if is_lab_notebook():\n",
- " pyo.init_notebook_mode()\n",
- "\n",
- "import pandas as pd\n",
- "pd.options.plotting.backend = \"plotly\"\n",
- "pd.set_option(\"display.max_rows\", 100)\n",
- "pd.set_option(\"display.width\", 1000)\n",
- "\n",
- "import cluster_display\n",
- "\n",
- "all_stats_df = pd.read_csv(\"all_stats.csv\", index_col=\"OpType\")\n",
- "rank_stats_by_optype_df = pd.read_csv(\"rank_stats_by_optype.csv\", index_col=\"OpType\")\n",
- "rank_stats_by_opname_df = pd.read_csv(\"rank_stats_by_opname.csv\", index_col=\"OpName\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 计算类算子耗时分析\n",
- "\n",
- "将整个集群所有Rank的计算类算子进行汇总,按算子类型和任务类型分类,统计分析耗时情况,时间单位为微秒(us)\n",
- "\n",
- "包含以下统计项:\n",
- "- Count:算子数量\n",
- "- Mean:平均耗时\n",
- "- Std:标准差\n",
- "- Min:最小值\n",
- "- Q1:四分之一分位数\n",
- "- Median:中位数\n",
- "- Q3:四分之三分位数\n",
- "- Max:最大值\n",
- "- Sum:总耗时"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "display(all_stats_df)\n",
- "fig_all_rank = cluster_display.display_duration_boxplots(None, all_stats_df, x_title=\"OpType\")\n",
- "fig_per_rank = cluster_display.display_graph(None, all_stats_df.index, all_stats_df[[\"Q1(Us)\", \"Median(Us)\", \"Q3(Us)\"]], title=\"50% of Distribution\", x_title=\"OpType\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 单个Rank的计算类算子基于算子类型的耗时分析\n",
- "将集群内每个Rank的计算类算子进行汇总,按算子类型和任务类型分类,统计分析耗时情况,时间单位为微秒(us)\n",
- "\n",
- "包含以下统计项:\n",
- "- Count:算子数量\n",
- "- Mean:平均耗时\n",
- "- Std:标准差\n",
- "- Min:最小值\n",
- "- Q1:四分之一分位数\n",
- "- Median:中位数\n",
- "- Q3:四分之三分位数\n",
- "- Max:最大值\n",
- "- Sum:总耗时"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "rank_stats_gdf = rank_stats_by_optype_df.groupby(rank_stats_by_optype_df.index)\n",
- "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 单个Rank的计算类算子基于算子名的耗时分析\n",
- "\n",
- "将集群内每个Rank的计算类算子进行汇总,按算子名称、任务类型、输入shape分类,统计分析耗时情况,时间单位为微秒(us)\n",
- "\n",
- "包含以下统计项:\n",
- "- Count:算子数量\n",
- "- Mean:平均耗时\n",
- "- Std:标准差\n",
- "- Min:最小值\n",
- "- Q1:四分之一分位数\n",
- "- Median:中位数\n",
- "- Q3:四分之三分位数\n",
- "- Max:最大值\n",
- "- Sum:总耗时"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "rank_stats_gdf = rank_stats_by_opname_df.groupby(rank_stats_by_opname_df.index)\n",
- "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "name": "python",
- "version": "3.12.1"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/profiler/cluster_analyse/analysis/hccl_sum/__init__.py b/profiler/cluster_analyse/analysis/hccl_sum/__init__.py
deleted file mode 100644
index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/hccl_sum/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/profiler/cluster_analyse/analysis/hccl_sum/hccl_sum.py b/profiler/cluster_analyse/analysis/hccl_sum/hccl_sum.py
deleted file mode 100644
index da0c575e4683f1c51c4cf38e89b9c096c484777e..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/hccl_sum/hccl_sum.py
+++ /dev/null
@@ -1,133 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import pandas as pd
-from analysis.base_analysis import BaseRecipeAnalysis
-from common_func.constant import Constant
-from common_func.utils import describe_duration
-from cluster_statistics_export.hccl_sum_export import HcclSumExport
-
-
-class HcclSum(BaseRecipeAnalysis):
-
- TABLE_ALL_RANK_STATS = "HcclAllRankStats"
- TABLE_PER_RANK_STATS = "HcclPerRankStats"
- TABLE_TOP_OP_STATS = "HcclTopOpStats"
-
- TOP_NUM = "top_num"
- DEFAULT_TOP_NUM = 15
-
- def __init__(self, params):
- super().__init__(params)
- print("[INFO] HcclSum init.")
- self.per_rank_stats = None
- self.all_rank_stats = None
- self.top_op_stats = None
- self.top_num = params.get(self.TOP_NUM, self.DEFAULT_TOP_NUM)
-
- @property
- def base_dir(self):
- return os.path.basename(os.path.dirname(__file__))
-
- @staticmethod
- def _mapper_func(data_map, analysis_class):
- df = HcclSumExport(data_map[1], analysis_class).read_export_db()
-
- if df is None or df.empty:
- print(f"[WARNING] There is no stats data in {data_map[1]}.")
- return None
-
- df["Rank"] = data_map[0]
- return df
-
- def mapper_func(self, context):
- return context.wait(
- context.map(
- self._mapper_func,
- self._get_rank_db(),
- analysis_class=self._recipe_name
- )
- )
-
- def reducer_func(self, mapper_res):
- mapper_res = list(filter(lambda df: df is not None, mapper_res))
- if not mapper_res:
- print("[ERROR] Mapper data is None.")
- return
- self.per_rank_stats = pd.concat(
- describe_duration(df.groupby("OpType")["Duration"]).assign(Rank=df["Rank"][0]) for df in mapper_res)
- self.per_rank_stats.sort_values(by=["Rank"], inplace=True)
- all_op_data = pd.concat(mapper_res)
- self.all_rank_stats = describe_duration(all_op_data.groupby("OpType")["Duration"])
- grouped_op_stats = all_op_data.groupby("OpName")
- self.top_op_stats = describe_duration(grouped_op_stats["Duration"]).nlargest(self.top_num, "MeanNs")
- min_rank = []
- max_rank = []
- for op_name in self.top_op_stats.index:
- df = grouped_op_stats.get_group(op_name)
- min_rank.append(df[df["Duration"] == df["Duration"].min()]["Rank"].values[0])
- max_rank.append(df[df["Duration"] == df["Duration"].max()]["Rank"].values[0])
- self.top_op_stats["MinRank"] = min_rank
- self.top_op_stats["MaxRank"] = max_rank
-
- def run(self, context):
- super().run(context)
- if self.top_num <= 0:
- print(f"[WARNING] HcclSum: top_num is set to a invalid value, "
- f"it will be reset to default value({self.DEFAULT_TOP_NUM}).")
- self.top_num = self.DEFAULT_TOP_NUM
- mapper_res = self.mapper_func(context)
- self.reducer_func(mapper_res)
-
- if self._export_type == "db":
- self.save_db()
- elif self._export_type == "notebook":
- self.save_notebook()
- else:
- print("[ERROR] Unknown export type.")
-
- def save_notebook(self):
- self.dump_data(self.all_rank_stats, os.path.join(self._get_output_dir(), "all_stats.csv"))
- self.dump_data(self.per_rank_stats, os.path.join(self._get_output_dir(), "rank_stats.csv"))
- self.dump_data(self.top_op_stats, os.path.join(self._get_output_dir(), "top_op_stats.csv"))
- self.create_notebook("stats.ipynb")
- self.add_helper_file("cluster_display.py")
-
- def save_db(self):
- self.dump_data(self.all_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_ALL_RANK_STATS)
- self.dump_data(self.per_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_PER_RANK_STATS)
- self.dump_data(self.top_op_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_TOP_OP_STATS)
-
- @classmethod
- def add_parser_argument(cls, parser):
- BaseRecipeAnalysis.add_parser_argument(parser)
- parser.add_argument("--top_num", type=int, help="Duration cost top count", default=cls.DEFAULT_TOP_NUM)
-
- @classmethod
- def parse_argument(cls, args_parsed) -> dict:
- argument_dict = BaseRecipeAnalysis.parse_argument(args_parsed)
- argument_dict.update({
- cls.TOP_NUM: args_parsed.top_num
- })
- return argument_dict
-
- @classmethod
- def get_extra_argument(cls, params) -> dict:
- argument_dict = BaseRecipeAnalysis.get_extra_argument(params)
- argument_dict.update({
- cls.TOP_NUM: params.get(cls.TOP_NUM, cls.DEFAULT_TOP_NUM)
- })
- return argument_dict
diff --git a/profiler/cluster_analyse/analysis/hccl_sum/stats.ipynb b/profiler/cluster_analyse/analysis/hccl_sum/stats.ipynb
deleted file mode 100644
index 87f8c6d736240531e2c28c0cf33df087ecfe38e8..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/hccl_sum/stats.ipynb
+++ /dev/null
@@ -1,162 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# HCCL Summary\n",
- "\n",
- "集群场景Hccl算子数据分析\n",
- "\n",
- "主要包含以下3个统计内容:\n",
- "1. 按算子类型分组的,整个集群通信算子耗时的统计情况\n",
- "2. 按算子类型分组的,每个Rank上通信算子的耗时情况\n",
- "3. 整个集群平均耗时最久的TOP通信算子"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 数据准备"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from IPython.display import display, HTML\n",
- "display(HTML(\"\"))\n",
- "\n",
- "import plotly.offline as pyo\n",
- "\n",
- "def is_lab_notebook():\n",
- " import re\n",
- " import psutil\n",
- " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n",
- "\n",
- "if is_lab_notebook():\n",
- " pyo.init_notebook_mode()\n",
- "\n",
- "import pandas as pd\n",
- "pd.options.plotting.backend = \"plotly\"\n",
- "pd.set_option(\"display.max_rows\", 100)\n",
- "pd.set_option(\"display.width\", 1000)\n",
- "\n",
- "import cluster_display\n",
- "\n",
- "all_stats_df = pd.read_csv(\"all_stats.csv\", index_col=\"OpType\")\n",
- "rank_stats_df = pd.read_csv(\"rank_stats.csv\", index_col=\"OpType\")\n",
- "top_op_stats_df = pd.read_csv(\"top_op_stats.csv\", index_col=\"OpName\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 集群通信算子耗时分析\n",
- "\n",
- "将整个集群所有Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n",
- "\n",
- "包含以下统计项:\n",
- "- Count:算子数量\n",
- "- Mean:平均耗时\n",
- "- Std:标准差\n",
- "- Min:最小值\n",
- "- Q1:四分之一分位数\n",
- "- Median:中位数\n",
- "- Q3:四分之三分位数\n",
- "- Max:最大值\n",
- "- Sum:总耗时"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "display(all_stats_df)\n",
- "fig_all_rank = cluster_display.display_duration_boxplots(None, all_stats_df, x_title=\"Hccl OpType\")\n",
- "fig_per_rank = cluster_display.display_graph(None, all_stats_df.index, all_stats_df[[\"Q1(Us)\", \"Median(Us)\", \"Q3(Us)\"]], title=\"50% of Distribution\", x_title=\"Hccl OpType\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 集群Rank通信算子耗时分析\n",
- "\n",
- "将集群内每个Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n",
- "\n",
- "包含以下统计项:\n",
- "- Count:算子数量\n",
- "- Mean:平均耗时\n",
- "- Std:标准差\n",
- "- Min:最小值\n",
- "- Q1:四分之一分位数\n",
- "- Median:中位数\n",
- "- Q3:四分之三分位数\n",
- "- Max:最大值\n",
- "- Sum:总耗时"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "rank_stats_gdf = rank_stats_df.groupby(rank_stats_df.index)\n",
- "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 集群TOP-N通信算子耗时分析\n",
- "\n",
- "统计集群内耗时最多的TOP-N通信算子,时间单位为微秒(us)\n",
- "\n",
- "包含以下统计项:\n",
- "- Count:算子数量\n",
- "- Mean:平均耗时\n",
- "- Std:标准差\n",
- "- Min:最小值\n",
- "- Q1:四分之一分位数\n",
- "- Median:中位数\n",
- "- Q3:四分之三分位数\n",
- "- Max:最大值\n",
- "- Sum:总耗时\n",
- "- MinRank:耗时最少算子所在的Rank\n",
- "- MaxRank:耗时最长算子所在的Rank"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "display(top_op_stats_df)\n",
- "fig_top_op = cluster_display.display_duration_boxplots(None, top_op_stats_df, x_title=\"Hccl OpName\")"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "name": "python",
- "version": "3.12.1"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/profiler/cluster_analyse/analysis/host_info_analysis.py b/profiler/cluster_analyse/analysis/host_info_analysis.py
deleted file mode 100644
index 563711080ed3a20923ce73ec595b84892492e9f6..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/host_info_analysis.py
+++ /dev/null
@@ -1,96 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-
-from analysis.base_analysis import BaseAnalysis
-from common_func.constant import Constant
-from common_func.db_manager import DBManager
-
-
-class HostInfoAnalysis(BaseAnalysis):
-
- TABLE_HOST_INFO = "HOST_INFO"
- TABLE_RANK_DEVICE_MAP = "RANK_DEVICE_MAP"
-
- def __init__(self, param: dict):
- super().__init__(param)
- self.all_rank_host_info = {}
- self.all_rank_device_info = []
-
- def run(self):
- if self.data_type != Constant.DB:
- return
- self.analyze_host_info()
- self.dump_db()
-
- def dump_db(self):
- output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT)
- result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER)
- conn, curs = DBManager.create_connect_db(result_db)
- if not (conn and curs):
- print(f"[ERROR] Failed to create db {Constant.DB_CLUSTER_COMMUNICATION_ANALYZER}")
- return
- self.dump_host_info(result_db, conn)
- self.dump_rank_device_map(result_db, conn)
- DBManager.destroy_db_connect(conn, curs)
-
- def dump_host_info(self, result_db, db_conn):
- if not self.all_rank_host_info:
- print(f"[WARNING] No host info data be analyzed.")
- return
- DBManager.create_tables(result_db, Constant.TABLE_HOST_INFO)
- save_host_info = list(self.all_rank_host_info.items())
- sql = "insert into {} values ({value})".format(Constant.TABLE_HOST_INFO,
- value="?," * (len(save_host_info[0]) - 1) + "?")
- DBManager.executemany_sql(db_conn, sql, save_host_info)
-
- def dump_rank_device_map(self, result_db, db_conn):
- if not self.all_rank_device_info:
- print(f"[WARNING] No rank device map data be analyzed.")
- return
- self.all_rank_device_info.sort()
- DBManager.create_tables(result_db, Constant.TABLE_RANK_DEVICE_MAP)
- sql = "insert into {} values ({value})".format(Constant.TABLE_RANK_DEVICE_MAP,
- value="?," * (len(self.all_rank_device_info[0]) - 1) + "?")
- DBManager.executemany_sql(db_conn, sql, self.all_rank_device_info)
-
- def analyze_host_info(self):
- print_empty_host_info = ""
- for rank_id, profiling_dir in self.data_map.items():
- host_info = []
- rank_device_info = []
- db_path = os.path.join(profiling_dir, Constant.SINGLE_OUTPUT, f"ascend_pytorch_profiler_{rank_id}.db")
- if (os.path.exists(db_path) and DBManager.check_tables_in_db(db_path, self.TABLE_HOST_INFO)):
- conn, curs = DBManager.create_connect_db(db_path)
- sql = "select * from {0}".format(self.TABLE_HOST_INFO)
- host_info = DBManager.fetch_all_data(curs, sql, is_dict=False)
- DBManager.destroy_db_connect(conn, curs)
- if not (host_info and host_info[0]):
- if not print_empty_host_info:
- print_empty_host_info = f"[WARNING] No {self.TABLE_HOST_INFO} data in {self.data_type} file."
- continue
- if (os.path.exists(db_path) and DBManager.check_tables_in_db(db_path, self.TABLE_RANK_DEVICE_MAP)):
- conn, curs = DBManager.create_connect_db(db_path)
- sql = "select * from {0}".format(self.TABLE_RANK_DEVICE_MAP)
- rank_device_info = DBManager.fetch_all_data(curs, sql, is_dict=False)
- DBManager.destroy_db_connect(conn, curs)
- host_uid, host_name = host_info[0][0], host_info[0][1]
- for idx, data in enumerate(rank_device_info):
- rank_device_info[idx] = list(data) + [host_uid, ]
- self.all_rank_host_info[host_uid] = host_name
- self.all_rank_device_info.extend(rank_device_info)
- if print_empty_host_info:
- print(print_empty_host_info)
diff --git a/profiler/cluster_analyse/analysis/mstx_sum/__init__.py b/profiler/cluster_analyse/analysis/mstx_sum/__init__.py
deleted file mode 100644
index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/mstx_sum/__init__.py
+++ /dev/null
@@ -1,14 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
diff --git a/profiler/cluster_analyse/analysis/mstx_sum/mstx_sum.py b/profiler/cluster_analyse/analysis/mstx_sum/mstx_sum.py
deleted file mode 100644
index 46a0e18abeee5cdd6b058d71e3a1bd2b97e7c29d..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/mstx_sum/mstx_sum.py
+++ /dev/null
@@ -1,204 +0,0 @@
-# Copyright (c) 2024, Huawei Technologies Co., Ltd.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import pandas as pd
-from collections import namedtuple
-from analysis.base_analysis import BaseRecipeAnalysis
-from common_func.constant import Constant
-from common_func.utils import describe_duration
-from cluster_statistics_export.mstx_mark_export import MstxMarkExport
-from cluster_statistics_export.mstx_step_export import MstxStepExport
-
-
-MarkInfo = namedtuple("MarkInfo", ["name", "framework_duration", "cann_duration", "device_duration",
- "tid", "start_ns"])
-
-
-def format_mark_info(df: pd.DataFrame, start_idx, stop_idx, name) -> MarkInfo:
- start_series = df.iloc[start_idx]
- stop_series = df.iloc[stop_idx]
- return MarkInfo(
- name=name,
- framework_duration=float(stop_series["framework_ts"]-start_series["framework_ts"]),
- cann_duration=float(stop_series["cann_ts"]-start_series["cann_ts"]),
- device_duration=float(stop_series["device_ts"]-start_series["device_ts"]),
- tid=start_series["tid"],
- start_ns=start_series["cann_ts"]
- )
-
-
-def rename_mark_msg_name(mark_stats_df: pd.DataFrame):
- msg_idx_counter = {}
- for idx, mark_info in enumerate(mark_stats_df.itertuples(index=False)):
- msg_idx_counter.setdefault(mark_info.step_id, {}).setdefault(mark_info.name, []).append(idx)
- for msg_dict in msg_idx_counter.values():
- for msg, idx_list in msg_dict.items():
- if len(idx_list) <= 1:
- continue
- for i, idx in enumerate(idx_list):
- mark_stats_df.loc[idx, 'name'] = f"{msg}_{i}"
-
-
-def compute_step_id(mark_stat, step_stats_df: pd.DataFrame):
- for step_info in step_stats_df.itertuples(index=False):
- if step_info.start_ns <= mark_stat.start_ns <= step_info.end_ns:
- return step_info.step_id
- print(f"[WARNING] {mark_stat.name} is not in any step.")
- return 0
-
-
-def format_columns(df: pd.DataFrame):
- formatted_df = df.rename(
- {
- "framework_duration": "FrameworkDurationNs",
- "cann_duration": "CannDurationNs",
- "device_duration": "DeviceDurationNs",
- "duration": "DurationNs",
- "step_id": "StepId",
- "tid": "Tid",
- "name": "Name"
- },
- axis="columns"
- )
- cols = [col for col in formatted_df.columns if not col.endswith("_ns") and col not in {"Tid"}]
- return formatted_df[cols]
-
-
-class MstxSum(BaseRecipeAnalysis):
-
- TABLE_FRAMEWORK_STATS = "MSTXAllFrameworkStats"
- TABLE_CANN_STATS = "MSTXAllCannStats"
- TABLE_DEVICE_STATS = "MSTXAllDeviceStats"
- TABLE_MARK_STATS = "MSTXMarkStats"
-
- START_SUFFIX = "_start"
- STOP_SUFFIX = "_stop"
-
- def __init__(self, params):
- super().__init__(params)
- print("[INFO] MstxSum init.")
- self.mark_stats = None
- self.all_fwk_stats = None
- self.all_cann_stats = None
- self.all_device_stats = None
-
- @property
- def base_dir(self):
- return os.path.basename(os.path.dirname(__file__))
-
- @staticmethod
- def _mapper_func(data_map, analysis_class):
- step_df = MstxStepExport(data_map[1], analysis_class).read_export_db()
- if step_df is None or step_df.empty:
- step_df = pd.DataFrame({"start_ns": [0], "end_ns": [float("inf")], "step_id": [0]})
- mark_df = MstxMarkExport(data_map[1], analysis_class).read_export_db()
- if mark_df is None or mark_df.empty:
- print(f"[WARNING] There is no mark data in {data_map[1]}.")
- return None
- mark_df["framework_ts"] = mark_df["framework_ts"].astype("int64")
-
- mark_info = {}
- mark_res = []
- mismatch_msg = []
- for idx, row in enumerate(mark_df.itertuples(index=False)):
- if row.msg.endswith(MstxSum.START_SUFFIX):
- msg = row.msg[:-len(MstxSum.START_SUFFIX)]
- mark_info.setdefault(row.tid, {}).setdefault(msg, []).append(idx)
- elif row.msg.endswith(MstxSum.STOP_SUFFIX):
- msg = row.msg[:-len(MstxSum.STOP_SUFFIX)]
- idx_list = mark_info.get(row.tid, {}).get(msg, [])
- if not idx_list:
- mismatch_msg.append((row.msg, idx))
- continue
- start_idx = idx_list.pop()
- mark_res.append(format_mark_info(mark_df, start_idx, idx, msg))
-
- # 统计未匹配上的mark信息
- for msg_info in mark_info.values():
- for msg, idx_list in msg_info.items():
- if not idx_list:
- continue
- mismatch_msg.extend((msg + MstxSum.START_SUFFIX, idx) for idx in idx_list)
- if mismatch_msg:
- mismatch_msg.sort(key=lambda msg: msg[1])
- print(f"[WARNING] The following mark messages do not match anyone in "
- f"rank {data_map[0]}: {','.join(msg[0] for msg in mismatch_msg)}.")
-
- mark_stats_df = pd.DataFrame(mark_res).assign(Rank=data_map[0])
- mark_stats_df["step_id"] = mark_stats_df.apply(compute_step_id, axis=1, step_stats_df=step_df)
- rename_mark_msg_name(mark_stats_df)
- mark_stats_df = format_columns(mark_stats_df).set_index("Name", drop=True)
- return mark_stats_df
-
- def mapper_func(self, context):
- return context.wait(
- context.map(
- self._mapper_func,
- self._get_rank_db(),
- analysis_class=self._recipe_name
- )
- )
-
- def reducer_func(self, mapper_res):
- mapper_res = list(filter(lambda df: df is not None, mapper_res))
- if not mapper_res:
- print("[ERROR] Mapper data is None.")
- return
- self.mark_stats = pd.concat(mapper_res)
- all_fwk_stats = []
- all_cann_stats = []
- all_device_stats = []
- mark_step_df = self.mark_stats.groupby("StepId")
- for step_id, df in mark_step_df:
- name_gdf = df.groupby("Name")
- fwk_stats = describe_duration(name_gdf["FrameworkDurationNs"]).assign(StepId=step_id)
- fwk_stats.sort_values(by=["SumNs"], inplace=True, ascending=False)
- all_fwk_stats.append(fwk_stats)
- cann_stats = describe_duration(name_gdf["CannDurationNs"]).assign(StepId=step_id)
- cann_stats.sort_values(by=["SumNs"], inplace=True, ascending=False)
- all_cann_stats.append(cann_stats)
- device_stats = describe_duration(name_gdf["DeviceDurationNs"]).assign(StepId=step_id)
- device_stats.sort_values(by=["SumNs"], inplace=True, ascending=False)
- all_device_stats.append(device_stats)
- self.all_fwk_stats = pd.concat(all_fwk_stats)
- self.all_cann_stats = pd.concat(all_cann_stats)
- self.all_device_stats = pd.concat(all_device_stats)
-
- def run(self, context):
- super().run(context)
- mapper_res = self.mapper_func(context)
- self.reducer_func(mapper_res)
-
- if self._export_type == "db":
- self.save_db()
- elif self._export_type == "notebook":
- self.save_notebook()
- else:
- print("[ERROR] Unknown export type.")
-
- def save_notebook(self):
- self.dump_data(self.mark_stats, os.path.join(self._get_output_dir(), "mark_stats.csv"))
- self.dump_data(self.all_fwk_stats, os.path.join(self._get_output_dir(), "all_fwk_stats.csv"))
- self.dump_data(self.all_cann_stats, os.path.join(self._get_output_dir(), "all_cann_stats.csv"))
- self.dump_data(self.all_device_stats, os.path.join(self._get_output_dir(), "all_device_stats.csv"))
- self.create_notebook("stats.ipynb")
- self.add_helper_file("cluster_display.py")
-
- def save_db(self):
- self.dump_data(self.mark_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_MARK_STATS)
- self.dump_data(self.all_fwk_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_FRAMEWORK_STATS)
- self.dump_data(self.all_cann_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_CANN_STATS)
- self.dump_data(self.all_device_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_DEVICE_STATS)
diff --git a/profiler/cluster_analyse/analysis/mstx_sum/stats.ipynb b/profiler/cluster_analyse/analysis/mstx_sum/stats.ipynb
deleted file mode 100644
index 84672bc72b97b02717c3a4110ab1b4dd827adafd..0000000000000000000000000000000000000000
--- a/profiler/cluster_analyse/analysis/mstx_sum/stats.ipynb
+++ /dev/null
@@ -1,180 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# MSTX Summary\n",
- "\n",
- "集群场景MSTX打点数据分析\n",
- "\n",
- "主要包含以下2个统计内容:\n",
- "1. 按Step分组的,整个集群MSTX打点数据的统计情况\n",
- "2. 按Name分组的,每个Rank上MSTX打点数据的统计情况"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 数据准备"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from IPython.display import display, HTML\n",
- "display(HTML(\"\"))\n",
- "\n",
- "import plotly.offline as pyo\n",
- "\n",
- "def is_lab_notebook():\n",
- " import re\n",
- " import psutil\n",
- " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n",
- "\n",
- "if is_lab_notebook():\n",
- " pyo.init_notebook_mode()\n",
- "\n",
- "import pandas as pd\n",
- "pd.options.plotting.backend = \"plotly\"\n",
- "pd.set_option(\"display.max_rows\", 100)\n",
- "pd.set_option(\"display.width\", 1000)\n",
- "\n",
- "import cluster_display\n",
- "\n",
- "all_fwk_stats_gdf = pd.read_csv(\"all_fwk_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n",
- "all_cann_stats_gdf = pd.read_csv(\"all_cann_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n",
- "all_device_stats_gdf = pd.read_csv(\"all_device_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n",
- "mark_stats_df = pd.read_csv(\"mark_stats.csv\", index_col=\"Name\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 集群MSTX数据分析\n",
- "\n",
- "将整个集群所有Rank的MSTX数据进行汇总,按Step划分,统计分析耗时情况,时间单位为微秒(us)\n",
- "打点数据分为三种:\n",
- "1. 框架侧耗时:Framework Time\n",
- "2. Cann侧耗时:Cann Time\n",
- "3. Device侧耗时:Devcie Time\n",
- "\n",
- "3种数据都包含以下统计项:\n",
- "- Count:数量\n",
- "- Mean:平均耗时\n",
- "- Std:标准差\n",
- "- Min:最小值\n",
- "- Q1:四分之一分位数\n",
- "- Median:中位数\n",
- "- Q3:四分之三分位数\n",
- "- Max:最大值\n",
- "- Sum:总耗时"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "def display_stats_mstx_step_combobox(selected, args):\n",
- " step = selected\n",
- " fwk_stats_gdf, cann_stats_gdf, device_stats_gdf = args\n",
- " fwk_df = fwk_stats_gdf.get_group(step)\n",
- " cann_df = cann_stats_gdf.get_group(step)\n",
- " device_df = device_stats_gdf.get_group(step)\n",
- " figs = []\n",
- " display(HTML(\"
Framework Time Stats
\"))\n", - " display(fwk_df)\n", - " cluster_display.display_duration_boxplots(figs, fwk_df, title=\"Framework Time\", x_title=\"Name\", y_title=\"Time\")\n", - " display(HTML(\"Cann Time Stats
\"))\n", - " display(cann_df)\n", - " cluster_display.display_duration_boxplots(figs, cann_df, title=\"Cann Time\", x_title=\"Name\", y_title=\"Time\")\n", - " display(HTML(\"Device Time Stats
\"))\n", - " display(device_df)\n", - " cluster_display.display_duration_boxplots(figs, device_df, title=\"Device Time\", x_title=\"Name\", y_title=\"Time\")\n", - "\n", - "steps = list(all_fwk_stats_gdf.groups.keys())\n", - "if steps:\n", - " cluster_display.display_stats_optional_combobox(steps, display_stats_mstx_step_combobox, \n", - " [all_fwk_stats_gdf, all_cann_stats_gdf, all_device_stats_gdf], \"Step:\")\n", - "else:\n", - " print(\"There is no step in stats, so no need to display\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 集群Rank MSTX数据分析\n", - "\n", - "将集群内每个Rank的MSTX数据进行汇总,按打点Name分类,统计分析耗时情况,时间单位为微秒(us)\n", - "\n", - "包含以下统计项:\n", - "- Name:打点名称\n", - "- FrameworkDuration(Us):框架侧耗时\n", - "- CannDuration(Us):Cann侧耗时\n", - "- DeviceDuration(Us):Device侧耗时\n", - "- Rank:Rank序号\n", - "- StepId:Step序号" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def display_mstx_duration_by_rank(selected, args):\n", - " mark_stats_gdf = args\n", - " df = mark_stats_gdf.get_group(selected).sort_values(\"Rank\")\n", - " display(df)\n", - " fwk_duration = []\n", - " cann_duration = []\n", - " device_duration = []\n", - " step_ids = []\n", - " for step_id, step_df in df.groupby(\"StepId\"):\n", - " fwk_duration.append((step_id, step_df[\"FrameworkDuration(Us)\"].values))\n", - " cann_duration.append((step_id, step_df[\"CannDuration(Us)\"].values))\n", - " device_duration.append((step_id, step_df[\"DeviceDuration(Us)\"].values))\n", - " step_ids.append(step_id)\n", - " fwk_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in fwk_duration], axis=1)\n", - " cann_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in cann_duration], axis=1)\n", - " device_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in device_duration], axis=1)\n", - " figs = []\n", - " ranks = df[\"Rank\"].drop_duplicates()\n", - " cluster_display.display_graph(figs, ranks, fwk_df[step_ids],\n", - " title=\"Framework Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", - " cluster_display.display_graph(figs, ranks, cann_df[step_ids],\n", - " title=\"Cann Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", - " cluster_display.display_graph(figs, ranks, device_df[step_ids],\n", - " title=\"Device Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", - "\n", - "mark_stats_gdf = mark_stats_df.groupby(mark_stats_df.index)\n", - "names = list(mark_stats_gdf.groups.keys())\n", - "if steps:\n", - " cluster_display.display_stats_optional_combobox(names, display_mstx_duration_by_rank, mark_stats_gdf, \"Name:\")\n", - "else:\n", - " print(\"There is no mark name in stats, so no need to display\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.12.1" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/profiler/cluster_analyse/analysis/step_trace_time_analysis.py b/profiler/cluster_analyse/analysis/step_trace_time_analysis.py deleted file mode 100644 index 6a886fffa97b142e8267066117f561154d85b162..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/step_trace_time_analysis.py +++ /dev/null @@ -1,126 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -from common_func.db_manager import DBManager -from common_func.constant import Constant -from common_func.file_manager import FileManager -from prof_bean.step_trace_time_bean import StepTraceTimeBean - - -class StepTraceTimeAnalysis: - CLUSTER_TRACE_TIME_CSV = "cluster_step_trace_time.csv" - CLUSTER_TRACE_TIME_TABLE = "ClusterStepTraceTime" - - def __init__(self, param: dict): - self.collection_path = param.get(Constant.COLLECTION_PATH) - self.data_map = param.get(Constant.DATA_MAP) - self.communication_group = param.get(Constant.COMM_DATA_DICT, {}).get(Constant.COMMUNICATION_GROUP) - self.step_time_dict = {} - self.step_data_list = [] - self.data_type = param.get(Constant.DATA_TYPE) - - @staticmethod - def get_max_data_row(data_group_list: list): - if not data_group_list: - return [] - ret = [] - for idx in range(len(data_group_list[0])): - max_val = 0 - for idy in range(len(data_group_list)): - max_val = max(max_val, data_group_list[idy][idx]) - ret.append(max_val) - return ret - - def run(self): - self.load_step_trace_time_data() - self.analyze_step_time() - self.dump_data() - - def dump_data(self): - if not self.step_data_list: - print("[WARNING] Can't get step time info!") - return - if self.data_type == Constant.TEXT: - headers = self.get_headers() - FileManager.create_csv_file(self.collection_path, self.step_data_list, self.CLUSTER_TRACE_TIME_CSV, headers) - else: - output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER) - DBManager.create_tables(result_db, self.CLUSTER_TRACE_TIME_TABLE) - column_len = DBManager.get_table_column_count(result_db, self.CLUSTER_TRACE_TIME_TABLE) - data_len = len(self.step_data_list[0]) - if data_len < column_len: - for data in self.step_data_list: - data.extend([0] * (column_len - data_len)) - conn, cursor = DBManager.create_connect_db(result_db) - sql = "insert into {} values ({value})".format(self.CLUSTER_TRACE_TIME_TABLE, - value="?," * (len(self.step_data_list[0]) - 1) + "?") - DBManager.executemany_sql(conn, sql, self.step_data_list) - DBManager.destroy_db_connect(conn, cursor) - - def load_step_trace_time_data(self): - for rank_id, profiling_dir_path in self.data_map.items(): - if self.data_type == Constant.TEXT: - step_time_file = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, Constant.STEP_TIME_CSV) - if os.path.exists(step_time_file): - self.step_time_dict[rank_id] = FileManager.read_csv_file(step_time_file, StepTraceTimeBean) - else: - step_time_file = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, - Constant.DB_COMMUNICATION_ANALYZER) - if (os.path.exists(step_time_file) and - DBManager.check_tables_in_db(step_time_file, Constant.TABLE_STEP_TRACE)): - conn, cursor = DBManager.create_connect_db(step_time_file) - sql = "select * from {0}".format(Constant.TABLE_STEP_TRACE) - data = DBManager.fetch_all_data(cursor, sql, is_dict=False) - self.step_time_dict[rank_id] = data - DBManager.destroy_db_connect(conn, cursor) - if not self.step_time_dict.get(rank_id): - print(f"[WARNING] Rank {rank_id} does not have a valid step_trace_time data in {self.data_type} file.") - - def analyze_step_time(self): - for rank_id, data_bean_list in self.step_time_dict.items(): - for data_bean in data_bean_list: - if self.data_type == Constant.TEXT: - self.step_data_list.append([data_bean.step, Constant.RANK, rank_id] + data_bean.row) - else: - self.step_data_list.append([data_bean[0], Constant.RANK, rank_id] + list(data_bean[1:])) - stage_list = self.communication_group.get(Constant.P2P) - if not stage_list: - return - step_group_dict = {} - for data_list in self.step_data_list: - stage_group = tuple() - for stage in stage_list: - if data_list[2] in stage: - stage_group = tuple(stage) - break - key = (data_list[0], stage_group) - step_group_dict.setdefault(key, []).append(data_list[3:]) - - for key, data_group_list in step_group_dict.items(): - if self.data_type == Constant.TEXT: - self.step_data_list.append([key[0], Constant.STAGE, key[1]] + self.get_max_data_row(data_group_list)) - else: - index = "(" + ",".join(str(i) for i in key[1]) + ")" - self.step_data_list.append([key[0], Constant.STAGE, index] + self.get_max_data_row(data_group_list)) - - def get_headers(self): - if self.step_time_dict: - for rank in self.step_time_dict: - if self.step_time_dict.get(rank): - return self.step_time_dict[rank][0].all_headers - return [] diff --git a/profiler/cluster_analyse/cluster_analysis.py b/profiler/cluster_analyse/cluster_analysis.py deleted file mode 100644 index a8d01dcfe348be6b47c0a71099cedab64b6b3e06..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_analysis.py +++ /dev/null @@ -1,148 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -from cluster_data_preprocess.pytorch_data_preprocessor import PytorchDataPreprocessor -from cluster_data_preprocess.mindspore_data_preprocessor import MindsporeDataPreprocessor -from communication_group.communication_group_generator import CommunicationGroupGenerator -from common_func.constant import Constant -from common_func.file_manager import FileManager -from common_func.path_manager import PathManager -from common_func import analysis_loader -from analysis.analysis_facade import AnalysisFacade - -COMM_FEATURE_LIST = ['all', 'communication_time', 'communication_matrix'] -ALL_FEATURE_LIST = ['all', 'communication_time', 'communication_matrix', 'cann_api_sum', 'hccl_sum', 'compute_op_sum', - 'mstx_sum'] - - -def get_analysis_args(analysis_class, analysis_args): - parser = argparse.ArgumentParser(description="custom analysis args") - parser.add_argument("--parallel_mode", type=str, help="context mode", default="concurrent") - parser.add_argument("--export_type", type=str, help="export type", default="db") - analysis_class[1].add_parser_argument(parser) - return parser.parse_args(analysis_args) - -def parse_specific_params(analysis_name, analysis_args): - analysis_class = analysis_loader.get_class_from_name(analysis_name) - if not analysis_class: - print("[ERROR] undefined analysis.") - return None - - args_parsed = get_analysis_args(analysis_class, analysis_args) - specific_params = { - Constant.RECIPE_NAME: analysis_class[0], - Constant.RECIPE_CLASS: analysis_class[1], - Constant.PARALLEL_MODE: args_parsed.parallel_mode, - Constant.EXPORT_TYPE: args_parsed.export_type - } - specific_params.update(analysis_class[1].parse_argument(args_parsed)) - return specific_params - -class Interface: - ASCEND_PT = "ascend_pt" - ASCEND_MS = "ascend_ms" - - - def __init__(self, params: dict): - self.collection_path = PathManager.get_realpath(params.get(Constant.COLLECTION_PATH)) - self.analysis_mode = params.get(Constant.ANALYSIS_MODE) - self.data_map = {} - self.communication_group = {} - self.collective_group_dict = {} - self.communication_ops = [] - self.matrix_ops = [] - self.origin_params = params - - def allocate_prof_data(self): - ascend_pt_dirs = [] - ascend_ms_dirs = [] - for root, dirs, files in os.walk(self.collection_path): - for dir_name in dirs: - if dir_name.endswith(self.ASCEND_PT): - ascend_pt_dirs.append(os.path.join(root, dir_name)) - if dir_name.endswith(self.ASCEND_MS): - ascend_ms_dirs.append(os.path.join(root, dir_name)) - pytorch_processor = PytorchDataPreprocessor(ascend_pt_dirs) - pt_data_map = pytorch_processor.get_data_map() - data_type = pytorch_processor.get_data_type() - ms_data_map = MindsporeDataPreprocessor(ascend_ms_dirs).get_data_map() - if pt_data_map and ms_data_map: - print("[ERROR] Can not analyze pytorch and mindspore meantime.") - return [] - return (pt_data_map, data_type) if pt_data_map else (ms_data_map, Constant.TEXT) - - def run(self): - PathManager.check_input_directory_path(self.collection_path) - PathManager.check_path_owner_consistent(self.collection_path) - data_map, data_type = self.allocate_prof_data() - if not data_map: - print("[WARNING] Can not get rank info or profiling data.") - return - if data_type == Constant.INVALID: - print("[ERROR] The current folder contains both DB and other files. Please check.") - return - if self.analysis_mode not in COMM_FEATURE_LIST: - if data_type != Constant.DB: - print("[ERROR] The current analysis node only supports DB as input data. Please check.") - return - FileManager.create_output_dir(self.collection_path, is_overwrite=True) - params = { - Constant.COLLECTION_PATH: self.collection_path, - Constant.DATA_MAP: data_map, - Constant.DATA_TYPE: data_type, - Constant.RECIPE_NAME: self.origin_params.get(Constant.RECIPE_NAME, ""), - Constant.RECIPE_CLASS: self.origin_params.get(Constant.RECIPE_CLASS), - Constant.PARALLEL_MODE: self.origin_params.get(Constant.PARALLEL_MODE, ""), - Constant.EXPORT_TYPE: self.origin_params.get(Constant.EXPORT_TYPE, "") - } - params.update(params[Constant.RECIPE_CLASS].get_extra_argument(self.origin_params)) - AnalysisFacade(params).recipe_analyze() - else: - FileManager.create_output_dir(self.collection_path) - params = { - Constant.COLLECTION_PATH: self.collection_path, - Constant.DATA_MAP: data_map, - Constant.ANALYSIS_MODE: self.analysis_mode, - Constant.DATA_TYPE: data_type - } - comm_data_dict = CommunicationGroupGenerator(params).generate() - params[Constant.COMM_DATA_DICT] = comm_data_dict - AnalysisFacade(params).cluster_analyze() - - -def cluster_analysis_main(args=None): - parser = argparse.ArgumentParser(description="cluster analysis module") - parser.add_argument('-d', '--collection_path', type=str, required=True, help="profiling data path") - parser.add_argument('-m', '--mode', choices=ALL_FEATURE_LIST, - default='all', help="different analysis mode") - args_parsed, args_remained = parser.parse_known_args(args=args) - parameter = { - Constant.COLLECTION_PATH: args_parsed.collection_path, - Constant.ANALYSIS_MODE: args_parsed.mode - } - if args_parsed.mode in COMM_FEATURE_LIST: - if args_remained: - print(f"[ERROR] The specific argument {args_remained} is not supported for communication analysis.") - return - else: - parameter.update(parse_specific_params(args_parsed.mode, args_remained)) - Interface(parameter).run() - - -if __name__ == "__main__": - cluster_analysis_main() diff --git a/profiler/cluster_analyse/cluster_data_preprocess/__init__.py b/profiler/cluster_analyse/cluster_data_preprocess/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_data_preprocess/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/cluster_data_preprocess/data_preprocessor.py b/profiler/cluster_analyse/cluster_data_preprocess/data_preprocessor.py deleted file mode 100644 index 72d65ae6571e68564e46f43463843d1f46a3a69e..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_data_preprocess/data_preprocessor.py +++ /dev/null @@ -1,41 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -from abc import abstractmethod - - -class DataPreprocessor: - PROFILER_INFO_HEAD = 'profiler_info_' - PROFILER_INFO_EXTENSION = '.json' - - def __init__(self, path_list: list): - self.path_list = path_list - self.data_map = {} - - @abstractmethod - def get_data_map(self): - pass - - def get_rank_id(self, dir_name: str) -> int: - files = os.listdir(dir_name) - for file_name in files: - if file_name.startswith(self.PROFILER_INFO_HEAD) and file_name.endswith(self.PROFILER_INFO_EXTENSION): - rank_id_str = file_name[len(self.PROFILER_INFO_HEAD): -1 * len(self.PROFILER_INFO_EXTENSION)] - try: - rank_id = int(rank_id_str) - except ValueError: - rank_id = -1 - return rank_id - return -1 diff --git a/profiler/cluster_analyse/cluster_data_preprocess/mindspore_data_preprocessor.py b/profiler/cluster_analyse/cluster_data_preprocess/mindspore_data_preprocessor.py deleted file mode 100644 index a3e09983ddb54b972a9e343c1661b5c8b2cbb8c8..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_data_preprocess/mindspore_data_preprocessor.py +++ /dev/null @@ -1,41 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from collections import defaultdict - -from cluster_data_preprocess.data_preprocessor import DataPreprocessor - - -class MindsporeDataPreprocessor(DataPreprocessor): - - def __init__(self, path_list: list): - super().__init__(path_list) - - def get_data_map(self) -> dict: - rank_id_map = defaultdict(list) - for dir_name in self.path_list: - rank_id = self.get_rank_id(dir_name) - if rank_id < 0: - print('[Error]fail to get rankid or rankid invalid.') - continue - rank_id_map[rank_id].append(dir_name) - - try: - for (rank_id, dir_list) in rank_id_map.items(): - dir_list.sort(key=lambda x: x.split('_')[-3]) - self.data_map[rank_id] = dir_list[0] - except Exception as e: - raise RuntimeError("Found invalid directory name!") from e - return self.data_map diff --git a/profiler/cluster_analyse/cluster_data_preprocess/pytorch_data_preprocessor.py b/profiler/cluster_analyse/cluster_data_preprocess/pytorch_data_preprocessor.py deleted file mode 100644 index 55c3d03958b97c427fe8fde0625e72ea4dee8997..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_data_preprocess/pytorch_data_preprocessor.py +++ /dev/null @@ -1,56 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import glob -from collections import defaultdict -import os - -from cluster_data_preprocess.data_preprocessor import DataPreprocessor -from common_func.constant import Constant -from common_func.file_manager import FileManager - - -class PytorchDataPreprocessor(DataPreprocessor): - - def __init__(self, path_list: list): - super().__init__(path_list) - self.data_type = set() - - def get_data_map(self) -> dict: - rank_id_map = defaultdict(list) - for dir_name in self.path_list: - rank_id = self.get_rank_id(dir_name) - if rank_id < 0: - print('[Error]fail to get rankid or rankid invalid.') - continue - for file_name in os.listdir(dir_name): - if file_name.startswith(self.PROFILER_INFO_HEAD) and file_name.endswith(self.PROFILER_INFO_EXTENSION): - file_path = os.path.join(dir_name, file_name) - config = FileManager.read_json_file(file_path) - self.data_type.add(config.get(Constant.CONFIG, {}).get(Constant.EXPER_CONFIG, {}). - get(Constant.EXPORT_TYPE, Constant.TEXT)) - rank_id_map[rank_id].append(dir_name) - - try: - for (rank_id, dir_list) in rank_id_map.items(): - dir_list.sort(key=lambda x: x.split('_')[-3]) - self.data_map[rank_id] = dir_list[0] - except Exception as e: - raise RuntimeError("Found invalid directory name!") from e - return self.data_map - - def get_data_type(self): - if len(self.data_type) == 1: - return self.data_type.pop() - return Constant.INVALID diff --git a/profiler/cluster_analyse/cluster_kernels_analysis/README.md b/profiler/cluster_analyse/cluster_kernels_analysis/README.md deleted file mode 100644 index f90f99fb9b3058d5ad67728b45da1c07f03e65e5..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_kernels_analysis/README.md +++ /dev/null @@ -1,67 +0,0 @@ -# 功能介绍 -集群场景下,多卡间的算子情况,只能通过查看每张卡各自的性能数据来了解,不能直观的对比各卡之间算子的性能差异。 -cluster_op_summary_analysis.py脚本基于多卡性能数据的op_summary信息,统计并展示各卡中执行最快、最慢、均值和方差的TopN算子。 - -## 交附件 -### cluster_op_time_ analysis.csv -将算子以op_name、input_shape、input_size、output_shape进行分类,统计每一类算子,在不同节点(node)的不同卡(device)上,执行时间的最大、最小、方差、平均时间以及范围。 -### xxx_info.html - -主要是各个特性(time和ratio)的html文件,以html方式展示top_n算子的箱线图。 - -time和ratio表示AI Core和AI Vector Core算子性能指标中的耗时和占比字段。 - -以html文件展示TopN算子执行耗时和占比的箱线图。 - -有TopN个算子就会有TopN个坐标系,每个坐标系表示一个算子的特性,以total_time的平均值从左向右依次向下排序。 - -- 横坐标:node_device表示第几个node的第几张卡,从小到大排序。 -- 纵坐标:时间。 -- 坐标名:在坐标下方,以op_name-input_shape拼接展示。 - -# 操作指导 - -1. 准备性能数据 - - 拷贝所有node上的性能数据到一个环境里,性能数据必须包含在node*目录下,例如当前集群场景为2机16卡,那么就是两个node分别有八个device,拷贝性能数据目录如下: - - ```bash - ├── node0 # 可以是node0或nodeo_xxx,表示某个节点 - │ ├── PROF_XXXXX # 单个device的性能数据,须完成msprof性能数据解析 - │ ├── SUMMARY - │ ├── op_summary_XX.csv - | ...... # 一共八张卡的性能数据 - ├── node1 # 可以是node1 或者node1_xxx表示某个节点 - │ ├── PROF_XXXXX # 单个device的profiling数据 - │ ├── SUMMARY - │ ├── op_summary_XX.csv # 用来做解析的op_summary表格 - | ...... - ``` - -2. 拷贝脚本准备环境 - - 将cluster_prof_Info_analysis.py脚本拷贝到一个文件夹里,并安装对应的Python库。 - - ```bash - pip install pandas - pip install ploty - ``` - -3. 运行脚本 - - ```bash - python3 cluster_prof_Info_analysis.py –d data_path -t type -n top_n - ``` - - - -d:集群场景性能数据目录,输入node的上一级目录。 - - -t:获取分析信息结果文件类型,可取值:html、csv、all,默认html。 - - -n:html分析独有,表示需要展示的是平均时间top_n的算子,默认10,配置超过30时需要一定时间。 - -异常情况处理: - -- -n参数必须大于0,如果输入<=0, 默认只导出一个算子的数据。 -- 配置-n参数值大于算子总数时,按等于算子数处理。 -- 部分没有op_summary的,不显示也不报错。 -- 目录下不存在op_summary时,执行报错无法找到数据文件。 -- op_summary列数据错误或读不到数据时,提示具体出错文件。 -- -t参数配置错误时,提示输入错误,并提示正确的配置。 diff --git a/profiler/cluster_analyse/cluster_kernels_analysis/__init__.py b/profiler/cluster_analyse/cluster_kernels_analysis/__init__.py deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/profiler/cluster_analyse/cluster_kernels_analysis/cluster_prof_Info_analysis.py b/profiler/cluster_analyse/cluster_kernels_analysis/cluster_prof_Info_analysis.py deleted file mode 100644 index 27e3c229c56d7c2a1afe6ae49d98c96b19bc55ff..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_kernels_analysis/cluster_prof_Info_analysis.py +++ /dev/null @@ -1,327 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import sys -import argparse -import re -import os -import stat -import shutil -import warnings -from pathlib import Path - -import pandas as pd -import plotly.graph_objects as go -from plotly.subplots import make_subplots -from plotly.offline import plot - -sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) - -from common_func.path_manager import PathManager - - -MAX_READ_FILE_BYTES = 64 * 1024 * 1024 - - -class FormDataProcessor: - def __init__(self, path, form_name): - self.form_name = form_name - self.files = self.get_files_with_prefix_recursive(path, form_name) - - def get_files_with_prefix_recursive(self, csv_path, match_str): - matched_ir_files = list(Path(csv_path).rglob(match_str)) - if not matched_ir_files: - msg = f"Didn't find any file in folder {csv_path} that matches {match_str}" - raise RuntimeError(msg) - return [str(item) for item in matched_ir_files] - - def readSummaryData(self, columns_to_keep): - # 存储所有合并后的数据 - all_data = pd.DataFrame() - for f in self.files: - if "mindstudio_profiler_output" in f: - continue - # 判断csv文件大小 - PathManager.check_path_readable(f) - # 读取CSV文件 - df = pd.read_csv(f) - # 保留需要的列 - try: - df = df[columns_to_keep] - except KeyError: - print(f"{f}文件没有所需的列,请确认profiling数据的正确性:\n,以下列可能不存在{columns_to_keep}\n") - continue - # 从文件名提取设备ID - try: - df['device_id'] = self.getDeviceId(f) - except Exception: - print(f"文件 \"{f}\" 的路径或者是文件夹名没有按照要求,请确保存在[device_]这一级文件夹,具体操作指导见readme\n") - continue - # 添加新列 "device_id" - try: - df['node_id'] = self.getNodeId(f) - except Exception: - print(f"文件 \"{f}\" 的路径或者是文件夹名没有按照要求,请确保存在[node*]这一级文件夹,具体操作指导见readme\n") - continue - # 将数据添加到最终的数据框中 - all_data = pd.concat([all_data, df]) - return all_data - - def getChipType(self): - file = self.files[0] - df = pd.read_csv(file) - if 'aiv_time(us)' in df.columns: - return "ASCEND_NEW" - return "ASCEND_OTHER" - - def getDeviceId(self, dir_path): - device_id = re.search(r'device_(\d+)', dir_path).group(1) - return device_id - - def getNodeId(self, dir_path): - node_id = re.search(r'node(\d+)', dir_path).group(1) - return int(node_id) - - def getRankNum(self): - return len(self.files) - - -# 表驱动,获取不同芯片类型不同交付件的所需的列 -class ViewInfoManager: - def __init__(self, chip_type): - self.chip_type = chip_type - self.op_summary_columns_dict = {} - self.setOpSummaryColumnsParams() - - def setOpSummaryColumnsParams(self): - # 有些数据除了用表格的列进行分组之外,还添加了其他属性对数据进行分类,这部分数据放在extend_attr_to_group里面 - self.op_summary_columns_dict = { - 'ASCEND_NEW': { - 'TimeToCsvAnalyzer': - {'columns_to_group': ["Op Name", "Input Shapes", "Input Data Types", "Output Shapes"], - 'extend_attr_to_group': ["device_id", "node_id"], - 'columns_to_view': ["Task Duration(us)"], - 'calculate_fun': ['mean', 'var', 'max', 'min'] - }, - 'StatisticalInfoToHtmlAnalyzer': - {'columns_to_group': ["Op Name", "Input Shapes", "Input Data Types", "Output Shapes"], - "columns_to_view": ["Task Duration(us)", "aiv_time(us)", "aiv_vec_ratio", - "aiv_scalar_ratio", "aiv_mte2_ratio", "aiv_mte3_ratio", - "aicore_time(us)", "aic_mac_ratio", "aic_scalar_ratio", - "aic_mte1_ratio", "aic_mte2_ratio", "aic_fixpipe_ratio" - ], - 'calculate_fun': ['mean', 'var', 'max', 'min'] - } - }, - 'ASCEND_OTHER': { - 'TimeToCsvAnalyzer': - {'columns_to_group': ["Op Name", "Input Shapes", "Input Data Types", "Output Shapes"], - 'extend_attr_to_group': ["device_id", "node_id"], - "columns_to_view": ["Task Duration(us)"], - 'calculate_fun': ['mean', 'var', 'max', 'min'] - }, - 'StatisticalInfoToHtmlAnalyzer': - {'columns_to_group': ["Op Name", "Input Shapes", "Input Data Types", "Output Shapes"], - "columns_to_view": ["aicore_time(us)", "Task Duration(us)", "mac_ratio", "vec_ratio", - "scalar_ratio", "mte1_ratio", "mte2_ratio", "mte3_ratio"], - 'calculate_fun': ['mean', 'var', 'max', 'min'] - } - } - } - - def getColumnsInfo(self, analyzer_type): - return self.op_summary_columns_dict.get(self.chip_type, {}).get(analyzer_type) - - -class OpSummaryAnalyzerBase: - def __init__(self, chip_type, analyzer_type, dir_path): - self.chip_type = chip_type - view_info = ViewInfoManager(chip_type).getColumnsInfo(analyzer_type) - self.columns_to_view = view_info['columns_to_view'] - self.calculate_fun = view_info['calculate_fun'] - self.columns_to_group = view_info['columns_to_group'] - self.attrs_to_group = self.columns_to_group.copy() - if 'extend_attr_to_group' in view_info: - extend_attr_to_group = view_info['extend_attr_to_group'] - self.attrs_to_group.extend(extend_attr_to_group) - # 创建结果文件 - self.result_dir = os.path.join(dir_path, "result") - PathManager.check_path_length(self.result_dir) - if os.path.exists(self.result_dir): - shutil.rmtree(self.result_dir, onerror=self.on_rm_error) - PathManager.check_path_writeable(dir_path) - PathManager.make_dir_safety(self.result_dir) - - def getColumnsToGroup(self): - return self.columns_to_group - - def getColumnsToView(self): - return self.columns_to_view - - def calculateViewData(self, summary_data): - # 存储所有合并后的数据 - calculate_dict = {self.columns_to_view[i]: self.calculate_fun for i in range(len(self.columns_to_view))} - view_data = summary_data.groupby(self.attrs_to_group).agg(calculate_dict).reset_index() - return view_data - - def on_rm_error(self, func, path, exc_info): - # path contains the path of the file that couldn't be removed - # let's just assume that it's read-only and unlink it. - os.chmod(path, stat.S_IWRITE) - os.unlink(path) - - -class TimeToCsvAnalyzer(OpSummaryAnalyzerBase): - def __init__(self, chip_type, dir_path): - super().__init__(chip_type, "TimeToCsvAnalyzer", dir_path) - - def GenerateDeliverable(self, summary_data, rank_num): - view_data = self.calculateViewData(summary_data) - # 规范化列名 - view_data.columns = [''.join(col) if col[1] == "" else '_'.join(col) for col in view_data.columns] - try: - for column in self.columns_to_view: - view_data[column + '_range'] = view_data[column + '_max'] - view_data[column + '_min'] - except Exception as e: - raise RuntimeError("Invalid view data!") from e - save_path = os.path.join(self.result_dir, "cluster_duration_time_analysis.csv") - PathManager.check_path_length(save_path) - view_data.to_csv(save_path, index=False) - # 该文件权限设置为只读权限,不允许修改 - os.chmod(save_path, stat.S_IROTH) - return view_data - - -class StatisticalInfoToHtmlAnalyzer(OpSummaryAnalyzerBase): - def __init__(self, chip_type, top_n, dir_path): - super().__init__(chip_type, "StatisticalInfoToHtmlAnalyzer", dir_path) - self.top_n = top_n - # top_n 如果不符合要求,报警告 - - def GenerateDeliverable(self, summary_data, rank_num): - view_data = self.calculateViewData(summary_data) - # 规范化列名 op_name/ --> op_name time/var 这种不变 - view_data.columns = [''.join(col) if col[1] == "" else col for col in view_data.columns] - - # 对使用到的变量进行初始设置 - self.top_n = min(max(self.top_n, 1), len(view_data)) - top_n_data = view_data.sort_values(("Task Duration(us)", 'var'), ascending=False).head(self.top_n) - - for column in self.columns_to_view: - # 分别给每一种特性画图 - self.drawPloty(column, summary_data, top_n_data, rank_num) - - def drawPloty(self, column, summary_data, top_n_data, rank_num): - col_num = self.getCalNum(rank_num) - row_num = self.top_n // col_num if self.top_n % col_num == 0 else (self.top_n + 1) // col_num - fig = make_subplots(rows=row_num, cols=col_num, vertical_spacing=0.03) - for i, (_, operation) in enumerate(top_n_data.iterrows()): - op_data = summary_data[(summary_data["Op Name"] == operation["Op Name"]) & - (summary_data["Input Shapes"] == operation["Input Shapes"]) & - (summary_data["Input Data Types"] == operation["Input Data Types"])] - op_data = op_data.sort_values(by=["node_id", "device_id"]) - node_ids = op_data['node_id'].unique() - device_ids = op_data['device_id'].unique() - - for node_id in node_ids: - for device_id in device_ids: - draw_data = op_data[(op_data['node_id'] == node_id) & (op_data['device_id'] == device_id)] - fig.add_trace(go.Box(y=draw_data[column], - name=f'{node_id}_{device_id}', - marker_color='green', showlegend=False), (i // col_num) + 1, (i % col_num) + 1) - - fig.update_xaxes(title_text=f'{operation["Op Name"]}-{operation["Input Shapes"]}', row=(i // col_num) + 1, - col=(i % col_num) + 1) - fig.update_layout(margin=dict(l=20, r=20, t=20, b=20), - height=int(500 * row_num), - width=int(rank_num * 100 * col_num), - title_text="Op Performance Comparison") - save_plot_path = os.path.join(self.result_dir, column + "_Info.html") - PathManager.check_path_length(save_plot_path) - plot(fig, filename=save_plot_path) - # 该文件权限设置为只读权限,不允许修改 - os.chmod(save_plot_path, stat.S_IROTH) - - def getCalNum(self, rank_num): - # 计算每行应该画多少个子图 - if rank_num <= 16: - return 2 - else: - return 1 - - -class DeliverableGenerator: - def __init__(self, params): - self.dirs = params.get('dir') - self.formProcess = FormDataProcessor(self.dirs, 'op_summary*.csv') - self.analyzers = [] - self.columns_to_keep = [] - self.setAnalyzers(params) - self.setColumnsToKeep() - - def run(self): - summary_data = self.formProcess.readSummaryData(self.columns_to_keep) - # 判断summarydata 数据是否为空,如果是空, 说明所有csv读取数据都失败了 - if summary_data.empty: - print("没有符合要求的csv表格数据,请排查您的PROFILING数据") - return - rank_num = self.formProcess.getRankNum() - for analyzer in self.analyzers: - analyzer.GenerateDeliverable(summary_data, rank_num) - - def setAnalyzers(self, params): - chip_type = self.formProcess.getChipType() - # 判断该路径是不是软链接,并修改为绝对路径 - if os.path.islink(params.get('dir')): - print(f"The file: \"{params.get('dir')}\" is link. Please check the path.") - return - prof_path = os.path.realpath(params.get('dir')) - PathManager.input_path_common_check(prof_path) - if params.get('type') == "all": - self.analyzers = [TimeToCsvAnalyzer(chip_type, prof_path), StatisticalInfoToHtmlAnalyzer(chip_type, params.get("top_n"), prof_path)] - elif params.get('type') == "html": - self.analyzers = [StatisticalInfoToHtmlAnalyzer(chip_type, params.get("top_n"), prof_path)] - elif params.get('type') == "csv": - self.analyzers = [TimeToCsvAnalyzer(chip_type, prof_path)] - else: - warnings.warn("参数错误,请输入 all html csv 这三种类型") # 发出一个警告信息 - - - def setColumnsToKeep(self): - columns_to_keep = [] - for analyzer in self.analyzers: - columns_to_keep.extend(analyzer.getColumnsToGroup()) - columns_to_keep.extend(analyzer.getColumnsToView()) - self.columns_to_keep = list(set(columns_to_keep)) - - -def main(): - # 解析命令行参数 - parser = argparse.ArgumentParser() - parser.add_argument("--dir", "-d", default=None, help="root dir of PROF_* data") - parser.add_argument("--top_n", "-n", default=10, help="how many operators to show", type=int) - parser.add_argument("--type", "-t", default='html', help="compare ratio or aicore-time", type=str) - args = parser.parse_args() - params = { - "dir": args.dir, - "top_n": args.top_n, - "type": args.type - } - - deviverable_gen = DeliverableGenerator(params) - deviverable_gen.run() - -if __name__ == "__main__": - main() diff --git a/profiler/cluster_analyse/cluster_statistics_export/__init__.py b/profiler/cluster_analyse/cluster_statistics_export/__init__.py deleted file mode 100644 index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/cluster_statistics_export/cann_api_sum_export.py b/profiler/cluster_analyse/cluster_statistics_export/cann_api_sum_export.py deleted file mode 100644 index 578ee937be57ff8615085bbe1e4ac6ccae81a4e9..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/cann_api_sum_export.py +++ /dev/null @@ -1,65 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - -QUERY = """ -WITH - summary as ( - SELECT - name, - sum(endNs - startNs) AS duration, - count (*) AS num, - avg(endNs - startNs) AS avg_duration, - min(endNs - startNs) AS min_duration, - median(endNs - startNs) AS med_duration, - max(endNs - startNs) AS max_duration, - stdev(endNs - startNs) AS stdev_duration, - lower_quartile(endNs - startNs) AS lower_quartile_duration, - upper_quartile(endNs - startNs) AS upper_quartile_duration - FROM - CANN_API - GROUP BY name - ), - totals AS ( - SELECT sum(duration) AS total - FROM summary - ) -SELECT - ids.value AS "name", - round(summary.duration * 100.0 / (SELECT total FROM totals), 2) AS "durationRatio", - summary.duration AS "totalTimeNs", - summary.num AS "totalCount", - round(summary.avg_duration, 1) AS "averageNs", - round(summary.min_duration, 1) AS "minNs", - round(summary.lower_quartile_duration, 1) AS "Q1Ns", - round(summary.med_duration, 1) AS "medNs", - round(summary.upper_quartile_duration, 1) AS "Q3Ns", - round(summary.max_duration, 1) AS "maxNs", - round(summary.stdev_duration, 1) AS "stdev" -FROM - summary -LEFT JOIN - STRING_IDS AS ids - ON ids.id == summary.name -ORDER BY 2 DESC; - """ - - -class CannApiSumExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/compute_op_sum_export.py b/profiler/cluster_analyse/cluster_statistics_export/compute_op_sum_export.py deleted file mode 100644 index d70c696100bc305f8b1e182f7b1f915cf58f274a..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/compute_op_sum_export.py +++ /dev/null @@ -1,49 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - - -QUERY = """ -SELECT - NAME_IDS.value AS "OpName", - OPTYPE_IDS.value AS "OpType", - TASKTYPE_IDS.value AS "TaskType", - INPUTSHAPES_IDS.value AS "InputShapes", - round(TASK.endNs - TASK.startNs) AS "Duration" -FROM - COMPUTE_TASK_INFO -LEFT JOIN TASK - ON TASK.globalTaskId == COMPUTE_TASK_INFO.globalTaskId -LEFT JOIN - STRING_IDS AS NAME_IDS - ON NAME_IDS.id == COMPUTE_TASK_INFO.name -LEFT JOIN - STRING_IDS AS OPTYPE_IDS - ON OPTYPE_IDS.id == COMPUTE_TASK_INFO.opType -LEFT JOIN - STRING_IDS AS TASKTYPE_IDS - ON TASKTYPE_IDS.id == COMPUTE_TASK_INFO.taskType -LEFT JOIN - STRING_IDS AS INPUTSHAPES_IDS - ON INPUTSHAPES_IDS.id == COMPUTE_TASK_INFO.inputShapes - """ - - -class ComputeOpSumExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/hccl_sum_export.py b/profiler/cluster_analyse/cluster_statistics_export/hccl_sum_export.py deleted file mode 100644 index f695949de1a92e9a1faff593bc45e52f91582242..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/hccl_sum_export.py +++ /dev/null @@ -1,39 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - - -QUERY = """ -SELECT - NAME_IDS.value AS "OpName", - TYPE_IDS.value AS "OpType", - round(endNs - startNs) AS "Duration" -FROM - COMMUNICATION_OP -LEFT JOIN - STRING_IDS AS TYPE_IDS - ON TYPE_IDS.id == COMMUNICATION_OP.opType -LEFT JOIN - STRING_IDS AS NAME_IDS - ON NAME_IDS.id == COMMUNICATION_OP.opName - """ - - -class HcclSumExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/mstx_mark_export.py b/profiler/cluster_analyse/cluster_statistics_export/mstx_mark_export.py deleted file mode 100644 index ac5355c020042d474963296242b79eb3fd6a8c38..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/mstx_mark_export.py +++ /dev/null @@ -1,57 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - - -QUERY = """ -WITH - FRAMEWORK_API AS ( - SELECT - PYTORCH_API.startNs, - CONNECTION_IDS.connectionId - FROM - PYTORCH_API - LEFT JOIN - CONNECTION_IDS - ON PYTORCH_API.connectionId == CONNECTION_IDS.id - ) -SELECT - MSG_IDS.value AS "msg", - MSTX_EVENTS.startNs AS "cann_ts", - TASK.startNs AS "device_ts", - FRAMEWORK_API.startNs AS "framework_ts", - MSTX_EVENTS.globalTid AS "tid" -FROM - MSTX_EVENTS -LEFT JOIN - TASK - ON MSTX_EVENTS.connectionId == TASK.connectionId -LEFT JOIN - FRAMEWORK_API - ON MSTX_EVENTS.connectionId == FRAMEWORK_API.connectionId -LEFT JOIN - STRING_IDS AS MSG_IDS - ON MSTX_EVENTS.message == MSG_IDS.id -ORDER BY - MSTX_EVENTS.startNs - """ - - -class MstxMarkExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/mstx_step_export.py b/profiler/cluster_analyse/cluster_statistics_export/mstx_step_export.py deleted file mode 100644 index c257ce675fe46ea0f7eff2489dd2fe13c846564f..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/mstx_step_export.py +++ /dev/null @@ -1,35 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - - -QUERY = """ -SELECT - id AS "step_id", - startNs AS "start_ns", - endNs AS "end_ns" -FROM - STEP_TIME -ORDER BY - startNs - """ - - -class MstxStepExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/stats_export.py b/profiler/cluster_analyse/cluster_statistics_export/stats_export.py deleted file mode 100644 index e6d98f48ef8c4e8032f7611dac163ead3cc5fbe0..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/stats_export.py +++ /dev/null @@ -1,40 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import pandas as pd - -from common_func.db_manager import DBManager -from common_func.constant import Constant - - -class StatsExport: - - def __init__(self, db_path, analysis_class): - self._db_path = db_path - self._analysis_class = analysis_class - self._query = None - - def get_query(self): - return self._query - - def read_export_db(self): - query = self.get_query() - if query is None: - print(f"[ERROR] query is None.") - return - conn, cursor = DBManager.create_connect_db(self._db_path, Constant.ANALYSIS) - data = pd.read_sql(query, conn) - DBManager.destroy_db_connect(conn, cursor) - return data diff --git a/profiler/cluster_analyse/cluster_utils/__init__.py b/profiler/cluster_analyse/cluster_utils/__init__.py deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/profiler/cluster_analyse/cluster_utils/data_transfer_adapter.py b/profiler/cluster_analyse/cluster_utils/data_transfer_adapter.py deleted file mode 100644 index 1f306415fa789ae0dab7d8751b1c240b3433de0d..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_utils/data_transfer_adapter.py +++ /dev/null @@ -1,142 +0,0 @@ -import copy - -from common_func.constant import Constant -from common_func.table_constant import TableConstant - - -class DataTransferAdapter(object): - COMM_TIME_TABLE_COLUMN = [TableConstant.START_TIMESTAMP, TableConstant.ELAPSED_TIME, TableConstant.TRANSIT_TIME, - TableConstant.WAIT_TIME, TableConstant.SYNCHRONIZATION_TIME, TableConstant.IDLE_TIME, - TableConstant.SYNCHRONIZATION_TIME_RATIO, TableConstant.WAIT_TIME_RATIO] - COMM_TIME_JSON_COLUMN = [Constant.START_TIMESTAMP, Constant.ELAPSE_TIME_MS, Constant.TRANSIT_TIME_MS, - Constant.WAIT_TIME_MS, Constant.SYNCHRONIZATION_TIME_MS, Constant.IDLE_TIME_MS, - Constant.SYNCHRONIZATION_TIME_RATIO, Constant.WAIT_TIME_RATIO] - MATRIX_TABLE_COLUMN = [TableConstant.TRANSIT_SIZE, TableConstant.TRANSIT_TIME, TableConstant.BANDWIDTH, - TableConstant.TRANSPORT_TYPE, TableConstant.OPNAME] - MATRIX_JSON_COLUMN = [Constant.TRANSIT_SIZE_MB, Constant.TRANSIT_TIME_MS, Constant.BANDWIDTH_GB_S, - Constant.TRANSPORT_TYPE, Constant.OP_NAME] - COMM_BD_TABLE_COLUMN = [TableConstant.TRANSIT_SIZE, TableConstant.TRANSIT_TIME, TableConstant.BANDWIDTH, - TableConstant.LARGE_PACKET_RATIO] - COMM_BD_JSON_COLUMN = [Constant.TRANSIT_SIZE_MB, Constant.TRANSIT_TIME_MS, Constant.BANDWIDTH_GB_S, - Constant.LARGE_PACKET_RATIO] - - def __init__(self): - super().__init__() - - def transfer_comm_from_db_to_json(self, time_info: list, bandwidth_info: list): - result = {} - if not time_info and not bandwidth_info: - return result - for time_data in time_info: - comm_time = dict() - hccl_name = time_data[TableConstant.HCCL_OP_NAME] + "@" + time_data[TableConstant.GROUP_NAME] - for key, value in dict(zip(self.COMM_TIME_JSON_COLUMN, self.COMM_TIME_TABLE_COLUMN)).items(): - if not key.endswith("ratio"): - comm_time[key] = time_data.get(value, 0) - result.setdefault(time_data[TableConstant.STEP], {}).setdefault(time_data[TableConstant.TYPE], {}). \ - setdefault(hccl_name, {})[Constant.COMMUNICATION_TIME_INFO] = comm_time - hccl_set = set() - for bd_data in bandwidth_info: - hccl_name = bd_data[TableConstant.HCCL_OP_NAME] + "@" + bd_data[TableConstant.GROUP_NAME] - hccl_set.add(hccl_name) - for hccl in hccl_set: - comm_bd = dict() - for bd_data in bandwidth_info: - if hccl == (bd_data[TableConstant.HCCL_OP_NAME] + "@" + bd_data[TableConstant.GROUP_NAME]): - temp_dict = dict() - key_dict = dict(zip(self.COMM_BD_JSON_COLUMN, self.COMM_BD_TABLE_COLUMN)) - self.set_value_by_key(temp_dict, bd_data, key_dict) - comm_bd.setdefault(bd_data[TableConstant.TRANSPORT_TYPE], temp_dict).setdefault( - Constant.SIZE_DISTRIBUTION, {})[bd_data[TableConstant.PACKAGE_SIZE]] = \ - [bd_data[TableConstant.COUNT], bd_data[TableConstant.TOTAL_DURATION]] - result.setdefault(bd_data[TableConstant.STEP], {}).setdefault(bd_data[TableConstant.TYPE], {}). \ - setdefault(hccl, {})[Constant.COMMUNICATION_BANDWIDTH_INFO] = comm_bd - return result - - def transfer_comm_from_json_to_db(self, res_data: dict): - res_comm_data, res_bd_data = list(), list() - - def split_comm_time(): - for rank_id, comm_data in op_data.items(): - time_data = comm_data.get(Constant.COMMUNICATION_TIME_INFO) - res_time = set_only_value(rank_id) - for key, value in dict(zip(self.COMM_TIME_TABLE_COLUMN, self.COMM_TIME_JSON_COLUMN)).items(): - res_time[key] = time_data.get(value, 0) - res_comm_data.append(res_time) - bd_data = comm_data.get(Constant.COMMUNICATION_BANDWIDTH_INFO, {}) - for transport_type, data in bd_data.items(): - res_bandwidth = set_only_value(rank_id) - key_dict = dict(zip(self.COMM_BD_TABLE_COLUMN, self.COMM_BD_JSON_COLUMN)) - res_bandwidth[TableConstant.TRANSPORT_TYPE] = transport_type - self.set_value_by_key(res_bandwidth, data, key_dict) - for key, value in data.get(Constant.SIZE_DISTRIBUTION, {}).items(): - res_bandwidth[TableConstant.PACKAGE_SIZE] = key - res_bandwidth[TableConstant.COUNT] = value[0] - res_bandwidth[TableConstant.TOTAL_DURATION] = value[1] - temp_dict = copy.deepcopy(res_bandwidth) - res_bd_data.append(temp_dict) - - def set_only_value(rank_id): - res_dict = dict() - res_dict[TableConstant.RANK_SET] = str(rank_set) - res_dict[TableConstant.STEP] = step - res_dict[TableConstant.RANK_ID] = rank_id - res_dict[TableConstant.HCCL_OP_NAME] = op_name.split("@")[0] if "@" in op_name else op_name - res_dict[TableConstant.GROUP_NAME] = op_name.split("@")[1] if "@" in op_name else "" - return res_dict - - for rank_set, step_dict in res_data.items(): - for step, op_dict in step_dict.items(): - for op_name, op_data in op_dict.items(): - split_comm_time() - return res_comm_data, res_bd_data - - def set_value_by_key(self, src_dict, dst_dict, key_dict): - for key, value in key_dict.items(): - src_dict[key] = dst_dict.get(value, 0) - - def transfer_matrix_from_db_to_json(self, matrix_data: list): - result = {} - if not matrix_data: - return result - hccl_set = set() - for data in matrix_data: - hccl = data[TableConstant.HCCL_OP_NAME] + "@" + data[TableConstant.GROUP_NAME] - hccl_set.add(hccl) - for hccl in hccl_set: - for data in matrix_data: - if hccl == (data[TableConstant.HCCL_OP_NAME] + "@" + data[TableConstant.GROUP_NAME]): - key = data[TableConstant.SRC_RANK] + '-' + data[TableConstant.DST_RANK] - temp_dict = dict() - key_dict = dict(zip(self.MATRIX_JSON_COLUMN, self.MATRIX_TABLE_COLUMN)) - self.set_value_by_key(temp_dict, data, key_dict) - result.setdefault(data[TableConstant.STEP], {}).setdefault(data[TableConstant.TYPE], {}). \ - setdefault(hccl, {}).setdefault(key, temp_dict) - return result - - def transfer_matrix_from_json_to_db(self, res_data: dict): - result = list() - - def split_matrix_data(): - for op_name, op_data in op_dict.items(): - for link_key, link_data in op_data.items(): - if "@" in op_name: - hccl_op_name, group_name = op_name.split("@")[0], op_name.split("@")[1] - else: - hccl_op_name, group_name = op_name, "" - matrix_data = { - TableConstant.RANK_SET: str(rank_set), - TableConstant.STEP: step, - TableConstant.HCCL_OP_NAME: hccl_op_name, - TableConstant.GROUP_NAME: group_name, - TableConstant.SRC_RANK: link_key.split("-")[0], - TableConstant.DST_RANK: link_key.split("-")[1] - } - key_dict = dict(zip(self.MATRIX_TABLE_COLUMN, self.MATRIX_JSON_COLUMN)) - self.set_value_by_key(matrix_data, link_data, key_dict) - result.append(matrix_data) - - for rank_set, step_dict in res_data.items(): - for step, op_dict in step_dict.items(): - split_matrix_data() - return result diff --git a/profiler/cluster_analyse/common_func/__init__.py b/profiler/cluster_analyse/common_func/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/common_func/analysis_loader.py b/profiler/cluster_analyse/common_func/analysis_loader.py deleted file mode 100644 index 55e7dbc6ea930de7a47799384ffad5daa1328da2..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/analysis_loader.py +++ /dev/null @@ -1,38 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import importlib -import inspect -import sys - -from common_func.constant import Constant -from analysis.base_analysis import BaseRecipeAnalysis - -def is_analysis_class(obj): - return inspect.isclass(obj) and issubclass(obj, BaseRecipeAnalysis) and obj != BaseRecipeAnalysis - -def get_class_from_name(analysis_name : str): - sys.path.append(Constant.ANALYSIS_PATH) - analysis_path = f"analysis.{analysis_name}.{analysis_name}" - module = None - try: - module = importlib.import_module(analysis_path) - except Exception as e: - print(f"[ERROR] {analysis_path} not find:{e}") - - specific_analysis = inspect.getmembers(module, is_analysis_class) - if not specific_analysis: - print(f"[ERROR] {analysis_name} not found.") - return specific_analysis[0] diff --git a/profiler/cluster_analyse/common_func/constant.py b/profiler/cluster_analyse/common_func/constant.py deleted file mode 100644 index 80f0374c1d1d9a37204b9583112ce5baa4cf3e95..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/constant.py +++ /dev/null @@ -1,118 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -class Constant(object): - # dir name - FRAMEWORK_DIR = "FRAMEWORK" - CLUSTER_ANALYSIS_OUTPUT = "cluster_analysis_output" - SINGLE_OUTPUT = "ASCEND_PROFILER_OUTPUT" - COMM_JSON = "communication.json" - COMM_MATRIX_JSON = "communication_matrix.json" - STEP_TIME_CSV = "step_trace_time.csv" - KERNEL_DETAILS_CSV = "kernel_details.csv" - - # file authority - FILE_AUTHORITY = 0o640 - DIR_AUTHORITY = 0o750 - MAX_JSON_SIZE = 1024 * 1024 * 1024 * 10 - MAX_CSV_SIZE = 1024 * 1024 * 1024 * 5 - MAX_PATH_LENGTH = 4096 - MAX_READ_DB_FILE_BYTES = 1024 * 1024 * 1024 * 8 - - # communication - P2P = "p2p" - COLLECTIVE = "collective" - STEP_ID = "step_id" - RANK_ID = "rank_id" - GROUP_NAME = "group_name" - COMM_OP_TYPE = "comm_op_type" - COMM_OP_NAME = "comm_op_name" - COMM_OP_INFO = "comm_op_info" - TOTAL_OP_INFO = "Total Op Info" - COMMUNICATION_TIME_INFO = "Communication Time Info" - START_TIMESTAMP = "Start Timestamp(us)" - COMMUNICATION_BANDWIDTH_INFO = "Communication Bandwidth Info" - HCOM_SEND = "hcom_send" - HCOM_RECEIVE = "hcom_receive" - SYNCHRONIZATION_TIME_RATIO = "Synchronization Time Ratio" - SYNCHRONIZATION_TIME_MS = "Synchronization Time(ms)" - WAIT_TIME_RATIO = "Wait Time Ratio" - TRANSIT_TIME_MS = "Transit Time(ms)" - TRANSIT_SIZE_MB = "Transit Size(MB)" - SIZE_DISTRIBUTION = "Size Distribution" - WAIT_TIME_MS = "Wait Time(ms)" - OP_NAME = "Op Name" - BANDWIDTH_GB_S = "Bandwidth(GB/s)" - COMMUNICATION = "communication.json" - ELAPSE_TIME_MS = "Elapse Time(ms)" - IDLE_TIME_MS = "Idle Time(ms)" - LARGE_PACKET_RATIO = "Large Packet Ratio" - - # params - DATA_MAP = "data_map" - COLLECTIVE_GROUP = "collective_group" - COMMUNICATION_OPS = "communication_ops" - MATRIX_OPS = "matrix_ops" - COLLECTION_PATH = "collection_path" - COMMUNICATION_GROUP = "communication_group" - TRANSPORT_TYPE = "Transport Type" - COMM_DATA_DICT = "comm_data_dict" - DATA_TYPE = "data_type" - ANALYSIS_MODE = "analysis_mode" - - # step time - RANK = "rank" - STAGE = "stage" - - # epsilon - EPS = 1e-15 - - # file suffix - JSON_SUFFIX = ".json" - CSV_SUFFIX = ".csv" - - # result files type - TEXT = "text" - DB = "db" - INVALID = "invalid" - - # db name - DB_COMMUNICATION_ANALYZER = "analysis.db" - DB_CLUSTER_COMMUNICATION_ANALYZER = "cluster_analysis.db" - - # db tables - TABLE_COMM_ANALYZER_BANDWIDTH = "CommAnalyzerBandwidth" - TABLE_COMM_ANALYZER_TIME = "CommAnalyzerTime" - TABLE_COMM_ANALYZER_MATRIX = "CommAnalyzerMatrix" - TABLE_STEP_TRACE = "StepTraceTime" - TABLE_HOST_INFO = "HostInfo" - TABLE_RANK_DEVICE_MAP = "RankDeviceMap" - - # data config key - CONFIG = "config" - EXPER_CONFIG = "experimental_config" - EXPORT_TYPE = "_export_type" - - # recipe config - ANALYSIS = "analysis" - RECIPE_NAME = "recipe_name" - RECIPE_CLASS = "recipe_class" - PARALLEL_MODE = "parallel_mode" - CLUSTER_CUSTOM_ANALYSE_PATH = os.path.abspath(os.path.dirname(__file__)) - ANALYSIS_PATH = os.path.join(CLUSTER_CUSTOM_ANALYSE_PATH, 'analysis') - - CONCURRENT_MODE = "concurrent" \ No newline at end of file diff --git a/profiler/cluster_analyse/common_func/context.py b/profiler/cluster_analyse/common_func/context.py deleted file mode 100644 index 4e3d544d3769e0c1360790dc1a4c57ca484687b8..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/context.py +++ /dev/null @@ -1,85 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -from functools import partial -from concurrent import futures -from common_func.constant import Constant - - -class Context(object): - """abstract base class""" - - ctx_map = None - - @classmethod - def create_context(cls, mode=Constant.CONCURRENT_MODE): - if cls.ctx_map is None: - keys = [Constant.CONCURRENT_MODE] - values = [ConcurrentContext] - cls.ctx_map = dict(zip(keys, values)) - - if mode not in cls.ctx_map: - raise NotImplementedError("mode must be in {}".format(keys)) - - return cls.ctx_map[mode]() - - def __init__(self): - print("[INFO] context {} initialized.".format(self._mode)) - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.close() - if exc_type is not None: - print(f"[ERROR] Failed to exit context: {exc_val}") - - def launch(self, func, *args, **kwargs): - raise NotImplementedError - - def map(self, func, *iterables, **kwargs): - raise NotImplementedError - - def wait(self, waitable): - raise NotImplementedError - -class ConcurrentContext(Context): - - def __init__(self, executor=None): - self._mode = Constant.CONCURRENT_MODE - super().__init__() - self._custom = executor is None - self._executor = executor or futures.ProcessPoolExecutor(max_workers=os.cpu_count()) - - def __enter__(self): - if self._executor is None: - raise RuntimeError("executor is None") - return self - - def close(self): - if self._custom: - self._executor.shutdown(wait=True) - self._executor = None - - def launch(self, func, *args, **kwargs): - return self._executor.submit(func, *args, **kwargs).result() - - def map(self, func, *iterables, **kwargs): - partial_func = partial(func, **kwargs) - return list(self._executor.map(partial_func, *iterables)) - - def wait(self, waitable): - return waitable \ No newline at end of file diff --git a/profiler/cluster_analyse/common_func/db_manager.py b/profiler/cluster_analyse/common_func/db_manager.py deleted file mode 100644 index c0d6ad89be8edd8bbb2a4ee8e0653141550b0129..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/db_manager.py +++ /dev/null @@ -1,233 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import sqlite3 - -from common_func.constant import Constant -from common_func.empty_class import EmptyClass -from common_func.file_manager import check_db_path_valid -from common_func.tables_config import TablesConfig -from common_func.sql_extention_func import SqlExtentionAggregateFunc - -class DBManager: - """ - class to manage DB operation - """ - FETCH_SIZE = 10000 - INSERT_SIZE = 10000 - MAX_ROW_COUNT = 100000000 - - @staticmethod - def create_connect_db(db_path: str, mode=None) -> tuple: - """ - create and connect database - """ - if check_db_path_valid(db_path, is_create=True): - try: - conn = sqlite3.connect(db_path) - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return EmptyClass("empty conn"), EmptyClass("empty curs") - try: - if mode == Constant.ANALYSIS: - try: - for func_name, params_count, class_name in SqlExtentionAggregateFunc: - conn.create_aggregate(func_name, params_count, class_name) - except sqlite3.Error as err: - print(f"[ERROR] {err}") - if isinstance(conn, sqlite3.Connection): - curs = conn.cursor() - os.chmod(db_path, Constant.FILE_AUTHORITY) - return conn, curs - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return EmptyClass("empty conn"), EmptyClass("empty curs") - return EmptyClass("empty conn"), EmptyClass("empty curs") - - @staticmethod - def destroy_db_connect(conn: any, curs: any) -> None: - """ - destroy db connection - """ - try: - if isinstance(curs, sqlite3.Cursor): - curs.close() - except sqlite3.Error as err: - print(f"[ERROR] {err}") - try: - if isinstance(conn, sqlite3.Connection): - conn.close() - except sqlite3.Error as err: - print(f"[ERROR] {err}") - - @staticmethod - def judge_table_exists(curs: any, table_name: str) -> any: - """ - judge table exists - """ - if not isinstance(curs, sqlite3.Cursor): - return False - try: - curs.execute("select count(*) from sqlite_master where type='table' and name=?", (table_name,)) - return curs.fetchone()[0] - except sqlite3.Error as err: - print("[ERROR] {}".format(err)) - return False - - @staticmethod - def sql_generate_table(table_map: str): - header_with_type_begin = "(" - header_with_type_end = ")" - header_with_type_list = [] - if table_map in TablesConfig.DATA: - items = TablesConfig.DATA[table_map] - for item in items: - if item[0] == "index": - header_with_type_list.append('"' + item[0] + '" ' + item[1].split(",")[0]) - else: - header_with_type_list.append(item[0] + ' ' + item[1].split(",")[0]) - header_with_type_begin += ",".join(header_with_type_list) - header_with_type_begin += header_with_type_end - return header_with_type_begin - return "" - - @classmethod - def check_tables_in_db(cls, db_path: any, *tables: any) -> bool: - if check_db_path_valid(db_path): - conn, curs = cls.create_connect_db(db_path) - if not (conn and curs): - return False - res = True - for table in tables: - if not cls.judge_table_exists(curs, table): - res = False - break - cls.destroy_db_connect(conn, curs) - return res - return False - - @classmethod - def create_tables(cls, db_path: any, *tables: any): - conn, curs = cls.create_connect_db(db_path) - if not (conn and curs): - return - for table_name in tables: - if cls.judge_table_exists(curs, table_name): - drop_sql = "drop table {0}".format(table_name) - cls.execute_sql(conn, drop_sql) - table_map = "{0}Map".format(table_name) - header_with_type = cls.sql_generate_table(table_map) - sql = "CREATE TABLE IF NOT EXISTS " + table_name + header_with_type - cls.execute_sql(conn, sql) - cls.destroy_db_connect(conn, curs) - - @classmethod - def get_table_column_count(cls, db_path: any, table: any) -> int: - conn, curs = cls.create_connect_db(db_path) - if not (conn and curs): - return 0 - sql = "SELECT COUNT(*) FROM pragma_table_info('{}')".format(table) - res = 0 - try: - curs.execute(sql) - res = curs.fetchone()[0] - except sqlite3.Error as err: - print("[ERROR] {}".format(err)) - finally: - cls.destroy_db_connect(conn, curs) - return res - - @staticmethod - def execute_sql(conn: any, sql: str, params: any = None) -> bool: - """ - execute sql - """ - try: - if isinstance(conn, sqlite3.Connection): - if params: - conn.cursor().execute(sql, params) - else: - conn.cursor().execute(sql) - conn.commit() - return True - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return False - print("[ERROR] conn is invalid param") - return False - - @staticmethod - def executemany_sql(conn: any, sql: str, params: any) -> bool: - """ - execute many sql once - """ - try: - if isinstance(conn, sqlite3.Connection): - conn.cursor().executemany(sql, params) - conn.commit() - return True - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return False - print("[ERROR] conn is invalid param") - return False - - @classmethod - def fetch_all_data(cls: any, curs: any, sql: str, param: tuple = None, is_dict: bool = True) -> list: - """ - fetch 10000 num of data from db each time to get all data - """ - if not isinstance(curs, sqlite3.Cursor): - return [] - data = [] - try: - if param: - res = curs.execute(sql, param) - else: - res = curs.execute(sql) - except sqlite3.Error as err: - print(f"[ERROR] {err}") - curs.row_factory = None - return [] - try: - description = res.description - while True: - res = curs.fetchmany(cls.FETCH_SIZE) - if is_dict: - data += CustomizedDictFactory.generate_dict_from_db(res, description) - else: - data += res - if len(data) > cls.MAX_ROW_COUNT: - print("[WARRING] The records count in the table exceeds the limit!") - if len(res) < cls.FETCH_SIZE: - break - return data - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return [] - finally: - curs.row_factory = None - - -class CustomizedDictFactory: - @staticmethod - def generate_dict_from_db(data_result: any, description: any) -> any: - description_set = [i[0] for i in description] - res = [] - for data in data_result: - data_dict = dict(zip(description_set, data)) - res.append(data_dict) - return res diff --git a/profiler/cluster_analyse/common_func/empty_class.py b/profiler/cluster_analyse/common_func/empty_class.py deleted file mode 100644 index df100d156fa064cca4514260db0b2e843e217d09..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/empty_class.py +++ /dev/null @@ -1,20 +0,0 @@ -class EmptyClass: - - def __init__(self: any, info: str = "") -> None: - self._info = info - - @classmethod - def __bool__(cls: any) -> bool: - return False - - @classmethod - def __str__(cls: any) -> str: - return "" - - @property - def info(self: any) -> str: - return self._info - - @staticmethod - def is_empty() -> bool: - return True diff --git a/profiler/cluster_analyse/common_func/file_manager.py b/profiler/cluster_analyse/common_func/file_manager.py deleted file mode 100644 index e7e2d5adca37faf5b377bcbe720fdfba84311eca..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/file_manager.py +++ /dev/null @@ -1,131 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import csv -import json - -from common_func.constant import Constant -from common_func.path_manager import PathManager - - -class FileManager: - DATA_FILE_AUTHORITY = 0o640 - DATA_DIR_AUTHORITY = 0o750 - - @classmethod - def read_csv_file(cls, file_path: str, class_bean: any) -> list: - PathManager.check_path_readable(file_path) - base_name = os.path.basename(file_path) - file_size = os.path.getsize(file_path) - if file_size <= 0: - return [] - if file_size > Constant.MAX_CSV_SIZE: - raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") - result_data = [] - try: - with open(file_path, newline="") as csv_file: - reader = csv.DictReader(csv_file) - for row in reader: - result_data.append(class_bean(row)) - except Exception as e: - raise RuntimeError(f"Failed to read the file: {base_name}") from e - return result_data - - @classmethod - def read_json_file(cls, file_path: str) -> dict: - PathManager.check_path_readable(file_path) - base_name = os.path.basename(file_path) - file_size = os.path.getsize(file_path) - if file_size <= 0: - return {} - if file_size > Constant.MAX_JSON_SIZE: - raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") - try: - with open(file_path, "r") as json_file: - result_data = json.loads(json_file.read()) - except Exception as e: - raise RuntimeError(f"Failed to read the file: {base_name}") from e - return result_data - - @classmethod - def create_csv_file(cls, profiler_path: str, data: list, file_name: str, headers: list = None) -> None: - if not data: - return - output_path = os.path.join( - profiler_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - output_file = os.path.join(output_path, file_name) - base_name = os.path.basename(output_file) - PathManager.check_path_writeable(output_path) - try: - with os.fdopen( - os.open(output_file, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY), - 'w', newline="" - ) as file: - writer = csv.writer(file) - if headers: - writer.writerow(headers) - writer.writerows(data) - except Exception as e: - raise RuntimeError(f"Can't create file: {base_name}") from e - - @classmethod - def create_json_file(cls, profiler_path: str, data: dict, file_name: str) -> None: - if not data: - return - output_path = os.path.join(profiler_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - output_file = os.path.join(output_path, file_name) - base_name = os.path.basename(output_file) - PathManager.check_path_writeable(output_path) - try: - with os.fdopen( - os.open(output_file, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY), 'w' - ) as file: - file.write(json.dumps(data)) - except Exception as e: - raise RuntimeError(f"Can't create the file: {base_name}") from e - - @classmethod - def create_output_dir(cls, collection_path: str, is_overwrite: bool = False) -> None: - output_path = os.path.join( - collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - if is_overwrite: - if not os.path.exists(output_path): - PathManager.make_dir_safety(output_path) - return - PathManager.remove_path_safety(output_path) - PathManager.make_dir_safety(output_path) - - @classmethod - def check_file_size(cls, file_path): - suffix = os.path.splitext(file_path) - base_name = os.path.join(file_path) - if suffix == Constant.CSV_SUFFIX: - limit_size = Constant.MAX_CSV_SIZE - else: - limit_size = Constant.MAX_JSON_SIZE - file_size = os.path.getsize(file_path) - if file_size > limit_size: - raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") - - -def check_db_path_valid(path: str, is_create: bool = False, max_size: int = Constant.MAX_READ_DB_FILE_BYTES) -> bool: - if os.path.islink(path): - print(f'[ERROR] The db file path: {path} is link. Please check the path') - return False - if not is_create and os.path.exists(path) and os.path.getsize(path) > max_size: - print(f'[ERROR] The db file: {path} is too large to read. Please check the file') - return False - return True diff --git a/profiler/cluster_analyse/common_func/path_manager.py b/profiler/cluster_analyse/common_func/path_manager.py deleted file mode 100644 index 7ef7b4c345c024a0980c6ce2d91839b64c351740..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/path_manager.py +++ /dev/null @@ -1,200 +0,0 @@ -# Copyright (c) 2023 Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the BSD 3-Clause License (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://opensource.org/licenses/BSD-3-Clause -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import re -import shutil -import platform - - -class PathManager: - MAX_PATH_LENGTH = 4096 - MAX_FILE_NAME_LENGTH = 255 - DATA_FILE_AUTHORITY = 0o640 - DATA_DIR_AUTHORITY = 0o750 - WINDOWS = "windows" - - @classmethod - def check_input_directory_path(cls, path: str): - """ - Function Description: - check whether the path is valid, some businesses can accept a path that does not exist, - so the function do not verify whether the path exists - Parameter: - path: the path to check, whether the incoming path is absolute or relative depends on the business - Exception Description: - when invalid data throw exception - """ - cls.input_path_common_check(path) - base_name = os.path.basename(path) - if os.path.isfile(path): - msg = f"Invalid input path which is a file path: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_input_file_path(cls, path: str): - """ - Function Description: - check whether the file path is valid, some businesses can accept a path that does not exist, - so the function do not verify whether the path exists - Parameter: - path: the file path to check, whether the incoming path is absolute or relative depends on the business - Exception Description: - when invalid data throw exception - """ - cls.input_path_common_check(path) - base_name = os.path.basename(path) - if os.path.isdir(path): - msg = f"Invalid input path which is a directory path: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_path_length(cls, path: str): - if len(path) > cls.MAX_PATH_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - path_split_list = path.split("/") - for path in path_split_list: - path_list = path.split("\\") - for name in path_list: - if len(name) > cls.MAX_FILE_NAME_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - @classmethod - def input_path_common_check(cls, path: str): - if len(path) > cls.MAX_PATH_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - if os.path.islink(path): - msg = f"Invalid input path which is a soft link." - raise RuntimeError(msg) - - if platform.system().lower() == cls.WINDOWS: - pattern = r'(\.|:|\\|/|_|-|\s|[~0-9a-zA-Z\u4e00-\u9fa5])+' - else: - pattern = r'(\.|/|_|-|\s|[~0-9a-zA-Z])+' - if not re.fullmatch(pattern, path): - msg = f"Invalid input path." - raise RuntimeError(msg) - - path_split_list = path.split("/") - for path in path_split_list: - path_list = path.split("\\") - for name in path_list: - if len(name) > cls.MAX_FILE_NAME_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - @classmethod - def check_path_owner_consistent(cls, path: str): - """ - Function Description: - check whether the path belong to process owner - Parameter: - path: the path to check - Exception Description: - when invalid path, prompt the user - """ - base_name = os.path.basename(path) - if not os.path.exists(path): - msg = f"Invalid path: {base_name}" - raise RuntimeError(msg) - if platform.system().lower() == cls.WINDOWS: - return - if os.stat(path).st_uid != os.getuid(): - check_msg = input("The path does not belong to you, do you want to continue? [y/n]") - if check_msg.lower() != "y": - raise RuntimeError("The user choose not to continue.") - - @classmethod - def check_path_writeable(cls, path): - """ - Function Description: - check whether the path is writable - Parameter: - path: the path to check - Exception Description: - when invalid data throw exception - """ - cls.check_path_owner_consistent(path) - if os.path.islink(path): - msg = f"Invalid path which is a soft link." - raise RuntimeError(msg) - base_name = os.path.basename(path) - if not os.access(path, os.W_OK): - msg = f"The path permission check failed: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_path_readable(cls, path): - """ - Function Description: - check whether the path is writable - Parameter: - path: the path to check - Exception Description: - when invalid data throw exception - """ - cls.check_path_owner_consistent(path) - if os.path.islink(path): - msg = f"Invalid path which is a soft link." - raise RuntimeError(msg) - base_name = os.path.basename(path) - if not os.access(path, os.R_OK): - msg = f"The path permission check failed: {base_name}" - raise RuntimeError(msg) - - @classmethod - def remove_path_safety(cls, path: str): - base_name = os.path.basename(path) - msg = f"Failed to remove path: {base_name}" - if os.path.islink(path): - raise RuntimeError(msg) - if os.path.exists(path): - try: - shutil.rmtree(path) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def make_dir_safety(cls, path: str): - base_name = os.path.basename(path) - msg = f"Failed to make directory: {base_name}" - if os.path.islink(path): - raise RuntimeError(msg) - if os.path.exists(path): - return - try: - os.makedirs(path, mode=cls.DATA_DIR_AUTHORITY) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def create_file_safety(cls, path: str): - base_name = os.path.basename(path) - msg = f"Failed to create file: {base_name}" - if os.path.islink(path): - raise RuntimeError(msg) - if os.path.exists(path): - return - try: - os.close(os.open(path, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY)) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def get_realpath(cls, path: str) -> str: - if os.path.islink(path): - msg = f"Invalid input path which is a soft link." - raise RuntimeError(msg) - return os.path.realpath(path) diff --git a/profiler/cluster_analyse/common_func/sql_extention_func.py b/profiler/cluster_analyse/common_func/sql_extention_func.py deleted file mode 100644 index 987a0d4365307704d6abf32575a48cc15c0fa33d..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/sql_extention_func.py +++ /dev/null @@ -1,73 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np - - -class Median: - - def __init__(self) -> None: - self.data = [] - - def step(self, value) -> None: - self.data.append(value) - - def finalize(self): - return np.median(self.data) - - -class LowerQuartile: - - def __init__(self) -> None: - self.data = [] - - def step(self, value) -> None: - self.data.append(value) - - def finalize(self): - return np.quantile(self.data, 0.25) - - -class UpperQuartile: - - def __init__(self) -> None: - self.data = [] - - def step(self, value) -> None: - self.data.append(value) - - def finalize(self): - return np.quantile(self.data, 0.75) - - -class StandardDeviation: - - def __init__(self) -> None: - self.data = [] - - def step(self, value) -> None: - self.data.append(value) - - def finalize(self): - return np.std(self.data) - - -# func_name, params_count, class -SqlExtentionAggregateFunc = [ - ('median', 1, Median), - ('lower_quartile', 1, LowerQuartile), - ('upper_quartile', 1, UpperQuartile), - ('stdev', 1, StandardDeviation) -] diff --git a/profiler/cluster_analyse/common_func/table_constant.py b/profiler/cluster_analyse/common_func/table_constant.py deleted file mode 100644 index de6d47e97e5683493905de5353a9978195e87b70..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/table_constant.py +++ /dev/null @@ -1,27 +0,0 @@ -class TableConstant: - - RANK_SET = "rank_set" - STEP = "step" - RANK_ID = "rank_id" - TYPE = "type" - HCCL_OP_NAME = "hccl_op_name" - GROUP_NAME = "group_name" - START_TIMESTAMP = "start_timestamp" - ELAPSED_TIME = "elapse_time" - TRANSIT_TIME = "transit_time" - WAIT_TIME = "wait_time" - SYNCHRONIZATION_TIME = "synchronization_time" - IDLE_TIME = "idle_time" - SYNCHRONIZATION_TIME_RATIO = "synchronization_time_ratio" - WAIT_TIME_RATIO = "wait_time_ratio" - BAND_TYPE = "band_type" - TRANSIT_SIZE = "transit_size" - BANDWIDTH = "bandwidth" - LARGE_PACKET_RATIO = "large_packet_ratio" - PACKAGE_SIZE = "package_size" - COUNT = "count" - TOTAL_DURATION = "total_duration" - SRC_RANK = "src_rank" - DST_RANK = "dst_rank" - TRANSPORT_TYPE = "transport_type" - OPNAME = "op_name" diff --git a/profiler/cluster_analyse/common_func/tables_config.py b/profiler/cluster_analyse/common_func/tables_config.py deleted file mode 100644 index f010014519f864e627f83b99ad0df26af98af3f9..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/tables_config.py +++ /dev/null @@ -1,73 +0,0 @@ -class TablesConfig: - DATA = { - "ClusterCommAnalyzerTimeMap": [ - ("rank_set", "TEXT, null"), - ("step", "TEXT, null"), - ("rank_id", "INTEGER, null"), - ("hccl_op_name", "TEXT, null"), - ("group_name", "TEXT, null"), - ("start_timestamp", "NUMERIC, null"), - ("elapsed_time", "NUMERIC, null"), - ("transit_time", "NUMERIC, null"), - ("wait_time", "NUMERIC, null"), - ("synchronization_time", "NUMERIC, null"), - ("idle_time", "NUMERIC, null"), - ("synchronization_time_ratio", "NUMERIC, null"), - ("wait_time_ratio", "NUMERIC, null") - ], - "CommunicationGroupMap": [ - ("type", "TEXT, null"), - ("rank_set", "TEXT, null") - ], - "ClusterCommAnalyzerBandwidthMap": [ - ("rank_set", "TEXT, null"), - ("step", "TEXT, null"), - ("rank_id", "INTEGER, null"), - ("hccl_op_name", "TEXT, null"), - ("group_name", "TEXT, null"), - ("band_type", "TEXT, null"), - ("transit_size", "NUMERIC, null"), - ("transit_time", "NUMERIC, null"), - ("bandwidth", "NUMERIC, null"), - ("large_packet_ratio", "NUMERIC, null"), - ("package_size", "NUMERIC, null"), - ("count", "NUMERIC, null"), - ("total_duration", "NUMERIC, null") - ], - "ClusterCommAnalyzerMatrixMap": [ - ("rank_set", "TEXT, null"), - ("step", "TEXT, null"), - ("hccl_op_name", "TEXT, null"), - ("group_name", "TEXT, null"), - ("src_rank", "TEXT, null"), - ("dst_rank", "TEXT, null"), - ("transit_size", "NUMERIC, null"), - ("transit_time", "NUMERIC, null"), - ("bandwidth", "NUMERIC, null"), - ("transport_type", "TEXT, null"), - ("op_name", "TEXT, null") - ], - "ClusterStepTraceTimeMap": [ - ("step", "TEXT, null"), - ("type", "TEXT, null"), - ("index", "TEXT, null"), - ("computing", "NUMERIC, null"), - ("communication_not_overlapped", "NUMERIC, null"), - ("overlapped", "NUMERIC, null"), - ("communication", "NUMERIC, null"), - ("free", "NUMERIC, null"), - ("stage", "NUMERIC, null"), - ("bubble", "NUMERIC, null"), - ("communication_not_overlapped_and_exclude_receive", "NUMERIC, null"), - ("preparing", "NUMERIC, null") - ], - "HostInfoMap": [ - ("hostUid", "INTEGER, null"), - ("hostName", "TEXT, null") - ], - "RankDeviceMapMap": [ - ("rankId", "INTEGER, null"), - ("deviceId", "INTEGER, null"), - ("hostUid", "INTEGER, null") - ] - } diff --git a/profiler/cluster_analyse/common_func/utils.py b/profiler/cluster_analyse/common_func/utils.py deleted file mode 100644 index 0a20a5c237f9f46e7b7425ef4b295dad4656174e..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/utils.py +++ /dev/null @@ -1,73 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np -import pandas as pd - - -def format_columns(df: pd.DataFrame): - formatted_df = df.rename( - { - "25%": "Q1Ns", - "50%": "MedianNs", - "75%": "Q3Ns", - 0.25: "Q1Ns", - 0.5: "MedianNs", - 0.75: "Q3Ns", - "Q1": "Q1Ns", - "Q3": "Q3Ns", - "min": "MinNs", - "max": "MaxNs", - "median": "MedianNs", - "sum": "SumNs", - "std": "StdNs", - "mean": "MeanNs", - "count": "Count" - }, - axis="columns" - ) - - stats_cols = ["Count", "MeanNs", "StdNs", "MinNs", "Q1Ns", "MedianNs", "Q3Ns", "MaxNs", "SumNs"] - other_cols = [col for col in formatted_df.columns if col not in stats_cols] - return formatted_df[stats_cols + other_cols] - - -def describe_duration(series_groupby): - agg_df = series_groupby.agg(["min", "max", "count", "std", "mean", "sum"]) - quantile_df = series_groupby.quantile([0.25, 0.5, 0.75]) - - quantile_df = quantile_df.unstack() - quantile_df.columns = ["25%", "50%", "75%"] - - stats_df = pd.merge(agg_df, quantile_df, left_index=True, right_index=True) - formated_df = format_columns(stats_df) - formated_df.index.name = stats_df.index.name - return formated_df - - -def stdev(df, aggregated): - if len(df) <= 1: - return df["stdevNs"].iloc[0] - instance = aggregated["totalCount"].loc[df.name] - var_sum = np.dot(df["totalCount"] - 1, df["stdev"] ** 2) - deviation = df["averageNs"] - aggregated["averageNs"].loc[df.name] - dev_sum = np.dot(df["totalCount"], deviation ** 2) - return np.sqrt((var_sum + dev_sum) / (instance - 1)) - - -def convert_unit(df: pd.DataFrame, src_unit, dst_unit): - df.loc[:, df.columns.str.endswith(src_unit)] = df.loc[:, df.columns.str.endswith(src_unit)].apply(lambda x: x / 1000.0) - df = df.rename(columns=lambda x: x.replace(src_unit, "".join(["(", dst_unit, ")"]))) - return df diff --git a/profiler/cluster_analyse/communication_group/__init__.py b/profiler/cluster_analyse/communication_group/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/communication_group/base_communication_group.py b/profiler/cluster_analyse/communication_group/base_communication_group.py deleted file mode 100644 index 55f6801c2875698047849d39fbee3b9827c9ad28..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/base_communication_group.py +++ /dev/null @@ -1,228 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -from abc import abstractmethod -from collections import defaultdict -from copy import deepcopy -from multiprocessing import Pool - -from common_func.constant import Constant -from cluster_utils.data_transfer_adapter import DataTransferAdapter - - -class BaseCommunicationGroup: - def __init__(self, params: dict): - self.collection_path = params.get(Constant.COLLECTION_PATH) - self.data_map = params.get(Constant.DATA_MAP) - self.data_type = params.get(Constant.DATA_TYPE) - self.analysis_mode = params.get(Constant.ANALYSIS_MODE) - self.rank_comm_dir_dict = {} - self.p2p_link = [] - self.collective_group_dict = defaultdict(set) - self.p2p_comm_group = [] - self.communication_group = {} - self.communication_ops = [] - self.matrix_ops = [] - self.adapter = DataTransferAdapter() - - def load_communication_data(self): - comm_op_dirs = [] - for rank_id, profiling_dir_path in self.data_map.items(): - if self.data_type == Constant.TEXT: - comm_dir = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, Constant.COMM_JSON) - matrix_dir = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, Constant.COMM_MATRIX_JSON) - else: - comm_dir = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, Constant.DB_COMMUNICATION_ANALYZER) - matrix_dir = comm_dir - if os.path.exists(comm_dir) or os.path.exists(matrix_dir): - comm_op_dirs.append((rank_id, comm_dir, matrix_dir)) - else: - print( - f"[WARNING] Rank {rank_id} does not have valid communication data and communication_matrix data.") - max_processes = int(os.cpu_count() / 2) - with Pool(processes=max_processes) as p: - self.rank_comm_dir_dict = p.map(self.read_communication_func, comm_op_dirs) - - def set_p2p_groups(self): - self.p2p_link = sorted(self.p2p_link, key=lambda x: min(x)) - while self.p2p_link: - union_set = deepcopy(self.p2p_link[0]) - rm_list = [self.p2p_link[0]] - for idx, link_rank_set_x in enumerate(self.p2p_link[1:]): - if UnionFind.is_connected(link_rank_set_x, union_set): - union_set = union_set.union(link_rank_set_x) - rm_list.append(link_rank_set_x) - self.p2p_comm_group.append(union_set) - self.p2p_link = [element for element in self.p2p_link if element not in rm_list] - - def generate_collective_communication_group(self): - self.communication_group[Constant.COLLECTIVE] = \ - [list(group) for group_name, group in self.collective_group_dict.items()] - - def generate_p2p_communication_group(self): - stage_group = {} - for group_name, rank_set in self.collective_group_dict.items(): - if not self.whether_valid_comm_group(rank_set): - continue - unioned_set = set() - remove_key = [] - for first_rank, stage in stage_group.items(): - if UnionFind.is_connected(rank_set, stage): - unioned_set = UnionFind.union(rank_set, stage, unioned_set) - remove_key.append(first_rank) - if unioned_set: - for key in remove_key: - del stage_group[key] - stage_group[min(unioned_set)] = unioned_set - else: - stage_group[min(rank_set)] = rank_set - first_rank_sort_list = sorted([first_rank for first_rank in stage_group]) - self.communication_group[Constant.P2P] = \ - [list(stage_group.get(first_rank, {})) for first_rank in first_rank_sort_list] - - def whether_valid_comm_group(self, rank_set: set): - """ - while distinguish which communication group should be used to infer stage info, these group should be ignored: - 1. group can not include more than 1 rank in every single p2p group - """ - for p2p_rank_set in self.p2p_comm_group: - if len(rank_set.intersection(p2p_rank_set)) > 1: - return False - return True - - @abstractmethod - def read_communication_func(self, params: tuple): - pass - - def analyze_communication_data(self): - for rank_id, rank_id_comm_dict, rank_id_matrix_dict in self.rank_comm_dir_dict: - for step_id, step_id_dict in rank_id_comm_dict.items(): - if not isinstance(step_id_dict, dict): - print(f"[WARNING] rank{rank_id}'s communication.json has a wrong data struct.") - continue - self.get_collective_ops_name(rank_id, step_id_dict.get(Constant.COLLECTIVE)) - for comm_op_type, comm_op_dict in step_id_dict.items(): - self.add_communication_ops(rank_id, step_id, comm_op_type, comm_op_dict) - - for step_id, step_id_dict in rank_id_matrix_dict.items(): - if not isinstance(step_id_dict, dict): - print(f"[WARNING] rank{rank_id}'s communication_matrix.json has a wrong data struct.") - continue - self.set_p2p_link(rank_id, step_id, rank_id_matrix_dict) - self.get_collective_ops_name(rank_id, step_id_dict.get(Constant.COLLECTIVE)) - - @abstractmethod - def dump_data(self): - pass - - def collect_comm_data(self): - comm_data_dict = { - Constant.COLLECTIVE_GROUP: self.collective_group_dict, - Constant.COMMUNICATION_OPS: self.communication_ops, - Constant.MATRIX_OPS: self.matrix_ops, - Constant.COMMUNICATION_GROUP: self.communication_group - } - return comm_data_dict - - def generate(self): - self.load_communication_data() - self.analyze_communication_data() - self.set_p2p_groups() - self.generate_collective_communication_group() - self.generate_p2p_communication_group() - self.dump_data() - return self.collect_comm_data() - - def set_p2p_link(self, rank_id: int, step_id: str, rank_id_matrix_dict: dict): - ops = rank_id_matrix_dict.get(step_id, {}) - self.add_matrix_ops(rank_id, step_id, ops) - if not ops: - print(f"[WARNING] rank{rank_id} {step_id} do not have communication matrix ops data.") - return - p2p_ops = ops.get(Constant.P2P, {}) - for op_name, link_dict in p2p_ops.items(): - self.append_p2p_link(op_name, link_dict) - - def append_p2p_link(self, op_name, link_dict): - for link in link_dict: - if '-' not in link: - print(f"[WARNING] {op_name} has an invalid link key {link}!") - break - src_rank = int(link.split('-')[0]) - dst_rank = int(link.split('-')[1]) - if src_rank != dst_rank: - rank_set = {src_rank, dst_rank} - if rank_set in self.p2p_link: - continue - self.p2p_link.append(rank_set) - - def get_collective_ops_name(self, rank_id: int, comm_op_dict: dict): - for comm_op in comm_op_dict: - if comm_op.startswith('Total'): - continue - group_name = comm_op.split('@')[-1] - self.collective_group_dict[group_name].add(rank_id) - - def add_communication_ops(self, rank_id: str, step_id: str, comm_op_type: str, comm_op_dict: dict): - for comm_op in comm_op_dict: - if comm_op.startswith('Total'): - continue - group_name = comm_op.split('@')[-1] - self.communication_ops.append({ - Constant.RANK_ID: rank_id, - Constant.STEP_ID: step_id, - Constant.COMM_OP_TYPE: comm_op_type, - Constant.COMM_OP_NAME: comm_op, - Constant.GROUP_NAME: group_name, - Constant.COMM_OP_INFO: comm_op_dict.get(comm_op) - }) - - def add_matrix_ops(self, rank_id: int, step_id: str, step_id_dict: dict): - for comm_op_type, comm_dict in step_id_dict.items(): - if comm_op_type != Constant.COLLECTIVE and comm_op_type != Constant.P2P: - print(f"[WARNING] Unknown communication operators type!") - continue - for op_name, op_link_info in comm_dict.items(): - if op_name.startswith('Total'): - continue - group_name = op_name.split('@')[-1] - self.matrix_ops.append({ - Constant.RANK_ID: rank_id, - Constant.STEP_ID: step_id, - Constant.COMM_OP_TYPE: comm_op_type, - Constant.COMM_OP_NAME: op_name, - Constant.GROUP_NAME: group_name, - Constant.COMM_OP_INFO: op_link_info - }) - - -class UnionFind(object): - """Disjoint Set Union""" - - @classmethod - def union(cls, first_set: set, second_set: set, third_set: set): - """make p and q the same set""" - return first_set | second_set | third_set - - @classmethod - def is_connected(cls, first_set: set, second_set: set): - """ - check whether set p and set q are connected - """ - if first_set & second_set: - return True - else: - return False diff --git a/profiler/cluster_analyse/communication_group/communication_db_group.py b/profiler/cluster_analyse/communication_group/communication_db_group.py deleted file mode 100644 index 510dcd971357dfb4798e4d284a72fbb3f3a21859..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/communication_db_group.py +++ /dev/null @@ -1,57 +0,0 @@ -import os - -from common_func.db_manager import DBManager -from common_func.constant import Constant -from communication_group.base_communication_group import BaseCommunicationGroup - - -class CommunicationDBGroup(BaseCommunicationGroup): - COMMUNICATION_GROUP_TABLE = "CommunicationGroup" - - def __init__(self, params: dict): - super().__init__(params) - - def read_communication_func(self, params: tuple): - if len(params) < 3: - return -1, ({}, {}, {}) - rank_id = params[0] - db_path = params[1] - time_data = [] - bandwidth_data = [] - matrix_data = [] - if os.path.exists(db_path): - conn, cursor = DBManager.create_connect_db(db_path) - time_info_sql = "select * from {0}".format(Constant.TABLE_COMM_ANALYZER_TIME) - bandwidth_info_sql = "select * from {0}".format(Constant.TABLE_COMM_ANALYZER_BANDWIDTH) - matrix_info_sql = "select * from {0}".format(Constant.TABLE_COMM_ANALYZER_MATRIX) - if (DBManager.check_tables_in_db(db_path, Constant.TABLE_COMM_ANALYZER_TIME, - Constant.TABLE_COMM_ANALYZER_BANDWIDTH) - and self.analysis_mode in ["all", "communication_time"]): - time_data = DBManager.fetch_all_data(cursor, time_info_sql) - bandwidth_data = DBManager.fetch_all_data(cursor, bandwidth_info_sql) - if (DBManager.check_tables_in_db(db_path, Constant.TABLE_COMM_ANALYZER_MATRIX) - and self.analysis_mode in ["all", "communication_matrix"]): - matrix_data = DBManager.fetch_all_data(cursor, matrix_info_sql) - DBManager.destroy_db_connect(conn, cursor) - comm_data = self.adapter.transfer_comm_from_db_to_json(time_data, bandwidth_data) - comm_matrix_data = self.adapter.transfer_matrix_from_db_to_json(matrix_data) - return rank_id, comm_data, comm_matrix_data - - def dump_data(self): - output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER) - res = [] - for data_type, data_list in self.communication_group.items(): - for data in data_list: - rank_set = "(" + ",".join(str(i) for i in data) + ")" - data = [data_type, rank_set] - res.append(data) - if res: - DBManager.create_tables(result_db, self.COMMUNICATION_GROUP_TABLE) - conn, cursor = DBManager.create_connect_db(result_db) - sql = "insert into {} values ({value})".format(self.COMMUNICATION_GROUP_TABLE, - value="?," * (len(res[0]) - 1) + "?") - DBManager.executemany_sql(conn, sql, res) - DBManager.destroy_db_connect(conn, cursor) - else: - print("[WARNING] The CommunicationGroup table won't be created because no data has been calculated.") diff --git a/profiler/cluster_analyse/communication_group/communication_group_generator.py b/profiler/cluster_analyse/communication_group/communication_group_generator.py deleted file mode 100644 index 3dca90454b608fe3ffb1c365854c2aa3950b6cee..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/communication_group_generator.py +++ /dev/null @@ -1,32 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from common_func.constant import Constant -from communication_group.communication_db_group import CommunicationDBGroup -from communication_group.communication_json_group import CommunicationJsonGroup - - -class CommunicationGroupGenerator: - - GROUP_MAP = { - Constant.DB: CommunicationDBGroup, - Constant.TEXT: CommunicationJsonGroup - } - - def __init__(self, params: dict): - self.processor = self.GROUP_MAP.get(params.get(Constant.DATA_TYPE))(params) - - def generate(self): - return self.processor.generate() diff --git a/profiler/cluster_analyse/communication_group/communication_json_group.py b/profiler/cluster_analyse/communication_group/communication_json_group.py deleted file mode 100644 index f6e01e3abfde4d8f180043a5bf9a50c6b5a4964c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/communication_json_group.py +++ /dev/null @@ -1,44 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -from common_func.constant import Constant -from common_func.file_manager import FileManager -from communication_group.base_communication_group import BaseCommunicationGroup - - -class CommunicationJsonGroup(BaseCommunicationGroup): - COMMUNICATION_GROUP_JSON = "communication_group.json" - - def __init__(self, params: dict): - super().__init__(params) - - def dump_data(self): - FileManager.create_json_file(self.collection_path, self.communication_group, self.COMMUNICATION_GROUP_JSON) - - def read_communication_func(self: any, params: tuple): - if len(params) < 3: - return -1, {}, {} - rank_id = params[0] - comm_json_path = params[1] - matrix_json_path = params[2] - comm_data = {} - matrix_data = {} - if os.path.exists(comm_json_path) and self.analysis_mode in ["all", "communication_time"]: - comm_data = FileManager.read_json_file(comm_json_path) - if os.path.exists(matrix_json_path) and self.analysis_mode in ["all", "communication_matrix"]: - matrix_data = FileManager.read_json_file(matrix_json_path) - return rank_id, comm_data, matrix_data diff --git a/profiler/cluster_analyse/prof_bean/__init__.py b/profiler/cluster_analyse/prof_bean/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/prof_bean/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/prof_bean/step_trace_time_bean.py b/profiler/cluster_analyse/prof_bean/step_trace_time_bean.py deleted file mode 100644 index b0a3be4f5eaccea70aa912bc85e68d70dbda3bde..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/prof_bean/step_trace_time_bean.py +++ /dev/null @@ -1,39 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -class StepTraceTimeBean: - STEP = "Step" - COMPLEMENT_HEADER = ["Step", "Type", "Index"] - - def __init__(self, data: list): - self._data = data - - @property - def row(self) -> list: - row = [] - for field_name in self._data.keys(): - if field_name == self.STEP: - continue - row.append(float(self._data.get(field_name, ))) - return row - - @property - def step(self) -> str: - return self._data.get(self.STEP, '') - - @property - def all_headers(self) -> list: - return self.COMPLEMENT_HEADER + list(self._data.keys())[1:] diff --git a/profiler/cluster_analyse/resources/.keep b/profiler/cluster_analyse/resources/.keep deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000