diff --git a/profiler/cluster_analyse/README.md b/profiler/cluster_analyse/README.md deleted file mode 100644 index deaebb6cde565d2c7f43c41fde252326c7d06ef5..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/README.md +++ /dev/null @@ -1,180 +0,0 @@ -# 集群分析工具 -cluster_analyse(集群分析工具)是在集群场景下,通过此工具来进行集群数据的分析,当前主要对基于通信域的迭代内耗时分析、通信时间分析以及通信矩阵分析为主, 从而定位慢卡、慢节点以及慢链路问题。 - -## 性能数据采集 -当前集群调优工具主要支持Ascend PyTorch Profiler采集方式下的集群数据。采集方式参考:[Profiling数据采集](https://gitee.com/ascend/mstt/tree/master/profiler),此工具只需要通过Ascend PyTorch Porfiler工具采集NPU的性能数据即可。 - -我们要求至少是L1级别的数据。 -```python -experimental_config = torch_npu.profiler._ExperimentalConfig( - profiler_level=torch_npu.profiler.ProfilerLevel.Level1 -) -``` -### 确认数据是否可用 - -打开采集到的某张卡数据(*ascend_pt结尾的文件夹),可用的数据应该具备: - -- ./profiler_info_x.json, -- ./ASCEND_PROFILER_OUTPUT/step_trace_time.csv, -- ./ASCEND_PROFILER_OUTPUT/trace_view.json, -- ./ASCEND_PROFILER_OUTPUT/kernel_details.csv, -- ./ASCEND_PROFILER_OUTPUT/communication.json, -- ./ASCEND_PROFILER_OUTPUT/communication_matrix.json - -或者具备: - -- analysis.db -- ascend_pytorch_profiler_{rank_id}.db - -以上csv、json文件与db文件只能存在一类,否则集群分析工具解析异常。 - -确认这几个文件生成后,继续下面的集群分析。 - -## 数据汇聚与解析 - -### 操作步骤 - -1. 参见《[性能工具](../README.md)》完成工具安装。建议安装最新版本。 - -2. 将所有卡的数据拷贝并汇集到一个目录下,在本目录下运行以下命令即可生成cluster_analysis_output文件夹。 - - ```bash - msprof-analyze cluster -d {cluster profiling data path} -m {mode} - ``` - - 或 - - ```bash - python3 cluster_analysis.py -d {cluster profiling data path} -m {mode} - ``` - - 参数说明: - - | 参数名 | 说明 | 是否必选 | - | --------------------- | ------------------------------------------------------------ | -------- | - | --collection_path或-d | 性能数据汇集目录,运行分析脚本之后会在该目录下自动创建cluster_analysis_output文件夹,保存分析数据。 | 是 | - | --mode或-m | 数据解析模式,取值详见“**--mode参数说明**”表。 | 否 | - | --parallel_mode | 设置收集多卡、多节点db数据时的并发方式。取值为concurrent(使用concurrent.feature进程池实现并发)。
**只有-m配置cann_api_sum、compute_op_sum、hccl_sum、mstx_sum时可配置此参数。** | 否 | - | --export_type | 设置导出的数据形式。取值为db(.db格式文件)和notebook(Jupyter Notebook文件),默认值为db。
**只有-m配置cann_api_sum、compute_op_sum、hccl_sum、mstx_sum时可配置此参数。** | 否 | - | --rank_list | 对特定Rank上的数据进行统计,默认值为all(表示对所有Rank进行统计),须根据实际卡的Rank ID配置。应配置为大于等于0的整数,若所配置的值大于实际训练所运行的卡的Rank ID,则仅解析合法的RankID的数据,比如当前环境Rank ID为0到7,实际训练运行0到3卡,此时若配置Rank ID为0, 3, 4或不存在的10等其他值,则仅解析0和3。配置示例:--rank_list 0, 1, 2。
**只有-m配置cann_api_sum、compute_op_sum、hccl_sum、mstx_sum时可配置此参数。** | 否 | - | --top_num | 设置TopN耗时的通信算子的数量,默认值为15,配置示例:--top_num 20。
**只有-m配置hccl_sum时可配置此参数。** | 否 | - - --mode参数说明: - - | 参数名 | 说明 | 是否必选 | - | -------------------- | ------------------------------------------------------------ | -------- | - | communication_matrix | 解析通信矩阵数据。 | 否 | - | communication_time | 解析通信耗时数据。 | 否 | - | all | 同时解析通信矩阵communication_matrix和通信耗时数据communication_time,--mode参数默认值为all。 | 否 | - | cann_api_sum | 集群API性能数据汇总分析,输入性能数据需要基于ascend_pytorch_profiler_{rank_id}.db文件。--export_type为db时,输出交付件cluster_analysis.db;--export_type为notebook时,在cluster_analysis_output/CannApiSum目录下输出交付件stats.ipynb。 | 否 | - | compute_op_sum | 集群场景性能数据的device运行算子信息汇总分析,输入性能数据需要基于ascend_pytorch_profiler_{rank_id}.db文件。--export_type为db时,输出交付件cluster_analysis.db;--export_type为notebook时,在cluster_analysis_output/ComputeOpSum目录下输出交付件stats.ipynb。 | 否 | - | hccl_sum | 集合通信算子耗时分析,输入性能数据需要基于ascend_pytorch_profiler_{rank_id}.db文件。--export_type为db时,输出交付件cluster_analysis.db;--export_type为notebook时,在cluster_analysis_output/HcclSum目录下输出交付件stats.ipynb。 | 否 | - | mstx_sum | 集群场景mstx打点信息汇总分析,输入性能数据需要基于ascend_pytorch_profiler_{rank_id}.db文件。--export_type为db时,输出交付件cluster_analysis.db;--export_type为notebook时,在cluster_analysis_output/MstxSum目录下输出交付件stats.ipynb。 | 否 | - - --parallel_mode参数示例如下: - - ```bash - msprof-analyze cluster -d {cluster profiling data path} -m cann_api_sum --parallel_mode concurrent - ``` - - 或 - - ```bash - python3 cluster_analysis.py -d {cluster profiling data path} -m cann_api_sum --parallel_mode concurrent - ``` - - -### 交付件 - -集群分析工具的交付件通过Ascend Insight工具展示,详见《[MindStudio Ascend Insight用户指南](https://www.hiascend.com/document/detail/zh/mindstudio/70RC1/GUI-baseddevelopmenttool/msascendinsightug/AscendInsight_0002.html)》。 - -#### cluster_step_trace_time.csv - -数据解析模式为communication_matrix、communication_time或all时均生成。 - -A列: Step数,是采集性能数据时设置的,一般来说集群性能数据采集一个step足够,如果采集多个step,需要先筛选一下。 - -B列: Type,主要分两种,rank和stage, 和后面的index强相关,可以理解为一个是单卡rank,一个是rank group(pp 并行的stage),如果type为stage,则后面D-K列信息为rank group下的最大值。 - -C列:Index,与type相关,表示卡号。 - -D列:Computing, 此列统计计算时间。 - -E列:Communication(Not Overlapped),此列统计未被掩盖的通信耗时。 - -F列:Overlapped,统计计算与通信重叠的耗时。 - -G列:Communication,通信时间的全部耗时。 - -H列:Free,空闲时间,只device侧既不在通信也不在计算的耗时,可能在做sdma拷贝或者空等。 - -I列:Stage时间,I、J、K列属于pp并行时有效的数值,stage时间代表除recieve算子时间外的时间。 - -J列:Bubble时间,指receive时间的总和。 - -K列:Communication(Not Overlapped and Exclude Receive)指剔除recieve算子外的并且不被掩盖的通信时间。 - -L列:Preparing,指迭代开始到首个计算或通信算子运行的时间。 - -**Tips**:先筛选B列type为stage, 看stage间是否有问题,再筛选B列type为rank,看rank是否有问题,根据以下几点排查。 - -* 根据Computing的时间差异判断是否有慢卡,或者有负载不均衡的现象。 - -* 根据Free统计是否有host bound或者分布不均现象。 - -* 根据Communication(Not Overlapped and Exclude Receive)时间判断是否通信耗时占比过大。 - -* 根据Bubble时间的占比和理论计算公式判断bubble设置是否合理,是否stage间有不均衡现象。 - -以上时间理论上都应该处于持平状态,即最大值小于最小值5%,否则就可能出现慢卡。 - -#### cluster_communication_matrix.json - -数据解析模式为communication_matrix或all时生成。 - -直接打开json(vscode或json查看器), 搜索"Total", 会有多个搜索结果,一般来说链路带宽信息的结构: - -```bash -{src_rank}-{dst_rank}: { - "Transport Type": "LOCAL", - "Transit Time(ms)": 0.02462, - "Transit Size(MB)": 16.777216, - "Bandwidth(GB/s)": 681.4466 -} -``` -**Tips**:可以根据rank互联的带宽以及链路类型,判断是否有慢链路的问题。 - -- "LOCAL"是片内拷贝,速率非常快,不需要考虑。 -- “HCCS”或“PCIE”是节点内片间拷贝,速度在18GB左右或以上比较正常。 -- “RDMA”是节点间拷贝,910A速度在12GB左右或以上。 - -#### cluster_communication.json - -数据解析模式为communication_time或all时生成。 - -主要为通信耗时数据。 - -#### cluster_analysis.db - -解析analysis.db或ascend_pytorch_profiler_{rank_id}.db生成的交付件,根据数据解析模式不同而解析不同的数据,可以使用Ascend Insight工具展示。 - -#### stats.ipynb - -- 数据解析模式为cann_api_sum时生成,保存在cluster_analysis_output/CannApiSum目录下。 - - 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群API耗时信息。 - -- 数据解析模式为compute_op_sum时生成,保存在cluster_analysis_output/ComputeOpSum目录下。 - - 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群计算算子耗时分析(将集群所有计算算子进行汇总并以图表展示),集群Rank计算算子耗时分析(将每个Rank的计算算子进行各自汇总)。 - -- 数据解析模式为hccl_sum时生成,保存在cluster_analysis_output/HcclSum目录下。 - - 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群通信算子耗时分析(将集群所有通信算子进行汇总并以图表展示),集群Rank通信算子耗时分析(将每个Rank的通信算子进行各自汇总)、Top通信算子信息展示。 - -- 数据解析模式为mstx_sum时生成,保存在cluster_analysis_output/MstxSum目录下。 - - 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群场景mstx打点信息,分为框架侧、CANN侧和Device侧三部分的打点信息。 - - - diff --git a/profiler/cluster_analyse/__init__.py b/profiler/cluster_analyse/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/analysis/__init__.py b/profiler/cluster_analyse/analysis/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/analysis/analysis_facade.py b/profiler/cluster_analyse/analysis/analysis_facade.py deleted file mode 100644 index 435d77b21bff423b207bf050ea660a1738f0fe5f..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/analysis_facade.py +++ /dev/null @@ -1,50 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from multiprocessing import Process - -from analysis.communication_analysis import CommunicationAnalysis -from analysis.comm_matrix_analysis import CommMatrixAnalysis -from analysis.step_trace_time_analysis import StepTraceTimeAnalysis -from analysis.host_info_analysis import HostInfoAnalysis -from common_func.context import Context -from common_func.constant import Constant - -class AnalysisFacade: - default_module = {CommunicationAnalysis, StepTraceTimeAnalysis, CommMatrixAnalysis, HostInfoAnalysis} - - def __init__(self, params: dict): - self.params = params - - def cluster_analyze(self): - # 多个profiler用多进程处理 - process_list = [] - for analysis in self.default_module: - process = Process(target=analysis(self.params).run) - process.start() - process_list.append(process) - - for process in process_list: - process.join() - - def recipe_analyze(self): - HostInfoAnalysis(self.params).run() - print("[INFO] Recipe analysis launched.") - try: - with Context.create_context(self.params.get(Constant.PARALLEL_MODE)) as context: - with self.params.get(Constant.RECIPE_CLASS)(self.params) as recipe: - recipe.run(context) - except Exception as e: - print("[ERROR] Recipe analysis launched failed, %s." % str(e)) diff --git a/profiler/cluster_analyse/analysis/base_analysis.py b/profiler/cluster_analyse/analysis/base_analysis.py deleted file mode 100644 index 7209e9b56f04cc6e97e4331db2ca48ba18a67ed6..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/base_analysis.py +++ /dev/null @@ -1,255 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import sys -import traceback -import shutil -import pandas as pd -from abc import abstractmethod - -from common_func.constant import Constant -from common_func.file_manager import FileManager -from common_func.db_manager import DBManager -from common_func.utils import convert_unit -from cluster_utils.data_transfer_adapter import DataTransferAdapter - - -class BaseAnalysis: - MAX_RANKS = 1000 - def __init__(self, param: dict): - self.collection_path = param.get(Constant.COLLECTION_PATH) - self.data_map = param.get(Constant.DATA_MAP) - self.data_type = param.get(Constant.DATA_TYPE) - self.communication_ops = [] - self.collective_group_dict = param.get(Constant.COMM_DATA_DICT, {}).get(Constant.COLLECTIVE_GROUP) - self.comm_ops_struct = {} - self.adapter = DataTransferAdapter() - - @staticmethod - def compute_ratio(dividend: float, divisor: float): - if abs(divisor) < Constant.EPS: - return 0 - else: - return round(dividend / divisor, 4) - - @staticmethod - def check_add_op(op_name: str): - """ - 兼容2个版本,判断是否需要将此算子信息相加 - """ - stat_list = ["middle", "top", "bottom", "total"] - total = "total" - for stat_name in stat_list: - if stat_name in op_name: - if stat_name != total: - return False - return True - - @abstractmethod - def run(self): - pass - - def dump_data(self): - if not self.comm_ops_struct: - print("[WARNING] There is no final comm ops data generated") - return - if self.data_type == Constant.TEXT: - self.dump_json() - else: - if len(self.data_map) >= self.MAX_RANKS: - print("[WARNING]The number of ranks is too large to dump to db, it will be dumped to json file.") - self.dump_json() - else: - self.dump_db() - - @abstractmethod - def dump_db(self): - pass - - def dump_json(self): - output_comm_data = {} - for key in self.comm_ops_struct: - output_comm_data[str(key)] = self.comm_ops_struct.get(key) - FileManager.create_json_file(self.collection_path, output_comm_data, self.SAVED_JSON) - - def split_op_by_group(self): - for single_op in self.communication_ops: - if single_op.get(Constant.COMM_OP_TYPE) == Constant.P2P: - rank_tup = Constant.P2P - else: - rank_tup = tuple(self.collective_group_dict.get(single_op.get(Constant.GROUP_NAME), [])) - rank_id = single_op.get(Constant.RANK_ID, 'N/A') - step_id = single_op.get(Constant.STEP_ID, 'N/A') - op_name = single_op.get(Constant.COMM_OP_NAME, 'N/A') - op_info = single_op.get(Constant.COMM_OP_INFO) - self.comm_ops_struct.setdefault(rank_tup, {}).setdefault(step_id, {}).\ - setdefault(op_name, {}).setdefault(rank_id, op_info) - - def combine_ops_total_info(self): - for rank_tup, group_dict in self.comm_ops_struct.items(): - for step_id, communication_ops in group_dict.items(): - self.compute_total_info(communication_ops) - - -class BaseRecipeAnalysis: - - UNIT = "Us" - DB_UNIT = "Ns" - - RANK_LIST = "rank_list" - - def __init__(self, params): - self._params = params - self._collection_dir = params.get(Constant.COLLECTION_PATH, "") - self._data_map = params.get(Constant.DATA_MAP, {}) - self._recipe_name = params.get(Constant.RECIPE_NAME, "") - self._mode = params.get(Constant.PARALLEL_MODE, "") - self._export_type = params.get(Constant.EXPORT_TYPE, "") - self._output_dir = None - self._rank_list = params.get(self.RANK_LIST, 'all') - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - if self._params is not None and exc_type is not None: - print(f"[ERROR] Failed to exit analysis: {exc_val}") - traceback.print_exc(file=sys.stdout) - - def run(self, context): - pass - - @property - def base_dir(self): - return os.path.basename(os.path.dirname(__file__)) - - def _get_rank_db(self): - invalid_rank_id = [] - if self._rank_list == 'all': - rank_ids = list(self._data_map.keys()) - else: - rank_ids = [] - for rank_id in self._rank_list: - if rank_id in self._data_map.keys(): - rank_ids.append(rank_id) - else: - invalid_rank_id.append(str(rank_id)) - db_paths = [] - for rank_id in rank_ids: - rank_path = self._data_map[rank_id] - db_path = os.path.join(rank_path, Constant.SINGLE_OUTPUT, f"ascend_pytorch_profiler_{rank_id}.db") - if os.path.exists(db_path): - db_paths.append((rank_id, db_path)) - else: - print(f"[WARNING] DB file not found, rank id: {rank_id}, db path: {db_path}.") - if invalid_rank_id: - print(f"[WARNING] Invalid Rank id : [{','.join(invalid_rank_id)}].") - return db_paths - - def get_mode(self): - return self._mode - - def get_recipe_name(self): - return self._recipe_name - - def dump_data(self, data, file_name, table_name=None, index=True): - output_path = os.path.join(self._collection_dir, Constant.CLUSTER_ANALYSIS_OUTPUT) - if table_name: - result_db = os.path.join(output_path, file_name) - conn, cursor = DBManager.create_connect_db(result_db) - if isinstance(data, pd.DataFrame): - data.to_sql(table_name, conn, if_exists='replace', index=True) - else: - print(f"[ERROR] Unknown dump data type: {type(data)}") - DBManager.destroy_db_connect(conn, cursor) - else: - result_csv = os.path.join(output_path, file_name) - if isinstance(data, pd.DataFrame): - data = convert_unit(data, self.DB_UNIT, self.UNIT) - data.to_csv(result_csv, index=index) - else: - print(f"[ERROR] Unknown dump data type: {type(data)}") - - def _create_output_dir_name(self, name): - i = 1 - while os.path.exists(f"{name}-{i}"): - i += 1 - return f"{name}-{i}" - - def _create_unique_output_dir(self): - output_dir = os.path.join(self._collection_dir, Constant.CLUSTER_ANALYSIS_OUTPUT, self._recipe_name) - - if os.path.exists(output_dir): - return self._create_output_dir_name(output_dir) - return output_dir - - def _get_output_dir(self): - if self._output_dir is None: - self._output_dir = self._create_unique_output_dir() - os.makedirs(self._output_dir) - return self._output_dir - - def create_notebook(self, filename, notebook_template_dir=None, replace_dict=None): - if notebook_template_dir is None: - template_path = os.path.dirname(__file__) - else: - template_path = notebook_template_dir - output_path = os.path.join(self._get_output_dir(), filename) - template_file = os.path.join(template_path, self.base_dir, filename) - if replace_dict is None: - shutil.copy(template_file, output_path) - else: - with open(template_file, 'r') as f: - template_content = f.read() - for key, value in replace_dict.items(): - template_content = template_content.replace(str(key), str(value)) - with open(output_path, 'w') as f: - f.write(template_content) - print(f"[INFO] Notebook export path is: {self._get_output_dir()}") - - def add_helper_file(self, helper_file): - helper_output_path = os.path.join(self._get_output_dir(), helper_file) - helper_file_path = os.path.join(os.path.dirname(__file__), helper_file) - - if helper_file_path is not None: - shutil.copy(helper_file_path, helper_output_path) - - @staticmethod - def _filter_data(mapper_data): - return [(rank, data) for rank, data in mapper_data if data is not None and len(data) != 0] - - @classmethod - def add_parser_argument(cls, parser): - parser.add_argument("--rank_list", type=str, help="Rank id list", default='all') - - @classmethod - def parse_argument(cls, args_parsed) -> dict: - if args_parsed.rank_list == 'all': - return { - cls.RANK_LIST: 'all' - } - else: - rank_str_list = args_parsed.rank_list.split(",") - rank_list = [int(rank) for rank in rank_str_list if rank.isdigit()] - return { - cls.RANK_LIST: rank_list - } - - @classmethod - def get_extra_argument(cls, params) -> dict: - return { - cls.RANK_LIST: params.get(cls.RANK_LIST, "all") - } diff --git a/profiler/cluster_analyse/analysis/cann_api_sum/__init__.py b/profiler/cluster_analyse/analysis/cann_api_sum/__init__.py deleted file mode 100644 index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/cann_api_sum/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/analysis/cann_api_sum/cann_api_sum.py b/profiler/cluster_analyse/analysis/cann_api_sum/cann_api_sum.py deleted file mode 100644 index db37b004b150eaa65b9c9cd4e12f1f5bdc0836e9..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/cann_api_sum/cann_api_sum.py +++ /dev/null @@ -1,108 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import pandas as pd - -from analysis.base_analysis import BaseRecipeAnalysis -from common_func.constant import Constant -from common_func.utils import stdev -from cluster_statistics_export.cann_api_sum_export import CannApiSumExport - - -class CannApiSum(BaseRecipeAnalysis): - - def __init__(self, params): - super().__init__(params) - print("[INFO] CannApiSum init.") - - @property - def base_dir(self): - return os.path.basename(os.path.dirname(__file__)) - - @staticmethod - def _mapper_func(data_map, analysis_class): - df = CannApiSumExport(data_map[1], analysis_class).read_export_db() - - if df is None or df.empty: - print(f"[WARNING] There is no stats data in {data_map[1]}.") - return None - return data_map[0], df - - def mapper_func(self, context): - return context.wait( - context.map( - self._mapper_func, - self._get_rank_db(), - analysis_class=self._recipe_name - ) - ) - - def reducer_func(self, mapper_res): - stats_rank_data = self._filter_data(mapper_res) - if not stats_rank_data: - print("[ERROR] Mapper data is None.") - return - stats_rank_data = [df.assign(rank=rank) for rank, df in stats_rank_data] - stats_rank_data = pd.concat(stats_rank_data) - stats_data = self._aggregate_stats(stats_rank_data) - if self._export_type == "db": - self.dump_data(stats_rank_data, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, "CannApiSumRank") - self.dump_data(stats_data, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, "CannApiSum") - elif self._export_type == "notebook": - self.dump_data(stats_rank_data, os.path.join(self._get_output_dir(), "rank_stats.csv"), index=False) - self.dump_data(stats_data, os.path.join(self._get_output_dir(), "all_stats.csv")) - self.save_notebook() - else: - print("[ERROR] Unknown export type.") - - def run(self, context): - mapper_res = self.mapper_func(context) - self.reducer_func(mapper_res) - - @staticmethod - def _aggregate_stats(stats_res): - grouped = stats_res.groupby("name") - res = {} - total_time = grouped["totalTimeNs"].sum() - res["timeRatio"] = total_time / total_time.sum() * 100.0 - res["totalTimeNs"] = total_time - res["totalCount"] = grouped["totalCount"].sum() - res["averageNs"] = res["totalTimeNs"] / res["totalCount"] - res["Q1Ns"] = grouped["Q1Ns"].min() - res["medNs"] = grouped["medNs"].median() - res["Q3Ns"] = grouped["Q3Ns"].max() - res["minNs"] = grouped["minNs"].min() - res["maxNs"] = grouped["maxNs"].max() - res["stdev"] = grouped.apply(lambda x: stdev(x, res)) - min_value = grouped["minNs"].min() - res["minRank"] = grouped.apply( - lambda x: ", ".join( - x.loc[x["minNs"] == min_value.loc[x.name], "rank"].astype(str) - ) - ) - max_value = grouped["maxNs"].max() - res["maxRank"] = grouped.apply( - lambda x: ", ".join( - x.loc[x["maxNs"] == max_value.loc[x.name], "rank"].astype(str) - ) - ) - res = pd.concat(res.values(), axis=1, keys=res.keys()).round(1) - res.sort_values(by="totalTimeNs", ascending=False, inplace=True) - return res - - def save_notebook(self): - self.create_notebook("stats.ipynb") - self.add_helper_file("cluster_display.py") diff --git a/profiler/cluster_analyse/analysis/cann_api_sum/stats.ipynb b/profiler/cluster_analyse/analysis/cann_api_sum/stats.ipynb deleted file mode 100644 index c97f039c5a01a6e7cce2968d569d79e137e76f8c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/cann_api_sum/stats.ipynb +++ /dev/null @@ -1,86 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# CANN_API_SUM" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "vscode": { - "languageId": "plaintext" - } - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import plotly.offline as pyo\n", - "\n", - "from IPython.display import display, HTML\n", - "\n", - "import cluster_display\n", - "\n", - "display(HTML(\"\"))\n", - "pd.set_option('display.max_columns', None)\n", - "pd.set_option('display.max_rows', None)\n", - "pyo.init_notebook_mode()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## 集群场景CANN层API统计分析\n", - "该分析脚本展示了集群场景的统计数据分析结果。需要注意以下几点:\n", - "1. 所有的时间信息单位是微秒(us);\n", - "2. Q1表示单个API耗时的25%分位数,最终结果取自所有卡的Q1值中最小值;\n", - "3. Q3表示单个API耗时的75%分位数,最终结果取自所有卡的Q3值中最大值;\n", - "4. 'minRank'展示了API最小耗时所在卡;\n", - "5. 'maxRank'展示了API最大耗时所在卡。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.read_csv(\"all_stats.csv\")\n", - "display(df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "cluster_display.display_box(df, xaxis_title=\"name\", yaxis_title=\"duration (ns)\")\n", - "cluster_display.display_stats_scatter(df, xaxis_title=\"name\", yaxis_title=\"duration (ns)\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "per_rank_df = pd.read_csv(\"rank_stats.csv\")\n", - "cluster_display.display_stats_per_operation(per_rank_df, xaxis_title='rank', yaxis_title='duration (ns)')" - ] - } - ], - "metadata": { - "language_info": { - "name": "python" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/profiler/cluster_analyse/analysis/cluster_display.py b/profiler/cluster_analyse/analysis/cluster_display.py deleted file mode 100644 index 8fc6040ccafae2d069e2e6e394941c7aff83a452..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/cluster_display.py +++ /dev/null @@ -1,239 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np -import pandas as pd -import plotly.graph_objects as go -from IPython.display import display, HTML -from ipywidgets import Dropdown, fixed, interact - - -def get_stats_cols(df): - cols = df.columns.tolist() - q1 = "Q1(Us)" if "Q1(Us)" in cols else "Q1~" - q3 = "Q3(Us)" if "Q3(Us)" in cols else "Q3~" - med = "med(Us)" if "med(Us)" in cols else "med~" - std = "stdev" if "stdev" in cols else "stdev~" - return q1, q3, med, std - - -def display_box(df, x=None, **layout_args): - if x is None: - x = df.columns[0] - q1, q3, med, std = get_stats_cols(df) - fig = go.Figure() - fig.add_trace( - go.Box( - x=df[x], - q1=df[q1], - median=df[med], - q3=df[q3], - sd=df[std], - lowerfence=df["minRank"], - upperfence=df["maxRank"] - ) - ) - fig.update_layout(**layout_args) - fig.show() - - -def display_stats_scatter(df, x=None, **layout_args): - if x is None: - x = df.columns[0] - q1, q3, med, _ = get_stats_cols(df) - fig = go.Figure() - col_names = [q1, med, q3, "minRank", "maxRank"] - for name in col_names: - fig.add_trace( - go.Scatter( - x=df[x], - y=df[name], - name=name - ) - ) - fig.update_layout(**layout_args) - fig.show() - - -def display_table_per_rank(df): - if df.empty: - display(df) - return - - rank_groups = df.groupby("rank") - def display_table(name): - rank_df = rank_groups.get_group(name) - rank_df = rank_df.drop(columns=["rank"]) - display(rank_df) - - dropdown = Dropdown( - options=rank_groups.groups.keys(), - description="rank:", - disabled=False, - ) - interact( - display_table, - name=dropdown - ) - - -def display_stats_per_operation(df, x=None, box=True, scatter=True, table=True, **layout_args): - if df.empty: - display(df) - return - - if x is None: - x = df.columns[0] - - op_groups = df.groupby(x) - - def display_graphs(name): - op_df = op_groups.get_group(name) - if table: - display(op_df.reset_index(drop=True).set_index("rank")) - if box: - display_box(op_df, x=op_df["rank"], **layout_args) - if scatter: - display_stats_scatter(op_df, x=op_df["rank"], **layout_args) - - operations = list(op_groups.groups.keys()) - - if len(operations) > 1: - dropdown = Dropdown( - options=operations, - description="Operation:", - disabled=False, - value=operations[1] - ) - interact( - display_graphs, - name=dropdown - ) - dropdown.value = operations[0] - else: - display_graphs(operations[0]) - - -def display_duration_boxplots(figs, stats_df: pd.DataFrame, orientation="v", title=None, - x_title="Names", y_title="Time", legend_title="Legend"): - mean_ds = stats_df.get("Mean(Us)", None) - min_ds = stats_df.get("Min(Us)", None) - max_ds = stats_df.get("Max(Us)", None) - q1_ds = stats_df.get("Q1(Us)", None) - median_ds = stats_df.get('Median(Us)', None) - q3_ds = stats_df.get('Q3(Us)', None) - return display_boxplot(figs, stats_df.index, min_ds, q1_ds, median_ds, q3_ds, max_ds, mean_ds, - orientation=orientation, title=title, x_title=x_title, y_title=y_title, - legend_title=legend_title) - - -def display_boxplot(figs, x_axis, min_ds, q1_ds, median_ds, q3_ds, max_ds, mean_ds, orientation="v", - title=None, x_title=None, y_title="Time", legend_title="Legend"): - fig = go.Figure() - fig.add_trace( - go.Box( - x=x_axis, - lowerfence=min_ds, - q1=q1_ds, - median=median_ds, - q3=q3_ds, - upperfence=max_ds, - mean=mean_ds - ) - ) - fig.update_traces(orientation=orientation) - fig.update_layout( - xaxis_title=x_title, yaxis_title=y_title, legend_title=legend_title, - title=title, height=1024 - ) - fig.show() - if isinstance(figs, list): - figs.append(fig) - return fig - - -def display_graph(figs, x_axis, y_axes, title=None, - x_title=None, y_title=None, legend_title="Legend"): - data = None - if isinstance(y_axes, pd.DataFrame): - data = y_axes.set_index(x_axis) - elif isinstance(y_axes, dict): - data = pd.DataFrame(y_axes, index=x_axis) - elif isinstance(y_axes, pd.Series): - data = pd.DataFrame({"": y_axes}, index=x_axis) - elif isinstance(y_axes, np.ndarray): - data = pd.DataFrame({"": pd.Series(y_axes)}, index=x_axis) - else: - return - - fig = data.plot.line() - fig.update_layout( - title=title, xaxis_title=x_title, yaxis_title=y_title, legend_title=legend_title - ) - fig.show() - if isinstance(figs, list): - figs.append(fig) - return fig - - -def display_stats_per_rank_groups_combobox(rank_stats_gdf): - names = list(rank_stats_gdf.groups.keys()) - if len(names) > 1: - dropdown = Dropdown( - options=names, layout={"width": "max-content"}, value=names[1] - ) - interact( - __display_stats_per_rank_group, - selected=dropdown, - rank_stats_gdf=fixed(rank_stats_gdf) - ) - dropdown.value = names[0] - elif len(names) == 1: - __display_stats_per_rank_group(names[0], rank_stats_gdf) - else: - print("cluster_display func:input rank_stats_gdf groups is null so no need to display") - - -def __display_stats_per_rank_group(selected, rank_stats_gdf): - df = rank_stats_gdf.get_group(selected) - df = df.reset_index(drop=True) - df = df.set_index(df["Rank"]) - display(df) - - figs = [] - display_duration_boxplots(figs, df, x_title="Ranks") - display_graph( - figs, - df.index, - df[["Q1(Us)", "Median(Us)", "Q3(Us)"]], - title="50% of Distribution", - x_title="Ranks" - ) - - -def display_stats_optional_combobox(options, display_func, args, description="Option:"): - if len(options) > 1: - dropdown = Dropdown( - options=options, layout={"width": "max-content"}, value=options[1], - description=description - ) - interact( - display_func, - selected=dropdown, - args=fixed(args) - ) - dropdown.value = options[0] - elif len(options) == 1: - display_func(options[0], args) diff --git a/profiler/cluster_analyse/analysis/comm_matrix_analysis.py b/profiler/cluster_analyse/analysis/comm_matrix_analysis.py deleted file mode 100644 index 8dc04471fe0a164fc859e51597d41028523f7a32..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/comm_matrix_analysis.py +++ /dev/null @@ -1,106 +0,0 @@ -import os -from collections import defaultdict - -from analysis.base_analysis import BaseAnalysis -from common_func.constant import Constant -from common_func.db_manager import DBManager - - -class CommMatrixAnalysis(BaseAnalysis): - SAVED_JSON = "cluster_communication_matrix.json" - COMMUNICATION_MATRIX_TABLE = "ClusterCommAnalyzerMatrix" - - def __init__(self, param: dict): - super().__init__(param) - self.communication_ops = param.get(Constant.COMM_DATA_DICT, {}).get(Constant.MATRIX_OPS) - - @staticmethod - def combine_link(link_info_dict: dict, single_link_dict: dict): - link_info_dict[Constant.TRANSPORT_TYPE] = single_link_dict.get(Constant.TRANSPORT_TYPE) - link_info_dict[Constant.OP_NAME] = single_link_dict.get(Constant.OP_NAME, '') - link_info_dict[Constant.TRANSIT_TIME_MS] += single_link_dict.get(Constant.TRANSIT_TIME_MS, 0) - link_info_dict[Constant.TRANSIT_SIZE_MB] += single_link_dict.get(Constant.TRANSIT_SIZE_MB, 0) - - def run(self): - if not self.communication_ops: - return - self.split_op_by_group() - self.combine_ops_total_info() - self.dump_data() - - def dump_db(self): - res_comm_matrix = self.adapter.transfer_matrix_from_json_to_db(self.comm_ops_struct) - output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER) - DBManager.create_tables(result_db, self.COMMUNICATION_MATRIX_TABLE) - conn, cursor = DBManager.create_connect_db(result_db) - if res_comm_matrix: - res_matrix_value = [list(data.values()) for data in res_comm_matrix] - sql = "insert into {} values ({value})".format(self.COMMUNICATION_MATRIX_TABLE, - value="?," * (len(res_matrix_value[0]) - 1) + "?") - DBManager.executemany_sql(conn, sql, res_matrix_value) - DBManager.destroy_db_connect(conn, cursor) - - def compute_total_info(self, step_dict: dict): - self.merge_same_links(step_dict) - self.combine_link_info(step_dict) - - def merge_same_links(self, step_dict: dict): - def process_link_key(): - for link_key in rank_dict: - if '-' not in link_key: - print(f"[WARNING] {op_name} has an invalid link key {link_key}!") - break - src_rank = link_key.split('-')[0] - dst_rank = link_key.split('-')[1] - if src_rank == dst_rank: - if src_rank not in project_local_global_rank_map: - project_local_global_rank_map[src_rank] = rank_id - elif project_local_global_rank_map.get(src_rank) != rank_id: - print(f"[WARNING] In the same communication group, local ranks projecting to global ranks " - f"repeat!") - self.combine_link(link_info[link_key], rank_dict[link_key]) - - def convert_local_to_global_rank(): - tmp_link = {} - for link_key, link_dict in link_info.items(): - src_rank = link_key.split('-')[0] - dst_rank = link_key.split('-')[1] - src_rank = project_local_global_rank_map[src_rank] \ - if src_rank in project_local_global_rank_map else src_rank - dst_rank = project_local_global_rank_map[dst_rank] \ - if dst_rank in project_local_global_rank_map else dst_rank - link_dict[Constant.BANDWIDTH_GB_S] = \ - self.compute_ratio(link_dict.get(Constant.TRANSIT_SIZE_MB, 0), - link_dict.get(Constant.TRANSIT_TIME_MS, 0)) - tmp_link[f"{src_rank}-{dst_rank}"] = link_dict - return tmp_link - - project_local_global_rank_map = dict() - for op_name, op_dict in step_dict.items(): - link_info = defaultdict(lambda: { - Constant.TRANSPORT_TYPE: '', - Constant.TRANSIT_TIME_MS: 0, - Constant.TRANSIT_SIZE_MB: 0, - Constant.OP_NAME: '' - }) - for rank_id, rank_dict in op_dict.items(): - process_link_key() - step_dict[op_name] = convert_local_to_global_rank() - - def combine_link_info(self, step_dict: dict): - total_op_info = defaultdict(lambda: { - Constant.TRANSPORT_TYPE: '', - Constant.TRANSIT_TIME_MS: 0, - Constant.TRANSIT_SIZE_MB: 0, - Constant.OP_NAME: '' - }) - for op_name, op_dict in step_dict.items(): - if self.check_add_op(op_name): - for link_key, link_dict in op_dict.items(): - self.combine_link(total_op_info[link_key], link_dict) - for link_key, link_dict in total_op_info.items(): - link_dict[Constant.BANDWIDTH_GB_S] = \ - self.compute_ratio(link_dict.get(Constant.TRANSIT_SIZE_MB, 0), - link_dict.get(Constant.TRANSIT_TIME_MS, 0)) - step_dict[Constant.TOTAL_OP_INFO] = total_op_info diff --git a/profiler/cluster_analyse/analysis/communication_analysis.py b/profiler/cluster_analyse/analysis/communication_analysis.py deleted file mode 100644 index 3f0a9b417e211b124b052cb5c5534f2fdbe5302e..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/communication_analysis.py +++ /dev/null @@ -1,103 +0,0 @@ -import os -from collections import defaultdict - -from analysis.base_analysis import BaseAnalysis -from common_func.constant import Constant -from common_func.db_manager import DBManager - - -class CommunicationAnalysis(BaseAnalysis): - SAVED_JSON = "cluster_communication.json" - COMMUNICATION_BANDWIDTH_TABLE = "ClusterCommAnalyzerBandwidth" - COMMUNICATION_TIME_TABLE = "ClusterCommAnalyzerTime" - - def __init__(self, param: dict): - super().__init__(param) - self.communication_ops = param.get(Constant.COMM_DATA_DICT, {}).get(Constant.COMMUNICATION_OPS) - - @staticmethod - def combine_size_distribution(op_dict: dict, total_dict: dict): - for size, size_info in op_dict.items(): - total_dict[size][0] += size_info[0] - total_dict[size][1] += size_info[1] - - def run(self): - if not self.communication_ops: - return - self.split_op_by_group() - self.combine_ops_total_info() - self.dump_data() - - def dump_db(self): - res_comm_time, res_comm_bandwidth = self.adapter.transfer_comm_from_json_to_db(self.comm_ops_struct) - output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER) - DBManager.create_tables(result_db, self.COMMUNICATION_TIME_TABLE, self.COMMUNICATION_BANDWIDTH_TABLE) - conn, cursor = DBManager.create_connect_db(result_db) - self.execute(conn, res_comm_time, self.COMMUNICATION_TIME_TABLE) - self.execute(conn, res_comm_bandwidth, self.COMMUNICATION_BANDWIDTH_TABLE) - DBManager.destroy_db_connect(conn, cursor) - - @staticmethod - def execute(conn, res_data, table_name): - if res_data: - res_value = [list(data.values()) for data in res_data] - sql = "insert into {} values ({value})".format(table_name, value="?," * (len(res_value[0]) - 1) + "?") - DBManager.executemany_sql(conn, sql, res_value) - - def compute_total_info(self, comm_ops: dict): - if not comm_ops: - return - total_rank_dict = defaultdict(lambda: { - Constant.COMMUNICATION_TIME_INFO: defaultdict(float), - Constant.COMMUNICATION_BANDWIDTH_INFO: {} - }) - for communication_op, rank_dict in comm_ops.items(): - for rank_id, communication_op_info in rank_dict.items(): - for com_info, com_info_dict in communication_op_info.items(): - if com_info == Constant.COMMUNICATION_TIME_INFO: - self.combine_time_info(com_info_dict, total_rank_dict[rank_id][com_info]) - if com_info == Constant.COMMUNICATION_BANDWIDTH_INFO: - self.combine_bandwidth_info(com_info_dict, total_rank_dict[rank_id][com_info]) - for rank_id in total_rank_dict: - self.compute_time_ratio(total_rank_dict[rank_id][Constant.COMMUNICATION_TIME_INFO]) - self.compute_bandwidth_ratio(total_rank_dict[rank_id][Constant.COMMUNICATION_BANDWIDTH_INFO]) - comm_ops[Constant.TOTAL_OP_INFO] = total_rank_dict - - def combine_time_info(self, com_info_dict: dict, total_time_info_dict: dict): - ratio_list = [Constant.WAIT_TIME_RATIO, Constant.SYNCHRONIZATION_TIME_RATIO] - for time_info in com_info_dict: - if time_info not in ratio_list and time_info != Constant.START_TIMESTAMP: - total_time_info_dict[time_info] += com_info_dict.get(time_info) - - def combine_bandwidth_info(self, com_info_dict: dict, total_bandwidth_info_dict: dict): - add_list = [Constant.TRANSIT_TIME_MS, Constant.TRANSIT_SIZE_MB] - dict_list = [Constant.SIZE_DISTRIBUTION] - for transport_type, part_transport_dict in com_info_dict.items(): - if transport_type not in total_bandwidth_info_dict: - total_bandwidth_info_dict[transport_type] = { - Constant.TRANSIT_TIME_MS: 0, - Constant.TRANSIT_SIZE_MB: 0, - Constant.SIZE_DISTRIBUTION: defaultdict(lambda: [0, 0]) - } - for bandwidth_msg, value in part_transport_dict.items(): - if bandwidth_msg in add_list: - total_bandwidth_info_dict[transport_type][bandwidth_msg] += value - if bandwidth_msg in dict_list: - self.combine_size_distribution(value, total_bandwidth_info_dict[transport_type][bandwidth_msg]) - - def compute_time_ratio(self, total_time_info_dict: dict): - total_time_info_dict[Constant.WAIT_TIME_RATIO] = \ - self.compute_ratio(total_time_info_dict.get(Constant.WAIT_TIME_MS, 0), - total_time_info_dict.get(Constant.WAIT_TIME_MS, 0) + - total_time_info_dict.get(Constant.TRANSIT_TIME_MS, 0)) - total_time_info_dict[Constant.SYNCHRONIZATION_TIME_RATIO] = \ - self.compute_ratio(total_time_info_dict.get(Constant.SYNCHRONIZATION_TIME_MS, 0), - total_time_info_dict.get(Constant.SYNCHRONIZATION_TIME_MS, 0) + - total_time_info_dict.get(Constant.TRANSIT_TIME_MS, 0)) - - def compute_bandwidth_ratio(self, total_bandwidth_info_dict: dict): - for transport_type, bandwidth_dict in total_bandwidth_info_dict.items(): - bandwidth_dict[Constant.BANDWIDTH_GB_S] = \ - self.compute_ratio(bandwidth_dict.get(Constant.TRANSIT_SIZE_MB, 0), - bandwidth_dict.get(Constant.TRANSIT_TIME_MS, 0)) diff --git a/profiler/cluster_analyse/analysis/compute_op_sum/__init__.py b/profiler/cluster_analyse/analysis/compute_op_sum/__init__.py deleted file mode 100644 index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/compute_op_sum/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/analysis/compute_op_sum/compute_op_sum.py b/profiler/cluster_analyse/analysis/compute_op_sum/compute_op_sum.py deleted file mode 100644 index e71cf868ac9e06785d030a702bf9c8182ae4e948..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/compute_op_sum/compute_op_sum.py +++ /dev/null @@ -1,103 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import pandas as pd -from analysis.base_analysis import BaseRecipeAnalysis -from common_func.constant import Constant -from common_func.utils import describe_duration -from cluster_statistics_export.compute_op_sum_export import ComputeOpSumExport - - -class ComputeOpSum(BaseRecipeAnalysis): - - TABLE_ALL_RANK_STATS = "ComputeOpAllRankStats" - TABLE_PER_RANK_STATS_BY_OPTYPE = "ComputeOpPerRankStatsByOpType" - TABLE_PER_RANK_STATS_BY_OPNAME = "ComputeOpPerRankStatsByOpName" - - def __init__(self, params): - super().__init__(params) - print("[INFO] ComputeOpSum init.") - self.all_rank_stats = None - self.per_rank_stats_by_optype = None - self.per_rank_stats_by_opname = None - - @property - def base_dir(self): - return os.path.basename(os.path.dirname(__file__)) - - @staticmethod - def _mapper_func(data_map, analysis_class): - df = ComputeOpSumExport(data_map[1], analysis_class).read_export_db() - - if df is None or df.empty: - print(f"[WARNING] There is no stats data in {data_map[1]}.") - return None - - df["Rank"] = data_map[0] - return df - - def mapper_func(self, context): - return context.wait( - context.map( - self._mapper_func, - self._get_rank_db(), - analysis_class=self._recipe_name - ) - ) - - def reducer_func(self, mapper_res): - mapper_res = list(filter(lambda df: df is not None, mapper_res)) - if not mapper_res: - print("[ERROR] Mapper data is None.") - return - # get per rank stats by optype - self.per_rank_stats_by_optype = pd.concat( - describe_duration(df.groupby(["OpType", "TaskType"])["Duration"]).assign(Rank=df["Rank"][0]) for df in mapper_res) - self.per_rank_stats_by_optype.sort_values(by=["SumNs"], inplace=True, ascending=False) - - # get all rank stats by optype - all_op_data = pd.concat(mapper_res) - self.all_rank_stats = describe_duration(all_op_data.groupby(["OpType", "TaskType"])["Duration"]) - self.all_rank_stats.sort_values(by=["SumNs"], inplace=True, ascending=False) - - # get per rank stats by opname - self.per_rank_stats_by_opname = pd.concat( - describe_duration(df.groupby(["OpName", "OpType", "TaskType", "InputShapes"])["Duration"]).assign(Rank=df["Rank"][0]) for df in mapper_res) - self.per_rank_stats_by_opname.sort_values(by=["SumNs"], inplace=True, ascending=False) - - def run(self, context): - super().run(context) - mapper_res = self.mapper_func(context) - self.reducer_func(mapper_res) - - if self._export_type == "db": - self.save_db() - elif self._export_type == "notebook": - self.save_notebook() - else: - print("[ERROR] Unknown export type.") - - def save_notebook(self): - self.dump_data(self.all_rank_stats, os.path.join(self._get_output_dir(), "all_stats.csv")) - self.dump_data(self.per_rank_stats_by_optype, os.path.join(self._get_output_dir(), "rank_stats_by_optype.csv")) - self.dump_data(self.per_rank_stats_by_opname, os.path.join(self._get_output_dir(), "rank_stats_by_opname.csv")) - self.create_notebook("stats.ipynb") - self.add_helper_file("cluster_display.py") - - def save_db(self): - self.dump_data(self.all_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_ALL_RANK_STATS) - self.dump_data(self.per_rank_stats_by_optype, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_PER_RANK_STATS_BY_OPTYPE) - self.dump_data(self.per_rank_stats_by_opname, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_PER_RANK_STATS_BY_OPNAME) diff --git a/profiler/cluster_analyse/analysis/compute_op_sum/stats.ipynb b/profiler/cluster_analyse/analysis/compute_op_sum/stats.ipynb deleted file mode 100644 index c88d2684c1f8822818f62005355c444332aaa915..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/compute_op_sum/stats.ipynb +++ /dev/null @@ -1,164 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Compute Op Summary\n", - "\n", - "集群场景计算类算子数据分析\n", - "\n", - "主要包含以下3个统计内容:\n", - "1. 按算子类型和任务类型分组的,整个集群通信算子耗时的统计情况\n", - "2. 按算子类型和任务类型分组的,每个Rank上计算类算子的耗时情况\n", - "3. 按算子名称、任务类型、输入shape分组的,每个Rank上的计算类算子的耗时情况" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 数据准备" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, HTML\n", - "display(HTML(\"\"))\n", - "\n", - "import plotly.offline as pyo\n", - "\n", - "def is_lab_notebook():\n", - " import re\n", - " import psutil\n", - " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n", - "\n", - "if is_lab_notebook():\n", - " pyo.init_notebook_mode()\n", - "\n", - "import pandas as pd\n", - "pd.options.plotting.backend = \"plotly\"\n", - "pd.set_option(\"display.max_rows\", 100)\n", - "pd.set_option(\"display.width\", 1000)\n", - "\n", - "import cluster_display\n", - "\n", - "all_stats_df = pd.read_csv(\"all_stats.csv\", index_col=\"OpType\")\n", - "rank_stats_by_optype_df = pd.read_csv(\"rank_stats_by_optype.csv\", index_col=\"OpType\")\n", - "rank_stats_by_opname_df = pd.read_csv(\"rank_stats_by_opname.csv\", index_col=\"OpName\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 计算类算子耗时分析\n", - "\n", - "将整个集群所有Rank的计算类算子进行汇总,按算子类型和任务类型分类,统计分析耗时情况,时间单位为微秒(us)\n", - "\n", - "包含以下统计项:\n", - "- Count:算子数量\n", - "- Mean:平均耗时\n", - "- Std:标准差\n", - "- Min:最小值\n", - "- Q1:四分之一分位数\n", - "- Median:中位数\n", - "- Q3:四分之三分位数\n", - "- Max:最大值\n", - "- Sum:总耗时" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "display(all_stats_df)\n", - "fig_all_rank = cluster_display.display_duration_boxplots(None, all_stats_df, x_title=\"OpType\")\n", - "fig_per_rank = cluster_display.display_graph(None, all_stats_df.index, all_stats_df[[\"Q1(Us)\", \"Median(Us)\", \"Q3(Us)\"]], title=\"50% of Distribution\", x_title=\"OpType\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 单个Rank的计算类算子基于算子类型的耗时分析\n", - "将集群内每个Rank的计算类算子进行汇总,按算子类型和任务类型分类,统计分析耗时情况,时间单位为微秒(us)\n", - "\n", - "包含以下统计项:\n", - "- Count:算子数量\n", - "- Mean:平均耗时\n", - "- Std:标准差\n", - "- Min:最小值\n", - "- Q1:四分之一分位数\n", - "- Median:中位数\n", - "- Q3:四分之三分位数\n", - "- Max:最大值\n", - "- Sum:总耗时" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "rank_stats_gdf = rank_stats_by_optype_df.groupby(rank_stats_by_optype_df.index)\n", - "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 单个Rank的计算类算子基于算子名的耗时分析\n", - "\n", - "将集群内每个Rank的计算类算子进行汇总,按算子名称、任务类型、输入shape分类,统计分析耗时情况,时间单位为微秒(us)\n", - "\n", - "包含以下统计项:\n", - "- Count:算子数量\n", - "- Mean:平均耗时\n", - "- Std:标准差\n", - "- Min:最小值\n", - "- Q1:四分之一分位数\n", - "- Median:中位数\n", - "- Q3:四分之三分位数\n", - "- Max:最大值\n", - "- Sum:总耗时" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "rank_stats_gdf = rank_stats_by_opname_df.groupby(rank_stats_by_opname_df.index)\n", - "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.12.1" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/profiler/cluster_analyse/analysis/hccl_sum/__init__.py b/profiler/cluster_analyse/analysis/hccl_sum/__init__.py deleted file mode 100644 index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/hccl_sum/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/analysis/hccl_sum/hccl_sum.py b/profiler/cluster_analyse/analysis/hccl_sum/hccl_sum.py deleted file mode 100644 index da0c575e4683f1c51c4cf38e89b9c096c484777e..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/hccl_sum/hccl_sum.py +++ /dev/null @@ -1,133 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import pandas as pd -from analysis.base_analysis import BaseRecipeAnalysis -from common_func.constant import Constant -from common_func.utils import describe_duration -from cluster_statistics_export.hccl_sum_export import HcclSumExport - - -class HcclSum(BaseRecipeAnalysis): - - TABLE_ALL_RANK_STATS = "HcclAllRankStats" - TABLE_PER_RANK_STATS = "HcclPerRankStats" - TABLE_TOP_OP_STATS = "HcclTopOpStats" - - TOP_NUM = "top_num" - DEFAULT_TOP_NUM = 15 - - def __init__(self, params): - super().__init__(params) - print("[INFO] HcclSum init.") - self.per_rank_stats = None - self.all_rank_stats = None - self.top_op_stats = None - self.top_num = params.get(self.TOP_NUM, self.DEFAULT_TOP_NUM) - - @property - def base_dir(self): - return os.path.basename(os.path.dirname(__file__)) - - @staticmethod - def _mapper_func(data_map, analysis_class): - df = HcclSumExport(data_map[1], analysis_class).read_export_db() - - if df is None or df.empty: - print(f"[WARNING] There is no stats data in {data_map[1]}.") - return None - - df["Rank"] = data_map[0] - return df - - def mapper_func(self, context): - return context.wait( - context.map( - self._mapper_func, - self._get_rank_db(), - analysis_class=self._recipe_name - ) - ) - - def reducer_func(self, mapper_res): - mapper_res = list(filter(lambda df: df is not None, mapper_res)) - if not mapper_res: - print("[ERROR] Mapper data is None.") - return - self.per_rank_stats = pd.concat( - describe_duration(df.groupby("OpType")["Duration"]).assign(Rank=df["Rank"][0]) for df in mapper_res) - self.per_rank_stats.sort_values(by=["Rank"], inplace=True) - all_op_data = pd.concat(mapper_res) - self.all_rank_stats = describe_duration(all_op_data.groupby("OpType")["Duration"]) - grouped_op_stats = all_op_data.groupby("OpName") - self.top_op_stats = describe_duration(grouped_op_stats["Duration"]).nlargest(self.top_num, "MeanNs") - min_rank = [] - max_rank = [] - for op_name in self.top_op_stats.index: - df = grouped_op_stats.get_group(op_name) - min_rank.append(df[df["Duration"] == df["Duration"].min()]["Rank"].values[0]) - max_rank.append(df[df["Duration"] == df["Duration"].max()]["Rank"].values[0]) - self.top_op_stats["MinRank"] = min_rank - self.top_op_stats["MaxRank"] = max_rank - - def run(self, context): - super().run(context) - if self.top_num <= 0: - print(f"[WARNING] HcclSum: top_num is set to a invalid value, " - f"it will be reset to default value({self.DEFAULT_TOP_NUM}).") - self.top_num = self.DEFAULT_TOP_NUM - mapper_res = self.mapper_func(context) - self.reducer_func(mapper_res) - - if self._export_type == "db": - self.save_db() - elif self._export_type == "notebook": - self.save_notebook() - else: - print("[ERROR] Unknown export type.") - - def save_notebook(self): - self.dump_data(self.all_rank_stats, os.path.join(self._get_output_dir(), "all_stats.csv")) - self.dump_data(self.per_rank_stats, os.path.join(self._get_output_dir(), "rank_stats.csv")) - self.dump_data(self.top_op_stats, os.path.join(self._get_output_dir(), "top_op_stats.csv")) - self.create_notebook("stats.ipynb") - self.add_helper_file("cluster_display.py") - - def save_db(self): - self.dump_data(self.all_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_ALL_RANK_STATS) - self.dump_data(self.per_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_PER_RANK_STATS) - self.dump_data(self.top_op_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_TOP_OP_STATS) - - @classmethod - def add_parser_argument(cls, parser): - BaseRecipeAnalysis.add_parser_argument(parser) - parser.add_argument("--top_num", type=int, help="Duration cost top count", default=cls.DEFAULT_TOP_NUM) - - @classmethod - def parse_argument(cls, args_parsed) -> dict: - argument_dict = BaseRecipeAnalysis.parse_argument(args_parsed) - argument_dict.update({ - cls.TOP_NUM: args_parsed.top_num - }) - return argument_dict - - @classmethod - def get_extra_argument(cls, params) -> dict: - argument_dict = BaseRecipeAnalysis.get_extra_argument(params) - argument_dict.update({ - cls.TOP_NUM: params.get(cls.TOP_NUM, cls.DEFAULT_TOP_NUM) - }) - return argument_dict diff --git a/profiler/cluster_analyse/analysis/hccl_sum/stats.ipynb b/profiler/cluster_analyse/analysis/hccl_sum/stats.ipynb deleted file mode 100644 index 87f8c6d736240531e2c28c0cf33df087ecfe38e8..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/hccl_sum/stats.ipynb +++ /dev/null @@ -1,162 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# HCCL Summary\n", - "\n", - "集群场景Hccl算子数据分析\n", - "\n", - "主要包含以下3个统计内容:\n", - "1. 按算子类型分组的,整个集群通信算子耗时的统计情况\n", - "2. 按算子类型分组的,每个Rank上通信算子的耗时情况\n", - "3. 整个集群平均耗时最久的TOP通信算子" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 数据准备" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, HTML\n", - "display(HTML(\"\"))\n", - "\n", - "import plotly.offline as pyo\n", - "\n", - "def is_lab_notebook():\n", - " import re\n", - " import psutil\n", - " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n", - "\n", - "if is_lab_notebook():\n", - " pyo.init_notebook_mode()\n", - "\n", - "import pandas as pd\n", - "pd.options.plotting.backend = \"plotly\"\n", - "pd.set_option(\"display.max_rows\", 100)\n", - "pd.set_option(\"display.width\", 1000)\n", - "\n", - "import cluster_display\n", - "\n", - "all_stats_df = pd.read_csv(\"all_stats.csv\", index_col=\"OpType\")\n", - "rank_stats_df = pd.read_csv(\"rank_stats.csv\", index_col=\"OpType\")\n", - "top_op_stats_df = pd.read_csv(\"top_op_stats.csv\", index_col=\"OpName\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 集群通信算子耗时分析\n", - "\n", - "将整个集群所有Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n", - "\n", - "包含以下统计项:\n", - "- Count:算子数量\n", - "- Mean:平均耗时\n", - "- Std:标准差\n", - "- Min:最小值\n", - "- Q1:四分之一分位数\n", - "- Median:中位数\n", - "- Q3:四分之三分位数\n", - "- Max:最大值\n", - "- Sum:总耗时" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "display(all_stats_df)\n", - "fig_all_rank = cluster_display.display_duration_boxplots(None, all_stats_df, x_title=\"Hccl OpType\")\n", - "fig_per_rank = cluster_display.display_graph(None, all_stats_df.index, all_stats_df[[\"Q1(Us)\", \"Median(Us)\", \"Q3(Us)\"]], title=\"50% of Distribution\", x_title=\"Hccl OpType\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 集群Rank通信算子耗时分析\n", - "\n", - "将集群内每个Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n", - "\n", - "包含以下统计项:\n", - "- Count:算子数量\n", - "- Mean:平均耗时\n", - "- Std:标准差\n", - "- Min:最小值\n", - "- Q1:四分之一分位数\n", - "- Median:中位数\n", - "- Q3:四分之三分位数\n", - "- Max:最大值\n", - "- Sum:总耗时" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "rank_stats_gdf = rank_stats_df.groupby(rank_stats_df.index)\n", - "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 集群TOP-N通信算子耗时分析\n", - "\n", - "统计集群内耗时最多的TOP-N通信算子,时间单位为微秒(us)\n", - "\n", - "包含以下统计项:\n", - "- Count:算子数量\n", - "- Mean:平均耗时\n", - "- Std:标准差\n", - "- Min:最小值\n", - "- Q1:四分之一分位数\n", - "- Median:中位数\n", - "- Q3:四分之三分位数\n", - "- Max:最大值\n", - "- Sum:总耗时\n", - "- MinRank:耗时最少算子所在的Rank\n", - "- MaxRank:耗时最长算子所在的Rank" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "display(top_op_stats_df)\n", - "fig_top_op = cluster_display.display_duration_boxplots(None, top_op_stats_df, x_title=\"Hccl OpName\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.12.1" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/profiler/cluster_analyse/analysis/host_info_analysis.py b/profiler/cluster_analyse/analysis/host_info_analysis.py deleted file mode 100644 index 563711080ed3a20923ce73ec595b84892492e9f6..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/host_info_analysis.py +++ /dev/null @@ -1,96 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -from analysis.base_analysis import BaseAnalysis -from common_func.constant import Constant -from common_func.db_manager import DBManager - - -class HostInfoAnalysis(BaseAnalysis): - - TABLE_HOST_INFO = "HOST_INFO" - TABLE_RANK_DEVICE_MAP = "RANK_DEVICE_MAP" - - def __init__(self, param: dict): - super().__init__(param) - self.all_rank_host_info = {} - self.all_rank_device_info = [] - - def run(self): - if self.data_type != Constant.DB: - return - self.analyze_host_info() - self.dump_db() - - def dump_db(self): - output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER) - conn, curs = DBManager.create_connect_db(result_db) - if not (conn and curs): - print(f"[ERROR] Failed to create db {Constant.DB_CLUSTER_COMMUNICATION_ANALYZER}") - return - self.dump_host_info(result_db, conn) - self.dump_rank_device_map(result_db, conn) - DBManager.destroy_db_connect(conn, curs) - - def dump_host_info(self, result_db, db_conn): - if not self.all_rank_host_info: - print(f"[WARNING] No host info data be analyzed.") - return - DBManager.create_tables(result_db, Constant.TABLE_HOST_INFO) - save_host_info = list(self.all_rank_host_info.items()) - sql = "insert into {} values ({value})".format(Constant.TABLE_HOST_INFO, - value="?," * (len(save_host_info[0]) - 1) + "?") - DBManager.executemany_sql(db_conn, sql, save_host_info) - - def dump_rank_device_map(self, result_db, db_conn): - if not self.all_rank_device_info: - print(f"[WARNING] No rank device map data be analyzed.") - return - self.all_rank_device_info.sort() - DBManager.create_tables(result_db, Constant.TABLE_RANK_DEVICE_MAP) - sql = "insert into {} values ({value})".format(Constant.TABLE_RANK_DEVICE_MAP, - value="?," * (len(self.all_rank_device_info[0]) - 1) + "?") - DBManager.executemany_sql(db_conn, sql, self.all_rank_device_info) - - def analyze_host_info(self): - print_empty_host_info = "" - for rank_id, profiling_dir in self.data_map.items(): - host_info = [] - rank_device_info = [] - db_path = os.path.join(profiling_dir, Constant.SINGLE_OUTPUT, f"ascend_pytorch_profiler_{rank_id}.db") - if (os.path.exists(db_path) and DBManager.check_tables_in_db(db_path, self.TABLE_HOST_INFO)): - conn, curs = DBManager.create_connect_db(db_path) - sql = "select * from {0}".format(self.TABLE_HOST_INFO) - host_info = DBManager.fetch_all_data(curs, sql, is_dict=False) - DBManager.destroy_db_connect(conn, curs) - if not (host_info and host_info[0]): - if not print_empty_host_info: - print_empty_host_info = f"[WARNING] No {self.TABLE_HOST_INFO} data in {self.data_type} file." - continue - if (os.path.exists(db_path) and DBManager.check_tables_in_db(db_path, self.TABLE_RANK_DEVICE_MAP)): - conn, curs = DBManager.create_connect_db(db_path) - sql = "select * from {0}".format(self.TABLE_RANK_DEVICE_MAP) - rank_device_info = DBManager.fetch_all_data(curs, sql, is_dict=False) - DBManager.destroy_db_connect(conn, curs) - host_uid, host_name = host_info[0][0], host_info[0][1] - for idx, data in enumerate(rank_device_info): - rank_device_info[idx] = list(data) + [host_uid, ] - self.all_rank_host_info[host_uid] = host_name - self.all_rank_device_info.extend(rank_device_info) - if print_empty_host_info: - print(print_empty_host_info) diff --git a/profiler/cluster_analyse/analysis/mstx_sum/__init__.py b/profiler/cluster_analyse/analysis/mstx_sum/__init__.py deleted file mode 100644 index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/mstx_sum/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/analysis/mstx_sum/mstx_sum.py b/profiler/cluster_analyse/analysis/mstx_sum/mstx_sum.py deleted file mode 100644 index 46a0e18abeee5cdd6b058d71e3a1bd2b97e7c29d..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/mstx_sum/mstx_sum.py +++ /dev/null @@ -1,204 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import pandas as pd -from collections import namedtuple -from analysis.base_analysis import BaseRecipeAnalysis -from common_func.constant import Constant -from common_func.utils import describe_duration -from cluster_statistics_export.mstx_mark_export import MstxMarkExport -from cluster_statistics_export.mstx_step_export import MstxStepExport - - -MarkInfo = namedtuple("MarkInfo", ["name", "framework_duration", "cann_duration", "device_duration", - "tid", "start_ns"]) - - -def format_mark_info(df: pd.DataFrame, start_idx, stop_idx, name) -> MarkInfo: - start_series = df.iloc[start_idx] - stop_series = df.iloc[stop_idx] - return MarkInfo( - name=name, - framework_duration=float(stop_series["framework_ts"]-start_series["framework_ts"]), - cann_duration=float(stop_series["cann_ts"]-start_series["cann_ts"]), - device_duration=float(stop_series["device_ts"]-start_series["device_ts"]), - tid=start_series["tid"], - start_ns=start_series["cann_ts"] - ) - - -def rename_mark_msg_name(mark_stats_df: pd.DataFrame): - msg_idx_counter = {} - for idx, mark_info in enumerate(mark_stats_df.itertuples(index=False)): - msg_idx_counter.setdefault(mark_info.step_id, {}).setdefault(mark_info.name, []).append(idx) - for msg_dict in msg_idx_counter.values(): - for msg, idx_list in msg_dict.items(): - if len(idx_list) <= 1: - continue - for i, idx in enumerate(idx_list): - mark_stats_df.loc[idx, 'name'] = f"{msg}_{i}" - - -def compute_step_id(mark_stat, step_stats_df: pd.DataFrame): - for step_info in step_stats_df.itertuples(index=False): - if step_info.start_ns <= mark_stat.start_ns <= step_info.end_ns: - return step_info.step_id - print(f"[WARNING] {mark_stat.name} is not in any step.") - return 0 - - -def format_columns(df: pd.DataFrame): - formatted_df = df.rename( - { - "framework_duration": "FrameworkDurationNs", - "cann_duration": "CannDurationNs", - "device_duration": "DeviceDurationNs", - "duration": "DurationNs", - "step_id": "StepId", - "tid": "Tid", - "name": "Name" - }, - axis="columns" - ) - cols = [col for col in formatted_df.columns if not col.endswith("_ns") and col not in {"Tid"}] - return formatted_df[cols] - - -class MstxSum(BaseRecipeAnalysis): - - TABLE_FRAMEWORK_STATS = "MSTXAllFrameworkStats" - TABLE_CANN_STATS = "MSTXAllCannStats" - TABLE_DEVICE_STATS = "MSTXAllDeviceStats" - TABLE_MARK_STATS = "MSTXMarkStats" - - START_SUFFIX = "_start" - STOP_SUFFIX = "_stop" - - def __init__(self, params): - super().__init__(params) - print("[INFO] MstxSum init.") - self.mark_stats = None - self.all_fwk_stats = None - self.all_cann_stats = None - self.all_device_stats = None - - @property - def base_dir(self): - return os.path.basename(os.path.dirname(__file__)) - - @staticmethod - def _mapper_func(data_map, analysis_class): - step_df = MstxStepExport(data_map[1], analysis_class).read_export_db() - if step_df is None or step_df.empty: - step_df = pd.DataFrame({"start_ns": [0], "end_ns": [float("inf")], "step_id": [0]}) - mark_df = MstxMarkExport(data_map[1], analysis_class).read_export_db() - if mark_df is None or mark_df.empty: - print(f"[WARNING] There is no mark data in {data_map[1]}.") - return None - mark_df["framework_ts"] = mark_df["framework_ts"].astype("int64") - - mark_info = {} - mark_res = [] - mismatch_msg = [] - for idx, row in enumerate(mark_df.itertuples(index=False)): - if row.msg.endswith(MstxSum.START_SUFFIX): - msg = row.msg[:-len(MstxSum.START_SUFFIX)] - mark_info.setdefault(row.tid, {}).setdefault(msg, []).append(idx) - elif row.msg.endswith(MstxSum.STOP_SUFFIX): - msg = row.msg[:-len(MstxSum.STOP_SUFFIX)] - idx_list = mark_info.get(row.tid, {}).get(msg, []) - if not idx_list: - mismatch_msg.append((row.msg, idx)) - continue - start_idx = idx_list.pop() - mark_res.append(format_mark_info(mark_df, start_idx, idx, msg)) - - # 统计未匹配上的mark信息 - for msg_info in mark_info.values(): - for msg, idx_list in msg_info.items(): - if not idx_list: - continue - mismatch_msg.extend((msg + MstxSum.START_SUFFIX, idx) for idx in idx_list) - if mismatch_msg: - mismatch_msg.sort(key=lambda msg: msg[1]) - print(f"[WARNING] The following mark messages do not match anyone in " - f"rank {data_map[0]}: {','.join(msg[0] for msg in mismatch_msg)}.") - - mark_stats_df = pd.DataFrame(mark_res).assign(Rank=data_map[0]) - mark_stats_df["step_id"] = mark_stats_df.apply(compute_step_id, axis=1, step_stats_df=step_df) - rename_mark_msg_name(mark_stats_df) - mark_stats_df = format_columns(mark_stats_df).set_index("Name", drop=True) - return mark_stats_df - - def mapper_func(self, context): - return context.wait( - context.map( - self._mapper_func, - self._get_rank_db(), - analysis_class=self._recipe_name - ) - ) - - def reducer_func(self, mapper_res): - mapper_res = list(filter(lambda df: df is not None, mapper_res)) - if not mapper_res: - print("[ERROR] Mapper data is None.") - return - self.mark_stats = pd.concat(mapper_res) - all_fwk_stats = [] - all_cann_stats = [] - all_device_stats = [] - mark_step_df = self.mark_stats.groupby("StepId") - for step_id, df in mark_step_df: - name_gdf = df.groupby("Name") - fwk_stats = describe_duration(name_gdf["FrameworkDurationNs"]).assign(StepId=step_id) - fwk_stats.sort_values(by=["SumNs"], inplace=True, ascending=False) - all_fwk_stats.append(fwk_stats) - cann_stats = describe_duration(name_gdf["CannDurationNs"]).assign(StepId=step_id) - cann_stats.sort_values(by=["SumNs"], inplace=True, ascending=False) - all_cann_stats.append(cann_stats) - device_stats = describe_duration(name_gdf["DeviceDurationNs"]).assign(StepId=step_id) - device_stats.sort_values(by=["SumNs"], inplace=True, ascending=False) - all_device_stats.append(device_stats) - self.all_fwk_stats = pd.concat(all_fwk_stats) - self.all_cann_stats = pd.concat(all_cann_stats) - self.all_device_stats = pd.concat(all_device_stats) - - def run(self, context): - super().run(context) - mapper_res = self.mapper_func(context) - self.reducer_func(mapper_res) - - if self._export_type == "db": - self.save_db() - elif self._export_type == "notebook": - self.save_notebook() - else: - print("[ERROR] Unknown export type.") - - def save_notebook(self): - self.dump_data(self.mark_stats, os.path.join(self._get_output_dir(), "mark_stats.csv")) - self.dump_data(self.all_fwk_stats, os.path.join(self._get_output_dir(), "all_fwk_stats.csv")) - self.dump_data(self.all_cann_stats, os.path.join(self._get_output_dir(), "all_cann_stats.csv")) - self.dump_data(self.all_device_stats, os.path.join(self._get_output_dir(), "all_device_stats.csv")) - self.create_notebook("stats.ipynb") - self.add_helper_file("cluster_display.py") - - def save_db(self): - self.dump_data(self.mark_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_MARK_STATS) - self.dump_data(self.all_fwk_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_FRAMEWORK_STATS) - self.dump_data(self.all_cann_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_CANN_STATS) - self.dump_data(self.all_device_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_DEVICE_STATS) diff --git a/profiler/cluster_analyse/analysis/mstx_sum/stats.ipynb b/profiler/cluster_analyse/analysis/mstx_sum/stats.ipynb deleted file mode 100644 index 84672bc72b97b02717c3a4110ab1b4dd827adafd..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/mstx_sum/stats.ipynb +++ /dev/null @@ -1,180 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# MSTX Summary\n", - "\n", - "集群场景MSTX打点数据分析\n", - "\n", - "主要包含以下2个统计内容:\n", - "1. 按Step分组的,整个集群MSTX打点数据的统计情况\n", - "2. 按Name分组的,每个Rank上MSTX打点数据的统计情况" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 数据准备" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display, HTML\n", - "display(HTML(\"\"))\n", - "\n", - "import plotly.offline as pyo\n", - "\n", - "def is_lab_notebook():\n", - " import re\n", - " import psutil\n", - " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n", - "\n", - "if is_lab_notebook():\n", - " pyo.init_notebook_mode()\n", - "\n", - "import pandas as pd\n", - "pd.options.plotting.backend = \"plotly\"\n", - "pd.set_option(\"display.max_rows\", 100)\n", - "pd.set_option(\"display.width\", 1000)\n", - "\n", - "import cluster_display\n", - "\n", - "all_fwk_stats_gdf = pd.read_csv(\"all_fwk_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n", - "all_cann_stats_gdf = pd.read_csv(\"all_cann_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n", - "all_device_stats_gdf = pd.read_csv(\"all_device_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n", - "mark_stats_df = pd.read_csv(\"mark_stats.csv\", index_col=\"Name\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 集群MSTX数据分析\n", - "\n", - "将整个集群所有Rank的MSTX数据进行汇总,按Step划分,统计分析耗时情况,时间单位为微秒(us)\n", - "打点数据分为三种:\n", - "1. 框架侧耗时:Framework Time\n", - "2. Cann侧耗时:Cann Time\n", - "3. Device侧耗时:Devcie Time\n", - "\n", - "3种数据都包含以下统计项:\n", - "- Count:数量\n", - "- Mean:平均耗时\n", - "- Std:标准差\n", - "- Min:最小值\n", - "- Q1:四分之一分位数\n", - "- Median:中位数\n", - "- Q3:四分之三分位数\n", - "- Max:最大值\n", - "- Sum:总耗时" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def display_stats_mstx_step_combobox(selected, args):\n", - " step = selected\n", - " fwk_stats_gdf, cann_stats_gdf, device_stats_gdf = args\n", - " fwk_df = fwk_stats_gdf.get_group(step)\n", - " cann_df = cann_stats_gdf.get_group(step)\n", - " device_df = device_stats_gdf.get_group(step)\n", - " figs = []\n", - " display(HTML(\"

Framework Time Stats

\"))\n", - " display(fwk_df)\n", - " cluster_display.display_duration_boxplots(figs, fwk_df, title=\"Framework Time\", x_title=\"Name\", y_title=\"Time\")\n", - " display(HTML(\"

Cann Time Stats

\"))\n", - " display(cann_df)\n", - " cluster_display.display_duration_boxplots(figs, cann_df, title=\"Cann Time\", x_title=\"Name\", y_title=\"Time\")\n", - " display(HTML(\"

Device Time Stats

\"))\n", - " display(device_df)\n", - " cluster_display.display_duration_boxplots(figs, device_df, title=\"Device Time\", x_title=\"Name\", y_title=\"Time\")\n", - "\n", - "steps = list(all_fwk_stats_gdf.groups.keys())\n", - "if steps:\n", - " cluster_display.display_stats_optional_combobox(steps, display_stats_mstx_step_combobox, \n", - " [all_fwk_stats_gdf, all_cann_stats_gdf, all_device_stats_gdf], \"Step:\")\n", - "else:\n", - " print(\"There is no step in stats, so no need to display\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 集群Rank MSTX数据分析\n", - "\n", - "将集群内每个Rank的MSTX数据进行汇总,按打点Name分类,统计分析耗时情况,时间单位为微秒(us)\n", - "\n", - "包含以下统计项:\n", - "- Name:打点名称\n", - "- FrameworkDuration(Us):框架侧耗时\n", - "- CannDuration(Us):Cann侧耗时\n", - "- DeviceDuration(Us):Device侧耗时\n", - "- Rank:Rank序号\n", - "- StepId:Step序号" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def display_mstx_duration_by_rank(selected, args):\n", - " mark_stats_gdf = args\n", - " df = mark_stats_gdf.get_group(selected).sort_values(\"Rank\")\n", - " display(df)\n", - " fwk_duration = []\n", - " cann_duration = []\n", - " device_duration = []\n", - " step_ids = []\n", - " for step_id, step_df in df.groupby(\"StepId\"):\n", - " fwk_duration.append((step_id, step_df[\"FrameworkDuration(Us)\"].values))\n", - " cann_duration.append((step_id, step_df[\"CannDuration(Us)\"].values))\n", - " device_duration.append((step_id, step_df[\"DeviceDuration(Us)\"].values))\n", - " step_ids.append(step_id)\n", - " fwk_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in fwk_duration], axis=1)\n", - " cann_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in cann_duration], axis=1)\n", - " device_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in device_duration], axis=1)\n", - " figs = []\n", - " ranks = df[\"Rank\"].drop_duplicates()\n", - " cluster_display.display_graph(figs, ranks, fwk_df[step_ids],\n", - " title=\"Framework Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", - " cluster_display.display_graph(figs, ranks, cann_df[step_ids],\n", - " title=\"Cann Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", - " cluster_display.display_graph(figs, ranks, device_df[step_ids],\n", - " title=\"Device Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", - "\n", - "mark_stats_gdf = mark_stats_df.groupby(mark_stats_df.index)\n", - "names = list(mark_stats_gdf.groups.keys())\n", - "if steps:\n", - " cluster_display.display_stats_optional_combobox(names, display_mstx_duration_by_rank, mark_stats_gdf, \"Name:\")\n", - "else:\n", - " print(\"There is no mark name in stats, so no need to display\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.12.1" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/profiler/cluster_analyse/analysis/step_trace_time_analysis.py b/profiler/cluster_analyse/analysis/step_trace_time_analysis.py deleted file mode 100644 index 6a886fffa97b142e8267066117f561154d85b162..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/analysis/step_trace_time_analysis.py +++ /dev/null @@ -1,126 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -from common_func.db_manager import DBManager -from common_func.constant import Constant -from common_func.file_manager import FileManager -from prof_bean.step_trace_time_bean import StepTraceTimeBean - - -class StepTraceTimeAnalysis: - CLUSTER_TRACE_TIME_CSV = "cluster_step_trace_time.csv" - CLUSTER_TRACE_TIME_TABLE = "ClusterStepTraceTime" - - def __init__(self, param: dict): - self.collection_path = param.get(Constant.COLLECTION_PATH) - self.data_map = param.get(Constant.DATA_MAP) - self.communication_group = param.get(Constant.COMM_DATA_DICT, {}).get(Constant.COMMUNICATION_GROUP) - self.step_time_dict = {} - self.step_data_list = [] - self.data_type = param.get(Constant.DATA_TYPE) - - @staticmethod - def get_max_data_row(data_group_list: list): - if not data_group_list: - return [] - ret = [] - for idx in range(len(data_group_list[0])): - max_val = 0 - for idy in range(len(data_group_list)): - max_val = max(max_val, data_group_list[idy][idx]) - ret.append(max_val) - return ret - - def run(self): - self.load_step_trace_time_data() - self.analyze_step_time() - self.dump_data() - - def dump_data(self): - if not self.step_data_list: - print("[WARNING] Can't get step time info!") - return - if self.data_type == Constant.TEXT: - headers = self.get_headers() - FileManager.create_csv_file(self.collection_path, self.step_data_list, self.CLUSTER_TRACE_TIME_CSV, headers) - else: - output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER) - DBManager.create_tables(result_db, self.CLUSTER_TRACE_TIME_TABLE) - column_len = DBManager.get_table_column_count(result_db, self.CLUSTER_TRACE_TIME_TABLE) - data_len = len(self.step_data_list[0]) - if data_len < column_len: - for data in self.step_data_list: - data.extend([0] * (column_len - data_len)) - conn, cursor = DBManager.create_connect_db(result_db) - sql = "insert into {} values ({value})".format(self.CLUSTER_TRACE_TIME_TABLE, - value="?," * (len(self.step_data_list[0]) - 1) + "?") - DBManager.executemany_sql(conn, sql, self.step_data_list) - DBManager.destroy_db_connect(conn, cursor) - - def load_step_trace_time_data(self): - for rank_id, profiling_dir_path in self.data_map.items(): - if self.data_type == Constant.TEXT: - step_time_file = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, Constant.STEP_TIME_CSV) - if os.path.exists(step_time_file): - self.step_time_dict[rank_id] = FileManager.read_csv_file(step_time_file, StepTraceTimeBean) - else: - step_time_file = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, - Constant.DB_COMMUNICATION_ANALYZER) - if (os.path.exists(step_time_file) and - DBManager.check_tables_in_db(step_time_file, Constant.TABLE_STEP_TRACE)): - conn, cursor = DBManager.create_connect_db(step_time_file) - sql = "select * from {0}".format(Constant.TABLE_STEP_TRACE) - data = DBManager.fetch_all_data(cursor, sql, is_dict=False) - self.step_time_dict[rank_id] = data - DBManager.destroy_db_connect(conn, cursor) - if not self.step_time_dict.get(rank_id): - print(f"[WARNING] Rank {rank_id} does not have a valid step_trace_time data in {self.data_type} file.") - - def analyze_step_time(self): - for rank_id, data_bean_list in self.step_time_dict.items(): - for data_bean in data_bean_list: - if self.data_type == Constant.TEXT: - self.step_data_list.append([data_bean.step, Constant.RANK, rank_id] + data_bean.row) - else: - self.step_data_list.append([data_bean[0], Constant.RANK, rank_id] + list(data_bean[1:])) - stage_list = self.communication_group.get(Constant.P2P) - if not stage_list: - return - step_group_dict = {} - for data_list in self.step_data_list: - stage_group = tuple() - for stage in stage_list: - if data_list[2] in stage: - stage_group = tuple(stage) - break - key = (data_list[0], stage_group) - step_group_dict.setdefault(key, []).append(data_list[3:]) - - for key, data_group_list in step_group_dict.items(): - if self.data_type == Constant.TEXT: - self.step_data_list.append([key[0], Constant.STAGE, key[1]] + self.get_max_data_row(data_group_list)) - else: - index = "(" + ",".join(str(i) for i in key[1]) + ")" - self.step_data_list.append([key[0], Constant.STAGE, index] + self.get_max_data_row(data_group_list)) - - def get_headers(self): - if self.step_time_dict: - for rank in self.step_time_dict: - if self.step_time_dict.get(rank): - return self.step_time_dict[rank][0].all_headers - return [] diff --git a/profiler/cluster_analyse/cluster_analysis.py b/profiler/cluster_analyse/cluster_analysis.py deleted file mode 100644 index a8d01dcfe348be6b47c0a71099cedab64b6b3e06..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_analysis.py +++ /dev/null @@ -1,148 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -from cluster_data_preprocess.pytorch_data_preprocessor import PytorchDataPreprocessor -from cluster_data_preprocess.mindspore_data_preprocessor import MindsporeDataPreprocessor -from communication_group.communication_group_generator import CommunicationGroupGenerator -from common_func.constant import Constant -from common_func.file_manager import FileManager -from common_func.path_manager import PathManager -from common_func import analysis_loader -from analysis.analysis_facade import AnalysisFacade - -COMM_FEATURE_LIST = ['all', 'communication_time', 'communication_matrix'] -ALL_FEATURE_LIST = ['all', 'communication_time', 'communication_matrix', 'cann_api_sum', 'hccl_sum', 'compute_op_sum', - 'mstx_sum'] - - -def get_analysis_args(analysis_class, analysis_args): - parser = argparse.ArgumentParser(description="custom analysis args") - parser.add_argument("--parallel_mode", type=str, help="context mode", default="concurrent") - parser.add_argument("--export_type", type=str, help="export type", default="db") - analysis_class[1].add_parser_argument(parser) - return parser.parse_args(analysis_args) - -def parse_specific_params(analysis_name, analysis_args): - analysis_class = analysis_loader.get_class_from_name(analysis_name) - if not analysis_class: - print("[ERROR] undefined analysis.") - return None - - args_parsed = get_analysis_args(analysis_class, analysis_args) - specific_params = { - Constant.RECIPE_NAME: analysis_class[0], - Constant.RECIPE_CLASS: analysis_class[1], - Constant.PARALLEL_MODE: args_parsed.parallel_mode, - Constant.EXPORT_TYPE: args_parsed.export_type - } - specific_params.update(analysis_class[1].parse_argument(args_parsed)) - return specific_params - -class Interface: - ASCEND_PT = "ascend_pt" - ASCEND_MS = "ascend_ms" - - - def __init__(self, params: dict): - self.collection_path = PathManager.get_realpath(params.get(Constant.COLLECTION_PATH)) - self.analysis_mode = params.get(Constant.ANALYSIS_MODE) - self.data_map = {} - self.communication_group = {} - self.collective_group_dict = {} - self.communication_ops = [] - self.matrix_ops = [] - self.origin_params = params - - def allocate_prof_data(self): - ascend_pt_dirs = [] - ascend_ms_dirs = [] - for root, dirs, files in os.walk(self.collection_path): - for dir_name in dirs: - if dir_name.endswith(self.ASCEND_PT): - ascend_pt_dirs.append(os.path.join(root, dir_name)) - if dir_name.endswith(self.ASCEND_MS): - ascend_ms_dirs.append(os.path.join(root, dir_name)) - pytorch_processor = PytorchDataPreprocessor(ascend_pt_dirs) - pt_data_map = pytorch_processor.get_data_map() - data_type = pytorch_processor.get_data_type() - ms_data_map = MindsporeDataPreprocessor(ascend_ms_dirs).get_data_map() - if pt_data_map and ms_data_map: - print("[ERROR] Can not analyze pytorch and mindspore meantime.") - return [] - return (pt_data_map, data_type) if pt_data_map else (ms_data_map, Constant.TEXT) - - def run(self): - PathManager.check_input_directory_path(self.collection_path) - PathManager.check_path_owner_consistent(self.collection_path) - data_map, data_type = self.allocate_prof_data() - if not data_map: - print("[WARNING] Can not get rank info or profiling data.") - return - if data_type == Constant.INVALID: - print("[ERROR] The current folder contains both DB and other files. Please check.") - return - if self.analysis_mode not in COMM_FEATURE_LIST: - if data_type != Constant.DB: - print("[ERROR] The current analysis node only supports DB as input data. Please check.") - return - FileManager.create_output_dir(self.collection_path, is_overwrite=True) - params = { - Constant.COLLECTION_PATH: self.collection_path, - Constant.DATA_MAP: data_map, - Constant.DATA_TYPE: data_type, - Constant.RECIPE_NAME: self.origin_params.get(Constant.RECIPE_NAME, ""), - Constant.RECIPE_CLASS: self.origin_params.get(Constant.RECIPE_CLASS), - Constant.PARALLEL_MODE: self.origin_params.get(Constant.PARALLEL_MODE, ""), - Constant.EXPORT_TYPE: self.origin_params.get(Constant.EXPORT_TYPE, "") - } - params.update(params[Constant.RECIPE_CLASS].get_extra_argument(self.origin_params)) - AnalysisFacade(params).recipe_analyze() - else: - FileManager.create_output_dir(self.collection_path) - params = { - Constant.COLLECTION_PATH: self.collection_path, - Constant.DATA_MAP: data_map, - Constant.ANALYSIS_MODE: self.analysis_mode, - Constant.DATA_TYPE: data_type - } - comm_data_dict = CommunicationGroupGenerator(params).generate() - params[Constant.COMM_DATA_DICT] = comm_data_dict - AnalysisFacade(params).cluster_analyze() - - -def cluster_analysis_main(args=None): - parser = argparse.ArgumentParser(description="cluster analysis module") - parser.add_argument('-d', '--collection_path', type=str, required=True, help="profiling data path") - parser.add_argument('-m', '--mode', choices=ALL_FEATURE_LIST, - default='all', help="different analysis mode") - args_parsed, args_remained = parser.parse_known_args(args=args) - parameter = { - Constant.COLLECTION_PATH: args_parsed.collection_path, - Constant.ANALYSIS_MODE: args_parsed.mode - } - if args_parsed.mode in COMM_FEATURE_LIST: - if args_remained: - print(f"[ERROR] The specific argument {args_remained} is not supported for communication analysis.") - return - else: - parameter.update(parse_specific_params(args_parsed.mode, args_remained)) - Interface(parameter).run() - - -if __name__ == "__main__": - cluster_analysis_main() diff --git a/profiler/cluster_analyse/cluster_data_preprocess/__init__.py b/profiler/cluster_analyse/cluster_data_preprocess/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_data_preprocess/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/cluster_data_preprocess/data_preprocessor.py b/profiler/cluster_analyse/cluster_data_preprocess/data_preprocessor.py deleted file mode 100644 index 72d65ae6571e68564e46f43463843d1f46a3a69e..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_data_preprocess/data_preprocessor.py +++ /dev/null @@ -1,41 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -from abc import abstractmethod - - -class DataPreprocessor: - PROFILER_INFO_HEAD = 'profiler_info_' - PROFILER_INFO_EXTENSION = '.json' - - def __init__(self, path_list: list): - self.path_list = path_list - self.data_map = {} - - @abstractmethod - def get_data_map(self): - pass - - def get_rank_id(self, dir_name: str) -> int: - files = os.listdir(dir_name) - for file_name in files: - if file_name.startswith(self.PROFILER_INFO_HEAD) and file_name.endswith(self.PROFILER_INFO_EXTENSION): - rank_id_str = file_name[len(self.PROFILER_INFO_HEAD): -1 * len(self.PROFILER_INFO_EXTENSION)] - try: - rank_id = int(rank_id_str) - except ValueError: - rank_id = -1 - return rank_id - return -1 diff --git a/profiler/cluster_analyse/cluster_data_preprocess/mindspore_data_preprocessor.py b/profiler/cluster_analyse/cluster_data_preprocess/mindspore_data_preprocessor.py deleted file mode 100644 index a3e09983ddb54b972a9e343c1661b5c8b2cbb8c8..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_data_preprocess/mindspore_data_preprocessor.py +++ /dev/null @@ -1,41 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from collections import defaultdict - -from cluster_data_preprocess.data_preprocessor import DataPreprocessor - - -class MindsporeDataPreprocessor(DataPreprocessor): - - def __init__(self, path_list: list): - super().__init__(path_list) - - def get_data_map(self) -> dict: - rank_id_map = defaultdict(list) - for dir_name in self.path_list: - rank_id = self.get_rank_id(dir_name) - if rank_id < 0: - print('[Error]fail to get rankid or rankid invalid.') - continue - rank_id_map[rank_id].append(dir_name) - - try: - for (rank_id, dir_list) in rank_id_map.items(): - dir_list.sort(key=lambda x: x.split('_')[-3]) - self.data_map[rank_id] = dir_list[0] - except Exception as e: - raise RuntimeError("Found invalid directory name!") from e - return self.data_map diff --git a/profiler/cluster_analyse/cluster_data_preprocess/pytorch_data_preprocessor.py b/profiler/cluster_analyse/cluster_data_preprocess/pytorch_data_preprocessor.py deleted file mode 100644 index 55c3d03958b97c427fe8fde0625e72ea4dee8997..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_data_preprocess/pytorch_data_preprocessor.py +++ /dev/null @@ -1,56 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import glob -from collections import defaultdict -import os - -from cluster_data_preprocess.data_preprocessor import DataPreprocessor -from common_func.constant import Constant -from common_func.file_manager import FileManager - - -class PytorchDataPreprocessor(DataPreprocessor): - - def __init__(self, path_list: list): - super().__init__(path_list) - self.data_type = set() - - def get_data_map(self) -> dict: - rank_id_map = defaultdict(list) - for dir_name in self.path_list: - rank_id = self.get_rank_id(dir_name) - if rank_id < 0: - print('[Error]fail to get rankid or rankid invalid.') - continue - for file_name in os.listdir(dir_name): - if file_name.startswith(self.PROFILER_INFO_HEAD) and file_name.endswith(self.PROFILER_INFO_EXTENSION): - file_path = os.path.join(dir_name, file_name) - config = FileManager.read_json_file(file_path) - self.data_type.add(config.get(Constant.CONFIG, {}).get(Constant.EXPER_CONFIG, {}). - get(Constant.EXPORT_TYPE, Constant.TEXT)) - rank_id_map[rank_id].append(dir_name) - - try: - for (rank_id, dir_list) in rank_id_map.items(): - dir_list.sort(key=lambda x: x.split('_')[-3]) - self.data_map[rank_id] = dir_list[0] - except Exception as e: - raise RuntimeError("Found invalid directory name!") from e - return self.data_map - - def get_data_type(self): - if len(self.data_type) == 1: - return self.data_type.pop() - return Constant.INVALID diff --git a/profiler/cluster_analyse/cluster_kernels_analysis/README.md b/profiler/cluster_analyse/cluster_kernels_analysis/README.md deleted file mode 100644 index f90f99fb9b3058d5ad67728b45da1c07f03e65e5..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_kernels_analysis/README.md +++ /dev/null @@ -1,67 +0,0 @@ -# 功能介绍 -集群场景下,多卡间的算子情况,只能通过查看每张卡各自的性能数据来了解,不能直观的对比各卡之间算子的性能差异。 -cluster_op_summary_analysis.py脚本基于多卡性能数据的op_summary信息,统计并展示各卡中执行最快、最慢、均值和方差的TopN算子。 - -## 交附件 -### cluster_op_time_ analysis.csv -将算子以op_name、input_shape、input_size、output_shape进行分类,统计每一类算子,在不同节点(node)的不同卡(device)上,执行时间的最大、最小、方差、平均时间以及范围。 -### xxx_info.html - -主要是各个特性(time和ratio)的html文件,以html方式展示top_n算子的箱线图。 - -time和ratio表示AI Core和AI Vector Core算子性能指标中的耗时和占比字段。 - -以html文件展示TopN算子执行耗时和占比的箱线图。 - -有TopN个算子就会有TopN个坐标系,每个坐标系表示一个算子的特性,以total_time的平均值从左向右依次向下排序。 - -- 横坐标:node_device表示第几个node的第几张卡,从小到大排序。 -- 纵坐标:时间。 -- 坐标名:在坐标下方,以op_name-input_shape拼接展示。 - -# 操作指导 - -1. 准备性能数据 - - 拷贝所有node上的性能数据到一个环境里,性能数据必须包含在node*目录下,例如当前集群场景为2机16卡,那么就是两个node分别有八个device,拷贝性能数据目录如下: - - ```bash - ├── node0 # 可以是node0或nodeo_xxx,表示某个节点 - │ ├── PROF_XXXXX # 单个device的性能数据,须完成msprof性能数据解析 - │ ├── SUMMARY - │ ├── op_summary_XX.csv - | ...... # 一共八张卡的性能数据 - ├── node1 # 可以是node1 或者node1_xxx表示某个节点 - │ ├── PROF_XXXXX # 单个device的profiling数据 - │ ├── SUMMARY - │ ├── op_summary_XX.csv # 用来做解析的op_summary表格 - | ...... - ``` - -2. 拷贝脚本准备环境 - - 将cluster_prof_Info_analysis.py脚本拷贝到一个文件夹里,并安装对应的Python库。 - - ```bash - pip install pandas - pip install ploty - ``` - -3. 运行脚本 - - ```bash - python3 cluster_prof_Info_analysis.py –d data_path -t type -n top_n - ``` - - - -d:集群场景性能数据目录,输入node的上一级目录。 - - -t:获取分析信息结果文件类型,可取值:html、csv、all,默认html。 - - -n:html分析独有,表示需要展示的是平均时间top_n的算子,默认10,配置超过30时需要一定时间。 - -异常情况处理: - -- -n参数必须大于0,如果输入<=0, 默认只导出一个算子的数据。 -- 配置-n参数值大于算子总数时,按等于算子数处理。 -- 部分没有op_summary的,不显示也不报错。 -- 目录下不存在op_summary时,执行报错无法找到数据文件。 -- op_summary列数据错误或读不到数据时,提示具体出错文件。 -- -t参数配置错误时,提示输入错误,并提示正确的配置。 diff --git a/profiler/cluster_analyse/cluster_kernels_analysis/__init__.py b/profiler/cluster_analyse/cluster_kernels_analysis/__init__.py deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/profiler/cluster_analyse/cluster_kernels_analysis/cluster_prof_Info_analysis.py b/profiler/cluster_analyse/cluster_kernels_analysis/cluster_prof_Info_analysis.py deleted file mode 100644 index 27e3c229c56d7c2a1afe6ae49d98c96b19bc55ff..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_kernels_analysis/cluster_prof_Info_analysis.py +++ /dev/null @@ -1,327 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import sys -import argparse -import re -import os -import stat -import shutil -import warnings -from pathlib import Path - -import pandas as pd -import plotly.graph_objects as go -from plotly.subplots import make_subplots -from plotly.offline import plot - -sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) - -from common_func.path_manager import PathManager - - -MAX_READ_FILE_BYTES = 64 * 1024 * 1024 - - -class FormDataProcessor: - def __init__(self, path, form_name): - self.form_name = form_name - self.files = self.get_files_with_prefix_recursive(path, form_name) - - def get_files_with_prefix_recursive(self, csv_path, match_str): - matched_ir_files = list(Path(csv_path).rglob(match_str)) - if not matched_ir_files: - msg = f"Didn't find any file in folder {csv_path} that matches {match_str}" - raise RuntimeError(msg) - return [str(item) for item in matched_ir_files] - - def readSummaryData(self, columns_to_keep): - # 存储所有合并后的数据 - all_data = pd.DataFrame() - for f in self.files: - if "mindstudio_profiler_output" in f: - continue - # 判断csv文件大小 - PathManager.check_path_readable(f) - # 读取CSV文件 - df = pd.read_csv(f) - # 保留需要的列 - try: - df = df[columns_to_keep] - except KeyError: - print(f"{f}文件没有所需的列,请确认profiling数据的正确性:\n,以下列可能不存在{columns_to_keep}\n") - continue - # 从文件名提取设备ID - try: - df['device_id'] = self.getDeviceId(f) - except Exception: - print(f"文件 \"{f}\" 的路径或者是文件夹名没有按照要求,请确保存在[device_]这一级文件夹,具体操作指导见readme\n") - continue - # 添加新列 "device_id" - try: - df['node_id'] = self.getNodeId(f) - except Exception: - print(f"文件 \"{f}\" 的路径或者是文件夹名没有按照要求,请确保存在[node*]这一级文件夹,具体操作指导见readme\n") - continue - # 将数据添加到最终的数据框中 - all_data = pd.concat([all_data, df]) - return all_data - - def getChipType(self): - file = self.files[0] - df = pd.read_csv(file) - if 'aiv_time(us)' in df.columns: - return "ASCEND_NEW" - return "ASCEND_OTHER" - - def getDeviceId(self, dir_path): - device_id = re.search(r'device_(\d+)', dir_path).group(1) - return device_id - - def getNodeId(self, dir_path): - node_id = re.search(r'node(\d+)', dir_path).group(1) - return int(node_id) - - def getRankNum(self): - return len(self.files) - - -# 表驱动,获取不同芯片类型不同交付件的所需的列 -class ViewInfoManager: - def __init__(self, chip_type): - self.chip_type = chip_type - self.op_summary_columns_dict = {} - self.setOpSummaryColumnsParams() - - def setOpSummaryColumnsParams(self): - # 有些数据除了用表格的列进行分组之外,还添加了其他属性对数据进行分类,这部分数据放在extend_attr_to_group里面 - self.op_summary_columns_dict = { - 'ASCEND_NEW': { - 'TimeToCsvAnalyzer': - {'columns_to_group': ["Op Name", "Input Shapes", "Input Data Types", "Output Shapes"], - 'extend_attr_to_group': ["device_id", "node_id"], - 'columns_to_view': ["Task Duration(us)"], - 'calculate_fun': ['mean', 'var', 'max', 'min'] - }, - 'StatisticalInfoToHtmlAnalyzer': - {'columns_to_group': ["Op Name", "Input Shapes", "Input Data Types", "Output Shapes"], - "columns_to_view": ["Task Duration(us)", "aiv_time(us)", "aiv_vec_ratio", - "aiv_scalar_ratio", "aiv_mte2_ratio", "aiv_mte3_ratio", - "aicore_time(us)", "aic_mac_ratio", "aic_scalar_ratio", - "aic_mte1_ratio", "aic_mte2_ratio", "aic_fixpipe_ratio" - ], - 'calculate_fun': ['mean', 'var', 'max', 'min'] - } - }, - 'ASCEND_OTHER': { - 'TimeToCsvAnalyzer': - {'columns_to_group': ["Op Name", "Input Shapes", "Input Data Types", "Output Shapes"], - 'extend_attr_to_group': ["device_id", "node_id"], - "columns_to_view": ["Task Duration(us)"], - 'calculate_fun': ['mean', 'var', 'max', 'min'] - }, - 'StatisticalInfoToHtmlAnalyzer': - {'columns_to_group': ["Op Name", "Input Shapes", "Input Data Types", "Output Shapes"], - "columns_to_view": ["aicore_time(us)", "Task Duration(us)", "mac_ratio", "vec_ratio", - "scalar_ratio", "mte1_ratio", "mte2_ratio", "mte3_ratio"], - 'calculate_fun': ['mean', 'var', 'max', 'min'] - } - } - } - - def getColumnsInfo(self, analyzer_type): - return self.op_summary_columns_dict.get(self.chip_type, {}).get(analyzer_type) - - -class OpSummaryAnalyzerBase: - def __init__(self, chip_type, analyzer_type, dir_path): - self.chip_type = chip_type - view_info = ViewInfoManager(chip_type).getColumnsInfo(analyzer_type) - self.columns_to_view = view_info['columns_to_view'] - self.calculate_fun = view_info['calculate_fun'] - self.columns_to_group = view_info['columns_to_group'] - self.attrs_to_group = self.columns_to_group.copy() - if 'extend_attr_to_group' in view_info: - extend_attr_to_group = view_info['extend_attr_to_group'] - self.attrs_to_group.extend(extend_attr_to_group) - # 创建结果文件 - self.result_dir = os.path.join(dir_path, "result") - PathManager.check_path_length(self.result_dir) - if os.path.exists(self.result_dir): - shutil.rmtree(self.result_dir, onerror=self.on_rm_error) - PathManager.check_path_writeable(dir_path) - PathManager.make_dir_safety(self.result_dir) - - def getColumnsToGroup(self): - return self.columns_to_group - - def getColumnsToView(self): - return self.columns_to_view - - def calculateViewData(self, summary_data): - # 存储所有合并后的数据 - calculate_dict = {self.columns_to_view[i]: self.calculate_fun for i in range(len(self.columns_to_view))} - view_data = summary_data.groupby(self.attrs_to_group).agg(calculate_dict).reset_index() - return view_data - - def on_rm_error(self, func, path, exc_info): - # path contains the path of the file that couldn't be removed - # let's just assume that it's read-only and unlink it. - os.chmod(path, stat.S_IWRITE) - os.unlink(path) - - -class TimeToCsvAnalyzer(OpSummaryAnalyzerBase): - def __init__(self, chip_type, dir_path): - super().__init__(chip_type, "TimeToCsvAnalyzer", dir_path) - - def GenerateDeliverable(self, summary_data, rank_num): - view_data = self.calculateViewData(summary_data) - # 规范化列名 - view_data.columns = [''.join(col) if col[1] == "" else '_'.join(col) for col in view_data.columns] - try: - for column in self.columns_to_view: - view_data[column + '_range'] = view_data[column + '_max'] - view_data[column + '_min'] - except Exception as e: - raise RuntimeError("Invalid view data!") from e - save_path = os.path.join(self.result_dir, "cluster_duration_time_analysis.csv") - PathManager.check_path_length(save_path) - view_data.to_csv(save_path, index=False) - # 该文件权限设置为只读权限,不允许修改 - os.chmod(save_path, stat.S_IROTH) - return view_data - - -class StatisticalInfoToHtmlAnalyzer(OpSummaryAnalyzerBase): - def __init__(self, chip_type, top_n, dir_path): - super().__init__(chip_type, "StatisticalInfoToHtmlAnalyzer", dir_path) - self.top_n = top_n - # top_n 如果不符合要求,报警告 - - def GenerateDeliverable(self, summary_data, rank_num): - view_data = self.calculateViewData(summary_data) - # 规范化列名 op_name/ --> op_name time/var 这种不变 - view_data.columns = [''.join(col) if col[1] == "" else col for col in view_data.columns] - - # 对使用到的变量进行初始设置 - self.top_n = min(max(self.top_n, 1), len(view_data)) - top_n_data = view_data.sort_values(("Task Duration(us)", 'var'), ascending=False).head(self.top_n) - - for column in self.columns_to_view: - # 分别给每一种特性画图 - self.drawPloty(column, summary_data, top_n_data, rank_num) - - def drawPloty(self, column, summary_data, top_n_data, rank_num): - col_num = self.getCalNum(rank_num) - row_num = self.top_n // col_num if self.top_n % col_num == 0 else (self.top_n + 1) // col_num - fig = make_subplots(rows=row_num, cols=col_num, vertical_spacing=0.03) - for i, (_, operation) in enumerate(top_n_data.iterrows()): - op_data = summary_data[(summary_data["Op Name"] == operation["Op Name"]) & - (summary_data["Input Shapes"] == operation["Input Shapes"]) & - (summary_data["Input Data Types"] == operation["Input Data Types"])] - op_data = op_data.sort_values(by=["node_id", "device_id"]) - node_ids = op_data['node_id'].unique() - device_ids = op_data['device_id'].unique() - - for node_id in node_ids: - for device_id in device_ids: - draw_data = op_data[(op_data['node_id'] == node_id) & (op_data['device_id'] == device_id)] - fig.add_trace(go.Box(y=draw_data[column], - name=f'{node_id}_{device_id}', - marker_color='green', showlegend=False), (i // col_num) + 1, (i % col_num) + 1) - - fig.update_xaxes(title_text=f'{operation["Op Name"]}-{operation["Input Shapes"]}', row=(i // col_num) + 1, - col=(i % col_num) + 1) - fig.update_layout(margin=dict(l=20, r=20, t=20, b=20), - height=int(500 * row_num), - width=int(rank_num * 100 * col_num), - title_text="Op Performance Comparison") - save_plot_path = os.path.join(self.result_dir, column + "_Info.html") - PathManager.check_path_length(save_plot_path) - plot(fig, filename=save_plot_path) - # 该文件权限设置为只读权限,不允许修改 - os.chmod(save_plot_path, stat.S_IROTH) - - def getCalNum(self, rank_num): - # 计算每行应该画多少个子图 - if rank_num <= 16: - return 2 - else: - return 1 - - -class DeliverableGenerator: - def __init__(self, params): - self.dirs = params.get('dir') - self.formProcess = FormDataProcessor(self.dirs, 'op_summary*.csv') - self.analyzers = [] - self.columns_to_keep = [] - self.setAnalyzers(params) - self.setColumnsToKeep() - - def run(self): - summary_data = self.formProcess.readSummaryData(self.columns_to_keep) - # 判断summarydata 数据是否为空,如果是空, 说明所有csv读取数据都失败了 - if summary_data.empty: - print("没有符合要求的csv表格数据,请排查您的PROFILING数据") - return - rank_num = self.formProcess.getRankNum() - for analyzer in self.analyzers: - analyzer.GenerateDeliverable(summary_data, rank_num) - - def setAnalyzers(self, params): - chip_type = self.formProcess.getChipType() - # 判断该路径是不是软链接,并修改为绝对路径 - if os.path.islink(params.get('dir')): - print(f"The file: \"{params.get('dir')}\" is link. Please check the path.") - return - prof_path = os.path.realpath(params.get('dir')) - PathManager.input_path_common_check(prof_path) - if params.get('type') == "all": - self.analyzers = [TimeToCsvAnalyzer(chip_type, prof_path), StatisticalInfoToHtmlAnalyzer(chip_type, params.get("top_n"), prof_path)] - elif params.get('type') == "html": - self.analyzers = [StatisticalInfoToHtmlAnalyzer(chip_type, params.get("top_n"), prof_path)] - elif params.get('type') == "csv": - self.analyzers = [TimeToCsvAnalyzer(chip_type, prof_path)] - else: - warnings.warn("参数错误,请输入 all html csv 这三种类型") # 发出一个警告信息 - - - def setColumnsToKeep(self): - columns_to_keep = [] - for analyzer in self.analyzers: - columns_to_keep.extend(analyzer.getColumnsToGroup()) - columns_to_keep.extend(analyzer.getColumnsToView()) - self.columns_to_keep = list(set(columns_to_keep)) - - -def main(): - # 解析命令行参数 - parser = argparse.ArgumentParser() - parser.add_argument("--dir", "-d", default=None, help="root dir of PROF_* data") - parser.add_argument("--top_n", "-n", default=10, help="how many operators to show", type=int) - parser.add_argument("--type", "-t", default='html', help="compare ratio or aicore-time", type=str) - args = parser.parse_args() - params = { - "dir": args.dir, - "top_n": args.top_n, - "type": args.type - } - - deviverable_gen = DeliverableGenerator(params) - deviverable_gen.run() - -if __name__ == "__main__": - main() diff --git a/profiler/cluster_analyse/cluster_statistics_export/__init__.py b/profiler/cluster_analyse/cluster_statistics_export/__init__.py deleted file mode 100644 index 7101187a2c2619f3b1c20dded14b433950b4c662..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/cluster_statistics_export/cann_api_sum_export.py b/profiler/cluster_analyse/cluster_statistics_export/cann_api_sum_export.py deleted file mode 100644 index 578ee937be57ff8615085bbe1e4ac6ccae81a4e9..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/cann_api_sum_export.py +++ /dev/null @@ -1,65 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - -QUERY = """ -WITH - summary as ( - SELECT - name, - sum(endNs - startNs) AS duration, - count (*) AS num, - avg(endNs - startNs) AS avg_duration, - min(endNs - startNs) AS min_duration, - median(endNs - startNs) AS med_duration, - max(endNs - startNs) AS max_duration, - stdev(endNs - startNs) AS stdev_duration, - lower_quartile(endNs - startNs) AS lower_quartile_duration, - upper_quartile(endNs - startNs) AS upper_quartile_duration - FROM - CANN_API - GROUP BY name - ), - totals AS ( - SELECT sum(duration) AS total - FROM summary - ) -SELECT - ids.value AS "name", - round(summary.duration * 100.0 / (SELECT total FROM totals), 2) AS "durationRatio", - summary.duration AS "totalTimeNs", - summary.num AS "totalCount", - round(summary.avg_duration, 1) AS "averageNs", - round(summary.min_duration, 1) AS "minNs", - round(summary.lower_quartile_duration, 1) AS "Q1Ns", - round(summary.med_duration, 1) AS "medNs", - round(summary.upper_quartile_duration, 1) AS "Q3Ns", - round(summary.max_duration, 1) AS "maxNs", - round(summary.stdev_duration, 1) AS "stdev" -FROM - summary -LEFT JOIN - STRING_IDS AS ids - ON ids.id == summary.name -ORDER BY 2 DESC; - """ - - -class CannApiSumExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/compute_op_sum_export.py b/profiler/cluster_analyse/cluster_statistics_export/compute_op_sum_export.py deleted file mode 100644 index d70c696100bc305f8b1e182f7b1f915cf58f274a..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/compute_op_sum_export.py +++ /dev/null @@ -1,49 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - - -QUERY = """ -SELECT - NAME_IDS.value AS "OpName", - OPTYPE_IDS.value AS "OpType", - TASKTYPE_IDS.value AS "TaskType", - INPUTSHAPES_IDS.value AS "InputShapes", - round(TASK.endNs - TASK.startNs) AS "Duration" -FROM - COMPUTE_TASK_INFO -LEFT JOIN TASK - ON TASK.globalTaskId == COMPUTE_TASK_INFO.globalTaskId -LEFT JOIN - STRING_IDS AS NAME_IDS - ON NAME_IDS.id == COMPUTE_TASK_INFO.name -LEFT JOIN - STRING_IDS AS OPTYPE_IDS - ON OPTYPE_IDS.id == COMPUTE_TASK_INFO.opType -LEFT JOIN - STRING_IDS AS TASKTYPE_IDS - ON TASKTYPE_IDS.id == COMPUTE_TASK_INFO.taskType -LEFT JOIN - STRING_IDS AS INPUTSHAPES_IDS - ON INPUTSHAPES_IDS.id == COMPUTE_TASK_INFO.inputShapes - """ - - -class ComputeOpSumExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/hccl_sum_export.py b/profiler/cluster_analyse/cluster_statistics_export/hccl_sum_export.py deleted file mode 100644 index f695949de1a92e9a1faff593bc45e52f91582242..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/hccl_sum_export.py +++ /dev/null @@ -1,39 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - - -QUERY = """ -SELECT - NAME_IDS.value AS "OpName", - TYPE_IDS.value AS "OpType", - round(endNs - startNs) AS "Duration" -FROM - COMMUNICATION_OP -LEFT JOIN - STRING_IDS AS TYPE_IDS - ON TYPE_IDS.id == COMMUNICATION_OP.opType -LEFT JOIN - STRING_IDS AS NAME_IDS - ON NAME_IDS.id == COMMUNICATION_OP.opName - """ - - -class HcclSumExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/mstx_mark_export.py b/profiler/cluster_analyse/cluster_statistics_export/mstx_mark_export.py deleted file mode 100644 index ac5355c020042d474963296242b79eb3fd6a8c38..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/mstx_mark_export.py +++ /dev/null @@ -1,57 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - - -QUERY = """ -WITH - FRAMEWORK_API AS ( - SELECT - PYTORCH_API.startNs, - CONNECTION_IDS.connectionId - FROM - PYTORCH_API - LEFT JOIN - CONNECTION_IDS - ON PYTORCH_API.connectionId == CONNECTION_IDS.id - ) -SELECT - MSG_IDS.value AS "msg", - MSTX_EVENTS.startNs AS "cann_ts", - TASK.startNs AS "device_ts", - FRAMEWORK_API.startNs AS "framework_ts", - MSTX_EVENTS.globalTid AS "tid" -FROM - MSTX_EVENTS -LEFT JOIN - TASK - ON MSTX_EVENTS.connectionId == TASK.connectionId -LEFT JOIN - FRAMEWORK_API - ON MSTX_EVENTS.connectionId == FRAMEWORK_API.connectionId -LEFT JOIN - STRING_IDS AS MSG_IDS - ON MSTX_EVENTS.message == MSG_IDS.id -ORDER BY - MSTX_EVENTS.startNs - """ - - -class MstxMarkExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/mstx_step_export.py b/profiler/cluster_analyse/cluster_statistics_export/mstx_step_export.py deleted file mode 100644 index c257ce675fe46ea0f7eff2489dd2fe13c846564f..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/mstx_step_export.py +++ /dev/null @@ -1,35 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from cluster_statistics_export.stats_export import StatsExport - - -QUERY = """ -SELECT - id AS "step_id", - startNs AS "start_ns", - endNs AS "end_ns" -FROM - STEP_TIME -ORDER BY - startNs - """ - - -class MstxStepExport(StatsExport): - - def __init__(self, db_path, recipe_name): - super().__init__(db_path, recipe_name) - self._query = QUERY diff --git a/profiler/cluster_analyse/cluster_statistics_export/stats_export.py b/profiler/cluster_analyse/cluster_statistics_export/stats_export.py deleted file mode 100644 index e6d98f48ef8c4e8032f7611dac163ead3cc5fbe0..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_statistics_export/stats_export.py +++ /dev/null @@ -1,40 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import pandas as pd - -from common_func.db_manager import DBManager -from common_func.constant import Constant - - -class StatsExport: - - def __init__(self, db_path, analysis_class): - self._db_path = db_path - self._analysis_class = analysis_class - self._query = None - - def get_query(self): - return self._query - - def read_export_db(self): - query = self.get_query() - if query is None: - print(f"[ERROR] query is None.") - return - conn, cursor = DBManager.create_connect_db(self._db_path, Constant.ANALYSIS) - data = pd.read_sql(query, conn) - DBManager.destroy_db_connect(conn, cursor) - return data diff --git a/profiler/cluster_analyse/cluster_utils/__init__.py b/profiler/cluster_analyse/cluster_utils/__init__.py deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/profiler/cluster_analyse/cluster_utils/data_transfer_adapter.py b/profiler/cluster_analyse/cluster_utils/data_transfer_adapter.py deleted file mode 100644 index 1f306415fa789ae0dab7d8751b1c240b3433de0d..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/cluster_utils/data_transfer_adapter.py +++ /dev/null @@ -1,142 +0,0 @@ -import copy - -from common_func.constant import Constant -from common_func.table_constant import TableConstant - - -class DataTransferAdapter(object): - COMM_TIME_TABLE_COLUMN = [TableConstant.START_TIMESTAMP, TableConstant.ELAPSED_TIME, TableConstant.TRANSIT_TIME, - TableConstant.WAIT_TIME, TableConstant.SYNCHRONIZATION_TIME, TableConstant.IDLE_TIME, - TableConstant.SYNCHRONIZATION_TIME_RATIO, TableConstant.WAIT_TIME_RATIO] - COMM_TIME_JSON_COLUMN = [Constant.START_TIMESTAMP, Constant.ELAPSE_TIME_MS, Constant.TRANSIT_TIME_MS, - Constant.WAIT_TIME_MS, Constant.SYNCHRONIZATION_TIME_MS, Constant.IDLE_TIME_MS, - Constant.SYNCHRONIZATION_TIME_RATIO, Constant.WAIT_TIME_RATIO] - MATRIX_TABLE_COLUMN = [TableConstant.TRANSIT_SIZE, TableConstant.TRANSIT_TIME, TableConstant.BANDWIDTH, - TableConstant.TRANSPORT_TYPE, TableConstant.OPNAME] - MATRIX_JSON_COLUMN = [Constant.TRANSIT_SIZE_MB, Constant.TRANSIT_TIME_MS, Constant.BANDWIDTH_GB_S, - Constant.TRANSPORT_TYPE, Constant.OP_NAME] - COMM_BD_TABLE_COLUMN = [TableConstant.TRANSIT_SIZE, TableConstant.TRANSIT_TIME, TableConstant.BANDWIDTH, - TableConstant.LARGE_PACKET_RATIO] - COMM_BD_JSON_COLUMN = [Constant.TRANSIT_SIZE_MB, Constant.TRANSIT_TIME_MS, Constant.BANDWIDTH_GB_S, - Constant.LARGE_PACKET_RATIO] - - def __init__(self): - super().__init__() - - def transfer_comm_from_db_to_json(self, time_info: list, bandwidth_info: list): - result = {} - if not time_info and not bandwidth_info: - return result - for time_data in time_info: - comm_time = dict() - hccl_name = time_data[TableConstant.HCCL_OP_NAME] + "@" + time_data[TableConstant.GROUP_NAME] - for key, value in dict(zip(self.COMM_TIME_JSON_COLUMN, self.COMM_TIME_TABLE_COLUMN)).items(): - if not key.endswith("ratio"): - comm_time[key] = time_data.get(value, 0) - result.setdefault(time_data[TableConstant.STEP], {}).setdefault(time_data[TableConstant.TYPE], {}). \ - setdefault(hccl_name, {})[Constant.COMMUNICATION_TIME_INFO] = comm_time - hccl_set = set() - for bd_data in bandwidth_info: - hccl_name = bd_data[TableConstant.HCCL_OP_NAME] + "@" + bd_data[TableConstant.GROUP_NAME] - hccl_set.add(hccl_name) - for hccl in hccl_set: - comm_bd = dict() - for bd_data in bandwidth_info: - if hccl == (bd_data[TableConstant.HCCL_OP_NAME] + "@" + bd_data[TableConstant.GROUP_NAME]): - temp_dict = dict() - key_dict = dict(zip(self.COMM_BD_JSON_COLUMN, self.COMM_BD_TABLE_COLUMN)) - self.set_value_by_key(temp_dict, bd_data, key_dict) - comm_bd.setdefault(bd_data[TableConstant.TRANSPORT_TYPE], temp_dict).setdefault( - Constant.SIZE_DISTRIBUTION, {})[bd_data[TableConstant.PACKAGE_SIZE]] = \ - [bd_data[TableConstant.COUNT], bd_data[TableConstant.TOTAL_DURATION]] - result.setdefault(bd_data[TableConstant.STEP], {}).setdefault(bd_data[TableConstant.TYPE], {}). \ - setdefault(hccl, {})[Constant.COMMUNICATION_BANDWIDTH_INFO] = comm_bd - return result - - def transfer_comm_from_json_to_db(self, res_data: dict): - res_comm_data, res_bd_data = list(), list() - - def split_comm_time(): - for rank_id, comm_data in op_data.items(): - time_data = comm_data.get(Constant.COMMUNICATION_TIME_INFO) - res_time = set_only_value(rank_id) - for key, value in dict(zip(self.COMM_TIME_TABLE_COLUMN, self.COMM_TIME_JSON_COLUMN)).items(): - res_time[key] = time_data.get(value, 0) - res_comm_data.append(res_time) - bd_data = comm_data.get(Constant.COMMUNICATION_BANDWIDTH_INFO, {}) - for transport_type, data in bd_data.items(): - res_bandwidth = set_only_value(rank_id) - key_dict = dict(zip(self.COMM_BD_TABLE_COLUMN, self.COMM_BD_JSON_COLUMN)) - res_bandwidth[TableConstant.TRANSPORT_TYPE] = transport_type - self.set_value_by_key(res_bandwidth, data, key_dict) - for key, value in data.get(Constant.SIZE_DISTRIBUTION, {}).items(): - res_bandwidth[TableConstant.PACKAGE_SIZE] = key - res_bandwidth[TableConstant.COUNT] = value[0] - res_bandwidth[TableConstant.TOTAL_DURATION] = value[1] - temp_dict = copy.deepcopy(res_bandwidth) - res_bd_data.append(temp_dict) - - def set_only_value(rank_id): - res_dict = dict() - res_dict[TableConstant.RANK_SET] = str(rank_set) - res_dict[TableConstant.STEP] = step - res_dict[TableConstant.RANK_ID] = rank_id - res_dict[TableConstant.HCCL_OP_NAME] = op_name.split("@")[0] if "@" in op_name else op_name - res_dict[TableConstant.GROUP_NAME] = op_name.split("@")[1] if "@" in op_name else "" - return res_dict - - for rank_set, step_dict in res_data.items(): - for step, op_dict in step_dict.items(): - for op_name, op_data in op_dict.items(): - split_comm_time() - return res_comm_data, res_bd_data - - def set_value_by_key(self, src_dict, dst_dict, key_dict): - for key, value in key_dict.items(): - src_dict[key] = dst_dict.get(value, 0) - - def transfer_matrix_from_db_to_json(self, matrix_data: list): - result = {} - if not matrix_data: - return result - hccl_set = set() - for data in matrix_data: - hccl = data[TableConstant.HCCL_OP_NAME] + "@" + data[TableConstant.GROUP_NAME] - hccl_set.add(hccl) - for hccl in hccl_set: - for data in matrix_data: - if hccl == (data[TableConstant.HCCL_OP_NAME] + "@" + data[TableConstant.GROUP_NAME]): - key = data[TableConstant.SRC_RANK] + '-' + data[TableConstant.DST_RANK] - temp_dict = dict() - key_dict = dict(zip(self.MATRIX_JSON_COLUMN, self.MATRIX_TABLE_COLUMN)) - self.set_value_by_key(temp_dict, data, key_dict) - result.setdefault(data[TableConstant.STEP], {}).setdefault(data[TableConstant.TYPE], {}). \ - setdefault(hccl, {}).setdefault(key, temp_dict) - return result - - def transfer_matrix_from_json_to_db(self, res_data: dict): - result = list() - - def split_matrix_data(): - for op_name, op_data in op_dict.items(): - for link_key, link_data in op_data.items(): - if "@" in op_name: - hccl_op_name, group_name = op_name.split("@")[0], op_name.split("@")[1] - else: - hccl_op_name, group_name = op_name, "" - matrix_data = { - TableConstant.RANK_SET: str(rank_set), - TableConstant.STEP: step, - TableConstant.HCCL_OP_NAME: hccl_op_name, - TableConstant.GROUP_NAME: group_name, - TableConstant.SRC_RANK: link_key.split("-")[0], - TableConstant.DST_RANK: link_key.split("-")[1] - } - key_dict = dict(zip(self.MATRIX_TABLE_COLUMN, self.MATRIX_JSON_COLUMN)) - self.set_value_by_key(matrix_data, link_data, key_dict) - result.append(matrix_data) - - for rank_set, step_dict in res_data.items(): - for step, op_dict in step_dict.items(): - split_matrix_data() - return result diff --git a/profiler/cluster_analyse/common_func/__init__.py b/profiler/cluster_analyse/common_func/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/common_func/analysis_loader.py b/profiler/cluster_analyse/common_func/analysis_loader.py deleted file mode 100644 index 55e7dbc6ea930de7a47799384ffad5daa1328da2..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/analysis_loader.py +++ /dev/null @@ -1,38 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import importlib -import inspect -import sys - -from common_func.constant import Constant -from analysis.base_analysis import BaseRecipeAnalysis - -def is_analysis_class(obj): - return inspect.isclass(obj) and issubclass(obj, BaseRecipeAnalysis) and obj != BaseRecipeAnalysis - -def get_class_from_name(analysis_name : str): - sys.path.append(Constant.ANALYSIS_PATH) - analysis_path = f"analysis.{analysis_name}.{analysis_name}" - module = None - try: - module = importlib.import_module(analysis_path) - except Exception as e: - print(f"[ERROR] {analysis_path} not find:{e}") - - specific_analysis = inspect.getmembers(module, is_analysis_class) - if not specific_analysis: - print(f"[ERROR] {analysis_name} not found.") - return specific_analysis[0] diff --git a/profiler/cluster_analyse/common_func/constant.py b/profiler/cluster_analyse/common_func/constant.py deleted file mode 100644 index 80f0374c1d1d9a37204b9583112ce5baa4cf3e95..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/constant.py +++ /dev/null @@ -1,118 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -class Constant(object): - # dir name - FRAMEWORK_DIR = "FRAMEWORK" - CLUSTER_ANALYSIS_OUTPUT = "cluster_analysis_output" - SINGLE_OUTPUT = "ASCEND_PROFILER_OUTPUT" - COMM_JSON = "communication.json" - COMM_MATRIX_JSON = "communication_matrix.json" - STEP_TIME_CSV = "step_trace_time.csv" - KERNEL_DETAILS_CSV = "kernel_details.csv" - - # file authority - FILE_AUTHORITY = 0o640 - DIR_AUTHORITY = 0o750 - MAX_JSON_SIZE = 1024 * 1024 * 1024 * 10 - MAX_CSV_SIZE = 1024 * 1024 * 1024 * 5 - MAX_PATH_LENGTH = 4096 - MAX_READ_DB_FILE_BYTES = 1024 * 1024 * 1024 * 8 - - # communication - P2P = "p2p" - COLLECTIVE = "collective" - STEP_ID = "step_id" - RANK_ID = "rank_id" - GROUP_NAME = "group_name" - COMM_OP_TYPE = "comm_op_type" - COMM_OP_NAME = "comm_op_name" - COMM_OP_INFO = "comm_op_info" - TOTAL_OP_INFO = "Total Op Info" - COMMUNICATION_TIME_INFO = "Communication Time Info" - START_TIMESTAMP = "Start Timestamp(us)" - COMMUNICATION_BANDWIDTH_INFO = "Communication Bandwidth Info" - HCOM_SEND = "hcom_send" - HCOM_RECEIVE = "hcom_receive" - SYNCHRONIZATION_TIME_RATIO = "Synchronization Time Ratio" - SYNCHRONIZATION_TIME_MS = "Synchronization Time(ms)" - WAIT_TIME_RATIO = "Wait Time Ratio" - TRANSIT_TIME_MS = "Transit Time(ms)" - TRANSIT_SIZE_MB = "Transit Size(MB)" - SIZE_DISTRIBUTION = "Size Distribution" - WAIT_TIME_MS = "Wait Time(ms)" - OP_NAME = "Op Name" - BANDWIDTH_GB_S = "Bandwidth(GB/s)" - COMMUNICATION = "communication.json" - ELAPSE_TIME_MS = "Elapse Time(ms)" - IDLE_TIME_MS = "Idle Time(ms)" - LARGE_PACKET_RATIO = "Large Packet Ratio" - - # params - DATA_MAP = "data_map" - COLLECTIVE_GROUP = "collective_group" - COMMUNICATION_OPS = "communication_ops" - MATRIX_OPS = "matrix_ops" - COLLECTION_PATH = "collection_path" - COMMUNICATION_GROUP = "communication_group" - TRANSPORT_TYPE = "Transport Type" - COMM_DATA_DICT = "comm_data_dict" - DATA_TYPE = "data_type" - ANALYSIS_MODE = "analysis_mode" - - # step time - RANK = "rank" - STAGE = "stage" - - # epsilon - EPS = 1e-15 - - # file suffix - JSON_SUFFIX = ".json" - CSV_SUFFIX = ".csv" - - # result files type - TEXT = "text" - DB = "db" - INVALID = "invalid" - - # db name - DB_COMMUNICATION_ANALYZER = "analysis.db" - DB_CLUSTER_COMMUNICATION_ANALYZER = "cluster_analysis.db" - - # db tables - TABLE_COMM_ANALYZER_BANDWIDTH = "CommAnalyzerBandwidth" - TABLE_COMM_ANALYZER_TIME = "CommAnalyzerTime" - TABLE_COMM_ANALYZER_MATRIX = "CommAnalyzerMatrix" - TABLE_STEP_TRACE = "StepTraceTime" - TABLE_HOST_INFO = "HostInfo" - TABLE_RANK_DEVICE_MAP = "RankDeviceMap" - - # data config key - CONFIG = "config" - EXPER_CONFIG = "experimental_config" - EXPORT_TYPE = "_export_type" - - # recipe config - ANALYSIS = "analysis" - RECIPE_NAME = "recipe_name" - RECIPE_CLASS = "recipe_class" - PARALLEL_MODE = "parallel_mode" - CLUSTER_CUSTOM_ANALYSE_PATH = os.path.abspath(os.path.dirname(__file__)) - ANALYSIS_PATH = os.path.join(CLUSTER_CUSTOM_ANALYSE_PATH, 'analysis') - - CONCURRENT_MODE = "concurrent" \ No newline at end of file diff --git a/profiler/cluster_analyse/common_func/context.py b/profiler/cluster_analyse/common_func/context.py deleted file mode 100644 index 4e3d544d3769e0c1360790dc1a4c57ca484687b8..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/context.py +++ /dev/null @@ -1,85 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -from functools import partial -from concurrent import futures -from common_func.constant import Constant - - -class Context(object): - """abstract base class""" - - ctx_map = None - - @classmethod - def create_context(cls, mode=Constant.CONCURRENT_MODE): - if cls.ctx_map is None: - keys = [Constant.CONCURRENT_MODE] - values = [ConcurrentContext] - cls.ctx_map = dict(zip(keys, values)) - - if mode not in cls.ctx_map: - raise NotImplementedError("mode must be in {}".format(keys)) - - return cls.ctx_map[mode]() - - def __init__(self): - print("[INFO] context {} initialized.".format(self._mode)) - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.close() - if exc_type is not None: - print(f"[ERROR] Failed to exit context: {exc_val}") - - def launch(self, func, *args, **kwargs): - raise NotImplementedError - - def map(self, func, *iterables, **kwargs): - raise NotImplementedError - - def wait(self, waitable): - raise NotImplementedError - -class ConcurrentContext(Context): - - def __init__(self, executor=None): - self._mode = Constant.CONCURRENT_MODE - super().__init__() - self._custom = executor is None - self._executor = executor or futures.ProcessPoolExecutor(max_workers=os.cpu_count()) - - def __enter__(self): - if self._executor is None: - raise RuntimeError("executor is None") - return self - - def close(self): - if self._custom: - self._executor.shutdown(wait=True) - self._executor = None - - def launch(self, func, *args, **kwargs): - return self._executor.submit(func, *args, **kwargs).result() - - def map(self, func, *iterables, **kwargs): - partial_func = partial(func, **kwargs) - return list(self._executor.map(partial_func, *iterables)) - - def wait(self, waitable): - return waitable \ No newline at end of file diff --git a/profiler/cluster_analyse/common_func/db_manager.py b/profiler/cluster_analyse/common_func/db_manager.py deleted file mode 100644 index c0d6ad89be8edd8bbb2a4ee8e0653141550b0129..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/db_manager.py +++ /dev/null @@ -1,233 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import sqlite3 - -from common_func.constant import Constant -from common_func.empty_class import EmptyClass -from common_func.file_manager import check_db_path_valid -from common_func.tables_config import TablesConfig -from common_func.sql_extention_func import SqlExtentionAggregateFunc - -class DBManager: - """ - class to manage DB operation - """ - FETCH_SIZE = 10000 - INSERT_SIZE = 10000 - MAX_ROW_COUNT = 100000000 - - @staticmethod - def create_connect_db(db_path: str, mode=None) -> tuple: - """ - create and connect database - """ - if check_db_path_valid(db_path, is_create=True): - try: - conn = sqlite3.connect(db_path) - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return EmptyClass("empty conn"), EmptyClass("empty curs") - try: - if mode == Constant.ANALYSIS: - try: - for func_name, params_count, class_name in SqlExtentionAggregateFunc: - conn.create_aggregate(func_name, params_count, class_name) - except sqlite3.Error as err: - print(f"[ERROR] {err}") - if isinstance(conn, sqlite3.Connection): - curs = conn.cursor() - os.chmod(db_path, Constant.FILE_AUTHORITY) - return conn, curs - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return EmptyClass("empty conn"), EmptyClass("empty curs") - return EmptyClass("empty conn"), EmptyClass("empty curs") - - @staticmethod - def destroy_db_connect(conn: any, curs: any) -> None: - """ - destroy db connection - """ - try: - if isinstance(curs, sqlite3.Cursor): - curs.close() - except sqlite3.Error as err: - print(f"[ERROR] {err}") - try: - if isinstance(conn, sqlite3.Connection): - conn.close() - except sqlite3.Error as err: - print(f"[ERROR] {err}") - - @staticmethod - def judge_table_exists(curs: any, table_name: str) -> any: - """ - judge table exists - """ - if not isinstance(curs, sqlite3.Cursor): - return False - try: - curs.execute("select count(*) from sqlite_master where type='table' and name=?", (table_name,)) - return curs.fetchone()[0] - except sqlite3.Error as err: - print("[ERROR] {}".format(err)) - return False - - @staticmethod - def sql_generate_table(table_map: str): - header_with_type_begin = "(" - header_with_type_end = ")" - header_with_type_list = [] - if table_map in TablesConfig.DATA: - items = TablesConfig.DATA[table_map] - for item in items: - if item[0] == "index": - header_with_type_list.append('"' + item[0] + '" ' + item[1].split(",")[0]) - else: - header_with_type_list.append(item[0] + ' ' + item[1].split(",")[0]) - header_with_type_begin += ",".join(header_with_type_list) - header_with_type_begin += header_with_type_end - return header_with_type_begin - return "" - - @classmethod - def check_tables_in_db(cls, db_path: any, *tables: any) -> bool: - if check_db_path_valid(db_path): - conn, curs = cls.create_connect_db(db_path) - if not (conn and curs): - return False - res = True - for table in tables: - if not cls.judge_table_exists(curs, table): - res = False - break - cls.destroy_db_connect(conn, curs) - return res - return False - - @classmethod - def create_tables(cls, db_path: any, *tables: any): - conn, curs = cls.create_connect_db(db_path) - if not (conn and curs): - return - for table_name in tables: - if cls.judge_table_exists(curs, table_name): - drop_sql = "drop table {0}".format(table_name) - cls.execute_sql(conn, drop_sql) - table_map = "{0}Map".format(table_name) - header_with_type = cls.sql_generate_table(table_map) - sql = "CREATE TABLE IF NOT EXISTS " + table_name + header_with_type - cls.execute_sql(conn, sql) - cls.destroy_db_connect(conn, curs) - - @classmethod - def get_table_column_count(cls, db_path: any, table: any) -> int: - conn, curs = cls.create_connect_db(db_path) - if not (conn and curs): - return 0 - sql = "SELECT COUNT(*) FROM pragma_table_info('{}')".format(table) - res = 0 - try: - curs.execute(sql) - res = curs.fetchone()[0] - except sqlite3.Error as err: - print("[ERROR] {}".format(err)) - finally: - cls.destroy_db_connect(conn, curs) - return res - - @staticmethod - def execute_sql(conn: any, sql: str, params: any = None) -> bool: - """ - execute sql - """ - try: - if isinstance(conn, sqlite3.Connection): - if params: - conn.cursor().execute(sql, params) - else: - conn.cursor().execute(sql) - conn.commit() - return True - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return False - print("[ERROR] conn is invalid param") - return False - - @staticmethod - def executemany_sql(conn: any, sql: str, params: any) -> bool: - """ - execute many sql once - """ - try: - if isinstance(conn, sqlite3.Connection): - conn.cursor().executemany(sql, params) - conn.commit() - return True - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return False - print("[ERROR] conn is invalid param") - return False - - @classmethod - def fetch_all_data(cls: any, curs: any, sql: str, param: tuple = None, is_dict: bool = True) -> list: - """ - fetch 10000 num of data from db each time to get all data - """ - if not isinstance(curs, sqlite3.Cursor): - return [] - data = [] - try: - if param: - res = curs.execute(sql, param) - else: - res = curs.execute(sql) - except sqlite3.Error as err: - print(f"[ERROR] {err}") - curs.row_factory = None - return [] - try: - description = res.description - while True: - res = curs.fetchmany(cls.FETCH_SIZE) - if is_dict: - data += CustomizedDictFactory.generate_dict_from_db(res, description) - else: - data += res - if len(data) > cls.MAX_ROW_COUNT: - print("[WARRING] The records count in the table exceeds the limit!") - if len(res) < cls.FETCH_SIZE: - break - return data - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return [] - finally: - curs.row_factory = None - - -class CustomizedDictFactory: - @staticmethod - def generate_dict_from_db(data_result: any, description: any) -> any: - description_set = [i[0] for i in description] - res = [] - for data in data_result: - data_dict = dict(zip(description_set, data)) - res.append(data_dict) - return res diff --git a/profiler/cluster_analyse/common_func/empty_class.py b/profiler/cluster_analyse/common_func/empty_class.py deleted file mode 100644 index df100d156fa064cca4514260db0b2e843e217d09..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/empty_class.py +++ /dev/null @@ -1,20 +0,0 @@ -class EmptyClass: - - def __init__(self: any, info: str = "") -> None: - self._info = info - - @classmethod - def __bool__(cls: any) -> bool: - return False - - @classmethod - def __str__(cls: any) -> str: - return "" - - @property - def info(self: any) -> str: - return self._info - - @staticmethod - def is_empty() -> bool: - return True diff --git a/profiler/cluster_analyse/common_func/file_manager.py b/profiler/cluster_analyse/common_func/file_manager.py deleted file mode 100644 index e7e2d5adca37faf5b377bcbe720fdfba84311eca..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/file_manager.py +++ /dev/null @@ -1,131 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import csv -import json - -from common_func.constant import Constant -from common_func.path_manager import PathManager - - -class FileManager: - DATA_FILE_AUTHORITY = 0o640 - DATA_DIR_AUTHORITY = 0o750 - - @classmethod - def read_csv_file(cls, file_path: str, class_bean: any) -> list: - PathManager.check_path_readable(file_path) - base_name = os.path.basename(file_path) - file_size = os.path.getsize(file_path) - if file_size <= 0: - return [] - if file_size > Constant.MAX_CSV_SIZE: - raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") - result_data = [] - try: - with open(file_path, newline="") as csv_file: - reader = csv.DictReader(csv_file) - for row in reader: - result_data.append(class_bean(row)) - except Exception as e: - raise RuntimeError(f"Failed to read the file: {base_name}") from e - return result_data - - @classmethod - def read_json_file(cls, file_path: str) -> dict: - PathManager.check_path_readable(file_path) - base_name = os.path.basename(file_path) - file_size = os.path.getsize(file_path) - if file_size <= 0: - return {} - if file_size > Constant.MAX_JSON_SIZE: - raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") - try: - with open(file_path, "r") as json_file: - result_data = json.loads(json_file.read()) - except Exception as e: - raise RuntimeError(f"Failed to read the file: {base_name}") from e - return result_data - - @classmethod - def create_csv_file(cls, profiler_path: str, data: list, file_name: str, headers: list = None) -> None: - if not data: - return - output_path = os.path.join( - profiler_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - output_file = os.path.join(output_path, file_name) - base_name = os.path.basename(output_file) - PathManager.check_path_writeable(output_path) - try: - with os.fdopen( - os.open(output_file, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY), - 'w', newline="" - ) as file: - writer = csv.writer(file) - if headers: - writer.writerow(headers) - writer.writerows(data) - except Exception as e: - raise RuntimeError(f"Can't create file: {base_name}") from e - - @classmethod - def create_json_file(cls, profiler_path: str, data: dict, file_name: str) -> None: - if not data: - return - output_path = os.path.join(profiler_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - output_file = os.path.join(output_path, file_name) - base_name = os.path.basename(output_file) - PathManager.check_path_writeable(output_path) - try: - with os.fdopen( - os.open(output_file, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY), 'w' - ) as file: - file.write(json.dumps(data)) - except Exception as e: - raise RuntimeError(f"Can't create the file: {base_name}") from e - - @classmethod - def create_output_dir(cls, collection_path: str, is_overwrite: bool = False) -> None: - output_path = os.path.join( - collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - if is_overwrite: - if not os.path.exists(output_path): - PathManager.make_dir_safety(output_path) - return - PathManager.remove_path_safety(output_path) - PathManager.make_dir_safety(output_path) - - @classmethod - def check_file_size(cls, file_path): - suffix = os.path.splitext(file_path) - base_name = os.path.join(file_path) - if suffix == Constant.CSV_SUFFIX: - limit_size = Constant.MAX_CSV_SIZE - else: - limit_size = Constant.MAX_JSON_SIZE - file_size = os.path.getsize(file_path) - if file_size > limit_size: - raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") - - -def check_db_path_valid(path: str, is_create: bool = False, max_size: int = Constant.MAX_READ_DB_FILE_BYTES) -> bool: - if os.path.islink(path): - print(f'[ERROR] The db file path: {path} is link. Please check the path') - return False - if not is_create and os.path.exists(path) and os.path.getsize(path) > max_size: - print(f'[ERROR] The db file: {path} is too large to read. Please check the file') - return False - return True diff --git a/profiler/cluster_analyse/common_func/path_manager.py b/profiler/cluster_analyse/common_func/path_manager.py deleted file mode 100644 index 7ef7b4c345c024a0980c6ce2d91839b64c351740..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/path_manager.py +++ /dev/null @@ -1,200 +0,0 @@ -# Copyright (c) 2023 Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the BSD 3-Clause License (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://opensource.org/licenses/BSD-3-Clause -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import re -import shutil -import platform - - -class PathManager: - MAX_PATH_LENGTH = 4096 - MAX_FILE_NAME_LENGTH = 255 - DATA_FILE_AUTHORITY = 0o640 - DATA_DIR_AUTHORITY = 0o750 - WINDOWS = "windows" - - @classmethod - def check_input_directory_path(cls, path: str): - """ - Function Description: - check whether the path is valid, some businesses can accept a path that does not exist, - so the function do not verify whether the path exists - Parameter: - path: the path to check, whether the incoming path is absolute or relative depends on the business - Exception Description: - when invalid data throw exception - """ - cls.input_path_common_check(path) - base_name = os.path.basename(path) - if os.path.isfile(path): - msg = f"Invalid input path which is a file path: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_input_file_path(cls, path: str): - """ - Function Description: - check whether the file path is valid, some businesses can accept a path that does not exist, - so the function do not verify whether the path exists - Parameter: - path: the file path to check, whether the incoming path is absolute or relative depends on the business - Exception Description: - when invalid data throw exception - """ - cls.input_path_common_check(path) - base_name = os.path.basename(path) - if os.path.isdir(path): - msg = f"Invalid input path which is a directory path: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_path_length(cls, path: str): - if len(path) > cls.MAX_PATH_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - path_split_list = path.split("/") - for path in path_split_list: - path_list = path.split("\\") - for name in path_list: - if len(name) > cls.MAX_FILE_NAME_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - @classmethod - def input_path_common_check(cls, path: str): - if len(path) > cls.MAX_PATH_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - if os.path.islink(path): - msg = f"Invalid input path which is a soft link." - raise RuntimeError(msg) - - if platform.system().lower() == cls.WINDOWS: - pattern = r'(\.|:|\\|/|_|-|\s|[~0-9a-zA-Z\u4e00-\u9fa5])+' - else: - pattern = r'(\.|/|_|-|\s|[~0-9a-zA-Z])+' - if not re.fullmatch(pattern, path): - msg = f"Invalid input path." - raise RuntimeError(msg) - - path_split_list = path.split("/") - for path in path_split_list: - path_list = path.split("\\") - for name in path_list: - if len(name) > cls.MAX_FILE_NAME_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - @classmethod - def check_path_owner_consistent(cls, path: str): - """ - Function Description: - check whether the path belong to process owner - Parameter: - path: the path to check - Exception Description: - when invalid path, prompt the user - """ - base_name = os.path.basename(path) - if not os.path.exists(path): - msg = f"Invalid path: {base_name}" - raise RuntimeError(msg) - if platform.system().lower() == cls.WINDOWS: - return - if os.stat(path).st_uid != os.getuid(): - check_msg = input("The path does not belong to you, do you want to continue? [y/n]") - if check_msg.lower() != "y": - raise RuntimeError("The user choose not to continue.") - - @classmethod - def check_path_writeable(cls, path): - """ - Function Description: - check whether the path is writable - Parameter: - path: the path to check - Exception Description: - when invalid data throw exception - """ - cls.check_path_owner_consistent(path) - if os.path.islink(path): - msg = f"Invalid path which is a soft link." - raise RuntimeError(msg) - base_name = os.path.basename(path) - if not os.access(path, os.W_OK): - msg = f"The path permission check failed: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_path_readable(cls, path): - """ - Function Description: - check whether the path is writable - Parameter: - path: the path to check - Exception Description: - when invalid data throw exception - """ - cls.check_path_owner_consistent(path) - if os.path.islink(path): - msg = f"Invalid path which is a soft link." - raise RuntimeError(msg) - base_name = os.path.basename(path) - if not os.access(path, os.R_OK): - msg = f"The path permission check failed: {base_name}" - raise RuntimeError(msg) - - @classmethod - def remove_path_safety(cls, path: str): - base_name = os.path.basename(path) - msg = f"Failed to remove path: {base_name}" - if os.path.islink(path): - raise RuntimeError(msg) - if os.path.exists(path): - try: - shutil.rmtree(path) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def make_dir_safety(cls, path: str): - base_name = os.path.basename(path) - msg = f"Failed to make directory: {base_name}" - if os.path.islink(path): - raise RuntimeError(msg) - if os.path.exists(path): - return - try: - os.makedirs(path, mode=cls.DATA_DIR_AUTHORITY) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def create_file_safety(cls, path: str): - base_name = os.path.basename(path) - msg = f"Failed to create file: {base_name}" - if os.path.islink(path): - raise RuntimeError(msg) - if os.path.exists(path): - return - try: - os.close(os.open(path, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY)) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def get_realpath(cls, path: str) -> str: - if os.path.islink(path): - msg = f"Invalid input path which is a soft link." - raise RuntimeError(msg) - return os.path.realpath(path) diff --git a/profiler/cluster_analyse/common_func/sql_extention_func.py b/profiler/cluster_analyse/common_func/sql_extention_func.py deleted file mode 100644 index 987a0d4365307704d6abf32575a48cc15c0fa33d..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/sql_extention_func.py +++ /dev/null @@ -1,73 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np - - -class Median: - - def __init__(self) -> None: - self.data = [] - - def step(self, value) -> None: - self.data.append(value) - - def finalize(self): - return np.median(self.data) - - -class LowerQuartile: - - def __init__(self) -> None: - self.data = [] - - def step(self, value) -> None: - self.data.append(value) - - def finalize(self): - return np.quantile(self.data, 0.25) - - -class UpperQuartile: - - def __init__(self) -> None: - self.data = [] - - def step(self, value) -> None: - self.data.append(value) - - def finalize(self): - return np.quantile(self.data, 0.75) - - -class StandardDeviation: - - def __init__(self) -> None: - self.data = [] - - def step(self, value) -> None: - self.data.append(value) - - def finalize(self): - return np.std(self.data) - - -# func_name, params_count, class -SqlExtentionAggregateFunc = [ - ('median', 1, Median), - ('lower_quartile', 1, LowerQuartile), - ('upper_quartile', 1, UpperQuartile), - ('stdev', 1, StandardDeviation) -] diff --git a/profiler/cluster_analyse/common_func/table_constant.py b/profiler/cluster_analyse/common_func/table_constant.py deleted file mode 100644 index de6d47e97e5683493905de5353a9978195e87b70..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/table_constant.py +++ /dev/null @@ -1,27 +0,0 @@ -class TableConstant: - - RANK_SET = "rank_set" - STEP = "step" - RANK_ID = "rank_id" - TYPE = "type" - HCCL_OP_NAME = "hccl_op_name" - GROUP_NAME = "group_name" - START_TIMESTAMP = "start_timestamp" - ELAPSED_TIME = "elapse_time" - TRANSIT_TIME = "transit_time" - WAIT_TIME = "wait_time" - SYNCHRONIZATION_TIME = "synchronization_time" - IDLE_TIME = "idle_time" - SYNCHRONIZATION_TIME_RATIO = "synchronization_time_ratio" - WAIT_TIME_RATIO = "wait_time_ratio" - BAND_TYPE = "band_type" - TRANSIT_SIZE = "transit_size" - BANDWIDTH = "bandwidth" - LARGE_PACKET_RATIO = "large_packet_ratio" - PACKAGE_SIZE = "package_size" - COUNT = "count" - TOTAL_DURATION = "total_duration" - SRC_RANK = "src_rank" - DST_RANK = "dst_rank" - TRANSPORT_TYPE = "transport_type" - OPNAME = "op_name" diff --git a/profiler/cluster_analyse/common_func/tables_config.py b/profiler/cluster_analyse/common_func/tables_config.py deleted file mode 100644 index f010014519f864e627f83b99ad0df26af98af3f9..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/tables_config.py +++ /dev/null @@ -1,73 +0,0 @@ -class TablesConfig: - DATA = { - "ClusterCommAnalyzerTimeMap": [ - ("rank_set", "TEXT, null"), - ("step", "TEXT, null"), - ("rank_id", "INTEGER, null"), - ("hccl_op_name", "TEXT, null"), - ("group_name", "TEXT, null"), - ("start_timestamp", "NUMERIC, null"), - ("elapsed_time", "NUMERIC, null"), - ("transit_time", "NUMERIC, null"), - ("wait_time", "NUMERIC, null"), - ("synchronization_time", "NUMERIC, null"), - ("idle_time", "NUMERIC, null"), - ("synchronization_time_ratio", "NUMERIC, null"), - ("wait_time_ratio", "NUMERIC, null") - ], - "CommunicationGroupMap": [ - ("type", "TEXT, null"), - ("rank_set", "TEXT, null") - ], - "ClusterCommAnalyzerBandwidthMap": [ - ("rank_set", "TEXT, null"), - ("step", "TEXT, null"), - ("rank_id", "INTEGER, null"), - ("hccl_op_name", "TEXT, null"), - ("group_name", "TEXT, null"), - ("band_type", "TEXT, null"), - ("transit_size", "NUMERIC, null"), - ("transit_time", "NUMERIC, null"), - ("bandwidth", "NUMERIC, null"), - ("large_packet_ratio", "NUMERIC, null"), - ("package_size", "NUMERIC, null"), - ("count", "NUMERIC, null"), - ("total_duration", "NUMERIC, null") - ], - "ClusterCommAnalyzerMatrixMap": [ - ("rank_set", "TEXT, null"), - ("step", "TEXT, null"), - ("hccl_op_name", "TEXT, null"), - ("group_name", "TEXT, null"), - ("src_rank", "TEXT, null"), - ("dst_rank", "TEXT, null"), - ("transit_size", "NUMERIC, null"), - ("transit_time", "NUMERIC, null"), - ("bandwidth", "NUMERIC, null"), - ("transport_type", "TEXT, null"), - ("op_name", "TEXT, null") - ], - "ClusterStepTraceTimeMap": [ - ("step", "TEXT, null"), - ("type", "TEXT, null"), - ("index", "TEXT, null"), - ("computing", "NUMERIC, null"), - ("communication_not_overlapped", "NUMERIC, null"), - ("overlapped", "NUMERIC, null"), - ("communication", "NUMERIC, null"), - ("free", "NUMERIC, null"), - ("stage", "NUMERIC, null"), - ("bubble", "NUMERIC, null"), - ("communication_not_overlapped_and_exclude_receive", "NUMERIC, null"), - ("preparing", "NUMERIC, null") - ], - "HostInfoMap": [ - ("hostUid", "INTEGER, null"), - ("hostName", "TEXT, null") - ], - "RankDeviceMapMap": [ - ("rankId", "INTEGER, null"), - ("deviceId", "INTEGER, null"), - ("hostUid", "INTEGER, null") - ] - } diff --git a/profiler/cluster_analyse/common_func/utils.py b/profiler/cluster_analyse/common_func/utils.py deleted file mode 100644 index 0a20a5c237f9f46e7b7425ef4b295dad4656174e..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/utils.py +++ /dev/null @@ -1,73 +0,0 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np -import pandas as pd - - -def format_columns(df: pd.DataFrame): - formatted_df = df.rename( - { - "25%": "Q1Ns", - "50%": "MedianNs", - "75%": "Q3Ns", - 0.25: "Q1Ns", - 0.5: "MedianNs", - 0.75: "Q3Ns", - "Q1": "Q1Ns", - "Q3": "Q3Ns", - "min": "MinNs", - "max": "MaxNs", - "median": "MedianNs", - "sum": "SumNs", - "std": "StdNs", - "mean": "MeanNs", - "count": "Count" - }, - axis="columns" - ) - - stats_cols = ["Count", "MeanNs", "StdNs", "MinNs", "Q1Ns", "MedianNs", "Q3Ns", "MaxNs", "SumNs"] - other_cols = [col for col in formatted_df.columns if col not in stats_cols] - return formatted_df[stats_cols + other_cols] - - -def describe_duration(series_groupby): - agg_df = series_groupby.agg(["min", "max", "count", "std", "mean", "sum"]) - quantile_df = series_groupby.quantile([0.25, 0.5, 0.75]) - - quantile_df = quantile_df.unstack() - quantile_df.columns = ["25%", "50%", "75%"] - - stats_df = pd.merge(agg_df, quantile_df, left_index=True, right_index=True) - formated_df = format_columns(stats_df) - formated_df.index.name = stats_df.index.name - return formated_df - - -def stdev(df, aggregated): - if len(df) <= 1: - return df["stdevNs"].iloc[0] - instance = aggregated["totalCount"].loc[df.name] - var_sum = np.dot(df["totalCount"] - 1, df["stdev"] ** 2) - deviation = df["averageNs"] - aggregated["averageNs"].loc[df.name] - dev_sum = np.dot(df["totalCount"], deviation ** 2) - return np.sqrt((var_sum + dev_sum) / (instance - 1)) - - -def convert_unit(df: pd.DataFrame, src_unit, dst_unit): - df.loc[:, df.columns.str.endswith(src_unit)] = df.loc[:, df.columns.str.endswith(src_unit)].apply(lambda x: x / 1000.0) - df = df.rename(columns=lambda x: x.replace(src_unit, "".join(["(", dst_unit, ")"]))) - return df diff --git a/profiler/cluster_analyse/communication_group/__init__.py b/profiler/cluster_analyse/communication_group/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/communication_group/base_communication_group.py b/profiler/cluster_analyse/communication_group/base_communication_group.py deleted file mode 100644 index 55f6801c2875698047849d39fbee3b9827c9ad28..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/base_communication_group.py +++ /dev/null @@ -1,228 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -from abc import abstractmethod -from collections import defaultdict -from copy import deepcopy -from multiprocessing import Pool - -from common_func.constant import Constant -from cluster_utils.data_transfer_adapter import DataTransferAdapter - - -class BaseCommunicationGroup: - def __init__(self, params: dict): - self.collection_path = params.get(Constant.COLLECTION_PATH) - self.data_map = params.get(Constant.DATA_MAP) - self.data_type = params.get(Constant.DATA_TYPE) - self.analysis_mode = params.get(Constant.ANALYSIS_MODE) - self.rank_comm_dir_dict = {} - self.p2p_link = [] - self.collective_group_dict = defaultdict(set) - self.p2p_comm_group = [] - self.communication_group = {} - self.communication_ops = [] - self.matrix_ops = [] - self.adapter = DataTransferAdapter() - - def load_communication_data(self): - comm_op_dirs = [] - for rank_id, profiling_dir_path in self.data_map.items(): - if self.data_type == Constant.TEXT: - comm_dir = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, Constant.COMM_JSON) - matrix_dir = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, Constant.COMM_MATRIX_JSON) - else: - comm_dir = os.path.join(profiling_dir_path, Constant.SINGLE_OUTPUT, Constant.DB_COMMUNICATION_ANALYZER) - matrix_dir = comm_dir - if os.path.exists(comm_dir) or os.path.exists(matrix_dir): - comm_op_dirs.append((rank_id, comm_dir, matrix_dir)) - else: - print( - f"[WARNING] Rank {rank_id} does not have valid communication data and communication_matrix data.") - max_processes = int(os.cpu_count() / 2) - with Pool(processes=max_processes) as p: - self.rank_comm_dir_dict = p.map(self.read_communication_func, comm_op_dirs) - - def set_p2p_groups(self): - self.p2p_link = sorted(self.p2p_link, key=lambda x: min(x)) - while self.p2p_link: - union_set = deepcopy(self.p2p_link[0]) - rm_list = [self.p2p_link[0]] - for idx, link_rank_set_x in enumerate(self.p2p_link[1:]): - if UnionFind.is_connected(link_rank_set_x, union_set): - union_set = union_set.union(link_rank_set_x) - rm_list.append(link_rank_set_x) - self.p2p_comm_group.append(union_set) - self.p2p_link = [element for element in self.p2p_link if element not in rm_list] - - def generate_collective_communication_group(self): - self.communication_group[Constant.COLLECTIVE] = \ - [list(group) for group_name, group in self.collective_group_dict.items()] - - def generate_p2p_communication_group(self): - stage_group = {} - for group_name, rank_set in self.collective_group_dict.items(): - if not self.whether_valid_comm_group(rank_set): - continue - unioned_set = set() - remove_key = [] - for first_rank, stage in stage_group.items(): - if UnionFind.is_connected(rank_set, stage): - unioned_set = UnionFind.union(rank_set, stage, unioned_set) - remove_key.append(first_rank) - if unioned_set: - for key in remove_key: - del stage_group[key] - stage_group[min(unioned_set)] = unioned_set - else: - stage_group[min(rank_set)] = rank_set - first_rank_sort_list = sorted([first_rank for first_rank in stage_group]) - self.communication_group[Constant.P2P] = \ - [list(stage_group.get(first_rank, {})) for first_rank in first_rank_sort_list] - - def whether_valid_comm_group(self, rank_set: set): - """ - while distinguish which communication group should be used to infer stage info, these group should be ignored: - 1. group can not include more than 1 rank in every single p2p group - """ - for p2p_rank_set in self.p2p_comm_group: - if len(rank_set.intersection(p2p_rank_set)) > 1: - return False - return True - - @abstractmethod - def read_communication_func(self, params: tuple): - pass - - def analyze_communication_data(self): - for rank_id, rank_id_comm_dict, rank_id_matrix_dict in self.rank_comm_dir_dict: - for step_id, step_id_dict in rank_id_comm_dict.items(): - if not isinstance(step_id_dict, dict): - print(f"[WARNING] rank{rank_id}'s communication.json has a wrong data struct.") - continue - self.get_collective_ops_name(rank_id, step_id_dict.get(Constant.COLLECTIVE)) - for comm_op_type, comm_op_dict in step_id_dict.items(): - self.add_communication_ops(rank_id, step_id, comm_op_type, comm_op_dict) - - for step_id, step_id_dict in rank_id_matrix_dict.items(): - if not isinstance(step_id_dict, dict): - print(f"[WARNING] rank{rank_id}'s communication_matrix.json has a wrong data struct.") - continue - self.set_p2p_link(rank_id, step_id, rank_id_matrix_dict) - self.get_collective_ops_name(rank_id, step_id_dict.get(Constant.COLLECTIVE)) - - @abstractmethod - def dump_data(self): - pass - - def collect_comm_data(self): - comm_data_dict = { - Constant.COLLECTIVE_GROUP: self.collective_group_dict, - Constant.COMMUNICATION_OPS: self.communication_ops, - Constant.MATRIX_OPS: self.matrix_ops, - Constant.COMMUNICATION_GROUP: self.communication_group - } - return comm_data_dict - - def generate(self): - self.load_communication_data() - self.analyze_communication_data() - self.set_p2p_groups() - self.generate_collective_communication_group() - self.generate_p2p_communication_group() - self.dump_data() - return self.collect_comm_data() - - def set_p2p_link(self, rank_id: int, step_id: str, rank_id_matrix_dict: dict): - ops = rank_id_matrix_dict.get(step_id, {}) - self.add_matrix_ops(rank_id, step_id, ops) - if not ops: - print(f"[WARNING] rank{rank_id} {step_id} do not have communication matrix ops data.") - return - p2p_ops = ops.get(Constant.P2P, {}) - for op_name, link_dict in p2p_ops.items(): - self.append_p2p_link(op_name, link_dict) - - def append_p2p_link(self, op_name, link_dict): - for link in link_dict: - if '-' not in link: - print(f"[WARNING] {op_name} has an invalid link key {link}!") - break - src_rank = int(link.split('-')[0]) - dst_rank = int(link.split('-')[1]) - if src_rank != dst_rank: - rank_set = {src_rank, dst_rank} - if rank_set in self.p2p_link: - continue - self.p2p_link.append(rank_set) - - def get_collective_ops_name(self, rank_id: int, comm_op_dict: dict): - for comm_op in comm_op_dict: - if comm_op.startswith('Total'): - continue - group_name = comm_op.split('@')[-1] - self.collective_group_dict[group_name].add(rank_id) - - def add_communication_ops(self, rank_id: str, step_id: str, comm_op_type: str, comm_op_dict: dict): - for comm_op in comm_op_dict: - if comm_op.startswith('Total'): - continue - group_name = comm_op.split('@')[-1] - self.communication_ops.append({ - Constant.RANK_ID: rank_id, - Constant.STEP_ID: step_id, - Constant.COMM_OP_TYPE: comm_op_type, - Constant.COMM_OP_NAME: comm_op, - Constant.GROUP_NAME: group_name, - Constant.COMM_OP_INFO: comm_op_dict.get(comm_op) - }) - - def add_matrix_ops(self, rank_id: int, step_id: str, step_id_dict: dict): - for comm_op_type, comm_dict in step_id_dict.items(): - if comm_op_type != Constant.COLLECTIVE and comm_op_type != Constant.P2P: - print(f"[WARNING] Unknown communication operators type!") - continue - for op_name, op_link_info in comm_dict.items(): - if op_name.startswith('Total'): - continue - group_name = op_name.split('@')[-1] - self.matrix_ops.append({ - Constant.RANK_ID: rank_id, - Constant.STEP_ID: step_id, - Constant.COMM_OP_TYPE: comm_op_type, - Constant.COMM_OP_NAME: op_name, - Constant.GROUP_NAME: group_name, - Constant.COMM_OP_INFO: op_link_info - }) - - -class UnionFind(object): - """Disjoint Set Union""" - - @classmethod - def union(cls, first_set: set, second_set: set, third_set: set): - """make p and q the same set""" - return first_set | second_set | third_set - - @classmethod - def is_connected(cls, first_set: set, second_set: set): - """ - check whether set p and set q are connected - """ - if first_set & second_set: - return True - else: - return False diff --git a/profiler/cluster_analyse/communication_group/communication_db_group.py b/profiler/cluster_analyse/communication_group/communication_db_group.py deleted file mode 100644 index 510dcd971357dfb4798e4d284a72fbb3f3a21859..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/communication_db_group.py +++ /dev/null @@ -1,57 +0,0 @@ -import os - -from common_func.db_manager import DBManager -from common_func.constant import Constant -from communication_group.base_communication_group import BaseCommunicationGroup - - -class CommunicationDBGroup(BaseCommunicationGroup): - COMMUNICATION_GROUP_TABLE = "CommunicationGroup" - - def __init__(self, params: dict): - super().__init__(params) - - def read_communication_func(self, params: tuple): - if len(params) < 3: - return -1, ({}, {}, {}) - rank_id = params[0] - db_path = params[1] - time_data = [] - bandwidth_data = [] - matrix_data = [] - if os.path.exists(db_path): - conn, cursor = DBManager.create_connect_db(db_path) - time_info_sql = "select * from {0}".format(Constant.TABLE_COMM_ANALYZER_TIME) - bandwidth_info_sql = "select * from {0}".format(Constant.TABLE_COMM_ANALYZER_BANDWIDTH) - matrix_info_sql = "select * from {0}".format(Constant.TABLE_COMM_ANALYZER_MATRIX) - if (DBManager.check_tables_in_db(db_path, Constant.TABLE_COMM_ANALYZER_TIME, - Constant.TABLE_COMM_ANALYZER_BANDWIDTH) - and self.analysis_mode in ["all", "communication_time"]): - time_data = DBManager.fetch_all_data(cursor, time_info_sql) - bandwidth_data = DBManager.fetch_all_data(cursor, bandwidth_info_sql) - if (DBManager.check_tables_in_db(db_path, Constant.TABLE_COMM_ANALYZER_MATRIX) - and self.analysis_mode in ["all", "communication_matrix"]): - matrix_data = DBManager.fetch_all_data(cursor, matrix_info_sql) - DBManager.destroy_db_connect(conn, cursor) - comm_data = self.adapter.transfer_comm_from_db_to_json(time_data, bandwidth_data) - comm_matrix_data = self.adapter.transfer_matrix_from_db_to_json(matrix_data) - return rank_id, comm_data, comm_matrix_data - - def dump_data(self): - output_path = os.path.join(self.collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) - result_db = os.path.join(output_path, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER) - res = [] - for data_type, data_list in self.communication_group.items(): - for data in data_list: - rank_set = "(" + ",".join(str(i) for i in data) + ")" - data = [data_type, rank_set] - res.append(data) - if res: - DBManager.create_tables(result_db, self.COMMUNICATION_GROUP_TABLE) - conn, cursor = DBManager.create_connect_db(result_db) - sql = "insert into {} values ({value})".format(self.COMMUNICATION_GROUP_TABLE, - value="?," * (len(res[0]) - 1) + "?") - DBManager.executemany_sql(conn, sql, res) - DBManager.destroy_db_connect(conn, cursor) - else: - print("[WARNING] The CommunicationGroup table won't be created because no data has been calculated.") diff --git a/profiler/cluster_analyse/communication_group/communication_group_generator.py b/profiler/cluster_analyse/communication_group/communication_group_generator.py deleted file mode 100644 index 3dca90454b608fe3ffb1c365854c2aa3950b6cee..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/communication_group_generator.py +++ /dev/null @@ -1,32 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from common_func.constant import Constant -from communication_group.communication_db_group import CommunicationDBGroup -from communication_group.communication_json_group import CommunicationJsonGroup - - -class CommunicationGroupGenerator: - - GROUP_MAP = { - Constant.DB: CommunicationDBGroup, - Constant.TEXT: CommunicationJsonGroup - } - - def __init__(self, params: dict): - self.processor = self.GROUP_MAP.get(params.get(Constant.DATA_TYPE))(params) - - def generate(self): - return self.processor.generate() diff --git a/profiler/cluster_analyse/communication_group/communication_json_group.py b/profiler/cluster_analyse/communication_group/communication_json_group.py deleted file mode 100644 index f6e01e3abfde4d8f180043a5bf9a50c6b5a4964c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/communication_group/communication_json_group.py +++ /dev/null @@ -1,44 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -from common_func.constant import Constant -from common_func.file_manager import FileManager -from communication_group.base_communication_group import BaseCommunicationGroup - - -class CommunicationJsonGroup(BaseCommunicationGroup): - COMMUNICATION_GROUP_JSON = "communication_group.json" - - def __init__(self, params: dict): - super().__init__(params) - - def dump_data(self): - FileManager.create_json_file(self.collection_path, self.communication_group, self.COMMUNICATION_GROUP_JSON) - - def read_communication_func(self: any, params: tuple): - if len(params) < 3: - return -1, {}, {} - rank_id = params[0] - comm_json_path = params[1] - matrix_json_path = params[2] - comm_data = {} - matrix_data = {} - if os.path.exists(comm_json_path) and self.analysis_mode in ["all", "communication_time"]: - comm_data = FileManager.read_json_file(comm_json_path) - if os.path.exists(matrix_json_path) and self.analysis_mode in ["all", "communication_matrix"]: - matrix_data = FileManager.read_json_file(matrix_json_path) - return rank_id, comm_data, matrix_data diff --git a/profiler/cluster_analyse/prof_bean/__init__.py b/profiler/cluster_analyse/prof_bean/__init__.py deleted file mode 100644 index 8400fd5ecd1246eaee795cebfccfacc80a94f08c..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/prof_bean/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. diff --git a/profiler/cluster_analyse/prof_bean/step_trace_time_bean.py b/profiler/cluster_analyse/prof_bean/step_trace_time_bean.py deleted file mode 100644 index b0a3be4f5eaccea70aa912bc85e68d70dbda3bde..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/prof_bean/step_trace_time_bean.py +++ /dev/null @@ -1,39 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -class StepTraceTimeBean: - STEP = "Step" - COMPLEMENT_HEADER = ["Step", "Type", "Index"] - - def __init__(self, data: list): - self._data = data - - @property - def row(self) -> list: - row = [] - for field_name in self._data.keys(): - if field_name == self.STEP: - continue - row.append(float(self._data.get(field_name, ))) - return row - - @property - def step(self) -> str: - return self._data.get(self.STEP, '') - - @property - def all_headers(self) -> list: - return self.COMPLEMENT_HEADER + list(self._data.keys())[1:] diff --git a/profiler/cluster_analyse/resources/.keep b/profiler/cluster_analyse/resources/.keep deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000