diff --git a/debug/accuracy_tools/MANIFEST.in b/debug/accuracy_tools/MANIFEST.in index 5776b37e9ebd360e8bd367f8802ca76008cad9d7..7242c0c95627b56620b63c650dbbffbf8aaa2896 100644 --- a/debug/accuracy_tools/MANIFEST.in +++ b/debug/accuracy_tools/MANIFEST.in @@ -1,6 +1,2 @@ -recursive-include ptdbg_ascend/src/python/ptdbg_ascend/ *.py -recursive-include ptdbg_ascend/src/python/ptdbg_ascend/ *.yaml -recursive-include ptdbg_ascend/src/python/ptdbg_ascend/ *.template recursive-include atat/ * -recursive-exclude api_accuracy_checker/test * recursive-exclude atat/test * \ No newline at end of file diff --git a/debug/accuracy_tools/api_accuracy_checker/README.md b/debug/accuracy_tools/api_accuracy_checker/README.md index acabe4a8c84aafd4a6235a3ef2010ce5b58a3ea9..b03f665b4072c307d75fed7541c297d3eee336dc 100644 --- a/debug/accuracy_tools/api_accuracy_checker/README.md +++ b/debug/accuracy_tools/api_accuracy_checker/README.md @@ -37,11 +37,9 @@ Ascend模型精度预检工具能在昇腾NPU上扫描用户训练模型中所 export PYTHONPATH=$PYTHONPATH:$MSTT_HOME/debug/accuracy_tools/ ``` -2. 安装依赖。 +2. 使用pip命令安装einops、numpy、pandas、PyYAML、rich、torch、tqdm、Twisted依赖。 - ```bash - pip3 install tqdm rich pyyaml pandas einops - ``` + 若环境中已安装部分依赖,不需要重复安装。 ## 预检操作 @@ -433,11 +431,9 @@ Forward Test Success和Backward Test Success是否通过测试是由`api_precisi export PYTHONPATH=$PYTHONPATH:$MSTT_HOME/debug/accuracy_tools/ ``` - 安装依赖: + 使用pip命令安装einops、numpy、pandas、PyYAML、rich、torch、tqdm、Twisted依赖。 - ```bash - pip3 install tqdm rich pyyaml pandas einops - ``` + 若环境中已安装部分依赖,不需要重复安装。 2. 执行溢出API解析操作 diff --git a/debug/accuracy_tools/atat/README.md b/debug/accuracy_tools/atat/README.md index e5f493db8774943fa9c0d2156a5eb4e6b858f9a7..85a7e3d24e6014cb144dc5d64042c5cdf673e0d8 100644 --- a/debug/accuracy_tools/atat/README.md +++ b/debug/accuracy_tools/atat/README.md @@ -6,7 +6,11 @@ MindStudio精度调试工具(ascend_training_accuracy_tools),简称atat, 精度工具合一软件包名称:`ascend_training_accuracy_tools-{version}-py3-none-any.whl` -1. whl包获取。 +1. 使用pip命令安装numpy、openpyxl、pandas、PyYAML、rich、torch、tqdm依赖。 + + 若环境中已安装部分依赖,不需要重复安装。 + +2. whl包获取。 请通过下表链接下载工具whl包。 @@ -17,7 +21,7 @@ MindStudio精度调试工具(ascend_training_accuracy_tools),简称atat, | 0.0.2 | 2024-05-23 | 2.0/2.1/2.2 | [ascend_training_accuracy_tools-0.0.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.2-py3-none-any.whl) | 2e35809bde559e9c4d2f16a02ccde779ed9e436bb65fded0b7ebaf6ac2c88d93 | | 0.0.1 | 2024-03-15 | 2.0/2.1 | [ascend_training_accuracy_tools-0.0.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.1-py3-none-any.whl) | 5801510d4e827e4859bc9a5aca021e4d30c2ea42d60a4c8ad0c2baab1b7782c9 | -2. whl包校验。 +3. whl包校验。 1. 根据以上下载链接下载whl包到Linux安装环境。 @@ -36,7 +40,7 @@ MindStudio精度调试工具(ascend_training_accuracy_tools),简称atat, 5801510d4e827e4859bc9a5aca021e4d30c2ea42d60a4c8ad0c2baab1b7782c9 *ascend_training_accuracy_tools-0.0.1-py3-none-any.whl ``` -3. 执行如下命令进行安装。 +4. 执行如下命令进行安装。 ```bash pip3 install ./ascend_training_accuracy_tools-{version}-py3-none-any.whl @@ -105,16 +109,36 @@ MindStudio精度调试工具(ascend_training_accuracy_tools),简称atat, 上述流程中的工具均为atat工具的子工具,使用相同的命令行,格式如下: +精度预检工具 + +```bash +atat -f run_ut [-h] +``` + +```bash +atat -f multi_run_ut [-h] +``` + +```bash +atat -f api_precision_compare [-h] +``` + +溢出解析工具 + ```bash -atat [-h] -f parse run_ut multi_run_ut api_precision_compare run_overflow_check +atat -f run_overflow_check [-h] ``` -| 参数 | 说明 | -| ---- | ---------------------------------------- | -| -f | 框架,当前支持配置为pytorch和mindspore。 | -| -h | 帮助信息。 | +数据解析工具 + +```bash +atat -f parse [-h] +``` -其他参数在上述对应的工具手册中详细介绍。 +| 参数 | 说明 | +| ---- | ------------------------------------------------------ | +| -f | 框架,请按所使用框架配置,当前支持pytorch或mindspore。 | +| -h | 帮助信息。 | ## 贡献 diff --git a/debug/accuracy_tools/atat/atat.py b/debug/accuracy_tools/atat/atat.py index 12c4042bee906f060520262c2a6719a79e01ee0f..90f8215b102d4f120b879773797c4b4864b25d6b 100644 --- a/debug/accuracy_tools/atat/atat.py +++ b/debug/accuracy_tools/atat/atat.py @@ -16,7 +16,7 @@ import argparse import sys from atat.pytorch.api_accuracy_checker.run_ut.run_ut import _run_ut_parser, run_ut_command -from ptdbg_ascend.src.python.ptdbg_ascend.parse_tool.cli import parse as cli_parse +from atat.pytorch.parse_tool.cli import parse as cli_parse from atat.pytorch.api_accuracy_checker.run_ut.multi_run_ut import prepare_config, run_parallel_ut from atat.pytorch.api_accuracy_checker.compare.api_precision_compare import _api_precision_compare_parser, \ _api_precision_compare_command diff --git a/debug/accuracy_tools/atat/config/README.md b/debug/accuracy_tools/atat/config/README.md index 66429b54fc5e716bec8c70c932232546bae1b55e..98963c6f4e23e2577914867ba21a5e82e9996f3e 100644 --- a/debug/accuracy_tools/atat/config/README.md +++ b/debug/accuracy_tools/atat/config/README.md @@ -12,7 +12,8 @@ | dump_path | 设置dump数据目录路径,str类型。配置示例:"dump_path": "./dump_path"。MindSpore场景仅支持绝对路径。 | 是 | | rank | 指定对某张卡上的数据进行dump,list[int]类型,默认未配置(表示dump所有卡的数据),应配置为大于等于0的整数,且须配置实际可用的Rank ID。配置示例:"rank": [1]。
对于PyTorch场景,Rank ID从0开始计数,最大取值为所有节点可用卡总数-1,若所配置的值大于实际训练所运行的卡的Rank ID,则dump数据为空,比如当前环境Rank ID为0到7,实际训练运行0到3卡,此时若配置Rank ID为4或不存在的10等其他值,此时dump数据为空。
对于MindSpore场景,所有节点的Rank ID均从0开始计数,最大取值为每个节点可用卡总数-1,config.json配置一次rank参数对所有节点同时生效。 | 否 | | step | 指定dump某个step的数据,list[int]类型。默认未配置,表示dump所有step数据。dump特定step时,须指定为训练脚本中存在的step。step为list格式,可配置逐个step,例如:"step": [0,1,2]。 | 否 | -| level | dump级别,str类型,根据不同级别dump不同数据。可取值"L0"(dump module模块级精度数据,仅PyTorch场景支持,使用背景详见“**模块级精度数据dump说明**”)、"L1"(dump API级精度数据,默认值)、"L2"(dump kernel级精度数据)、"mix"(dump module模块级和API级精度数据,即"L0"+"L1",仅PyTorch场景支持)。配置示例:"level": "L1"。 | 否 | +| level | dump级别,str类型,根据不同级别dump不同数据。可取值"L0"(dump module模块级精度数据,仅PyTorch场景支持,使用背景详见“**模块级精度数据dump说明**”)、"L1"(dump API级精度数据,默认值)、"L2"(dump kernel级精度数据,须配置acl_config参数)、"mix"(dump module模块级和API级精度数据,即"L0"+"L1",仅PyTorch场景支持)。配置示例:"level": "L1"。 | 否 | +| acl_config | kernel dump的配置文件,str类型。level取"L2"时,该参数必选;level为其他值时,该参数不选。参数示例:acl_config='./acl_config.json'。acl_config.json配置文件详细介绍请参见“**acl_config.json配置文件说明**”。 | 否 | | seed | 随机种子数,int类型,默认值为:1234,仅PyTorch场景支持。通过固定随机数保证模型的输入或输出一致,可固定的随机数详见“**固定随机数范围**”。配置示例:"seed": 1234。 | 否 | | is_deterministic | 确定性计算模式,bool类型,仅PyTorch场景支持。可取值true(开启)或false(关闭),默认关闭。配置示例:"is_deterministic": true。
即使在相同的硬件和输入下,API多次执行的结果也可能不同,开启确定性计算是为了保证在相同的硬件和输入下,API多次执行的结果相同。
确定性计算会导致API执行性能降低,建议在发现模型多次执行结果不同的情况下开启。
rnn类算子、ReduceSum、ReduceMean等算子可能与确定性计算存在冲突,若开启确定性计算后多次执行的结果不相同,则考虑存在这些算子。 | 否 | | enable_dataloader | 自动控制开关,bool类型,仅PyTorch场景支持。可取值true(开启)或false(关闭),默认为false。配置为True后自动识别step参数指定的迭代,并在该迭代执行完成后退出训练,此时start、stop和step函数可不配置,开启该开关要求训练脚本是通过torch.utils.data.dataloader方式加载数据。仅支持PyTorch单卡训练使用,分布式训练场景下存在数据dump不全问题,**下个版本即将废弃该功能**。 | 否 | @@ -250,6 +251,109 @@ task配置为free_benchmark时,开启**无标杆比对**,在NPU环境下通 模块指的是继承自nn.Module类模块,通常情况下这类模块就是一个小模型,可以被视为一个整体,dump数据时以模块为粒度进行dump。 +### acl_config.json配置文件说明 + +#### [config.json](./config.json)配置示例 + +当level取"L2"时,须配置acl_config参数,并指定acl_config.json文件(用于指定L2 kernel级dump的配置),此时config.json文件配置示例如下: + +- 前向kernel dump配置示例: + + "scope"配置为前向API名称,仅支持配置一个API。 + + ```json + { + "task": "tensor", + "dump_path": "/home/data_dump", + "level": "L2", + "rank": [0], + "step": [0], + "is_deterministic": false, + "tensor": { + "scope": ["Tensor.__mul__.10.forward"], + "list":[], + "data_mode": ["all"], + "backward_input": [""], + "file_format": "npy" + }, + "acl_config": "acl_config.json" + } + ``` + +- 反向kernel dump配置示例: + + 执行反向kernel dump前需要先使用工具dump该API的反向输入,保存pt文件,在"backward_input"参数中传入该pt文件路径。 + + "scope"配置为反向API名称,仅支持配置一个API。 + + ```json + { + "task": "tensor", + "dump_path": "/home/data_dump", + "level": "L2", + "rank": [0], + "step": [0], + "is_deterministic": false, + "tensor": { + "scope": ["Tensor.__mul__.10.backward"], + "list":[], + "data_mode": ["all"], + "backward_input": ["Tensor.__mul__.10.backward.input.0.pt"], + "file_format": "npy" + }, + "acl_config": "acl_config.json" + } + ``` + +#### acl_config.json配置示例 + +acl_config.json文件须自行创建,配置示例如下: + +``` +{ + "dump": + { + "dump_list":[], + "dump_path":"./dump/output", + "dump_mode":"all", + "dump_op_switch":"on" + } +} +``` + +**acl_config.json参数说明** + +| 字段名 | 说明 | +| -------------- | ------------------------------------------------------------ | +| dump_list | 待dump数据的API模型。为空,无需配置。 | +| dump_path | dump数据文件存储到运行环境的目录,主要用于指定kernel dump数据路径。支持配置绝对路径或相对路径。dump_path须为已存在目录。 | +| dump_mode | dump数据模式,配置如下: output:dump API的输出数据。默认值。 input:dump API的输入数据。 all:dump API的输入、输出数据。 | +| dump_op_switch | 单API模型dump数据开关,配置如下:
off:关闭单API模型dump,默认值。
on:开启单API模型dump。 | + +**dump目录说明** + +配置register_hook的dump_config后,采集的dump数据会在{dump_path}/{time}/{deviceid}/{model_id}目录下生成,例如“/home/HwHiAiUser/output/20200808163566/0/0” + +``` +├── 20230131172437 +│ └── 1 +│ ├── 0 +│ │ ├── Add.Add.45.0.1675157077183551 +│ │ ├── Cast.trans_Cast_0.31.0.1675157077159449 +│ │ ├── Cast.trans_Cast_5.43.0.1675157077180129 +│ │ ├── MatMul.MatMul.39.0.1675157077172961 +│ │ ├── Mul.Mul.29.0.1675157077155731 +│ │ ├── NPUAllocFloatStatus.NPUAllocFloatStatus.24.0.1675157077145262 +│ │ ├── TransData.trans_TransData_1.33.0.1675157077162791 +│ │ └── TransData.trans_TransData_4.41.0.1675157077176648 +│ ├── 1701737061 +│ │ └── Cast.trans_Cast_2.35.0.1675157077166214 +│ ├── 25 +│ │ └── NPUClearFloatStatus.NPUClearFloatStatus.26.0.1675157077150342 +│ └── 68 +│ └── TransData.trans_TransData_3.37.0.1675157077169473 +``` + ### 固定随机数范围 仅PyTorch场景支持。 diff --git a/debug/accuracy_tools/atat/pytorch/common/exceptions.py b/debug/accuracy_tools/atat/core/common/exceptions.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/common/exceptions.py rename to debug/accuracy_tools/atat/core/common/exceptions.py diff --git a/debug/accuracy_tools/atat/pytorch/common/file_check.py b/debug/accuracy_tools/atat/core/common/file_check.py similarity index 87% rename from debug/accuracy_tools/atat/pytorch/common/file_check.py rename to debug/accuracy_tools/atat/core/common/file_check.py index 3204652583b9bce5ac874b5a178fb83926856660..c5a79dab8a8ec8bc1faa847e7b5b502761d5aba1 100644 --- a/debug/accuracy_tools/atat/pytorch/common/file_check.py +++ b/debug/accuracy_tools/atat/core/common/file_check.py @@ -17,9 +17,8 @@ import os import re -from .log import print_error_log, print_warn_log -from .exceptions import FileCheckException -from .utils import Const +from atat.core.common.log import logger +from atat.core.common.exceptions import FileCheckException class FileCheckConst: @@ -32,6 +31,7 @@ class FileCheckConst: DIRECTORY_LENGTH = 4096 FILE_NAME_LENGTH = 255 FILE_VALID_PATTERN = r"^[a-zA-Z0-9_.:/-]+$" + FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' PKL_SUFFIX = ".pkl" NUMPY_SUFFIX = ".npy" JSON_SUFFIX = ".json" @@ -78,7 +78,7 @@ class FileChecker: @staticmethod def _check_path_type(path_type): if path_type not in [FileCheckConst.DIR, FileCheckConst.FILE]: - print_error_log(f'The path_type must be {FileCheckConst.DIR} or {FileCheckConst.FILE}.') + logger.error(f'The path_type must be {FileCheckConst.DIR} or {FileCheckConst.FILE}.') raise FileCheckException(FileCheckException.ILLEGAL_PARAM_ERROR) return path_type @@ -144,7 +144,7 @@ class FileOpen: def check_file_path(self): support_mode = self.SUPPORT_READ_MODE + self.SUPPORT_WRITE_MODE + self.SUPPORT_READ_WRITE_MODE if self.mode not in support_mode: - print_error_log("File open not support %s mode" % self.mode) + logger.error("File open not support %s mode" % self.mode) raise FileCheckException(FileCheckException.ILLEGAL_PARAM_ERROR) check_link(self.file_path) self.file_path = os.path.realpath(self.file_path) @@ -171,7 +171,7 @@ class FileOpen: def check_link(path): abs_path = os.path.abspath(path) if os.path.islink(abs_path): - print_error_log('The file path {} is a soft link.'.format(path)) + logger.error('The file path {} is a soft link.'.format(path)) raise FileCheckException(FileCheckException.SOFT_LINK_ERROR) @@ -179,58 +179,58 @@ def check_path_length(path, name_length=None): file_max_name_length = name_length if name_length else FileCheckConst.FILE_NAME_LENGTH if len(path) > FileCheckConst.DIRECTORY_LENGTH or \ len(os.path.basename(path)) > file_max_name_length: - print_error_log('The file path length exceeds limit.') + logger.error('The file path length exceeds limit.') raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) def check_path_exists(path): if not os.path.exists(path): - print_error_log('The file path %s does not exist.' % path) + logger.error('The file path %s does not exist.' % path) raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) def check_path_readability(path): if not os.access(path, os.R_OK): - print_error_log('The file path %s is not readable.' % path) + logger.error('The file path %s is not readable.' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_path_writability(path): if not os.access(path, os.W_OK): - print_error_log('The file path %s is not writable.' % path) + logger.error('The file path %s is not writable.' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_path_executable(path): if not os.access(path, os.X_OK): - print_error_log('The file path %s is not executable.' % path) + logger.error('The file path %s is not executable.' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_other_user_writable(path): st = os.stat(path) if st.st_mode & 0o002: - print_error_log('The file path %s may be insecure because other users have write permissions. ' % path) + logger.error('The file path %s may be insecure because other users have write permissions. ' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_path_owner_consistent(path): file_owner = os.stat(path).st_uid if file_owner != os.getuid(): - print_error_log('The file path %s may be insecure because is does not belong to you.' % path) + logger.error('The file path %s may be insecure because is does not belong to you.' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_path_pattern_vaild(path): if not re.match(FileCheckConst.FILE_VALID_PATTERN, path): - print_error_log('The file path {} contains special characters.'.format(path)) + logger.error('The file path {} contains special characters.' %(path)) raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) def check_file_size(file_path, max_size): file_size = os.path.getsize(file_path) if file_size >= max_size: - print_error_log(f'The size of file path {file_path} exceeds {max_size} bytes.') + logger.error(f'The size of file path {file_path} exceeds {max_size} bytes.') raise FileCheckException(FileCheckException.FILE_TOO_LARGE_ERROR) @@ -245,18 +245,18 @@ def check_common_file_size(file_path): def check_file_suffix(file_path, file_suffix): if file_suffix: if not file_path.endswith(file_suffix): - print_error_log(f"The {file_path} should be a {file_suffix} file!") + logger.error(f"The {file_path} should be a {file_suffix} file!") raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) def check_path_type(file_path, file_type): if file_type == FileCheckConst.FILE: if not os.path.isfile(file_path): - print_error_log(f"The {file_path} should be a file!") + logger.error(f"The {file_path} should be a file!") raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) if file_type == FileCheckConst.DIR: if not os.path.isdir(file_path): - print_error_log(f"The {file_path} should be a dictionary!") + logger.error(f"The {file_path} should be a dictionary!") raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) @@ -281,7 +281,7 @@ def check_path_before_create(path): if path_len_exceeds_limit(path): raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR, 'The file path length exceeds limit.') - if not re.match(Const.FILE_PATTERN, os.path.realpath(path)): + if not re.match(FileCheckConst.FILE_PATTERN, os.path.realpath(path)): raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR, 'The file path {} contains special characters.'.format(path)) @@ -298,4 +298,4 @@ def change_mode(path, mode): def path_len_exceeds_limit(file_path): return len(os.path.realpath(file_path)) > FileCheckConst.DIRECTORY_LENGTH or \ - len(os.path.basename(file_path)) > FileCheckConst.FILE_NAME_LENGTH + len(os.path.basename(file_path)) > FileCheckConst.FILE_NAME_LENGTH \ No newline at end of file diff --git a/debug/accuracy_tools/atat/core/common/log.py b/debug/accuracy_tools/atat/core/common/log.py new file mode 100644 index 0000000000000000000000000000000000000000..72324524de72c705f388c73e6409411e798c9fcc --- /dev/null +++ b/debug/accuracy_tools/atat/core/common/log.py @@ -0,0 +1,59 @@ +import os +import time +import sys +from atat.core.common.exceptions import DistributedNotInitializedError + +class BaseLogger: + def __init__(self): + self.warning_level = "WARNING" + self.error_level = "ERROR" + self.info_level = "INFO" + self.rank = None + + @staticmethod + def _print_log(level, msg, end='\n'): + current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) + pid = os.getpid() + full_msg = f"{current_time} ({pid}) [{level}] {msg}" + print(full_msg, end=end) + sys.stdout.flush() + + def get_rank(self): + return self.rank + + def info(self, msg): + self._print_log(self.info_level, msg) + + def error(self, msg): + self._print_log(self.error_level, msg) + + def warning(self, msg): + self._print_log(self.warning_level, msg) + + def on_rank_0(self, func): + def func_rank_0(*args, **kwargs): + try: + current_rank = self.get_rank() + except DistributedNotInitializedError: + current_rank = None + if current_rank is None or current_rank == 0: + return func(*args, **kwargs) + else: + return None + return func_rank_0 + + def info_on_rank_0(self, msg): + return self.on_rank_0(self.info)(msg) + + def error_on_rank_0(self, msg): + return self.on_rank_0(self.error)(msg) + + def warning_on_rank_0(self, msg): + return self.on_rank_0(self.warning)(msg) + + def error_log_with_exp(self, exp, msg): + self.error(msg) + raise exp + + +logger = BaseLogger() \ No newline at end of file diff --git a/debug/accuracy_tools/atat/core/utils.py b/debug/accuracy_tools/atat/core/common/utils.py similarity index 86% rename from debug/accuracy_tools/atat/core/utils.py rename to debug/accuracy_tools/atat/core/common/utils.py index 47e54ffa6daa3fac0254af93d3e7d5aef4b47297..0c74bf038d29b19d68d20b6ffc398b78aee30abe 100644 --- a/debug/accuracy_tools/atat/core/utils.py +++ b/debug/accuracy_tools/atat/core/common/utils.py @@ -26,8 +26,8 @@ from datetime import datetime, timezone from pathlib import Path import numpy as np -from .file_check_util import FileOpen, FileChecker, FileCheckConst -from .log import print_info_log, print_warn_log, print_error_log +from atat.core.common.file_check import FileOpen, FileChecker, FileCheckConst +from atat.core.common.log import logger device = collections.namedtuple('device', ['type', 'index']) @@ -75,6 +75,7 @@ class Const: WRITE_FLAGS = os.O_WRONLY | os.O_CREAT WRITE_MODES = stat.S_IWUSR | stat.S_IRUSR + OVERWRITE_FLAGS = os.O_WRONLY | os.O_CREAT | os.O_TRUNC PKL_SUFFIX = ".pkl" NUMPY_SUFFIX = ".npy" @@ -90,6 +91,15 @@ class Const: ASCEND_WORK_PATH = "ASCEND_WORK_PATH" DUMP_DIR = "dump_data" + KWARGS = 'kwargs' + INPUT = 'input' + OUTPUT = 'output' + INPUT_ARGS = 'input_args' + INPUT_KWARGS = 'input_kwargs' + GRAD_INPUT = 'grad_input' + GRAD_OUTPUT = 'grad_output' + START = "start" + STOP = "stop" ENV_ENABLE = "1" ENV_DISABLE = "0" @@ -104,6 +114,21 @@ class Const: TENSOR = "tensor" OVERFLOW_CHECK = "overflow_check" FREE_BENCHMARK = "free_benchmark" + KERNEL_DUMP = "kernel_dump" + DATA = "data" + PT_FRAMEWORK = "pytorch" + MS_FRAMEWORK = "mindspore" + DIRECTORY_LENGTH = 4096 + FILE_NAME_LENGTH = 255 + FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' + FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble, np.float32, np.float16] + BOOL_TYPE = [bool, np.uint8] + INT_TYPE = [np.int32, np.int64] + NPU = 'NPU' + DISTRIBUTED = 'Distributed' + INPLACE_LIST = ["broadcast", "all_reduce", "reduce", "all_gather", "gather", "scatter", "reduce_scatter", + "_reduce_scatter_base", "_all_gather_base", "all_to_all_single"] + class CompareConst: """ @@ -235,7 +260,6 @@ class CompareException(Exception): INVALID_SUMMARY_MODE = 19 INVALID_TASK_ERROR = 20 - def __init__(self, code, error_info: str = ""): super(CompareException, self).__init__() self.code = code @@ -263,12 +287,12 @@ def make_dump_path_if_not_exists(dump_path): try: Path(dump_path).mkdir(mode=0o750, exist_ok=True, parents=True) except OSError as ex: - print_error_log( + logger.error( 'Failed to create {}.Please check the path permission or disk space .{}'.format(dump_path, str(ex))) raise CompareException(CompareException.INVALID_PATH_ERROR) from ex else: if not os.path.isdir(dump_path): - print_error_log('{} already exists and is not a directory.'.format(dump_path)) + logger.error('{} already exists and is not a directory.'.format(dump_path)) def check_mode_valid(mode, scope=None, api_list=None): @@ -300,13 +324,13 @@ def check_mode_valid(mode, scope=None, api_list=None): def check_switch_valid(switch): if switch not in ["ON", "OFF"]: - print_error_log("Please set switch with 'ON' or 'OFF'.") + logger.error("Please set switch with 'ON' or 'OFF'.") raise CompareException(CompareException.INVALID_PARAM_ERROR) def check_dump_mode_valid(dump_mode): if not isinstance(dump_mode, list): - print_warn_log("Please set dump_mode as a list.") + logger.warning("Please set dump_mode as a list.") dump_mode = [dump_mode] if not all(mode in ["all", "forward", "backward", "input", "output"] for mode in dump_mode): raise ValueError("Please set dump_mode as a list containing one or more of the following: 'all', 'forward', 'backward', 'input', 'output'.") @@ -327,14 +351,14 @@ def check_summary_mode_valid(summary_mode): def check_summary_only_valid(summary_only): if not isinstance(summary_only, bool): - print_error_log("Params summary_only only support True or False.") + logger.error("Params summary_only only support True or False.") raise CompareException(CompareException.INVALID_PARAM_ERROR) return summary_only def check_compare_param(input_parma, output_path, stack_mode=False, summary_compare=False, md5_compare=False): if not (isinstance(input_parma, dict) and isinstance(output_path, str)): - print_error_log("Invalid input parameters") + logger.error("Invalid input parameters") raise CompareException(CompareException.INVALID_PARAM_ERROR) check_file_or_directory_path(input_parma.get("npu_json_path"), False) check_file_or_directory_path(input_parma.get("bench_json_path"), False) @@ -351,7 +375,7 @@ def check_compare_param(input_parma, output_path, stack_mode=False, summary_comp def check_configuration_param(stack_mode=False, auto_analyze=True, fuzzy_match=False): if not (isinstance(stack_mode, bool) and isinstance(auto_analyze, bool) and isinstance(fuzzy_match, bool)): - print_error_log("Invalid input parameters which should be only bool type.") + logger.error("Invalid input parameters which should be only bool type.") raise CompareException(CompareException.INVALID_PARAM_ERROR) @@ -379,7 +403,7 @@ def is_starts_with(string, prefix_list): def _check_json(json_file_handle, file_name): tensor_line = json_file_handle.readline() if not tensor_line: - print_error_log("dump file {} have empty line!".format(file_name)) + logger.error("dump file {} have empty line!".format(file_name)) raise CompareException(CompareException.INVALID_DUMP_FILE) json_file_handle.seek(0, 0) @@ -394,10 +418,10 @@ def check_file_size(input_file, max_size): try: file_size = os.path.getsize(input_file) except OSError as os_error: - print_error_log('Failed to open "%s". %s' % (input_file, str(os_error))) + logger.error('Failed to open "%s". %s' % (input_file, str(os_error))) raise CompareException(CompareException.INVALID_FILE_ERROR) from os_error if file_size > max_size: - print_error_log('The size (%d) of %s exceeds (%d) bytes, tools not support.' + logger.error('The size (%d) of %s exceeds (%d) bytes, tools not support.' % (file_size, input_file, max_size)) raise CompareException(CompareException.INVALID_FILE_ERROR) @@ -437,7 +461,7 @@ def remove_path(path): else: shutil.rmtree(path) except PermissionError as err: - print_error_log("Failed to delete {}. Please check the permission.".format(path)) + logger.error("Failed to delete {}. Please check the permission.".format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) from err @@ -463,14 +487,6 @@ def get_dump_data_path(dump_dir): return dump_data_path, file_is_exist -def modify_dump_path(dump_path, mode): - if mode == Const.ALL: - return dump_path - file_name = os.path.split(dump_path) - mode_file_name = mode + "_" + file_name[-1] - return os.path.join(file_name[0], mode_file_name) - - def create_directory(dir_path): """ Function Description: @@ -484,7 +500,7 @@ def create_directory(dir_path): try: os.makedirs(dir_path, mode=0o700) except OSError as ex: - print_error_log( + logger.error( 'Failed to create {}.Please check the path permission or disk space .{}'.format(dir_path, str(ex))) raise CompareException(CompareException.INVALID_PATH_ERROR) from ex @@ -498,7 +514,7 @@ def execute_command(cmd): Exception Description: when invalid command throw exception """ - print_info_log('Execute command:%s' % cmd) + logger.info('Execute command:%s' % cmd) process = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) while process.poll() is None: line = process.stdout.readline() @@ -506,7 +522,7 @@ def execute_command(cmd): if line: print(line) if process.returncode != 0: - print_error_log('Failed to execute command:%s' % " ".join(cmd)) + logger.error('Failed to execute command:%s' % " ".join(cmd)) raise CompareException(CompareException.INVALID_DATA_ERROR) @@ -530,7 +546,7 @@ def parse_value_by_comma(value): if value_str.isdigit() or value_str == '-1': value_list.append(int(value_str)) else: - print_error_log("please check your input shape.") + logger.error("please check your input shape.") raise CompareException(CompareException.INVALID_PARAM_ERROR) return value_list @@ -539,7 +555,7 @@ def get_data_len_by_shape(shape): data_len = 1 for item in shape: if item == -1: - print_error_log("please check your input shape, one dim in shape is -1.") + logger.error("please check your input shape, one dim in shape is -1.") return -1 data_len = data_len * item return data_len @@ -564,25 +580,25 @@ def format_value(value): def check_seed_all(seed, mode): if isinstance(seed, int): if seed < 0 or seed > Const.MAX_SEED_VALUE: - print_error_log(f"Seed must be between 0 and {Const.MAX_SEED_VALUE}.") + logger.error(f"Seed must be between 0 and {Const.MAX_SEED_VALUE}.") raise CompareException(CompareException.INVALID_PARAM_ERROR) else: - print_error_log(f"Seed must be integer.") + logger.error(f"Seed must be integer.") raise CompareException(CompareException.INVALID_PARAM_ERROR) if not isinstance(mode, bool): - print_error_log(f"seed_all mode must be bool.") + logger.error(f"seed_all mode must be bool.") raise CompareException(CompareException.INVALID_PARAM_ERROR) def get_process_rank(model): - print_info_log("Rank id is not provided. Trying to get the rank id of the model.") + logger.info("Rank id is not provided. Trying to get the rank id of the model.") try: local_device = next(model.parameters()).device except StopIteration: - print_warn_log('There is no parameter in the model. Fail to get rank id.') + logger.warning('There is no parameter in the model. Fail to get rank id.') return 0, False if local_device.type == 'cpu': - print_warn_log("Warning: the debugger is unable to get the rank id. " + logger.warning("Warning: the debugger is unable to get the rank id. " "This may cause the dumpped data to be corrupted in the " "case of distributed training. (You may ignore this if you are using only one card.) " "Transfer the model to npu or gpu before register_hook() to avoid this warning.") @@ -603,43 +619,43 @@ def generate_compare_script(dump_path, pkl_file_path, dump_switch_mode): code_temp = ftemp.read() fout.write(code_temp % (pkl_file_path, dump_path, is_api_stack)) except OSError: - print_error_log(f"Failed to open file. Please check file {template_path} or path {pkl_dir}.") + logger.error(f"Failed to open file. Please check file {template_path} or path {pkl_dir}.") - print_info_log(f"Generate compare script successfully which is {compare_script_path}.") + logger.info(f"Generate compare script successfully which is {compare_script_path}.") def check_file_valid(file_path): if os.path.islink(file_path): - print_error_log('The file path {} is a soft link.'.format(file_path)) + logger.error('The file path {} is a soft link.'.format(file_path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if len(os.path.realpath(file_path)) > Const.DIRECTORY_LENGTH or len(os.path.basename(file_path)) > \ Const.FILE_NAME_LENGTH: - print_error_log('The file path length exceeds limit.') + logger.error('The file path length exceeds limit.') raise CompareException(CompareException.INVALID_PATH_ERROR) if not re.match(Const.FILE_PATTERN, os.path.realpath(file_path)): - print_error_log('The file path {} contains special characters.'.format(file_path)) + logger.error('The file path {} contains special characters.'.format(file_path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if os.path.isfile(file_path): file_size = os.path.getsize(file_path) if file_path.endswith(Const.PKL_SUFFIX) and file_size > Const.ONE_GB: - print_error_log('The file {} size is greater than 1GB.'.format(file_path)) + logger.error('The file {} size is greater than 1GB.'.format(file_path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if file_path.endswith(Const.NUMPY_SUFFIX) and file_size > Const.TEN_GB: - print_error_log('The file {} size is greater than 10GB.'.format(file_path)) + logger.error('The file {} size is greater than 10GB.'.format(file_path)) raise CompareException(CompareException.INVALID_PATH_ERROR) def check_path_before_create(path): if len(os.path.realpath(path)) > Const.DIRECTORY_LENGTH or len(os.path.basename(path)) > \ Const.FILE_NAME_LENGTH: - print_error_log('The file path length exceeds limit.') + logger.error('The file path length exceeds limit.') raise CompareException(CompareException.INVALID_PATH_ERROR) if not re.match(Const.FILE_PATTERN, os.path.realpath(path)): - print_error_log('The file path {} contains special characters.'.format(path)) + logger.error('The file path {} contains special characters.'.format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) @@ -667,14 +683,14 @@ def task_dumppath_get(input_param): npu_json_path = input_param.get("npu_json_path", None) bench_json_path = input_param.get("bench_json_path", None) if not npu_json_path or not bench_json_path: - print_error_log(f"Please check the json path is valid.") + logger.error(f"Please check the json path is valid.") raise CompareException(CompareException.INVALID_PATH_ERROR) with FileOpen(npu_json_path, 'r') as npu_f: npu_json_data = json.load(npu_f) with FileOpen(bench_json_path, 'r') as bench_f: bench_json_data = json.load(bench_f) if npu_json_data['task'] != bench_json_data['task']: - print_error_log(f"Please check the dump task is consistent.") + logger.error(f"Please check the dump task is consistent.") raise CompareException(CompareException.INVALID_TASK_ERROR) if npu_json_data['task'] == Const.TENSOR: summary_compare = False @@ -686,7 +702,7 @@ def task_dumppath_get(input_param): else: summary_compare = True else: - print_error_log(f"Compare is not required for overflow_check or free_benchmark.") + logger.error(f"Compare is not required for overflow_check or free_benchmark.") raise CompareException(CompareException.INVALID_TASK_ERROR) input_param['npu_dump_data_dir'] = npu_json_data['dump_data_dir'] input_param['bench_dump_data_dir'] = bench_json_data['dump_data_dir'] @@ -699,6 +715,10 @@ def get_header_index(header_name, summary_compare=False): else: header = CompareConst.COMPARE_RESULT_HEADER[:] if header_name not in header: - print_error_log(f"{header_name} not in data name") + logger.error(f"{header_name} not in data name") raise CompareException(CompareException.INVALID_PARAM_ERROR) return header.index(header_name) + + +def convert_tuple(data): + return data if isinstance(data, tuple) else (data, ) diff --git a/debug/accuracy_tools/atat/core/common_config.py b/debug/accuracy_tools/atat/core/common_config.py index ee045d3c520f9418191daaedec2830e8f9248435..ad440ef6514454b3e2cc51c44e7d19ab30ac6845 100644 --- a/debug/accuracy_tools/atat/core/common_config.py +++ b/debug/accuracy_tools/atat/core/common_config.py @@ -1,7 +1,8 @@ -from .utils import Const +from atat.core.common.utils import Const +from atat.core.common.log import logger +from atat.core.common.exceptions import MsaccException -# 公共配置类 class CommonConfig: def __init__(self, json_config): self.task = json_config.get('task') @@ -17,22 +18,25 @@ class CommonConfig: def _check_config(self): if self.task and self.task not in Const.TASK_LIST: - raise Exception("task is invalid") + logger.error_log_with_exp( + "task is invalid, it should be one of {}".format(Const.TASK_LIST), MsaccException.INVALID_PARAM_ERROR) if self.rank is not None and not isinstance(self.rank, list): - raise Exception("rank is invalid") + logger.error_log_with_exp("rank is invalid, it should be a list", MsaccException.INVALID_PARAM_ERROR) if self.step is not None and not isinstance(self.step, list): - raise Exception("step is invalid") + logger.error_log_with_exp("step is invalid, it should be a list", MsaccException.INVALID_PARAM_ERROR) if self.level and self.level not in Const.LEVEL_LIST: - raise Exception("level is invalid") + logger.error_log_with_exp( + "level is invalid, it should be one of {}".format(Const.LEVEL_LIST), MsaccException.INVALID_PARAM_ERROR) if self.seed is not None and not isinstance(self.seed, int): - raise Exception("seed is invalid") + logger.error_log_with_exp("seed is invalid, it should be an integer", MsaccException.INVALID_PARAM_ERROR) if not isinstance(self.is_deterministic, bool): - raise Exception("is_deterministic is invalid") + logger.error_log_with_exp( + "is_deterministic is invalid, it should be a boolean", MsaccException.INVALID_PARAM_ERROR) if not isinstance(self.enable_dataloader, bool): - raise Exception("enable_dataloader is invalid") + logger.error_log_with_exp( + "enable_dataloader is invalid, it should be a boolean", MsaccException.INVALID_PARAM_ERROR) -# 基础配置类 class BaseConfig: def __init__(self, json_config): self.scope = json_config.get('scope') @@ -46,9 +50,9 @@ class BaseConfig: def check_config(self): if self.scope is not None and not isinstance(self.scope, list): - raise Exception("scope is invalid") + logger.error_log_with_exp("scope is invalid, it should be a list", MsaccException.INVALID_PARAM_ERROR) if self.list is not None and not isinstance(self.list, list): - raise Exception("list is invalid") + logger.error_log_with_exp("list is invalid, it should be a list", MsaccException.INVALID_PARAM_ERROR) if self.data_mode is not None and not isinstance(self.data_mode, list): - raise Exception("data_mode is invalid") + logger.error_log_with_exp("data_mode is invalid, it should be a list", MsaccException.INVALID_PARAM_ERROR) \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/functional/data_collector.py b/debug/accuracy_tools/atat/core/data_dump/data_collector.py similarity index 55% rename from debug/accuracy_tools/atat/pytorch/functional/data_collector.py rename to debug/accuracy_tools/atat/core/data_dump/data_collector.py index 8e4011a054a85736d005063ca122b8bc0885c9fb..2a0bc34ba8d8f4849d8056437f2640c11e91b340 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/data_collector.py +++ b/debug/accuracy_tools/atat/core/data_dump/data_collector.py @@ -1,20 +1,11 @@ -import os - -import torch -from .data_processor import build_data_processor, DataProcessor -from .json_writer import DataWriter -from .scope import build_scope, ListScope -from ..common.log import print_info_log, print_warn_log -from ..common.utils import Const -from ..module_processer import ModuleProcesser - -try: - import torch_npu -except ImportError: - pass +import os -forward_init_status = False +from atat.core.data_dump.scope import build_scope, ListScope +from atat.core.data_dump.json_writer import DataWriter +from atat.core.common.log import logger +from atat.core.common.utils import Const +from atat.core.data_dump.data_processor.factory import DataProcessorFactory def build_data_collector(config): @@ -22,19 +13,17 @@ def build_data_collector(config): class DataCollector: - overflow_task = "overflow_check" - tensor_task = "tensor" - freebenchmark_task = "free_benchmark" multi_output_apis = ["_sort_", "npu_flash_attention"] - tasks_need_tensor_data = [overflow_task, tensor_task, freebenchmark_task] + tasks_need_tensor_data = [Const.OVERFLOW_CHECK, Const.TENSOR, Const.FREE_BENCHMARK] level_without_construct = ["L1", "L2"] def __init__(self, config): self.config = config self.data_writer = DataWriter() - self.data_processor = build_data_processor(config, self.data_writer) + self.data_processor = DataProcessorFactory.create_processor(self.config, self.data_writer) + self.module_processor = DataProcessorFactory.get_module_processor(self.config.framework) if self.config.framework == Const.PT_FRAMEWORK else None self.module_count = {} - if config.task == DataCollector.freebenchmark_task: + if self.config.task == Const.FREE_BENCHMARK: self.scope = build_scope(ListScope, self.config.scope, self.config.list) else: self.scope = build_scope(None, self.config.scope, self.config.list) @@ -46,7 +35,7 @@ class DataCollector: @property def dump_file_path(self): return self.data_writer.dump_file_path - + @staticmethod def check_scope_and_pid(scope, name, pid): return (not scope or scope.check(name)) and pid == os.getpid() @@ -54,10 +43,10 @@ class DataCollector: @staticmethod def is_inplace(module): return getattr(module, "op_is_inplace", False) - + def if_return_forward_new_output(self): return self.data_processor.if_return_forward_new_output() - + def get_forward_new_output(self): return self.data_processor.get_forward_new_output() @@ -68,7 +57,7 @@ class DataCollector: self.data_writer.write_json() def update_data(self, data_info, msg=''): - if self.config.task == DataProcessor.overflow: + if self.config.task == Const.OVERFLOW_CHECK: if self.data_processor.has_overflow: self.data_writer.update_data(data_info) msg += "Overflow detected." @@ -79,12 +68,12 @@ class DataCollector: return msg def pre_forward_data_collect(self, name, module, pid, module_input_output): - backward_name = name.replace("forward", "backward") + backward_name = name.replace(Const.FORWARD, Const.BACKWARD) if self.check_scope_and_pid(self.scope, backward_name, pid): self.data_processor.analyze_pre_forward(backward_name, module, module_input_output) if not self.is_inplace(module): return - print_info_log(f"API {name} is inplace.") + logger.info(f"API {name} is inplace.") if self.check_scope_and_pid(self.scope, name, pid): data_info = self.data_processor.analyze_pre_forward_inplace(name, module_input_output) self.update_data(data_info) @@ -94,14 +83,12 @@ class DataCollector: if not self.check_scope_and_pid(self.scope, name, pid): return - if self.config.level == "L2": - self.acl_dump(module, module_input_output, name) - return - if not self.is_inplace(module): data_info = self.data_processor.analyze_forward(name, module, module_input_output) else: data_info = self.data_processor.analyze_forward_inplace(name, module_input_output) + if self.config.level == "L2": + return self.data_writer.update_stack(self.data_processor.analyze_api_call_stack(name)) self.handle_data(name, data_info) @@ -115,14 +102,14 @@ class DataCollector: def update_construct(self, name): if self.config.level not in DataCollector.level_without_construct: - self.data_writer.update_construct({name: ModuleProcesser.api_parent_node}) - self.data_writer.update_construct(ModuleProcesser.module_node) + self.data_writer.update_construct({name: self.module_processor.api_parent_node}) + self.data_writer.update_construct(self.module_processor.module_node) def handle_data(self, name, data_info): msg = f"msProbe is collecting data on {name}. " if data_info: msg = self.update_data(data_info, msg) - print_info_log(msg) + logger.info(msg) self.data_writer.flush_data_when_buffer_is_full() def module_count_func(self, name, name_template): @@ -148,65 +135,6 @@ class DataCollector: def update_dump_paths(self, *args): self.data_writer.update_dump_paths(*args) self.data_writer.initialize_json_file(task=self.config.task, level=self.config.level) - + def update_iter(self, current_iter): self.data_processor.update_iter(current_iter) - - def acl_dump(self, module, module_input_output, module_name): - if self.config.is_forward_acl_dump: - self.forward_acl_dump(module, module_input_output, module_name) - else: - self.dump_mode_backward_acl_dump(module, module_input_output, module_name) - - def op_need_trigger(self, module_name): - if 'Tensor___getitem___' in module_name: - return True - return False - - def forward_acl_dump(self, module, module_input_output, module_name): - global forward_init_status - if not forward_init_status: - forward_init_status = True - torch_npu.npu.synchronize() - torch_npu.npu.init_dump() - torch_npu.npu.set_dump(self.config.acl_config) - torch_npu.npu.synchronize() - if self.op_need_trigger(module_name): - module.forward(*module_input_output.args, **module_input_output.kwargs).cpu() - else: - module.forward(*module_input_output.args, **module_input_output.kwargs) - torch_npu.npu.synchronize() - torch_npu.npu.finalize_dump() - torch_npu.npu.synchronize() - forward_init_status = False - print_info_log("Dump %s op file." % module_name) - - def acl_backward_dump_status(self, output, grad, module_name): - if isinstance(output, torch.Tensor): - output.backward(grad, retain_graph=True) - return True - - for api_name in DataCollector.multi_output_apis: - if api_name in module_name: - output[0].backward(grad, retain_graph=True) - return True - return False - - def dump_mode_backward_acl_dump(self, module, module_input_output, module_name): - global forward_init_status - grad_path = self.config.backward_input.get(module_name) - if not forward_init_status: - forward_init_status = True - output = module.forward(*module_input_output.args, **module_input_output.kwargs) - grad = torch.load(grad_path).to("npu").requires_grad_() - torch_npu.npu.init_dump() - torch_npu.npu.set_dump(self.config.acl_config) - torch_npu.npu.synchronize() - if not self.acl_backward_dump_status(output, grad, module_name): - print_warn_log("The output of {} is not of tensor type and cannot be automatically derived. " - "you can manually construct a single API backward case for ACL dump.".format( - module_name)) - torch_npu.npu.synchronize() - torch_npu.npu.finalize_dump() - forward_init_status = False - print_info_log("Dump %s op file." % module_name) diff --git a/debug/accuracy_tools/atat/core/data_dump/data_processor/base.py b/debug/accuracy_tools/atat/core/data_dump/data_processor/base.py new file mode 100644 index 0000000000000000000000000000000000000000..1ee3314b368f3c3382bcb1a221ca846b1beb90f1 --- /dev/null +++ b/debug/accuracy_tools/atat/core/data_dump/data_processor/base.py @@ -0,0 +1,242 @@ +import os +import inspect +from dataclasses import dataclass +from typing import Tuple, Dict, Optional, Any +import numpy as np +from atat.core.common.log import logger +from atat.core.common.utils import Const, convert_tuple + + +@dataclass +class ModuleForwardInputsOutputs: + args: Optional[Tuple] + kwargs: Optional[Dict] + output: Any + + @property + def args_tuple(self): + return convert_tuple(self.args) + + @property + def output_tuple(self): + return convert_tuple(self.output) + + def concat_args_and_kwargs(self): + args = self.args + tuple(self.kwargs.values()) + return args + + +@dataclass +class ModuleBackwardInputsOutputs: + grad_output: Optional[Tuple] + grad_input: Optional[Tuple] + + @property + def grad_input_tuple(self): + return convert_tuple(self.grad_input) + + @property + def grad_output_tuple(self): + return convert_tuple(self.grad_output) + + +class TensorStatInfo: + def __init__(self, max_val=None, min_val=None, mean_val=None, norm_val=None): + self.max = max_val + self.min = min_val + self.mean = mean_val + self.norm = norm_val + + +class BaseDataProcessor: + _recursive_key_stack = [] + special_type = (np.integer, np.floating, np.bool_, np.complexfloating, np.str_, np.byte, np.unicode_, + bool, int, float, str, slice) + + def __init__(self, config, data_writer): + self.data_writer = data_writer + self.config = config + self.api_info_struct = {} + self.stack_info_struct = {} + self.current_api_or_module_name = None + self.api_data_category = None + self.has_overflow = False + self.current_iter = 0 + self._return_forward_new_output = False + self._forward_new_output = None + + @property + def data_path(self): + return self.data_writer.dump_tensor_data_dir + + @staticmethod + def analyze_api_call_stack(name): + stack_str = [] + for (_, path, line, func, code, _) in inspect.stack()[5:]: + if not code: + continue + stack_line = " ".join([ + "File", ", ".join([ + path, + " ".join(["line", str(line)]), + " ".join(["in", func]), + " ".join(["\n", code[0].strip()]) + ]) + ]) + stack_str.append(stack_line) + stack_info_struct = {name: stack_str} + return stack_info_struct + + @staticmethod + def _convert_numpy_to_builtin(arg): + type_mapping = { + np.integer: int, + np.floating: float, + np.bool_: bool, + np.complexfloating: complex, + np.str_: str, + np.byte: bytes, + np.unicode_: str + } + for numpy_type, builtin_type in type_mapping.items(): + if isinstance(arg, numpy_type): + return builtin_type(arg), type(arg).__name__ + return arg, '' + + @staticmethod + def _analyze_numpy(value, numpy_type): + return {"type": numpy_type, "value": value} + + @staticmethod + def _analyze_builtin(arg): + single_arg = {} + if isinstance(arg, slice): + single_arg.update({"type": "slice"}) + single_arg.update({"value": [arg.start, arg.stop, arg.step]}) + else: + single_arg.update({"type": type(arg).__name__}) + single_arg.update({"value": arg}) + return single_arg + + @classmethod + def get_special_types(cls): + return cls.special_type + + @classmethod + def recursive_apply_transform(cls, args, transform): + if isinstance(args, cls.get_special_types()): + arg_transform = transform(args, cls._recursive_key_stack) + return arg_transform + elif isinstance(args, (list, tuple)): + result_list = [] + for i, arg in enumerate(args): + cls._recursive_key_stack.append(str(i)) + result_list.append(cls.recursive_apply_transform(arg, transform)) + cls._recursive_key_stack.pop() + return type(args)(result_list) + elif isinstance(args, dict): + resutl_dict = {} + for k, arg in args.items(): + cls._recursive_key_stack.append(str(k)) + resutl_dict[k] = cls.recursive_apply_transform(arg, transform) + cls._recursive_key_stack.pop() + return resutl_dict + else: + logger.warning(f"Data type {type(args)} is not supported.") + return None + + def if_return_forward_new_output(self): + return self._return_forward_new_output + + def get_forward_new_output(self): + self._return_forward_new_output = False + return self._forward_new_output + + def update_iter(self, current_iter): + self.current_iter = current_iter + + def visit_and_clear_overflow_status(self, api_or_module_name): + if self.current_api_or_module_name != api_or_module_name: + self.current_api_or_module_name = api_or_module_name + self.has_overflow = False + + def is_dump_for_data_mode(self, forward_backward, input_output): + """ + Compare the parameters with data_mode to determine whether to dump. + + Args: + forward_backward(str): The forward or backward mode to check. + input_output(str): The input or output mode to check. + + Return: + bool: True if the parameters are in data_mode or data_mode is all, False otherwise. + """ + return (Const.ALL in self.config.data_mode or + forward_backward in self.config.data_mode or + input_output in self.config.data_mode) + + def analyze_pre_forward(self, name, module,module_input_output: ModuleForwardInputsOutputs): + pass + + def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + api_info_struct = {} + if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): # check whether data_mode contains forward or input + api_info_struct[name] = {} + self.api_data_category = Const.INPUT + args_info_list = self.analyze_element(module_input_output.args_tuple) + api_info_struct[name][Const.INPUT_ARGS] = args_info_list + self.api_data_category = Const.KWARGS + kwargs_info_list = self.analyze_element(module_input_output.kwargs) + api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list + + if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): # check whether data_mode contains forward or output + api_info_struct[name] = api_info_struct.get(name, {}) + self.api_data_category = Const.OUTPUT + output_info_list = self.analyze_element(module_input_output.output_tuple) + api_info_struct[name][Const.OUTPUT] = output_info_list + return api_info_struct + + def analyze_pre_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): + api_info_struct = {} + if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): + api_info_struct[name] = {} + self.api_data_category = Const.INPUT + args_info_list = self.analyze_element(module_input_output.args_tuple) + api_info_struct[name][Const.INPUT_ARGS] = args_info_list + self.api_data_category = Const.KWARGS + kwargs_info_list = self.analyze_element(module_input_output.kwargs) + api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list + return api_info_struct + + def analyze_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): + concat_args = module_input_output.concat_args_and_kwargs() + api_info_struct = {} + if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): + api_info_struct[name] = {} + self.api_data_category = Const.OUTPUT + output_info_list = self.analyze_element(concat_args) + api_info_struct[name][Const.OUTPUT] = output_info_list + return api_info_struct + + def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): + api_info_struct = {} + if self.is_dump_for_data_mode(Const.BACKWARD, Const.OUTPUT): + api_info_struct[name] = {} + self.api_data_category = Const.OUTPUT + input_info_list = self.analyze_element(module_input_output.grad_input_tuple) + api_info_struct[name][Const.GRAD_INPUT] = input_info_list + + if self.is_dump_for_data_mode(Const.BACKWARD, Const.INPUT): + api_info_struct[name] = api_info_struct.get(name, {}) + self.api_data_category = Const.INPUT + output_info_list = self.analyze_element(module_input_output.grad_output_tuple) + api_info_struct[name][Const.GRAD_OUTPUT] = output_info_list + + return api_info_struct + + def get_save_file_path(self, suffix): + file_format = "pt" if self.config.framework == Const.PT_FRAMEWORK else "npy" + dump_data_name = (self.current_api_or_module_name + Const.SEP + self.api_data_category + Const.SEP + + suffix + Const.SEP + file_format) + file_path = os.path.join(self.data_writer.dump_tensor_data_dir, dump_data_name) + return dump_data_name, file_path \ No newline at end of file diff --git a/debug/accuracy_tools/atat/core/data_dump/data_processor/factory.py b/debug/accuracy_tools/atat/core/data_dump/data_processor/factory.py new file mode 100644 index 0000000000000000000000000000000000000000..00f2f72e7a8a04fa0fad12cd2935958d50171b4a --- /dev/null +++ b/debug/accuracy_tools/atat/core/data_dump/data_processor/factory.py @@ -0,0 +1,61 @@ +from atat.core.common.utils import Const + + +class DataProcessorFactory: + _data_processor = {} + _module_processor = {} + + @classmethod + def register_processor(cls, framework, task, processor_class): + key = (framework, task) + cls._data_processor[key] = processor_class + + @classmethod + def register_module_processor(cls, framework, processor_class): + cls._module_processor[framework] = processor_class + + @classmethod + def get_module_processor(cls, framework): + processor_class = cls._module_processor.get(framework) + if not processor_class: + raise ValueError(f"ModuleProcesser not found for framework: {framework}") + return processor_class + + @classmethod + def create_processor(cls, config, data_writer): + cls.register_processors(config.framework) + task = Const.KERNEL_DUMP if config.level == "L2" else config.task + key = (config.framework, task) + processor_class = cls._data_processor.get(key) + if not processor_class: + raise ValueError(f"Processor not found for framework: {config.framework}, task: {config.task}") + return processor_class(config, data_writer) + + @classmethod + def register_processors(cls, framework): + if framework == Const.PT_FRAMEWORK: + from .pytorch_processor import ( + StatisticsDataProcessor as PytorchStatisticsDataProcessor, + TensorDataProcessor as PytorchTensorDataProcessor, + OverflowCheckDataProcessor as PytorchOverflowCheckDataProcessor, + FreeBenchmarkDataProcessor as PytorchFreeBenchmarkDataProcessor, + KernelDumpDataProcessor as PytorchKernelDumpDataProcessor + ) + from ....pytorch.module_processer import ModuleProcesser + cls.register_processor(Const.PT_FRAMEWORK, Const.STATISTICS, PytorchStatisticsDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.TENSOR, PytorchTensorDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.OVERFLOW_CHECK, PytorchOverflowCheckDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.FREE_BENCHMARK, PytorchFreeBenchmarkDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.KERNEL_DUMP, PytorchKernelDumpDataProcessor) + cls.register_module_processor(Const.PT_FRAMEWORK, ModuleProcesser) + elif framework == Const.MS_FRAMEWORK: + from .mindspore_processor import ( + StatisticsDataProcessor as MindsporeStatisticsDataProcessor, + TensorDataProcessor as MindsporeTensorDataProcessor, + OverflowCheckDataProcessor as MindsporeOverflowCheckDataProcessor, + FreeBenchmarkDataProcessor as MindsporeFreeBenchmarkDataProcessor + ) + cls.register_processor(Const.MS_FRAMEWORK, Const.STATISTICS, MindsporeStatisticsDataProcessor) + cls.register_processor(Const.MS_FRAMEWORK, Const.TENSOR, MindsporeTensorDataProcessor) + cls.register_processor(Const.MS_FRAMEWORK, Const.OVERFLOW_CHECK, MindsporeOverflowCheckDataProcessor) + cls.register_processor(Const.MS_FRAMEWORK, Const.FREE_BENCHMARK, MindsporeFreeBenchmarkDataProcessor) \ No newline at end of file diff --git a/debug/accuracy_tools/atat/core/data_dump/data_processor/pytorch_processor.py b/debug/accuracy_tools/atat/core/data_dump/data_processor/pytorch_processor.py new file mode 100644 index 0000000000000000000000000000000000000000..d726bc8af2a11c9cc28b955822d9e166993e7f42 --- /dev/null +++ b/debug/accuracy_tools/atat/core/data_dump/data_processor/pytorch_processor.py @@ -0,0 +1,332 @@ +import os +import zlib +from typing import List +from dataclasses import asdict +import torch +import numpy as np +from atat.pytorch.free_benchmark import FreeBenchmarkCheck, UnequalRow +from atat.core.common.utils import Const +from atat.core.common.file_check import path_len_exceeds_limit, change_mode, FileCheckConst +from atat.core.common.log import logger +from atat.core.common.exceptions import MsaccException +from atat.core.data_dump.data_processor.base import BaseDataProcessor, ModuleBackwardInputsOutputs, ModuleForwardInputsOutputs, TensorStatInfo + +try: + import torch_npu +except ImportError: + pass + + +class PytorchDataProcessor(BaseDataProcessor): + pytorch_special_type = (torch.device, torch.dtype, torch.Size, torch.Tensor) + + def __init__(self, config, data_writer): + super().__init__(config, data_writer) + self.torch_object_key = { + "device": self.analyze_device_in_kwargs, + "dtype": self.analyze_dtype_in_kwargs + } + + @staticmethod + def get_md5_for_tensor(x): + if x.dtype == torch.bfloat16: + x = x.float() + tensor_bytes = x.cpu().detach().numpy().tobytes() + crc32_hash = zlib.crc32(tensor_bytes) + return f"{crc32_hash:08x}" + + @staticmethod + def analyze_device_in_kwargs(element): + single_arg = {} + single_arg.update({'type': "torch.device"}) + if not isinstance(element, str): + if hasattr(element, "index"): + device_value = element.type + ":" + str(element.index) + else: + device_value = element.type + single_arg.update({"value": device_value}) + else: + single_arg.update({"value": element}) + return single_arg + + @staticmethod + def analyze_dtype_in_kwargs(element): + return {"type": "torch.dtype", "value": str(element)} + + @staticmethod + def get_stat_info(data): + tensor_stat = TensorStatInfo() + if data.is_meta: + return tensor_stat + data_clone = data.detach() + if data_clone.numel() == 0: + return tensor_stat + elif data_clone.dtype == torch.bool: + tensor_stat.max = True in data_clone + tensor_stat.min = False not in data_clone + elif not data_clone.shape: + tensor_stat.max = tensor_stat.min = tensor_stat.mean = tensor_stat.norm = data_clone.item() + else: + if not data_clone.is_floating_point() or data_clone.dtype == torch.float64: + data_clone = data_clone.float() + tensor_stat.max = torch._C._VariableFunctionsClass.max(data_clone).item() + tensor_stat.min = torch._C._VariableFunctionsClass.min(data_clone).item() + tensor_stat.mean = torch._C._VariableFunctionsClass.mean(data_clone).item() + tensor_stat.norm = torch._C._VariableFunctionsClass.norm(data_clone).item() + return tensor_stat + + @classmethod + def get_special_types(cls): + return super().get_special_types() + cls.pytorch_special_type + + def analyze_single_element(self, element, suffix_stack): + if suffix_stack and suffix_stack[-1] in self.torch_object_key: + return self.torch_object_key[suffix_stack[-1]](element) + if isinstance(element, torch.Size): + return self._analyze_torch_size(element) + converted_numpy, numpy_type = self._convert_numpy_to_builtin(element) + if converted_numpy is not element: + return self._analyze_numpy(converted_numpy, numpy_type) + if isinstance(element, torch.Tensor): + return self._analyze_tensor(element, Const.SEP.join(suffix_stack)) + if isinstance(element, (bool, int, float, str, slice)): + return self._analyze_builtin(element) + return None + + def analyze_element(self, element): + return self.recursive_apply_transform(element, self.analyze_single_element) + + def _analyze_torch_size(arg): + return {"type": "torch.Size", "value": list(arg)} + + def _analyze_tensor(self, tensor, suffix): + tensor_stat = self.get_stat_info(tensor) + tensor_json = {} + tensor_json.update({'type': 'torch.Tensor'}) + tensor_json.update({'dtype': str(tensor.dtype)}) + tensor_json.update({"shape": tensor.shape}) + tensor_json.update({"Max": tensor_stat.max}) + tensor_json.update({"Min": tensor_stat.min}) + tensor_json.update({"Mean": tensor_stat.mean}) + tensor_json.update({"Norm": tensor_stat.norm}) + tensor_json.update({"requires_grad": tensor.requires_grad}) + if self.config.summary_mode == "md5": + tensor_md5 = self.get_md5_for_tensor(tensor) + tensor_json.update({"md5": tensor_md5}) + return tensor_json + + +class StatisticsDataProcessor(PytorchDataProcessor): + pass + + +class TensorDataProcessor(PytorchDataProcessor): + def _analyze_tensor(self, tensor, suffix): + dump_data_name, file_path = self.get_save_file_path(suffix) + if not path_len_exceeds_limit(file_path): + torch.save(tensor, file_path) + change_mode(file_path, FileCheckConst.DATA_FILE_AUTHORITY) + else: + logger.warning(f'The file path {file_path} length exceeds limit.') + single_arg = super()._analyze_tensor(tensor, suffix) + single_arg.update({"data_name": dump_data_name}) + return single_arg + + +class OverflowCheckDataProcessor(PytorchDataProcessor): + __slots__ = ["cached_tensors_and_file_paths"] + + def __init__(self, config, data_writer): + super().__init__(config, data_writer) + self.cached_tensors_and_file_paths = {} + self.real_overflow_dump_times = 0 + self.overflow_nums = config.overflow_num + self.bits_for_overflow = 8 + + @staticmethod + def overflow_debug_mode_enable(): + overflow_mode = os.getenv(Const.OVERFLOW_DEBUG_MODE_ENABLE, Const.ENV_DISABLE) + return overflow_mode == Const.ENV_ENABLE + + @staticmethod + def handle_tensor_extremum_nan_inf(data_clone, operator): + data_nan = torch._C._VariableFunctionsClass.isnan(data_clone) + if int(torch._C._VariableFunctionsClass.sum(data_nan)) == data_clone.numel(): + return float('nan') + finite_mask = torch._C._VariableFunctionsClass.isfinite(data_clone) + if int(torch._C._VariableFunctionsClass.sum(finite_mask)) > 0: + finite_values = data_clone[finite_mask] + return torch._C._VariableFunctionsClass.max(finite_values).item() if operator == 'max' else \ + torch._C._VariableFunctionsClass.min(finite_values).item() + else: + data_no_nan = data_clone[~data_nan] + return torch._C._VariableFunctionsClass.max(data_no_nan).item() if operator == 'max' else \ + torch._C._VariableFunctionsClass.min(data_no_nan).item() + + def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + self.has_overflow = False + api_info_struct = super().analyze_forward(name, module, module_input_output) + self.maybe_save_overflow_data_and_check_overflow_times() + return api_info_struct if self.has_overflow else None + + def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): + self.has_overflow = False + api_info_struct = super().analyze_backward(name, module, module_input_output) + self.maybe_save_overflow_data_and_check_overflow_times() + return api_info_struct if self.has_overflow else None + + def maybe_save_overflow_data_and_check_overflow_times(self): + if self.has_overflow: + for file_path, tensor in self.cached_tensors_and_file_paths.items(): + torch.save(tensor, file_path) + change_mode(file_path, FileCheckConst.DATA_FILE_AUTHORITY) + self.inc_and_check_overflow_times() + self.cached_tensors_and_file_paths = {} + + def inc_and_check_overflow_times(self): + self.real_overflow_dump_times += 1 + if self.overflow_nums == -1: + return + if self.real_overflow_dump_times >= self.overflow_nums: + raise MsaccException(MsaccException.OVERFLOW_NUMS_ERROR, str(self.real_overflow_dump_times)) + + def clear_overflow_npu(self): + if self.overflow_debug_mode_enable(): + float_status = torch.zeros(self.bits_for_overflow).npu() + torch_npu.npu_clear_float_status(float_status, Const.OVERFLOW_DEBUG_MODE) + else: + torch_npu._C._clear_overflow_npu() + + def _analyze_maybe_overflow_tensor(self, tensor_json, tensor): + data_clone = tensor.detach() + if hasattr(torch_npu._C, '_npu_is_support_inf_nan') and torch_npu._C._npu_is_support_inf_nan(): + if tensor_json['Max'] is None: + return + if np.isinf(tensor_json['Max']) or np.isnan(tensor_json['Max']): + tensor_json['Max_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(data_clone, "max") + self.has_overflow = True + if np.isinf(tensor_json['Min']) or np.isnan(tensor_json['Min']): + tensor_json['Min_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(data_clone, "min") + self.has_overflow = True + else: + self.has_overflow = self.check_overflow_npu() + if self.has_overflow: + self.clear_overflow_npu() + + def _analyze_tensor(self, tensor, suffix): + dump_data_name, file_path = self.get_save_file_path(suffix) + if not path_len_exceeds_limit(file_path): + self.cached_tensors_and_file_paths.update({file_path: tensor}) + else: + logger.warning(f'The file path {file_path} length exceeds limit.') + single_arg = super()._analyze_tensor(tensor, suffix) + self._analyze_maybe_overflow_tensor(single_arg, tensor) + single_arg.update({"data_name": dump_data_name}) + return single_arg + + +class FreeBenchmarkDataProcessor(PytorchDataProcessor): + + def __init__(self, config, data_writer): + super().__init__(config, data_writer) + self.checker = FreeBenchmarkCheck(config=config) + self._return_forward_new_output = None + self._forward_new_output = None + + def update_iter(self, current_iter): + super().update_iter(current_iter) + self.checker.update_iter(current_iter) + + def update_unequal_rows(self, unequal_rows: List[UnequalRow]): + if not unequal_rows: + return + for row in unequal_rows: + data_dict = asdict(row) + self.data_writer.write_data_to_csv( + data_dict.values(), + data_dict.keys(), + self.data_writer.free_benchmark_file_path + ) + return + + def analyze_pre_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + self.checker.pre_forward(name, module, self, module_input_output.args, module_input_output.kwargs) + + def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + new_output, unequal_rows = self.checker.forward( + name, + module, + module_input_output.args, + module_input_output.kwargs, + module_input_output.output, + ) + self.update_unequal_rows(unequal_rows) + if self.checker.if_fix(): + self._return_forward_new_output = True + self._forward_new_output = new_output + + def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): + self.checker.backward(name, module, module_input_output.grad_output) + + +class KernelDumpDataProcessor(PytorchDataProcessor): + forward_init_status = False + multi_output_apis = ["_sort_", "npu_flash_attention"] + + def __init__(self, config, data_writer): + super().__init__(config, data_writer) + + def analyze_forward(self, name, module, module_input_output): + if self.config.is_forward_acl_dump: + self.forward_acl_dump(name, module, module_input_output) + else: + self.dump_mode_backward_acl_dump(name, module, module_input_output) + + def forward_acl_dump(self, name, module, module_input_output): + if not KernelDumpDataProcessor.forward_init_status: + KernelDumpDataProcessor.forward_init_status = True + torch_npu.npu.synchronize() + torch_npu.npu.init_dump() + torch_npu.npu.set_dump(self.config.acl_config) + torch_npu.npu.synchronize() + if self.op_need_trigger(name): + module.forward(*module_input_output.args, **module_input_output.kwargs).cpu() + else: + module.forward(*module_input_output.args, **module_input_output.kwargs) + torch_npu.npu.synchronize() + torch_npu.npu.finalize_dump() + torch_npu.npu.synchronize() + KernelDumpDataProcessor.forward_init_status = False + logger.info("Dump %s op file." % name) + + def acl_backward_dump_status(self, output, grad, module_name): + if isinstance(output, torch.Tensor): + output.backward(grad, retain_graph=True) + return True + + for api_name in KernelDumpDataProcessor.multi_output_apis: + if api_name in module_name: + output[0].backward(grad, retain_graph=True) + return True + return False + + def dump_mode_backward_acl_dump(self, name, module, module_input_output): + grad_path = self.config.backward_input.get(name) + if not KernelDumpDataProcessor.forward_init_status: + KernelDumpDataProcessor.forward_init_status = True + output = module.forward(*module_input_output.args, **module_input_output.kwargs) + grad = torch.load(grad_path).to("npu").requires_grad_() + torch_npu.npu.init_dump() + torch_npu.npu.set_dump(self.config.acl_config) + torch_npu.npu.synchronize() + if not self.acl_backward_dump_status(output, grad, name): + logger.warning("The output of {} is not of tensor type and cannot be automatically derived. " + "you can manually construct a single API backward case for ACL dump.".format( + name)) + torch_npu.npu.synchronize() + torch_npu.npu.finalize_dump() + KernelDumpDataProcessor.forward_init_status = False + logger.info("Dump %s op file." % name) + + def op_need_trigger(self, module_name): + return 'Tensor.__getitem__.' in module_name \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/functional/json_writer.py b/debug/accuracy_tools/atat/core/data_dump/json_writer.py similarity index 86% rename from debug/accuracy_tools/atat/pytorch/functional/json_writer.py rename to debug/accuracy_tools/atat/core/data_dump/json_writer.py index 216d24882e44b98cc606e4cab5e417602015118c..dd0d2f9c7b539704ecd02a7f8aa2cc006c50b81b 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/json_writer.py +++ b/debug/accuracy_tools/atat/core/data_dump/json_writer.py @@ -1,14 +1,15 @@ +import os import csv +import fcntl import json -import os from pathlib import Path -from ..common.file_check import FileCheckConst, change_mode -from ..common.log import print_info_log_rank_0 -from ..common.utils import Const +from atat.core.common.file_check import FileCheckConst, change_mode +from atat.core.common.log import logger +from atat.core.common.utils import Const -class DataWriter: # TODO: UT +class DataWriter: def __init__(self, init_json=None) -> None: self.dump_count = 0 @@ -25,22 +26,22 @@ class DataWriter: # TODO: UT @staticmethod def write_data_to_csv(result: list, result_header: tuple, file_path: str): - if len(result) == 0: + if not result: return is_exists = os.path.exists(file_path) append = "a+" if is_exists else "w+" with os.fdopen( - os.open(file_path, Const.WRITE_FLAGS, FileCheckConst.DATA_FILE_AUTHORITY), append, newline="" + os.open(file_path, Const.WRITE_FLAGS, FileCheckConst.DATA_FILE_AUTHORITY), append, newline="" ) as csv_file: spawn_writer = csv.writer(csv_file) if not is_exists: spawn_writer.writerow(result_header) - spawn_writer.writerows([result, ]) + spawn_writer.writerows([result,]) def initialize_json_file(self, **kwargs): kwargs.update({"dump_data_dir": self.dump_tensor_data_dir, Const.DATA: {}}) with os.fdopen( - os.open(self.dump_file_path, Const.OVERWRITE_FLAGS, FileCheckConst.DATA_FILE_AUTHORITY), 'w' + os.open(self.dump_file_path, Const.OVERWRITE_FLAGS, FileCheckConst.DATA_FILE_AUTHORITY), 'w' ) as f: json.dump(kwargs, f) @@ -54,7 +55,7 @@ class DataWriter: # TODO: UT Path(self.construct_file_path).touch() change_mode(self.construct_file_path, FileCheckConst.DATA_FILE_AUTHORITY) - def update_dump_paths(self, dump_file_path, stack_file_path, construct_file_path, dump_data_dir, + def update_dump_paths(self, dump_file_path, stack_file_path, construct_file_path, dump_data_dir, free_benchmark_file_path): self.dump_file_path = dump_file_path self.stack_file_path = stack_file_path @@ -80,8 +81,7 @@ class DataWriter: # TODO: UT self.cache_construct.update(new_data) def write_data_json(self, file_path): - import fcntl - print_info_log_rank_0(f"dump.json is at {os.path.dirname(os.path.dirname(file_path))}. ") + logger.info(f"dump.json is at {os.path.dirname(os.path.dirname(file_path))}. ") if Path(file_path).exists() and os.path.getsize(file_path) > 0: with open(file_path, "r+") as f: fcntl.flock(f, fcntl.LOCK_EX) @@ -99,14 +99,12 @@ class DataWriter: # TODO: UT self.cache_data[Const.DATA].clear() def write_stack_info_json(self, file_path): - import fcntl with open(file_path, 'w+') as f: fcntl.flock(f, fcntl.LOCK_EX) json.dump(self.cache_stack, f, indent=1) fcntl.flock(f, fcntl.LOCK_UN) def write_construct_info_json(self, file_path): - import fcntl with open(file_path, 'w+') as f: fcntl.flock(f, fcntl.LOCK_EX) json.dump(self.cache_construct, f, indent=1) diff --git a/debug/accuracy_tools/atat/pytorch/functional/scope.py b/debug/accuracy_tools/atat/core/data_dump/scope.py similarity index 98% rename from debug/accuracy_tools/atat/pytorch/functional/scope.py rename to debug/accuracy_tools/atat/core/data_dump/scope.py index 735c6d9c180d197c640ee2027e36e56e23be69c2..dc473d7e1460e977d8f3e08d690ad554415239d5 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/scope.py +++ b/debug/accuracy_tools/atat/core/data_dump/scope.py @@ -1,6 +1,6 @@ from abc import ABC, abstractmethod -from ..common.exceptions import ScopeException -from ..common.utils import Const +from atat.core.common.exceptions import ScopeException +from atat.core.common.utils import Const def build_scope(scope_class, scope=None, api_list=None): @@ -10,7 +10,6 @@ def build_scope(scope_class, scope=None, api_list=None): scope = [] if api_list is None: api_list = [] - if scope_class: return scope_class(scope, api_list) return build_range_scope_according_to_scope_name(scope, api_list) @@ -73,6 +72,7 @@ class BaseScope(ABC): return True return False + class ListScope(BaseScope): @staticmethod def rectify_args(scope, api_list): @@ -94,6 +94,7 @@ class RangeScope(BaseScope, ABC): self.in_scope = False self.is_valid = self.check_scope_is_valid() + @staticmethod def rectify_args(scope, api_list): scope, api_list = super(RangeScope, RangeScope).rectify_args(scope, api_list) diff --git a/debug/accuracy_tools/atat/core/file_check_util.py b/debug/accuracy_tools/atat/core/file_check_util.py deleted file mode 100644 index b10cdd61049ad9a87e91d910e89b121557a58a7f..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/core/file_check_util.py +++ /dev/null @@ -1,319 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -import os -import re - -from .log import print_warn_log, print_error_log - - -class FileCheckConst: - """ - Class for file check const - """ - READ_ABLE = "read" - WRITE_ABLE = "write" - READ_WRITE_ABLE = "read and write" - DIRECTORY_LENGTH = 4096 - FILE_NAME_LENGTH = 255 - FILE_VALID_PATTERN = r"^[a-zA-Z0-9_.:/-]+$" - PKL_SUFFIX = ".pkl" - NUMPY_SUFFIX = ".npy" - JSON_SUFFIX = ".json" - PT_SUFFIX = ".pt" - CSV_SUFFIX = ".csv" - YAML_SUFFIX = ".yaml" - MAX_PKL_SIZE = 1 * 1024 * 1024 * 1024 - MAX_NUMPY_SIZE = 10 * 1024 * 1024 * 1024 - MAX_JSON_SIZE = 1 * 1024 * 1024 * 1024 - MAX_PT_SIZE = 10 * 1024 * 1024 * 1024 - MAX_CSV_SIZE = 1 * 1024 * 1024 * 1024 - MAX_YAML_SIZE = 10 * 1024 * 1024 - DIR = "dir" - FILE = "file" - DATA_DIR_AUTHORITY = 0o750 - DATA_FILE_AUTHORITY = 0o640 - FILE_SIZE_DICT = { - PKL_SUFFIX: MAX_PKL_SIZE, - NUMPY_SUFFIX: MAX_NUMPY_SIZE, - JSON_SUFFIX: MAX_JSON_SIZE, - PT_SUFFIX: MAX_PT_SIZE, - CSV_SUFFIX: MAX_CSV_SIZE, - YAML_SUFFIX: MAX_YAML_SIZE - } - - -class FileCheckException(Exception): - """ - Class for File Check Exception - """ - NONE_ERROR = 0 - INVALID_PATH_ERROR = 1 - INVALID_FILE_TYPE_ERROR = 2 - INVALID_PARAM_ERROR = 3 - INVALID_PERMISSION_ERROR = 3 - - def __init__(self, code, error_info: str = ""): - super(FileCheckException, self).__init__() - self.code = code - self.error_info = error_info - - def __str__(self): - return self.error_info - - -class FileChecker: - """ - The class for check file. - - Attributes: - file_path: The file or dictionary path to be verified. - path_type: file or dictionary - ability(str): FileCheckConst.WRITE_ABLE or FileCheckConst.READ_ABLE to set file has writability or readability - file_type(str): The correct file type for file - """ - def __init__(self, file_path, path_type, ability=None, file_type=None, is_script=True): - self.file_path = file_path - self.path_type = self._check_path_type(path_type) - self.ability = ability - self.file_type = file_type - self.is_script = is_script - - @staticmethod - def _check_path_type(path_type): - if path_type not in [FileCheckConst.DIR, FileCheckConst.FILE]: - print_error_log(f'The path_type must be {FileCheckConst.DIR} or {FileCheckConst.FILE}.') - raise FileCheckException(FileCheckException.INVALID_PARAM_ERROR) - return path_type - - def common_check(self): - """ - 功能:用户校验基本文件权限:软连接、文件长度、是否存在、读写权限、文件属组、文件特殊字符 - 注意:文件后缀的合法性,非通用操作,可使用其他独立接口实现 - """ - check_path_exists(self.file_path) - check_link(self.file_path) - self.file_path = os.path.realpath(self.file_path) - check_path_length(self.file_path) - check_path_type(self.file_path, self.path_type) - self.check_path_ability() - if self.is_script: - check_path_owner_consistent(self.file_path) - check_path_pattern_vaild(self.file_path) - check_common_file_size(self.file_path) - check_file_suffix(self.file_path, self.file_type) - return self.file_path - - def check_path_ability(self): - if self.ability == FileCheckConst.WRITE_ABLE: - check_path_writability(self.file_path) - if self.ability == FileCheckConst.READ_ABLE: - check_path_readability(self.file_path) - if self.ability == FileCheckConst.READ_WRITE_ABLE: - check_path_readability(self.file_path) - check_path_writability(self.file_path) - - -class FileOpen: - """ - The class for open file by a safe way. - - Attributes: - file_path: The file or dictionary path to be opened. - mode(str): The file open mode - """ - SUPPORT_READ_MODE = ["r", "rb"] - SUPPORT_WRITE_MODE = ["w", "wb", "a", "ab"] - SUPPORT_READ_WRITE_MODE = ["r+", "rb+", "w+", "wb+", "a+", "ab+"] - - def __init__(self, file_path, mode, encoding='utf-8'): - self.file_path = file_path - self.mode = mode - self.encoding = encoding - self._handle = None - - def __enter__(self): - self.check_file_path() - binary_mode = "b" - if binary_mode not in self.mode: - self._handle = open(self.file_path, self.mode, encoding=self.encoding) - else: - self._handle = open(self.file_path, self.mode) - return self._handle - - def __exit__(self, exc_type, exc_val, exc_tb): - if self._handle: - self._handle.close() - - def check_file_path(self): - support_mode = self.SUPPORT_READ_MODE + self.SUPPORT_WRITE_MODE + self.SUPPORT_READ_WRITE_MODE - if self.mode not in support_mode: - print_error_log("File open not support %s mode" % self.mode) - raise FileCheckException(FileCheckException.INVALID_PARAM_ERROR) - check_link(self.file_path) - self.file_path = os.path.realpath(self.file_path) - check_path_length(self.file_path) - self.check_ability_and_owner() - check_path_pattern_vaild(self.file_path) - if os.path.exists(self.file_path): - check_common_file_size(self.file_path) - - def check_ability_and_owner(self): - if self.mode in self.SUPPORT_READ_MODE: - check_path_exists(self.file_path) - check_path_readability(self.file_path) - check_path_owner_consistent(self.file_path) - if self.mode in self.SUPPORT_WRITE_MODE and os.path.exists(self.file_path): - check_path_writability(self.file_path) - check_path_owner_consistent(self.file_path) - if self.mode in self.SUPPORT_READ_WRITE_MODE and os.path.exists(self.file_path): - check_path_readability(self.file_path) - check_path_writability(self.file_path) - check_path_owner_consistent(self.file_path) - - -def check_link(path): - abs_path = os.path.abspath(path) - if os.path.islink(abs_path): - print_error_log('The file path {} is a soft link.'.format(path)) - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - - -def check_path_length(path, name_length=None): - file_max_name_length = name_length if name_length else FileCheckConst.FILE_NAME_LENGTH - if len(path) > FileCheckConst.DIRECTORY_LENGTH or \ - len(os.path.basename(path)) > file_max_name_length: - print_error_log('The file path length exceeds limit.') - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - - -def check_path_exists(path): - if not os.path.exists(path): - print_error_log('The file path %s does not exist.' % path) - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - - -def check_path_readability(path): - if not os.access(path, os.R_OK): - print_error_log('The file path %s is not readable.' % path) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) - - -def check_path_writability(path): - if not os.access(path, os.W_OK): - print_error_log('The file path %s is not writable.' % path) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) - - -def check_path_executable(path): - if not os.access(path, os.X_OK): - print_error_log('The file path %s is not executable.' % path) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) - - -def check_other_user_writable(path): - st = os.stat(path) - if st.st_mode & 0o002: - _user_interactive_confirm( - 'The file path %s may be insecure because other users have write permissions. ' - 'Do you want to continue?' % path) - - -def _user_interactive_confirm(message): - while True: - check_message = input(message + " Enter 'c' to continue or enter 'e' to exit: ") - if check_message == "c": - break - elif check_message == "e": - print_warn_log("User canceled.") - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - else: - print("Input is error, please enter 'c' or 'e'.") - - -def check_path_owner_consistent(path): - file_owner = os.stat(path).st_uid - if file_owner != os.getuid(): - print_error_log('The file path %s may be insecure because is does not belong to you.' % path) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) - - -def check_path_pattern_vaild(path): - if not re.match(FileCheckConst.FILE_VALID_PATTERN, path): - print_error_log('The file path {} contains special characters.'.format(path)) - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - - -def check_file_size(file_path, max_size): - file_size = os.path.getsize(file_path) - if file_size >= max_size: - _user_interactive_confirm(f'The size of file path {file_path} exceeds {max_size} bytes.' - f'Do you want to continue?') - - -def check_common_file_size(file_path): - if os.path.isfile(file_path): - for suffix, max_size in FileCheckConst.FILE_SIZE_DICT.items(): - if file_path.endswith(suffix): - check_file_size(file_path, max_size) - break - - -def check_file_suffix(file_path, file_suffix): - if file_suffix: - if not file_path.endswith(file_suffix): - print_error_log(f"The {file_path} should be a {file_suffix} file!") - raise FileCheckException(FileCheckException.INVALID_FILE_TYPE_ERROR) - - -def check_path_type(file_path, file_type): - if file_type == FileCheckConst.FILE: - if not os.path.isfile(file_path): - print_error_log(f"The {file_path} should be a file!") - raise FileCheckException(FileCheckException.INVALID_FILE_TYPE_ERROR) - if file_type == FileCheckConst.DIR: - if not os.path.isdir(file_path): - print_error_log(f"The {file_path} should be a dictionary!") - raise FileCheckException(FileCheckException.INVALID_FILE_TYPE_ERROR) - - -def create_directory(dir_path): - """ - Function Description: - creating a directory with specified permissions - Parameter: - dir_path: directory path - Exception Description: - when invalid data throw exception - """ - dir_path = os.path.realpath(dir_path) - try: - os.makedirs(dir_path, mode=FileCheckConst.DATA_DIR_AUTHORITY, exist_ok=True) - except OSError as ex: - print_error_log( - 'Failed to create {}.Please check the path permission or disk space .{}'.format(dir_path, str(ex))) - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) from ex - - -def change_mode(path, mode): - if not os.path.exists(path) or os.path.islink(path): - return - try: - os.chmod(path, mode) - except PermissionError as ex: - print_error_log('Failed to change {} authority. {}'.format(path, str(ex))) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) from ex - diff --git a/debug/accuracy_tools/atat/mindspore/dump/api_kbk_dump.py b/debug/accuracy_tools/atat/mindspore/dump/api_kbk_dump.py index b0f80f40e553a8b136144f515015d0f94c635f5d..a53841189f5a52de74900b9ba4382e0746dfee3a 100644 --- a/debug/accuracy_tools/atat/mindspore/dump/api_kbk_dump.py +++ b/debug/accuracy_tools/atat/mindspore/dump/api_kbk_dump.py @@ -1,9 +1,9 @@ import os import json -from atat.core.utils import make_dump_path_if_not_exists +from atat.core.common.utils import make_dump_path_if_not_exists from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.core.log import print_info_log -from atat.core.file_check_util import FileOpen +from atat.core.common.log import logger +from atat.core.common.file_check import FileOpen class ApiKbkDump: @@ -48,7 +48,7 @@ class ApiKbkDump: json_path = os.path.join(json_path, "api_kbk_dump.json") with FileOpen(json_path, 'w') as f: json.dump(self.dump_json, f) - print_info_log(json_path + " has been created.") + logger.info(json_path + " has been created.") os.environ["GRAPH_OP_RUN"] = "1" os.environ["MINDSPORE_DUMP_CONFIG"] = json_path if "MS_ACL_DUMP_CFG_PATH" in os.environ: diff --git a/debug/accuracy_tools/atat/mindspore/dump/kernel_graph_dump.py b/debug/accuracy_tools/atat/mindspore/dump/kernel_graph_dump.py index f8a10ec1b1f690931871895a47014d44594ac80a..190e6bc4d5591f9ed5aa466c2be631fb224fc89b 100644 --- a/debug/accuracy_tools/atat/mindspore/dump/kernel_graph_dump.py +++ b/debug/accuracy_tools/atat/mindspore/dump/kernel_graph_dump.py @@ -1,9 +1,9 @@ import os import json -from atat.core.utils import make_dump_path_if_not_exists +from atat.core.common.utils import make_dump_path_if_not_exists from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.core.log import print_info_log -from atat.core.file_check_util import FileOpen +from atat.core.common.log import logger +from atat.core.common.file_check import FileOpen class KernelGraphDump: @@ -49,7 +49,7 @@ class KernelGraphDump: json_path = os.path.join(json_path, "kernel_graph_dump.json") with FileOpen(json_path, 'w') as f: json.dump(self.dump_json, f) - print_info_log(json_path + " has been created.") + logger.info(json_path + " has been created.") os.environ["MINDSPORE_DUMP_CONFIG"] = json_path if self.dump_json["common_dump_settings"]["dump_mode"] == 0: if self.dump_json["common_dump_settings"]["iteration"] != "all" or \ diff --git a/debug/accuracy_tools/atat/mindspore/ms_config.py b/debug/accuracy_tools/atat/mindspore/ms_config.py index 0d846c4771caca64443e170d580268ffbbdeff8e..02cead32f1f5fc2b00c47d75ac9d9950a3cd258d 100644 --- a/debug/accuracy_tools/atat/mindspore/ms_config.py +++ b/debug/accuracy_tools/atat/mindspore/ms_config.py @@ -1,6 +1,6 @@ import json from atat.core.common_config import CommonConfig, BaseConfig -from atat.core.file_check_util import FileOpen +from atat.core.common.file_check import FileOpen class TensorConfig(BaseConfig): diff --git a/debug/accuracy_tools/atat/mindspore/overflow_check/kernel_graph_overflow_check.py b/debug/accuracy_tools/atat/mindspore/overflow_check/kernel_graph_overflow_check.py index 5ef005e59e8839e19f9af600c168343251580936..7a677eb3c70c583e745785d3b8b988cbbe93e7dd 100644 --- a/debug/accuracy_tools/atat/mindspore/overflow_check/kernel_graph_overflow_check.py +++ b/debug/accuracy_tools/atat/mindspore/overflow_check/kernel_graph_overflow_check.py @@ -1,9 +1,9 @@ import os import json -from atat.core.utils import make_dump_path_if_not_exists +from atat.core.common.utils import make_dump_path_if_not_exists from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.core.log import print_warn_log, print_info_log -from atat.core.file_check_util import FileOpen +from atat.core.common.log import logger +from atat.core.common.file_check import FileOpen class KernelGraphOverflowCheck: @@ -23,7 +23,7 @@ class KernelGraphOverflowCheck: self.dump_json["common_dump_settings"]["path"] = config.dump_path if len(config.step) > 0: - print_warn_log("Step would change to all in this task.") + logger.warning("Step would change to all in this task.") if len(config.rank) > 0: self.dump_json["common_dump_settings"]["support_device"] = config.rank if config.check_mode == "aicore": @@ -39,7 +39,7 @@ class KernelGraphOverflowCheck: json_path = os.path.join(json_path, "kernel_graph_overflow_check.json") with FileOpen(json_path, 'w') as f: json.dump(self.dump_json, f) - print_info_log(json_path + " has been created.") + logger.info(json_path + " has been created.") os.environ["MINDSPORE_DUMP_CONFIG"] = json_path if "MS_ACL_DUMP_CFG_PATH" in os.environ: del os.environ["MS_ACL_DUMP_CFG_PATH"] diff --git a/debug/accuracy_tools/atat/pytorch/advisor/advisor.py b/debug/accuracy_tools/atat/pytorch/advisor/advisor.py index db193dcd833af1e87c377e721b72b36022601437..f4cb441f5e6f74a52282db1789bb2a29cf97ea79 100644 --- a/debug/accuracy_tools/atat/pytorch/advisor/advisor.py +++ b/debug/accuracy_tools/atat/pytorch/advisor/advisor.py @@ -16,12 +16,12 @@ """ import os -import pandas as pd -from .advisor_result import AdvisorResult -from .advisor_const import AdvisorConst -from ...core.utils import CompareException, CompareConst, Const, print_info_log, print_warn_log, print_error_log -from ...core.file_check_util import FileChecker, FileCheckConst +from atat.pytorch.advisor.advisor_result import AdvisorResult +from atat.pytorch.advisor.advisor_const import AdvisorConst +from atat.pytorch.common.log import logger +from atat.core.common.utils import CompareException, CompareConst, Const +from atat.core.common.file_check import FileChecker, FileCheckConst class Advisor: @@ -49,15 +49,16 @@ class Advisor: def analyze_unmatched(self, analyze_data): if self.file_type == Const.ALL: - accuracy_unmatched = analyze_data[analyze_data[CompareConst.ACCURACY] == CompareConst.ACCURACY_CHECK_UNMATCH] + accuracy_unmatched = analyze_data[ + analyze_data[CompareConst.ACCURACY] == CompareConst.ACCURACY_CHECK_UNMATCH] else: - accuracy_unmatched = analyze_data[(analyze_data[CompareConst.NPU_SHAPE] == CompareConst.NAN) | + accuracy_unmatched = analyze_data[(analyze_data[CompareConst.NPU_SHAPE] == CompareConst.NAN) | (analyze_data[CompareConst.BENCH_SHAPE] == CompareConst.NAN)] num_unmatch = len(accuracy_unmatched) if num_unmatch != 0: for i in range(len(accuracy_unmatched)): item = accuracy_unmatched.iloc[i] - print_warn_log("The tensor name matches but the shape or dtype does not match: {}" + logger.warning("The tensor name matches but the shape or dtype does not match: {}" .format(item[CompareConst.NPU_NAME])) def gen_advisor_result(self, pd_data): @@ -65,7 +66,7 @@ class Advisor: node_name = first_failing_data[CompareConst.NPU_NAME] index = first_failing_data['index'] message = self.gen_advisor_message(node_name) - print_warn_log("Find %s accuracy not reached, the line is %s" % (node_name, index)) + logger.warning("Find %s accuracy not reached, the line is %s" % (node_name, index)) result = AdvisorResult(node_name, index, message) return result @@ -88,7 +89,7 @@ class Advisor: def analysis(self): self._check_path_vaild() analyze_data = self._parse_input_data() - print_info_log("Start analyzing the comparison result: %s" % self.file_type) + logger.info("Start analyzing the comparison result: %s" % self.file_type) self.analyze_unmatched(analyze_data) if self.file_type == Const.ALL: failing_data = analyze_data[analyze_data[CompareConst.ACCURACY] == CompareConst.ACCURACY_CHECK_NO] @@ -97,7 +98,7 @@ class Advisor: elif self.file_type == Const.SUMMARY: failing_data = analyze_data[analyze_data[CompareConst.RESULT] == CompareConst.WARNING] if failing_data.empty: - print_info_log("All data from api input/output accuracy reached") + logger.info("All data from api input/output accuracy reached") result = AdvisorResult(AdvisorConst.NO_ERROR_API, AdvisorConst.NO_ERROR_API, AdvisorConst.NO_ERR_SUGGEST) else: result = self.gen_advisor_result(failing_data) @@ -113,7 +114,7 @@ class Advisor: elif {CompareConst.MAX_DIFF, CompareConst.RESULT}.issubset(data_columns): self.file_type = Const.SUMMARY else: - print_error_log('Compare result does not meet the required conditions.') + logger.error('Compare result does not meet the required conditions.') raise CompareException(CompareException.INVALID_DATA_ERROR) df = self.input_data.reset_index() return df diff --git a/debug/accuracy_tools/atat/pytorch/advisor/advisor_result.py b/debug/accuracy_tools/atat/pytorch/advisor/advisor_result.py index f8a16d2a7067d7ef2fa0746e32258f9da17624df..59845a75415823246a1fadeba588d5289c3eb272 100644 --- a/debug/accuracy_tools/atat/pytorch/advisor/advisor_result.py +++ b/debug/accuracy_tools/atat/pytorch/advisor/advisor_result.py @@ -17,9 +17,10 @@ import os import time -from .advisor_const import AdvisorConst -from ...core.utils import Const, print_info_log, print_error_log -from ...core.file_check_util import FileCheckConst, change_mode +from atat.pytorch.advisor.advisor_const import AdvisorConst +from atat.pytorch.common.log import logger +from atat.core.common.utils import Const +from atat.core.common.file_check import FileCheckConst, change_mode class AdvisorResult: @@ -43,15 +44,15 @@ class AdvisorResult: output_file.writelines(message_list) change_mode(result_file, FileCheckConst.DATA_FILE_AUTHORITY) except IOError as io_error: - print_error_log("Failed to save %s, the reason is %s." % (result_file, io_error)) + logger.error("Failed to save %s, the reason is %s." % (result_file, io_error)) else: - print_info_log("The advisor summary is saved in: %s" % result_file) + logger.info("The advisor summary is saved in: %s" % result_file) def print_advisor_log(self): - print_info_log("The summary of the expert advice is as follows: ") + logger.info("The summary of the expert advice is as follows: ") message_list = [AdvisorConst.LINE + AdvisorConst.COLON + str(self.line), AdvisorConst.SUSPECT_NODES + AdvisorConst.COLON + self.suspect_node, AdvisorConst.ADVISOR_SUGGEST + AdvisorConst.COLON + self.advisor_message] for message in message_list: - print_info_log(message) + logger.info(message) return message_list diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/config.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/config.py index db6db968bf3124e85c211991fa9b48c1aa1b7bba..0aceb691b2530bd8117396ef34c9cd4154c76716 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/config.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/config.py @@ -1,8 +1,8 @@ import os import yaml -from ..common.utils import check_file_or_directory_path -from ...hook_module.utils import WrapFunctionalOps, WrapTensorOps, WrapTorchOps -from ...common.file_check import FileOpen +from atat.pytorch.api_accuracy_checker.common.utils import check_file_or_directory_path +from atat.pytorch.hook_module.utils import WrapFunctionalOps, WrapTensorOps, WrapTorchOps +from atat.core.common.file_check import FileOpen WrapApi = set(WrapFunctionalOps) | set(WrapTensorOps) | set(WrapTorchOps) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/utils.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/utils.py index 51d7c556ed51180c5d00897b9a7af11bebeaa5da..022edbfcf308a252cf9ef477e62ee49b4d7873ed 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/utils.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/utils.py @@ -28,10 +28,10 @@ except ImportError: else: IS_GPU = False -from atat.pytorch.common.log import print_warn_log, print_error_log -from atat.pytorch.common.file_check import FileCheckConst, FileChecker, FileOpen, change_mode, create_directory +from atat.pytorch.common.log import logger +from atat.core.common.file_check import FileCheckConst, FileChecker, FileOpen, change_mode, create_directory from atat.pytorch.common.utils import Const -from atat.core.utils import CompareException +from atat.core.common.utils import CompareException class DumpException(CompareException): @@ -55,7 +55,7 @@ def check_object_type(check_object, allow_type): when invalid data throw exception """ if not isinstance(check_object, allow_type): - print_error_log(f"{check_object} not of {allow_type} type") + logger.error(f"{check_object} not of {allow_type} type") raise CompareException(CompareException.INVALID_DATA_ERROR) @@ -71,24 +71,24 @@ def check_file_or_directory_path(path, isdir=False): """ if isdir: if not os.path.exists(path): - print_error_log('The path {} is not exist.'.format(path)) + logger.error('The path {} is not exist.'.format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if not os.path.isdir(path): - print_error_log('The path {} is not a directory.'.format(path)) + logger.error('The path {} is not a directory.'.format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if not os.access(path, os.W_OK): - print_error_log( + logger.error( 'The path {} does not have permission to write. Please check the path permission'.format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) else: if not os.path.isfile(path): - print_error_log('{} is an invalid file or non-exist.'.format(path)) + logger.error('{} is an invalid file or non-exist.'.format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if not os.access(path, os.R_OK): - print_error_log( + logger.error( 'The path {} does not have permission to read. Please check the path permission'.format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) @@ -98,10 +98,10 @@ def get_json_contents(file_path): try: json_obj = json.loads(ops) except ValueError as error: - print_error_log('Failed to load "%s". %s' % (file_path, str(error))) + logger.error('Failed to load "%s". %s' % (file_path, str(error))) raise CompareException(CompareException.INVALID_FILE_ERROR) from error if not isinstance(json_obj, dict): - print_error_log('Json file %s, content is not a dictionary!' % file_path) + logger.error('Json file %s, content is not a dictionary!' % file_path) raise CompareException(CompareException.INVALID_FILE_ERROR) return json_obj @@ -161,7 +161,7 @@ def cross_entropy_process(api_info_dict): def initialize_save_path(save_path, dir_name): data_path = os.path.join(save_path, dir_name) if os.path.exists(data_path): - print_warn_log(f"{data_path} already exists, it will be overwritten") + logger.warning(f"{data_path} already exists, it will be overwritten") else: os.mkdir(data_path, mode=FileCheckConst.DATA_DIR_AUTHORITY) data_path_checker = FileChecker(data_path, FileCheckConst.DIR) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/algorithm.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/algorithm.py index 7983709f14bcca72a0cb29c453198396561681b1..a450edb929161dd14f2d1476509ff5ce5b7a9d1c 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/algorithm.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/algorithm.py @@ -1,7 +1,7 @@ # 定义比对算法及比对标准 import torch import numpy as np -from .compare_utils import CompareConst +from atat.pytorch.api_accuracy_checker.compare.compare_utils import CompareConst #cos diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_compare.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_compare.py index fcc4ba3d5d0caed136040c5fe5dd7b0146a6a1f5..7e0617eb3ae1a91062fa25c5cb478e5e674ffb0a 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_compare.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_compare.py @@ -14,9 +14,9 @@ from atat.pytorch.api_accuracy_checker.compare.compare_utils import CompareConst convert_str_to_float, CompareMessage from atat.pytorch.api_accuracy_checker.compare.compare_column import ApiPrecisionOutputColumn from atat.pytorch.api_accuracy_checker.run_ut.run_ut import get_validated_result_csv_path -from atat.pytorch.common.file_check import FileCheckConst, FileChecker, change_mode, check_path_before_create, create_directory -from atat.pytorch.common.log import print_info_log, print_warn_log, print_error_log -from atat.core.utils import CompareException +from atat.core.common.file_check import FileCheckConst, FileChecker, change_mode, check_path_before_create, create_directory +from atat.pytorch.common.log import logger +from atat.core.common.utils import CompareException CompareConfig = namedtuple('CompareConfig', ['npu_csv_path', 'gpu_csv_path', 'result_csv_path', 'details_csv_path']) unsupported_message = 'This data type does not support benchmark compare.' @@ -152,18 +152,18 @@ def write_detail_csv(content, save_path): def api_precision_compare(config): - print_info_log("Start compare task") - print_info_log(f"Compare task result will be saved in {config.result_csv_path}") - print_info_log(f"Compare task detail will be saved in {config.details_csv_path}") + logger.info("Start compare task") + logger.info(f"Compare task result will be saved in {config.result_csv_path}") + logger.info(f"Compare task detail will be saved in {config.details_csv_path}") try: npu_data = pd.read_csv(config.npu_csv_path) except Exception as err: - print_error_log(f"Open npu csv Error: %s" % str(err)) + logger.error(f"Open npu csv Error: %s" % str(err)) check_csv_columns(npu_data.columns, "npu_csv") try: gpu_data = pd.read_csv(config.gpu_csv_path) except Exception as err: - print_error_log(f"Open gpu csv Error: %s" % str(err)) + logger.error(f"Open gpu csv Error: %s" % str(err)) check_csv_columns(gpu_data.columns, "gpu_csv") detail_csv_title = [ApiPrecisionCompareColumn.get_detail_csv_title()] result_csv_title = [ApiPrecisionCompareColumn.get_result_csv_title()] @@ -172,7 +172,7 @@ def api_precision_compare(config): try: analyse_csv(npu_data, gpu_data, config) except Exception as err: - print_error_log(f"Analyse csv Error: %s" % str(err)) + logger.error(f"Analyse csv Error: %s" % str(err)) change_mode(config.result_csv_path, FileCheckConst.DATA_FILE_AUTHORITY) change_mode(config.details_csv_path, FileCheckConst.DATA_FILE_AUTHORITY) @@ -187,7 +187,7 @@ def analyse_csv(npu_data, gpu_data, config): row_gpu = gpu_data[gpu_data[ApiPrecisionCompareColumn.API_NAME] == full_api_name_with_direction_status] _, api_name, _, direction_status, _, _ = full_api_name_with_direction_status.split(".") if row_gpu.empty: - print_warn_log(f'This API : {full_api_name_with_direction_status} does not exist in the GPU data.') + logger.warning(f'This API : {full_api_name_with_direction_status} does not exist in the GPU data.') continue if len(row_gpu) > 1: msg = f'This API : {full_api_name_with_direction_status} has multiple records in the GPU data.' @@ -234,7 +234,7 @@ def analyse_csv(npu_data, gpu_data, config): elif direction_status == 'backward': backward_status.append(new_status) else: - print_error_log(f"Invalid direction status: {direction_status}") + logger.error(f"Invalid direction status: {direction_status}") if last_api_name is not None: if last_api_dtype in API_PRECISION_COMPARE_UNSUPPORT_LIST: @@ -389,4 +389,4 @@ def _api_precision_compare_parser(parser): if __name__ == '__main__': _api_precision_compare() - print_info_log("Compare task completed.") + logger.info("Compare task completed.") diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare.py index 350a407747514a33d4f925a4ea40ef21ec13f6ad..fbba1dca002663df9e3c6df983fe4ee3546be13f 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare.py @@ -5,16 +5,18 @@ import torch import numpy as np from rich.table import Table from rich.console import Console -from ..common.utils import get_json_contents, write_csv, print_warn_log, Const -from ..compare.compare_utils import CompareConst, check_dtype_comparable, DETAIL_TEST_ROWS, \ - precision_configs, BENCHMARK_COMPARE_SUPPORT_LIST, AbsoluteStandardApi, BinaryStandardApi, apis_threshold -from ..compare.compare_column import CompareColumn -from ..compare.algorithm import get_rmse, get_error_balance, get_max_rel_err, get_mean_rel_err, \ - get_rel_err, get_abs_err, get_max_abs_err, get_rel_err_ratio, cosine_sim, get_rel_err_origin, \ +from atat.pytorch.common.log import logger +from atat.pytorch.api_accuracy_checker.common.utils import get_json_contents, write_csv, Const +from atat.pytorch.api_accuracy_checker.compare.compare_utils import CompareConst, check_dtype_comparable, \ + DETAIL_TEST_ROWS, precision_configs, BENCHMARK_COMPARE_SUPPORT_LIST, AbsoluteStandardApi, BinaryStandardApi, \ + apis_threshold +from atat.pytorch.api_accuracy_checker.compare.compare_column import CompareColumn +from atat.pytorch.api_accuracy_checker.compare.algorithm import get_rmse, get_error_balance, get_max_rel_err, \ + get_mean_rel_err, get_rel_err, get_abs_err, get_max_abs_err, get_rel_err_ratio, cosine_sim, get_rel_err_origin, \ get_small_value_err_ratio, get_finite_and_infinite_mask, get_small_value_mask, check_inf_nan_value, \ check_small_value, check_norm_value, get_abs_bench_with_eps -from ..common.config import msCheckerConfig -from ...common.file_check import FileOpen +from atat.pytorch.api_accuracy_checker.common.config import msCheckerConfig +from atat.core.common.file_check import FileOpen class Comparator: @@ -40,7 +42,7 @@ class Comparator: } @staticmethod - def _compare_dropout(api_name, bench_output, device_output): + def _compare_dropout(bench_output, device_output): tensor_num = bench_output.numel() if tensor_num >= 100: if abs((bench_output == 0).sum() - (device_output == 0).cpu().sum()) / tensor_num < 0.1: @@ -67,7 +69,7 @@ class Comparator: error_rate = float(error_nums / bench_output.size) result = CompareConst.PASS if error_rate == 0 else CompareConst.ERROR return error_rate, result, "" - + @staticmethod def _get_absolute_threshold_attribute(api_name, dtype): small_value_threshold = apis_threshold.get(api_name).get(dtype).get('small_value') @@ -83,7 +85,7 @@ class Comparator: else: passing_rate = "0%" - print_warn_log("The follwing tables will be deprecated in the future." + logger.warning("The follwing tables will be deprecated in the future." "The following results are for reference only.") console = Console() table_total = Table( @@ -104,13 +106,15 @@ class Comparator: table_detail.add_column("Statistics") table_detail.add_row("Forward Error", str(self.test_result_cnt.get("forward_fail_num", 0))) table_detail.add_row("Backward Error", str(self.test_result_cnt.get("backward_fail_num", 0))) - table_detail.add_row("Both Forward & Backward Error", str(self.test_result_cnt.get("forward_and_backward_fail_num", 0))) + table_detail.add_row("Both Forward & Backward Error", + str(self.test_result_cnt.get("forward_and_backward_fail_num", 0))) console.print(table_total) console.print(table_detail) def get_statistics_from_result_csv(self): - checklist = [CompareConst.PASS, CompareConst.ERROR, CompareConst.WARNING, CompareConst.SPACE, CompareConst.SKIP, "skip"] + checklist = [CompareConst.PASS, CompareConst.ERROR, CompareConst.WARNING, CompareConst.SPACE, CompareConst.SKIP, + "skip"] self.test_result_cnt = { "success_num": 0, "warning_num": 0, "error_num": 0, "forward_fail_num": 0, "backward_fail_num": 0, "forward_and_backward_fail_num": 0, @@ -148,7 +152,7 @@ class Comparator: self.test_result_cnt['warning_num'] += 1 def write_csv_title(self): - summary_test_rows = [[self.COLUMN_API_NAME, self.COLUMN_FORWARD_SUCCESS, + summary_test_rows = [[self.COLUMN_API_NAME, self.COLUMN_FORWARD_SUCCESS, self.COLUMN_BACKWARD_SUCCESS, "Message"]] if not os.path.exists(self.save_path): write_csv(summary_test_rows, self.save_path) @@ -179,36 +183,41 @@ class Comparator: if isinstance(fwd_result, list): for i, test_subject in enumerate(fwd_result): subject = subject_prefix + ".forward.output." + str(i) - test_subject = ["{:.{}f}".format(item, msCheckerConfig.precision) + test_subject = ["{:.{}f}".format(item, msCheckerConfig.precision) if isinstance(item, float) else item for item in test_subject] test_rows.append([subject] + list(test_subject)) if isinstance(bwd_result, list): for i, test_subject in enumerate(bwd_result): subject = subject_prefix + ".backward.output." + str(i) - test_subject = ["{:.{}f}".format(item, msCheckerConfig.precision) + test_subject = ["{:.{}f}".format(item, msCheckerConfig.precision) if isinstance(item, float) else item for item in test_subject] test_rows.append([subject] + list(test_subject)) write_csv(test_rows, self.detail_save_path) - def record_results(self, *args): + def record_results(self, args): self.write_summary_csv(args) self.write_detail_csv(args) def compare_output(self, full_api_name, bench_output, device_output, bench_grad=None, npu_grad=None): _, api_name, _ = full_api_name.split(Const.SEP) - compare_func = self._compare_dropout if "dropout" in full_api_name else self._compare_core_wrapper - fwd_success_status, fwd_compare_alg_results = compare_func(api_name, bench_output, device_output) + if "dropout" in full_api_name: + fwd_success_status, fwd_compare_alg_results = self._compare_dropout(bench_output, device_output) + else: + fwd_success_status, fwd_compare_alg_results = self._compare_core_wrapper(api_name, bench_output, + device_output) if not (bench_grad and npu_grad): bwd_success_status, bwd_compare_alg_results = (CompareConst.SPACE, []) else: if "dropout" in full_api_name: - bwd_success_status, bwd_compare_alg_results = compare_func(api_name, bench_grad[0], npu_grad[0]) + bwd_success_status, bwd_compare_alg_results = self._compare_dropout(bench_grad[0], npu_grad[0]) else: - bwd_success_status, bwd_compare_alg_results = compare_func(api_name, bench_grad, npu_grad) - self.record_results(full_api_name, fwd_success_status, bwd_success_status if bwd_compare_alg_results is not None else CompareConst.SPACE, fwd_compare_alg_results, bwd_compare_alg_results) + bwd_success_status, bwd_compare_alg_results = self._compare_core_wrapper(api_name, bench_grad, npu_grad) + self.record_results((full_api_name, fwd_success_status, + bwd_success_status if bwd_compare_alg_results is not None else CompareConst.SPACE, + fwd_compare_alg_results, bwd_compare_alg_results)) return fwd_success_status == CompareConst.PASS, bwd_success_status == CompareConst.PASS \ - or bwd_success_status == CompareConst.SPACE + or bwd_success_status == CompareConst.SPACE def _compare_core_wrapper(self, api_name, bench_output, device_output): detailed_result_total = [] @@ -250,7 +259,7 @@ class Comparator: if b_keys != n_keys: return CompareConst.ERROR, compare_column, "bench and npu output dict keys are different." else: - status, compare_result, message = self._compare_core(api_name, list(bench_output.values()), + status, compare_result, message = self._compare_core(api_name, list(bench_output.values()), list(device_output.values())) elif isinstance(bench_output, torch.Tensor): copy_bench_out = bench_output.detach().clone() @@ -259,7 +268,7 @@ class Comparator: compare_column.npu_type = str(copy_device_output.dtype) compare_column.shape = tuple(device_output.shape) status, compare_result, message = self._compare_torch_tensor(api_name, copy_bench_out, copy_device_output, - compare_column) + compare_column) elif isinstance(bench_output, (bool, int, float, str)): compare_column.bench_type = str(type(bench_output)) compare_column.npu_type = str(type(device_output)) @@ -267,7 +276,7 @@ class Comparator: elif bench_output is None: return CompareConst.SKIP, compare_column, "Bench output is None, skip this test." else: - return CompareConst.PASS, compare_column, + return CompareConst.PASS, compare_column, "Unexpected output type in compare_core: {}".format(type(bench_output)) return status, compare_result, message @@ -283,24 +292,24 @@ class Comparator: device_output = device_output.cpu().numpy() if cpu_shape != npu_shape: return CompareConst.ERROR, compare_column, f"The shape of bench{str(cpu_shape)} " \ - f"and npu{str(npu_shape)} not equal." + f"and npu{str(npu_shape)} not equal." if not check_dtype_comparable(bench_output, device_output): return CompareConst.ERROR, compare_column, f"Bench out dtype is {bench_output.dtype} but " \ - f"npu output dtype is {device_output.dtype}, cannot compare." + f"npu output dtype is {device_output.dtype}, cannot compare." message = "" - if bench_output.dtype in [bool, np.uint8, np.int8, np.int16, np.uint16, np.uint32, np.int32, + if bench_output.dtype in [bool, np.uint8, np.int8, np.int16, np.uint16, np.uint32, np.int32, np.int64, np.uint64]: message += f"Compare algorithm is not supported for {bench_output.dtype} data. " \ - f"Only judged by Error Rate." + f"Only judged by Error Rate." err_rate, status, msg = self._compare_bool_tensor(bench_output, device_output) message += msg + "\n" compare_column.error_rate = err_rate return status, compare_column, message else: - status, compare_column, message = self._compare_float_tensor(api_name, bench_output, device_output, + status, compare_column, message = self._compare_float_tensor(api_name, bench_output, device_output, compare_column, npu_dtype) return status, compare_column, message - + def _compare_float_tensor(self, api_name, bench_output, device_output, compare_column, dtype): message = "" abs_bench, abs_bench_with_eps = get_abs_bench_with_eps(bench_output, dtype) @@ -316,11 +325,12 @@ class Comparator: rel_err = abs_err / abs_bench_with_eps small_value_mask = get_small_value_mask(abs_bench, both_finite_mask, small_value_threshold) normal_value_mask = np.logical_and(both_finite_mask, np.logical_not(small_value_mask)) - compare_column.inf_nan_error_ratio = check_inf_nan_value(inf_nan_mask, bench_output, device_output, dtype, rtol) + compare_column.inf_nan_error_ratio = check_inf_nan_value(inf_nan_mask, bench_output, device_output, + dtype, rtol) compare_column.rel_err_ratio = check_norm_value(normal_value_mask, rel_err, rtol) compare_column.abs_err_ratio = check_small_value(abs_err, small_value_mask, small_value_atol) else: - dtype_config = precision_configs.get(dtype) + dtype_config = precision_configs.get(dtype) small_value_mask = get_small_value_mask(abs_bench, both_finite_mask, dtype_config['small_value'][0]) abs_err_greater_mask = np.greater(abs_err, dtype_config['small_value_atol'][0]) compare_column.small_value_err_ratio = get_small_value_err_ratio(small_value_mask, abs_err_greater_mask) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_column.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_column.py index 97cf8226bd1ea6c9a668abd91719fd2662b5183b..bd88d6742f3a26f082f735d514023d74a49c7541 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_column.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_column.py @@ -1,4 +1,4 @@ -from .compare_utils import CompareConst +from atat.pytorch.api_accuracy_checker.compare.compare_utils import CompareConst class CompareColumn: diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_utils.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_utils.py index 5511da724446187e2dd886448bf6b26ea7b7b369..fe841eb06397443f7ef7d89edc6852d05f7579f5 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_utils.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_utils.py @@ -3,8 +3,9 @@ import os import numpy as np import torch import yaml -from ..common.utils import Const, print_warn_log, CompareException -from ...common.file_check import FileOpen +from atat.core.common.utils import Const, CompareException +from atat.pytorch.common.log import logger +from atat.core.common.file_check import FileOpen current_time = time.strftime("%Y%m%d%H%M%S") @@ -170,7 +171,7 @@ def check_dtype_comparable(x, y): if y.dtype in Const.INT_TYPE: return True return False - print_warn_log(f"Compare: Unexpected dtype {x.dtype}, {y.dtype}") + logger.warning(f"Compare: Unexpected dtype {x.dtype}, {y.dtype}") return False diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/data_generate.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/data_generate.py index 723fb8ec6680ab65770007c5ab90b5f8428db2ac..e983413bf00ea32aaea4d6ee7e215097e9f12cff 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/data_generate.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/data_generate.py @@ -20,8 +20,8 @@ import math import torch import numpy -from ..common.utils import Const, check_file_or_directory_path, check_object_type, print_warn_log, \ - print_error_log, get_full_data_path, CompareException +from atat.pytorch.api_accuracy_checker.common.utils import Const, check_file_or_directory_path, check_object_type, get_full_data_path, CompareException +from atat.pytorch.common.log import logger TORCH_TYPE = ["torch.device", "torch.dtype"] TENSOR_DATA_LIST = ["torch.Tensor", "torch.nn.parameter.Parameter"] @@ -62,7 +62,7 @@ def gen_data(info, need_grad, convert_type, real_data_path=None): try: data = eval(data_type)(data) except Exception as err: - print_error_log("Failed to convert the type to numpy: %s" % str(err)) + logger.error("Failed to convert the type to numpy: %s" % str(err)) elif data_type == "torch.Size": data = torch.Size(info.get("value")) else: @@ -170,7 +170,7 @@ def gen_common_tensor(low_info, high_info, shape, data_dtype, convert_type): low, high = int(low), int(high) tensor = torch.randint(low, high + 1, shape, dtype=eval(data_dtype)) else: - print_error_log('Dtype is not supported: ' + data_dtype) + logger.error('Dtype is not supported: ' + data_dtype) raise NotImplementedError() if tensor.nelement() == 0: return tensor @@ -231,7 +231,7 @@ def gen_args(args_info, need_grad=True, convert_type=None, real_data_path=None): elif arg is None: data = None else: - print_warn_log(f'Warning: {arg} is not supported') + logger.warning(f'Warning: {arg} is not supported') raise NotImplementedError() args_result.append(data) return args_result @@ -304,6 +304,6 @@ def gen_api_params(api_info, need_grad=True, convert_type=None, real_data_path=N if api_info.get("input_args"): args_params = gen_args(api_info.get("input_args"), need_grad, convert_type, real_data_path) else: - print_warn_log(f'Warning: No args in {api_info} ') + logger.warning(f'Warning: No args in {api_info} ') args_params = [] return args_params, kwargs_params diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py index b4aa2ddeaa7aac1dca9e1579344e3738e4c73a6c..b9d1a4fd1f3da5c4f115bb8542c9f9150c4d5f35 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py @@ -13,9 +13,9 @@ from atat.pytorch.api_accuracy_checker.run_ut.run_ut import _run_ut_parser, get_ get_validated_details_csv_path, preprocess_forward_content from atat.pytorch.api_accuracy_checker.compare.compare import Comparator from atat.pytorch.common import parse_json_info_forward_backward -from atat.pytorch.common.file_check import FileCheckConst, FileChecker, check_file_suffix, check_link, FileOpen, \ +from atat.core.common.file_check import FileCheckConst, FileChecker, check_file_suffix, check_link, FileOpen, \ check_path_before_create, create_directory -from atat.pytorch.common.log import print_error_log, print_warn_log, print_info_log +from atat.pytorch.common.log import logger def split_json_file(input_file, num_splits, filter_api): @@ -57,7 +57,7 @@ def split_json_file(input_file, num_splits, filter_api): def signal_handler(signum, frame): - print_warn_log(f'Signal handler called with signal {signum}') + logger.warning(f'Signal handler called with signal {signum}') raise KeyboardInterrupt() @@ -74,8 +74,8 @@ def run_parallel_ut(config): processes = [] device_id_cycle = cycle(config.device_id) if config.save_error_data_flag: - print_info_log("UT task error datas will be saved") - print_info_log(f"Starting parallel UT with {config.num_splits} processes") + logger.info("UT task error datas will be saved") + logger.info(f"Starting parallel UT with {config.num_splits} processes") progress_bar = tqdm(total=config.total_items, desc="Total items", unit="items") def create_cmd(api_info, dev_id): @@ -105,7 +105,7 @@ def run_parallel_ut(config): print(output, end='') sys.stdout.flush() except ValueError as e: - print_warn_log(f"An error occurred while reading subprocess output: {e}") + logger.warning(f"An error occurred while reading subprocess output: {e}") def update_progress_bar(progress_bar, result_csv_path): while any(process.poll() is None for process in processes): @@ -114,9 +114,9 @@ def run_parallel_ut(config): completed_items = len(result_file.readlines()) - 1 progress_bar.update(completed_items - progress_bar.n) except FileNotFoundError: - print_warn_log(f"Result CSV file not found: {result_csv_path}.") + logger.warning(f"Result CSV file not found: {result_csv_path}.") except Exception as e: - print_error_log(f"An unexpected error occurred while reading result CSV: {e}") + logger.error(f"An unexpected error occurred while reading result CSV: {e}") time.sleep(1) for api_info in config.api_files: @@ -141,27 +141,27 @@ def run_parallel_ut(config): try: os.remove(file) except FileNotFoundError: - print_warn_log(f"File not found and could not be deleted: {file}") + logger.warning(f"File not found and could not be deleted: {file}") try: for process in processes: process.communicate(timeout=None) except KeyboardInterrupt: - print_warn_log("Interrupted by user, terminating processes and cleaning up...") + logger.warning("Interrupted by user, terminating processes and cleaning up...") except Exception as e: - print_error_log(f"An unexpected error occurred: {e}") + logger.error(f"An unexpected error occurred: {e}") finally: if progress_bar.n < config.total_items: - print_warn_log("The UT task has not been completed. The parameter '-csv_path' along with the path to the result CSV file will be utilized to resume the UT task.") + logger.warning("The UT task has not been completed. The parameter '-csv_path' along with the path to the result CSV file will be utilized to resume the UT task.") clean_up() progress_bar_thread.join() try: comparator = Comparator(config.result_csv_path, config.result_csv_path, False) comparator.print_pretest_result() except FileNotFoundError as e: - print_error_log(f"Error: {e}") + logger.error(f"Error: {e}") except Exception as e: - print_error_log(f"An unexpected error occurred: {e}") + logger.error(f"An unexpected error occurred: {e}") def prepare_config(args): @@ -182,8 +182,8 @@ def prepare_config(args): else: result_csv_path = get_validated_result_csv_path(args.result_csv_path, 'result') details_csv_path = get_validated_details_csv_path(result_csv_path) - print_info_log(f"UT task result will be saved in {result_csv_path}") - print_info_log(f"UT task details will be saved in {details_csv_path}") + logger.info(f"UT task result will be saved in {result_csv_path}") + logger.info(f"UT task details will be saved in {details_csv_path}") return ParallelUTConfig(split_files, out_path, args.num_splits, args.save_error_data, args.jit_compile, args.device_id, result_csv_path, total_items, args.real_data_path) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py index be25e24b372b7fe86339561b138cde75a7df7dea..498379030cf3ba314b0e7ad7249b33fc46066201 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py @@ -4,10 +4,10 @@ import sys import torch_npu import torch from tqdm import tqdm +from atat.pytorch.api_accuracy_checker.run_ut.run_ut import exec_api, generate_device_params, get_api_info from atat.pytorch.api_accuracy_checker.common.utils import get_json_contents -from atat.pytorch.common.file_check import check_link -from atat.pytorch.common.log import print_info_log, print_warn_log, print_error_log - +from atat.core.common.file_check import check_link +from atat.pytorch.common.log import logger def check_tensor_overflow(x): if isinstance(x, torch.Tensor) and x.numel() != 0 and x.dtype != torch.bool: @@ -45,7 +45,7 @@ def check_data_overflow(x): def run_overflow_check(forward_file): - print_info_log("start UT test") + logger.info("start UT test") forward_content = get_json_contents(forward_file) for api_full_name, api_info_dict in tqdm(forward_content.items()): try: @@ -53,13 +53,13 @@ def run_overflow_check(forward_file): except Exception as err: api_name = api_full_name.split("_", 1)[1].rsplit("_", 2)[0] if "not implemented for 'Half'" in str(err): - print_warn_log(f"API {api_name} not support half tensor in CPU, please add {api_name} to CONVERT_API " + logger.warning(f"API {api_name} not support half tensor in CPU, please add {api_name} to CONVERT_API " f"'fp16_to_fp32' list in accuracy_tools/api_accuracy_check/common/utils.py file.") elif "expected scalar type Long" in str(err): - print_warn_log(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " + logger.warning(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " f"'int32_to_int64' list in accuracy_tools/api_accuracy_check/common/utils.py file.") else: - print_error_log(f"Run {api_full_name} UT Error: %s" % str(err)) + logger.error(f"Run {api_full_name} UT Error: %s" % str(err)) def run_torch_api(api_full_name, api_info_dict): @@ -68,7 +68,7 @@ def run_torch_api(api_full_name, api_info_dict): api_name = api_full_name.split(".", 1)[1].rsplit(".", 2)[0] args, kwargs, need_grad = get_api_info(api_info_dict, api_name, real_data_path='') if not need_grad: - print_warn_log("%s function with out=... arguments don't support automatic differentiation, skip backward." + logger.warning("%s function with out=... arguments don't support automatic differentiation, skip backward." % api_full_name) npu_args, npu_kwargs = generate_device_params(args, kwargs, False, api_name) if kwargs.get("device"): @@ -78,9 +78,9 @@ def run_torch_api(api_full_name, api_info_dict): cpu_overflow = check_data_overflow(out) npu_overflow = torch_npu.npu.utils.npu_check_overflow(npu_out) if cpu_overflow == npu_overflow: - print_warn_log("The %s overflow is a normal overflow." % api_full_name) + logger.warning("The %s overflow is a normal overflow." % api_full_name) else: - print_warn_log("The %s overflow is an abnormal overflow." % api_full_name) + logger.warning("The %s overflow is an abnormal overflow." % api_full_name) return @@ -111,11 +111,11 @@ def _run_overflow_check_command(args): try: torch.npu.set_device(npu_device) except Exception as error: - print_error_log(f"Set NPU device id failed. device id is: {args.device_id}") + logger.error(f"Set NPU device id failed. device id is: {args.device_id}") raise NotImplementedError from error run_overflow_check(api_info) if __name__ == '__main__': _run_overflow_check() - print_info_log("UT task completed.") + logger.info("UT task completed.") diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_ut.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_ut.py index 59ccea4bc6687b289608a6c26aea6938a070d80b..77f3bf714a37eede95de04fdf1240ff655451e3a 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_ut.py +++ b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_ut.py @@ -27,9 +27,9 @@ from atat.pytorch.hook_module.wrap_functional import FunctionalOPTemplate from atat.pytorch.hook_module.wrap_torch import TorchOPTemplate from atat.pytorch.api_accuracy_checker.common.config import msCheckerConfig from atat.pytorch.common.parse_json import parse_json_info_forward_backward -from atat.pytorch.common.file_check import FileOpen, FileCheckConst, FileChecker, \ +from atat.core.common.file_check import FileOpen, FileCheckConst, FileChecker, \ change_mode, check_file_suffix, check_link, check_path_before_create, create_directory -from atat.pytorch.common.log import print_info_log, print_warn_log, print_error_log +from atat.pytorch.common.log import logger from atat.pytorch.common.utils import Const current_time = time.strftime("%Y%m%d%H%M%S") @@ -150,12 +150,12 @@ def generate_cpu_params(input_args, input_kwargs, need_backward, api_name): def run_ut(config): - print_info_log("start UT test") - print_info_log(f"UT task result will be saved in {config.result_csv_path}") - print_info_log(f"UT task details will be saved in {config.details_csv_path}") + logger.info("start UT test") + logger.info(f"UT task result will be saved in {config.result_csv_path}") + logger.info(f"UT task details will be saved in {config.details_csv_path}") if config.save_error_data: error_data_path = os.path.abspath(os.path.join(msCheckerConfig.error_data_path, UT_ERROR_DATA_DIR)) - print_info_log(f"UT task error_datas will be saved in {error_data_path}") + logger.info(f"UT task error_datas will be saved in {error_data_path}") compare = Comparator(config.result_csv_path, config.details_csv_path, config.is_continue_run_ut) with FileOpen(config.result_csv_path, 'r') as file: csv_reader = csv.reader(file) @@ -182,10 +182,10 @@ def run_ut(config): except Exception as err: [_, api_name, _] = api_full_name.split(Const.SEP) if "expected scalar type Long" in str(err): - print_warn_log(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " + logger.warning(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " f"'int32_to_int64' list in accuracy_tools/api_accuracy_check/common/utils.py file.") else: - print_error_log(f"Run {api_full_name} UT Error: %s" % str(err)) + logger.error(f"Run {api_full_name} UT Error: %s" % str(err)) compare.write_summary_csv((api_full_name, "SKIP", "SKIP", str(err))) finally: if is_gpu: @@ -202,7 +202,7 @@ def is_unsupported_api(api_name): split_name = api_name.split(Const.SEP)[0] flag = split_name in [Const.NPU, Const.DISTRIBUTED] if flag: - print_info_log(f"{split_name} api is not supported for run ut. SKIP.") + logger.info(f"{split_name} api is not supported for run ut. SKIP.") return flag @@ -226,11 +226,11 @@ def run_torch_api(api_full_name, real_data_path, backward_content, api_info_dict in_fwd_data_list.append(kwargs) need_backward = api_full_name in backward_content if not need_grad: - print_warn_log("%s function with out=... arguments don't support automatic differentiation, skip backward." + logger.warning("%s function with out=... arguments don't support automatic differentiation, skip backward." % api_full_name) if api_name in not_backward_list: need_grad = False - print_warn_log( + logger.warning( "%s function backward result is None, skip backward." % api_full_name) need_backward = need_backward and need_grad if kwargs.get("device"): @@ -377,7 +377,7 @@ def preprocess_forward_content(forward_content): existing_kwargs = processed_content[variant].get('kwargs', {}) filtered_existing_args = [{k: v for k, v in arg.items() if k not in ['Max', 'Min']} for arg in existing_args if isinstance(arg, dict)] except KeyError as e: - print_error_log(f"KeyError: {e} when processing {key}") + logger.error(f"KeyError: {e} when processing {key}") if filtered_existing_args == filtered_new_args and existing_kwargs == new_kwargs: is_duplicate = True break @@ -408,7 +408,7 @@ def run_ut_command(args): else: torch.npu.set_device(used_device) except Exception as error: - print_error_log(f"Set device id failed. device id is: {args.device_id}") + logger.error(f"Set device id failed. device id is: {args.device_id}") raise NotImplementedError from error check_link(args.api_info_file) api_info = os.path.realpath(args.api_info_file) @@ -451,4 +451,4 @@ class UtDataInfo: if __name__ == '__main__': _run_ut() - print_info_log("UT task completed.") + logger.info("UT task completed.") diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_data_generate.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_data_generate.py deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/debug/accuracy_tools/atat/pytorch/common/__init__.py b/debug/accuracy_tools/atat/pytorch/common/__init__.py index b391e103115498a2c2cf8b78f48168822517be73..8283aa502195cf2cf3cdac1260414b2dbc36a6dd 100644 --- a/debug/accuracy_tools/atat/pytorch/common/__init__.py +++ b/debug/accuracy_tools/atat/pytorch/common/__init__.py @@ -1,4 +1,2 @@ -from .recursive import recursive_apply_transform -from .log import print_error_log_rank_0, print_info_log_rank_0, print_warn_log_rank_0 from .parse_json import parse_json_info_forward_backward from .utils import seed_all diff --git a/debug/accuracy_tools/atat/pytorch/common/log.py b/debug/accuracy_tools/atat/pytorch/common/log.py index dddbdbee3e15d210c26df504d3b8a641622ceecc..e496e9b72ad449c24dd3f2a76a9a149d0f2eff1e 100644 --- a/debug/accuracy_tools/atat/pytorch/common/log.py +++ b/debug/accuracy_tools/atat/pytorch/common/log.py @@ -1,68 +1,32 @@ import os import time import sys +from atat.pytorch.common.utils import get_rank_if_initialized +from atat.core.common.log import BaseLogger +from atat.core.common.exceptions import DistributedNotInitializedError -from .utils import get_rank_if_initialized -from .exceptions import DistributedNotInitializedError +class PyTorchLogger(BaseLogger): + def __init__(self): + super().__init__() -def on_rank_0(func): - def func_rank_0(*args, **kwargs): + def get_rank(self): try: current_rank = get_rank_if_initialized() except DistributedNotInitializedError: current_rank = None + return current_rank - if current_rank is None or current_rank == 0: - return func(*args, **kwargs) - - return func_rank_0 + def _print_log(self, level, msg, end='\n'): + current_rank = self.get_rank() + current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) + pid = os.getpid() + if current_rank is not None: + full_msg = f"{current_time} ({pid}) [rank {current_rank}] [{level}] {msg}" + else: + full_msg = f"{current_time} ({pid}) [{level}] {msg}" + print(full_msg, end=end) + sys.stdout.flush() -def _print_log(level, msg, end='\n'): - current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(time.time()))) - pid = os.getpid() - full_msg = current_time + "(" + str(pid) + ")-[" + level + "]" + msg - try: - current_rank = get_rank_if_initialized() - except DistributedNotInitializedError: - current_rank = None - if current_rank is not None: - full_msg = f"[rank {current_rank}]-" + full_msg - print(full_msg, end=end) - sys.stdout.flush() - - -def print_info_log(info_msg, end='\n'): - """ - Function Description: - print info log. - Parameter: - info_msg: the info message. - """ - _print_log("INFO", info_msg, end=end) - - -def print_error_log(error_msg): - """ - Function Description: - print error log. - Parameter: - error_msg: the error message. - """ - _print_log("ERROR", error_msg) - - -def print_warn_log(warn_msg): - """ - Function Description: - print warn log. - Parameter: - warn_msg: the warning message. - """ - _print_log("WARNING", warn_msg) - - -print_info_log_rank_0 = on_rank_0(print_info_log) -print_warn_log_rank_0 = on_rank_0(print_warn_log) -print_error_log_rank_0 = on_rank_0(print_error_log) +logger = PyTorchLogger() \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/common/parse_json.py b/debug/accuracy_tools/atat/pytorch/common/parse_json.py index dc594c4cf818ed0bedd8d3997e2848d9fe123a17..a938f5f0da9ea465923157ad4131ce72bb92962f 100644 --- a/debug/accuracy_tools/atat/pytorch/common/parse_json.py +++ b/debug/accuracy_tools/atat/pytorch/common/parse_json.py @@ -1,5 +1,5 @@ import json -from .exceptions import ParseJsonException +from atat.core.common.exceptions import ParseJsonException def parse_json_info_forward_backward(json_path): diff --git a/debug/accuracy_tools/atat/pytorch/common/recursive.py b/debug/accuracy_tools/atat/pytorch/common/recursive.py deleted file mode 100644 index 9b222f5f52126d9f4c933a9dd9c2ed7b7f665fc8..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/common/recursive.py +++ /dev/null @@ -1,31 +0,0 @@ -import numpy as np -import torch - -from .log import print_warn_log - -_recursive_key_stack = [] -special_type = (torch.device, torch.dtype, torch.Size, torch.Tensor, np.integer, np.floating, np.bool_, np.complexfloating, \ - np.str_, np.byte, np.unicode_, bool, int, float, str, slice) - - -def recursive_apply_transform(args, transform): - global _recursive_key_stack - if isinstance(args, special_type): - arg_transform = transform(args, _recursive_key_stack) - return arg_transform - elif isinstance(args, (list, tuple)): - transform_result = [] - for i, arg in enumerate(args): - _recursive_key_stack.append(str(i)) - transform_result.append(recursive_apply_transform(arg, transform)) - _recursive_key_stack.pop() - return type(args)(transform_result) - elif isinstance(args, dict): - transform_dict = {} - for k, arg in args.items(): - _recursive_key_stack.append(str(k)) - transform_dict[k] = recursive_apply_transform(arg, transform) - _recursive_key_stack.pop() - return transform_dict - elif args is not None: - print_warn_log(f"Data type {type(args)} is not supported.") diff --git a/debug/accuracy_tools/atat/pytorch/common/utils.py b/debug/accuracy_tools/atat/pytorch/common/utils.py index aa4a3214071121ead083d9195752ac7db8e3ef07..d0591c0ff8dbccfbed34b69c9f7c30b27033afca 100644 --- a/debug/accuracy_tools/atat/pytorch/common/utils.py +++ b/debug/accuracy_tools/atat/pytorch/common/utils.py @@ -15,14 +15,12 @@ # limitations under the License. """ import os -import re import random import stat import torch import numpy as np from functools import wraps - -from .exceptions import DistributedNotInitializedError +from atat.core.common.exceptions import DistributedNotInitializedError try: import torch_npu diff --git a/debug/accuracy_tools/atat/pytorch/compare/acc_compare.py b/debug/accuracy_tools/atat/pytorch/compare/acc_compare.py index 12f790fbeb77c8d04c1bdf2da26f59c9e8ce89c6..328a88ad55fa5621ee7956b3d9ed3da6abf9993d 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/acc_compare.py +++ b/debug/accuracy_tools/atat/pytorch/compare/acc_compare.py @@ -18,11 +18,8 @@ import json import multiprocessing import os.path -import stat import sys -import math import torch - import numpy as np import pandas as pd import openpyxl @@ -30,14 +27,14 @@ from openpyxl.styles import PatternFill from collections import namedtuple from dataclasses import dataclass -from .match import graph_mapping -from .highlight import HighlightRules, get_header_index -from .npy_compare import compare_ops_apply, get_error_type, reshape_value, get_relative_err, get_error_message -from ..advisor.advisor import Advisor -from ...core.utils import check_compare_param, add_time_with_xlsx, CompareException, CompareConst, \ - format_value, check_file_not_exists, check_configuration_param, task_dumppath_get, print_info_log, \ - print_warn_log, print_error_log, Const -from ...core.file_check_util import FileChecker, FileCheckConst, change_mode, FileOpen, create_directory +from atat.pytorch.compare.match import graph_mapping +from atat.pytorch.compare.highlight import HighlightRules, get_header_index +from atat.pytorch.compare.npy_compare import compare_ops_apply, get_error_type, reshape_value, get_relative_err, get_error_message +from atat.pytorch.advisor.advisor import Advisor +from atat.pytorch.common.log import logger +from atat.core.common.utils import check_compare_param, add_time_with_xlsx, CompareException, CompareConst, \ + format_value, check_file_not_exists, check_configuration_param, task_dumppath_get, Const +from atat.core.common.file_check import FileChecker, FileCheckConst, change_mode, FileOpen, create_directory def check_graph_mode(a_op_name, b_op_name): @@ -61,7 +58,7 @@ def check_op(npu_dict, bench_dict, fuzzy_match): try: is_match = fuzzy_check_op(a_op_name, b_op_name) except Exception as err: - print_warn_log("%s and %s can not fuzzy match." % (a_op_name, b_op_name)) + logger.warning("%s and %s can not fuzzy match." % (a_op_name, b_op_name)) is_match = False return is_match and struct_match @@ -309,7 +306,7 @@ def _do_multi_process(input_parma, result_df): result_df = _handle_multi_process(compare_ops, input_parma, result_df, multiprocessing.Manager().RLock()) return result_df except ValueError as e: - print_error_log('result dataframe is not found.') + logger.error('result dataframe is not found.') raise CompareException(CompareException.INVALID_DATA_ERROR) from e @@ -324,10 +321,10 @@ def read_dump_data(result_df): op_name_mapping_dict[npu_dump_name] = [npu_dump_tensor, npu_dump_tensor] return op_name_mapping_dict except ValueError as e: - print_error_log('result dataframe is not found.') + logger.error('result dataframe is not found.') raise CompareException(CompareException.INVALID_DATA_ERROR) from e except IndexError as e: - print_error_log('result dataframe elements can not be access.') + logger.error('result dataframe elements can not be access.') raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from e @@ -345,11 +342,11 @@ def _handle_multi_process(func, input_parma, result_df, lock): pool = multiprocessing.Pool(process_num) def err_call(args): - print_error_log('multiprocess compare failed! Reason: {}'.format(args)) + logger.error('multiprocess compare failed! Reason: {}'.format(args)) try: pool.terminate() except OSError as e: - print_error_log("pool terminate failed") + logger.error("pool terminate failed") for process_idx, df_chunk in enumerate(df_chunks): idx = df_chunk_size * process_idx @@ -374,11 +371,11 @@ def compare_ops(idx, dump_path_dict, result_df, lock, input_parma): for i in range(len(result_df)): op_name = result_df.iloc[i, 0] if is_print_compare_log: - print_info_log("start compare: {}".format(op_name)) + logger.info("start compare: {}".format(op_name)) cos_sim, max_abs_err, max_relative_err, one_thousand_err_ratio, five_thousand_err_ratio, err_msg = compare_by_op( op_name, dump_path_dict, input_parma) if is_print_compare_log: - print_info_log( + logger.info( "[{}] Compare result: cosine {}, max_abs_err {}, max_relative_err {}, {}, one_thousand_err_ratio {}, " "five_thousand_err_ratio {}".format(op_name, cos_sim, max_abs_err, max_relative_err, err_msg, one_thousand_err_ratio, five_thousand_err_ratio)) @@ -437,10 +434,10 @@ def _save_cmp_result(offset, result: ComparisonResult, result_df, lock): result_df.loc[process_index, CompareConst.FIVE_THOUSANDTHS_ERR_RATIO] = result.five_thousand_err_ratio_result[i] return result_df except ValueError as e: - print_error_log('result dataframe is not found.') + logger.error('result dataframe is not found.') raise CompareException(CompareException.INVALID_DATA_ERROR) from e except IndexError as e: - print_error_log('result dataframe elements can not be access.') + logger.error('result dataframe elements can not be access.') raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from e finally: lock.release() @@ -456,7 +453,7 @@ def check_accuracy(cos, max_abs_err): try: cos, max_abs_err = float(cos), float(max_abs_err) except ValueError: - print_warn_log("Cosine or MaxAbsErr can not get float value.") + logger.warning("Cosine or MaxAbsErr can not get float value.") return CompareConst.NONE if cos < CompareConst.COS_THRESHOLD and max_abs_err > CompareConst.MAX_ABS_ERR_THRESHOLD: return CompareConst.ACCURACY_CHECK_NO @@ -615,7 +612,7 @@ def find_compare_result_error_rows(result_df, highlight_dict, summary_compare): def highlight_rows_xlsx(result_df, highlight_dict, file_path): """Write and highlight results in Excel""" - print_info_log('Compare result is %s' % file_path) + logger.info('Compare result is %s' % file_path) wb = openpyxl.Workbook() ws = wb.active @@ -648,7 +645,7 @@ def compare(input_parma, output_path, stack_mode=False, auto_analyze=True, create_directory(output_path) check_compare_param(input_parma, output_path, stack_mode, summary_compare, md5_compare) except CompareException as error: - print_error_log('Compare failed. Please check the arguments and do it again!') + logger.error('Compare failed. Please check the arguments and do it again!') sys.exit(error.code) compare_core(input_parma, output_path, stack_mode=stack_mode, auto_analyze=auto_analyze, fuzzy_match=fuzzy_match, summary_compare=summary_compare, @@ -681,7 +678,7 @@ def compare_core(input_parma, output_path, **kwargs): summary_compare = kwargs.get('summary_compare', False) md5_compare = kwargs.get('md5_compare', False) - print_info_log("Please check whether the input data belongs to you. If not, there may be security risks.") + logger.info("Please check whether the input data belongs to you. If not, there may be security risks.") file_name = add_time_with_xlsx("compare_result" + suffix) file_path = os.path.join(os.path.realpath(output_path), file_name) check_file_not_exists(file_path) @@ -704,7 +701,7 @@ def compare_core(input_parma, output_path, **kwargs): def parse(pkl_file, module_name_prefix): if not isinstance(module_name_prefix, str): - print_error_log("The parameter:module_name_prefix is not a string.") + logger.error("The parameter:module_name_prefix is not a string.") raise CompareException(CompareException.INVALID_PARAM_ERROR) with FileOpen(pkl_file, "r") as f: done = False @@ -723,18 +720,18 @@ def parse(pkl_file, module_name_prefix): continue if info_prefix.find("stack_info") != -1: - print_info_log("\nTrace back({}):".format(msg[0])) + logger.info("\nTrace back({}):".format(msg[0])) for item in reversed(msg[1]): - print_info_log(" File \"{}\", line {}, in {}".format(item[0], item[1], item[2])) - print_info_log(" {}".format(item[3])) + logger.info(" File \"{}\", line {}, in {}".format(item[0], item[1], item[2])) + logger.info(" {}".format(item[3])) continue if len(msg) > 5: summary_info = " [{}][dtype: {}][shape: {}][max: {}][min: {}][mean: {}]" \ .format(msg[0], msg[3], msg[4], msg[5][0], msg[5][1], msg[5][2]) if not title_printed: - print_info_log("\nStatistic Info:") + logger.info("\nStatistic Info:") title_printed = True - print_info_log(summary_info) + logger.info(summary_info) def op_item_parse(item, op_name, index, item_list=[], top_bool=True): @@ -880,7 +877,7 @@ def compare_process(file_handles, stack_mode, fuzzy_match, summary_compare=False stack_json_data = json.load(stack_json_handle) if fuzzy_match: - print_warn_log("This task uses fuzzy matching, which may affect the accuracy of the comparison.") + logger.warning("This task uses fuzzy matching, which may affect the accuracy of the comparison.") npu_ops_queue = [] bench_ops_queue = [] diff --git a/debug/accuracy_tools/atat/pytorch/compare/distributed_compare.py b/debug/accuracy_tools/atat/pytorch/compare/distributed_compare.py index 09d40b214d5bc2ae67480d78c9255d9e0326567a..b89adc1581e8b0cf76ca28c41dbf0e86738ebece 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/distributed_compare.py +++ b/debug/accuracy_tools/atat/pytorch/compare/distributed_compare.py @@ -17,10 +17,11 @@ import os import sys import re -from ...core.utils import print_error_log, CompareException, check_compare_param, \ +from atat.core.common.utils import CompareException, check_compare_param, \ check_configuration_param, task_dumppath_get, check_file_or_directory_path, check_regex_prefix_format_valid -from .acc_compare import compare_core -from ...core.file_check_util import create_directory +from atat.pytorch.compare.acc_compare import compare_core +from atat.core.common.file_check import create_directory +from atat.pytorch.common.log import logger def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): @@ -46,7 +47,7 @@ def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): pattern = re.compile(rf'^{prefix}(?:0|[0-9][1-9]*)?$') for name in contents: if not pattern.match(name): - print_error_log( + logger.error( f"dump_dir contains '{name}'. Expected '{prefix}'. This name is not in the format of dump " f"output. Please check and delete irrelevant files in {dump_dir} and try again." ) @@ -66,12 +67,12 @@ def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): # Provide robustness on invalid directory inputs if not json_path: - print_error_log(f'No file is found in dump dir {dirname}. ') + logger.error(f'No file is found in dump dir {dirname}. ') raise CompareException(CompareException.NO_DUMP_FILE_ERROR) return json_path if kwargs.get('suffix'): - print_error_log("Argument 'suffix' is not supported for compare_distributed.") + logger.error("Argument 'suffix' is not supported for compare_distributed.") raise CompareException(CompareException.INVALID_PARAM_ERROR) stack_mode = kwargs.get('stack_mode', False) auto_analyze = kwargs.get('auto_analyze', True) @@ -80,7 +81,7 @@ def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): npu_ranks = sorted(check_and_return_dir_contents(npu_dump_dir, 'rank')) bench_ranks = sorted(check_and_return_dir_contents(bench_dump_dir, 'rank')) if len(npu_ranks) != len(bench_ranks): - print_error_log('The number of ranks in the two runs are different. ' + logger.error('The number of ranks in the two runs are different. ' 'Unable to match the ranks. Please use another folder to compare ' 'or use compare() api and manually match the ranks.') raise CompareException(CompareException.INVALID_PATH_ERROR) @@ -104,7 +105,7 @@ def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): create_directory(output_path) check_compare_param(dump_result_param, output_path, stack_mode=stack_mode, summary_compare=summary_compare) except CompareException as error: - print_error_log('Compare failed. Please check the arguments and do it again!') + logger.error('Compare failed. Please check the arguments and do it again!') sys.exit(error.code) compare_core(dump_result_param, output_path, suffix=f'_{nr}-{br}', summary_compare=summary_compare, md5_compare=md5_compare, **kwargs) diff --git a/debug/accuracy_tools/atat/pytorch/compare/highlight.py b/debug/accuracy_tools/atat/pytorch/compare/highlight.py index fdc11303003d20d2d89e2667355b2193a3a9db48..d94e86b013b9a89c1fe0c412db612ccef3e51991 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/highlight.py +++ b/debug/accuracy_tools/atat/pytorch/compare/highlight.py @@ -1,7 +1,7 @@ import math import abc import numpy as np -from ...core.utils import CompareConst, get_header_index +from atat.core.common.utils import CompareConst, get_header_index class HighlightCheck(abc.ABC): diff --git a/debug/accuracy_tools/atat/pytorch/compare/match.py b/debug/accuracy_tools/atat/pytorch/compare/match.py index 51fb2fb6666756d39db9003b87ef7c3a71b4080b..148fbb7d640b3fde5e3292508c29404e277cde84 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/match.py +++ b/debug/accuracy_tools/atat/pytorch/compare/match.py @@ -1,7 +1,7 @@ import os import yaml -from ...core.file_check_util import FileOpen -from ...core.utils import CompareException +from atat.core.common.file_check import FileOpen +from atat.core.common.utils import CompareException class AtenIrMapping(): diff --git a/debug/accuracy_tools/atat/pytorch/compare/npy_compare.py b/debug/accuracy_tools/atat/pytorch/compare/npy_compare.py index b94a83f1349652ae1b5656dc4e293ea823f2ae68..2e1f22ab3f5243fa4bb6b2e9bb6e7094680081b2 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/npy_compare.py +++ b/debug/accuracy_tools/atat/pytorch/compare/npy_compare.py @@ -1,6 +1,7 @@ import abc import numpy as np -from ...core.utils import CompareConst, Const, print_warn_log, format_value +from atat.core.common.utils import CompareConst, Const, format_value +from atat.pytorch.common.log import logger def handle_inf_nan(n_value, b_value): @@ -69,7 +70,7 @@ def get_error_message(n_value, b_value, op_name, error_flag, error_file=None): if not n_value.shape: return "This is type of scalar data, can not compare." if n_value.dtype != b_value.dtype: - print_warn_log("Dtype of NPU and bench Tensor do not match: {}".format(op_name)) + logger.warning("Dtype of NPU and bench Tensor do not match: {}".format(op_name)) return "Dtype of NPU and bench Tensor do not match." return "" diff --git a/debug/accuracy_tools/atat/pytorch/debugger/debugger_config.py b/debug/accuracy_tools/atat/pytorch/debugger/debugger_config.py index 451410dc96a3860a55952fb41175f85f7af63abc..6f2bfe8551062e82d552661f186130195b41f4dd 100644 --- a/debug/accuracy_tools/atat/pytorch/debugger/debugger_config.py +++ b/debug/accuracy_tools/atat/pytorch/debugger/debugger_config.py @@ -1,5 +1,7 @@ -from ..common import print_warn_log_rank_0, seed_all -from ...core.utils import Const +from atat.pytorch.common import seed_all +from atat.pytorch.common.log import logger +from atat.core.common.utils import Const + class DebuggerConfig: def __init__(self, common_config, task_config, task, dump_path, level): @@ -20,12 +22,9 @@ class DebuggerConfig: self.is_forward_acl_dump = True self.summary_mode = task_config.summary_mode if task_config.summary_mode else Const.STATISTICS self.overflow_num = task_config.overflow_num if task_config.overflow_num else 1 - self.repair_scope = None - self.repair_api_str = None - self.on_step_end = None - self.repair_type = None - - if self.task == "free_benchmark": + self.framework = Const.PT_FRAMEWORK + + if self.task == Const.FREE_BENCHMARK: self.fuzz_device = task_config.fuzz_device if task_config.fuzz_device else 'npu' self.handler_type = task_config.handler_type if task_config.handler_type else 'check' self.pert_mode = task_config.pert_mode if task_config.pert_mode else 'improve_precision' @@ -79,11 +78,10 @@ class DebuggerConfig: if not isinstance(rank_id, int) or rank_id < 0: raise ValueError(f"rank {self.rank} must be an integer and greater than or equal to 0.") else: - print_warn_log_rank_0(f"Rank argument is provided. Only rank {self.rank} data will be dumpped.") + logger.warning_on_rank_0(f"Rank argument is provided. Only rank {self.rank} data will be dumpped.") def _check_step(self): if self.step: for s in self.step: if not isinstance(s, int) or s < 0: raise ValueError(f"step element {s} must be an integer and greater than or equal to 0.") - diff --git a/debug/accuracy_tools/atat/pytorch/debugger/precision_debugger.py b/debug/accuracy_tools/atat/pytorch/debugger/precision_debugger.py index 8d67ae9ba6e2e5da6b115bad238d966611223ea8..140d829bedc6fb1243820c0d0ccc84af42f01424 100644 --- a/debug/accuracy_tools/atat/pytorch/debugger/precision_debugger.py +++ b/debug/accuracy_tools/atat/pytorch/debugger/precision_debugger.py @@ -1,10 +1,10 @@ import torch from torch.utils.data import dataloader -from .debugger_config import DebuggerConfig -from ..service import Service -from ..common import print_warn_log_rank_0 -from ..pt_config import parse_json_config -from ..common.exceptions import MsaccException +from atat.pytorch.debugger.debugger_config import DebuggerConfig +from atat.pytorch.service import Service +from atat.pytorch.common.log import logger +from atat.pytorch.pt_config import parse_json_config +from atat.core.common.exceptions import MsaccException class PrecisionDebugger: @@ -39,7 +39,7 @@ class PrecisionDebugger: self.service = Service(self.config) self.enable_dataloader = self.config.enable_dataloader if self.enable_dataloader: - print_warn_log_rank_0("The enable_dataloader feature will be deprecated in the future.") + logger.warning_on_rank_0("The enable_dataloader feature will be deprecated in the future.") dataloader._BaseDataLoaderIter.__next__ = iter_tracer(dataloader._BaseDataLoaderIter.__next__) @property @@ -52,7 +52,7 @@ class PrecisionDebugger: if not instance: raise Exception("No instance of PrecisionDebugger found.") if instance.enable_dataloader: - print_warn_log_rank_0("DataLoader is enabled, start() skipped.") + logger.warning_on_rank_0("DataLoader is enabled, start() skipped.") else: instance.service.start(instance.model) @@ -62,7 +62,7 @@ class PrecisionDebugger: if not instance: raise Exception("PrecisionDebugger instance is not created.") if instance.enable_dataloader: - print_warn_log_rank_0("DataLoader is enabled, stop() skipped.") + logger.warning_on_rank_0("DataLoader is enabled, stop() skipped.") else: instance.service.stop() diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl.png b/debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl.png index c64e9380c6d9c01bb2ad18c81e430ead0800bb7d..863708bf6daf46985328f0dc42d48f0a5b849af5 100644 Binary files a/debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl.png and b/debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl.png differ diff --git a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_compare.md b/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_compare.md index 9beda3b02f2d72383a2bcaa4c20bcd9c5b8ba971..e3537594c4f8c9e277ca867172875e3e28c23113 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_compare.md +++ b/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_compare.md @@ -36,7 +36,7 @@ compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs) | -------------- | ------------------------------------------------------------ | -------- | | npu_dump_dir | 配置NPU环境下的dump目录。dump数据目录须指定到step级。参数示例:'./npu_dump/step0'。数据类型:str。 | 是 | | bench_dump_dir | 配置CPU、GPU或NPU环境下的dump目录。参数示例:'./gpu_dump/step0'。数据类型:str。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。需要预先创建output_path目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.csv`。数据类型:str。 | 是 | +| output_path | 配置比对结果文件存盘目录。需要预先创建output_path目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.xlsx`。数据类型:str。 | 是 | | **kwargs | 支持compare的所有可选参数。 | 否 | **函数示例** @@ -67,7 +67,7 @@ compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_mat | 参数名 | 说明 | 是否必选 | | ------------ | ------------------------------------------------------------ | -------- | | input_param | 配置dump数据文件及目录。数据类型:dict。配置参数包括:
"npu_json_path":指定NPU dump目录下的dump.json文件。参数示例:"npu_json_path": "./npu_dump/dump.json"。必选。
"bench_json_path":指定CPU、GPU或NPU dump目录下的dump.json文件。参数示例:"bench_json_path": "./gpu_dump/dump.json"。必选。
"stack_json_path":指定NPU dump目录下的stack.json文件。参数示例:"stack_json_path": "./npu_dump/stack.json"。可选。
"is_print_compare_log":配置是否开启日志打屏。可取值True或False。可选。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.csv`。数据类型:str。 | 是 | +| output_path | 配置比对结果文件存盘目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.xlsx`。数据类型:str。 | 是 | | stack_mode | 配置stack_mode的开关。仅当配置"stack_json_path"需要开启。可取值True或False,参数示例:stack_mode=True,默认为False。数据类型:bool。 | 否 | | auto_analyze | 自动精度分析,开启后工具自动针对比对结果进行分析,识别到第一个精度不达标节点(在比对结果文件中的“Accuracy Reached or Not”列显示为No),并给出问题可能产生的原因(打屏展示并生成advisor_{timestamp}.txt文件)。可取值True或False,参数示例:auto_analyze=False,默认为True。数据类型:bool。 | 否 | | fuzzy_match | 模糊匹配。开启后,对于网络中同一层级且命名仅调用次数不同的API,可匹配并进行比对。可取值True或False,参数示例:fuzzy_match=True,默认为False。数据类型:bool。 | 否 | @@ -108,7 +108,7 @@ compare(dump_result_param, output_path="./output", stack_mode=True) **比对结果** -数据量比对同样生成`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt`文件。其中`advisor_{timestamp}.txt`主要对`compare_result_{timestamp}.csv`中可能存在精度问题(Result为Waring)的API提出定位建议;`compare_result_{timestamp}.csv`主要有如下两种情况: +数据量比对同样生成`compare_result_{timestamp}.xlsx`和`advisor_{timestamp}.txt`文件。其中`advisor_{timestamp}.txt`主要对`compare_result_{timestamp}.xlsx`中可能存在精度问题(Result为Waring)的API提出定位建议;`compare_result_{timestamp}.xlsx`主要有如下两种情况: - "summary_mode": "statistics"时比对dump.json文件: @@ -122,13 +122,28 @@ compare(dump_result_param, output_path="./output", stack_mode=True) 上图是对dump.json文件中NPU及标杆API的MD5信息进行比对,判断API数据的完整性,文件中记录NPU及标杆API的基本信息和MD5信息,其中需要关注Result列,包含结果:Pass(表示NPU与标杆的MD5值一致,即API数据完整);Different(表示NPU与标杆的MD5值不一致,即API数据不完全一致,可以通过NPU_Stack_Info列API调用栈查询该API的详细信息);Nan(表示MD5信息数据没有匹配上)。 -## 计算精度评价指标 +## 比对结果分析 -通过计算精度评价指标可以直接从精度比对结果文件中找出不符合精度标准的算子。 +PyTorch精度比对是以CPU或GPU的计算结果为标杆,通过计算精度评价指标判断API在运行时是否存在精度问题。 -PyTorch精度比对是以CPU或GPU的计算结果为标杆,计算Cosine(余弦相似度)、MaxAbsErr(最大绝对误差)和MaxRelativeErr(最大相对误差),根据这两个结果判断API在运行时是否存在精度问题。 +- `advisor_{timestamp}.txt`文件中给出了可能存在精度问题的API的专家建议,可直接打开查看。 -计算精度评价指标: +- `compare_result_{timestamp}.xlsx`文件列出了所有执行精度比对的API详细信息和比对结果,如下示例: + + ![compare_result](https://gitee.com/cai-weiwei1989/att_ptdbg/raw/master/debug/accuracy_tools/ptdbg_ascend/doc/img/compare_result.png) + + 可以从该结果文件中进行“**判断计算精度达标情况**”、“**计算精度评价指标分析**”以及“**异常信息识别**”等分析动作。 + +### **判断计算精度达标情况** + +精度比对结果`compare_result_{timestamp}.xlsx`文件中只需要通过Accuracy Reached or Not来判断计算精度是否达标,判断标准如下: + +1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标,标记为“No”。 +2. Cosine < 0.9,精度不达标,标记为“No”。 +3. MaxAbsError > 1,精度不达标,标记为“No”。 +4. 其余情况下记为精度达标,标记为“Yes”。 + +### **计算精度评价指标分析** 1. Cosine:通过计算两个向量的余弦值来判断其相似度,数值越接近于1说明计算出的两个张量越相似,实际可接受阈值为大于0.99。在计算中可能会存在nan,主要由于可能会出现其中一个向量为0。 @@ -140,12 +155,20 @@ PyTorch精度比对是以CPU或GPU的计算结果为标杆,计算Cosine(余 4. One Thousandth Err Ratio(双千分之一)、Five Thousandths Err Ratio(双千分之五)精度指标:是指NPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差大于千分之一、千分之五的比例占总元素个数的比例小于千分之一、千分之五。该数据仅作为精度下降趋势的参考,并不参与计算精度是否通过的判定。 -精度比对结果csv文件中只需要通过Accuracy Reached or Not来判断计算精度是否达标,判断标准如下: +### **异常信息识别** -1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标,标记为“No”。 -2. Cosine < 0.9,精度不达标,标记为“No”。 -3. MaxAbsError > 1,精度不达标,标记为“No”。 -4. 其余情况下记为精度达标,标记为“Yes”。 +精度比对结果`compare_result_{timestamp}.xlsx`文件中对于存在异常信息的API会进行高亮处理: + +- 红色可能出现的情况有: + - NPU max或NPU min信息中存在nan/inf + - Max diff存在大于1e+10的值 + - 统计数据中output的Max diff除以max(0.01, Bench max) > 0.5 + - 真实数据中One Thousandth Err Ratio的input > 0.9同时output < 0.6 +- 黄色可能出现的情况有: + - Max diff的input与output都大于1,同时output比input大一个数量级以上 + - 统计数据Max diff除以max(0.01, Bench max)的output > 0.1同时input < 0.01 + - 真实数据One Thousandth Err Ratio的input - output > 0.1 + - 真实数据Cosine的input - output > 0.1 # FAQ diff --git a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_quickstart.md b/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_quickstart.md index ae6e3b0b4bbad4796b0332ee8a41b3ae14e5f94e..c05302055687fdf6071befd7ff8ad77c9e32f2df 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_quickstart.md +++ b/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_quickstart.md @@ -98,7 +98,7 @@ PyTorch训练场景的精度问题分析建议参考以下思路进行精度比 python3 compare.py ``` - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` + 在output目录下生成结果文件,包括:`compare_result_{timestamp}.xlsx`和`advisor_{timestamp}.txt` 4. 找出存在问题的API。 @@ -106,7 +106,7 @@ python3 compare.py ![auto_analyze_log](img/auto_analyze_log.png) - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 + 2. 根据第2步结果文件`compare_result_{timestamp}.xlsx`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 5. (可选)重新比对。 diff --git "a/debug/accuracy_tools/atat/pytorch/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" "b/debug/accuracy_tools/atat/pytorch/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" new file mode 100644 index 0000000000000000000000000000000000000000..b2e373feb6cf7cb0c63dcff592939567b52738b4 --- /dev/null +++ "b/debug/accuracy_tools/atat/pytorch/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" @@ -0,0 +1,90 @@ +# **PyTorch NPU在线精度比对工具使用指南** + +PyTorch NPU在线精度比对是ptdbg_ascend工具实现在PyTorch训练过程中直接完成精度比对并输出比对结果的功能。 + +在线精度比对实现的是NPU与CPU之间的精度比对。 + +## PyTorch NPU在线精度比对总体流程 + +1. 准备NPU训练工程。 + +2. 在NPU环境下安装ptdbg_ascend工具,参见《[PyTorch精度工具](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 + +3. 在训练脚本内插入ptdbg_ascend工具在线精度比对接口。 + +4. 执行训练并获取在线精度比对NPU和CPU分别执行后的精度比对结果。 + +5. 比对结果分析。 + +## PyTorch NPU在线精度比对 +### 总体说明 +- 本节主要介绍NPU精度比对所需要的函数以及示例。 +- 在线精度比对工具通过截获PyTorch框架中部分Aten Ir及其输入输出,并将输入数据转到CPU执行,最后将NPU和CPU的执行结果进行精度比对得到比对结果。 + +### 约束 + +- Pytorch 只支持2.0及其以上版本。 +- 只支持Aten Ir级在线精度比对,所有Aten Ir可以通过dir(torch.ops.aten)查看,其中部分IR不支持在线比对:Aten Ir无对应CPU实现、NPU和CPU同AtenIR实现逻辑不一致,导致同输入不同输出。 +- 正反向不支持同时在线精度比对,不支持跨step在线精度比对。 + + +### 场景示例 +1. 在NPU训练脚本中添加在线精度比对接口,示例如下: + + ```python + from atat.pytorch.common.utils import seed_all + from atat.pytorch.online_dispatch import PtdbgDispatch + + # 在main函数开始前固定随机数 + seed_all() + + + ... + + # 在需要调试精度的正向或反向代码前设置 + # 正向示例 + with PtdbgDispatch(dump_mode="auto", dump_path="/home/dump"): + output = model_cpu(inputs) + # 反向示例 + with PtdbgDispatch(dump_mode="auto", dump_path="/home/dump"): + loss.backward() + ``` + +2. 执行训练。 + +3. 找出精度不达标的Aten IR。 + + 执行过程中会打屏Failed,Failed在比对结果csv中的Accuracy Reached or Not列标记为No,并在Dump目录下存盘精度不达标Aten IR的输入输出。 + ![图片说明](http://image.huawei.com/tiny-lts/v1/images/d83d564e337e80c7cfb557ca3600d0d4_1689x178.png@900-0-90-f.png) + +### 计算精度评价指标 + +1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标; +2. Cosine < 0.9,精度不达标; +3. MaxAbsError > 1,精度不达标。 + +### 在线精度比对参数设置说明 + +| 参数名称 | 说明 | 是否必选 | +| -------- |-------------------------------------------------------------------------------------------------| -------- | +| dump_mode| dump模式,可取值"all"、"list"、"auto"、"OFF",默认值为OFF(表示不Dump数据)。 | 否 | +| api_list | dump范围,dump_mode="list"时设置,需要Dump Aten Ir API名称,默认为None,Aten Ir API名称可以通过dir(torch.ops.aten)查看。 | 否 | +| dump_path| dump文件生成的路径。 | 是 | +| tag | 传入tag字符串,成为dump文件夹名一部分,默认为None。 | 否 | +| process_num | 多进程并发数,默认为0。 | 否 | +| debug | debug信息打印,默认为False。 | 否 | +### dump数据存盘说明 +dump数据存盘目录名格式:`atat_tag_rankid_{timestamp}`。 + +子目录下包含1个比对结果csv文件、cpu和npudump数据目录,npu目录下包含Aten IR在NPU上的输入输出的dump数据,由于CPU的输入是直接使用NPU的输入执行,因此cpu目录下只包含执行输出的dump数据。 + +```bash +atat_rank4_20230911170521 +├── compare_result_rank4_20230911170521.csv +├── cpu +│   ├── native_batch_norm_backward_10_output.0.npy +│ ............ +└── npu + ├── native_batch_norm_backward_10_input.0.npy + ............ +``` diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/__init__.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/__init__.py index 3ffe161cba432405e2dc8d98f9be89053b58849d..f86fc41d557b1801318303ce18a934b2306c223e 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/__init__.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/__init__.py @@ -1,5 +1,5 @@ -from atat.pytorch.common import print_warn_log_rank_0, print_info_log_rank_0 -from atat.pytorch.common.exceptions import FreeBenchmarkException +from atat.core.common.log import logger +from atat.core.common.exceptions import FreeBenchmarkException from atat.pytorch.common.utils import Const from .main import FreeBenchmarkCheck diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/params.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/common/params.py index c5dfefb43f856383af93068840e9e48e1590c431..440348d78c28d3f7cc816932ff12e83aa71915bc 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/params.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/common/params.py @@ -1,9 +1,8 @@ -from abc import ABC from dataclasses import dataclass from typing import Any, Callable, Dict, List, Optional, Tuple import torch -from atat.pytorch.free_benchmark import Const, print_warn_log_rank_0 +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.enums import ( DeviceType, FuzzLevel, @@ -78,7 +77,7 @@ def data_pre_deal(name, func, args, kwargs): index = check_args_type(args) data_params.valid_input_index = index if index == -1: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: 无标杆工具不支持当前算子的输入类型 {name}." ) return data_params diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/grad_saver.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/grad_saver.py index 5094da3e2a41b3e96a01cf357d434401e5ed75cc..89ef9e4c9b4500953b0edc26f28f5b14e401ca50 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/grad_saver.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/grad_saver.py @@ -1,6 +1,6 @@ import torch -from atat.pytorch.common.exceptions import FreeBenchmarkException -from atat.pytorch.free_benchmark import print_warn_log_rank_0 +from atat.core.common.exceptions import FreeBenchmarkException +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.constant import CommonField from atat.pytorch.free_benchmark.common.params import DataParams, HandlerParams from atat.pytorch.free_benchmark.perturbed_layers.layer_factory import LayerFactory @@ -40,18 +40,18 @@ class GradSaver: ) data_processor.update_unequal_rows(handler.get_unequal_rows()) except IndexError: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: grad index out of range. api:{self.handler_params.api_name}." f"index:{new_grad_index}, perturbation grad len {len(self.perturbed_grad_input)}" ) return grad except FreeBenchmarkException as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: grad input check error: {e}" ) return grad except Exception as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: grad compare error: {e}" ) return grad @@ -76,10 +76,13 @@ class GradSaver: self.data_params.original_result = self.origin_grad_input handler.handle(self.data_params) except Exception as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: compare two vjp failed: api:{self.handler_params.api_name}." f"{e}" ) + # 在扰动前后输出对比后释放输出的引用 + self.data_params.perturbed_result = None + self.data_params.original_result = None def check_grad_input(self, origin_grad, new_grad_index): if self.perturbed_grad_input is None: @@ -171,6 +174,10 @@ class GradSaver: self.handler_params.pert_mode, ) layer.handle(self.data_params) - self.perturbed_grad_input = tuple( - [x.cpu() for x in self.data_params.perturbed_result] - ) + # 在计算扰动输出之后,释放输入的引用 + self.data_params.args = None + # 确定扰动成功后,才会暂存 + if self.data_params.perturbed_result: + self.perturbed_grad_input = tuple( + [x.cpu() for x in self.data_params.perturbed_result] + ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/single_benchmark.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/single_benchmark.py index 80c526be91f90f02603d49d535a7054be99d920c..85aa68f13b969e996407f5f64353de43b916e00f 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/single_benchmark.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/single_benchmark.py @@ -1,7 +1,7 @@ import math import torch -from atat.pytorch.free_benchmark import print_warn_log_rank_0 +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.constant import ThresholdConfig from atat.pytorch.free_benchmark.common.utils import TorchC @@ -61,7 +61,7 @@ class SingleCompare: actual.dtype, ThresholdConfig.BENCHMARK_THD_DICT.get(torch.float32) ) if self.filter_overflow(golden) > 0: - print_warn_log_rank_0("[atat] Free Benchmark: inf and nan" + logger.warning_on_rank_0("[atat] Free Benchmark: inf and nan" "in golden tensor is not supported.") return True actual = self.replace_inf_or_nan(actual) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/main.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/main.py index c2e0005181d967ed8437e3047f9d967b1370d4e3..ba3e9a6b2561fdf12a27065b4837a54757010d52 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/main.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/main.py @@ -1,9 +1,7 @@ -import importlib from abc import ABC import torch -from atat.pytorch.free_benchmark import Const, print_warn_log_rank_0 - +from atat.pytorch.free_benchmark import Const, logger from atat.pytorch.free_benchmark.common.params import data_pre_deal, make_handler_params from atat.pytorch.free_benchmark.common.enums import ( PerturbationMode, @@ -80,7 +78,7 @@ class FreeBenchmarkCheck(ABC): try: grad_saver = getattr(module, "grad_saver") except AttributeError: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: get grad saver failed. api_name:{name}" ) return @@ -96,7 +94,7 @@ class FreeBenchmarkCheck(ABC): _new_grad_output, need_grad_tensors, _inner_args ) except Exception as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: grad vjp calculate failed. api_name:{name} error: {e}" ) return diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py index d5ba63c6a943ba105e157c6871d1bcdc937620b5..af8a93f7d4b9b06623b70c22e7fb5065305e84a0 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py @@ -1,8 +1,5 @@ import torch -from atat.pytorch.free_benchmark import ( - print_info_log_rank_0, - print_warn_log_rank_0, -) +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.constant import ThresholdConfig from atat.pytorch.free_benchmark.common.enums import PerturbationMode from atat.pytorch.free_benchmark.common.params import DataParams @@ -39,7 +36,7 @@ class AddNoiseLayer(NpuBaseLayer): """ 对输入添加扰动并返回 """ - print_info_log_rank_0( + logger.info_on_rank_0( f"[atat] Free benchmark: Perturbation is " f"{PerturbationMode.ADD_NOISE} of {self.api_name}." ) @@ -62,13 +59,13 @@ class AddNoiseLayer(NpuBaseLayer): 判断是否需要添加扰动 """ if not self.perturbed_value: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.api_name}, " f"dtype unsupported. Cancel perturbation." ) return False if tensor_obj.numel() == 0: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: For {self.api_name}, tensor shape must > 0." f" Cancel adding noise." ) @@ -79,13 +76,13 @@ class AddNoiseLayer(NpuBaseLayer): try: max_val = TorchC.max(TorchC.abs(tensor_obj)).item() except Exception: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.api_name}, " f"when calculate maximun value, tensor is changed to float32." ) max_val = TorchC.max(TorchC.abs(tensor_obj.to(torch.float32))).item() if max_val < abs_tol: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.api_name}, " f"Maximun value is less than the minimun threshold. Cancel add noise." ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py index 2c1ed9a3e1ceaee11021d5aacbacfd103b9dd9d7..40b99acf41105fa61792ef52e27cc7f2e6686ba7 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py @@ -1,8 +1,5 @@ import torch -from atat.pytorch.free_benchmark import ( - print_info_log_rank_0, - print_warn_log_rank_0, -) +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.constant import ThresholdConfig from atat.pytorch.free_benchmark.common.enums import PerturbationMode from atat.pytorch.free_benchmark.common.params import DataParams @@ -55,7 +52,7 @@ class BitNoiseLayer(NpuBaseLayer): """ 对输入添加扰动并返回 """ - print_info_log_rank_0( + logger.info_on_rank_0( f"[atat] Free benchmark: Perturbation is " f"{PerturbationMode.BIT_NOISE} of {self.api_name}." ) @@ -67,13 +64,13 @@ class BitNoiseLayer(NpuBaseLayer): 判断是否需要添加扰动, bit翻转 """ if not self.bit_type: - print_warn_log_rank_0( + logger.info_on_rank_0( f"[atat] Free Benchmark: For {self.api_name}, " f"dtype unsupported. Cancel perturbation." ) return False if tensor_obj.numel() == 0: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free benchmark: For {self.api_name}, tensor shape must > 0" f" Cancel adding noise." ) @@ -84,13 +81,13 @@ class BitNoiseLayer(NpuBaseLayer): try: max_val = TorchC.max(TorchC.abs(tensor_obj)).item() except Exception: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.api_name}, " f"when calculate maximun value, tensor is changed to float32." ) max_val = TorchC.max(TorchC.abs(tensor_obj.to(torch.float32))).item() if max_val < abs_tol: - print_warn_log_rank_0( + logger.info_on_rank_0( f"[atat] Free Benchmark: For {self.api_name}, " f"Maximun value is less than the minimun threshold. Cancel add noise." ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/change_value.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/change_value.py index b4ee67384164cb73abd4e3f3cbaab77b1ffac293..b7a967e18b91ecc2d36c22afce49f72677bef565 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/change_value.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/change_value.py @@ -1,5 +1,5 @@ import torch -from atat.pytorch.free_benchmark import print_warn_log_rank_0, print_info_log_rank_0 +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.enums import PerturbationMode from atat.pytorch.free_benchmark.common.params import DataParams from atat.pytorch.free_benchmark.common.utils import TorchC @@ -43,7 +43,7 @@ class ChangeValueLayer(NpuBaseLayer): """ 对输入添加扰动并返回 """ - print_info_log_rank_0( + logger.info_on_rank_0( f"[atat] Free benchmark: Perturbation is " f"{PerturbationMode.CHANGE_VALUE} of {self.api_name}." ) @@ -55,7 +55,7 @@ class ChangeValueLayer(NpuBaseLayer): 判断是否需要添加扰动, 首尾值交换 """ if tensor_obj.size(0) < 2: - print_warn_log_rank_0( + logger.info_on_rank_0( f"[atat] Free Benchmark: For {self.api_name}, " f"size 0 must greater than 1. Cancel change value." ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py index e18b303a60fd0bdfd91fcb2a0f2abd9402b55141..2df26afc1beecc21a8a77bddbe76de6142d68862 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py @@ -1,5 +1,5 @@ import torch -from atat.pytorch.free_benchmark import Const, print_info_log_rank_0 +from atat.pytorch.free_benchmark import Const, logger from atat.pytorch.free_benchmark.common.constant import CommonField from atat.pytorch.free_benchmark.common.enums import PerturbationMode from atat.pytorch.free_benchmark.common.params import DataParams @@ -18,6 +18,7 @@ class ImprovePrecisionLayer(NpuBaseLayer): ): self._set_improve_valus(tensor_obj) tensor_obj = self._change_dtype(tensor_obj) + self.is_added = True return tensor_obj if isinstance(tensor_obj, dict): return { @@ -31,7 +32,7 @@ class ImprovePrecisionLayer(NpuBaseLayer): return tensor_obj def handle(self, params: DataParams) -> torch.Any: - print_info_log_rank_0( + logger.info_on_rank_0( f"[atat] Free benchmark: Perturbation is " f"{PerturbationMode.IMPROVE_PRECISION} of {self.api_name}." ) @@ -40,6 +41,9 @@ class ImprovePrecisionLayer(NpuBaseLayer): new_kwargs = {} else: new_kwargs = self.improve_tensor_precision(params.kwargs) + # 如果输入中全为高精度、应跳过二次执行、减少多余显存引用 + if not self.is_added: + return params.perturbed_result if "inplace" in new_kwargs: new_kwargs["inplace"] = False params.perturbed_result = params.origin_func(*new_args, **new_kwargs) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/no_change.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/no_change.py index 204e649d805a86a83509475f67c2cf477028f356..bb065385c690f937c702cadac5707b787489aee5 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/no_change.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/no_change.py @@ -1,5 +1,5 @@ import torch -from atat.pytorch.free_benchmark import print_info_log_rank_0 +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.enums import PerturbationMode from atat.pytorch.free_benchmark.common.params import DataParams from atat.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( @@ -20,7 +20,7 @@ class NoChangeLayer(NpuBaseLayer): """ 对输入添加扰动并返回 """ - print_info_log_rank_0( + logger.info_on_rank_0( f"[atat] Free benchmark: Perturbation is " f"{PerturbationMode.NO_CHANGE} of {self.api_name}." ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/run_cpu.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/run_cpu.py index 387f9447fd29276e3c43bcdabf0e8a3a05b8ecec..024958ffbe126b89ec15fa10b277d90af4ed3e45 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/run_cpu.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/run_cpu.py @@ -1,5 +1,5 @@ import torch -from atat.pytorch.free_benchmark import print_info_log_rank_0 +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.params import DataParams from atat.pytorch.free_benchmark.common.utils import Tools from atat.pytorch.free_benchmark.common.enums import DeviceType @@ -10,7 +10,7 @@ class CpuLayer(BaseLayer): def handle(self, params: DataParams) -> torch.Any: - print_info_log_rank_0( + logger.info_on_rank_0( f"[atat] Free benchmark: Perturbation is to_cpu of {self.api_name}." ) new_args = Tools.convert_device_and_dtype(params.args, DeviceType.CPU, change_dtype=True) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/base_handler.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/base_handler.py index 0d1f6ec5d000784b49a12c6d22430d3030fbf389..1f1f8e1cba1b8c1a9646c30b71a4b785fd77d6ed 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/base_handler.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/base_handler.py @@ -5,7 +5,7 @@ from typing import Any, Optional, Tuple import torch from atat.pytorch.free_benchmark import ( Const, - print_warn_log_rank_0, + logger, ) from atat.pytorch.free_benchmark.common.constant import ThresholdConfig from atat.pytorch.free_benchmark.common.enums import ( @@ -101,7 +101,7 @@ class FuzzHandler(ABC): origin_output, perturbed_output ) except Exception as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.params.api_name}, " f"when computing ratio," f" y1 or y2 dtype is not supported {e}" @@ -130,7 +130,7 @@ class FuzzHandler(ABC): origin_output / perturbed_output, ) elif not isinstance(perturbed_output, torch.Tensor): - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.params.api_name} " f"The compare for output type {type(perturbed_output)} is not supported" ) @@ -182,7 +182,7 @@ class FuzzHandler(ABC): ) ) except Exception as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.params.api_name}, " f"when campare the result exception raise {e}" ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/check_handler.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/check_handler.py index 2f590855f1b96e0a6475c87c9b3dfdafd0288332..7444c855eb42bdff0303c30ac83206249d2f02a4 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/check_handler.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/check_handler.py @@ -1,7 +1,6 @@ from typing import Any -import torch -from atat.pytorch.free_benchmark import print_warn_log_rank_0 +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.enums import DeviceType from atat.pytorch.free_benchmark.compare.single_benchmark import SingleCompare from atat.pytorch.free_benchmark.common.params import DataParams, make_unequal_row @@ -34,7 +33,7 @@ class CheckerHandler(FuzzHandler): else: self.other_compare(data_params) except Exception as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.params.api_name}, " f"when campare the result exception raise {e}" ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/fix_handler.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/fix_handler.py index 789e2653aa0eafc3619fbe3bd192b49dee643a1d..fa5c6f37495323693175117040c9c2f7fa3c01c6 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/fix_handler.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/fix_handler.py @@ -3,7 +3,7 @@ from typing import Any from atat.pytorch.free_benchmark.common.params import DataParams from atat.pytorch.free_benchmark.common.utils import Tools from atat.pytorch.free_benchmark.result_handlers.base_handler import FuzzHandler -from atat.pytorch.free_benchmark import print_warn_log_rank_0 +from atat.pytorch.free_benchmark import logger class FixHandler(FuzzHandler): @@ -17,7 +17,7 @@ class FixHandler(FuzzHandler): data_params.original_result, data_params.perturbed_result ) except Exception as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.params.api_name} " f"Fix output failed. " ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/handler_factory.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/handler_factory.py index 50f791d81eeb25f8a50a6b4044dbc8e6e09e6a1e..cff629854d9b2bd1413b273171ad4ce73493bbb0 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/handler_factory.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/handler_factory.py @@ -1,6 +1,5 @@ from atat.pytorch.free_benchmark import FreeBenchmarkException from atat.pytorch.free_benchmark.common.constant import PreheatConfig -from atat.pytorch.free_benchmark.common.utils import Tools from atat.pytorch.free_benchmark.common.enums import HandlerType from atat.pytorch.free_benchmark.common.params import HandlerParams from atat.pytorch.free_benchmark.result_handlers.check_handler import CheckerHandler diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/preheat_handler.py b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/preheat_handler.py index 1e70067b93d3031d5ac640e5e442087bca9a63aa..ee2ee11a79a753a028a6687cebdd66f6ea220a66 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/preheat_handler.py +++ b/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/preheat_handler.py @@ -1,7 +1,7 @@ import math from typing import Any -from atat.pytorch.free_benchmark import print_info_log_rank_0, print_warn_log_rank_0 +from atat.pytorch.free_benchmark import logger from atat.pytorch.free_benchmark.common.constant import ThresholdConfig from atat.pytorch.free_benchmark.common.counter import preheat_counter from atat.pytorch.free_benchmark.common.enums import DeviceType @@ -74,14 +74,14 @@ class PreheatHandler(FuzzHandler): try: cpu_consistent = self.compare_npu_and_cpu(data_params) except Exception as e: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.params.api_name}, " f"when campare to cpu exception raise {e}" ) try: first_dtype = Tools.get_first_tensor_dtype(data_params.perturbed_result) except RuntimeError: - print_warn_log_rank_0( + logger.warning_on_rank_0( f"[atat] Free Benchmark: For {self.params.api_name}, " f"the output sequence does not contain tensors." ) @@ -96,7 +96,7 @@ class PreheatHandler(FuzzHandler): res = curr_called_seq in need_sample_set if res: total_count = preheat_counter.get_one_step_used_api(self.pure_name) - print_info_log_rank_0( + logger.info_on_rank_0( f"[atat] Free benchmark: preheat sample in step{self.params.step}" f"api_name {self.params.api_name}, " f"curr_called_seq: {curr_called_seq}/{total_count}" diff --git a/debug/accuracy_tools/atat/pytorch/functional/__init__.py b/debug/accuracy_tools/atat/pytorch/functional/__init__.py index 12e530d4c950f6bab9d6fe48861954ca0061e33d..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/__init__.py +++ b/debug/accuracy_tools/atat/pytorch/functional/__init__.py @@ -1,4 +0,0 @@ -from .repair import build_repair -from .scope import build_scope -from .step_post_process import build_step_post_process -from .data_collector import build_data_collector diff --git a/debug/accuracy_tools/atat/pytorch/functional/data_processor.py b/debug/accuracy_tools/atat/pytorch/functional/data_processor.py index 116301725b886eef975688fd4a8f63761ad54828..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/data_processor.py +++ b/debug/accuracy_tools/atat/pytorch/functional/data_processor.py @@ -1,551 +0,0 @@ -import inspect -import os -import zlib -from dataclasses import dataclass, asdict -from typing import Tuple, List, Dict, Optional, Union - -import numpy as np -import torch -import torch_npu - -from ..common import recursive_apply_transform -from ..common.exceptions import MsaccException -from ..common.file_check import path_len_exceeds_limit, change_mode, FileCheckConst -from ..common.log import print_warn_log -from ..common.utils import Const -from ..free_benchmark import FreeBenchmarkCheck, UnequalRow - -bits_for_overflow = 8 - - -def build_data_processor(config, data_writer): - if config.task == DataProcessor.full: - return FullTensorDataProcessor(config, data_writer) - elif config.task == DataProcessor.summary: - return DataProcessor(config, data_writer) - elif config.task == DataProcessor.overflow: - return OverflowTensorDataProcessor(config, data_writer) - elif config.task == DataProcessor.free_benchmark: - return FreeBenchmarkDataProcessor(config, data_writer) - else: - raise MsaccException(MsaccException.INVALID_PARAM_ERROR, - "task should be in [{}, {}, {}, {}]".format( - DataProcessor.full, - DataProcessor.summary, - DataProcessor.overflow, - DataProcessor.free_benchmark - )) - - -@dataclass -class ModuleForwardInputsOutputs: - args: Optional[Tuple] - kwargs: Optional[Dict] - output: Union[Tuple, torch.Tensor] - - @property - def args_tuple(self): - if not isinstance(self.args, tuple): - return (self.args,) - else: - return self.args - - @property - def output_tuple(self): - if not isinstance(self.output, tuple): - return (self.output,) - else: - return self.output - - def concat_args_and_kwargs(self): - args = self.args + tuple(self.kwargs.values()) - return args - - -@dataclass -class ModuleBackwardInputsOutputs: - grad_output: Optional[Tuple] - grad_input: Optional[Tuple] - - @property - def grad_input_tuple(self): - if not isinstance(self.grad_input, tuple): - return (self.grad_input,) - else: - return self.grad_input - - @property - def grad_output_tuple(self): - if not isinstance(self.grad_output, tuple): - return (self.grad_output,) - else: - return self.grad_output - - -class TensorStatInfo: - def __init__(self, max_val=None, min_val=None, mean_val=None, norm_val=None): - self.max = max_val - self.min = min_val - self.mean = mean_val - self.norm = norm_val - - -class DataProcessor: - full = "tensor" - summary = "statistics" - overflow = "overflow_check" - free_benchmark = "free_benchmark" - - def __init__(self, config, data_writer): - self.data_writer = data_writer - self.api_info_struct = {} - self.stack_info_struct = {} - self.torch_object_key = { - "device": self.analyze_device_in_kwargs, - "dtype": self.analyze_dtype_in_kwargs - } - self.current_api_or_module_name = None - self.config = config - self.api_data_category = None - self.has_overflow = False - self.current_iter = 0 - - # 需要对forward的output进行更改 - self._return_forward_new_output = False - self._forward_new_output = None - - @staticmethod - def get_md5_for_tensor(x): - if x.dtype == torch.bfloat16: - x = x.float() - tensor_bytes = x.cpu().detach().numpy().tobytes() - crc32_hash = zlib.crc32(tensor_bytes) - return f"{crc32_hash:08x}" - - @staticmethod - def analyze_device_in_kwargs(element): - single_arg = {} - single_arg.update({'type': "torch.device"}) - if not isinstance(element, str): - if hasattr(element, "index"): - device_value = element.type + ":" + str(element.index) - else: - device_value = element.type - single_arg.update({"value": device_value}) - else: - single_arg.update({"value": element}) - return single_arg - - @staticmethod - def analyze_dtype_in_kwargs(element): - single_arg = {} - single_arg.update({"type": "torch.dtype"}) - single_arg.update({"value": str(element)}) - return single_arg - - @staticmethod - def _convert_numpy_to_builtin(arg): - type_mapping = { - np.integer: int, - np.floating: float, - np.bool_: bool, - np.complexfloating: complex, - np.str_: str, - np.byte: bytes, - np.unicode_: str - } - for numpy_type, builtin_type in type_mapping.items(): - if isinstance(arg, numpy_type): - return builtin_type(arg), type(arg).__name__ - return arg, '' - - @staticmethod - def handle_tensor_extremum_nan_inf(data_clone, operator): - data_nan = torch._C._VariableFunctionsClass.isnan(data_clone) - if int(torch._C._VariableFunctionsClass.sum(data_nan)) == data_clone.numel(): - return float('nan') - finite_mask = torch._C._VariableFunctionsClass.isfinite(data_clone) - if int(torch._C._VariableFunctionsClass.sum(finite_mask)) > 0: - finite_values = data_clone[finite_mask] - return torch._C._VariableFunctionsClass.max(finite_values).item() if operator == 'max' else \ - torch._C._VariableFunctionsClass.min(finite_values).item() - else: - data_no_nan = data_clone[~data_nan] - return torch._C._VariableFunctionsClass.max(data_no_nan).item() if operator == 'max' else \ - torch._C._VariableFunctionsClass.min(data_no_nan).item() - - @staticmethod - def analyze_api_call_stack(name): - stack_str = [] - for (_, path, line, func, code, _) in inspect.stack()[5:]: - if not code: - continue - stack_line = " ".join([ - "File", ", ".join([ - path, - " ".join(["line", str(line)]), - " ".join(["in", func]), - " ".join(["\n", code[0].strip()]) - ]) - ]) - stack_str.append(stack_line) - stack_info_struct = {name: stack_str} - return stack_info_struct - - def get_stat_info(self, data): - tensor_stat = TensorStatInfo() - if data.is_meta: - return tensor_stat - data_clone = data.detach() - if data_clone.numel() == 0: - return tensor_stat - elif data_clone.dtype == torch.bool: - tensor_stat.max = True in data_clone - tensor_stat.min = False not in data_clone - tensor_stat.mean = None - tensor_stat.norm = None - elif not data_clone.shape: - tensor_stat.max = data_clone.item() - tensor_stat.min = tensor_stat.max - tensor_stat.mean = tensor_stat.max - tensor_stat.norm = tensor_stat.max - else: - if not data_clone.is_floating_point(): - data_clone = data_clone.float() - tensor_stat.max = torch._C._VariableFunctionsClass.max(data_clone).item() - tensor_stat.min = torch._C._VariableFunctionsClass.min(data_clone).item() - tensor_stat.mean = torch._C._VariableFunctionsClass.mean(data_clone).item() - tensor_stat.norm = torch._C._VariableFunctionsClass.norm(data_clone).item() - - return tensor_stat - - def if_return_forward_new_output(self): - return self._return_forward_new_output - - def get_forward_new_output(self): - self._return_forward_new_output = False - return self._forward_new_output - - def update_iter(self, current_iter): - self.current_iter = current_iter - - def visit_and_clear_overflow_status(self, api_or_module_name): - if self.current_api_or_module_name != api_or_module_name: - self.current_api_or_module_name = api_or_module_name - self.has_overflow = False - - def is_dump_for_data_mode(self, forward_backward, input_output): - """ - Compare the parameters with data_mode to determine whether to dump. - - Args: - forward_backward(str): The forward or backward mode to check. - input_output(str): The input or output mode to check. - - Return: - bool: True if the parameters are in data_mode or data_mode is all, False otherwise. - """ - return (Const.ALL in self.config.data_mode or - forward_backward in self.config.data_mode or - input_output in self.config.data_mode) - - def analyze_single_element(self, element, suffix_stack): - if suffix_stack and suffix_stack[-1] in self.torch_object_key: - return self.torch_object_key[suffix_stack[-1]](element) - - if isinstance(element, torch.Size): - return self._analyze_torch_size(element) - - converted_numpy, numpy_type = self._convert_numpy_to_builtin(element) - if converted_numpy is not element: - return self._analyze_numpy(converted_numpy, numpy_type) - - if isinstance(element, torch.Tensor): - return self._analyze_tensor(element, Const.SEP.join(suffix_stack)) - - if isinstance(element, (bool, int, float, str, slice)): - return self._analyze_builtin(element) - return {} - - def analyze_element(self, element): - return recursive_apply_transform(element, self.analyze_single_element) - - def analyze_pre_forward(self, name, module, - module_input_output: ModuleForwardInputsOutputs): - pass - - def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): - api_info_struct = {} - if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): - api_info_struct[name] = {} - self.api_data_category = Const.INPUT - args_info_list = self.analyze_element(module_input_output.args_tuple) - api_info_struct[name][Const.INPUT_ARGS] = args_info_list - - self.api_data_category = Const.KWARGS - kwargs_info_list = self.analyze_element(module_input_output.kwargs) - api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list - - if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): - api_info_struct[name] = api_info_struct.get(name, {}) - self.api_data_category = Const.OUTPUT - output_info_list = self.analyze_element(module_input_output.output_tuple) - api_info_struct[name][Const.OUTPUT] = output_info_list - - return api_info_struct - - def analyze_pre_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): - api_info_struct = {} - if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): - api_info_struct[name] = {} - self.api_data_category = Const.INPUT - args_info_list = self.analyze_element(module_input_output.args_tuple) - api_info_struct[name][Const.INPUT_ARGS] = args_info_list - - self.api_data_category = Const.KWARGS - kwargs_info_list = self.analyze_element(module_input_output.kwargs) - api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list - - return api_info_struct - - def analyze_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): - concat_args = module_input_output.concat_args_and_kwargs() - api_info_struct = {} - if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): - api_info_struct[name] = {} - self.api_data_category = Const.OUTPUT - output_info_list = self.analyze_element(concat_args) - api_info_struct[name][Const.OUTPUT] = output_info_list - - return api_info_struct - - def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): - api_info_struct = {} - if self.is_dump_for_data_mode(Const.BACKWARD, Const.OUTPUT): - api_info_struct[name] = {} - self.api_data_category = Const.OUTPUT - input_info_list = self.analyze_element(module_input_output.grad_input_tuple) - api_info_struct[name][Const.GRAD_INPUT] = input_info_list - - if self.is_dump_for_data_mode(Const.BACKWARD, Const.INPUT): - api_info_struct[name] = api_info_struct.get(name, {}) - self.api_data_category = Const.INPUT - output_info_list = self.analyze_element(module_input_output.grad_output_tuple) - api_info_struct[name][Const.GRAD_OUTPUT] = output_info_list - - return api_info_struct - - def _analyze_numpy(self, value, numpy_type): - single_arg = {} - single_arg.update({"type": numpy_type}) - single_arg.update({"value": value}) - return single_arg - - def _analyze_builtin(self, arg): - single_arg = {} - if isinstance(arg, slice): - single_arg.update({"type": "slice"}) - # slice参数中可能存在tensor类型,json序列化,需要转换为python数值类型 - values = [ - value if not isinstance(value, torch.Tensor) else value.item() - for value in [arg.start, arg.stop, arg.step] - ] - single_arg.update({"value": values}) - else: - single_arg.update({"type": type(arg).__name__}) - single_arg.update({"value": arg}) - return single_arg - - def _analyze_torch_size(self, arg): - single_arg = {} - single_arg.update({"type": "torch.Size"}) - single_arg.update({"value": list(arg)}) - return single_arg - - def _analyze_maybe_overflow_tensor(self, tensor_json, tensor): - data_clone = tensor.detach() - if hasattr(torch_npu._C, '_npu_is_support_inf_nan') and torch_npu._C._npu_is_support_inf_nan(): - if tensor_json[Const.MAX] is None: - return - if np.isinf(tensor_json[Const.MAX]) or np.isnan(tensor_json[Const.MAX]): - tensor_json['Max_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(data_clone, "max") - self.has_overflow = True - if np.isinf(tensor_json[Const.MIN]) or np.isnan(tensor_json[Const.MIN]): - tensor_json['Min_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(data_clone, "min") - self.has_overflow = True - else: - self.has_overflow = check_overflow_npu() - if self.has_overflow: - clear_overflow_npu() - - def _analyze_tensor(self, tensor, suffix): - tensor_stat = self.get_stat_info(tensor) - - tensor_json = {} - tensor_json.update({'type': 'torch.Tensor'}) - tensor_json.update({'dtype': str(tensor.dtype)}) - tensor_json.update({"shape": tensor.shape}) - tensor_json.update({"Max": tensor_stat.max}) - tensor_json.update({"Min": tensor_stat.min}) - self._analyze_maybe_overflow_tensor(tensor_json, tensor) - tensor_json.update({"Mean": tensor_stat.mean}) - tensor_json.update({"Norm": tensor_stat.norm}) - tensor_json.update({"requires_grad": tensor.requires_grad}) - if self.config.summary_mode == "md5": - tensor_md5 = self.get_md5_for_tensor(tensor) - tensor_json.update({"md5": tensor_md5}) - - return tensor_json - - -class FullTensorDataProcessor(DataProcessor): - - def __init__(self, config, data_writer): - super().__init__(config, data_writer) - self.data_path = self.data_writer.dump_tensor_data_dir - - def _analyze_tensor(self, tensor, suffix): - dump_data_name = (self.current_api_or_module_name + Const.SEP + self.api_data_category + Const.SEP + - suffix + ".pt") - file_path = os.path.join(self.data_writer.dump_tensor_data_dir, dump_data_name) - if not path_len_exceeds_limit(file_path): - torch.save(tensor, file_path) - change_mode(file_path, FileCheckConst.DATA_FILE_AUTHORITY) - else: - print_warn_log(f'The file path {file_path} length exceeds limit.') - single_arg = super()._analyze_tensor(tensor, suffix) - single_arg.update({"data_name": dump_data_name}) - return single_arg - - -class OverflowTensorDataProcessor(DataProcessor): - __slots__ = ["cached_tensors_and_file_paths"] - - def __init__(self, config, data_writer): - super().__init__(config, data_writer) - self.cached_tensors_and_file_paths = {} - self.real_overflow_dump_times = 0 - self.overflow_nums = config.overflow_num - - def _analyze_tensor(self, tensor, suffix): - dump_data_name = (self.current_api_or_module_name + Const.SEP + self.api_data_category + Const.SEP + - suffix + ".pt") - file_path = os.path.join(self.data_writer.dump_tensor_data_dir, dump_data_name) - if not path_len_exceeds_limit(file_path): - self.cached_tensors_and_file_paths.update({file_path: tensor}) - else: - print_warn_log(f'The file path {file_path} length exceeds limit.') - single_arg = super()._analyze_tensor(tensor, suffix) - single_arg.update({"data_name": dump_data_name}) - return single_arg - - def analyze_forward(self, name, module, - module_input_output: ModuleForwardInputsOutputs): - self.has_overflow = False - api_info_struct = super().analyze_forward(name, module, module_input_output) - self.maybe_save_overflow_data_and_check_overflow_times() - return api_info_struct if self.has_overflow else None - - def analyze_backward(self, name, module, - module_input_output: ModuleBackwardInputsOutputs): - self.has_overflow = False - api_info_struct = super().analyze_backward(name, module, module_input_output) - self.maybe_save_overflow_data_and_check_overflow_times() - return api_info_struct if self.has_overflow else None - - def maybe_save_overflow_data_and_check_overflow_times(self): - if self.has_overflow: - for file_path, tensor in self.cached_tensors_and_file_paths.items(): - torch.save(tensor, file_path) - change_mode(file_path, FileCheckConst.DATA_FILE_AUTHORITY) - self.inc_and_check_overflow_times() - self.cached_tensors_and_file_paths = {} - - def inc_and_check_overflow_times(self): - self.real_overflow_dump_times += 1 - if self.overflow_nums == -1: - return - if self.real_overflow_dump_times >= self.overflow_nums: - raise MsaccException(MsaccException.OVERFLOW_NUMS_ERROR, - str(self.real_overflow_dump_times)) - - -class FreeBenchmarkDataProcessor(DataProcessor): - - def __init__(self, config, data_writer): - super().__init__(config, data_writer) - self.checker = FreeBenchmarkCheck(config=config) - - def update_iter(self, current_iter): - self.current_iter = current_iter - self.checker.update_iter(current_iter) - - def update_unequal_rows(self, unequal_rows: List[UnequalRow]): - if len(unequal_rows) == 0: - return - for row in unequal_rows: - data_dict = asdict(row) - self.data_writer.write_data_to_csv( - data_dict.values(), - data_dict.keys(), - self.data_writer.free_benchmark_file_path - ) - return - - def analyze_pre_forward(self, name, module, - module_input_output: ModuleForwardInputsOutputs): - args = module_input_output.args - kwargs = module_input_output.kwargs - self.checker.pre_forward(name, module, self, args, kwargs) - - def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): - new_output, unequal_rows = self.checker.forward( - name, - module, - module_input_output.args, - module_input_output.kwargs, - module_input_output.output, - ) - self.update_unequal_rows(unequal_rows) - if self.checker.if_fix(): - self._return_forward_new_output = True - self._forward_new_output = new_output - return None - - def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): - self.checker.backward(name, module, module_input_output.grad_output) - return None - - -def overflow_debug_mode_enable(): - overflow_mode = os.getenv(OverflowConst.OVERFLOW_DEBUG_MODE_ENABLE, Const.ENV_DISABLE) - return overflow_mode == Const.ENV_ENABLE - - -def check_overflow_npu(): - if overflow_debug_mode_enable(): - float_status = torch.zeros(bits_for_overflow).npu() - result = torch_npu.npu_get_float_status(float_status, OverflowConst.OVERFLOW_DEBUG_MODE) - if (result.cpu()[0] != 0): - return True - else: - return False - else: - return torch_npu._C._check_overflow_npu() - - -def clear_overflow_npu(): - if overflow_debug_mode_enable(): - float_status = torch.zeros(bits_for_overflow).npu() - torch_npu.npu_clear_float_status(float_status, OverflowConst.OVERFLOW_DEBUG_MODE) - else: - torch_npu._C._clear_overflow_npu() - - -class OverflowConst: - """ - Class for Overflow - """ - OVERFLOW_DEBUG_MODE_ENABLE = "OVERFLOW_DEBUG_MODE_ENABLE" - OVERFLOW_ORIGINAL_MODE = 0 - OVERFLOW_DEBUG_MODE = 1 diff --git a/debug/accuracy_tools/atat/pytorch/functional/dump_module.py b/debug/accuracy_tools/atat/pytorch/functional/dump_module.py index fed73ad5374178fac01180bb905468b9e7c747fa..8652f13f9bcaa46d13fc19bfe74c796ac45cdadf 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/dump_module.py +++ b/debug/accuracy_tools/atat/pytorch/functional/dump_module.py @@ -1,20 +1,21 @@ import torch.nn as nn -from atat.core.utils import print_error_log, DumpException -from .scope import BaseScope -from ..common.utils import Const -from ..hook_module.api_registry import api_register -from ..debugger.precision_debugger import PrecisionDebugger +from atat.pytorch.common.log import logger +from atat.core.common.utils import Const +from atat.pytorch.hook_module.api_registry import api_register +from atat.pytorch.debugger.precision_debugger import PrecisionDebugger +from atat.core.common.exceptions import MsaccException +from atat.core.data_dump.scope import BaseScope module_count = {} def module_dump(module, dump_name): if not isinstance(module, nn.Module): - print_error_log("The parameter:module in module_dump is not a Module subclass.") - raise DumpException(DumpException.INVALID_PARAM_ERROR) + logger.error("The parameter:module in module_dump is not a Module subclass.") + raise MsaccException(MsaccException.INVALID_PARAM_ERROR) if not isinstance(dump_name, str): - print_error_log("The parameter:dump_name in module_dump is not a str type.") - raise DumpException(DumpException.INVALID_PARAM_ERROR) + logger.error("The parameter:dump_name in module_dump is not a str type.") + raise MsaccException(MsaccException.INVALID_PARAM_ERROR) api_register.api_originality() if dump_name not in module_count: module_count[dump_name] = 0 diff --git a/debug/accuracy_tools/atat/pytorch/functional/repair.py b/debug/accuracy_tools/atat/pytorch/functional/repair.py deleted file mode 100644 index aed8326424f5e171a9a71d21cfeb48db6fb26fb3..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/functional/repair.py +++ /dev/null @@ -1,90 +0,0 @@ -from abc import ABC, abstractmethod - -import torch - -from .scope import build_scope, ListScope, BaseScope -from ..common.exceptions import RepairException -from ..common import recursive_apply_transform, print_info_log_rank_0 - - -def build_repair(config): - if config.repair_type is None: - return None - elif config.repair_type == RepairAPI.ToCPU: - return RepairAPI_toCPU(config) - elif config.repair_type == RepairAPI.RaisePrecision: - return RepairAPI_raise(config) - else: - raise RepairException(RepairException.InvalidRepairType, f"精度修复类型" - f"须配置为'{RepairAPI.ToCPU}'或'{RepairAPI.RaisePrecision}," - f"实际配置为{config.repair_type}") - - -class RepairAPI(ABC): - ToCPU = "cpu" - RaisePrecision = "raise" - - def __init__(self, config): - self.config = config - self.scope = build_scope(ListScope, config.repair_scope, config.repair_api_str) - self.saved, self.towards = "None", "None" - - def check_name_and_module_type(self, name, module_type): - if module_type == BaseScope.Module_Type_Module: - return False - if not self.scope.check(name): - return False - return True - - def convert(self, name, module_type, args, kwargs): - is_target = self.check_name_and_module_type(name, module_type) - if is_target: - args = recursive_apply_transform(args, self.fx) - kwargs = recursive_apply_transform(kwargs, self.fx) - print_info_log_rank_0(f"[msProbe] convert inputs of {name} to " - f"{self.towards}.") - return args, kwargs - - def invert(self, name, module_type, out_feat): - is_target = self.check_name_and_module_type(name, module_type) - if is_target: - out_feat = recursive_apply_transform(out_feat, self.inv_fx) - print_info_log_rank_0(f"[msProbe] convert outputs of {name} back to "\ - f"{self.saved}.") - return out_feat - - -class RepairAPI_toCPU(RepairAPI): - def fx(self, arg, _): - if isinstance(arg, torch.Tensor): - self.saved = arg.device - self.towards = torch.device("cpu") - return arg.cpu() - return arg - - def inv_fx(self, arg, _): - if isinstance(arg, torch.Tensor): - return arg.to(self.saved) - return arg - - -class RepairAPI_raise(RepairAPI): - raise_dtype_map = { - torch.bfloat16: torch.float32, - torch.float16: torch.float32 - } - - def fx(self, arg, _): - if isinstance(arg, torch.Tensor): - self.saved = arg.dtype - self.towards = RepairAPI_raise.raise_dtype_map.get(self.saved) - # bug: nested input may be of various dtypes. which to save and invert? - return arg.to(self.towards) - return arg - - def inv_fx(self, arg, _): - if isinstance(arg, torch.Tensor): - return arg.to(self.saved) - return arg - - diff --git a/debug/accuracy_tools/atat/pytorch/functional/step_post_process.py b/debug/accuracy_tools/atat/pytorch/functional/step_post_process.py deleted file mode 100644 index 7f0d3459326f04691a0041c120bf4efc676f8bc1..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/functional/step_post_process.py +++ /dev/null @@ -1,43 +0,0 @@ -from abc import ABC, abstractmethod -from ..common.exceptions import StepException - - -def run_parallel_ut(config): - pass - - -def compare_distrbuted(config): - pass - - -def build_step_post_process(config): - if not config.on_step_end: - return None - if config.on_step_end == StepPostProcess.SingleAPICheck: - return SingleAPICheck(config) - elif config.on_step_end == StepPostProcess.Compare: - return AutoCompare(config) - else: - raise StepException(StepException.InvalidPostProcess, f"step后处理须配置为" - f"'{StepPostProcess.SingleAPICheck}'或'{StepPostProcess.Compare}'," - f"实际配置为{config.on_step_end}") - - -class StepPostProcess(ABC): - SingleAPICheck = 'single_api_check' - Compare = 'compare' - - -class SingleAPICheck: - def __init__(self, config): - self.config = config - - def run(self): - run_parallel_ut(self.config) - -class AutoCompare: - def __init__(self, config): - self.config = config - - def run(self): - compare_distrbuted(self.config.bench_dump_path, self.config.dump_path) diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/api_registry.py b/debug/accuracy_tools/atat/pytorch/hook_module/api_registry.py index 2b4b6a8579a958862f279e015005029ca0c51d2b..6910276f94462018e09cbd3ae865cecff0d0cc1f 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/api_registry.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/api_registry.py @@ -18,14 +18,14 @@ import torch import torch.distributed as dist -from . import wrap_torch, wrap_functional, wrap_tensor, wrap_vf, wrap_distributed, wrap_aten -from .wrap_aten import get_aten_ops -from .wrap_distributed import get_distributed_ops -from .wrap_functional import get_functional_ops -from .wrap_tensor import get_tensor_ops -from .wrap_torch import get_torch_ops -from .wrap_vf import get_vf_ops -from ..common.utils import torch_without_guard_version, npu_distributed_api, is_gpu, Const +from atat.pytorch.hook_module import wrap_torch, wrap_functional, wrap_tensor, wrap_vf, wrap_distributed, wrap_aten +from atat.pytorch.hook_module.wrap_aten import get_aten_ops +from atat.pytorch.hook_module.wrap_distributed import get_distributed_ops +from atat.pytorch.hook_module.wrap_functional import get_functional_ops +from atat.pytorch.hook_module.wrap_tensor import get_tensor_ops +from atat.pytorch.hook_module.wrap_torch import get_torch_ops +from atat.pytorch.hook_module.wrap_vf import get_vf_ops +from atat.pytorch.common.utils import torch_without_guard_version, npu_distributed_api, is_gpu, Const torch_version_above_2 = torch.__version__.split('+')[0] > '2.0' diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/hook_module.py b/debug/accuracy_tools/atat/pytorch/hook_module/hook_module.py index ae4a7abdab12e46fcf25e7594b9016ca347599bc..d45a951d479cd235aa6c29e45443d6b22e56dbc9 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/hook_module.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/hook_module.py @@ -20,7 +20,8 @@ import threading import torch import torch.nn as nn import torch.utils.hooks as full_hooks -from ..common.utils import Const +from atat.core.common.utils import Const + class HOOKModule(nn.Module): module_count = {} diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/utils.py b/debug/accuracy_tools/atat/pytorch/hook_module/utils.py index 96883072ebffe815def98237379a14ca60f7f8c5..e4ed157af6dcafc826eb74fcc40898bfdc835eac 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/utils.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/utils.py @@ -18,7 +18,7 @@ import os import yaml -from ..common.file_check import FileOpen +from atat.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_aten.py b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_aten.py index 8666287095bbe12f7e9d5f314cff1db75d74a108..c247a27082edf9af738e48706c6308d4916fc586 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_aten.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_aten.py @@ -20,9 +20,9 @@ import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.file_check import FileOpen +from atat.pytorch.hook_module.hook_module import HOOKModule +from atat.pytorch.common.utils import torch_device_guard, Const +from atat.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_distributed.py b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_distributed.py index 68ce83c16b8414f43e61b1a667f8cb7c27899a10..1059bf748843ae381e48a1c7103811ad71af83c2 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_distributed.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_distributed.py @@ -20,9 +20,9 @@ from functools import wraps import torch.distributed as dist import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.file_check import FileOpen +from atat.pytorch.hook_module.hook_module import HOOKModule +from atat.pytorch.common.utils import torch_device_guard, Const +from atat.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_functional.py b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_functional.py index f5dde41b12e6939fdd6bb274df2123cb52ffd429..8c829904cbe848b7e8d57abb1f5f3a2c0bc6d494 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_functional.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_functional.py @@ -20,15 +20,15 @@ import os import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.log import print_info_log_rank_0 -from ..common.file_check import FileOpen +from atat.pytorch.hook_module.hook_module import HOOKModule +from atat.pytorch.common.utils import torch_device_guard, Const +from atat.pytorch.common.log import logger +from atat.core.common.file_check import FileOpen def remove_dropout(): if torch.__version__ > "1.8": - print_info_log_rank_0("For precision comparison, the probability p in the dropout method is set to 0.") + logger.info_on_rank_0("For precision comparison, the probability p in the dropout method is set to 0.") import torch.nn.functional as F from torch import _VF from torch.overrides import has_torch_function_unary, handle_torch_function diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_npu_custom.py b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_npu_custom.py index e910e609c8379e0c66239755c3ec2a44953ef1ec..90ad9cb9c4f3865a7dd110dcd5701acdcaf5ce64 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_npu_custom.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_npu_custom.py @@ -20,9 +20,9 @@ import torch import torch_npu import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, torch_without_guard_version, Const -from ..common.file_check import FileOpen +from atat.pytorch.hook_module.hook_module import HOOKModule +from atat.pytorch.common.utils import torch_device_guard, torch_without_guard_version, Const +from atat.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_tensor.py b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_tensor.py index bf4a889524de922857b72d5dd387f4e33cb1b8e0..d53291b78faa594b69ac686100a3f48eccce4dc0 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_tensor.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_tensor.py @@ -20,9 +20,9 @@ import os import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, parameter_adapter, Const -from ..common.file_check import FileOpen +from atat.pytorch.hook_module.hook_module import HOOKModule +from atat.pytorch.common.utils import torch_device_guard, parameter_adapter, Const +from atat.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_torch.py b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_torch.py index 09fa67166c3e2a6eb06be9aaae03d030e925daf3..3cdece23065a7ab830c6125929c7cd4e86bab711 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_torch.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_torch.py @@ -20,9 +20,9 @@ import os import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.file_check import FileOpen +from atat.pytorch.hook_module.hook_module import HOOKModule +from atat.pytorch.common.utils import torch_device_guard, Const +from atat.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_vf.py b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_vf.py index 9303aec6e04c01e0aa969141e33f54c75ebb8ca4..c5f3cb7ee0624757b617b049935b6aabc593ec8c 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_vf.py +++ b/debug/accuracy_tools/atat/pytorch/hook_module/wrap_vf.py @@ -20,9 +20,9 @@ import os import torch import yaml -from .hook_module import HOOKModule -from ..common.file_check import FileOpen -from ..common.utils import torch_device_guard, Const +from atat.pytorch.hook_module.hook_module import HOOKModule +from atat.core.common.file_check import FileOpen +from atat.pytorch.common.utils import torch_device_guard, Const cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") diff --git a/debug/accuracy_tools/atat/pytorch/module_processer.py b/debug/accuracy_tools/atat/pytorch/module_processer.py index fda3d37bc92360fc104d761e78b13cfc793995bc..f56513907c262fbc43d1ce76aace271e04caf944 100644 --- a/debug/accuracy_tools/atat/pytorch/module_processer.py +++ b/debug/accuracy_tools/atat/pytorch/module_processer.py @@ -1,8 +1,8 @@ from functools import wraps import torch from torch.utils.hooks import BackwardHook -from .functional.scope import ModuleRangeScope -from .common.utils import Const +from atat.core.common.utils import Const +from atat.core.data_dump.scope import ModuleRangeScope class ModuleProcesser: diff --git a/debug/accuracy_tools/atat/pytorch/online_dispatch/__init__.py b/debug/accuracy_tools/atat/pytorch/online_dispatch/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..54c15d2dfc6d2ab80ef082d2d0c653c5e2625f59 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/online_dispatch/__init__.py @@ -0,0 +1,20 @@ +# Copyright (c) 2024-2024 Huawei Technologies Co., Ltd. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from signal import signal, SIGPIPE, SIG_DFL +from .dispatch import PtdbgDispatch +signal(SIGPIPE, SIG_DFL) + + +__all__ = ["PtdbgDispatch"] diff --git a/debug/accuracy_tools/atat/pytorch/online_dispatch/compare.py b/debug/accuracy_tools/atat/pytorch/online_dispatch/compare.py new file mode 100644 index 0000000000000000000000000000000000000000..512849a2a28ae15c6e49a0d9e919052a9a4425d7 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/online_dispatch/compare.py @@ -0,0 +1,231 @@ +# 进行比对及结果展示 +import os +import sys +import csv +import json +from rich.table import Table +from rich.console import Console +from .single_compare import single_benchmark_compare_wrap +from .utils import DispatchException, CompareConst +from atat.core.common.file_check import FileOpen +from atat.pytorch.common.log import logger + +ELEMENT_NUM_THRESHOLD = 100 +ZERO_NUM_THRESHOLD = 0.1 +FLOAT_PRECISION = 14 + + +def get_file_content_bytes(file): + with FileOpen(file, 'rb') as file_handle: + return file_handle.read() + + +def get_json_contents(file_path): + ops = get_file_content_bytes(file_path) + try: + json_obj = json.loads(ops) + except ValueError as error: + logger.error('Failed to load "%s". %s' % (file_path, str(error))) + raise DispatchException(1, "failed to load json.") from error + if not isinstance(json_obj, dict): + logger.error('Json file %s, content is not a dictionary!' % file_path) + raise DispatchException(2, "json is empty.") + return json_obj + + +def write_csv(data, filepath): + with FileOpen(filepath, 'a', encoding='utf-8-sig') as f: + writer = csv.writer(f) + writer.writerows(data) + + +class Saver: + # consts for result csv + COLUMN_API_NAME = "API name" + COLUMN_FORWARD_SUCCESS = "Forward Test Success" + COLUMN_BACKWARD_SUCCESS = "Backward Test Success" + COLUMN_STACK_INFO = "Traceback callstack info" + + def __init__(self, save_path, detail_save_path, stack_info): + self.save_path = save_path + self.detail_save_path = detail_save_path + self.stack_info = stack_info + + self.test_result_cnt = { + "forward_fail_num": 0, "backward_fail_num": 0, "forward_and_backward_fail_num": 0, "success_num": 0, + "total_num": 0, "forward_or_backward_fail_num": 0 + } + + def write_csv_title(self): + summary_test_rows = [[self.COLUMN_API_NAME, self.COLUMN_FORWARD_SUCCESS, self.COLUMN_BACKWARD_SUCCESS, "Message"]] + write_csv(summary_test_rows, self.save_path) + + detail_test_rows = [[ + "Npu Name", "Bench Dtype", "NPU Dtype", "Shape", + "error_balance", "max_abs_diff", "max_abs_idx", + "max_rel_diff", "max_rel_idx", "eb_thd", + "error_thd", "Status","Message" + ]] + write_csv(detail_test_rows, self.detail_save_path) + + def print_pretest_result(self): + self.get_statistics_from_result_csv() + if self.test_result_cnt.get("total_num") != 0: + passing_rate = str(self.test_result_cnt.get("success_num") / + (self.test_result_cnt.get("total_num") + sys.float_info.epsilon)) + else: + passing_rate = "0" + + console = Console() + table_total = Table( + show_header=True, title="Overall Statistics", show_lines=True, width=75 + ) + table_total.add_column("Result") + table_total.add_column("Statistics") + table_total.add_row("[green]Pass[/green]", str(self.test_result_cnt.get("success_num"))) + table_total.add_row("[red]Fail[/red]", str(self.test_result_cnt.get("forward_and_backward_fail_num") + + self.test_result_cnt.get("forward_or_backward_fail_num"))) + table_total.add_row("Passing Rate", passing_rate) + + table_detail = Table( + show_header=True, title="Detail Statistics", show_lines=True, width=75 + ) + table_detail.add_column("Result") + table_detail.add_column("Statistics") + table_detail.add_row("Only Forward Fail", str(self.test_result_cnt.get("forward_fail_num"))) + table_detail.add_row("Only Backward Fail", str(self.test_result_cnt.get("backward_fail_num"))) + table_detail.add_row( + "Both Forward & Backward Fail", str(self.test_result_cnt.get("forward_and_backward_fail_num"))) + + console.print(table_total) + console.print(table_detail) + + def get_statistics_from_result_csv(self): + checklist = [CompareConst.TRUE, CompareConst.FALSE, CompareConst.NA, CompareConst.SKIP] + with FileOpen(self.save_path, 'r') as file: + reader = csv.reader(file) + result_csv_rows = [row for row in reader] + result_csv_name = os.path.basename(self.save_path) + for item in result_csv_rows[1:]: + if not isinstance(item, list) or len(item) < 3: + raise ValueError("The number of columns in %s is incorrect" % result_csv_name) + if not all(item[i] and item[i].upper() in checklist for i in (1, 2)): + raise ValueError( + "The value in the 2nd or 3rd column of %s is wrong, it must be TRUE, FALSE, SKIP or N/A" + % result_csv_name) + column1 = item[1].upper() + column2 = item[2].upper() + if column1 == CompareConst.SKIP: + continue + self.test_result_cnt["total_num"] += 1 + if column1 == CompareConst.TRUE and column2 in [CompareConst.TRUE, 'N/A']: + self.test_result_cnt['success_num'] += 1 + elif column1 == CompareConst.FALSE and column2 == CompareConst.FALSE: + self.test_result_cnt['forward_and_backward_fail_num'] += 1 + elif column1 == CompareConst.FALSE: + self.test_result_cnt['forward_fail_num'] += 1 + self.test_result_cnt['forward_or_backward_fail_num'] += 1 + else: + self.test_result_cnt['backward_fail_num'] += 1 + self.test_result_cnt['forward_or_backward_fail_num'] += 1 + + def write_summary_csv(self, test_result): + test_rows = [] + if self.stack_info: + test_rows[0].append(self.COLUMN_STACK_INFO) + + name = test_result[0] + df_row = list(test_result[:3]) + if test_result[1] == "SKIP" or test_result[2] == "SKIP": + df_row.append(test_result[3]) + if self.stack_info: + stack_info = "\n".join(self.stack_info[name]) + df_row.append(stack_info) + test_rows.append(df_row) + write_csv(test_rows, self.save_path) + + def write_detail_csv(self, test_result): + test_rows = [] + + subject_prefix = test_result[0] + fwd_result = test_result[3] + bwd_result = test_result[4] + if isinstance(fwd_result, list): + for i, test_subject in enumerate(fwd_result): + subject = subject_prefix + ".forward.output." + str(i) + test_subject = ["{:.{}f}".format(item, FLOAT_PRECISION) if isinstance(item, float) else item for item in test_subject] + test_rows.append([subject] + list(test_subject)) + if isinstance(bwd_result, list): + for i, test_subject in enumerate(bwd_result): + subject = subject_prefix + ".backward.output." + str(i) + test_subject = ["{:.{}f}".format(item, FLOAT_PRECISION) if isinstance(item, float) else item for item in test_subject] + test_rows.append([subject] + list(test_subject)) + + write_csv(test_rows, self.detail_save_path) + + def record_results(self, *args): + self.write_summary_csv(args) + self.write_detail_csv(args) + + +class Comparator: + + def __init__(self, result_csv_path, details_csv_path, is_continue_run_ut, stack_info_json_path=None): + self.save_path = result_csv_path + self.detail_save_path = details_csv_path + if stack_info_json_path: + self.stack_info = get_json_contents(stack_info_json_path) + else: + self.stack_info = None + self.saver = Saver(result_csv_path, details_csv_path, self.stack_info) + + is_meet_some_condition = (is_continue_run_ut and not os.path.exists(self.save_path) + and not os.path.exists(self.detail_save_path)) + if is_meet_some_condition: + self.saver.write_csv_title() + + @staticmethod + def _compare_core_wrapper(bench_out, npu_out): + detailed_result_total = [] + test_final_success = True + status, details = single_benchmark_compare_wrap(npu_out, bench_out) + if not isinstance(status, list): + detailed_result_total.append(details) + test_final_success = status + else: + for item, item_status in enumerate(status): + detailed_result_total.append(details.get(item, 'key does not exist')) + if not item_status: + test_final_success = False + return test_final_success, detailed_result_total + + @staticmethod + def _compare_dropout(bench_out, npu_out): + tensor_num = bench_out.numel() + if tensor_num >= ELEMENT_NUM_THRESHOLD: + if abs((bench_out == 0).sum() - (npu_out == 0).cpu().sum()) / tensor_num < ZERO_NUM_THRESHOLD: + return True, 1 + else: + return False, 0 + else: + return True, 1 + + def compare_output(self, api_name, bench_out, npu_out, bench_grad=None, npu_grad=None): + if "dropout" in api_name: + is_fwd_success, fwd_compare_alg_results = self._compare_dropout(bench_out, npu_out) + else: + is_fwd_success, fwd_compare_alg_results = self._compare_core_wrapper(bench_out, npu_out) + if bench_grad and npu_grad: + if "dropout" in api_name: + is_bwd_success, bwd_compare_alg_results = self._compare_dropout(bench_grad[0], npu_grad[0]) + else: + is_bwd_success, bwd_compare_alg_results = self._compare_core_wrapper(bench_grad, npu_grad) + else: + is_bwd_success, bwd_compare_alg_results = True, None + if is_bwd_success and bwd_compare_alg_results is None: + self.saver.record_results(api_name, is_fwd_success, CompareConst.NA, fwd_compare_alg_results, + bwd_compare_alg_results) + else: + self.saver.record_results(api_name, is_fwd_success, is_bwd_success, fwd_compare_alg_results, + bwd_compare_alg_results) + return is_fwd_success, is_bwd_success diff --git a/debug/accuracy_tools/atat/pytorch/online_dispatch/dispatch.py b/debug/accuracy_tools/atat/pytorch/online_dispatch/dispatch.py new file mode 100644 index 0000000000000000000000000000000000000000..71566f5adce54b0a2e95a1cf881156d8bd8cf3b0 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/online_dispatch/dispatch.py @@ -0,0 +1,277 @@ +import os +import time +import json +from pathlib import Path +from multiprocessing import Manager, Pool + +import yaml +import torch + +from torch.utils._python_dispatch import TorchDispatchMode + +try: + import torch_npu +except ImportError: + is_npu = False +else: + is_npu = True + +from .dump_compare import dispatch_workflow, dispatch_multiprocess, error_call, TimeStatistics, \ + DispatchRunParam, DisPatchDataInfo +from .utils import get_callstack, data_to_cpu, logger_debug, logger_error, logger_warn, logger_logo, get_sys_info, \ + DispatchException +from .compare import Comparator +from atat.core.common.file_check import FileOpen +from atat.pytorch.common.utils import Const +from atat.core.common.utils import CompareConst, check_file_or_directory_path, check_path_before_create + +current_time = time.strftime("%Y%m%d%H%M%S") +RESULT_FILE_NAME = "accuracy_checking_result_" + current_time + ".csv" +DETAILS_FILE_NAME = "accuracy_checking_details_" + current_time + ".csv" + + +class PtdbgDispatch(TorchDispatchMode): + def __init__(self, dump_mode=Const.OFF, api_list=None, debug=False, dump_path=None, tag=None, process_num=0): + super(PtdbgDispatch, self).__init__() + logger_logo() + if not is_npu: + logger_error("Please confirm you run environment installed torch_npu!") + return + + if dump_path is None: + logger_error("Please set dump_path when dump_mode is config!") + check_file_or_directory_path(dump_path, True) + + self.device_id = torch_npu._C._npu_getDevice() + self.dump_mode = dump_mode + self.dump_api_list = api_list + self.debug_flag = debug + self.api_index = 0 + self.single_api_index_dict = {} + self.device_dump_path_cpu = None + self.device_dump_path_npu = None + self.all_summery = [] + self.call_stack_list = [] + self.process_num = process_num + self.filter_dump_api() + self.check_param() + # guarantee file uniqueness + time.sleep(1) + time_now = time.strftime("%Y%m%d%H%M%S", time.localtime(time.time())) + if tag is None or not isinstance(tag, str): + logger_warn('There is not tag or the type of tag is not string.') + dir_name = f'atat_rank{self.device_id}_{time_now}' + else: + dir_name = f'atat_{tag}_rank{self.device_id}_{time_now}' + self.root_path = os.path.join(os.path.realpath(dump_path), dir_name) + self.root_cpu_path = os.path.join(self.root_path, f'cpu') + self.root_npu_path = os.path.join(self.root_path, f'npu') + check_path_before_create(self.root_cpu_path) + check_path_before_create(self.root_npu_path) + Path(self.root_cpu_path).mkdir(mode=0o750, parents=True, exist_ok=True) + Path(self.root_npu_path).mkdir(mode=0o750, parents=True, exist_ok=True) + + self.result_csv_path = os.path.join(self.root_path, RESULT_FILE_NAME) + self.detail_csv_path = os.path.join(self.root_path, DETAILS_FILE_NAME) + self.comparator = Comparator(self.result_csv_path, self.detail_csv_path, False) + + self.aten_ops_blacklist = [] + self.npu_adjust_autogard = [] + yaml_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "torch_ops_config.yaml") + self.load_yaml_file(yaml_path) + + self.lock = None + if process_num > 0: + self.pool = Pool(process_num) + self.lock = Manager().Lock() + self.log_debug_init(debug, process_num) + + def __exit__(self, exc_type, exc_val, exc_tb): + super().__exit__(exc_type, exc_val, exc_tb) + + if not is_npu: + return + logger_debug(f'start write compare csv: Rank[{self.device_id}], Pid[{os.getpid()}') + + if self.process_num > 0: + self.pool.close() + self.pool.join() + summery_path = os.path.join(self.root_cpu_path, f'summery.json') + if not os.path.exists(summery_path): + logger_error("Please check train log, An exception may have occurred!") + return + check_file_or_directory_path(summery_path, False) + fp_handle = open(summery_path, "r") + while True: + json_line_data = fp_handle.readline() + if json_line_data == '\n': + continue + if len(json_line_data) == 0: + break + msg = json.loads(json_line_data) + self.all_summery[msg[0]] = msg[1] + fp_handle.close() + + if self.debug_flag: + input_num = 0 + output_num = 0 + total_num = 0 + + for list_data in self.all_summery: + for data in list_data: + logger_debug(f'summery: Device[{self.device_id}], Pid[{os.getpid()}], Data[{data}]') + if "_input" in data[CompareConst.NPU_NAME]: + input_num = input_num + 1 + if "_output" in data[CompareConst.NPU_NAME]: + output_num = output_num + 1 + total_num = total_num + 1 + logger_debug(f'Dispatch exit: Device[{self.device_id}], Pid[{os.getpid()} Input[{input_num}] ' + f'Output[{output_num}] Total[{total_num}] API_Total[{self.api_index}]]') + + def __torch_dispatch__(self, func, types, args=(), kwargs=None): + if not is_npu: + logger_error("Please confirm you run environment installed torch_npu!") + return func(*args, **kwargs) + + func_name_split_list = func.__name__.split(".") + aten_api = func_name_split_list[0] + try: + aten_api_overload_name = func_name_split_list[1] + except IndexError: + logger_error(f"Please check the func name {func.__name__}!") + return func(*args, **kwargs) + + self.enable_autogard(aten_api) + if aten_api in self.aten_ops_blacklist: + npu_out = func(*args, **kwargs) + return npu_out + + call_stack = get_callstack() + self.call_stack_list.append(call_stack) + self.api_index += 1 + if aten_api not in self.single_api_index_dict: + self.single_api_index_dict[aten_api] = 1 + else: + self.single_api_index_dict[aten_api] += 1 + + run_param = self.get_run_param(aten_api, func.__name__, aten_api_overload_name) + + if self.debug_flag: + logger_debug(f'Dispatch Info: Rank[{self.device_id}], Pid[{os.getpid()}], Func[{func.__name__}], ' + f'Name[{run_param.aten_api}_{run_param.single_api_index}], ' + f'Count[{self.api_index}], Sys[{get_sys_info()}]') + + cpu_args = [] + cpu_kwargs = [] + data_to_cpu(args, 0, cpu_args) + data_to_cpu(kwargs, 0, cpu_kwargs) + cpu_args = cpu_args[0] + cpu_kwargs = cpu_kwargs[0] + + with TimeStatistics("NPU RUN", run_param): + npu_out = func(*args, **kwargs) + npu_out_cpu = [] + data_to_cpu(npu_out, 0, npu_out_cpu) + npu_out_cpu = npu_out_cpu[0] + + with TimeStatistics("CPU RUN", run_param): + cpu_out = func(*cpu_args, **cpu_kwargs) + + if isinstance(cpu_out, torch.Tensor) and cpu_out.dtype in [torch.bfloat16, torch.float16, torch.half]: + cpu_out = cpu_out.float() + + if self.process_num == 0: + self.all_summery.append([]) + dispatch_data_info = DisPatchDataInfo(cpu_args, cpu_kwargs, self.all_summery, func, npu_out_cpu, + cpu_out, self.lock) + dispatch_workflow(run_param, dispatch_data_info) + else: + self.lock.acquire() + self.all_summery.append([]) + self.lock.release() + run_param.process_flag = True + if self.check_fun(func, run_param): + dispatch_data_info = DisPatchDataInfo(cpu_args, cpu_kwargs, self.all_summery, None, npu_out_cpu, + cpu_out, self.lock) + self.pool.apply_async(func=dispatch_multiprocess, + args=(run_param, dispatch_data_info), + error_callback=error_call) + else: + logger_error("can not get correct function please set process_num=0") + return npu_out + + @staticmethod + def check_fun(func, run_param): + if hasattr(torch.ops.aten, run_param.aten_api): + aten_func = getattr(torch.ops.aten, run_param.aten_api) + if hasattr(aten_func, run_param.aten_api_overload_name): + aten_overload_func = getattr(aten_func, run_param.aten_api_overload_name) + if id(aten_overload_func) == id(func): + run_param.func_namespace = "aten" + return True + return False + + def load_yaml_file(self, file_path): + with FileOpen(file_path, 'r') as f: + yaml_file = yaml.safe_load(f) + self.aten_ops_blacklist = yaml_file.get('aten_ops_blacklist') + self.npu_adjust_autogard = yaml_file.get('npu_adjust_autogard') + + def log_debug_init(self, debug, process_num): + if debug: + logger_debug(f'Main pid:{os.getpid()} device:{self.device_id} dump_list:{self.dump_api_list} ' + f'dump_mode:{self.dump_mode} cpu_path[{self.root_cpu_path}], npu_path[{self.root_npu_path}], ' + f'process[{process_num}]') + + def filter_dump_api(self): + if self.dump_mode != Const.LIST or not self.dump_api_list: + self.dump_api_list = [] + return + aten_api_list = dir(torch.ops.aten) + dump_api_list = [] + for aten_api in self.dump_api_list: + if aten_api in aten_api_list: + dump_api_list.append(aten_api) + else: + logger_warn(f'{aten_api} is not aten api will not dump, please refer to torch.ops.aten') + self.dump_api_list = dump_api_list + + def get_run_param(self, aten_api, func_name, aten_api_overload_name): + run_param = DispatchRunParam(self.debug_flag, self.device_id, self.root_npu_path, self.root_cpu_path, + self.process_num, self.comparator) + run_param.dump_flag, run_param.auto_dump_flag = self.get_dump_flag(aten_api) + run_param.func_name = func_name + run_param.aten_api = aten_api + run_param.aten_api_overload_name = aten_api_overload_name + run_param.single_api_index = self.single_api_index_dict[aten_api] + run_param.api_index = self.api_index + return run_param + + def get_dump_flag(self, aten_api): + dump_flag = False + auto_dump_flag = False + if self.dump_mode == Const.ALL: + dump_flag = True + if self.dump_mode == Const.LIST and aten_api in self.dump_api_list: + dump_flag = True + if self.dump_mode == Const.AUTO: + auto_dump_flag = True + return dump_flag, auto_dump_flag + + def check_param(self): + if self.dump_mode not in Const.ONLINE_DUMP_MODE: + logger_error('The parameter "dump mode" can only be one of {}.'.format(Const.ONLINE_DUMP_MODE)) + raise DispatchException(DispatchException.INVALID_PARAMETER) + if not isinstance(self.dump_api_list, list): + logger_error('The type of parameter "api_list" can only be list.') + raise DispatchException(DispatchException.INVALID_PARAMETER) + if not isinstance(self.debug_flag, bool): + logger_error('The type of parameter "debug" can only be bool.') + raise DispatchException(DispatchException.INVALID_PARAMETER) + if not isinstance(self.process_num, int) or self.process_num < 0: + logger_error('The type of parameter "process_num" can only be int and it should not be less than 0.') + raise DispatchException(DispatchException.INVALID_PARAMETER) + + def enable_autogard(self, aten_api): + if aten_api in self.npu_adjust_autogard: + torch._C._dispatch_tls_set_dispatch_key_excluded(torch._C.DispatchKey.AutogradFunctionality, False) \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/online_dispatch/dump_compare.py b/debug/accuracy_tools/atat/pytorch/online_dispatch/dump_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..38b62e95ee2fc80389b5e1972bcd000c667fa9e8 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/online_dispatch/dump_compare.py @@ -0,0 +1,187 @@ +import os +import json +import copy +from datetime import datetime, timezone + +import pandas as pd +import torch +from atat.pytorch.common.utils import Const +from .utils import np_save_data, logger_debug, logger_error, logger_warn, logger_user, COLOR_RED, COLOR_GREEN, \ + COLOR_RESET, CSV_COLUMN_NAME +from atat.core.common.file_check import FileOpen, change_mode, FileCheckConst +from atat.core.common.utils import CompareConst +from atat.core.common.log import logger + +class DispatchRunParam: + def __init__(self, debug_flag, device_id, root_npu_path, root_cpu_path, process_num, comparator): + # static parameters are initialized by constructors, and dynamic parameters are constructed at run time + self.debug_flag = debug_flag + self.device_id = device_id + self.root_npu_path = root_npu_path + self.root_cpu_path = root_cpu_path + self.process_num = process_num + self.process_flag = False + self.func_name = None + self.func_namespace = None + self.aten_api = None + self.aten_api_overload_name = None + self.single_api_index = None + self.api_index = None + self.dump_flag = None + self.auto_dump_flag = None + self.comparator = comparator + + +class DisPatchDataInfo: + def __init__(self, cpu_args, cpu_kwargs, all_summery, func, npu_out_cpu, cpu_out, lock): + self.cpu_args = cpu_args + self.cpu_kwargs = cpu_kwargs + self.all_summery = all_summery + self.func = func + self.npu_out_cpu = npu_out_cpu + self.cpu_out = cpu_out + self.lock = lock + + +class TimeStatistics: + def __init__(self, name_tag, run_param, timeout=5): + self.debug = run_param.debug_flag + if self.debug: + self.fun = run_param.func_name + self.device = run_param.device_id + self.process = run_param.process_num + self.index = run_param.single_api_index + self.tag = name_tag + self.timeout = timeout + self.time = None + + def __enter__(self): + if self.debug: + self.time = datetime.now(tz=timezone.utc) + logger_debug(f'Time[{self.tag}]-ENTER: Dev[{self.device}], Pid[{os.getpid()}], Fun[{self.fun}], ' \ + f'Id[{self.index}]') + + def __exit__(self, exc_type, exc_val, exc_tb): + if self.debug: + cost_time = datetime.now(tz=timezone.utc) - self.time + time_cost = f'Time[{self.tag}]-EXIT: Dev[{self.device}], Pid[{os.getpid()}], Fun[{self.fun}], ' \ + f'Id[{self.index}], time[{cost_time}]' + hot_time_cost = "Hotspot " + time_cost + + if cost_time.total_seconds() > self.timeout: + logger_debug(hot_time_cost) + else: + logger_debug(time_cost) + + +def support_basic_type(data): + if isinstance(data, (bool, int, float, torch.Tensor)): + return True + return False + + +def dump_data(data, prefix, dump_path): + if isinstance(data, (tuple, list)) and data: + for i, item in enumerate(data): + dump_data(item, "{}.{}".format(prefix, i), dump_path) + return + elif support_basic_type(data): + if isinstance(data, torch.Tensor) and data.is_meta: + return + # dump data may greater than summery_list collect + np_save_data(data, prefix, dump_path) + + +def save_temp_summery(api_index, single_api_summery, path, lock): + summery_path = os.path.join(path, f'summery.json') + lock.acquire() + with FileOpen(summery_path, "a") as f: + json.dump([api_index, single_api_summery], f) + f.write('\n') + lock.release() + + +def dispatch_workflow(run_param: DispatchRunParam, data_info: DisPatchDataInfo): + cpu_args, cpu_kwargs = data_info.cpu_args, data_info.cpu_kwargs + all_summery, func = data_info.all_summery, data_info.func + npu_out_cpu, cpu_out, lock = data_info.npu_out_cpu, data_info.cpu_out, data_info.lock + single_api_summery = [] + + prefix_input = f'{run_param.aten_api}_{run_param.single_api_index}_input' + prefix_output = f'{run_param.aten_api}_{run_param.single_api_index}_output' + + accuracy_reached = False + with TimeStatistics("COMPARE OUTPUT", run_param): + run_param.comparator.compare_output(prefix_output, cpu_out, npu_out_cpu, None, None) + + # user set dump or auto mode will dump + if run_param.dump_flag or (run_param.auto_dump_flag and not accuracy_reached): + with TimeStatistics("DUMP INPUT", run_param): + dump_data(cpu_args, prefix_input, run_param.root_npu_path) + if len(cpu_kwargs) > 0: + for k, v in cpu_kwargs.items(): + kwargs_prefix_name = prefix_input + f'_{k}' + dump_data(v, kwargs_prefix_name, run_param.root_npu_path) + + with TimeStatistics("DUMP OUTPUT", run_param): + dump_data(cpu_out, prefix_output, run_param.root_cpu_path) + dump_data(npu_out_cpu, prefix_output, run_param.root_npu_path) + + if run_param.process_num == 0: + all_summery[run_param.api_index - 1] = copy.deepcopy(single_api_summery) + else: + save_temp_summery(run_param.api_index - 1, single_api_summery, run_param.root_cpu_path, lock) + + +def get_torch_func(run_param): + if hasattr(torch.ops, run_param.func_namespace): + ops_func = getattr(torch.ops, run_param.func_namespace) + if hasattr(ops_func, run_param.aten_api): + ops_aten_func = getattr(ops_func, run_param.aten_api) + if hasattr(ops_aten_func, run_param.aten_api_overload_name): + ops_aten_overlaod_func = getattr(ops_aten_func, run_param.aten_api_overload_name) + return ops_aten_overlaod_func + return None + + +def dispatch_multiprocess(run_param, dispatch_data_info): + torch_func = get_torch_func(run_param) + if torch_func is None: + logger.error(f'can not find suitable call api:{run_param.aten_api}') + else: + dispatch_data_info.func = torch_func + dispatch_workflow(run_param, dispatch_data_info) + + +def error_call(err): + logger.error(f'multiprocess {err}') + + +def save_csv(all_summery, call_stack_list, csv_path): + df = pd.DataFrame(columns=CSV_COLUMN_NAME) + + for index, list_data in enumerate(all_summery): + for data in list_data: + csv_row_data = {CompareConst.NPU_NAME: data[CompareConst.NPU_NAME], + CompareConst.BENCH_NAME: data[CompareConst.BENCH_NAME], + CompareConst.NPU_DTYPE: data[CompareConst.NPU_DTYPE], + CompareConst.BENCH_DTYPE: data[CompareConst.BENCH_DTYPE], + CompareConst.NPU_SHAPE: data[CompareConst.NPU_SHAPE], + CompareConst.BENCH_SHAPE: data[CompareConst.BENCH_SHAPE], + CompareConst.NPU_MAX: data[CompareConst.NPU_MAX], + CompareConst.NPU_MIN: data[CompareConst.NPU_MIN], + CompareConst.NPU_MEAN: data[CompareConst.NPU_MEAN], + CompareConst.BENCH_MAX: data[CompareConst.BENCH_MAX], + CompareConst.BENCH_MIN: data[CompareConst.BENCH_MIN], + CompareConst.BENCH_MEAN: data[CompareConst.BENCH_MEAN], + CompareConst.COSINE: data[CompareConst.COSINE], + CompareConst.MAX_ABS_ERR: data[CompareConst.MAX_ABS_ERR], + CompareConst.MAX_RELATIVE_ERR: data[CompareConst.MAX_RELATIVE_ERR], + CompareConst.ACCURACY: data[CompareConst.ACCURACY], + CompareConst.STACK: call_stack_list[index], + CompareConst.ERROR_MESSAGE: data[CompareConst.ERROR_MESSAGE]} + row_df = pd.DataFrame.from_dict(csv_row_data, orient='index').T + df = pd.concat([df, row_df]) + + df.to_csv(csv_path, index=False) + change_mode(csv_path, FileCheckConst.DATA_FILE_AUTHORITY) diff --git a/debug/accuracy_tools/atat/pytorch/online_dispatch/single_compare.py b/debug/accuracy_tools/atat/pytorch/online_dispatch/single_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..04b29a97b11daae3b720568b1647deac95800d53 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/online_dispatch/single_compare.py @@ -0,0 +1,375 @@ +import logging +from functools import wraps +import torch +from prettytable import PrettyTable +from collections import namedtuple +from .utils import logger_user + +def func_log_wrapper(): + def _out_wrapper(func): + @wraps(func) + def _in_wrapper(*kargs, **kwargs): + logging.debug(f"start to run: {func.__name__}") + x = func(*kargs, **kwargs) + logging.debug(f"end to run: {func.__name__}") + return x + + return _in_wrapper + + return _out_wrapper + + +class SingleBenchmarkCompareStandard: + def __init__(self, high_precision=True): + self.high_precision = high_precision + self.small_value = 1.0 + self.error_thd = {torch.float16: [2 ** -11, 2 ** -7], + torch.bfloat16: [2 ** -8, 2 ** -6], + torch.float32: [2 ** -14, 2 ** -11], + torch.float64: [2 ** -14, 2 ** -11]} + self.eb_thd = {torch.float16: 2 ** -10, + torch.bfloat16: 2 ** -7, + torch.float32: 2 ** -14, + torch.float64: 2 ** -14} + + def get_error_thd(self, dtype): + if dtype in self.error_thd.keys(): + if dtype == torch.float64: + logging.warning("the output data of fp64 uses the same standard as fp32.") + return self.error_thd.get(dtype)[0] if self.high_precision else self.error_thd.get(dtype)[1] + logging.error( + "Single benchmark compare only supports floating point " + "in fp16, bf16, fp32. " + ) + return None + + def get_eb_thd(self, dtype): + if dtype in self.eb_thd.keys(): + return self.eb_thd.get(dtype) + return None + + +class SingleBenchmarkAccuracyResult: + def __init__( + self, + result=True, + error_balance=None, + max_abs_diff=None, + max_abs_idx=None, + max_rel_diff=None, + max_rel_idx=None + ): + self.result = result + self.error_balance = error_balance + self.max_abs_diff = max_abs_diff + self.max_abs_idx = max_abs_idx + self.max_rel_diff = max_rel_diff + self.max_rel_idx = max_rel_idx + + def get_result(self, eb_thd, error_thd): + if ( + self.error_balance > eb_thd + or self.max_abs_diff > error_thd + or self.max_rel_diff > error_thd + ): + self.result = False + else: + self.result = True + + +class SingleBenchmarkAccuracyCompare: + @classmethod + @func_log_wrapper() + def check_output_size(cls, npu_out, bench_out): + acc_result = None + if npu_out.numel() == 0 and bench_out.nuimel() == 0: + info = ( + "The npu_output is [], and it is same as benchmark_output, " + "the result of data_compare is Pass" + ) + logging.debug(info) + acc_result = SingleBenchmarkAccuracyResult(result=True) + + if npu_out.size() != bench_out.size(): + error_info = ( + f"the size of npu output[{npu_out.size()}] and" + f"benchmark[{bench_out.size()}] is not equal" + ) + + logging.error(error_info) + acc_result = SingleBenchmarkAccuracyResult(result=False) + return acc_result + + @classmethod + @func_log_wrapper() + def check_output_invalid_value(cls, output): + has_nan = torch.isnan(output).any() + has_inf = torch.isinf(output).any() + return has_nan or has_inf + + @classmethod + @func_log_wrapper() + def precision_compare_for_case(cls, npu_out, bench_out, benchmark_standard: SingleBenchmarkCompareStandard): + error_thd = None + eb_thd = None + acc_result = cls.check_output_size(npu_out, bench_out) + CompareResultInfo = namedtuple("CompareResultInfo", + ['accuracy_result', 'error_threshold', 'eb_threshold', 'failed_information']) + + if acc_result: + failed_info = "比对数据的shape不一致" + compare_result_info = CompareResultInfo(acc_result, error_thd, eb_thd, failed_info) + return compare_result_info + + if cls.check_output_invalid_value(bench_out): + logging.info("The benchmark result contains nan/inf value. ") + failed_info = "标杆结果存在nan值或inf值, 依照单标杆标准该用例通过" + acc_result = SingleBenchmarkAccuracyResult(result=True) + compare_result_info = CompareResultInfo(acc_result, error_thd, eb_thd, failed_info) + return compare_result_info + + if cls.check_output_invalid_value(npu_out): + logging.info("The NPU result contains nan/inf value. ") + failed_info = "NPU结果存在nan值或inf值, 依照单标杆标准该用例不通过" + acc_result = SingleBenchmarkAccuracyResult(result=False) + compare_result_info = CompareResultInfo(acc_result, error_thd, eb_thd, failed_info) + return compare_result_info + + data_type = npu_out.dtype + if data_type not in [torch.float16, torch.float32, torch.float64, torch.bfloat16]: + acc_result = cls.compute_binary_diff(npu_out, bench_out) + else: + error_thd = benchmark_standard.get_error_thd(data_type) + eb_thd = benchmark_standard.get_eb_thd(data_type) + if error_thd is None: + logging.error( + "single benchmark not support the comparison of %s", str(data_type) + ) + acc_result = SingleBenchmarkAccuracyResult(result=False) + else: + if npu_out.dtype in [torch.float16, torch.bfloat16] and bench_out.dtype in [torch.float32]: + npu_out = npu_out.to(torch.float32) + error_balance = cls.compute_error_balance(npu_out, bench_out, benchmark_standard) + max_abs_diff, max_abs_idx = cls.compute_abs_diff(npu_out, bench_out, error_thd, benchmark_standard) + max_rel_diff, max_rel_idx = cls.compute_rel_diff(npu_out, bench_out, error_thd, benchmark_standard) + acc_result = SingleBenchmarkAccuracyResult( + error_balance=error_balance, + max_abs_diff=max_abs_diff, + max_abs_idx=max_abs_idx, + max_rel_diff=max_rel_diff, + max_rel_idx=max_rel_idx + ) + acc_result.get_result(eb_thd, error_thd) + compare_result_info = CompareResultInfo(acc_result, error_thd, eb_thd, None) + return compare_result_info + return None + + @classmethod + @func_log_wrapper() + def compute_binary_diff(cls, npu_out, bench_out): + result = torch.equal(npu_out, bench_out) + if result: + logger_user("二进制精度比对通过, 无需单标杆比对法验证") + return SingleBenchmarkAccuracyResult(result=result, max_abs_diff=0, max_rel_diff=0, error_balance=0) + + @classmethod + @func_log_wrapper() + def compute_error_balance(cls, npu_out, bench_out, benchmark_standard: SingleBenchmarkCompareStandard): + ones = torch.ones_like(npu_out) + zeros = torch.zeros_like(npu_out) + abs_mask_idx = torch.where(torch.abs(bench_out) < benchmark_standard.small_value, ones, zeros) + abs_mask_idx = abs_mask_idx.type(torch.bool) + diff_value = torch.subtract(npu_out, bench_out) + diff_value_rel = diff_value / (torch.abs(bench_out) + torch.finfo(torch.float).eps ) + rel_and_abs = torch.where(abs_mask_idx, diff_value, diff_value_rel) + eb_float = float(torch.mean(rel_and_abs)) + return eb_float + + @classmethod + @func_log_wrapper() + def compute_abs_diff(cls, npu_out, bench_out, error_thd, benchmark_standard: SingleBenchmarkCompareStandard): + max_abs_diff = 0 + max_abs_idx = None + + ones = torch.ones_like(npu_out) + zeros = torch.zeros_like(npu_out) + diff_value = torch.subtract(npu_out, bench_out) + diff_abs = torch.abs(diff_value) + abs_mask_idx = torch.where(torch.abs(bench_out) < benchmark_standard.small_value, ones, zeros) + abs_err_idx = torch.where(diff_abs > error_thd, ones, zeros) + abs_err_idx = abs_err_idx * abs_mask_idx + abs_err = diff_abs[torch.where(abs_err_idx == 1)] + + if len(abs_err) > 0: + err_for_max = torch.where(abs_err_idx == 1, diff_abs, zeros) + logging.debug("err_for_max for abs %s", err_for_max) + max_abs_idx = torch.argmax(err_for_max) + max_abs_diff = diff_abs[max_abs_idx] + elif torch.sum(abs_mask_idx) > 0: + err_for_max = torch.where(abs_mask_idx == 1, diff_abs, zeros) + logging.debug("error_for_max for abs %s", err_for_max) + max_abs_idx = torch.argmax(err_for_max) + if err_for_max.max() != 0: + max_abs_diff = diff_abs[max_abs_idx] + return (float(max_abs_diff), int(max_abs_idx) if torch.is_tensor(max_abs_idx) else max_abs_idx) + + @classmethod + @func_log_wrapper() + def compute_rel_diff(cls, npu_out, bench_out, error_thd, benchmark_standard: SingleBenchmarkCompareStandard): + max_rel_diff = 0 + max_rel_idx = None + + ones = torch.ones_like(npu_out) + zeros = torch.zeros_like(npu_out) + diff_value = torch.subtract(npu_out, bench_out) + diff_abs = torch.abs(diff_value) + + rel_mask_idx = torch.where(torch.abs(bench_out) >= benchmark_standard.small_value, ones, zeros) + rel_err = diff_abs / (torch.abs(bench_out) + torch.finfo(torch.float).eps ) + diff_rel = rel_err + rel_err_idx = torch.where(rel_err > error_thd, ones, zeros) + rel_err_idx = rel_err_idx * rel_mask_idx + rel_err = rel_err[torch.where(rel_err_idx == 1)] + if len(rel_err) > 0: + err_for_max = torch.where(rel_err_idx == 1, diff_rel, zeros) + logging.debug("error_for_max for rel %s", err_for_max) + max_rel_idx = torch.argmax(err_for_max) + max_rel_diff = diff_rel[max_rel_idx] + elif torch.sum(rel_mask_idx > 0): + err_for_max = torch.where(rel_mask_idx == 1, diff_rel, zeros) + logging.debug("err_for_max for rel %s", err_for_max) + max_rel_idx = torch.argmax(err_for_max) + if torch.sum(err_for_max) != 0: + max_rel_diff = diff_rel[max_rel_idx] + return (float(max_rel_diff), int(max_rel_idx) if torch.is_tensor(max_rel_idx) else max_rel_idx) + + +class SingleBenchSummary: + def __init__(self, precision_result: SingleBenchmarkAccuracyResult, npu_dtype=None, + bench_dtype=None, shape=None, error_thd=None, eb_thd=None, failed_info=None): + self.npu_dtype = npu_dtype + self.bench_dtype = bench_dtype + self.shape = shape + self.result = precision_result.result + self.error_balance = precision_result.error_balance + self.max_abs_diff = precision_result.max_abs_diff + self.max_abs_idx = precision_result.max_abs_idx + self.max_rel_diff = precision_result.max_rel_diff + self.max_rel_idx = precision_result.max_rel_idx + self.eb_thd = eb_thd + self.error_thd = error_thd + self.failed_info = failed_info + + def get_check_result(self): + if self.result: + return "PASS" + else: + return "FAILED" + + def get_result_msg(self): + result_str = "" + if self.failed_info: + return self.failed_info + + if self.result: + result_str += "误差均衡性EB: %s <= 阈值%s\n" % (self.error_balance, self.eb_thd) + result_str += "最大绝对误差: %s <= 阈值%s\n" % (self.max_abs_diff, self.error_thd) + result_str += "最大相对误差: %s <= 阈值%s\n" % (self.max_rel_diff, self.error_thd) + else: + if self.error_balance > self.eb_thd: + result_str += "误差均衡性EB超过阈值%s: EB = %s\n" % ( + self.eb_thd, + self.error_balance, + ) + if self.max_abs_diff > self.error_thd: + result_str += "小值域最大绝对误差超过阈值%s: idx = %s, 绝对误差 = %s\n" % ( + self.error_thd, + self.max_abs_idx, + self.max_abs_diff + ) + if self.max_rel_diff > self.error_thd: + result_str += "大值域最大相对误差超过阈值%s: idx = %s, 相对误差 = %s\n" % ( + self.error_thd, + self.max_rel_idx, + self.max_rel_diff, + ) + return result_str + + def print_detail_table(self): + table =PrettyTable() + table.title = "Single Benchmark Metrics Info" + table.field_names = ["Index", "Result", "Threshold"] + table.add_row(["error_balance", self.error_balance, self.eb_thd]) + table.add_row(["max_abs_diff", self.max_abs_diff, self.error_thd]) + table.add_row(["max_abs_idx", self.max_abs_idx, "-"]) + table.add_row(["max_rel_diff", self.max_rel_diff, self.error_thd]) + table.add_row(["max_rel_idx", self.max_rel_idx, "-"]) + + logger_user(table) + + def to_column_value(self): + return [self.bench_dtype, self.npu_dtype, self.shape, self.error_balance, + self.max_abs_diff, self.max_abs_idx, self.max_rel_diff, self.max_rel_idx, + self.eb_thd, self.error_thd, self.result, self.failed_info] + + +def single_benchmark_compare(npu_out: torch.Tensor, bench_out: torch.Tensor, high_precision: bool = True): + benchmark_standard = SingleBenchmarkCompareStandard(high_precision) + npu_out = npu_out.flatten() + bench_out = bench_out.flatten() + + compare_results = SingleBenchmarkAccuracyCompare.precision_compare_for_case(npu_out, bench_out, benchmark_standard) + ( + precision_result, + error_thd, + eb_thd, + failed_info + ) = (compare_results.accuracy_result, compare_results.error_threshold, + compare_results.eb_threshold, compare_results.failed_information) + + summary = SingleBenchSummary(precision_result, str(npu_out.dtype), str(bench_out.dtype), tuple(npu_out.shape), error_thd, eb_thd, failed_info) + result = summary.result + details = summary.to_column_value() + return result, details + + +def single_benchmark_compare_wrap(npu_out: torch.Tensor, bench_out: torch.Tensor, high_precision=True): + result = SingleBenchmarkAccuracyResult(result=True) + summary = SingleBenchSummary(result) + if isinstance(bench_out, (list, tuple)): + status, details = [], [] + if len(bench_out) != len(npu_out): + summary.result = False + summary.failed_info = "bench and npu output structure is different." + return False, summary.to_column_value() + for b_out_i, n_out_i in zip(bench_out, npu_out): + status_i, details_i = single_benchmark_compare_wrap(n_out_i, b_out_i, high_precision) + status.append(status_i) + details.append(details_i) + elif isinstance(bench_out, dict): + b_keys, n_keys = set(bench_out.keys()), set(npu_out.keys()) + if b_keys != n_keys: + summary.result = False + summary.failed_info = "bench and npu_output dict keys are different." + return False, summary.to_column_value() + else: + status, details = single_benchmark_compare_wrap(list(bench_out.values(), list(npu_out.values()))) + elif isinstance(bench_out, torch.Tensor) : + status, details = single_benchmark_compare(bench_out, npu_out) + elif isinstance(bench_out, (bool, int, float, str)): + summary.bench_dtype = str(type(bench_out)) + summary.npu_dtype = str(type(npu_out)) + status = bench_out == npu_out + summary.result = status + return status, summary.to_column_value() + elif bench_out is None: + summary.result = True + summary.failed_info = "Output is None." + return True, summary.to_column_value() + else: + summary.result = True + summary.failed_info = "Unexpected output type: {}".format(type(bench_out)) + return True, summary.to_column_value() + return status, details + + diff --git a/debug/accuracy_tools/atat/pytorch/online_dispatch/torch_ops_config.yaml b/debug/accuracy_tools/atat/pytorch/online_dispatch/torch_ops_config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..789ae2a7a7b8a3bc05ea6a073cbbf2875de2bc59 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/online_dispatch/torch_ops_config.yaml @@ -0,0 +1,50 @@ +aten_ops_blacklist: + - _cudnn_rnn + - _local_scalar_dense + - _pin_memory + - _to_copy + - _unsafe_view + - clone + - contiguous + - copy_ + - cudnn_batch_norm + - cudnn_batch_norm_backward + - detach + - empty + - index_put_ + - lift_fresh + - max_pool2d_with_indices_backward # shape unmatch + - native_batch_norm_backward + - new_empty + - new_empty_strided + - new_full + - new_ones + - new_zeros + - ones + - ones_like + - permute + - rand + - rand_like + - randint + - randint_like + - randn + - randn_like + - randperm + - scalar_tensor + - select + - to + - transpose + - unbind + - view + - zero + - zero_ + - zeros + - zeros_like + +npu_adjust_autogard: + - adaptive_avg_pool2d + - batch_norm + - log_softmax + - nll_loss + - to + \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/online_dispatch/utils.py b/debug/accuracy_tools/atat/pytorch/online_dispatch/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..1f9c2e916c187615160bfb1be64a262b2cd6bd95 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/online_dispatch/utils.py @@ -0,0 +1,198 @@ +import os +import inspect +import logging +import psutil +import torch +import numpy as np + +try: + import torch_npu +except ImportError: + pta_cpu_device = None +else: + pta_cpu_device = torch.device("cpu") + +from atat.core.common.utils import CompareConst +from atat.core.common.file_check import change_mode, FileCheckConst + +cpu_device = torch._C.device("cpu") +COLOR_RED = '\033[31m' +COLOR_GREEN = '\033[32m' +COLOR_YELLOW = '\033[33m' +COLOR_BLUE = '\033[34m' +COLOR_PURPLE = '\033[35m' +COLOR_CYAN = '\033[36m' +COLOR_GRAY = '\033[37m' +COLOR_RESET = '\033[0m' + +COMPARE_LOGO = ''' + _ _ + ___ _ __ | (_)_ __ ___ ___ ___ _ __ ___ _ __ __ _ _ __ ___ + / _ \\| '_ \\| | | '_ \\ / _ \\ / __/ _ \\| '_ ` _ \\| '_ \\ / _` | '__/ _ \\ +| (_) | | | | | | | | | __/ | (_| (_) | | | | | | |_) | (_| | | | __/ + \\___/|_| |_|_|_|_| |_|\\___| \\___\\___/|_| |_| |_| .__/ \\__,_|_| \\___| + |_| +''' + +CSV_COLUMN_NAME = [CompareConst.NPU_NAME, + CompareConst.BENCH_NAME, + CompareConst.NPU_DTYPE, + CompareConst.BENCH_DTYPE, + CompareConst.NPU_SHAPE, + CompareConst.BENCH_SHAPE, + CompareConst.NPU_MAX, + CompareConst.NPU_MIN, + CompareConst.NPU_MEAN, + CompareConst.BENCH_MAX, + CompareConst.BENCH_MIN, + CompareConst.BENCH_MEAN, + CompareConst.COSINE, + CompareConst.MAX_ABS_ERR, + CompareConst.MAX_RELATIVE_ERR, + CompareConst.ACCURACY, + CompareConst.STACK, + CompareConst.ERROR_MESSAGE] + +FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble, np.float32, np.float16] +BOOL_TYPE = [bool, np.uint8] +INT_TYPE = [np.int32, np.int64] + + +class CompareConst: + NAN = np.nan + NA = "N/A" + PASS = 'pass' + WARNING = 'warning' + ERROR = 'error' + SKIP = 'SKIP' + TRUE = 'TRUE' + FALSE = 'FALSE' + + +def get_callstack(): + callstack = [] + for (_, path, line, func, code, _) in inspect.stack()[2:]: + if code: + stack_line = [path, str(line), func, code[0].strip() if code else code] + else: + stack_line = [path, str(line), func, code] + callstack.append(stack_line) + return callstack + + +def np_save_data(data, file_name, data_path): + try: + if hasattr(data, "numpy"): + data = data.numpy() + dump_path = os.path.join(data_path, f'{file_name}.npy') + np.save(dump_path, data) + change_mode(dump_path, FileCheckConst.DATA_FILE_AUTHORITY) + except Exception as e: + logger_error("save numpy failed, error: {}".format(e)) + finally: + pass + + +def data_to_cpu(data, deep, data_cpu): + global cpu_device + list_cpu = [] + if isinstance(data, torch.Tensor): + if data.device == cpu_device or data.device == pta_cpu_device: + tensor_copy = data.clone().detach() + else: + tensor_copy = data.cpu().detach() + if tensor_copy.dtype in [torch.float16, torch.half, torch.bfloat16]: + tensor_copy = tensor_copy.float() + + if deep == 0: + data_cpu.append(tensor_copy) + return tensor_copy + elif isinstance(data, list): + for v in data: + list_cpu.append(data_to_cpu(v, deep + 1, data_cpu)) + if deep == 0: + data_cpu.append(list_cpu) + return list_cpu + elif isinstance(data, tuple): + for v in data: + list_cpu.append(data_to_cpu(v, deep + 1, data_cpu)) + tuple_cpu = tuple(list_cpu) + if deep == 0: + data_cpu.append(tuple_cpu) + return tuple_cpu + elif isinstance(data, dict): + dict_cpu = {} + for k, v in data.items(): + dict_cpu[k] = data_to_cpu(v, deep + 1, data_cpu) + if deep == 0: + data_cpu.append(dict_cpu) + return dict_cpu + elif isinstance(data, torch._C.device): + return cpu_device + else: + if deep == 0: + data_cpu.append(data) + return data + + +def get_mp_logger(): + logger = logging.getLogger(__name__) + if not logger.handlers: + logger.setLevel(logging.INFO) + handler = logging.StreamHandler() + formatter = logging.Formatter('%(asctime)s %(message)s') + logger.propagate = True + handler.setFormatter(formatter) + logger.addHandler(handler) + return logger.info + + +def logger_debug(mesg): + logger = get_mp_logger() + logger(f'DEBUG ' + mesg) + + +def logger_info(mesg): + logger = get_mp_logger() + logger(f'INFO ' + mesg) + + +def logger_warn(mesg): + logger = get_mp_logger() + logger(f'{COLOR_YELLOW}WARNING {mesg} {COLOR_RESET}') + + +def logger_error(mesg): + logger = get_mp_logger() + logger(f'{COLOR_RED}ERROR {mesg} {COLOR_RESET}') + + +def logger_user(mesg): + logger = get_mp_logger() + logger(mesg) + + +def logger_logo(): + logger_user(f'{COLOR_CYAN}{COMPARE_LOGO} {COLOR_RESET}') + + +def get_sys_info(): + mem = psutil.virtual_memory() + cpu_percent = psutil.cpu_percent(interval=1) + sys_info = f'Total: {mem.total / 1024 / 1024:.2f}MB ' \ + f'Free: {mem.available / 1024 / 1024:.2f} MB ' \ + f'Used: {mem.used / 1024 / 1024:.2f} MB ' \ + f'CPU: {cpu_percent}% ' + return sys_info + + +class DispatchException(Exception): + INVALID_PARAMETER = 0 + + def __init__(self, err_code, err_msg=""): + super(DispatchException, self).__init__() + self.err_code = err_code + self.err_msg = err_msg + + def __str__(self): + return self.err_msg diff --git a/debug/accuracy_tools/atat/pytorch/parse.py b/debug/accuracy_tools/atat/pytorch/parse.py new file mode 100644 index 0000000000000000000000000000000000000000..40792d0e0297a9b034f186e255193d6201517764 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse.py @@ -0,0 +1,4 @@ +from atat.pytorch.parse_tool import cli + +if __name__ == '__main__': + cli.parse() diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/api_info.py b/debug/accuracy_tools/atat/pytorch/parse_tool/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/api_info.py rename to debug/accuracy_tools/atat/pytorch/parse_tool/__init__.py diff --git a/debug/accuracy_tools/atat/core/log.py b/debug/accuracy_tools/atat/pytorch/parse_tool/cli.py similarity index 39% rename from debug/accuracy_tools/atat/core/log.py rename to debug/accuracy_tools/atat/pytorch/parse_tool/cli.py index b9ac8f5edfb18286aff317b5440bb99a92dd2486..f59fbf13a8d3e2785611022cd7b5b9a2926ea008 100644 --- a/debug/accuracy_tools/atat/core/log.py +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/cli.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -# Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -14,43 +14,19 @@ # See the License for the specific language governing permissions and # limitations under the License. """ -import os -import time -import sys +from atat.pytorch.parse_tool.lib.interactive_cli import InteractiveCli +from atat.pytorch.common.log import logger -def _print_log(level, msg, end='\n'): - current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(time.time()))) - pid = os.getgid() - print(current_time + "(" + str(pid) + ")-[" + level + "]" + msg, end=end) - sys.stdout.flush() +def _run_interactive_cli(cli=None): + logger.info("Interactive command mode") + if not cli: + cli = InteractiveCli() + try: + cli.cmdloop(intro="Start Parsing........") + except KeyboardInterrupt: + logger.info("Exit parsing.......") -def print_info_log(info_msg, end='\n'): - """ - Function Description: - print info log. - Parameter: - info_msg: the info message. - """ - _print_log("INFO", info_msg, end=end) - - -def print_error_log(error_msg): - """ - Function Description: - print error log. - Parameter: - error_msg: the error message. - """ - _print_log("ERROR", error_msg) - - -def print_warn_log(warn_msg): - """ - Function Description: - print warn log. - Parameter: - warn_msg: the warning message. - """ - _print_log("WARNING", warn_msg) \ No newline at end of file +def parse(): + _run_interactive_cli() diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/run_ut.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/run_ut.py rename to debug/accuracy_tools/atat/pytorch/parse_tool/lib/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/parse_tool/lib/compare.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/compare.py new file mode 100644 index 0000000000000000000000000000000000000000..dfc4529414cbe00307b36fc58e5d62d64c6fdf32 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/compare.py @@ -0,0 +1,259 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import os +import time +import numpy as np +from collections import namedtuple +from atat.pytorch.parse_tool.lib.utils import Util +from atat.pytorch.parse_tool.lib.config import Const +from atat.pytorch.parse_tool.lib.parse_exception import ParseException + + +class Compare: + def __init__(self): + self.util = Util() + self.log = self.util.log + self.vector_compare_result = {} + + def npu_vs_npu_compare(self, my_dump_path, golden_dump_path, result_dir, msaccucmp_path): + self.log.info("Start Compare ...............") + self.compare_vector(my_dump_path, golden_dump_path, result_dir, msaccucmp_path) + self.log.info("Compare finished!!") + + def compare_vector(self, my_dump_path, golden_dump_path, result_dir, msaccucmp_path): + self.util.create_dir(result_dir) + self.util.check_path_valid(result_dir) + call_msaccucmp = self.util.check_msaccucmp(msaccucmp_path) + cmd = '%s %s compare -m %s -g %s -out %s' % ( + self.util.python, call_msaccucmp, my_dump_path, golden_dump_path, result_dir + ) + return self.util.execute_command(cmd) + + def convert_dump_to_npy(self, dump_file, data_format, output, msaccucmp_path): + dump_file = self.util.path_strip(dump_file) + file_name = "" + if os.path.isfile(dump_file): + self.log.info("Covert file is: %s", dump_file) + file_name = os.path.basename(dump_file) + elif os.path.isdir(dump_file): + self.log.info("Convert all files in path: %s", dump_file) + file_name = "" + output = output if output else Const.DUMP_CONVERT_DIR + convert = self.convert(dump_file, data_format, output, msaccucmp_path) + if convert == 0: + convert_files = self.util.list_convert_files(output, file_name) + + summary_txt = ["SrcFile: %s" % dump_file] + for convert_file in convert_files.values(): + summary_txt.append(" - %s" % convert_file.file_name) + self.log.info("Transfer result is saved in : %s", os.path.realpath(output)) + self.util.print_panel("\n".join(summary_txt)) + + def convert(self, dump_file, data_format, output, msaccucmp_path): + self.util.create_dir(output) + self.util.check_path_valid(output) + call_msaccucmp = self.util.check_msaccucmp(msaccucmp_path) + if data_format: + cmd = '%s %s convert -d %s -out %s -f %s' % ( + self.util.python, call_msaccucmp, dump_file, output, data_format + ) + else: + cmd = '%s %s convert -d %s -out %s' % ( + self.util.python, call_msaccucmp, dump_file, output + ) + return self.util.execute_command(cmd) + + def compare_data(self, args): + """Compare data""" + (left, right, save_txt, rl, al, diff_count) = args + if left is None or right is None: + raise ParseException("invalid input or output") + try: + left_data = np.load(left) + right_data = np.load(right) + except UnicodeError as e: + self.log.error("%s %s" % ("UnicodeError", str(e))) + self.log.warning("Please check the npy file") + raise ParseException(ParseException.PARSE_UNICODE_ERROR) from e + except IOError: + self.log.error("Failed to load npy %s or %s." % (left, right)) + raise ParseException(ParseException.PARSE_LOAD_NPY_ERROR) from e + + # save to txt + if save_txt: + self.util.save_npy_to_txt(left_data, left + ".txt") + self.util.save_npy_to_txt(right_data, right + ".txt") + # compare data + (total_cnt, all_close, cos_sim, err_percent) = self.do_compare_data(left_data, right_data, rl, al, diff_count) + content = ['Left:', ' ├─ NpyFile: %s' % left] + if save_txt: + content.append(' ├─ TxtFile: [green]%s.txt[/green]' % left) + content.append(' └─ NpySpec: [yellow]%s[/yellow]' % self.util.gen_npy_info_txt(left_data)) + content.append('Right:') + content.append(' ├─ NpyFile: %s' % right) + if save_txt: + content.append(' ├─ TxtFile: [green]%s.txt[/green]' % right) + content.append(' └─ NpySpec: [yellow]%s[/yellow]' % self.util.gen_npy_info_txt(right_data)) + content.append('NumCnt: %s' % total_cnt) + content.append('AllClose: %s' % all_close) + content.append('CosSim: %s' % cos_sim) + content.append('ErrorPer: %s (rl= %s, al= %s)' % (err_percent, rl, al)) + self.util.print_panel("\n".join(content)) + + def do_compare_data(self, left, right, rl=0.001, al=0.001, diff_count=20): + data_left = left.astype(np.float32) + data_right = right.astype(np.float32) + shape_left = data_left.shape + shape_right = data_right.shape + if shape_left != shape_right: + self.log.warning("Data shape not equal: %s vs %s", data_left.shape, data_right.shape) + data_left = data_left.reshape(-1) + data_right = data_right.reshape(-1) + if data_left.shape[0] != data_right.shape[0]: + self.log.warning("Data size not equal: %s vs %s", data_left.shape, data_right.shape) + if data_left.shape[0] < data_right.shape[0]: + data_left = np.pad(data_left, (0, data_right.shape[0] - data_left.shape[0]), 'constant') + else: + data_right = np.pad(data_right, (0, data_left.shape[0] - data_right.shape[0]), 'constant') + all_close = np.allclose(data_left, data_right, atol=al, rtol=rl) + np.seterr(divide='raise') + cos_sim = np.dot(data_left, data_right) / ( + np.sqrt(np.dot(data_left, data_left)) * np.sqrt(np.dot(data_right, data_right))) + err_cnt = 0 + total_cnt = data_left.shape[0] + diff_table_columns = ['Index', 'Left', 'Right', 'Diff'] + err_table = self.util.create_table("Error Item Table", diff_table_columns) + top_table = self.util.create_table("Top Item Table", diff_table_columns) + for i in range(total_cnt): + abs_diff = abs(data_left[i] - data_right[i]) + if i < diff_count: + top_table.add_row(str(i), str(data_left[i]), str(data_right[i]), str(abs_diff)) + if abs_diff > (al + rl * abs(data_right[i])): + if err_cnt < diff_count: + err_table.add_row(str(i), str(data_left[i]), str(data_right[i]), str(abs_diff)) + err_cnt += 1 + if total_cnt == 0: + err_percent = float(0) + else: + err_percent = float(err_cnt / total_cnt) + self.util.print(self.util.create_columns([err_table, top_table])) + do_compare_data_result = namedtuple('do_compare_data_result', ['cnt', 'close', 'cos', 'err']) + res = do_compare_data_result(total_cnt, all_close, cos_sim, err_percent) + return res + + def compare_npy(self, file, bench_file, output_path): + data = np.load(file) + bench_data = np.load(bench_file) + shape, dtype = data.shape, data.dtype + bench_shape, bench_dtype = bench_data.shape, bench_data.dtype + filename = os.path.basename(file) + bench_filename = os.path.basename(bench_file) + if shape != bench_shape or dtype != bench_dtype: + self.log.error( + "Shape or dtype between two npy files is inconsistent. Please check the two files." + "File 1: %s, file 2: %s", file, bench_file) + self.util.deal_with_dir_or_file_inconsistency(output_path) + return + md5_consistency = False + if self.util.get_md5_for_numpy(data) == self.util.get_md5_for_numpy(bench_data): + md5_consistency = True + data_mean = np.mean(data) + bench_data_mean = np.mean(bench_data) + abs_error = np.abs(data - bench_data) + bench_data = self.util.deal_with_value_if_has_zero(bench_data) + rel_error = np.abs(abs_error / bench_data) + abs_diff_max = abs_error.max() + rel_diff_max = np.max(rel_error) + compare_result = [[filename, bench_filename, data_mean, bench_data_mean, md5_consistency, abs_diff_max, + rel_diff_max]] + self.util.write_csv(compare_result, output_path) + + def compare_all_file_in_directory(self, my_dump_dir, golden_dump_dir, output_path): + if not (self.util.is_subdir_count_equal(my_dump_dir, golden_dump_dir) + and self.util.check_npy_files_valid_in_dir(my_dump_dir) + and self.util.check_npy_files_valid_in_dir(golden_dump_dir)): + self.log.error( + "Top level(Npy files level) directory structure is inconsistent. Please check the two directory.") + self.util.deal_with_dir_or_file_inconsistency(output_path) + return + my_npy_files = self.util.get_sorted_files_names(my_dump_dir) + golden_npy_files = self.util.get_sorted_files_names(golden_dump_dir) + for my_npy_file_name, golden_npy_file_name in zip(my_npy_files, golden_npy_files): + my_npy_path = os.path.join(my_dump_dir, my_npy_file_name) + golden_npy_path = os.path.join(golden_dump_dir, golden_npy_file_name) + self.compare_npy(my_npy_path, golden_npy_path, output_path) + + def compare_timestamp_directory(self, my_dump_dir, golden_dump_dir, output_path): + if not self.util.is_subdir_count_equal(my_dump_dir, golden_dump_dir): + self.log.error( + "Second level(Timestamp level) directory structure is inconsistent. Please check the two directory.") + self.util.deal_with_dir_or_file_inconsistency(output_path) + return + my_ordered_subdirs = self.util.get_sorted_subdirectories_names(my_dump_dir) + golden_ordered_subdirs = self.util.get_sorted_subdirectories_names(golden_dump_dir) + for my_subdir_name, golden_subdir_name in zip(my_ordered_subdirs, golden_ordered_subdirs): + my_subdir_path = os.path.join(my_dump_dir, my_subdir_name) + golden_subdir_path = os.path.join(golden_dump_dir, golden_subdir_name) + self.compare_all_file_in_directory(my_subdir_path, golden_subdir_path, output_path) + + def compare_converted_dir(self, my_dump_dir, golden_dump_dir, output_dir): + if not self.util.is_subdir_count_equal(my_dump_dir, golden_dump_dir): + self.log.error( + "Top level(Opname level) directory structure is inconsistent. Please check the two directory.") + return + timestamp = int(time.time()) + output_file_name = f"batch_compare_{timestamp}.csv" + output_path = os.path.join(output_dir, output_file_name) + title_rows = [[ + "NPU File Name", + "Bench File Name", + "Mean", + "Bench Mean", + "Md5 Consistency", + "Max Abs Error", + "Max Relative Error" + ]] + self.util.write_csv(title_rows, output_path) + + my_ordered_subdirs = self.util.get_sorted_subdirectories_names(my_dump_dir) + golden_ordered_subdirs = self.util.get_sorted_subdirectories_names(golden_dump_dir) + for my_subdir_name, golden_subdir_name in zip(my_ordered_subdirs, golden_ordered_subdirs): + if not my_subdir_name == golden_subdir_name: + self.log.error( + "Top level(Opname level) directory structure is inconsistent. Please check the two directory.") + self.util.deal_with_dir_or_file_inconsistency(output_path) + return + my_subdir_path = os.path.join(my_dump_dir, my_subdir_name) + golden_subdir_path = os.path.join(golden_dump_dir, golden_subdir_name) + self.compare_timestamp_directory(my_subdir_path, golden_subdir_path, output_path) + self.util.change_filemode_safe(output_path) + self.log.info("Compare result is saved in : %s", output_path) + + def convert_api_dir_to_npy(self, dump_dir, param, output_dir, msaccucmp_path): + dump_dir = self.util.path_strip(dump_dir) + for root, _, files in os.walk(dump_dir): + for file in files: + file_path = os.path.join(root, file) + file_name = os.path.basename(file_path) + parts = file_name.split(".") + if len(parts) < 5: + continue + op_name = parts[1] + timestamp = parts[-1] + output_path = os.path.join(output_dir, op_name, timestamp) + self.convert_dump_to_npy(file_path, param, output_path, msaccucmp_path) diff --git a/debug/accuracy_tools/atat/pytorch/parse_tool/lib/config.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/config.py new file mode 100644 index 0000000000000000000000000000000000000000..a745ff46f08a28c39c989a5d8dce4ff5cf475ee5 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/config.py @@ -0,0 +1,51 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import os +import numpy as np + + +class Const: + + MS_ACCU_CMP_PATH = '/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py' + MS_ACCU_CMP_FILE_NAME = 'msaccucmp.py' + ROOT_DIR = "" + LOG_LEVEL = "NOTSET" + DATA_ROOT_DIR = os.path.join(ROOT_DIR, 'parse_data') + DUMP_CONVERT_DIR = os.path.join(DATA_ROOT_DIR, 'dump_convert') + COMPARE_DIR = os.path.join(DATA_ROOT_DIR, 'compare_result') + BATCH_DUMP_CONVERT_DIR = os.path.join(DATA_ROOT_DIR, 'batch_dump_convert') + BATCH_COMPARE_DIR = os.path.join(DATA_ROOT_DIR, 'batch_compare_result') + OFFLINE_DUMP_CONVERT_PATTERN = \ + r"^([A-Za-z0-9_-]+)\.([A-Za-z0-9_-]+)\.([0-9]+)(\.[0-9]+)?\.([0-9]{1,255})" \ + r"\.([a-z]+)\.([0-9]{1,255})(\.[x0-9]+)?\.npy$" + NUMPY_PATTERN = r".*\.npy$" + NPY_SUFFIX = ".npy" + PKL_SUFFIX = ".pkl" + DIRECTORY_LENGTH = 4096 + FILE_NAME_LENGTH = 255 + FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' + ONE_GB = 1 * 1024 * 1024 * 1024 + TEN_GB = 10 * 1024 * 1024 * 1024 + FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble, np.float32, np.float16] + HEADER = r""" ____ + / __ \____ ______________ + / /_/ / __ `/ ___/ ___/ _ \ + / ____/ /_/ / / (__ ) __/ + /_/ \__,_/_/ /____/\___/ + + """ diff --git a/debug/accuracy_tools/atat/pytorch/parse_tool/lib/file_desc.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/file_desc.py new file mode 100644 index 0000000000000000000000000000000000000000..14ba27277168bc110b38287afbba957b69f8cdff --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/file_desc.py @@ -0,0 +1,31 @@ +# coding=utf-8 +import os + + +class FileDesc(object): + def __init__(self, file_name, dir_path, timestamp=-1): + self.file_name = file_name + self.dir_path = dir_path + self.path = os.path.join(dir_path, file_name) + self.timestamp = timestamp + self.idx = 0 + if self.timestamp == -1: + self.timestamp = os.path.getmtime(self.path) + + +class NpuDumpFileDesc(FileDesc): + def __init__(self, file_name, dir_path, timestamp, op_name, op_type, task_id, stream_id=0): + super(NpuDumpFileDesc, self).__init__(file_name, dir_path, timestamp) + self.op_name = op_name + self.op_type = op_type + self.task_id = task_id + stream_id = 0 if stream_id is None else int(stream_id) + self.stream_id = stream_id + self.idx = dir_path.split(os.sep)[-1] + + +class DumpDecodeFileDesc(NpuDumpFileDesc): + def __init__(self, file_name, dir_path, timestamp, op_name, op_type, task_id, anchor_type, anchor_idx): + super(DumpDecodeFileDesc, self).__init__(file_name, dir_path, timestamp, op_name, op_type, task_id) + self.type = anchor_type + self.idx = anchor_idx diff --git a/debug/accuracy_tools/atat/pytorch/parse_tool/lib/interactive_cli.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/interactive_cli.py new file mode 100644 index 0000000000000000000000000000000000000000..12b07183fbc2e1c2ea630f05ac44deda744d4d01 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/interactive_cli.py @@ -0,0 +1,102 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import cmd +import argparse +from atat.pytorch.parse_tool.lib.parse_tool import ParseTool +from atat.pytorch.parse_tool.lib.utils import Util +from atat.pytorch.parse_tool.lib.config import Const +from atat.pytorch.parse_tool.lib.parse_exception import catch_exception + + +class InteractiveCli(cmd.Cmd): + def __init__(self): + super().__init__() + self.prompt = "Parse >>> " + self.parse_tool = ParseTool() + self.util = Util() + self.util.print_panel(Const.HEADER) + self.prepare() + + @staticmethod + def _parse_argv(line, insert=None): + argv = line.split() if line != "" else [] + if "-h" in argv: + return argv + if insert is not None and len(argv) and argv[0] != insert: + argv.insert(0, insert) + return argv + + def prepare(self): + self.parse_tool.prepare() + + @catch_exception + def default(self, line=""): + self.util.execute_command(line) + return False + + @catch_exception + def do_run(self, line=""): + self.util.execute_command(line) + + @catch_exception + def do_vc(self, line=""): + parser = argparse.ArgumentParser() + parser.add_argument( + "-m", "--my_dump_path", dest="my_dump_path", default=None, + help=" my dump path, the data compared with golden data", + required=True + ) + parser.add_argument( + "-g", "--golden_dump_path", dest="golden_dump_path", default=None, + help=" the golden dump data path", + required=True + ) + parser.add_argument( + "-out", "--output_path", dest="output_path", default=None, + help=" the output path", + required=False + ) + parser.add_argument( + "-cmp_path", "--msaccucmp_path", dest="msaccucmp_path", default=None, + help=" the msaccucmp.py file path", + required=False + ) + args = parser.parse_args(self._parse_argv(line)) + self.util.check_path_valid(args.my_dump_path) + self.util.check_path_valid(args.golden_dump_path) + self.util.check_files_in_path(args.my_dump_path) + self.util.check_files_in_path(args.golden_dump_path) + if self.util.dir_contains_only(args.my_dump_path, ".npy") and \ + self.util.dir_contains_only(args.golden_dump_path, ".npy"): + self.parse_tool.do_compare_converted_dir(args) + else: + self.parse_tool.do_vector_compare(args) + + def do_dc(self, line=""): + self.parse_tool.do_convert_dump(self._parse_argv(line)) + + def do_pt(self, line=""): + self.parse_tool.do_print_data(self._parse_argv(line)) + + def do_pk(self, line=""): + self.parse_tool.do_parse_pkl(self._parse_argv(line)) + + def do_cn(self, line=''): + self.parse_tool.do_compare_data(self._parse_argv(line)) + + def do_cad(self, line=''): + self.parse_tool.do_convert_api_dir(self._parse_argv(line)) diff --git a/debug/accuracy_tools/atat/pytorch/parse_tool/lib/parse_exception.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/parse_exception.py new file mode 100644 index 0000000000000000000000000000000000000000..1177c51985dc82fba632898590762e38387603ab --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/parse_exception.py @@ -0,0 +1,54 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import logging +from atat.core.common.exceptions import FileCheckException + + +class ParseException(Exception): + + PARSE_INVALID_PATH_ERROR = 0 + PARSE_NO_FILE_ERROR = 1 + PARSE_NO_MODULE_ERROR = 2 + PARSE_INVALID_DATA_ERROR = 3 + PARSE_INVALID_FILE_FORMAT_ERROR = 4 + PARSE_UNICODE_ERROR = 5 + PARSE_JSONDECODE_ERROR = 6 + PARSE_MSACCUCMP_ERROR = 7 + PARSE_LOAD_NPY_ERROR = 8 + PARSE_INVALID_PARAM_ERROR = 9 + + def __init__(self, code, error_info=""): + super(ParseException, self).__init__() + self.error_info = error_info + self.code = code + + +def catch_exception(func): + def inner(*args, **kwargs): + log = logging.getLogger() + line = args[-1] if len(args) == 2 else "" + result = None + try: + result = func(*args, **kwargs) + except OSError: + log.error("%s: command not found" % line) + except ParseException: + log.error("Command execution failed") + except FileCheckException: + log.error("Command execution failed") + return result + return inner diff --git a/debug/accuracy_tools/atat/pytorch/parse_tool/lib/parse_tool.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/parse_tool.py new file mode 100644 index 0000000000000000000000000000000000000000..3e02baa1272199a961fd550a0837d68100de8348 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/parse_tool.py @@ -0,0 +1,158 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import argparse +import os +from collections import namedtuple + +from atat.pytorch.parse_tool.lib.config import Const +from atat.pytorch.parse_tool.lib.utils import Util +from atat.pytorch.parse_tool.lib.compare import Compare +from atat.pytorch.parse_tool.lib.visualization import Visualization +from atat.pytorch.parse_tool.lib.parse_exception import catch_exception, ParseException + + +class ParseTool: + def __init__(self): + self.util = Util() + self.compare = Compare() + self.visual = Visualization() + + @catch_exception + def prepare(self): + self.util.create_dir(Const.DATA_ROOT_DIR) + + @catch_exception + def do_vector_compare(self, args): + if not args.output_path: + result_dir = os.path.join(Const.COMPARE_DIR) + else: + result_dir = args.output_path + my_dump_path = args.my_dump_path + golden_dump_path = args.golden_dump_path + if not os.path.isdir(my_dump_path) or not os.path.isdir(golden_dump_path): + self.util.log.error("Please enter a directory not a file") + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + msaccucmp_path = self.util.path_strip(args.msaccucmp_path) if args.msaccucmp_path else Const.MS_ACCU_CMP_PATH + self.util.check_path_valid(msaccucmp_path) + self.util.check_executable_file(msaccucmp_path) + self.compare.npu_vs_npu_compare(my_dump_path, golden_dump_path, result_dir, msaccucmp_path) + + @catch_exception + def do_convert_dump(self, argv=None): + parser = argparse.ArgumentParser() + parser.add_argument( + '-n', '--name', dest='path', default=None, required=True, help='dump file or dump file directory') + parser.add_argument( + '-f', '--format', dest='format', default=None, required=False, help='target format') + parser.add_argument( + '-out', '--output_path', dest='output_path', required=False, default=None, help='output path') + parser.add_argument( + "-cmp_path", "--msaccucmp_path", dest="msaccucmp_path", default=None, + help=" the msaccucmp.py file path", required=False) + args = parser.parse_args(argv) + self.util.check_path_valid(args.path) + self.util.check_files_in_path(args.path) + msaccucmp_path = self.util.path_strip(args.msaccucmp_path) if args.msaccucmp_path else Const.MS_ACCU_CMP_PATH + self.util.check_path_valid(msaccucmp_path) + self.util.check_executable_file(msaccucmp_path) + if args.format: + self.util.check_str_param(args.format) + self.compare.convert_dump_to_npy(args.path, args.format, args.output_path, msaccucmp_path) + + @catch_exception + def do_print_data(self, argv=None): + """print tensor data""" + parser = argparse.ArgumentParser() + parser.add_argument('-n', '--name', dest='path', default=None, required=True, help='File name') + args = parser.parse_args(argv) + self.visual.print_npy_data(args.path) + + @catch_exception + def do_parse_pkl(self, argv=None): + parser = argparse.ArgumentParser() + parser.add_argument( + '-f', '--file', dest='file_name', default=None, required=True, help='PKL file path') + parser.add_argument( + '-n', '--name', dest='api_name', default=None, required=True, help='API name') + args = parser.parse_args(argv) + self.visual.parse_pkl(args.file_name, args.api_name) + + @catch_exception + def do_compare_data(self, argv): + """compare two tensor""" + parser = argparse.ArgumentParser() + parser.add_argument( + "-m", "--my_dump_path", dest="my_dump_path", default=None, + help=" my dump path, the data compared with golden data", + required=True + ) + parser.add_argument( + "-g", "--golden_dump_path", dest="golden_dump_path", default=None, + help=" the golden dump data path", + required=True + ) + parser.add_argument('-p', '--print', dest='count', default=20, type=int, help='print err data num') + parser.add_argument('-s', '--save', dest='save', action='store_true', help='save data in txt format') + parser.add_argument('-al', '--atol', dest='atol', default=0.001, type=float, help='set rtol') + parser.add_argument('-rl', '--rtol', dest='rtol', default=0.001, type=float, help='set atol') + args = parser.parse_args(argv) + self.util.check_path_valid(args.my_dump_path) + self.util.check_path_valid(args.golden_dump_path) + self.util.check_path_format(args.my_dump_path, Const.NPY_SUFFIX) + self.util.check_path_format(args.golden_dump_path, Const.NPY_SUFFIX) + compare_data_args = namedtuple('compare_data_args', ['my_dump_path', 'golden_dump_path', 'save', 'rtol', 'atol', 'count']) + compare_data_args.__new__.__defaults__ = (False, 0.001, 0.001, 20) + res = compare_data_args(args.my_dump_path, args.golden_dump_path, args.save, args.rtol, args.atol, args.count) + self.compare.compare_data(res) + + @catch_exception + def do_compare_converted_dir(self, args): + """compare two dir""" + my_dump_dir = self.util.path_strip(args.my_dump_path) + golden_dump_dir = self.util.path_strip(args.golden_dump_path) + if my_dump_dir == golden_dump_dir: + self.util.log.error("My directory path and golden directory path is same. Please check parameter" + " '-m' and '-g'.") + raise ParseException("My directory path and golden directory path is same.") + output_path = self.util.path_strip(args.output_path) if args.output_path else Const.BATCH_COMPARE_DIR + if not os.path.isdir(output_path): + os.makedirs(output_path, mode=0o750) + self.compare.compare_converted_dir(my_dump_dir, golden_dump_dir, output_path) + + @catch_exception + def do_convert_api_dir(self, argv=None): + parser = argparse.ArgumentParser() + parser.add_argument( + "-m", "--my_dump_path", dest="my_dump_path", default=None, + help=" my dump path, the data need to convert to npy files.", + required=True + ) + parser.add_argument( + '-out', '--output_path', dest='output_path', required=False, default=None, help='output path') + parser.add_argument( + "-asc", "--msaccucmp_path", dest="msaccucmp_path", default=None, + help=" the msaccucmp.py file path", required=False) + args = parser.parse_args(argv) + self.util.check_path_valid(args.my_dump_path) + self.util.check_files_in_path(args.my_dump_path) + output_path = self.util.path_strip(args.output_path) if args.output_path else \ + os.path.join(Const.BATCH_DUMP_CONVERT_DIR, self.util.localtime_str()) + msaccucmp_path = self.util.path_strip( + args.msaccucmp_path) if args.msaccucmp_path else Const.MS_ACCU_CMP_PATH + self.util.check_path_valid(msaccucmp_path) + self.util.check_executable_file(msaccucmp_path) + self.compare.convert_api_dir_to_npy(args.my_dump_path, None, output_path, msaccucmp_path) diff --git a/debug/accuracy_tools/atat/pytorch/parse_tool/lib/utils.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..aeb1d7f6d2f7d3a0488822bfd7859633dfd70366 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/utils.py @@ -0,0 +1,367 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import logging +import os +import io +import re +import sys +import subprocess +import hashlib +import csv +import time +import numpy as np +from collections import namedtuple +from atat.pytorch.parse_tool.lib.config import Const +from atat.pytorch.parse_tool.lib.file_desc import DumpDecodeFileDesc, FileDesc +from atat.pytorch.parse_tool.lib.parse_exception import ParseException +from atat.core.common.file_check import change_mode, check_other_user_writable,\ + check_path_executable, check_path_owner_consistent +from atat.core.common.file_check import FileCheckConst +from atat.core.common.file_check import FileOpen +from atat.core.common.utils import check_file_or_directory_path +from atat.pytorch.common.log import logger + + +try: + from rich.traceback import install + from rich.panel import Panel + from rich.table import Table + from rich import print as rich_print + from rich.columns import Columns + + install() +except ImportError as err: + install = None + Panel = None + Table = None + Columns = None + rich_print = None + logger.warning( + "Failed to import rich, Some features may not be available. Please run 'pip install rich' to fix it.") + + +class Util: + def __init__(self): + self.ms_accu_cmp = None + logging.basicConfig( + level=Const.LOG_LEVEL, + format="%(asctime)s (%(process)d) -[%(levelname)s]%(message)s", + datefmt="%Y-%m-%d %H:%M:%S" + ) + self.log = logging.getLogger() + self.python = sys.executable + + @staticmethod + def print(content): + rich_print(content) + + @staticmethod + def path_strip(path): + return path.strip("'").strip('"') + + @staticmethod + def _gen_npu_dump_convert_file_info(name, match, dir_path): + return DumpDecodeFileDesc(name, dir_path, int(match.groups()[-4]), op_name=match.group(2), + op_type=match.group(1), task_id=int(match.group(3)), anchor_type=match.groups()[-3], + anchor_idx=int(match.groups()[-2])) + + @staticmethod + def _gen_numpy_file_info(name, math, dir_path): + return FileDesc(name, dir_path) + + @staticmethod + def check_executable_file(path): + check_path_owner_consistent(path) + check_other_user_writable(path) + check_path_executable(path) + + @staticmethod + def get_subdir_count(self, directory): + subdir_count = 0 + for _, dirs, _ in os.walk(directory): + subdir_count += len(dirs) + break + return subdir_count + + @staticmethod + def get_subfiles_count(self, directory): + file_count = 0 + for _, _, files in os.walk(directory): + file_count += len(files) + return file_count + + @staticmethod + def get_sorted_subdirectories_names(self, directory): + subdirectories = [] + for item in os.listdir(directory): + item_path = os.path.join(directory, item) + if os.path.isdir(item_path): + subdirectories.append(item) + return sorted(subdirectories) + + @staticmethod + def get_sorted_files_names(self, directory): + files = [] + for item in os.listdir(directory): + item_path = os.path.join(directory, item) + if os.path.isfile(item_path): + files.append(item) + return sorted(files) + + @staticmethod + def check_npy_files_valid_in_dir(self, dir_path): + for file_name in os.listdir(dir_path): + file_path = os.path.join(dir_path, file_name) + check_file_or_directory_path(file_path) + _, file_extension = os.path.splitext(file_path) + if not file_extension == '.npy': + return False + return True + + @staticmethod + def get_md5_for_numpy(self, obj): + np_bytes = obj.tobytes() + md5_hash = hashlib.md5(np_bytes) + return md5_hash.hexdigest() + + @staticmethod + def write_csv(self, data, filepath): + need_change_mode = False + if not os.path.exists(filepath): + need_change_mode = True + with FileOpen(filepath, 'a') as f: + writer = csv.writer(f) + writer.writerows(data) + if need_change_mode: + change_mode(filepath, FileCheckConst.DATA_FILE_AUTHORITY) + + @staticmethod + def deal_with_dir_or_file_inconsistency(self, output_path): + if os.path.exists(output_path): + os.remove(output_path) + raise ParseException("Inconsistent directory structure or file.") + + @staticmethod + def deal_with_value_if_has_zero(self, data): + if data.dtype in Const.FLOAT_TYPE: + zero_mask = (data == 0) + # 给0的地方加上eps防止除0 + data[zero_mask] += np.finfo(data.dtype).eps + else: + # int type + float eps 会报错,所以这里要强转 + data = data.astype(float) + zero_mask = (data == 0) + data[zero_mask] += np.finfo(float).eps + return data + + @staticmethod + def dir_contains_only(self, path, endfix): + for _, _, files in os.walk(path): + for file in files: + if not file.endswith(endfix): + return False + return True + + @staticmethod + def localtime_str(self): + return time.strftime("%Y%m%d%H%M%S", time.localtime()) + + @staticmethod + def change_filemode_safe(self, path): + change_mode(path, FileCheckConst.DATA_FILE_AUTHORITY) + + def execute_command(self, cmd): + if not cmd: + self.log.error("Commond is None") + return -1 + self.log.debug("[RUN CMD]: %s", cmd) + cmd = cmd.split(" ") + complete_process = subprocess.run(cmd, shell=False) + return complete_process.returncode + + def print_panel(self, content, title='', fit=True): + if not Panel: + self.print(content) + return + if fit: + self.print(Panel.fit(content, title=title)) + else: + self.print(Panel(content, title=title)) + + def check_msaccucmp(self, target_file): + if os.path.split(target_file)[-1] != Const.MS_ACCU_CMP_FILE_NAME: + self.log.error( + "Check msaccucmp failed in dir %s. This is not a correct msaccucmp file" % target_file) + raise ParseException(ParseException.PARSE_MSACCUCMP_ERROR) + result = subprocess.run( + [self.python, target_file, "--help"], stdout=subprocess.PIPE) + if result.returncode == 0: + self.log.info("Check [%s] success.", target_file) + else: + self.log.error("Check msaccucmp failed in dir %s" % target_file) + self.log.error("Please specify a valid msaccucmp.py path or install the cann package") + raise ParseException(ParseException.PARSE_MSACCUCMP_ERROR) + return target_file + + def create_dir(self, path): + path = self.path_strip(path) + if os.path.exists(path): + return + self.check_path_name(path) + try: + os.makedirs(path, mode=FileCheckConst.DATA_DIR_AUTHORITY) + except OSError as e: + self.log.error("Failed to create %s.", path) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) from e + + def gen_npy_info_txt(self, source_data): + (shape, dtype, max_data, min_data, mean) = \ + self.npy_info(source_data) + return \ + '[Shape: %s] [Dtype: %s] [Max: %s] [Min: %s] [Mean: %s]' % (shape, dtype, max_data, min_data, mean) + + def save_npy_to_txt(self, data, dst_file='', align=0): + if os.path.exists(dst_file): + self.log.info("Dst file %s exists, will not save new one.", dst_file) + return + shape = data.shape + data = data.flatten() + if align == 0: + align = 1 if len(shape) == 0 else shape[-1] + elif data.size % align != 0: + pad_array = np.zeros((align - data.size % align,)) + data = np.append(data, pad_array) + np.savetxt(dst_file, data.reshape((-1, align)), delimiter=' ', fmt='%g') + change_mode(dst_file, FileCheckConst.DATA_FILE_AUTHORITY) + + def list_convert_files(self, path, external_pattern=""): + return self.list_file_with_pattern( + path, Const.OFFLINE_DUMP_CONVERT_PATTERN, external_pattern, self._gen_npu_dump_convert_file_info + ) + + def list_numpy_files(self, path, extern_pattern=''): + return self.list_file_with_pattern(path, Const.NUMPY_PATTERN, extern_pattern, + self._gen_numpy_file_info) + + def create_columns(self, content): + if not Columns: + self.log.error("No module named rich, please install it") + raise ParseException(ParseException.PARSE_NO_MODULE_ERROR) + return Columns(content) + + def create_table(self, title, columns): + if not Table: + self.log.error("No module named rich, please install it and restart parse tool") + raise ParseException(ParseException.PARSE_NO_MODULE_ERROR) + table = Table(title=title) + for column_name in columns: + table.add_column(column_name, overflow='fold') + return table + + def check_path_valid(self, path): + path = self.path_strip(path) + if not path or not os.path.exists(path): + self.log.error("The path %s does not exist." % path) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if os.path.islink(path): + self.log.error('The file path {} is a soft link.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if len(os.path.realpath(path)) > Const.DIRECTORY_LENGTH or len(os.path.basename(path)) > \ + Const.FILE_NAME_LENGTH: + self.log.error('The file path length exceeds limit.') + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if not re.match(Const.FILE_PATTERN, os.path.realpath(path)): + self.log.error('The file path {} contains special characters.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if os.path.isfile(path): + file_size = os.path.getsize(path) + if path.endswith(Const.PKL_SUFFIX) and file_size > Const.ONE_GB: + self.log.error('The file {} size is greater than 1GB.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if path.endswith(Const.NPY_SUFFIX) and file_size > Const.TEN_GB: + self.log.error('The file {} size is greater than 10GB.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + return True + + def check_files_in_path(self, path): + if os.path.isdir(path) and len(os.listdir(path)) == 0: + self.log.error("No files in %s." % path) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + + def npy_info(self, source_data): + if isinstance(source_data, np.ndarray): + data = source_data + else: + self.log.error("Invalid data, data is not ndarray") + raise ParseException(ParseException.PARSE_INVALID_DATA_ERROR) + if data.dtype == 'object': + self.log.error("Invalid data, data is object.") + raise ParseException(ParseException.PARSE_INVALID_DATA_ERROR) + if np.size(data) == 0: + self.log.error("Invalid data, data is empty") + raise ParseException(ParseException.PARSE_INVALID_DATA_ERROR) + npu_info_result = namedtuple('npu_info_result', ['shape', 'dtype', 'max', 'min', 'mean']) + res = npu_info_result(data.shape, data.dtype, data.max(), data.min(), data.mean()) + return res + + def list_file_with_pattern(self, path, pattern, extern_pattern, gen_info_func): + self.check_path_valid(path) + file_list = {} + re_pattern = re.compile(pattern) + for dir_path, _, file_names in os.walk(path, followlinks=True): + for name in file_names: + match = re_pattern.match(name) + if not match: + continue + if extern_pattern != '' and not re.match(extern_pattern, name): + continue + file_list[name] = gen_info_func(name, match, dir_path) + return file_list + + def check_path_format(self, path, suffix): + if os.path.isfile(path): + if not path.endswith(suffix): + self.log.error("%s is not a %s file." % (path, suffix)) + raise ParseException(ParseException.PARSE_INVALID_FILE_FORMAT_ERROR) + elif os.path.isdir(path): + self.log.error("Please specify a single file path") + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + else: + self.log.error("The file path %s is invalid" % path) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + + def check_path_name(self, path): + if len(os.path.realpath(path)) > Const.DIRECTORY_LENGTH or len(os.path.basename(path)) > \ + Const.FILE_NAME_LENGTH: + self.log.error('The file path length exceeds limit.') + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if not re.match(Const.FILE_PATTERN, os.path.realpath(path)): + self.log.error('The file path {} contains special characters.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + + def check_str_param(self, param): + if len(param) > Const.FILE_NAME_LENGTH: + self.log.error('The parameter length exceeds limit') + raise ParseException(ParseException.PARSE_INVALID_PARAM_ERROR) + if not re.match(Const.FILE_PATTERN, param): + self.log.error('The parameter {} contains special characters.'.format(param)) + raise ParseException(ParseException.PARSE_INVALID_PARAM_ERROR) + + def is_subdir_count_equal(self, dir1, dir2): + dir1_count = self.get_subdir_count(dir1) + dir2_count = self.get_subdir_count(dir2) + return dir1_count == dir2_count diff --git a/debug/accuracy_tools/atat/pytorch/parse_tool/lib/visualization.py b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/visualization.py new file mode 100644 index 0000000000000000000000000000000000000000..3ef9878ae8213e5ece5813d8857544ce07b603d5 --- /dev/null +++ b/debug/accuracy_tools/atat/pytorch/parse_tool/lib/visualization.py @@ -0,0 +1,90 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import json +import numpy as np + +from atat.pytorch.parse_tool.lib.config import Const +from atat.pytorch.parse_tool.lib.utils import Util +from atat.pytorch.parse_tool.lib.parse_exception import ParseException +from atat.core.common.file_check import FileOpen + + +class Visualization: + def __init__(self): + self.util = Util() + + def print_npy_summary(self, target_file): + try: + np_data = np.load(target_file, allow_pickle=True) + except UnicodeError as e: + self.util.log.error("%s %s" % ("UnicodeError", str(e))) + self.util.log.warning("Please check the npy file") + raise ParseException(ParseException.PARSE_UNICODE_ERROR) from e + table = self.util.create_table('', ['Index', 'Data']) + flatten_data = np_data.flatten() + tablesize = 8 + for i in range(min(16, int(np.ceil(flatten_data.size / tablesize)))): + last_idx = min(flatten_data.size, i * tablesize + tablesize) + table.add_row(str(i * tablesize), ' '.join(flatten_data[i * tablesize: last_idx].astype('str').tolist())) + summary = ['[yellow]%s[/yellow]' % self.util.gen_npy_info_txt(np_data), 'Path: %s' % target_file, + "TextFile: %s.txt" % target_file] + self.util.print_panel(self.util.create_columns([table, "\n".join(summary)]), target_file) + self.util.save_npy_to_txt(np_data, target_file + ".txt") + + def print_npy_data(self, file_name): + file_name = self.util.path_strip(file_name) + self.util.check_path_valid(file_name) + self.util.check_path_format(file_name, Const.NPY_SUFFIX) + return self.print_npy_summary(file_name) + + def parse_pkl(self, path, api_name): + path = self.util.path_strip(path) + self.util.check_path_valid(path) + self.util.check_path_format(path, Const.PKL_SUFFIX) + self.util.check_str_param(api_name) + with FileOpen(path, "r") as pkl_handle: + title_printed = False + while True: + pkl_line = pkl_handle.readline() + if pkl_line == '\n': + continue + if len(pkl_line) == 0: + break + try: + msg = json.loads(pkl_line) + except json.JSONDecodeError as e: + self.util.log.error("%s %s in line %s" % ("JSONDecodeError", str(e), pkl_line)) + self.util.log.warning("Please check the pkl file") + raise ParseException(ParseException.PARSE_JSONDECODE_ERROR) from e + info_prefix = msg[0] + if not info_prefix.startswith(api_name): + continue + if info_prefix.find("stack_info") != -1 and len(msg) == 2: + self.util.log.info("\nTrace back({}):".format(msg[0])) + if msg[1] and len(msg[1]) > 4: + for item in reversed(msg[1]): + self.util.log.info(" File \"{}\", line {}, in {}".format(item[0], item[1], item[2])) + self.util.log.info(" {}".format(item[3])) + continue + if len(msg) > 5 and len(msg[5]) >= 3: + summery_info = " [{}][dtype: {}][shape: {}][max: {}][min: {}][mean: {}]" \ + .format(msg[0], msg[3], msg[4], msg[5][0], msg[5][1], msg[5][2]) + if not title_printed: + self.util.log.info("\nStatistic Info:") + title_printed = True + self.util.log.info(summery_info) + pkl_handle.close() diff --git a/debug/accuracy_tools/atat/pytorch/pt_config.py b/debug/accuracy_tools/atat/pytorch/pt_config.py index 46d9b70cc9f9ef38aca9dc29c4df50be2d90e535..e04b88bb96633e54bf6c0e085247907a29752503 100644 --- a/debug/accuracy_tools/atat/pytorch/pt_config.py +++ b/debug/accuracy_tools/atat/pytorch/pt_config.py @@ -1,12 +1,11 @@ import json import os -from ..core.common_config import CommonConfig, BaseConfig -from ..core.file_check_util import FileOpen -from ..core.utils import Const +from atat.core.common_config import CommonConfig, BaseConfig +from atat.core.common.file_check import FileOpen +from atat.core.common.utils import Const -# 特定任务配置类 class TensorConfig(BaseConfig): def __init__(self, json_config): super().__init__(json_config) diff --git a/debug/accuracy_tools/atat/pytorch/service.py b/debug/accuracy_tools/atat/pytorch/service.py index 8b7f2a1d874cead20d55d3e92121a90b21120756..cd80d0852ac7526615b0a14f059803ef6222ed7f 100644 --- a/debug/accuracy_tools/atat/pytorch/service.py +++ b/debug/accuracy_tools/atat/pytorch/service.py @@ -2,55 +2,45 @@ import functools import os from pathlib import Path -from .common import print_info_log_rank_0 -from .common.file_check import FileChecker, FileCheckConst, check_path_before_create -from .common.utils import get_rank_if_initialized, is_gpu, Const, DistributedNotInitializedError -from .functional import build_repair, build_data_collector, build_step_post_process -from .functional.data_processor import ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs -from .functional.scope import BaseScope -from .hook_module import remove_dropout -from .hook_module.api_registry import api_register -from .module_processer import ModuleProcesser - -from ..core.utils import DumpException +from atat.pytorch.common.log import logger +from atat.core.common.file_check import FileChecker, FileCheckConst, check_path_before_create +from atat.core.common.utils import Const +from atat.core.common.exceptions import DistributedNotInitializedError, MsaccException +from atat.core.data_dump.data_collector import build_data_collector +from atat.core.data_dump.scope import BaseScope +from atat.core.data_dump.data_processor.base import ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs +from atat.pytorch.common.utils import get_rank_if_initialized +from atat.pytorch.module_processer import ModuleProcesser +from atat.pytorch.hook_module import remove_dropout +from atat.pytorch.hook_module.api_registry import api_register class Service: - make_dir_flag = True - REGISTER_HOOK_KWARGS = ["overflow_nums", "dump_mode", "dump_config"] - def __init__(self, config): self.model = None self.config = config self.data_collector = build_data_collector(config) self.module_processor = ModuleProcesser(self.data_collector.scope) - self.repair = build_repair(config) - self.step_post_process = build_step_post_process(config) self.switch = False self.current_iter = 0 self.first_start = True self.current_rank = None - self.first_touch_dir = True self.dump_iter_dir = None def build_hook(self, module_type, name): - def pre_hook(repair, api_or_module_name, module, args, kwargs): - nonlocal module_type, pid + def pre_hook(api_or_module_name, module, args, kwargs): if module_type == BaseScope.Module_Type_Module: api_or_module_name = module.mindstudio_reserved_name self.data_collector.visit_and_clear_overflow_status(api_or_module_name) if not self.switch: return args, kwargs - if repair: - args, kwargs = repair.convert(api_or_module_name, module_type, args, kwargs) if self.data_collector: module_input_output = ModuleForwardInputsOutputs(args=args, kwargs=kwargs, output=None) self.data_collector.pre_forward_data_collect(api_or_module_name, module, pid, module_input_output) return args, kwargs - def forward_hook(repair, api_or_module_name, module, args, kwargs, output): - nonlocal module_type, pid + def forward_hook(api_or_module_name, module, args, kwargs, output): if module_type == BaseScope.Module_Type_Module: api_or_module_name = module.mindstudio_reserved_name self.data_collector.visit_and_clear_overflow_status(api_or_module_name) @@ -62,13 +52,9 @@ class Service: self.data_collector.forward_data_collect(api_or_module_name, module, pid, module_input_output) if self.data_collector.if_return_forward_new_output(): return self.data_collector.get_forward_new_output() - if repair: - output = repair.invert(api_or_module_name, module_type, output) - return output - def backward_hook(repair, api_or_module_name, module, grad_input, grad_output): - nonlocal module_type, pid + def backward_hook(api_or_module_name, module, grad_input, grad_output): if module_type == BaseScope.Module_Type_Module: api_or_module_name = module.mindstudio_reserved_name self.data_collector.visit_and_clear_overflow_status(api_or_module_name) @@ -82,15 +68,13 @@ class Service: pid = os.getpid() forward_name_template = name + Const.FORWARD backward_name_template = name + Const.BACKWARD - pre_forward_hook = functools.partial(pre_hook, self.repair, forward_name_template) - forward_hook = functools.partial(forward_hook, self.repair, forward_name_template) - backward_hook = functools.partial(backward_hook, None, backward_name_template) + pre_forward_hook = functools.partial(pre_hook, forward_name_template) + forward_hook = functools.partial(forward_hook, forward_name_template) + backward_hook = functools.partial(backward_hook, backward_name_template) return pre_forward_hook, forward_hook, backward_hook def step(self): self.current_iter += 1 - if self.step_post_process: - self.step_post_process() self.data_collector.update_iter(self.current_iter) def start(self, model): @@ -111,10 +95,10 @@ class Service: self.register_hook_new() self.first_start = False self.switch = True - print_info_log_rank_0(f"Dump switch is turned on at step {self.current_iter}. ") + logger.info_on_rank_0(f"Dump switch is turned on at step {self.current_iter}. ") if self.config.level != "L2": self.create_dirs() - print_info_log_rank_0(f"Dump data will be saved in {self.dump_iter_dir}.") + logger.info_on_rank_0(f"Dump data will be saved in {self.dump_iter_dir}.") def stop(self): if self.config.level == "L2": @@ -151,16 +135,11 @@ class Service: dump_file_path, stack_file_path, construct_file_path, dump_data_dir, free_benchmark_file_path) def register_hook_new(self): - hook_name = self.config.task - - if "overflow_check" in hook_name and not is_gpu: - pass - - print_info_log_rank_0("The {} hook function is successfully mounted to the model.".format(hook_name)) + logger.info_on_rank_0("The {} hook function is successfully mounted to the model.".format(self.config.task)) if self.config.level in ["L0", "mix"]: if self.model is None: - raise DumpException("Model is None") - print_info_log_rank_0("The init dump mode is enabled, and the module dump function will not be available") + logger.error_log_with_exp("The model is None.", MsaccException.INVALID_PARAM_ERROR) + logger.info_on_rank_0("The init dump mode is enabled, and the module dump function will not be available") for name, module in self.model.named_modules(): if module == self.model: continue @@ -184,5 +163,5 @@ class Service: api_register.initialize_hook(functools.partial(self.build_hook, BaseScope.Module_Type_API)) api_register.api_modularity() - if Const.STATISTICS in hook_name or Const.TENSOR in hook_name: + if Const.STATISTICS == self.config.task or Const.TENSOR == self.config.task: remove_dropout() diff --git a/debug/accuracy_tools/atat/test/core_ut/test_utils.py b/debug/accuracy_tools/atat/test/core_ut/test_utils.py index 9492bbc9f97a2349a9272ef478eed3b71724ef4d..89734f2c572bff2aa864db16f23dfe8665042f74 100644 --- a/debug/accuracy_tools/atat/test/core_ut/test_utils.py +++ b/debug/accuracy_tools/atat/test/core_ut/test_utils.py @@ -1,11 +1,12 @@ from unittest import TestCase from unittest.mock import patch -from atat.core.utils import check_seed_all, Const, CompareException +from atat.core.common.utils import check_seed_all, Const, CompareException, check_inplace_op +from atat.core.common.log import logger class TestUtils(TestCase): - @patch("atat.core.utils.print_error_log") + @patch.object(logger, "error") def test_check_seed_all(self, mock_print_error_log): self.assertIsNone(check_seed_all(1234, True)) self.assertIsNone(check_seed_all(0, True)) @@ -30,3 +31,11 @@ class TestUtils(TestCase): check_seed_all(1234, 1) self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) mock_print_error_log.assert_called_with("seed_all mode must be bool.") + + def test_check_inplace_op(self): + test_prefix_1 = "Distributed.broadcast.0.forward.input.0" + self.assertTrue(check_inplace_op(test_prefix_1)) + test_prefix_2 = "Distributed_broadcast_0_forward_input_0" + self.assertFalse(check_inplace_op(test_prefix_2)) + test_prefix_3 = "Torch.sum.0.backward.output.0" + self.assertFalse(check_inplace_op(test_prefix_3)) diff --git a/debug/accuracy_tools/atat/test/mindspore_ut/test_ms_config.py b/debug/accuracy_tools/atat/test/mindspore_ut/test_ms_config.py index 0029e24bdac79a9f02c92c79cfe06e8c14050154..6be8949684c89012f0dc2165ba24eab4e7a77f1c 100644 --- a/debug/accuracy_tools/atat/test/mindspore_ut/test_ms_config.py +++ b/debug/accuracy_tools/atat/test/mindspore_ut/test_ms_config.py @@ -1,7 +1,7 @@ from unittest import TestCase from unittest.mock import patch, mock_open -from atat.core.utils import Const +from atat.core.common.utils import Const from atat.mindspore.ms_config import parse_json_config @@ -20,8 +20,8 @@ class TestMsConfig(TestCase): "summary_mode": "statistics" } } - with (patch("atat.mindspore.ms_config.FileOpen", mock_open(read_data='')), - patch("atat.mindspore.ms_config.json.load", return_value=mock_json_data)): + with patch("atat.mindspore.ms_config.FileOpen", mock_open(read_data='')), \ + patch("atat.mindspore.ms_config.json.load", return_value=mock_json_data): common_config, task_config = parse_json_config("./config.json") self.assertEqual(common_config.task, Const.STATISTICS) self.assertEqual(task_config.data_mode, ["all"]) diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/advisor/test_advisor.py b/debug/accuracy_tools/atat/test/pytorch_ut/advisor/test_advisor.py new file mode 100644 index 0000000000000000000000000000000000000000..78e5b489e7ad14f2965b813d733a30c8849b8a71 --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/advisor/test_advisor.py @@ -0,0 +1,82 @@ +import difflib +import os +import shutil +import unittest +from unittest.mock import patch + +import pandas + +from atat.pytorch.advisor.advisor import Advisor +from atat.pytorch.advisor.advisor_const import AdvisorConst + + +class TestAdvisor(unittest.TestCase): + + def setUp(self): + self.base_test_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) + self.input_dir = os.path.join(self.base_test_dir, 'resources') + self.output_path = os.path.abspath(os.path.join(self.base_test_dir, 'test_output')) + + os.makedirs(self.output_path, mode=0o700, exist_ok=True) + self.has_error = False + + self.input_data = pandas.read_csv(os.path.join(self.input_dir, 'compare_result_20230703104808.csv')) + self.advisor = Advisor(self.input_data, self.output_path) + + def tearDown(self) -> None: + shutil.rmtree(self.output_path, ignore_errors=True) + + @patch("os.path.realpath") + def test_init(self, mock_realpath): + mock_realpath.return_value = 'real_output_path' + adv = Advisor(self.input_data, self.output_path) + self.assertEqual(adv.out_path, 'real_output_path') + + def test_deterministic_advisor_when_api_in_need_determ_api(self): + msg = self.advisor.deterministic_advisor('', 'Functional.layer_norm.0.forward_input.0') + self.assertEqual(msg, AdvisorConst.DETERMINISTIC_SUGGEST) + + def test_deterministic_advisor_when_api_not_in_need_determ_api(self): + mock_message = 'mock message' + msg = self.advisor.deterministic_advisor(mock_message, 'Functional.linear.0.forward_input.0') + self.assertEqual(msg, mock_message) + + def test_batch_norm_advisor(self): + mock_message = 'mocked batch norm advisor message' + msg1 = self.advisor.batch_norm_advisor(mock_message, AdvisorConst.FUNC_BATCH_NORM + '' + + AdvisorConst.FORWARD_INPUT_1) + msg2 = self.advisor.batch_norm_advisor(mock_message, 'Functional.linear.0.forward_output.1') + self.assertEqual(msg1, AdvisorConst.BATCH_NORM_SUGGEST) + self.assertEqual(msg2, mock_message) + + def test_gen_advisor_message(self): + self.assertIn(AdvisorConst.FORWARD_OUTPUT_SUGGEST, self.advisor.gen_advisor_message( + 'Functional.linear.0.forward_output.1')) + self.assertIn(AdvisorConst.BACKWARD_INPUT_SUGGEST, self.advisor.gen_advisor_message( + 'Functional.linear.0.backward_input.1')) + + def test_advisor_summary_file(self): + self.advisor.analysis() + filenames = os.listdir(self.output_path) + for filename in filenames: + filename = os.path.join(self.output_path, filename) + self.result_check(os.path.join(self.input_dir, 'advisor.txt'), filename) + self.assertFalse(self.has_error) + + def result_check(self, standard_file, output_file): + with open(standard_file, 'r', encoding='utf-8') as st_file: + standard_content = st_file.read().splitlines() + with open(output_file, 'r', encoding='utf-8') as out_file: + output_content = out_file.read().splitlines() + result = list(difflib.unified_diff(standard_content, output_content, n=0)) + if result: + print('\n\n-------------------------------------------------------------------------', flush=True) + print(f'[ERROR] {output_file.replace(self.output_path, "")} advisor summary are inconsistent.', + flush=True) + print('\n'.join(result), flush=True) + print('-------------------------------------------------------------------------', flush=True) + self.has_error = True + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/common/test_common_utils.py b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/common/test_common_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..16d0c0bc12738bb7ce129224cf124761503c31fd --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/common/test_common_utils.py @@ -0,0 +1,108 @@ +import unittest +from unittest.mock import patch + +from atat.pytorch.api_accuracy_checker.common.utils import * + + +class TestUtils(unittest.TestCase): + + @patch('atat.pytorch.api_accuracy_checker.common.utils.get_file_content_bytes') + def test_get_json_contents_should_raise_exception(self, mock_get_file_content_bytes): + mock_get_file_content_bytes.return_value = 'not a dict' + with self.assertRaises(CompareException) as ce: + get_json_contents('') + self.assertEqual(ce.exception.code, CompareException.INVALID_FILE_ERROR) + + def test_get_json_contents_should_return_json_obj(self): + test_dict = {"key": "value"} + file_name = 'test.json' + + fd = os.open(file_name, os.O_CREAT | os.O_WRONLY | os.O_TRUNC, 0o644) + with os.fdopen(fd, 'w') as f: + json.dump(test_dict, f) + self.assertEqual(get_json_contents(file_name), test_dict) + os.remove(file_name) + + def test_write_csv(self): + test_file_name = 'test.csv' + test_data = [["name", "age"], ["Alice", "20"], ["Bob", "30"]] + write_csv(test_data, 'test.csv') + with open(test_file_name, 'r', encoding='utf-8-sig') as f: + reader = csv.reader(f) + for i, row in enumerate(reader): + self.assertEqual(row, test_data[i]) + os.remove(test_file_name) + + def test_check_need_convert(self): + self.assertEqual(check_need_convert('cross_entropy'), 'int32_to_int64') + self.assertIsNone(check_need_convert('linear')) + + def test_check_object_type(self): + try: + check_object_type(123, int) + except Exception as e: + self.fail(f"check_object_type raised exception {e}") + + def test_check_file_or_directory_path(self): + try: + check_file_or_directory_path(__file__) + except Exception as e: + self.fail(f"check_file_or_directory_path raised exception {e}") + + def test_create_directory(self): + test_dir_name = 'test_dir' + create_directory(test_dir_name) + self.assertTrue(os.path.exists(test_dir_name)) + os.rmdir(test_dir_name) + + def test_get_file_content_bytes(self): + fd = os.open('test.txt', os.O_CREAT | os.O_WRONLY | os.O_TRUNC, 0o644) + with os.fdopen(fd, 'w') as f: + f.write("Hello, World!") + self.assertEqual(get_file_content_bytes('test.txt'), b"Hello, World!") + os.remove('test.txt') + + @patch('os.path.exists') + def test_check_file_or_dir_path_should_raise_exe_when_dir_path_not_existed(self, mock_path_exists): + mock_path_exists.return_value = False + with self.assertRaises(CompareException) as ce: + check_file_or_directory_path('', isdir=True) + self.assertEqual(ce.exception.code, CompareException.INVALID_PATH_ERROR) + + @patch('os.path.exists') + @patch('os.path.isdir') + @patch('os.access') + def test_check_file_or_dir_path_should_pass_when_path_is_dir(self, mock_os_access, mock_path_is_dir, + mock_path_exists): + mock_os_access.return_value = True + mock_path_is_dir.return_value = True + mock_path_exists.return_value = True + check_file_or_directory_path('', isdir=True) + + @patch('os.path.isfile') + @patch('os.access') + def test_check_file_or_dir_path_should_raise_exe_when_file_not_access(self, mock_os_access, mock_path_is_file): + mock_os_access.return_value = False + mock_path_is_file.return_value = True + with self.assertRaises(CompareException) as ce: + check_file_or_directory_path('', isdir=False) + self.assertEqual(ce.exception.code, CompareException.INVALID_PATH_ERROR) + + def test_check_file_or_dir_path_should_pass_when_path_is_file(self): + with unittest.mock.patch('os.path.isfile', return_value=True), \ + unittest.mock.patch('os.access', return_value=True): + check_file_or_directory_path('', isdir=False) + + def test_api_info_preprocess_no_conversion_needed(self): + api_name = 'linear' + original_api_info = {'key': 'value'} + convert_type, processed_api_info = api_info_preprocess(api_name, original_api_info.copy()) + self.assertIsNone(convert_type) + self.assertEqual(original_api_info, processed_api_info) + + def test_api_info_preprocess_cross_entropy_positive(self): + api_name = 'cross_entropy' + api_info = {'args': [{'Name': 'logit'}, {'Name': 'labels', 'Min': 1}]} + convert_type, processed_api_info = api_info_preprocess(api_name, api_info.copy()) + self.assertEqual(convert_type, 'int32_to_int64') + self.assertEqual(processed_api_info, api_info) diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/common/test_config.py b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/common/test_config.py new file mode 100644 index 0000000000000000000000000000000000000000..066e74aa518dd7958a511bb47be92bce7ce5ac0b --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/common/test_config.py @@ -0,0 +1,39 @@ +import unittest +import os +from unittest.mock import patch + +from atat.pytorch.api_accuracy_checker.common.config import Config + + +class TestConfig(unittest.TestCase): + def setUp(self): + self.base_test_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))) + self.input_dir = os.path.join(self.base_test_dir, 'resources') + self.yaml_file = os.path.join(self.input_dir, "config.yaml") + self.cfg = Config(self.yaml_file) + + def test_validate_valid_data(self): + for key, val in self.cfg.config.items(): + validated_type = self.cfg.validate(key, val) + self.assertEqual(validated_type, val) + + def test_validate_should_raise_when_invalid_type(self): + with self.assertRaises(ValueError): + self.cfg.validate('error_data_path', True) + + def test_validate_should_raise_when_invalid_key(self): + with self.assertRaises(ValueError): + self.cfg.validate('invalid_key', 'mock_value') + + def test_validate_precision(self): + self.assertEqual(self.cfg.validate('precision', 1), 1) + + with self.assertRaises(ValueError): + self.cfg.validate('precision', -1) + + def test_validate_white_list(self): + validate_white_list = ['conv1d', 'max_pool1d', 'dropout', '__add__'] + self.assertEqual(self.cfg.validate('white_list', validate_white_list), validate_white_list) + + with self.assertRaises(ValueError): + self.cfg.validate('white_list', ['invalid_api1', 'invalid_api2']) diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_algorithm.py b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_algorithm.py new file mode 100644 index 0000000000000000000000000000000000000000..9604e7a681c869c41cdfa70b7b1b551ceed9604e --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_algorithm.py @@ -0,0 +1,112 @@ +import unittest + +import numpy as np + +from atat.pytorch.api_accuracy_checker.compare import algorithm as alg + + +class TestAlgorithmMethods(unittest.TestCase): + + def setUp(self): + self.bench_data = np.array([1.0, 1.0, 9.0], dtype=np.float16) + self.device_data = np.array([5.0, 2.0, 1.0], dtype=np.float16) + self.abs_err = np.abs(self.device_data - self.bench_data) + self.rel_err_origin = np.abs(self.abs_err / self.bench_data) + eps = np.finfo(self.bench_data.dtype).eps + self.abs_bench = np.abs(self.bench_data) + self.abs_bench_with_eps = self.abs_bench + eps + self.rel_err = self.abs_err / self.abs_bench_with_eps + + def test_cosine_sim(self): + cpu_output = np.array([1.0, 2.0, 3.0]) + npu_output = np.array([1.0, 2.0, 3.0]) + self.assertEqual(alg.cosine_sim(cpu_output, npu_output), (1.0, True, '')) + + def test_get_rmse(self): + inf_nan_mask = [False, False, False] + self.assertAlmostEqual(alg.get_rmse(self.abs_err, inf_nan_mask), 5.196, 3) + + def test_get_error_balance(self): + self.assertEqual(alg.get_error_balance(self.bench_data, self.device_data), 1 / 3) + + def test_get_small_value_err_ratio(self): + small_value_mask = [True, True, True, False, True] + abs_err_greater_mask = [False, True, True, True, False] + self.assertEqual(alg.get_small_value_err_ratio(small_value_mask, abs_err_greater_mask), 0.5) + + def get_rel_err(self): + eps = np.finfo(self.bench_data.dtype).eps + abs_bench = np.abs(self.bench_data) + abs_bench_with_eps = abs_bench + eps + small_value_mask = [False, False, False] + inf_nan_mask = [False, False, False] + rel_err = self.abs_err / abs_bench_with_eps + self.assertListEqual(list(alg.get_rel_err(self.abs_err, abs_bench_with_eps, small_value_mask, inf_nan_mask)), + list(rel_err)) + + def test_get_abs_err(self): + self.assertListEqual(list(alg.get_abs_err(self.bench_data, self.device_data)), [4.0, 1.0, 8.0]) + + def test_get_rel_err_origin(self): + self.assertListEqual(list(alg.get_rel_err_origin(self.abs_err, self.bench_data)), list(self.rel_err_origin)) + + def test_get_max_abs_err(self): + self.assertEqual(alg.get_max_abs_err(self.abs_err), (8.0, False)) + + def test_get_max_rel_err(self): + self.assertAlmostEqual(alg.get_max_rel_err(self.rel_err), 3.996, 3) + + def test_get_mean_rel_err(self): + self.assertAlmostEqual(alg.get_mean_rel_err(self.rel_err), 1.961, 3) + + def test_get_rel_err_ratio_thousandth(self): + b_value = np.array([1.0, 2.0, 3.0]) + n_value = np.array([1.0, 2.0, 3.0]) + abs_err = np.abs(b_value - n_value) + rel_err = alg.get_rel_err_origin(abs_err, b_value) + self.assertEqual(alg.get_rel_err_ratio(rel_err, 0.001), (1.0, True)) + + def test_get_rel_err_ratio_ten_thousandth(self): + b_value = np.array([1.0, 2.0, 3.0]) + n_value = np.array([1.0, 2.0, 3.0]) + abs_err = np.abs(b_value - n_value) + rel_err = alg.get_rel_err_origin(abs_err, b_value) + self.assertEqual(alg.get_rel_err_ratio(rel_err, 0.0001), (1.0, True)) + + def test_get_finite_and_infinite_mask(self): + both_finite_mask, inf_nan_mask = alg.get_finite_and_infinite_mask(self.bench_data, self.device_data) + self.assertListEqual(list(both_finite_mask), [True, True, True]) + self.assertListEqual(list(inf_nan_mask), [False, False, False]) + + def test_get_small_value_mask(self): + b_value = np.array([1e-7, 1.0, 2e-6], dtype=np.float16) + abs_bench = np.abs(b_value) + both_finite_mask = [True, True, True] + small_value_mask = alg.get_small_value_mask(abs_bench, both_finite_mask, 1e-3) + self.assertListEqual(list(small_value_mask), [True, False, True]) + + def test_get_abs_bench_with_eps(self): + abs_bench, abs_bench_with_eps = alg.get_abs_bench_with_eps(self.bench_data, np.float16) + self.assertListEqual(list(abs_bench), list(self.abs_bench)) + self.assertListEqual(list(abs_bench_with_eps), list(self.abs_bench_with_eps)) + + def test_check_inf_nan_value(self): + both_finite_mask, inf_nan_mask = alg.get_finite_and_infinite_mask(self.bench_data, self.device_data) + self.assertEqual(alg.check_inf_nan_value(inf_nan_mask, self.bench_data, self.device_data, np.float16, 0.001), 0) + + def test_check_small_value(self): + a_value = np.array([1e-7, 1.0, 2e-6], dtype=np.float16) + b_value = np.array([1e-7, 1.0, 2e-6], dtype=np.float16) + abs_bench = np.abs(b_value) + both_finite_mask = [True, True, True] + abs_err = abs(a_value - b_value) + small_value_mask = alg.get_small_value_mask(abs_bench, both_finite_mask, 1e-3) + self.assertEqual(alg.check_small_value(abs_err, small_value_mask, 0.001), 0) + + def test_check_norm_value(self): + both_finite_mask, inf_nan_mask = alg.get_finite_and_infinite_mask(self.bench_data, self.device_data) + small_value_mask = alg.get_small_value_mask(self.abs_bench, both_finite_mask, 1e-3) + normal_value_mask = np.logical_and(both_finite_mask, np.logical_not(small_value_mask)) + print(normal_value_mask) + print(self.rel_err) + self.assertEqual(alg.check_norm_value(normal_value_mask, self.rel_err, 0.001), 1) diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..aab90f122f64a148955a7c824e8975c7d19cb679 --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py @@ -0,0 +1,75 @@ +import unittest + +import pandas as pd + +from atat.pytorch.api_accuracy_checker.compare.api_precision_compare import ( + CompareConfig, + BenchmarkStandard, + check_csv_columns, + check_error_rate, + get_api_checker_result, +) +from atat.pytorch.api_accuracy_checker.compare.compare_utils import CompareConst + + +class TestApiPrecisionCompare(unittest.TestCase): + + def setUp(self): + # Setup paths and mock data + self.config = CompareConfig( + npu_csv_path='mock_npu.csv', + gpu_csv_path='mock_gpu.csv', + result_csv_path='result.csv', + details_csv_path='details.csv' + ) + + self.npu_data = pd.DataFrame({ + 'API_NAME': ['api1.forward', 'api1.backward'], + 'DEVICE_DTYPE': ['float32', 'float32'], + 'ERROR_RATE': ['0', '0.1'], + 'SMALL_VALUE_ERROR_RATE': ['0.01', '0.02'], + 'RMSE': ['0.1', '0.2'], + 'MAX_REL_ERR': ['0.1', '0.2'], + 'MEAN_REL_ERR': ['0.1', '0.2'], + 'EB': ['0.1', '0.2'] + }) + + self.gpu_data = pd.DataFrame({ + 'API_NAME': ['api1.forward', 'api1.backward'], + 'DEVICE_DTYPE': ['float32', 'float32'], + 'ERROR_RATE': ['0', '0'], + 'SMALL_VALUE_ERROR_RATE': ['0.01', '0.01'], + 'RMSE': ['0.1', '0.1'], + 'MAX_REL_ERR': ['0.1', '0.1'], + 'MEAN_REL_ERR': ['0.1', '0.1'], + 'EB': ['0.1', '0.1'] + }) + + def test_benchmark_standard_calc_ratio(self): + result = BenchmarkStandard._calc_ratio('2', '1') + self.assertEqual(result, 2.0) + + result = BenchmarkStandard._calc_ratio('0', '0') + self.assertEqual(result, 1.0) + + def test_check_csv_columns(self): + with self.assertRaises(Exception): + check_csv_columns([], 'test_csv') + + def test_check_error_rate(self): + result = check_error_rate('0') + self.assertEqual(result, CompareConst.PASS) + + result = check_error_rate('0.1') + self.assertEqual(result, CompareConst.ERROR) + + def test_get_api_checker_result(self): + result = get_api_checker_result([CompareConst.PASS, CompareConst.ERROR]) + self.assertEqual(result, CompareConst.ERROR) + + result = get_api_checker_result([CompareConst.PASS, CompareConst.PASS]) + self.assertEqual(result, CompareConst.PASS) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..46b0ddc800d59c6c2ed6fed79ecfc18b080a2a18 --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py @@ -0,0 +1,128 @@ +import csv +import os +import shutil +import time +import unittest + +import numpy as np +import torch.nn.functional + +from atat.pytorch.api_accuracy_checker.compare.compare import Comparator +from atat.pytorch.api_accuracy_checker.compare.compare_column import CompareColumn +from atat.pytorch.api_accuracy_checker.run_ut.run_ut import UtDataInfo + +current_time = time.strftime("%Y%m%d%H%M%S") +RESULT_FILE_NAME = "accuracy_checking_result_" + current_time + ".csv" +DETAILS_FILE_NAME = "accuracy_checking_details_" + current_time + '.csv' +base_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) + + +class TestCompare(unittest.TestCase): + def setUp(self): + self.output_path = os.path.join(base_dir, "../compare_result") + os.mkdir(self.output_path, mode=0o750) + self.result_csv_path = os.path.join(self.output_path, RESULT_FILE_NAME) + self.details_csv_path = os.path.join(self.output_path, DETAILS_FILE_NAME) + self.is_continue_run_ut = False + self.compare = Comparator(self.result_csv_path, self.details_csv_path, self.is_continue_run_ut) + + def tearDown(self) -> None: + if os.path.exists(self.output_path): + shutil.rmtree(self.output_path) + + def test_compare_dropout(self): + dummy_input = torch.randn(100, 100) + bench_out = torch.nn.functional.dropout2d(dummy_input, 0.3) + npu_out = torch.nn.functional.dropout2d(dummy_input, 0.3) + self.assertTrue(self.compare._compare_dropout(bench_out, npu_out)) + + def test_compare_core_wrapper(self): + # 迁移后,detailed_result_total少了三列数据,待补充 + dummy_input = torch.randn(100, 100) + bench_out, npu_out = dummy_input, dummy_input + test_final_success, detailed_result_total = self.compare._compare_core_wrapper("api", bench_out, npu_out) + actual_cosine_similarity = detailed_result_total[0][3] + # 设置一个小的公差值 + tolerance = 1e-4 + # 判断实际的余弦相似度值是否在预期值的公差范围内 + self.assertTrue(np.isclose(actual_cosine_similarity, 1.0, atol=tolerance)) + # 对其他值进行比较,确保它们符合预期 + detailed_result_total[0][3] = 1.0 + self.assertEqual(detailed_result_total, [['torch.float32', 'torch.float32', (100, 100), 1.0, 0.0, ' ', ' ', ' ', + ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', 'pass', + '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n']]) + self.assertTrue(test_final_success) + + bench_out, npu_out = [dummy_input, dummy_input], [dummy_input, dummy_input] + test_final_success, detailed_result_total = self.compare._compare_core_wrapper("api", bench_out, npu_out) + actual_cosine_similarity = detailed_result_total[0][3] + self.assertTrue(np.isclose(actual_cosine_similarity, 1.0, atol=tolerance)) + actual_cosine_similarity = detailed_result_total[1][3] + self.assertTrue(np.isclose(actual_cosine_similarity, 1.0, atol=tolerance)) + detailed_result_total[0][3] = 1.0 + detailed_result_total[1][3] = 1.0 + self.assertTrue(test_final_success) + self.assertEqual(detailed_result_total, [['torch.float32', 'torch.float32', (100, 100), 1.0, 0.0, ' ', ' ', ' ', + ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', 'pass', + '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n'], + ['torch.float32', 'torch.float32', (100, 100), 1.0, 0.0, ' ', ' ', ' ', + ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', 'pass', + '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n']]) + + def test_compare_output(self): + bench_out, npu_out = torch.randn(100, 100), torch.randn(100, 100) + bench_grad, npu_grad = [torch.randn(100, 100)], [torch.randn(100, 100)] + api_name = 'Functional.conv2d.0' + data_info = UtDataInfo(bench_grad, npu_grad, bench_out, npu_out, None, None) + is_fwd_success, is_bwd_success = self.compare.compare_output(api_name, data_info.bench_out, + data_info.device_out) + self.assertFalse(is_fwd_success) + # is_bwd_success should be checked + + dummy_input = torch.randn(100, 100) + bench_out, npu_out = dummy_input, dummy_input + data_info = UtDataInfo(None, None, bench_out, npu_out, None, None) + is_fwd_success, is_bwd_success = self.compare.compare_output(api_name, data_info.bench_out, + data_info.device_out) + self.assertTrue(is_fwd_success) + self.assertTrue(is_bwd_success) + + def test_record_results(self): + args = ('Functional.conv2d.0', False, 'N/A', [['torch.float64', 'torch.float32', (32, 64, 112, 112), 1.0, + 0.012798667686, 'N/A', 0.81631212311, 0.159979121213, 'N/A', + 'error', '\n']], None, 0) + self.compare.record_results(args) + with open(self.details_csv_path, 'r') as file: + csv_reader = csv.reader(file) + next(csv_reader) + api_name_list = [row[0] for row in csv_reader] + self.assertEqual(api_name_list[0], 'Functional.conv2d.0.forward.output.0') + + def test_compare_torch_tensor(self): + cpu_output = torch.Tensor([1.0, 2.0, 3.0]) + npu_output = torch.Tensor([1.0, 2.0, 3.0]) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_torch_tensor("api", cpu_output, npu_output, + compare_column) + self.assertEqual(status, "pass") + + def test_compare_bool_tensor(self): + cpu_output = np.array([True, False, True]) + npu_output = np.array([True, False, True]) + self.assertEqual(self.compare._compare_bool_tensor(cpu_output, npu_output), (0.0, 'pass', '')) + + def test_compare_builtin_type(self): + compare_column = CompareColumn() + bench_out = 1 + npu_out = 1 + status, compare_result, message = self.compare._compare_builtin_type(bench_out, npu_out, compare_column) + self.assertEqual((status, compare_result.error_rate, message), ('pass', 0, '')) + + def test_compare_float_tensor(self): + cpu_output = torch.Tensor([1.0, 2.0, 3.0]) + npu_output = torch.Tensor([1.0, 2.0, 3.0]) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("api", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + self.assertEqual(status, "pass") diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare_column.py b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare_column.py new file mode 100644 index 0000000000000000000000000000000000000000..ee25a25e74d18dd0cc5436747767aa2acbbab05e --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare_column.py @@ -0,0 +1,10 @@ +import unittest + +from atat.pytorch.api_accuracy_checker.compare.compare_column import ApiPrecisionOutputColumn + + +class TestCompareColumns(unittest.TestCase): + + def test_api_precision_output_column(self): + col = ApiPrecisionOutputColumn() + self.assertIsInstance(col.to_column_value(), list) diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare_utils.py b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..afa8318f7c47b50725b0b5c72e1f199f50203889 --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/api_accuracy_checker/compare/test_compare_utils.py @@ -0,0 +1,49 @@ +import unittest + +import numpy as np + +from atat.pytorch.api_accuracy_checker.common.utils import CompareException +from atat.pytorch.api_accuracy_checker.compare.compare_utils import check_dtype_comparable, convert_str_to_float + + +class TestCompareUtils(unittest.TestCase): + def test_check_dtype_comparable(self): + x = np.array([1, 2, 3], dtype=np.int32) + y = np.array([4, 5, 6], dtype=np.int32) + self.assertTrue(check_dtype_comparable(x, y)) + + x = np.array([1.0, 2.0, 3.0], dtype=np.float32) + y = np.array([4.0, 5.0, 6.0], dtype=np.float32) + self.assertTrue(check_dtype_comparable(x, y)) + + x = np.array([True, False, True], dtype=np.bool_) + y = np.array([False, True, False], dtype=np.bool_) + self.assertTrue(check_dtype_comparable(x, y)) + + x = np.array([1, 2, 3], dtype=np.int32) + y = np.array([4.0, 5.0, 6.0], dtype=np.float32) + self.assertFalse(check_dtype_comparable(x, y)) + + x = np.array([1, 2, 3], dtype=np.int32) + y = np.array([True, False, True], dtype=np.bool_) + self.assertFalse(check_dtype_comparable(x, y)) + + def test_convert_str_to_float_when_valid_float(self): + self.assertEqual(convert_str_to_float("123.45"), 123.45) + + def test_convert_str_to_float_when_valid_int(self): + self.assertEqual(convert_str_to_float("123.0"), 123.0) + + def test_convert_str_to_float_when_valid_int_with_spaces(self): + self.assertEqual(convert_str_to_float(" 123.0 "), 123.0) + + def test_convert_str_to_float_when_empty_string(self): + with self.assertRaises(CompareException) as cm: + convert_str_to_float('') + self.assertEqual(cm.exception.code, CompareException.INVALID_DATA_ERROR) + + def test_convert_str_to_float_when_invalid_inf_string(self): + with self.assertRaises(CompareException) as cm: + convert_str_to_float('inf') + self.assertEqual(cm.exception.code, CompareException.INVALID_DATA_ERROR) + diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/compare/test_acc_compare.py b/debug/accuracy_tools/atat/test/pytorch_ut/compare/test_acc_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..5a82289a0003a332e41d9202b63e7bde4bc43c42 --- /dev/null +++ b/debug/accuracy_tools/atat/test/pytorch_ut/compare/test_acc_compare.py @@ -0,0 +1,17 @@ +# coding=utf-8 +import unittest +from atat.pytorch.compare.acc_compare import rename_api + +class TestUtilsMethods(unittest.TestCase): + + def test_rename_api(self): + test_name_1 = "Distributed.broadcast.0.forward.input.0" + expect_name_1 = "Distributed.broadcast.input.0" + actual_name_1 = rename_api(test_name_1, "forward") + self.assertEqual(actual_name_1, expect_name_1) + + test_name_2 = "Torch.sum.0.backward.output.0" + expect_name_2 = "Torch.sum.output.0" + actual_name_2 = rename_api(test_name_2, "backward") + self.assertEqual(actual_name_2, expect_name_2) + \ No newline at end of file diff --git a/debug/accuracy_tools/atat/test/pytorch_ut/test_pt_config.py b/debug/accuracy_tools/atat/test/pytorch_ut/test_pt_config.py index 8279c207765f802b741848f14b493e41758055fd..c931c8550716bc65b58655db6140684408f596cc 100644 --- a/debug/accuracy_tools/atat/test/pytorch_ut/test_pt_config.py +++ b/debug/accuracy_tools/atat/test/pytorch_ut/test_pt_config.py @@ -1,7 +1,7 @@ from unittest import TestCase from unittest.mock import patch, mock_open -from atat.core.utils import Const +from atat.core.common.utils import Const from atat.pytorch.pt_config import parse_json_config @@ -23,16 +23,16 @@ class TestPtConfig(TestCase): "file_format": "npy" } } - with (patch("atat.pytorch.pt_config.os.path.join", return_value="/path/config.json"), - patch("atat.pytorch.pt_config.FileOpen", mock_open(read_data='')), - patch("atat.pytorch.pt_config.json.load", return_value=mock_json_data)): + with patch("atat.pytorch.pt_config.os.path.join", return_value="/path/config.json"), \ + patch("atat.pytorch.pt_config.FileOpen", mock_open(read_data='')), \ + patch("atat.pytorch.pt_config.json.load", return_value=mock_json_data): common_config, task_config = parse_json_config(None, None) self.assertEqual(common_config.task, Const.STATISTICS) self.assertEqual(task_config.data_mode, ["all"]) - with (patch("atat.pytorch.pt_config.os.path.join", return_value="/path/config.json"), - patch("atat.pytorch.pt_config.FileOpen", mock_open(read_data='')), - patch("atat.pytorch.pt_config.json.load", return_value=mock_json_data)): + with patch("atat.pytorch.pt_config.os.path.join", return_value="/path/config.json"), \ + patch("atat.pytorch.pt_config.FileOpen", mock_open(read_data='')), \ + patch("atat.pytorch.pt_config.json.load", return_value=mock_json_data): common_config, task_config = parse_json_config(None, Const.TENSOR) self.assertEqual(common_config.task, Const.STATISTICS) self.assertEqual(task_config.file_format, "npy") diff --git a/debug/accuracy_tools/atat/test/resources/advisor.txt b/debug/accuracy_tools/atat/test/resources/advisor.txt new file mode 100644 index 0000000000000000000000000000000000000000..5c4825e28ebde12b43ad7e46bf05820929c88f8d --- /dev/null +++ b/debug/accuracy_tools/atat/test/resources/advisor.txt @@ -0,0 +1,3 @@ +Line: NA +Suspect Nodes: NA +Expert Advice: All data in comparison result meets the accuracy requirements. diff --git a/debug/accuracy_tools/atat/test/resources/compare_result_20230703104808.csv b/debug/accuracy_tools/atat/test/resources/compare_result_20230703104808.csv new file mode 100644 index 0000000000000000000000000000000000000000..a7742ff3fd0863fa157dbabebee252aea6b70888 --- /dev/null +++ b/debug/accuracy_tools/atat/test/resources/compare_result_20230703104808.csv @@ -0,0 +1,9 @@ +NPU Name,Bench Name,NPU Tensor Dtype,Bench Tensor Dtype,NPU Tensor Shape,Bench Tensor Shape,Cosine,MaxAbsErr,NPU max,NPU min,NPU mean,Bench max,Bench min,Bench mean,Accuracy Reached or Not,Err_message +Functional_linear_0_forward_input.0,Functional_linear_0_forward_input.0,torch.float32,torch.float32,"[3, 2]","[3, 2]",1.0,0.000000,1.948258399963379,-1.0052297115325928,-0.2003595232963562,1.948258399963379,-1.0052297115325928,-0.2003595232963562,Yes, +Functional_linear_0_forward_input.1,Functional_linear_0_forward_input.1,torch.float32,torch.float32,"[3, 2]","[3, 2]",1.0,0.000000,0.28375449776649475,-0.6661239266395569,-0.2789986729621887,0.28375449776649475,-0.6661239266395569,-0.2789986729621887,Yes, +Functional_linear_0_forward_input.2,Functional_linear_0_forward_input.2,torch.float32,torch.float32,[3],[3],1.0,0.000000,0.2457989901304245,-0.6338542103767395,-0.14437106251716614,0.2457989901304245,-0.6338542103767395,-0.14437106251716614,Yes, +Functional_linear_0_forward_output,Functional_linear_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,-0.8729169964790344,0.16790540516376495,0.8278868794441223,-0.8729169964790344,0.16790540516376495,Yes, +Torch_relu_0_forward_input.0,Torch_relu_0_forward_input.0,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,-0.8729169964790344,0.16790540516376495,0.8278868794441223,-0.8729169964790344,0.16790540516376495,Yes, +Torch_relu_0_forward_output,Torch_relu_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,0.0,0.31367552280426025,0.8278868794441223,0.0,0.31367552280426025,Yes, +Functional_relu_0_forward_input.0,Functional_relu_0_forward_input.0,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,-0.8729169964790344,0.16790540516376495,0.8278868794441223,-0.8729169964790344,0.16790540516376495,Yes, +Functional_relu_0_forward_output,Functional_relu_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,0.0,0.31367552280426025,0.8278868794441223,0.0,0.31367552280426025,Yes, diff --git a/debug/accuracy_tools/atat/test/resources/compare_result_without_accuracy.csv b/debug/accuracy_tools/atat/test/resources/compare_result_without_accuracy.csv new file mode 100644 index 0000000000000000000000000000000000000000..404af78ec03f497f91dc7fcfc7c6ab0e855e7e7b --- /dev/null +++ b/debug/accuracy_tools/atat/test/resources/compare_result_without_accuracy.csv @@ -0,0 +1,9 @@ +NPU Name,Bench Name,NPU Tensor Dtype,Bench Tensor Dtype,NPU Tensor Shape,Bench Tensor Shape,Cosine,MaxAbsErr,NPU max,NPU min,NPU mean,Bench max,Bench min,Bench mean,Accuracy Reached or Not,Err_message +,Functional_linear_0_forward_input.0,torch.float32,torch.float32,"[3, 2]","[3, 2]",1,0,1.9482584,-1.005229712,-0.200359523,1.9482584,-1.005229712,-0.200359523,, +,Functional_linear_0_forward_input.1,torch.float32,torch.float32,"[3, 2]","[3, 2]",1,0,0.283754498,-0.666123927,-0.278998673,0.283754498,-0.666123927,-0.278998673,, +,Functional_linear_0_forward_input.2,torch.float32,torch.float32,[3],[3],1,0,0.24579899,-0.63385421,-0.144371063,0.24579899,-0.63385421,-0.144371063,, +,Functional_linear_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,-0.872916996,0.167905405,0.827886879,-0.872916996,0.167905405,, +,Torch_relu_0_forward_input.0,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,-0.872916996,0.167905405,0.827886879,-0.872916996,0.167905405,, +,Torch_relu_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,0,0.313675523,0.827886879,0,0.313675523,, +,Functional_relu_0_forward_input.0,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,-0.872916996,0.167905405,0.827886879,-0.872916996,0.167905405,, +,Functional_relu_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,0,0.313675523,0.827886879,0,0.313675523,, diff --git a/debug/accuracy_tools/atat/test/resources/config.yaml b/debug/accuracy_tools/atat/test/resources/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..1744c9cf4a8aa0c034157edfbe80c8083a87ad9c --- /dev/null +++ b/debug/accuracy_tools/atat/test/resources/config.yaml @@ -0,0 +1,3 @@ +white_list: [] +error_data_path: './' +precision: 14 \ No newline at end of file diff --git a/debug/accuracy_tools/atat/test/resources/npu_test.pkl b/debug/accuracy_tools/atat/test/resources/npu_test.pkl new file mode 100644 index 0000000000000000000000000000000000000000..2e00b07b7c97e9cdb497bc63dd7eef8063388807 --- /dev/null +++ b/debug/accuracy_tools/atat/test/resources/npu_test.pkl @@ -0,0 +1,8 @@ +["Functional_linear_0_forward_input.0", 1, [], "torch.float32", [3, 2], [1.948258399963379, -1.0052297115325928, -0.2003595232963562]] +["Functional_linear_0_forward_input.1", 1, [], "torch.float32", [3, 2], [0.28375449776649475, -0.6661239266395569, -0.2789986729621887]] +["Functional_linear_0_forward_input.2", 1, [], "torch.float32", [3], [0.2457989901304245, -0.6338542103767395, -0.14437106251716614]] +["Functional_linear_0_forward_output", 1, [], "torch.float32", [3, 3], [0.8278868794441223, -0.8729169964790344, 0.16790540516376495]] +["Torch_relu_0_forward_input.0", 1, [], "torch.float32", [3, 3], [0.8278868794441223, -0.8729169964790344, 0.16790540516376495]] +["Torch_relu_0_forward_output", 1, [], "torch.float32", [3, 3], [0.8278868794441223, 0.0, 0.31367552280426025]] +["Functional_relu_0_forward_input.0", 1, [], "torch.float32", [3, 3], [0.8278868794441223, -0.8729169964790344, 0.16790540516376495]] +["Functional_relu_0_forward_output", 1, [], "torch.float32", [3, 3], [0.8278868794441223, 0.0, 0.31367552280426025]] diff --git a/debug/accuracy_tools/atat/test/run_ut.py b/debug/accuracy_tools/atat/test/run_ut.py index 7f51d266c24e74510113594e546882b27bf340d0..7c593c14abca82f39050276255316693a47c6fc9 100644 --- a/debug/accuracy_tools/atat/test/run_ut.py +++ b/debug/accuracy_tools/atat/test/run_ut.py @@ -3,31 +3,12 @@ import shutil import subprocess import sys -from atat.core.log import print_info_log, print_error_log - - -def get_ignore_dirs(cur_dir): - ignore_dirs = [] - try: - import torch - import torch_npu - except ImportError: - print_info_log(f"Skipping the {cur_dir}/pytorch_ut directory") - ignore_dirs.extend(["--ignore", f"{cur_dir}/pytorch_ut"]) - - try: - import mindspore - except ImportError: - print_info_log(f"Skipping the {cur_dir}/mindspore_ut directory") - ignore_dirs.extend(["--ignore", f"{cur_dir}/mindspore_ut"]) - - return ignore_dirs +from atat.core.common.log import logger def run_ut(): cur_dir = os.path.realpath(os.path.dirname(__file__)) ut_path = cur_dir - ignore_dirs = get_ignore_dirs(cur_dir) cov_dir = os.path.dirname(cur_dir) report_dir = os.path.join(cur_dir, "report") final_xml_path = os.path.join(report_dir, "final.xml") @@ -37,22 +18,37 @@ def run_ut(): shutil.rmtree(report_dir) os.makedirs(report_dir) - cmd = ["python3", "-m", "pytest", ut_path, "--junitxml=" + final_xml_path, "--cov=" + cov_dir, - "--cov-branch", "--cov-report=xml:" + cov_report_path] + ignore_dirs - result_ut = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - while result_ut.poll() is None: - line = result_ut.stdout.readline().strip() - if line: - print_info_log(str(line)) - - ut_flag = False - if result_ut.returncode == 0: - ut_flag = True - print_info_log("run ut successfully.") - else: - print_error_log("run ut failed.") + pytest_cmd = [ + "python3", "-m", "pytest", + ut_path, + f"--junitxml={final_xml_path}", + f"--cov={cov_dir}", + "--cov-branch", + f"--cov-report=xml:{cov_report_path}", + ] - return ut_flag + try: + with subprocess.Popen( + pytest_cmd, + shell=False, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + ) as proc: + for line in proc.stdout: + logger.info(line.strip()) + + proc.wait() + + if proc.returncode == 0: + logger.info("Unit tests executed successfully.") + return True + else: + logger.error("Unit tests execution failed.") + return False + except Exception as e: + logger.error(f"An error occurred during test execution: {e}") + return False if __name__ == "__main__": diff --git a/debug/accuracy_tools/grad_tool/README.md b/debug/accuracy_tools/grad_tool/README.md index ddb281242d748f7370a62e64246ed3d988303382..13a3930059897e7ddc665df20c15fddc61977d13 100644 --- a/debug/accuracy_tools/grad_tool/README.md +++ b/debug/accuracy_tools/grad_tool/README.md @@ -20,11 +20,9 @@ export PYTHONPATH=$PYTHONPATH:$MSTT_HOME/debug/accuracy_tools/ ``` -2. 安装依赖 +2. 使用pip命令安装matplotlib、mindspore、numpy、pandas、PyYAML、torch、tqdm依赖。 - ```bash - pip3 install pandas pyyaml tqdm matplotlib - ``` + 若环境中已安装部分依赖,不需要重复安装。 ## 使用方式 diff --git a/debug/accuracy_tools/grad_tool/common/utils.py b/debug/accuracy_tools/grad_tool/common/utils.py index b63e95578a13b54a7de5f9ca3886d10b72418e0a..cdce3fda7e38940c293849b63f139ce96c5d1d55 100644 --- a/debug/accuracy_tools/grad_tool/common/utils.py +++ b/debug/accuracy_tools/grad_tool/common/utils.py @@ -211,3 +211,12 @@ def create_directory(dir_path): os.makedirs(dir_path, mode=GradConst.DATA_DIR_AUTHORITY, exist_ok=True) except OSError as ex: raise RuntimeError("Failed to create directory. Please check the path permission or disk space.") from ex + +def change_mode(path, mode): + check_path_exists(path) + check_link(path) + try: + os.chmod(path, mode) + except PermissionError as ex: + print_error_log(f'Failed to change {path} authority. {str(ex)}') + raise ex diff --git a/debug/accuracy_tools/grad_tool/grad_pt/grad_monitor.py b/debug/accuracy_tools/grad_tool/grad_pt/grad_monitor.py index 19c5b32bf8bdb94373bf73784a9e608cc4fbfd83..6733d566d6aaadd2fbf97076bfaca84a3e05160d 100644 --- a/debug/accuracy_tools/grad_tool/grad_pt/grad_monitor.py +++ b/debug/accuracy_tools/grad_tool/grad_pt/grad_monitor.py @@ -4,19 +4,35 @@ from collections import defaultdict import torch from torch.optim.optimizer import register_optimizer_step_pre_hook from grad_tool.common.base_monitor import BaseMonitor -from grad_tool.grad_pt.level_adapter import Level, LevelAdapter from grad_tool.grad_pt.grad_stat_csv import GradStatCsv from grad_tool.common.utils import check_numeral_list_ascend, data_in_list_target, \ - write_csv, print_info_log, create_directory, print_warn_log -from grad_tool.grad_pt.utils import get_rank_id, print_rank_0 + write_csv, print_info_log, create_directory, print_warn_log, change_mode +from grad_tool.grad_pt.utils import get_rank_id, print_rank_0, GradConst class PtGradientMonitor(BaseMonitor): default_bounds = [-10, -1, -0.1, -0.01, -0.001, 0, 0.001, 0.01, 0.1, 1, 10] + level_adp = { + "L0": { + "header": [GradConst.md5, GradConst.max, GradConst.min, GradConst.norm, GradConst.shape], + "have_grad_direction": False + }, + "L1": { + "header": [GradConst.max, GradConst.min, GradConst.norm, GradConst.shape], + "have_grad_direction": True + }, + "L2": { + "header": [GradConst.distribution, GradConst.max, GradConst.min, GradConst.norm, GradConst.shape], + "have_grad_direction": True + }, + } def __init__(self, config_filepath): super(PtGradientMonitor, self).__init__(config_filepath) - self._level_adp: Level = LevelAdapter.level_adapter(self.config.get("level")) + level = self.config.get("level") + if level not in PtGradientMonitor.level_adp: + raise Exception(f"level is valid, not in {PtGradientMonitor.level_adp.keys()}") + self._level_adp = PtGradientMonitor.level_adp[level] self._param_list = self.config.get('param_list') self._target_ranks = self.config.get("rank") print_info_log(f"target rank {self._target_ranks}") @@ -38,6 +54,16 @@ class PtGradientMonitor(BaseMonitor): def output_path(self): return self._output_path + @staticmethod + def save_grad_direction(param_name, grad, save_path): + if not os.path.exists(save_path): + create_directory(save_path) + param_grad = grad.clone().detach() + is_positive = param_grad > 0 + save_filepath = os.path.join(save_path, f"{param_name}.pt") + torch.save(is_positive, save_filepath) + change_mode(save_filepath, 0o640) + def monitor(self, model): print_rank_0("> parameter names:") for name, param in model.named_parameters(): @@ -66,17 +92,14 @@ class PtGradientMonitor(BaseMonitor): if grad is None: print_info_log(f"grad is None: {param_name}") continue - grad_info = GradStatCsv.generate_csv_line( - level=self._level_adp, - param_name=param_name, - grad=grad, - bounds=self._bounds) + grad_info = GradStatCsv.generate_csv_line(param_name, self._level_adp, grad, self._bounds) output_lines.append(grad_info) - self._level_adp.save_grad_direction(param_name, grad, + if self._level_adp["have_grad_direction"]: + PtGradientMonitor.save_grad_direction(param_name, grad, f'{self._output_path}/rank_{self._rank}/step_{self._step}') output_path = os.path.join(self._output_path, f"rank_{getattr(self, '_rank')}", f"grad_summary_{self._step}.csv") write_csv(output_path, output_lines, - GradStatCsv.generate_csv_header(level=self._level_adp, bounds=self._bounds)) + GradStatCsv.generate_csv_header(self._level_adp, self._bounds)) register_optimizer_step_pre_hook(optimizer_pre_step_hook) diff --git a/debug/accuracy_tools/grad_tool/grad_pt/grad_stat_csv.py b/debug/accuracy_tools/grad_tool/grad_pt/grad_stat_csv.py index 442b763b47717a461e9dc1a147dca03cbabda488..48e9bff0aac458272a1b1f62c710b3649b89c25d 100644 --- a/debug/accuracy_tools/grad_tool/grad_pt/grad_stat_csv.py +++ b/debug/accuracy_tools/grad_tool/grad_pt/grad_stat_csv.py @@ -1,104 +1,127 @@ +from abc import ABC, abstractmethod +from collections import namedtuple import hashlib import torch -from grad_tool.grad_pt.level_adapter import Level +from grad_tool.grad_pt.utils import GradConst - -class GradExtremeOps: - @staticmethod - def tensor_max(tensor): - return torch._C._VariableFunctionsClass.max(tensor).cpu().detach().float().numpy().tolist() - - @staticmethod - def tensor_min(tensor): - return torch._C._VariableFunctionsClass.min(tensor).cpu().detach().float().numpy().tolist() - - @staticmethod - def tensor_norm(tensor): - return torch._C._VariableFunctionsClass.norm(tensor).cpu().detach().float().numpy().tolist() - - -class GradExtremes: - extremes = { - "max": GradExtremeOps.tensor_max, - "min": GradExtremeOps.tensor_min, - "norm": GradExtremeOps.tensor_norm - } - - -class GradStatOps: - @staticmethod - def md5_header(**kwargs): - level: Level = kwargs.get("level") - return level.MD5_header() - - @staticmethod - def intervals_header(**kwargs): - level: Level = kwargs.get("level") - bounds = kwargs.get("bounds") - return level.intervals_header(bounds) - - @staticmethod - def extremes_header(**kwargs): - return GradExtremes.extremes.keys() - - @staticmethod - def shape_header(**kwargs): - return ["shape"] - - @staticmethod - def md5_content(**kwargs): - grad = kwargs.get("grad") - level: Level = kwargs.get("level") - return level.MD5_content(grad) - - @staticmethod - def count_distribution(**kwargs): - level: Level = kwargs.get("level") - grad = kwargs.get("grad") - bounds = kwargs.get("bounds") - return level.count_grad_distribution(grad, bounds) - - @staticmethod - def extremes_content(**kwargs): - grad = kwargs.get("grad") - return [f(grad) for f in GradExtremes.extremes.values()] - - @staticmethod - def shape_content(**kwargs): - grad = kwargs.get("grad") - return [list(grad.shape)] +CSV_header_input = namedtuple("CSV_header_input", ["bounds"]) +CSV_content_input = namedtuple("CSV_content_input", ["grad", "bounds"]) class GradStatCsv: - CSV = { - "MD5": { - "header": GradStatOps.md5_header, - "content": GradStatOps.md5_content - }, - "distribution": { - "header": GradStatOps.intervals_header, - "content": GradStatOps.count_distribution - }, - "extremes": { - "header": GradStatOps.extremes_header, - "content": GradStatOps.extremes_content - }, - "shape": { - "header": GradStatOps.shape_header, - "content": GradStatOps.shape_content - }, - } + csv = {} @staticmethod - def generate_csv_header(**kwargs): + def generate_csv_header(level, bounds): header = ["param_name"] - for func in GradStatCsv.CSV.values(): - header.extend(func["header"](**kwargs)) + for key in level["header"]: + csv_header_input = CSV_header_input(bounds=bounds) + header.extend(GradStatCsv.csv[key].generate_csv_header(csv_header_input)) return header @staticmethod - def generate_csv_line(**kwargs): - line = [kwargs.get("param_name")] - for func in GradStatCsv.CSV.values(): - line.extend(func["content"](**kwargs)) + def generate_csv_line(param_name, level, grad, bounds): + line = [param_name] + for key in level["header"]: + csv_content_input = CSV_content_input(grad=grad, bounds=bounds) + line.extend(GradStatCsv.csv[key].generate_csv_content(csv_content_input)) return line + + +def register_csv_item(key, cls=None): + if cls is None: + # 无参数时,返回装饰器函数 + return lambda cls: register_csv_item(key, cls) + GradStatCsv.csv[key] = cls + return cls + + +class CsvItem(ABC): + @abstractmethod + def generate_csv_header(csv_header_input): + pass + + @abstractmethod + def generate_csv_content(csv_content_input): + pass + + +@register_csv_item(GradConst.md5) +class CSV_md5(CsvItem): + def generate_csv_header(csv_header_input): + return ["MD5"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + tensor_bytes = grad.cpu().detach().float().numpy().tobytes() + md5_hash = hashlib.md5(tensor_bytes) + return [md5_hash.hexdigest()] + + +@register_csv_item(GradConst.distribution) +class CSV_distribution(CsvItem): + def generate_csv_header(csv_header_input): + bounds = csv_header_input.bounds + intervals = [] + for i, _ in enumerate(bounds): + if i == 0: + intervals.append(f"(-inf, {bounds[i]}]") + else: + intervals.append(f"({bounds[i-1]}, {bounds[i]}]") + intervals.extend([f"({bounds[-1]}, inf)", "=0"]) + return intervals + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + bounds = csv_content_input.bounds + grad = grad.cpu().detach() + if grad.dtype == torch.bfloat16: + grad = grad.to(torch.float32) + element_num = grad.numel() + grad_equal_0_num = (grad == 0).sum().item() + bound = torch.Tensor(bounds) + bucketsize_result = torch.bucketize(grad, bound) + interval_nums = [(bucketsize_result == i).sum().item() for i in range(len(bound) + 1)] + interval_nums.append(grad_equal_0_num) + return_list = [x / element_num if element_num != 0 else 0 for x in interval_nums] + return return_list + + +@register_csv_item(GradConst.max) +class CSV_max(CsvItem): + def generate_csv_header(csv_header_input): + return ["max"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + return [torch.max(grad).cpu().detach().float().numpy().tolist()] + + +@register_csv_item(GradConst.min) +class CSV_max(CsvItem): + def generate_csv_header(csv_header_input): + return ["min"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + return [torch.min(grad).cpu().detach().float().numpy().tolist()] + + +@register_csv_item(GradConst.norm) +class CSV_max(CsvItem): + def generate_csv_header(csv_header_input): + return ["norm"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + return [torch.norm(grad).cpu().detach().float().numpy().tolist()] + + +@register_csv_item(GradConst.shape) +class CSV_shape(CsvItem): + def generate_csv_header(csv_header_input): + return ["shape"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + return [list(grad.shape)] \ No newline at end of file diff --git a/debug/accuracy_tools/grad_tool/grad_pt/level_adapter.py b/debug/accuracy_tools/grad_tool/grad_pt/level_adapter.py deleted file mode 100644 index 520b0ce0cd8623cd7ca91956f80e97608a407b9b..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/grad_tool/grad_pt/level_adapter.py +++ /dev/null @@ -1,133 +0,0 @@ -import os -import hashlib -from abc import ABC, abstractmethod -import torch -from grad_tool.common.utils import print_info_log, create_directory - - -class LevelOps: - @staticmethod - def intervals_header(bounds): - intervals = [] - for i, _ in enumerate(bounds): - if i == 0: - intervals.append(f"(-inf, {bounds[i]}]") - else: - intervals.append(f"({bounds[i-1]}, {bounds[i]}]") - intervals.extend([f"({bounds[-1]}, inf)", "=0"]) - return intervals - - @staticmethod - def count_grad_distribution(grad, bounds): - grad = grad.cpu().detach() - if grad.dtype == torch.bfloat16: - grad = grad.to(torch.float32) - element_num = grad.numel() - grad_equal_0_num = (grad == 0).sum().item() - bound = torch.Tensor(bounds) - bucketsize_result = torch.bucketize(grad, bound) - interval_nums = [(bucketsize_result == i).sum().item() for i in range(len(bound) + 1)] - interval_nums.append(grad_equal_0_num) - return_list = [x / element_num if element_num != 0 else 0 for x in interval_nums] - return return_list - - @staticmethod - def save_grad_direction(param_name, grad, save_path): - if not os.path.exists(save_path): - create_directory(save_path) - param_grad = grad.clone().detach() - is_positive = param_grad > 0 - torch.save(is_positive, f'{save_path}/{param_name}.pt') - - @staticmethod - def MD5_content(grad): - tensor_bytes = grad.cpu().detach().float().numpy().tobytes() - md5_hash = hashlib.md5(tensor_bytes) - return [md5_hash.hexdigest()] - - @staticmethod - def MD5_header(): - return ["MD5"] - - -class Level(ABC): - @abstractmethod - def save_grad_direction(self, param_name, grad, save_path): - pass - - @abstractmethod - def count_grad_distribution(self, grad, bounds) -> list: - pass - - @abstractmethod - def intervals_header(self, bounds) -> list: - pass - - @abstractmethod - def MD5_content(self, grad) -> list: - pass - - @abstractmethod - def MD5_header(self) -> list: - pass - - -class Level_0(Level): - def save_grad_direction(self, param_name, grad, save_path): - pass - - def count_grad_distribution(self, grad, bounds): - return [] - - def intervals_header(self, bounds): - return [] - - def MD5_content(self, grad): - return LevelOps.MD5_content(grad) - - def MD5_header(self): - return LevelOps.MD5_header() - - -class Level_1(Level): - def save_grad_direction(self, param_name, grad, save_path): - LevelOps.save_grad_direction(param_name, grad, save_path) - - def count_grad_distribution(self, grad, bounds): - return [] - - def intervals_header(self, bounds): - return [] - - def MD5_content(self, grad): - return [] - - def MD5_header(self): - return [] - - -class Level_2(Level): - def save_grad_direction(self, param_name, grad, save_path): - LevelOps.save_grad_direction(param_name, grad, save_path) - - def count_grad_distribution(self, grad, bounds): - return LevelOps.count_grad_distribution(grad, bounds) - - def intervals_header(self, bounds): - return LevelOps.intervals_header(bounds) - - def MD5_content(self, grad): - return [] - - def MD5_header(self): - return [] - - -class LevelAdapter: - levels = {"L0": Level_0, "L1": Level_1, "L2": Level_2} - - @staticmethod - def level_adapter(level): - if level not in LevelAdapter.levels: - raise Exception(f"level is valid, not in {LevelAdapter.levels.keys()}") - return LevelAdapter.levels[level]() diff --git a/debug/accuracy_tools/grad_tool/grad_pt/utils.py b/debug/accuracy_tools/grad_tool/grad_pt/utils.py index cbccab7b480c6d38fea7c22b6779913f47e91a22..bfa7c158f1399657dc3f601e69ecc3cc8d725f5a 100644 --- a/debug/accuracy_tools/grad_tool/grad_pt/utils.py +++ b/debug/accuracy_tools/grad_tool/grad_pt/utils.py @@ -15,3 +15,11 @@ def print_rank_0(message): print(message) else: print(message) + +class GradConst: + md5 = "MD5" + distribution = "distribution" + shape = "shape" + max = "max" + min = "min" + norm = "norm" diff --git a/debug/accuracy_tools/grad_tool/test/ut/test_grad_csv.py b/debug/accuracy_tools/grad_tool/test/ut/test_grad_csv.py index 48a23c887e02c14d021d4191cb2fdae2bf9dbc33..4669da1c4d199c3007f36ae2b0f57802a1577be2 100644 --- a/debug/accuracy_tools/grad_tool/test/ut/test_grad_csv.py +++ b/debug/accuracy_tools/grad_tool/test/ut/test_grad_csv.py @@ -3,7 +3,7 @@ import unittest import os import torch from grad_tool.grad_pt.grad_stat_csv import GradStatCsv -from grad_tool.grad_pt.level_adapter import LevelAdapter +from grad_tool.grad_pt.grad_monitor import PtGradientMonitor grad_tensor = torch.tensor([[-2, 2], [0.2, 0.3]]) @@ -12,39 +12,27 @@ grad_tensor = torch.tensor([[-2, 2], [0.2, 0.3]]) class TestGradCSV(unittest.TestCase): def test_level_L0_header(self): self.assertEqual(['param_name', 'MD5', 'max', 'min', 'norm', 'shape'], - GradStatCsv.generate_csv_header(level=LevelAdapter.level_adapter("L0"), bounds=[-1, 0, 1])) + GradStatCsv.generate_csv_header(PtGradientMonitor.level_adp["L0"], [-1, 0, 1])) def test_level_L1_header(self): self.assertEqual(['param_name', 'max', 'min', 'norm', 'shape'], - GradStatCsv.generate_csv_header(level=LevelAdapter.level_adapter("L1"), bounds=[-1, 0, 1])) + GradStatCsv.generate_csv_header(PtGradientMonitor.level_adp["L1"], [-1, 0, 1])) def test_level_L2_header(self): self.assertEqual(['param_name', '(-inf, -1]', '(-1, 0]', '(0, 1]', '(1, inf)', '=0', 'max', 'min', 'norm', 'shape'], - GradStatCsv.generate_csv_header(level=LevelAdapter.level_adapter("L2"), bounds=[-1, 0, 1])) + GradStatCsv.generate_csv_header(PtGradientMonitor.level_adp["L2"], [-1, 0, 1])) def test_level_L0_content(self): - generated_csv_line = GradStatCsv.generate_csv_line( - level=LevelAdapter.level_adapter("L0"), - param_name="model.conv2d", - grad=grad_tensor, - bounds=[-1, 0, 1]) + generated_csv_line = GradStatCsv.generate_csv_line("model.conv2d", PtGradientMonitor.level_adp["L0"], grad_tensor, [-1, 0, 1]) self.assertEqual(['model.conv2d', '678a6c7d9d9716682b56fda097d0936c', 2.0, -2.0, 2.851315498352051, [2, 2]], generated_csv_line) def test_level_L1_content(self): - generated_csv_line = GradStatCsv.generate_csv_line( - level=LevelAdapter.level_adapter("L1"), - param_name="model.conv2d", - grad=grad_tensor, - bounds=[-1, 0, 1]) + generated_csv_line = GradStatCsv.generate_csv_line("model.conv2d", PtGradientMonitor.level_adp["L1"], grad_tensor, [-1, 0, 1]) self.assertEqual(['model.conv2d', 2.0, -2.0, 2.851315498352051, [2, 2]], generated_csv_line) def test_level_L2_content(self): - generated_csv_line = GradStatCsv.generate_csv_line( - level=LevelAdapter.level_adapter("L2"), - param_name="model.conv2d", - grad=grad_tensor, - bounds=[-1, 0, 1]) + generated_csv_line = GradStatCsv.generate_csv_line("model.conv2d", PtGradientMonitor.level_adp["L2"], grad_tensor, [-1, 0, 1]) self.assertEqual(['model.conv2d', 0.25, 0.0, 0.5, 0.25, 0.0, 2.0, -2.0, 2.851315498352051, [2, 2]], generated_csv_line) diff --git a/debug/accuracy_tools/kj600/kj600/anomaly_detect.py b/debug/accuracy_tools/kj600/kj600/anomaly_detect.py index 5365f6c54cf6c2445433e0c244af5f1f9cc19da7..cbd7b6daa2f0d9b0a9b28016993e836ee07df72d 100644 --- a/debug/accuracy_tools/kj600/kj600/anomaly_detect.py +++ b/debug/accuracy_tools/kj600/kj600/anomaly_detect.py @@ -4,10 +4,11 @@ from typing import List import sys from torch.utils.tensorboard import SummaryWriter from collections import defaultdict +from kj600.utils import print_info_log class ScanRule(ABC): def apply(self, history, cur): - raise NotImplemented("abstract method apply is not implemented") + raise NotImplementedError("abstract method apply is not implemented") class AnomalyTurbulence(ScanRule): name = "AnomalyTurbulence" @@ -66,9 +67,6 @@ class SummaryWriterWithAD(SummaryWriter): self.job_id = job_id self.anomaly_inform = anomaly_inform - def _ad(self, scalar_value, history): - return AnomalyScanner.scan(self.ad_rules, history, cur=scalar_value) - def add_scalar(self, tag, scalar_value, global_step=None, walltime=None, new_style=False, double_precision=False): new_avg = avg = scalar_value if tag in self.tag2scalars: @@ -78,8 +76,11 @@ class SummaryWriterWithAD(SummaryWriter): self.tag2scalars[tag].append((scalar_value, new_avg)) detected, rule_name = self._ad(scalar_value, history=avg) if detected: - print(f"{bcolors.WARNING}> Rule {rule_name} reports anomaly signal in {tag} at step {global_step}.{bcolors.ENDC}") + print_info_log(f"{bcolors.WARNING}> Rule {rule_name} reports anomaly signal in {tag} at step {global_step}.{bcolors.ENDC}") exception_message = f"{bcolors.WARNING}> Rule {rule_name} reports anomaly signal in {tag} at step {global_step}.{bcolors.ENDC}" if self.anomaly_inform: self.anomaly_inform.run(exception_message, self.job_id) return super().add_scalar(tag, scalar_value, global_step, walltime, new_style, double_precision) + + def _ad(self, scalar_value, history): + return AnomalyScanner.scan(self.ad_rules, history, cur=scalar_value) diff --git a/debug/accuracy_tools/kj600/kj600/module_hook.py b/debug/accuracy_tools/kj600/kj600/module_hook.py index 363ae8959a65899be06b5d63cf76ac99f96c8a1e..8043c5671c42675dc55944e4818bce8e9137b455 100644 --- a/debug/accuracy_tools/kj600/kj600/module_hook.py +++ b/debug/accuracy_tools/kj600/kj600/module_hook.py @@ -145,8 +145,10 @@ class TrainerMon: self.mix_precision_optimizer_mon = OptimizerMonFactory.create_optimizer_mon(opt_ty) if opt_ty is None: - assert not self.ur_distribution, "ur_distribution cannot be enabled with unknown optimizer." - assert not self.mv_distribution, "mv_distribution cannot be enabled with unknown optimizer." + if self.ur_distribution: + raise Exception("ur_distribution cannot be enabled with unknown optimizer.") + if self.mv_distribution: + raise Exception("mv_distribution cannot be enabled with unknown optimizer.") self.print_struct = self.config.get("print_struct", False) self.struct_printed = False self.module_struct = {} @@ -169,110 +171,6 @@ class TrainerMon: return TrainerMon.tensor_metrics.stat_insert(target_tensor, ops_list, module_name, tensor_name, rank) - def _smallest_rank_print(self, msg): - if dist.is_initialized(): - if self.module_rank_list: - if dist.get_rank() == min(self.module_rank_list): - print_info_log(msg) - else: - if dist.get_rank() == 0: - print_info_log(msg) - else: - print_info_log(msg) - - def _hook_module(self, target_names, module: torch.nn.Module, fwd_or_bkd): - if '_modules' not in module.__dict__: - # nothing to hook - return 0 - - def fwd_hook_fun(module, module_input, module_output): - context: ModuleHookContext = self.module_fwd_hook_context_by_module[module] - if self.print_struct: - self.module_struct[context.module_name].update( - {"input": f"{get_param_struct(module_input)}", "output": f"{get_param_struct(module_output)}"}) - return - if not self.xy_distribution: - return - if not context.format_by_arg: - context.set_format_by_arg('input', self.config['targets']) - context.set_format_by_arg('output', self.config['targets']) - if not context.verified: - if not context.ignore_in: - context.focused_in_col = validate_config_spec(context.format_by_arg['input'], module_input, context.module_name, 'input') - context.focused_out_col = validate_config_spec(context.format_by_arg['output'], module_output, context.module_name, 'output') - context.verified = True - # expect output be tensor type - tbtag_tensor_map = {} - if not context.ignore_in: - cared_input = module_input if context.focused_in_col is None else module_input[context.focused_in_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input', cared_input)) - cared_output = module_output if context.focused_out_col is None else module_output[context.focused_out_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output', cared_output)) - metric_dict = {} - for metric_name in self.ops: - metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) - if context.micro_step == 0 and context.actv: - print_warn_log( - f"actv context of {context.module_name} is not empty when first micro_step, maybe something wrong happened. Now clear it.") - context.actv.clear() - context.actv.append(metric_dict) - - context.micro_step += 1 - if context.micro_step == self.micro_batch_number: - context.micro_step = 0 - context.step += 1 - return - - def bwd_hook_fun(module, input_grad, output_grad): - context: ModuleHookContext = self.module_bwd_hook_context_by_module[module] - if self.print_struct: - self.module_struct[context.module_name].update( - {"input_grad": f"{get_param_struct(input_grad)}", "output_grad": f"{get_param_struct(output_grad)}"}) - return - if not self.xy_distribution: - return - if not context.format_by_arg: - context.set_format_by_arg('input_grad', self.config['targets']) - context.set_format_by_arg('output_grad', self.config['targets']) - if not context.verified: - if not context.ignore_in: - context.focused_in_col = validate_config_spec(context.format_by_arg['input_grad'], input_grad, context.module_name, 'input_grad') - context.focused_out_col = validate_config_spec(context.format_by_arg['output_grad'], output_grad, context.module_name, 'output_grad') - context.verified = True - - tbtag_tensor_map = {} - if not context.ignore_in: - cared_input_grad = input_grad if context.focused_in_col is None else input_grad[context.focused_in_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input_grad', cared_input_grad)) - cared_output_grad = output_grad if context.focused_out_col is None else output_grad[context.focused_out_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output_grad', cared_output_grad)) - metric_dict = {} - for metric_name in self.ops: - metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) - if context.micro_step == 0 and context.actvgrad: - print_warn_log(f"actvgrad context of {context.module_name} is not empty when first micro_step, maybe something wrong happened. Now clear it.") - context.actvgrad.clear() - context.actvgrad.append(metric_dict) - - context.micro_step += 1 - if context.micro_step == self.micro_batch_number: - context.micro_step = 0 - context.step += 1 - return - - hooked_count = 0 - for name, submodule in module.named_modules(): - self.module_struct[name] = {} - if name in target_names: - submodule.register_forward_hook(fwd_hook_fun) - self.module_fwd_hook_context_by_module[submodule] = ModuleHookContext(name) - if not self.forward_only: - submodule.register_full_backward_hook(bwd_hook_fun) - self.module_bwd_hook_context_by_module[submodule] = ModuleHookContext(name) - print_rank_0(f"> {name} is monitored successfully") - hooked_count += 1 - return hooked_count - def hook_modules(self, model:torch.nn.Module, grad_acc_steps): # fwd=0, bkd=1 # targets is module name list like ["xx.xxx1", "xxx.xxx2"] which can be obtained when first run. @@ -430,3 +328,107 @@ class TrainerMon: register_optimizer_step_pre_hook(optimizer_pre_step_hook) register_optimizer_step_post_hook(optimizer_post_step_hook) return + + def _smallest_rank_print(self, msg): + if dist.is_initialized(): + if self.module_rank_list: + if dist.get_rank() == min(self.module_rank_list): + print_info_log(msg) + else: + if dist.get_rank() == 0: + print_info_log(msg) + else: + print_info_log(msg) + + def _hook_module(self, target_names, module: torch.nn.Module, fwd_or_bkd): + if '_modules' not in module.__dict__: + # nothing to hook + return 0 + + def fwd_hook_fun(module, module_input, module_output): + context: ModuleHookContext = self.module_fwd_hook_context_by_module[module] + if self.print_struct: + self.module_struct[context.module_name].update( + {"input": f"{get_param_struct(module_input)}", "output": f"{get_param_struct(module_output)}"}) + return + if not self.xy_distribution: + return + if not context.format_by_arg: + context.set_format_by_arg('input', self.config['targets']) + context.set_format_by_arg('output', self.config['targets']) + if not context.verified: + if not context.ignore_in: + context.focused_in_col = validate_config_spec(context.format_by_arg['input'], module_input, context.module_name, 'input') + context.focused_out_col = validate_config_spec(context.format_by_arg['output'], module_output, context.module_name, 'output') + context.verified = True + # expect output be tensor type + tbtag_tensor_map = {} + if not context.ignore_in: + cared_input = module_input if context.focused_in_col is None else module_input[context.focused_in_col] + tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input', cared_input)) + cared_output = module_output if context.focused_out_col is None else module_output[context.focused_out_col] + tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output', cared_output)) + metric_dict = {} + for metric_name in self.ops: + metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) + if context.micro_step == 0 and context.actv: + print_warn_log( + f"actv context of {context.module_name} is not empty when first micro_step, maybe something wrong happened. Now clear it.") + context.actv.clear() + context.actv.append(metric_dict) + + context.micro_step += 1 + if context.micro_step == self.micro_batch_number: + context.micro_step = 0 + context.step += 1 + return + + def bwd_hook_fun(module, input_grad, output_grad): + context: ModuleHookContext = self.module_bwd_hook_context_by_module[module] + if self.print_struct: + self.module_struct[context.module_name].update( + {"input_grad": f"{get_param_struct(input_grad)}", "output_grad": f"{get_param_struct(output_grad)}"}) + return + if not self.xy_distribution: + return + if not context.format_by_arg: + context.set_format_by_arg('input_grad', self.config['targets']) + context.set_format_by_arg('output_grad', self.config['targets']) + if not context.verified: + if not context.ignore_in: + context.focused_in_col = validate_config_spec(context.format_by_arg['input_grad'], input_grad, context.module_name, 'input_grad') + context.focused_out_col = validate_config_spec(context.format_by_arg['output_grad'], output_grad, context.module_name, 'output_grad') + context.verified = True + + tbtag_tensor_map = {} + if not context.ignore_in: + cared_input_grad = input_grad if context.focused_in_col is None else input_grad[context.focused_in_col] + tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input_grad', cared_input_grad)) + cared_output_grad = output_grad if context.focused_out_col is None else output_grad[context.focused_out_col] + tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output_grad', cared_output_grad)) + metric_dict = {} + for metric_name in self.ops: + metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) + if context.micro_step == 0 and context.actvgrad: + print_warn_log(f"actvgrad context of {context.module_name} is not empty when first micro_step, maybe something wrong happened. Now clear it.") + context.actvgrad.clear() + context.actvgrad.append(metric_dict) + + context.micro_step += 1 + if context.micro_step == self.micro_batch_number: + context.micro_step = 0 + context.step += 1 + return + + hooked_count = 0 + for name, submodule in module.named_modules(): + self.module_struct[name] = {} + if name in target_names: + submodule.register_forward_hook(fwd_hook_fun) + self.module_fwd_hook_context_by_module[submodule] = ModuleHookContext(name) + if not self.forward_only: + submodule.register_full_backward_hook(bwd_hook_fun) + self.module_bwd_hook_context_by_module[submodule] = ModuleHookContext(name) + print_rank_0(f"> {name} is monitored successfully") + hooked_count += 1 + return hooked_count diff --git a/debug/accuracy_tools/kj600/kj600/module_metric.py b/debug/accuracy_tools/kj600/kj600/module_metric.py index b6a192eaaee5a6d790c72d2f7bc41db88b3f0e1f..e09536b072cf7953e6b6106420936416d4264d0e 100644 --- a/debug/accuracy_tools/kj600/kj600/module_metric.py +++ b/debug/accuracy_tools/kj600/kj600/module_metric.py @@ -52,17 +52,16 @@ class Metric(object): def get_metric_value(tensor, eps): pass + @staticmethod + def metric_tensorboard(metric_name, summary_writer, metric_value, step): + pass + def get_metrics(self, tag2tensor: dict, eps): metrics_dict = {} for tag, tensor in tag2tensor.items(): metrics_dict[tag] = self.get_metric_value(tensor, eps) return metrics_dict - @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - pass - - @register_config_metric("min") class MinMetric(Metric): @staticmethod @@ -148,7 +147,7 @@ def get_metrics(metric_name, tag2tensor, eps): fun_metric = config_metric_registry[metric_name] return fun_metric().get_metrics(tag2tensor, eps) except KeyError as e: - raise ValueError(f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: {metric_name}") + raise ValueError(f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: {metric_name}") from e def write_metrics_tensorboard(metric_name, summary_writer, metric_value, step): @@ -156,4 +155,4 @@ def write_metrics_tensorboard(metric_name, summary_writer, metric_value, step): fun_metric = config_metric_registry[metric_name] return fun_metric.metric_tensorboard(metric_name, summary_writer, metric_value, step) except KeyError as e: - raise ValueError(f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: {metric_name}") + raise ValueError(f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: {metric_name}") from e diff --git a/debug/accuracy_tools/kj600/kj600/optimizer_collect.py b/debug/accuracy_tools/kj600/kj600/optimizer_collect.py index dfb473ca074f135d809e32c85a8fc9b4047da4d3..285f17ca6dc6a00814b0847c7d203524d8a8caa6 100644 --- a/debug/accuracy_tools/kj600/kj600/optimizer_collect.py +++ b/debug/accuracy_tools/kj600/kj600/optimizer_collect.py @@ -4,7 +4,6 @@ import torch.distributed as dist from kj600.visualizer import HeatmapVisualizer - def print_rank_0(message, debug=False, force=False): if dist.is_initialized(): if dist.get_rank() == 0: @@ -16,12 +15,23 @@ def print_rank_0(message, debug=False, force=False): class MixPrecsionOptimizerMon: wrapped_optimizer = None + def __init__(self) -> None: + self.fp16_to_fp32_param = {} + @staticmethod def set_wrapped_optimizer(_wrapped_optimizer): MixPrecsionOptimizerMon.wrapped_optimizer = _wrapped_optimizer - def __init__(self) -> None: - self.fp16_to_fp32_param = {} + # parameter tensors we want to monitor and their names are in params2name_dict + # base_optimizer is pytorch optimizer, wrapped_optimizer is a normal object with base_optimizer + def fetch_mv(self, monitor, torch_opt, params2name): + mix_prec_opt = MixPrecsionOptimizerMon.wrapped_optimizer + + if not self.fp16_to_fp32_param and mix_prec_opt is not None: + for fp16_group, fp32_group in zip(mix_prec_opt.float16_groups, mix_prec_opt.fp32_from_float16_groups): + for fp16_param, fp32_param in zip(fp16_group, fp32_group): + self.fp16_to_fp32_param[fp16_param] = fp32_param + return self._fetch_mv_in_adam(params2name, torch_opt, monitor) def _fetch_mv_in_adam(self, params2name, torch_opt, monitor): exp_avg_dict = defaultdict(float) @@ -48,22 +58,13 @@ class MixPrecsionOptimizerMon: monitor.ratio_heatmap_visualizer[name].pre_cal(ratio_dict[name]) return exp_avg_dict, exp_avg_sq_dict, update_dict, ratio_dict - # parameter tensors we want to monitor and their names are in params2name_dict - # base_optimizer is pytorch optimizer, wrapped_optimizer is a normal object with base_optimizer - def fetch_mv(self, monitor, torch_opt, params2name): - mix_prec_opt = MixPrecsionOptimizerMon.wrapped_optimizer - - if not self.fp16_to_fp32_param and mix_prec_opt is not None: - for fp16_group, fp32_group in zip(mix_prec_opt.float16_groups, mix_prec_opt.fp32_from_float16_groups): - for fp16_param, fp32_param in zip(fp16_group, fp32_group): - self.fp16_to_fp32_param[fp16_param] = fp32_param - return self._fetch_mv_in_adam(params2name, torch_opt, monitor) class MegatronDistributedOptimizerMon(MixPrecsionOptimizerMon): def fetch_mv(self, monitor, torch_opt, params2name): mix_prec_opt = MixPrecsionOptimizerMon.wrapped_optimizer - assert hasattr(mix_prec_opt, "model_float16_groups") and hasattr(mix_prec_opt, "shard_fp32_from_float16_groups"), \ - "megatron distributed optimizer should have model_float16_groups and shard_fp32_from_float16_groups, if not, please check megatron-lm version" + if not (hasattr(mix_prec_opt, "model_float16_groups") and hasattr(mix_prec_opt, "shard_fp32_from_float16_groups")): + raise Exception("megatron distributed optimizer should have model_float16_groups and shard_fp32_from_float16_groups, \ + if not, please check megatron-lm version") if not self.fp16_to_fp32_param and mix_prec_opt is not None: for fp16_group, shard_fp32_group in zip(mix_prec_opt.model_float16_groups, mix_prec_opt.shard_fp32_from_float16_groups): for fp16_param, shard_fp32_param in zip(fp16_group, shard_fp32_group): @@ -71,10 +72,12 @@ class MegatronDistributedOptimizerMon(MixPrecsionOptimizerMon): return self._fetch_mv_in_adam(params2name, torch_opt, monitor) + class DummyOptimizerMon(MixPrecsionOptimizerMon): def fetch_mv(self, monitor, torch_opt, params2name): return None, None, None, None + class OptimizerMonFactory: @staticmethod def create_optimizer_mon(opt_ty:str): @@ -82,6 +85,6 @@ class OptimizerMonFactory: return MixPrecsionOptimizerMon() if opt_ty == "Megatron_DistributedOptimizer": return MegatronDistributedOptimizerMon() - if opt_ty == None or opt_ty == "unknown": + if opt_ty is None or opt_ty == "unknown": return DummyOptimizerMon() - assert opt_ty != None, "opt_ty should be Megatron_Float16OptimizerWithFloat16Params or Megatron_DistributedOptimizer or None or unknown" \ No newline at end of file + raise Exception("opt_ty should be Megatron_Float16OptimizerWithFloat16Params or Megatron_DistributedOptimizer or None or unknown") diff --git a/debug/accuracy_tools/ptdbg_ascend/README.md b/debug/accuracy_tools/ptdbg_ascend/README.md index 06a612a3592496ace638c5d544e3c8c94991aabe..dce5406bfa5de2b8baf21d94e69c5ab2fd63d534 100644 --- a/debug/accuracy_tools/ptdbg_ascend/README.md +++ b/debug/accuracy_tools/ptdbg_ascend/README.md @@ -8,7 +8,11 @@ 进行PyTorch精度比对需要将ptdbg_ascend精度工具分别安装在CPU或GPU环境以及NPU环境下。 -1. whl包获取。 +1. 使用pip命令安装wheel、numpy、openpyxl、pandas(1.3.5及以上版本)、psutil、pytest、PyYAML、rich、setuptools、torch依赖。 + + 若环境中已安装部分依赖,不需要重复安装。 + +2. whl包获取。 请通过下表链接下载ptdbg_ascend精度工具whl包,推荐下载最新版本。 @@ -20,8 +24,8 @@ | 3.0 | 2023-10-16 | 1.8.1/1.11.0/2.0/2.1 | [ptdbg_ascend-3.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/3.0/ptdbg_ascend-3.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v3.0](doc/ptdbg_ascend精度工具功能说明_v3.0.md) | eb177ec795f8ae8b0c937a3cf543914f535bb64c76ba2e520fc6f0456ff6740b | | 2.0 | 2023-7-07 | 1.8.1/1.11.0/2.0 | [ptdbg_ascend-2.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/2.0/ptdbg_ascend-2.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v2.0](doc/ptdbg_ascend精度工具功能说明_v2.0.md) | 85e046f133f0f40ed660337ce8207249b1dac47ac668910625bea49809f31d66 | | 1.0 | 2023-3-30 | 1.8.1/1.11.0 | [ptdbg_ascend-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/1.0/ptdbg_ascend-1.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v1.0](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend%E7%B2%BE%E5%BA%A6%E5%B7%A5%E5%85%B7%E5%8A%9F%E8%83%BD%E8%AF%B4%E6%98%8E_v1.0.md) | 0559e12ba7accf80d182f227698163ee0de88bf86b1e9cd9f33b16fdead14759 | - -2. whl包校验。 + +3. whl包校验。 1. 根据以上下载链接下载whl包到Linux安装环境。 @@ -40,7 +44,7 @@ ba7ff7a1acffb1a2fab02fea76b6f957b2868bc6b66d72365622f6a8950406c6 *ptdbg_ascend-4.0-py3-none-any.whl ``` -3. whl包安装。 +4. whl包安装。 执行如下命令进行安装。 @@ -104,7 +108,7 @@ ptdbg_ascend为PyTorch精度工具,用来进行PyTorch整网API粒度的数据 ### 环境准备 -- 通过pip安装环境依赖wheel、numpy、pandas(1.3.5及以上版本)和pyyaml。 +- 通过pip安装环境依赖wheel、numpy、openpyxl、pandas(1.3.5及以上版本)、psutil、pytest、PyYAML、rich、setuptools、torch、torch_npu。 - ptdbg_ascend与PyTorch有严格的版本配套关系,使用工具前,您需要确保已经正确安装了PyTorch v1.11.0、PyTorch v2.0.0或PyTorch v2.1.0版本: - CPU或GPU环境:请至[PyTorch官网](https://www.pytorch.org)下载并安装。 - NPU环境:请参见《[CANN软件安装指南](https://www.hiascend.com/document/detail/zh/canncommercial/63RC1/envdeployment/instg/instg_000002.html)》“安装开发环境 > 在昇腾设备上安装 > 安装深度学习框架 > 安装PyTorch”章节进行安装。 @@ -117,7 +121,11 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 #### 下载whl包安装 -1. whl包获取。 +1. 使用pip命令安装wheel、numpy、openpyxl、pandas(1.3.5及以上版本)、psutil、pytest、PyYAML、rich、setuptools、torch依赖。 + + 若环境中已安装部分依赖,不需要重复安装。 + +2. whl包获取。 请通过下表链接下载ptdbg_ascend精度工具whl包,推荐下载最新版本。 @@ -129,8 +137,8 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 | 3.0 | 2023-10-16 | 1.8.1/1.11.0/2.0/2.1 | [ptdbg_ascend-3.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/3.0/ptdbg_ascend-3.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v3.0](doc/ptdbg_ascend精度工具功能说明_v3.0.md) | eb177ec795f8ae8b0c937a3cf543914f535bb64c76ba2e520fc6f0456ff6740b | | 2.0 | 2023-7-07 | 1.8.1/1.11.0/2.0 | [ptdbg_ascend-2.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/2.0/ptdbg_ascend-2.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v2.0](doc/ptdbg_ascend精度工具功能说明_v2.0.md) | 85e046f133f0f40ed660337ce8207249b1dac47ac668910625bea49809f31d66 | | 1.0 | 2023-3-30 | 1.8.1/1.11.0 | [ptdbg_ascend-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/1.0/ptdbg_ascend-1.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v1.0](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend精度工具功能说明_v1.0.md) | 0559e12ba7accf80d182f227698163ee0de88bf86b1e9cd9f33b16fdead14759 | - -2. whl包校验。 + +3. whl包校验。 1. 根据以上下载链接下载whl包到Linux安装环境。 @@ -149,7 +157,7 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 ba7ff7a1acffb1a2fab02fea76b6f957b2868bc6b66d72365622f6a8950406c6 *ptdbg_ascend-4.0-py3-none-any.whl ``` -3. whl包安装。 +4. whl包安装。 执行如下命令进行安装。 @@ -171,13 +179,9 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 #### 源代码编译安装 -1. 安装依赖。 +1. 使用pip命令安装wheel、numpy、openpyxl、pandas(1.3.5及以上版本)、psutil、pytest、PyYAML、rich、setuptools、torch依赖。 - 编译前需要安装wheel。 - - ```bash - pip3 install wheel - ``` + 若环境中已安装部分依赖,不需要重复安装。 2. 下载源码。 @@ -237,7 +241,7 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 6. 安装。 执行如下命令进行ptdbg_ascend安装。 - + ```bash pip3 install ./ptdbg_ascend/dist/ptdbg_ascend-{version}-py3-none-any.whl --upgrade --force-reinstall ``` diff --git "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.md" "b/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.md" index 021a69e060881acc26d360cbc5534a76e84d1cde..09d608b676d7a02a59abbbdede4bda413b1152bd 100644 --- "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.md" +++ "b/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.md" @@ -2032,14 +2032,14 @@ PyTorch精度比对是以CPU或GPU的计算结果为标杆,通过计算精度 - 红色可能出现的情况有: - NPU max或NPU min信息中存在nan/inf - Max diff存在大于1e+10的值 - - 统计数据中Max diff除以Bench max > 0.5 + - 统计数据中output的Max diff除以max(0.01, Bench max) > 0.5 - 真实数据中One Thousandth Err Ratio的input > 0.9同时output < 0.6 - 黄色可能出现的情况有: - - Max diff的输出与输入存在数量级差 - - 统计数据Max diff除以Bench max的input > 0.1同时input < 0.01 - - 真实数据One Thousandth Err Ratio的output - input > 0.1 - - 真实数据Cosine的output - input > 0.1 + - Max diff的input与output都大于1,同时output比input大一个数量级以上 + - 统计数据Max diff除以max(0.01, Bench max)的output > 0.1同时input < 0.01 + - 真实数据One Thousandth Err Ratio的input - output > 0.1 + - 真实数据Cosine的input - output > 0.1 ## ptdbg_ascend.parse数据解析功能 diff --git a/debug/accuracy_tools/ptdbg_ascend/test/ut/compare/test_acc_compare.py b/debug/accuracy_tools/ptdbg_ascend/test/ut/compare/test_acc_compare.py index be6b480657b88d4f7852e236c4215d62c5995ef5..c5ee7ff9d50dfaefbe94e2ccaf8b281433570c92 100644 --- a/debug/accuracy_tools/ptdbg_ascend/test/ut/compare/test_acc_compare.py +++ b/debug/accuracy_tools/ptdbg_ascend/test/ut/compare/test_acc_compare.py @@ -355,4 +355,17 @@ class TestUtilsMethods(unittest.TestCase): result_df = pd.DataFrame(result) highlight_dict = {'red_rows': [], 'yellow_rows': []} compare.find_compare_result_error_rows(result_df, highlight_dict) - self.assertEqual(highlight_dict, {'red_rows': [num_1, num_3], 'yellow_rows': [num_2]}) \ No newline at end of file + self.assertEqual(highlight_dict, {'red_rows': [num_1, num_3], 'yellow_rows': [num_2]}) + + def test_rename_api(self): + test_name_1 = "Distributed.broadcast.0.forward.input.0" + expect_name_1 = "Distributed.broadcast.input.0" + actual_name_1 = compare.rename_api(test_name_1, "forward") + self.assertEqual(actual_name_1, expect_name_1) + + test_name_2 = "Torch.sum.0.backward.output.0" + expect_name_2 = "Torch.sum.output.0" + actual_name_2 = compare.rename_api(test_name_2, "backward") + self.assertEqual(actual_name_2, expect_name_2) + + \ No newline at end of file diff --git a/debug/accuracy_tools/ptdbg_ascend/test/ut/test_common_util.py b/debug/accuracy_tools/ptdbg_ascend/test/ut/test_common_util.py index 4c91a6928c02c4a1e9eec5b21a4dc43d65ee631b..7a74eeb369c47a6d6ad7752bd75dd481039af2a2 100644 --- a/debug/accuracy_tools/ptdbg_ascend/test/ut/test_common_util.py +++ b/debug/accuracy_tools/ptdbg_ascend/test/ut/test_common_util.py @@ -65,6 +65,14 @@ class TestCommonUtilsMethods(unittest.TestCase): dump_path = "/usr/dump" mode = "api_stack" self.assertEqual(common.modify_dump_path(dump_path, mode), "/usr/api_stack_dump") + + def test_check_inplace_op(self): + test_prefix_1 = "Distributed.broadcast.0.forward.input.0" + self.assertTrue(common.check_inplace_op(test_prefix_1)) + test_prefix_2 = "Distributed_broadcast_0_forward_input_0" + self.assertFalse(common.check_inplace_op(test_prefix_2)) + test_prefix_3 = "Torch.sum.0.backward.output.0" + self.assertFalse(common.check_inplace_op(test_prefix_3)) def test_create_directory(self): pass diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py index f9c3e88c388718994d9741f01a003f4ecc4e2a2f..fd7b265cfa7d67023075ec8d9bc59ed85f4e0f15 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py @@ -4,4 +4,4 @@ # Entry point for Pytorch TensorBoard plugin package. -__version__ = '0.4.0.5' +__version__ = '0.4.0.8' diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py index 9961ae7cf9eb3144da8f1ac78e551a56ca4f27b8..d6f9bb245eb2d170cb4a63e7f912a9c69932e28b 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py @@ -64,13 +64,17 @@ class RunProfileData(object): fwd_bwd_events = [] if trace_body is not None: for data in trace_body: + if data.get('ts') is not None: + try: + self.profiler_start_ts = min(self.profiler_start_ts, float(data.get('ts'))) + except ValueError: + logger.warning(f'The operator {data.get("name")} has wrong "ts" format, expected a number.') if data.get('cat') == 'forward_backward': fwd_bwd_events.append(data) else: event = trace.create_event(data, self.is_pytorch_lightning) if event is not None: event.ts = float(event.ts) - self.profiler_start_ts = min(self.profiler_start_ts, event.ts) self.events.append(event) self.events.sort(key=lambda e: e.ts) diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py index 061db7a4e0d46f7a56c70e0953128c30a243499e..3cd7ce9ff662a152cc9e1e4150bfe4d762e7a691 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py @@ -88,10 +88,6 @@ class NodeParserMixin: runtime_nodes = externalid_to_runtime.pop(op.external_id, []) if runtime_nodes: op.runtimes.extend(runtime_nodes) - for ext_id in externalid_to_runtime: - if ext_id != 0: - logger.warning("{} Runtime with external id {} don't correlate to any operator!".format( - len(externalid_to_runtime[ext_id]), ext_id)) if len(corrid_to_device) > 0: node_count_dict = defaultdict(int) @@ -138,13 +134,6 @@ class NodeParserMixin: if rt_node.device_nodes is None: rt_node.device_nodes = [] rt_node.device_nodes.append(device_node) - - # Check the external_id - if rt_node.external_id != device_node.external_id: - logger.warning( - 'Runtime and Device-op have same correlation id %s but with different external id!' - ' (runtime external_id, device external_id): (%s, %s)' % - (corrid, rt_node.external_id, device_node.external_id)) else: corrid_to_device[corrid].append(device_node) self.device_node_list.append(device_node) diff --git a/profiler/README.md b/profiler/README.md index c21a616c98cd8a369c10f17e8ac9387a4a017c0d..1669e3524e54bb78e6f4f09f597d2399196ff950 100644 --- a/profiler/README.md +++ b/profiler/README.md @@ -91,6 +91,7 @@ ascend pytorch profiler数据目录结构如下: | profiler版本 | 发布日期 | 下载链接 | 校验码 | | ------------ | ---------- | ------------------------------------------------------------ | ------------------------------------------------------------ | + | 1.1.2 | 2024-07-12 | [msprof_analyze-1.1.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.2/msprof_analyze-1.1.2-py3-none-any.whl) | af62125b1f9348bf491364e03af712fc6d0282ccee3fb07458bc9bbef82dacc6 | | 1.1.1 | 2024-06-20 | [msprof_analyze-1.1.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.1/msprof_analyze-1.1.1-py3-none-any.whl) | 76aad967a3823151421153d368d4d2f8e5cfbcb356033575e0b8ec5acea8e5e4 | | 1.1.0 | 2024-05-28 | [msprof_analyze-1.1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.0/msprof_analyze-1.1.0-py3-none-any.whl) | b339f70e7d1e45e81f289332ca64990a744d0e7ce6fdd84a8d82e814fa400698 | | 1.0 | 2024-05-10 | [msprof_analyze-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.0/msprof_analyze-1.0-py3-none-any.whl) | 95b2f41c8c8e8afe4887b738c8cababcb4f412e1874483b6adae4a025fcbb7d4 | @@ -124,12 +125,6 @@ ascend pytorch profiler数据目录结构如下: pip3 install ./msprof_analyze-{version}-py3-none-any.whl ``` - 若为覆盖安装,请在命令行末尾增加“--force-reinstall”参数强制安装,例如: - - ```bash - pip3 install ./msprof_analyze-{version}-py3-none-any.whl --force-reinstall - ``` - 提示如下信息则表示安装成功。 ```bash @@ -167,25 +162,45 @@ ascend pytorch profiler数据目录结构如下: ```bash cd dist - pip3 install ./msprof_analyze-{version}-py3-none-any.whl --force-reinstall + pip3 install ./msprof_analyze-{version}-py3-none-any.whl + ``` + +## 卸载和更新 + +若需要更新工具,请先卸载旧版本后再重新安装新版本,如下操作: + +1. 卸载 + + ```bash + pip3 uninstall msprof-analyze + ``` + +2. 更新 + + ```bash + pip3 install ./msprof_analyze-{version}-py3-none-any.whl ``` ## 工具使用 ```bash -msprof-analyze advisor [-h] [-v] +msprof-analyze advisor [-h] ``` ```bash -msprof-analyze compare [-h] [-v] +msprof-analyze compare [-h] ``` ```bash -msprof-analyze cluster [-h] [-v] +msprof-analyze cluster [-h] ``` ```bash -msprof-analyze auto-completion [-h] [-v] +msprof-analyze auto-completion [-h] +``` + +``` +msprof-analyze [-h] [-v] ``` | 参数 | 说明 | diff --git a/profiler/advisor/README.md b/profiler/advisor/README.md index d08ae23d6e86adefda8dccfc5a17ef5c43792bff..ccaccdda017ad61674ea245eb5f9c9465747f63e 100644 --- a/profiler/advisor/README.md +++ b/profiler/advisor/README.md @@ -92,19 +92,19 @@ msprof-analyze的advisor功能是将Ascend PyTorch Profiler或者msprof采集的 - 总体性能瓶颈 ```bash - msprof-analyze advisor all -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [-d] [-h] + msprof-analyze advisor all -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [-D] [-h] ``` - 计算瓶颈 ```bash - msprof-analyze advisor computation -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [-d] [-h] + msprof-analyze advisor computation -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [-D] [-h] ``` - 调度瓶颈 ```bash - msprof-analyze advisor schedule -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-d] [-h] + msprof-analyze advisor schedule -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-D] [-h] ``` #### 参数介绍 diff --git a/profiler/advisor/analyzer/computation/operator_checker.py b/profiler/advisor/analyzer/computation/operator_checker.py index 0f47650943a7355b494bd766214d10526c46c0fa..64618b56a8df7f380277e99ae7ca47cd69d24648 100644 --- a/profiler/advisor/analyzer/computation/operator_checker.py +++ b/profiler/advisor/analyzer/computation/operator_checker.py @@ -118,7 +118,7 @@ class OperatorChecker(VersionControl): def is_dynamic_shape(self, profiling_database: ProfilingDataset) -> bool: less_than_cann800_list = [constant.CANN_VERSION_C30, constant.CANN_VERSION_C13, constant.CANN_VERSION_C15] - # CANN 8.0.0 之前从 ge_info 中获取 op_state 属性,进行动态 shape 逻辑判断 + # CANN 8.0.RC1 之前从 ge_info 中获取 op_state 属性,进行动态 shape 逻辑判断 if self.cann_version in less_than_cann800_list: if hasattr(profiling_database, "ge_info"): ge_info = profiling_database.ge_info @@ -131,7 +131,7 @@ class OperatorChecker(VersionControl): "To enable dynamic shape check, please try to set data_simplification=False in experimental_config.\n" "More details please refer to link : %s", constant.ASCEND_PROFILER_URL) else: - # CANN 8.0.0 之后 op_state 属性从 op_summary 文件中获取 + # CANN 8.0.RC1 之后 op_state 属性从 op_summary 文件中获取 if hasattr(profiling_database, "op_summary"): static_shape_operators = profiling_database.op_summary.get_static_shape_operators() if len(static_shape_operators) == 0: diff --git a/profiler/advisor/computation_analysis.ipynb b/profiler/advisor/computation_analysis.ipynb index 585f222c6a7ca765443104cf84ca562a99328506..0d4aaadfadff05d1e11d4a9873ef7ce4ae2cfaa8 100644 --- a/profiler/advisor/computation_analysis.ipynb +++ b/profiler/advisor/computation_analysis.ipynb @@ -44,7 +44,7 @@ "outputs": [], "source": [ "# 查询computation相关是否存在block dim问题\n", - "# 如果profiling数据采集自非8.0.0的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", + "# 如果profiling数据采集自非8.0.RC1的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", "block_dim_result = interface.get_result(\"computation\", \"block_dim_analysis\", cann_version=\"7.0.RC1\")" ] }, @@ -252,7 +252,7 @@ "outputs": [], "source": [ "# 查询computation相关是否存在operator no bound问题\n", - "# 如果profiling数据采集自非8.0.0的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", + "# 如果profiling数据采集自非8.0.RC1的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", "operator_no_bound_result = interface.get_result(\"computation\", \"operator_no_bound_analysis\", cann_version=\"7.0.RC1\")" ] }, @@ -499,7 +499,7 @@ ], "source": [ "# 查询computation相关是否存在aicpu问题\n", - "# 如果profiling数据采集自非8.0.0的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", + "# 如果profiling数据采集自非8.0.RC1的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", "aicpu_result = interface.get_result(\"computation\", \"aicpu_analysis\")" ] }, diff --git a/profiler/advisor/config/profiling_data_version_config.yaml b/profiler/advisor/config/profiling_data_version_config.yaml index f73aecd3baf18e06981ef4d4b0db7d6faadd419a..4ef76105a07c28c5072c4bbfe20fd39a938038b7 100644 --- a/profiler/advisor/config/profiling_data_version_config.yaml +++ b/profiler/advisor/config/profiling_data_version_config.yaml @@ -1,5 +1,5 @@ versions: - - version: 8.0.0 + - version: 8.0.RC1 dirs_pattern: ^PROF_\d{6}_\d{17}_\w+$: mindstudio_profiler_output: diff --git a/profiler/advisor/fusion_operators_api_analysis.ipynb b/profiler/advisor/fusion_operators_api_analysis.ipynb index dcc71ba3c139f630c07545340e61c66b1f29d929..ac758f562f13c9dd7466279aac73002c0e68da55 100644 --- a/profiler/advisor/fusion_operators_api_analysis.ipynb +++ b/profiler/advisor/fusion_operators_api_analysis.ipynb @@ -81,7 +81,7 @@ " \n", " \n", " timeline_fusion_ops\n", - " Found 2 apis to be replaced based on the runtime env cann-8.0.0 and torch-2.1.0\n", + " Found 2 apis to be replaced based on the runtime env cann-8.0.RC1 and torch-2.1.0\n", " 1. Please replace training api according to sub table 'Affinity training api'\n", " \n", " \n", @@ -91,7 +91,7 @@ "+---------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------+\n", "| problem | description | suggestion |\n", "+---------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------+\n", - "| timeline_fusion_ops | Found 2 apis to be replaced based on the runtime env cann-8.0.0 and torch-2.1.0 | 1. Please replace training api according to sub table 'Affinity training api' |\n", + "| timeline_fusion_ops | Found 2 apis to be replaced based on the runtime env cann-8.0.RC1 and torch-2.1.0 | 1. Please replace training api according to sub table 'Affinity training api' |\n", "+---------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------+" ] }, diff --git a/profiler/advisor/rules/aicpu_rules.yaml b/profiler/advisor/rules/aicpu_rules.yaml index ac49b22f40efc813b58c2483bf325f30314b049c..58e6eef163204ea1b5efbb5148770948bd4afdad 100644 --- a/profiler/advisor/rules/aicpu_rules.yaml +++ b/profiler/advisor/rules/aicpu_rules.yaml @@ -46,7 +46,7 @@ CommonChecker: suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ __ALL__ ] ignore_type: [ cast, tensorequal, equal, nonzero, mul ] input: [ float, float32, float16, dt_bf16, float64, bool, int32, int64, int8, uint8, int16, complex64, complex128 ] @@ -54,28 +54,28 @@ CommonChecker: suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ cast ] input: [ float, float32, float16, bool, int32, uint32, int64, uint64, uint8, dt_bf16 ] output: [ float, float32, float16, bool, int32, uint32, int64, uint64, uint8, dt_bf16 ] suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ tensorequal ] input: [ float, float32, float16, dt_bf16, float64, bool, int32, int8, uint8 ] output: [ bool ] suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ equal ] input: [ float, float32, float16, dt_bf16, float64, bool, int32, int64, int8, uint8 ] output: [ bool ] suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ mul ] input: [ float, float32, float16, dt_bf16, float64, bool, int32, int64, int8, uint8, complex64 ] output: [ float, float32, float16, dt_bf16, float64, bool, int32, int64, int8, uint8, complex64 ] diff --git a/profiler/advisor/rules/timeline_fusion_ops.yaml b/profiler/advisor/rules/timeline_fusion_ops.yaml index 10c12ff18dd8792e24a89c6d5fbb7ed87f643a9d..8207465dc4a5c5ddbb1cc934ef95951493c4bacb 100644 --- a/profiler/advisor/rules/timeline_fusion_ops.yaml +++ b/profiler/advisor/rules/timeline_fusion_ops.yaml @@ -49,7 +49,7 @@ "(slice|chunk)-sigmoid-mul-mul", "(slice|chunk)-mul-sigmoid-mul", "(slice|chunk)-mul-mul-sigmoid" ] -- cann_version: 8.0.0 +- cann_version: 8.0.RC1 torch_version: [1.11.0, 2.1.0] unique_id: 3 inherit_unique_id: 2 diff --git a/profiler/advisor/version.py b/profiler/advisor/version.py index 1a95cc3c0f93f49a2aaacf483770462d09961ff9..67d04140866a3df8ecb8484451c476006da2671d 100644 --- a/profiler/advisor/version.py +++ b/profiler/advisor/version.py @@ -30,9 +30,9 @@ def print_version_callback(ctx, param, value): # NOQA if not value or ctx.resilient_parsing: return - click.echo('Version {}'.format(get_package_version("att_advisor"))) + click.echo('Version {}'.format(get_package_version("msprof-analyze"))) ctx.exit() def cli_version(): - return get_package_version("att_advisor") + return get_package_version("msprof-analyze") diff --git a/profiler/version.txt b/profiler/version.txt index 8cfbc905b39f65131ba18e561d236557fbdc52cc..8428158dc5bd08a490b652db38f90e08cb471d25 100644 --- a/profiler/version.txt +++ b/profiler/version.txt @@ -1 +1 @@ -1.1.1 \ No newline at end of file +1.1.2 \ No newline at end of file