diff --git a/.gitee/PULL_REQUEST_TEMPLATE.zh-CN.md b/.gitee/PULL_REQUEST_TEMPLATE.zh-CN.md index a4049752f470ddb41c4c74f7ae6d5819d4315773..e9cc1deb82ff0498f1a8267cd288ecde798f308c 100644 --- a/.gitee/PULL_REQUEST_TEMPLATE.zh-CN.md +++ b/.gitee/PULL_REQUEST_TEMPLATE.zh-CN.md @@ -1,5 +1,9 @@ # PR 合入模板 +**注:经过自检不涉及的可标注“不涉及”或直接打勾,特殊情况请文字备注。不符合规范的 PR 不允许合入,请(后备)commit 注意。** + +--- + ## 1. 修改描述 - **修改原因:** - **修改内容:** @@ -7,49 +11,56 @@ --- ## 2. 功能验证 -- [ ] **功能自验** -- [ ] **本地自验用例截图**(请确保不体现个人信息) -- [ ] **冒烟是否通过** +- [ ] **功能自验** +- [ ] **本地自验用例截图** +- [ ] **冒烟是否通过** (填入群链接的自验证报告中,如未通过,请说明原因:____________________ ,功能代码请主动申报添加冒烟) + +--- + +## 3. 分支合并要求 +- [ ] **代码合并**(请确保将 master 分支的最新代码同步合并至 poc 分支及 pre-research 分支,同时保证 poc 分支的代码也已正确合并到 pre-research 分支。) --- ## 3. 代码检视 - **要求:** - - 合入代码大于 200 行,需三人以上会议检视。 - - 检视密度≥2个/100行。 - - 检视缺陷密度达不到要求的需给出说明。 + - 合入代码超过 200 行,需三人以上会议检视。 + - 检视密度≥1个/100行。 + - 检视缺陷密度未达要求需提供说明。 - 大于 1000 行代码原则上不允许合入,需进行备案。 -- [ ] **是否经过代码检视** -- [ ] **是否具备UT测试用例看护** +- [ ] **是否经过代码检视** +- [ ] **是否具备 UT 测试用例看护** (如不符合,请说明原因:____________________) + +- **检视意见数:____ 条** (请填写本次检视的意见总数,用于commit合入前审视) --- ## 4. 安全自检 -- **典型安全编码问题 ** -- [ ] **若涉及对外接口,是否已校验外部数据** -- [ ] **MR 标题和描述是否按格式填写** -- [ ] **是否进行空指针校验** -- [ ] **是否进行返回值校验** -- [ ] **是否正确考虑文件权限配置** -- [ ] **是否充分考虑接口的异常场景** -- [ ] **是否正确记录错误日志** -- [ ] **若涉及正则表达式,是否对正则表达式做 ReDos 校验** -- [ ] **若涉及运算,是否存在整数溢出、除零等风险** ---- +### Python、C++: +- [ ] **对外接口新增/删除/变更,需要更新外部输入表格** +- [ ] **不允许私有的文件操作,需要使用公共函数** +- [ ] **数组使用需要校验越界场景** +- [ ] **对正则表达式做 ReDos 校验** +- [ ] **对除法做除零校验** +- [ ] **充分进行接口返回值异常情况的校验** +- [ ] **充分进行接口输入值异常情况的校验** +- [ ] **日志不要暴露代码细节和敏感信息** + +### C++: +- [ ] **指针使用前需要判空** +- [ ] **数值计算校验溢出和反转** +- [ ] **不可存在内存泄漏(异常场景需要释放内存)** +- [ ] **类型转换不能出现数据截断** +- [ ] **拷贝字符串时,目的缓冲区至少比源缓冲区大 1** +- [ ] **拷贝内存时,目的缓冲区不小于源缓冲区** +- [ ] **内存释放后指针赋值为 nullptr** -## 5. 变更知会 -- **资料修改:** -- **变更通知(消息知会 + 邮件知会):** --- -## 6. 冒烟修改 -- **PR 来源:** - - [ ] 问题单 - - [ ] 需求特性 - - [ ] 其他 -- [ ] **是否存在冒烟可以拦截却未拦截的情况** -- [ ] **是否需要添加冒烟:** +## 5. 变更知会 +- [ ] **资料修改** +- [ ] **变更通知(消息知会 + 邮件知会)** ---- \ No newline at end of file +--- diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000000000000000000000000000000000000..b08433f072bf89f62edf88b3aff40d24c1040ea8 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "dynolog_npu/third_party/dynolog"] + path = dynolog_npu/third_party/dynolog + url = https://github.com/facebookincubator/dynolog.git diff --git a/OWNERS b/OWNERS index 2e949debf181a6e75fdb5b1e1e091ce7a39c7e69..415d737ed907c577bc61e71c2839a485395b899c 100644 --- a/OWNERS +++ b/OWNERS @@ -1,7 +1,6 @@ approvers: - leo920320 - wo-wenjie -- ma-dongfang - xhahn - aerfaliang - wangchao285 @@ -11,16 +10,14 @@ approvers: - ly-qianxiao - blian - kun_8 -- binghamhuang reviewers: - lv-kaimeng -- litian_drinksnow -- binghamhuang - wo-wenjie - ly-qianxiao - leo920320 - sunboquan -- stby - Seanesmhxocism - TAJh -- czr9775 \ No newline at end of file +- czr9775 +- kali20gakki +- wjchuee \ No newline at end of file diff --git a/README.md b/README.md index ea548a0bfcab119d7af06912587d5887adc8a0a8..5ae0bf742fced7ed86452d03d013670cc3528316 100644 --- a/README.md +++ b/README.md @@ -26,25 +26,21 @@ 脚本迁移工具通过后端命令行,将 GPU 上训练的 PyTorch 脚本迁移至 NPU 上,得到新的训练脚本用于训练。 -4. [训推一体权重转换工具](https://gitee.com/Ascend/mstt/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E8%AE%AD%E6%8E%A8%E4%B8%80%E4%BD%93%E6%9D%83%E9%87%8D%E8%BD%AC%E6%8D%A2%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) - - 训推一体权重转换工具,支持在 GPU 和 NPU 上训练好的模型转成加速推理支持的格式。 - ## [精度工具](./debug/accuracy_tools/) [MindStudio Probe(msprobe,MindStudio 精度调试工具)](./debug/accuracy_tools/msprobe)。 -## [性能工具](./profiler) +## [性能工具](./profiler/msprof_analyze) -1. [compare_tools(性能比对工具)](./profiler/compare_tools) +1. [compare_tools(性能比对工具)](./profiler/msprof_analyze/compare_tools) 提供 NPU 与 GPU 性能拆解功能以及算子、通信、内存性能的比对功能。 -2. [cluster_analyse(集群分析工具)](./profiler/cluster_analyse) +2. [cluster_analyse(集群分析工具)](./profiler/msprof_analyze/cluster_analyse) 提供多机多卡的集群分析能力(基于通信域的通信分析和迭代耗时分析), 当前需要配合 MindStudio Insight 的集群分析功能使用。 -3. [advisor](./profiler/advisor) +3. [advisor](./profiler/msprof_analyze/advisor) 将 Ascend PyTorch Profiler 或者 msprof 采集的 PyTorch 场景性能数据进行分析,并输出性能调优建议。 diff --git a/debug/OWNERS b/debug/OWNERS index 311da9c60cb527eff4feb755c3a012fc042e3afb..0bda9243569f0b6bcd0ce761d7817d512b487ddd 100644 --- a/debug/OWNERS +++ b/debug/OWNERS @@ -3,13 +3,14 @@ options: approvers: - wangchao285 - kun_8 -- binghamhuang - brightlyking -- litian_drinksnow reviewers: - lv-kaimeng -- binghamhuang - TAJh - jiandaobao - pengxiaopeng1 - zhengxinqian +- louyujing +- yang_chen_2001_02_14 +- shawnzhu1 +- wqc01202410 diff --git a/debug/accuracy_tools/CMakeLists.txt b/debug/accuracy_tools/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..b73df6f420415cc8edb42c77d3356654c41aab77 --- /dev/null +++ b/debug/accuracy_tools/CMakeLists.txt @@ -0,0 +1,18 @@ +project(accracy_tools) +cmake_minimum_required(VERSION 3.14) + +execute_process( + COMMAND uname -m + OUTPUT_VARIABLE machine_arch + OUTPUT_STRIP_TRAILING_WHITESPACE +) + +if (DEFINED ARCH_TYPE AND NOT "${ARCH_TYPE}" STREQUAL "${machine_arch}") + message(FATAL_ERROR "Cross-compilation is not supported currently. (compile ${ARCH_TYPE} on ${machine_arch})") +endif() + + +set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/cmake") +set(ENV{PROJECT_ROOT_PATH} "${CMAKE_SOURCE_DIR}") +include(utils) +add_subdirectory(msprobe) \ No newline at end of file diff --git a/debug/accuracy_tools/MANIFEST.in b/debug/accuracy_tools/MANIFEST.in index 7997215ffdb2071277645bf47c520db304b1bd98..2afe7f3d2a54437b44b2d9f91505234f4c611740 100644 --- a/debug/accuracy_tools/MANIFEST.in +++ b/debug/accuracy_tools/MANIFEST.in @@ -2,4 +2,5 @@ include README.md include LICENSE recursive-include msprobe * recursive-exclude msprobe/test * +recursive-exclude msprobe/ccsrc * diff --git a/debug/accuracy_tools/build.sh b/debug/accuracy_tools/build.sh new file mode 100644 index 0000000000000000000000000000000000000000..a21d11e05f7e7bd7e9bbf28a0fd70ad3d4835fda --- /dev/null +++ b/debug/accuracy_tools/build.sh @@ -0,0 +1,86 @@ +#!/bin/bash + +set -e + +BUILD_PATH=$(pwd) + +BUILD_ARGS=$(getopt -o ha:v:j:ft --long help,release,debug,arch:python-version:,CANN-path:,jobs:,force-rebuild,local,test-cases -- "$@") +eval set -- "${BUILD_ARGS}" + +ARCH_TYPE=$(uname -m) +BUILD_TYPE=release +CANN_PATH="" +CONCURRENT_JOBS=16 +BUILD_TEST_CASE=False +USE_LOCAL_FIRST=False +PYTHON_VERSION="" + +HELP_DOC=$(cat << EOF +Usage: build.sh [OPTION]...\n +Build the C++ part of MsProbe.\n +\n +Arguments:\n + -a, --arch Specify the schema, which generally does not need to be set up.\n + --CANN-path Specify the CANN path. When set, the build script will find the dependent files in\n + the specified path.\n + -j, --jobs Specify the number of compilation jobs(default 16).\n + -f, --force-rebuild Clean up the cache before building.\n + -t, --test-cases Build test cases.\n + --local Prioritize the use of on-premises, third-party resources as dependencies.\n + --release Build the release version(default).\n + --debug Build the debug version. + -v, --python-version Specify version of python. +EOF +) + +while true; do + case "$1" in + -h | --help) + echo -e ${HELP_DOC} + exit 0 ;; + -a | --arch) + ARCH_TYPE="$2" ; shift 2 ;; + -v | --python-version) + PYTHON_VERSION="$2" ; shift 2 ;; + --release) + BUILD_TYPE=release ; shift ;; + --debug) + BUILD_TYPE=debug ; shift ;; + --CANN-path) + CANN_PATH="$2" ; shift 2 ;; + -j | --jobs) + CONCURRENT_JOBS="$2" ; shift 2 ;; + --local) + USE_LOCAL_FIRST=True ; shift ;; + -f | --force-rebuild) + rm -rf "${BUILD_PATH}/build_dependency" "${BUILD_PATH}/lib" "${BUILD_PATH}/output" "${BUILD_PATH}/third_party" \ + "${BUILD_PATH}/msprobe/lib/_msprobe_c.so" + shift ;; + -t | --test-cases) + BUILD_TEST_CASE=True ; shift ;; + --) + shift ; break ;; + *) + echo "Unknow argument $1" + exit 1 ;; + esac +done + +BUILD_OUTPUT_PATH=${BUILD_PATH}/output/${BUILD_TYPE} + +cmake -B ${BUILD_OUTPUT_PATH} -S . -DARCH_TYPE=${ARCH_TYPE} -DBUILD_TYPE=${BUILD_TYPE} -DCANN_PATH=${CANN_PATH} \ + -DUSE_LOCAL_FIRST=${USE_LOCAL_FIRST} -DBUILD_TEST_CASE=${BUILD_TEST_CASE} \ + -DPYTHON_VERSION=${PYTHON_VERSION} +cd ${BUILD_OUTPUT_PATH} +make -j${CONCURRENT_JOBS} + +if [[ ! -e ${BUILD_OUTPUT_PATH}/msprobe/ccsrc/lib_msprobe_c.so ]]; then + echo "Failed to build lib_msprobe_c.so." + exit 1 +fi + +if [[ ! -e ${BUILD_PATH}/msprobe/lib ]]; then + mkdir ${BUILD_PATH}/msprobe/lib +fi + +cp ${BUILD_OUTPUT_PATH}/msprobe/ccsrc/lib_msprobe_c.so ${BUILD_PATH}/msprobe/lib/_msprobe_c.so diff --git a/debug/accuracy_tools/cmake/Findcpython.cmake b/debug/accuracy_tools/cmake/Findcpython.cmake new file mode 100644 index 0000000000000000000000000000000000000000..815fbc638de824fb91f2e7183781a6415007868b --- /dev/null +++ b/debug/accuracy_tools/cmake/Findcpython.cmake @@ -0,0 +1,16 @@ +set(PKG_NAME cpython) + +if (NOT ${PKG_NAME}_FOUND) + +find_package(Python3 ${PYTHON_VERSION} EXACT COMPONENTS Development) +if (NOT Python3_FOUND) + message(FATAL_ERROR "${Python3} is not found.") +endif() + +set(PACKAGE_VERSION ${Python3_VERSION}) + +include_directories(${Python3_INCLUDE_DIRS}) +set(${PKG_NAME}_LIBRARIES ${Python3_LIBRARIES}) +set(${PKG_NAME}_FOUND TRUE) + +endif() diff --git a/debug/accuracy_tools/cmake/Findgtest.cmake b/debug/accuracy_tools/cmake/Findgtest.cmake new file mode 100644 index 0000000000000000000000000000000000000000..dbfe76abcc9b5d3c2f61642cc8c6e270fc441a0f --- /dev/null +++ b/debug/accuracy_tools/cmake/Findgtest.cmake @@ -0,0 +1,49 @@ +set(PACKAGE_VERSION 1.12.1) + +set(PKG_NAME gtest) +set(URL "https://gitee.com/mirrors/googletest/repository/archive/release-1.12.1.tar.gz") +set(SHA256_VALUE "81964fe578e9bd7c94dfdb09c8e4d6e6759e19967e397dbea48d1c10e45d0df2") +set(DOWNLOAD_PATH "$ENV{PROJECT_ROOT_PATH}/third_party") +set(DIR_NAME "${DOWNLOAD_PATH}/googletest-release-1.12.1") + +if (NOT ${PKG_NAME}_FOUND) + +download_opensource_pkg(${PKG_NAME} + URL ${URL} + SHA256 ${SHA256_VALUE} + DOWNLOAD_PATH ${DOWNLOAD_PATH} +) + +include_directories(${DIR_NAME}/googletest/include) +include_directories(${DIR_NAME}/googlemock/include) + +set(BUILD_DEPENDENCY_PATH "$ENV{PROJECT_ROOT_PATH}/build_dependency") +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND cmake . -DBUILD_SHARED_LIBS=ON + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build gtest. ${RESULT}") +endif() +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND make -j16 + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build gtest. ${RESULT}") +endif() + +file(GLOB GTEST_SO "${DIR_NAME}/lib/libgtest.so") +file(GLOB GMOCK_SO "${DIR_NAME}/lib/libgmock.so") +file(GLOB GTEST_MAIN_SO "${DIR_NAME}/lib/libgtest_main.so") +file(GLOB GMOCK_MAIN_SO "${DIR_NAME}/lib/libgmock_main.so") +if (NOT GTEST_SO OR NOT GMOCK_SO OR NOT GTEST_MAIN_SO OR NOT GMOCK_MAIN_SO) + message(FATAL_ERROR "Failed to build gtest.") +endif() + +set(${PKG_NAME}_LIBRARIES "${GTEST_SO};${GMOCK_SO};${GTEST_MAIN_SO};${GMOCK_MAIN_SO}") +set(${PKG_NAME}_FOUND TRUE) + +endif() \ No newline at end of file diff --git a/debug/accuracy_tools/cmake/Findmockcpp.cmake b/debug/accuracy_tools/cmake/Findmockcpp.cmake new file mode 100644 index 0000000000000000000000000000000000000000..c360702c187bfdef553a6b67344ea132a18373f6 --- /dev/null +++ b/debug/accuracy_tools/cmake/Findmockcpp.cmake @@ -0,0 +1,45 @@ +set(PACKAGE_VERSION 2.7) + +set(PKG_NAME mockcpp) +set(URL "https://gitee.com/sinojelly/mockcpp/repository/archive/v2.7.zip") +set(SHA256_VALUE "0dc7111c5be9785d0550ed3b68db7e12fd5d7802b7bc6548c52ac7b9e727fcc1") +set(DOWNLOAD_PATH "$ENV{PROJECT_ROOT_PATH}/third_party") +set(DIR_NAME "${DOWNLOAD_PATH}/mockcpp-v2.7") + +if (NOT ${PKG_NAME}_FOUND) + +download_opensource_pkg(${PKG_NAME} + URL ${URL} + SHA256 ${SHA256_VALUE} + DOWNLOAD_PATH ${DOWNLOAD_PATH} +) + +include_directories(${DIR_NAME}/include) +include_directories(${DIR_NAME}/3rdparty) + +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND cmake . + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build mockcpp. ${RESULT}") +endif() +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND make -j16 + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build mockcpp. ${RESULT}") +endif() + +file(GLOB MOCKCPP_LIB "${DIR_NAME}/src/libmockcpp.a") +if (NOT MOCKCPP_LIB) + message(FATAL_ERROR "Failed to build mockcpp.") +endif() + +set(${PKG_NAME}_LIBRARIES "${MOCKCPP_LIB}") +set(${PKG_NAME}_FOUND TRUE) + +endif() \ No newline at end of file diff --git a/debug/accuracy_tools/cmake/Findnlohmannjson.cmake b/debug/accuracy_tools/cmake/Findnlohmannjson.cmake new file mode 100644 index 0000000000000000000000000000000000000000..0f85cc00a0d30a3896a8f47cac95911929070e33 --- /dev/null +++ b/debug/accuracy_tools/cmake/Findnlohmannjson.cmake @@ -0,0 +1,20 @@ +set(PACKAGE_VERSION 3.10.1) + +set(PKG_NAME nlohmannjson) +set(URL "https://gitee.com/mirrors/JSON-for-Modern-CPP/repository/archive/v3.10.1.zip") +set(SHA256_VALUE "5c7d0a0542431fef628f8dc4c34fd022fe8747ccb577012d58f38672d8747e0d") +set(DOWNLOAD_PATH "$ENV{PROJECT_ROOT_PATH}/third_party") +set(DIR_NAME "${DOWNLOAD_PATH}/JSON-for-Modern-CPP-v3.10.1") + +if (NOT ${PKG_NAME}_FOUND) + +download_opensource_pkg(${PKG_NAME} + URL ${URL} + SHA256 ${SHA256_VALUE} + DOWNLOAD_PATH ${DOWNLOAD_PATH} +) + +include_directories(${DIR_NAME}/include) +set(${PKG_NAME}_FOUND TRUE) + +endif() diff --git a/debug/accuracy_tools/cmake/Findopenssl.cmake b/debug/accuracy_tools/cmake/Findopenssl.cmake new file mode 100644 index 0000000000000000000000000000000000000000..d361095242917df8accbb81a51de65c5ca5ac980 --- /dev/null +++ b/debug/accuracy_tools/cmake/Findopenssl.cmake @@ -0,0 +1,73 @@ +set(PACKAGE_VERSION 1.1.1) + +set(PKG_NAME openssl) +set(URL "https://gitee.com/mirrors/openssl/repository/archive/OpenSSL_1_1_1k.tar.gz") +set(SHA256_VALUE "b92f9d3d12043c02860e5e602e50a73ed21a69947bcc74d391f41148e9f6aa95") +set(DOWNLOAD_PATH "$ENV{PROJECT_ROOT_PATH}/third_party") +set(DIR_NAME "${DOWNLOAD_PATH}/openssl-OpenSSL_1_1_1k") + +if (NOT ${PKG_NAME}_FOUND) + +if (DEFINED USE_LOCAL_FIRST AND "${USE_LOCAL_FIRST}" STREQUAL "True") +find_package(OpenSSL) +if (OpenSSL_FOUND AND OPENSSL_INCLUDE_DIR AND OPENSSL_LIBRARIES) + if (${OPENSSL_VERSION} VERSION_GREATER_EQUAL ${PACKAGE_VERSION}) + message("Found openssl ${OPENSSL_VERSION}, witch is equal or greater than the minimum required version ${PACKAGE_VERSION}. Use it instead.") + set(PACKAGE_VERSION ${PACKAGE_VERSION}) + set(${PKG_NAME}_FOUND TRUE) + include_directories(${OPENSSL_INCLUDE_DIR}) + set(${PKG_NAME}_LIBRARIES ${OPENSSL_LIBRARIES}) + return() + endif() +endif() +endif() + +download_opensource_pkg(${PKG_NAME} + URL ${URL} + SHA256 ${SHA256_VALUE} + DOWNLOAD_PATH ${DOWNLOAD_PATH} +) + +include_directories(${DIR_NAME}/include) +set(BUILD_DEPENDENCY_PATH "$ENV{PROJECT_ROOT_PATH}/build_dependency") +file(GLOB OPENSSL_LIB "${BUILD_DEPENDENCY_PATH}/lib/libssl.a") +file(GLOB CRYPTO_LIB "${BUILD_DEPENDENCY_PATH}/lib/libcrypto.a") +if (OPENSSL_LIB AND CRYPTO_LIB) + set(${PKG_NAME}_FOUND TRUE) + set(${PKG_NAME}_LIBRARIES "${OPENSSL_LIB};${CRYPTO_LIB}") + return() +endif() + +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND ./config -fPIC no-shared --prefix=${BUILD_DEPENDENCY_PATH} + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build openssl. ${RESULT}") +endif() + +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND make -j16 + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build openssl. ${RESULT}") +endif() + +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND make install +) + +file(GLOB OPENSSL_LIB "${BUILD_DEPENDENCY_PATH}/lib/libssl.a") +file(GLOB CRYPTO_LIB "${BUILD_DEPENDENCY_PATH}/lib/libcrypto.a") +if (NOT OPENSSL_LIB OR NOT CRYPTO_LIB) + message(FATAL_ERROR "Failed to build openssl.") +endif() + +set(${PKG_NAME}_LIBRARIES "${OPENSSL_LIB};${CRYPTO_LIB}") +set(${PKG_NAME}_FOUND TRUE) + +endif() diff --git a/debug/accuracy_tools/cmake/Findprotobuf.cmake b/debug/accuracy_tools/cmake/Findprotobuf.cmake new file mode 100644 index 0000000000000000000000000000000000000000..4d70515e980f7a921447250fe58400f600419e4c --- /dev/null +++ b/debug/accuracy_tools/cmake/Findprotobuf.cmake @@ -0,0 +1,93 @@ +set(PACKAGE_VERSION 3.13.0) + +set(PKG_NAME protobuf) +set(URL "https://gitee.com/mirrors/protobuf_source/repository/archive/v3.13.0.tar.gz") +set(SHA256_VALUE "ab9b39e7053a6fb06b01bf75fb6ec6a71a1ada5a5f8e2446f927336e97b9e7bb") +set(DOWNLOAD_PATH "$ENV{PROJECT_ROOT_PATH}/third_party") +set(DIR_NAME "${DOWNLOAD_PATH}/protobuf_source-v3.13.0") + +if (NOT ${PKG_NAME}_FOUND) + +if (DEFINED USE_LOCAL_FIRST AND "${USE_LOCAL_FIRST}" STREQUAL "True") +find_program(PROTOC_EXECUTABLE protoc) +find_package(Protobuf) +if (PROTOC_EXECUTABLE AND Protobuf_FOUND) +execute_process( + COMMAND ${PROTOC_EXECUTABLE} --version + OUTPUT_VARIABLE PROTOC_VERSION_OUTPUT + ERROR_VARIABLE PROTOC_VERSION_OUTPUT + OUTPUT_STRIP_TRAILING_WHITESPACE +) +string(REGEX MATCH "[0-9]+\\.[0-9]+" PROTOC_VERSION ${PROTOC_VERSION_OUTPUT}) +if(${PROTOC_VERSION} VERSION_GREATER_EQUAL ${PACKAGE_VERSION}) + message("Found protoc ${PROTOC_VERSION}, witch is equal or greater than the minimum required version ${PACKAGE_VERSION}. Use it instead.") + set(PACKAGE_VERSION ${PROTOC_VERSION}) + set(${PKG_NAME}_FOUND TRUE) + set(${PKG_NAME}_LIBRARIES ${Protobuf_LIBRARIES}) + set(PROTOC_EXECUTABLE ${PROTOC_EXECUTABLE}) + include_directories(${Protobuf_INCLUDE_DIRS}) + return() +endif() +endif() +endif() + +download_opensource_pkg(${PKG_NAME} + URL ${URL} + SHA256 ${SHA256_VALUE} + DOWNLOAD_PATH ${DOWNLOAD_PATH} +) + +include_directories(${DIR_NAME}/src) +set(BUILD_DEPENDENCY_PATH "$ENV{PROJECT_ROOT_PATH}/build_dependency") +file(GLOB PROTOC_EXECUTABLE "${BUILD_DEPENDENCY_PATH}/bin/protoc") +file(GLOB ${PKG_NAME}_LIBRARIES "${BUILD_DEPENDENCY_PATH}/lib/libprotobuf.a") +if (PROTOC_EXECUTABLE AND ${PKG_NAME}_LIBRARIES) + set(${PKG_NAME}_FOUND TRUE) + set(PROTOC_EXECUTABLE ${PROTOC_EXECUTABLE}) + set(${PKG_NAME}_LIBRARIES ${${PKG_NAME}_LIBRARIES}) + return() +endif() + +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND ./autogen.sh + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build protobuf. ${RESULT}") +endif() + +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND ./configure CFLAGS=-fPIC CXXFLAGS=-fPIC --prefix=${BUILD_DEPENDENCY_PATH} --enable-cpp + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build protobuf. ${RESULT}") +endif() + +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND make -j16 + RESULT_VARIABLE RESULT +) +if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to build protobuf. ${RESULT}") +endif() + +execute_process( + WORKING_DIRECTORY ${DIR_NAME} + COMMAND make install +) + +file(GLOB PROTOC_EXECUTABLE "${BUILD_DEPENDENCY_PATH}/bin/protoc") +file(GLOB ${PKG_NAME}_LIBRARIES "${BUILD_DEPENDENCY_PATH}/lib/libprotobuf.a") +if (NOT PROTOC_EXECUTABLE OR NOT ${PKG_NAME}_LIBRARIES) + message(FATAL_ERROR "Failed to build protobuf.") +endif() + +set(PROTOC_EXECUTABLE ${PROTOC_EXECUTABLE}) +set(${PKG_NAME}_LIBRARIES ${${PKG_NAME}_LIBRARIES}) +set(${PKG_NAME}_FOUND TRUE) + +endif() diff --git a/debug/accuracy_tools/cmake/download_opensource.sh b/debug/accuracy_tools/cmake/download_opensource.sh new file mode 100644 index 0000000000000000000000000000000000000000..725e971621434c32d9954c80b9efe234502eefcc --- /dev/null +++ b/debug/accuracy_tools/cmake/download_opensource.sh @@ -0,0 +1,69 @@ +#!/bin/bash + +if [ "$#" -lt 2 ]; then + echo "Usage: $0 [ ] [ ]" + exit 1 +fi + +url=$1 +path=$2 + +if [ "$#" -ge 3 ]; then + sha256_value=$3 +fi +if [ "$#" -ge 4 ]; then + tag=$4 +fi + +echo "Start to download ${url}..." + +if [ ! -d "$path" ]; then + echo "The specified path does not exist: $path" + exit 1 +fi +cd ${path} + +extension=$(echo "${url}" | awk -F'[./]' '{print $NF}') +if [[ "${extension}" == "gz" || "${extension}" == "zip" ]]; then + fullname="${path}/$(basename "${url}")" + if [[ -e ${fullname} ]]; then + echo "Source ${fullname} is exists, will not download again." + else + curl -L "${url}" -o ${fullname} -k + if [ $? -eq 0 ]; then + echo "Download successful: ${url}" + else + echo "Download failed: ${url}" + exit 1 + fi + fi + + if [[ ! -z "${sha256_value}" ]]; then + sha256data=$(sha256sum "${fullname}" | cut -d' ' -f1) + if [[ "${sha256data}" != "${sha256_value}" ]]; then + echo "Failed to verify sha256: ${url}" + exit 1 + fi + fi + + if [[ "${extension}" == "gz" ]]; then + tar -zxvf ${fullname} -C ./ -n > /dev/null + elif [[ "${extension}" == "zip" ]]; then + unzip -n ${fullname} -d ./ > /dev/null + fi +elif [[ "${extension}" == "git" ]]; then + if [[ -z "${tag}" ]]; then + git clone ${url} + else + git clone ${url} -b "${tag}" + fi + if [ $? -eq 0 ]; then + echo "Download successful: ${url}" + else + echo "Download failed: ${url}" + exit 1 + fi +else + echo "Unknow url ${url}" + exit 1 +fi diff --git a/debug/accuracy_tools/cmake/utils.cmake b/debug/accuracy_tools/cmake/utils.cmake new file mode 100644 index 0000000000000000000000000000000000000000..e3e963d63e99da4e0bb1fd2973051278feb04435 --- /dev/null +++ b/debug/accuracy_tools/cmake/utils.cmake @@ -0,0 +1,45 @@ + +function(download_opensource_pkg pkg_name) + message("start to download ${pkg_name}...") + set(options) + set(oneValueArgs URL SHA256 GIT_TAG DOWNLOAD_PATH DIR_NAME BUILD_CMD) + set(multiValueArgs PATCHES) + cmake_parse_arguments(PKG "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN}) + + if (NOT PKG_URL) + message(FATAL_ERROR "${pkg_name} need URL.") + endif() + if (NOT PKG_DOWNLOAD_PATH) + set(PKG_DOWNLOAD_PATH "${CMAKE_SOURCE_DIR}/../third_party") + endif() + file(MAKE_DIRECTORY ${PKG_DOWNLOAD_PATH}) + + execute_process( + WORKING_DIRECTORY $ENV{PROJECT_ROOT_PATH}/cmake + COMMAND bash download_opensource.sh ${PKG_URL} ${PKG_DOWNLOAD_PATH} ${PKG_SHA256} ${PKG_GIT_TAG} + RESULT_VARIABLE RESULT + ) + if (NOT RESULT EQUAL 0) + message(FATAL_ERROR "Failed to download ${pkg_name}(${RESULT}).") + endif() + if (PKG_BUILD_CMD) + execute_process(COMMAND bash -c "cd ${PKG_DOWNLOAD_PATH}/${DIR_NAME};${PKG_BUILD_CMD}") + endif() +endfunction() + +function(compile_protobuf_file output_path) + if (NOT PROTOC_EXECUTABLE) + message(FATAL_ERROR "You shall install protobuf first.") + endif() + file(MAKE_DIRECTORY ${output_path}) + foreach(file ${ARGN}) + get_filename_component(abs_file_path ${file} ABSOLUTE) + get_filename_component(file_name ${file} NAME_WE) + get_filename_component(file_dir ${abs_file_path} PATH) + file(RELATIVE_PATH rel_path ${CMAKE_CURRENT_SOURCE_DIR} ${file_dir}) + execute_process( + COMMAND ${PROTOC_EXECUTABLE} -I${file_dir} --cpp_out=${output_path} ${abs_file_path} + ) + message("Compile protobuf file ${file}") + endforeach() +endfunction() diff --git a/debug/accuracy_tools/msprobe/CMakeLists.txt b/debug/accuracy_tools/msprobe/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..66085a4b0bdb0589f2d90e6f8b23e4c1d6b27c13 --- /dev/null +++ b/debug/accuracy_tools/msprobe/CMakeLists.txt @@ -0,0 +1,5 @@ +add_subdirectory(ccsrc) + +if (DEFINED BUILD_TEST_CASE AND "${BUILD_TEST_CASE}" STREQUAL "True") +add_subdirectory(test) +endif() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/README.md b/debug/accuracy_tools/msprobe/README.md index 0cbc9e0871040653f21fc36c339ec42ef25ede0d..e31490f01e9f9d61504d9ee2311c82497323d886 100644 --- a/debug/accuracy_tools/msprobe/README.md +++ b/debug/accuracy_tools/msprobe/README.md @@ -15,7 +15,7 @@ debugger = PrecisionDebugger(config_path='./config.json') ... debugger.start() # 一般在训练循环开头启动工具 ... # 循环体 -debugger.stop() # 一般在训练循环末尾结束工具 +debugger.stop() # 一般在训练循环末尾结束工具。必须调用,否则可能导致精度数据落盘不全 debugger.step() # 在训练循环的最后需要重置工具,非循环场景不需要 ``` @@ -33,16 +33,40 @@ export MSPROBE_LOG_LEVEL={x} ``` **config.json** 的配置要求和各功能具体的使用指导详见后续章节。 +## 环境和依赖 + +- 硬件环境请参见《[昇腾产品形态说明](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F80RC22%2Fquickstart%2Fquickstart%2Fquickstart_18_0002.html)》。 +- 软件环境请参见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F80RC22%2Fsoftwareinst%2Finstg%2Finstg_0000.html%3FMode%3DPmIns%26OS%3DUbuntu%26Software%3DcannToolKit)》安装昇腾设备开发或运行环境,即toolkit软件包。 + +以上环境依赖请根据实际环境选择适配的版本。 + +## 版本配套说明 + +- msprobe支持AscendPyTorch 1.11.0或更高版本,支持的PyTorch和CANN以及PyTorch和python软件版本配套关系请参见《[Ascend Extension for PyTorch插件](https://gitee.com/ascend/pytorch)》。 +- msprobe支持MindSpore 2.4.0或更高版本,支持的MindSpore和CANN以及MindSpore和python软件版本配套关系请参见《[MindSpore版本发布列表](https://www.mindspore.cn/versions)》。 +- msprobe支持的固件驱动版本与配套CANN软件支持的固件驱动版本相同,开发者可通过“[昇腾社区-固件与驱动](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fhardware%2Ffirmware-drivers%2Fcommunity%3Fproduct%3D2%26model%3D28%26cann%3D8.0.RC3.alpha003%26driver%3D1.0.25.alpha)”页面根据产品型号与CANN软件版本获取配套的固件与驱动。 + + ## 🚨 工具限制与注意事项 **1. Pytorch 框架下,工具暂不支持 Fully Sharded Data Parallel(FSDP)。** +**2. 工具读写的所有路径,如config_path、dump_path等,只允许包含大小写字母、数字、下划线、斜杠、点和短横线。** + ## ⚙️ [安装](./docs/01.installation.md) +## 🌟 新版本特性 + +请参见[特性变更说明](./docs/01.installation.md#特性变更说明)。 + ## 🛠️ config.json [介绍](./docs/02.config_introduction.md) 和 [示例](./docs/03.config_examples.md) ## 🧰 主要功能 +### 0 用前必看 + +使用工具前,建议先浏览[**工具功能模块简介、适用场景和当前版本局限性**](./docs/25.tool_function_introduction.md),了解功能特性。 + ### 1 数据采集 msprobe 通过在训练脚本中添加 PrecisionDebugger 接口的方式对 API 执行精度数据 dump 操作,对应 config.json 中的 task 为 statistics 或 tensor。 @@ -59,21 +83,21 @@ PyTorch 场景的[离线预检](./docs/07.accuracy_checker_PyTorch.md)和[在线 MindSpore 动态图场景的[离线预检](./docs/09.accuracy_checker_MindSpore.md) -### 3 精度比对 +### 3 分级可视化构图比对 -该功能进行 PyTorch 整网 API 粒度的数据 dump、精度比对,进而定位训练场景下的精度问题。 +该功能将msprobe工具dump的精度数据进行解析,还原模型图结构,实现模型各个层级的精度数据比对,方便用户理解模型结构、分析精度问题。 -[PyTorch 场景的精度比对](./docs/10.accuracy_compare_PyTorch.md) +[PyTorch 场景的分级可视化构图比对](./docs/21.visualization_PyTorch.md) -[MindSpore 场景的精度比对](./docs/11.accuracy_compare_MindSpore.md) +[MindSpore 场景的分级可视化构图比对](./docs/22.visualization_MindSpore.md) -### 4 溢出检测与解析 +### 4 精度比对 -溢出检测与解析是在执行精度数据 dump 时,判断是否存在输入正常但输出存在溢出的 API,从而判断是否为正常溢出。对应 config.json 中的 overflow_check。 +该功能进行 PyTorch 整网 API 粒度的数据 dump、精度比对,进而定位训练场景下的精度问题。 -[PyTorch 场景的溢出检测与解析](./docs/12.overflow_check_PyTorch.md) +[PyTorch 场景的精度比对](./docs/10.accuracy_compare_PyTorch.md) -[MindSpore 场景的溢出检测与解析](./docs/13.overflow_check_MindSpore.md) +[MindSpore 场景的精度比对](./docs/11.accuracy_compare_MindSpore.md) ### 5 数据解析 @@ -103,37 +127,28 @@ MindSpore 动态图场景的[离线预检](./docs/09.accuracy_checker_MindSpore. 该功能收集和聚合模型训练过程中的网络层,优化器, 通信算子的中间值,帮助诊断模型训练过程中计算, 通信,优化器各部分出现的异常情况。 -[PyTorch 场景的训练状态监控](./docs/19.monitor.md) +[兼容 PyTorch 和 MindSpore 框架的训练状态监控](./docs/19.monitor.md) -### 10 分级可视化构图比对 +### 10 单算子API自动生成脚本 -该功能将msprobe工具dump的精度数据进行解析,还原模型图结构,实现模型各个层级的精度数据比对,方便用户理解模型结构、分析精度问题。 +该功能将msprobe工具dump的精度数据进行解析,自动生成单API脚本,用于复现整网中出现的算子问题,降低用户复现问题的成本,供开发分析算子问题。 -[PyTorch 场景的分级可视化构图比对](./docs/21.visualization_PyTorch.md) +[PyTorch 单算子API自动生成脚本](./docs/23.generate_operator_PyTorch.md) -## 🌟 新版本特性 +### 11 数码关联 -若查看历史版本特性,请点击[安装](./docs/01.installation.md)。 +该功能只支持 MindSpore 静态图场景,用于将IR图与dump数据进行关联,获取dump数据和代码调用栈的关联关系。 -【数据采集】 -- 支持 config.json 中的 step 传入范围; -- 优化了指定 step 的机制,指定 step 结束后工具不再采集数据,但训练会继续运行。工具结束运行后,日志提示信息如下: - ```bash - **************************************** - * msprobe ends successfully. * - **************************************** - ``` - 注:在多卡场景,每张卡进程训练到指定 step 之后都会打印一次上述信息。 +[MindSpore 场景的数码关联](./docs/24.code_mapping_Mindspore.md) -【精度预检】 -- 在 PyTorch 场景,支持部分 NPU 融合算子预检。 +### 12 溢出检测与解析 -【精度比对】 -- 解决了使用 MindSpore 需要安装 PyTorch 的问题。 +溢出检测与解析是在执行精度数据 dump 时,判断是否存在输入正常但输出存在溢出的 API,从而判断是否为正常溢出。对应 config.json 中的 overflow_check。 +推荐直接使用[数据采集](#1-数据采集)功能采集统计量信息检测溢出问题。 -【无标杆比对】 -- 补充在 PyTorch 场景的性能基线报告; -- 支持 MindSpore 场景的 change_value 扰动模式。 +[PyTorch 场景的溢出检测与解析](./docs/12.overflow_check_PyTorch.md) + +[MindSpore 场景的溢出检测与解析](./docs/13.overflow_check_MindSpore.md) ## 📑 补充材料 diff --git a/debug/accuracy_tools/msprobe/ccsrc/CMakeLists.txt b/debug/accuracy_tools/msprobe/ccsrc/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..2579a3a0e785c0e0ca384b4d52118a5d828249f8 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/CMakeLists.txt @@ -0,0 +1,60 @@ +project(msprobe VERSION 1.0.0 LANGUAGES CXX C) +cmake_minimum_required(VERSION 3.14) + +set(CMAKE_CXX_STANDARD 17) +set(CMAKE_CXX_STANDARD_REQUIRED ON) + +find_package(cpython MODULE REQUIRED) +find_package(openssl MODULE REQUIRED) +find_package(nlohmannjson MODULE REQUIRED) +find_package(protobuf MODULE REQUIRED) + +if (DEFINED CANN_PATH AND NOT "${CANN_PATH}" STREQUAL "") + file(GLOB_RECURSE DUMP_DATA_PROTOS "${CANN_PATH}/**/dump_data.proto") + if (DUMP_DATA_PROTOS) + list(GET DUMP_DATA_PROTOS 0 DUMP_DATA_PROTO) + file(COPY "${DUMP_DATA_PROTO}" DESTINATION "${CMAKE_CURRENT_SOURCE_DIR}/third_party/ACL/AclDumpMsg.proto") + else() + message("Warning: File dump_data.proto not found.") + endif() +endif() + +set(PROTO_PATH ${CMAKE_CURRENT_SOURCE_DIR}/proto) +file(GLOB_RECURSE PROTO_SRC "*.proto") +compile_protobuf_file( + ${PROTO_PATH} + ${PROTO_SRC} +) + +add_library(_msprobe_c SHARED) + +target_compile_options(_msprobe_c PRIVATE "-Wall") +target_compile_options(_msprobe_c PRIVATE "-fPIC") +target_compile_options(_msprobe_c PRIVATE "-fstack-protector-all") +target_compile_options(_msprobe_c PRIVATE "-ftrapv") +target_compile_options(_msprobe_c PRIVATE "-fstack-check") + +target_link_options(_msprobe_c PRIVATE "-Wl,-z,relor") +target_link_options(_msprobe_c PRIVATE "-Wl,-z,now") +target_link_options(_msprobe_c PRIVATE "-Wl,-z,noexecstack") + +target_link_libraries(_msprobe_c PUBLIC dl) +target_link_libraries(_msprobe_c PUBLIC pthread) +target_link_libraries(_msprobe_c PUBLIC ${cpython_LIBRARIES}) +target_link_libraries(_msprobe_c PUBLIC ${openssl_LIBRARIES}) +target_link_libraries(_msprobe_c PUBLIC ${protobuf_LIBRARIES}) + +if(DEFINED BUILD_TYPE AND "${BUILD_TYPE}" STREQUAL "debug") + target_compile_options(_msprobe_c PRIVATE "-O0") + target_compile_options(_msprobe_c PRIVATE "-g") + target_compile_definitions(_msprobe_c PRIVATE __DEBUG__) +else() + target_compile_options(_msprobe_c PRIVATE "-O2") +endif() + +target_include_directories(_msprobe_c PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}) + +file(GLOB_RECURSE SOURCES "*.cpp" "*.cc") +target_sources(_msprobe_c PRIVATE ${SOURCES}) + +install(TARGETS _msprobe_c LIBRARY DESTINATION lib) diff --git a/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfig.cpp b/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfig.cpp new file mode 100644 index 0000000000000000000000000000000000000000..9f61e03a31f6d4dfa2ca0b258d589bbcd29356fa --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfig.cpp @@ -0,0 +1,488 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include + +#include "include/ErrorCode.hpp" +#include "include/Macro.hpp" +#include "utils/FileUtils.hpp" +#include "base/ErrorInfos.hpp" +#include "DebuggerConfigFieldMap.hpp" +#include "DebuggerConfig.hpp" + +namespace MindStudioDebugger { + +template +DebuggerErrno ParseJsonBaseObj2Var(const nlohmann::json& content, const std::string& field, T& output, + bool mandatory=false) +{ + nlohmann::json::const_iterator iter = content.find(field); + if (iter == content.end()) { + if (mandatory) { + return DebuggerErrno::ERROR_FIELD_NOT_EXISTS; + } else { + return DebuggerErrno::OK; + } + } + + try { + output = iter->get(); + return DebuggerErrno::OK; + } catch (const nlohmann::detail::type_error& e) { + /* 数据类型不匹配异常 */ + return DebuggerErrno::ERROR_INVALID_FORMAT; + } +} + +template +DebuggerErrno ParseJsonStringAndTrans(const nlohmann::json& content, const std::string& field, + const std::map& enum2name, T& output, bool mandatory=false) { + DebuggerErrno ret; + std::string value; + + ret = ParseJsonBaseObj2Var(content, field, value, true); + if (ret == DebuggerErrno::ERROR_FIELD_NOT_EXISTS && !mandatory) { + return DebuggerErrno::OK; + } + + if (ret != DebuggerErrno::OK) { + return ret; + } + + int32_t enumId = GetEnumIdFromName(enum2name, value); + if (enumId == debuggerInvalidEnum) { + return DebuggerErrno::ERROR_UNKNOWN_VALUE; + } + + output = static_cast(enumId); + return DebuggerErrno::OK; +} + +#define PARSE_OPTIONAL_FIELD_CHECK_RET(content, field, output) \ + { \ + if (ParseJsonBaseObj2Var(content, field, output) != DebuggerErrno::OK) { \ + LOG_ERROR(DebuggerErrno::ERROR_UNKNOWN_VALUE, \ + "Field " + std::string(field) + " cannot be parsed."); \ + } \ + } + +#define PARSE_OPTIONAL_FIELD_TRANS_CHECK_RET(content, field, transMap, output) \ + { \ + if (ParseJsonStringAndTrans(content, field, transMap, output) != DebuggerErrno::OK) { \ + LOG_ERROR(DebuggerErrno::ERROR_UNKNOWN_VALUE, \ + "Value of field " + std::string(field) + " is unknown."); \ + } \ + } + +static bool DebuggerCfgParseUIntRangeGetBorder(const std::string& exp, uint32_t& left, uint32_t& right) +{ + if (std::count(exp.begin(), exp.end(), '-') != 1) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, "When using a range expression, it should be formatted as \"a-b\"."); + return false; + } + std::istringstream iss(exp); + char dash; + iss >> left >> dash >> right; + if (iss.fail() || dash != '-') { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, "When using a range expression, it should be formatted as \"a-b\"."); + return false; + } + if (left >= right) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, + "When using a range expression, the left border should be smaller than the right."); + return false; + } + return true; +} + +void DebuggerCfgParseUIntRange(const nlohmann::json& content, const std::string& name, std::vector& range) +{ + if (!content.contains(name)) { + return; + } + + const nlohmann::json& array = content[name]; + if (!array.is_array()) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, name + " should be empty or an array."); + return; + } + + range.clear(); + range.reserve(array.size()); + std::vector> buf; + buf.reserve(array.size()); + uint32_t realLen = 0; + /* a-b表示的范围可能很大,此处为了减少反复申请内存,对于a-b形式先预留空间再解析 */ + for (const auto& element : array) { + if (element.is_number()) { + range.emplace_back(element.get()); + realLen++; + } else if (element.is_string()) { + std::string exp = element.get(); + uint32_t begin, end; + if (!DebuggerCfgParseUIntRangeGetBorder(exp, begin, end)) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, "Failed to parse " + name + "."); + return; + } + realLen += (end - begin + 1); + buf.emplace_back(std::make_pair(begin, end)); + } + } + + constexpr uint32_t maxEleNum = 65536; + if (realLen > maxEleNum) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, + "When using a range expression in " + name + ", maximum of 65536 elements can be expressed."); + return; + } + + if (!buf.empty()) { + range.reserve(realLen); + for (const auto& border : buf) { + for (uint32_t i = border.first; i <= border.second; ++i) { + range.emplace_back(i); + } + } + } + return; +} + +/* 老规则此处只能指定一个task,新规则允许task列表,出于兼容性考虑,此处允许输入string或list格式 */ +void CommonCfgParseTasks(const nlohmann::json& content, std::vector& tasks) +{ + std::vector taskNameList; + std::string taskName; + DebuggerErrno ret; + + ret = ParseJsonBaseObj2Var(content, kTask, taskName, true); + if (ret == DebuggerErrno::ERROR_FIELD_NOT_EXISTS) { + ret = ParseJsonBaseObj2Var>(content, kTasks, taskNameList, true); + } else { + taskNameList.emplace_back(taskName); + } + + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Value of field task(s) should be string or list."); + return; + } + + for (auto& ele : taskNameList) { + int32_t enumId = GetEnumIdFromName(TaskTypeEnum2Name, ele); + if (enumId == debuggerInvalidEnum) { + LOG_WARNING(DebuggerErrno::ERROR_UNKNOWN_VALUE, "Task " + ele + " is unknown."); + continue; + } + if (!ELE_IN_VECTOR(tasks, static_cast(enumId))) { + tasks.emplace_back(static_cast(enumId)); + } + } + return; +} + +constexpr char kRegexPrefix[] = "name-regex("; +constexpr char kRegexSuffix[] = ")"; +constexpr size_t kRegexPrefixLen = sizeof(kRegexPrefix) - 1; +constexpr size_t kRegexSuffixLen = sizeof(kRegexSuffix) - 1; + +void KernelListMatcher::Parse(const std::vector& expressions) +{ + for (auto& expression : expressions) { + size_t len = expression.size(); + if (strncmp(expression.c_str(), kRegexPrefix, kRegexPrefixLen) == 0 && + strncmp(expression.c_str() + (len - kRegexSuffixLen), kRegexSuffix, kRegexSuffixLen) == 0) { + /* name-regex(xxx)表示正则表达式*/ + regexList.emplace_back(expression.substr(kRegexPrefixLen, len - kRegexPrefixLen - kRegexSuffixLen)); + } else { + /* 否则认为是full scope name */ + fullNameList.emplace_back(expression); + } + } +} + +std::vector KernelListMatcher::GenRealKernelList(const char** fullKernelList) const +{ + std::vector output; + /* 返回空列表表示全部dump,返回一个空字符串表示没有匹配上的,都不dump */ + if (this->empty() || fullKernelList == nullptr) { + return output; + } + output = fullNameList; + + for (auto& reg : regexList) { + for (const char** ss = fullKernelList; *ss != nullptr; ++ss) { + if (std::regex_search(*ss, reg)) { + output.emplace_back(*ss); + } + } + } + + if (output.empty()) { + output.emplace_back(""); + LOG_INFO("No kernel matches, so nothing will be dumped."); + } + + return output; +} + +void CommonCfg::Parse(const nlohmann::json& content) +{ + CommonCfgParseTasks(content, tasks); + if (tasks.empty()) { + return; + } + + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kOutputPath, outputPath); + outputPath = FileUtils::GetAbsPath(outputPath); + DebuggerCfgParseUIntRange(content, kRank, rank); + DebuggerCfgParseUIntRange(content, kStep, step); + PARSE_OPTIONAL_FIELD_TRANS_CHECK_RET(content, kLevel, DebuggerLevelEnum2Name, level); + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kSeed, seed); + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kIsDeterministic, isDeterministic); + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kEnableDataloader, enableDataloader); + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kAclConfig, aclConfig); +} + +void DebuggerCfgParseDataMode(const nlohmann::json& content, DebuggerDataDirection& direction, DebuggerDataInOut& inout) +{ + std::vector buf; + bool fw, bw, in, out, all; + + direction = DebuggerDataDirection::DIRECTION_BOTH; + inout = DebuggerDataInOut::INOUT_BOTH; + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kDataMode, buf); + all = static_cast(std::find(buf.begin(), buf.end(), kDataModeAll) != buf.end()); + if (buf.empty() || all) { + return; + } + + fw = static_cast(std::find(buf.begin(), buf.end(), kDirectionForward) != buf.end()); + bw = static_cast(std::find(buf.begin(), buf.end(), kDirectionBackward) != buf.end()); + in = static_cast(std::find(buf.begin(), buf.end(), kInOutInput) != buf.end()); + out = static_cast(std::find(buf.begin(), buf.end(), kInOutOutput) != buf.end()); + + /* 互补项都配或都不配都表示both,因此关注不同的场景就行 */ + if (fw != bw) { + if (fw) { + direction = DebuggerDataDirection::DIRECTION_FORWARD; + } else { + direction = DebuggerDataDirection::DIRECTION_BACKWARD; + } + } + if (in != out) { + if (in) { + inout = DebuggerDataInOut::INOUT_INPUT; + } else { + inout = DebuggerDataInOut::INOUT_OUTPUT; + } + } + return; +} + +void StatisticsCfgParseSummary(const nlohmann::json& content, std::vector& summaryOption) +{ + /* 老规则支持"statistics"或"md5",新规则支持"max"/"min"/"l2norm"/"md5"组合,此处兼容 */ + DebuggerErrno ret; + std::string mode = kStatistics; + std::vector modeListName; + + /* 若无该字段,认为是statistic,因此这里给mode设个默认值 */ + ret = ParseJsonBaseObj2Var(content, kSummaryMode, mode); + if (ret == DebuggerErrno::OK) { + if (mode == kStatistics) { + summaryOption.push_back(DebuggerSummaryOption::MAX); + summaryOption.push_back(DebuggerSummaryOption::MIN); + summaryOption.push_back(DebuggerSummaryOption::MEAN); + summaryOption.push_back(DebuggerSummaryOption::L2NORM); + } else if (mode == kMd5) { + summaryOption.push_back(DebuggerSummaryOption::MD5); + } else { + LOG_ERROR(DebuggerErrno::ERROR_UNKNOWN_VALUE, "Summary mode " + mode + " is unknown."); + } + return; + } + + ret = ParseJsonBaseObj2Var>(content, kSummaryMode, modeListName); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Value of field summary_mode should be string or list."); + return; + } + + /* 若有该字段但值为空,认为是statistic */ + if (modeListName.empty()) { + summaryOption.push_back(DebuggerSummaryOption::MAX); + summaryOption.push_back(DebuggerSummaryOption::MIN); + summaryOption.push_back(DebuggerSummaryOption::MEAN); + summaryOption.push_back(DebuggerSummaryOption::L2NORM); + return; + } + + for (auto& ele : modeListName) { + int32_t enumId = GetEnumIdFromName(SummaryOptionEnum2Name, ele); + if (enumId == debuggerInvalidEnum) { + LOG_ERROR(DebuggerErrno::ERROR_UNKNOWN_VALUE, "Summary mode " + ele + " is unknown."); + return; + } + summaryOption.push_back(static_cast(enumId)); + } + + return; +} + +void StatisticsCfg::Parse(const nlohmann::json& content) +{ + std::vector filter; + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kScope, scope); + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kList, filter); + filter.erase(std::remove_if(filter.begin(), filter.end(), + [](const std::string& s) { return s.find_first_not_of(' ') == std::string::npos; }), + filter.end()); + list = std::move(filter); + if (DebuggerConfig::GetInstance().GetDebugLevel() == DebuggerLevel::L2) { + matcher.Parse(list); + } + DebuggerCfgParseDataMode(content, direction, inout); + StatisticsCfgParseSummary(content, summaryOption); +} + +void DumpTensorCfg::Parse(const nlohmann::json& content) +{ + std::vector filter; + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kScope, scope); + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kList, filter); + filter.erase(std::remove_if(filter.begin(), filter.end(), + [](const std::string& s) { return s.find_first_not_of(' ') == std::string::npos; }), + filter.end()); + list = std::move(filter); + if (DebuggerConfig::GetInstance().GetDebugLevel() == DebuggerLevel::L2) { + matcher.Parse(list); + } + DebuggerCfgParseDataMode(content, direction, inout); + PARSE_OPTIONAL_FIELD_TRANS_CHECK_RET(content, kFileFormat, DumpFileFormatEnum2Name, fileFormat); + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kBackwardInput, backwardInput); +} + +void OverflowCheckCfg::Parse(const nlohmann::json& content) +{ + PARSE_OPTIONAL_FIELD_CHECK_RET(content, kOverflowNums, overflowNums); + PARSE_OPTIONAL_FIELD_TRANS_CHECK_RET(content, kCheckMode, OpCheckLevelEnum2Name, checkMode); +} + +void DebuggerConfig::Reset() +{ + LOG_INFO("Reset configuration."); + commonCfg = CommonCfg(); + statisticCfg.reset(); + dumpTensorCfg.reset(); + overflowCheckCfg.reset(); + loaded = false; +} + +void DebuggerConfig::Parse() +{ + std::ifstream cfgFile; + DebuggerErrno ret = FileUtils::OpenFile(cfgFilePath_, cfgFile); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Failed to open file " + cfgFilePath_ + "."); + return; + } + + nlohmann::json content; + nlohmann::json::const_iterator iter; + try { + cfgFile >> content; + } catch (const nlohmann::json::parse_error& e) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, "Failed to parse json file " + cfgFilePath_ + "."); + return; + } + + commonCfg.Parse(content); + +#define PARSE_SUBTASK_CONFIG(enumeration, name, member, basetype) \ + do { \ + if (ELE_IN_VECTOR(commonCfg.tasks, enumeration)) { \ + iter = content.find(name); \ + if (iter != content.end()) { \ + member = std::make_shared(); \ + member->Parse(*(iter)); \ + } \ + } \ + } while (0) + + PARSE_SUBTASK_CONFIG(DebuggerTaskType::TASK_DUMP_STATISTICS, kTaskStatistics, statisticCfg, StatisticsCfg); + PARSE_SUBTASK_CONFIG(DebuggerTaskType::TASK_DUMP_TENSOR, kTaskDumpTensor, dumpTensorCfg, DumpTensorCfg); + PARSE_SUBTASK_CONFIG(DebuggerTaskType::TASK_OVERFLOW_CHECK, kTaskOverflowCheck, overflowCheckCfg, OverflowCheckCfg); + +#undef PARSE_SUBTASK_CONFIG + return; +} + +int32_t DebuggerConfig::LoadConfig(const std::string& framework, const std::string& cfgFilePath) +{ + if (loaded) { + LOG_WARNING(DebuggerErrno::ERROR, "Repeated initialization, which may lead to errors."); + Reset(); + } + + cfgFilePath_ = FileUtils::GetAbsPath(cfgFilePath); + if (cfgFilePath_ == "") { + LOG_ERROR(DebuggerErrno::ERROR_CANNOT_PARSE_PATH, "Cannot parse path " + cfgFilePath + "."); + return -1; + } + + DebuggerErrno ret = FileUtils::CheckFileBeforeRead(cfgFilePath_, "r", FileType::JSON); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Config file " + cfgFilePath + " is invalid."); + return -1; + } + + int32_t enumId = GetEnumIdFromName(FrameworkEnum2Name, framework); + if (enumId == debuggerInvalidEnum) { + LOG_ERROR(DebuggerErrno::ERROR_UNKNOWN_VALUE, "Unknown framework " + framework + "."); + return -1; + } + framework_ = static_cast(enumId); + + Parse(); + if (ErrorInfosManager::GetTopErrLevelInDuration() >= DebuggerErrLevel::LEVEL_ERROR) { + LOG_ERROR(DebuggerErrno::ERROR, "Failed to parse config file " + cfgFilePath + "."); + return -1; + } + + CheckConfigValidity(); + if (ErrorInfosManager::GetTopErrLevelInDuration() >= DebuggerErrLevel::LEVEL_ERROR) { + LOG_ERROR(DebuggerErrno::ERROR, "Config file " + cfgFilePath + " is invalid."); + return -1; + } + + loaded = true; + return 0; +} + +bool DebuggerConfig::CheckConfigValidity() +{ + if (commonCfg.tasks.empty()) { + LOG_WARNING(DebuggerErrno::ERROR, "No task configured. MsProbe will do nothing."); + return true; + } + + /* 解析时已做格式有效性校验,数值有效性放在python前端校验 */ + return true; +} + +} diff --git a/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfig.hpp b/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfig.hpp new file mode 100644 index 0000000000000000000000000000000000000000..d56191443f8e6a7819c2bfbf402a5937bacd92ff --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfig.hpp @@ -0,0 +1,265 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "include/Macro.hpp" + +namespace MindStudioDebugger { + +constexpr int debuggerInvalidEnum = -1; + +enum class DebuggerFramework { + FRAMEWORK_PYTORCH, + FRAMEWORK_MINDSPORE, + + FRAMEWORK_BUTT, +}; + +enum class DebuggerTaskType { + TASK_DUMP_TENSOR, + TASK_DUMP_STATISTICS, + TASK_OVERFLOW_CHECK, + TASK_FREE_BENCHMARK, + TASK_RUN_UT, + TASK_GRAD_PROBE, + + TASK_BUTT = debuggerInvalidEnum, +}; + +enum class DebuggerDevType { + DEVICE_TYPE_NPU, + DEVICE_TYPE_GPU, + DEVICE_TYPE_CPU, + + DEVICE_TYPE_BUTT = debuggerInvalidEnum, +}; + +enum class DebuggerLevel { + L0, + L1, + L2, + MIX, + + LEVEL_BUTT = debuggerInvalidEnum, +}; + +enum class DebuggerDataDirection { + DIRECTION_FORWARD, + DIRECTION_BACKWARD, + DIRECTION_BOTH, + + DIRECTION_BUTT = debuggerInvalidEnum, +}; + +enum class DebuggerDataInOut { + INOUT_INPUT, + INOUT_OUTPUT, + INOUT_BOTH, + + INOUT_BUTT = debuggerInvalidEnum, +}; + +enum class DebuggerDumpFileFormat { + FILE_FORMAT_BIN, + FILE_FORMAT_NPY, + + FILE_FORMAT_BUTT = debuggerInvalidEnum, +}; + +enum class DebuggerOpCheckLevel { + CHECK_LEVEL_AICORE, + CHECK_LEVEL_ATOMIC, + CHECK_LEVEL_ALL, + + CHECK_LEVEL_BUTT = debuggerInvalidEnum, +}; + +enum class DebuggerSummaryOption { + MAX, + MIN, + MEAN, + L2NORM, + NAN_CNT, + NEG_INF_CNT, + POS_INF_CNT, + MD5, + + SUMMARY_BUTT = debuggerInvalidEnum, +}; + +class KernelListMatcher { +public: + KernelListMatcher() = default; + ~KernelListMatcher() = default; + + void Parse(const std::vector& expressions); + std::vector GenRealKernelList(const char** fullKernelList) const; + + inline bool empty() const {return fullNameList.empty() && regexList.empty();} + inline bool needAllKernels() const {return !regexList.empty();} + +private: + std::vector fullNameList; + std::vector regexList; +}; + +/* 说明:config类作为基础的配置解析查询类,对外应该是只读的,外部仅能通过Parse接口解析配置文件,而不应该直接修改配置字段,此处用以下方式防止外部误操作 + * 1、外部统一调用单例类DebuggerConfig的Parse解析配置文件,无法创建子配置类并调用其Parse函数 + * 2、子配置类通过添加DebuggerConfig为友元类允许其调用子配置类的Parse + * 3、DebuggerConfig对外提供获取子配置类的方法,返回的是const类型指针,实现外部只读(而非将成员变量都写为private并提供get函数) + */ +class DebuggerConfig; + +class CommonCfg { +public: + friend class DebuggerConfig; + CommonCfg() = default; + ~CommonCfg() = default; + + std::vector tasks; + std::string outputPath{"./output"}; + std::vector rank; + std::vector step; + DebuggerLevel level{DebuggerLevel::L1}; + int32_t seed{1234}; + bool isDeterministic{false}; + bool enableDataloader{false}; + std::string aclConfig; + +private: + void Parse(const nlohmann::json &content); +}; + +class StatisticsCfg { +public: + friend class DebuggerConfig; + StatisticsCfg() = default; + ~StatisticsCfg() = default; + + std::vector scope; + std::vector list; + KernelListMatcher matcher; + DebuggerDataDirection direction{DebuggerDataDirection::DIRECTION_BOTH}; + DebuggerDataInOut inout{DebuggerDataInOut::INOUT_BOTH}; + std::vector summaryOption; + +private: + void Parse(const nlohmann::json &content); +}; + +class DumpTensorCfg { +public: + friend class DebuggerConfig; + DumpTensorCfg() = default; + ~DumpTensorCfg() = default; + + std::vector scope; + std::vector list; + KernelListMatcher matcher; + DebuggerDataDirection direction{DebuggerDataDirection::DIRECTION_BOTH}; + DebuggerDataInOut inout{DebuggerDataInOut::INOUT_BOTH}; + DebuggerDumpFileFormat fileFormat{DebuggerDumpFileFormat::FILE_FORMAT_NPY}; + std::vector backwardInput; + bool onlineRunUt{false}; + std::string nfsPath; + std::string tlsPath; + std::string host; + int32_t port{-1}; +private: + void Parse(const nlohmann::json &content); +}; + +class OverflowCheckCfg { +public: + friend class DebuggerConfig; + OverflowCheckCfg() = default; + ~OverflowCheckCfg() = default; + + int32_t overflowNums{1}; + DebuggerOpCheckLevel checkMode{DebuggerOpCheckLevel::CHECK_LEVEL_ALL}; + +private: + void Parse(const nlohmann::json &content); +}; + + +class DebuggerConfig { + +public: + static DebuggerConfig& GetInstance() { + static DebuggerConfig instance_; + return instance_; + } + + int32_t LoadConfig(const std::string& framework, const std::string& cfgFilePath); + void Reset(); + + bool IsCfgLoaded() const {return loaded;} + DebuggerFramework GetFramework() const {return framework_;} + const std::vector& GetTaskList() const {return commonCfg.tasks;} + const std::string& GetOutputPath() const {return commonCfg.outputPath;} + const std::vector& GetRankRange() const {return commonCfg.rank;}; + const std::vector& GetStepRange() const {return commonCfg.step;}; + DebuggerLevel GetDebugLevel() const {return commonCfg.level;} + int32_t GetRandSeed() const {return commonCfg.seed;} + bool IsDeterministic() const {return commonCfg.isDeterministic;} + bool IsDataloaderEnable() const {return commonCfg.enableDataloader;} + std::string GetAclConfigPath() const {return commonCfg.aclConfig;} + + std::shared_ptr GetStatisticsCfg() const + {return std::const_pointer_cast(statisticCfg);} + std::shared_ptr GetDumpTensorCfg() const + {return std::const_pointer_cast(dumpTensorCfg);} + std::shared_ptr GetOverflowCheckCfg() const + {return std::const_pointer_cast(overflowCheckCfg);} + + bool IsRankHits(uint32_t rankId) const + {return commonCfg.rank.empty() || ELE_IN_VECTOR(commonCfg.rank, rankId);} + bool IsStepHits(uint32_t stepId) const + {return commonCfg.step.empty() || ELE_IN_VECTOR(commonCfg.step, stepId);} + +private: + DebuggerConfig() = default; + ~DebuggerConfig() = default; + explicit DebuggerConfig(const DebuggerConfig &obj) = delete; + DebuggerConfig& operator=(const DebuggerConfig &obj) = delete; + explicit DebuggerConfig(DebuggerConfig &&obj) = delete; + DebuggerConfig& operator=(DebuggerConfig &&obj) = delete; + + void Parse(); + bool CheckConfigValidity(); + + DebuggerFramework framework_; + std::string cfgFilePath_; + bool loaded{false}; + CommonCfg commonCfg; + std::shared_ptr statisticCfg{nullptr}; + std::shared_ptr dumpTensorCfg{nullptr}; + std::shared_ptr overflowCheckCfg{nullptr}; +}; + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfigFieldMap.hpp b/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfigFieldMap.hpp new file mode 100644 index 0000000000000000000000000000000000000000..8ebef4206b42b702712edccc5b19d9611370c63b --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/base/DebuggerConfigFieldMap.hpp @@ -0,0 +1,166 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +#include "DebuggerConfig.hpp" + +namespace MindStudioDebugger { + +constexpr const char* kFramework = "framework"; +constexpr const char* kFrameworkPyTorch = "PyTorch"; +constexpr const char* kFrameworkMindSpore = "MindSpore"; + +constexpr const char* kTaskStatistics = "statistics"; +constexpr const char* kTaskDumpTensor = "tensor"; +constexpr const char* kTaskOverflowCheck = "overflow_check"; +constexpr const char* kFreeBenchmark = "free_benchmark"; +constexpr const char* kRunUT = "run_ut"; +constexpr const char* kGradProbe = "grad_probe"; + +constexpr const char* kLevel0 = "L0"; +constexpr const char* kLevel1 = "L1"; +constexpr const char* kLevel2 = "L2"; +constexpr const char* kLevelMix = "mix"; + +constexpr const char* kDirectionForward = "forward"; +constexpr const char* kDirectionBackward = "backward"; +constexpr const char* kDirectionBoth = "both"; +constexpr const char* kInOutInput = "input"; +constexpr const char* kInOutOutput = "output"; +constexpr const char* kInOutBoth = "both"; +constexpr const char* kDataModeAll = "all"; + +constexpr const char* kFreeBenchmarkHandlerCheck = "check"; +constexpr const char* kFreeBenchmarkHandlerFix = "fix"; + +constexpr const char* kDumpFileFormatBin = "bin"; +constexpr const char* kDumpFileFormatNpy = "npy"; + +constexpr const char* kOpCheckLevelAiCore = "aicore"; +constexpr const char* kOpCheckLevelAtomic = "atomic"; +constexpr const char* kOpCheckLevelAll = "all"; + +constexpr const char* kTask = "task"; +constexpr const char* kTasks = "tasks"; +constexpr const char* kOutputPath = "dump_path"; +constexpr const char* kRank = "rank"; +constexpr const char* kStep = "step"; +constexpr const char* kLevel = "level"; +constexpr const char* kSeed = "seed"; +constexpr const char* kIsDeterministic = "is_deterministic"; +constexpr const char* kEnableDataloader = "enable_dataloader"; +constexpr const char* kAclConfig = "acl_config"; + +constexpr const char* kScope = "scope"; +constexpr const char* kList = "list"; + +constexpr const char* kDataMode = "data_mode"; +constexpr const char* kSummaryMode = "summary_mode"; +constexpr const char* kFileFormat = "file_format"; +constexpr const char* kOverflowNums = "overflow_nums"; +constexpr const char* kCheckMode = "check_mode"; +constexpr const char* kBackwardInput = "backward_input"; + +constexpr const char* kStatistics = "statistics"; +constexpr const char* kMd5 = "md5"; +constexpr const char* kMax = "max"; +constexpr const char* kMin = "min"; +constexpr const char* kMean = "mean"; +constexpr const char* kL2Norm = "l2norm"; +constexpr const char* kNanCount = "nan count"; +constexpr const char* kNegativeInfCount = "negative inf count"; +constexpr const char* kPositiveInfCount = "positive inf count"; + +const std::map FrameworkEnum2Name = { + {static_cast(DebuggerFramework::FRAMEWORK_PYTORCH), kFrameworkPyTorch}, + {static_cast(DebuggerFramework::FRAMEWORK_MINDSPORE), kFrameworkMindSpore}, +}; + +const std::map TaskTypeEnum2Name = { + {static_cast(DebuggerTaskType::TASK_DUMP_TENSOR), kTaskDumpTensor}, + {static_cast(DebuggerTaskType::TASK_DUMP_STATISTICS), kTaskStatistics}, + {static_cast(DebuggerTaskType::TASK_OVERFLOW_CHECK), kTaskOverflowCheck}, + {static_cast(DebuggerTaskType::TASK_FREE_BENCHMARK), kFreeBenchmark}, + {static_cast(DebuggerTaskType::TASK_RUN_UT), kRunUT}, + {static_cast(DebuggerTaskType::TASK_GRAD_PROBE), kGradProbe}, +}; + +const std::map DebuggerLevelEnum2Name = { + {static_cast(DebuggerLevel::L0), kLevel0}, + {static_cast(DebuggerLevel::L1), kLevel1}, + {static_cast(DebuggerLevel::L2), kLevel2}, + {static_cast(DebuggerLevel::MIX), kLevelMix}, +}; + +const std::map DataDirectionEnum2Name = { + {static_cast(DebuggerDataDirection::DIRECTION_FORWARD), kDirectionForward}, + {static_cast(DebuggerDataDirection::DIRECTION_BACKWARD), kDirectionBackward}, + {static_cast(DebuggerDataDirection::DIRECTION_BOTH), kDirectionBoth}, +}; + +const std::map DataInOutEnum2Name = { + {static_cast(DebuggerDataInOut::INOUT_INPUT), kInOutInput}, + {static_cast(DebuggerDataInOut::INOUT_OUTPUT), kInOutOutput}, + {static_cast(DebuggerDataInOut::INOUT_BOTH), kInOutBoth}, +}; + +const std::map DumpFileFormatEnum2Name = { + {static_cast(DebuggerDumpFileFormat::FILE_FORMAT_BIN), kDumpFileFormatBin}, + {static_cast(DebuggerDumpFileFormat::FILE_FORMAT_NPY), kDumpFileFormatNpy}, +}; + +const std::map OpCheckLevelEnum2Name = { + {static_cast(DebuggerOpCheckLevel::CHECK_LEVEL_AICORE), kOpCheckLevelAiCore}, + {static_cast(DebuggerOpCheckLevel::CHECK_LEVEL_ATOMIC), kOpCheckLevelAtomic}, + {static_cast(DebuggerOpCheckLevel::CHECK_LEVEL_ALL), kOpCheckLevelAll}, +}; + +const std::map SummaryOptionEnum2Name = { + {static_cast(DebuggerSummaryOption::MAX), kMax}, + {static_cast(DebuggerSummaryOption::MIN), kMin}, + {static_cast(DebuggerSummaryOption::MEAN), kMean}, + {static_cast(DebuggerSummaryOption::NAN_CNT), kNanCount}, + {static_cast(DebuggerSummaryOption::NEG_INF_CNT), kNegativeInfCount}, + {static_cast(DebuggerSummaryOption::POS_INF_CNT), kPositiveInfCount}, + {static_cast(DebuggerSummaryOption::L2NORM), kL2Norm}, + + {static_cast(DebuggerSummaryOption::MD5), kMd5}, +}; + +inline int32_t GetEnumIdFromName(const std::map& enum2name, const std::string& name) +{ + for (auto iter = enum2name.begin(); iter != enum2name.end(); iter++) { + if (iter->second == name) { + return iter->first; + } + } + return debuggerInvalidEnum; +} + +inline std::string GetNameFromEnumId(const std::map& enum2name, int32_t id) +{ + auto iter = enum2name.find(id); + if (iter == enum2name.end()) { + return "UNKNOWN"; + } + return iter->second; +} + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/base/Environment.cpp b/debug/accuracy_tools/msprobe/ccsrc/base/Environment.cpp new file mode 100644 index 0000000000000000000000000000000000000000..3a31e03cf898901767e3c658b993edc14b76e35a --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/base/Environment.cpp @@ -0,0 +1,90 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "utils/CPythonUtils.hpp" +#include "DebuggerConfig.hpp" +#include "Environment.hpp" + +namespace MindStudioDebugger { +namespace Environment { + +static int32_t GetRankID_PT() +{ + /* if torch.distributed.is_initialized(): + * return torch.distributed.get_rank() + */ + CPythonUtils::PythonObject torch = CPythonUtils::PythonObject::Import("torch"); + if (!torch.IsModule()) { + return -1; + } + + CPythonUtils::PythonObject distributed = torch.Get("distributed"); + if (distributed.IsNone()) { + return -1; + } + + if (!distributed.Get("is_initialized").Call()) { + return -1; + } + + CPythonUtils::PythonObject rank = distributed.Get("get_rank").Call(); + int32_t id; + if (rank.To(id) != 0) { + return -1; + } + return id; +} + +static int32_t GetRankID_MS() +{ + constexpr const char* kRankId = "RANK_ID"; + const char* rankIdEnv = getenv(kRankId); + if (rankIdEnv == nullptr) { + return -1; + } + + std::string rankId(rankIdEnv); + std::istringstream iss(rankId); + int32_t id = -1; + if (!(iss >> id) || id < 0) { + return -1; + } + + return id; +} + +int32_t GetRankID() +{ + if (!DebuggerConfig::GetInstance().IsCfgLoaded()) { + return -1; + } + + static int32_t id = -1; + if (id >= 0) { + return id; + } + + if (DebuggerConfig::GetInstance().GetFramework() == DebuggerFramework::FRAMEWORK_PYTORCH) { + id = GetRankID_PT(); + } else if (DebuggerConfig::GetInstance().GetFramework() == DebuggerFramework::FRAMEWORK_MINDSPORE) { + id = GetRankID_MS(); + } + + return id; +} + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/base/Environment.hpp b/debug/accuracy_tools/msprobe/ccsrc/base/Environment.hpp new file mode 100644 index 0000000000000000000000000000000000000000..187c6f23d32bf90602fad93765f7e916a412fb1b --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/base/Environment.hpp @@ -0,0 +1,28 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +namespace MindStudioDebugger { +namespace Environment { + +/* -1表示获取失败或者还未初始化 */ +int32_t GetRankID(); + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/base/ErrorInfos.cpp b/debug/accuracy_tools/msprobe/ccsrc/base/ErrorInfos.cpp new file mode 100644 index 0000000000000000000000000000000000000000..b07554a9fe10609ab4fa03357877b2f7630bd55e --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/base/ErrorInfos.cpp @@ -0,0 +1,144 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "utils/FileUtils.hpp" +#include "ErrorInfos.hpp" + +namespace MindStudioDebugger { + +static std::mutex errInfoMtx; +static std::ofstream logOfs; +DebuggerErrLevel ErrorInfosManager::topLevel = DebuggerErrLevel::LEVEL_NONE; +DebuggerErrLevel ErrorInfosManager::threshold = DebuggerErrLevel::LEVEL_INFO; + +static std::map ErrorLevelString = { + {DebuggerErrLevel::LEVEL_CRITICAL, "CRITICAL"}, + {DebuggerErrLevel::LEVEL_ERROR, "ERROR"}, + {DebuggerErrLevel::LEVEL_WARNING, "WARNING"}, + {DebuggerErrLevel::LEVEL_INFO, "INFO"}, + {DebuggerErrLevel::LEVEL_DEBUG, "DEBUG"}, + {DebuggerErrLevel::LEVEL_NONE, "NONE"}, +}; + +static std::map ErrnoString = { + {DebuggerErrno::OK, "OK"}, + {DebuggerErrno::ERROR, "ERROR"}, + + {DebuggerErrno::ERROR_FILE_NOT_EXISTS, "FILE_NOT_EXISTS"}, + {DebuggerErrno::ERROR_FILE_ALREADY_EXISTS, "FILE_ALREADY_EXISTS"}, + {DebuggerErrno::ERROR_FAILED_TO_OPEN_FILE, "FAILED_TO_OPEN_FILE"}, + {DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE, "FAILED_TO_WRITE_FILE"}, + {DebuggerErrno::ERROR_DIR_NOT_EXISTS, "DIR_NOT_EXISTS"}, + {DebuggerErrno::ERROR_PERMISSION_DENINED, "PERMISSION_DENINED"}, + {DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK, "NOT_ALLOW_SOFTLINK"}, + {DebuggerErrno::ERROR_ILLEGAL_FILE_TYPE, "ILLEGAL_FILE_TYPE"}, + {DebuggerErrno::ERROR_PATH_TOO_LOOG, "PATH_TOO_LOOG"}, + {DebuggerErrno::ERROR_PATH_TOO_DEEP, "PATH_TOO_DEEP"}, + {DebuggerErrno::ERROR_PATH_CONTAINS_INVALID_CHAR, "PATH_CONTAINS_INVALID_CHAR"}, + {DebuggerErrno::ERROR_FILE_TOO_LARGE, "FILE_TOO_LARGE"}, + {DebuggerErrno::ERROR_UNKNOWN_FILE_SUFFIX, "UNKNOWN_FILE_SUFFIX"}, + {DebuggerErrno::ERROR_CANNOT_PARSE_PATH, "CANNOT_PARSE_PATH"}, + + {DebuggerErrno::ERROR_INVALID_OPERATION, "INVALID_OPERATION"}, + {DebuggerErrno::ERROR_INVALID_FORMAT, "INVALID_FORMAT"}, + {DebuggerErrno::ERROR_INVALID_VALUE, "INVALID_VALUE"}, + {DebuggerErrno::ERROR_UNKNOWN_FIELD, "UNKNOWN_FIELD"}, + {DebuggerErrno::ERROR_UNKNOWN_VALUE, "UNKNOWN_VALUE"}, + {DebuggerErrno::ERROR_UNKNOWN_TRANS, "UNKNOWN_TRANS"}, + {DebuggerErrno::ERROR_FIELD_NOT_EXISTS, "FIELD_NOT_EXISTS"}, + {DebuggerErrno::ERROR_VALUE_OVERFLOW, "VALUE_OVERFLOW"}, + + {DebuggerErrno::ERROR_NO_MEMORY, "NO_MEMORY"}, + {DebuggerErrno::ERROR_BUFFER_OVERFLOW, "BUFFER_OVERFLOW"}, + {DebuggerErrno::ERROR_SYSCALL_FAILED, "SYSCALL_FAILED"}, + {DebuggerErrno::ERROR_OPERATION_FAILED, "OPERATION_FAILED"}, + + {DebuggerErrno::ERROR_DEPENDENCY_NOT_FIND, "DEPENDENCY_NOT_FIND"}, + {DebuggerErrno::ERROR_EXTERNAL_API_ERROR, "EXTERNAL_API_ERROR"}, +}; + +void ErrorInfosManager::LogErrorInfo(DebuggerErrLevel level, DebuggerErrno errId, const std::string& info) +{ + if (level < threshold) { + return; + } + + std::lock_guard lk(errInfoMtx); + std::ostream& output = logOfs.is_open() ? logOfs : std::cout; + output << "[" << ErrorLevelString[level] << "]"; + if (errId != DebuggerErrno::NONE) { + output << "[" << ErrnoString[errId] << "]"; + } + output << info << std::endl; + + if (level > topLevel) { + topLevel = level; + } + + return; +} + +DebuggerErrLevel ErrorInfosManager::GetTopErrLevelInDuration() +{ + std::lock_guard lk(errInfoMtx); + DebuggerErrLevel ret = topLevel; + topLevel = DebuggerErrLevel::LEVEL_NONE; + return ret; +} + +void ErrorInfosManager::SetLogPath(const std::string& path) +{ + std::lock_guard lk(errInfoMtx); + if (logOfs.is_open()) { + logOfs.close(); + } + + if (path.empty()) { + return; + } + + FileUtils::OpenFile(path, logOfs); +} + +__attribute__((constructor)) void InitDebuggerThreshold() +{ + const char* msprobeLogLevelEnv = getenv("MSPROBE_LOG_LEVEL"); + if (msprobeLogLevelEnv == nullptr) { + return; + } + + int msprobeLogLevel = 1; + try { + msprobeLogLevel = std::stoi(msprobeLogLevelEnv); + } catch (const std::exception& e) { + return; + } + + if (msprobeLogLevel >= static_cast(DebuggerErrLevel::LEVEL_DEBUG) && + msprobeLogLevel <= static_cast(DebuggerErrLevel::LEVEL_CRITICAL)) { + ErrorInfosManager::SetLogThreshold(static_cast(msprobeLogLevel)); + } +} + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/base/ErrorInfos.hpp b/debug/accuracy_tools/msprobe/ccsrc/base/ErrorInfos.hpp new file mode 100644 index 0000000000000000000000000000000000000000..6c740a6a36cfd7692b793dfa7625789771731289 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/base/ErrorInfos.hpp @@ -0,0 +1,78 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include "include/ErrorCode.hpp" + +namespace MindStudioDebugger { + +enum class DebuggerErrLevel { + LEVEL_NONE = -1, /* 无 */ + LEVEL_DEBUG = 0, /* 仅作为调试信息,不影响功能 */ + LEVEL_INFO, /* 用户需要感知的信息,一般不影响功能 */ + LEVEL_WARNING, /* 告警,可能会影响部分功能,但基础功能还能继续运行 */ + LEVEL_ERROR, /* 功能发生错误,本模块无法继续正常执行 */ + LEVEL_CRITICAL, /* 系统级严重错误,需要立即强制停止程序执行,无法屏蔽 */ +}; + +class ErrorInfosManager { +public: + static void LogErrorInfo(DebuggerErrLevel level, DebuggerErrno errId, const std::string& info); + static DebuggerErrLevel GetTopErrLevelInDuration(); + static void SetLogPath(const std::string& path); + static void SetLogThreshold(DebuggerErrLevel t) { threshold = t; } +private: + static DebuggerErrLevel topLevel; + static DebuggerErrLevel threshold; +}; + +inline void CleanErrorInfoCache() { + ErrorInfosManager::GetTopErrLevelInDuration(); +} + +#ifdef __DEBUG__ + +#define SOURCE_CODE_INFO \ + ("[" + std::string(__FILE__) + ":" + std::to_string(__LINE__) + " @ " + std::string(__FUNCTION__) + "]:") +#define LOG_CRITICAL(errid, msg) \ + ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_CRITICAL, errid, SOURCE_CODE_INFO + (msg)) +#define LOG_ERROR(errid, msg) \ + ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_ERROR, errid, SOURCE_CODE_INFO + (msg)) +#define LOG_WARNING(errid, msg) \ + ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_WARNING, errid, SOURCE_CODE_INFO + (msg)) +#define LOG_INFO(msg) \ + ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_INFO, DebuggerErrno::NONE, SOURCE_CODE_INFO + (msg)) +#define LOG_DEBUG(msg) \ + ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_DEBUG, DebuggerErrno::NONE, SOURCE_CODE_INFO + (msg)) +#define DEBUG_FUNC_TRACE() \ + ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_DEBUG, DebuggerErrno::NONE, \ + "TRACE: enter " + std::string(__FUNCTION__)) + +#else + +#define LOG_CRITICAL(errid, msg) ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_CRITICAL, errid, msg) +#define LOG_ERROR(errid, msg) ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_ERROR, errid, msg) +#define LOG_WARNING(errid, msg) ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_WARNING, errid, msg) +#define LOG_INFO(msg) ErrorInfosManager::LogErrorInfo(DebuggerErrLevel::LEVEL_INFO, DebuggerErrno::NONE, msg) +#define LOG_DEBUG(msg) +#define DEBUG_FUNC_TRACE() + +#endif + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/AclDumpDataProcessor.cpp b/debug/accuracy_tools/msprobe/ccsrc/core/AclDumpDataProcessor.cpp new file mode 100644 index 0000000000000000000000000000000000000000..3374aa0be3117702ccde546fee14d7276d2e2c22 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/AclDumpDataProcessor.cpp @@ -0,0 +1,906 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "include/Macro.hpp" +#include "utils/FileUtils.hpp" +#include "utils/FileOperation.hpp" +#include "utils/DataUtils.hpp" +#include "utils/MathUtils.hpp" +#include "core/AclTensor.hpp" +#include "base/ErrorInfos.hpp" +#include "proto/AclDumpMsg.pb.h" +#include "AclDumpDataProcessor.hpp" + +namespace MindStudioDebugger { + +namespace AclDumpMsg = toolkit::dumpdata; + +constexpr size_t kDhaAtomicAddInfoSize = 128; +constexpr size_t kL2AtomicAddInfoSize = 128; +constexpr size_t kAiCoreInfoSize = 256; +constexpr size_t kDhaAtomicAddStatusSize = 256; +constexpr size_t kL2AtomicAddStatusSize = 256; +constexpr size_t kUint64Size = sizeof(uint64_t); +constexpr const char* debugFileSign = "Opdebug.Node_OpDebug."; + +constexpr const char* kStatsHeaderInout = "Input/Output"; +constexpr const char* kStatsHeaderId = "Index"; +constexpr const char* kStatsHeaderDataSize = "Data Size"; +constexpr const char* kStatsHeaderDataType = "Data Type"; +constexpr const char* kStatsHeaderFormat = "Format"; +constexpr const char* kStatsHeaderShape = "Shape"; +constexpr const char* kStatsHeaderMax = "Max Value"; +constexpr const char* kStatsHeaderMin = "Min Value"; +constexpr const char* kStatsHeaderAvg = "Avg Value"; +constexpr const char* kStatsHeaderL2Norm = "l2norm"; +constexpr const char* kStatsHeaderL2NormInCsv = "L2Norm Value"; +constexpr const char* kStatsHeaderMD5 = "MD5 Value"; +constexpr const char* kStatsHeaderNan = "Nan Count"; +constexpr const char* kStatsHeaderNanInCsv = "NaN Count"; +constexpr const char* kStatsHeaderNegInf = "Negative Inf Count"; +constexpr const char* kStatsHeaderPosInf = "Positive Inf Count"; +constexpr const char* kRankId = "RANK_ID"; +constexpr const char* kDigitalNumbers = "0123456789"; + +static const std::map> summaryOptionHeaderStrMap = { + {DebuggerSummaryOption::MAX, {kStatsHeaderMax, kStatsHeaderMax}}, + {DebuggerSummaryOption::MIN, {kStatsHeaderMin, kStatsHeaderMin}}, + {DebuggerSummaryOption::MEAN, {kStatsHeaderAvg, kStatsHeaderAvg}}, + {DebuggerSummaryOption::L2NORM, {kStatsHeaderL2Norm, kStatsHeaderL2NormInCsv}}, + {DebuggerSummaryOption::NAN_CNT, {kStatsHeaderNan, kStatsHeaderNanInCsv}}, + {DebuggerSummaryOption::NEG_INF_CNT, {kStatsHeaderNegInf, kStatsHeaderNegInf}}, + {DebuggerSummaryOption::POS_INF_CNT, {kStatsHeaderPosInf, kStatsHeaderPosInf}}, + {DebuggerSummaryOption::MD5, {kStatsHeaderMD5, kStatsHeaderMD5}}, +}; + +const static std::map kDtypeTransMap = { + {AclDtype::DT_BF16, AclDtype::DT_FLOAT}, + {AclDtype::DT_INT4, AclDtype::DT_INT8}, +}; + +class AclTensorStats { +public: + AclTensorStats() = default; + explicit AclTensorStats(const AclTensorInfo& tensor, const std::map& summary); + ~AclTensorStats() = default; + + std::string GetCsvHeader() const; + std::string GetCsvValue() const; + std::string GetPath() const {return path;} + bool empty() const {return stats.empty();}; + + static AclTensorStats CalTensorSummary(const AclTensorInfo& tensor, const std::vector& opt); + static AclTensorStats ParseTensorSummary(const std::string& dumpPath, const std::string& input); + +private: + std::string path; + std::string opType; + std::string opName; + std::string taskID; + std::string streamID; + std::string timestamp; + std::string inout; + std::string slot; + std::string dataSize; + std::string dataType; + std::string format; + std::string shape; + std::map stats; + + void ParseInfoFromDumpPath(const std::string& dumpPath); + std::string& operator[](DebuggerSummaryOption opt) { return stats[opt]; } + + static constexpr const size_t bufferLen = 1024; +}; + +void AclTensorStats::ParseInfoFromDumpPath(const std::string& dumpPath) +{ + std::string filename; + if (FileUtils::GetFileSuffix(filename) == "csv") { + filename = FileUtils::GetFileBaseName(dumpPath); + } else { + filename = FileUtils::GetFileName(dumpPath); + } + + path = FileUtils::GetParentDir(dumpPath); + std::vector tokens = FileUtils::SplitPath(filename, '.'); + + /* dump文件名格式:{optype}.{opname}.{taskid}.{streamid}.{timestamp} */ + if (tokens.size() < 5) { + LOG_WARNING(DebuggerErrno::ERROR_INVALID_FORMAT, "Skip dumping invalid op " + filename); + stats.clear(); + return; + } + + opType = std::move(tokens[0]); + opName = std::move(tokens[1]); + taskID = std::move(tokens[2]); + streamID = std::move(tokens[3]); + timestamp = std::move(tokens[4]); +} + +AclTensorStats::AclTensorStats(const AclTensorInfo& tensor, const std::map& summary) + : stats{summary} +{ + ParseInfoFromDumpPath(tensor.dumpPath); + /* stats为空说明是header行,不需要落盘 */ + if (stats.empty()) { + return; + } + inout = tensor.inout; + slot = std::to_string(tensor.slot); + dataSize = std::to_string(tensor.dataSize); + dataType = DataUtils::GetDTypeString(tensor.dtype); + format = DataUtils::GetFormatString(tensor.hostFmt); + shape = DataUtils::GetShapeString(tensor.hostShape); +} + +AclTensorStats AclTensorStats::CalTensorSummary(const AclTensorInfo& tensor, const std::vector& opt) +{ + DEBUG_FUNC_TRACE(); + std::map summary; + if (ELE_IN_VECTOR(opt, DebuggerSummaryOption::MD5)) { + const uint8_t* data = tensor.transBuf.empty() ? tensor.aclData : tensor.transBuf.data(); + summary[DebuggerSummaryOption::MD5] = MathUtils::CalculateMD5(data, tensor.dataSize); + } + + return AclTensorStats(tensor, summary); +} + +static std::map ParseTensorSummaryHeaderOrder(const std::vector& segs) +{ + std::map ret; + for (uint32_t pos = 0; pos < segs.size(); ++pos) { + const std::string& opt = segs[pos]; + for (auto it = summaryOptionHeaderStrMap.begin(); it != summaryOptionHeaderStrMap.end(); ++it) { + if (opt == it->second.first) { + ret[pos] = it->first; + break; + } + } + } + return ret; +} + +AclTensorStats AclTensorStats::ParseTensorSummary(const std::string& dumpPath, const std::string& input) +{ + constexpr const uint32_t optPosBase = 7; + static std::map order; + static uint32_t headerLen = 0; + + std::vector segs = FileUtils::SplitPath(input, ','); + /* device计算统计量场景,各个kernel的统计项的顺序是相同的,只要计算一次即可 */ + if (order.empty()) { + if (segs.size() <= optPosBase || segs[0] != kStatsHeaderInout) { + LOG_WARNING(DebuggerErrno::ERROR_INVALID_FORMAT, "Summary data miss header, some data may lose."); + return AclTensorStats(); + } + headerLen = segs.size(); + order = ParseTensorSummaryHeaderOrder(segs); + + return AclTensorStats(); + } + + if (segs.size() < headerLen) { + LOG_WARNING(DebuggerErrno::ERROR_INVALID_FORMAT, "Summary data miss some fields, some data may lose."); + return AclTensorStats(); + } + + /* 不重复解析header行 */ + if (segs[0] == kStatsHeaderInout) { + return AclTensorStats(); + } + + /* device侧计算统计量格式:Input/Output,Index,Data Size,Data Type,Format,Shape,Count,...(统计量) */ + AclTensorStats stat = AclTensorStats(); + stat.ParseInfoFromDumpPath(dumpPath); + stat.inout = segs[0]; + stat.slot = segs[1]; + stat.dataSize = segs[2]; + stat.dataType = segs[3]; + stat.format = segs[4]; + stat.shape = segs[5]; + for (auto it = order.begin(); it != order.end(); ++it) { + stat[it->second] = segs[it->first]; + } + return stat; +} + +std::string AclTensorStats::GetCsvHeader() const +{ + if (stats.empty()) { + return std::string(); + } + std::string ret; + ret.reserve(bufferLen); + ret.append("Op Type,Op Name,Task ID,Stream ID,Timestamp,Input/Output,Slot,Data Size,Data Type,Format,Shape"); + for (auto it = stats.begin(); it != stats.end(); it++) { + ret.append(","); + ret.append(summaryOptionHeaderStrMap.at(it->first).second); + } + ret.append("\n"); + + return ret; +} + +std::string AclTensorStats::GetCsvValue() const +{ + if (stats.empty()) { + return std::string(); + } + + std::string ret; + ret.reserve(bufferLen); + ret.append(opType).append(",").append(opName).append(",").append(taskID).append(",").append(streamID).append(",") \ + .append(timestamp).append(",").append(inout).append(",").append(slot).append(",") .append(dataSize) \ + .append(",").append(dataType).append(",").append(format).append(",").append(shape); + /* map会根据键值自动排序,此处可以保障头和值的顺序,直接追加写即可 */ + for (auto it = stats.begin(); it != stats.end(); it++) { + ret.append(","); + ret.append(it->second); + } + ret.append("\n"); + + return ret; +} + +AclDumpDataProcessor::~AclDumpDataProcessor() +{ + while (!buffer.empty()) { + delete buffer.front(); + buffer.pop(); + } +} + +std::string AclDumpDataProcessor::ToString() const +{ + return "AclDumpDataProcessor(path=" + dumpPath + ",completed=" + std::to_string(completed) + ",len=" + + std::to_string(totalLen) + ")"; +} + +DebuggerErrno AclDumpDataProcessor::PushData(const acldumpChunk *chunk) +{ + DEBUG_FUNC_TRACE(); + if (completed) { + LOG_WARNING(DebuggerErrno::ERROR_INVALID_OPERATION, + ToString() + " receive data when completed. Some errors may occur."); + return DebuggerErrno::ERROR_INVALID_OPERATION; + } + + /* 防止最后一包处理出错导致processor残留,此处先设置完成标记位 */ + if (chunk->isLastChunk) { + completed = true; + } + + size_t len = chunk->bufLen; + /* 防止正负翻转 */ + if (SIZE_MAX - len < totalLen || totalLen + len > kMaxDataLen || len == 0) { + LOG_ERROR(DebuggerErrno::ERROR_BUFFER_OVERFLOW, ToString() + ": buffer overflow(cached size " + + std::to_string(totalLen) + ", receiving size " + std::to_string(len) + ")."); + errorOccurred = true; + return DebuggerErrno::ERROR_BUFFER_OVERFLOW; + } + + std::vector *p = new std::vector(len); + if (p == nullptr) { + LOG_ERROR(DebuggerErrno::ERROR_NO_MEMORY, "Acl dump data processor(" + dumpPath + "): Alloc failed(" + + std::to_string(len) + " bytes)."); + errorOccurred = true; + return DebuggerErrno::ERROR_NO_MEMORY; + } + + if (memcpy(p->data(), chunk->dataBuf, len) == nullptr) { + LOG_ERROR(DebuggerErrno::ERROR_SYSCALL_FAILED, ToString() + ": Failed to copy data;"); + delete p; + errorOccurred = true; + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } + + buffer.push(p); + totalLen += len; + if (!chunk->isLastChunk) { + return DebuggerErrno::OK; + } + + completed = true; + DebuggerErrno ret = ConcatenateData(); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Acl dump data processor(" + dumpPath + "): Failed to concatenate data."); + errorOccurred = true; + return ret; + } + LOG_DEBUG(ToString() + " is completed."); + + return DebuggerErrno::OK; +} + +DebuggerErrno AclDumpDataProcessor::ConcatenateData() +{ + DEBUG_FUNC_TRACE(); + if (!completed) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_OPERATION, "Acl dump data processor(" + dumpPath + + "): Data is incomplete."); + return DebuggerErrno::ERROR_INVALID_OPERATION; + } + + if (buffer.empty()) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_VALUE, "Data processor(" + dumpPath + "): No data."); + return DebuggerErrno::ERROR_INVALID_VALUE; + } + + /* 为了减少数据重复拷贝,此处只整合一次,不再剥数据头,用偏移来取数据段 */ + if (buffer.size() > 1) { + std::vector *p = new std::vector(totalLen); + if (p == nullptr) { + LOG_ERROR(DebuggerErrno::ERROR_NO_MEMORY, "Alloc failed(" + std::to_string(totalLen) + ")."); + return DebuggerErrno::ERROR_NO_MEMORY; + } + + size_t offset = 0; + uint8_t* msg = p->data(); + while (!buffer.empty()) { + if (memcpy(msg + offset, buffer.front()->data(), buffer.front()->size()) == nullptr) { + delete p; + LOG_ERROR(DebuggerErrno::ERROR_SYSCALL_FAILED, "Data processor(" + dumpPath + "): Failed to copy."); + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } + offset += buffer.front()->size(); + delete buffer.front(); + buffer.pop(); + } + buffer.push(p); + } + + if (FileUtils::GetFileSuffix(dumpPath) == CSV_SUFFIX) { + dataSegOffset = 0; + dataSegLen = totalLen; + return DebuggerErrno::OK; + } + + headerSegOffset = sizeof(uint64_t); + if (totalLen < headerSegOffset) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, "Acl dump data processor(" + dumpPath + + "): Invalid data length " + std::to_string(totalLen) + "."); + return DebuggerErrno::ERROR_INVALID_FORMAT; + } + + headerSegLen = *(reinterpret_cast(buffer.front()->data())); + if (totalLen < headerSegOffset + headerSegLen) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, "Acl dump data processor(" + dumpPath + + "): Invalid header len " + std::to_string(headerSegLen) + "/" + std::to_string(totalLen) + "."); + return DebuggerErrno::ERROR_INVALID_FORMAT; + } + + dataSegOffset = headerSegOffset + headerSegLen; + dataSegLen = totalLen - dataSegOffset; + return DebuggerErrno::OK; +} + +static nlohmann::json ParseOverflowInfo(const uint8_t* data) +{ + DEBUG_FUNC_TRACE(); + uint32_t index = 0; + nlohmann::json overflowInfo; + uint64_t modelId = DataUtils::UnpackUint64Value_Le(data); + index += kUint64Size; + uint64_t streamId = DataUtils::UnpackUint64Value_Le(data + index); + index += kUint64Size; + uint64_t taskId = DataUtils::UnpackUint64Value_Le(data + index); + index += kUint64Size; + uint64_t taskType = DataUtils::UnpackUint64Value_Le(data + index); + index += kUint64Size; + uint64_t pcStart = DataUtils::UnpackUint64Value_Le(data + index); + index += kUint64Size; + uint64_t paraBase = DataUtils::UnpackUint64Value_Le(data + index); + + overflowInfo["model_id"] = modelId; + overflowInfo["stream_id"] = streamId; + overflowInfo["task_id"] = taskId; + overflowInfo["task_type"] = taskType; + overflowInfo["pc_start"] = DataUtils::U64ToHexString(pcStart); + overflowInfo["para_base"] = DataUtils::U64ToHexString(paraBase); + return overflowInfo; +} + +static DebuggerErrno DumpOpDebugDataToDisk(const std::string& dumpPath, AclDumpMsg::DumpData& dumpData, + const uint8_t* data, size_t dataLen) +{ + DEBUG_FUNC_TRACE(); + std::string outPath = dumpPath + ".output."; + uint32_t num = dumpData.output().size(); + for (uint32_t slot = 0; slot < num; slot++) { + uint32_t offset = 0; + // parse DHA Atomic Add info + nlohmann::json dhaAtomicAddInfo = ParseOverflowInfo(data + offset); + offset += kDhaAtomicAddInfoSize; + // parse L2 Atomic Add info + nlohmann::json l2AtomicAddInfo = ParseOverflowInfo(data + offset); + offset += kL2AtomicAddInfoSize; + // parse AICore info + nlohmann::json aiCoreInfo = ParseOverflowInfo(data + offset); + offset += kAiCoreInfoSize; + // parse DHA Atomic Add status + dhaAtomicAddInfo["status"] = DataUtils::UnpackUint64Value_Le(data + offset); + offset += kDhaAtomicAddStatusSize; + // parse L2 Atomic Add status + l2AtomicAddInfo["status"] = DataUtils::UnpackUint64Value_Le(data + offset); + offset += kL2AtomicAddStatusSize; + // parse AICore status + uint64_t kernelCode = DataUtils::UnpackUint64Value_Le(data + offset); + offset += kUint64Size; + uint64_t blockIdx = DataUtils::UnpackUint64Value_Le(data + offset); + offset += kUint64Size; + uint64_t status = DataUtils::UnpackUint64Value_Le(data + offset); + aiCoreInfo["kernel_code"] = DataUtils::U64ToHexString(kernelCode); + aiCoreInfo["block_idx"] = blockIdx; + aiCoreInfo["status"] = status; + + nlohmann::json opdebugData; + opdebugData["DHA Atomic Add"] = dhaAtomicAddInfo; + opdebugData["L2 Atomic Add"] = l2AtomicAddInfo; + opdebugData["AI Core"] = aiCoreInfo; + + // save json to file + std::string filePath = outPath + std::to_string(slot) + "." + JSON_SUFFIX; + DebuggerErrno ret = FileOperation::DumpJson(filePath, opdebugData); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Failed to dump data to " + filePath + "."); + return ret; + } + } + return DebuggerErrno::OK; +} + +static DebuggerErrno ConvertFormatDeviceToHost(AclTensorInfo& tensor) +{ + DEBUG_FUNC_TRACE(); + if (tensor.deviceFmt == tensor.hostFmt || AclTensor::SizeOfTensor(tensor) == 0) { + LOG_DEBUG(tensor + ": No need to convert format."); + return DebuggerErrno::OK; + } + + DebuggerErrno ret = AclTensor::TransFormatD2H(tensor); + if (ret == DebuggerErrno::ERROR_UNKNOWN_TRANS) { + LOG_INFO("Do not support convert format from " + + std::to_string(tensor.deviceFmt) + " to " + std::to_string(tensor.hostFmt) + "."); + tensor.hostFmt = tensor.deviceFmt; + return DebuggerErrno::OK; + } + + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, tensor + ": Failed to convert format."); + return ret; + } + + LOG_DEBUG(tensor + ": Convert format successfully."); + return DebuggerErrno::OK; +} + +static std::string MappingFilePath(const std::string& originPath) +{ + /* adump一次最多传10个tensor数据,输入输出数超过10的算子会分包,但是时序上是连续的,此处缓存上一次的映射 */ + static std::string lastOriName; + static std::string lastMappingPath; + + if (lastOriName == originPath && !lastMappingPath.empty()) { + return lastMappingPath; + } + + std::string dir = FileUtils::GetParentDir(originPath); + std::string suffix = FileUtils::GetFileSuffix(originPath); + std::string mappingName; + uint32_t retry = 10; + constexpr uint32_t randFileNameLen = 32; + do { + mappingName = MathUtils::RandomString(randFileNameLen, '0', '9'); + if (!suffix.empty()) { + mappingName.append(".").append(suffix); + } + if (!FileUtils::IsPathExist(dir + "/" + mappingName)) { + break; + } + } while (--retry); + + if (retry == 0) { + LOG_ERROR(DebuggerErrno::ERROR, "Failed to map path " + originPath + "."); + return std::string(); + } + + DebuggerErrno ret; + FileUtils::CreateDir(dir); + std::ofstream ofs; + constexpr const char* mapFileName = "mapping.csv"; + + ret = FileUtils::OpenFile(dir + "/" + mapFileName, ofs, std::ofstream::app); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(DebuggerErrno::ERROR, "Failed to open mapping file " + dir + "/" + mapFileName + "."); + return std::string(); + } + + ofs << mappingName << "," << FileUtils::GetFileName(originPath) << "\n"; + if (ofs.fail()) { + LOG_ERROR(DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE, "Failed to write file " + dir + "/" + mapFileName + "."); + ofs.close(); + return std::string(); + } + ofs.close(); + lastOriName = originPath; + lastMappingPath = dir + "/" + mappingName; + return lastMappingPath; +} + +static DebuggerErrno StandardizedDumpPath(std::string& originPath) +{ + std::string filename = FileUtils::GetFileName(originPath); + if (filename.length() <= FileUtils::FILE_NAME_MAX) { + return DebuggerErrno::OK; + } + + std::string mappingPath = MappingFilePath(originPath); + if (mappingPath.empty()) { + LOG_ERROR(DebuggerErrno::ERROR, "Failed to open mapping file " + originPath + "."); + return DebuggerErrno::ERROR; + } + + originPath = std::move(mappingPath); + return DebuggerErrno::OK; +} + +static std::string GenDataPath(const std::string& path) { + LOG_DEBUG("Original acl data path is " + path); + std::string outputPath = DebuggerConfig::GetInstance().GetOutputPath(); + std::string dataPath; + if (path.compare(0, outputPath.length(), outputPath) != 0) { + return path; + } + dataPath = path.substr(outputPath.length()); + const std::vector items = FileUtils::SplitPath(dataPath); + constexpr const size_t expectSegLen = 9; + constexpr const size_t rankIdPos = 0; + constexpr const size_t timeStampPos = 1; + constexpr const size_t stepIdPos = 2; + constexpr const size_t dataNamePos = 8; + + if (items.size() >= expectSegLen) { + dataPath = outputPath; + if (dataPath.at(dataPath.length() - 1) != '/') { + dataPath.append("/"); + } + /* + * ACL 接口返回数据的路径格式如下 + * {dump_path}/rank_{rank_id}/{time stamp}/step_{step_id}/{time}/{device_id}/{model_name}/{model_id}/{iteration_id}/{data name} + * items[0] 表示 rank_{rank_id} + * items[1] 表示 {time stamp} + * items[2] 表示 step_{step_id} + * items[8] 表示 {data name} + */ + dataPath.append(items[rankIdPos] + "/"); + dataPath.append(items[timeStampPos] + "/"); + dataPath.append(items[stepIdPos] + "/"); + dataPath.append(items[dataNamePos]); + return dataPath; + } + return path; +} + +inline std::string GetTensorInfoSuffix(AclTensorInfo& tensor) +{ + return "." + tensor.inout + "." + std::to_string(tensor.slot) + + "." + DataUtils::GetFormatString(tensor.hostFmt) + "." + DataUtils::GetDTypeString(tensor.oriDtype); +} + +static DebuggerErrno DumpOneAclTensorFmtBin(AclTensorInfo& tensor) +{ + DebuggerErrno ret; + std::string dumpPathSlot = tensor.dumpPath + GetTensorInfoSuffix(tensor); + if (StandardizedDumpPath(dumpPathSlot) != DebuggerErrno::OK) { + LOG_ERROR(DebuggerErrno::ERROR, "Failed to standardize path " + dumpPathSlot + "."); + return DebuggerErrno::ERROR; + } + + std::ofstream ofs; + ret = FileUtils::OpenFile(dumpPathSlot, ofs, std::ios::out | std::ios::binary); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Failed to open file " + dumpPathSlot + "."); + return ret; + } + + ofs.write(reinterpret_cast(tensor.aclData), tensor.dataSize); + if (ofs.fail()) { + LOG_ERROR(DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE, "Failed to write file " + dumpPathSlot + "."); + ret = DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE; + } + ofs.close(); + return ret; +} + +static DebuggerErrno DumpOneAclTensorFmtNpy(AclTensorInfo& tensor) +{ + DEBUG_FUNC_TRACE(); + DebuggerErrno ret; + if (tensor.dataSize == 0) { + LOG_INFO(tensor + ": Data size is 0. No need to dump."); + return DebuggerErrno::OK; + } + + auto it = kDtypeTransMap.find(tensor.dtype); + if (it != kDtypeTransMap.end()) { + AclDtype dstDtype = it->second; + ret = AclTensor::TransDtype(tensor, dstDtype); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, tensor + ": Failed to transform dtype from " + DataUtils::GetDTypeString(it->first) + " to " + + DataUtils::GetDTypeString(it->second)+ "."); + return ret; + } + } + + // dump_path: dump_dir/op_type.op_name.task_id.stream_id.timestamp + std::string dumpPathSlot = tensor.dumpPath + GetTensorInfoSuffix(tensor) + "." + NPY_SUFFIX; + + if (StandardizedDumpPath(dumpPathSlot) != DebuggerErrno::OK) { + LOG_ERROR(DebuggerErrno::ERROR, "Failed to standardize path " + dumpPathSlot + "."); + return DebuggerErrno::ERROR; + } + + if (tensor.transBuf.empty()) { + ret = FileOperation::DumpNpy(dumpPathSlot, tensor.aclData, tensor.dataSize, tensor.dtype, tensor.hostShape); + } else { + ret = FileOperation::DumpNpy(dumpPathSlot, tensor.transBuf.data(), tensor.transBuf.size(), tensor.dtype, + tensor.hostShape); + } + + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, tensor + ": Failed to dump as npy."); + return ret; + } + + LOG_DEBUG(tensor + ": dump successfully."); + + return ret; +} + +static DebuggerErrno WriteOneTensorStatToDisk(const AclTensorStats& stat) +{ + DEBUG_FUNC_TRACE(); + if (stat.empty()) { + return DebuggerErrno::OK; + } + + std::string dumpfile = stat.GetPath() + "/statistic.csv"; + /* 此处防止多进程间竞争,使用文件锁,故使用C风格接口 */ + uint32_t retry = 100; + uint32_t interval = 10; + if (FileUtils::IsPathExist(dumpfile) && !FileUtils::IsRegularFile(dumpfile)) { + LOG_ERROR(DebuggerErrno::ERROR_FILE_ALREADY_EXISTS, "File " + dumpfile + " exists and has invalid format."); + return DebuggerErrno::ERROR_FILE_ALREADY_EXISTS; + } + + int fd = open(dumpfile.c_str(), O_WRONLY | O_CREAT | O_APPEND, NORMAL_FILE_MODE_DEFAULT); + if (fd < 0) { + LOG_ERROR(DebuggerErrno::ERROR_FAILED_TO_OPEN_FILE, "Failed to open file " + dumpfile); + return DebuggerErrno::ERROR_FAILED_TO_OPEN_FILE; + } + + uint32_t i; + for (i = 0; i < retry; ++i) { + if (flock(fd, LOCK_EX | LOCK_NB) == 0) { + break; + } + std::this_thread::sleep_for(std::chrono::milliseconds(interval)); + } + + if (i >= retry) { + LOG_ERROR(DebuggerErrno::ERROR_SYSCALL_FAILED, "Failed to occupy file " + dumpfile); + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } + + /* 防止等待文件锁的期间又有别的进程写入内容,重新查找文件尾 */ + off_t offset = lseek(fd, 0, SEEK_END); + if (offset == 0) { + std::string header = stat.GetCsvHeader(); + if (write(fd, header.c_str(), header.length()) < static_cast(header.length())) { + LOG_ERROR(DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE, "Failed to write file " + dumpfile); + flock(fd, LOCK_UN); + close(fd); + return DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE; + } + } + + std::string value = stat.GetCsvValue(); + DebuggerErrno ret = DebuggerErrno::OK; + if (write(fd, value.c_str(), value.length()) < static_cast(value.length())) { + LOG_ERROR(DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE, "Failed to write file " + dumpfile); + ret = DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE; + } + + flock(fd, LOCK_UN); + close(fd); + return ret; +} + +static DebuggerErrno DumpOneAclTensor(AclTensorInfo& tensor, std::vector& opt) +{ + DEBUG_FUNC_TRACE(); + if (tensor.dumpOriginData || !FileOperation::IsDtypeSupportByNpy(tensor.dtype)) { + if (kDtypeTransMap.find(tensor.dtype) == kDtypeTransMap.end()) { + return DumpOneAclTensorFmtBin(tensor); + } + } + + DebuggerErrno ret = ConvertFormatDeviceToHost(tensor); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, tensor + ": Failed to convert format to host."); + return ret; + } + + if (!opt.empty()) { + AclTensorStats stat = AclTensorStats::CalTensorSummary(tensor, opt); + return WriteOneTensorStatToDisk(stat); + } + + return DumpOneAclTensorFmtNpy(tensor); +} + +static void DumpAclTensor(std::vector::iterator begin, std::vector::iterator end, + std::vector opt) +{ + DEBUG_FUNC_TRACE(); + DebuggerErrno ret = DebuggerErrno::OK; + for (auto it = begin; it != end; it++) { + ret = DumpOneAclTensor(*it, opt); + if (ret != DebuggerErrno::OK) { + LOG_WARNING(ret, *it + ": Failed to dump to disk."); + break; + } + } + return; +} + +static DebuggerErrno DumpTensorDataToDisk(const std::string& dumpPath, AclDumpMsg::DumpData& dumpData, + const uint8_t* data, size_t dataLen, std::vector& opt) +{ + DEBUG_FUNC_TRACE(); + std::vector aclTensorInfos; + uint64_t offset = 0; + uint32_t slot = 0; + for (auto& tensor : dumpData.input()) { + aclTensorInfos.push_back(AclTensor::ParseAttrsFromDumpData(dumpPath, data + offset, tensor, "input", slot)); + offset += tensor.size(); + slot++; + } + + slot = 0; + for (auto& tensor : dumpData.output()) { + aclTensorInfos.push_back(AclTensor::ParseAttrsFromDumpData(dumpPath, data + offset, tensor, "output", slot)); + offset += tensor.size(); + slot++; + } + + if (aclTensorInfos.empty()) { + return DebuggerErrno::OK; + } + + if (offset > dataLen) { + LOG_ERROR(DebuggerErrno::ERROR_VALUE_OVERFLOW, dumpPath + ": offset overflow " + std::to_string(offset) + "/" + + std::to_string(dataLen) + "."); + return DebuggerErrno::ERROR_VALUE_OVERFLOW; + } + + /* 根据tensor的数据量,1MB以下串行,1MB以上多线程并发,最大并发量为 最大线程数/4 */ + constexpr int kMaxTensorSize = 1024 * 1024; + if (offset < kMaxTensorSize) { + DumpAclTensor(aclTensorInfos.begin(), aclTensorInfos.end(), opt); + } else { + size_t concurrent = std::max(1, std::thread::hardware_concurrency() / 4); + concurrent = std::min(concurrent, aclTensorInfos.size()); + size_t total = aclTensorInfos.size(); + size_t batch = MathUtils::DivCeil(total, concurrent); + size_t cur = 0; + std::vector threads; + std::vector::iterator begin = aclTensorInfos.begin(); + + threads.reserve(concurrent); + while (cur < total) { + threads.emplace_back(std::thread(&DumpAclTensor, begin + cur, begin + std::min(total, cur + batch), opt)); + cur += batch; + } + + for (auto& t : threads) { + if (t.joinable()) { + t.join(); + } + } + } + + DebuggerErrLevel err = ErrorInfosManager::GetTopErrLevelInDuration(); + return err >= DebuggerErrLevel::LEVEL_ERROR ? DebuggerErrno::ERROR : DebuggerErrno::OK; +} + +static DebuggerErrno DumpStatsDataToDisk(const std::string& dumpPath, const uint8_t* data, size_t dataLen) +{ + DEBUG_FUNC_TRACE(); + constexpr const size_t maxDataSize = 10 * 1024 * 1024; + + if (dataLen > maxDataSize) { + LOG_ERROR(DebuggerErrno::ERROR_FILE_TOO_LARGE, "File " + dumpPath + " is too large to be dumped."); + return DebuggerErrno::ERROR_FILE_TOO_LARGE; + } + + std::string content(reinterpret_cast(data), dataLen); + std::vector lines = FileUtils::SplitPath(content, '\n'); + DebuggerErrno ret; + for (const auto& line : lines) { + if (line.empty() || line[0] == '\0') { + continue; + } + AclTensorStats stat = AclTensorStats::ParseTensorSummary(dumpPath, line); + ret = WriteOneTensorStatToDisk(stat); + if (ret != DebuggerErrno::OK) { + return ret; + } + } + + return DebuggerErrno::OK; +} + +DebuggerErrno AclDumpDataProcessor::DumpToDisk() +{ + DEBUG_FUNC_TRACE(); + if (!completed) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_OPERATION, ToString() + ": Data is incomplete."); + return DebuggerErrno::ERROR_INVALID_OPERATION; + } + + uint8_t* msg = buffer.front()->data(); + AclDumpMsg::DumpData dumpData; + if (headerSegLen > 0) { + if (!dumpData.ParseFromArray(msg + headerSegOffset, headerSegLen)) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_FORMAT, ToString() + ": Failed to parse header."); + return DebuggerErrno::ERROR_INVALID_FORMAT; + } + } + + const std::string dataPath = GenDataPath(dumpPath); + DebuggerErrno ret; + if (FileUtils::GetFileName(dumpPath).find(debugFileSign) == 0 && + DebuggerConfig::GetInstance().GetOverflowCheckCfg() != nullptr) { + ret = DumpOpDebugDataToDisk(dataPath, dumpData, msg + dataSegOffset, dataSegLen); + } else if (DebuggerConfig::GetInstance().GetStatisticsCfg() != nullptr && + hostAnalysisOpts.empty()) { + ret = DumpStatsDataToDisk(dataPath, msg + dataSegOffset, dataSegLen); + } else { + ret = DumpTensorDataToDisk(dataPath, dumpData, msg + dataSegOffset, dataSegLen, hostAnalysisOpts); + } + + if (ret != DebuggerErrno::OK) { + LOG_ERROR(DebuggerErrno::ERROR_OPERATION_FAILED, ToString() + ": Failed to dump to disk."); + } + + return ret; +} + +} diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/AclDumpDataProcessor.hpp b/debug/accuracy_tools/msprobe/ccsrc/core/AclDumpDataProcessor.hpp new file mode 100644 index 0000000000000000000000000000000000000000..4ce2ab6e8c8709437791aba9699ec76184cb6761 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/AclDumpDataProcessor.hpp @@ -0,0 +1,59 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include + +#include "include/ErrorCode.hpp" +#include "base/DebuggerConfig.hpp" +#include "third_party/ACL/AclApi.hpp" + +namespace MindStudioDebugger { + +constexpr size_t kMaxDataLen = 4ULL * 1024 * 1024 * 1024; + +class AclDumpDataProcessor { +public: + AclDumpDataProcessor(const std::string& path, const std::vector& opts) : + dumpPath{path}, hostAnalysisOpts{opts} {}; + ~AclDumpDataProcessor(); + + bool IsCompleted() const {return completed;} + bool ErrorOccurred() const {return errorOccurred;} + DebuggerErrno PushData(const acldumpChunk *chunk); + DebuggerErrno DumpToDisk(); + std::string ToString() const; + +private: + DebuggerErrno ConcatenateData(); + + std::string dumpPath; + bool completed{false}; + bool errorOccurred{false}; + size_t totalLen{0}; + size_t headerSegOffset{0}; + size_t headerSegLen{0}; + size_t dataSegOffset{0}; + size_t dataSegLen{0}; + std::queue*> buffer; + std::vector hostAnalysisOpts; +}; + +} + diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/AclDumper.cpp b/debug/accuracy_tools/msprobe/ccsrc/core/AclDumper.cpp new file mode 100644 index 0000000000000000000000000000000000000000..805a6a7a0a24bb1fee1472698511d53beb7a35a6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/AclDumper.cpp @@ -0,0 +1,535 @@ +/* + * Copyright (C) 2024-2025. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include + +#include "include/Macro.hpp" +#include "utils/FileUtils.hpp" +#include "utils/FileOperation.hpp" +#include "third_party/ACL/AclApi.hpp" +#include "base/Environment.hpp" +#include "base/ErrorInfos.hpp" +#include "AclDumper.hpp" + +namespace MindStudioDebugger { + +constexpr const char* kAclDumpScene = "dump_scene"; +constexpr const char* kSceneNormal = "normal"; +constexpr const char* kSceneException ="lite_exception"; + +constexpr const char* kAclDumpPath = "dump_path"; +constexpr const char* kAclDumpStep = "dump_step"; + +constexpr const char* kAclDumpList = "dump_list"; +constexpr const char* kAclDumpLayer = "layer"; +constexpr const char* kAclDumpModel = "model_name"; + +constexpr const char* kAclDumpMode = "dump_mode"; +constexpr const char* kAclModeInput = "input"; +constexpr const char* kAclModeOutput = "output"; +constexpr const char* kAclModeAll = "all"; + +constexpr const char* kAclDumpOpSwitch = "dump_op_switch"; +constexpr const char* kAclDumpDebug = "dump_debug"; +constexpr const char* kAclSwitchOn = "on"; +constexpr const char* kAclSwitchOff = "off"; + +constexpr const char* kAclDumpData = "dump_data"; +constexpr const char* kAclDumpTensor = "tensor"; +constexpr const char* kAclDumpStats = "stats"; + +constexpr const char* kAclDumpStatsOpt = "dump_stats"; +constexpr const char* kAclDumpStatsMax = "Max"; +constexpr const char* kAclDumpStatsMin = "Min"; +constexpr const char* kAclDumpStatsAvg = "Avg"; +constexpr const char* kAclDumpStatsNorn = "L2norm"; +constexpr const char* kAclDumpStatsNan = "Nan"; +constexpr const char* kAclDumpStatsNegInf = "Negative Inf"; +constexpr const char* kAclDumpStatsPosInf = "Positive Inf"; + +constexpr const size_t kProcessorNumMax = 100; + +inline std::string GenAclJsonPath(const std::string& dumpPath, uint32_t rank) +{ + return std::move(dumpPath + "/acl_dump_" + std::to_string(rank) + "." + JSON_SUFFIX); +} + +/* 这里几个转换函数,映射和DebuggerConfigFieldMap类似,但是此处是对接ACL规则的,本质上不是一回事,因此单写一套 */ +static std::string GenDumpInoutString(DebuggerDataInOut mode) +{ + static std::map dumpModeMap = { + {DebuggerDataInOut::INOUT_INPUT, kAclModeInput}, + {DebuggerDataInOut::INOUT_OUTPUT, kAclModeOutput}, + {DebuggerDataInOut::INOUT_BOTH, kAclModeAll}, + }; + + auto it = dumpModeMap.find(mode); + if (it == dumpModeMap.end()) { + return kAclModeAll; + } else { + return it->second; + } +} + +static std::vector GenStatsOptions(const std::vector& options) +{ + static std::map summaryOptMap = { + {DebuggerSummaryOption::MAX, kAclDumpStatsMax}, + {DebuggerSummaryOption::MIN, kAclDumpStatsMin}, + {DebuggerSummaryOption::MEAN, kAclDumpStatsAvg}, + {DebuggerSummaryOption::L2NORM, kAclDumpStatsNorn}, + {DebuggerSummaryOption::NAN_CNT, kAclDumpStatsNan}, + {DebuggerSummaryOption::NEG_INF_CNT, kAclDumpStatsNegInf}, + {DebuggerSummaryOption::POS_INF_CNT, kAclDumpStatsPosInf}, + }; + + std::vector output; + for (auto& ele : options) { + auto it = summaryOptMap.find(ele); + if (it != summaryOptMap.end()) { + output.emplace_back(it->second); + } + } + return output; +} + +static std::string GenDumpPath(const std::string& path) +{ + std::string timestamp; + std::string dumpPath; + + time_t pTime; + time (&pTime); + char cTime[15]; + strftime(cTime, sizeof(cTime), "%Y%m%d%H%M%S", localtime(&pTime)); + timestamp = cTime; + + int32_t rankId = Environment::GetRankID(); + if (rankId < 0) { + rankId = 0; + } + + dumpPath = path + "/rank_" + std::to_string(rankId) + "/" + timestamp; + return dumpPath; +} + +bool AclDumper::IsIterNeedDump(uint32_t iterId) +{ + const DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + if (!cfg.IsCfgLoaded()) { + return false; + } + + return cfg.IsStepHits(iterId); +} + +bool AclDumper::IsCfgEnableAclDumper() +{ + DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + if (!cfg.IsCfgLoaded() || cfg.GetDebugLevel() != DebuggerLevel::L2) { + return false; + } + const std::vector& tasks = cfg.GetTaskList(); + return (ELE_IN_VECTOR(tasks, DebuggerTaskType::TASK_DUMP_TENSOR) || + ELE_IN_VECTOR(tasks, DebuggerTaskType::TASK_DUMP_STATISTICS) || + ELE_IN_VECTOR(tasks, DebuggerTaskType::TASK_OVERFLOW_CHECK)); +} + +bool AclDumper::IsOverflowCompleted() +{ + return overflowNums != -1 && realOverflowNums > overflowNums; +} + +void AclDumper::CountOverflowNumbers(const acldumpChunk* chunk) +{ + if (IsOverflowCompleted() || !isOverflowDump || !chunk->isLastChunk) { + return; + } + const std::string fileName = chunk->fileName; + auto separator = fileName.rfind("/"); + auto fileBaseName = fileName.substr(separator + 1); + if (fileBaseName.rfind("Opdebug.Node_OpDebug.") == 0) { + // count according to the first file: Node_OpDebug + realOverflowNums++; + } + return; +} + +std::string AclDumper::GetDumpPath(uint32_t curStep) const +{ + if (!initialized || foreDumpPath.empty()) { + return ""; + } + return foreDumpPath + "/step_" + std::to_string(curStep); +} + +DebuggerErrno AclDumper::AclDumpGenTensorJson(std::shared_ptr dumpTensorCfg, uint32_t rank, + uint32_t curStep, const char** kernels) +{ + DEBUG_FUNC_TRACE(); + nlohmann::json aclDumpJson; + bool needDump = AclDumper::IsIterNeedDump(curStep); + const std::string& dumpPath = DebuggerConfig::GetInstance().GetOutputPath(); + std::string fullDumpPath; + if (needDump) { + fullDumpPath = GetDumpPath(curStep); + FileUtils::CreateDir(fullDumpPath, true); + } else { + fullDumpPath = dumpPath; + } + + aclDumpJson[kAclDumpPath] = fullDumpPath; + aclDumpJson[kAclDumpMode] = GenDumpInoutString(dumpTensorCfg->inout); + aclDumpJson[kAclDumpData] = kAclDumpTensor; + aclDumpJson[kAclDumpList] = nlohmann::json::array(); + aclDumpJson[kAclDumpOpSwitch] = kAclSwitchOn; + + if (!needDump) { + /* 这里沿用mindspore框架的方案,用一个大数0x7FFFFFFF表示不需要dump;这个方案非常奇怪,后续可以看下能否优化 */ + aclDumpJson[kAclDumpStep] = std::to_string(INT_MAX); + } else { + std::vector kernelsList = dumpTensorCfg->matcher.GenRealKernelList(kernels); + if (!kernelsList.empty()) { + aclDumpJson[kAclDumpList].push_back({{kAclDumpLayer, kernelsList}}); + } + } + + nlohmann::json content = {{"dump", aclDumpJson}}; + LOG_DEBUG("AclDumpGenTensorJson dump json to " + GenAclJsonPath(dumpPath, rank)); + return FileOperation::DumpJson(GenAclJsonPath(dumpPath, rank), content); +} + +DebuggerErrno AclDumper::AclDumpGenStatJson(std::shared_ptr statisticsCfg, uint32_t rank, + uint32_t curStep, const char** kernels) +{ + DEBUG_FUNC_TRACE(); + nlohmann::json aclDumpJson; + bool needDump = AclDumper::IsIterNeedDump(curStep); + const std::string& dumpPath = DebuggerConfig::GetInstance().GetOutputPath(); + std::string fullDumpPath; + if (needDump) { + fullDumpPath = GetDumpPath(curStep); + FileUtils::CreateDir(fullDumpPath, true); + } else { + fullDumpPath = dumpPath; + } + + aclDumpJson[kAclDumpPath] = fullDumpPath; + aclDumpJson[kAclDumpMode] = GenDumpInoutString(statisticsCfg->inout); + aclDumpJson[kAclDumpList] = nlohmann::json::array(); + aclDumpJson[kAclDumpOpSwitch] = kAclSwitchOn; + + /* 如果需要host侧分析,下给acl的任务还是dump tensor,然后在host侧转成统计量 */ + if (!hostAnalysisOpt.empty()) { + aclDumpJson[kAclDumpData] = kAclDumpTensor; + } else { + aclDumpJson[kAclDumpData] = kAclDumpStats; + aclDumpJson[kAclDumpStatsOpt] = GenStatsOptions(statisticsCfg->summaryOption); + } + + if (!needDump) { + aclDumpJson[kAclDumpStep] = std::to_string(INT_MAX); + } else { + std::vector kernelsList = statisticsCfg->matcher.GenRealKernelList(kernels); + if (!kernelsList.empty()){ + aclDumpJson[kAclDumpList].push_back({{kAclDumpLayer, kernelsList}}); + } + } + + nlohmann::json content = {{"dump", aclDumpJson}}; + LOG_DEBUG("AclDumpGenStatJson dump json to " + GenAclJsonPath(dumpPath, rank)); + return FileOperation::DumpJson(GenAclJsonPath(dumpPath, rank), content); +} + +DebuggerErrno AclDumper::AclDumpGenOverflowJson(std::shared_ptr overflowCfg, uint32_t rank, + uint32_t curStep) +{ + DEBUG_FUNC_TRACE(); + nlohmann::json aclDumpJson; + bool needDump = AclDumper::IsIterNeedDump(curStep); + const std::string& dumpPath = DebuggerConfig::GetInstance().GetOutputPath(); + std::string fullDumpPath; + if (needDump) { + fullDumpPath = GetDumpPath(curStep); + FileUtils::CreateDir(fullDumpPath, true); + } else { + fullDumpPath = dumpPath; + } + + DebuggerErrno ret = FileUtils::CreateDir(fullDumpPath, true); + if (ret != DebuggerErrno::OK) { + return ret; + } + + aclDumpJson[kAclDumpPath] = fullDumpPath; + aclDumpJson[kAclDumpDebug] = kAclSwitchOn; + if (!needDump) { + aclDumpJson[kAclDumpStep] = std::to_string(INT_MAX); + } + nlohmann::json content = {{"dump", aclDumpJson}}; + LOG_DEBUG("AclDumpGenOverflowJson dump json to " + GenAclJsonPath(dumpPath, rank)); + return FileOperation::DumpJson(GenAclJsonPath(dumpPath, rank), content); +} + +static DebuggerErrno InitAcl() +{ + DEBUG_FUNC_TRACE(); + nlohmann::json aclInitJson; + std::string aclInitJsonPath = FileUtils::GetAbsPath("./aclinit.json"); + if (aclInitJsonPath.empty()) { + LOG_ERROR(DebuggerErrno::ERROR_CANNOT_PARSE_PATH, "Failed to get full path of aclinit.json."); + return DebuggerErrno::ERROR_CANNOT_PARSE_PATH; + } + + constexpr const char* AclErrMsgOn = "1"; + aclInitJson["err_msg_mode"] = AclErrMsgOn; + LOG_DEBUG("InitAcl dump json to " + aclInitJsonPath); + FileOperation::DumpJson(aclInitJsonPath, aclInitJson); + aclError ret; + try { + ret = CALL_ACL_API(aclInit, aclInitJsonPath.c_str()); + } catch (const std::runtime_error& e) { + LOG_ERROR(DebuggerErrno::ERROR_DEPENDENCY_NOT_FIND, "Cannot find function aclInit."); + return DebuggerErrno::ERROR_DEPENDENCY_NOT_FIND; + } + + /* 此处框架可能会初始化,如果报重复初始化错误,忽略即可 */ + if (ret != ACL_SUCCESS && ret != ACL_ERROR_REPEAT_INITIALIZE) { + LOG_ERROR(DebuggerErrno::ERROR_EXTERNAL_API_ERROR, "Failed to init acl(" + std::to_string(ret) + ")."); + return DebuggerErrno::ERROR_EXTERNAL_API_ERROR; + } + + LOG_DEBUG("InitAcl succeed"); + return DebuggerErrno::OK; +} + +int32_t AclDumpCallBack(const acldumpChunk* chunk, int32_t len) +{ + AclDumper& dumper = AclDumper::GetInstance(); + dumper.OnAclDumpCallBack(chunk, len); + return 0; +} + +DebuggerErrno AclDumper::Initialize() +{ + DEBUG_FUNC_TRACE(); + DebuggerErrno ret; + aclError aclRet; + const DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + std::shared_ptr statsCfg = cfg.GetStatisticsCfg(); + std::shared_ptr tensorCfg = cfg.GetDumpTensorCfg(); + std::shared_ptr overflowCheckCfg = cfg.GetOverflowCheckCfg(); + + ret = InitAcl(); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Failed to call InitAcl."); + return ret; + } + + foreDumpPath = GenDumpPath(cfg.GetOutputPath()); + + bool needCallback = false; + if (statsCfg != nullptr) { + if (ELE_IN_VECTOR(statsCfg->summaryOption, DebuggerSummaryOption::MD5)) { + hostAnalysisOpt = {DebuggerSummaryOption::MD5}; + } + needCallback = true; + } + + if (tensorCfg != nullptr && tensorCfg->fileFormat == DebuggerDumpFileFormat::FILE_FORMAT_NPY) { + needCallback = true; + } + + if (overflowCheckCfg != nullptr) { + needCallback = true; + } + + if (needCallback) { + LOG_DEBUG("Register acl dump callback."); + /* 上面aclInit成功,此处认为acldumpRegCallback符号也存在,不会抛出异常 */ + aclRet = CALL_ACL_API(acldumpRegCallback, AclDumpCallBack, 0); + if (aclRet != ACL_SUCCESS) { + LOG_ERROR(DebuggerErrno::ERROR_EXTERNAL_API_ERROR, + "Failed to register acldump callback(" + std::to_string(aclRet) + ")."); + return DebuggerErrno::ERROR_EXTERNAL_API_ERROR; + } + } + LOG_DEBUG("AclDumper::Initialize succeed"); + return DebuggerErrno::OK; +} + +void AclDumper::OnAclDumpCallBack(const acldumpChunk* chunk, int32_t len) +{ + DEBUG_FUNC_TRACE(); + CountOverflowNumbers(chunk); + if (IsOverflowCompleted()) { + return; + } + + std::string dumpPath = FileUtils::GetAbsPath(chunk->fileName); + auto it = dataProcessors.find(dumpPath); + if (it == dataProcessors.end()) { + if (dataProcessors.size() > kProcessorNumMax) { + LOG_ERROR(DebuggerErrno::ERROR_BUFFER_OVERFLOW, "The number of processors has reached the upper limit."); + return; + } + dataProcessors[dumpPath] = std::make_shared(dumpPath, hostAnalysisOpt); + } + + std::shared_ptr processor = dataProcessors[dumpPath]; + DebuggerErrno ret = processor->PushData(chunk); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Failed to push data " + dumpPath + "."); + } + + LOG_DEBUG("Acl dump data processor " + dumpPath + " receive data, len=" + + std::to_string(chunk->bufLen)); + + if (!processor->IsCompleted()) { + return; + } + + if (!processor->ErrorOccurred()) { + ret = processor->DumpToDisk(); + } else { + ret = DebuggerErrno::ERROR; + } + + dataProcessors.erase(dumpPath); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Failed to write data " + dumpPath + " to disk."); + } + return; +} + +void AclDumper::SetDump(uint32_t rank, uint32_t curStep, ExtArgs& args) +{ + DEBUG_FUNC_TRACE(); + DebuggerErrno ret; + DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + if (aclDumpHasSet || !cfg.IsRankHits(rank) || !IsCfgEnableAclDumper()) { + return; + } + + if (!initialized) { + ret = Initialize(); + if(ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "AclDumper initialization failed."); + return; + } + initialized = true; + } + + /* 和acl dump相关的三个任务 */ + std::shared_ptr dumpTensorCfg = cfg.GetDumpTensorCfg(); + std::shared_ptr statisticsCfg = cfg.GetStatisticsCfg(); + std::shared_ptr overflowCheckCfg = cfg.GetOverflowCheckCfg(); + + /* 当前只能三选一 */ + const char** kernels = GetExtArgs(args, MindStudioExtensionArgs::ALL_KERNEL_NAMES); + if (dumpTensorCfg != nullptr) { + ret = AclDumpGenTensorJson(dumpTensorCfg, rank, curStep, kernels); + } else if (statisticsCfg != nullptr) { + ret = AclDumpGenStatJson(statisticsCfg, rank, curStep, kernels); + } else if (overflowCheckCfg != nullptr) { + ret = AclDumpGenOverflowJson(overflowCheckCfg, rank, curStep); + overflowNums = overflowCheckCfg->overflowNums; + isOverflowDump = true; + } + + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "AclDumper failed to generate cfg file."); + return; + } + + aclError aclRet; + aclRet = CALL_ACL_API(aclmdlInitDump); + if (aclRet != ACL_SUCCESS) { + LOG_ERROR(DebuggerErrno::ERROR_EXTERNAL_API_ERROR, + "Failed to init acldump(" + std::to_string(aclRet) + ")."); + return; + } + + const std::string& dumpPath = DebuggerConfig::GetInstance().GetOutputPath(); + aclRet = CALL_ACL_API(aclmdlSetDump, GenAclJsonPath(dumpPath, rank).c_str()); + if (aclRet != ACL_SUCCESS) { + LOG_ERROR(DebuggerErrno::ERROR_EXTERNAL_API_ERROR, + "Failed to enable acldump(" + std::to_string(aclRet) + ")."); + return; + } + + aclDumpHasSet = true; + return; +} + +void AclDumper::FinalizeDump(ExtArgs& args) +{ + DEBUG_FUNC_TRACE(); + if (!aclDumpHasSet) { + return; + } + + CALL_ACL_API(aclrtSynchronizeDevice); + aclError aclRet = CALL_ACL_API(aclmdlFinalizeDump); + if (aclRet != ACL_SUCCESS) { + LOG_ERROR(DebuggerErrno::ERROR_EXTERNAL_API_ERROR, + "Failed to finalize acldump(" + std::to_string(aclRet) + ")."); + + } + + aclDumpHasSet = false; +} + +void KernelInitDump() { + if (AscendCLApi::LoadAclApi() != DebuggerErrno::OK) { + return; + } + + DebuggerErrno ret = InitAcl(); + if (ret != DebuggerErrno::OK) { + LOG_ERROR(ret, "Failed to call InitAcl."); + return; + } + auto aclRet = CALL_ACL_API(aclmdlInitDump); + if (aclRet != ACL_SUCCESS) { + LOG_ERROR(DebuggerErrno::ERROR_EXTERNAL_API_ERROR, + "Failed to init acldump(" + std::to_string(aclRet) + ")."); + return; + } +} + +void KernelSetDump(const std::string &filePath) { + std::string dumpPath = FileUtils::GetAbsPath(filePath); + auto aclRet = CALL_ACL_API(aclmdlSetDump, dumpPath.c_str()); + if (aclRet != ACL_SUCCESS) { + LOG_ERROR(DebuggerErrno::ERROR_EXTERNAL_API_ERROR, + "Failed to enable acldump(" + std::to_string(aclRet) + ")."); + return; + } +} + +void KernelFinalizeDump() { + CALL_ACL_API(aclrtSynchronizeDevice); + auto aclRet = CALL_ACL_API(aclmdlFinalizeDump); + if (aclRet != ACL_SUCCESS) { + LOG_ERROR(DebuggerErrno::ERROR_EXTERNAL_API_ERROR, + "Failed to finalize acldump(" + std::to_string(aclRet) + ")."); + } +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/AclDumper.hpp b/debug/accuracy_tools/msprobe/ccsrc/core/AclDumper.hpp new file mode 100644 index 0000000000000000000000000000000000000000..6985df65e166101c08501e5e206e003bda494b9a --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/AclDumper.hpp @@ -0,0 +1,77 @@ +/* + * Copyright (C) 2024-2025. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include +#include + +#include "include/ExtArgs.hpp" +#include "base/DebuggerConfig.hpp" +#include "AclDumpDataProcessor.hpp" + +namespace MindStudioDebugger { + +class AclDumper { +public: + static AclDumper& GetInstance() { + static AclDumper instance_; + return instance_; + } + + static bool IsIterNeedDump(uint32_t iterId); + static bool IsCfgEnableAclDumper(); + + void SetDump(uint32_t rank, uint32_t curStep, ExtArgs& args); + void FinalizeDump(ExtArgs& args); + void OnAclDumpCallBack(const acldumpChunk* chunk, int32_t len); + + std::string GetDumpPath(uint32_t curStep) const; + +private: + AclDumper() = default; + ~AclDumper() = default; + explicit AclDumper(const AclDumper &obj) = delete; + AclDumper& operator=(const AclDumper &obj) = delete; + explicit AclDumper(AclDumper &&obj) = delete; + AclDumper& operator=(AclDumper &&obj) = delete; + + DebuggerErrno Initialize(); + DebuggerErrno AclDumpGenTensorJson(std::shared_ptr dumpTensorCfg, uint32_t rank, + uint32_t curStep, const char** kernels); + DebuggerErrno AclDumpGenStatJson(std::shared_ptr statisticsCfg, uint32_t rank, + uint32_t curStep, const char** kernels); + DebuggerErrno AclDumpGenOverflowJson(std::shared_ptr overflowCfg, uint32_t rank, + uint32_t curStep); + void CountOverflowNumbers(const acldumpChunk* chunk); + bool IsOverflowCompleted(); + + bool initialized{false}; + bool aclDumpHasSet{false}; + std::string foreDumpPath; + std::vector hostAnalysisOpt; + std::map> dataProcessors; + bool isOverflowDump{false}; + int32_t overflowNums{1}; + int32_t realOverflowNums{0}; +}; + +void KernelInitDump(); +void KernelSetDump(const std::string &filePath); +void KernelFinalizeDump(); +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/AclTensor.cpp b/debug/accuracy_tools/msprobe/ccsrc/core/AclTensor.cpp new file mode 100644 index 0000000000000000000000000000000000000000..4a5ec4c555198015603d7cc1446be66fda05765d --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/AclTensor.cpp @@ -0,0 +1,848 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "utils/DataUtils.hpp" +#include "utils/MathUtils.hpp" +#include "base/ErrorInfos.hpp" +#include "AclTensor.hpp" + +namespace MindStudioDebugger { +namespace AclDumpMsg = toolkit::dumpdata; +namespace AclTensor { + +using namespace MathUtils; + +constexpr int64_t kCubeSize = 16; +constexpr int64_t kCube16 = kCubeSize; +constexpr int64_t kCube32 = 32; +constexpr int64_t kCube64 = 64; +constexpr int64_t kCubeSize_C04 = 4; + +constexpr size_t hwH = 1; +constexpr size_t hwW = 2; +constexpr size_t fnzW1 = 4; +constexpr size_t fnzH1 = 3; +constexpr size_t fnzH0 = 2; +constexpr size_t fnzW0 = 1; +constexpr size_t fzN0 = 1; +constexpr size_t fzNi = 2; +constexpr size_t fzC0 = 3; + +using TensorTransFunc = DebuggerErrno (*)(AclTensorInfo &); + +static DebuggerErrno FRAC_Z_TO_NCHW(AclTensorInfo& tensor); +static DebuggerErrno FRAC_NZ_TO_NCHW(AclTensorInfo& tensor); +static DebuggerErrno NC1HWC0_TO_NCHW(AclTensorInfo& tensor); +static DebuggerErrno NDC1HWC0_TO_NCDHW(AclTensorInfo& tensor); +static DebuggerErrno C1HWNCoC0_TO_NCHW(AclTensorInfo& tensor); +static DebuggerErrno NC1HWC0_C04_TO_NCHW(AclTensorInfo& tensor); +static DebuggerErrno FRAC_Z3D_TO_NCDHW(AclTensorInfo& tensor); + +const static std::unordered_set kSupportedDtypes = { + AclDtype::DT_UNDEFINED, + AclDtype::DT_FLOAT, + AclDtype::DT_FLOAT16, + AclDtype::DT_INT8, + AclDtype::DT_UINT8, + AclDtype::DT_INT16, + AclDtype::DT_UINT16, + AclDtype::DT_INT32, + AclDtype::DT_INT64, + AclDtype::DT_UINT32, + AclDtype::DT_UINT64, + AclDtype::DT_BOOL, + AclDtype::DT_DOUBLE, + AclDtype::DT_BF16, + AclDtype::DT_COMPLEX64, + AclDtype::DT_COMPLEX128, +}; + +const static std::unordered_set kSupportedFormat = { + AclFormat::FORMAT_NCHW, + AclFormat::FORMAT_NHWC, + AclFormat::FORMAT_ND, + AclFormat::FORMAT_NC1HWC0, + AclFormat::FORMAT_FRACTAL_Z, + AclFormat::FORMAT_NC1HWC0_C04, + AclFormat::FORMAT_FRACTAL_Z_C04, + AclFormat::FORMAT_NC1KHKWHWC0, + AclFormat::FORMAT_HWCN, + AclFormat::FORMAT_NDHWC, + AclFormat::FORMAT_NCDHW, + AclFormat::FORMAT_DHWCN, + AclFormat::FORMAT_DHWNC, + AclFormat::FORMAT_NDC1HWC0, + AclFormat::FORMAT_FRACTAL_Z_3D, + AclFormat::FORMAT_C1HWNCoC0, + AclFormat::FORMAT_FRACTAL_NZ, + AclFormat::FORMAT_FRACTAL_ZN_LSTM, + AclFormat::FORMAT_NCL, +}; + +const static std::map, TensorTransFunc> formatTransFuncMap = { + /* {{from, to}, function} */ + {{AclFormat::FORMAT_HWCN, AclFormat::FORMAT_NCHW}, nullptr}, + {{AclFormat::FORMAT_NHWC, AclFormat::FORMAT_NCHW}, nullptr}, + {{AclFormat::FORMAT_FRACTAL_Z, AclFormat::FORMAT_NCHW}, FRAC_Z_TO_NCHW}, + {{AclFormat::FORMAT_FRACTAL_NZ, AclFormat::FORMAT_NCHW}, FRAC_NZ_TO_NCHW}, + {{AclFormat::FORMAT_NC1HWC0, AclFormat::FORMAT_NCHW}, NC1HWC0_TO_NCHW}, + {{AclFormat::FORMAT_NDC1HWC0, AclFormat::FORMAT_NCHW}, NDC1HWC0_TO_NCDHW}, + {{AclFormat::FORMAT_C1HWNCoC0, AclFormat::FORMAT_NCHW}, C1HWNCoC0_TO_NCHW}, + {{AclFormat::FORMAT_NC1HWC0_C04, AclFormat::FORMAT_NCHW}, NC1HWC0_C04_TO_NCHW}, + {{AclFormat::FORMAT_FRACTAL_Z_3D, AclFormat::FORMAT_NCHW}, FRAC_Z3D_TO_NCDHW}, +}; + +const static std::unordered_map dtypeTransMap = { + {AclDumpMsg::OutputDataType::DT_UNDEFINED, AclDtype::DT_UNDEFINED}, + {AclDumpMsg::OutputDataType::DT_FLOAT, AclDtype::DT_FLOAT}, + {AclDumpMsg::OutputDataType::DT_FLOAT16, AclDtype::DT_FLOAT16}, + {AclDumpMsg::OutputDataType::DT_INT8, AclDtype::DT_INT8}, + {AclDumpMsg::OutputDataType::DT_UINT8, AclDtype::DT_UINT8}, + {AclDumpMsg::OutputDataType::DT_INT16, AclDtype::DT_INT16}, + {AclDumpMsg::OutputDataType::DT_UINT16, AclDtype::DT_UINT16}, + {AclDumpMsg::OutputDataType::DT_INT32, AclDtype::DT_INT32}, + {AclDumpMsg::OutputDataType::DT_INT64, AclDtype::DT_INT64}, + {AclDumpMsg::OutputDataType::DT_UINT32, AclDtype::DT_UINT32}, + {AclDumpMsg::OutputDataType::DT_UINT64, AclDtype::DT_UINT64}, + {AclDumpMsg::OutputDataType::DT_BOOL, AclDtype::DT_BOOL}, + {AclDumpMsg::OutputDataType::DT_DOUBLE, AclDtype::DT_DOUBLE}, + {AclDumpMsg::OutputDataType::DT_STRING, AclDtype::DT_STRING}, + {AclDumpMsg::OutputDataType::DT_DUAL_SUB_INT8, AclDtype::DT_DUAL_SUB_INT8}, + {AclDumpMsg::OutputDataType::DT_DUAL_SUB_UINT8, AclDtype::DT_DUAL_SUB_UINT8}, + {AclDumpMsg::OutputDataType::DT_COMPLEX64, AclDtype::DT_COMPLEX64}, + {AclDumpMsg::OutputDataType::DT_COMPLEX128, AclDtype::DT_COMPLEX128}, + {AclDumpMsg::OutputDataType::DT_QINT8, AclDtype::DT_QINT8}, + {AclDumpMsg::OutputDataType::DT_QINT16, AclDtype::DT_QINT16}, + {AclDumpMsg::OutputDataType::DT_QINT32, AclDtype::DT_QINT32}, + {AclDumpMsg::OutputDataType::DT_QUINT8, AclDtype::DT_QUINT8}, + {AclDumpMsg::OutputDataType::DT_QUINT16, AclDtype::DT_QUINT16}, + {AclDumpMsg::OutputDataType::DT_RESOURCE, AclDtype::DT_RESOURCE}, + {AclDumpMsg::OutputDataType::DT_STRING_REF, AclDtype::DT_STRING_REF}, + {AclDumpMsg::OutputDataType::DT_DUAL, AclDtype::DT_DUAL}, + {AclDumpMsg::OutputDataType::DT_VARIANT, AclDtype::DT_VARIANT}, + {AclDumpMsg::OutputDataType::DT_BF16, AclDtype::DT_BF16}, + {AclDumpMsg::OutputDataType::DT_INT4, AclDtype::DT_INT4}, + {AclDumpMsg::OutputDataType::DT_UINT1, AclDtype::DT_UINT1}, + {AclDumpMsg::OutputDataType::DT_INT2, AclDtype::DT_INT2}, + {AclDumpMsg::OutputDataType::DT_UINT2, AclDtype::DT_UINT2}, +}; + +const static std::unordered_map formatTransMap = { + {AclDumpMsg::OutputFormat::FORMAT_NCHW, AclFormat::FORMAT_NCHW}, + {AclDumpMsg::OutputFormat::FORMAT_NHWC, AclFormat::FORMAT_NHWC}, + {AclDumpMsg::OutputFormat::FORMAT_ND, AclFormat::FORMAT_ND}, + {AclDumpMsg::OutputFormat::FORMAT_NC1HWC0, AclFormat::FORMAT_NC1HWC0}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_Z, AclFormat::FORMAT_FRACTAL_Z}, + {AclDumpMsg::OutputFormat::FORMAT_NC1C0HWPAD, AclFormat::FORMAT_NC1C0HWPAD}, + {AclDumpMsg::OutputFormat::FORMAT_NHWC1C0, AclFormat::FORMAT_NHWC1C0}, + {AclDumpMsg::OutputFormat::FORMAT_FSR_NCHW, AclFormat::FORMAT_FSR_NCHW}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_DECONV, AclFormat::FORMAT_FRACTAL_DECONV}, + {AclDumpMsg::OutputFormat::FORMAT_C1HWNC0, AclFormat::FORMAT_C1HWNC0}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_DECONV_TRANSPOSE, AclFormat::FORMAT_FRACTAL_DECONV_TRANSPOSE}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_DECONV_SP_STRIDE_TRANS, AclFormat::FORMAT_FRACTAL_DECONV_SP_STRIDE_TRANS}, + {AclDumpMsg::OutputFormat::FORMAT_NC1HWC0_C04, AclFormat::FORMAT_NC1HWC0_C04}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_Z_C04, AclFormat::FORMAT_FRACTAL_Z_C04}, + {AclDumpMsg::OutputFormat::FORMAT_CHWN, AclFormat::FORMAT_CHWN}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_DECONV_SP_STRIDE8_TRANS, AclFormat::FORMAT_FRACTAL_DECONV_SP_STRIDE8_TRANS}, + {AclDumpMsg::OutputFormat::FORMAT_HWCN, AclFormat::FORMAT_HWCN}, + {AclDumpMsg::OutputFormat::FORMAT_NC1KHKWHWC0, AclFormat::FORMAT_NC1KHKWHWC0}, + {AclDumpMsg::OutputFormat::FORMAT_BN_WEIGHT, AclFormat::FORMAT_BN_WEIGHT}, + {AclDumpMsg::OutputFormat::FORMAT_FILTER_HWCK, AclFormat::FORMAT_FILTER_HWCK}, + {AclDumpMsg::OutputFormat::FORMAT_HASHTABLE_LOOKUP_LOOKUPS, AclFormat::FORMAT_HASHTABLE_LOOKUP_LOOKUPS}, + {AclDumpMsg::OutputFormat::FORMAT_HASHTABLE_LOOKUP_KEYS, AclFormat::FORMAT_HASHTABLE_LOOKUP_KEYS}, + {AclDumpMsg::OutputFormat::FORMAT_HASHTABLE_LOOKUP_VALUE, AclFormat::FORMAT_HASHTABLE_LOOKUP_VALUE}, + {AclDumpMsg::OutputFormat::FORMAT_HASHTABLE_LOOKUP_OUTPUT, AclFormat::FORMAT_HASHTABLE_LOOKUP_OUTPUT}, + {AclDumpMsg::OutputFormat::FORMAT_HASHTABLE_LOOKUP_HITS, AclFormat::FORMAT_HASHTABLE_LOOKUP_HITS}, + {AclDumpMsg::OutputFormat::FORMAT_C1HWNCoC0, AclFormat::FORMAT_C1HWNCoC0}, + {AclDumpMsg::OutputFormat::FORMAT_MD, AclFormat::FORMAT_MD}, + {AclDumpMsg::OutputFormat::FORMAT_NDHWC, AclFormat::FORMAT_NDHWC}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_ZZ, AclFormat::FORMAT_FRACTAL_ZZ}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_NZ, AclFormat::FORMAT_FRACTAL_NZ}, + {AclDumpMsg::OutputFormat::FORMAT_NCDHW, AclFormat::FORMAT_NCDHW}, + {AclDumpMsg::OutputFormat::FORMAT_DHWCN, AclFormat::FORMAT_DHWCN}, + {AclDumpMsg::OutputFormat::FORMAT_NDC1HWC0, AclFormat::FORMAT_NDC1HWC0}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_Z_3D, AclFormat::FORMAT_FRACTAL_Z_3D}, + {AclDumpMsg::OutputFormat::FORMAT_CN, AclFormat::FORMAT_CN}, + {AclDumpMsg::OutputFormat::FORMAT_NC, AclFormat::FORMAT_NC}, + {AclDumpMsg::OutputFormat::FORMAT_DHWNC, AclFormat::FORMAT_DHWNC}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_Z_3D_TRANSPOSE, AclFormat::FORMAT_FRACTAL_Z_3D_TRANSPOSE}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_ZN_LSTM, AclFormat::FORMAT_FRACTAL_ZN_LSTM}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_Z_G, AclFormat::FORMAT_FRACTAL_Z_G}, + {AclDumpMsg::OutputFormat::FORMAT_RESERVED, AclFormat::FORMAT_RESERVED}, + {AclDumpMsg::OutputFormat::FORMAT_ALL, AclFormat::FORMAT_ALL}, + {AclDumpMsg::OutputFormat::FORMAT_NULL, AclFormat::FORMAT_NULL}, + {AclDumpMsg::OutputFormat::FORMAT_ND_RNN_BIAS, AclFormat::FORMAT_ND_RNN_BIAS}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_ZN_RNN, AclFormat::FORMAT_FRACTAL_ZN_RNN}, + {AclDumpMsg::OutputFormat::FORMAT_YUV, AclFormat::FORMAT_YUV}, + {AclDumpMsg::OutputFormat::FORMAT_YUV_A, AclFormat::FORMAT_YUV_A}, + {AclDumpMsg::OutputFormat::FORMAT_NCL, AclFormat::FORMAT_NCL}, + {AclDumpMsg::OutputFormat::FORMAT_FRACTAL_Z_WINO, AclFormat::FORMAT_FRACTAL_Z_WINO}, + {AclDumpMsg::OutputFormat::FORMAT_C1HWC0, AclFormat::FORMAT_C1HWC0}, +}; + +enum kAxis4D : int { kN = 0, kC, kH, kW, kNchwDims }; +enum Axis5D : int { + N_ncdhw = 0, + C_ncdhw, + D_ncdhw, + H_ncdhw, + W_ncdhw, + kNcdhw, + N_ndc1hwc0 = 0, + D_ndc1hwc0, + C1_ndc1hwc0, + H_ndc1hwc0, + W_ndc1hwc0, + C0_ndc1hwc0 +}; + +static inline AclDtype transAclDtype2MS(AclDumpMsg::OutputDataType dt) +{ + auto it = dtypeTransMap.find(dt); + if (it != dtypeTransMap.end()) { + return it->second; + } + return AclDtype::DT_MAX; +} + +static inline AclFormat transAclFormat2MS(AclDumpMsg::OutputFormat fmt) +{ + auto it = formatTransMap.find(fmt); + if (it != formatTransMap.end()) { + return it->second; + } + return AclFormat::FORMAT_MAX; +} + +static size_t EleNumOfTensor(const AclTensorInfo& tensor, bool host = true) { + size_t num = 1; + const AclShape& shape = host ? tensor.hostShape : tensor.deviceShape; + for (auto dim : shape) { + if (dim <= 0) { + /* For dynamic shape which has negative dimensions, data size should be zero. */ + return 0; + } + + if (SIZE_MAX / dim < num) { + throw std::out_of_range(tensor + ": Count of element over size_t."); + } + num *= static_cast(dim); + } + return num; +} + +static inline size_t SizeOfAclDType(const AclTensorInfo& tensor) { + return DataUtils::SizeOfDType(tensor.dtype); +} + +static inline size_t SizeOfAclDType(const AclDtype& dtype) { + return DataUtils::SizeOfDType(dtype); +} + +size_t SizeOfTensor(const AclTensorInfo& tensor, bool host) { + size_t num = EleNumOfTensor(tensor, host); + size_t eleSize = SizeOfAclDType(tensor); + if (num != 0 && SIZE_MAX / num < eleSize) { + throw std::runtime_error(tensor + ": Size over size_t."); + } + return num * eleSize; +} + +static inline int64_t GetCubeSizeByType(const AclDtype& dtype) { + if (dtype == AclDtype::DT_UINT8 || dtype == AclDtype::DT_INT8) { + return kCube32; + } + + if (dtype == AclDtype::DT_INT4) { + return kCube64; + } + + return kCube16; +} + +static inline void AssertDim(const AclShape& shape, size_t dim) +{ + if (shape.size() != dim) { + throw std::runtime_error("Dimension of tensor is expected to be " + std::to_string(dim) + + ", but actually " + std::to_string(shape.size()) +"."); + } +} + +static inline void AssertConsis(const AclTensorInfo& tensor) +{ + size_t tensor_size = EleNumOfTensor(tensor, false) * SizeOfAclDType(tensor); + // Processing dtype whose size < 1 + // The ele num of quantization type(qint4*2) in MindSpore must be even. + if (tensor.dtype == AclDtype::DT_INT4) tensor_size = EleNumOfTensor(tensor, false) / 2; + if (tensor_size != tensor.dataSize) { + throw std::runtime_error(tensor + ": The internal data of Tensor is inconsistent."); + } +} + +template +AclTensorInfo ParseAttrsFromDumpData(const std::string& dumpPath, const uint8_t* data, const T& tensor, + const std::string& io, uint32_t slot) +{ + AclDumpMsg::OutputDataType oriDtype = tensor.data_type(); + AclDtype dtype = transAclDtype2MS(oriDtype); + bool dumpOriginData = false; + size_t dataSize = static_cast(tensor.size()); + if (dtype == AclDtype::DT_MAX || kSupportedDtypes.find(dtype) == kSupportedDtypes.end()) { + dumpOriginData = true; + } + + AclDumpMsg::OutputFormat oriDeviceFmt = tensor.format(); + AclFormat dFmt = transAclFormat2MS(oriDeviceFmt); + if (dFmt == AclFormat::FORMAT_MAX || kSupportedFormat.find(dFmt) == kSupportedFormat.end()) { + dumpOriginData = true; + } + + AclShape dShape; + std::transform(tensor.shape().dim().begin(), tensor.shape().dim().end(), std::back_inserter(dShape), + DataUtils::SizeToS64); + AclShape hShape; + for (auto d : tensor.original_shape().dim()) { + if (d > INT64_MAX) { + LOG_WARNING(DebuggerErrno::ERROR_VALUE_OVERFLOW, + "The value(" + std::to_string(d) + ") exceeds the max value of int64_t, " + + "this maybe caused by the unfixed shape operaters."); + hShape.clear(); + break; + } + hShape.push_back(DataUtils::SizeToS64(d)); + } + + // convert format to host format. It can be either NCHW or ND (non 4-dimemsions). + AclFormat hFmt; + if (hShape.size() == kDim4) { + hFmt = AclFormat::FORMAT_NCHW; + } else if (hShape.empty()) { + hFmt = dFmt; + hShape = dShape; + LOG_WARNING(DebuggerErrno::NONE, + "Tensor(" + dumpPath + "): The host shape is empty, use device shape as host shape."); + } else { + hFmt = AclFormat::FORMAT_ND; + } + + int32_t subFormat = tensor.sub_format(); + return AclTensorInfo{dumpPath, data, dtype, dtype, dFmt, hFmt, dShape, hShape, dataSize, subFormat, io, slot, dumpOriginData}; +} + +template AclTensorInfo ParseAttrsFromDumpData( + const std::string& dumpPath, const uint8_t* data, const AclDumpMsg::OpOutput& tensor, const std::string& io, + uint32_t slot); +template AclTensorInfo ParseAttrsFromDumpData( + const std::string& dumpPath, const uint8_t* data, const AclDumpMsg::OpInput& tensor, const std::string& io, + uint32_t slot); + +static inline void AllocTensorTransBuf(AclTensorInfo& tensor) +{ + tensor.transBuf.resize(SizeOfTensor(tensor)); +} + +static DebuggerErrno FRAC_Z_TO_NCHW_WITH_GROUPS(AclTensorInfo& tensor) +{ + AssertDim(tensor.hostShape, kDim4); + AssertConsis(tensor); + AllocTensorTransBuf(tensor); + + auto nDim = tensor.hostShape[kN]; + auto cDim = tensor.hostShape[kC]; + auto hDim = tensor.hostShape[kH]; + auto wDim = tensor.hostShape[kW]; + auto groups = tensor.subFormat; + auto cinOri = cDim; + auto coutOri = nDim / groups; + + if (cinOri == 0 || coutOri == 0) { + LOG_WARNING(DebuggerErrno::ERROR_INVALID_VALUE, tensor + ": cin/cout ori must not equal to 0."); + return DebuggerErrno::ERROR_INVALID_VALUE; + } + + auto cubeK = GetCubeSizeByType(tensor.dtype); + auto eMult = std::min(Lcm(Lcm(cinOri, cubeK) / cinOri, Lcm(coutOri, kCubeSize) / cinOri), + static_cast(groups)); + if (eMult == 0) { + LOG_WARNING(DebuggerErrno::ERROR_INVALID_VALUE, + tensor + ": The value of e_mult should be greater than 0."); + return DebuggerErrno::ERROR_INVALID_VALUE; + } + + auto cinOpt = AlignCeil(eMult * cinOri, cubeK); + auto coutOpt = AlignCeil(eMult * coutOri, kCubeSize); + auto c1Dim = cinOpt / cubeK; + const uint8_t* src = tensor.aclData; + uint8_t* dst = tensor.transBuf.data(); + auto dtypeSize = SizeOfAclDType(tensor); + + for (int64_t g = 0; g < groups; ++g) { + for (int64_t c = 0; c < cDim; ++c) { + for (int64_t h = 0; h < hDim; ++h) { + for (int64_t w = 0; w < wDim; ++w) { + for (int64_t n = 0; n < coutOri; ++n) { + int64_t eVal = g % eMult; + int64_t dstCi = eVal * cinOri + c; + int64_t dstCo = eVal * coutOri + n; + int64_t srcCo = g * coutOri + n; + int64_t temporary = dstCi % cubeK; + int64_t devIdx = (g / eMult) * c1Dim * hDim * wDim * coutOpt * cubeK + + (dstCi / cubeK) * hDim * wDim * coutOpt * cubeK + h * wDim * coutOpt * cubeK + + w * coutOpt * cubeK + dstCo * cubeK + temporary; + int64_t hstIdx = srcCo * cDim * hDim * wDim + c * hDim * wDim + h * wDim + w; + /* 此处由偏移计算逻辑保障不会越界读写 */ + std::memcpy(dst + hstIdx * dtypeSize, src + devIdx * dtypeSize, dtypeSize); + } + } + } + } + } + return DebuggerErrno::OK; +} + +static DebuggerErrno FRAC_Z_TO_NCHW(AclTensorInfo& tensor) +{ + if (tensor.subFormat > 1) { + return FRAC_Z_TO_NCHW_WITH_GROUPS(tensor); + } + + AssertDim(tensor.hostShape, kDim4); + AssertConsis(tensor); + AllocTensorTransBuf(tensor); + + auto n0 = tensor.deviceShape.at(fzN0); + auto ni = tensor.deviceShape.at(fzNi); + auto c0 = tensor.deviceShape.at(fzC0); + auto n = tensor.hostShape[kN]; + auto c = tensor.hostShape[kC]; + auto h = tensor.hostShape[kH]; + auto w = tensor.hostShape[kW]; + auto nc = ni * n0; + auto ncc0 = nc * c0; + auto wncc0 = w * ncc0; + auto hwncc0 = h * wncc0; + auto hw = h * w; + auto chw = c * hw; + + if (c0 == 0) { + return DebuggerErrno::ERROR_INVALID_VALUE; + } + + const uint8_t* src = tensor.aclData; + uint8_t* dst = tensor.transBuf.data(); + auto dtypeSize = SizeOfAclDType(tensor); + for (int64_t nIdx = 0; nIdx < n; nIdx++) { + int64_t nHeadAddr = nIdx * chw; + for (int64_t cIdx = 0; cIdx < c; cIdx++) { + int64_t cHeadAddr = nHeadAddr + cIdx * hw; + for (int64_t hIdx = 0; hIdx < h; hIdx++) { + int64_t hHeadAddr = cHeadAddr + hIdx * w; + for (int64_t wIdx = 0; wIdx < w; wIdx++) { + auto dstIdx = hHeadAddr + wIdx; + auto c1Idx = cIdx / c0; + auto c0Idx = cIdx % c0; + auto ncIdx = nIdx; + auto srcIdx = c1Idx * hwncc0 + hIdx * wncc0 + wIdx * ncc0 + ncIdx * c0 + c0Idx; + /* 此处由偏移计算逻辑保障不会越界读写 */ + std::memcpy(dst + dstIdx * dtypeSize, src + srcIdx * dtypeSize, dtypeSize); + } + } + } + } + return DebuggerErrno::OK; +} + +static void TransShapeToHwNz(const AclShape &hostShape, AclShape& hwShape) +{ + if (hostShape.size() == kDim1) { + hwShape.push_back(1); + hwShape.push_back(1); + hwShape.push_back(hostShape[0]); + return; + } + auto size = hostShape.size(); + int64_t times = 1; + for (size_t i = 0; i != size - kDim2; i++) { + times *= hostShape[i]; + } + hwShape.push_back(times); + hwShape.push_back(hostShape[size - kDim2]); + hwShape.push_back(hostShape[size - kDim1]); +} + +static DebuggerErrno FRAC_NZ_TO_NCHW(AclTensorInfo& tensor) +{ + AssertConsis(tensor); + AllocTensorTransBuf(tensor); + + AclShape hwShape; + TransShapeToHwNz(tensor.hostShape, hwShape); + auto times = hwShape.at(0); + auto h = hwShape.at(hwH); + auto w = hwShape.at(hwW); + auto hw = h * w; + + auto shapeSize = tensor.deviceShape.size(); + if (shapeSize < kDim4) { + LOG_WARNING(DebuggerErrno::ERROR_INVALID_VALUE, tensor + ": Invalid shape size."); + return DebuggerErrno::ERROR_INVALID_VALUE; + } + + auto w1 = tensor.deviceShape[shapeSize - fnzW1]; + auto h1 = tensor.deviceShape[shapeSize - fnzH1]; + auto h0 = tensor.deviceShape[shapeSize - fnzH0]; + auto w0 = tensor.deviceShape[shapeSize - fnzW0]; + auto h1h0w0 = h1 * h0 * w0; + auto w1h1h0w0 = w1 * h1h0w0; + auto numW1 = w / w0; + + const uint8_t* src = tensor.aclData; + uint8_t* dst = tensor.transBuf.data(); + auto dtypeSize = SizeOfAclDType(tensor); + + for (int64_t timesIdx = 0; timesIdx < times; timesIdx++) { + auto timesHead = timesIdx * w1h1h0w0; + auto srcTimesHead = timesIdx * hw; + for (int64_t h1h0Idx = 0; h1h0Idx < h; h1h0Idx++) { + auto h1h0Head = timesHead + h1h0Idx * w0; + auto srcHHead = srcTimesHead + h1h0Idx * w; + for (int64_t w1Idx = 0; w1Idx < numW1; w1Idx++) { + for (int64_t i = 0; i < w0; ++i) { + int64_t srcIdx = h1h0Head + w1Idx * h1h0w0 + i; + int64_t dstIdx = srcHHead + w1Idx * w0 + i; + /* 此处由偏移计算逻辑保障不会越界读写 */ + std::memcpy(dst + dstIdx * dtypeSize, src + srcIdx * dtypeSize, dtypeSize); + } + } + auto w1Head = numW1 * w0; + for (int64_t w0Idx = 0; w1Head + w0Idx < w; w0Idx++) { + auto srcWIdx = w1Head + w0Idx; + int64_t srcIdx = h1h0Head + numW1 * h1h0w0 + w0Idx; + int64_t dstIdx = srcHHead + srcWIdx; + /* 此处由偏移计算逻辑保障不会越界读写 */ + std::memcpy(dst + dstIdx * dtypeSize, src + srcIdx * dtypeSize, dtypeSize); + } + } + } + return DebuggerErrno::OK; +} + +static DebuggerErrno NC1HWC0_TO_NCHW(AclTensorInfo& tensor) +{ + AssertDim(tensor.hostShape, kDim4); + AssertConsis(tensor); + AllocTensorTransBuf(tensor); + + auto n = tensor.hostShape[kN]; + auto c = tensor.hostShape[kC]; + auto h = tensor.hostShape[kH]; + auto w = tensor.hostShape[kW]; + auto c1 = tensor.deviceShape[kDim1]; + auto c0 = tensor.deviceShape[kDim4]; + + auto hw = h * w; + auto chw = c * hw; + auto wc0 = w * c0; + auto hwc0 = h * wc0; + auto c1hwc0 = c1 * hwc0; + + const uint8_t* src = tensor.aclData; + uint8_t* dst = tensor.transBuf.data(); + auto dtypeSize = SizeOfAclDType(tensor); + for (int64_t nIndex = 0; nIndex < n; nIndex++) { + int64_t nHeadAddr = nIndex * chw; + for (int64_t cIndex = 0; cIndex < c; cIndex++) { + int64_t cHeadAddr = nHeadAddr + cIndex * hw; + for (int64_t hIndex = 0; hIndex < h; hIndex++) { + int64_t hHeadAddr = cHeadAddr + hIndex * w; + for (int64_t wIndex = 0; wIndex < w; wIndex++) { + int64_t dstIdx = hHeadAddr + wIndex; + int64_t c1Index = cIndex / c0; + int64_t c0Index = cIndex % c0; + int64_t srcIdx = nIndex * c1hwc0 + c1Index * hwc0 + hIndex * wc0 + wIndex * c0 + c0Index; + /* 此处由偏移计算逻辑保障不会越界读写 */ + std::memcpy(dst + dstIdx * dtypeSize, src + srcIdx * dtypeSize, dtypeSize); + } + } + } + } + return DebuggerErrno::OK; +} + +static DebuggerErrno NDC1HWC0_TO_NCDHW(AclTensorInfo& tensor) +{ + AssertDim(tensor.hostShape, kDim5); + AssertConsis(tensor); + AllocTensorTransBuf(tensor); + + auto n = tensor.hostShape[N_ncdhw]; + auto c = tensor.hostShape[C_ncdhw]; + auto d = tensor.hostShape[D_ncdhw]; + auto h = tensor.hostShape[H_ncdhw]; + auto w = tensor.hostShape[W_ncdhw]; + auto c1 = tensor.deviceShape[C1_ndc1hwc0]; + auto c0 = tensor.deviceShape[C0_ndc1hwc0]; + + const int64_t cdhw = c * d * h * w; + const int64_t dhw = d * h * w; + const int64_t hw = h * w; + const int64_t dc1hwc0 = d * c1 * h * w * c0; + const int64_t c1hwc0 = c1 * h * w * c0; + const int64_t hwc0 = h * w * c0; + const int64_t wc0 = w * c0; + + const uint8_t* src = tensor.aclData; + uint8_t* dst = tensor.transBuf.data(); + auto dtypeSize = SizeOfAclDType(tensor); + for (int64_t nIndex = 0; nIndex < n; nIndex++) { + int64_t nHead = nIndex * cdhw; + for (int64_t cIndex = 0; cIndex < c; cIndex++) { + int64_t cHead = nHead + cIndex * dhw; + for (int64_t dIndex = 0; dIndex < d; dIndex++) { + int64_t dHead = cHead + dIndex * hw; + for (int64_t hIndex = 0; hIndex < h; hIndex++) { + int64_t hHead = dHead + hIndex * w; + for (int64_t wIndex = 0; wIndex < w; wIndex++) { + int64_t dstIdx = hHead + wIndex; + int64_t c1Index = cIndex / c0; + int64_t c0Index = cIndex % c0; + auto srcIdx = nIndex * dc1hwc0 + dIndex * c1hwc0 + c1Index * hwc0 + hIndex * wc0 + + wIndex * c0 + c0Index; + /* 此处由偏移计算逻辑保障不会越界读写 */ + std::memcpy(dst + dstIdx * dtypeSize, src + srcIdx * dtypeSize, dtypeSize); + } + } + } + } + } + return DebuggerErrno::OK; +} + +static DebuggerErrno C1HWNCoC0_TO_NCHW(AclTensorInfo& tensor) +{ + AssertDim(tensor.hostShape, kDim4); + AssertConsis(tensor); + AllocTensorTransBuf(tensor); + + auto n = tensor.hostShape[kN]; + auto c = tensor.hostShape[kC]; + auto h = tensor.hostShape[kH]; + auto w = tensor.hostShape[kW]; + const int coIdx = 4; + const int c0Idx = 5; + auto co = tensor.deviceShape[coIdx]; + auto c0 = tensor.deviceShape[c0Idx]; + auto cubeK = GetCubeSizeByType(tensor.dtype); + + const uint8_t* src = tensor.aclData; + uint8_t* dst = tensor.transBuf.data(); + auto dtypeSize = SizeOfAclDType(tensor); + for (int64_t nIndex = 0; nIndex < n; nIndex++) { + for (int64_t cIndex = 0; cIndex < c; cIndex++) { + for (int64_t hIndex = 0; hIndex < h; hIndex++) { + for (int64_t wIndex = 0; wIndex < w; wIndex++) { + int64_t dstIdx = nIndex * c * h * w + cIndex * h * w + hIndex * w + wIndex; + int64_t c1Index = cIndex / cubeK; + int64_t c0Index = cIndex % cubeK; + int64_t coIndex = c0Index; + int64_t srcIdx = c1Index * h * w * n * co * c0 + hIndex * w * n * co * c0 + wIndex * n * co * c0 + + nIndex * co * c0 + coIndex * c0 + c0Index; + /* 此处由偏移计算逻辑保障不会越界读写 */ + std::memcpy(dst + dstIdx * dtypeSize, src + srcIdx * dtypeSize, dtypeSize); + } + } + } + } + return DebuggerErrno::OK; +} + +static DebuggerErrno NC1HWC0_C04_TO_NCHW(AclTensorInfo& tensor) +{ + return NC1HWC0_TO_NCHW(tensor); +} + +static DebuggerErrno FRAC_Z3D_TO_NCDHW(AclTensorInfo& tensor) +{ + AssertDim(tensor.hostShape, kDim5); + AssertConsis(tensor); + AllocTensorTransBuf(tensor); + + auto n = tensor.hostShape[N_ncdhw]; + auto c = tensor.hostShape[C_ncdhw]; + auto d = tensor.hostShape[D_ncdhw]; + auto h = tensor.hostShape[H_ncdhw]; + auto w = tensor.hostShape[W_ncdhw]; + constexpr int kFZ3D_C0 = 3; + auto c0 = tensor.deviceShape[kFZ3D_C0]; + auto cube_k = GetCubeSizeByType(tensor.dtype); + auto c1 = DivCeil(c, cube_k); + constexpr int64_t kNiSize = 16; + auto n1n0 = AlignCeil(n, kNiSize); + auto n1n0c0 = n1n0 * c0; + auto wn1n0c0 = w * n1n0c0; + auto hwn1n0c0 = h * wn1n0c0; + auto c1hwn1n0c0 = c1 * hwn1n0c0; + auto hw = h * w; + auto dhw = d * hw; + auto cdhw = c * dhw; + + const uint8_t* src = tensor.aclData; + uint8_t* dst = tensor.transBuf.data(); + auto dtypeSize = SizeOfAclDType(tensor); + for (int64_t nIdx = 0; nIdx < n; nIdx++) { + int64_t nHead = nIdx * cdhw; + for (int64_t cIdx = 0; cIdx < c; cIdx++) { + int64_t cHead = nHead + cIdx * dhw; + for (int64_t dIdx = 0; dIdx < d; dIdx++) { + int64_t dHead = cHead + dIdx * hw; + for (int64_t hIdx = 0; hIdx < h; hIdx++) { + int64_t hHead = dHead + hIdx * w; + for (int64_t wI = 0; wI < w; wI++) { + int64_t dstIdx = hHead + wI; + int64_t c1I = cIdx / c0; + int64_t c0I = cIdx % c0; + int64_t ncIdx = nIdx; + int64_t srcIdx = dIdx * c1hwn1n0c0 + c1I * c1hwn1n0c0 + hIdx * wn1n0c0 + wI * n1n0c0 + + ncIdx * c0 + c0I; + /* 此处由偏移计算逻辑保障不会越界读写 */ + std::memcpy(dst + dstIdx * dtypeSize, src + srcIdx * dtypeSize, dtypeSize); + } + } + } + } + } + return DebuggerErrno::OK; +} + +DebuggerErrno TransFormatD2H(AclTensorInfo& tensor) +{ + AclFormat from = tensor.deviceFmt; + AclFormat to = tensor.hostFmt; + auto it = formatTransFuncMap.find(std::make_pair(from, to)); + if (it == formatTransFuncMap.end()) { + return DebuggerErrno::ERROR_UNKNOWN_TRANS; + } + + try { + return it->second(tensor); + } catch (const std::exception& e) { + LOG_ERROR(DebuggerErrno::ERROR_OPERATION_FAILED, tensor + ": Failed to conver dtype from " + + std::to_string(from) + " to " + std::to_string(to) + "(" + e.what() + ")."); + return DebuggerErrno::ERROR_OPERATION_FAILED; + } +} + +static void TransBf16ToFp32(const uint8_t* input, size_t num, uint8_t* output, size_t bufferSize) +{ + if (bufferSize < num * sizeof(float)) { + LOG_ERROR(DebuggerErrno::ERROR_BUFFER_OVERFLOW, "Insufficient space for converting data from bf16 to fp32."); + return; + } + const DataUtils::BFloat16* in = reinterpret_cast(input); + float* out = reinterpret_cast(output); + + for (size_t i = 0; i < num; i++) { + out[i] = static_cast(in[i]); + } +} + +static void TransInt4ToInt8(const uint8_t* input, size_t elemNums, uint8_t* output, size_t bufferSize) +{ + if (bufferSize < elemNums * sizeof(int8_t)) { + LOG_ERROR(DebuggerErrno::ERROR_BUFFER_OVERFLOW, "Insufficient space for converting data from int4 to int8."); + return; + } + const int8_t *srcData = reinterpret_cast(input); + int8_t *dstData = reinterpret_cast(output); + size_t inputLength = elemNums / 2; + int maxValue = 7; + int minValue = -8; + int signBitShift = 3; + int signBitMask = 0x08; + for (size_t i = 0; i < inputLength; ++i) { + int8_t s = *srcData; + int8_t t = s & 0xf; + // keep the sign bit not change + int8_t signBit = (t & signBitMask) >> signBitShift; + if (signBit == 1) { + t = t | 0xf0; + } else { + t = t & 0x0f; + } + if (t < minValue || t > maxValue) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_VALUE, "Invalid int4 value."); + } + *dstData = t; + ++dstData; + + int highByteShift = 4; + t = s >> highByteShift; + signBit = (t & signBitMask) >> signBitShift; + if (signBit == 1) { + t = t | 0xf0; + } else { + t = t & 0x0f; + } + if (t < minValue || t > maxValue) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_VALUE, "Invalid int4 value."); + } + *dstData = t; + ++dstData; + ++srcData; + } + return; +} + +DebuggerErrno TransDtype(AclTensorInfo& tensor, AclDtype to) +{ + + if (tensor.dtype == to) { + return DebuggerErrno::OK; + } + + tensor.oriDtype = tensor.dtype; + std::vector buffer; + AssertConsis(tensor); + size_t bufferSize = EleNumOfTensor(tensor) * SizeOfAclDType(to); + buffer.resize(bufferSize); + const uint8_t* input = tensor.transBuf.empty() ? tensor.aclData : tensor.transBuf.data(); + uint8_t* output = buffer.data(); + + if (tensor.dtype == AclDtype::DT_BF16 && to == AclDtype::DT_FLOAT) { + TransBf16ToFp32(input, EleNumOfTensor(tensor), output, bufferSize); + } else if (tensor.dtype == AclDtype::DT_INT4 && to == AclDtype::DT_INT8) { + TransInt4ToInt8(input, EleNumOfTensor(tensor), output, bufferSize); + } else { + LOG_ERROR(DebuggerErrno::ERROR_UNKNOWN_TRANS, tensor + ": Trans " + DataUtils::GetDTypeString(tensor.dtype) + + " to " + DataUtils::GetDTypeString(to) + " is not supported."); + return DebuggerErrno::ERROR_UNKNOWN_TRANS; + } + + tensor.transBuf = std::move(buffer); + tensor.dtype = to; + return DebuggerErrno::OK; +} + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/AclTensor.hpp b/debug/accuracy_tools/msprobe/ccsrc/core/AclTensor.hpp new file mode 100644 index 0000000000000000000000000000000000000000..f2ac429a7f14370ea1721369c7f9089cb971bb6e --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/AclTensor.hpp @@ -0,0 +1,78 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +#include "include/ErrorCode.hpp" +#include "proto/AclDumpMsg.pb.h" +#include "utils/DataUtils.hpp" + +namespace MindStudioDebugger { + +using AclShape = DataUtils::TensorShape; +using AclDtype = DataUtils::DataType; +using AclFormat = DataUtils::TensorFormat; + +constexpr uint8_t kDim1 = 1; +constexpr uint8_t kDim2 = 2; +constexpr uint8_t kDim3 = 3; +constexpr uint8_t kDim4 = 4; +constexpr uint8_t kDim5 = 5; +constexpr uint8_t kDim6 = 6; + +struct AclTensorInfo { + std::string dumpPath; + const uint8_t* aclData; + AclDtype dtype; + AclDtype oriDtype; + AclFormat deviceFmt; + AclFormat hostFmt; + AclShape deviceShape; + AclShape hostShape; + size_t dataSize; + int32_t subFormat; + std::string inout; + uint32_t slot; + bool dumpOriginData; + std::vector transBuf; + + std::string ToString() const { + return "AclTensor(path=" + dumpPath + ",dtype=" + DataUtils::GetDTypeString(dtype) + ",inout=" + inout + ")"; + } +}; + +inline std::string operator+(const std::string& s, const AclTensorInfo& tensor) { + return s + tensor.ToString(); +} + +inline std::string operator+(const AclTensorInfo& tensor, const std::string& s) { + return tensor.ToString() + s; +} + +namespace AclTensor { +size_t SizeOfTensor(const AclTensorInfo& tensor, bool host=true); +template +AclTensorInfo ParseAttrsFromDumpData(const std::string &dumpPath, const uint8_t* data, const T& tensor, + const std::string& io, uint32_t slot); +DebuggerErrno TransFormatD2H(AclTensorInfo& tensor); +DebuggerErrno TransDtype(AclTensorInfo& tensor, AclDtype to); +bool IsDtypeSupportTrans(AclDtype dtype); + +} +} diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/PrecisionDebugger.cpp b/debug/accuracy_tools/msprobe/ccsrc/core/PrecisionDebugger.cpp new file mode 100644 index 0000000000000000000000000000000000000000..d4d74f1962222558c88c576b8ffbd8c474e152f2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/PrecisionDebugger.cpp @@ -0,0 +1,157 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include + +#include "base/ErrorInfos.hpp" +#include "base/DebuggerConfig.hpp" +#include "third_party/ACL/AclApi.hpp" +#include "PrecisionDebugger.hpp" + +namespace MindStudioDebugger { + +void PrecisionDbgTaskBase::Register() +{ + PrecisionDebugger::GetInstance().RegisterDebuggerTask(this); +} + +void PrecisionDebugger::RegisterDebuggerTask(PrecisionDbgTaskBase* task) +{ + DEBUG_FUNC_TRACE(); + std::vector::iterator iter; + const DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + + if (cfg.IsCfgLoaded() && !task->Condition(cfg)) { + return; + } + + for (iter = subDebuggers.begin(); iter != subDebuggers.end(); ++iter) { + if (*iter == task) { + return; + } + } + + for (iter = subDebuggers.begin(); iter != subDebuggers.end(); ++iter) { + if ((*iter)->Priority() > task->Priority()) { + break; + } + } + + subDebuggers.insert(iter, task); + + if (cfg.IsCfgLoaded()) { + /* 如果配置还没加载,先加入到缓存中,等加载时再根据条件过滤一遍 */ + task->Initialize(cfg); + LOG_DEBUG("PrecisionDebugger: " + task->Name() + " registered."); + } + return; +} + +void PrecisionDebugger::UnRegisterDebuggerTask(PrecisionDbgTaskBase* task) +{ + DEBUG_FUNC_TRACE(); + for (auto iter = subDebuggers.begin(); iter != subDebuggers.end(); iter++) { + if (*iter == task) { + LOG_DEBUG("PrecisionDebugger: " + task->Name() + " unregistered."); + subDebuggers.erase(iter); + return; + } + } + + return; +} + +int32_t PrecisionDebugger::Initialize(const std::string& framework, const std::string& cfgFile) +{ + DEBUG_FUNC_TRACE(); + + int32_t ret = DebuggerConfig::GetInstance().LoadConfig(framework, cfgFile); + if (ret != 0) { + return ret; + } + + if(AscendCLApi::LoadAclApi() != DebuggerErrno::OK) { + return -1; + } + + const DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + for (auto iter = subDebuggers.begin(); iter != subDebuggers.end(); ) { + if (!(*iter)->Condition(cfg)) { + iter = subDebuggers.erase(iter); + } else { + (*iter)->Initialize(cfg); + LOG_DEBUG("PrecisionDebugger: " + (*iter)->Name() + " registered."); + iter++; + } + } + + initialized = true; + return 0; +} + +void PrecisionDebugger::Start() +{ + DEBUG_FUNC_TRACE(); + if (!initialized) { + return; + } + + enable = true; + + for (auto task : subDebuggers) { + task->OnStart(); + } +} + +void PrecisionDebugger::Stop() +{ + DEBUG_FUNC_TRACE(); + if (!initialized) { + return; + } + + enable = false; + CALL_ACL_API(aclrtSynchronizeDevice); + + for (auto task : subDebuggers) { + task->OnStop(); + } +} + +void PrecisionDebugger::Step() +{ + return Step(1); +} + +void PrecisionDebugger::Step(uint32_t step) +{ + DEBUG_FUNC_TRACE(); + if (!initialized) { + return; + } + + if (step > UINT32_MAX - curStep) { + throw std::runtime_error("Step over upper limit(4294967295)."); + } + curStep += step; + CALL_ACL_API(aclrtSynchronizeDevice); + + for (auto task : subDebuggers) { + task->OnStep(curStep); + } +} + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/PrecisionDebugger.hpp b/debug/accuracy_tools/msprobe/ccsrc/core/PrecisionDebugger.hpp new file mode 100644 index 0000000000000000000000000000000000000000..fbc22c016c40285a90a3de5989684098639256c9 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/PrecisionDebugger.hpp @@ -0,0 +1,79 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +#include "base/DebuggerConfig.hpp" + +namespace MindStudioDebugger { + +class PrecisionDbgTaskBase { +public: + virtual bool Condition(const DebuggerConfig& cfg) const = 0; + virtual std::string Name() const = 0; + virtual uint32_t Priority() const {return 100;} + + virtual void Initialize(const DebuggerConfig& cfg) {}; + virtual void OnStart() {}; + virtual void OnStop() {}; + virtual void OnStep(uint32_t curStep) {}; + + void Register(); + +protected: + PrecisionDbgTaskBase() = default; + ~PrecisionDbgTaskBase() = default; +}; + +class PrecisionDebugger { +public: + static PrecisionDebugger& GetInstance() { + static PrecisionDebugger instance_; + return instance_; + } + + int32_t Initialize(const std::string& framework, const std::string& cfgFile); + bool HasInitialized() const {return initialized;} + + void Start(); + void Stop(); + void Step(); + void Step(uint32_t step); + + bool IsEnable() const {return enable;} + uint32_t GetCurStep() const {return curStep;} + + void RegisterDebuggerTask(PrecisionDbgTaskBase* task); + void UnRegisterDebuggerTask(PrecisionDbgTaskBase* task); + +private: + PrecisionDebugger() = default; + ~PrecisionDebugger() = default; + explicit PrecisionDebugger(const PrecisionDebugger &obj) = delete; + PrecisionDebugger& operator=(const PrecisionDebugger &obj) = delete; + explicit PrecisionDebugger(PrecisionDebugger &&obj) = delete; + PrecisionDebugger& operator=(PrecisionDebugger &&obj) = delete; + + bool initialized{false}; + bool enable{false}; + uint32_t curStep{0}; + std::vector subDebuggers; +}; + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MSAclDumper.cpp b/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MSAclDumper.cpp new file mode 100644 index 0000000000000000000000000000000000000000..2d80ed3ce1ab11ee5ddf9bad18583a6813f32529 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MSAclDumper.cpp @@ -0,0 +1,59 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include + +#include "base/ErrorInfos.hpp" +#include "base/DebuggerConfig.hpp" +#include "base/Environment.hpp" +#include "core/AclDumper.hpp" +#include "MSAclDumper.hpp" + +namespace MindStudioDebugger { + +void MSAclDumper::OnStepBegin(uint32_t device, uint32_t curStep, ExtArgs& args) +{ + DEBUG_FUNC_TRACE(); + if (!PrecisionDebugger::GetInstance().IsEnable()) { + return; + } + const bool* isKbk = GetExtArgs(args, MindStudioExtensionArgs::IS_KBK); + if (isKbk != nullptr && *isKbk) { + /* acldump只用于非kbk场景 */ + return; + } + + int32_t rank = Environment::GetRankID(); + if (rank < 0) { + rank = static_cast(device); + } + + AclDumper::GetInstance().SetDump(rank, curStep, args); + return; +} + +void MSAclDumper::OnStepEnd(ExtArgs& args) +{ + DEBUG_FUNC_TRACE(); + AclDumper::GetInstance().FinalizeDump(args); +} + +__attribute__((constructor)) void RegisterMSAclDumper() +{ + MSAclDumper::GetInstance().Register(); +} + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MSAclDumper.hpp b/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MSAclDumper.hpp new file mode 100644 index 0000000000000000000000000000000000000000..cd09bf51af0dac67065d51b8ce60c20f011cd585 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MSAclDumper.hpp @@ -0,0 +1,51 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +#include "include/ExtArgs.hpp" +#include "core/PrecisionDebugger.hpp" + +namespace MindStudioDebugger { + +class MSAclDumper : public PrecisionDbgTaskBase { +public: + static MSAclDumper& GetInstance() { + static MSAclDumper instance_; + return instance_; + } + + std::string Name() const override {return "MindSpore AclDumper";} + bool Condition(const DebuggerConfig& cfg) const override { + return cfg.GetFramework() == DebuggerFramework::FRAMEWORK_MINDSPORE && + cfg.GetDebugLevel() == DebuggerLevel::L2; + } + + void OnStepBegin(uint32_t device, uint32_t curStep, ExtArgs& args); + void OnStepEnd(ExtArgs& args); + +private: + MSAclDumper() = default; + ~MSAclDumper() = default; + explicit MSAclDumper(const MSAclDumper &obj) = delete; + MSAclDumper& operator=(const MSAclDumper &obj) = delete; + explicit MSAclDumper(MSAclDumper &&obj) = delete; + MSAclDumper& operator=(MSAclDumper &&obj) = delete; +}; + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MindSporeTrigger.cpp b/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MindSporeTrigger.cpp new file mode 100644 index 0000000000000000000000000000000000000000..631ea7c4acf4666b911a3bb5f28a3c6cc4fe0d54 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MindSporeTrigger.cpp @@ -0,0 +1,53 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "include/Macro.hpp" +#include "base/ErrorInfos.hpp" +#include "MindSporeTrigger.hpp" +#include "MSAclDumper.hpp" + +namespace MindStudioDebugger { + +bool MindSporeTrigger::stepBeginFlag = false; + +void MindSporeTrigger::TriggerOnStepBegin(uint32_t device, uint32_t curStep, ExtArgs& args) +{ + DEBUG_FUNC_TRACE(); + CleanErrorInfoCache(); + + MSAclDumper::GetInstance().OnStepBegin(device, curStep, args); + stepBeginFlag = true; + + CleanErrorInfoCache(); + return; +} + +void MindSporeTrigger::TriggerOnStepEnd(ExtArgs& args) +{ + DEBUG_FUNC_TRACE(); + CleanErrorInfoCache(); + + if (!stepBeginFlag) { + return; + } + MSAclDumper::GetInstance().OnStepEnd(args); + stepBeginFlag = false; + + CleanErrorInfoCache(); + return; +} + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MindSporeTrigger.hpp b/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MindSporeTrigger.hpp new file mode 100644 index 0000000000000000000000000000000000000000..022e5d7d4c14a9771681840b967b2ec3aebb811b --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/core/mindspore/MindSporeTrigger.hpp @@ -0,0 +1,39 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +#include "include/ExtArgs.hpp" + +namespace MindStudioDebugger { + +class MindSporeTrigger { +public: + static void TriggerOnStepBegin(uint32_t device, uint32_t curStep, ExtArgs& args); + static void TriggerOnStepEnd(ExtArgs& args); + static void LaunchPreDbg() {} + static void LaunchPostDbg() {} + +private: + MindSporeTrigger() = default; + ~MindSporeTrigger() = default; + + static bool stepBeginFlag; +}; + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/if/mindspore/MindSporeDbgHook.cpp b/debug/accuracy_tools/msprobe/ccsrc/if/mindspore/MindSporeDbgHook.cpp new file mode 100644 index 0000000000000000000000000000000000000000..42f3a2e5b61d5da021b2ef7da4a7b88c6dc2abbb --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/if/mindspore/MindSporeDbgHook.cpp @@ -0,0 +1,71 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#define _GLIBCXX_USE_CXX11_ABI 0 + +#include +#include + +#include "include/Macro.hpp" +#include "include/ExtArgs.hpp" +#include "core/mindspore/MindSporeTrigger.hpp" + +EXPORT_SYMBOL void MS_DbgOnStepBegin(uint32_t device, int32_t curStep, + std::map exts) +{ + MindStudioDebugger::ExtArgs args; + const char** strBuf = nullptr; + for (auto& ext : exts) { + if (ext.first >= static_cast(MindStudioDebugger::MindStudioExtensionArgs::ARG_MAX)) { + continue; + } + /* mindspore使用了_GLIBCXX_USE_CXX11_ABI=0,为了解决CXX版本兼容问题,此处将string转char*使用 */ + if (ext.first == static_cast(MindStudioDebugger::MindStudioExtensionArgs::ALL_KERNEL_NAMES)) { + std::vector* ss = reinterpret_cast*>(ext.second); + strBuf = new const char*[(*ss).size() + 1]; + strBuf[(*ss).size()] = nullptr; + size_t i = 0; + for (std::string& s : *ss) { + strBuf[i] = s.c_str(); + i++; + } + args[static_cast(ext.first)] = reinterpret_cast(strBuf); + continue; + } + args[static_cast(ext.first)] = ext.second; + } + + MindStudioDebugger::MindSporeTrigger::TriggerOnStepBegin(device, static_cast(curStep), args); + if (strBuf != nullptr) { + delete[] strBuf; + } + + return; +} + +EXPORT_SYMBOL void MS_DbgOnStepEnd(std::map& exts) +{ + MindStudioDebugger::ExtArgs args; + for (auto& ext : exts) { + if (ext.first >= static_cast(MindStudioDebugger::MindStudioExtensionArgs::ARG_MAX)) { + continue; + } + args[static_cast(ext.first)] = ext.second; + } + return MindStudioDebugger::MindSporeTrigger::TriggerOnStepEnd(args); +} + + diff --git a/debug/accuracy_tools/msprobe/ccsrc/if/python/ACLDump.cpp b/debug/accuracy_tools/msprobe/ccsrc/if/python/ACLDump.cpp new file mode 100644 index 0000000000000000000000000000000000000000..1c380ed3f505795eb622f7f401558f72a54db557 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/if/python/ACLDump.cpp @@ -0,0 +1,64 @@ +/* + * Copyright (C) 2025-2025. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "base/ErrorInfos.hpp" +#include "core/AclDumper.hpp" +#include "utils/CPythonUtils.hpp" + +namespace MindStudioDebugger { + +static PyObject *CPythonKernelInitDump(PyObject *module, PyObject *args) { + PyGILState_STATE gstate = PyGILState_Ensure(); + KernelInitDump(); + PyGILState_Release(gstate); + Py_RETURN_NONE; +} + +static PyObject *CPythonKernelSetDump(PyObject *module, PyObject *args) { + const char *path; + if (!PyArg_ParseTuple(args, "s", &path)) { + LOG_ERROR(DebuggerErrno::ERROR_INVALID_VALUE, + "npu set dump error, cfg_file must string"); + return nullptr; + } + PyGILState_STATE gstate = PyGILState_Ensure(); + KernelSetDump(std::string(path)); + PyGILState_Release(gstate); + Py_RETURN_NONE; +} + +static PyObject *CPythonKernelFinalizeDump(PyObject *module, PyObject *args) { + PyGILState_STATE gstate = PyGILState_Ensure(); + KernelFinalizeDump(); + PyGILState_Release(gstate); + Py_RETURN_NONE; +} + +static PyMethodDef DumpMethods[] = { + {"init_dump", reinterpret_cast(CPythonKernelInitDump), + METH_NOARGS, "Initialize dump."}, + {"set_dump", reinterpret_cast(CPythonKernelSetDump), + METH_VARARGS, "Set dump."}, + {"finalize_dump", reinterpret_cast(CPythonKernelFinalizeDump), + METH_NOARGS, "Finalize dump."}, + {nullptr, nullptr, 0, nullptr}}; + +PyMethodDef *GetDumpMethods() { return DumpMethods; } +} // namespace MindStudioDebugger diff --git a/debug/accuracy_tools/msprobe/ccsrc/if/python/ACLDump.hpp b/debug/accuracy_tools/msprobe/ccsrc/if/python/ACLDump.hpp new file mode 100644 index 0000000000000000000000000000000000000000..11ae2ad4adb634e0c7cf58295127f76340796b84 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/if/python/ACLDump.hpp @@ -0,0 +1,23 @@ +/* + * Copyright (C) 2025-2025. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +namespace MindStudioDebugger { +PyMethodDef *GetDumpMethods(); +} diff --git a/debug/accuracy_tools/msprobe/ccsrc/if/python/CPythonAgent.cpp b/debug/accuracy_tools/msprobe/ccsrc/if/python/CPythonAgent.cpp new file mode 100644 index 0000000000000000000000000000000000000000..4b8fc03491e2c0792c3c707c272e7b587d60c7ad --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/if/python/CPythonAgent.cpp @@ -0,0 +1,106 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "utils/CPythonUtils.hpp" + +namespace MindStudioDebugger { + +PyDoc_STRVAR(CPythonAgentModuleDoc, +"A module for Python code to interact with C++ code.\n\ + \n\ +..."); + +static PyObject* CPythonAgentRegister(PyObject *module, PyObject *args) +{ + /* 预期2个参数,name和obj */ + if (args == nullptr || PyTuple_GET_SIZE(args) != 2) { + PyErr_SetString(PyExc_TypeError, "\'register_context\' expects 2 arguments."); + Py_RETURN_NONE; + } + + PyObject* obj = nullptr; + const char* name = nullptr; + if (!PyArg_ParseTuple(args, "sO", &name, &obj)) { + PyErr_SetString(PyExc_TypeError, "\"name\" should be a string and \"obj\" should be a python object."); + Py_RETURN_NONE; + } + + if (CPythonUtils::RegisterPythonObject(name, obj) != 0) { + if (CPythonUtils::IsPyObjRegistered(name)) { + PyErr_Format(PyExc_RuntimeError, "\"%s\" has been registered already.", name); + } else { + PyErr_Format(PyExc_RuntimeError, "Failed to register \"%s\".", name); + } + } + + Py_RETURN_NONE; +} + +static PyObject* CPythonAgentUnRegister(PyObject *module, PyObject *obj) +{ + CPythonUtils::PythonStringObject name(obj); + if(name.IsNone()) { + PyErr_SetString(PyExc_TypeError, "\"name\" should be a string."); + Py_RETURN_NONE; + } + + CPythonUtils::UnRegisterPythonObject(name); + Py_RETURN_NONE; +} + +static PyObject* CPythonAgentGetContext(PyObject *module, PyObject *obj) +{ + CPythonUtils::PythonStringObject name(obj); + if(name.IsNone()) { + PyErr_SetString(PyExc_TypeError, "\"name\" should be a string."); + Py_RETURN_NONE; + } + + return CPythonUtils::GetRegisteredPyObj(name).NewRef(); +} + +PyDoc_STRVAR(RegisterDoc, +"register_context(name, obj)\n--\n\nRegister a python object, which will be available on the backend."); +PyDoc_STRVAR(UnregisterDoc, +"unregister_context(name)\n--\n\nUnregister a python object."); +PyDoc_STRVAR(GetDoc, +"get_context(name)\n--\n\nGet a python object, which may be register by the backend."); + +static PyMethodDef CPythonAgentMethods[] = { + {"register_context", reinterpret_cast(CPythonAgentRegister), METH_VARARGS, RegisterDoc}, + {"unregister_context", reinterpret_cast(CPythonAgentUnRegister), METH_O, UnregisterDoc}, + {"get_context", reinterpret_cast(CPythonAgentGetContext), METH_O, GetDoc}, + {nullptr, nullptr, 0, nullptr} +}; + +static struct PyModuleDef g_CPythonAgentModule = { + PyModuleDef_HEAD_INIT, + "_msprobe_c.CPythonAgent", /* m_name */ + CPythonAgentModuleDoc, /* m_doc */ + -1, /* m_size */ + CPythonAgentMethods, /* m_methods */ +}; + +PyObject* GetCPythonAgentModule() +{ + return PyModule_Create(&g_CPythonAgentModule); +} + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/if/python/CPythonAgent.hpp b/debug/accuracy_tools/msprobe/ccsrc/if/python/CPythonAgent.hpp new file mode 100644 index 0000000000000000000000000000000000000000..103fa4430eb0f490654f30c1684b2427e062590c --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/if/python/CPythonAgent.hpp @@ -0,0 +1,23 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +namespace MindStudioDebugger { +PyObject* GetCPythonAgentModule(); +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/if/python/MsProbeIfPython.cpp b/debug/accuracy_tools/msprobe/ccsrc/if/python/MsProbeIfPython.cpp new file mode 100644 index 0000000000000000000000000000000000000000..a18c54a146f7d676d6b3c7f760e50f9e7eebe56c --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/if/python/MsProbeIfPython.cpp @@ -0,0 +1,85 @@ +/* + * Copyright (C) 2024-2025. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include + +#include "PrecisionDebuggerIfPython.hpp" +#include "CPythonAgent.hpp" +#include "ACLDump.hpp" + +namespace MindStudioDebugger { + +PyDoc_STRVAR(MsProbeCModuleDoc, +"The part of the module msprobe that is implemented in CXX.\n\ +class _PrecisionDebugger: PrecisionDebugger in CXX \n\ +class _DebuggerConfig: Configuration data of PrecisionDebugger \n\ +class CPythonAgent: Used for front-end and back-end code interactions \n\ + \n\ +..."); + +static struct PyModuleDef g_MsProbeCModule = { + PyModuleDef_HEAD_INIT, + "_msprobe_c", /* m_name */ + MsProbeCModuleDoc, /* m_doc */ + -1, /* m_size */ + nullptr, /* m_methods */ +}; + +} + +PyMODINIT_FUNC PyInit__msprobe_c(void) +{ + PyObject* m = PyModule_Create(&MindStudioDebugger::g_MsProbeCModule); + if (m == nullptr) { + return nullptr; + } + + PyTypeObject* precisionDebugger = MindStudioDebugger::GetPyPrecisionDebuggerType(); + if (precisionDebugger == nullptr) { + PyErr_SetString(PyExc_ImportError, "Failed to create class _PrecisionDebugger."); + Py_DECREF(m); + return nullptr; + } + if (PyModule_AddObject(m, "_PrecisionDebugger", reinterpret_cast(precisionDebugger)) < 0) { + PyErr_SetString(PyExc_ImportError, "Failed to bind class _PrecisionDebugger."); + Py_DECREF(m); + return nullptr; + } + Py_INCREF(precisionDebugger); + + PyObject* cpyAgent = MindStudioDebugger::GetCPythonAgentModule(); + if (cpyAgent == nullptr) { + PyErr_SetString(PyExc_ImportError, "Failed to create submodule CPythonAgent."); + Py_DECREF(m); + return nullptr; + } + if (PyModule_AddObject(m, "CPythonAgent", cpyAgent) < 0) { + PyErr_SetString(PyExc_ImportError, "Failed to bind submodule CPythonAgent."); + Py_DECREF(m); + return nullptr; + } + Py_INCREF(cpyAgent); + + PyMethodDef* dumpmethods = MindStudioDebugger::GetDumpMethods(); + for (PyMethodDef* method = dumpmethods; method->ml_name != nullptr; ++method) { + if (PyModule_AddObject(m, method->ml_name, PyCFunction_New(method, nullptr)) < 0) { + PyErr_SetString(PyExc_ImportError, "Failed to bind dump method."); + Py_DECREF(m); + return nullptr; + } + } + return m; +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/if/python/PrecisionDebuggerIfPython.cpp b/debug/accuracy_tools/msprobe/ccsrc/if/python/PrecisionDebuggerIfPython.cpp new file mode 100644 index 0000000000000000000000000000000000000000..da1cf3cf1c5d4c8894d0b12b5518657b5928a8d6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/if/python/PrecisionDebuggerIfPython.cpp @@ -0,0 +1,188 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "utils/CPythonUtils.hpp" +#include "core/PrecisionDebugger.hpp" + +namespace MindStudioDebugger { + +static PyObject* NewPrecisionDebugger(PyTypeObject *type, PyObject *args, PyObject *kwds) +{ + if (type == nullptr || type->tp_alloc == nullptr) { + throw std::runtime_error("PrecisionDebugger: type or alloc is nullptr."); + } + + /* 单例,减少重复构造 */ + static PyObject *self = nullptr; + if (self == nullptr) { + self = type->tp_alloc(type, 0); + } + + Py_XINCREF(self); + return self; +} + +static int InitPrecisionDebugger(PyObject *self, PyObject *args, PyObject *kws) +{ + if (PrecisionDebugger::GetInstance().HasInitialized()) { + return 0; + } + + if (kws == nullptr) { + PyErr_SetString(PyExc_TypeError, "Need keywords arg'framework\'and \'config_path\'."); + return -1; + } + + CPythonUtils::PythonDictObject kwArgs(kws); + std::string framework = kwArgs.GetItem("framework"); + std::string cfgFile = kwArgs.GetItem("config_path"); + + if (PrecisionDebugger::GetInstance().Initialize(framework, cfgFile) != 0) { + PyErr_SetString(PyExc_RuntimeError, "Failed to load config, read log for more details."); + return -1; + } + + return 0; +} + +static PyObject* PrecisionDebuggerGetAttr(PyObject *self, PyObject *name) +{ + CPythonUtils::PythonStringObject attr(name); + + if (attr.IsNone()) { + PyErr_SetString(PyExc_TypeError, "Attribution should be a string."); + Py_RETURN_NONE; + } + + const char* s = attr.ToString().c_str(); + if (strcmp(s, "enable") == 0) { + return CPythonUtils::PythonObject::From(PrecisionDebugger::GetInstance().IsEnable()).NewRef(); + } else if (strcmp(s, "current_step") == 0) { + return CPythonUtils::PythonObject::From(PrecisionDebugger::GetInstance().GetCurStep()).NewRef(); + } + + PyObject* ret = PyObject_GenericGetAttr(self, name); + if (ret == nullptr) { + PyErr_Format(PyExc_AttributeError, "\'PrecisionDebugger\' object has no attribute \'%s\'", attr); + Py_RETURN_NONE; + } + + return ret; +} + +static PyObject* PrecisionDebuggerStart(PyObject *self) +{ + PrecisionDebugger::GetInstance().Start(); + Py_RETURN_NONE; +} + +static PyObject* PrecisionDebuggerStop(PyObject *self) +{ + PrecisionDebugger::GetInstance().Stop(); + Py_RETURN_NONE; +} + +static PyObject* PrecisionDebuggerStep(PyObject *self, PyObject *args) +{ + if (args == nullptr || PyTuple_GET_SIZE(args) == 0) { + PrecisionDebugger::GetInstance().Step(); + Py_RETURN_NONE; + } + + PyObject* increment = PyTuple_GetItem(args, 0); + if (!PyLong_Check(increment)) { + PyErr_SetString(PyExc_TypeError, "\'step\' should be a int."); + Py_RETURN_NONE; + } + + PrecisionDebugger::GetInstance().Step(PyLong_AsUnsignedLong(increment)); + Py_RETURN_NONE; +} + +PyDoc_STRVAR(StartDoc, +"start($self, /)\n--\n\nEnable debug."); +PyDoc_STRVAR(StopDoc, +"stop($self, /)\n--\n\nDisable debug."); +PyDoc_STRVAR(StepDoc, +"step($self, [increment])\n--\n\nUpdata step."); + +static PyMethodDef PrecisionDebuggerMethods[] = { + {"start", reinterpret_cast(PrecisionDebuggerStart), METH_NOARGS, StartDoc}, + {"stop", reinterpret_cast(PrecisionDebuggerStop), METH_NOARGS, StopDoc}, + {"step", reinterpret_cast(PrecisionDebuggerStep), METH_VARARGS, StepDoc}, + {nullptr, nullptr, 0, nullptr} +}; + +PyTypeObject PyPrecisionDebuggerType = { + PyVarObject_HEAD_INIT(&PyType_Type, 0) + "_msprobe_c._PrecisionDebugger", /* tp_name */ + 0, /* tp_basicsize */ + 0, /* tp_itemsize */ + /* methods */ + 0, /* tp_dealloc */ + 0, /* tp_vectorcall_offset */ + 0, /* tp_getattr */ + 0, /* tp_setattr */ + 0, /* tp_as_async */ + 0, /* tp_repr */ + 0, /* tp_as_number */ + 0, /* tp_as_sequence */ + 0, /* tp_as_mapping */ + 0, /* tp_hash */ + 0, /* tp_call */ + 0, /* tp_str */ + PrecisionDebuggerGetAttr, /* tp_getattro */ + 0, /* tp_setattro */ + 0, /* tp_as_buffer */ + Py_TPFLAGS_DEFAULT, /* tp_flags */ + 0, /* tp_doc */ + 0, /* tp_traverse */ + 0, /* tp_clear */ + 0, /* tp_richcompare */ + 0, /* tp_weaklistoffset */ + 0, /* tp_iter */ + 0, /* tp_iternext */ + PrecisionDebuggerMethods, /* tp_methods */ + 0, /* tp_members */ + 0, /* tp_getset */ + &PyBaseObject_Type, /* tp_base */ + 0, /* tp_dict */ + 0, /* tp_descr_get */ + 0, /* tp_descr_set */ + 0, /* tp_dictoffset */ + InitPrecisionDebugger, /* tp_init */ + 0, /* tp_alloc */ + NewPrecisionDebugger, /* tp_new */ + PyObject_Del, /* tp_free */ +}; + +PyTypeObject* GetPyPrecisionDebuggerType() +{ + static bool init = false; + if (!init) { + if (PyType_Ready(&PyPrecisionDebuggerType) < 0) { + return nullptr; + } + init = true; + } + return &PyPrecisionDebuggerType; +} + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/if/python/PrecisionDebuggerIfPython.hpp b/debug/accuracy_tools/msprobe/ccsrc/if/python/PrecisionDebuggerIfPython.hpp new file mode 100644 index 0000000000000000000000000000000000000000..55e861c1ecf62a5326b3660a9846cf9458127e7a --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/if/python/PrecisionDebuggerIfPython.hpp @@ -0,0 +1,23 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +namespace MindStudioDebugger { +PyTypeObject* GetPyPrecisionDebuggerType(); +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/include/ErrorCode.hpp b/debug/accuracy_tools/msprobe/ccsrc/include/ErrorCode.hpp new file mode 100644 index 0000000000000000000000000000000000000000..19ce6ce1b83a970406c6ca13c96175eaea97b04f --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/include/ErrorCode.hpp @@ -0,0 +1,64 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +namespace MindStudioDebugger { + +enum class DebuggerErrno { + OK = 0, + ERROR, + NONE, + + /* 文件操作类 */ + ERROR_FILE_NOT_EXISTS = 100, + ERROR_FILE_ALREADY_EXISTS, + ERROR_FAILED_TO_OPEN_FILE, + ERROR_FAILED_TO_WRITE_FILE, + ERROR_DIR_NOT_EXISTS, + ERROR_PERMISSION_DENINED, + ERROR_NOT_ALLOW_SOFTLINK, + ERROR_ILLEGAL_FILE_TYPE, + ERROR_PATH_TOO_LOOG, + ERROR_PATH_TOO_DEEP, + ERROR_PATH_CONTAINS_INVALID_CHAR, + ERROR_FILE_TOO_LARGE, + ERROR_UNKNOWN_FILE_SUFFIX, + ERROR_CANNOT_PARSE_PATH, + + /* 数据解析类 */ + ERROR_INVALID_OPERATION = 200, + ERROR_INVALID_FORMAT, + ERROR_INVALID_VALUE, + ERROR_UNKNOWN_FIELD, + ERROR_UNKNOWN_VALUE, + ERROR_UNKNOWN_TRANS, + ERROR_FIELD_NOT_EXISTS, + ERROR_VALUE_OVERFLOW, + + /* 系统调用类 */ + ERROR_NO_MEMORY = 300, + ERROR_BUFFER_OVERFLOW, + ERROR_SYSCALL_FAILED, + ERROR_OPERATION_FAILED, + + /* 环境依赖类 */ + ERROR_DEPENDENCY_NOT_FIND = 400, + ERROR_CONFIGURATION_CONFLICTS, + ERROR_EXTERNAL_API_ERROR, +}; + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/include/ExtArgs.hpp b/debug/accuracy_tools/msprobe/ccsrc/include/ExtArgs.hpp new file mode 100644 index 0000000000000000000000000000000000000000..40624194e5690a974bf0b3881dfdc717ff01d064 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/include/ExtArgs.hpp @@ -0,0 +1,44 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +namespace MindStudioDebugger { + +enum class MindStudioExtensionArgs { + ALL_KERNEL_NAMES = 0, /* const std::vector --> char** */ + IS_KBK = 1, /* bool */ + + /* Add before this line */ + ARG_MAX, +}; + +using ExtArgs = std::map; + +template +T GetExtArgs(ExtArgs& args, MindStudioExtensionArgs id) +{ + auto it = args.find(id); + if (it == args.end()) { + return nullptr; + } + + return reinterpret_cast(it->second); +} + +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/include/Macro.hpp b/debug/accuracy_tools/msprobe/ccsrc/include/Macro.hpp new file mode 100644 index 0000000000000000000000000000000000000000..f366ab426f51c5150792605ee6bf03f899c76fd2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/include/Macro.hpp @@ -0,0 +1,21 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#define EXPORT_SYMBOL extern "C" __attribute__((visibility("default"))) + +#define ELE_IN_VECTOR(vec, ele) (std::find((vec).begin(), (vec).end(), (ele)) != (vec).end()) diff --git a/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclApi.cpp b/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclApi.cpp new file mode 100644 index 0000000000000000000000000000000000000000..1636c6998d9096b62e9a7f281c7e5ac1b4de4818 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclApi.cpp @@ -0,0 +1,156 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "base/ErrorInfos.hpp" +#include "AclApi.hpp" + +namespace MindStudioDebugger { +namespace AscendCLApi { + +using namespace MindStudioDebugger; + +constexpr const char* kLibAscendclName = "libascendcl.so"; +constexpr const char* kLibMSAscendName = "libmindspore_ascend.so.2"; + +using aclInitFuncType = aclError (*)(const char *); +using aclmdlInitDumpFuncType = aclError (*)(); +using aclmdlSetDumpFuncType = aclError (*)(const char *); +using aclmdlFinalizeDumpFuncType = aclError (*)(); +using acldumpRegCallbackFuncType = aclError (*)(AclDumpCallbackFuncType, int32_t); +using aclrtSynchronizeDeviceFuncType = aclError (*)(); + +static aclInitFuncType aclInitFunc = nullptr; +static aclmdlInitDumpFuncType aclmdlInitDumpFunc = nullptr; +static aclmdlSetDumpFuncType aclmdlSetDumpFunc = nullptr; +static aclmdlFinalizeDumpFuncType aclmdlFinalizeDumpFunc = nullptr; +static acldumpRegCallbackFuncType acldumpRegCallbackFunc = nullptr; +static aclrtSynchronizeDeviceFuncType aclrtSynchronizeDeviceFunc = nullptr; + +DebuggerErrno LoadAclApi() +{ + static void* hLibAscendcl = nullptr; + + if (hLibAscendcl != nullptr) { + LOG_INFO("No need to load acl api again."); + return DebuggerErrno::OK; + } + + hLibAscendcl = dlopen(kLibAscendclName, RTLD_LAZY); + if (hLibAscendcl == nullptr) { + LOG_ERROR(DebuggerErrno::ERROR_DEPENDENCY_NOT_FIND, + "Failed to search libascendcl.so." + std::string(dlerror())); + return DebuggerErrno::ERROR_DEPENDENCY_NOT_FIND; + } + + static const std::map functionMap = { + {"aclInit", reinterpret_cast(&aclInitFunc)}, + {"aclmdlInitDump", reinterpret_cast(&aclmdlInitDumpFunc)}, + {"aclmdlSetDump", reinterpret_cast(&aclmdlSetDumpFunc)}, + {"aclmdlFinalizeDump", reinterpret_cast(&aclmdlFinalizeDumpFunc)}, + {"aclrtSynchronizeDevice", reinterpret_cast(&aclrtSynchronizeDeviceFunc)}, + }; + + for (auto& iter : functionMap) { + if (*(iter.second) != nullptr) { + continue; + } + *(iter.second) = dlsym(hLibAscendcl, iter.first); + if (*(iter.second) == nullptr) { + LOG_ERROR(DebuggerErrno::ERROR_DEPENDENCY_NOT_FIND, "Failed to load function " + + std::string(iter.first) + " from libascendcl.so." + std::string(dlerror())); + dlclose(hLibAscendcl); + hLibAscendcl = nullptr; + return DebuggerErrno::ERROR_DEPENDENCY_NOT_FIND; + } + LOG_DEBUG("Load function " + std::string(iter.first) + " from libascendcl.so."); + } + + /* 规避adump的bug,mindspore场景优先使用libmindspore_ascend.so中的符号 */ + void* handler = dlopen(kLibMSAscendName, RTLD_LAZY); + std::string libName = kLibMSAscendName; + if (handler == nullptr) { + handler = hLibAscendcl; + libName = kLibAscendclName; + } + + acldumpRegCallbackFunc = reinterpret_cast(dlsym(handler, "acldumpRegCallback")); + if (acldumpRegCallbackFunc == nullptr) { + LOG_ERROR(DebuggerErrno::ERROR_DEPENDENCY_NOT_FIND, "Failed to load function acldumpRegCallback from " + + libName + "."); + } + LOG_DEBUG("Load function acldumpRegCallback from " + libName); + + if (handler != hLibAscendcl) { + dlclose(handler); + } + + return DebuggerErrno::OK; +} + +aclError ACLAPI_aclInit(const char* cfg) +{ + if (aclInitFunc == nullptr) { + throw std::runtime_error("API aclInit does not have a definition."); + } + return aclInitFunc(cfg); +} + +aclError ACLAPI_aclmdlInitDump() +{ + if (aclmdlInitDumpFunc == nullptr) { + throw std::runtime_error("API aclmdlInitDump does not have a definition."); + } + return aclmdlInitDumpFunc(); +} + +aclError ACLAPI_aclmdlSetDump(const char* cfg) +{ + if (aclmdlSetDumpFunc == nullptr) { + throw std::runtime_error("API aclmdlSetDump does not have a definition."); + } + return aclmdlSetDumpFunc(cfg); +} + +aclError ACLAPI_aclmdlFinalizeDump() +{ + if (aclmdlFinalizeDumpFunc == nullptr) { + throw std::runtime_error("API aclmdlFinalizeDump does not have a definition."); + } + return aclmdlFinalizeDumpFunc(); +} + +aclError ACLAPI_acldumpRegCallback(AclDumpCallbackFuncType messageCallback, int32_t flag) +{ + if (acldumpRegCallbackFunc == nullptr) { + throw std::runtime_error("API acldumpRegCallback does not have a definition."); + } + return acldumpRegCallbackFunc(messageCallback, flag); +} + +aclError ACLAPI_aclrtSynchronizeDevice() +{ + if (aclrtSynchronizeDeviceFunc == nullptr) { + throw std::runtime_error("API aclrtSynchronizeDevice does not have a definition."); + } + return aclrtSynchronizeDeviceFunc(); +} + +} +} diff --git a/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclApi.hpp b/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclApi.hpp new file mode 100644 index 0000000000000000000000000000000000000000..731ae2e2caacaa345605ec572c8dcd6dba091488 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclApi.hpp @@ -0,0 +1,59 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +#include "include/ErrorCode.hpp" + +extern "C" { + +typedef int aclError; +constexpr int ACL_SUCCESS = 0; +constexpr int ACL_ERROR_NONE = 0; +constexpr int ACL_ERROR_REPEAT_INITIALIZE = 100002; + +#define ACL_DUMP_MAX_FILE_PATH_LENGTH 4096 +typedef struct acldumpChunk { + char fileName[ACL_DUMP_MAX_FILE_PATH_LENGTH]; // 待落盘的Dump数据文件名,ACL_DUMP_MAX_FILE_PATH_LENGTH表示文件名最大长度,当前为4096 + uint32_t bufLen; // dataBuf数据长度,单位Byte + uint32_t isLastChunk; // 标识Dump数据是否为最后一个分片,0表示不是最后一个分片,1表示最后一个分片 + int64_t offset; // Dump数据文件内容的偏移,其中-1表示文件追加内容 + int32_t flag; // 预留Dump数据标识,当前数据无标识 + uint8_t dataBuf[0]; // Dump数据的内存地址 +} acldumpChunk; + +} + +namespace MindStudioDebugger { +namespace AscendCLApi { + +DebuggerErrno LoadAclApi(); + +using AclDumpCallbackFuncType = int32_t (*)(const acldumpChunk*, int32_t); +aclError ACLAPI_aclInit(const char* cfg); +aclError ACLAPI_aclmdlInitDump(); +aclError ACLAPI_aclmdlSetDump(const char* cfg); +aclError ACLAPI_aclmdlFinalizeDump(); +aclError ACLAPI_acldumpRegCallback(AclDumpCallbackFuncType messageCallback, int32_t flag); + +aclError ACLAPI_aclrtSynchronizeDevice(); + +#define CALL_ACL_API(func, ...) MindStudioDebugger::AscendCLApi::ACLAPI_##func(__VA_ARGS__) + +} +} diff --git a/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclDumpMsg.proto b/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclDumpMsg.proto new file mode 100644 index 0000000000000000000000000000000000000000..6ce5407bea3b6d10f4118a98170752b901d03ab8 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/third_party/ACL/AclDumpMsg.proto @@ -0,0 +1,143 @@ +syntax = "proto3"; +package toolkit.dumpdata; + +enum OutputDataType { + DT_UNDEFINED = 0; + DT_FLOAT = 1; + DT_FLOAT16 = 2; + DT_INT8 = 3; + DT_UINT8 = 4; + DT_INT16 = 5; + DT_UINT16 = 6; + DT_INT32 = 7; + DT_INT64 = 8; + DT_UINT32 = 9; + DT_UINT64 = 10; + DT_BOOL = 11; + DT_DOUBLE = 12; + DT_STRING = 13; + DT_DUAL_SUB_INT8 = 14; + DT_DUAL_SUB_UINT8 = 15; + DT_COMPLEX64 = 16; + DT_COMPLEX128 = 17; + DT_QINT8 = 18; + DT_QINT16 = 19; + DT_QINT32 = 20; + DT_QUINT8 = 21; + DT_QUINT16 = 22; + DT_RESOURCE = 23; + DT_STRING_REF = 24; + DT_DUAL = 25; + DT_VARIANT = 26; + DT_BF16 = 27; + DT_INT4 = 28; + DT_UINT1 = 29; + DT_INT2 = 30; + DT_UINT2 = 31; +} + +enum OutputFormat { + FORMAT_NCHW = 0; + FORMAT_NHWC = 1; + FORMAT_ND = 2; + FORMAT_NC1HWC0 = 3; + FORMAT_FRACTAL_Z = 4; + FORMAT_NC1C0HWPAD = 5; + FORMAT_NHWC1C0 = 6; + FORMAT_FSR_NCHW = 7; + FORMAT_FRACTAL_DECONV = 8; + FORMAT_C1HWNC0 = 9; + FORMAT_FRACTAL_DECONV_TRANSPOSE = 10; + FORMAT_FRACTAL_DECONV_SP_STRIDE_TRANS = 11; + FORMAT_NC1HWC0_C04 = 12; + FORMAT_FRACTAL_Z_C04 = 13; + FORMAT_CHWN = 14; + FORMAT_FRACTAL_DECONV_SP_STRIDE8_TRANS = 15; + FORMAT_HWCN = 16; + FORMAT_NC1KHKWHWC0 = 17; + FORMAT_BN_WEIGHT = 18; + FORMAT_FILTER_HWCK = 19; + FORMAT_HASHTABLE_LOOKUP_LOOKUPS = 20; + FORMAT_HASHTABLE_LOOKUP_KEYS = 21; + FORMAT_HASHTABLE_LOOKUP_VALUE = 22; + FORMAT_HASHTABLE_LOOKUP_OUTPUT = 23; + FORMAT_HASHTABLE_LOOKUP_HITS = 24; + FORMAT_C1HWNCoC0 = 25; + FORMAT_MD = 26; + FORMAT_NDHWC = 27; + FORMAT_FRACTAL_ZZ = 28; + FORMAT_FRACTAL_NZ = 29; + FORMAT_NCDHW = 30; + FORMAT_DHWCN = 31; // 3D filter input tensor format + FORMAT_NDC1HWC0 = 32; + FORMAT_FRACTAL_Z_3D=33; + FORMAT_CN = 34; + FORMAT_NC = 35; + FORMAT_DHWNC = 36; + FORMAT_FRACTAL_Z_3D_TRANSPOSE = 37; // 3D filter(transpose) input tensor format + FORMAT_FRACTAL_ZN_LSTM = 38; + FORMAT_FRACTAL_Z_G = 39; + FORMAT_RESERVED = 40; + FORMAT_ALL = 41; + FORMAT_NULL = 42; + FORMAT_ND_RNN_BIAS = 43; + FORMAT_FRACTAL_ZN_RNN = 44; + FORMAT_YUV = 45; + FORMAT_YUV_A = 46; + FORMAT_NCL = 47; + FORMAT_FRACTAL_Z_WINO = 48; + FORMAT_C1HWC0 = 49; + // Add new formats definition here + FORMAT_MAX = 0xff; +} + +message OriginalOp { + string name = 1; + uint32 output_index = 2; + OutputDataType data_type = 3; + OutputFormat format = 4; +} + +message Shape { + repeated uint64 dim = 1; +} + +message OpOutput { + OutputDataType data_type = 1; + OutputFormat format = 2; + Shape shape = 3; + OriginalOp original_op = 4; // the original op corresponding to the output + bytes data = 5; + uint64 size = 6; + Shape original_shape = 7; + int32 sub_format = 8; +} + +message OpInput { + OutputDataType data_type = 1; + OutputFormat format = 2; + Shape shape = 3; + bytes data = 4; + uint64 size = 5; + Shape original_shape = 6; + int32 sub_format = 7; +} + +enum BufferType { + L1 = 0; +} + +message OpBuffer { + BufferType buffer_type = 1; + bytes data = 2; + uint64 size = 3; +} + +message DumpData { + string version = 1; + uint64 dump_time = 2; + repeated OpOutput output = 3; + repeated OpInput input = 4; + repeated OpBuffer buffer = 5; + string op_name = 6; +} diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/CPythonUtils.cpp b/debug/accuracy_tools/msprobe/ccsrc/utils/CPythonUtils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..fd944f62db4ff728d1aa2c5d1d5ff818bd5dcf62 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/CPythonUtils.cpp @@ -0,0 +1,542 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "CPythonUtils.hpp" + +namespace MindStudioDebugger { +namespace CPythonUtils { + +static std::map PyObjMap = {}; + +int32_t RegisterPythonObject(const std::string& name, PythonObject obj) +{ + if (PyObjMap.find(name) != PyObjMap.end()) { + return -1; + } + + PyObjMap[name] = obj; + return 0; +} + +void UnRegisterPythonObject(const std::string& name) +{ + auto it = PyObjMap.find(name); + if (it == PyObjMap.end()) { + return; + } + + PyObjMap.erase(it); +} + +bool IsPyObjRegistered(const std::string& name) +{ + return PyObjMap.find(name) != PyObjMap.end(); +} + +PythonObject GetRegisteredPyObj(const std::string& name) +{ + auto it = PyObjMap.find(name); + if (it == PyObjMap.end()) { + return PythonObject(); + } + return it->second; +} + +PythonObject PythonObject::From(const PythonObject& input) +{ + return PythonObject(input); +} + +PythonObject PythonObject::From(const int32_t& input) +{ + return PythonNumberObject::From(input); +} + +PythonObject PythonObject::From(const uint32_t& input) +{ + return PythonNumberObject::From(input); +} + +PythonObject PythonObject::From(const double& input) +{ + return PythonNumberObject::From(input); + +} +PythonObject PythonObject::From(const std::string& input) +{ + return PythonStringObject::From(input); +} + +PythonObject PythonObject::From(const char* input) +{ + return PythonStringObject::From(input); +} + +PythonObject PythonObject::From(const bool& input) +{ + return PythonBoolObject::From(input); +} + +int32_t PythonObject::To(int32_t& output) const +{ + if (!PyLong_Check(ptr)) { + return -1; + } + output = static_cast(PyLong_AsLong(ptr)); + return 0; +} + +int32_t PythonObject::To(uint32_t& output) const +{ + if (!PyLong_Check(ptr)) { + return -1; + } + output = static_cast(PyLong_AsUnsignedLong(ptr)); + return 0; +} + +int32_t PythonObject::To(double& output) const +{ + if (!PyFloat_Check(ptr)) { + return -1; + } + + output = PyFloat_AsDouble(ptr); + return 0; +} + +int32_t PythonObject::To(std::string& output) const +{ + PyObject* strObj = PyObject_Str(ptr); + if (strObj == nullptr) { + return -1; + } + const char* s = PyUnicode_AsUTF8(strObj); + if (s == nullptr) { + Py_DECREF(strObj); + return -1; + } + output = std::string(s); + Py_DECREF(strObj); + return 0; +} + +int32_t PythonObject::To(bool& output) const +{ + output = static_cast(PyObject_IsTrue(ptr)); + return 0; +} + +PythonObject PythonObject::Get(const std::string& name, bool ignore) const +{ + PyObject* o = PyObject_GetAttrString(ptr, name.c_str()); + if (o == nullptr && ignore) { + PyErr_Clear(); + } + PythonObject ret(o); + Py_XDECREF(o); + return ret; +} + +PythonObject PythonObject::Call(bool ignore) +{ + if (!PyCallable_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Object is not callable."); + } + return PythonObject(); + } + + PyObject* o = PyObject_CallObject(ptr, nullptr); + if (o == nullptr && ignore) { + PyErr_Clear(); + } + PythonObject ret(o); + Py_XDECREF(o); + return ret; +} + +PythonObject PythonObject::Call(PythonTupleObject& args, bool ignore) +{ + if (!PyCallable_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Object is not callable."); + } + return PythonObject(); + } + + PyObject* o = PyObject_CallObject(ptr, args.IsNone() ? nullptr : args); + if (o == nullptr && ignore) { + PyErr_Clear(); + } + PythonObject ret(o); + Py_XDECREF(o); + return ret; +} + +PythonObject PythonObject::Call(PythonTupleObject& args, PythonDictObject& kwargs, bool ignore) +{ + if (!PyCallable_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Object is not callable."); + } + return PythonObject(); + } + + if (args.IsNone() || kwargs.IsNone()) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Call python object with invalid parameters."); + } + return PythonObject(); + } + + PyObject* o = PyObject_Call(ptr, args, kwargs); + if (o == nullptr && ignore) { + PyErr_Clear(); + } + PythonObject ret(o); + Py_XDECREF(o); + return ret; +} + +PythonObject PythonObject::GetGlobal(const std::string& name, bool ignore) +{ + PyObject *globals = PyEval_GetGlobals(); + if (globals == nullptr) { + if (ignore) { + PyErr_Clear(); + } + return PythonObject(); + } + + return PythonObject(PyDict_GetItemString(globals, name.c_str())); + +} + +PythonObject PythonObject::Import(const std::string& name, bool ignore) +{ + PyObject* m = PyImport_ImportModule(name.c_str()); + if (m == nullptr) { + if (ignore) { + PyErr_Clear(); + } + return PythonObject(); + } + PythonObject ret(m); + Py_XDECREF(m); + return ret; +} + +PythonNumberObject::PythonNumberObject() : PythonObject() +{ + PyObject* o = PyLong_FromLong(0); + SetPtr(o); + Py_XDECREF(o); +} + +PythonNumberObject::PythonNumberObject(PyObject* o) : PythonObject() +{ + if (!PyLong_Check(o) && !PyFloat_Check(o)) { + return; + } + + SetPtr(o); +} + +PythonNumberObject PythonNumberObject::From(const int32_t& input) +{ + PythonNumberObject ret; + PyObject* o = PyLong_FromLong(input); + if (o == nullptr) { + return ret; + } + ret.SetPtr(o); + Py_DECREF(o); + return ret; +} + +PythonNumberObject PythonNumberObject::From(const uint32_t& input) +{ + PythonNumberObject ret; + PyObject* o = PyLong_FromUnsignedLong(input); + if (o == nullptr) { + return ret; + } + ret.SetPtr(o); + Py_DECREF(o); + return ret; +} + +PythonNumberObject PythonNumberObject::From(const double& input) +{ + PythonNumberObject ret; + PyObject* o = PyFloat_FromDouble(input); + if (o == nullptr) { + return ret; + } + ret.SetPtr(o); + Py_DECREF(o); + return ret; +} + +PythonStringObject::PythonStringObject() : PythonObject() +{ + PyObject* o = PyUnicode_FromString(""); + SetPtr(o); + Py_XDECREF(o); +} + +PythonStringObject::PythonStringObject(PyObject* o) : PythonObject() +{ + if (!PyUnicode_Check(o)) { + return; + } + + SetPtr(o); +} + +PythonStringObject PythonStringObject::From(const std::string& input) +{ + PythonStringObject ret; + PyObject* o = PyUnicode_FromString(input.c_str()); + if (o == nullptr) { + return ret; + } + ret.SetPtr(o); + Py_DECREF(o); + return ret; +} + +PythonStringObject PythonStringObject::From(const char* input) +{ + PythonStringObject ret; + PyObject* o = PyUnicode_FromString(input); + if (o == nullptr) { + return ret; + } + ret.SetPtr(o); + Py_DECREF(o); + return ret; +} + +PythonBoolObject::PythonBoolObject() : PythonObject() +{ + SetPtr(Py_False); +} + +PythonBoolObject::PythonBoolObject(PyObject* o) : PythonObject() +{ + if (!PyBool_Check(o)) { + return; + } + + SetPtr(o); +} + +PythonBoolObject PythonBoolObject::From(const bool& input) +{ + PythonBoolObject ret; + PyObject* o = PyBool_FromLong(input); + if (o == nullptr) { + return ret; + } + ret.SetPtr(o); + Py_DECREF(o); + return ret; +} + +PythonListObject::PythonListObject() : PythonObject() +{ + PyObject* o = PyList_New(0); + SetPtr(o); + Py_XDECREF(o); +} + +PythonListObject::PythonListObject(size_t size) : PythonObject() +{ + PyObject* o = PyList_New(size); + SetPtr(o); + Py_XDECREF(o); +} + +PythonListObject::PythonListObject(PyObject* o) : PythonObject() +{ + if (!PyList_Check(o)) { + return; + } + + SetPtr(o); +} + +size_t PythonListObject::Size() const +{ + if (!PyList_Check(ptr)) { + return 0; + } + + return PyList_GET_SIZE(ptr); +} + +PythonObject PythonListObject::GetItem(size_t pos, bool ignore) +{ + if (!PyList_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Expect a list."); + } + return PythonObject(); + } + if (static_cast(PyList_GET_SIZE(ptr)) <= pos) { + if (!ignore) { + PyErr_SetString(PyExc_IndexError, "list index outof range"); + } + return PythonObject(); + } + + PyObject* o = PyList_GetItem(ptr, pos); + if (o == nullptr && ignore) { + PyErr_Clear(); + } + + return PythonObject(o); +} + +PythonListObject& PythonListObject::SetItem(size_t pos, PythonObject& item, bool ignore) +{ + if (!PyList_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Expect a list."); + } + return *this; + } + + if (static_cast(PyList_GET_SIZE(ptr)) <= pos) { + if (!ignore) { + PyErr_SetString(PyExc_IndexError, "list index outof range"); + } + return *this; + } + + if (PyList_SetItem(ptr, pos, item.NewRef()) != 0) { + if (ignore) { + PyErr_Clear(); + } + } + return *this; +} + +PythonListObject& PythonListObject::Insert(int64_t pos, PythonObject& item, bool ignore) +{ + if (!PyList_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Expect a list."); + } + return *this; + } + + if (PyList_Insert(ptr, pos, item) != 0) { + if (ignore) { + PyErr_Clear(); + } + } + + return *this; +} + +PythonTupleObject PythonListObject::ToTuple(bool ignore) +{ + if (!PyList_Check(ptr)) { + return PythonTupleObject(); + } + + PyObject* o = PyList_AsTuple(ptr); + if (o == nullptr && ignore) { + PyErr_Clear(); + } + PythonTupleObject ret(o); + Py_XDECREF(o); + return ret; +} + +PythonTupleObject::PythonTupleObject() : PythonObject() +{ + PyObject* o = PyTuple_New(0); + SetPtr(o); + Py_XDECREF(o); +} + +PythonTupleObject::PythonTupleObject(PyObject* o) : PythonObject() +{ + if (!PyTuple_Check(o)) { + return; + } + + SetPtr(o); +} + +size_t PythonTupleObject::Size() const +{ + if (!PyTuple_Check(ptr)) { + return 0; + } + + return PyTuple_GET_SIZE(ptr); +} + +PythonObject PythonTupleObject::GetItem(size_t pos, bool ignore) +{ + if (!PyTuple_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Expect a tuple."); + } + return PythonObject(); + } + if (static_cast(PyTuple_GET_SIZE(ptr)) <= pos) { + if (!ignore) { + PyErr_SetString(PyExc_IndexError, "tuple index outof range"); + } + return PythonObject(); + } + + PyObject* o = PyTuple_GetItem(ptr, pos); + if (o == nullptr && ignore) { + PyErr_Clear(); + } + + return PythonObject(o); +} + +PythonDictObject::PythonDictObject() : PythonObject() +{ + PyObject* o = PyDict_New(); + SetPtr(o); + Py_XDECREF(o); +} + +PythonDictObject::PythonDictObject(PyObject* o) : PythonObject() +{ + if (!PyDict_Check(o)) { + return; + } + + SetPtr(o); +} + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/CPythonUtils.hpp b/debug/accuracy_tools/msprobe/ccsrc/utils/CPythonUtils.hpp new file mode 100644 index 0000000000000000000000000000000000000000..40ebcb1dafd505fd7dfa3bda1c2c1609cb60297a --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/CPythonUtils.hpp @@ -0,0 +1,436 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include + +namespace MindStudioDebugger { +namespace CPythonUtils { + +/* + * 对常用python类型做了cpp对接封装,对应关系为: + * ------------------------------------------- + * | python | cpp wrapper | + * |-----------------------------------------| + * | object | PythonObject | + * | str | PythonStringObject | + * | int/float | PythonNumberObject | + * | bool | PythonBoolObject | + * | list | PythonListObject | + * | tuple | PythonTupleObject | + * | dict | PythonDictObject | + * ------------------------------------------- + * + * 创建对象的方式: + * 1、通过原生PyObject*类型创建,PythonObject生命周期内会持有原生对象的一个引用 + * 2、通过From方法从c++对象创建 + * 3、通过GetGlobal、Import等方法从解释器上下文获取 + * 4、通过GetRegisteredPyObj获取到上下文的python对象 + * 5、通过已有PythonObject对象的Get、GetItem等方法获取子对象 + * + * 对象转换: + * 1、对于转换成PyObject*、bool、string的场景,支持隐式转换 + * 2、对于非通用类型转换,调用To方法,返回0表示成功 + * 3、对于list、tuple、dict类型,若元素类型都一样,支持直接转为vector/map类型,否则无法直接转换 + * 4、对于To方法: + * python维度支持bool()的都可以转bool(即并非只有bool类型支持转换,下同) + * 支持str()的都可以转string + * 可迭代对象(且元素支持转换)都可以转vector + * + * 对象传递: + * 1、子类可以安全传递或拷贝给PythonObject对象 + * 2、PythonObject传给子类时,若类型匹配,可以安全转递,否则会转为None + * 3、PythonObject或子类传递给需要PyObject*类型的cpthon原生接口时: + * 若原生接口是接管参数型,需要传递NewRef() + * 若原生接口是临时引用型,需要确保对象生命周期覆盖被调用的函数(不要原地构造临时对象) + */ + +class PythonObject; +class PythonNumberObject; +class PythonStringObject; +class PythonBoolObject; +class PythonListObject; +class PythonTupleObject; +class PythonDictObject; + +/* python侧使用_msprobe_c.CPythonAgent,cpp侧使用以下函数,进行python<--->cpp代码交互 */ +int32_t RegisterPythonObject(const std::string& name, PythonObject obj); +void UnRegisterPythonObject(const std::string& name); +bool IsPyObjRegistered(const std::string& name); +PythonObject GetRegisteredPyObj(const std::string& name); + +class PythonObject { +public: + PythonObject() { + Py_INCREF(Py_None); + ptr = Py_None; + } + PythonObject(PyObject* o) : ptr(o) { + if (ptr == nullptr) { + ptr = Py_None; + } + Py_XINCREF(ptr); + } + ~PythonObject() { + Py_XDECREF(ptr); + } + explicit PythonObject(const PythonObject &obj) : PythonObject(static_cast(obj)) {} + PythonObject& operator=(const PythonObject &obj) { + SetPtr(static_cast(obj)); + return *this; + } + + /* 获取全局对象 */ + static PythonObject GetGlobal(const std::string& name, bool ignore=true); + /* 获取模块对象;若其还未加载至缓存,则加载一遍 */ + static PythonObject Import(const std::string& name, bool ignore=true); + + /* From/To转换,统一放一份在基类,用于遍历迭代器等场景 */ + static PythonObject From(const PythonObject& input); + static PythonObject From(const int32_t& input); + static PythonObject From(const uint32_t& input); + static PythonObject From(const double& input); + static PythonObject From(const std::string& input); + static PythonObject From(const char* input); + static PythonObject From(const bool& input); + template + static PythonObject From(const std::vector& input); + template + static PythonObject From(const std::map& input); + int32_t To(int32_t& output) const; + int32_t To(uint32_t& output) const; + int32_t To(double& output) const; + int32_t To(std::string& output) const; + int32_t To(bool& output) const; + template + int32_t To(std::vector& output)const; + + bool IsNone() const {return ptr == Py_None;} + bool IsNumber() const {return PyLong_Check(ptr) || PyFloat_Check(ptr);} + bool IsString() const {return PyUnicode_Check(ptr);} + bool IsBool() const {return PyBool_Check(ptr);} + bool IsList() const {return PyList_Check(ptr);} + bool IsTuple() const {return PyTuple_Check(ptr);} + bool IsDict() const {return PyDict_Check(ptr);} + bool IsModule() const {return PyModule_Check(ptr);} + bool IsCallable() const {return PyCallable_Check(ptr);} + + /* 用于调用可调用对象,相当于python代码中的obj(),为了简单只实现了args+kwargs参数形式 */ + PythonObject Call(bool ignore=true); + PythonObject Call(PythonTupleObject& args, bool ignore=true); + PythonObject Call(PythonTupleObject& args, PythonDictObject& kwargs, bool ignore=true); + + /* 用于获取对象属性,相当于python代码中的obj.xx */ + PythonObject Get(const std::string& name, bool ignore=true) const; + PythonObject& NewRef() { + Py_XINCREF(ptr); + return *this; + } + std::string ToString() const { + std::string ret; + if (To(ret) == 0) { + return ret; + } + return std::string(); + } + + operator PyObject*() const {return ptr;} + operator bool() const {return static_cast(PyObject_IsTrue(ptr));} + operator std::string() const { + return ToString(); + } + PythonObject operator()(bool ignore=true) {return Call(ignore);} + PythonObject operator()(PythonTupleObject& args, bool ignore=true) {return Call(args, ignore);} + PythonObject operator()(PythonTupleObject& args, PythonDictObject& kwargs, bool ignore=true) { + return Call(args, kwargs, ignore); + } + +protected: + void SetPtr(PyObject* o) { + Py_XDECREF(ptr); + if (o == nullptr) { + o = Py_None; + } + Py_INCREF(o); + ptr = o; + } + + PyObject* ptr{nullptr}; + +private: + explicit PythonObject(PythonObject &&obj) = delete; + PythonObject& operator=(PythonObject &&obj) = delete; +}; + +class PythonNumberObject : public PythonObject { +public: + PythonNumberObject(); + PythonNumberObject(PyObject* o); + + static PythonNumberObject From(const int32_t& input); + static PythonNumberObject From(const uint32_t& input); + static PythonNumberObject From(const double& input); +}; + +class PythonStringObject : public PythonObject { +public: + PythonStringObject(); + PythonStringObject(PyObject* o); + + static PythonStringObject From(const std::string& input); + static PythonStringObject From(const char* input); +}; + +class PythonBoolObject : public PythonObject { +public: + PythonBoolObject(); + PythonBoolObject(PyObject* o); + + static PythonBoolObject From(const bool& input); +}; + +class PythonListObject : public PythonObject { +public: + PythonListObject(); + explicit PythonListObject(size_t size); + PythonListObject(PyObject* o); + + template + static PythonListObject From(const std::vector& input); + + size_t Size() const; + template + PythonListObject& Append(T value, bool ignore=true); + PythonObject GetItem(size_t pos, bool ignore=true); + PythonListObject& SetItem(size_t pos, PythonObject& item, bool ignore=true); + PythonListObject& Insert(int64_t pos, PythonObject& item, bool ignore=true); + PythonTupleObject ToTuple(bool ignore=true); +}; + +class PythonTupleObject : public PythonObject { +public: + PythonTupleObject(); + PythonTupleObject(PyObject* o); + + template + static PythonTupleObject From(const std::vector& input); + + size_t Size() const; + PythonObject GetItem(size_t pos, bool ignore=true); +}; + +class PythonDictObject : public PythonObject { +public: + PythonDictObject(); + PythonDictObject(PyObject* o); + + template + static PythonDictObject From(const std::map& input); + + template + PythonDictObject& Add(T1 key, T2 value, bool ignore=true); + template + PythonDictObject& Delete(T key, bool ignore=true); + template + PythonObject GetItem(T key, bool ignore=true); +}; + +/**************************************************************************************************/ +/**************************** 以下为模板函数的实现,调用者无需关注 ***********************************/ +/**************************************************************************************************/ +template +PythonObject PythonObject::From(const std::vector& input) +{ + return PythonListObject::From(input); +} + +template +PythonObject PythonObject::From(const std::map& input) +{ + return PythonDictObject::From(input); +} + +template +int32_t PythonObject::To(std::vector& output) const +{ + PyObject* item = nullptr; + PyObject* iter = PyObject_GetIter(ptr); + if (iter == nullptr) { + return -1; + } + + while ((item = PyIter_Next(iter)) != nullptr) { + T tmp; + if (PythonObject(item).To(tmp) != 0) { + goto error; + } + output.emplace_back(tmp); + Py_DECREF(item); + } + + Py_DECREF(iter); + return 0; +error: + Py_DECREF(item); + Py_DECREF(iter); + return -1; +} + +template +PythonListObject PythonListObject::From(const std::vector& input) +{ + PyObject* o = PyList_New(input.size()); + if (o == nullptr) { + return PythonListObject(); + } + + Py_ssize_t i = 0; + for (const T& ele : input) { + if (PyList_SetItem(o, i, PythonObject::From(ele).NewRef()) != 0) { + Py_DECREF(o); + return PythonListObject(); + } + i++; + } + + PythonListObject ret(o); + Py_DECREF(o); + return ret; +} + +template +PythonListObject& PythonListObject::Append(T value, bool ignore) +{ + if (!PyList_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Expect a list."); + } + return *this; + } + + PythonObject o = PythonObject::From(value); + PyList_Append(ptr, o); + return *this; +} + +template +PythonTupleObject PythonTupleObject::From(const std::vector& input) +{ + PyObject* o = PyTuple_New(input.size()); + if (o == nullptr) { + return PythonTupleObject(); + } + + Py_ssize_t i = 0; + + for (const T& ele : input) { + if (PyTuple_SetItem(o, i, PythonObject::From(ele).NewRef()) != 0) { + Py_DECREF(o); + return PythonTupleObject(); + } + i++; + } + + PythonTupleObject ret(o); + Py_DECREF(o); + return ret; +} + +template +PythonDictObject PythonDictObject::From(const std::map& input) +{ + PyObject* o = PyDict_New(); + if (o == nullptr) { + return PythonDictObject(); + } + for (const std::pair& pair : input) { + PythonObject key = PythonObject::From(pair.first); + PythonObject value = PythonObject::From(pair.second); + if (PyDict_SetItem(o, key.NewRef(), value.NewRef()) != 0) { + Py_DECREF(o); + return PythonDictObject(); + } + } + + PythonDictObject ret(o); + Py_DECREF(o); + return ret; +} + +template +PythonDictObject& PythonDictObject::Add(T1 key, T2 value, bool ignore) +{ + if (!PyDict_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Expect a dict."); + } + return *this; + } + + if (PyDict_SetItem(ptr, PythonObject::From(key).NewRef(), PythonObject::From(value).NewRef()) != 0) { + if (ignore) { + PyErr_Clear(); + } + } + return *this; +} + +template +PythonDictObject& PythonDictObject::Delete(T key, bool ignore) +{ + if (!PyDict_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Expect a dict."); + } + return *this; + } + + PythonObject o = PythonObject::From(key); + if (PyDict_DelItem(ptr, o) != 0) { + if (ignore) { + PyErr_Clear(); + } + } + return *this; +} + +template +PythonObject PythonDictObject::GetItem(T key, bool ignore) +{ + if (!PyDict_Check(ptr)) { + if (!ignore) { + PyErr_SetString(PyExc_TypeError, "Expect a dict."); + } + return *this; + } + + PythonObject o = PythonObject::From(key); + PyObject* item = PyDict_GetItem(ptr, o); + if (item == nullptr && ignore) { + PyErr_Clear(); + } + return PythonObject(item); +} + +} +} diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/DataUtils.cpp b/debug/accuracy_tools/msprobe/ccsrc/utils/DataUtils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..c2d7df85294f7c96f0fe1a1b9458dfd2ad2e502c --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/DataUtils.cpp @@ -0,0 +1,213 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include +#include +#include + +#include "DataUtils.hpp" + +namespace MindStudioDebugger { +namespace DataUtils { + +int64_t SizeToS64(size_t v) { + if (v > static_cast(INT64_MAX)) { + throw std::runtime_error("Value " + std::to_string(v) + "exceeds the maximum value of int64."); + } + return static_cast(v); +} + +std::string U64ToHexString(uint64_t v) { + std::stringstream ss; + ss << "0x" << std::hex << std::uppercase << v; + return std::move(ss.str()); +} + +BFloat16::BFloat16(float f32) +{ + if (std::isnan(f32)) { + value_ = BFloat16::nan_value; + } else { + union { + uint32_t U32; + float F32; + }; + F32 = f32; + uint32_t rounding_bias = ((U32 >> 16) & 1) + UINT32_C(0x7FFF); + value_ = static_cast((U32 + rounding_bias) >> 16); + } +} + +BFloat16::operator float() const +{ + float f32 = 0; + uint32_t tmp = value_; + tmp <<= 16; + std::memcpy(&f32, &tmp, sizeof(f32)); + return f32; +} + +const static std::unordered_map kTypeSizeMap = { + {DataType::DT_BOOL, 1}, + {DataType::DT_INT8, 1}, + {DataType::DT_UINT8, 1}, + {DataType::DT_INT16, 2}, + {DataType::DT_UINT16, 2}, + {DataType::DT_FLOAT16, 2}, + {DataType::DT_BF16, 2}, + {DataType::DT_INT32, 4}, + {DataType::DT_UINT32, 4}, + {DataType::DT_FLOAT, 4}, + {DataType::DT_INT64, 8}, + {DataType::DT_UINT64, 8}, + {DataType::DT_DOUBLE, 8}, + {DataType::DT_COMPLEX64, 8}, + {DataType::DT_COMPLEX128, 16}, +}; + +size_t SizeOfDType(DataType type) +{ + auto it = kTypeSizeMap.find(type); + if (it == kTypeSizeMap.end()) { + return 0; + } + return it->second; +} + +constexpr auto kOpDType_UNKNOWN = "UNKNOWN"; +const static std::unordered_map kDDTypeToStringMap = { + {DataType::DT_UNDEFINED, "UNDEFINED"}, + {DataType::DT_FLOAT, "FLOAT"}, + {DataType::DT_FLOAT16, "FLOAT16"}, + {DataType::DT_INT8, "INT8"}, + {DataType::DT_UINT8, "UINT8"}, + {DataType::DT_INT16, "INT16"}, + {DataType::DT_UINT16, "UINT16"}, + {DataType::DT_INT32, "INT32"}, + {DataType::DT_INT64, "INT64"}, + {DataType::DT_UINT32, "UINT32"}, + {DataType::DT_UINT64, "UINT64"}, + {DataType::DT_BOOL, "BOOL"}, + {DataType::DT_DOUBLE, "DOUBLE"}, + {DataType::DT_STRING, "STRING"}, + {DataType::DT_DUAL_SUB_INT8, "DUAL_SUB_INT8"}, + {DataType::DT_DUAL_SUB_UINT8, "DUAL_SUB_UINT8"}, + {DataType::DT_COMPLEX64, "COMPLEX64"}, + {DataType::DT_COMPLEX128, "COMPLEX128"}, + {DataType::DT_QINT8, "QINT8"}, + {DataType::DT_QINT16, "QINT16"}, + {DataType::DT_QINT32, "QINT32"}, + {DataType::DT_QUINT8, "QUINT8"}, + {DataType::DT_QUINT16, "QUINT16"}, + {DataType::DT_RESOURCE, "RESOURCE"}, + {DataType::DT_STRING_REF, "STRING_REF"}, + {DataType::DT_DUAL, "DUAL"}, + {DataType::DT_VARIANT, "VARIANT"}, + {DataType::DT_BF16, "BF16"}, + {DataType::DT_INT4, "INT4"}, + {DataType::DT_UINT1, "UINT1"}, + {DataType::DT_INT2, "INT2"}, + {DataType::DT_UINT2, "UINT2"}, +}; + +std::string GetDTypeString(DataType dtype) +{ + auto it = kDDTypeToStringMap.find(dtype); + if (it != kDDTypeToStringMap.end()) { + return it->second; + } + return kOpDType_UNKNOWN; +} + +constexpr auto kOpFormat_UNKNOWN = "UNKNOWN"; +const static std::unordered_map kFormatToStringMap = { + {TensorFormat::FORMAT_NCHW, "NCHW"}, + {TensorFormat::FORMAT_NHWC, "NHWC"}, + {TensorFormat::FORMAT_ND, "ND"}, + {TensorFormat::FORMAT_NC1HWC0, "NC1HWC0"}, + {TensorFormat::FORMAT_FRACTAL_Z, "FRACTAL_Z"}, + {TensorFormat::FORMAT_NC1C0HWPAD, "NC1C0HWPAD"}, + {TensorFormat::FORMAT_NHWC1C0, "NHWC1C0"}, + {TensorFormat::FORMAT_FSR_NCHW, "FSR_NCHW"}, + {TensorFormat::FORMAT_FRACTAL_DECONV, "FRACTAL_DECONV"}, + {TensorFormat::FORMAT_C1HWNC0, "C1HWNC0"}, + {TensorFormat::FORMAT_FRACTAL_DECONV_TRANSPOSE, "FRACTAL_DECONV_TRANSPOSE"}, + {TensorFormat::FORMAT_FRACTAL_DECONV_SP_STRIDE_TRANS, "FRACTAL_DECONV_SP_STRIDE_TRANS"}, + {TensorFormat::FORMAT_NC1HWC0_C04, "NC1HWC0_C04"}, + {TensorFormat::FORMAT_FRACTAL_Z_C04, "FRACTAL_Z_C04"}, + {TensorFormat::FORMAT_CHWN, "CHWN"}, + {TensorFormat::FORMAT_FRACTAL_DECONV_SP_STRIDE8_TRANS, "FRACTAL_DECONV_SP_STRIDE8_TRANS"}, + {TensorFormat::FORMAT_HWCN, "HWCN"}, + {TensorFormat::FORMAT_NC1KHKWHWC0, "NC1KHKWHWC0"}, + {TensorFormat::FORMAT_BN_WEIGHT, "BN_WEIGHT"}, + {TensorFormat::FORMAT_FILTER_HWCK, "FILTER_HWCK"}, + {TensorFormat::FORMAT_HASHTABLE_LOOKUP_LOOKUPS, "HASHTABLE_LOOKUP_LOOKUPS"}, + {TensorFormat::FORMAT_HASHTABLE_LOOKUP_KEYS, "HASHTABLE_LOOKUP_KEYS"}, + {TensorFormat::FORMAT_HASHTABLE_LOOKUP_VALUE, "HASHTABLE_LOOKUP_VALUE"}, + {TensorFormat::FORMAT_HASHTABLE_LOOKUP_OUTPUT, "HASHTABLE_LOOKUP_OUTPUT"}, + {TensorFormat::FORMAT_HASHTABLE_LOOKUP_HITS, "HASHTABLE_LOOKUP_HITS"}, + {TensorFormat::FORMAT_C1HWNCoC0, "C1HWNCoC0"}, + {TensorFormat::FORMAT_MD, "MD"}, + {TensorFormat::FORMAT_NDHWC, "NDHWC"}, + {TensorFormat::FORMAT_FRACTAL_ZZ, "FRACTAL_ZZ"}, + {TensorFormat::FORMAT_FRACTAL_NZ, "FRACTAL_NZ"}, + {TensorFormat::FORMAT_NCDHW, "NCDHW"}, + {TensorFormat::FORMAT_DHWCN, "DHWCN"}, + {TensorFormat::FORMAT_NDC1HWC0, "NDC1HWC0"}, + {TensorFormat::FORMAT_FRACTAL_Z_3D, "FRACTAL_Z_3D"}, + {TensorFormat::FORMAT_CN, "CN"}, + {TensorFormat::FORMAT_NC, "NC"}, + {TensorFormat::FORMAT_DHWNC, "DHWNC"}, + {TensorFormat::FORMAT_FRACTAL_Z_3D_TRANSPOSE, "FRACTAL_Z_3D_TRANSPOSE"}, + {TensorFormat::FORMAT_FRACTAL_ZN_LSTM, "FRACTAL_ZN_LSTM"}, + {TensorFormat::FORMAT_FRACTAL_Z_G, "FRACTAL_Z_G"}, + {TensorFormat::FORMAT_RESERVED, "RESERVED"}, + {TensorFormat::FORMAT_ALL, "ALL"}, + {TensorFormat::FORMAT_NULL, "NULL"}, + {TensorFormat::FORMAT_ND_RNN_BIAS, "ND_RNN_BIAS"}, + {TensorFormat::FORMAT_FRACTAL_ZN_RNN, "FRACTAL_ZN_RNN"}, + {TensorFormat::FORMAT_YUV, "YUV"}, + {TensorFormat::FORMAT_YUV_A, "YUV_A"}, + {TensorFormat::FORMAT_NCL, "NCL"}, + {TensorFormat::FORMAT_FRACTAL_Z_WINO, "FRACTAL_Z_WINO"}, + {TensorFormat::FORMAT_C1HWC0, "C1HWC0"}, +}; + +std::string GetFormatString(TensorFormat fmt) +{ + auto it = kFormatToStringMap.find(fmt); + if (it != kFormatToStringMap.end()) { + return it->second; + } + return kOpFormat_UNKNOWN; +} + +std::string GetShapeString(const TensorShape& shape) +{ + std::ostringstream buffer; + buffer << "("; + for (size_t i = 0; i < shape.size(); i++) { + buffer << (i > 0 ? "," : "") << shape[i]; + } + buffer << ")"; + return buffer.str(); +} + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/DataUtils.hpp b/debug/accuracy_tools/msprobe/ccsrc/utils/DataUtils.hpp new file mode 100644 index 0000000000000000000000000000000000000000..f58e15a8c77719f62ddeef8ebbcd25a5b5ebf624 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/DataUtils.hpp @@ -0,0 +1,169 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include +#include + +namespace MindStudioDebugger { +namespace DataUtils { + +inline uint64_t UnpackUint64Value_Le(const void* data) +{ + return le64toh(*reinterpret_cast(data)); +} +inline uint64_t UnpackUint64Value_Be(const void* data) +{ + return be64toh(*reinterpret_cast(data)); +} + +int64_t SizeToS64(size_t v); +std::string U64ToHexString(uint64_t v); + +class BFloat16 { +public: + static constexpr uint16_t value_mask = 0x7fff; + static constexpr uint16_t inf_value = 0x7f80; + static constexpr uint16_t nan_value = 0x7fc0; + static constexpr uint16_t true_value = 0x3c00; + static constexpr uint32_t f32_inf_value = 0x7f800000; + + BFloat16() = default; + ~BFloat16() = default; + BFloat16(const BFloat16 &other) noexcept = default; + BFloat16(BFloat16 &&other) noexcept = default; + BFloat16 &operator=(const BFloat16 &other) noexcept = default; + BFloat16 &operator=(BFloat16 &&other) noexcept = default; + + explicit BFloat16(float f); + explicit operator float() const; + BFloat16 operator+(const BFloat16& other) const + { return BFloat16(static_cast(*this) + static_cast(other)); } + float operator+(const float other) const { return static_cast(*this) + other; } +private: + uint16_t value_; +}; + +inline float operator+(const float fp32, const BFloat16& bf16) +{ + return fp32 + static_cast(bf16); +} + +using ShapeBaseType = int64_t; +using TensorShape = std::vector; + +enum DataType : int { + DT_UNDEFINED = 0, + DT_FLOAT = 1, + DT_FLOAT16 = 2, + DT_INT8 = 3, + DT_UINT8 = 4, + DT_INT16 = 5, + DT_UINT16 = 6, + DT_INT32 = 7, + DT_INT64 = 8, + DT_UINT32 = 9, + DT_UINT64 = 10, + DT_BOOL = 11, + DT_DOUBLE = 12, + DT_STRING = 13, + DT_DUAL_SUB_INT8 = 14, + DT_DUAL_SUB_UINT8 = 15, + DT_COMPLEX64 = 16, + DT_COMPLEX128 = 17, + DT_QINT8 = 18, + DT_QINT16 = 19, + DT_QINT32 = 20, + DT_QUINT8 = 21, + DT_QUINT16 = 22, + DT_RESOURCE = 23, + DT_STRING_REF = 24, + DT_DUAL = 25, + DT_VARIANT = 26, + DT_BF16 = 27, + DT_INT4 = 28, + DT_UINT1 = 29, + DT_INT2 = 30, + DT_UINT2 = 31, + /* Add before this line */ + DT_MAX +}; + +enum TensorFormat : int { + FORMAT_NCHW = 0, + FORMAT_NHWC = 1, + FORMAT_ND = 2, + FORMAT_NC1HWC0 = 3, + FORMAT_FRACTAL_Z = 4, + FORMAT_NC1C0HWPAD = 5, + FORMAT_NHWC1C0 = 6, + FORMAT_FSR_NCHW = 7, + FORMAT_FRACTAL_DECONV = 8, + FORMAT_C1HWNC0 = 9, + FORMAT_FRACTAL_DECONV_TRANSPOSE = 10, + FORMAT_FRACTAL_DECONV_SP_STRIDE_TRANS = 11, + FORMAT_NC1HWC0_C04 = 12, + FORMAT_FRACTAL_Z_C04 = 13, + FORMAT_CHWN = 14, + FORMAT_FRACTAL_DECONV_SP_STRIDE8_TRANS = 15, + FORMAT_HWCN = 16, + FORMAT_NC1KHKWHWC0 = 17, + FORMAT_BN_WEIGHT = 18, + FORMAT_FILTER_HWCK = 19, + FORMAT_HASHTABLE_LOOKUP_LOOKUPS = 20, + FORMAT_HASHTABLE_LOOKUP_KEYS = 21, + FORMAT_HASHTABLE_LOOKUP_VALUE = 22, + FORMAT_HASHTABLE_LOOKUP_OUTPUT = 23, + FORMAT_HASHTABLE_LOOKUP_HITS = 24, + FORMAT_C1HWNCoC0 = 25, + FORMAT_MD = 26, + FORMAT_NDHWC = 27, + FORMAT_FRACTAL_ZZ = 28, + FORMAT_FRACTAL_NZ = 29, + FORMAT_NCDHW = 30, + FORMAT_DHWCN = 31, + FORMAT_NDC1HWC0 = 32, + FORMAT_FRACTAL_Z_3D = 33, + FORMAT_CN = 34, + FORMAT_NC = 35, + FORMAT_DHWNC = 36, + FORMAT_FRACTAL_Z_3D_TRANSPOSE = 37, + FORMAT_FRACTAL_ZN_LSTM = 38, + FORMAT_FRACTAL_Z_G = 39, + FORMAT_RESERVED = 40, + FORMAT_ALL = 41, + FORMAT_NULL = 42, + FORMAT_ND_RNN_BIAS = 43, + FORMAT_FRACTAL_ZN_RNN = 44, + FORMAT_YUV = 45, + FORMAT_YUV_A = 46, + FORMAT_NCL = 47, + FORMAT_FRACTAL_Z_WINO = 48, + FORMAT_C1HWC0 = 49, + /* Add before this line */ + FORMAT_MAX +}; + +size_t SizeOfDType(DataType type); +std::string GetDTypeString(DataType dtype); +std::string GetFormatString(TensorFormat fmt); +std::string GetShapeString(const TensorShape& shape); + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/FileOperation.cpp b/debug/accuracy_tools/msprobe/ccsrc/utils/FileOperation.cpp new file mode 100644 index 0000000000000000000000000000000000000000..7f025e568abdfe95830902d1e72bdb77300f7de5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/FileOperation.cpp @@ -0,0 +1,179 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include + +#include "FileUtils.hpp" +#include "DataUtils.hpp" +#include "FileOperation.hpp" + +namespace MindStudioDebugger { +namespace FileOperation { + +using namespace MindStudioDebugger; +using DataType = DataUtils::DataType; +using NpyVersion = std::pair; + +struct NpyDtypeDescr { + char byteorder; + char type; + size_t length; + + std::string str() const { + std::ostringstream buffer; + buffer << "\'" << byteorder << type << length << "\'"; + return buffer.str(); + } +}; + +// npy file header start information +constexpr char kNpyMagicPrefix[] = "\x93NUMPY"; +constexpr size_t kNpyMagicLen = sizeof(kNpyMagicPrefix) - 1; +constexpr size_t kNpyArrayAlign = 64; +static const std::unordered_map npyTypeDescMap = { + {DataType::DT_BOOL, NpyDtypeDescr{'|', 'b', 1}}, {DataType::DT_INT8, NpyDtypeDescr{'|', 'i', 1}}, + {DataType::DT_INT16, NpyDtypeDescr{'<', 'i', 2}}, {DataType::DT_INT32, NpyDtypeDescr{'<', 'i', 4}}, + {DataType::DT_INT64, NpyDtypeDescr{'<', 'i', 8}}, {DataType::DT_UINT8, NpyDtypeDescr{'|', 'u', 1}}, + {DataType::DT_UINT16, NpyDtypeDescr{'<', 'u', 2}}, {DataType::DT_UINT32, NpyDtypeDescr{'<', 'u', 4}}, + {DataType::DT_UINT64, NpyDtypeDescr{'<', 'u', 8}}, {DataType::DT_FLOAT16, NpyDtypeDescr{'<', 'f', 2}}, + {DataType::DT_FLOAT, NpyDtypeDescr{'<', 'f', 4}}, {DataType::DT_DOUBLE, NpyDtypeDescr{'<', 'f', 8}}, + {DataType::DT_COMPLEX128, NpyDtypeDescr{'<', 'c', 16}}, {DataType::DT_COMPLEX64, NpyDtypeDescr{'<', 'c', 8}}, +}; + +DebuggerErrno DumpJson(const std::string &path, const nlohmann::json& content) +{ + DebuggerErrno ret; + std::ofstream ofs; + + ret = FileUtils::OpenFile(path, ofs); + if (ret != DebuggerErrno::OK) { + return ret; + } + + try { + ofs << content.dump(); + } catch (std::exception &e) { + ret = DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE; + } + + if (ofs.fail()) { + ret = DebuggerErrno::ERROR_FAILED_TO_WRITE_FILE; + } + + ofs.close(); + return ret; +} + +inline static std::string NpyTransShapeToStr(const DataUtils::TensorShape &shape) +{ + std::ostringstream buffer; + buffer << "("; + for (const auto i : shape) { + buffer << std::to_string(i) << ","; + } + buffer << ")"; + return buffer.str(); +} + +inline static std::vector NpyLen2Bytes(size_t length, size_t lengthLen) { + std::vector buff; + lengthLen = std::min(lengthLen, static_cast(sizeof(length))); + for (size_t i = 0; i < lengthLen; i++) { + buff.emplace_back(length & 0xff); + length >>= CHAR_BIT; + } + return buff; +} + +static std::string GenerateNpyHeader(const DataUtils::TensorShape &shape, DataUtils::DataType dt, bool fortranOrder=false) +{ + auto typeDesc = npyTypeDescMap.find(dt); + if (typeDesc == npyTypeDescMap.end()) { + return std::string(); + } + + std::ostringstream buffer; + std::string fortranOrderStr = fortranOrder ? "True" : "False" ; + + buffer << "{"; + buffer << "'descr': " << typeDesc->second.str() << ", "; + buffer << "'fortran_order': " << fortranOrderStr << ", "; + buffer << "'shape': " << NpyTransShapeToStr(shape) << ", "; + buffer << "}"; + + std::string headerStr = buffer.str(); + NpyVersion version{1, 0}; + const size_t headerLen = headerStr.length(); + constexpr const size_t versionLen = 2; + constexpr const size_t maxLen = 65535; + constexpr const size_t lengthLenV1 = 2; + constexpr const size_t lengthLenV2 = 4; + size_t lengthLen = lengthLenV1; + + size_t totalLen = kNpyMagicLen + versionLen + lengthLen + headerLen + 1; + if (totalLen > maxLen) { + version = {2, 0}; + lengthLen = lengthLenV2; + totalLen = kNpyMagicLen + versionLen + lengthLen + headerLen + 1; + } + + const size_t padLen = kNpyArrayAlign - totalLen % kNpyArrayAlign; + const size_t paddingHeaderLen = headerLen + padLen + 1; + const std::string padding(padLen, ' '); + std::vector lengthBytes = NpyLen2Bytes(paddingHeaderLen, lengthLen); + std::ostringstream out; + out.write(kNpyMagicPrefix, DataUtils::SizeToS64(kNpyMagicLen)); + out.put(version.first); + out.put(version.second); + out.write(lengthBytes.data(), DataUtils::SizeToS64(lengthBytes.size())); + out << headerStr << padding << "\n"; + return out.str(); +} + +bool IsDtypeSupportByNpy(DataUtils::DataType dt) +{ + return npyTypeDescMap.find(dt) != npyTypeDescMap.end(); +} + +DebuggerErrno DumpNpy(const std::string &path, const uint8_t* data, size_t len, DataUtils::DataType dt, + const DataUtils::TensorShape& shape) +{ + DebuggerErrno ret; + std::string header = GenerateNpyHeader(shape, dt); + if (header.empty()) { + return DebuggerErrno::ERROR_INVALID_FORMAT; + } + + std::ofstream fd; + ret = FileUtils::OpenFile(path, fd, std::ios::out | std::ios::binary); + if (ret != DebuggerErrno::OK) { + return ret; + } + + fd << header; + fd.write(reinterpret_cast(data), len); + if (fd.fail()) { + ret = DebuggerErrno::ERROR_OPERATION_FAILED; + } + fd.close(); + + return ret; +} + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/FileOperation.hpp b/debug/accuracy_tools/msprobe/ccsrc/utils/FileOperation.hpp new file mode 100644 index 0000000000000000000000000000000000000000..3f89263ae3621d33f5bbc8a67e86887d8063067e --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/FileOperation.hpp @@ -0,0 +1,38 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include + +#include "include/ErrorCode.hpp" +#include "DataUtils.hpp" + +namespace MindStudioDebugger { + +constexpr const char* JSON_SUFFIX = "json"; +constexpr const char* NPY_SUFFIX = "npy"; +constexpr const char* CSV_SUFFIX = "csv"; + +namespace FileOperation { + +DebuggerErrno DumpJson(const std::string &path, const nlohmann::json& content); +bool IsDtypeSupportByNpy(DataUtils::DataType dt); +DebuggerErrno DumpNpy(const std::string &path, const uint8_t* data, size_t len, DataUtils::DataType dt, + const DataUtils::TensorShape& shape); + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/FileUtils.cpp b/debug/accuracy_tools/msprobe/ccsrc/utils/FileUtils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..246f899690ccd0e306f5b6b550870406086430cc --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/FileUtils.cpp @@ -0,0 +1,662 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "include/ErrorCode.hpp" +#include "FileUtils.hpp" + +/* 部分环境上c++版本比较老,这里不用filesystem库实现 */ + +namespace MindStudioDebugger { +namespace FileUtils { + +using namespace MindStudioDebugger; + +/********************* 基础检查函数库,不做过多校验,路径有效性由调用者保证 ******************/ +bool IsPathExist(const std::string& path) { + struct stat buffer; + return (stat(path.c_str(), &buffer) == 0); +} + +static std::string GetFullPath(const std::string &originPath) +{ + if (originPath.empty()) { + return ""; + } + if (originPath[0] == '/') { + return originPath; + } + + std::string cwd; + char cwdBuf[PATH_MAX]; + + if (getcwd(cwdBuf, PATH_MAX) == nullptr) { + return ""; + } + + cwd = cwdBuf; + std::string fullPath = std::move(cwd + pathSeparator + originPath); + + return fullPath; +} + +std::vector SplitPath(const std::string &path, char separator) +{ + std::vector tokens; + size_t len = path.length(); + size_t start = 0; + + while (start < len) { + size_t end = path.find(separator, start); + if (end == std::string::npos) { + end = len; + } + if (start != end) { + tokens.push_back(path.substr(start, end - start)); + } + start = end + 1; + } + return tokens; +} + +std::string GetAbsPath(const std::string &originPath) { + std::string fullPath = GetFullPath(originPath); + if (fullPath.empty()) { + return ""; + } + + std::vector tokens = SplitPath(fullPath); + std::vector tokensRefined; + + for (std::string& token : tokens) { + if (token.empty() || token == ".") { + continue; + } else if (token == "..") { + if (tokensRefined.empty()) { + return ""; + } + tokensRefined.pop_back(); + } else { + tokensRefined.emplace_back(token); + } + } + + if (tokensRefined.empty()) { + return "/"; + } + + std::string resolvedPath(""); + for (std::string& token : tokensRefined) { + resolvedPath.append("/").append(token); + } + + return resolvedPath; +} + +bool IsDir(const std::string& path) { + struct stat buffer; + if (stat(path.c_str(), &buffer) == 0) { + return (buffer.st_mode & S_IFDIR) != 0; + } + return false; +} + +bool IsRegularFile(const std::string& path) { + struct stat path_stat; + if (stat(path.c_str(), &path_stat) == 0) { + return S_ISREG(path_stat.st_mode); + } + return false; +} + +bool IsFileSymbolLink(const std::string& path) { + struct stat buffer; + if (lstat(path.c_str(), &buffer) == 0) { + if (S_ISLNK(buffer.st_mode)) { + return true; + } + } + return false; +} + +bool IsPathCharactersValid(const std::string& path) { + for (const char& ch : path) { + if (!std::isalnum(ch) && ch != '_' && ch != '.' && ch != ':' && ch != '/' && ch != '-') { + return false; + } + } + return true; +} + +bool IsFileReadable(const std::string& path) +{ + return access(path.c_str(), R_OK) == 0; +} + +bool IsFileWritable(const std::string& path) +{ + return access(path.c_str(), W_OK) == 0; +} + +bool IsFileExecutable(const std::string& path) +{ + return (access(path.c_str(), R_OK) == 0) && (access(path.c_str(), X_OK) == 0); +} + +bool IsDirReadable(const std::string& path) +{ + return (access(path.c_str(), R_OK) == 0) && (access(path.c_str(), X_OK) == 0); +} + +std::string GetParentDir(const std::string& path) +{ + size_t found = path.find_last_of('/'); + if (found != std::string::npos) { + return path.substr(0, found); + } + return "."; +} + +std::string GetFileName(const std::string& path) +{ + size_t found = path.find_last_of('/'); + if (found != std::string::npos) { + return path.substr(found + 1); + } + return path; +} + +std::string GetFileBaseName(const std::string& path) +{ + std::string fileName = GetFileName(path); + size_t dotPos = fileName.find_last_of('.'); + if (dotPos != std::string::npos) { + return fileName.substr(0, dotPos); + } + return fileName; +} + +std::string GetFileSuffix(const std::string& path) +{ + std::string fileName = GetFileName(path); + size_t dotPos = fileName.find_last_of('.'); + if (dotPos != std::string::npos && dotPos + 1 < fileName.size()) { + return fileName.substr(dotPos + 1); + } + return ""; +} + +bool CheckFileRWX(const std::string& path, const std::string& permissions) +{ + if (permissions.find('r') != std::string::npos && !IsFileReadable(path)) { + return false; + } + if (permissions.find('w') != std::string::npos && !IsFileWritable(path)) { + return false; + } + if (permissions.find('x') != std::string::npos && !IsFileExecutable(path)) { + return false; + } + return true; +} + +bool IsPathLengthLegal(const std::string& path) +{ + if (path.length() > FULL_PATH_LENGTH_MAX || path.length() == 0) { + return false; + } + + std::vector tokens = SplitPath(path); + for (auto token : tokens) { + if (token.length() > FILE_NAME_LENGTH_MAX) { + return false; + } + } + + return true; +} + +bool IsPathDepthValid(const std::string& path) +{ + return std::count(path.begin(), path.end(), pathSeparator) <= PATH_DEPTH_MAX; +} + +bool IsFileOwner(const std::string& path) +{ + struct stat file_stat; + if (stat(path.c_str(), &file_stat) == 0) { + if (file_stat.st_uid == getuid()) { + return true; + } + } + return false; +} + +/****************** 文件操作函数库,会对入参做基本检查 ************************/ +DebuggerErrno DeleteFile(const std::string &path) { + if (!IsPathExist(path)) { + return DebuggerErrno::OK; + } + if (IsFileSymbolLink(path)) { + return DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK; + } + + if (remove(path.c_str()) == 0) { + return DebuggerErrno::OK; + } else { + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } +} + +static DebuggerErrno DeleteDirRec(const std::string &path, uint32_t depth) +{ + if (depth > PATH_DEPTH_MAX) { + return DebuggerErrno::ERROR_PATH_TOO_DEEP; + } + + DebuggerErrno ret; + DIR* dir = opendir(path.c_str()); + if (dir == nullptr) { + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } + + struct dirent* entry; + while ((entry = readdir(dir)) != nullptr) { + if (strcmp(entry->d_name, ".") == 0 || (strcmp(entry->d_name, "..") == 0)) { + continue; + } + std::string entryPath = path + "/" + entry->d_name; + if (entry->d_type == DT_DIR) { + ret = DeleteDirRec(entryPath, depth + 1); + if (ret != DebuggerErrno::OK) { + closedir(dir); + return ret; + } + } else if (entry->d_type == DT_REG || entry->d_type == DT_LNK) { + if (remove(entryPath.c_str()) != 0) { + closedir(dir); + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } + } else { + closedir(dir); + return DebuggerErrno::ERROR_ILLEGAL_FILE_TYPE; + } + + } + + closedir(dir); + if (rmdir(path.c_str()) != 0) { + if (errno == EACCES || errno == EROFS) { + return DebuggerErrno::ERROR_PERMISSION_DENINED; + } else { + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } + } + + return DebuggerErrno::OK; +} + +DebuggerErrno DeleteDir(const std::string &path, bool recursion) { + if (!IsPathExist(path)) { + return DebuggerErrno::OK; + } + if (IsFileSymbolLink(path)) { + return DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK; + } + + if (recursion) { + return DeleteDirRec(path, 0); + } + + if (rmdir(path.c_str()) != 0) { + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } + + return DebuggerErrno::OK; +} + +static DebuggerErrno CreateDirAux(const std::string& path, bool recursion, mode_t mode) { + std::string parent = GetParentDir(path); + DebuggerErrno ret; + + if (!IsPathExist(parent)) { + if (!recursion) { + return DebuggerErrno::ERROR_DIR_NOT_EXISTS; + } + /* 递归创建父目录,由于前面已经判断过目录深度,此处递归是安全的 */ + ret = CreateDirAux(parent, recursion, mode); + if (ret != DebuggerErrno::OK) { + return ret; + } + } + + if (mkdir(path.c_str(), mode) != 0) { + if (errno == EACCES || errno == EROFS) { + return DebuggerErrno::ERROR_PERMISSION_DENINED; + } else { + return DebuggerErrno::ERROR_SYSCALL_FAILED; + } + } + return DebuggerErrno::OK; +} + +DebuggerErrno CreateDir(const std::string &path, bool recursion, mode_t mode) +{ + if (IsPathExist(path)) { + return DebuggerErrno::OK; + } + + std::string realPath = GetAbsPath(path); + if (realPath.empty()) { + return DebuggerErrno::ERROR_CANNOT_PARSE_PATH; + } + if (!IsPathLengthLegal(realPath)) { + return DebuggerErrno::ERROR_PATH_TOO_LOOG; + } + if (!IsPathCharactersValid(realPath)) { + return DebuggerErrno::ERROR_PATH_CONTAINS_INVALID_CHAR; + } + if (!IsPathDepthValid(realPath)) { + return DebuggerErrno::ERROR_PATH_TOO_DEEP; + } + + return CreateDirAux(realPath, recursion, mode); +} + +DebuggerErrno Chmod(const std::string& path, const mode_t& mode) +{ + if (!IsPathExist(path)) { + return DebuggerErrno::ERROR_FILE_NOT_EXISTS; + } + if (IsFileSymbolLink(path)) { + return DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK; + } + + std::string absPath = GetAbsPath(path); + if (absPath.empty()) { + return DebuggerErrno::ERROR_CANNOT_PARSE_PATH; + } + return chmod(absPath.c_str(), mode) == 0 ? DebuggerErrno::OK : DebuggerErrno::ERROR_SYSCALL_FAILED; +} + +DebuggerErrno GetFileSize(const std::string &path, size_t& size) { + struct stat path_stat; + if (stat(path.c_str(), &path_stat) != 0) { + return DebuggerErrno::ERROR_FILE_NOT_EXISTS; + } + if (!S_ISREG(path_stat.st_mode)) { + return DebuggerErrno::ERROR_ILLEGAL_FILE_TYPE; + } + + size = static_cast(path_stat.st_size); + return DebuggerErrno::OK; +} + +DebuggerErrno OpenFile(const std::string& path, std::ifstream& ifs, std::ios::openmode mode) +{ + std::string realPath = GetAbsPath(path); + DebuggerErrno ret = CheckFileBeforeRead(realPath); + if (ret != DebuggerErrno::OK) { + return ret; + } + + std::ifstream tmpifs(realPath, mode); + if (!tmpifs.is_open()) { + return DebuggerErrno::ERROR_FAILED_TO_OPEN_FILE; + } + + ifs = std::move(tmpifs); + return DebuggerErrno::OK; +} + +DebuggerErrno OpenFile(const std::string& path, std::ofstream& ofs, std::ios::openmode mode, mode_t permission) +{ + DebuggerErrno ret; + std::string realPath = GetAbsPath(path); + if (realPath.empty()) { + return DebuggerErrno::ERROR_CANNOT_PARSE_PATH; + } + + std::string parent = GetParentDir(realPath); + ret = CheckFileBeforeCreateOrWrite(realPath, true); + if (ret != DebuggerErrno::OK) { + return ret; + } + + if (!IsPathExist(parent)) { + ret = CreateDir(parent, true); + if (ret != DebuggerErrno::OK) { + return ret; + } + } + + if (!IsPathExist(path)) { + int fd = open(path.c_str(), O_CREAT | O_WRONLY, permission); + if (fd < 0) { + return DebuggerErrno::ERROR_FAILED_TO_OPEN_FILE; + } + close(fd); + } + + std::ofstream tmpofs(realPath, mode); + if (!tmpofs.is_open()) { + return DebuggerErrno::ERROR_FAILED_TO_OPEN_FILE; + } + + ofs = std::move(tmpofs); + return DebuggerErrno::OK; +} + +/******************************* 通用检查函数 **********************************/ +DebuggerErrno CheckFileSuffixAndSize(const std::string &path, FileType type) +{ + static const std::map> FileTypeCheckTbl = { + {FileType::PKL, {"kpl", MAX_PKL_SIZE}}, + {FileType::NUMPY, {"npy", MAX_NUMPY_SIZE}}, + {FileType::JSON, {"json", MAX_JSON_SIZE}}, + {FileType::PT, {"pt", MAX_PT_SIZE}}, + {FileType::CSV, {"csv", MAX_CSV_SIZE}}, + {FileType::YAML, {"yaml", MAX_YAML_SIZE}}, + }; + + size_t size; + DebuggerErrno ret = GetFileSize(path, size); + if (ret != DebuggerErrno::OK) { + return ret; + } + + if (type == FileType::COMMON) { + if (size > MAX_FILE_SIZE_DEFAULT) { + return DebuggerErrno::ERROR_FILE_TOO_LARGE; + } + return DebuggerErrno::OK; + } + + auto iter = FileTypeCheckTbl.find(type); + if (iter == FileTypeCheckTbl.end()) { + return DebuggerErrno::ERROR_UNKNOWN_FILE_SUFFIX; + } + + std::string suffix = GetFileSuffix(path); + if (suffix != iter->second.first) { + return DebuggerErrno::ERROR_UNKNOWN_FILE_SUFFIX; + } + if (size > iter->second.second) { + return DebuggerErrno::ERROR_FILE_TOO_LARGE; + } + + return DebuggerErrno::OK; +} + +DebuggerErrno CheckDirCommon(const std::string &path) +{ + std::string realPath = GetAbsPath(path); + if (realPath.empty()) { + return DebuggerErrno::ERROR_CANNOT_PARSE_PATH; + } + if (!IsPathExist(realPath)) { + return DebuggerErrno::ERROR_FILE_NOT_EXISTS; + } + if (!IsDir(realPath)) { + return DebuggerErrno::ERROR_ILLEGAL_FILE_TYPE; + } + if (!IsPathLengthLegal(realPath)) { + return DebuggerErrno::ERROR_PATH_TOO_LOOG; + } + if (!IsPathCharactersValid(realPath)) { + return DebuggerErrno::ERROR_PATH_CONTAINS_INVALID_CHAR; + } + if (!IsPathDepthValid(realPath)) { + return DebuggerErrno::ERROR_PATH_TOO_DEEP; + } + if (IsFileSymbolLink(path)) { + return DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK; + } + if (!IsDirReadable(path)) { + return DebuggerErrno::ERROR_PERMISSION_DENINED; + } + + return DebuggerErrno::OK; +} + +DebuggerErrno CheckFileBeforeRead(const std::string &path, const std::string& authority, FileType type) +{ + std::string realPath = GetAbsPath(path); + if (realPath.empty()) { + return DebuggerErrno::ERROR_CANNOT_PARSE_PATH; + } + if (!IsPathExist(realPath)) { + return DebuggerErrno::ERROR_FILE_NOT_EXISTS; + } + if (!IsPathLengthLegal(realPath)) { + return DebuggerErrno::ERROR_PATH_TOO_LOOG; + } + if (!IsPathCharactersValid(realPath)) { + return DebuggerErrno::ERROR_PATH_CONTAINS_INVALID_CHAR; + } + if (!IsPathDepthValid(realPath)) { + return DebuggerErrno::ERROR_PATH_TOO_DEEP; + } + if (IsFileSymbolLink(realPath)) { + return DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK; + } + if (!CheckFileRWX(realPath, authority)) { + return DebuggerErrno::ERROR_PERMISSION_DENINED; + } + + /* 如果是/dev/random之类的无法计算size的文件,不要用本函数check */ + return CheckFileSuffixAndSize(path, type); +} + +DebuggerErrno CheckFileBeforeCreateOrWrite(const std::string &path, bool overwrite) +{ + std::string realPath = GetAbsPath(path); + if (realPath.empty()) { + return DebuggerErrno::ERROR_CANNOT_PARSE_PATH; + } + if (!IsPathLengthLegal(realPath)) { + return DebuggerErrno::ERROR_PATH_TOO_LOOG; + } + if (!IsPathCharactersValid(realPath)) { + return DebuggerErrno::ERROR_PATH_CONTAINS_INVALID_CHAR; + } + if (!IsPathDepthValid(realPath)) { + return DebuggerErrno::ERROR_PATH_TOO_DEEP; + } + if (IsPathExist(realPath)) { + if (!overwrite) { + return DebuggerErrno::ERROR_FILE_ALREADY_EXISTS; + } + + /* 默认不允许覆盖其他用户创建的文件,若有特殊需求(如多用户通信管道等)由业务自行校验 */ + if (!IsFileWritable(realPath) || !IsFileOwner(realPath)) { + return DebuggerErrno::ERROR_PERMISSION_DENINED; + } + } + return DebuggerErrno::OK; +} + +/* 其他文件操作工具 */ +static DebuggerErrno ListAllAux(const std::string &path, std::vector& output, uint32_t depth) +{ + if (depth > PATH_DEPTH_MAX) { + return DebuggerErrno::ERROR_PATH_TOO_DEEP; + } + + DIR* dir = opendir(path.c_str()); + if (dir == nullptr) { + return DebuggerErrno::ERROR_FAILED_TO_OPEN_FILE; + } + + DebuggerErrno ret = DebuggerErrno::OK; + size_t max = output.capacity(); + size_t num = output.size(); + if (num >= max) { + return DebuggerErrno::OK; + } + + struct dirent* entry = nullptr; + while ((entry = readdir(dir)) != nullptr) { + if (strcmp(entry->d_name, ".") == 0 || (strcmp(entry->d_name, "..") == 0)) { + continue; + } + std::string entryPath = path + "/" + entry->d_name; + if (entry->d_type == DT_DIR) { + ret = ListAllAux(entryPath, output, depth + 1); + if (ret != DebuggerErrno::OK) { + closedir(dir); + return ret; + } + } else if (entry->d_type == DT_REG) { + output.emplace_back(entryPath); + if (++num >= max) { + break; + } + } + } + closedir(dir); + return DebuggerErrno::OK; +} + +std::vector ListAll(const std::string &path, size_t max) +{ + std::vector ret; + std::string realPath = GetAbsPath(path); + if (CheckDirCommon(realPath) != DebuggerErrno::OK) { + return ret; + } + ret.reserve(max); + + uint32_t depth = std::count(realPath.begin(), realPath.end(), pathSeparator); + ListAllAux(realPath, ret, depth); + ret.resize(ret.size()); + return ret; +} + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/FileUtils.hpp b/debug/accuracy_tools/msprobe/ccsrc/utils/FileUtils.hpp new file mode 100644 index 0000000000000000000000000000000000000000..70b47137fc40fd7fb73be11ddb8d3551550e2b8d --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/FileUtils.hpp @@ -0,0 +1,107 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include +#include +#include +#include +#include + +#include "include/ErrorCode.hpp" + +namespace MindStudioDebugger { + +constexpr const char pathSeparator = '/'; +constexpr const uint32_t FULL_PATH_LENGTH_MAX = 4096; +constexpr const uint32_t FILE_NAME_LENGTH_MAX = 255; +constexpr const uint32_t PATH_DEPTH_MAX = 32; +constexpr const char* FILE_VALID_PATTERN = "^[a-zA-Z0-9_.:/-]+$"; + +constexpr size_t MAX_PKL_SIZE = 1024ULL * 1024 * 1024; +constexpr size_t MAX_NUMPY_SIZE = 10ULL * 1024 * 1024 * 1024; +constexpr size_t MAX_JSON_SIZE = 1024ULL * 1024 * 1024; +constexpr size_t MAX_PT_SIZE = 10ULL * 1024 * 1024 * 1024; +constexpr size_t MAX_CSV_SIZE = 1024ULL * 1024 * 1024; +constexpr size_t MAX_YAML_SIZE = 10ULL * 1024 * 1024; +constexpr size_t MAX_FILE_SIZE_DEFAULT = 10ULL * 1024 * 1024 * 1024; + +constexpr mode_t NORMAL_FILE_MODE_DEFAULT = 0640; +constexpr mode_t READONLY_FILE_MODE_DEFAULT = 0440; +constexpr mode_t SCRIPT_FILE_MODE_DEFAULT = 0550; +constexpr mode_t NORMAL_DIR_MODE_DEFAULT = 0750; + +enum class FileType { + PKL, + NUMPY, + JSON, + PT, + CSV, + YAML, + + /* Add new type before this line. */ + COMMON +}; + +namespace FileUtils { + +constexpr const uint32_t FILE_NAME_MAX = 255; + +/* 基础检查函数库,不做过多校验,路径有效性由调用者保证 */ +bool IsPathExist(const std::string& path); +std::vector SplitPath(const std::string &path, char separator=pathSeparator); +std::string GetAbsPath(const std::string &path); +bool IsDir(const std::string& path); +bool IsRegularFile(const std::string& path); +bool IsFileSymbolLink(const std::string& path); +bool IsPathCharactersValid(const std::string& path); +bool IsFileReadable(const std::string& path); +bool IsFileWritable(const std::string& path); +bool IsFileExecutable(const std::string& path); +bool IsDirReadable(const std::string& path); +std::string GetParentDir(const std::string& path); +std::string GetFileName(const std::string& path); +std::string GetFileBaseName(const std::string& path); +std::string GetFileSuffix(const std::string& path); +bool CheckFileRWX(const std::string& path, const std::string& permissions); +bool IsPathLengthLegal(const std::string& path); +bool IsPathDepthValid(const std::string& path); +bool IsFileOwner(const std::string& path); + +/* 文件操作函数库,会对入参做基本检查 */ +DebuggerErrno DeleteFile(const std::string &path); +DebuggerErrno DeleteDir(const std::string &path, bool recursion=false); +DebuggerErrno CreateDir(const std::string &path, bool recursion=false, mode_t mode=NORMAL_DIR_MODE_DEFAULT); +DebuggerErrno Chmod(const std::string& path, const mode_t& mode); +DebuggerErrno GetFileSize(const std::string &path, size_t& size); +DebuggerErrno OpenFile(const std::string& path, std::ifstream& ifs, std::ios::openmode mode=std::ios::in); +DebuggerErrno OpenFile(const std::string& path, std::ofstream& ofs, std::ios::openmode mode=std::ios::out, + mode_t permission=NORMAL_FILE_MODE_DEFAULT); + +/* 通用检查函数 */ +DebuggerErrno CheckFileSuffixAndSize(const std::string &path, FileType type); +DebuggerErrno CheckDirCommon(const std::string &path); +DebuggerErrno CheckFileBeforeRead(const std::string &path, const std::string& authority="r", + FileType type=FileType::COMMON); +DebuggerErrno CheckFileBeforeCreateOrWrite(const std::string &path, bool overwrite=false); + +/* 其他文件操作工具 */ +std::vector ListAll(const std::string &path, size_t max = 1024); + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/MathUtils.cpp b/debug/accuracy_tools/msprobe/ccsrc/utils/MathUtils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..27111d60c9f86f2ae9b2b2a00b804ab886917755 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/MathUtils.cpp @@ -0,0 +1,85 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include "openssl/md5.h" + +namespace MindStudioDebugger { +namespace MathUtils { + +float Random() +{ + std::mt19937 generator(std::random_device{}()); + std::uniform_real_distribution distribution(0.0f, 1.0f); + return distribution(generator); +} + +float Random(float floor, float ceil) +{ + std::mt19937 generator(std::random_device{}()); + std::uniform_real_distribution distribution(floor, ceil); + return distribution(generator); +} + +int32_t RandomInt(int32_t floor, int32_t ceil) +{ + std::mt19937 generator(std::random_device{}()); + std::uniform_int_distribution distribution(floor, ceil - 1); + + return distribution(generator); +} + +std::string RandomString(uint32_t len, char min, char max) +{ + std::mt19937 generator(std::random_device{}()); + std::string output(len, '\0'); + if (min > max) { + return output; + } + + std::uniform_int_distribution distribution(min, max); + for (uint32_t i = 0; i < len; i++) { + output[i] = distribution(generator); + } + + return output; +} + +std::string CalculateMD5(const uint8_t* data, size_t length) +{ + MD5_CTX md5ctx; + MD5_Init(&md5ctx); + MD5_Update(&md5ctx, data, length); + + unsigned char digest[MD5_DIGEST_LENGTH]; + MD5_Final(digest, &md5ctx); + + static const char hexchar[] = "0123456789abcdef"; + constexpr const uint8_t hexbase = 16; + constexpr const size_t byteToStrWidth = 2; + char md5string[MD5_DIGEST_LENGTH * byteToStrWidth + 1]; + for (int i = 0; i < MD5_DIGEST_LENGTH; i++) { + md5string[i * byteToStrWidth] = hexchar[digest[i] / hexbase]; + md5string[i * byteToStrWidth + 1] = hexchar[digest[i] % hexbase]; + } + md5string[sizeof(md5string) - 1] = '\0'; + + return std::string(md5string); +} + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/ccsrc/utils/MathUtils.hpp b/debug/accuracy_tools/msprobe/ccsrc/utils/MathUtils.hpp new file mode 100644 index 0000000000000000000000000000000000000000..141471ac8ce284ac1a7ab4b6db59f5d0da9a9fe2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/ccsrc/utils/MathUtils.hpp @@ -0,0 +1,70 @@ +/* + * Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#pragma once + +#include +#include + +namespace MindStudioDebugger { +namespace MathUtils { + +template +T Gcd(T a, T b) { + if (a == 0 || b == 0) { + return 0; + } + T c = b; + while (a % b != 0) { + c = a % b; + a = b; + b = c; + } + return c; +} + +template +T Lcm(T a, T b) { + if (a == 0 || b == 0) { + return 0; + } + T ret = (a * b) / (Gcd(a, b)); + return ret; +} + +template +T DivCeil(T v, T divisor) { + if (divisor == 0) { + return 0; + } + return (v + divisor - 1) / divisor; +} + +template +T AlignCeil(T v, T block) +{ + return DivCeil(v, block) * block; +} + +float Random(); +float Random(float floor, float ceil); +int32_t RandomInt(int32_t floor, int32_t ceil); +std::string RandomString(uint32_t len, char min=' ', char max='~'); + +std::string CalculateMD5(const uint8_t* data, size_t length); + +} +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/config.json b/debug/accuracy_tools/msprobe/config.json index 02d05e020928e94c2a9e941c27d67f81330accd8..553b7f9ee3b89215647b00fb14b70af44ea5f00c 100644 --- a/debug/accuracy_tools/msprobe/config.json +++ b/debug/accuracy_tools/msprobe/config.json @@ -5,12 +5,11 @@ "step": [], "level": "L1", "enable_dataloader": false, - "acl_config": "", + "async_dump": false, "tensor": { "scope": [], "list":[], - "data_mode": ["all"], - "backward_input": [], + "data_mode": ["all"], "file_format": "npy" }, "statistics": { diff --git a/debug/accuracy_tools/msprobe/core/common/const.py b/debug/accuracy_tools/msprobe/core/common/const.py index 8d33aa99c13b209c113d621f1ccfc4610bd5f5e6..27dc231c75ca870a3d7ffee50d79dcc329f9c5dd 100644 --- a/debug/accuracy_tools/msprobe/core/common/const.py +++ b/debug/accuracy_tools/msprobe/core/common/const.py @@ -1,6 +1,7 @@ -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -11,7 +12,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import os import stat @@ -24,9 +25,11 @@ class Const: """ TOOL_NAME = "msprobe" + ipv4_pattern = "([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])(\.([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])){3}$" SEP = "." REGEX_PREFIX_MAX_LENGTH = 20 REGEX_PREFIX_PATTERN = r"^[a-zA-Z0-9_-]+$" + REGEX_FORWARD_BACKWARD = r'\.(forward|backward)\.' FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' STRING_BLACKLIST = r"^[+-=%@\+\-=%@]|;[+-=%@\+\-=%@]" COMMA = "," @@ -34,6 +37,8 @@ class Const: OFF = 'OFF' BACKWARD = 'backward' FORWARD = 'forward' + PROGRESS_TIMEOUT = 3000 + EXCEPTION_NONE = None JIT = 'Jit' PRIMITIVE_PREFIX = 'Primitive' DEFAULT_LIST = [] @@ -61,6 +66,7 @@ class Const: ONLINE_DUMP_MODE = [ALL, LIST, AUTO, OFF] SUMMARY = "summary" MD5 = "md5" + VALUE = "value" SUMMARY_MODE = [ALL, SUMMARY, MD5] WRITE_FLAGS = os.O_WRONLY | os.O_CREAT @@ -69,6 +75,7 @@ class Const: PKL_SUFFIX = ".pkl" NUMPY_SUFFIX = ".npy" + NUMPY_PATTERN = "*.npy" PT_SUFFIX = ".pt" ONE_GB = 1073741824 # 1 * 1024 * 1024 * 1024 TEN_GB = 10737418240 # 10 * 1024 * 1024 * 1024 @@ -83,6 +90,8 @@ class Const: INPUT_KWARGS = 'input_kwargs' GRAD_INPUT = 'grad_input' GRAD_OUTPUT = 'grad_output' + PARAMS = 'parameters' + PARAMS_GRAD = 'parameters_grad' START = "start" STOP = "stop" ENV_ENABLE = "1" @@ -94,20 +103,23 @@ class Const: FREE_BENCHMARK = "free_benchmark" RUN_UT = "run_ut" GRAD_PROBE = "grad_probe" - TASK_LIST = [TENSOR, STATISTICS, OVERFLOW_CHECK, FREE_BENCHMARK, RUN_UT, GRAD_PROBE] - DUMP_DATA_COLLECTION_LIST = [STATISTICS, TENSOR] + STRUCTURE = "structure" + TASK_LIST = [TENSOR, STATISTICS, OVERFLOW_CHECK, FREE_BENCHMARK, RUN_UT, GRAD_PROBE, STRUCTURE] + DUMP_DATA_COLLECTION_LIST = [STATISTICS, TENSOR, STRUCTURE] DUMP_DATA_MODE_LIST = [ALL, INPUT, OUTPUT, FORWARD, BACKWARD] LEVEL_L0 = "L0" LEVEL_L1 = "L1" LEVEL_L2 = "L2" LEVEL_MIX = "mix" - LEVEL_LIST = [LEVEL_L0, LEVEL_L1, LEVEL_L2, LEVEL_MIX] + LEVEL_DEBUG = "debug" + LEVEL_LIST = [LEVEL_L0, LEVEL_L1, LEVEL_L2, LEVEL_MIX, LEVEL_DEBUG] ATTR_NAME_PREFIX = "wrap_" ATTR_NAME_PREFIX_LEN = len(ATTR_NAME_PREFIX) KERNEL_DUMP = "kernel_dump" DATA = "data" PT_FRAMEWORK = "pytorch" MS_FRAMEWORK = "mindspore" + MT_FRAMEWORK = "mindtorch" UNKNOWN_FRAMEWORK = "unknown" DIRECTORY_LENGTH = 4096 FILE_NAME_LENGTH = 255 @@ -118,7 +130,12 @@ class Const: NPU_LOWERCASE = 'npu' CPU_LOWERCASE = 'cpu' CUDA_LOWERCASE = 'cuda' + DEVICE = 'device' DISTRIBUTED = 'Distributed' + DUMP_PREFIX = ["Distributed", "Functional", "Torch", "Tensor", "Mint", "MintFunctional", "Primitive", + "Aten", "VF", "NPU", "Jit"] + MODULE_PREFIX = ["Module", "Cell"] + FORWARD_NAME_SUFFIX = ".forward" # struct json param ORIGIN_DATA = "origin_data" @@ -129,7 +146,7 @@ class Const: MODULE_WHITE_LIST = ["torch", "numpy"] FUNC_SKIP_LIST = ["construct", "__call__"] - FILE_SKIP_LIST = ["site-packages/mindspore", "package/mindspore", "msprobe", "site-packages/torch", "package/torch", "MindSpeed"] + FILE_SKIP_LIST = ["msprobe", "MindSpeed"] DATA_TYPE_SKIP_LIST = ["Primitive", "Jit"] STACK_FILE_INDEX = 0 @@ -139,15 +156,18 @@ class Const: SCOPE_ID_INDEX = -1 SCOPE_DIRECTION_INDEX = -2 TYPE_NAME_INDEX = -3 + PARAMS_GRAD_TYPE_NAME_INDEX = -2 LAYER_NAME_INDEX = -4 + PARAMS_GRAD_NAME_INDEX = -3 API_TYPE_INDEX = 0 LEFT_MOVE_INDEX = -1 RIGHT_MOVE_INDEX = 1 + LAST_INDEX = -1 TOP_LAYER = "TopLayer" CELL = "Cell" MODULE = "Module" - + FRAME_FILE_LIST = ["site-packages/torch", "package/torch", "site-packages/mindspore", "package/mindspore"] INPLACE_LIST = [ "broadcast", "all_reduce", "reduce", "all_gather", "gather", "scatter", "reduce_scatter", "_reduce_scatter_base", "_all_gather_base", "send", "recv", "irecv", "isend", "all_to_all_single", "all_to_all", @@ -156,12 +176,16 @@ class Const: CONVERT = { "int32_to_int64": ["torch.int32", "torch.int64"], + "int64_to_fp32": ["torch.int64", "torch.float32"] } CONVERT_API = { - "int32_to_int64": ["cross_entropy"] + "int32_to_int64": ["cross_entropy"], + "int64_to_fp32": ["histc"] } + FA_SPECIAL_SPARSE_MODE = [2, 3, 4] + FILL_CHAR_NUMS = 50 TOOL_ENDS_SUCCESSFULLY = f"{TOOL_NAME} ends successfully." WITHOUT_CALL_STACK = "The call stack retrieval failed." @@ -169,9 +193,12 @@ class Const: STEP = "step" RANK = "rank" HYPHEN = "-" - STEP_RANK_MAXIMUM_RANGE = [int(0), int(1e6)] + STEP_RANK_MINIMUM_VALUE = 0 + STEP_RANK_MAXIMUM_VALUE = int(1e6) # data type const + TORCH_INT_DTYPE = ["torch.int8", "torch.int32", "torch.int64"] + TORCH_FLOAT_DTYPE = ["torch.bfloat16", "torch.float16", "torch.float32", "torch.float64"] FLOAT16 = "Float16" FLOAT32 = "Float32" BFLOAT16 = "BFloat16" @@ -186,6 +213,23 @@ class Const: MEAN = 'Mean' NORM = 'Norm' + CODE_STACK = 'Code Stack' + OP_NAME = 'Op Name' + SCOPE_NAME = 'Scope Name' + CODE_STACKS = 'Code Stacks' + FILE_PATH = 'File Path' + NEW_LINE = '\n' + CSV_NEWLINE_SEPARATOR = ',\n' + # 分隔符常量 + SCOPE_SEPARATOR = "/" + REPLACEMENT_CHARACTER = "_" + + OPTIMIZER = "optimizer" + CLIP_GRAD = "clip_grad" + END_PREFIX = "end_" + + TENSOR_STAT_LEN = 2 + class CompareConst: """ @@ -212,6 +256,7 @@ class CompareConst: MEAN_DIFF = "Mean diff" NORM_DIFF = "L2norm diff" COSINE = "Cosine" + EUC_DIST = "EucDist" MAX_ABS_ERR = "MaxAbsErr" MAX_RELATIVE_ERR = "MaxRelativeErr" MIN_RELATIVE_ERR = "MinRelativeErr" @@ -228,19 +273,66 @@ class CompareConst: RESULT = "Result" MAGNITUDE = 0.5 OP_NAME = "op_name" + STRUCT = "struct" INPUT_STRUCT = "input_struct" KWARGS_STRUCT = "kwargs_struct" OUTPUT_STRUCT = "output_struct" + PARAMS_STRUCT = "params_struct" + PARAMS_GRAD_STRUCT = "params_grad_struct" SUMMARY = "summary" + COMPARE_RESULT = "compare_result" + COMPARE_MESSAGE = "compare_message" MAX_EXCEL_LENGTH = 1048576 YES = "Yes" NO = "No" STATISTICS_INDICATOR_NUM = 4 EPSILON = 1e-10 + COMPARE_ENDS_SUCCESSFULLY = "msprobe compare ends successfully." + DEFAULT_RATIO_VALUE = 10000 + THOUSANDTH_PASS_VALUE = 0.999 + ZERO_SHAPE = '(0,)' + + BENCHMARK_COMPARE_ALGORITHM_NAME = "标杆比对法" + ULP_COMPARE_ALGORITHM_NAME = "ULP误差比对法" + BINARY_CONSISTENCY_ALGORITHM_NAME = "二进制一致法" + ABSOLUTE_THRESHOLD_ALGORITHM_NAME = "绝对阈值法" + THOUSANDTH_STANDARD_ALGORITHM_NAME = "双千指标法" + ACCUMULATIVE_ERROR_COMPARE_ALGORITHM_NAME = "累积误差比对法" + + ABSOLUTE_THRESHOLD = 'absolute_threshold' + BINARY_CONSISTENCY = 'binary_consistency' + ULP_COMPARE = 'ulp_compare' + THOUSANDTH_STANDARD = 'thousandth_threshold' + BENCHMARK = 'benchmark' + ACCUMULATIVE_ERROR_COMPARE = 'accumulative_error_compare' + + SMALL_VALUE_ERR_RATIO = "small_value_err_ratio" + RMSE_RATIO = "rmse_ratio" + MAX_REL_ERR_RATIO = "max_rel_err_ratio" + MEAN_REL_ERR_RATIO = "mean_rel_err_ratio" + EB_RATIO = "eb_ratio" + + SMALL_VALUE = "small_value" + RMSE = "rmse" + MAX_REL_ERR = "max_rel_err" + MEAN_REL_ERR = "mean_rel_err" + EB = "eb" + + SMALL_VALUE_ERR_STATUS = "small_value_err_status" + RMSE_STATUS = "rmse_status" + MAX_REL_ERR_STATUS = "max_rel_err_status" + MEAN_REL_ERR_STATUS = "mean_rel_err_status" + EB_STATUS = "eb_status" + + MEAN_ULP_ERR = "mean_ulp_err" + ULP_ERR_PROPORTION = "ulp_err_proportion" + ULP_ERR_PROPORTION_RATIO = "ulp_err_proportion_ratio" + + ULP_ERR_STATUS = "ulp_err_status" COMPARE_RESULT_HEADER = [ - NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, COSINE, MAX_ABS_ERR, MAX_RELATIVE_ERR, - ONE_THOUSANDTH_ERR_RATIO, FIVE_THOUSANDTHS_ERR_RATIO, + NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, COSINE, EUC_DIST, + MAX_ABS_ERR, MAX_RELATIVE_ERR, ONE_THOUSANDTH_ERR_RATIO, FIVE_THOUSANDTHS_ERR_RATIO, NPU_MAX, NPU_MIN, NPU_MEAN, NPU_NORM, BENCH_MAX, BENCH_MIN, BENCH_MEAN, BENCH_NORM, ACCURACY, ERROR_MESSAGE ] @@ -254,12 +346,58 @@ class CompareConst: NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, NPU_MD5, BENCH_MD5, RESULT ] + COMPARE_RESULT_HEADER_STACK = COMPARE_RESULT_HEADER + [STACK] + + SUMMARY_COMPARE_RESULT_HEADER_STACK = SUMMARY_COMPARE_RESULT_HEADER + [STACK] + + MD5_COMPARE_RESULT_HEADER_STACK = MD5_COMPARE_RESULT_HEADER + [STACK] + HEAD_OF_COMPARE_MODE = { Const.ALL: COMPARE_RESULT_HEADER, Const.SUMMARY: SUMMARY_COMPARE_RESULT_HEADER, Const.MD5: MD5_COMPARE_RESULT_HEADER } + ALL_COMPARE_INDEX = [COSINE, EUC_DIST, MAX_ABS_ERR, MAX_RELATIVE_ERR, ONE_THOUSANDTH_ERR_RATIO, + FIVE_THOUSANDTHS_ERR_RATIO] + SUMMARY_COMPARE_INDEX = [MAX_DIFF, MIN_DIFF, MEAN_DIFF, NORM_DIFF, + MAX_RELATIVE_ERR, MIN_RELATIVE_ERR, MEAN_RELATIVE_ERR, NORM_RELATIVE_ERR] + + # dtype match + MS_TYPE = [ + [Const.FLOAT16, Const.FLOAT32], [Const.FLOAT32, Const.FLOAT16], + [Const.FLOAT16, Const.BFLOAT16], [Const.BFLOAT16, Const.FLOAT16] + ] + TORCH_TYPE = [ + [Const.TORCH_FLOAT16, Const.TORCH_FLOAT32], [Const.TORCH_FLOAT32, Const.TORCH_FLOAT16], + [Const.TORCH_FLOAT16, Const.TORCH_BFLOAT16], [Const.TORCH_BFLOAT16, Const.TORCH_FLOAT16] + ] + + # read_op + IO_NAME_MAPPING = { + Const.INPUT_ARGS: '.input', + Const.INPUT_KWARGS: '.input', + Const.INPUT: '.input', + Const.OUTPUT: '.output', + Const.PARAMS: '.parameters' + } + + # state to struct mapping + STATE_TO_STRUCT_MAPPING = { + Const.INPUT: INPUT_STRUCT, + Const.KWARGS: INPUT_STRUCT, + Const.OUTPUT: OUTPUT_STRUCT, + Const.PARAMS: PARAMS_STRUCT, + Const.PARAMS_GRAD: PARAMS_GRAD_STRUCT + } + + STRUCT_COMPARE_KEY = [ + INPUT_STRUCT, + OUTPUT_STRUCT, + PARAMS_STRUCT, + PARAMS_GRAD_STRUCT + ] + # compare standard HUNDRED_RATIO_THRESHOLD = 0.01 THOUSAND_RATIO_THRESHOLD = 0.001 @@ -331,13 +469,22 @@ class CompareConst: BENCH_MEAN: None, BENCH_NORM: None, ACCURACY: '', ERROR_MESSAGE: '' } MS_GRAPH_NPY = { - COSINE: None, MAX_ABS_ERR: None, MAX_RELATIVE_ERR: None, ONE_THOUSANDTH_ERR_RATIO: None, + COSINE: None, EUC_DIST: None, MAX_ABS_ERR: None, MAX_RELATIVE_ERR: None, ONE_THOUSANDTH_ERR_RATIO: None, FIVE_THOUSANDTHS_ERR_RATIO: None } MS_GRAPH_STATISTIC = { MAX_DIFF: None, MIN_DIFF: None, MEAN_DIFF: None, NORM_DIFF: None, MAX_RELATIVE_ERR: None, MIN_RELATIVE_ERR: None, MEAN_RELATIVE_ERR: None, NORM_RELATIVE_ERR: None } + INPUT_PATTERN = Const.SEP + Const.INPUT + Const.SEP + KWARGS_PATTERN = Const.SEP + Const.KWARGS + Const.SEP + OUTPUT_PATTERN = Const.SEP + Const.OUTPUT + Const.SEP + PARAMS_PATTERN = Const.SEP + Const.PARAMS + Const.SEP + PARAMS_GRAD_PATTERN = Const.SEP + Const.PARAMS_GRAD + Const.SEP + COMPARE_KEY = 'compare_key' + COMPARE_SHAPE = 'compare_shape' + INTERNAL_API_MAPPING_FILE = 'ms_to_pt_api.yaml' + UNREADABLE = 'unreadable data' class FileCheckConst: @@ -356,13 +503,17 @@ class FileCheckConst: JSON_SUFFIX = ".json" PT_SUFFIX = ".pt" CSV_SUFFIX = ".csv" + XLSX_SUFFIX = ".xlsx" YAML_SUFFIX = ".yaml" + IR_SUFFIX = ".ir" MAX_PKL_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 MAX_NUMPY_SIZE = 10737418240 # 10 * 1024 * 1024 * 1024 MAX_JSON_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 MAX_PT_SIZE = 10737418240 # 10 * 1024 * 1024 * 1024 MAX_CSV_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 - MAX_YAML_SIZE = 1048576 # 1 * 1024 * 1024 + MAX_XLSX_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 + MAX_YAML_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 + MAX_IR_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 COMMOM_FILE_SIZE = 1048576 # 1 * 1024 * 1024 DIR = "dir" FILE = "file" @@ -374,7 +525,9 @@ class FileCheckConst: JSON_SUFFIX: MAX_JSON_SIZE, PT_SUFFIX: MAX_PT_SIZE, CSV_SUFFIX: MAX_CSV_SIZE, - YAML_SUFFIX: MAX_YAML_SIZE + XLSX_SUFFIX: MAX_XLSX_SIZE, + YAML_SUFFIX: MAX_YAML_SIZE, + IR_SUFFIX: MAX_IR_SIZE } CSV_BLACK_LIST = r'^[+-=%@\+\-=%@]|;[+-=%@\+\-=%@]' @@ -387,34 +540,6 @@ class OverflowConst: OVERFLOW_DEBUG_MODE = 1 -class MsCompareConst: - # api_info field - MINT = "Mint" - MINT_FUNCTIONAL = "MintFunctional" - - TASK_FIELD = "task" - STATISTICS_TASK = "statistics" - TENSOR_TASK = "tensor" - DUMP_DATA_DIR_FIELD = "dump_data_dir" - DATA_FIELD = "data" - - # detail_csv - DETAIL_CSV_API_NAME = "API Name" - DETAIL_CSV_BENCH_DTYPE = "Bench Dtype" - DETAIL_CSV_TESTED_DTYPE = "Tested Dtype" - DETAIL_CSV_SHAPE = "Shape" - DETAIL_CSV_PASS_STATUS = "Status" - DETAIL_CSV_MESSAGE = "Message" - DETAIL_CSV_FILE_NAME = "accuracy_checking_details" - - # result_csv - RESULT_CSV_FORWARD_TEST_SUCCESS = "Forward Test Success" - RESULT_CSV_BACKWARD_TEST_SUCCESS = "Backward Test Success" - RESULT_CSV_FILE_NAME = "accuracy_checking_result" - - EPSILON = 1e-8 - - class MsgConst: """ Class for log messages const @@ -435,6 +560,7 @@ class MsgConst: class ERROR: value = 3 + SPECIAL_CHAR = ["\n", "\r", "\u007F", "\b", "\f", "\t", "\u000B", "%08", "%0a", "%0b", "%0c", "%0d", "%7f"] NOT_CREATED_INSTANCE = "PrecisionDebugger instance is not created." @@ -450,8 +576,48 @@ class MonitorConst: """ Class for monitor const """ - OP_LIST = ["min", "max", "norm", "zeros", "nans", "id"] + OP_LIST = ["norm", "min", "max", "zeros", "nans", "id", "mean"] MONITOR_OUTPUT_DIR = "MONITOR_OUTPUT_DIR" DEFAULT_MONITOR_OUTPUT_DIR = "./monitor_output" DATABASE = "database" EMAIL = "email" + OPT_TY = ['Megatron_DistributedOptimizer', 'Megatron_Float16OptimizerWithFloat16Params'] + DEEPSPEED_OPT_TY = ( + "DeepSpeedZeroOptimizer_Stage0", + "DeepSpeedZeroOptimizer_Stage1_or_2", + "DeepSpeedZeroOptimizer_Stage3" + ) + DEEPSPEED_ZERO_OPT_FILTER = "DeepSpeedZeroOptimizer" + RULE_NAME = ['AnomalyTurbulence'] + + SLICE_SIZE = 20480 + # used for name + DOT = "." + NAME_SEP = ":" + INPUT_GRAD = "input_grad" + OUTPUT_GRAD = "output_grad" + ACTV_IN = "input" + ACTV_OUT = "output" + ACTVGRAD_IN = "input_grad" + ACTVGRAD_OUT = "output_grad" + # used for tasks + ACTV = "actv" + ACTVGRAD = "actv_grad" + POST_GRAD = "post_grad" + PRE_GRAD = "pre_grad" + ACC_GRAD = "acc_grad" + PREFIX_POST = "post" + PREFIX_PRE = "pre" + EXP_AVG = "exp_avg" + EXP_AVG_SQ = "exp_avg_sq" + PARAM = "param" + + CSV_HEADER = ["vpp_stage", "name", "step"] + CSV_HEADER_XY = ["vpp_stage", "name", "step", "micro_step"] + OUTPUT_DIR_PATTERN = r"([\w-]{0,20})-rank(\d{1,5})-" + ANOMALY_JSON = "anomaly.json" + ANALYSE_JSON = "anomaly_analyse.json" + TENSORBOARD = "tensorboard" + CSV = "csv" + API = "api" + HEADER_NAME = 'name' diff --git a/debug/accuracy_tools/msprobe/core/common/exceptions.py b/debug/accuracy_tools/msprobe/core/common/exceptions.py index 5eb3839efb4375ee4a2723d865cdf513e063e0ca..d71d30224b677fb19361f62de0ee25b2d32d389f 100644 --- a/debug/accuracy_tools/msprobe/core/common/exceptions.py +++ b/debug/accuracy_tools/msprobe/core/common/exceptions.py @@ -1,6 +1,7 @@ -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -11,7 +12,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + class CodedException(Exception): def __init__(self, code, error_info=''): super().__init__() @@ -25,10 +26,14 @@ class CodedException(Exception): class MsprobeException(CodedException): INVALID_PARAM_ERROR = 0 OVERFLOW_NUMS_ERROR = 1 + RECURSION_LIMIT_ERROR = 2 + INTERFACE_USAGE_ERROR = 3 err_strs = { INVALID_PARAM_ERROR: "[msprobe] 无效参数:", - OVERFLOW_NUMS_ERROR: "[msprobe] 超过预设溢出次数 当前溢出次数:" + OVERFLOW_NUMS_ERROR: "[msprobe] 超过预设溢出次数 当前溢出次数:", + RECURSION_LIMIT_ERROR: "[msprobe] 递归调用超过限制:", + INTERFACE_USAGE_ERROR: "[msprobe] Invalid interface usage: " } @@ -55,7 +60,7 @@ class ParseJsonException(CodedException): InvalidDumpJson = 1 err_strs = { UnexpectedNameStruct: "[msprobe] Unexpected name in json: ", - InvalidDumpJson: "[msprobe] json格式不正确: ", + InvalidDumpJson: "[msprobe] Invalid dump.json format: ", } @@ -116,4 +121,4 @@ class ApiAccuracyCheckerException(CodedException): UnsupportType: "[msprobe] Api Accuracy Checker get unsupported type: ", WrongValue: "[msprobe] Api Accuracy Checker get wrong value: ", ApiWrong: "[msprobe] Api Accuracy Checker something wrong with api: ", - } \ No newline at end of file + } diff --git a/debug/accuracy_tools/msprobe/core/common/file_utils.py b/debug/accuracy_tools/msprobe/core/common/file_utils.py index 78d9c4fc32ae57944034063e34dc6c6e2cccf0d8..fdc626ca6a1a90e9060cefa237f9d5d8d7e42844 100644 --- a/debug/accuracy_tools/msprobe/core/common/file_utils.py +++ b/debug/accuracy_tools/msprobe/core/common/file_utils.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2022-2023. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,13 +12,16 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import csv import fcntl import os +import stat import json import re import shutil +from datetime import datetime, timezone +from dateutil import parser import yaml import numpy as np import pandas as pd @@ -70,6 +72,8 @@ class FileChecker: check_path_pattern_valid(self.file_path) check_common_file_size(self.file_path) check_file_suffix(self.file_path, self.file_type) + if self.path_type == FileCheckConst.FILE: + check_dirpath_before_read(self.file_path) return self.file_path def check_path_ability(self): @@ -125,6 +129,7 @@ class FileOpen: check_path_pattern_valid(self.file_path) if os.path.exists(self.file_path): check_common_file_size(self.file_path) + check_dirpath_before_read(self.file_path) def check_ability_and_owner(self): if self.mode in self.SUPPORT_READ_MODE: @@ -217,7 +222,6 @@ def check_common_file_size(file_path): check_file_size(file_path, max_size) return check_file_size(file_path, FileCheckConst.COMMOM_FILE_SIZE) - def check_file_suffix(file_path, file_suffix): @@ -238,6 +242,15 @@ def check_path_type(file_path, file_type): raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) +def check_others_writable(directory): + dir_stat = os.stat(directory) + is_writable = ( + bool(dir_stat.st_mode & stat.S_IWGRP) or # 组可写 + bool(dir_stat.st_mode & stat.S_IWOTH) # 其他用户可写 + ) + return is_writable + + def make_dir(dir_path): check_path_before_create(dir_path) dir_path = os.path.realpath(dir_path) @@ -281,6 +294,17 @@ def check_path_before_create(path): 'The file path {} contains special characters.'.format(path)) +def check_dirpath_before_read(path): + path = os.path.realpath(path) + dirpath = os.path.dirname(path) + if check_others_writable(dirpath): + logger.warning(f"The directory is writable by others: {dirpath}.") + try: + check_path_owner_consistent(dirpath) + except FileCheckException: + logger.warning(f"The directory {dirpath} is not yours.") + + def check_file_or_directory_path(path, isdir=False): """ Function Description: @@ -356,7 +380,7 @@ def load_npy(filepath): def load_json(json_path): try: with FileOpen(json_path, "r") as f: - fcntl.flock(f, fcntl.LOCK_EX) + fcntl.flock(f, fcntl.LOCK_SH) data = json.load(f) fcntl.flock(f, fcntl.LOCK_UN) except Exception as e: @@ -365,11 +389,11 @@ def load_json(json_path): return data -def save_json(json_path, data, indent=None): +def save_json(json_path, data, indent=None, mode="w"): check_path_before_create(json_path) json_path = os.path.realpath(json_path) try: - with FileOpen(json_path, 'w') as f: + with FileOpen(json_path, mode) as f: fcntl.flock(f, fcntl.LOCK_EX) json.dump(data, f, indent=indent) fcntl.flock(f, fcntl.LOCK_UN) @@ -394,20 +418,36 @@ def save_yaml(yaml_path, data): def save_excel(path, data): + def validate_data(data): + """Validate that the data is a DataFrame or a list of (DataFrame, sheet_name) pairs.""" + if isinstance(data, pd.DataFrame): + return "single" + elif isinstance(data, list): + if all(isinstance(item, tuple) and len(item) == 2 and isinstance(item[0], pd.DataFrame) for item in data): + return "list" + raise ValueError("Data must be a DataFrame or a list of (DataFrame, sheet_name) pairs.") + check_path_before_create(path) path = os.path.realpath(path) + + # 验证数据类型 + data_type = validate_data(data) + try: - if isinstance(data, pd.DataFrame): + if data_type == "single": data.to_excel(path, index=False) - else: - logger.error(f'unsupported data type.') - return + elif data_type == "list": + with pd.ExcelWriter(path) as writer: + for data_df, sheet_name in data: + data_df.to_excel(writer, sheet_name=sheet_name, index=False) except Exception as e: logger.error(f'Save excel file "{os.path.basename(path)}" failed.') raise RuntimeError(f"Save excel file {path} failed.") from e change_mode(path, FileCheckConst.DATA_FILE_AUTHORITY) + + def move_file(src_path, dst_path): check_file_or_directory_path(src_path) check_path_before_create(dst_path) @@ -469,7 +509,7 @@ def save_workbook(workbook, file_path): def write_csv(data, filepath, mode="a+", malicious_check=False): def csv_value_is_valid(value: str) -> bool: if not isinstance(value, str): - return True + return True try: # -1.00 or +1.00 should be consdiered as digit numbers float(value) @@ -477,7 +517,7 @@ def write_csv(data, filepath, mode="a+", malicious_check=False): # otherwise, they will be considered as formular injections return not bool(re.compile(FileCheckConst.CSV_BLACK_LIST).search(value)) return True - + if malicious_check: for row in data: for cell in row: @@ -497,11 +537,11 @@ def write_csv(data, filepath, mode="a+", malicious_check=False): change_mode(filepath, FileCheckConst.DATA_FILE_AUTHORITY) -def read_csv(filepath, as_pd=True): +def read_csv(filepath, as_pd=True, header='infer'): check_file_or_directory_path(filepath) try: if as_pd: - csv_data = pd.read_csv(filepath) + csv_data = pd.read_csv(filepath, header=header) else: with FileOpen(filepath, 'r', encoding='utf-8-sig') as f: csv_reader = csv.reader(f, delimiter=',') @@ -512,6 +552,39 @@ def read_csv(filepath, as_pd=True): return csv_data +def write_df_to_csv(data, filepath, mode="w", header=True, malicious_check=False): + def csv_value_is_valid(value: str) -> bool: + if not isinstance(value, str): + return True + try: + # -1.00 or +1.00 should be consdiered as digit numbers + float(value) + except ValueError: + # otherwise, they will be considered as formular injections + return not bool(re.compile(FileCheckConst.CSV_BLACK_LIST).search(value)) + return True + + if not isinstance(data, pd.DataFrame): + raise ValueError("The data type of data is not supported. Only support pd.DataFrame.") + + if malicious_check: + for i in range(len(data)): + for j in range(len(data.columns)): + cell = data.iloc[i, j] + if not csv_value_is_valid(cell): + raise RuntimeError(f"Malicious value [{cell}] is not allowed " + f"to be written into the csv: {filepath}.") + + check_path_before_create(filepath) + file_path = os.path.realpath(filepath) + try: + data.to_csv(filepath, mode=mode, header=header, index=False) + except Exception as e: + logger.error(f'Save csv file "{os.path.basename(file_path)}" failed') + raise RuntimeError(f"Save csv file {file_path} failed.") from e + change_mode(filepath, FileCheckConst.DATA_FILE_AUTHORITY) + + def remove_path(path): if not os.path.exists(path): return @@ -545,6 +618,7 @@ def get_file_content_bytes(file): with FileOpen(file, 'rb') as file_handle: return file_handle.read() + # 对os.walk设置遍历深度 def os_walk_for_files(path, depth): res = [] @@ -556,3 +630,44 @@ def os_walk_for_files(path, depth): for file in files: res.append({"file": file, "root": root}) return res + + +def check_crt_valid(pem_path): + """ + Check the validity of the SSL certificate. + + Load the SSL certificate from the specified path, parse and check its validity period. + If the certificate is expired or invalid, raise a RuntimeError. + + Parameters: + pem_path (str): The file path of the SSL certificate. + + Raises: + RuntimeError: If the SSL certificate is invalid or expired. + """ + import OpenSSL + try: + with FileOpen(pem_path, "r") as f: + pem_data = f.read() + cert = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, pem_data) + pem_start = parser.parse(cert.get_notBefore().decode("UTF-8")) + pem_end = parser.parse(cert.get_notAfter().decode("UTF-8")) + logger.info(f"The SSL certificate passes the verification and the validity period " + f"starts from {pem_start} ends at {pem_end}.") + except Exception as e: + logger.error("Failed to parse the SSL certificate. Check the certificate.") + raise RuntimeError(f"The SSL certificate is invalid, {pem_path}") from e + + now_utc = datetime.now(tz=timezone.utc) + if cert.has_expired() or not (pem_start <= now_utc <= pem_end): + raise RuntimeError(f"The SSL certificate has expired and needs to be replaced, {pem_path}") + + +def read_xlsx(file_path): + check_file_or_directory_path(file_path) + try: + result_df = pd.read_excel(file_path, keep_default_na=False) + except Exception as e: + logger.error(f"The xlsx file failed to load. Please check the path: {file_path}.") + raise RuntimeError(f"Read xlsx file {file_path} failed.") from e + return result_df diff --git a/debug/accuracy_tools/msprobe/core/common/inplace_op_checker.py b/debug/accuracy_tools/msprobe/core/common/inplace_op_checker.py index 231fc1e22e4b8b72f41dcde0024a8a952a2a8cd6..d8544a8aa43dca8625e8b151488e6301ba475109 100644 --- a/debug/accuracy_tools/msprobe/core/common/inplace_op_checker.py +++ b/debug/accuracy_tools/msprobe/core/common/inplace_op_checker.py @@ -1,3 +1,18 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os from msprobe.core.common.file_utils import load_yaml diff --git a/debug/accuracy_tools/msprobe/core/common/inplace_ops.yaml b/debug/accuracy_tools/msprobe/core/common/inplace_ops.yaml index 0ca64ec947912570a98ef32d8ca43168cde6379d..dc899cbc8620ea6e62e946660942e1940f2bfa62 100644 --- a/debug/accuracy_tools/msprobe/core/common/inplace_ops.yaml +++ b/debug/accuracy_tools/msprobe/core/common/inplace_ops.yaml @@ -157,6 +157,9 @@ inplace_tensor_op: - trunc_ - unsqueeze_ - xlogy_ + - bitwise_left_shift_ + - bitwise_right_shift_ + - arctan2_ inplace_torch_op: - _add_relu_ @@ -247,5 +250,6 @@ inplace_distributed_op: - all_to_all - all_gather_into_tensor - reduce_scatter_tensor + - batch_isend_irecv diff --git a/debug/accuracy_tools/msprobe/core/common/log.py b/debug/accuracy_tools/msprobe/core/common/log.py index e6389d8c1a84c3b047451a57944492c93f96c7af..f20d25d991ef2d3da1307336e4aa05ec3bc87d86 100644 --- a/debug/accuracy_tools/msprobe/core/common/log.py +++ b/debug/accuracy_tools/msprobe/core/common/log.py @@ -1,6 +1,7 @@ -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -11,7 +12,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import os import time import sys @@ -25,6 +26,7 @@ def filter_special_chars(func): for char in MsgConst.SPECIAL_CHAR: msg = msg.replace(char, '_') return func(self, msg, **kwargs) + return func_level @@ -71,6 +73,7 @@ class BaseLogger: return func(*args, **kwargs) else: return None + return func_rank_0 def info_on_rank_0(self, msg): @@ -81,7 +84,7 @@ class BaseLogger: def warning_on_rank_0(self, msg): return self.on_rank_0(self.warning)(msg) - + def error_log_with_exp(self, msg, exception): self.error(msg) raise exception diff --git a/debug/accuracy_tools/msprobe/core/common/utils.py b/debug/accuracy_tools/msprobe/core/common/utils.py index 0dc06e6367eb7fc8091a441ce530ef0ec0ce330b..7ec0490168f3ec3c39afcd0915f85609e39f0030 100644 --- a/debug/accuracy_tools/msprobe/core/common/utils.py +++ b/debug/accuracy_tools/msprobe/core/common/utils.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,14 +12,17 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import collections import os import re import subprocess import time -import json +from collections import defaultdict from datetime import datetime, timezone +from functools import wraps + +import numpy as np from msprobe.core.common.file_utils import (FileOpen, check_file_or_directory_path, load_json) from msprobe.core.common.const import Const, CompareConst @@ -68,6 +70,11 @@ class MsprobeBaseException(Exception): FUNCTION_CALL_ERROR = 28 FORWARD_DATA_COLLECTION_ERROR = 29 BACKWARD_DATA_COLLECTION_ERROR = 30 + INVALID_KEY_ERROR = 31 + MISSING_HEADER_ERROR = 32 + MERGE_COMPARE_RESULT_ERROR = 33 + NAMES_STRUCTS_MATCH_ERROR = 34 + INVALID_STATE_ERROR = 35 def __init__(self, code, error_info: str = ""): super(MsprobeBaseException, self).__init__() @@ -99,7 +106,14 @@ class DumpException(MsprobeBaseException): return f"Dump Error Code {self.code}: {self.error_info}" -def check_compare_param(input_param, output_path, dump_mode): +def is_json_file(file_path): + if isinstance(file_path, str) and file_path.lower().endswith('.json'): + return True + else: + return False + + +def check_compare_param(input_param, output_path, dump_mode, stack_mode): if not isinstance(input_param, dict): logger.error(f"Invalid input parameter 'input_param', the expected type dict but got {type(input_param)}.") raise CompareException(CompareException.INVALID_PARAM_ERROR) @@ -107,18 +121,31 @@ def check_compare_param(input_param, output_path, dump_mode): logger.error(f"Invalid input parameter 'output_path', the expected type str but got {type(output_path)}.") raise CompareException(CompareException.INVALID_PARAM_ERROR) - check_file_or_directory_path(input_param.get("npu_json_path"), False) - check_file_or_directory_path(input_param.get("bench_json_path"), False) - check_file_or_directory_path(input_param.get("stack_json_path"), False) + def check_json_path(json_path_str): + json_path = input_param.get(json_path_str) + check_file_or_directory_path(json_path, False) + json_type_check = is_json_file(json_path) + if not json_type_check: + logger.error(f"Invalid {json_path_str}: {json_path}, please check!") + raise CompareException(CompareException.INVALID_PATH_ERROR) + + check_json_path("npu_json_path") + check_json_path("bench_json_path") + if stack_mode: + check_json_path("stack_json_path") + if dump_mode == Const.ALL: check_file_or_directory_path(input_param.get("npu_dump_data_dir"), True) check_file_or_directory_path(input_param.get("bench_dump_data_dir"), True) check_file_or_directory_path(output_path, True) with FileOpen(input_param.get("npu_json_path"), "r") as npu_json, \ - FileOpen(input_param.get("bench_json_path"), "r") as bench_json, \ - FileOpen(input_param.get("stack_json_path"), "r") as stack_json: - check_json_file(input_param, npu_json, bench_json, stack_json) + FileOpen(input_param.get("bench_json_path"), "r") as bench_json: + _check_json(npu_json, input_param.get("npu_json_path")) + _check_json(bench_json, input_param.get("bench_json_path")) + if stack_mode: + with FileOpen(input_param.get("stack_json_path"), "r") as stack_json: + _check_json(stack_json, input_param.get("stack_json_path")) def check_configuration_param(stack_mode=False, auto_analyze=True, fuzzy_match=False, is_print_compare_log=True): @@ -212,12 +239,18 @@ def md5_find(data): for data_detail in data[key_op][api_info]: if data_detail and 'md5' in data_detail: return True - elif 'md5' in data[key_op][api_info]: + if isinstance(data[key_op][api_info], bool): + continue + elif data[key_op][api_info] and 'md5' in data[key_op][api_info]: return True return False def detect_framework_by_dump_json(file_path): + json_data = load_json(file_path) + framework = json_data.get("framework", None) + if framework in [Const.PT_FRAMEWORK, Const.MS_FRAMEWORK]: + return framework pattern_ms = r'"type":\s*"mindspore' pattern_pt = r'"type":\s*"torch' with FileOpen(file_path, 'r') as file: @@ -247,8 +280,10 @@ def get_stack_construct_by_dump_json_path(dump_json_path): def set_dump_path(input_param): npu_path = input_param.get("npu_json_path", None) bench_path = input_param.get("bench_json_path", None) - if not npu_path or not bench_path: - logger.error(f"Please check the json path is valid.") + npu_path_valid = npu_path is not None and npu_path.endswith("dump.json") + bench_path_valid = bench_path is not None and bench_path.endswith("dump.json") + if not npu_path_valid or not bench_path_valid: + logger.error(f"Please check the json path is valid. npu_path: {npu_path}, bench_path: {bench_path}") raise CompareException(CompareException.INVALID_PATH_ERROR) input_param['npu_dump_data_dir'] = os.path.join(os.path.dirname(npu_path), Const.DUMP_TENSOR_DATA) input_param['bench_dump_data_dir'] = os.path.join(os.path.dirname(bench_path), Const.DUMP_TENSOR_DATA) @@ -259,21 +294,36 @@ def get_dump_mode(input_param): bench_path = input_param.get("bench_json_path", None) npu_json_data = load_json(npu_path) bench_json_data = load_json(bench_path) - if npu_json_data['task'] != bench_json_data['task']: + + npu_task = npu_json_data.get('task', None) + bench_task = bench_json_data.get('task', None) + + if not npu_task or not bench_task: + logger.error(f"Please check the dump task is correct, npu's task is {npu_task}, bench's task is {bench_task}.") + raise CompareException(CompareException.INVALID_TASK_ERROR) + + if npu_task != bench_task: logger.error(f"Please check the dump task is consistent.") raise CompareException(CompareException.INVALID_TASK_ERROR) - if npu_json_data['task'] == Const.TENSOR: - dump_mode = Const.ALL - elif npu_json_data['task'] == Const.STATISTICS: - md5_compare = md5_find(npu_json_data['data']) - if md5_compare: - dump_mode = Const.MD5 + + if npu_task == Const.TENSOR: + return Const.ALL + + if npu_task == Const.STRUCTURE: + return Const.STRUCTURE + + if npu_task == Const.STATISTICS: + npu_md5_compare = md5_find(npu_json_data['data']) + bench_md5_compare = md5_find(bench_json_data['data']) + if npu_md5_compare == bench_md5_compare: + return Const.MD5 if npu_md5_compare else Const.SUMMARY else: - dump_mode = Const.SUMMARY - else: - logger.error(f"Compare applies only to task is tensor or statistics") - raise CompareException(CompareException.INVALID_TASK_ERROR) - return dump_mode + logger.error(f"Please check the dump task is consistent, " + f"dump mode of npu and bench should both be statistics or md5.") + raise CompareException(CompareException.INVALID_TASK_ERROR) + + logger.error(f"Compare applies only to task is tensor or statistics") + raise CompareException(CompareException.INVALID_TASK_ERROR) def get_header_index(header_name, dump_mode): @@ -288,7 +338,7 @@ def get_header_index(header_name, dump_mode): def convert_tuple(data): - return data if isinstance(data, tuple) else (data, ) + return data if isinstance(data, tuple) else (data,) def check_op_str_pattern_valid(string, op_name=None, stack=False): @@ -308,6 +358,10 @@ def is_invalid_pattern(string): return re.search(pattern, string) +def is_int(x): + return isinstance(x, int) and not isinstance(x, bool) + + def print_tools_ends_info(): total_len = len(Const.TOOL_ENDS_SUCCESSFULLY) + Const.FILL_CHAR_NUMS logger.info('*' * total_len) @@ -327,7 +381,7 @@ def get_step_or_rank_from_string(step_or_rank, obj): raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, f'The string parameter for {obj} only supports formats like "3-5". ' f'Now string parameter for {obj} is "{step_or_rank}".') - if all(Const.STEP_RANK_MAXIMUM_RANGE[0] <= b <= Const.STEP_RANK_MAXIMUM_RANGE[1] for b in borderlines): + if all(Const.STEP_RANK_MINIMUM_VALUE <= b <= Const.STEP_RANK_MAXIMUM_VALUE for b in borderlines): if borderlines[0] <= borderlines[1]: continual_step_or_rank = list(range(borderlines[0], borderlines[1] + 1)) else: @@ -337,7 +391,7 @@ def get_step_or_rank_from_string(step_or_rank, obj): else: raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, f"The boundaries must fall within the range of " - f"[{Const.STEP_RANK_MAXIMUM_RANGE[0]}, {Const.STEP_RANK_MAXIMUM_RANGE[1]}].") + f"[{Const.STEP_RANK_MINIMUM_VALUE}, {Const.STEP_RANK_MAXIMUM_VALUE}].") return continual_step_or_rank @@ -349,26 +403,33 @@ def get_real_step_or_rank(step_or_rank_input, obj): return [] if not isinstance(step_or_rank_input, list): raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, f"{obj} is invalid, it should be a list") + if len(step_or_rank_input) > Const.STEP_RANK_MAXIMUM_VALUE: + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, + f"{obj} is invalid, its length cannot exceed {Const.STEP_RANK_MAXIMUM_VALUE}") + real_step_or_rank = [] for element in step_or_rank_input: - if not isinstance(element, (int, str)): + if not is_int(element) and not isinstance(element, str): raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, f"{obj} element {element} must be an integer or string.") - if isinstance(element, int) and element < 0: - raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, - f"Each element of {obj} must be non-negative, currently it is {element}.") - if isinstance(element, int) and Const.STEP_RANK_MAXIMUM_RANGE[0] <= element <= Const.STEP_RANK_MAXIMUM_RANGE[1]: + if is_int(element): + if not Const.STEP_RANK_MINIMUM_VALUE <= element <= Const.STEP_RANK_MAXIMUM_VALUE: + raise MsprobeException( + MsprobeException.INVALID_PARAM_ERROR, + f"Each element of {obj} must be between {Const.STEP_RANK_MINIMUM_VALUE} and " + f"{Const.STEP_RANK_MAXIMUM_VALUE}, currently it is {element}." + ) real_step_or_rank.append(element) - elif isinstance(element, str) and Const.HYPHEN in element: - continual_step_or_rank = get_step_or_rank_from_string(element, obj) - real_step_or_rank.extend(continual_step_or_rank) + continue + continual_step_or_rank = get_step_or_rank_from_string(element, obj) + real_step_or_rank.extend(continual_step_or_rank) real_step_or_rank = list(set(real_step_or_rank)) real_step_or_rank.sort() return real_step_or_rank -def check_seed_all(seed, mode): - if isinstance(seed, int): +def check_seed_all(seed, mode, rm_dropout): + if is_int(seed): if seed < 0 or seed > Const.MAX_SEED_VALUE: logger.error(f"Seed must be between 0 and {Const.MAX_SEED_VALUE}.") raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR) @@ -378,3 +439,78 @@ def check_seed_all(seed, mode): if not isinstance(mode, bool): logger.error("seed_all mode must be bool.") raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR) + if not isinstance(rm_dropout, bool): + logger.error("The rm_dropout parameter must be bool.") + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR) + + +def safe_get_value(container, index, container_name, key=None): + try: + # 处理字典情况 + if isinstance(container, dict): + return container.get(key)[index] + # 处理列表、元组、numpy情况 + elif isinstance(container, (list, tuple, np.ndarray)): + return container[index] + else: + err_msg = f"Unsupported container type for '{container_name}': {type(container)}" + logger.error(err_msg) + raise MsprobeBaseException(MsprobeBaseException.INVALID_OBJECT_TYPE_ERROR) + except IndexError as e: + err_msg = "index out of bounds error occurs, please check!\n" \ + f"{container_name} is {container}\n" \ + f"index is {index}" + logger.error(err_msg) + raise MsprobeBaseException(MsprobeBaseException.INDEX_OUT_OF_BOUNDS_ERROR) from e + except TypeError as e: + err_msg = "wrong type, please check!\n" \ + f"{container_name} is {container}\n" \ + f"index is {index}\n" \ + f"key is {key}" + logger.error(err_msg) + raise MsprobeBaseException(MsprobeBaseException.INVALID_OBJECT_TYPE_ERROR) from e + + +# 记录工具函数递归的深度 +recursion_depth = defaultdict(int) + + +# 装饰一个函数,当函数递归调用超过限制时,抛出异常并打印函数信息。 +def recursion_depth_decorator(func_info): + def decorator(func): + @wraps(func) + def wrapper(*args, **kwargs): + func_id = id(func) + recursion_depth[func_id] += 1 + if recursion_depth[func_id] > Const.MAX_DEPTH: + msg = f"call {func_info} exceeds the recursion limit." + logger.error_log_with_exp( + msg, + MsprobeException( + MsprobeException.RECURSION_LIMIT_ERROR, msg + ), + ) + try: + result = func(*args, **kwargs) + finally: + recursion_depth[func_id] -= 1 + return result + + return wrapper + + return decorator + + +def check_str_param(param): + if not re.match(Const.REGEX_PREFIX_PATTERN, param): + logger.error('The parameter {} contains special characters.'.format(param)) + raise MsprobeBaseException(MsprobeBaseException.INVALID_CHAR_ERROR) + + +class DumpPathAggregation: + dump_file_path = None + stack_file_path = None + construct_file_path = None + dump_tensor_data_dir = None + free_benchmark_file_path = None + debug_file_path = None \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/core/common_config.py b/debug/accuracy_tools/msprobe/core/common_config.py index 93f020ca4b43e0891f49dcb83dfbbb5541ebfd30..b9a717c0c52f11e52ac055e3cfe6a0e77fe7e44c 100644 --- a/debug/accuracy_tools/msprobe/core/common_config.py +++ b/debug/accuracy_tools/msprobe/core/common_config.py @@ -1,5 +1,6 @@ -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -11,11 +12,10 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + from msprobe.core.common.const import Const, FileCheckConst from msprobe.core.common.log import logger from msprobe.core.common.exceptions import MsprobeException -from msprobe.core.common.file_utils import FileChecker from msprobe.core.common.utils import get_real_step_or_rank @@ -26,8 +26,8 @@ class CommonConfig: self.rank = get_real_step_or_rank(json_config.get('rank'), Const.RANK) self.step = get_real_step_or_rank(json_config.get('step'), Const.STEP) self.level = json_config.get('level') - self.acl_config = json_config.get('acl_config') self.enable_dataloader = json_config.get('enable_dataloader', False) + self.async_dump = json_config.get("async_dump", False) self._check_config() def _check_config(self): @@ -43,16 +43,11 @@ class CommonConfig: if not isinstance(self.enable_dataloader, bool): logger.error_log_with_exp("enable_dataloader is invalid, it should be a boolean", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) - if self.acl_config: - self._check_acl_config() - - def _check_acl_config(self): - if not isinstance(self.acl_config, str): - logger.error_log_with_exp("acl_config is invalid, it should be a string", + if not isinstance(self.async_dump, bool): + logger.error_log_with_exp("async_dump is invalid, it should be a boolean", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) - file_checker = FileChecker( - file_path=self.acl_config, path_type=FileCheckConst.FILE, file_type=FileCheckConst.JSON_SUFFIX) - file_checker.common_check() + elif self.async_dump: + logger.warning("async_dump is True, it may cause OOM when dumping large tensor.") class BaseConfig: @@ -60,7 +55,6 @@ class BaseConfig: self.scope = json_config.get('scope') self.list = json_config.get('list') self.data_mode = json_config.get('data_mode') - self.backward_input = json_config.get("backward_input") self.file_format = json_config.get("file_format") self.summary_mode = json_config.get("summary_mode") self.overflow_nums = json_config.get("overflow_nums") @@ -88,24 +82,29 @@ class BaseConfig: def check_config(self): self._check_str_list_config(self.scope, "scope") self._check_str_list_config(self.list, "list") - self._check_str_list_config(self.backward_input, "backward_input") self._check_data_mode() def _check_data_mode(self): if self.data_mode is not None: if not isinstance(self.data_mode, list): - logger.error_log_with_exp(f"data_mode is invalid, it should be a list[str]", + logger.error_log_with_exp("data_mode is invalid, it should be a list[str]", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) - if len(self.data_mode) > len(Const.DUMP_DATA_MODE_LIST): + if Const.ALL in self.data_mode and len(self.data_mode) != 1: + logger.error_log_with_exp( + "'all' cannot be combined with other options in data_mode.", + MsprobeException(MsprobeException.INVALID_PARAM_ERROR) + ) + + if len(self.data_mode) >= len(Const.DUMP_DATA_MODE_LIST): logger.error_log_with_exp( - f"The number of elements in the data_made cannot exceed {len(Const.DUMP_DATA_MODE_LIST)}.", + f"The number of elements in the data_made cannot exceed {len(Const.DUMP_DATA_MODE_LIST) - 1}.", MsprobeException(MsprobeException.INVALID_PARAM_ERROR) ) for mode in self.data_mode: if not isinstance(mode, str): - logger.error_log_with_exp(f"data_mode is invalid, it should be a list[str]", + logger.error_log_with_exp("data_mode is invalid, it should be a list[str]", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if mode not in Const.DUMP_DATA_MODE_LIST: logger.error_log_with_exp( diff --git a/debug/accuracy_tools/msprobe/core/compare/acc_compare.py b/debug/accuracy_tools/msprobe/core/compare/acc_compare.py index 8c6166fcea48288b21492e4eda90935cfa34ff94..f0ac97a0293b5a7ec95b61a4805af179a087eafc 100644 --- a/debug/accuracy_tools/msprobe/core/compare/acc_compare.py +++ b/debug/accuracy_tools/msprobe/core/compare/acc_compare.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -15,42 +15,57 @@ import multiprocessing import os +import re +from copy import deepcopy + import pandas as pd from tqdm import tqdm -from msprobe.core.common.file_utils import load_json + +from msprobe.core.advisor.advisor import Advisor from msprobe.core.common.const import CompareConst, Const from msprobe.core.common.exceptions import FileCheckException +from msprobe.core.common.file_utils import load_json, remove_path from msprobe.core.common.log import logger -from msprobe.core.common.utils import add_time_with_xlsx, CompareException, check_op_str_pattern_valid -from msprobe.core.common.file_utils import remove_path -from msprobe.core.compare.check import check_graph_mode, check_struct_match, fuzzy_check_op, check_dump_json_str, \ - check_stack_json_str +from msprobe.core.common.utils import CompareException, add_time_with_xlsx, check_op_str_pattern_valid, safe_get_value +from msprobe.core.compare.check import check_dump_json_str, check_graph_mode, check_stack_json_str, \ + check_struct_match, fuzzy_check_op from msprobe.core.compare.highlight import find_compare_result_error_rows, highlight_rows_xlsx -from msprobe.core.compare.utils import read_op, merge_tensor, get_un_match_accuracy, get_accuracy, \ - get_rela_diff_summary_mode -from msprobe.core.compare.multiprocessing_compute import _handle_multi_process, ComparisonResult, _save_cmp_result -from msprobe.core.compare.npy_compare import compare_ops_apply, get_error_type, reshape_value, get_relative_err, \ - get_error_message -from msprobe.core.advisor.advisor import Advisor +from msprobe.core.compare.multiprocessing_compute import ComparisonResult, _handle_multi_process, _save_cmp_result +from msprobe.core.compare.npy_compare import compare_ops_apply, get_error_flag_and_msg +from msprobe.core.compare.utils import get_accuracy, get_rela_diff_summary_mode, get_un_match_accuracy, merge_tensor, \ + print_compare_ends_info, read_op, get_name_and_state, reorder_op_x_list + + +class ModeConfig: + def __init__(self, stack_mode=False, auto_analyze=True, fuzzy_match=False, dump_mode=None): + self.stack_mode = stack_mode + self.auto_analyze = auto_analyze + self.fuzzy_match = fuzzy_match + self.dump_mode = dump_mode class Comparator: - - def __init__(self): - pass + def __init__(self, mode_config: ModeConfig): + self.stack_mode = mode_config.stack_mode + self.auto_analyze = mode_config.auto_analyze + self.fuzzy_match = mode_config.fuzzy_match + self.dump_mode = mode_config.dump_mode @staticmethod def get_result_md5_compare(ms_op_name, bench_op_name, npu_ops_all, bench_ops_all, *args): - result_item = [ms_op_name, bench_op_name, npu_ops_all.get(ms_op_name).get('struct')[0], - bench_ops_all.get(bench_op_name).get('struct')[0], - npu_ops_all.get(ms_op_name).get('struct')[1], - bench_ops_all.get(bench_op_name).get('struct')[1], - npu_ops_all.get(ms_op_name).get('struct')[2], - bench_ops_all.get(bench_op_name).get('struct')[2], - CompareConst.PASS if npu_ops_all.get(ms_op_name).get('struct')[2] - == bench_ops_all.get(bench_op_name).get('struct')[2] - else CompareConst.DIFF] - if args[0]: + npu_struct = npu_ops_all.get(ms_op_name).get('struct', []) + bench_struct = bench_ops_all.get(bench_op_name).get('struct', []) + + if len(npu_struct) < 3 or len(bench_struct) < 3: + logger.error(f"The length of npu_struct and bench_struct must be >= 3, " + f"but got npu_struct={len(npu_struct)} and bench_struct={len(bench_struct)}. Please check!") + raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) + + result_item = [ms_op_name, bench_op_name, npu_struct[0], bench_struct[0], + npu_struct[1], bench_struct[1], npu_struct[2], bench_struct[2], + CompareConst.PASS if npu_struct[2] == bench_struct[2] else CompareConst.DIFF] + + if len(args) >= 2 and args[0]: result_item.extend(args[1]) else: result_item.append(CompareConst.NONE) @@ -63,82 +78,98 @@ class Comparator: bench_summary_data, err_msg) result_item.append(accuracy_check) result_item.append(err_msg) - - @classmethod - def make_result_table(cls, result, stack_mode, dump_mode): - header = CompareConst.HEAD_OF_COMPARE_MODE[dump_mode][:] - if stack_mode: + @staticmethod + def _generate_na_data(ops_all): + if not ops_all: + return {} + key = next(iter(ops_all)) + value = deepcopy(ops_all[key]) + for k, v in value.items(): + if isinstance(v, tuple): + value[k] = tuple(CompareConst.N_A for _ in range(len(v))) + elif isinstance(v, list): + value[k] = [CompareConst.N_A] * len(v) + else: + value[k] = CompareConst.N_A + return value + + def make_result_table(self, result): + header = CompareConst.HEAD_OF_COMPARE_MODE[self.dump_mode][:] + + if self.stack_mode: header.append(CompareConst.STACK) - if dump_mode == Const.ALL: + if self.dump_mode == Const.ALL: header.append(CompareConst.DATA_NAME) else: - if dump_mode == Const.ALL: + if self.dump_mode == Const.ALL: for row in result: - del row[-2] # 输出结果不要堆栈信息时,删除中间结果result中的stack info,真实数据时为倒数第2列 + del row[-2] # 输出结果不要堆栈信息时,删除中间结果result中的stack info,真实数据时为倒数第2列 header.append(CompareConst.DATA_NAME) else: for row in result: - del row[-1] # 输出结果不要堆栈信息时,删除中间结果result中的stack info,非真实数据时为倒数第1列 + del row[-1] # 输出结果不要堆栈信息时,删除中间结果result中的stack info,非真实数据时为倒数第1列 result_df = pd.DataFrame(result, columns=header, dtype='object') - return result_df - - @classmethod - def gen_merge_list(cls, json_data, op_name, stack_json_data, dump_mode): + return result_df + + def gen_merge_list(self, json_data, op_name, stack_json_data): op_data = json_data['data'][op_name] check_dump_json_str(op_data, op_name) op_parsed_list = read_op(op_data, op_name) - stack_info = stack_json_data.get(op_name) - if stack_info is not None: - check_stack_json_str(stack_info, op_name) - op_parsed_list.append({ - 'full_op_name': op_name, - 'full_info': stack_info - }) - - merge_list = merge_tensor(op_parsed_list, dump_mode) + if self.stack_mode: + stack_info = stack_json_data.get(op_name) + if stack_info is not None: + check_stack_json_str(stack_info, op_name) + # append only when stack_mode is True, + op_parsed_list.append({ + 'full_op_name': op_name, + 'full_info': stack_info + }) + + merge_list = merge_tensor(op_parsed_list, self.dump_mode) return merge_list - - def check_op(self, npu_dict, bench_dict, fuzzy_match): - a_op_name = npu_dict["op_name"] - b_op_name = bench_dict["op_name"] - graph_mode = check_graph_mode(a_op_name[0], b_op_name[0]) - + + def check_op(self, npu_dict, bench_dict): + npu_op_name = npu_dict[CompareConst.OP_NAME] + bench_op_name = bench_dict[CompareConst.OP_NAME] + graph_mode = check_graph_mode(safe_get_value(npu_op_name, 0, "npu_op_name"), + safe_get_value(bench_op_name, 0, "bench_op_name")) + frame_name = getattr(self, "frame_name") if frame_name == "PTComparator": from msprobe.pytorch.compare.match import graph_mapping if graph_mode: - return graph_mapping.match(a_op_name[0], b_op_name[0]) + return graph_mapping.match(npu_op_name[0], bench_op_name[0]) struct_match = check_struct_match(npu_dict, bench_dict) - if not fuzzy_match: - return a_op_name == b_op_name and struct_match - is_match = True + if not self.fuzzy_match: + name_match = npu_op_name == bench_op_name + return name_match and struct_match try: - is_match = fuzzy_check_op(a_op_name, b_op_name) + name_match = fuzzy_check_op(npu_op_name, bench_op_name) except Exception as err: - logger.warning("%s and %s can not fuzzy match." % (a_op_name, b_op_name)) - is_match = False - return is_match and struct_match - - def match_op(self, npu_queue, bench_queue, fuzzy_match): + logger.warning("%s and %s can not fuzzy match." % (npu_op_name, bench_op_name)) + name_match = False + return name_match and struct_match + + def match_op(self, npu_queue, bench_queue): for b_index, b_op in enumerate(bench_queue[0: -1]): - if self.check_op(npu_queue[-1], b_op, fuzzy_match): + if self.check_op(npu_queue[-1], b_op): return len(npu_queue) - 1, b_index - if self.check_op(npu_queue[-1], bench_queue[-1], fuzzy_match): + if self.check_op(npu_queue[-1], bench_queue[-1]): return len(npu_queue) - 1, len(bench_queue) - 1 for n_index, n_op in enumerate(npu_queue[0: -1]): - if self.check_op(n_op, bench_queue[-1], fuzzy_match): + if self.check_op(n_op, bench_queue[-1]): return n_index, len(bench_queue) - 1 return -1, -1 - - def compare_process(self, file_lists, stack_mode, fuzzy_match, dump_mode): + + def compare_process(self, file_lists): npu_json_path, bench_json_path, stack_json_path = file_lists npu_json_data = load_json(npu_json_path) bench_json_data = load_json(bench_json_path) - stack_json_data = load_json(stack_json_path) + stack_json_data = load_json(stack_json_path) if self.stack_mode else None - if fuzzy_match: + if self.fuzzy_match: logger.warning("This task uses fuzzy matching, which may affect the accuracy of the comparison.") npu_ops_queue = [] @@ -162,8 +193,7 @@ class Comparator: last_npu_ops_len = len(npu_ops_queue) op_name_npu = next(ops_npu_iter) check_op_str_pattern_valid(op_name_npu) - read_err_npu = True - npu_merge_list = self.gen_merge_list(npu_json_data, op_name_npu, stack_json_data, dump_mode) + npu_merge_list = self.gen_merge_list(npu_json_data, op_name_npu, stack_json_data) if npu_merge_list: npu_ops_queue.append(npu_merge_list) except StopIteration: @@ -172,7 +202,7 @@ class Comparator: last_bench_ops_len = len(bench_ops_queue) op_name_bench = next(ops_bench_iter) check_op_str_pattern_valid(op_name_bench) - bench_merge_list = self.gen_merge_list(bench_json_data, op_name_bench, stack_json_data, dump_mode) + bench_merge_list = self.gen_merge_list(bench_json_data, op_name_bench, stack_json_data) if bench_merge_list: bench_ops_queue.append(bench_merge_list) except StopIteration: @@ -191,77 +221,105 @@ class Comparator: logger.info("Please check whether the number and calls of APIs in NPU and Bench models are consistent.") break - n_match_point, b_match_point = self.match_op(npu_ops_queue, bench_ops_queue, fuzzy_match) + n_match_point, b_match_point = self.match_op(npu_ops_queue, bench_ops_queue) + + # 如果没有匹配到,数据放到队列中,跳过,直到后面匹配到,把匹配之前的api放到不匹配中 if n_match_point == -1 and b_match_point == -1: continue + n_match_data = npu_ops_queue[n_match_point] b_match_data = bench_ops_queue[b_match_point] un_match_data = npu_ops_queue[0: n_match_point] for npu_data in un_match_data: - get_un_match_accuracy(result, npu_data, dump_mode) - get_accuracy(result, n_match_data, b_match_data, dump_mode) + get_un_match_accuracy(result, npu_data, self.dump_mode) + get_accuracy(result, n_match_data, b_match_data, self.dump_mode) del npu_ops_queue[0: n_match_point + 1] del bench_ops_queue[0: b_match_point + 1] + progress_bar.close() if npu_ops_queue: for npu_data in npu_ops_queue: - get_un_match_accuracy(result, npu_data, dump_mode) - - result_df = self.make_result_table(result, stack_mode, dump_mode) + get_un_match_accuracy(result, npu_data, self.dump_mode) + + result_df = self.make_result_table(result) return result_df - def merge_data(self, json_data, stack_json_data, dump_mode): + def merge_data(self, json_data, stack_json_data): ops_all = {} for op_name in json_data.get('data', {}): - merge_list = self.gen_merge_list(json_data, op_name, stack_json_data, dump_mode) + merge_list = self.gen_merge_list(json_data, op_name, stack_json_data) if merge_list: - input_index, output_index = 0, 0 - for index, input_or_output in enumerate(merge_list['op_name']): - input_or_output_list = input_or_output.split(Const.SEP) - data_name = merge_list.get('data_name') - data_name = data_name[index] if data_name else None - if Const.INPUT in input_or_output_list or Const.KWARGS in input_or_output_list: - ops_all[input_or_output] = {'struct': merge_list.get('input_struct')[input_index], - 'summary': merge_list.get('summary')[index], - 'data_name': data_name, - 'stack_info': merge_list.get('stack_info')} - input_index += 1 - - elif Const.OUTPUT in input_or_output_list: - ops_all[input_or_output] = {'struct': merge_list.get('output_struct')[output_index], - 'summary': merge_list.get('summary')[index], - 'data_name': data_name, - 'stack_info': merge_list.get('stack_info')} - output_index += 1 + struct_to_index_mapping = { + CompareConst.INPUT_STRUCT: 0, + CompareConst.OUTPUT_STRUCT: 0, + CompareConst.PARAMS_STRUCT: 0, + CompareConst.PARAMS_GRAD_STRUCT: 0 + } + + op_name_list = merge_list.get(CompareConst.OP_NAME) + summary_list = merge_list.get(Const.SUMMARY) + data_name_list = merge_list.get('data_name') + op_name_reorder, summary_reorder, data_name_reorder = reorder_op_x_list(op_name_list, + summary_list, + data_name_list) + for index, op_full_name in enumerate(op_name_reorder): + data_name = data_name_reorder[index] if data_name_reorder else None + + _, state = get_name_and_state(op_full_name) + struct_key = CompareConst.STATE_TO_STRUCT_MAPPING.get(state) + if not struct_key: + continue + ops_all[op_full_name] = { + CompareConst.STRUCT: safe_get_value(merge_list, struct_to_index_mapping.get(struct_key), + "merge_list", key=struct_key), + CompareConst.SUMMARY: safe_get_value(summary_reorder, index, "summary_reorder"), + 'data_name': data_name, + 'stack_info': merge_list.get('stack_info') + } + struct_to_index_mapping[struct_key] += 1 return ops_all - def get_accuracy(self, npu_ops_all, bench_ops_all, dump_mode): + def get_accuracy(self, npu_ops_all, bench_ops_all): result = [] + bench_ops_all[CompareConst.N_A] = self._generate_na_data(bench_ops_all) for ms_op_name, bench_op_name in self.data_mapping_dict.items(): if ms_op_name in npu_ops_all and bench_op_name in bench_ops_all: npu_stack_info = npu_ops_all.get(ms_op_name).get("stack_info", None) bench_stack_info = bench_ops_all.get(bench_op_name).get("stack_info", None) has_stack = npu_stack_info and bench_stack_info - if dump_mode == Const.MD5: + if self.dump_mode == Const.MD5: result.append(self.get_result_md5_compare(ms_op_name, bench_op_name, npu_ops_all, bench_ops_all, has_stack, npu_stack_info)) continue - if dump_mode == Const.SUMMARY: - result_item = [ms_op_name, bench_op_name, npu_ops_all.get(ms_op_name).get('struct')[0], - bench_ops_all.get(bench_op_name).get('struct')[0], - npu_ops_all.get(ms_op_name).get('struct')[1], - bench_ops_all.get(bench_op_name).get('struct')[1], - " ", " ", " ", " ", " ", " ", " ", " "] + + npu_struct = npu_ops_all.get(ms_op_name).get('struct', []) + bench_struct = bench_ops_all.get(bench_op_name).get('struct', []) + + if len(npu_struct) < 2 or len(bench_struct) < 2: + logger.error( + f"The length of npu_struct and bench_struct must be >= 2, " + f"but got npu_struct={len(npu_struct)} and bench_struct={len(bench_struct)}. " + f"Please check!" + ) + raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) + + base_result_item = [ + ms_op_name, bench_op_name, + npu_struct[0], + bench_struct[0], + npu_struct[1], + bench_struct[1] + ] + + if self.dump_mode == Const.SUMMARY: + result_item = base_result_item + [" "] * 8 # 8个统计量数据情况的比对指标 else: - result_item = [ms_op_name, bench_op_name, npu_ops_all.get(ms_op_name).get('struct')[0], - bench_ops_all.get(bench_op_name).get('struct')[0], - npu_ops_all.get(ms_op_name).get('struct')[1], - bench_ops_all.get(bench_op_name).get('struct')[1], - " ", " ", " ", " ", " "] + result_item = base_result_item + [" "] * 6 # 6个真实数据情况的比对指标 + npu_summary_data = npu_ops_all.get(ms_op_name).get("summary") result_item.extend(npu_summary_data) bench_summary_data = bench_ops_all.get(bench_op_name).get("summary") result_item.extend(bench_summary_data) - if dump_mode == Const.SUMMARY: + if self.dump_mode == Const.SUMMARY: self.calculate_summary_data(npu_summary_data, bench_summary_data, result_item) else: result_item.append(CompareConst.ACCURACY_CHECK_YES) @@ -270,7 +328,7 @@ class Comparator: result_item.extend(npu_stack_info) else: result_item.append(CompareConst.NONE) - if dump_mode == Const.ALL: + if self.dump_mode == Const.ALL: result_item.append(npu_ops_all.get(ms_op_name).get("data_name", None)) result.append(result_item) elif ms_op_name not in npu_ops_all: @@ -279,26 +337,39 @@ class Comparator: logger.warning(f'Can not find bench op name : `{bench_op_name}` in bench dump json file.') return result - def compare_process_custom(self, file_lists, stack_mode, dump_mode): + def compare_process_custom(self, file_lists): npu_json_path, bench_json_path, stack_json_path = file_lists npu_json_data = load_json(npu_json_path) bench_json_data = load_json(bench_json_path) - stack_json_data = load_json(stack_json_path) + stack_json_data = load_json(stack_json_path) if self.stack_mode else None + npu_ops_all = self.merge_data(npu_json_data, stack_json_data) + bench_ops_all = self.merge_data(bench_json_data, stack_json_data) - npu_ops_all = self.merge_data(npu_json_data, stack_json_data, dump_mode) - bench_ops_all = self.merge_data(bench_json_data, stack_json_data, dump_mode) - - result = self.get_accuracy(npu_ops_all, bench_ops_all, dump_mode) - result_df = self.make_result_table(result, stack_mode, dump_mode) + result = self.get_accuracy(npu_ops_all, bench_ops_all) + result_df = self.make_result_table(result) return result_df - def compare_by_op(self, npu_op_name, bench_op_name, op_name_mapping_dict, input_param): + def compare_by_op(self, npu_op_name, bench_op_name, op_name_mapping_dict, input_param, bench_data): + """ + :param npu_op_name: excel中的NPU_Name,例如:MintFunctional.conv2d.0.forward.input.3.0 + :param bench_op_name: excel中的Bench_Name,例如:Functional.conv2d.0.forward.input.3.0 + :param op_name_mapping_dict: op_name和npy或pt文件的映射关系 + :param input_param: npu_json_path/bench_json_path/stack_json_path等参数 + :param bench_data: bench的dump数据中"data"字段 + :return: result_list,包含余弦相似度、最大绝对误差、最大相对误差、千分之一误差率、千分之五误差率和错误信息 + 用于读取excel中的NPU_Name和Bench_Name,根据映射关系找到npy或pt文件,然后读取文件中的数据进行比较,计算余弦相似度、 + 最大绝对误差、最大相对误差、千分之一误差率、千分之五误差率并生成错误信息 + """ npu_bench_name_list = op_name_mapping_dict[npu_op_name] - data_name = npu_bench_name_list[1] + data_name = safe_get_value(npu_bench_name_list, 1, "npu_bench_name_list") error_file, relative_err, error_flag = None, None, False + bench_data_name = get_bench_data_name(bench_op_name, bench_data) if data_name == '-1' or data_name == -1: # 没有真实数据路径 n_value, b_value = CompareConst.READ_NONE, CompareConst.READ_NONE error_flag = True + elif not bench_data_name: + n_value, b_value, error_flag = CompareConst.READ_NONE, CompareConst.READ_NONE, True + error_file = 'no_bench_data' else: try: read_npy_data = getattr(self, "read_npy_data") @@ -306,42 +377,39 @@ class Comparator: if frame_name == "MSComparator": n_value = read_npy_data(input_param.get("npu_dump_data_dir"), npu_op_name + Const.NUMPY_SUFFIX) if self.cross_frame: - b_value = read_npy_data(input_param.get("bench_dump_data_dir"), - bench_op_name + Const.PT_SUFFIX, load_pt_file=True) + b_value = read_npy_data(input_param.get("bench_dump_data_dir"), bench_data_name, + load_pt_file=True) else: - b_value = read_npy_data(input_param.get("bench_dump_data_dir"), - bench_op_name + Const.NUMPY_SUFFIX) + b_value = read_npy_data(input_param.get("bench_dump_data_dir"), bench_data_name) else: n_value = read_npy_data(input_param.get("npu_dump_data_dir"), npu_op_name + Const.PT_SUFFIX) - b_value = read_npy_data(input_param.get("bench_dump_data_dir"), bench_op_name + Const.PT_SUFFIX) + b_value = read_npy_data(input_param.get("bench_dump_data_dir"), bench_data_name) except IOError as error: error_file = error.filename n_value, b_value = CompareConst.READ_NONE, CompareConst.READ_NONE error_flag = True - except FileCheckException: + except (FileCheckException, CompareException): error_file = data_name n_value, b_value = CompareConst.READ_NONE, CompareConst.READ_NONE error_flag = True - n_value, b_value, error_flag = get_error_type(n_value, b_value, error_flag) - if not error_flag: - relative_err = get_relative_err(n_value, b_value) - n_value, b_value = reshape_value(n_value, b_value) + # 通过n_value, b_value同时得到错误标志和错误信息 + n_value, b_value, error_flag, err_msg = get_error_flag_and_msg(n_value, b_value, + error_flag=error_flag, error_file=error_file) - err_msg = get_error_message(n_value, b_value, npu_op_name, error_flag, error_file=error_file) - result_list, err_msg = compare_ops_apply(n_value, b_value, error_flag, err_msg, relative_err=relative_err) + result_list, err_msg = compare_ops_apply(n_value, b_value, error_flag, err_msg) - if npu_op_name != bench_op_name and bench_op_name != CompareConst.N_A: + if self.fuzzy_match and npu_op_name != bench_op_name and bench_op_name != CompareConst.N_A: err_msg += " Fuzzy matching data, the comparison accuracy may be affected." result_list.append(err_msg) return result_list - - def compare_core(self, input_parma, output_path, **kwargs): + + def compare_core(self, input_param, output_path, **kwargs): """ Compares data from multiple JSON files and generates a comparison report. Args: - input_parma (dict): A dictionary containing paths to JSON files ("npu_path", "bench_path", + input_param (dict): A dictionary containing paths to JSON files ("npu_path", "bench_path", "stack_path"). output_path (str): The path where the output Excel report will be saved. **kwargs: Additional keyword arguments including: @@ -354,57 +422,58 @@ class Comparator: Returns: """ # get kwargs or set default value - stack_mode = kwargs.get('stack_mode', False) - auto_analyze = kwargs.get('auto_analyze', True) suffix = kwargs.get('suffix', '') - fuzzy_match = kwargs.get('fuzzy_match', False) - dump_mode = kwargs.get('dump_mode', None) logger.info("Please check whether the input data belongs to you. If not, there may be security risks.") file_name = add_time_with_xlsx("compare_result" + suffix) file_path = os.path.join(os.path.realpath(output_path), file_name) remove_path(file_path) - highlight_dict = {'red_rows': [], 'yellow_rows': []} + highlight_dict = {"red_rows": set(), "yellow_rows": set(), "red_lines": [], "yellow_lines": []} - npu_json = input_parma.get("npu_json_path") - bench_json = input_parma.get("bench_json_path") - stack_json = input_parma.get("stack_json_path") + npu_json = input_param.get("npu_json_path") + bench_json = input_param.get("bench_json_path") + stack_json = input_param.get("stack_json_path") if self.data_mapping: - result_df = self.compare_process_custom([npu_json, bench_json, stack_json], stack_mode, dump_mode) + result_df = self.compare_process_custom([npu_json, bench_json, stack_json]) else: - result_df = self.compare_process([npu_json, bench_json, stack_json], stack_mode, fuzzy_match, dump_mode) + result_df = self.compare_process([npu_json, bench_json, stack_json]) if not result_df.values.tolist(): logger.warning("Can`t match any op.") return - if dump_mode == Const.ALL: - result_df = self._do_multi_process(input_parma, result_df) + if self.dump_mode == Const.ALL: + result_df = self.do_multi_process(input_param, result_df) - logger.info("Highlight suspicious API/Module start.") - find_compare_result_error_rows(result_df, highlight_dict, dump_mode) + find_compare_result_error_rows(result_df, highlight_dict, self.dump_mode) highlight_rows_xlsx(result_df, highlight_dict, file_path) - logger.info("Highlight suspicious API/Module finish.") - if auto_analyze: + if self.auto_analyze: advisor = Advisor(result_df, output_path, suffix) advisor.analysis() - + + print_compare_ends_info() + def compare_ops(self, idx, dump_path_dict, result_df, lock, input_param): cos_result = [] + euc_dist_result = [] max_err_result = [] max_relative_err_result = [] - err_mess = [] one_thousand_err_ratio_result = [] five_thousand_err_ratio_result = [] + err_mess = [] + is_print_compare_log = input_param.get("is_print_compare_log") + bench_data = load_json(input_param.get("bench_json_path")).get('data') for i in range(len(result_df)): npu_op_name = result_df.iloc[i, 0] bench_op_name = result_df.iloc[i, 1] if is_print_compare_log: logger.info("start compare: {}".format(npu_op_name)) - cos_sim, max_abs_err, max_relative_err, one_thousand_err_ratio, five_thousand_err_ratio, err_msg = \ - self.compare_by_op(npu_op_name, bench_op_name, dump_path_dict, input_param) + + cos_sim, euc_dist, max_abs_err, max_relative_err, one_thousand_err_ratio, five_thousand_err_ratio, err_msg \ + = self.compare_by_op(npu_op_name, bench_op_name, dump_path_dict, input_param, bench_data) + if is_print_compare_log: logger.info( "[{}] Compare result: cosine {}, max_abs_err {}, max_relative_err {}, {}, \ @@ -412,29 +481,73 @@ class Comparator: "five_thousand_err_ratio {}".format(npu_op_name, cos_sim, max_abs_err, max_relative_err, err_msg, one_thousand_err_ratio, five_thousand_err_ratio)) cos_result.append(cos_sim) + euc_dist_result.append(euc_dist) max_err_result.append(max_abs_err) max_relative_err_result.append(max_relative_err) - err_mess.append(err_msg) one_thousand_err_ratio_result.append(one_thousand_err_ratio) five_thousand_err_ratio_result.append(five_thousand_err_ratio) + err_mess.append(err_msg) cr = ComparisonResult( cos_result=cos_result, + euc_dist_result=euc_dist_result, max_err_result=max_err_result, max_relative_err_result=max_relative_err_result, - err_msgs=err_mess, one_thousand_err_ratio_result=one_thousand_err_ratio_result, - five_thousand_err_ratio_result=five_thousand_err_ratio_result + five_thousand_err_ratio_result=five_thousand_err_ratio_result, + err_msgs=err_mess ) - return _save_cmp_result(idx, cr, result_df, lock) - - def _do_multi_process(self, input_parma, result_df): + return _save_cmp_result(idx, cr, result_df, lock) + + def do_multi_process(self, input_param, result_df): try: - result_df = _handle_multi_process(self.compare_ops, input_parma, result_df, + result_df = _handle_multi_process(self.compare_ops, input_param, result_df, multiprocessing.Manager().RLock()) return result_df except ValueError as e: logger.error('result dataframe is not found.') raise CompareException(CompareException.INVALID_DATA_ERROR) from e - \ No newline at end of file + + +def get_bench_data_name(bench_op_name, bench_data): + bench_name_list = re.split(r'\.(input|output|kwargs|parameters|parameters_grad)\.', bench_op_name) + if len(bench_name_list) > 1 and bench_name_list[1] == Const.PARAMS_GRAD: + bench_data_bundle = bench_data.get(bench_name_list[0] + Const.SEP + bench_name_list[1], {}) + else: + bench_data_bundle = bench_data.get(bench_name_list[0], {}) + if not bench_data_bundle or len(bench_name_list) < 3: + return None + layers = bench_name_list[2].split(Const.SEP) + + def _get(key, container): + if isinstance(container, dict): + return container.get(key) + if isinstance(container, list): + try: + return container[int(key)] + except (ValueError, IndexError): + return None + return None + + def get_by_layer(container, params_grad=False): + data = container + # dump.json中parameters_grad的结构为key:[{}], 如果存在key,有且只有一个列表元素,而op_name中只命名到了key,因此加'0' + if params_grad: + layers.append('0') + for layer in layers: + data = _get(layer, data) + return _get(CompareConst.DATA_NAME.lower(), data) + + if Const.INPUT == bench_name_list[1]: + return get_by_layer(bench_data_bundle.get(Const.INPUT, bench_data_bundle.get(Const.INPUT_ARGS))) + elif Const.KWARGS == bench_name_list[1]: + return get_by_layer(bench_data_bundle.get(Const.INPUT_KWARGS)) + elif Const.OUTPUT == bench_name_list[1]: + return get_by_layer(bench_data_bundle.get(Const.OUTPUT)) + elif Const.PARAMS == bench_name_list[1]: + return get_by_layer(bench_data_bundle.get(Const.PARAMS)) + elif Const.PARAMS_GRAD == bench_name_list[1]: + return get_by_layer(bench_data_bundle, params_grad=True) + else: + return None diff --git a/debug/accuracy_tools/msprobe/core/compare/check.py b/debug/accuracy_tools/msprobe/core/compare/check.py index 67f684d4cb9d6a71898c139fd85b5f1656bfea91..653823e20b29b14b6e7ede929f3bd2865bffaa18 100644 --- a/debug/accuracy_tools/msprobe/core/compare/check.py +++ b/debug/accuracy_tools/msprobe/core/compare/check.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -16,8 +16,7 @@ from msprobe.core.common.log import logger from msprobe.core.compare.utils import rename_api from msprobe.core.common.utils import check_op_str_pattern_valid, CompareException -from msprobe.core.common.const import Const - +from msprobe.core.common.const import CompareConst, Const dtype_mapping = { "Int8": "torch.int8", @@ -35,37 +34,43 @@ dtype_mapping = { "BFloat16": "torch.bfloat16", "Complex64": "torch.complex64", "Complex128": "torch.complex128" - } +} + +def compare_op_dict_struct(npu_dict, bench_dict): + return all(npu_dict.get(key) == bench_dict.get(key) for key in CompareConst.STRUCT_COMPARE_KEY) -def check_struct_match(npu_dict, bench_dict, cross_frame=False): - npu_struct_in = npu_dict.get("input_struct") - bench_struct_in = bench_dict.get("input_struct") - npu_struct_out = npu_dict.get("output_struct") - bench_struct_out = bench_dict.get("output_struct") - if cross_frame: - npu_struct_in = [(dtype_mapping.get(item[0], item[0]), item[1]) for item in npu_struct_in] - npu_struct_out = [(dtype_mapping.get(item[0], item[0]), item[1]) for item in npu_struct_out] - is_match = npu_struct_in == bench_struct_in and npu_struct_out == bench_struct_out +def check_struct_match(npu_dict, bench_dict): + is_match = compare_op_dict_struct(npu_dict, bench_dict) if not is_match: - if len(npu_struct_in) == 0 or len(bench_struct_in) == 0 or len(npu_struct_in) != len(bench_struct_in): - return False + struct_match_list = [] try: - struct_in_is_match = check_type_shape_match(npu_struct_in, bench_struct_in) - struct_out_is_match = check_type_shape_match(npu_struct_out, bench_struct_out) + for i, key in enumerate(CompareConst.STRUCT_COMPARE_KEY): + # 首先额外检查input_struct是否空,input_struct不可能为空 + if i == 0 and (not npu_dict.get(key, []) or not bench_dict.get(key, [])): + return False + struct_match_list.append(check_type_shape_match(npu_dict.get(key, []), bench_dict.get(key, []))) except CompareException as error: err_msg = f'index out of bounds error occurs in npu or bench api, please check!\n' \ f'npu_dict: {npu_dict}' \ f'bench_dict: {bench_dict}' logger.error(err_msg) raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from error - is_match = struct_in_is_match and struct_out_is_match + is_match = all(struct_match_list) return is_match def check_type_shape_match(npu_struct, bench_struct): - shape_type_match = False + """ + further check dtypes with a dtype mapping list when dtypes are not entirely consistent. + """ + if len(npu_struct) != len(bench_struct): + return False + if not npu_struct and not bench_struct: + return True + + struct_match = False for npu_type_shape, bench_type_shape in zip(npu_struct, bench_struct): try: npu_type = npu_type_shape[0] @@ -79,22 +84,14 @@ def check_type_shape_match(npu_struct, bench_struct): shape_match = npu_shape == bench_shape type_match = npu_type == bench_type if not type_match: - ms_type = [ - [Const.FLOAT16, Const.FLOAT32], [Const.FLOAT32, Const.FLOAT16], - [Const.FLOAT16, Const.BFLOAT16], [Const.BFLOAT16, Const.FLOAT16] - ] - torch_type = [ - [Const.TORCH_FLOAT16, Const.TORCH_FLOAT32], [Const.TORCH_FLOAT32, Const.TORCH_FLOAT16], - [Const.TORCH_FLOAT16, Const.TORCH_BFLOAT16], [Const.TORCH_BFLOAT16, Const.TORCH_FLOAT16] - ] - if ([npu_type, bench_type] in ms_type) or ([npu_type, bench_type] in torch_type): + if ([npu_type, bench_type] in CompareConst.MS_TYPE) or ([npu_type, bench_type] in CompareConst.TORCH_TYPE): type_match = True else: type_match = False - shape_type_match = shape_match and type_match - if not shape_type_match: + struct_match = shape_match and type_match + if not struct_match: return False - return shape_type_match + return struct_match def check_graph_mode(a_op_name, b_op_name): @@ -106,6 +103,8 @@ def check_graph_mode(a_op_name, b_op_name): def fuzzy_check_op(npu_name_list, bench_name_list): + # 先检查api里的item长度是否相等,如果不是parameters_grad, 必然有input或者output,长度不可能为0 + # 如果是parameters_grad, "parameters_grad"字段的字典不会是空字典,因此len>=1 if len(npu_name_list) == 0 or len(bench_name_list) == 0 or len(npu_name_list) != len(bench_name_list): return False is_match = True @@ -151,11 +150,11 @@ def check_json_key_value(input_output, op_name, depth=0): return if isinstance(input_output, list): for item in input_output: - check_json_key_value(item, op_name, depth+1) + check_json_key_value(item, op_name, depth + 1) elif isinstance(input_output, dict): for key, value in input_output.items(): if isinstance(value, dict): - check_json_key_value(value, op_name, depth+1) + check_json_key_value(value, op_name, depth + 1) else: valid_key_value(key, value, op_name) diff --git a/debug/accuracy_tools/msprobe/core/compare/compare_cli.py b/debug/accuracy_tools/msprobe/core/compare/compare_cli.py index 086f1cebd74688aadb1517d2fa0aaf8b5cadd058..7df7315043cb57b057871a7d12f5aa63cf927c74 100644 --- a/debug/accuracy_tools/msprobe/core/compare/compare_cli.py +++ b/debug/accuracy_tools/msprobe/core/compare/compare_cli.py @@ -24,6 +24,12 @@ def compare_cli(args): input_param = load_json(args.input_path) npu_path = input_param.get("npu_path", None) bench_path = input_param.get("bench_path", None) + if not npu_path: + logger.error(f"Missing npu_path in configuration file {args.input_path}, please check!") + raise CompareException(CompareException.INVALID_PATH_ERROR) + if not bench_path: + logger.error(f"Missing bench_path in configuration file {args.input_path}, please check!") + raise CompareException(CompareException.INVALID_PATH_ERROR) frame_name = args.framework auto_analyze = not args.compare_only if frame_name == Const.PT_FRAMEWORK: @@ -32,30 +38,43 @@ def compare_cli(args): else: from msprobe.mindspore.compare.ms_compare import ms_compare from msprobe.mindspore.compare.distributed_compare import ms_compare_distributed, ms_graph_compare + + common_kwargs = { + "auto_analyze": auto_analyze, + "fuzzy_match": args.fuzzy_match, + "data_mapping": args.data_mapping, + } + if check_file_type(npu_path) == FileCheckConst.FILE and check_file_type(bench_path) == FileCheckConst.FILE: input_param["npu_json_path"] = input_param.pop("npu_path") input_param["bench_json_path"] = input_param.pop("bench_path") - input_param["stack_json_path"] = input_param.pop("stack_path") + if "stack_path" not in input_param: + logger.warning(f"Missing stack_path in the configuration file. " + f"Automatically detecting stack.json to determine whether to display NPU_Stack_Info.") + else: + input_param["stack_json_path"] = input_param.pop("stack_path") + if frame_name == Const.PT_FRAMEWORK: - kwargs = { - "data_mapping": args.data_mapping - } - compare(input_param, args.output_path, stack_mode=args.stack_mode, auto_analyze=auto_analyze, - fuzzy_match=args.fuzzy_match, **kwargs) + kwargs = {**common_kwargs, "stack_mode": args.stack_mode} + compare(input_param, args.output_path, **kwargs) else: kwargs = { + **common_kwargs, "stack_mode": args.stack_mode, - "auto_analyze": auto_analyze, - "fuzzy_match": args.fuzzy_match, "cell_mapping": args.cell_mapping, "api_mapping": args.api_mapping, - "data_mapping": args.data_mapping, "layer_mapping": args.layer_mapping } - ms_compare(input_param, args.output_path, **kwargs) elif check_file_type(npu_path) == FileCheckConst.DIR and check_file_type(bench_path) == FileCheckConst.DIR: - kwargs = {"stack_mode": args.stack_mode, "auto_analyze": auto_analyze, "fuzzy_match": args.fuzzy_match} + kwargs = { + **common_kwargs, + "stack_mode": args.stack_mode, + "is_print_compare_log": input_param.get("is_print_compare_log", True), + "cell_mapping": args.cell_mapping, + "api_mapping": args.api_mapping, + "layer_mapping": args.layer_mapping + } if input_param.get("rank_id") is not None: ms_graph_compare(input_param, args.output_path) return diff --git a/debug/accuracy_tools/msprobe/core/compare/highlight.py b/debug/accuracy_tools/msprobe/core/compare/highlight.py index 5461c54ccd18b946d88475767633068ce601038a..1983313249f34680a8f25c3a2466d8871fe0a693 100644 --- a/debug/accuracy_tools/msprobe/core/compare/highlight.py +++ b/debug/accuracy_tools/msprobe/core/compare/highlight.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -13,17 +13,23 @@ # See the License for the specific language governing permissions and # limitations under the License. -import math import abc +import math +import multiprocessing import re from collections import namedtuple + import numpy as np import openpyxl from openpyxl.styles import PatternFill -from msprobe.core.common.utils import get_header_index +from openpyxl.utils.dataframe import dataframe_to_rows +from tqdm import tqdm + +from msprobe.core.common.const import CompareConst, Const from msprobe.core.common.file_utils import save_workbook from msprobe.core.common.log import logger -from msprobe.core.common.const import CompareConst, FileCheckConst, Const +from msprobe.core.common.utils import get_header_index, safe_get_value +from msprobe.core.compare.utils import table_value_is_valid, get_name_and_state, CompareException class HighlightCheck(abc.ABC): @@ -32,8 +38,17 @@ class HighlightCheck(abc.ABC): raise NotImplementedError +def add_highlight_row_info(color_list, num, highlight_err_msg): + for i, (existing_num, existing_err_msg) in enumerate(color_list): + if num == existing_num: + color_list[i][1].append(highlight_err_msg) + return + color_list.append((num, [highlight_err_msg])) + + class CheckOrderMagnitude(HighlightCheck): """检查Max diff的数量级差异""" + def apply(self, info, color_columns, dump_mode): api_in, api_out, num = info max_diff_index = get_header_index(CompareConst.MAX_DIFF if dump_mode == Const.SUMMARY @@ -43,11 +58,14 @@ class CheckOrderMagnitude(HighlightCheck): in_order = 0 if abs(api_in[max_diff_index]) < 1 else math.log10(abs(api_in[max_diff_index])) out_order = 0 if abs(api_out[max_diff_index]) < 1 else math.log10(abs(api_out[max_diff_index])) if out_order - in_order >= CompareConst.ORDER_MAGNITUDE_DIFF_YELLOW: - color_columns.yellow.append(num) + add_highlight_row_info(color_columns.yellow, num, + "maximum absolute error of both input/parameters and output exceed 1, " + "with the output larger by an order of magnitude") class CheckOneThousandErrorRatio(HighlightCheck): """检查千分误差比率""" + def apply(self, info, color_columns, dump_mode): api_in, api_out, num = info one_thousand_index = get_header_index(CompareConst.ONE_THOUSANDTH_ERR_RATIO, dump_mode) @@ -56,42 +74,55 @@ class CheckOneThousandErrorRatio(HighlightCheck): return if (api_in[one_thousand_index] > CompareConst.ONE_THOUSAND_ERROR_IN_RED and api_out[one_thousand_index] < CompareConst.ONE_THOUSAND_ERROR_OUT_RED): - color_columns.red.append(num) + add_highlight_row_info(color_columns.red, num, + "The input/parameters's one thousandth err ratio exceeds 0.9, " + "while the output's is below 0.6") elif api_in[one_thousand_index] - api_out[one_thousand_index] > CompareConst.ONE_THOUSAND_ERROR_DIFF_YELLOW: - color_columns.yellow.append(num) + add_highlight_row_info(color_columns.yellow, num, + "The output's one thousandth err ratio decreases by more than 0.1 " + "compared to the input/parameters's") class CheckCosineSimilarity(HighlightCheck): """检查余弦相似度""" + def apply(self, info, color_columns, dump_mode): api_in, api_out, num = info cosine_index = get_header_index(CompareConst.COSINE, dump_mode) if not isinstance(api_in[cosine_index], (float, int)) or not isinstance(api_out[cosine_index], (float, int)): return if api_in[cosine_index] - api_out[cosine_index] > CompareConst.COSINE_DIFF_YELLOW: - color_columns.yellow.append(num) + add_highlight_row_info(color_columns.yellow, num, + "The output's cosine decreases by more than 0.1 " + "compared to the input/parameters's") class CheckMaxRelativeDiff(HighlightCheck): """检查最大相对差异""" + def apply(self, info, color_columns, dump_mode): api_in, api_out, num = info max_diff_index = get_header_index(CompareConst.MAX_DIFF, dump_mode) bench_max_index = get_header_index(CompareConst.BENCH_MAX, dump_mode) - input_max_relative_diff = np.abs(np.divide(api_in[max_diff_index], max(0.01, api_in[bench_max_index]))) - output_max_relative_diff = np.abs(np.divide(api_out[max_diff_index], max(0.01, api_out[bench_max_index]))) + input_max_relative_diff = np.abs( + np.divide(api_in[max_diff_index], max(Const.FLOAT_EPSILON, api_in[bench_max_index]))) + output_max_relative_diff = np.abs( + np.divide(api_out[max_diff_index], max(Const.FLOAT_EPSILON, api_out[bench_max_index]))) if not isinstance(input_max_relative_diff, (float, int)) or not isinstance(output_max_relative_diff, (float, int)): return if output_max_relative_diff > CompareConst.MAX_RELATIVE_OUT_RED: - color_columns.red.append(num) + add_highlight_row_info(color_columns.red, num, "maximum relative error exceeds 0.5") elif (output_max_relative_diff > CompareConst.MAX_RELATIVE_OUT_YELLOW and input_max_relative_diff < CompareConst.MAX_RELATIVE_IN_YELLOW): - color_columns.yellow.append(num) + add_highlight_row_info(color_columns.yellow, num, + "The output's maximum relative error exceeds 0.1, " + "while the input/parameters's is below 0.01") class CheckOverflow(HighlightCheck): """检查是否存在溢出""" + def apply(self, info, color_columns, dump_mode): line, num = info npu_max_index = get_header_index(CompareConst.NPU_MAX, dump_mode) @@ -100,11 +131,11 @@ class CheckOverflow(HighlightCheck): else CompareConst.MAX_ABS_ERR, dump_mode) if str(line[npu_max_index]) in CompareConst.OVERFLOW_LIST or str( line[npu_min_index]) in CompareConst.OVERFLOW_LIST: - color_columns.red.append(num) + add_highlight_row_info(color_columns.red, num, "maximum or minimum is nan, -inf, or inf") return # check if Max_Diff > 1e+10 - if isinstance(line[max_diff_index], (float, int)) and line[max_diff_index] > CompareConst.MAX_DIFF_RED: - color_columns.red.append(num) + if isinstance(line[max_diff_index], (float, int)) and abs(line[max_diff_index]) > CompareConst.MAX_DIFF_RED: + add_highlight_row_info(color_columns.red, num, "maximum absolute error exceeds 1e+10") class HighlightRules: @@ -115,18 +146,35 @@ class HighlightRules: } # 用于比较输入和输出的规则 + # 真实数据检查规则 compare_rules = { "check_order_magnitude": CheckOrderMagnitude(), "check_one_thousand_error": CheckOneThousandErrorRatio(), "check_cosine_similarity": CheckCosineSimilarity() } + # 统计量数据检查规则 summary_compare_rules = { "check_order_magnitude": CheckOrderMagnitude(), "check_max_relative_diff": CheckMaxRelativeDiff(), } - -def find_error_rows(result, last_len, n_num_input, highlight_dict, dump_mode): + +def check_indices_numeric(api_items, indices: list): + """检查指定索引处的值是否都为数字类型(int 或 float)""" + return all(isinstance(api_items[i], (float, int)) for i in indices) + + +def apply_comparison_rules(api_info, dump_mode, color_columns): + """output与input/params的比较""" + if dump_mode == Const.SUMMARY: + for rule in HighlightRules.summary_compare_rules.values(): + rule.apply(api_info, color_columns, dump_mode) + else: + for rule in HighlightRules.compare_rules.values(): + rule.apply(api_info, color_columns, dump_mode) + + +def find_error_rows(result, api_batch, highlight_dict, dump_mode): """找到单个API中需要高亮的行""" if dump_mode == Const.MD5: return @@ -141,123 +189,229 @@ def find_error_rows(result, last_len, n_num_input, highlight_dict, dump_mode): ColorColumns = namedtuple('ColorColumns', ['red', 'yellow']) color_columns = ColorColumns(red=red_lines, yellow=yellow_lines) + api_batch_start = api_batch.start # result_df的input起始全局索引 + api_batch_params_end_index = api_batch.params_end_index # result_df的params结束全局索引 + 1 + api_batch_output_end_index = api_batch.output_end_index # result_df的output结束全局索引 + 1 + api_batch_params_slice_index_local = api_batch_params_end_index - api_batch_start # result的params结束局部切片索引 + api_batch_output_slice_index_local = api_batch_output_end_index - api_batch_start # result的output结束局部切片索引 + # 对单行API的输入或输出进行误差判断 for i, line in enumerate(result): - num = last_len + i - line_info = LineInfo(line_data=line, num_pointer=num) + index = api_batch_start + i + line_info = LineInfo(line_data=line, num_pointer=index) for rule in HighlightRules.basic_rules.values(): rule.apply(line_info, color_columns, dump_mode) # 对API的输出与输入比较,进行误差判断 - for n, api_out in enumerate(result[n_num_input:len(result)]): - num = last_len + n_num_input + n - if num in red_lines: + for n, api_out in enumerate(result[api_batch_params_slice_index_local: api_batch_output_slice_index_local]): + index = api_batch_start + api_batch_params_slice_index_local + n + # 单行检查只有溢出检查(红色),如果已经溢出,不进一步检查 + if index in red_lines: continue - if not isinstance(api_out[npu_max_index], (float, int)) \ - or not isinstance(api_out[bench_max_index], (float, int)) \ - or not isinstance(api_out[max_diff_index], (float, int)): + if not check_indices_numeric(api_out, [npu_max_index, bench_max_index, max_diff_index]): continue - for _, api_in in enumerate(result[0:n_num_input]): - if not isinstance(api_in[npu_max_index], (float, int)) \ - or not isinstance(api_in[bench_max_index], (float, int)) \ - or not isinstance(api_in[max_diff_index], (float, int)): - continue - api_info = ApiInfo(api_input=api_in, api_output=api_out, num_pointer=num) - if dump_mode == Const.SUMMARY: - for rule in HighlightRules.summary_compare_rules.values(): - rule.apply(api_info, color_columns, dump_mode) - else: - for rule in HighlightRules.compare_rules.values(): - rule.apply(api_info, color_columns, dump_mode) - - highlight_dict.get('red_rows', []).extend(list(set(red_lines))) - highlight_dict.get('yellow_rows', []).extend(list(set(yellow_lines) - set(red_lines))) - - -def get_name_and_state(name): - """Get api/module name and state""" - if Const.INPUT in name: - api_name = name.split(Const.INPUT)[0] - state = Const.INPUT + # input/parameters的比较检查, 这里api_in包括input、parameters + for _, api_in in enumerate(result[0: api_batch_params_slice_index_local]): + if not check_indices_numeric(api_in, [npu_max_index, bench_max_index, max_diff_index]): + continue + api_info = ApiInfo(api_input=api_in, api_output=api_out, num_pointer=index) + apply_comparison_rules(api_info, dump_mode, color_columns) + + red_lines_num_set = {x[0] for x in red_lines} + yellow_lines_num_set = {x[0] for x in yellow_lines} + highlight_dict.get('red_rows', set()).update(red_lines_num_set) + highlight_dict.get('yellow_rows', set()).update(yellow_lines_num_set - red_lines_num_set) + highlight_dict.get('red_lines', []).extend(red_lines) + highlight_dict.get('yellow_lines', []).extend(yellow_lines) + + +class ApiBatch: + def __init__(self, api_name: str, start: int): + self.api_name = api_name + self.start = start + self.input_len = 1 # input的数量 + self.params_end_index = start + 1 # params的结束index + self.output_end_index = start + 1 # output的结束index + self.params_grad_end_index = start + 1 # params_grad的结束index + # 内部state的标志("input", "output", "parameters", "parameters_grad"), + # 用于控制计算input_len, output_end_index, params_end_index, self.params_grad_end_index + self._state = Const.INPUT # api_batch初始化为input + + def set_state(self, state: str): + """设置当前状态""" + if state in {Const.INPUT, Const.OUTPUT, Const.KWARGS, Const.PARAMS, Const.PARAMS_GRAD}: + self._state = state + else: + raise ValueError(f"Invalid state: {state}") + + def increment(self, state: str): + self.set_state(state) + if self._state == Const.INPUT or self._state == Const.KWARGS: + self.input_len += 1 + self.params_end_index += 1 + self.output_end_index += 1 + if self._state == Const.PARAMS: + self.params_end_index += 1 + self.output_end_index += 1 + if self._state == Const.OUTPUT: + self.output_end_index += 1 + self.params_grad_end_index += 1 + + +def api_batches_update(api_batches, api_name, state, index): + """ + 当一个api的所有item更新完后,input, output的索引范围: + input: [start: start+input_len] + output: [start+input_len: output_end_index] + params: [output_end_index: params_end_index] + """ + if not api_batches: + api_batches.append(ApiBatch(api_name, index)) else: - api_name = name.split(Const.OUTPUT)[0] - state = Const.OUTPUT - return api_name, state + api_batch = api_batches[-1] + if api_batch.api_name == api_name or ( + not re.search(Const.REGEX_FORWARD_BACKWARD, api_name) and api_name in api_batch.api_name): + try: + api_batch.increment(state) + except ValueError as e: + logger.error(f"api_batch: {api_batch} with invalid state, please check! {e}") + raise CompareException(CompareException.INVALID_STATE_ERROR) from e + else: + api_batches.append(ApiBatch(api_name, index)) def find_compare_result_error_rows(result_df, highlight_dict, dump_mode): """将dataframe根据API分组,并找到有误差的算子用于高亮""" result = result_df.values - start, input_num, output_num, end = 0, 0, 0, len(result_df) - last_api_name, last_state = None, None - num, last_len = 0, 0 - for res_i in result: - api_name, state = get_name_and_state(res_i[0]) - if last_api_name: - if api_name == last_api_name: - if state == last_state: - num += 1 - else: - input_num = num - num, last_state = 1, state - else: - output_num = num - find_error_rows(result[start:start + input_num + output_num], start, input_num, highlight_dict, - dump_mode) - num, last_api_name, last_state = 1, api_name, state - start += input_num + output_num - input_num, output_num = 1, 0 - else: - num, last_api_name, last_state = 1, api_name, state - if state: - if state == Const.INPUT: - input_num = num + api_batches = [] + for i, res_i in enumerate(result): + api_full_name = safe_get_value(res_i, 0, "res_i") + api_name, state = get_name_and_state(api_full_name) + api_batches_update(api_batches, api_name, state, i) + with tqdm(total=len(api_batches), desc="API/Module Analyse Progress", unit="item", ncols=100) as progress_bar: + for api_batch in api_batches: + find_error_rows(result[api_batch.start: api_batch.params_grad_end_index], api_batch, highlight_dict, + dump_mode) + progress_bar.update(1) + + +def value_check(value, api_name=None, i=None, result_df_columns=None): + if not table_value_is_valid(value): + if result_df_columns: + logger.error(f"Malicious value [{value}] at api_name [{api_name}], column [{result_df_columns[i]}], " + f"is not allowed to be written into the compare result xlsx.") else: - output_num = num - find_error_rows(result[start:start + input_num + output_num], start, input_num, highlight_dict, - dump_mode) + logger.error(f"Malicious value [{value}] is not allowed to be written into the compare result xlsx.") + + +def df_malicious_value_check(df_chunk, result_df_columns): + for row in df_chunk.itertuples(index=False): + api_name = row[0] + for i, value in enumerate(row): + value_check(value, api_name, i, result_df_columns) + + +def handle_multi_process_malicious_value_check(func, result_df): + result_total_nums = len(result_df) + process_num = int((multiprocessing.cpu_count() + 1) / 2) + + if result_total_nums <= process_num: + process_num = 1 + chunks = [result_df] + else: + chunk_size = result_total_nums // process_num + chunks = [result_df.iloc[i: i + chunk_size] for i in range(0, result_total_nums, chunk_size)] + + pool = multiprocessing.Pool(process_num) + + def err_call(args): + logger.error("Multiprocessing malicious value check failed! Reason: {}".format(args)) + try: + pool.terminate() + except OSError: + logger.error("Pool terminate failed") + + result_df_columns = result_df.columns.tolist() + for column in result_df_columns: + value_check(column) + for df_chunk in chunks: + pool.apply_async(func, args=(df_chunk, result_df_columns,), error_callback=err_call) + + pool.close() + pool.join() + + +def compare_result_df_convert(value): + if not isinstance(value, (float, int)) or isinstance(value, bool): # bool类型或者非数字类型转str + value = f"{str(value)}\t" if str(value) in ("inf", "-inf", "nan") else str(value) + if isinstance(value, float): + value = f"{str(value)}\t" if str(value) in ("inf", "-inf", "nan") else value + return value def highlight_rows_xlsx(result_df, highlight_dict, file_path): """Write and highlight results in Excel""" - logger.info('Compare result is %s' % file_path) + + update_highlight_err_msg(result_df, highlight_dict) # add highlight err_msg wb = openpyxl.Workbook() ws = wb.active # write header - for j, col_name in enumerate(result_df.columns, start=1): - if not csv_value_is_valid(col_name): - raise RuntimeError(f"Malicious value [{col_name}] is not allowed to be written into the xlsx: {file_path}.") - ws.cell(row=1, column=j, value=col_name) - - for i, row in enumerate(result_df.iterrows(), start=2): - for j, value in enumerate(row[1], start=1): - if not isinstance(value, (float, int)): - value = f'{str(value)}\t' if str(value) in ('inf', '-inf', 'nan') else str(value) - if not csv_value_is_valid(value): - raise RuntimeError(f"Malicious value [{value}] is not allowed to be written into the xlsx: " - f"{file_path}.") - ws.cell(row=i, column=j, value=f'{str(value)}\t' if str(value) in ('inf', '-inf', 'nan') else value) - - if (i - 2) in highlight_dict['red_rows']: - ws.cell(row=i, column=j).fill = PatternFill(start_color=CompareConst.RED, - end_color=CompareConst.RED, fill_type="solid") - elif (i - 2) in highlight_dict['yellow_rows']: - ws.cell(row=i, column=j).fill = PatternFill(start_color=CompareConst.YELLOW, - end_color=CompareConst.YELLOW, fill_type="solid") - + logger.info('Initializing Excel file.') + + handle_multi_process_malicious_value_check(df_malicious_value_check, result_df) + + result_df_convert = result_df.applymap(compare_result_df_convert) + + for row in dataframe_to_rows(result_df_convert, index=False, header=True): + ws.append(row) + + # 对可疑数据标色 + logger.info('Coloring Excel in progress.') + col_len = len(result_df.columns) + red_fill = PatternFill( + start_color=CompareConst.RED, end_color=CompareConst.RED, fill_type="solid" + ) + yellow_fill = PatternFill( + start_color=CompareConst.YELLOW, end_color=CompareConst.YELLOW, fill_type="solid", + ) + for i in highlight_dict.get("red_rows", []): + for j in range(1, col_len + 1): + ws.cell(row=i + 2, column=j).fill = red_fill # 2因为ws.cell中的row或column需要>=1,数据从第2行开始 + for i in highlight_dict.get("yellow_rows", []): + for j in range(1, col_len + 1): + ws.cell(row=i + 2, column=j).fill = yellow_fill + + logger.info('Saving Excel file to disk: %s' % file_path) save_workbook(wb, file_path) -def csv_value_is_valid(value: str) -> bool: - if not isinstance(value, str): - return True - try: - # -1.00 or +1.00 should be consdiered as digit numbers - float(value) - except ValueError: - # otherwise, they will be considered as formular injections - return not bool(re.compile(FileCheckConst.CSV_BLACK_LIST).search(value)) - return True +def update_highlight_err_msg(result_df, highlight_dict): + if result_df.shape[1] <= 1: + return + + if CompareConst.NPU_MD5 in result_df.columns: + return + + err_msg = result_df.get(CompareConst.ERROR_MESSAGE) + red_lines_num_set = highlight_dict.get('red_rows') + + for color in ['red', 'yellow']: + line_key = f'{color}_lines' + lines = highlight_dict.get(line_key, []) + for line_index, messages in lines: + if color == 'yellow' and line_index in red_lines_num_set: + continue # 如果是 yellow 行,且已被 red 行覆盖,跳过 + + for msg in messages: + if err_msg[line_index] == '': + err_msg[line_index] = msg + else: + err_msg[line_index] += '\n' + msg + + if color == 'red': + red_lines_num_set.add(line_index) + + result_df[CompareConst.ERROR_MESSAGE] = err_msg diff --git a/debug/accuracy_tools/msprobe/core/compare/layer_mapping/__init__.py b/debug/accuracy_tools/msprobe/core/compare/layer_mapping/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..39df7ac53bbbee256c92bea0b0e1d05864a2b970 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/compare/layer_mapping/__init__.py @@ -0,0 +1,19 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprobe.core.compare.layer_mapping.layer_mapping import ( + generate_data_mapping_by_layer_mapping, + generate_api_mapping_by_layer_mapping, +) diff --git a/debug/accuracy_tools/msprobe/core/compare/data_scope_parser.py b/debug/accuracy_tools/msprobe/core/compare/layer_mapping/data_scope_parser.py similarity index 60% rename from debug/accuracy_tools/msprobe/core/compare/data_scope_parser.py rename to debug/accuracy_tools/msprobe/core/compare/layer_mapping/data_scope_parser.py index 107554c8397544fdf04a2dbd6324c151d2341abd..5ba5aa69a10a8aa408868697ae2982bd1349ff76 100644 --- a/debug/accuracy_tools/msprobe/core/compare/data_scope_parser.py +++ b/debug/accuracy_tools/msprobe/core/compare/layer_mapping/data_scope_parser.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -17,13 +17,14 @@ import os import re from copy import deepcopy from dataclasses import dataclass -from typing import ClassVar, Dict, Optional, Tuple +from typing import ClassVar, Dict, List, Optional, Tuple import yaml from msprobe.core.common.const import Const from msprobe.core.common.file_utils import save_yaml from msprobe.core.common.log import logger from msprobe.core.common.utils import CompareException, add_time_with_yaml +from msprobe.core.compare.layer_mapping.postprocess_pass import postprocess_pass @dataclass @@ -36,13 +37,15 @@ class DumpDataItem: full_scope: str = "" layer_scope: str = "" stack_scope: str = "" + frame_stack_scope: str = "" + user_stack_scope: str = "" construct_scope: str = "" scope_direction: Optional[str] = None scope_id: Optional[int] = None + state: str = "" # 类变量使用 ClassVar - framework2layername: ClassVar[Dict[str, str]] = { - Const.MS_FRAMEWORK: Const.CELL, Const.PT_FRAMEWORK: Const.MODULE} + layernames: ClassVar[set] = {Const.CELL, Const.MODULE} framework2stack_sign: ClassVar[Dict[str, Tuple[str, str]]] = { Const.MS_FRAMEWORK: ("Template", "construct"), Const.PT_FRAMEWORK: ("Template", r"in (for|back)ward,") @@ -70,43 +73,63 @@ class DumpDataItem: data_name_list = data_name.split(Const.SEP) if not data_name_list or len(data_name_list) < abs(Const.LAYER_NAME_INDEX): logger.error( - f"The dump data does not comply with the format specification and must contain no less than four fields. The current data is {data_name}") + f"The dump data does not comply with the format specification and " + f"must contain no less than four fields. " + f"The current data is {data_name}" + ) raise CompareException(CompareException.INVALID_DATA_ERROR) + if data_name_list[Const.LAST_INDEX] == Const.PARAMS_GRAD: + self.api_type = Const.PARAMS_GRAD + self.api_name = data_name_list[Const.PARAMS_GRAD_NAME_INDEX] + self.type_name = data_name_list[Const.PARAMS_GRAD_TYPE_NAME_INDEX] + self.state = Const.PARAMS_GRAD + return + self.api_type = data_name_list[Const.API_TYPE_INDEX] self.type_name = data_name_list[Const.TYPE_NAME_INDEX] - if self.api_type == self.framework2layername.get(self.framework): + if self.api_type in self.layernames: self.api_name = data_name_list[Const.LAYER_NAME_INDEX] + self.state = data_name_list[Const.SCOPE_DIRECTION_INDEX] else: self.api_name = self.type_name + self.state = data_name_list[Const.LAST_INDEX] def set_layer_scope(self, construct_info: str) -> None: self.construct_scope = construct_info - if not construct_info: - self.layer_scope = self.framework2layername.get(self.framework) - return - construct_info_list = construct_info.split(Const.SEP) - if len(construct_info_list) < abs(Const.LAYER_NAME_INDEX): - logger.error( - f"The construct data does not comply with the format specification and must contain no less than four fields. The current data is {construct_info}") - raise CompareException(CompareException.INVALID_DATA_ERROR) - if self.api_type == self.framework2layername.get(self.framework): + if self.api_type in self.layernames: # remove api name data_list = self.data_name.split(Const.SEP) data_list = data_list[:Const.LAYER_NAME_INDEX] + data_list[Const.TYPE_NAME_INDEX:] + elif self.api_type == Const.PARAMS_GRAD: + data_list = self.data_name.split(Const.SEP) + elif construct_info: + data_list = construct_info.split(Const.SEP) else: - data_list = construct_info_list - self.layer_scope = Const.SEP.join(data_list[:Const.TYPE_NAME_INDEX]) - self.scope_id = data_list[Const.SCOPE_ID_INDEX] - self.scope_direction = construct_info_list[Const.SCOPE_DIRECTION_INDEX] + data_list = [] + + if data_list: + self.layer_scope = Const.SEP.join(data_list[:Const.TYPE_NAME_INDEX]) + else: + self.layer_scope = Const.TOP_LAYER + if construct_info and Const.SEP in construct_info: + construct_list = construct_info.split(Const.SEP) + if len(construct_list) < abs(Const.LAYER_NAME_INDEX): + logger.error( + f"The construct data does not comply with the format specification and " + f"must contain no less than four fields. " + f"The current data is {construct_info}" + ) + raise CompareException(CompareException.INVALID_DATA_ERROR) + self.scope_id = construct_list[Const.SCOPE_ID_INDEX] + self.scope_direction = construct_list[Const.SCOPE_DIRECTION_INDEX] def set_stack_scope(self, stack_info: str) -> None: # Cell/Module has no stack info - if self.api_type == self.framework2layername.get(self.framework): + if self.api_type in self.layernames: return if self.api_type in Const.DATA_TYPE_SKIP_LIST or not stack_info: - self.stack_scope = self.api_name return start_sign, end_sign = self.framework2stack_sign.get(self.framework) @@ -114,13 +137,16 @@ class DumpDataItem: start_pos, end_pos = find_regard_scope(stack_info, start_sign, end_sign) # 获取指定范围的代码 regard_scope = stack_info[start_pos + 1:end_pos] - func_stack_list = find_stack_func_list(regard_scope) - self.stack_scope = Const.SEP.join(func_stack_list) + frame_func_stack_list, user_func_stack_list = find_stack_func_list(regard_scope) + self.frame_stack_scope = Const.SEP.join(frame_func_stack_list) + self.user_stack_scope = Const.SEP.join(user_func_stack_list) - def set_full_scope(self, use_stack_scope=False) -> None: + def set_full_scope(self, use_user_func_scope=False, use_frame_func_scope=True) -> None: scope_list = [self.layer_scope] - if use_stack_scope: - scope_list.append(self.stack_scope) + if use_user_func_scope and self.user_stack_scope: + scope_list.append(self.user_stack_scope) + if use_frame_func_scope and self.frame_stack_scope: + scope_list.append(self.frame_stack_scope) scope_list.append(self.api_name) self.full_scope = Const.SEP.join(scope_list) @@ -138,11 +164,41 @@ def find_regard_scope(lines, start_sign, end_sign): return start_pos, end_pos -def find_stack_func_list(lines): +def find_stack_func_list(lines, record_user=True): res_list = [] - # 过滤和处理 regard_scope + user_stack = [] + frame_stack = None + no_entrance = True for line in lines: - ele_list = line.split(',') + ele_list = line.split(Const.COMMA) + file_ele = ele_list[Const.STACK_FILE_INDEX] + # if framework func line and no framework entrance found yet + if any(ii in file_ele for ii in Const.FRAME_FILE_LIST) and no_entrance: + frame_stack = line # Update the last target index + else: + if record_user: + user_stack.append(line) + no_entrance = False + + # Check if the last string in the list contains target str + if frame_stack and no_entrance: + no_entrance = False + + # 过滤和处理 regard_scope + frame_func = get_stack_in_lines([frame_stack]) + user_func = get_stack_in_lines(user_stack) + return (frame_func, user_func) + + +def get_stack_in_lines(simplified: List[str]): + res_list = [] + if not simplified: + return res_list + for line in simplified: + if not line: + continue + + ele_list = line.split(Const.COMMA) file_ele = ele_list[Const.STACK_FILE_INDEX] if any(ii in file_ele for ii in Const.FILE_SKIP_LIST): continue @@ -154,6 +210,7 @@ def find_stack_func_list(lines): in_func_name = func_ele.split()[Const.STACK_FUNC_ELE_INDEX] res_list.append(in_func_name) + reversed_list = res_list[::-1] return reversed_list @@ -179,16 +236,7 @@ def get_dump_data_items(dump, stack, construct, framework, output_path=None): name2item[data_name] = data_item data_items.append(data_item) - # 处理反向数据,反向无栈信息,沿用正向数据栈信息 - for data_item in data_items: - data_name = data_item.data_name - if Const.BACKWARD in data_name: - forward_name = data_name.replace(Const.BACKWARD, Const.FORWARD) - forward_item = name2item.get(forward_name, None) - if not forward_item: - continue - data_item.stack_scope = forward_item.stack_scope - data_item.full_scope = forward_item.full_scope + postprocess_pass(data_items, name2item) if output_path: yaml.add_representer(DumpDataItem, dumpdata_representer) diff --git a/debug/accuracy_tools/msprobe/core/compare/layer_mapping.py b/debug/accuracy_tools/msprobe/core/compare/layer_mapping/layer_mapping.py similarity index 68% rename from debug/accuracy_tools/msprobe/core/compare/layer_mapping.py rename to debug/accuracy_tools/msprobe/core/compare/layer_mapping/layer_mapping.py index a985621fab26712764306c103cc8edcdbd9bd798..d0f19462ee1ccf4d72c69885c18174cec32df056 100644 --- a/debug/accuracy_tools/msprobe/core/compare/layer_mapping.py +++ b/debug/accuracy_tools/msprobe/core/compare/layer_mapping/layer_mapping.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,24 +14,28 @@ # limitations under the License. import os +from collections import defaultdict -from msprobe.core.common.const import Const -from msprobe.core.common.log import logger +from msprobe.core.common.const import CompareConst, Const from msprobe.core.common.file_utils import load_json, load_yaml, save_yaml -from msprobe.core.common.utils import (add_time_with_yaml, detect_framework_by_dump_json, - get_stack_construct_by_dump_json_path, CompareException) -from msprobe.core.compare.data_scope_parser import get_dump_data_items +from msprobe.core.common.utils import (add_time_with_yaml, + detect_framework_by_dump_json, + get_stack_construct_by_dump_json_path) +from msprobe.core.compare.layer_mapping.data_scope_parser import get_dump_data_items +from msprobe.core.compare.utils import read_op, reorder_op_name_list + class LayerTrie: def __init__(self, type_name, framework=None): self.type_name = type_name - self.data_items = [] + self.data_items = defaultdict(list) self.children = {} self.framework = framework def __repr__(self): - return f"Layer(type_name={self.type_name}, data_number={len(self.data_items)})" + data_nums = [{k: len(v)} for k, v in self.data_items.items()] + return f"Layer(type_name={self.type_name}, data_number={data_nums})" def get(self, name): return self.children.get(name) @@ -45,10 +49,10 @@ class LayerTrie: if name not in node.children: node.children[name] = LayerTrie(name, data_item.framework) node = node.children[name] - node.data_items.append(data_item) + node.data_items[data_item.state].append(data_item) node.type_name = data_item.type_name - def query_data(self, scope, index, default_value=None): + def query_data(self, scope, state, index, default_value=None): parts = scope.split(Const.SEP) node = self scope_name_list = parts[1:] @@ -57,9 +61,9 @@ class LayerTrie: if name not in node.children: return default_value node = node.children[name] - if index >= len(node.data_items): + if index >= len(node.data_items[state]): return default_value - return node.data_items[index] + return node.data_items[state][index] def save_to_yaml(self, output_path): result = {f"{self.type_name} @ {self}": self.convert_to_dict(self)} @@ -69,7 +73,7 @@ class LayerTrie: def convert_to_dict(self, node): result = {} - result["data_item"] = [node.data_name for node in node.data_items] + result["data_item"] = {st: [dt.data_name for dt in dts] for st, dts in node.data_items.items()} for child_key, child_node in node.children.items(): key = f"{child_key} @ {child_node}" result[key] = self.convert_to_dict(child_node) @@ -101,10 +105,11 @@ def convert_scope(layer_trie, data_item, mapping=None): cur_node = child_node idx += 1 index = -1 - for idx, child in enumerate(cur_node.data_items): + state = data_item.state + for idx, child in enumerate(cur_node.data_items[state]): if data_item.data_name == child.data_name: index = idx - return new_scope, index + return new_scope, state, index def get_data_items_and_tree(dump_json_path, output_path): @@ -121,8 +126,8 @@ def get_data_items_and_tree(dump_json_path, output_path): def convert_data_item(npu_tree, bench_tree, npu_data_item, mapping): - new_scope, index = convert_scope(npu_tree, npu_data_item, mapping) - bench_data_item = bench_tree.query_data(new_scope, index) + new_scope, state, index = convert_scope(npu_tree, npu_data_item, mapping) + bench_data_item = bench_tree.query_data(new_scope, state, index) return bench_data_item @@ -175,7 +180,7 @@ def convert_data_items(npu_tree, bench_tree, npu_data_items, mapping): api_mapping = {} for npu_data_item in npu_data_items: bench_data_item = convert_data_item(npu_tree, bench_tree, npu_data_item, mapping) - bench_name = bench_data_item.data_name if bench_data_item else "" + bench_name = bench_data_item.data_name if bench_data_item else CompareConst.N_A npu_name = npu_data_item.data_name api_mapping[npu_name] = bench_name return api_mapping @@ -196,61 +201,37 @@ def generate_api_mapping_by_layer_mapping(npu_json_path, bench_json_path, layer_ return api_mapping -def generate_index_set(item, prefix="", depth=0, max_depth=10): - if depth > max_depth: - logger.error("parse index exceeds the recursion limit.") - raise CompareException(CompareException.RECURSION_LIMIT_ERROR) - result = set() - if isinstance(item, list): - for idx, value in enumerate(item): - pre = f"{prefix}.{idx}" if prefix else str(idx) - result.update(generate_index_set(value, pre, depth+1, max_depth)) - elif prefix: - result.add(prefix) - return result - - -def generate_file_mapping(npu_json_path, bench_json_path, api_mapping, output_path=None): - def get_input(data): - input_list = data.get(Const.INPUT_ARGS) - if not input_list: - input_list = data.get(Const.INPUT) - return input_list - - def generate_input_output_index_set(data, name): - data_item = data.get(name) - inputs = get_input(data_item) - outputs = data_item.get(Const.OUTPUT) - input_index_set = generate_index_set(inputs) - output_index_set = generate_index_set(outputs) - return input_index_set, output_index_set - - def get_common_index_list(npu_index_set, bench_index_set): - common_index = npu_index_set & bench_index_set - res = sorted(common_index, key=lambda x: [int(i) for i in x.split(Const.SEP)]) - return res - - def combine_data_name_and_index(npu_name, bench_name, index_list, input_output): - res = {} - for index in index_list: - k = Const.SEP.join([npu_name, input_output, index]) - v = Const.SEP.join([bench_name, input_output, index]) - res[k] = v - return res +def generate_data_mapping(npu_json_path, bench_json_path, api_mapping, output_path=None): + def read_full_op_names(data, op_name): + op_parsed_list = read_op(data.get(op_name, {}), op_name) + full_op_names = [op_parsed.get('full_op_name') for op_parsed in op_parsed_list] + return full_op_names + + def generate_op_data_mapping(npu_op_name, npu_full_op_names, bench_op_name, bench_full_op_names): + suffix_to_full_op_name = {} + op_data_mapping = {} + for bench_full_op_name in bench_full_op_names: + suffix = bench_full_op_name[len(bench_op_name):] + suffix_to_full_op_name[suffix] = bench_full_op_name + + for npu_full_op_name in npu_full_op_names: + suffix = npu_full_op_name[len(npu_op_name):] + op_data_mapping[npu_full_op_name] = suffix_to_full_op_name.get(suffix, CompareConst.N_A) + return op_data_mapping npu_data = load_json(npu_json_path).get("data", {}) bench_data = load_json(bench_json_path).get("data", {}) data_mapping = {} - for npu_name, bench_name in api_mapping.items(): - if not bench_name or not npu_name: + for npu_op_name, bench_op_name in api_mapping.items(): + if not npu_op_name: continue - npu_input_index_set, npu_output_index_set = generate_input_output_index_set(npu_data, npu_name) - bench_input_index_set, bench_output_index_set = generate_input_output_index_set(bench_data, bench_name) - common_input_index_list = get_common_index_list(npu_input_index_set, bench_input_index_set) - common_output_index_list = get_common_index_list(npu_output_index_set, bench_output_index_set) - - data_mapping.update(combine_data_name_and_index(npu_name, bench_name, common_input_index_list, Const.INPUT)) - data_mapping.update(combine_data_name_and_index(npu_name, bench_name, common_output_index_list, Const.OUTPUT)) + npu_full_op_names = read_full_op_names(npu_data, npu_op_name) + bench_full_op_names = read_full_op_names(bench_data, bench_op_name) + npu_full_op_names_reorder = reorder_op_name_list(npu_full_op_names) + bench_full_op_names_reorder = reorder_op_name_list(bench_full_op_names) + mapping = generate_op_data_mapping(npu_op_name, npu_full_op_names_reorder, + bench_op_name, bench_full_op_names_reorder) + data_mapping.update(mapping) if output_path: file_name = add_time_with_yaml("data_mapping") file_path = os.path.join(os.path.realpath(output_path), file_name) @@ -263,6 +244,6 @@ def generate_data_mapping_by_layer_mapping(input_param, layer_mapping_path=None, bench_json_path = input_param.get("bench_json_path") api_mapping = generate_api_mapping_by_layer_mapping( npu_json_path, bench_json_path, layer_mapping_path) - data_mapping = generate_file_mapping( + data_mapping = generate_data_mapping( npu_json_path, bench_json_path, api_mapping, output_path) return data_mapping diff --git a/debug/accuracy_tools/msprobe/core/compare/layer_mapping/postprocess_pass.py b/debug/accuracy_tools/msprobe/core/compare/layer_mapping/postprocess_pass.py new file mode 100644 index 0000000000000000000000000000000000000000..2946b86122d6619338a5d9ec057bf3ba96c5ac75 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/compare/layer_mapping/postprocess_pass.py @@ -0,0 +1,95 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import re +import math + +from msprobe.core.common.const import Const + + +def postprocess_pass(data_items, name2item): + backward_pass(data_items, name2item) + renumber_index_pass(data_items, "ParallelTransformer", "layers") + + +def backward_pass(data_items, name2item): + # 处理反向数据,反向无栈信息,沿用正向数据栈信息 + for data_item in data_items: + data_name_list = data_item.data_name.split(Const.SEP) + if not data_name_list: + continue + if Const.BACKWARD in data_name_list[Const.SCOPE_DIRECTION_INDEX:]: + data_name_list[Const.SCOPE_DIRECTION_INDEX:] = [ + s.replace(Const.BACKWARD, Const.FORWARD) + for s in data_name_list[Const.SCOPE_DIRECTION_INDEX:] + ] + forward_name = Const.SEP.join(data_name_list) + forward_item = name2item.get(forward_name, None) + if not forward_item: + continue + data_item.stack_scope = forward_item.stack_scope + data_item.full_scope = forward_item.full_scope + data_item.layer_scope = forward_item.layer_scope + + +def extract_next_item_last_number(data, prefix, default_result=None): + result = default_result + match = re.search(rf"^{re.escape(prefix)}\.(\S+?)(?:\.|$)", data) + if match: + next_item = match.group(1) + numbers = re.findall(r"\d+", next_item) + if numbers: + result = int(numbers[-1]) + return result + + +def replace_next_item_index(full_scope, prefix, index): + if math.isinf(index): + return full_scope + prefix_pattern = rf"^{re.escape(prefix)}\." + result = full_scope + match = re.search(rf"{prefix_pattern}(\S+?)(?:\.|$)", full_scope) + if match: + next_item = match.group(1) + pattern = rf"{prefix_pattern}{re.escape(next_item)}" + result = re.sub(pattern, f"{prefix}.{index}", full_scope, count=1) + return result + + +def renumber_index_pass(data_items, type_name, suffix=None): + """ + 该函数为解决并行切分场景中编号不一致的比对问题。例如在MindSpore中ParallelTransformer层的PP切分场景, + MindSpore中的layers的成员编号是全局的,而在PyTorch中编号为局部的。 + 为适配此种场景,对指定层的索引进行重新编号,以确保在后续处理阶段序号对齐。 + """ + prefix_dict = {} # 保存类型为type_name的前缀和最小编号的映射 + for data_item in data_items: + if data_item.type_name == type_name: + prefix = f"{data_item.full_scope}.{suffix}" if suffix else data_item.layer_scope + prefix_dict[prefix] = math.inf + + # 计算前缀对应的最小编号 + for prefix in prefix_dict: + for data_item in data_items: + res = extract_next_item_last_number(data_item.full_scope, prefix, math.inf) + prefix_dict[prefix] = min(prefix_dict[prefix], res) + + # 重新编号 + for prefix, min_index in prefix_dict.items(): + for data_item in data_items: + full_scope = data_item.full_scope + abs_index = extract_next_item_last_number(data_item.full_scope, prefix, math.inf) + rel_index = abs_index - min_index + full_scope = replace_next_item_index(full_scope, prefix, rel_index) + data_item.full_scope = full_scope diff --git a/debug/accuracy_tools/msprobe/core/compare/merge_result/merge_result.py b/debug/accuracy_tools/msprobe/core/compare/merge_result/merge_result.py new file mode 100644 index 0000000000000000000000000000000000000000..b605bd59fca0b2b3a510a7a686caa94383488bd2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/compare/merge_result/merge_result.py @@ -0,0 +1,381 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import re +import multiprocessing +from functools import partial + +import pandas as pd +from tqdm import tqdm + +from msprobe.core.common.file_utils import load_yaml, logger, FileChecker, save_excel, read_xlsx, create_directory +from msprobe.core.common.const import FileCheckConst, Const, CompareConst +from msprobe.core.common.utils import CompareException, add_time_with_xlsx +from msprobe.core.compare.utils import table_value_is_valid +from msprobe.core.compare.merge_result.utils import replace_compare_index_dict, check_config + + +def check_compare_result_name(file_name): + """ + check whether the compare result name is as expected + """ + single_rank_pattern = r"^compare_result_rank-rank_\d{14}.xlsx$" + multi_ranks_pattern = r"^compare_result_rank(\d+)-rank\1_\d{14}.xlsx$" + if re.match(multi_ranks_pattern, file_name): + return True + if re.match(single_rank_pattern, file_name): + logger.warning("Single rank compare result do not need to be merged.") + return False + logger.error(f"Wrong compare result name: {file_name}, please check!") + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + + +def reorder_path(compare_result_path_list): + """ + reorder compare results by rank num + """ + rank_pattern = r"compare_result_rank(\d+)-rank" + reorder_path_list = sorted( + compare_result_path_list, + key=lambda path: int(re.search(rank_pattern, os.path.basename(path)).group(1)) + ) + return reorder_path_list + + +def get_result_path(input_dir): + """ + get rank ordered compare result file path list + """ + compare_result_path_list = [os.path.join(input_dir, f) + for f in os.listdir(input_dir) if f.endswith(FileCheckConst.XLSX_SUFFIX)] + filt_compare_result_path_list = [] + for file_path in compare_result_path_list: + file_name = os.path.basename(file_path) + if check_compare_result_name(file_name): + compare_result_path_checker = FileChecker(file_path, FileCheckConst.FILE, FileCheckConst.READ_ABLE) + compare_result_path = compare_result_path_checker.common_check() + filt_compare_result_path_list.append(compare_result_path) + + filt_compare_result_path_list = reorder_path(filt_compare_result_path_list) # 多卡比对结果按rank序号重新排序 + + if len(filt_compare_result_path_list) < 2: + logger.warning("Number of compare result is no more than 1, no need to merge.") # 单卡结果无需合并,直接退出 + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + return filt_compare_result_path_list + + +def get_dump_mode(result_df, rank_num): + + """ + get dump mode from header of first compare result table + """ + header = result_df.columns.tolist() + if header in [CompareConst.COMPARE_RESULT_HEADER + [CompareConst.DATA_NAME], + CompareConst.COMPARE_RESULT_HEADER_STACK + [CompareConst.DATA_NAME]]: + return Const.ALL + elif header in [CompareConst.SUMMARY_COMPARE_RESULT_HEADER, CompareConst.SUMMARY_COMPARE_RESULT_HEADER_STACK]: + return Const.SUMMARY + elif header in [CompareConst.MD5_COMPARE_RESULT_HEADER, CompareConst.MD5_COMPARE_RESULT_HEADER_STACK]: + return Const.MD5 + else: + logger.warning(f"A valid dump task can not be identified from rank{rank_num} compare result, please check! " + f"The compare result will not be shown in merged result.") + return "" + + +def check_index_dump_mode_consistent(dump_mode, rank_num): + """ + check compare index to merge is consistent with dump mode + if compare_index_list is None, return all compare_indexes of dump mode + """ + if dump_mode == Const.MD5: + logger.warning(f"Rank{rank_num} compare result is 'md5' dump task and does not support merging result, please " + f"check! The compare result will not be shown in merged result.") + return [] + + dump_mode_compare_index_map = { + Const.ALL: CompareConst.ALL_COMPARE_INDEX, + Const.SUMMARY: CompareConst.SUMMARY_COMPARE_INDEX + } + valid_compare_index = dump_mode_compare_index_map.get(dump_mode) + + share_list = list(share_compare_index_list) + + # 如果传入的compare_index_list为空,则比对指标为dump_mode对应的全部比对指标 + if not share_list: + share_compare_index_list.extend(valid_compare_index) + return list(share_compare_index_list) + if set(share_list).issubset(valid_compare_index): + return share_list + else: + invalid_compare_index = set(valid_compare_index) - set(share_list) + logger.warning(f"Compare indexes in rank{rank_num} compare result are not consistent with " + f"those in other compare results, please check!") + logger.warning(f"The compare result will not be shown in merged result.") + logger.warning(f"The invalid compare indexes: {invalid_compare_index}") + return [] + + +def extract_api_full_name(api_list, result_df, rank_num): + """ + find api full name from compare result according to api list + """ + api_full_name_list = [] + for api in api_list: + api_pat = api + Const.SEP + escaped_api_pat = api_pat.replace('.', r'\.') + single_api_full_name_list = result_df.loc[ + result_df[CompareConst.NPU_NAME].str.contains(escaped_api_pat, na=False), CompareConst.NPU_NAME].tolist() + if len(single_api_full_name_list) == 0: + logger.warning(f"{api} not found in rank{rank_num} compare result.") + continue + api_full_name_list.extend(single_api_full_name_list) + return api_full_name_list + + +def search_api_index_result(api_list, compare_index_list, result_df, rank_num, compare_index_dict): + """ + parsing single rank compare result into the intermediate target dict + { + compare_index1: { + api_full_name1:{ + rank1: value, + }, + api_full_name2, + ... + }, + compare_index2: {}, + ... + } + """ + api_full_name_list = extract_api_full_name(api_list, result_df, rank_num) + for compare_index in compare_index_list: + api_index_dict = {} + for api_full_name in api_full_name_list: + table_value_check(api_full_name) + row_num = result_df.index[result_df[CompareConst.NPU_NAME] == api_full_name].tolist()[0] + index_value = result_df.loc[row_num, compare_index] + table_value_check(index_value) + api_index_dict.setdefault(api_full_name, {})[rank_num] = index_value # update api_index_dict + compare_index_dict[compare_index] = api_index_dict + + compare_index_dict = replace_compare_index_dict(compare_index_dict, compare_index_list, rank_num) + return compare_index_dict + + +def table_value_check(value): + if not table_value_is_valid(value): + raise RuntimeError( + f"Malicious value [{value}] is not allowed to be written into the merged xlsx.") + + +def result_process(compare_result_path_list, api_list): + """ + process compare results into target intermediate dict list + """ + compare_index_dict_list = [] + rank_num_list = [] + compare_index_list = [] + + for compare_result_path in compare_result_path_list: + compare_index_dict = {} + result_df = read_xlsx(compare_result_path) + + rank_pattern = r"compare_result_rank(\d+)-rank" + rank_num = int(re.search(rank_pattern, os.path.basename(compare_result_path)).group(1)) + logger.info(f"Parsing rank{rank_num} compare result...") + if not result_df.empty: + dump_mode = get_dump_mode(result_df, rank_num) + if dump_mode == "": + return [], [], [] + # 因为compare_index是指定的,固定不变,所以一旦compare_index是确定的,dump_mode也是确定的, + # 所以只要校验compare_index和dump_mode一致性就能保证所有rank的结果都是dump_mode一致的 + compare_index_list = check_index_dump_mode_consistent(dump_mode, rank_num) + if len(compare_index_list) == 0: + return [], [], [] + compare_index_list.extend([CompareConst.NPU_MAX, CompareConst.BENCH_MAX]) + compare_index_dict = search_api_index_result(api_list, compare_index_list, + result_df, rank_num, compare_index_dict) + compare_index_dict_list.append(compare_index_dict) + rank_num_list.append(rank_num) + compare_index_list.pop() + compare_index_list.pop() + else: + logger.warning(f"Rank{rank_num} compare result is empty and will not shown in merged result.") + + return compare_index_dict_list, rank_num_list, compare_index_list + + +def handle_multi_process(func, func_args, lock): + compare_result_path_list, api_list = func_args + + result_num = len(compare_result_path_list) + process_num = int((multiprocessing.cpu_count() + 1) / 2) + if result_num <= process_num: + process_num = result_num + chunks = [[compare_result_path] for compare_result_path in compare_result_path_list] + else: + chunk_size = result_num // process_num + chunks = [compare_result_path_list[i:i + chunk_size] for i in range(0, result_num, chunk_size)] + + pool = multiprocessing.Pool(process_num) + + def err_call(args): + logger.error('Multiprocess merge result failed! Reason: {}'.format(args)) + try: + pool.terminate() + except OSError: + logger.error("Pool terminate failed") + + progress_bar = tqdm(total=result_num, desc="Compare Result Parsing Process", unit="num", ncols=100) + + def update_progress(size, progress_lock, extra_param=None): + with progress_lock: + progress_bar.update(size) + + results = [] + for chunk in chunks: + chunk_size = len(chunk) + result = pool.apply_async(func, # pool.apply_async立即返回ApplyResult对象,因此results中结果是顺序的 + args=(chunk, api_list), + error_callback=err_call, + callback=partial(update_progress, chunk_size, lock) + ) + results.append(result) + + all_compare_index_dict_list = [] + all_rank_num_list = [] + all_compare_index_list_list = [] + for result in results: + compare_index_dict, rank_num_list, compare_index_list = result.get() + all_compare_index_dict_list.append(compare_index_dict) + all_rank_num_list.append(rank_num_list) + all_compare_index_list_list.append(compare_index_list) + + pool.close() + pool.join() + + if not any(all_compare_index_dict_list): + logger.warning("Nothing to merge.") + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + + return all_compare_index_dict_list, all_rank_num_list, all_compare_index_list_list + + +def generate_result_df(api_index_dict, header): + """ + Generates a DataFrame from the given api_index_dict and header. + api_index_dict: + { + api_full_name1:{ + rank1: value, + }, + api_full_name2:{ + rank1: value + }, + ... + } + """ + result = [] + for api_full_name, rank_value_dict in api_index_dict.items(): + result_item = [api_full_name] + result_item.extend(rank_value_dict.values()) + result.append(result_item) + return pd.DataFrame(result, columns=header, dtype="object") + + +def generate_merge_result(all_compare_index_dict_list, all_rank_num_list, all_compare_index_list_list, output_dir): + """ + generate merge result from the intermediate dict. + one compare index, one sheet + """ + file_name = add_time_with_xlsx("multi_ranks_compare_merge") + output_path = os.path.join(output_dir, file_name) + + compare_index_list = None + for item in all_compare_index_list_list: + if item: + compare_index_list = item + break + if not compare_index_list: + logger.error("No compare index recognized, please check!") + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + + all_result_df_list = [] + for compare_index_dict_list, rank_num_list in zip(all_compare_index_dict_list, all_rank_num_list): + for compare_index_dict, rank_num in zip(compare_index_dict_list, rank_num_list): + header = [CompareConst.NPU_NAME, "rank" + str(rank_num)] + result_df_list = [] + for _, api_index_dict in compare_index_dict.items(): + result_df = generate_result_df(api_index_dict, header) + result_df_list.append(result_df) + all_result_df_list.append(result_df_list) + + merge_df_list = df_merge(all_result_df_list) + final_result_df_list = [] + for i, df in enumerate(merge_df_list): + # merge_df_list中df与compare_index_list中compare_index一一对应 + final_result_df_list.append((df, compare_index_list[i])) + save_excel(output_path, final_result_df_list) + logger.info(f"The compare results of the multi-ranks are merged and saved in: {output_path}.") + + +def df_merge(all_result_df_list): + """ + merge different rank result_df + """ + if len(all_result_df_list) == 0: + logger.warning("Nothing to merge.") + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + if len(all_result_df_list) == 1: + logger.info("Only one compare result gets merge data.") + merge_df_base = all_result_df_list[0] + for sublist in all_result_df_list[1:]: + for i, sub_df in enumerate(sublist): + merge_df_base[i] = pd.merge(merge_df_base[i], sub_df, on=CompareConst.NPU_NAME, how='outer') + for i, value in enumerate(merge_df_base): + merge_df_base[i] = value.reindex( + columns=[CompareConst.NPU_NAME] + [col for col in value.columns if col != CompareConst.NPU_NAME]) + return merge_df_base + + +share_compare_index_list = [] + + +def initialize_compare_index(config): + global share_compare_index_list + manager = multiprocessing.Manager() + share_compare_index_list = manager.list(config.get("compare_index", [])) # 创建共享全局列表 + + +def merge_result(input_dir, output_dir, config_path): + input_dir = FileChecker(input_dir, FileCheckConst.DIR, FileCheckConst.READ_ABLE).common_check() + create_directory(output_dir) + + compare_result_path_list = get_result_path(input_dir) # 获得的input_dir中所有比对结果件的全路径,数量少于2,便提示退出 + + config = load_yaml(config_path) + config = check_config(config) + api_list = config.get('api') + + # 初始化共享全局变量share_compare_index_list + initialize_compare_index(config) + + func_args = (compare_result_path_list, api_list) + all_compare_index_dict_list, all_rank_num_list, all_compare_index_list_list = ( + handle_multi_process(result_process, func_args, multiprocessing.Manager().RLock())) + + generate_merge_result(all_compare_index_dict_list, all_rank_num_list, all_compare_index_list_list, output_dir) diff --git a/debug/accuracy_tools/msprobe/core/compare/merge_result/merge_result_cli.py b/debug/accuracy_tools/msprobe/core/compare/merge_result/merge_result_cli.py new file mode 100644 index 0000000000000000000000000000000000000000..46b24f796dbca497d197ce1d5bf670b9745ab7cd --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/compare/merge_result/merge_result_cli.py @@ -0,0 +1,31 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprobe.core.compare.merge_result.merge_result import merge_result + + +def _merge_result_parser(parser): + parser.add_argument("-i", "--input_dir", dest="input_dir", type=str, + help=" The compare result path, a dir.", required=True) + parser.add_argument("-o", "--output_dir", dest="output_dir", type=str, + help=" The result merge output path, a dir.", required=True) + parser.add_argument("-config", "--config-path", dest="config_path", type=str, + help=" Yaml path containing distribute APIs and compare indexes for merging data " + "from compare results.", + required=True) + + +def merge_result_cli(args): + merge_result(args.input_dir, args.output_dir, args.config_path) diff --git a/debug/accuracy_tools/msprobe/core/compare/merge_result/utils.py b/debug/accuracy_tools/msprobe/core/compare/merge_result/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..ce563e9682088b28e2100a3851588632a9bb4b3a --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/compare/merge_result/utils.py @@ -0,0 +1,81 @@ +# Copyright (c) 2025-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprobe.core.common.const import CompareConst +from msprobe.core.common.file_utils import logger +from msprobe.core.common.utils import CompareException + + +def replace_compare_index_dict(compare_index_dict, compare_index_list, rank_num): + """ + 比对指标值为N/A、unsupported、Nan,将比对指标值替换成NPU max 和 Bench max(几个统计量相同) + + 示例: + Distributed.all_reduce.0.forward.output.group的比对指标值是N/A + 替换后: + 比对指标值为: + NPU: tp-0-1-2-3 + Bench: tp-0-1-2-3 + """ + + if CompareConst.NPU_MAX not in compare_index_dict or CompareConst.BENCH_MAX not in compare_index_dict: + compare_index_dict.pop(CompareConst.NPU_MAX, None) + compare_index_dict.pop(CompareConst.BENCH_MAX, None) + return compare_index_dict + + # 遍历比对指标列表,排除最后两个指标NPU max, Bench max + for compare_index in compare_index_list[:-2]: + op_name_index_dict = compare_index_dict[compare_index] + # 遍历op_item名称和对应的比对指标值 + for op_name, index_value in op_name_index_dict.items(): + npu_max = compare_index_dict[CompareConst.NPU_MAX][op_name][rank_num] + bench_max = compare_index_dict[CompareConst.BENCH_MAX][op_name][rank_num] + # 如果当前比对指标值是N/A、unsupported、Nan,并且NPU和Bench的最大值是类型相同,进行替换 + if index_value[rank_num] in [CompareConst.N_A, CompareConst.UNSUPPORTED, CompareConst.NAN]: + compare_index_dict[compare_index][op_name][rank_num] = f'NPU:{str(npu_max)} Bench:{str(bench_max)}' + + # 删除NPU_MAX和BENCH_MAX + compare_index_dict.pop(CompareConst.NPU_MAX, None) + compare_index_dict.pop(CompareConst.BENCH_MAX, None) + return compare_index_dict + + +def check_config(config): + """ + config.yaml 内容检查 + Args: config: + Returns: config + """ + if not config: + logger.error('config.yaml is empty, please check.') + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + + api_list = config.get('api') + if not api_list: + logger.error('The APIs required to merge data were not found.') + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + if not isinstance(api_list, list): + logger.error("The config format of 'api' is incorrect, please check.") + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + + compare_index_list = config.get('compare_index', []) + if compare_index_list is None: + compare_index_list = [] + config['compare_index'] = compare_index_list + if not isinstance(compare_index_list, list): + logger.error("The config format of 'compare_index' is incorrect, please check.") + raise CompareException(CompareException.MERGE_COMPARE_RESULT_ERROR) + + return config diff --git a/debug/accuracy_tools/msprobe/core/compare/multiprocessing_compute.py b/debug/accuracy_tools/msprobe/core/compare/multiprocessing_compute.py index 864f29d2fbd098fd2ed4aa4d0a27c38cc025e30e..f79671827c1efc30f3f0a573e23d9d72f2fbd289 100644 --- a/debug/accuracy_tools/msprobe/core/compare/multiprocessing_compute.py +++ b/debug/accuracy_tools/msprobe/core/compare/multiprocessing_compute.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -15,15 +15,18 @@ import multiprocessing from dataclasses import dataclass +from functools import partial + import pandas as pd from tqdm import tqdm + from msprobe.core.common.log import logger from msprobe.core.common.utils import CompareException from msprobe.core.common.const import CompareConst def _handle_multi_process(func, input_parma, result_df, lock): - process_num = int((multiprocessing.cpu_count() + 1) / 2) + process_num = max(int((multiprocessing.cpu_count() + 1) // 4), 1) op_name_mapping_dict = read_dump_data(result_df) df_chunk_size = len(result_df) // process_num @@ -44,7 +47,7 @@ def _handle_multi_process(func, input_parma, result_df, lock): progress_bar = tqdm(total=len(result_df), desc="API/Module Item Compare Process", unit="row", ncols=100) - def update_progress(size, progress_lock): + def update_progress(size, progress_lock, extra_param=None): with progress_lock: progress_bar.update(size) @@ -54,8 +57,10 @@ def _handle_multi_process(func, input_parma, result_df, lock): result = pool.apply_async(func, args=(idx, op_name_mapping_dict, df_chunk, lock, input_parma), error_callback=err_call, - callback=update_progress(chunk_size, lock)) + callback=partial(update_progress, chunk_size, lock) + ) results.append(result) + final_results = [r.get() for r in results] pool.close() pool.join() @@ -63,7 +68,7 @@ def _handle_multi_process(func, input_parma, result_df, lock): def _ms_graph_handle_multi_process(func, result_df, mode): - process_num = int((multiprocessing.cpu_count() + 1) // 4) + process_num = max(int((multiprocessing.cpu_count() + 1) // 4), 1) df_chunk_size = len(result_df) // process_num if df_chunk_size > 0: df_chunks = [result_df.iloc[i:i + df_chunk_size] for i in range(0, len(result_df), df_chunk_size)] @@ -110,11 +115,12 @@ def read_dump_data(result_df): @dataclass class ComparisonResult: cos_result: list + euc_dist_result: list max_err_result: list max_relative_err_result: list - err_msgs: list one_thousand_err_ratio_result: list five_thousand_err_ratio_result: list + err_msgs: list def _save_cmp_result(offset, result: ComparisonResult, result_df, lock): @@ -135,15 +141,16 @@ def _save_cmp_result(offset, result: ComparisonResult, result_df, lock): for i, _ in enumerate(result.cos_result): process_index = i + offset result_df.loc[process_index, CompareConst.COSINE] = result.cos_result[i] + result_df.loc[process_index, CompareConst.EUC_DIST] = result.euc_dist_result[i] result_df.loc[process_index, CompareConst.MAX_ABS_ERR] = result.max_err_result[i] result_df.loc[process_index, CompareConst.MAX_RELATIVE_ERR] = result.max_relative_err_result[i] - result_df.loc[process_index, CompareConst.ERROR_MESSAGE] = result.err_msgs[i] - result_df.loc[process_index, CompareConst.ACCURACY] = ( - check_accuracy(result.cos_result[i], result.max_err_result[i])) result_df.loc[process_index, CompareConst.ONE_THOUSANDTH_ERR_RATIO] = ( result.one_thousand_err_ratio_result)[i] result_df.loc[process_index, CompareConst.FIVE_THOUSANDTHS_ERR_RATIO] = ( result.five_thousand_err_ratio_result)[i] + result_df.loc[process_index, CompareConst.ACCURACY] = ( + check_accuracy(result.cos_result[i], result.max_err_result[i])) + result_df.loc[process_index, CompareConst.ERROR_MESSAGE] = result.err_msgs[i] return result_df except ValueError as e: logger.error('result dataframe is not found.') diff --git a/debug/accuracy_tools/msprobe/core/compare/npy_compare.py b/debug/accuracy_tools/msprobe/core/compare/npy_compare.py index 0646d24c01c9be9fdf8180ee040ff69bbfe70630..4103d361fec14284fc38f97e1418e5405e939cd9 100644 --- a/debug/accuracy_tools/msprobe/core/compare/npy_compare.py +++ b/debug/accuracy_tools/msprobe/core/compare/npy_compare.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,18 +14,31 @@ # limitations under the License. import abc + import numpy as np -from msprobe.core.common.utils import format_value + from msprobe.core.common.const import Const, CompareConst from msprobe.core.common.log import logger +from msprobe.core.common.utils import CompareException, format_value def handle_inf_nan(n_value, b_value): + def convert_to_float(value): + try: + if isinstance(value, np.ndarray): + return value.astype(float) + else: + return float(value) + except ValueError as e: + logger.error('\n'.join(e.args)) + raise CompareException(CompareException.INVALID_DATA_ERROR) from e + + n_value_convert, b_value_convert = convert_to_float(n_value), convert_to_float(b_value) """处理inf和nan的数据""" - n_inf = np.isinf(n_value) - b_inf = np.isinf(b_value) - n_nan = np.isnan(n_value) - b_nan = np.isnan(b_value) + n_inf = np.isinf(n_value_convert) + b_inf = np.isinf(b_value_convert) + n_nan = np.isnan(n_value_convert) + b_nan = np.isnan(b_value_convert) n_invalid = np.any(n_inf) or np.any(n_nan) b_invalid = np.any(b_inf) or np.any(b_nan) if n_invalid or b_invalid: @@ -39,58 +52,66 @@ def handle_inf_nan(n_value, b_value): return n_value, b_value -def get_error_type(n_value, b_value, error_flag): - """判断数据是否有异常并返回异常的n_value, b_value,同时返回error_flag""" +def get_error_flag_and_msg(n_value, b_value, error_flag=False, error_file=None): + """判断数据是否有异常并返回异常的n_value, b_value,同时返回error_flag和error_msg""" + err_msg = "" if error_flag: - return CompareConst.READ_NONE, CompareConst.READ_NONE, True + if error_file == "no_bench_data": + err_msg = "Bench does not have data file." + elif error_file: + err_msg = f"Dump file: {error_file} not found." + else: + err_msg = CompareConst.NO_BENCH + error_flag = True + return CompareConst.READ_NONE, CompareConst.READ_NONE, error_flag, err_msg + if n_value.size == 0: # 判断读取到的数据是否为空 - return CompareConst.NONE, CompareConst.NONE, True + err_msg = "This is empty data, can not compare." + error_flag = True + return CompareConst.NONE, CompareConst.NONE, error_flag, err_msg + if not n_value.shape: # 判断数据是否为0维张量 + err_msg = (f"This is type of 0-d tensor, can not calculate '{CompareConst.COSINE}', '{CompareConst.EUC_DIST}', " + f"'{CompareConst.ONE_THOUSANDTH_ERR_RATIO}' and '{CompareConst.FIVE_THOUSANDTHS_ERR_RATIO}'. ") + error_flag = False # 0-d tensor 最大绝对误差、最大相对误差仍然支持计算,因此error_flag设置为False,不做统一处理 + return n_value, b_value, error_flag, err_msg if n_value.shape != b_value.shape: # 判断NPU和bench的数据结构是否一致 - return CompareConst.SHAPE_UNMATCH, CompareConst.SHAPE_UNMATCH, True - if not n_value.shape: # 判断数据是否为标量 - return n_value, b_value, False + err_msg = "Shape of NPU and bench tensor do not match. Skipped." + error_flag = True + return CompareConst.SHAPE_UNMATCH, CompareConst.SHAPE_UNMATCH, error_flag, err_msg - n_value, b_value = handle_inf_nan(n_value, b_value) # 判断是否有nan/inf数据 + try: + n_value, b_value = handle_inf_nan(n_value, b_value) # 判断是否有nan/inf数据 + except CompareException: + logger.error('Numpy data is unreadable, please check!') + err_msg = "Data is unreadable." + error_flag = True + return CompareConst.UNREADABLE, CompareConst.UNREADABLE, error_flag, err_msg if n_value is CompareConst.NAN or b_value is CompareConst.NAN: - return CompareConst.NAN, CompareConst.NAN, True - return n_value, b_value, False + err_msg = "The position of inf or nan in NPU and bench Tensor do not match." + error_flag = True + return CompareConst.NAN, CompareConst.NAN, error_flag, err_msg + + if n_value.dtype != b_value.dtype: # 判断数据的dtype是否一致 + err_msg = "Dtype of NPU and bench tensor do not match." + error_flag = False + return n_value, b_value, error_flag, err_msg + + return n_value, b_value, error_flag, err_msg def reshape_value(n_value, b_value): """返回reshape后的数据""" - if not n_value.shape: # 判断数据是否为标量 + if not n_value.shape: # 判断数据是否为0维tensor, 如果0维tensor,不会转成1维tensor,直接返回 if n_value.dtype == bool: n_value = n_value.astype(float) b_value = b_value.astype(float) return n_value, b_value - n_value = n_value.reshape(-1).astype(float) + n_value = n_value.reshape(-1).astype(float) # 32转64为了防止某些数转dataframe时出现误差 b_value = b_value.reshape(-1).astype(float) return n_value, b_value -def get_error_message(n_value, b_value, npu_op_name, error_flag, error_file=None): - """获取异常情况的错误信息""" - if error_flag: - if n_value == CompareConst.READ_NONE: - if error_file: - return "Dump file: {} not found.".format(error_file) - return CompareConst.NO_BENCH - if n_value == CompareConst.NONE: - return "This is empty data, can not compare." - if n_value == CompareConst.SHAPE_UNMATCH: - return "Shape of NPU and bench Tensor do not match. Skipped." - if n_value == CompareConst.NAN: - return "The position of inf or nan in NPU and bench Tensor do not match." - else: - if not n_value.shape: - return "This is type of scalar data, can not compare." - if n_value.dtype != b_value.dtype: - logger.warning("Dtype of NPU and bench Tensor do not match: {}".format(npu_op_name)) - return "Dtype of NPU and bench Tensor do not match." - return "" - - def npy_data_check(n_value, b_value): error_message = "" if not isinstance(n_value, np.ndarray) or not isinstance(b_value, np.ndarray): @@ -109,7 +130,11 @@ def npy_data_check(n_value, b_value): error_message += "Dtype of NPU and bench Tensor do not match. Skipped.\n" if not error_message: - n_value, b_value = handle_inf_nan(n_value, b_value) # 判断是否有 nan/inf 数据 + try: + n_value, b_value = handle_inf_nan(n_value, b_value) # 判断是否有nan/inf数据 + except CompareException: + logger.error('Numpy data is unreadable, please check!') + return True, 'Numpy data is unreadable, please check!' # handle_inf_nan 会返回'Nan'或ndarray类型,使用类型判断是否存在无法处理的nan/inf数据 if not isinstance(n_value, np.ndarray) or not isinstance(b_value, np.ndarray): error_message += "The position of inf or nan in NPU and bench Tensor do not match.\n" @@ -143,13 +168,30 @@ def statistics_data_check(result_dict): class TensorComparisonBasic(abc.ABC): """NPU和bench中npy数据的比较模板""" + @abc.abstractmethod - def apply(self, n_value, b_value, error_flag, relative_err=None): + def apply(self, n_value, b_value, relative_err, err_msg): raise NotImplementedError +def get_relative_err(n_value, b_value): + """计算相对误差""" + with np.errstate(divide='ignore', invalid='ignore'): + if b_value.dtype not in CompareConst.FLOAT_TYPE: + n_value, b_value = n_value.astype(float), b_value.astype(float) + + n_value_copy = n_value.copy() + b_value_copy = b_value.copy() + zero_mask = (b_value_copy == 0) + b_value_copy[zero_mask] += Const.FLOAT_EPSILON + n_value_copy[zero_mask] += Const.FLOAT_EPSILON + relative_err = np.divide((n_value_copy - b_value_copy), b_value_copy) + return np.abs(relative_err) + + class GetCosineSimilarity(TensorComparisonBasic): """计算cosine相似度""" + @staticmethod def correct_data(result): if result == CompareConst.NAN: @@ -158,156 +200,120 @@ class GetCosineSimilarity(TensorComparisonBasic): return round(float(result), 6) return result - def apply(self, n_value, b_value, error_flag, relative_err=None): - if error_flag: - if n_value == CompareConst.READ_NONE: - return CompareConst.NONE, '' - if n_value == CompareConst.NONE: - return CompareConst.UNSUPPORTED, '' - if n_value == CompareConst.SHAPE_UNMATCH: - return CompareConst.SHAPE_UNMATCH, '' - if n_value == CompareConst.NAN: - return CompareConst.N_A, '' - - if not n_value.shape: - return CompareConst.UNSUPPORTED, '' - - with np.errstate(divide='ignore', invalid='ignore'): + def apply(self, n_value, b_value, relative_err, err_msg): + if "This is type of 0-d tensor" in err_msg: + return CompareConst.UNSUPPORTED, err_msg + + with np.errstate(divide="ignore", invalid="ignore"): if len(n_value) == 1: - return CompareConst.UNSUPPORTED, "This tensor is scalar." + return CompareConst.UNSUPPORTED, "This is a 1-d tensor of length 1." num = n_value.dot(b_value) a_norm = np.linalg.norm(n_value) b_norm = np.linalg.norm(b_value) if a_norm <= Const.FLOAT_EPSILON and b_norm <= Const.FLOAT_EPSILON: - return 1.0, '' + return 1.0, "" if a_norm <= Const.FLOAT_EPSILON: - return CompareConst.NAN, 'Cannot compare by Cosine Similarity, All the data is Zero in npu dump data.' + return CompareConst.NAN, "Cannot compare by Cosine Similarity, All the data is Zero in npu dump data." if b_norm <= Const.FLOAT_EPSILON: - return CompareConst.NAN, 'Cannot compare by Cosine Similarity, All the data is Zero in Bench dump data.' + return CompareConst.NAN, "Cannot compare by Cosine Similarity, All the data is Zero in Bench dump data." cos = num / (a_norm * b_norm) if np.isnan(cos): - return CompareConst.NAN, 'Cannot compare by Cosine Similarity, the dump data has NaN.' + return CompareConst.NAN, "Cannot compare by Cosine Similarity, the dump data has NaN." result = format_value(cos) result = self.correct_data(result) - return 1.0 if float(result) > 0.99999 else result, '' + return result, "" + + +class GetEuclideanDistance(TensorComparisonBasic): + """计算欧式距离""" + + def apply(self, n_value, b_value, relative_err, err_msg): + if "This is type of 0-d tensor" in err_msg: + return CompareConst.UNSUPPORTED, err_msg + + distance = np.linalg.norm(n_value - b_value, ord=2) + + return distance, "" class GetMaxAbsErr(TensorComparisonBasic): """计算最大绝对误差""" - def apply(self, n_value, b_value, error_flag, relative_err=None): - if error_flag: - if n_value == CompareConst.READ_NONE: - return CompareConst.NONE, "" - if n_value == CompareConst.NONE: - return 0, "" - if n_value == CompareConst.SHAPE_UNMATCH: - return CompareConst.SHAPE_UNMATCH, "" - if n_value == CompareConst.NAN: - return CompareConst.N_A, "" + def apply(self, n_value, b_value, relative_err, err_msg): temp_res = n_value - b_value max_value = np.max(np.abs(temp_res)) if np.isnan(max_value): - message = 'Cannot compare by MaxRelativeError, the data contains nan/inf/-inf in dump data.' - return CompareConst.NAN, message + msg = "Cannot compare by MaxAbsError, the data contains nan/inf/-inf in dump data." + return CompareConst.NAN, msg return format_value(max_value), "" -def get_relative_err(n_value, b_value): - """计算相对误差""" - with np.errstate(divide='ignore', invalid='ignore'): - if b_value.dtype not in CompareConst.FLOAT_TYPE: - n_value, b_value = n_value.astype(float), b_value.astype(float) - zero_mask = (b_value == 0) - b_value[zero_mask] += np.finfo(b_value.dtype).eps - n_value[zero_mask] += np.finfo(b_value.dtype).eps - relative_err = np.divide((n_value - b_value), b_value) - return np.abs(relative_err) - - class GetMaxRelativeErr(TensorComparisonBasic): """计算最大相对误差""" - def apply(self, n_value, b_value, error_flag, relative_err=None): - if error_flag: - if n_value == CompareConst.READ_NONE: - return CompareConst.NONE, '' - if n_value == CompareConst.NONE: - return 0, '' - if n_value == CompareConst.SHAPE_UNMATCH: - return CompareConst.SHAPE_UNMATCH, '' - if n_value == CompareConst.NAN: - return CompareConst.N_A, '' - - if relative_err is None: - relative_err = get_relative_err(n_value, b_value) + + def apply(self, n_value, b_value, relative_err, err_msg): max_relative_err = np.max(np.abs(relative_err)) if np.isnan(max_relative_err): - message = 'Cannot compare by MaxRelativeError, the data contains nan/inf/-inf in dump data.' - return CompareConst.NAN, message - return format_value(max_relative_err), '' - - -class GetThousandErrRatio(TensorComparisonBasic): - """计算相对误差小于千分之一的比例""" - def apply(self, n_value, b_value, error_flag, relative_err=None): - if error_flag: - if n_value == CompareConst.READ_NONE: - return CompareConst.NONE, "" - if n_value == CompareConst.NONE: - return 0, "" - if n_value == CompareConst.SHAPE_UNMATCH: - return CompareConst.SHAPE_UNMATCH, "" - if n_value == CompareConst.NAN: - return CompareConst.N_A, "" - - if not n_value.shape: - return CompareConst.NAN, "" - if relative_err is None: - relative_err = get_relative_err(n_value, b_value) - if not np.size(relative_err): - return CompareConst.NAN, "" - return format_value(np.sum(relative_err < CompareConst.THOUSAND_RATIO_THRESHOLD) / np.size(relative_err)), "" - - -class GetFiveThousandErrRatio(TensorComparisonBasic): - """计算相对误差小于千分之五的比例""" - def apply(self, n_value, b_value, error_flag, relative_err=None): - if error_flag: - if n_value == CompareConst.READ_NONE: - return CompareConst.NONE, "" - if n_value == CompareConst.NONE: - return 0, "" - if n_value == CompareConst.SHAPE_UNMATCH: - return CompareConst.SHAPE_UNMATCH, "" - if n_value == CompareConst.NAN: - return CompareConst.N_A, "" - - if not n_value.shape: - return CompareConst.NAN, "" - if relative_err is None: - relative_err = get_relative_err(n_value, b_value) + msg = "Cannot compare by MaxRelativeError, the data contains nan/inf/-inf in dump data." + return CompareConst.NAN, msg + return format_value(max_relative_err), "" + + +class GetErrRatio(TensorComparisonBasic): + """计算相对误差小于指定阈值(千分之一、千分之五)的比例""" + + def __init__(self, threshold): + self.threshold = threshold + + def apply(self, n_value, b_value, relative_err, err_msg): + if "This is type of 0-d tensor" in err_msg: + return CompareConst.UNSUPPORTED, err_msg + if not np.size(relative_err): return CompareConst.NAN, "" - return format_value( - np.sum(relative_err < CompareConst.FIVE_THOUSAND_RATIO_THRESHOLD) / np.size(relative_err)), "" + + ratio = np.sum(relative_err < self.threshold) / np.size(relative_err) + return format_value(ratio), "" class CompareOps: compare_ops = { "cosine_similarity": GetCosineSimilarity(), + "euclidean_distance": GetEuclideanDistance(), "max_abs_error": GetMaxAbsErr(), "max_relative_error": GetMaxRelativeErr(), - "one_thousand_err_ratio": GetThousandErrRatio(), - "five_thousand_err_ratio": GetFiveThousandErrRatio() + "one_thousand_err_ratio": GetErrRatio(CompareConst.THOUSAND_RATIO_THRESHOLD), + "five_thousand_err_ratio": GetErrRatio(CompareConst.FIVE_THOUSAND_RATIO_THRESHOLD) } -def compare_ops_apply(n_value, b_value, error_flag, err_msg, relative_err=None): +def error_value_process(n_value): + if n_value == CompareConst.READ_NONE or n_value == CompareConst.UNREADABLE: + return CompareConst.UNSUPPORTED, "" + if n_value == CompareConst.NONE: + return 0, "" + if n_value == CompareConst.SHAPE_UNMATCH: + return CompareConst.SHAPE_UNMATCH, "" + if n_value == CompareConst.NAN: + return CompareConst.N_A, "" + return CompareConst.N_A, "" + + +def compare_ops_apply(n_value, b_value, error_flag, err_msg): result_list = [] + if error_flag: + result, msg = error_value_process(n_value) + result_list = [result] * len(CompareOps.compare_ops) + err_msg += msg * len(CompareOps.compare_ops) + return result_list, err_msg + + relative_err = get_relative_err(n_value, b_value) + n_value, b_value = reshape_value(n_value, b_value) + for op in CompareOps.compare_ops.values(): - result, msg = op.apply(n_value, b_value, error_flag, relative_err=relative_err) - err_msg += msg + result, msg = op.apply(n_value, b_value, relative_err, err_msg) result_list.append(result) + err_msg += msg return result_list, err_msg diff --git a/debug/accuracy_tools/msprobe/core/compare/utils.py b/debug/accuracy_tools/msprobe/core/compare/utils.py index f37d9ce4decf01b3086c10edc68751832f7738a8..72b75ab254e59a4ec5788e95fde6721df2babe46 100644 --- a/debug/accuracy_tools/msprobe/core/compare/utils.py +++ b/debug/accuracy_tools/msprobe/core/compare/utils.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -16,36 +16,45 @@ import os import re import math +import zlib +from dataclasses import dataclass + import numpy as np -from msprobe.core.common.const import Const, CompareConst -from msprobe.core.common.utils import CompareException, check_regex_prefix_format_valid, logger + +from msprobe.core.common.const import Const, CompareConst, FileCheckConst +from msprobe.core.common.utils import CompareException, check_regex_prefix_format_valid, logger, safe_get_value from msprobe.core.common.file_utils import check_file_or_directory_path def extract_json(dirname, stack_json=False): json_path = '' - for fname in os.listdir(dirname): - if fname == "construct.json": - continue - full_path = os.path.join(dirname, fname) - if full_path.endswith('.json'): - json_path = full_path - if not stack_json and 'stack' not in json_path: - break - if stack_json and 'stack' in json_path: - break + for filename in os.listdir(dirname): + target_file_name = 'stack.json' if stack_json else 'dump.json' + if filename == target_file_name: + json_path = os.path.join(dirname, filename) + break # Provide robustness on invalid directory inputs if not json_path: - logger.error(f'No file is found in dump dir {dirname}. ') - raise CompareException(CompareException.NO_DUMP_FILE_ERROR) + if stack_json: + logger.warning(f'stack.json is not found in dump dir {dirname}.') + else: + logger.error(f'dump.json is not found in dump dir {dirname}.') + raise CompareException(CompareException.NO_DUMP_FILE_ERROR) return json_path +def set_stack_json_path(input_param): + npu_data_dir = os.path.dirname(input_param.get("npu_json_path")) + stack_path = extract_json(npu_data_dir, stack_json=True) + input_param["stack_json_path"] = stack_path if stack_path else None + return bool(stack_path) + + def check_and_return_dir_contents(dump_dir, prefix): """ check the given dump dir and validate files in dump dir by using the given prefix patterns to build a - pattern: ^{prefix}(?:0|[0-9][1-9]*)?$ + pattern: ^{prefix}(?:0|[1-9][0-9]*)?$ Args: dump_dir (str): dump dir @@ -61,7 +70,7 @@ def check_and_return_dir_contents(dump_dir, prefix): check_regex_prefix_format_valid(prefix) check_file_or_directory_path(dump_dir, True) contents = os.listdir(dump_dir) - pattern = re.compile(rf'^{prefix}(?:0|[0-9][1-9]*)?$') + pattern = re.compile(rf'^{prefix}(?:0|[1-9][0-9]*)?$') for name in contents: if not pattern.match(name): logger.error( @@ -73,6 +82,10 @@ def check_and_return_dir_contents(dump_dir, prefix): def rename_api(npu_name, process): + """ + 原api: {api_type}.{api_name}.{API调用次数}.{前向反向}.{input/output}.{参数序号} + rename后: {api_type}.{api_name}.{input/output}.{参数序号} + """ npu_split = npu_name.split(process) try: torch_func_index, in_out = npu_split[0], npu_split[1] @@ -85,122 +98,99 @@ def rename_api(npu_name, process): def read_op(op_data, op_name): - op_parsed_list = [] - if Const.FORWARD in op_name: - if Const.INPUT_ARGS in op_data: - input_item = op_data[Const.INPUT_ARGS] - input_parsed_list = op_item_parse(input_item, op_name + '.input', None) - op_parsed_list = input_parsed_list.copy() - input_parsed_list.clear() - if Const.INPUT_KWARGS in op_data: - kwargs_item = op_data[Const.INPUT_KWARGS] - if isinstance(kwargs_item, dict) and "type" in kwargs_item or isinstance(kwargs_item, list): - kwarg_parsed_list = op_item_parse(kwargs_item, op_name + '.input', None) - op_parsed_list += kwarg_parsed_list - kwarg_parsed_list.clear() - elif kwargs_item: - for kwarg in kwargs_item: - kwarg_parsed_list = op_item_parse(kwargs_item[kwarg], op_name + '.input.' + kwarg, None) - op_parsed_list += kwarg_parsed_list - kwarg_parsed_list.clear() - if Const.OUTPUT in op_data: - output_item = op_data[Const.OUTPUT] - output_parsed_list = op_item_parse(output_item, op_name + '.output', None) - op_parsed_list += output_parsed_list - output_parsed_list.clear() - if Const.BACKWARD in op_name: - if Const.INPUT in op_data: - input_item = op_data[Const.INPUT] - input_parsed_list = op_item_parse(input_item, op_name + '.input', None) - op_parsed_list = input_parsed_list.copy() - input_parsed_list.clear() - if Const.OUTPUT in op_data: - output_item = op_data[Const.OUTPUT] - output_parsed_list = op_item_parse(output_item, op_name + '.output', None) - op_parsed_list += output_parsed_list - output_parsed_list.clear() + if Const.PARAMS_GRAD in op_name.split(Const.SEP): + op_parsed_list = op_item_parse(op_data, op_name) + else: + op_parsed_list = [] + for name in CompareConst.IO_NAME_MAPPING: + if name in op_data: + op_parsed_list.extend(op_item_parse(op_data[name], op_name + CompareConst.IO_NAME_MAPPING[name])) return op_parsed_list -def op_item_parse(item, op_name, index, item_list=None, top_bool=True, depth=0): +def op_item_parse(op_data, op_name: str, depth: int = 0) -> list: + default_item = { + 'full_op_name': op_name, + 'type': None, + 'Max': None, + 'Min': None, + 'Mean': None, + 'Norm': None, + 'dtype': None, + 'shape': None, + 'md5': None, + 'value': None, + 'data_name': '-1' + } + if depth > Const.MAX_DEPTH: - logger.error(f"parse of api/module of {op_name} exceeds the recursion limit.") + logger.error(f'parse of api/module of {op_name} exceeds the recursion limit.') raise CompareException(CompareException.RECURSION_LIMIT_ERROR) - if item_list is None: - item_list = [] - if item is None or (isinstance(item, dict) and not item): - if not top_bool: - tmp = { - 'full_op_name': op_name + '.' + str(index), 'Max': None, 'Min': None, 'Mean': None, 'Norm': None, - 'dtype': None, 'shape': None, 'md5': None, 'data_name': '-1' - } - else: - tmp = { - 'full_op_name': op_name + '.0', 'Max': None, 'Min': None, 'Mean': None, 'Norm': None, 'dtype': None, - 'shape': None, 'md5': None, 'data_name': '-1' - } - item_list.append(tmp) - return item_list - if index is None: - if isinstance(item, dict): - full_op_name = op_name + '.0' - else: - full_op_name = op_name - else: - full_op_name = op_name + Const.SEP + str(index) - if isinstance(item, dict): - if 'type' not in item: - for kwarg in item: - kwarg_parsed_list = op_item_parse(item[kwarg], op_name + Const.SEP + kwarg, None, depth=depth+1) - item_list += kwarg_parsed_list - kwarg_parsed_list.clear() - elif 'dtype' in item: - parsed_item = item - parsed_item['full_op_name'] = full_op_name - item_list.append(parsed_item) - elif 'type' in item: - parsed_item = {} - if item['type'] == 'torch.Size': - parsed_item['full_op_name'] = full_op_name - parsed_item['dtype'] = 'torch.Size' - parsed_item['shape'] = str(item['value']) - parsed_item['md5'] = None - parsed_item['Max'] = None - parsed_item['Min'] = None - parsed_item['Mean'] = None - parsed_item['Norm'] = None - parsed_item['data_name'] = '-1' - item_list.append(parsed_item) - elif item['type'] == 'slice': - parsed_item['full_op_name'] = full_op_name - parsed_item['dtype'] = 'slice' - parsed_item['shape'] = str(np.shape(np.array(item['value']))) - parsed_item['md5'] = None - parsed_item['Max'] = None - parsed_item['Min'] = None - parsed_item['Mean'] = None - parsed_item['Norm'] = None - parsed_item['data_name'] = '-1' - item_list.append(parsed_item) + + if op_data is None: + return [default_item] + elif not op_data: + return [] + + item_list = [] + if isinstance(op_data, list): + for i, data in enumerate(op_data): + if Const.PARAMS_GRAD not in op_name.split(Const.SEP): + item_list.extend(op_item_parse(data, op_name + Const.SEP + str(i), depth + 1)) else: - parsed_item['full_op_name'] = full_op_name - parsed_item['dtype'] = str(type(item['value'])) - parsed_item['shape'] = '[]' - parsed_item['md5'] = None - parsed_item['Max'] = item['value'] - parsed_item['Min'] = item['value'] - parsed_item['Mean'] = item['value'] - parsed_item['Norm'] = item['value'] - parsed_item['data_name'] = '-1' - item_list.append(parsed_item) - else: - resolve_api_special_parameters(item, full_op_name, item_list) - else: - for j, item_spec in enumerate(item): - op_item_parse(item_spec, full_op_name, j, item_list=item_list, top_bool=False, depth=depth+1) + item_list.extend(op_item_parse(data, op_name, depth + 1)) + elif isinstance(op_data, dict): + if is_leaf_data(op_data): + return [gen_op_item(op_data, op_name)] + for sub_name, sub_data in op_data.items(): + item_list.extend(op_item_parse(sub_data, op_name + Const.SEP + str(sub_name), depth + 1)) return item_list +def is_leaf_data(op_data): + return 'type' in op_data and isinstance(op_data['type'], str) + + +def gen_op_item(op_data, op_name): + op_item = {} + op_item.update(op_data) + data_name = op_data.get('data_name') if op_data.get('data_name') else '-1' # 如果是""也返回-1 + op_item['data_name'] = data_name + op_item['full_op_name'] = data_name.rsplit(Const.SEP, 1)[0] if data_name != '-1' else op_name + + params = ['Max', 'Min', 'Mean', 'Norm'] + for i in params: + if i not in op_item: + op_item[i] = None + + if not op_item.get('dtype'): + if op_item.get('type') == 'torch.Size': + op_item['dtype'] = op_data.get('type') + op_item['shape'] = str(op_data.get('value')) + elif op_item.get('type') == 'slice': + op_item['dtype'] = op_data.get('type') + op_item['shape'] = str(np.shape(np.array(op_data.get('value')))) + elif op_item.get('type') == 'ellipsis': + op_item['dtype'] = op_data.get('type') + op_item['shape'] = '[]' + for i in params: + op_item[i] = op_data.get('value') + elif op_item.get('type') == 'torch.ProcessGroup': + op_item['dtype'] = op_data.get('type') + op_item['shape'] = '[]' + for i in params: + op_item[i] = str(op_data.get('group_ranks')) + else: + op_item['dtype'] = str(type(op_data.get('value'))) + op_item['shape'] = '[]' + for i in params: + op_item[i] = op_data.get('value') + if not op_item.get('md5'): + op_item['md5'] = f"{zlib.crc32(str(op_data.get('value', '')).encode()):08x}" + + return op_item + + def resolve_api_special_parameters(data_dict, full_op_name, item_list): """ Function Description: @@ -268,6 +258,61 @@ def get_rela_diff_summary_mode(result_item, npu_summary_data, bench_summary_data return result_item, accuracy_check, err_msg +@dataclass +class ApiItemInfo: + name: str + struct: tuple + stack_info: list + + +def stack_column_process(result_item, has_stack, index, key, npu_stack_info): + if has_stack and index == 0 and key == CompareConst.INPUT_STRUCT: + result_item.extend(npu_stack_info) + else: + result_item.append(CompareConst.NONE) + return result_item + + +def result_item_init(n_info, b_info, dump_mode): + n_len = len(n_info.struct) + b_len = len(b_info.struct) + struct_long_enough = (n_len > 2 and b_len > 2) if dump_mode == Const.MD5 else (n_len > 1 and b_len > 1) + if struct_long_enough: + result_item = [ + n_info.name, b_info.name, n_info.struct[0], b_info.struct[0], n_info.struct[1], b_info.struct[1] + ] + if dump_mode == Const.MD5: + md5_compare_result = CompareConst.PASS if n_info.struct[2] == b_info.struct[2] else CompareConst.DIFF + result_item.extend([n_info.struct[2], b_info.struct[2], md5_compare_result]) + elif dump_mode == Const.SUMMARY: + result_item.extend([" "] * 8) # 8个统计量数据情况的比对指标 + else: + result_item.extend([" "] * 6) # 6个真实数据情况的比对指标 + else: + err_msg = "index out of bounds error will occur in result_item_init, please check!\n" \ + f"npu_info_struct is {n_info.struct}\n" \ + f"bench_info_struct is {b_info.struct}" + logger.error(err_msg) + raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) + return result_item + + +def count_struct(op_dict): + parts = [ + CompareConst.OP_NAME, + CompareConst.INPUT_STRUCT, + CompareConst.OUTPUT_STRUCT, + CompareConst.PARAMS_STRUCT, + CompareConst.PARAMS_GRAD_STRUCT + ] + lengths = [len(op_dict.get(part, [])) for part in parts] + num = lengths[0] + if num != sum(lengths[1:]): + logger.error(f"Length of names and structs of op_dict not match. Please check! op_dict: {op_dict}") + raise CompareException(CompareException.NAMES_STRUCTS_MATCH_ERROR) + return tuple(lengths) + + def get_accuracy(result, n_dict, b_dict, dump_mode): def get_accuracy_core(n_start, n_len, b_start, b_len, key): min_len = min(n_len, b_len) @@ -280,36 +325,23 @@ def get_accuracy(result, n_dict, b_dict, dump_mode): bench_data_name = b_dict.get("data_name", None) for index in range(min_len): - n_name = n_dict['op_name'][n_start + index] - b_name = b_dict['op_name'][b_start + index] - n_struct = n_dict[key][index] - b_struct = b_dict[key][index] + n_name = safe_get_value(n_dict, n_start + index, "n_dict", key="op_name") + b_name = safe_get_value(b_dict, b_start + index, "b_dict", key="op_name") + n_struct = safe_get_value(n_dict, index, "n_dict", key=key) + b_struct = safe_get_value(b_dict, index, "b_dict", key=key) err_msg = "" + + npu_info = ApiItemInfo(n_name, n_struct, npu_stack_info) + bench_info = ApiItemInfo(b_name, b_struct, bench_stack_info) + result_item = result_item_init(npu_info, bench_info, dump_mode) + if dump_mode == Const.MD5: - result_item = [ - n_name, b_name, n_struct[0], b_struct[0], n_struct[1], b_struct[1], n_struct[2], b_struct[2], - CompareConst.PASS if n_struct[2] == b_struct[2] else CompareConst.DIFF - ] - if has_stack and index == 0 and key == "input_struct": - result_item.extend(npu_stack_info) - else: - result_item.append(CompareConst.NONE) + result_item = stack_column_process(result_item, has_stack, index, key, npu_stack_info) result.append(result_item) continue - if dump_mode == Const.SUMMARY: - result_item = [ - n_name, b_name, n_struct[0], b_struct[0], n_struct[1], b_struct[1], - " ", " ", " ", " ", " ", " ", " ", " " - ] - else: - result_item = [ - n_name, b_name, n_struct[0], b_struct[0], n_struct[1], b_struct[1], - " ", " ", " ", " ", " " - ] - - npu_summary_data = n_dict.get(CompareConst.SUMMARY)[n_start + index] - bench_summary_data = b_dict.get(CompareConst.SUMMARY)[b_start + index] + npu_summary_data = safe_get_value(n_dict, n_start + index, "n_dict", key=CompareConst.SUMMARY) + bench_summary_data = safe_get_value(b_dict, b_start + index, "b_dict", key=CompareConst.SUMMARY) result_item.extend(process_summary_data(npu_summary_data)) result_item.extend(process_summary_data(bench_summary_data)) @@ -319,97 +351,121 @@ def get_accuracy(result, n_dict, b_dict, dump_mode): result_item.append(accuracy_check if dump_mode == Const.SUMMARY else CompareConst.ACCURACY_CHECK_YES) result_item.append(err_msg) - if has_stack and index == 0 and key == "input_struct": - result_item.extend(npu_stack_info) - else: - result_item.append(CompareConst.NONE) + result_item = stack_column_process(result_item, has_stack, index, key, npu_stack_info) if dump_mode == Const.ALL: - result_item.append(npu_data_name[n_start + index]) + result_item.append(safe_get_value(npu_data_name, n_start + index, "npu_data_name")) result.append(result_item) if n_len > b_len: for index in range(b_len, n_len): - n_name = n_dict['op_name'][n_start + index] - n_struct = n_dict[key][index] - if dump_mode == Const.MD5: + try: + n_name = n_dict['op_name'][n_start + index] + n_struct = n_dict[key][index] + if dump_mode == Const.MD5: + result_item = [ + n_name, CompareConst.NAN, n_struct[0], CompareConst.NAN, n_struct[1], CompareConst.NAN, + n_struct[2], CompareConst.NAN, CompareConst.NAN + ] + result.append(result_item) + continue result_item = [ n_name, CompareConst.NAN, n_struct[0], CompareConst.NAN, n_struct[1], CompareConst.NAN, - n_struct[2], CompareConst.NAN, CompareConst.NAN + " ", " ", " ", " ", " " ] - result.append(result_item) - continue - result_item = [ - n_name, CompareConst.NAN, n_struct[0], CompareConst.NAN, n_struct[1], CompareConst.NAN, - " ", " ", " ", " ", " " - ] - summary_data = n_dict.get(CompareConst.SUMMARY)[n_start + index] - result_item.extend(summary_data) - summary_data = [CompareConst.NAN for _ in range(len(n_dict.get(CompareConst.SUMMARY)[0]))] - result_item.extend(summary_data) + summary_data = n_dict.get(CompareConst.SUMMARY)[n_start + index] + result_item.extend(summary_data) + summary_data = [CompareConst.NAN for _ in range(len(n_dict.get(CompareConst.SUMMARY)[0]))] + result_item.extend(summary_data) + except IndexError as e: + err_msg = "index out of bounds error occurs, please check!\n" \ + f"n_dict is {n_dict}" + logger.error(err_msg) + raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from e err_msg = "" result_item.append(CompareConst.ACCURACY_CHECK_YES) result_item.append(err_msg) - - if has_stack and index == 0 and key == "input_struct": - result_item.extend(npu_stack_info) - else: - result_item.append(CompareConst.NONE) + result_item = stack_column_process(result_item, has_stack, index, key, npu_stack_info) if dump_mode == Const.ALL: - result_item.append(npu_data_name[n_start + index]) + result_item.append(safe_get_value(npu_data_name, n_start + index, "npu_data_name")) result.append(result_item) - n_num = len(n_dict['op_name']) - b_num = len(b_dict['op_name']) - n_num_input = len([name for name in n_dict['op_name'] - if Const.INPUT in name.split(Const.SEP) or Const.KWARGS in name.split(Const.SEP)]) - b_num_input = len([name for name in b_dict['op_name'] - if Const.INPUT in name.split(Const.SEP) or Const.KWARGS in name.split(Const.SEP)]) - n_num_output = n_num - n_num_input - b_num_output = b_num - b_num_input - get_accuracy_core(0, n_num_input, 0, b_num_input, 'input_struct') - get_accuracy_core(n_num_input, n_num_output, b_num_input, b_num_output, 'output_struct') + n_num, n_num_input, n_num_output, n_num_params, n_num_params_grad = count_struct(n_dict) + b_num, b_num_input, b_num_output, b_num_params, b_num_params_grad = count_struct(b_dict) + + get_accuracy_core(0, n_num_input, 0, b_num_input, CompareConst.INPUT_STRUCT) + get_accuracy_core(n_num_input + n_num_output, n_num_params, b_num_input + b_num_output, b_num_params, + CompareConst.PARAMS_STRUCT) + get_accuracy_core(n_num_input, n_num_output, b_num_input, b_num_output, CompareConst.OUTPUT_STRUCT) + get_accuracy_core(n_num_input + n_num_output + n_num_params, n_num_params_grad, + b_num_input + b_num_output + b_num_params, b_num_params_grad, + CompareConst.PARAMS_GRAD_STRUCT) + + +def append_stack_info(result_item, npu_stack_info, index): + """添加堆栈信息到 result_item""" + if npu_stack_info and index == 0: + result_item.extend(npu_stack_info) + else: + result_item.append(CompareConst.NONE) def get_un_match_accuracy(result, n_dict, dump_mode): - index_out = 0 npu_stack_info = n_dict.get("stack_info", None) bench_name, bench_type, bench_shape = CompareConst.N_A, CompareConst.N_A, CompareConst.N_A - err_msg = CompareConst.NO_BENCH - accuracy_check_res = CompareConst.N_A - for index, n_name in enumerate(n_dict["op_name"]): - name_ele_list = n_name.split(Const.SEP) - if Const.INPUT in name_ele_list or Const.KWARGS in name_ele_list: - n_struct = n_dict[CompareConst.INPUT_STRUCT][index] - if Const.OUTPUT in name_ele_list: - n_struct = n_dict[CompareConst.OUTPUT_STRUCT][index_out] - index_out += 1 - - result_item = [n_name, bench_name, n_struct[0], bench_type, n_struct[1], bench_shape] + + struct_to_index_mapping = { + CompareConst.INPUT_STRUCT: 0, + CompareConst.OUTPUT_STRUCT: 0, + CompareConst.PARAMS_STRUCT: 0, + CompareConst.PARAMS_GRAD_STRUCT: 0 + } + + op_name_list = n_dict.get(CompareConst.OP_NAME) + summary_list = n_dict.get(Const.SUMMARY) + data_name_list = n_dict.get('data_name') + op_name_reorder, summary_reorder, _ = reorder_op_x_list(op_name_list, + summary_list, + data_name_list) + for index, n_name in enumerate(op_name_reorder): + _, state = get_name_and_state(n_name) + struct_key = CompareConst.STATE_TO_STRUCT_MAPPING.get(state) + if not struct_key: + continue + n_struct = safe_get_value(n_dict, struct_to_index_mapping.get(struct_key), "n_dict", key=struct_key) + struct_to_index_mapping[struct_key] += 1 + + try: + result_item = [n_name, bench_name, n_struct[0], bench_type, n_struct[1], bench_shape] + except IndexError as e: + err_msg = "index out of bounds error occurs, please check!\n" \ + f"op_name of n_dict is {n_dict['op_name']}\n" \ + f"input_struct of n_dict is {n_dict[CompareConst.INPUT_STRUCT]}\n" \ + f"output_struct of n_dict is {n_dict[CompareConst.OUTPUT_STRUCT]}" + logger.error(err_msg) + raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from e + if dump_mode == Const.MD5: result_item.extend([CompareConst.N_A] * 3) - if npu_stack_info and index == 0: - result_item.extend(npu_stack_info) - else: - result_item.append(CompareConst.NONE) + append_stack_info(result_item, npu_stack_info, index) result.append(result_item) continue if dump_mode == Const.SUMMARY: - result_item.extend([CompareConst.N_A] * 8) - else: - result_item.extend([CompareConst.N_A] * 5) - npu_summary_data = n_dict.get("summary")[index] - result_item.extend(npu_summary_data) + result_item.extend([CompareConst.N_A] * 8) # 8个统计量数据情况的比对指标 + if dump_mode == Const.ALL: + result_item.extend([CompareConst.N_A] * 6) # 6个真实数据情况的比对指标 + + npu_summary_data = safe_get_value(summary_reorder, index, "summary_reorder") bench_summary_data = [CompareConst.N_A] * 4 + result_item.extend(npu_summary_data) result_item.extend(bench_summary_data) + err_msg = CompareConst.NO_BENCH + accuracy_check_res = CompareConst.N_A result_item.append(accuracy_check_res) result_item.append(err_msg) - if npu_stack_info and index == 0: - result_item.extend(npu_stack_info) - else: - result_item.append(CompareConst.NONE) + append_stack_info(result_item, npu_stack_info, index) if dump_mode == Const.ALL and result_item[1] == CompareConst.N_A: result_item.extend(["-1"]) result.append(result_item) @@ -421,6 +477,8 @@ def merge_tensor(tensor_list, dump_mode): op_dict[CompareConst.INPUT_STRUCT] = [] op_dict[CompareConst.KWARGS_STRUCT] = [] op_dict[CompareConst.OUTPUT_STRUCT] = [] + op_dict[CompareConst.PARAMS_STRUCT] = [] + op_dict[CompareConst.PARAMS_GRAD_STRUCT] = [] op_dict[Const.SUMMARY] = [] op_dict["stack_info"] = [] @@ -428,41 +486,123 @@ def merge_tensor(tensor_list, dump_mode): op_dict["data_name"] = [] for tensor in tensor_list: + # A dict(len=2) with 'full_op_name' and 'full_info' is added to the tensor only if self.stack_mode is True if len(tensor) == 2: op_dict['stack_info'].append(tensor['full_info']) break + op_dict["op_name"].append(tensor['full_op_name']) - name_ele_list = tensor['full_op_name'].split(Const.SEP) - name_to_struct_mapping = { - Const.INPUT: CompareConst.INPUT_STRUCT, - Const.KWARGS: CompareConst.KWARGS_STRUCT, - Const.OUTPUT: CompareConst.OUTPUT_STRUCT - } - for name_key, struct_key in name_to_struct_mapping.items(): - if name_key in name_ele_list: - if dump_mode == Const.MD5: - op_dict.get(struct_key).append((tensor[Const.DTYPE], tensor[Const.SHAPE], tensor[Const.MD5])) - else: - op_dict.get(struct_key).append((tensor[Const.DTYPE], tensor[Const.SHAPE])) - break + + _, state = get_name_and_state(tensor['full_op_name']) + struct_key = CompareConst.STATE_TO_STRUCT_MAPPING.get(state) + if not struct_key: + continue + if dump_mode == Const.MD5: + op_dict.get(struct_key).append((tensor[Const.DTYPE], tensor[Const.SHAPE], tensor[Const.MD5])) + else: + op_dict.get(struct_key).append((tensor[Const.DTYPE], tensor[Const.SHAPE])) op_dict[Const.SUMMARY].append([tensor[Const.MAX], tensor[Const.MIN], tensor[Const.MEAN], tensor[Const.NORM]]) if dump_mode == Const.ALL: op_dict["data_name"].append(tensor['data_name']) - data_name = op_dict["data_name"][-1].rsplit(Const.SEP, 1)[0] - if data_name != "-1": - op_dict["op_name"][-1] = data_name if not op_dict[CompareConst.KWARGS_STRUCT]: del op_dict[CompareConst.KWARGS_STRUCT] return op_dict if op_dict["op_name"] else {} +def print_compare_ends_info(): + total_len = len(CompareConst.COMPARE_ENDS_SUCCESSFULLY) + Const.FILL_CHAR_NUMS + logger.info('*' * total_len) + logger.info(f"*{CompareConst.COMPARE_ENDS_SUCCESSFULLY.center(total_len - 2)}*") + logger.info('*' * total_len) + + +def table_value_is_valid(value: str) -> bool: + if not isinstance(value, str): + return True + try: + # -1.00 or +1.00 should be consdiered as digit numbers + float(value) + except ValueError: + # otherwise, they will be considered as formular injections + return not bool(re.compile(FileCheckConst.CSV_BLACK_LIST).search(value)) + return True + + +def get_name_and_state(name): + """ + Get api/module name and state + example: + name = 'conv2d.forward.1.input.0' + return: ('conv2d.forward.1.', 'input') + + name = 'Functional.pad.0.backward.output.0' + return: ('Functional.pad.0.backward.', 'output') + + state type: input, output, kwargs, parameters, parameters_grad + """ + if Const.PARAMS_GRAD in name.split(Const.SEP): + return name.split(Const.PARAMS_GRAD)[0], Const.PARAMS_GRAD + + split = re.split(Const.REGEX_FORWARD_BACKWARD, name) + api = f'{split[0]}.{split[1]}.' + state_str = split[2] + match = re.match(r'^(\d+\.)?(input|output|kwargs|parameters)\..+$', state_str) + if not match: + raise CompareException(f'Invalid name string: {name}') + if match.group(1): + api = f'{api}{match.group(1)}' + state = match.group(2) + return api, state + + +def reorder_op_name_list(op_name_list): + if not op_name_list: + return op_name_list + + parameters = [] + output = [] + parameters_grad = [] + others = [] + for x in op_name_list: + state = get_name_and_state(x)[1] + if state == Const.PARAMS: + parameters.append(x) + elif state == Const.OUTPUT: + output.append(x) + elif state == Const.PARAMS_GRAD: + parameters_grad.append(x) + else: + others.append(x) + # 合并others, parameters, 和output,确保parameters排在output前面 + op_name_reorder = others + parameters + output + parameters_grad + return op_name_reorder + + +def reorder_op_x_list(op_name_list, summary_list, data_name_list): + """对op_name, summary, data_name重新排序,把parameters放到input后output前,data_name由于统计量比对时,为None,单独处理""" + if not op_name_list or not summary_list: + return op_name_list, summary_list, data_name_list + + index_map = {name: index for index, name in enumerate(op_name_list)} + + op_name_reorder = reorder_op_name_list(op_name_list) + summary_reorder = [summary_list[index_map.get(name)] for name in op_name_reorder] + if data_name_list: + data_name_reorder = [data_name_list[index_map.get(name)] for name in op_name_reorder] + else: + data_name_reorder = data_name_list + + return op_name_reorder, summary_reorder, data_name_reorder + + def _compare_parser(parser): parser.add_argument("-i", "--input_path", dest="input_path", type=str, help=" The compare input path, a dict json.", required=True) parser.add_argument("-o", "--output_path", dest="output_path", type=str, - help=" The compare task result out path.", required=True) + help=" The compare task result out path. Default path: ./output", + required=False, default="./output", nargs="?", const="./output") parser.add_argument("-s", "--stack_mode", dest="stack_mode", action="store_true", help=" Whether to save stack info.", required=False) parser.add_argument("-c", "--compare_only", dest="compare_only", action="store_true", @@ -472,7 +612,7 @@ def _compare_parser(parser): parser.add_argument("-cm", "--cell_mapping", dest="cell_mapping", type=str, nargs='?', const=True, help=" The cell mapping file path.", required=False) parser.add_argument("-am", "--api_mapping", dest="api_mapping", type=str, nargs='?', const=True, - help=" The api mapping file path.", required=False) + help=" The api mapping file path.", required=False) parser.add_argument("-dm", "--data_mapping", dest="data_mapping", type=str, help=" The data mapping file path.", required=False) parser.add_argument("-lm", "--layer_mapping", dest="layer_mapping", type=str, nargs='?', const=True, diff --git a/debug/accuracy_tools/msprobe/core/data_dump/data_collector.py b/debug/accuracy_tools/msprobe/core/data_dump/data_collector.py index 3b8dc70b9e191006f4b8a6621fccc1eae122e099..20e4489f89e4bd345595e6a1db1e39ab427d4908 100644 --- a/debug/accuracy_tools/msprobe/core/data_dump/data_collector.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_collector.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -16,7 +16,7 @@ import atexit import os -from msprobe.core.data_dump.scope import build_scope, ListScope +from msprobe.core.data_dump.scope import ScopeFactory from msprobe.core.data_dump.json_writer import DataWriter from msprobe.core.common.log import logger from msprobe.core.common.const import Const @@ -28,7 +28,6 @@ def build_data_collector(config): class DataCollector: - multi_output_apis = ["_sort_", "npu_flash_attention"] tasks_need_tensor_data = [Const.OVERFLOW_CHECK, Const.TENSOR, Const.FREE_BENCHMARK] level_without_construct = [Const.LEVEL_L1, Const.LEVEL_L2] @@ -38,11 +37,10 @@ class DataCollector: self.data_processor = DataProcessorFactory.create_processor(self.config, self.data_writer) self.module_processor = DataProcessorFactory.get_module_processor(self.config.framework) self.module_count = {} - if self.config.task == Const.FREE_BENCHMARK: - self.scope = build_scope(ListScope, self.config.scope, self.config.list) - else: - self.scope = build_scope(None, self.config.scope, self.config.list) - + self.scope = ScopeFactory(self.config).build_scope() + self.backward_module_names = {} + self.optimizer_status = "" + self.optimizer_status_first_start = {Const.OPTIMIZER: True, Const.CLIP_GRAD: True} atexit.register(self.write_json) @property @@ -58,8 +56,15 @@ class DataCollector: return (not scope or scope.check(name)) and pid == os.getpid() @staticmethod - def is_inplace(module): - return getattr(module, "op_is_inplace", False) + def set_is_recomputable(data_info, is_recompute): + if data_info and len(data_info) == 1 and is_recompute is not None: # 正常情况下data_info的长度应改为1 + data_info[list(data_info.keys())[0]]["is_recompute"] = is_recompute + + def reset_status(self): + self.optimizer_status = "" + self.optimizer_status_first_start = {Const.OPTIMIZER: True, Const.CLIP_GRAD: True} + self.data_writer.reset_cache() + self.backward_module_names.clear() def if_return_forward_new_output(self): return self.data_processor.if_return_forward_new_output() @@ -84,62 +89,105 @@ class DataCollector: logger.debug(msg) self.data_writer.update_data(data_info) - def pre_forward_data_collect(self, name, module, pid, module_input_output): - backward_name = name.replace(Const.FORWARD, Const.BACKWARD) - if self.check_scope_and_pid(self.scope, backward_name, pid): - self.data_processor.analyze_pre_forward(backward_name, module, module_input_output) - if not self.is_inplace(module) or not self.check_scope_and_pid(self.scope, name, pid): + def forward_input_data_collect(self, name, module, pid, module_input_output, is_recompute=None): + if self.config.task == Const.FREE_BENCHMARK: + backward_name = name.replace(Const.FORWARD, Const.BACKWARD) + if self.check_scope_and_pid(self.scope, backward_name, pid): + self.data_processor.analyze_forward_input(backward_name, module, module_input_output) + return + + if not self.check_scope_and_pid(self.scope, name, pid): + return + + data_info = {} + if self.config.task != Const.STRUCTURE: + data_info = self.data_processor.analyze_forward_input(name, module, module_input_output) + self.set_is_recomputable(data_info, is_recompute) + if self.config.level == Const.LEVEL_L2: return - logger.info(f"API {name} is inplace.") - data_info = self.data_processor.analyze_pre_forward_inplace(name, module_input_output) self.handle_data(name, data_info, flush=self.data_processor.is_terminated) - def forward_data_collect(self, name, module, pid, module_input_output): + def forward_output_data_collect(self, name, module, pid, module_input_output, is_recompute=None): self.update_construct(name) if not self.check_scope_and_pid(self.scope, name, pid): return - if not self.is_inplace(module): - data_info = self.data_processor.analyze_forward(name, module, module_input_output) - else: - data_info = self.data_processor.analyze_forward_inplace(name, module_input_output) - if self.config.level == "L2": + data_info = {} + if self.config.task != Const.STRUCTURE: + data_info = self.data_processor.analyze_forward_output(name, module, module_input_output) + self.set_is_recomputable(data_info, is_recompute) + if self.config.level == Const.LEVEL_L2: + return + self.data_writer.update_stack(self.data_processor.analyze_api_call_stack(name)) + self.handle_data(name, data_info, flush=self.data_processor.is_terminated) + + def forward_data_collect(self, name, module, pid, module_input_output, is_recompute=None): + self.update_construct(name) + if not self.check_scope_and_pid(self.scope, name, pid): return + + data_info = {} + if self.config.task != Const.STRUCTURE: + data_info = self.data_processor.analyze_forward(name, module, module_input_output) + self.set_is_recomputable(data_info, is_recompute) self.data_writer.update_stack(self.data_processor.analyze_api_call_stack(name)) self.handle_data(name, data_info, flush=self.data_processor.is_terminated) - def backward_data_collect(self, name, module, pid, module_input_output): + def backward_data_collect(self, name, module, pid, module_input_output, is_recompute=None): self.update_construct(name) if not self.check_scope_and_pid(self.scope, name, pid): return - data_info = self.data_processor.analyze_backward(name, module, module_input_output) + data_info = {} + if self.config.task != Const.STRUCTURE: + data_info = self.data_processor.analyze_backward(name, module, module_input_output) + if self.config.level == Const.LEVEL_L2: + return + # 获取执行反向的模块名称 + if data_info and name.split(Const.SEP)[0] in Const.MODULE_PREFIX: + module_name = name.rsplit(Const.SEP, 2)[0] + # 将模块名称加入到反向模块名称集合中,用于梯度收集时判断是否需要收集梯度 + self.backward_module_names[module_name] = True self.handle_data(name, data_info, flush=self.data_processor.is_terminated) - def backward_input_data_collect(self, name, module, pid, module_input_output): + def backward_input_data_collect(self, name, module, pid, module_input_output, is_recompute=None): self.update_construct(name) if not self.check_scope_and_pid(self.scope, name, pid): return - data_info = self.data_processor.analyze_backward_input(name, module, module_input_output) + data_info = {} + if self.config.task != Const.STRUCTURE: + data_info = self.data_processor.analyze_backward_input(name, module, module_input_output) + self.set_is_recomputable(data_info, is_recompute) self.handle_data(name, data_info) - def backward_output_data_collect(self, name, module, pid, module_input_output): + def backward_output_data_collect(self, name, module, pid, module_input_output, is_recompute=None): self.update_construct(name) if not self.check_scope_and_pid(self.scope, name, pid): return - data_info = self.data_processor.analyze_backward_output(name, module, module_input_output) + data_info = {} + if self.config.task != Const.STRUCTURE: + data_info = self.data_processor.analyze_backward_output(name, module, module_input_output) + self.set_is_recomputable(data_info, is_recompute) self.handle_data(name, data_info) def update_construct(self, name): if self.config.level not in DataCollector.level_without_construct: - self.data_writer.update_construct({name: self.module_processor.api_parent_node}) + if self.optimizer_status in [Const.OPTIMIZER, Const.CLIP_GRAD]: + if self.optimizer_status_first_start[self.optimizer_status]: + self.data_writer.update_construct({self.optimizer_status: None}) + self.optimizer_status_first_start[self.optimizer_status] = False + self.data_writer.update_construct({name: self.optimizer_status}) + else: + self.data_writer.update_construct({name: self.module_processor.api_parent_node}) self.data_writer.update_construct(self.module_processor.module_node) def handle_data(self, name, data_info, flush=False): if data_info: self.update_data(name, data_info) + if self.config.async_dump: + return if not flush: self.data_writer.flush_data_periodically() else: @@ -147,7 +195,36 @@ class DataCollector: def update_dump_paths(self, *args): self.data_writer.update_dump_paths(*args) - self.data_writer.initialize_json_file(task=self.config.task, level=self.config.level) + + def initialize_json_file(self, framework=Const.UNKNOWN_FRAMEWORK): + self.data_writer.initialize_json_file(task=self.config.task, level=self.config.level, framework=framework) def update_iter(self, current_iter): self.data_processor.update_iter(current_iter) + + def params_data_collect(self, name, param_name, pid, data): + grad_name = name + Const.SEP + Const.PARAMS_GRAD + # 校验scope和pid,以及当前name是否有过反向计算 + if not self.check_scope_and_pid(self.scope, name, pid) and not self.backward_module_names.get(name): + # 如果没有反向计算,则需要清除之前占位写入的grad数据 + if self.data_writer.cache_data.get("data"): + self.data_writer.cache_data.get("data").pop(grad_name, None) + return + data_info = self.data_processor.analyze_params(grad_name, param_name, data) + self.handle_data(grad_name, data_info, flush=self.data_processor.is_terminated) + + def fill_stack_tensor_data(self): + self.data_writer.fill_stack_tensor_data() + + def debug_data_collect_forward(self, variable, name_with_count): + + data_info = self.data_processor.analyze_debug_forward(variable, name_with_count) + self.data_writer.update_debug({name_with_count: data_info}) + + def debug_data_collect_backward(self, variable, grad_name_with_count): + # prepare all None nested data structure + all_none_data_info = self.data_processor.analyze_element_to_all_none(variable) + self.data_writer.update_debug({grad_name_with_count: all_none_data_info}) + + # register tensor backward hook + self.data_processor.analyze_debug_backward(variable, grad_name_with_count, self.data_writer.cache_debug['data']) diff --git a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/base.py b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/base.py index 3fa0c09a76ae355ddf2ca8adfd1fc6d0ba08de9e..775a80b2418ef356867228b4ca09fad8c86cce25 100644 --- a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/base.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/base.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -15,10 +15,14 @@ import inspect import os -from dataclasses import dataclass +from dataclasses import dataclass, is_dataclass from typing import Tuple, Dict, Optional, Any +from functools import partial +import copy +from typing import Union import numpy as np + from msprobe.core.common.const import Const from msprobe.core.common.log import logger from msprobe.core.common.utils import convert_tuple, CompareException @@ -38,9 +42,8 @@ class ModuleForwardInputsOutputs: def output_tuple(self): return convert_tuple(self.output) - def concat_args_and_kwargs(self): - args = self.args + tuple(self.kwargs.values()) - return args + def update_output_with_args_and_kwargs(self): + self.output = self.args + tuple(self.kwargs.values()) @dataclass @@ -76,17 +79,18 @@ class ModuleBackwardOutputs: class TensorStatInfo: - def __init__(self, max_val=None, min_val=None, mean_val=None, norm_val=None): + def __init__(self, max_val=None, min_val=None, mean_val=None, norm_val=None, stack_tensor_stat=None): self.max = max_val self.min = min_val self.mean = mean_val self.norm = norm_val + self.stack_tensor_stat = stack_tensor_stat class BaseDataProcessor: _recursive_key_stack = [] special_type = ( - np.integer, np.floating, np.bool_, np.complexfloating, np.str_, np.byte, np.unicode_, + np.integer, np.floating, np.bool_, np.complexfloating, np.str_, np.byte, np.unicode_, np.ndarray, bool, int, float, str, slice, type(Ellipsis) ) @@ -101,6 +105,9 @@ class BaseDataProcessor: self.current_iter = 0 self._return_forward_new_output = False self._forward_new_output = None + self.save_name = None + if hasattr(config, "data_mode"): + self.allowed_data_mode = self._get_allowed_data_mode(config.data_mode) @property def data_path(self): @@ -139,6 +146,37 @@ class BaseDataProcessor: else: return data + @staticmethod + def set_value_into_nested_structure(data_structure, indexes, value): + ''' + Args: + data_structure: nested data structure + indexes: List + value: value to be set + ''' + if not indexes: + raise ValueError("set_value_into_nested_structure failed: " + "indexes need to be non empty when set value to nested data structure") + current_level = data_structure + for i, index in enumerate(indexes): + valid_for_list = isinstance(current_level, list) and isinstance(index, int) and len(current_level) > index + valid_for_dict = isinstance(current_level, dict) and index in current_level + is_last = i == len(indexes) - 1 + if valid_for_dict or valid_for_list: + if is_last: + try: + current_level[index] = value + except Exception as e: + raise IndexError("set_value_into_nested_structure failed: passed indexes wrong") from e + else: + try: + current_level = current_level[index] + except Exception as e: + raise IndexError("set_value_into_nested_structure failed: passed indexes wrong") from e + else: + raise ValueError("set_value_into_nested_structure failed: " + "invalid data_structure type or invalid index") + @staticmethod def _convert_numpy_to_builtin(arg): type_mapping = { @@ -179,41 +217,94 @@ class BaseDataProcessor: return single_arg @staticmethod - def _analyze_numpy(value, numpy_type): - return {"type": numpy_type, "value": value} + def _analyze_numpy(ndarray, numpy_type): + ndarray_json = {} + ndarray_json.update({'type': 'numpy.ndarray'}) + ndarray_json.update({'dtype': str(ndarray.dtype)}) + ndarray_json.update({'shape': ndarray.shape}) + if ndarray.size > 0: + ndarray_json.update({"Max": np.max(ndarray).item()}) + ndarray_json.update({"Min": np.min(ndarray).item()}) + ndarray_json.update({"Mean": np.mean(ndarray).item()}) + ndarray_json.update({"Norm": np.linalg.norm(ndarray).item()}) + else: + ndarray_json.update({"Max": None}) + ndarray_json.update({"Min": None}) + ndarray_json.update({"Mean": None}) + ndarray_json.update({"Norm": None}) + return ndarray_json + + @staticmethod + def _get_allowed_data_mode(data_mode): + if Const.ALL in data_mode: + allowed_data_mode = [Const.FORWARD, Const.BACKWARD, Const.INPUT, Const.OUTPUT] + else: + allowed_data_mode = list(set(data_mode)) + if Const.FORWARD not in allowed_data_mode and Const.BACKWARD not in allowed_data_mode: + allowed_data_mode += [Const.FORWARD, Const.BACKWARD] + if Const.INPUT not in allowed_data_mode and Const.OUTPUT not in allowed_data_mode: + allowed_data_mode += [Const.INPUT, Const.OUTPUT] + return allowed_data_mode @classmethod def get_special_types(cls): return cls.special_type @classmethod - def recursive_apply_transform(cls, args, transform, depth=0): + def recursive_apply_transform(cls, args, transform, depth=0) -> Union[dict, list, None]: if depth > Const.MAX_DEPTH: logger.error(f"The maximum depth of recursive transform, {Const.MAX_DEPTH} is reached.") raise CompareException(CompareException.RECURSION_LIMIT_ERROR) if isinstance(args, cls.get_special_types()): arg_transform = transform(args, cls._recursive_key_stack) return arg_transform + elif isinstance(args, tuple) and hasattr(args, '_fields'): + # namedtuple to dict + args_dict = {field: getattr(args, field) for field in args._fields} + return cls.apply_transform_dict(args_dict, transform, depth) + elif is_dataclass(args): + # dataclass to dict + args_dict = {field: getattr(args, field) for field in args.__dataclass_fields__} + return cls.apply_transform_dict(args_dict, transform, depth) elif isinstance(args, (list, tuple)): - result_list = [] - for i, arg in enumerate(args): - cls._recursive_key_stack.append(str(i)) - result_list.append(cls.recursive_apply_transform(arg, transform, depth=depth + 1)) - cls._recursive_key_stack.pop() - return type(args)(result_list) + result_list = cls.apply_transform_list(args, transform, depth) + return result_list elif isinstance(args, dict): - result_dict = {} - for k, arg in args.items(): - cls._recursive_key_stack.append(str(k)) - result_dict[k] = cls.recursive_apply_transform(arg, transform, depth=depth + 1) - cls._recursive_key_stack.pop() - return result_dict + return cls.apply_transform_dict(args, transform, depth) elif args is not None: - logger.warning(f"Data type {type(args)} is not supported.") + logger.debug(f"Data type {type(args)} is not supported.") return None else: return None + @classmethod + def apply_transform_dict(cls, args, transform, depth): + result_dict = {} + for k, arg in args.items(): + cls._recursive_key_stack.append(k) + result_dict[k] = cls.recursive_apply_transform(arg, transform, depth=depth + 1) + cls._recursive_key_stack.pop() + return result_dict + + @classmethod + def apply_transform_list(cls, args, transform, depth): + result_list = [] + for i, arg in enumerate(args): + cls._recursive_key_stack.append(i) + result_list.append(cls.recursive_apply_transform(arg, transform, depth=depth + 1)) + cls._recursive_key_stack.pop() + return result_list + + @classmethod + def register_hook_single_element(cls, element, suffix_stack, hook_fn): + if cls.is_hookable_element(element): + indexes = copy.deepcopy(suffix_stack) + wrap_hook_fn = partial(hook_fn, indexes=indexes) + + def real_hook_fn(grad): + return wrap_hook_fn(grad) + element.register_hook(real_hook_fn) + def if_return_forward_new_output(self): return self._return_forward_new_output @@ -239,17 +330,12 @@ class BaseDataProcessor: Return: bool: True if the parameters are in data_mode or data_mode is all, False otherwise. """ - return (Const.ALL in self.config.data_mode or - forward_backward in self.config.data_mode or - input_output in self.config.data_mode) - - def analyze_pre_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): - pass + return forward_backward in self.allowed_data_mode and input_output in self.allowed_data_mode def analyze_element(self, element): return self.recursive_apply_transform(element, self.analyze_single_element) - def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + def analyze_forward_input(self, name, module, module_input_output: ModuleForwardInputsOutputs): api_info_struct = {} # check whether data_mode contains forward or input if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): @@ -261,16 +347,22 @@ class BaseDataProcessor: kwargs_info_list = self.analyze_element(module_input_output.kwargs) api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list - # check whether data_mode contains forward or output + return api_info_struct + + def analyze_forward_output(self, name, module, module_input_output: ModuleForwardInputsOutputs): + api_info_struct = {} + # check whether data_mode contains forward or input if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): - api_info_struct[name] = api_info_struct.get(name, {}) + api_info_struct[name] = {} self.api_data_category = Const.OUTPUT output_info_list = self.analyze_element(module_input_output.output_tuple) api_info_struct[name][Const.OUTPUT] = output_info_list + return api_info_struct - def analyze_pre_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): + def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): api_info_struct = {} + # check whether data_mode contains forward or input if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): api_info_struct[name] = {} self.api_data_category = Const.INPUT @@ -279,16 +371,18 @@ class BaseDataProcessor: self.api_data_category = Const.KWARGS kwargs_info_list = self.analyze_element(module_input_output.kwargs) api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list - return api_info_struct - def analyze_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): - concat_args = module_input_output.concat_args_and_kwargs() - api_info_struct = {} + # check whether data_mode contains forward or output if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): - api_info_struct[name] = {} + api_info_struct[name] = api_info_struct.get(name, {}) self.api_data_category = Const.OUTPUT - output_info_list = self.analyze_element(concat_args) + output_info_list = self.analyze_element(module_input_output.output_tuple) api_info_struct[name][Const.OUTPUT] = output_info_list + + if name in api_info_struct and hasattr(module_input_output, Const.PARAMS): + self.api_data_category = Const.PARAMS + api_info_struct[name][Const.PARAMS] = self.analyze_element(getattr(module_input_output, Const.PARAMS)) + return api_info_struct def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): @@ -329,9 +423,47 @@ class BaseDataProcessor: api_info_struct[name][Const.OUTPUT] = output_info_list return api_info_struct + def analyze_params(self, name, param_name, grad): + api_info_struct = {} + self.save_name = name + Const.SEP + param_name + data_info = self.analyze_element(grad) + grad_info_dict = {param_name: [data_info]} + api_info_struct[name] = grad_info_dict + return api_info_struct + def get_save_file_path(self, suffix): file_format = Const.PT_SUFFIX if self.config.framework == Const.PT_FRAMEWORK else Const.NUMPY_SUFFIX - dump_data_name = (self.current_api_or_module_name + Const.SEP + self.api_data_category + Const.SEP + - suffix + file_format) + if self.save_name is not None: + dump_data_name = (self.save_name + file_format) + self.save_name = None + else: + dump_data_name = (self.current_api_or_module_name + Const.SEP + self.api_data_category + Const.SEP + + suffix + file_format) file_path = os.path.join(self.data_writer.dump_tensor_data_dir, dump_data_name) return dump_data_name, file_path + + def analyze_element_to_all_none(self, element): + return self.recursive_apply_transform(element, lambda element, stack: None) + + def analyze_debug_forward(self, variable, name_with_count): + self.current_api_or_module_name = name_with_count + self.api_data_category = Const.TENSOR + # these two attributes are used to construct tensor file name {name_with_count}.tensor.{indexes}.npy/pt + data_info = self.analyze_element(variable) + return data_info + + def analyze_debug_backward(self, variable, grad_name_with_count, nested_data_structure): + def hook_fn(grad, indexes): + suffix = Const.SEP.join([str(index) for index in indexes]) + self.save_name = grad_name_with_count + Const.SEP + Const.TENSOR + Const.SEP + suffix + grad_data_info = self.analyze_element(grad) + self.save_name = None + full_index = [grad_name_with_count] + indexes + try: + self.set_value_into_nested_structure(nested_data_structure, full_index, grad_data_info) + except (ValueError, IndexError) as e: + logger.warning(f"error occured while recording statistics of {grad_name_with_count} variable, " + f"skip current recording, detailed infomation: {e}") + return grad + wrap_register_hook_single_element = partial(self.register_hook_single_element, hook_fn=hook_fn) + self.recursive_apply_transform(variable, wrap_register_hook_single_element) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/factory.py b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/factory.py index 381789eb285bf2136d90fb454d6218a25a412fc8..83f3c717e88f018b4ecef9ec4e2a5edec3e56c4f 100644 --- a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/factory.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/factory.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,6 +14,7 @@ # limitations under the License. from msprobe.core.common.const import Const +from msprobe.core.data_dump.data_processor.base import BaseDataProcessor class DataProcessorFactory: @@ -56,21 +57,25 @@ class DataProcessorFactory: FreeBenchmarkDataProcessor as PytorchFreeBenchmarkDataProcessor, KernelDumpDataProcessor as PytorchKernelDumpDataProcessor ) - from msprobe.pytorch.module_processer import ModuleProcesser + from msprobe.pytorch.dump.module_dump.module_processer import ModuleProcesser cls.register_processor(Const.PT_FRAMEWORK, Const.STATISTICS, PytorchStatisticsDataProcessor) cls.register_processor(Const.PT_FRAMEWORK, Const.TENSOR, PytorchTensorDataProcessor) cls.register_processor(Const.PT_FRAMEWORK, Const.OVERFLOW_CHECK, PytorchOverflowCheckDataProcessor) cls.register_processor(Const.PT_FRAMEWORK, Const.FREE_BENCHMARK, PytorchFreeBenchmarkDataProcessor) cls.register_processor(Const.PT_FRAMEWORK, Const.KERNEL_DUMP, PytorchKernelDumpDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.STRUCTURE, BaseDataProcessor) cls.register_module_processor(Const.PT_FRAMEWORK, ModuleProcesser) elif framework == Const.MS_FRAMEWORK: from msprobe.core.data_dump.data_processor.mindspore_processor import ( StatisticsDataProcessor as MindsporeStatisticsDataProcessor, TensorDataProcessor as MindsporeTensorDataProcessor, - OverflowCheckDataProcessor as MindsporeOverflowCheckDataProcessor + OverflowCheckDataProcessor as MindsporeOverflowCheckDataProcessor, + KernelDumpDataProcessor as MindsporeKernelDumpDataProcessor ) from msprobe.mindspore.cell_processor import CellProcessor cls.register_processor(Const.MS_FRAMEWORK, Const.STATISTICS, MindsporeStatisticsDataProcessor) cls.register_processor(Const.MS_FRAMEWORK, Const.TENSOR, MindsporeTensorDataProcessor) cls.register_processor(Const.MS_FRAMEWORK, Const.OVERFLOW_CHECK, MindsporeOverflowCheckDataProcessor) + cls.register_processor(Const.MS_FRAMEWORK, Const.KERNEL_DUMP, MindsporeKernelDumpDataProcessor) + cls.register_processor(Const.MS_FRAMEWORK, Const.STRUCTURE, BaseDataProcessor) cls.register_module_processor(Const.MS_FRAMEWORK, CellProcessor) diff --git a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/mindspore_processor.py b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/mindspore_processor.py index 3a931f487929c1bae1d59bc997b9fd7c8b66a5e5..8c4542a1917b76809aad21971e148ec17bd6045e 100644 --- a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/mindspore_processor.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/mindspore_processor.py @@ -1,4 +1,4 @@ -# Copyright 2024 Huawei Technologies Co., Ltd +# Copyright 2024-2025 Huawei Technologies Co., Ltd # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -16,18 +16,24 @@ import zlib import mindspore as ms -from mindspore import mint, ops +from mindspore import mint, ops, hal from mindspore._c_expression.typing import Number import numpy as np from msprobe.core.common.const import Const from msprobe.core.data_dump.data_processor.base import (BaseDataProcessor, TensorStatInfo, ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs) -from msprobe.core.common.file_utils import path_len_exceeds_limit +from msprobe.core.common.file_utils import path_len_exceeds_limit, save_npy from msprobe.mindspore.common.utils import convert_bf16_to_fp32, save_tensor_as_npy from msprobe.mindspore.common.log import logger from msprobe.mindspore.dump.hook_cell.api_registry import api_register +has_adump = True +try: + from msprobe.lib import _msprobe_c +except ImportError: + has_adump = False + class MindsporeDataProcessor(BaseDataProcessor): mindspore_special_type = tuple([ms.Tensor, Number]) @@ -37,11 +43,12 @@ class MindsporeDataProcessor(BaseDataProcessor): self.mindspore_object_key = { "dtype": self.analyze_dtype_in_kwargs } + self._async_dump_cache = {} @staticmethod def get_md5_for_tensor(x): x = convert_bf16_to_fp32(x) - tensor_bytes = x.contiguous().asnumpy().tobytes() + tensor_bytes = x.asnumpy().tobytes() crc32_hash = zlib.crc32(tensor_bytes) return f"{crc32_hash:08x}" @@ -49,28 +56,23 @@ class MindsporeDataProcessor(BaseDataProcessor): def analyze_dtype_in_kwargs(element): return {"type": "mindspore.dtype", "value": str(element)} - @classmethod - def get_special_types(cls): - return super().get_special_types() + cls.mindspore_special_type - - def get_stat_info(self, data): + @staticmethod + def get_stat_info_sync(data): tensor_stat = TensorStatInfo() - if data.numel() == 0: - return tensor_stat - elif data.dtype == ms.bool_: - data_np = data.contiguous().asnumpy() + if data.dtype == ms.bool_: + data_np = data.asnumpy() tensor_stat.max = np.max(data_np).item() tensor_stat.min = np.min(data_np).item() elif not data.shape: tensor_stat.max = tensor_stat.min = tensor_stat.mean = tensor_stat.norm = data.item() elif data.dtype == ms.complex64 or data.dtype == ms.complex128: - data_abs = np.abs(data.contiguous().asnumpy()) + data_abs = np.abs(data.asnumpy()) tensor_stat.max = np.max(data_abs).item() tensor_stat.min = np.min(data_abs).item() tensor_stat.mean = np.mean(data_abs).item() tensor_stat.norm = np.linalg.norm(data_abs).item() else: - if not ops.is_floating_point(data): + if not ops.is_floating_point(data) or data.dtype == ms.float64: data = data.to(ms.float32) api_register.norm_inner_op_set_ori_func() get_max_value = api_register.mint_ops_ori_attr.get("max", mint.max) @@ -87,17 +89,64 @@ class MindsporeDataProcessor(BaseDataProcessor): api_register.norm_inner_op_set_hook_func() return tensor_stat + @staticmethod + def get_stat_info_async(data): + tensor_stat = TensorStatInfo() + stack_method = api_register.functional_ori_attr.get("stack", ms.ops.stack) + if data.dtype == ms.complex64 or data.dtype == ms.complex128: + logger.warning("Async dump do not support complex data!") + return tensor_stat + elif data.dtype == ms.bool_: + tensor_stat.stack_tensor_stat = (["Max", "Min"], stack_method([data.any(), data.all()])) + elif not data.shape: + tensor_stat.stack_tensor_stat = (["Max", "Min", "Mean", "Norm"], stack_method([data, data, data, data])) + else: + if not ops.is_floating_point(data) or data.dtype == ms.float64: + data = data.to(ms.float32) + api_register.norm_inner_op_set_ori_func() + get_max_value = api_register.mint_ops_ori_attr.get("max", mint.max) + get_min_value = api_register.mint_ops_ori_attr.get("min", mint.min) + get_mean_value = api_register.mint_ops_ori_attr.get("mean", mint.mean) + if hasattr(mint, "norm"): + get_norm_value = api_register.mint_ops_ori_attr.get("norm", mint.norm) + else: + get_norm_value = api_register.functional_ori_attr.get("norm", ops.norm) + tensor_stat.stack_tensor_stat = (["Max", "Min", "Mean", "Norm"], stack_method( + [get_max_value(data), get_min_value(data), get_mean_value(data), get_norm_value(data)])) + api_register.norm_inner_op_set_hook_func() + return tensor_stat + + @staticmethod + def is_hookable_element(element): + return hasattr(element, "register_hook") and callable(element.register_hook) + + @classmethod + def get_special_types(cls): + return super().get_special_types() + cls.mindspore_special_type + + def get_stat_info(self, data): + tensor_stat = TensorStatInfo() + if data.numel() == 0: + return tensor_stat + else: + if self.config.async_dump: + return MindsporeDataProcessor.get_stat_info_async(data) + else: + return MindsporeDataProcessor.get_stat_info_sync(data) + def analyze_single_element(self, element, suffix_stack): if suffix_stack and suffix_stack[-1] in self.mindspore_object_key: return self.mindspore_object_key[suffix_stack[-1]](element) converted_numpy, numpy_type = self._convert_numpy_to_builtin(element) if converted_numpy is not element: - return self._analyze_numpy(converted_numpy, numpy_type) + return {"type": numpy_type, "value": converted_numpy} if isinstance(element, Number): return self.analyze_dtype_in_kwargs(element) if isinstance(element, ms.Tensor): - return self._analyze_tensor(element, Const.SEP.join(suffix_stack)) + return self._analyze_tensor(element, Const.SEP.join([str(suffix) for suffix in suffix_stack])) + if isinstance(element, np.ndarray): + return self._analyze_numpy(element, Const.SEP.join([str(suffix) for suffix in suffix_stack])) if isinstance(element, (bool, int, float, str, slice, type(Ellipsis))): return self._analyze_builtin(element) return {} @@ -107,13 +156,17 @@ class MindsporeDataProcessor(BaseDataProcessor): tensor_json = { 'type': 'mindspore.Tensor', 'dtype': str(tensor.dtype), - 'shape': tensor.shape, - 'Max': self.transfer_type(tensor_stat.max), - 'Min': self.transfer_type(tensor_stat.min), - 'Mean': self.transfer_type(tensor_stat.mean), - 'Norm': self.transfer_type(tensor_stat.norm), + 'shape': tensor.shape } - if self.config.summary_mode == Const.MD5: + + if tensor_stat.stack_tensor_stat is None: + tensor_json.update({'Max': self.transfer_type(tensor_stat.max)}) + tensor_json.update({'Min': self.transfer_type(tensor_stat.min)}) + tensor_json.update({'Mean': self.transfer_type(tensor_stat.mean)}) + tensor_json.update({'Norm': self.transfer_type(tensor_stat.norm)}) + else: + tensor_json.update({'tensor_stat': tensor_stat.stack_tensor_stat}) + if self.config.summary_mode == Const.MD5 and not self.config.async_dump: tensor_md5 = self.get_md5_for_tensor(tensor) tensor_json.update({Const.MD5: tensor_md5}) return tensor_json @@ -124,12 +177,27 @@ class StatisticsDataProcessor(MindsporeDataProcessor): class TensorDataProcessor(MindsporeDataProcessor): + def dump_async_data(self): + for file_path, tensor in self._async_dump_cache.items(): + save_tensor_as_npy(tensor, file_path) + self._async_dump_cache.clear() + def _analyze_tensor(self, tensor, suffix): dump_data_name, file_path = self.get_save_file_path(suffix) single_arg = super()._analyze_tensor(tensor, suffix) single_arg.update({"data_name": dump_data_name}) - save_tensor_as_npy(tensor, file_path) + if self.config.async_dump: + self._async_dump_cache[file_path] = tensor.copy() + else: + save_tensor_as_npy(tensor, file_path) return single_arg + + def _analyze_numpy(self, ndarray, suffix): + dump_data_name, file_path = self.get_save_file_path(suffix) + save_npy(ndarray, file_path) + ndarray_json = super()._analyze_numpy(ndarray, suffix) + ndarray_json.update({"data_name": dump_data_name}) + return ndarray_json class OverflowCheckDataProcessor(MindsporeDataProcessor): @@ -138,6 +206,7 @@ class OverflowCheckDataProcessor(MindsporeDataProcessor): def __init__(self, config, data_writer): super().__init__(config, data_writer) self.has_overflow = False + self.cached_api_info = {} self.cached_tensors_and_file_paths = {} self.real_overflow_nums = 0 self.overflow_nums = config.overflow_nums @@ -150,6 +219,20 @@ class OverflowCheckDataProcessor(MindsporeDataProcessor): return True return False + def analyze_forward_input(self, name, module, module_input_output: ModuleForwardInputsOutputs): + self.has_overflow = False + self.cached_api_info = super().analyze_forward_input(name, module, module_input_output) + return None + + def analyze_forward_output(self, name, module, module_input_output: ModuleForwardInputsOutputs): + api_info_struct = super().analyze_forward_output(name, module, module_input_output) + if name in self.cached_api_info and name in api_info_struct: + self.cached_api_info[name].update(api_info_struct[name]) + elif name in api_info_struct: + self.cached_api_info = api_info_struct + self.maybe_save_overflow_data() + return self.cached_api_info if self.has_overflow else None + def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): self.has_overflow = False api_info_struct = super().analyze_forward(name, module, module_input_output) @@ -162,6 +245,12 @@ class OverflowCheckDataProcessor(MindsporeDataProcessor): self.maybe_save_overflow_data() return api_info_struct if self.has_overflow else None + def analyze_params(self, name, param_name, grad): + self.has_overflow = False + api_info_struct = super().analyze_params(name, param_name, grad) + self.maybe_save_overflow_data() + return api_info_struct if self.has_overflow else None + def maybe_save_overflow_data(self): if self.has_overflow: for file_path, tensor in self.cached_tensors_and_file_paths.items(): @@ -190,3 +279,61 @@ class OverflowCheckDataProcessor(MindsporeDataProcessor): self._analyze_maybe_overflow_tensor(single_arg) single_arg.update({"data_name": dump_data_name}) return single_arg + + +class KernelDumpDataProcessor(MindsporeDataProcessor): + def __init__(self, config, data_writer): + super().__init__(config, data_writer) + self.enable_kernel_dump = True + + @staticmethod + def start_kernel_dump(config_path): + hal.synchronize() + _msprobe_c.init_dump() + _msprobe_c.set_dump(config_path) + hal.synchronize() + + @staticmethod + def stop_kernel_dump(): + hal.synchronize() + _msprobe_c.finalize_dump() + hal.synchronize() + + @staticmethod + def _print_unsupported_log(api_name): + logger.warning(f"The kernel dump does not support the {api_name} API.") + + def analyze_forward_input(self, name, module, module_input_output): + if not self.enable_kernel_dump: + return + if not has_adump: + logger.warning("The current msprobe package does not compile adump, and kernel dump cannot be used.") + self.enable_kernel_dump = False + return + self.start_kernel_dump(self.config.kernel_config_path) + + def analyze_forward_output(self, name, module, module_input_output): + if not self.enable_kernel_dump: + return + self.enable_kernel_dump = False + self.stop_kernel_dump() + logger.info(f"The kernel data of {name} is dumped successfully.") + + def analyze_backward_input(self, name, module, module_input_output): + if not self.enable_kernel_dump: + return + if not has_adump: + logger.warning("The current msprobe package does not compile adump, and kernel dump cannot be used.") + self.enable_kernel_dump = False + return + self.start_kernel_dump(self.config.kernel_config_path) + + def analyze_backward(self, name, module, module_input_output): + if not self.enable_kernel_dump: + return + self.enable_kernel_dump = False + self.stop_kernel_dump() + logger.info(f"The kernel data of {name} is dumped successfully.") + + def reset_status(self): + self.enable_kernel_dump = True diff --git a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/pytorch_processor.py b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/pytorch_processor.py index 3dbbdee2fcd50ca59455816275c16bc540e8214f..2cd98b125682434b517f6d70e09ea6a850b3e3bb 100644 --- a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/pytorch_processor.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/pytorch_processor.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -13,19 +13,25 @@ # See the License for the specific language governing permissions and # limitations under the License. +import hashlib import zlib from dataclasses import asdict from typing import List import numpy as np import torch +from torch import distributed as dist +from torch.distributed.distributed_c10d import _get_default_group + from msprobe.core.common.const import Const from msprobe.core.common.file_utils import path_len_exceeds_limit from msprobe.core.common.log import logger +from msprobe.core.common.utils import convert_tuple from msprobe.core.data_dump.data_processor.base import BaseDataProcessor, ModuleBackwardInputsOutputs, \ ModuleForwardInputsOutputs, TensorStatInfo from msprobe.pytorch.common.utils import save_pt, load_pt from msprobe.pytorch.free_benchmark import FreeBenchmarkCheck, UnequalRow +from msprobe.core.common.utils import recursion_depth_decorator is_gpu = False try: @@ -35,7 +41,22 @@ except ImportError: class PytorchDataProcessor(BaseDataProcessor): - pytorch_special_type = (torch.device, torch.dtype, torch.Size, torch.Tensor) + pytorch_special_type = ( + torch.device, + torch.dtype, + torch.Size, + torch.Tensor, + torch.memory_format, + dist.ProcessGroup, + dist.P2POp, + dist.ReduceOp + ) + memory_format = { + torch.contiguous_format: "contiguous_format", + torch.channels_last: "channels_last", + torch.channels_last_3d: "channels_last_3d", + torch.preserve_format: "preserve_format" + } def __init__(self, config, data_writer): super().__init__(config, data_writer) @@ -43,6 +64,7 @@ class PytorchDataProcessor(BaseDataProcessor): "device": self.analyze_device_in_kwargs, "dtype": self.analyze_dtype_in_kwargs } + self._async_dump_cache = {} @staticmethod def get_md5_for_tensor(x): @@ -56,14 +78,16 @@ class PytorchDataProcessor(BaseDataProcessor): def analyze_device_in_kwargs(element): single_arg = {} single_arg.update({'type': "torch.device"}) - if not isinstance(element, str): + if isinstance(element, (int, str)): + single_arg.update({"value": element}) + elif isinstance(element, torch.device): if hasattr(element, "index"): device_value = element.type + ":" + str(element.index) else: device_value = element.type single_arg.update({"value": device_value}) else: - single_arg.update({"value": element}) + logger.debug(f"Device type {type(element)} is not supported.") return single_arg @staticmethod @@ -71,54 +95,127 @@ class PytorchDataProcessor(BaseDataProcessor): return {"type": "torch.dtype", "value": str(element)} @staticmethod - def get_stat_info(data): + def get_stat_info_async(data): tensor_stat = TensorStatInfo() - if data.is_meta: + if torch.is_complex(data): + logger.warning("Async dump do not support complex data!") return tensor_stat - data_clone = data.detach() - if data_clone.numel() == 0: - return tensor_stat - elif data_clone.dtype == torch.bool: - tensor_stat.max = torch._C._VariableFunctionsClass.any(data_clone).item() - tensor_stat.min = torch._C._VariableFunctionsClass.all(data_clone).item() - elif not data_clone.shape: - tensor_stat.max = tensor_stat.min = tensor_stat.mean = tensor_stat.norm = data_clone.item() - elif torch.is_complex(data_clone): - data_np = data_clone.cpu().numpy() + elif data.dtype == torch.bool: + tensor_stat.stack_tensor_stat = (["Max", "Min"], torch.stack( + [torch.any(data), torch.all(data)])) + elif not data.shape: + tensor_stat.stack_tensor_stat = (["Max", "Min", "Mean", "Norm"], torch.stack([data, data, data, data])) + else: + if not data.is_floating_point() or data.dtype == torch.float64: + data = data.float() + tensor_stat.stack_tensor_stat = (["Max", "Min", "Mean", "Norm"], torch.stack([ + torch.max(data), + torch.min(data), + torch.mean(data), + torch.norm(data) + ])) + return tensor_stat + + @staticmethod + def get_stat_info_sync(data): + tensor_stat = TensorStatInfo() + if torch.is_complex(data): + data_np = data.cpu().numpy() data_abs = np.abs(data_np) tensor_stat.max = np.max(data_abs).item() tensor_stat.min = np.min(data_abs).item() tensor_stat.mean = np.mean(data_abs).item() + elif data.dtype == torch.bool: + tensor_stat.max = torch.any(data).item() + tensor_stat.min = torch.all(data).item() + elif not data.shape: + tensor_stat.max = tensor_stat.min = tensor_stat.mean = tensor_stat.norm = data.item() else: - if not data_clone.is_floating_point() or data_clone.dtype == torch.float64: - data_clone = data_clone.float() - tensor_stat.max = torch._C._VariableFunctionsClass.max(data_clone).item() - tensor_stat.min = torch._C._VariableFunctionsClass.min(data_clone).item() - tensor_stat.mean = torch._C._VariableFunctionsClass.mean(data_clone).item() - tensor_stat.norm = torch._C._VariableFunctionsClass.norm(data_clone).item() + if not data.is_floating_point() or data.dtype == torch.float64: + data = data.float() + tensor_stat.max = torch.max(data).item() + tensor_stat.min = torch.min(data).item() + tensor_stat.mean = torch.mean(data).item() + tensor_stat.norm = torch.norm(data).item() return tensor_stat + @staticmethod + def get_stat_info(data, async_dump=False): + tensor_stat = TensorStatInfo() + if data.is_meta: + return tensor_stat + data_clone = data.detach() + if data_clone.numel() == 0: + return tensor_stat + else: + if data_clone.device.type == Const.CPU_LOWERCASE or not async_dump: + return PytorchDataProcessor.get_stat_info_sync(data_clone) + else: + return PytorchDataProcessor.get_stat_info_async(data_clone) + @staticmethod def handle_tensor_extremum_nan_inf(tensor, operator): data_clone = tensor.detach() - data_nan = torch._C._VariableFunctionsClass.isnan(data_clone) - if int(torch._C._VariableFunctionsClass.sum(data_nan)) == data_clone.numel(): + data_nan = torch.isnan(data_clone) + if int(torch.sum(data_nan)) == data_clone.numel(): return float('nan') - finite_mask = torch._C._VariableFunctionsClass.isfinite(data_clone) - if int(torch._C._VariableFunctionsClass.sum(finite_mask)) > 0: - finite_values = getattr(torch._C._TensorBase, "__getitem__")(data_clone, finite_mask) - return torch._C._VariableFunctionsClass.max(finite_values).item() if operator == 'max' else \ - torch._C._VariableFunctionsClass.min(finite_values).item() + finite_mask = torch.isfinite(data_clone) + if int(torch.sum(finite_mask)) > 0: + finite_values = data_clone[finite_mask] + return torch.max(finite_values).item() if operator == 'max' else \ + torch.min(finite_values).item() else: - data_no_nan = getattr(torch._C._TensorBase, "__getitem__")(data_clone, ~data_nan) - return torch._C._VariableFunctionsClass.max(data_no_nan).item() if operator == 'max' else \ - torch._C._VariableFunctionsClass.min(data_no_nan).item() + data_no_nan = data_clone[~data_nan] + return torch.max(data_no_nan).item() if operator == 'max' else \ + torch.min(data_no_nan).item() + + @staticmethod + def process_group_hash(arg): + group_ranks = dist.get_process_group_ranks(arg) + group_ranks_hash = hashlib.md5(str(group_ranks).encode('utf-8')).hexdigest() + return group_ranks_hash + + @staticmethod + def is_distributed_op(module): + return getattr(module, "op_is_distributed", False) + + @staticmethod + def is_hookable_element(element): + return (hasattr(element, "register_hook") and callable(element.register_hook)) and \ + (hasattr(element, "requires_grad") and element.requires_grad) @staticmethod def _analyze_torch_size(arg): return {"type": "torch.Size", "value": list(arg)} + @staticmethod + def _analyze_memory_format(arg): + # 获取内存格式 + format_type = PytorchDataProcessor.memory_format.get(arg) + return {"type": "torch.memory_format", "format": format_type} + + @staticmethod + def _analyze_process_group(arg): + group_info = {"type": "torch.ProcessGroup"} + try: + group_ranks = dist.get_process_group_ranks(arg) + group_info.update({"group_ranks": group_ranks}) + group_id = PytorchDataProcessor.process_group_hash(arg) + group_info.update({"group_id": group_id}) + except Exception as e: + logger.warning(f"Failed to get process group ranks info with error info: {e}.") + return group_info + + @staticmethod + def _analyze_reduce_op(arg): + op_type = None + try: + op_type = str(arg) + except Exception as e: + logger.warning(f"Failed to get value of torch.distributed.ReduceOp with error info: {e}.") + return {"type": "torch.distributed.ReduceOp", "value": op_type} + @classmethod def get_special_types(cls): return super().get_special_types() + cls.pytorch_special_type @@ -128,35 +225,69 @@ class PytorchDataProcessor(BaseDataProcessor): return self.torch_object_key[suffix_stack[-1]](element) if isinstance(element, torch.Size): return self._analyze_torch_size(element) + if isinstance(element, torch.memory_format): + return self._analyze_memory_format(element) + if isinstance(element, dist.ProcessGroup): + return self._analyze_process_group(element) + if isinstance(element, dist.P2POp): + return self._analyze_p2pop(element) + if isinstance(element, dist.ReduceOp): + return self._analyze_reduce_op(element) converted_numpy, numpy_type = self._convert_numpy_to_builtin(element) if converted_numpy is not element: - return self._analyze_numpy(converted_numpy, numpy_type) + return {"type": numpy_type, "value": converted_numpy} if isinstance(element, torch.Tensor): - return self._analyze_tensor(element, Const.SEP.join(suffix_stack)) + return self._analyze_tensor(element, Const.SEP.join([str(suffix) for suffix in suffix_stack])) + if isinstance(element, np.ndarray): + return self._analyze_numpy(element, Const.SEP.join([str(suffix) for suffix in suffix_stack])) if isinstance(element, (bool, int, float, str, slice, type(Ellipsis))): return self._analyze_builtin(element) return {} + def analyze_forward_output(self, name, module, module_input_output: ModuleForwardInputsOutputs): + if self.is_distributed_op(module): + module_input_output.update_output_with_args_and_kwargs() + return super().analyze_forward_output(name, module, module_input_output) + + def _analyze_p2pop(self, arg): + p2pop_info = {"class_type": "torch.distributed.P2POp"} + try: + tensor_info = self._analyze_tensor(arg.tensor, []) + p2pop_info.update({"tensor": tensor_info}) + p2pop_info.update({"op": arg.op.__name__}) + p2pop_info.update({"peer": arg.peer}) + p2pop_info.update({"tag": arg.tag}) + group_id = PytorchDataProcessor.process_group_hash( + arg.group) if arg.group else PytorchDataProcessor.process_group_hash(_get_default_group()) + p2pop_info.update({"group_id": group_id}) + except Exception as e: + logger.warning(f"Failed to parse the P2POp content with error info: {e}.") + return p2pop_info + def _analyze_tensor(self, tensor, suffix): - tensor_stat = self.get_stat_info(tensor) + tensor_stat = self.get_stat_info(tensor, self.config.async_dump) tensor_json = {} tensor_json.update({'type': 'torch.Tensor'}) tensor_json.update({'dtype': str(tensor.dtype)}) tensor_json.update({"shape": tensor.shape}) - tensor_json.update({"Max": tensor_stat.max}) - tensor_json.update({"Min": tensor_stat.min}) - tensor_json.update({"Mean": tensor_stat.mean}) - tensor_json.update({"Norm": tensor_stat.norm}) - tensor_json.update({"requires_grad": tensor.requires_grad}) - - if tensor_stat.max is not None: - if np.isinf(tensor_stat.max) or np.isnan(tensor_stat.max): - tensor_json['Max_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(tensor, "max") - if tensor_stat.min is not None: - if np.isinf(tensor_stat.min) or np.isnan(tensor_stat.min): - tensor_json['Min_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(tensor, "min") - - if self.config.summary_mode == Const.MD5: + if tensor_stat.stack_tensor_stat is None: + tensor_json.update({"Max": tensor_stat.max}) + tensor_json.update({"Min": tensor_stat.min}) + tensor_json.update({"Mean": tensor_stat.mean}) + tensor_json.update({"Norm": tensor_stat.norm}) + tensor_json.update({"requires_grad": tensor.requires_grad}) + if tensor_stat.max is not None: + if np.isinf(tensor_stat.max) or np.isnan(tensor_stat.max): + tensor_json['Max_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(tensor, "max") + if tensor_stat.min is not None: + if np.isinf(tensor_stat.min) or np.isnan(tensor_stat.min): + tensor_json['Min_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(tensor, "min") + + else: + tensor_json.update({"requires_grad": tensor.requires_grad}) + tensor_json.update({"tensor_stat": tensor_stat.stack_tensor_stat}) + + if self.config.summary_mode == Const.MD5 and not self.config.async_dump: tensor_md5 = self.get_md5_for_tensor(tensor) tensor_json.update({Const.MD5: tensor_md5}) return tensor_json @@ -167,14 +298,29 @@ class StatisticsDataProcessor(PytorchDataProcessor): class TensorDataProcessor(PytorchDataProcessor): + def dump_async_data(self): + for file_path, tensor in self._async_dump_cache.items(): + save_pt(tensor.contiguous(), file_path) + self._async_dump_cache.clear() + def _analyze_tensor(self, tensor, suffix): dump_data_name, file_path = self.get_save_file_path(suffix) - saved_tensor = tensor.clone().contiguous().detach() - save_pt(saved_tensor, file_path) single_arg = super()._analyze_tensor(tensor, suffix) single_arg.update({"data_name": dump_data_name}) + if self.config.async_dump: + self._async_dump_cache[file_path] = tensor.clone().detach() + else: + saved_tensor = tensor.clone().contiguous().detach() + save_pt(saved_tensor, file_path) return single_arg + def _analyze_numpy(self, ndarray, suffix): + dump_data_name, file_path = self.get_save_file_path(suffix) + save_pt(torch.tensor(ndarray), file_path) + ndarray_json = super()._analyze_numpy(ndarray, suffix) + ndarray_json.update({"data_name": dump_data_name}) + return ndarray_json + class OverflowCheckDataProcessor(PytorchDataProcessor): __slots__ = ["cached_tensors_and_file_paths"] @@ -183,7 +329,7 @@ class OverflowCheckDataProcessor(PytorchDataProcessor): super().__init__(config, data_writer) self.has_overflow = False self.support_inf_nan = None - self.cached_inplace_api_info = {} + self.cached_api_info = {} self.cached_tensors_and_file_paths = {} self.bits_for_overflow = 8 self.real_overflow_nums = 0 @@ -197,21 +343,21 @@ class OverflowCheckDataProcessor(PytorchDataProcessor): return True return False - def analyze_pre_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): + def analyze_forward_input(self, name, module, module_input_output: ModuleForwardInputsOutputs): self.has_overflow = False self._is_support_inf_nan() - self.cached_inplace_api_info = super().analyze_pre_forward_inplace(name, module_input_output) + self.cached_api_info = super().analyze_forward_input(name, module, module_input_output) return None - def analyze_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): + def analyze_forward_output(self, name, module, module_input_output: ModuleForwardInputsOutputs): self._is_support_inf_nan() - api_info_struct = super().analyze_forward_inplace(name, module_input_output) - if name in self.cached_inplace_api_info and name in api_info_struct: - self.cached_inplace_api_info[name].update(api_info_struct[name]) + api_info_struct = super().analyze_forward_output(name, module, module_input_output) + if name in self.cached_api_info and name in api_info_struct: + self.cached_api_info[name].update(api_info_struct[name]) elif name in api_info_struct: - self.cached_inplace_api_info = api_info_struct + self.cached_api_info = api_info_struct self.handle_overflow() - return self.cached_inplace_api_info if self.has_overflow else None + return self.cached_api_info if self.has_overflow else None def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): self.has_overflow = False @@ -227,6 +373,13 @@ class OverflowCheckDataProcessor(PytorchDataProcessor): self.handle_overflow() return api_info_struct if self.has_overflow else None + def analyze_params(self, name, param_name, grad): + self.has_overflow = False + self._is_support_inf_nan() + api_info_struct = super().analyze_params(name, param_name, grad) + self.handle_overflow() + return api_info_struct if self.has_overflow else None + def handle_overflow(self): if not self.support_inf_nan: self._analyze_maybe_overflow_flag() @@ -300,10 +453,10 @@ class FreeBenchmarkDataProcessor(PytorchDataProcessor): ) return - def analyze_pre_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + def analyze_forward_input(self, name, module, module_input_output: ModuleForwardInputsOutputs): self.checker.pre_forward(name, module, self, module_input_output.args, module_input_output.kwargs) - def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + def analyze_forward_output(self, name, module, module_input_output: ModuleForwardInputsOutputs): new_output, unequal_rows = self.checker.forward( name, module, @@ -321,64 +474,120 @@ class FreeBenchmarkDataProcessor(PytorchDataProcessor): class KernelDumpDataProcessor(PytorchDataProcessor): - forward_init_status = False - multi_output_apis = ["_sort_", "npu_flash_attention"] - def __init__(self, config, data_writer): super().__init__(config, data_writer) + self.enable_kernel_dump = True + self.is_found_output_tensor = False + self.is_found_grad_input_tensor = False + self.forward_args = None + self.forward_kwargs = None + self.forward_output_tensor = None + self.grad_input_tensor = None + + @staticmethod + def start_kernel_dump(config_path): + torch_npu.npu.synchronize() + torch_npu.npu.init_dump() + torch_npu.npu.set_dump(config_path) + torch_npu.npu.synchronize() + + @staticmethod + def stop_kernel_dump(): + torch_npu.npu.synchronize() + torch_npu.npu.finalize_dump() + torch_npu.npu.synchronize() + + @staticmethod + def _print_unsupported_log(api_name): + logger.warning(f"The kernel dump does not support the {api_name} API.") + + def analyze_forward_input(self, name, module, module_input_output): + if not self.enable_kernel_dump: + return + if is_gpu: + logger.warning("The current environment is not a complete NPU environment, and kernel dump cannot be used.") + self.enable_kernel_dump = False + return + + if self.config.is_backward_kernel_dump: + self.forward_args = self.clone_and_detach_tensor(module_input_output.args) + self.forward_kwargs = self.clone_and_detach_tensor(module_input_output.kwargs) + try: + output = module.forward(*self.forward_args, **self.forward_kwargs) + except Exception: + self._print_unsupported_log(name) + self.enable_kernel_dump = False + return + + self.analyze_element(convert_tuple(output)) + if not self.is_found_output_tensor: + self._print_unsupported_log(name) + self.enable_kernel_dump = False + return + self.start_kernel_dump(self.config.kernel_config_path) + + def analyze_forward_output(self, name, module, module_input_output): + if not self.enable_kernel_dump: + return + if self.config.is_backward_kernel_dump: + return + self.enable_kernel_dump = False + self.stop_kernel_dump() + logger.info(f"The kernel data of {name} is dumped successfully.") + + def analyze_backward(self, name, module, module_input_output): + if not self.enable_kernel_dump: + return + self.enable_kernel_dump = False + + self.analyze_element(module_input_output.grad_input) + if not self.is_found_grad_input_tensor: + self._print_unsupported_log(name) + return + self.start_kernel_dump(self.config.kernel_config_path) - def analyze_forward(self, name, module, module_input_output): - if self.config.is_forward_acl_dump: - self.forward_acl_dump(name, module, module_input_output) + try: + self.forward_output_tensor.backward(self.grad_input_tensor, retain_graph=True) + except Exception: + self._print_unsupported_log(name) + self.stop_kernel_dump() + return + + self.stop_kernel_dump() + logger.info(f"The kernel data of {name} is dumped successfully.") + + @recursion_depth_decorator("KernelDump: KernelDumpDataProcessor.clone_and_detach_tensor") + def clone_and_detach_tensor(self, input_params): + if isinstance(input_params, torch.Tensor): + if input_params.requires_grad: + return input_params.clone().detach().requires_grad_() + return input_params.clone() + elif isinstance(input_params, tuple): + return tuple(self.clone_and_detach_tensor(x) for x in input_params) + elif isinstance(input_params, list): + return list(self.clone_and_detach_tensor(x) for x in input_params) + elif isinstance(input_params, dict): + return {k: self.clone_and_detach_tensor(v) for k, v in input_params.items()} else: - self.dump_mode_backward_acl_dump(name, module, module_input_output) - - def forward_acl_dump(self, name, module, module_input_output): - if not KernelDumpDataProcessor.forward_init_status: - KernelDumpDataProcessor.forward_init_status = True - torch_npu.npu.synchronize() - torch_npu.npu.init_dump() - torch_npu.npu.set_dump(self.config.acl_config) - torch_npu.npu.synchronize() - if self.op_need_trigger(name): - module.forward(*module_input_output.args, **module_input_output.kwargs).cpu() - else: - module.forward(*module_input_output.args, **module_input_output.kwargs) - torch_npu.npu.synchronize() - torch_npu.npu.finalize_dump() - torch_npu.npu.synchronize() - KernelDumpDataProcessor.forward_init_status = False - logger.info("Dump %s op file." % name) - - def acl_backward_dump_status(self, output, grad, module_name): - if isinstance(output, torch.Tensor): - output.backward(grad, retain_graph=True) - return True + return input_params - for api_name in KernelDumpDataProcessor.multi_output_apis: - if api_name in module_name: - output[0].backward(grad, retain_graph=True) - return True - return False + def analyze_single_element(self, element, suffix_stack): + if isinstance(element, torch.Tensor): + if not self.is_found_output_tensor: + if element.requires_grad: + self.forward_output_tensor = element + self.is_found_output_tensor = True + return {} + if not self.is_found_grad_input_tensor: + self.grad_input_tensor = element.clone() + self.is_found_grad_input_tensor = True + return {} - def dump_mode_backward_acl_dump(self, name, module, module_input_output): - grad_path = self.config.backward_input.get(name) - if not KernelDumpDataProcessor.forward_init_status: - KernelDumpDataProcessor.forward_init_status = True - output = module.forward(*module_input_output.args, **module_input_output.kwargs) - pt = load_pt(grad_path) - grad = pt.to("npu").requires_grad_() - torch_npu.npu.init_dump() - torch_npu.npu.set_dump(self.config.acl_config) - torch_npu.npu.synchronize() - if not self.acl_backward_dump_status(output, grad, name): - logger.warning("The output of {} is not of tensor type and cannot be automatically derived. " - "you can manually construct a single API backward case for ACL dump.".format( - name)) - torch_npu.npu.synchronize() - torch_npu.npu.finalize_dump() - KernelDumpDataProcessor.forward_init_status = False - logger.info("Dump %s op file." % name) - - def op_need_trigger(self, module_name): - return 'Tensor.__getitem__.' in module_name + def reset_status(self): + self.enable_kernel_dump = True + self.is_found_output_tensor = False + self.is_found_grad_input_tensor = False + self.forward_args = None + self.forward_kwargs = None + self.forward_output_tensor = None + self.grad_input_tensor = None diff --git a/debug/accuracy_tools/msprobe/core/data_dump/json_writer.py b/debug/accuracy_tools/msprobe/core/data_dump/json_writer.py index e99235977798102e120478c53216e982b2c82e5e..b1e26d16f9741765c1c9600a64efb112aa0f42d7 100644 --- a/debug/accuracy_tools/msprobe/core/data_dump/json_writer.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/json_writer.py @@ -15,10 +15,13 @@ import csv import os +import copy +import numpy as np from msprobe.core.common.const import Const, FileCheckConst -from msprobe.core.common.file_utils import change_mode, FileOpen, save_json +from msprobe.core.common.file_utils import change_mode, FileOpen, save_json, load_json from msprobe.core.common.log import logger +from msprobe.core.common.exceptions import MsprobeException class DataWriter: @@ -29,10 +32,12 @@ class DataWriter: self.construct_file_path = None self.free_benchmark_file_path = None self.dump_tensor_data_dir = None + self.debug_file_path = None self.flush_size = 1000 self.cache_data = {} self.cache_stack = {} self.cache_construct = {} + self.cache_debug = {} @staticmethod def write_data_to_csv(result: list, result_header: tuple, file_path: str): @@ -55,6 +60,13 @@ class DataWriter: self.cache_construct = {} def initialize_json_file(self, **kwargs): + if self.debug_file_path and not self.cache_debug: + # debug level case only create debug.json + debug_dict = copy.deepcopy(kwargs) + debug_dict.update({"dump_data_dir": self.dump_tensor_data_dir, Const.DATA: {}}) + self.cache_debug = debug_dict + save_json(self.debug_file_path, self.cache_debug, indent=1) + return if not self.cache_data: kwargs.update({"dump_data_dir": self.dump_tensor_data_dir, Const.DATA: {}}) self.cache_data = kwargs @@ -64,13 +76,13 @@ class DataWriter: if not self.cache_construct: save_json(self.construct_file_path, self.cache_construct, indent=1) - def update_dump_paths(self, dump_file_path, stack_file_path, construct_file_path, dump_data_dir, - free_benchmark_file_path): - self.dump_file_path = dump_file_path - self.stack_file_path = stack_file_path - self.construct_file_path = construct_file_path - self.dump_tensor_data_dir = dump_data_dir - self.free_benchmark_file_path = free_benchmark_file_path + def update_dump_paths(self, dump_path_aggregation): + self.dump_file_path = dump_path_aggregation.dump_file_path + self.stack_file_path = dump_path_aggregation.stack_file_path + self.construct_file_path = dump_path_aggregation.construct_file_path + self.dump_tensor_data_dir = dump_path_aggregation.dump_tensor_data_dir + self.free_benchmark_file_path = dump_path_aggregation.free_benchmark_file_path + self.debug_file_path = dump_path_aggregation.debug_file_path def flush_data_periodically(self): dump_data = self.cache_data.get(Const.DATA) @@ -98,6 +110,9 @@ class DataWriter: def update_construct(self, new_data): self.cache_construct.update(new_data) + def update_debug(self, new_data): + self.cache_debug['data'].update(new_data) + def write_data_json(self, file_path): logger.info(f"dump.json is at {os.path.dirname(os.path.dirname(file_path))}. ") save_json(file_path, self.cache_data, indent=1) @@ -108,6 +123,9 @@ class DataWriter: def write_construct_info_json(self, file_path): save_json(file_path, self.cache_construct, indent=1) + def write_debug_info_json(self, file_path): + save_json(file_path, self.cache_debug, indent=1) + def write_json(self): if self.cache_data: self.write_data_json(self.dump_file_path) @@ -115,3 +133,31 @@ class DataWriter: self.write_stack_info_json(self.stack_file_path) if self.cache_construct: self.write_construct_info_json(self.construct_file_path) + if self.cache_debug: + self.write_debug_info_json(self.debug_file_path) + + def fill_stack_tensor_data(self): + self.process_stat_data_recursive(self.cache_data) + + def process_stat_data_recursive(self, data, depth=0): + if depth > Const.MAX_DEPTH: + logger.error(f"The maximum depth of recursive process stat data, {Const.MAX_DEPTH} is reached.") + raise MsprobeException(MsprobeException.RECURSION_LIMIT_ERROR) + if isinstance(data, dict): + if "tensor_stat" in data.keys(): + tensor_stat = data["tensor_stat"] + if len(tensor_stat) != Const.TENSOR_STAT_LEN or len(tensor_stat[0]) != len(tensor_stat[1]): + logger.warning("Some bad data in async dump") + else: + tensor_stat_index, tensor_stat_data = tensor_stat[0], tensor_stat[1] + if hasattr(tensor_stat_data, "device") and tensor_stat_data.device != Const.CPU_LOWERCASE: + tensor_stat_data = tensor_stat_data.cpu() + for index, stat in zip(tensor_stat_index, tensor_stat_data): + data.update({index: stat.item()}) + del data["tensor_stat"] + else: + for key in data.keys(): + self.process_stat_data_recursive(data[key], depth + 1) + elif isinstance(data, (list, tuple)): + for i in data: + self.process_stat_data_recursive(i, depth + 1) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/core/data_dump/scope.py b/debug/accuracy_tools/msprobe/core/data_dump/scope.py index 5cd356ed106b12f302bcdb1474b054e10eb74eef..7632dcf30c9eb4cc6047cde5fff5d230176b9fc0 100644 --- a/debug/accuracy_tools/msprobe/core/data_dump/scope.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/scope.py @@ -14,36 +14,48 @@ # limitations under the License. from abc import ABC, abstractmethod +import re from msprobe.core.common.const import Const from msprobe.core.common.exceptions import ScopeException -def build_scope(scope_class, scope=None, api_list=None): - if not scope and not api_list: - return None - if scope is None: - scope = [] - if api_list is None: - api_list = [] - if scope_class: - return scope_class(scope, api_list) - return build_range_scope_according_to_scope_name(scope, api_list) - - -def build_range_scope_according_to_scope_name(scope, api_list): - api_range_scope = APIRangeScope(scope, api_list) - module_range_scope = ModuleRangeScope(scope, api_list) - if not scope: # 如果没有scope参数则用哪类scope都一样 - return api_range_scope - if api_range_scope.is_valid and module_range_scope.is_valid: - raise ScopeException(ScopeException.InvalidScope, f"scope={scope}.") - elif api_range_scope.is_valid: - return api_range_scope - elif module_range_scope.is_valid: - return module_range_scope - else: - raise ScopeException(ScopeException.InvalidScope, f"scope={scope}") +class ScopeFactory: + def __init__(self, config): + self.task = config.task + self.level = config.level + self.scope = config.scope + self.api_list = config.list + + def build_scope(self): + if not self.scope and not self.api_list: + return None + if self.scope is None: + self.scope = [] + if self.api_list is None: + self.api_list = [] + if self.task == Const.FREE_BENCHMARK: + return ListScope(self.scope, self.api_list) + return self._build_range_scope() + + def _build_range_scope(self): + api_range_scope = APIRangeScope(self.scope, self.api_list, self.level) + module_range_scope = ModuleRangeScope(self.scope, self.api_list, self.level) + mix_range_scope = MixRangeScope(self.scope, self.api_list, self.level) + + if self.level == Const.LEVEL_MIX: + return mix_range_scope + + if not self.scope: + return api_range_scope + if api_range_scope.is_valid and module_range_scope.is_valid: + raise ScopeException(ScopeException.InvalidScope, f"scope={self.scope}.") + elif api_range_scope.is_valid: + return api_range_scope + elif module_range_scope.is_valid: + return module_range_scope + else: + raise ScopeException(ScopeException.InvalidScope, f"scope={self.scope}") class BaseScope(ABC): @@ -51,7 +63,8 @@ class BaseScope(ABC): Module_Type_API = "api" module_type = ["Module", "Cell"] - def __init__(self, scope, api_list): + def __init__(self, scope, api_list, level=None): + self.level = level scope, api_list = self.rectify_args(scope, api_list) self.scope = scope self.api_list = api_list @@ -60,21 +73,21 @@ class BaseScope(ABC): def rectify_args(scope, api_list): if not isinstance(api_list, list): raise ScopeException(ScopeException.InvalidApiStr, - f"api_list参数须配置为列表,实际类型为{type(api_list)}.") + f"api_list参数须配置为列表,实际类型为{type(api_list)}.") for api in api_list: if not isinstance(api, str): raise ScopeException(ScopeException.InvalidApiStr, - f"api_list中的元素须配置为字符串,实际类型为{type(api)}.") + f"api_list中的元素须配置为字符串,实际类型为{type(api)}.") if isinstance(scope, str): scope = [scope] return scope, api_list if not isinstance(scope, list): raise ScopeException(ScopeException.InvalidScope, - f"scope参数须配置为字符串或列表,实际类型为{type(scope)}.") + f"scope参数须配置为字符串或列表,实际类型为{type(scope)}.") for s in scope: if not isinstance(s, str): raise ScopeException(ScopeException.InvalidScope, - f"scope列表元素要求类型为字符串,实际类型为{type(s)}.") + f"scope列表元素要求类型为字符串,实际类型为{type(s)}.") return scope, api_list @abstractmethod @@ -95,7 +108,7 @@ class ListScope(BaseScope): def rectify_args(scope, api_list): if scope and api_list: raise ScopeException(ScopeException.ArgConflict, - f"scope和api_list不可以同时配置,实际配置为scope={scope}, api_list={api_list}.") + f"scope和api_list不可以同时配置,实际配置为scope={scope}, api_list={api_list}.") return super(ListScope, ListScope).rectify_args(scope, api_list) def check(self, name): @@ -109,17 +122,37 @@ class RangeScope(BaseScope, ABC): def __init__(self, *args): super().__init__(*args) self.in_scope = False + self.in_list = False + self.start_name_set = set() self.is_valid = self.check_scope_is_valid() - @staticmethod - def rectify_args(scope, api_list): - scope, api_list = super(RangeScope, RangeScope).rectify_args(scope, api_list) - if isinstance(scope, list): - if len(scope) == 1: - scope.append(scope[0]) - elif len(scope) > 2: + def check_name_pattern(self, name): + options_pattern = "|".join(re.escape(option) for option in Const.DUMP_PREFIX) + api_pattern = rf"^({options_pattern})\..*\.\d+\.(forward|backward)$" + module_pattern = r"^(Cell|Module)\..*\.(forward|backward)\.\d+$" + + if self.level == Const.LEVEL_L1: + if not re.match(api_pattern, name): + raise ScopeException(ScopeException.InvalidScope, + f"scope参数格式错误,要求格式为api完整命名,实际为{name}.") + + if self.level == Const.LEVEL_L0: + if not re.match(module_pattern, name): + raise ScopeException(ScopeException.InvalidScope, + f"scope参数格式错误,要求格式为模块完整命名,实际为{name}.") + + if self.level == Const.LEVEL_MIX: + if not re.match(api_pattern, name) and not re.match(module_pattern, name): raise ScopeException(ScopeException.InvalidScope, - f"scope参数指定区间断点,须传入长度为1或2的列表,实际长度为{len(scope)}.") + f"scope参数格式错误,要求格式为api或模块完整命名,实际为{name}.") + + def rectify_args(self, scope, api_list): + scope, api_list = super(RangeScope, RangeScope).rectify_args(scope, api_list) + if scope and len(scope) != 2: + raise ScopeException(ScopeException.InvalidScope, + f"scope参数指定区间断点,须传入长度为2的列表,实际长度为{len(scope)}.") + for name in scope: + self.check_name_pattern(name) return scope, api_list @abstractmethod @@ -192,3 +225,50 @@ class ModuleRangeScope(RangeScope): if not self.scope or self.in_scope: return self.check_api_list(name) return False + + +class MixRangeScope(RangeScope): + def check_scope_is_valid(self): + return True if self.scope else False + + def begin_module(self, module_name): + if self.scope and module_name == self.scope[0]: + self.in_scope = True + for name in self.api_list: + if name in module_name: + self.in_list = True + self.start_name_set.add(module_name) # 记录每一个开启in_list的module_name + + def end_module(self, module_name): + if self.scope and module_name == self.scope[1]: + self.in_scope = False + self.start_name_set.discard(module_name) # 从集合中删除每一个module_name + if not self.start_name_set: # 如果集合为空,说明当前module_name是最后一个开启in_list的module_name + self.in_list = False # 关闭in_list + + def check_api_list(self, api_name): + if not self.api_list: + return True + + for name in self.api_list: + if name in api_name: + return True + return False + + def check(self, name): + """ + dump时调用的接口,根据scope和api_list判断是否需要dump + """ + result = False + if self.scope and name == self.scope[0]: + self.in_scope = True + + if not self.scope or self.in_scope: + if self.in_list: + result = True + else: + result = self.check_api_list(name) + + if self.scope and name == self.scope[1]: + self.in_scope = False + return result diff --git a/debug/accuracy_tools/msprobe/core/grad_probe/utils.py b/debug/accuracy_tools/msprobe/core/grad_probe/utils.py index 0619db4c45aa612eec11488427a5e59621aa08ce..de3e4156acc74f135120e06116b5894a0e9ed09e 100644 --- a/debug/accuracy_tools/msprobe/core/grad_probe/utils.py +++ b/debug/accuracy_tools/msprobe/core/grad_probe/utils.py @@ -18,6 +18,7 @@ from msprobe.core.grad_probe.constant import GradConst from msprobe.core.common.log import logger from msprobe.core.common.file_utils import write_csv, check_path_before_create, change_mode from msprobe.core.common.const import FileCheckConst +from msprobe.core.common.utils import is_int import matplotlib.pyplot as plt @@ -41,13 +42,24 @@ def check_str(string, variable_name): if not isinstance(string, str): raise ValueError(f'The variable: "{variable_name}" is not a string.') + def check_bounds_element(bound): - return GradConst.BOUNDS_MINIMUM <= bound and bound <= GradConst.BOUNDS_MAXIMUM + return GradConst.BOUNDS_MINIMUM <= bound <= GradConst.BOUNDS_MAXIMUM + + +def check_param_element(param): + if not re.match(GradConst.PARAM_VALID_PATTERN, param): + return False + else: + return True + def check_bounds(bounds): + if not isinstance(bounds, list): + raise Exception(f"bounds must be a list") prev = GradConst.BOUNDS_MINIMUM - 1 for element in bounds: - if not isinstance(element, (int, float)): + if not is_int(element) and not isinstance(element, float): raise Exception("bounds element is not int or float") if not check_bounds_element(element): raise Exception("bounds element is out of int64 range") @@ -55,6 +67,7 @@ def check_bounds(bounds): raise Exception("bounds list is not ascending") prev = element + class ListCache(list): threshold = 1000 diff --git a/debug/accuracy_tools/msprobe/core/overflow_check/abnormal_scene.py b/debug/accuracy_tools/msprobe/core/overflow_check/abnormal_scene.py new file mode 100644 index 0000000000000000000000000000000000000000..54dae2576e48b7ad75df97fa046e6e90bbd144c2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/overflow_check/abnormal_scene.py @@ -0,0 +1,189 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import List, Dict, Union, Any + +import numpy as np + +from msprobe.core.overflow_check.api_info import APIInfo +from msprobe.core.overflow_check.level import OverflowLevel +from msprobe.core.overflow_check.utils import has_nan_inf + + +class AnomalyScene: + """异常场景的基类""" + + def __init__(self, api_info: APIInfo): + self.api_name = api_info.api_name + self.api_data = api_info + + @property + def rank(self) -> OverflowLevel: + """获取异常等级""" + raise NotImplementedError + + @staticmethod + def _has_anomaly(data: Union[Dict, Any]) -> bool: + """检查张量是否包含异常值""" + if isinstance(data, dict): + return has_nan_inf(data) + elif isinstance(data, list): + return any(AnomalyScene._has_anomaly(x) for x in data) + return False + + def get_details(self) -> Dict: + """获取异常详情""" + return { + 'api_name': self.api_name, + 'rank': self.rank.value, + 'scene_type': self.__class__.__name__, + 'input_args_anomaly_indices': self._get_anomaly_indices_from_list(self.api_data.input_args), + 'input_kwargs_anomaly_keys': self._get_anomaly_keys_from_dict(self.api_data.input_kwargs), + 'output_anomaly_indices': self._get_anomaly_indices_from_list(self.api_data.output_data) + } + + def matches(self) -> bool: + """ + 待子类实现对应匹配逻辑 + Returns: + + """ + raise NotImplementedError + + def _get_anomaly_indices_from_list(self, data_list: List[Dict]) -> List[int]: + return [i for i, data in enumerate(data_list) if self._has_anomaly(data)] + + def _get_anomaly_keys_from_dict(self, data_dict: Dict) -> List[str]: + return [key for key, data in data_dict.items() if self._has_anomaly(data)] + + +class InputOutputAnomalyScene(AnomalyScene): + """输入输出异常检测的基类""" + def has_input_anomaly(self) -> bool: + """检查输入是否有异常(包括args和kwargs)""" + # args + args_anomaly = any(self._has_anomaly(x) for x in self.api_data.input_args) + # kwargs + kwargs_anomaly = any(self._has_anomaly(x) for x in self.api_data.input_kwargs.values()) + return args_anomaly or kwargs_anomaly + + def has_output_anomaly(self) -> bool: + """检查输出是否有异常""" + return any(self._has_anomaly(x) for x in self.api_data.output_data) + + def matches(self) -> bool: + """判断是否匹配该场景""" + raise NotImplementedError + + +class InputAnomalyOutputNormalScene(InputOutputAnomalyScene): + """输入异常,输出正常场景""" + + @property + def rank(self) -> OverflowLevel: + return OverflowLevel.MEDIUM + + def matches(self) -> bool: + return self.has_input_anomaly() and not self.has_output_anomaly() + + +class InputAnomalyOutputAnomalyScene(InputOutputAnomalyScene): + """输入异常,输出异常场景""" + + @property + def rank(self) -> OverflowLevel: + return OverflowLevel.HIGH + + def matches(self) -> bool: + return self.has_input_anomaly() and self.has_output_anomaly() + + +class InputNormalOutputAnomalyScene(InputOutputAnomalyScene): + """输入正常,输出异常场景""" + + @property + def rank(self) -> OverflowLevel: + return OverflowLevel.CRITICAL + + def matches(self) -> bool: + return not self.has_input_anomaly() and self.has_output_anomaly() + + +class NumericalMutationScene(AnomalyScene): + """ + 检查数值突变,统计输入args、kwargs中norm值,同时统计输出的norm最大值,计算差异,大于 threshold 则认为是异常情况 + """ + def __init__(self, api_info: APIInfo, threshold: float = 100.0): + super().__init__(api_info) + self.threshold = threshold + + @property + def rank(self) -> OverflowLevel: + return OverflowLevel.HIGH + + @staticmethod + def _get_tensor_norms(data_list: List[Dict]) -> List[float]: + norms = [] + for data in data_list: + if isinstance(data, dict) and data.get('type') == 'torch.Tensor': + norm = data.get('Norm') + if norm is not None and not np.isnan(norm): + norms.append(norm) + return norms + + @staticmethod + def _get_kwargs_norms(data_dict: Dict) -> List[float]: + """ + 获取kwargs中张量的范数列表 + Args: + data_dict: + Returns: + """ + norms = [] + for data in data_dict.values(): + if isinstance(data, dict) and data.get('type') == 'torch.Tensor': + norm = data.get('Norm') + if norm is not None and not np.isnan(norm): + norms.append(norm) + return norms + + def matches(self) -> bool: + """ + 继承父类函数,实现数值突变检查 + Returns: + """ + # 收集所有输入的范数 + input_norms = (self._get_tensor_norms(self.api_data.input_args) + + self._get_kwargs_norms(self.api_data.input_kwargs)) + # 收集所有输出的范数 + output_norms = self._get_tensor_norms(self.api_data.output_data) + + if not input_norms or not output_norms: + return False + + max_input = max(input_norms) + max_output = max(output_norms) + + if max_input == 0: + return max_output > self.threshold + return max_output / max_input > self.threshold + + def get_details(self) -> Dict: + details = super().get_details() + details.update({ + 'threshold': self.threshold, + 'scale_change_detected': self.matches() + }) + return details diff --git a/debug/accuracy_tools/msprobe/core/overflow_check/api_info.py b/debug/accuracy_tools/msprobe/core/overflow_check/api_info.py new file mode 100644 index 0000000000000000000000000000000000000000..9b5ef810f56a79583d47de26f3a6d77e01ac72d5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/overflow_check/api_info.py @@ -0,0 +1,55 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from dataclasses import dataclass + +from typing import Dict, List + +from msprobe.core.common.const import Const + + +@dataclass +class APIInfo: + api_name: str + torch_api_name: str + input_args: List[Dict] + input_kwargs: Dict + output_data: List[Dict] + + def __init__(self, api_name, input_args=None, input_kwargs=None, output_data=None): + self.api_name = api_name + self.input_args = input_args + self.input_kwargs = input_kwargs + self.output_data = output_data + self.torch_api_name = self.extract_torch_api(self.api_name) + + @staticmethod + def extract_torch_api(api_name) -> str: + """ + Process tensor api name to extract first two fields in lowercase. + """ + # Empty string checking + if not api_name.strip(): + return "" + + parts = api_name.split(Const.SEP) + + # Handle different cases based on number of parts + if len(parts) == 0: + return "" + elif len(parts) == 1: + return parts[0].lower() + else: + return Const.SEP.join(parts[:2]).lower() diff --git a/debug/accuracy_tools/msprobe/core/overflow_check/checker.py b/debug/accuracy_tools/msprobe/core/overflow_check/checker.py new file mode 100644 index 0000000000000000000000000000000000000000..1287d1b0adcc54e34141c09811d7c24d269ca356 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/overflow_check/checker.py @@ -0,0 +1,138 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Dict, List, Optional, Any + +from msprobe.core.common.const import Const + +from msprobe.core.overflow_check.abnormal_scene import InputAnomalyOutputNormalScene, InputAnomalyOutputAnomalyScene, \ + InputNormalOutputAnomalyScene, NumericalMutationScene, AnomalyScene +from msprobe.core.overflow_check.api_info import APIInfo +from msprobe.core.overflow_check.filter import IgnoreFilter +from msprobe.core.overflow_check.level import OverflowLevel + + +class StatisticsFields: + """统计字段常量类""" + CRITICAL_APIS = 'critical_apis' + HIGH_PRIORITY_APIS = 'high_priority_apis' + MEDIUM_PRIORITY_APIS = 'medium_priority_apis' + ANOMALY_DETAILS = 'anomaly_details' + + # 所有字段 + ALL_FIELDS = [CRITICAL_APIS, HIGH_PRIORITY_APIS, MEDIUM_PRIORITY_APIS, ANOMALY_DETAILS] + + +class AnomalyDetector: + """异常检测器""" + + def __init__(self, dump_data: Dict): + """ + 初始化检测器,并保存dump_data + Args: + dump_data: 数据格式如下 + { + "api/module": {statistics} + } + """ + self.dump_data = dump_data + self.ignore_filter = IgnoreFilter() + self.scene_types = [ + InputNormalOutputAnomalyScene, # 输入正常,输出异常 + InputAnomalyOutputAnomalyScene, # 输入异常,输出异常 + InputAnomalyOutputNormalScene, # 输入异常,输出正常 + NumericalMutationScene # 输出较输入值突变 + ] + self.anomaly_scenes: Dict[str, AnomalyScene] = dict() + + @staticmethod + def _create_api_info(api_name: str, data: Dict) -> APIInfo: + """从原始数据创建APIInfo实例""" + return APIInfo( + api_name=api_name, + input_args=data.get(Const.INPUT_ARGS, data.get(Const.INPUT, [])), + input_kwargs=data.get(Const.INPUT_KWARGS, {}), + output_data=data.get(Const.OUTPUT, []) + ) + + def get_statistics(self) -> Dict[str, List]: + """获取统计信息 + + 使用StatisticsFields类统一管理字段名称,避免硬编码 + + Returns: + Dict[str, List]: 包含各优先级API列表和异常详情的字典 + """ + stats = {field: [] for field in StatisticsFields.ALL_FIELDS} + + # 定义rank到结果key的映射关系 + rank_to_key = { + OverflowLevel.CRITICAL: StatisticsFields.CRITICAL_APIS, + OverflowLevel.HIGH: StatisticsFields.HIGH_PRIORITY_APIS, + OverflowLevel.MEDIUM: StatisticsFields.MEDIUM_PRIORITY_APIS + } + + for scene in self.anomaly_scenes.values(): + stats[StatisticsFields.ANOMALY_DETAILS].append(scene.get_details()) + # 根据rank分类API + key = rank_to_key.get(scene.rank, None) + if not key: + stats[key].append(scene.api_name) + + return stats + + def analyze(self): + """ + 按照异常场景对调用数据进行分析 + Returns: + 返回类本身,若不进行过滤,则仅调用analyze即可 + """ + # 遍历data item + for api_name, data in self.dump_data.items(): + api_info = self._create_api_info(api_name, data) + + # 每种都进行检测,可能涉及多种命中,原则如下: + # - 就高原则 + # - 优先原则,数据异常放最后检测 + for scene_type in self.scene_types: + scene = scene_type(api_info) + if hasattr(scene, 'matches') and scene.matches(): + self.anomaly_scenes[api_name] = scene + break # 直接跳过,就高原则 + return self + + def filter(self): + """ + 对误检数据进行过滤 + Returns: + 检查checker自身,方便链式调用 + """ + result = dict() + for api_name, scene in self.anomaly_scenes.items(): + if self.ignore_filter.apply_filter(scene.api_data): + continue + result[api_name] = scene + self.anomaly_scenes = result + return self + + def overflow_result(self) -> Dict[str, AnomalyScene]: + return self.anomaly_scenes + + def has_overflow(self, api_name: str) -> bool: + return api_name in self.anomaly_scenes.keys() + + def get_overflow_level(self, api_name: str) -> Optional[Any]: + scene = self.anomaly_scenes.get(api_name, None) + return scene.rank if scene else None diff --git a/debug/accuracy_tools/msprobe/core/overflow_check/filter.py b/debug/accuracy_tools/msprobe/core/overflow_check/filter.py new file mode 100644 index 0000000000000000000000000000000000000000..096b9e2cdf4542dd9090fc94f7008f8ab1afbc56 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/overflow_check/filter.py @@ -0,0 +1,157 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os.path +from dataclasses import dataclass, field +from typing import Set + +from msprobe.core.common.file_utils import load_yaml +from msprobe.core.overflow_check.api_info import APIInfo +from msprobe.core.overflow_check.utils import has_nan_inf + +cur_path = os.path.dirname(os.path.realpath(__file__)) + + +class IgnoreFilter: + def __init__(self, rule_path=os.path.join(cur_path, './ignore_rules.yaml')): + self.rules = dict() + self._load_rules(rule_path) + + def has_api_rule(self, api_name: str) -> bool: + return api_name in self.rules.keys() + + def apply_filter(self, api_info: APIInfo) -> bool: + """ + 应用过滤规则,返回是否需要被过滤 + Args: + api_info: API调用信息 + Returns: + 是否为误检,是否需要过滤 + """ + torch_api = api_info.torch_api_name + if not self.has_api_rule(torch_api): + return False + rule = self.rules.get(torch_api) + if not rule.match(api_info): + return False + return True + + def _load_rules(self, rule_file_path): + if self.rules and len(self.rules): + return + data = load_yaml(rule_file_path) + self.rules = dict() + for rule_item in data.get('ignore_nan_inf', []): + rule = Rule( + api_name=rule_item.get('api_name', ''), + desc=rule_item.get('description', ''), + input_ignore=rule_item.get('input_ignore', []), + output_ignore=rule_item.get('output_ignore', []) + ) + if not rule.verify_field(): + continue + if self.has_api_rule(rule.api_name): + continue + self.rules[rule.api_name] = rule + + +class Rule: + + def __init__(self, api_name, desc='', input_ignore=None, output_ignore=None): + self.api_name = api_name + self.desc = desc + self.input_ignore = IgnoreItem() + self.output_ignore = IgnoreItem() + self._init_ignore(input_ignore, output_ignore) + + def __repr__(self): + return (f'Rule(api_name={self.api_name}, desc={self.desc}, input_ignore={self.input_ignore}, output_ignore=' + f'{self.output_ignore})') + + def verify_field(self): + if self.api_name == '': + return False + # 若无输入输出规则长度,则为无效规则 + if not (len(self.input_ignore.index) + len(self.input_ignore.name) + len(self.output_ignore.index)): + return False + return True + + def match(self, api_info: APIInfo) -> bool: + """ + 匹配API信息是否符合规则 + Returns: + bool: True if the api_info matches this rule, False otherwise + """ + # 首先检查API名称是否匹配 + api_name = api_info.torch_api_name + if api_name != self.api_name: + return False + + # 检查输入参数中的NaN/Inf + if self.input_ignore.index and len(api_info.input_args): + for idx, arg in enumerate(api_info.input_args): + if has_nan_inf(arg) and not self.input_ignore.has_index(idx): + return False + + # 检查输入kwargs中的NaN/Inf + if self.input_ignore.name and len(api_info.input_kwargs): + for name, value in api_info.input_kwargs.items(): + if has_nan_inf(value) and not self.input_ignore.has_name(name): + return False + + # 检查输出中的NaN/Inf + if self.output_ignore.index and len(api_info.output_data): + for idx, out in enumerate(api_info.output_data): + if has_nan_inf(out) and not self.output_ignore.has_index(idx): + return False + + return True + + def _init_ignore(self, input_ignore=None, output_ignore=None): + """初始化忽略项""" + if input_ignore is None: + input_ignore = [] + if output_ignore is None: + output_ignore = [] + + # 处理输入忽略规则 + for item in input_ignore: + if 'index' in item: + self.input_ignore.add_index(item['index']) + if 'name' in item: + self.input_ignore.add_name(item['name']) + + # 处理输出忽略规则 + for item in output_ignore: + if 'index' in item: + self.output_ignore.add_index(item['index']) + + +@dataclass +class IgnoreItem: + """存储需要忽略的索引和名称""" + index: Set[int] = field(default_factory=set) + name: Set[str] = field(default_factory=set) + + def add_index(self, idx: int): + self.index.add(idx) + + def add_name(self, name: str): + self.name.add(name) + + def has_index(self, idx: int) -> bool: + return idx in self.index + + def has_name(self, name: str) -> bool: + return name in self.name diff --git a/debug/accuracy_tools/msprobe/core/overflow_check/ignore_rules.yaml b/debug/accuracy_tools/msprobe/core/overflow_check/ignore_rules.yaml new file mode 100644 index 0000000000000000000000000000000000000000..8d45e597c58d8c30ca5be6d5adf2de32151bcb24 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/overflow_check/ignore_rules.yaml @@ -0,0 +1,55 @@ +ignore_nan_inf: + # Create an uninitialized memory + - api_name: "torch.empty" + description: "Creates a tensor with uninitialized data. The values may contain NaN or Inf because the memory is not cleared or set to zero." + output_ignore: + - index: 0 + + - api_name: "torch.empty_like" + description: "Creates an uninitialized tensor with the same size, dtype, and device as the input tensor. The values may contain NaN or Inf due to uninitialized memory." + output_ignore: + - index: 0 + + - api_name: "torch.empty_strided" + description: "Creates a tensor with uninitialized data using specified strides. NaN or Inf may be present due to uninitialized memory." + output_ignore: + - index: 0 + + # Distributed func + - api_name: "distributed.recv" + description: "Receives a tensor from another process. The input tensor may contain uninitialized data before the recv call, but it will be overwritten with received data." + input_ignore: + - index: 0 # tensor (the input buffer, which may be uninitialized before receiving) + - name: tensor + + - api_name: "distributed.all_gather" + description: "Gathers tensors from all processes and distributes them to each process. The tensors in tensor_list may contain uninitialized data before the all_gather call, but they will be overwritten with collected data from all processes." + input_ignore: + - index: 0 # tensor_list (the input list of tensors, which may contain uninitialized data before the all_gather call) + + - api_name: "distributed.reduce_scatter" + description: "Combines reduction and scatter operations. The output tensor may contain uninitialized data before the reduce_scatter call, but it will be overwritten with the reduced and scattered data from all processes." + input_ignore: + - index: 0 + - name: output + + - api_name: "distributed._reduce_scatter_base" + description: "Performs a combined reduction and scatter operation using a single input tensor. The output tensor may contain uninitialized data before the _reduce_scatter_base call, but it will be overwritten with the reduced and scattered data." + input_ignore: + - index: 0 + + - api_name: "distributed.all_gather_into_tensor" + description: "Gathers tensors from all processes into a single output tensor. The output tensor may contain uninitialized data before the all_gather_into_tensor call, but it will be overwritten with collected data from all processes." + input_ignore: + - index: 0 + + - api_name: "distributed.reduce_scatter_tensor" + description: "Performs a reduction operation across all processes and scatters the result into the output tensor. The output tensor may contain uninitialized data before the reduce_scatter_tensor call, but it will be overwritten with the reduced and scattered data." + input_ignore: + - index: 0 + + # Tensor inplace func + - api_name: "tensor.masked_fill_" + description: "Inplace fill tensor with given value by filtered mask" + input_ignore: + - index: 0 diff --git a/debug/accuracy_tools/msprobe/core/overflow_check/level.py b/debug/accuracy_tools/msprobe/core/overflow_check/level.py new file mode 100644 index 0000000000000000000000000000000000000000..2f40468f6551a3787bdae7f9d94a5f66599151a0 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/overflow_check/level.py @@ -0,0 +1,22 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from enum import Enum + + +class OverflowLevel(Enum): + MEDIUM = "medium" + HIGH = "high" + CRITICAL = "critical" diff --git a/debug/accuracy_tools/msprobe/core/overflow_check/utils.py b/debug/accuracy_tools/msprobe/core/overflow_check/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..c9bedb3c6367ff9b2d589bdb5b8af5e8e68014e8 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/overflow_check/utils.py @@ -0,0 +1,28 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Any + +CHECK_FIELDS = ['Max', 'Min', 'Mean'] +OVERFLOW_VALUES = ['inf', '-inf', 'nan'] + + +def has_nan_inf(value: Any) -> bool: + """检查值是否包含NaN或Inf""" + if isinstance(value, dict): + for k, v in value.items(): + if k in CHECK_FIELDS and str(v).lower() in OVERFLOW_VALUES: + return True + return False diff --git a/debug/accuracy_tools/msprobe/docs/01.installation.md b/debug/accuracy_tools/msprobe/docs/01.installation.md index ab6b53fce070b699e047aed92567e2e2855ad1a8..530783e87d0bdadd51856cb1ae08160cb081da80 100644 --- a/debug/accuracy_tools/msprobe/docs/01.installation.md +++ b/debug/accuracy_tools/msprobe/docs/01.installation.md @@ -16,6 +16,10 @@ pip install mindstudio-probe |版本|发布日期|支持 PyTorch 版本|支持 MindSpore 版本|下载链接|校验码| |:--:|:--:|:--:|:--:|:--:|:--:| +|1.2.2|2025.3.03|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.2.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.2/mindstudio_probe-1.2.2-py3-none-any.whl)|961411bb460d327ea51d6ca4d0c8e8c5565f07c0852d7b8592b781ca35b87212| +|1.2.1|2025.2.07|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.2.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.2/mindstudio_probe-1.2.1-py3-none-any.whl)|b64b342118558e0339b39237f88a49b93fd24551b0cb202c872fbfef4260c86b| +|1.2.0|2025.1.13|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.2.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.2/mindstudio_probe-1.2.0-py3-none-any.whl)|1e3aeea1706112f6ee52fd1165037936bb209138f0b9ec42ea21e2c1c8942cdc| +|1.1.1|2024.12.09|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.1.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.1/mindstudio_probe-1.1.1-py3-none-any.whl)|577b597555dc155b76ba1a62d575c3546004644e140a456c3ba0824d46283735| |1.1.0|2024.10.14|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.1/mindstudio_probe-1.1.0-py3-none-any.whl)|83a5a9b7c65a357639f8c9636d88c693b4cf0eb590d4f8f5cb56395ba69b1f6d| |1.0.4|2024.09.09|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.0.4-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.0/mindstudio_probe-1.0.4-py3-none-any.whl)|4e1909566a71a855b356597750c20ee43d964a22b2c2b02ac08312a5def75fd6| | 1.0.3 | 2024.08.23 | 1.11/2.0/2.1/2.2 | 2.4.0 | [mindstudio_probe-1.0.3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.0/mindstudio_probe-1.0.3-py3-none-any.whl) | 7060cc141a5b98ef770cd9220995d299393f32a61938261e632c7e8b5160bef2 | @@ -41,18 +45,128 @@ cd mstt/debug/accuracy_tools pip install setuptools wheel -python setup.py bdist_wheel +python setup.py bdist_wheel [--include-mod=[adump]] cd ./dist pip install ./mindstudio_probe*.whl ``` -# 历史版本特性 +|参数|说明|是否必选| +|--|--|:--:| +|--include-mod|指定可选模块,可取值`adump`,表示在编whl包时加入adump模块。默认未配置该参数,表示编基础包。
• adump模块用于MindSpore静态图场景L2级别的dump。
• 仅MindSpore 2.5.0及以上版本支持adump模块。
• 若使用源码安装,编译环境需支持GCC 7或以上版本,和CMAKE 3.14或以上版本。
• 生成的whl包仅限编译时使用的python版本和处理器架构可用。|否| - - - - -
版本特性
1.0.3【精度预检】
1. 落盘数据小;
2. 支持随机生成模式和真实数据模式;
3. 单 API 测试,排除整网中的累计误差问题。
【梯度检测】
1. 使用便捷,无需在训练流程里插入代码。
2. 可以精准定位问题出现的 step。
+# 特性变更说明 + +## 1.2.0 + +【数据采集】 + +- 模块级dump支持采集权重及权重梯度 +- 修复原地覆盖类API前向输入数据采集不正确的问题 +- seed_all接口支持控制dropout失效功能 + +【精度预检】 + +- MindSpore场景新增支持Tensor类的mint API的预检 + +【训练状态监控】 + +- 支持FSDP和ZeRO-0 +- 异常排序支持前向激活值和反向梯度 + +【分级可视化构图比对】 + +- 支持graph结构分页展示,支持graph批量构建和比对 +- 支持溢出检测模式 + +## 1.1.1 + +## 1.1.1 + +【数据采集】 + +- dump 支持 processgroup、namedtuple、slice 等数据类型 +- MindSpore 动态图 dump 能力增强,支持 mix 模式 dump、控制 dropout 失效、支持控制区间正反向数据 dump + +【精度预检】 + +- PyTorch 场景新增单算子 API 自动生成脚本 +- MindSpore 动态图场景新增支持 multi_run_ut 多线程预检 +- MindSpore 场景新增支持断点续检 + +【精度比对】 + +- 新增 MindSpore 跨框架比对能力,支持 MindSpore 与 PyTorch 跨框架比对 +- 支持异常比对结果数据自动颜色标注 + +【无标杆比对】 + +- Mindspore 动态图场景支持反向过程的无标杆比对 + +【训练状态监控】 + +- 新增支持通信聚合前梯度信息监控 + +【分级可视化构图比对】 + +- 新增分级可视化构图比对工具,支持单数据构图、溢出检测、双数据比对构图、同时支持传入映射文件,支持跨框架或同框架比对 + +## 1.1.0 + +【总体】 + +- 训练精度一体化工具 atat 统一更名为 msprobe +- msprobe 支持日志分级功能 + +【数据采集】 + +- 增加 L1 dump 接口,支持在指定区间内进行正反向 dump 功能 +- 新增 MindSpore 函数式接口的通信 API dump 功能 + +【精度预检】 + +- 支持配置 blacklist 黑名单字段 +- 补充了支持的融合算子列表 + +【精度比对】 + +- 支持 data mapping 和 layer mapping 的比对功能。 + +【梯度工具】 + +- 增加了梯度工具中关于 JIT 限制的说明 + +## 1.0.4 + +【数据采集】 + +- 支持在 config.json 中传入 step 范围配置 +- 优化了 MindSpore 场景下的 step 机制,step 结束后训练继续运行 + +【精度预检】 + +- 在 PyTorch 场景下,支持部分 NPU 融合算子精度预检 + +【精度比对】 + +- 解决了在 MindSpore 场景下需要安装 PyTorch 的问题 + +【无标杆比对】 + +- 补充了 PyTorch 场景的性能基线报告 +- 支持 MindSpore 场景下的 change_value 扰动模式 + +## 1.0.3 + +【精度预检】 + +- 落盘数据缩减 +- 支持随机生成模式和真实数据模式 +- 单 API 测试,排除整网中的累计误差问题 + +【梯度检测】 + +- 使用便捷,无需在训练流程里插入代码 +- 可以精准定位问题出现的 step # 查看 msprobe 工具信息 diff --git a/debug/accuracy_tools/msprobe/docs/02.config_introduction.md b/debug/accuracy_tools/msprobe/docs/02.config_introduction.md index a5d9e27062f7aac28a5ced084f00a4b56160e0be..bab79da92d26a63f9f796f855e75d608c142a46d 100644 --- a/debug/accuracy_tools/msprobe/docs/02.config_introduction.md +++ b/debug/accuracy_tools/msprobe/docs/02.config_introduction.md @@ -10,15 +10,15 @@ ### 1.1 通用配置 -| 参数 | 解释 | 是否必选 | -| ----------------- | ---------------- | -------- | -| task | dump 的任务类型,str 类型。可选参数:
"statistics":仅采集统计信息,默认值;
"tensor":采集统计信息和完全复刻整网的真实数据;
"run_ut":精度预检,仅 PyTorch 场景支持,采集数据时勿选;
"overflow_check":溢出检测;
"free_benchmark":无标杆比对。
根据 task 参数取值的不同,可以配置不同场景参数,详见:
[1.2 task 配置为 statistics](#12-task-配置为-statistics),
[1.3 task 配置为 tensor](#13-task-配置为-tensor),
[1.4 task 配置为 run_ut](#14-task-配置为-run_ut),
[1.5 task 配置为 overflow_check](#15-task-配置为-overflow_check),
[1.6 task 配置为 free_benchmark](#16-task-配置为-free_benchmark)。
**配置示例**:"task": "tensor"。 | 否 | -| dump_path | 设置 dump 数据目录路径,str 类型。
**配置示例**:"dump_path": "./dump_path"。 | 是 | -| rank | 指定对某张卡上的数据进行采集,list[Union[int, str]] 类型,默认未配置(表示采集所有卡的数据),应配置元素为 ≥0 的整数或类似"4-6"的字符串,且须配置实际可用的 Rank ID。
PyTorch 场景: Rank ID 从 0 开始计数,最大取值为所有节点可用卡总数-1,若所配置的值大于实际训练所运行的卡的 Rank ID,则 dump 数据为空,比如当前环境 Rank ID 为 0 到 7,实际训练运行 0 到 3 卡,此时若配置 Rank ID 为 4 或不存在的 10 等其他值,dump 数据为空。
MindSpore 场景:所有节点的 Rank ID 均从 0 开始计数,最大取值为每个节点可用卡总数-1,config.json 配置一次 rank 参数对所有节点同时生效。
**配置示例**:"rank": [1, "4-6"]。 | 否 | -| step | 指定采集某个 step 的数据,list[Union[int, str]] 类型。默认未配置,表示采集所有 step 数据。采集特定 step 时,须指定为训练脚本中存在的 step,可逐个配置,也可以指定范围。
**配置示例**:"step": [0, 1 , 2, "4-6"]。 | 否 | -| level | dump 级别,str 类型,根据不同级别采集不同数据。可选参数:
"L0":dump 模块级精度数据,仅 PyTorch 与 MindSpore 动态图场景支持,使用背景详见 [1.1.1 模块级精度数据 dump 说明](#111-模块级精度数据-dump-说明);
"L1":dump API 级精度数据,默认值,仅 PyTorch 与 MindSpore 动态图场景支持;
"L2":dump kernel 级精度数据,PyTorch 场景下须配置 acl_config 参数;
"mix":dump module 模块级和 API 级精度数据,即"L0"+"L1",仅 PyTorch 与 MindSpore 动态图场景支持。
**配置示例**:"level": "L1"。 | 否 | -| acl_config | kernel dump 的配置文件,str 类型。当 PyTorch 场景的 level 取"L2"时,该参数必选;level 为其他值时,该参数不选。
**配置示例**:acl_config="./acl_config.json"。具体配置见[ acl_config 示例](./04.acl_config_examples.md)。 | 否 | -| enable_dataloader | 自动控制开关,bool 类型,仅 PyTorch 场景支持。可选参数 true(开启)或 false(关闭),默认为 false。配置为 true 后自动识别 step 参数指定的迭代,并在该迭代执行完成后退出训练,此时 start、stop 和 step 函数可不配置,开启该开关要求训练脚本是通过 torch.utils.data.dataloader 方式加载数据。仅支持 PyTorch 单卡训练使用,分布式训练场景下存在数据 dump 不全问题。 | 否 | +| 参数 | 解释 | 是否必选 | +| ----------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------- | +| task | dump 的任务类型,str 类型。可选参数:
"statistics":仅采集统计信息,默认值;
"tensor":采集统计信息和完全复刻整网的真实数据;
"run_ut":精度预检,仅 PyTorch 场景支持,采集数据时勿选;
"overflow_check":溢出检测;
"free_benchmark":无标杆比对;
"grad_probe":梯度监控;
"structure":仅采集模型结构以及调用栈信息,不采集具体数据。
根据 task 参数取值的不同,可以配置不同场景参数,详见:
[1.2 task 配置为 statistics](#12-task-配置为-statistics),
[1.3 task 配置为 tensor](#13-task-配置为-tensor),
[1.4 task 配置为 run_ut](#14-task-配置为-run_ut),
[1.5 task 配置为 overflow_check](#15-task-配置为-overflow_check),
[1.6 task 配置为 free_benchmark](#16-task-配置为-free_benchmark),
[1.7 task 配置为 grad_probe](#17-task-配置为-grad_probe)。
**配置示例**:"task": "tensor"。 | 否 | +| dump_path | 设置 dump 数据目录路径,str 类型。
**配置示例**:"dump_path": "./dump_path"。 | 是 | +| rank | 指定对某张卡上的数据进行采集,list[Union[int, str]] 类型,默认未配置(表示采集所有卡的数据),应配置元素为 ≥0 的整数或类似"4-6"的字符串,且须配置实际可用的 Rank ID。
PyTorch 场景: Rank ID 从 0 开始计数,最大取值为所有节点可用卡总数-1,若所配置的值大于实际训练所运行的卡的 Rank ID,则 dump 数据为空,比如当前环境 Rank ID 为 0 到 7,实际训练运行 0 到 3 卡,此时若配置 Rank ID 为 4 或不存在的 10 等其他值,dump 数据为空。
MindSpore 场景:所有节点的 Rank ID 均从 0 开始计数,最大取值为每个节点可用卡总数-1,config.json 配置一次 rank 参数对所有节点同时生效。
注意,单卡训练时,rank必须为[],即空列表,不能指定rank。
**配置示例**:"rank": [1, "4-6"]。 | 否 | +| step | 指定采集某个 step 的数据,list[Union[int, str]] 类型。默认未配置,表示采集所有 step 数据。采集特定 step 时,须指定为训练脚本中存在的 step,可逐个配置,也可以指定范围。
**配置示例**:"step": [0, 1 , 2, "4-6"]。 | 否 | +| level | dump 级别,str 类型,根据不同级别采集不同数据。可选参数:
"L0":dump 模块级精度数据,仅 PyTorch 与 MindSpore 动态图场景支持,使用背景详见 [1.1.1 模块级精度数据 dump 说明](#111-模块级精度数据-dump-说明);
"L1":dump API 级精度数据,默认值,仅 PyTorch 与 MindSpore 动态图场景支持;
"L2":dump kernel 级精度数据,PyTorch场景详细介绍见 [PyTorch 场景的 kernel dump 说明](./04.kernel_dump_PyTorch.md);MindSpore场景详细介绍见 [MindSpore 场景的 kernel dump 说明](./28.kernel_dump_MindSpore.md);
"mix":dump module 模块级和 API 级精度数据,即"L0"+"L1",仅 PyTorch 与 MindSpore 动态图场景支持。
"debug":单点保存功能,细节详见[单点保存工具 README](./28.debugger_save_instruction.md)
**配置示例**:"level": "L1"。 | 否 | +| enable_dataloader | 自动控制开关,bool 类型,仅 PyTorch 场景支持。可选参数 true(开启)或 false(关闭),默认为 false。配置为 true 后自动识别 step 参数指定的迭代,并在该迭代执行完成后退出训练,此时 start、stop 和 step 函数可不配置,开启该开关要求训练脚本是通过 torch.utils.data.dataloader 方式加载数据。仅支持 PyTorch 单卡训练使用,分布式训练场景下存在数据 dump 不全问题。 **这个特性下个版本将被废弃** | 否 | +| async_dump | 异步 dump 开关,bool 类型。可选参数 true(开启)或 false(关闭),默认为 false。配置为 true 后开启异步 dump,即采集的精度数据会在当前 step 训练结束后统一落盘,训练过程中工具不触发同步操作。由于使用该模式有**显存溢出**的风险,当 task 配置为 tensor 时,即真实数据的异步dump模式,必须配置 [list](#13-task-配置为-tensor) 参数,指定需要 dump 的 tensor 。该模式暂不支持复数类型 tensor
的统计量计算。 | 否 | #### 1.1.1 模块级精度数据 dump 说明 @@ -30,19 +30,24 @@ 模块指的是继承 nn.Module 类(PyTorch场景)或 nn.Cell 类(MindSpore场景)的子类,通常情况下这类模块就是一个小模型,可以被视为一个整体,dump 数据时以模块为粒度进行 dump。 + + ### 1.2 task 配置为 statistics - + - - + + - - - + + +
参数解释是否必选
scopePyTorch 和 MindSpore 动态图场景 dump 范围,list[str] 类型,默认未配置(list 也未配置时表示 dump 所有 API 的数据)。该参数可以在 [ ] 内配置两个模块名或 API 名,用于锁定区间,dump 该范围内的数据;也可以在 [ ] 内配置一个模块名,表示dump此模块内部的API或子模块数据(仅level配置为mix时支持配置一个模块名)。
配置示例:"scope": ["MyModuleOP1", "MyModuleOP2"]或"scope": ["MyModuleOP1"]。与 level 参数取值相关,level 为 L0 和 mix 级别时,可配置模块名;level 为 L1 级别时,可配置 API 名。
scopePyTorch 和 MindSpore 动态图场景 dump 范围,list[str] 类型,默认未配置(list 也未配置时表示 dump 所有 API 的数据)。该参数可以在 [ ] 内配置两个模块名或 API 名,要求列表长度必须为2,需要配置按照工具命名格式的完整模块名或API名称,用于锁定区间,dump 该范围内的数据。
配置示例: + "scope": ["Module.conv1.Conv2d.forward.0", "Module.fc2.Linear.forward.0"], + 或 "scope": ["Cell.conv1.Conv2d.forward.0", "Cell.fc2.Dense.backward.0"], 或"scope": ["Tensor.add.0.forward", "Functional.square.2.forward"]。与 level 参数取值相关,level 为 L0 级别时,可配置模块名;level 为 L1 级别时,可配置 API 名, level为 mix 级别时,可配置为模块名或API名。
list自定义采集的算子列表,list[str] 类型,默认未配置(scope 也未配置时表示 dump 所有 API 的数据),包含以下配置方法:
PyTorch 和 MindSpore 动态图场景配置具体的 API 全称,dump 该 API 数据。
配置示例:"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]。
PyTorch 和 MindSpore 动态图场景指定某一类 API,dump 某一类的 API 级别输入输出数据。
配置示例:"list": ["relu"]。
PyTorch 和 MindSpore 动态图场景配置具体的 API 全称,dump 该 API 数据。在 PyTorch 场景,如果 level 配置成 L2,该配置为必填项。
配置示例:"list": ["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]。
PyTorch 和 MindSpore 动态图场景在level为 mix 级别时可以配置模块名称,dump该模块展开数据 (dump该模块从执行开始到执行结束期间的所有数据)。 +
配置示例:"list": ["Module.module.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"], 或 "list": ["Cell.network_with_loss.language_model.encoder.layers.0.mlp.ParallelMlp.forward.0"]
PyTorch 和 MindSpore 动态图场景指定某一类 API,dump 某一类的 API 级别输入输出数据。
配置示例:"list": ["relu"]。
PyTorch 和 MindSpore 动态图场景在level为 mix 级别时, 会dump名称中包含list中配置的字符串的API数据,还会将名称中包含list中配置的字符串的模块进行展开dump (dump该模块从执行开始到执行结束期间的所有数据)。
MindSpore 静态图场景配置 kernel_name,可以是算子的名称列表,也可以指定算子类型("level": "L2"时不支持),还可以配置算子名称的正则表达式(当字符串符合“name-regex(xxx)”格式时,后台则会将其作为正则表达式。
配置示例:list: ["name-regex(Default/.+)"]
可匹配算子名称以“Default/”开头的所有算子。
data_modedump 数据过滤,str 类型。
PyTorch 场景:支持"all"、"forward"、"backward"、"input"和"output",表示仅采集包含"forward"、"backward"、"input"和"output"字段的数据。
配置示例:"data_mode": ["backward"] 或 "data_mode": ["forward", "backward"]。默认为["all"],即保存所有 dump 的数据。所有参数可以自由组合。
MindSpore 场景:仅支持"all"、"input"和"output"参数,且各参数只能单独配置,不支持自由组合。
配置示例:"data_mode": ["all"]。
summary_mode控制 dump 文件输出的模式,str 类型,仅 PyTorch 场景支持,可选参数:
md5:dump 输出包含 CRC-32 值以及 API 统计信息的 dump.json 文件,用于验证数据的完整性;
statistics:dump 仅输出包含 API 统计信息的 dump.json 文件,默认值。
配置示例:"summary_mode": "md5"。
PyTorch 与 MindSpore 动态图场景:支持"all"、"forward"、"backward"、"input"和"output",除"all"外,其余参数可以自由组合。默认为["all"],即保存所有 dump 的数据。
配置示例:"data_mode": ["backward"] (仅保存反向数据)或 "data_mode": ["forward", "input"](仅保存前向的输入数据)。
MindSpore 静态图场景:仅支持"all"、"input"和"output"参数,且各参数只能单独配置,不支持自由组合。
配置示例:"data_mode": ["all"]。
summary_mode控制 dump 文件输出的模式,str 类型,仅 PyTorch 与 MindSpore 动态图场景支持,可选参数:
md5:dump 输出包含 CRC-32 值以及 API 统计信息的 dump.json 文件,用于验证数据的完整性;
statistics:dump 仅输出包含 API 统计信息的 dump.json 文件,默认值。
配置示例:"summary_mode": "md5"。
MindSpore静态图jit_level=O2场景L2级dump,支持上述配置的同时额外支持配置统计项列表,可选统计项为max、min、mean、l2norm,可从中任意选取组合搭配。其中mean、l2norm的结果为float数据格式。
配置示例:"summary_mode": ["max", "min"]。
**说明**:"summary_mode"配置为"md5"时,所使用的校验算法为CRC-32算法。 @@ -53,9 +58,8 @@ | -------------- | ---------------------- | -------- | | scope | 与[ 1.2 task 配置为 statistics ](#12-task-配置为-statistics)中的解释相同。 | 否 | | list | 与[ 1.2 task 配置为 statistics ](#12-task-配置为-statistics)中的解释相同。 | 否 | -| backward_input | 首次运行训练采集得到反向 API 输入的 dump 文件,list[str] 类型,仅支持 PyTorch 场景的 kernel dump(即level配置为"L2")且需要配置scope参数指定需要dump的反向API名,默认未配置。例如,若需要采集 Functional.conv2d.1 API 反向过程的输入输出,则需要在 dump 目录下查找命名包含 Functional.conv2d.1、backward 和 input 字段的 dump 文件。
**配置示例**:"backward_input": ["./npu_dump/step0/rank0/dump_tensor_data/Functional.conv2d.1.backward.input.0.pt"],仅支持配置一个反向输入API文件名。 | 否 | | data_mode | 与[ 1.2 task 配置为 statistics ](#12-task-配置为-statistics)中的解释相同 | 否 | -| file_format | tensor 数据的保存格式,str 类型,仅支持 MindSpore 静态图场景的 L2,不支持 L0 和 L1。可选参数:
"bin":dump 的 tensor 文件为二进制格式;
"npy":dump 的 tensor 文件后缀为 .npy,默认值。 | 否 | +| file_format | tensor 数据的保存格式,str 类型,仅支持 MindSpore 静态图场景的 L2 级别配置该字段,其他场景不生效。可选参数:
"bin":dump 的 tensor 文件为二进制格式;
"npy":dump 的 tensor 文件后缀为 .npy,默认值。 | 否 | | online_run_uta | 在线预检模式开关,bool 类型,可选参数 true(开启)、false(关闭),默认未配置,表示关闭。配置为 true 表示开启在线预检。| 否 | | nfs_patha | 在线预检模式共享存储目录路径,str 类型,用于 GPU 设备和 NPU 设备间进行通信。仅在 online_run_ut 字段配置为 true 时生效,配置该参数后 host 和 port 不生效。 | 否 | | hosta | 在线预检模式局域网场景信息接收端 IP,str 类型,用于 GPU 设备和 NPU 设备间进行通信,NPU 侧须配置为 GPU 侧的局域网 IP 地址。仅在 online_run_ut 字段配置为 true 时生效,局域网场景时,不能配置 nfs_path 参数,否则局域网场景不生效。 | 否 | @@ -82,7 +86,7 @@ ### 1.5 task 配置为 overflow_check -PyTorch 与 MindSpore 动态图场景下,"level"须为"L1";MindSpore 静态图场景下,"level"须为"L2",且模型编译优化等级(jit_level)须为"O2"。 +PyTorch 与 MindSpore 动态图场景下,"level"须为"L0"或"L1";MindSpore 静态图场景下,"level"须为"L2",且模型编译优化等级(jit_level)须为"O2"。 | 参数 | 解释 | 是否必选 | | ------------- | ---------------------- | -------- | @@ -109,7 +113,7 @@ PyTorch 与 MindSpore 动态图场景下,"level"须为"L1";MindSpore 静态 pert_mode无标杆扰动因子,str 类型。可选参数:
"improve_precision":对输入做升精度,默认值;
"add_noise":对输入增加噪声;
"no_change":不加扰动直接二次执行;
"bit_noise":输入的末位比特翻转,MindSpore 场景不支持 BF16 类型的向量;
"change_value":输入的张量首尾值调换;
"to_cpu":在 CPU 等价执行(仅 PyTorch 场景支持)。
"auto_fix":使用scale、切精度、同步等方法快速排除和恢复算子问题。
配置示例:"pert_mode": "improve_precision"。否 handler_type处理类型,可选参数:
"check":进行无标杆比对检查,默认值;
"fix":将扰动后的 API 输出结果覆盖原始 API 输出结果,尝试将 Loss 曲线恢复正常,该模式下不支持预热功能与反向过程,且仅支持"improve_precision"、"to_cpu"( PyTorch 场景)、"auto_fix"( PyTorch 场景)三种扰动因子。
配置示例:"handler_type": "check"。否 fuzz_level无标杆数据 dump 级别,即选择比对结果文件应输出的表头属性,当前仅支持取值为:"L1"。输出结果详见 1.6.1 无标杆比对数据存盘格式。否 - fuzz_stage比对过程,选择对 API 前向或反向进行无标杆比对,可选参数:
"forward":前向,默认值;
"backward":反向, 仅 PyTorch 场景支持。当 fuzz_stage 为 "backward" 时,handler_type 只能为 "check"。
配置示例:"fuzz_stage": "backward"。否 + fuzz_stage比对过程,选择对 API 前向或反向进行无标杆比对,可选参数:
"forward":前向,默认值;
"backward":反向。当 fuzz_stage 为 "backward" 时,handler_type 只能为 "check"。
配置示例:"fuzz_stage": "backward"。否 if_preheat预热功能(仅 PyTorch 场景支持),bool 类型。开启功能后工具可以根据每次迭代的输出调整精度算法的阈值,从而更准确地找出存在精度问题的 API。当"handler_type": "fix"时,不支持预热。可选参数:
true(开启)或 false(关闭),默认关闭。
配置示例:"if_preheat": "true"。否 preheat_step开启预热的迭代数量(仅 PyTorch 场景支持),int 类型,默认值为 15。须配置 "if_preheat": "true"。否 max_sample每个算子预热的采样次数的最大阈值(仅 PyTorch 场景支持),int 类型,默认值为 20。须配置 "if_preheat": "true"。否 @@ -132,3 +136,31 @@ PyTorch 与 MindSpore 动态图场景下,"level"须为"L1";MindSpore 静态 | dtype | 输入的 dtype,string 类型。 | | shape | 输入的 shape,tuple 类型。 | | output_index | 如果输出为列表或元组,其中一个元素检测不一致,则会有该元素的 index,否则为空,int 类型。 | + + +### 1.7 task 配置为 grad_probe + + **参数说明** + + | 参数 | 说明 | 输入类型 | 是否必选 | + |--------------------------------|-----------------------------------|-----------------|----------| + | task | 填为"grad_probe"。 | str | 是 | + | dump_path | 输出目录。如果不存在就会创建一个新目录。 | str | 是 | + | rank | rank id列表,在多卡场景下,表示需要导出梯度数据的进程的rank id。列表为空就表示导出所有rank的数据。默认为空。采集特定 rank 时,须指定为训练脚本中存在的 rank_id,可逐个配置,也可以指定范围。
**配置示例**:"rank": [0, 1 , 2, "4-6"]。(MindSpore静态图模式下,当前暂不支持指定rank功能) | list[Union[int, str]] | 否 | + | step | step列表,表示需要导出数据的step列表。列表为空就表示导出所有step的数据。默认为空。采集特定 step 时,须指定为训练脚本中存在的 step,可逐个配置,也可以指定范围。
**配置示例**:"step": [0, 1 , 2, "4-6"]。(MindSpore静态图模式下,当前暂不支持指定step功能) | list[Union[int, str]] | 否 | + | grad_level | 输出级别。决定导出数据的详细程度,级别越大导出数据越详细。可取值:L0, L1, L2。默认L1。|str | 否 | + | param_list | 权重名称列表,表示需要监控的权重。列表为空就表示监控所有权重。默认为空。 | List[str] | 否 | + | bounds | 区间列表,用来划分区间以统计数值的分布。需要保证由数据小到大排列,并且列表中的元素需要在int64取值范围内。可以使用默认值[-1, 0, 1]。 | List[float, int] | 否 | + + + **不同级别的level的导出数据** + + + | 级别 | 特征数据表头 | 是否有方向数据 | + | ---- | ------------------------------------------------------------ | -------------- | + | L0 | ("param_name", "MD5", "max", "min", "norm", "shape") | 否 | + | L1 | ("param_name", "max", "min", "norm", "shape") | 是 | + | L2 | ("param_name", *intervals, "=0", "max", "min", "norm", "shape") | 是 | + + intervals就是根据值分布bounds划分出的区间。 + MindSpore静态图模式下,L0级别中暂不支持"MD5" diff --git a/debug/accuracy_tools/msprobe/docs/03.config_examples.md b/debug/accuracy_tools/msprobe/docs/03.config_examples.md index a3c50cd9efeb71e5b6855cb9a12329d524d6755e..542250fac243f3ab2f1d0aff87bc509ac7c1a675 100644 --- a/debug/accuracy_tools/msprobe/docs/03.config_examples.md +++ b/debug/accuracy_tools/msprobe/docs/03.config_examples.md @@ -13,7 +13,6 @@ "rank": [], "step": [], "level": "L1", - "enable_dataloader": false, "statistics": { "scope": [], @@ -33,13 +32,11 @@ "rank": [], "step": [], "level": "L1", - "enable_dataloader": false, "tensor": { "scope": [], "list":[], - "data_mode": ["all"], - "backward_input": [] + "data_mode": ["all"] } } ``` @@ -53,7 +50,6 @@ "rank": [], "step": [], "level": "L1", - "enable_dataloader": false, "run_ut": { "white_list": [], @@ -72,7 +68,6 @@ "rank": [], "step": [], "level": "L1", - "enable_dataloader": false, "overflow_check": { "overflow_nums": 1 @@ -89,7 +84,6 @@ "rank": [], "step": [], "level": "L1", - "enable_dataloader": false, "free_benchmark": { "scope": [], @@ -106,6 +100,18 @@ } ``` +### 1.6 task 配置为 structure + +```json +{ + "task": "structure", + "dump_path": "/home/data_dump", + "rank": [], + "step": [], + "level": "mix" +} +``` + ## 2 MindSpore 静态图场景 ### 2.1 task 配置为 statistics @@ -138,8 +144,7 @@ "tensor": { "list":[], - "data_mode": ["all"], - "backward_input": [] + "data_mode": ["all"] } } ``` @@ -235,3 +240,15 @@ } } ``` + +### 3.5 task 配置为 structure + +```json +{ + "task": "structure", + "dump_path": "/home/data_dump", + "rank": [], + "step": [], + "level": "mix" +} +``` diff --git a/debug/accuracy_tools/msprobe/docs/04.acl_config_examples.md b/debug/accuracy_tools/msprobe/docs/04.acl_config_examples.md deleted file mode 100644 index 2aafe939c5f686ae22b9aa5178799c049bd5a064..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/docs/04.acl_config_examples.md +++ /dev/null @@ -1,78 +0,0 @@ -# acl_config.json 配置文件说明 - -当 PyTorch 场景 level 取"L2"时,须配置 acl_config 参数,并指定 acl_config.json 文件(用于指定 L2 kernel 级 dump 的配置),此时 **config.json** 文件配置示例如下: - -## 1 前向 kernel dump 配置示例 - -"scope"配置为前向 API 名称,仅支持配置一个 API。 - -```json -{ - "task": "tensor", - "dump_path": "/home/data_dump", - "level": "L2", - "rank": [0], - "step": [0], - "is_deterministic": false, - - "tensor": { - "scope": ["Tensor.__mul__.10.forward"], - "list":[], - "data_mode": ["all"], - "backward_input": [""], - "file_format": "npy" - }, - "acl_config": "acl_config.json" -} -``` - -## 2 反向 kernel dump 配置示例 - -执行反向 kernel dump 前需要先使用工具 dump 该 API 的反向输入,保存 pt 文件,"backward_input"参数中传入该 pt 文件路径。 - -"scope"配置为反向 API 名称,仅支持配置一个 API。 - -```json -{ - "task": "tensor", - "dump_path": "/home/data_dump", - "level": "L2", - "rank": [0], - "step": [0], - "is_deterministic": false, - - "tensor": { - "scope": ["Tensor.__mul__.10.backward"], - "list":[], - "data_mode": ["all"], - "backward_input": ["Tensor.__mul__.10.backward.input.0.pt"], - "file_format": "npy" - }, - "acl_config": "acl_config.json" -} -``` - -## 3 acl_config.json 配置示例 - -该文件须自行创建,配置示例如下: - -```json -{ - "dump": - { - "dump_list":[], - "dump_path":"./dump/output", - "dump_mode":"all", - "dump_op_switch":"on" - } -} -``` - -**acl_config.json 参数说明** - -| 字段名 | 解释 | -| -------------- | ------------------------------------------------------------ | -| dump_list | 待 dump 数据的 API 模型。为空,无需配置。 | -| dump_path | dump 数据文件存储到运行环境的目录,主要配置的是 kernel 级数据的存放路径。支持配置绝对路径或相对路径。dump_path 须为已存在目录。 | -| dump_mode | dump 数据模式,可取值,
output:dump API 的输出数据,默认值;
input:dump API 的输入数据;
all:dump API 的输入、输出数据。 | -| dump_op_switch | 单 API 模型 dump 数据开关,可取值,
off:关闭单 API 模型 dump,默认值;
on:开启单 API 模型 dump。 | diff --git a/debug/accuracy_tools/msprobe/docs/04.kernel_dump_PyTorch.md b/debug/accuracy_tools/msprobe/docs/04.kernel_dump_PyTorch.md new file mode 100644 index 0000000000000000000000000000000000000000..ce3fd54f5a6741b262f6248f70a9f1166ca0b4a6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/04.kernel_dump_PyTorch.md @@ -0,0 +1,73 @@ +# PyTorch 场景的 kernel dump 说明 + +当使用 msprobe 数据采集功能时,level 配置为 "L2" 表示采集 kernel 层级的算子数据,仅支持昇腾 NPU 平台。 + +本文主要介绍 kernel dump 的配置示例和采集结果介绍, msprobe 数据采集功能的详细使用参考 《[PyTorch 场景的精度数据采集](./05.data_dump_PyTorch.md)》。 + +## 1 kernel dump 配置示例 + +使用 kernel dump 时,list 必须要填一个 API 名称,kernel dump 目前每个 step 只支持采集一个 API 的数据。 +API 名称填写参考 L1 dump 结果文件 dump.json 中的API名称,命名格式为:`{api_type}.{api_name}.{API调用次数}.{forward/backward}`。 + +```json +{ + "task": "tensor", + "dump_path": "/home/data_dump", + "level": "L2", + "rank": [], + "step": [], + "tensor": { + "scope": [], + "list": ["Functional.linear.0.backward"] + } +} +``` + +## 2 结果文件介绍 + +### 2.1 采集结果说明 + +如果 API kernel 级数据采集成功,会打印以下信息: + +```bash +The kernel data of {api_name} is dumped successfully. +``` + +注意:如果打印该信息后,没有数据生成,参考**常见问题3.1**进行排查。 + +如果 kernel dump 遇到不支持的 API, 会打印以下信息: + +```bash +The kernel dump does not support the {api_name} API. +``` + +其中 {api_name} 是对应溢出的 API 名称。 + +### 2.2 输出文件说明 +kernel dump 采集成功后,会在指定的 dump_path 目录下生成如下文件: + +``` +├── /home/data_dump/ +│ ├── step0 +│ │ ├── 20241201103000 # 日期时间格式,表示2024-12-01 10:30:00 +│ │ │ ├── 0 # 表示 device id +│ │ │ │ ├──{op_type}.{op_name}.{task_id}.{stream_id}.{timestamp} # kernel 层算子数据 +│ │ │ ... +│ │ ├── kernel_config_{device_id}.json # kernel dump 在接口调用过程中生成的中间文件,一般情况下无需关注 +│ │ ... +│ ├── step1 +│ ... +``` +成功采集到数据后,可以使用 msprobe 工具提供的《[PyTorch 场景的数据解析](./14.data_parse_PyTorch.md)》功能分析数据。 + +## 3 常见问题 + +#### 3.1 采集结果文件为空,有可能是什么原因? + +1. 首先需要确认工具使用方式、配置文件内容、list 填写的 API 名称格式是否都正确无误。 + +2. 其次需要确认 API 是否运行在昇腾 NPU 上,如果是运行在其他设备上则不会存在 kernel 级数据。 + +3. 如果排除上述两点仍然没有数据,您可以使用《[Ascend Extension for PyTorch 插件](https://gitee.com/ascend/pytorch)》提供的 +torch_npu.npu 接口进行 kernel 层数据采集,工具的 kernel dump 也是基于其中的init_dump、set_dump和finalize_dump三个子接口实现的。 +torch_npu.npu 接口详细描述见《[torch_npu.npu API 概述](https://www.hiascend.com/document/detail/zh/Pytorch/60RC3/apiref/apilist/ptaoplist_000192.html)》。 diff --git a/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md b/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md index dc2e544d878483bfe9a93f81751f40fc1983c53f..c2e33436e534f8c9bfbbd1a2a1b1506aa43b9e50 100644 --- a/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md +++ b/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md @@ -2,6 +2,8 @@ msprobe 工具主要通过在训练脚本内添加 dump 接口、启动训练的方式采集精度数据。 +dump的'tensor'模式采集数据量大小,可以参考[数据量基线](./26.data_dump_PyTorch_baseline.md)。 + 本工具提供固定的 API 支持列表,若需要删除或增加 dump 的 API,可以在 msprobe/pytorch/hook_module/support_wrap_ops.yaml 文件内手动修改,如下示例: ```yaml @@ -11,6 +13,8 @@ functional: # functional为算子类别,找到对应的类别,在该类别 - conv3d ``` +删除API的场景:部分模型代码逻辑会存在API原生类型校验,工具执行dump操作时,对模型的API封装可能与模型的原生API类型不一致,此时可能引发校验失败,详见《[FAQ](FAQ.md)》中“异常情况”的第10和11条。 + ## 1 接口介绍 ### 1.1 PrecisionDebugger @@ -23,8 +27,12 @@ functional: # functional为算子类别,找到对应的类别,在该类别 PrecisionDebugger(config_path=None, task=None, dump_path=None, level=None, model=None, step=None) ``` -1. config_path:指定 dump 配置文件路径;model:指定具体的 torch.nn.Module,默认未配置,level 配置为"L0"或"mix"时,必须在该接口或 **start** 接口中配置该参数。其他参数均在 [config.json](../config.json) 文件中可配,详细配置可见 [config.json 介绍](./02.config_introduction.md)。 -2. 此接口的参数均不是必要,且优先级高于 [config.json](../config.json) 文件中的配置,但可配置的参数相比 config.json 较少。 +1. config_path:指定 dump 配置文件路径; +2. model:指定需要采集 Module 级数据的模型,支持传入 torch.nn.Module 或 list[torch.nn.Module] 类型,默认未配置。 +level 配置为"L0"或"mix"时,必须在该接口或 **start** 接口中配置该参数。该参数在将来会从该接口移除,建议在 **start** 接口中配置该参数。 +3. 其他参数均在 [config.json](../config.json) 文件中可配,详细配置可见 [config.json 介绍](./02.config_introduction.md)。 + +此接口的参数均不是必要,且优先级高于 [config.json](../config.json) 文件中的配置,但可配置的参数相比 config.json 较少。 ### 1.2 start @@ -36,12 +44,18 @@ PrecisionDebugger(config_path=None, task=None, dump_path=None, level=None, model debugger.start(model=None) ``` -1. model:指定具体的 torch.nn.Module,默认未配置,level 配置为"L0"或"mix"时,必须在该接口或 **PrecisionDebugger** 接口中配置该参数。 +1. model:指定需要采集 Module 级数据的模型,支持传入 torch.nn.Module、list[torch.nn.Module]或Tuple[torch.nn.Module] 类型,默认未配置。 +level 配置为"L0"或"mix"时,必须在该接口或 **PrecisionDebugger** 接口中配置该参数。 本接口中的 model 比 PrecisionDebugger 中 model 参数优先级更高,会覆盖 PrecisionDebugger 中的 model 参数。 ### 1.3 stop -**功能说明**:停止精度数据采集。在 **start** 函数之后的任意位置添加。若需要 dump 反向数据,则需要添加在反向计算代码(如,loss.backward)之后。使用示例可参见 [2.1 快速上手](#21-快速上手)和 [2.2 采集完整的前反向数据](#22-采集完整的前反向数据)。 +**功能说明**:停止精度数据采集。在 **start** 函数之后的任意位置添加。 +若 **stop** 函数添加在反向计算代码(如loss.backward)之后,则会采集 **start** 和该函数之间的前反向数据。 +若 **stop** 函数添加在反向计算代码之前,则需要将 [**step**](#15-step) 函数添加到反向计算代码之后,才能采集 **start** 和该函数之间的前反向数据。 +使用示例可参见 [2.1 快速上手](#21-快速上手) 和 [2.2 采集完整的前反向数据](#22-采集完整的前反向数据)。 + +**注意**:**stop** 函数必须调用,否则可能导致精度数据落盘不全。 **原型**: @@ -51,7 +65,8 @@ debugger.stop() ### 1.4 forward_backward_dump_end -**功能说明**:停止精度数据采集。用于 dump 指定代码的前反向数据。在 **start** 函数之后,反向计算代码(如,loss.backward)之前的任意位置添加,可以采集 **start** 函数和该函数之间的前反向数据,可以通过调整 **start** 函数与该函数的位置,来指定需要 dump 的代码块。要求 **stop** 函数添加在反向计算代码(如,loss.backward)之后,此时该函数与 **stop** 函数之间的代码不会被 dump。使用示例可参见 [2.3 采集指定代码块的前反向数据](#23-采集指定代码块的前反向数据) +**功能说明**:停止精度数据采集。与 **stop** 函数功能相同,该函数在将来会被移除,建议使用 **stop** 函数。 +使用示例可参见 [2.3 采集指定代码块的前反向数据](#23-采集指定代码块的前反向数据)。 **原型**: @@ -61,7 +76,8 @@ forward_backward_dump_end() ### 1.5 step -**功能说明**:更新 dump 参数。在最后一个 **stop** 函数后或一个 step 结束的位置添加。需要与 **start** 函数一起添加在 for 循环内。 +**功能说明**:结束一个 step 的数据采集,完成所有数据落盘并更新 dump 参数。在一个 step 结束的位置添加,且必须在 **stop** 函数之后的位置调用。 +该函数需要配合 **start** 和 **stop** 函数使用,尽量添加在反向计算代码(如loss.backward)之后,否则可能会导致反向数据丢失。使用示例可参见[2.2 采集完整的前反向数据](#22-采集完整的前反向数据)。 **原型**: @@ -101,13 +117,18 @@ module_dump_end() **原型**: ```python -seed_all(seed=1234, mode=False) +seed_all(seed=1234, mode=False, rm_dropout=True) ``` **参数说明**: 1. seed: 随机性种子。参数示例: seed=1000。默认值:1234。非必选 2. mode:确定性计算模式。可配置True或False。参数示例:mode=True。默认为False。非必选(注意:确定性计算会导致API执行性能降低,建议在发现模型多次执行结果不同的情况下开启) +3. rm_dropout:控制dropout失效的开关。可配置 True 或 False,默认值:True,非必选。参数示例:rm_dropout=True。 +该参数设置为 True 后, 工具会自动将 `torch.nn.functional.dropout`、`torch.nn.functional.dropout2d`、`torch.nn.functional.dropout3d`、`torch.nn.Dropout`、`torch.nn.Dropout2d`、`torch.nn.Dropout3d` +的接口参数 p 置为0,以避免因随机dropout造成的网络随机性。 注意:通过rm_dropout控制dropout失效或生效需要在初始化dropout实例前调用才能生效。 + +当前工具 dump 默认不会固定随机性和使 dropout 失效,若希望每次采集的数据保持一致,建议在 dump 数据前调用 seed_all 接口。 seed_all 函数可固定随机数的范围如下表。 @@ -126,28 +147,45 @@ seed_all 函数可固定随机数的范围如下表。 | torch.backends.cudnn.enable=False | 关闭 cuDNN | | torch.backends.cudnn.benchmark=False | cuDNN 确定性地选择算法 | | torch.backends.cudnn.deterministic=True | cuDNN 仅使用确定性的卷积算法 | +| torch.nn.functional.dropout | 将 dropout 的接口参数 p 置为0 | +| torch.nn.functional.dropout2d | 将 dropout2d 的接口参数 p 置为0 | +| torch.nn.functional.dropout3d | 将 dropout3d 的接口参数 p 置为0 | +| torch.nn.Dropout | 将 Dropout 的接口参数 p 置为0 | +| torch.nn.Dropout2d | 将 Dropout2d 的接口参数 p 置为0 | +| torch.nn.Dropout3d | 将 Dropout3d 的接口参数 p 置为0 | 需要保证 CPU 或 GPU 以及 NPU 的模型输入完全一致,dump 数据的比对才有意义,seed_all 并不能保证模型输入完全一致,如下表所示场景需要保证输入的一致性。 | 场景 | 固定方法 | | --------------- | ------------- | | 数据集的 shuffle | 关闭 shuffle。 | -| dropout | 关闭 dropout。 | 关闭 shuffle 示例: ```python train_loader = torch.utils.data.DataLoader( -train_dataset, -batch_size = batch_size, -shuffle = False, -num_workers = num_workers + train_dataset, + batch_size=batch_size, + shuffle=False, + num_workers=num_workers ) ``` -关闭 dropout: +### 1.9 save -在使用 `from msprobe.pytorch import PrecisionDebugger` 后,工具会自动将 `torch.nn.functional.dropout`、`torch.nn.functional.dropout2d`、`torch.nn.functional.dropout3d`、`torch.nn.Dropout`、`torch.nn.Dropout2d`、`torch.nn.Dropout3d` 的接口参数 p 置为0. +**功能说明**:单点保存网络执行过程中正反向数值,并以统计值/张量文件落盘。 + +**原型**: +```python +save(variable, name, save_backward=True) +``` + +**参数说明**: +| 参数名称 | 参数含义 | 支持数据类型 | 是否必选| +| ---------- | ------------------| ------------------- | ------------------- | +| variable | 需要保存的变量 |dict, list, tuple, torch.tensor, int, float, str | 是 | +| name | 指定的名称 | str | 是 | +| save_backward | 是否保存反向数据 | boolean | 否 | ## 2 示例代码 @@ -159,38 +197,43 @@ num_workers = num_workers # 根据需要import包 import torch import torch.nn as nn -import torch_npu # 需安装 torch_npu import torch.nn.functional as F + +# 导入工具的数据采集接口 from msprobe.pytorch import PrecisionDebugger, seed_all + # 在模型训练开始前固定随机性 seed_all() +# 在模型训练开始前实例化PrecisionDebugger +debugger = PrecisionDebugger(config_path='./config.json') -torch.npu.set_device("npu:0") # 定义网络 class ModuleOP(nn.Module): def __init__(self) -> None: super().__init__() - self.linear_1 = nn.Linear(in_features=8,out_features=4) - self.linear_2 = nn.Linear(in_features=4,out_features=2) + self.linear_1 = nn.Linear(in_features=8, out_features=4) + self.linear_2 = nn.Linear(in_features=4, out_features=2) - def forward(self,x): + def forward(self, x): x1 = self.linear_1(x) x2 = self.linear_2(x1) r1 = F.relu(x2) return r1 if __name__ == "__main__": - module = ModuleOP() - # 注册工具 - debugger = PrecisionDebugger('./config.json', model=module) - debugger.start() - x = torch.randn(10,8) - out = module(x) - loss = out.sum() - loss.backward() - debugger.stop() + module = ModuleOP() + # 开启数据 dump + debugger.start(model=module) + x = torch.randn(10, 8) + out = module(x) + loss = out.sum() + loss.backward() + + # 关闭数据 dump + debugger.stop() ``` + ### 2.2 采集完整的前反向数据 ```Python @@ -203,19 +246,20 @@ debugger = PrecisionDebugger(config_path="./config.json", dump_path="./dump_path # ... # 数据集迭代的位置一般为模型训练开始的位置 for data, label in data_loader: - debugger.start() # 开启数据dump - # 如下是模型每个step执行的逻辑 + debugger.start() # 开启数据dump + # 如下是模型每个step执行的逻辑 output = model(data) #... loss.backward() - debugger.stop() # 关闭数据dump - debugger.step() # 结束一个step的dump + debugger.stop() # 关闭数据dump + debugger.step() # 结束一个step的dump ``` ### 2.3 采集指定代码块的前反向数据 ```Python from msprobe.pytorch import PrecisionDebugger, seed_all + # 在模型训练开始前固定随机性 seed_all() # 请勿将PrecisionDebugger的初始化流程插入到循环代码中 @@ -225,14 +269,14 @@ debugger = PrecisionDebugger(config_path="./config.json", dump_path="./dump_path # ... # 数据集迭代的位置一般为模型训练开始的位置 for data, label in data_loader: - debugger.start() # 开启数据dump - # 如下是模型每个step执行的逻辑 + debugger.start() # 开启数据dump + # 如下是模型每个step执行的逻辑 output = model(data) - debugger.forward_backward_dump_end() # 插入该函数到start函数之后,只dump start函数到该函数之间代码的前反向数据,本函数到stop函数之间的数据则不dump - #... + + debugger.forward_backward_dump_end() # 插入该函数到start函数之后,只dump start函数到该函数之间的前反向数据。 + # ... loss.backward() - debugger.stop() # 关闭数据dump - debugger.step() # 结束一个step的dump + debugger.step() # 结束一个step的dump ``` ### 2.4 采集函数模块化数据 @@ -241,11 +285,14 @@ for data, label in data_loader: # 根据需要import包 import torch import torch.nn as nn -import torch_npu # 需安装 torch_npu import torch.nn.functional as F + +# 导入工具的数据采集接口 from msprobe.pytorch import PrecisionDebugger, module_dump, module_dump_end -torch.npu.set_device("npu:0") +# 在模型训练开始前实例化PrecisionDebugger +debugger = PrecisionDebugger(config_path='./config.json') + # 定义网络 class ModuleOP(nn.Module): def __init__(self) -> None: @@ -261,37 +308,38 @@ class ModuleOP(nn.Module): if __name__ == "__main__": module = ModuleOP() - # 注册工具 - debugger = PrecisionDebugger(config_path='./config.json') - debugger.start() # 开启数据dump - - x = torch.randn(10, 8) + # 开启数据dump + debugger.start() + x = torch.randn(10, 8) # ... # start和module_dump接口之间的数据正常dump module_dump(module, "MyModuleOP") # 开启模块级精度数据dump - out = module(x) # module内部的child modules或API不会被dump + out = module(x) # module内部的child modules或API将不会被dump module_dump_end() # 关闭模块级精度数据dump loss = out.sum() # module_dump_end和stop接口之间的数据正常dump - loss.backward() - - debugger.stop() # 关闭数据dump + loss.backward() + # 关闭数据dump + debugger.stop() ``` ## 3 dump 结果文件介绍 训练结束后,工具将 dump 的数据保存在 dump_path 参数指定的目录下。目录结构示例如下: -```Python +```lua ├── dump_path │ ├── step0 │ | ├── rank0 │ | │ ├── dump_tensor_data | | | | ├── Tensor.permute.1.forward.pt -| | | | ├── MyModule.0.forward.input.pt # 开启模块级精度数据dump时存在模块级的dump数据文件 +| | | | ├── Functional.linear.5.backward.output.pt # 命名格式为{api_type}.{api_name}.{API调用次数}.{forward/backward}.{input/output}.{参数序号}, 其中,“参数序号”表示该API的第n个输入或输出,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个元素。 | | | | ... -| | | | └── Fcuntion.linear.5.backward.output.pt -│ | | ├── dump.json # 保存前反向算子、算子的统计量信息或溢出算子信息。包含dump数据的API名称(命名格式为:`{api_type}_{api_name}_{API调用次数}_{前向反向}_{input/output}.{参数序号}`)、dtype、 shape、各数据的max、min、mean、L2norm统计信息以及当配置summary_mode="md5"时的CRC-32数据。其中,“参数序号”表示该API下的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个子参数;L2norm表示L2范数(平方根) -│ | | ├── stack.json # 算子调用栈信息 -│ | | └── construct.json # 分层分级结构 +| | | | ├── Module.conv1.Conv2d.forward.0.input.0.pt # 命名格式为{Module}.{module_name}.{class_name}.{forward/backward}.{调用次数}.{input/output}.{参数序号}, 其中,“参数序号”表示该Module的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该Module的第1个参数的第1个元素。 +| | | | ├── Module.conv1.Conv2D.forward.0.parameters.bias.pt # 模块参数数据:命名格式为{Module}.{module_name}.{class_name}.forward.{调用次数}.parameters.{parameter_name}。 +| | | | └── Module.conv1.Conv2D.parameters_grad.weight.pt # 模块参数梯度数据:命名格式为{Module}.{module_name}.{class_name}.parameters_grad.{parameter_name}。因为同一模块的参数使用同一梯度进行更新,所以参数梯度文件名不包含调用次数。 +| | | | # 当dump时传入的model参数为List[torch.nn.Module]或Tuple[torch.nn.Module]时,模块级数据的命名中包含该模块在列表中的索引index,命名格式为{Module}.{index}.*,*表示以上三种模块级数据的命名格式,例如:Module.0.conv1.Conv2d.forward.0.input.0.pt。 +│ | | ├── dump.json +│ | | ├── stack.json +│ | | └── construct.json │ | ├── rank1 | | | ├── dump_tensor_data | | | | └── ... @@ -305,6 +353,12 @@ if __name__ == "__main__": │ | ├── ... │ ├── step2 ``` +* `rank`:设备 ID,每张卡的数据保存在对应的 `rank{ID}` 目录下。非分布式场景下没有 rank ID,目录名称为 rank。 +* `dump_tensor_data`:保存采集到的张量数据。 +* `dump.json`: 保存API或Module前反向数据的统计量信息。包含dump数据的API名称或Module名称,各数据的dtype、 shape、max、min、mean、L2norm(L2范数,平方根)统计信息以及当配置summary_mode="md5"时的CRC-32数据。具体介绍可参考[dump.json文件说明](./27.dump_json_instruction.md#1-dumpjson文件介绍pytorch)。 +* `stack.json`:API/Module的调用栈信息。 +* `construct.json`:分层分级结构,level为L1时,construct.json内容为空。 + dump 过程中,pt 文件在对应算子或者模块被执行后就会落盘,而 json 文件则需要在正常执行 PrecisionDebugger.stop() 后才会写入完整数据,异常的程序终止会保存终止前被执行算子的相关 npy 文件,可能会导致 json 文件中数据丢失。 @@ -321,6 +375,3 @@ pt 文件保存的前缀和 PyTorch 对应关系如下: | VF | torch._VF | | Aten | torch.ops.aten | | Distributed | torch.distributed | - - - diff --git a/debug/accuracy_tools/msprobe/docs/06.data_dump_MindSpore.md b/debug/accuracy_tools/msprobe/docs/06.data_dump_MindSpore.md index 9c3ee09386c43edd33373e9516c58bdf470d675a..96d37c170face1a0a866753ae3b9e67c70a820f1 100644 --- a/debug/accuracy_tools/msprobe/docs/06.data_dump_MindSpore.md +++ b/debug/accuracy_tools/msprobe/docs/06.data_dump_MindSpore.md @@ -1,37 +1,84 @@ -# MindSpore 场景的精度数据采集 -msprobe 工具主要通过在训练脚本内添加 dump 接口、启动训练的方式采集精度数据。目前,静态图场景仅支持 kernel 级数据采集,对应 config.json 配置中的 "L2" level;动态图场景支持cell、API、kernel级数据采集,对应 config.json 配置中的 "L0"、"L1" 、"L2"、"mix" level。 +# msprobe 工具 MindSpore场景精度数据采集指南 -需要注意,**动态图 kernel 级**("L2" level)精度数据采集对象为被 PSJit 或 PIJit 装饰的 Cell 或 function 内的算子,其被装饰部分实际以**静态图**模式执行,所以此场景下的工具使用方式与静态图场景完全相同。下文无特殊说明时,介绍的动态图场景不包括 kernel 级("L2" level)dump 情形。 -精度数据采集功能的配置示例见[MindSpore 静态图场景下 task 配置为 statistics](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/03.config_examples.md#21-task-%E9%85%8D%E7%BD%AE%E4%B8%BA-statistics)、[MindSpore 静态图场景下 task 配置为 tensor](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/03.config_examples.md#22-task-%E9%85%8D%E7%BD%AE%E4%B8%BA-tensor)、[MindSpore 动态图场景下 task 配置为 statistics](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/03.config_examples.md#31-task-%E9%85%8D%E7%BD%AE%E4%B8%BA-statistics)、[MindSpore 动态图场景下 task 配置为 tensor](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/03.config_examples.md#32-task-%E9%85%8D%E7%BD%AE%E4%B8%BA-tensor)。 +## 1. 专业名词解释 -动态图 API 级 dump 时,本工具提供固定的 API 支持列表,仅支持对列表中的 API 进行精度数据采集。一般情况下,无需修改该列表,而是通过config.json中的scope/list字段进行 dump API 指定。若需要改变 API 支持列表,可以在 `msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml` 文件内手动修改,如下示例: +* **静态图**:在编译时就确定网络结构,静态图模式拥有较高的训练性能,但难以调试。 +* **动态图**:运行时动态构建网络,相较于静态图模式虽然易于调试,但难以高效执行。 +* **高阶 API**:如 `mindspore.train.Model`,封装了训练过程的高级接口。 +* **JIT(Just-In-Time 编译)**:MindSpore提供JIT(just-in-time)技术进一步进行性能优化。JIT模式会通过AST树解析的方式或者Python字节码解析的方式,将代码解析为一张中间表示图(IR,intermediate representation)。IR图作为该代码的唯一表示,编译器通过对该IR图的优化,来达到对代码的优化,提高运行性能。与动态图模式相对应,这种JIT的编译模式被称为静态图模式。 +* **Primitive op**:MindSpore 中的基本算子,通常由 `mindspore.ops.Primitive` 定义,提供底层的算子操作接口。 + + +## 2. 工具安装 + +请参见[《msprobe 工具安装指南》](./01.installation.md)。 + + +## 3. 快速入门 + +以下通过一个简单的示例,展示如何在 MindSpore 中使用 msprobe 工具进行精度数据采集。 + +您可以参考 [动态图快速入门示例](data_dump_MindSpore/dynamic_graph_quick_start_example.md) 了解详细步骤。 + +## 4. 概述 + +msprobe 工具通过在训练脚本中添加 `PrecisionDebugger` 接口并启动训练的方式,采集模型在运行过程中的精度数据。该工具支持对MindSpore的静态图和动态图场景进行不同Level等级的精度数据采集。 + +dump 的"tensor"模式采集数据量大小,可以参考[数据量基线](data_dump_MindSpore/data_dump_MindSpore_baseline.md)。 + +## 5. 场景介绍 + +### 5.1 静态图场景 +在静态图场景下,msprobe 仅支持 **L2 Level** 的数据采集。 +- **L2 Level(Kernel 级)** :采集底层算子的输入输出数据,适用于深入分析算子级别的精度问题。 + +采集方式请参见[示例代码 > 静态图场景](#71-静态图场景)。详细介绍请参见[《config.json 配置文件介绍》](./02.config_introduction.md#11-通用配置)中的“level 参数”和[《config.json 配置示例》](./03.config_examples.md#2-mindspore-静态图场景) 中的“MindSpore 静态图场景”。 + +### 5.2 动态图场景 +在动态图场景下,msprobe 支持 **L0** 、**L1** 、**mix** 、**L2 Level**、 **debug** 的数据采集,具体分为以下几种情况: +- **使用高阶 API(如 `Model 高阶API`)** : + - 需要使用 `MsprobeStep` 回调类来控制数据采集的启停,适用于 **L0** 、**L1** 、**mix** 、**L2** 数据采集。 + +- **未使用高阶 API** : + - 手动在训练循环中调用 `start`、`stop`、`step` 等接口,适用于 **L0** 、**L1** 、**mix** 、**L2** 数据采集。 + +采集方式请参见[示例代码 > 动态图场景](#72-动态图场景)。 + +> **注意** :动态图模式下,使用 `PSJit` 或 `PIJit` 装饰的部分实际以静态图模式执行,此时的 **Kernel 级(L2 Level)** 数据采集方式与静态图场景相同。 +- **L0 Level(Cell 级)** :采集 `Cell` 对象的数据,适用于需要分析特定网络模块的情况。 + +- **L1 Level(API 级)** :采集 MindSpore API 的输入输出数据,适用于定位 API 层面的精度问题。 + +- **mix(模块级 + API 级)** :在 `L0` 和 `L1` 级别的基础上同时采集模块级和 API 级数据,适用于需要分析模块和 API 层面精度问题的场景。 + +- **debug level (单点保存)**:单点保存网络中变量的正反向数据,适用于用户熟悉网络结构的场景。 + + +详细介绍请参见[《config.json 配置文件介绍》](./02.config_introduction.md#11-通用配置)中的“level 参数”和[《config.json 配置示例》](./03.config_examples.md#3-mindspore-动态图场景) 中的“MindSpore 动态图场景”。 -```yaml -ops: # ops为算子类别,找到对应的类别,在该类别下按照下列格式删除或添加API - - adaptive_avg_pool1d - - adaptive_avg_pool2d - - adaptive_avg_pool3d -``` -## 1 接口介绍 +## 6 接口介绍 -### 1.1 msprobe.mindspore.PrecisionDebugger +### 6.1 msprobe.mindspore.PrecisionDebugger **功能说明**:通过加载 dump 配置文件的方式来确定 dump 操作的详细配置。 **原型**: ```Python -PrecisionDebugger(config_path=None) +PrecisionDebugger(config_path=None, task=None, dump_path=None, level=None, step=None) ``` **参数说明**: 1. config_path:指定 dump 配置文件路径,string 类型。参数示例:"./config.json"。未配置该路径时,默认使用 [config.json](../config.json) 文件的默认配置,配置选项含义可见 [config.json 介绍](./02.config_introduction.md)。 +2. 其他参数均在 [config.json](../config.json) 文件中可配,详细配置可见 [config.json 介绍](./02.config_introduction.md)。 -#### 1.1.1 start +此接口的参数均不是必要,且优先级高于 [config.json](../config.json) 文件中的配置,但可配置的参数相比 config.json 较少。 + +#### 6.1.1 start **功能说明**:启动精度数据采集。需在模型执行模式(静态图/动态图、O0/O1/O2编译等级)设置后调用。静态图场景下,必须在模型初始化及 mindspore.communication.init 调用前添加;动态图场景下,如果没有使用 [Model](https://gitee.com/link?target=https%3A%2F%2Fwww.mindspore.cn%2Ftutorials%2Fzh-CN%2Fr2.3.1%2Fadvanced%2Fmodel.html) 高阶 API 进行训练,则需要与 stop 函数一起添加在 for 循环内,否则只有需要传入model参数时,才使用该接口。 @@ -43,11 +90,15 @@ start(model=None) **参数说明**: -1. model:指具体的 mindspore.nn.Cell,默认不配置。Cell级别("L0" level)dump 时,传入 model 可以采集 model 内的所有Cell 对象数据。API级别("L1" level)dump 时,传入 model 可以采集 model 内包含 primitive op 对象在内的所有 API 数据,若不传入 model 参数,则只采集非 primitive op 的 API 数据。 +1. model:指定需要采集数据的实例化模型,支持传入mindspore.nn.Cell、List[mindspore.nn.Cell]或Tuple[mindspore.nn.Cell] 类型, 默认未配置。Cell级别("L0" level)dump 与 "mix" level dump 时,必须传入 model 才可以采集 model 内的所有Cell 对象数据。API级别("L1" level)dump 时,传入 model 可以采集 model 内包含 primitive op 对象在内的所有 API 数据,若不传入 model 参数,则只采集非 primitive op 的 API 数据。 + +#### 6.1.2 stop -#### 1.1.2 stop +**功能说明**:停止精度数据采集。在 **start** 函数之后的任意位置添加。若 **stop** 函数添加在反向计算代码之后,则会采集 **start** 和该函数之间的前反向数据。 +若 **stop** 函数添加在反向计算代码之前,则需要将 [**step**](#613-step) 函数添加到反向计算代码之后,才能采集 **start** 和该函数之间的前反向数据。 +**仅未使用 Model 高阶 API 的动态图场景支持。** -**功能说明**:停止数据采集。在 **start** 函数之后的任意位置添加。需要与 start 函数一起添加在 for 循环内。若需要 dump 反向数据,则需要添加在反向计算代码之后。**仅未使用 Model 高阶 API 的动态图场景支持。** +**注意**:**stop** 函数必须调用,否则可能导致精度数据落盘不全。 **原型**: @@ -55,9 +106,11 @@ start(model=None) stop() ``` -#### 1.1.3 step +#### 6.1.3 step -**功能说明**:在最后一个 **stop** 函数后或一个 step 训练结束的位置添加。**仅未使用 Model 高阶 API 的动态图场景支持。** +**功能说明**:结束一个 step 的数据采集,完成所有数据落盘并更新 dump 参数。在一个 step 结束的位置添加,且必须在 **stop** 函数之后的位置调用。 +该函数需要配合 **start** 和 **stop** 函数使用,尽量添加在反向计算代码之后,否则可能会导致反向数据丢失。 +**仅未使用 Model 高阶 API 的动态图场景支持。** **原型**: @@ -65,13 +118,11 @@ stop() step() ``` -#### 1.1.4 forward_backward_dump_end +#### 6.1.4 forward_backward_dump_end -**功能说明**:在 **start** 函数和在 **stop** 函数之间调用,表示采集 **start** 到 **forward_backward_dump_end**之间的L1级别的正反向数据。 +**功能说明**:停止精度数据采集。与 **stop** 函数功能相同,该函数在将来会被移除,建议使用 **stop** 函数。 -**仅支持L1级别数据采集场景。** - -**L1级别数据中的jit数据采集行为不受此接口影响** +**L1级别数据中的jit数据采集行为不受此接口影响。** **仅未使用 Model 高阶 API 的动态图场景支持。** @@ -81,9 +132,27 @@ step() forward_backward_dump_end() ``` -### 1.2 msprobe.mindspore.common.utils.MsprobeStep +#### 6.1.5 save -**功能说明**:MindSpore Callback类,自动在每个step开始时调用start()接口,在每个step结束时调用stop()、step()接口。实现使用 Model 高阶 API 的动态图场景下 L0、L1 级别的精度数据采集控制,控制粒度为单个 **Step** ,而 PrecisionDebugger.start, PrecisionDebugger.stop 接口的控制粒度任意训练代码段。 +**功能说明**:单点保存网络执行过程中正反向数值,并以统计值/张量文件落盘。 + +**原型**: +```python +save(variable, name, save_backward=True) +``` + +**参数说明**: +| 参数名称 | 参数含义 | 支持数据类型 | 是否必选| +| ---------- | ------------------| ------------------- | ------------------- | +| variable | 需要保存的变量 |dict, list, tuple, torch.tensor, int, float, str | 是 | +| name | 指定的名称 | str | 是 | +| save_backward | 是否保存反向数据 | boolean | 否 | + + + +### 6.2 msprobe.mindspore.common.utils.MsprobeStep + +**功能说明**:MindSpore Callback类,自动在每个step开始时调用start()接口,在每个step结束时调用stop()、step()接口。实现使用 Model 高阶 API 的动态图场景下 L0、L1、mix 级别的精度数据采集控制,控制粒度为单个 **Step** ,而 PrecisionDebugger.start, PrecisionDebugger.stop 接口的控制粒度任意训练代码段。 **原型**: @@ -95,13 +164,13 @@ MsprobeStep(debugger) 1. debugger:PrecisionDebugger对象。 -### 1.3 msprobe.mindspore.seed_all +### 6.3 msprobe.mindspore.seed_all **功能说明**:用于固定网络中的随机性和开启确定性计算。 **原型**: ```python -seed_all(seed=1234, mode=False) +seed_all(seed=1234, mode=False, rm_dropout=True) ``` **参数说明**: @@ -110,26 +179,36 @@ seed_all(seed=1234, mode=False) 2. mode:确定性计算使能,可配置 True 或 False,默认值:False,非必选。参数示例:mode=True。该参数设置为 True 后,将会开启算子确定性运行模式与归约类通信算子(AllReduce、ReduceScatter、Reduce)的确定性计算。注意:确定性计算会导致API执行性能降低,建议在发现模型多次执行结果不同的情况下开启。 -## 2 示例代码 +3. rm_dropout:控制dropout失效的开关。可配置 True 或 False,默认值:True,非必选。参数示例:rm_dropout=True。该参数设置为 True 后,将会使mindspore.ops.Dropout,mindspore.ops.Dropout2D,mindspore.ops.Dropout3D,mindspore.mint.nn.Dropout和mindspore.mint.nn.functional.dropout失效,以避免因随机dropout造成的网络随机性。建议在采集mindspore数据前开启。注意:通过rm_dropout控制dropout失效或生效需要在初始化Dropout实例前调用才能生效。 -### 2.1 MindSpore 静态图场景 -```Python + + +## 7. 示例代码 + +### 7.1 静态图场景 + +```python import mindspore as ms ms.set_context(mode=ms.GRAPH_MODE, device_target="Ascend") from msprobe.mindspore import PrecisionDebugger debugger = PrecisionDebugger(config_path="./config.json") debugger.start() -# 请勿将以上初始化流程置于模型实例化或mindspore.communication.init调用后 +# 请勿将以上初始化流程置于模型实例化或 mindspore.communication.init 调用后 +# 模型定义和训练代码 # ... + ``` -### 2.2 MindSpore 动态图场景 +### 7.2 动态图场景 -#### 2.2.1 未使用 Model 高阶 API(非 L2 级别) +#### 7.2.1 L0 ,L1, mix 级别 -```Python +##### 7.2.1.1 未使用 Model 高阶 API + + +```python import mindspore as ms ms.set_context(mode=ms.PYNATIVE_MODE, device_target="Ascend") @@ -141,60 +220,66 @@ debugger = PrecisionDebugger(config_path="./config.json") model = Network() # 数据集迭代的地方往往是模型开始训练的地方 for data, label in data_loader: - debugger.start() # 进行L1级别下非primitive op采集时调用 - # debugger.start(model) # 进行L0级别或L1级别下primitive op的数据采集时调用 - # 如下是模型每个step执行的逻辑 + debugger.start() # 进行 L1 级别下非 primitive op 采集时调用 + # debugger.start(model) # 进行 L0, mix 级别或 L1 级别下 primitive op 的数据采集时调用 + # 如下是模型每个 step 执行的逻辑 grad_net = ms.grad(model)(data) # ... - debugger.stop() # 关闭数据dump - debugger.step() # 结束一个step的dump + debugger.stop() # 关闭数据 dump + debugger.step() # 更新迭代数 ``` -#### 2.2.2 未使用 Model 高阶 API(L2 级别) +##### 7.2.1.2 使用 Model 高阶 API -```Python + +```python import mindspore as ms +from mindspore.train import Model ms.set_context(mode=ms.PYNATIVE_MODE, device_target="Ascend") from msprobe.mindspore import PrecisionDebugger +from msprobe.mindspore.common.utils import MsprobeStep debugger = PrecisionDebugger(config_path="./config.json") -debugger.start() -# 请勿将以上初始化流程置于模型实例化或mindspore.communication.init调用后 # 模型、损失函数的定义以及初始化等操作 # ... + model = Network() -# 数据集迭代的地方往往是模型开始训练的地方 -for data, label in data_loader: - # 如下是模型每个step执行的逻辑 - grad_net = ms.grad(model)(data) - # ... +# 只有进行 L0 级别下 Cell 对象,mix 级别,L1 级别下 primitive op 的数据采集时才需要调用 +# debugger.start(model) +trainer = Model(model, loss_fn=loss_fn, optimizer=optimizer, metrics={'accuracy'}) +trainer.train(1, train_dataset, callbacks=[MsprobeStep(debugger)]) ``` -#### 2.2.3 使用 Model 高阶 API(非 L2 级别) +#### 7.2.2 L2 级别 -```Python +##### 7.2.2.1 未使用 Model 高阶 API + + +```python import mindspore as ms -from mindspore.train import Model ms.set_context(mode=ms.PYNATIVE_MODE, device_target="Ascend") from msprobe.mindspore import PrecisionDebugger -from msprobe.mindspore.common.utils import MsprobeStep debugger = PrecisionDebugger(config_path="./config.json") +debugger.start() +# 请勿将以上初始化流程置于模型实例化或 mindspore.communication.init 调用后 # 模型、损失函数的定义以及初始化等操作 # ... - model = Network() -# 只有进行L0级别下Cell对象或L1级别下primitive op的数据采集时才需要调用 -# debugger.start(model) -trainer = Model(model, loss_fn=loss_fn, optimizer=optimizer, metrics={'accuracy'}) -trainer.train(1, train_dataset, callbacks=[MsprobeStep(debugger)]) +# 数据集迭代的地方往往是模型开始训练的地方 +for data, label in data_loader: + # 如下是模型每个 step 执行的逻辑 + grad_net = ms.grad(model)(data) + # ... ``` -#### 2.2.4 使用 Model 高阶 API(L2 级别) -```Python +##### 7.2.2.2 使用 Model 高阶 API + + +```python import mindspore as ms from mindspore.train import Model ms.set_context(mode=ms.PYNATIVE_MODE, device_target="Ascend") @@ -202,7 +287,7 @@ ms.set_context(mode=ms.PYNATIVE_MODE, device_target="Ascend") from msprobe.mindspore import PrecisionDebugger debugger = PrecisionDebugger(config_path="./config.json") debugger.start() -# 请勿将以上初始化流程置于模型实例化或mindspore.communication.init调用后 +# 请勿将以上初始化流程置于模型实例化或 mindspore.communication.init 调用后 # 模型、损失函数的定义以及初始化等操作 # ... @@ -212,43 +297,65 @@ trainer = Model(model, loss_fn=loss_fn, optimizer=optimizer, metrics={'accuracy' trainer.train(1, train_dataset) ``` -## 3 dump 结果文件介绍 - -### 3.1 MindSpore 静态图场景 - -训练结束后,工具将 dump 的数据保存在 dump_path 参数指定的目录下。 - -- jit_level 为O0/O1时: +## 8. dump 结果文件介绍 - dump 结果目录请参见 MindSpore 官网中的[同步 Dump 数据对象目录](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/debug/dump.html#%E6%95%B0%E6%8D%AE%E5%AF%B9%E8%B1%A1%E7%9B%AE%E5%BD%95%E5%92%8C%E6%95%B0%E6%8D%AE%E6%96%87%E4%BB%B6%E4%BB%8B%E7%BB%8D)。 +### 8.1 静态图场景 -- jit_level 为O2时: +训练结束后,数据将保存在 `dump_path` 指定的目录下。 - dump 结果目录请参见 MindSpore 官网中的[异步 Dump 数据对象目录](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/debug/dump.html#%E6%95%B0%E6%8D%AE%E5%AF%B9%E8%B1%A1%E7%9B%AE%E5%BD%95%E5%92%8C%E6%95%B0%E6%8D%AE%E6%96%87%E4%BB%B6%E4%BB%8B%E7%BB%8D-1)。 +若jit_level=O2,且使用mindstudio-probe发布包或源码编包时添加了`--include-mod=adump`选项,目录结构示例如下: +``` +├── dump_path +│ ├── rank_0 +│ | ├── {timestamp} +│ | │ ├── step_0 +| | | | ├── AssignAdd.Default_network-TrainOneStepCell_optimzer-Gsd_AssignAdd-op0.0.10.1735011096403740.input.0.ND.INT32.npy +| | | | ├── Cast.Default_network-TrainOneStepCell_network-WithLossCell__backbone-Net_Cast-op0.9.10.1735011096426349.input.0.ND.FLOAT.npy +| | | | ├── GetNext.Default_GetNext-op0.0.11.17350110964032987.output.0.ND.FLOAT.npy +| | | | ... +| | | | ├── RefDAata.accum_bias1.6.10.1735011096424907.output.0.ND.FLOAT.npy +| | | | ├── Sub.Default_network-TrainOneStepCell_network-WithLossCell__backbone-Net_Sub-op0.10.10.1735011096427368.input.0.ND.BF16 +| | | | └── mapping.csv +│ | │ ├── step_1 +| | | | ├── ... +| | | ├── ... +| | ├── ... +| | +│ ├── ... +| | +│ └── rank_7 +│ ├── ... +``` +**说明** +1. 若配置文件中指定落盘npy格式,但是实际数据格式不在npy支持范围内(如bf16、int4等),则该tensor会以原始码流落盘,并不会转换为npy格式。 +2. 若原始文件全名长度超过255个字符,则文件基础名会被转换为长度为32位的随机数字字符串,原始文件名与转换后文件名的对应关系会保存在同目录下的`mapping.csv`文件中。 -jit_level 请参见 [mindspore.set_context](https://www.mindspore.cn/docs/zh-CN/r2.3.1/api_python/mindspore/mindspore.set_context.html) 中的 jit_config 参数。 -### 3.2 MindSpore 动态图场景 +其他场景请参见 MindSpore 官方文档中的[数据对象目录](https://www.mindspore.cn/docs/zh-CN/r2.4.0/model_train/debug/dump.html)。 -训练结束后,工具将 dump 的数据保存在dump_path参数指定的目录下。 +### 8.2 动态图场景 -dump结果目录结构示例如下: +dump 结果目录结构示例如下: -```bash +```lua ├── dump_path │ ├── step0 │ | ├── rank0 │ | │ ├── dump_tensor_data | | | | ├── MintFunctional.relu.0.backward.input.0.npy | | | | ├── Mint.abs.0.forward.input.0.npy -| | | | ├── Functional.split.0.forward.input.0.npy +| | | | ├── Functional.split.0.forward.input.0.npy # 命名格式为{api_type}.{api_name}.{API调用次数}.{forward/backward}.{input/output}.{参数序号}, 其中,“参数序号”表示该API的第n个输入或输出,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个元素。 | | | | ├── Tensor.__add__.0.forward.output.0.npy | | | | ... | | | | ├── Jit.AlexNet.0.forward.input.0.npy -| | | | └── Cell.relu.ReLU.forward.0.input.0.npy # config.json文件配置level为L0时dump的cell模块级数据,命名格式为{Cell}_{cell_name}_{class_name}_{前向反向}.{index}.{input/output}.{参数序号} -│ | | ├── dump.json # 保存前反向算子、算子的统计量信息或溢出算子信息。包含dump数据的API名称(命名格式为:{api_type}_{api_name}_{API调用次数}_{前向反向}_{input/output}.{参数序号})、dtype、 shape、各数据的max、min、mean、L2norm统计信息以及当配置summary_mode="md5"时的CRC-32数据。其中,“参数序号”表示该API下的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个子参数;L2norm表示L2范数(平方根) -│ | | ├── stack.json # 算子调用栈信息 -│ | | └── construct.json # 分层分级结构,level为L1时,construct.json内容为空 +| | | | ├── Primitive.conv2d.Conv2D.0.forward.input.0.npy +| | | | ├── Cell.conv1.Conv2D.forward.0.parameters.weight.npy # 模块参数数据:命名格式为{Cell}.{cell_name}.{class_name}.forward.{调用次数}.parameters.{parameter_name}。 +| | | | ├── Cell.conv1.Conv2D.parameters_grad.weight.npy # 模块参数梯度数据:命名格式为{Cell}.{cell_name}.{class_name}.parameters_grad.{parameter_name}。因为同一模块的参数使用同一梯度进行更新,所以参数梯度文件名不包含调用次数。 +| | | | └── Cell.relu.ReLU.forward.0.input.0.npy # 命名格式为{Cell}.{cell_name}.{class_name}.{forward/backward}.{调用次数}.{input/output}.{参数序号}, 其中,“参数序号”表示该Cell的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该Cell的第1个参数的第1个元素。 +| | | | # 当dump时传入的model参数为List[mindspore.nn.Cell]或Tuple[mindspore.nn.Cell]时,模块级数据的命名中包含该模块在列表中的索引index,命名格式为{Cell}.{index}.*,*表示以上三种模块级数据的命名格式,例如:Cell.0.relu.ReLU.forward.0.input.0.npy。 +│ | | ├── dump.json +│ | | ├── stack.json +│ | | └── construct.json │ | ├── rank1 | | | ├── dump_tensor_data | | | | └── ... @@ -263,23 +370,44 @@ dump结果目录结构示例如下: │ ├── step2 ``` -dump 过程中,npy 文件在对应算子或者模块被执行后就会落盘,而 json 文件则需要在正常执行 PrecisionDebugger.stop() 后才会写入完整数据,异常的程序终止会保存终止前被执行算子的相关 npy 文件,可能会导致 json 文件中数据丢失。 +* `rank`:设备 ID,每张卡的数据保存在对应的 `rank{ID}` 目录下。非分布式场景下没有 rank ID,目录名称为 rank。 +* `dump_tensor_data`:保存采集到的张量数据。 +* `dump.json`: 保存API或Cell前反向数据的统计量信息。包含dump数据的API名称或Cell名称,各数据的dtype、 shape、max、min、mean、L2norm(L2范数,平方根)统计信息以及当配置summary_mode="md5"时的CRC-32数据。具体介绍可参考[dump.json文件说明](./27.dump_json_instruction.md#2-dumpjson文件示例mindspore)。 +* `stack.json`:API/Cell的调用栈信息。 +* `construct.json`:分层分级结构,level为L1时,construct.json内容为空。 -其中 rank 为设备上各卡的 ID,每张卡上 dump 的数据会生成对应 dump 目录。非分布式场景下没有 rank ID,目录名称为 rank。 +dump 过程中,npy 文件在对应API或者模块被执行后就会落盘,而 json 文件则需要在正常执行 PrecisionDebugger.stop() 后才会写入完整数据,因此,程序异常终止时,被执行API对应的 npy 文件已被保存,但 json 文件中的数据可能丢失。 动态图场景下使能 PSJit 或 PIJit,装饰特定 Cell 或 function,被装饰的部分会全部/部分使能**静态图**流程。 - PSJit 场景下 config.json 文件配置 level 为 L1 时,被 PSJit 装饰的部分也作为 API 被 dump 到对应目录;配置 level 为 L2 时,则只会 dump 用户网络中静态图流程下的相关 kernel,其结果目录同jit_level 为 O0/O1 时的静态图 dump 相同。 - PIJit 场景下 config.json 文件配置 level 为 L1 时,会被还原为动态图,按 API 粒度进行 dump;配置 level 为 L2 时,则只会 dump 用户网络中静态图流程下的相关 kernel。 -npy 文件保存的前缀和 MindSpore 对应关系如下: - -| 前缀 | MindSpore 模块 | -| -------------- | ---------------------------- | -| Tensor | mindspore.Tensor | -| Functional | mindspore.ops | -| Primitive | mindspore.ops.Primitive | -| Mint | mindspore.mint | -| MintFunctional | mindspore.mint.nn.functional | -| Jit | mindspore.jit | -| Cell | mindspore.nn.Cell | + +npy文件名的前缀含义如下: + +| 前缀 | 含义 | +| -------------- |------------------------------| +| Tensor | mindspore.Tensor API数据 | +| Functional | mindspore.ops API数据 | +| Primitive | mindspore.ops.Primitive API数据 | +| Mint | mindspore.mint API数据 | +| MintFunctional | mindspore.mint.nn.functional API数据 | +| Distributed | mindspore.communication.comm_func API数据 | +| Jit | 被"jit"装饰的模块或函数数据 | +| Cell | mindspore.nn.Cell 类(模块)数据 | + + + +## 9.补充说明 + +### 9.1 修改 API 支持列表 + +动态图 API 级 dump 时,本工具提供固定的 API 支持列表,仅支持对列表中的 API 进行精度数据采集。一般情况下,无需修改该列表,而是通过config.json中的scope/list字段进行 dump API 指定。若需要改变 API 支持列表,可以在 `msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml` 文件内手动修改,如下示例: + +```yaml +ops: + - adaptive_avg_pool1d + - adaptive_avg_pool2d + - adaptive_avg_pool3d +``` diff --git a/debug/accuracy_tools/msprobe/docs/07.accuracy_checker_PyTorch.md b/debug/accuracy_tools/msprobe/docs/07.accuracy_checker_PyTorch.md index a65664ccfe727061dd0f47be9bbcd68fc24e4da0..b07568e25a2915a4e8e5c2157e7de4252410f38d 100644 --- a/debug/accuracy_tools/msprobe/docs/07.accuracy_checker_PyTorch.md +++ b/debug/accuracy_tools/msprobe/docs/07.accuracy_checker_PyTorch.md @@ -31,7 +31,7 @@ run_ut 预检操作包括以下两种方式: 将 API 信息输入到 run_ut 模块进行精度检测并比对,运行如下命令: ```bash -msprobe -f pytorch run_ut -api_info ./dump.json +msprobe -f pytorch run_ut -api_info ./dump_path/step{step_number}/rank{rank_number}/dump.json ``` | 参数名称 | 解释 | 是否必选 | @@ -50,7 +50,7 @@ run_ut 执行结果包括 `accuracy_checking_result_{timestamp}.csv` 和 `accura 如果需要保存比对不达标的输入和输出数据,可以在 run_ut 执行命令结尾添加 `-save_error_data`,例如: ```bash -msprobe -f pytorch run_ut -api_info ./dump.json -save_error_data +msprobe -f pytorch run_ut -api_info ./dump_path/step{step_number}/rank{rank_number}/dump.json -save_error_data ``` 数据默认会存盘到 './ut_error_data{timestamp}' 路径下,如有需要,用户可以通过 error_data_path 参数来配置保存路径,error_data_path 参数在 [config.json](../config.json) 文件或 [config.yaml](../pytorch/api_accuracy_checker/config.yaml) 文件配置,config.json 文件需要在 run_ut 操作时通过 -config 参数指定。 @@ -98,7 +98,7 @@ msprobe -f pytorch run_ut -api_info ./dump.json -save_error_data multi_run_ut 脚本,可以并行执行多个 run_ut 操作,从而减少预检耗时。示例如下: ```bash -msprobe -f pytorch multi_run_ut -api_info ./dump.json -n 32 -d 0 1 2 3 +msprobe -f pytorch multi_run_ut -api_info ./dump_path/step{step_number}/rank{rank_number}/dump.json -n 32 -d 0 1 2 3 ``` | 参数名称 | 解释 | 是否必选 | @@ -117,7 +117,7 @@ msprobe -f pytorch multi_run_ut -api_info ./dump.json -n 32 -d 0 1 2 3 断点续检操作通过如下命令执行: ```bash -msprobe -f pytorch run_ut -api_info ./dump.json -csv_path /home/xxx/ut/accuracy_checking_result_{timestamp}.csv +msprobe -f pytorch run_ut -api_info ./dump_path/step{step_number}/rank{rank_number}/dump.json -csv_path /home/xxx/ut/accuracy_checking_result_{timestamp}.csv ``` 精度预检 run_ut 过程中,若因环境、数据量过大等原因导致预检进程中断,那么当用户解决这些问题后,重新执行 run_ut 操作,可以通过断点续检操作继续前面未完成的预检,会在 -csv_path 指定的 `accuracy_checking_result_{timestamp}.csv` 文件以及对应的 `accuracy_checking_details_{timestamp}.csv` 文件中继续写入后续的结果,不会重新创建结果文件。 @@ -194,15 +194,15 @@ Forward Test Success 和 Backward Test Success 是否通过测试是由 `accurac 判定为小值的阈值: - - torch.float32:e-6 - - torch.float16:e-3 - - torch.bfloat16:e-3 + - torch.float32:2**-20 + - torch.float16:2**-10 + - torch.bfloat16:2**-10 小值域的绝对误差阈值: - - torch.float32:e-9 - - torch.float16:e-5 - - torch.bfloat16:e-5 + - torch.float32:2**-30 + - torch.float16:2**-16 + - torch.bfloat16:2**-16 ## 5 预检结果比对 @@ -214,8 +214,8 @@ msprobe -f pytorch api_precision_compare -npu /home/xxx/npu/accuracy_checking_de | 参数名称 | 说明 | 是否必选 | | -------------------- | ------------- | -------- | -| -npu 或 --npu_csv_path | NPU 预检结果 `accuracy_checking_details_{timestamp}.csv` 文件路径。默认从当前目录下识别该文件。 | 否 | -| -gpu 或 --gpu_csv_path | GPU 预检结果 `accuracy_checking_details_{timestamp}.csv` 文件路径。默认从当前目录下识别该文件。 | 否 | +| -npu 或 --npu_csv_path | NPU 预检结果 `accuracy_checking_details_{timestamp}.csv` 文件路径。默认从当前目录下识别该文件。 | 是 | +| -gpu 或 --gpu_csv_path | GPU 预检结果 `accuracy_checking_details_{timestamp}.csv` 文件路径。默认从当前目录下识别该文件。 | 是 | | -o 或 --out_path | 指定 api_precision_compare.py 执行结果存盘路径,默认为当前目录。 | 否 | 执行完成后输出 `api_precision_compare_result_{timestamp}.csv` 和 `api_precision_compare_details_{timestamp}.csv` 文件。文件示例如下: @@ -229,8 +229,8 @@ msprobe -f pytorch api_precision_compare -npu /home/xxx/npu/accuracy_checking_de | 字段 | 含义 | | --------------------- | ------------------------------------------------------------ | | API name | API 名称。 | -| Forward Test Success | 前向 API 是否通过测试。pass 为通过;warning 为待观察;error 为错误;SKIP 表示跳过该 API 的计算,跳过原因在 Message 字段中提示,包括:该 API 的数据类型不支持使用新精度标准进行比对(如 float64),或该 API 不支持精度预检,或该 API 被黑名单过滤或不在白名单上,或运行错误等。 | -| Backward Test Success | 反向 API 是否通过测试。pass 为通过;warning 为待观察;error 为错误;如果是空白的话代表该 API 没有反向输出;SKIP 表示该 API 的数据类型不支持使用新精度标准进行比对(如 float64)。 | +| Forward Test Success | 前向 API 是否通过测试。pass 为通过;error 为错误;SKIP 表示跳过该 API 的计算,跳过原因在 Message 字段中提示,包括:该 API 的数据类型不支持使用新精度标准进行比对(如 float64),或该 API 不支持精度预检,或该 API 被黑名单过滤或不在白名单上,或运行错误等。 | +| Backward Test Success | 反向 API 是否通过测试。pass 为通过;error 为错误;如果是空白的话代表该 API 没有反向输出;SKIP 表示该 API 的数据类型不支持使用新精度标准进行比对(如 float64)。 | | Message | 提示信息。 | Forward Test Success 和 Backward Test Success 是否通过测试是由 `api_precision_compare_details_{timestamp}.csv` 中的各个指标判定结果决定的。需要注意的是 `api_precision_compare_details_{timestamp}.csv` 中可能存在一个 API 的前向(反向)有多个输出,那么每个输出记录一行,而在 `api_precision_compare_result_{timestamp}.csv` 中的结果需要该 API 的所有结果均为 pass 才能标记为 pass,只要存在一个 error 则标记 error,仅存在 warning 和 pass 且不存在 error 标记 warning。 @@ -243,15 +243,15 @@ Forward Test Success 和 Backward Test Success 是否通过测试是由 `api_pre | ------------------------ | ------------------------------------------------------------ | | API name | NPU 或 GPU 下的 API 名称。 | | 小值域错误比值 | NPU 与 CPU 的小值域的错误比率 / GPU 与 CPU 的小值域的错误比率。标杆比对法指标。 | -| 小值域错误判定结果 | 小值域错误比值小于等于 1 标记为 pass,1 ~ 2 之间标记为 warning,大于 2 标记为 error。 | +| 小值域错误判定结果 | 小值域错误比值小于等于 2 标记为 pass,大于 2 标记为 error。 | | 均方根误差比值 | NPU 与 CPU 的均方根误差 / GPU 与 CPU 的均方根误差。标杆比对法指标。 | -| 均方根误差判定结果 | 均方根误差比值小于等于 1 标记为 pass,1~2 之间标记为 warning,大于 2 标记为 error。 | +| 均方根误差判定结果 | 均方根误差比值小于等于 2 标记为 pass,大于 2 标记为 error。 | | 相对误差最大值比值 | NPU 与 CPU 的相对误差最大值 / GPU 与 CPU 的相对误差最大值。标杆比对法指标。 | -| 相对误差最大值判定结果 | 相对误差最大值比值小于等于 1 标记为 pass,1 ~ 10 之间标记为 warning,大于 10 标记为 error。 | +| 相对误差最大值判定结果 | 相对误差最大值比值小于等于 10 标记为 pass,大于 10 标记为 error。 | | 相对误差平均值比值 | NPU 与 CPU 的相对误差的平均值 / GPU 与 CPU 的相对误差的平均值。标杆比对法指标。 | -| 相对误差平均值判定结果 | 相对误差平均值比值小于等于 1 标记为 pass,1 ~ 2 之间标记为 warning,大于 2 标记为 error。 | +| 相对误差平均值判定结果 | 相对误差平均值比值小于等于 2 标记为 pass,大于 2 标记为 error。 | | 误差均衡性比值 | NPU 与 CPU 的误差均衡性 / GPU 与 CPU 的误差均衡性。标杆比对法指标。 | -| 误差均衡性判定结果 | 误差均衡性比值小于等于 1 标记为 pass,1 ~ 2 之间标记为 warning,大于 2 标记为 error。该字段暂不参与 api_precision_compare_result 的结果判定。 | +| 误差均衡性判定结果 | 误差均衡性比值小于等于 2 标记为 pass,大于 2 标记为 error。该字段暂不参与 api_precision_compare_result 的结果判定。 | | inf / nan 错误率 | NPU 与标杆 inf / nan 计算不一致的元素个数占总元素的个数比例。绝对阈值法指标。 | | inf / nan 判定结果 | inf / nan 错误率判定结果,等于 0 标记为 pass,其余情况标记为 error。 | | 相对误差错误率 | NPU 与标杆的正常值计算相对误差,其大于错误阈值的元素个数占正常值元素个数的比例。绝对阈值法指标。 | @@ -286,7 +286,7 @@ a:误差比对法指标。 - npu_linear -- npu_fusion_attention(该算子在 GPU 上预检时,需要额外安装 flash_attn,请用户自行安装。) +- npu_fusion_attention(该算子在 GPU 上预检时,需要额外安装 flash_attn,请用户自行安装,建议安装2.1以上版本。) - npu_rms_norm @@ -295,3 +295,13 @@ a:误差比对法指标。 - npu_scaled_masked_softmax - npu_swiglu + +- npu_apply_adam + +- npu_group_norm_silu + +- npu_mish + +- npu_moe_gating_top_k_softmax + +- npu_sort_v2 diff --git a/debug/accuracy_tools/msprobe/docs/08.accuracy_checker_online_PyTorch.md b/debug/accuracy_tools/msprobe/docs/08.accuracy_checker_online_PyTorch.md index ed7a91005ff5637be55675cedfcaad6453ca56b0..a93ad3b62405d549a16e7196e2f2145de68e8674 100644 --- a/debug/accuracy_tools/msprobe/docs/08.accuracy_checker_online_PyTorch.md +++ b/debug/accuracy_tools/msprobe/docs/08.accuracy_checker_online_PyTorch.md @@ -49,8 +49,6 @@ Host 与 GPU Host 设备间建立连接,将 NPU 上对应 API 的输入数据 | level | dump 级别,str 类型,在线预检时配置为 L1,表示 dump API 级精度数据。在线预检可不配置,默认取值 L1。 | 是 | | rank | 指定对某张卡上的数据进行 dump,list[int] 类型,默认未配置(表示 dump所有卡的数据),需要与 GPU 侧配置项 rank_list 保持一致。 | 否 | | step | 指定 dump 某个 step 的数据,list[int] 类型,默认未配置,表示 dump 所有 step 的数据。dump 特定 step 时,须指定为训练脚本中存在的 step。 | 否 | -| seed | 随机种子数,int 类型,默认值为 1234。通过固定随机数保证模型的输入或输出一致。 | 否 | -| is_deterministic | 确定性计算模式,bool 类型,可取值 true(开启)或 false(关闭),默认关闭。 | 否 | | scope | dump 范围,list[str] 类型,默认未配置(list 也未配置时表示 dump 所有 api 的数据),配置方式参考 [config.json 配置介绍](./02.config_introduction.md)。 | 否 | | list | dump 范围,list[str] 类型,默认未配置(scope 也未配置时表示 dump 所有 api 的数据),配置方式参考 [config.json 配置介绍](./02.config_introduction.md)。 | 否 | | online_run_ut | 在线预检模式开关,bool 类型,可取值 True(开启)、False(关闭),默认关闭。 | 是 | @@ -115,8 +113,6 @@ NPU 侧: "rank": [0], "step": [0], "level": "L1", - "seed": 1234, - "is_deterministic": true, "tensor": { "scope": [], "list": [], @@ -159,8 +155,6 @@ NPU 侧: "rank": [0], "step": [0], "level": "L1", - "seed": 1234, - "is_deterministic": true, "tensor": { "scope": [], "list": [], diff --git a/debug/accuracy_tools/msprobe/docs/09.accuracy_checker_MindSpore.md b/debug/accuracy_tools/msprobe/docs/09.accuracy_checker_MindSpore.md index 10baf4f24b5cee46edebd0e2f97fe9dc2ae477c6..3bf65032edae2b8e35c5818d5c030c9ce4c79e95 100644 --- a/debug/accuracy_tools/msprobe/docs/09.accuracy_checker_MindSpore.md +++ b/debug/accuracy_tools/msprobe/docs/09.accuracy_checker_MindSpore.md @@ -2,34 +2,72 @@ ## 1 简介 -**MindSpore 动态图精度预检**a通过扫描昇腾 NPU 上用户训练 MindSpore 模型中的所有 Mint API,输出精度情况的诊断和分析。工具以模型中所有 Mint API 前反向的 dump 结果为输入,构造相应的 API 单元测试,将 NPU 输出与标杆(CPU 高精度)比对,计算对应的精度指标,从而找出 NPU 中存在精度问题的 Mint API。本工具支持**随机生成模式和真实数据模式**b。 +**MindSpore 动态图精度预检**a通过扫描昇腾 NPU 上用户训练 MindSpore 模型中的所有 Mint API 以及 Msadapter场景下迁移的 Mindspore API,输出精度情况的诊断和分析。工具以模型中所有 API 前反向的 dump 结果为输入,构造相应的 API 单元测试,将 NPU 输出与标杆(CPU 高精度)比对,计算对应的精度指标,从而找出 NPU 中存在精度问题的 API。本工具支持**随机生成模式和真实数据模式**b。 -a. 支持 Mindspore 版本:2.4; +a. 支持 Mindspore 版本:2.4/2.5; -b. 在预检 dump 时可以选择由工具构造随机数进行输入获得 dump 数据或选择获取真实输入数据进行预检 dump 操作。随机生成模式执行效率高,可以快速获得结果,但数据精度低,只能大致判断精度问题;真实数据模式执行效率略低于随机生成模式,但是数据精度高,可以准确判断精度问题。 +b. (可选)当使用Msadapter时,由于需要环境中同时存在 Torch 与 Msadapter,所以只支持在**安装原生Torch**的场景下通过export PYTHONPATH="xx/msadapter/build/lib"等通过**环境变量使能Msadapter的方式**的环境中进行预检,预检工具能够自动索引得到所需的 Torch 与 Msadapter环境,环境安装详细参考:[msadapter官网](https://gitee.com/mindspore/msadapter)。 + +c. 在预检时可以由工具构造随机数据或者获取真实dump数据进行预检操作。随机生成模式执行效率高,可以快速获得结果,但结果准确度低,只能大致判断精度问题;真实数据模式执行效率略低于随机生成模式,并且需要较大磁盘空间存放待预检数据,但是结果准确度高,可以准确判断精度问题。 ## 2 离线预检流程 操作流程如下: -1. 在 NPU 和 GPU 环境下分别安装 msprobe。详见[ msprobe 安装](./01.installation.md)章节。 +1. 在 NPU 环境下安装 msprobe。详见[ msprobe 安装](./01.installation.md)章节。 2. 在 NPU 训练脚本内添加 msprobe 工具 dump 接口 PrecisionDebugger,采集待预检数据。详见 [MindSpore 场景下的数据采集](./06.data_dump_MindSpore.md)章节,注意需要配置 level="L1"。 3. 执行预检操作,查看预检结果文件,分析预检不达标的 API。 ## 3 离线预检操作指导 -命令如下: +### 3.1 使用 run_ut 执行预检 + +将 API 信息输入到 run_ut 模块进行精度检测并比对,运行如下命令: + + ```bash msprobe -f mindspore run_ut -api_info ./dump.json -o ./checker_result ``` -| 参数名称 | 说明 |参数类型 | 是否必选 | -| ---------------------------- | --------------------------------------|---------------------- | ---------------------------------- | -| -api_info 或 --api_info_file | 指定 API 信息文件 dump.json。 | str | 是 | -| -o 或 --out_path | 指定预检结果存盘路径,默认“./”。 | str | 否 | +| 参数名称 | 说明 |参数类型 | 是否必选 | +| ---------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------- | ---------------------------------- | +| -api_info 或 --api_info_file | 指定 API 信息文件 dump.json。对其中的mint api以及部分Tensor api进行预检,预检支持的Tensor api列表详见 [ 预检支持列表](../mindspore/api_accuracy_checker/checker_support_api.yaml)。 | str | 是 | +| -o 或 --out_path | 指定预检结果存盘路径,默认“./”。 | str | 否 | +| -csv_path 或 --result_csv_path | 指定本次运行中断时生成的 `accuracy_checking_result_{timestamp}.csv` 文件路径,执行 run_ut 中断时,若想从中断处继续执行,配置此参数即可。需要指定为上次中断的 `accuracy_checking_result_{timestamp}.csv` 文件。详见 [3.3 断点续检](#33-断点续检)。 | str | 否 | 预检执行结果包括 `accuracy_checking_result_{timestamp}.csv` 和 `accuracy_checking_details_{timestamp}.csv` 两个文件。`accuracy_checking_result_{timestamp}.csv` 属于 API 级,标明每个 API 是否通过测试。建议用户先查看 `accuracy_checking_result_{timestamp}.csv` 文件,对于其中没有通过测试的或者特定感兴趣的 API,根据其 API Name 字段在 `accuracy_checking_details_{timestamp}.csv` 中查询其各个输出的达标情况以及比较指标。详细介绍请参见 [4 预检结果](#4-预检结果)。 +### 3.2 使用 multi_run_ut 执行多线程预检 + +multi_run_ut 脚本,可以并行在多个Device执行 run_ut 操作,从而减少预检耗时。示例如下: + +```bash +msprobe -f mindspore multi_run_ut -api_info ./dump.json -d 0 1 2 3 +``` + +| 参数名称 | 说明 |参数类型 | 是否必选 | +| ---------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------- | ---------------------------------- | +| -api_info 或 --api_info_file | 指定 API 信息文件 dump.json。对其中的mint api以及部分Tensor api进行预检,预检支持的Tensor api列表详见 [ 预检支持列表](../mindspore/api_accuracy_checker/checker_support_api.yaml)。 | str | 是 | +| -o 或 --out_path | 指定预检结果存盘路径,默认“./”。 | str | 否 | +| -csv_path 或 --result_csv_path | 指定本次运行中断时生成的 `accuracy_checking_result_{timestamp}.csv` 文件路径,执行 run_ut 中断时,若想从中断处继续执行,配置此参数即可。需要指定为上次中断的 `accuracy_checking_result_{timestamp}.csv` 文件。详见 [3.3 断点续检](#33-断点续检)。 | str | 否 | +| -d 或 --device | 指定 Device ID,选择 UT 代码运行所在的卡,默认值为 0,支持同时指定 0 ~ Device数量 - 1 ,例如 0 1 2 3 4。 | List[int] | 否 | + +在不同卡数下,使用38B语言大模型的预检耗时基线参考 [multi_run_ut耗时基线](accuracy_checker_MindSpore/accuracy_checker_MindSpore_baseline.md) + + +### 3.3 断点续检 + +断点续检操作通过如下命令执行: + +```bash +msprobe -f mindspore run_ut -api_info ./dump.json -csv_path xxx/accuracy_checking_result_{timestamp}.csv +``` + +精度预检 run_ut 过程中,若因环境、数据量过大等原因导致预检进程中断,那么当用户解决这些问题后,重新执行 run_ut 操作,可以通过断点续检操作继续前面未完成的预检,会在 -csv_path 指定的 `accuracy_checking_result_{timestamp}.csv` 文件以及对应的 `accuracy_checking_details_{timestamp}.csv` 文件中继续写入后续的结果,不会重新创建结果文件。 + +须指定为上次预检中断的 `accuracy_checking_result_{timestamp}.csv` 文件。请勿修改 `accuracy_checking_result_{timestamp}.csv` 和 `accuracy_checking_details_{timestamp}.csv` 文件以及文件名,否则不对断点续检的结果负责。 + + ## 4 预检结果 精度预检生成的 `accuracy_checking_result_{timestamp}.csv` 和 `accuracy_checking_details_{timestamp}.csv` 文件内容详情如下: @@ -46,7 +84,7 @@ msprobe -f mindspore run_ut -api_info ./dump.json -o ./checker_result | MaxAbsErr | 被检验数据与标杆数据的最大绝对误差。 | | MaxRelativeErr | 被检验数据与标杆数据的最大相对误差。 | | Status | API 预检通过状态,pass 表示通过测试,error 表示未通过。 | -| message | 提示信息。 | +| Message | 提示信息。 | 注意:PyTorch 无法对 dtype 为整数类型的 tensor 进行反向求导,而 MindSpore 支持。反向过程的预检仅比较 dtype 为浮点型的输出。 diff --git a/debug/accuracy_tools/msprobe/docs/10.accuracy_compare_PyTorch.md b/debug/accuracy_tools/msprobe/docs/10.accuracy_compare_PyTorch.md index 9f8751641d957a8f3e945170a9f5b8d673f706d8..a5f83d8dfcbc7645691a8753c105fb7552522bf1 100644 --- a/debug/accuracy_tools/msprobe/docs/10.accuracy_compare_PyTorch.md +++ b/debug/accuracy_tools/msprobe/docs/10.accuracy_compare_PyTorch.md @@ -1,5 +1,11 @@ # PyTorch 场景的精度比对 +## 🚨 重要通知 + +**1. 精度比对操作中2.2比对函数方式(compare 函数、compare_distributed 函数)将于2025.9.30废弃。** + +**2. 精度比对已支持自动识别stack.json并呈现NPU_Stack_Info。命令行方式中用户可无需配置compare.json中的"stack_path"字段和命令行中的-s参数。具体使用参见“2.1.4 比对文件”中的参数说明。命令行方式中的-s(--stack_mode)将于2025.9.30废弃,并且不再需要配置compare.json中的"stack_path"字段。比对函数方式同理,详见“2.2.1 compare函数”和“2.2.2 compare_distributed函数”中的参数说明。** + ## 1 简介 - 本节主要介绍通过命令行和比对函数的方式进行 CPU 或 GPU 与 NPU 的精度数据比对,执行精度比对操作前需要先完成 CPU 或 GPU 与 NPU 的精度数据 dump,参见 [PyTorch 场景下的数据采集](./05.data_dump_PyTorch.md)章节。 @@ -10,19 +16,19 @@ - 工具性能:比对数据量较小时(单份文件小于 10 GB),比对速度 0.1 GB/s;比对数据量较大时,比对速度 0.3 GB/s。 推荐环境配置:独占环境,CPU 核心数 192,固态硬盘(IO 速度参考:固态硬盘 > 500 MB/s,机械硬盘 60 ~ 170 MB/s)。用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。比对速度的计算方式:两份比对文件大小/比对耗时。 -**使用场景**: +**使用场景** - 同一模型,从 CPU 或 GPU 移植到 NPU 中存在精度下降问题,对比 NPU 芯片中的 API 计算数值与 CPU 或 GPU 芯片中的 API 计算数值,进行问题定位。 - 同一模型,进行迭代(模型、框架版本升级或设备硬件升级)时存在的精度下降问题,对比相同模型在迭代前后版本的 API 计算数值,进行问题定位。 - 以上两个场景下,当存在无法自动匹配的API和模块时,则通过用户手动指定可以比对的API或模块来自定义映射关系,进行比对。 -**注意事项**: +**注意事项** - NPU 自研 API,在 CPU 或 GPU 侧若没有对应的 API,该 API 的 dump 数据不比对。 - NPU 与 CPU 或 GPU 的计算结果误差可能会随着模型的执行不断累积,最终会出现同一个 API 因为输入的数据差异较大而无法比对的情况。 - CPU 或 GPU 与 NPU 中两个相同的 API 会因为调用次数不同导致无法比对或比对到错误的 API,不影响整体运行,该 API 忽略。 -**API 匹配条件**: +**API 匹配条件** 进行精度比对时,需要判断 CPU 或 GPU 的 API 与 NPU 的 API 是否可以比对,须满足以下匹配条件: @@ -37,22 +43,22 @@ #### 2.1.1 比对命令说明 -命令示例如下: +命令示例: ```shell msprobe -f pytorch compare -i ./compare.json -o ./output -s ``` -**完整参数说明**: +完整参数说明: -| 参数名 | 说明 | 是否必选 | -|-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------- | -| -i 或 --input_path | 指定[比对文件](#211-比对文件),str 类型。 | 是 | -| -o 或 --output_path | 配置比对结果文件存盘目录,str 类型。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.xlsx`。 | 是 | -| -s 或 --stack_mode | 配置 stack_mode 的开关,bool 类型。仅当[比对文件](#214-比对文件)配置 stack_path 需要开启。通过直接配置该参数开启,默认未配置,表示关闭。 | 否 | +| 参数名 | 说明 | 是否必选 | +|-------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------- | +| -i 或 --input_path | 指定[比对文件](#214-比对文件),str 类型。 | 是 | +| -o 或 --output_path | 配置比对结果文件存盘目录,str 类型,默认在当前目录创建output目录。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.xlsx`。 | 否 | +| -s 或 --stack_mode | 比对结果展示调用栈信息(NPU_Stack_Info)的开关,bool 类型。单卡场景开启时,根据[比对文件](#214-比对文件)的参数说明配置stack_path;多卡场景开启时,自动识别npu_dump目录下stack.json文件,如存在生成详细调用栈信息,否则不生成,此参数不生效。通过直接配置该参数开启,默认未配置,表示关闭。 | 否 | | -c 或 --compare_only | 仅比对开关,bool 类型。该参数默认未配置,会启用自动精度分析,工具自动针对比对结果进行分析,识别到第一个精度可能不达标节点(在比对结果文件中的 Accuracy Reached or Not 列显示为 No),并给出问题可能产生的原因(打屏展示并生成 `advisor_{timestamp}.txt` 文件)。通过配置该参数取消自动精度分析,仅输出比对结果表格。 | 否 | -| -f 或 --fuzzy_match | 模糊匹配,bool 类型。开启后,对于网络中同一层级且命名仅调用次数不同的 API,可匹配并进行比对。通过直接配置该参数开启,默认未配置,表示关闭。 | 否 | -| -dm或--data_mapping | 自定义映射关系比对。需要指定自定义映射文件*.yaml。自定义映射文件的格式请参见[自定义映射文件](#215-自定义映射文件)。仅[API和模块无法自动匹配场景](#213-API和模块无法自动匹配场景)需要配置。仅支持逐卡比对,即使用[比对文件](#214-比对文件)的单卡场景示例。 | 否 | +| -f 或 --fuzzy_match | 模糊匹配,bool 类型。开启后,对于网络中同一层级且命名仅调用次数不同的 API,可匹配并进行比对。通过直接配置该参数开启,默认未配置,表示关闭。 | 否 | +| -dm或--data_mapping | 自定义映射关系比对。需要指定自定义映射文件*.yaml。自定义映射文件的格式请参见[自定义映射文件](#215-自定义映射文件)。仅[API和模块无法自动匹配场景](#213-api和模块无法自动匹配场景)需要配置。仅支持逐卡比对,即使用[比对文件](#214-比对文件)的单卡场景示例。 | 否 | #### 2.1.2 整网比对场景 @@ -60,11 +66,11 @@ msprobe -f pytorch compare -i ./compare.json -o ./output -s 支持单卡和多卡,可同时比对多卡的 dump 数据。多机场景需要每个设备单独执行比对操作。 -1. 配置[config.json](https://gitee.com/ascend/mstt/blob/8914fbb31ff6da3898c3bb7b97ba99e23b0f1d38/debug/accuracy_tools/msprobe/config.json)文件。 +1. 配置[config.json](../config.json)文件。 2. 参见 [PyTorch 场景下的数据采集](./05.data_dump_PyTorch.md)章节完成 CPU 或 GPU 与 NPU 的精度数据 dump。 -3. 创建[比对文件](#211-比对文件)。 +3. 创建[比对文件](#214-比对文件)。 4. 运行命令: @@ -78,11 +84,11 @@ msprobe -f pytorch compare -i ./compare.json -o ./output -s 当存在无法自动匹配的API和模块时,则用户可以通过提供自定义映射关系的配置文件来告知工具可匹配的API或模块,进行比对。 -1. [config.json](https://gitee.com/ascend/mstt/blob/8914fbb31ff6da3898c3bb7b97ba99e23b0f1d38/debug/accuracy_tools/msprobe/config.json)文件level配置为L0或L1、task配置为tensor或statistics并指定需要dump的API或模块名。 +1. [config.json](../config.json)文件level配置为L0或L1、task配置为tensor或statistics并指定需要dump的API或模块名。 2. 参见[PyTorch 场景下的数据采集](./05.data_dump_PyTorch.md)章节完成 CPU 或 GPU 与 NPU 的精度数据 dump。 -3. 创建[比对文件](#211-比对文件)(单卡场景示例)。 +3. 创建[比对文件](#214-比对文件)(单卡场景示例)。 4. 运行命令: @@ -121,14 +127,14 @@ msprobe -f pytorch compare -i ./compare.json -o ./output -s } ``` - **参数说明**: +**参数说明**: -| 参数名 | 说明 | 是否必选 | -| -------------------- | ------------------------------------------------------------ | ------------------ | -| npu_path | 配置 NPU 环境下的 dump.json 文件(单卡场景)或真实数据目录(多卡场景),str 类型。 | 是 | -| bench_path | 配置 CPU、GPU 或 NPU 环境下的 dump.json 文件(单卡场景)或真实数据目录(多卡场景),str 类型。 | 是 | -| stack_path | 配置 NPU dump 目录下的 stack.json 文件,str 类型。 | 单卡必选,多卡不选 | -| is_print_compare_log | 配置是否开启单个算子的日志打屏。可取值 true 或 false,默认为 true。关闭后则只输出常规日志,bool 类型。 | 否 | +| 参数名 | 说明 | 是否必选 | +| -------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------| +| npu_path | 配置 NPU 环境下的 dump.json 文件(单卡场景)或真实数据目录(多卡场景),str 类型。 | 是 | +| bench_path | 配置 CPU、GPU 或 NPU 环境下的 dump.json 文件(单卡场景)或真实数据目录(多卡场景),str 类型。 | 是 | +| stack_path | 配置 NPU dump 目录下的 stack.json 文件,str 类型。如果没有配置stack_path,命令行-s参数不生效,程序自动识别是否存在stack.json文件,如存在,则比对结果中呈现NPU_Stack_Info,如不存在,则不呈现。如果配置了stack_path,比对结果中是否呈现NPU_Stack_Info则通过命令行参数-s来控制。 | 否 | +| is_print_compare_log | 配置是否开启单个算子的日志打屏。可取值 true 或 false,默认为 true。关闭后则只输出常规日志,bool 类型。 | 否 | #### 2.1.5 自定义映射文件 @@ -145,7 +151,7 @@ msprobe -f pytorch compare -i ./compare.json -o ./output -s 冒号左侧和右侧分别为PyTorch框架不同版本或不同芯片环境的API的名称和module模块名称。 -API和模块名称请从《[PyTorch 场景的精度数据采集](https://gitee.com/ascend/mstt/blob/8914fbb31ff6da3898c3bb7b97ba99e23b0f1d38/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md)》中的dump.json文件获取。 +API和模块名称请从《[PyTorch 场景的精度数据采集](05.data_dump_PyTorch.md)》中的dump.json文件获取。 文件内容示例: @@ -174,13 +180,13 @@ compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_mat **参数说明**: -| 参数名 | 说明 | 是否必选 | -| ------------ | ------------------------------------------------------------ | -------- | +| 参数名 | 说明 | 是否必选 | +| ------------ |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------- | | input_param | 配置 dump 数据文件及目录,dict 类型。配置参数包括:
"npu_json_path":指定 NPU dump 目录下的 dump.json 文件。
**配置示例**:"npu_json_path": "./npu_dump/dump.json"。
"bench_json_path":指定 CPU、GPU 或 NPU dump 目录下的 dump.json 文件。
**配置示例**:"bench_json_path": "./bench_dump/dump.json"。
"stack_json_path":指定 NPU dump 目录下的 stack.json 文件。
**配置示例**:"stack_json_path": "./npu_dump/stack.json"。
"is_print_compare_log":配置是否开启单个算子的日志打屏。
**配置示例**:True 或 False。 | 是 | -| output_path | 配置比对结果文件存盘目录,str 类型。
**配置示例**:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.xlsx`。 | 是 | -| stack_mode | 配置 stack_mode 的开关,bool 类型。仅当配置 stack_json_path 时需要开启。
**配置示例**:stack_mode=True,默认为 False。 | 否 | -| auto_analyze | 自动精度分析,bool 类型。开启后工具自动针对比对结果进行分析,识别到第一个精度可能不达标节点(在比对结果文件中的 Accuracy Reached or Not 列显示为 No),并给出问题可能产生的原因(打屏展示并生成 advisor_{timestamp}.txt 文件)。
**配置示例**:auto_analyze=False,默认为 True。 | 否 | -| fuzzy_match | 模糊匹配,bool 类型。开启后,对于网络中同一层级且命名仅调用次数不同的 API,可匹配并进行比对。
**配置示例**:fuzzy_match=True,默认为 False。 | 否 | +| output_path | 配置比对结果文件存盘目录,str 类型。
**配置示例**:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.xlsx`。 | 是 | +| stack_mode | 配置 stack_mode 的开关,bool 类型。仅当配置 stack_json_path 时需要,开启时比对结果呈现NPU_Stack_Info,关闭时不呈现。当不配置stack_json_path 时,自动识别是否存在stack.json,存在时呈现NPU_Stack_Info,否则不呈现。
**配置示例**:stack_mode=True,默认为 False。 | 否 | +| auto_analyze | 自动精度分析,bool 类型。开启后工具自动针对比对结果进行分析,识别到第一个精度可能不达标节点(在比对结果文件中的 Accuracy Reached or Not 列显示为 No),并给出问题可能产生的原因(打屏展示并生成 advisor_{timestamp}.txt 文件)。
**配置示例**:auto_analyze=False,默认为 True。 | 否 | +| fuzzy_match | 模糊匹配,bool 类型。开启后,对于网络中同一层级且命名仅调用次数不同的 API,可匹配并进行比对。
**配置示例**:fuzzy_match=True,默认为 False。 | 否 | **函数示例**: @@ -209,12 +215,12 @@ compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs) **参数说明**: -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| npu_dump_dir | 配置 NPU 环境下的 dump 目录。str 类型。dump 数据目录须指定到 step 级。
**配置示例**:'./npu_dump/step0'。 | 是 | -| bench_dump_dir | 配置 CPU、GPU 或 NPU 环境下的 dump 目录。str 类型。
**配置示例**:'./gpu_dump/step0'。 | 是 | +| 参数名 | 说明 | 是否必选 | +| -------------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------| -------- | +| npu_dump_dir | 配置 NPU 环境下的 dump 目录。str 类型。dump 数据目录须指定到 step 级。
**配置示例**:'./npu_dump/step0'。 | 是 | +| bench_dump_dir | 配置 CPU、GPU 或 NPU 环境下的 dump 目录。str 类型。
**配置示例**:'./gpu_dump/step0'。 | 是 | | output_path | 配置比对结果文件存盘目录。需要预先创建 output_path 目录。str 类型。
**配置示例**:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.xlsx`。 | 是 | -| **kwargs | 支持 compare 的所有可选参数。 | 否 | +| **kwargs | 支持 compare 的所有可选参数。 其中,stack_mode不生效,自动识别是否存在stack.json,如存在,呈现NPU_Stack_Info,否则不呈现。 | 否 | **函数示例**: @@ -231,7 +237,7 @@ PyTorch 精度比对是以 CPU 或 GPU 的计算结果为标杆,通过计算 - `compare_result_{timestamp}.xlsx` 文件列出了所有执行精度比对的 API 详细信息和比对结果,示例如下: - ![compare_result](https://gitee.com/cai-weiwei1989/att_ptdbg/raw/master/debug/accuracy_tools/ptdbg_ascend/doc/img/compare_result.png) + ![compare_result](./img/compare_result.png) - **提示**:比对结果通过颜色标记、比对结果(Result)、计算精度达标情况(Accuracy Reached no Not)、错误信息提示(Err_Message)定位可疑算子,但鉴于每种指标都有对应的判定标准,还需要结合实际情况进行判断。 @@ -251,25 +257,29 @@ PyTorch 精度比对是以 CPU 或 GPU 的计算结果为标杆,通过计算 统计量有 4 种:最大值(max)、最小值(min)、平均值(mean)和 L2-范数(L2 norm)。 -|dump 数据模式|Cosine (tensor 余弦相似度)|MaxAbsErr (tensor 最大绝对误差)|MaxRelativeErr (tensor 最大相对误差)|One Thousandth Err Ratio (tensor 相对误差小于千分之一的比例)|Five Thousandth Err Ratio (tensor 相对误差小于千分之五的比例)|NPU 和 bench 的统计量绝对误差 (max, min, mean, L2 norm) diff| NPU 和 bench 的统计量相对误差 (max, min, mean, L2 norm) RelativeErr |NPU 和 bench 的统计量 (max, min, mean, L2 norm)|NPU MD5 (NPU 数据 CRC-32 值)|BENCH MD5 (bench 数据 CRC-32 值)|Result (比对结果)|Accuracy Reached or Not (计算精度是否达标)|Err_message (错误信息提示)|NPU_Stack_Info (堆栈信息)|Data_Name (NPU 真实数据名)| -|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| -|真实数据模式|√|√|√|√|√|||√||||√|√|√|√| -|统计数据模式||||||√|√|√|||√||√|√|| -|MD5 模式|||||||||√|√|√|||√|| +|dump 数据模式|Cosine (tensor 余弦相似度)|EucDist (tensor 欧式距离)|MaxAbsErr (tensor 最大绝对误差)|MaxRelativeErr (tensor 最大相对误差)|One Thousandth Err Ratio (tensor 相对误差小于千分之一的比例)|Five Thousandth Err Ratio (tensor 相对误差小于千分之五的比例)|NPU 和 bench 的统计量绝对误差 (max, min, mean, L2 norm) diff| NPU 和 bench 的统计量相对误差 (max, min, mean, L2 norm) RelativeErr |NPU 和 bench 的统计量 (max, min, mean, L2 norm)|NPU MD5 (NPU 数据 CRC-32 值)|BENCH MD5 (bench 数据 CRC-32 值)|Result (比对结果)|Accuracy Reached or Not (计算精度是否达标)|Err_message (错误信息提示)|NPU_Stack_Info (堆栈信息)|Data_Name (NPU 真实数据名)| +|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| +|真实数据模式|√|√|√|√|√|√|||√||||√|√|√|√| +|统计数据模式|||||||√|√|√|||√||√|√|| +|MD5 模式||||||||||√|√|√|||√|| + +上表中NPU_Stack_Info字段需要配置-s参数生成。 ### 3.2 颜色标记——真实数据模式、统计数据模式 +在比对结果中的Err_message列呈现比对结果颜色标记的原因,具体含义如下: + 红色标记情况: -1. 一个 API 或模块的 One Thousandth Err Ratio 的 input > 0.9 同时 output < 0.6(真实数据模式); -2. 一个 API 或模块的 output 的最大值相对误差 (Max diff 除以 max(0.01, Bench max)) > 0.5(统计数据模式); -3. 一个 API 或模块的 NPU 的最大值或最小值中存在 nan/inf/-inf(真实数据模式、统计数据模式); -4. 一个 API 或模块的最大值绝对误差大于 1e+10(真实数据模式,统计数据模式)。 +1. 一个 API 或模块的 NPU 的最大值或最小值中存在 nan/inf/-inf(真实数据模式、统计数据模式); +2. 一个 API 或模块的最大值绝对误差大于 1e+10(真实数据模式,统计数据模式); +3. 一个 API 或模块的 One Thousandth Err Ratio 的 input/parameters > 0.9 同时 output < 0.6(真实数据模式)(仅标记output); +4. 一个 API 或模块的 output 的最大值相对误差 (Max diff 除以 max(0.01, Bench max)) > 0.5(统计数据模式)(仅标记output)。 -黄色标记情况: -1. 一个 API 或模块的 One Thousandth Err Ratio 的 input - output > 0.1(真实数据模式); -2. 一个 API 或模块的 Cosine 的 input - output > 0.1(真实数据模式); -3. 一个 API 或模块的 output 的最大值相对误差 > 0.1 同时 input < 0.01(真实数据模式,统计数据模式); -4. 一个 API 或模块的 input 与 output 的最大值绝对误差都大于 1,同时 output 比 input 大一个数量级以上(真实数据模式、统计数据模式)。 +黄色标记情况(仅标记output): +1. 一个 API 或模块的 input/parameters 与 output 的最大值绝对误差都大于 1,同时 output 比 input/parameters 大一个数量级以上(真实数据模式、统计数据模式); +2. 一个 API 或模块的 One Thousandth Err Ratio 的 input/parameters - output > 0.1(真实数据模式); +3. 一个 API 或模块的 output 的最大值相对误差 > 0.1 同时 input/parameters < 0.01(真实数据模式,统计数据模式); +4. 一个 API 或模块的 Cosine 的 input/parameters - output > 0.1(真实数据模式)。 ### 3.3 比对结果(Result)——统计数据模式、MD5 模式 @@ -310,18 +320,98 @@ MD5 模式: 5. "This is empty data, can not compare.":读取到的数据为空(真实数据模式); 6. "Shape of NPU and bench Tensor do not match. Skipped.":NPU 和 Bench 的数据结构不一致(真实数据模式); 7. "The Position of inf or nan in NPU and bench Tensor do not match.":NPU 和 Bench 的数据有 nan/inf(真实数据模式); -8. "This is type of scalar data, can not compare.":NPU 为标量(真实数据模式); +8. "This is type of 0-d tensor, can not calculate 'Cosine', 'EucDist', 'One Thousandth Err Ratio' and 'Five Thousandths Err Ratio'.":NPU 为0维张量(真实数据模式); 9. "Dtype of NPU and bench Tensor do not match.":NPU 和 Bench 数据的数据类型不同(真实数据模式); 10. "":除以上情况的其余情况(真实数据模式、统计数据模式)。 +除以上错误信息提示外,异常数据颜色高亮标记的原因叠加呈现于此列。 + ### 3.6 计算精度评价指标分析 1. Cosine:通过计算两个向量的余弦值来判断其相似度,数值越接近于 1 说明计算出的两个张量越相似,实际可接受阈值为大于 0.99。在计算中可能会存在 nan,主要由于可能会出现其中一个向量为 0。 -2. MaxAbsErr:当最大绝对误差越接近 0 表示其计算的误差越小,实际可接受阈值为小于 0.001。 +2. EucDist:通过计算两个向量的欧式距离来判断其相似度,定义为多维空间中两个点之间的绝对距离。数值越接近0,张量越相似,数值越大,差异越大。 -3. MaxRelativeErr:当最大相对误差越接近 0 表示其计算的误差越小。 +3. MaxAbsErr:当最大绝对误差越接近 0 表示其计算的误差越小,实际可接受阈值为小于 0.001。 + +4. MaxRelativeErr:当最大相对误差越接近 0 表示其计算的误差越小。 当 dump 数据中存在 0 或 Nan 时,比对结果中最大相对误差则出现 inf 或 Nan 的情况,属于正常现象。 -4. One Thousandth Err Ratio(双千分之一)、Five Thousandths Err Ratio(双千分之五)精度指标:是指 NPU 的 Tensor 中的元素逐个与对应的标杆数据对比,相对误差大于千分之一、千分之五的比例占总元素个数的比例小于千分之一、千分之五。该数据仅作为精度下降趋势的参考,并不参与计算精度是否通过的判定。 +5. One Thousandth Err Ratio(相对误差小于千分之一的元素比例)、Five Thousandths Err Ratio(相对误差小于千分之五的元素比例)精度指标:是指 NPU 的 Tensor 中的元素逐个与对应的标杆数据对比,相对误差小于千分之一、千分之五的比例占总元素个数的比例。该数据仅作为精度下降趋势的参考,并不参与计算精度是否通过的判定。 + +## 4 多卡比对结果提取汇总通信算子数据 + +本功能是将多卡比对场景的比对结果,进行通信算子数据提取和汇总,输出整理好的通信算子多卡比对精度表。 + +**使用场景** + +已完成精度比对,获得多卡精度比对结果,但是通信算子数据分布在多个结果件中,不利于精度问题的分析。通过此功能,可以汇总多卡通信算子数据,减少问题定位时间。 + +**约束** + +不支持MD5比对结果。 + +**命令示例** + +```bash +msprobe -f pytorch merge_result -i ./input_dir -o ./output_dir -config ./config.yaml +``` + +**完整参数说明** + +| 参数名 | 说明 | 是否必选 | +| ---------------------- |------------------------------------------------------------------------------------| -------- | +| -i 或 --input_dir | 多卡比对结果存盘目录,即使用compare比对的结果输出目录,str类型。所有比对结果应全部为真实数据比对结果或统计数据比对结果,否则可能导致汇总数据不完整。 | 是 | +| -o 或 --output_dir | 数据提取汇总结果存盘目录,str类型。文件名称基于时间戳自动生成,格式为:`multi_ranks_compare_merge_{timestamp}.xlsx`。 | 是 | +| -config或--config-path | 指定需要汇总数据的API和比对指标的yaml文件路径,str类型。
yaml文件详细介绍见下文“**yaml文件说明**”。 | 是 | + +**yaml文件说明** + +以config.yaml文件名为例,配置示例如下: + +``` +api: +- Distributed.all_reduce +- Distributed.all_gather_into_tensor +compare_index: +- Max diff +- L2norm diff +- MeanRelativeErr +``` + +| 参数名 | 说明 | +| ------------- | ------------------------------------------------------------ | +| api | 表示需要汇总的API或module名称。如果没有配置,工具会提示报错。
api名称配置格式为:`{api_type}.{api_name}.{API调用次数}.{前向反向}`
须按顺序配置以上四个字段,可按如下组合配置:
{api_type}
{api_type}.{api_name}
{api_type}.{api_name}.{API调用次数}
{api_type}.{api_name}.{API调用次数}.{前向反向}
这里的api指代API或module。 | +| compare_index | 表示需要汇总的比对指标。compare_index需为dump_mode对应比对指标的子集。如果没有配置,工具将根据比对结果自动提取dump_mode对应的全部比对指标进行汇总。
统计数据模式比对指标:Max diff、Min diff、Mean diff、Norm diff、MaxRelativeErr、MinRelativeErr、MeanRelativeErr、NormRelativeErr
真实数据模式比对指标:Cosine、MaxAbsErr、MaxRelativeErr、One Thousandth Err Ratio、Five Thousandths Err Ratio | + +**汇总结果件说明** + +多卡数据汇总结果如下所示: + +![merge_result](img/merge_result.png) + +1. NPU Name列表示API或module名称。 +2. rank*列为多卡数据。 +3. 不同比对指标的数据通过不同sheet页呈现。 +4. 如果一个API或module在某张卡上找不到数据,汇总结果中将空白呈现。 +5. 如果比对指标值为N/A,unsupported,Nan,表示无法计算该比对指标值,汇总结果将以”NPU:’NPU max值‘ Bench:’Bench max值‘“呈现。 +6. 针对图示案例,此处NPU:N/A Bench:N/A表示output为None。 + +
+如何基于group信息查看分组数据: + +以Distributed.all_reduce.0.forward为例。这个API将多卡数据规约操作,输出为一个group内的规约结果,同一个group内的输出保持一致。
这个API中,rank0-3为一个group,Distributed.all_reduce.0.forward.input.group展示为tp-0-1-2-3,rank0-3输出一致;rank4-7为一个group,展示为tp-4-5-6-7,rank4-7输出一致。
group除了这种形式,还有如[0, 1, 2, 3]的呈现形式。 + +
+常见通信API预期结果: + +1. Distributed.all_gather:多卡数据汇总,每张卡输入可以不一致,同group内输出一致,输出是张量列表。 +2. Distributed.all_gather_into_tensor:多卡数据汇总,每张卡输入可以不一致,同group内输出一致,输出是张量。 +3. Distributed.all_reduce:多卡数据规约操作,每张卡输入可以不一致,同group内输出一致,为规约结果。 +4. Distributed.reduce_scatter:多卡数据规约操作,每张卡输入可以不一致,输出为group内规约结果的不同部分,输入是张量列表。 +5. Distributed.reduce_scatter_tensor:多卡数据规约操作,每张卡输入可以不一致,输出为group内规约结果的不同部分,输入是张量。 +6. Distributed.broadcast:输入为要广播的数据,输出为广播后的数据。 +7. Distributed.isend:点对点通信,输入为要发送的数据,输出为发送的数据。 +8. Distributed.irecv:点对点通信,输入为原数据,输出为接收的新数据。 +9. Distributed.all_to_all_single:输出数据为所有卡上的数据切分后合并的结果。 \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/docs/11.accuracy_compare_MindSpore.md b/debug/accuracy_tools/msprobe/docs/11.accuracy_compare_MindSpore.md index f6c4f4c444ee75a64fb74cfce9271c44753a2392..1b1824a774f15a86106585669d5f3412b3faca2e 100644 --- a/debug/accuracy_tools/msprobe/docs/11.accuracy_compare_MindSpore.md +++ b/debug/accuracy_tools/msprobe/docs/11.accuracy_compare_MindSpore.md @@ -1,5 +1,10 @@ # MindSpore 场景的精度比对 +## 🚨 重要通知 + +**1. 精度比对已支持自动识别stack.json并呈现NPU_Stack_Info,用户可无需配置compare.json中的"stack_path"字段和命令行中的-s参数。具体使用参见“4.1 比对文件”中的参数说明。命令行方式中的-s(--stack_mode)将于2025.9.30废弃,并且不再需要配置compare.json中的"stack_path"字段。** + + ## 1 简介 msprobe精度比对工具主要用于如下场景: @@ -33,14 +38,16 @@ msprobe -f mindspore compare -i ./compare.json -o ./output -s | 参数名 | 说明 | 是否必选 | | -------------------- | ------------------------------------------------------------ | -------- | | -i或--input_path | 指定比对文件。比对文件内容及示例请参见[比对文件](#31-比对文件)或[比对文件(kernel)](#32-比对文件kernel)(比对文件(kernel)仅[不同版本下的全量kernel比对](#23-不同版本下的全量kernel比对)场景支持)。 | 是 | -| -o或--output_path | 配置比对结果文件存盘目录。文件名称基于时间戳自动生成,格式为:
`compare_result_{timestamp}.xlsx`
`compare_result_{rank_id}_{step_id}_{timestamp}.xlsx`(仅[不同版本下的全量kernel比对](#23-不同版本下的全量kernel比对)场景支持)。 | 是 | -| -s或--stack_mode | 配置stack_mode的开关。仅当[比对文件](#31-比对文件)配置"stack_path"需要开启。通过直接配置该参数开启,默认未配置,表示关闭。 | 否 | +| -o或--output_path | 配置比对结果文件存盘目录,默认会在当前目录创建output目录。文件名称基于时间戳自动生成,格式为:
`compare_result_{timestamp}.xlsx`
`compare_result_{rank_id}_{step_id}_{timestamp}.xlsx`(仅[不同版本下的全量kernel比对](#23-不同版本下的全量kernel比对)场景支持)。 | 否 | +| -s或--stack_mode | 比对结果展示调用栈信息(NPU_Stack_Info)的开关,bool 类型。单卡场景开启时,需要使用[比对文件](#31-比对文件)的单卡场景配置stack_path指定stack.json文件,才能生成详细调用栈信息,否则在比对时会报错;暂不支持多卡场景。通过直接配置该参数开启,默认未配置,表示关闭。 | 否 | | -c或--compare_only | 仅比对开关,bool 类型。该参数默认未配置,会启用自动精度分析,工具自动针对比对结果进行分析,识别到第一个精度可能不达标节点(在比对结果文件中的 Accuracy Reached or Not 列显示为 No),并给出问题可能产生的原因(打屏展示并生成 `advisor_{timestamp}.txt` 文件)。通过配置该参数取消自动精度分析,仅输出比对结果表格。 | 否 | | -f或--fuzzy_match | 模糊匹配。开启后,对于网络中同一层级且命名仅调用次数不同的API,可匹配并进行比对。通过直接配置该参数开启,默认未配置,表示关闭。 | 否 | -| -am或--api_mapping | 跨框架比对。配置该参数时表示开启跨框架API比对功能。仅[跨框架的API比对](#25-跨框架的api比对)场景需要配置。 | 否 | -| -cm或--cell_mapping | 跨框架比对。配置该参数时表示开启跨框架cell模块比对功能,可以指定自定义映射文件*.yaml,不指定映射文件时按照msprobe定义的默认映射关系进行比对。自定义映射文件的格式请参见[自定义映射文件(cell)](#33-自定义映射文件cell)。仅[跨框架的cell模块比对](#26-跨框架的cell模块比对)场景需要配置。 | 否 | -| -dm或--data_mapping | 跨框架比对。配置该参数时表示开启跨框架API或模块的比对功能,需要指定自定义映射文件*.yaml。自定义映射文件的格式请参见[自定义映射文件(API和模块)](#34-自定义映射文件api和模块)。仅[跨框架的API或模块比对](#27-跨框架的api或模块比对)场景需要配置。 | 否 | -| -lm或--layer_mapping | 跨框架比对。配置该参数时表示开启跨框架Layer层的比对功能,指定模型代码中的Layer层后,可以识别对应dump数据中的模块或API。需要指定自定义映射文件*.yaml。自定义映射文件的格式请参见[自定义映射文件(Layer)](#35-自定义映射文件layer)。仅[跨框架的Layer层比对](#28-跨框架的layer层比对)场景需要配置。 | 否 | +| -am或--api_mapping | 跨框架比对。配置该参数时表示开启跨框架API比对功能,可以指定自定义映射文件*.yaml,不指定映射文件时按照msprobe定义的默认映射关系进行比对。自定义映射文件的格式请参见[自定义映射文件(api_mapping)](#33-自定义映射文件api_mapping)。仅[跨框架的API比对](#25-跨框架的api比对)场景需要配置。 | 否 | +| -cm或--cell_mapping | 跨框架比对。配置该参数时表示开启跨框架cell模块比对功能,可以指定自定义映射文件*.yaml,不指定映射文件时按照msprobe定义的默认映射关系进行比对。自定义映射文件的格式请参见[自定义映射文件(cell_mapping)](#34-自定义映射文件cell_mapping)。仅[跨框架的cell模块比对](#26-跨框架的cell模块比对)场景需要配置。 | 否 | +| -dm或--data_mapping | 同框架或跨框架比对。通过映射文件指定两个具体参数的对应关系,可以在L0、L1或mix采集场景下使用。配置该参数的同时需要指定自定义映射文件*.yaml。自定义映射文件的格式请参见[自定义映射文件(data_mapping)](#35-自定义映射文件data_mapping)。 | 否 | +| -lm或--layer_mapping | 跨框架比对。配置该参数时表示开启跨框架Layer层的比对功能,指定模型代码中的Layer层后,可以识别对应dump数据中的模块或API。需要指定自定义映射文件*.yaml。自定义映射文件的格式请参见[自定义映射文件(Layer_mapping)](#36-自定义映射文件layer_mapping)。仅[跨框架的Layer层比对](#27-跨框架的layer层比对)场景需要配置。 | 否 | + +动态图模式没有填写任何mapping时,按照同框架比对的方式进行比对,比对数据和标杆数据的Cell或Api名称需要完全相同才能匹配得上。 ### 2.2 不同版本下的全量API比对 @@ -102,6 +109,21 @@ msprobe -f mindspore compare -i ./compare.json -o ./output -s msprobe -f mindspore compare -i ./compare.json -o ./output -s -am ``` + 或 + + ```shell + msprobe -f mindspore compare -i ./compare.json -o ./output -s -am api_mapping.yaml + ``` + + api_mapping.yaml文件配置请参见[自定义映射文件(api_mapping)](#33-自定义映射文件api_mapping)。 + 不传入api_mapping.yaml的情况下将按照内置的api映射进行匹配;传入api_mapping.yaml的情况下优先按照api_mapping.yaml的内容进行匹配,api_mapping.yaml中没有涉及的按照内置的api映射进行匹配。 + + 此外,也可以通过data_mapping.yaml文件实现具体参数的匹配,例: + ```shell + msprobe -f mindspore compare -i ./compare.json -o ./output -s -dm data_mapping.yaml + ``` + data_mapping.yaml的写法请参见[自定义映射文件(data_mapping)](#35-自定义映射文件data_mapping)。 + 5. 查看比对结果,请详见PyTorch目录下的《[PyTorch 场景的精度比对-精度比对结果分析](./10.accuracy_compare_PyTorch.md#3-精度比对结果分析)》章节。 ### 2.6 跨框架的cell模块比对 @@ -124,15 +146,22 @@ msprobe -f mindspore compare -i ./compare.json -o ./output -s msprobe -f mindspore compare -i ./compare.json -o ./output -s -cm cell_mapping.yaml ``` - cell_mapping.yaml文件配置请参见[自定义映射文件(cell)](#33-自定义映射文件cell)。 + cell_mapping.yaml文件配置请参见[自定义映射文件(cell_mapping)](#34-自定义映射文件cell_mapping)。 + 不传入cell_mapping.yaml的情况下仅将Cell改成Module后进行匹配;传入cell_mapping.yaml的情况下将按照cell_mapping.yaml的内容进行匹配。 + + 此外,也可以通过data_mapping.yaml文件实现具体参数的匹配,例: + ```shell + msprobe -f mindspore compare -i ./compare.json -o ./output -s -dm data_mapping.yaml + ``` + data_mapping.yaml的写法请参见[自定义映射文件(data_mapping)](#35-自定义映射文件data_mapping)。 5. 查看比对结果,请详见PyTorch目录下的《[PyTorch 场景的精度比对-精度比对结果分析](./10.accuracy_compare_PyTorch.md#3-精度比对结果分析)》章节。 -### 2.7 跨框架的API或模块比对 +### 2.7 跨框架的Layer层比对 -该场景可用于在“**跨框架的API比对**”和“**跨框架的cell模块比对**”场景均无法完全覆盖模型中的API和模块时,通过手动指定映射关系来补全未被比对的API或模块。 +layer_mapping可以从Layer层识别整网的API和Cell,简化配置。 -1. 配置[config.json](../config.json)文件level配置为L0或L1、task配置为tensor或statistics并指定需要dump的API或模块名。 +1. 配置[config.json](../config.json)文件level配置为L0或mix、task配置为tensor或statistics并指定需要dump的API或模块名。 2. 参见《[MindSpore 场景的精度数据采集](./06.data_dump_MindSpore.md)》和《[PyTorch 场景的精度数据采集](./05.data_dump_PyTorch.md)》完成不同环境下API或模块精度数据的采集,得到两个框架的API或模块dump数据。 @@ -141,36 +170,99 @@ msprobe -f mindspore compare -i ./compare.json -o ./output -s 4. 执行如下示例命令进行比对: ```shell - msprobe -f mindspore compare -i ./compare.json -o ./output -s -dm data_mapping.yaml + msprobe -f mindspore compare -i ./compare.json -o ./output -s -lm layer_mapping.yaml ``` - data_mapping.yaml文件配置请参见[自定义映射文件(all)](#34-自定义映射文件all)。 + layer_mapping.yaml文件配置请参见[自定义映射文件(layer_mapping)](#36-自定义映射文件layer_mapping)。 + + 此外,也可以通过data_mapping.yaml文件实现具体参数的匹配,例: + ```shell + msprobe -f mindspore compare -i ./compare.json -o ./output -s -dm data_mapping.yaml + ``` + data_mapping.yaml的写法请参见[自定义映射文件(data_mapping)](#35-自定义映射文件data_mapping)。 5. 查看比对结果,请详见PyTorch目录下的《[PyTorch 场景的精度比对-精度比对结果分析](./10.accuracy_compare_PyTorch.md#3-精度比对结果分析)》章节。 -### 2.8 跨框架的Layer层比对 +## 3 多卡比对结果提取汇总通信算子数据 -该场景可简化API或模块场景的配置,从Layer层识别整网的API和模块。 +本功能是将多卡比对场景的比对结果,进行通信算子数据提取和汇总,输出整理好的通信算子多卡比对精度表。 -1. 配置[config.json](../config.json)文件level配置为L0或mix、task配置为tensor或statistics并指定需要dump的API或模块名。 +**使用场景** -2. 参见《[MindSpore 场景的精度数据采集](./06.data_dump_MindSpore.md)》和《[PyTorch 场景的精度数据采集](./05.data_dump_PyTorch.md)》完成不同环境下API或模块精度数据的采集,得到两个框架的API或模块dump数据。 +已完成精度比对,获得多卡精度比对结果,但是通信算子数据分布在多个结果件中,不利于精度问题的分析。通过此功能,可以汇总多卡通信算子数据,减少问题定位时间。 -3. 创建比对文件,文件内容及示例请参见[比对文件](#31-比对文件)。 +**约束** -4. 执行如下示例命令进行比对: +- 不支持MD5比对结果。 +- 不支持MindSpore静态图比对结果。 - ```shell - msprobe -f mindspore compare -i ./compare.json -o ./output -s -lm layer_mapping.yaml - ``` +**命令示例** - layer_mapping.yaml文件配置请参见[自定义映射文件(Layer)](#35-自定义映射文件layer)。 +```bash +msprobe -f mindspore merge_result -i ./input_dir -o ./output_dir -config ./config.yaml +``` -5. 查看比对结果,请详见PyTorch目录下的《[PyTorch 场景的精度比对-精度比对结果分析](./10.accuracy_compare_PyTorch.md#3-精度比对结果分析)》章节。 +**完整参数说明** + +| 参数名 | 说明 | 是否必选 | +| ---------------------- | ------------------------------------------------------------ | -------- | +| -i 或 --input_dir | 多卡比对结果存盘目录,即使用compare比对的结果输出目录,str类型。所有比对结果应全部为真实数据比对结果或统计数据比对结果,否则可能导致汇总数据不完整。 | 是 | +| -o 或 --output_dir | 数据提取汇总结果存盘目录,str类型。文件名称基于时间戳自动生成,格式为:`multi_ranks_compare_merge_{timestamp}.xlsx`。 | 是 | +| -config或--config-path | 指定需要汇总数据的API和比对指标的yaml文件路径,str类型。
yaml文件详细介绍见下文“**yaml文件说明**”。 | 是 | -## 3 附录 +**yaml文件说明** -### 3.1 比对文件 +以config.yaml文件名为例,配置示例如下: + +``` +api: +- Distributed.all_reduce +- Distributed.all_gather_into_tensor +compare_index: +- Max diff +- L2norm diff +- MeanRelativeErr +``` + +| 参数名 | 说明 | +| ------------- | ------------------------------------------------------------ | +| api | 表示需要汇总的API或module名称。如果没有配置,工具会提示报错。
api名称配置格式为:`{api_type}.{api_name}.{API调用次数}.{前向反向}`
须按顺序配置以上四个字段,可按如下组合配置:
{api_type}
{api_type}.{api_name}
{api_type}.{api_name}.{API调用次数}
{api_type}.{api_name}.{API调用次数}.{前向反向}
这里的api指代API或module。 | +| compare_index | 表示需要汇总的比对指标。compare_index需为dump_mode对应比对指标的子集。如果没有配置,工具将根据比对结果自动提取dump_mode对应的全部比对指标进行汇总。
统计数据模式比对指标:Max diff、Min diff、Mean diff、Norm diff、MaxRelativeErr、MinRelativeErr、MeanRelativeErr、NormRelativeErr
真实数据模式比对指标:Cosine、MaxAbsErr、MaxRelativeErr、One Thousandth Err Ratio、Five Thousandths Err Ratio | + +**汇总结果件说明** + +多卡数据汇总结果如下所示: + +![merge_result](img/merge_result.png) + +1. NPU Name列表示API或module名称。 +2. rank*列为多卡数据。 +3. 不同比对指标的数据通过不同sheet页呈现。 +4. 如果一个API或module在某张卡上找不到数据,汇总结果中将空白呈现。 +5. 如果比对指标值为N/A,unsupported,Nan,表示无法计算该比对指标值,汇总结果将以”NPU:’NPU max值‘ Bench:’Bench max值‘“呈现。 +6. 针对图示案例,此处NPU:N/A Bench:N/A表示output为None。 + +
+如何基于group信息查看分组数据: + +以Distributed.all_reduce.0.forward为例。这个API将多卡数据规约操作,输出为一个group内的规约结果,同一个group内的输出保持一致。
这个API中,rank0-3为一个group,Distributed.all_reduce.0.forward.input.group展示为tp-0-1-2-3,rank0-3输出一致;rank4-7为一个group,展示为tp-4-5-6-7,rank4-7输出一致。
group除了这种形式,还有如[0, 1, 2, 3]的呈现形式。 + +
+常见通信API预期结果: + +1. Distributed.all_gather:多卡数据汇总,每张卡输入可以不一致,同group内输出一致,输出是张量列表。 +2. Distributed.all_gather_into_tensor:多卡数据汇总,每张卡输入可以不一致,同group内输出一致,输出是张量。 +3. Distributed.all_reduce:多卡数据规约操作,每张卡输入可以不一致,同group内输出一致,为规约结果。 +4. Distributed.reduce_scatter:多卡数据规约操作,每张卡输入可以不一致,输出为group内规约结果的不同部分,输入是张量列表。 +5. Distributed.reduce_scatter_tensor:多卡数据规约操作,每张卡输入可以不一致,输出为group内规约结果的不同部分,输入是张量。 +6. Distributed.broadcast:输入为要广播的数据,输出为广播后的数据。 +7. Distributed.isend:点对点通信,输入为要发送的数据,输出为发送的数据。 +8. Distributed.irecv:点对点通信,输入为原数据,输出为接收的新数据。 +9. Distributed.all_to_all_single:输出数据为所有卡上的数据切分后合并的结果。 + +## 4 附录 + +### 4.1 比对文件 以在当前目录创建./compare.json为例,单卡场景示例如下: @@ -184,17 +276,25 @@ msprobe -f mindspore compare -i ./compare.json -o ./output -s } ``` +多卡场景示例如下: +```json +{ +"npu_path": "./npu_dump/step0", # 需填写到step层级(rank的上一层级) +"bench_path": "./bench_dump/step0", # 需填写到step层级(rank的上一层级) +"is_print_compare_log": true +} +``` **参数说明** | 参数名 | 说明 | 是否必选 | -| -------------------- | ------------------------------------------------------------ | -------- | -| npu_path | 配置NPU环境下的dump.json文件(单卡场景)。跨框架场景指定为MindSpore的json文件。数据类型:str。 | 是 | -| bench_path | 配置CPU、GPU或NPU环境下的dump.json文件(单卡场景)。 跨框架场景指定为PyTorch的json文件。数据类型:str。 | 是 | -| stack_path | 配置NPU dump目录下的stack.json文件。数据类型:str。 | 是 | -| is_print_compare_log | 配置是否开启单个算子的日志打屏。可取值true或false,默认为true。关闭后则只输出常规日志。数据类型:bool | 否 | +| -------------------- | ------------------------------------------------------------ |------| +| npu_path | 配置NPU环境下的dump.json文件(单卡场景)。跨框架场景指定为MindSpore的json文件。数据类型:str。 | 是 | +| bench_path | 配置CPU、GPU或NPU环境下的dump.json文件(单卡场景)。 跨框架场景指定为PyTorch的json文件。数据类型:str。 | 是 | +| stack_path | 配置NPU dump目录下的stack.json文件。数据类型:str。 如果没有配置stack_path,命令行-s参数不生效,程序自动识别是否存在stack.json文件,如存在,则比对结果中呈现NPU_Stack_Info,如不存在,则不呈现。如果配置了stack_path,比对结果中是否呈现NPU_Stack_Info则通过命令行参数-s来控制。 | 否 | +| is_print_compare_log | 配置是否开启单个算子的日志打屏。可取值true或false,默认为true。关闭后则只输出常规日志。数据类型:bool | 否 | -### 3.2 比对文件(kernel) +### 4.2 比对文件(kernel) 仅[不同版本下的全量kernel比对](#23-不同版本下的全量kernel比对)场景支持。 @@ -232,19 +332,86 @@ msprobe -f mindspore compare -i ./compare.json -o ./output -s | rank_id | 配置比对的Rank ID。npu_path和bench_path目录下的dump文件需要存在对应Rank的数据。默认为空,表示比对所有Rank。可配置一个或多个Rank,多个Rank ID用逗号隔开,例如:"rank_id": [1,2,3]。数据类型:list[int]。 | 否 | | step_id | 配置比对的Step ID。npu_path和bench_path目录下的dump文件需要存在对应Step的数据。默认为空,表示比对所有Step。可配置一个或多个Step,多个Step ID用逗号隔开,例如:"step_id": [1,2,3]。数据类型:list[int]。 | 否 | -### 3.3 自定义映射文件(cell) +### 4.3 自定义映射文件(api_mapping) 文件名格式:\*.yaml,*为文件名,可自定义。 文件内容格式: ```yaml -{cell_name}.{class_name}: {module_name}.{class_name} +ms_api: {ms_api_name} +pt_api: {pt_api_name} +ms_args: +- {index1} +- {index2} +... +- {indexN} +pt_args: +- {index1} +- {index2} +... +- {indexN} +ms_outputs: +- {index1} +- {index2} +... +- {indexN} +pt_outputs: +- {index1} +- {index2} +... +- {indexN} ``` -冒号左侧为MindSpore框架cell模块的{cell_name}.{class_name},冒号右侧为PyTorch框架module模块的{module_name}.{class_name}。 +- ms_api/pt_api:分别为MindSpore和PyTorch框架的API名称,配置格式为{api_type}.{api_name}。API名称请分别从《[MindSpore 场景的精度数据采集](./06.data_dump_MindSpore.md)》和《[PyTorch 场景的精度数据采集](./05.data_dump_PyTorch.md)》中的dump.json文件获取。 +- ms_args/pt_args:分别为ms_api/pt_api对应的MindSpore和PyTorch框架API的入参的序号。 +- ms_outputs/pt_outputs:分别为ms_api/pt_api对应的MindSpore和PyTorch框架API的输出的序号。 + +**说明**: -{cell_name}.{class_name}从dump cell模块级.npy文件名获取,命名格式为:`{Cell}.{cell_name}.{class_name}.{前向反向}.{index}.{input/output}.{参数序号}` +- MindSpore和PyTorch框架的API映射关系可以从《[PyTorch与MindSpore API映射表](https://www.mindspore.cn/docs/zh-CN/r2.3.0rc2/note/api_mapping/pytorch_api_mapping.html)》获取,其中PyTorch与MindSpore API名称前缀的映射关系如下: + + | PyTorch | PyTorch在dump文件中的名称 | MindSpore | MindSpore在dump文件中的名称 | + | ------------------- | ------------------------- | ---------------- | --------------------------- | + | torch.nn.functional | Functional | mindspore.ops | Functional | + | torch.Tensor | Tensor | mindspore.Tensor | Tensor | + | torch | Torch | mindspore.ops | Functional | + + 实际配置自定义映射文件(API)时需要使用dump文件中的名称。 + +- 自定义映射文件(API)需要满足ms_args/pt_args列表中的元素个数一致,ms_outputs/pt_outputs相同。 + +- 须确保列表自定义映射文件(API)配置元素的合法性,比如ms_args/pt_args的API用到的参数只有3个参数,那么用户实际指定的参数序号只能包含0、1、2;另外参数序号列表中的值不能重复。 + +文件内容示例: + +```yaml +ms_api: Functional.abs +pt_api: Torch.abs +ms_args: +- 0 +- 1 +pt_args: +- 0 +- 1 +ms_outputs: +- 0 +- 1 +pt_outputs: +- 0 +- 1 +# ms_args/pt_args和ms_outputs/pt_outputs参数的配置需要根据ms_api/pt_api的API入参和输出的顺序,例如Functional.abs API的入参为(a b c),那对应的ms_args为0 1 2,可根据实际需要选择,而Torch.abs的入参如果是(a b c),那么ms_args和pt_args配置一致即可,但如果Torch.abs的入参如果是(a c)或其他与Functional.abs不完全映射的值,那么ms_args和pt_args配置的序号需要与入参对应,Torch.abs(a c)的序号为0 1,Functional.abs(a b c)为0 1 2,只有a和c参数可以映射,那么ms_args配置为0 2,pt_args配置为0 1。ms_outputs/pt_outputs同理。 +``` + +### 4.4 自定义映射文件(cell_mapping) + +文件名格式:\*.yaml,*为文件名,可自定义。 + +文件内容格式: + +```yaml +{cell_name}.{class_name}: {module_name}.{class_name} +``` 文件内容示例: @@ -253,7 +420,21 @@ fc2.Dense: fc2.Linear conv1.Conv2d: conv3.Conv2d ``` -### 3.4 自定义映射文件(API和模块) +冒号左侧为MindSpore框架cell模块的{cell_name}.{class_name},冒号右侧为PyTorch框架module模块的{module_name}.{class_name}。 + +```yaml +{cell_name}.{class_name}从dump cell模块级.npy文件名获取,命名格式为: +{Cell}.{cell_name}.{class_name}.{forward/backward}.{index}.{input/output}.{参数序号/参数名} +或 +{Cell}.{cell_name}.{class_name}.parameters_grad.{parameter_name} + +{module_name}.{class_name}从dump module模块级.npy文件名获取,命名格式为: +{Module}.{module_name}.{class_name}.{forward/backward}.{index}.{input/output}.{参数序号/参数名} +或 +{Module}.{module_name}.{class_name}.parameters_grad.{parameter_name} +``` + +### 4.5 自定义映射文件(data_mapping) 文件名格式:\*.yaml,*为文件名,可自定义。 @@ -261,9 +442,11 @@ conv1.Conv2d: conv3.Conv2d ```yaml # API -{api_type}.{api_name}.{API调用次数}.{前向反向}.{input/output}.{参数序号}: {api_type}.{api_name}.{API调用次数}.{前向反向}.{input/output}.{参数序号} +{api_type}.{api_name}.{API调用次数}.{forward/backward}.{input/output}.{参数序号/参数名}: {api_type}.{api_name}.{API调用次数}.{forward/backward}.{input/output}.{参数序号/参数名} # 模块 -{Cell}.{cell_name}.{class_name}.{前向反向}.{index}.{input/output}.{参数序号}: {Module}.{module_name}.{前向反向}.{index}.{input/output}.{参数序号} +{Cell}.{cell_name}.{class_name}.{forward/backward}.{index}.{input/output}.{参数序号/参数名}: {Module}.{module_name}.{class_name}.{forward/backward}.{index}.{input/output}.{参数序号/参数名} +或 +{Cell}.{cell_name}.{class_name}.parameters_grad.{parameter_name}: {Module}.{module_name}.{class_name}.parameters_grad.{parameter_name} ``` 冒号左侧为MindSpore框架API的名称和Cell模块的名称,冒号右侧为PyTorch框架API的名称和module模块名称。 @@ -277,9 +460,11 @@ API和模块名称请分别从《[MindSpore 场景的精度数据采集](./06.da Functional.flash_attention_score.4.forward.input.0: NPU.npu_fusion_attention.4.forward.input.0 # 模块 Cell.relu.ReLU.forward.0.input.0: Module.module.language_model.embedding.word_embedding.VocabParallelEmbedding.forward.0.input.0 +或 +Cell.relu.ReLU.parameters_grad.weight: Module.module.language_model.embedding.word_embedding.VocabParallelEmbedding.parameters_grad.weight ``` -API和模块名称在dump.json文件中的“data_name”字段展示,如下图红框处所示: +当dump.json文件中存在“data_name”字段时,API和模块名称为data_name字段去掉文件后缀,如下图红框处所示: - MindSpore dump @@ -289,7 +474,145 @@ API和模块名称在dump.json文件中的“data_name”字段展示,如下 ![pt_dump](./img/pt_dump.png) -### 3.5 自定义映射文件(Layer) +当dump.json文件中不存在“data_name”字段时,名称的拼写规则如下: + +input_args、input_kwargs和output使用统一的命名规则,当值是list类型时,名称后面添加'.{index}',当值类型是dict类型时,名称后面加'.{key}',当值类型是具体Tensor或null或空list/dict时,命名结束。 + +以下面cell的dump文件为例: +```yaml +"Cell.network.module.NetworkWithLoss.forward.0": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 24, + 16, + 1, + 60, + 34 + ], + "Max": 3.591925621032715, + "Min": -3.6856653690338135, + "Mean": -0.017044123262166977, + "Norm": 940.671630859375, + "md5": "00d69ba8" + }, + { + "y": { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 24, + 1, + 100, + 4096 + ], + "Max": 2.433350086212158, + "Min": -4.09375, + "Mean": -0.00010696164099499583, + "Norm": 170.3390655517578, + "md5": "a72e1fa4" + }, + "y_mask": { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 24, + 100 + ], + "Max": 1.0, + "Min": 0.0, + "Mean": 0.22999998927116394, + "Norm": 23.494680404663086, + "md5": "bbcbd5ab" + }, + "x_mask": { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 24, + 510 + ], + "Max": 1.0, + "Min": 1.0, + "Mean": 1.0, + "Norm": 110.63453674316406, + "md5": "766d1028" + }, + "loss_mask": { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 24, + 1, + 60, + 34 + ], + "Max": 1.0, + "Min": 1.0, + "Mean": 1.0, + "Norm": 221.26907348632812, + "md5": "0cb690ce" + }, + "data_info": { + "img_hw": null + } + } + ], +"input_kwargs": {}, +"output": [ +{ +"type": "mindspore.Tensor", +"dtype": "Float32", +"shape": [], +"Max": 0.3672327995300293, +"Min": 0.3672327995300293, +"Mean": 0.3672327995300293, +"Norm": 0.3672327995300293, +"md5": "28f8f74f" +} +] +} +``` +, +初始名称为`Cell.network.module.NetworkWithLoss.forward.0`,`input_args`是`list`,长度为2,按照顺序命名为 +``` +Cell.network.module.NetworkWithLoss.forward.0.input.0 +Cell.network.module.NetworkWithLoss.forward.0.input.1 +``` +第0项后面直接是`Tensor`,命名结束 +第1项后面是`dict`,key包括`y`、`y_mask`、`x_mask`和`data_info`,命名为 +``` +Cell.network.module.NetworkWithLoss.forward.0.input.1.y +Cell.network.module.NetworkWithLoss.forward.0.input.1.y_mask +Cell.network.module.NetworkWithLoss.forward.0.input.1.x_mask +Cell.network.module.NetworkWithLoss.forward.0.input.1.data_info +``` +`y`后面是`Tensor`,命名结束;`y_mask`后面是`Tensor`,命名结束;`x_mask`后面是`Tensor`,命名结束;`data_info`后面是`dict`,key是`img_hw`,命名为 +``` +Cell.network.module.NetworkWithLoss.forward.0.input.1.data_info.img_hw +``` +`img_hw`后面是`null`,命名结束。 + +`input_kwargs`是`dict`,长度为0,命名结束。 +`output`是`list`,长度为1,按照顺序命名为 +``` +Cell.network.module.NetworkWithLoss.forward.0.output.0 +``` +第0项后面是`Tensor`,命名结束。 + +综上,生成的op_name为: +``` +Cell.network.module.NetworkWithLoss.forward.0.input.0 +Cell.network.module.NetworkWithLoss.forward.0.input.1.y +Cell.network.module.NetworkWithLoss.forward.0.input.1.y_mask +Cell.network.module.NetworkWithLoss.forward.0.input.1.x_mask +Cell.network.module.NetworkWithLoss.forward.0.input.1.data_info.img_hw +Cell.network.module.NetworkWithLoss.forward.0.output.0 +``` + +### 4.6 自定义映射文件(Layer_mapping) 文件名格式:\*.yaml,*为文件名,可自定义。 @@ -315,13 +638,6 @@ PipelineCell: Cell: network_with_loss: module - -layers: # 手动映射MindSpore与PyTorch模型代码中的Layer层序号 - '5': '0' - '6': '1' - '7': '2' - '8': '3' - '9': '4' ``` Layer层名称需要从模型代码中获取。 @@ -330,4 +646,4 @@ yaml文件中只需配置MindSpore与PyTorch模型代码中功能一致但名称 模型代码示例: -![ms_dump](./img/ms_layer.png) +![ms_dump](./img/ms_layer.png) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/docs/12.overflow_check_PyTorch.md b/debug/accuracy_tools/msprobe/docs/12.overflow_check_PyTorch.md index 831e6f95a8a9fd2460dd7eddc7ff6a4b9a54cc2b..983477554e138f3e547f2d3efcf14fdfc4a991a0 100644 --- a/debug/accuracy_tools/msprobe/docs/12.overflow_check_PyTorch.md +++ b/debug/accuracy_tools/msprobe/docs/12.overflow_check_PyTorch.md @@ -26,7 +26,9 @@ msprobe 工具在 PyTorch 场景下提供溢出数据采集功能和溢出数据 ### 1.5 其他说明 -溢出数据采集功能在昇腾 NPU 上支持饱和模式和 INF/NAN 模式。INF/NAN 模式遵循 IEEE 754 标准,根据定义输出 INF/NAN 的计算结果。与之对应的饱和模式在计算出现溢出时,饱和为浮点数极值(+-MAX)。对于 CANN 侧配置,Atlas 训练系列产品,默认为饱和模式,且不建议使用 INF/NAN 模式;Atlas A2 训练系列产品,默认为 INF/NAN 模式,且不建议使用饱和模式。 +溢出数据采集功能在昇腾 NPU 上支持饱和模式(仅支持 Atlas 训练系列产品)和 INF/NAN 模式。 + +INF/NAN 模式遵循 IEEE 754 标准,根据定义输出 INF/NAN 的计算结果。与之对应的饱和模式在计算出现溢出时,饱和为浮点数极值(+-MAX)。对于 CANN 侧配置,Atlas 训练系列产品,默认为饱和模式,且不支持使用 INF/NAN 模式;Atlas A2 训练系列产品,默认为 INF/NAN 模式,且不建议使用饱和模式。 INF/NAN 模式的使能方式如下: @@ -53,7 +55,7 @@ export INF_NAN_MODE_ENABLE=1 2. 执行溢出 API 解析操作。 ```bash - msprobe -f pytorch run_overflow_check -api_info ./dump.json + msprobe -f pytorch run_overflow_check -api_info ./dump_path/step{step_number}/rank{rank_number}/dump.json ``` | 参数名称 | 说明 | 是否必选 | diff --git a/debug/accuracy_tools/msprobe/docs/13.overflow_check_MindSpore.md b/debug/accuracy_tools/msprobe/docs/13.overflow_check_MindSpore.md index 35d2130a93787aab38b7ca839d72d175cb036f35..ef83aa17237d1cc56b8a67bf4b3ec9f57647fb9c 100644 --- a/debug/accuracy_tools/msprobe/docs/13.overflow_check_MindSpore.md +++ b/debug/accuracy_tools/msprobe/docs/13.overflow_check_MindSpore.md @@ -1,8 +1,8 @@ # MindSpore 场景的溢出检测 -msprobe 工具提供静态图O2编译等级下的过程溢出检测与动态图场景下的结果溢出检测。其中前者检测对象为 kernel 级别,对应 config.json 配置中的 "L2" level,后者检测对象为 API 级别(支持的API类型为ops、Tensor、mint和mint.nn.functional,不支持Primitive和Jit类API),对应 config.json 配置中的 "L1" level。 +msprobe 工具提供静态图O2编译等级下与动态图场景下的溢出检测功能。其中前者检测对象为 **kernel** 级别,对应 config.json 配置中的 **"L2"** level,后者检测对象为 **API** 级别(支持的API类型为ops、Tensor、mint和mint.nn.functional,不支持Primitive和Jit类API)或 **cell** 级别,分别对应 config.json 配置中的 **"L1"** 、**"L0"** level。 -需要注意,动态图场景下的溢出检测功能仅支持 INF/NAN 模式a。INF/NAN 模式的使能方式如下: +需要注意,本工具仅支持在 INF/NAN 模式a下进行溢出检测。INF/NAN 模式的使能方式如下: ```Shell # 使能 CANN 侧 INF/NAN 模式 @@ -11,21 +11,21 @@ export INF_NAN_MODE_ENABLE=1 export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" ``` -**a**:INF/NAN 模式遵循 IEEE 754 标准,根据定义输出 INF/NAN 的计算结果。与之对应的饱和模式在计算出现溢出时,饱和为浮点数极值(+-MAX)。对于 CANN 侧配置,Atlas 训练系列产品,默认为饱和模式,且不建议使用 INF/NAN 模式;Atlas A2训练系列产品,默认为 INF/NAN 模式,且不建议使用饱和模式。对于 MindSpore 框架侧配置,仅支持对 Atlas A2 训练系列产品进行设置,默认为 INF/NAN 模式。CANN 侧 与 MindSpore 框架侧配置须一致。 +**a**:在处理浮点数计算溢出问题时,NPU 当前支持两种溢出模式:INF/NAN 模式与饱和模式。INF/NAN 模式遵循 IEEE 754 标准,根据定义输出 INF/NAN 的计算结果。与之对应的饱和模式在计算出现溢出时,饱和为浮点数极值(+-MAX)。对于 CANN 侧配置,Atlas 训练系列产品,默认为饱和模式,且不支持使用 INF/NAN 模式;Atlas A2训练系列产品,默认为 INF/NAN 模式,且不建议使用饱和模式。对于 MindSpore 框架侧配置,仅支持对 Atlas A2 训练系列产品进行设置,默认为 INF/NAN 模式。CANN 侧 与 MindSpore 框架侧配置须一致。 溢出检测任务的配置示例见[MindSpore 静态图场景下 task 配置为 overflow_check](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/03.config_examples.md#23-task-%E9%85%8D%E7%BD%AE%E4%B8%BA-overflow_check)、[MindSpore 动态图场景下 task 配置为 overflow_check](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/03.config_examples.md#33-task-%E9%85%8D%E7%BD%AE%E4%B8%BA-overflow_check)。 ## 1 接口介绍 -溢出检测功能提供的接口与数据采集任务一致,详见[MindSpore 场景的精度数据采集](./06.data_dump_MindSpore.md)中的"**1 接口介绍**"章节。 +溢出检测功能提供的接口与数据采集任务一致,详见MindSpore 场景的精度数据采集中的["**1 接口介绍**"](./06.data_dump_MindSpore.md#1-接口介绍)章节。 需要注意,目前暂不支持动态图 "L1" level 下 primitive op 的溢出检测。 ## 2 示例代码 -溢出检测功能使用方式与数据采集任务一致,详见[MindSpore 场景的精度数据采集](./06.data_dump_MindSpore.md)中的"**2 示例代码**"章节。 +溢出检测功能使用方式与数据采集任务一致,详见MindSpore 场景的精度数据采集中的["**2 示例代码**"](./06.data_dump_MindSpore.md#2-示例代码)节。 ## 3 溢出检测结果文件介绍 -溢出检测结果文件目录结构与含义与数据采集任务一致,详见[MindSpore 场景的精度数据采集](./06.data_dump_MindSpore.md)中的"**3 dump 结果文件介绍**"章节。 +溢出检测结果文件目录结构与含义与数据采集任务一致,但仅保存溢出 API 或 kernel 的真实数据或统计信息。详见MindSpore 场景的精度数据采集中的["**3 dump 结果文件介绍**"](./06.data_dump_MindSpore.md#3-dump-结果文件介绍)章节。 diff --git a/debug/accuracy_tools/msprobe/docs/15.free_benchmarking_PyTorch.md b/debug/accuracy_tools/msprobe/docs/15.free_benchmarking_PyTorch.md index bd345813400d31eacb9e425dd4c4ca9837a79e0f..1525b1aa6cb1da6ab6b2ef261a66b11270a77892 100644 --- a/debug/accuracy_tools/msprobe/docs/15.free_benchmarking_PyTorch.md +++ b/debug/accuracy_tools/msprobe/docs/15.free_benchmarking_PyTorch.md @@ -65,8 +65,6 @@ D-->config.json配置 "rank": [], "step": [], "level": "L1", - "seed": 1234, - "is_deterministic": false, "free_benchmark": { "scope": [], @@ -99,15 +97,16 @@ D-->config.json配置 参数是否必选可配置项适用场景 pert_mode否"improve_precision" (默认)(常用)(可做验证) 插桩算子可能在低精度下有精度问题,扰动因子会将输入的低精度向量升精度。 "bit_noise"(常用)插桩算子可能在轻微扰动下暴露精度问题,扰动因子会将输入向量最后一个比特位翻转。 - "add_noise"插桩算子可能在轻微扰动下暴露精度问题,扰动因子会为输入向量增加一个极小。 - "change_value"插桩算子可能存在大数吃小数问题,扰动因子会交换输入向量的首尾。 - "no_change"插桩算子可能存在数值稳定性精度问题,扰动因子会复制原始输。 + "add_noise"插桩算子可能在轻微扰动下暴露精度问题,扰动因子会为输入向量增加一个极小值。 + "change_value"插桩算子可能存在大数吃小数问题,扰动因子会交换输入向量的首尾值。 + "no_change"插桩算子可能存在数值稳定性精度问题,扰动因子会复制原始输入。 "to_cpu"(可做验证) 插桩算子可能在同 CPU 精度表现不一致,扰动因子会将输入转至 CPU,需要配合 fuzz_device="cpu"使用。 "auto_fix"(专做修复) 已有怀疑算子,实现自动恢复,检测前向中Nan/inf/全0问题,按照缩放->切高精度->Synchronize->Contiguous->引导tocpu的顺序进行排查替换,快速恢复。 fuzz_device否"npu" (默认)pert_mode 不需要to cpu操作。 "cpu"pert_mode 须配置为"to_cpu",目前仅支持"to cpu"扰动因子。 + #### 3.2.3 选择处理方式 diff --git a/debug/accuracy_tools/msprobe/docs/16.free_benchmarking_MindSpore.md b/debug/accuracy_tools/msprobe/docs/16.free_benchmarking_MindSpore.md index a1c9c49f7d0bb85dbdff8b79327cb85721c86c0f..57f1fad985b7c8cd623288759600135fd29c2fc0 100644 --- a/debug/accuracy_tools/msprobe/docs/16.free_benchmarking_MindSpore.md +++ b/debug/accuracy_tools/msprobe/docs/16.free_benchmarking_MindSpore.md @@ -13,6 +13,7 @@ * **验证低精度可疑 API**,确认升精度后是否对模型 Loss 有影响。 * 该工具的约束 * 仅支持 MindSpore 动态图场景。支持的 API 类型为 ops、Tensor、mint 和 mint.nn.functional 类的非 inplace 计算 API,不支持 Primitive 和 Jit 类 API。 + * 仅支持 输入输出向量为浮点数类型(BF16、FP16、FP32、FP64)的 API 比对。 * 建议配置白名单(设置 list),控制对少量 API 进行无标杆比对。比对 API 越多,性能和显存损耗越大。 ## 2 工具实现原理 @@ -79,40 +80,57 @@ D-->config.json配置 用户需根据自己的使用场景,对照[工具实现原理](#2-工具实现原理)中几个关键步骤进行配置 #### 3.2.1 确定比对范围 -| 相关参数名 | 是否必选 | 可配置项 | 适用场景 | -| ---------- | -------- | ----------------- | ------------------------------------------------------------------------------------------- | -| list | 可选 | | 需要通过指定 API 名来限制比对API个数 如:\["mindspore.ops.bmm"\] 会只对mindspore.ops.bmm API进行比对| -| fuzz_stage | 可选 | "forward"(默认) | 需要进行 API **前向**计算的精度问题排查或验证| +
+ + + + +
参数是否必选可配置项适用场景
list自定义需要通过指定 API 名来限制比对API个数 如:["mindspore.ops.bmm"] 会只对mindspore.ops.bmm API进行比对。
fuzz_stage"forward"(默认)需要进行 API 前向计算的精度问题排查或验证。
"backward"需要进行 API 反向计算的精度问题排查,不支持反向验证(前向验证包括反向)。
-#### 3.2.2. 选择扰动因子 +#### 3.2.2 选择扰动因子 -| 相关参数 | 是否必选 | 可配置项 | 适用场景 | -| ----------- | -------- | ---------------------------- | -------------------------------------------------------------------------------------------------------- | -| pert_mode | 可选 | "improve_precision" (默认) | (常用)(可做验证) API 可能在**低精度**下有精度问题,扰动因子会将输入的低精度向量升精度 | -| | | "add_noise" | API 可能在**轻微扰动**下暴露精度问题,扰动因子会为输入向量增加一个极小值 | -| | | "bit_noise" | API 可能在**轻微扰动**下暴露精度问题,扰动因子会翻转输入向量的最后一个比特位。不支持BF16类型向量 | -| | | "no_change" | API 可能存在**数值稳定性**精度问题,扰动因子会复制原始输入| -| | | "change_value" | API 可能存在**大数吃小数**问题,扰动因子会交换输入向量的首尾值 | + + + + + + + +
参数是否必选可配置项适用场景
pert_mode"improve_precision" (默认)(常用)(可做验证) API 可能在低精度下有精度问题,扰动因子会将输入的低精度向量升精度。
"add_noise"API 可能在轻微扰动下暴露精度问题,扰动因子会为输入向量增加一个极小值。
"bit_noise"API 可能在轻微扰动下暴露精度问题,扰动因子会翻转输入向量的最后一个比特位。不支持BF16类型向量。
"no_change"API 可能存在数值稳定性精度问题,扰动因子会复制原始输入。
"change_value"API 可能存在大数吃小数问题,扰动因子会交换输入向量的首尾值。
-#### 3.2.3. 选择处理方式 -| 相关参数名 | 是否必选 | 可配置项 | 适用场景 | -| ------------ | -------- | ----------------- | ----------------------------------------------------------------------------------------- | -| handler_type | 可选 | "check" (默认) | 要做精度问题 API 排查,输出扰动前后不符合精度标准的 API,支持所有扰动因子 | -| | | "fix" | 要做可疑 API 验证,用扰动后输出替换原始输出,仅支持 "improve_precision" 扰动因子 | +#### 3.2.3 选择处理方式 + + + + + +
参数是否必选可配置项适用场景
handler_type"check"(默认)要做精度问题 API 排查,输出扰动前后不符合精度标准的 API,支持所有扰动因子。
"fix"要做可疑 API 验证,用扰动后输出替换原始输出,仅支持 "improve_precision" 扰动因子。
### 3.3 在模型脚本中开启工具 通过 PrecisionDebugger 统一接口开启工具,示例如下: ```python -from msprobe.mindspore import PrecisionDebugger +import mindspore as ms +ms.set_context(mode=ms.PYNATIVE_MODE, device_target="Ascend") + +# 其他模块的导入 +# ... +from msprobe.mindspore import PrecisionDebugger debugger = PrecisionDebugger(config_path='./config.json') -... -debugger.start() # 一般在训练循环开头启动工具 -... # 循环体 -debugger.stop() # 一般在训练循环末尾结束工具 -debugger.step() # 在训练循环的最后需要重置工具,非循环场景不需要 + +# 模型、损失函数的定义以及初始化等操作 +# ... +model = Network() + +# 数据集迭代的地方往往是模型开始训练的地方 +for step, (input_ids, input_position, attention_mask) in enumerate(dataset.create_tuple_iterator()): + debugger.start() # 一般在训练循环开头启动工具 + # 单步训练逻辑 + # ... + debugger.stop() # 一般在训练循环末尾结束工具 + debugger.step() # 在训练循环的最后需要重置工具,非循环场景不需要 ``` ### 3.4 查看精度风险算子 @@ -121,20 +139,21 @@ check 模式下,若存在不符合精度标准的 API,则工具会在 dump_p | 字段 | 说明 | | ------------ | ---------------------------------------------------------------------------------------- | -| rank | Rank ID,int 类型 | -| pert_mode | 扰动因子的类型,string 类型 | -| stage | 前/反向,string 类型 | -| step | 迭代数,int 类型 | -| api_name | API 名称,string 类型 | -| max_rel | 输出对比最大相对误差,float 类型 | -| dtype | 输入的 dtype,string 类型 | -| shape | 输入的 shape,tuple 类型 | -| output_index | 如果输出为列表或元组,其中一个元素检测不一致,则会有该元素的 index,否则为空,int 类型 | +| rank | Rank ID,int 类型。 | +| pert_mode | 扰动因子的类型,string 类型。 | +| stage | 前/反向,string 类型。 | +| step | 迭代数,int 类型。 | +| api_name | API 名称,string 类型。 | +| max_rel | 输出对比最大相对误差,float 类型。 | +| dtype | 输入的 dtype,string 类型。 | +| shape | 输入的 shape,tuple 类型。 | +| output_index | 如果输出为列表或元组,其中一个元素检测不一致,则会有该元素的 index,否则为空,int 类型。 | 无标杆比对使用的精度标准如下: -| 输出dtype | 相对误差阈值 | -| ----------------- | ------------ | -| mindspore.float16 | 0.002 | -| mindspore.float32 | 0.0002 | -| 其他 | 0.0002 | +| 输出dtype | 相对误差阈值 | +| ------------------ | ------------ | +| mindspore.bfloat16 | 0.004 | +| mindspore.float16 | 0.002 | +| mindspore.float32 | 0.0002 | +| mindspore.float64 | 0.0002 | diff --git a/debug/accuracy_tools/msprobe/docs/17.grad_probe.md b/debug/accuracy_tools/msprobe/docs/17.grad_probe.md index 98c6a2ba7eda278388451cf46ff26afd4aa4b2c5..f210088013415e40167f3eea3aab6163b0c947dc 100644 --- a/debug/accuracy_tools/msprobe/docs/17.grad_probe.md +++ b/debug/accuracy_tools/msprobe/docs/17.grad_probe.md @@ -5,11 +5,11 @@ - 将模型权重的梯度数据导出。这种功能可以将模型权重的梯度值以统计量的形式采集出来,用以分析问题。 - 将两份梯度数据进行相似度对比。在有标杆问题中,可以确认训练过程中精度问题出现的step,以及抓取反向过程中的问题。 -工具支持PyTorch版本:2.0/2.1/2.2;支持MindSpore版本:r2.3。 +工具支持PyTorch版本:2.0/2.1/2.2;支持MindSpore版本:r2.3。暂不支持deepspeed的zero1、zero2、zero3。 ## 工具特性 -- 使用便捷,无需在训练流程里插入代码 +- 使用便捷,仅需在训练流程里插入少量代码 - 可以精准定位问题出现的step ## 使用方式 @@ -39,8 +39,8 @@ |--------------------------------|-----------------------------------|-----------------|----------| | task | 填为"grad_probe"。 | str | 是 | | dump_path | 输出目录。如果不存在就会创建一个新目录。 | str | 是 | - | rank | rank id列表,在多卡场景下,表示需要导出梯度数据的进程的rank id。列表为空就表示导出所有rank的数据。默认为空。(MindSpore静态图模式下,当前暂不支持指定rank功能) | List[int] | 否 | - | step | step列表,表示需要导出数据的step列表。列表为空就表示导出所有step的数据。默认为空。(MindSpore静态图模式下,当前暂不支持指定step功能) | List[int] | 否 | + | rank | rank id列表,在多卡场景下,表示需要导出梯度数据的进程的rank id。列表为空就表示导出所有rank的数据。默认为空。采集特定 rank 时,须指定为训练脚本中存在的 rank_id,可逐个配置,也可以指定范围。
**配置示例**:"rank": [0, 1 , 2, "4-6"]。(MindSpore静态图模式下,当前暂不支持指定rank功能) | list[Union[int, str]] | 否 | + | step | step列表,表示需要导出数据的step列表。列表为空就表示导出所有step的数据。默认为空。采集特定 step 时,须指定为训练脚本中存在的 step,可逐个配置,也可以指定范围。
**配置示例**:"step": [0, 1 , 2, "4-6"]。(MindSpore静态图模式下,当前暂不支持指定step功能) | list[Union[int, str]] | 否 | | grad_level | 输出级别。决定导出数据的详细程度,级别越大导出数据越详细。可取值:L0, L1, L2。默认L1。|str | 否 | | param_list | 权重名称列表,表示需要监控的权重。列表为空就表示监控所有权重。默认为空。 | List[str] | 否 | | bounds | 区间列表,用来划分区间以统计数值的分布。需要保证由数据小到大排列,并且列表中的元素需要在int64取值范围内。可以使用默认值[-1, 0, 1]。 | List[float, int] | 否 | @@ -100,11 +100,10 @@ debugger.stop() │ ├── step{step} │ │ ├── {param_name}.npy ``` -+ {timestamp}:梯度工具导出数据的时候会在output_path下生成一个时间戳目录,然后在这个时间戳目录下输出结果。 + rank_{rank_id}:在分布式场景下,会记录卡的rank_id。非分布式场景下,如果是CPU则记录进程号,如果是CPU或GPU则记录卡号 + grad_summary_{step}.csv:会分step记录每一步的梯度数据统计值。 + step_{step}:这个目录下会存放该step的梯度的方向数据。 -+ {param_name}.pt(npy):模型参数的梯度方向数据,PyTorch保存的是pt文件,MindSpore是npy文件。 ++ {param_name}.npy:模型参数的梯度方向数据。 **grad_summary_{step}.csv** diff --git a/debug/accuracy_tools/msprobe/docs/19.monitor.md b/debug/accuracy_tools/msprobe/docs/19.monitor.md index 8b5155a433a68bd96aa9cc2b9f770d956e6ce3ab..1c197ba5496378130d8d04b6f847ee2f35c3e946 100644 --- a/debug/accuracy_tools/msprobe/docs/19.monitor.md +++ b/debug/accuracy_tools/msprobe/docs/19.monitor.md @@ -1,193 +1,558 @@ -# Monitor模型训练状态监控工具 +# Monitor 训练状态轻量化监控工具 ## 简介 -本项目开发了一个模型训练状态监控工具,能够收集和聚合模型训练过程中的网络层,优化器, 通信算子的中间值,帮助诊断模型训练过程中计算, 通信,优化器各部分出现的异常情况。 +训练状态轻量化监控工具,能够在较低性能损耗下收集和记录模型训练过程中的激活值、权重梯度、优化器状态和通信算子的中间值,实时呈现训练状态。 ## 安装 +参见[msprobe安装](./01.installation.md)。 + +要求: + +- PyTorch场景:torch不低于**2.0** +- MindSpore场景:mindspore不低于**2.4.10**,仅支持**MindSpore动态图**,暂不支持**msadapter**套件 + +## 功能介绍 +下表中字段为训练状态轻量化监控工具的完整功能点: + +| 功能 | 说明 | 支持场景 | +| ------------------------------------------------------------ | ------------------------------------------------------------ | ----------------- | +| [权重监控](#权重监控) | 开启权重监控 | PyTorch、MindSpore | +| [权重梯度监控](#权重梯度监控) | 开启权重梯度监控 | PyTorch、MindSpore | +| [激活值监控](#激活值监控) | 开启激活值监控 | PyTorch、MindSpore | +| [优化器状态监控](#优化器状态监控) | 开启优化器状态监控 | PyTorch、MindSpore | +| [指定监控对象](#指定监控对象) | 指定监控的nn.Module(nn.Cell)及对应的输入输出 | PyTorch、MindSpore | +| [打印模型结构](#打印模型结构) | 打印模型结构 | PyTorch | +| [Module全量监控](#Module全量监控) | 对全量module的输入输出做监控 | PyTorch、MindSpore | +| [Parameter全量监控](#Parameter全量监控) | 对全量Parameter的输入输出做监控 | PyTorch、MindSpore | +| [输出格式和统计量](#输出格式和统计量) | format PyTorch支持`csv`、`tensorboard`和`api`,MindSpore仅支持`csv`,`ops`均支持,`ndigits`仅PyTorch支持 | PyTorch、MindSpore | +| [梯度异常时序判断](#梯度异常时序判断) | 梯度异常时自动梯度落盘 | PyTorch | +| [csv格式数据转tensorboard可视化显示](#csv格式数据转tensorboard可视化显示) | 将csv转为tensorboard文件显示 | PyTorch | +| [动态启停](#动态启停) | 训练过程中动态修改配置开启监控 | PyTorch、MindSpore | +| [功能重载](#功能重载) | 训练中开启激活值监控。待废弃,请使用动态启停功能代替。 | PyTorch | + +## 快速上手 +根据需求监控相应对象。比如在loss上扬,grad norm正常的异常训练过程中,优先考虑监控模型前向过程;在grad norm异常的训练过程中,监控权重和激活值的梯度。 +推荐使用方式:权重梯度的监控性能损耗小(20B dense模型全量权重梯度监控,时间增加<1%,内存增加<1%),可以长期开启。激活值监控性能损耗大,在必要时开启或者仅监控部分。 + +### 工具使能 +在实际训练代码中找到模型、优化器定义的位置,使能monitor工具,通过配置文件(json)控制工具行为。如下分别为Pytorch场景和MindSpore场景下的使能方式。 + +- Pytorch使能方式: +```python +# Megatron-LM(core_r0.6.0) training.py +model, optimizer, opt_param_scheduler = setup_model_and_optimizer( + model_provider, model_type) + +... +from msprobe.pytorch import TrainerMon +monitor = TrainerMon( + config_file_path="./monitor_config.json", + params_have_main_grad=True, # 权重是否使用main_grad,通常megatron为True,deepspeed为False。默认为True。 +) +# 挂载监控对象 +monitor.set_monitor( + model, + grad_acc_steps=args.global_batch_size//args.data_parallel_size//args.micro_batch_size, + optimizer=optimizer, + dp_group=None, + tp_group=None, + start_iteration=0 # 断点续训时提供当前iteration,默认从0开始 +) +``` -### 1. 三方依赖 +*注意*:补充deepspeed下常用框架的使能位置。 -| 依赖软件 | -|-------------| -| torch>=2.0 | -| torch_npu | -| torchvision | -| tensorboard | -| matplotlib | -| sqlalchemy | -| pymysql | +deepspeed与accelerate、transformers同时使用时,optimizer传值方式为`optimizer=optimizer.optimizer`,若未使用deepspeed,单独使用accelerate、transformers,optimizer传值方式为`optimizer=optimizer`。 -### 2. 安装 Monitor +1) 同时使用deepspeed和accelerate时,工具使能位置参考如下: -参考[msprobe 安装](./01.installation.md) +```python +model, optimizer, trainloader, evalloader, schedular = accelerator.prepare(...) +... +monitor = TrainerMon(...) +monitor.set_monitor(....optimizer=optimizer.optimizer) +``` -# 快速上手 +2. 同时使用deepspeed和transformers时,工具使能位置参考如下: - 下面以Ascend/ModelLink训练框架为例,给出Monitor工具的使用方法。 +```python +# src/transformers/trainer.py +class Trainer: + def _inner_training_loop: + ... + monitor = TrainerMon(...) + monitor.set_monitor(....optimizer=self.optimizer.optimizer) + + for epoch in range(epochs_trained, num_train_epochs): + ... +``` + +- MindSpore使能方式: +```python +... +from msprobe.mindspore import TrainerMon +monitor = TrainerMon( + config_file_path="./monitor_config.json", + process_group=None, + params_have_main_grad=True, # 权重是否使用main_grad,通常megatron为True,deepspeed为False。默认为True。 +) +# 挂载监控对象 +monitor.set_monitor( + model, + grad_acc_steps=args.global_batch_size//args.data_parallel_size//args.micro_batch_size, + optimizer=optimizer, + dp_group=None, + tp_group=None +) +``` -1. 在ModelLink的根目录,创建json配置文件,如llama2_config.json,内容如下: +### 权重监控 +- 工具配置示例: ```json { - "targets": { - "language_model.encoder.layers.0": {"input": "tuple[2]:0", "output": "tensor", "input_grad":"tuple[2]:0", "output_grad":"tuple[1]:0"} + "targets": { + }, + "param_distribution": true, + "format": "csv", + "ops": ["norm", "min", "max", "nans"] +} +``` +`targets`中指定module包含的所有权重都会被监控。`targets`为空时,默认监控全部module。 +设置`param_distribution`为true,表示开启权重监控功能,默认值为false。 + +### 权重梯度监控 +- 工具配置示例: +```json +{ + "targets": { }, - "print_struct": false, - "module_ranks": [1,2,3,4], - "ur_distribution": true, - "xy_distribution": true, - "mv_distribution": true, "wg_distribution": true, - "cc_distribution": {"enable":true, "cc_codeline":[]}, - "alert": { - "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}], - "inform": {"recipient": "database", "connection_str": "mysql+pymysql://username:password@host:port/database"} + "format": "csv", + "ops": ["norm", "min", "max", "nans"] +} +``` +`targets`中指定module包含的所有权重都会被监控。`targets`为空时,默认监控全部module。 +设置`wg_distribution`(weight grad, noted as `wg`) 为true,表示开启权重梯度监控功能,默认值为false。 + +### 激活值监控 + +- 工具配置 +```json +{ + "targets": { }, - "ops": ["min", "max", "norm", "zeros", "id"], - "eps": 1e-8 + "xy_distribution": true, + "forward_only": false, + "backward_only": false, + "all_xy": true, + "format": "csv", + "ops": ["norm", "min", "max", "nans"] } ``` +`all_xy`为true表示监控全量module激活值,若需要对指定模块设置监控对象,在`targets`中进行配置,配置方式参考 [指定监控对象](#指定监控对象) 。 -每个要监控的module都有自己特定的输入输出格式(依赖于模型实现),所以我们需要指定前向输入输出格式和反向计算时输入张量的梯度和输出张量的梯度格式。 如果不清楚的话可以将"targets"填为空("targets":{}),然后将 "print_struct" 字段设置为 true, 之后工具会打印详细的模型结构。 我们也会随时更新更多常用module的格式规范。 +设置`xy_distribution`为true表示开启激活值监控功能,默认值为false。 -下面详细解释各个字段: +注意:`forward_only`和`backward_only`均为true时,触发warning,前反向均不采集;默认值均为false时,前反向均采集。 -| 字段| 说明 | 是否必选 | -|-------------------|---------------------------------------------|------| -| "targets" | 指定需要监控的大模型层,例如transformer的第0层language_model.encoder.layers.0。如果不清楚模型结构,可以将"targets"填为空("targets":{}),然后将"print_struct"字段设置为true,之后监控工具会打印模型中torchmodule的名字和详细结构,并在第1个step后退出,你可以从中选择你关心的module。| 是 | -| "input" | "tuple[2]:0"的意思是目标module的前向input参数为长度为2的tuple,我们关心的是tuple第0个元素。| 否 | -| "output"| "tensor"的意思是目标module的前向output参数类型为tensor | 是 | -| "input_grad"| "tuple[2]:0"的意思是目标module的后向input_grad参数是长度为2的tuple,我们关心的是tuple的第0个元素。| 否 | -| "output_grad" | "tuple[1]:0"的意思是目标module的后向input_grad参数是长度为1的tuple,我们关心的是tuple的第0个元素。| 是 | -| "print_struct"| 设置为true后监控工具会打印模型中torchmodule的名字和详细结构,并在第1个step后退出。不填默认为false。 | 否 | -| "module_ranks"| 用于在分布式训练场景中希望控制在哪些rank开启module监控。如果不填,则默认在所有rank开启。| 否 | -| "ur_distribution" | 若为true则会统计adam优化器指定模块(targets中指定)参数的update和ratio向量的数值分布,并展示在heatmap里,默认为false。依赖histc算子,需要CANN8.0.rc2以上版本,否则会有严重的性能问题。 | 否 | -| "xy_distribution" | 若为true则会监控指定module(targets中指定)的输入输出张量。默认为false。| 否 | -| "forward_only"| 若为true,则在开启xy_distribution时,仅监控"input"和"output",忽略"input_grad"和"output_grad"。默认为false,同时监控激活值及其梯度。 | 否 | -| "mv_distribution" | 若为true则会监控指定模块中的参数的优化器状态,默认为false。需要在TrainerMon构造函数正确指定opt_ty.目前只支持megatron的混合精度优化器以及megatron的分布式优化器。Deepspeed的分布式优化器实现暂不支持。 | 否 | -| "wg_distribution" | 若为true则会监控指定模块的参数梯度,默认为false。| 否 | -| "mg_distribution" | 若为true则会监控指定模块梯度与adam一阶动量的方向一致性,默认为false。| 否 | -| "alert" | ·"rules":指定自动报警的异常检测机制及其相应的阈值。目前实现的异常检测是AnomalyTurbulence。如果统计标量超出历史均值的指定浮动范围(threshold指定,0.5意味着上浮或者下浮50%)则在控制台打印报警信息。
·"inform":自动报警需要的配置,若想关闭自动报警删掉inform的配置即可。其中"recipient"指定自动报警的通知方式,可选值为"database"或"email",默认为"database"。
-若"recipient"为"database",则需要指定"connection_str"字段,即数据库的连接URL,默认为{"recipient":"database","connection_str":"mysql+pymysql://username:password@host:port/database"},若有特殊字符需要转义。
-若"recipient"为"email",则需要指定"send_email_address"-发送方邮箱地址,"receive_email_address"-接收方邮箱地址,"send_email_username"-发送方邮箱用户名,"send_email_password"-发送方邮箱密码,"smtp_server"-发送方邮箱对应的SMTP服务器,"smtp_port"-发送方邮箱对应的SMTP端口号。默认为:
{"recipient":"email",send_email_address":"sender@huawei.com","receive_email_address":"receiver@huawei.com","send_email_username":"username","send_email_password":"******","smtp_server":"smtpscn.huawei.com","smtp_port":"587"} | 否 | -| "cc_distribution" | 其中"enable"字段控制通信监控模块的开关;需要监控通信算子时,务必尽量早地实例化`TrainerMon`,因为监控通过劫持原始func后挂hook实现,部分加速库初始化时会保存原始function,避免监控失效。"cc_codeline"字段指定监控的代码行,如:`train.py\\[23\\]`,默认为空列表,不特别指定;"cc_pre_hook"字段控制是否监控通信前的数据;模块会在第二个optimize.step之前打印通信日志,包括通信api的调用栈、输入dtype、通信group。"cc_log_only"为true时,仅打印日志,不监控通信的输入输出,并在打印后中断训练。可以根据通信日志设置"cc_codeline",规避与训练过程不相关的通信,比如一些时间、metrics的同步。| 否 | -| "ops" | 与ur_distribution、xy_distribution、mv_distribution、wg_distribution、mg_direction、cc_distribution配合,监控所选张量的min、max、norm、zeros值。其中,zeros代表监控所选张量的元素小于eps的比例,id代表监控所选的非张量本身,默认为[]。 | 否 | -| "eps" | 若ops里包含"zeros"则需要配置,默认为1e-8。 | 否 | - -下面给出transformer架构模型中常见的module的前向计算的输入输出和反向计算输入张量的梯度和输出张量的梯度格式,以供参考: - -| module | input | output | input_grad | output_grad | -| ------------------------------------------------------------ | -------- | -------- | ---------- | ----------- | -| language_model.embedding.word_embeddings | tuple[1] | tensor | tuple[1] | tuple[1] | -| language_model.embedding.embedding_dropout | tuple[1] | tensor | tuple[1] | tuple[1] | -| language_model.encoder.layers.0 | tuple[2] | tensor | tuple[2] | tuple[1] | -| language_model.encoder.layers.0.input_norm | tuple[1] | tensor | tuple[1] | tuple[1] | -| language_model.encoder.layers.0.self_attention | tuple[2] | tuple[2] | tuple[2] | tuple[2] | -| language_model.encoder.layers.0.self_attention.query_key_value | tuple[1] | tuple[2] | tuple[1] | tuple[2] | -| language_model.encoder.layers.2.self_attention.core_attention_flash | tuple[3] | tensor | tuple[3] | tuple[1] | -| language_model.encoder.layers.0.self_attention.dense | tuple[1] | tuple[2] | tuple[1] | tuple[2] | -| language_model.encoder.layers.0.post_attention_norm | tuple[1] | tensor | tuple[1] | tuple[1] | -| language_model.encoder.layers.0.mlp | tuple[1] | tuple[2] | tuple[1] | tuple[2] | -| language_model.encoder.final_norm | tuple[1] | tensor | tuple[1] | tuple[1] | - -对于language_model.embedding.word_embeddings这类输入层,我们不关心输入的情况下,可以不填"input"和"input_grad",监控的状态中不会包含输入的相关信息。config文件示例如下: +### 优化器状态监控 +- 工具配置示例: ```json { - "targets": { - "language_model.embedding.word_embeddings": {"output": "tensor","output_grad":"tuple[1]:0"} - } + "targets": { + }, + "mv_distribution": true, + "format": "csv", + "ops": ["norm", "min", "max", "nans"] } ``` +`targets`中指定module包含的所有权重都会被监控。`targets`为空时,默认监控全部module。 +设置`mv_distribution`为true表示开启优化监控功能(1st moment noted as `m`, 2nd moment noted as `v`),默认值为false。[什么是mv](https://arxiv.org/pdf/1412.6980) + +本工具针对分布式计算框架megatron和deepspeed框架做了适配,暂不支持其他框架。 + + +## 高阶功能 + +### 指定监控对象 + +工具支持对nn.Module(**激活值监控**)和nn.Parameter(**权重监控**、**权重梯度监控、优化器监控**)对象实现相应的监控行为,在配置文件的"targets"(dict)字段指定,targets格式为{module_name/param_name: {filed: format}}。 + +#### 打印模型结构 +工具提供可选项`print_struct`打印模型结构,帮助配置targets。工具会在在第一个step后打印结构并停止训练进程,模型结构默认打印在`$MONITOR_OUTPUT_DIR/module_struct.json`。 +```json +{ + "print_struct": true +} +``` + +输出样例: +字段`config`用于配置文件中指定module target。其余为各个元素的shape和dtype。 + +```json +"0:63.mlp.linear_fc2": { + "input": { + "config": "tuple[1]", + "0": "size=(4096, 4, 1024), dtype=torch.bfloat16" + }, + "output": { + "config": "tuple[2]", + "0": "size=(2048, 4, 512), dtype=torch.bfloat16", + "1": "size=(512,), dtype=torch.bfloat16" + }, + "input_grad": { + "config": "tuple[1]", + "0": "size=(4096, 4, 1024), dtype=torch.bfloat16" + }, + "output_grad": { + "config": "tuple[2]", + "0": "size=(2048, 4, 512), dtype=torch.bfloat16", + "1": "size=(512,), dtype=torch.bfloat16" + } +}, +``` -2. 在训练器中加入代码,开启Monitor训练监控。 +- Module + 对于module对象,通常关心其前向的输入(input)输出(output)和反向的输入--前向输出的梯度(output_grad)和输出--前向输入的梯度(input_grad)。同时需要声明这些对象的类型,通常为"tensor"或"tuple\[length]"。 - 例如在ModelLink/pretrain_gpt.py的model_provider GPTModel构造后加入以下代码, **注意优化器类型opt_ty** : + "tensor"可以直接用来计算统计量,"tuple"需要进一步指定监控的索引。如"tuple[2]:0",表示该对象为长度2的tuple,对第0元素进行监控;不指定索引时,默认对第0元素进行监控。 - ```python - from msprobe.pytorch import TrainerMon - hooker = TrainerMon( - "./llama2_config.json", - params_have_main_grad=True, - opt_ty="Megatron_DistributedOptimizer" - ) # or opt_ty=Megatron_Float16OptimizerWithFloat16Params - hooker.hook_modules( - model=model, - grad_acc_steps=args.global_batch_size//args.data_parallel_size//args.micro_batch_size - ) - ``` - params_have_main_grad: 若为True则参数权重梯度为main_grad,否则为grad,默认为True。 - - 如果不是Megatron-LM的训练框架, 可以设置对应的梯度累积步数grad_acc_steps。 + module_name可以通过nn.Module的接口`named_modules()`获取。 +```json +// 示例:对一个名为"module.encoder.layers.0.mlp"的module,监控其前向输入第0元素和输出。 +{ + "targets": { + "module.encoder.layers.0.mlp": { + "input": "tuple[2]:0", + "output": "tensor" + } + } +} +``` +#### Module全量监控 +工具提供简便的全量module监控方式。或不配置targets、all_xy字段,同样表示全量监控。 - 如果要监控混合精度优化器的动量和方差, 需要在混合精度优化器构造后加入如下代码。 目前只支持Megatron_DistributedOptimizer, 使用bf16或者fp16混合精度时开启分布式优化器。 或者Megatron_Float16OptimizerWithFloat16Params, 使用bf16或者fp16混合精度选项并且不开启分布式优化器。 +```json +{ + "targets": {}, + "all_xy": true +} +``` - ```python - model, optimizer, opt_param_scheduler = setup_model_and_optimizer( - model_provider, model_type) - # 插入位置 - from msprobe.pytorch import TrainerMon - TrainerMon.set_wrapped_optimizer(optimizer) - ``` -3. 配置tensorboard写入的目录 +- Parameter + 对于parameter对象,通常会关注其在一个训练迭代中的梯度(weight grad)、adam类优化器中的动量(1st moment, 2nd moment)。 + parameter归属于某一module,也可以通过指定module_name来监控包含在这一module中的**所有**parameter。 - ```shell - export MONITOR_OUTPUT_DIR=/xxx/output_dir - ``` + param_name可以通过nn.Module的接口`named_parameters()`获取。 +```json +// 示例:监控"module.encoder.layers.0.mlp"的所有参数和"module.embedding.word_embedding.weight"这一参数 +{ + "targets": { + "module.encoder.layers.0.mlp": {}, + "module.embedding.word_embedding.weight": {} + } +} +``` -4. 开始预训练,在日志中如果发现以下内容, 则说明指定的模块被成功监视。 +#### Parameter全量监控 +工具提供简便的全量parameter监控方式。或不配置targets,同样表示全量监控。 - ```txt - > language_model.encoder.layers.0 is monitored successfully - > 1 out of 1 modules are monitored. - ``` +```json +{ + "targets": {} +} +``` -5. 训练过程中,打开tensorboard,可以查看训练的中间状态: +### 输出格式和统计量 +工具配置示例: +```json +{ + "format": "csv", + "ops": ["norm", "min", "max", "mean", "nans", "zeros"], + "ndigits": 12 +} +``` +#### 输出路径 +通过环境变量`MONITOR_OUTPUT_DIR`设置monitor输出路径,默认为`./monitor_output/`。 ```shell -tensorboard --logdir=$MONITOR_OUTPUT_DIR +export MONITOR_OUTPUT_DIR=/xxx/output_dir +``` + +- 输出格式 + 通过可选配置项`format`指定,当前支持`csv`, `tensorboard`, `api`。其中`csv`为默认缺省值。 + + - **tensorboard** + 监控结果写入tensorboard的event文件,启动tensorboard查看。 + 激活值监控任务的tag为{vpp_stage}:{module_name}.{input or output}:{micro_step}/{rank}/{task}\_{ops} + 其他监控任务的tag为{vpp_stage}:{param_name}/{rank}/{task}\_{ops} + ```shell + tensorboard --logdir=$MONITOR_OUTPUT_DIR + ``` + 之后,运行以下SSH命令来建立端口转发,可以在本地通过http://localhost:6006访问tensorboard: + ```shell + ssh -N -L localhost:6006:localhost:6006 your_username@remote_server_address + ``` + + - **csv** + 监控结果写入csv文件中,可以通过`ndigits`字段设置小数位数。 + 表头为 vpp_stage | name | step | micro_step(optional) | *ops |。 + 仅在激活值监控的输出文件中包含micor_step。 + 激活值监控的name为.\, 其他任务的name为> + + - **api** + 监控结果不落盘,在训练过程中可以通过`generate_wgrad_metrics`、`generate_xy_metrics`等接口获取,使用方式参考[公开接口](#公开接口) 。 + +- 统计量 +通过配置项`ops`指定。当前支持`norm`, `min`, `max`, `mean`, `nans`,`zeros`。其中`nans`监控tensor中`nan`的数量,`zeros`统计tensor中数值小于`eps`的比例。 + +- csv输出件合并 + + 提供csv输出件合并功能,在配置json文件中设置`step_count_per_record`,表示每个csv文件存储多个step的监控数据。默认值为1,表示每个csv文件记录一个step的监控数据。 + + 如下图所示为梯度监控结果示例,配置`step_count_per_record`为5,连续监控10个step,每个csv文件记录了5个step的梯度数据。其中`grad_reduced_0-4.csv`为step0至step4共计5个step的聚合后梯度数据,`grad_unreduced_0-4.csv`为step0至step4共计5个step的聚合前梯度数据。 + + ![step_count_per_record](img/monitor/step_count_per_record.png) + +### 梯度异常时序判断 +1. 训练前配置相关参数 + +工具支持自动判断训练过程中的梯度异常,需要在配置文件中设置alert相关字段。"AnomalyTurbulence"会将当前数值与历史均值比较,如果相对偏差超过阈值,会在打屏信息中提示用户。如果打开"`dump`"选项,则会将异常梯度相关信息落盘到目录`monitor_output/anomaly_detected`,用于后续时序判断。 +```json + "alert": { + "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}], + "dump": true + }, +``` +2. 实例化工具时传入流水线并行group +```python +monitor = TrainerMon( + "./monitor_config.json", + process_group=mpu.get_pipeline_model_parallel_group(), + params_have_main_grad=True # 权重是否使用main_grad,通常megatron为True,deepspeed为False。默认为True。 +) ``` +训练过程中,检测到异常后打屏提示,并将异常信息按照rank分组写入json文件,文件路径默认为`monitor_output/anomaly_detected`,异常信息示例如下: -之后,运行以下SSH命令来建立端口转发,可以在本地通过http://localhost:6006访问tensorboard: +```json +{ + "0:1.self_attention.core_attention_flash_0/rank0/input_grad_step_1_call_112": { + "rank": 0, + "step": 1, + "micro_step": 0, + "pp_stage": 0, + "vpp_stage": 0, + "call_id": 112, + "tag_name": "0:1.self_attention.core_attention_flash_0/rank0/input_grad", + "message": "Rule AnomalyTurbulence reports anomaly signal in ('0:1.self_attention.core_attention_flash_0/rank0/input_grad', 'min') at step 1.", + "group_mates": [0, 1] + }, + ... +} +``` + +3. 异常事件排序 + +当模型训练过程中出现较多异常数据,需要对异常事件排序。工具提供topk的异常排序能力,按照api的执行顺序进行排序,便于定界首次异常点。异常分析命令示例: ```shell -ssh -N -L localhost:6006:localhost:6006 your_username@remote_server_address +python3 -m msprobe.pytorch.monitor.anomaly_analyse -d $MONITOR_OUTPUT_DIR/anomaly_detected ``` +异常事件分析结束,将topk事件写入文件`anomaly_detected/anomaly_analyse.json`。异常分析支持以下参数配置: -# 高级用法 -TBD +| 字段名 | 解释 | 是否必选 | +| ----------------- | ------------------------------------------------------------ | -------- | +| -d 或 --data_path | 指定梯度异常落盘文件夹,梯度监控功能输出,一般为$MONITOR_OUTPUT_DIR/anomaly_detected。 | 是 | +| -o 或 --out_path | 排序后的异常落盘文件地址,默认在--data_path路径下落盘一个anomaly_analyse.json文件。 | 否 | +| -k 或 --topk | 指定保留前topk个异常,默认为8。 | 否 | +| -s 或 --step_list | 指定分析的step范围,默认为[]。 | 否 | -# 公开接口 -**接口说明** +### csv格式数据转tensorboard可视化显示 + +将csv数据转换为tensorboard格式数据。 ```python -TrainerMon.__init__(config_file_path, params_have_main_grad=True, opt_ty=None) -> None +from msprobe.pytorch.monitor.csv2tb import csv2tensorboard_by_step +# 前三个参数用来指定需要转换的一批文件,指定monitor输出目录及一个时间范围,会对这个范围内的文件进行转换 +# process_num指定拉起的进程个数,默认为1,更多的进程个数可以加速转换 +# data_type_list是一个列表,指定需要转换的数据类型, 数据类型应来自输出件文件前缀,所有类型数据: +# ["actv", "actv_grad", "exp_avg", "exp_avg_sq", "grad_unreduced", "grad_reduced", "param"] +# 不指定就转换全部数据 +# output_dirpath可指定输出目录, 不传值时保存到"{curtime}_csv2tensorboard_by_step"文件夹,其中curtime为自动获取的当前时间戳 +csv2tensorboard_by_step( + monitor_path="~/monitor_output", + time_start="Dec03_21-34-40", + time_end="Dec03_21-34-42", + process_num=8, + data_type_list=["param"] +) ``` -| 参数 | 说明 | 是否必选 | -| ----- | -------------------- | -------- | -| config_file_path |自己写的json配置文件路径。 | 是 | -| params_have_main_grad |权重是否使用main_grad,是就为True,否则为False。默认为True。 | 否 | -| opt_ty |优化器类型,有两个选项,Megatron_DistributedOptimizer:使用bf16或者fp16混合精度时开启分布式优化器;Megatron_Float16OptimizerWithFloat16Params:使用bf16或者fp16混合精度选项并且不开启分布式优化器,也适用于常规的adam优化器。如果使用的不是adam优化器,使用None。默认为None。 | 否 | +### 动态启停 +动态启停模式:支持用户在训练过程中随时启动/更新监控。 + +用户可在训练开始前通过配置环境变量DYNAMIC_MONITOR=True来确认开启动态启停模式,该模式下需要配合config.json文件中的dynamic_on字段来使用。 + +在动态启停模式下,启动和停止分别由如下控制: + +- 启动: + 首次监控:config.json文件中dynamic_on字段为true,代表是否需要开启监控。 + 非首次监控:config文件时间戳更新且config.json文件中dynamic_on字段为true。 +- 停止: + 到达collect_times之后自动停止并改config.json文件中dynamic_on字段为false,可再通过上述操作重启。 -**接口说明** +大部分情况下,用户可在看到异常趋势后再手动更新config.json文件并打开dynamic_on开关;此外,使用时若想要在一开始就启动监控,可直接打开dynamic_on开关做基础配置的监测(首次不要求时间戳更新) +注意事项: + +- 默认监控启动皆统一在配置初始化或查询到更新后的下一步,也就是若第n步挂上hook则第n+1步才启动采集,如需采集第0步数据请用静态模式。 +- config中途修改出错时,若此时不在监控就不生效,若在监控则用原配置继续。 +- 达到collect_times之后会自动将该值置为false待下次改true重启。 + +### 功能重载 +此功能将在2026年废弃。请使用[动态启停](#动态启停)功能代替。 + +- 统计量 +可以在训练过程中修改`TrainerMon`实例的`ops`属性, 调整监控的统计量。 +```python +if {some condition}: + monitor.ops = ["min", "max"] +``` + +- 训练过程中开关激活值监控 +激活值监控的性能损耗较大, 推荐仅在必要时开启, 比如发现loss出现尖刺, 根据loss的异常开启激活值监控. ```python -TrainerMon.hook_modules(model, grad_acc_steps) -> None +if {some condition}: + monitor.reload_xy(xy_distribution=True) ``` -| 参数 | 说明 | 是否必选 | -| ----- | -------------------- | -------- | -| model |需要监控的模型,需要是一个torch.nn.Module。 | 是 | -| grad_acc_steps | 梯度累积步数。 | 是 | +## 公开接口 +- monitor工具初始化 +```python +TrainerMon.__init__(config_file_path, process_group=None, params_have_main_grad=True) -> None +``` -**接口说明** +| 参数 | 说明 | 是否必选 | +| --------------------- | ------------------------------------------------------------ | -------- | +| config_file_path | json配置文件路径。 | 是 | +| process_group | 传入ProcessGroup对象,用以确定pipeline并行不同rank异常间时序,megatron下通过core.parallel_state.get_pipeline_model_parallel_group()获得。仅在异常时序判断功能中使用。 | 否 | +| params_have_main_grad | 权重是否使用main_grad,通常megatron为True,deepspeed为False。默认为True。 | 否 | +- 模型挂载monitor工具 ```python -TrainerMon.set_wrapped_optimizer(_wrapped_optimizer) -> None +TrainerMon.set_monitor(model, grad_acc_steps, optimizer, dp_group=None, tp_group=None, start_iteration=0) -> None ``` +| 参数 | 说明 | 是否必选 | +| --------------- | ------------------------------------------------------------ | -------- | +| model | 需要监控的模型,需要是一个torch.nn.Module或者mindspore.nn.Cell。 | 是 | +| grad_acc_steps | 梯度累积步数。 | 是 | +| optimizer | 需要patch的优化器。 | 否 | +| dp_group | 数据并行的通信组。
dp域通信后,且没有使用分布式优化器时,group内所有rank的梯度相同,落盘数据冗余。
提供dp_group后,工具仅保留每个dp_group的第一个rank的梯度。 | 否 | +| tp_group | 张量并行的通信组。
tp域通信后,group内部分参数所有rank的梯度相同,落盘数据冗余。
提供tp_group后,工具仅保留每个tp_group中冗余参数在第一个rank的梯度。
当前适配Megatron core_r0.6.0, 通过权重属性"tensor_model_parallel"判断是否冗余。 | 否 | +| start_iteration | 训练的起始iteration,影响工具计数。**仅PyTorch场景支持此参数**。 | 否 | + +- csv输出件转tensorboard输出件 +```python +csv2tensorboard_by_step(monitor_path, time_start, time_end, process_num=1, data_type_list=None) -> None +``` +| 参数 | 说明 | 是否必选 | +| -------------- | ------------------------------------------------------------ | -------- | +| monitor_path | 待转换的csv存盘目录。 | 是 | +| time_start | 起始时间戳。搭配time_end一起使用。指定一个时间范围,会对这个范围内的文件进行转换。左闭右闭的区间。 | 是 | +| time_end | 结束时间戳。搭配time_start一起使用。指定一个时间范围,会对这个范围内的文件进行转换。左闭右闭的区间。 | 是 | +| process_num | 指定拉起的进程个数,默认为1,更多的进程个数可以加速转换。 | 否 | +| data_type_list | 指定需要转换的数据类型, 数据类型应来自输出件文件前缀,所有类型数据:
["actv", "actv_grad", "exp_avg", "exp_avg_sq", "grad_unreduced", "grad_reduced", "param"]。
不指定就转换全部数据。 | 否 | + +- 在模型任意位置获取当前参数**梯度**统计量 +```python +TrainerMon.generate_wgrad_metrics() -> tuple[dict, dict] +``` +具体使用方式如下: +```python +reduced, unreduced = monitor.generate_wgrad_metrics() +``` + +- 在模型任意位置获取当前参数**激活值**统计量 +```python +TrainerMon.generate_xy_metrics() -> tuple[dict, dict] +``` +具体使用方式如下: +```python +actv, actv_grad = monitor.generate_xy_metrics() +``` + + + +## 详细配置 + +```json +{ + "targets": { + "language_model.encoder.layers.0": {"input": "tuple[2]:0", "output": "tensor", "input_grad":"tuple[2]:0", "output_grad":"tuple[1]:0"} + }, + "dynamic_on": false, + "start_step": 0, + "collect_times": 100000000, + "step_interval": 1, + "print_struct": false, + "module_ranks": [0,1,2,3], + "ur_distribution": true, + "xy_distribution": true, + "all_xy": true, + "forward_only": false, + "backward_only": false, + "mv_distribution": true, + "param_distribution": true, + "wg_distribution": true, + "cc_distribution": {"enable":true, "cc_codeline":[]}, + "alert": { + "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}], + "dump": false + }, + "format": "csv", + "ops": ["min", "max", "norm", "zeros", "nans", "mean"], + "eps": 1e-8, + "ndigits": 12, + "step_count_per_record": 1, + "append_output": [], + "squash_name": true +} +``` + +下面详细解释各个字段: -| 参数 | 说明 | 是否必选 | -| ----- | -------------------- | -------- | -| _wrapped_optimizer |megatron创建好的混合精度优化器。 | 是 | \ No newline at end of file +| 字段名字 | 是否必选 | 解释 | +| ----------------------- | -------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| "targets" | 可选 | 指定需要监控的模型层和监控对象, 例如transformer的第0层language_model.encoder.layers.0,可选择监控input、output、input_grad、output_grad。如果不清楚模型结构, 可以将 "print_struct" 字段设置为 true, 监控工具会打印模型中torch module的名字和详细结构,并在第1个step后退出。未配置时默认为全量监控。 | +| "input" | 可选 | "tuple[2]:0"的意思是目标module的前向input参数为长度为2的tuple, 我们关心的是tuple第0个元素。 | +| "output" | 必选 | "tensor"的意思是目标module的前向output参数类型为tensor | +| "input_grad" | 可选 | "tuple[2]:0"的意思是目标module的后向input_grad参数是长度为2的tuple, 我们关心的是tuple的第0个元素。 | +| "output_grad" | 必选 | "tuple[1]:0"的意思是目标module的后向input_grad参数是长度为1的tuple, 我们关心的是tuple的第0个元素。 | +| "dynamic_on" | 可选 | 在动态启停时使用,true代表打开监控,false代表关闭监控,默认值为false,且达到collect_times之后会自动将该值置为false待下次改true重启。**仅PyTorch场景支持此参数**。 | +| "collect_times" | 可选 | 设置采集次数,达到该次数后停止监控,默认值为100000000,目的是一直采集。 | +| "start_step" | 可选 | 设置开始采集step,模型训练达到start_step后开始监控采集,默认值为0,表示从step0开始监控采集。 | +| "step_interval" | 可选 | 设置采集step间隔,默认值为1,表示每个step均采集监控数据。 | +| "print_struct" | 可选 | 设置为true后监控工具会打印模型中torch module的名字和详细结构,并在第1个step后退出。不填默认为false。**仅PyTorch场景支持此参数**。 | +| "module_ranks" | 可选 | 用于在分布式训练场景中希望控制在哪些rank开启module监控。如果不填,则默认在所有rank开启。 | +| "ur_distribution" | 可选 | 若为true则会统计adam优化器指定模块(targets中指定)参数的update和ratio向量的数值分布,并展示在heatmap里,默认为false,同时format字段必须设置为tensorboard。
依赖histc算子, 需要CANN8.0.rc2以上版本, 否则会有严重的性能问题。**仅PyTorch场景支持此参数**。 | +| "xy_distribution" | 可选 | 若为true则会监控指定module(targets中指定)的输入输出张量。 默认为false。 | +| "all_xy" | 可选 | 开启xy_distribution后生效,若为true,监控所有module。默认为false。
与targets同时生效,all_xy配置为true时,若targets配置module_xx和指定对象,则module_xx按targets配置生效,其他module则监控全部对象,包含input、output、input_grad、output_grad。 | +| "forward_only" | 可选 | 开启xy_distribution后生效,若为true,仅监控指定module的前向,targets中的input_grad、output_grad不生效。默认为false。 | +| "backward_only" | 可选 | 开启xy_distribution后生效,若为true,仅监控指定module的反向,targets中的input、output不生效。默认为false。 | +| "mv_distribution" | 可选 | 若为true则会监控指定模块中的参数的优化器状态, 默认为false。需要在TrainerMon构造函数正确指定opt_ty。 目前支持megatron和Deepspeed的分布式优化器。
-Megatron_DistributedOptimizer:megatron分布式优化器;
-Megatron_Float16OptimizerWithFloat16Params:megatron混合精度优化器;
-Megatron_ChainedDistributedOptimizer:megatron分布式优化器序列;
-Megatron_ChainedFloat16OptimizerWithFloat16Params:megatron混合精度优化器序列;
-DeepSpeedZeroOptimizer_Stage0:DeepSpeed Zero0
-DeepSpeedZeroOptimizer_Stage1_or_2:DeepSpeed Zero1和Zero2;
-DeepSpeedZeroOptimizer_Stage3:DeepSpeed Zero3。
未使用megatron和deepspeed框架时,opt_ty默认为None,无需传入。 | +| "wg_distribution" | 可选 | 若为true则会监控指定模块的参数梯度, 默认为false。 | +| "param_distribution" | 可选 | 若为true则会监控指定模块的参数, 默认为false。 | +| "alert" | 可选 | "rules": 指定自动报警的异常检测机制及其相应的阈值。目前实现的异常检测是AnomalyTurbulence, 如果统计标量超出历史均值的指定浮动范围(threshold 0.5意味着上浮或者下浮50%)则在控制台打印报警信息。当"dump"字段配置为true表示异常事件写入文件,默认为false。**仅PyTorch场景支持此参数**。 | +| "cc_distribution" | 可选 | 其中"enable"字段控制通信监控模块的开关;需要监控通信算子时,务必尽量早地实例化`TrainerMon`, 因为监控通过劫持原始func后挂hook实现,部分加速库初始化时会保存原始function,避免监控失效。"cc_codeline"字段指定监控的代码行,如:`train.py\\[23\\]`,默认为空列表,不特别指定;"cc_pre_hook"字段控制是否监控通信前的数据; 模块会在第二个optimize.step之前打印通信日志,包括通信api的调用栈、输入dtype、通信group。 "cc_log_only"为true时,仅打印日志,不监控通信的输入输出,并在打印后中断训练。可以根据通信日志设置"cc_codeline",规避与训练过程不相关的通信,比如一些时间、metrics的同步。**仅PyTorch场景支持此参数**。 | +| "format" | 可选 | 数据落盘格式,默认值为"csv",可选 \["csv", "tensorboard", "api"\]。仅PyThon和MindSpore动态图场景支持此参数,且MindSpore动态图场景仅支持\["csv"\]。 | +| "ops" | 可选 | 类型为list,与ur_distribution、xy_distribution、mv_distribution、wg_distribution、mg_direction、cc_distribution配合,监控所选张量的统计指标,目前支持"min"、"max"、"norm"、"mean"、"zeros"、"nans"。其中,zeros代表监控所选张量的元素小于eps的比例,nans代表张量中nan的数量。当ops中无有效指标时,默认监控norm指标。 | +| "eps" | 可选 | 若ops里包含"zeros"则需要配置,默认为1e-8。 | +| "ndigits" | 可选 | "format"为"csv"时,设置落盘文件中的小数位数,默认为6。**仅PyTorch场景支持此参数**。 | +| "step_count_per_record" | 可选 | "format"为"csv"时生效,每个csv记录多少个step的数据,默认为1。 | +| "append_output" | 可选 | 适用于断点续训场景。多卡场景下生效,指定两个时间戳,将输出续写到这两个时间戳范围间的输出件中,不在范围内的rank不被续写。时间戳应来自原有输出件目录前缀,例如["Dec03_21-34-40", "Dec03_21-34-41"]。默认为[],不续写。**仅PyTorch场景支持此参数**。 | +| "squash_name" | 可选 | 是否简化参数名/模块名,多模态场景建议关闭,默认为True | diff --git a/debug/accuracy_tools/msprobe/docs/21.visualization_PyTorch.md b/debug/accuracy_tools/msprobe/docs/21.visualization_PyTorch.md index 0e69a0c883961dc08d492a8e2dec00661f3dc4e6..34cdc2aa99b8f6f1ab65a2b692424506f2563b56 100644 --- a/debug/accuracy_tools/msprobe/docs/21.visualization_PyTorch.md +++ b/debug/accuracy_tools/msprobe/docs/21.visualization_PyTorch.md @@ -1,25 +1,35 @@ -# PyTorch模型分级可视化工具 +# PyTorch 场景的分级可视化构图比对 分级可视化工具将msprobe工具dump的精度数据进行解析,还原模型图结构,实现模型各个层级的精度数据比对,方便用户理解模型结构、分析精度问题。 工具支持PyTorch版本:2.1/2.2 +## 展示示例 + +支持重建模型的层级结构; + +支持两个模型的结构差异比对; + +支持两个模型的精度数据比对,支持疑似有精度问题节点的快速搜索,自动跳转展开节点所在的层级。 + +![vis_show](./img/visualization/vis_showcase.png) + ## 1.依赖安装 分级可视化工具依赖**msprobe工具**和**tensorboard。** ### 1.1 安装msprobe工具 -[msprobe工具安装](https://gitee.com/louyujing/mstt_graph_wiki/blob/master/debug/accuracy_tools/msprobe/docs/01.installation.md) +[msprobe工具安装](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/01.installation.md) -### 1.2 安装tensorboard +### 1.2 安装tb_graph_ascend -[tensorboard下载](https://mindstudio-sample.obs.cn-north-4.myhuaweicloud.com/tbgraph/tensorboard-2.15.1-py3-none-any.whl) +**请安装tb_graph_ascend,否则无法解析构图结果。** -``pip3 install``即可。 +``pip3 install tb-graph-ascend``即可。 -## 2.dump模型结构数据 -[PyTorch场景的数据采集](https://gitee.com/louyujing/mstt_graph_wiki/blob/master/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md) +## 2.模型结构数据采集 +[PyTorch场景的数据采集](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md) **需要选择level为L0(module信息)或者mix(module信息+api信息),才能采集到模型结构数据,即采集结果件construct.json内容不为空**。 @@ -33,12 +43,35 @@ msprobe -f pytorch graph -i ./compare.json -o ./output ``` **命令行参数说明**: -| 参数名 | 说明 | 是否必选 | -|-------------------|------------------------------------------------------------------| -------- | -| -i 或 --input_path | 指定比对文件,str 类型。 | 是 | -| -o 或 --output_path | 配置比对结果文件存盘目录,str 类型。文件名称基于时间戳自动生成,格式为:`compare_{timestamp}.vis`。 | 是 | +| 参数名 | 说明 | 是否必选 | +|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------| +| -i 或 --input_path | 指定比对文件,参考[比对文件说明](#313-比对文件说明) | 是 | +| -o 或 --output_path | 配置比对结果文件存盘目录,str 类型。文件名称基于时间戳自动生成,格式为:`compare_{timestamp}.vis或build_{timestamp}.vis`。 | 是 | +| -lm 或 --layer_mapping | 跨套件比对,例如同一个模型分别使用了DeepSpeed和Megatron套件的比对场景。配置该参数时表示开启跨套件Layer层的比对功能,指定模型代码中的Layer层后,可以识别对应dump数据中的模块或API。需要指定自定义映射文件*.yaml。自定义映射文件的格式请参见[自定义映射文件(Layer)](#71-自定义映射文件layer),如何配置自定义映射文件请参考[模型分级可视化如何配置layer mapping映射文件](./visualization/layer_mapping_example.md)。 | 否 | +| -oc 或 --overflow_check | 是否开启溢出检测模式,开启后会在输出vis文件中(`compare_{timestamp}.vis或build_{timestamp}.vis`)对每个溢出节点进行标记溢出等级,溢出等级说明参考[溢出等级说明](#312-溢出等级说明) | 否 | +| -f 或 --fuzzy_match | 是否开启模糊匹配,bool类型。模糊匹配说明参考[匹配说明](#311-匹配说明) | 否 | +| -cs 或 --complete_stack | 是否使用完整的堆栈信息,bool类型。默认使用精简的堆栈信息,数据量小有助于增加流畅度。完整堆栈和精简堆栈信息参考[堆栈信息说明](#72-堆栈信息说明) | 否 | + +#### 3.1.1 匹配说明 + +**注:dump名称 = 名称 + 调用次数**,例如Torch.matmul.2.forward,matmul是名称,2是调用次数 -**比对文件说明**: +1.默认匹配 +- 所有节点dump名称一致 +- 节点输入输出参数数量一致,参数type、shape一致 +- 节点的层级一致(父节点们一致) + +2.模糊匹配 +- Module节点dump名称一致,两个匹配上的Module节点, 忽略各自节点下所有api的dump调用次数,按照名称一致+Module节点内的调用顺序进行匹配 +- ![fuzzy_match_pt.png](./img/visualization/fuzzy_match_pt.png) +- 参数shape一致 + +#### 3.1.2 溢出等级说明 +- medium:输入异常,输出正常场景 +- high:输入异常,输出异常;输出norm值相较于输入存在异常增大情况 +- critical:输入正常,输出异常场景 + +#### 3.1.3 比对文件说明 以在当前目录创建 ./compare.json 为例。 ``` { @@ -49,11 +82,11 @@ msprobe -f pytorch graph -i ./compare.json -o ./output ``` **比对文件参数说明**: -| 参数名 | 说明 | 是否必选 | -|-------------------|--------------------------------------------------------------------------------------------------------|------| -| npu_path | 指定待调试侧比对路径,str类型,路径下必须包含construct.json、dump.json和stack.json文件,注意construct.json内容不能为空,否则无法构图 | 是 | -| bench_path | 指定标杆侧比对路径,str类型,路径下必须包含construct.json、dump.json和stack.json文件,注意construct.json内容不能为空,否则无法构图。单图构建场景可以不配置 | 否 | -| is_print_compare_log | 配置是否开启单个算子的日志打屏。可取值 true 或 false,默认为 true。关闭后则只输出常规日志,bool 类型。 | 否 | +| 参数名 | 说明 | 是否必选 | +|-------------------|----------------------------------------------------------------------------|------| +| npu_path | 指定待调试侧比对路径,str类型。工具根据路径格式自动进行单rank比对、多rank批量比对或多step批量比对,具体格式参考3.2 图构建和比对。 | 是 | +| bench_path | 指定标杆侧比对路径,str类型。单图构建场景可以不配置。 | 否 | +| is_print_compare_log | 配置是否开启单个算子的日志打屏。可取值 true 或 false,默认为 true。关闭后则只输出常规日志,bool 类型。 | 否 | ### 3.2 图构建和比对 @@ -63,7 +96,7 @@ msprobe -f pytorch graph -i ./compare.json -o ./output #### 3.2.1 单图构建 -展示模型结构和精度数据。 +展示模型结构、精度数据、堆栈信息。 **1. 准备比对文件**: @@ -74,6 +107,18 @@ msprobe -f pytorch graph -i ./compare.json -o ./output "is_print_compare_log": true } ``` +npu_path格式:必须包含dump.json、stack.json和construct.json,且construct.json不能为空。如果construct.json为空,请检查dump的level参数是否没有选择L0或者mix。 +``` +├── npu_path +│ ├── dump_tensor_data(配置dump的task参数选择tensor时存在) +| | ├── Tensor.permute.1.forward.pt +| | ├── MyModule.0.forward.input.pt +| | ... +| | └── Fcuntion.linear.5.backward.output.pt +| ├── dump.json # 数据信息 +| ├── stack.json # 调用栈信息 +| └── construct.json # 分层分级结构 +``` **2. 执行命令**: ``` msprobe -f pytorch graph -i ./compare.json -o ./output @@ -90,7 +135,7 @@ msprobe -f pytorch graph -i ./compare.json -o ./output 3.md5:dump了API和Module的输入输出数据统计信息和md5信息。 -dump类型如何配置见[数据采集配置文件介绍](https://gitee.com/louyujing/mstt_graph_wiki/blob/master/debug/accuracy_tools/msprobe/docs/02.config_introduction.md) +dump类型如何配置见[数据采集配置文件介绍](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/02.config_introduction.md) **1. 准备比对文件**: @@ -102,6 +147,18 @@ dump类型如何配置见[数据采集配置文件介绍](https://gitee.com/louy "is_print_compare_log": true } ``` +npu_path或bench_path格式:必须包含dump.json、stack.json和construct.json,且construct.json不能为空。如果construct.json为空,请检查dump的level参数是否没有选择L0或者mix。 +``` +├── npu_path或bench_path +│ ├── dump_tensor_data(仅配置dump的task参数选择tensor时存在) +| | ├── Tensor.permute.1.forward.pt +| | ├── MyModule.0.forward.input.pt +| | ... +| | └── Function.linear.5.backward.output.pt +| ├── dump.json # 数据信息 +| ├── stack.json # 调用栈信息 +| └── construct.json # 分层分级结构,level为L1时,construct.json内容为空 +``` **2. 执行命令**: ``` msprobe -f pytorch graph -i ./compare.json -o ./output @@ -109,33 +166,206 @@ msprobe -f pytorch graph -i ./compare.json -o ./output 比对完成后将在**output**下生成一个**vis后缀文件**。 +#### 3.2.3 批量构建或比对 +##### 3.2.3.1 多rank批量构建或比对 +批量构建或比对一个step下的所有rank的数据 + +**1. 准备比对文件**: + +以在当前目录创建 ./compare.json 为例。 +``` +{ +"npu_path": "./npu_dump", +"bench_path": "./bench_dump", # 只进行图构建可不配置 +"is_print_compare_log": true +} +``` +npu_path或bench_path格式:必须只包含rank+数字格式的文件夹,且每个rank文件夹中必须包含dump.json、stack.json和construct.json,且construct.json不能为空。如果construct.json为空,请检查dump的level参数是否没有选择L0或者mix。 + +进行批量图比对时,npu_path和bench_path中包含的rank+数字格式的文件夹必须数量一致且能够一一对应。 +``` +├── npu_path或bench_path +| ├── rank0 +| │ ├── dump_tensor_data(仅配置dump的task参数选择tensor时存在) +| | | ├── Tensor.permute.1.forward.pt +| | | ├── MyModule.0.forward.input.pt +| | | ... +| | | └── Function.linear.5.backward.output.pt +| | ├── dump.json # 数据信息 +| | ├── stack.json # 算子调用栈信息 +| | └── construct.json # 分层分级结构,level为L1时,construct.json内容为空 +| ├── rank1 +| | ├── dump_tensor_data +| | | └── ... +| | ├── dump.json +| | ├── stack.json +| | └── construct.json +| ├── ... +| | +| └── rankn +``` +**2. 执行命令**: +``` +msprobe -f pytorch graph -i ./compare.json -o ./output +``` +比对完成后将在**output**下生成n个**vis后缀文件**。 + +图构建: +``` +├── build_rank0_{timestamp}.vis +├── build_rank1_{timestamp}.vis +├── build_rank2_{timestamp}.vis +├── build_rank3_{timestamp}.vis +├── ... +├── build_rankn_{timestamp}.vis +``` +图比对: +``` +├── compare_rank0_{timestamp}.vis +├── compare_rank1_{timestamp}.vis +├── compare_rank2_{timestamp}.vis +├── compare_rank3_{timestamp}.vis +├── ... +├── compare_rankn_{timestamp}.vis +``` +##### 3.2.3.2 多step批量构建或比对 +批量构建或比对多个step下的所有rank的数据 + +**1. 准备比对文件**: + +以在当前目录创建 ./compare.json 为例。 +``` +{ +"npu_path": "./npu_dump", +"bench_path": "./bench_dump", # 只进行图构建可不配置 +"is_print_compare_log": true +} +``` +npu_path或bench_path格式:必须只包含step+数字格式的文件夹,且每个step文件夹中必须只包含rank+数字格式的文件夹,每个rank文件夹中必须包含dump.json、stack.json和construct.json,且construct.json不能为空。如果construct.json为空,请检查dump的level参数是否没有选择L0或者mix。 + +进行批量图比对时,npu_path和bench_path中包含的step+数字格式的文件夹必须数量一致且能够一一对应,每个step文件夹中包含的rank+数字格式的文件夹必须数量一致且能够一一对应。 +``` +├── npu_path或bench_path +│ ├── step0 +│ | ├── rank0 +│ | │ ├── dump_tensor_data(仅配置dump的task参数选择tensor时存在) +| | | | ├── Tensor.permute.1.forward.pt +| | | | ├── MyModule.0.forward.input.pt +| | | | ... +| | | | └── Function.linear.5.backward.output.pt +│ | | ├── dump.json # 数据信息 +│ | | ├── stack.json # 调用栈信息 +│ | | └── construct.json # 分层分级结构,level为L1时,construct.json内容为空 +│ | ├── rank1 +| | | ├── dump_tensor_data +| | | | └── ... +│ | | ├── dump.json +│ | | ├── stack.json +| | | └── construct.json +│ | ├── ... +│ | | +| | └── rankn +│ ├── step1 +│ | ├── ... +│ ├── step2 +``` +**2. 执行命令**: +``` +msprobe -f pytorch graph -i ./compare.json -o ./output +``` +比对完成后将在**output**下生成若干个**vis后缀文件**。 + +图构建: +``` +├── build_step0_rank0_{timestamp}.vis +├── build_step0_rank1_{timestamp}.vis +├── build_step0_rank2_{timestamp}.vis +├── build_step0_rank3_{timestamp}.vis +├── build_step1_rank0_{timestamp}.vis +├── build_step1_rank1_{timestamp}.vis +├── build_step1_rank2_{timestamp}.vis +├── build_step1_rank3_{timestamp}.vis +├── ... +├── build_stepn_rankn_{timestamp}.vis +``` +图比对: +``` +├── compare_step0_rank0_{timestamp}.vis +├── compare_step0_rank1_{timestamp}.vis +├── compare_step0_rank2_{timestamp}.vis +├── compare_step0_rank3_{timestamp}.vis +├── compare_step1_rank0_{timestamp}.vis +├── compare_step1_rank1_{timestamp}.vis +├── compare_step1_rank2_{timestamp}.vis +├── compare_step1_rank3_{timestamp}.vis +├── ... +├── compare_stepn_rankn_{timestamp}.vis +``` + +#### 3.2.4 仅模型结构比对 + +适用场景:**主要关注模型结构而非训练过程数据**。例如,在模型迁移过程中,确保迁移前后模型结构的一致性,或在排查精度差异时,判断是否由模型结构差异所引起。 + +使用msprobe工具对模型数据进行采集时,**可选择仅采集模型结构(task配置为structure)**,此配置将避免采集模型训练过程的数据,从而显著减少采集所需的时间。 + +dump配置请参考[dump配置示例](./03.config_examples.md#16-task-配置为-structure) + +得到dump数据后,若需比较特定两个rank之间的数据,请参考[3.2.2 双图比对](#322-双图比对);若需进行多个rank或多个step的数据批量比对,请参考[3.2.3 批量构建或比对](#323-批量构建或比对)。 + ## 4.启动tensorboard +### 4.1 可直连的服务器 + 将生成vis文件的路径**out_path**传入--logdir ``` tensorboard --logdir out_path --bind_all --port [可选,端口号] ``` +启动后会打印日志: -启动后会打印日志。 -``TensorBoard 2.15.1 at http://localhost.localdomain:6008/ (Press CTRL+C to quit)`` -localhost.localdomain是机器地址,6008是端口号。 +![tensorboard_1](./img/visualization/tensorboard_1.png) -**如果链接打不开,可以尝试使用vscode连接服务器,在vscode终端输入:** +ubuntu是机器地址,6008是端口号。 + +**注意,ubuntu需要替换为真实的服务器地址,例如真实的服务器地址为10.123.456.78,则需要在浏览器窗口输入http://10.123.456.78:6008** + +### 4.2 不可直连的服务器 +**如果链接打不开(服务器无法直连需要挂vpn才能连接等场景),可以尝试使用vscode连接服务器,在vscode终端输入:** ``` tensorboard --logdir out_path ``` +![tensorboard_2](./img/visualization/tensorboard_2.png) -CTRL+C点击链接即可 +按住CTRL点击链接即可 ## 5.浏览器查看 -推荐使用谷歌浏览器,在浏览器中输入机器地址+端口号回车,出现TensorBoard页面,右上方选择GRAPHS即可展示模型结构图。 +### 5.1 浏览器打开图 +推荐使用谷歌浏览器,在浏览器中输入机器地址+端口号回车,出现TensorBoard页面,其中/#graph_ascend会自动拼接。 +![vis_browser_1](./img/visualization/vis_browser_1.png) +如果您切换了TensorBoard的其他功能,此时想回到模型分级可视化页面,可以点击左上方的**GRAPH_ASCEND** +![vis_browser_2](./img/visualization/vis_browser_2.png) + +### 5.2 查看图 +![vis_show_info.png](./img/visualization/vis_show_info.png) + +### 5.3 名称搜索 +![vis_search_info.png](./img/visualization/vis_search_info.png) + +### 5.4 精度筛选 +![vis_precision_info.png](./img/visualization/vis_precision_info.png) + +### 5.5 未匹配节点筛选 +节点匹配规则: + +1.名称一致 -节点需要双击打开。 +2.节点输入输出参数数量一致,参数type、shape一致 -键盘WS可放大缩小,AD可左右移动,鼠标滚轮可上下移动。 +3.节点的层级一致(父节点们一致) + +![vis_unmatch_info.png](./img/visualization/vis_unmatch_info.png) ## 6.图比对说明 @@ -146,14 +376,101 @@ CTRL+C点击链接即可 ### 疑似有精度问题判定 #### 真实数据模式 -所有输入的最小双千指标和所有输出的最小双千指标的差值的绝对值,大于0.1。 +节点中所有输入的最小双千指标和所有输出的最小双千分之一指标的差值,反映了双千指标的下降情况,**值越大精度差距越大,颜色标记越深**。 -``One Thousandth Err Ratio(双千分之一)、Five Thousandths Err Ratio(双千分之五)精度指标:是指NPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差大于千分之一、千分之五的比例占总元素个数的比例小于千分之一、千分之五`` +``One Thousandth Err Ratio(双千分之一)精度指标:Tensor中的元素逐个与对应的标杆数据对比,相对误差小于千分之一的比例占总元素个数的比例,比例越接近1越好`` #### 统计信息模式 -所有输入输出的统计量相对diff值,大于0.5,其中小值域相对diff值(小于1e-3)不参与判定。 +节点中输出的统计量相对误差,**值越大精度差距越大,颜色标记越深**。 -``相对diff值:abs((npu统计值 - bench统计值) / bench统计值)`` +``相对误差:abs((npu统计值 - bench统计值) / bench统计值)`` #### md5模式 -任意输入输出的md5值不同。 +节点中任意输入输出的md5值不同。 + +## 7.附录 +### 7.1 自定义映射文件(Layer) + +文件名格式:\*.yaml,*为文件名,可自定义。 + +文件内容示例: + +```yaml +PanGuVLMModel: # Layer层名称 + vision_model: language_model.vision_encoder # 模型代码中嵌套的Layer层名称 + vision_projection: language_model.projection + +RadioViTModel: + input_conditioner: radio_model.input_conditioner + patch_generator: radio_model.patch_generator + radio_model: radio_model.transformer + +ParallelTransformerLayer: + input_norm: input_layernorm + post_attention_norm: post_attention_layernorm + +GPTModel: + decoder: encoder + +SelfAttention: + linear_qkv: query_key_value + core_attention: core_attention_flash + linear_proj: dense + +MLP: + linear_fc1: dense_h_to_4h + linear_fc2: dense_4h_to_h +``` + +Layer层名称需要从模型代码中获取。 + +yaml文件中只需配置待调试侧与标杆侧模型代码中功能一致但名称不同的Layer层,名称相同的Layer层会被自动识别并映射。 + +模型代码示例: + +![ms_dump](./img/ms_layer.png) + +### 7.2 堆栈信息说明 + +**精简堆栈** + +保留一条当前模块或api的调用信息 + +```json +{ + "Module.layer1.0.bn1.BatchNorm2d.forward.0": [ + "File /home/torchvision/models/resnet.py, line 93, in forward, \n out = self.bn1(out)" + ] +} +``` + +**完整堆栈** + +当前模块或api完整的调用信息 + +```json +{ + "Module.layer1.0.bn1.BatchNorm2d.forward.0": [ + "File /home/torchvision/models/resnet.py, line 93, in forward, \n out = self.bn1(out)", + "File /home/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /home/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /home/torch/nn/modules/container.py, line 215, in forward, \n input = module(input)", + "File /home/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /home/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /home/torchvision/models/resnet.py, line 273, in _forward_impl, \n x = self.layer1(x)", + "File /home/torchvision/models/resnet.py, line 285, in forward, \n return self._forward_impl(x)", + "File /home/torch/nn/modules/module.py, line 1527, in _call_impl, \n return forward_call(*args, **kwargs)", + "File /home/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /home/visualization/resnet18.py, line 40, in , \n outputs = model(inputs)" + ] +} + +``` +# FAQ +1. 图比对场景,节点呈现灰色,且没有精度比对数据,怎么处理? + +节点呈现灰色,代表左边待调试侧节点与右边标杆侧节点没有匹配上,可能有以下几点原因: + +- **标杆侧确实没有能与待调试侧匹配上的节点**,属于代码实现上的差异,请确认此差异是否正常,是否会影响到整网精度。 +- **节点的输入或输出type、shape不一致,参数个数不一致,节点所在层级的父层级不一致**,导致节点无法匹配,具体匹配规则见[匹配说明](#311-匹配说明),可尝试使用模糊匹配功能,如何使用此功能请参考[构图命令行说明](#31-构图命令行说明)。如果是参数shape不一致,即使是模糊匹配功能也无法让节点匹配上,请检查参数shape不一致是否合理。 +- **节点名称不一致**,导致节点无法匹配,可使用layer mapping功能,如何使用此功能请参考[构图命令行说明](#31-构图命令行说明),如何自定义映射文件请参考[模型分级可视化如何配置layer mapping映射文件](./visualization/layer_mapping_example.md)。 diff --git a/debug/accuracy_tools/msprobe/docs/22.visualization_MindSpore.md b/debug/accuracy_tools/msprobe/docs/22.visualization_MindSpore.md new file mode 100644 index 0000000000000000000000000000000000000000..12306b8be027e7cee715f99f75b00f7504ba8252 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/22.visualization_MindSpore.md @@ -0,0 +1,492 @@ +# MindSpore 场景的分级可视化构图比对 + +分级可视化工具将msprobe工具dump的精度数据进行解析,还原模型图结构,实现模型各个层级的精度数据比对,方便用户理解模型结构、分析精度问题。 + +工具支持MindSpore版本:2.4.0 + +## 展示示例 + +支持重建模型的层级结构; + +支持两个模型的结构差异比对; + +支持两个模型的精度数据比对,支持疑似有精度问题节点的快速搜索,自动跳转展开节点所在的层级。 + +![vis_show](./img/visualization/vis_showcase.png) + +## 1.依赖安装 + +分级可视化工具依赖**msprobe工具**和**tensorboard。** + +### 1.1 安装msprobe工具 + +[msprobe工具安装](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/01.installation.md) + +### 1.2 安装tb_graph_ascend + +**请安装tb_graph_ascend,否则无法解析构图结果。** + +``pip3 install tb-graph-ascend``即可。 + +## 2.模型结构数据采集 +[MindSpore场景的精度数据采集](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/06.data_dump_MindSpore.md) + +**仅支持动态图场景,需要选择level为L0(cell信息)或者mix(cell信息+api信息),才能采集到模型结构数据,即采集结果件construct.json内容不为空**。 + +## 3.生成图结构文件 + +### 3.1 构图命令行说明 + +**命令示例如下**: +``` +msprobe -f mindspore graph -i ./compare.json -o ./output +``` +**命令行参数说明**: + +| 参数名 | 说明 | 是否必选 | +|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------- | +| -i 或 --input_path | 指定比对文件,参考[比对文件说明](#313-比对文件说明) | 是 | +| -o 或 --output_path | 配置比对结果文件存盘目录,str 类型。文件名称基于时间戳自动生成,格式为:`compare_{timestamp}.vis或build_{timestamp}.vis`。 | 是 | +| -lm 或 --layer_mapping| 跨框架比对,MindSpore和PyTorch的比对场景。配置该参数时表示开启跨框架Layer层的比对功能,指定模型代码中的Layer层后,可以识别对应dump数据中的模块或API。需要指定自定义映射文件*.yaml。自定义映射文件的格式请参见[自定义映射文件(Layer)](#71-自定义映射文件layer), 如何配置自定义映射文件请参考[模型分级可视化如何配置layer mapping映射文件](./visualization/layer_mapping_example.md)。 | 否 | +| -oc 或 --overflow_check | 是否开启溢出检测模式,开启后会在输出vis文件中(`compare_{timestamp}.vis或build_{timestamp}.vis`)对每个溢出节点进行标记溢出等级,溢出等级说明参考[溢出等级说明](#312-溢出等级说明) | 否 | +| -f 或 --fuzzy_match | 是否开启模糊匹配,bool类型。模糊匹配说明参考[匹配说明](#311-匹配说明) | 否 | +| -cs 或 --complete_stack | 是否使用完整的堆栈信息,bool类型。默认使用精简的堆栈信息,数据量小有助于增加流畅度。完整堆栈和精简堆栈信息参考[堆栈信息说明](#72-堆栈信息说明) | 否 | + +#### 3.1.1 匹配说明 + +**注:dump名称 = 名称 + 调用次数**,例如Functional.matmul.2.forward,matmul是名称,2是调用次数 + +1.默认匹配 +- 所有节点dump名称一致 +- 节点输入输出参数数量一致,参数type、shape一致 +- 节点的层级一致(父节点们一致) + +2.模糊匹配 +- Cell节点dump名称一致,两个匹配上的Cell节点, 忽略各自节点下所有api的dump调用次数,按照名称一致+Cell节点内的调用顺序进行匹配 +- ![fuzzy_match_ms.png](./img/visualization/fuzzy_match_ms.png) +- 参数shape一致 + +#### 3.1.2 溢出等级说明 +- medium:输入异常,输出正常场景 +- high:输入异常,输出异常;输出norm值相较于输入存在异常增大情况 +- critical:输入正常,输出异常场景 + +#### 3.1.3 比对文件说明 + +以在当前目录创建 ./compare.json 为例。 +``` +{ +"npu_path": "./npu_dump", +"bench_path": "./bench_dump", +"is_print_compare_log": true +} +``` +**比对文件参数说明**: + +| 参数名 | 说明 | 是否必选 | +|-------------------|-------------------------------------------------------------------------------------------------------|------| +| npu_path | 指定待调试侧比对路径,str类型。工具根据路径格式自动进行单rank比对、多rank批量比对或多step批量比对,具体格式参考3.2 图构建和比对。 | 是 | +| bench_path | 指定标杆侧比对路径,str类型。单图构建场景可以不配置 | 否 | +| is_print_compare_log | 配置是否开启单个算子的日志打屏。可取值 true 或 false,默认为 true。关闭后则只输出常规日志,bool 类型。 | 否 | + + +### 3.2 图构建和比对 + +**如果只是想查看一个模型的结构,请选择单图构建**; +**如果想比较两个模型的结构差异和精度数据差异,请选择双图比对**。 + +#### 3.2.1 单图构建 + +展示模型结构、精度数据、堆栈信息。 + +**1. 准备比对文件**: + +以在当前目录创建 ./compare.json 为例。 +``` +{ +"npu_path": "./npu_dump", +"is_print_compare_log": true +} +``` +npu_path格式:必须包含dump.json、stack.json和construct.json,且construct.json不能为空。如果construct.json为空,请检查dump的level参数是否没有选择L0或者mix。 +``` +├── npu_path +│ ├── dump_tensor_data(配置dump的task参数选择tensor时存在) +| | ├── MintFunctional.relu.0.backward.input.0.npy +| | ├── Mint.abs.0.forward.input.0.npy +| | ... +| | └── Cell.relu.ReLU.forward.0.input.0.npy +| ├── dump.json # 数据信息 +| ├── stack.json # 调用栈信息 +| └── construct.json # 分层分级结构,level为L1时,construct.json内容为空 +``` +**2. 执行命令**: +``` +msprobe -f mindspore graph -i ./compare.json -o ./output +``` +#### 3.2.2 双图比对 + +展示模型结构、结构差异、精度数据和精度比对指标、精度是否疑似有问题(精度比对指标差异越大颜色越深)。 + +当前比对支持三种类型的dump数据,分级可视化工具比对时会自动判断: + +1.统计信息:仅dump了API和Module的输入输出数据统计信息,占用磁盘空间小; + +2.真实数据:不仅dump了API和Module的输入输出数据统计信息,还将tensor进行存盘,占用磁盘空间大,但比对更加准确; + +3.md5:dump了API和Module的输入输出数据统计信息和md5信息。 + +dump类型如何配置见[数据采集配置文件介绍](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/02.config_introduction.md) + +**1. 准备比对文件**: + +以在当前目录创建 ./compare.json 为例。 +``` +{ +"npu_path": "./npu_dump", +"bench_path": "./bench_dump", +"is_print_compare_log": true +} +``` +npu_path或bench_path格式:必须包含dump.json、stack.json和construct.json,且construct.json不能为空。如果construct.json为空,请检查dump的level参数是否没有选择L0或者mix。 +``` +├── npu_path或bench_path +│ ├── dump_tensor_data(配置dump的task参数选择tensor时存在) +| | ├── MintFunctional.relu.0.backward.input.0.npy +| | ├── Mint.abs.0.forward.input.0.npy +| | ... +| | └── Cell.relu.ReLU.forward.0.input.0.npy +| ├── dump.json # 数据信息 +| ├── stack.json # 调用栈信息 +| └── construct.json # 分层分级结构,level为L1时,construct.json内容为空 +``` +**2. 执行命令**: +``` +msprobe -f mindspore graph -i ./compare.json -o ./output +``` + +比对完成后将在**output**下生成一个**vis后缀文件**。 + +#### 3.2.3 批量构建或比对 +##### 3.2.3.1 多rank批量构建或比对 +批量构建或比对一个step下的所有rank的数据 + +**1. 准备比对文件**: + +以在当前目录创建 ./compare.json 为例。 +``` +{ +"npu_path": "./npu_dump", +"bench_path": "./bench_dump", # 只进行图构建可不配置 +"is_print_compare_log": true +} +``` +npu_path或bench_path格式:必须只包含rank+数字格式的文件夹,且每个rank文件夹中必须包含dump.json、stack.json和construct.json,且construct.json不能为空。如果construct.json为空,请检查dump的level参数是否没有选择L0或者mix。 + +进行批量图比对时,npu_path和bench_path中包含的rank+数字格式的文件夹必须数量一致且能够一一对应。 +``` +├── npu_path或bench_path +| ├── rank0 +| │ ├── dump_tensor_data(仅配置dump的task参数选择tensor时存在) +| | | ├── MintFunctional.relu.0.backward.input.0.npy +| | | ├── Mint.abs.0.forward.input.0.npy +| | | ... +| | | └── Cell.relu.ReLU.forward.0.input.0.npy +| | ├── dump.json # 数据信息 +| | ├── stack.json # 算子调用栈信息 +| | └── construct.json # 分层分级结构,level为L1时,construct.json内容为空 +| ├── rank1 +| | ├── dump_tensor_data +| | | └── ... +| | ├── dump.json +| | ├── stack.json +| | └── construct.json +| ├── ... +| | +| └── rankn +``` +**2. 执行命令**: +``` +msprobe -f mindspore graph -i ./compare.json -o ./output +``` +比对完成后将在**output**下生成n个**vis后缀文件**。 + +图构建: +``` +├── build_rank0_{timestamp}.vis +├── build_rank1_{timestamp}.vis +├── build_rank2_{timestamp}.vis +├── build_rank3_{timestamp}.vis +├── ... +├── build_rankn_{timestamp}.vis +``` +图比对: +``` +├── compare_rank0_{timestamp}.vis +├── compare_rank1_{timestamp}.vis +├── compare_rank2_{timestamp}.vis +├── compare_rank3_{timestamp}.vis +├── ... +├── compare_rankn_{timestamp}.vis +``` +##### 3.2.3.2 多step批量构建或比对 +批量构建或比对多个step下的所有rank的数据 + +**1. 准备比对文件**: + +以在当前目录创建 ./compare.json 为例。 +``` +{ +"npu_path": "./npu_dump", +"bench_path": "./bench_dump", # 只进行图构建可不配置 +"is_print_compare_log": true +} +``` +npu_path或bench_path格式:必须只包含step+数字格式的文件夹,且每个step文件夹中必须只包含rank+数字格式的文件夹,每个rank文件夹中必须包含dump.json、stack.json和construct.json,且construct.json不能为空。如果construct.json为空,请检查dump的level参数是否没有选择L0或者mix。 + +进行批量图比对时,npu_path和bench_path中包含的step+数字格式的文件夹必须数量一致且能够一一对应,每个step文件夹中包含的rank+数字格式的文件夹必须数量一致且能够一一对应。 +``` +├── npu_path或bench_path +│ ├── step0 +│ | ├── rank0 +│ | │ ├── dump_tensor_data(仅配置dump的task参数选择tensor时存在) +| | | | ├── MintFunctional.relu.0.backward.input.0.npy +| | | | ├── Mint.abs.0.forward.input.0.npy +| | | | ... +| | | | └── Cell.relu.ReLU.forward.0.input.0.npy +│ | | ├── dump.json # 数据信息 +│ | | ├── stack.json # 调用栈信息 +│ | | └── construct.json # 分层分级结构,level为L1时,construct.json内容为空 +│ | ├── rank1 +| | | ├── dump_tensor_data +| | | | └── ... +│ | | ├── dump.json +│ | | ├── stack.json +| | | └── construct.json +│ | ├── ... +│ | | +| | └── rankn +│ ├── step1 +│ | ├── ... +│ ├── step2 +``` +**2. 执行命令**: +``` +msprobe -f mindspore graph -i ./compare.json -o ./output +``` +比对完成后将在**output**下生成若干个**vis后缀文件**。 + +图构建: +``` +├── build_step0_rank0_{timestamp}.vis +├── build_step0_rank1_{timestamp}.vis +├── build_step0_rank2_{timestamp}.vis +├── build_step0_rank3_{timestamp}.vis +├── build_step1_rank0_{timestamp}.vis +├── build_step1_rank1_{timestamp}.vis +├── build_step1_rank2_{timestamp}.vis +├── build_step1_rank3_{timestamp}.vis +├── ... +├── build_stepn_rankn_{timestamp}.vis +``` +图比对: +``` +├── compare_step0_rank0_{timestamp}.vis +├── compare_step0_rank1_{timestamp}.vis +├── compare_step0_rank2_{timestamp}.vis +├── compare_step0_rank3_{timestamp}.vis +├── compare_step1_rank0_{timestamp}.vis +├── compare_step1_rank1_{timestamp}.vis +├── compare_step1_rank2_{timestamp}.vis +├── compare_step1_rank3_{timestamp}.vis +├── ... +├── compare_stepn_rankn_{timestamp}.vis +``` + +#### 3.2.4 仅模型结构比对 + +适用场景:**主要关注模型结构而非训练过程数据**。例如,在模型迁移过程中,确保迁移前后模型结构的一致性,或在排查精度差异时,判断是否由模型结构差异所引起。 + +使用msprobe工具对模型数据进行采集时,**可选择仅采集模型结构(task配置为structure)**,此配置将避免采集模型训练过程的数据,从而显著减少采集所需的时间。 + +dump配置请参考[dump配置示例](./03.config_examples.md#35-task-配置为-structure) + +得到dump数据后,若需比较特定两个rank之间的数据,请参考[3.2.2 双图比对](#322-双图比对);若需进行多个rank或多个step的数据批量比对,请参考[3.2.3 批量构建或比对](#323-批量构建或比对)。 + + +## 4.启动tensorboard + +### 4.1 可直连的服务器 + +将生成vis文件的路径**out_path**传入--logdir + +``` +tensorboard --logdir out_path --bind_all --port [可选,端口号] +``` +启动后会打印日志: + +![tensorboard_1](./img/visualization/tensorboard_1.png) + +ubuntu是机器地址,6008是端口号。 + +**注意,ubuntu需要替换为真实的服务器地址,例如真实的服务器地址为10.123.456.78,则需要在浏览器窗口输入http://10.123.456.78:6008** + +### 4.2 不可直连的服务器 +**如果链接打不开(服务器无法直连需要挂vpn才能连接等场景),可以尝试使用vscode连接服务器,在vscode终端输入:** + +``` +tensorboard --logdir out_path +``` +![tensorboard_2](./img/visualization/tensorboard_2.png) + +按住CTRL点击链接即可 + +## 5.浏览器查看 + +### 5.1 浏览器打开图 +推荐使用谷歌浏览器,在浏览器中输入机器地址+端口号回车,出现TensorBoard页面,其中/#graph_ascend会自动拼接。 +![vis_browser_1](./img/visualization/vis_browser_1.png) +如果您切换了TensorBoard的其他功能,此时想回到模型分级可视化页面,可以点击左上方的**GRAPH_ASCEND** +![vis_browser_2](./img/visualization/vis_browser_2.png) + +### 5.2 查看图 +![vis_show_info.png](./img/visualization/vis_show_info.png) + +### 5.3 名称搜索 +![vis_search_info.png](./img/visualization/vis_search_info.png) + +### 5.4 精度筛选 +![vis_precision_info.png](./img/visualization/vis_precision_info.png) + +### 5.5 未匹配节点筛选 +节点匹配规则: + +1.名称一致 + +2.节点输入输出参数数量一致,参数type、shape一致 + +3.节点的层级一致(父节点们一致) + +![vis_unmatch_info.png](./img/visualization/vis_unmatch_info.png) + +## 6.图比对说明 + +### 颜色 + +颜色越深,精度比对差异越大,越可疑,具体信息可见浏览器页面左下角颜色图例。 + +### 疑似有精度问题判定 + +#### 真实数据模式 +节点中所有输入的最小双千指标和所有输出的最小双千分之一指标的差值,反映了双千指标的下降情况,**值越大精度差距越大,颜色标记越深**。 + +``One Thousandth Err Ratio(双千分之一)精度指标:Tensor中的元素逐个与对应的标杆数据对比,相对误差小于千分之一的比例占总元素个数的比例,比例越接近1越好`` + +#### 统计信息模式 +节点中输出的统计量相对误差,**值越大精度差距越大,颜色标记越深**。 + +``相对误差:abs((npu统计值 - bench统计值) / bench统计值)`` + +#### md5模式 +节点中任意输入输出的md5值不同。 + +## 7.附录 +### 7.1 自定义映射文件(Layer) + +文件名格式:\*.yaml,*为文件名,可自定义。 + +文件内容示例: + +```yaml +ParallelAttention: # Layer层名称 + qkv_proj: query_key_value # 冒号左侧为MindSpore框架模型代码中嵌套的Layer层名称,冒号右侧为PyTorch框架模型代码中嵌套的Layer层名称 + out_proj: dense + +ParallelTransformerLayer: + attention: self_attention + +Embedding: + dropout: embedding_dropout + +ParallelMLP: + mapping: dense_h_to_4h + projection: dense_4h_to_h + +PipelineCell: + model: module + +Cell: + network_with_loss: module +``` + +Layer层名称需要从模型代码中获取。 + +yaml文件中只需配置MindSpore与PyTorch模型代码中功能一致但名称不同的Layer层,名称相同的Layer层会被自动识别并映射。 + +模型代码示例: + +![ms_dump](./img/ms_layer.png) + +### 7.2 堆栈信息说明 + +**精简堆栈** + +保留一条当前模块或api的调用信息 + +```json +{ + "Cell.model.language_model.embedding.word_embeddings.reduce_scatter_to_sp_region.ReduceScatterToSequenceParallelRegion.forward.0": [ + "File /home/mindformers/experimental/distri_cores/tensor_parallel/layers.py, line 770, in construct, \n output = self.reduce_scatter_to_sp_region(output_parallel)" + ] +} +``` + +**完整堆栈** + +当前模块或api完整的调用信息 + +```json +{ + "Cell.model.language_model.embedding.word_embeddings.reduce_scatter_to_sp_region.ReduceScatterToSequenceParallelRegion.forward.0": [ + "File /home/mindspore/nn/cell.py, line 507, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /home/mindspore/nn/cell.py, line 759, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /home/mindspore/nn/cell.py, line 747, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /home/mindformers/experimental/distri_cores/tensor_parallel/layers.py, line 770, in construct, \n output = self.reduce_scatter_to_sp_region(output_parallel)", + "File /home/mindspore/nn/cell.py, line 2462, in _backward_hook_construct, \n outputs = self.construct(outputs, **kwargs)", + "File /home/mindspore/nn/cell.py, line 498, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /home/mindspore/nn/cell.py, line 745, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /home/mindformers/experimental/distri_cores/transformer/language_model.py, line 151, in construct, \n embeddings = self.word_embeddings(input_ids)", + "File /home/mindspore/nn/cell.py, line 2460, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /home/mindspore/nn/cell.py, line 498, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /home/mindspore/nn/cell.py, line 745, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /home/mindformers/experimental/distri_cores/transformer/language_model.py, line 391, in construct, \n text_embedding_out = self.embedding(enc_input_ids, enc_position_ids,", + "File /home/mindspore/nn/cell.py, line 2460, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /home/mindspore/nn/cell.py, line 498, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /home/mindspore/nn/cell.py, line 745, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /home/model/gpt_model.py, line 104, in construct, \n lm_output = self.language_model(tokens,", + "File /home/mindspore/nn/cell.py, line 2460, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /home/mindspore/nn/cell.py, line 498, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /home/mindspore/nn/cell.py, line 745, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /home/mindformers/experimental/distri_cores/pipeline_parallel/pipeline_cell.py, line 429, in construct, \n return self.model(*inputs)", + "File /home/mindspore/nn/cell.py, line 757, in _complex_call, \n output = self.construct(*args, **kwargs)", + "File /home/mindspore/nn/cell.py, line 747, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /home/mindformers/experimental/distri_cores/pipeline_parallel/schedules.py, line 121, in run_forward, \n output_tensor = model(*input_data, recv_data=None)", + "File /home/mindformers/experimental/distri_cores/pipeline_parallel/schedules.py, line 735, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /home/mindformers/experimental/distri_cores/training.py, line 409, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /home/mindformers/experimental/distri_cores/training.py, line 533, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /home/mindspore/nn/cell.py, line 757, in _complex_call, \n output = self.construct(*args, **kwargs)", + "File /home/mindspore/nn/cell.py, line 747, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /home/mindformers/experimental/distri_cores/training.py, line 655, in train, \n loss, is_finite, loss_scale, learning_rate = train_one_step_cell(**data)", + "File /home/model/pretrain_gpt.py, line 303, in main, \n train(", + "File /home/model/pretrain_gpt.py, line 316, in , \n main()" + ] +} +``` +# FAQ +1. 图比对场景,节点呈现灰色,且没有精度比对数据,怎么处理? + +节点呈现灰色,代表左边待调试侧节点与右边标杆侧节点没有匹配上,可能有以下几点原因: + +- **标杆侧确实没有能与待调试侧匹配上的节点**,属于代码实现上的差异,请确认此差异是否正常,是否会影响到整网精度。 +- **节点的输入或输出type、shape不一致,参数个数不一致,节点所在层级的父层级不一致**,导致节点无法匹配,具体匹配规则见[匹配说明](#311-匹配说明),可尝试使用模糊匹配功能,如何使用此功能请参考[构图命令行说明](#31-构图命令行说明)。如果是参数shape不一致,即使是模糊匹配功能也无法让节点匹配上,请检查参数shape不一致是否合理。 +- **节点名称不一致**,导致节点无法匹配,可使用layer mapping功能,如何使用此功能请参考[构图命令行说明](#31-构图命令行说明),如何自定义映射文件请参考[模型分级可视化如何配置layer mapping映射文件](./visualization/layer_mapping_example.md)。 diff --git a/debug/accuracy_tools/msprobe/docs/23.generate_operator_PyTorch.md b/debug/accuracy_tools/msprobe/docs/23.generate_operator_PyTorch.md new file mode 100644 index 0000000000000000000000000000000000000000..59e2755ec3e5a3939af3a20d19fda12031a9bf51 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/23.generate_operator_PyTorch.md @@ -0,0 +1,107 @@ +# 单算子API自动生成脚本 + +## 1 简介 + +单算子API自动生成脚本通过提取dump数据中的可疑算子,对其进行单API复现,输出单API精度的比对结果。具体而言,该工具可以从dump数据中提取可疑API的前反向信息,根据前反向数据生成单API的前反向过程,最后通过**新精度标准比对法**a将 NPU/GPU 和 CPU 的结果进行比对,从而给出不同比对方法下的比对结果。本工具支持**随机生成模式和真实数据模式**b。 + +a. 依据新精度标准,对不同的API采取不同的比对算法(包括绝对阈值法、标杆比对法、二进制一致法、ULP误差比对法和双千指标法),最终给定比对结果; + +b. 在生成单API脚本时可以选择由工具构造随机数获得 dump 数据或选择真实输入的数据进行单API复现。随机生成模式(对应 task: "statistics")执行效率高,可以快速获得结果,但数据精度低,只能大致判断精度问题;真实数据模式(对应 task: "tensor")执行效率略低于随机生成模式,但是数据精度高,可以准确判断精度问题。 + +## 2 使用方式 + +### 前提 +1. 安装 msprobe。详见[ msprobe 安装](./01.installation.md)章节。 +2. 已完成对训练过程的dump,获得dump.json文件。 + [PyTorch场景的数据采集](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/docs/05.data_dump_PyTorch.md) + + **目前仅支持复现API级的数据,故dump时level可选择L0(API信息)或者mix(module信息+API信息)。如需复现真实数据场景的API脚本,dump时task应选择tensor,如需复现随机数据场景的API脚本,dump时task选择statistics**。 +3. 发现某个算子疑似存在精度问题,并得知算子名,如Functional.softmax.3、Tensor.add.0、Torch.matmul.5等 + +### 2.1 配置config_op.json +单API复现参数配置如下(以复现softmax算子为例): +``` +{ + "dump_json_path": "./dump.json", + "api_name": "Functional.softmax.0", + "extract_api_path": "./Functional.softmax.0.json", + "propagation": "forward", + "data_mode": "random_data", + "random_seed": 1234, + "iter_times": 1 +} +``` +**配置文件参数说明** + + | 参数名称 | 解释 | 是否必选 | + | ---------------------------- | ------------------------------------------------------------ | ---------------------------------- | + | dump_json_path | dump.json的文件路径,包含所有dump算子的信息;如果已经提取了可疑算子并保存可以不指定。 | 否 | + | api_name | 算子名,如Functional.softmax.3、Tensor.add.0、Torch.matmul.5等。如果已经提取了可疑算子并保存可以不指定 | 否 | + | extract_api_path | 提取可疑算子的json文件路径 | 是 | + | propagation | 选择复现算子的forward还是backward,默认为forward | 否 | + | data_mode | 选择复现算子的随机数据(random_data)还是真实数据(real_data)模式,默认为random_data | 否 | + | random_seed | 仅random_data模式有效,表示手动设定的随机种子,默认为1234 | 否 | + | iter_times | 仅random_data模式有效,表示单API运行的次数 | 否 | + + ### 2.2 运行命令生成单API脚本 +config_op.json配置好后,运行如下命令: +``` +msprobe -f pytorch op_generate -i ./config.json -o ./ +``` +或者 + +进入到mstt的generate_op_script文件夹 +``` +cd mstt/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script +``` +运行 +``` +python op_generator.py -i ./config_op.json -o ./ +``` +**参数说明** + | 参数名称 | 解释 | 是否必选 | + | ---------------------------- | ------------------------------------------------------------ | ---------------------------------- | + | -i 或 --config_input | config_op.json的路径 | 是 | + | -o 或 --api_output_path | 单API脚本的输出路径 | 是 | + + ### 2.3 运行单API脚本 + 运行完op_generator.py后,会在指定路径下生成api_name.py的单API脚本,例如Functional.softmax.3.backward.py、Tensor.add.0.forward.py、Torch.matmul.5.backward.py + +运行单API脚本即可获得不同比对方法下的比对结果 +``` +python api_name.py +``` + +**运行结果参数说明** +| 字段 | 含义 | +| ------------------- | ------------------------------------------------------------ | +| Shape | 单API输出结果的shape | +| Dtype of out_device | NPU 或 GPU 数据的 API 数据类型。 | +| Dtype of out_bench | 标杆数据的 API 数据类型。 | +| Compare Standard | 比对方法(包括绝对阈值法,标杆比对法、二进制一致法、ULP误差比对法和双千指标法 | +| Relative Error Ratio | 相对误差错误率。NPU 与标杆的正常值计算相对误差,其大于错误阈值的元素个数占正常值元素个数的比例。绝对阈值法指标。 | +| 相对误差判定结果 | 相对误差错误率判定结果,等于 0 标记为 pass,其余情况标记为 error。 | +| Absolute Error Ratio | 绝对误差错误率。NPU 与标杆的小值计算绝对误差,其大于错误阈值的元素个数占小值元素个数的比例。绝对阈值法指标。NPU 或 GPU 数据与标杆数据的最大绝对误差。 | +| 绝对误差判定结果 | 绝对误差错误率判定结果,等于 0 标记为 pass,其余情况标记为 error。 | +| Small Value Error Proportion | 小值域错误比值。NPU 与 CPU 的小值域的错误比率 / GPU 与 CPU 的小值域的错误比率。标杆比对法指标。 | +| 小值域错误判定结果 | 小值域错误比值小于等于 1 标记为 pass,1 ~ 2 之间标记为 warning,大于 2 标记为 error。 | +| Maximum Relative Error | 相对误差最大值比值。NPU 与 CPU 的相对误差最大值 / GPU 与 CPU 的相对误差最大值。标杆比对法指标。 | +| 相对误差最大值判定结果 | 相对误差最大值比值小于等于 1 标记为 pass,1 ~ 10 之间标记为 warning,大于 10 标记为 error。 | +| Mean Relative Error | 相对误差平均值比值。NPU 与 CPU 的相对误差的平均值 / GPU 与 CPU 的相对误差的平均值。标杆比对法指标。 | +| 相对误差平均值判定结果 | 相对误差平均值比值小于等于 1 标记为 pass,1 ~ 2 之间标记为 warning,大于 2 标记为 error。 | +| Root Mean Squared Error | 均方根误差比值。NPU 与 CPU 的均方根误差 / GPU 与 CPU 的均方根误差。标杆比对法指标。 | +| 均方根误差判定结果 | 均方根误差比值小于等于 1 标记为 pass,1~2 之间标记为 warning,大于 2 标记为 error。 | +| Error Balance | 误差均衡性比值。NPU 与 CPU 的误差均衡性 / GPU 与 CPU 的误差均衡性。标杆比对法指标。 | +| 误差均衡性判定结果 | 误差均衡性比值小于等于 1 标记为 pass,1 ~ 2 之间标记为 warning,大于 2 标记为 error。 | +| Error Rate | 二进制一致错误率。NPU 或 GPU 数据中每个 Tensor 精度不一致的数值的数量与 Tensor 中数值数量的比值。只有数据是 builtin 类型(bool、int、float、str)、torch.bool 和 torch 的 int 类型或者在新精度标准中使用二进制一致算法进行比对的 API 才会展示。二进制一致法指标。 | +| 二进制一致错误率判定结果 | 二进制一致错误率判定结果,等于 0 标记为 pass,其余情况标记为 error。 | +| Maximum ULP Error | ULP 误差最大值a。NPU 数据与标杆数据 ULP 误差的最大值(取绝对值后)。 | +| Mean ULP Error | ULP 误差平均值a。NPU 数据与标杆数据 ULP 误差的平均值(取绝对值后)。 | +| ULP Error Proportion |ULP 误差大于阈值占比比值a。NPU 与 CPU 的 ULP 误差大于阈值占比 / GPU 与 CPU 的 ULP 误差大于阈值占比。 | +| ULP 误差判定结果 | ULP 误差判定结果。
当 NPU 或 GPU 数据类型是 float16 或 bfloat16 时,以下两条标准满足其一标记为 pass,否则标记为 error:
NPU ULP 误差大于阈值占比小于 0.001;
NPU ULP 误差大于阈值占比小于 GPU ULP 误差大于阈值占比。
当 NPU 或 GPU 数据类型是 float32 时,以下三条标准满足其一标记为 pass,否则标记为 error:
NPU ULP 误差平均值小于 64;
NPU ULP 误差大于阈值占比小于 0.05;
NPU ULP 误差大于阈值占比小于 GPU ULP 误差大于阈值占比。 | +| Thousandth ratio |双千精度指标。是指 NPU 的 Tensor 中的元素逐个与对应的标杆数据对比,相对误差小于千分之一的个数占总元素个数的比例。测试通过标准为相对误差大于千分之一的个数占总元素个数的比例小于千分之一。仅 conv1d 和 conv2d 使用该指标。双千指标法指标。 | +| 双千指标判定结果 | 双千指标判定结果。双千指标大于 0.999 标记为 pass,否则标记为 error。 | + +a:误差比对法指标。 + +最终判定单API是否符合精度标准由开发者通过**算子精度标准**判断。 \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/docs/24.code_mapping_Mindspore.md b/debug/accuracy_tools/msprobe/docs/24.code_mapping_Mindspore.md new file mode 100644 index 0000000000000000000000000000000000000000..05e3900d2647b07ed5334082e3ac519cfc7fb2b2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/24.code_mapping_Mindspore.md @@ -0,0 +1,28 @@ +# MindSpore 场景的数码关联工具 + +数码关联工具,用于 MindSpore 静态图场景下将IR图与dump数据进行关联,获取 dump 数据和代码调用栈的关联关系。 + +## 安装 + +请参见[《msprobe 工具安装指南》](./01.installation.md)。 + + +## 功能说明 + +数码关联是指数据和代码调用栈的关联,数据一般意义上指静态图`O0`,`O1`,`O2`下dump下来的数据。 + +IR图使用推荐:IR图推荐使用`anf_after_graph_build`图。 + +命令格式: + +``` +msprobe -f mindspore code_mapping --ir --dump_data [--output ] +``` + + +| 参数名称 | 说明 |参数类型 | 是否必选 | +| ---------------------------- |-------------------------------------------------------------------------------------------------------------------------------------------|---------------------- | ---------------------------------- | +| --ir | 指定 MindSpore 静态图运行时生成的IR图文件。 | str | 是 | +| --dump_data | 指定dump数据文件(支持tensor或statistic模式的dump数据)。可指定单个dump数据 文件或dump数据文件的父目录,指定父目录表示关联目录下的所有dump数据文件。 | str | 是 | +| --output | 关联结果输出目录,默认为"./",只在tensor模式时生效,会把数据文件路径和代码调用栈的关联关系存到output路径下的code_mapping_{时间戳}.csv中。如果关联的是statistic模式,则会把statistic.csv中每个条目加上该条目对应的代码栈。 | str | 否 | + diff --git a/debug/accuracy_tools/msprobe/docs/25.tool_function_introduction.md b/debug/accuracy_tools/msprobe/docs/25.tool_function_introduction.md new file mode 100644 index 0000000000000000000000000000000000000000..f6f5db9781223fc299df978dfd55a9d2af2e07e6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/25.tool_function_introduction.md @@ -0,0 +1,29 @@ +# msprobe 工具功能模块简介、适用场景和当前版本局限性 + +## 1 PyTorch框架 + +| 功能名(英文) | 简介 | 适用场景/优势 | 当前版本局限性 | +|------------------------------------------------------------------------------------|---------------------------------------------------------------|--------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| +| [数据采集
(dump)](./05.data_dump_PyTorch.md) | 采集模型训练过程中的API或Module层级的前反向输入输出数据,包括层次关系、统计值信息、真实数据和调用栈等。 | 1、将模型中训练的API或Module的前反向输入输出数据保存下来分析
2、模型出现溢出时,可用于查看哪些API或Module出现了溢出 | 1、API级数据采集仅支持白名单列表上的API
2、工具会做一些同步操作,引入工具可能会导致一些同步问题消失
3、当前对inplace操作API或Module的支持度有限
4、暂不支持参数及参数梯度的采集 | +| [离线预检
(api_accuracy_checker)](./07.accuracy_checker_PyTorch.md) | 为网络中每个API创建用例,检验其精度,并根据不同比对算法综合判定API在NPU上的精度是否达标,快速找出精度差异API。 | 1、对模型中所有的API做精度初步排查
2、精度排查不受模型累计误差影响 | 1、依赖GPU环境
2、不支持通信算子
3、仅支持部分融合算子 | +| [整网比对
(compare)](./10.accuracy_compare_PyTorch.md) | 计算模型整网NPU和标杆设备的精度误差指标,标记精度异常API或Module,助力快速定位精度问题根因。 | 1、整网比对定位精度可疑算子 | 1、由于使用整网dump数据,定位的可疑算子受累计误差影响
2、当模型规模较大时,比对所需时间较长 | +| [在线预检
(online_api_accuracy_checker)](./08.accuracy_checker_online_PyTorch.md) | 通过TCP通信或共享存储空间的方式,进行在线精度预检,解决离线预检大数据量落盘、传输困难痛点。 | 1、使用离线预检,数据量较大落盘困难或传输耗时长时,可通过在线预检进行精度排查 | 1、依赖GPU环境,NPU和GPU能够通信
2、重计算模式下,不支持反向aten算子预检 | +| [溢出检查
(overflow_checker)](./12.overflow_check_PyTorch.md) | 检测模型计算过程的输入输出,并在溢出时落盘数据,助力用户快速定位溢出位置。 | 1、当模型出现溢出时,用于快速定位最先溢出的API或Module
2、相比数据采集,性能更优,磁盘压力更小 | 1、局限性同数据采集 | +| [数据解析
(parse_tool)](./14.data_parse_PyTorch.md) | 互交式界面处理解析kernel层级dump数据,便于查看分析。 | 1、比对kernel层级dump数据的一致性 | 1、仅限于NPU | +| [无标杆比对
(free_benchmark)](./15.free_benchmarking_PyTorch.md) | 不依赖标杆数据,通过对算子输入增加微小扰动,计算扰动后输出与原始输出的相对误差,识别有精度风险算子。 | 1、无标杆数据场景下的算子精度排查
2、对个别算子进行升精度、“to cpu”等操作,以验证其对模型loss的影响 | 1、由于需要拷贝输入进行二次执行,所以在遇到大张量的输入时容易发生显存OOM的问题, 特别是反向比对过程。建议结合白名单使用
2、比对会延长训练时间,整网比对可能会造成严重的耗时膨胀,建议结合白名单使用 | +| [梯度状态监测
(grad_probe)](./17.grad_probe.md) | 可导出模型权重梯度数据并对比相似度,助力确认训练过程精度问题step和反向中的异常。 | 1、需要分析梯度数据时
2、需要定位发生问题的step时 | 暂无 | +| [在线精度比对
(online_dispatch)](./18.online_dispatch.md) | 训练过程中直接完成NPU和CPU的精度比对并输出比对结果。 | 1、执行一次就可获取NPU和CPU分别执行后的精度比对结果 | 暂无 | +| [训练状态监控
(monitor)](./19.monitor.md) | 收集模型训练过程中的激活值、梯度和优化器状态,助力分析计算、通信、优化器各部分异常情况。 | 1、通过监控模块级统计量指标,快速定位异常模块位置,如loss出现nan | 1、仅支持模块级别统计量指标分析
2、仅支持megatron、deepspeed框架
3、少量增加时间和显存膨胀 | +| [可视化比对
(visualization) ](./21.visualization_PyTorch.md) | 解析dump的精度数据,还原模型图结构,比对各层级精度数据,助力理解模型结构、分析精度问题。 | 1、整网精度比对定位可疑算子,通过浏览器展示比对结果,支持快速搜索到可疑算子
2、支持查看模型层级结果,比对模型层级结构差异 | 1、由于使用整网dump数据,定位的可疑算子受累计误差影响
2、当模型规模较大时,比对所需时间较长 | +| [单API自动生成脚本
(generate_operator) ](./23.generate_operator_PyTorch.md) | 解析dump的精度数据,提取可疑的API算子,自动生成单API复现脚本,并根据不同的API采用不同的比对算法,给定最终比对结果数据;帮助开发者分析算子精度问题。 | 1、该工具支持从整网dump下来的数据中提取可疑算子,并自动生成单API脚本
2、除了支持复现单API的前反向过程,同时会根据不同的API选择不同的比对方法,并给出比对结果 |1、不支持通信算子
2、融合算子需手动修改脚本进行适配
3、目前比对的标杆均为和CPU进行比对,暂不支持直接NPU和GPU比对 + +## 2 MindSpore框架 + +| 功能名(英文) | 简介 | 适用场景/优势 | 当前版本局限性 | +|----------------------------------------------------------------------|-------------------------------------------------------------------|------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------| +| [数据采集
(dump)](./06.data_dump_MindSpore.md) | 采集模型训练过程中的API或Cell层级的前反向输入输出数据,包括层次关系、统计值信息、真实数据和调用栈等。 | 1、将模型中训练的API或Cell的前反向输入输出数据保存下来分析
2、模型出现溢出时,可用于查看哪些API或Cell出现了溢出 | 1、API级数据采集仅支持白名单列表上的API
2、当前对inplace操作API或Cell的支持度有限
3、暂不支持参数及参数梯度的采集 | +| [离线预检
(api_accuracy_checker)](./09.accuracy_checker_MindSpore.md) | 为网络中每个API创建用例,检验其精度,并根据不同比对算法综合判定API在NPU上的精度是否达标,快速找出精度差异API。 | 1、对模型中所有的API做精度初步排查
2、精度排查不受模型累计误差影响 | 1、仅针对MindSpore.mint API | +| [整网比对
(compare)](./11.accuracy_compare_MindSpore.md) | NPU精度数据与标杆数据的比对,支持MindSpore框架内和与PyTorch跨框架的比对,助力快速定位精度异常API或Cell。 | 1、MindSpore同框架静态图比对
2、MindSpore同框架动态图比对
3、MindSpore vs PyTorch跨框架动态图比对 | 1、部分PyTorch的API关联不到MindSpore,需要手动配置映射关系 | +| [溢出检查
(overflow_checker)](./13.overflow_check_MindSpore.md) | 检测模型计算过程的输入输出,并在溢出时落盘数据,助力用户快速定位溢出位置。 | 1、当模型出现溢出时,可用于定位最先溢出的API或Cell或kernel
2、相比数据采集,性能更优,磁盘压力更小 | 1、除具有与数据采集功能相同的局限性外,动态图场景下,不支持 Primitive 和 Jit 类 API 的检测
2、动态图场景下,仅支持检测API或Cell级别溢出
3、静态图场景下,仅支持检测kernel级别溢出 | +| [无标杆比对
(free_benchmark)](./16.free_benchmarking_MindSpore.md) | 不依赖标杆数据,通过对算子输入增加微小扰动,计算扰动后输出与原始输出的相对误差,识别有精度风险算子。 | 1、无标杆数据场景下的算子精度排查
2、对个别算子进行升精度修复,验证其对模型loss的影响 | 1、仅支持动态图场景
2、由于需要拷贝输入进行二次执行,所以在遇到大张量的输入时容易发生显存OOM的问题, 特别是反向比对过程。建议结合白名单使用
3、比对会延长训练时间,整网比对可能会造成严重的耗时膨胀,建议结合白名单使用
4、不支持“to cpu”操作,不支持预热功能 | +| [可视化比对
(visualization) ](./22.visualization_MindSpore.md) | 解析dump的精度数据,还原模型图结构,比对各层级精度数据,助力理解模型结构、分析精度问题。 | 1、整网精度比对定位可疑算子,通过浏览器展示比对结果,支持快速搜索到可疑算子
2、支持查看模型层级结果,比对模型层级结构差异 | 1、由于使用整网dump数据,定位的可疑算子受累计误差影响
2、当模型规模较大时,比对所需时间较长 | diff --git a/debug/accuracy_tools/msprobe/docs/26.data_dump_PyTorch_baseline.md b/debug/accuracy_tools/msprobe/docs/26.data_dump_PyTorch_baseline.md new file mode 100644 index 0000000000000000000000000000000000000000..5ca199ab6171a3634af0b26844d6ba8e7d04933f --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/26.data_dump_PyTorch_baseline.md @@ -0,0 +1,37 @@ +# PyTorch 场景的精度数据采集基线 + +## "tensor"模式采集数据量参考基线 + +该基线为pytorch框架下,使用"tensor"模式采集数据量参考基线。本基线测试了两个模型,分别为LLAMA2-7B和LLAMA2-13B,测试了不同采集模式下,不同global_batch_size下,单卡和8卡下,数据量的变化。 + +### LLAMA2-7B + + + + + + + + + + + + + +
采集模式global_batch_size单卡8卡
L017.8GB63GB
216GB125GB
324GB187GB
L11300.8GB2.3TB
2480GB3.6TB
3640GB4.9TB
mix1313.6GB2.4TB
2512GB3.8TB
3672GB5.1TB
+ +### LLAMA2-13B + + + + + + + + + + + + + +
采集模式global_batch_size单卡8卡
L0113GB97GB
225B194GB
337G291GB
L11440GB3.4TB
2720GB5.4TB
3960GB7.3TB
mix1480GB3.6TB
2720GB5.6TB
31000GB7.7TB
\ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/docs/27.dump_json_instruction.md b/debug/accuracy_tools/msprobe/docs/27.dump_json_instruction.md new file mode 100644 index 0000000000000000000000000000000000000000..f994dc2301bcae6b23dc7a7503297aa4fe5b3724 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/27.dump_json_instruction.md @@ -0,0 +1,525 @@ +# dump.json文件说明及示例 + +## 1. dump.json文件示例(PyTorch) + +### 1.1 L0级别 +L0级别的dump.json文件包括模块的前反向的输入输出,以及模块的参数和参数梯度。以PyTorch的Conv2d模块为例,网络中模块调用代码为: +`output = self.conv2(input) # self.conv2 = torch.nn.Conv2d(64, 128, 5, padding=2, bias=True)` + +dump.json文件中包含以下数据名称: + +- `Module.conv2.Conv2d.forward.0`:模块的前向数据,其中input_args为模块的输入数据(位置参数),input_kwargs为模块的输入数据(关键字参数),output为模块的输出数据,parameters为模块的参数数据,包括权重(weight)和偏置(bias)。 +- `Module.conv2.Conv2d.parameters_grad`:模块的参数梯度数据,包括权重(weight)和偏置(bias)的梯度。 +- `Module.conv2.Conv2d.backward.0`:模块的反向数据,其中input为模块反向的输入梯度(对应前向输出的梯度),output为模块的反向输出梯度(对应前向输入的梯度)。 + +**说明**:当dump时传入的model参数为List[torch.nn.Module]或Tuple[torch.nn.Module]时,模块级数据的命名中包含该模块在列表中的索引index,命名格式为`{Module}.{index}.*`,*表示以上三种模块级数据的命名格式,例如:`Module.0.conv1.Conv2d.forward.0`。 + +```json +{ + "task": "tensor", + "level": "L0", + "framework": "pytorch", + "dump_data_dir": "/dump/path", + "data": { + "Module.conv2.Conv2d.forward.0": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 16, + 14, + 14 + ], + "Max": 1.638758659362793, + "Min": 0.0, + "Mean": 0.2544615864753723, + "Norm": 70.50277709960938, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.input.0.pt" + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 32, + 10, + 10 + ], + "Max": 1.6815717220306396, + "Min": -1.5120246410369873, + "Mean": -0.025344856083393097, + "Norm": 149.65576171875, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.output.0.pt" + } + ], + "parameters": { + "weight": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 5, + 5 + ], + "Max": 0.05992485210299492, + "Min": -0.05999220535159111, + "Mean": -0.0006165213999338448, + "Norm": 3.421217441558838, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.parameters.weight.pt" + }, + "bias": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32 + ], + "Max": 0.05744686722755432, + "Min": -0.04894155263900757, + "Mean": 0.006410328671336174, + "Norm": 0.17263513803482056, + "requires_grad": true, + "data_name": "Module.conv2.Conv2d.forward.0.parameters.bias.pt" + } + } + }, + "Module.conv2.Conv2d.parameters_grad": { + "weight": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 5, + 5 + ], + "Max": 0.018550323322415352, + "Min": -0.008627401664853096, + "Mean": 0.0006675920449197292, + "Norm": 0.26084786653518677, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.parameters_grad.weight.pt" + } + ], + "bias": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32 + ], + "Max": 0.014914230443537235, + "Min": -0.006656786892563105, + "Mean": 0.002657240955159068, + "Norm": 0.029451673850417137, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.parameters_grad.bias.pt" + } + ] + }, + "Module.conv2.Conv2d.backward.0": { + "input": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 32, + 10, + 10 + ], + "Max": 0.0015069986693561077, + "Min": -0.001139344065450132, + "Mean": 3.3215508210560074e-06, + "Norm": 0.020567523315548897, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.backward.0.input.0.pt" + } + ], + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 8, + 16, + 14, + 14 + ], + "Max": 0.0007466732058674097, + "Min": -0.00044813455315306783, + "Mean": 6.814070275140693e-06, + "Norm": 0.01474067009985447, + "requires_grad": false, + "data_name": "Module.conv2.Conv2d.backward.0.output.0.pt" + } + ] + } + } +} +``` + +### 1.2 L1级别 +L1级别的dump.json文件包括API的前反向的输入输出。以PyTorch的relu函数为例,网络中API调用代码为: +`output = torch.nn.functional.relu(input)` + +dump.json文件中包含以下数据名称: +- `Functional.relu.0.forward`:API的前向数据,其中input_args为API的输入数据(位置参数),input_kwargs为API的输入数据(关键字参数),output为API的输出数据。 +- `Functional.relu.0.backward`:API的反向数据,其中input为API的反向输入梯度(对应前向输出的梯度),output为API的反向输出梯度(对应前向输入的梯度)。 + +```json +{ + "task": "tensor", + "level": "L1", + "framework": "pytorch", + "dump_data_dir":"/dump/path", + "data": { + "Functional.relu.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 1.3864083290100098, + "Min": -1.3364859819412231, + "Mean": 0.03711778670549393, + "Norm": 236.20692443847656, + "requires_grad": true, + "data_name": "Functional.relu.0.forward.input.0.pt" + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 1.3864083290100098, + "Min": 0.0, + "Mean": 0.16849493980407715, + "Norm": 175.23345947265625, + "requires_grad": true, + "data_name": "Functional.relu.0.forward.output.0.pt" + } + ] + }, + "Functional.relu.0.backward": { + "input": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 0.0001815402356442064, + "Min": -0.00013352684618439525, + "Mean": 0.00011915402356442064, + "Norm": 0.007598237134516239, + "requires_grad": false, + "data_name": "Functional.relu.0.backward.input.0.pt" + } + ], + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 0.0001815402356442064, + "Min": -0.00012117840378778055, + "Mean": 2.0098118724831693e-08, + "Norm": 0.006532244384288788, + "requires_grad": false, + "data_name": "Functional.relu.0.backward.output.0.pt" + } + ] + } + } +} +``` + +### 1.3 mix级别 + +mix级别的dump.json文件同时包括L0和L1级别的dump数据,文件格式与上述示例相同。 + +## 2. dump.json文件示例(MindSpore) + +### 2.1 L0级别 + +L0级别的dump.json文件包括模块的前反向的输入输出,以及模块的参数和参数梯度。 +以MindSpore的Conv2d模块为例,dump.json文件中使用的模块调用代码为: +`output = self.conv2(input) # self.conv2 = mindspore.nn.Conv2d(64, 128, 5, pad_mode='same', has_bias=True)` + +dump.json文件中包含以下数据名称: +- `Cell.conv2.Conv2d.forward.0`:模块的前向数据,其中input_args为模块的输入数据(位置参数),input_kwargs为模块的输入数据(关键字参数),output为模块的输出数据,parameters为模块的参数数据,包括权重(weight)和偏置(bias)。 +- `Cell.conv2.Conv2d.parameters_grad`:模块的参数梯度数据,包括权重(weight)和偏置(bias)的梯度。 +- `Cell.conv2.Conv2d.backward.0`:模块的反向数据,其中input为模块反向的输入梯度(对应前向输出的梯度),output为模块的反向输出梯度(对应前向输入的梯度)。 + +**说明**:当dump时传入的model参数为List[mindspore.nn.Cell]或Tuple[mindspore.nn.Cell]时,模块级数据的命名中包含该模块在列表中的索引index,命名格式为`{Cell}.{index}.*`,*表示以上三种模块级数据的命名格式,例如:`Cell.0.conv2.Conv2d.forward.0`。 + +```json +{ + "task": "tensor", + "level": "L0", + "framework": "mindspore", + "dump_data_dir": "/dump/path", + "data": { + "Cell.conv2.Conv2d.forward.0": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 8, + 16, + 14, + 14 + ], + "Max": 1.638758659362793, + "Min": 0.0, + "Mean": 0.2544615864753723, + "Norm": 70.50277709960938, + "data_name": "Cell.conv2.Conv2d.forward.0.input.0.npy" + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 8, + 32, + 10, + 10 + ], + "Max": 1.6815717220306396, + "Min": -1.5120246410369873, + "Mean": -0.025344856083393097, + "Norm": 149.65576171875, + "data_name": "Cell.conv2.Conv2d.forward.0.output.0.npy" + } + ], + "parameters": { + "weight": { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 32, + 16, + 5, + 5 + ], + "Max": 0.05992485210299492, + "Min": -0.05999220535159111, + "Mean": -0.0006165213999338448, + "Norm": 3.421217441558838, + "data_name": "Cell.conv2.Conv2d.forward.0.parameters.weight.npy" + }, + "bias": { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 32 + ], + "Max": 0.05744686722755432, + "Min": -0.04894155263900757, + "Mean": 0.006410328671336174, + "Norm": 0.17263513803482056, + "data_name": "Cell.conv2.Conv2d.forward.0.parameters.bias.npy" + } + } + }, + "Cell.conv2.Conv2d.parameters_grad": { + "weight": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 32, + 16, + 5, + 5 + ], + "Max": 0.018550323322415352, + "Min": -0.008627401664853096, + "Mean": 0.0006675920449197292, + "Norm": 0.26084786653518677, + "data_name": "Cell.conv2.Conv2d.parameters_grad.weight.npy" + } + ], + "bias": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 32 + ], + "Max": 0.014914230443537235, + "Min": -0.006656786892563105, + "Mean": 0.002657240955159068, + "Norm": 0.029451673850417137, + "data_name": "Cell.conv2.Conv2d.parameters_grad.bias.npy" + } + ] + }, + "Cell.conv2.Conv2d.backward.0": { + "input": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 8, + 32, + 10, + 10 + ], + "Max": 0.0015069986693561077, + "Min": -0.001139344065450132, + "Mean": 3.3215508210560074e-06, + "Norm": 0.020567523315548897, + "data_name": "Cell.conv2.Conv2d.backward.0.input.0.npy" + } + ], + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 8, + 16, + 14, + 14 + ], + "Max": 0.0007466732058674097, + "Min": -0.00044813455315306783, + "Mean": 6.814070275140693e-06, + "Norm": 0.01474067009985447, + "data_name": "Cell.conv2.Conv2d.backward.0.output.0.npy" + } + ] + } + } +} +``` + +### 2.2 L1级别 +L1级别的dump.json文件包括API的前反向的输入输出,以MindSpore的relu函数为例,网络中API调用代码为: + `output = mindspore.ops.relu(input)` + + dump.json文件中包含以下数据名称: +- `Functional.relu.0.forward`:API的前向数据,其中input_args为API的输入数据(位置参数),input_kwargs为API的输入数据(关键字参数),output为API的输出数据。 +- `Functional.relu.0.backward`:API的反向数据,其中input为API的反向输入梯度(对应前向输出的梯度),output为API的反向输出梯度(对应前向输入的梯度)。 + +```json +{ + "task": "tensor", + "level": "L1", + "framework": "mindspore", + "dump_data_dir":"/dump/path", + "data": { + "Functional.relu.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 1.3864083290100098, + "Min": -1.3364859819412231, + "Mean": 0.03711778670549393, + "Norm": 236.20692443847656, + "data_name": "Functional.relu.0.forward.input.0.npy" + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 1.3864083290100098, + "Min": 0.0, + "Mean": 0.16849493980407715, + "Norm": 175.23345947265625, + "data_name": "Functional.relu.0.forward.output.0.npy" + } + ] + }, + "Functional.relu.0.backward": { + "input": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 0.0001815402356442064, + "Min": -0.00013352684618439525, + "Mean": 0.00011915402356442064, + "Norm": 0.007598237134516239, + "data_name": "Functional.relu.0.backward.input.0.npy" + } + ], + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 32, + 16, + 28, + 28 + ], + "Max": 0.0001815402356442064, + "Min": -0.00012117840378778055, + "Mean": 2.0098118724831693e-08, + "Norm": 0.006532244384288788, + "data_name": "Functional.relu.0.backward.output.0.npy" + } + ] + } + } +} +``` + +### 2.3 mix级别 +mix级别的dump.json文件同时包括L0和L1级别的dump数据,文件格式与上述示例相同。 diff --git a/debug/accuracy_tools/msprobe/docs/28.debugger_save_instruction.md b/debug/accuracy_tools/msprobe/docs/28.debugger_save_instruction.md new file mode 100644 index 0000000000000000000000000000000000000000..6f4d519d5f61d5efaaffe54a1bde4f140b539f72 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/28.debugger_save_instruction.md @@ -0,0 +1,94 @@ +# 单点保存工具 README + +## 简介 +L0, L1, mix dump存在盲区,网络中的非api/module的输入输出不会被批量dump下来。单点保存提供类似np.save和print的功能和使用体验,可以保存指定的变量。同时针对大模型场景进行了增强,具备以下特性: +- 可保存变量的反向梯度结果。 +- 能直接保存嵌套结构数据(如 list、dict),无需手动遍历。 +- 自动分 rank 保存。 +- 多次调用时会自动计数。 +- 可配置保存统计值或者张量。 + +## 支持场景 +仅支持 PyTorch 与 MindSpore 的动态图场景。 + +## 使能方式 + +### 配置文件说明 + +通用配置: + +| 参数 | 解释 | 是否必选 | +| -------- |-------------------------------------------| -------- | +| task | dump 的任务类型,str 类型。 单点保存场景仅支持传入"statistics", "tensor"。 | 是 | +| level | dump 级别,str 类型,根据不同级别采集不同数据。单点保存场景传入"debug"。 | 是 | +| dump_path | 设置 dump 数据目录路径,str 类型。细节详见[通用配置说明](./02.config_introduction.md#11-通用配置) | 是 | +| rank | 指定对某张卡上的数据进行采集,list[Union[int, str]] 类型。细节详见[通用配置说明](./02.config_introduction.md#11-通用配置) | 否 | + +"statistics" 任务子配置项: +| 参数 | 解释 | 是否必选 | +| -------- |-------------------------------------------| -------- | +| summary_mode | 控制 dump 文件输出的模式,str 类型。支持传入"statistics", "md5"。 细节详见[statistics任务子配置项说明](./02.config_introduction.md#12-task-配置为-statistics) | 否 | + +"tensor" 任务无子配置项。 + +### 接口调用说明 + +调用PrecisionDebugger.save,传入需要保存的变量,指定变量名称以及是否需要保存反向数据。接口入参说明详见[pytorch单点保存接口](./05.data_dump_PyTorch.md#19-save),[mindspore单点保存接口](./06.data_dump_MindSpore.md#615-save) + +### 实例(以pytorch场景为例) + +配置文件 +```json +{ + "task": "statistics", + "dump_path": "./dump_path", + "rank": [], + "level": "debug", + "statistics": { + "summary_mode": "statistics" + } +} +``` + +初始化 +```python +# 训练启动py脚本 +from mindspore.pytorch import PrecisionDebugger +debugger = PrecisionDebugger("./config.json") +for data, label in data_loader: + # 执行模型训练 + train(data, label) + +``` + +初始化(无配置文件) +```python +# 训练启动py脚本 +from mindspore.pytorch import PrecisionDebugger +debugger = PrecisionDebugger(dump_path="dump_path", level="debug") +for data, label in data_loader: + # 执行模型训练 + train(data, label) + +``` + +调用保存接口 +```python +# 训练过程中被调用py文件 +from mindspore.pytorch import PrecisionDebugger +dict_variable = {"key1": "value1", "key2": [1, 2]} +PrecisionDebugger.save(dict_variable, "dict_variable", save_backward=False) + +``` + +## 输出结果 + * **"task" 配置为 "statistics" 场景** :在 dump 目录下会生成包含变量统计值信息的 `debug.json` 文件。 + * **"task" 配置为 "tensor" 场景** :除了在 dump 目录下生成包含变量统计值信息的 `debug.json` 文件外,还会在 dump 子目录 `dump_tensor_data` 中保存张量二进制文件,文件名称格式为 `{variable_name}{grad_flag}.{count}.tensor.{indexes}.{file_suffix}`。 + + - variable_name: 传入save接口的变量名称。 + - grad_flag: 反向数据标识,反向数据为"_grad",正向数据为""。 + - count: 调用计数,多次以相同变量名称调用时的计数。 + - indexes: 索引,在保存嵌套结构数据时的索引。例如:嵌套结构为`{"key1": "value1", "key2": ["value2", "value3"]}`,"value2"的索引为"key2.0" + - file_suffix:文件后缀,pytorch场景为"pt",mindspore场景为"npy" + + diff --git a/debug/accuracy_tools/msprobe/docs/28.kernel_dump_MindSpore.md b/debug/accuracy_tools/msprobe/docs/28.kernel_dump_MindSpore.md new file mode 100644 index 0000000000000000000000000000000000000000..6b8cc558aa22526158033cfb35f31203d8b04278 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/28.kernel_dump_MindSpore.md @@ -0,0 +1,69 @@ +# MindSpore 场景的 kernel dump 说明 + +当使用 msprobe 数据采集功能时,level 配置为 "L2" 表示采集 kernel 层级的算子数据,仅支持昇腾 NPU 平台。 + +本文主要介绍 kernel dump 的配置示例和采集结果介绍, msprobe 数据采集功能的详细使用参考 《[MindSpore 场景的精度数据采集](./06.data_dump_MindSpore.md)》。 + +## 1 kernel dump 配置示例 + +使用 kernel dump 时,list 必须要填一个 API 名称,kernel dump 目前每个 step 只支持采集一个 API 的数据。 +API 名称填写参考 L1 dump 结果文件 dump.json 中的API名称,命名格式为:`{api_type}.{api_name}.{API调用次数}.{forward/backward}`。 + +```json +{ + "task": "tensor", + "dump_path": "/home/data_dump", + "level": "L2", + "rank": [], + "step": [], + "tensor": { + "scope": [], + "list": ["Functional.linear.0.backward"] + } +} +``` + +## 2 结果文件介绍 + +### 2.1 采集结果说明 + +如果 API kernel 级数据采集成功,会打印以下信息: + +```bash +The kernel data of {api_name} is dumped successfully. +``` + +注意:如果打印该信息后,没有数据生成,参考**常见问题3.1**进行排查。 + +如果 kernel dump 遇到不支持的 API, 会打印以下信息: + +```bash +The kernel dump does not support the {api_name} API. +``` + +其中 {api_name} 是对应溢出的 API 名称。 + +### 2.2 输出文件说明 +kernel dump 采集成功后,会在指定的 dump_path 目录下生成如下文件: + +``` +├── /home/data_dump/ +│ ├── step0 +│ │ ├── 20241201103000 # 日期时间格式,表示2024-12-01 10:30:00 +│ │ │ ├── 0 # 表示 device id +│ │ │ │ ├──{op_type}.{op_name}.{task_id}.{stream_id}.{timestamp} # kernel 层算子数据 +│ │ │ ... +│ │ ├── kernel_config_{device_id}.json # kernel dump 在接口调用过程中生成的中间文件,一般情况下无需关注 +│ │ ... +│ ├── step1 +│ ... +``` +成功采集到数据后,可以使用 msprobe 工具提供的《[PyTorch 场景的数据解析](./14.data_parse_PyTorch.md)》功能分析数据。 + +## 3 常见问题 + +#### 3.1 采集结果文件为空,有可能是什么原因? + +1. 首先需要确认工具使用方式、配置文件内容、list 填写的 API 名称格式是否都正确无误。 + +2. 其次需要确认 API 是否运行在昇腾 NPU 上,如果是运行在其他设备上则不会存在 kernel 级数据。 diff --git a/debug/accuracy_tools/msprobe/docs/FAQ.md b/debug/accuracy_tools/msprobe/docs/FAQ.md index fd96525833865166e4ae484b270e19d5703b7c56..833ca07a236f33e69b102d4acb45d35cd6fe7e3a 100644 --- a/debug/accuracy_tools/msprobe/docs/FAQ.md +++ b/debug/accuracy_tools/msprobe/docs/FAQ.md @@ -10,6 +10,32 @@ - 输入参数或输出参数类型当前工具不支持,会有日志打印提醒。 - 输入或者输出tensor的dtype为bool时,Mean和Norm等字段为null。 +2. 如果存在namedtuple类型的数据作为nn.Module的输出,工具会将各字段数据dump下来,但是输出数据类型会被转成tuple,原因是什么? + - 这是由于pytorch框架自身,在注册module的backward hook时,会将namedtuple类型转成tuple类型。 + +3. 如果某个api在dump支持列表support_wrap_ops.yaml中,但没有dump该api的数据,原因是什么? + - 首先确认api调用是否在采集范围内,即需要在 **start** 和 **stop** 接口涵盖的范围内。 + - 其次,由于工具只在被调用时才对api进行patch,从而使得数据可以被dump下来。因此当api是被直接import进行调用时,由于该api的地址已经确定, + 工具无法再对其进行patch,故而该api数据无法被dump下来。如下示例,relu将无法被dump: + ```python + import torch + from torch import relu # 此时relu地址已经确定,无法修改 + + from msprobe.pytorch import PrecisionDebugger + + debugger = PrecisionDebugger(dump_path="./dump_data") + x = torch.randn(10) + debugger.start() # 此时会对torch下面的api进行patch,但已无法对import进来的api进行patch了 + x = relu(x) + debugger.stop() + ``` + 在上述场景中,若希望采集relu数据,只需要将`relu(x)`修改为`torch.relu(x)`即可。 + +4. 在使用L0 dump时,发现有些 module 的数据没有采集下来,原因是什么? + - 确认日志打印中是否存在`The {module_name} has registered deprecated register_backward_hook`信息, + 该信息说明 module 挂载了被 PyTorch 框架废弃的 register_backward_hook,这与工具使用的 register_full_backward_hook 接口会产生冲突,故工具会跳过该 module 的反向数据采集。 + - 如果您希望所有 module 数据都能采集下来,可以将模型中使用的 register_backward_hook 接口改为 PyTorch 框架推荐的 register_full_backward_pre_hook 或 register_full_backward_hook 接口。 + # 2 精度预检(PyTorch) 1. 预检工具在 dump 和 run_ut 的过程中,是否需要同时开启或关闭 jit 编译(jit_compile)? @@ -180,9 +206,10 @@ def npu_forward_fused_softmax(self, input_, mask): 答:注释工具目录 `mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml` 文件中 `Tensor: ` 下的 `- __getitem__`,工具会跳过采集该 API。如果是需要采集关键位置 API 也可以考虑根据报错堆栈信息注释引发报错的类型检查。 -11. 添加 msprobe 工具后 F.gelu 触发 ValueError 报错:`activation_func must be F.gelu` 等。以及采集 Megatron 数据时报错:`ValueError(Only support fusion of gelu and swiglu)`。 +11. 使用 msprobe 工具数据采集功能后,模型出现报错,报错信息为:`activation_func must be F.gelu` 或 `ValueError(Only support fusion of gelu and swiglu)`。 - 答:这一类问题是因为工具本身封装了 torch 算子,所以校验算子名时会报错。注释 `mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml` 文件中的 `-gelu` 或者 `-silu`,工具会跳过采集该 API。如果需要采集关键位置 API 也可以考虑根据报错堆栈信息注释引发报错的类型检查。 + 答:这一类报错常见于 Megatron/MindSpeed/ModelLink 等加速库或模型仓中,原因是工具本身会封装 torch 的 API(API类型和地址会发生改变),而有些 API 在工具使能前类型和地址就已经确定,此时工具无法对这类 API 再进行封装,而加速库中会对某些 API 进行类型检查,即会把工具无法封装的原始的 API和工具封装之后的 API 进行判断,所以会报错。 + 规避方式有3种:①将PrecisionDebugger的实例化放在文件的开始位置,即导包后的位置,确保所有API都被封装;②注释 `mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml` 文件中的 `-gelu` 或者 `-silu`,工具会跳过采集该 API。③ 可以考虑根据报错堆栈信息注释引发报错的类型检查。 12. 添加 msprobe 工具后触发与 AsStrided 算子相关、或者编译相关的报错,如:`Failed to compile Op [AsStrided]`。 diff --git a/debug/accuracy_tools/msprobe/docs/accuracy_checker_MindSpore/accuracy_checker_MindSpore_baseline.md b/debug/accuracy_tools/msprobe/docs/accuracy_checker_MindSpore/accuracy_checker_MindSpore_baseline.md new file mode 100644 index 0000000000000000000000000000000000000000..855d9b51370894ac321b6f2a02b66794d12a692e --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/accuracy_checker_MindSpore/accuracy_checker_MindSpore_baseline.md @@ -0,0 +1,14 @@ +# MindSpore 场景的精度预检基线 + +## "multi_run_ut"模式精度预检耗时参考基线 + +该基线为MindSpore框架下,使用"multi_run_ut"模式精度预检耗时参考基线。本基线测试了38B语言大模型在不同卡数下耗时的变化。 + +### 38B语言大模型 + +| 卡数 | 总耗时 (分钟) | 备注 | +| ----- |----------|---------- | +| 1 卡 | 21.0 | 单卡基线 | +| 2 卡 | 11.5 | 双卡基线 | +| 4 卡 | 6.7 | 四卡基线 | +| 8 卡 | 3.5 | 八卡基线 | diff --git a/debug/accuracy_tools/msprobe/docs/data_dump_MindSpore/data_dump_MindSpore_baseline.md b/debug/accuracy_tools/msprobe/docs/data_dump_MindSpore/data_dump_MindSpore_baseline.md new file mode 100644 index 0000000000000000000000000000000000000000..0a76c51d71d77c9cbc86d98600203e6faa71a0f6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/data_dump_MindSpore/data_dump_MindSpore_baseline.md @@ -0,0 +1,22 @@ +# MindSpore 场景的精度数据采集基线 + +## "tensor"模式采集数据量参考基线 + +该基线为MindSpore框架下,使用"tensor"模式采集数据量参考基线。本基线测试了38B语言大模型在不同采集模式下,不同global_batch_size下,单卡和8卡下,数据量的变化。 + +### 38B语言大模型 + + + + + + + + + + + + + +
采集模式global_batch_size单卡8卡
L01262GB2.1T
2480GB3.8T
3928GB7.4T
L112.1TB17.1TB
22.8T22.7TB
34.2T34.3TB
mix12.4T19.2TB
23.3TB26.6TB
35.1TB41.4TB
+ diff --git a/debug/accuracy_tools/msprobe/docs/data_dump_MindSpore/dynamic_graph_quick_start_example.md b/debug/accuracy_tools/msprobe/docs/data_dump_MindSpore/dynamic_graph_quick_start_example.md new file mode 100644 index 0000000000000000000000000000000000000000..543d260650361431ffb8b5142ae3df6b09d0db1d --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/data_dump_MindSpore/dynamic_graph_quick_start_example.md @@ -0,0 +1,211 @@ +# 动态图精度数据采集快速入门示例 + +本示例将展示如何在 MindSpore 动态图模式下使用 msprobe 工具进行精度数据采集。 + +## 1. 配置文件 + +请在当前目录下创建一个名为 `config.json` 的配置文件,内容如下: + +```json +{ + "task": "statistics", + "dump_path": "./output", + "rank": [], + "step": ["0-2"], + "level": "L1", + "statistics": { + "scope": [], + "list": [], + "data_mode": [ + "all" + ], + "summary_mode": "statistics" + } +} + +``` +以上配置参数详细介绍和使用请参见[《config.json 配置文件介绍》](../02.config_introduction.md)和[《config.json 配置示例》](../03.config_examples.md#3-mindspore-动态图场景) 中的“MindSpore动态图场景”。 + +## 2. 模型脚本 + +在当前目录下创建一个 Python 脚本文件,例如 `alexnet_model.py`,将以下代码粘贴进去: + +```python +import os +import numpy as np +import mindspore as ms +from mindspore import nn, ops +from mindspore import context +from mindspore import Tensor +from msprobe.mindspore import PrecisionDebugger, seed_all + +# 设置随机种子以确保结果可重现 +seed_all(seed=1234, mode=False, rm_dropout=True) + +# 配置文件路径 +script_dir = os.path.dirname(os.path.abspath(__file__)) +config_path = os.path.join(script_dir, 'config.json') + +# 初始化精度调试器 +debugger = PrecisionDebugger(config_path=config_path) + +# 设置 MindSpore 设备上下文 +context.set_context(mode=ms.PYNATIVE_MODE, device_target="Ascend", device_id=0) + +# 定义卷积层 +def conv_layer(in_channels, out_channels, kernel_size, stride=1, padding=0, pad_mode="valid", has_bias=True): + return nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, + has_bias=has_bias, pad_mode=pad_mode) + +# 定义全连接层 +def fc_layer(input_channels, out_channels, has_bias=True): + return nn.Dense(input_channels, out_channels, has_bias=has_bias) + + +class AlexNet(nn.Cell): + """ + AlexNet 模型定义 + + 参数: + - num_classes: 分类数量 + - channel: 输入通道数(图像的颜色通道数) + - phase: 模型运行阶段('train' 或 'test') + - include_top: 是否包含全连接层的顶部(最后的分类层) + """ + def __init__(self, num_classes=10, channel=3, phase='train', include_top=True): + super(AlexNet, self).__init__() + + # 卷积层 + self.conv1 = conv_layer(channel, 64, 11, stride=4, pad_mode="same") + self.conv2 = conv_layer(64, 128, 5, pad_mode="same") + self.conv3 = conv_layer(128, 192, 3, pad_mode="same") + self.conv4 = conv_layer(192, 256, 3, pad_mode="same") + self.conv5 = conv_layer(256, 256, 3, pad_mode="same") + + # 激活函数和池化层 + self.relu = nn.ReLU() + self.max_pool2d = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode='valid') + + # 如果包括顶部(全连接层) + self.include_top = include_top + if self.include_top: + self.flatten = nn.Flatten() + self.fc1 = fc_layer(256 * 28 * 28, 4096) + self.fc2 = fc_layer(4096, 4096) + self.fc3 = fc_layer(4096, num_classes) + + # 数学操作 + self.add = ops.Add() + self.mul = ops.Mul() + + def construct(self, x): + """定义前向传播过程""" + + x = self.conv1(x) + x = self.add(x, 0.1) # 偏置加法 + x = self.mul(x, 2.0) # 乘法操作 + x = self.relu(x) # ReLU 激活函数 + x = ops.celu(x) + x = x + 2 + + # 打印每层输出形状,调试时可使用 + print(f"After Conv1: {x.shape}") + + x = self.max_pool2d(x) # Max pooling 操作 + print(f"After MaxPool: {x.shape}") # 打印池化后的形状 + + x = self.conv2(x) + x = self.relu(x) + + x = self.conv3(x) + x = self.relu(x) + + x = self.conv4(x) + x = self.relu(x) + + x = self.conv5(x) + x = self.relu(x) + + # 打印卷积层后的形状,调试时使用 + print(f"After Conv5: {x.shape}") + + # 可选的全连接层部分 + if self.include_top: + x = self.flatten(x) + x = self.fc1(x) + x = self.fc2(x) + x = self.fc3(x) + + return x + +# 前向函数 +def forward_fn(data, label): + out = net(data) + loss = criterion(out, label) + return loss + +# 训练步骤 +def train_step(data, label): + loss, grads = grad_fn(data, label) + optimizer(grads) + return loss + +# 测试模型 +if __name__ == "__main__": + net = AlexNet() + optimizer = nn.SGD(net.trainable_params(), learning_rate=0.01) + criterion = nn.MSELoss() + + grad_fn = ms.value_and_grad(forward_fn, None, optimizer.parameters) + + # 生成数据和标签 + batch_size = 1 + num_classes = 10 + data = np.random.normal(1, 1, (batch_size, 3, 227, 227)).astype(np.float32) + label = np.random.randint(0, num_classes, (batch_size,)).astype(np.float32) # 注意此处类型应为 float32 + + # 转换为 MindSpore 张量 + data = Tensor(data) + label = Tensor(label) + + steps = 5 + for i in range(steps): + debugger.start(net) # 启动调试器 + loss = train_step(data, label) # 执行训练步骤 + print(f"Step {i}, Loss: {loss}") + debugger.stop() # 停止调试器 + debugger.step() # 计数步数 +``` + +## 3. 运行训练脚本 + +在命令行中执行以下命令: + +```bash +python alexnet_model.py +``` + +## 4. 查看采集结果 + +执行训练命令后,工具会将模型训练过程中的精度数据采集下来。 + +日志中打印出现如下信息表示数据采集成功,即可手动停止模型训练查看采集数据。 + +```markdown +**************************************************************************** +* msprobe ends successfully. * +**************************************************************************** +``` + +## 5. 数据分析 + +在 `dump_path` 参数指定的路径下(本例中为 `./output`),会出现如下目录结构,后续精度数据分析操作可使用 msprobe 工具的精度预检和精度比对等功能,详细流程请参见[《msprobe使用手册》](../../README.md#2-精度预检)。: + +```bash +output/ +└── step0 + └── rank + ├── construct.json # level为L0时,保存Cell的层级关系信息。当前场景为空 + ├── dump.json # 保存API前反向输入输出数据的统计量信息 + └── stack.json # 保存API的调用栈 +``` \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/docs/img/compare_result.png b/debug/accuracy_tools/msprobe/docs/img/compare_result.png new file mode 100644 index 0000000000000000000000000000000000000000..b6d7ec6dfcbc44b4b7056e1297a481f495ceb86e Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/compare_result.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/merge_result.png b/debug/accuracy_tools/msprobe/docs/img/merge_result.png new file mode 100644 index 0000000000000000000000000000000000000000..a8c97a9f619206ae53a86df2013c2ba19202713b Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/merge_result.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/monitor/step_count_per_record.png b/debug/accuracy_tools/msprobe/docs/img/monitor/step_count_per_record.png new file mode 100644 index 0000000000000000000000000000000000000000..9347d3ecae01b6d4717db8fcdd6c40b6766fa908 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/monitor/step_count_per_record.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/fuzzy_match_ms.png b/debug/accuracy_tools/msprobe/docs/img/visualization/fuzzy_match_ms.png new file mode 100644 index 0000000000000000000000000000000000000000..130fd83094bb37d2ea050a5fdd59b64c2ce0eaff Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/fuzzy_match_ms.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/fuzzy_match_pt.png b/debug/accuracy_tools/msprobe/docs/img/visualization/fuzzy_match_pt.png new file mode 100644 index 0000000000000000000000000000000000000000..b818388388519414378a64c3b17a4d08c79a9c6a Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/fuzzy_match_pt.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/tensorboard_1.png b/debug/accuracy_tools/msprobe/docs/img/visualization/tensorboard_1.png new file mode 100644 index 0000000000000000000000000000000000000000..e99ff9ea47f0e3eb25ec324640589248398a7f5a Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/tensorboard_1.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/tensorboard_2.png b/debug/accuracy_tools/msprobe/docs/img/visualization/tensorboard_2.png new file mode 100644 index 0000000000000000000000000000000000000000..ed8b024a4b811bb24e9ae23f1b0ca8d04e229992 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/tensorboard_2.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/vis_browser_1.png b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_browser_1.png new file mode 100644 index 0000000000000000000000000000000000000000..96e8521fde4b776ba915a00b5d77851b8406c153 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_browser_1.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/vis_browser_2.png b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_browser_2.png new file mode 100644 index 0000000000000000000000000000000000000000..1cc076fc66bf6fde67b92cd5f619bca74c4b840a Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_browser_2.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/vis_precision_info.png b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_precision_info.png new file mode 100644 index 0000000000000000000000000000000000000000..ddd59b37f044fe64c02148b698b95296592e0399 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_precision_info.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/vis_search_info.png b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_search_info.png new file mode 100644 index 0000000000000000000000000000000000000000..7c55b33840163c388f8fde69f0bbc531b23f81f6 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_search_info.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/vis_show_info.png b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_show_info.png new file mode 100644 index 0000000000000000000000000000000000000000..9a6217e04848e671d784ed0b484d2fe10151bde7 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_show_info.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/vis_showcase.png b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_showcase.png new file mode 100644 index 0000000000000000000000000000000000000000..e95b5eeee663d91a67b1ace422c8681797ca96c1 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_showcase.png differ diff --git a/debug/accuracy_tools/msprobe/docs/img/visualization/vis_unmatch_info.png b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_unmatch_info.png new file mode 100644 index 0000000000000000000000000000000000000000..e4c9ed4306f9a7b20d031d32f18c815628030da6 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/img/visualization/vis_unmatch_info.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/GPTModel.png b/debug/accuracy_tools/msprobe/docs/visualization/GPTModel.png new file mode 100644 index 0000000000000000000000000000000000000000..71c1ff2e5bd9a38489d6ff0b7365936508660fec Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/GPTModel.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/ParallelMLP.png b/debug/accuracy_tools/msprobe/docs/visualization/ParallelMLP.png new file mode 100644 index 0000000000000000000000000000000000000000..d76650c9103c2d81d6b07458832945a237b43acc Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/ParallelMLP.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/layer_mapping_example.md b/debug/accuracy_tools/msprobe/docs/visualization/layer_mapping_example.md new file mode 100644 index 0000000000000000000000000000000000000000..35acb23ab4c44a763909fce40ae5cec136b584f6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/docs/visualization/layer_mapping_example.md @@ -0,0 +1,132 @@ +# 模型分级可视化如何配置layer mapping映射文件 + +## 1.使用场景 +同框架跨套件比对(例如PyTorch DeepSpeed vs Megatron),或者跨框架比对(例如PyTorch vs MindSpore),**由于代码实现的差异,导致一些模型层级和层级命名有所不同无法进行匹配**,需要进行layer层名称映射,才能够比对。 + +## 2.模块命名说明 + +由于有些节点的名称比较长,例如Module.module.module.language_model.embedding.Embedding.forward.0,在图节点上由于字符串过长无法完整显示,forward或backward信息被省略,**因此节点中显示的名称字符串去掉了Module前缀,并将forward或backward信息提取到名称字符串的第二位展示**。 + +![module_name.png](./module_name.png) + +![module_name1.png](./module_name1.png) + +### 2.1 命名格式 + +**{Module}.{module_name}.{class_name}.{forward/backward}.{调用次数}** + +**layer mapping主要是针对module_name的映射** + +#### 2.1.1 命名示例 + +- **Module.module.Float16Module.forward.0** -----> Module{**Module**}.module{**module_name**}.Float16Module{**class_name**}.forward.0{**调用次数**} +- **Module.module.module.GPTModel.forward.0** -----> Module{**Module**}.module.module{**module_name**}.GPTModel{**class_name**}.forward.0{**调用次数**} +- **Module.module.module.language_model.TransformerLanguageModel.forward.0** -----> Module{**Module**}.module.module.language_model{**module_name**}.TransformerLanguageModel{**class_name**}.forward.0{**调用次数**} +- **Module.module.module.language_model.embedding.Embedding.forward.0** -----> Module{**Module**}.module.module.language_model.embedding{**module_name**}.Embedding{**class_name**}.forward.0{**调用次数**} + +可以看到,module_name随着模型层级的深入在变长,**embedding层module_name拼接了它的上层language_model、上上层module和顶层module**。 + +## 3.示例 + +如图所示,左边为NPU模型,右边为GPU模型,由于代码实现上的差异,导致模型层级和层级命名有所不同,导致节点无法匹配,**图上节点显示为灰色,表示节点未匹配**。 + +![no_mapping.png](./no_mapping.png) + +### 3.1 看图分析 + +同一模型使用了不同套件或者框架,虽然两个模型的层级关系和层级命名可能有所不同,但也可以从图上的**节点名称**看出一些匹配关系,例如同是embedding层,代码里也是会命名为xxx_embedding,不会命名为xxx_norm,体现在节点名称上也是带有embedding的信息,并且层级关系也是大致相同的。 + +![no_mapping_analyze.png](./no_mapping_analyze.png) + +分析可知,节点匹配关系如下: + +**注意,仅需关注module_name的差异** + +| NPU节点名称 | GPU节点名称 | module_name差异 | +|-------------------|----------------------------------------------------------------|---------------------------| +| Module.module.Float16Module.forward.0 | Module.model.FloatModule.forward.0 | NPU为module,GPU为model | +| Module.module.module.GPTModel.forward.0 | Module.model.module.GPT2Model.forward.0 | NPU为module,GPU为module,无差异 | +| Module.module.module.language_model.TransformerLanguageModel.forward.0 | 无 | NPU多了一层 | +| Module.module.module.language_model.embedding.Embedding.forward.0 | Module.module.module.embedding.LanguageModelEmbedding.forward.0 | NPU为language_model.embedding,GPU为embedding | +| Module.module.module.language_model.rotary_pos_emb.RotaryEmbedding.forward.0 | Module.module.module.rotary_pos_emb.RotaryEmbedding.forward.0 | NPU为language_model.rotary_pos_emb,GPU为rotary_pos_emb | +| Module.module.module.language_model.encoder.ParallelTransformer.forward.0 | Module.module.module.decoder.TransformerBlock.forward.0 | NPU为language_model.encoder,GPU为decoder | +| Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0 | Module.module.module.decoder.layers.0.TransformerLayer.forward.0 | 父层级有差异,本层级NPU和GPU都叫layers,无差异 | + +### 3.2 构建layer_mapping配置文件 +准备一个命名为mapping.yaml文件,建立**module_name**的映射关系 + +#### 3.2.1 顶层模块映射 +NPU和GPU侧的模块Module.module.Float16Module.forward.0和Module.model.FloatModule.forward.0处于图的顶层,需要进行如下配置: + +![top_layer.png](./top_layer.png) + +```yaml +TopLayer: + module: model +``` + +#### 3.2.2 其他模块映射 +配置module下的子模块,虽然两边的class_name不同(NPU侧为GPTModel,GPU侧为GPT2Model),**但是仅需取NPU侧也就是左边图的class_name进行配置,无需关心右边图的class_name叫什么**。 + +**这里涉及到跨层级的配置,NPU多了一层language_model层**,将language_model作为embedding层、rotary_pos_emb层和encoder层的前缀,进行如下配置: + +![GPTModel.png](./GPTModel.png) + +```yaml +GPTModel: + language_model.embedding: embedding + language_model.rotary_pos_emb: rotary_pos_emb + language_model.encoder: decoder +``` +然后看Module.module.module.language_model.encoder.ParallelTransformer.forward.0层下的子模块: + +此层下的若干个层,NPU和GPU的层名都叫layers,**当前层名称相同,则不用进行配置**。 + +### 3.3 查看效果 + +执行命令,指定-lm: +``` +msprobe -f pytorch graph -i ./compare.json -o ./output -lm ./mapping.yaml +``` +或 +``` +msprobe -f mindspore graph -i ./compare.json -o ./output -lm ./mapping.yaml +``` +可以看到,除了language_model层(NPU多的一层,GPU没有层与其匹配),其余在mapping.yaml文件配置的层均匹配上了。 + +![mapping.png](./mapping.png) + +### 3.4 继续配置 + +展开节点过程中,如果发现还有未匹配节点,则继续配置mapping.yaml + +![no_mapping1.png](./no_mapping1.png) + +按前一章过程进行分析配置,分析可知,节点匹配关系如下: + +| NPU节点名称 | GPU节点名称 | 差异 | +|-------------------|------------------------------------------------------------------|---------------------------------------------| +| Module.module.module.language_model.encoder.layers.0.mlp.dense_h_to_4h.ColumnParallelLinear.forward.0 | Module.module.module.decoder.layers.0.mlp.linear_fc1.TELayerNormColumnParallelLinear.forward.0 | NPU为dense_h_to_4h,GPU为linear_fc1 | +| Module.module.module.language_model.encoder.layers.0.mlp.dense_4h_to_h.RowParallelLinear.forward.0 | Module.module.module.decoder.layers.0.mlp.linear_fc2.TERowParallelLinear.forward.0 | NPU为dense_4h_to_h,GPU为linear_fc2 | + +![ParallelMLP.png](./ParallelMLP.png) + +追加mapping.yaml配置: + +```yaml +TopLayer: + module: model + +GPTModel: + language_model.embedding: embedding + language_model.rotary_pos_emb: rotary_pos_emb + language_model.encoder: decoder + +ParallelMLP: + dense_h_to_4h: linear_fc1 + dense_4h_to_h: linear_fc2 +``` + +执行命令,查看效果,可以看到节点已成功匹配上。 + +![mapping1.png](./mapping1.png) diff --git a/debug/accuracy_tools/msprobe/docs/visualization/mapping.png b/debug/accuracy_tools/msprobe/docs/visualization/mapping.png new file mode 100644 index 0000000000000000000000000000000000000000..fb03d85fab802ed881b75b5eba67bff815f97b30 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/mapping.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/mapping1.png b/debug/accuracy_tools/msprobe/docs/visualization/mapping1.png new file mode 100644 index 0000000000000000000000000000000000000000..1ec713f29ca812b900adcab22c198e8705bbe1bb Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/mapping1.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/module_name.png b/debug/accuracy_tools/msprobe/docs/visualization/module_name.png new file mode 100644 index 0000000000000000000000000000000000000000..8e959dc7ce8d9c8e5dec72853b02e29bdaa2389c Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/module_name.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/module_name1.png b/debug/accuracy_tools/msprobe/docs/visualization/module_name1.png new file mode 100644 index 0000000000000000000000000000000000000000..764fa08166050123e12ebd87d9a4012a64d688bb Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/module_name1.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/no_mapping.png b/debug/accuracy_tools/msprobe/docs/visualization/no_mapping.png new file mode 100644 index 0000000000000000000000000000000000000000..47693dc78cdf0c205184f7be2bf2cd196a4b5ce8 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/no_mapping.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/no_mapping1.png b/debug/accuracy_tools/msprobe/docs/visualization/no_mapping1.png new file mode 100644 index 0000000000000000000000000000000000000000..88f8dc9e7aa89775f2a33c6c41d314a60af8ab76 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/no_mapping1.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/no_mapping_analyze.png b/debug/accuracy_tools/msprobe/docs/visualization/no_mapping_analyze.png new file mode 100644 index 0000000000000000000000000000000000000000..f9ff18681a2b99507658152921a38dc00e1a2918 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/no_mapping_analyze.png differ diff --git a/debug/accuracy_tools/msprobe/docs/visualization/top_layer.png b/debug/accuracy_tools/msprobe/docs/visualization/top_layer.png new file mode 100644 index 0000000000000000000000000000000000000000..9f2482969c6b5340e15251a794b339858e010ba4 Binary files /dev/null and b/debug/accuracy_tools/msprobe/docs/visualization/top_layer.png differ diff --git a/debug/accuracy_tools/msprobe/mindspore/__init__.py b/debug/accuracy_tools/msprobe/mindspore/__init__.py index e3ef097eb046c1d694b5bf7cf000f8c5753fed1e..089c29eb098ad4305edcca1306462f8924dd9291 100644 --- a/debug/accuracy_tools/msprobe/mindspore/__init__.py +++ b/debug/accuracy_tools/msprobe/mindspore/__init__.py @@ -13,5 +13,16 @@ # See the License for the specific language governing permissions and # limitations under the License. +import os + +try: + from msprobe.lib import _msprobe_c + os.environ["MS_HOOK_ENABLE"] = "on" + os.environ["HOOK_TOOL_PATH"] = _msprobe_c.__file__ +except ImportError: + from .common.log import logger + logger.info("Module _msprobe_c has not been installed. L2-Dump may not work normally.") + from msprobe.mindspore.debugger.precision_debugger import PrecisionDebugger from msprobe.mindspore.common.utils import seed_all +from msprobe.mindspore.monitor.module_hook import TrainerMon \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_accuracy_checker.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_accuracy_checker.py index e3dc5ed7134aad9d589d45041e042fff7767e452..557d731e042913da3a622035219ec8dea0409ab4 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_accuracy_checker.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_accuracy_checker.py @@ -13,18 +13,24 @@ # See the License for the specific language governing permissions and # limitations under the License. -import json import os +from tqdm import tqdm -from msprobe.core.common.const import Const, CompareConst, MsCompareConst -from msprobe.core.common.file_utils import FileOpen, create_directory, write_csv +from msprobe.core.common.const import Const, CompareConst +from msprobe.core.common.file_utils import FileOpen, create_directory, write_csv, load_json, load_yaml from msprobe.core.common.utils import add_time_as_suffix from msprobe.mindspore.api_accuracy_checker.api_info import ApiInfo from msprobe.mindspore.api_accuracy_checker.api_runner import api_runner, ApiInputAggregation from msprobe.mindspore.api_accuracy_checker.base_compare_algorithm import compare_algorithms +from msprobe.mindspore.api_accuracy_checker.data_manager import DataManager from msprobe.mindspore.api_accuracy_checker.utils import (check_and_get_from_json_dict, global_context, trim_output_compute_element_list) +from msprobe.mindspore.common.const import MsCompareConst from msprobe.mindspore.common.log import logger +from msprobe.mindspore.api_accuracy_checker import torch_mindtorch_importer + +cur_path = os.path.dirname(os.path.realpath(__file__)) +yaml_path = os.path.join(cur_path, MsCompareConst.SUPPORTED_API_LIST_FILE) class BasicInfoAndStatus: @@ -46,14 +52,21 @@ class ResultCsvEntry: self.overall_err_msg = None +class ProcessResultPacket: + def __init__(self, process_status, result, err_msg) -> None: + self.process_status = process_status + self.result = result + self.err_msg = err_msg + + class ApiAccuracyChecker: - def __init__(self): + def __init__(self, args): self.api_infos = dict() - self.results = dict() + self.data_manager = DataManager(args.out_path, args.result_csv_path) # 在初始化时实例化 DataManager @staticmethod def run_and_compare_helper(api_info, api_name_str, api_input_aggregation, forward_or_backward): - ''' + """ Args: api_info: ApiInfo api_name_str: str @@ -67,13 +80,15 @@ class ApiAccuracyChecker: get mindspore api output, run torch api and get output. compare output. record compare result. - ''' + """ # get output if global_context.get_is_constructed(): # constructed situation, need use constructed input to run mindspore api getting tested_output - tested_outputs = api_runner(api_input_aggregation, api_name_str, forward_or_backward, Const.MS_FRAMEWORK) + tested_outputs = api_runner(api_input_aggregation, api_name_str, + forward_or_backward, global_context.get_framework()) else: tested_outputs = api_info.get_compute_element_list(forward_or_backward, Const.OUTPUT) + bench_outputs = api_runner(api_input_aggregation, api_name_str, forward_or_backward, Const.PT_FRAMEWORK) tested_outputs = trim_output_compute_element_list(tested_outputs, forward_or_backward) bench_outputs = trim_output_compute_element_list(bench_outputs, forward_or_backward) @@ -101,8 +116,8 @@ class ApiAccuracyChecker: err_msg = "" else: status = CompareConst.ERROR - err_msg = compare_result_dict.get(CompareConst.COSINE).err_msg + \ - compare_result_dict.get(CompareConst.MAX_ABS_ERR).err_msg + err_msg = (compare_result_dict.get(CompareConst.COSINE).err_msg + + compare_result_dict.get(CompareConst.MAX_ABS_ERR).err_msg) basic_info_status = \ BasicInfoAndStatus(api_name_with_slot, bench_dtype, tested_dtype, shape, status, err_msg) output_list.append(tuple([api_name_str, forward_or_backward, basic_info_status, compare_result_dict])) @@ -110,13 +125,13 @@ class ApiAccuracyChecker: @staticmethod def prepare_api_input_aggregation(api_info, forward_or_backward=Const.FORWARD): - ''' + """ Args: api_info: ApiInfo forward_or_backward: str Returns: ApiInputAggregation - ''' + """ forward_inputs = api_info.get_compute_element_list(Const.FORWARD, Const.INPUT) kwargs = api_info.get_kwargs() if forward_or_backward == Const.FORWARD: @@ -125,29 +140,71 @@ class ApiAccuracyChecker: gradient_inputs = api_info.get_compute_element_list(Const.BACKWARD, Const.INPUT) return ApiInputAggregation(forward_inputs, kwargs, gradient_inputs) + @staticmethod + def is_api_checkable(api_name_str): + ''' + Args: + api_name_str: str, e.g. "MintFunctional.relu.0.forward", key in data field of api_info.json + Returns: + is_checkable: bool + Description: + tell whether this api is checkable based on the key in "data" dict in api_info.json + ''' + api_name_str_list = api_name_str.split(Const.SEP) + if len(api_name_str_list) < MsCompareConst.API_NAME_STR_LENGTH: + return False + api_type_str = api_name_str_list[0] + real_api_str = Const.SEP.join(api_name_str_list[1:-2]) + api_list = load_yaml(yaml_path) + supported_tensor_api_list = api_list.get(MsCompareConst.SUPPORTED_TENSOR_LIST_KEY) + supported_fusion_api_list = MsCompareConst.SUPPORTED_FUSION_LIST + if api_type_str in (MsCompareConst.MINT, MsCompareConst.MINT_FUNCTIONAL) \ + and global_context.get_framework() == Const.MS_FRAMEWORK: + return True + if api_type_str in MsCompareConst.MT_VALID_API_TYPES \ + and global_context.get_framework() == Const.MT_FRAMEWORK: + return True + if api_type_str == MsCompareConst.TENSOR_API and real_api_str in supported_tensor_api_list \ + and global_context.get_framework() == Const.MS_FRAMEWORK: + return True + if api_type_str == MsCompareConst.FUNCTIONAL_API and real_api_str in supported_fusion_api_list \ + and global_context.get_framework() == Const.MS_FRAMEWORK: + return True + return False + def parse(self, api_info_path): - with FileOpen(api_info_path, "r") as f: - api_info_dict = json.load(f) + + api_info_dict = load_json(api_info_path) # init global context task = check_and_get_from_json_dict(api_info_dict, MsCompareConst.TASK_FIELD, "task field in api_info.json", accepted_type=str, accepted_value=(MsCompareConst.STATISTICS_TASK, MsCompareConst.TENSOR_TASK)) + try: + framework = check_and_get_from_json_dict(api_info_dict, MsCompareConst.FRAMEWORK, + "framework field in api_info.json", accepted_type=str, + accepted_value=(Const.MS_FRAMEWORK, + Const.MT_FRAMEWORK)) + except Exception as e: + framework = Const.MS_FRAMEWORK + logger.warning(f"JSON parsing error in framework field: {e}") + + if framework == Const.MT_FRAMEWORK and not torch_mindtorch_importer.is_valid_pt_mt_env: + raise Exception(f"Please check if you have a valid PyTorch and MindTorch environment") + is_constructed = task == MsCompareConst.STATISTICS_TASK if not is_constructed: dump_data_dir = check_and_get_from_json_dict(api_info_dict, MsCompareConst.DUMP_DATA_DIR_FIELD, "dump_data_dir field in api_info.json", accepted_type=str) else: dump_data_dir = "" - global_context.init(is_constructed, dump_data_dir) + global_context.init(is_constructed, dump_data_dir, framework) api_info_data = check_and_get_from_json_dict(api_info_dict, MsCompareConst.DATA_FIELD, "data field in api_info.json", accepted_type=dict) for api_name, api_info in api_info_data.items(): - is_mint = api_name.split(Const.SEP)[0] in \ - (MsCompareConst.MINT, MsCompareConst.MINT_FUNCTIONAL) - if not is_mint: + if not self.is_api_checkable(api_name): continue forbackward_str = api_name.split(Const.SEP)[-1] if forbackward_str not in (Const.FORWARD, Const.BACKWARD): @@ -161,137 +218,87 @@ class ApiAccuracyChecker: else: self.api_infos[api_name].load_backward_info(api_info) + def process_forward(self, api_name_str, api_info): + """处理前向检查""" + if not api_info.check_forward_info(): + logger.debug(f"api: {api_name_str} is lack of forward information, skip forward check.") + process_result_packet = ProcessResultPacket(process_status=MsCompareConst.ProcessStatus.API_NOT_FOUND, + result=None, + err_msg=f"forward info of {api_name_str} is not found") + return process_result_packet + + try: + forward_inputs_aggregation = self.prepare_api_input_aggregation(api_info, Const.FORWARD) + except Exception as e: + logger.warning(f"Exception occurs when getting inputs for {api_name_str} forward api. " + f"Skipping forward check. Detailed exception information: {e}.") + process_result_packet = ProcessResultPacket(process_status=MsCompareConst.ProcessStatus.EXCEPTION_SKIP, + result=None, err_msg=f"{e}") + return process_result_packet + + try: + forward_output_list = self.run_and_compare_helper(api_info, api_name_str, forward_inputs_aggregation, + Const.FORWARD) + except Exception as e: + logger.warning(f"Exception occurs when running and comparing {api_name_str} forward api. " + f"Detailed exception information: {e}.") + process_result_packet = ProcessResultPacket(process_status=MsCompareConst.ProcessStatus.EXCEPTION_SKIP, + result=None, err_msg=f"{e}") + return process_result_packet + + process_result_packet = ProcessResultPacket(process_status=MsCompareConst.ProcessStatus.SUCCESS, + result=forward_output_list, err_msg="") + return process_result_packet + + def process_backward(self, api_name_str, api_info): + """处理反向检查""" + if not api_info.check_backward_info(): + logger.debug(f"api: {api_name_str} is lack of backward information, skipping backward check.") + process_result_packet = ProcessResultPacket(process_status=MsCompareConst.ProcessStatus.API_NOT_FOUND, + result=None, + err_msg=f"backward info of {api_name_str} is not found") + return process_result_packet + + try: + backward_inputs_aggregation = self.prepare_api_input_aggregation(api_info, Const.BACKWARD) + except Exception as e: + logger.warning(f"Exception occurs when getting inputs for {api_name_str} backward api. " + f"Skipping backward check. Detailed exception information: {e}.") + process_result_packet = ProcessResultPacket(process_status=MsCompareConst.ProcessStatus.EXCEPTION_SKIP, + result=None, err_msg=f"{e}") + return process_result_packet + + try: + backward_output_list = self.run_and_compare_helper(api_info, api_name_str, backward_inputs_aggregation, + Const.BACKWARD) + except Exception as e: + logger.warning(f"Exception occurs when running and comparing {api_name_str} backward api. " + f"Detailed exception information: {e}.") + process_result_packet = ProcessResultPacket(process_status=MsCompareConst.ProcessStatus.EXCEPTION_SKIP, + result=None, err_msg=f"{e}") + return process_result_packet + + process_result_packet = ProcessResultPacket(process_status=MsCompareConst.ProcessStatus.SUCCESS, + result=backward_output_list, err_msg="") + return process_result_packet + def run_and_compare(self): - for api_name_str, api_info in self.api_infos.items(): - if not api_info.check_forward_info(): - logger.warning(f"api: {api_name_str} is lack of forward infomation, skip forward and backward check.") + for api_name_str, api_info in tqdm(self.api_infos.items()): + if not self.data_manager.is_unique_api(api_name_str): continue - try: - forward_inputs_aggregation = self.prepare_api_input_aggregation(api_info, Const.FORWARD) - except Exception as e: - logger.warning(f"exception occurs when getting inputs for {api_name_str} forward api. " - f"skip forward and backward check. detailed exception information: {e}.") - continue - forward_output_list = None - try: - forward_output_list = \ - self.run_and_compare_helper(api_info, api_name_str, forward_inputs_aggregation, Const.FORWARD) - except Exception as e: - logger.warning(f"exception occurs when running and comparing {api_name_str} forward api. " - f"detailed exception information: {e}.") - self.record(forward_output_list) - - if not api_info.check_backward_info(): - logger.warning(f"api: {api_name_str} is lack of backward infomation, skip backward check.") - continue - try: - backward_inputs_aggregation = self.prepare_api_input_aggregation(api_info, Const.BACKWARD) - except Exception as e: - logger.warning(f"exception occurs when getting inputs for {api_name_str} backward api. " - f"skip backward check. detailed exception information: {e}.") - continue - backward_output_list = None - try: - backward_output_list = \ - self.run_and_compare_helper(api_info, api_name_str, backward_inputs_aggregation, Const.BACKWARD) - except Exception as e: - logger.warning(f"exception occurs when running and comparing {api_name_str} backward api. " - f"detailed exception information: {e}.") - self.record(backward_output_list) - - def record(self, output_list): - if output_list is None: - return - for output in output_list: - api_real_name, forward_or_backward, basic_info, compare_result_dict = output - key = tuple([api_real_name, forward_or_backward]) - if key not in self.results: - self.results[key] = [] - self.results[key].append(tuple([basic_info, compare_result_dict])) - - def to_detail_csv(self, csv_dir): - # detail_csv - detail_csv = [] - detail_csv_header_basic_info = [ - MsCompareConst.DETAIL_CSV_API_NAME, - MsCompareConst.DETAIL_CSV_BENCH_DTYPE, - MsCompareConst.DETAIL_CSV_TESTED_DTYPE, - MsCompareConst.DETAIL_CSV_SHAPE, - ] - detail_csv_header_compare_result = list(compare_algorithms.keys()) - detail_csv_header_status = [ - MsCompareConst.DETAIL_CSV_PASS_STATUS, - MsCompareConst.DETAIL_CSV_MESSAGE, - ] - - detail_csv_header = detail_csv_header_basic_info + detail_csv_header_compare_result + detail_csv_header_status - detail_csv.append(detail_csv_header) - - for _, results in self.results.items(): - # detail csv - for res in results: - basic_info, compare_result_dict = res - csv_row_basic_info = \ - [basic_info.api_name, basic_info.bench_dtype, basic_info.tested_dtype, basic_info.shape] - csv_row_compare_result = list(compare_result_dict.get(algorithm_name).compare_value \ - for algorithm_name in detail_csv_header_compare_result) - csv_row_status = [basic_info.status, basic_info.err_msg] - csv_row = csv_row_basic_info + csv_row_compare_result + csv_row_status - detail_csv.append(csv_row) - - file_name = os.path.join(csv_dir, add_time_as_suffix(MsCompareConst.DETAIL_CSV_FILE_NAME)) - create_directory(csv_dir) - write_csv(detail_csv, file_name, mode="w") - - def to_result_csv(self, csv_dir): - result_csv_dict = dict() - for key, results in self.results.items(): - api_real_name, forward_or_backward = key - forward_or_backward_pass_status = CompareConst.PASS - forward_or_backward_overall_err_msg = "" - # detail csv - for res in results: - basic_info, _ = res - if basic_info.status != CompareConst.PASS: - forward_or_backward_pass_status = CompareConst.ERROR - forward_or_backward_overall_err_msg += basic_info.err_msg - forward_or_backward_overall_err_msg = \ - "" if forward_or_backward_pass_status == CompareConst.PASS else forward_or_backward_overall_err_msg - - # result_csv_dict - if api_real_name not in result_csv_dict: - result_csv_dict[api_real_name] = ResultCsvEntry() - if forward_or_backward == Const.FORWARD: - result_csv_dict[api_real_name].forward_pass_status = forward_or_backward_pass_status - result_csv_dict[api_real_name].forward_err_msg = forward_or_backward_overall_err_msg - else: - result_csv_dict[api_real_name].backward_pass_status = forward_or_backward_pass_status - result_csv_dict[api_real_name].backward_err_msg = forward_or_backward_overall_err_msg - - # result_csv - result_csv = [] - result_csv_header = [ - MsCompareConst.DETAIL_CSV_API_NAME, - MsCompareConst.RESULT_CSV_FORWARD_TEST_SUCCESS, - MsCompareConst.RESULT_CSV_BACKWARD_TEST_SUCCESS, - MsCompareConst.DETAIL_CSV_MESSAGE, - ] - result_csv.append(result_csv_header) - - for api_name, result_csv_entry in result_csv_dict.items(): - if result_csv_entry.forward_pass_status == CompareConst.PASS and \ - result_csv_entry.backward_pass_status == CompareConst.PASS: - overall_err_msg = "" - else: - overall_err_msg = result_csv_entry.forward_err_msg + result_csv_entry.backward_err_msg - row = [ - api_name, - result_csv_entry.forward_pass_status, - result_csv_entry.backward_pass_status, - overall_err_msg - ] - result_csv.append(row) - - file_name = os.path.join(csv_dir, add_time_as_suffix(MsCompareConst.RESULT_CSV_FILE_NAME)) - create_directory(csv_dir) - write_csv(result_csv, file_name, mode="w") + + # 处理前向 + process_result_packet = self.process_forward(api_name_str, api_info) + if process_result_packet.process_status is MsCompareConst.ProcessStatus.SUCCESS: + self.data_manager.record(process_result_packet.result) + elif process_result_packet.process_status == MsCompareConst.ProcessStatus.EXCEPTION_SKIP: + self.data_manager.record_exception_skip(api_name_str, Const.FORWARD, process_result_packet.err_msg) + + # 处理反向 + process_result_packet = self.process_backward(api_name_str, api_info) + if process_result_packet.process_status is MsCompareConst.ProcessStatus.SUCCESS: + self.data_manager.record(process_result_packet.result) + elif process_result_packet.process_status == MsCompareConst.ProcessStatus.EXCEPTION_SKIP: + self.data_manager.record_exception_skip(api_name_str, Const.BACKWARD, process_result_packet.err_msg) + + self.data_manager.save_results(api_name_str) diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_info.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_info.py index 3c0537542cf0fd030e14fb5d6d6573a51fb583b6..57985eb08ea89dc82d0f2f2c82a4dfa744744379 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_info.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_info.py @@ -82,8 +82,8 @@ class ApiInfo: err_msg = "ApiInfo.get_kwargs failed: compute_element_dict key is not a string" logger.error_log_with_exp(err_msg, ApiAccuracyCheckerException(ApiAccuracyCheckerException.ParseJsonFailed)) - if not isinstance(compute_element_info, (list, dict)): - err_msg = "ApiInfo.get_kwargs failed: compute_element_dict value is not a list or dict" + if not (isinstance(compute_element_info, (list, dict)) or compute_element_info is None): + err_msg = "ApiInfo.get_kwargs failed: compute_element_dict value is not a list, dict or null" logger.error_log_with_exp(err_msg, ApiAccuracyCheckerException(ApiAccuracyCheckerException.ParseJsonFailed)) kwargs_compute_element_dict = {key_str: ComputeElement(compute_element_info=compute_element_info) diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_runner.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_runner.py index a8396f220fdba75b5864842ad2f6c915a7d41eb2..36e506f67737cdea4452ba27f4fad0524d4c2884 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_runner.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/api_runner.py @@ -14,24 +14,39 @@ # limitations under the License. import mindspore -import torch from mindspore import ops -from msprobe.core.common.const import Const, MsCompareConst +from msprobe.core.common.const import Const from msprobe.core.common.exceptions import ApiAccuracyCheckerException from msprobe.mindspore.api_accuracy_checker.compute_element import ComputeElement from msprobe.mindspore.api_accuracy_checker.type_mapping import float_dtype_str_list, torch_dtype_to_dtype_str from msprobe.mindspore.api_accuracy_checker.utils import convert_to_tuple +from msprobe.mindspore.api_accuracy_checker.bench_functions.fusion_operator import fusion +from msprobe.mindspore.common.const import MsCompareConst from msprobe.mindspore.common.log import logger +from msprobe.mindspore.api_accuracy_checker import torch_mindtorch_importer + +from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import mindtorch +from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import mindtorch_tensor +from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import mindtorch_func +from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import mindtorch_dist + +if torch_mindtorch_importer.is_valid_pt_mt_env: + from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import torch +else: + import torch + + + class ApiInputAggregation: def __init__(self, inputs, kwargs, gradient_inputs) -> None: - ''' + """ Args: inputs: List[ComputeElement] kwargs: dict{str: ComputeElement} gradient_inputs: Union[List[ComputeElement], None] - ''' + """ self.inputs = inputs self.kwargs = kwargs self.gradient_inputs = gradient_inputs @@ -41,7 +56,40 @@ api_parent_module_mapping = { (MsCompareConst.MINT, Const.MS_FRAMEWORK): mindspore.mint, (MsCompareConst.MINT, Const.PT_FRAMEWORK): torch, (MsCompareConst.MINT_FUNCTIONAL, Const.MS_FRAMEWORK): mindspore.mint.nn.functional, - (MsCompareConst.MINT_FUNCTIONAL, Const.PT_FRAMEWORK): torch.nn.functional + (MsCompareConst.MINT_FUNCTIONAL, Const.PT_FRAMEWORK): torch.nn.functional, + (MsCompareConst.TENSOR_API, Const.MS_FRAMEWORK): mindspore.Tensor, + (MsCompareConst.TENSOR_API, Const.PT_FRAMEWORK): torch.Tensor, + (MsCompareConst.MINDTORCH_TENSOR, Const.MT_FRAMEWORK): mindtorch_tensor, + (MsCompareConst.MINDTORCH_TENSOR, Const.PT_FRAMEWORK): torch.Tensor, + (MsCompareConst.MINDTORCH, Const.MT_FRAMEWORK): mindtorch, + (MsCompareConst.MINDTORCH, Const.PT_FRAMEWORK): torch, + (MsCompareConst.MINDTORCH_FUNC, Const.MT_FRAMEWORK): mindtorch_func, + (MsCompareConst.MINDTORCH_FUNC, Const.PT_FRAMEWORK): torch.nn.functional, + (MsCompareConst.MINDTORCH_DIST, Const.MT_FRAMEWORK): mindtorch_dist, + (MsCompareConst.MINDTORCH_DIST, Const.PT_FRAMEWORK): torch.distributed, + (MsCompareConst.FUNCTIONAL_API, Const.MS_FRAMEWORK): mindspore.ops, + (MsCompareConst.FUSION_API, Const.PT_FRAMEWORK): fusion + +} + + +api_parent_module_str_mapping = { + (MsCompareConst.MINT, Const.MS_FRAMEWORK): "mindspore.mint", + (MsCompareConst.MINT, Const.PT_FRAMEWORK): "torch", + (MsCompareConst.MINT_FUNCTIONAL, Const.MS_FRAMEWORK): "mindspore.mint.nn.functional", + (MsCompareConst.MINT_FUNCTIONAL, Const.PT_FRAMEWORK): "torch.nn.functional", + (MsCompareConst.TENSOR_API, Const.MS_FRAMEWORK): "mindspore.Tensor", + (MsCompareConst.TENSOR_API, Const.PT_FRAMEWORK): "torch.Tensor", + (MsCompareConst.MINDTORCH_TENSOR, Const.MT_FRAMEWORK): "mindtorch_tensor", + (MsCompareConst.MINDTORCH_TENSOR, Const.PT_FRAMEWORK): "torch.Tensor", + (MsCompareConst.MINDTORCH, Const.MT_FRAMEWORK): "mindtorch", + (MsCompareConst.MINDTORCH, Const.PT_FRAMEWORK): "torch", + (MsCompareConst.MINDTORCH_FUNC, Const.MT_FRAMEWORK): "mindtorch_func", + (MsCompareConst.MINDTORCH_FUNC, Const.PT_FRAMEWORK): "torch.nn.functional", + (MsCompareConst.MINDTORCH_DIST, Const.MT_FRAMEWORK): "mindtorch_dist", + (MsCompareConst.MINDTORCH_DIST, Const.PT_FRAMEWORK): "torch.distributed", + (MsCompareConst.FUNCTIONAL_API, Const.MS_FRAMEWORK): "mindspore.ops", + (MsCompareConst.FUSION_API, Const.PT_FRAMEWORK): "fusion" } @@ -53,7 +101,7 @@ class ApiRunner: api_input_aggregation: ApiInputAggregation api_name_str: str, e.g. "MintFunctional.relu.0" forward_or_backward: str, Union["forward", "backward"] - api_platform: str, Union["mindspore", "torch"] + api_platform: str, Union["mindspore", "torch", "mindtorch"] Return: outputs: list[ComputeElement] @@ -61,39 +109,46 @@ class ApiRunner: Description: run mindspore.mint/torch api ''' - api_type_str, api_sub_name = self.get_info_from_name(api_name_str) + + api_type_str, api_sub_name = self.get_info_from_name(api_name_str, api_platform) api_instance = self.get_api_instance(api_type_str, api_sub_name, api_platform) return self.run_api(api_instance, api_input_aggregation, forward_or_backward, api_platform) @staticmethod - def get_info_from_name(api_name_str): - ''' + def get_info_from_name(api_name_str, api_platform=Const.MS_FRAMEWORK): + """ Args: api_name_str: str, the trimmed key of data dict in api_info.json. e.g. "MintFunctional.relu.0" - + api_platform: str, the platform for the API, which can be either "mindspore" or "mindtorch". + It specifies which framework is being used. Default is "mindspore". Return: - api_type_str: str, Union["MintFunctional", "Mint"] + api_type_str: str, Union["MintFunctional", "Mint", "Tensor", "Torch", "Functional"] api_sub_name: str, e.g. "relu" - ''' + """ api_name_list = api_name_str.split(Const.SEP) if len(api_name_list) != 3: err_msg = f"ApiRunner.get_info_from_name failed: api_name_str: {api_name_str} is not in defined format" logger.error_log_with_exp(err_msg, ApiAccuracyCheckerException(ApiAccuracyCheckerException.WrongValue)) api_type_str, api_sub_name = api_name_list[0], api_name_list[1] - if api_type_str not in [MsCompareConst.MINT, MsCompareConst.MINT_FUNCTIONAL]: - err_msg = f"ApiRunner.get_info_from_name failed: not mint or mint.nn.functional api" + if api_type_str not in [MsCompareConst.MINT, MsCompareConst.MINT_FUNCTIONAL, MsCompareConst.TENSOR_API, + MsCompareConst.FUNCTIONAL_API] \ + and api_platform == Const.MS_FRAMEWORK: + err_msg = f"ApiRunner.get_info_from_name failed: not mint, mint.nn.functional or Tensor api" logger.error_log_with_exp(err_msg, ApiAccuracyCheckerException(ApiAccuracyCheckerException.WrongValue)) + if api_type_str not in MsCompareConst.MT_VALID_API_TYPES and api_platform == Const.MT_FRAMEWORK: + err_msg = f"ApiRunner.get_info_from_name failed: not torch, functional or Tensor api" + logger.error_log_with_exp(err_msg, ApiAccuracyCheckerException(ApiAccuracyCheckerException.WrongValue)) return api_type_str, api_sub_name @staticmethod def get_api_instance(api_type_str, api_sub_name, api_platform): - ''' + """ Args: - api_type_str: str, Union["MintFunctional", "Mint"] + api_type_str: str, Union["MintFunctional", "Mint", "Tensor", "Functional"] api_sub_name: str, e.g. "relu" - api_platform: str: Union["mindpore", "torch"] + api_platform: str: Union["mindpore", "pytorch"] Return: api_instance: function object @@ -102,12 +157,15 @@ class ApiRunner: get mindspore.mint/torch api fucntion mindspore.mint.{api_sub_name} <--> torch.{api_sub_name} mindspore.mint.nn.functional.{api_sub_name} <--> torch.nn.functional.{api_sub_name} - ''' + """ + if api_sub_name in MsCompareConst.SUPPORTED_FUSION_LIST and api_platform == "pytorch": + api_parent_module = api_parent_module_mapping.get((MsCompareConst.FUSION_API, api_platform)) + api_parent_module_str = api_parent_module_str_mapping.get((MsCompareConst.FUSION_API, api_platform)) + else: + api_parent_module = api_parent_module_mapping.get((api_type_str, api_platform)) + api_parent_module_str = api_parent_module_str_mapping.get((api_type_str, api_platform)) + full_api_name = api_parent_module_str + Const.SEP + api_sub_name - api_parent_module = api_parent_module_mapping.get((api_type_str, api_platform)) - module_str = "mindspore.mint." if api_platform == Const.MS_FRAMEWORK else "torch." - submodule_str = "nn.functional." if api_type_str == MsCompareConst.MINT_FUNCTIONAL else "" - full_api_name = module_str + submodule_str + api_sub_name if not hasattr(api_parent_module, api_sub_name): err_msg = f"ApiRunner.get_api_instance failed: {full_api_name} is not found" logger.error_log_with_exp(err_msg, ApiAccuracyCheckerException(ApiAccuracyCheckerException.ApiWrong)) @@ -137,7 +195,7 @@ class ApiRunner: logger.error_log_with_exp(err_msg, ApiAccuracyCheckerException(ApiAccuracyCheckerException.WrongValue)) gradient_inputs = tuple(compute_element.get_parameter(get_origin=False, tensor_platform=api_platform) for compute_element in gradient_inputs) - if api_platform == Const.MS_FRAMEWORK: + if api_platform == Const.MS_FRAMEWORK or api_platform == Const.MT_FRAMEWORK: if len(gradient_inputs) == 1: gradient_inputs = gradient_inputs[0] diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/base_compare_algorithm.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/base_compare_algorithm.py index ead03d25ea5c2e6bb0422486f1939c5b31ee589b..da2f8ad612fcf3a42083894ff1b8e56db757f919 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/base_compare_algorithm.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/base_compare_algorithm.py @@ -18,9 +18,10 @@ from abc import ABC, abstractmethod import mindspore import numpy as np import torch -from msprobe.core.common.const import CompareConst, MsCompareConst +from msprobe.core.common.const import CompareConst from msprobe.core.common.exceptions import ApiAccuracyCheckerException from msprobe.mindspore.common.log import logger +from msprobe.mindspore.common.const import MsCompareConst class CompareResult: diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/bench_functions/flash_attention_score.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/bench_functions/flash_attention_score.py new file mode 100644 index 0000000000000000000000000000000000000000..cb268efeae90a51465493c65caa948045bae4913 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/bench_functions/flash_attention_score.py @@ -0,0 +1,602 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import namedtuple +import torch +import torch.nn as nn +import numpy as np + +from einops import rearrange + + +from msprobe.pytorch.common.utils import logger + +GTYPE = torch.float64 # arm host必须选择float64,x86环境选择float32即可,64也行。arm计算很慢,s=8k的场景建议使用x86 +SOFTMAX_BUILD_MODE = "QKV" # "MAX_SUM" + +FaForwardParams = namedtuple("FaForwardParams", + ["q", "k", "v", "drop_mask", "attn_mask", "pse", "scalar_value", "keep_prob"]) +FaBackwardParams = namedtuple("FaBackwardParams", + ["dx", "q", "k", "v", "softmax_res", "drop_mask", "pse", "scalar_value", "keep_prob"]) +RebuildSoftmaxParams = namedtuple("RebuildSoftmaxParams", + ["q", "k", "attn_mask", "pse", "scalar_value", "softmax_max", "softmax_sum"]) + + +def softmax_forward(x): + x_max = torch.max(x, dim=-1, keepdims=True)[0] + x_sub = x.sub(x_max) + y = torch.exp(x_sub) + x_sum = y.sum(dim=-1, keepdims=True) + res = y.div(x_sum) + return res, x_max, x_sum + + +def softmax_grad(dp, softmax_res): + muls = dp * softmax_res + muls_r = muls.sum(dim=-1, keepdims=True) + sub_r = dp - muls_r + res = sub_r * softmax_res + return res + + +def broadcast_kv(num_heads, num_kv_heads, kv_tensor, dtype): + if num_kv_heads == 0 or num_kv_heads > num_heads: + raise ValueError(f"num_kv_heads must be non-zero and bigger than num_heads.") + + factor = num_heads // num_kv_heads + kv_shape = kv_tensor.shape + b = kv_shape[0] + s = kv_shape[2] + d = kv_shape[3] + kv_res = torch.zeros([b, num_heads, s, d]).to(dtype) + for i in range(num_heads): + j = i // factor + kv_res[:, i:i + 1, :, :] = kv_tensor[:, j:j + 1, :, :] + return kv_res + + +def calculate_qk(q, k, attn_mask, pse, scalar_value): + if k.dim() != 4: + raise ValueError(f"k tensor dimension must be 4, but got {k.dim()} dimensions (shape: {k.shape})") + + if k.dim() == 3: + k = k.unsqueeze(1) # 在head维度扩展 + + if pse is None or len(pse.shape) == 0: + qk = torch.matmul(q, k.permute(0, 1, 3, 2)).mul(scalar_value) + else: + qk = (torch.matmul(q, k.permute(0, 1, 3, 2)) + pse).mul(scalar_value) + if attn_mask is None or len(attn_mask.shape) == 0: + return qk + else: + qk = qk + attn_mask.bool() * (-40000.0) # -10000 + return qk + + +def fusion_attention_forward(forward_params): + q = forward_params.q + k = forward_params.k + v = forward_params.v + drop_mask = forward_params.drop_mask + attn_mask = forward_params.attn_mask + pse = forward_params.pse + scalar_value = forward_params.scalar_value + keep_prob = forward_params.keep_prob + + qk = calculate_qk(q, k, attn_mask, pse, scalar_value) + softmax_res, softmax_max, softmax_sum = softmax_forward(qk) + if drop_mask is None or len(drop_mask.shape) == 0: + drop_res = softmax_res + else: + drop_res = softmax_res * drop_mask * (1.0 / keep_prob) + y = torch.matmul(drop_res, v) + return y, softmax_max, softmax_sum + + +def fusion_attention_backward(backward_params): + dx = backward_params.dx + q = backward_params.q + k = backward_params.k + v = backward_params.v + softmax_res = backward_params.softmax_res + drop_mask = backward_params.drop_mask + pse = backward_params.pse + scalar_value = backward_params.scalar_value + keep_prob = backward_params.keep_prob + dp = torch.matmul(dx, v.permute(0, 1, 3, 2)) + if drop_mask is None or len(drop_mask.shape) == 0: + drop_res = softmax_res.permute(0, 1, 3, 2) + dp_drop = dp + else: + drop_res = softmax_res.mul(drop_mask).mul(1.0 / keep_prob).permute(0, 1, 3, 2) + dp_drop = dp * drop_mask * (1.0 / keep_prob) + dv = torch.matmul(drop_res, dx) + softmax_grad_res = (softmax_grad(dp_drop, softmax_res) * scalar_value) + dq = torch.matmul(softmax_grad_res, k) + dk = torch.matmul(softmax_grad_res.permute(0, 1, 3, 2), q) + return dq, dk, dv + + +def parse_bsnd_args(query, key, head_num, input_layout): + supported_input_layout = ["BSH", "SBH", "BSND", "BNSD", "TND"] + b, s1, s2, n1, n2, d, h1, h2 = None, None, None, head_num, None, None, None, None + + if not isinstance(input_layout, str) or input_layout not in supported_input_layout: + raise ValueError(f"Invalid input_layout arg which must be one of {supported_input_layout}.") + + if input_layout == "TND": + raise ValueError(f"input_layout {input_layout} does not supported for now.") + try: + if input_layout == "BSH": + b, s1, h1 = query.shape + _, s2, h2 = key.shape + d = h1 // n1 + n2 = h2 // d + elif input_layout == "SBH": + s1, b, h1 = query.shape + s2, _, h2 = key.shape + d = h1 // n1 + n2 = h2 // d + elif input_layout == "BSND": + b, s1, n1, d = query.shape + _, s2, n2, _ = key.shape + h1 = n1 * d + h2 = n2 * d + elif input_layout == "BNSD": + b, n1, s1, d = query.shape + _, n2, s2, _ = key.shape + h1 = n1 * d + h2 = n2 * d + except Exception as e: + raise ValueError(f"query.shape: {query.shape}, key.shape: {key.shape}, parse_bsnd_args error: {e}") from e + + if d == 0: + raise ValueError(f"Value d must be non-zero.") + _dtype = query.dtype + ret = (b, s1, s2, n1, n2, d, h1, h2, _dtype) + return ret + + +def convert_from_bnsd(_input, input_layout): + """ + transform qkv from bnsd to input_layout. + B: batch_size + S: sequence_length + N: num_heads + D: head_dim + Args: + _input (torch.Tensor): tensor of shape (B,N,S,D) + input_layout (str): "BSH" or "SBH" or "BSND" or "BNSD" or "TND" + Returns: + tensor of shape (B,N,S,D) or (B,S,N,D) or (S,B,H) or (B,S,H) + """ + if input_layout == "BSH": + # (B,N,S,D)=>(B,S,N*D) + out = rearrange(_input, 'b n s d -> b s (n d)').contiguous() + elif input_layout == "SBH": + # (B,N,S,D)=>(S,B,N*D) + out = rearrange(_input, 'b n s d -> s b (n d)').contiguous() + elif input_layout == "BSND": + # (B,N,S,D)=>(B,S,N,D) + out = rearrange(_input, 'b n s d -> b s n d').contiguous() + elif input_layout == "TND": + raise ValueError(f"input_layout {input_layout} does not supported for now.") + else: + out = _input + return out + + +def convert_to_bnsd(_input, n, input_layout): + """ + transform qkv from input_layout to bnsd. + B: batch_size + S: sequence_length + N: num_heads + D: head_dim + Args: + _input (torch.Tensor): tensor of shape (B,N,S,D) or (B,S,N,D) or (S,B,H) or (B,S,H) + n (int): num_heads + input_layout (str):"BSH" or "SBH" or "BSND" or "BNSD" or "TND" + Returns: + tensor of shape (B,N,S,D) + """ + if input_layout == "BSH": + # (B,S,N*D)=>(B,N,S,D) + out = rearrange(_input, 'b s (n d) -> b n s d', n=n) + elif input_layout == "SBH": + # (S,B,N*D)=>(B,N,S,D) + out = rearrange(_input, 's b (n d) -> b n s d', n=n) + elif input_layout == "BSND": + # (B,S,N,D)=>(B,N,S,D) + out = rearrange(_input, 'b s n d -> b n s d', n=n) + elif input_layout == "TND": + raise ValueError(f"input_layout {input_layout} does not supported for now.") + else: + out = _input + if out.dim() != 4: + raise ValueError(f"convert qkv format failed with input_layout {input_layout}.") + return out.to(GTYPE) + + +def convert_from_bsnd(_input, input_layout): + """ + transform qkv from bsnd to input_layout. + B: batch_size + S: sequence_length + N: num_heads + D: head_dim + Args: + _input (torch.Tensor): tensor of shape (B,S,N,D) + input_layout (str): "BSH" or "SBH" or "BSND" or "BNSD" or "TND" + Returns: + tensor of shape (B,N,S,D) or (B,S,N,D) or (S,B,H) or (B,S,H) + """ + if input_layout == "BSH": + # (B,S,N,D)=>(B,S,N*D) + out = rearrange(_input, 'b s n d -> b s (n d)').contiguous() + elif input_layout == "SBH": + # (B,S,N,D)=>(S,B,N*D) + out = rearrange(_input, 'b s n d -> s b (n d)').contiguous() + elif input_layout == "BNSD": + # (B,S,N,D)=>(B,N,S,D) + out = rearrange(_input, 'b s n d -> b n s d').contiguous() + elif input_layout == "TND": + raise ValueError(f"input_layout {input_layout} does not supported for now.") + else: + out = _input + return out + + +def convert_to_bsnd(_input, n, input_layout): + """ + transform qkv from input_layout to bsnd. + B: batch_size + S: sequence_length + N: num_heads + D: head_dim + Args: + _input (torch.Tensor): tensor of shape (B,N,S,D) or (B,S,N,D) or (S,B,H) or (B,S,H) + n (int): num_heads + input_layout (str):"BSH" or "SBH" or "BSND" or "BNSD" or "TND" + Returns: + tensor of shape (B,S,N,D) + """ + if input_layout == "BSH": + # (B,S,N*D)=>(B,S,N,D) + out = rearrange(_input, 'b s (n d) -> b s n d', n=n) + elif input_layout == "SBH": + # (S,B,N*D)=>(B,S,N,D) + out = rearrange(_input, 's b (n d) -> b s n d', n=n) + elif input_layout == "BNSD": + # (B,N,S,D)=>(B,S,N,D) + out = rearrange(_input, 'b n s d -> b s n d', n=n) + elif input_layout == "TND": + raise ValueError(f"input_layout {input_layout} does not supported for now.") + else: + out = _input + if out.dim() != 4: + raise ValueError(f"convert qkv format failed with input_layout {input_layout}.") + return out + + +def generate_attn_mask(*args): + """ + # 当sparse_mode=2、3、4时小算子到融合算子会走这个优化,反过来看就要拆解回原来的基本实现 + ===> attn_mask = torch.from_numpy(np.triu(np.ones([2048, 2048]), k=1)).to(dtype) + """ + + sparse_mode, attn_mask, b, n1, s1, s2, pre_tocken, next_tocken, dtype = args + shape = [s1, s2] + + if attn_mask is not None: + # 当FA的输入已经包含attn_mask时,可以认为已经是转换之后的mask矩阵了,有三种特殊场景,即稀疏矩阵场景,需要进行逆向还原 + if sparse_mode == 2 or sparse_mode == 3 or sparse_mode == 4: + logger.info(f"s1: {s1}, s2:{s2}, attn_mask.shape:{attn_mask.shape}, attn_mask.dtype:{attn_mask.dtype}") + + if attn_mask.dim() == 2 and attn_mask.shape[0] == 2048 and attn_mask.shape[1] == 2048: + if attn_mask.equal(torch.from_numpy(np.triu(np.ones([2048, 2048]), k=1)).to(attn_mask.dtype)): + if sparse_mode == 2: + attn_mask = torch.from_numpy(np.triu(np.ones(shape), k=1)) + elif sparse_mode == 3: + attn_mask = torch.from_numpy(np.triu(np.ones(shape), k=s2 - s1 + 1)) + elif sparse_mode == 4: + attn_mask_u = torch.from_numpy(np.triu(np.ones(shape), k=next_tocken + 1)) + attn_mask_l = torch.from_numpy(np.tril(np.ones(shape), k=-pre_tocken - 1)) + attn_mask = attn_mask_u + attn_mask_l + logger.debug(f"反向转换attn_mask {attn_mask.shape}") + return attn_mask.to(dtype) + + return attn_mask.to(dtype) + + if attn_mask is not None: + if attn_mask.dim() == 2: + if attn_mask.shape[0] != s1 or attn_mask.shape[1] != s2: + raise ValueError(f"Invalid attn_mask shape `SS` {attn_mask.shape}") + shape = [s1, s2] + elif attn_mask.dim() == 4: + if attn_mask.shape[1] == 1: + shape = [b, 1, s1, s2] if b != 1 else [1, 1, s1, s2] + else: + shape = [b, n1, s1, s2] if b != 1 else [1, n1, s1, s2] + + if sparse_mode == 0: + attn_mask_u = torch.from_numpy(np.triu(np.ones(shape), k=next_tocken + 1)) + attn_mask_l = torch.from_numpy(np.tril(np.ones(shape), k=-pre_tocken - 1)) + attn_mask = attn_mask_u + attn_mask_l + elif sparse_mode == 1: # no sparse + attn_mask = torch.from_numpy(np.zeros(shape)) + elif sparse_mode == 2: + attn_mask = torch.from_numpy(np.triu(np.ones(shape), k=1)) + elif sparse_mode == 3: + attn_mask = torch.from_numpy(np.triu(np.ones(shape), k=s2 - s1 + 1)) + elif sparse_mode == 4: + attn_mask_u = torch.from_numpy(np.triu(np.ones(shape), k=next_tocken + 1)) + attn_mask_l = torch.from_numpy(np.tril(np.ones(shape), k=-pre_tocken - 1)) + attn_mask = attn_mask_u + attn_mask_l + # 注:不会出现sparse_mode=5的情况,该情况要求必须要传入attn_mask,且attn_mask矩阵数据格式须为BNSS或B1SS, + # 因此可以认为FA的输入已经是正确的attn_mask了 + return attn_mask.to(dtype) + + +def generate_kv(key, value, n1, n2): + # N不等长适配by cdy + if not (n1 == n2): + k_new = broadcast_kv(n1, n2, key, key.dtype) + v_new = broadcast_kv(n1, n2, value, value.dtype) + else: + k_new = key + v_new = value + return k_new, v_new + + +def rebuid_softmax_by_qkv(q, k, attn_mask, pse, scalar_value): + """ + attention = softmax(QK^T/sqrt(d))V + softmax(x_i) = e^(x_i - x_max) / sum(e^(x_i - x_max)) + """ + logger.info("Using QKV to rebuild original softmax") + qk = calculate_qk(q, k, attn_mask, pse, scalar_value) + softmax_res, _, _ = softmax_forward(qk) + return softmax_res + + +def rebuild_softmax_by_max_sum(softmax_params): + """ + attention = softmax(QK^T/sqrt(d))V + softmax(x_i) = e^(x_i - x_max_i) / x_sum_i) + """ + q = softmax_params.q + k = softmax_params.k + attn_mask = softmax_params.attn_mask + pse = softmax_params.pse + scalar_value = softmax_params.scalar_value + softmax_max = softmax_params.softmax_max + softmax_sum = softmax_params.softmax_sum + logger.info("Using softmax_max and softmax_sum to rebuild original softmax") + + qk = calculate_qk(q, k, attn_mask, pse, scalar_value) + if softmax_max.shape[-1] == 0: + raise ValueError(f"softmax_max.shape[-1] must be non-zero, softmax_max.shape: {softmax_max.shape}") + repeat_dim = qk.shape[-1] // softmax_max.shape[-1] + softmax_res = torch.exp(qk.sub(softmax_max.repeat(1, 1, 1, repeat_dim))).div( + softmax_sum.repeat(1, 1, 1, repeat_dim)) + return softmax_res + + +def get_head_num(*args, **kwargs): + if kwargs.get("head_num", None): + head_num = kwargs.get("head_num") + elif len(args) >= 4: + head_num = args[3] + else: + raise ValueError(f"Unsupported npu_fusion_attention args {args}.") + return head_num + + +def get_input_layout(*args, **kwargs): + if kwargs.get("input_layout", None): + input_layout = kwargs.get("input_layout") + elif len(args) >= 5: + input_layout = args[4] + else: + raise ValueError(f"Unsupported npu_fusion_attention args {args}.") + return input_layout + + +def npu_fusion_attention_forward_patch(*args, **kwargs): + if len(args) < 2: + raise RuntimeError("npu_fusion_attention_forward_patch: length of args should greater than or equal to 2.") + + # query, key, value, head_num, input_layout + head_num = get_head_num(*args, **kwargs) + input_layout = get_input_layout(*args, **kwargs) + + b, s1, s2, n1, n2, d, h1, h2, dtype = parse_bsnd_args(args[0], args[1], head_num, input_layout) + if n1 == n2 and s1 == s2: + logger.debug(f"running case : BNSD = {b}_{n1}_{s1}_{d}, sparse = {kwargs.get('sparse_mode', 0)}") + else: + logger.debug(f"running case: BNSD = {b}_{n1}({n2})_{s1}({s2})_{d}, sparse = {kwargs.get('sparse_mode', 0)}") + if not (n1 % n2 == 0 and n1 >= n2): + raise ValueError(f"N1与N2不匹配,请检查: n1 = {n1}, n2 = {n2}.") + + dims_kwargs = { + "b": b, "s1": s1, "s2": s2, "n1": n1, "n2": n2, + "d": d, "h1": h1, "h2": h2, "dtype": dtype + } + new_kwargs = { + "keep_prob": 1, + "scalar_value": kwargs.get("scalar_value", 1 / (d ** 0.5)), + "sparse_mode": kwargs.get("sparse_mode", 0), + "prefix": kwargs.get("prefix"), + "pre_tockens": kwargs.get("pre_tockens", 2147483647), + "next_tockens": kwargs.get("next_tockens", 2147483647), + "pse": kwargs.get("pse"), + "padding_mask": kwargs.get("padding_mask"), + "attn_mask": kwargs.get("attn_mask") + } + + return args, dims_kwargs, new_kwargs + + +def npu_fusion_attention_backward_patch(*args, **kwargs): + if len(args) != 6: + raise ValueError(f"Unsupported npu_fusion_attention_grad args {args}.") + + b, s1, s2, n1, n2, d, h1, h2, dtype = parse_bsnd_args(args[0], args[1], args[4], args[5]) + if n1 == n2 and s1 == s2: + logger.info(f"running case : bnsd = {b}_{n1}_{s1}_{d}, sparse = {kwargs.get('sparse_mode', 0)}") + else: + logger.info(f"running case: bnsd = {b}_{n1}({n2})_{s1}({s2})_{d}, sparse = {kwargs.get('sparse_mode', 0)}") + if not (n1 % n2 == 0 and n1 >= n2): + raise ValueError(f"N1与N2不匹配,请检查: n1 = {n1}, n2 = {n2}.") + + dims_kwargs = { + "b": b, "s1": s1, "s2": s2, "n1": n1, "n2": n2, + "d": d, "h1": h1, "h2": h2, "dtype": dtype + } + + new_kwargs = { + "keep_prob": 1, + "scalar_value_value": kwargs.get("scalar_value_value", 1 / (d ** 0.5)), + "sparse_mode": kwargs.get("sparse_mode", 0), + "prefix": kwargs.get("prefix"), + "pre_tockens": kwargs.get("pre_tockens", 2147483647), + "next_tockens": kwargs.get("next_tockens", 2147483647), + "pse": kwargs.get("pse"), + "padding_mask": kwargs.get("padding_mask"), + "softmax_max": kwargs.get("softmax_max"), + "softmax_sum": kwargs.get("softmax_sum"), + "softmax_in": kwargs.get("softmax_in"), + "attention_in": kwargs.get("attention_in"), + "seed": kwargs.get("seed", 0), + "offset": kwargs.get("offset", 0), + "numels": kwargs.get("numels", 0), + "attn_mask": kwargs.get("attn_mask") + } + + return args, dims_kwargs, new_kwargs + + +class FlashAttentionScore(nn.Module): + def __init__(self): + super(FlashAttentionScore, self).__init__() + # You can initialize any parameters here if necessary + + def forward(self, *inputs, **kwargs): + # Extract the inputs for the attention calculation + new_args, dims_kwargs, new_kwargs = npu_fusion_attention_forward_patch(*inputs, **kwargs) + query, key, value = new_args[0], new_args[1], new_args[2] + + input_layout = get_input_layout(*inputs, **kwargs) + + n1 = dims_kwargs.get("n1") + n2 = dims_kwargs.get("n2") + s1 = dims_kwargs.get("s1") + s2 = dims_kwargs.get("s2") + b = dims_kwargs.get("b") + dtype = dims_kwargs.get("dtype") + attn_mask = new_kwargs.get("attn_mask") + keep_prob = new_kwargs.get("keep_prob") + sparse_mode = new_kwargs.get("sparse_mode") + pre_tockens = new_kwargs.get("pre_tockens") + next_tockens = new_kwargs.get("next_tokens") + pse = new_kwargs.get("real_shift") + scalar_value = new_kwargs.get("scalar_value") + + args_temp = [sparse_mode, attn_mask, b, n1, s1, s2, pre_tockens, next_tockens, dtype] + + attn_mask = generate_attn_mask(*args_temp) + query = convert_to_bnsd(query, n1, input_layout) + key = convert_to_bnsd(key, n2, input_layout) + value = convert_to_bnsd(value, n2, input_layout) + + forward_params = FaForwardParams( + q=query, + k=key, + v=value, + drop_mask=None, + attn_mask=attn_mask, + pse=pse, + scalar_value=scalar_value, + keep_prob=keep_prob + ) + + out_golden, softmax_max, softmax_sum = fusion_attention_forward(forward_params) + + # If output dimension is 5, reshape accordingly + if out_golden.dim() == 5: + out_golden = out_golden.reshape(out_golden.size(0), + out_golden.size(1) * out_golden.size(2), + out_golden.size(3), out_golden.size(4)) + + out_golden = convert_from_bnsd(out_golden, input_layout) + + # Ensure the output matches the desired layout + out_golden = out_golden.cpu(), softmax_max.repeat(1, 1, 1, 8).cpu(), softmax_sum.repeat(1, 1, 1, 8).cpu() + + return out_golden + + def backward(self, *inputs, **kwargs): + # The backward pass will be similar to what was described for the gradient computation + new_args, dims_kwargs, new_kwargs = npu_fusion_attention_backward_patch(*inputs, **kwargs) + query, key, value, dx, input_layout = new_args[0], new_args[1], new_args[2], new_args[3], new_args[5] + n1 = dims_kwargs.get("n1") + n2 = dims_kwargs.get("n2") + s1 = dims_kwargs.get("s1") + s2 = dims_kwargs.get("s2") + b = dims_kwargs.get("b") + dtype = dims_kwargs.get("dtype") + attn_mask = new_kwargs.get("attn_mask") + keep_prob = new_kwargs.get("keep_prob") + sparse_mode = new_kwargs.get("sparse_mode") + pre_tockens = new_kwargs.get("pre_tockens") + next_tockens = new_kwargs.get("next_tockens") + pse = new_kwargs.get("pse") + softmax_max = new_kwargs.get("softmax_max") + softmax_sum = new_kwargs.get("softmax_sum") + scalar_value = new_kwargs.get("scalar_value") + + args_temp = [sparse_mode, attn_mask, b, n1, s1, s2, pre_tockens, next_tockens, dtype] + attn_mask = generate_attn_mask(*args_temp) + + query = convert_to_bnsd(query, n1, input_layout) + dx = convert_to_bnsd(dx, n1, input_layout) + key = convert_to_bnsd(key, n2, input_layout) + value = convert_to_bnsd(value, n2, input_layout) + + k_new, v_new = generate_kv(key, value, n1, n2) + + if SOFTMAX_BUILD_MODE == "QKV": + softmax_res = rebuid_softmax_by_qkv(query, k_new, attn_mask, pse, scalar_value) + else: + softmax_params = RebuildSoftmaxParams(query, k_new, attn_mask, pse, scalar_value, softmax_max, softmax_sum) + softmax_res = rebuild_softmax_by_max_sum(softmax_params) + + backward_params = FaBackwardParams(dx, query, k_new, v_new, softmax_res, None, pse, scalar_value, keep_prob) + dq, dk, dv = fusion_attention_backward(backward_params) + + # Reshape as needed + if dq.dim() == 5: + dq = dq.reshape(dq.size(0), dq.size(1) * dq.size(2), dq.size(3), dq.size(4)) + if dk.dim() == 5: + dk = dk.reshape(dk.size(0), dk.size(1) * dk.size(2), dk.size(3), dk.size(4)) + if dv.dim() == 5: + dv = dv.reshape(dv.size(0), dv.size(1) * dv.size(2), dv.size(3), dv.size(4)) + + dq = convert_from_bnsd(dq, input_layout) + dk = convert_from_bnsd(dk, input_layout) + dv = convert_from_bnsd(dv, input_layout) + + return dq.cpu(), dk.cpu(), dv.cpu() diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/bench_functions/fusion_operator.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/bench_functions/fusion_operator.py new file mode 100644 index 0000000000000000000000000000000000000000..e1344541e89c4dafd9d49d63e3fdea117366bdd9 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/bench_functions/fusion_operator.py @@ -0,0 +1,41 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprobe.mindspore.api_accuracy_checker.bench_functions.flash_attention_score import FlashAttentionScore + + +class FusionOperator: + """ + 所有融合算子的父类,定义了通用的接口和属性。 + """ + + # 初始化操作符字典 + def __init__(self): + self.flash_attention_score = None # 用于存放 FlashAttentionScore 操作符 + self._register_operators() + + def __getattr__(self, name): + """ 动态获取算子类 """ + if hasattr(self, name): + return getattr(self, name) + else: + raise AttributeError(f"'FusionOperator' object has no attribute '{name}'") + + def _register_operators(self): + """ 注册操作符到父类,以便通过 fusion.xxx 调用 """ + self.flash_attention_score = FlashAttentionScore() + + +fusion = FusionOperator() diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/checker_support_api.yaml b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/checker_support_api.yaml new file mode 100644 index 0000000000000000000000000000000000000000..4de7715215b7f28037d1c553921b0f073e4aa300 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/checker_support_api.yaml @@ -0,0 +1,77 @@ +# Copyright 2024 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ + +# list of api that can be checked + +tensor: + - add_ + - add + - addmm_ + - all + - allclose + - any + - bool + - byte + - ceil + - clamp + - contiguous + - copy_ + - cos + - clone + - cumprod + - expand_as + - flatten + - float + - half + - int + - is_contiguous + - isnan + - item + - log + - log2 + - long + - masked_fill + - max + - mean + - min + - numel + - numpy + - repeat + - repeat_interleave + - reshape + - round + - select + - sin + - size + - split + - sqrt + - square + - sub + - swapaxes + - to + - t + - tolist + - topk + - transpose + - trunc + - type + - unsqueeze + - view + - view_as + - fill_ + - floor_ + - clamp_ + - type_as + - zero_ \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/cmd_parser.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/cmd_parser.py index cf0eeea0bf071d775dd1ccb705ea854f0c2c36db..4af92bfa1002c419d0bd84e5dfd250b712b57136 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/cmd_parser.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/cmd_parser.py @@ -13,9 +13,57 @@ # See the License for the specific language governing permissions and # limitations under the License. +import argparse +import os + +from msprobe.core.common.file_utils import check_file_or_directory_path, create_directory +from msprobe.core.common.utils import Const, MsprobeBaseException + + +class UniqueDeviceAction(argparse.Action): + def __call__(self, parser, namespace, values, option_string=None): + unique_values = set(values) + if len(values) != len(unique_values): + parser.error("device id must be unique") + for device_id in values: + if not 0 <= device_id <= 4095: + parser.error(f"the argument 'device_id' must be in range [0, 4095], but got {device_id}") + setattr(namespace, self.dest, values) + + def add_api_accuracy_checker_argument(parser): parser.add_argument("-api_info", "--api_info_file", dest="api_info_file", type=str, required=True, help=" The api param tool result file: generate from api param tool, " "a json file.") parser.add_argument("-o", "--out_path", dest="out_path", default="./", type=str, required=False, help=" The ut task result out path.") + parser.add_argument("-csv_path", "--result_csv_path", dest="result_csv_path", default="", type=str, required=False, + help=" the exit csv for continue") + + +def multi_add_api_accuracy_checker_argument(parser): + parser.add_argument("-api_info", "--api_info_file", dest="api_info_file", type=str, required=True, + help=" The api param tool result file: generate from api param tool, " + "a json file.") + parser.add_argument("-o", "--out_path", dest="out_path", default="./", type=str, required=False, + help=" The ut task result out path.") + parser.add_argument("-csv_path", "--result_csv_path", dest="result_csv_path", default="", type=str, required=False, + help=" the exit csv for continue") + #以下属于多线程参数 + parser.add_argument("-d", "--device", dest="device_id", nargs='+', type=int, + help=" set device id to run ut, must be unique and in range 0-7", + default=[0], required=False, action=UniqueDeviceAction) + + +def check_args(args): + args.api_info_file = os.path.abspath(args.api_info_file) + check_file_or_directory_path(args.api_info_file) + + if args.out_path == "": + args.out_path = "./" + args.out_path = os.path.abspath(args.out_path) + create_directory(args.out_path) + + if args.result_csv_path: + args.result_csv_path = os.path.abspath(args.result_csv_path) + check_file_or_directory_path(args.result_csv_path) diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/compute_element.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/compute_element.py index 47ce0c1e7a283c03284b5c1592f1214c530a64f1..5dcb1421c2b96c61cad766cba7ad5c85107b5b29 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/compute_element.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/compute_element.py @@ -25,6 +25,7 @@ from msprobe.core.common.file_utils import load_npy from msprobe.mindspore.api_accuracy_checker.type_mapping import (api_info_type_str_to_type, ms_dtype_to_dtype_str, torch_dtype_to_dtype_str, dtype_str_to_ms_dtype, dtype_str_to_np_dtype, + dtype_str_to_mindtorch_dtype, dtype_str_to_torch_dtype, type_to_api_info_type_str, DEFAULT_CONSTRUCT_NP_FLOAT_DTYPE, TUPLE_TYPE_STR, MINDSPORE_TENSOR_TYPE_STR, MINDSPORE_DTYPE_TYPE_STR, @@ -33,6 +34,15 @@ from msprobe.mindspore.api_accuracy_checker.type_mapping import (api_info_type_s from msprobe.mindspore.api_accuracy_checker.utils import check_and_get_from_json_dict, global_context from msprobe.mindspore.common.log import logger +import msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer as env_module + + +if env_module.is_valid_pt_mt_env: + from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import mindtorch + from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import torch +else: + import torch + class MstensorMetaData: def __init__(self, dtype_str, npy_path, maximum, minimum, shape) -> None: @@ -78,16 +88,45 @@ class ComputeElement: else: torch_dtype = dtype_str_to_torch_dtype.get(dtype_str) - if dtype_str in float_dtype_str_list: - middle_dtype = mindspore.float64 - elif dtype_str in int_dtype_str_list: + if dtype_str in int_dtype_str_list: middle_dtype = mindspore.int64 else: - middle_dtype = mindspore.uint64 + middle_dtype = mindspore.float64 np_ndarray = ms_tensor.astype(middle_dtype).numpy() torch_tensor = torch.from_numpy(np_ndarray).to(torch_dtype) return torch_tensor + @staticmethod + def transfer_to_mindtorch_tensor(ms_tensor): + """ + Args: + ms_tensor: mindspore.Tensor + Return: + mindtorch_tensor: mindtorch.Tensor + """ + + ms_dtype = ms_tensor.dtype + + dtype_str = ms_dtype_to_dtype_str.get(ms_dtype) + + if dtype_str not in dtype_str_to_mindtorch_dtype: + err_msg = f"ComputeElement.transfer_to_mindtorch_tensor failed: no matching mindtorch dtype for {dtype_str}" + logger.error_log_with_exp(err_msg, + ApiAccuracyCheckerException(ApiAccuracyCheckerException.UnsupportType)) + else: + mindtorch_dtype = dtype_str_to_mindtorch_dtype.get(dtype_str) + + if dtype_str in int_dtype_str_list: + middle_dtype = mindspore.int64 + else: + middle_dtype = mindspore.float64 + + np_ndarray = ms_tensor.astype(middle_dtype).numpy() + + mindtorch_tensor = mindtorch.from_numpy(np_ndarray).to(ms_dtype) + + return mindtorch_tensor + @staticmethod def transfer_to_mindspore_tensor(torch_tensor): ''' @@ -106,10 +145,10 @@ class ComputeElement: else: ms_dtype = dtype_str_to_ms_dtype.get(dtype_str) - if dtype_str in float_dtype_str_list: - middle_dtype = torch.float64 - elif dtype_str in int_dtype_str_list: + if dtype_str in int_dtype_str_list: middle_dtype = torch.int64 + else: + middle_dtype = torch.float64 np_ndarray = torch_tensor.to(middle_dtype, copy=True).numpy() ms_tensor = mindspore.Tensor.from_numpy(np_ndarray).astype(ms_dtype) return ms_tensor @@ -143,8 +182,11 @@ class ComputeElement: elif isinstance(self.parameter, DtypeMetaData): if tensor_platform == Const.MS_FRAMEWORK: parameter_tmp = dtype_str_to_ms_dtype.get(self.parameter.dtype_str) - else: + elif tensor_platform == Const.PT_FRAMEWORK: parameter_tmp = dtype_str_to_torch_dtype.get(self.parameter.dtype_str) + elif tensor_platform == Const.MT_FRAMEWORK: + parameter_tmp = dtype_str_to_mindtorch_dtype.get(self.parameter.dtype_str) + elif isinstance(self.parameter, MstensorMetaData): mstensor_meta_data = self.parameter ms_dtype = dtype_str_to_ms_dtype.get(mstensor_meta_data.dtype_str) @@ -163,6 +205,8 @@ class ComputeElement: # if necessary, do transfer if not get_origin and isinstance(parameter_tmp, mindspore.Tensor) and tensor_platform == Const.PT_FRAMEWORK: parameter = self.transfer_to_torch_tensor(parameter_tmp) + elif not get_origin and isinstance(parameter_tmp, mindspore.Tensor) and tensor_platform == Const.MT_FRAMEWORK: + parameter = self.transfer_to_mindtorch_tensor(parameter_tmp) elif not get_origin and isinstance(parameter_tmp, torch.Tensor) and tensor_platform == Const.MS_FRAMEWORK: parameter = self.transfer_to_mindspore_tensor(parameter_tmp) else: diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/data_manager.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/data_manager.py new file mode 100644 index 0000000000000000000000000000000000000000..fc2680d68a5697dae165c70a276b21038f87fbe0 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/data_manager.py @@ -0,0 +1,302 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import csv + +from msprobe.core.common.const import Const, CompareConst +from msprobe.core.common.file_utils import FileOpen, create_directory, write_csv, read_csv +from msprobe.core.common.utils import add_time_as_suffix, MsprobeBaseException +from msprobe.mindspore.api_accuracy_checker.base_compare_algorithm import compare_algorithms +from msprobe.core.common.file_utils import check_file_or_directory_path +from msprobe.mindspore.common.log import logger +from msprobe.mindspore.common.const import MsCompareConst + + +class ResultCsvEntry: + def __init__(self) -> None: + self.forward_pass_status = None + self.backward_pass_status = None + self.forward_err_msg = "" + self.backward_err_msg = "" + self.overall_err_msg = None + + +def write_csv_header(csv_path, header_func): + """如果是第一次写入,则写入 CSV 表头""" + header = header_func() # 获取表头 + logger.debug(f"Writing CSV header: {header}") + write_csv([header], csv_path, mode="a+") + + +def get_result_csv_header(): + """获取结果 CSV 文件的表头""" + return [ + MsCompareConst.DETAIL_CSV_API_NAME, + MsCompareConst.RESULT_CSV_FORWARD_TEST_SUCCESS, + MsCompareConst.RESULT_CSV_BACKWARD_TEST_SUCCESS, + MsCompareConst.DETAIL_CSV_MESSAGE, + ] + + +def get_detail_csv_header(): + """获取详细 CSV 文件的表头""" + detail_csv_header_basic_info = [ + MsCompareConst.DETAIL_CSV_API_NAME, + MsCompareConst.DETAIL_CSV_BENCH_DTYPE, + MsCompareConst.DETAIL_CSV_TESTED_DTYPE, + MsCompareConst.DETAIL_CSV_SHAPE, + ] + detail_csv_header_compare_result = list(compare_algorithms.keys()) + detail_csv_header_status = [ + MsCompareConst.DETAIL_CSV_PASS_STATUS, + MsCompareConst.DETAIL_CSV_MESSAGE, + ] + return detail_csv_header_basic_info + detail_csv_header_compare_result + detail_csv_header_status + + +def check_csv_header(headers, required_constants, csv_path): + """校验 CSV 文件表头是否包含所有必需的常量""" + missing_constants = [const for const in required_constants if not any(const in header for header in headers)] + + if missing_constants: + raise MsprobeBaseException( + MsprobeBaseException.MISSING_HEADER_ERROR, + f"{csv_path} 缺少以下必需的表头字段: {missing_constants}" + ) + + +class DataManager: + def __init__(self, csv_dir, result_csv_path): + self.results = {} + self.results_exception_skip = {} + self.is_first_write = True # 标记用于添加表头 + self.csv_dir = csv_dir + self.api_names_set = set() # 存储已经出现的 API 名称的集合 + # 如果传入了 result_csv_path,则启用断点续检 + if result_csv_path: + self.resume_from_last_csv(result_csv_path) + self.initialize_api_names_set(result_csv_path) + else: + # 默认情况下,设置输出路径为空,等待首次写入时初始化 + self.result_out_path = os.path.join(self.csv_dir, add_time_as_suffix(MsCompareConst.RESULT_CSV_FILE_NAME)) + self.detail_out_path = os.path.join( + self.csv_dir, + os.path.basename(self.result_out_path).replace("result", "details") + ) + + if self.detail_out_path and os.path.exists(self.detail_out_path): + check_file_or_directory_path(self.detail_out_path) + + if self.result_out_path and os.path.exists(self.result_out_path): + check_file_or_directory_path(self.result_out_path) + + def initialize_api_names_set(self, result_csv_path): + """读取现有的 CSV 文件并存储已经出现的 API 名称到集合中""" + # 使用新的 read_csv 函数读取数据 + csv_data = read_csv(result_csv_path, as_pd=False) + + # 读取标题行 + headers = csv_data[0] if csv_data else [] # 如果文件为空,则 headers 会为空 + + # 使用提取的表头校验函数 + if check_csv_header(headers, get_result_csv_header(), result_csv_path): + + # 获取 "API Name" 列的索引 + api_name_index = None + for i, header in enumerate(headers): + if MsCompareConst.DETAIL_CSV_API_NAME in header: # CSV 文件的标题行包含了字节顺序标记,所以使用通过包含方式来查找 + api_name_index = i + break + + if api_name_index is None: + logger.warning(f"{result_csv_path} No column contains 'API Name'.") + return + + # 读取每一行的 API 名称 + for row in csv_data[1:]: # 跳过标题行,从第二行开始 + if row and len(row) > api_name_index: + api_name = row[api_name_index] + if api_name: + self.api_names_set.add(api_name) + + logger.debug(f"Initialized API names set from existing CSV: {self.api_names_set}") + + def is_unique_api(self, api_name): + """检查 API 名称是否唯一,如果已经存在则返回 False,否则加入集合并返回 True""" + if api_name in self.api_names_set: + return False + self.api_names_set.add(api_name) + return True + + def resume_from_last_csv(self, result_csv_path): + """从上次运行的 result_csv_path 恢复断点""" + # 获取上次的目录路径 + last_dir = os.path.dirname(result_csv_path) + + # 设置当前目录和输出路径,确保在首次写入时使用 + self.csv_dir = last_dir + self.detail_out_path = os.path.join(last_dir, os.path.basename(result_csv_path).replace("result", "details")) + if self.detail_out_path and os.path.exists(self.detail_out_path): + check_file_or_directory_path(self.detail_out_path) + self.result_out_path = result_csv_path + self.is_first_write = False + + def save_results(self, api_name_str): + if self.is_first_write: + # 直接写入表头 + logger.info("Writing CSV headers for the first time.") + write_csv_header(self.detail_out_path, get_detail_csv_header) + write_csv_header(self.result_out_path, get_result_csv_header) + self.is_first_write = False # 写入后标记为 False,避免重复写入表头 + + """写入详细输出和结果摘要并清理结果""" + logger.debug("Starting to write detailed output to CSV.") + self.to_detail_csv(self.detail_out_path) + logger.debug(f"Detailed output for {api_name_str} written to {self.detail_out_path}.") + + logger.debug("Starting to write result summary to CSV.") + self.to_result_csv(self.result_out_path) + logger.debug(f"Result summary for {api_name_str} written to {self.result_out_path}.") + + # 清理记录,准备下一次调用 + self.clear_results() + + def record(self, output_list): + if output_list is None: + return + for output in output_list: + api_real_name, forward_or_backward, basic_info, compare_result_dict = output + key = (api_real_name, forward_or_backward) + if key not in self.results: + self.results[key] = [] + self.results[key].append((basic_info, compare_result_dict)) + logger.debug(f"Updated self.results for key {key}: {self.results[key]}") + logger.debug(f"Complete self.results after recording: {self.results}") + + def record_exception_skip(self, api_name, forward_or_backward, err_msg): + ''' + record exception_skip infomation into self.record_exception_skip. + self.record_exception_skip: dict{str: dict{"forward": str/None, "backward": str/None}} + string in key is api_name, string in value is err_msg + ''' + if api_name not in self.results_exception_skip: + self.results_exception_skip[api_name] = {Const.FORWARD: None, Const.BACKWARD: None} + self.results_exception_skip[api_name][forward_or_backward] = err_msg + + def clear_results(self): + """清空 self.results 数据""" + logger.debug("Clearing self.results data.") + self.results.clear() + self.results_exception_skip.clear() + + def to_detail_csv(self, csv_path): + logger.debug("Preparing detail CSV headers and rows.") + detail_csv = [] + + detail_csv_header_compare_result = list(compare_algorithms.keys()) + + for _, results in self.results.items(): + for res in results: + basic_info, compare_result_dict = res + csv_row_basic_info = [ + basic_info.api_name, + basic_info.bench_dtype, + basic_info.tested_dtype, + basic_info.shape + ] + csv_row_compare_result = [ + compare_result_dict.get(algorithm_name).compare_value + for algorithm_name in detail_csv_header_compare_result + ] + csv_row_status = [basic_info.status, basic_info.err_msg] + csv_row = csv_row_basic_info + csv_row_compare_result + csv_row_status + detail_csv.append(csv_row) + logger.debug(f"Detail CSV row added: {csv_row}") + + logger.debug(f"Writing detail CSV to {csv_path}.") + write_csv(detail_csv, csv_path, mode="a+") + logger.debug(f"Detail CSV written successfully to {csv_path}.") + + def to_result_csv(self, csv_path): + ''' + depend on both self.results and self.results_exception_skip + ''' + logger.debug("Preparing result CSV data.") + result_csv = [] + + result_csv_dict = {} + for key, results in self.results.items(): + api_real_name, forward_or_backward = key + pass_status = CompareConst.PASS + overall_err_msg = "" + + for res in results: + basic_info, _ = res + if basic_info.status != CompareConst.PASS: + pass_status = CompareConst.ERROR + overall_err_msg += basic_info.err_msg + + overall_err_msg = "" if pass_status == CompareConst.PASS else overall_err_msg + + if api_real_name not in result_csv_dict: + result_csv_dict[api_real_name] = ResultCsvEntry() + if forward_or_backward == Const.FORWARD: + result_csv_dict[api_real_name].forward_pass_status = pass_status + result_csv_dict[api_real_name].forward_err_msg = overall_err_msg + else: + result_csv_dict[api_real_name].backward_pass_status = pass_status + result_csv_dict[api_real_name].backward_err_msg = overall_err_msg + + for api_name, entry in result_csv_dict.items(): + overall_err_msg = "" if (entry.forward_pass_status == CompareConst.PASS and + entry.backward_pass_status == CompareConst.PASS) else \ + entry.forward_err_msg + entry.backward_err_msg + row = [ + api_name, + entry.forward_pass_status, + entry.backward_pass_status, + overall_err_msg + ] + # change row if this api has excption_skip infomation + if api_name in self.results_exception_skip: + if self.results_exception_skip[api_name][Const.FORWARD] is not None: + row[1] = CompareConst.SKIP + row[-1] += self.results_exception_skip[api_name][Const.FORWARD] + if self.results_exception_skip[api_name][Const.BACKWARD] is not None: + row[2] = CompareConst.SKIP + row[-1] += self.results_exception_skip[api_name][Const.BACKWARD] + del self.results_exception_skip[api_name] + result_csv.append(row) + logger.debug(f"Result CSV row added: {row}") + for api_name in self.results_exception_skip: + current_exception_skip = self.results_exception_skip[api_name] + forward_status = None + backward_status = None + err_msg = "" + if current_exception_skip[Const.FORWARD] is not None: + forward_status = CompareConst.SKIP + err_msg += current_exception_skip[Const.FORWARD] + if current_exception_skip[Const.BACKWARD] is not None: + backward_status = CompareConst.SKIP + err_msg += current_exception_skip[Const.BACKWARD] + row = [api_name, forward_status, backward_status, err_msg] + result_csv.append(row) + + write_csv(result_csv, csv_path, mode="a+") + logger.debug(f"Result CSV written successfully to {csv_path}.") + + # 设置标记为 False,防止后续重复添加表头 + self.is_first_write = False diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/main.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/main.py index 64060979893ca56e352ccde9c1a16b8ef836934a..e3963cb9fd4e71a773fa70565d8056b2a742b4a4 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/main.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/main.py @@ -15,10 +15,20 @@ from msprobe.mindspore.api_accuracy_checker.api_accuracy_checker import ApiAccuracyChecker +from msprobe.mindspore.api_accuracy_checker.multi_api_accuracy_checker import MultiApiAccuracyChecker + +from msprobe.mindspore.api_accuracy_checker.cmd_parser import check_args + def api_checker_main(args): - api_accuracy_checker = ApiAccuracyChecker() + check_args(args) + api_accuracy_checker = ApiAccuracyChecker(args) + api_accuracy_checker.parse(args.api_info_file) + api_accuracy_checker.run_and_compare() + + +def mul_api_checker_main(args): + check_args(args) + api_accuracy_checker = MultiApiAccuracyChecker(args) api_accuracy_checker.parse(args.api_info_file) api_accuracy_checker.run_and_compare() - api_accuracy_checker.to_detail_csv(args.out_path) - api_accuracy_checker.to_result_csv(args.out_path) diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/multi_api_accuracy_checker.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/multi_api_accuracy_checker.py new file mode 100644 index 0000000000000000000000000000000000000000..1913675ad162bf690fc0aed5fc84c245ae4f73ca --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/multi_api_accuracy_checker.py @@ -0,0 +1,213 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# 标准库导入 +import multiprocessing +from multiprocessing import Manager +import os +import signal +import sys +import time + +# 第三方库导入 +from mindspore import context +import numpy as np +from tqdm import tqdm + +# 本地应用/库特定导入 +from msprobe.core.common.const import Const, CompareConst +from msprobe.mindspore.api_accuracy_checker.api_accuracy_checker import ApiAccuracyChecker, BasicInfoAndStatus +from msprobe.mindspore.api_accuracy_checker.multi_data_manager import MultiDataManager +from msprobe.mindspore.common.log import logger +from msprobe.mindspore.common.const import MsCompareConst + + +class MultiApiAccuracyChecker(ApiAccuracyChecker): + def __init__(self, args): + # 可以添加 MultiApiAccuracyChecker 特有的属性或方法 + self.api_infos = dict() + + # 使用 Manager 创建共享变量,确保进程间的同步 + self.manager = Manager() + self.is_first_write = self.manager.Value('b', True) # 创建共享变量 + + # 初始化 DataManager 时传入共享的 is_first_write + self.multi_data_manager = MultiDataManager(args.out_path, args.result_csv_path, self.is_first_write) + + self.args = args # 将 args 保存为类的属性 + + # 初始化一个属性来存储当前的设备ID(用于日志中显示) + self.current_device_id = None + + def process_on_device(self, device_id, api_infos, progress_queue): + """ + 在特定设备上处理一部分API。 + + 参数: + device_id (int): 要使用的设备ID。 + api_infos (list): 包含API名称和对应信息的元组列表。 + progress_queue (multiprocessing.Queue): 用于通信进度更新的队列。 + """ + + # 设置当前设备ID + self.current_device_id = device_id + + # 设置 MindSpore context 的 device_id + context.set_context(device_id=device_id) + + # 遍历当前进程分配的任务 + for _, (api_name_str, api_info) in enumerate(api_infos): + logger.debug(f"Processing API: {api_name_str}, Device: {device_id}") + + if not self.multi_data_manager.is_unique_api(api_name_str): + logger.debug(f"API {api_name_str} is not unique, skipping.") + progress_queue.put(1) + continue + + # 处理前向 + forward_output_list = self.process_forward(api_name_str, api_info) + if forward_output_list is not Const.EXCEPTION_NONE: + self.multi_data_manager.record(forward_output_list) + + # 处理反向 + backward_output_list = self.process_backward(api_name_str, api_info) + if backward_output_list is not Const.EXCEPTION_NONE: + self.multi_data_manager.record(backward_output_list) + + # 保存结果 + self.multi_data_manager.save_results(api_name_str) + progress_queue.put(1) # 更新进度 + + def run_and_compare(self): + # 获取要使用的设备ID列表 + device_ids = self.args.device_id + + # 按设备数划分要处理的 API 项 + partitioned_api_infos = list(self.api_infos.items()) + + # 在主进程中进行交叉任务切分(基于取模的方式) + partitioned_api_infos_split = [[] for _ in range(len(device_ids))] + for idx, api_info in enumerate(partitioned_api_infos): + device_index = idx % len(device_ids) # 使用取模方法分配任务 + partitioned_api_infos_split[device_index].append(api_info) + + # 创建一个共享进度队列 + progress_queue = multiprocessing.Queue() + + # 进度条 + total_tasks = len(partitioned_api_infos) # 计算总任务数 + with tqdm(total=total_tasks, desc="Total Progress", ncols=100) as pbar: + # 创建多进程 + processes = [] + for index, device_id in enumerate(device_ids): + process = multiprocessing.Process(target=self.process_on_device, + args=(device_id, partitioned_api_infos_split[index], progress_queue)) + processes.append(process) + process.start() + + # 主进程更新进度条 + completed_tasks = 0 + while completed_tasks < total_tasks: + try: + completed_tasks += progress_queue.get(timeout=Const.PROGRESS_TIMEOUT) # 设置超时时间(秒) + pbar.update(1) + except multiprocessing.queues.Empty: + logger.error("Timeout while waiting for progress updates. Skipping remaining tasks.") + break + + # 检查子进程状态 + for process in processes: + if not process.is_alive(): + if process.exitcode != 0: + logger.error(f"Process {process.pid} exited with code {process.exitcode}.") + total_tasks -= len(partitioned_api_infos_split[processes.index(process)]) + processes.remove(process) + + # 确保所有子进程完成或终止 + for process in processes: + process.join(timeout=Const.PROGRESS_TIMEOUT) + if process.is_alive(): + logger.error(f"Process {process.pid} did not terminate. Forcing termination.") + process.terminate() + + def process_forward(self, api_name_str, api_info): + """ + Overrides the parent class's process_forward method to log the device ID when exceptions occur. + + Parameters: + api_name_str (str): The name of the API. + api_info (object): The API information object. + + Returns: + list or None: The forward output list or None if an error occurs. + """ + if not api_info.check_forward_info(): + logger.debug( + f"[Device {self.current_device_id}] API: {api_name_str} lacks forward information, skipping " + f"forward check.") + return Const.EXCEPTION_NONE + + try: + forward_inputs_aggregation = self.prepare_api_input_aggregation(api_info, Const.FORWARD) + except Exception as e: + logger.warning( + f"[Device {self.current_device_id}] Exception occurred while getting forward API inputs for " + f"{api_name_str}. Skipping forward check. Detailed exception information: {e}.") + return Const.EXCEPTION_NONE + + forward_output_list = None + try: + forward_output_list = self.run_and_compare_helper(api_info, api_name_str, forward_inputs_aggregation, + Const.FORWARD) + except Exception as e: + logger.warning( + f"[Device {self.current_device_id}] Exception occurred while running and comparing {api_name_str} " + f"forward API. Detailed exception information: {e}.") + return forward_output_list + + def process_backward(self, api_name_str, api_info): + """ + Overrides the parent class's process_backward method to log the device ID when exceptions occur. + + Parameters: + api_name_str (str): The name of the API. + api_info (object): The API information object. + + Returns: + list or None: The backward output list or None if an error occurs. + """ + if not api_info.check_backward_info(): + logger.debug( + f"[Device {self.current_device_id}] API: {api_name_str} lacks backward information, skipping " + f"backward check.") + return Const.EXCEPTION_NONE + + try: + backward_inputs_aggregation = self.prepare_api_input_aggregation(api_info, Const.BACKWARD) + except Exception as e: + logger.warning( + f"[Device {self.current_device_id}] Exception occurred while getting backward API inputs for " + f"{api_name_str}. Skipping backward check. Detailed exception information: {e}.") + return Const.EXCEPTION_NONE + + backward_output_list = None + try: + backward_output_list = self.run_and_compare_helper(api_info, api_name_str, backward_inputs_aggregation, + Const.BACKWARD) + except Exception as e: + logger.warning( + f"[Device {self.current_device_id}] Exception occurred while running and comparing {api_name_str} " + f"backward API. Detailed exception information: {e}.") + return backward_output_list \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/multi_data_manager.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/multi_data_manager.py new file mode 100644 index 0000000000000000000000000000000000000000..19d5f1b0f978e79e3e2bdfe4337fd44ec76e7a9f --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/multi_data_manager.py @@ -0,0 +1,60 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import multiprocessing +import os + +from msprobe.mindspore.api_accuracy_checker.data_manager import (DataManager, ResultCsvEntry, write_csv_header, + get_result_csv_header, get_detail_csv_header, + check_csv_header) +from msprobe.mindspore.common.log import logger + + +class MultiDataManager(DataManager): + def __init__(self, csv_dir, result_csv_path, shared_is_first_write): + super().__init__(csv_dir, result_csv_path) + + # 使用共享的 is_first_write 变量来控制表头写入 + self.shared_is_first_write = shared_is_first_write + # 创建锁对象,确保线程安全 + self.lock = multiprocessing.Lock() + + def save_results(self, api_name_str): + """保存结果,线程安全操作""" + + with self.lock: # 确保保存操作不会被多个进程同时进行 + if self.is_first_write and self.shared_is_first_write.value: + self.shared_is_first_write.value = False + self.is_first_write = False # 写入后标记为 False,避免重复写入表头 + # 直接写入表头 + logger.info("Writing CSV headers for the first time.") + write_csv_header(self.detail_out_path, get_detail_csv_header) + write_csv_header(self.result_out_path, get_result_csv_header) + + """写入详细输出和结果摘要并清理结果""" + self.to_detail_csv(self.detail_out_path) + logger.debug(f"Detailed output for {api_name_str} written to {self.detail_out_path}.") + + self.to_result_csv(self.result_out_path) + logger.debug(f"Result summary for {api_name_str} written to {self.result_out_path}.") + + # 清理记录,准备下一次调用 + self.clear_results() + + def clear_results(self): + """清空 self.results 数据,线程安全操作""" + logger.debug("Clearing results data.") + self.results.clear() diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/torch_mindtorch_importer.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/torch_mindtorch_importer.py new file mode 100644 index 0000000000000000000000000000000000000000..7b319382eb4eba4abac3bd6894cc3b0262032d88 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/torch_mindtorch_importer.py @@ -0,0 +1,130 @@ +# Copyright (c) 2025-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import gc +import sys +from pathlib import Path +import mindspore +from msprobe.mindspore.common.log import logger +from msprobe.core.common.const import Const, CompareConst +from msprobe.mindspore.common.const import MsCompareConst +import torch as mindtorch +from torch import Tensor as mindtorch_tensor +import torch.nn.functional as mindtorch_func +import torch.distributed as mindtorch_dist + + +is_valid_pt_mt_env = True + + +def is_mindtorch(): + mindtorch_check_result = False + try: + import torch as test_torch + from mindspore import Tensor as MindsporeTensor + except ImportError: + return mindtorch_check_result + tensor = test_torch.tensor(0.0) + if isinstance(tensor, MindsporeTensor): + mindtorch_check_result = True + + return mindtorch_check_result + + +def remove_torch_related_paths(): + removed_paths = [] + if not is_mindtorch(): + return + try: + import torch as remove_torch + torch_file = remove_torch.__file__ + except ImportError: + return + + torch_dir = os.path.dirname(torch_file) + + torch_dir_path = Path(torch_dir).resolve() + parent_dir = torch_dir_path.parent + + paths_to_remove = [str(parent_dir)] + + for path in paths_to_remove: + try: + path_resolved = str(Path(path).resolve()) + except Exception as error: + logger.debug(f"Failed to resolve path {path}: {error}") + continue + + if path_resolved in sys.path: + index = sys.path.index(path_resolved) + removed_paths.append((path_resolved, index)) + sys.path.pop(index) + + return + + +def clear_torch_from_sys_modules(): + modules_to_remove = [] + for module in sys.modules: + if module == "torch" or module.startswith("torch."): + modules_to_remove.append(module) + + for module in modules_to_remove: + del sys.modules[module] + + +def set_pt_mt_env_invalid(): + global is_valid_pt_mt_env + is_valid_pt_mt_env = False + + +def delete_torch_paths(): + + if not is_mindtorch(): + set_pt_mt_env_invalid() + + clear_torch_from_sys_modules() + + for count_delete_env_path in range(MsCompareConst.MAX_RECURSION_DEPTH): + if not is_mindtorch(): + break + + remove_torch_related_paths() + + clear_torch_from_sys_modules() + + if count_delete_env_path >= MsCompareConst.MAX_RECURSION_DEPTH - 1: + raise Exception(f"Please check if you have a valid PyTorch and MindTorch environment, and ensure " + f"the PYTHONPATH environment variable depth does not exceed {Const.MAX_RECURSION_DEPTH}.") + + +if not is_mindtorch(): + set_pt_mt_env_invalid() + +else: + initial_sys_path = sys.path.copy() + delete_torch_paths() + + gc.collect() + + import torch + + if is_mindtorch(): + set_pt_mt_env_invalid() + + sys.path = initial_sys_path + + diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/type_mapping.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/type_mapping.py index 0e23a441c9df06c7a9b3fcc3b001397120fedc76..12981183698327556c92c6f5140f84109d4b2816 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/type_mapping.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/type_mapping.py @@ -15,10 +15,18 @@ import mindspore import numpy as np -import torch from mindspore._c_expression import typing from mindspore.common import dtype as mstype +from msprobe.mindspore.api_accuracy_checker import torch_mindtorch_importer + +if torch_mindtorch_importer.is_valid_pt_mt_env: + from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import mindtorch + from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import torch +else: + from msprobe.mindspore.api_accuracy_checker.torch_mindtorch_importer import mindtorch + import torch + INT8 = "Int8" UINT8 = "UInt8" INT16 = "Int16" @@ -82,6 +90,21 @@ dtype_str_to_torch_dtype = { } torch_dtype_to_dtype_str = {value: key for key, value in dtype_str_to_torch_dtype.items()} + +dtype_str_to_mindtorch_dtype = { + INT8: mindtorch.int8, + UINT8: mindtorch.uint8, + INT16: mindtorch.int16, + INT32: mindtorch.int32, + INT64: mindtorch.int64, + FLOAT16: mindtorch.float16, + FLOAT32: mindtorch.float32, + FLOAT64: mindtorch.float64, + BOOL: mindtorch.bool, + BFLOAT16: mindtorch.bfloat16, +} +mindtorch_dtype_to_dtype_str = {value: key for key, value in dtype_str_to_mindtorch_dtype.items()} + MINDSPORE_TENSOR_TYPE_STR = "mindspore.Tensor" BOOL_TYPE_STR = "bool" INT_TYPE_STR = "int" diff --git a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/utils.py b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/utils.py index a4fbbe142d4dad83344da8988baf9a7653d2b8b6..56503e757dfad4549439657a16c6114147ce761b 100644 --- a/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/utils.py +++ b/debug/accuracy_tools/msprobe/mindspore/api_accuracy_checker/utils.py @@ -82,10 +82,12 @@ class GlobalContext: def __init__(self): self.is_constructed = True self.dump_data_dir = "" + self.framework = Const.MS_FRAMEWORK - def init(self, is_constructed, dump_data_dir): + def init(self, is_constructed, dump_data_dir, framework): self.is_constructed = is_constructed self.dump_data_dir = dump_data_dir + self.framework = framework def get_dump_data_dir(self): return self.dump_data_dir @@ -93,5 +95,8 @@ class GlobalContext: def get_is_constructed(self): return self.is_constructed + def get_framework(self): + return self.framework + global_context = GlobalContext() diff --git a/debug/accuracy_tools/msprobe/mindspore/cell_processor.py b/debug/accuracy_tools/msprobe/mindspore/cell_processor.py index 604463275494a11d68fa45336fd5c207904932a8..6dc5d510ef51ab2a135a8bdf9f15ac670fba9e56 100644 --- a/debug/accuracy_tools/msprobe/mindspore/cell_processor.py +++ b/debug/accuracy_tools/msprobe/mindspore/cell_processor.py @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -from msprobe.core.data_dump.scope import ModuleRangeScope +from msprobe.core.data_dump.scope import ModuleRangeScope, MixRangeScope from msprobe.core.common.const import Const @@ -24,10 +24,7 @@ class CellProcessor: module_node = {} def __init__(self, scope): - if isinstance(scope, ModuleRangeScope): - self.scope = scope - else: - self.scope = None + self.scope = scope if isinstance(scope, (ModuleRangeScope, MixRangeScope)) else None @staticmethod def set_cell_count(cell_name): @@ -36,23 +33,22 @@ class CellProcessor: else: CellProcessor.cell_count[cell_name] += 1 return CellProcessor.cell_count[cell_name] - + @classmethod def reset_cell_stats(cls): cls.cell_count = {} cls.cell_stack = [] cls.api_parent_node = "" cls.module_node = {} - + def node_hook(self, name_prefix, start_or_stop, **kwargs): def begin_hook(cell, input_data): - index = self.set_cell_count(name_prefix) - cell.mindstudio_reserved_name = full_name = name_prefix + Const.SEP + str(index) + full_name = self.set_and_get_reserved_name(cell, name_prefix, is_called_by_pre_hook=True) if CellProcessor.cell_stack: CellProcessor.module_node[full_name] = CellProcessor.cell_stack[-1] else: CellProcessor.module_node[full_name] = None - + CellProcessor.cell_stack.append(full_name) CellProcessor.api_parent_node = full_name @@ -71,3 +67,13 @@ class CellProcessor: self.scope.end_module(cell.mindstudio_reserved_name) return begin_hook if Const.START == start_or_stop else end_hook + + def set_and_get_reserved_name(self, cell, cell_name, is_called_by_pre_hook=False): + if not is_called_by_pre_hook and hasattr(cell, 'has_pre_hook_called') and cell.has_pre_hook_called: + cell.has_pre_hook_called = False + else: + if is_called_by_pre_hook: + cell.has_pre_hook_called = True + index = self.set_cell_count(cell_name) + cell.mindstudio_reserved_name = cell_name + Const.SEP + str(index) + return cell.mindstudio_reserved_name diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/decorator/__init__.py b/debug/accuracy_tools/msprobe/mindspore/code_mapping/__init__.py similarity index 100% rename from debug/accuracy_tools/msprobe/mindspore/free_benchmark/decorator/__init__.py rename to debug/accuracy_tools/msprobe/mindspore/code_mapping/__init__.py diff --git a/debug/accuracy_tools/msprobe/mindspore/code_mapping/bind.py b/debug/accuracy_tools/msprobe/mindspore/code_mapping/bind.py new file mode 100644 index 0000000000000000000000000000000000000000..614abdf20f238bde69742103cf3b9e534e269313 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/code_mapping/bind.py @@ -0,0 +1,264 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import time +import glob +from typing import Dict, List +from pathlib import Path + +import pandas as pd + +from msprobe.core.common.const import Const +from msprobe.core.common.file_utils import ( + check_file_or_directory_path, + FileOpen, + create_directory, + write_csv, + check_path_before_create, + read_csv, + write_df_to_csv +) +from msprobe.mindspore.code_mapping.graph import GraphNode +from msprobe.mindspore.common.log import logger + + +# 定义Trie节点 +class TrieNode: + def __init__(self): + self.children = {} + self.is_end_of_key = False + self.value = None + + +# 定义Trie树 +class Trie: + def __init__(self): + self.root = TrieNode() + + # 向Trie中插入一个键 + def insert(self, key, value): + node = self.root + for key_char in key: + if key_char not in node.children: + node.children[key_char] = TrieNode() + node = node.children[key_char] + # 标记结束位置 + node.is_end_of_key = True + node.value = value + + # 在name字符串中查找所有匹配的键 + def search_in_string(self, string): + matched_values = [] + for i in range(len(string)): + node = self.root + j = i + # 从字符串的每个字符开始,逐字符查找匹配 + while j < len(string) and string[j] in node.children: + node = node.children[string[j]] + if node.is_end_of_key: + matched_values.append(node.value) + j += 1 + return matched_values + + +# 定义匹配函数 +def match_codes(trie, name): + matched_nodes = trie.search_in_string(name) + matched_codes = [Const.NEW_LINE.join(node.code_info) for node in matched_nodes] + return Const.NEW_LINE.join(matched_codes) + + +def match_names(trie, name): + matched_nodes = trie.search_in_string(name) + matched_names = [node.scope for node in matched_nodes] + return Const.NEW_LINE.join(matched_names) + + +def map_op_names_to_codes_and_scopes(df, match_dict): + # 构建Trie树并插入所有键 + trie = Trie() + for key, value in match_dict.items(): + trie.insert(key, value) + + df[Const.CODE_STACK] = df[Const.OP_NAME].apply(lambda name: match_codes(trie, name)) + df[Const.SCOPE_NAME] = df[Const.OP_NAME].apply(lambda name: match_names(trie, name)) + return df + + +def find_npy_files(npy_path): + """ + 查找指定路径下所有的.npy文件。 + + Parameters: + npy_path (str): 搜索的路径,可以是文件或目录。 + + Returns: + List[Path]: 找到的.npy文件路径列表。 + """ + npy_files = [] + npy_path_obj = Path(npy_path) + + # 检查当前路径是否是一个以 .npy 结尾的文件 + if npy_path_obj.suffix == Const.NUMPY_SUFFIX and npy_path_obj.is_file(): + check_file_or_directory_path(npy_path_obj) + npy_files.append(npy_path_obj.resolve()) + return npy_files + + # 如果是目录,使用Path.rglob查找所有.npy文件 + if npy_path_obj.is_dir(): + for file in npy_path_obj.rglob(Const.NUMPY_PATTERN): + check_file_or_directory_path(file) + npy_files.append(file.resolve()) + else: + logger.info(f"The specified path is neither an .npy file nor a directory: {npy_path}") + + return npy_files + + +def write_to_csv(param: Dict, output_dir: str): + """ + 将参数写入CSV文件。 + + Parameters: + param (Dict): 要写入的数据,格式为{文件名: (代码堆栈, 作用域名称)}。 + output_dir (str): 输出目录路径。 + """ + create_directory(output_dir) + + # 使用时间戳生成文件名 + timestamp = time.strftime("%Y%m%d%H%M%S", time.localtime()) + file_path = Path(output_dir) / f"code_mapping_{timestamp}.csv" + check_path_before_create(file_path) + data = [(name, res1, res2) for name, (res1, res2) in param.items()] + df = pd.DataFrame(data, columns=[Const.FILE_PATH, Const.CODE_STACK, Const.SCOPE_NAME]) + write_df_to_csv(df, file_path) + + +def find_statistic_files(path): + if not os.path.isdir(path): + if os.path.basename(path) == 'statistic.csv': + return [path] + else: + return [] + pattern = os.path.join(path, '**', "statistic.csv") + + statistic_files = list(glob.glob(pattern, recursive=True)) + return statistic_files + + +def check_and_fix_header(file_path: str): + """ + 检查 CSV 文件的表头是否以逗号结尾,如果没有则添加一个逗号。 + + Parameters: + file_path (str): CSV 文件的路径。 + + Returns: + bool: 如果表头被修改,返回 True;否则,返回 False。 + """ + + with FileOpen(file_path, "r") as f: + lines = f.readlines() + + if not lines: + logger.warning(f"The file {file_path} is empty.") + return False + + # 获取表头并去除末尾的换行符 + header = lines[0].rstrip(Const.NEW_LINE).rstrip('\r') + + if not header.endswith(','): + logger.info(f"The header does not end with a comma. Adding a comma to the file: {file_path}.") + # 添加逗号并恢复换行符 + lines[0] = header + Const.CSV_NEWLINE_SEPARATOR + + # 写回修复后的内容到文件 + with FileOpen(file_path, "w") as f: + f.writelines(lines) + logger.info(f"Added a trailing comma to the file: {file_path}.") + return True + else: + logger.info(f"The header already ends with a comma. No modification needed for the file: {file_path}.") + return False + + +def bind_for_statistic(statistic_files: List[str], match_dict: Dict): + """ + 处理统计文件,绑定代码信息。 + + Parameters: + statistic_files (List[str]): 统计文件路径列表。 + match_dict (Dict): 匹配字典,用于复杂映射。 + """ + for statistic_file in statistic_files: + # 使用FileOpen安全打开文件 + header_modified = check_and_fix_header(statistic_file) + if header_modified: + logger.info(f"The header of the file {statistic_file} has been fixed.") + + df = read_csv(statistic_file, as_pd=True) + + # 进行复杂映射 + df = map_op_names_to_codes_and_scopes(df, match_dict) + + # 使用write_csv安全写入文件 + write_df_to_csv(df, statistic_file) + + +def bind_code_info_for_data(input_dir: str, nodes: Dict[str, GraphNode]) -> Dict[str, str]: + # 待重构后优化性能 + match_dict = {} + for node in nodes.values(): + # 屏蔽子图节点 + if node.is_subgraph: + continue + # 获取规范化后的scope name + scope_name = node.scope.replace(Const.SCOPE_SEPARATOR, Const.REPLACEMENT_CHARACTER) + match_dict[scope_name] = node + npy_files = find_npy_files(input_dir) + + bind_result = {} + if not npy_files: + statistic_files = find_statistic_files(input_dir) + if statistic_files: + bind_for_statistic(statistic_files, match_dict) + return bind_result + + for npy_file in npy_files: + directory, file_name = os.path.split(npy_file) # 拆分路径 + name_without_ext = os.path.splitext(file_name)[0] # 提取文件名(去掉扩展名) + if name_without_ext.isdigit(): + # 3. 读取find.csv文件 + csv_file_path = os.path.join(directory, 'mapping.csv') + check_file_or_directory_path(csv_file_path) + df = read_csv(csv_file_path, header=None) + + # 4. 查找是否有与xxx.npy匹配的条目 + matching_row = df[df[0] == file_name] # 假设A列存储文件名 + if not matching_row.empty: + corresponding_name = matching_row[1].values[0] + else: + corresponding_name = None + name_without_ext = os.path.splitext(corresponding_name)[0] + npy_path = os.path.realpath(npy_file) + node_scope = name_without_ext.split(".")[1] + trie = Trie() + for key, value in match_dict.items(): + trie.insert(key, value) + bind_code = match_codes(trie, node_scope) + bind_name = match_names(trie, node_scope) + bind_result[npy_path] = (bind_code, bind_name) + return bind_result diff --git a/debug/accuracy_tools/msprobe/mindspore/code_mapping/cmd_parser.py b/debug/accuracy_tools/msprobe/mindspore/code_mapping/cmd_parser.py new file mode 100644 index 0000000000000000000000000000000000000000..223d54dc445b3991801ac37c7950325159175def --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/code_mapping/cmd_parser.py @@ -0,0 +1,40 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +from msprobe.core.common.file_utils import check_file_or_directory_path, create_directory, FileChecker + + +def add_ir_parser_arguments(parser): + parser.add_argument('--ir', type=str, required=True, help="Path to the graph file") + parser.add_argument('--dump_data', type=str, required=True, default=None, help="Path to data dir") + parser.add_argument('--output', type=str, required=False, default="./", help="Path to output dir") + + +def check_args(args): + args.ir = os.path.abspath(args.ir) + + check_file_or_directory_path(args.ir) + + args.dump_data = os.path.abspath(args.dump_data) + if os.path.isdir(args.dump_data): + check_file_or_directory_path(args.dump_data, isdir=True) + else: + check_file_or_directory_path(args.dump_data, isdir=False) + + args.output = os.path.abspath(args.output) + create_directory(args.output) + diff --git a/debug/accuracy_tools/msprobe/mindspore/code_mapping/graph.py b/debug/accuracy_tools/msprobe/mindspore/code_mapping/graph.py new file mode 100644 index 0000000000000000000000000000000000000000..69c067de0fc6ab1e646073bd2fe962766186ab9b --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/code_mapping/graph.py @@ -0,0 +1,49 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import List, Dict, Union + + +class GraphNode: + def __init__(self, name: str, pos: int = -1, unique_name: str = "", operator_name: str = "", + return_variable: str = "", return_value: str = "", + var_inputs: List[str] = None, has_constant_input: bool = False, + unique_id: str = "", scope: str = "", code_info: List[str] = None, + is_subgraph: bool = False, attrs: Union[Dict[str, str], List[str]] = None): + self.name = name + self.unique_name = unique_name + self.pos = pos + self.operator_name = operator_name + self.return_variable = return_variable + self.return_value = return_value + self.var_inputs = var_inputs if var_inputs else [] + self.has_constant_input = has_constant_input + self.unique_id = unique_id + self.scope = scope + self.code_info = code_info if code_info else [] + self.attrs = attrs if attrs else ({} if not is_subgraph else []) + self.nodes = {} # Internal nodes if this is a subgraph + self.predecessors = [] # Predecessor nodes + self.successors = [] # Successor nodes + self.is_subgraph = is_subgraph + + def trace_back_ancestors(self, ancestors: List[str], visited: Dict[str, bool], parser) -> None: + if visited[self.unique_name]: + return + visited[self.unique_name] = True + ancestors.append(self.unique_name) + for predecessor in self.predecessors: + predecessor.trace_back_ancestors(ancestors, visited, parser) + diff --git a/debug/accuracy_tools/msprobe/mindspore/code_mapping/graph_parser.py b/debug/accuracy_tools/msprobe/mindspore/code_mapping/graph_parser.py new file mode 100644 index 0000000000000000000000000000000000000000..ee35750fb35c100e2025b0dcbdd9e20ef998b2ee --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/code_mapping/graph_parser.py @@ -0,0 +1,226 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +import logging +from typing import Tuple, List, Dict +from msprobe.mindspore.code_mapping.graph import GraphNode + + +class Parser: + def __init__(self): + self.nodes = {} + self.local_dict = {} + self.number_dict = {} + + @staticmethod + def parse_subgraph_attributes(text: str, subgraph_node: GraphNode, start_pos: int, end_pos: int) -> None: + subgraph_attr_pattern = re.compile(r'subgraph attr:\s*(.*)', re.DOTALL) + match = subgraph_attr_pattern.search(text, start_pos, end_pos) + if match: + attrs = match.group(1).strip().split('\n') + if isinstance(subgraph_node.attrs, list): + subgraph_node.attrs.extend(attrs) + + @staticmethod + def parse_graph_attributes(text: str, graph_node: GraphNode) -> None: + attr_pattern = re.compile(r'# Attrs:\s*(.*)', re.DOTALL) + match = attr_pattern.search(text, graph_node.pos) + if match: + attrs = match.group(1).strip().split('\n') + for attr in attrs: + if not attr: + break + key, value = attr.split(':') + if isinstance(graph_node.attrs, dict): + graph_node.attrs[key.strip()] = value.strip() + + @staticmethod + def parse_code_info(text: str, start_pos: int, end_pos: int) -> List[str]: + code_info = [] + code_info_pattern = re.compile(r'# .*', re.MULTILINE) + final_pos = end_pos if end_pos else len(text) - 1 + lines = text[start_pos + 1:final_pos].split('\n') + for line in lines: + match = code_info_pattern.search(line) + if not match: + break + code_info.append(match.group(0).strip('# ').strip('/')) + return code_info + + @staticmethod + def extract_bracket_content(text: str, start_pos: int) -> Tuple[str, int]: + stack = [] + content = [] + for i in range(start_pos, len(text)): + char = text[i] + if char == '(': + stack.append('(') + elif char == ')': + stack.pop() + if not stack: + content.append(char) + return ''.join(content), i + content.append(char) + raise ValueError("Mismatched parentheses") + + @staticmethod + def find_matching_brace(text: str, start_pos: int) -> int: + stack = [] + for i in range(start_pos, len(text)): + if text[i] == '{': + stack.append('{') + elif text[i] == '}': + stack.pop() + if not stack: + return i + raise ValueError("Matching closing brace not found") + + @staticmethod + def extract_constants(inputs_str: str) -> List[str]: + constant_pattern = re.compile(r'\b(\w+\(.*?\))') + constants = constant_pattern.findall(inputs_str) + return constants + + def parse_func_graph(self, text: str) -> None: + func_graph_pattern = re.compile(r'# IR entry: @(\S+)') + matches = func_graph_pattern.finditer(text) + for match in matches: + func_name = match.group(1) + func_graph_info = GraphNode(name=func_name, pos=match.start(), is_subgraph=False) + self.nodes[func_name] = func_graph_info + + def parse_nodes(self, text: str, subgraph_info: GraphNode) -> None: + node_pattern = re.compile(r'(%\d+)\((\S+)\)\s*=\s*(\S+)\(') + matches = list(node_pattern.finditer(text)) + for i, match in enumerate(matches): + series_number = match.group(1) + variable_name = match.group(2) + operator_name = match.group(3) + unique_name = "&".join([series_number, variable_name]) + self.local_dict[series_number] = unique_name + + args_str, end_pos = self.__class__.extract_bracket_content(text, match.end() - 1) + inputs = re.findall(r'%\w+', args_str) + subgraph_inputs = re.findall(r'@\w+', args_str) + inputs += subgraph_inputs + + constants = self.__class__.extract_constants(args_str) + + scope_pattern = re.compile(r'# .*scope.*:\s*\((.*?)\)', re.IGNORECASE | re.MULTILINE) + + scope_match = scope_pattern.search(text, end_pos) + scope = scope_match.group(1) if scope_match else "" + + id_pattern = re.compile(r'.*cnode_primal_attrs:' + r'\s*\{.*\b(?:forward_unique_id|unique_id):\s*\"(\d+)\".*', re.IGNORECASE) + unique_id_match = id_pattern.search(text, end_pos, scope_match.start()) + unique_id = unique_id_match.group(1) if unique_id_match else None + + if scope: + next_match = matches[i + 1].start() - 1 if i < len(matches) - 1 else None + code_info = self.__class__.parse_code_info(text, scope_match.end(), next_match) + else: + code_info = None + + node_info = GraphNode(name=variable_name, unique_name=unique_name, operator_name=operator_name, + var_inputs=inputs + constants, unique_id=unique_id, scope=scope, code_info=code_info) + + if unique_id and scope and not scope.startswith("Gradients"): + self.number_dict[unique_id] = node_info + + if subgraph_info: + subgraph_info.nodes[variable_name] = node_info + + if not self.nodes.get(unique_name, None): + self.nodes[unique_name] = node_info + else: + pass + + for const in constants: + if const not in self.nodes: + const_node = GraphNode(name=const, operator_name="Constant", var_inputs=[], has_constant_input=True) + if not self.nodes.get(const_node, None): + self.nodes[const] = const_node + if subgraph_info: + subgraph_info.nodes[const] = const_node + self.local_dict[const] = const + + for input_var in node_info.var_inputs: + if input_var in self.local_dict or input_var in self.nodes: + input_name = self.local_dict.get(input_var, input_var) + input_node = self.nodes.get(input_name, None) + if input_node: + node_info.predecessors.append(input_node) + input_node.successors.append(node_info) + else: + param_node = GraphNode(name=input_var, operator_name="Param", var_inputs=[], + has_constant_input=False) + if not self.nodes.get(input_var, None): + self.nodes[input_var] = param_node + node_info.predecessors.append(param_node) + param_node.successors.append(node_info) + + def extract_callees(self, text: str) -> None: + for node_info in self.nodes.values(): + func_start_pos = node_info.pos + func_end_pos = text.find('}', func_start_pos) + func_text = text[func_start_pos:func_end_pos] + callee_pattern = re.compile(r'Partial\(@(\S+)\(') + callee_matches = callee_pattern.finditer(func_text) + for callee_match in callee_matches: + callee_name = callee_match.group(1) + if callee_name not in node_info.var_inputs: + node_info.var_inputs.append(callee_name) + + def parse_subgraphs(self, text: str) -> None: + subgraph_pattern = re.compile(r'subgraph\s+@(\S+)(\([^\)]*\))?\s+.*\{') + matches = list(subgraph_pattern.finditer(text)) + end_pos = 0 + for match in matches: + last_pos = end_pos + 2 + subgraph_name = match.group(1).split('(')[0] + start_pos = match.start() + end_pos = self.__class__.find_matching_brace(text, start_pos) + subgraph_text = text[start_pos:end_pos + 1] + attr_text = text[last_pos:start_pos] + subgraph_info = GraphNode(name=subgraph_name, pos=start_pos, is_subgraph=True) + self.nodes[subgraph_name] = subgraph_info + self.__class__.parse_subgraph_attributes(text, subgraph_info, last_pos, start_pos) + self.parse_nodes(subgraph_text, subgraph_info) + subgraph_info.end = end_pos + logging.info('Parsed subgraph: %s', subgraph_name) + + def count_nodes(self) -> Tuple[int, int]: + total_nodes = len(self.nodes) + total_cnodes = sum(1 for node in self.nodes.values() if node.name.startswith('CNode')) + return total_nodes, total_cnodes + + def create_backward_map(self): + for node in self.nodes.values(): + if node.scope and node.scope.startswith("Gradients"): + related_forward_node = self.number_dict.get(node.unique_id, None) + if related_forward_node: + node.code_info = related_forward_node.code_info + + def parse(self, text: str) -> None: + self.parse_func_graph(text) + self.parse_subgraphs(text) + self.parse_nodes(text, None) + self.extract_callees(text) + self.create_backward_map() + + def get_nodes(self) -> Dict[str, GraphNode]: + return self.nodes diff --git a/debug/accuracy_tools/msprobe/mindspore/code_mapping/main.py b/debug/accuracy_tools/msprobe/mindspore/code_mapping/main.py new file mode 100644 index 0000000000000000000000000000000000000000..b1684baa8ab5475029a5602927285e7880572335 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/code_mapping/main.py @@ -0,0 +1,24 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprobe.mindspore.code_mapping.processor import process +from msprobe.mindspore.code_mapping.cmd_parser import check_args + + +def code_mapping_main(args): + check_args(args) + process(args) + + diff --git a/debug/accuracy_tools/msprobe/mindspore/code_mapping/processor.py b/debug/accuracy_tools/msprobe/mindspore/code_mapping/processor.py new file mode 100644 index 0000000000000000000000000000000000000000..ec71a5f9887c13a3feeed20723f23489312f308f --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/code_mapping/processor.py @@ -0,0 +1,34 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprobe.mindspore.code_mapping.graph_parser import Parser +from msprobe.mindspore.code_mapping.bind import bind_code_info_for_data, write_to_csv +from msprobe.core.common.file_utils import FileOpen + + +def process(args): + ir_file_path = args.ir + with FileOpen(ir_file_path, 'r') as f: + input_text = f.read() + + parser = Parser() + parser.parse(input_text) + + nodes = parser.get_nodes() + + bind_result = bind_code_info_for_data(args.dump_data, nodes) + if bind_result: + write_to_csv(bind_result, args.output) + diff --git a/debug/accuracy_tools/msprobe/mindspore/common/const.py b/debug/accuracy_tools/msprobe/mindspore/common/const.py index 8cb91f763dd78a69ffb8e5989bd29389d9988d79..067e783842f13899feaba00476777ded707e9eb7 100644 --- a/debug/accuracy_tools/msprobe/mindspore/common/const.py +++ b/debug/accuracy_tools/msprobe/mindspore/common/const.py @@ -1,7 +1,7 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -38,23 +38,18 @@ class Const: ASCEND_910A = "ascend910" OPS_PREFIX = "mindspore.ops." - Tensor_PREFIX = "mindspore.Tensor." + TENSOR_PREFIX = "mindspore.Tensor." MINT_PREFIX = "mindspore.mint." MINT_NN_FUNC_PREFIX = "mindspore.mint.nn.functional." - COMM_PREFIX = "mindspore.communication.comm_func." - COMMUNICATION_API_LIST = [ - "mindspore.communication.comm_func.all_gather_into_tensor", - "mindspore.communication.comm_func.gather_into_tensor", - "mindspore.communication.comm_func.all_reduce", - "mindspore.communication.comm_func.reduce", - "mindspore.communication.comm_func.reduce_scatter_tensor" - ] + TENSOR_DATA_PREFIX = "Tensor." STUB_TENSOR_DATA_PREFIX = "Tensor." OPS_DATA_PREFIX = "Functional." MINT_DATA_PREFIX = "Mint." MINT_NN_FUNC_DATA_PREFIX = "MintFunctional." DISTRIBUTED_DATA_PREFIX = "Distributed." + TORCH_DATA_PREFIX = "Torch." + TORCH_NPU_DATA_PREFIX = "NPU." SUPPORTED_API_LIST_FILE = "support_wrap_ops.yaml" SUPPORTED_TENSOR_LIST_KEY = "tensor" @@ -65,6 +60,76 @@ class Const: DROPOUT_API_NAME_PREFIX = "dropout" + GRAPH_DATA_MODE_LIST = [CoreConst.ALL, CoreConst.INPUT, CoreConst.OUTPUT] + + HOOK_MS_PREFIX_DICT = { + OPS_DATA_PREFIX: OPS_PREFIX, + TENSOR_DATA_PREFIX: TENSOR_PREFIX, + MINT_DATA_PREFIX: MINT_PREFIX, + MINT_NN_FUNC_DATA_PREFIX: MINT_NN_FUNC_PREFIX + } + + +class MsCompareConst: + # api_info field + MINT = "Mint" + MINT_FUNCTIONAL = "MintFunctional" + TENSOR_API = "Tensor" + FUNCTIONAL_API = "Functional" + FUSION_API = "FUSION" + + API_NAME_STR_LENGTH = 4 + MAX_RECURSION_DEPTH = 20 + + # Mindtorch api_info field + MINDTORCH_TENSOR = "Tensor" + MINDTORCH = "Torch" + MINDTORCH_FUNC = "Functional" + MINDTORCH_NPU = "NPU" + MINDTORCH_DIST = "Distributed" + + + + MT_VALID_API_TYPES = [ + MINDTORCH, MINDTORCH_FUNC, MINDTORCH_TENSOR + ] + SUPPORTED_FUSION_LIST = ["flash_attention_score"] + + + TASK_FIELD = "task" + STATISTICS_TASK = "statistics" + FRAMEWORK = "framework" + TENSOR_TASK = "tensor" + DUMP_DATA_DIR_FIELD = "dump_data_dir" + DATA_FIELD = "data" + + # supported api yaml + SUPPORTED_API_LIST_FILE = "checker_support_api.yaml" + SUPPORTED_TENSOR_LIST_KEY = "tensor" + + # detail_csv + DETAIL_CSV_API_NAME = "API Name" + DETAIL_CSV_BENCH_DTYPE = "Bench Dtype" + DETAIL_CSV_TESTED_DTYPE = "Tested Dtype" + DETAIL_CSV_SHAPE = "Shape" + DETAIL_CSV_PASS_STATUS = "Status" + DETAIL_CSV_MESSAGE = "Message" + DETAIL_CSV_FILE_NAME = "accuracy_checking_details" + + # result_csv + RESULT_CSV_FORWARD_TEST_SUCCESS = "Forward Test Success" + RESULT_CSV_BACKWARD_TEST_SUCCESS = "Backward Test Success" + RESULT_CSV_FILE_NAME = "accuracy_checking_result" + + EPSILON = 1e-8 + + class ProcessStatus: + SUCCESS = "success" + API_NOT_FOUND = "api_not_found" + EXCEPTION_SKIP = "exception_skip" + + + class FreeBenchmarkConst: ADD_NOISE = "add_noise" @@ -80,19 +145,21 @@ class FreeBenchmarkConst: DEFAULT_PERT_TYPE = IMPROVE_PRECISION DEFAULT_HANDLER_TYPE = CHECK DEVICE_LIST = [DEFAULT_DEVICE] - STAGE_LIST = [CoreConst.FORWARD] + STAGE_LIST = [CoreConst.FORWARD, CoreConst.BACKWARD] DUMP_LEVEL_LIST = [DEFAULT_DUMP_LEVEL] PERT_TYPE_LIST = [IMPROVE_PRECISION, ADD_NOISE, BIT_NOISE, NO_CHANGE, EXCHANGE_VALUE] HANDLER_TYPE_LIST = [CHECK, FIX] NO_CHANGE_ERROR_THRESHOLD = 1.0 SYMBOL_FLIPPING_RATIO = 8.0 + SUPPORTED_CHECK_API_FILE = "support_wrap_ops.yaml" + CHECK_RESULT_FILE = "free_benchmark.csv" + API_PREFIX_DICT = { "ops": Const.OPS_PREFIX, - "Tensor": Const.Tensor_PREFIX, + "Tensor": Const.TENSOR_PREFIX, "mint": Const.MINT_PREFIX, - "mint.nn.functional": Const.MINT_NN_FUNC_PREFIX, - "communication": Const.COMM_PREFIX + "mint.nn.functional": Const.MINT_NN_FUNC_PREFIX } PERT_VALUE_DICT = { @@ -103,6 +170,7 @@ class FreeBenchmarkConst: } ERROR_THRESHOLD = { + ms.bfloat16: 1.004, ms.float16: 1.002, ms.float32: 1.0002 } diff --git a/debug/accuracy_tools/msprobe/mindspore/common/log.py b/debug/accuracy_tools/msprobe/mindspore/common/log.py index ec027c750133ce4aabdac4ed914b4a5c50b2a2f1..25f1fdb7dca8ed0d875b6d5df80c8f98c27acca9 100644 --- a/debug/accuracy_tools/msprobe/mindspore/common/log.py +++ b/debug/accuracy_tools/msprobe/mindspore/common/log.py @@ -1,4 +1,5 @@ -# Copyright 2024 Huawei Technologies Co., Ltd +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,15 +12,10 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# ============================================================================ -import os -import time -import sys - -from msprobe.mindspore.common.utils import get_rank_if_initialized -from msprobe.core.common.log import BaseLogger from msprobe.core.common.exceptions import DistributedNotInitializedError +from msprobe.core.common.log import BaseLogger +from msprobe.mindspore.common.utils import get_rank_if_initialized class MindsporeLogger(BaseLogger): @@ -35,4 +31,4 @@ class MindsporeLogger(BaseLogger): return current_rank -logger = MindsporeLogger() \ No newline at end of file +logger = MindsporeLogger() diff --git a/debug/accuracy_tools/msprobe/mindspore/common/utils.py b/debug/accuracy_tools/msprobe/mindspore/common/utils.py index 37826333f9e081824bef93844b6c46745248f05b..b205dabc6a3573f26381aca989b15b6cefe04003 100644 --- a/debug/accuracy_tools/msprobe/mindspore/common/utils.py +++ b/debug/accuracy_tools/msprobe/mindspore/common/utils.py @@ -1,4 +1,5 @@ -# Copyright 2024 Huawei Technologies Co., Ltd +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,13 +12,15 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# ============================================================================ import os import random import mindspore as ms +from mindspore import ops +from mindspore.mint import nn + from msprobe.core.common.exceptions import DistributedNotInitializedError from msprobe.core.common.file_utils import path_len_exceeds_limit, check_path_exists, save_npy from msprobe.core.common.log import logger @@ -41,7 +44,7 @@ def convert_bf16_to_fp32(tensor): def save_tensor_as_npy(tensor, file_path): if not path_len_exceeds_limit(file_path): tensor = convert_bf16_to_fp32(tensor) - saved_tensor = tensor.contiguous().asnumpy() + saved_tensor = tensor.asnumpy() save_npy(saved_tensor, file_path) else: logger.warning(f'The file path {file_path} length exceeds limit.') @@ -54,6 +57,11 @@ def convert_to_int(value): return -1 +def clean_input_kwargs(cell): + if hasattr(cell, 'input_kwargs'): + del cell.input_kwargs + + def list_lowest_level_directories(root_dir): check_path_exists(root_dir) lowest_level_dirs = [] @@ -74,13 +82,15 @@ def list_lowest_level_directories(root_dir): return lowest_level_dirs -def seed_all(seed=1234, mode=False): - check_seed_all(seed, mode) +def seed_all(seed=1234, mode=False, rm_dropout=True): + check_seed_all(seed, mode, rm_dropout) os.environ['PYTHONHASHSEED'] = str(seed) ms.set_seed(seed) random.seed(seed) ms.set_context(deterministic="ON" if mode else "OFF") os.environ['HCCL_DETERMINISTIC'] = str(mode) + if rm_dropout: + remove_dropout() class MsprobeStep(ms.train.Callback): @@ -95,3 +105,95 @@ class MsprobeStep(ms.train.Callback): def on_train_step_end(self, run_context): self.debugger.stop() self.debugger.step() + + +class Dropout(ops.Dropout): + def __init__(self, keep_prob=0.5, seed0=0, seed1=1): + super().__init__(1., seed0, seed1) + + +class Dropout2D(ops.Dropout2D): + def __init__(self, keep_prob=0.5): + super().__init__(1.) + + +class Dropout3D(ops.Dropout3D): + def __init__(self, keep_prob=0.5): + super().__init__(1.) + + +class DropoutExt(nn.Dropout): + def __init__(self, p=0.5): + super().__init__(0) + + +def dropout_ext(input_tensor, p=0.5, training=True): + return input_tensor + + +def remove_dropout(): + ops.Dropout = Dropout + ops.operations.Dropout = Dropout + ops.Dropout2D = Dropout2D + ops.operations.Dropout2D = Dropout2D + ops.Dropout3D = Dropout3D + ops.operations.Dropout3D = Dropout3D + nn.Dropout = DropoutExt + nn.functional.dropout = dropout_ext + + +mindtorch_check_result = None + + +def is_mindtorch(): + global mindtorch_check_result + if mindtorch_check_result is None: + mindtorch_check_result = False + try: + import torch + except ImportError: + return mindtorch_check_result + tensor = torch.tensor(0.0) + if isinstance(tensor, ms.Tensor): + mindtorch_check_result = True + return mindtorch_check_result + + +register_backward_hook_functions = {} + + +def set_register_backward_hook_functions(): + global register_backward_hook_functions + if is_mindtorch(): + import torch + from msprobe.mindspore.mindtorch import (_call_impl, + register_full_backward_pre_hook, + register_full_backward_hook) + if not hasattr(torch, "register_full_backward_hook"): + setattr(torch.nn.Module, "_call_impl", _call_impl) + setattr(torch.nn.Module, "register_full_backward_pre_hook", register_full_backward_pre_hook) + setattr(torch.nn.Module, "register_full_backward_hook", register_full_backward_hook) + register_backward_hook_functions["pre"] = torch.nn.Module.register_full_backward_pre_hook + register_backward_hook_functions["full"] = torch.nn.Module.register_full_backward_hook + else: + register_backward_hook_functions["pre"] = ms.nn.Cell.register_backward_pre_hook + register_backward_hook_functions["full"] = ms.nn.Cell.register_backward_hook + + +def check_save_param(variable, name, save_backward): + # try catch this api to skip invalid call + if not isinstance(variable, (list, dict, tuple, ms.Tensor, int, float, str)): + logger.warning("PrecisionDebugger.save variable type not valid, " + "should be one of list, dict, tuple, ms.Tensor, int, float or string. " + "Skip current save process.") + raise ValueError + if not isinstance(name, str): + logger.warning("PrecisionDebugger.save name not valid, " + "should be string. " + "skip current save process.") + raise ValueError + if not isinstance(save_backward, bool): + logger.warning("PrecisionDebugger.save_backward name not valid, " + "should be bool. " + "Skip current save process.") + raise ValueError \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/compare/distributed_compare.py b/debug/accuracy_tools/msprobe/mindspore/compare/distributed_compare.py index cf65280a8805bfcb92ec3ba7af71649d2fb586c4..46f825330dbb8b7ff5ce9d42cef5c6b74e3846f2 100644 --- a/debug/accuracy_tools/msprobe/mindspore/compare/distributed_compare.py +++ b/debug/accuracy_tools/msprobe/mindspore/compare/distributed_compare.py @@ -14,12 +14,11 @@ # limitations under the License. import os -from msprobe.core.common.utils import CompareException, check_compare_param, \ - check_configuration_param, set_dump_path, get_dump_mode +from msprobe.core.common.utils import CompareException from msprobe.core.common.file_utils import create_directory from msprobe.core.common.exceptions import FileCheckException from msprobe.mindspore.common.log import logger -from msprobe.mindspore.compare.ms_compare import MSComparator +from msprobe.mindspore.compare.ms_compare import ms_compare from msprobe.core.compare.utils import check_and_return_dir_contents, extract_json from msprobe.mindspore.compare.ms_graph_compare import GraphMSComparator @@ -28,41 +27,27 @@ def ms_compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): if kwargs.get('suffix'): logger.error("Argument 'suffix' is not supported for compare_distributed.") raise CompareException(CompareException.INVALID_PARAM_ERROR) - stack_mode = kwargs.get('stack_mode', False) - auto_analyze = kwargs.get('auto_analyze', True) - fuzzy_match = kwargs.get('fuzzy_match', False) + is_print_compare_log = kwargs.get('is_print_compare_log', True) # get the ranks and match by order npu_ranks = sorted(check_and_return_dir_contents(npu_dump_dir, 'rank')) bench_ranks = sorted(check_and_return_dir_contents(bench_dump_dir, 'rank')) if len(npu_ranks) != len(bench_ranks): logger.error('The number of ranks in the two runs are different. ' - 'Unable to match the ranks. Please use another folder to compare ' - 'or use compare() api and manually match the ranks.') + 'Unable to match the ranks. Please use another folder to compare ' + 'or use compare() api and manually match the ranks.') raise CompareException(CompareException.INVALID_PATH_ERROR) for nr, br in zip(npu_ranks, bench_ranks): npu_data_dir = os.path.join(npu_dump_dir, nr) bench_data_dir = os.path.join(bench_dump_dir, br) npu_path = extract_json(npu_data_dir, stack_json=False) bench_path = extract_json(bench_data_dir, stack_json=False) - stack_path = extract_json(npu_data_dir, stack_json=True) dump_result_param = { 'npu_json_path': npu_path, 'bench_json_path': bench_path, - 'stack_json_path': stack_path, - 'is_print_compare_log': True + 'is_print_compare_log': is_print_compare_log } - try: - set_dump_path(dump_result_param) - dump_mode = get_dump_mode(dump_result_param) - check_configuration_param(stack_mode, auto_analyze, fuzzy_match, dump_result_param.get('is_print_compare_log', True)) - create_directory(output_path) - check_compare_param(dump_result_param, output_path, dump_mode) - except (CompareException, FileCheckException) as error: - logger.error('Compare failed. Please check the arguments and do it again!') - raise CompareException(error.code) from error - ms_comparator = MSComparator() - ms_comparator.compare_core(dump_result_param, output_path, suffix=f'_{nr}-{br}', dump_mode=dump_mode, **kwargs) + ms_compare(input_param=dump_result_param, output_path=output_path, suffix=f'_{nr}-{br}', **kwargs) def ms_graph_compare(inputs, outputs): diff --git a/debug/accuracy_tools/msprobe/mindspore/compare/ms_compare.py b/debug/accuracy_tools/msprobe/mindspore/compare/ms_compare.py index 77d30c5393333a9ee9eb4ad634e35ea7552e7f78..9f1523c03aa63d0a487467e59b24830919bfb2ba 100644 --- a/debug/accuracy_tools/msprobe/mindspore/compare/ms_compare.py +++ b/debug/accuracy_tools/msprobe/mindspore/compare/ms_compare.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -13,36 +13,61 @@ # See the License for the specific language governing permissions and # limitations under the License. -import copy import os import re +from collections import defaultdict + +import numpy as np +import pandas as pd from msprobe.core.common.const import CompareConst, Const from msprobe.core.common.exceptions import FileCheckException -from msprobe.core.common.file_utils import (FileOpen, create_directory, - load_npy, load_yaml) +from msprobe.core.common.file_utils import create_directory, load_json, load_npy, load_yaml from msprobe.core.common.log import logger -from msprobe.core.common.utils import (CompareException, check_compare_param, - check_configuration_param, - get_dump_mode, set_dump_path) -from msprobe.core.compare.acc_compare import Comparator -from msprobe.core.compare.check import check_struct_match, fuzzy_check_op +from msprobe.core.common.utils import CompareException, check_compare_param, check_configuration_param, \ + check_op_str_pattern_valid, get_dump_mode, set_dump_path, detect_framework_by_dump_json +from msprobe.core.compare.acc_compare import Comparator, ModeConfig +from msprobe.core.compare.check import dtype_mapping from msprobe.core.compare.layer_mapping import generate_data_mapping_by_layer_mapping +from msprobe.core.compare.utils import set_stack_json_path, reorder_op_x_list -class MSComparator(Comparator): - def __init__(self, cell_mapping=None, api_mapping=None, data_mapping=None, is_cross_framework=False): - self.frame_name = MSComparator.__name__ +class MappingConfig: + def __init__(self, cell_mapping=None, api_mapping=None, data_mapping=None): self.cell_mapping = cell_mapping self.api_mapping = api_mapping self.data_mapping = data_mapping - if data_mapping: + + +class MSComparator(Comparator): + """ + 用于mindspore动态图同框架/跨框架精度比对,支持md5/summary/all模式。 + cell_mapping: mindspore在cell级别(L0)dump数据和pytorch的module之间的映射关系; + api_mapping: mindspore在api级别(L1)dump数据和pytorch的api之间的映射关系; + data_mapping: mindspore的cell或api的入参/出参和pytorch之间的映射关系; + is_cross_framework: 是否跨框架。 + """ + def __init__(self, mode_config, mapping_config=None, is_cross_framework=False): + super().__init__(mode_config) + self.frame_name = MSComparator.__name__ + + self.stack_mode = mode_config.stack_mode + self.auto_analyze = mode_config.auto_analyze + self.fuzzy_match = mode_config.fuzzy_match + self.dump_mode = mode_config.dump_mode + + if mapping_config: + self.cell_mapping = mapping_config.cell_mapping + self.api_mapping = mapping_config.api_mapping + self.data_mapping = mapping_config.data_mapping + + if self.data_mapping: self.cross_frame = is_cross_framework else: - self.cross_frame = cell_mapping is not None or api_mapping is not None + self.cross_frame = self.cell_mapping is not None or self.api_mapping is not None self.cell_mapping_dict = self.load_mapping_file(self.cell_mapping) self.api_mapping_dict = self.load_mapping_file(self.api_mapping) - if api_mapping is not None: + if self.api_mapping is not None: self.ms_to_pt_mapping = self.load_internal_api() if isinstance(self.data_mapping, str) or self.data_mapping is None: @@ -53,9 +78,107 @@ class MSComparator(Comparator): raise TypeError(f"The type of parameter `data_mapping` must be dict, str or None, but got " f"{type(self.data_mapping)}") + def calc_accuracy(self, result_df, header): + condition_no_bench = result_df[CompareConst.BENCH_NAME] == CompareConst.N_A + result_df[condition_no_bench] = result_df[condition_no_bench].fillna(CompareConst.N_A) + result_df.loc[condition_no_bench, CompareConst.ERROR_MESSAGE] = CompareConst.NO_BENCH + + def calc_summary_diff(data_type: str): + def type_check(val): + check_series = pd.Series(False, index=val.index) + val_str = val.astype(str) + check_series[pd.to_numeric(val_str, errors='coerce').notna() | val_str.str.lower().eq('nan')] = True + return check_series + + def get_number(val): + return pd.to_numeric(val.astype(str), errors='coerce') + + ms_val = result_df['NPU ' + data_type] + pt_val = result_df['Bench ' + data_type] + diff_name = data_type.capitalize() + ' diff' + rel_err_name = ('norm' if data_type == 'l2norm' else data_type).capitalize() + 'RelativeErr' + condition_na = ~type_check(ms_val) | ~type_check(pt_val) + result_df.loc[condition_na, [diff_name, rel_err_name]] = CompareConst.N_A + result_df.loc[~(condition_no_bench | condition_na), diff_name] = get_number(ms_val) - get_number(pt_val) + condition_nan_diff = ~condition_no_bench & ~condition_na & result_df[diff_name].isna() + condition_not_nan_diff = ~condition_no_bench & ~condition_na & result_df[diff_name].notna() + result_df.loc[condition_nan_diff, [diff_name, rel_err_name]] = CompareConst.NAN + condition_pt_zero = pt_val == 0 + result_df.loc[condition_not_nan_diff & condition_pt_zero, rel_err_name] = CompareConst.NAN + condition_ref_err = condition_not_nan_diff & ~condition_pt_zero + result_df.loc[condition_ref_err, rel_err_name] = (result_df.loc[condition_ref_err, diff_name] / + pt_val[condition_ref_err] * 100) + result_df.loc[condition_ref_err, rel_err_name] = (result_df.loc[condition_ref_err, rel_err_name] + .abs().astype(str) + '%') + magnitude = get_number(result_df[diff_name]).abs() / ( + pd.Series(np.maximum(get_number(ms_val), get_number(pt_val))).abs() + CompareConst.EPSILON) + return magnitude > CompareConst.MAGNITUDE + + if self.dump_mode == Const.MD5: + condition_md5_equal = result_df[CompareConst.NPU_MD5] == result_df[CompareConst.BENCH_MD5] + result_df.loc[condition_md5_equal, CompareConst.RESULT] = CompareConst.PASS + result_df.loc[~condition_md5_equal & ~condition_no_bench, CompareConst.RESULT] = CompareConst.DIFF + elif self.dump_mode == Const.SUMMARY: + warning_list = [calc_summary_diff(data_type) for data_type in ['max', 'min', 'mean', 'l2norm']] + warning_flag = pd.DataFrame(warning_list).all() + result_df.loc[~condition_no_bench, [CompareConst.RESULT, CompareConst.ERROR_MESSAGE]] = '' + result_df.loc[warning_flag, CompareConst.RESULT] = CompareConst.WARNING + result_df.loc[warning_flag, CompareConst.ERROR_MESSAGE] = 'Need double check api accuracy.' + else: + fill_cols = [CompareConst.COSINE, CompareConst.EUC_DIST, + CompareConst.MAX_ABS_ERR, CompareConst.MAX_RELATIVE_ERR, + CompareConst.ONE_THOUSANDTH_ERR_RATIO, CompareConst.FIVE_THOUSANDTHS_ERR_RATIO, + CompareConst.ERROR_MESSAGE] + result_df.loc[~condition_no_bench, fill_cols] = '' + result_df.loc[~condition_no_bench, CompareConst.ACCURACY] = CompareConst.ACCURACY_CHECK_YES + return result_df[header] + + def make_result_df(self, result): + header = CompareConst.HEAD_OF_COMPARE_MODE[self.dump_mode][:] + + if self.stack_mode: + header.append(CompareConst.STACK) + if self.dump_mode == Const.ALL: + header.append(CompareConst.DATA_NAME) + result.rename(columns={'op_name_x': CompareConst.NPU_NAME, + 'op_name_y': CompareConst.BENCH_NAME, + 'dtype_x': CompareConst.NPU_DTYPE, + 'dtype_y': CompareConst.BENCH_DTYPE, + 'shape_x': CompareConst.NPU_SHAPE, + 'shape_y': CompareConst.BENCH_SHAPE, + 'md5_x': CompareConst.NPU_MD5, + 'md5_y': CompareConst.BENCH_MD5, + 'data_name_x': CompareConst.DATA_NAME, + 'stack_info_x': CompareConst.STACK}, inplace=True) + + npu_summary = [CompareConst.NPU_MAX, CompareConst.NPU_MIN, CompareConst.NPU_MEAN, CompareConst.NPU_NORM] + bench_summary = [CompareConst.BENCH_MAX, CompareConst.BENCH_MIN, CompareConst.BENCH_MEAN, + CompareConst.BENCH_NORM] + + def set_summary(summary): + if summary == CompareConst.N_A: + return [CompareConst.N_A] * 4 + summary_list = [] + for i in summary: + if i is None: + summary_list.append(CompareConst.N_A) + elif str(i).lower() == 'nan': + summary_list.append(CompareConst.NAN) + else: + summary_list.append(i) + return summary_list + + result[npu_summary] = result['summary_x'].apply(set_summary).tolist() + result[bench_summary] = result['summary_y'].apply(set_summary).tolist() + result_df = pd.DataFrame(columns=header) + for h in header: + if h in result.columns: + result_df[h] = result[h] + return self.calc_accuracy(result_df, header) + def load_internal_api(self): cur_path = os.path.dirname(os.path.realpath(__file__)) - yaml_path = os.path.join(cur_path, "ms_to_pt_api.yaml") + yaml_path = os.path.abspath(os.path.join(cur_path, CompareConst.INTERNAL_API_MAPPING_FILE)) return load_yaml(yaml_path) def load_mapping_file(self, mapping_file): @@ -66,42 +189,23 @@ class MSComparator(Comparator): return mapping_dict def process_cell_mapping(self, npu_op_name): - npu_op_name = [op_name.replace("Cell", "Module", 1) for op_name in npu_op_name] + if not npu_op_name: + return CompareConst.N_A + param_grad_flag = Const.PARAMS_GRAD in npu_op_name.split(Const.SEP) + if not param_grad_flag and not re.search(Const.REGEX_FORWARD_BACKWARD, npu_op_name): + return CompareConst.N_A + npu_op_name = npu_op_name.replace("Cell", "Module", 1) if self.cell_mapping_dict: - for index, op_name in enumerate(npu_op_name): - # get cell name & class name from op_name - # Cell.fc1.Dense.forward.0.input.0 - cell_name = op_name.split(Const.SEP, 1)[-1].rsplit(Const.SEP, 4)[0] - if cell_name in self.cell_mapping_dict: - npu_op_name[index] = op_name.replace(cell_name, self.cell_mapping_dict[cell_name], 1) + # get cell name & class name from op_name + # Cell.fc1.Dense.forward.0.input.0 + cell_name = re.split(r'\.(?:forward|backward|parameters_grad)\.', npu_op_name.split(Const.SEP, 1)[-1])[0] + if cell_name in self.cell_mapping_dict: + npu_op_name = npu_op_name.replace(cell_name, self.cell_mapping_dict[cell_name], 1) return npu_op_name - def check_op(self, npu_dict, bench_dict, fuzzy_match): - npu_dict_new, bench_dict_new = copy.deepcopy(npu_dict), copy.deepcopy(bench_dict) - npu_op_name, bench_op_name = npu_dict_new.get(CompareConst.OP_NAME), bench_dict_new.get(CompareConst.OP_NAME) - if self.cell_mapping is not None: - npu_op_name = self.process_cell_mapping(npu_op_name) - if self.api_mapping is not None: - npu_op_name = self.process_internal_api_mapping(npu_op_name, bench_op_name) - if isinstance(self.api_mapping, str): - npu_dict_new, bench_dict_new, target_dict = self.transform_user_mapping_api(npu_dict_new, - bench_dict_new) - if target_dict: - bench_dict = self.reconstitution_bench_dict(npu_dict, copy.deepcopy(bench_dict_new), target_dict) - npu_op_name = npu_dict_new.get(CompareConst.OP_NAME) - bench_op_name = bench_dict_new.get(CompareConst.OP_NAME) - struct_match = check_struct_match(npu_dict_new, bench_dict_new, cross_frame=self.cross_frame) - if not fuzzy_match: - return npu_op_name == bench_op_name and struct_match - is_match = True - try: - is_match = fuzzy_check_op(npu_op_name, bench_op_name) - except Exception as err: - logger.warning("%s and %s can not fuzzy match." % (npu_op_name, bench_op_name)) - is_match = False - return is_match and struct_match - def read_npy_data(self, dir_path, file_name, load_pt_file=False): + if not file_name: + return None data_path = os.path.join(dir_path, file_name) if load_pt_file: import torch @@ -111,35 +215,23 @@ class MSComparator(Comparator): data_value = data_value.to(torch.float32) data_value = data_value.numpy() else: - data_value = load_npy(data_path) - return data_value + data_value = load_npy(data_path) + return data_value - def api_replace(self, npu_op_name, target, para): - for idx, _ in enumerate(npu_op_name): - npu_op_name[idx] = npu_op_name[idx].replace(target, para) - return npu_op_name - - def process_internal_api_mapping(self, npu_op_name, bench_op_name): + def process_internal_api_mapping(self, npu_op_name): # get api name & class name from op_name # Functional.addcmul.0.forward.input.0 - npu_op_name, bench_op_name = npu_op_name.copy(), bench_op_name.copy() - ms_api_name = self.get_api_name(npu_op_name[0].split(Const.SEP)) - pt_api_name = self.get_api_name(bench_op_name[0].split(Const.SEP)) + ms_api_name = self.get_api_name(npu_op_name.split(Const.SEP)) class_name = ms_api_name.split(Const.SEP)[0] if class_name == "Mint": - return self.api_replace(npu_op_name, "Mint", "Torch") + return npu_op_name.replace("Mint", "Torch") elif class_name == "MintFunctional": - return self.api_replace(npu_op_name, "MintFunctional", "Functional") - elif self.ms_to_pt_mapping.get(ms_api_name) == pt_api_name: - return self.api_replace(npu_op_name, ms_api_name, pt_api_name) + return npu_op_name.replace("MintFunctional", "Functional") + elif self.ms_to_pt_mapping.get(ms_api_name): + return npu_op_name.replace(ms_api_name, self.ms_to_pt_mapping.get(ms_api_name)) else: return npu_op_name - - def remove_element(self, op_name, struct, summary, idx): - del op_name[idx] - del struct[idx] - del summary[idx] - + def get_api_name(self, api_list): try: api_name = api_list[0] + Const.SEP + api_list[1] @@ -147,136 +239,184 @@ class MSComparator(Comparator): logger.error(f'Failed to retrieve API name, please check if the dump data is reasonable') raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from error return api_name - - def transform_user_mapping_api(self, new_npu_dict, new_bench_dict): - """ - Transform user mapping API based on new NPU and benchmark dictionaries. - Parameters: - new_npu_dict (dict): New NPU operation dictionary. - new_bench_dict (dict): New benchmark operation dictionary. - Returns: - tuple: Updated NPU and benchmark dictionaries, along with the target dictionary. - """ - npu_op_name, bench_op_name = new_npu_dict.get(CompareConst.OP_NAME), new_bench_dict.get(CompareConst.OP_NAME) - npu_struct_in = new_npu_dict.get(CompareConst.INPUT_STRUCT) - bench_struct_in = new_bench_dict.get(CompareConst.INPUT_STRUCT) - npu_struct_out = new_npu_dict.get(CompareConst.OUTPUT_STRUCT) - bench_struct_out = new_bench_dict.get(CompareConst.OUTPUT_STRUCT) - npu_summary, bench_summary = new_npu_dict.get(CompareConst.SUMMARY), new_bench_dict.get(CompareConst.SUMMARY) - npu_in_len, bench_in_len = len(npu_struct_in), len(bench_struct_in) - npu_out_len, bench_out_len = len(npu_struct_out), len(bench_struct_out) - ms_api_list, pt_api_list = npu_op_name[0].split(Const.SEP), bench_op_name[0].split(Const.SEP) - ms_api_name = self.get_api_name(ms_api_list) - pt_api_name = self.get_api_name(pt_api_list) - target_dict = {} - for api_dict in self.api_mapping_dict: - if api_dict.get("pt_api") == pt_api_name and api_dict.get("ms_api") == ms_api_name: - ms_user_args_len, pt_user_args_len = len(api_dict.get("ms_args")), len(api_dict.get("pt_args")) - ms_user_output_len, pt_user_output_len = len(api_dict.get("ms_output")), len(api_dict.get("pt_output")) - if ms_user_args_len != pt_user_args_len or ms_user_output_len != pt_user_output_len: - logger.warning("The user-defined mapping table is incorrect,\ - make sure that the number of parameters is equal") - break - ms_out_list = api_dict.get("ms_output", []) - for idx in reversed(range(npu_out_len)): - if idx not in ms_out_list: - del npu_struct_out[idx] - if idx + npu_in_len < len(npu_summary) and idx + npu_in_len < len(npu_op_name): - del npu_summary[idx + npu_in_len] - del npu_op_name[idx + npu_in_len] - pt_out_list = api_dict.get("pt_output", []) - for idx in reversed(range(bench_out_len)): - if idx not in pt_out_list: - del bench_struct_out[idx] - if idx + bench_in_len < len(bench_summary) and idx + bench_in_len < len(bench_op_name): - del bench_summary[idx + bench_in_len] - del bench_op_name[idx + bench_in_len] - ms_para_list = api_dict.get("ms_args", []) - for idx in reversed(range(npu_in_len)): - if idx not in ms_para_list: - self.remove_element(npu_op_name, npu_struct_in, npu_summary, idx) - pt_para_list = api_dict.get("pt_args", []) - for idx in reversed(range(bench_in_len)): - if idx not in pt_para_list: - self.remove_element(bench_op_name, bench_struct_in, bench_summary, idx) - npu_op_name = self.api_replace(npu_op_name, ms_api_name, pt_api_name) - npu_op_name = self.para_sequence_update(npu_op_name, bench_op_name) - target_dict = api_dict - break - if target_dict: - new_npu_dict.update({CompareConst.OP_NAME: npu_op_name, CompareConst.INPUT_STRUCT: npu_struct_in, - CompareConst.OUTPUT_STRUCT: npu_struct_out, CompareConst.SUMMARY: npu_summary}) - new_bench_dict.update({CompareConst.OP_NAME: bench_op_name, CompareConst.INPUT_STRUCT: bench_struct_in, - CompareConst.OUTPUT_STRUCT: bench_struct_out, CompareConst.SUMMARY: bench_summary}) - return new_npu_dict, new_bench_dict, target_dict - - def para_sequence_update(self, npu_op_name, bench_op_name): - for idx, _ in enumerate(npu_op_name): - bench_op_name_list = bench_op_name[idx].rsplit(Const.SEP, 1) - if len(bench_op_name_list) != 0: - npu_op_name[idx] = npu_op_name[idx][:-1] + bench_op_name_list[-1] - return npu_op_name - def reconstitution_bench_dict(self, npu_dict, del_bench_dict, api_dict): - ms_user_args_list = api_dict.get("ms_args", []) - ms_user_output_list = api_dict.get("ms_output", []) - npu_struct_in = npu_dict.get(CompareConst.INPUT_STRUCT) - npu_struct_out = npu_dict.get(CompareConst.OUTPUT_STRUCT) - npu_in_len = len(npu_struct_in) - npu_out_len = len(npu_struct_out) - if npu_in_len == len(ms_user_args_list) and npu_out_len == len(ms_user_output_list): - return del_bench_dict - ms_input_args_list = [i for i in range(npu_in_len)] - input_sub_list = list(set(ms_input_args_list) - set(ms_user_args_list)) - ms_output_args_list = [i for i in range(npu_out_len)] - output_sub_list = list(set(ms_output_args_list) - set(ms_user_output_list)) - bench_op_name = del_bench_dict.get(CompareConst.OP_NAME, []) - bench_struct_in = del_bench_dict.get(CompareConst.INPUT_STRUCT, []) - bench_struct_out = del_bench_dict.get(CompareConst.OUTPUT_STRUCT, []) - bench_summary = del_bench_dict.get(CompareConst.SUMMARY, []) - for idx in input_sub_list: # Fill in the blank value field in the pt dictionary - bench_op_name.insert(idx, CompareConst.N_A) - bench_struct_in.insert(idx, CompareConst.N_A) - bench_summary.insert(idx, CompareConst.N_A) - for idx in output_sub_list: # Fill in the blank value field in the pt dictionary - bench_op_name.insert(npu_in_len + idx, CompareConst.N_A) - bench_struct_out.insert(idx, CompareConst.N_A) - bench_summary.insert(npu_in_len + idx, CompareConst.N_A) - del_bench_dict.update({CompareConst.OP_NAME: bench_op_name, CompareConst.INPUT_STRUCT: bench_struct_in, - CompareConst.OUTPUT_STRUCT: bench_struct_out, CompareConst.SUMMARY: bench_summary}) - return del_bench_dict + def compare_process(self, file_lists): + npu_json_path, bench_json_path, stack_json_path = file_lists + npu_json_data = load_json(npu_json_path) + bench_json_data = load_json(bench_json_path) + stack_json_data = load_json(stack_json_path) if self.stack_mode else None + + npu_df = self.gen_data_df(npu_json_data, stack_json_data) + bench_df = self.gen_data_df(bench_json_data, stack_json_data) + if self.cell_mapping: + npu_df[CompareConst.COMPARE_KEY] = npu_df[CompareConst.OP_NAME].apply(self.process_cell_mapping) + elif self.api_mapping: + npu_df[CompareConst.COMPARE_KEY] = npu_df[CompareConst.OP_NAME].apply(self.process_internal_api_mapping) + if isinstance(self.api_mapping, str): + self.modify_compare_data_with_user_mapping(npu_df, bench_df) + else: + npu_df[CompareConst.COMPARE_KEY] = npu_df[CompareConst.OP_NAME] + npu_df[[Const.DTYPE, Const.SHAPE]] = npu_df[[Const.DTYPE, Const.SHAPE]].astype(str) + bench_df[[Const.DTYPE, Const.SHAPE]] = bench_df[[Const.DTYPE, Const.SHAPE]].astype(str) + npu_df[CompareConst.COMPARE_SHAPE] = npu_df[Const.SHAPE] + bench_df[CompareConst.COMPARE_KEY] = bench_df[CompareConst.OP_NAME] + bench_df[CompareConst.COMPARE_SHAPE] = bench_df[Const.SHAPE] + match_result = pd.merge(npu_df, bench_df, on=[CompareConst.COMPARE_KEY, CompareConst.COMPARE_SHAPE], + how='outer') + match_result = match_result[match_result['op_name_x'].notna()].fillna(CompareConst.N_A) + + def gen_dtype_condition(): + npu_dtype = match_result['dtype_x'] + bench_dtype = match_result['dtype_y'] + if self.cross_frame: + npu_dtype = npu_dtype.map(dtype_mapping).fillna(npu_dtype) + return ((npu_dtype == bench_dtype) | + ((npu_dtype == Const.FLOAT16) & (bench_dtype == Const.FLOAT32)) | + ((npu_dtype == Const.FLOAT32) & (bench_dtype == Const.FLOAT16)) | + ((npu_dtype == Const.FLOAT16) & (bench_dtype == Const.BFLOAT16)) | + ((npu_dtype == Const.BFLOAT16) & (bench_dtype == Const.FLOAT16)) | + ((npu_dtype == Const.TORCH_FLOAT16) & (bench_dtype == Const.TORCH_FLOAT32)) | + ((npu_dtype == Const.TORCH_FLOAT32) & (bench_dtype == Const.TORCH_FLOAT16)) | + ((npu_dtype == Const.TORCH_FLOAT16) & (bench_dtype == Const.TORCH_BFLOAT16)) | + ((npu_dtype == Const.TORCH_BFLOAT16) & (bench_dtype == Const.TORCH_FLOAT16))) + + match_result.loc[~gen_dtype_condition(), [i + '_y' for i in bench_df.columns]] = CompareConst.N_A + return self.make_result_df(match_result) + + def modify_compare_data_with_user_mapping(self, npu_df, bench_df): + def get_api_indices_dict(op_name_df): + api_indices_dict = defaultdict(list) + for op_index, name in enumerate(op_name_df[CompareConst.OP_NAME]): + api = self.get_api_name(name.split(Const.SEP)) + api_indices_dict[api].append(op_index) + return api_indices_dict + + ms_api_indices_dict = get_api_indices_dict(npu_df) + pt_api_indices_dict = get_api_indices_dict(bench_df) + + def gen_input_compare_key(pattern, term): + flag = True + for i, prefix in enumerate(mapping_dict.get(f'ms_{term}')): + if op_name.split(pattern)[1].startswith(str(prefix)): + npu_df.loc[index, CompareConst.COMPARE_KEY] = ( + op_name.replace(pattern + str(prefix), + pattern + str(mapping_dict.get(f'pt_{term}')[i]))) + flag = False + return flag + + for mapping_dict in self.api_mapping_dict: + keys_to_compare = [ + ('ms_args', 'pt_args'), + ('ms_output', 'pt_output'), + ('ms_parameters', 'pt_parameters'), + ('ms_parameters_grad', 'pt_parameters_grad'), + ] + if not all(len(mapping_dict.get(k1, [])) == len(mapping_dict.get(k2, [])) for k1, k2 in keys_to_compare): + logger.warning('The user-defined mapping table is incorrect,\ + make sure that the number of parameters is equal') + continue + + ms_api, pt_api = mapping_dict.get('ms_api'), mapping_dict.get('pt_api') + if ms_api not in ms_api_indices_dict or pt_api not in pt_api_indices_dict: + continue + for index in ms_api_indices_dict.get(ms_api): + op_name = npu_df.loc[index, CompareConst.OP_NAME].replace(ms_api, pt_api, 1) + if CompareConst.INPUT_PATTERN in op_name: + is_abandoned = gen_input_compare_key(CompareConst.INPUT_PATTERN, 'args') + elif CompareConst.KWARGS_PATTERN in op_name: + is_abandoned = gen_input_compare_key(CompareConst.KWARGS_PATTERN, 'args') + elif CompareConst.OUTPUT_PATTERN in op_name: + is_abandoned = gen_input_compare_key(CompareConst.OUTPUT_PATTERN, 'output') + elif CompareConst.PARAMS_PATTERN in op_name: + is_abandoned = gen_input_compare_key(CompareConst.PARAMS_PATTERN, 'parameters') + elif CompareConst.PARAMS_GRAD_PATTERN in op_name: + is_abandoned = gen_input_compare_key(CompareConst.PARAMS_GRAD_PATTERN, 'parameters_grad') + else: + logger.error(f'Excepted op_name: {op_name}') + raise CompareException(CompareException.INVALID_DATA_ERROR) + if is_abandoned: + npu_df.loc[index, CompareConst.COMPARE_KEY] = op_name + 'abandoned' + + def gen_data_df(self, data_json, stack_json_data): + result = { + CompareConst.OP_NAME: [], + Const.DTYPE: [], + Const.SHAPE: [], + Const.SUMMARY: [], + 'stack_info': [] + } + if self.dump_mode == Const.ALL: + result['data_name'] = [] + elif self.dump_mode == Const.MD5: + result[Const.MD5] = [] + for data_name in data_json['data']: + check_op_str_pattern_valid(data_name) + merge_list = self.gen_merge_list(data_json, data_name, stack_json_data) + if not merge_list: + continue + + op_name_list = merge_list.get(CompareConst.OP_NAME) + summary_list = merge_list.get(Const.SUMMARY) + data_name_list = merge_list.get('data_name') + op_name_reorder, summary_reorder, data_name_reorder = reorder_op_x_list(op_name_list, + summary_list, + data_name_list) + for op_name in op_name_reorder: + result[CompareConst.OP_NAME].append(op_name) + if (CompareConst.INPUT_PATTERN in op_name) or (CompareConst.KWARGS_PATTERN in op_name): + struct = merge_list[CompareConst.INPUT_STRUCT].pop(0) + elif CompareConst.OUTPUT_PATTERN in op_name: + struct = merge_list[CompareConst.OUTPUT_STRUCT].pop(0) + elif CompareConst.PARAMS_PATTERN in op_name: + struct = merge_list[CompareConst.PARAMS_STRUCT].pop(0) + else: + struct = merge_list[CompareConst.PARAMS_GRAD_STRUCT].pop(0) + result[Const.DTYPE].append(struct[0]) + result[Const.SHAPE].append(struct[1]) + if self.dump_mode == Const.MD5: + result[Const.MD5].append(struct[2]) + result[Const.SUMMARY].append(summary_reorder.pop(0)) + result['stack_info'].append(merge_list['stack_info'][0] if self.stack_mode else None) + if self.dump_mode == Const.ALL: + result['data_name'].append(data_name_reorder.pop(0)) + return pd.DataFrame(result) def check_cross_framework(bench_json_path): - pattern = r'"data_name":\s*"[^"]+\.pt"' - with FileOpen(bench_json_path, 'r') as file: - for line in file: - if re.search(pattern, line): - return True - return False + framework = detect_framework_by_dump_json(bench_json_path) + if framework == Const.PT_FRAMEWORK: + return True + else: + return False def ms_compare(input_param, output_path, **kwargs): try: - stack_mode = kwargs.get('stack_mode', False) auto_analyze = kwargs.get('auto_analyze', True) fuzzy_match = kwargs.get('fuzzy_match', False) cell_mapping = kwargs.get('cell_mapping', None) api_mapping = kwargs.get('api_mapping', None) data_mapping = kwargs.get('data_mapping', None) layer_mapping = kwargs.get('layer_mapping', None) + suffix = kwargs.get('suffix', '') set_dump_path(input_param) dump_mode = get_dump_mode(input_param) + if 'stack_json_path' in input_param: + stack_mode = kwargs.get('stack_mode', False) + else: + stack_mode = set_stack_json_path(input_param) # set stack_mode and set "stack_json_path" in input_param check_configuration_param(stack_mode, auto_analyze, fuzzy_match, input_param.get('is_print_compare_log', True)) create_directory(output_path) - check_compare_param(input_param, output_path, dump_mode) + check_compare_param(input_param, output_path, dump_mode, stack_mode) except (CompareException, FileCheckException) as error: logger.error('Compare failed. Please check the arguments and do it again!') raise CompareException(error.code) from error if layer_mapping: data_mapping = generate_data_mapping_by_layer_mapping(input_param, layer_mapping, output_path) - is_cross_framework = check_cross_framework(input_param.get("bench_json_path")) - ms_comparator = MSComparator(cell_mapping, api_mapping, data_mapping, is_cross_framework) - ms_comparator.compare_core(input_param, output_path, stack_mode=stack_mode, - auto_analyze=auto_analyze, fuzzy_match=fuzzy_match, dump_mode=dump_mode) + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig(cell_mapping, api_mapping, data_mapping) + is_cross_framework = check_cross_framework(input_param.get('bench_json_path')) + ms_comparator = MSComparator(mode_config, mapping_config, is_cross_framework) + ms_comparator.compare_core(input_param, output_path, suffix=suffix) diff --git a/debug/accuracy_tools/msprobe/mindspore/compare/ms_graph_compare.py b/debug/accuracy_tools/msprobe/mindspore/compare/ms_graph_compare.py index 58439e42bdee24f99991e8cfd88db8a6c8889a90..153f4fd655212b24904c33d29dad694ee1dd2c1f 100644 --- a/debug/accuracy_tools/msprobe/mindspore/compare/ms_graph_compare.py +++ b/debug/accuracy_tools/msprobe/mindspore/compare/ms_graph_compare.py @@ -25,7 +25,7 @@ from msprobe.core.common.file_utils import load_npy, read_csv, save_excel from msprobe.core.common.log import logger from msprobe.core.common.utils import add_time_with_xlsx, CompareException from msprobe.core.compare.multiprocessing_compute import _ms_graph_handle_multi_process, check_accuracy -from msprobe.core.compare.npy_compare import npy_data_check, statistics_data_check, reshape_value, compare_ops_apply +from msprobe.core.compare.npy_compare import npy_data_check, statistics_data_check, compare_ops_apply from msprobe.mindspore.common.utils import convert_to_int, list_lowest_level_directories @@ -80,7 +80,7 @@ def statistic_data_read(statistic_file_list, statistic_file_path): data_list = [] statistic_data_list = [] header_index = { - 'Data Type': None, 'Shape': None, 'Max Value': None, + 'Data Type': None, 'Shape': None, 'Max Value': None, 'Min Value': None, 'Avg Value': None, 'L2Norm Value': None } for statistic_file in statistic_file_list: @@ -144,10 +144,16 @@ def generate_data_name(data_path): mode = GraphMode.STATISTIC_MODE else: mode = GraphMode.ERROR_MODE - logger.error(f"Error mode.") + logger.error("Error mode.") return mode, data_list +def transform_special_string_into_float(data_frame): + data_frame[data_frame == "null"] = '0' + data_frame[data_frame == "False"] = '0' + data_frame[data_frame == "True"] = '1' + + class GraphMSComparator: def __init__(self, input_param, output_path): self.output_path = output_path @@ -187,14 +193,14 @@ class GraphMSComparator: result_dict[CompareConst.ERROR_MESSAGE] = error_message if not error_flag: - n_value, b_value = reshape_value(n_value, b_value) result_list, err_msg = compare_ops_apply(n_value, b_value, False, "") result_dict[CompareConst.COSINE] = result_list[0] - result_dict[CompareConst.MAX_ABS_ERR] = result_list[1] - result_dict[CompareConst.MAX_RELATIVE_ERR] = result_list[2] - result_dict[CompareConst.ONE_THOUSANDTH_ERR_RATIO] = result_list[3] - result_dict[CompareConst.FIVE_THOUSANDTHS_ERR_RATIO] = result_list[4] - result_dict[CompareConst.ACCURACY] = check_accuracy(result_list[0], result_list[1]) + result_dict[CompareConst.EUC_DIST] = result_list[1] + result_dict[CompareConst.MAX_ABS_ERR] = result_list[2] + result_dict[CompareConst.MAX_RELATIVE_ERR] = result_list[3] + result_dict[CompareConst.ONE_THOUSANDTH_ERR_RATIO] = result_list[4] + result_dict[CompareConst.FIVE_THOUSANDTHS_ERR_RATIO] = result_list[5] + result_dict[CompareConst.ACCURACY] = check_accuracy(result_list[0], result_list[2]) result_dict[CompareConst.ERROR_MESSAGE] = err_msg return pd.Series(result_dict) @@ -228,7 +234,8 @@ class GraphMSComparator: result_dict[CompareConst.MAX_RELATIVE_ERR] = result_dict[CompareConst.MAX_DIFF] / result_dict[ CompareConst.BENCH_MAX] if result_dict[CompareConst.BENCH_MAX] > 0 else 0 if not np.isnan(result_dict[CompareConst.MAX_RELATIVE_ERR]): - result_dict[CompareConst.MAX_RELATIVE_ERR] = str(result_dict[CompareConst.MAX_RELATIVE_ERR] * 100) + "%" + result_dict[CompareConst.MAX_RELATIVE_ERR] = str( + result_dict[CompareConst.MAX_RELATIVE_ERR] * 100) + "%" result_dict[CompareConst.MIN_RELATIVE_ERR] = result_dict[CompareConst.MIN_DIFF] / result_dict[ CompareConst.BENCH_MIN] if result_dict[CompareConst.BENCH_MIN] > 0 else 0 if not np.isnan(result_dict[CompareConst.MIN_RELATIVE_ERR]): @@ -272,12 +279,12 @@ class GraphMSComparator: is_empty = True if is_empty or not mode: continue - compare_result_df = self._do_multi_process(compare_result_df, mode) + compare_result_df = self.do_multi_process(compare_result_df, mode) compare_result_name = add_time_with_xlsx(f"compare_result_{str(rank_id)}_{str(step_id)}") compare_result_path = os.path.join(os.path.realpath(self.output_path), f"{compare_result_name}") self.to_excel(compare_result_df, compare_result_path) logger.info(f"Compare rank: {rank_id} step: {step_id} finish. Compare result: {compare_result_path}.") - + def to_excel(self, compare_result_df: pd.DataFrame, compare_result_path: str, slice_num=0, need_slice=False) -> int: size = len(compare_result_df) # sheet size cannot be larger than 1048576 @@ -287,8 +294,8 @@ class GraphMSComparator: save_excel(compare_result_path, compare_result_df) return slice_num + 1 else: - slice_num = self.to_excel(compare_result_df.iloc[0: size//2], compare_result_path, slice_num, True) - return self.to_excel(compare_result_df.iloc[size//2:], compare_result_path, slice_num, True) + slice_num = self.to_excel(compare_result_df.iloc[0: size // 2], compare_result_path, slice_num, True) + return self.to_excel(compare_result_df.iloc[size // 2:], compare_result_path, slice_num, True) def compare_process(self, rank_id, step_id): # generate data_path @@ -333,13 +340,17 @@ class GraphMSComparator: CompareConst.BENCH_NORM]) npu_float_type = [CompareConst.NPU_MAX, CompareConst.NPU_MIN, CompareConst.NPU_MEAN, CompareConst.NPU_NORM] - npu_data_df[npu_float_type] = npu_data_df[npu_float_type].astype(float) + npu_float_data_df = npu_data_df[npu_float_type].astype(str) + transform_special_string_into_float(npu_float_data_df) + npu_data_df[npu_float_type] = npu_float_data_df.astype(float) bench_float_type = [ - CompareConst.BENCH_MAX, CompareConst.BENCH_MIN, + CompareConst.BENCH_MAX, CompareConst.BENCH_MIN, CompareConst.BENCH_MEAN, CompareConst.BENCH_NORM ] - bench_data_df[bench_float_type] = bench_data_df[bench_float_type].astype(float) + bench_float_data_df = bench_data_df[bench_float_type].astype(str) + transform_special_string_into_float(bench_float_data_df) + bench_data_df[bench_float_type] = bench_float_data_df.astype(float) npu_data_df['Local Index'] = npu_data_df.sort_values('TimeStamp').groupby('Compare Key').cumcount() bench_data_df['Local Index'] = bench_data_df.sort_values('TimeStamp').groupby('Compare Key').cumcount() @@ -388,7 +399,7 @@ class GraphMSComparator: rank_step_path_dict[rank_step_key] = [dir_path] return dict(sorted(rank_step_path_dict.items())) - def _do_multi_process(self, result_df, mode): + def do_multi_process(self, result_df, mode): try: result_df = _ms_graph_handle_multi_process(self.compare_ops, result_df, mode) except ValueError as e: diff --git a/debug/accuracy_tools/msprobe/mindspore/debugger/debugger_config.py b/debug/accuracy_tools/msprobe/mindspore/debugger/debugger_config.py index b1c3ad6fcc3c52cd2e4370e453f26941e2f1c95f..92155b4ec4ebd636477ef67f1c75b43e7a82b802 100644 --- a/debug/accuracy_tools/msprobe/mindspore/debugger/debugger_config.py +++ b/debug/accuracy_tools/msprobe/mindspore/debugger/debugger_config.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -16,9 +16,11 @@ import os from msprobe.core.common.const import Const +from msprobe.core.common.exceptions import MsprobeException from msprobe.core.common.file_utils import create_directory from msprobe.mindspore.common.const import Const as MsConst from msprobe.mindspore.common.const import FreeBenchmarkConst +from msprobe.core.common.log import logger class DebuggerConfig: @@ -39,6 +41,7 @@ class DebuggerConfig: self.check_mode = task_config.check_mode self.framework = Const.MS_FRAMEWORK self.summary_mode = task_config.summary_mode + self.async_dump = common_config.async_dump if common_config.async_dump else False self.check() create_directory(self.dump_path) @@ -49,9 +52,12 @@ class DebuggerConfig: if not task_config.handler_type else task_config.handler_type) self.stage = FreeBenchmarkConst.DEFAULT_STAGE if not task_config.fuzz_stage else task_config.fuzz_stage if self.handler_type == FreeBenchmarkConst.FIX and \ - self.pert_type != FreeBenchmarkConst.DEFAULT_PERT_TYPE: + self.pert_type != FreeBenchmarkConst.DEFAULT_PERT_TYPE: raise ValueError("pert_mode must be improve_precision or empty when handler_type is fix, " f"but got {self.pert_type}.") + if self.stage == Const.BACKWARD and self.handler_type == FreeBenchmarkConst.FIX: + raise ValueError("handler_type must be check or empty when fuzz_stage is backward, " + f"but got {self.handler_type}.") self.dump_level = FreeBenchmarkConst.DEFAULT_DUMP_LEVEL def check(self): @@ -66,4 +72,27 @@ class DebuggerConfig: self.file_format = "npy" if not self.check_mode: self.check_mode = "all" + if not isinstance(self.async_dump, bool): + raise Exception("The parameters async_dump should be bool.") + if self.async_dump and self.task == Const.TENSOR and not self.list: + raise Exception("The parameters async_dump is true in tensor task, the parameters list cannot be empty.") + if self.task == Const.STRUCTURE and self.level_ori not in [Const.LEVEL_L0, Const.LEVEL_MIX]: + logger.warning_on_rank_0( + f"When the task is set to structure, the level should be one of {[Const.LEVEL_L0, Const.LEVEL_MIX]}. " + f"If not, the default level is {Const.LEVEL_MIX}." + ) + self.level_ori = Const.LEVEL_MIX return True + + def check_config_with_l2(self): + if self.level_ori != Const.LEVEL_L2: + return + if self.task != Const.TENSOR: + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, + f"When level is set to L2, the task must be set to tensor.") + if self.scope: + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, + f"When level is set to L2, the scope cannot be configured.") + if not self.list or len(self.list) != 1: + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, + f"When level is set to L2, the list must be configured as a list with one api name.") diff --git a/debug/accuracy_tools/msprobe/mindspore/debugger/precision_debugger.py b/debug/accuracy_tools/msprobe/mindspore/debugger/precision_debugger.py index 0de71e078d1122894815e03beab285bea4c7d9e0..7694d71dd98ae1c7c4611f9435a274ac018e5df6 100644 --- a/debug/accuracy_tools/msprobe/mindspore/debugger/precision_debugger.py +++ b/debug/accuracy_tools/msprobe/mindspore/debugger/precision_debugger.py @@ -1,7 +1,7 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -14,25 +14,42 @@ # limitations under the License. import os +from collections import defaultdict, namedtuple import mindspore as ms from mindspore._c_expression import MSContext -from msprobe.core.common.const import Const, MsgConst +from msprobe.core.common.const import Const, FileCheckConst, MsgConst +from msprobe.core.common.exceptions import MsprobeException +from msprobe.core.common.file_utils import FileChecker +from msprobe.core.common.utils import get_real_step_or_rank +from msprobe.mindspore.cell_processor import CellProcessor from msprobe.mindspore.common.const import Const as MsConst +from msprobe.mindspore.common.utils import set_register_backward_hook_functions, check_save_param from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.dump.hook_cell.api_registry import api_register +from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell from msprobe.mindspore.grad_probe.grad_monitor import GradientMonitor from msprobe.mindspore.ms_config import parse_json_config from msprobe.mindspore.runtime import Runtime from msprobe.mindspore.service import Service from msprobe.mindspore.task_handler_factory import TaskHandlerFactory +try: + from msprobe.lib import _msprobe_c +except ImportError: + _msprobe_c = None + + +ConfigParameters = namedtuple("ConfigParameters", ["config_path", "task", "dump_path", "level"]) + class PrecisionDebugger: _instance = None task_not_need_service = [Const.GRAD_PROBE] - def __new__(cls, config_path=None, opt=None): + def __new__(cls, config_path=None, task=None, dump_path=None, + level=None, step=None, opt=None): if not cls._instance: cls._instance = super().__new__(cls) cls._instance.initialized = False @@ -41,22 +58,66 @@ class PrecisionDebugger: cls.first_start = False return cls._instance - def __init__(self, config_path=None): + def __init__(self, config_path=None, task=None, dump_path=None, + level=None, step=None): if self.initialized: return self.initialized = True + + set_register_backward_hook_functions() + if not config_path: config_path = os.path.join(os.path.dirname(__file__), "../../config.json") + + config_params = ConfigParameters(config_path, task, dump_path, level) + self.check_input_params(config_params) + common_config, task_config = parse_json_config(config_path) + common_config.task = task if task else common_config.task self.task = common_config.task if self.task == Const.GRAD_PROBE: self.gm = GradientMonitor(common_config, task_config) return + common_config.step = get_real_step_or_rank( + step, Const.STEP) if step is not None else common_config.step + common_config.level = level if level else common_config.level + common_config.dump_path = dump_path if dump_path else common_config.dump_path self.config = DebuggerConfig(common_config, task_config) + if _msprobe_c: + _msprobe_c._PrecisionDebugger(framework="MindSpore", config_path=config_path) + + self.config.execution_mode = self._get_execution_mode() + if self._need_service(): + self.config.check_config_with_l2() + self.service = Service(self.config) + Runtime.step_count = 0 Runtime.is_running = False + @staticmethod + def check_input_params(args): + if args.config_path is not None: + if not isinstance(args.config_path, str): + raise MsprobeException( + MsprobeException.INVALID_PARAM_ERROR, f"config_path must be a string") + file_checker = FileChecker( + file_path=args.config_path, path_type=FileCheckConst.FILE, file_type=FileCheckConst.JSON_SUFFIX) + file_checker.common_check() + + if args.task is not None and args.task not in Const.TASK_LIST: + raise MsprobeException( + MsprobeException.INVALID_PARAM_ERROR, f"task must be one of {Const.TASK_LIST}") + + if args.dump_path is not None: + if not isinstance(args.dump_path, str): + raise MsprobeException( + MsprobeException.INVALID_PARAM_ERROR, f"dump_path must be a string") + + if args.level is not None and args.level not in Const.LEVEL_LIST: + raise MsprobeException( + MsprobeException.INVALID_PARAM_ERROR, f"level must be one of {Const.LEVEL_LIST}") + @staticmethod def _get_execution_mode(): jit_level = ms.context.get_jit_config().get(MsConst.JIT_LEVEL) @@ -75,11 +136,23 @@ class PrecisionDebugger: else: return MsConst.PYNATIVE_MODE + @staticmethod + def _is_graph_dump(config): + if config.level != MsConst.KERNEL: + return False + if not config.list: + return True + is_graph = any(item.startswith("name-regex") for item in config.list) + is_graph |= all("." not in item for item in config.list) + return is_graph + @classmethod def start(cls, model=None): instance = cls._instance if not instance: raise Exception(MsgConst.NOT_CREATED_INSTANCE) + if _msprobe_c: + _msprobe_c._PrecisionDebugger().start() if instance.task in PrecisionDebugger.task_not_need_service: return @@ -90,6 +163,7 @@ class PrecisionDebugger: instance.service.start(model) else: if not instance.first_start: + api_register.api_set_ori_func() handler = TaskHandlerFactory.create(instance.config) handler.handle() @@ -99,18 +173,15 @@ class PrecisionDebugger: @classmethod def forward_backward_dump_end(cls): instance = cls._instance - if not instance: - raise Exception(MsgConst.NOT_CREATED_INSTANCE) - if instance.task in PrecisionDebugger.task_not_need_service: - return - if instance.service: - instance.service.forward_backward_dump_end() + instance.stop() @classmethod def stop(cls): instance = cls._instance if not instance: raise Exception(MsgConst.NOT_CREATED_INSTANCE) + if _msprobe_c: + _msprobe_c._PrecisionDebugger().stop() if instance.task == Const.GRAD_PROBE: instance.gm.stop() if instance.task in PrecisionDebugger.task_not_need_service: @@ -124,10 +195,15 @@ class PrecisionDebugger: instance = cls._instance if not instance: raise Exception(MsgConst.NOT_CREATED_INSTANCE) + if _msprobe_c: + _msprobe_c._PrecisionDebugger().step() if instance.task in PrecisionDebugger.task_not_need_service: return if instance.service: instance.service.step() + HOOKCell.cell_count = defaultdict(int) + CellProcessor.reset_cell_stats() + Runtime.step_count += 1 @classmethod @@ -139,6 +215,24 @@ class PrecisionDebugger: return instance.gm.monitor(opt) + @classmethod + def save(cls, variable, name, save_backward=True): + instance = cls._instance + if not instance: + raise Exception(MsgConst.NOT_CREATED_INSTANCE) + if instance.task not in [Const.TENSOR, Const.STATISTICS] or instance.config.level_ori != Const.LEVEL_DEBUG: + return + try: + check_save_param(variable, name, save_backward) + except ValueError: + return + + instance.config.execution_mode = cls._get_execution_mode() + if cls._need_service(): + if not instance.service: + instance.service = Service(instance.config) + instance.service.save(variable, name, save_backward) + @classmethod def _need_service(cls): instance = cls._instance @@ -147,4 +241,4 @@ class PrecisionDebugger: if instance.config.execution_mode != MsConst.PYNATIVE_MODE: return False else: - return instance.config.task != Const.FREE_BENCHMARK and instance.config.level != MsConst.KERNEL + return instance.config.task != Const.FREE_BENCHMARK and not instance._is_graph_dump(instance.config) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/dump_tool_factory.py b/debug/accuracy_tools/msprobe/mindspore/dump/dump_tool_factory.py index 42a5b3ad948db33ba151722fa4620a83be688520..0ca63b4a84aee00127bca37b7da36888e905a5aa 100644 --- a/debug/accuracy_tools/msprobe/mindspore/dump/dump_tool_factory.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/dump_tool_factory.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -40,6 +40,8 @@ class DumpToolFactory: @staticmethod def create(config: DebuggerConfig): + if len(config.data_mode) != 1 or config.data_mode[0] not in Const.GRAPH_DATA_MODE_LIST: + raise Exception("data_mode must be one of all, input, output.") tool = DumpToolFactory.tools.get(config.level) if not tool: raise Exception("Valid level is needed.") diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/api_registry.py b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/api_registry.py index 09a687de28e882f96d517107acf74a2b46502952..7aee1deccd9689985c7a2e270648bd0877cd7cf3 100644 --- a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/api_registry.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/api_registry.py @@ -1,4 +1,5 @@ -# Copyright 2024 Huawei Technologies Co., Ltd +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# ============================================================================ from mindspore import Tensor, ops, mint from mindspore.mint.nn import functional @@ -20,8 +20,21 @@ from mindspore.communication import comm_func from msprobe.mindspore.dump.hook_cell.wrap_api import (HOOKTensor, HOOKStubTensor, HOOKFunctionalOP, HOOKMintOP, HOOKMintNNFunctionalOP, HOOKDistributedOP, - get_wrap_api_list, setup_hooks) + HOOKTorchOP, HOOKTorchTensor, HOOKTorchFunctionalOP, + HOOKTorchDistributedOP, HOOKTorchNpuOP, + get_wrap_api_list, get_wrap_torch_api_list, setup_hooks) from msprobe.core.common.utils import Const +from msprobe.mindspore.common.utils import is_mindtorch + +if is_mindtorch(): + import torch + import torch_npu + + +def stub_method(method): + def wrapped_method(*args, **kwargs): + return method(*args, **kwargs) + return wrapped_method class ApiRegistry: @@ -34,6 +47,12 @@ class ApiRegistry: self.distributed_ori_attr = {} self.norm_inner_ops_ori_attr = {} + self.torch_ori_attr = {} + self.torch_tensor_ori_attr = {} + self.torch_functional_ori_attr = {} + self.torch_distributed_ori_attr = {} + self.torch_npu_ori_attr = {} + self.tensor_hook_attr = {} self.stub_tensor_hook_attr = {} self.functional_hook_attr = {} @@ -42,6 +61,12 @@ class ApiRegistry: self.distibuted_hook_attr = {} self.norm_inner_ops_hook_attr = {} + self.torch_hook_attr = {} + self.torch_tensor_hook_attr = {} + self.torch_functional_hook_attr = {} + self.torch_distributed_hook_attr = {} + self.torch_npu_hook_attr = {} + self.norm_inner_ops = ["norm", "square", "sqrt", "is_complex"] @staticmethod @@ -50,9 +75,13 @@ class ApiRegistry: if Const.SEP in api: sub_module_name, sub_op = api.rsplit(Const.SEP, 1) sub_module = getattr(ori_api_group, sub_module_name) - api_ori_attr[api] = getattr(sub_module, sub_op) + ori_api_func = getattr(sub_module, sub_op) else: - api_ori_attr[api] = getattr(ori_api_group, api) + ori_api_func = getattr(ori_api_group, api) + if ori_api_group == StubTensor: + api_ori_attr[api] = stub_method(ori_api_func) + continue + api_ori_attr[api] = ori_api_func @staticmethod def set_api_attr(api_group, attr_dict): @@ -72,22 +101,73 @@ class ApiRegistry: self.set_api_attr(ops, self.norm_inner_ops_ori_attr) def api_set_hook_func(self): - self.set_api_attr(Tensor, self.tensor_hook_attr) - self.set_api_attr(StubTensor, self.stub_tensor_hook_attr) - self.set_api_attr(ops, self.functional_hook_attr) - self.set_api_attr(mint, self.mint_ops_hook_attr) - self.set_api_attr(functional, self.mint_func_ops_hook_attr) - self.set_api_attr(comm_func, self.distibuted_hook_attr) + if is_mindtorch(): + self.set_api_attr(torch, self.torch_hook_attr) + self.set_api_attr(torch.Tensor, self.torch_tensor_hook_attr) + self.set_api_attr(torch.nn.functional, self.torch_functional_hook_attr) + self.set_api_attr(torch.distributed, self.torch_distributed_hook_attr) + self.set_api_attr(torch.distributed.distributed_c10d, self.torch_distributed_hook_attr) + self.set_api_attr(torch_npu, self.torch_npu_hook_attr) + else: + self.set_api_attr(Tensor, self.tensor_hook_attr) + self.set_api_attr(StubTensor, self.stub_tensor_hook_attr) + self.set_api_attr(ops, self.functional_hook_attr) + self.set_api_attr(mint, self.mint_ops_hook_attr) + self.set_api_attr(functional, self.mint_func_ops_hook_attr) + self.set_api_attr(comm_func, self.distibuted_hook_attr) def api_set_ori_func(self): - self.set_api_attr(Tensor, self.tensor_ori_attr) - self.set_api_attr(StubTensor, self.stub_tensor_ori_attr) - self.set_api_attr(ops, self.functional_ori_attr) - self.set_api_attr(mint, self.mint_ops_ori_attr) - self.set_api_attr(functional, self.mint_func_ops_ori_attr) - self.set_api_attr(comm_func, self.distributed_ori_attr) + if is_mindtorch(): + self.set_api_attr(torch, self.torch_ori_attr) + self.set_api_attr(torch.Tensor, self.torch_tensor_ori_attr) + self.set_api_attr(torch.nn.functional, self.torch_functional_ori_attr) + self.set_api_attr(torch.distributed, self.torch_distributed_ori_attr) + self.set_api_attr(torch.distributed.distributed_c10d, self.torch_distributed_ori_attr) + self.set_api_attr(torch_npu, self.torch_npu_ori_attr) + else: + self.set_api_attr(Tensor, self.tensor_ori_attr) + self.set_api_attr(StubTensor, self.stub_tensor_ori_attr) + self.set_api_attr(ops, self.functional_ori_attr) + self.set_api_attr(mint, self.mint_ops_ori_attr) + self.set_api_attr(functional, self.mint_func_ops_ori_attr) + self.set_api_attr(comm_func, self.distributed_ori_attr) def initialize_hook(self, hook): + setup_hooks(hook) + if is_mindtorch(): + wrap_torch_api_name = get_wrap_torch_api_list() + self.store_ori_attr(torch, + wrap_torch_api_name.torch_api_names, self.torch_ori_attr) + self.store_ori_attr(torch.Tensor, + wrap_torch_api_name.tensor_api_names, self.torch_tensor_ori_attr) + self.store_ori_attr(torch.nn.functional, + wrap_torch_api_name.functional_api_names, self.torch_functional_ori_attr) + self.store_ori_attr(torch.distributed, + wrap_torch_api_name.distributed_api_names, self.torch_distributed_ori_attr) + self.store_ori_attr(torch_npu, + wrap_torch_api_name.npu_api_names, self.torch_npu_ori_attr) + for attr_name in dir(HOOKTorchOP): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): + api_name = attr_name[Const.ATTR_NAME_PREFIX_LEN:] + self.torch_hook_attr[api_name] = getattr(HOOKTorchOP, attr_name) + for attr_name in dir(HOOKTorchTensor): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): + api_name = attr_name[Const.ATTR_NAME_PREFIX_LEN:] + self.torch_tensor_hook_attr[api_name] = getattr(HOOKTorchTensor, attr_name) + for attr_name in dir(HOOKTorchFunctionalOP): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): + api_name = attr_name[Const.ATTR_NAME_PREFIX_LEN:] + self.torch_functional_hook_attr[api_name] = getattr(HOOKTorchFunctionalOP, attr_name) + for attr_name in dir(HOOKTorchDistributedOP): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): + api_name = attr_name[Const.ATTR_NAME_PREFIX_LEN:] + self.torch_distributed_hook_attr[api_name] = getattr(HOOKTorchDistributedOP, attr_name) + for attr_name in dir(HOOKTorchNpuOP): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): + api_name = attr_name[Const.ATTR_NAME_PREFIX_LEN:] + self.torch_npu_hook_attr[api_name] = getattr(HOOKTorchNpuOP, attr_name) + return + wrap_api_name = get_wrap_api_list() self.store_ori_attr(Tensor, wrap_api_name.tensor_api_names, self.tensor_ori_attr) self.store_ori_attr(StubTensor, wrap_api_name.stub_tensor_api_names, self.stub_tensor_ori_attr) @@ -96,7 +176,6 @@ class ApiRegistry: self.store_ori_attr(functional, wrap_api_name.mint_nn_func_api_names, self.mint_func_ops_ori_attr) self.store_ori_attr(comm_func, wrap_api_name.distributed_api_names, self.distributed_ori_attr) self.store_ori_attr(ops, self.norm_inner_ops, self.norm_inner_ops_ori_attr) - setup_hooks(hook) for attr_name in dir(HOOKTensor): if attr_name.startswith(Const.ATTR_NAME_PREFIX): api_name = attr_name[Const.ATTR_NAME_PREFIX_LEN:] diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py index 83e9d4d34881e21111767e8d8de80f932b236122..b68a7d995a56497a219281c5a43d692c46cfac4d 100644 --- a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py @@ -1,4 +1,5 @@ -# Copyright 2024 Huawei Technologies Co., Ltd +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,45 +12,66 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# ============================================================================ from collections import defaultdict from mindspore import nn -from msprobe.core.common.const import Const - - -class HOOKCell(nn.Cell): - cell_count = defaultdict(int) - g_stop_hook = False - - def __init__(self, build_hook) -> None: - super(HOOKCell, self).__init__() - self.changed_status = False - self.input_kwargs = {} - self.prefix = "" - if not HOOKCell.g_stop_hook: - HOOKCell.g_stop_hook = True - self.changed_status = True - if hasattr(self, "prefix_api_name"): - self.prefix = self.prefix_api_name - - HOOKCell.cell_count[self.prefix] += 1 - self.prefix = self.prefix + str(HOOKCell.cell_count[self.prefix] - 1) + Const.SEP - forward_hook, backward_hook = build_hook(self.prefix) - self.register_forward_hook(forward_hook) - self.register_backward_hook(backward_hook) - - # 重载call,加全局标志。 - def __call__(self, *args, **kwargs): - try: - self.input_kwargs = kwargs - out = super(HOOKCell, self).__call__(*args, **kwargs) - except Exception as e: - raise e - finally: - if self.changed_status: - self.changed_status = False - HOOKCell.g_stop_hook = False - return out +from msprobe.mindspore.common.utils import is_mindtorch, register_backward_hook_functions + + +def add_cell_count(name): + HOOKCell.cell_count[name] += 1 + + +def get_cell_count(name): + return HOOKCell.cell_count[name] + + +def __init__(self, build_hook) -> None: + super(HOOKCell, self).__init__() + self.changed_status = False + self.input_kwargs = {} + self.prefix = "" + if not HOOKCell.g_stop_hook: + HOOKCell.g_stop_hook = True + self.changed_status = True + if hasattr(self, "prefix_api_name"): + self.prefix = self.prefix_api_name + + self.forward_data_collected = False + forward_pre_hook, forward_hook, backward_hook, backward_pre_hook = build_hook(self.prefix) + self.register_forward_pre_hook(forward_pre_hook) + self.register_forward_hook(forward_hook) + register_backward_hook_functions["full"](self, backward_hook) + register_backward_hook_functions["pre"](self, backward_pre_hook) + + +# 重载call,加全局标志。 +def __call__(self, *args, **kwargs): + try: + self.input_kwargs = kwargs + out = super(HOOKCell, self).__call__(*args, **kwargs) + except Exception as e: + raise e + finally: + if self.changed_status: + self.changed_status = False + HOOKCell.g_stop_hook = False + return out + + +hook_cell_dict = { + "cell_count": defaultdict(int), + "g_stop_hook": False, + "add_cell_count": staticmethod(add_cell_count), + "get_cell_count": staticmethod(get_cell_count), + "__init__": __init__, + "__call__": __call__ +} + +if is_mindtorch(): + import torch + HOOKCell = type("HOOKCell", (torch.nn.Module,), hook_cell_dict) +else: + HOOKCell = type("HOOKCell", (nn.Cell,), hook_cell_dict) diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/primitive_hooks.py b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/primitive_hooks.py index 1a8383330fbff3f6579ef7afb8a00d733e88684d..656e48c678956563a6f2d1d5f5ab8a4d03f074e7 100644 --- a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/primitive_hooks.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/primitive_hooks.py @@ -1,4 +1,5 @@ -# Copyright 2024 Huawei Technologies Co., Ltd +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,18 +12,16 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# ============================================================================ import os -import mindspore as ms -from mindspore.common.tensor import Tensor from mindspore import ops +from mindspore.common.tensor import Tensor -from msprobe.mindspore.common.log import logger from msprobe.core.common.utils import Const, DumpException -from msprobe.core.data_dump.data_processor.base import ModuleBackwardInputsOutputs, ModuleForwardInputsOutputs, \ - ModuleBackwardInputs, ModuleBackwardOutputs +from msprobe.core.data_dump.data_processor.base import (ModuleBackwardInputs, ModuleBackwardOutputs, + ModuleForwardInputsOutputs) +from msprobe.mindspore.common.log import logger class PrimitiveHookService: @@ -136,6 +135,34 @@ class PrimitiveHookService: return tuple(hooked_outputs) return out + def pre_forward_hook(primitive_name, primitive_instance, args, kwargs): + module_input_output = ModuleForwardInputsOutputs(args=args, kwargs=kwargs, output=None) + try: + self.service_instance.data_collector.forward_input_data_collect( + primitive_name, + primitive_instance, + os.getpid(), + module_input_output + ) + except Exception as exception: + logger.error(f"This is a primitive op dump error during forward input data collection: {exception}, " + f"primitive_name: {primitive_name}") + raise DumpException(DumpException.FORWARD_DATA_COLLECTION_ERROR) from exception + + def post_forward_hook(primitive_name, primitive_instance, args, kwargs, output): + module_input_output = ModuleForwardInputsOutputs(args=args, kwargs=kwargs, output=output) + try: + self.service_instance.data_collector.forward_output_data_collect( + primitive_name, + primitive_instance, + os.getpid(), + module_input_output + ) + except Exception as exception: + logger.error(f"This is a primitive op dump error during forward output data collection: {exception}, " + f"primitive_name: {primitive_name}") + raise DumpException(DumpException.FORWARD_DATA_COLLECTION_ERROR) from exception + def wrapped_primitive_call(instance_self, *args, **kwargs): """ 包装后的 primitive 调用函数,添加输入和输出的 hook。 @@ -164,27 +191,17 @@ class PrimitiveHookService: f"primitive_name: {primitive_name}") raise DumpException(DumpException.INPUT_HOOK_ERROR) from exception + forward_primitive_name = f"{updated_primitive_name}{Const.SEP}{Const.FORWARD}" + self.service_instance.data_collector.update_api_or_module_name(forward_primitive_name) + + pre_forward_hook(forward_primitive_name, instance_self, hooked_inputs, kwargs) try: out = origin_func(*hooked_inputs, **kwargs) except Exception as exception: logger.error(f"This is a primitive op dump error during function call: {exception}, " f"primitive_name: {primitive_name}") raise DumpException(DumpException.FUNCTION_CALL_ERROR) from exception - - forward_primitive_name = f"{updated_primitive_name}{Const.SEP}{Const.FORWARD}" - self.service_instance.data_collector.update_api_or_module_name(forward_primitive_name) - if self.service_instance.data_collector: - module_input_output = ModuleForwardInputsOutputs(args=hooked_inputs, kwargs=kwargs, output=out) - try: - self.service_instance.data_collector.forward_data_collect(forward_primitive_name, instance_self, - os.getpid(), module_input_output) - except Exception as exception: - logger.error(f"This is a primitive op dump error during forward data collection: {exception}, " - f"primitive_name: {primitive_name}") - raise DumpException(DumpException.FORWARD_DATA_COLLECTION_ERROR) from exception - - if self.service_instance.data_collector.if_return_forward_new_output(): - out = self.service_instance.data_collector.get_forward_new_output() + post_forward_hook(forward_primitive_name, instance_self, hooked_inputs, kwargs, out) try: out = hook_primitive_outputs(out, captured_grads_output, updated_primitive_name) @@ -202,4 +219,3 @@ class PrimitiveHookService: self.primitive_counters[primitive_name] = 0 else: self.primitive_counters[primitive_name] += 1 - diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml index 0dac3f0ac1e7dea26735585c15654a769f85425f..364062b46478b63369269c2470ea526eec59a3d3 100644 --- a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml +++ b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml @@ -15,7 +15,7 @@ # List of ops that register hooks - + ops: - adaptive_avg_pool1d - adaptive_avg_pool2d @@ -85,6 +85,7 @@ ops: - relu6 - celu - rrelu + - rms_norm - selu - sigmoid - silu @@ -490,6 +491,31 @@ ops: - scatter_update - derivative - jet + - row_stack + - gather + - arange + - cond + - slice_scatter + - clip_by_norm + - eps + - layer_norm + - cast + - numel + - permute + - select_scatter + - group_norm + - eq + - embedding + - ones_like + - zeros + - nanmean + - shape + - zeros_like + - ones + - diagonal_scatter + - vander + - is_nonzero + - rotary_position_embedding tensor: - __abs__ @@ -528,6 +554,7 @@ tensor: - acos - acosh - add + - add_ - addbmm - addcdiv - addcmul @@ -537,15 +564,15 @@ tensor: - all - amax - amin + - angle - any - arccos - arccosh - - argmax - - angle - arcsin - arcsinh - arctan - arctanh + - argmax - argmin - argsort - asin @@ -555,19 +582,23 @@ tensor: - atanh - baddbmm - bernoulli + - bfloat16 - bincount - bitwise_and - bitwise_or - bitwise_xor - bmm - bool + - bool astype - broadcast_to + - byte - ceil - - cholesky_solve - cholesky + - cholesky_solve - clamp - clip - conj + - copy - copysign - cos - cosh @@ -579,10 +610,13 @@ tensor: - deg2rad - diag - diagflat + - diagonal - diff - digamma - div + - div_ - divide + - double - equal - erf - erfc @@ -590,13 +624,16 @@ tensor: - exp - expand_as - expm1 + - flatten - flip - fliplr - flipud + - float - float_power - floor - fmod - frac + - from_numpy - gather_elements - ge - geqrf @@ -620,12 +657,12 @@ tensor: - inner - int - inverse + - is_complex + - is_signed - isclose - isfinite - isinf - isnan - - is_complex - - is_signed - isneginf - isposinf - isreal @@ -676,28 +713,27 @@ tensor: - new_ones - new_zeros - nextafter - - norm - nonzero + - norm - not_equal - ormqr - permute - pow - prod - qr + - rad2deg - ravel - real - reciprocal - remainder - renorm - - rad2deg - - tile - repeat_interleave - reshape - reshape - - round + - resize - rot90 + - round - rsqrt - - sum_to_size - scatter - sgn - short @@ -714,8 +750,11 @@ tensor: - square - squeeze - std + - sub + - sub_ - subtract - - subtract + - sum + - sum_to_size - svd - swapaxes - swapdims @@ -723,13 +762,13 @@ tensor: - take - tan - tanh - - trace - - swapaxes + - tensor_split - tile + - to - topk - - tril - - tensor_split + - trace - transpose + - tril - true_divide - trunc - unbind @@ -739,17 +778,6 @@ tensor: - view - where - xlogy - - from_numpy - - std - - take - - var - - all - - any - - copy - - diagonal - - flatten - - resize - - sum mint.ops: - abs @@ -958,6 +986,7 @@ mint.nn.functional: - one_hot_ext - pad - relu + - relu_ - sigmoid - silu - softmax @@ -992,3 +1021,7 @@ communication.comm_func: - broadcast - gather_into_tensor - scatter_tensor + - send + - recv + - isend + - irecv diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py index 6a8b4a505e34a44b5cd5caf13422dcae8eed8b46..0e97929ecd7f8444b19fd531efc49883d0df58de 100644 --- a/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -23,10 +23,16 @@ from mindspore.mint.nn import functional from msprobe.core.common.const import Const from msprobe.core.common.file_utils import load_yaml from msprobe.mindspore.common.const import Const as MsConst +from msprobe.mindspore.common.utils import is_mindtorch from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell +if is_mindtorch(): + import torch + import torch_npu + cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, MsConst.SUPPORTED_API_LIST_FILE) +torch_yaml_path = os.path.join(cur_path, "../../../pytorch/hook_module", MsConst.SUPPORTED_API_LIST_FILE) class HOOKTensor(object): @@ -53,6 +59,26 @@ class HOOKDistributedOP(object): pass +class HOOKTorchOP(object): + pass + + +class HOOKTorchTensor(object): + pass + + +class HOOKTorchFunctionalOP(object): + pass + + +class HOOKTorchDistributedOP(object): + pass + + +class HOOKTorchNpuOP(object): + pass + + class ApiTemplate(HOOKCell): def __init__(self, api_name, api_dict, prefix, hook): self.api_name = api_name @@ -60,7 +86,30 @@ class ApiTemplate(HOOKCell): self.prefix_api_name = prefix + str(api_name.split(Const.SEP)[-1]) + Const.SEP super().__init__(hook) + @staticmethod + def async_to_sync(output): + # Fake handle, used to return after the CommHandle executes the wait method + fake_handle = type("FakeHandle", (), {"wait": lambda self: None})() + if isinstance(output, tuple) and len(output) == 2 and hasattr(output[1], "wait"): + output[1].wait() + output = (output[0], fake_handle) + elif hasattr(output, "wait"): + output.wait() + output = fake_handle + return output + def construct(self, *args, **kwargs): + if self.api_name.startswith(MsConst.DROPOUT_API_NAME_PREFIX): + return args[0] if args else kwargs.get(Const.INPUT) + + output = self.api_func(*args, **kwargs) + + if self.prefix_api_name.startswith(MsConst.DISTRIBUTED_DATA_PREFIX): + if kwargs.get("async_op") or self.api_name in ["isend", "irecv"]: + output = self.async_to_sync(output) + return output + + def forward(self, *args, **kwargs): if self.api_name.startswith(MsConst.DROPOUT_API_NAME_PREFIX): return args[0] if args else kwargs.get(Const.INPUT) return self.api_func(*args, **kwargs) @@ -77,6 +126,15 @@ class WrapApiName: self.distributed_api_names = distributed_api_names +class WrapTorchApiName: + def __init__(self, torch_api_names, tensor_api_names, functional_api_names, distributed_api_names, npu_api_names): + self.torch_api_names = torch_api_names + self.tensor_api_names = tensor_api_names + self.functional_api_names = functional_api_names + self.distributed_api_names = distributed_api_names + self.npu_api_names = npu_api_names + + def get_wrap_api_list(): api_list = load_yaml(yaml_path) tensor_api = api_list.get(MsConst.SUPPORTED_TENSOR_LIST_KEY) @@ -93,6 +151,21 @@ def get_wrap_api_list(): return wrap_api_name +def get_wrap_torch_api_list(): + api_list = load_yaml(torch_yaml_path) + torch_api = api_list.get("torch") + tensor_api = api_list.get("tensor") + functional_api = api_list.get("functional") + distributed_api = api_list.get("distributed") + npu_api = api_list.get("torch_npu") + wrap_api_name = WrapTorchApiName(set(torch_api) & set(dir(torch)), + set(tensor_api) & set(dir(torch.Tensor)), + set(functional_api) & set(dir(torch.nn.functional)), + set(distributed_api) & set(dir(torch.distributed)), + set(npu_api) & set(dir(torch_npu))) + return wrap_api_name + + def wrap_api_func(api_name, api_dict, prefix, hook): def api_function(*args, **kwargs): return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs) @@ -106,6 +179,24 @@ def wrap_api_func_and_bind(api_list, api_dict, prefix, hook, hook_class): def setup_hooks(hook): + if is_mindtorch(): + torch_wrap_api_name = get_wrap_torch_api_list() + wrap_api_func_and_bind(torch_wrap_api_name.torch_api_names, + {f: getattr(torch, f) for f in dir(torch)}, + MsConst.TORCH_DATA_PREFIX, hook, HOOKTorchOP) + wrap_api_func_and_bind(torch_wrap_api_name.tensor_api_names, + {f: getattr(torch.Tensor, f) for f in dir(torch.Tensor)}, + MsConst.TENSOR_DATA_PREFIX, hook, HOOKTorchTensor) + wrap_api_func_and_bind(torch_wrap_api_name.functional_api_names, + {f: getattr(torch.nn.functional, f) for f in dir(torch.nn.functional)}, + MsConst.OPS_DATA_PREFIX, hook, HOOKTorchFunctionalOP) + wrap_api_func_and_bind(torch_wrap_api_name.distributed_api_names, + {f: getattr(torch.distributed, f) for f in dir(torch.distributed)}, + MsConst.DISTRIBUTED_DATA_PREFIX, hook, HOOKTorchDistributedOP) + wrap_api_func_and_bind(torch_wrap_api_name.npu_api_names, {f: getattr(torch_npu, f) for f in dir(torch_npu)}, + MsConst.TORCH_NPU_DATA_PREFIX, hook, HOOKTorchNpuOP) + return + wrap_api_name = get_wrap_api_list() wrap_api_func_and_bind(wrap_api_name.tensor_api_names, {f: getattr(Tensor, f) for f in dir(Tensor)}, MsConst.TENSOR_DATA_PREFIX, hook, HOOKTensor) diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/jit_dump.py b/debug/accuracy_tools/msprobe/mindspore/dump/jit_dump.py index 5910b325e23fc2ba895f8493c27fb247a37481d1..0a32200639a1f3805f815c37caaef5d3bb64c82f 100644 --- a/debug/accuracy_tools/msprobe/mindspore/dump/jit_dump.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/jit_dump.py @@ -16,14 +16,15 @@ import os from collections import defaultdict -from mindspore import Tensor from mindspore._c_expression import PyNativeExecutor_ -from mindspore.common.api import _MindsporeFunctionExecutor +try: + from mindspore.common.api import _MindsporeFunctionExecutor +except ImportError: + from mindspore.common.api import _JitExecutor as _MindsporeFunctionExecutor from msprobe.core.common.log import logger -from msprobe.core.data_dump.data_processor.base import ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs from msprobe.core.common.const import Const -from msprobe.core.data_dump.data_processor.base import ModuleForwardInputsOutputs +from msprobe.core.data_dump.data_processor.base import ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs from msprobe.mindspore.dump.hook_cell.api_registry import api_register @@ -40,8 +41,8 @@ def dump_jit(name, in_feat, out_feat, is_forward): if JitDump.need_dump(): if is_forward: JitDump.jit_count[result] += 1 - name_template = Const.JIT + Const.SEP + result + Const.SEP + str(JitDump.jit_count[result]) + Const.SEP + \ - Const.FORWARD + name_template = (Const.JIT + Const.SEP + result + Const.SEP + + str(JitDump.jit_count[result]) + Const.SEP + Const.FORWARD) JitDump.data_collector.update_api_or_module_name(name_template) module_input_output = ModuleForwardInputsOutputs(args=in_feat, kwargs={}, output=out_feat) JitDump.data_collector.forward_data_collect(name_template, None, pid, module_input_output) @@ -67,7 +68,8 @@ class JitDump(_MindsporeFunctionExecutor): self._executor = PyNativeExecutor_.get_instance() def __call__(self, *args, **kwargs): - api_register.api_set_ori_func() + if JitDump.jit_dump_switch: + api_register.api_set_ori_func() out = super().__call__(*args, **kwargs) if JitDump.jit_dump_switch and len(args) > 0: if self.name and self.name != "construct": @@ -75,9 +77,10 @@ class JitDump(_MindsporeFunctionExecutor): else: dump_jit(args[0], args, out, True) JitDump.jit_enable = True - else: + elif len(args) == 0: logger.warning(f"The jit function {self.name} has no input arguments, nothing will be dumped.") - api_register.api_set_hook_func() + if JitDump.jit_dump_switch: + api_register.api_set_hook_func() return out @classmethod diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/kernel_dump/kernel_config.py b/debug/accuracy_tools/msprobe/mindspore/dump/kernel_dump/kernel_config.py new file mode 100644 index 0000000000000000000000000000000000000000..aff10d79dc8879e3a5a4053f8e61d9bddc225f71 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/dump/kernel_dump/kernel_config.py @@ -0,0 +1,33 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +from msprobe.core.common.file_utils import save_json + + +def create_kernel_config_json(dump_path, cur_rank): + kernel_config_name = "kernel_config.json" if cur_rank == '' else f"kernel_config_{cur_rank}.json" + kernel_config_path = os.path.join(dump_path, kernel_config_name) + config_info = { + "dump": { + "dump_list": [], + "dump_path": dump_path, + "dump_mode": "all", + "dump_op_switch": "on" + } + } + save_json(kernel_config_path, config_info, indent=4) + return kernel_config_path diff --git a/debug/accuracy_tools/msprobe/mindspore/dump/kernel_graph_dump.py b/debug/accuracy_tools/msprobe/mindspore/dump/kernel_graph_dump.py index a9a48d5a878f4fdfe7879064b4086407ec988154..731c336cbe2f1350e7de9af34bfc0349305c0221 100644 --- a/debug/accuracy_tools/msprobe/mindspore/dump/kernel_graph_dump.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/kernel_graph_dump.py @@ -56,6 +56,13 @@ class KernelGraphDump: self.dump_json["common_dump_settings"]["input_output"] = 2 def handle(self): + try: + from msprobe.lib import _msprobe_c + return + except ImportError: + # 如果没有_msprobe_ce_c走MindSpore老流程 + logger.info("Module _msprobe_c has not been installed, use interface in mindspore instead.") + if os.getenv("GRAPH_OP_RUN") == "1": raise Exception("Must run in graph mode, not kbk mode") json_path = self.dump_json["common_dump_settings"]["path"] diff --git a/debug/accuracy_tools/msprobe/mindspore/dym_loader/hook_dynamic_loader.cc b/debug/accuracy_tools/msprobe/mindspore/dym_loader/hook_dynamic_loader.cc new file mode 100644 index 0000000000000000000000000000000000000000..b72d68741da491fc450c2d697a3ebfec895a3447 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/dym_loader/hook_dynamic_loader.cc @@ -0,0 +1,140 @@ +/** + * Copyright 2024 Huawei Technologies Co., Ltd + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include "hook_dynamic_loader.h" +#include +#include +#include +#include "utils/log_adapter.h" + +namespace { + +// Utility function to check if a file path is valid +bool IsValidPath(const std::string &path) { + struct stat fileStat; + if (stat(path.c_str(), &fileStat) != 0) { + MS_LOG(ERROR) << "File does not exist or cannot be accessed: " << path; + return false; + } + + if (S_ISLNK(fileStat.st_mode)) { + MS_LOG(ERROR) << "File is a symbolic link, which is not allowed: " << path; + return false; + } + + if (!S_ISREG(fileStat.st_mode)) { + MS_LOG(ERROR) << "File is not a regular file: " << path; + return false; + } + + if (path.substr(path.find_last_of(".")) != ".so") { + MS_LOG(ERROR) << "File is not a .so file: " << path; + return false; + } + + return true; +} + +} // namespace + +HookDynamicLoader &HookDynamicLoader::GetInstance() { + static HookDynamicLoader instance; + return instance; +} + +bool HookDynamicLoader::loadFunction(void *handle, const std::string &functionName) { + void *func = dlsym(handle, functionName.c_str()); + if (!func) { + MS_LOG(WARNING) << "Could not load function: " << functionName << ", error: " << dlerror(); + return false; + } + funcMap_[functionName] = func; + return true; +} + +bool HookDynamicLoader::validateLibraryPath(const std::string &libPath) { + char *realPath = realpath(libPath.c_str(), nullptr); + if (!realPath) { + MS_LOG(WARNING) << "Failed to resolve realpath for the library: " << libPath; + return false; + } + + bool isValid = IsValidPath(realPath); + free(realPath); // Free memory allocated by realpath + return isValid; +} + +bool HookDynamicLoader::LoadLibrary() { + const char *libPath = std::getenv("HOOK_TOOL_PATH"); + if (!libPath) { + MS_LOG(WARNING) << "HOOK_TOOL_PATH is not set!"; + return false; + } + + std::string resolvedLibPath(libPath); + if (!validateLibraryPath(resolvedLibPath)) { + MS_LOG(WARNING) << "Library path validation failed."; + return false; + } + + std::lock_guard lock(mutex_); + if (handle_) { + MS_LOG(WARNING) << "Hook library already loaded!"; + return false; + } + + handle_ = dlopen(resolvedLibPath.c_str(), RTLD_LAZY | RTLD_LOCAL); + if (!handle_) { + MS_LOG(WARNING) << "Failed to load Hook library: " << dlerror(); + return false; + } + + for (const auto &functionName : functionList_) { + if (!loadFunction(handle_, functionName)) { + MS_LOG(WARNING) << "Failed to load function: " << functionName; + dlclose(handle_); + handle_ = nullptr; + return false; + } + } + + MS_LOG(INFO) << "Hook library loaded successfully."; + return true; +} + +bool HookDynamicLoader::UnloadLibrary() { + std::lock_guard lock(mutex_); + if (!handle_) { + MS_LOG(WARNING) << "Hook library hasn't been loaded."; + return false; + } + + dlclose(handle_); + handle_ = nullptr; + funcMap_.clear(); + MS_LOG(INFO) << "Library unloaded successfully."; + return true; +} + +void *HookDynamicLoader::GetHooker(const std::string &funcName) { + std::lock_guard lock(mutex_); + auto iter = funcMap_.find(funcName); + if (iter == funcMap_.end()) { + MS_LOG(WARNING) << "Function not found: " << funcName; + return nullptr; + } + return iter->second; +} diff --git a/debug/accuracy_tools/msprobe/mindspore/dym_loader/hook_dynamic_loader.h b/debug/accuracy_tools/msprobe/mindspore/dym_loader/hook_dynamic_loader.h new file mode 100644 index 0000000000000000000000000000000000000000..6309e60b662a03d7f77cb450986ded5329fd8960 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/dym_loader/hook_dynamic_loader.h @@ -0,0 +1,53 @@ +/** + * Copyright 2024 Huawei Technologies Co., Ltd + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef HOOK_DYNAMIC_LOADER_H +#define HOOK_DYNAMIC_LOADER_H + +#include +#include +#include +#include +#include + +constexpr auto kHookBegin = "MS_DbgOnStepBegin"; +constexpr auto kHookEnd = "MS_DbgOnStepEnd"; + +class HookDynamicLoader { + public: + static HookDynamicLoader &GetInstance(); + + HookDynamicLoader(const HookDynamicLoader &) = delete; + HookDynamicLoader &operator=(const HookDynamicLoader &) = delete; + + bool LoadLibrary(); + bool UnloadLibrary(); + void *GetHooker(const std::string &funcName); + + private: + // Helper functions + bool loadFunction(void *handle, const std::string &functionName); + bool validateLibraryPath(const std::string &libPath); + + HookDynamicLoader() = default; + + void *handle_ = nullptr; + std::vector functionList_ = {kHookBegin, kHookEnd}; + std::map funcMap_; + std::mutex mutex_; +}; + +#endif // HOOK_DYNAMIC_LOADER_H diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/api_pynative_self_check.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/api_pynative_self_check.py index 9a501b1457216969bc481225520a081d588ea1b6..57b7de4fa567d73a19178256d79f5e4cbeb38864 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/api_pynative_self_check.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/api_pynative_self_check.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,24 +13,31 @@ # See the License for the specific language governing permissions and # limitations under the License. +import functools import importlib -import inspect import os +import traceback import mindspore as ms -from mindspore.communication import comm_func - from msprobe.core.common.const import Const +from msprobe.core.common.exceptions import DistributedNotInitializedError from msprobe.core.common.file_utils import check_path_length, load_yaml from msprobe.mindspore.common.const import Const as MsConst from msprobe.mindspore.common.const import FreeBenchmarkConst from msprobe.mindspore.common.log import logger +from msprobe.mindspore.common.utils import get_rank_if_initialized from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.dump.hook_cell.api_registry import api_register +from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell from msprobe.mindspore.free_benchmark.common.config import Config -from msprobe.mindspore.free_benchmark.decorator.decorator_factory import decorate_forward_function +from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.common.utils import Tools +from msprobe.mindspore.free_benchmark.handler.handler_factory import HandlerFactory +from msprobe.mindspore.free_benchmark.perturbation.perturbation_factory import PerturbationFactory +from msprobe.mindspore.runtime import Runtime -class ApiPyNativeSelFCheck: +class ApiPyNativeSelfCheck: def __init__(self, config: DebuggerConfig): Config.is_enable = True Config.handler_type = config.handler_type @@ -39,29 +46,77 @@ class ApiPyNativeSelFCheck: Config.dump_level = config.dump_level Config.steps = config.step Config.ranks = config.rank - Config.dump_path = os.path.join(config.dump_path, "free_benchmark.csv") + Config.dump_path = os.path.join(config.dump_path, FreeBenchmarkConst.CHECK_RESULT_FILE) check_path_length(Config.dump_path) + self.ori_func = {} + self.api_list = config.list all_api = get_supported_ops() if not self.api_list: self.api_list = all_api else: self.api_list = set(self.api_list) & all_api + self.store_original_func() def handle(self): + api_register.initialize_hook(self.build_hook) + api_register.api_set_hook_func() + + def build_hook(self, api_name): + def pre_hook(cell, input_data): + return None + + def forward_hook(api_name_with_id, cell, input_data, output_data): + ret = None + + if not need_wrapper_func(): + del cell.input_kwargs + return ret + + api_name_with_id = api_name_with_id[:-1] + hook_prefix = api_name_with_id[:api_name_with_id.find(Const.SEP) + 1] + api_name = (MsConst.HOOK_MS_PREFIX_DICT.get(hook_prefix, "") + + api_name_with_id[api_name_with_id.find(Const.SEP) + 1:api_name_with_id.rfind(Const.SEP)]) + if api_name in self.api_list: + ret = check_self(api_name_with_id, output_data, self.ori_func.get(api_name), + *input_data, **cell.input_kwargs) + + del cell.input_kwargs + return ret + + def backward_hook(cell, grad_input, grad_output): + pass + + HOOKCell.get_cell_count(api_name) + api_name_with_id = api_name + str(HOOKCell.get_cell_count(api_name)) + Const.SEP + forward_hook = functools.partial(forward_hook, api_name_with_id) + HOOKCell.add_cell_count(api_name) + + def wrap_forward_hook(cell, input_data, output_data): + return forward_hook(cell, input_data, output_data) + + def wrap_backward_hook(cell, grad_input, grad_output): + return backward_hook(cell, grad_input, grad_output) + + def pre_backward_hook(cell, grad_input): + return None + + return pre_hook, wrap_forward_hook, wrap_backward_hook, pre_backward_hook + + def store_original_func(self): for api_name in self.api_list: - hijack(api_name) + self.ori_func[api_name] = get_module(api_name)[1] def get_supported_ops(): supported_ops = [] cur_path = os.path.dirname(os.path.realpath(__file__)) - yaml_path = os.path.join(cur_path, "data", "support_wrap_ops.yaml") + yaml_path = os.path.join(cur_path, "data", FreeBenchmarkConst.SUPPORTED_CHECK_API_FILE) - yaml_data = load_yaml(yaml_path) + supported_ops_list = load_yaml(yaml_path) for k, v in FreeBenchmarkConst.API_PREFIX_DICT.items(): - ops = yaml_data.get(k) + ops = supported_ops_list.get(k) if ops: ops = [v + i for i in ops] supported_ops += ops @@ -72,7 +127,7 @@ def get_supported_ops(): _all_functional_ops += ms_ops ms_tensor = dir(ms.Tensor) - ms_tensor = [MsConst.Tensor_PREFIX + i for i in ms_tensor] + ms_tensor = [MsConst.TENSOR_PREFIX + i for i in ms_tensor] _all_functional_ops += ms_tensor ms_mint = dir(ms.mint) @@ -83,49 +138,109 @@ def get_supported_ops(): ms_mint_nn_func = [MsConst.MINT_NN_FUNC_PREFIX + i for i in ms_mint_nn_func] _all_functional_ops += ms_mint_nn_func - ms_communication = dir(comm_func) - ms_communication = [MsConst.COMM_PREFIX + i for i in ms_communication] - _all_functional_ops += ms_communication - return set(supported_ops) & set(_all_functional_ops) -def get_decorate_func(): - return decorate_forward_function - - -def is_func_support_decorate(orig_func): - return not inspect.isclass(orig_func) and callable(orig_func) - - -def get_wrapper_obj(orig_func, api_name): - if is_func_support_decorate(orig_func): - wrapped_obj = get_decorate_func()(orig_func, api_name) - else: - wrapped_obj = orig_func - return wrapped_obj - - def get_module(api_name): func_name_list = api_name.split(Const.SEP) func_name = func_name_list[-1] module_obj = importlib.import_module(func_name_list[0]) for i, module_name in enumerate(func_name_list[1:-1]): if not hasattr(module_obj, module_name): - importlib.import_module(f"{Const.SEP.join(func_name_list[:i+2])}") + importlib.import_module(f"{Const.SEP.join(func_name_list[:i + 2])}") module_obj = getattr(module_obj, module_name) orig_func = getattr(module_obj, func_name) return module_obj, orig_func -def hijack(api_name): - if not api_name.strip(): - return +def check_self(api_name_with_id, output, ori_func, *args, **kwargs): + ret = None + + if Config.stage == Const.BACKWARD and not (check_all_tensor(args) and check_all_tensor(output)): + logger.warning(f"{api_name_with_id} has non-tensor input or output.") + return ret + + params = data_pre_deal(api_name_with_id, ori_func, *args, **kwargs) + if params.index == -1: + return ret + + logger.info(f"[{api_name_with_id}] is {Config.handler_type}ing.") + api_register.api_set_ori_func() + try: - func_name = api_name.split(Const.SEP)[-1] - module_obj, origin_func = get_module(api_name) - wrapped_obj = get_wrapper_obj(origin_func, api_name) - setattr(module_obj, func_name, wrapped_obj) + perturbation = PerturbationFactory.create(api_name_with_id) + params.fuzzed_result = perturbation.handle(params) + if params.fuzzed_result is False: + api_register.api_set_hook_func() + return ret + if Config.stage == Const.BACKWARD: + params.original_result = Tools.get_grad(params.original_func, *params.args, **params.kwargs) + else: + params.original_result = output + ret = deal_fuzzed_and_original_result(api_name_with_id, params) except Exception as e: - logger.error(f"Failed decorator {api_name}: {e}") + logger.error(f"[{api_name_with_id}] Error: {str(e)}") + logger.error(f"[{api_name_with_id}] Error detail: {traceback.format_exc()}") + + api_register.api_set_hook_func() + return ret + + +def check_all_tensor(input_output): + if isinstance(input_output, ms.Tensor): + return True + if isinstance(input_output, (tuple, list)): + return all([check_all_tensor(v) for v in input_output]) + return False + + +def get_target_arg_index(args) -> int: + """ + 类型校验 + + """ + for i, arg in enumerate(args): + if ms.ops.is_tensor(arg): + if not ms.ops.is_floating_point(arg): + continue + return i + if isinstance(arg, (list, tuple, dict)): + return i + return -1 + + +def data_pre_deal(api_name_with_id, func, *args, **kwargs): + params = HandlerParams() + params.args = args + params.kwargs = kwargs + params.original_func = func + index = get_target_arg_index(args) + if index == -1: + logger.warning(f"{api_name_with_id} has no supported input type.") + params.index = index + return params + + +def need_wrapper_func(): + if not (Runtime.is_running and Config.is_enable): + return False + + if Config.steps and Runtime.step_count not in Config.steps: + return False + + if Runtime.rank_id == -1: + try: + Runtime.rank_id = get_rank_if_initialized() + except DistributedNotInitializedError: + Runtime.rank_id = -1 + if Config.ranks and Runtime.rank_id != -1 and Runtime.rank_id not in Config.ranks: + return False + + return True + + +def deal_fuzzed_and_original_result(api_name_with_id, params: HandlerParams): + handler = HandlerFactory.create(api_name_with_id) + result = handler.handle(params) + return result diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/common/handler_params.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/common/handler_params.py index fdb2b1ffc41909c4f908f7464c830320954b233e..1378b69c1c87dfa86af43508570edb557a7f045c 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/common/handler_params.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/common/handler_params.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -27,6 +27,5 @@ class HandlerParams: original_result: Optional[Any] = None fuzzed_result: Optional[Any] = None is_consistent: Optional[bool] = True - save_flag: Optional[bool] = True fuzzed_value: Optional[Any] = None original_func: Optional[Callable] = None diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/common/utils.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/common/utils.py index 24581181630b13a0bf14500564c12f4e74929299..14a72a5e6b6a6289595897a15c46a0e6397bcd1a 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/common/utils.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/common/utils.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -17,7 +17,7 @@ from dataclasses import dataclass from typing import Any, Optional import mindspore as ms -from mindspore import Tensor +from mindspore import Tensor, ops from msprobe.mindspore.common.const import FreeBenchmarkConst from msprobe.mindspore.free_benchmark.common.config import Config @@ -43,6 +43,23 @@ class Tools: return FreeBenchmarkConst.NO_CHANGE_ERROR_THRESHOLD return FreeBenchmarkConst.ERROR_THRESHOLD.get(dtype, FreeBenchmarkConst.ERROR_THRESHOLD.get(ms.float32)) + @staticmethod + def get_grad_out(outputs): + if isinstance(outputs, Tensor): + return ops.ones_like(outputs) + if isinstance(outputs, (tuple, list)): + return type(outputs)([Tools.get_grad_out(v) for v in outputs]) + return outputs + + @staticmethod + def get_grad(func, *args, **kwargs): + def target_func(*inputs): + return func(*inputs, **kwargs) + + outputs, vjp_fn = ms.vjp(target_func, *args) + values = Tools.get_grad_out(outputs) + return vjp_fn(values) + @dataclass class UnequalRow: @@ -73,10 +90,8 @@ def make_unequal_row( if isinstance(ratio, float): row.max_rel = ratio - 1 original_tensor = params.original_result - fuzzed_tensor = params.fuzzed_result if index is not None: original_tensor = original_tensor[index] - fuzzed_tensor = fuzzed_tensor[index] row.output_index = index if isinstance(original_tensor, Tensor): row.dtype = original_tensor.dtype diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/data/support_wrap_ops.yaml b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/data/support_wrap_ops.yaml index cc802d38142fd3d68aad03ee75abbbe77ce7eb35..36cb23767e1ccf8034431cec048dad894369e7ca 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/data/support_wrap_ops.yaml +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/data/support_wrap_ops.yaml @@ -1,11 +1,4 @@ # List of apis that support self check - -communication: - - all_gather_into_tensor - - gather_into_tensor - - all_reduce - - reduce - - reduce_scatter_tensor ops: - adaptive_avg_pool1d @@ -18,18 +11,10 @@ ops: - avg_pool3d - batch_norm - bias_add - - ctc_greedy_decoder - conv1d - conv2d - conv3d - deformable_conv2d - - dense - - dropout - - dropout1d - - dropout2d - - dropout3d - - flatten - - fold - fractional_max_pool3d - lp_pool1d - lp_pool2d @@ -39,7 +24,6 @@ ops: - max_unpool1d - max_unpool2d - max_unpool3d - - unfold - binary_cross_entropy - binary_cross_entropy_with_logits - cosine_embedding_loss @@ -105,8 +89,6 @@ ops: - pixel_shuffle - pixel_unshuffle - upsample - - abs - - absolute - accumulate_n - acos - arccos @@ -143,16 +125,9 @@ ops: - bessel_k1e - bessel_y0 - bessel_y1 - - bitwise_and - - bitwise_left_shift - - bitwise_or - - bitwise_right_shift - - bitwise_xor - ceil - clamp - clip - - combinations - - copysign - cos - cosh - cosine_similarity @@ -200,12 +175,8 @@ ops: - mul - multiply - mvlgamma - - neg - - negative - - nextafter - polar - polygamma - - positive - pow - rad2deg - ravel @@ -225,7 +196,6 @@ ops: - square - sub - subtract - - t - tan - tanhshrink - trapz @@ -238,11 +208,9 @@ ops: - xdivy - xlogy - zeta - - all - amax - amin - aminmax - - any - argmax - argmin - cummax @@ -264,28 +232,10 @@ ops: - var_mean - argsort - approximate_equal - - equal - - ge - - greater - - greater_equal - - gt - intopk - - isclose - - isfinite - - isinf - - isnan - - isneginf - - isposinf - - isreal - - le - - less - - less_equal - - lt - maximum - minimum - msort - - ne - - not_equal - searchsorted - topk - bmm @@ -329,30 +279,12 @@ ops: - hamming_window - hann_window - kaiser_window - - eye - - fill - - full - - full_like - - linspace - - logspace - - one_hot - - arange - - range - heaviside - bernoulli - gamma - laplace - multinomial - multinomial_with_replacement - - rand - - rand_like - - randint - - randint_like - - randn - - randn_like - - random_gamma - - random_poisson - - randperm - standard_laplace - standard_normal - uniform @@ -361,14 +293,10 @@ ops: - bincount - block_diag - broadcast_to - - cat - channel_shuffle - - chunk - column_stack - - concat - conj - count_nonzero - - deepcopy - diag - diagflat - diagonal @@ -395,49 +323,22 @@ ops: - nan_to_num - nansum - normal - - nonzero - population_count - - rank - - repeat_elements - - repeat_interleave - - reshape - - reverse - - reverse_sequence - - roll - - select - sequence_mask - - shuffle - - size - - slice - - sort - space_to_batch_nd - sparse_segment_mean - - split - - squeeze - - stack - - strided_slice - sum - swapaxes - swapdims - - tensor_split - - tile - tril - triu - - transpose - unbind - - unique - - unique_consecutive - - unique_with_pad - unsorted_segment_max - unsorted_segment_min - unsorted_segment_prod - unsorted_segment_sum - - unsqueeze - - unstack - - view_as_real - vsplit - vstack - - where - cross - renorm - tuple_to_array @@ -447,7 +348,6 @@ ops: - jet Tensor: - - __abs__ - __add__ - __and__ - __iadd__ @@ -459,8 +359,6 @@ Tensor: - __matmul__ - __mod__ - __mul__ - - __neg__ - - __or__ - __pow__ - __radd__ - __rmatmul__ @@ -471,8 +369,6 @@ Tensor: - __sub__ - __truediv__ - __xor__ - - abs - - absolute - acos - acosh - add @@ -504,18 +400,11 @@ Tensor: - baddbmm - bernoulli - bincount - - bitwise_and - - bitwise_or - - bitwise_xor - bmm - broadcast_to - - ceil - cholesky_solve - cholesky - - clamp - - clip - conj - - copysign - cos - cosh - cross @@ -530,7 +419,6 @@ Tensor: - digamma - div - divide - - equal - erf - erfc - erfinv @@ -541,14 +429,11 @@ Tensor: - fliplr - flipud - float_power - - floor - fmod - frac - gather_elements - geqrf - ger - - greater - - greater_equal - half - hardshrink - heaviside @@ -559,13 +444,7 @@ Tensor: - igammac - imag - index_add - - index_fill - - index_put - - index_select - inner - - int - - inverse - - item - lcm - ldexp - lerp @@ -587,30 +466,17 @@ Tensor: - masked_scatter - masked_select - matmul - - max - - maximum - mean - median - - min - - minimum - moveaxis - movedim - - msort - multinomial - multiply - mvlgamma - - nan_to_num - nansum - narrow - - neg - - negative - nelement - - new_ones - - new_zeros - - nextafter - norm - - nonzero - - not_equal - ormqr - permute - pow @@ -622,10 +488,6 @@ Tensor: - remainder - renorm - rad2deg - - tile - - repeat_interleave - - reshape - - reshape - round - rot90 - rsqrt @@ -641,73 +503,44 @@ Tensor: - sinh - slogdet - sort - - split - sqrt - square - - squeeze - std - subtract - subtract - svd - swapaxes - swapdims - - t - - take - tan - tanh - trace - swapaxes - - tile - topk - tril - - tensor_split - - transpose - true_divide - trunc - unbind - unique_consecutive - - unsqueeze - var - - view - - where - xlogy - from_numpy - std - - take - var - - all - - any - - copy - diagonal - - flatten - - resize - sum mint: - - abs - - absolute_import - add - add_ex - - all - - any - any_ex - - arange - argmax - avg_pool2d - baddbmm - baddbmm_ex - batch_norm - binary_cross_entropy_with_logits - - bitwise_and - - bitwise_or - - bitwise_xor - bmm - broadcast_to - - cat - - cat_ex - - ceil - - chunk - - clamp - conv2d - conv_transpose2d - cos @@ -717,59 +550,32 @@ mint: - cumsum - div - divide - - dropout - embedding - - eq - erf - erfinv - exp - - flatten - - flip - - flip_ex - - fold - - full - gather - gelu - greater - grid_sample - group_norm - - gt - index_select - interpolate - - isclose - - isfinite - layer_norm - - le - leaky_relu - - less - - less_equal - linear - linspace - log - logical_and - logical_not - logical_or - - lt - masked_select - matmul - - max - max_pool2d - - maximum - mean - mean_ex - - min - - minimum - mul - - ne - - neg - - negative - - nonzero - normal - - one_hot - - ones - - ones_ex - - ones_like - - pad - permute - permute_ex - pow @@ -786,7 +592,6 @@ mint: - softmax - softplus - sort - - split - sqrt - sqrt_ex - square @@ -795,17 +600,9 @@ mint: - sub_ex - sum - tanh - - tile - topk - - tril - triu - - unfold - - unique - - where - xlogy - - zeros - - zeros_ex - - zeros_like mint.nn.functional: - absolute_import @@ -816,7 +613,6 @@ mint.nn.functional: - binary_cross_entropy_with_logits - conv_transpose2d - dense - - dropout - embedding - fold - gelu diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/decorator/dec_forward.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/decorator/dec_forward.py deleted file mode 100644 index ae295b2734fcf91b4a5ec159a8b7ccebb178f5a5..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/decorator/dec_forward.py +++ /dev/null @@ -1,57 +0,0 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from msprobe.mindspore.common.const import Const, FreeBenchmarkConst -from msprobe.mindspore.free_benchmark.common.config import Config -from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams -from msprobe.mindspore.free_benchmark.handler.handler_factory import HandlerFactory -from msprobe.mindspore.free_benchmark.perturbation.perturbation_factory import PerturbationFactory - - -class ForwardSelfChecker: - - def __init__(self, api_name: str): - self.api_name = api_name - - def handle(self, params: HandlerParams): - """ - 装饰器实际执行逻辑 - - """ - perturbation = PerturbationFactory.create(self.api_name) - params.fuzzed_result = perturbation.handle(params) - params.original_result = params.original_func(*params.args, **params.kwargs) - if params.fuzzed_result is not False: - return self.deal_fuzzed_and_original_result(params) - return params.original_result - - def get_compare_data(self, params: HandlerParams): - if self.api_name not in Const.COMMUNICATION_API_LIST: - return - # 以下为通讯类api处理逻辑 - params.fuzzed_result = params.fuzzed_value - if Config.pert_type == FreeBenchmarkConst.IMPROVE_PRECISION: - params.original_result = params.args - else: - params.original_result = params.args[params.index] - - def deal_fuzzed_and_original_result(self, params: HandlerParams): - original_result = params.original_result - self.get_compare_data(params) - handler = HandlerFactory.create(self.api_name) - result = handler.handle(params) - if self.api_name in Const.COMMUNICATION_API_LIST: - result = original_result - return result diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/decorator/decorator_factory.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/decorator/decorator_factory.py deleted file mode 100644 index 5c70682e80e7e7e95587830a565abc12d42f40a5..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/decorator/decorator_factory.py +++ /dev/null @@ -1,122 +0,0 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import sys -import traceback -from functools import wraps -from typing import Dict, List, Tuple - -from mindspore import ops - -from msprobe.mindspore.common.log import logger -from msprobe.mindspore.free_benchmark.common.config import Config -from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams -from msprobe.mindspore.free_benchmark.decorator.dec_forward import ForwardSelfChecker -from msprobe.mindspore.runtime import Runtime - - -def decorate(original_func, decorate_func, api_name=None): - """ - 总装饰器 - """ - @wraps(original_func) - def fuzz_wrapper(*args, **kwargs): - - def __exec_decorate_func(): - params = data_pre_deal(api_name, original_func, *args, **kwargs) - result = decorate_func(params) - return result - - try: - if Runtime.rank_id == -1: - Runtime.rank_id = os.environ.get("RANK_ID", -1) - if need_wrapper_func(): - logger.info(f"[{api_name}] is checking.") - return __exec_decorate_func() - except Exception as e: - logger.error(f"[{api_name}] Error: {str(e)}") - logger.error(f"[{api_name}] Error detail: {traceback.format_exc()}") - - return original_func(*args, **kwargs) - - return fuzz_wrapper - - -def decorate_forward_function(func, api_name=None): - """ - 前向装饰器 - """ - - if not api_name: - api_name = func.__name__ - - def forward_func(params: HandlerParams): - forward = ForwardSelfChecker(api_name) - result = forward.handle(params) - return result - - return decorate(func, forward_func, api_name) - - -def stack_depth_check() -> bool: - nested_depth = 1 - frame = sys._getframe(1) - while frame: - if frame.f_code.co_name == "fuzz_wrapper": - nested_depth -= 1 - if nested_depth < 0: - return False - frame = frame.f_back - return True - - -def get_target_arg_index(args: Tuple) -> int: - """ - 类型校验 - - """ - for i, arg in enumerate(args): - if ops.is_tensor(arg): - if not ops.is_floating_point(arg): - continue - return i - if isinstance(arg, (List, Tuple, Dict)): - return i - return -1 - - -def data_pre_deal(api_name, func, *args, **kwargs): - params = HandlerParams() - params.args = args - params.kwargs = kwargs - params.original_func = func - index = get_target_arg_index(args) - if index == -1: - raise Exception(f"{api_name} has no supported input type") - params.index = index - return params - - -def need_wrapper_func(): - if not (Runtime.is_running and Config.is_enable): - return False - if not stack_depth_check(): - return False - if Config.steps and Runtime.step_count not in Config.steps: - return False - if Config.ranks and Runtime.rank_id != -1 and Runtime.rank_id not in Config.ranks: - return False - return True diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/base_handler.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/base_handler.py index 9e0638736a393fca671e15fde3b54b7ad2392cfa..6689a9ae035c68c6177fb41abbc730bf3d325184 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/base_handler.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/base_handler.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -28,8 +28,8 @@ from msprobe.mindspore.free_benchmark.common.utils import Tools class BaseHandler(ABC): - def __init__(self, api_name: str): - self.api_name = api_name + def __init__(self, api_name_with_id: str): + self.api_name_with_id = api_name_with_id @staticmethod def pre_calculate(original_output, fuzzed_output): diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/check_handler.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/check_handler.py index aa059ef0d4079962d1fe5d0ca540550782e6ed62..66946ba5d91b56502d752656f7ec47bbfdb89756 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/check_handler.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/check_handler.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -32,19 +32,19 @@ class CheckHandler(BaseHandler): is_consistent, ratio = self.npu_compare(original_output, fuzzed_output) params.is_consistent = params.is_consistent and is_consistent if not is_consistent: - row = make_unequal_row(self.api_name, params, ratio, output_index) + row = make_unequal_row(self.api_name_with_id, params, ratio, output_index) data_dict = asdict(row) DataWriter.write_data_to_csv( data_dict.values(), data_dict.keys(), Config.dump_path ) - logger.error(f"{self.api_name} is not consistent") + logger.error(f"{self.api_name_with_id} is not consistent") def handle(self, params: HandlerParams) -> Any: try: if not self.is_float_tensor(params.fuzzed_result): - return params.original_result + return if isinstance(params.fuzzed_result, Tensor): self.npu_compare_and_save(params.original_result, params.fuzzed_result, params) elif isinstance(params.fuzzed_result, (list, tuple)): @@ -53,4 +53,3 @@ class CheckHandler(BaseHandler): self.npu_compare_and_save(item, params.fuzzed_result[i], params, output_index=i) except Exception as e: logger.error(str(e)) - return params.original_result diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/fix_handler.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/fix_handler.py index 5961ae9e3843e612ceb7850afb3d360fa028c578..12686a5109c0456945950351ff91c36b4da42237 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/fix_handler.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/fix_handler.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -23,8 +23,8 @@ from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams class FixHandler: - def __init__(self, api_name: str): - self.api_name = api_name + def __init__(self, api_name_with_id: str): + self.api_name_with_id = api_name_with_id @staticmethod def use_fuzzed_result(original_result, fuzzed_result): @@ -46,6 +46,6 @@ class FixHandler: try: return FixHandler.use_fuzzed_result(params.original_result, params.fuzzed_result) except Exception as e: - logger.error(f"{self.api_name} failed to fix.") + logger.error(f"{self.api_name_with_id} failed to fix.") logger.error(str(e)) return params.original_result diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/handler_factory.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/handler_factory.py index 4bfb153d69a7c3f593757395e752a070c4a7d999..998bac18593ac3c522b4244512a91a9ecdd9bc7d 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/handler_factory.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/handler/handler_factory.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -23,14 +23,14 @@ from msprobe.mindspore.free_benchmark.handler.fix_handler import FixHandler class HandlerFactory: result_handlers = { FreeBenchmarkConst.CHECK: CheckHandler, - FreeBenchmarkConst.FIX: FixHandler, + FreeBenchmarkConst.FIX: FixHandler } @staticmethod - def create(api_name: str): + def create(api_name_with_id: str): handler = HandlerFactory.result_handlers.get(Config.handler_type) if handler: - return handler(api_name) + return handler(api_name_with_id) else: logger.error(f"{Config.handler_type} is not supported.") raise Exception diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/add_noise.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/add_noise.py index 7b491ec832ce0f42030abe614928445187d97745..e955aee589374d31fcf99a2cdad80092169a3610 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/add_noise.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/add_noise.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -32,7 +32,7 @@ class AddNoisePerturbation(BasePerturbation): """ params.fuzzed_value = self.add_noise(params.args[params.index]) if not self.is_fuzzed: - logger.warning(f"{self.api_name} can not add noise.") + logger.warning(f"{self.api_name_with_id} can not add noise.") return False return self.get_fuzzed_result(params) diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/base_perturbation.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/base_perturbation.py index 7e4937a5e0e5da27499175091b6773a2c4f3010e..c333fd82f8303426d93420a3efaa2450a2bcc240 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/base_perturbation.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/base_perturbation.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -15,21 +15,30 @@ from typing import Any +from msprobe.core.common.const import Const +from msprobe.mindspore.free_benchmark.common.config import Config from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.common.utils import Tools class BasePerturbation: - def __init__(self, api_name: str): - self.api_name = api_name + def __init__(self, api_name_with_id: str): + self.api_name_with_id = api_name_with_id self.is_fuzzed = False self.perturbation_value = None @staticmethod def get_fuzzed_result(params: HandlerParams): - args_front = params.args[:params.index] - args_rear = params.args[params.index + 1:] - fuzzed_result = params.original_func(*args_front, params.fuzzed_value, *args_rear, **params.kwargs) + if Config.stage == Const.BACKWARD: + fuzzed_result = Tools.get_grad(params.original_func, *params.args[:params.index], + params.fuzzed_value, *params.args[params.index + 1:], **params.kwargs) + + if fuzzed_result is None: + return False + else: + fuzzed_result = params.original_func(*params.args[:params.index], params.fuzzed_value, + *params.args[params.index + 1:], **params.kwargs) return fuzzed_result def handler(self, params: HandlerParams) -> Any: diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/bit_noise.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/bit_noise.py index fb033854b97f1e390355d05fd5db8a107fd7ec0e..8c3a6d6f3b6944811bd3878503e817b8578e3823 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/bit_noise.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/bit_noise.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -35,12 +35,12 @@ class BitNoisePerturbation(BasePerturbation): noise_type = list(FreeBenchmarkConst.MS_NUMPY_DTYPE_DICT.keys())[ list(FreeBenchmarkConst.MS_NUMPY_DTYPE_DICT.values()).index(bit_len_type)] noise = ops.full(inputs.shape, 1, dtype=noise_type) - input_np = inputs.contiguous().asnumpy() + input_np = inputs.asnumpy() input_np_int = input_np.view(bit_len_type) result = Tensor(input_np_int) result = ops.where(ops.abs(inputs) > sub_normal, ops.bitwise_xor(result, noise), result) - result_np = result.contiguous().asnumpy() + result_np = result.asnumpy() result_np_float = result_np.view(FreeBenchmarkConst.MS_NUMPY_DTYPE_DICT.get(inputs.dtype)) self.is_fuzzed = True return Tensor(result_np_float) @@ -55,7 +55,7 @@ class BitNoisePerturbation(BasePerturbation): args = params.args params.fuzzed_value = self.add_bit_noise(params.args[params.index]) if not self.is_fuzzed: - logger.warning(f"{self.api_name} can not add bit noise.") + logger.warning(f"{self.api_name_with_id} can not add bit noise.") return False params.args = args return self.get_fuzzed_result(params) diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/exchange_value.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/exchange_value.py index fe2dea766922710dc2dd1a2ea7e688f0b0004204..0c73b8256542f80167cb5fd7d2412e7c2493bf2e 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/exchange_value.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/exchange_value.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -41,7 +41,7 @@ class ExchangeValuePerturbation(BasePerturbation): """ params.fuzzed_value = self.exchange_value(params.args[params.index]) if not self.is_fuzzed: - logger.warning(f"{self.api_name} can not exchange value.") + logger.warning(f"{self.api_name_with_id} can not exchange value.") return False return self.get_fuzzed_result(params) diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/improve_precision.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/improve_precision.py index 12d1c55df02113542e6addcffe9ae49a7e732671..04f15c6604b212b7cf3d2ec5168d53cef6abdf04 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/improve_precision.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/improve_precision.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -18,9 +18,11 @@ from typing import Any import mindspore as ms from mindspore import Tensor, ops -from msprobe.mindspore.common.const import Const +from msprobe.core.common.const import Const from msprobe.mindspore.common.log import logger +from msprobe.mindspore.free_benchmark.common.config import Config from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.common.utils import Tools from msprobe.mindspore.free_benchmark.perturbation.base_perturbation import BasePerturbation @@ -40,10 +42,15 @@ class ImprovePrecisionPerturbation(BasePerturbation): def handle(self, params: HandlerParams) -> Any: args = self.improve_tensor_precision(params.args) kwargs = self.improve_tensor_precision(params.kwargs) - fuzzed_value = args - if self.api_name in Const.COMMUNICATION_API_LIST: - params.fuzzed_value = fuzzed_value if not self.is_fuzzed: - logger.warning(f"{self.api_name} can not improve precision.") + logger.warning(f"{self.api_name_with_id} can not improve precision.") return False + + if Config.stage == Const.BACKWARD: + fuzzed_result = Tools.get_grad(params.original_func, *args, **kwargs) + if fuzzed_result is not None: + return fuzzed_result + else: + return False + return params.original_func(*args, **kwargs) diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/perturbation_factory.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/perturbation_factory.py index 703082a685f57398dd5d8222f902d83a6e38d3f2..3fd1430bff792d5043429caac8fe477e457b8bee 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/perturbation_factory.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/perturbation/perturbation_factory.py @@ -36,9 +36,9 @@ class PerturbationFactory: } @staticmethod - def create(api_name: str): + def create(api_name_with_id: str): perturbation = PerturbationFactory.perturbations.get(Config.pert_type) if perturbation: - return perturbation(api_name) + return perturbation(api_name_with_id) else: raise Exception(f'{Config.pert_type} is a invalid perturbation type') diff --git a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/self_check_tool_factory.py b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/self_check_tool_factory.py index 85d77aa3d6b3924b7be46785c778eacf6f8f9611..35b5eb2ab65511fa4320dc97702a60a9c8d07f62 100644 --- a/debug/accuracy_tools/msprobe/mindspore/free_benchmark/self_check_tool_factory.py +++ b/debug/accuracy_tools/msprobe/mindspore/free_benchmark/self_check_tool_factory.py @@ -15,7 +15,7 @@ from msprobe.mindspore.common.const import Const from msprobe.mindspore.debugger.debugger_config import DebuggerConfig -from msprobe.mindspore.free_benchmark.api_pynative_self_check import ApiPyNativeSelFCheck +from msprobe.mindspore.free_benchmark.api_pynative_self_check import ApiPyNativeSelfCheck class SelfCheckToolFactory: @@ -28,7 +28,7 @@ class SelfCheckToolFactory: Const.API: { Const.GRAPH_KBYK_MODE: None, Const.GRAPH_GE_MODE: None, - Const.PYNATIVE_MODE: ApiPyNativeSelFCheck + Const.PYNATIVE_MODE: ApiPyNativeSelfCheck }, Const.KERNEL: { Const.GRAPH_KBYK_MODE: None, diff --git a/debug/accuracy_tools/msprobe/mindspore/grad_probe/global_context.py b/debug/accuracy_tools/msprobe/mindspore/grad_probe/global_context.py index c16853e11dd04fa53bcff576bf435a39b77934ea..01e46e019a4d1634a4592970386d855637c34e8f 100644 --- a/debug/accuracy_tools/msprobe/mindspore/grad_probe/global_context.py +++ b/debug/accuracy_tools/msprobe/mindspore/grad_probe/global_context.py @@ -17,9 +17,10 @@ import os import threading from typing import Dict, Union, Tuple +from msprobe.core.common.utils import is_int from msprobe.core.common.file_utils import create_directory, check_path_before_create from msprobe.core.grad_probe.constant import GradConst -from msprobe.core.grad_probe.utils import check_str, check_bounds_element +from msprobe.core.grad_probe.utils import check_str, check_bounds_element, check_param_element from msprobe.mindspore.common.log import logger @@ -51,10 +52,10 @@ class GlobalContext: else: raise ValueError("Invalid level set in config yaml file, level option: L0, L1, L2") - self._set_input_list(config_dict, GradConst.PARAM_LIST, str) + self._set_input_list(config_dict, GradConst.PARAM_LIST, (str,), element_check=check_param_element) self._set_input_list(config_dict, GradConst.BOUNDS, (float, int), element_check=check_bounds_element) - self._set_input_list(config_dict, GradConst.STEP, int) - self._set_input_list(config_dict, GradConst.RANK, int) + self._set_input_list(config_dict, GradConst.STEP, (int,)) + self._set_input_list(config_dict, GradConst.RANK, (int,)) output_path = config_dict.get(GradConst.OUTPUT_PATH) check_str(output_path, variable_name="output_path in yaml") @@ -102,11 +103,15 @@ class GlobalContext: if value and isinstance(value, list): for val in value: if not isinstance(val, dtype): - logger.warning(f"Invalid {name} which must be None or list of {type_str}") + logger.warning(f"Invalid {name} which must be None or list of {type_str}, use default value.") + return + elif isinstance(val, int) and not is_int(val): + logger.warning(f"Invalid {name} which must be None or list of int, use default value.") return if element_check and not element_check(val): - logger.warning(f"Given {name} violates some rules.") + logger.warning(f"Given {name} violates some rules, use default value.") return + self._setting[name] = value else: logger.warning(f"{name} is None or not a list with valid items, use default value.") diff --git a/debug/accuracy_tools/msprobe/mindspore/grad_probe/grad_analyzer.py b/debug/accuracy_tools/msprobe/mindspore/grad_probe/grad_analyzer.py index c875d52794fea28d77532a53078afde5abbd51e9..8a154f4d65f63e55f6b0cf3165d3c905bcb68546 100644 --- a/debug/accuracy_tools/msprobe/mindspore/grad_probe/grad_analyzer.py +++ b/debug/accuracy_tools/msprobe/mindspore/grad_probe/grad_analyzer.py @@ -16,6 +16,7 @@ import multiprocessing import os import time +from dataclasses import dataclass from multiprocessing import Process from typing import List @@ -23,6 +24,7 @@ import mindspore as ms import numpy as np from mindspore.common.parameter import Parameter from mindspore.communication import get_rank + from msprobe.core.common.file_utils import (create_directory, check_file_or_directory_path, write_csv, remove_path, move_file, load_npy) from msprobe.core.grad_probe.constant import GradConst @@ -31,6 +33,16 @@ from msprobe.mindspore.common.log import logger from msprobe.mindspore.grad_probe.global_context import grad_context, GlobalContext +@dataclass +class GradDumpConfig: + dump_dir: str + g_name: str + dump_step: Parameter + grad: ms.Tensor + level: str + bounds: List + + def get_rank_id(): try: rank_id = get_rank() @@ -40,35 +52,35 @@ def get_rank_id(): @ms.jit -def grad_dump(dump_dir: str, g_name: str, dump_step: Parameter, grad: ms.Tensor, level: str, bounds: List): +def grad_dump(config: GradDumpConfig): """ Dump gradient statistic data. level0: [step, max, min, norm, shape_dim, shape] level1: [step, max, min, norm, shape_dim, shape] + grad_bool_data level2: [step, max, min, norm, shape_dim, shape, dist_dim, dist] + grad_bool_data """ - dump_path = os.path.join(dump_dir, g_name) + dump_path = os.path.join(config.dump_dir, config.g_name) dump_dir_path = dump_path + "_dir" save_op = ms.ops.TensorDump() - grad_flat = grad.reshape(-1) + grad_flat = config.grad.reshape(-1) max_val = grad_flat.max(axis=0).float() min_val = grad_flat.min(axis=0).float() norm_val = grad_flat.norm(ord=2).float() - shape = grad.shape - extrem_list = [dump_step[0].float(), max_val, min_val, norm_val] + shape = config.grad.shape + extrem_list = [config.dump_step[0].float(), max_val, min_val, norm_val] extrem_stat = ms.ops.stack(extrem_list) shape_list = [len(shape)] + list(shape) shape_stat = ms.Tensor(shape_list).float() level0_stat = ms.ops.concat((extrem_stat, shape_stat), axis=0) level_stat = level0_stat - if level == GradConst.LEVEL2: - zero_grad = (grad == 0).sum() - dist_dim = ms.Tensor([len(bounds) + 2]).float() - bucket_result = ms.ops.bucketize(grad.float(), bounds) + if config.level == GradConst.LEVEL2: + zero_grad = (config.grad == 0).sum() + dist_dim = ms.Tensor([len(config.bounds) + 2]).float() + bucket_result = ms.ops.bucketize(config.grad.float(), config.bounds) bucket_result = bucket_result.astype(ms.int8) - dist_stat = [(bucket_result == i).sum() for i in range(len(bounds) + 1)] + dist_stat = [(bucket_result == i).sum() for i in range(len(config.bounds) + 1)] dist_stat.append(zero_grad) dist_stat.append(ms.Tensor(1, dtype=ms.int64)) # make sure dist_stat is not empty dist_stat = ms.ops.stack(dist_stat, axis=0).float() @@ -76,8 +88,8 @@ def grad_dump(dump_dir: str, g_name: str, dump_step: Parameter, grad: ms.Tensor, level_stat = level2_stat save_op(dump_path, level_stat) - if level == GradConst.LEVEL1 or level == GradConst.LEVEL2: - grad_direction = grad > 0 + if config.level == GradConst.LEVEL1 or config.level == GradConst.LEVEL2: + grad_direction = config.grad > 0 save_op(dump_dir_path, grad_direction) diff --git a/debug/accuracy_tools/msprobe/mindspore/grad_probe/hook.py b/debug/accuracy_tools/msprobe/mindspore/grad_probe/hook.py index b93989870886fe47aad0b43d0cdccc34aa703814..1aa9fcfad10815d5845de66ab0ea6d4d7211741f 100644 --- a/debug/accuracy_tools/msprobe/mindspore/grad_probe/hook.py +++ b/debug/accuracy_tools/msprobe/mindspore/grad_probe/hook.py @@ -26,7 +26,7 @@ from msprobe.core.grad_probe.constant import GradConst from msprobe.mindspore.common.log import logger from msprobe.mindspore.grad_probe.global_context import grad_context from msprobe.mindspore.grad_probe.grad_analyzer import csv_generator -from msprobe.mindspore.grad_probe.grad_analyzer import grad_dump, get_rank_id +from msprobe.mindspore.grad_probe.grad_analyzer import grad_dump, get_rank_id, GradDumpConfig from msprobe.mindspore.grad_probe.grad_stat_csv import GradStatCsv, CsvInput from msprobe.mindspore.grad_probe.utils import save_grad_direction, get_adapted_level @@ -38,7 +38,14 @@ class HookInput: def __init__(self, opt) -> None: self.func = opt.construct - self.g_names = [param.name for param in opt._parameters] + if hasattr(opt, "_parameters"): + parameter_list = opt._parameters + elif hasattr(opt, "parameters"): + parameter_list = opt.parameters + else: + logger.error_log_with_exp("Given optimizer has no attributes: '_parameters' or 'parameters'. \ + Please check the type of the given optimizer.", ValueError) + self.g_names = [param.name for param in parameter_list] self.param_list = grad_context.get_context(GradConst.PARAM_LIST) self.rank_id = get_rank_id() output_path = grad_context.get_context(GradConst.OUTPUT_PATH) @@ -59,8 +66,10 @@ def hook_graph_mode_optimizer(opt, hook_input): for index, grad_value in enumerate(gradients): if hook_input.param_list and hook_input.g_names[index] not in hook_input.param_list: continue - grad_dump(hook_input.dump_dir, hook_input.g_names[index], self.dump_step, - grad_value, hook_input.level, hook_input.bounds) + conf = GradDumpConfig(dump_dir=hook_input.dump_dir, g_name=hook_input.g_names[index], + dump_step=self.dump_step, grad=grad_value, level=hook_input.level, + bounds=hook_input.bounds) + grad_dump(conf) ms.ops.TensorDump()(hook_input.step_finish_flag, self.dump_step) self.assignadd(self.dump_step, self.global_step_increase_tensor) out = hook_input.func(gradients) diff --git a/debug/accuracy_tools/msprobe/mindspore/mindtorch/__init__.py b/debug/accuracy_tools/msprobe/mindspore/mindtorch/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..fc695d05ccc010f824b61db39a8ea77714d2d73b --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/mindtorch/__init__.py @@ -0,0 +1,18 @@ +# Copyright (c) 2025-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from .mindtorch_adaptor import (_call_impl, + register_full_backward_pre_hook, + register_full_backward_hook) diff --git a/debug/accuracy_tools/msprobe/mindspore/mindtorch/mindtorch_adaptor.py b/debug/accuracy_tools/msprobe/mindspore/mindtorch/mindtorch_adaptor.py new file mode 100644 index 0000000000000000000000000000000000000000..27e42d52ba6190ec7e7531af25464e6aa3996b2b --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/mindtorch/mindtorch_adaptor.py @@ -0,0 +1,255 @@ +# From PyTorch: + +# Copyright (c) 2025 Huawei Technologies Co., Ltd +# Copyright (c) 2016- Facebook, Inc (Adam Paszke) +# Copyright (c) 2014- Facebook, Inc (Soumith Chintala) +# Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) +# Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) +# Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) +# Copyright (c) 2011-2013 NYU (Clement Farabet) +# Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) +# Copyright (c) 2006 Idiap Research Institute (Samy Bengio) +# Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) + +# From Caffe2: + +# Copyright (c) 2016-present, Facebook Inc. All rights reserved. + +# All contributions by Facebook: +# Copyright (c) 2016 Facebook Inc. + +# All contributions by Google: +# Copyright (c) 2015 Google Inc. +# All rights reserved. + +# All contributions by Yangqing Jia: +# Copyright (c) 2015 Yangqing Jia +# All rights reserved. + +# All contributions by Kakao Brain: +# Copyright 2019-2020 Kakao Brain + +# All contributions by Cruise LLC: +# Copyright (c) 2022 Cruise LLC. +# All rights reserved. + +# All contributions by Tri Dao: +# Copyright (c) 2024 Tri Dao. +# All rights reserved. + +# All contributions by Arm: +# Copyright (c) 2021, 2023-2024 Arm Limited and/or its affiliates + +# All contributions from Caffe: +# Copyright(c) 2013, 2014, 2015, the respective contributors +# All rights reserved. + +# All other contributions: +# Copyright(c) 2015, 2016 the respective contributors +# All rights reserved. + +# Caffe2 uses a copyright model similar to Caffe: each contributor holds +# copyright over their contributions to Caffe2. The project versioning records +# all such contribution and copyright details. If a contributor wants to further +# mark their specific copyright on a particular contribution, they should +# indicate their copyright solely in the commit message of the change when it is +# committed. + +# All rights reserved. + +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are met: + +# 1. Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. + +# 2. Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. + +# 3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories +# America, IDIAP Research Institute and Huawei nor the names of its contributors +# may be used to endorse or promote products derived from this software without +# specific prior written permission. + +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE +# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +# POSSIBILITY OF SUCH DAMAGE. + +import warnings + +import mindspore as ms +from mindspore.ops.operations import _inner_ops as inner +from torch.nn.modules.module import (_global_backward_pre_hooks, _global_backward_hooks, + _global_is_full_backward_hook, _global_forward_pre_hooks, + _global_forward_hooks, _global_forward_hooks_always_called) +from torch.utils.hooks import RemovableHandle + + +def _call_impl(self, *args, **kwargs): + forward_call = self.forward + if self.__ms_class__: + return forward_call(*args, **kwargs) + + # If we don't have any hooks, we want to skip the rest of the logic in + # this function, and just call forward. + if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks + or _global_backward_pre_hooks or _global_backward_hooks + or _global_forward_hooks or _global_forward_pre_hooks): + return forward_call(*args, **kwargs) + + try: + result = None + called_always_called_hooks = set() + + if self._backward_pre_hooks or _global_backward_pre_hooks: + _get_backward_pre_hooks(self) + + if self._backward_hooks or _global_backward_hooks: + _get_backward_hooks(self) + + if _global_forward_pre_hooks or self._forward_pre_hooks: + for hook_id, hook in ( + *_global_forward_pre_hooks.items(), + *self._forward_pre_hooks.items(), + ): + if hook_id in self._forward_pre_hooks_with_kwargs: + args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] + if args_kwargs_result is not None: + if isinstance(args_kwargs_result, tuple) and len(args_kwargs_result) == 2: + args, kwargs = args_kwargs_result + else: + raise RuntimeError( + "forward pre-hook must return None or a tuple " + f"of (new_args, new_kwargs), but got {args_kwargs_result}." + ) + else: + args_result = hook(self, args) + if args_result is not None: + if not isinstance(args_result, tuple): + args_result = (args_result,) + args = args_result + + bw_hook = None + if self._backward_hooks: + bw_hook = inner.CellBackwardHook(self.__class__.__name__ + "(" + str(id(self)) + ")", + self, self._backward_hooks) + bw_hook.register_backward_hook() + args = apply_backward_hook_on_tensors(bw_hook, args) + + result = forward_call(*args, **kwargs) + if _global_forward_hooks or self._forward_hooks: + for hook_id, hook in ( + *_global_forward_hooks.items(), + *self._forward_hooks.items(), + ): + # mark that always called hook is run + if hook_id in self._forward_hooks_always_called or hook_id in _global_forward_hooks_always_called: + called_always_called_hooks.add(hook_id) + + if hook_id in self._forward_hooks_with_kwargs: + hook_result = hook(self, args, kwargs, result) + else: + hook_result = hook(self, args, result) + + if hook_result is not None: + result = hook_result + + if bw_hook: + if not isinstance(result, (ms.Tensor, tuple)): + warnings.warn("For backward hooks to be called," + " module output should be a Tensor or a tuple of Tensors" + f" but received {type(result)}") + result = apply_backward_hook_on_tensors(bw_hook, result) + + if self._backward_pre_hooks: + bw_pre_hook = inner.CellBackwardHook(self.__class__.__name__ + "(" + str(id(self)) + ")", + self, self._backward_pre_hooks) + bw_pre_hook.register_backward_pre_hook() + result = apply_backward_hook_on_tensors(bw_pre_hook, result) + + return result + except Exception: + # run always called hooks if they have not already been run + # For now only forward hooks have the always_call option but perhaps + # this functionality should be added to full backward hooks as well. + for hook_id, hook in _global_forward_hooks.items(): + # type: ignore[possibly-undefined] + if hook_id in _global_forward_hooks_always_called and hook_id not in called_always_called_hooks: + try: + hook_result = hook(self, args, result) # type: ignore[possibly-undefined] + if hook_result is not None: + result = hook_result + except Exception as e: + warnings.warn("global module forward hook with ``always_call=True`` raised an exception " + f"that was silenced as another error was raised in forward: {str(e)}") + continue + + for hook_id, hook in self._forward_hooks.items(): + # type: ignore[possibly-undefined] + if hook_id in self._forward_hooks_always_called and hook_id not in called_always_called_hooks: + try: + if hook_id in self._forward_hooks_with_kwargs: + hook_result = hook(self, args, kwargs, result) # type: ignore[possibly-undefined] + else: + hook_result = hook(self, args, result) # type: ignore[possibly-undefined] + if hook_result is not None: + result = hook_result + except Exception as e: + warnings.warn("module forward hook with ``always_call=True`` raised an exception " + f"that was silenced as another error was raised in forward: {str(e)}") + continue + # raise exception raised in try block + raise + + +def register_full_backward_pre_hook(self, hook, prepend: bool = False) -> RemovableHandle: + handle = RemovableHandle(self._backward_pre_hooks) + self._backward_pre_hooks[handle.id] = hook + if prepend: + self._backward_pre_hooks.move_to_end(handle.id, last=False) # type: ignore[attr-defined] + return handle + + +def register_full_backward_hook(self, hook, prepend: bool = False) -> RemovableHandle: + if self._is_full_backward_hook is False: + raise RuntimeError( + "Cannot use both regular backward hooks and full backward hooks on a " + "single Module. Please use only one of them." + ) + + self._is_full_backward_hook = True + + handle = RemovableHandle(self._backward_hooks) + self._backward_hooks[handle.id] = hook + if prepend: + self._backward_hooks.move_to_end(handle.id, last=False) # type: ignore[attr-defined] + return handle + + +def _get_backward_pre_hooks(self): + self._backward_pre_hooks.update(_global_backward_pre_hooks) + + +def _get_backward_hooks(self): + if (_global_is_full_backward_hook is True): + self._backward_hooks.update(_global_backward_hooks) + + +def apply_backward_hook_on_tensors(cell_backward_hook, args): + is_tuple = True + if not isinstance(args, tuple): + args = (args,) + is_tuple = False + hooked_args = cell_backward_hook(*args) + if is_tuple and len(args) == 1: + hooked_args = (hooked_args, ) + return hooked_args diff --git a/debug/accuracy_tools/msprobe/mindspore/monitor/anomaly_detect.py b/debug/accuracy_tools/msprobe/mindspore/monitor/anomaly_detect.py new file mode 100644 index 0000000000000000000000000000000000000000..3544ebbd025614349585bc799b15e00a5c2c7956 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/monitor/anomaly_detect.py @@ -0,0 +1,404 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import itertools +import os +import sys +import statistics as st +from abc import ABC +from dataclasses import dataclass, field +from typing import List +from collections import defaultdict + +import pandas as pd + +from mindspore import ops +from mindspore import _no_grad +from msprobe.core.common.log import logger +from msprobe.core.common.file_utils import change_mode, create_directory, write_df_to_csv +from msprobe.core.common.const import FileCheckConst, MonitorConst + + +class ScanRule(ABC): + name = "ScanRule" + + def apply(self, history, cur): + raise NotImplementedError("abstract method apply is not implemented") + + +class AnomalyTurbulence(ScanRule): + name = "AnomalyTurbulence" + + def __init__(self, threshold) -> None: + self.threshold = threshold + + def apply(self, history, cur): + baseline = st.mean(history) if isinstance(history, list) else history + + up_bound = baseline + baseline * self.threshold + if baseline > 0: + return cur > up_bound + else: + return cur < up_bound + + +class AnomalyScanner: + + @staticmethod + def load_rules(specs: List[dict]): + """ + specs: [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}] + """ + if specs is None: + return [] + alert_rules = [] + for spec in specs: + # 使用get方法获取键值,如果键不存在则返回None + rule_cls_name = spec.get("rule_name") + rule_args = spec.get("args") + + # 检查必要的键是否存在 + if rule_cls_name is None or rule_args is None: + logger.warning(f"Spec is missing required keys: {spec}") + continue + + cur_module = sys.modules.get(__name__) + try: + rule_cls = getattr(cur_module, rule_cls_name) + except AttributeError: + logger.error(f"Rule class '{rule_cls_name}' not found in the current module.") + continue + + try: + rule_instance = rule_cls(**rule_args) + alert_rules.append(rule_instance) + except Exception as e: + logger.error(f"Error creating instance of rule '{rule_cls_name}': {e}") + continue + + return alert_rules + + @staticmethod + def scan(scan_rules: List[ScanRule], history, cur): + anomaly = False + for rule in scan_rules: + anomaly = rule.apply(history, cur) + if anomaly: + return anomaly, rule.name + return anomaly, None + + +class BCOLORS: + HEADER = '\033[95m' + OKBLUE = '\033[94m' + OKCYAN = '\033[96m' + OKGREEN = '\033[92m' + WARNING = '\033[93m' + FAIL = '\033[91m' + ENDC = '\033[0m' + BOLD = '\033[1m' + UNDERLINE = '\033[4m' + + +class AnomalyDataFactory(ABC): + def __init__(self, rank, pp_stage, group_mates): + super().__init__() + self.rank = rank + self.pp_stage = pp_stage + self.group_mates = group_mates + self.micro_step = 0 + self.name2callid = {} + + def set_call_id(self, name2callid): + """根据当前GradContext信息更新call_id vpp_stage等信息 + """ + self.name2callid = name2callid + + def create(self, tag, message, step): + """如果检查出异常, 调用当前接口生成GradAnomalyData实例 + tag (tuple): metric tag ('0:1.post_attention_norm.weight/rank0/pre_grad', 'min') + message (str): anomaly detect message + step (int): training step + """ + if not isinstance(tag, tuple) or len(tag) != 2: + raise ValueError("tag must be a tuple with length 2") + tag_name = tag[0] + param_name = tag_name.split('/')[0] + call_id = self.name2callid.get(tag_name, -1) + if MonitorConst.NAME_SEP in param_name: + vpp_stage = int(param_name.split(MonitorConst.NAME_SEP)[0]) + else: + vpp_stage = 0 + + return GradAnomalyData( + self.rank, + step, + self.micro_step, + self.pp_stage, + vpp_stage, + call_id, + tag_name, + message, + self.group_mates + ) + + +class TrainStage: + DEFAULT_STAGE = -1 + FORWARD_STAGE = 0 + BACKWARD_STAGE = 1 + OPTIMIZER_STAGE = 2 + + +FORWARD_KEY = [MonitorConst.ACTV_IN, MonitorConst.ACTV_OUT] +BACKWARD_KEY = [MonitorConst.ACTVGRAD_IN, MonitorConst.ACTVGRAD_OUT, + MonitorConst.PRE_GRAD, MonitorConst.POST_GRAD, MonitorConst.ACC_GRAD] +OPTIMIZER_KEY = [MonitorConst.EXP_AVG, MonitorConst.EXP_AVG_SQ] +TRAIN_STAGE = { + **{key_: TrainStage.FORWARD_STAGE for key_ in FORWARD_KEY}, + **{key_: TrainStage.BACKWARD_STAGE for key_ in BACKWARD_KEY}, + **{key_: TrainStage.OPTIMIZER_STAGE for key_ in OPTIMIZER_KEY} +} + + +@dataclass(eq=True) +class GradAnomalyData: + rank: int = 0 + step: int = 0 + micro_step: int = 0 + pp_stage: int = 0 + vpp_stage: int = 0 + call_id: int = 0 + tag_name: str = field(default=None, compare=False) + message: str = field(default="", compare=False) + group_mates: list = field(default=None, compare=False) + + def __lt__(self, other): + """ + 自定义比较函数,用于确定 GradAnomalyData 实例之间的顺序。 + 比较规则为: + step 和 micro_step 值越小优先级越高; + vpp 和 pp 在前向阶段值越小优先级越高,在非前向阶段值越大优先级越高; + call_id 值越小优先级越高。 + """ + if not isinstance(other, GradAnomalyData): + return NotImplemented + + self_train_stage = self.get_train_stage(self.tag_name) + other_train_stage = self.get_train_stage(other.tag_name) + + def vpp_pp_comparator(anomaly): + """ + Determine the priority rule for vpp and pp based on train stage + Forward stage prefers smaller vpp and pp + Other stages prefer larger vpp and pp + """ + if self_train_stage == TrainStage.FORWARD_STAGE: + return anomaly.vpp_stage, anomaly.pp_stage + else: + return -anomaly.vpp_stage, -anomaly.pp_stage + + self_cmp = [self.step, self.micro_step, self_train_stage, *vpp_pp_comparator(self), self.call_id] + other_cmp = [other.step, other.micro_step, other_train_stage, *vpp_pp_comparator(other), other.call_id] + return self_cmp < other_cmp + + def __le__(self, other): + if not isinstance(other, GradAnomalyData): + return NotImplemented + return self == other or self < other + + @staticmethod + def get_train_stage(tag_name): + """ + :param tag_name: "0:fc2_0/rank0/input", "0:fc1.weight/rank0/post_grad", "0:fc2.weight/rank0/exp_avg_sq" + :return: int, if forward return 0; if backward return 1; if optimizer return 2 + """ + key_ = tag_name.split("/")[-1] + return TRAIN_STAGE.get(key_, TrainStage.DEFAULT_STAGE) + + def to_dict(self): + return self.__dict__ + + def get_key(self): + # 0:1.self_attention.core_attention_flash_0/rank0/input_grad + return ''.join([str(self.tag_name), "_step_", str(self.step), "_call_", str(self.call_id)]) + + +@dataclass +class WriterInput: + path: str + ad_rules: list + job_id: str + anomaly_factory: AnomalyDataFactory = None + ndigits: int = 6 + step_count_per_record: int = 1 + + +class BaseWriterWithAD: + def __init__(self, writer_input: WriterInput): + self.tag2scalars = {} + self.ad_rules = writer_input.ad_rules + self.job_id = writer_input.job_id + self.anomaly_factory = writer_input.anomaly_factory + self.anomalies = [] + self.ndigits = writer_input.ndigits + + def get_anomalies(self): + """返回已检测到的异常列表 + """ + return self.anomalies + + def clear_anomalies(self): + self.anomalies.clear() + + def add_scalar(self, tag, scalar_value, global_step=None, need_explain=False): + """If an anomaly is detected, the anomaly information is recorded and added to self.anomalies. + Args: + tag (tuple): tuple of tag_name and tag like ('0:1.post_attention_norm.weight/rank0/pre_grad', 'min'). + scalar_value (float): scalar_value. + global_step (int): global_step. + Returns: + None + """ + detected = False + if self.ad_rules: + avg = self._update_tag2scalars(tag, scalar_value) + detected, rule_name = self._ad(scalar_value, history=avg) + if detected: + exception_message = f"Rule {rule_name} reports anomaly signal in {tag} at step {global_step}." + logger.info(f"{BCOLORS.WARNING}> {exception_message}{BCOLORS.ENDC}") + # append to self.anomalies for dump + if self.anomaly_factory: + self.anomalies.append(self.anomaly_factory.create(tag, exception_message, global_step)) + + def write_metrics(self, op_list, metric_value, step, prefix='', need_explain=False): + if not metric_value: + return + tensors = [] + tags = list(itertools.product(metric_value.keys(), op_list)) + for op2tensor in metric_value.values(): + tensors.extend(op2tensor.values()) + with _no_grad(): + metric_list = ops.stack(tensors).tolist() if tensors else [] + for tag, metric in zip(tags, metric_list): + self.add_scalar(tag, metric, step, need_explain) + + def _ad(self, scalar_value, history): + return AnomalyScanner.scan(self.ad_rules, history, cur=scalar_value) + + def _update_tag2scalars(self, tag, scalar_value): + """Update the average and count of a scalar value associated with a tag. + + This method is used to maintain a running average of scalar values for each tag. + + + Args: + tag (str): The tag identifier. + scalar_value (float): The scalar value to be added. + + Returns: + float: The average value before update. + """ + if tag not in self.tag2scalars: + self.tag2scalars[tag] = {'avg': scalar_value, 'count': 0} + avg = self.tag2scalars[tag]['avg'] + new_avg = (avg * self.tag2scalars[tag]['count'] + scalar_value) / (self.tag2scalars[tag]['count'] + 1) + self.tag2scalars[tag]['avg'] = new_avg + self.tag2scalars[tag]['count'] += 1 + return avg + + +class CSVWriterWithAD(BaseWriterWithAD): + def __init__(self, writer_input: WriterInput): + super().__init__(writer_input) + + path = writer_input.path + self.log_dir = path + create_directory(path) + change_mode(path, FileCheckConst.DATA_DIR_AUTHORITY) + self.context_dict = defaultdict(list) + self.header = [] + self.step_count_per_record = writer_input.step_count_per_record + + def get_step_interval(self, step): + count = step // self.step_count_per_record + return count * self.step_count_per_record, (count + 1) * self.step_count_per_record - 1 + + def write_csv(self, prefix, step): + """ + Args: + prefix[str]: prefix of output csv file e.g. grad_unreduced + step[int] + """ + if len(self.context_dict) == 0: + return + + ster_start, step_end = self.get_step_interval(step) + filepath = os.path.join(self.log_dir, f'{prefix}_{ster_start}-{step_end}.csv') + if not os.path.exists(filepath): + data_frame = pd.DataFrame(columns=self.header) + write_df_to_csv(data_frame, filepath) + + new_data = [] + for name, metric_value in self.context_dict.items(): + if MonitorConst.NAME_SEP not in name: + new_data.append([name] + [step] + metric_value) + else: + new_data.append(name.split(MonitorConst.NAME_SEP) + [step] + metric_value) + new_data = pd.DataFrame(new_data).round(self.ndigits) + write_df_to_csv(new_data, filepath, mode='a+', header=False) + self.context_dict = defaultdict(list) + + def add_scalar(self, tag, scalar_value, global_step, need_explain=False): + """ + ('0:1.post_attention_norm.weight/rank0/pre_grad', 'min') + """ + super().add_scalar(tag, scalar_value, global_step, need_explain=False) + split_name = tag[0].split('/') + name = split_name[0] + if need_explain: + if 'pre' in split_name[-1]: + name += '.input' + if 'post' in split_name[-1]: + name += '.output' + self.context_dict[name].append(scalar_value) + + def write_metrics(self, op_list, metric_value, step, prefix='', need_explain=False): + need_explain = prefix == 'other' + super().write_metrics(op_list, metric_value, step, prefix='', need_explain=need_explain) + + # generate csv headers + # set hashmap to reduce the number of headers generated. + # 前向的norm用input.ops_和output.ops_,反向的用input_grad.ops_和output_grad.ops_ + if prefix in {"actv", "actv_grad"}: + if prefix == "actv": + input_and_output = [MonitorConst.ACTV_IN, MonitorConst.ACTV_OUT] + else: + input_and_output = [MonitorConst.ACTVGRAD_IN, MonitorConst.ACTVGRAD_OUT] + ops_ = [MonitorConst.DOT.join(i) for i in itertools.product(input_and_output, op_list)] + csv_header = ["module_name", "step", *ops_] + else: + csv_header = ["param_name", "step", *op_list] + + keys = list(metric_value.keys()) + if keys and MonitorConst.NAME_SEP in keys[0]: + csv_header.insert(0, "vpp_stage") + + self.header = csv_header + self.write_csv(prefix, step) + self.header = [] + + def close(self): + pass diff --git a/debug/accuracy_tools/msprobe/pytorch/functional/__init__.py b/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/__init__.py similarity index 100% rename from debug/accuracy_tools/msprobe/pytorch/functional/__init__.py rename to debug/accuracy_tools/msprobe/mindspore/monitor/distributed/__init__.py diff --git a/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/distributed_ops.yaml b/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/distributed_ops.yaml new file mode 100644 index 0000000000000000000000000000000000000000..6f336b2fffd81c3e3aa60a4dec1c743e31f2609b --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/distributed_ops.yaml @@ -0,0 +1,15 @@ +communication.comm_func: + - all_reduce + - all_gather_into_tensor + - reduce + - reduce_scatter_tensor + - all_to_all_single_with_output_shape + - all_to_all_with_output_shape + - batch_isend_irecv + - broadcast + - gather_into_tensor + - scatter_tensor + - send + - recv + - isend + - irecv \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/stack_blacklist.yaml b/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/stack_blacklist.yaml new file mode 100644 index 0000000000000000000000000000000000000000..068935cebec687497d75b688fad228866a0b3622 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/stack_blacklist.yaml @@ -0,0 +1,5 @@ +stack: +- msprobe/mindspore/monitor/distributed +- site-packages/mindspore/nn/cell.py +- multiprocessing +- debugpy \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/wrap_distributed.py b/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/wrap_distributed.py new file mode 100644 index 0000000000000000000000000000000000000000..33fd58c7278c6245140e50a984f44e59b90c69de --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/monitor/distributed/wrap_distributed.py @@ -0,0 +1,300 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import inspect +import os +import re + +import numpy as np + +from mindspore import nn, Tensor, ops, _no_grad +from mindspore import communication +from mindspore.communication import comm_func, get_rank + +from msprobe.core.common.const import MonitorConst, Const +from msprobe.core.common.file_utils import load_yaml +from msprobe.mindspore.monitor.utils import get_metrics, get_summary_writer_tag_name + +enable_communication = True +try: + from mindspore._c_expression import CommHandle as CommHandle_ +except ImportError: + enable_communication = False + + +RANK = None + +OpsPath = os.path.join(os.path.dirname(__file__), "distributed_ops.yaml") +WrapDistributedOps = load_yaml(OpsPath).get("communication.comm_func", []) + +StackBlackListPath = os.path.join(os.path.dirname(__file__), "stack_blacklist.yaml") +StackBlackList = load_yaml(StackBlackListPath).get("stack", []) + +distributed_func = {} +for f in dir(comm_func): + distributed_func[f] = getattr(comm_func, f) + +ORIGIN_WAIT = CommHandle_.wait if enable_communication else None +PENDING_ASYNC_CC_BY_HANDLE = {} + + +def get_distributed_ops(): + global WrapDistributedOps + _all_distributed_ops = dir(comm_func) + return set(WrapDistributedOps) & set(_all_distributed_ops) + + +class DistributedOPTemplate(nn.Cell): + def __init__(self, op_name, pre_hooks, post_hooks): + super(DistributedOPTemplate, self).__init__() + self.op_name_ = str(op_name) + self.__name__ = self.op_name_ + self.cc_hooks = [] + for pre_hook in pre_hooks: + handle = self.register_forward_pre_hook(pre_hook) + self.cc_hooks.append(handle) + for hook in post_hooks: + handle = self.register_forward_hook(hook) + self.cc_hooks.append(handle) + + def construct(self, *args, **kwargs): + return distributed_func.get(self.op_name_)(*args, **kwargs) + + def forward(self, *args, **kwargs): + return distributed_func.get(self.op_name_)(*args, **kwargs) + + +class ApiRegistry: + def __init__(self): + self.distributed_attr_origin = {} + self.distributed_attr_hooked = {} + + @staticmethod + def store_ori_attr(ori_api_group, api_list, api_ori_attr): + for api in api_list: + if Const.SEP in api: + sub_module_name, sub_op = api.rsplit(Const.SEP, 1) + sub_module = getattr(ori_api_group, sub_module_name) + api_ori_attr[api] = getattr(sub_module, sub_op) + else: + api_ori_attr[api] = getattr(ori_api_group, api) + + @staticmethod + def set_api_attr(api_group, attr_dict): + for cc_api_name, cc_api_entry_func in attr_dict.items(): + if Const.SEP in cc_api_name: + sub_module_name, sub_op = cc_api_name.rsplit(Const.SEP, 1) + sub_module = getattr(api_group, sub_module_name, None) + if sub_module is not None: + setattr(sub_module, sub_op, cc_api_entry_func) + else: + setattr(api_group, cc_api_name, cc_api_entry_func) + + @staticmethod + def redirect_wait(): + global ORIGIN_WAIT + global PENDING_ASYNC_CC_BY_HANDLE + if not ORIGIN_WAIT: + return + + def wrapped_wait(work): + def wrapped_wait(*args, **kwargs): + ORIGIN_WAIT(*args, **kwargs) + if args[0] in PENDING_ASYNC_CC_BY_HANDLE: + store_func = PENDING_ASYNC_CC_BY_HANDLE.pop(args[0]) + store_func() + + return wrapped_wait + + CommHandle_.wait = wrapped_wait(CommHandle_) + + def redirect_api(self): + self.set_api_attr(comm_func, self.distributed_attr_hooked) + self.redirect_wait() + + def restore_api(self): + if not ORIGIN_WAIT: + return + self.set_api_attr(comm_func, self.distributed_attr_origin) + setattr(CommHandle_, 'wait', ORIGIN_WAIT) + + def initialize_hook(self, pre_hooks, post_hooks): + self.store_ori_attr(comm_func, get_distributed_ops(), self.distributed_attr_origin) + cc_hooks = [] + for op_name in get_distributed_ops(): + self.distributed_attr_hooked[op_name] = DistributedOPTemplate(op_name, pre_hooks, post_hooks) + cc_hooks.extend(self.distributed_attr_hooked[op_name].cc_hooks) + return cc_hooks + + +def get_process_group(process_group): + return ( + process_group + if process_group + else comm_func.HCCL_WORLD_GROUP + ) + + +def stack_filter(stack): + for pattern in StackBlackList: + if re.search(pattern, stack): + return False + return True + + +def get_callstack(): + callstack = [] + for (_, path, line, func, _, _) in inspect.stack(): + stack_line = f'{path}[{line}]' + if stack_filter(stack_line): + callstack.append(stack_line + ' ' + func) + return callstack + + +@_no_grad() +def op_aggregate(op, tensorlist): + if isinstance(tensorlist, Tensor): + return tensorlist + if not tensorlist: + return Tensor(float('nan')) + if op == 'min': + return min(tensorlist) + if op == 'max': + return max(tensorlist) + if op == 'norm': + return sum(tensorlist) + if op == 'zeros': + return sum(tensorlist) / len(tensorlist) + if op == 'nans': + return sum(tensorlist) + if op == 'mean': + return sum(tensorlist) / len(tensorlist) + return Tensor(float('nan')) + + +def update_data(old, new): + for tag, op2tensor in new.items(): + if tag not in old: + old[tag] = {} + for op, tensor in op2tensor.items(): + if op not in old[tag]: + old[tag][op] = [tensor] + else: + old[tag][op].append(tensor) + return old + + +def is_target_line(codeline): + stack = get_callstack() + whole_stack = ';'.join(stack) + if codeline == []: + return True + for pattern in codeline: + if re.search(pattern, whole_stack): + return True + return False + + +@_no_grad() +def catch_data(cc_context, cc_name, ops_list, args, prefix): + tensor_args = {} + for arg in args: + if isinstance(arg, Tensor): + key = get_summary_writer_tag_name(cc_name, f'{prefix}_{len(tensor_args)}', RANK) + tensor_args[key] = arg + elif isinstance(arg, list): + if isinstance(arg[0], Tensor): + stacked_arg = ops.stack(arg) + elif isinstance(arg[0], comm_func.P2POp): + stacked_arg = ops.stack([op.tensor for op in arg]) + key = get_summary_writer_tag_name(cc_name, f'{prefix}_{len(tensor_args)}', RANK) + tensor_args[key] = stacked_arg + + new_data = get_metrics(ops_list, tensor_args, 1e-8) + cc_context.data = update_data(cc_context.data, new_data) + + +def create_async_callback_func(context, cc_name, ops_list, args, prefix): + def store_data(): + catch_data(context, cc_name, ops_list, args, prefix) + + return store_data + + +def create_hooks(context, monitor): + def cc_log_hook(module, inputs): + stack = ';'.join(get_callstack()) + monitor.cc_logged_stack[module.op_name_].add(stack) + return + + def cc_pre_hook(module, inputs): + if not is_target_line(monitor.cc_codeline): + return + catch_data(context[module.op_name_], module.op_name_, monitor.ops, inputs, MonitorConst.PREFIX_PRE) + return + + def cc_hook(module, inputs, out=None): + if not is_target_line(monitor.cc_codeline): + return out + if out and enable_communication: # async + if isinstance(out, CommHandle_): + PENDING_ASYNC_CC_BY_HANDLE[out] = create_async_callback_func( + context[module.op_name_], + module.op_name_, + monitor.ops, inputs, + MonitorConst.PREFIX_POST + ) + elif isinstance(out, list): # batch_isend_irecv + for out_element in out: + if isinstance(out_element, comm_func.P2POp): + PENDING_ASYNC_CC_BY_HANDLE[out_element] = create_async_callback_func( + context[module.op_name_], + module.op_name_, + monitor.ops, inputs, + MonitorConst.PREFIX_POST + ) + elif isinstance(out, tuple): + if len(out) == 2 and isinstance(out[1], CommHandle_): + PENDING_ASYNC_CC_BY_HANDLE[out[1]] = create_async_callback_func( + context[module.op_name_], + module.op_name_, + monitor.ops, inputs, + MonitorConst.PREFIX_POST + ) + + return out + catch_data(context[module.op_name_], module.op_name_, monitor.ops, inputs, MonitorConst.PREFIX_POST) + return out + + global RANK + pre_hooks = [] + hooks = [] + RANK = str(get_rank()) + if communication.GlobalComm.INITED and RANK not in monitor.module_rank_list and monitor.module_rank_list != []: + return [pre_hooks, hooks] + + if monitor.cc_log_only: + pre_hooks.append(cc_log_hook) + return [pre_hooks, hooks] + + if monitor.cc_pre_hook: + pre_hooks.append(cc_pre_hook) + + hooks.append(cc_hook) + + return [pre_hooks, hooks] + + +api_register = ApiRegistry() diff --git a/debug/accuracy_tools/msprobe/mindspore/monitor/features.py b/debug/accuracy_tools/msprobe/mindspore/monitor/features.py new file mode 100644 index 0000000000000000000000000000000000000000..be958dadfe8fcc50f26f16c93b3a090269235d1e --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/monitor/features.py @@ -0,0 +1,63 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from mindspore import mint, ops, _no_grad +from mindspore import Tensor +from mindspore import dtype as mstype + + +@_no_grad() +def square_sum(x: Tensor): + return (x * x).sum() + + +@_no_grad() +def get_min(x: Tensor): + return mint.min(x) + + +@_no_grad() +def get_mean(x: Tensor): + return mint.mean(x.astype(mstype.float32)) + + +@_no_grad() +def get_norm(x: Tensor): + norm_func = mint.norm if hasattr(mint, "norm") else ops.norm + return norm_func(x.astype(mstype.float32)) + + +@_no_grad() +def get_max(x: Tensor): + return mint.max(x) + + +@_no_grad() +def get_zeros(x: Tensor, eps: float): + return mint.sum(mint.abs(x) < eps) / x.numel() + + +@_no_grad() +def get_nans(t): + return ops.isnan(t.astype(mstype.float32)).sum() + + +FUNC_MAP = {"min" : get_min, + "max" : get_max, + "mean" : get_mean, + "norm" : get_norm, + "nans" : get_nans, + "zeros": get_zeros + } \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/monitor/module_hook.py b/debug/accuracy_tools/msprobe/mindspore/monitor/module_hook.py new file mode 100644 index 0000000000000000000000000000000000000000..068be9ff6c782bec2bf637999ef5f0eabe0c2675 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/monitor/module_hook.py @@ -0,0 +1,870 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import re +import uuid +from collections import defaultdict +from datetime import datetime + +import pytz +import mindspore as ms +from mindspore import Tensor, mint +from mindspore import nn, _no_grad +from mindspore.communication import get_rank + +from msprobe.core.common.log import logger +from msprobe.core.common.const import MonitorConst +from msprobe.core.common.file_utils import load_json, save_json +from msprobe.mindspore.monitor.utils import get_summary_writer_tag_name, validate_config, step_accumulates_one, \ + is_skip_step, get_metrics, get_single_metrics, get_target_output_dir +from msprobe.mindspore.monitor.module_spec_verifier import validate_config_spec +from msprobe.mindspore.monitor.anomaly_detect import AnomalyScanner, AnomalyDataFactory, \ + CSVWriterWithAD, BaseWriterWithAD, WriterInput +from msprobe.mindspore.monitor.distributed.wrap_distributed import api_register, create_hooks, op_aggregate, \ + get_process_group + +FORMAT_MAPPING = { + MonitorConst.CSV: CSVWriterWithAD, + MonitorConst.API: BaseWriterWithAD +} + + +def get_output_base_dir(): + return os.getenv(MonitorConst.MONITOR_OUTPUT_DIR, MonitorConst.DEFAULT_MONITOR_OUTPUT_DIR) + + +def get_param_struct(param): + res = {} + if isinstance(param, (tuple, list)): + res['config'] = f'{type(param).__name__}[{len(param)}]' + for i, x in enumerate(param): + res[i] = f'size={tuple(x.shape)}, dtype={x.dtype}' if isinstance(x, Tensor) else f'{type(x)}' + elif isinstance(param, Tensor): + res['config'] = 'tensor' + res['tensor'] = f'size={tuple(param.shape)}, dtype={param.dtype}' + else: + res['config'] = f'{type(param)}' + logger.warning(f'Not support type({type(param)}) now, please check the type of param {param}') + return res + + +def param_is_not_tensor_parallel_duplicate(param, tp_group): + return (hasattr(param, 'tensor_model_parallel') and param.tensor_model_parallel) or ( + mint.distributed.get_rank(group=tp_group) == 0 + ) + + +def param_is_data_parallel_duplicate(dp_group): + return mint.distributed.get_rank(group=dp_group) != 0 + + +def squash_param_name(param_name): + for pattern in ['layers?\.(.*)', 'embeddings?\.(.*)', 'final.*', 'output.*', 'norm.*']: + match = re.findall(pattern, param_name) + if match: + return match[0] + return param_name + + +# Used For Module Forward & Backward Collect +class ModuleHookContext: + def __init__(self, module_name) -> None: + self.step = 0 + self.micro_step = 0 + self.actv = defaultdict(dict) + self.actvgrad = [] + self.module_name = module_name + self.struct = {} + self.format_by_arg = {} + self.verified = False + self.focused_in_col = 0 + self.focused_out_col = 0 + self.ignore_in = False # no need to care when no key 'input' or 'input_grad' found + + def set_format_by_arg(self, key_name: str, target_config: dict): + cared = target_config.get(self.module_name, self.struct) + if key_name in cared: + if isinstance(cared[key_name], dict): + # current cared is self.struct + config = cared[key_name].get('config') + self.format_by_arg[key_name] = config + else: + # current cared is target_config[self.module_name] + self.format_by_arg[key_name] = cared[key_name] + elif key_name in ['input', 'input_grad']: + self.ignore_in = True + + def reset(self): + self.actv.clear() + self.actvgrad.clear() + + +start_step = 0 + + +# Used For Optimizer Weight Grad & M/V Collect +class OptimizerContext: + def __init__(self) -> None: + self.step = start_step + self.param_mg_direction = defaultdict(float) + self.param_adam_update = defaultdict() + self.param_adam_ratio = defaultdict() + self.param_weight_grad = defaultdict() + self.param_exp_avg = defaultdict() + self.exp_avg_metric = {} + self.param_exp_avg_sq = defaultdict() + self.exp_avg_sq_metric = {} + self.metric_dict = {} + self.param_metric = {} + + def reset(self) -> None: + self.param_mg_direction.clear() + self.param_adam_update.clear() + self.param_adam_ratio.clear() + self.param_weight_grad.clear() + self.param_exp_avg.clear() + self.exp_avg_metric.clear() + self.param_exp_avg_sq.clear() + self.exp_avg_sq_metric.clear() + self.metric_dict.clear() + self.param_metric.clear() + + +# Used For Weight Grad Collect +class GradContext: + def __init__(self) -> None: + self.pre = {} + self.post = {} + self.acc_metric = {} + self.acc = {} + self.actv = {} + + def reset(self): + self.pre.clear() + self.post.clear() + self.acc_metric.clear() + self.acc.clear() + self.actv.clear() + + +class CommunicationContext: + def __init__(self) -> None: + self.data = {} + + @staticmethod + def _agg(data): + aggregated_data = {} + for tag, op2tensorlist in data.items(): + aggregated_data[tag] = {} + for op, tensorlist in op2tensorlist.items(): + aggregated_data[tag][op] = op_aggregate(op, tensorlist) + return aggregated_data + + def reset(self): + self.data = {} + + def aggregate(self): + self.data = self._agg(self.data) + + +class TrainerMon: + def __init__(self, config_file_path, process_group=None, params_have_main_grad=True) -> None: + # TYPE1: 只在这里初始化的变量, 不会随着训练中途config配置改变而重置 + self.config_file_path = config_file_path + self.process_group = process_group + self.params_have_main_grad = params_have_main_grad + self.config_timestamp = 0 # 后面有校验时间戳, 首次监控无需为了更新config文件时间戳而去改, 可通过dynamic_on开关直接打开 + self.config = load_json(config_file_path) + validate_config(self.config) + + local_tz = pytz.timezone("Asia/Shanghai") # 根据需要调整为目标时区 + cur_time = datetime.now(local_tz).strftime('%b%d_%H-%M-%S') + self.unique_id = str(uuid.uuid4())[:8] + self.output_base_dir = get_output_base_dir() + time_tags = self.config.get("append_output", []) + try: + self.rank = get_rank() + if time_tags: + output_append_dirs = get_target_output_dir(self.output_base_dir, time_tags[0], time_tags[1]) + if str(self.rank) in output_append_dirs: + self.tensorboard_dir = output_append_dirs[str(self.rank)] + logger.info(f"Append rank({self.rank}) result to {self.tensorboard_dir}") + else: + self.tensorboard_dir = os.path.join(self.output_base_dir, + f"{cur_time}-rank{self.rank}-{self.unique_id}") + except Exception as e: + self.rank = 0 + self.tensorboard_dir = os.path.join(self.output_base_dir, f"{cur_time}-rank{self.rank}-{self.unique_id}") + + self.pp_stage = 0 + self.group_mates = [0] + + # TYPE2: 只会在set_monitor()主调中赋值的变量 + self.model = None + self.vpp = False + self.dp_group = None + self.tp_group = None + self.micro_batch_number = 1 + + # TYPE3: 会随着训练中途config配置更新或监控状态改变而重置的变量 + self.module_fwd_hook_context_by_module = defaultdict(ModuleHookContext) + self.module_bwd_hook_context_by_module = defaultdict(ModuleHookContext) + self.optimizer_context = defaultdict(OptimizerContext) + self.cc_context = defaultdict(CommunicationContext) + self.grad_context = GradContext() + self.handles = defaultdict(list) + self.param2name = defaultdict(str) + self.name2index = defaultdict() + self.name2indices = defaultdict() + self.name2param = {} + self.duplicate_param = {} + self.name2tag = {} + self.param_name_call_id = {} + self.call_id = 0 + self.module_struct = defaultdict(dict) + self.grad_accs = [] + self.weight_hooked = False + self.optimizer_hooked = False + self.param_registered = False + self.struct_printed = False + + # 动静态区分 + self.dynamic_enable = os.getenv("DYNAMIC_MONITOR", 'False').lower() == 'true' + if self.dynamic_enable: + logger.warning(f"DYNAMIC_MONITOR is set, " + f"please make sure you have 'dynamic_on' and 'collect_times' in {self.config_file_path}") + self.monitoring = False + else: + self.set_config() + # 静态且collect_times>0时在第0步self.monitoring就可以True, 动态默认在下一步开启 + if self.collect_times > 0: + self.monitoring = True + + def set_config(self): + self.start_step = self.config.get("start_step", 0) + self.collect_times = self.config.get("collect_times", 100000000) # 默认大值, 目的是一直采集 + self.step_interval = self.config.get("step_interval", 1) + self.has_collect_times = 0 # 重设采集计数器 + self.print_struct = self.config.get("print_struct", False) + self.targets = self.config.get("targets", None) + self.is_select = self.config.get("is_select", False) + self.module_rank_list = self.config.get("module_ranks", []) + self.format = self.config.get('format', MonitorConst.CSV) # only csv supported in mindspore + self.eps = self.config.get('eps', 1e-8) + self.ops = self.config.get('ops', []) # monitor mean/max/norm/min/nan... + self.ndigits = self.config.get('ndigits', 6) + self.all_xy = self.config.get('all_xy', False) + self.xy_distribution = self.config.get('xy_distribution', False) + self.forward_only = self.config.get('forward_only', False) + self.backward_only = self.config.get('backward_only', False) + self.ur_distribution = self.config.get('ur_distribution', False) # vector and ratio vector of adam + self.mv_distribution = self.config.get("mv_distribution", False) # m/v of adam + self.wg_distribution = self.config.get("wg_distribution", False) + self.param_distribution = self.config.get("param_distribution", False) + self.mg_direction = self.config.get('mg_direction', False) # main grad direction + self.cc_distribution = self.config.get("cc_distribution", {}) # communication ops + if not self.cc_distribution.get('enable', False): + self.cc_log_only = False + else: + self.cc_codeline = self.cc_distribution.get('cc_codeline', []) + self.cc_log_only = self.cc_distribution.get('cc_log_only', False) + self.cc_logged_stack = defaultdict(set) + self.cc_pre_hook = self.cc_distribution.get('cc_pre_hook', False) + self.handles['cc'] = api_register.initialize_hook(*create_hooks(context=self.cc_context, monitor=self)) + api_register.redirect_api() + self.common_info() + + # 初始化AnomalyData工厂 + alert_setting = self.config.get('alert', {"rules": []}) + self.alert_rules = AnomalyScanner.load_rules(alert_setting["rules"]) + self.anomaly_data_factory = None + if alert_setting.get('dump', False): + self.anomaly_data_factory = AnomalyDataFactory(self.rank, self.pp_stage, self.group_mates) + + # 初始化writer, 创建输出目录 + if self.format not in FORMAT_MAPPING: + logger.error(f"Unsupported format: {self.format}, use default format: {MonitorConst.CSV}") + self.format = MonitorConst.CSV + writer = FORMAT_MAPPING[self.format] + self.step_count_per_record = self.config.get('step_count_per_record', 1) + self.summary_writer = writer( + WriterInput( + self.tensorboard_dir, + self.alert_rules, + self.unique_id, + self.anomaly_data_factory, + self.ndigits, + self.step_count_per_record + ) + ) + + def common_info(self): + if not self.xy_distribution: + logger.info("> module input/output input_grad/output_grad is not monitored. ") + if self.forward_only: + logger.info("> only module forward is monitored. ") + if not self.ur_distribution: + logger.info("> update vector and ratio vector of adam is not monitored. ") + if not self.mv_distribution: + logger.info("> momentum and variance of adam is not monitored. ") + if not self.wg_distribution: + logger.info("> weight grad of specified module is not monitored. ") + if not self.mg_direction: + logger.info('> grad and momentum direction will not be compared.') + if not self.cc_distribution.get('enable', False): + logger.info("> cc operator is not monitored.") + + def set_monitor( + self, + model, + optimizer, + grad_acc_steps=1, + tp_group=None, + dp_group=None, + start_iteration=0 + ): + global start_step + start_step = start_iteration + self.micro_batch_number = grad_acc_steps + self.dp_group = dp_group + self.tp_group = tp_group + self.hook_step_final(optimizer) + if not isinstance(model, list): + model = [model] + self.model = model + if len(model) > 1: + self.vpp = True + logger.info('vpp enabled') + if not self.dynamic_enable: + self.register_hooks(optimizer) + + def hook_step_final(self, optimizer): + def step_final_hook(optimizer, *args, **kwargs): + context = self.optimizer_context[optimizer] + # 静态在第0步就可以保存, 动态在第0步不可以, 因为动态设计的就是重置后下一步开启, 第0步的self.monitoring还是False + if self.monitoring: + module_rank_valid = self.is_target_rank() + step_condition = (context.step >= self.start_step and ( + context.step - self.start_step) % self.step_interval == 0) + if module_rank_valid and step_condition: + self.has_collect_times += 1 + self.write_xy_tb(context.step) + self.write_grad_tb(context.step) + self.write_mv_tb(context) + self.write_param_tb(context) + + if context.metric_dict: + self.summary_writer.write_metrics(self.ops, context.metric_dict, context.step, 'other') + context.metric_dict.clear() + + self.summary_writer.clear_anomalies() + self.call_id = 0 + self.param_name_call_id.clear() + + if self.has_collect_times >= self.collect_times: + self._remove_all_hooks_final(optimizer) + + context.step += 1 + self.dynamic_monitor(optimizer) + + optimizer.register_forward_hook(step_final_hook) + return + + def dynamic_monitor(self, optimizer): + """ + If dynamic monitor enabled and config.json updated, + remove hooks and register new hooks according to new configuration. + """ + context = self.optimizer_context[optimizer] + if not self.dynamic_enable: + return + try: + # 如果文件时间戳没变, 可以不读取节省时间 + config_timestamp = os.path.getmtime(self.config_file_path) + if config_timestamp == self.config_timestamp: + return + # 更新config文件最新修改时间戳 + self.config_timestamp = config_timestamp + config = load_json(self.config_file_path) + except Exception as e: + logger.error(f"get config.json wrong because {e}, not updated, please check!!!") + return + + if config.get("dynamic_on", False): + try: + validate_config(config) + self.config = config + self.set_config() + logger.warning(f"config is updated at step{context.step - 1}, " + f"will start new hook at step{context.step}.") + except Exception as e: + logger.error(f"set config wrong because {e}, not updated, please check!!!") + return + + self._remove_all_hooks() + self.register_hooks(optimizer) + + def register_hooks(self, optimizer): + self._register_param_name() + self.hook_modules() + self.hook_optimizer(optimizer) + self._patch_grad_sync() + self.monitoring = True + + def hook_modules(self): + if not self.is_target_rank(): + return + module_in_all_stage = [key for key in self.targets.keys() if MonitorConst.NAME_SEP not in key] + + for key in module_in_all_stage: + struct = self.targets.pop(key) + self.targets.update( + {f'{vpp_stage}{MonitorConst.NAME_SEP}{key}': struct for vpp_stage in range(len(self.model))}) + + hooked_count = 0 + for vpp_stage, model_chunk in enumerate(self.model): + if not isinstance(model_chunk, nn.Cell): + logger.info("Target Model is not Cell") + continue + vpp_stage = f'{vpp_stage}{MonitorConst.NAME_SEP}' + targets = [x for x, _ in model_chunk.cells_and_names()] if self.print_struct else self.targets.keys() + hooked_count += self._hook_module(targets, model_chunk, vpp_stage) + logger.info(f"> {hooked_count} modules are monitored.") + + def hook_optimizer(self, optimizer): + def optimizer_pre_hook_function(opt, grad_names, gradients): + context = self.optimizer_context[opt] + if is_skip_step(context.step, self.start_step, self.step_interval, self.has_collect_times, + self.collect_times): + return + gradient_list = gradients[0] if isinstance(gradients, tuple) else gradients + is_select = self.is_select + for idx, grad in enumerate(gradient_list): + grad_name = grad_names[idx] + if is_select and grad_name not in self.targets: + continue + get_single_metrics(self.ops, grad_name, grad, context.param_weight_grad) + + if self.mv_distribution: + # fetch mean + for param in m_list: + name = param.name + if is_select and name not in self.targets: + continue + get_single_metrics(self.ops, name, param, context.exp_avg_metric) + # fetch variance + for param in v_list: + name = param.name + if is_select and name not in self.targets: + continue + get_single_metrics(self.ops, name, param, context.exp_avg_sq_metric) + if self.param_distribution: + for param in param_list: + get_single_metrics(self.ops, param.name, param, context.param_metric) + self.generate_wgrad_metrics() + metric_dict = {} + for cc in self.cc_context.values(): + cc.aggregate() + metric_dict.update(cc.data) + cc.reset() + + if not metric_dict: + return + context.metric_dict = metric_dict + return + + def optimizer_pre_hook_wrapper(func, grad_names): + def wrapper(opt, gradients): + return func(opt, grad_names, gradients) + return wrapper + + if self.optimizer_hooked or not self.is_target_rank(): + return + + m_list = [] + v_list = [] + param_list = [] + grad_names = [] + for param in optimizer.get_parameters(): + if MonitorConst.EXP_AVG_SQ in param.name: + v_list.append(param) + elif MonitorConst.EXP_AVG in param.name: + m_list.append(param) + elif param.name in ['global_step', 'learning_rate']: + pass + else: + param_list.append(param) + grad_names.append(param.name) + + handle = optimizer.register_forward_pre_hook( + optimizer_pre_hook_wrapper(optimizer_pre_hook_function, grad_names)) + self.handles['optimizer'].append(handle) + self.optimizer_hooked = True + return + + def generate_wgrad_metrics(self): + if not self.wg_distribution: + return {}, {} + + if self.weight_hooked: + try: + get_metrics(self.ops, self.grad_context.acc, self.eps, self.grad_context.acc_metric) + except Exception as e: + logger.warning(f"An error occurred while generating wgrad pre metrics") + return {}, {} + + grad_dict = {} + for param, name in self.param2name.items(): + if self.duplicate_param.get(name, False): + continue + grad = param.main_grad if self.params_have_main_grad else param.grad + if grad is None: + logger.warning(f"grad is None: {name}, maybe something wrong happened.") + continue + tag = self.name2tag.get(name, {}).get(MonitorConst.POST_GRAD) + self._register_param_call_id("hook_optimizer", tag) + grad_dict[tag] = grad + try: + get_metrics(self.ops, grad_dict, self.eps, self.grad_context.post) + except Exception as e: + logger.warning(f"An error occurred while generating wgrad post metrics") + return {}, {} + return self.grad_context.post, self.grad_context.pre + + def write_xy_tb(self, step): + if not self.xy_distribution: + return + for _, fwd_context in self.module_fwd_hook_context_by_module.items(): + if len(fwd_context.actv) == 0: + continue + self.summary_writer.write_metrics(self.ops, fwd_context.actv, step, 'actv') + fwd_context.actv.clear() + if self.grad_context.actv: + self.summary_writer.write_metrics(self.ops, self.grad_context.actv, step, 'actv_grad') + + def write_param_tb(self, opt_context): + if not self.param_distribution: + return + self.summary_writer.write_metrics(self.ops, opt_context.param_metric, opt_context.step, 'param') + + def write_mv_tb(self, opt_context): + if not self.mv_distribution: + return + self.summary_writer.write_metrics(self.ops, opt_context.exp_avg_metric, opt_context.step, 'exp_avg') + self.summary_writer.write_metrics(self.ops, opt_context.exp_avg_sq_metric, opt_context.step, 'exp_avg_sq') + + def write_grad_tb(self, step): + if not self.wg_distribution: + return + + self.summary_writer.write_metrics(self.ops, self.grad_context.acc_metric, step, 'grad_unreduced') + self.summary_writer.write_metrics(self.ops, self.grad_context.post, step, 'grad_reduced') + + def is_target_rank(self): + if self.module_rank_list and (self.rank not in self.module_rank_list): + return False + return True + + def build_tbtag_tensor_map(self, module_name, tag, tensor): + metrics = {} + key = get_summary_writer_tag_name(module_name, tag, str(self.rank)) + if isinstance(tensor, Tensor): + self._register_param_call_id("_hook_module", key) + metrics[key] = tensor + return metrics + + def _register_param_name(self): + for vpp_stage, model_chunk in enumerate(self.model): + prefix = f'{vpp_stage}{MonitorConst.NAME_SEP}' + self._register_chunk(model_chunk, prefix) + + def _register_chunk(self, model_chunk, prefix): + index = 0 + for param in model_chunk.get_parameters(): + param_name = param.name + if not param.requires_grad: + continue + if self._is_target_param(param_name, param, prefix): + name = prefix + squash_param_name(param_name) + if name in self.param2name.values(): + name = prefix + param_name + self.param2name[param] = name + self.name2param[name] = param + self.name2index[name] = index + + if self.tp_group and not param_is_not_tensor_parallel_duplicate(param, self.tp_group): + self.duplicate_param[name] = True + if self.dp_group and param_is_data_parallel_duplicate(self.dp_group): + self.duplicate_param[name] = True + self.name2tag[name] = { + MonitorConst.PRE_GRAD: get_summary_writer_tag_name(name, MonitorConst.PRE_GRAD, self.rank), + MonitorConst.POST_GRAD: get_summary_writer_tag_name(name, MonitorConst.POST_GRAD, self.rank) + } + index += 1 + + def _hook_module(self, target_names, module, vpp_stage=''): + if not isinstance(module, nn.Cell): + # nothing to hook + return 0 + + def fwd_hook_fun(module, module_input, module_output, name): + if module not in self.module_fwd_hook_context_by_module: + self.module_fwd_hook_context_by_module[module] = ModuleHookContext(name) + context: ModuleHookContext = self.module_fwd_hook_context_by_module[module] + if not context.struct: + context.struct = { + MonitorConst.ACTV_IN: get_param_struct(module_input), + MonitorConst.ACTV_OUT: get_param_struct(module_output) + } + if self.print_struct: + self.module_struct[context.module_name].update(context.struct) + return + if not module.training: + return + if is_skip_step(context.step, self.start_step, self.step_interval, self.has_collect_times, + self.collect_times): + step_accumulates_one(context, self.micro_batch_number) + return + if not context.format_by_arg: + context.set_format_by_arg(MonitorConst.ACTV_IN, self.targets) + context.set_format_by_arg(MonitorConst.ACTV_OUT, self.targets) + if not context.format_by_arg: + return + if not context.verified: + if not context.ignore_in: + context.focused_in_col = validate_config_spec(context.format_by_arg[MonitorConst.ACTV_IN], + module_input, context.module_name, + MonitorConst.ACTV_IN) + context.focused_out_col = validate_config_spec(context.format_by_arg[MonitorConst.ACTV_OUT], + module_output, context.module_name, + MonitorConst.ACTV_OUT) + context.verified = True + + tbtag_tensor_map = {} + if not context.ignore_in: + cared_input = module_input if context.focused_in_col is None else module_input[context.focused_in_col] + tbtag_tensor_map.update( + self.build_tbtag_tensor_map(f'{context.module_name}_{context.micro_step}', MonitorConst.ACTV_IN, + cared_input)) + cared_output = module_output if context.focused_out_col is None else module_output[context.focused_out_col] + tbtag_tensor_map.update( + self.build_tbtag_tensor_map(f'{context.module_name}_{context.micro_step}', MonitorConst.ACTV_OUT, + cared_output)) + try: + get_metrics(self.ops, tbtag_tensor_map, self.eps, context.actv) + except Exception as e: + logger.warning(f"An error occurred while generating forward activation metrics") + + step_accumulates_one(context, self.micro_batch_number) + return + + def bwd_hook_fun(module, input_grad, output_grad): + context: ModuleHookContext = self.module_bwd_hook_context_by_module[module] + if not context.struct: + context.struct = { + MonitorConst.ACTVGRAD_IN: get_param_struct(input_grad), + MonitorConst.ACTVGRAD_OUT: get_param_struct(output_grad) + } + if self.print_struct: + self.module_struct[context.module_name].update(context.struct) + return + + if is_skip_step(context.step, self.start_step, self.step_interval, self.has_collect_times, + self.collect_times): + step_accumulates_one(context, self.micro_batch_number) + return + + if not context.format_by_arg: + context.set_format_by_arg(MonitorConst.ACTVGRAD_IN, self.targets) + context.set_format_by_arg(MonitorConst.ACTVGRAD_OUT, self.targets) + if not context.format_by_arg: + return + if not context.verified: + if not context.ignore_in: + context.focused_in_col = validate_config_spec(context.format_by_arg[MonitorConst.ACTVGRAD_IN], + input_grad, context.module_name, + MonitorConst.ACTVGRAD_IN) + context.focused_out_col = validate_config_spec(context.format_by_arg[MonitorConst.ACTVGRAD_OUT], + output_grad, context.module_name, + MonitorConst.ACTVGRAD_OUT) + context.verified = True + + tbtag_tensor_map = {} + if not context.ignore_in: + cared_input_grad = input_grad if context.focused_in_col is None else input_grad[context.focused_in_col] + tbtag_tensor_map.update( + self.build_tbtag_tensor_map( + f'{context.module_name}_{context.micro_step}', MonitorConst.ACTVGRAD_IN, cared_input_grad)) + cared_output_grad = output_grad if context.focused_out_col is None else output_grad[context.focused_out_col] + tbtag_tensor_map.update( + self.build_tbtag_tensor_map(f'{context.module_name}_{context.micro_step}', MonitorConst.ACTVGRAD_OUT, + cared_output_grad)) + + if context.micro_step == 0 and context.actvgrad: + logger.warning(f"actvgrad context of {context.module_name} is not empty when first micro_step, " + f"maybe something wrong happened. Now clear it.") + context.actvgrad.clear() + try: + get_metrics(self.ops, tbtag_tensor_map, self.eps, self.grad_context.actv) + except Exception as e: + logger.warning(f"An error occurred while generating backward activation metrics: {e}") + + step_accumulates_one(context, self.micro_batch_number) + return + + def fwd_hook_fun_wrapper(fwd_hook_fun, name): + def wrapper(module, module_input, module_output): + return fwd_hook_fun(module, module_input, module_output, name) + return wrapper + + if self.backward_only and self.forward_only: + logger.warning('not enable backward_only and forward_only simultaneously') + hooked_count = 0 + if self.xy_distribution or self.print_struct: + for module_name, submodule in module.cells_and_names(): + name = self._is_target_module(module_name, target_names, vpp_stage) + if not name: + continue + if not self.backward_only: + handle = submodule.register_forward_hook(fwd_hook_fun_wrapper(fwd_hook_fun, name=name)) + self.handles['xy'].append(handle) + if not self.forward_only: + handle = submodule.register_backward_hook(bwd_hook_fun) + self.handles['xy'].append(handle) + self.module_bwd_hook_context_by_module[submodule] = ModuleHookContext(name) + logger.info(f"> {name} is monitored successfully") + hooked_count += 1 + return hooked_count + + def _patch_grad_sync(self): + if not self.wg_distribution: + return + self._hook_weights() + + def _hook_weights(self): + context = self.grad_context + + @_no_grad() + def param_hook(grad, context_dict, param, key): + param.micro_step += 1 + self._register_param_call_id("param_hook", key) + if param.micro_step == self.micro_batch_number: + param.micro_step = 0 + context_dict[key] = grad + + def param_hook_wrapper(param_hook, context_dict, param, key): + def wrapper(grad): + return param_hook(grad, context_dict, param, key) + return wrapper + + for param, name in self.param2name.items(): + key = get_summary_writer_tag_name(name, 'acc_grad', self.rank) + setattr(param, 'micro_step', 0) + handle = param.register_hook(param_hook_wrapper(param_hook, context_dict=context.acc, param=param, key=key)) + self.handles['wgrads'].append(handle) + self.weight_hooked = True + + def _is_target_param(self, param_name, param, prefix): + if not self.targets: + return True + squash_name = prefix + squash_param_name(param_name) + name = prefix + param_name + for target in self.targets.keys(): + if param_name.startswith(target) or squash_name.startswith(target) or name.startswith(target): + setattr(param, "zero_out_wgrad", True) + return True + return False + + def _is_target_module(self, module_name, targets, vpp_stage): + if self.all_xy or self.print_struct: + return vpp_stage + squash_param_name(module_name) + for pattern in [ + vpp_stage + squash_param_name(module_name), + vpp_stage + module_name, + ]: + if pattern in targets: + return pattern + return "" + + def _register_param_call_id(self, hook_name: str, key: str): + """ + :param hook_name: + :param key: str, '0:relu_0/output_grad' + :return: + """ + logger.debug(f"{hook_name} {key}: {self.call_id}") + self.param_name_call_id[key] = self.call_id + self.call_id += 1 + + def _remove_all_hooks(self): + # 清空hook handle + for handle in self.handles['xy']: + handle.remove() + self.handles['xy'].clear() + # 清空对应context缓存 + for _, fwd_context in self.module_fwd_hook_context_by_module.items(): + fwd_context.reset() + for _, bwd_context in self.module_bwd_hook_context_by_module.items(): + bwd_context.reset() + self.grad_context.reset() # 权重梯度和激活值梯度都在这 + + for handle in self.handles['wgrads']: + handle.remove() + self.handles['wgrads'].clear() + self.weight_hooked = False + + if self.optimizer_hooked: + for handle in self.handles['optimizer']: + handle.remove() + self.handles['optimizer'].clear() + for _, context in self.optimizer_context.items(): + context.reset() + self.optimizer_hooked = False + + for handle in self.handles['cc']: + handle.remove() + self.handles['cc'].clear() + for _, context in self.cc_context.items(): + context.reset() + + # 清空节点缓存 + self.param2name.clear() + self.name2index.clear() + self.name2indices.clear() + self.name2param.clear() + self.duplicate_param.clear() + self.name2tag.clear() + self.module_struct.clear() + self.grad_accs.clear() + + # 关闭采集状态 + self.monitoring = False + + def _remove_all_hooks_final(self, optimizer): + if self.dynamic_enable: + # 结束后自动重置dynamic_on为False等待用户手动开启 + try: + config = load_json(self.config_file_path) + config['dynamic_on'] = False + save_json(self.config_file_path, config, indent=2) + config_timestamp = os.path.getmtime(self.config_file_path) + self.config_timestamp = config_timestamp + logger.info( + "Finish monitor, set config'dynamic_on=False, will restart by set it to True and update config") + except Exception as e: + logger.warning(f"Finish monitor, set config'dynamic_on=False fail because {e}, please check!!!") + logger.info("Finish monitor") + self._remove_all_hooks() diff --git a/debug/accuracy_tools/msprobe/mindspore/monitor/module_spec_verifier.py b/debug/accuracy_tools/msprobe/mindspore/monitor/module_spec_verifier.py new file mode 100644 index 0000000000000000000000000000000000000000..c06e8ea10f6a2178c3670e596ad64e333db44cab --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/monitor/module_spec_verifier.py @@ -0,0 +1,94 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +import abc +from mindspore import Tensor + +from msprobe.core.common.log import logger + + +# 用于存储所有validator实现类的注册表 +config_validator_registry = {} + + +def register_config_validator(cls): + """装饰器 用于注册ConfigValidator的实现类""" + config_validator_registry[cls.__name__] = cls + return cls + + +class ConfigValidator(metaclass=abc.ABCMeta): + @abc.abstractmethod + def check_pattern_match(self, config_spec: str): + pass + + @abc.abstractmethod + def validate(self, actual_data, module_name: str, data_type: str, pattern_match): + pass + + +@register_config_validator +class TensorValidator(ConfigValidator): + def check_pattern_match(self, config_spec: str): + pattern = re.compile(r"tensor") + return pattern.match(config_spec) + + def validate(self, actual_data, module_name: str, data_type: str, pattern_match): + if not isinstance(actual_data, Tensor): + raise ValueError( + f"Format of {module_name} {data_type} does not match the required format 'tensor' in config.") + + +@register_config_validator +class TupleValidator(ConfigValidator): + def check_pattern_match(self, config_spec: str): + pattern = re.compile(r"tuple\[(\d+)\]:?(\d+)?") + return pattern.match(config_spec) + + def validate(self, actual_data, module_name: str, data_type: str, pattern_match): + length, index = pattern_match.groups() + if index is None: + index = 0 + length, index = int(length), int(index) + + if not (0 <= index < length): + raise ValueError( + f"Format of {module_name} {data_type} in config.json does not match the required format 'tuple[x]:y'." + f"y must be greater than or equal to 0 and less than x.") + if not isinstance(actual_data, tuple): + raise ValueError( + f"Type of {module_name} {data_type} does not match spec of config.json, should be tuple, please check.") + if len(actual_data) != length: + raise ValueError( + f"Length of {module_name} {data_type} does not match spec of config.json, should be {length}, " + f"actual is {len(actual_data)} please check.") + return index + + +def validate_config_spec(config_spec: str, actual_data, module_name: str, data_type: str): + focused_col = None + for _, validator_cls in config_validator_registry.items(): + config_validator = validator_cls() + pattern_match = config_validator.check_pattern_match(config_spec) + if pattern_match: + try: + focused_col = config_validator.validate(actual_data, module_name, data_type, pattern_match) + except ValueError as e: + logger.warning(f"config spec validate failed: {str(e)}") + return focused_col + logger.warning(f"config spec in {module_name} {data_type} not supported, " + f"expected spec:'tuple\[(\d+)\]:(\d+)' or 'tensor', actual spec: {config_spec}.") + return focused_col \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/mindspore/monitor/utils.py b/debug/accuracy_tools/msprobe/mindspore/monitor/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..506ad6c3f91c7c73e5e12109a6ea617309df72c0 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/monitor/utils.py @@ -0,0 +1,301 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import re +from datetime import datetime +from mindspore import dtype as mstype, Tensor + +from msprobe.mindspore.monitor.features import FUNC_MAP +from msprobe.core.common.const import MonitorConst +from msprobe.core.common.utils import is_int +from msprobe.core.common.log import logger +from msprobe.core.common.file_utils import check_file_or_directory_path + + +def get_single_metrics(op_list, tag, tensor, output=None): + if output is None: + output = {} + if tag not in output: + output[tag] = {} + for op in op_list: + func = FUNC_MAP.get(op) + statistic = func(tensor) + if hasattr(statistic, "dtype") and statistic.dtype == mstype.bfloat16: + statistic = float(statistic) + statistic = Tensor(statistic) + output[tag][op] = statistic.astype(mstype.float32) + + +def get_metrics(op_list, tag2tensor, eps, output=None): + if output is None: + output = {} + for tag, tensor in tag2tensor.items(): + if tag not in output: + output[tag] = {} + get_single_metrics(op_list, tag, tensor, output) + return output + + +def get_summary_writer_tag_name(module_or_param_name: str, tag: str, rank): + if rank is None: + return f"{module_or_param_name}/{tag}" + else: + return f"{module_or_param_name}/rank{rank}/{tag}" + + +def step_accumulates_one(context, micro_batch_number): + """ + :param context: ModuleHookContext + :param micro_batch_number: mbs of training model. + :return: + """ + context.micro_step += 1 + if context.micro_step == micro_batch_number: + context.micro_step = 0 + context.step += 1 + + +def is_skip_step(step, start_step, step_interval, has_collect_times=0, collect_times=1e8): + """ + If current step less than start_step or not reach step_interval, skip current step. + :param step: current training step, int + :param start_step: int + :param step_interval: int + :return: whether skip or not, bool + """ + return step < start_step or (step - start_step) % step_interval != 0 or has_collect_times >= collect_times + + +def validate_ops(ops): + if not isinstance(ops, list): + raise TypeError("ops should be a list") + valid_ops = [] + for op in ops: + if op not in MonitorConst.OP_LIST: + logger.warning(f"op {op} is not supported. Optional ops: {MonitorConst.OP_LIST}") + continue + valid_ops.append(op) + if not valid_ops: + default_op = MonitorConst.OP_LIST[0] + valid_ops.append(default_op) + logger.info(f"There is no valid ops, default op {default_op} is used") + return valid_ops + + +def validate_ranks(ranks): + if not isinstance(ranks, list): + raise TypeError("module_ranks should be a list") + for rank in ranks: + if not isinstance(rank, str): + raise TypeError(f"element in module_ranks should be a str, get {type(rank)}") + + +def validate_targets(targets): + if not isinstance(targets, dict): + raise TypeError('targets in config.json should be a dict') + for module_name, field in targets.items(): + if not isinstance(module_name, str): + raise TypeError('key of targets should be module_name[str] in config.json') + if not isinstance(field, dict): + raise TypeError('values of targets should be cared filed e.g. {"input": "tensor"} in config.json') + + +def validate_print_struct(print_struct): + if not isinstance(print_struct, bool): + raise TypeError("print_struct should be a bool") + + +def validate_ur_distribution(ur_distribution): + if not isinstance(ur_distribution, bool): + raise TypeError('ur_distribution should be a bool') + + +def validate_xy_distribution(xy_distribution): + if not isinstance(xy_distribution, bool): + raise TypeError('xy_distribution should be a bool') + + +def validate_wg_distribution(wg_distribution): + if not isinstance(wg_distribution, bool): + raise TypeError('wg_distribution should be a bool') + + +def validate_mg_distribution(mg_distribution): + if not isinstance(mg_distribution, bool): + raise TypeError('mg_distribution should be a bool') + + +def validate_param_distribution(param_distribution): + if not isinstance(param_distribution, bool): + raise TypeError('param_distribution should be a bool') + + +def validate_cc_distribution(cc_distribution): + if not isinstance(cc_distribution, dict): + raise TypeError('cc_distribution should be a dictionary') + expected_keys = { + 'enable': bool, + 'cc_codeline': list, + 'cc_pre_hook': bool, + 'cc_log_only': bool + } + for key, value in cc_distribution.items(): + if key in expected_keys: + if not isinstance(value, expected_keys[key]): + raise TypeError(f'cc_distribution {key} should be a {expected_keys[key].__name__}') + else: + raise TypeError(f'{key} of cc_distribution is not supported.') + + +def validate_alert(alert): + if not isinstance(alert, dict): + raise TypeError('alert should be a dictionary') + rules = alert.get('rules') + if rules and isinstance(rules, list): + for rule in rules: + rule_name = rule.get("rule_name") + if rule_name and rule_name not in MonitorConst.RULE_NAME: + raise TypeError(f"{rule_name} is not supported") + args = rule.get("args") + if args and isinstance(args, dict): + threshold = args.get("threshold") + if not isinstance(threshold, float) or threshold < 0: + raise TypeError('threshold must be float and not less than 0') + dump = alert.get('dump') + if dump and not isinstance(dump, bool): + raise TypeError('dump must be bool.') + + +def validate_step_count_per_record(step_count_per_record): + if not is_int(step_count_per_record): + raise TypeError('step_count_per_record must be int.') + if step_count_per_record < 1: + raise ValueError("step_count_per_record must greater than 0") + if step_count_per_record > 1e6: + raise ValueError("step_count_per_record must smaller than 1e6") + + +def validate_start_step(start_step): + if not is_int(start_step): + raise TypeError('start_step must be int.') + if start_step < 0: + raise ValueError("start_step must greater than 0") + if start_step > 1e8: + raise ValueError("start_step must smaller than 1e8") + + +def validate_step_interval(step_interval): + if not is_int(step_interval): + raise TypeError('step_interval must be int.') + if step_interval < 1: + raise ValueError("step_interval must greater than 1") + if step_interval > 1e8: + raise ValueError("step_interval must smaller than 1e8") + + +def validate_collect_times(collect_times): + if not is_int(collect_times): + raise TypeError('collect_times must be int.') + if collect_times < 1: + raise ValueError("collect_times must greater than 1") + + +def validate_config(config): + config['ops'] = validate_ops(config.get('ops', [])) + + eps = config.get('eps', 1e-8) + if not isinstance(eps, float): + raise TypeError("eps should be a float") + + ranks = config.get("module_ranks", []) + validate_ranks(ranks) + + targets = config.get("targets", {}) + validate_targets(targets) + + print_struct = config.get('print_struct', False) + validate_print_struct(print_struct) + + ur_distribution = config.get('ur_distribution', False) + validate_ur_distribution(ur_distribution) + + xy_distribution = config.get('xy_distribution', False) + validate_xy_distribution(xy_distribution) + + wg_distribution = config.get('wg_distribution', False) + validate_wg_distribution(wg_distribution) + + mg_distribution = config.get('mg_distribution', False) + validate_mg_distribution(mg_distribution) + + param_distribution = config.get('param_distribution', False) + validate_param_distribution(param_distribution) + + cc_distribution = config.get('cc_distribution', {}) + validate_cc_distribution(cc_distribution) + + alert = config.get('alert', {}) + validate_alert(alert) + + step_count_per_record = config.get('step_count_per_record', 1) + validate_step_count_per_record(step_count_per_record) + + start_step = config.get('start_step', 0) + validate_start_step(start_step) + + step_interval = config.get('step_interval', 1) + validate_step_interval(step_interval) + + collect_times = config.get('collect_times', int(1e8)) + validate_collect_times(collect_times) + + if not targets: + if xy_distribution: + config["all_xy"] = True + config["targets"] = {"": {}} + config["is_select"] = False + else: + config["is_select"] = True + + +def time_str2time_digit(time_str): + time_format = '%b%d_%H-%M-%S' + try: + time_digit = datetime.strptime(time_str, time_format) + except Exception as e: + raise RuntimeError(f"illegal timestamp: {time_str}, timestamp should be prefix \ + of existing output dirpath, like 'Dec03_21-34-40'.") from e + return time_digit + + +def get_target_output_dir(monitor_path, time_start, time_end): + check_file_or_directory_path(monitor_path, isdir=True) + time_start = time_str2time_digit(time_start) if time_start is not None else time_start + time_end = time_str2time_digit(time_end) if time_end is not None else time_end + if time_start and time_end and time_start > time_end: + raise ValueError(f"time_start({time_start}) greater than time_end({time_end})") + result = {} + for dirname in os.listdir(monitor_path): + match = re.match(MonitorConst.OUTPUT_DIR_PATTERN, dirname) + if not match: + continue + time_tag = match.group(1) + rank = match.group(2) + target_time = time_str2time_digit(time_tag) + start_ok = time_start is None or target_time >= time_start + end_ok = time_end is None or target_time <= time_end + if start_ok and end_ok: + result[rank] = os.path.join(monitor_path, dirname) + return result diff --git a/debug/accuracy_tools/msprobe/mindspore/ms_config.py b/debug/accuracy_tools/msprobe/mindspore/ms_config.py index 1e34c93955bf8a04798ea8a589df8210b98c7a94..f20ed804c5bb8d8fbe4dba3e208060e8f52a3120 100644 --- a/debug/accuracy_tools/msprobe/mindspore/ms_config.py +++ b/debug/accuracy_tools/msprobe/mindspore/ms_config.py @@ -1,7 +1,7 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,15 +13,14 @@ # See the License for the specific language governing permissions and # limitations under the License. -import json - -from msprobe.core.common_config import CommonConfig, BaseConfig -from msprobe.core.common.file_utils import FileOpen from msprobe.core.common.const import Const -from msprobe.mindspore.common.const import FreeBenchmarkConst -from msprobe.mindspore.common.log import logger +from msprobe.core.common.file_utils import load_json +from msprobe.core.common.utils import is_int +from msprobe.core.common_config import BaseConfig, CommonConfig from msprobe.core.grad_probe.constant import level_adp from msprobe.core.grad_probe.utils import check_numeral_list_ascend +from msprobe.mindspore.common.const import FreeBenchmarkConst +from msprobe.mindspore.common.log import logger class TensorConfig(BaseConfig): @@ -33,9 +32,6 @@ class TensorConfig(BaseConfig): self._check_config() def _check_config(self): - if self.data_mode is not None and len(self.data_mode) > 0: - if len(self.data_mode) > 1 or self.data_mode[0] not in ["all", "input", "output"]: - raise Exception("data_mode must be all, input or output") if self.file_format and self.file_format not in ["npy", "bin"]: raise Exception("file_format is invalid") @@ -49,10 +45,11 @@ class StatisticsConfig(BaseConfig): self._check_config() def _check_config(self): - if self.data_mode is not None and len(self.data_mode) > 0: - if len(self.data_mode) > 1 or self.data_mode[0] not in ["all", "input", "output"]: - raise Exception("data_mode must be all, input or output") - if self.summary_mode and self.summary_mode not in ["statistics", "md5"]: + single_opt = ["statistics", "md5"] + muti_opt = ["md5", "max", "min", "mean", "l2norm"] + if isinstance(self.summary_mode, str) and self.summary_mode not in single_opt: + raise Exception("summary_mode is invalid") + if isinstance(self.summary_mode, list) and not all(opt in muti_opt for opt in self.summary_mode): raise Exception("summary_mode is invalid") @@ -63,7 +60,7 @@ class OverflowCheckConfig(BaseConfig): self._check_config() def _check_config(self): - if self.overflow_nums is not None and not isinstance(self.overflow_nums, int): + if self.overflow_nums is not None and not is_int(self.overflow_nums): raise Exception("overflow_nums is invalid, it should be an integer") if self.overflow_nums is not None and self.overflow_nums != -1 and self.overflow_nums <= 0: raise Exception("overflow_nums should be -1 or positive integer") @@ -87,7 +84,7 @@ class FreeBenchmarkConfig(BaseConfig): if self.fuzz_level and self.fuzz_level not in FreeBenchmarkConst.DUMP_LEVEL_LIST: raise Exception("fuzz_level must be L1 or empty") if self.fuzz_stage and self.fuzz_stage not in FreeBenchmarkConst.STAGE_LIST: - raise Exception("fuzz_stage must be forward or empty") + raise Exception("fuzz_stage must be forward, backward or empty") if self.if_preheat or self.preheat_step or self.max_sample: logger.warning("'if_preheat', 'preheat_step' and 'max_sample' settings " "are not supported for mindspore free benchmark task.") @@ -109,12 +106,18 @@ class GradProbeConfig(BaseConfig): check_numeral_list_ascend(self.bounds) +class StructureConfig(BaseConfig): + def __init__(self, json_config): + super().__init__(json_config) + + TaskDict = { Const.TENSOR: TensorConfig, Const.STATISTICS: StatisticsConfig, Const.OVERFLOW_CHECK: OverflowCheckConfig, Const.FREE_BENCHMARK: FreeBenchmarkConfig, - Const.GRAD_PROBE: GradProbeConfig + Const.GRAD_PROBE: GradProbeConfig, + Const.STRUCTURE: StructureConfig } @@ -134,8 +137,7 @@ def parse_task_config(task, json_config): def parse_json_config(json_file_path): if not json_file_path: raise Exception("json file path is None") - with FileOpen(json_file_path, 'r') as file: - json_config = json.load(file) + json_config = load_json(json_file_path) common_config = parse_common_config(json_config) if not common_config.task: common_config.task = Const.STATISTICS diff --git a/debug/accuracy_tools/msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py b/debug/accuracy_tools/msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py index ff0b81cce7fe1a8eee64c9f4462591c3f9d811d8..d093733192a8a86086cdc40569711cf5255394e3 100644 --- a/debug/accuracy_tools/msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py +++ b/debug/accuracy_tools/msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py @@ -46,6 +46,13 @@ class KernelGraphOverflowCheck: self.dump_json["common_dump_settings"]["op_debug_mode"] = 2 def handle(self): + try: + from msprobe.lib import _msprobe_c + return + except ImportError: + # 如果没有_msprobe_ce_c走MindSpore老流程 + logger.info("Module _msprobe_c has not been installed, use interface in mindspore instead.") + if os.getenv("GRAPH_OP_RUN") == "1": raise Exception("Must run in graph mode, not kbk mode") json_path = self.dump_json["common_dump_settings"]["path"] diff --git a/debug/accuracy_tools/msprobe/mindspore/service.py b/debug/accuracy_tools/msprobe/mindspore/service.py index 50b446d6ffd68a65d3a9795458806cfff8885059..5afbd046be4caf29c4b247a0f8fdd655c5208fd0 100644 --- a/debug/accuracy_tools/msprobe/mindspore/service.py +++ b/debug/accuracy_tools/msprobe/mindspore/service.py @@ -1,4 +1,5 @@ -# Copyright 2024 Huawei Technologies Co., Ltd +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,17 +12,17 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# ============================================================================ -import os import copy import functools +import os from collections import defaultdict import mindspore as ms -from mindspore.common.tensor import Tensor -from mindspore import ops from mindspore import nn +from mindspore.common.api import _no_grad +from mindspore.ops.primitive import Primitive + try: from mindspore.common._pijit_context import PIJitCaptureContext except ImportError: @@ -29,22 +30,25 @@ except ImportError: else: pijit_label = True - +from msprobe.core.common.exceptions import DistributedNotInitializedError, MsprobeException +from msprobe.core.common.file_utils import create_directory +from msprobe.core.common.utils import Const, print_tools_ends_info, DumpPathAggregation from msprobe.core.data_dump.data_collector import build_data_collector +from msprobe.core.data_dump.data_processor.base import (ModuleBackwardInputsOutputs, ModuleForwardInputsOutputs, + ModuleBackwardInputs) from msprobe.core.data_dump.scope import BaseScope -from msprobe.mindspore.common.utils import get_rank_if_initialized -from msprobe.core.common.file_utils import create_directory +from msprobe.mindspore.cell_processor import CellProcessor from msprobe.mindspore.common.log import logger -from msprobe.core.common.utils import Const, print_tools_ends_info -from msprobe.core.common.exceptions import DistributedNotInitializedError +from msprobe.mindspore.common.utils import (get_rank_if_initialized, clean_input_kwargs, + is_mindtorch, register_backward_hook_functions) from msprobe.mindspore.dump.hook_cell.api_registry import api_register from msprobe.mindspore.dump.hook_cell.primitive_hooks import PrimitiveHookService -from msprobe.core.data_dump.data_processor.base import ModuleBackwardInputsOutputs, ModuleForwardInputsOutputs, \ - ModuleBackwardInputs, ModuleBackwardOutputs -from msprobe.core.common.exceptions import MsprobeException -from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell -from msprobe.mindspore.cell_processor import CellProcessor from msprobe.mindspore.dump.jit_dump import JitDump +from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell +from msprobe.mindspore.dump.kernel_dump.kernel_config import create_kernel_config_json + +if is_mindtorch(): + import torch class Service: @@ -56,66 +60,196 @@ class Service: self.cell_processor = CellProcessor(self.data_collector.scope) self.primitive_hook_service = PrimitiveHookService(self) self.switch = False + self.inner_switch = False self.primitive_switch = False self.current_iter = 0 self.first_start = True self.current_rank = None self.dump_iter_dir = None self.start_call = False - self.check_level_valid() self.should_stop_service = False + self.params_grad_info = {} + self.hook_handle_dict = {} + # 提前注册,确保注册尽可能多的API hook + self.register_api_hook() + self.init_for_debug_level() @staticmethod - def check_model_valid(model): - if not model or isinstance(model, nn.Cell): - return model - raise MsprobeException( - MsprobeException.INVALID_PARAM_ERROR, "model 参数必须是 mindspore.nn.Cell 类型。" - ) + def check_model_valid(models): + target_module_type = (torch.nn.Module, "torch.nn.Module") if is_mindtorch() else (nn.Cell, "mindspore.nn.Cell") + if models is None or isinstance(models, target_module_type[0]): + return models + error_model = None + if isinstance(models, (list, tuple)): + for model in models: + if not isinstance(model, target_module_type[0]): + error_model = model + break + else: + error_model = models - def check_level_valid(self): - if self.config.level == Const.LEVEL_L2: + if error_model is not None: + error_info = (f"The 'model' parameter must be a {target_module_type[1]} or list[{target_module_type[1]}] " + f"type, currently there is a {type(error_model)} type.") raise MsprobeException( - MsprobeException.INVALID_PARAM_ERROR, "L2 level dump function is currently not supported." - ) + MsprobeException.INVALID_PARAM_ERROR, error_info) + return models + + @staticmethod + def prepare_module_input_output(target_type, cell, input_data, output): + if target_type == BaseScope.Module_Type_Module: + module_input_output = ModuleForwardInputsOutputs(args=input_data, kwargs={}, output=output) + else: + module_input_output = ModuleForwardInputsOutputs(args=input_data, kwargs=cell.input_kwargs, output=output) + return module_input_output def build_hook(self, target_type, name): - def forward_hook(api_or_cell_name, cell, input_data, output): - if not self.should_excute_hook(): + def pre_hook(api_or_cell_name, cell, input_data): + if not self.should_execute_hook(target_type, cell, True): + clean_input_kwargs(cell) return None - if target_type == BaseScope.Module_Type_Module: - api_or_cell_name = cell.mindstudio_reserved_name - module_input_output = ModuleForwardInputsOutputs(args=input_data, kwargs={}, output=output) - else: - module_input_output = ModuleForwardInputsOutputs(args=input_data, kwargs=cell.input_kwargs, - output=output) + with _no_grad(): + self.inner_switch = True + if target_type == BaseScope.Module_Type_Module: + api_or_cell_name = self.cell_processor.set_and_get_reserved_name(cell, api_or_cell_name) + else: + cell.forward_data_collected = True + HOOKCell.add_cell_count(name) + module_input_output = self.prepare_module_input_output(target_type, cell, input_data, None) + self.data_collector.update_api_or_module_name(api_or_cell_name) + self.data_collector.forward_input_data_collect(api_or_cell_name, cell, pid, module_input_output) + self.inner_switch = False + return input_data + + def grad_hook(cell, ori_name, param_name): + def hook_fn(grad): + if not self.should_execute_hook(target_type, cell, False): + return None + self.inner_switch = True + self.data_collector.params_data_collect(ori_name, param_name, pid, grad) + self.inner_switch = False + return None - self.data_collector.update_api_or_module_name(api_or_cell_name) - self.data_collector.forward_data_collect(api_or_cell_name, cell, pid, module_input_output) - if self.data_collector.if_return_forward_new_output(): - return self.data_collector.get_forward_new_output() - if target_type == BaseScope.Module_Type_API: - del cell.input_kwargs - return output + return hook_fn + + def register_param_hook(ori_name, cell, params_dict): + ''' + 注册参数hook + ''' + # data_mode为forward时,不注册参数hook + if not (Const.FORWARD in self.config.data_mode and Const.BACKWARD not in self.config.data_mode): + for param_name, param in params_dict.items(): + if param.requires_grad: + name = ori_name + Const.SEP + param_name + old_handle = self.hook_handle_dict.get(name) + if old_handle and hasattr(old_handle, "remove"): + old_handle.remove() + handle = param.register_hook(grad_hook(cell, ori_name, param_name)) + self.hook_handle_dict[name] = handle + + def init_params_grad_info(cell, params_dict): + ''' + 初始化参数梯度信息, 在前向hook结束后, 将参数梯度信息写入cache_data中用于占位 + ''' + if not params_dict: + return + if not (Const.FORWARD in self.config.data_mode and Const.BACKWARD not in self.config.data_mode): + grad_name = cell.params_grad_name if hasattr(cell, 'params_grad_name') else None + # 判断是否已经在cache_data中进行了占位, 若没有则先写入cache_data中 + if not self.params_grad_info.get(grad_name): + data_info = {grad_name: {key: [None] for key, value in params_dict.items() if value.requires_grad}} + # 当模块中的参数有requires_grad属性为True时,才会进行梯度计算,此时才需要占位 + if data_info.get(grad_name): + # 将grad_name的data_info先写入cache_data中, 梯度计算后再更新 + self.data_collector.handle_data(grad_name, data_info, + flush=self.data_collector.data_processor.is_terminated) + # 记录当前模块的参数梯度信息已占位 + self.params_grad_info[grad_name] = True + + def forward_hook(api_or_cell_name, cell, input_data, output): + if not self.should_execute_hook(target_type, cell, True): + clean_input_kwargs(cell) + return None + with _no_grad(): + self.inner_switch = True + module_input_output = self.prepare_module_input_output(target_type, cell, input_data, output) + if target_type == BaseScope.Module_Type_Module: + api_or_cell_name = self.cell_processor.set_and_get_reserved_name(cell, api_or_cell_name) + params_dict = {} + if self.config.task != Const.STRUCTURE: + params_dict = { + key.split(Const.SEP)[-1]: value + for key, value in cell.parameters_dict(recurse=False).items() + } + setattr(module_input_output, Const.PARAMS, params_dict) + # 判断是否需要注册参数hook + if params_dict: + ori_name = api_or_cell_name.rsplit(Const.SEP, 2)[0] + grad_name = ori_name + Const.SEP + Const.PARAMS_GRAD + # 首次执行前向hook时,添加params_grad_name属性,并注册参数hook + setattr(cell, 'params_grad_name', grad_name) + register_param_hook(ori_name, cell, params_dict) + self.data_collector.update_api_or_module_name(api_or_cell_name) + self.data_collector.forward_data_collect(api_or_cell_name, cell, pid, module_input_output) + init_params_grad_info(cell, params_dict) + else: + self.data_collector.update_api_or_module_name(api_or_cell_name) + self.data_collector.forward_output_data_collect(api_or_cell_name, cell, pid, module_input_output) + + if self.data_collector.if_return_forward_new_output(): + forward_new_output = self.data_collector.get_forward_new_output() + self.inner_switch = False + return forward_new_output + clean_input_kwargs(cell) + self.inner_switch = False + return output def backward_hook(api_or_cell_name, cell, grad_input, grad_output): - if not self.should_excute_hook(): + if not self.should_execute_hook(target_type, cell, False): return + self.inner_switch = True + need_exchange = True if target_type == BaseScope.Module_Type_Module: - api_or_cell_name = cell.mindstudio_reserved_name + if not hasattr(cell, 'has_pre_hook_called') or not cell.has_pre_hook_called: + need_exchange = False + api_or_cell_name = self.cell_processor.set_and_get_reserved_name(cell, api_or_cell_name) + self.data_collector.update_api_or_module_name(api_or_cell_name) if self.data_collector: # 框架最新接口变更,grad_input和grad_output的含义发生了变化,与torch含义保持一致,因此此处调换顺序传入 - module_input_output = ModuleBackwardInputsOutputs(grad_input=grad_output, grad_output=grad_input) + if need_exchange: + module_input_output = ModuleBackwardInputsOutputs(grad_input=grad_output, grad_output=grad_input) + else: + module_input_output = ModuleBackwardInputsOutputs(grad_input=grad_input, grad_output=grad_output) self.data_collector.backward_data_collect(api_or_cell_name, cell, pid, module_input_output) + self.inner_switch = False + + def pre_backward_hook(api_or_cell_name, cell, grad_input): + if not self.should_execute_hook(target_type, cell, False): + return + self.inner_switch = True + module_input = ModuleBackwardInputs(grad_input=grad_input) + self.data_collector.update_api_or_module_name(api_or_cell_name) + self.data_collector.backward_input_data_collect(api_or_cell_name, cell, pid, module_input) + + self.inner_switch = False pid = os.getpid() - forward_name_template = name + Const.FORWARD - backward_name_template = name + Const.BACKWARD - forward_hook = functools.partial(forward_hook, forward_name_template) - backward_hook = functools.partial(backward_hook, backward_name_template) + if target_type == BaseScope.Module_Type_Module: + full_forward_name = name + Const.FORWARD + full_backward_name = name + Const.BACKWARD + else: + full_forward_name = name + str(HOOKCell.get_cell_count(name)) + Const.SEP + Const.FORWARD + full_backward_name = name + str(HOOKCell.get_cell_count(name)) + Const.SEP + Const.BACKWARD + pre_forward_hook = functools.partial(pre_hook, full_forward_name) + forward_hook = functools.partial(forward_hook, full_forward_name) + backward_hook = functools.partial(backward_hook, full_backward_name) + pre_backward_hook = functools.partial(pre_backward_hook, full_backward_name) + + def wrap_pre_forward_hook(cell, input_data): + return pre_forward_hook(cell, input_data) def wrap_forward_hook(cell, input_data, output_data): return forward_hook(cell, input_data, output_data) @@ -123,8 +257,10 @@ class Service: def wrap_backward_hook(cell, grad_input, grad_output): return backward_hook(cell, grad_input, grad_output) - return wrap_forward_hook, wrap_backward_hook + def wrap_pre_backward_hook(cell, grad_input): + return pre_backward_hook(cell, grad_input) + return wrap_pre_forward_hook, wrap_forward_hook, wrap_backward_hook, wrap_pre_backward_hook def update_primitive_counters(self, primitive_name): if primitive_name not in self.primitive_counters: @@ -132,35 +268,25 @@ class Service: else: self.primitive_counters[primitive_name] += 1 - def register_primitive_hooks(self): - primitive_set = set() - for _, cell in self.model.cells_and_names(): - for pname, primitive in cell._primitives.items(): - primitive_set.add((pname, primitive)) - - for pname, primitive in primitive_set: - primitive_class_name = primitive.__class__.__name__ - primitive_combined_name = pname + Const.SEP + primitive_class_name - new_primitive = type('NewPrimitive', (primitive.__class__,), - {'__call__': self.primitive_hook_service.wrap_primitive(primitive.__call__, - primitive_combined_name)}) - primitive.__class__ = new_primitive - def step(self): + if self.config.level == Const.LEVEL_DEBUG: + return + if self.config.async_dump: + self.data_collector.fill_stack_tensor_data() + if self.config.task == Const.TENSOR: + self.data_collector.data_processor.dump_async_data() + self.data_collector.write_json() self.current_iter += 1 self.data_collector.update_iter(self.current_iter) - HOOKCell.cell_count = defaultdict(int) - CellProcessor.reset_cell_stats() - self.primitive_hook_service.primitive_counters.clear() - self.data_collector.data_writer.reset_cache() - JitDump.jit_count = defaultdict(int) + self.reset_status() def start(self, model=None): + if self.config.level == Const.LEVEL_DEBUG: + return self.start_call = True if self.should_stop_service: return if self.need_end_service(): - api_register.api_set_ori_func() self.should_stop_service = True self.switch = False self.primitive_switch = False @@ -180,11 +306,15 @@ class Service: if self.config.rank and self.current_rank not in self.config.rank: return - self.register_hook_new() + self.register_primitive_hook() + self.register_cell_hook() if self.config.level in [Const.LEVEL_MIX, Const.LEVEL_L1]: JitDump.set_config(self.config) JitDump.set_data_collector(self.data_collector) - ms.common.api._MindsporeFunctionExecutor = JitDump + if hasattr(ms.common.api, "_MindsporeFunctionExecutor"): + ms.common.api._MindsporeFunctionExecutor = JitDump + else: + ms.common.api._JitExecutor = JitDump ms.common.api._PyNativeExecutor.grad = JitDump.grad if pijit_label: PIJitCaptureContext.__enter__ = self.empty @@ -199,25 +329,9 @@ class Service: logger.info(f"Dump data will be saved in {self.dump_iter_dir}.") JitDump.jit_dump_switch = True - def forward_backward_dump_end(self): - if self.should_stop_service: - return - logger.info(f"{Const.TOOL_NAME}: debugger.forward_backward_dump_end() is set successfully. ") - if not self.start_call: - logger.error(f"{Const.TOOL_NAME}: debugger.start() is not set in the current scope.") - raise Exception("debugger.start() is not set in the current scope.") - if not self.switch: - logger.error(f"{Const.TOOL_NAME}: debugger.forward_backward_dump_end() should be called between " - "debugger.start() and debugger.stop() ") - raise Exception("debugger.stop() is already called. ") - if self.config.step and self.current_iter not in self.config.step: - return - if self.config.rank and self.current_rank not in self.config.rank: - return - self.primitive_switch = False - api_register.api_set_ori_func() - def stop(self): + if self.config.level == Const.LEVEL_DEBUG: + return if self.should_stop_service: return logger.info(f"{Const.TOOL_NAME}: debugger.stop() is set successfully. " @@ -232,6 +346,10 @@ class Service: self.switch = False self.primitive_switch = False self.start_call = False + if self.config.async_dump: + self.data_collector.fill_stack_tensor_data() + if self.config.task == Const.TENSOR: + self.data_collector.data_processor.dump_async_data() self.data_collector.write_json() JitDump.jit_dump_switch = False @@ -242,8 +360,16 @@ class Service: return True return False - def should_excute_hook(self): - if not self.switch: + def should_execute_hook(self, hook_type, cell, is_forward): + is_cell_hook = hook_type == BaseScope.Module_Type_Module + if is_cell_hook and not self.switch: + return False + elif not is_cell_hook and is_forward and not self.switch: + return False + elif not is_cell_hook and not is_forward and not cell.forward_data_collected: + return False + + if self.inner_switch: return False if not self.data_collector or self.data_collector.data_processor.is_terminated: return False @@ -253,6 +379,12 @@ class Service: create_directory(self.config.dump_path) self.dump_iter_dir = os.path.join(self.config.dump_path, f"step{self.current_iter}") cur_rank = self.current_rank if self.current_rank is not None else '' + if self.config.level == Const.LEVEL_L2: + create_directory(self.dump_iter_dir) + kernel_config_path = create_kernel_config_json(self.dump_iter_dir, cur_rank) + self.config.kernel_config_path = kernel_config_path + return + dump_dir = os.path.join(self.dump_iter_dir, f"rank{cur_rank}") create_directory(dump_dir) if self.config.task in self.data_collector.tasks_need_tensor_data: @@ -261,41 +393,151 @@ class Service: else: dump_data_dir = None - dump_file_path = os.path.join(dump_dir, "dump.json") - stack_file_path = os.path.join(dump_dir, "stack.json") - construct_file_path = os.path.join(dump_dir, "construct.json") - self.data_collector.update_dump_paths( - dump_file_path, stack_file_path, construct_file_path, dump_data_dir, None) + dump_path_aggregation = DumpPathAggregation() + dump_path_aggregation.dump_file_path = os.path.join(dump_dir, "dump.json") + dump_path_aggregation.stack_file_path = os.path.join(dump_dir, "stack.json") + dump_path_aggregation.construct_file_path = os.path.join(dump_dir, "construct.json") + dump_path_aggregation.dump_tensor_data_dir = dump_data_dir + self.data_collector.update_dump_paths(dump_path_aggregation) + + self.data_collector.initialize_json_file( + framework=Const.MT_FRAMEWORK if is_mindtorch() else Const.MS_FRAMEWORK + ) def empty(self, *args, **kwargs): pass - def register_hook_new(self): - logger.info("The {} hook function is successfully mounted to the model.".format(self.config.task)) - if self.config.level in [Const.LEVEL_MIX, Const.LEVEL_L1]: + def register_api_hook(self): + if self.config.level in [Const.LEVEL_MIX, Const.LEVEL_L1, Const.LEVEL_L2]: + logger.info(f"The api {self.config.task} hook function is successfully mounted to the model.") api_register.initialize_hook(functools.partial(self.build_hook, BaseScope.Module_Type_API)) api_register.api_set_hook_func() - if self.model and self.config.task in Const.DUMP_DATA_COLLECTION_LIST: - self.register_primitive_hooks() + def get_cells_and_names(self): + cells_and_names_with_index = {} + + def get_cell_or_module(model): + return model.named_modules() if is_mindtorch() else model.cells_and_names() + + if isinstance(self.model, (list, tuple)): + for index, model in enumerate(self.model): + cells_and_names_with_index[str(index)] = get_cell_or_module(model) + else: + cells_and_names_with_index["-1"] = get_cell_or_module(self.model) + return cells_and_names_with_index + + def register_primitive_hook(self): + if self.config.level not in [Const.LEVEL_MIX, Const.LEVEL_L1]: + return + if not self.model or self.config.task not in Const.DUMP_DATA_COLLECTION_LIST: + return + + primitive_set = set() + cells_and_names_with_index = self.get_cells_and_names() + for cells_and_names in cells_and_names_with_index.values(): + for _, cell in cells_and_names: + for attribute, value in vars(cell).items(): + if isinstance(value, Primitive): + primitive_set.add((attribute, value)) + + for pname, primitive in primitive_set: + primitive_class_name = primitive.__class__.__name__ + primitive_combined_name = pname + Const.SEP + primitive_class_name + new_primitive = type('NewPrimitive', (primitive.__class__,), + {'__call__': self.primitive_hook_service.wrap_primitive(primitive.__call__, + primitive_combined_name)}) + primitive.__class__ = new_primitive + + def register_cell_hook(self): if self.config.level in [Const.LEVEL_MIX, Const.LEVEL_L0]: + logger.info(f"The cell {self.config.task} hook function is successfully mounted to the model.") if not self.model: raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, f"The current level is {self.config.level}, the model cannot be None") - for name, cell in self.model.cells_and_names(): - if cell == self.model: - continue - prefix = 'Cell' + Const.SEP + name + Const.SEP + \ - cell.__class__.__name__ + Const.SEP - forward_hook, backward_hook = self.build_hook(BaseScope.Module_Type_Module, prefix) - cell.register_forward_hook(forward_hook) - cell.register_backward_hook(backward_hook) - - cell.register_forward_pre_hook( - self.cell_processor.node_hook(prefix + Const.FORWARD, Const.START)) - cell.register_forward_hook( - self.cell_processor.node_hook(prefix + Const.FORWARD, Const.STOP)) - cell.register_backward_pre_hook( - self.cell_processor.node_hook(prefix + Const.BACKWARD, Const.START)) - cell.register_backward_hook( - self.cell_processor.node_hook(prefix + Const.BACKWARD, Const.STOP)) + model_type = Const.MODULE if is_mindtorch() else Const.CELL + cells_and_names_with_index = self.get_cells_and_names() + + for index, cells_and_names in cells_and_names_with_index.items(): + model = self.model if index == "-1" else self.model[int(index)] + for name, cell in cells_and_names: + if cell == model: + continue + cell_index = (index + Const.SEP) if index != "-1" else "" + prefix = (model_type + Const.SEP + cell_index + name + + Const.SEP + cell.__class__.__name__ + Const.SEP) + _, forward_hook, backward_hook, _ = self.build_hook(BaseScope.Module_Type_Module, prefix) + cell.register_forward_hook(forward_hook) + cell.register_forward_pre_hook( + self.cell_processor.node_hook(prefix + Const.FORWARD, Const.START)) + cell.register_forward_hook( + self.cell_processor.node_hook(prefix + Const.FORWARD, Const.STOP)) + + register_backward_hook_functions["full"](cell, backward_hook) + register_backward_hook_functions["pre"]( + cell, self.cell_processor.node_hook(prefix + Const.BACKWARD, Const.START)) + register_backward_hook_functions["full"]( + cell, self.cell_processor.node_hook(prefix + Const.BACKWARD, Const.STOP)) + + def reset_status(self): + self.primitive_hook_service.primitive_counters.clear() + self.data_collector.reset_status() + JitDump.jit_count = defaultdict(int) + self.params_grad_info.clear() + if self.config.level == Const.LEVEL_L2: + self.data_collector.data_processor.reset_status() + return + if self.config.step and self.current_iter not in self.config.step: + return + if self.config.rank and self.current_rank not in self.config.rank: + return + + def init_for_debug_level(self): + if not (self.config.level == Const.LEVEL_DEBUG and self.config.task in [Const.TENSOR, Const.STATISTICS]): + return + try: + self.current_rank = get_rank_if_initialized() + except DistributedNotInitializedError: + self.current_rank = None + # dir: dump_path -- rank{} -- debug.json + self.dump_iter_dir = self.config.dump_path + cur_rank = self.current_rank if self.current_rank is not None else '' + dump_dir = os.path.join(self.dump_iter_dir, f"rank{cur_rank}") + create_directory(dump_dir) + if self.config.task in self.data_collector.tasks_need_tensor_data: + dump_data_dir = os.path.join(dump_dir, "dump_tensor_data") + create_directory(dump_data_dir) + else: + dump_data_dir = None + + dump_path_aggregation = DumpPathAggregation() + dump_path_aggregation.dump_tensor_data_dir = dump_data_dir + dump_path_aggregation.debug_file_path = os.path.join(dump_dir, "debug.json") + self.data_collector.update_dump_paths(dump_path_aggregation) + self.data_collector.initialize_json_file( + framework=Const.MT_FRAMEWORK if is_mindtorch() else Const.MS_FRAMEWORK + ) + self.debug_variable_counter = defaultdict(int) + + def save(self, variable, name, save_backward): + ''' + Args: + variable: Union[List[variable], dict{str: variable}, mindspore.tensor, str, float, int] + name: str + save_backward: boolean + Return: + void + ''' + if self.config.level != Const.LEVEL_DEBUG: + return + count = self.debug_variable_counter[name] + self.debug_variable_counter[name] += 1 + + name_with_count = f"{name}.{count}" + grad_name_with_count = f"{name}_grad.{count}" + + # forward save + self.data_collector.debug_data_collect_forward(variable, name_with_count) + + # backward save + if save_backward: + self.data_collector.debug_data_collect_backward(variable, grad_name_with_count) diff --git a/debug/accuracy_tools/msprobe/msprobe.py b/debug/accuracy_tools/msprobe/msprobe.py index 97b72a2fd9084e3e3b749d9ad9b72cb9aa8248e8..8e0386fde6dccc071c3d9d8e1a86729a2c483c7c 100644 --- a/debug/accuracy_tools/msprobe/msprobe.py +++ b/debug/accuracy_tools/msprobe/msprobe.py @@ -16,10 +16,12 @@ import argparse import sys import importlib.util -from msprobe.core.compare.utils import _compare_parser + +from msprobe.core.common.const import Const from msprobe.core.common.log import logger +from msprobe.core.compare.utils import _compare_parser from msprobe.core.compare.compare_cli import compare_cli -from msprobe.core.common.const import Const +from msprobe.core.compare.merge_result.merge_result_cli import _merge_result_parser, merge_result_cli def is_module_available(module_name): @@ -45,10 +47,15 @@ def main(): multi_run_ut_cmd_parser = subparsers.add_parser('multi_run_ut') api_precision_compare_cmd_parser = subparsers.add_parser('api_precision_compare') run_overflow_check_cmd_parser = subparsers.add_parser('run_overflow_check') + code_mapping_cmd_parser = subparsers.add_parser('code_mapping') graph_service_cmd_parser = subparsers.add_parser('graph') + op_generate_cmd_parser = subparsers.add_parser('op_generate') + merge_result_parser = subparsers.add_parser('merge_result') _compare_parser(compare_cmd_parser) + _merge_result_parser(merge_result_parser) + is_torch_available = is_module_available("torch") - is_mindspore_available = is_module_available("mindspore") + if len(sys.argv) < 4: parser.print_help() sys.exit(0) @@ -61,7 +68,9 @@ def main(): _api_precision_compare_command from msprobe.pytorch.api_accuracy_checker.run_ut.run_overflow_check import _run_overflow_check_parser, \ _run_overflow_check_command - from msprobe.visualization.graph_service import _graph_service_parser, _graph_service_command + from msprobe.visualization.graph_service import _pt_graph_service_parser, _pt_graph_service_command + from msprobe.pytorch.api_accuracy_checker.generate_op_script.op_generator import _op_generator_parser, \ + _run_operator_generate_commond _run_ut_parser(run_ut_cmd_parser) _run_ut_parser(multi_run_ut_cmd_parser) @@ -69,10 +78,18 @@ def main(): help='Number of splits for parallel processing. Range: 1-64') _api_precision_compare_parser(api_precision_compare_cmd_parser) _run_overflow_check_parser(run_overflow_check_cmd_parser) - _graph_service_parser(graph_service_cmd_parser) + _pt_graph_service_parser(graph_service_cmd_parser) + _op_generator_parser(op_generate_cmd_parser) elif framework_args.framework == Const.MS_FRAMEWORK: from msprobe.mindspore.api_accuracy_checker.cmd_parser import add_api_accuracy_checker_argument + from msprobe.visualization.graph_service import _ms_graph_service_parser, _ms_graph_service_command add_api_accuracy_checker_argument(run_ut_cmd_parser) + from msprobe.mindspore.api_accuracy_checker.cmd_parser import multi_add_api_accuracy_checker_argument + multi_add_api_accuracy_checker_argument(multi_run_ut_cmd_parser) + from msprobe.mindspore.code_mapping.cmd_parser import add_ir_parser_arguments + add_ir_parser_arguments(code_mapping_cmd_parser) + + _ms_graph_service_parser(graph_service_cmd_parser) args = parser.parse_args(sys.argv[1:]) if sys.argv[2] == Const.PT_FRAMEWORK: @@ -91,21 +108,35 @@ def main(): elif sys.argv[3] == "run_overflow_check": _run_overflow_check_command(args) elif sys.argv[3] == "graph": - _graph_service_command(args) + _pt_graph_service_command(args) + elif sys.argv[3] == 'op_generate': + _run_operator_generate_commond(args) elif sys.argv[3] == "compare": if args.cell_mapping is not None or args.api_mapping is not None: logger.error("Argument -cm or -am is not supported in PyTorch framework") raise Exception("Argument -cm or -am is not supported in PyTorch framework") compare_cli(args) + elif sys.argv[3] == "merge_result": + merge_result_cli(args) else: if not is_module_available(Const.MS_FRAMEWORK): logger.error("MindSpore does not exist, please install MindSpore library") raise Exception("MindSpore does not exist, please install MindSpore library") if sys.argv[3] == "compare": compare_cli(args) + elif sys.argv[3] == "merge_result": + merge_result_cli(args) elif sys.argv[3] == "run_ut": from msprobe.mindspore.api_accuracy_checker.main import api_checker_main api_checker_main(args) + elif sys.argv[3] == "multi_run_ut": + from msprobe.mindspore.api_accuracy_checker.main import mul_api_checker_main + mul_api_checker_main(args) + elif sys.argv[3] == "graph": + _ms_graph_service_command(args) + elif sys.argv[3] == "code_mapping": + from msprobe.mindspore.code_mapping.main import code_mapping_main + code_mapping_main(args) if __name__ == "__main__": diff --git a/debug/accuracy_tools/msprobe/pytorch/__init__.py b/debug/accuracy_tools/msprobe/pytorch/__init__.py index aa47633ccc2dccb0e7e0898d6b9d27fafb2e69b9..ce84e6b35b74e55a90915350ff3ef2da3f7ba441 100644 --- a/debug/accuracy_tools/msprobe/pytorch/__init__.py +++ b/debug/accuracy_tools/msprobe/pytorch/__init__.py @@ -1,6 +1,4 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -15,10 +13,12 @@ # See the License for the specific language governing permissions and # limitations under the License. - -from .debugger.precision_debugger import PrecisionDebugger -from .common.utils import seed_all +import torch from .compare.distributed_compare import compare_distributed from .compare.pt_compare import compare -from .functional.module_dump import module_dump, module_dump_end -from .monitor.module_hook import TrainerMon +from .common.utils import seed_all +from .debugger.precision_debugger import PrecisionDebugger, module_dump, module_dump_end + +torch_version_above_or_equal_2 = torch.__version__.split('+')[0] >= '2.0' +if torch_version_above_or_equal_2: + from msprobe.pytorch.monitor.module_hook import TrainerMon diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/config.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/config.py index 508d7966b3fe8c17ee42d3f36aecf62717f5feb8..f2b2d6a30463c62846bcc02e147c9c319f55d1b8 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/config.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/config.py @@ -16,10 +16,18 @@ # limitations under the License. import os +from collections import namedtuple from msprobe.core.common.file_utils import load_yaml, check_file_or_directory_path +from msprobe.core.common.utils import is_int from msprobe.pytorch.pt_config import RunUTConfig +RunUtConfig = namedtuple('RunUtConfig', ['forward_content', 'backward_content', 'result_csv_path', 'details_csv_path', + 'save_error_data', 'is_continue_run_ut', 'real_data_path', 'white_list', + 'black_list', 'error_data_path', 'online_config']) +OnlineConfig = namedtuple('OnlineConfig', ['is_online', 'nfs_path', 'host', 'port', 'rank_list', 'tls_path']) + + class Config: def __init__(self, yaml_file): check_file_or_directory_path(yaml_file, False) @@ -50,6 +58,8 @@ class Config: raise ValueError(f"{key} must be one of {validators.keys()}") if not isinstance(value, validators.get(key)): raise ValueError(f"{key} must be {validators[key].__name__} type") + if key == 'precision' and not is_int(value): + raise ValueError("precision must be an integer") if key == 'precision' and (value < 0 or value > 20): raise ValueError("precision must be greater than or equal to 0 and less than 21") if key == 'white_list': @@ -68,3 +78,55 @@ class Config: cur_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__))) yaml_path = os.path.join(cur_path, "config.yaml") msCheckerConfig = Config(yaml_path) + + +class CheckerConfig: + def __init__(self, task_config=None): + self.white_list = msCheckerConfig.white_list + self.black_list = msCheckerConfig.black_list + self.error_data_path = msCheckerConfig.error_data_path + self.is_online = msCheckerConfig.is_online + self.nfs_path = msCheckerConfig.nfs_path + self.host = msCheckerConfig.host + self.port = msCheckerConfig.port + self.rank_list = msCheckerConfig.rank_list + self.tls_path = msCheckerConfig.tls_path + + if task_config: + self.load_config(task_config) + + def load_config(self, task_config): + self.white_list = task_config.white_list + self.black_list = task_config.black_list + self.error_data_path = task_config.error_data_path + self.is_online = task_config.is_online + self.nfs_path = task_config.nfs_path + self.host = task_config.host + self.port = task_config.port + self.rank_list = task_config.rank_list + self.tls_path = task_config.tls_path + + def get_online_config(self): + return OnlineConfig( + is_online=self.is_online, + nfs_path=self.nfs_path, + host=self.host, + port=self.port, + rank_list=self.rank_list, + tls_path=self.tls_path + ) + + def get_run_ut_config(self, **config_params): + return RunUtConfig( + forward_content=config_params.get('forward_content'), + backward_content=config_params.get('backward_content'), + result_csv_path=config_params.get('result_csv_path'), + details_csv_path=config_params.get('details_csv_path'), + save_error_data=config_params.get('save_error_data'), + is_continue_run_ut=config_params.get('is_continue_run_ut'), + real_data_path=config_params.get('real_data_path'), + white_list=self.white_list, + black_list=self.black_list, + error_data_path=config_params.get('error_data_path'), + online_config=self.get_online_config() + ) diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/utils.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/utils.py index 07f483d9dc63fa028d0e11805865996187902df4..5724f626237af164c582d2165354d5ab35e3b839 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/utils.py @@ -72,38 +72,53 @@ def check_need_convert(api_name): return convert_type -def api_info_preprocess(api_name, api_info_dict): +def cross_entropy_process(api_info_dict): """ Function Description: - Preprocesses the API information. + Preprocesses the cross_entropy API information. Parameter: - api_name: Name of the API. api_info_dict: argument of the API. Return api_info_dict: - convert_type: Type of conversion. api_info_dict: Processed argument of the API. """ - convert_type = check_need_convert(api_name) - if api_name == 'cross_entropy': - api_info_dict = cross_entropy_process(api_info_dict) - return convert_type, api_info_dict + if 'input_args' in api_info_dict and len(api_info_dict['input_args']) > 1 \ + and 'Min' in api_info_dict['input_args'][1]: + if api_info_dict['input_args'][1]['Min'] <= 0: + # The second argument in cross_entropy should be -100 or not less than 0 + api_info_dict['input_args'][1]['Min'] = 0 + return api_info_dict -def cross_entropy_process(api_info_dict): +def histc_process(api_info_dict): + input_args = api_info_dict['input_args'] + if input_args and input_args[0].get('dtype'): + dtype = input_args[0]['dtype'] + if dtype in Const.TORCH_INT_DTYPE: + api_info_dict['input_args'][0]['dtype'] = Const.TORCH_FLOAT32 + return api_info_dict + + +API_PROCESS_MAP = { + 'cross_entropy': cross_entropy_process, + 'histc': histc_process +} + + +def api_info_preprocess(api_name, api_info_dict): """ Function Description: - Preprocesses the cross_entropy API information. + Preprocesses the API information. Parameter: + api_name: Name of the API. api_info_dict: argument of the API. Return api_info_dict: + convert_type: Type of conversion. api_info_dict: Processed argument of the API. """ - if 'input_args' in api_info_dict and len(api_info_dict['input_args']) > 1 \ - and 'Min' in api_info_dict['input_args'][1]: - if api_info_dict['input_args'][1]['Min'] <= 0: - # The second argument in cross_entropy should be -100 or not less than 0 - api_info_dict['input_args'][1]['Min'] = 0 - return api_info_dict + convert_type = check_need_convert(api_name) + if api_name in API_PROCESS_MAP: + api_info_dict = API_PROCESS_MAP[api_name](api_info_dict) + return convert_type, api_info_dict def initialize_save_path(save_path, dir_name): diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/algorithm.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/algorithm.py index 5d6dc772963cb4bbc94cbc4f578aee631f890d0f..ddee254c2b1085f9af96fe2774c53fb88c5821f4 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/algorithm.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/algorithm.py @@ -16,10 +16,12 @@ # limitations under the License. # 定义比对算法及比对标准 +import math import torch import numpy as np from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import ULP_PARAMETERS +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_config import StandardConfig from msprobe.core.common.const import CompareConst @@ -179,13 +181,13 @@ def check_inf_nan_value(inf_nan_mask, bench_output, device_output, dtype, rtol): def check_small_value(abs_err, small_value_mask, small_value_atol): ''' - 新精度标准的相对阈值法中,检查npu和golden小值域输出的相对误差是否满足阈值 + 新精度标准的绝对阈值法中,检查npu和golden正常值输出的绝对误差是否满足阈值 输入: - rel_err:npu输出和golden输出的相对误差 + abs_err:npu输出和golden输出的绝对误差 normal_value_mask:npu输出和golden输出的正常值mask - rtol:相对误差的阈值 + atol:绝对误差的阈值 输出: - rel_err_ratio:npu输出和golden输出的相对误差不满足阈值的比例 + abs_err_ratio:npu输出和golden输出的绝对误差不满足阈值的比例 ''' greater_mask = np.greater(abs_err, small_value_atol) err_mask = np.logical_and(greater_mask, small_value_mask) @@ -195,13 +197,13 @@ def check_small_value(abs_err, small_value_mask, small_value_atol): def check_norm_value(normal_value_mask, rel_err, rtol): ''' - 新精度标准的绝对阈值法中,检查npu和golden正常值输出的绝对误差是否满足阈值 + 新精度标准的相对阈值法中,检查npu和golden小值域输出的相对误差是否满足阈值 输入: - abs_err:npu输出和golden输出的绝对误差 + rel_err:npu输出和golden输出的相对误差 normal_value_mask:npu输出和golden输出的正常值mask - atol:绝对误差的阈值 + rtol:相对误差的阈值 输出: - abs_err_ratio:npu输出和golden输出的绝对误差不满足阈值的比例 + rel_err_ratio:npu输出和golden输出的相对误差不满足阈值的比例 ''' err_mask = np.greater(rel_err, rtol) err_mask = np.logical_and(err_mask, normal_value_mask) @@ -228,3 +230,34 @@ def get_ulp_err(bench_output, device_output, dtype): def calc_ulp_err(bench_output, device_output, eb, exponent_num, data_type): return (device_output.astype(data_type) - bench_output).astype(data_type) * \ np.exp2(-eb + exponent_num).astype(data_type) + + +def calc_ratio(x, y, dtype): + """ + Calculate the ratio between NPU and GPU statistical values. + + Args: + x (float): Statistical value from the NPU side + y (float): Statistical value from the GPU side + dtype: Data type used to determine the minimum error value + + Returns: + float: The ratio of NPU to GPU statistical values + + Notes: + - Takes absolute values of both x and y for calculation + - Uses StandardConfig.get_minmum_err(dtype) to get minimum error for the specified dtype + - Prevents division by zero by ensuring denominator is not less than minimum error + - Returns |x| / max(|y|, minimum_error) + """ + x, y = abs(x), abs(y) + minmum_err = StandardConfig.get_minmum_err(dtype) + err_y = max(y, minmum_err) + return x / err_y + + +def compare_bool_tensor(bench_output, device_output): + error_nums = (bench_output != device_output).sum() + error_rate = float(error_nums / bench_output.size) + result = CompareConst.PASS if error_rate == 0 else CompareConst.ERROR + return error_rate, result, "" diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py index 7767bbc0766445ba28516c8a60a3fc94e3783690..8f7db73b58f42a4a64728bb0f12d25cf6f9f9ebe 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py @@ -29,11 +29,15 @@ from msprobe.pytorch.api_accuracy_checker.common.config import msCheckerConfig from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import API_PRECISION_COMPARE_RESULT_FILE_NAME, \ API_PRECISION_COMPARE_DETAILS_FILE_NAME, BENCHMARK_COMPARE_SUPPORT_LIST, API_PRECISION_COMPARE_UNSUPPORT_LIST, \ ApiPrecisionCompareColumn, absolute_standard_api, binary_standard_api, ulp_standard_api, thousandth_standard_api, \ - BINARY_COMPARE_UNSUPPORT_LIST, ULP_COMPARE_SUPPORT_LIST, convert_str_to_float, CompareMessage, is_inf_or_nan, \ - check_inf_or_nan + BINARY_COMPARE_UNSUPPORT_LIST, ULP_COMPARE_SUPPORT_LIST, convert_str_to_float, CompareMessage +from msprobe.pytorch.api_accuracy_checker.compare.compare_input import PrecisionCompareInput +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_register import StandardRegistry +from msprobe.pytorch.api_accuracy_checker.precision_standard.ulp_compare import UlpPrecisionCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.benchmark_compare import BenchmarkPrecisionCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_config import StandardConfig from msprobe.pytorch.api_accuracy_checker.compare.compare_column import ApiPrecisionOutputColumn from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import get_validated_result_csv_path -from msprobe.pytorch.api_accuracy_checker.common.utils import extract_detailed_api_segments +from msprobe.pytorch.api_accuracy_checker.common.utils import extract_detailed_api_segments, extract_basic_api_segments from msprobe.core.common.file_utils import FileChecker, change_mode, create_directory from msprobe.pytorch.common.log import logger from msprobe.core.common.utils import CompareException @@ -47,30 +51,6 @@ BenchmarkInfNanConsistency = namedtuple('BenchmarkInfNanConsistency', ['small_va 'eb_inf_nan_consistency']) UNSUPPORTED_MESSAGE = 'This data type does not support benchmark compare.' -DEFAULT_THRESHOLD = 1 - -benchmark_algorithms_thresholds = { - 'small_value': { - 'error_threshold': 2, - 'warning_threshold': 1 - }, - 'rmse': { - 'error_threshold': 2, - 'warning_threshold': 1 - }, - 'max_rel_err': { - 'error_threshold': 10, - 'warning_threshold': 1 - }, - 'mean_rel_err': { - 'error_threshold': 2, - 'warning_threshold': 1 - }, - 'eb': { - 'error_threshold': 2, - 'warning_threshold': 1 - } -} benchmark_message = { "small_value_err_status": { @@ -92,189 +72,6 @@ benchmark_message = { } -class Standard: - @staticmethod - def _calc_ratio(column_name, x, y, default_value): - ''' - 计算npu侧和gpu侧统计量的比值 - 输入: - column_name:统计量名称 - x:npu侧统计量 - y:gpu侧统计量 - default:当x不接近0,y接近0,设置的比值默认值 - 输出: - ratio:统计量x和y的比值 - inf_nan_consistency:不出现inf或nan时为True,出现inf或nan时必须同时为inf或-inf或nan才为True,否则为False - message:当出现inf或nan时的提示信息 - ''' - x, y = convert_str_to_float(x), convert_str_to_float(y) - - if is_inf_or_nan(x) or is_inf_or_nan(y): - return check_inf_or_nan(x, y, column_name) - - inf_nan_consistency = True - message = "" - if math.isclose(y, 0.0): - if math.isclose(x, 0.0): - return 1.0, inf_nan_consistency, message - else: - return default_value, inf_nan_consistency, message - else: - return abs(x / y), inf_nan_consistency, message - - -class BenchmarkStandard(Standard): - def __init__(self, api_name, npu_precision, gpu_precision): - self.api_name = api_name - self.npu_precision = npu_precision - self.gpu_precision = gpu_precision - self.small_value_err_ratio = 1 - self.rmse_ratio = 1 - self.max_rel_err_ratio = 1 - self.mean_rel_err_ratio = 1 - self.eb_ratio = 1 - self.small_value_err_status = CompareConst.PASS - self.rmse_status = CompareConst.PASS - self.max_rel_err_status = CompareConst.PASS - self.mean_rel_err_status = CompareConst.PASS - self.eb_status = CompareConst.PASS - self.check_result_list = [] - self.final_result = CompareConst.PASS - self.compare_message = "" - - def __str__(self): - return "%s" % (self.api_name) - - @staticmethod - def _get_status(ratio, algorithm): - if math.isnan(ratio) or math.isinf(ratio): - return CompareConst.PASS - error_threshold = benchmark_algorithms_thresholds.get(algorithm, {}).get('error_threshold', DEFAULT_THRESHOLD) - warning_threshold = benchmark_algorithms_thresholds.get(algorithm, {}).get('warning_threshold', - DEFAULT_THRESHOLD) - if ratio > error_threshold: - return CompareConst.ERROR - elif ratio > warning_threshold: - return CompareConst.WARNING - return CompareConst.PASS - - def get_result(self): - inf_nan_consistency = self._compare_ratio() - small_value_inf_nan_consistency = inf_nan_consistency.small_value_inf_nan_consistency - rmse_inf_nan_consistency = inf_nan_consistency.rmse_inf_nan_consistency - max_rel_inf_nan_consistency = inf_nan_consistency.max_rel_inf_nan_consistency - mean_rel_inf_nan_consistency = inf_nan_consistency.mean_rel_inf_nan_consistency - eb_inf_nan_consistency = inf_nan_consistency.eb_inf_nan_consistency - self.small_value_err_status = self._get_status(self.small_value_err_ratio, 'small_value') if \ - small_value_inf_nan_consistency else CompareConst.ERROR - self.check_result_list.append(self.small_value_err_status) - self.rmse_status = self._get_status(self.rmse_ratio, 'rmse') if rmse_inf_nan_consistency \ - else CompareConst.ERROR - self.check_result_list.append(self.rmse_status) - self.max_rel_err_status = self._get_status( - self.max_rel_err_ratio, 'max_rel_err') if max_rel_inf_nan_consistency else CompareConst.ERROR - self.check_result_list.append(self.max_rel_err_status) - self.mean_rel_err_status = self._get_status( - self.mean_rel_err_ratio, 'mean_rel_err') if mean_rel_inf_nan_consistency else CompareConst.ERROR - self.check_result_list.append(self.mean_rel_err_status) - self.eb_status = self._get_status(self.eb_ratio, 'eb') - if CompareConst.ERROR in self.check_result_list: - self.final_result = CompareConst.ERROR - elif CompareConst.WARNING in self.check_result_list: - self.final_result = CompareConst.WARNING - - def to_column_value(self): - return [self.small_value_err_ratio, self.small_value_err_status, self.rmse_ratio, - self.rmse_status, self.max_rel_err_ratio, self.max_rel_err_status, self.mean_rel_err_ratio, - self.mean_rel_err_status, self.eb_ratio, self.eb_status] - - def _compare_ratio(self): - - self.small_value_err_ratio, small_value_inf_nan_consistency, small_value_message = self._calc_ratio( - ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE, - self.npu_precision.get(ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE), - self.gpu_precision.get(ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE), 10000.0) - self.compare_message += small_value_message - self.rmse_ratio, rmse_inf_nan_consistency, rmse_message = self._calc_ratio(ApiPrecisionCompareColumn.RMSE, - self.npu_precision.get(ApiPrecisionCompareColumn.RMSE), - self.gpu_precision.get(ApiPrecisionCompareColumn.RMSE), 10000.0) - self.compare_message += rmse_message - self.max_rel_err_ratio, max_rel_inf_nan_consistency, max_rel_message = self._calc_ratio( - ApiPrecisionCompareColumn.MAX_REL_ERR, - self.npu_precision.get(ApiPrecisionCompareColumn.MAX_REL_ERR), - self.gpu_precision.get(ApiPrecisionCompareColumn.MAX_REL_ERR), 10000.0) - self.compare_message += max_rel_message - self.mean_rel_err_ratio, mean_rel_inf_nan_consistency, mean_rel_message = self._calc_ratio( - ApiPrecisionCompareColumn.MEAN_REL_ERR, - self.npu_precision.get(ApiPrecisionCompareColumn.MEAN_REL_ERR), - self.gpu_precision.get(ApiPrecisionCompareColumn.MEAN_REL_ERR), 10000.0) - self.compare_message += mean_rel_message - self.eb_ratio, eb_inf_nan_consistency, eb_message = self._calc_ratio(ApiPrecisionCompareColumn.EB, - self.npu_precision.get(ApiPrecisionCompareColumn.EB), - self.gpu_precision.get(ApiPrecisionCompareColumn.EB), 10000.0) - self.compare_message += eb_message - - return BenchmarkInfNanConsistency(small_value_inf_nan_consistency, rmse_inf_nan_consistency, - max_rel_inf_nan_consistency, mean_rel_inf_nan_consistency, - eb_inf_nan_consistency) - - -class ULPStandard(Standard): - def __init__(self, api_name, npu_precision, gpu_precision): - self.api_name = api_name - self.npu_precision = npu_precision - self.gpu_precision = gpu_precision - self.mean_ulp_err = 0 - self.ulp_err_proportion = 0 - self.ulp_err_proportion_ratio = 1 - self.ulp_err_status = CompareConst.PASS - self.compare_message = "" - - def __str__(self): - return f"{self.api_name}" - - def get_result(self): - self.mean_ulp_err = convert_str_to_float(self.npu_precision.get(ApiPrecisionCompareColumn.MEAN_ULP_ERR)) - gpu_mean_ulp_err = convert_str_to_float(self.gpu_precision.get(ApiPrecisionCompareColumn.MEAN_ULP_ERR)) - inf_nan_consistency = True - if is_inf_or_nan(self.mean_ulp_err) or is_inf_or_nan(gpu_mean_ulp_err): - _, inf_nan_consistency, message = check_inf_or_nan(self.mean_ulp_err, gpu_mean_ulp_err, - ApiPrecisionCompareColumn.MEAN_ULP_ERR) - self.compare_message += message - self.ulp_err_proportion = convert_str_to_float( - self.npu_precision.get(ApiPrecisionCompareColumn.ULP_ERR_PROPORTION)) - self.ulp_err_proportion_ratio, ulp_inf_nan_consistency, message = self._calc_ratio( - ApiPrecisionCompareColumn.ULP_ERR_PROPORTION, - self.npu_precision.get(ApiPrecisionCompareColumn.ULP_ERR_PROPORTION), - self.gpu_precision.get(ApiPrecisionCompareColumn.ULP_ERR_PROPORTION), 10000.0) - inf_nan_consistency = inf_nan_consistency and ulp_inf_nan_consistency - self.compare_message += message - if inf_nan_consistency: - self.ulp_err_status = self._get_ulp_status(self.npu_precision.get(ApiPrecisionCompareColumn.DEVICE_DTYPE)) - else: - self.ulp_err_status = CompareConst.ERROR - - def _get_ulp_status(self, dtype): - if dtype == torch.float32: - if self.mean_ulp_err < 64: - return CompareConst.PASS - elif self.ulp_err_proportion < 0.05: - return CompareConst.PASS - elif self.ulp_err_proportion_ratio < 1: - return CompareConst.PASS - else: - self.compare_message += "ERROR: ULP误差不满足标准\n" - return CompareConst.ERROR - else: - if self.ulp_err_proportion < 0.001: - return CompareConst.PASS - elif self.ulp_err_proportion_ratio < 1: - return CompareConst.PASS - else: - self.compare_message += "ERROR: ULP误差不满足标准\n" - return CompareConst.ERROR - - def write_detail_csv(content, save_path): rows = [] content = ["{:.{}f}".format(item, msCheckerConfig.precision) \ @@ -283,6 +80,17 @@ def write_detail_csv(content, save_path): write_csv(rows, save_path) +def register_compare_func(): + registry = StandardRegistry() + registry.register(CompareConst.ABSOLUTE_THRESHOLD, record_absolute_threshold_result) + registry.register(CompareConst.BINARY_CONSISTENCY, record_binary_consistency_result) + registry.register(CompareConst.ULP_COMPARE, record_ulp_compare_result) + registry.register(CompareConst.THOUSANDTH_STANDARD, record_thousandth_threshold_result) + registry.register(CompareConst.BENCHMARK, record_benchmark_compare_result) + registry.register(CompareConst.ACCUMULATIVE_ERROR_COMPARE, record_accumulative_error_compare_result) + return registry + + def api_precision_compare(config): logger.info("Start compare task") logger.info(f"Compare task result will be saved in {config.result_csv_path}") @@ -337,6 +145,8 @@ def analyse_csv(npu_data, gpu_data, config): forward_status, backward_status = [], [] last_api_name, last_api_dtype, last_api_full_name = None, None, None last_api_skip_message = '' + registry = register_compare_func() + for _, row_npu in npu_data.iterrows(): message = '' compare_column = ApiPrecisionOutputColumn() @@ -362,7 +172,7 @@ def analyse_csv(npu_data, gpu_data, config): row_gpu = row_gpu.iloc[0] new_status = CompareConst.SPACE try: - new_status = get_api_status(row_npu, row_gpu, api_name, compare_column) + new_status = get_api_status(row_npu, row_gpu, api_name, compare_column, registry) except Exception as err: logger.error(f"Get api status error: {str(err)}") compare_column.api_name = full_api_name_with_direction_status @@ -383,7 +193,8 @@ def analyse_csv(npu_data, gpu_data, config): else: forward_result = get_api_checker_result(forward_status) backward_result = get_api_checker_result(backward_status) - message += CompareMessage.get(last_api_name, "") if forward_result == CompareConst.ERROR else "" + _, base_api_name = extract_basic_api_segments(last_api_name) + message += CompareMessage.get(base_api_name, "") if forward_result == CompareConst.ERROR else "" message += last_api_skip_message if forward_result == CompareConst.SKIP else "" write_csv([[last_api_name, forward_result, backward_result, message]], config.result_csv_path) print_test_success(last_api_name, forward_result, backward_result) @@ -415,37 +226,30 @@ def analyse_csv(npu_data, gpu_data, config): else: forward_result = get_api_checker_result(forward_status) backward_result = get_api_checker_result(backward_status) - message += CompareMessage.get(last_api_name, "") if forward_result == CompareConst.ERROR else "" + _, base_api_name = extract_basic_api_segments(last_api_name) + message += CompareMessage.get(base_api_name, "") if forward_result == CompareConst.ERROR else "" message += last_api_skip_message if forward_result == CompareConst.SKIP else "" write_csv([[last_api_name, forward_result, backward_result, message]], config.result_csv_path) print_test_success(last_api_name, forward_result, backward_result) last_api_skip_message = '' -def get_api_status(row_npu, row_gpu, api_name, compare_column): +def get_api_status(row_npu, row_gpu, api_name, compare_column, registry): full_api_name_with_direction_status = row_npu[ApiPrecisionCompareColumn.API_NAME] # 当前API的输出为空(例如反向过程中requires_grad=False),跳过比对 - if row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE].isspace(): + if row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE].isspace() or \ + row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] in API_PRECISION_COMPARE_UNSUPPORT_LIST or \ + row_npu[ApiPrecisionCompareColumn.SHAPE] == CompareConst.ZERO_SHAPE: compare_column.api_name = full_api_name_with_direction_status compare_column.compare_result = CompareConst.SKIP compare_column.compare_message = row_npu[ApiPrecisionCompareColumn.MESSAGE] new_status = CompareConst.SKIP else: compare_column.api_name = full_api_name_with_direction_status - if api_name in thousandth_standard_api: - new_status = record_thousandth_threshold_result(compare_column, row_npu) - elif row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] not in BINARY_COMPARE_UNSUPPORT_LIST or \ - api_name in binary_standard_api: - new_status = record_binary_consistency_result(api_name, compare_column, row_npu) - elif api_name in absolute_standard_api: - new_status = record_absolute_threshold_result(compare_column, row_npu) - elif api_name in ulp_standard_api and \ - row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] in ULP_COMPARE_SUPPORT_LIST: - us = ULPStandard(full_api_name_with_direction_status, row_npu, row_gpu) - new_status = record_ulp_compare_result(compare_column, us) - elif row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] in BENCHMARK_COMPARE_SUPPORT_LIST: - bs = BenchmarkStandard(full_api_name_with_direction_status, row_npu, row_gpu) - new_status = record_benchmark_compare_result(compare_column, bs) + dtype = row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] + input_data = PrecisionCompareInput(row_npu, row_gpu, dtype, compare_column) + comparison_func = registry.get_comparison_function(api_name, dtype) + new_status = comparison_func(input_data) return new_status @@ -505,21 +309,24 @@ def check_csv_columns(columns, csv_type): raise CompareException(CompareException.INVALID_DATA_ERROR, msg) -def record_binary_consistency_result(api_name, compare_column, row_npu): +def record_binary_consistency_result(input_data): + row_npu = input_data.row_npu + compare_column = input_data.compare_column new_status = check_error_rate(row_npu[ApiPrecisionCompareColumn.ERROR_RATE]) compare_column.error_rate = row_npu[ApiPrecisionCompareColumn.ERROR_RATE] compare_column.error_rate_status = new_status compare_column.compare_result = new_status - compare_column.compare_algorithm = "二进制一致法" + compare_column.compare_algorithm = CompareConst.BINARY_CONSISTENCY_ALGORITHM_NAME message = '' if compare_column.error_rate_status == CompareConst.ERROR: message += "ERROR: 二进制一致错误率超过阈值\n" - message += CompareMessage.get(api_name, "") compare_column.compare_message = message return new_status -def record_absolute_threshold_result(compare_column, row_npu): +def record_absolute_threshold_result(input_data): + row_npu = input_data.row_npu + compare_column = input_data.compare_column absolute_threshold_result = get_absolute_threshold_result(row_npu) compare_column.inf_nan_error_ratio = absolute_threshold_result.get("inf_nan_error_ratio") compare_column.inf_nan_error_ratio_status = absolute_threshold_result.get("inf_nan_result") @@ -528,62 +335,88 @@ def record_absolute_threshold_result(compare_column, row_npu): compare_column.abs_err_ratio = absolute_threshold_result.get("abs_err_ratio") compare_column.abs_err_ratio_status = absolute_threshold_result.get("abs_err_result") compare_column.compare_result = absolute_threshold_result.get("absolute_threshold_result") - compare_column.compare_algorithm = "绝对阈值法" + compare_column.compare_algorithm = CompareConst.ABSOLUTE_THRESHOLD_ALGORITHM_NAME message = '' if compare_column.inf_nan_error_ratio_status == CompareConst.ERROR: - message += "ERROR: inf/nan错误率超过阈值\n" + message += "ERROR: inf/nan错误率超过阈值" if compare_column.rel_err_ratio_status == CompareConst.ERROR: - message += "ERROR: 相对误差错误率超过阈值\n" + message += "ERROR: 相对误差错误率超过阈值" if compare_column.abs_err_ratio_status == CompareConst.ERROR: - message += "ERROR: 绝对误差错误率超过阈值\n" + message += "ERROR: 绝对误差错误率超过阈值" compare_column.compare_message = message return compare_column.compare_result -def record_benchmark_compare_result(compare_column, bs): - bs.get_result() - compare_column.small_value_err_ratio = bs.small_value_err_ratio - compare_column.small_value_err_status = bs.small_value_err_status - compare_column.rmse_ratio = bs.rmse_ratio - compare_column.rmse_status = bs.rmse_status - compare_column.max_rel_err_ratio = bs.max_rel_err_ratio - compare_column.max_rel_err_status = bs.max_rel_err_status - compare_column.mean_rel_err_ratio = bs.mean_rel_err_ratio - compare_column.mean_rel_err_status = bs.mean_rel_err_status - compare_column.eb_ratio = bs.eb_ratio - compare_column.eb_status = bs.eb_status - compare_column.compare_result = bs.final_result - compare_column.compare_algorithm = "标杆比对法" - compare_column.compare_message = bs.compare_message +def record_benchmark_compare_result(input_data): + bs = BenchmarkPrecisionCompare(input_data) + compare_result = bs.compare() for status_attr, messages in benchmark_message.items(): - status_value = getattr(compare_column, status_attr) + status_value = getattr(input_data.compare_column, status_attr) if status_value in messages: - compare_column.compare_message += messages[status_value] - return compare_column.compare_result + input_data.compare_column.compare_message += messages[status_value] + return compare_result + +def record_ulp_compare_result(input_data): + us = UlpPrecisionCompare(input_data) + compare_result = us.compare() + return compare_result -def record_ulp_compare_result(compare_column, us): - us.get_result() - compare_column.mean_ulp_err = us.mean_ulp_err - compare_column.ulp_err_proportion = us.ulp_err_proportion - compare_column.ulp_err_proportion_ratio = us.ulp_err_proportion_ratio - compare_column.ulp_err_status = us.ulp_err_status - compare_column.compare_result = us.ulp_err_status - compare_column.compare_algorithm = "ULP误差比对法" - compare_column.compare_message = us.compare_message + +def record_accumulative_error_compare_result(input_data): + row_npu = input_data.row_npu + compare_column = input_data.compare_column + absolute_threshold_result = get_absolute_threshold_result(row_npu) + threshold_result = absolute_threshold_result.get("absolute_threshold_result") + eb, eb_result = check_eb(row_npu) + accumulative_error_compare_result = CompareConst.PASS + if CompareConst.ERROR in [threshold_result, eb_result]: + accumulative_error_compare_result = CompareConst.ERROR + + compare_column.inf_nan_error_ratio = absolute_threshold_result.get("inf_nan_error_ratio") + compare_column.inf_nan_error_ratio_status = absolute_threshold_result.get("inf_nan_result") + compare_column.rel_err_ratio = absolute_threshold_result.get("rel_err_ratio") + compare_column.rel_err_ratio_status = absolute_threshold_result.get("rel_err_result") + compare_column.abs_err_ratio = absolute_threshold_result.get("abs_err_ratio") + compare_column.abs_err_ratio_status = absolute_threshold_result.get("abs_err_result") + compare_column.eb_ratio = eb + compare_column.eb_status = eb_result + compare_column.compare_result = accumulative_error_compare_result + compare_column.compare_algorithm = CompareConst.ACCUMULATIVE_ERROR_COMPARE_ALGORITHM_NAME + message = [] + if compare_column.inf_nan_error_ratio_status == CompareConst.ERROR: + message.append("ERROR: inf/nan错误率超过阈值\n") + if compare_column.rel_err_ratio_status == CompareConst.ERROR: + message.append("ERROR: 相对误差错误率超过阈值\n") + if compare_column.abs_err_ratio_status == CompareConst.ERROR: + message.append("ERROR: 绝对误差错误率超过阈值\n") + if compare_column.eb_status == CompareConst.ERROR: + message.append("ERROR: 误差均衡性超过阈值\n") + compare_column.compare_message = "\n".join(message) return compare_column.compare_result +def check_eb(row_npu): + eb = convert_str_to_float(row_npu[ApiPrecisionCompareColumn.EB]) + dtype = row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] + eb_threshold = StandardConfig.get_accumulative_error_eb_threshold(dtype) + eb_result = CompareConst.PASS if eb <= eb_threshold else CompareConst.ERROR + return eb, eb_result + + def check_thousandth_rate(thousandth_rate): - return CompareConst.PASS if convert_str_to_float(thousandth_rate) >= 0.999 else CompareConst.ERROR + return CompareConst.PASS if convert_str_to_float(thousandth_rate) >= CompareConst.THOUSANDTH_PASS_VALUE \ + else CompareConst.ERROR -def record_thousandth_threshold_result(compare_column, row_npu): +def record_thousandth_threshold_result(input_data): + row_npu = input_data.row_npu + compare_column = input_data.compare_column new_status = check_thousandth_rate(row_npu[ApiPrecisionCompareColumn.REL_ERR_THOUSANDTH]) compare_column.rel_err_thousandth = row_npu[ApiPrecisionCompareColumn.REL_ERR_THOUSANDTH] compare_column.rel_err_thousandth_status = new_status compare_column.compare_result = new_status - compare_column.compare_algorithm = "双千指标法" + compare_column.compare_algorithm = CompareConst.THOUSANDTH_STANDARD_ALGORITHM_NAME message = '' if compare_column.rel_err_thousandth_status == CompareConst.ERROR: message += "ERROR: 双千指标不达标\n" @@ -620,7 +453,7 @@ def _api_precision_compare_parser(parser): parser.add_argument("-gpu", "--gpu_csv_path", dest="gpu_csv_path", default="", type=str, help=" Accuracy_checking_details.csv generated on the GPU by using the " "api_accuracy_checker tool.", - required=False) + required=True) parser.add_argument("-o", "--out_path", dest="out_path", default="", type=str, help=" The api precision compare task result out path.", required=False) diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml index 91170bd525b6cb8e7cb531a3c6e88888f9a628eb..1175c1ed42c80135e4033823fa94d2e189da1d12 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml @@ -66,6 +66,7 @@ BinaryCompareStandard: - greater_ - greater_equal - greater_equal_ + - histc - isfinite - isnan - less @@ -130,4 +131,6 @@ ULPStandard: ThousandthStandard: - conv1d - conv2d - \ No newline at end of file + +AccumulativeErrorStandard: + - test_api diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare.py index c40a43a51133dce878a585b3158e38cfb34a0270..cf5928e509e3138ea762cd9d7af6fc26a5d2c5c9 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare.py @@ -24,15 +24,20 @@ from msprobe.core.common.utils import CompareException from msprobe.core.common.file_utils import get_json_contents, write_csv import torch from msprobe.core.common.const import CompareConst -from msprobe.pytorch.api_accuracy_checker.compare.algorithm import get_rmse, get_error_balance, get_max_rel_err, \ - get_mean_rel_err, get_rel_err, get_abs_err, get_max_abs_err, get_rel_err_ratio, cosine_sim, get_rel_err_origin, \ - get_small_value_err_ratio, get_finite_and_infinite_mask, get_small_value_mask, check_inf_nan_value, \ - check_small_value, check_norm_value, get_abs_bench_with_eps, get_ulp_err +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_register import StandardRegistry +from msprobe.pytorch.api_accuracy_checker.precision_standard.absolute_threshold import AbsolutethdCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.benchmark_compare import BenchmarkCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.ulp_compare import UlpCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.binary_consistency import BinaryCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.thousandth_standard import ThousandthStdCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.accumulative_error_compare import AccumulativeErrorCompare +from msprobe.pytorch.api_accuracy_checker.compare.compare_input import CompareInput +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import get_abs_err, get_max_abs_err, get_rel_err_ratio, \ + cosine_sim, get_rel_err_origin, get_abs_bench_with_eps, compare_bool_tensor from msprobe.pytorch.api_accuracy_checker.common.config import msCheckerConfig from msprobe.pytorch.api_accuracy_checker.compare.compare_column import CompareColumn from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import check_dtype_comparable, \ - DETAIL_TEST_ROWS, precision_configs, BENCHMARK_COMPARE_SUPPORT_LIST, absolute_standard_api, binary_standard_api, \ - ulp_standard_api, thousandth_standard_api, apis_threshold + DETAIL_TEST_ROWS, BENCHMARK_COMPARE_SUPPORT_LIST from msprobe.pytorch.api_accuracy_checker.common.utils import extract_basic_api_segments from msprobe.pytorch.common.log import logger @@ -42,6 +47,7 @@ ResultInfo = namedtuple('ResultInfo', ['full_api_name', 'fwd_success_status', 'b INDEX_TEST_RESULT_GROUP = 3 +BACKWARD_RESULT_GROUP = 4 INDEX_FIRST_GROUP = 0 INDEX_MESSAGE = -1 @@ -66,6 +72,8 @@ class Comparator: self.detail_save_path_list = \ [self.detail_save_path_str.format(rank) for rank in config.online_config.rank_list] + self.registry = self._register_compare_func() + if not is_continue_run_ut: self.write_csv_title() if stack_info_json_path: @@ -101,22 +109,6 @@ class Comparator: compare_column.error_rate = 0 return CompareConst.PASS, compare_column, "" - @staticmethod - def _compare_bool_tensor(bench_output, device_output): - error_nums = (bench_output != device_output).sum() - if bench_output.size == 0: - return CompareConst.NAN, CompareConst.ERROR, "There is not bench calculation result." - error_rate = float(error_nums / bench_output.size) - result = CompareConst.PASS if error_rate == 0 else CompareConst.ERROR - return error_rate, result, "" - - @staticmethod - def _get_absolute_threshold_attribute(api_name, dtype): - small_value_threshold = apis_threshold.get(api_name).get(dtype).get('small_value') - small_value_atol = apis_threshold.get(api_name).get(dtype).get('small_value_atol') - rtol = apis_threshold.get(api_name).get(dtype).get('rtol') - return small_value_threshold, small_value_atol, rtol - @staticmethod def _get_run_ut_detail(test_result): """get run_ut detail before write to csv, called by online run_ut""" @@ -143,6 +135,36 @@ class Comparator: test_rows.append([subject] + list(test_subject)) return test_rows + @staticmethod + def _binary_standard_compare(input_data): + binary_compare = BinaryCompare(input_data) + binary_compare.compare() + + @staticmethod + def _thousandth_standard_compare(input_data): + thousandth_compare = ThousandthStdCompare(input_data) + thousandth_compare.compare() + + @staticmethod + def _absolute_standard_compare(input_data): + absolute_compare = AbsolutethdCompare(input_data) + absolute_compare.compare() + + @staticmethod + def _ulp_compare(input_data): + ulp_compare = UlpCompare(input_data) + ulp_compare.compare() + + @staticmethod + def _benchmark_compare(input_data): + benchmark_compare = BenchmarkCompare(input_data) + benchmark_compare.compare() + + @staticmethod + def _accumulative_error_compare(input_data): + accumulative_error_compare = AccumulativeErrorCompare(input_data) + accumulative_error_compare.compare() + def write_csv_title(self): summary_test_rows = [ [self.COLUMN_API_NAME, @@ -163,6 +185,8 @@ class Comparator: df_row = list(test_result[:INDEX_TEST_RESULT_GROUP]) if test_result[1] == CompareConst.SKIP: df_row.append(test_result[INDEX_TEST_RESULT_GROUP][INDEX_FIRST_GROUP][INDEX_MESSAGE]) + elif test_result[2] == CompareConst.SKIP: + df_row.append(test_result[BACKWARD_RESULT_GROUP][INDEX_FIRST_GROUP][INDEX_MESSAGE]) if self.stack_info: stack_info = "\n".join(self.stack_info[name]) df_row.append(stack_info) @@ -211,6 +235,7 @@ class Comparator: if backward_message: backward_column = CompareColumn() bwd_compare_alg_results = [backward_column.to_column_value(CompareConst.SKIP, backward_message)] + bwd_success_status = CompareConst.SKIP else: bwd_success_status = bwd_success_status if bwd_compare_alg_results is not None else CompareConst.SPACE result_info = ResultInfo(full_api_name, @@ -226,6 +251,16 @@ class Comparator: return fwd_success_status == CompareConst.PASS, bwd_success_status == CompareConst.PASS \ or bwd_success_status == CompareConst.SPACE + def _register_compare_func(self): + registry = StandardRegistry() + registry.register(CompareConst.ABSOLUTE_THRESHOLD, self._absolute_standard_compare) + registry.register(CompareConst.BINARY_CONSISTENCY, self._binary_standard_compare) + registry.register(CompareConst.ULP_COMPARE, self._ulp_compare) + registry.register(CompareConst.THOUSANDTH_STANDARD, self._thousandth_standard_compare) + registry.register(CompareConst.BENCHMARK, self._benchmark_compare) + registry.register(CompareConst.ACCUMULATIVE_ERROR_COMPARE, self._accumulative_error_compare) + return registry + def _compare_core_wrapper(self, api_name, bench_output, device_output): detailed_result_total = [] test_final_success = CompareConst.PASS @@ -308,11 +343,13 @@ class Comparator: return CompareConst.ERROR, compare_column, f"Bench out dtype is {bench_output.dtype} but " \ f"npu output dtype is {device_output.dtype}, cannot compare." message = "" + if bench_output.size == 0: + return CompareConst.ERROR, compare_column, "There is not bench calculation result." if bench_output.dtype in [bool, np.uint8, np.int8, np.int16, np.uint16, np.uint32, np.int32, np.int64, np.uint64]: message += f"Compare algorithm is not supported for {bench_output.dtype} data. " \ f"Only judged by Error Rate." - err_rate, status, msg = self._compare_bool_tensor(bench_output, device_output) + err_rate, status, msg = compare_bool_tensor(bench_output, device_output) message += msg + "\n" compare_column.error_rate = err_rate return status, compare_column, message @@ -321,56 +358,20 @@ class Comparator: compare_column, npu_dtype) return status, compare_column, message + def _perform_comparison(self, api_name, input_data): + comparison_func = self.registry.get_comparison_function(api_name, None) + comparison_func(input_data) + def _compare_float_tensor(self, api_name, bench_output, device_output, compare_column, dtype): message = "" - abs_bench, abs_bench_with_eps = get_abs_bench_with_eps(bench_output, dtype) + _, abs_bench_with_eps = get_abs_bench_with_eps(bench_output, dtype) abs_err = get_abs_err(bench_output, device_output) rel_err_orign = get_rel_err_origin(abs_err, abs_bench_with_eps) - if api_name in thousandth_standard_api: - thousand_res, thousand_status = get_rel_err_ratio(rel_err_orign, CompareConst.THOUSAND_RATIO_THRESHOLD) - compare_column.rel_err_thousandth = thousand_res + input_data = CompareInput(bench_output, device_output, compare_column, dtype, rel_err_orign) if str(dtype) in BENCHMARK_COMPARE_SUPPORT_LIST: - both_finite_mask, inf_nan_mask = get_finite_and_infinite_mask(bench_output, device_output) - if api_name in binary_standard_api: - err_rate, _, _ = self._compare_bool_tensor(bench_output, device_output) - compare_column.error_rate = err_rate - elif api_name in absolute_standard_api: - small_value_threshold, small_value_atol, rtol = self._get_absolute_threshold_attribute( - api_name, str(dtype)) - rel_err = abs_err / abs_bench_with_eps - small_value_mask = get_small_value_mask(abs_bench, both_finite_mask, small_value_threshold) - normal_value_mask = np.logical_and(both_finite_mask, np.logical_not(small_value_mask)) - compare_column.inf_nan_error_ratio = check_inf_nan_value(inf_nan_mask, bench_output, device_output, - dtype, rtol) - compare_column.rel_err_ratio = check_norm_value(normal_value_mask, rel_err, rtol) - compare_column.abs_err_ratio = check_small_value(abs_err, small_value_mask, small_value_atol) - elif api_name in ulp_standard_api: - if bench_output.size == 0: - compare_column.max_ulp_error = 0 - compare_column.mean_ulp_error = 0 - compare_column.ulp_error_proportion = 0 - else: - ulp_err = get_ulp_err(bench_output, device_output, dtype) - compare_column.max_ulp_error = np.max(ulp_err) - compare_column.mean_ulp_error = np.mean(ulp_err) - if dtype == torch.float32: - compare_column.ulp_error_proportion = \ - np.sum(ulp_err > CompareConst.ULP_FLOAT32_THRESHOLD) / bench_output.size - else: - compare_column.ulp_error_proportion = \ - np.sum(ulp_err > CompareConst.ULP_FLOAT16_THRESHOLD) / bench_output.size - else: - dtype_config = precision_configs.get(dtype) - small_value_mask = get_small_value_mask(abs_bench, both_finite_mask, dtype_config['small_value'][0]) - abs_err_greater_mask = np.greater(abs_err, dtype_config['small_value_atol'][0]) - compare_column.small_value_err_ratio = get_small_value_err_ratio(small_value_mask, abs_err_greater_mask) - rel_err = get_rel_err(abs_err, abs_bench_with_eps, small_value_mask, inf_nan_mask) - compare_column.rmse = get_rmse(abs_err, np.logical_or(inf_nan_mask, small_value_mask)) - compare_column.eb = get_error_balance(bench_output, device_output) - if rel_err.size == 0: - return CompareConst.ERROR, compare_column, "Relative error result list is empty." - compare_column.max_rel_error = get_max_rel_err(rel_err) - compare_column.mean_rel_error = get_mean_rel_err(rel_err) + self._perform_comparison(api_name, input_data) + else: + message += f"The data type {dtype} is not supported for new precision standard." cos_res, cos_status, msg = cosine_sim(bench_output, device_output) compare_column.cosine_sim = cos_res diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_column.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_column.py index b1cbc3234682e9106a38301e0b8035cca74f010b..976fb7f5f258eaa4e6a57caf596f5bbfc39acfa5 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_column.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_column.py @@ -16,9 +16,17 @@ # limitations under the License. from msprobe.core.common.const import CompareConst +from msprobe.pytorch.common.log import logger class CompareColumn: + __slots__ = [ + 'bench_type', 'npu_type', 'shape', 'cosine_sim', 'max_abs_err', 'rel_err_hundredth', + 'rel_err_ten_thousandth', 'inf_nan_error_ratio', 'rel_err_ratio', 'abs_err_ratio', + 'small_value_err_ratio', 'max_rel_error', 'mean_rel_error', 'rmse', 'eb', 'max_ulp_error', + 'mean_ulp_error', 'ulp_error_proportion', 'error_rate', 'rel_err_thousandth' + ] + def __init__(self): self.bench_type = CompareConst.SPACE self.npu_type = CompareConst.SPACE @@ -41,6 +49,24 @@ class CompareColumn: self.mean_ulp_error = CompareConst.SPACE self.ulp_error_proportion = CompareConst.SPACE + def update(self, metrics): + """ + Updates the object's attributes with the provided metrics. + + Args: + metrics (dict): A dictionary containing attribute names and their corresponding values. + + Raises: + AttributeError: If the metric key is not a valid attribute of CompareColumn. + """ + for key, value in metrics.items(): + if value is None: + continue + if key not in self.__slots__: + logger.error(f"The key '{key}' is not a valid attribute of CompareColumn.") + continue + setattr(self, key, value) + def to_column_value(self, is_pass, message): return [self.bench_type, self.npu_type, self.shape, self.cosine_sim, self.max_abs_err, self.rel_err_hundredth, self.rel_err_thousandth, self.rel_err_ten_thousandth, self.error_rate, self.eb, self.rmse, @@ -50,6 +76,16 @@ class CompareColumn: class ApiPrecisionOutputColumn: + __slots__ = [ + 'api_name', 'small_value_err_ratio', 'small_value_err_status', 'rmse_ratio', 'rmse_status', + 'max_rel_err_ratio', 'max_rel_err_status', 'mean_rel_err_ratio', 'mean_rel_err_status', 'eb_ratio', + 'eb_status', 'inf_nan_error_ratio', 'inf_nan_error_ratio_status', 'rel_err_ratio', + 'rel_err_ratio_status', 'abs_err_ratio', 'abs_err_ratio_status', 'error_rate', 'error_rate_status', + 'mean_ulp_err', 'ulp_err_proportion', 'ulp_err_proportion_ratio', 'ulp_err_status', + 'rel_err_thousandth', 'rel_err_thousandth_status', 'compare_result', 'compare_algorithm', + 'compare_message' + ] + def __init__(self): self.api_name = CompareConst.SPACE self.small_value_err_ratio = CompareConst.SPACE @@ -80,6 +116,24 @@ class ApiPrecisionOutputColumn: self.compare_algorithm = CompareConst.SPACE self.compare_message = CompareConst.SPACE + def update(self, metrics): + """ + Updates the object's attributes with the provided metrics. + + Args: + metrics (dict): A dictionary containing attribute names and their corresponding values. + + Raises: + AttributeError: If the metric key is not a valid attribute of CompareColumn. + """ + for key, value in metrics.items(): + if value is None: + continue + if key not in self.__slots__: + logger.error("The key '%s' is not a valid attribute of CompareColumn.", key) + continue + setattr(self, key, value) + def to_column_value(self): return [self.api_name, self.small_value_err_ratio, self.small_value_err_status, self.rmse_ratio, self.rmse_status, self.max_rel_err_ratio, self.max_rel_err_status, self.mean_rel_err_ratio, diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_input.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_input.py new file mode 100644 index 0000000000000000000000000000000000000000..8c21def9d859a3ec5637e396f9396527c5fc8979 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_input.py @@ -0,0 +1,51 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + + +class CompareInput: + """ + A class to encapsulate the input data required for comparison operations. + + Attributes: + bench_output (np.ndarray): The benchmark output values. + device_output (np.ndarray): The device output values. + compare_column (class): A clasee to store and update comparison metrics. + dtype (type, optional): The data type of the outputs. Defaults to None. + rel_err_orign (float or array-like, optional): The original relative error values. Defaults to None. + + Methods: + __init__(bench_output, device_output, compare_column, dtype, rel_err_orign): + Initializes an instance of CompareInput. + """ + def __init__(self, bench_output, device_output, compare_column, dtype=None, rel_err_orign=None): + self.bench_output = bench_output + self.device_output = device_output + if not isinstance(bench_output, np.ndarray) or not isinstance(device_output, np.ndarray): + raise TypeError("The input should be numpy array") + self.compare_column = compare_column + self.dtype = dtype + self.rel_err_orign = rel_err_orign + + +class PrecisionCompareInput: + def __init__(self, row_npu, row_gpu, dtype, compare_column): + self.row_npu = row_npu + self.row_gpu = row_gpu + self.dtype = dtype + self.compare_column = compare_column diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py index e83b6160357b9822b20b33ed1813eddf11cf243c..549230d0a9e200283f545eed608a8da5df6a53a8 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py @@ -43,10 +43,7 @@ absolute_standard_api = apis.get('AbsoluteThreshStandard') binary_standard_api = apis.get('BinaryCompareStandard') ulp_standard_api = apis.get('ULPStandard') thousandth_standard_api = apis.get('ThousandthStandard') - - -threshold_yaml_path = os.path.join(cur_path, "api_precision_threshold.yaml") -apis_threshold = load_yaml(threshold_yaml_path) +accumulative_error_standard_api = apis.get('AccumulativeErrorStandard') DETAIL_TEST_ROWS = [ @@ -134,6 +131,7 @@ ULP_PARAMETERS = { class ApiPrecisionCompareColumn: API_NAME = 'API Name' DEVICE_DTYPE = 'DEVICE Dtype' + SHAPE = 'Shape' SMALL_VALUE_ERROR_RATE = '小值域错误占比' RMSE = '均方根误差' MAX_REL_ERR = '相对误差最大值' diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/config_op.json b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/config_op.json new file mode 100644 index 0000000000000000000000000000000000000000..6a54b58bdef971dec739202e1636838252d087d7 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/config_op.json @@ -0,0 +1,9 @@ +{ + "dump_json_path": "./dump.json", + "api_name": "", + "extract_api_path": "", + "propagation": "forward", + "data_mode": "random_data", + "random_seed": 1234, + "iter_times": 1 +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/op_generator.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/op_generator.py new file mode 100644 index 0000000000000000000000000000000000000000..797210f09c3b55a64002a4aa84a3d39770ae803c --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/op_generator.py @@ -0,0 +1,478 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import json +import os +import re + +import math +import numpy as np +import torch + +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import binary_standard_api, absolute_standard_api, \ +ulp_standard_api, thousandth_standard_api +from msprobe.core.common.file_utils import FileOpen, load_json, save_json +from msprobe.core.common.utils import check_file_or_directory_path, check_op_str_pattern_valid, is_int +from msprobe.core.common.const import Const, MonitorConst, MsgConst +from msprobe.core.common.log import logger +from msprobe.core.common.file_utils import make_dir +from msprobe.core.common.utils import recursion_depth_decorator + +TENSOR_DATA_LIST = ["torch.Tensor", "torch.nn.parameter.Parameter"] +TORCH_BOOL_TYPE = ["torch.bool"] +TORCH_INT_TYPE = ["torch.uint8", "torch.int8", "torch.int16", "torch.short", "torch.int32", "torch.int", + "torch.int64", "torch.long"] +TORCH_FLOAT_TYPE = ["torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.float", + "torch.float64", "torch.double"] +TORCH_COMPLEX_TYPE = ["torch.complex32", "torch.chalf", "torch.complex64", "torch.cfloat", "torch.complex128", + "torch.cdouble"] +OPERATOR_TYPE = ("Functional", "Tensor", "Torch") + +API_INFO = 2 +FOUR_SEGMENT = 4 +FIVE_SEGMENT = 5 +DATA_NAME = "data_name" +API_MAX_LENGTH = 30 +PROPAGATION_LIST = [Const.FORWARD, Const.BACKWARD] +DATAMODE_LIST = ["random_data", "real_data"] + + +class APIInfo: + def __init__(self, api_full_name, api_info_dict, backward_info=None): + self.api_full_name = api_full_name + self.api_info_dict = api_info_dict + self.backward_info = backward_info + + @property + def api_type(self): + return self.api_full_name.split(Const.SEP, -1)[0] + + @classmethod + def from_json(cls, json_content, propagation): + forward_name, forward_dict = list(json_content.items())[0] + forward_info = cls(api_full_name=forward_name, api_info_dict=forward_dict) + + if propagation == Const.BACKWARD: + backward_name, backward_dict = list(json_content.items())[1] + backward_info = cls(api_full_name=backward_name, api_info_dict=backward_dict) + forward_info.backward_info = backward_info + + if not forward_info.is_supported_type(): + raise ValueError(f"type {forward_info.api_type} of API is not supported!") + + return forward_info + + def is_supported_type(self): + return self.api_type in OPERATOR_TYPE + + +class CommonConfig: + def __init__(self, json_config): + self.dump_json_path = json_config.get('dump_json_path') + self.api_name = json_config.get('api_name') + self.extract_api_path = json_config.get('extract_api_path') + self.propagation = json_config.get('propagation') + self.data_mode = json_config.get('data_mode') + self.random_seed = json_config.get('random_seed') + self.iter_times = json_config.get('iter_times') + self._check_config() + + + def check_user_settings(self): + iter_t = self.iter_times + if iter_t <= 0: + raise ValueError("iter_times should be an integer bigger than zero!") + + json_file = self.extract_api_path + propagation = self.propagation + + json_content = load_json(json_file) + + # ensure the dict is not empty + if not json_content: + raise ValueError(f'json file is empty!') + + # ensure json_content is of type dict + if not isinstance(json_content, dict): + raise ValueError(f'content of json file is not a dict!') + + # ensure the length of json_content is within allowed limits + if len(json_content) > API_INFO: + raise ValueError(f'json file has more than one API, the API only contains forward and backward info') + + # Retrieve the first API name and dictionary + forward_item = next(iter(json_content.items()), None) + if not forward_item or not isinstance(forward_item[1], dict): + raise ValueError(f'Invalid forward API data in json_content!') + + # if propagation is backward, ensure json file contains forward and backward info + if propagation == Const.BACKWARD and len(json_content) < API_INFO: + raise ValueError(f'Backward propagation requires contains forward and backward info!') + + # if propagation is backward, ensure it has valid data + if propagation == Const.BACKWARD: + backward_item = list(json_content.items())[1] + if not isinstance(backward_item[1], dict): + raise ValueError(f'Invalid backward API data in json_content!') + + return json_content + + + def _check_config(self): + if self.dump_json_path: + check_file_or_directory_path(self.dump_json_path) + if self.api_name: + check_op_str_pattern_valid(self.api_name) + if len(self.api_name) > API_MAX_LENGTH: + raise ValueError(f'API name {self.api_name} is too long!') + make_dir(os.path.dirname(self.extract_api_path)) + if self.propagation and self.propagation not in PROPAGATION_LIST: + raise ValueError(f'propagation is invalid, it should be one of {PROPAGATION_LIST}') + if self.data_mode and self.data_mode not in DATAMODE_LIST: + raise ValueError(f'data_mode is invalid, it should be one of {DATAMODE_LIST}') + if not is_int(self.random_seed): + raise ValueError(f'random_seed is invalid, it should be an int') + if not is_int(self.iter_times): + raise ValueError(f'iter_times is invalid, it should be an int') + + +class APIExtractor: + def __init__(self, api_name, dump_json_path, output_file): + self.api_name = api_name + self.dump_json_path = dump_json_path + self.output_file = output_file + self.data = None + + def extract_op(self): + self.data = load_json(self.dump_json_path) + new_data = {} + extract_key_pattern = re.compile(f"^{re.escape(self.api_name)}\..+") + real_data_path = self.data.get('dump_data_dir', '') + for key, value in self.data.get('data', {}).items(): + if extract_key_pattern.match(key): + if real_data_path: + value = self.load_real_data_path(value, real_data_path) + new_data[key] = value + if not new_data: + logger.error(f"Error: The api '{self.api_name}' does not exist in the file.") + else: + save_json(self.output_file, new_data, indent=4) + logger.info( + f"The api '{self.api_name}' has been successfully extracted and saved in: {self.output_file}") + + def load_real_data_path(self, value, dump_data_dir): + parameters = [Const.INPUT_ARGS, Const.GRAD_INPUT, Const.INPUT, Const.OUTPUT, Const.GRAD_OUTPUT] + for parameter in parameters: + for v in value.get(parameter, []): + if v is not None: + self.update_data_name(v, dump_data_dir) + return value + + def update_data_name(self, data, dump_data_dir): + if isinstance(data, list): + for item in data: + self.update_data_name(item, dump_data_dir) + elif DATA_NAME in data: + data[DATA_NAME] = os.path.join(dump_data_dir, data[DATA_NAME]) + + +class OperatorScriptGenerator: + def __init__(self, common_config, args_info_forward, kwargs_info_forward, args_info_backward): + self.common_config = common_config + self.args_info_forward = args_info_forward + self.kwargs_info_forward = kwargs_info_forward + self.args_info_backward = args_info_backward + + @staticmethod + def get_compare_standard(api_name): + api_standard_map = { + "binary_standard_api": "CompareStandard.BINARY_EQUALITY_STANDARD", + "absolute_standard_api": "CompareStandard.ABSOLUTE_THRESHOLD_STANDARD", + "ulp_standard_api": "CompareStandard.ULP_ERROR_STANDARD", + "thousandth_standard_api": "CompareStandard.THOUSANDTH_STANDARD" + } + for standard_api, standard_value in api_standard_map.items(): + if api_name in globals()[standard_api]: + return standard_value + return "CompareStandard.BENCHMARK_STANDARD" + + @staticmethod + def extract_detailed_api_segments(full_api_name): + """ + Function Description: + Extract the name of the API. + Parameter: + full_api_name_with_direction_status: Full name of the API. Example: torch.matmul.0.forward.output.0 + Return: + api_name: Name of api. Example: matmul, mul, etc. + full_api_name: Full name of api. Example: torch.matmul.0 + direction_status: Direction status of api. Example: forward, backward, etc. + """ + api_parts = full_api_name.split(Const.SEP) + api_parts_length = len(api_parts) + api_type, api_name, api_order = None, None, None + if api_parts_length == FOUR_SEGMENT: + api_type, api_name, api_order, _ = api_parts + elif api_parts_length == FIVE_SEGMENT: + api_type, prefix, api_name, api_order, _ = api_parts + api_name = Const.SEP.join([prefix, api_name]) + return api_type, api_name, api_order + + def get_settings(self, api_full_name): + ''' + internal_settings contain all information needed for the operator program. + keys: + api_full_name: api_type.api_name.ordinal_number + api_type: type of API, one of torch.nn.functional, torch.Tensor or Torch + api_name: name of API + ordinal_number: how many times the same api has been called + direction_status: forward + random_seed: if mode is random_data, random seed is random_seed + iter_times: if mode is random_data, generate iter_times group of data; if mode is real_data, + iter_times does not matter + args_element_assignment: code for args assignment + args_list_generator_device: code for generate args list on device + args_list_generator_bench: code for generate args list on bench + kwargs_value_assignment: code for kwargs assignment + kwargs_dict_generator_device: code for generate kwargs dict on device + kwargs_dict_generator_bench: code for generate kwargs dict on bench + ''' + # Generate an internal setting dictionary based on user settings + # including API name, type, comparison standard, random seed, number of iterations and other information + internal_settings = {} + internal_settings["propagation"] = self.common_config.propagation + internal_settings["api_full_name"] = api_full_name + api_type, api_name, ordinal_number = self.extract_detailed_api_segments(api_full_name) + if api_type == "Functional": + internal_settings["api_type"] = "torch.nn.functional" + elif api_type == "Tensor": + internal_settings["api_type"] = "torch.Tensor" + else: + internal_settings["api_type"] = "torch" + internal_settings["api_name"] = api_name + internal_settings["compare_standard"] = self.get_compare_standard(api_name) + internal_settings["ordinal_number"] = ordinal_number + internal_settings["direction_status"] = self.common_config.propagation + internal_settings["random_seed"] = self.common_config.random_seed + if self.common_config.data_mode == "real_data": + internal_settings["iter_times"] = 1 + else: + internal_settings["iter_times"] = self.common_config.iter_times + internal_settings["args_element_assignment"] = \ + self.generate_args_element_assignment_code(self.args_info_forward) + internal_settings["args_list_generator_device"] = \ + self.generate_args_list(self.args_info_forward, flag_device=True) + internal_settings["args_list_generator_bench"] = \ + self.generate_args_list(self.args_info_forward, flag_device=False) + internal_settings["kwargs_value_assignment"] = \ + self.generate_kwargs_value_assignment_code(self.kwargs_info_forward) + internal_settings["kwargs_dict_generator_device"] = \ + self.generate_kwargs_dict(self.kwargs_info_forward, flag_device=True) + internal_settings["kwargs_dict_generator_bench"] = \ + self.generate_kwargs_dict(self.kwargs_info_forward, flag_device=False) + if self.common_config.propagation == Const.BACKWARD: + internal_settings["args_element_assignment_backward"] = self.generate_args_element_assignment_code( + self.args_info_backward) + internal_settings["args_list_generator_device_backward"] = \ + self.generate_args_list(self.args_info_backward, flag_device=True) + internal_settings["args_list_generator_bench_backward"] = \ + self.generate_args_list(self.args_info_backward, flag_device=False) + else: + internal_settings["args_element_assignment_backward"] = '' + internal_settings["args_list_generator_device_backward"] = '' + internal_settings["args_list_generator_bench_backward"] = '' + + return internal_settings + + @recursion_depth_decorator("OpGenerator: OperatorScriptGenerator.recursive_args_element_assignment") + def recursive_args_element_assignment(self, args_info, name_number): + args_element_assignment = "" + for index, arg in enumerate(args_info): + if isinstance(arg, (list, tuple)): + new_args_element_assignment = \ + self.recursive_args_element_assignment(arg, name_number + "_" + str(index)) + args_element_assignment += new_args_element_assignment + else: + arg["parameter_name"] = "arg" + name_number + "_" + str(index) + args_element_assignment += " " + "arg_info" + name_number + "_" + str(index) + " = " + \ + "{}".format(str(arg)) + MsgConst.SPECIAL_CHAR[0] + args_element_assignment += " " + "arg" + name_number + "_" + str(index) + " = " + \ + "generate_data(arg_info" + name_number + "_" + str(index) + ")" + MsgConst.SPECIAL_CHAR[0] + return args_element_assignment + + + def generate_args_element_assignment_code(self, args_info): + args_element_assignment = self.recursive_args_element_assignment(args_info, "") + return args_element_assignment + + @recursion_depth_decorator("OpGenerator: OperatorScriptGenerator.recursive_args_list") + def recursive_args_list(self, args_info, flag_device=False, flag_bench=False): + args_list_generator = "" + for _, arg in enumerate(args_info): + if isinstance(arg, (list, tuple)): + (left_bracket, right_bracket) = ("[", "]") if isinstance(arg, list) else ("(", ")") + args_list_generator += left_bracket + new_args_list_generator = self.recursive_args_list(arg, flag_device=flag_device, flag_bench=flag_bench) + args_list_generator += new_args_list_generator + args_list_generator += right_bracket + else: + args_list_generator += arg.get("parameter_name") + if arg.get("type") in TENSOR_DATA_LIST: + if flag_device: + args_list_generator += ".to(device)" + if flag_bench: + args_list_generator += '.to(torch.device("cpu"))' + args_list_generator += ".to(RAISE_PRECISION.get(str(" + arg.get("parameter_name") + \ + ".dtype), " + arg.get("parameter_name") + ".dtype))" + args_list_generator += Const.COMMA + return args_list_generator + + def generate_args_list(self, args_info, flag_device): + if flag_device: + args_list_generator = self.recursive_args_list(args_info, flag_device=True) + else: + args_list_generator = self.recursive_args_list(args_info, flag_bench=True) + return args_list_generator + + @recursion_depth_decorator("OpGenerator: OperatorScriptGenerator.recursive_kwargs_value_assignment") + def recursive_kwargs_value_assignment(self, info, key_name, name_number): + kwargs_value_assignment = "" + if isinstance(info, dict): + if info.get("type") == "torch.device" or info.get("type") == "torch.dtype": + kwargs_value_assignment += " " + "kwarg_" + key_name + name_number + " = " + info.get("value") + else: + kwargs_value_assignment += " " + "kwarg_info_" + key_name + name_number + " = " + \ + "{}".format(str(info)) + MsgConst.SPECIAL_CHAR[0] + kwargs_value_assignment += " " + "kwarg_" + key_name + name_number + " = " + \ + "generate_data(kwarg_info_" + key_name + name_number + ")" + MsgConst.SPECIAL_CHAR[0] + info["parameter_name"] = "kwarg_" + key_name + name_number + else: + for index, arg in enumerate(info): + new_kwargs_value_assignment = self.recursive_kwargs_value_assignment(arg, key_name, name_number + \ + "_" + str(index)) + kwargs_value_assignment += new_kwargs_value_assignment + return kwargs_value_assignment + + def generate_kwargs_value_assignment_code(self, kwargs_info): + kwargs_value_assignment = "" + for key, value in kwargs_info.items(): + kwargs_value_assignment += self.recursive_kwargs_value_assignment(value, key, "") + return kwargs_value_assignment + + @recursion_depth_decorator("OpGenerator: OperatorScriptGenerator.recursive_kwargs_dict") + def recursive_kwargs_dict(self, info, flag_device=False, flag_bench=False): + kwargs_dict_generator = "" + if isinstance(info, dict): + kwargs_dict_generator += info.get("parameter_name") + if info.get("type") in TENSOR_DATA_LIST: + if flag_device: + kwargs_dict_generator += ".to(device)" + if flag_bench: + kwargs_dict_generator += '.to(torch.device("cpu"))' + kwargs_dict_generator += ".to(RAISE_PRECISION.get(str(" + info.get("parameter_name") + \ + ".dtype), " + info.get("parameter_name") + ".dtype))" + else: + (left_bracket, right_bracket) = ("[", "]") if isinstance(info, list) else ("(", ")") + kwargs_dict_generator += left_bracket + for arg in info: + kwargs_dict_generator += self.recursive_kwargs_dict(arg, flag_device=flag_device, flag_bench=flag_bench) + kwargs_dict_generator += Const.COMMA + kwargs_dict_generator += right_bracket + return kwargs_dict_generator + + + def generate_kwargs_dict(self, kwargs_info, flag_device): + kwargs_dict_generator = "" + for key, value in kwargs_info.items(): + kwargs_dict_generator += '"' + key + '"' + MonitorConst.NAME_SEP + if flag_device: + kwargs_dict_generator += self.recursive_kwargs_dict(value, flag_device=True) + Const.COMMA + else: + kwargs_dict_generator += self.recursive_kwargs_dict(value, flag_bench=True) + Const.COMMA + return kwargs_dict_generator + + + +def _op_generator_parser(parser): + parser.add_argument("-i", "--config_input", dest="config_input", default='', type=str, + help=" Path of config json file", required=True) + parser.add_argument("-o", "--api_output_path", dest="api_output_path", type=str, + help=" Path of extract api_name.json.", + required=True) + + +def parse_json_config(json_file_path): + if not json_file_path: + config_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__))) + json_file_path = os.path.join(config_dir, "config.json") + json_config = load_json(json_file_path) + common_config = CommonConfig(json_config) + return common_config + + +def _run_operator_generate_commond(cmd_args): + common_config = parse_json_config(cmd_args.config_input) + + if common_config.dump_json_path: + api_extract = APIExtractor(common_config.api_name, common_config.dump_json_path, common_config.extract_api_path) + api_extract.extract_op() + check_file_or_directory_path(common_config.extract_api_path) + check_file_or_directory_path(cmd_args.api_output_path, isdir=True) + json_content = common_config.check_user_settings() + api_info = APIInfo.from_json(json_content, common_config.propagation) + + if common_config.propagation == Const.BACKWARD: + # read and check json + api_full_name_forward, api_info_dict_forward = api_info.api_full_name, api_info.api_info_dict + api_full_name_backward, api_info_dict_backward = (api_info.backward_info.api_full_name, + api_info.backward_info.api_info_dict) + args_info_forward = api_info_dict_forward.get(Const.INPUT_ARGS) + kwargs_info_forward = api_info_dict_forward.get(Const.INPUT_KWARGS) + if Const.GRAD_INPUT in api_info_dict_backward: + args_info_backward = api_info_dict_backward.get(Const.GRAD_INPUT) + elif Const.INPUT in api_info_dict_backward: + args_info_backward = api_info_dict_backward.get(Const.INPUT) + op_generate = OperatorScriptGenerator(common_config, args_info_forward, kwargs_info_forward, args_info_backward) + internal_settings = op_generate.get_settings(api_full_name_backward) + else: + # read and check json + api_full_name_forward, api_info_dict_forward = api_info.api_full_name, api_info.api_info_dict + args_info_forward = api_info_dict_forward.get(Const.INPUT_ARGS) + kwargs_info_forward = api_info_dict_forward.get(Const.INPUT_KWARGS) + op_generate = OperatorScriptGenerator(common_config, args_info_forward, kwargs_info_forward, None) + internal_settings = op_generate.get_settings(api_full_name_forward) + + template_path = os.path.join(os.path.dirname(__file__), "operator_replication.template") + operator_script_path = os.path.join(cmd_args.api_output_path, + "{0}.py".format(internal_settings.get("api_full_name"))) + + try: + with FileOpen(template_path, 'r') as ftemp, FileOpen(operator_script_path, 'w') as fout: + code_template = ftemp.read() + fout.write(code_template.format(**internal_settings)) + except OSError: + logger.error(f"Failed to open file. Please check file {template_path} or {operator_script_path}.") + + logger.info(f"Generate operator script successfully and the name is {operator_script_path}.") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + _op_generator_parser(parser) + cmd_args = parser.parse_args() + _run_operator_generate_commond(cmd_args) diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/operator_replication.template b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/operator_replication.template new file mode 100644 index 0000000000000000000000000000000000000000..131fd211ad82dad8256c48e59195fc335efa936b --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/generate_op_script/operator_replication.template @@ -0,0 +1,365 @@ +import json +import os +import math +from enum import Enum, auto +import torch +try: + import torch_npu +except ImportError: + pass +from tabulate import tabulate + +TENSOR_DATA_LIST = ["torch.Tensor", "torch.nn.parameter.Parameter"] +TORCH_BOOL_TYPE = ["torch.bool"] +TORCH_INT_TYPE = ["torch.uint8", "torch.int8", "torch.int16", "torch.short", "torch.int32", "torch.int", + "torch.int64", "torch.long"] +TORCH_FLOAT_TYPE = ["torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.float", + "torch.float64", "torch.double"] +TORCH_COMPLEX_TYPE = ["torch.complex32", "torch.chalf", "torch.complex64", "torch.cfloat", "torch.complex128", "torch.cdouble"] +RAISE_PRECISION = {{ + "torch.float16": torch.float32, + "torch.half": torch.float32, + "torch.bfloat16": torch.float32, + "torch.float32": torch.float64, + "torch.float": torch.float64 +}} +THOUSANDTH_THRESHOLDING = 0.001 +BACKWARD = 'backward' + +class CompareStandard(Enum): + BINARY_EQUALITY_STANDARD = auto() + ABSOLUTE_THRESHOLD_STANDARD = auto() + ULP_ERROR_STANDARD = auto() + BENCHMARK_STANDARD = auto() + THOUSANDTH_STANDARD = auto() + +def load_pt(pt_path, to_cpu=False): + pt_path = os.path.realpath(pt_path) + try: + if to_cpu: + pt = torch.load(pt_path, map_location=torch.device("cpu")) + else: + pt = torch.load(pt_path) + except Exception as e: + raise RuntimeError(f"load pt file {{pt_path}} failed") from e + return pt + +def get_device(): + if torch.cuda.is_available(): + device = torch.device("cuda") + elif torch_npu.npu.is_available(): + device = torch.device("npu") + else: + raise Exception("Error: This device is not NPU or GPU!") + return device + + +def generate_bool_tensor(low, high, shape): + low, high = int(low), int(high) + tensor = torch.randint(low, high + 1, shape) + bool_tensor = torch.gt(tensor, 0) + return bool_tensor + + +def generate_numerical_tensor(low, high, shape, data_dtype): + if data_dtype in TORCH_FLOAT_TYPE: + scale = high - low + rand01 = torch.rand(shape, dtype=eval(data_dtype)) + tensor = rand01 * scale + low + elif data_dtype in TORCH_INT_TYPE: + low, high = int(low), int(high) + tensor = torch.randint(low, high + 1, shape, dtype=eval(data_dtype)) + else: + raise NotImplementedError(f"{{data_dtype}} is not supported!") + if torch.numel(tensor) == 0: + return tensor + tmp_tensor = tensor.reshape(-1) + tmp_tensor[0] = low + tmp_tensor[-1] = high + data = tmp_tensor.reshape(shape) + return data + + +def generate_random_tensor(info): + low, high = info.get('Min'), info.get('Max') + data_dtype = info.get('dtype') + shape = tuple(info.get('shape')) + if data_dtype == "torch.bool": + data = generate_bool_tensor(low, high, shape) + else: + data = generate_numerical_tensor(low, high, shape, data_dtype) + return data + + +def generate_real_tensor(data_path): + data_path = os.path.realpath(data_path) + data = load_pt(data_path, to_cpu = True) + return data + + +def generate_data(info): + data_type = info.get("type") + data_path = info.get("data_name") + data_grad = info.get("requires_grad") + if data_type in TENSOR_DATA_LIST: + if data_path: + data = generate_real_tensor(data_path) + else: + data = generate_random_tensor(info) + else: + data = info.get("value") + if data_grad == True: + data.requires_grad_(True) + return data + + +def get_input(propagation): +{args_element_assignment} + args_device = [{args_list_generator_device}] + args_bench = [{args_list_generator_bench}] +{kwargs_value_assignment} + kwargs_device = {{{kwargs_dict_generator_device}}} + kwargs_bench = {{{kwargs_dict_generator_bench}}} +{args_element_assignment_backward} + args_device_backward = [{args_list_generator_device_backward}] + args_bench_backward = [{args_list_generator_bench_backward}] + if propagation == BACKWARD: + return args_device, kwargs_device, args_bench, kwargs_bench, args_device_backward, args_bench_backward + return args_device, kwargs_device, args_bench, kwargs_bench + +def exec_api(args, kwargs, args_grad_input, propagation): + output = {api_type}.{api_name}(*args, **kwargs) + if propagation == BACKWARD: + args_input_tensor = [tensor for tensor in args if isinstance(tensor, torch.Tensor) and tensor.requires_grad] + args_input_tensor.extend( + [value for value in kwargs.values() if isinstance(value, torch.Tensor) and value.requires_grad]) + output_backward = torch.autograd.grad(outputs=output, inputs=args_input_tensor, grad_outputs=args_grad_input) + return output_backward + return output + +def compute_inf_nan_proportion(inf_nan_mask, out_device, out_bench, abs_bench_with_eps, rtol): + out_bench = out_bench.to(out_device.dtype) + min = torch.finfo(out_device.dtype).min + max = torch.finfo(out_device.dtype).max + bench_clip = torch.clamp(out_bench, min=min, max=max) + device_clip = torch.clamp(out_device, min=min, max=max) + clipped_abs_ae = torch.abs(device_clip - bench_clip) + clipped_re = clipped_abs_ae / abs_bench_with_eps + pass_mask = torch.less_equal(clipped_re, rtol) + both_nan_mask = torch.logical_and(torch.isnan(out_device), torch.isnan(bench_clip)) + pass_mask = torch.logical_or(pass_mask, both_nan_mask) + not_pass_mask = torch.logical_not(pass_mask) + not_pass_mask = torch.logical_and(not_pass_mask, inf_nan_mask) + inf_nan_err_cnt = torch.sum(not_pass_mask) + return 0 if torch.sum(inf_nan_mask) == 0 else inf_nan_err_cnt / torch.sum(inf_nan_mask) + + +def compute_rmse(abs_err, normal_value_mask): + if torch.sum(normal_value_mask) == 0: + return 0 + else: + masked_ae = torch.where(normal_value_mask, abs_err, 0) + mse = torch.sum(torch.square(masked_ae)) / torch.sum(normal_value_mask) + rmse = torch.sqrt(mse) + return rmse + + +def compute_error_balance(out_device, out_bench): + larger_count = torch.sum(torch.greater(out_device - out_bench.to(out_device.dtype), 0)) + smaller_count = torch.sum(torch.less(out_device - out_bench.to(out_device.dtype), 0)) + if torch.numel(out_bench) == 0: + raise ZeroDivisionError(f"ERROR: please check torch.numel out_bench, its value is {{torch.numel(out_bench)}}") + error_balance = abs(larger_count - smaller_count) / torch.numel(out_bench) + return error_balance + + +def compare_tensor(out_device, out_bench, api_name): + if out_device.shape != out_bench.shape: + print("ERROR: shape of out_device and out_bench is not equal!") + return None + if torch.numel(out_bench) == 0: + print("Both out_device and out_bench have zero elements.") + return None + dtype_device = out_device.dtype + dtype_bench = out_bench.dtype + headers = ["Metric", "Value"] + table = [ + ["Shape", out_bench.shape], + ["Dtype of out_device", out_device.dtype], + ["Dtype of out_bench", out_bench.dtype] + ] + if str(dtype_device) in TORCH_FLOAT_TYPE and str(dtype_bench) in TORCH_FLOAT_TYPE \ + or str(dtype_device) in TORCH_INT_TYPE and str(dtype_bench) in TORCH_INT_TYPE \ + or str(dtype_device) in TORCH_BOOL_TYPE and str(dtype_bench) in TORCH_BOOL_TYPE: + out_device = out_device.to(torch.device("cpu")) + if str(dtype_device) in TORCH_BOOL_TYPE or str(dtype_device) in TORCH_INT_TYPE or compare_standard == CompareStandard.BINARY_EQUALITY_STANDARD: + error_number = torch.sum(out_device != out_bench).item() + if torch.numel(out_bench) == 0: + raise ZeroDivisionError(f"ERROR: please check torch.numel out_bench, its value is {{torch.numel(out_bench)}}") + error_rate = error_number / torch.numel(out_bench) + table.append(["Compare Standard", "Binary Equality Standard"]) + table.append(["Error Rate", error_rate]) + else: + abs_err = torch.abs(out_device - out_bench) + abs_bench = torch.abs(out_bench) + if dtype_bench == torch.float32: + eps = 2 ** -23 + if dtype_bench == torch.float64: + eps = 2 ** -52 + abs_bench_with_eps = abs_bench + eps + rel_err = torch.abs(abs_err / abs_bench_with_eps) + device_finite_mask = torch.isfinite(out_device) + bench_finite_mask = torch.isfinite(out_bench.to(dtype_device)) + both_finite_mask = torch.logical_and(device_finite_mask, bench_finite_mask) + inf_nan_mask = torch.logical_not(both_finite_mask) + if compare_standard == CompareStandard.ABSOLUTE_THRESHOLD_STANDARD: + if dtype_device == torch.float16: + rtol, small_value, small_value_atol = 1.0e-3, 1.0e-3, 1.0e-5 + elif dtype_device == torch.bfloat16: + rtol, small_value, small_value_atol = 4.0e-3, 1.0e-3, 1.0e-5 + else: + rtol, small_value, small_value_atol = 1.0e-6, 1.0e-6, 1.0e-9 + small_value_mask = torch.less_equal(abs_bench, small_value) + small_value_mask = torch.logical_and(small_value_mask, both_finite_mask) + normal_value_mask = torch.logical_and(both_finite_mask, torch.logical_not(small_value_mask)) + inf_nan_proportion = compute_inf_nan_proportion(inf_nan_mask, out_device, out_bench, abs_bench_with_eps, rtol) + rel_err_mask = torch.greater(rel_err, rtol) + rel_err_mask = torch.logical_and(rel_err_mask, normal_value_mask) + if torch.sum(normal_value_mask) == 0: + rel_err_proportion = 0 + else: + rel_err_proportion = torch.sum(rel_err_mask) / torch.sum(normal_value_mask) + abs_err_mask = torch.greater(abs_err, small_value_atol) + abs_err_mask = torch.logical_and(abs_err_mask, small_value_mask) + if torch.sum(small_value_mask) == 0: + abs_err_proportion = 0 + else: + abs_err_proportion = torch.sum(abs_err_mask) / torch.sum(small_value_mask) + table.append(["Compare Standard", "Absolute Threshold Standard"]) + table.append(["Relative Error Ratio", rel_err_proportion]) + table.append(["Absolute Error Ratio", abs_err_proportion]) + elif compare_standard == CompareStandard.ULP_ERROR_STANDARD: + if dtype_device == torch.float16: + min_eb, exponent_num = -14, 10 + elif dtype_device == torch.bfloat16: + min_eb, exponent_num = -126, 7 + else: + min_eb, exponent_num = -126, 23 + eb = torch.where(abs_bench == 0, torch.zeros(out_bench.shape), torch.floor(torch.log2(abs_bench))) + eb = torch.maximum(eb, min_eb * torch.ones(out_bench.shape)) + if dtype_device == torch.float32: + ulp_err = (out_device.to(torch.float64) - out_bench).to(torch.float64) * torch.exp2(-eb + exponent_num).to(torch.float64) + else: + ulp_err = (out_device.to(torch.float32) - out_bench).to(torch.float32) * torch.exp2(-eb + exponent_num).to(torch.float32) + ulp_err = torch.abs(ulp_err) + max_ulp_err = torch.max(ulp_err) + mean_ulp_err = torch.mean(ulp_err) + if torch.numel(out_bench) == 0: + raise ZeroDivisionError(f"ERROR: please check torch.numel out_bench, its value is {{torch.numel(out_bench)}}") + if dtype_device == torch.float32: + ulp_err_proportion = torch.sum(ulp_err > 32) / torch.numel(out_bench) + else: + ulp_err_proportion = torch.sum(ulp_err > 1) / torch.numel(out_bench) + table.append(["Compare Standard", "ULP error Standard"]) + table.append(["Maximum ULP Error", max_ulp_err]) + table.append(["Mean ULP Error", mean_ulp_err]) + table.append(["ULP Error Proportion", ulp_err_proportion]) + elif compare_standard == CompareStandard.THOUSANDTH_STANDARD: + rel_err_origin = torch.abs(abs_err / abs_bench_with_eps) + if torch.numel(rel_err_origin) == 0: + thousand_res = 1 + else: + thousand_res = torch.divide(torch.sum(rel_err < THOUSANDTH_THRESHOLDING), torch.numel(rel_err_origin)) + thousand_status = thousand_res > (1 - THOUSANDTH_THRESHOLDING) + table.append(["Compare Standard", "Thousandth Standard"]) + table.append(["Thousandth ratio", thousand_res]) + else: + if dtype_device == torch.float16: + small_value, small_value_atol = 1.0e-3, 1.0e-5 + elif dtype_device == torch.bfloat16: + small_value, small_value_atol = 1.0e-3, 1.0e-5 + else: + small_value, small_value_atol = 1.0e-6, 1.0e-9 + small_value_mask = torch.less_equal(abs_bench, small_value) + small_value_mask = torch.logical_and(small_value_mask, both_finite_mask) + normal_value_mask = torch.logical_and(both_finite_mask, torch.logical_not(small_value_mask)) + abs_err_mask = torch.greater(abs_err, small_value_atol) + abs_err_mask = torch.logical_and(abs_err_mask, small_value_mask) + if torch.sum(small_value_mask) == 0: + small_value_err_proportion = 0 + else: + small_value_err_proportion = torch.sum(abs_err_mask) / torch.sum(small_value_mask) + rel_err = torch.where(normal_value_mask, rel_err, -1 * torch.ones(out_device.shape)) + if torch.max(rel_err) >= 0: + max_rel_err = torch.max(rel_err) + else: + max_rel_err = 0 + if torch.sum(normal_value_mask) == 0: + mean_rel_err = 0 + else: + mean_rel_err = torch.sum(torch.clamp(rel_err, min=0)) / torch.sum(normal_value_mask) + rmse = compute_rmse(abs_err, normal_value_mask) + error_balance = compute_error_balance(out_device, out_bench) + table.append(["Compare Standard", "Benchmark Standard"]) + table.append(["Small Value Error Proportion", small_value_err_proportion]) + table.append(["Maximum Relative Error", max_rel_err]) + table.append(["Mean Relative Error", mean_rel_err]) + table.append(["Root Mean Squared Error", rmse]) + table.append(["Error Balance", error_balance]) + else: + print(f"ERROR: out_device dtype is {{dtype_device}}, out_bench dtype is {{dtype_bench}}, not comparable.") + return None + print(tabulate(table, headers, tablefmt='grid')) + return None + + +def compare_element(out_device, out_bench, api_name): + if type(out_device) != type(out_bench): + print("ERROR: out_device and out_bench is not the same type!") + return None + if isinstance(out_bench, torch.Tensor): + compare_tensor(out_device, out_bench, api_name) + elif isinstance(out_bench, (bool, int, float, str)): + if out_device == out_bench: + print("PASS: out_device and out_bench equals.") + else: + print("ERROR: out_device and out_bench is not equal!") + else: + print(f"ERROR: comparison of type {{type(out_bench)}} is not supported.") + return None + + +def compare(out_device, out_bench, api_name): + print("Compare result:") + if type(out_device) != type(out_bench): + print("ERROR: out_device and out_bench is not the same type!") + return None + if isinstance(out_bench, (list, tuple)): + if len(out_device) != len(out_bench): + print("ERROR: len of out_device and out_bench is different!") + return None + for index, _ in enumerate(out_bench): + print(f"index {{index}}:") + compare_element(out_device[index], out_bench[index], api_name) + else: + compare_element(out_device, out_bench, api_name) + +if __name__ == "__main__": + device = get_device() + api_name = "{api_name}" + propagation = "{propagation}" + compare_standard = {compare_standard} + torch.manual_seed({random_seed}) + for i in range({iter_times}): + print(f"iter: {{i}}:") + if propagation == BACKWARD: + args_device, kwargs_device, args_bench, kwargs_bench, args_device_backward, args_bench_backward = get_input(propagation) + output_device = exec_api(args_device, kwargs_device, args_device_backward, propagation) + output_bench = exec_api(args_bench, kwargs_bench, args_bench_backward, propagation) + compare(output_device, output_bench, api_name) + else: + args_device, kwargs_device, args_bench, kwargs_bench = get_input(propagation) + output_device = exec_api(args_device, kwargs_device, None, propagation) + output_bench = exec_api(args_bench, kwargs_bench, None, propagation) + compare(output_device, output_bench, api_name) + print("Compare finished.") diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/absolute_threshold.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/absolute_threshold.py new file mode 100644 index 0000000000000000000000000000000000000000..fd6ac7dd41dd7c43eb4d4168b6352319be538047 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/absolute_threshold.py @@ -0,0 +1,106 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import check_inf_nan_value, check_norm_value, \ + check_small_value +from msprobe.pytorch.api_accuracy_checker.precision_standard.base_standard import BaseCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_config import StandardConfig +from msprobe.core.common.const import CompareConst + + + +class AbsolutethdCompare(BaseCompare): + """ + Absolute threshold compare class. + + This class is used to compare the absolute threshold of benchmark outputs and device outputs. + It calculates various metrics such as inf_nan_error_ratio, rel_err_ratio, and abs_err_ratio + to determine the accuracy of the device output compared to the benchmark output. + + Attributes: + bench_output (np.ndarray): The output from the benchmark. + device_output (np.ndarray): The output from the device. + dtype (torch.dtype): The data type of the outputs. + abs_bench (np.ndarray): The absolute value of the benchmark output. + abs_bench_with_eps (np.ndarray): The absolute value of the benchmark output with epsilon. + both_finite_mask (np.ndarray): A mask indicating where both outputs are finite. + inf_nan_mask (np.ndarray): A mask indicating where either output is infinite or NaN. + rtol (float): The relative tolerance for comparison. + rel_err (np.ndarray): The relative error between the benchmark and device outputs. + small_value (float): The small value threshold for comparison. + small_value_atol (float): The absolute tolerance for small values. + small_value_mask (np.ndarray): A mask indicating where values are small. + normal_value_mask (np.ndarray): A mask indicating where values are normal. + + Methods: + _get_rtol(): Gets the relative tolerance based on the data type. + _get_rel_err(abs_bench_with_eps): Calculates the relative error. + _get_normal_value_mask(small_value_mask): Gets the mask for normal values. + _pre_compare(): Prepares the comparison by calculating various metrics. + _compute_metrics(): Computes the comparison metrics. + + Note: + This class assumes that the input data is a dictionary containing 'bench_output', 'device_output', + 'compare_column' and 'dtype'. + The 'dtype' should be a PyTorch data type. + + See Also: + BaseCompare: The base class for comparison classes. + StandardConfig: The class containing standard configuration values. + """ + def __init__(self, input_data): + super(AbsolutethdCompare, self).__init__(input_data) + self.compare_algorithm = CompareConst.ABSOLUTE_THRESHOLD + + def _get_rtol(self): + return StandardConfig.get_rtol(self.dtype) + + def _pre_compare(self): + """ + Prepares the comparison by calculating various metrics. + + This method performs the following steps: + 1. Calculates the absolute benchmark values and their epsilon-adjusted versions. + 2. Determines masks for finite and infinite/NaN values in the outputs. + 3. Computes the absolute error between benchmark and device outputs. + 4. Retrieves the relative tolerance based on the data type. + 5. Calculates the relative error using the absolute error and epsilon-adjusted benchmark values. + 6. Determines the small value threshold and its absolute tolerance. + 7. Creates a mask for small values based on the benchmark values and finite mask. + 8. Creates a mask for normal values by excluding small values from the finite mask. + """ + self.abs_bench, self.abs_bench_with_eps = self.stat_abs_bench_with_eps() + self.both_finite_mask, self.inf_nan_mask = self.stat_finite_and_infinite_mask() + self.abs_err = self.stat_abs_error() + self.rtol = self._get_rtol() + self.rel_err = self._get_rel_err(self.abs_err, self.abs_bench_with_eps) + self.small_value, self.small_value_atol = self.get_small_value_threshold() + self.small_value_mask = self.stat_small_value_mask(self.abs_bench, self.both_finite_mask, self.small_value) + self.normal_value_mask = self._get_normal_value_mask(self.both_finite_mask, self.small_value_mask) + + def _compute_metrics(self): + inf_nan_error_ratio = check_inf_nan_value(self.inf_nan_mask, self.bench_output, self.device_output, self.dtype, + self.rtol) + rel_err_ratio = check_norm_value(self.normal_value_mask, self.rel_err, self.rtol) + abs_err_ratio = check_small_value(self.abs_err, self.small_value_mask, self.small_value_atol) + return { + "inf_nan_error_ratio": inf_nan_error_ratio, + "rel_err_ratio": rel_err_ratio, + "abs_err_ratio": abs_err_ratio + } diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/accumulative_error_compare.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/accumulative_error_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..1f2e875c0e49f14fb3266472058f116c0f4402c9 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/accumulative_error_compare.py @@ -0,0 +1,107 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import check_inf_nan_value, check_norm_value, \ + check_small_value, get_error_balance +from msprobe.pytorch.api_accuracy_checker.precision_standard.base_standard import BaseCompare +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_config import StandardConfig +from msprobe.core.common.const import CompareConst + + +class AccumulativeErrorCompare(BaseCompare): + """ + Absolute threshold compare class. + + This class is used to compare the absolute threshold of benchmark outputs and device outputs. + It calculates various metrics such as inf_nan_error_ratio, rel_err_ratio, and abs_err_ratio + to determine the accuracy of the device output compared to the benchmark output. + + Attributes: + bench_output (np.ndarray): The output from the benchmark. + device_output (np.ndarray): The output from the device. + dtype (torch.dtype): The data type of the outputs. + abs_bench (np.ndarray): The absolute value of the benchmark output. + abs_bench_with_eps (np.ndarray): The absolute value of the benchmark output with epsilon. + both_finite_mask (np.ndarray): A mask indicating where both outputs are finite. + inf_nan_mask (np.ndarray): A mask indicating where either output is infinite or NaN. + bound (float): The tolerance for comparison. + rel_err (np.ndarray): The relative error between the benchmark and device outputs. + small_value (float): The small value threshold for comparison. + small_value_atol (float): The absolute tolerance for small values. + small_value_mask (np.ndarray): A mask indicating where values are small. + normal_value_mask (np.ndarray): A mask indicating where values are normal. + + Methods: + _get_rtol(): Gets the relative tolerance based on the data type. + _get_rel_err(abs_bench_with_eps): Calculates the relative error. + _get_normal_value_mask(small_value_mask): Gets the mask for normal values. + _pre_compare(): Prepares the comparison by calculating various metrics. + _compute_metrics(): Computes the comparison metrics. + + Note: + This class assumes that the input data is a dictionary containing 'bench_output', 'device_output', + 'compare_column' and 'dtype'. + The 'dtype' should be a PyTorch data type. + + See Also: + BaseCompare: The base class for comparison classes. + StandardConfig: The class containing standard configuration values. + """ + def __init__(self, input_data): + super(AccumulativeErrorCompare, self).__init__(input_data) + self.compare_algorithm = CompareConst.ACCUMULATIVE_ERROR_COMPARE + + def _get_bound(self): + return StandardConfig.get_accumulative_error_bound(self.dtype) + + def _pre_compare(self): + """ + Prepares the comparison by calculating various metrics. + + This method performs the following steps: + 1. Calculates the absolute benchmark values and their epsilon-adjusted versions. + 2. Determines masks for finite and infinite/NaN values in the outputs. + 3. Computes the absolute error between benchmark and device outputs. + 4. Retrieves the tolerance based on the data type. + 5. Calculates the relative error using the absolute error and epsilon-adjusted benchmark values. + 6. Determines the small value threshold and its absolute tolerance. + 7. Creates a mask for small values based on the benchmark values and finite mask. + 8. Creates a mask for normal values by excluding small values from the finite mask. + """ + self.abs_bench, self.abs_bench_with_eps = self.stat_abs_bench_with_eps() + self.both_finite_mask, self.inf_nan_mask = self.stat_finite_and_infinite_mask() + self.abs_err = self.stat_abs_error() + self.bound = self._get_bound() + self.rel_err = self._get_rel_err(self.abs_err, self.abs_bench_with_eps) + self.small_value, self.small_value_atol = self.get_small_value_threshold() + self.small_value_mask = self.stat_small_value_mask(self.abs_bench, self.both_finite_mask, self.small_value) + self.normal_value_mask = self._get_normal_value_mask(self.both_finite_mask, self.small_value_mask) + + def _compute_metrics(self): + inf_nan_error_ratio = check_inf_nan_value(self.inf_nan_mask, self.bench_output, self.device_output, self.dtype, + self.bound) + rel_err_ratio = check_norm_value(self.normal_value_mask, self.rel_err, self.bound) + abs_err_ratio = check_small_value(self.abs_err, self.small_value_mask, self.bound) + eb = get_error_balance(self.bench_output, self.device_output) + return { + "inf_nan_error_ratio": inf_nan_error_ratio, + "rel_err_ratio": rel_err_ratio, + "abs_err_ratio": abs_err_ratio, + "eb": eb + } diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/base_standard.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/base_standard.py new file mode 100644 index 0000000000000000000000000000000000000000..e3ff6637586dd7e6e6c1ea966e5ecd88adf08c11 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/base_standard.py @@ -0,0 +1,151 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from abc import ABC, abstractmethod +import numpy as np +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import convert_str_to_float +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import get_abs_bench_with_eps, get_abs_err, \ + get_finite_and_infinite_mask, get_small_value_mask +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_config import StandardConfig + + +class BaseCompare(ABC): + """ + Base comparison class for benchmarking and device output. + + This class provides a foundation for comparing benchmark outputs with device outputs. + It encapsulates the common logic for calculating accuracy metrics and + provides a framework for subclasses to implement specific comparison logic. + + Attributes: + bench_output (np.ndarray): The output from the benchmark. + device_output (np.ndarray): The output from the device. + compare_column (object): The column object to store comparison results. + dtype (torch.dtype): The data type of the outputs. + + Methods: + get_small_value_threshold(): Retrieves the small value threshold for the given data type. + stat_abs_bench_with_eps(): Calculates the absolute benchmark output with epsilon. + stat_abs_error(): Calculates the absolute error between the benchmark and device outputs. + stat_finite_and_infinite_mask(): Generates masks for finite and infinite/NaN values. + stat_small_value_mask(abs_bench, both_finite_mask, small_value): Creates a mask for small values. + compare(): Performs the comparison and computes metrics. + _pre_compare(): Pre-comparison hook for subclass-specific initialization. + _compute_metrics(): Computes the comparison metrics. + _post_compare(metrics): Post-comparison hook to update comparison results. + + Note: + This class assumes that the input data is an instance of InputData containing the benchmark output, + device output, comparison column, and data type. Subclasses should implement the _pre_compare, + _compute_metrics, and _post_compare methods to provide specific comparison logic. + + See Also: + InputData: The class containing input data for comparison. + StandardConfig: The class containing standard configuration values. + """ + def __init__(self, input_data): + self.bench_output = input_data.bench_output + self.device_output = input_data.device_output + self.compare_column = input_data.compare_column + self.dtype = input_data.dtype + self.compare_algorithm = None + + @staticmethod + def stat_small_value_mask(abs_bench, both_finite_mask, small_value): + small_value_mask = get_small_value_mask(abs_bench, both_finite_mask, small_value) + return small_value_mask + + @staticmethod + def _get_rel_err(abs_err, abs_bench_with_eps): + rel_err = abs_err / abs_bench_with_eps + return rel_err + + @staticmethod + def _get_normal_value_mask(both_finite_mask, small_value_mask): + return np.logical_and(both_finite_mask, np.logical_not(small_value_mask)) + + @abstractmethod + def _pre_compare(self): + raise NotImplementedError + + def get_small_value_threshold(self): + small_value = StandardConfig.get_small_value(self.dtype, self.compare_algorithm) + small_value_atol = StandardConfig.get_small_value_atol(self.dtype, self.compare_algorithm) + return small_value, small_value_atol + + def stat_abs_bench_with_eps(self): + abs_bench, abs_bench_with_eps = get_abs_bench_with_eps(self.bench_output, self.dtype) + return abs_bench, abs_bench_with_eps + + def stat_abs_error(self): + abs_err = get_abs_err(self.bench_output, self.device_output) + return abs_err + + def stat_finite_and_infinite_mask(self): + both_finite_mask, inf_nan_mask = get_finite_and_infinite_mask(self.bench_output, self.device_output) + return both_finite_mask, inf_nan_mask + + def compare(self): + self._pre_compare() + metrics = self._compute_metrics() + self._post_compare(metrics) + + def _compute_metrics(self): + return {} + + def _post_compare(self, metrics): + self.compare_column.update(metrics) + + +class BasePrecisionCompare: + def __init__(self, input_data): + self.row_npu = input_data.row_npu + self.row_gpu = input_data.row_gpu + self.dtype = input_data.dtype + self.compare_column = input_data.compare_column + self.compare_algorithm = None + + @abstractmethod + def _get_status(self, metrics, inf_nan_consistency): + pass + + @abstractmethod + def _compute_ratio(self): + pass + + def compare(self): + metrics, inf_nan_consistency = self._compute_ratio() + compare_result = self._post_compare(metrics, inf_nan_consistency) + return compare_result + + def _get_and_convert_values(self, column_name): + npu_value = self.row_npu.get(column_name) + gpu_value = self.row_gpu.get(column_name) + if npu_value is None: + raise ValueError(f"NPU value for column '{column_name}' is None.") + if gpu_value is None: + raise ValueError(f"GPU value for column '{column_name}' is None.") + npu_value = convert_str_to_float(npu_value) + gpu_value = convert_str_to_float(gpu_value) + return npu_value, gpu_value + + def _post_compare(self, metrics, inf_nan_consistency): + metrics = self._get_status(metrics, inf_nan_consistency) + metrics.update({'compare_algorithm': self.compare_algorithm}) + self.compare_column.update(metrics) + compare_result = metrics.get('compare_result') + return compare_result \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/benchmark_compare.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/benchmark_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..6eb175b5f99a34c697e5529491041a9c1a800c49 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/benchmark_compare.py @@ -0,0 +1,226 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +from collections import namedtuple +import numpy as np + +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_config import StandardConfig +from msprobe.pytorch.api_accuracy_checker.precision_standard.base_standard import BaseCompare, BasePrecisionCompare +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import calc_ratio, get_small_value_err_ratio, get_rel_err, \ + get_rmse, get_error_balance, get_max_rel_err, get_mean_rel_err +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import ApiPrecisionCompareColumn, check_inf_or_nan, \ + is_inf_or_nan +from msprobe.core.common.const import CompareConst + + +BenchmarkInfNanConsistency = namedtuple('BenchmarkInfNanConsistency', ['small_value_inf_nan_consistency', + 'rmse_inf_nan_consistency', + 'max_rel_inf_nan_consistency', + 'mean_rel_inf_nan_consistency', + 'eb_inf_nan_consistency']) + + +class BenchmarkCompare(BaseCompare): + """ + Benchmark comparison class for calculating accuracy metrics. + + This class is designed to compare the output of a benchmark test with the output of a device. + It calculates various metrics such as small value error ratio, RMSE, error balance, max relative error, + and mean relative error to assess the accuracy of the device output against the benchmark output. + + Attributes: + bench_output (np.ndarray): The output from the benchmark. + device_output (np.ndarray): The output from the device. + dtype (torch.dtype): The data type of the outputs. + abs_bench (np.ndarray): The absolute value of the benchmark output. + abs_bench_with_eps (np.ndarray): The absolute value of the benchmark output with epsilon. + both_finite_mask (np.ndarray): A mask indicating where both outputs are finite. + inf_nan_mask (np.ndarray): A mask indicating where either output is infinite or NaN. + abs_err (np.ndarray): The absolute error between the benchmark and device outputs. + small_value (float): The small value threshold for comparison. + small_value_atol (float): The absolute tolerance for small values. + small_value_mask (np.ndarray): A mask indicating where values are small. + rel_err (np.ndarray): The relative error between the benchmark and device outputs. + abs_err_greater_mask (np.ndarray): A mask indicating where absolute error is greater than the small value + tolerance. + + Methods: + _get_abs_err_greater_mask(small_value_atol): Calculates a mask where absolute error is greater than the small + value tolerance. + _compute_rel_err(): Computes the relative error between the benchmark and device outputs. + _pre_compare(): Prepares the comparison by calculating various metrics. + _compute_metrics(): Computes the accuracy metrics. + + Note: + This class assumes that the input data is a dictionary containing 'bench_output', 'device_output', + 'compare_column' and 'dtype'. + The data type should be a PyTorch data type. + + See Also: + BaseCompare: The base class for comparison classes. + InputData: The class containing input data for comparison. + """ + + def __init__(self, input_data): + super(BenchmarkCompare, self).__init__(input_data) + self.compare_algorithm = CompareConst.BENCHMARK + + def _get_abs_err_greater_mask(self, small_value_atol): + abs_err_greater_mask = np.greater(self.abs_err, small_value_atol) + return abs_err_greater_mask + + def _compute_rel_err(self): + rel_err = get_rel_err(self.abs_err, self.abs_bench_with_eps, self.small_value_mask, self.inf_nan_mask) + return rel_err + + def _pre_compare(self): + self.abs_bench, self.abs_bench_with_eps = self.stat_abs_bench_with_eps() + self.both_finite_mask, self.inf_nan_mask = self.stat_finite_and_infinite_mask() + self.abs_err = self.stat_abs_error() + self.small_value, self.small_value_atol = self.get_small_value_threshold() + self.small_value_mask = self.stat_small_value_mask(self.abs_bench, self.both_finite_mask, self.small_value) + self.rel_err = self._compute_rel_err() + self.abs_err_greater_mask = self._get_abs_err_greater_mask(self.small_value_atol) + + def _compute_metrics(self): + """ + Computes a comprehensive set of error metrics for the comparison between benchmark and device outputs. + + This method calculates five key metrics: + 1. Small Value Error Ratio: The proportion of errors associated with small values. + 2. Root Mean Square Error (RMSE): The square root of the mean of the squared errors. + 3. Error Balance (EB): A measure of the balance between the errors in the benchmark and device outputs. + 4. Maximum Relative Error: The maximum relative error between the benchmark and device outputs. + 5. Mean Relative Error: The mean relative error between the benchmark and device outputs. + + Returns: + dict: A dictionary containing the computed error metrics. + The dictionary has the following keys: + - "small_value_err_ratio": The proportion of errors associated with small values. + - "max_rel_error": The maximum relative error. + - "mean_rel_error": The mean relative error. + - "rmse": The root mean square error. + - "eb": The error balance. + """ + small_value_err_ratio = get_small_value_err_ratio(self.small_value_mask, self.abs_err_greater_mask) + rmse = get_rmse(self.abs_err, np.logical_or(self.inf_nan_mask, self.small_value_mask)) + eb = get_error_balance(self.bench_output, self.device_output) + max_rel_error = get_max_rel_err(self.rel_err) + mean_rel_error = get_mean_rel_err(self.rel_err) + + return { + "small_value_err_ratio": small_value_err_ratio, + "max_rel_error": max_rel_error, + "mean_rel_error": mean_rel_error, + "rmse": rmse, + "eb": eb + } + + +class BenchmarkPrecisionCompare(BasePrecisionCompare): + def __init__(self, input_data): + super().__init__(input_data) + self.compare_algorithm = CompareConst.BENCHMARK_COMPARE_ALGORITHM_NAME + + @staticmethod + def get_final_status(status_list): + compare_result = CompareConst.PASS + if CompareConst.ERROR in status_list: + compare_result = CompareConst.ERROR + elif CompareConst.WARNING in status_list: + compare_result = CompareConst.WARNING + return compare_result + + def _calc_ratio(self, column_name): + npu_value, gpu_value = self._get_and_convert_values(column_name) + if is_inf_or_nan(npu_value) or is_inf_or_nan(gpu_value): + return check_inf_or_nan(npu_value, gpu_value, column_name) + else: + return calc_ratio(npu_value, gpu_value, str(self.dtype)), True, "" + + def _compute_ratio(self): + compare_message = "" + small_value_err_ratio, small_value_inf_nan_consistency, small_value_message = \ + self._calc_ratio(ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE) + compare_message += small_value_message + rmse_ratio, rmse_inf_nan_consistency, rmse_message = self._calc_ratio(ApiPrecisionCompareColumn.RMSE) + compare_message += rmse_message + max_rel_err_ratio, max_rel_inf_nan_consistency, max_rel_message = \ + self._calc_ratio(ApiPrecisionCompareColumn.MAX_REL_ERR) + compare_message += max_rel_message + mean_rel_err_ratio, mean_rel_inf_nan_consistency, mean_rel_message = \ + self._calc_ratio(ApiPrecisionCompareColumn.MEAN_REL_ERR) + compare_message += mean_rel_message + eb_ratio, eb_inf_nan_consistency, eb_message = self._calc_ratio(ApiPrecisionCompareColumn.EB) + compare_message += eb_message + + metrics = { + CompareConst.SMALL_VALUE_ERR_RATIO: small_value_err_ratio, + CompareConst.RMSE_RATIO: rmse_ratio, + CompareConst.MAX_REL_ERR_RATIO: max_rel_err_ratio, + CompareConst.MEAN_REL_ERR_RATIO: mean_rel_err_ratio, + CompareConst.EB_RATIO: eb_ratio, + CompareConst.COMPARE_MESSAGE: compare_message + } + + return metrics, \ + BenchmarkInfNanConsistency(small_value_inf_nan_consistency, rmse_inf_nan_consistency, + max_rel_inf_nan_consistency, mean_rel_inf_nan_consistency, + eb_inf_nan_consistency) + + def _get_threshold(self, metric): + error_threshold = StandardConfig.get_benchmark_threshold(metric) + return error_threshold + + def _get_single_metric_status(self, ratio, metric): + if is_inf_or_nan(ratio): + return CompareConst.PASS + error_threshold = self._get_threshold(metric) + if ratio > error_threshold: + return CompareConst.ERROR + return CompareConst.PASS + + def _get_status(self, metrics, inf_nan_consistency): + small_value_err_ratio = metrics.get(CompareConst.SMALL_VALUE_ERR_RATIO) + rmse_ratio = metrics.get(CompareConst.RMSE_RATIO) + max_rel_err_ratio = metrics.get(CompareConst.MAX_REL_ERR_RATIO) + mean_rel_err_ratio = metrics.get(CompareConst.MEAN_REL_ERR_RATIO) + eb_ratio = metrics.get(CompareConst.EB_RATIO) + + small_value_err_status = self._get_single_metric_status(small_value_err_ratio, CompareConst.SMALL_VALUE) \ + if inf_nan_consistency.small_value_inf_nan_consistency else CompareConst.ERROR + rmse_status = self._get_single_metric_status(rmse_ratio, CompareConst.RMSE) \ + if inf_nan_consistency.rmse_inf_nan_consistency else CompareConst.ERROR + max_rel_err_status = self._get_single_metric_status(max_rel_err_ratio, CompareConst.MAX_REL_ERR) \ + if inf_nan_consistency.max_rel_inf_nan_consistency else CompareConst.ERROR + mean_rel_err_status = self._get_single_metric_status(mean_rel_err_ratio, CompareConst.MEAN_REL_ERR) \ + if inf_nan_consistency.mean_rel_inf_nan_consistency else CompareConst.ERROR + eb_status = self._get_single_metric_status(eb_ratio, CompareConst.EB) \ + if inf_nan_consistency.eb_inf_nan_consistency else CompareConst.ERROR + status_list = [small_value_err_status, rmse_status, max_rel_err_status, mean_rel_err_status] + compare_result = self.get_final_status(status_list) + status_dict = { + CompareConst.SMALL_VALUE_ERR_STATUS: small_value_err_status, + CompareConst.RMSE_STATUS: rmse_status, + CompareConst.MAX_REL_ERR_STATUS: max_rel_err_status, + CompareConst.MEAN_REL_ERR_STATUS: mean_rel_err_status, + CompareConst.EB_STATUS: eb_status + } + metrics.update(status_dict) + metrics.update({CompareConst.COMPARE_RESULT: compare_result}) + return metrics \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/binary_consistency.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/binary_consistency.py new file mode 100644 index 0000000000000000000000000000000000000000..661ab0088622175b64dc8cbf3526be6505e2be4d --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/binary_consistency.py @@ -0,0 +1,68 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import compare_bool_tensor +from msprobe.pytorch.api_accuracy_checker.precision_standard.base_standard import BaseCompare + + +class BinaryCompare(BaseCompare): + """ + Binary comparison class for comparing boolean tensors. + + This class is designed to compare the output of a binary operation between a benchmark and a device. + It calculates the error rate of the comparison and provides a simple metric for assessing the accuracy. + + Attributes: + bench_output (np.ndarray): The output from the benchmark. + device_output (np.ndarray): The output from the device. + compare_column (object): The column object to store comparison results. + dtype (torch.dtype): The data type of the outputs. + + Methods: + _compute_metrics(): Computes the comparison metrics, specifically the error rate. + + Note: + This class assumes that the input data is an instance of InputData containing the benchmark output, + device output, comparison column, and data type. The outputs are expected to be boolean tensors. + + See Also: + BaseCompare: The base class for comparison classes. + compare_bool_tensor: The function used to compare boolean tensors. + """ + def __init__(self, input_data): + super(BinaryCompare, self).__init__(input_data) + + def _pre_compare(self): + pass + + def _compute_metrics(self): + """ + Computes the error rate metric for the comparison between benchmark and device outputs. + + This method calculates the proportion of mismatches between the benchmark output and the device output. + It uses the `compare_bool_tensor` function to compare the two tensors and extract the error rate. + + Returns: + dict: A dictionary containing the computed error rate metric. + The dictionary has the following key: + - "error_rate": The proportion of mismatches between the benchmark and device outputs. + """ + error_rate, _, _ = compare_bool_tensor(self.bench_output, self.device_output) + + return { + "error_rate": error_rate + } diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/standard_config.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/standard_config.py new file mode 100644 index 0000000000000000000000000000000000000000..11a99e044abe3870a41dacd4c8a8f01ba8dc26b0 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/standard_config.py @@ -0,0 +1,218 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch + +from msprobe.core.common.const import CompareConst + + +class StandardConfig: + """ + Standard configuration class for managing precision and comparison thresholds. + + This class provides a centralized way to manage the small value thresholds, absolute tolerances, + and relative tolerances (rtol) used in precision comparisons. It allows for different thresholds + based on the data type, with default values provided for common data types. + + Attributes: + _small_value (dict): A dictionary mapping data types to their corresponding small value thresholds. + _small_value_atol (dict): A dictionary mapping data types to their corresponding absolute tolerances. + _rtol (dict): A dictionary mapping data types to their corresponding relative tolerances. + + Methods: + get_small_value(dtype): Retrieves the small value threshold for the given data type. + get_small_value_atol(dtype): Retrieves the absolute tolerance for the given data type. + get_rtol(dtype): Retrieves the relative tolerance for the given data type. + + Example: + >>> small_value = StandardConfig.get_small_value(torch.float32) + >>> atol = StandardConfig.get_small_value_atol(torch.float32) + >>> rtol = StandardConfig.get_rtol(torch.float32) + >>> print(small_value, atol, rtol) + 1e-6 1e-9 1e-6 + + Note: + The data type is expected to be a PyTorch data type. If the data type is not found in the dictionary, + the default value is returned. + + See Also: + torch.dtype: PyTorch data types. + """ + _small_value = { + torch.float16: 2**-10, + torch.bfloat16: 2**-10, + torch.float32: 2**-20, + "default": 2**-20 + } + _threshold_small_value_atol = { + torch.float16: 2**-16, + torch.bfloat16: 1e-16, + torch.float32: 2**-30, + "default": 2**-30 + } + _benchmark_small_value_atol = { + torch.float16: 1e-16, + torch.bfloat16: 1e-16, + torch.float32: 2**-30, + "default": 2**-30 + } + _rtol = { + torch.float16: 2**-10, + torch.bfloat16: 2**-8, + torch.float32: 2**-20, + "default": 2**-20 + } + _accumulative_error_bound = { + torch.float16: 2**-8, + torch.bfloat16: 2**-7, + torch.float32: 2**-11, + "default": 2**-11 + } + _small_value_threshold = { + 'error_threshold': 2, + 'warning_threshold': 1, + "default": 1 + } + _rmse_threshold = { + 'error_threshold': 2, + 'warning_threshold': 1, + "default": 1 + } + _max_rel_err_threshold = { + 'error_threshold': 10, + 'warning_threshold': 1, + "default": 1 + } + _mean_rel_err_threshold = { + 'error_threshold': 2, + 'warning_threshold': 1, + "default": 1 + } + _eb_threshold = { + 'error_threshold': 2, + 'warning_threshold': 1, + "default": 1 + } + _minmum_err = { + 'torch.float16': 2**-11, + 'torch.bfloat16': 2**-8, + 'torch.float32': 2**-14, + 'default': 2**-14 + } + _accumulative_error_eb_threshold = { + 'torch.float16': 2**-20, + 'torch.bfloat16': 2**-7, + 'torch.float32': 2**-14, + 'default': 2**-14 + } + + _fp32_mean_ulp_err_threshold = 64 + ulp_err_proportion_ratio = 1 + _fp32_ulp_err_proportion = 0.05 + _fp16_ulp_err_proportion = 0.001 + _special_samll_value = 1 + + @classmethod + def get_small_value(cls, dtype, standard): + if standard == CompareConst.ACCUMULATIVE_ERROR_COMPARE: + return cls._special_samll_value + return cls._small_value.get(dtype, cls._small_value["default"]) + + @classmethod + def get_small_value_atol(cls, dtype, standard): + standard_dict = { + CompareConst.ABSOLUTE_THRESHOLD: cls._threshold_small_value_atol, + CompareConst.BENCHMARK: cls._benchmark_small_value_atol + } + small_value_atol_standard = standard_dict.get(standard, cls._benchmark_small_value_atol) + return small_value_atol_standard.get(dtype, small_value_atol_standard["default"]) + + @classmethod + def get_rtol(cls, dtype): + return cls._rtol.get(dtype, cls._rtol["default"]) + + @classmethod + def get_small_value_threshold(cls, threshold_type): + return cls._small_value_threshold.get(threshold_type, cls._small_value_threshold["default"]) + + @classmethod + def get_rmse_threshold(cls, threshold_type): + return cls._rmse_threshold.get(threshold_type, cls._rmse_threshold["default"]) + + @classmethod + def get_max_rel_err_threshold(cls, threshold_type): + return cls._max_rel_err_threshold.get(threshold_type, cls._max_rel_err_threshold["default"]) + + @classmethod + def get_mean_rel_err_threshold(cls, threshold_type): + return cls._mean_rel_err_threshold.get(threshold_type, cls._mean_rel_err_threshold["default"]) + + @classmethod + def get_eb_threshold(cls, threshold_type): + return cls._eb_threshold.get(threshold_type, cls._eb_threshold["default"]) + + @classmethod + def get_benchmark_threshold(cls, metric): + metric_threshold_functions = { + 'small_value': StandardConfig.get_small_value_threshold, + 'rmse': StandardConfig.get_rmse_threshold, + 'max_rel_err': StandardConfig.get_max_rel_err_threshold, + 'mean_rel_err': StandardConfig.get_mean_rel_err_threshold, + 'eb': StandardConfig.get_eb_threshold + } + + threshold_func = metric_threshold_functions.get(metric) + return threshold_func('error_threshold') + + @classmethod + def get_fp32_mean_ulp_err_threshold(cls): + return cls._fp32_mean_ulp_err_threshold + + @classmethod + def get_ulp_err_proportion_ratio_threshold(cls): + return cls.ulp_err_proportion_ratio + + @classmethod + def get_fp32_ulp_err_proportion_threshold(cls): + return cls._fp32_ulp_err_proportion + + @classmethod + def get_fp16_ulp_err_proportion_threshold(cls): + return cls._fp16_ulp_err_proportion + + @classmethod + def get_ulp_threshold(cls, dtype): + ulp_err_proportion_ratio_threshold = StandardConfig.get_ulp_err_proportion_ratio_threshold() + if dtype == torch.float32: + mean_ulp_err_threshold = StandardConfig.get_fp32_mean_ulp_err_threshold() + ulp_err_proportion_threshold = StandardConfig.get_fp32_ulp_err_proportion_threshold() + return mean_ulp_err_threshold, ulp_err_proportion_threshold, ulp_err_proportion_ratio_threshold + else: + ulp_err_proportion_threshold = StandardConfig.get_fp16_ulp_err_proportion_threshold() + return None, ulp_err_proportion_threshold, ulp_err_proportion_ratio_threshold + + @classmethod + def get_minmum_err(cls, dtype): + return cls._minmum_err.get(dtype, cls._minmum_err["default"]) + + @classmethod + def get_accumulative_error_bound(cls, dtype): + return cls._accumulative_error_bound.get(dtype, cls._accumulative_error_bound["default"]) + + @classmethod + def get_accumulative_error_eb_threshold(cls, dtype): + return cls._accumulative_error_eb_threshold.get(dtype, cls._accumulative_error_eb_threshold["default"]) diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/standard_register.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/standard_register.py new file mode 100644 index 0000000000000000000000000000000000000000..82df8c54e87ea1627159a52aef2544028ab21b22 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/standard_register.py @@ -0,0 +1,104 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Callable +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import absolute_standard_api, binary_standard_api, \ + ulp_standard_api, thousandth_standard_api, accumulative_error_standard_api, BINARY_COMPARE_UNSUPPORT_LIST +from msprobe.core.common.const import CompareConst + + +class StandardRegistry: + """ + Registry class for managing comparison standards and functions. + + This class provides a centralized registry for different comparison standards and their corresponding functions. + It allows for dynamic registration of comparison functions based on the standard category. + + Attributes: + comparison_functions (dict): A dictionary mapping standard categories to their corresponding comparison + functions. + standard_categories (dict): A dictionary mapping standard names to their corresponding API categories. + + Methods: + _get_standard_category(api_name, dtype): Determines the standard category for a given API name and data type. + register(standard, func): Registers a comparison function for a given standard category. + get_comparison_function(api_name, dtype): Retrieves the comparison function for a given API name and data type. + + Note: + The data type is used to determine the standard category if it is not supported by binary comparison. + If the API name is not found in any standard category, it defaults to the 'benchmark' category. + + See Also: + BaseCompare: The base class for comparison classes. + """ + def __init__(self): + self.comparison_functions = {} + self.api_standard_function_map = { + CompareConst.ABSOLUTE_THRESHOLD: absolute_standard_api, + CompareConst.BINARY_CONSISTENCY: binary_standard_api, + CompareConst.ULP_COMPARE: ulp_standard_api, + CompareConst.THOUSANDTH_STANDARD: thousandth_standard_api, + CompareConst.ACCUMULATIVE_ERROR_COMPARE: accumulative_error_standard_api + } + + def register(self, standard: str, func: Callable) -> None: + """ + Registers a comparison function for a given standard category. + + Args: + standard (str): The name of the standard category. + func (Callable): The comparison function to be registered. + + Raises: + ValueError: If the standard category is not supported. + """ + if not callable(func): + raise ValueError("The function to be registered must be callable.") + self.comparison_functions[standard] = func + + def get_comparison_function(self, api_name, dtype=None): + standard = self._get_standard_category(api_name, dtype) + return self.comparison_functions.get(standard) + + def _get_standard_category(self, api_name, dtype=None): + """ + Determines the standard category for a given API name and data type. + + This method checks if the provided data type is supported for binary comparison. + If it is, the method returns 'binary_consistency'. Otherwise, it iterates over the + api_standard_function_map to find a matching category for the API name. + + Args: + api_name (str): The name of the API for which to determine the standard category. + dtype (type, optional): The data type to check against the BINARY_COMPARE_UNSUPPORT_LIST. Defaults to None. + + Returns: + str: The name of the standard category that matches the API name and data type, or 'benchmark' if no match + is found. + + Note: + This method assumes that the api_standard_function_map is properly populated with standard categories and + their corresponding API functions. + The BINARY_COMPARE_UNSUPPORT_LIST should be defined and contain all data types that are not supported for + binary comparison. + """ + if dtype and dtype not in BINARY_COMPARE_UNSUPPORT_LIST: + return CompareConst.BINARY_CONSISTENCY + for name, category in self.api_standard_function_map.items(): + if api_name in category: + return name + return CompareConst.BENCHMARK diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/thousandth_standard.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/thousandth_standard.py new file mode 100644 index 0000000000000000000000000000000000000000..d1114420b99678e38749fdef94bc03afd339bb7f --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/thousandth_standard.py @@ -0,0 +1,63 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import get_rel_err_ratio +from msprobe.core.common.const import CompareConst +from msprobe.pytorch.api_accuracy_checker.precision_standard.base_standard import BaseCompare + + +class ThousandthStdCompare(BaseCompare): + """ + Thousandth standard comparison class for calculating accuracy metrics. + + A subclass of BaseCompare, specifically designed to compare the relative error + between benchmark and device outputs, focusing on errors within a thousandth (0.001) threshold. + + Attributes: + rel_err_orign (float or array-like): The original relative error values to be compared. + compare_column (object): An object to store and update comparison metrics. + + Methods: + _compute_metrics(): Computes the relative error metrics, specifically the thousandth error ratio. + """ + def __init__(self, input_data): + self.rel_err_orign = input_data.rel_err_orign + self.compare_column = input_data.compare_column + + def _pre_compare(self): + pass + + def _compute_metrics(self): + """ + Computes the relative error metrics for the comparison, specifically focusing on errors within a thousandth + (0.001) threshold. + + This method calculates the proportion of relative errors that are within the thousandth threshold. + It uses the `get_rel_err_ratio` function to determine the ratio of relative errors that are less than or + equal to the + specified threshold defined in `CompareConst.THOUSAND_RATIO_THRESHOLD`. + + Returns: + dict: A dictionary containing the computed relative error metric. + The dictionary has the following key: + - 'rel_err_thousandth': The proportion of relative errors within the thousandth threshold. + """ + rel_err_thousandth, _ = get_rel_err_ratio(self.rel_err_orign, CompareConst.THOUSAND_RATIO_THRESHOLD) + + return { + 'rel_err_thousandth': rel_err_thousandth + } diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/ulp_compare.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/ulp_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..df181588ad01836186c82df6fc2d23eef63238f0 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/precision_standard/ulp_compare.py @@ -0,0 +1,200 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import namedtuple +import numpy as np +import torch + +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_config import StandardConfig +from msprobe.pytorch.api_accuracy_checker.precision_standard.base_standard import BaseCompare, BasePrecisionCompare +from msprobe.core.common.const import Const, CompareConst +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import calc_ratio, get_ulp_err +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import ApiPrecisionCompareColumn, check_inf_or_nan, \ + is_inf_or_nan + + +UlpInfNanConsistency = namedtuple('UlpInfNanConsistency', ['mean_ulp_err_inf_nan_consistency', + 'ulp_err_proportion_ratio_inf_nan_consistency']) + + +class UlpCompare(BaseCompare): + """ + Ulp compare comparison class for calculating accuracy metrics. + + Attributes: + bench_output (array-like): The benchmark output values. + device_output (array-like): The device output values. + dtype (torch.dtype): The data type of the outputs (e.g., torch.float32 or torch.float16). + ulp_err (array-like): The ULP errors calculated from the benchmark and device outputs. + + Methods: + _stat_max_ulp_err(ulp_err): Calculates the maximum ULP error. + _stat_mean_ulp_err(ulp_err): Calculates the mean ULP error. + _stat_ulp_error_proportion(ulp_err): Calculates the proportion of ULP errors exceeding a threshold. + _pre_compare(): Prepares for comparison by calculating ULP errors. + _compute_metrics(): Computes the ULP error metrics. + """ + def __init__(self, input_data): + super(UlpCompare, self).__init__(input_data) + + @staticmethod + def _stat_max_ulp_err(ulp_err): + return np.max(ulp_err) + + @staticmethod + def _stat_mean_ulp_err(ulp_err): + return np.mean(ulp_err) + + def _stat_ulp_error_proportion(self, ulp_err): + if self.dtype == torch.float32: + return np.sum(ulp_err > CompareConst.ULP_FLOAT32_THRESHOLD) / self.bench_output.size + else: + return np.sum(ulp_err > CompareConst.ULP_FLOAT16_THRESHOLD) / self.bench_output.size + + def _pre_compare(self): + self.ulp_err = get_ulp_err(self.bench_output, self.device_output, self.dtype) + + def _compute_metrics(self): + """ + Computes the ULP error metrics for the comparison. + + This method calculates three key metrics: + 1. Maximum ULP error: The maximum difference in ULPs between the benchmark and device outputs. + 2. Mean ULP error: The average difference in ULPs between the benchmark and device outputs. + 3. ULP error proportion: The proportion of ULP errors that exceed a certain threshold. + + Args: + None (this method uses instance variables) + + Returns: + dict: A dictionary containing the computed ULP error metrics. + The dictionary has the following keys: + - "max_ulp_error": The maximum ULP error. + - "mean_ulp_error": The mean ULP error. + - "ulp_error_proportion": The proportion of ULP errors exceeding the threshold. + """ + max_ulp_error = self._stat_max_ulp_err(self.ulp_err) + mean_ulp_error = self._stat_mean_ulp_err(self.ulp_err) + + ulp_error_proportion = self._stat_ulp_error_proportion(self.ulp_err) + + return { + "max_ulp_error": max_ulp_error, + "mean_ulp_error": mean_ulp_error, + "ulp_error_proportion": ulp_error_proportion + } + + +class UlpPrecisionCompare(BasePrecisionCompare): + def __init__(self, input_data): + super().__init__(input_data) + self.compare_algorithm = CompareConst.ULP_COMPARE_ALGORITHM_NAME + + @staticmethod + def _compute_ulp_err_proportion_ratio(npu_value, gpu_value, dtype): + column_name = ApiPrecisionCompareColumn.ULP_ERR_PROPORTION + if is_inf_or_nan(npu_value) or is_inf_or_nan(gpu_value): + return check_inf_or_nan(npu_value, gpu_value, column_name) + else: + return calc_ratio(npu_value, gpu_value, dtype), True, "" + + def _compute_mean_ulp_err(self): + column_name = ApiPrecisionCompareColumn.MEAN_ULP_ERR + npu_value, gpu_value = self._get_and_convert_values(column_name) + if is_inf_or_nan(npu_value) or is_inf_or_nan(gpu_value): + _, mean_ulp_err_inf_nan_consistency, message = check_inf_or_nan(npu_value, gpu_value, column_name) + return npu_value, mean_ulp_err_inf_nan_consistency, message + else: + return npu_value, True, "" + + def _compute_ulp_err_proportion(self): + column_name = ApiPrecisionCompareColumn.ULP_ERR_PROPORTION + npu_value, gpu_value = self._get_and_convert_values(column_name) + return npu_value, gpu_value + + def _get_status(self, metrics, inf_nan_consistency): + ulp_inf_nan_consistency = inf_nan_consistency.mean_ulp_err_inf_nan_consistency and \ + inf_nan_consistency.ulp_err_proportion_ratio_inf_nan_consistency + + if not ulp_inf_nan_consistency: + status_dict = { + CompareConst.ULP_ERR_STATUS: CompareConst.ERROR + } + compare_result = CompareConst.ERROR + metrics[CompareConst.COMPARE_MESSAGE] = metrics.get(CompareConst.COMPARE_MESSAGE, "") + \ + "ERROR: ULP误差不满足标准\n" + metrics.update({CompareConst.COMPARE_RESULT: compare_result}) + return metrics + + dtype = self.row_npu.get(ApiPrecisionCompareColumn.DEVICE_DTYPE) + mean_ulp_err = metrics.get(CompareConst.MEAN_ULP_ERR) + ulp_err_proportion = metrics.get(CompareConst.ULP_ERR_PROPORTION) + ulp_err_proportion_ratio = metrics.get(CompareConst.ULP_ERR_PROPORTION_RATIO) + if dtype == Const.TORCH_FLOAT32: + status, final_message = \ + self._get_fp32_ulp_err_status(mean_ulp_err, ulp_err_proportion, ulp_err_proportion_ratio) + else: + status, final_message = \ + self._get_fp16_ulp_err_status(ulp_err_proportion, ulp_err_proportion_ratio) + metrics[CompareConst.COMPARE_MESSAGE] = metrics.get(CompareConst.COMPARE_MESSAGE, "") + final_message + + status_dict = { + CompareConst.ULP_ERR_STATUS: status + } + compare_result = status + metrics.update(status_dict) + metrics.update({CompareConst.COMPARE_RESULT: compare_result}) + return metrics + + def _get_fp32_ulp_err_status(self, mean_ulp_err, ulp_err_proportion, ulp_err_proportion_ratio): + mean_ulp_err_threshold, ulp_err_proportion_threshold, ulp_err_proportion_ratio_threshold = \ + StandardConfig.get_ulp_threshold(torch.float32) + if mean_ulp_err < mean_ulp_err_threshold: + return CompareConst.PASS, "" + elif ulp_err_proportion < ulp_err_proportion_threshold: + return CompareConst.PASS, "" + elif ulp_err_proportion_ratio < ulp_err_proportion_ratio_threshold: + return CompareConst.PASS, "" + compare_message = "ERROR: ULP误差不满足标准\n" + return CompareConst.ERROR, compare_message + + def _get_fp16_ulp_err_status(self, ulp_err_proportion, ulp_err_proportion_ratio): + _, ulp_err_proportion_threshold, ulp_err_proportion_ratio_threshold = \ + StandardConfig.get_ulp_threshold(torch.float16) + if ulp_err_proportion < ulp_err_proportion_threshold: + return CompareConst.PASS, "" + elif ulp_err_proportion_ratio < ulp_err_proportion_ratio_threshold: + return CompareConst.PASS, "" + compare_message = "ERROR: ULP误差不满足标准\n" + return CompareConst.ERROR, compare_message + + def _compute_ratio(self): + compare_message = "" + mean_ulp_err, mean_ulp_err_inf_nan_consistency, mean_ulp_err_message = self._compute_mean_ulp_err() + compare_message += mean_ulp_err_message + npu_ulp_err_proportion, gpu_ulp_err_proportion = self._compute_ulp_err_proportion() + ulp_err_proportion_ratio, ulp_err_proportion_ratio_inf_nan_consistency, ulp_err_proportion_ratio_message = \ + self._compute_ulp_err_proportion_ratio(npu_ulp_err_proportion, gpu_ulp_err_proportion, str(self.dtype)) + compare_message += ulp_err_proportion_ratio_message + metrics = { + CompareConst.MEAN_ULP_ERR: mean_ulp_err, + CompareConst.ULP_ERR_PROPORTION: npu_ulp_err_proportion, + CompareConst.ULP_ERR_PROPORTION_RATIO: ulp_err_proportion_ratio, + CompareConst.COMPARE_MESSAGE: compare_message + } + return metrics, UlpInfNanConsistency(mean_ulp_err_inf_nan_consistency, + ulp_err_proportion_ratio_inf_nan_consistency) diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py index 0026d9e36dbcc52c6f4547ec543d9226ac3b1989..9d89b2de32f70c6fa7abf38add49b58a13531d7a 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py @@ -28,6 +28,7 @@ from msprobe.pytorch.common.log import logger from msprobe.pytorch.common.utils import load_pt from msprobe.core.common.const import Const, FileCheckConst, CompareConst + TORCH_TYPE = ["torch.device", "torch.dtype"] TENSOR_DATA_LIST = ["torch.Tensor", "torch.nn.parameter.Parameter"] FLOAT_TYPE = [ @@ -139,7 +140,12 @@ def gen_random_tensor(info, convert_type): high_info = [high, high_origin] data_dtype = info.get('dtype') shape = tuple(info.get('shape')) - if not isinstance(low, (int, float)) or not isinstance(high, (int, float)): + if 0 in shape: + low, low_origin = 0, 0 + high, high_origin = 0, 0 + low_info = [low, low_origin] + high_info = [high, high_origin] + elif not isinstance(low, (int, float)) or not isinstance(high, (int, float)): error_info = f'Data info Min: {low} , Max: {high}, info type must be int or float.' raise CompareException(CompareException.INVALID_PARAM_ERROR, error_info) if data_dtype == "torch.bool": @@ -305,6 +311,19 @@ def gen_kwargs(api_info, api_name, convert_type=None, real_data_path=None): kwargs_params[key] = gen_list_kwargs(value, api_name, convert_type, real_data_path) elif value is None: kwargs_params[key] = None + elif key == 'atten_mask' and api_name == 'npu_fusion_attention': + sparse_mode = kwargs_params.get('sparse_mode', {}) + if isinstance(sparse_mode, dict): + sparse_mode_value = sparse_mode.get('value', 0) + elif isinstance(sparse_mode, int): + sparse_mode_value = sparse_mode + else: + msg = f'The sparse_mode value is not int or dict, but {type(sparse_mode)}' + raise CompareException(CompareException.INVALID_PARAM_ERROR, msg) + if sparse_mode_value in Const.FA_SPECIAL_SPARSE_MODE: + kwargs_params[key] = gen_atten_mask(value, convert_type, real_data_path) + else: + kwargs_params[key] = gen_data(value, api_name, True, convert_type, real_data_path) elif value.get('type') in TENSOR_DATA_LIST or value.get('type').startswith("numpy"): kwargs_params[key] = gen_data(value, api_name, True, convert_type, real_data_path) elif value.get('type') in TORCH_TYPE: @@ -314,6 +333,30 @@ def gen_kwargs(api_info, api_name, convert_type=None, real_data_path=None): return kwargs_params +def gen_atten_mask(info, convert_type, real_data_path): + """ + Function Description: + Based on API basic information, generate input parameters: atten_mask, for API forward running + Parameter: + info: API basic information. Dict + convert_type: convert ori_type to dist_type flag. + real_data_path: the root directory for storing real data. + """ + check_object_type(info, dict) + data_type = info.get('type') + data_path = info.get('datapath', info.get('data_name')) + data_path = get_full_data_path(data_path, real_data_path) + data = None + if data_type in TENSOR_DATA_LIST: + if data_path: + data = gen_real_tensor(data_path, convert_type) + else: + # 生成一个2048x2048的三角矩阵,对角线为1,其余为0 + # 这是npu_fusion_attention的sparse_mode为[2, 3, 4]时,atten_mask的shape + data = torch.triu(torch.ones([2048, 2048]), diagonal=1).to(torch.bool) + return data + + def gen_torch_kwargs(kwargs_params, key, value): if value.get('type') != "torch.device": module_name, attribute_name = get_module_and_atttribute_name(value.get('value')) @@ -341,6 +384,23 @@ def gen_list_kwargs(kwargs_item_value, api_name, convert_type, real_data_path=No return kwargs_item_result +def get_output_dtype(api_info): + """ + Function Description: + Based on API basic information, get the output data dtype + Parameter: + api_info: API basic information. Dict + """ + output_dtype = None + output_info = api_info.get(Const.OUTPUT) + if output_info and isinstance(output_info[0], dict): + output_str_dtype = output_info[0].get(Const.DTYPE) + if output_str_dtype in Const.TORCH_FLOAT_DTYPE: + module_name, attribute_name = get_module_and_atttribute_name(output_str_dtype) + output_dtype = get_attribute(module_name, attribute_name) + return output_dtype + + def gen_api_params(api_info, api_name, need_grad=True, convert_type=None, real_data_path=None): """ Function Description: @@ -367,4 +427,5 @@ def gen_api_params(api_info, api_name, need_grad=True, convert_type=None, real_d else: logger.warning(f'Warning: No args in {api_info} ') args_params = [] - return args_params, kwargs_params + output_dtype = get_output_dtype(api_info) + return args_params, kwargs_params, output_dtype diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py index 05d268f3ec96c81b3cc03934d39d980720d31040..498102b475f564564d6039a81e305fba3bceec17 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py @@ -33,13 +33,15 @@ from msprobe.pytorch.api_accuracy_checker.compare.compare import Comparator from msprobe.pytorch.common import parse_json_info_forward_backward from msprobe.pytorch.common.log import logger from msprobe.core.common.file_utils import FileChecker, check_file_suffix, check_link, FileOpen, \ - create_directory + create_directory, load_json, save_json from msprobe.core.common.file_utils import remove_path -from msprobe.core.common.const import FileCheckConst +from msprobe.core.common.const import FileCheckConst, Const +from msprobe.core.common.utils import CompareException def split_json_file(input_file, num_splits, filter_api): forward_data, backward_data, real_data_path = parse_json_info_forward_backward(input_file) + input_dir = os.path.dirname(os.path.abspath(input_file)) if filter_api: forward_data = preprocess_forward_content(forward_data) for data_name in list(forward_data.keys()): @@ -47,9 +49,11 @@ def split_json_file(input_file, num_splits, filter_api): for data_name in list(backward_data.keys()): backward_data[f"{data_name}.backward"] = backward_data.pop(data_name) - with FileOpen(input_file, 'r') as file: - input_data = json.load(file) - input_data.pop("data") + input_data = load_json(input_file) + if input_data.get("data") is None: + logger.error("Invalid input file, 'data' field is missing") + raise CompareException("Invalid input file, 'data' field is missing") + input_data.pop("data") items = list(forward_data.items()) total_items = len(items) @@ -68,9 +72,8 @@ def split_json_file(input_file, num_splits, filter_api): **backward_data } } - split_filename = f"temp_part{i}.json" - with FileOpen(split_filename, 'w') as split_file: - json.dump(temp_data, split_file) + split_filename = os.path.join(input_dir, f"temp_part{i}.json") + save_json(split_filename, temp_data) split_files.append(split_filename) return split_files, total_items @@ -122,7 +125,7 @@ def run_parallel_ut(config): if output == '': break if '[ERROR]' in output: - logger.warning(output, end='') + logger.warning(output) sys.stdout.flush() except ValueError as e: logger.warning(f"An error occurred while reading subprocess output: {e}") @@ -182,15 +185,19 @@ def run_parallel_ut(config): def prepare_config(args): - check_link(args.api_info_file) - api_info = os.path.realpath(args.api_info_file) - check_file_suffix(api_info, FileCheckConst.JSON_SUFFIX) - out_path = os.path.realpath(args.out_path) if args.out_path else "./" + api_info_file_checker = FileChecker(file_path=args.api_info_file, path_type=FileCheckConst.FILE, + ability=FileCheckConst.READ_ABLE, file_type=FileCheckConst.JSON_SUFFIX) + api_info = api_info_file_checker.common_check() + out_path = args.out_path if args.out_path else Const.DEFAULT_PATH create_directory(out_path) out_path_checker = FileChecker(out_path, FileCheckConst.DIR, ability=FileCheckConst.WRITE_ABLE) out_path = out_path_checker.common_check() split_files, total_items = split_json_file(api_info, args.num_splits, args.filter_api) - config_path = os.path.realpath(args.config_path) if args.config_path else None + config_path = args.config_path if args.config_path else None + if config_path: + config_path_checker = FileChecker(config_path, FileCheckConst.FILE, + FileCheckConst.READ_ABLE, FileCheckConst.JSON_SUFFIX) + config_path = config_path_checker.common_check() result_csv_path = args.result_csv_path or os.path.join( out_path, f"accuracy_checking_result_{time.strftime('%Y%m%d%H%M%S')}.csv") if not args.result_csv_path: diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py index 0450c2c492a571fb123da771c3dd7675e25fccdf..6214d892906bef44d94474c6415674f39099357b 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py @@ -23,16 +23,19 @@ try: import torch_npu except ImportError: is_gpu = True + current_device = "cuda" else: is_gpu = False + current_device = "npu" import torch from tqdm import tqdm from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut import generate_device_params, get_api_info -from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import exec_api -from msprobe.core.common.file_utils import check_link +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import exec_api, is_unsupported_api, ExecParams +from msprobe.core.common.file_utils import check_link, FileChecker +from msprobe.pytorch.api_accuracy_checker.common.utils import extract_basic_api_segments +from msprobe.core.common.const import FileCheckConst, Const from msprobe.pytorch.common.log import logger from msprobe.pytorch.common.parse_json import parse_json_info_forward_backward -from msprobe.core.common.const import Const def check_tensor_overflow(x): @@ -60,52 +63,80 @@ def check_tensor_overflow(x): return False -def check_data_overflow(x): - if isinstance(x, (tuple, list)) and x: - for _, item in enumerate(x): - if check_data_overflow(item): - return True - return False +def check_data_overflow(x, device): + if isinstance(x, (tuple, list)): + if not x: + return False + return any(check_data_overflow(item, device) for item in x) else: - return check_tensor_overflow(x) + if device == Const.CPU_LOWERCASE: + return check_tensor_overflow(x) + else: + return torch_npu.npu.utils.npu_check_overflow(x) + + +def is_bool_output(x): + if isinstance(x, (tuple, list)): + if not x: + return False + return any(is_bool_output(item) for item in x) + else: + return isinstance(x, bool) def run_overflow_check(forward_file): logger.info("start UT test") forward_content, _, real_data_path = parse_json_info_forward_backward(forward_file) + if real_data_path: + dump_path = os.path.dirname(forward_file) + real_data_path = os.path.join(dump_path, Const.DUMP_TENSOR_DATA) for api_full_name, api_info_dict in tqdm(forward_content.items()): + if is_unsupported_api(api_full_name, is_overflow_check=True): + continue try: run_torch_api(api_full_name, api_info_dict, real_data_path) except Exception as err: _, api_name, _ = api_full_name.split(Const.SEP) if "not implemented for 'Half'" in str(err): - logger.warning(f"API {api_name} not support half tensor in CPU, please add {api_name} to CONVERT_API " - f"'fp16_to_fp32' list in accuracy_tools/api_accuracy_check/common/utils.py file.") + logger.warning(f"API {api_name} not support half tensor in CPU. This API does not support overflow " + "check, so it will be skipped.") elif "expected scalar type Long" in str(err): logger.warning(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " - f"'int32_to_int64' list in accuracy_tools/api_accuracy_check/common/utils.py file.") + "'int32_to_int64' list in accuracy_tools/msprobe/core/common/const.py file.") + elif "could not create a primitive descriptor for a matmul primitive" in str(err): + logger.warning(f"API {api_name} not support matmul primitive in CPU due to pytorch bug, " + "so it will be skipped.") else: logger.error(f"Run {api_full_name} UT Error: %s" % str(err)) def run_torch_api(api_full_name, api_info_dict, real_data_path): torch.npu.clear_npu_overflow_flag() - api_type, api_name, _ = api_full_name.split(Const.SEP) + api_type, api_name = extract_basic_api_segments(api_full_name) args, kwargs, need_grad = get_api_info(api_info_dict, api_name, real_data_path) if not need_grad: logger.warning("%s function with out=... arguments don't support automatic differentiation, skip backward." % api_full_name) + device_info_kwargs = kwargs.get(Const.DEVICE) + if device_info_kwargs and device_info_kwargs.get(Const.VALUE): + kwargs[Const.DEVICE] = current_device npu_args, npu_kwargs = generate_device_params(args, kwargs, False, api_name) - if kwargs.get("device"): - del kwargs["device"] - out = exec_api(api_type, api_name, Const.CPU_LOWERCASE, args, kwargs) - npu_out = exec_api(api_type, api_name, Const.NPU_LOWERCASE, npu_args, npu_kwargs) + if kwargs.get(Const.DEVICE): + del kwargs[Const.DEVICE] + cpu_exec_params = ExecParams(api_type, api_name, Const.CPU_LOWERCASE, args, kwargs, False, None) + device_exec_params = ExecParams(api_type, api_name, Const.NPU_LOWERCASE, npu_args, npu_kwargs, False, None) + out = exec_api(cpu_exec_params) + npu_out = exec_api(device_exec_params) if out is None and npu_out is None: logger.warning("The %s overflow is a normal overflow, out and npu_out is None." % api_full_name) return + if is_bool_output(out) or is_bool_output(npu_out): + logger.warning("The output of %s is bool type.This dtype not support overflow, so it will be skipped." + % api_full_name) + return - cpu_overflow = check_data_overflow(out) - npu_overflow = torch_npu.npu.utils.npu_check_overflow(npu_out) + cpu_overflow = check_data_overflow(out, Const.CPU_LOWERCASE) + npu_overflow = check_data_overflow(npu_out, Const.NPU_LOWERCASE) if cpu_overflow == npu_overflow: logger.warning("The %s overflow is a normal overflow." % api_full_name) else: @@ -135,8 +166,9 @@ def _run_overflow_check(parser=None): def _run_overflow_check_command(args): torch.npu.set_compile_mode(jit_compile=args.jit_compile) npu_device = "npu:" + str(args.device_id) - check_link(args.api_info_file) - api_info = os.path.realpath(args.api_info_file) + api_info_file_checker = FileChecker(file_path=args.api_info_file, path_type=FileCheckConst.FILE, + ability=FileCheckConst.READ_ABLE, file_type=FileCheckConst.JSON_SUFFIX) + api_info = api_info_file_checker.common_check() try: torch.npu.set_device(npu_device) except Exception as error: diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py index 1302ad94ea072a22c5980af1f3519a506d3f6405..905687c1bfc932883396481410c333a7566fd342 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py @@ -17,7 +17,7 @@ import argparse import os -import csv +import re import sys import time import gc @@ -31,39 +31,40 @@ except ImportError: else: is_gpu = False current_device = "npu" + import torch from tqdm import tqdm from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import BackwardMessage, UtDataInfo, \ - get_validated_result_csv_path, get_validated_details_csv_path, exec_api, record_skip_info + get_validated_result_csv_path, get_validated_details_csv_path, exec_api, record_skip_info, is_unsupported_api from msprobe.pytorch.api_accuracy_checker.run_ut.data_generate import gen_api_params, gen_args from msprobe.pytorch.api_accuracy_checker.common.utils import api_info_preprocess, \ initialize_save_path, UtDataProcessor, extract_basic_api_segments, ApiData from msprobe.pytorch.api_accuracy_checker.compare.compare import Comparator from msprobe.pytorch.api_accuracy_checker.compare.compare_column import CompareColumn -from msprobe.pytorch.api_accuracy_checker.common.config import msCheckerConfig +from msprobe.pytorch.api_accuracy_checker.common.config import CheckerConfig from msprobe.pytorch.common.parse_json import parse_json_info_forward_backward from msprobe.core.common.file_utils import FileChecker, change_mode, \ - create_directory, get_json_contents, read_csv + create_directory, get_json_contents, read_csv, check_file_or_directory_path, check_crt_valid from msprobe.pytorch.common.log import logger from msprobe.pytorch.pt_config import parse_json_config from msprobe.core.common.const import Const, FileCheckConst, CompareConst +from msprobe.core.common.utils import safe_get_value, CompareException +from msprobe.pytorch.common.utils import seed_all from msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.attl import ATTL, ATTLConfig, move2device_exec from msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.device_dispatch import ConsumerDispatcher -from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import generate_cpu_params, generate_device_params +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import generate_cpu_params, generate_device_params, \ + ExecParams current_time = time.strftime("%Y%m%d%H%M%S") UT_ERROR_DATA_DIR = 'ut_error_data' + current_time RESULT_FILE_NAME = "accuracy_checking_result_" + current_time + ".csv" DETAILS_FILE_NAME = "accuracy_checking_details_" + current_time + ".csv" -RunUTConfig = namedtuple('RunUTConfig', ['forward_content', 'backward_content', 'result_csv_path', 'details_csv_path', - 'save_error_data', 'is_continue_run_ut', 'real_data_path', 'white_list', - 'black_list', 'error_data_path', 'online_config']) -OnlineConfig = namedtuple('OnlineConfig', ['is_online', 'nfs_path', 'host', 'port', 'rank_list', 'tls_path']) not_backward_list = ['repeat_interleave'] +unsupported_backward_list = ['masked_select'] tqdm_params = { @@ -99,7 +100,11 @@ def run_ut(config): run_api_online(config, compare) else: csv_df = read_csv(config.result_csv_path) - api_name_set = {row[0] for row in csv_df.itertuples(index=False, name=None)} + try: + api_name_set = {row[0] for row in csv_df.itertuples(index=False, name=None)} + except IndexError: + logger.error(f"Read {config.result_csv_path} error, api_name_set is empty.") + api_name_set = set() run_api_offline(config, compare, api_name_set) for result_csv_path, details_csv_path in zip(compare.save_path_list, compare.detail_save_path_list): change_mode(result_csv_path, FileCheckConst.DATA_FILE_AUTHORITY) @@ -140,7 +145,7 @@ def run_api_offline(config, compare, api_name_set): except Exception as err: if "expected scalar type Long" in str(err): logger.warning(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " - f"'int32_to_int64' list in accuracy_tools/api_accuracy_check/common/utils.py file.") + "'int32_to_int64' list in accuracy_tools/msprobe/core/common/const.py file.") else: logger.error(f"Run {api_full_name} UT Error: %s" % str(err)) compare_alg_results = err_column.to_column_value(CompareConst.SKIP, str(err)) @@ -220,14 +225,6 @@ def blacklist_and_whitelist_filter(api_name, black_list, white_list): return False -def is_unsupported_api(api_name): - split_name = api_name.split(Const.SEP)[0] - flag = split_name == Const.DISTRIBUTED - if flag: - logger.info(f"{split_name} api is not supported for run ut. SKIP.") - return flag - - def do_save_error_data(api_full_name, data_info, error_data_path, is_fwd_success, is_bwd_success): if not is_fwd_success or not is_bwd_success: processor = UtDataProcessor(error_data_path) @@ -244,7 +241,8 @@ def run_torch_api(api_full_name, real_data_path, backward_content, api_info_dict in_fwd_data_list = [] backward_message = '' api_type, api_name = extract_basic_api_segments(api_full_name) - args, kwargs, need_grad = get_api_info(api_info_dict, api_name, real_data_path) + args, kwargs, output_dtype = get_api_info(api_info_dict, api_name, real_data_path) + need_grad = check_need_grad(api_info_dict) in_fwd_data_list.append(args) in_fwd_data_list.append(kwargs) need_backward = api_full_name in backward_content @@ -253,16 +251,32 @@ def run_torch_api(api_full_name, real_data_path, backward_content, api_info_dict backward_message += BackwardMessage.UNSUPPORT_BACKWARD_MESSAGE if api_name in not_backward_list: need_grad = False - logger.warning("%s %s" % (api_full_name, BackwardMessage.NO_BACKWARD_RESULT_MESSAGE)) + logger.info("%s %s" % (api_full_name, BackwardMessage.NO_BACKWARD_RESULT_MESSAGE)) backward_message += BackwardMessage.NO_BACKWARD_RESULT_MESSAGE + if api_name in unsupported_backward_list: + need_grad = False + logger.info("%s %s" % (api_full_name, BackwardMessage.UNSUPPORT_API_MESSAGE)) + backward_message += BackwardMessage.UNSUPPORT_API_MESSAGE need_backward = need_backward and need_grad - if kwargs.get("device"): - del kwargs["device"] - cpu_args, cpu_kwargs = generate_cpu_params(args, kwargs, need_backward, api_name) + + device_info_kwargs = kwargs.get(Const.DEVICE) + if device_info_kwargs and device_info_kwargs.get(Const.VALUE): + kwargs[Const.DEVICE] = current_device device_args, device_kwargs = generate_device_params(args, kwargs, need_backward, api_name) + if kwargs.get(Const.DEVICE): + del kwargs[Const.DEVICE] + cpu_params = generate_cpu_params(args, kwargs, need_backward, api_name) + cpu_args, cpu_kwargs = cpu_params.cpu_args, cpu_params.cpu_kwargs + autocast_dtype, is_autocast = cpu_params.autocast_dtype, cpu_params.is_autocast + if not is_autocast and output_dtype: + is_autocast = autocast_dtype != output_dtype + autocast_dtype = output_dtype bench_grad_out, device_grad_out = None, None - out = exec_api(api_type, api_name, Const.CPU_LOWERCASE, cpu_args, cpu_kwargs) - device_out = exec_api(api_type, api_name, current_device, device_args, device_kwargs) + cpu_exec_params = ExecParams(api_type, api_name, Const.CPU_LOWERCASE, cpu_args, cpu_kwargs, False, autocast_dtype) + out = exec_api(cpu_exec_params) + device_exec_params = ExecParams(api_type, api_name, current_device, device_args, device_kwargs, is_autocast, + autocast_dtype) + device_out = exec_api(device_exec_params) current_path = os.path.dirname(os.path.realpath(__file__)) ut_setting_path = os.path.join(current_path, "torch_ut_setting.json") api_setting_dict = get_json_contents(ut_setting_path) @@ -278,16 +292,18 @@ def run_torch_api(api_full_name, real_data_path, backward_content, api_info_dict func_options = { 'real_data_path': real_data_path } - grad = gen_args(backward_args, api_name, func_options)[0] - bench_grad, _ = generate_cpu_params(grad, {}, False, api_name) + grad = gen_args(backward_args, api_name, func_options) + grad = safe_get_value(grad, 0, "grad") + grad_params = generate_cpu_params(grad, {}, False, api_name) + bench_grad = grad_params.cpu_args bench_grad_out = run_backward(cpu_args, bench_grad, grad_index, out) device_grad = grad.clone().detach().to(current_device) device_grad_out = run_backward(device_args, device_grad, grad_index, device_out) else: backward_message += BackwardMessage.MULTIPLE_BACKWARD_MESSAGE if api_name == "npu_fusion_attention": - out = out[0] - device_out = device_out[0] + out = safe_get_value(out, 0, "out") + device_out = safe_get_value(device_out, 0, "device_out") return UtDataInfo(bench_grad_out, device_grad_out, device_out, out, bench_grad, in_fwd_data_list, backward_message) @@ -306,13 +322,18 @@ def run_torch_api_online(api_full_name, api_data, backward_content): return UtDataInfo(None, None, out, device_out, None, in_fwd_data_list, None, rank=api_data.rank) -def get_api_info(api_info_dict, api_name, real_data_path): - convert_type, api_info_dict = api_info_preprocess(api_name, api_info_dict) +def check_need_grad(api_info_dict): need_grad = True - if api_info_dict.get("input_kwargs") and "out" in api_info_dict.get("input_kwargs"): + if api_info_dict.get(Const.INPUT_KWARGS) and "out" in api_info_dict.get(Const.INPUT_KWARGS): need_grad = False - args, kwargs = gen_api_params(api_info_dict, api_name, need_grad, convert_type, real_data_path) - return args, kwargs, need_grad + return need_grad + + +def get_api_info(api_info_dict, api_name, real_data_path): + convert_type, api_info_dict = api_info_preprocess(api_name, api_info_dict) + need_grad = check_need_grad(api_info_dict) + args, kwargs, output_dtype = gen_api_params(api_info_dict, api_name, need_grad, convert_type, real_data_path) + return args, kwargs, output_dtype def need_to_backward(grad_index, out): @@ -323,18 +344,31 @@ def need_to_backward(grad_index, out): def run_backward(args, grad, grad_index, out): if grad_index is not None: + if grad_index >= len(out): + logger.error(f"Run backward error when grad_index is {grad_index}") + raise IndexError(f"Run backward error when grad_index is {grad_index}") out[grad_index].backward(grad) else: out.backward(grad) - args_grad = [] - for arg in args: - if isinstance(arg, torch.Tensor): - args_grad.append(arg.grad) - grad_out = args_grad + + grad_out = extract_tensors_grad(args) return grad_out +def extract_tensors_grad(args, depth=0): + if depth > Const.MAX_DEPTH: + logger.error("The depth of arg_in is too large, please check the arg_in.") + raise CompareException(CompareException.RECURSION_LIMIT_ERROR) + grads = [] + for arg in args: + if isinstance(arg, torch.Tensor): + grads.append(arg.grad) + elif isinstance(arg, (list, tuple)): + grads.extend(extract_tensors_grad(arg, depth+1)) + return grads + + def initialize_save_error_data(error_data_path): create_directory(error_data_path) error_data_path_checker = FileChecker(error_data_path, FileCheckConst.DIR, @@ -437,9 +471,55 @@ def _run_ut(parser=None): run_ut_command(args) +def checked_online_config(online_config): + if not online_config.is_online: + return + if not isinstance(online_config.is_online, bool): + raise ValueError("is_online must be bool type") + # rank_list + if not isinstance(online_config.rank_list, list): + raise ValueError("rank_list must be a list") + if online_config.rank_list and not all(isinstance(rank, int) for rank in online_config.rank_list): + raise ValueError("All elements in rank_list must be integers") + + # nfs_path + if online_config.nfs_path: + check_file_or_directory_path(online_config.nfs_path, isdir=True) + return + # tls_path + if online_config.tls_path: + check_file_or_directory_path(online_config.tls_path, isdir=True) + check_file_or_directory_path(os.path.join(online_config.tls_path, "server.key")) + check_file_or_directory_path(os.path.join(online_config.tls_path, "server.crt")) + check_crt_valid(os.path.join(online_config.tls_path, "server.crt")) + + # host and port + if not isinstance(online_config.host, str) or not re.match(Const.ipv4_pattern, online_config.host): + raise Exception(f"host: {online_config.host} is invalid.") + if not isinstance(online_config.port, int) or not (0 < online_config.port <= 65535): + raise Exception(f"port: {online_config.port} is invalid, port range 0-65535.") + + def run_ut_command(args): + if args.config_path: + config_path_checker = FileChecker(args.config_path, FileCheckConst.FILE, + FileCheckConst.READ_ABLE, FileCheckConst.JSON_SUFFIX) + checked_config_path = config_path_checker.common_check() + _, task_config = parse_json_config(checked_config_path, Const.RUN_UT) + checker_config = CheckerConfig(task_config) + else: + checker_config = CheckerConfig() + + if not checker_config.is_online and not args.api_info_file: + logger.error("Please provide api_info_file for offline run ut.") + raise Exception("Please provide api_info_file for offline run ut.") + if not is_gpu: torch.npu.set_compile_mode(jit_compile=args.jit_compile) + if args.jit_compile: + torch.npu.config.allow_internal_format = True + else: + torch.npu.config.allow_internal_format = False used_device = current_device + ":" + str(args.device_id[0]) try: if is_gpu: @@ -458,6 +538,9 @@ def run_ut_command(args): ability=FileCheckConst.READ_ABLE, file_type=FileCheckConst.JSON_SUFFIX) checked_api_info = api_info_file_checker.common_check() forward_content, backward_content, real_data_path = parse_json_info_forward_backward(checked_api_info) + if real_data_path: + dump_path = os.path.dirname(checked_api_info) + real_data_path = os.path.join(dump_path, Const.DUMP_TENSOR_DATA) if args.filter_api: logger.info("Start filtering the api in the api_info_file.") forward_content = preprocess_forward_content(forward_content) @@ -474,43 +557,31 @@ def run_ut_command(args): if args.result_csv_path: result_csv_path = get_validated_result_csv_path(args.result_csv_path, 'result') details_csv_path = get_validated_details_csv_path(result_csv_path) - white_list = msCheckerConfig.white_list - black_list = msCheckerConfig.black_list - error_data_path = msCheckerConfig.error_data_path - is_online = msCheckerConfig.is_online - nfs_path = msCheckerConfig.nfs_path - host = msCheckerConfig.host - port = msCheckerConfig.port - rank_list = msCheckerConfig.rank_list - tls_path = msCheckerConfig.tls_path - if args.config_path: - config_path_checker = FileChecker(args.config_path, FileCheckConst.FILE, - FileCheckConst.READ_ABLE, FileCheckConst.JSON_SUFFIX) - checked_config_path = config_path_checker.common_check() - _, task_config = parse_json_config(checked_config_path, Const.RUN_UT) - white_list = task_config.white_list - black_list = task_config.black_list - error_data_path = task_config.error_data_path - is_online = task_config.is_online - nfs_path = task_config.nfs_path - host = task_config.host - port = task_config.port - rank_list = task_config.rank_list - tls_path = task_config.tls_path + error_data_path = checker_config.error_data_path if save_error_data: if args.result_csv_path: time_info = result_csv_path.split('.')[0].split('_')[-1] global UT_ERROR_DATA_DIR UT_ERROR_DATA_DIR = 'ut_error_data' + time_info error_data_path = initialize_save_error_data(error_data_path) - online_config = OnlineConfig(is_online, nfs_path, host, port, rank_list, tls_path) - run_ut_config = RunUTConfig(forward_content, backward_content, result_csv_path, details_csv_path, save_error_data, - args.result_csv_path, real_data_path, set(white_list), set(black_list), error_data_path, - online_config) + online_config = checker_config.get_online_config() + checked_online_config(online_config) + config_params = { + 'forward_content': forward_content, + 'backward_content': backward_content, + 'result_csv_path': result_csv_path, + 'details_csv_path': details_csv_path, + 'save_error_data': save_error_data, + 'is_continue_run_ut': args.result_csv_path, + 'real_data_path': real_data_path, + 'error_data_path': error_data_path + } + run_ut_config = checker_config.get_run_ut_config(**config_params) run_ut(run_ut_config) if __name__ == '__main__': + seed_all() _run_ut() logger.info("UT task completed.") diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py index a0b256cbd6702245b3da2eeab9e190b6af407fb3..dc0174212e3f8f8cf70fa1701aadc664138dbcdf 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py @@ -16,6 +16,7 @@ # limitations under the License. import os +from collections import namedtuple import re import torch @@ -23,8 +24,10 @@ try: import torch_npu except ImportError: current_device = "cuda" + from torch.cuda.amp import autocast else: current_device = "npu" + from torch_npu.npu.amp import autocast from msprobe.core.common.const import FileCheckConst, Const, CompareConst from msprobe.core.common.file_utils import FileChecker @@ -47,11 +50,17 @@ PRECISION_MAPPING = { } +CpuParams = namedtuple("CpuArgs", ["cpu_args", "cpu_kwargs", "autocast_dtype", "is_autocast"]) +ExecParams = namedtuple("ExecParams", ["api_type", "api_name", "device", "args", "kwargs", + "is_autocast", "autocast_dtype"]) + + class BackwardMessage: MULTIPLE_BACKWARD_MESSAGE = "Multiple backward is not supported." UNSUPPORT_BACKWARD_MESSAGE = "function with out=... arguments don't support automatic differentiation, " \ "skip backward." - NO_BACKWARD_RESULT_MESSAGE = "function backward result is None, skip backward." + NO_BACKWARD_RESULT_MESSAGE = "This API does not have backward input data, skip backward." + UNSUPPORT_API_MESSAGE = "This API does not support backward ut, skip backward." class UtDataInfo: @@ -91,7 +100,15 @@ def get_validated_details_csv_path(validated_result_csv_path): return validated_details_csv_path -def exec_api(api_type, api_name, device, args, kwargs): +def exec_api(exec_params): + api_type = exec_params.api_type + api_name = exec_params.api_name + device = exec_params.device + args = exec_params.args + kwargs = exec_params.kwargs + is_autocast = exec_params.is_autocast + autocast_dtype = exec_params.autocast_dtype + if api_type == "Functional": torch_api = FunctionalOPTemplate(api_name, str, False) if api_type == "Tensor": @@ -102,7 +119,11 @@ def exec_api(api_type, api_name, device, args, kwargs): torch_api = AtenOPTemplate(api_name, None, False) if api_type == "NPU": torch_api = NpuOPTemplate(api_name, None, False, device) - out = torch_api.forward(*args, **kwargs) + if is_autocast: + with autocast(dtype=autocast_dtype): + out = torch_api.forward(*args, **kwargs) + else: + out = torch_api.forward(*args, **kwargs) return out @@ -196,21 +217,38 @@ def generate_cpu_params(input_args, input_kwargs, need_backward, api_name): return set() raise_dtype = None + autocast_dtype = None + is_autocast = False need_raise_dtypes = recursive_find_dtypes(input_args) need_raise_dtypes.update(recursive_find_dtypes(input_kwargs, check_kwargs=True)) if len(need_raise_dtypes) == 1: - raise_dtype = PRECISION_MAPPING.get(need_raise_dtypes.pop(), torch.float32) + origin_dtype = need_raise_dtypes.pop() + raise_dtype = PRECISION_MAPPING.get(origin_dtype, torch.float32) + autocast_dtype = origin_dtype + elif len(need_raise_dtypes) >= 2: raise_dtype = torch.float32 + need_raise_dtypes.discard(torch.float32) + autocast_dtype = need_raise_dtypes.pop() + is_autocast = True raise_dtype = None if api_name in not_raise_dtype_set else raise_dtype is_detach = api_name not in not_detach_set cpu_args = recursive_arg_to_cpu(input_args, is_detach, raise_dtype=raise_dtype) cpu_kwargs = {key: recursive_arg_to_cpu(value, key != "out" and is_detach, raise_dtype=raise_dtype) for key, value in input_kwargs.items()} - return cpu_args, cpu_kwargs + cpu_params = CpuParams(cpu_args, cpu_kwargs, autocast_dtype, is_autocast) + return cpu_params def record_skip_info(api_full_name, compare, compare_alg_results): result_info = (api_full_name, CompareConst.SKIP, CompareConst.SKIP, [compare_alg_results], None, 0) compare.record_results(result_info) + + +def is_unsupported_api(api_name, is_overflow_check=False): + split_name = api_name.split(Const.SEP)[0] + flag = (split_name == Const.DISTRIBUTED) or (is_overflow_check and split_name == Const.NPU) + if flag: + logger.info(f"{split_name} api is not supported for run ut. SKIP.") + return flag diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/attl.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/attl.py index 80fb901a7230ebd8149132eaeea428a19614b794..f31c29c6bb6fa8a863b83bf09d15aba09645436f 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/attl.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/attl.py @@ -16,7 +16,6 @@ import glob import os.path import time -import re from multiprocessing import Queue from typing import Optional, Union, Dict, Any from dataclasses import dataclass @@ -54,7 +53,6 @@ class ATTL: self.dequeue_list = [] self.message_end = False self.kill_progress = False - self.check_attl_config() self.nfs_path = None if self.session_config.nfs_path: self.nfs_path = self.session_config.nfs_path @@ -72,18 +70,6 @@ class ATTL: self.session_config.tls_path) self.socket_manager.start() - def check_attl_config(self): - if self.session_config.nfs_path: - if os.path.exists(self.session_config.nfs_path): - return - else: - raise Exception(f"nfs path {self.session_config.nfs_path} doesn't exists.") - ipv4_pattern = "([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])(\.([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])){3}$" - if not re.match(ipv4_pattern, self.session_config.connect_ip): - raise Exception(f"host {self.session_config.connect_ip} is invalid.") - if not (0 < self.session_config.connect_port <= 65535): - raise Exception(f"port {self.session_config.connect_port} is invalid.") - def stop_serve(self): if isinstance(self.socket_manager, TCPServer): self.socket_manager.stop() @@ -114,21 +100,21 @@ class ATTL: self.socket_manager.add_to_sending_queue(data, rank=rank, step=step) def recv(self, timeout_ms=0) -> Optional[BufferType]: - buffer = None - while buffer is None: + buffer = '' + while not buffer: if timeout_ms > 0: time.sleep(timeout_ms / 1000.0) - if buffer is None and not self.data_queue.empty(): + if not buffer and not self.data_queue.empty(): buffer = self.data_queue.get() break - if buffer is None and timeout_ms > 0: # timeout is the only case we give up and return None + if not buffer and timeout_ms > 0: # timeout is the only case we give up and return None break if self.message_end and self.data_queue.empty(): buffer = b"KILL_CONFIRM" self.kill_progress = True break time.sleep(0.1) # waiting outside the lock before next attempt - if buffer is None: + if not buffer: # this is a result of a timeout self.logger.info(f"RECEIVE API DATA TIMED OUT") else: @@ -145,7 +131,7 @@ class ATTL: except Exception as e: self.logger.warning("there is something error. please check it. %s", e) if isinstance(buffer, bytes): - return None + return '' if isinstance(buffer, str): return buffer diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/client.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/client.py index 8fc8877d9e6f56a127bd73136cfff9503c3228bd..fbb087deec73bb6e77c0d7581128c74e2d9be9fa 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/client.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/client.py @@ -27,8 +27,8 @@ from twisted.internet import reactor, protocol, endpoints from twisted.protocols.basic import FileSender from msprobe.pytorch.common.utils import logger -from msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.utils import struct_unpack_mode as unpack_mode, \ - str_to_bytes_order as bytes_order +from msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.utils import STRUCT_UNPACK_MODE as unpack_mode, \ + STR_TO_BYTES_ORDER as bytes_order MAX_SENDING_QUEUE_SIZE = 20 @@ -84,15 +84,6 @@ class TCPClient: def run_reactor(): reactor.run(installSignalHandlers=False) - def check_tls_path(self): - client_key = os.path.join(self.tls_path, "client.key") - client_crt = os.path.join(self.tls_path, "client.crt") - if not os.path.exists(client_key): - raise Exception(f"client_key: {client_key} is not exists.") - if not os.path.exists(client_crt): - raise Exception(f"client_crt: {client_crt} is not exists.") - return client_key, client_crt - def start(self): def conn_callback(cur_protocol): if cur_protocol.transport and cur_protocol.transport.getPeer().host == self.host: @@ -114,7 +105,8 @@ class TCPClient: self.factory.protocol = cur_protocol if self.tls_path: from twisted.internet import ssl - client_key, client_crt = self.check_tls_path() + client_key = os.path.join(self.tls_path, "client.key") + client_crt = os.path.join(self.tls_path, "client.crt") client_context_factory = ssl.DefaultOpenSSLContextFactory(client_key, client_crt) endpoint = endpoints.SSL4ClientEndpoint(reactor, self.host, self.port, client_context_factory) else: diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/device_dispatch.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/device_dispatch.py index 56b3f6ff3e9bc5196e1d11096ff3d9072307e0ba..8777af9cc37ad03dacfa82bf29854fb1c1babe95 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/device_dispatch.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/device_dispatch.py @@ -24,7 +24,7 @@ from msprobe.core.common.const import Const, CompareConst from msprobe.pytorch.api_accuracy_checker.compare.api_precision_compare import online_api_precision_compare from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import DETAIL_TEST_ROWS, thousandth_standard_api, \ binary_standard_api, absolute_standard_api -from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import UtDataInfo, exec_api +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import UtDataInfo, exec_api, ExecParams from msprobe.pytorch.common.log import logger from msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.attl import move2target_device from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import generate_cpu_params @@ -92,8 +92,10 @@ def online_precision_compare(api_data, device, common_config, api_precision_csv_ try: # NPU vs CPU - cpu_args, cpu_kwargs = generate_cpu_params(npu_args, npu_kwargs, False, api_name) - cpu_out = exec_api(api_type, api_name, Const.CPU_LOWERCASE, cpu_args, cpu_kwargs) + cpu_params = generate_cpu_params(npu_args, npu_kwargs, False, api_name) + cpu_args, cpu_kwargs = cpu_params.cpu_args, cpu_params.cpu_kwargs + cpu_exec_params = ExecParams(api_type, api_name, Const.CPU_LOWERCASE, cpu_args, cpu_kwargs, False, None) + cpu_out = exec_api(cpu_exec_params) npu_data_info = UtDataInfo(None, None, npu_out, cpu_out, None, [], None, rank=api_data.rank) npu_detail = compare.compare_output(api_full_name, npu_data_info, True) npu_data = pd.DataFrame(npu_detail, columns=DETAIL_TEST_ROWS[-1]) diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/server.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/server.py index de411e2b75157085254b7de4556d8071b3fcef22..411e36d4cb3014b75a46d58ebec99b7e8b7c7c44 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/server.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/server.py @@ -24,7 +24,7 @@ from twisted.internet import reactor, protocol, endpoints from msprobe.pytorch.common.utils import logger from msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.utils import cipher_list, \ - struct_unpack_mode as unpack_mode, str_to_bytes_order as bytes_order + STRUCT_UNPACK_MODE as unpack_mode, STR_TO_BYTES_ORDER as bytes_order class TCPServer: @@ -40,22 +40,14 @@ class TCPServer: def run_reactor(): reactor.run(installSignalHandlers=False) - def check_tls_path(self): - server_key = os.path.join(self.tls_path, "server.key") - server_crt = os.path.join(self.tls_path, "server.crt") - if not os.path.exists(server_key): - raise Exception(f"server_key: {server_key} is not exists.") - if not os.path.exists(server_crt): - raise Exception(f"server_crt: {server_crt} is not exists.") - return server_key, server_crt - def start(self): self.factory.protocol = self.build_protocol if self.tls_path: from OpenSSL import SSL from twisted.internet import ssl - server_key, server_crt = self.check_tls_path() + server_key = os.path.join(self.tls_path, "server.key") + server_crt = os.path.join(self.tls_path, "server.crt") server_context_factory = ssl.DefaultOpenSSLContextFactory(server_key, server_crt, SSL.TLSv1_2_METHOD) server_context_ = server_context_factory.getContext() server_context_.set_cipher_list(cipher_list) diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/utils.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/utils.py index 5ca1a965ea137747718adb7140fabbf4acccb9e6..aace2f13cc0eeb34a51c03907c9a87a6479617c4 100644 --- a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/utils.py @@ -40,5 +40,5 @@ cipher_list = ":".join( "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256"] ).encode() -struct_unpack_mode = "!Q" -str_to_bytes_order = "big" +STRUCT_UNPACK_MODE = "!Q" +STR_TO_BYTES_ORDER = "big" diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/apply_adam.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/apply_adam.py new file mode 100644 index 0000000000000000000000000000000000000000..408929685e0c5de9984f06674df9ad6a76cd1281 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/apply_adam.py @@ -0,0 +1,215 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from collections import namedtuple +import torch + + +VarParams = namedtuple('VarParams', ['var', 'lr_t', 'm_t', 'beta1_broad', 'grad', 'epsilon', 'v_t']) + + +def _output_m_compute(m, beta1_broad, grad): + """ + _output_m_compute + do compute m_t = m + (beta1 - 1) * (m - grad) + """ + input_dtype = m.dtype + + sneg_one = torch.ones((1), dtype=input_dtype) * -1 + sneg_one = sneg_one.to(beta1_broad.device) + + # `formula; beta1 -1` + vsub_beta1_1 = torch.add(beta1_broad, sneg_one) + + # `formula; m - grad` + vsub_m_grad = torch.sub(m, grad) + + # `formula; (beta1 - 1) * (m - grad)` + vmul_m = torch.mul(vsub_beta1_1, vsub_m_grad) + + # `formula; m_t = m + (beta1 - 1) * (m - grad)` + m_t = torch.add(m, vmul_m) + + return m_t + + +def _output_v_compute(v, beta2, grad): + """ + _output_v_compute + do compute v_t = v + (1 - beta2)*(grad*grad -v) + """ + input_dtype = v.dtype + + sneg_one = torch.ones((1), dtype=input_dtype) * -1 + + # `formula; broadcast beta2 to vector` + beta2_tensor = torch.tensor(beta2, dtype=input_dtype) + beta2_broad = beta2_tensor.expand_as(v) + + # `formula; beta2 - 1` + vsub_beta2_1 = torch.add(beta2_broad, sneg_one) + vsub_beta2_1 = vsub_beta2_1.to(v.device) + + # `formula; grad * grad` + vmul_grad_grad = torch.mul(grad, grad) + + # `formula; (v - grad*grad)` + vsub_v_grad = torch.sub(v, vmul_grad_grad) + + # `formula; (beta2 -1) * (v - grad * grad)` + vmul_grad = torch.mul(vsub_beta2_1, vsub_v_grad) + + # `formula; v_t = v + (beta2 - 1) * (v - grad * grad)` + v_t = torch.add(v, vmul_grad) + + return v_t + + +def _inner_lr_compute(lr, beta2_power, beta1_power, compute_shape_tensor): + """ + _inner_lr_compute + `formula; lr_t = learning_rate * (sqrt(1-beta2_power)) / (1 - beta1_power)` + """ + + input_dtype = compute_shape_tensor.dtype + + s_one = torch.ones((1), dtype=input_dtype) + + s_neg_one = torch.ones((1), dtype=input_dtype) * -1 + + # `formula; (1 - beta2_power)` + v_neg_beta2_power = torch.mul(beta2_power, s_neg_one) + v_add_beta2_power = torch.add(v_neg_beta2_power, s_one) + + # `formula; sqrt(1 - beta2_power)` + v_sqrt_beta2_power = torch.sqrt(v_add_beta2_power) + + # `formula; (1 - beta1_power)` + v_neg_beta1_power = torch.mul(beta1_power, s_neg_one) + v_add_beta1_power = torch.add(v_neg_beta1_power, s_one) + + # `formula; learning_rate * (sqrt(1-beta2_power)` + res = torch.mul(lr, v_sqrt_beta2_power) + + # `formula; learning_rate*(sqrt(1-beta2_power))/(1-beta1_power)` + res = torch.div(res, v_add_beta1_power) + return res.expand_as(compute_shape_tensor) + + +def _inner_eps_add_sqrt_vt_compute(epsilon, v_t): + """ + (epsilon + sqrt(v_t) ) + """ + # `formula; sqrt(v_t)` + sqrt_vt = torch.sqrt(v_t) + + # `formula; broadcast epsilon to vector` + input_dtype = v_t.dtype + epsilon_tensor = torch.tensor(epsilon, dtype=input_dtype) + epsilon_broad = epsilon_tensor.expand_as(v_t) + epsilon_broad = epsilon_broad.to(sqrt_vt.device) + + # `formula; epsilon + sqrt(v_t)` + v_add_sqrt_v = torch.add(sqrt_vt, epsilon_broad) + + return v_add_sqrt_v + + +def _output_var_t_compute_use_nesterov(varparams): + """ + _output_var_t_compute_use_nesterov + `formula; var_t = var - lr_t * (m_t * beta1 + (1 - beta1) * grad) / (epsilon + sqrt(v_t))` + `formula; var_t = var - lr_t * (m_t * beta1 + (1 - beta1) * grad) / (epsilon + sqrt(v_t))` + """ + var = varparams.var + lr_t = varparams.lr_t + m_t = varparams.m_t + beta1_broad = varparams.beta1_broad + grad = varparams.grad + epsilon = varparams.epsilon + v_t = varparams.v_t + + input_dtype = var.dtype + + s_one = torch.ones((1), dtype=input_dtype) + + s_neg_one = torch.ones((1), dtype=input_dtype) * -1 + + # `formula; m_t * beta1` + v_muls_mt_beta1 = torch.mul(m_t, beta1_broad) + + # `formula; 1 -beta1` + v_neg_beta1 = torch.mul(beta1_broad, s_neg_one) + vsub_1_beta1 = torch.add(v_neg_beta1, s_one) + + # `formula; (1-beta1)* grad` + v_mul_grad = torch.mul(vsub_1_beta1, grad) + + # `formula; (m_t*beta1 + (1 - beta1)*grad)` + v_div_left = torch.add(v_muls_mt_beta1, v_mul_grad) + + # `formula; lr_t * (m_t*beta1 + (1 - beta1) * grad)` + # broadcast lr_t to vector + + lrt_broad = lr_t.expand_as(var) + v_mul_left = torch.mul(lrt_broad, v_div_left) + + # `formula; (epsilon + sqrt(v_t))` + v_add_sqrt_v = _inner_eps_add_sqrt_vt_compute(epsilon, v_t) + + # `formula; lr_t * (m_t*beta1 + (1-beta1)*grad / (epsilon + sqrt(v_t))` + v_div_res = torch.div(v_mul_left, v_add_sqrt_v) + + # `formula; var - lr_t * (m_t*beta1 + (1-beta1)*grad) / (epsilon + sqrt(v_t))` + v_t = torch.sub(var, v_div_res) + + return v_t + + +def _output_var_t_compute(var, lr_t, m_t, epsilon, v_t): + """ + _output_var_t_compute + `var_t = var - lr_t * m_t / (epsilon + sqrt(v_t))` + """ + # `formula; lr_t * m_t` + lr_t = lr_t.to(m_t.device) + v_mul_left = torch.mul(lr_t, m_t) + + # `formula; (epsilon + sqrt(v_t))` + v_add_sqrt_v = _inner_eps_add_sqrt_vt_compute(epsilon, v_t) + + # `formula; lr_t * m_t /(epsilon + sqrt(v_t))` + v_div_res = torch.div(v_mul_left, v_add_sqrt_v) + + # `formula; var - lr_t * m_t / (epsilon + sqrt(v_t))` + v_t = torch.sub(var, v_div_res) + + return v_t + + +def npu_apply_adam(beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad, use_locking, use_nesterov, out): + var, m, v = out + input_dtype = m.dtype + beta1_tensor = torch.tensor(beta1, dtype=input_dtype).to(m.device) + beta1_broad = beta1_tensor.expand_as(m) + m_t = _output_m_compute(m, beta1_broad, grad) + v_t = _output_v_compute(v, beta2, grad) + lr_t = _inner_lr_compute(lr, beta2_power, beta1_power, grad) + if use_nesterov: + var_params = VarParams(var, lr_t, m_t, beta1_broad, grad, epsilon, v_t) + var_t = _output_var_t_compute_use_nesterov(var_params) + else: + var_t = _output_var_t_compute(var, lr_t, m_t, epsilon, v_t) + return var_t, m_t, v_t diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/confusion_transpose.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/confusion_transpose.py index 6bda1fc6e8e5f141c05824260b20ad045eccaa8e..81fb70e8a4765d252a375e53c7bebdfddfac7e9e 100644 --- a/debug/accuracy_tools/msprobe/pytorch/bench_functions/confusion_transpose.py +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/confusion_transpose.py @@ -22,7 +22,11 @@ def npu_confusion_transpose(data, perm, shape, transpose_first): def npu_confusion_transpose_backward(grad, perm, shape, transpose_first): - shape_cal = shape if transpose_first else [shape[perm_dim] for perm_dim in perm] + try: + shape_cal = shape if transpose_first else [shape[perm_dim] for perm_dim in perm] + except IndexError as e: + raise IndexError("npu_confusion_transpose_backward: Invalid perm index for shape") from e + perm_cal = [0] * len(perm) for i, perm_dim in enumerate(perm): perm_cal[perm_dim] = i diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/group_norm_silu.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/group_norm_silu.py new file mode 100644 index 0000000000000000000000000000000000000000..c8757083c56b78cabbb83ec5d2b7b80f0edd8421 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/group_norm_silu.py @@ -0,0 +1,27 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch + + +def npu_group_norm_silu(x, gama, beta, group, eps): + if len(x.shape) != 4: + raise ValueError("x shape should be (N, C, H, W)") + res = torch.ops.aten.native_group_norm(x, gama, beta, x.shape[0], x.shape[1], x.shape[2] * x.shape[3], group, eps) + res = list(res) + if not res: + raise ValueError("run native_group_norm failed") + res[0] = torch.nn.functional.silu(res[0]) + return res diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/matmul_backward.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/matmul_backward.py index 5e10da6e609dd360567ea60141022682ed59879c..bceb0a0e1b1921455e3a10ff8d3539864c6f0c2c 100644 --- a/debug/accuracy_tools/msprobe/pytorch/bench_functions/matmul_backward.py +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/matmul_backward.py @@ -17,6 +17,9 @@ import torch def matmul_backward(grad, self, other, mask): + if len(mask) < 2: + raise RuntimeError("Mask size at least 2") + grad_self, grad_other = None, None dim_self = self.dim() dim_other = other.dim() @@ -24,6 +27,7 @@ def matmul_backward(grad, self, other, mask): size_grad = list(grad.size()) size_self = list(self.size()) size_other = list(other.size()) + if dim_self == 1 and dim_other == 1: grad_self = other.mul(grad) if mask[0] else grad_self grad_other = self.mul(grad) if mask[1] else grad_other @@ -34,19 +38,27 @@ def matmul_backward(grad, self, other, mask): grad_self = grad.unsqueeze(0).mm(other.transpose(-1, -2)).squeeze_(0) if mask[0] else grad_self grad_other = self.unsqueeze(1).mm(grad.unsqueeze(0)) if mask[1] else grad_other elif dim_self >= 3 and (dim_other == 1 or dim_other == 2): + if len(size_grad) < 1: + raise RuntimeError("size_grad's length at least 1") view_size = 1 if dim_other == 1 else size_grad[-1] unfolded_grad = (grad.unsqueeze(-1) if dim_other == 1 else grad).contiguous().view(-1, view_size) if mask[0]: grad_self = unfolded_grad.mm(other.unsqueeze(0) if dim_other == 1 else other.transpose(-1, -2)) \ .view(size_self) if mask[1]: + if len(size_self) < 1: + raise RuntimeError("size_self's length at least 1") unfolded_self = self.contiguous().view([-1, size_self[-1]]) grad_other = unfolded_self.transpose(-1, -2).mm(unfolded_grad).view(size_other) elif (dim_self == 1 or dim_self == 2) and dim_other >= 3: + if len(size_grad) < 2: + raise RuntimeError("size_grad's length at least 2") view_size = 1 if dim_self == 1 else size_grad[-2] unfolded_grad_t = grad.view([-1, view_size]) \ if dim_self == 1 else grad.transpose(-1, -2).contiguous().view([-1, view_size]) if mask[0]: + if len(size_other) < 2: + raise RuntimeError("size_other's length at least 2") # create a 2D-matrix from other unfolded_other_t = \ other.transpose(-1, -2).contiguous().view([-1, size_other[-2]]).transpose(-1, -2) diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/mish.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/mish.py new file mode 100644 index 0000000000000000000000000000000000000000..f395a30ee60db57ab9a298a637c8318ffce7aec4 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/mish.py @@ -0,0 +1,21 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch + + +def npu_mish(x): + mish = torch.nn.Mish() + return mish(x) diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/moe_gating_top_k_softmax.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/moe_gating_top_k_softmax.py new file mode 100644 index 0000000000000000000000000000000000000000..be15935ce9c9f77bc0a8447902f7f4a7b536a7fb --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/moe_gating_top_k_softmax.py @@ -0,0 +1,44 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch +import numpy as np + + +def softmax_func(x, axis=None): + x = x.float() + x_max = x.max(dim=axis, keepdims=True).values + x_sub = x - x_max + y = torch.exp(x_sub) + x_sum = y.sum(dim=axis, keepdims=True) + ans = 0 if (x_sum == 0).any() else y / x_sum + return ans + + +def npu_moe_gating_top_k_softmax(x, finished_optional, k): + input_dtype = x.dtype + num_expert = x.shape[-1] + softmax = softmax_func(x, -1) + softmax = softmax.to(input_dtype) + expert_idx = torch.argsort(-softmax, dim=-1, stable=True) + expert_idx = expert_idx[:, :k] + y = torch.gather(softmax, index=expert_idx, dim=-1) + if finished_optional is not None: + finished_optional = finished_optional.view(finished_optional.shape[0], 1) + finished_optional = finished_optional.expand(-1, k) + expert_idx = torch.where(finished_optional, num_expert, expert_idx) + row_idx = torch.arange(y.shape[0] * y.shape[1]).reshape(y.shape[1], y.shape[0]).t() + + return y, expert_idx, row_idx diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/npu_fusion_attention.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/npu_fusion_attention.py index 518cee09215aee061d4efaa2914fb740ff2175fa..58a585f5a05f4b2d533d150db3a9fbfd907f5a07 100644 --- a/debug/accuracy_tools/msprobe/pytorch/bench_functions/npu_fusion_attention.py +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/npu_fusion_attention.py @@ -30,6 +30,7 @@ numels=0, prefix=None, sparse_mode=0, gen_mask_parallel=True, sync=False """ +from collections import namedtuple import torch import numpy as np from einops import rearrange @@ -50,8 +51,16 @@ else: from msprobe.pytorch.common.utils import logger from msprobe.core.common.const import Const, CompareConst -gtype = torch.float64 # arm host必须选择float64,x86环境选择float32即可,64也行。arm计算很慢,s=8k的场景建议使用x86 -softmax_build_mode = "QKV" # "MAX_SUM" +GTYPE = torch.float64 # arm host必须选择float64,x86环境选择float32即可,64也行。arm计算很慢,s=8k的场景建议使用x86 +SOFTMAX_BUILD_MODE = "QKV" # "MAX_SUM" + + +FaForwardParams = namedtuple("FaForwardParams", + ["q", "k", "v", "drop_mask", "atten_mask", "pse", "scale", "keep_prob"]) +FaBackwardParams = namedtuple("FaBackwardParams", + ["dx", "q", "k", "v", "softmax_res", "drop_mask", "pse", "scale", "keep_prob"]) +RebuildSoftmaxParams = namedtuple("RebuildSoftmaxParams", + ["q", "k", "atten_mask", "pse", "scale", "softmax_max", "softmax_sum"]) def softmax_forward(x): @@ -99,7 +108,15 @@ def calculate_qk(q, k, atten_mask, pse, scale): return qk -def fusion_attention_forward(q, k, v, drop_mask, atten_mask, pse, scale, keep_prob): +def fusion_attention_forward(forward_params): + q = forward_params.q + k = forward_params.k + v = forward_params.v + drop_mask = forward_params.drop_mask + atten_mask = forward_params.atten_mask + pse = forward_params.pse + scale = forward_params.scale + keep_prob = forward_params.keep_prob qk = calculate_qk(q, k, atten_mask, pse, scale) softmax_res, softmax_max, softmax_sum = softmax_forward(qk) if drop_mask is None or len(drop_mask.shape) == 0: @@ -110,7 +127,16 @@ def fusion_attention_forward(q, k, v, drop_mask, atten_mask, pse, scale, keep_pr return y, softmax_max, softmax_sum -def fusion_attention_backward(dx, q, k, v, softmax_res, drop_mask, pse, scale, keep_prob): +def fusion_attention_backward(backward_params): + dx = backward_params.dx + q = backward_params.q + k = backward_params.k + v = backward_params.v + softmax_res = backward_params.softmax_res + drop_mask = backward_params.drop_mask + pse = backward_params.pse + scale = backward_params.scale + keep_prob = backward_params.keep_prob dp = torch.matmul(dx, v.permute(0, 1, 3, 2)) if drop_mask is None or len(drop_mask.shape) == 0: drop_res = softmax_res.permute(0, 1, 3, 2) @@ -166,6 +192,18 @@ def parse_bsnd_args(query, key, head_num, input_layout): def convert_from_bnsd(_input, input_layout): + """ + transform qkv from bnsd to input_layout. + B: batch_size + S: sequence_length + N: num_heads + D: head_dim + Args: + _input (torch.Tensor): tensor of shape (B,N,S,D) + input_layout (str): "BSH" or "SBH" or "BSND" or "BNSD" or "TND" + Returns: + tensor of shape (B,N,S,D) or (B,S,N,D) or (S,B,H) or (B,S,H) + """ if input_layout == "BSH": # (B,N,S,D)=>(B,S,N*D) out = rearrange(_input, 'b n s d -> b s (n d)').contiguous() @@ -183,7 +221,19 @@ def convert_from_bnsd(_input, input_layout): def convert_to_bnsd(_input, n, input_layout): - # 默认"BNSD"无需处理 + """ + transform qkv from input_layout to bnsd. + B: batch_size + S: sequence_length + N: num_heads + D: head_dim + Args: + _input (torch.Tensor): tensor of shape (B,N,S,D) or (B,S,N,D) or (S,B,H) or (B,S,H) + n (int): num_heads + input_layout (str):"BSH" or "SBH" or "BSND" or "BNSD" or "TND" + Returns: + tensor of shape (B,N,S,D) + """ if input_layout == "BSH": # (B,S,N*D)=>(B,N,S,D) out = rearrange(_input, 'b s (n d) -> b n s d', n=n) @@ -199,7 +249,68 @@ def convert_to_bnsd(_input, n, input_layout): out = _input if out.dim() != 4: raise ValueError(f"convert qkv format failed with input_layout {input_layout}.") - return out.to(gtype) + return out.to(GTYPE) + + +def convert_from_bsnd(_input, input_layout): + """ + transform qkv from bsnd to input_layout. + B: batch_size + S: sequence_length + N: num_heads + D: head_dim + Args: + _input (torch.Tensor): tensor of shape (B,S,N,D) + input_layout (str): "BSH" or "SBH" or "BSND" or "BNSD" or "TND" + Returns: + tensor of shape (B,N,S,D) or (B,S,N,D) or (S,B,H) or (B,S,H) + """ + if input_layout == "BSH": + # (B,S,N,D)=>(B,S,N*D) + out = rearrange(_input, 'b s n d -> b s (n d)').contiguous() + elif input_layout == "SBH": + # (B,S,N,D)=>(S,B,N*D) + out = rearrange(_input, 'b s n d -> s b (n d)').contiguous() + elif input_layout == "BNSD": + # (B,S,N,D)=>(B,N,S,D) + out = rearrange(_input, 'b s n d -> b n s d').contiguous() + elif input_layout == "TND": + raise ValueError(f"input_layout {input_layout} does not supported for now.") + else: + out = _input + return out + + +def convert_to_bsnd(_input, n, input_layout): + """ + transform qkv from input_layout to bsnd. + B: batch_size + S: sequence_length + N: num_heads + D: head_dim + Args: + _input (torch.Tensor): tensor of shape (B,N,S,D) or (B,S,N,D) or (S,B,H) or (B,S,H) + n (int): num_heads + input_layout (str):"BSH" or "SBH" or "BSND" or "BNSD" or "TND" + Returns: + tensor of shape (B,S,N,D) + """ + if input_layout == "BSH": + # (B,S,N*D)=>(B,S,N,D) + out = rearrange(_input, 'b s (n d) -> b s n d', n=n) + elif input_layout == "SBH": + # (S,B,N*D)=>(B,S,N,D) + out = rearrange(_input, 's b (n d) -> b s n d', n=n) + elif input_layout == "BNSD": + # (B,N,S,D)=>(B,S,N,D) + out = rearrange(_input, 'b n s d -> b s n d', n=n) + elif input_layout == "TND": + raise ValueError(f"input_layout {input_layout} does not supported for now.") + else: + out = _input + if out.dim() != 4: + raise ValueError(f"convert qkv format failed with input_layout {input_layout}.") + return out def generate_atten_mask(*args): @@ -283,11 +394,18 @@ def rebuid_softmax_by_qkv(q, k, atten_mask, pse, scale): return softmax_res -def rebuild_softmax_by_max_sum(q, k, atten_mask, pse, scale, softmax_max, softmax_sum): +def rebuild_softmax_by_max_sum(softmax_params): """ attention = softmax(QK^T/sqrt(d))V softmax(x_i) = e^(x_i - x_max_i) / x_sum_i) """ + q = softmax_params.q + k = softmax_params.k + atten_mask = softmax_params.atten_mask + pse = softmax_params.pse + scale = softmax_params.scale + softmax_max = softmax_params.softmax_max + softmax_sum = softmax_params.softmax_sum logger.info("Using softmax_max and softmax_sum to rebuild original softmax") qk = calculate_qk(q, k, atten_mask, pse, scale) if softmax_max.shape[-1] == 0: @@ -319,6 +437,10 @@ def get_input_layout(*args, **kwargs): def npu_fusion_attention_forward_patch(*args, **kwargs): + + if len(args) < 2: + raise RuntimeError("npu_fusion_attention_forward_patch: length of args should greater than or equal to 2.") + # query, key, value, head_num, input_layout head_num = get_head_num(*args, **kwargs) input_layout = get_input_layout(*args, **kwargs) @@ -413,10 +535,8 @@ def npu_fusion_attention(*args, **kwargs): key = convert_to_bnsd(key, n2, input_layout) value = convert_to_bnsd(value, n2, input_layout) k_new, v_new = generate_kv(key, value, n1, n2) - out_golden, softmax_max, softmax_sum = fusion_attention_forward(q=query, k=k_new, v=v_new, - drop_mask=None, atten_mask=atten_mask, - pse=pse, scale=scale, - keep_prob=keep_prob) + forward_params = FaForwardParams(query, k_new, v_new, None, atten_mask, pse, scale, keep_prob) + out_golden, softmax_max, softmax_sum = fusion_attention_forward(forward_params) if out_golden.dim() == 5: out_golden = out_golden.reshape(out_golden.size(0), out_golden.size(1) * out_golden.size(2), out_golden.size(3), out_golden.size(4)) @@ -454,12 +574,13 @@ def npu_fusion_attention_grad(*args, **kwargs): value = convert_to_bnsd(value, n2, input_layout) k_new, v_new = generate_kv(key, value, n1, n2) - if softmax_build_mode == "QKV": + if SOFTMAX_BUILD_MODE == "QKV": softmax_res = rebuid_softmax_by_qkv(query, k_new, atten_mask, pse, scale_value) else: - softmax_res = rebuild_softmax_by_max_sum(query, k_new, atten_mask, pse, scale_value, softmax_max, softmax_sum) - - dq, dk, dv = fusion_attention_backward(dx, query, k_new, v_new, softmax_res, None, pse, scale_value, keep_prob) + softmax_params = RebuildSoftmaxParams(query, k_new, atten_mask, pse, scale_value, softmax_max, softmax_sum) + softmax_res = rebuild_softmax_by_max_sum(softmax_params) + backward_params = FaBackwardParams(dx, query, k_new, v_new, softmax_res, None, pse, scale_value, keep_prob) + dq, dk, dv = fusion_attention_backward(backward_params) # N不等长适配by cdy if not (n1 == n2): @@ -531,8 +652,13 @@ def gpu_fusion_attention(*args, **kwargs): else: alibi_slopes = None + input_layout = get_input_layout(*args, **kwargs) + query = convert_to_bsnd(query, n1, input_layout) + key = convert_to_bsnd(key, n2, input_layout) + value = convert_to_bsnd(value, n2, input_layout) out = flash_attn_func( query, key, value, dropout_p=(1 - keep_prob), softmax_scale=scale, causal=causal_switch, window_size=(window_left, window_right), alibi_slopes=alibi_slopes, deterministic=deterministic ) + out = convert_from_bsnd(out, input_layout) return out, Const.NONE, Const.NONE diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/rotary_mul.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/rotary_mul.py index 61a9fb6ef6d9e8c03b5b3a7ef3102b696670e206..dd328dd1d9144fdafc3bc825a15b44f9c6f5d2ea 100644 --- a/debug/accuracy_tools/msprobe/pytorch/bench_functions/rotary_mul.py +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/rotary_mul.py @@ -40,6 +40,9 @@ def npu_rotary_mul_backward(dy_tensor, x, r1, r2): x_shape = x.shape h = x.float() grad = dy_tensor.float() + if len(r1_shape) < 4 or len(x_shape) < 4: + raise RuntimeError(f"Shape of r1 and x should at least be 4-dimension, " + f"but got r1 shape:{r1_shape}, x shape:{x_shape}") condition_1 = (r1_shape[0] == 1 and r1_shape[1] == x_shape[1] and r1_shape[2] == 1 @@ -68,4 +71,5 @@ def npu_rotary_mul_backward(dy_tensor, x, r1, r2): for j in range(x_shape[2]): r2_grad[:, 0, 0, :] += (x_new2[:, i, j, :] * grad[:, i, j, :]) r1_grad[:, 0, 0, :] += (h[:, i, j, :] * grad[:, i, j, :]) + return x.grad.cpu(), r1_grad.cpu(), r2_grad.cpu() diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/sort_v2.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/sort_v2.py new file mode 100644 index 0000000000000000000000000000000000000000..c5bd1c141f83158632cffd9ce6238f191fbfe826 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/sort_v2.py @@ -0,0 +1,21 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch + + +def npu_sort_v2(x, dim=-1, descending=False, out=None): + y, _ = torch.sort(x, dim=dim, descending=descending) + return y diff --git a/debug/accuracy_tools/msprobe/pytorch/bench_functions/swiglu.py b/debug/accuracy_tools/msprobe/pytorch/bench_functions/swiglu.py index a6d3a75af5cde45a1a31530eccb10a73e6cdc7ba..a9f24e25bb1837635423031b52ade635d556aeff 100644 --- a/debug/accuracy_tools/msprobe/pytorch/bench_functions/swiglu.py +++ b/debug/accuracy_tools/msprobe/pytorch/bench_functions/swiglu.py @@ -19,7 +19,11 @@ import torch def npu_swiglu(x, dim=-1): tensor_dtype = x.dtype - in_tensors = torch.chunk(x, 2, dim=dim) + try: + in_tensors = torch.chunk(x, 2, dim=dim) + except Exception as e: + raise RuntimeError(f"Invalid chunk x into 2 tensors with shape {x.shape} and dimension {dim}") from e + if tensor_dtype == torch.float32: tensor_scalar = torch.sigmoid(torch.mul(in_tensors[0], 1.0)) output_data = torch.mul(torch.mul(tensor_scalar, in_tensors[0]), in_tensors[1]) @@ -34,7 +38,11 @@ def npu_swiglu(x, dim=-1): def npu_swiglu_backward(grad, x, dim=-1): tensor_dtype = grad.dtype - in_tensors = torch.chunk(x, 2, dim=dim) + try: + in_tensors = torch.chunk(x, 2, dim=dim) + except Exception as e: + raise RuntimeError(f"Invalid chunk x into 2 tensors with shape {x.shape} and dimension {dim}") from e + tensor_grad_out = grad if tensor_dtype == torch.float16: diff --git a/debug/accuracy_tools/msprobe/pytorch/common/parse_json.py b/debug/accuracy_tools/msprobe/pytorch/common/parse_json.py index 2d91a56a53e7c5fe327761aa23a20a10a5b8db3a..b46dbdac7c4620d2dccd31aff8217b80583391c3 100644 --- a/debug/accuracy_tools/msprobe/pytorch/common/parse_json.py +++ b/debug/accuracy_tools/msprobe/pytorch/common/parse_json.py @@ -15,6 +15,7 @@ from msprobe.core.common.exceptions import ParseJsonException from msprobe.core.common.file_utils import load_json +from msprobe.core.common.log import logger def parse_json_info_forward_backward(json_path): @@ -22,8 +23,11 @@ def parse_json_info_forward_backward(json_path): real_data_path = dump_json.get("dump_data_dir") dump_data = dump_json.get("data") + if dump_data is None: + raise ParseJsonException(ParseJsonException.InvalidDumpJson, + "something wrong with dump, no data found in dump.json") if not dump_data: - raise ParseJsonException(ParseJsonException.InvalidDumpJson, "dump数据中没有data字段") + logger.warning("data field is empty, no overflow data found.") forward_data = {} backward_data = {} diff --git a/debug/accuracy_tools/msprobe/pytorch/common/utils.py b/debug/accuracy_tools/msprobe/pytorch/common/utils.py index 478978eb0a521638cc61b0d5ea198c9b005548b7..4e82bee4a04d9ffe2be8aebe1a85791eccae4070 100644 --- a/debug/accuracy_tools/msprobe/pytorch/common/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/common/utils.py @@ -18,6 +18,7 @@ import os import pickle import random import stat +import inspect from functools import wraps import numpy as np @@ -105,8 +106,49 @@ def get_rank_if_initialized(): raise DistributedNotInitializedError("torch distributed environment is not initialized") -def seed_all(seed=1234, mode=False): - check_seed_all(seed, mode) +def remove_dropout(): + if torch.__version__ > "1.8": + logger.info_on_rank_0("For precision comparison, the probability p in the dropout method is set to 0.") + import torch.nn.functional as F + from torch import _VF + from torch.overrides import has_torch_function_unary, handle_torch_function + + def function_dropout(input_tensor: torch.Tensor, p: float = 0.5, training: bool = True, + inplace: bool = False) -> torch.Tensor: + if has_torch_function_unary(input_tensor): + return handle_torch_function( + function_dropout, (input_tensor,), input_tensor, p=0., training=training, inplace=inplace) + if p < 0.0 or p > 1.0: + raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p)) + return _VF.dropout_(input_tensor, 0., training) if inplace else _VF.dropout(input_tensor, 0., training) + + def function_dropout2d(input_tensor: torch.Tensor, p: float = 0.5, training: bool = True, + inplace: bool = False) -> torch.Tensor: + if has_torch_function_unary(input_tensor): + return handle_torch_function( + function_dropout2d, (input_tensor,), input_tensor, p=0., training=training, inplace=inplace) + if p < 0.0 or p > 1.0: + raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p)) + return _VF.feature_dropout_(input_tensor, 0., training) if inplace else _VF.feature_dropout(input_tensor, + 0., training) + + def function_dropout3d(input_tensor: torch.Tensor, p: float = 0.5, training: bool = True, + inplace: bool = False) -> torch.Tensor: + if has_torch_function_unary(input_tensor): + return handle_torch_function( + function_dropout3d, (input_tensor,), input_tensor, p=0., training=training, inplace=inplace) + if p < 0.0 or p > 1.0: + raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p)) + return _VF.feature_dropout_(input_tensor, 0., training) if inplace else _VF.feature_dropout(input_tensor, + 0., training) + + F.dropout = function_dropout + F.dropout2d = function_dropout2d + F.dropout3d = function_dropout3d + + +def seed_all(seed=1234, mode=False, rm_dropout=True): + check_seed_all(seed, mode, rm_dropout) try: random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) @@ -126,6 +168,8 @@ def seed_all(seed=1234, mode=False): else: torch_npu.npu.manual_seed_all(seed) torch_npu.npu.manual_seed(seed) + if rm_dropout: + remove_dropout() except Exception as e: logger.error(f"There is an unexpected error while determinating randomness. {e}") @@ -359,3 +403,73 @@ def load_api_data(api_data_bytes): except Exception as e: raise RuntimeError(f"load api_data from bytes failed") from e return buffer + + +def is_recomputation(): + """Check if the current operation is in the re-computation phase. + + This function inspects the current call stack to indicate whether the current operation is in the + re-computation phase. We use a blacklist mechanism, now supported megatron and mindspeed framework. + megatron: The 'backward' function is called by the 'torch/autograd/function.py' file. + mindspeed: The 'checkpoint_function_backward' function is called by the 'torch/autograd/function.py' + file or the custom module(use CheckpointWithoutOutput) with the 'recompute_fn' function is executed within the + 'torch/utils/checkpoint.py' file. + + Returns: + bool: True if in the re-computation phase, False otherwise. + """ + backward_function_indices = [] + call_stack = inspect.stack() + + # Identify the function 'backward' is being executed within the 'torch/_tensor.py' file. + for frame_info in call_stack: + if frame_info.function == "recompute_fn" and frame_info.filename.endswith('torch/utils/checkpoint.py'): + del call_stack + return True + + # Identify indices in the call stack where the specific function is being executed + for idx, frame_info in enumerate(call_stack): + if frame_info.function == Const.BACKWARD or frame_info.function == 'checkpoint_function_backward': + backward_function_indices.append(idx) + + # Check if the execution is within 'torch/autograd/function.py' file + for idx in backward_function_indices: + # The Megatron and MindSpeed L0&L1 scenes + if idx + 1 < len(call_stack) and call_stack[idx + 1].filename.endswith('torch/autograd/function.py'): + del call_stack + return True + # The latest MindSpeed L2 and ModelLink scenes + if idx + 2 < len(call_stack) and call_stack[idx + 2].filename.endswith('torch/autograd/function.py'): + del call_stack + return True + + del call_stack + return False + + +def check_save_param(variable, name, save_backward): + # try catch this api to skip invalid call + if not isinstance(variable, (list, dict, tuple, torch.Tensor, int, float, str)): + logger.warning("PrecisionDebugger.save variable type not valid, " + "should be one of list, dict, tuple, torch.Tensor, int, float or string. " + "Skip current save process.") + raise ValueError + if not isinstance(name, str): + logger.warning("PrecisionDebugger.save name not valid, " + "should be string. " + "skip current save process.") + raise ValueError + if not isinstance(save_backward, bool): + logger.warning("PrecisionDebugger.save_backward name not valid, " + "should be bool. " + "Skip current save process.") + raise ValueError + + +def replace_last_occurrence(text, old, new): + if text is None: + return text + index = text.rfind(old) + if index != -1: + return text[:index] + text[index:].replace(old, new, 1) + return text diff --git a/debug/accuracy_tools/msprobe/pytorch/compare/distributed_compare.py b/debug/accuracy_tools/msprobe/pytorch/compare/distributed_compare.py index ab197afe79e58f53d28687d5282a1187eb9db4f8..de62af421b5a37e39140a9836fb16853443740d7 100644 --- a/debug/accuracy_tools/msprobe/pytorch/compare/distributed_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/compare/distributed_compare.py @@ -14,52 +14,40 @@ # limitations under the License. import os -from msprobe.core.common.utils import CompareException, check_compare_param, \ - check_configuration_param, set_dump_path, get_dump_mode -from msprobe.core.common.file_utils import create_directory + from msprobe.core.common.exceptions import FileCheckException +from msprobe.core.common.file_utils import create_directory +from msprobe.core.common.utils import CompareException, check_compare_param, check_configuration_param, get_dump_mode, \ + set_dump_path +from msprobe.core.compare.acc_compare import ModeConfig +from msprobe.core.compare.utils import check_and_return_dir_contents, extract_json, set_stack_json_path from msprobe.pytorch.common.log import logger -from msprobe.pytorch.compare.pt_compare import PTComparator -from msprobe.core.compare.utils import check_and_return_dir_contents, extract_json +from msprobe.pytorch.compare.pt_compare import PTComparator, compare def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): - if kwargs.get('suffix'): + if kwargs.get("suffix"): logger.error("Argument 'suffix' is not supported for compare_distributed.") raise CompareException(CompareException.INVALID_PARAM_ERROR) - stack_mode = kwargs.get('stack_mode', False) - auto_analyze = kwargs.get('auto_analyze', True) - fuzzy_match = kwargs.get('fuzzy_match', False) + is_print_compare_log = kwargs.get("is_print_compare_log", True) # get the ranks and match by order npu_ranks = sorted(check_and_return_dir_contents(npu_dump_dir, 'rank')) bench_ranks = sorted(check_and_return_dir_contents(bench_dump_dir, 'rank')) if len(npu_ranks) != len(bench_ranks): - logger.error('The number of ranks in the two runs are different. ' - 'Unable to match the ranks. Please use another folder to compare ' - 'or use compare() api and manually match the ranks.') + logger.error( + "The number of ranks in the two runs are different. " + "Unable to match the ranks. " + "Please use another folder to compare or use compare() api and manually match the ranks.") raise CompareException(CompareException.INVALID_PATH_ERROR) for nr, br in zip(npu_ranks, bench_ranks): npu_data_dir = os.path.join(npu_dump_dir, nr) bench_data_dir = os.path.join(bench_dump_dir, br) npu_path = extract_json(npu_data_dir, stack_json=False) bench_path = extract_json(bench_data_dir, stack_json=False) - stack_path = extract_json(npu_data_dir, stack_json=True) dump_result_param = { - 'npu_json_path': npu_path, - 'bench_json_path': bench_path, - 'stack_json_path': stack_path, - 'is_print_compare_log': True + "npu_json_path": npu_path, + "bench_json_path": bench_path, + "is_print_compare_log": is_print_compare_log } - try: - set_dump_path(dump_result_param) - dump_mode = get_dump_mode(dump_result_param) - check_configuration_param(stack_mode, auto_analyze, fuzzy_match, - dump_result_param.get('is_print_compare_log', True)) - create_directory(output_path) - check_compare_param(dump_result_param, output_path, dump_mode) - except (CompareException, FileCheckException) as error: - logger.error('Compare failed. Please check the arguments and do it again!') - raise CompareException(error.code) from error - pt_comparator = PTComparator() - pt_comparator.compare_core(dump_result_param, output_path, suffix=f'_{nr}-{br}', dump_mode=dump_mode, **kwargs) + compare(input_param=dump_result_param, output_path=output_path, suffix=f'_{nr}-{br}', **kwargs) diff --git a/debug/accuracy_tools/msprobe/pytorch/compare/pt_compare.py b/debug/accuracy_tools/msprobe/pytorch/compare/pt_compare.py index 534326267f5728e31fec408b3d0da2d97f641d07..308a82b3d6e9beb67a669ea05b83d7b8a6eddc90 100644 --- a/debug/accuracy_tools/msprobe/pytorch/compare/pt_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/compare/pt_compare.py @@ -14,19 +14,29 @@ # limitations under the License. import os.path + import torch + from msprobe.core.common.const import FileCheckConst -from msprobe.pytorch.common.log import logger from msprobe.core.common.exceptions import FileCheckException -from msprobe.core.compare.acc_compare import Comparator -from msprobe.core.common.utils import check_configuration_param, check_compare_param, \ - CompareException, set_dump_path, get_dump_mode from msprobe.core.common.file_utils import FileChecker, create_directory, load_yaml +from msprobe.core.common.utils import CompareException, check_compare_param, check_configuration_param, get_dump_mode, \ + set_dump_path +from msprobe.core.compare.acc_compare import Comparator, ModeConfig +from msprobe.core.compare.utils import set_stack_json_path +from msprobe.pytorch.common.log import logger from msprobe.pytorch.common.utils import load_pt -class PTComparator (Comparator): - def __init__(self, data_mapping=None): +class PTComparator(Comparator): + def __init__(self, mode_config, data_mapping=None): + super().__init__(mode_config) + + self.stack_mode = mode_config.stack_mode + self.auto_analyze = mode_config.auto_analyze + self.fuzzy_match = mode_config.fuzzy_match + self.dump_mode = mode_config.dump_mode + self.frame_name = PTComparator.__name__ self.data_mapping = data_mapping if isinstance(self.data_mapping, str) or self.data_mapping is None: @@ -37,21 +47,24 @@ class PTComparator (Comparator): raise TypeError(f"The type of parameter `data_mapping` must be dict, str or None, but got " f"{type(self.data_mapping)}") - def load_mapping_file(self, mapping_file): + @staticmethod + def load_mapping_file(mapping_file): if isinstance(mapping_file, str): mapping_dict = load_yaml(mapping_file) else: mapping_dict = {} return mapping_dict - + def read_npy_data(self, dir_path, file_name): + if not file_name: + return None data_path = os.path.join(dir_path, file_name) path_checker = FileChecker(data_path, FileCheckConst.FILE, FileCheckConst.READ_ABLE, - FileCheckConst.PT_SUFFIX, False) + FileCheckConst.PT_SUFFIX, False) data_path = path_checker.common_check() try: - data_value = load_pt(data_path, - to_cpu=True).detach() # detach because numpy can not process gradient information + # detach because numpy can not process gradient information + data_value = load_pt(data_path, to_cpu=True).detach() except RuntimeError as e: # 这里捕获 load_pt 中抛出的异常 logger.error(f"Failed to load the .pt file at {data_path}.") @@ -63,20 +76,29 @@ class PTComparator (Comparator): if data_value.dtype == torch.bfloat16: data_value = data_value.to(torch.float32) data_value = data_value.numpy() - return data_value - - -def compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_match=False, **kwargs): + return data_value + + +def compare(input_param, output_path, **kwargs): try: + auto_analyze = kwargs.get('auto_analyze', True) + fuzzy_match = kwargs.get('fuzzy_match', False) + data_mapping = kwargs.get('data_mapping', None) + suffix = kwargs.get('suffix', '') + set_dump_path(input_param) dump_mode = get_dump_mode(input_param) + if "stack_json_path" in input_param: + stack_mode = kwargs.get('stack_mode', False) + else: + stack_mode = set_stack_json_path(input_param) # set stack_mode and set "stack_json_path" in input_param check_configuration_param(stack_mode, auto_analyze, fuzzy_match, input_param.get('is_print_compare_log', True)) create_directory(output_path) - check_compare_param(input_param, output_path, dump_mode) - data_mapping = kwargs.get('data_mapping', None) + check_compare_param(input_param, output_path, dump_mode, stack_mode) except (CompareException, FileCheckException) as error: logger.error('Compare failed. Please check the arguments and do it again!') raise CompareException(error.code) from error - pt_comparator = PTComparator(data_mapping) - pt_comparator.compare_core(input_param, output_path, stack_mode=stack_mode, - auto_analyze=auto_analyze, fuzzy_match=fuzzy_match, dump_mode=dump_mode) + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + pt_comparator = PTComparator(mode_config, data_mapping) + pt_comparator.compare_core(input_param, output_path, suffix=suffix) diff --git a/debug/accuracy_tools/msprobe/pytorch/debugger/debugger_config.py b/debug/accuracy_tools/msprobe/pytorch/debugger/debugger_config.py index 054aa94912d375e995d921a7380835267a1712fe..77e78bc38063602e64b533291d60b9b12fd2ae00 100644 --- a/debug/accuracy_tools/msprobe/pytorch/debugger/debugger_config.py +++ b/debug/accuracy_tools/msprobe/pytorch/debugger/debugger_config.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -26,18 +26,15 @@ class DebuggerConfig: self.task = task or common_config.task or Const.STATISTICS self.rank = common_config.rank if common_config.rank else [] self.step = common_config.step if common_config.step else [] - self.level = level or common_config.level or "L1" + self.level = level or common_config.level or Const.LEVEL_L1 self.enable_dataloader = common_config.enable_dataloader self.scope = task_config.scope if task_config.scope else [] self.list = task_config.list if task_config.list else [] self.data_mode = task_config.data_mode if task_config.data_mode else ["all"] - self.backward_input_list = task_config.backward_input if task_config.backward_input else [] - self.backward_input = {} - self.acl_config = common_config.acl_config if common_config.acl_config else "" - self.is_forward_acl_dump = True self.summary_mode = task_config.summary_mode if task_config.summary_mode else Const.STATISTICS self.overflow_nums = task_config.overflow_nums if task_config.overflow_nums else 1 self.framework = Const.PT_FRAMEWORK + self.async_dump = common_config.async_dump if common_config.async_dump else False if self.task == Const.FREE_BENCHMARK: self.fuzz_device = task_config.fuzz_device @@ -64,16 +61,9 @@ class DebuggerConfig: self.check() - if self.level == "L2": - if not self.scope or not isinstance(self.scope, list) or len(self.scope) != 1: - raise ValueError("scope must be configured as a list with one api name") - if isinstance(self.scope[0], str) and Const.BACKWARD in self.scope[0] and not self.backward_input_list: - raise ValueError("backward_input must be configured when scope contains 'backward'") - if Const.BACKWARD in self.scope[0]: - self.is_forward_acl_dump = False - for index, scope_spec in enumerate(self.scope): - self.scope[index] = scope_spec.replace(Const.BACKWARD, Const.FORWARD) - self.backward_input[self.scope[index]] = self.backward_input_list[index] + if self.level == Const.LEVEL_L2: + self.is_backward_kernel_dump = False + self._check_and_adjust_config_with_l2() def check_kwargs(self): if self.task and self.task not in Const.TASK_LIST: @@ -85,26 +75,63 @@ class DebuggerConfig: if not self.dump_path: raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, f"The dump_path not found.") + if not isinstance(self.async_dump, bool): + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, + f"The parameters async_dump should be bool.") + if self.async_dump and self.task == Const.TENSOR and not self.list: + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, + f"The parameters async_dump is true in tensor task, the parameters list cannot be " + f"empty.") + if self.task == Const.STRUCTURE and self.level not in [Const.LEVEL_L0, Const.LEVEL_MIX]: + logger.warning_on_rank_0( + f"When the task is set to structure, the level should be one of {[Const.LEVEL_L0, Const.LEVEL_MIX]}. " + f"If not, the default level is {Const.LEVEL_MIX}." + ) + self.level = Const.LEVEL_MIX def check(self): self.check_kwargs() return True def check_model(self, instance, start_model): - if self.level not in ["L0", "mix"]: + if self.level not in [Const.LEVEL_L0, Const.LEVEL_MIX]: if instance.model is not None or start_model is not None: - logger.warning_on_rank_0( + logger.info_on_rank_0( f"The current level is not L0 or mix level, so the model parameters will not be used.") return - if start_model is None: - if instance.model is None: - logger.error_on_rank_0( - f"For level {self.level}, PrecisionDebugger or start interface must receive a 'model' argument.") - raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, f"missing the parameter 'model'") + if start_model is None and instance.model is None: + logger.error_on_rank_0( + f"For level {self.level}, PrecisionDebugger or start interface must receive a 'model' parameter.") + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, f"missing the parameter 'model'") + + instance.model = start_model if start_model is not None else instance.model + if isinstance(instance.model, torch.nn.Module): return - if isinstance(start_model, torch.nn.Module): - instance.model = start_model + + error_model = None + if isinstance(instance.model, (list, tuple)): + for model in instance.model: + if not isinstance(model, torch.nn.Module): + error_model = model + break else: - logger.error_on_rank_0(f"The 'model' parameter of start must be a torch.nn.Module type.") + error_model = instance.model + + if error_model is not None: + error_info = (f"The 'model' parameter must be a torch.nn.Module or list[torch.nn.Module] " + f"type, currently there is a {type(error_model)} type.") raise MsprobeException( - MsprobeException.INVALID_PARAM_ERROR, f"model must be a torch.nn.Module") + MsprobeException.INVALID_PARAM_ERROR, error_info) + + def _check_and_adjust_config_with_l2(self): + if self.scope: + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, + f"When level is set to L2, the scope cannot be configured.") + if not self.list or len(self.list) != 1: + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR, + f"When level is set to L2, the list must be configured as a list with one api name.") + api_name = self.list[0] + if api_name.endswith(Const.BACKWARD): + self.is_backward_kernel_dump = True + api_forward_name = api_name[:-len(Const.BACKWARD)] + Const.FORWARD + self.list.append(api_forward_name) diff --git a/debug/accuracy_tools/msprobe/pytorch/debugger/precision_debugger.py b/debug/accuracy_tools/msprobe/pytorch/debugger/precision_debugger.py index 72890761e9550adc1709fced46c38ddbf3f6e8d8..5bb1d3a14e82d7b4bce9d7da8921a1d701e82222 100644 --- a/debug/accuracy_tools/msprobe/pytorch/debugger/precision_debugger.py +++ b/debug/accuracy_tools/msprobe/pytorch/debugger/precision_debugger.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -21,7 +21,9 @@ from msprobe.core.common.exceptions import MsprobeException from msprobe.core.common.file_utils import FileChecker from msprobe.core.common.utils import get_real_step_or_rank from msprobe.pytorch.common.log import logger +from msprobe.pytorch.common.utils import check_save_param from msprobe.pytorch.debugger.debugger_config import DebuggerConfig +from msprobe.pytorch.dump.module_dump.module_dump import ModuleDumper from msprobe.pytorch.grad_probe.grad_monitor import GradientMonitor from msprobe.pytorch.pt_config import parse_json_config from msprobe.pytorch.service import Service @@ -49,7 +51,7 @@ class PrecisionDebugger: dump_path=None, level=None, model=None, - step=None, + step=None ): if not hasattr(self, "initialized"): config_params = ConfigParameters(config_path, @@ -59,7 +61,6 @@ class PrecisionDebugger: model) self.check_input_params(config_params) - self.api_origin = False self.initialized = True self.model = model common_config, task_config = parse_json_config(config_path, task) @@ -67,12 +68,13 @@ class PrecisionDebugger: if self.task == Const.GRAD_PROBE: self.gm = GradientMonitor(common_config, task_config) return - if step: + if step is not None: common_config.step = get_real_step_or_rank(step, Const.STEP) self.config = DebuggerConfig( common_config, task_config, task, dump_path, level ) self.service = Service(self.config) + self.module_dumper = ModuleDumper(self.service) self.enable_dataloader = self.config.enable_dataloader if self.enable_dataloader: logger.warning_on_rank_0("The enable_dataloader feature will be deprecated in the future.") @@ -105,9 +107,11 @@ class PrecisionDebugger: raise MsprobeException( MsprobeException.INVALID_PARAM_ERROR, f"level must be one of {Const.LEVEL_LIST}") - if args.model is not None and not isinstance(args.model, torch.nn.Module): - raise MsprobeException( - MsprobeException.INVALID_PARAM_ERROR, f"model must be a torch.nn.Module") + if args.model is not None: + logger.warning_on_rank_0( + "The 'model' parameter in the PrecisionDebugger will be deprecated in the future." + "It is recommended to pass the 'model' parameter in the start interface instead." + ) @classmethod def start(cls, model=None): @@ -120,15 +124,12 @@ class PrecisionDebugger: if instance.enable_dataloader: logger.warning_on_rank_0("DataLoader is enabled, start() skipped.") else: - instance.service.start(instance.model, instance.api_origin) - instance.api_origin = False + instance.service.start(instance.model) - # 指定代码段dump前反向结束符,之后的计算过程数据将被忽略,无法被dump @classmethod def forward_backward_dump_end(cls): instance = cls._instance - instance.service.forward_backward_dump_end() - instance.api_origin = True + instance.stop() @classmethod def stop(cls): @@ -158,6 +159,49 @@ class PrecisionDebugger: return cls._instance.gm.monitor(model) + @classmethod + def save(cls, variable, name, save_backward=True): + instance = cls._instance + if not instance: + raise Exception(MsgConst.NOT_CREATED_INSTANCE) + if instance.task not in [Const.TENSOR, Const.STATISTICS] or instance.config.level != Const.LEVEL_DEBUG: + return + try: + check_save_param(variable, name, save_backward) + except ValueError: + return + instance.service.save(variable, name, save_backward) + + +def module_dump(module, dump_name): + if not isinstance(module, torch.nn.Module): + raise MsprobeException( + MsprobeException.INVALID_PARAM_ERROR, + f"the module argument in module_dump must be a torch.nn.Module subclass" + ) + if not isinstance(dump_name, str): + raise MsprobeException( + MsprobeException.INVALID_PARAM_ERROR, + f"the dump_name argument in module_dump must be a str type" + ) + instance = PrecisionDebugger._instance + if not instance: + raise MsprobeException( + MsprobeException.INTERFACE_USAGE_ERROR, + f"PrecisionDebugger must be instantiated before using module_dump interface" + ) + instance.module_dumper.start_module_dump(module, dump_name) + + +def module_dump_end(): + instance = PrecisionDebugger._instance + if not instance: + raise MsprobeException( + MsprobeException.INTERFACE_USAGE_ERROR, + f"PrecisionDebugger must be instantiated before using module_dump_end interface" + ) + instance.module_dumper.stop_module_dump() + def iter_tracer(func): def func_wrapper(*args, **kwargs): diff --git a/debug/accuracy_tools/msprobe/pytorch/dump/kernel_dump/kernel_config.py b/debug/accuracy_tools/msprobe/pytorch/dump/kernel_dump/kernel_config.py new file mode 100644 index 0000000000000000000000000000000000000000..48d0918ca68d7f429cc97fc64c5ba7d7f884960b --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/dump/kernel_dump/kernel_config.py @@ -0,0 +1,33 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +from msprobe.core.common.file_utils import save_json + + +def create_kernel_config_json(dump_path, cur_rank): + kernel_config_name = "kernel_config.json" if cur_rank == '' else f"kernel_config_{cur_rank}.json" + kernel_config_path = os.path.join(dump_path, kernel_config_name) + config_info = { + "dump": { + "dump_list": [], + "dump_path": dump_path, + "dump_mode": "all", + "dump_op_switch": "on" + } + } + save_json(kernel_config_path, config_info, indent=4) + return kernel_config_path diff --git a/profiler/advisor/advisor_backend/overall_advice/__init__.py b/debug/accuracy_tools/msprobe/pytorch/dump/module_dump/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/overall_advice/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/dump/module_dump/__init__.py diff --git a/debug/accuracy_tools/msprobe/pytorch/dump/module_dump/module_dump.py b/debug/accuracy_tools/msprobe/pytorch/dump/module_dump/module_dump.py new file mode 100644 index 0000000000000000000000000000000000000000..4700de6f1f9f3b5ddfb9507decb6f8739b5eda9b --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/dump/module_dump/module_dump.py @@ -0,0 +1,86 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch +from msprobe.core.common.const import Const +from msprobe.core.data_dump.scope import BaseScope +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.hook_module.api_registry import api_register + +torch_version_above_or_equal_2 = torch.__version__.split('+')[0] >= '2.0' + + +class ModuleDumper: + def __init__(self, service): + self.service = service + self.hook_handle_list = [] + + def start_module_dump(self, module, dump_name): + api_register.api_originality() + self.register_hook(module, dump_name) + + def stop_module_dump(self): + api_register.api_modularity() + for hook_handle in self.hook_handle_list: + if isinstance(hook_handle, torch.utils.hooks.RemovableHandle): + hook_handle.remove() + self.hook_handle_list.clear() + + def register_hook(self, module, dump_name): + prefix_name = ( + BaseScope.Module_Type_Module + Const.SEP + + dump_name + Const.SEP + + module.__class__.__name__ + Const.SEP + ) + module_processor = self.service.module_processor + _, forward_hook, backward_hook, forward_hook_torch_version_below_2 = self.service.build_hook( + BaseScope.Module_Type_Module, + prefix_name + ) + + if module_processor.has_register_backward_hook(module): + logger.warning( + f"The {dump_name} module has registered deprecated register_backward_hook," + f"which may cause abnormal data dump. The backward data dump for this module will be skipped." + ) + if torch_version_above_or_equal_2: + forward_hook_handle = module.register_forward_hook(forward_hook, with_kwargs=True) + else: + if not module_processor.has_register_backward_hook(module): + backward_hook_handle = module.register_full_backward_hook( + module_processor.node_hook(prefix_name + Const.BACKWARD, Const.STOP) + ) + self.hook_handle_list.append(backward_hook_handle) + forward_hook_handle = module.register_forward_hook(forward_hook_torch_version_below_2) + self.hook_handle_list.append(forward_hook_handle) + if not module_processor.has_register_backward_hook(module): + backward_hook_handle = module.register_full_backward_hook(backward_hook) + self.hook_handle_list.append(backward_hook_handle) + + forward_pre_hook_handle = module.register_forward_pre_hook( + module_processor.node_hook(prefix_name + Const.FORWARD, Const.START) + ) + forward_hook_handle = module.register_forward_hook( + module_processor.node_hook(prefix_name + Const.FORWARD, Const.STOP) + ) + self.hook_handle_list.extend([forward_pre_hook_handle, forward_hook_handle]) + if torch_version_above_or_equal_2 and not module_processor.has_register_backward_hook(module): + backward_pre_hook_handle = module.register_full_backward_pre_hook( + module_processor.node_hook(prefix_name + Const.BACKWARD, Const.START) + ) + backward_hook_handle = module.register_full_backward_hook( + module_processor.node_hook(prefix_name + Const.BACKWARD, Const.STOP) + ) + self.hook_handle_list.extend([backward_pre_hook_handle, backward_hook_handle]) diff --git a/debug/accuracy_tools/msprobe/pytorch/module_processer.py b/debug/accuracy_tools/msprobe/pytorch/dump/module_dump/module_processer.py similarity index 45% rename from debug/accuracy_tools/msprobe/pytorch/module_processer.py rename to debug/accuracy_tools/msprobe/pytorch/dump/module_dump/module_processer.py index 4ff6fa08b46e759c6b265313b632e728fa14794b..b5ca1da461fd4235a09172de4b9dcea34a624e58 100644 --- a/debug/accuracy_tools/msprobe/pytorch/module_processer.py +++ b/debug/accuracy_tools/msprobe/pytorch/dump/module_dump/module_processer.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -17,12 +17,25 @@ from functools import wraps import torch from msprobe.core.common.const import Const -from msprobe.core.data_dump.scope import ModuleRangeScope +from msprobe.core.data_dump.scope import BaseScope, ModuleRangeScope, MixRangeScope +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.common.utils import replace_last_occurrence +from torch.utils.checkpoint import checkpoint as origin_checkpoint +from torch.utils.checkpoint import set_checkpoint_early_stop from torch.utils.hooks import BackwardHook torch_version_above_or_equal_2 = torch.__version__.split('+')[0] >= '2.0' +def checkpoint_without_early_stop(*args, **kwargs): + with set_checkpoint_early_stop(False): + return origin_checkpoint(*args, **kwargs) + + +def replace_checkpoint(): + torch.utils.checkpoint.checkpoint = checkpoint_without_early_stop + + class ModuleProcesser: module_count = {} module_stack = [] @@ -30,33 +43,10 @@ class ModuleProcesser: module_node = {} def __init__(self, scope): - if isinstance(scope, ModuleRangeScope): - self.scope = scope - else: - self.scope = None + self.scope = scope if isinstance(scope, (ModuleRangeScope, MixRangeScope)) else None BackwardHook.setup_input_hook = ModuleProcesser.clone_return_value(BackwardHook.setup_input_hook) BackwardHook.setup_output_hook = ModuleProcesser.clone_return_value(BackwardHook.setup_output_hook) - BackwardHook.setup_output_hook = ModuleProcesser.filter_tensor_and_tuple(BackwardHook.setup_output_hook) - - @staticmethod - def filter_tensor_and_tuple(func): - @wraps(func) - def wrap_by_filter_tensor_and_tuple(*args, **kwargs): - # setup_output_hook传入非tensor数据,工具后续dump会报错,处理方式是解析非tensor数据的属性,对tensor属性挂hook - # setup_output_hook定义为setup_output_hook(self, args),因此处理第二个位置参数,即*args[1] - if not isinstance(args[1], (torch.Tensor, tuple)): - for item_str in dir(args[1]): - item = getattr(args[1], item_str) - # 处理tensor或者只包含tensor的元组 - if isinstance(item, torch.Tensor) or \ - (isinstance(item, tuple) and all(isinstance(x, torch.Tensor) for x in item)): - args_new = (args[0], item) - result = func(*args_new, **kwargs) - setattr(args[1], item_str, result) - return args[1] - return func(*args, **kwargs) - - return wrap_by_filter_tensor_and_tuple + replace_checkpoint() @staticmethod def clone_return_value(func): @@ -66,16 +56,16 @@ class ModuleProcesser: return ModuleProcesser.clone_if_tensor(result) return clone_return_value_func - + @staticmethod def clone_if_tensor(result): if isinstance(result, torch.Tensor): return result.clone() - elif isinstance(result, tuple): + elif type(result) is tuple: return tuple(ModuleProcesser.clone_if_tensor(x) for x in result) - elif isinstance(result, list): + elif type(result) is list: return list(ModuleProcesser.clone_if_tensor(x) for x in result) - elif isinstance(result, dict): + elif type(result) is dict: return {k: ModuleProcesser.clone_if_tensor(v) for k, v in result.items()} else: return result @@ -88,6 +78,22 @@ class ModuleProcesser: ModuleProcesser.module_count[module_name] += 1 return ModuleProcesser.module_count[module_name] + @staticmethod + def has_register_backward_hook(module): + return hasattr(module, '_backward_hooks') and \ + len(module._backward_hooks) > 0 and \ + module._is_full_backward_hook is False + + @staticmethod + def get_modules_and_names(models): + modules_and_names_with_index = {} + if isinstance(models, (list, tuple)): + for index, model in enumerate(models): + modules_and_names_with_index[str(index)] = model.named_modules() + else: + modules_and_names_with_index["-1"] = models.named_modules() + return modules_and_names_with_index + @classmethod def reset_module_stats(cls): cls.module_count = {} @@ -95,6 +101,42 @@ class ModuleProcesser: cls.api_parent_node = "" cls.module_node = {} + def register_module_hook(self, models, build_hook): + logger.info_on_rank_0("The init dump is enabled, and the module dump function will not be available.") + modules_and_names_with_index = self.get_modules_and_names(models) + for index, modules_and_names in modules_and_names_with_index.items(): + model = models if index == "-1" else models[int(index)] + for name, module in modules_and_names: + if module == model: + continue + module_index = (index + Const.SEP) if index != "-1" else "" + prefix_name = (BaseScope.Module_Type_Module + Const.SEP + module_index + + name + Const.SEP + module.__class__.__name__ + Const.SEP) + pre_forward_hook, forward_hook, backward_hook, forward_hook_torch_version_below_2 = build_hook( + BaseScope.Module_Type_Module, + prefix_name + ) + + if self.has_register_backward_hook(module): + logger.warning( + f"The {prefix_name[:-1]} has registered deprecated register_backward_hook," + f"which may cause abnormal data dump. The backward data dump for this module will be skipped." + ) + if torch_version_above_or_equal_2: + module.register_forward_hook(forward_hook, with_kwargs=True) + else: + if not self.has_register_backward_hook(module): + module.register_full_backward_hook(self.node_hook(prefix_name + Const.BACKWARD, Const.STOP)) + module.register_forward_hook(forward_hook_torch_version_below_2) + if not self.has_register_backward_hook(module): + module.register_full_backward_hook(backward_hook) + + module.register_forward_pre_hook(self.node_hook(prefix_name + Const.FORWARD, Const.START)) + module.register_forward_hook(self.node_hook(prefix_name + Const.FORWARD, Const.STOP)) + if torch_version_above_or_equal_2 and not self.has_register_backward_hook(module): + module.register_full_backward_pre_hook(self.node_hook(prefix_name + Const.BACKWARD, Const.START)) + module.register_full_backward_hook(self.node_hook(prefix_name + Const.BACKWARD, Const.STOP)) + def node_hook(self, name_prefix, start_or_stop, **kwargs): def pre_hook(module, input, output=None): @@ -103,7 +145,10 @@ class ModuleProcesser: except IndexError as e: index = None pass - module.mindstudio_reserved_name = full_name = name_prefix + Const.SEP + str(index) + full_name = name_prefix + Const.SEP + str(index) + if not hasattr(module, "mindstudio_reserved_name") or not module.mindstudio_reserved_name: + module.mindstudio_reserved_name = [] + module.mindstudio_reserved_name.append(full_name) if self.module_stack: ModuleProcesser.module_node[full_name] = self.module_stack[-1] else: @@ -122,8 +167,11 @@ class ModuleProcesser: ModuleProcesser.api_parent_node = self.module_stack[-1] else: ModuleProcesser.api_parent_node = None + if not hasattr(module, "mindstudio_reserved_name") or not module.mindstudio_reserved_name: + raise RuntimeError(f"module reserve name is None when pop") + current_name = module.mindstudio_reserved_name.pop() if self.scope: - self.scope.end_module(module.mindstudio_reserved_name) + self.scope.end_module(current_name) def backward_hook(module, input, output=None): try: @@ -131,10 +179,13 @@ class ModuleProcesser: except IndexError as e: index = None pass - module.mindstudio_reserved_name = full_name = name_prefix + Const.SEP + str(index) - forward_full_name = full_name.replace(Const.BACKWARD, Const.FORWARD) - ModuleProcesser.module_node[full_name] = ModuleProcesser.module_node[forward_full_name].replace( - Const.FORWARD, Const.BACKWARD) if ModuleProcesser.module_node[forward_full_name] else None + full_name = name_prefix + Const.SEP + str(index) + if not hasattr(module, "mindstudio_reserved_name") or not module.mindstudio_reserved_name: + module.mindstudio_reserved_name = [] + module.mindstudio_reserved_name.append(full_name) + forward_full_name = replace_last_occurrence(full_name, Const.BACKWARD, Const.FORWARD) + ModuleProcesser.module_node[full_name] = replace_last_occurrence( + ModuleProcesser.module_node.get(forward_full_name), Const.FORWARD, Const.BACKWARD) ModuleProcesser.api_parent_node = None if self.scope: self.scope.begin_module(full_name) diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/constant.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/constant.py index c5e93be138d24af8c18858db483e397527fb4092..c469914fea24903c1a15650496cc9d14a3ae89d5 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/constant.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/constant.py @@ -1,3 +1,18 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from typing import Dict import numpy as np diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/counter.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/counter.py index b2f8c81f3a4ea57d712e49b0b58fc77747797323..f0a16697d568fa3a7cf72d17b70e0a3a5c686d99 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/counter.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/counter.py @@ -1,3 +1,18 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from collections import defaultdict from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/enums.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/enums.py index dac78f016a09792539a15ef4a9f11bb3af1bf0e8..eb5c3a93c74ec01a2bdb0bc538241608839a8c8c 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/enums.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/enums.py @@ -1,3 +1,18 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from msprobe.core.common.const import Const diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/params.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/params.py index 326f623be728c13b2a39627b67b3fca63babe3eb..5d88912024fc93858d27f1f611f4d30b3d2cd5c7 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/params.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/params.py @@ -39,7 +39,6 @@ class DataParams: origin_func: Optional[Callable] = None api_type: Optional[str] = None fuzz_stage: Optional[str] = None - grad_unequal_flag: Optional[bool] = True @dataclass @@ -127,6 +126,8 @@ def make_unequal_row( ) if isinstance(ratio, float): row.max_rel = ratio - 1 + if isinstance(ratio, str): + row.max_rel = ratio origin_tensor = data_params.original_result perturbed_tensor = data_params.perturbed_result if index is not None: diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/utils.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/utils.py index 3dbded07e940efc6b9e0e8f4e98ded9f140df4e4..a5e7cabd85186336b7b4cb5bf5d6f25599ad9d7f 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/utils.py @@ -13,8 +13,10 @@ # See the License for the specific language governing permissions and # limitations under the License. + import torch from msprobe.core.common.exceptions import FreeBenchmarkException +from msprobe.core.common.utils import recursion_depth_decorator from msprobe.pytorch.free_benchmark.common.enums import DeviceType from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode @@ -52,6 +54,7 @@ class Tools: return api_name.rsplit(".", 2)[0] @staticmethod + @recursion_depth_decorator("FreeBenchmark: Tools.convert_device_and_dtype") def convert_device_and_dtype( tensor_seq, device: str = DeviceType.CPU, change_dtype: bool = False ): @@ -74,6 +77,7 @@ class Tools: return tensor_seq @staticmethod + @recursion_depth_decorator("FreeBenchmark: Tools.convert_fuzz_output_to_origin") def convert_fuzz_output_to_origin(origin, perturbed, pert_mode): if isinstance(origin, torch.Tensor) and isinstance(perturbed, torch.Tensor): if pert_mode == PerturbationMode.AUTO: @@ -123,6 +127,7 @@ class TorchC: abs = torch._C._VariableFunctionsClass.abs where = torch._C._VariableFunctionsClass.where div = torch._C._VariableFunctionsClass.div + mul = torch._C._VariableFunctionsClass.mul max = torch._C._VariableFunctionsClass.max min = torch._C._VariableFunctionsClass.min gt = torch._C._VariableFunctionsClass.gt @@ -137,3 +142,5 @@ class TorchC: tensor_split = torch._C._VariableFunctionsClass.tensor_split stack = torch._C._VariableFunctionsClass.stack reshape = torch._C._VariableFunctionsClass.reshape + nan_to_num = torch._C._VariableFunctionsClass.nan_to_num + aminmax = torch._C._VariableFunctionsClass.aminmax diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/grad_saver.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/grad_saver.py index efdf23999bc1abfa01e7a6bb14960a0aef0d4671..58cfea45d00459db65355a2cdba4471bac7b754e 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/grad_saver.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/grad_saver.py @@ -82,13 +82,11 @@ class GradSaver: data_params = DataParams() data_params.original_result = origin_grad data_params.perturbed_result = perturbed_grad - data_params.grad_unequal_flag = False data_params.valid_input_index = index try: handler.handle(data_params) if not data_params.is_consistent: self.is_compare = False - data_params.grad_unequal_flag = True data_params.is_consistent = True data_params.perturbed_result = self.perturbed_grad_input data_params.original_result = self.origin_grad_input diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/single_benchmark.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/single_benchmark.py index cf4bac80333671f5a24ec06ed65da03cd1b1e3f1..49e845da4011565f1b6ccf0c0e1193fb3fcffcbf 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/single_benchmark.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/single_benchmark.py @@ -16,6 +16,7 @@ import math import torch +from msprobe.core.common.utils import recursion_depth_decorator from msprobe.pytorch.free_benchmark import logger from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig from msprobe.pytorch.free_benchmark.common.utils import TorchC @@ -67,6 +68,7 @@ class SingleCompare: return False return True + @recursion_depth_decorator("FreeBenchmark: SingleCompare.compare_seq") def compare_seq(self, actual, golden): if isinstance(golden, torch.Tensor): return self.compare_tensor_seq(actual, golden) diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py index 29b1de6d8f58a5e907f6156fb146dd7848e45d59..41ec39e3a3b6233720c047d5d2b736d91bba989e 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py @@ -14,6 +14,7 @@ # limitations under the License. import torch +from msprobe.core.common.utils import recursion_depth_decorator from msprobe.pytorch.free_benchmark import logger from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode @@ -26,6 +27,7 @@ from msprobe.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import class AddNoiseLayer(NpuBaseLayer): + @recursion_depth_decorator("FreeBenchmark: AddNoiseLayer.add_noise") def add_noise(self, tensor_obj): if isinstance(tensor_obj, torch.Tensor): self.perturbed_value = ThresholdConfig.PERTURBATION_VALUE_DICT.get( diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py index 9bea9f67ca08790202f8b174346e8b2a3b4581bf..df1a73127aa0b69e42254cce1d3334810319f7cf 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py @@ -14,6 +14,7 @@ # limitations under the License. import torch +from msprobe.core.common.utils import recursion_depth_decorator from msprobe.pytorch.free_benchmark import logger from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode @@ -31,6 +32,7 @@ class BitNoiseLayer(NpuBaseLayer): self.bit_tail: int = 1 self.bit_type = None + @recursion_depth_decorator("FreeBenchmark: BitNoiseLayer.add_bit_noise") def add_bit_noise(self, tensor_obj): """ 对输入添加噪声 diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py index e2e0c8aa012a6ff57169cfd47c4728bd70a2563a..c4fbeaf82f8fcafba235a7faa6dd9073d4d556d8 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py @@ -14,6 +14,7 @@ # limitations under the License. import torch +from msprobe.core.common.utils import recursion_depth_decorator from msprobe.pytorch.free_benchmark import logger from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode from msprobe.pytorch.free_benchmark.common.params import DataParams @@ -29,6 +30,7 @@ class ChangeValueLayer(NpuBaseLayer): self.head: int = 0 self.tail: int = -1 + @recursion_depth_decorator("FreeBenchmark: ChangeValueLayer.change_value") def change_value(self, tensor_obj): """ 交换张量首尾 diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py index 393d82a3a06b01308ce53b9f79b5901ea93f7cdc..095e77ffaff39a795cb1418c1695608d91d7427b 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py @@ -15,6 +15,7 @@ import torch from msprobe.core.common.const import Const +from msprobe.core.common.utils import recursion_depth_decorator from msprobe.pytorch.free_benchmark import logger from msprobe.pytorch.free_benchmark.common.constant import CommonField from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode @@ -26,6 +27,9 @@ from msprobe.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import class ImprovePrecisionLayer(NpuBaseLayer): + @recursion_depth_decorator( + "FreeBenchmark: ImprovePrecisionLayer.improve_tensor_precision" + ) def improve_tensor_precision(self, tensor_obj): if ( isinstance(tensor_obj, torch.Tensor) diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/base_handler.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/base_handler.py index a9ccbc1d4374777ea006cdb58f406664395e3d2d..47f93ab7b89f44bdd4f92ceafc6e9dbe503d0374 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/base_handler.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/base_handler.py @@ -20,6 +20,7 @@ from typing import Any, Optional, Tuple import numpy as np import torch from msprobe.core.common.const import Const +from msprobe.core.common.exceptions import FreeBenchmarkException from msprobe.pytorch.free_benchmark import logger from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig from msprobe.pytorch.free_benchmark.common.enums import ( @@ -88,12 +89,6 @@ class FuzzHandler(ABC): ) return origin_output_chunks, perturbed_output_chunks - @staticmethod - def convert_overflow_ratio_to_consistent(ratio): - if math.isnan(ratio) or math.isinf(ratio): - return ThresholdConfig.COMP_CONSISTENT - return ratio - @abstractmethod def get_threshold(self, dtype): pass @@ -106,55 +101,45 @@ class FuzzHandler(ABC): self, origin_output, perturbed_output, norm_type, abs_tol ): if norm_type == NormType.ENDLESS_NORM: - return self.calculate_error(origin_output, perturbed_output, abs_tol) + return self.calculate_max_ratio(origin_output, perturbed_output, abs_tol) return ThresholdConfig.COMP_CONSISTENT - def calculate_error(self, origin_output, perturbed_output, abs_tol): + def calculate_max_ratio(self, origin_output, perturbed_output, abs_tol): origin_output_chunks, perturbed_output_chunks = ( self.tensor_split_for_error_calculate(origin_output, perturbed_output) ) - if len(origin_output) != len(perturbed_output): - logger.warning( - f"[msprobe] Free Benchmark: For {self.params.api_name} " - f"The compare tensor chunks is different: {len(origin_output)} != {len(perturbed_output)}" + if len(origin_output_chunks) != len(perturbed_output_chunks): + err_msg = ( + f"For {self.params.api_name}, the number of compare tensor chunks is different: " + f"{len(origin_output_chunks)} != {len(perturbed_output_chunks)}. please check!" ) - return 1 - norm1 = -np.inf - norm2 = -np.inf - norm3 = np.inf + raise FreeBenchmarkException( + FreeBenchmarkException.OutputIndexError, err_msg + ) + + max_ratio = ThresholdConfig.COMP_CONSISTENT for i, chunk_origin in enumerate(origin_output_chunks): if chunk_origin.nelement() == 0: break chunk_perturbed = perturbed_output_chunks[i] - ratio_tensor1 = TorchC.where( - TorchC.abs(chunk_perturbed) > abs_tol, - TorchC.div( - TorchC.clamp(chunk_origin, min=abs_tol), - TorchC.clamp(chunk_perturbed, min=abs_tol), - ), - 1, + # 如果乘积最小值 < 极小值乘积的负值,认为存在非极小值符号相反的情况 + if TorchC.lt( + TorchC.min(TorchC.mul(chunk_origin, chunk_perturbed)), -(abs_tol**2) + ): + return ThresholdConfig.SYMBOL_FLIPPING + # 求A/B B/A的比值前,将值限制在大于极小值范围内 + clamp_origin = TorchC.clamp(TorchC.abs(chunk_origin), min=abs_tol) + clamp_perturbed = TorchC.clamp(TorchC.abs(chunk_perturbed), min=abs_tol) + # 对于计算结果为nan的情况,认为两者没有差异 + ratio_tensor = TorchC.nan_to_num( + TorchC.div(clamp_origin, clamp_perturbed), + nan=ThresholdConfig.COMP_CONSISTENT, ) - ratio_tensor2 = TorchC.where( - TorchC.abs(chunk_origin) > abs_tol, - TorchC.div( - TorchC.clamp(chunk_perturbed, min=abs_tol), - TorchC.clamp(chunk_origin, min=abs_tol), - ), - 1, - ) - norm_values = TorchC.stack( - [TorchC.max(ratio_tensor1), TorchC.max(ratio_tensor2)] - ) - max_ratio1, max_ratio2 = norm_values.tolist() - norm1 = max(norm1, self.convert_overflow_ratio_to_consistent(max_ratio1)) - norm2 = max(norm2, self.convert_overflow_ratio_to_consistent(max_ratio2)) - norm3 = min(norm3, self.convert_overflow_ratio_to_consistent(max_ratio1)) - - if norm3 < 0: - ratio = ThresholdConfig.SYMBOL_FLIPPING - else: - ratio = max(norm1, norm2) - return ratio + # 求A/B 和 B/A比值最大值,其中 B/A的最大值为 A/B的最小值的倒数 + min_ratio, max_ratio = TorchC.stack([*TorchC.aminmax(ratio_tensor)]).tolist() + min_ratio_reciprocal = np.inf if min_ratio == 0 else 1 / min_ratio + max_ratio = max(max_ratio, min_ratio_reciprocal) + return max_ratio def ratio_calculate(self, origin_output, perturbed_output, norm_type) -> float: try: @@ -217,10 +202,12 @@ class FuzzHandler(ABC): ) npu_consistent = is_consistent max_fuzz_ratio = ( - max_fuzz_ratio if ratio is None else max(max_fuzz_ratio, ratio) + max_fuzz_ratio + if not isinstance(ratio, (int, float)) + else max(max_fuzz_ratio, ratio) ) - data_params.is_consistent = is_consistent and data_params.is_consistent - if not is_consistent and data_params.grad_unequal_flag: + data_params.is_consistent = is_consistent + if not is_consistent: self.unequal_rows.append( make_unequal_row(data_params, self.params, ratio=ratio) ) @@ -232,12 +219,12 @@ class FuzzHandler(ABC): ) npu_consistent = npu_consistent and is_consistent max_fuzz_ratio = ( - max_fuzz_ratio if ratio is None else max(max_fuzz_ratio, ratio) - ) - data_params.is_consistent = ( - is_consistent and data_params.is_consistent + max_fuzz_ratio + if not isinstance(ratio, (int, float)) + else max(max_fuzz_ratio, ratio) ) - if not is_consistent and data_params.grad_unequal_flag: + data_params.is_consistent = is_consistent + if not is_consistent: self.unequal_rows.append( make_unequal_row( data_params, self.params, ratio=ratio, index=index_ diff --git a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py index 1bfed53e6d8223a3a3f90004b0b9e2915c1729a6..5bfc672df18b97eb47b2ffed3ab9bf6b66dd03e7 100644 --- a/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py @@ -75,10 +75,6 @@ class PreheatHandler(FuzzHandler): if self.params.preheat_config.get("preheat_step") <= self.params.step: return data_params.original_result - if not data_params.grad_unequal_flag: - data_params.grad_unequal_flag = True - data_params.is_consistent = False - return data_params.original_result preheat_counter.add_api_called_time(self.pure_name) if not self._is_take_a_sample(): diff --git a/debug/accuracy_tools/msprobe/pytorch/function_factory.py b/debug/accuracy_tools/msprobe/pytorch/function_factory.py index e3ac6947f69074b5041a44e23c5a720e62680941..247e2cd0ed5ea11047cc0d75954dbc1e92b889f4 100644 --- a/debug/accuracy_tools/msprobe/pytorch/function_factory.py +++ b/debug/accuracy_tools/msprobe/pytorch/function_factory.py @@ -27,6 +27,11 @@ from msprobe.pytorch.bench_functions.rotary_mul import npu_rotary_mul, npu_rotar from msprobe.pytorch.bench_functions.scaled_mask_softmax import npu_scaled_masked_softmax, \ npu_scaled_masked_softmax_backward from msprobe.pytorch.bench_functions.swiglu import npu_swiglu, npu_swiglu_backward +from msprobe.pytorch.bench_functions.apply_adam import npu_apply_adam +from msprobe.pytorch.bench_functions.group_norm_silu import npu_group_norm_silu +from msprobe.pytorch.bench_functions.mish import npu_mish +from msprobe.pytorch.bench_functions.moe_gating_top_k_softmax import npu_moe_gating_top_k_softmax +from msprobe.pytorch.bench_functions.sort_v2 import npu_sort_v2 from msprobe.pytorch.common.utils import logger @@ -79,7 +84,8 @@ class Register(dict): npu_custom_functions = Register() npu_custom_functions([ npu_apply_adam_w, npu_confusion_transpose, npu_fast_gelu, npu_layer_norm_eval, npu_linear, npu_fusion_attention, - npu_rms_norm, npu_rotary_mul, npu_scaled_masked_softmax, npu_swiglu, gpu_fusion_attention + npu_rms_norm, npu_rotary_mul, npu_scaled_masked_softmax, npu_swiglu, gpu_fusion_attention, npu_apply_adam, + npu_group_norm_silu, npu_mish, npu_moe_gating_top_k_softmax, npu_sort_v2 ]) # register for npu custom backward bench functions diff --git a/debug/accuracy_tools/msprobe/pytorch/functional/module_dump.py b/debug/accuracy_tools/msprobe/pytorch/functional/module_dump.py deleted file mode 100644 index f1eb8ab77527f0a2b27146893121916e81be4d2c..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/functional/module_dump.py +++ /dev/null @@ -1,84 +0,0 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import torch -import torch.nn as nn -from msprobe.core.common.const import Const -from msprobe.core.common.exceptions import MsprobeException -from msprobe.core.data_dump.scope import BaseScope -from msprobe.pytorch.common.log import logger -from msprobe.pytorch.debugger.precision_debugger import PrecisionDebugger -from msprobe.pytorch.hook_module.api_registry import api_register -from msprobe.pytorch.service import torch_version_above_or_equal_2 - -hook_handle_list = [] - - -def module_dump(module, dump_name): - if not isinstance(module, nn.Module): - logger.error("The parameter module in module_dump must be a Module subclass.") - raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR) - if not isinstance(dump_name, str): - logger.error("The parameter dump_name in module_dump must be a str type.") - raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR) - - api_register.api_originality() - register_hook(module, dump_name) - - -def module_dump_end(): - api_register.api_modularity() - remove_hook() - hook_handle_list.clear() - - -def register_hook(module, dump_name): - prefix = BaseScope.Module_Type_Module + Const.SEP + dump_name + Const.SEP + module.__class__.__name__ + Const.SEP - - pdg = PrecisionDebugger() - _, forward_hook, backward_hook, forward_hook_torch_version_below_2 = \ - pdg.service.build_hook(BaseScope.Module_Type_Module, prefix) - - if torch_version_above_or_equal_2: - forward_hook_handle = module.register_forward_hook(forward_hook, with_kwargs=True) - hook_handle_list.append(forward_hook_handle) - else: - pdg.service.check_register_full_backward_hook(module) - full_backward_hook_handle = module.register_full_backward_hook( - pdg.service.module_processor.node_hook(prefix + Const.BACKWARD, Const.STOP)) - forward_hook_handle = module.register_forward_hook(forward_hook_torch_version_below_2) - hook_handle_list.extend([full_backward_hook_handle, forward_hook_handle]) - pdg.service.check_register_full_backward_hook(module) - full_backward_hook_handle = module.register_full_backward_hook(backward_hook) - - forward_pre_hook_handle = module.register_forward_pre_hook( - pdg.service.module_processor.node_hook(prefix + Const.FORWARD, Const.START)) - forward_hook_handle = module.register_forward_hook( - pdg.service.module_processor.node_hook(prefix + Const.FORWARD, Const.STOP)) - hook_handle_list.extend([full_backward_hook_handle, forward_pre_hook_handle, forward_hook_handle]) - - if torch_version_above_or_equal_2: - backward_pre_hook_handle = module.register_full_backward_pre_hook( - pdg.service.module_processor.node_hook(prefix + Const.BACKWARD, Const.START)) - pdg.service.check_register_full_backward_hook(module) - full_backward_hook_handle = module.register_full_backward_hook( - pdg.service.module_processor.node_hook(prefix + Const.BACKWARD, Const.STOP)) - hook_handle_list.extend([backward_pre_hook_handle, full_backward_hook_handle]) - - -def remove_hook(): - for hook_handle in hook_handle_list: - if isinstance(hook_handle, torch.utils.hooks.RemovableHandle): - hook_handle.remove() diff --git a/debug/accuracy_tools/msprobe/pytorch/grad_probe/grad_monitor.py b/debug/accuracy_tools/msprobe/pytorch/grad_probe/grad_monitor.py index 3111c63eac616dcfa1a4ce8cb78b3f7c4a64b35f..926476b8fb353531e54a485ccb47c4c59860c5d0 100644 --- a/debug/accuracy_tools/msprobe/pytorch/grad_probe/grad_monitor.py +++ b/debug/accuracy_tools/msprobe/pytorch/grad_probe/grad_monitor.py @@ -17,14 +17,15 @@ import os from collections import defaultdict import torch -if int(torch.__version__.split('.')[0]) >= 2: - from torch.optim.optimizer import register_optimizer_step_pre_hook -from msprobe.pytorch.grad_probe.grad_stat_csv import GradStatCsv -from msprobe.core.grad_probe.utils import check_numeral_list_ascend, data_in_list_target +from msprobe.core.common.file_utils import remove_path, save_npy, write_csv, create_directory from msprobe.core.grad_probe.constant import level_adp +from msprobe.core.grad_probe.utils import check_numeral_list_ascend, data_in_list_target from msprobe.pytorch.common.log import logger -from msprobe.core.common.file_utils import remove_path, save_npy, write_csv, create_directory from msprobe.pytorch.common.utils import get_rank_id, print_rank_0 +from msprobe.pytorch.grad_probe.grad_stat_csv import GradStatCsv + +if int(torch.__version__.split('.')[0]) >= 2: + from torch.optim.optimizer import register_optimizer_step_pre_hook class GradientMonitor: @@ -90,7 +91,7 @@ class GradientMonitor: output_lines.append(grad_info) if self._level_adp["have_grad_direction"]: GradientMonitor.save_grad_direction(param_name, grad, - f'{self._output_path}/rank{self._rank}/step{self._step}') + f'{self._output_path}/rank{self._rank}/step{self._step}') output_dirpath = os.path.join(self._output_path, f"rank{getattr(self, '_rank')}") if not os.path.isdir(output_dirpath): create_directory(output_dirpath) @@ -102,5 +103,6 @@ class GradientMonitor: output_lines.insert(0, header_result) write_csv(output_lines, output_path) logger.info(f"write grad data to {output_path}") + if int(torch.__version__.split('.')[0]) >= 2: register_optimizer_step_pre_hook(optimizer_pre_step_hook) diff --git a/debug/accuracy_tools/msprobe/pytorch/hook_module/__init__.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/__init__.py index 0cb7a03c02967205b2f64b3ff307bd63b2c7ffc8..fe9bfef31f3cdae07151b67923a98f806c4e8b5a 100644 --- a/debug/accuracy_tools/msprobe/pytorch/hook_module/__init__.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/__init__.py @@ -13,4 +13,4 @@ # See the License for the specific language governing permissions and # limitations under the License. -from .wrap_functional import remove_dropout +from msprobe.pytorch.common.utils import remove_dropout \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/hook_module/hook_module.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/hook_module.py index 876353b0b75e031ba8f01e64465197129d79e989..b59d4be82f2b55326c2a1d6a8a9e127a8470bff6 100644 --- a/debug/accuracy_tools/msprobe/pytorch/hook_module/hook_module.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/hook_module.py @@ -15,17 +15,17 @@ import functools import threading +from collections import defaultdict import torch import torch.nn as nn import torch.utils.hooks as full_hooks -from msprobe.core.common.const import Const torch_version_above_or_equal_2 = torch.__version__.split('+')[0] >= '2.0' class HOOKModule(nn.Module): - module_count = {} + module_count = defaultdict(int) inner_stop_hook = {} def __init__(self, build_hook) -> None: @@ -41,12 +41,7 @@ class HOOKModule(nn.Module): if hasattr(self, "prefix_op_name_"): self.prefix = self.prefix_op_name_ - if self.prefix not in HOOKModule.module_count: - HOOKModule.module_count[self.prefix] = 1 - self.prefix += '0' + Const.SEP - else: - HOOKModule.module_count[self.prefix] += 1 - self.prefix = self.prefix + str(HOOKModule.module_count[self.prefix] - 1) + Const.SEP + self.forward_data_collected = False forward_pre_hook, forward_hook, backward_hook, _ = build_hook(self.prefix) if torch_version_above_or_equal_2: self.register_forward_pre_hook(forward_pre_hook, with_kwargs=True) @@ -66,9 +61,17 @@ class HOOKModule(nn.Module): HOOKModule.inner_stop_hook[self.current_thread] = False return result - @classmethod - def reset_module_stats(cls): - cls.module_count = {} + @staticmethod + def reset_module_stats(): + HOOKModule.module_count = defaultdict(int) + + @staticmethod + def add_module_count(name): + HOOKModule.module_count[name] += 1 + + @staticmethod + def get_module_count(name): + return HOOKModule.module_count[name] def _call_func(self, *args, **kwargs): full_backward_hooks, non_full_backward_hooks = [], [] diff --git a/debug/accuracy_tools/msprobe/pytorch/hook_module/register_optimizer_hook.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/register_optimizer_hook.py new file mode 100644 index 0000000000000000000000000000000000000000..75be9fc4532ea5863ed3daad569c062c4ccb91ba --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/register_optimizer_hook.py @@ -0,0 +1,59 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch +from msprobe.core.common.const import Const +from msprobe.pytorch.common.log import logger + +torch_version_above_or_equal_2 = torch.__version__.split('+')[0] >= '2.0' +if torch_version_above_or_equal_2: + from torch.optim.optimizer import register_optimizer_step_pre_hook, register_optimizer_step_post_hook + + +def register_optimizer_hook(data_collector): + def optimizer_pre_step_hook(optimizer, args, kwargs): + data_collector.optimizer_status = Const.OPTIMIZER + + def optimizer_post_step_hook(optimizer, args, kwargs): + data_collector.optimizer_status = Const.END_PREFIX + Const.OPTIMIZER + + def patch_clip_grad(func): + def wrapper(*args, **kwargs): + data_collector.optimizer_status = Const.CLIP_GRAD + func(*args, **kwargs) + data_collector.optimizer_status = Const.END_PREFIX + Const.CLIP_GRAD + + return wrapper + + if torch_version_above_or_equal_2: + register_optimizer_step_pre_hook(optimizer_pre_step_hook) + register_optimizer_step_post_hook(optimizer_post_step_hook) + else: + logger.info_on_rank_0("Pytorch version is below 2.0, cannot register optimizer hook.") + + try: + torch.nn.utils.clip_grad_norm_ = patch_clip_grad(torch.nn.utils.clip_grad_norm_) + torch.nn.utils.clip_grad_norm = patch_clip_grad(torch.nn.utils.clip_grad_norm) + torch.nn.utils.clip_grad_value_ = patch_clip_grad(torch.nn.utils.clip_grad_value_) + except Exception as e: + logger.info_on_rank_0("Cannot patch clip grad function. detail:%s" % str(e)) + + try: + from megatron.core.optimizer import MegatronOptimizer + MegatronOptimizer.clip_grad_norm = patch_clip_grad(MegatronOptimizer.clip_grad_norm) + except ImportError: + pass + except Exception as e: + logger.info_on_rank_0("Cannot patch megatron clip grad function. detail:%s" % str(e)) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml b/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml index 11281396b9e43536d8f1042e4f56c591937f33ba..91eb016284ad1ab7e2d1701ba991d56138ecd054 100644 --- a/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml @@ -138,6 +138,10 @@ functional: - fold - multi_head_attention_forward - scaled_dot_product_attention + - lp_pool3d + - dropout1d + - mish + - huber_loss tensor: - __add__ @@ -145,9 +149,10 @@ tensor: - __bool__ - __div__ - __eq__ + - __floordiv__ - __ge__ - - __gt__ - __getitem__ + - __gt__ - __iadd__ - __iand__ - __idiv__ @@ -156,18 +161,29 @@ tensor: - __imod__ - __imul__ - __ior__ + - __ipow__ - __irshift__ - __isub__ - __ixor__ + - __le__ - __lshift__ + - __lt__ - __matmul__ - __mod__ - __mul__ + - __ne__ - __nonzero__ - __or__ + - __pow__ - __radd__ + - __rdiv__ + - __rmod__ - __rmul__ + - __ror__ + - __rpow__ - __rshift__ + - __rsub__ + - __rxor__ - __setitem__ - __sub__ - __truediv__ @@ -194,12 +210,14 @@ tensor: - addmv_ - addr - addr_ + - adjoint - align_as - align_to - all - allclose - amax - amin + - aminmax - angle - any - arccos @@ -211,12 +229,15 @@ tensor: - arcsinh - arcsinh_ - arctan + - arctan2 + - arctan2_ - arctan_ - arctanh - arctanh_ - argmax - argmin - argsort + - argwhere - asin - asin_ - asinh @@ -231,39 +252,51 @@ tensor: - baddbmm_ - bernoulli - bernoulli_ + - bfloat16 - bincount - bitwise_and - bitwise_and_ + - bitwise_left_shift + - bitwise_left_shift_ - bitwise_not - bitwise_not_ - bitwise_or - bitwise_or_ + - bitwise_right_shift + - bitwise_right_shift_ - bitwise_xor - bitwise_xor_ - bmm + - bool - broadcast_to + - byte - cauchy_ - ceil - ceil_ + - cfloat + - char - cholesky + - cholesky_inverse + - cholesky_solve - chunk - clamp - - cholesky_solve - - cholesky_inverse - clamp_ - clamp_max - clamp_max_ - - clip - clamp_min - clamp_min_ + - clip - clip_ + - conj_physical - copysign - copysign_ + - corrcoef - cos - cos_ - cosh - cosh_ - count_nonzero + - cov - cummax - cummin - cumprod @@ -277,20 +310,23 @@ tensor: - diag_embed - diagflat - diagonal + - diagonal_scatter - diff - - dist - digamma - digamma_ + - dist - div - div_ - divide - divide_ - dot + - double + - dsplit - eig - eq - eq_ - - erf - equal + - erf - erf_ - erfc - erfc_ @@ -299,18 +335,21 @@ tensor: - exp - exp2 - exp2_ - - expm1 - exp_ + - expand + - expand_as + - expm1 - expm1_ - exponential_ - fill_ - - fix - fill_diagonal_ + - fix - fix_ + - flatten - flip - fliplr - - flatten - flipud + - float - float_power - float_power_ - floor @@ -323,6 +362,7 @@ tensor: - fmod_ - frac - frac_ + - frexp - gather - gcd - gcd_ @@ -333,31 +373,37 @@ tensor: - ger - greater - greater_ - - gt - - gt_ - greater_equal - greater_equal_ + - gt + - gt_ + - half - hardshrink - heaviside - heaviside_ - histc + - histogram + - hsplit - hypot - hypot_ + - i0 + - i0_ - igamma - igamma_ - igammac - igammac_ - index_add - index_add_ - - inverse - index_copy - index_copy_ - index_fill - index_fill_ - index_put - index_put_ - - inner - index_select + - inner + - int + - inverse - isclose - isfinite - isinf @@ -375,7 +421,6 @@ tensor: - le_ - lerp - lerp_ - - where - less - less_ - less_equal @@ -392,43 +437,47 @@ tensor: - log_ - log_normal_ - log_softmax - - logcumsumexp - - logdet - logaddexp - logaddexp2 + - logcumsumexp + - logdet - logical_and - logical_and_ - logical_not - - logit - logical_not_ - logical_or - logical_or_ - logical_xor - logical_xor_ + - logit - logit_ - logsumexp + - long - lstsq - lt - lt_ + - lu - lu_solve - map2_ - map_ - masked_fill - - matmul - masked_fill_ - masked_scatter - masked_scatter_ - masked_select + - matmul - matrix_exp + - matrix_power - max - maximum - mean - - matrix_power - median - min - minimum - mm - mode + - moveaxis + - movedim - msort - mul - mul_ @@ -438,6 +487,11 @@ tensor: - mv - mvlgamma - mvlgamma_ + - nan_to_num + - nan_to_num_ + - nanmean + - nanmedian + - nanquantile - nansum - narrow - narrow_copy @@ -447,20 +501,29 @@ tensor: - neg_ - negative - negative_ + - nextafter + - nextafter_ - nonzero - norm - normal_ - not_equal - not_equal_ + - numpy + - orgqr + - ormqr + - outer - permute - pinverse - polygamma + - polygamma_ - pow - pow_ - - polygamma_ - prelu - prod - put_ + - q_zero_point + - qr + - quantile - rad2deg - rad2deg_ - ravel @@ -469,15 +532,16 @@ tensor: - relu - relu_ - remainder - - repeat_interleave - - reshape - remainder_ - renorm - renorm_ - repeat + - repeat_interleave + - reshape - reshape_as - resize_ - resize_as_ + - resolve_neg - roll - rot90 - round @@ -491,6 +555,7 @@ tensor: - select - sgn - sgn_ + - short - sigmoid - sigmoid_ - sign @@ -502,11 +567,13 @@ tensor: - sinc_ - sinh - sinh_ + - slice_scatter - slogdet - smm - softmax - solve - sort + - split - split_with_sizes - sqrt - sqrt_ @@ -516,21 +583,29 @@ tensor: - squeeze_ - sspaddmm - std + - stft + - stride - sub - sub_ + - subtract - sum - sum_to_size - svd + - swapaxes + - swapdims + - swapdims_ - symeig - t - t_ - take + - take_along_dim - tan - tan_ - tanh - tanh_ - tensor_split - tile + - to - topk - transpose - transpose_ @@ -538,8 +613,8 @@ tensor: - tril - tril_ - triu - - true_divide - triu_ + - true_divide - true_divide_ - trunc - trunc_ @@ -547,14 +622,18 @@ tensor: - unbind - unflatten - unfold + - unique + - unique_consecutive - unsafe_chunk - - unsqueeze - unsafe_split - unsafe_split_with_sizes + - unsqueeze + - unsqueeze_ - var - vdot - - unsqueeze_ - view_as + - vsplit + - where - xlogy - xlogy_ @@ -616,13 +695,14 @@ torch: - addmv - addmv_ - addr - - amax - affine_grid_generator - align_tensors - all - alpha_dropout - - amin - alpha_dropout_ + - amax + - amin + - aminmax - angle - any - arange @@ -635,12 +715,14 @@ torch: - arcsinh - arcsinh_ - arctan + - arctan2 - arctan_ - arctanh - arctanh_ - argmax - argmin - argsort + - argwhere - asin - asin_ - asinh @@ -661,13 +743,13 @@ torch: - batch_norm_elemt - batch_norm_gather_stats - batch_norm_gather_stats_with_counts - - bernoulli - batch_norm_stats - batch_norm_update_stats + - bernoulli - bilinear + - binary_cross_entropy_with_logits - bincount - binomial - - binary_cross_entropy_with_logits - bitwise_and - bitwise_not - bitwise_or @@ -713,9 +795,9 @@ torch: - conv_transpose1d - conv_transpose2d - conv_transpose3d - - cos - convolution - copysign + - cos - cos_ - cosh - cosh_ @@ -729,14 +811,16 @@ torch: - cummin - cumprod - cumsum + - cumulative_trapezoid - deg2rad - deg2rad_ - det - diag - diag_embed - - diff - diagflat - diagonal + - diagonal_scatter + - diff - digamma - dist - div @@ -745,12 +829,15 @@ torch: - dropout - dropout_ - dsmm + - dsplit - dstack - eig - einsum - embedding - embedding_bag - embedding_renorm_ + - empty + - empty_like - eq - equal - erf @@ -765,12 +852,12 @@ torch: - expm1 - expm1_ - eye - - feature_dropout - feature_alpha_dropout - feature_alpha_dropout_ + - feature_dropout - feature_dropout_ - - fix - fill_ + - fix - fix_ - flatten - flip @@ -785,8 +872,9 @@ torch: - fmod - frac - frac_ - - full + - frexp - frobenius_norm + - full - full_like - gather - gcd @@ -798,8 +886,8 @@ torch: - greater_equal - grid_sampler - grid_sampler_2d - - group_norm - grid_sampler_3d + - group_norm - gru - gru_cell - gt @@ -809,23 +897,29 @@ torch: - heaviside - hinge_embedding_loss - histc + - histogram + - histogramdd - hsmm + - hsplit - hspmm - hstack - hypot + - i0 + - i0_ - igamma - igammac - index_add - index_copy - - inner - index_fill - index_put - index_put_ - index_select + - inner - instance_norm - inverse - isclose - isfinite + - isin - isinf - isnan - isneginf @@ -853,8 +947,8 @@ torch: - log1p_ - log2 - log2_ - - log_softmax - log_ + - log_softmax - logaddexp - logaddexp2 - logcumsumexp @@ -873,18 +967,18 @@ torch: - lt - lu_solve - lu_unpack - - masked_fill - margin_ranking_loss + - masked_fill - masked_scatter - masked_select - - matrix_exp - matmul + - matrix_exp - matrix_power - matrix_rank - max - max_pool1d - - max_pool2d - max_pool1d_with_indices + - max_pool2d - max_pool3d - maximum - mean @@ -903,18 +997,20 @@ torch: - mvlgamma - nan_to_num - nan_to_num_ + - nanmean - nanmedian + - nanquantile - nansum - narrow + - narrow_copy - native_batch_norm - native_group_norm - - narrow_copy - native_layer_norm - native_norm - ne - neg - - negative - neg_ + - negative - negative_ - nextafter - nonzero @@ -946,30 +1042,31 @@ torch: - ravel - real - reciprocal - - relu - reciprocal_ + - relu - relu_ - remainder - renorm - repeat_interleave - reshape - resize_as_ + - resolve_neg - roll - rot90 - round - round_ + - row_stack - rrelu - rrelu_ - rsqrt - - row_stack - rsqrt_ - rsub - saddmm - scalar_tensor - scatter - - select - scatter_add - searchsorted + - select - selu - selu_ - sgn @@ -989,12 +1086,12 @@ torch: - solve - sort - sparse_coo_tensor - - square - split - split_with_sizes - spmm - sqrt - sqrt_ + - square - square_ - squeeze - sspaddmm @@ -1016,8 +1113,8 @@ torch: - tan_ - tanh - tanh_ - - tensordot - tensor_split + - tensordot - threshold - threshold_ - tile @@ -1033,19 +1130,21 @@ torch: - true_divide - trunc - trunc_ - - unique_consecutive - - xlogy - unbind + - unflatten + - unique_consecutive - unsafe_chunk - unsafe_split - - vander - - var - - vdot - unsafe_split_with_sizes - unsqueeze + - vander + - var - var_mean + - vdot + - vsplit - vstack - where + - xlogy - xlogy_ _VF: @@ -1130,6 +1229,36 @@ torch_npu: - npu_prompt_flash_attention - npu_lstm - npu_apply_adam + - npu_apply_adam_w + - npu_anti_quant + - npu_grouped_matmu + - npu_quant_scatter + - npu_group_norm_silu + - npu_format_cast + - npu_moe_finalize_routing + - npu_moe_gating_top_k_softmax + - npu_trans_quant_param + - npu_gelu + - npu_ffn + - npu_quant_matmul + - npu_format_cast_ + - npu_dynamic_quant + - npu_moe_compute_expert_tokens + - npu_weight_quant_batchmatmul + - npu_dynamic_quant_asymmetric + - npu_grouped_matmul + - npu_quant_scatter_ + - npu_group_quant + - npu_fused_infer_attention_score + - npu_quantize + - npu_fast_gelu + - npu_weight_quant_batchmatmul + - scatter_update + - scatter_update_ + - npu_moe_init_routing + - npu_scatter_nd_update_ + - npu_scatter_nd_update + - npu_prefetch aten: - signbit @@ -1876,4 +2005,5 @@ distributed: - all_to_all_single - all_to_all - all_gather_into_tensor - - reduce_scatter_tensor \ No newline at end of file + - reduce_scatter_tensor + - batch_isend_irecv \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py index 44fa291a0f3e9246d1ef7d562435f8aba02f4b96..1cd11842c31bacdad7c1bb90f98ac81c3415a40e 100644 --- a/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py @@ -21,7 +21,6 @@ from msprobe.pytorch.hook_module.hook_module import HOOKModule from msprobe.pytorch.common.utils import torch_device_guard from msprobe.core.common.const import Const from msprobe.core.common.file_utils import load_yaml -from msprobe.core.common.inplace_op_checker import InplaceOpChecker cur_path = os.path.dirname(os.path.realpath(__file__)) @@ -49,17 +48,20 @@ class DistributedOPTemplate(HOOKModule): self.op_name_ = op_name self.prefix_op_name_ = "Distributed" + Const.SEP + str(op_name) + Const.SEP super().__init__(build_hook) - if not self.stop_hook and InplaceOpChecker.check(self.op_name_, InplaceOpChecker.OP_DISTRIBUTED): - self.op_is_inplace = True + if not self.stop_hook: + self.op_is_distributed = True @torch_device_guard def forward(self, *args, **kwargs): + handle = distributed_func.get(self.op_name_)(*args, **kwargs) if kwargs.get("async_op") or self.op_name_ in ["isend", "irecv"]: - handle = distributed_func.get(self.op_name_)(*args, **kwargs) - handle.wait() - return handle - else: - return distributed_func.get(self.op_name_)(*args, **kwargs) + if handle and hasattr(handle, 'wait'): + handle.wait() + if self.op_name_ == "batch_isend_irecv": + if isinstance(handle, list): + for req in handle: + req.wait() + return handle def wrap_distributed_op(op_name, hook): diff --git a/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_functional.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_functional.py index f101f746f09268a6b4ca7dfa1ba1548bc4773a91..6164169476dab66ac2bdb8d0cbc41a04ddce6713 100644 --- a/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_functional.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_functional.py @@ -23,44 +23,6 @@ from msprobe.pytorch.common.log import logger from msprobe.core.common.file_utils import load_yaml -def remove_dropout(): - if torch.__version__ > "1.8": - logger.info_on_rank_0("For precision comparison, the probability p in the dropout method is set to 0.") - import torch.nn.functional as F - from torch import _VF - from torch.overrides import has_torch_function_unary, handle_torch_function - - def function_dropout(input: torch.Tensor, p: float = 0.5, training: bool = True, - inplace: bool = False) -> torch.Tensor: - if has_torch_function_unary(input): - return handle_torch_function( - function_dropout, (input,), input, p=0., training=training, inplace=inplace) - if p < 0.0 or p > 1.0: - raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p)) - return _VF.dropout_(input, 0., training) if inplace else _VF.dropout(input, 0., training) - - def function_dropout2d(input: torch.Tensor, p: float = 0.5, training: bool = True, - inplace: bool = False) -> torch.Tensor: - if has_torch_function_unary(input): - return handle_torch_function( - function_dropout2d, (input,), input, p=0., training=training, inplace=inplace) - if p < 0.0 or p > 1.0: - raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p)) - return _VF.feature_dropout_(input, 0., training) if inplace else _VF.feature_dropout(input, 0., training) - - def function_dropout3d(input: torch.Tensor, p: float = 0.5, training: bool = True, - inplace: bool = False) -> torch.Tensor: - if has_torch_function_unary(input): - return handle_torch_function( - function_dropout3d, (input,), input, p=0., training=training, inplace=inplace) - if p < 0.0 or p > 1.0: - raise ValueError("dropout probability has to be between 0 and 1, " "but got {}".format(p)) - return _VF.feature_dropout_(input, 0., training) if inplace else _VF.feature_dropout(input, 0., training) - - F.dropout = function_dropout - F.dropout2d = function_dropout2d - F.dropout3d = function_dropout3d - cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_analyse.py b/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_analyse.py new file mode 100644 index 0000000000000000000000000000000000000000..9a0b71e8a5791bc216c82737d1d4f4a482abceb9 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_analyse.py @@ -0,0 +1,201 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +import argparse +import ast +import heapq + +from msprobe.pytorch.common.log import logger +from msprobe.core.common.const import MonitorConst +from msprobe.core.common.file_utils import check_path_before_create, save_json, create_directory, remove_path, \ + check_file_or_directory_path, load_json +from msprobe.pytorch.monitor.anomaly_detect import GradAnomalyData + + +class AnomalyDataWriter: + """ + 异常数据写入类,负责将异常数据写入到JSON文件中。 + """ + + def __init__(self, dump_path, rank) -> None: + self.dump_path = dump_path + self.dump_rank_dir = os.path.join(self.dump_path, f"rank{rank}") + self.json_path = os.path.join(self.dump_rank_dir, MonitorConst.ANOMALY_JSON) + + @staticmethod + def get_anomaly_dict(anomalies): + """将GradAnomalyData列表转换为json""" + anomalies_json = {} + for anomaly in anomalies: + anomalies_json.update({anomaly.get_key(): anomaly.to_dict()}) + return anomalies_json + + def init_detected_json(self): + """初始化落盘文件""" + check_path_before_create(self.dump_path) + if not os.path.exists(self.dump_path): + create_directory(self.dump_path) + + if not os.path.exists(self.dump_rank_dir): + create_directory(self.dump_rank_dir) + + if os.path.exists(self.json_path): + check_file_or_directory_path(self.json_path, isdir=False) + logger.warning(f"The existing file will be deleted: {self.json_path}.") + remove_path(self.json_path) + save_json(self.json_path, {}, indent=1) + + def write_detected_json(self, anomalies): + """ + 落盘异常数据 + Args: + anomalies: GradAnomalyData对象列表 + """ + anomalies_json = self.get_anomaly_dict(anomalies) + logger.info(f"{MonitorConst.ANOMALY_JSON} is at {self.dump_rank_dir}.") + + data_to_write = load_json(self.json_path) if os.path.exists(self.json_path) else {} + data_to_write.update(anomalies_json) + save_json(self.json_path, data_to_write, indent=1) + + +class AnomalyDataLoader: + def __init__(self, data_path) -> None: + self.data_path = data_path + + @staticmethod + def create_instances_from_dict(anomalies_dict: dict): + instances = [] + for values in anomalies_dict.values(): + try: + instances.append(GradAnomalyData(**values)) + except KeyError as e: + logger.warning(f"Missing key in anomaly data: {e}.") + except Exception as e: + logger.warning(f"Value error when creating a GradAnomalyData instance: {e}.") + return instances + + def get_anomalies_from_jsons(self): + """遍历文件夹,从rankK/anomaly.json中读取异常数据 + return: anomalies: GradAnomalyData对象列表 + """ + anomalies = [] + check_file_or_directory_path(self.data_path, isdir=True) + for rank_dir in os.listdir(self.data_path): + rank_path = os.path.join(self.data_path, rank_dir) + if not os.path.isdir(rank_path): + continue + json_path = os.path.join(rank_path, MonitorConst.ANOMALY_JSON) + if not os.path.exists(json_path): + continue + data_anomalies = load_json(json_path) + instances = self.create_instances_from_dict(data_anomalies) + anomalies.extend(instances) + return anomalies + + +class AnomalyAnalyse: + def __init__(self) -> None: + self.sorted_anomalies = [] + + def get_range_top_k(self, topk, step_list, anomalies): + """ + 获取前topk个step_list范围内的异常。 + """ + if not step_list: + filtered_anomalies = anomalies + else: + filtered_anomalies = [ + anomaly + for anomaly in anomalies + if anomaly.step in step_list + ] + if topk >= len(filtered_anomalies): + self.sorted_anomalies = sorted(filtered_anomalies) + else: + self.sorted_anomalies = list(heapq.nsmallest(topk, filtered_anomalies)) + return self.sorted_anomalies + + def rewrite_sorted_anomalies(self, output_path): + """ + 将排序后的异常数据重新落盘 + """ + check_file_or_directory_path(output_path, isdir=True) + + sorted_data = AnomalyDataWriter.get_anomaly_dict(self.sorted_anomalies) + logger.info(f"{MonitorConst.ANALYSE_JSON} is at {output_path}.") + json_path = os.path.join(output_path, MonitorConst.ANALYSE_JSON) + if os.path.exists(json_path): + logger.warning(f"The existing file will be deleted: {json_path}.") + remove_path(json_path) + save_json(json_path, sorted_data, indent=1) + + +def _get_parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("-d", "--data_path", dest="data_path_dir", default="./", type=str, + help=" The anomaly detect result dictionary: generate from monitor tool.", + required=True, + ) + parser.add_argument("-o", "--out_path", dest="out_path", default="", type=str, + help=" The analyse task result out path.", + required=False, + ) + parser.add_argument("-k", "--topk", dest="top_k_number", default=8, type=int, + help=" Top K number of earliest anomalies.", + required=False, + ) + parser.add_argument("-s", "--step", dest="step_list", default="[]", type=str, + help=" Analyse which steps.", + required=False, + ) + return parser.parse_args(sys.argv[1:]) + + +def _get_step_and_stop(args): + try: + step_list = ast.literal_eval(args.step_list) + if not isinstance(step_list, list): + raise ValueError(f"{args.step_list} is not a list.") + except (ValueError, SyntaxError, RecursionError) as e: + raise Exception(f"The step list must be a resolvable list type.") from e + if args.top_k_number <= 0: + raise Exception("The top k number must be greater than 0.") + return step_list, args.top_k_number + + +def _anomaly_analyse(): + args = _get_parse_args() + step_list, top_k_number = _get_step_and_stop(args) + loader = AnomalyDataLoader(args.data_path_dir) + anomalies = loader.get_anomalies_from_jsons() + analyser = AnomalyAnalyse() + top_anomalies = analyser.get_range_top_k( + top_k_number, step_list, anomalies + ) + analyser.rewrite_sorted_anomalies( + args.out_path if args.out_path else args.data_path_dir + ) + + logger.info(f"Top {top_k_number} anomalies are listed as follows:") + for index, anomaly in enumerate(top_anomalies): + logger.info(f"{index}: {anomaly.message}") + + +if __name__ == "__main__": + _anomaly_analyse() + logger.info("Analyse task completed.") diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_detect.py b/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_detect.py index 1a9ecc3ad1f05a01f6457ddf8a5530d1c7d48f78..63f20b1928c80e1e29d7cb8224f267c246fcaa8b 100644 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_detect.py +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_detect.py @@ -1,6 +1,4 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,21 +12,27 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - +import itertools import os -import sys import statistics as st +import sys from abc import ABC -from typing import List from collections import defaultdict +from dataclasses import dataclass, field +from typing import List +import pandas as pd +import torch from torch.utils.tensorboard import SummaryWriter -from msprobe.pytorch.monitor.utils import print_info_log, print_warn_log, print_error_log -from msprobe.pytorch.monitor.file_check import check_path_before_create, change_mode, FileCheckConst, create_directory +from msprobe.core.common.const import FileCheckConst, MonitorConst +from msprobe.core.common.file_utils import change_mode, create_directory, write_df_to_csv +from msprobe.pytorch.common.log import logger class ScanRule(ABC): + name = "ScanRule" + def apply(self, history, cur): raise NotImplementedError("abstract method apply is not implemented") @@ -53,6 +57,9 @@ class AnomalyScanner: @staticmethod def load_rules(specs: List[dict]): + """ + specs: [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}] + """ if specs is None: return [] alert_rules = [] @@ -63,21 +70,21 @@ class AnomalyScanner: # 检查必要的键是否存在 if rule_cls_name is None or rule_args is None: - print_warn_log(f"Spec is missing required keys: {spec}") + logger.warning(f"Spec is missing required keys: {spec}") continue - cur_module = sys.modules[__name__] + cur_module = sys.modules.get(__name__) try: rule_cls = getattr(cur_module, rule_cls_name) except AttributeError: - print_error_log(f"Rule class '{rule_cls_name}' not found in the current module.") + logger.error(f"Rule class '{rule_cls_name}' not found in the current module.") continue try: rule_instance = rule_cls(**rule_args) alert_rules.append(rule_instance) except Exception as e: - print_error_log(f"Error creating instance of rule '{rule_cls_name}': {e}") + logger.error(f"Error creating instance of rule '{rule_cls_name}': {e}") continue return alert_rules @@ -104,40 +111,300 @@ class BCOLORS: UNDERLINE = '\033[4m' -class SummaryWriterWithAD(SummaryWriter): - def __init__(self, path, ad_rules, job_id, anomaly_inform=False): - check_path_before_create(path) +class AnomalyDataFactory(ABC): + def __init__(self, rank, pp_stage, group_mates): + super().__init__() + self.rank = rank + self.pp_stage = pp_stage + self.group_mates = group_mates + self.micro_step = 0 + self.name2callid = {} + + def set_call_id(self, name2callid): + """根据当前GradContext信息更新call_id vpp_stage等信息 + """ + self.name2callid = name2callid + + def create(self, tag, message, step): + """如果检查出异常, 调用当前接口生成GradAnomalyData实例 + tag (tuple): metric tag ('0:1.post_attention_norm.weight/rank0/pre_grad', 'min') + message (str): anomaly detect message + step (int): training step + """ + if not isinstance(tag, tuple) or len(tag) != 2: + raise ValueError("tag must be a tuple with length 2") + tag_name = tag[0] + param_name = tag_name.split('/')[0] + call_id = self.name2callid.get(tag_name, -1) + if MonitorConst.NAME_SEP in param_name: + vpp_stage = int(param_name.split(MonitorConst.NAME_SEP)[0]) + else: + vpp_stage = 0 + + return GradAnomalyData( + self.rank, + step, + self.micro_step, + self.pp_stage, + vpp_stage, + call_id, + tag_name, + message, + self.group_mates + ) + + +class TrainStage: + DEFAULT_STAGE = -1 + FORWARD_STAGE = 0 + BACKWARD_STAGE = 1 + OPTIMIZER_STAGE = 2 + + +FORWARD_KEY = [MonitorConst.ACTV] +BACKWARD_KEY = [MonitorConst.ACTVGRAD, MonitorConst.PRE_GRAD, + MonitorConst.POST_GRAD, MonitorConst.ACC_GRAD] +OPTIMIZER_KEY = [MonitorConst.EXP_AVG, MonitorConst.EXP_AVG_SQ] +TRAIN_STAGE = { + **{key_: TrainStage.FORWARD_STAGE for key_ in FORWARD_KEY}, + **{key_: TrainStage.BACKWARD_STAGE for key_ in BACKWARD_KEY}, + **{key_: TrainStage.OPTIMIZER_STAGE for key_ in OPTIMIZER_KEY} +} + + +@dataclass(eq=True) +class GradAnomalyData: + rank: int = 0 + step: int = 0 + micro_step: int = 0 + pp_stage: int = 0 + vpp_stage: int = 0 + call_id: int = 0 + tag_name: str = field(default=None, compare=False) + message: str = field(default="", compare=False) + group_mates: list = field(default=None, compare=False) + + def __lt__(self, other): + """ + 自定义比较函数,用于确定 GradAnomalyData 实例之间的顺序。 + 比较规则为: + step 和 micro_step 值越小优先级越高; + vpp 和 pp 在前向阶段值越小优先级越高,在非前向阶段值越大优先级越高; + call_id 值越小优先级越高。 + """ + if not isinstance(other, GradAnomalyData): + return NotImplemented + + self_train_stage = self.get_train_stage(self.tag_name) + other_train_stage = self.get_train_stage(other.tag_name) + + def vpp_pp_comparator(anomaly): + """ + Determine the priority rule for vpp and pp based on train stage + Forward stage prefers smaller vpp and pp + Other stages prefer larger vpp and pp + """ + if self_train_stage == TrainStage.FORWARD_STAGE: + return anomaly.vpp_stage, anomaly.pp_stage + else: + return -anomaly.vpp_stage, -anomaly.pp_stage + + self_cmp = [self.step, self.micro_step, self_train_stage, *vpp_pp_comparator(self), self.call_id] + other_cmp = [other.step, other.micro_step, other_train_stage, *vpp_pp_comparator(other), other.call_id] + return self_cmp < other_cmp + + def __le__(self, other): + if not isinstance(other, GradAnomalyData): + return NotImplemented + return self == other or self < other + + @staticmethod + def get_train_stage(tag_name): + """ + :param tag_name: "0:fc2.input:0/rank0/actv", "0:fc1.weight/rank0/post_grad", "0:fc2.weight/rank0/exp_avg_sq" + :return: int, if forward return 0; if backward return 1; if optimizer return 2 + """ + key_ = tag_name.split("/")[-1] + return TRAIN_STAGE.get(key_, TrainStage.DEFAULT_STAGE) + + def to_dict(self): + return self.__dict__ + + def get_key(self): + # 0:1.self_attention.core_attention_flash_0/rank0/input_grad + return ''.join([str(self.tag_name), "_step_", str(self.step), "_call_", str(self.call_id)]) + + +@dataclass +class WriterInput: + path: str + ad_rules: list + job_id: str + anomaly_factory: AnomalyDataFactory = None + ndigits: int = 6 + step_count_per_record: int = 1 + + +class BaseWriterWithAD: + def __init__(self, writer_input: WriterInput): + self.tag2scalars = {} + self.ad_rules = writer_input.ad_rules + self.job_id = writer_input.job_id + self.anomaly_factory = writer_input.anomaly_factory + self.anomalies = [] + self.ndigits = writer_input.ndigits + + def get_anomalies(self): + """返回已检测到的异常列表 + """ + return self.anomalies + + def clear_anomalies(self): + self.anomalies.clear() + + def add_scalar(self, tag, scalar_value, global_step=None): + """If an anomaly is detected, the anomaly information is recorded and added to self.anomalies. + Args: + tag (tuple): tuple of tag_name and tag like ('0:1.post_attention_norm.weight/rank0/pre_grad', 'min'). + scalar_value (float): scalar_value. + global_step (int): global_step. + Returns: + None + """ + detected = False + if self.ad_rules: + avg = self._update_tag2scalars(tag, scalar_value) + detected, rule_name = self._ad(scalar_value, history=avg) + if detected: + exception_message = f"Rule {rule_name} reports anomaly signal in {tag} at step {global_step}." + logger.info(f"{BCOLORS.WARNING}> {exception_message}{BCOLORS.ENDC}") + # append to self.anomalies for dump + if self.anomaly_factory: + self.anomalies.append(self.anomaly_factory.create(tag, exception_message, global_step)) + + def write_metrics(self, ops, metric_value, step, prefix=''): + if not metric_value: + return + tensors = [] + tags = list(itertools.product(metric_value.keys(), ops)) + for op2tensor in metric_value.values(): + tensors.extend(op2tensor.values()) + if not tensors: + return + + n_slices = len(tensors) // MonitorConst.SLICE_SIZE + with torch.no_grad(): + for i in range(n_slices + 1): + begin = i * MonitorConst.SLICE_SIZE + end = (i+1) * MonitorConst.SLICE_SIZE + if begin == len(tensors): + continue + metric_list = torch.stack(tensors[begin:end]).cpu() + for tag, metric in zip(tags[begin:end], metric_list): + self.add_scalar(tag, metric, step) + + def _ad(self, scalar_value, history): + return AnomalyScanner.scan(self.ad_rules, history, cur=scalar_value) + + def _update_tag2scalars(self, tag, scalar_value): + """Update the average and count of a scalar value associated with a tag. + + This method is used to maintain a running average of scalar values for each tag. + + + Args: + tag (str): The tag identifier. + scalar_value (float): The scalar value to be added. + + Returns: + float: The average value before update. + """ + if tag not in self.tag2scalars: + self.tag2scalars[tag] = {'avg': scalar_value, 'count': 0} + avg = self.tag2scalars[tag]['avg'] + new_avg = (avg * self.tag2scalars[tag]['count'] + scalar_value) / (self.tag2scalars[tag]['count'] + 1) + self.tag2scalars[tag]['avg'] = new_avg + self.tag2scalars[tag]['count'] += 1 + return avg + + +class CSVWriterWithAD(BaseWriterWithAD): + def __init__(self, writer_input: WriterInput): + super().__init__(writer_input) + + path = writer_input.path + self.log_dir = path create_directory(path) + change_mode(path, FileCheckConst.DATA_DIR_AUTHORITY) + self.context_dict = defaultdict(list) + self.header = [] + self.step_count_per_record = writer_input.step_count_per_record + + def get_step_interval(self, step): + count = step // self.step_count_per_record + return count * self.step_count_per_record, (count + 1) * self.step_count_per_record - 1 + + def write_csv(self, prefix, step): + """ + Args: + prefix[str]: prefix of output csv file e.g. grad_unreduced + step[int] + """ + if len(self.context_dict) == 0: + return + + ster_start, step_end = self.get_step_interval(step) + filepath = os.path.join(self.log_dir, f'{prefix}_{ster_start}-{step_end}.csv') + if not os.path.exists(filepath): + data_frame = pd.DataFrame(columns=self.header) + write_df_to_csv(data_frame, filepath) + + new_data = [] + for name, metric_value in self.context_dict.items(): + new_line = name.split(MonitorConst.NAME_SEP) + metric_value + new_line.insert(2, step) + new_data.append(new_line) + + new_data = pd.DataFrame(new_data).round(self.ndigits).fillna("nan") + write_df_to_csv(new_data, filepath, mode='a+', header=False) + self.context_dict = defaultdict(list) + + def add_scalar(self, tag, scalar_value, global_step): + """ + ('0:1.post_attention_norm.weight/rank0/pre_grad', 'min') + """ + super().add_scalar(tag, scalar_value, global_step) + + name = tag[0].split('/')[0] + self.context_dict[name].append(scalar_value.item()) + + def write_metrics(self, ops, metric_value, step, prefix=''): + super().write_metrics(ops, metric_value, step, prefix='') + + if prefix in [MonitorConst.ACTV, MonitorConst.ACTVGRAD]: + self.header = MonitorConst.CSV_HEADER_XY + ops + else: + self.header = MonitorConst.CSV_HEADER + ops + self.write_csv(prefix, step) + + def close(self): + pass + + +class SummaryWriterWithAD(SummaryWriter, BaseWriterWithAD): + def __init__(self, writer_input: WriterInput): + + path = writer_input.path + if not os.path.exists(path): + create_directory(path) try: + super(SummaryWriter, self).__init__(writer_input) super().__init__(path) except Exception as e: - print_error_log(f'error when init summary writer at {path}: {e}') + logger.error(f'error when init summary writer at {path}: {e}') raise ValueError("Init summary writer error.") from e - for event in os.listdir(path): - change_mode(os.path.join(path, event), FileCheckConst.DATA_FILE_AUTHORITY) - self.tag2scalars = defaultdict(list) - self.ad_rules = ad_rules - self.job_id = job_id - self.anomaly_inform = anomaly_inform - - def add_scalar(self, tag, scalar_value, global_step=None, walltime=None, new_style=False, double_precision=False): - new_avg = avg = scalar_value - if tag in self.tag2scalars: - n = len(self.tag2scalars[tag]) - _, avg = self.tag2scalars[tag][-1] - new_avg = (avg * n + scalar_value) / (n + 1) - self.tag2scalars[tag].append((scalar_value, new_avg)) - detected, rule_name = self._ad(scalar_value, history=avg) - if detected: - print_info_log( - f"{BCOLORS.WARNING}> Rule {rule_name} reports anomaly signal in {tag} at step {global_step}." - f"{BCOLORS.ENDC}") - exception_message = (f"{BCOLORS.WARNING}> Rule {rule_name} reports anomaly signal in {tag} at step " - f"{global_step}.{BCOLORS.ENDC}") - if self.anomaly_inform: - self.anomaly_inform.run(exception_message, self.job_id) - args = [tag, scalar_value, global_step, walltime, new_style, double_precision] - return super().add_scalar(*args) - def _ad(self, scalar_value, history): - return AnomalyScanner.scan(self.ad_rules, history, cur=scalar_value) + def add_scalar(self, tag, scalar_value, global_step): + super(SummaryWriter, self).add_scalar(tag, scalar_value, global_step) + tag = f'{tag[0]}_{tag[1]}' + super().add_scalar(tag, scalar_value, global_step) diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_inform.py b/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_inform.py deleted file mode 100644 index 21e4e3a84fdf947787b80275c34d7e384f77f1b2..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/anomaly_inform.py +++ /dev/null @@ -1,102 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import smtplib -from email.mime.text import MIMEText -from datetime import datetime, timedelta - -from msprobe.core.common.const import MonitorConst -from msprobe.pytorch.monitor.database import Database, ExceptionMessage -from msprobe.pytorch.monitor.utils import beijing_tz - - -# define class InformRegistry to get inform_sub_class -class AnomalyInformFactory: - @staticmethod - def create_informer(**kwargs): - recipient = kwargs.get("recipient") - if recipient == MonitorConst.DATABASE: - return DatabaseInform(**kwargs) - elif recipient == MonitorConst.EMAIL: - return EmailInform(**kwargs) - raise ValueError("Invalid recipient specified") - - -# define class AnomalyInform to inform with database or email -class AnomalyInform: - def __init__(self, **kwargs): - self.inform_args = kwargs - self.exception_message_list = [] - self.time = 0 - self.current_time = 0 - - def inform_fun(self, exception_message_list, job_id): - pass - - def run(self, exception_message, job_id): - if self.time != 0 and self.current_time == 0: - self.current_time = datetime.now(tz=beijing_tz) - if self.time == 0 or ((self.current_time - self.time) > timedelta(minutes=self.interval_time)): - self.exception_message_list.append(exception_message) - self.inform_fun(self.exception_message_list, job_id) - self.exception_message_list = [] - self.time = datetime.now(tz=beijing_tz) - elif (self.current_time - self.time) <= timedelta(minutes=self.interval_time): - self.exception_message_list.append(exception_message) - self.current_time = datetime.now(tz=beijing_tz) - - -class DatabaseInform(AnomalyInform): - def __init__(self, **kwargs): - super().__init__(**kwargs) - self.interval_time = 2 - self.database = Database(self.inform_args.get("connection_str", None)) - self.database.create_table() - - def inform_fun(self, exception_message_list, job_id): - save_list = [] - for exception_message in exception_message_list: - item = { - 'job_id': job_id, - 'message': exception_message, - 'create_time': datetime.now(tz=beijing_tz) - } - save_list.append(ExceptionMessage(**item)) - self.database.insert_batch(save_list) - - -class EmailInform(AnomalyInform): - def __init__(self, **kwargs): - super().__init__(**kwargs) - self.interval_time = 10 - - def inform_fun(self, exception_message_list, job_id): - subject = "Exception Detected in Your Program" - text = f"{len(exception_message_list)} exception was detected in your program:\n\n" - for exception_message in exception_message_list: - text += f"{job_id}: {exception_message}\n" - message = MIMEText(text, "plain") - message["Subject"] = subject - message["From"] = self.inform_args.get('send_email_address', None) - message["To"] = self.inform_args.get('receive_email_address', None) - - with smtplib.SMTP(self.inform_args.get('smtp_server', None), self.inform_args.get('smtp_port', 587)) as server: - server.starttls() - server.login(self.inform_args.get('send_email_username', None), - self.inform_args.get('send_email_password', None)) - server.sendmail(self.inform_args.get('send_email_address', None), - self.inform_args.get('receive_email_address', None), message.as_string()) diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/csv2tb.py b/debug/accuracy_tools/msprobe/pytorch/monitor/csv2tb.py new file mode 100644 index 0000000000000000000000000000000000000000..6ffd1ffabe7b113ff4e61786d4d9f0709b8b605b --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/csv2tb.py @@ -0,0 +1,164 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import datetime +import os +import re +from multiprocessing import Process + +import pytz +from torch.utils.tensorboard import SummaryWriter +from tqdm import tqdm + +from msprobe.core.common.const import MonitorConst +from msprobe.core.common.file_utils import read_csv, create_directory, remove_path +from msprobe.core.common.utils import is_int +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.monitor.utils import get_target_output_dir + +all_data_type_list = ["actv", "actv_grad", "exp_avg", "exp_avg_sq", "grad_unreduced", "grad_reduced", "param"] +CSV_FILE_SUFFIX = r"_\d+-\d+\.csv" + + +def parse_step_line(line, ops): + vp_id = line["vpp_stage"] + module_name = line[MonitorConst.HEADER_NAME] + step = line["step"] + vpp_name = f"vp{vp_id}:{module_name}" + if 'micro_step' in line: + vpp_name = f'{vpp_name}{MonitorConst.NAME_SEP}micro{line["micro_step"]}' + ops_result = {} + for op in ops: + ops_result[op] = line[op] + return vpp_name, step, ops_result + + +def parse_step_fn(filepath): + data = read_csv(filepath) + ops = [k for k in data.keys() if k in MonitorConst.OP_LIST] + parse_step_result = {} + + for _, line in data.iterrows(): + vpp_name, step, ops_result = parse_step_line(line, ops) + if vpp_name not in parse_step_result: + parse_step_result[vpp_name] = {} + if step in parse_step_result[vpp_name]: + raise Exception(f"duplicated step({step})") + parse_step_result[vpp_name][step] = ops_result + return parse_step_result + + +def write_step(output_dirpath, parse_step_result, rank, data_type): + tb_output_path = os.path.join(output_dirpath, f"rank{rank}", data_type) + if os.path.exists(tb_output_path): + remove_path(tb_output_path) + logger.warning(f"existing path {tb_output_path} will be recovered") + writer = SummaryWriter(tb_output_path) + for vpp_name, step_data_dict in parse_step_result.items(): + step_data_list = [(step, ops) for step, ops in step_data_dict.items()] + step_data_list.sort(key=lambda x: x[0]) + for step_data in step_data_list: + step = step_data[0] + ops = step_data[1] + for op, value in ops.items(): + tag = f"{vpp_name}/{op}" + writer.add_scalar(tag, value, step) + + +def update_dict(dict1, dict2): + for key, value in dict2.items(): + if key in dict1: + if isinstance(dict1[key], dict) and isinstance(value, dict): + try: + update_dict(dict1[key], value) + except Exception as e: + raise Exception(f"Error updating nested dict failed at key '{key}': {e}") from e + else: + raise Exception(f"duplicate key: {key}") + else: + dict1[key] = value + return dict1 + + +def csv2tb_by_step_work(target_output_dirs, output_dirpath, data_type_list): + for directory in tqdm(target_output_dirs): + dirpath = directory["path"] + rank = directory["rank"] + for data_type in data_type_list: + all_step_result = {} + for filename in os.listdir(dirpath): + if not re.match(f"{data_type}{CSV_FILE_SUFFIX}", filename): + continue + filepath = os.path.join(dirpath, filename) + try: + parse_step_result = parse_step_fn(filepath) + except Exception as e: + logger.error(f"csv2tensorboard parse {filepath} failed \n {e}") + break + + all_step_result = update_dict(all_step_result, parse_step_result) + if all_step_result: + write_step(output_dirpath, all_step_result, rank, data_type) + + +def check_process_num(process_num): + if not is_int(process_num) or process_num <= 0: + raise ValueError(f"process_num({process_num}) is not a positive integer") + + +def check_data_type_list(data_type_list): + if data_type_list is None: + logger.info(f"data_type_list is None, use defualt all_data_type_list: {all_data_type_list}") + return + if not isinstance(data_type_list, list): + raise ValueError(f"data_type_list({data_type_list}) is not a list") + for data_type in data_type_list: + if data_type not in all_data_type_list: + raise ValueError(f"data type({data_type}) is not supported, supported data type: {all_data_type_list}") + + +def csv2tensorboard_by_step( + monitor_path, + time_start=None, + time_end=None, + process_num=1, + data_type_list=None, + output_dirpath=None +): + check_process_num(process_num) + check_data_type_list(data_type_list) + target_output_dirs = get_target_output_dir(monitor_path, time_start, time_end) + target_output_dirs = [{"rank": rank, "path": path} for rank, path in target_output_dirs.items()] + if output_dirpath is None: + local_tz = pytz.timezone("Asia/Shanghai") # 根据需要调整为目标时区 + cur_time = datetime.datetime.now(local_tz).strftime("%b%d_%H-%M-%S") + output_dirpath = os.path.join(monitor_path, f"{cur_time}-csv2tensorboard_by_step") + create_directory(output_dirpath) + + task_num = len(target_output_dirs) + task_num_per_pro = task_num // process_num + target_data_type = data_type_list if data_type_list else all_data_type_list + + processes = [] + for pro_id in range(process_num): + task_start_id = pro_id * task_num_per_pro + task_end_id = (pro_id + 1) * task_num_per_pro if pro_id != process_num - 1 else task_num + task_dirs = target_output_dirs[task_start_id: task_end_id] + + p = Process(target=csv2tb_by_step_work, args=(task_dirs, output_dirpath, target_data_type)) + processes.append(p) + p.start() + for p in processes: + p.join() + logger.info(f"output has been saved to: {output_dirpath}") diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/database.py b/debug/accuracy_tools/msprobe/pytorch/monitor/database.py deleted file mode 100644 index 68103e28537ee0dff587b0036f4a14b7c06289f2..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/database.py +++ /dev/null @@ -1,72 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from sqlalchemy import create_engine -from sqlalchemy.orm import sessionmaker -from sqlalchemy.ext.declarative import declarative_base -from sqlalchemy import Column, Integer, String, DateTime -from pymysql.err import OperationalError - -Base = declarative_base() - - -class ExceptionMessage(Base): - __tablename__ = 'exception_message' - - id = Column(Integer, primary_key=True) - job_id = Column(String(40), index=True, nullable=False) - message = Column(String(255)) - create_time = Column(DateTime, nullable=False) - - def __repr__(self): - return ' FileCheckConst.DIRECTORY_LENGTH or \ - len(os.path.basename(path)) > file_max_name_length: - logger.error('The file path length exceeds limit.') - raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) - - -def check_path_exists(path): - if not os.path.exists(path): - logger.error('The file path %s does not exist.' % path) - raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) - - -def check_path_readability(path): - if not os.access(path, os.R_OK): - logger.error('The file path %s is not readable.' % path) - raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) - - -def check_path_writability(path): - if not os.access(path, os.W_OK): - logger.error('The file path %s is not writable.' % path) - raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) - - -def check_path_executable(path): - if not os.access(path, os.X_OK): - logger.error('The file path %s is not executable.' % path) - raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) - - -def check_other_user_writable(path): - st = os.stat(path) - if st.st_mode & 0o002: - logger.error('The file path %s may be insecure because other users have write permissions. ' % path) - raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) - - -def check_path_owner_consistent(path): - file_owner = os.stat(path).st_uid - if file_owner != os.getuid(): - logger.error('The file path %s may be insecure because is does not belong to you.' % path) - raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) - - -def check_path_pattern_valid(path): - if not re.match(FileCheckConst.FILE_VALID_PATTERN, path): - logger.error('The file path %s contains special characters.' % path) - raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) - - -def check_file_size(file_path, max_size): - file_size = os.path.getsize(file_path) - if file_size >= max_size: - logger.error(f'The size of file path {file_path} exceeds {max_size} bytes.') - raise FileCheckException(FileCheckException.FILE_TOO_LARGE_ERROR) - - -def check_common_file_size(file_path): - if os.path.isfile(file_path): - for suffix, max_size in FileCheckConst.FILE_SIZE_DICT.items(): - if file_path.endswith(suffix): - check_file_size(file_path, max_size) - break - - -def check_file_suffix(file_path, file_suffix): - if file_suffix: - if not file_path.endswith(file_suffix): - logger.error(f"The {file_path} should be a {file_suffix} file!") - raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) - - -def check_path_type(file_path, file_type): - if file_type == FileCheckConst.FILE: - if not os.path.isfile(file_path): - logger.error(f"The {file_path} should be a file!") - raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) - if file_type == FileCheckConst.DIR: - if not os.path.isdir(file_path): - logger.error(f"The {file_path} should be a dictionary!") - raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) - - -def create_directory(dir_path): - """ - Function Description: - creating a directory with specified permissions - Parameter: - dir_path: directory path - Exception Description: - when invalid data throw exception - """ - dir_path = os.path.realpath(dir_path) - try: - os.makedirs(dir_path, mode=FileCheckConst.DATA_DIR_AUTHORITY, exist_ok=True) - except OSError as ex: - raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR, - 'Failed to create {}. Please check the path permission or disk space .{}'.format( - dir_path, str(ex))) from ex - - -def check_path_before_create(path): - if path_len_exceeds_limit(path): - raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR, 'The file path length exceeds limit.') - - if not re.match(FileCheckConst.FILE_PATTERN, os.path.realpath(path)): - raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR, - 'The file path {} contains special characters.'.format(path)) - - -def change_mode(path, mode): - if not os.path.exists(path) or os.path.islink(path): - return - try: - os.chmod(path, mode) - except PermissionError as ex: - raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR, - 'Failed to change {} authority. {}'.format(path, str(ex))) from ex - - -def path_len_exceeds_limit(file_path): - return len(os.path.realpath(file_path)) > FileCheckConst.DIRECTORY_LENGTH or \ - len(os.path.basename(file_path)) > FileCheckConst.FILE_NAME_LENGTH - - -def check_file_type(path): - """ - Function Description: - determine if it is a file or a directory - Parameter: - path: path - Exception Description: - when neither a file nor a directory throw exception - """ - if os.path.isdir(path): - return FileCheckConst.DIR - elif os.path.isfile(path): - return FileCheckConst.FILE - else: - logger.error('Neither a file nor a directory.') - raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/module_hook.py b/debug/accuracy_tools/msprobe/pytorch/monitor/module_hook.py index f6e7a7de67a7e50133c0f0a6a4d7ee849954686a..0c9efaab999e71d896eaf64d837978bd26f214ad 100644 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/module_hook.py +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/module_hook.py @@ -1,6 +1,4 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,71 +12,129 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - -import logging +import json import os import uuid -import json from collections import defaultdict from datetime import datetime -import torch +from functools import partial +import pytz +import torch import torch.distributed as dist -from torch.optim.optimizer import register_optimizer_step_pre_hook, register_optimizer_step_post_hook +from torch.utils.hooks import BackwardHook + +from msprobe.core.common.const import MonitorConst, Const +from msprobe.core.common.file_utils import load_json, save_json +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.common.utils import is_recomputation +from msprobe.pytorch.monitor.anomaly_analyse import AnomalyDataWriter +from msprobe.pytorch.monitor.anomaly_detect import AnomalyScanner, SummaryWriterWithAD, AnomalyDataFactory, \ + CSVWriterWithAD, BaseWriterWithAD, WriterInput +from msprobe.pytorch.monitor.distributed.wrap_distributed import api_register, create_hooks, op_aggregate, \ + get_process_group +from msprobe.pytorch.monitor.features import get_sign_matches +from msprobe.pytorch.monitor.module_metric import get_metrics, get_summary_writer_tag_name, \ + TensorMetrics, squash_param_name from msprobe.pytorch.monitor.module_spec_verifier import validate_config_spec -from msprobe.pytorch.monitor.optimizer_collect import MixPrecsionOptimizerMon, OptimizerMonFactory -from msprobe.pytorch.monitor.features import eff_rank, get_sign_matches +from msprobe.pytorch.monitor.optimizer_collect import OptimizerMonFactory +from msprobe.pytorch.monitor.utils import get_param_struct, validate_config, validate_ops, \ + get_output_base_dir, get_target_output_dir from msprobe.pytorch.monitor.visualizer import HeatmapVisualizer -from msprobe.pytorch.monitor.anomaly_detect import AnomalyScanner, SummaryWriterWithAD -from msprobe.pytorch.monitor.anomaly_inform import AnomalyInformFactory -from msprobe.pytorch.monitor.module_metric import get_metrics, write_metrics_tensorboard, get_summary_writer_tag_name, \ - TensorMetrics -from msprobe.pytorch.monitor.distributed.wrap_distributed import api_register, create_hooks, op_aggregate -from msprobe.pytorch.monitor.utils import print_warn_log, print_info_log, print_rank_0, get_param_struct, \ - check_path_length, check_path_pattern_valid, change_mode, FileCheckConst, validate_config, beijing_tz -from msprobe.pytorch.monitor.file_check import FileOpen -from msprobe.core.common.const import MonitorConst torch_version_above_or_equal_2 = torch.__version__.split('+')[0] >= '2.0' if not torch_version_above_or_equal_2: raise ValueError("monitor require torch>=2.0") -output_base_dir = os.getenv(MonitorConst.MONITOR_OUTPUT_DIR, MonitorConst.DEFAULT_MONITOR_OUTPUT_DIR) + +FORMAT_MAPPING = { + MonitorConst.TENSORBOARD: SummaryWriterWithAD, + MonitorConst.CSV: CSVWriterWithAD, + MonitorConst.API: BaseWriterWithAD +} + + +def param_is_not_tensor_parallel_duplicate(param, tp_group): + return (hasattr(param, 'tensor_model_parallel') and param.tensor_model_parallel) or ( + torch.distributed.get_rank(group=tp_group) == 0 + ) + + +def param_is_data_parallel_duplicate(dp_group): + return torch.distributed.get_rank(group=dp_group) != 0 class ModuleHookContext: def __init__(self, module_name) -> None: - self.step = 0 self.micro_step = 0 - self.actv = [] + self.actv = defaultdict(dict) self.actvgrad = [] self.module_name = module_name + self.struct = {} self.format_by_arg = {} self.verified = False self.focused_in_col = 0 self.focused_out_col = 0 - self.ignore_in = False # no need to care when no key 'input' or 'input_grad' found def set_format_by_arg(self, key_name: str, target_config: dict): - if key_name in target_config[self.module_name]: - self.format_by_arg[key_name] = target_config[self.module_name][key_name] - elif key_name in ['input', 'input_grad']: - self.ignore_in = True + """ 按照监控对象配置format_by_arg + 1) module_name 在 target 中配置监控对象 + 2) module_name 未在 targets 中配置,且 all_xy 全量监控 + 3) module_name 未在 targets 中配置,且 all_xy 未全量监控 + + :param key_name: str, one of [input, output, input_grad, output_grad] + :param target_config: target obj in config json. + :return: + """ + cared = target_config.get(self.module_name, self.struct) + if key_name in cared: + target_module_config = cared[key_name] + if isinstance(target_module_config, dict): + # current cared is self.struct, monitor all data for module_name + self.format_by_arg[key_name] = target_module_config.get('config') + elif isinstance(target_module_config, str): + # current cared is target_config[self.module_name] + self.format_by_arg[key_name] = target_module_config + else: + logger.warning_on_rank_0(f"target module config error, result maybe empty." + f"module_name: {self.module_name}, key_name: {key_name}") + self.format_by_arg[key_name] = None else: - raise KeyError(f"Missing key: {key_name} of {self.module_name} in config.json") + self.format_by_arg[key_name] = self.struct.get(key_name).get('config') + + def reset(self): + self.actv.clear() + self.actvgrad.clear() + + +start_step = 0 class OptimizerContext: def __init__(self) -> None: - self.step = 0 - self.param_effective_rank = defaultdict(float) + self.step = start_step self.param_mg_direction = defaultdict(float) self.param_adam_update = defaultdict() self.param_adam_ratio = defaultdict() self.param_weight_grad = defaultdict() self.param_exp_avg = defaultdict() + self.exp_avg_metric = {} self.param_exp_avg_sq = defaultdict() - self.metric_list = [] + self.exp_avg_sq_metric = {} + self.metric_dict = {} + self.param_metric = {} + + def reset(self): + self.param_mg_direction.clear() + self.param_adam_update.clear() + self.param_adam_ratio.clear() + self.param_weight_grad.clear() + self.param_exp_avg.clear() + self.exp_avg_metric.clear() + self.param_exp_avg_sq.clear() + self.exp_avg_sq_metric.clear() + self.metric_dict.clear() + self.param_metric.clear() class CommunicationContext: @@ -88,10 +144,10 @@ class CommunicationContext: @staticmethod def _agg(data): aggregated_data = {} - for op, tag2tensorlist in data.items(): - aggregated_data[op] = {} - for tag, tensorlist in tag2tensorlist.items(): - aggregated_data[op][tag] = op_aggregate(op, tensorlist) + for tag, op2tensorlist in data.items(): + aggregated_data[tag] = {} + for op, tensorlist in op2tensorlist.items(): + aggregated_data[tag][op] = op_aggregate(op, tensorlist) return aggregated_data def reset(self): @@ -101,410 +157,938 @@ class CommunicationContext: self.data = self._agg(self.data) +class GradContext: + def __init__(self) -> None: + self.pre = {} + self.post = {} + self.acc_metric = {} + self.acc = {} + self.actv = {} + + def reset(self): + self.pre.clear() + self.post.clear() + self.acc_metric.clear() + self.acc.clear() + self.actv.clear() + + class TrainerMon: tensor_metrics = TensorMetrics() - def __init__(self, config_file_path, params_have_main_grad=True, opt_ty=None) -> None: - """ - config_file_path: str, monitor config path - params_have_main_grad: bool, whether param has attribution main_grad - opt_ty: str, Megatron_Float16OptimizerWithFloat16Params or Megatron_DistributedOptimizer - """ + def __init__(self, config_file_path, process_group=None, params_have_main_grad=True) -> None: + # TYPE1: 只在这里初始化的变量, 不会随着训练中途config配置改变而重置 + self.config_file_path = config_file_path + self.process_group = get_process_group(process_group) + self.params_have_main_grad = params_have_main_grad + self.update_heatmap_visualizer = defaultdict(HeatmapVisualizer) + self.ratio_heatmap_visualizer = defaultdict(HeatmapVisualizer) + self.origin_step_func = None + self.origin_start_grad_sync = None + self.config_timestamp = 0 # 后面有校验时间戳, 首次监控无需为了更新config文件时间戳而去改, 可通过dynamic_on开关直接打开 + self.config = load_json(config_file_path) + validate_config(self.config) + + self.squash_name = self.config.get('squash_name', True) # 不允许修改防止前后名字对不上 + local_tz = pytz.timezone("Asia/Shanghai") # 根据需要调整为目标时区 + cur_time = datetime.now(local_tz).strftime('%b%d_%H-%M-%S') + self.unique_id = str(uuid.uuid4())[:8] + self.output_base_dir = get_output_base_dir() + time_tags = self.config.get("append_output", []) + if dist.is_initialized(): + self.rank = dist.get_rank() + if time_tags: + output_append_dirs = get_target_output_dir(self.output_base_dir, time_tags[0], time_tags[1]) + if str(self.rank) in output_append_dirs: + self.tensorboard_dir = output_append_dirs[str(self.rank)] + logger.info(f"append rank({self.rank}) result to {self.tensorboard_dir}") + else: + self.tensorboard_dir = os.path.join(self.output_base_dir, + f"{cur_time}-rank{self.rank}-{self.unique_id}") + self.pp_stage = dist.get_group_rank(self.process_group, self.rank) + self.group_mates = dist.get_process_group_ranks(self.process_group) + else: + self.rank = 0 + self.tensorboard_dir = os.path.join(self.output_base_dir, f"{cur_time}-rank{self.rank}-{self.unique_id}") + self.pp_stage = 0 + self.group_mates = [0] + + # TYPE2: 只会在set_monitor()主调中赋值的变量 + self.model = None + self.vpp = False + self.dp_group = None + self.tp_group = None + self.enable_megatron = False + self.micro_batch_number = 1 + self.optimizer_class = None + self.optimizer_mon = None + + # TYPE3: 会随着训练中途config配置更新或监控状态改变而重置的变量 self.module_fwd_hook_context_by_module = defaultdict(ModuleHookContext) self.module_bwd_hook_context_by_module = defaultdict(ModuleHookContext) self.optimizer_context = defaultdict(OptimizerContext) self.cc_context = defaultdict(CommunicationContext) - self.params_have_main_grad = params_have_main_grad - with FileOpen(config_file_path, 'r') as f: - self.config = json.load(f) - validate_config(self.config) + self.grad_context = GradContext() + self.handles = defaultdict(list) + self.param2name = defaultdict(str) + self.name2index = defaultdict() + self.name2indices = defaultdict() + self.name2param = {} + self.duplicate_param = {} + self.name2tag = {} + self.param_name_call_id = {} + self.call_id = 0 + self.module_struct = defaultdict(dict) + self.grad_accs = [] + self.weight_hooked = False + self.optimizer_hooked = False + self.param_registered = False + self.struct_printed = False + + # 动静态区分 + self.dynamic_enable = os.getenv("DYNAMIC_MONITOR", 'False').lower() == 'true' + if self.dynamic_enable: + logger.warning(f"DYNAMIC_MONITOR is set, " + f"please make sure you have 'dynamic_on' and 'collect_times' in {self.config_file_path}") + self.monitoring = False + else: + self.set_config() + # 静态且collect_times>0时在第0步self.monitoring就可以True, 动态默认在下一步开启 + if self.collect_times > 0: + self.monitoring = True + + def __del__(self): + if hasattr(self, "summary_writer"): + self.summary_writer.close() + + @property + def ops(self): + return self._ops + + @ops.setter + def ops(self, value): + self._ops = validate_ops(value) + + @staticmethod + def has_register_backward_hook(module_name, module): + if hasattr(module, '_backward_hooks') and \ + len(module._backward_hooks) > 0 and \ + module._is_full_backward_hook is False: + logger.warning( + f"The {module_name} has registered deprecated register_backward_hook," + f"which may cause abnormal data dump. The backward input/output for this module will be skipped." + ) + return True + return False + + @staticmethod + def generate_cc_metrics(cc_name, cc_tensor): + metrics = defaultdict(dict) + rank = dist.get_rank() if dist.is_initialized() else None + for op, tag2tensor in cc_tensor.data.items(): + for tag, tensor in tag2tensor.items(): + key = get_summary_writer_tag_name(cc_name, tag, rank) + metrics[op].update({key: tensor}) + cc_tensor.reset() + return metrics + + def set_config(self): + logger.info(f"current config: {self.config}") + self.start_step = self.config.get("start_step", 0) + self.collect_times = self.config.get("collect_times", 100000000) # 默认大值, 目的是一直采集 + self.step_interval = self.config.get("step_interval", 1) + self.has_collect_times = 0 # 重设采集计数器 + self.print_struct = self.config.get("print_struct", False) self.module_rank_list = self.config.get("module_ranks", []) + self.format = self.config.get('format', MonitorConst.CSV) self.eps = self.config.get('eps', 1e-8) self.ops = self.config.get('ops', []) + self.ndigits = self.config.get('ndigits', 6) + self.all_xy = self.config.get('all_xy', False) self.xy_distribution = self.config.get('xy_distribution', False) - if not self.xy_distribution: - print_rank_0("> module input/output input_grad/output_grad is not monitored. ") - - # backward hook cause megatron-lm pipeline parallel schedule assert exception. - # TBD: backward hook cause output tensor is view of some base tensor. root cause invesigation pending. self.forward_only = self.config.get('forward_only', False) - if self.forward_only: - print_rank_0("> only module forward is monitored. ") - + self.backward_only = self.config.get('backward_only', False) self.ur_distribution = self.config.get('ur_distribution', False) - if not self.ur_distribution: - print_rank_0("> update vector and ratio vector of adam is not monitored. ") self.mv_distribution = self.config.get("mv_distribution", False) - if not self.mv_distribution: - print_rank_0("> momentum and variance of adam is not monitored. ") self.wg_distribution = self.config.get("wg_distribution", False) - if not self.wg_distribution: - print_rank_0("> weight grad of specified module is not monitored. ") + self.param_distribution = self.config.get("param_distribution", False) self.mg_direction = self.config.get('mg_direction', False) - if not self.mg_direction: - print_rank_0('> grad and momentum direction will not be compared.') self.cc_distribution = self.config.get("cc_distribution", {}) + if not self.cc_distribution.get('enable', False): - print_rank_0("> cc operator is not monitored.") self.cc_log_only = False else: self.cc_codeline = self.cc_distribution.get('cc_codeline', []) self.cc_log_only = self.cc_distribution.get('cc_log_only', False) self.cc_logged_stack = defaultdict(set) self.cc_pre_hook = self.cc_distribution.get('cc_pre_hook', False) - api_register.initialize_hook(*create_hooks(context=self.cc_context, monitor=self)) + self.handles['cc'] = api_register.initialize_hook(*create_hooks(context=self.cc_context, monitor=self)) api_register.redirect_api() + self.common_info() + + # 初始化AnomalyData工厂 alert_setting = self.config.get('alert', {"rules": []}) self.alert_rules = AnomalyScanner.load_rules(alert_setting["rules"]) + self.anomaly_data_factory = None + if alert_setting.get('dump', False): + self.anomaly_data_factory = AnomalyDataFactory(self.rank, self.pp_stage, self.group_mates) - anomaly_inform = AnomalyInformFactory.create_informer( - **alert_setting["inform"]) if "inform" in alert_setting else None + # 初始化writer, 创建输出目录 + if self.format not in FORMAT_MAPPING: + logger.error(f"Unsupported format: {self.format}, use default format: {MonitorConst.CSV}") + self.format = MonitorConst.CSV - self.optimizer_hooked = False - cur_time = datetime.now(beijing_tz).strftime('%b%d_%H-%M-%S') - unique_id = str(uuid.uuid4())[:8] - if dist.is_initialized(): - cur_path = os.path.join(output_base_dir, f"{cur_time}-rank{dist.get_rank()}-{unique_id}") - if (dist.get_rank() in self.module_rank_list) or len(self.module_rank_list) == 0: - check_path_length(cur_path) - check_path_pattern_valid(cur_path) - self.summary_writer = SummaryWriterWithAD( - cur_path, self.alert_rules, unique_id, anomaly_inform) - else: - cur_path = os.path.join(output_base_dir, f"{cur_time}-{unique_id}") - check_path_length(cur_path) - check_path_pattern_valid(cur_path) - self.summary_writer = SummaryWriterWithAD(cur_path, self.alert_rules, unique_id, anomaly_inform) + if self.ur_distribution and self.format != 'tensorboard': + logger.error("can only set ur_distribution when format is 'tensorboard', cancel ur_distribution") + self.ur_distribution = False - full_path = os.path.realpath(cur_path) - change_mode(full_path, FileCheckConst.DATA_DIR_AUTHORITY) + writer = FORMAT_MAPPING[self.format] + self.step_count_per_record = self.config.get('step_count_per_record', 1) - # A HeatmapVisualizer instance is associated with an image - self.update_heatmap_visualizer = defaultdict(HeatmapVisualizer) - self.ratio_heatmap_visualizer = defaultdict(HeatmapVisualizer) - self.micro_batch_number = 0 + if (self.rank in self.module_rank_list) or len(self.module_rank_list) == 0: + self.summary_writer = writer( + WriterInput( + self.tensorboard_dir, + self.alert_rules, + self.unique_id, + self.anomaly_data_factory, + self.ndigits, + self.step_count_per_record + ) + ) + # 初始化anomaly detected文件目录 + if self.anomaly_data_factory: + self.anomaly_data_writer = AnomalyDataWriter(os.path.join(self.output_base_dir, "anomaly_detected"), + self.rank) + self.anomaly_data_writer.init_detected_json() - self.param_name_list = [] - self.param2name = defaultdict(str) + def common_info(self): + if not self.xy_distribution: + logger.info_on_rank_0("> module input/output input_grad/output_grad is not monitored. ") + if self.forward_only: + logger.info_on_rank_0("> only module forward is monitored. ") + if not self.ur_distribution: + logger.info_on_rank_0("> update vector and ratio vector of adam is not monitored. ") + if not self.mv_distribution: + logger.info_on_rank_0("> momentum and variance of adam is not monitored. ") + if not self.wg_distribution: + logger.info_on_rank_0("> weight grad of specified module is not monitored. ") + if not self.mg_direction: + logger.info_on_rank_0('> grad and momentum direction will not be compared.') + if not self.cc_distribution.get('enable', False): + logger.info_on_rank_0("> cc operator is not monitored.") - self.mix_precision_optimizer_mon = OptimizerMonFactory.create_optimizer_mon(opt_ty) - if opt_ty is None: - if self.ur_distribution: - raise Exception("ur_distribution cannot be enabled with unknown optimizer.") - if self.mv_distribution: - raise Exception("mv_distribution cannot be enabled with unknown optimizer.") - self.print_struct = self.config.get("print_struct", False) - self.struct_printed = False - self.module_struct = defaultdict(dict) + def hook_modules(self): + if self.module_rank_list and (self.rank not in self.module_rank_list): + return + + targets = self.config['targets'] + module_in_all_stage = [key for key in targets.keys() if MonitorConst.NAME_SEP not in key] + for key in module_in_all_stage: + struct = targets.pop(key) + targets.update({f'{vpp_stage}{MonitorConst.NAME_SEP}{key}': struct for vpp_stage in range(len(self.model))}) + + hooked_count = 0 + for vpp_stage, model_chunk in enumerate(self.model): + vpp_stage = f'{vpp_stage}{MonitorConst.NAME_SEP}' + targets = [x for x, _ in model_chunk.named_modules()] if self.print_struct else self.config[ + 'targets'].keys() + hooked_count += self._hook_module(targets, model_chunk, vpp_stage) + + logger.info_on_rank_0(f"> {hooked_count} modules are monitored.") + + def clone_if_tensor(args): + if isinstance(args, tuple): + return tuple([clone_if_tensor(arg) for arg in args]) + elif isinstance(args, torch.Tensor): + return args.clone() + else: + return args + + @torch.no_grad + def wrap_hook_setup(setup): + def wrapped_setup(*args, **kwargs): + args = setup(*args, **kwargs) + args = clone_if_tensor(args) + return args + + return wrapped_setup + + BackwardHook.setup_input_hook = wrap_hook_setup(BackwardHook.setup_input_hook) + BackwardHook.setup_output_hook = wrap_hook_setup(BackwardHook.setup_output_hook) return - def __del__(self): - if hasattr(self, "summary_writer"): - self.summary_writer.close() + def set_monitor( + self, + model, + optimizer, + grad_acc_steps=1, + tp_group=None, + dp_group=None, + start_iteration=0 + ): + """External interface""" + global start_step + start_step = start_iteration + logger.info(f'grad acc steps {grad_acc_steps}') + self.micro_batch_number = grad_acc_steps + self.dp_group = dp_group + self.tp_group = tp_group + self.optimizer_mon, self.optimizer_class = OptimizerMonFactory.create_optimizer_mon(optimizer) + self.hook_step_final(optimizer) + if not isinstance(model, list): + model = [model] + self.model = model + if len(model) > 1: + self.vpp = True + self._smallest_rank_print('vpp enabled') + if not self.dynamic_enable: + self.register_hooks(optimizer) - @staticmethod - def set_wrapped_optimizer(_wrapped_optimizer): - MixPrecsionOptimizerMon.set_wrapped_optimizer(_wrapped_optimizer) + def register_hooks(self, optimizer): + self._register_param_name() + self.hook_optimizer(optimizer) + self._patch_grad_sync() + self.hook_modules() + self.monitoring = True - @staticmethod - def adhoc_check(target_tensor: torch.tensor, module_name: str, tensor_name: str, rank_list, ops_list): + def adhoc_check(self, target_tensor: torch.tensor, module_name: str, tensor_name: str, rank_list, ops_list): rank = None if dist.is_initialized(): rank = dist.get_rank() if (rank not in rank_list) and len(rank_list) != 0: return - TrainerMon.tensor_metrics.stat_insert(target_tensor, ops_list, module_name, tensor_name, rank) - - @staticmethod - def build_tbtag_tensor_map(module_name, tag, tensor): - metrics = {} - rank = dist.get_rank() if dist.is_initialized() else None - key = get_summary_writer_tag_name(module_name, tag, rank) - if tensor is not None: - metrics[key] = tensor - return metrics + self.tensor_metrics.stat_insert(target_tensor, ops_list, module_name, tensor_name, rank) - @staticmethod - def generate_cc_metrics(cc_name, cc_tensor): - metrics = defaultdict(dict) - rank = dist.get_rank() if dist.is_initialized() else None - for op, tag2tensor in cc_tensor.data.items(): - for tag, tensor in tag2tensor.items(): - key = get_summary_writer_tag_name(cc_name, tag, rank) - metrics[op].update({key: tensor}) - cc_tensor.reset() - return metrics + def build_tbtag_tensor_map(self, module_name, tag, tensor): + key = get_summary_writer_tag_name(module_name, tag, self.rank) + self._register_param_call_id("_hook_module", key) + return {key: tensor} - def generate_param_metrics(self, tag, param_tensor): + def generate_param_map(self, tag, param_tensor): metrics = {} - rank = dist.get_rank() if dist.is_initialized() else None - for _, name in self.param2name.items(): - key = get_summary_writer_tag_name(name, tag, rank) + for name in self.param2name.values(): + key = get_summary_writer_tag_name(name, tag, self.rank) + self._register_param_call_id("optimizer_pre_step_hook", key) if name not in param_tensor or param_tensor[name] is None: continue metrics[key] = param_tensor[name] return metrics - def hook_modules(self, model: torch.nn.Module, grad_acc_steps): - # fwd=0, bkd=1 - # targets is module name list like ["xx.xxx1", "xxx.xxx2"] which can be obtained when first run. - if not isinstance(model, torch.nn.Module): - raise TypeError("model should be a nn.Module") - if not isinstance(grad_acc_steps, int) or isinstance(grad_acc_steps, bool): - raise TypeError("grad_acc_steps should be int") - print_rank_0("> module names:") - for name, _ in model.named_modules(): - print_rank_0(f"\t{name}") - - self.micro_batch_number = grad_acc_steps + def generate_param_metrics(self, opt_context): + if not self.param_distribution: + return + get_metrics(self.ops, self.name2param, self.eps, opt_context.param_metric) - if not self.module_rank_list or (dist.is_initialized() and dist.get_rank() in self.module_rank_list): - targets = [x for x, _ in model.named_modules()] if self.print_struct else self.config['targets'].keys() - hooked_count = self._hook_module(targets, model, fwd_or_bkd=0) - print_rank_0(f"> {hooked_count} out of {len(self.config['targets'])} are monitored.") - else: + def generate_mv_metrics(self, opt_context): + if not self.mv_distribution: return + opt_context.exp_avg_metric = {} + opt_context.exp_avg_sq_metric = {} + m_tag_tensor_map = self.generate_param_map(MonitorConst.EXP_AVG, opt_context.param_exp_avg) + v_tag_tensor_map = self.generate_param_map(MonitorConst.EXP_AVG_SQ, opt_context.param_exp_avg_sq) + get_metrics(self.ops, m_tag_tensor_map, self.eps, opt_context.exp_avg_metric) + get_metrics(self.ops, v_tag_tensor_map, self.eps, opt_context.exp_avg_sq_metric) - if not self.optimizer_hooked: - self.optimizer_hooked = True - print_rank_0("> parameter names:") - for name, param in model.named_parameters(): - print_rank_0(f"\t{name}") - for target_module, _ in self.config['targets'].items(): - if name.startswith(target_module): - # name : language_model.encoder.layers.0.mlp.weight - # target_module:language_model.encoder.layers.0 - self.param_name_list.append(name) - self.param2name[param] = name - self.hook_optimizer() - return + def generate_wgrad_metrics(self): + if not self.wg_distribution: + return {}, {} + + if self.weight_hooked: + get_metrics(self.ops, self.grad_context.acc, self.eps, self.grad_context.acc_metric) + + grad_dict = {} + for param, name in self.param2name.items(): + if self.duplicate_param.get(name, False): + continue + grad = param.main_grad if self.params_have_main_grad else param.grad + if grad is None: + logger.warning(f"grad is None: {name}, maybe something wrong happened.") + continue + tag = self.name2tag.get(name, {}).get(MonitorConst.POST_GRAD) + self._register_param_call_id("hook_optimizer", tag) + grad_dict[tag] = grad + + get_metrics(self.ops, grad_dict, self.eps, self.grad_context.post) + unreduced_grad = self.grad_context.acc_metric if self.weight_hooked else self.grad_context.pre + return self.grad_context.post, unreduced_grad + + def generate_xy_metrics(self): + actv = {} + for fwd_context in self.module_fwd_hook_context_by_module.values(): + actv.update(fwd_context.actv) + + actv_grad = self.grad_context.actv + + return actv, actv_grad + + def reload_xy(self, xy_distribution=False): + logger.warning("reload_xy() is deprecated and will be removed in a future version. " + "Use DYNAMIC_MONITOR instead.") + self.xy_distribution = xy_distribution + + for handle in self.handles['xy']: + handle.remove() + self.handles['xy'].clear() + self.hook_modules() + for _, fwd_context in self.module_fwd_hook_context_by_module.items(): + fwd_context.actv.clear() def write_adhoc_check(self, step): - TrainerMon.tensor_metrics.flush(self.summary_writer) + self.tensor_metrics.flush(self.summary_writer) def write_xy_tb(self, step): if not self.xy_distribution: return for _, fwd_context in self.module_fwd_hook_context_by_module.items(): - if not len(fwd_context.actv) == self.micro_batch_number: - print_warn_log( - f"fwd_context.actv not equal to micro_batch_number: {len(fwd_context.actv)}, " - f"{self.micro_batch_number}") - for metric_name in self.ops: - write_metrics_tensorboard(metric_name, self.summary_writer, fwd_context.actv, step) + if len(fwd_context.actv) == 0: + continue + self.summary_writer.write_metrics(self.ops, fwd_context.actv, step, MonitorConst.ACTV) fwd_context.actv.clear() + if self.grad_context.actv: + self.summary_writer.write_metrics(self.ops, self.grad_context.actv, step, MonitorConst.ACTVGRAD) - for _, bwd_context in self.module_bwd_hook_context_by_module.items(): - if not len(bwd_context.actvgrad) == self.micro_batch_number: - print_warn_log( - f"bwd_context.actvgrad not equal to micro_batch_number: {len(bwd_context.actvgrad)}, " - f"{self.micro_batch_number}") - for metric_name in self.ops: - write_metrics_tensorboard(metric_name, self.summary_writer, bwd_context.actvgrad, step) - bwd_context.actvgrad.clear() - - def hook_optimizer(self): + def write_param_tb(self, opt_context): + if not self.param_distribution: + return + self.summary_writer.write_metrics(self.ops, opt_context.param_metric, opt_context.step, MonitorConst.PARAM) + + def write_mv_tb(self, opt_context): + if not self.mv_distribution: + return + self.summary_writer.write_metrics(self.ops, opt_context.exp_avg_metric, + opt_context.step, MonitorConst.EXP_AVG) + self.summary_writer.write_metrics(self.ops, opt_context.exp_avg_sq_metric, + opt_context.step, MonitorConst.EXP_AVG_SQ) + + def write_grad_tb(self, step): + if not self.wg_distribution: + return + + if self.enable_megatron: + self.summary_writer.write_metrics(self.ops, self.grad_context.pre, step, 'grad_unreduced') + else: + self.summary_writer.write_metrics(self.ops, self.grad_context.acc_metric, step, 'grad_unreduced') + self.summary_writer.write_metrics(self.ops, self.grad_context.post, step, 'grad_reduced') + + def hook_optimizer(self, optimizer): # in DDP by default use params_have_main_grad def optimizer_pre_step_hook(optimizer, args, kwargs): context = self.optimizer_context[optimizer] + if (self.print_struct and not all(value == {} for value in self.module_struct.values()) and not self.struct_printed): - self._smallest_rank_print("> module struct:") - self._smallest_rank_print(json.dumps(self.module_struct)) - self.struct_printed = True + self._save_module_struct() if not self.cc_log_only: - raise Exception("exit after first step when print model struct") + raise Exception("exit after first monitor step when print model struct") if self.cc_log_only and context.step > 0: self._smallest_rank_print("> Used communication ops and corresponding stack") self._smallest_rank_print( - json.dumps({k: [i.split(';') for i in v] for k, v in self.cc_logged_stack.items()}, indent=4)) + json.dumps({k: [i.split(';') for i in v] for k, v in self.cc_logged_stack.items()})) raise Exception("exit after first step when print cc stack") - context.param_exp_avg, context.param_exp_avg_sq, context.param_adam_update, context.param_adam_ratio = \ - self.mix_precision_optimizer_mon.fetch_mv(self, optimizer, self.param2name) + # skip generate metrics + if context.step < self.start_step or (context.step - self.start_step) % self.step_interval != 0: + return + if MonitorConst.DEEPSPEED_ZERO_OPT_FILTER in self.optimizer_class: # use deepspeed with zero1/2/3 + if not self.name2indices: + self.name2indices = self.optimizer_mon.get_param_index(self.param2name, self.name2index, optimizer) + mv_result = self.optimizer_mon.fetch_mv(self, optimizer, self.param2name, self.name2indices) + self.param2name = mv_result.grad + else: + mv_result = self.optimizer_mon.fetch_mv(self, optimizer, self.param2name) + context.param_exp_avg = mv_result.exp_avg + context.param_exp_avg_sq = mv_result.exp_avg_sq + context.param_adam_update = mv_result.update + context.param_adam_ratio = mv_result.ratio - for param, name in self.param2name.items(): - if "params_effrank" in self.config and name in self.config["params_effrank"]: - context.param_effective_rank[name] = eff_rank(param.detach()) - grad = param.main_grad if self.params_have_main_grad else param.grad - if grad is None: - print_warn_log(f"grad is None: {name}, maybe something wrong happened.") - continue - if self.wg_distribution: - context.param_weight_grad[name] = grad - if self.mg_direction: + self.generate_wgrad_metrics() + self.generate_mv_metrics(context) + self.generate_param_metrics(context) + + tbtag_tensor_map = {} + if self.mg_direction: + for param, name in self.param2name.items(): + grad = param.main_grad if self.params_have_main_grad else param.grad + if grad is None: + logger.warning(f"grad is None: {name}, maybe something wrong happened.") + continue if context.step == 0: same_direction_ratio = torch.tensor(1.) else: same_direction_ratio = get_sign_matches(grad, context.param_exp_avg[name]) context.param_mg_direction[name] = same_direction_ratio + tbtag_tensor_map.update(self.generate_param_map('mg_direction', context.param_mg_direction)) - tbtag_tensor_map = {} - if self.wg_distribution: - tbtag_tensor_map.update(self.generate_param_metrics('weight_grad', context.param_weight_grad)) - if self.mv_distribution: - tbtag_tensor_map.update(self.generate_param_metrics('exp_avg', context.param_exp_avg)) - tbtag_tensor_map.update(self.generate_param_metrics('exp_avg_sq', context.param_exp_avg_sq)) - if self.mg_direction: - tbtag_tensor_map.update(self.generate_param_metrics('mg_direction', context.param_mg_direction)) metric_dict = {} - for metric_name in self.ops: - metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) - for k, c in self.cc_context.items(): - c.aggregate() - cc_metrics = self.generate_cc_metrics(k, c) - for op, m in cc_metrics.items(): - metric_dict[op].update(m) + get_metrics(self.ops, tbtag_tensor_map, self.eps, metric_dict) + for cc in self.cc_context.values(): + cc.aggregate() + metric_dict.update(cc.data) + cc.reset() + if not metric_dict: return - context.metric_list.append(metric_dict) + context.metric_dict = metric_dict + return + + def patch_step(func, optimizer): + def wrapper(*args, **kwargs): + optimizer_pre_step_hook(optimizer, args, kwargs) + out = func(*args, **kwargs) + return out + return wrapper + + if self.optimizer_hooked: + return + + optimizer.__class__.step = patch_step(optimizer.__class__.step, optimizer) + + self.optimizer_hooked = True + return + + def dynamic_monitor(self, optimizer): + """ + If dynamic monitor enabled and config.json updated, + remove hooks and register new hooks according to new configuration. + """ + context = self.optimizer_context[optimizer] + if not self.dynamic_enable: + return + try: + # 如果文件时间戳没变, 可以不读取节省时间 + config_timestamp = os.path.getmtime(self.config_file_path) + if config_timestamp == self.config_timestamp: + return + # 更新config文件最新修改时间戳 + self.config_timestamp = config_timestamp + config = load_json(self.config_file_path) + except Exception as e: + logger.error(f"get config.json wrong because {e}, not updated, please check!!!") return - def optimizer_post_step_hook(optimizer, args, kwargs): + if config.get("dynamic_on", False): + try: + validate_config(config) + self.config = config + self.set_config() + logger.warning(f"config is updated at step{context.step - 1}, " + f"will start new hook at step{context.step}.") + except Exception as e: + logger.error(f"set config wrong because {e}, not updated, please check!!!") + return + + self._remove_all_hooks(optimizer) + self.register_hooks(optimizer) + + def hook_step_final(self, optimizer): + def step_final_hook(optimizer, args, kwargs): context = self.optimizer_context[optimizer] rank = dist.get_rank() if dist.is_initialized() else None + # 静态在第0步就可以保存, 动态在第0步不可以, 因为动态设计的就是重置后下一步开启, 第0步的self.monitoring还是False + if self.monitoring: + module_rank_valid = not self.module_rank_list or ( + dist.is_initialized() and dist.get_rank() in self.module_rank_list) + step_condition = (context.step >= self.start_step and ( + context.step - self.start_step) % self.step_interval == 0) + if module_rank_valid and step_condition: + self.has_collect_times += 1 + + if self.anomaly_data_factory: + self.anomaly_data_factory.set_call_id(self.param_name_call_id) + self.write_xy_tb(context.step) + self.write_grad_tb(context.step) + self.write_mv_tb(context) + self.write_param_tb(context) + self.write_adhoc_check(context.step) + + if self.ur_distribution: + for param_name, _ in context.param_adam_update.items(): + self.update_heatmap_visualizer[param_name].visualize( + get_summary_writer_tag_name(param_name, 'adam_update', rank), context.step, + self.summary_writer) + for param_name, _ in context.param_adam_ratio.items(): + self.ratio_heatmap_visualizer[param_name].visualize( + get_summary_writer_tag_name(param_name, 'adam_ratio', rank), context.step, + self.summary_writer) + + if context.metric_dict: + self.summary_writer.write_metrics(self.ops, context.metric_dict, context.step, 'other') + context.metric_dict.clear() + + if self.anomaly_data_factory: + self.anomaly_data_writer.write_detected_json(self.summary_writer.get_anomalies()) + self.summary_writer.clear_anomalies() + self.call_id = 0 + self.param_name_call_id.clear() + + if self.has_collect_times >= self.collect_times: + self._remove_all_hooks_final(optimizer) - self.write_xy_tb(context.step) - self.write_adhoc_check(context.step) - - if self.ur_distribution: - for param_name, _ in context.param_adam_update.items(): - self.update_heatmap_visualizer[param_name].visualize( - get_summary_writer_tag_name(param_name, 'adam_update', rank), context.step, - self.summary_writer) - for param_name, _ in context.param_adam_ratio.items(): - self.ratio_heatmap_visualizer[param_name].visualize( - get_summary_writer_tag_name(param_name, 'adam_ratio', rank), context.step, - self.summary_writer) - - for metric_name in self.ops: - if not context.metric_list: - break - write_metrics_tensorboard(metric_name, self.summary_writer, context.metric_list, context.step) - context.metric_list.clear() context.step += 1 + self.dynamic_monitor(optimizer) - return + def patch_step(func, optimizer): + def wrapper(*args, **kwargs): + out = func(*args, **kwargs) + step_final_hook(optimizer, args, kwargs) + return out + return wrapper + + optimizer.__class__.step = patch_step(optimizer.__class__.step, optimizer) + self.origin_step_func = optimizer.__class__.step - if not self.module_rank_list or (dist.is_initialized() and dist.get_rank() in self.module_rank_list): - register_optimizer_step_pre_hook(optimizer_pre_step_hook) - register_optimizer_step_post_hook(optimizer_post_step_hook) return + def _remove_all_hooks(self, optimizer): + # 清空hook handle + for handle in self.handles['xy']: + handle.remove() + self.handles['xy'].clear() + # 清空对应context缓存 + for _, fwd_context in self.module_fwd_hook_context_by_module.items(): + fwd_context.reset() + for _, bwd_context in self.module_bwd_hook_context_by_module.items(): + bwd_context.reset() + self.grad_context.reset() # 权重梯度和激活值梯度都在这 + + if self.origin_start_grad_sync: # megatron + try: + from megatron.core.distributed.param_and_grad_buffer import Bucket + Bucket.start_grad_sync = self.origin_start_grad_sync + logger.info("remove Bucket start_grad_sync") + except ImportError: + pass + try: + from megatron.core.distributed.param_and_grad_buffer import _ParamAndGradBucketGroup + _ParamAndGradBucketGroup.start_grad_sync = self.origin_start_grad_sync + logger.info("remove _ParamAndGradBucketGroup start_grad_sync") + except ImportError: + pass + else: # not megatron + for handle in self.handles['wgrads']: + handle.remove() + self.handles['wgrads'].clear() + self.weight_hooked = False + + if self.optimizer_hooked: + optimizer.__class__.step = self.origin_step_func + + for _, context in self.optimizer_context.items(): + context.reset() + self.optimizer_hooked = False + + for handle in self.handles['cc']: + handle.remove() + self.handles['cc'].clear() + for _, context in self.cc_context.items(): + context.reset() + + # 清空节点缓存 + self.param2name.clear() + self.name2index.clear() + self.name2indices.clear() + self.name2param.clear() + self.duplicate_param.clear() + self.name2tag.clear() + self.module_struct.clear() + self.grad_accs.clear() + + # 关闭采集状态 + self.monitoring = False + + def _remove_all_hooks_final(self, optimizer): + if self.dynamic_enable: + # 结束后自动重置dynamic_on为False等待用户手动开启 + try: + config = load_json(self.config_file_path) + config['dynamic_on'] = False + save_json(self.config_file_path, config, indent=2) + config_timestamp = os.path.getmtime(self.config_file_path) + self.config_timestamp = config_timestamp + logger.info( + "Finish monitor, set config'dynamic_on=False, will restart by set it to True and update config") + except Exception as e: + logger.warning(f"Finish monitor, set config'dynamic_on=False fail because {e}, please check!!!") + logger.info("Finish monitor") + self._remove_all_hooks(optimizer) + def _smallest_rank_print(self, msg): if dist.is_initialized(): if self.module_rank_list: if dist.get_rank() == min(self.module_rank_list): - print_info_log(msg) + logger.info(msg) else: if dist.get_rank() == 0: - print_info_log(msg) + logger.info(msg) else: - print_info_log(msg) + logger.info(msg) + + def _save_module_struct(self): + save_module_struct = (not dist.is_initialized() + or (self.module_rank_list and dist.get_rank() == min(self.module_rank_list)) + or (not self.module_rank_list and dist.get_rank() == 0)) + + if save_module_struct: + module_struct_file = os.path.realpath(os.path.join(get_output_base_dir(), 'module_struct.json')) + save_json(module_struct_file, self.module_struct, indent=2) + logger.info(f"> save module struct to {module_struct_file}") + self.struct_printed = True + + def _is_target_param(self, param_name, param, prefix): + name = prefix + param_name + squash_name = prefix + squash_param_name(param_name, self.squash_name) + for target in self.config['targets'].keys(): + if param_name.startswith(target) or squash_name.startswith(target) or name.startswith(target): + setattr(param, "zero_out_wgrad", True) + return True + + return False + + def _register_chunk(self, model_chunk, prefix): + index = 0 + for (param_name, param) in model_chunk.named_parameters(): + if not param.requires_grad: + continue + if self._is_target_param(param_name, param, prefix): + name = prefix + squash_param_name(param_name, self.squash_name) + if name in self.param2name.values(): + name = prefix + param_name + self.param2name[param] = name + self.name2param[name] = param + self.name2index[name] = index + + if self.tp_group and not param_is_not_tensor_parallel_duplicate(param, self.tp_group): + self.duplicate_param[name] = True + if self.dp_group and param_is_data_parallel_duplicate(self.dp_group): + self.duplicate_param[name] = True + self.name2tag[name] = { + MonitorConst.PRE_GRAD: get_summary_writer_tag_name(name, MonitorConst.PRE_GRAD, self.rank), + MonitorConst.POST_GRAD: get_summary_writer_tag_name(name, MonitorConst.POST_GRAD, self.rank) + } + index += 1 + + def _register_param_name(self): + for vpp_stage, model_chunk in enumerate(self.model): + prefix = f'{vpp_stage}{MonitorConst.NAME_SEP}' + self._register_chunk(model_chunk, prefix) + + def _is_target_module(self, module_name, targets, vpp_stage): + if self.all_xy or self.print_struct: + return vpp_stage + squash_param_name(module_name, self.squash_name) + for pattern in [ + vpp_stage + squash_param_name(module_name, self.squash_name), + vpp_stage + module_name, + ]: + if pattern in targets: + return pattern + return "" - def _hook_module(self, target_names, module: torch.nn.Module, fwd_or_bkd): + def _hook_module(self, target_names, module: torch.nn.Module, vpp_stage=''): if '_modules' not in module.__dict__: # nothing to hook return 0 - def fwd_hook_fun(module, module_input, module_output): + def fwd_hook_fun(module, module_input, module_output, name): + if not module.training or is_recomputation(): + # 1 only monitor training stage. + # 2 when open recompute, skip recomputed forward stage. + return + if module not in self.module_fwd_hook_context_by_module: + self.module_fwd_hook_context_by_module[module] = ModuleHookContext(name) context: ModuleHookContext = self.module_fwd_hook_context_by_module[module] + if not context.struct: + context.struct = { + Const.INPUT: get_param_struct(module_input), + Const.OUTPUT: get_param_struct(module_output) + } if self.print_struct: - if context.module_name not in self.module_struct: - self.module_struct[context.module_name] = {} - self.module_struct[context.module_name].update({ - "input": f"{get_param_struct(module_input)}", - "output": f"{get_param_struct(module_output)}" - }) - return - if not self.xy_distribution: + self.module_struct[context.module_name].update(context.struct) return if not context.format_by_arg: - context.set_format_by_arg('input', self.config['targets']) - context.set_format_by_arg('output', self.config['targets']) + context.set_format_by_arg(Const.INPUT, self.config['targets']) + context.set_format_by_arg(Const.OUTPUT, self.config['targets']) + if not context.format_by_arg: + return if not context.verified: - if not context.ignore_in: - context.focused_in_col = validate_config_spec(context.format_by_arg['input'], module_input, - context.module_name, 'input') - context.focused_out_col = validate_config_spec(context.format_by_arg['output'], module_output, - context.module_name, 'output') + context.focused_in_col = validate_config_spec(context.format_by_arg[Const.INPUT], + module_input, context.module_name, + Const.INPUT) + context.focused_out_col = validate_config_spec(context.format_by_arg[Const.OUTPUT], + module_output, context.module_name, + Const.OUTPUT) context.verified = True # expect output be tensor type tbtag_tensor_map = {} - if not context.ignore_in: - cared_input = module_input if context.focused_in_col is None else module_input[context.focused_in_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input', cared_input)) + cared_input = module_input if context.focused_in_col is None else module_input[context.focused_in_col] + tbtag_tensor_map.update( + self.build_tbtag_tensor_map( + f'{context.module_name}.{Const.INPUT}{MonitorConst.NAME_SEP}{context.micro_step}', + MonitorConst.ACTV, cared_input)) cared_output = module_output if context.focused_out_col is None else module_output[context.focused_out_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output', cared_output)) - metric_dict = {} - for metric_name in self.ops: - metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) - if context.micro_step == 0 and context.actv: - print_warn_log(f"actv context of {context.module_name} is not empty when first micro_step, " - f"maybe something wrong happened. Now clear it.") - context.actv.clear() - context.actv.append(metric_dict) + tbtag_tensor_map.update( + self.build_tbtag_tensor_map( + f'{context.module_name}.{Const.OUTPUT}{MonitorConst.NAME_SEP}{context.micro_step}', + MonitorConst.ACTV, cared_output)) + get_metrics(self.ops, tbtag_tensor_map, self.eps, context.actv) context.micro_step += 1 if context.micro_step == self.micro_batch_number: context.micro_step = 0 - context.step += 1 return def bwd_hook_fun(module, input_grad, output_grad): context: ModuleHookContext = self.module_bwd_hook_context_by_module[module] + if not context.struct: + context.struct = { + MonitorConst.INPUT_GRAD: get_param_struct(input_grad), + MonitorConst.OUTPUT_GRAD: get_param_struct(output_grad) + } if self.print_struct: - self.module_struct[context.module_name].update({ - "input_grad": f"{get_param_struct(input_grad)}", - "output_grad": f"{get_param_struct(output_grad)}" - }) - return - if not self.xy_distribution: + self.module_struct[context.module_name].update(context.struct) return if not context.format_by_arg: - context.set_format_by_arg('input_grad', self.config['targets']) - context.set_format_by_arg('output_grad', self.config['targets']) + context.set_format_by_arg(MonitorConst.INPUT_GRAD, self.config['targets']) + context.set_format_by_arg(MonitorConst.OUTPUT_GRAD, self.config['targets']) + if not context.format_by_arg: + return if not context.verified: - if not context.ignore_in: - context.focused_in_col = validate_config_spec( - context.format_by_arg['input_grad'], input_grad, context.module_name, 'input_grad') + context.focused_in_col = validate_config_spec( + context.format_by_arg[MonitorConst.INPUT_GRAD], + input_grad, context.module_name, MonitorConst.INPUT_GRAD) context.focused_out_col = validate_config_spec( - context.format_by_arg['output_grad'], output_grad, context.module_name, 'output_grad') + context.format_by_arg[MonitorConst.OUTPUT_GRAD], + output_grad, context.module_name, MonitorConst.OUTPUT_GRAD) context.verified = True tbtag_tensor_map = {} - if not context.ignore_in: - cared_input_grad = input_grad if context.focused_in_col is None else input_grad[context.focused_in_col] - tbtag_tensor_map.update( - self.build_tbtag_tensor_map(context.module_name, 'input_grad', cared_input_grad)) + cared_input_grad = input_grad if context.focused_in_col is None else input_grad[context.focused_in_col] + tbtag_tensor_map.update( + self.build_tbtag_tensor_map( + f'{context.module_name}.{Const.INPUT}{MonitorConst.NAME_SEP}{context.micro_step}', + MonitorConst.ACTV, cared_input_grad)) cared_output_grad = output_grad if context.focused_out_col is None else output_grad[context.focused_out_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output_grad', - cared_output_grad)) - metric_dict = {} - for metric_name in self.ops: - metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) + tbtag_tensor_map.update( + self.build_tbtag_tensor_map( + f'{context.module_name}.{Const.OUTPUT}{MonitorConst.NAME_SEP}{context.micro_step}', + MonitorConst.ACTV, cared_output_grad)) + if context.micro_step == 0 and context.actvgrad: - print_warn_log(f"actvgrad context of {context.module_name} is not empty when first micro_step, " + logger.warning(f"actvgrad context of {context.module_name} is not empty when first micro_step, " f"maybe something wrong happened. Now clear it.") context.actvgrad.clear() - context.actvgrad.append(metric_dict) + + get_metrics(self.ops, tbtag_tensor_map, self.eps, self.grad_context.actv) context.micro_step += 1 if context.micro_step == self.micro_batch_number: context.micro_step = 0 - context.step += 1 return + if self.backward_only and self.forward_only: + logger.warning('not enable backward_only and forward_only simultaneously') + hooked_count = 0 - for name, submodule in module.named_modules(): - self.module_struct[name] = {} - if name in target_names: - submodule.register_forward_hook(fwd_hook_fun) - self.module_fwd_hook_context_by_module[submodule] = ModuleHookContext(name) - if not self.forward_only: - submodule.register_full_backward_hook(bwd_hook_fun) + if self.xy_distribution or self.print_struct: + for module_name, submodule in module.named_modules(): + name = self._is_target_module(module_name, target_names, vpp_stage) + if not name: + continue + if not self.backward_only: + handle = submodule.register_forward_hook(partial(fwd_hook_fun, name=name)) + self.handles['xy'].append(handle) + if not self.forward_only and not self.has_register_backward_hook(name, submodule): + handle = submodule.register_full_backward_hook(bwd_hook_fun) + self.handles['xy'].append(handle) self.module_bwd_hook_context_by_module[submodule] = ModuleHookContext(name) - print_rank_0(f"> {name} is monitored successfully") + logger.info_on_rank_0(f"> {name} is monitored successfully") hooked_count += 1 return hooked_count + + def _patch_grad_sync(self): + def patch_sync(sync_grad_func): + def wrapper(bucket): + grad_dict = {} + # Megatron between core_r0.6.0 and core_r0.8.0, this bucket is Bucket. + # When megatron is core_r0.9.0, this bucket is _ParamAndGradBucketGroup. + # In megatron version core_r0.9.0, func start_grad_sync from Bucket moved to _ParamAndGradBucketGroup. + bucket_params_id_list = [id(params) for params in bucket.params] + for param, name in self.param2name.items(): + if id(param) not in bucket_params_id_list: + continue + grad = param.main_grad if self.params_have_main_grad else param.grad + if grad is None: + logger.warning(f"grad is None: {name}, maybe something wrong happened.") + continue + tag = self.name2tag.get(name, {}).get(MonitorConst.PRE_GRAD) + if tag is None: + continue + grad_dict[tag] = grad + self._register_param_call_id("sync_grad_func", tag) + get_metrics(self.ops, grad_dict, self.eps, self.grad_context.pre) + out = sync_grad_func(bucket) + return out + + return wrapper + + if not self.wg_distribution: + return + + try: + from megatron.core.distributed.param_and_grad_buffer import Bucket + self.origin_start_grad_sync = Bucket.start_grad_sync + Bucket.start_grad_sync = patch_sync(Bucket.start_grad_sync) + self.enable_megatron = True + logger.info("megatron version is >= core_r0.6.0 <= core_r0.8.0") + except ImportError: + self.enable_megatron = False + + try: + from megatron.core.distributed.param_and_grad_buffer import _ParamAndGradBucketGroup + self.origin_start_grad_sync = _ParamAndGradBucketGroup.start_grad_sync + _ParamAndGradBucketGroup.start_grad_sync = patch_sync(_ParamAndGradBucketGroup.start_grad_sync) + self.enable_megatron = True + logger.info("megatron version is > core_r0.8.0 <= core_r0.9.0") + except ImportError: + self.enable_megatron = False | self.enable_megatron + + if not self.enable_megatron: + self._hook_weights() + + def _hook_weights(self): + context = self.grad_context + + @torch.no_grad + def param_hook(*args, context_dict, param, key, name): + param.micro_step += 1 + self._register_param_call_id("param_hook", key) + if param.micro_step == self.micro_batch_number: + param.micro_step = 0 + if self.params_have_main_grad: + context_dict[key] = param.main_grad.clone() + else: + context_dict[key] = param.grad.clone() + + logger.info("hooking weights.") + for param, name in self.param2name.items(): + key = get_summary_writer_tag_name(name, 'acc_grad', self.rank) + setattr(param, 'micro_step', 0) + param_tmp = param.expand_as(param) + grad_acc = param_tmp.grad_fn.next_functions[0][0] + handle = grad_acc.register_hook( + partial(param_hook, context_dict=context.acc, param=param, key=key, name=name)) + self.grad_accs.append(grad_acc) + self.handles['wgrads'].append(handle) + + self.weight_hooked = True + + def _register_param_call_id(self, hook_name: str, key: str): + """ + :param hook_name: + :param key: str, '0:relu_0/output_grad' + :return: + """ + logger.debug(f"{hook_name} {key}: {self.call_id}") + self.param_name_call_id[key] = self.call_id + self.call_id += 1 diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/module_metric.py b/debug/accuracy_tools/msprobe/pytorch/monitor/module_metric.py index e840c306a697a2459e21679a2d880a149c5294fd..87963812006413a90fd33bc70d6172a7c73c3f10 100644 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/module_metric.py +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/module_metric.py @@ -1,6 +1,4 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,19 +12,33 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import re -import math -import statistics +import torch -from msprobe.pytorch.monitor.features import square_sum, get_max, get_min, get_zeros, get_nans, get_norm -from msprobe.pytorch.monitor.utils import print_error_log +from msprobe.pytorch.monitor.features import get_max, get_min, get_zeros, get_nans, get_norm, get_mean +from msprobe.pytorch.monitor.utils import get_nan_tensor def get_summary_writer_tag_name(module_or_param_name: str, tag: str, rank): if rank is None: return f"{module_or_param_name}/{tag}" else: - return f"{module_or_param_name}/{rank}/{tag}" + return f"{module_or_param_name}/rank{rank}/{tag}" + + +def squash_param_name(param_name, enable=True): + if not enable: + return param_name + name = '' + for pattern in ['layers?\.(.*)', 'embeddings?\.(.*)', 'final.*', 'output.*', 'norm.*']: + match = re.findall(pattern, param_name) + if match: + name += match[0] + break + if name == '': + name = param_name + return name # 用于存储所有metric实现类的注册表 @@ -37,21 +49,20 @@ def register_config_metric(key, cls=None): """装饰器 用于注册Metric的实现类""" if cls is None: # 无参数时,返回装饰器函数 - return lambda cls: register_config_metric(key, cls) - config_metric_registry[key] = cls + return lambda cls_: register_config_metric(key, cls_) + config_metric_registry[key] = cls() return cls class TensorMetrics: + fun_map = {"norm": get_norm, "max": get_max, "min": get_min, "mean": get_mean} + def __init__(self) -> None: - # tensor_tag --> [] - self.metrics = {} + self.metrics = {} # tensor_tag --> [] self.cur_idx = {} - fun_map = {"norm": get_norm, "max": get_max, "min": get_min} - - # get stats and insert into metrics dictionary def stat_insert(self, tensor, stat_ops, module_name, tensor_name, rank): + """get stats and insert into metrics dictionary""" prefix = get_summary_writer_tag_name(module_name, tensor_name, rank) for stat_op in stat_ops: y = TensorMetrics.fun_map[stat_op](tensor) @@ -72,17 +83,13 @@ class TensorMetrics: class Metric(object): @staticmethod def get_metric_value(tensor, eps): - pass + NotImplementedError - @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - pass - - def get_metrics(self, tag2tensor: dict, eps): - metrics_dict = {} - for tag, tensor in tag2tensor.items(): - metrics_dict[tag] = self.get_metric_value(tensor, eps) - return metrics_dict + def get_metric(self, tensor, eps): + try: + return self.get_metric_value(tensor, eps) + except RuntimeError as e: + return torch.tensor(torch.nan).to(tensor.device) @register_config_metric("min") @@ -91,14 +98,12 @@ class MinMetric(Metric): def get_metric_value(tensor, eps): return get_min(tensor) + +@register_config_metric("mean") +class MeanMetric(Metric): @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - try: - for key in metric_value[0][metric_name].keys(): - min_value = min([item[metric_name][key].item() for item in metric_value]) - summary_writer.add_scalar(f'{key}_min', min_value, step) - except Exception as e: - print_error_log(f"min metric metric_tensorboard error: {e}") + def get_metric_value(tensor, eps): + return get_mean(tensor) @register_config_metric("max") @@ -107,30 +112,12 @@ class MaxMetric(Metric): def get_metric_value(tensor, eps): return get_max(tensor) - @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - try: - for key in metric_value[0][metric_name].keys(): - max_value = max([item[metric_name][key].item() for item in metric_value]) - summary_writer.add_scalar(f'{key}_max', max_value, step) - except Exception as e: - print_error_log(f"max metric metric_tensorboard error: {e}") - @register_config_metric("norm") class NormMetric(Metric): @staticmethod def get_metric_value(tensor, eps): - return square_sum(tensor) - - @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - try: - for key in metric_value[0][metric_name].keys(): - norm_value = math.sqrt(sum([item[metric_name][key].item() for item in metric_value])) - summary_writer.add_scalar(f'{key}_norm', norm_value, step) - except Exception as e: - print_error_log(f"norm metric metric_tensorboard error: {e}") + return get_norm(tensor) @register_config_metric("zeros") @@ -139,15 +126,6 @@ class ZerosMetric(Metric): def get_metric_value(tensor, eps): return get_zeros(tensor, eps) - @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - try: - for key in metric_value[0][metric_name].keys(): - zeros_value = statistics.mean([item[metric_name][key].item() for item in metric_value]) - summary_writer.add_scalar(f'{key}_zeros', zeros_value, step) - except Exception as e: - print_error_log(f"zeros metric metric_tensorboard error: {e}") - @register_config_metric("nans") class NaNsMetric(Metric): @@ -155,15 +133,6 @@ class NaNsMetric(Metric): def get_metric_value(tensor, eps): return get_nans(tensor) - @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - try: - for key in metric_value[0][metric_name].keys(): - nans_value = sum([v[metric_name][key].item() for v in metric_value]) - summary_writer.add_scalar(f'{key}_nans', nans_value, step) - except Exception as e: - print_error_log(f"nans metric metric_tensorboard error: {e}") - @register_config_metric("id") class IdentMetric(Metric): @@ -173,34 +142,31 @@ class IdentMetric(Metric): return None return tensor - @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - # metric_value is a dict, key is parameter name and value is a list of scalar tensor - try: - if len(metric_value) == 1: - for key, value in metric_value[0][metric_name].items(): - if not value: - continue - summary_writer.add_scalar(f'{key}_identical', value.item(), step) - except Exception as e: - print_error_log(f"id metric metric_tensorboard error: {e}") - - -def get_metrics(metric_name, tag2tensor, eps): - try: - fun_metric = config_metric_registry[metric_name] - return fun_metric().get_metrics(tag2tensor, eps) - except KeyError as e: - raise ValueError( - f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: " - f"{metric_name}") from e - - -def write_metrics_tensorboard(metric_name, summary_writer, metric_value, step): - try: - fun_metric = config_metric_registry[metric_name] - return fun_metric.metric_tensorboard(metric_name, summary_writer, metric_value, step) - except KeyError as e: - raise ValueError( - f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: " - f"{metric_name}") from e + +def get_metrics(ops, tag2tensor, eps, out_dict=None): + """ + :param ops: ["op1", "op2"] + :param tag2tensor: { + '0:fc.input:0/actv': torch.randn([3, 4]), + '0:fc.output:0/actv': torch.randn([3, 3]) + } + :param eps: float 1e-8 + :param out_dict:{ + '0:fc.input:0/actv': {"op1": op1(torch.randn([3, 4])), "op2": op2(torch.randn([3, 4]))} + '0:fc.output:0/actv': {"op1": op1(torch.randn([3, 3])), "op2": op2(torch.randn([3, 3]))} + } + :return: out_dict + """ + if out_dict is None: + out_dict = {} + for tag, tensor in tag2tensor.items(): + if tag not in out_dict: + out_dict[tag] = {} + if not torch.is_tensor(tensor): + # Non-tensor in/output filled with nan. + out_dict[tag].update({metric_name: get_nan_tensor() for metric_name in ops}) + continue + for metric_name in ops: + fun_metric = config_metric_registry.get(metric_name) + out_dict[tag][metric_name] = fun_metric.get_metric(tensor, eps) + return out_dict diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/module_spec_verifier.py b/debug/accuracy_tools/msprobe/pytorch/monitor/module_spec_verifier.py index 226a3ba56b02505ef9bd8511f2cbd2100295ec3d..72c35c90bf9540a31cfa1176274a3d2c66bc8946 100644 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/module_spec_verifier.py +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/module_spec_verifier.py @@ -1,5 +1,3 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # @@ -19,6 +17,8 @@ import re import abc import torch +from msprobe.pytorch.common.log import logger + # 用于存储所有validator实现类的注册表 config_validator_registry = {} @@ -54,11 +54,15 @@ class TensorValidator(ConfigValidator): @register_config_validator class TupleValidator(ConfigValidator): def check_pattern_match(self, config_spec: str): - pattern = re.compile(r"tuple\[(\d+)\]:(\d+)") + pattern = re.compile(r"tuple\[(\d+)\]:?(\d+)?") return pattern.match(config_spec) def validate(self, actual_data, module_name: str, data_type: str, pattern_match): - length, index = map(int, pattern_match.groups()) + length, index = pattern_match.groups() + if index is None: + index = 0 + length, index = int(length), int(index) + if not (0 <= index < length): raise ValueError( f"Format of {module_name} {data_type} in config.json does not match the required format 'tuple[x]:y'." @@ -74,11 +78,18 @@ class TupleValidator(ConfigValidator): def validate_config_spec(config_spec: str, actual_data, module_name: str, data_type: str): + focused_col = None + if not config_spec or not isinstance(config_spec, str): + return focused_col for _, validator_cls in config_validator_registry.items(): config_validator = validator_cls() pattern_match = config_validator.check_pattern_match(config_spec) if pattern_match: - focused_col = config_validator.validate(actual_data, module_name, data_type, pattern_match) + try: + focused_col = config_validator.validate(actual_data, module_name, data_type, pattern_match) + except ValueError as e: + logger.warning(f"config spec validate failed: {str(e)}") return focused_col - raise ValueError(f"config spec in {module_name} {data_type} not supported, " - f"expected spec:'tuple\[(\d+)\]:(\d+)' or 'tensor', actual spec: {config_spec}.") + logger.warning(f"config spec in {module_name} {data_type} not supported, " + f"expected spec:'tuple\[(\d+)\]:(\d+)' or 'tensor', actual spec: {config_spec}.") + return focused_col diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/optimizer_collect.py b/debug/accuracy_tools/msprobe/pytorch/monitor/optimizer_collect.py index a11c61cf8c9a4233bc16f2634b7ac261ec9daf5c..602514836d2531ad4a6be3a23f56bc3b942ba199 100644 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/optimizer_collect.py +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/optimizer_collect.py @@ -1,6 +1,4 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -16,87 +14,302 @@ # limitations under the License. from collections import defaultdict + import torch +import torch.distributed as dist +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.monitor.utils import MVResult, MVGradResult -class MixPrecsionOptimizerMon: - wrapped_optimizer = None +class OptimizerMon(object): def __init__(self) -> None: self.fp16_to_fp32_param = {} + self.is_stage3 = False - @staticmethod - def set_wrapped_optimizer(_wrapped_optimizer): - MixPrecsionOptimizerMon.wrapped_optimizer = _wrapped_optimizer - - # parameter tensors we want to monitor and their names are in params2name_dict - # base_optimizer is pytorch optimizer, wrapped_optimizer is a normal object with base_optimizer def fetch_mv(self, monitor, torch_opt, params2name): - mix_prec_opt = MixPrecsionOptimizerMon.wrapped_optimizer + pass - if not self.fp16_to_fp32_param and mix_prec_opt is not None: - for fp16_group, fp32_group in zip(mix_prec_opt.float16_groups, mix_prec_opt.fp32_from_float16_groups): - for fp16_param, fp32_param in zip(fp16_group, fp32_group): - self.fp16_to_fp32_param[fp16_param] = fp32_param - return self._fetch_mv_in_adam(params2name, torch_opt, monitor) - - def _fetch_mv_in_adam(self, params2name, torch_opt, monitor): + def _fetch_mv_in_adam(self, monitor, torch_opt, params2name): exp_avg_dict = defaultdict(float) exp_avg_sq_dict = defaultdict(float) update_dict = defaultdict() ratio_dict = defaultdict() - for param, name in params2name.items(): if param in self.fp16_to_fp32_param: param = self.fp16_to_fp32_param[param] - + if param in torch_opt.state: - exp_avg = torch_opt.state[param]["exp_avg"] - exp_avg_sq = torch_opt.state[param]["exp_avg_sq"] + state_param = torch_opt.state.get(param, None) + exp_avg = state_param.get("exp_avg", None) + exp_avg_sq = state_param.get("exp_avg_sq", None) + if exp_avg is None or exp_avg_sq is None: + logger.warning(f"exp_avg or exp_avg_sq of {name} is None, maybe something wrong happened.") + continue if monitor.mv_distribution: exp_avg_dict[name] = exp_avg exp_avg_sq_dict[name] = exp_avg_sq if monitor.mg_direction: exp_avg_dict[name] = exp_avg if monitor.ur_distribution: - update_dict[name] = exp_avg / (torch.sqrt(exp_avg_sq) + torch_opt.defaults['eps']) - ratio_dict[name] = exp_avg / torch.sqrt(exp_avg_sq) + if len(torch_opt.param_groups) > 1: + logger.info(f"the length of torch_opt.param_groups is {len(torch_opt.param_groups)}.") + if 'step' in state_param: + step = state_param['step'] # Optimizer from pytorch or FusedAdam from apex(used by megatron) + elif 'step' in torch_opt.param_groups[0]: + step = torch_opt.param_groups[0]['step'] # AdamW from mindspeed + else: + logger.warning(f"step of {name} is None, maybe something wrong happened.") + continue + exp_avg_hat = exp_avg / (1 - torch_opt.defaults['betas'][0] ** step) + exp_avg_sq_hat = exp_avg_sq / (1 - torch_opt.defaults['betas'][1] ** step) + update_dict[name] = exp_avg_hat / (torch.sqrt(exp_avg_sq_hat) + torch_opt.defaults['eps']) + ratio_dict[name] = exp_avg_hat / torch.sqrt(exp_avg_sq_hat) monitor.update_heatmap_visualizer[name].pre_cal(update_dict[name]) monitor.ratio_heatmap_visualizer[name].pre_cal(ratio_dict[name]) - res = (exp_avg_dict, exp_avg_sq_dict, update_dict, ratio_dict) - return res + return MVResult(exp_avg=exp_avg_dict, exp_avg_sq=exp_avg_sq_dict, update=update_dict, ratio=ratio_dict) + + def _fetch_mv_grad_in_adam(self, monitor, torch_opt, params2name, name2indices, fp32_partitioned_groups_flat): + exp_avg_dict = defaultdict(float) + exp_avg_sq_dict = defaultdict(float) + update_dict = defaultdict() + ratio_dict = defaultdict() + param2name = defaultdict() + fp32_partitioned_groups_flat_grad = defaultdict() + partition_id = dist.get_rank() + + def get_flatten_grad(self, optimizer, group_idx): + if fp32_partitioned_groups_flat[group_idx].grad is None: + if partition_id == dist.get_world_size() - 1 and not self.is_stage3: + fp32_partitioned_groups_flat_grad = optimizer.flatten_dense_tensors_aligned( + optimizer.averaged_gradients[group_idx], + int(optimizer.partition_size[group_idx]) + ).to(fp32_partitioned_groups_flat[group_idx].dtype) + else: + fp32_partitioned_groups_flat_grad = optimizer.flatten( + optimizer.averaged_gradients[group_idx] + ).to(fp32_partitioned_groups_flat[group_idx].dtype) + return fp32_partitioned_groups_flat_grad + else: + return fp32_partitioned_groups_flat[group_idx].grad + + for group_idx in range(len(fp32_partitioned_groups_flat)): + fp32_partitioned_groups_flat_grad[group_idx] = get_flatten_grad(self, torch_opt, group_idx) + + for name in params2name.values(): + start_idx, end_idx, group_idx, group_with_rank = name2indices[name] + if group_with_rank != partition_id and isinstance(group_with_rank, int): + continue + fp32_param = fp32_partitioned_groups_flat[group_idx][start_idx: end_idx] + fp32_param.grad = fp32_partitioned_groups_flat_grad[group_idx][start_idx: end_idx] + param2name[fp32_param] = name + if not torch_opt.state: + continue + state_param = list(torch_opt.state.values())[group_idx] + exp_avg = state_param.get("exp_avg", None) + exp_avg_sq = state_param.get("exp_avg_sq", None) + if exp_avg is None or exp_avg_sq is None: + logger.warning(f"exp_avg or exp_avg_sq of {name} is None, maybe something wrong happened.") + continue + exp_avg = exp_avg[start_idx: end_idx] + exp_avg_sq = exp_avg_sq[start_idx: end_idx] + if monitor.mv_distribution: + exp_avg_dict[name] = exp_avg + exp_avg_sq_dict[name] = exp_avg_sq + if monitor.mg_direction: + exp_avg_dict[name] = exp_avg + if monitor.ur_distribution: + if 'step' in state_param: + step = state_param['step'] # Optimizer from pytorch or FusedAdam from apex(used by megatron) + elif 'step' in torch_opt.param_groups[group_idx]: + step = torch_opt.param_groups[group_idx]['step'] # AdamW from mindspeed + else: + logger.warning(f"step of {name} is None, maybe something wrong happened.") + continue + exp_avg_hat = exp_avg / (1 - torch_opt.defaults['betas'][0] ** step) + exp_avg_sq_hat = exp_avg_sq / (1 - torch_opt.defaults['betas'][1] ** step) + update_dict[name] = exp_avg_hat / (torch.sqrt(exp_avg_sq_hat) + torch_opt.defaults['eps']) + ratio_dict[name] = exp_avg_hat / torch.sqrt(exp_avg_sq_hat) + monitor.update_heatmap_visualizer[name].pre_cal(update_dict[name]) + monitor.ratio_heatmap_visualizer[name].pre_cal(ratio_dict[name]) + del fp32_partitioned_groups_flat_grad + return MVGradResult(exp_avg=exp_avg_dict, exp_avg_sq=exp_avg_sq_dict, update=update_dict, ratio=ratio_dict, + grad=param2name) + + +class MixPrecisionOptimizerMon(OptimizerMon): + """ + 混合精度优化器监控类。在混合精度训练中监控和管理优化器。 + 混合精度训练通过适当降低某些计算的精度来加速训练过程并减少内存消耗。 + """ + + def map_fp16_tp_fp32_param(self, torch_opt): + for fp16_group, fp32_group in zip(torch_opt.float16_groups, torch_opt.fp32_from_float16_groups): + for fp16_param, fp32_param in zip(fp16_group, fp32_group): + self.fp16_to_fp32_param[fp16_param] = fp32_param + + def fetch_mv(self, monitor, torch_opt, params2name): + if not self.fp16_to_fp32_param and torch_opt is not None: + self.map_fp16_tp_fp32_param(torch_opt) + + return self._fetch_mv_in_adam(monitor, torch_opt, params2name) + + +class MegatronDistributedOptimizerMon(OptimizerMon): + def map_fp16_tp_fp32_param(self, torch_opt): + if not (hasattr(torch_opt, "model_float16_groups") and + hasattr(torch_opt, "shard_fp32_from_float16_groups")): + raise Exception( + "megatron distributed optimizer should have model_float16_groups and shard_fp32_from_float16_groups, " + "if not, please check megatron-lm version") + for fp16_group, shard_fp32_group in zip(torch_opt.model_float16_groups, + torch_opt.shard_fp32_from_float16_groups): + for fp16_param, shard_fp32_param in zip(fp16_group, shard_fp32_group): + self.fp16_to_fp32_param[fp16_param] = shard_fp32_param + + def fetch_mv(self, monitor, torch_opt, params2name): + if not self.fp16_to_fp32_param and torch_opt is not None: + self.map_fp16_tp_fp32_param(torch_opt) + + return self._fetch_mv_in_adam(monitor, torch_opt, params2name) + + +class MegatronFP32OptimizerMon(OptimizerMon): + def fetch_mv(self, monitor, torch_opt, params2name): + return self._fetch_mv_in_adam(monitor, torch_opt, params2name) + + +class MegatronChainedDistributedOptimizerMon(MegatronDistributedOptimizerMon): + def fetch_mv(self, monitor, torch_opt, params2name): + if not self.fp16_to_fp32_param and torch_opt is not None: + for opt in torch_opt.chained_optimizers: + self.map_fp16_tp_fp32_param(opt) + + if not isinstance(torch_opt, torch.optim.Optimizer): + torch_opt.state = {} + for opt in torch_opt.chained_optimizers: + torch_opt.state.update(opt.optimizer.state) + return self._fetch_mv_in_adam(monitor, torch_opt, params2name) -class MegatronDistributedOptimizerMon(MixPrecsionOptimizerMon): +class MegatronChainedMixPrecisionOptimizerMon(MixPrecisionOptimizerMon): def fetch_mv(self, monitor, torch_opt, params2name): - mix_prec_opt = MixPrecsionOptimizerMon.wrapped_optimizer - if not (hasattr(mix_prec_opt, "model_float16_groups") - and hasattr(mix_prec_opt, "shard_fp32_from_float16_groups")): - raise Exception("megatron distributed optimizer should have model_float16_groups " - "and shard_fp32_from_float16_groups, if not, please check megatron-lm version") - if not self.fp16_to_fp32_param and mix_prec_opt is not None: - for fp16_group, shard_fp32_group in zip(mix_prec_opt.model_float16_groups, - mix_prec_opt.shard_fp32_from_float16_groups): - for fp16_param, shard_fp32_param in zip(fp16_group, shard_fp32_group): - self.fp16_to_fp32_param[fp16_param] = shard_fp32_param + if not self.fp16_to_fp32_param and torch_opt is not None: + for opt in torch_opt.chained_optimizers: + self.map_fp16_tp_fp32_param(opt) - return self._fetch_mv_in_adam(params2name, torch_opt, monitor) + if not isinstance(torch_opt, torch.optim.Optimizer): + torch_opt.state = {} + for opt in torch_opt.chained_optimizers: + torch_opt.state.update(opt.optimizer.state) + return self._fetch_mv_in_adam(monitor, torch_opt, params2name) -class DummyOptimizerMon(MixPrecsionOptimizerMon): +class DeepSpeedZeroOptimizerStage0Mon(OptimizerMon): def fetch_mv(self, monitor, torch_opt, params2name): - res = None, None, None, None - return res + return self._fetch_mv_in_adam(monitor, torch_opt, params2name) + + +class DeepSpeedZeroOptimizerStage3Mon(OptimizerMon): + def get_param_index(self, params2name, name2index, torch_opt): + fp16_groups = torch_opt.fp16_partitioned_groups + name2indices = defaultdict() + index_length = defaultdict() + index = 0 + idx = 0 + for group_idx, fp16_group in enumerate(fp16_groups): + for param in fp16_group: + param_length = len(param.flatten()) + index_length[idx] = (index, index + param_length, group_idx) + index += param_length + idx += 1 + for _, name in params2name.items(): + idx = name2index[name] + start_idx, end_idx, group_idx = index_length[idx] + name2indices[name] = (start_idx, end_idx, group_idx, None) + return name2indices + + def fetch_mv(self, monitor, torch_opt, params2name, name2indices=None): + self.is_stage3 = True + fp32_partitioned_groups_flat = torch_opt.fp32_partitioned_groups_flat + return self._fetch_mv_grad_in_adam(monitor, torch_opt, params2name, name2indices, fp32_partitioned_groups_flat) + + +class DeepSpeedZeroOptimizerStage1or2Mon(OptimizerMon): + @staticmethod + def get_group_index(fp32_length, world_size, index): + for i in range(len(fp32_length) - 1): + if fp32_length[i] <= index < fp32_length[i + 1]: + interval_start = fp32_length[i] + interval_length = fp32_length[i + 1] - fp32_length[i] + sub_interval_length = interval_length // world_size + sub_index = (index - interval_start) // sub_interval_length + sub_interval_start = interval_start + sub_index * sub_interval_length + return sub_interval_start, min(sub_index, world_size - 1) + return fp32_length[-1], 0 + + def get_param_index(self, params2name, name2index, torch_opt): + padding = torch_opt.groups_padding + world_size = dist.get_world_size() + fp32_length = [0] + for fp32_group_index, single_partition_of_fp32_group in enumerate(torch_opt.single_partition_of_fp32_groups): + fp32_length.append(len(single_partition_of_fp32_group) * world_size + fp32_length[fp32_group_index]) + + bf16_groups = [] + name2indices = defaultdict() + index_length = defaultdict() + index = 0 + idx = 0 + for group_idx, bf16_group in enumerate(torch_opt.bit16_groups): + bf16_groups.extend(bf16_group) + for param in bf16_group: + param_length = len(param.flatten()) + group_index, group_with_rank = self.get_group_index(fp32_length, world_size, index) + index_length[idx] = (index, index + param_length, group_idx, group_index, group_with_rank) + index += param_length + idx += 1 + group_length = len(bf16_groups) / len(torch_opt.bit16_groups) + for _, name in params2name.items(): + name_index = name2index[name] + start_idx, end_idx, group_idx, group_index, group_with_rank = index_length[name_index] + need_padding = True if group_with_rank == world_size - 1 else False + new_start_idx = start_idx - group_index + new_end_idx = end_idx - group_index + if need_padding and group_length - 1 <= name_index <= len(bf16_groups) - 1 and name_index % ( + group_length - 1) == 0: + new_end_idx -= padding[int(name_index // (group_length - 1) - 1)] + name2indices[name] = (new_start_idx, new_end_idx, group_idx, group_with_rank) + return name2indices + + def fetch_mv(self, monitor, torch_opt, params2name, name2indices=None): + fp32_partitioned_groups_flat = torch_opt.single_partition_of_fp32_groups + return self._fetch_mv_grad_in_adam(monitor, torch_opt, params2name, name2indices, fp32_partitioned_groups_flat) + + +class DummyOptimizerMon(OptimizerMon): + def fetch_mv(self, monitor, torch_opt, params2name): + return self._fetch_mv_in_adam(monitor, torch_opt, params2name) class OptimizerMonFactory: + _optimizer_mon_map = { + "FP32Optimizer": MegatronFP32OptimizerMon, + "Float16OptimizerWithFloat16Params": MixPrecisionOptimizerMon, + "DistributedOptimizer": MegatronDistributedOptimizerMon, + "ChainedDistributedOptimizer": MegatronChainedDistributedOptimizerMon, + "ChainedFloat16OptimizerWithFloat16Params": MegatronChainedMixPrecisionOptimizerMon, + "BF16_Optimizer": DeepSpeedZeroOptimizerStage0Mon, + "DeepSpeedZeroOptimizer": DeepSpeedZeroOptimizerStage1or2Mon, + "DeepSpeedZeroOptimizer_Stage3": DeepSpeedZeroOptimizerStage3Mon, + "Adam": DummyOptimizerMon + } + @staticmethod - def create_optimizer_mon(opt_ty:str): - if opt_ty == "Megatron_Float16OptimizerWithFloat16Params": - return MixPrecsionOptimizerMon() - if opt_ty == "Megatron_DistributedOptimizer": - return MegatronDistributedOptimizerMon() - if opt_ty is None or opt_ty == "unknown": - return DummyOptimizerMon() - raise Exception("opt_ty should be Megatron_Float16OptimizerWithFloat16Params " - "or Megatron_DistributedOptimizer or None or unknown") + def create_optimizer_mon(optimizer): + # auto replace opt_ty + optimizer_class = optimizer.__class__.__name__ + if optimizer_class == "ChainedOptimizer": + optimizer_class = "Chained" + optimizer.chained_optimizers[0].__class__.__name__ + + optimizer_mon_class = OptimizerMonFactory._optimizer_mon_map.get(optimizer_class, DummyOptimizerMon) + return optimizer_mon_class(), optimizer_class diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/cc_utils.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/cc_utils.py deleted file mode 100644 index 9861b5bb2b11c34d5845ba1ae49e401d62e0cfe1..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/cc_utils.py +++ /dev/null @@ -1,83 +0,0 @@ -import os -from functools import partial -import torch -from torch import distributed as dist -from torch import nn -try: - import torch_npu - BACKEND = 'hccl' - DEVICE = 'npu' -except: - BACKEND = 'nccl' - DEVICE = 'cuda' - -from msprobe.pytorch.monitor.features import square_sum, get_max, get_min, get_zeros -from msprobe.pytorch.monitor.module_hook import CommunicationContext - - -OP_FUNCS = { - "min": get_min, - "max": get_max, - "norm": square_sum, - "zeros": partial(get_zeros, eps=1e-8) -} - -def ddp_setup(rank, world_size): - os.environ["MASTER_ADDR"] = "localhost" - os.environ["MASTER_PORT"] = "12346" - dist.init_process_group(backend=BACKEND, rank=rank, world_size=world_size) - -def reset_context(context): - if isinstance(context, CommunicationContext): - context.reset() - elif isinstance(context, dict): - for op, v in context.items(): - v.reset() - -def wrap_reset(func): - def reset_and_test(*args, **kwargs): - print(f"testing {func.__name__}") - reset_context(args[0]) - res = func(*args, **kwargs) - return res - - return reset_and_test - -def assert_empty(data): - assert len(data) == 0, f'data is not empty as expected' - -def assert_nonempty(data): - assert len(data) != 0, f'data is empty' - -def assert_equal(a, b, rank, op_name=None, tag=None): - if a.dim() == 0: - assert a==b, f'inequal in rank {rank}: {a}, {b}, {op_name}, {tag}' - else: - assert torch.equal(a,b), f'inequal in rank {rank}: {a},{b}' - -def assert_inequal(a, b, rank): - if a.dim() == 0: - assert a!=b, f'equal in rank {rank}: {a},{b}' - else: - assert not torch.equal(a,b), f'equal in rank {rank}: {a},{b}' - -def assert_context(data, src, rank): - if len(src) == 0: - assert_empty(data) - else: - assert_nonempty(data) - - for op_name, tensors in data.items(): - for tag, tensor in tensors.items(): - prefix, idx = tag.split('_') - idx = int(idx) - assert_equal(tensor, OP_FUNCS[op_name](src[prefix][idx]), rank, op_name, tag) - - -class Model(nn.Module): - def __init__(self): - super(Model, self).__init__() - self.layer = nn.Linear(2,2) - - def forward(self, x): - return self.layer(x) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc.json b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc.json deleted file mode 100644 index a4667ce6fea8052831ddde3fb879402a30f4e946..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc.json +++ /dev/null @@ -1,7 +0,0 @@ -{ - "targets": { - "foo": {} - }, - "cc_distribution": {"enable": true, "cc_pre_hook":true}, - "ops":["max","min","norm","zeros"] -} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc_codeline_ranks.json b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc_codeline_ranks.json deleted file mode 100644 index f139e9b27557c11a02060e686043e83d2120f1de..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc_codeline_ranks.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "targets": { - "foo": {} - }, - "cc_distribution": {"enable": true, "cc_codeline":["monitor/unittest/test_cc_codeline_ranks.py\\[19\\]"]}, - "module_ranks": [1], - "ops":["max","min","norm","zeros"] -} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc_logonly.json b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc_logonly.json deleted file mode 100644 index 51e619fc2d87ffb4a52575ac53ccf0921eb78cce..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_cc_logonly.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "targets": { - "foo": {} - }, - "cc_distribution": {"enable": true, "cc_log_only": true}, - "module_ranks": [0,1], - "ops":["max","min","norm","zeros"] -} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/expected_cc_log.json b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/expected_cc_log.json deleted file mode 100644 index 8204f4a5d5feea0aaf588546700674fa29f49e7e..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/expected_cc_log.json +++ /dev/null @@ -1,20 +0,0 @@ -{ - "all_gather": [ - [ - "|torch.float32||", - "0|1", - "/home/jovyan/workspace/kj_dev/monitor/unittest/test_cc_log_only.py[18] test_all_gather", - "/home/jovyan/workspace/kj_dev/monitor/unittest/test_cc_log_only.py[40] main", - "[1] " - ] - ], - "all_reduce": [ - [ - "torch.float32|||", - "0|1", - "/home/jovyan/workspace/kj_dev/monitor/unittest/test_cc_log_only.py[23] test_all_reduce", - "/home/jovyan/workspace/kj_dev/monitor/unittest/test_cc_log_only.py[41] main", - "[1] " - ] - ] -} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_anomaly_inform.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_anomaly_inform.py deleted file mode 100644 index 22c4a4dea65cb11ec7fd5112ea0c522d50599507..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_anomaly_inform.py +++ /dev/null @@ -1,26 +0,0 @@ -import uuid -import unittest - -from msprobe.pytorch.monitor.anomaly_inform import AnomalyInformFactory - - -class TestAnomalyInform(unittest.TestCase): - def test_database_inform(self): - inform_args = {"inform": {"recipient": "database", "connection_str": "mysql+pymysql://username:password@host:port/database"}} - anomaly_inform = AnomalyInformFactory.create_informer(**inform_args["inform"]) - exception_message = '\x1b[93m> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0.self_attention.query_key_value.weight/0/exp_avg_sq_min at step 49.\x1b[0m' - job_id = str(uuid.uuid4()) - anomaly_inform.run(exception_message, job_id) - - def test_email_inform(self): - inform_args = {"inform": {"recipient": "email", "send_email_address": "test@huawei.com", "receive_email_address": "test@huawei.com", - "send_email_username": "foo", "send_email_password": "********", - "smtp_server": "smtpscn.huawei.com", "smtp_port": "587"}} - anomaly_inform = AnomalyInformFactory.create_informer(**inform_args["inform"]) - exception_message = '\x1b[93m> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0.self_attention.query_key_value.weight/0/exp_avg_sq_min at step 49.\x1b[0m' - job_id = str(uuid.uuid4()) - anomaly_inform.run(exception_message, job_id) - - -if __name__ == "__main__": - unittest.main() diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_basic_functions.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_basic_functions.py deleted file mode 100644 index 7a9dc26d1381f515bf79a6951f17bcce90f9e355..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_basic_functions.py +++ /dev/null @@ -1,151 +0,0 @@ -import unittest -import shutil -import json -import os - -import torch -try: - import torch_npu - device = torch.device('npu:0') -except ModuleNotFoundError: - device = torch.device('cpu') -from tensorboard.backend.event_processing.event_accumulator import EventAccumulator - -from msprobe.pytorch.monitor.module_hook import TrainerMon - - -class Model(torch.nn.Module): - def __init__(self): - super().__init__() - self.fc = torch.nn.Linear(784, 2) - self.relu = torch.nn.ReLU() - - def forward(self, x): - return self.relu(self.fc(x)) - -class ToyDataset(torch.utils.data.Dataset): - def __init__(self): - self.data = torch.randn(16, 784, requires_grad=True) - self.labels = torch.randint(low=0, high=8, size=(16,)) - def __len__(self): - return len(self.labels) - def __getitem__(self, idx): - return self.data[idx].to(device), self.labels[idx].to(device) -def get_file_path(): - output_dir = os.environ.get("MONITOR_OUTPUT_DIR") - for root1, dirs, files in os.walk(output_dir): - for root2, dir, file in os.walk(os.path.join(root1, dirs[-1])): - return os.path.join(root2, file[0]) - -def get_config(): - os.environ["MONITOR_OUTPUT_DIR"] = "./test_monitor_output" - with open("config_basic_functions.json", 'r') as file: - config_test = json.load(file) - return config_test -def get_tensorbaord(event_file_path): - tensorboard = EventAccumulator(event_file_path) - tensorboard.Reload() - tags = tensorboard.Tags() - scalers_tag = [] - for tag in tags['scalars']: - tag = tag.split('/') - scalers_tag.append(tag[1]) - images_tag = [] - for tag in tags['images']: - tag = tag.split('/') - images_tag.append(tag[1]) - return scalers_tag, images_tag - -def clean_output(): - folder_path = os.environ.get("MONITOR_OUTPUT_DIR") - if os.path.exists(folder_path): - shutil.rmtree(folder_path) - -def train(): - model = Model().to(device=device) - hooker = TrainerMon('config_basic_functions.json', False, - opt_ty="Megatron_Float16OptimizerWithFloat16Params") # or opt_ty=Megatron_DistributedOptimizer - hooker.hook_modules(model=model, grad_acc_steps=1) - - train_ds = ToyDataset() - train_loader = torch.utils.data.DataLoader(train_ds, shuffle=True, batch_size=2) - - optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001) - - for (inputs, targets) in train_loader: - optimizer.zero_grad() - # inputs and param torch.float32 -> torch.float16 - inputs = inputs.half() - for param in model.parameters(): - param.data = param.data.half() - # outputs torch.float32 - outputs = model(inputs) - output = outputs[0] - targets = targets.float() - # loss torch.float16 -> torch.float32 - loss = torch.nn.functional.cross_entropy(output, targets) - - loss.backward() - optimizer.step() - -class TestMonitor(unittest.TestCase): - def __init__(self, method_name: str) -> None: - super(TestMonitor, self).__init__(method_name) - self.config_test = get_config() - self.event_file_path = None - self.scalers_tag = None - self.images_tag = None - - @classmethod - def setUpClass(cls): - train() - - def setUp(self): - self.config_test = get_config() - self.event_file_path = get_file_path() - self.scalers_tag, self.images_tag = get_tensorbaord(self.event_file_path) - - def test_ops(self): - if self.config_test["ops"]: - for op in self.config_test.get("ops"): - if op == "id": - assert any(op in item for item in self.scalers_tag) == self.config_test.get('mg_direction'), f"{op} in ops did not take effect" - else: - assert any(op in item for item in self.scalers_tag), f"{op} in ops did not take effect" - print("ops has taken effect") - - def test_ur_distribution(self): - if self.config_test.get("ur_distribution"): - assert any('adam_update' in item for item in self.images_tag) and any( - 'adam_ratio' in item for item in self.images_tag), "ur_distribution did not take effect" - print("ur_distribution has taken effect") - - def test_xy_distribution(self): - if self.config_test.get("xy_distribution"): - assert any('input' in item for item in self.scalers_tag) and any( - 'output' in item for item in self.scalers_tag), "xy_distribution did not take effect" - print("xy_distribution has taken effect") - - def test_mv_distribution(self): - if self.config_test.get("mv_distribution"): - assert any('exp_avg' in item for item in self.scalers_tag) and any( - 'exp_avg_sq' in item for item in self.scalers_tag), "mv_distribution did not take effect" - print("mv_distribution has taken effect") - - def test_mg_direction(self): - if self.config_test.get("mg_direction"): - assert any('mg_direction' in item for item in self.scalers_tag), "mg_direction did not take effect" - print("mg_direction has taken effect") - - def test_wg_distribution(self): - if self.config_test.get("wg_distribution"): - assert any('weight' in item for item in self.scalers_tag), "wg_distribution did not take effect" - print("wg_distribution has taken effect") - - @classmethod - def tearDownClass(cls) -> None: - clean_output() - - -if __name__ == "__main__": - unittest.main() diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc.py deleted file mode 100644 index 1b10d46373c4aad95661707cccfbe8ca0d61adb2..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc.py +++ /dev/null @@ -1,260 +0,0 @@ -import sys -sys.path.append(".") -import time -import torch -from torch import nn -from torch import distributed as dist -import torch.multiprocessing as mp -from msprobe.pytorch.monitor.module_hook import TrainerMon -from msprobe.pytorch.monitor.unittest.cc_utils import DEVICE, assert_context, assert_equal, wrap_reset, ddp_setup - -DEBUG = False -DIM = 2 -DTYPE = torch.float16 - -# 采集数据正确 -# 通信结果正确 - -def test_broadcast(context, rank, async_op): - a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - local_a = a.clone() - src = 0 - work = dist.broadcast(a, src, dist.group.WORLD, async_op) - if work: - work.wait() - context.aggregate() - if rank == src: - assert_context(context.data, {'pre':[local_a], 'post':[a]}, rank) - assert torch.equal(local_a, a), f"{local_a}, {a}" - else: - src_tensor = torch.tensor([src+1, src+1], dtype=DTYPE, device=f'{DEVICE}:{rank}') - assert_context(context.data, {'pre': [local_a], 'post':[src_tensor]}, rank) - assert_equal(src_tensor, a, rank) - -@wrap_reset -def test_gather(context, rank, world_size, async_op): - a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - dst = 0 - if rank == dst: - data = [torch.zeros_like(a) for _ in range(world_size)] - else: - data = None - work = dist.gather(a, data, dst, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - context.aggregate() - if rank == dst: - assert_context(context.data, {'pre':[a, torch.zeros(world_size, 2, dtype=DTYPE)], 'post':[a, torch.stack(data)]}, rank) - for i in range(world_size): - local_a = torch.tensor([i+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - assert_equal(data[i], local_a, rank) - - -@wrap_reset -def test_all_gather(context, rank, world_size, async_op): - a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - data = [torch.zeros_like(a, dtype=DTYPE) for _ in range(world_size)] - work = dist.all_gather(data, a, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - context.aggregate() - assert_context(context.data, {'pre':[torch.zeros(world_size, DIM, dtype=DTYPE), a], 'post':[torch.stack(data), a]}, rank) - assert_equal(data[rank], a, rank) - -@wrap_reset -def test_all_gather_into_tensor(context, rank, world_size, async_op): - a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - # concatenation - data = torch.zeros(world_size * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - res = torch.tensor([[i+1] for i in range(world_size)], dtype=DTYPE, device=f'{DEVICE}:{rank}').repeat(1, DIM) - work = dist.all_gather_into_tensor(data, a, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - context.aggregate() - assert_context(context.data, {'pre': [torch.zeros(world_size * DIM, dtype=DTYPE), a], 'post': [data, a]}, rank) - assert_equal(data, res.flatten(), rank) - - context.reset() - # concatenation - data = torch.zeros(world_size, DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - work = dist.all_gather_into_tensor(data, a, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - - context.aggregate() - assert_context(context.data, {'pre': [torch.zeros(world_size, DIM, dtype=DTYPE), a], 'post': [data, a]}, rank) - assert_equal(data, res, rank) - -@wrap_reset -def test_reduce(context, rank, world_size, async_op): - a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - local_a = a.clone() - dst = 0 - work = dist.reduce(a, dst, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - context.aggregate() - total = sum([i+1 for i in range(world_size)]) - res = torch.tensor([total] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - if rank == dst: - assert_context(context.data, {'pre':[local_a], 'post':[res]}, rank) - assert_equal(res, a, rank) - else: - assert_context(context.data, {'pre':[a], 'post':[a]}, rank) - assert_equal(local_a, a, rank) - -@wrap_reset -def test_all_reduce(context, rank, world_size, async_op): - repeat = 2 - for _ in range(repeat): # test aggregate - a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - local_a = a.clone() - if rank == 0: - time.sleep(6) - work = dist.all_reduce(a, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - context.aggregate() - total = sum([i+1 for i in range(world_size)]) - res = torch.tensor([total] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - assert_context(context.data, {'pre': [local_a.repeat(repeat)],'post': [res.repeat(repeat)]}, rank) - assert_equal(res, a, rank) - - -@wrap_reset -def test_reduce_scatter(context, rank, world_size, async_op): - a = torch.tensor([rank+1, rank+1], dtype=DTYPE, device=f'{DEVICE}:{rank}') - output = torch.zeros_like(a) - data = [a*(i+1) for i in range(world_size)] - work = dist.reduce_scatter(output, data, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - context.aggregate() - total = sum([i+1 for i in range(world_size)]) - res = (rank+1) * torch.tensor([total, total], dtype=DTYPE, device=f'{DEVICE}:{rank}') - assert_context(context.data,{'pre': [torch.zeros_like(a), torch.stack(data)], 'post':[output, torch.stack(data)]}, rank) - assert_equal(res, output, rank) - - -@wrap_reset -def test_reduce_scatter_tensor(context, rank, world_size, async_op): - a = torch.tensor([rank+1] * DIM * world_size, dtype=DTYPE, device=f'{DEVICE}:{rank}') - output = torch.zeros(DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - work = dist.reduce_scatter_tensor(output, a, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - context.aggregate() - total = sum([i+1 for i in range(world_size)]) - res = torch.tensor([total] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - assert_context(context.data,{'pre': [torch.zeros_like(a, dtype=DTYPE, device=f'{DEVICE}:{rank}'), a], 'post':[output, a]}, rank) - assert_equal(res, output, rank) - -@wrap_reset -def test_scatter(context, rank, world_size, async_op): - a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - local_a = a.clone() - src = 0 - if rank == src: - scatter_list = [10*torch.tensor([i+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') for i in range(world_size)] - else: - scatter_list = None - work = dist.scatter(a, scatter_list, src, group=dist.group.WORLD, async_op=async_op) - if work: - work.wait() - context.aggregate() - if rank == src: - assert_context(context.data, {'pre': [local_a, torch.stack(scatter_list)], 'post': [a, torch.stack(scatter_list)]}, rank) - else: - assert_context(context.data, {'pre': [local_a], 'post': [a]}, rank) - assert_equal(a, 10*torch.tensor([(rank+1)] * DIM ,dtype=DTYPE, device=f'{DEVICE}:{rank}'), rank) - -## point2point -@wrap_reset -def test_send_recv(context, rank, world_size, async_op): - """send from rank 0 to rank world_size-1""" - if world_size<2: - return - a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - local_a = a.clone() - src = 0 - dst = world_size-1 - if rank == src: - dist.send(a, dst, group=dist.group. - WORLD) - context['send'].aggregate() - assert_context(context['send'].data, {'pre': [local_a], 'post': [a]}, rank) - assert_equal(a, local_a, rank) - if rank == dst: - src_tensor = torch.tensor([src+1, src+1], dtype=DTYPE, device=f'{DEVICE}:{rank}') - dist.recv(a, src, group=dist.group. - WORLD) - context['recv'].aggregate() - assert_context(context['recv'].data, {'pre':[local_a], 'post': [a]}, rank) - assert_equal(a, src_tensor, rank) - -@wrap_reset -def test_batch_isend_irecv(context, rank, world_size, async_op): - send_tensor = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') - recv_tensor = torch.zeros_like(send_tensor) - send_op = dist.P2POp(dist.isend, send_tensor, (rank + 1)%world_size) - recv_op = dist.P2POp(dist.irecv, recv_tensor, (rank - 1 + world_size)%world_size) - reqs = dist.batch_isend_irecv([send_op, recv_op]) - for req in reqs: - req.wait() - context.aggregate() - assert_context(context.data, {'pre': [torch.stack([send_tensor, torch.zeros_like(send_tensor)])], 'post':[torch.stack([send_tensor, recv_tensor])]}, rank) - assert_equal( recv_tensor, torch.tensor([(rank - 1 + world_size)%world_size + 1] * DIM, device=f'{DEVICE}:{rank}'), rank) - -def test_all(monitor, rank, world_size, async_op): - cc_context = monitor.cc_context - - test_send_recv(cc_context, rank, world_size, async_op) - test_broadcast(cc_context['broadcast'], rank, async_op) - test_gather(cc_context['gather'], rank, world_size, async_op) - test_all_gather(cc_context['all_gather'], rank, world_size, async_op) - test_all_gather_into_tensor(cc_context['all_gather_into_tensor'], rank, world_size, async_op) - test_reduce(cc_context['reduce'], rank, world_size, async_op) - test_all_reduce(cc_context['all_reduce'], rank, world_size, async_op) - test_reduce_scatter(cc_context['reduce_scatter'], rank, world_size, async_op) - test_reduce_scatter_tensor(cc_context['reduce_scatter_tensor'], rank, world_size, async_op) - test_scatter(cc_context['scatter'], rank, world_size, async_op) - test_batch_isend_irecv(cc_context['batch_isend_irecv'], rank, world_size, async_op) - - -def main(rank, world_size): - - ddp_setup(rank, world_size) - if rank == 0 and DEBUG: - import debugpy - debugpy.listen(5678) - debugpy.wait_for_client() - steps = 2 - - net = Model() - monitor = TrainerMon("monitor/unittest/config_cc.json", opt_ty="Megatron_Float16OptimizerWithFloat16Params") - # monitor = None - # monitor.hook_optimizer() # to enable tb - optimizer = torch.optim.Adam(net.parameters()) - for step in range(steps): - print('setp: ', step) - test_all(monitor, rank, world_size, False) - test_all(monitor, rank, world_size, True) - optimizer.step() - - -class Model(nn.Module): - def __init__(self): - super(Model, self).__init__() - self.layer = nn.Linear(2,2) - - def forward(self, x): - return self.layer(x) - -if __name__ == '__main__': - if len(sys.argv)>1: - DEBUG = sys.argv[1] - world_size=4 - torch.manual_seed(1234) - mp.spawn(main, args=(world_size,), nprocs=world_size) - - \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc_codeline_ranks.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc_codeline_ranks.py deleted file mode 100644 index 656dba73af720b541376f5d7934a277e5ef089c7..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc_codeline_ranks.py +++ /dev/null @@ -1,52 +0,0 @@ -import sys -sys.path.append(".") -import torch -from torch import distributed as dist -import torch.multiprocessing as mp -from msprobe.pytorch.monitor.module_hook import TrainerMon -from msprobe.pytorch.monitor.unittest.cc_utils import DEVICE, wrap_reset, assert_context, ddp_setup, Model - -@wrap_reset -def test_all_gather(context, rank, target_rank, world_size, async_op): - a = torch.tensor([rank+1, rank+1], dtype=torch.float32, device=f'{DEVICE}:{rank}') - data = [torch.empty_like(a) for _ in range(world_size)] - dist.all_gather(data, a, group=dist.group.WORLD, async_op=async_op) - assert_context(context.data, {}, rank) - -@wrap_reset -def test_all_reduce(context, rank, target_rank, world_size, async_op): - a = torch.tensor([rank+1, rank+1], dtype=torch.float32, device=f'{DEVICE}:{rank}') - dist.all_reduce(a, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) - total = sum([i+1 for i in range(world_size)]) - sum_reduced = torch.tensor([total, total], dtype=torch.float32, device=f'{DEVICE}:{rank}') - context.aggregate() - if rank in target_rank: - assert_context(context.data, {"post": [sum_reduced]}, rank) - else: - assert_context(context.data, {}, rank) - -def main(rank, world_size): - - ddp_setup(rank, world_size) - steps = 2 - async_op = False - - net = Model() - monitor = TrainerMon("monitor/unittest/config_cc_codeline_ranks.json") - target_rank = monitor.module_rank_list - # monitor = None - # monitor.hook_optimizer() # to enable tb - optimizer = torch.optim.Adam(net.parameters()) - cc_context = monitor.cc_context - for step in range(steps): - print('setp: ', step) - test_all_gather(cc_context['all_gather'], rank, target_rank, world_size, async_op) - test_all_reduce(cc_context['all_reduce'], rank, target_rank, world_size, async_op) - optimizer.step() - -if __name__ == '__main__': - world_size=2 - torch.manual_seed(1234) - mp.spawn(main, args=(world_size,), nprocs=world_size) - - \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc_log_only.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc_log_only.py deleted file mode 100644 index b3b4c43dce002e2eb34c349bad08fdf6615e5439..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_cc_log_only.py +++ /dev/null @@ -1,54 +0,0 @@ -import os -import sys -sys.path.append(".") -import json -import torch -from torch import distributed as dist -import torch.multiprocessing as mp -from msprobe.pytorch.monitor.module_hook import TrainerMon -from msprobe.pytorch.monitor.unittest.cc_utils import assert_context, DEVICE, ddp_setup, Model - -with open(os.path.join(os.path.dirname(__file__), 'expected_cc_log.json')) as f: - EXPECTED = json.load(f) - -def test_all_gather(context, rank, world_size, async_op): - a = torch.tensor([rank+1, rank+1], dtype=torch.float32, device=f'{DEVICE}:{rank}') - data = [torch.empty_like(a) for _ in range(world_size)] - dist.all_gather(data, a, group=dist.group.WORLD, async_op=async_op) - assert_context(context.data, {}, rank) - -def test_all_reduce(context, rank, world_size, async_op): - a = torch.tensor([rank+1, rank+1], dtype=torch.float32, device=f'{DEVICE}:{rank}') - dist.all_reduce(a, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) - assert_context(context.data, {}, rank) - - -def main(rank, world_size): - ddp_setup(rank, world_size) - steps = 3 - async_op = False - - net = Model() - monitor = TrainerMon("monitor/unittest/config_cc_logonly.json") - monitor.hook_optimizer() # to enable tb - optimizer = torch.optim.Adam(net.parameters()) - cc_context = monitor.cc_context - try: - for step in range(steps): - print('step: ', step) - test_all_gather(cc_context['all_gather'], rank, world_size, async_op) - test_all_reduce(cc_context['all_reduce'], rank, world_size, async_op) - optimizer.step() - except Exception as e: - assert step == 1 - assert e.__str__() == "exit after first step when print cc stack", e - for k in EXPECTED.keys(): - assert [';'.join(stack) for stack in EXPECTED[k]] == list(monitor.cc_logged_stack[k]) - - -if __name__ == '__main__': - world_size=2 - torch.manual_seed(1234) - mp.spawn(main, args=(world_size,), nprocs=world_size) - - \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_database.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_database.py deleted file mode 100644 index ad41757189e655e2bda387a150fa0e70f3e93a3d..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_database.py +++ /dev/null @@ -1,42 +0,0 @@ -import unittest -import uuid -from datetime import datetime -from unittest import TestCase - -from sqlalchemy import inspect - -from msprobe.pytorch.monitor.database import Database, ExceptionMessage - - -class TestDatabase(TestCase): - def __init__(self, method_name: str): - super(TestDatabase, self).__init__(method_name) - self.db = Database('mysql+pymysql://username:password@host:port/database') - - def test_create_table(self): - self.db.create_table() - inspect_ = inspect(self.db.engine) - table_names = inspect_.get_table_names() - print(table_names) - self.assertIn("exception_message", table_names) - - def test_insert_batch(self): - self.db.create_table() - job_id = str(uuid.uuid4()) - print(job_id) - save_list = [] - exception_message_list = [ - '> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0/1/input_zeros at step 1.', - '> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0.input_norm.weight/0/exp_avg_min at step 2.', - '> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0.input_norm.weight/1/exp_avg_min at step 2.'] - for exception_message in exception_message_list: - item = {'job_id': job_id, 'message': exception_message, 'create_time': datetime.now()} - save_list.append(ExceptionMessage(**item)) - self.db.insert_batch(save_list) - find_by_job_id = self.db.find_by_job_id(job_id) - exception_messages = [item.message for item in find_by_job_id] - self.assertEqual(exception_messages, exception_message_list) - - -if __name__ == '__main__': - unittest.main() diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_features.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_features.py deleted file mode 100644 index b19f6655552e74052dad1eb3d005cbe501b9ed65..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_features.py +++ /dev/null @@ -1,33 +0,0 @@ -import unittest -import torch -import torch.nn as nn -import torch_npu -from msprobe.pytorch.monitor.features import eff_rank - - -class TestFeatureCalculation(unittest.TestCase): - def test_effective_rank(self): - param = torch.randn(10, 10).npu() - rank = eff_rank(param) - self.assertTrue(rank.item() >= 1) - - def test_lambda_max(self): - pass - # input_dim = 10 - # hidden_dim = 100 - # output_dim = 1 - # num_samples = 100 - # X = torch.randn(num_samples, input_dim) - # network = nn.Sequential( - # nn.Linear(input_dim, hidden_dim), - # nn.ReLU(), - # nn.Linear(hidden_dim, output_dim) - # ) - # Y = network(X) - # Y.backward() - # for name, param in network.named_parameters(): - # lm = lambda_max(param) - - -if __name__ == "__main__": - unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_module_hook.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_module_hook.py deleted file mode 100644 index 9e3491179cac39bd07d04f5313a54403bc290b2f..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_module_hook.py +++ /dev/null @@ -1,69 +0,0 @@ -import sys - -sys.path.append('./') -import argparse -import torch - -try: - import torch_npu - - device = torch.device('npu:0') -except ModuleNotFoundError: - device = torch.device('cpu') -import torch.nn.functional as F -from msprobe.pytorch.monitor.module_hook import TrainerMon # Modify PYTHONPATH to import TrainerMon - -parser = argparse.ArgumentParser(prog="monitor debug", description="monitor sample code", epilog="") -parser.add_argument("-o", "--out_dir", type=str, default=".") -args = parser.parse_args() -DTYPE = torch.float32 - - -class Model(torch.nn.Module): - def __init__(self): - super().__init__() - self.fc = torch.nn.Linear(784, 10, dtype=DTYPE) - self.relu = torch.nn.ReLU() - - def forward(self, x): - return self.relu(self.fc(x).type(DTYPE)) - - -net = Model().to(device=device) - -config = { - "targets": { - "fc": {"input": "tuple[2]:0", "output": "tensor::"}, - "relu": {"input": "..", "output": ".."} - } -} - -optimizer = torch.optim.Adam(net.parameters(), lr=0.0001) - -hooker = TrainerMon('./monitor/unittest/config_1.json', opt_ty='Megatron_Float16OptimizerWithFloat16Params') -hooker.hook_modules(model=net, global_batch_size=2, dp=1, micro_batch_size=2, fwd_or_bkd=0, params_have_main_grad=False) - - -class ToyDataset(torch.utils.data.Dataset): - def __init__(self): - self.data = torch.randn(16, 784, dtype=DTYPE, requires_grad=True) - self.labels = torch.randint(low=0, high=9, size=(16,)) - - def __len__(self): - return len(self.labels) - - def __getitem__(self, idx): - return self.data[idx].to(device), self.labels[idx].to(device) - - -train_ds = ToyDataset() -train_loader = torch.utils.data.DataLoader(train_ds, shuffle=True, batch_size=2) - - -for (inputs, labels) in train_loader: - optimizer.zero_grad() - outputs = net(inputs) - loss = F.cross_entropy(outputs, labels) - - loss.backward() - optimizer.step() diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_monitor.py b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_monitor.py new file mode 100644 index 0000000000000000000000000000000000000000..4d5c1a717d80ee30414f25b44a93ddc7257ef2c7 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/test_monitor.py @@ -0,0 +1,160 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import re +from glob import glob + +import pandas as pd + +from msprobe.pytorch.common.log import logger + + +def parse_logfile(logfile): + grad_norm = [] + step = [] + with open(logfile) as f: + for line in f.readlines(): + if 'consumed samples' in line: + grad_norm.append(float(re.findall('(?<=grad norm\: )[\d\.]*', line)[0])) + return grad_norm + + +def parse_monitor_output(output_dir): + reduced = {} + unreduced = {} + for directory in glob(output_dir + '*'): + rank = int(re.findall('(?<=rank)[\d]*', directory)[0]) + unreduced[rank] = [] + reduced[rank] = [] + for file in os.listdir(directory): + df = pd.read_csv(os.path.join(directory, file)) + if '_unreduced_' in file: + unreduced[rank].append(df) + pass + elif '_reduced_' in file: + reduced[rank].append(df) + else: + logger.info(f'unexpected file {file} in {directory}') + return reduced, unreduced + + +def valid_reduce(reduced, unreduced, tp_size, dp_size, sequence_parallel): + steps = len(reduced[0]) + world_size = len(reduced) + errors = [] + for _, row in unreduced[0][0].iterrows(): + param = row['param_name'] + is_tp_duplicate = False + for step in range(2): + # sum reduced + reduced_mean = 0. + for rank in range(world_size): + if len(reduced[rank]) == 0: + continue + df = reduced[rank][step] + value = list(df[df['param_name'] == param]['mean']) + if not value: + if step == 0: + is_tp_duplicate = True + continue + reduced_mean += value[0] + + # sum unreduced + unreduced_mean = 0. + for rank in range(world_size): + df = unreduced[rank][step] + value = list(df[df['param_name'] == param]['mean']) + if not value: + continue + unreduced_mean += list(df[df['param_name'] == param]['mean'])[0] + + unreduced_mean /= dp_size + if is_tp_duplicate and (not sequence_parallel or 'embedding' in param): + unreduced_mean /= tp_size + try: + assert_equal(unreduced_mean, reduced_mean) + except AssertionError as e: + errors.append([param, step, e, is_tp_duplicate]) + if errors: + logger.info(errors) + else: + logger.info(f'grad mean is in consist between unreduced grad and reduced grad monitord.') + + +def assert_equal(a, b): + if b == 0 or a == 0: + return + if b == 0: + rel_diff = a + elif a == 0: + rel_diff = b + else: + rel_diff = abs(a / b - 1) + assert rel_diff < 0.01, f'{a}, {b}, {rel_diff}' + + +def valid_total_norm(total_norm, reduced, duplicate_embedding): + steps = len(total_norm) + world_size = len(reduced) + errors = [] + for step in range(steps): + calculated_norm = 0. + for rank in range(world_size): + if len(reduced[rank]) == 0: + if step == 0: + logger.info(f'rank {rank} is duplicated in dp group') + continue + for _, row in reduced[rank][step].iterrows(): + if duplicate_embedding and 'word_embedding' in row['param_name']: + continue + calculated_norm += row['norm'] ** 2 + try: + assert_equal(calculated_norm ** 0.5, total_norm[step]) + except AssertionError as e: + errors.append([step, e]) + if errors: + logger.info('total norm errors: ', errors) + else: + logger.info('grad norm in consist between training log and reduced gradients monitored') + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument('--monitor_output', '-m', type=str, required=True, + help='path prefix to the output of monitor e.g. monitor_output/Aug12_07-16') + parser.add_argument('--logfile', '-l', type=str, required=True, help='path to the training log file') + parser.add_argument('--tp_size', '-t', type=int, required=True, help='tp parallel size') + parser.add_argument('--dp_size', '-d', type=int, required=True, help='dp parallel size') + parser.add_argument('--pp_size', '-p', type=int, required=True, help='pp parallel size') + parser.add_argument('--untie_embeddings_and_output_weights', '-u', action="store_true", default=False, + help='whether untie_embeddings_and_output_weights in pp parallel') + parser.add_argument('--sequence_parallel', '-s', action="store_true", default=False, + help='whether sequence parallel is enabled. Add -s to store true') + + args = parser.parse_args() + + assert args.tp_size > 0, 'if tp not enabled, set tp_size = 1' + assert args.dp_size > 0, 'if tp not enabled, set dp_size = 1' + assert args.pp_size > 0, 'if tp not enabled, set pp_size = 1' + + total_norm = parse_logfile(args.logfile) + reduced, unreduced = parse_monitor_output(args.monitor_output) + + duplicate_embedding = not args.untie_embeddings_and_output_weights and args.pp_size > 1 + + valid_total_norm(total_norm, reduced, duplicate_embedding) + valid_reduce(reduced, unreduced, args.tp_size, args.dp_size, args.sequence_parallel) diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/utils.py b/debug/accuracy_tools/msprobe/pytorch/monitor/utils.py index 46386db99d644347bf80eac8baa9b2562cc0b034..94afe56ffcfe7571a189c5f6959b2eb9a2779d81 100644 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/utils.py @@ -1,5 +1,3 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # @@ -14,23 +12,38 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - -import os -import time -import sys -import re +import inspect +from collections import namedtuple from datetime import timezone, timedelta from functools import wraps -from torch import distributed as dist +from datetime import datetime +import os +import re + +import torch -from msprobe.core.common.const import MonitorConst +from msprobe.core.common.const import MonitorConst, Const +from msprobe.pytorch.common.log import logger +from msprobe.core.common.utils import is_int +from msprobe.core.common.file_utils import check_file_or_directory_path + +device = "cpu" +try: + import torch_npu + device = "npu" +except ImportError: + if torch.cuda.is_available(): + device = "cuda" + +NAN_TENSOR_ON_DEVICE = None FILE_MAX_SIZE = 10 * 1024 * 1024 * 1024 FILE_NAME_MAX_LENGTH = 255 DIRECTORY_MAX_LENGTH = 4096 -FILE_NAME_VALID_PATTERN = r"^[a-zA-Z0-9_.:/-]+$" beijing_tz = timezone(timedelta(hours=8)) +MVResult = namedtuple('MVResult', ("exp_avg", "exp_avg_sq", "update", "ratio")) +MVGradResult = namedtuple('MVGradResult', ("exp_avg", "exp_avg_sq", "update", "ratio", "grad")) class MsgConst: @@ -40,6 +53,17 @@ class MsgConst: SPECIAL_CHAR = ["\n", "\r", "\u007F", "\b", "\f", "\t", "\u000B", "%08", "%0a", "%0b", "%0c", "%0d", "%7f"] +def get_output_base_dir(): + return os.getenv(MonitorConst.MONITOR_OUTPUT_DIR, MonitorConst.DEFAULT_MONITOR_OUTPUT_DIR) + + +def get_nan_tensor(): + global NAN_TENSOR_ON_DEVICE + if not NAN_TENSOR_ON_DEVICE: + NAN_TENSOR_ON_DEVICE = torch.tensor(torch.nan, device=device) + return NAN_TENSOR_ON_DEVICE + + def filter_special_chars(func): @wraps(func) def func_level(msg): @@ -50,209 +74,43 @@ def filter_special_chars(func): return func_level -class FileCheckConst: - """ - Class for file check const - """ - READ_ABLE = "read" - WRITE_ABLE = "write" - READ_WRITE_ABLE = "read and write" - DIRECTORY_LENGTH = 4096 - FILE_NAME_LENGTH = 255 - FILE_VALID_PATTERN = r"^[a-zA-Z0-9_.:/-]+$" - PKL_SUFFIX = ".pkl" - NUMPY_SUFFIX = ".npy" - JSON_SUFFIX = ".json" - PT_SUFFIX = ".pt" - CSV_SUFFIX = ".csv" - YAML_SUFFIX = ".yaml" - MAX_PKL_SIZE = 1 * 1024 * 1024 * 1024 - MAX_NUMPY_SIZE = 10 * 1024 * 1024 * 1024 - MAX_JSON_SIZE = 1 * 1024 * 1024 * 1024 - MAX_PT_SIZE = 10 * 1024 * 1024 * 1024 - MAX_CSV_SIZE = 1 * 1024 * 1024 * 1024 - MAX_YAML_SIZE = 10 * 1024 * 1024 - DIR = "dir" - FILE = "file" - DATA_DIR_AUTHORITY = 0o750 - DATA_FILE_AUTHORITY = 0o640 - FILE_SIZE_DICT = { - PKL_SUFFIX: MAX_PKL_SIZE, - NUMPY_SUFFIX: MAX_NUMPY_SIZE, - JSON_SUFFIX: MAX_JSON_SIZE, - PT_SUFFIX: MAX_PT_SIZE, - CSV_SUFFIX: MAX_CSV_SIZE, - YAML_SUFFIX: MAX_YAML_SIZE - } - - -class FileCheckException(Exception): - """ - Class for File Check Exception - """ - NONE_ERROR = 0 - INVALID_PATH_ERROR = 1 - INVALID_FILE_TYPE_ERROR = 2 - INVALID_PARAM_ERROR = 3 - INVALID_PERMISSION_ERROR = 3 - - def __init__(self, code, error_info: str = ""): - super(FileCheckException, self).__init__() - self.code = code - self.error_info = error_info - - def __str__(self): - return self.error_info - - -def print_rank_0(message): - if dist.is_initialized(): - if dist.get_rank() == 0: - print(message) - else: - print(message) - - -def _print_log(level, msg, end='\n'): - current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(time.time()))) - pid = os.getgid() - print(current_time + "(" + str(pid) + ")-[" + level + "]" + msg, end=end) - sys.stdout.flush() - - -@filter_special_chars -def print_info_log(info_msg): - """ - Function Description: - print info log. - Parameter: - info_msg: the info message. - """ - _print_log("INFO", info_msg) - - -@filter_special_chars -def print_error_log(error_msg): - """ - Function Description: - print error log. - Parameter: - error_msg: the error message. - """ - _print_log("ERROR", error_msg) - - -@filter_special_chars -def print_warn_log(warn_msg): - """ - Function Description: - print warn log. - Parameter: - warn_msg: the warning message. - """ - _print_log("WARNING", warn_msg) - - def get_param_struct(param): - if isinstance(param, tuple): - return f"tuple[{len(param)}]" - if isinstance(param, list): - return f"list[{len(param)}]" - return "tensor" - - -def check_link(path): - abs_path = os.path.abspath(path) - if os.path.islink(abs_path): - raise RuntimeError("The path is a soft link.") - - -def check_path_length(path, name_length_limit=None): - file_max_name_length = name_length_limit if name_length_limit else FILE_NAME_MAX_LENGTH - if len(path) > DIRECTORY_MAX_LENGTH or \ - len(os.path.basename(path)) > file_max_name_length: - raise RuntimeError("The file path length exceeds limit.") - - -def check_path_pattern_valid(path): - if not re.match(FILE_NAME_VALID_PATTERN, path): - raise RuntimeError("The file path contains special characters.") - - -def check_path_readability(path): - if not os.access(path, os.R_OK): - raise RuntimeError("The file path is not readable.") - - -def check_path_writability(path): - if not os.access(path, os.W_OK): - raise RuntimeError("The file path is not writable.") - - -def check_file_size(file_path, max_size=FILE_MAX_SIZE): - file_size = os.path.getsize(file_path) - if file_size >= max_size: - raise RuntimeError("The file size excess limit.") - - -def check_path_exists(path): - if not os.path.exists(path): - raise RuntimeError("The file path does not exist.") - - -def check_file_valid(path): - check_path_exists(path) - check_link(path) - real_path = os.path.realpath(path) - check_path_length(real_path) - check_path_pattern_valid(real_path) - check_file_size(real_path) - - -def check_file_valid_readable(path): - check_file_valid(path) - check_path_readability(path) - - -def check_file_valid_writable(path): - check_file_valid(path) - check_path_writability(path) - - -def change_mode(path, mode): - if not os.path.exists(path) or os.path.islink(path): - return - try: - os.chmod(path, mode) - except PermissionError as ex: - print_error_log('Failed to change {} authority. {}'.format(path, str(ex))) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) from ex + res = {} + if isinstance(param, (tuple, list)): + res['config'] = f'{type(param).__name__}[{len(param)}]' + for i, x in enumerate(param): + res[i] = f'size={tuple(x.shape)}, dtype={x.dtype}' if torch.is_tensor(x) else f'{type(x)}' + elif torch.is_tensor(param): + res['config'] = 'tensor' + res['tensor'] = f'size={tuple(param.shape)}, dtype={param.dtype}' + else: + res['config'] = f'{type(param)}' + logger.warning(f'Not support type({type(param)}) now, please check the type of param {param}') + return res def validate_ops(ops): if not isinstance(ops, list): raise TypeError("ops should be a list") - if not ops: - raise TypeError(f"specify ops to calculate metrics. Optional ops: {MonitorConst.OP_LIST}") - valid_ops = [] for op in ops: if op not in MonitorConst.OP_LIST: - raise ValueError(f"op {op} is not supported. Optional ops: {MonitorConst.OP_LIST}") - else: - valid_ops.append(op) + logger.warning(f"op {op} is not supported. Optional ops: {MonitorConst.OP_LIST}") + continue + valid_ops.append(op) + if not valid_ops: + default_op = MonitorConst.OP_LIST[0] + valid_ops.append(default_op) + logger.info_on_rank_0(f"There is no valid ops, default op {default_op} is used") return valid_ops def validate_ranks(ranks): - world_size = dist.get_world_size() if not isinstance(ranks, list): raise TypeError("module_ranks should be a list") for rank in ranks: if not isinstance(rank, int) or isinstance(rank, bool): raise TypeError(f"element in module_ranks should be a int, get {type(rank)}") - if rank < 0 or rank >= world_size: - print_warn_log(f"rank {rank} is beyond world size [0, {world_size - 1}] and will be ignored") def validate_targets(targets): @@ -265,6 +123,89 @@ def validate_targets(targets): raise TypeError('values of targets should be cared filed e.g. {"input": "tensor"} in config.json') +def validate_print_struct(print_struct): + if not isinstance(print_struct, bool): + raise TypeError("print_struct should be a bool") + + +def validate_ur_distribution(ur_distribution): + if not isinstance(ur_distribution, bool): + raise TypeError('ur_distribution should be a bool') + + +def validate_xy_distribution(xy_distribution): + if not isinstance(xy_distribution, bool): + raise TypeError('xy_distribution should be a bool') + + +def validate_wg_distribution(wg_distribution): + if not isinstance(wg_distribution, bool): + raise TypeError('wg_distribution should be a bool') + + +def validate_mg_distribution(mg_distribution): + if not isinstance(mg_distribution, bool): + raise TypeError('mg_distribution should be a bool') + + +def validate_param_distribution(param_distribution): + if not isinstance(param_distribution, bool): + raise TypeError('param_distribution should be a bool') + + +def validate_cc_distribution(cc_distribution): + if not isinstance(cc_distribution, dict): + raise TypeError('cc_distribution should be a dictionary') + for key, value in cc_distribution.items(): + if key == 'enable': + if not isinstance(value, bool): + raise TypeError('cc_distribution enable should be a bool') + elif key == 'cc_codeline': + if not isinstance(value, list): + raise TypeError('cc_distribution cc_codeline should be a list') + elif key == 'cc_pre_hook': + if not isinstance(value, bool): + raise TypeError('cc_distribution cc_pre_hook should be a bool') + elif key == 'cc_log_only': + if not isinstance(value, bool): + raise TypeError('cc_distribution cc_log_only should be a bool') + else: + raise TypeError(f'{key} of cc_distribution is not supported.') + + +def validate_squash_name(squash_name): + if not isinstance(squash_name, bool): + raise TypeError('squash_name should be a bool') + + +def validate_alert(alert): + if not isinstance(alert, dict): + raise TypeError('alert should be a dictionary') + rules = alert.get('rules') + if rules and isinstance(rules, list): + for rule in rules: + rule_name = rule.get("rule_name") + if rule_name and rule_name not in MonitorConst.RULE_NAME: + raise TypeError(f"{rule_name} is not supported") + args = rule.get("args") + if args and isinstance(args, dict): + threshold = args.get("threshold") + if not isinstance(threshold, float) or threshold < 0: + raise TypeError('threshold must be float and not less than 0') + dump = alert.get('dump') + if dump and not isinstance(dump, bool): + raise TypeError('dump must be bool.') + + +def validate_step_count_per_record(step_count_per_record): + if not is_int(step_count_per_record): + raise TypeError('step_count_per_record must be int.') + if step_count_per_record < 1: + raise ValueError("step_count_per_record must greater than 0") + if step_count_per_record > 1e6: + raise ValueError("step_count_per_record must smaller than 1e6") + + def validate_config(config): config['ops'] = validate_ops(config.get('ops', [])) @@ -277,3 +218,69 @@ def validate_config(config): targets = config.get("targets", {}) validate_targets(targets) + + print_struct = config.get('print_struct', False) + validate_print_struct(print_struct) + + ur_distribution = config.get('ur_distribution', False) + validate_ur_distribution(ur_distribution) + + xy_distribution = config.get('xy_distribution', False) + validate_xy_distribution(xy_distribution) + + wg_distribution = config.get('wg_distribution', False) + validate_wg_distribution(wg_distribution) + + mg_distribution = config.get('mg_distribution', False) + validate_mg_distribution(mg_distribution) + + param_distribution = config.get('param_distribution', False) + validate_param_distribution(param_distribution) + + cc_distribution = config.get('cc_distribution', {}) + validate_cc_distribution(cc_distribution) + + alert = config.get('alert', {}) + validate_alert(alert) + + step_count_per_record = config.get('step_count_per_record', 1) + validate_step_count_per_record(step_count_per_record) + + squash_name = config.get('squash_name', True) + validate_squash_name(squash_name) + + if not targets: + if xy_distribution: + config["all_xy"] = True + config["targets"] = {"": {}} + + +def time_str2time_digit(time_str): + time_format = '%b%d_%H-%M-%S' + try: + time_digit = datetime.strptime(time_str, time_format) + except Exception as e: + raise RuntimeError(f"illegal timestamp: {time_str}, timestamp should be prefix \ + of existing output dirpath, like 'Dec03_21-34-40'.") from e + return time_digit + + +def get_target_output_dir(monitor_path, time_start, time_end): + check_file_or_directory_path(monitor_path, isdir=True) + time_start = time_str2time_digit(time_start) if time_start is not None else time_start + time_end = time_str2time_digit(time_end) if time_end is not None else time_end + if time_start and time_end and time_start > time_end: + raise ValueError(f"time_start({time_start}) greater than time_end({time_end})") + result = {} + for dirname in os.listdir(monitor_path): + match = re.match(MonitorConst.OUTPUT_DIR_PATTERN, dirname) + if not match: + continue + time_tag = match.group(1) + rank = match.group(2) + target_time = time_str2time_digit(time_tag) + start_ok = time_start is None or target_time >= time_start + end_ok = time_end is None or target_time <= time_end + if start_ok and end_ok: + result[rank] = os.path.join(monitor_path, dirname) + return result diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/visualizer.py b/debug/accuracy_tools/msprobe/pytorch/monitor/visualizer.py index aeb5bae176084a23879c4766ffe8619a5bc8f3bd..525ed5317c3ce2ca3cece9326914f60b068cc7be 100644 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/visualizer.py +++ b/debug/accuracy_tools/msprobe/pytorch/monitor/visualizer.py @@ -1,5 +1,3 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # @@ -27,7 +25,7 @@ class HeatmapVisualizer: self.min_val = -1 self.max_val = 1 self.histogram_edges = None - self.histogram_sum_data_np = None # matrix shape is [bins_num * total_step] + self.histogram_sum_data_np = None # matrix shape is [bins_num * total_step] self.cur_step_histogram_data = None self.histogram_edges = torch.linspace(self.min_val, self.max_val, self.histogram_bins_num) @@ -35,7 +33,7 @@ class HeatmapVisualizer: self.cur_step_histogram_data = cal_histc(tensor_cal=tensor, bins_total=self.histogram_bins_num, min_val=self.min_val, max_val=self.max_val) - def visualize(self, tag_name:str, step, summary_writer): + def visualize(self, tag_name: str, step, summary_writer): if self.histogram_sum_data_np is None or self.histogram_sum_data_np.size == 0: self.histogram_sum_data_np = np.expand_dims(self.cur_step_histogram_data.cpu(), 0).T else: @@ -43,7 +41,7 @@ class HeatmapVisualizer: # matrix shape is [bins_num * total_step] self.histogram_sum_data_np = np.concatenate((self.histogram_sum_data_np, np.expand_dims( self.cur_step_histogram_data.cpu(), 1)), axis=1) - + fig, ax = plt.subplots() cax = ax.matshow(self.histogram_sum_data_np, cmap='hot', aspect='auto') fig.colorbar(cax) diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/compare.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/compare.py index 07989907a036ea8ca86a59ee2f51529a475b5f3b..7a265e70fa4cbe95c897c35d68e4afa8ebd77249 100644 --- a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/compare.py @@ -20,9 +20,8 @@ import sys from collections import namedtuple from msprobe.core.common.const import CompareConst, FileCheckConst -from msprobe.core.common.file_utils import FileOpen, change_mode, read_csv -from msprobe.core.common.utils import CompareException, check_op_str_pattern_valid -from msprobe.pytorch.common.log import logger +from msprobe.core.common.file_utils import read_csv, get_json_contents, write_csv +from msprobe.core.common.utils import check_op_str_pattern_valid from msprobe.pytorch.online_dispatch.single_compare import single_benchmark_compare_wrap from rich.console import Console from rich.table import Table @@ -35,31 +34,6 @@ ResultInfo = namedtuple('ResultInfo', ['api_name', 'is_fwd_success', 'is_bwd_suc 'fwd_compare_alg_results', 'bwd_compare_alg_results']) -def get_file_content_bytes(file): - with FileOpen(file, 'rb') as file_handle: - return file_handle.read() - - -def get_json_contents(file_path): - ops = get_file_content_bytes(file_path) - try: - json_obj = json.loads(ops) - except ValueError as error: - logger.error('Failed to load "%s". %s' % (file_path, str(error))) - raise CompareException(CompareException.INVALID_FILE_ERROR) from error - if not isinstance(json_obj, dict): - logger.error('Json file %s, content is not a dictionary!' % file_path) - raise CompareException(CompareException.INVALID_FILE_ERROR) - return json_obj - - -def write_csv(data, filepath): - with FileOpen(filepath, 'a', encoding='utf-8-sig') as f: - writer = csv.writer(f) - writer.writerows(data) - change_mode(filepath, FileCheckConst.DATA_FILE_AUTHORITY) - - class Saver: # consts for result csv COLUMN_API_NAME = "API name" diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dispatch.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dispatch.py index df8899954a75da3d3ea6c1d7aa688f41d7fb9e70..b9201cfaac74e38bbbaee468b6c452895f8b38f9 100644 --- a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dispatch.py +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dispatch.py @@ -28,7 +28,7 @@ except ImportError: else: is_npu = True -from msprobe.core.common.file_utils import check_file_or_directory_path, load_yaml +from msprobe.core.common.file_utils import check_file_or_directory_path, load_yaml, FileOpen, create_directory from msprobe.core.common.const import Const, CompareConst from msprobe.pytorch.common.log import logger from msprobe.pytorch.online_dispatch.dump_compare import dispatch_workflow, dispatch_multiprocess, error_call, \ @@ -36,7 +36,7 @@ from msprobe.pytorch.online_dispatch.dump_compare import dispatch_workflow, disp from msprobe.pytorch.online_dispatch.utils import get_callstack, data_to_cpu, get_sys_info, DispatchException, \ COMPARE_LOGO from msprobe.pytorch.online_dispatch.compare import Comparator -from msprobe.core.common.file_utils import FileOpen, create_directory +from msprobe.core.common.utils import check_str_param, safe_get_value current_time = time.strftime("%Y%m%d%H%M%S") RESULT_FILE_NAME = "accuracy_checking_result_" + current_time + ".csv" @@ -56,7 +56,7 @@ class PtdbgDispatch(TorchDispatchMode): self.device_id = torch_npu._C._npu_getDevice() self.dump_mode = dump_mode - self.dump_api_list = api_list + self.dump_api_list = api_list or [] self.debug_flag = debug self.api_index = 0 self.single_api_index_dict = {} @@ -65,8 +65,9 @@ class PtdbgDispatch(TorchDispatchMode): self.all_summary = [] self.call_stack_list = [] self.process_num = process_num - self.filter_dump_api() + self.tag = tag self.check_param() + self.filter_dump_api() dir_name = self.get_dir_name(tag) self.root_path = os.path.join(os.path.realpath(dump_path), dir_name) self.root_cpu_path = os.path.join(self.root_path, f'cpu') @@ -170,17 +171,24 @@ class PtdbgDispatch(TorchDispatchMode): cpu_kwargs = [] data_to_cpu(args, 0, cpu_args) data_to_cpu(kwargs, 0, cpu_kwargs) - cpu_args = cpu_args[0] - cpu_kwargs = cpu_kwargs[0] + + cpu_args = safe_get_value(cpu_args, 0, "cpu_args") + cpu_kwargs = safe_get_value(cpu_kwargs, 0, "cpu_kwargs") with TimeStatistics("NPU RUN", run_param): npu_out = func(*args, **kwargs) npu_out_cpu = [] data_to_cpu(npu_out, 0, npu_out_cpu) - npu_out_cpu = npu_out_cpu[0] + npu_out_cpu = safe_get_value(npu_out_cpu, 0, "npu_out_cpu") with TimeStatistics("CPU RUN", run_param): - cpu_out = func(*cpu_args, **cpu_kwargs) + try: + cpu_out = func(*cpu_args, **cpu_kwargs) + except RuntimeError as e: + self.api_index -= 1 + logger.warning(f"RuntimeError: {e}") + logger.warning(f"This aten_api {aten_api} does not support running on cpu, so skip it.") + return npu_out if isinstance(cpu_out, torch.Tensor) and cpu_out.dtype in [torch.bfloat16, torch.float16, torch.half]: cpu_out = cpu_out.float() @@ -272,6 +280,17 @@ class PtdbgDispatch(TorchDispatchMode): if not isinstance(self.dump_api_list, list): logger.error('The type of parameter "api_list" can only be list.') raise DispatchException(DispatchException.INVALID_PARAMETER) + if not all(isinstance(item, str) for item in self.dump_api_list): + logger.error('The type of parameter in "api_list" can only be str.') + raise DispatchException(DispatchException.INVALID_PARAMETER) + if len(self.dump_api_list) > Const.STEP_RANK_MAXIMUM_VALUE: + logger.error('The length of parameter "api_list" should not be greater ' + f'than {Const.STEP_RANK_MAXIMUM_VALUE}.') + raise DispatchException(DispatchException.INVALID_PARAMETER) + for item in self.dump_api_list: + check_str_param(item) + if self.tag is not None: + check_str_param(self.tag) if not isinstance(self.debug_flag, bool): logger.error('The type of parameter "debug" can only be bool.') raise DispatchException(DispatchException.INVALID_PARAMETER) diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dump_compare.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dump_compare.py index edb9c40d38c0c81c0278419a93adbbd1399bb378..b185bc1110d4062d8a31b9cc94dc946d8fb8456c 100644 --- a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dump_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dump_compare.py @@ -19,7 +19,7 @@ import os from datetime import datetime, timezone import torch -from msprobe.core.common.file_utils import FileOpen, save_npy +from msprobe.core.common.file_utils import FileOpen, save_npy, save_json from msprobe.pytorch.common.log import logger @@ -107,10 +107,8 @@ def dump_data(data, prefix, dump_path): def save_temp_summary(api_index, single_api_summary, path, lock): summary_path = os.path.join(path, f'summary.json') lock.acquire() - with FileOpen(summary_path, "a") as f: - json.dump([api_index, single_api_summary], f) - f.write('\n') - lock.release() + data = [api_index, single_api_summary] + save_json(summary_path, data, mode='a') def dispatch_workflow(run_param: DispatchRunParam, data_info: DisPatchDataInfo): diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/single_compare.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/single_compare.py index 3d52db1ba6ff0b5de45ada16d4cc3e83bd4868c3..f5c3bb955c0058221c374b8201e7b92e4bbbdd07 100644 --- a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/single_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/single_compare.py @@ -19,10 +19,10 @@ from functools import wraps import torch from msprobe.pytorch.common.log import logger +from msprobe.pytorch.online_dispatch.utils import check_idx_valid from prettytable import PrettyTable - def func_log_wrapper(): def _out_wrapper(func): @wraps(func) @@ -217,12 +217,13 @@ class SingleBenchmarkAccuracyCompare: err_for_max = torch.where(abs_err_idx == 1, diff_abs, zeros) logging.debug("err_for_max for abs %s", err_for_max) max_abs_idx = torch.argmax(err_for_max) - max_abs_diff = diff_abs[max_abs_idx] + if check_idx_valid(diff_abs, max_abs_idx): + max_abs_diff = diff_abs[max_abs_idx] elif torch.sum(abs_mask_idx) > 0: err_for_max = torch.where(abs_mask_idx == 1, diff_abs, zeros) logging.debug("error_for_max for abs %s", err_for_max) max_abs_idx = torch.argmax(err_for_max) - if err_for_max.max() != 0: + if err_for_max.max() != 0 and check_idx_valid(diff_abs, max_abs_idx): max_abs_diff = diff_abs[max_abs_idx] return (float(max_abs_diff), int(max_abs_idx) if torch.is_tensor(max_abs_idx) else max_abs_idx) @@ -247,12 +248,13 @@ class SingleBenchmarkAccuracyCompare: err_for_max = torch.where(rel_err_idx == 1, diff_rel, zeros) logging.debug("error_for_max for rel %s", err_for_max) max_rel_idx = torch.argmax(err_for_max) - max_rel_diff = diff_rel[max_rel_idx] + if check_idx_valid(diff_rel, max_rel_idx): + max_rel_diff = diff_rel[max_rel_idx] elif torch.sum(rel_mask_idx > 0): err_for_max = torch.where(rel_mask_idx == 1, diff_rel, zeros) logging.debug("err_for_max for rel %s", err_for_max) max_rel_idx = torch.argmax(err_for_max) - if torch.sum(err_for_max) != 0: + if torch.sum(err_for_max) != 0 and check_idx_valid(diff_rel, max_rel_idx): max_rel_diff = diff_rel[max_rel_idx] return (float(max_rel_diff), int(max_rel_idx) if torch.is_tensor(max_rel_idx) else max_rel_idx) @@ -282,7 +284,8 @@ class SingleBenchSummary: def get_result_msg(self): result_str = "" if self.failed_info: - return self.failed_info + result_str = self.failed_info + return result_str if self.result: result_str += "误差均衡性EB: %s <= 阈值%s\n" % (self.error_balance, self.eb_thd) diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/utils.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/utils.py index 527b622908a9ad7f6f023ce7284320d5f426ccdc..ae8b9435a34ced607d4e70fab615b2b017083fe9 100644 --- a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/utils.py @@ -14,9 +14,10 @@ # limitations under the License. import inspect + +import numpy as np import psutil import torch -import numpy as np try: import torch_npu @@ -26,6 +27,7 @@ else: pta_cpu_device = torch.device("cpu") from msprobe.core.common.const import CompareConst +from msprobe.pytorch.common.log import logger cpu_device = torch._C.device("cpu") COLOR_RED = '\033[31m' @@ -75,8 +77,11 @@ INT_TYPE = [np.int32, np.int64] def get_callstack(): callstack = [] for (_, path, line, func, code, _) in inspect.stack()[2:]: - stack_line = [path, str(line), func, code[0].strip() if code else code] - callstack.append(stack_line) + try: + stack_line = [path, str(line), func, code[0].strip() if code else code] + callstack.append(stack_line) + except IndexError: + logger.error("Failed to get callstack for code:{} index out of range".format(code)) return callstack @@ -142,3 +147,9 @@ class DispatchException(Exception): def __str__(self): return self.err_msg + + +def check_idx_valid(data, idx): + if data is not None and data.numel() > 0 and 0 <= idx < data.numel(): + return True + return False diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/compare.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/compare.py index 4ccd8a79202c2758b5c34e6bcbe89ed1119db360..4a4632913cfbd92d274e9467929fdfcdf0e7ef0e 100644 --- a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/compare.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,16 +12,17 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import os import time -import numpy as np from collections import namedtuple -from msprobe.pytorch.parse_tool.lib.utils import Util + +import numpy as np + +from msprobe.core.common.file_utils import create_directory, load_npy, save_npy_to_txt, write_csv, os_walk_for_files from msprobe.pytorch.parse_tool.lib.config import Const from msprobe.pytorch.parse_tool.lib.parse_exception import ParseException -from msprobe.core.common.file_utils import create_directory, load_npy, save_npy_to_txt, write_csv, os_walk_for_files +from msprobe.pytorch.parse_tool.lib.utils import Util class Compare: @@ -126,7 +126,7 @@ class Compare: all_close = np.allclose(data_left, data_right, atol=al, rtol=rl) np.seterr(divide='raise') cos_sim = np.dot(data_left, data_right) / ( - np.sqrt(np.dot(data_left, data_left)) * np.sqrt(np.dot(data_right, data_right))) + np.sqrt(np.dot(data_left, data_left)) * np.sqrt(np.dot(data_right, data_right))) err_cnt = 0 total_cnt = data_left.shape[0] diff_table_columns = ['Index', 'Left', 'Right', 'Diff'] diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/config.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/config.py index 53fb9ac4407b326d288e3249e204710a4ed30cfa..6dc70afe465bc81e57e22dbe5105d566979ebd03 100644 --- a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/config.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/config.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,14 +12,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import os + import numpy as np class Const: - MS_ACCU_CMP_PATH = '/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py' MS_ACCU_CMP_FILE_NAME = 'msaccucmp.py' ROOT_DIR = "" diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/file_desc.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/file_desc.py index 14ba27277168bc110b38287afbba957b69f8cdff..c883547251c6fafbfb5884511817f0a632d91ef4 100644 --- a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/file_desc.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/file_desc.py @@ -1,4 +1,18 @@ -# coding=utf-8 +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/interactive_cli.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/interactive_cli.py index 1ea7dd30153e458b758dc0a79779b54a25fe8289..ac6f3d234e3a6681a580f16e56d94204223102f1 100644 --- a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/interactive_cli.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/interactive_cli.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,13 +12,14 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" -import cmd + import argparse -from msprobe.pytorch.parse_tool.lib.parse_tool import ParseTool -from msprobe.pytorch.parse_tool.lib.utils import Util +import cmd + from msprobe.pytorch.parse_tool.lib.config import Const from msprobe.pytorch.parse_tool.lib.parse_exception import catch_exception +from msprobe.pytorch.parse_tool.lib.parse_tool import ParseTool +from msprobe.pytorch.parse_tool.lib.utils import Util class InteractiveCli(cmd.Cmd): @@ -81,7 +81,7 @@ class InteractiveCli(cmd.Cmd): self.util.check_files_in_path(args.my_dump_path) self.util.check_files_in_path(args.golden_dump_path) if self.util.dir_contains_only(args.my_dump_path, ".npy") and \ - self.util.dir_contains_only(args.golden_dump_path, ".npy"): + self.util.dir_contains_only(args.golden_dump_path, ".npy"): self.parse_tool.do_compare_converted_dir(args) else: self.parse_tool.do_vector_compare(args) diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_exception.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_exception.py index 7525230cedc7ff11d4112a55998c6414e8f09217..d6ab6c708aa50a4d050a87b464d740d316065e1c 100644 --- a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_exception.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_exception.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,13 +12,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import logging + from msprobe.core.common.exceptions import FileCheckException class ParseException(Exception): - PARSE_INVALID_PATH_ERROR = 0 PARSE_NO_FILE_ERROR = 1 PARSE_NO_MODULE_ERROR = 2 @@ -51,4 +50,5 @@ def catch_exception(func): except FileCheckException: log.error("Command execution failed") return result + return inner diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_tool.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_tool.py index 6daa622b9052a4b309d75b219a90ed28580d9e69..ca508886f5324a436e357d5dd9598c8c4f0cd363 100644 --- a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_tool.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_tool.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,17 +12,18 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import argparse import os from collections import namedtuple +from msprobe.core.common.file_utils import create_directory +from msprobe.pytorch.parse_tool.lib.compare import Compare from msprobe.pytorch.parse_tool.lib.config import Const +from msprobe.pytorch.parse_tool.lib.parse_exception import catch_exception, ParseException from msprobe.pytorch.parse_tool.lib.utils import Util -from msprobe.pytorch.parse_tool.lib.compare import Compare from msprobe.pytorch.parse_tool.lib.visualization import Visualization -from msprobe.pytorch.parse_tool.lib.parse_exception import catch_exception, ParseException -from msprobe.core.common.file_utils import create_directory + class ParseTool: def __init__(self): @@ -117,7 +117,8 @@ class ParseTool: self.util.check_path_valid(args.golden_dump_path) self.util.check_file_path_format(args.my_dump_path, Const.NPY_SUFFIX) self.util.check_file_path_format(args.golden_dump_path, Const.NPY_SUFFIX) - compare_data_args = namedtuple('compare_data_args', ['my_dump_path', 'golden_dump_path', 'save', 'rtol', 'atol', 'count']) + compare_data_args = namedtuple('compare_data_args', + ['my_dump_path', 'golden_dump_path', 'save', 'rtol', 'atol', 'count']) compare_data_args.__new__.__defaults__ = (False, 0.001, 0.001, 20) res = compare_data_args(args.my_dump_path, args.golden_dump_path, args.save, args.rtol, args.atol, args.count) self.compare.compare_data(res) @@ -132,8 +133,7 @@ class ParseTool: " '-m' and '-g'.") raise ParseException("My directory path and golden directory path is same.") output_path = self.util.path_strip(args.output_path) if args.output_path else Const.BATCH_COMPARE_DIR - if not os.path.isdir(output_path): - os.makedirs(output_path, mode=0o750) + create_directory(output_path) self.compare.compare_converted_dir(my_dump_dir, golden_dump_dir, output_path) @catch_exception diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/utils.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/utils.py index 3bdc419dd0426b6b9f4551dc176f0fd909cd741b..66229d36b8d0b532eea48f1aa5d96e178ed80cdc 100644 --- a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/utils.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,24 +12,24 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + +import hashlib import os import re -import sys import subprocess -import hashlib +import sys import time -import numpy as np from collections import namedtuple -from msprobe.pytorch.parse_tool.lib.config import Const -from msprobe.pytorch.parse_tool.lib.file_desc import DumpDecodeFileDesc, FileDesc -from msprobe.pytorch.parse_tool.lib.parse_exception import ParseException -from msprobe.core.common.file_utils import change_mode, check_other_user_writable,\ - check_path_executable, check_path_owner_consistent + +import numpy as np from msprobe.core.common.const import FileCheckConst +from msprobe.core.common.file_utils import change_mode, check_other_user_writable, \ + check_path_executable, check_path_owner_consistent from msprobe.core.common.file_utils import check_file_or_directory_path, remove_path, check_file_type, os_walk_for_files from msprobe.pytorch.common.log import logger - +from msprobe.pytorch.parse_tool.lib.config import Const +from msprobe.pytorch.parse_tool.lib.file_desc import DumpDecodeFileDesc, FileDesc +from msprobe.pytorch.parse_tool.lib.parse_exception import ParseException try: from rich.traceback import install @@ -135,7 +134,7 @@ class Util: zero_mask = (data == 0) data[zero_mask] += np.finfo(float).eps return data - + @staticmethod def dir_contains_only(path, endfix): files = os_walk_for_files(path, Const.MAX_TRAVERSAL_DEPTH) @@ -143,11 +142,11 @@ class Util: if not file['file'].endswith(endfix): return False return True - + @staticmethod def localtime_str(): return time.strftime("%Y%m%d%H%M%S", time.localtime()) - + @staticmethod def change_filemode_safe(path): change_mode(path, FileCheckConst.DATA_FILE_AUTHORITY) @@ -208,7 +207,7 @@ class Util: def list_numpy_files(self, path, extern_pattern=''): return self.list_file_with_pattern(path, Const.NUMPY_PATTERN, extern_pattern, - self._gen_numpy_file_info) + self._gen_numpy_file_info) def create_columns(self, content): if not Columns: diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/visualization.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/visualization.py index 8d807ab568557cf3dfbc1d115f53a5d72ad97d20..5b53831b1c6fb9280dbad5621ee222baa2712225 100644 --- a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/visualization.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/visualization.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,14 +12,14 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import json -import numpy as np +import numpy as np +from msprobe.core.common.file_utils import FileOpen, load_npy, save_npy_to_txt from msprobe.pytorch.parse_tool.lib.config import Const -from msprobe.pytorch.parse_tool.lib.utils import Util from msprobe.pytorch.parse_tool.lib.parse_exception import ParseException -from msprobe.core.common.file_utils import FileOpen, load_npy, save_npy_to_txt +from msprobe.pytorch.parse_tool.lib.utils import Util class Visualization: @@ -65,6 +64,8 @@ class Visualization: self.util.log.error("%s %s in line %s" % ("JSONDecodeError", str(e), pkl_line)) self.util.log.warning("Please check the pkl file") raise ParseException(ParseException.PARSE_JSONDECODE_ERROR) from e + if not isinstance(msg, list) or len(msg) == 0: + break info_prefix = msg[0] if not info_prefix.startswith(api_name): continue @@ -75,7 +76,7 @@ class Visualization: self.util.log.info(" File \"{}\", line {}, in {}".format(item[0], item[1], item[2])) self.util.log.info(" {}".format(item[3])) continue - if len(msg) > 5 and len(msg[5]) >= 3: + if len(msg) > 5 and len(msg[5]) >= 3: summery_info = " [{}][dtype: {}][shape: {}][max: {}][min: {}][mean: {}]" \ .format(msg[0], msg[3], msg[4], msg[5][0], msg[5][1], msg[5][2]) if not title_printed: diff --git a/debug/accuracy_tools/msprobe/pytorch/pt_config.py b/debug/accuracy_tools/msprobe/pytorch/pt_config.py index 9ddc16c24dbee02fb2773ea9f07642d2f0f01e83..8293ac969490b103eef630081b6001234ca8bb07 100644 --- a/debug/accuracy_tools/msprobe/pytorch/pt_config.py +++ b/debug/accuracy_tools/msprobe/pytorch/pt_config.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,11 +14,13 @@ # limitations under the License. import os +import re from msprobe.core.common.const import Const from msprobe.core.common.exceptions import MsprobeException -from msprobe.core.common.file_utils import FileOpen, load_json +from msprobe.core.common.file_utils import FileOpen, load_json, check_file_or_directory_path, check_crt_valid from msprobe.core.common.log import logger +from msprobe.core.common.utils import is_int from msprobe.core.common_config import BaseConfig, CommonConfig from msprobe.core.grad_probe.constant import level_adp from msprobe.core.grad_probe.utils import check_bounds @@ -41,15 +43,35 @@ class TensorConfig(BaseConfig): self.online_run_ut_recompute = json_config.get("online_run_ut_recompute", False) self.check_config() self._check_file_format() - self._check_tls_path_config() + if self.online_run_ut: + self._check_online_run_ut() def _check_file_format(self): if self.file_format is not None and self.file_format not in ["npy", "bin"]: raise Exception("file_format is invalid") - def _check_tls_path_config(self): - if self.tls_path and not os.path.exists(self.tls_path): - raise Exception("tls_path: %s does not exist" % self.tls_path) + def _check_online_run_ut(self): + if not isinstance(self.online_run_ut, bool): + raise Exception(f"online_run_ut: {self.online_run_ut} is invalid.") + + if not isinstance(self.online_run_ut_recompute, bool): + raise Exception(f"online_run_ut_recompute: {self.online_run_ut_recompute} is invalid.") + + if self.nfs_path: + check_file_or_directory_path(self.nfs_path, isdir=True) + return + + if self.tls_path: + check_file_or_directory_path(self.tls_path, isdir=True) + check_file_or_directory_path(os.path.join(self.tls_path, "client.key")) + check_file_or_directory_path(os.path.join(self.tls_path, "client.crt")) + check_crt_valid(os.path.join(self.tls_path, "client.crt")) + + if not isinstance(self.host, str) or not re.match(Const.ipv4_pattern, self.host): + raise Exception(f"host: {self.host} is invalid.") + + if not isinstance(self.port, int) or not (0 < self.port <= 65535): + raise Exception(f"port: {self.port} is invalid, port range 0-65535.") class StatisticsConfig(BaseConfig): @@ -71,7 +93,7 @@ class OverflowCheckConfig(BaseConfig): self.check_overflow_config() def check_overflow_config(self): - if self.overflow_nums is not None and not isinstance(self.overflow_nums, int): + if self.overflow_nums is not None and not is_int(self.overflow_nums): raise Exception("overflow_num is invalid") if self.check_mode is not None and self.check_mode not in ["all", "aicore", "atomic"]: raise Exception("check_mode is invalid") @@ -171,7 +193,7 @@ class FreeBenchmarkCheckConfig(BaseConfig): ) def _check_preheat_config(self): - if not isinstance(self.preheat_step, int): + if not is_int(self.preheat_step): msg = "preheat_step is invalid, it should be an integer" logger.error_log_with_exp( msg, MsprobeException(MsprobeException.INVALID_PARAM_ERROR, msg) @@ -181,7 +203,7 @@ class FreeBenchmarkCheckConfig(BaseConfig): logger.error_log_with_exp( msg, MsprobeException(MsprobeException.INVALID_PARAM_ERROR, msg) ) - if not isinstance(self.max_sample, int): + if not is_int(self.max_sample): msg = "max_sample is invalid, it should be an integer" logger.error_log_with_exp( msg, MsprobeException(MsprobeException.INVALID_PARAM_ERROR, msg) @@ -281,28 +303,25 @@ class GradToolConfig(BaseConfig): check_bounds(self.bounds) +class StructureConfig(BaseConfig): + def __init__(self, json_config): + super().__init__(json_config) + + +TaskDict = { + Const.TENSOR: TensorConfig, + Const.STATISTICS: StatisticsConfig, + Const.OVERFLOW_CHECK: OverflowCheckConfig, + Const.FREE_BENCHMARK: FreeBenchmarkCheckConfig, + Const.RUN_UT: RunUTConfig, + Const.GRAD_PROBE: GradToolConfig, + Const.STRUCTURE: StructureConfig +} + + def parse_task_config(task, json_config): - default_dic = {} - if task == Const.TENSOR: - config_dic = json_config.get(Const.TENSOR, default_dic) - return TensorConfig(config_dic) - elif task == Const.STATISTICS: - config_dic = json_config.get(Const.STATISTICS, default_dic) - return StatisticsConfig(config_dic) - elif task == Const.OVERFLOW_CHECK: - config_dic = json_config.get(Const.OVERFLOW_CHECK, default_dic) - return OverflowCheckConfig(config_dic) - elif task == Const.FREE_BENCHMARK: - config_dic = json_config.get(Const.FREE_BENCHMARK, default_dic) - return FreeBenchmarkCheckConfig(config_dic) - elif task == Const.RUN_UT: - config_dic = json_config.get(Const.RUN_UT, default_dic) - return RunUTConfig(config_dic) - elif task == Const.GRAD_PROBE: - config_dic = json_config.get(Const.GRAD_PROBE, default_dic) - return GradToolConfig(config_dic) - else: - return StatisticsConfig(default_dic) + task_map = json_config.get(task, dict()) + return TaskDict.get(task)(task_map) def parse_json_config(json_file_path, task): diff --git a/debug/accuracy_tools/msprobe/pytorch/service.py b/debug/accuracy_tools/msprobe/pytorch/service.py index 04f9136e43005d18f9eede6a84a0f2e89d546268..fd81a7f1cf064506a4fb91481429828c97113509 100644 --- a/debug/accuracy_tools/msprobe/pytorch/service.py +++ b/debug/accuracy_tools/msprobe/pytorch/service.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -15,23 +15,24 @@ import functools import os +from collections import namedtuple, defaultdict -from collections import namedtuple import torch from msprobe.core.common.const import Const -from msprobe.core.common.exceptions import DistributedNotInitializedError, MsprobeException +from msprobe.core.common.exceptions import DistributedNotInitializedError from msprobe.core.common.file_utils import create_directory -from msprobe.core.common.utils import print_tools_ends_info +from msprobe.core.common.utils import print_tools_ends_info, DumpPathAggregation from msprobe.core.data_dump.data_collector import build_data_collector from msprobe.core.data_dump.data_processor.base import ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs from msprobe.core.data_dump.scope import BaseScope +from msprobe.pytorch.api_accuracy_checker.common.utils import ApiData from msprobe.pytorch.common.log import logger -from msprobe.pytorch.common.utils import get_rank_if_initialized -from msprobe.pytorch.hook_module import remove_dropout +from msprobe.pytorch.common.utils import get_rank_if_initialized, is_recomputation +from msprobe.pytorch.dump.kernel_dump.kernel_config import create_kernel_config_json +from msprobe.pytorch.dump.module_dump.module_processer import ModuleProcesser from msprobe.pytorch.hook_module.api_registry import api_register from msprobe.pytorch.hook_module.hook_module import HOOKModule -from msprobe.pytorch.module_processer import ModuleProcesser -from msprobe.pytorch.api_accuracy_checker.common.utils import ApiData +from msprobe.pytorch.hook_module.register_optimizer_hook import register_optimizer_hook torch_version_above_or_equal_2 = torch.__version__.split('+')[0] >= '2.0' if torch_version_above_or_equal_2: @@ -47,100 +48,206 @@ class Service: self.data_collector = build_data_collector(config) self.module_processor = ModuleProcesser(self.data_collector.scope) self.switch = False + self.inner_switch = False self.current_iter = 0 self.first_start = True self.current_rank = None self.dump_iter_dir = None self.should_stop_service = False self.attl = None - - @staticmethod - def forward_backward_dump_end(): - logger.info_on_rank_0("Data needed ends here.") - api_register.api_originality() - - @staticmethod - def is_registered_backward_hook(module): - if hasattr(module, '_backward_hooks') and \ - len(module._backward_hooks) > 0 and \ - module._is_full_backward_hook is False: - return True - return False - - def check_register_full_backward_hook(self, module): - if self.is_registered_backward_hook(module): - module._backward_hooks.clear() - module._is_full_backward_hook = None - logger.warning("Found deprecated backward hooks. Removing them and switching to full backward hooks.") + self.params_grad_info = {} + self.hook_handle_dict = {} + # 提前注册,确保注册尽可能多的API hook + self.register_api_hook() + self.init_for_debug_level() def build_hook(self, module_type, name): def pre_hook(api_or_module_name, module, args, kwargs): - if not self.should_execute_hook(): + if not self.should_execute_hook(module_type, module, True): return args, kwargs + is_recompute = is_recomputation() + self.inner_switch = True if module_type == BaseScope.Module_Type_Module: - api_or_module_name = module.mindstudio_reserved_name + api_or_module_name = module.mindstudio_reserved_name[-1] + else: + module.forward_data_collected = True + HOOKModule.add_module_count(name) self.data_collector.update_api_or_module_name(api_or_module_name) if self.config.online_run_ut: + self.inner_switch = False return None, None if self.data_collector: module_input_output = ModuleForwardInputsOutputs(args=args, kwargs=kwargs, output=None) - self.data_collector.pre_forward_data_collect(api_or_module_name, module, pid, module_input_output) + self.data_collector.forward_input_data_collect( + api_or_module_name, + module, + pid, + module_input_output, + is_recompute + ) + + self.inner_switch = False return args, kwargs + def grad_hook(module, ori_name, param_name): + def hook_fn(grad): + if not self.should_execute_hook(module_type, module, False): + return grad + self.inner_switch = True + self.data_collector.params_data_collect(ori_name, param_name, pid, grad) + self.inner_switch = False + return grad + + return hook_fn + + def register_param_hook(ori_name, module, params_dict): + ''' + 注册参数hook + ''' + # data_mode为forward时,不注册参数hook + if not (Const.FORWARD in self.config.data_mode and Const.BACKWARD not in self.config.data_mode): + for param_name, param in params_dict.items(): + if param.requires_grad: + name = ori_name + Const.SEP + param_name + old_handle = self.hook_handle_dict.get(name) + if old_handle and hasattr(old_handle, "remove"): + old_handle.remove() + handle = param.register_hook(grad_hook(module, ori_name, param_name)) + self.hook_handle_dict[name] = handle + + def init_params_grad_info(module, params_dict): + ''' + 初始化参数梯度信息, 在前向hook结束后, 将参数梯度信息写入cache_data中用于占位 + ''' + if not params_dict: + return + if not (Const.FORWARD in self.config.data_mode and Const.BACKWARD not in self.config.data_mode): + grad_name = module.params_grad_name if hasattr(module, 'params_grad_name') else None + # 判断是否已经在cache_data中进行了占位, 若没有则先写入cache_data中 + if not self.params_grad_info.get(grad_name): + data_info = {grad_name: {key: [None] for key, value in params_dict.items() if value.requires_grad}} + # 当模块中的参数有requires_grad属性为True时,才会进行梯度计算,此时才需要占位 + if data_info.get(grad_name): + # 将grad_name的data_info先写入cache_data中, 梯度计算后再更新 + self.data_collector.handle_data(grad_name, data_info, + flush=self.data_collector.data_processor.is_terminated) + # 记录当前模块的参数梯度信息已占位 + self.params_grad_info[grad_name] = True + def forward_hook(api_or_module_name, module, args, kwargs, output): - if not self.should_execute_hook(): + if not self.should_execute_hook(module_type, module, True): return None + is_recompute = is_recomputation() - if module_type == BaseScope.Module_Type_Module: - api_or_module_name = module.mindstudio_reserved_name - self.data_collector.update_api_or_module_name(api_or_module_name) - + self.inner_switch = True if self.config.online_run_ut: + self.data_collector.update_api_or_module_name(api_or_module_name) if self.data_collector.scope and not self.data_collector.scope.check(api_or_module_name): return None - api_data = ApiData(name[:-1], args, kwargs, output, self.current_iter, self.current_rank) + api_data = ApiData( + api_or_module_name[:-len(Const.FORWARD_NAME_SUFFIX)], + args, + kwargs, + output, + self.current_iter, + self.current_rank + ) self.attl_send(api_data) + self.inner_switch = False return None - if self.data_collector: - module_input_output = ModuleForwardInputsOutputs(args=args, kwargs=kwargs, output=output) - self.data_collector.forward_data_collect(api_or_module_name, module, pid, module_input_output) - if self.data_collector.if_return_forward_new_output(): - return self.data_collector.get_forward_new_output() + module_input_output = ModuleForwardInputsOutputs(args=args, kwargs=kwargs, output=output) + if module_type == BaseScope.Module_Type_Module: + api_or_module_name = module.mindstudio_reserved_name[-1] + self.data_collector.update_api_or_module_name(api_or_module_name) + params_dict = {} + if self.config.task != Const.STRUCTURE: + params_dict = { + key.split(Const.SEP)[-1]: value + for key, value in module.named_parameters(recurse=False) + } + setattr(module_input_output, Const.PARAMS, params_dict) + # 判断是否需要注册参数hook + if params_dict: + ori_name = api_or_module_name.rsplit(Const.SEP, 2)[0] + grad_name = ori_name + Const.SEP + Const.PARAMS_GRAD + # 首次执行前向hook时,添加params_grad_name属性,并注册参数hook + setattr(module, 'params_grad_name', grad_name) + register_param_hook(ori_name, module, params_dict) + self.data_collector.forward_data_collect( + api_or_module_name, + module, + pid, + module_input_output, + is_recompute + ) + init_params_grad_info(module, params_dict) + else: + self.data_collector.update_api_or_module_name(api_or_module_name) + self.data_collector.forward_output_data_collect( + api_or_module_name, + module, + pid, + module_input_output, + is_recompute + ) + + if self.data_collector.if_return_forward_new_output(): + forward_new_output = self.data_collector.get_forward_new_output() + self.inner_switch = False + return forward_new_output + self.inner_switch = False return output def forward_hook_torch_version_below_2(api_or_module_name, module, args, output): return forward_hook(api_or_module_name, module, args, {}, output) def backward_hook(api_or_module_name, module, grad_input, grad_output): - if not self.should_execute_hook(): + if not self.should_execute_hook(module_type, module, False): return + is_recompute = is_recomputation() + self.inner_switch = True if module_type == BaseScope.Module_Type_Module: - api_or_module_name = module.mindstudio_reserved_name + api_or_module_name = module.mindstudio_reserved_name[-1] self.data_collector.update_api_or_module_name(api_or_module_name) if self.config.online_run_ut: + self.inner_switch = False return if self.data_collector: # 此处获取到的grad_input实际为反向过程的输出数据,grad_output为反向过程的输入数据,因此传入时调换顺序 module_input_output = ModuleBackwardInputsOutputs(grad_input=grad_output, grad_output=grad_input) - self.data_collector.backward_data_collect(api_or_module_name, module, pid, module_input_output) + self.data_collector.backward_data_collect( + api_or_module_name, + module, + pid, + module_input_output, + is_recompute + ) + self.inner_switch = False pid = os.getpid() - forward_name_template = name + Const.FORWARD - backward_name_template = name + Const.BACKWARD - pre_forward_hook_fn = functools.partial(pre_hook, forward_name_template) - forward_hook_fn = functools.partial(forward_hook, forward_name_template) - backward_hook_fn = functools.partial(backward_hook, backward_name_template) - forward_hook_torch_version_below_2_fn = functools.partial(forward_hook_torch_version_below_2, - forward_name_template) + full_forward_name = None + full_backward_name = None + if module_type == BaseScope.Module_Type_API: + full_forward_name = name + str(HOOKModule.get_module_count(name)) + Const.SEP + Const.FORWARD + full_backward_name = name + str(HOOKModule.get_module_count(name)) + Const.SEP + Const.BACKWARD + pre_forward_hook_fn = functools.partial(pre_hook, full_forward_name) + forward_hook_fn = functools.partial(forward_hook, full_forward_name) + backward_hook_fn = functools.partial(backward_hook, full_backward_name) + forward_hook_torch_version_below_2_fn = functools.partial( + forward_hook_torch_version_below_2, + full_forward_name + ) return HookFn(pre_forward_hook_fn, forward_hook_fn, backward_hook_fn, forward_hook_torch_version_below_2_fn) - def start(self, model, api_origin=False): + def start(self, model): + if self.config.level == Const.LEVEL_DEBUG: + return if self.need_stop_service(): return @@ -154,42 +261,52 @@ class Service: if self.config.rank and self.current_rank not in self.config.rank: return - self.register_hook_new() + self.register_module_hook() + if self.config.level == Const.LEVEL_MIX: + register_optimizer_hook(self.data_collector) self.first_start = False - if api_origin: - api_register.api_modularity() if self.config.online_run_ut and torch_version_above_or_equal_2: run_ut_dispatch(self.attl, True, self.config.online_run_ut_recompute) self.switch = True logger.info_on_rank_0(f"Dump switch is turned on at step {self.current_iter}. ") - if self.config.level != "L2" and not self.config.online_run_ut: + if not self.config.online_run_ut: self.create_dirs() logger.info_on_rank_0(f"Dump data will be saved in {self.dump_iter_dir}.") def stop(self): - if self.should_stop_service: + if self.config.level == Const.LEVEL_DEBUG: return - if self.config.level == "L2": + if self.should_stop_service: return if self.config.step and self.current_iter not in self.config.step: return if self.config.rank and self.current_rank not in self.config.rank: return self.switch = False + if self.config.level == Const.LEVEL_L2: + return if self.config.online_run_ut and torch_version_above_or_equal_2: run_ut_dispatch(self.attl, False, self.config.online_run_ut_recompute) return + if self.config.async_dump: + self.data_collector.fill_stack_tensor_data() + if self.config.task == Const.TENSOR: + self.data_collector.data_processor.dump_async_data() self.data_collector.write_json() def step(self): + if self.config.level == Const.LEVEL_DEBUG: + return if self.should_stop_service: return + if self.config.async_dump: + self.data_collector.fill_stack_tensor_data() + if self.config.task == Const.TENSOR: + self.data_collector.data_processor.dump_async_data() + self.data_collector.write_json() self.current_iter += 1 self.data_collector.update_iter(self.current_iter) - - ModuleProcesser.reset_module_stats() - HOOKModule.reset_module_stats() - self.data_collector.data_writer.reset_cache() + self.reset_status() def need_stop_service(self): if self.should_stop_service: @@ -200,8 +317,6 @@ class Service: if self.config.online_run_ut: # send stop signal if online_run_ut self.attl_stop() - if self.config.level in [Const.LEVEL_L1, Const.LEVEL_L2, Const.LEVEL_MIX]: - api_register.api_originality() self.switch = False self.should_stop_service = True print_tools_ends_info() @@ -210,10 +325,18 @@ class Service: return True return False - def should_execute_hook(self): - if not self.switch: + def should_execute_hook(self, hook_type, module, is_forward): + is_module_hook = hook_type == BaseScope.Module_Type_Module + if is_module_hook and not self.switch: return False - if self.data_collector and self.data_collector.data_processor.is_terminated: + elif not is_module_hook and is_forward and not self.switch: + return False + elif not is_module_hook and not is_forward and not module.forward_data_collected: + return False + + if self.inner_switch: + return False + if not self.data_collector or self.data_collector.data_processor.is_terminated: return False return True @@ -221,6 +344,12 @@ class Service: create_directory(self.config.dump_path) self.dump_iter_dir = os.path.join(self.config.dump_path, f"step{self.current_iter}") cur_rank = self.current_rank if self.current_rank is not None else '' + if self.config.level == Const.LEVEL_L2: + create_directory(self.dump_iter_dir) + kernel_config_path = create_kernel_config_json(self.dump_iter_dir, cur_rank) + self.config.kernel_config_path = kernel_config_path + return + dump_dir = os.path.join(self.dump_iter_dir, f"rank{cur_rank}") create_directory(dump_dir) if self.config.task in self.data_collector.tasks_need_tensor_data: @@ -229,55 +358,28 @@ class Service: else: dump_data_dir = None - dump_file_path = os.path.join(dump_dir, "dump.json") - stack_file_path = os.path.join(dump_dir, "stack.json") - construct_file_path = os.path.join(dump_dir, "construct.json") - free_benchmark_file_path = os.path.join(self.config.dump_path, "free_benchmark.csv") - self.data_collector.update_dump_paths( - dump_file_path, stack_file_path, construct_file_path, dump_data_dir, free_benchmark_file_path) - - def register_hook_new(self): - logger.info_on_rank_0("The {} hook function is successfully mounted to the model.".format(self.config.task)) - if self.config.level in ["L0", "mix"]: - if self.model is None: - logger.error_log_with_exp("The model is None.", MsprobeException.INVALID_PARAM_ERROR) - logger.info_on_rank_0("The init dump mode is enabled, and the module dump function will not be available") - for name, module in self.model.named_modules(): - if module == self.model: - continue - prefix = BaseScope.Module_Type_Module + Const.SEP + name + Const.SEP + \ - module.__class__.__name__ + Const.SEP - - pre_forward_hook, forward_hook, backward_hook, forward_hook_torch_version_below_2 = self.build_hook( - BaseScope.Module_Type_Module, prefix) - if torch_version_above_or_equal_2: - module.register_forward_hook(forward_hook, with_kwargs=True) - else: - self.check_register_full_backward_hook(module) - module.register_full_backward_hook( - self.module_processor.node_hook(prefix + Const.BACKWARD, Const.STOP)) - module.register_forward_hook(forward_hook_torch_version_below_2) - self.check_register_full_backward_hook(module) - module.register_full_backward_hook(backward_hook) - - module.register_forward_pre_hook( - self.module_processor.node_hook(prefix + Const.FORWARD, Const.START)) - module.register_forward_hook( - self.module_processor.node_hook(prefix + Const.FORWARD, Const.STOP)) - if torch_version_above_or_equal_2: - module.register_full_backward_pre_hook( - self.module_processor.node_hook(prefix + Const.BACKWARD, Const.START)) - self.check_register_full_backward_hook(module) - module.register_full_backward_hook( - self.module_processor.node_hook(prefix + Const.BACKWARD, Const.STOP)) - - if self.config.level in ["mix", "L1", "L2"]: - api_register.initialize_hook(functools.partial(self.build_hook, BaseScope.Module_Type_API), - self.config.online_run_ut) + dump_path_aggregation = DumpPathAggregation() + dump_path_aggregation.dump_file_path = os.path.join(dump_dir, "dump.json") + dump_path_aggregation.stack_file_path = os.path.join(dump_dir, "stack.json") + dump_path_aggregation.construct_file_path = os.path.join(dump_dir, "construct.json") + dump_path_aggregation.dump_tensor_data_dir = dump_data_dir + dump_path_aggregation.free_benchmark_file_path = os.path.join(dump_dir, "free_benchmark.csv") + self.data_collector.update_dump_paths(dump_path_aggregation) + self.data_collector.initialize_json_file(framework=Const.PT_FRAMEWORK) + + def register_api_hook(self): + if self.config.level in [Const.LEVEL_MIX, Const.LEVEL_L1, Const.LEVEL_L2]: + logger.info_on_rank_0(f"The api {self.config.task} hook function is successfully mounted to the model.") + api_register.initialize_hook( + functools.partial(self.build_hook, BaseScope.Module_Type_API), + self.config.online_run_ut + ) api_register.api_modularity() - if Const.STATISTICS == self.config.task or Const.TENSOR == self.config.task: - remove_dropout() + def register_module_hook(self): + if self.config.level in [Const.LEVEL_L0, Const.LEVEL_MIX]: + logger.info_on_rank_0(f"The module {self.config.task} hook function is successfully mounted to the model.") + self.module_processor.register_module_hook(self.model, self.build_hook) def attl_init(self): if self.config.online_run_ut: @@ -309,3 +411,60 @@ class Service: elif self.attl.socket_manager is not None: logger.info(f"pid: {os.getpid()} finished, start send STOP signal.") self.attl.socket_manager.send_stop_signal() + + def reset_status(self): + ModuleProcesser.reset_module_stats() + HOOKModule.reset_module_stats() + self.data_collector.reset_status() + self.params_grad_info.clear() + + if self.config.level == Const.LEVEL_L2: + self.data_collector.data_processor.reset_status() + return + if self.config.step and self.current_iter not in self.config.step: + return + if self.config.rank and self.current_rank not in self.config.rank: + return + + def init_for_debug_level(self): + if not (self.config.level == Const.LEVEL_DEBUG and self.config.task in [Const.TENSOR, Const.STATISTICS]): + return + try: + self.current_rank = get_rank_if_initialized() + except DistributedNotInitializedError: + self.current_rank = None + + # dir: dump_path -- rank{} -- debug.json + self.dump_iter_dir = self.config.dump_path + cur_rank = self.current_rank if self.current_rank is not None else '' + dump_dir = os.path.join(self.dump_iter_dir, f"rank{cur_rank}") + create_directory(dump_dir) + if self.config.task in self.data_collector.tasks_need_tensor_data: + dump_data_dir = os.path.join(dump_dir, "dump_tensor_data") + create_directory(dump_data_dir) + else: + dump_data_dir = None + + dump_path_aggregation = DumpPathAggregation() + dump_path_aggregation.dump_tensor_data_dir = dump_data_dir + dump_path_aggregation.debug_file_path = os.path.join(dump_dir, "debug.json") + self.data_collector.update_dump_paths(dump_path_aggregation) + self.data_collector.initialize_json_file(framework=Const.PT_FRAMEWORK) + + self.debug_variable_counter = defaultdict(int) + + def save(self, variable, name, save_backward): + if self.config.level != Const.LEVEL_DEBUG: + return + count = self.debug_variable_counter[name] + self.debug_variable_counter[name] += 1 + + name_with_count = f"{name}.{count}" + grad_name_with_count = f"{name}_grad.{count}" + + # forward save + self.data_collector.debug_data_collect_forward(variable, name_with_count) + + # backward save + if save_backward: + self.data_collector.debug_data_collect_backward(variable, grad_name_with_count) diff --git a/debug/accuracy_tools/msprobe/test/CMakeLists.txt b/debug/accuracy_tools/msprobe/test/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..da8ed956f6bd903fff3b88b8b4512c54a607d063 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/CMakeLists.txt @@ -0,0 +1 @@ +add_subdirectory(cpp) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/dump_no_pt_no_ms.json b/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/dump_no_pt_no_ms.json new file mode 100644 index 0000000000000000000000000000000000000000..63a062d8ffa264a0254fc2bab0208dcf951ae094 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/dump_no_pt_no_ms.json @@ -0,0 +1,3 @@ +{ + "task": "tensor" +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/ms_dump_no_framework.json b/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/ms_dump_no_framework.json new file mode 100644 index 0000000000000000000000000000000000000000..b223c74b2315af1b9454e5f1e70c29502d449c56 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/ms_dump_no_framework.json @@ -0,0 +1,4 @@ +{ + "task": "tensor", + "type": "mindspore.float16" +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/pt_dump_no_framework.json b/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/pt_dump_no_framework.json new file mode 100644 index 0000000000000000000000000000000000000000..2444ae1fd4096b083a9e8a0e51c9166bb990f51f --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/common/test_dump_file/pt_dump_no_framework.json @@ -0,0 +1,4 @@ +{ + "task": "tensor", + "type": "torch.float16" +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/core_ut/common/test_file_utils.py b/debug/accuracy_tools/msprobe/test/core_ut/common/test_file_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..9ed13f78aed57fd4d8153e2f005ea14d4fb33643 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/common/test_file_utils.py @@ -0,0 +1,536 @@ +from unittest.mock import patch, mock_open, MagicMock + +import numpy as np +import pandas as pd +import pytest + +from msprobe.core.common.file_utils import * + + +class TestFileChecks: + @pytest.fixture(autouse=True) + def setup(self, tmp_path): + self.test_file = str(tmp_path / "test_file.txt") + self.test_dir = tmp_path / "test_dir" + + # Common mocks + self.mock_stat = MagicMock() + self.mock_stat.st_mode = 0o755 + self.mock_stat.st_uid = 1000 + + def test_check_link(self): + with patch('os.path.islink', return_value=False): + check_link(self.test_file) + + def test_check_path_length(self): + # Test normal path + check_path_length(str(self.test_file)) + + # Test too long path + long_path = self.test_dir / ('a' * FileCheckConst.DIRECTORY_LENGTH) + with pytest.raises(FileCheckException) as exc_info: + check_path_length(str(long_path)) + assert exc_info.value.code == FileCheckException.ILLEGAL_PATH_ERROR + + def test_check_path_exists(self): + with patch('os.path.exists', return_value=False), \ + pytest.raises(FileCheckException) as exc_info: + check_path_exists(self.test_file) + assert exc_info.value.code == FileCheckException.ILLEGAL_PATH_ERROR + + def test_check_path_readability(self): + with patch('os.access', return_value=False), \ + pytest.raises(FileCheckException) as exc_info: + check_path_readability(self.test_file) + assert exc_info.value.code == FileCheckException.FILE_PERMISSION_ERROR + + with patch('os.access', return_value=True): + check_path_readability(self.test_file) + + def test_check_path_writability(self): + with patch('os.access', return_value=False), \ + pytest.raises(FileCheckException) as exc_info: + check_path_writability(self.test_file) + assert exc_info.value.code == FileCheckException.FILE_PERMISSION_ERROR + + with patch('os.access', return_value=True): + check_path_writability(self.test_file) + + def test_check_path_executable(self): + with patch('os.access', return_value=False), \ + pytest.raises(FileCheckException) as exc_info: + check_path_executable(self.test_file) + assert exc_info.value.code == FileCheckException.FILE_PERMISSION_ERROR + + with patch('os.access', return_value=True): + check_path_executable(self.test_file) + + def test_check_other_user_writable(self): + self.mock_stat.st_mode = 0o777 # Others writable + with patch('os.stat', return_value=self.mock_stat), \ + pytest.raises(FileCheckException) as exc_info: + check_other_user_writable(self.test_file) + assert exc_info.value.code == FileCheckException.FILE_PERMISSION_ERROR + + self.mock_stat.st_mode = 0o755 # Others not writable + with patch('os.stat', return_value=self.mock_stat): + check_other_user_writable(self.test_file) + + def test_check_path_owner_consistent(self): + with patch('os.stat', return_value=self.mock_stat), \ + patch('os.getuid', return_value=1001), \ + pytest.raises(FileCheckException) as exc_info: + check_path_owner_consistent(self.test_file) + assert exc_info.value.code == FileCheckException.FILE_PERMISSION_ERROR + + # Test root user case + with patch('os.stat', return_value=self.mock_stat), \ + patch('os.getuid', return_value=0): + check_path_owner_consistent(self.test_file) + + def test_check_path_pattern_valid(self): + valid_paths = [ + self.test_dir / "file.txt", + self.test_dir / "file-1.txt", + self.test_dir / "file_1.txt", + self.test_dir / "file.1.txt", + ] + + invalid_paths = [ + self.test_dir / "file*.txt", + self.test_dir / "file?.txt", + self.test_dir / "file;.txt", + self.test_dir / "file|.txt", + ] + + for path in valid_paths: + path = str(path) + check_path_pattern_valid(path) + + for path in invalid_paths: + path = str(path) + with pytest.raises(FileCheckException) as exc_info: + check_path_pattern_valid(path) + assert exc_info.value.code == FileCheckException.ILLEGAL_PATH_ERROR + + @pytest.mark.parametrize("file_size,max_size,should_raise", [ + (100, 200, False), + (200, 100, True), + (1024 * 1024, 1024 * 1024 - 1, True), + ]) + def test_check_file_size(self, file_size, max_size, should_raise): + with patch('os.path.getsize', return_value=file_size): + if should_raise: + with pytest.raises(FileCheckException) as exc_info: + check_file_size(self.test_file, max_size) + assert exc_info.value.code == FileCheckException.FILE_TOO_LARGE_ERROR + else: + check_file_size(self.test_file, max_size) + + +class TestFileOperations: + @pytest.fixture(autouse=True) + def setup(self, tmp_path): + self.test_file = tmp_path / "test_file" + self.test_dir = tmp_path / "test_dir" + self.yaml_file = tmp_path / "test.yaml" + self.json_file = tmp_path / "test.json" + self.npy_file = tmp_path / "test.npy" + self.excel_file = tmp_path / "test.xlsx" + + def test_check_common_file_size(self): + with patch('os.path.isfile', return_value=True), \ + patch('os.path.getsize', return_value=100): + check_common_file_size(str(self.test_file)) + check_common_file_size(str(self.test_file.with_suffix('.csv'))) + + with patch('os.path.isfile', return_value=True), \ + patch('os.path.getsize', return_value=FileCheckConst.COMMOM_FILE_SIZE + 1), \ + pytest.raises(FileCheckException) as exc_info: + check_common_file_size(str(self.test_file)) + assert exc_info.value.code == FileCheckException.FILE_TOO_LARGE_ERROR + + def test_check_file_suffix(self): + check_file_suffix(str(self.test_file.with_suffix('.txt')), '.txt') + + with pytest.raises(FileCheckException) as exc_info: + check_file_suffix(str(self.test_file.with_suffix('.txt')), '.csv') + assert exc_info.value.code == FileCheckException.INVALID_FILE_ERROR + + check_file_suffix((self.test_file.with_suffix('.txt')), None) + + def test_make_dir(self): + with patch('os.path.isdir', return_value=False), \ + patch('os.makedirs') as mock_makedirs, \ + patch('msprobe.core.common.file_utils.FileChecker') as mock_checker: + mock_checker.return_value.common_check.return_value = None + make_dir(self.test_dir) + mock_makedirs.assert_called_once_with( + str(self.test_dir), + mode=FileCheckConst.DATA_DIR_AUTHORITY, + exist_ok=True + ) + + def test_load_yaml(self): + yaml_content = """ + key: value + list: + - item1 + - item2 + """ + with patch('builtins.open', mock_open(read_data=yaml_content)), \ + patch('msprobe.core.common.file_utils.FileChecker') as mock_checker, \ + patch('msprobe.core.common.file_utils.FileOpen.check_file_path', return_value=None): + mock_checker.return_value.common_check.return_value = str(self.yaml_file) + result = load_yaml(str(self.yaml_file)) + assert result == {'key': 'value', 'list': ['item1', 'item2']} + + # Test load error + with patch('builtins.open', mock_open(read_data="invalid: yaml: content")), \ + patch('msprobe.core.common.file_utils.FileChecker') as mock_checker, \ + pytest.raises(RuntimeError): + mock_checker.return_value.common_check.return_value = str(self.yaml_file) + load_yaml(str(self.yaml_file)) + + def test_load_npy(self): + mock_array = np.array([1, 2, 3]) + with patch('numpy.load', return_value=mock_array), \ + patch('msprobe.core.common.file_utils.check_file_or_directory_path', return_value=None): + result = load_npy(str(self.npy_file)) + np.testing.assert_array_equal(result, mock_array) + + with patch('numpy.load', side_effect=Exception), \ + patch('msprobe.core.common.file_utils.check_file_or_directory_path', return_value=None), \ + pytest.raises(RuntimeError): + load_npy(str(self.npy_file)) + + def test_save_npy(self): + mock_data = np.array([1, 2, 3]) + with patch('numpy.save') as mock_save, \ + patch('os.chmod') as mock_chmod: + save_npy(mock_data, str(self.npy_file)) + mock_save.assert_called_once() + + def test_save_json(self): + test_data = {'key': 'value'} + mock_file = mock_open() + + with patch('builtins.open', mock_file), \ + patch('fcntl.flock') as mock_flock, \ + patch('json.dump') as mock_dump, \ + patch('os.chmod') as mock_chmod: + save_json(self.json_file, test_data) + mock_file.assert_called_once_with(str(self.json_file), 'w', encoding='utf-8') + mock_dump.assert_called_once_with(test_data, mock_file(), indent=None) + + def test_load_json(self): + test_data = '{"key": "value"}' + mock_file = mock_open(read_data=test_data) + + with patch('builtins.open', mock_file), \ + patch('fcntl.flock') as mock_flock, \ + patch('msprobe.core.common.file_utils.FileOpen.check_file_path', return_value=None): + result = load_json(str(self.json_file)) + mock_file.assert_called_once_with(str(self.json_file), 'r', encoding='utf-8') + assert mock_flock.call_count == 2 + assert result == {'key': 'value'} + + def test_save_yaml(self): + test_data = {'key': 'value'} + mock_file = mock_open() + + with patch('builtins.open', mock_file), \ + patch('fcntl.flock') as mock_flock, \ + patch('yaml.dump') as mock_dump, \ + patch('os.chmod') as mock_chmod: + save_yaml(str(self.yaml_file), test_data) + mock_file.assert_called_once_with(str(self.yaml_file), 'w', encoding='utf-8') + assert mock_flock.call_count == 2 + mock_dump.assert_called_once_with(test_data, mock_file(), sort_keys=False) + + def test_save_excel(self): + df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) + with patch('pandas.DataFrame.to_excel') as mock_to_excel, \ + patch('os.chmod') as mock_chmod: + save_excel(self.excel_file, df) + mock_to_excel.assert_called_once_with(str(self.excel_file), index=False) + + def test_move_file(self): + dst_file = self.test_dir / "moved_file" + with patch('shutil.move') as mock_move, \ + patch('os.chmod') as mock_chmod, \ + patch('msprobe.core.common.file_utils.check_file_or_directory_path', return_value=None), \ + patch('msprobe.core.common.file_utils.check_path_before_create', return_value=None): + move_file(str(self.test_file), str(dst_file)) + mock_move.assert_called_once_with(str(self.test_file), str(dst_file)) + + with patch('shutil.move', side_effect=Exception), \ + patch('msprobe.core.common.file_utils.check_file_or_directory_path', return_value=None), \ + patch('msprobe.core.common.file_utils.check_path_before_create', return_value=None), \ + pytest.raises(RuntimeError): + move_file(self.test_file, dst_file) + + +class TestCSVOperations: + @pytest.fixture(autouse=True) + def setup(self, tmp_path): + self.csv_file = tmp_path / "test.csv" + self.test_data = [['header1', 'header2'], ['value1', 'value2']] + self.test_df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) + + def test_write_csv(self): + mock_file = mock_open() + with patch('builtins.open', mock_file), \ + patch('csv.writer') as mock_writer, \ + patch('os.chmod') as mock_chmod: + write_csv(self.test_data, self.csv_file) + mock_file.assert_called_once_with( + str(self.csv_file), 'a+', encoding='utf-8-sig' + ) + mock_writer.return_value.writerows.assert_called_once_with(self.test_data) + + def test_write_csv_malicious_check(self): + test_data = [['normal', '=1+1']] # Formula injection attempt + with pytest.raises(RuntimeError): + write_csv(test_data, self.csv_file, malicious_check=True) + + def test_read_csv(self): + # Test pandas read + with patch('pandas.read_csv', return_value=self.test_df), \ + patch('msprobe.core.common.file_utils.check_file_or_directory_path', return_value=None): + result = read_csv(str(self.csv_file), as_pd=True) + assert isinstance(result, pd.DataFrame) + pd.testing.assert_frame_equal(result, self.test_df) + + # Test standard csv read + mock_file = mock_open() + with patch('builtins.open', mock_file), \ + patch('csv.reader', return_value=self.test_data), \ + patch('msprobe.core.common.file_utils.check_file_or_directory_path', return_value=None), \ + patch('msprobe.core.common.file_utils.FileOpen.check_file_path', return_value=None): + result = read_csv(self.csv_file, as_pd=False) + assert result == self.test_data + + def test_write_df_to_csv(self): + with patch('pandas.DataFrame.to_csv') as mock_to_csv, \ + patch('os.chmod') as mock_chmod: + write_df_to_csv(self.test_df, str(self.csv_file)) + mock_to_csv.assert_called_once_with( + str(self.csv_file), + mode='w', + header=True, + index=False + ) + + # Test invalid data type + with pytest.raises(ValueError): + write_df_to_csv([1, 2, 3], self.csv_file) + + # Test malicious check + df_with_formula = pd.DataFrame({'col1': ['=1+1']}) + with pytest.raises(RuntimeError): + write_df_to_csv(df_with_formula, self.csv_file, malicious_check=True) + + +class TestPathOperations: + @pytest.fixture(autouse=True) + def setup(self, tmp_path): + self.test_path = tmp_path / "test_path" + self.test_dir = tmp_path / "test_dir" + self.test_file = tmp_path / "test_file.txt" + + def test_check_path_type(self): + # Test file type + with patch('os.path.isfile', return_value=True): + check_path_type(self.test_file, FileCheckConst.FILE) + + with patch('os.path.isfile', return_value=False), \ + pytest.raises(FileCheckException) as exc_info: + check_path_type(self.test_file, FileCheckConst.FILE) + assert exc_info.value.code == FileCheckException.INVALID_FILE_ERROR + + # Test directory type + with patch('os.path.isdir', return_value=True): + check_path_type(self.test_dir, FileCheckConst.DIR) + + with patch('os.path.isdir', return_value=False), \ + pytest.raises(FileCheckException) as exc_info: + check_path_type(self.test_dir, FileCheckConst.DIR) + assert exc_info.value.code == FileCheckException.INVALID_FILE_ERROR + + def test_check_others_writable(self): + mock_stat = MagicMock() + + # Test group writable + mock_stat.st_mode = stat.S_IWGRP + with patch('os.stat', return_value=mock_stat): + assert check_others_writable(self.test_path) is True + + # Test others writable + mock_stat.st_mode = stat.S_IWOTH + with patch('os.stat', return_value=mock_stat): + assert check_others_writable(self.test_path) is True + + # Test not writable by others + mock_stat.st_mode = stat.S_IRUSR | stat.S_IWUSR + with patch('os.stat', return_value=mock_stat): + assert check_others_writable(self.test_path) is False + + def test_create_directory(self): + with patch('os.path.isdir', return_value=True), \ + patch('os.makedirs') as mock_makedirs, \ + patch('msprobe.core.common.file_utils.FileChecker') as mock_checker: + mock_checker.return_value.common_check.return_value = None + create_directory(str(self.test_dir)) + + def test_check_path_before_create(self): + # Test valid path + check_path_before_create(self.test_path) + + # Test path length exceeds limit + long_path = self.test_dir / ('a' * FileCheckConst.DIRECTORY_LENGTH) + with pytest.raises(FileCheckException) as exc_info: + check_path_before_create(long_path) + assert exc_info.value.code == FileCheckException.ILLEGAL_PATH_ERROR + + # Test invalid characters + invalid_path = self.test_dir / "test*file" + with pytest.raises(FileCheckException) as exc_info: + check_path_before_create(invalid_path) + assert exc_info.value.code == FileCheckException.ILLEGAL_PATH_ERROR + + +class TestUtilityOperations: + @pytest.fixture(autouse=True) + def setup(self, tmp_path): + self.test_file = tmp_path / "test_file" + self.test_dir = tmp_path / "test_dir" + self.npy_file = tmp_path / "test.npy" + self.txt_file = tmp_path / "test.txt" + self.workbook_file = tmp_path / "test.xlsx" + + def test_save_npy_to_txt(self): + test_data = np.array([1, 2, 3, 4]) + + with patch('os.path.exists', return_value=False), \ + patch('numpy.savetxt') as mock_savetxt: + # Test without alignment + save_npy_to_txt(test_data, self.txt_file) + mock_savetxt.assert_called_once() + + # Test with alignment + with patch('os.path.exists', return_value=False), \ + patch('numpy.savetxt') as mock_savetxt: + save_npy_to_txt(test_data, self.txt_file, align=3) + mock_savetxt.assert_called_once() + + def test_save_workbook(self): + mock_workbook = MagicMock() + with patch('os.chmod') as mock_chmod: + save_workbook(mock_workbook, self.workbook_file) + mock_workbook.save.assert_called_once_with(str(self.workbook_file)) + + # Test save error + mock_workbook = MagicMock() + mock_workbook.save.side_effect = Exception + with pytest.raises(RuntimeError): + save_workbook(mock_workbook, self.workbook_file) + + def test_remove_path(self): + # Test remove file + with patch('os.path.exists', return_value=True), \ + patch('os.path.islink', return_value=True), \ + patch('os.remove') as mock_remove: + remove_path(str(self.test_file)) + mock_remove.assert_called_once_with(str(self.test_file)) + + # Test remove directory + with patch('os.path.exists', return_value=True), \ + patch('os.path.islink', return_value=False), \ + patch('os.path.isfile', return_value=False), \ + patch('shutil.rmtree') as mock_rmtree: + remove_path(str(self.test_dir)) + mock_rmtree.assert_called_once_with(str(self.test_dir)) + + def test_get_json_contents(self): + json_content = '{"key": "value"}' + with patch('builtins.open', mock_open(read_data=json_content)), \ + patch('os.path.exists', return_value=True), \ + patch('msprobe.core.common.file_utils.FileOpen.check_file_path', return_value=None): + result = get_json_contents(str(self.test_file)) + assert result == {'key': 'value'} + + # Test invalid JSON + with patch('builtins.open', mock_open(read_data='invalid json')), \ + patch('os.path.exists', return_value=True), \ + pytest.raises(FileCheckException) as exc_info: + get_json_contents(self.test_file) + assert exc_info.value.code == FileCheckException.FILE_PERMISSION_ERROR + + def test_get_file_content_bytes(self): + test_content = b'test content' + with patch('builtins.open', mock_open(read_data=test_content)), \ + patch('os.path.exists', return_value=True), \ + patch('msprobe.core.common.file_utils.FileOpen.check_file_path', return_value=None): + result = get_file_content_bytes(self.test_file) + assert result == test_content + + def test_os_walk_for_files(self): + mock_walk_data = [ + (str(self.test_dir), ['dir1'], ['file1.txt']), + (str(self.test_dir / 'dir1'), [], ['file2.txt']) + ] + + with patch('os.walk', return_value=mock_walk_data), \ + patch('msprobe.core.common.file_utils.check_file_or_directory_path'): + # Test with depth 1 + result = os_walk_for_files(str(self.test_dir), 2) + assert len(result) == 2 + assert result[0]['file'] == 'file1.txt' + assert result[1]['file'] == 'file2.txt' + + # Test with depth 0 + result = os_walk_for_files(str(self.test_dir), 1) + assert len(result) == 1 + assert result[0]['file'] == 'file1.txt' + + +class TestCertificateOperations: + @pytest.fixture(autouse=True) + def setup(self, tmp_path): + self.cert_file = tmp_path / "test.pem" + self.mock_cert = MagicMock() + self.mock_cert.get_notBefore.return_value = b'20230101000000Z' + self.mock_cert.get_notAfter.return_value = b'20250101000000Z' + self.mock_cert.has_expired.return_value = False + + def test_check_crt_valid(self): + # Test expired certificate + self.mock_cert.has_expired.return_value = True + with patch('OpenSSL.crypto.load_certificate', return_value=self.mock_cert), \ + patch('builtins.open', mock_open(read_data='cert data')), \ + pytest.raises(RuntimeError): + check_crt_valid(self.cert_file) + + +class TestDirectoryChecks: + @pytest.fixture(autouse=True) + def setup(self, tmp_path): + self.test_dir = tmp_path / "test_dir" + self.test_file = tmp_path / "test_file" + + def test_check_dirpath_before_read(self): + with patch('msprobe.core.common.file_utils.check_others_writable', return_value=True), \ + patch('msprobe.core.common.file_utils.check_path_owner_consistent', + side_effect=FileCheckException(0)), \ + patch('msprobe.core.common.file_utils.logger') as mock_logger: + check_dirpath_before_read(self.test_dir) + assert mock_logger.warning.call_count == 2 + + def test_check_file_or_directory_path(self): + with patch('msprobe.core.common.file_utils.FileChecker') as mock_checker: + mock_checker.return_value.common_check.return_value = None + # Test file path + check_file_or_directory_path(self.test_file, isdir=False) + # Test directory path + check_file_or_directory_path(self.test_dir, isdir=True) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/core_ut/common/test_utils.py b/debug/accuracy_tools/msprobe/test/core_ut/common/test_utils.py index c0235840db4dbddcfacb4ee79b31c573c9bce823..61766ed27c0a58f4fff81fb2f45618de60bb5b48 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/common/test_utils.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/common/test_utils.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (C) 2024-2025. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -17,18 +17,26 @@ import json import os import tempfile +from datetime import datetime, timezone +import unittest from unittest import TestCase -from unittest.mock import patch, MagicMock, mock_open +from unittest.mock import MagicMock, mock_open, patch + +import OpenSSL +import numpy as np +from pathlib import Path from msprobe.core.common.const import Const -from msprobe.core.common.file_utils import (FileCheckConst, - FileCheckException, - check_file_size, - check_file_or_directory_path, - get_json_contents, - get_file_content_bytes, - save_json) -from msprobe.core.common.inplace_op_checker import InplaceOpChecker +from msprobe.core.common.file_utils import ( + FileCheckConst, + FileCheckException, + check_file_or_directory_path, + check_file_size, + check_crt_valid, + get_file_content_bytes, + get_json_contents, + save_json, +) from msprobe.core.common.log import logger from msprobe.core.common.exceptions import MsprobeException from msprobe.core.common.utils import (CompareException, @@ -41,7 +49,14 @@ from msprobe.core.common.utils import (CompareException, get_dump_mode, get_real_step_or_rank, get_step_or_rank_from_string, - get_stack_construct_by_dump_json_path) + get_stack_construct_by_dump_json_path, + check_seed_all, + safe_get_value, + recursion_depth_decorator, + MsprobeBaseException, + check_str_param, + is_json_file, + detect_framework_by_dump_json) class TestUtils(TestCase): @@ -82,43 +97,42 @@ class TestUtils(TestCase): @patch.object(logger, "error") def test_check_compare_param(self, mock_error): params = { - "npu_json_path": "npu_path", - "bench_json_path": "bench_path", - "stack_json_path": "stack_path", + "npu_json_path": "npu_path.json", + "bench_json_path": "bench_path.json", + "stack_json_path": "stack_path.json", "npu_dump_data_dir": "npu_dump_data_dir", "bench_dump_data_dir": "bench_dump_data_dir" } call_args = [ - ("npu_path", False), - ("bench_path", False), - ("stack_path", False), + ("npu_path.json", False), + ("bench_path.json", False), ("npu_dump_data_dir", True), ("bench_dump_data_dir", True), ("output_path", True), - ("npu_path", False), - ("bench_path", False), - ("stack_path", False), + ("npu_path.json", False), + ("bench_path.json", False), + ("stack_path.json", False), ("output_path", True) ] with self.assertRaises(CompareException) as context: - check_compare_param("npu_path", "output_path", dump_mode=Const.ALL) + check_compare_param("npu_path", "output_path", dump_mode=Const.ALL, stack_mode=False) self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) mock_error.assert_called_with("Invalid input parameter 'input_param', " "the expected type dict but got .") mock_check_file_or_directory_path = MagicMock() - mock_check_json_file = MagicMock() + mock__check_json = MagicMock() with patch("msprobe.core.common.utils.FileOpen", mock_open(read_data="")), \ - patch("msprobe.core.common.utils.check_json_file", new=mock_check_json_file), \ - patch("msprobe.core.common.utils.check_file_or_directory_path", new=mock_check_file_or_directory_path): - check_compare_param(params, "output_path", dump_mode=Const.ALL) - check_compare_param(params, "output_path", dump_mode=Const.MD5) + patch("msprobe.core.common.utils._check_json", mock__check_json), \ + patch("msprobe.core.common.utils.check_file_or_directory_path", mock_check_file_or_directory_path): + check_compare_param(params, "output_path", dump_mode=Const.ALL, stack_mode=False) + check_compare_param(params, "output_path", dump_mode=Const.MD5, stack_mode=True) for i in range(len(call_args)): self.assertEqual(mock_check_file_or_directory_path.call_args_list[i][0], call_args[i]) - self.assertEqual(len(mock_check_json_file.call_args[0]), 4) - self.assertEqual(mock_check_json_file.call_args[0][0], params) + self.assertEqual(len(mock__check_json.call_args[0]), 2) + self.assertEqual(mock__check_json.call_args[0][1], "stack_path.json") @patch.object(logger, "error") def test_check_configuration_param(self, mock_error): @@ -192,7 +206,7 @@ class TestUtils(TestCase): with self.assertRaises(CompareException) as context: set_dump_path(input_param) self.assertEqual(context.exception.code, CompareException.INVALID_PATH_ERROR) - mock_error.assert_called_with("Please check the json path is valid.") + mock_error.assert_called_with("Please check the json path is valid. npu_path: None, bench_path: bench_path") @patch.object(logger, "error") def test_get_dump_mode(self, mock_error): @@ -264,6 +278,12 @@ class TestUtils(TestCase): with self.assertRaises(MsprobeException) as context: get_real_step_or_rank([1, 2, 3.5], "step") self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) + with self.assertRaises(MsprobeException) as context: + get_real_step_or_rank([True, 1, 2], "step") + self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) + with self.assertRaises(MsprobeException) as context: + get_real_step_or_rank([10000000], "step") + self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) result = get_real_step_or_rank([1, 10, 50], "step") self.assertEqual(result, [1, 10, 50]) @@ -312,3 +332,201 @@ class TestUtils(TestCase): self.assertEqual(stack, {'stack_key': 'stack_value'}) self.assertEqual(construct, {'construct_key': 'construct_value'}) + + @patch.object(logger, "error") + def test_recursion_depth_decorator(self, mock_error): + # 测试递归深度限制函数 + recursion_list = [[]] + temp_list = recursion_list[0] + for _ in range(Const.MAX_DEPTH): + temp_list.append([]) + temp_list = temp_list[0] + temp_list.append(0) + call_record = [] + @recursion_depth_decorator("test func_info") + def recursion_func(test_list, call_record): + call_record.append(1) + if isinstance(test_list, list): + recursion_func(test_list[0], call_record) + with self.assertRaises(MsprobeException) as context: + recursion_func(recursion_list, call_record) + # 执行超过限制的递归函数会触发异常、且函数成功调用次数等于限制次数 + self.assertEqual(context.exception.code, MsprobeException.RECURSION_LIMIT_ERROR) + mock_error.assert_called_with("call test func_info exceeds the recursion limit.") + self.assertEqual(len(call_record), Const.MAX_DEPTH) + + def test_check_seed_all(self): + with self.assertRaises(MsprobeException) as context: + check_seed_all(-1, True, True) + self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) + with self.assertRaises(MsprobeException) as context: + check_seed_all(Const.MAX_SEED_VALUE + 1, True, True) + self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) + with self.assertRaises(MsprobeException) as context: + check_seed_all("1", True, True) + self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) + with self.assertRaises(MsprobeException) as context: + check_seed_all(True, True, True) + self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) + with self.assertRaises(MsprobeException) as context: + check_seed_all(True, 1, True) + self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) + with self.assertRaises(MsprobeException) as context: + check_seed_all(1, True, "test") + self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) + + def test_safe_get_value_dict_valid_key_index(self): + # Test valid key and index in a dictionary + dict_container = {'a': [1, 2, 3], 'b': [4, 5, 6]} + self.assertEqual(safe_get_value(dict_container, 1, 'dict_container', key='a'), 2) + + def test_safe_get_value_invalid_key(self): + # Test invalid key in dictionary + dict_container = {'a': [1, 2, 3], 'b': [4, 5, 6]} + with self.assertRaises(MsprobeBaseException) as context: + safe_get_value(dict_container, 1, 'dict_container', key='invalid_key') + self.assertEqual(context.exception.code, MsprobeBaseException.INVALID_OBJECT_TYPE_ERROR) + + def test_safe_get_value_valid_key_invalid_index(self): + # Test invalid index in dictionary[key] + dict_container = {'a': [1, 2, 3], 'b': [4, 5, 6]} + with self.assertRaises(MsprobeBaseException) as context: + safe_get_value(dict_container, 5, 'dict_container', key='a') + self.assertEqual(context.exception.code, MsprobeBaseException.INDEX_OUT_OF_BOUNDS_ERROR) + + def test_safe_get_value_list_valid_index(self): + # Test valid index in a list + list_container = [10, 20, 30] + self.assertEqual(safe_get_value(list_container, 1, 'list_container'), 20) + + def test_safe_get_value_list_index_out_of_bounds(self): + # Test index out of bounds in a list + list_container = [10, 20, 30] + with self.assertRaises(MsprobeBaseException) as context: + safe_get_value(list_container, 10, 'list_container') + self.assertEqual(context.exception.code, MsprobeBaseException.INDEX_OUT_OF_BOUNDS_ERROR) + + def test_safe_get_value_tuple_valid_index(self): + # Test valid index in a tuple + tuple_container = (100, 200, 300) + self.assertEqual(safe_get_value(tuple_container, 2, 'tuple_container'), 300) + + def test_safe_get_value_array_valid_index(self): + # Test valid index in a numpy array + array_container = np.array([1000, 2000, 3000]) + self.assertEqual(safe_get_value(array_container, 0, 'array_container'), 1000) + + def test_safe_get_value_unsupported_container_type(self): + # Test unsupported container type (e.g., a string) + with self.assertRaises(MsprobeBaseException) as context: + safe_get_value("unsupported_type", 0, 'string_container') + self.assertEqual(context.exception.code, MsprobeBaseException.INVALID_OBJECT_TYPE_ERROR) + + def test_valid_str_param(self): + valid_param = "valid_string_without_special_chars" + check_str_param(valid_param) + + def test_invalid_str_param(self): + invalid_param = "invalid$tring&with^special*chars()" + with self.assertRaises(MsprobeBaseException) as context: + check_str_param(invalid_param) + self.assertEqual(context.exception.code, MsprobeBaseException.INVALID_CHAR_ERROR) + + def test_is_json_file(self): + file_path_true = 'step/rank/stack.json' + file_path_false = 1 + self.assertTrue(is_json_file(file_path_true)) + self.assertFalse(is_json_file(file_path_false)) + + +class TestCheckCrtValid(TestCase): + """ + Test the check_crt_valid function. + """ + + def setUp(self): + self.cert_file_path = "cert_file_path.pem" + if not os.path.exists(self.cert_file_path): + with open(self.cert_file_path, 'w') as f: + f.write("This is a test certificate.") + + def tearDown(self): + if os.path.exists(self.cert_file_path): + os.remove(self.cert_file_path) + + @patch('msprobe.core.common.file_utils.datetime') + @patch('OpenSSL.crypto.load_certificate') + @patch('builtins.open', new_callable=mock_open, read_data="cert_data") + def test_check_crt_valid_success(self, mock_open_, mock_load_certificate, mock_datetime): + mock_cert = MagicMock() + mock_cert.get_notBefore.return_value = b'20220101' + mock_cert.get_notAfter.return_value = b'20230101' + mock_cert.has_expired.return_value = False + mock_load_certificate.return_value = mock_cert + mock_datetime.now.return_value = datetime(2022, 10, 1) + + check_crt_valid(self.cert_file_path) + mock_load_certificate.assert_called_once_with(OpenSSL.crypto.FILETYPE_PEM, 'cert_data') + + @patch('datetime.datetime') + @patch('OpenSSL.crypto.load_certificate') + @patch('builtins.open', new_callable=mock_open, read_data="cert_data") + def test_check_crt_valid_expired(self, mock_open_, mock_load_certificate, mock_datetime): + mock_cert = MagicMock() + mock_cert.get_notBefore.return_value = b'20220101' + mock_cert.get_notAfter.return_value = b'20230101' + mock_cert.has_expired.return_value = True + mock_load_certificate.return_value = mock_cert + mock_datetime.now.return_value = datetime(2022, 10, 1, tzinfo=timezone.utc) + + with self.assertRaises(RuntimeError) as context: + check_crt_valid(self.cert_file_path) + self.assertIn('The SSL certificate has expired and needs to be replaced', str(context.exception)) + + @patch('OpenSSL.crypto.load_certificate') + @patch('builtins.open', new_callable=mock_open, read_data="cert_data") + def test_check_crt_valid_exception(self, mock_open_, mock_load_certificate): + mock_load_certificate.side_effect = Exception('Test Exception') + + with self.assertRaises(RuntimeError) as context: + check_crt_valid(self.cert_file_path) + self.assertIn('The SSL certificate is invalid', str(context.exception)) + + +class TestDetectFrameworkByDumpJson(unittest.TestCase): + + @patch('msprobe.core.common.utils.load_json') + def test_valid_pytorch_framework(self, mock_load_json): + mock_load_json.return_value = {"framework": Const.PT_FRAMEWORK} + + result = detect_framework_by_dump_json("dummy_path") + + self.assertEqual(result, Const.PT_FRAMEWORK) + + @patch('msprobe.core.common.utils.load_json') + def test_valid_mindspore_framework(self, mock_load_json): + mock_load_json.return_value = {"framework": Const.MS_FRAMEWORK} + + result = detect_framework_by_dump_json("dummy_path") + + self.assertEqual(result, Const.MS_FRAMEWORK) + + def test_detect_framework_in_file(self): + self.current_dir = Path(__file__).parent + file_path = self.current_dir / "test_dump_file/pt_dump_no_framework.json" + result = detect_framework_by_dump_json(file_path) + self.assertEqual(result, Const.PT_FRAMEWORK) + + self.current_dir = Path(__file__).parent + file_path = self.current_dir / "test_dump_file/ms_dump_no_framework.json" + result = detect_framework_by_dump_json(file_path) + self.assertEqual(result, Const.MS_FRAMEWORK) + + @patch("msprobe.core.common.utils.logger") + def test_detect_framework_exception(self, mock_logger): + self.current_dir = Path(__file__).parent + file_path = self.current_dir / "test_dump_file/dump_no_pt_no_ms.json" + with self.assertRaises(CompareException) as context: + result = detect_framework_by_dump_json(file_path) + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_logger.error.assert_called_once_with(f"{file_path} must be based on the MindSpore or PyTorch framework.") diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare.py index 7a12a2ea4816c8695cc2df7c7cd88d4d912de19e..c882e331f5513ddbd3cbb5baf4c1292079680f4f 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare.py @@ -1,21 +1,24 @@ # coding=utf-8 -import unittest -import pandas as pd +import json import os import shutil -import json -import torch import threading -from msprobe.core.compare.utils import get_accuracy -from msprobe.core.compare.highlight import find_error_rows, find_compare_result_error_rows -from msprobe.core.compare.acc_compare import Comparator +import unittest +from unittest.mock import patch + +import pandas as pd +import torch + from msprobe.core.common.const import CompareConst, Const +from msprobe.core.common.utils import CompareException +from msprobe.core.compare.acc_compare import Comparator, ModeConfig, get_bench_data_name +from msprobe.core.compare.highlight import find_error_rows, find_compare_result_error_rows, ApiBatch +from msprobe.core.compare.utils import get_accuracy from msprobe.pytorch.compare.pt_compare import PTComparator - npu_dict = {'op_name': ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.1', 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output'], - 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), + 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), ('torch.float32', [16])], 'output_struct': [('torch.float32', [1, 16, 28, 28])], 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], @@ -24,18 +27,18 @@ npu_dict = {'op_name': ['Functional.conv2d.0.forward.input.0', 'Functional.conv2 [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], 'stack_info': []} npu_dict2 = {'op_name': ['Functional.conv2d.1.forward.input.0', 'Functional.conv2d.0.forward.input.1', - 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output'], - 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), - ('torch.float32', [16])], - 'output_struct': [('torch.float32', [1, 16, 28, 28])], - 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], - [0.19919930398464203, -0.19974489510059357, 0.006269412115216255], - [0.19734230637550354, -0.18177609145641327, 0.007903944700956345], - [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], 'stack_info': []} + 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output'], + 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), + ('torch.float32', [16])], + 'output_struct': [('torch.float32', [1, 16, 28, 28])], + 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], + [0.19919930398464203, -0.19974489510059357, 0.006269412115216255], + [0.19734230637550354, -0.18177609145641327, 0.007903944700956345], + [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], 'stack_info': []} bench_dict = {'op_name': ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.1', 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output'], - 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), + 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), ('torch.float32', [16])], 'output_struct': [('torch.float32', [1, 16, 28, 28])], 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], @@ -45,7 +48,7 @@ bench_dict = {'op_name': ['Functional.conv2d.0.forward.input.0', 'Functional.con tensor_list = [ {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'Max': 0.33033010363578796, - 'Min': -0.331031858921051,'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True, + 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_.0.forward.input.0'}, {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, @@ -63,23 +66,28 @@ result_op_dict = {'op_name': ['Tensor.add_.0.forward.input.0', 'Tensor.add_.0.fo ("", '[]')], 'output_struct': [('torch.float32', [16, 1, 3, 3])], 'summary': [[0.33033010363578796, -0.331031858921051, -0.030964046716690063, 2.2533628940582275], - [0.003992878366261721, -0.008102823048830032, -0.0002002553956117481, 0.02844562754034996], + [0.003992878366261721, -0.008102823048830032, -0.0002002553956117481, + 0.02844562754034996], [-0.1, -0.1, -0.1, -0.1], [0.33033010363578796, -0.331031858921051, -0.030964046716690063, 2.2533628940582275]], 'stack_info': []} o_result = [ ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.0', 'torch.float32', 'torch.float32', - [1, 1, 28, 28], [1, 1, 28, 28], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 3.029174327850342, -2.926689624786377, + [1, 1, 28, 28], [1, 1, 28, 28], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 3.029174327850342, + -2.926689624786377, -0.06619918346405029, 3.029174327850342, -2.926689624786377, -0.06619918346405029, '', '', 'None'], ['Functional.conv2d.0.forward.input.1', 'Functional.conv2d.0.forward.input.1', 'torch.float32', 'torch.float32', - [16, 1, 5, 5], [16, 1, 5, 5], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 0.19919930398464203, -0.19974489510059357, + [16, 1, 5, 5], [16, 1, 5, 5], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 0.19919930398464203, + -0.19974489510059357, 0.006269412115216255, 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, '', '', 'None'], ['Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.input.2', 'torch.float32', 'torch.float32', - [16], [16], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, + [16], [16], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 0.19734230637550354, -0.18177609145641327, + 0.007903944700956345, 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, '', '', 'None'], ['Functional.conv2d.0.forward.output', 'Functional.conv2d.0.forward.output', 'torch.float32', 'torch.float32', - [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 2.1166646480560303, -2.190781354904175, + [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 2.1166646480560303, + -2.190781354904175, -0.003579073818400502, 2.1166646480560303, -2.190781354904175, -0.003579073818400502, '', '', 'None']] npu_dict_aten = {'op_name': ['Aten__native_batch_norm_legit_functional.default_0_forward.input.0', @@ -182,29 +190,33 @@ summary_line_3 = ['Functional_batch_norm_0_forward.output.2', 'Functional_batch_ 'torch.float16', 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0, 0, 0, 0, 2, 0, 1, 1, 1, 1, 1, 1, 'Warning', ''] -line_input = ['Functional_batch_norm_0_forward.input.0', 'Functional_batch_norm_0_forward.input.0', 'torch.float16', - 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 1, 1, 1, 0.95, 1, 1, 1, 1, 1, 1.01, 1, 1, 1, +line_input = ['Functional.batch.norm.0.forward.input.0', 'Functional.batch.norm.0.forward.input.0', 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 1, 0.5, 1, 1, 0.95, 1, + 1, 1, 1, 1, 1.01, 1, 1, 1, 'Yes', ''] -line_1 = ['Functional_batch_norm_0_forward.output.0', 'Functional_batch_norm_0_forward.output.0', 'torch.float16', - 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.8, 1, 1, 0.59, 1, 'nan', 0, 1, 1, 19, 1, 1, 1, - 'Warning', ''] -line_2 = ['Functional_batch_norm_0_forward.output.1', 'Functional_batch_norm_0_forward.output.1', 'torch.float16', - 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.9, 1, 1, 0.8, 1, 0, 0.12, 0, 1, 1, 0.1, 1, 1, 1, - 'Warning', ''] -line_3 = ['Functional_batch_norm_0_forward.output.2', 'Functional_batch_norm_0_forward.output.2', 'torch.float16', - 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.8, 1.1e+10, 1, 0.85, 1, 9, 0.12, 0, 1, 1, 0.1, 1, - 1, 1, 'Warning', ''] +line_1 = ['Functional.batch.norm.0.forward.output.0', 'Functional.batch.norm.0.forward.output.0', 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.8, 0.5, 1, 1, 0.59, 1, + 'nan', 0, 1, 1, 19, 1, 1, 1, + 'Yes', ''] +line_2 = ['Functional.batch.norm.0.forward.output.1', 'Functional.batch.norm.0.forward.output.1', 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.9, 0.5, 1, 1, 0.8, 1, + 0, 0.12, 0, 1, 1, 0.1, 1, 1, + 'Yes', ''] +line_3 = ['Functional.batch.norm.0.forward.output.2', 'Functional.batch.norm.0.forward.output.2', 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.8, 0.5, 1.1e+10, 1, 0.85, 1, + 9, 0.12, 0, 1, 1, 0.1, 1, 1, + 'Yes', ''] op_data = { 'input_args': [{'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], - 'Max': 0.33033010363578796, 'Min': -0.331031858921051,'Mean': -0.030964046716690063, + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True}, {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, 'Norm': 0.02844562754034996, 'requires_grad': False}], 'input_kwargs': {'alpha': {'type': 'float', 'value': -0.1}}, 'output': [{'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], - 'Max': 0.33033010363578796, 'Min': -0.331031858921051,'Mean': -0.030964046716690063, + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True}]} op_name = "Tensor.add_0.0.forward" @@ -294,29 +306,59 @@ class TestUtilsMethods(unittest.TestCase): self.assertEqual(result, aten_result) def test_find_error_rows(self): + api_batch = ApiBatch("Functional_batch_norm_0_forward", 0) + api_batch.input_len = 1 + api_batch.output_end_index = 4 + api_batch.params_end_index = 4 summary_result = [summary_line_input, summary_line_1, summary_line_2, summary_line_3] - highlight_dict = {'red_rows': [], 'yellow_rows': []} - find_error_rows(summary_result, 0, 1, highlight_dict, dump_mode=Const.SUMMARY) - self.assertEqual(highlight_dict, {'red_rows': [], 'yellow_rows': []}) + highlight_dict_test = {"red_rows": set(), "yellow_rows": set(), "red_lines": [], "yellow_lines": []} + find_error_rows(summary_result, api_batch, highlight_dict_test, dump_mode=Const.SUMMARY) + self.assertEqual(highlight_dict_test, + {"red_rows": set(), "yellow_rows": set(), "red_lines": [], "yellow_lines": []}) def test_find_compare_result_error_rows(self): result = [line_input, line_1, line_2, line_3] result_df = pd.DataFrame(result) - highlight_dict = {'red_rows': [], 'yellow_rows': []} - find_compare_result_error_rows(result_df, highlight_dict, dump_mode=Const.ALL) - self.assertEqual(highlight_dict, {'red_rows': [num_1, num_3], 'yellow_rows': [num_2]}) + highlight_dict_test = {"red_rows": set(), "yellow_rows": set(), "red_lines": [], "yellow_lines": []} + find_compare_result_error_rows(result_df, highlight_dict_test, dump_mode=Const.ALL) + self.assertEqual(highlight_dict_test, { + "red_rows": {1, 3}, + "yellow_rows": {2}, + "red_lines": [ + (1, ["maximum or minimum is nan, -inf, or inf"]), + (3, ["maximum absolute error exceeds 1e+10"]) + ], + "yellow_lines": [ + (2, ["The output's one thousandth err ratio decreases by more than 0.1 compared to the input/parameters's"]), + (3, [ + "maximum absolute error of both input/parameters and output exceed 1, " + "with the output larger by an order of magnitude", + "The output's cosine decreases by more than 0.1 compared to the input/parameters's"]) + ] + }) def test_calculate_summary_data(self): npu_summary_data = [1, 1, 1, 1] bench_summary_data = [2, 2, 2, 2] result_item = ['', '', '', '', '', '', '', '', '', '', '', '', '', ''] - Comparator().calculate_summary_data(npu_summary_data, bench_summary_data, result_item) - self.assertEqual(result_item, ['', '', '', '', '', '', -1, -1, -1, -1, '50.0%', '50.0%', '50.0%', '50.0%', '', '']) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + comparator = Comparator(mode_config) + comparator.calculate_summary_data(npu_summary_data, bench_summary_data, result_item) + self.assertEqual(result_item, + ['', '', '', '', '', '', -1, -1, -1, -1, '50.0%', '50.0%', '50.0%', '50.0%', '', '']) bench_summary_data = [0, 0, 0, 0] result_item = ['', '', '', '', '', '', '', '', '', '', '', '', '', ''] - Comparator().calculate_summary_data(npu_summary_data, bench_summary_data, result_item) - self.assertEqual(result_item, ['', '', '', '', '', '', 1, 1, 1, 1, 'N/A', 'N/A', 'N/A', 'N/A', 'Warning', 'Need double check api accuracy.']) + + comparator.calculate_summary_data(npu_summary_data, bench_summary_data, result_item) + self.assertEqual(result_item, ['', '', '', '', '', '', 1, 1, 1, 1, 'N/A', 'N/A', 'N/A', 'N/A', 'Warning', + 'Need double check api accuracy.']) def test_make_result_table_stack_mode_True(self): result_md5 = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', @@ -325,7 +367,7 @@ class TestUtilsMethods(unittest.TestCase): 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', 'File']] result_all = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', - 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', 'File', '-1']] columns_md5_stack_mode_true = CompareConst.MD5_COMPARE_RESULT_HEADER + ['NPU_Stack_Info'] result_table_md5_true = pd.DataFrame(result_md5, columns=columns_md5_stack_mode_true, dtype=object) @@ -335,32 +377,40 @@ class TestUtilsMethods(unittest.TestCase): result_table_all_true = pd.DataFrame(result_all, columns=columns_all_stack_mode_true, dtype=object) stack_mode = True + auto_analyze = True + fuzzy_match = False - result_df = Comparator().make_result_table(result_md5, stack_mode, dump_mode=Const.MD5) + dump_mode = Const.MD5 + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + result_df = Comparator(mode_config).make_result_table(result_md5) self.assertTrue(result_df.equals(result_table_md5_true)) - result_df = Comparator().make_result_table(result_summary, stack_mode, dump_mode=Const.SUMMARY) + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + result_df = Comparator(mode_config).make_result_table(result_summary) self.assertTrue(result_df.equals(result_table_summary_true)) - result_df = Comparator().make_result_table(result_all, stack_mode, dump_mode=Const.ALL) + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + result_df = Comparator(mode_config).make_result_table(result_all) self.assertTrue(result_df.equals(result_table_all_true)) def test_make_result_table_stack_mode_False(self): result_md5_test = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', - 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '']] + 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '']] result_md5 = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '']] result_summary_test = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', - 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', '', '', '', - 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '']] + 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', '', '', '', + 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '']] result_summary = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '']] result_all_test = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', - 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', - 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '', '-1']] + 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', '', + 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '', '-1']] result_all = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', - 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1']] columns_md5_stack_mode_true = CompareConst.MD5_COMPARE_RESULT_HEADER result_table_md5_true = pd.DataFrame(result_md5, columns=columns_md5_stack_mode_true, dtype='object') @@ -371,18 +421,25 @@ class TestUtilsMethods(unittest.TestCase): result_table_all_true = pd.DataFrame(result_all, columns=columns_all_stack_mode_true, dtype='object') stack_mode = False + auto_analyze = True + fuzzy_match = False - result_df = Comparator().make_result_table(result_md5_test, stack_mode, dump_mode=Const.MD5) + dump_mode = Const.MD5 + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + result_df = Comparator(mode_config).make_result_table(result_md5_test) self.assertTrue(result_df.equals(result_table_md5_true)) - result_df = Comparator().make_result_table(result_summary_test, stack_mode, dump_mode=Const.SUMMARY) + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + result_df = Comparator(mode_config).make_result_table(result_summary_test) self.assertTrue(result_df.equals(result_table_summary_true)) - result_df = Comparator().make_result_table(result_all_test, stack_mode, dump_mode=Const.ALL) + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + result_df = Comparator(mode_config).make_result_table(result_all_test) self.assertTrue(result_df.equals(result_table_all_true)) def test_gen_merge_list(self): - dump_mode = Const.SUMMARY op_data = { 'input_args': [ { @@ -400,53 +457,94 @@ class TestUtilsMethods(unittest.TestCase): 'input_struct': [('torch.float32', [2, 2])], 'op_name': ['Functional.linear.0.forward.input.0'], 'output_struct': [], + 'params_struct': [], + 'params_grad_struct': [], 'stack_info': [['File']], 'summary': [[1, 1, 1, 1]] } - result = Comparator().gen_merge_list(json_data, op_name, stack_json_data, dump_mode) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + result = Comparator(mode_config).gen_merge_list(json_data, op_name, stack_json_data) self.assertEqual(result, merge_list) def test_check_op_fuzzy_false(self): + stack_mode = False + auto_analyze = True + dump_mode = Const.SUMMARY + fuzzy_match = False - pt_comparator = PTComparator() - result = pt_comparator.check_op(npu_dict, bench_dict, fuzzy_match) + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + result = pt_comparator.check_op(npu_dict, bench_dict) self.assertEqual(result, True) def test_check_op_fuzzy_true(self): + stack_mode = False + auto_analyze = True + dump_mode = Const.SUMMARY + fuzzy_match = True - pt_comparator = PTComparator() - result = pt_comparator.check_op(npu_dict2, bench_dict, fuzzy_match) + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + result = pt_comparator.check_op(npu_dict2, bench_dict) self.assertEqual(result, True) def test_match_op_both_last_element(self): + stack_mode = False + auto_analyze = True fuzzy_match = False - pt_comparator = PTComparator() - a, b = pt_comparator.match_op([npu_dict], [bench_dict], fuzzy_match) + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + a, b = pt_comparator.match_op([npu_dict], [bench_dict]) self.assertEqual(a, 0) self.assertEqual(b, 0) def test_match_op_only_npu_last_element(self): + stack_mode = False + auto_analyze = True fuzzy_match = False - pt_comparator = PTComparator() - a, b = pt_comparator.match_op([npu_dict], [bench_dict, 1], fuzzy_match) + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + a, b = pt_comparator.match_op([npu_dict], [bench_dict, 1]) self.assertEqual(a, 0) self.assertEqual(b, 0) def test_match_op_only_bench_last_element(self): + stack_mode = False + auto_analyze = True fuzzy_match = False - pt_comparator = PTComparator() - a, b = pt_comparator.match_op([npu_dict, npu_dict2], [bench_dict], fuzzy_match) + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + a, b = pt_comparator.match_op([npu_dict, npu_dict2], [bench_dict]) self.assertEqual(a, 0) self.assertEqual(b, 0) def test_compare_process(self): generate_dump_json(base_dir) generate_stack_json(base_dir) - file_lists = [os.path.join(base_dir, 'dump.json'), os.path.join(base_dir, 'dump.json'), os.path.join(base_dir, 'stack.json')] + file_lists = [os.path.join(base_dir, 'dump.json'), os.path.join(base_dir, 'dump.json'), + os.path.join(base_dir, 'stack.json')] + stack_mode = True + auto_analyze = True fuzzy_match = False dump_mode = Const.SUMMARY - result = PTComparator().compare_process(file_lists, stack_mode, fuzzy_match, dump_mode) + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + result = PTComparator(mode_config).compare_process(file_lists) o_data = [ ['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], 0, 0, 0, 0, '0.0%', 'N/A', '0.0%', '0.0%', @@ -458,7 +556,6 @@ class TestUtilsMethods(unittest.TestCase): self.assertTrue(result.equals(o_result)) def test_merge_data(self): - dump_mode = Const.SUMMARY op_data = { 'input_args': [ { @@ -471,7 +568,14 @@ class TestUtilsMethods(unittest.TestCase): } json_data = {'data': {'Functional.linear.0.forward': op_data}} stack_json_data = {'Functional.linear.0.forward': ['File']} - result = Comparator().merge_data(json_data, stack_json_data, dump_mode) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + result = Comparator(mode_config).merge_data(json_data, stack_json_data) ops_all = { 'Functional.linear.0.forward.input.0': { 'data_name': None, 'stack_info': [['File']], @@ -490,7 +594,13 @@ class TestUtilsMethods(unittest.TestCase): } output_path = base_dir2 - PTComparator().compare_core(input_params, output_path, dump_mode=Const.SUMMARY) + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + PTComparator(mode_config).compare_core(input_params, output_path) output_files = os.listdir(output_path) self.assertTrue(any(f.endswith(".xlsx") for f in output_files)) @@ -509,8 +619,16 @@ class TestUtilsMethods(unittest.TestCase): 'NPU Name': ['Functional.linear.0.forward.input.0'], 'Bench Name': ['Functional.linear.0.forward.input.0'] }) - updated_df = PTComparator().compare_ops(idx=0, dump_path_dict=dump_path_dict, result_df=result_df, lock=self.lock, - input_param=input_param) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + updated_df = pt_comparator.compare_ops(idx=0, dump_path_dict=dump_path_dict, result_df=result_df, + lock=self.lock, input_param=input_param) self.assertEqual(updated_df.loc[0, CompareConst.COSINE], 1.0) self.assertEqual(updated_df.loc[0, CompareConst.MAX_ABS_ERR], 0) @@ -518,15 +636,25 @@ class TestUtilsMethods(unittest.TestCase): def test_do_multi_process(self): data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], - '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1']] + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1']] o_data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', - 'torch.float32', 'torch.float32', [2, 2], [2, 2], 'None', 'None', 'None', 'None', 'None', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], + 'unsupported', 'unsupported', 'unsupported', 'unsupported', 'unsupported', 'unsupported', 1, 1, 1, 1, 1, 1, 1, 1, 'None', 'No bench data matched.', '-1']] columns = CompareConst.COMPARE_RESULT_HEADER + ['Data_name'] result_df = pd.DataFrame(data, columns=columns) o_result = pd.DataFrame(o_data, columns=columns) - input_param = {} - result = Comparator()._do_multi_process(input_param, result_df) + generate_dump_json(base_dir) + input_param = {'bench_json_path': os.path.join(base_dir, 'dump.json')} + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + comparator = Comparator(mode_config) + result = comparator.do_multi_process(input_param, result_df) self.assertTrue(result.equals(o_result)) def test_compare_by_op_1(self): @@ -534,19 +662,101 @@ class TestUtilsMethods(unittest.TestCase): bench_op_name = 'N/A' op_name_mapping_dict = {'Functional.linear.0.forward.input.0': [-1, -1]} input_param = {} - result = PTComparator().compare_by_op(npu_op_name, bench_op_name, op_name_mapping_dict, input_param) - self.assertEqual(result, ['None', 'None', 'None', 'None', 'None', 'No bench data matched.']) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + result = pt_comparator.compare_by_op(npu_op_name, bench_op_name, op_name_mapping_dict, input_param, {}) + + self.assertEqual(result, ['unsupported', 'unsupported', 'unsupported', 'unsupported', 'unsupported', + 'unsupported', 'No bench data matched.']) def test_compare_by_op_2(self): npu_op_name = 'Functional.linear.0.forward.input.0' bench_op_name = 'Functional.linear.0.forward.input.0' + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + + pt_name = '-1' + pt_path = os.path.join(base_dir, pt_name) + op_name_mapping_dict = {'Functional.linear.0.forward.input.0': [pt_path, pt_path]} + input_param = {'npu_dump_data_dir': base_dir, 'bench_dump_data_dir': base_dir} + result = pt_comparator.compare_by_op(npu_op_name, bench_op_name, op_name_mapping_dict, input_param, + {'Functional.linear.0.forward': {'input_args': [ + {'data_name': 'Functional.linear.0.forward.input.0.pt'}]}}) + self.assertEqual(result, ['unsupported', 'unsupported', 'unsupported', 'unsupported', 'unsupported', + 'unsupported', f'Dump file: {pt_path} not found.']) + pt_name = 'Functional.linear.0.forward.input.0.pt' pt_path = os.path.join(base_dir, pt_name) op_name_mapping_dict = {'Functional.linear.0.forward.input.0': [pt_path, pt_path]} input_param = {'npu_dump_data_dir': base_dir, 'bench_dump_data_dir': base_dir} - result = PTComparator().compare_by_op(npu_op_name, bench_op_name, op_name_mapping_dict, input_param) - self.assertEqual(result, ['None', 'None', 'None', 'None', 'None', f'Dump file: {pt_path} not found.']) + result = pt_comparator.compare_by_op(npu_op_name, bench_op_name, op_name_mapping_dict, input_param, {}) + self.assertEqual(result, ['unsupported', 'unsupported', 'unsupported', 'unsupported', 'unsupported', + 'unsupported', 'Bench does not have data file.']) generate_pt(base_dir) - result = PTComparator().compare_by_op(npu_op_name, bench_op_name, op_name_mapping_dict, input_param) - self.assertEqual(result, [1.0, 0.0, 0.0, 1.0, 1.0, '']) + result = pt_comparator.compare_by_op(npu_op_name, bench_op_name, op_name_mapping_dict, input_param, + {'Functional.linear.0.forward': {'input_args': [ + {'data_name': 'Functional.linear.0.forward.input.0.pt'}]}}) + self.assertEqual(result, [1.0, 0.0, 0.0, 0.0, 1.0, 1.0, '']) + + def test_get_bench_data_name_input(self): + bench_op_name = "Functional.linear.0.forward.input.0" + bench_data = {"Functional.linear.0.forward": {"input_args": [{"data_name": "Functional.linear.0.forward.input.0.pt"}], "input_kwargs": {}, "output": []}} + result = get_bench_data_name(bench_op_name, bench_data) + + self.assertEqual(result, "Functional.linear.0.forward.input.0.pt") + + def test_get_bench_data_name_output(self): + bench_op_name = "Functional.linear.0.forward.output.0" + bench_data = {"Functional.linear.0.forward": {"input_args": [], "input_kwargs": {}, "output": [{"data_name": "Functional.linear.0.forward.output.0.pt"}]}} + result = get_bench_data_name(bench_op_name, bench_data) + + self.assertEqual(result, "Functional.linear.0.forward.output.0.pt") + + +class TestComparator(unittest.TestCase): + def setUp(self): + mode_config = ModeConfig(dump_mode=Const.MD5) + self.comparator = Comparator(mode_config=mode_config) + self.npu_ops_all = { + 'op1': {'struct': ['float32', [1, 96, 2], '83dcefb7']}, + } + self.bench_ops_all = { + 'op1': {'struct': ['float32', [1, 96, 2], '83dcefb7']}, + } + + def test_normal(self): + expected_result = ['op1', 'op1', 'float32', 'float32', [1, 96, 2], [1, 96, 2], '83dcefb7', '83dcefb7', + CompareConst.PASS, CompareConst.NONE] + result = self.comparator.get_result_md5_compare('op1', 'op1', + self.npu_ops_all, self.bench_ops_all) + self.assertEqual(result, expected_result) + + @patch('msprobe.core.compare.acc_compare.logger') + def test_length_exception(self, mock_logger): + self.npu_ops_all['op1']['struct'] = ['npu_val1', 'npu_val2'] + with self.assertRaises(CompareException) as context: + self.comparator.get_result_md5_compare('op1', 'op1', + self.npu_ops_all, self.bench_ops_all) + self.assertEqual(context.exception.code, CompareException.INDEX_OUT_OF_BOUNDS_ERROR) + mock_logger.error.assert_called_once_with("The length of npu_struct and bench_struct must be >= 3, " + "but got npu_struct=2 and bench_struct=3. Please check!") + + def test_with_extra_args(self): + expected_result = ['op1', 'op1', 'float32', 'float32', [1, 96, 2], [1, 96, 2], '83dcefb7', '83dcefb7', + CompareConst.PASS, 'extra_data'] + result = self.comparator.get_result_md5_compare('op1', 'op1', + self.npu_ops_all, self.bench_ops_all, True, ['extra_data']) + self.assertEqual(result, expected_result) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_check.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_check.py index 95065ff7b798d514e9a8d783aebead38772173ca..a1e5f8eee1bce9b170e6f4f7fdfeda65d47252c9 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_check.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_check.py @@ -67,7 +67,7 @@ op_name = 'Functional.conv2d.0.backward.input.0' class TestUtilsMethods(unittest.TestCase): def test_check_struct_match_success(self): - result = check_struct_match(npu_dict, bench_dict, cross_frame=False) + result = check_struct_match(npu_dict, bench_dict) self.assertTrue(result) def test_check_struct_match_fail(self): @@ -80,7 +80,7 @@ class TestUtilsMethods(unittest.TestCase): ('torch.float32', [16])], 'output_struct': [('torch.float32', [1, 16, 28, 28])] } - result = check_struct_match(npu_dict2, bench_dict2, cross_frame=False) + result = check_struct_match(npu_dict2, bench_dict2) self.assertFalse(result) def test_check_struct_index_error(self): @@ -94,7 +94,7 @@ class TestUtilsMethods(unittest.TestCase): 'output_struct': [('torch.float32')] } with self.assertRaises(CompareException) as context: - result = check_struct_match(npu_dict3, bench_dict3, cross_frame=False) + result = check_struct_match(npu_dict3, bench_dict3) self.assertEqual(context.exception.code, CompareException.INDEX_OUT_OF_BOUNDS_ERROR) def test_check_type_shape_match_success(self): diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_npy_compare.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_npy_compare.py index 2a34222dc011d1e766b8607c9610e8a77b31a533..da315b657c8c1fc691136a1dbc56574d69c92076 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_npy_compare.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_npy_compare.py @@ -1,117 +1,172 @@ # coding=utf-8 +""" +# Copyright (C) 2024-2025. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" import unittest import numpy as np -from msprobe.core.compare.npy_compare import handle_inf_nan, get_error_type, reshape_value, get_error_message, \ - npy_data_check, statistics_data_check, GetCosineSimilarity, GetMaxAbsErr, get_relative_err, GetMaxRelativeErr, \ - GetThousandErrRatio, GetFiveThousandErrRatio, compare_ops_apply +from unittest.mock import patch + from msprobe.core.common.const import CompareConst +from msprobe.core.compare.npy_compare import handle_inf_nan, reshape_value, get_error_flag_and_msg, \ + npy_data_check, statistics_data_check, get_relative_err, GetCosineSimilarity, GetMaxAbsErr, GetMaxRelativeErr, \ + GetErrRatio, error_value_process, compare_ops_apply, GetEuclideanDistance op_name = 'Functional.conv2d.0.backward.input.0' class TestUtilsMethods(unittest.TestCase): - def test_handle_inf_nan_1(self): + def test_handle_inf_nan_normal(self): + n_value = np.array([1, 2, 3, 4]) + b_value = np.array([1, 2, 3, 4]) + + a, b = handle_inf_nan(n_value, b_value) + + self.assertTrue(np.array_equal(a, n_value) and np.array_equal(b, b_value)) + + def test_handle_inf_nan_with_inf(self): n_value = np.array([1, 2, np.inf, 4]) b_value = np.array([1, 2, 3, 4]) + a, b = handle_inf_nan(n_value, b_value) + self.assertTrue(a == CompareConst.NAN and b == CompareConst.NAN) - def test_handle_inf_nan_2(self): + def test_handle_inf_nan_with_nan(self): n_value = np.array([1, 2, 3, 4]) b_value = np.array([1, 2, np.nan, 4]) + a, b = handle_inf_nan(n_value, b_value) + self.assertTrue(a == CompareConst.NAN and b == CompareConst.NAN) - def test_handle_inf_nan_3(self): - n_value = np.array([1, 2, 3, 4]) - b_value = np.array([1, 2, 3, 4]) + def test_handle_inf_nan_both_nan(self): + n_value = np.array([1, 2, np.nan, 4]) + b_value = np.array([1, 2, np.nan, 4]) + a, b = handle_inf_nan(n_value, b_value) - self.assertTrue(np.array_equal(a, n_value) and np.array_equal(b, b_value)) - def test_get_error_type_1(self): - n_value = np.array([1, 2, np.inf, 4]) - b_value = np.array([1, 2, 3, 4]) - error_flag = True - a, b, c = get_error_type(n_value, b_value, error_flag) - self.assertTrue(a == CompareConst.READ_NONE and b == CompareConst.READ_NONE and c == True) + self.assertTrue(np.array_equal(a, np.array([1, 2, 0, 4]))) + self.assertTrue(np.array_equal(b, np.array([1, 2, 0, 4]))) - def test_get_error_type_2(self): + def test_handle_inf_nan_both_inf(self): n_value = np.array([1, 2, np.inf, 4]) - b_value = np.array([1, 2, 3, 4]) + b_value = np.array([1, 2, np.inf, 4]) + + a, b = handle_inf_nan(n_value, b_value) + + self.assertTrue(np.array_equal(a, np.array([1, 2, 0, 4]))) + self.assertTrue(np.array_equal(b, np.array([1, 2, 0, 4]))) + + def test_get_error_flag_and_msg_normal(self): + n_value_0 = np.array([1, 2, 3, 4]) + b_value_0 = np.array([1, 2, 3, 4]) error_flag = False - a, b, c = get_error_type(n_value, b_value, error_flag) - self.assertTrue(a == CompareConst.NAN and b == CompareConst.NAN and c == True) - def test_get_error_type_3(self): - n_value = np.array([1, 2, 3, 4]) + n_value, b_value, error_flag, err_msg = get_error_flag_and_msg(n_value_0, b_value_0, error_flag=error_flag) + + self.assertTrue(np.array_equal(n_value, n_value_0)) + self.assertTrue(np.array_equal(b_value, b_value_0)) + self.assertFalse(error_flag) + self.assertEqual(err_msg, "") + + def test_get_error_flag_and_msg_read_none(self): + n_value = np.array([1, 2, np.inf, 4]) b_value = np.array([1, 2, 3, 4]) - error_flag = False - a, b, c = get_error_type(n_value, b_value, error_flag) - self.assertTrue(np.array_equal(a, n_value) and np.array_equal(b, b_value) and c == False) + error_flag = True + + n_value, b_value, error_flag, err_msg = get_error_flag_and_msg(n_value, b_value, error_flag=error_flag) - def test_get_error_type_4(self): + self.assertEqual(n_value, CompareConst.READ_NONE) + self.assertEqual(b_value, CompareConst.READ_NONE) + self.assertTrue(error_flag) + self.assertEqual(err_msg, CompareConst.NO_BENCH) + + def test_get_error_flag_and_msg_none(self): n_value = np.array([]) b_value = np.array([1, 2, 3, 4, 5]) error_flag = False - a, b, c = get_error_type(n_value, b_value, error_flag) - self.assertTrue(a == CompareConst.NONE and b == CompareConst.NONE and c == True) - def test_get_error_type_5(self): + n_value, b_value, error_flag, err_msg = get_error_flag_and_msg(n_value, b_value, error_flag=error_flag) + + self.assertEqual(n_value, CompareConst.NONE) + self.assertEqual(b_value, CompareConst.NONE) + self.assertTrue(error_flag) + self.assertEqual(err_msg, "This is empty data, can not compare.") + + def test_get_error_flag_and_0d_tensor(self): + n_value = np.array(1) + b_value = np.array(1) + error_flag = False + + n_value, b_value, error_flag, err_msg = get_error_flag_and_msg(n_value, b_value, error_flag=error_flag) + + self.assertFalse(error_flag) + self.assertEqual(err_msg, "This is type of 0-d tensor, can not calculate 'Cosine', 'EucDist', " + "'One Thousandth Err Ratio' and 'Five Thousandths Err Ratio'. ") + + def test_get_error_flag_and_msg_shape_unmatch(self): n_value = np.array([1, 2, 3, 4]) b_value = np.array([1, 2, 3, 4, 5]) error_flag = False - a, b, c = get_error_type(n_value, b_value, error_flag) - self.assertTrue(a == CompareConst.SHAPE_UNMATCH and b == CompareConst.SHAPE_UNMATCH and c == True) - def test_reshape_value_1(self): - n_value = np.array([[1, 2], [3, 4]]) - b_value = np.array([[1, 2, 3], [3, 4, 5]]) - a, b = reshape_value(n_value, b_value) - self.assertTrue(np.array_equal(a, np.array([1., 2., 3., 4.])) and np.array_equal(b, np.array([1., 2., 3., 3., 4., 5.]))) + n_value, b_value, error_flag, err_msg = get_error_flag_and_msg(n_value, b_value, error_flag=error_flag) - def test_reshape_value_2(self): - n_value = np.array([]) - b_value = np.array([]) - a, b = reshape_value(n_value, b_value) - self.assertTrue(np.array_equal(a, n_value) and np.array_equal(b, b_value)) + self.assertEqual(n_value, CompareConst.SHAPE_UNMATCH) + self.assertEqual(b_value, CompareConst.SHAPE_UNMATCH) + self.assertTrue(error_flag) + self.assertEqual(err_msg, "Shape of NPU and bench tensor do not match. Skipped.") - def test_get_error_message_True(self): - b_value = CompareConst.READ_NONE - error_flag = True + def test_get_error_flag_and_msg_nan(self): + n_value = np.array([1.0, 2.0, np.inf, 4.0]) + b_value = np.array([1.0, 2.0, 3.0, 4.0]) + error_flag = False - n_value_1 = CompareConst.READ_NONE - result_1 = get_error_message(n_value_1, b_value, op_name, error_flag, error_file='abc') - self.assertEqual(result_1, 'Dump file: abc not found.') + n_value, b_value, error_flag, err_msg = get_error_flag_and_msg(n_value, b_value, error_flag=error_flag) - n_value_2 = CompareConst.READ_NONE - result_2 = get_error_message(n_value_2, b_value, op_name, error_flag) - self.assertEqual(result_2, CompareConst.NO_BENCH) + self.assertEqual(n_value, CompareConst.NAN) + self.assertEqual(b_value, CompareConst.NAN) + self.assertTrue(error_flag) + self.assertEqual(err_msg, "The position of inf or nan in NPU and bench Tensor do not match.") - n_value_3 = CompareConst.NONE - result_3 = get_error_message(n_value_3, b_value, op_name, error_flag) - self.assertEqual(result_3, 'This is empty data, can not compare.') + def test_get_error_flag_and_msg_diff_dtype(self): + n_value = np.array([1, 2, 3, 4]) + b_value = np.array([1.0, 2.0, 3.0, 4.0]) + error_flag = False - n_value_4 = CompareConst.SHAPE_UNMATCH - result_4 = get_error_message(n_value_4, b_value, op_name, error_flag) - self.assertEqual(result_4, 'Shape of NPU and bench Tensor do not match. Skipped.') + n_value, b_value, error_flag, err_msg = get_error_flag_and_msg(n_value, b_value, error_flag=error_flag) - n_value_5 = CompareConst.NAN - result_5 = get_error_message(n_value_5, b_value, op_name, error_flag) - self.assertEqual(result_5, 'The position of inf or nan in NPU and bench Tensor do not match.') + self.assertFalse(error_flag) + self.assertEqual(err_msg, "Dtype of NPU and bench tensor do not match.") - def test_get_error_message_False(self): - b_value = CompareConst.READ_NONE - error_flag = False + def test_reshape_value_normal(self): + n_value = np.array([[1, 2], [3, 4]]) + b_value = np.array([[1, 2, 3], [3, 4, 5]]) + a, b = reshape_value(n_value, b_value) + self.assertTrue(np.array_equal(a, np.array([1., 2., 3., 4.])) and np.array_equal(b, np.array([1., 2., 3., 3., 4., 5.]))) - n_value_1 = np.array(1) - result_1 = get_error_message(n_value_1, b_value, op_name, error_flag, error_file='abc') - self.assertEqual(result_1, 'This is type of scalar data, can not compare.') + def test_reshape_value_not_shape(self): + n_value = np.array([]) + b_value = np.array([]) + a, b = reshape_value(n_value, b_value) + self.assertTrue(np.array_equal(a, n_value) and np.array_equal(b, b_value)) - b_value = np.array([1]) - n_value_2 = np.array(['abc']) - result_2 = get_error_message(n_value_2, b_value, op_name, error_flag) - self.assertEqual(result_2, 'Dtype of NPU and bench Tensor do not match.') + def test_reshape_value_bool(self): + n_value = np.array(True) + b_value = np.array(True) + a, b = reshape_value(n_value, b_value) + self.assertTrue(np.array_equal(a, np.array(1.)) and np.array_equal(b, np.array(1.))) def test_data_check(self): n_value_1 = None @@ -170,210 +225,261 @@ class TestUtilsMethods(unittest.TestCase): self.assertEqual(error_message_3, 'Dump file not found.\nThis is type of scalar data, can not compare.\n''Dtype of NPU and bench Tensor do not match. Skipped.\n') self.assertTrue(error_flag_3) - def test_GetCosineSimilarity_Ture(self): - b_value = CompareConst.READ_NONE - error_flag = True + def test_get_relative_err(self): + n_value = np.array([1, 2]) + b_value = np.array([1, 1]) + result = get_relative_err(n_value, b_value) + + self.assertTrue(np.array_equal(result, [0.0, 1.0])) + + def test_GetCosineSimilarity_normal(self): + op = GetCosineSimilarity() - n_value_1 = CompareConst.READ_NONE + n_value_1 = np.array(1) + b_value_1 = np.array(1) + relative_err = get_relative_err(n_value_1, b_value_1) + n_value_1, b_value_1 = reshape_value(n_value_1, b_value_1) + err_msg = "This is type of 0-d tensor, can not calculate 'Cosine', 'EucDist', 'One Thousandth Err Ratio' and 'Five Thousandths Err Ratio'. " + result, err_msg = op.apply(n_value_1, b_value_1, relative_err, err_msg) + self.assertEqual(result, CompareConst.UNSUPPORTED) + self.assertEqual(err_msg, "This is type of 0-d tensor, can not calculate 'Cosine', 'EucDist', 'One Thousandth Err Ratio' and 'Five Thousandths Err Ratio'. ") + + n_value_2 = np.array([1, 2]) + b_value_2 = np.array([1, 2]) + relative_err = get_relative_err(n_value_2, b_value_2) + n_value_2, b_value_2 = reshape_value(n_value_2, b_value_2) + err_msg = "" + result, err_msg = op.apply(n_value_2, b_value_2, relative_err, err_msg) + self.assertEqual(result, 1.0) + self.assertEqual(err_msg, "") + + n_value_3 = np.array([0, 0]) + b_value_3 = np.array([0, 0]) + relative_err = get_relative_err(n_value_3, b_value_3) + n_value_3, b_value_3 = reshape_value(n_value_3, b_value_3) + err_msg = "" + result, err_msg = op.apply(n_value_3, b_value_3, relative_err, err_msg) + self.assertEqual(result, 1.0) + self.assertEqual(err_msg, "") + + n_value_4 = np.array([0, 0]) + b_value_4 = np.array([1, 2]) + relative_err = get_relative_err(n_value_4, b_value_4) + n_value_4, b_value_4 = reshape_value(n_value_4, b_value_4) + err_msg = "" + result, err_msg = op.apply(n_value_4, b_value_4, relative_err, err_msg) + self.assertEqual(result, CompareConst.NAN) + self.assertEqual(err_msg, 'Cannot compare by Cosine Similarity, All the data is Zero in npu dump data.') + + n_value_5 = np.array([1, 2]) + b_value_5 = np.array([0, 0]) + relative_err = get_relative_err(n_value_5, b_value_5) + n_value_5, b_value_5 = reshape_value(n_value_5, b_value_5) + err_msg = "" + result, err_msg = op.apply(n_value_5, b_value_5, relative_err, err_msg) + self.assertEqual(result, CompareConst.NAN) + self.assertEqual(err_msg, 'Cannot compare by Cosine Similarity, All the data is Zero in Bench dump data.') + + def test_GetCosineSimilarity_not_shape(self): op = GetCosineSimilarity() - a_1, b_1 = op.apply(n_value_1, b_value, error_flag) - self.assertEqual(a_1, CompareConst.NONE) - self.assertEqual(b_1, '') - - n_value_2 = CompareConst.NONE - a_2, b_2 = op.apply(n_value_2, b_value, error_flag) - self.assertEqual(a_2, CompareConst.UNSUPPORTED) - self.assertEqual(b_2, '') - - n_value_3 = CompareConst.SHAPE_UNMATCH - a_3, b_3 = op.apply(n_value_3, b_value, error_flag) - self.assertEqual(a_3, CompareConst.SHAPE_UNMATCH) - self.assertEqual(b_3, '') - - n_value_4 = CompareConst.NAN - a_4, b_4 = op.apply(n_value_4, b_value, error_flag) - self.assertEqual(a_4, 'N/A') - self.assertEqual(b_4, '') - - def test_GetCosineSimilarity_False(self): - error_flag_2 = False - b_value = CompareConst.READ_NONE - - n_value_5 = np.array(1) + + n_value_1 = np.array([1]) + b_value_1 = np.array([1]) + relative_err = get_relative_err(n_value_1, b_value_1) + n_value_1, b_value_1 = reshape_value(n_value_1, b_value_1) + err_msg = "" + + result, err_msg = op.apply(n_value_1, b_value_1, relative_err, err_msg) + self.assertEqual(result, CompareConst.UNSUPPORTED) + self.assertEqual(err_msg, "This is a 1-d tensor of length 1.") + + @patch("numpy.isnan", return_value=True) + def test_GetCosineSimilarity_isnan(self, mock_isnan): op = GetCosineSimilarity() - a_5, b_5 = op.apply(n_value_5, b_value, error_flag_2) - self.assertEqual(a_5, CompareConst.UNSUPPORTED) - self.assertEqual(b_5, '') - - n_value_6 = np.array([1, 2]) - b_value_6 = np.array([1, 2]) - a_6, b_6 = op.apply(n_value_6, b_value_6, error_flag_2) - self.assertEqual(a_6, 1.0) - self.assertEqual(b_6, '') - - n_value_7 = np.array([0, 0]) - b_value_7 = np.array([0, 0]) - a_7, b_7 = op.apply(n_value_7, b_value_7, error_flag_2) - self.assertEqual(a_7, 1.0) - self.assertEqual(b_7, '') - - n_value_8 = np.array([0, 0]) - b_value_8 = np.array([1, 2]) - a_8, b_8 = op.apply(n_value_8, b_value_8, error_flag_2) - self.assertEqual(a_8, CompareConst.NAN) - self.assertEqual(b_8, 'Cannot compare by Cosine Similarity, All the data is Zero in npu dump data.') - - n_value_9 = np.array([1, 2]) - b_value_9 = np.array([0, 0]) - a_9, b_9 = op.apply(n_value_9, b_value_9, error_flag_2) - self.assertEqual(a_9, CompareConst.NAN) - self.assertEqual(b_9, 'Cannot compare by Cosine Similarity, All the data is Zero in Bench dump data.') - - def test_GetMaxAbsErr_True(self): - b_value = CompareConst.READ_NONE - error_flag = True - n_value_1 = CompareConst.READ_NONE - op = GetMaxAbsErr() - a_1, b_1 = op.apply(n_value_1, b_value, error_flag) - self.assertEqual(a_1, CompareConst.NONE) - self.assertEqual(b_1, '') + n_value = np.array([1, 2]) + b_value = np.array([1, 1]) + relative_err = get_relative_err(n_value, b_value) + n_value, b_value = reshape_value(n_value, b_value) + err_msg = "" - n_value_2 = CompareConst.NONE - a_2, b_2 = op.apply(n_value_2, b_value, error_flag) - self.assertEqual(a_2, 0) - self.assertEqual(b_2, '') + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) - n_value_3 = CompareConst.SHAPE_UNMATCH - a_3, b_3 = op.apply(n_value_3, b_value, error_flag) - self.assertEqual(a_3, CompareConst.SHAPE_UNMATCH) - self.assertEqual(b_3, '') + self.assertEqual(result, CompareConst.NAN) + self.assertEqual(err_msg, "Cannot compare by Cosine Similarity, the dump data has NaN.") + mock_isnan.assert_called_once() - n_value_4 = CompareConst.NAN - a_4, b_4 = op.apply(n_value_4, b_value, error_flag) - self.assertEqual(a_4, 'N/A') - self.assertEqual(b_4, '') + def test_GetCosineSimilarity_correct_data(self): + op = GetCosineSimilarity() - def test_GetMaxAbsErr_False(self): - error_flag_2 = False + result_origin = CompareConst.NAN + result = op.correct_data(result_origin) + self.assertEqual(result, CompareConst.NAN) - n_value_5 = np.array([1, 2]) - b_value_5 = np.array([0, 0]) + result_origin = 1 + result = op.correct_data(result_origin) + self.assertEqual(result, float(result_origin)) + + def test_GetMaxAbsErr_normal(self): + op = GetMaxAbsErr() + + n_value = np.array([1, 2]) + b_value = np.array([0, 0]) + relative_err = get_relative_err(n_value, b_value) + n_value, b_value = reshape_value(n_value, b_value) + err_msg = "" + + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) + + self.assertEqual(result, 2.0) + self.assertEqual(err_msg, "") + + @patch("numpy.isnan", return_value=True) + def test_GetMaxAbsErr_isnan(self, mock_isnan): op = GetMaxAbsErr() - a_5, b_5 = op.apply(n_value_5, b_value_5, error_flag_2) - self.assertEqual(a_5, 2.0) - self.assertEqual(b_5, '') - def test_get_relative_err(self): n_value = np.array([1, 2]) b_value = np.array([1, 1]) - result = get_relative_err(n_value, b_value) - self.assertTrue(np.array_equal(result, [0.0, 1.0])) + relative_err = get_relative_err(n_value, b_value) + n_value, b_value = reshape_value(n_value, b_value) + err_msg = "" - def test_GetMaxRelativeErr_True(self): - b_value = CompareConst.READ_NONE - error_flag = True + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) - n_value_1 = CompareConst.READ_NONE - op = GetMaxRelativeErr() - a_1, b_1 = op.apply(n_value_1, b_value, error_flag) - self.assertEqual(a_1, CompareConst.NONE) - self.assertEqual(b_1, '') + self.assertEqual(result, CompareConst.NAN) + self.assertEqual(err_msg, "Cannot compare by MaxAbsError, the data contains nan/inf/-inf in dump data.") + mock_isnan.assert_called_once() - n_value_2 = CompareConst.NONE - a_2, b_2 = op.apply(n_value_2, b_value, error_flag) - self.assertEqual(a_2, 0) - self.assertEqual(b_2, '') + def test_GetMaxRelativeErr_normal(self): + op = GetMaxRelativeErr() - n_value_3 = CompareConst.SHAPE_UNMATCH - a_3, b_3 = op.apply(n_value_3, b_value, error_flag) - self.assertEqual(a_3, CompareConst.SHAPE_UNMATCH) - self.assertEqual(b_3, '') + n_value = np.array([1, 2]) + b_value = np.array([1, 1]) + relative_err = get_relative_err(n_value, b_value) + n_value, b_value = reshape_value(n_value, b_value) + err_msg = "" - n_value_4 = CompareConst.NAN - a_4, b_4 = op.apply(n_value_4, b_value, error_flag) - self.assertEqual(a_4, 'N/A') - self.assertEqual(b_4, '') + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) - def test_GetMaxRelativeErr_False(self): - error_flag_2 = False + self.assertEqual(result, 1.0) + self.assertEqual(err_msg, "") - n_value_5 = np.array([1, 2]) - b_value_5 = np.array([1, 1]) + @patch("numpy.isnan", return_value=True) + def test_GetMaxRelativeErr_isnan(self, mock_isnan): op = GetMaxRelativeErr() - a_5, b_5 = op.apply(n_value_5, b_value_5, error_flag_2) - self.assertEqual(a_5, 1.0) - self.assertEqual(b_5, '') - def test_GetThousandErrRatio_True(self): - b_value = CompareConst.READ_NONE - error_flag = True + n_value = np.array([1, 2]) + b_value = np.array([1, 1]) + relative_err = get_relative_err(n_value, b_value) + n_value, b_value = reshape_value(n_value, b_value) + err_msg = "" - n_value_1 = CompareConst.READ_NONE - op = GetThousandErrRatio() - a_1, b_1 = op.apply(n_value_1, b_value, error_flag) - self.assertEqual(a_1, CompareConst.NONE) - self.assertEqual(b_1, '') + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) - n_value_2 = CompareConst.NONE - a_2, b_2 = op.apply(n_value_2, b_value, error_flag) - self.assertEqual(a_2, 0) - self.assertEqual(b_2, '') + self.assertEqual(result, CompareConst.NAN) + self.assertEqual(err_msg, "Cannot compare by MaxRelativeError, the data contains nan/inf/-inf in dump data.") + mock_isnan.assert_called_once() - n_value_3 = CompareConst.SHAPE_UNMATCH - a_3, b_3 = op.apply(n_value_3, b_value, error_flag) - self.assertEqual(a_3, CompareConst.SHAPE_UNMATCH) - self.assertEqual(b_3, '') + def test_GetThousandErrRatio_normal(self): + op = GetErrRatio(CompareConst.THOUSAND_RATIO_THRESHOLD) - n_value_4 = CompareConst.NAN - a_4, b_4 = op.apply(n_value_4, b_value, error_flag) - self.assertEqual(a_4, 'N/A') - self.assertEqual(b_4, '') + n_value = np.array([1, 2]) + b_value = np.array([1, 1]) + relative_err = get_relative_err(n_value, b_value) + n_value, b_value = reshape_value(n_value, b_value) + err_msg = "" - def test_GetThousandErrRatio_False(self): - error_flag_2 = False + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) - n_value_5 = np.array([1, 2]) - b_value_5 = np.array([1, 1]) - op = GetThousandErrRatio() - a_5, b_5 = op.apply(n_value_5, b_value_5, error_flag_2) - self.assertEqual(a_5, 0.5) - self.assertEqual(b_5, '') - - def test_GetFiveThousandErrRatio_True(self): - b_value = CompareConst.READ_NONE - error_flag = True + self.assertEqual(result, 0.5) + self.assertEqual(err_msg, "") - n_value_1 = CompareConst.READ_NONE - op = GetFiveThousandErrRatio() - a_1, b_1 = op.apply(n_value_1, b_value, error_flag) - self.assertEqual(a_1, CompareConst.NONE) - self.assertEqual(b_1, '') + def test_GetThousandErrRatio_not_shape(self): + op = GetErrRatio(CompareConst.THOUSAND_RATIO_THRESHOLD) - n_value_2 = CompareConst.NONE - a_2, b_2 = op.apply(n_value_2, b_value, error_flag) - self.assertEqual(a_2, 0) - self.assertEqual(b_2, '') + n_value = np.array(1) # 标量 + b_value = np.array(1) + relative_err = np.array(0) + err_msg = "This is type of 0-d tensor, can not calculate 'Cosine', 'EucDist', 'One Thousandth Err Ratio' and 'Five Thousandths Err Ratio'. " - n_value_3 = CompareConst.SHAPE_UNMATCH - a_3, b_3 = op.apply(n_value_3, b_value, error_flag) - self.assertEqual(a_3, CompareConst.SHAPE_UNMATCH) - self.assertEqual(b_3, '') + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) - n_value_4 = CompareConst.NAN - a_4, b_4 = op.apply(n_value_4, b_value, error_flag) - self.assertEqual(a_4, 'N/A') - self.assertEqual(b_4, '') + self.assertEqual(result, CompareConst.UNSUPPORTED) + self.assertEqual(err_msg, "This is type of 0-d tensor, can not calculate 'Cosine', 'EucDist', 'One Thousandth Err Ratio' and 'Five Thousandths Err Ratio'. ") - def test_GetFiveThousandErrRatio_False(self): - error_flag_2 = False + def test_GetThousandErrRatio_not_size(self): + op = GetErrRatio(CompareConst.THOUSAND_RATIO_THRESHOLD) - n_value_5 = np.array([1, 2]) - b_value_5 = np.array([1, 1]) - op = GetFiveThousandErrRatio() - a_5, b_5 = op.apply(n_value_5, b_value_5, error_flag_2) - self.assertEqual(a_5, 0.5) - self.assertEqual(b_5, '') + n_value = np.array([1, 2]) + b_value = np.array([1, 2]) + relative_err = np.array([]) # 空数组 + err_msg = "" + + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) + + self.assertEqual(result, CompareConst.NAN) + self.assertEqual(err_msg, "") + + def test_GetFiveThousandErrRatio_normal(self): + op = GetErrRatio(CompareConst.FIVE_THOUSAND_RATIO_THRESHOLD) + + n_value = np.array([1, 2]) + b_value = np.array([1, 1]) + relative_err = get_relative_err(n_value, b_value) + n_value, b_value = reshape_value(n_value, b_value) + err_msg = "" + + result, err_msg = op.apply(n_value, b_value, relative_err, err_msg) + + self.assertEqual(result, 0.5) + self.assertEqual(err_msg, "") + + def test_error_value_process_read_none(self): + n_value = CompareConst.READ_NONE + result, err_msg = error_value_process(n_value) + + self.assertEqual(result, CompareConst.UNSUPPORTED) + self.assertEqual(err_msg, "") + + def test_error_value_process_unreadable(self): + n_value = CompareConst.UNREADABLE + + result, err_msg = error_value_process(n_value) + + self.assertEqual(result, CompareConst.UNSUPPORTED) + self.assertEqual(err_msg, "") + + def test_error_value_process_none(self): + n_value = CompareConst.NONE + + result, err_msg = error_value_process(n_value) + + self.assertEqual(result, 0) + self.assertEqual(err_msg, "") + + def test_error_value_process_shape_unmatch(self): + n_value = CompareConst.SHAPE_UNMATCH + + result, err_msg = error_value_process(n_value) + + self.assertEqual(result, CompareConst.SHAPE_UNMATCH) + self.assertEqual(err_msg, "") + + def test_error_value_process_nan(self): + n_value = CompareConst.NAN + + result, err_msg = error_value_process(n_value) + + self.assertEqual(result, CompareConst.N_A) + self.assertEqual(err_msg, "") + + def test_error_value_process_other(self): + n_value = "abc" + + result, err_msg = error_value_process(n_value) + + self.assertEqual(result, CompareConst.N_A) + self.assertEqual(err_msg, "") def test_compare_ops_apply(self): n_value = np.array([1, 1]) @@ -381,5 +487,34 @@ class TestUtilsMethods(unittest.TestCase): error_flag = False err_msg = '' a, b = compare_ops_apply(n_value, b_value, error_flag, err_msg) - self.assertEqual(a, [1.0, 0.0, 0.0, 1.0, 1.0]) - self.assertEqual(b, '') \ No newline at end of file + self.assertEqual(a, [1.0, 0.0, 0.0, 0.0, 1.0, 1.0]) + self.assertEqual(b, '') + + +class TestGetEuclideanDistance(unittest.TestCase): + + def setUp(self): + self.euc_distance = GetEuclideanDistance() + + def test_euclidean_distance_normal(self): + # 测试计算两个张量之间的欧式距离 + n_value = np.array([1, 2, 3]) + b_value = np.array([4, 5, 6]) + relative_err = None + err_msg = "" + + result, msg = self.euc_distance.apply(n_value, b_value, relative_err, err_msg) + expected_distance = np.linalg.norm(n_value - b_value) + self.assertEqual(result, expected_distance) + self.assertEqual(msg, '') + + def test_euclidean_distance_0d_tensor(self): + # 测试计算两个张量之间的欧式距离 + n_value = np.array(1) + b_value = np.array(1) + relative_err = None + err_msg = "This is type of 0-d tensor, can not calculate 'Cosine', 'EucDist', 'One Thousandth Err Ratio' and 'Five Thousandths Err Ratio'. " + + result, msg = self.euc_distance.apply(n_value, b_value, relative_err, err_msg) + self.assertEqual(result, CompareConst.UNSUPPORTED) + self.assertEqual(msg, "This is type of 0-d tensor, can not calculate 'Cosine', 'EucDist', 'One Thousandth Err Ratio' and 'Five Thousandths Err Ratio'. ") diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_utils.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_utils.py index 3150ee14f5dedb45456a9ca1b38cdcab88862fe4..2e9a46572662489e861f98f03f25e9e480031bcf 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_utils.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_acc_compare_utils.py @@ -1,40 +1,45 @@ # coding=utf-8 -import os +import argparse import json +import os import shutil import unittest -import argparse -from msprobe.core.compare.utils import extract_json, rename_api, read_op, op_item_parse, \ - check_and_return_dir_contents, resolve_api_special_parameters, get_rela_diff_summary_mode, \ - get_accuracy, get_un_match_accuracy, merge_tensor, _compare_parser -from msprobe.core.common.utils import CompareException -from msprobe.core.common.const import Const +from unittest.mock import patch +import zlib +import numpy as np + +from msprobe.core.common.const import CompareConst, Const +from msprobe.core.common.utils import CompareException +from msprobe.core.compare.utils import ApiItemInfo, _compare_parser, check_and_return_dir_contents, extract_json, \ + count_struct, get_accuracy, append_stack_info, get_rela_diff_summary_mode, get_un_match_accuracy, merge_tensor, \ + op_item_parse, read_op, rename_api, resolve_api_special_parameters, result_item_init, stack_column_process, \ + table_value_is_valid, get_name_and_state, reorder_op_name_list, reorder_op_x_list, gen_op_item # test_read_op_1 op_data = { 'input_args': [{'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], - 'Max': 0.33033010363578796, 'Min': -0.331031858921051,'Mean': -0.030964046716690063, + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True}, {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, 'Norm': 0.02844562754034996, 'requires_grad': False}], 'input_kwargs': {'alpha': {'type': 'float', 'value': -0.1}}, 'output': [{'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], - 'Max': 0.33033010363578796, 'Min': -0.331031858921051,'Mean': -0.030964046716690063, + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True}]} op_name = "Tensor.add_0.0.forward" op_result = [ - {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], - 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'md5': '00000000', + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'data_name': '-1', 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_0.0.forward.input.0'}, - {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], - 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'md5': '00000000', + 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, 'data_name': '-1', 'Norm': 0.02844562754034996, 'requires_grad': False, 'full_op_name': 'Tensor.add_0.0.forward.input.1'}, - {'full_op_name': 'Tensor.add_0.0.forward.input.alpha.0', 'dtype': "", 'shape': '[]', 'md5': None, - 'Max': -0.1, 'Min': -0.1, 'Mean': -0.1, 'Norm': -0.1, 'data_name': '-1'}, - {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], - 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, + {'full_op_name': 'Tensor.add_0.0.forward.input.alpha', 'dtype': "", 'shape': '[]', 'md5': '0dae4479', + 'Max': -0.1, 'Min': -0.1, 'Mean': -0.1, 'Norm': -0.1, 'data_name': '-1', 'type': 'float', 'value': -0.1}, + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'md5': '00000000', + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'data_name': '-1', 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_0.0.forward.output.0'}] # test_read_op_1 @@ -50,20 +55,20 @@ op_data_b = { 'Norm': 2.2533628940582275, 'requires_grad': True}]} op_name_b = "Tensor.add_0.0.backward" op_result_b = [ - {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'data_name': '-1', 'md5': '00000000', 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_0.0.backward.input.0'}, - {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'data_name': '-1', 'md5': '00000000', 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, 'Norm': 0.02844562754034996, 'requires_grad': False, 'full_op_name': 'Tensor.add_0.0.backward.input.1'}, - {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'data_name': '-1', 'md5': '00000000', 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_0.0.backward.output.0'}] - # test_op_item_parse parse_item = [ - {'Max': 4097.0, 'Mean': 820.2, 'Min': 0.0, 'Norm': 4097.0, 'dtype': 'torch.int64', 'requires_grad': False, 'shape': [5], 'type': 'torch.Tensor'}, + {'Max': 4097.0, 'Mean': 820.2, 'Min': 0.0, 'Norm': 4097.0, 'dtype': 'torch.int64', 'requires_grad': False, + 'shape': [5], 'type': 'torch.Tensor'}, {'type': 'int', 'value': 0}, {'type': 'slice', 'value': [None, None, None]} ] @@ -73,14 +78,15 @@ parse_item_list = None parse_top_bool = True o_result_parse = [ {'Max': 4097.0, 'Mean': 820.2, 'Min': 0.0, 'Norm': 4097.0, 'dtype': 'torch.int64', 'requires_grad': False, - 'shape': [5], 'type': 'torch.Tensor', 'full_op_name': 'Distributed.broadcast.0.forward.input.0'}, + 'shape': [5], 'type': 'torch.Tensor', 'full_op_name': 'Distributed.broadcast.0.forward.input.0', + 'data_name': '-1', 'md5': '00000000'}, {'full_op_name': 'Distributed.broadcast.0.forward.input.1', 'dtype': "", 'shape': '[]', - 'md5': None, 'Max': 0, 'Min': 0, 'Mean': 0, 'Norm': 0, 'data_name': '-1'}, - {'Max': None, 'Mean': None, 'Min': None, 'Norm': None, 'data_name': '-1', 'dtype': 'slice', - 'full_op_name': 'Distributed.broadcast.0.forward.input.2', 'md5': None, 'shape': '(3,)'} + 'md5': 'f4dbdf21', 'Max': 0, 'Min': 0, 'Mean': 0, 'Norm': 0, 'data_name': '-1', 'type': 'int', 'value': 0}, + {'Max': None, 'Mean': None, 'Min': None, 'Norm': None, 'data_name': '-1', 'dtype': 'slice', 'type': 'slice', + 'full_op_name': 'Distributed.broadcast.0.forward.input.2', 'md5': '5fbbe87f', 'shape': '(3,)', + 'value': [None, None, None]} ] - # test_resolve_api_special_parameters data_dict = { "last_hidden_state": @@ -90,71 +96,165 @@ data_dict = { } full_op_name = "Tensor.add_0.0.forward.input.0" o_result_api_special = [ - {"type": "torch.Tensor", "dtype": "torch.bfloat16", "full_op_name": "Tensor.add_0.0.forward.input.last_hidden_state.0"}, + {"type": "torch.Tensor", "dtype": "torch.bfloat16", + "full_op_name": "Tensor.add_0.0.forward.input.last_hidden_state.0"}, {"type": "torch.Tensor", "dtype": "torch.float32", "full_op_name": "Tensor.add_0.0.forward.input.loss.0"} ] - # test_get_accuracy npu_dict = {'op_name': ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.1', - 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output'], - 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), + 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output.0', + 'Functional.conv2d.0.forward.parameters.weight', 'Functional.conv2d.0.forward.parameters.bias', + 'Functional.conv2d.0.parameters_grad.weight', 'Functional.conv2d.0.parameters_grad.bias'], + 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), ('torch.float32', [16])], 'output_struct': [('torch.float32', [1, 16, 28, 28])], - 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], - [0.19919930398464203, -0.19974489510059357, 0.006269412115216255], - [0.19734230637550354, -0.18177609145641327, 0.007903944700956345], - [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], 'stack_info': []} + 'params_struct': [('torch.float32', [1, 16, 28, 28]), ('torch.float32', [1, 16, 28, 28])], + 'params_grad_struct': [('torch.float32', [1, 16, 28, 28]), ('torch.float32', [1, 16, 28, 28])], + 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029, 1.0], + [0.19919930398464203, -0.19974489510059357, 0.006269412115216255, 1.0], + [0.19734230637550354, -0.18177609145641327, 0.007903944700956345, 1.0], + [2.1166646480560303, -2.190781354904175, -0.003579073818400502, 1.0], + [1.0, 1.0, 1.0, 1.0], + [1.0, 1.0, 1.0, 1.0], + [1.0, 1.0, 1.0, 1.0], + [1.0, 1.0, 1.0, 1.0]], + 'stack_info': []} bench_dict = {'op_name': ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.1', - 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output'], - 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), + 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output.0', + 'Functional.conv2d.0.forward.parameters.weight', 'Functional.conv2d.0.forward.parameters.bias', + 'Functional.conv2d.0.parameters_grad.weight', 'Functional.conv2d.0.parameters_grad.bias'], + 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), ('torch.float32', [16])], 'output_struct': [('torch.float32', [1, 16, 28, 28])], - 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], - [0.19919930398464203, -0.19974489510059357, 0.006269412115216255], - [0.19734230637550354, -0.18177609145641327, 0.007903944700956345], - [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], 'stack_info': []} + 'params_struct': [('torch.float32', [1, 16, 28, 28]), ('torch.float32', [1, 16, 28, 28])], + 'params_grad_struct': [('torch.float32', [1, 16, 28, 28]), ('torch.float32', [1, 16, 28, 28])], + 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029, 1.0], + [0.19919930398464203, -0.19974489510059357, 0.006269412115216255, 1.0], + [0.19734230637550354, -0.18177609145641327, 0.007903944700956345, 1.0], + [2.1166646480560303, -2.190781354904175, -0.003579073818400502, 1.0], + [1.0, 1.0, 1.0, 1.0], + [1.0, 1.0, 1.0, 1.0], + [1.0, 1.0, 1.0, 1.0], + [1.0, 1.0, 1.0, 1.0]], + 'stack_info': []} highlight_dict = {'red_rows': [], 'yellow_rows': []} o_result = [ ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.0', 'torch.float32', 'torch.float32', - [1, 1, 28, 28], [1, 1, 28, 28], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 3.029174327850342, -2.926689624786377, - -0.06619918346405029, 3.029174327850342, -2.926689624786377, -0.06619918346405029, '', '', 'None'], + [1, 1, 28, 28], [1, 1, 28, 28], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 3.029174327850342, -2.926689624786377, -0.06619918346405029, 1.0, + 3.029174327850342, -2.926689624786377, -0.06619918346405029, 1.0,'', '', 'None'], ['Functional.conv2d.0.forward.input.1', 'Functional.conv2d.0.forward.input.1', 'torch.float32', 'torch.float32', - [16, 1, 5, 5], [16, 1, 5, 5], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 0.19919930398464203, -0.19974489510059357, - 0.006269412115216255, 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, '', '', 'None'], + [16, 1, 5, 5], [16, 1, 5, 5], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, 1.0, + 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, 1.0, '', '', 'None'], ['Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.input.2', 'torch.float32', 'torch.float32', - [16], [16], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, - 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, '', '', 'None'], - ['Functional.conv2d.0.forward.output', 'Functional.conv2d.0.forward.output', 'torch.float32', 'torch.float32', - [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 2.1166646480560303, -2.190781354904175, - -0.003579073818400502, 2.1166646480560303, -2.190781354904175, -0.003579073818400502, '', '', 'None']] - + [16], [16], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, 1.0, + 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, 1.0, '', '', 'None'], + ['Functional.conv2d.0.forward.parameters.weight', 'Functional.conv2d.0.forward.parameters.weight', 'torch.float32', + 'torch.float32', + [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, '', '', 'None'], + ['Functional.conv2d.0.forward.parameters.bias', 'Functional.conv2d.0.forward.parameters.bias', 'torch.float32', + 'torch.float32', + [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, '', '', 'None'], + ['Functional.conv2d.0.forward.output.0', 'Functional.conv2d.0.forward.output.0', 'torch.float32', 'torch.float32', + [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 2.1166646480560303, -2.190781354904175, -0.003579073818400502, 1.0, + 2.1166646480560303, -2.190781354904175, -0.003579073818400502, 1.0, '', '', 'None'], + ['Functional.conv2d.0.parameters_grad.weight', 'Functional.conv2d.0.parameters_grad.weight', 'torch.float32', 'torch.float32', + [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, '', '', 'None'], + ['Functional.conv2d.0.parameters_grad.bias', 'Functional.conv2d.0.parameters_grad.bias', 'torch.float32', 'torch.float32', + [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, '', '', 'None'], +] # test_get_un_match_accuracy o_result_unmatch_1 = [ - ['Functional.conv2d.0.forward.input.0', 'N/A', 'torch.float32', 'N/A', [1, 1, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', 'None'], - ['Functional.conv2d.0.forward.input.1', 'N/A', 'torch.float32', 'N/A', [16, 1, 5, 5], 'N/A', 'N/A', 'N/A', 'N/A', 'None'], + ['Functional.conv2d.0.forward.input.0', 'N/A', 'torch.float32', 'N/A', [1, 1, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', + 'None'], + ['Functional.conv2d.0.forward.input.1', 'N/A', 'torch.float32', 'N/A', [16, 1, 5, 5], 'N/A', 'N/A', 'N/A', 'N/A', + 'None'], ['Functional.conv2d.0.forward.input.2', 'N/A', 'torch.float32', 'N/A', [16], 'N/A', 'N/A', 'N/A', 'N/A', 'None'], - ['Functional.conv2d.0.forward.output', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', 'None'] + ['Functional.conv2d.0.forward.parameters.weight', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', + 'N/A', 'N/A', + 'None'], + ['Functional.conv2d.0.forward.parameters.bias', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', + 'N/A', + 'None'], + ['Functional.conv2d.0.forward.output.0', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', + 'None'], + ['Functional.conv2d.0.parameters_grad.weight', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', + 'None'], + ['Functional.conv2d.0.parameters_grad.bias', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', + 'None'] ] o_result_unmatch_2 = [ - ['Functional.conv2d.0.forward.input.0', 'N/A', 'torch.float32', 'N/A', [1, 1, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 3.029174327850342, -2.926689624786377, -0.06619918346405029, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], - ['Functional.conv2d.0.forward.input.1', 'N/A', 'torch.float32', 'N/A', [16, 1, 5, 5], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], - ['Functional.conv2d.0.forward.input.2', 'N/A', 'torch.float32', 'N/A', [16], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], - ['Functional.conv2d.0.forward.output', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 2.1166646480560303, -2.190781354904175, -0.003579073818400502, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'] + ['Functional.conv2d.0.forward.input.0', 'N/A', 'torch.float32', 'N/A', [1, 1, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 3.029174327850342, -2.926689624786377, -0.06619918346405029, 1.0, 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], + ['Functional.conv2d.0.forward.input.1', 'N/A', 'torch.float32', 'N/A', [16, 1, 5, 5], 'N/A', 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, 1.0, 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], + ['Functional.conv2d.0.forward.input.2', 'N/A', 'torch.float32', 'N/A', [16], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, 1.0, 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'No bench data matched.', 'None'], + ['Functional.conv2d.0.forward.parameters.weight', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', + 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 1.0, 1.0, 1.0, 1.0, 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], + ['Functional.conv2d.0.forward.parameters.bias', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', + 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 1.0, 1.0, 1.0, 1.0, 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], + ['Functional.conv2d.0.forward.output.0', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 2.1166646480560303, -2.190781354904175, -0.003579073818400502, 1.0, 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], + ['Functional.conv2d.0.parameters_grad.weight', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 1.0, 1.0, 1.0, 1.0, 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'], + ['Functional.conv2d.0.parameters_grad.bias', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 1.0, 1.0, 1.0, 1.0, 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None'] ] o_result_unmatch_3 = [ - ['Functional.conv2d.0.forward.input.0', 'N/A', 'torch.float32', 'N/A', [1, 1, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 3.029174327850342, -2.926689624786377, -0.06619918346405029, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None', '-1'], - ['Functional.conv2d.0.forward.input.1', 'N/A', 'torch.float32', 'N/A', [16, 1, 5, 5], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None', '-1'], - ['Functional.conv2d.0.forward.input.2', 'N/A', 'torch.float32', 'N/A', [16], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None', '-1'], - ['Functional.conv2d.0.forward.output', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 2.1166646480560303, -2.190781354904175, -0.003579073818400502, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None', '-1'] + ['Functional.conv2d.0.forward.input.0', 'N/A', 'torch.float32', 'N/A', [1, 1, 28, 28], 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 3.029174327850342, -2.926689624786377, -0.06619918346405029, 1.0, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 'No bench data matched.', 'None', '-1'], + ['Functional.conv2d.0.forward.input.1', 'N/A', 'torch.float32', 'N/A', [16, 1, 5, 5], 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, 1.0, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 'No bench data matched.', 'None', '-1'], + ['Functional.conv2d.0.forward.input.2', 'N/A', 'torch.float32', 'N/A', [16], 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, 1.0, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 'No bench data matched.', 'None', '-1'], + ['Functional.conv2d.0.forward.parameters.weight', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 1.0, 1.0, 1.0, 1.0, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None', '-1'], + ['Functional.conv2d.0.forward.parameters.bias', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 1.0, 1.0, 1.0, 1.0, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None', '-1'], + ['Functional.conv2d.0.forward.output.0', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 2.1166646480560303, -2.190781354904175, -0.003579073818400502, 1.0, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 'No bench data matched.', 'None', '-1'], + ['Functional.conv2d.0.parameters_grad.weight', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 1.0, 1.0, 1.0, 1.0, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None', '-1'], + ['Functional.conv2d.0.parameters_grad.bias', 'N/A', 'torch.float32', 'N/A', [1, 16, 28, 28], 'N/A', + 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 1.0, 1.0, 1.0, 1.0, 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', 'No bench data matched.', 'None', '-1'] ] - # test_merge_tensor tensor_list = [ {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'Max': 0.33033010363578796, - 'Min': -0.331031858921051,'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True, + 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_.0.forward.input.0'}, {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, @@ -170,8 +270,11 @@ result_op_dict = {'op_name': ['Tensor.add_.0.forward.input.0', 'Tensor.add_.0.fo 'input_struct': [('torch.float32', [16, 1, 3, 3]), ('torch.float32', [16, 1, 3, 3]), ("", '[]')], 'output_struct': [('torch.float32', [16, 1, 3, 3])], + 'params_struct': [], + 'params_grad_struct': [], 'summary': [[0.33033010363578796, -0.331031858921051, -0.030964046716690063, 2.2533628940582275], - [0.003992878366261721, -0.008102823048830032, -0.0002002553956117481, 0.02844562754034996], + [0.003992878366261721, -0.008102823048830032, -0.0002002553956117481, + 0.02844562754034996], [-0.1, -0.1, -0.1, -0.1], [0.33033010363578796, -0.331031858921051, -0.030964046716690063, 2.2533628940582275]], 'stack_info': []} @@ -188,15 +291,16 @@ tensor_list_md5 = [ ] result_op_dict_md5 = {'op_name': ['Tensor.add_.0.forward.input.0', 'Tensor.add_.0.forward.kwargs.alpha.0', 'Tensor.add_.0.forward.output.0'], - 'input_struct': [('torch.float32', [16, 1, 3, 3], 1)], - 'kwargs_struct': [("", '[]', None)], + 'input_struct': [('torch.float32', [16, 1, 3, 3], 1), ("", '[]', None)], 'output_struct': [('torch.float32', [16, 1, 3, 3], 2)], - 'summary': [[0.003992878366261721, -0.008102823048830032, -0.0002002553956117481, 0.02844562754034996], - [-0.1, -0.1, -0.1, -0.1], - [0.33033010363578796, -0.331031858921051, -0.030964046716690063, 2.2533628940582275]], + 'params_struct': [], + 'params_grad_struct': [], + 'summary': [ + [0.003992878366261721, -0.008102823048830032, -0.0002002553956117481, 0.02844562754034996], + [-0.1, -0.1, -0.1, -0.1], + [0.33033010363578796, -0.331031858921051, -0.030964046716690063, 2.2533628940582275]], 'stack_info': []} - base_dir1 = os.path.join(os.path.dirname(os.path.abspath(__file__)), f'test_acc_compare_utils1') base_dir2 = os.path.join(os.path.dirname(os.path.abspath(__file__)), f'test_acc_compare_utils2') @@ -267,12 +371,12 @@ class TestUtilsMethods(unittest.TestCase): self.assertEqual(result, op_result_b) def test_op_item_parse(self): - result = op_item_parse(parse_item, parse_op_name, parse_index, parse_item_list, parse_top_bool) + result = op_item_parse(parse_item, parse_op_name) self.assertEqual(result, o_result_parse) def test_op_item_parse_max_depth(self): with self.assertRaises(CompareException) as context: - result = op_item_parse(parse_item, parse_op_name, parse_index, parse_item_list, parse_top_bool, depth=11) + op_item_parse(parse_item, parse_op_name, depth=11) self.assertEqual(context.exception.code, CompareException.RECURSION_LIMIT_ERROR) def test_resolve_api_special_parameters(self): @@ -313,12 +417,74 @@ class TestUtilsMethods(unittest.TestCase): self.assertEqual(accuracy_check, '') self.assertEqual(err_msg, '') + def test_count_struct_normal(self): + op_dict = { + CompareConst.OP_NAME: ['op1', 'op2', 'op3', 'op4', 'op5', 'op6', 'op7', 'op8'], + CompareConst.INPUT_STRUCT: [("torch.float32", [1]), ("torch.float32", [1])], + CompareConst.OUTPUT_STRUCT: [("torch.float32", [1]), ("torch.float32", [1])], + CompareConst.PARAMS_STRUCT: [("torch.float32", [1]), ("torch.float32", [1])], + CompareConst.PARAMS_GRAD_STRUCT: [("torch.float32", [1]), ("torch.float32", [1])], + } + + result = count_struct(op_dict) + + self.assertEqual(result, (8, 2, 2, 2, 2)) + + @patch('msprobe.core.compare.utils.logger') + def test_mismatch_case(self, mock_logger): + op_dict = { + CompareConst.OP_NAME: ['op1', 'op2', 'op3', 'op4', 'op5', 'op6', 'op7', 'op8'], + CompareConst.INPUT_STRUCT: [("torch.float32", [1])], + CompareConst.OUTPUT_STRUCT: [("torch.float32", [1]), ("torch.float32", [1])], + CompareConst.PARAMS_STRUCT: [("torch.float32", [1]), ("torch.float32", [1])], + CompareConst.PARAMS_GRAD_STRUCT: [("torch.float32", [1]), ("torch.float32", [1])], + } + + with self.assertRaises(CompareException) as context: + count_struct(op_dict) + self.assertEqual(context.exception.code, CompareException.NAMES_STRUCTS_MATCH_ERROR) def test_get_accuracy(self): result = [] get_accuracy(result, npu_dict, bench_dict, dump_mode=Const.SUMMARY) self.assertEqual(result, o_result) + def test_append_stack_info_stack_exist_index_0(self): + result_item = ['item1'] + npu_stack_info = ['stack_info1'] + index = 0 + + append_stack_info(result_item, npu_stack_info, index) + + self.assertEqual(result_item, ['item1', 'stack_info1']) + + def test_append_stack_info_stack_exist_index_not_0(self): + result_item = ['item1'] + npu_stack_info = ['stack_info1'] + index = 1 + + append_stack_info(result_item, npu_stack_info, index) + + self.assertEqual(result_item, ['item1', CompareConst.NONE]) + + def test_append_stack_info_stack_empty_index_0(self): + result_item = ['item1'] + npu_stack_info = [] + index = 0 + + append_stack_info(result_item, npu_stack_info, index) + + self.assertEqual(result_item, ['item1', CompareConst.NONE]) + + def test_append_stack_info_stack_empty_index_not_0(self): + result_item = ['item1'] + npu_stack_info = [] + index = 1 + + append_stack_info(result_item, npu_stack_info, index) + + self.assertEqual(result_item, ['item1', CompareConst.NONE]) + def test_get_un_match_accuracy_md5(self): result = [] get_un_match_accuracy(result, npu_dict, dump_mode=Const.MD5) @@ -353,10 +519,9 @@ class TestUtilsMethods(unittest.TestCase): self.assertTrue(args.fuzzy_match) def test_compare_parser_2(self): - test_args = ["-i", "input.json"] - - with self.assertRaises(SystemExit): # argparse 会抛出 SystemExit - self.parser.parse_args(test_args) + self.assertEqual(self.parser.parse_args('-i aaa -o'.split(' ')).output_path, './output') + self.assertEqual(self.parser.parse_args('-i aaa'.split(' ')).output_path, './output') + self.assertEqual(self.parser.parse_args('-i aaa -o ./aaa/output'.split(' ')).output_path, './aaa/output') def test_compare_parser_3(self): test_args = ["-i", "input.json", "-o", "output.json", "-cm", "cell_mapping.txt", "-dm", @@ -367,3 +532,325 @@ class TestUtilsMethods(unittest.TestCase): self.assertIsNone(args.api_mapping) # 默认值应为 None self.assertEqual(args.data_mapping, "data_mapping.txt") self.assertEqual(args.layer_mapping, "layer_mapping.txt") + + def test_stack_column_process_stack_info(self): + result_item = [] + has_stack = True + index = 0 + key = CompareConst.INPUT_STRUCT + npu_stack_info = ['abc'] + result_item = stack_column_process(result_item, has_stack, index, key, npu_stack_info) + self.assertEqual(result_item, ['abc']) + + def test_stack_column_process_None(self): + result_item = [] + has_stack = True + index = 1 + key = CompareConst.INPUT_STRUCT + npu_stack_info = ['abc'] + result_item = stack_column_process(result_item, has_stack, index, key, npu_stack_info) + self.assertEqual(result_item, ['None']) + + def test_result_item_init_all_and_summary(self): + n_name = 'Tensor.add.0.forward.input.0' + n_struct = ('torch.float32', [96]) + npu_stack_info = ['abc'] + b_name = 'Tensor.add.0.forward.input.0' + b_struct = ('torch.float32', [96]) + bench_stack_info = ['abc'] + n_info = ApiItemInfo(n_name, n_struct, npu_stack_info) + b_info = ApiItemInfo(b_name, b_struct, bench_stack_info) + + dump_mode = Const.ALL + result_item = result_item_init(n_info, b_info, dump_mode) + self.assertEqual(result_item, ['Tensor.add.0.forward.input.0', 'Tensor.add.0.forward.input.0', + 'torch.float32', 'torch.float32', [96], [96], ' ', ' ', ' ', ' ', ' ', ' ']) + + dump_mode = Const.SUMMARY + result_item = result_item_init(n_info, b_info, dump_mode) + self.assertEqual(result_item, ['Tensor.add.0.forward.input.0', 'Tensor.add.0.forward.input.0', + 'torch.float32', 'torch.float32', [96], [96], ' ', ' ', ' ', ' ', ' ', ' ', ' ', + ' ']) + + def test_result_item_init_md5(self): + n_name = 'Tensor.add.0.forward.input.0' + n_struct = ('torch.float32', [96], 'e87000dc') + npu_stack_info = ['abc'] + b_name = 'Tensor.add.0.forward.input.0' + b_struct = ('torch.float32', [96], 'e87000dc') + bench_stack_info = ['abc'] + n_info = ApiItemInfo(n_name, n_struct, npu_stack_info) + b_info = ApiItemInfo(b_name, b_struct, bench_stack_info) + + dump_mode = Const.MD5 + result_item = result_item_init(n_info, b_info, dump_mode) + self.assertEqual(result_item, ['Tensor.add.0.forward.input.0', 'Tensor.add.0.forward.input.0', + 'torch.float32', 'torch.float32', [96], [96], 'e87000dc', 'e87000dc', 'pass']) + + def test_result_item_init_md5_index_error(self): + n_name = 'Tensor.add.0.forward.input.0' + n_struct = ('torch.float32', [96]) + npu_stack_info = ['abc'] + b_name = 'Tensor.add.0.forward.input.0' + b_struct = ('torch.float32', [96]) + bench_stack_info = ['abc'] + n_info = ApiItemInfo(n_name, n_struct, npu_stack_info) + b_info = ApiItemInfo(b_name, b_struct, bench_stack_info) + + dump_mode = Const.MD5 + with self.assertRaises(CompareException) as context: + result_item = result_item_init(n_info, b_info, dump_mode) + self.assertEqual(context.exception.code, CompareException.INDEX_OUT_OF_BOUNDS_ERROR) + + def test_table_value_is_valid_int(self): + result = table_value_is_valid(1) + self.assertTrue(result) + + def test_table_value_is_valid_float(self): + result = table_value_is_valid("-1.00") + self.assertTrue(result) + + result = table_value_is_valid("+1.00") + self.assertTrue(result) + + def test_table_value_is_valid_invalid_str(self): + result = table_value_is_valid("=1.00") + self.assertFalse(result) + + +class TestGetNameAndState(unittest.TestCase): + def test_valid_forward_input(self): + name = 'conv2d.forward.1.input.0' + expected_api = 'conv2d.forward.1.' + expected_state = 'input' + self.assertEqual(get_name_and_state(name), (expected_api, expected_state)) + + def test_valid_backward_output(self): + name = 'Functional.pad.0.backward.output.0' + expected_api = 'Functional.pad.0.backward.' + expected_state = 'output' + self.assertEqual(get_name_and_state(name), (expected_api, expected_state)) + + def test_valid_with_kwargs(self): + name = 'layer.norm.2.forward.kwargs.attr' + expected_api = 'layer.norm.2.forward.' + expected_state = 'kwargs' + self.assertEqual(get_name_and_state(name), (expected_api, expected_state)) + + def test_no_numeric_index(self): + name = 'conv2d.forward.input.0' + expected_api = 'conv2d.forward.' + expected_state = 'input' + self.assertEqual(get_name_and_state(name), (expected_api, expected_state)) + + def test_invalid__state(self): + name = 'conv2d.forward.1.invalidstate.0' + with self.assertRaises(CompareException) as context: + get_name_and_state(name) + self.assertIn('Invalid name string', str(context.exception.code)) + + +class TestReorderOpNameList(unittest.TestCase): + def test_reorder_op_name_list(self): + # 标准顺序 + op_name_list = ["op.forward.input.0.0", "op.forward.output.0", "op.forward.output.1", "op.forward.parameters.1", "op.forward.parameters.2", "op.parameters_grad.0"] + result = reorder_op_name_list(op_name_list) + expected = ["op.forward.input.0.0", "op.forward.parameters.1", "op.forward.parameters.2", "op.forward.output.0", "op.forward.output.1", "op.parameters_grad.0"] + self.assertEqual(result, expected) + + # 只有输入元素 + op_name_list = ["op.forward.input.0", "op.forward.input.1"] + result = reorder_op_name_list(op_name_list) + expected = ["op.forward.input.0", "op.forward.input.1"] + self.assertEqual(result, expected) + + # 输入为空 + op_name_list = [] + result = reorder_op_name_list(op_name_list) + expected = [] + self.assertEqual(result, expected) + + +class TestReorderOpXList(unittest.TestCase): + def test_reorder_op_x_list(self): + # 标准顺序 + op_name_list = ["op.forward.input.0", "op.forward.output.0", "op.forward.parameters.weight"] + summary_list = ["summary1", "summary2", "summary3"] + data_name_list = ["data1", "data2", "data3"] + result_op_name, result_summary, result_data_name = reorder_op_x_list(op_name_list, summary_list, data_name_list) + self.assertEqual(result_op_name, ["op.forward.input.0", "op.forward.parameters.weight", "op.forward.output.0"]) + self.assertEqual(result_summary, ["summary1", "summary3", "summary2"]) + self.assertEqual(result_data_name, ["data1", "data3", "data2"]) + + # 空 op_name_list 或 summary_list + op_name_list = [] + summary_list = [] + data_name_list = ["data1", "data2", "data3"] + result_op_name, result_summary, result_data_name = reorder_op_x_list(op_name_list, summary_list, data_name_list) + self.assertEqual(result_op_name, []) + self.assertEqual(result_summary, []) + self.assertEqual(result_data_name, ["data1", "data2", "data3"]) + + # 空 data_name_list + op_name_list = ["op.forward.input.0", "op.forward.output.0", "op.forward.parameters.weight"] + summary_list = ["summary1", "summary2", "summary3"] + data_name_list = [] + result_op_name, result_summary, result_data_name = reorder_op_x_list(op_name_list, summary_list, data_name_list) + self.assertEqual(result_op_name, ["op.forward.input.0", "op.forward.parameters.weight", "op.forward.output.0"]) + self.assertEqual(result_summary, ["summary1", "summary3", "summary2"]) + self.assertEqual(result_data_name, []) + + # data_name_list 为 None + op_name_list = ["op.forward.input.0", "op.forward.output.0", "op.forward.parameters.weight"] + summary_list = ["summary1", "summary2", "summary3"] + data_name_list = None + result_op_name, result_summary, result_data_name = reorder_op_x_list(op_name_list, summary_list, data_name_list) + self.assertEqual(result_op_name, ["op.forward.input.0", "op.forward.parameters.weight", "op.forward.output.0"]) + self.assertEqual(result_summary, ["summary1", "summary3", "summary2"]) + self.assertEqual(result_data_name, None) + + +class TestGenOpItem(unittest.TestCase): + def test_gen_op_item_with_data_name(self): + op_data = { + 'data_name': 'test_data', + 'type': 'torch.Tensor', + 'dtype': 'torch.int64', + 'shape': [3], + 'value': [1, 2, 3], + 'Max': 3, + 'Min': 1, + 'Mean': 2, + 'Norm': 2 + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + self.assertEqual(result['data_name'], 'test_data') + self.assertEqual(result['full_op_name'], 'test_data') + self.assertEqual(result['dtype'], 'torch.int64') + self.assertEqual(result['shape'], [3]) + self.assertEqual(result['Max'], 3) + self.assertEqual(result['Min'], 1) + self.assertEqual(result['Mean'], 2) + self.assertEqual(result['Norm'], 2) + self.assertEqual(result['md5'], f"{zlib.crc32(str(op_data['value']).encode()):08x}") + + def test_gen_op_item_with_empty_data_name(self): + op_data = { + 'data_name': '', + 'type': 'torch.Tensor', + 'value': [1, 2, 3] + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + # data_name为空时,应该被设置为'-1' + self.assertEqual(result['data_name'], '-1') + self.assertEqual(result['full_op_name'], op_name) + + def test_gen_op_item_with_none_data_name(self): + op_data = { + 'data_name': None, + 'type': 'torch.Tensor', + 'value': [1, 2, 3] + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + # data_name为None时,应该被设置为'-1' + self.assertEqual(result['data_name'], '-1') + self.assertEqual(result['full_op_name'], op_name) + + def test_gen_op_item_with_type_torch_size(self): + op_data = { + 'data_name': 'test_data', + 'type': 'torch.Size', + 'value': [2, 3, 4] + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + self.assertEqual(result['dtype'], 'torch.Size') + self.assertEqual(result['shape'], '[2, 3, 4]') + self.assertEqual(result['Max'], None) + self.assertEqual(result['Min'], None) + self.assertEqual(result['Mean'], None) + self.assertEqual(result['Norm'], None) + + def test_gen_op_item_with_type_slice(self): + op_data = { + 'data_name': 'test_data', + 'type': 'slice', + 'value': [1, 2, 3] + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + self.assertEqual(result['dtype'], 'slice') + self.assertEqual(result['shape'], str(np.shape(np.array(op_data['value'])))) + + def test_gen_op_item_with_type_ellipsis(self): + op_data = { + 'data_name': 'test_data', + 'type': 'ellipsis', + 'value': '...' + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + self.assertEqual(result['dtype'], 'ellipsis') + self.assertEqual(result['shape'], '[]') + self.assertEqual(result['Max'], '...') + self.assertEqual(result['Min'], '...') + self.assertEqual(result['Mean'], '...') + self.assertEqual(result['Norm'], '...') + + def test_gen_op_item_with_type_torch_process_group(self): + op_data = { + 'data_name': 'test_data', + 'type': 'torch.ProcessGroup', + 'group_ranks': [0, 1] + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + self.assertEqual(result['dtype'], 'torch.ProcessGroup') + self.assertEqual(result['shape'], '[]') + self.assertEqual(result['Max'], '[0, 1]') + self.assertEqual(result['Min'], '[0, 1]') + self.assertEqual(result['Mean'], '[0, 1]') + self.assertEqual(result['Norm'], '[0, 1]') + + def test_gen_op_item_with_default_dtype(self): + op_data = { + 'data_name': 'test_data', + 'type': 'other_type', + 'value': [1, 2, 3] + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + self.assertEqual(result['dtype'], str(type(op_data['value']))) + self.assertEqual(result['shape'], '[]') + + def test_gen_op_item_with_md5(self): + op_data = { + 'data_name': 'test_data', + 'type': 'torch.Tensor', + 'value': [1, 2, 3] + } + op_name = 'op_test' + + result = gen_op_item(op_data, op_name) + + expected_md5 = f"{zlib.crc32(str(op_data['value']).encode()):08x}" + self.assertEqual(result['md5'], expected_md5) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_cmp_highlight.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_cmp_highlight.py index 117ae4b31237925fe24754538c54fa2ce2f909aa..3261bce5d6d0a15d8e46c7d9fc22df0cf64c9e4d 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_cmp_highlight.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_cmp_highlight.py @@ -1,15 +1,22 @@ # coding=utf-8 -import unittest import os import shutil -import pandas as pd +import sys +from collections import namedtuple +import unittest +from unittest.mock import patch + import numpy as np +import pandas as pd import openpyxl from openpyxl import load_workbook from openpyxl.styles import PatternFill -from collections import namedtuple -from msprobe.core.compare.highlight import CheckMaxRelativeDiff, highlight_rows_xlsx, csv_value_is_valid + + from msprobe.core.common.const import CompareConst, Const +from msprobe.core.compare.highlight import ApiBatch, CheckMaxRelativeDiff, CheckOrderMagnitude, \ + CheckOneThousandErrorRatio, CheckCosineSimilarity, add_highlight_row_info, compare_result_df_convert, \ + df_malicious_value_check, find_error_rows, highlight_rows_xlsx, update_highlight_err_msg, value_check base_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), f'test_highlight') @@ -19,7 +26,7 @@ def generate_result_xlsx(base_dir): data_path = os.path.join(base_dir, 'target_result.xlsx') data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], - '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] ] columns = CompareConst.COMPARE_RESULT_HEADER + ['Data_name'] result_df = pd.DataFrame(data, columns=columns) @@ -34,6 +41,18 @@ def generate_result_xlsx(base_dir): cell.fill = red_fill wb.save(data_path) + data_path_yellow = os.path.join(base_dir, 'target_result_yellow.xlsx') + result_df.to_excel(data_path_yellow, index=False, sheet_name='Sheet') + wb = load_workbook(data_path_yellow) + ws = wb.active + yellow_fill = PatternFill(start_color=CompareConst.YELLOW, end_color=CompareConst.YELLOW, fill_type='solid') + for row_index, row in enumerate(ws.iter_rows()): + if row_index == 0: + continue + for cell in row: + cell.fill = yellow_fill + wb.save(data_path_yellow) + def compare_excel_files_with_highlight(file1, file2): wb1 = openpyxl.load_workbook(file1) @@ -70,7 +89,53 @@ class TestUtilsMethods(unittest.TestCase): if os.path.exists(base_dir): shutil.rmtree(base_dir) - def test_CheckMaxRelativeDiff_1(self): + def test_CheckOrderMagnitude_normal(self): + api_in = [1, 1, 1, 1, 1, 1, 5, 1, 1] + api_out = [1, 1, 1, 1, 1, 1, 1, 1, 1] + info = (api_in, api_out, 1) + color_columns = () + dump_mode = Const.SUMMARY + + result = CheckOrderMagnitude().apply(info, color_columns, dump_mode) + + self.assertEqual(result, None) + + def test_CheckOneThousandErrorRatio_str(self): + api_in = [1, 1, 1, 1, 1, 1, 0.9, 0.5, 1, 1, "unsupported"] + api_out = [1, 1, 1, 1, 1, 1, 0.9, 0.5, 1, 1, "unsupported"] + info = (api_in, api_out, 1) + color_columns = () + dump_mode = Const.ALL + + result = CheckOneThousandErrorRatio().apply(info, color_columns, dump_mode) + + self.assertEqual(result, None) + + @patch("msprobe.core.compare.highlight.add_highlight_row_info") + def test_CheckOneThousandErrorRatio_red(self, mock_add_highlight_row_info): + api_in = [1, 1, 1, 1, 1, 1, 0.9, 0.5, 1, 1, 1] + api_out = [1, 1, 1, 1, 1, 1, 0.9, 0.5, 1, 1, 0.5] + info = (api_in, api_out, 1) + ColorColumns = namedtuple('ColorColumns', ['red', 'yellow']) + color_columns = ColorColumns(red=[], yellow=[]) + dump_mode = Const.ALL + + CheckOneThousandErrorRatio().apply(info, color_columns, dump_mode) + + mock_add_highlight_row_info.assert_called_once() + + def test_CheckCosineSimilarity_str(self): + api_in = [1, 1, 1, 1, 1, 1, "unsupported", 1, 1, "unsupported"] + api_out = [1, 1, 1, 1, 1, 1, "unsupported", 1, 1, "unsupported"] + info = (api_in, api_out, 1) + color_columns = () + dump_mode = Const.ALL + + result = CheckCosineSimilarity().apply(info, color_columns, dump_mode) + + self.assertEqual(result, None) + + def test_CheckMaxRelativeDiff_red(self): ColorColumns = namedtuple('ColorColumns', ['red', 'yellow']) red_lines, yellow_lines = [], [] @@ -81,11 +146,11 @@ class TestUtilsMethods(unittest.TestCase): num = 1 info = (api_in, api_out, num) CheckMaxRelativeDiff().apply(info, color_columns, dump_mode=Const.SUMMARY) - red_lines, yellow_lines = [1], [] + red_lines, yellow_lines = [(1, ["maximum relative error exceeds 0.5"])], [] target_color_columns = ColorColumns(red=red_lines, yellow=yellow_lines) self.assertEqual(color_columns, target_color_columns) - def test_CheckMaxRelativeDiff_2(self): + def test_CheckMaxRelativeDiff_yellow(self): ColorColumns = namedtuple('ColorColumns', ['red', 'yellow']) red_lines, yellow_lines = [], [] @@ -96,11 +161,11 @@ class TestUtilsMethods(unittest.TestCase): num = 1 info = (api_in, api_out, num) CheckMaxRelativeDiff().apply(info, color_columns, dump_mode=Const.SUMMARY) - red_lines, yellow_lines = [], [1] + red_lines, yellow_lines = [], [(1, ["The output's maximum relative error exceeds 0.1, while the input/parameters's is below 0.01"])] target_color_columns = ColorColumns(red=red_lines, yellow=yellow_lines) self.assertEqual(color_columns, target_color_columns) - def test_CheckMaxRelativeDiff_3(self): + def test_CheckMaxRelativeDiff_other_type(self): ColorColumns = namedtuple('ColorColumns', ['red', 'yellow']) red_lines, yellow_lines = [], [] @@ -113,10 +178,158 @@ class TestUtilsMethods(unittest.TestCase): result = CheckMaxRelativeDiff().apply(info, color_columns, dump_mode=Const.SUMMARY) self.assertEqual(result, None) - def test_highlight_rows_xlsx_1(self): + def test_find_error_rows_normal(self): + compare_result = np.array([ + ["Functional.linear.0.forward.input.0", "Functional.linear.0.forward.input.0", + "torch.float32", "torch.float32", [2, 2], [2, 2], 0.0, 0.0, 0.0, 0.0, "0.0%", "0.0%", "0.0%", "0.0%", + 1, 1, 1, 1, 1, 1, 1, 1, "", ""], + ["Functional.linear.0.forward.input.1", "Functional.linear.0.forward.input.1", + "torch.float32", "torch.float32", [2, 2], [2, 2], 0.0, 0.0, 0.0, 0.0, "0.0%", "0.0%", "0.0%", "0.0%", + 1, 1, 1, 1, 1, 1, 1, 1, "", ""], + ["Functional.linear.0.forward.input.2", "Functional.linear.0.forward.input.2", + "torch.float32", "torch.float32", [2], [2], 0.0, 0.0, 0.0, 0.0, "0.0%", "0.0%", "0.0%", "0.0%", + 1, 1, 1, 1, 1, 1, 1, 1, "", ""], + ["Functional.linear.0.forward.output.0", "Functional.linear.0.forward.output.0", + "torch.float32", "torch.float32", [2, 2], [2, 2], 0.0, 0.0, 0.0, 0.0, "0.0%", "0.0%", "0.0%", "0.0%", + 1, 1, 1, 1, 1, 1, 1, 1, "", ""], + ], dtype=object) + api_batch = ApiBatch("Functional.linear.0.forward", 0) + api_batch.input_len = 3 + api_batch.output_end_index = 4 + api_batch.params_end_index = 4 + highlight_dict = {"red_lines": [], "red_rows": set(), "yellow_lines": [], "yellow_rows": set()} + dump_mode = Const.ALL + + find_error_rows(compare_result, api_batch, highlight_dict, dump_mode) + + self.assertEqual(highlight_dict, {"red_lines": [], "red_rows": set(), "yellow_lines": [], "yellow_rows": set()}) + + def test_find_error_rows_md5(self): + compare_result = [] + api_batch = ApiBatch("", 0) + api_batch.input_len = 0 + api_batch.output_end_index = 1 + api_batch.params_end_index = 1 + highlight_dict = {} + dump_mode = Const.MD5 + + result = find_error_rows(compare_result, api_batch, highlight_dict, dump_mode) + + self.assertEqual(result, None) + + def test_ApiBatch_increment_input(self): + api_name = "functional.conv2d" + start = 2 + api_batch = ApiBatch(api_name, start) + + api_batch.increment(Const.INPUT) + + self.assertEqual(api_batch._state, Const.INPUT) + self.assertEqual(api_batch.input_len, 2) + self.assertEqual(api_batch.params_end_index, 4) + self.assertEqual(api_batch.output_end_index, 4) + self.assertEqual(api_batch.params_grad_end_index, 4) + + def test_ApiBatch_increment_output(self): + api_name = "functional.conv2d" + start = 2 + api_batch = ApiBatch(api_name, start) + + api_batch.increment(Const.OUTPUT) + + self.assertEqual(api_batch._state, Const.OUTPUT) + self.assertEqual(api_batch.input_len, 1) + self.assertEqual(api_batch.params_end_index, 3) + self.assertEqual(api_batch.output_end_index, 4) + self.assertEqual(api_batch.params_grad_end_index, 4) + + def test_ApiBatch_increment_kwargs(self): + api_name = "functional.conv2d" + start = 2 + api_batch = ApiBatch(api_name, start) + + api_batch.increment(Const.KWARGS) + + self.assertEqual(api_batch._state, Const.KWARGS) + self.assertEqual(api_batch.input_len, 2) + self.assertEqual(api_batch.params_end_index, 4) + self.assertEqual(api_batch.output_end_index, 4) + self.assertEqual(api_batch.params_grad_end_index, 4) + + def test_ApiBatch_increment_params(self): + api_name = "functional.conv2d" + start = 2 + api_batch = ApiBatch(api_name, start) + + api_batch.increment(Const.PARAMS) + + self.assertEqual(api_batch._state, Const.PARAMS) + self.assertEqual(api_batch.input_len, 1) + self.assertEqual(api_batch.params_end_index, 4) + self.assertEqual(api_batch.output_end_index, 4) + self.assertEqual(api_batch.params_grad_end_index, 4) + + def test_ApiBatch_increment_multiple_input(self): + api_name = "functional.conv2d" + start = 2 + api_batch = ApiBatch(api_name, start) + + api_batch.increment(Const.INPUT) + api_batch.increment(Const.INPUT) + + self.assertEqual(api_batch._state, Const.INPUT) + self.assertEqual(api_batch.input_len, 3) + self.assertEqual(api_batch.params_end_index, 5) + self.assertEqual(api_batch.output_end_index, 5) + self.assertEqual(api_batch.params_grad_end_index, 5) + + def test_ApiBatch_increment_multiple_output(self): + api_name = "functional.conv2d" + start = 2 + api_batch = ApiBatch(api_name, start) + + api_batch.increment(Const.OUTPUT) + api_batch.increment(Const.OUTPUT) + + self.assertEqual(api_batch._state, Const.OUTPUT) + self.assertEqual(api_batch.input_len, 1) + self.assertEqual(api_batch.params_end_index, 3) + self.assertEqual(api_batch.output_end_index, 5) + self.assertEqual(api_batch.params_grad_end_index, 5) + + @patch("msprobe.core.compare.highlight.logger") + def test_value_check(self, mock_logger): + value = "=functional.conv2d" + api_name = "=functional.conv2d" + i = 1 + result_df_columns = CompareConst.COMPARE_RESULT_HEADER + + value_check(value, api_name, i, result_df_columns) + + mock_logger.error.assert_called_once_with( + "Malicious value [=functional.conv2d] at api_name [=functional.conv2d], column [Bench Name], " + "is not allowed to be written into the compare result xlsx." + ) + + def test_df_malicious_value_check(self): + columns = CompareConst.COMPARE_RESULT_HEADER + data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', ''] + ] + result_df = pd.DataFrame(data, columns=columns) + + df_malicious_value_check(result_df, columns) + + def test_compare_result_df_convert(self): + value = float("nan") + result = compare_result_df_convert(value) + self.assertEqual(result, "nan\t") + + def test_highlight_rows_xlsx_red(self): data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], - '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] ] columns = CompareConst.COMPARE_RESULT_HEADER + ['Data_name'] result_df = pd.DataFrame(data, columns=columns) @@ -126,47 +339,132 @@ class TestUtilsMethods(unittest.TestCase): generate_result_xlsx(base_dir) self.assertTrue(compare_excel_files_with_highlight(file_path, os.path.join(base_dir, 'target_result.xlsx'))) - def test_highlight_rows_xlsx_2(self): + def test_highlight_rows_xlsx_yellow(self): data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], - '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] + ] + columns = CompareConst.COMPARE_RESULT_HEADER + ['Data_name'] + result_df = pd.DataFrame(data, columns=columns) + highlight_dict = {'yellow_rows': [0]} + file_path = os.path.join(base_dir, 'result.xlsx') + highlight_rows_xlsx(result_df, highlight_dict, file_path) + generate_result_xlsx(base_dir) + self.assertTrue(compare_excel_files_with_highlight(file_path, os.path.join(base_dir, 'target_result_yellow.xlsx'))) + + @patch("msprobe.core.compare.highlight.save_workbook") + def test_highlight_rows_xlsx_malicious_columns(self, mock_save_book): + data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] ] columns = CompareConst.COMPARE_RESULT_HEADER + ['=Data_name'] result_df = pd.DataFrame(data, columns=columns) highlight_dict = {} file_path = base_dir - with self.assertRaises(RuntimeError) as context: - highlight_rows_xlsx(result_df, highlight_dict, file_path) - self.assertIn("Malicious value", str(context.exception)) - def test_highlight_rows_xlsx_3(self): + temp_output_file = 'temp_output.txt' + sys.stdout = open(temp_output_file, 'w') + + highlight_rows_xlsx(result_df, highlight_dict, file_path) + + with open(temp_output_file, 'r') as f: + output = f.read() + os.remove(temp_output_file) + + self.assertIn('Malicious value [=Data_name]', output) + + @patch("msprobe.core.compare.highlight.save_workbook") + def test_highlight_rows_xlsx_malicious_type(self, mock_save_book): data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', '=torch.float32', 'torch.float32', [2, 2], [2, 2], - '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'], + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'], ['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', '=torch.float32', 'torch.float32', [2, 2], [2, 2], - '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] ] columns = CompareConst.COMPARE_RESULT_HEADER + ['Data_name'] result_df = pd.DataFrame(data, columns=columns) highlight_dict = {'red_rows': [], 'yellow_rows': []} file_path = base_dir - with self.assertRaises(RuntimeError) as context: - highlight_rows_xlsx(result_df, highlight_dict, file_path) - self.assertIn("Malicious value", str(context.exception)) - def test_csv_value_is_valid_1(self): - result = csv_value_is_valid(1) - self.assertTrue(result) + temp_output_file = 'temp_output.txt' + sys.stdout = open(temp_output_file, 'w') - def test_csv_value_is_valid_2(self): - result = csv_value_is_valid("-1.00") - self.assertTrue(result) + highlight_rows_xlsx(result_df, highlight_dict, file_path) - result = csv_value_is_valid("+1.00") - self.assertTrue(result) + with open(temp_output_file, 'r') as f: + output = f.read() + os.remove(temp_output_file) - def test_csv_value_is_valid_3(self): - result = csv_value_is_valid("=1.00") - self.assertFalse(result) + self.assertIn('Malicious value [=torch.float32]', output) + def test_add_highlight_row_info_existing(self): + color_list = [(1, ["a", "b"]), (5, ["c"])] + num = 5 + highlight_err_msg = "highlight" + add_highlight_row_info(color_list, num, highlight_err_msg) + self.assertEqual(color_list, [(1, ["a", "b"]), (5, ["c", "highlight"])]) + + def test_add_highlight_row_info_new(self): + color_list = [(1, ["a", "b"]), (5, ["c"])] + num = 6 + highlight_err_msg = "highlight" + add_highlight_row_info(color_list, num, highlight_err_msg) + self.assertEqual(color_list, [(1, ["a", "b"]), (5, ["c"]), (6, ["highlight"])]) + + def test_update_highlight_err_msg(self): + data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'], + ['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1'] + ] + columns = CompareConst.COMPARE_RESULT_HEADER + ['Data_name'] + result_df = pd.DataFrame(data, columns=columns) + highlight_dict = { + 'red_rows': set([0]), + 'yellow_rows': {0, 1}, + 'red_lines': [(0, ['a', 'b'])], + 'yellow_lines': [(0, ['c']), (1, ['d'])] + } + update_highlight_err_msg(result_df, highlight_dict) + + t_data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', 'a\nb', '-1'], + ['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', 'd', '-1'] + ] + target_result_df = pd.DataFrame(t_data, columns=columns) + self.assertTrue(result_df.equals(target_result_df)) + + def test_update_highlight_err_msg_md5(self): + data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', + 'torch.float32', 'torch.float32', [2, 2], [2, 2], 'abc', 'abc', 'pass'] + ] + columns = CompareConst.MD5_COMPARE_RESULT_HEADER + result_df = pd.DataFrame(data, columns=columns) + highlight_dict = {} + + result = update_highlight_err_msg(result_df, highlight_dict) + + self.assertEqual(result, None) + + def test_update_highlight_err_msg_fail(self): + data = [ + ['err_msg1'], + ['err_msg2'] + ] + columns = ['Err_message'] + result_df = pd.DataFrame(data, columns=columns) + highlight_dict = { + 'red_rows': set([0]), + 'yellow_rows': {0, 1}, + 'red_lines': [(0, ['a', 'b'])], + 'yellow_lines': [(0, ['c']), (1, ['d'])] + } + result = update_highlight_err_msg(result_df, highlight_dict) + self.assertEqual(result, None) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_cmp_multiprocessing_compute.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_cmp_multiprocessing_compute.py index 1e10af4ddfbb8e20a0a878861703a9c5b80808d8..3fa16b0d9d487250a7a8d9ec97b5572d3c0b387a 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_cmp_multiprocessing_compute.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_cmp_multiprocessing_compute.py @@ -1,44 +1,60 @@ # coding=utf-8 -import unittest -import threading -import pandas as pd import multiprocessing -from msprobe.core.compare.multiprocessing_compute import _handle_multi_process, read_dump_data, ComparisonResult, \ - _save_cmp_result, check_accuracy -from msprobe.core.compare.acc_compare import Comparator -from msprobe.core.common.const import CompareConst -from msprobe.core.common.utils import CompareException +import os +import shutil +import threading +import unittest +import pandas as pd +from msprobe.core.common.const import CompareConst, Const +from msprobe.core.common.utils import CompareException +from msprobe.core.compare.acc_compare import Comparator, ModeConfig +from msprobe.core.compare.multiprocessing_compute import ComparisonResult, _handle_multi_process, _save_cmp_result, \ + check_accuracy, read_dump_data +from test_acc_compare import generate_dump_json data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], - '', '', '', '', '', + '', '', '', '', '', '', 1, 1, 1, 1, 1, 1, 1, 1, 'Yes', '', '-1']] o_data = [['Functional.linear.0.forward.input.0', 'Functional.linear.0.forward.input.0', 'torch.float32', 'torch.float32', [2, 2], [2, 2], - 'None', 'None', 'None', 'None', 'None', + 'unsupported', 'unsupported', 'unsupported', 'unsupported', 'unsupported', 'unsupported', 1, 1, 1, 1, 1, 1, 1, 1, 'None', 'No bench data matched.', '-1']] columns = CompareConst.COMPARE_RESULT_HEADER + ['Data_name'] result_df = pd.DataFrame(data, columns=columns) o_result = pd.DataFrame(o_data, columns=columns) +base_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), f'test_cmp_multiprocessing_compute') class TestUtilsMethods(unittest.TestCase): def setUp(self): self.result_df = pd.DataFrame(columns=[ - CompareConst.COSINE, CompareConst.MAX_ABS_ERR, CompareConst.MAX_RELATIVE_ERR, - CompareConst.ERROR_MESSAGE, CompareConst.ACCURACY, - CompareConst.ONE_THOUSANDTH_ERR_RATIO, CompareConst.FIVE_THOUSANDTHS_ERR_RATIO + CompareConst.COSINE, CompareConst.EUC_DIST, CompareConst.MAX_ABS_ERR, CompareConst.MAX_RELATIVE_ERR, + CompareConst.ONE_THOUSANDTH_ERR_RATIO, CompareConst.FIVE_THOUSANDTHS_ERR_RATIO, + CompareConst.ACCURACY, CompareConst.ERROR_MESSAGE ]) + os.makedirs(base_dir, mode=0o750, exist_ok=True) self.lock = threading.Lock() + def tearDown(self): + if os.path.exists(base_dir): + shutil.rmtree(base_dir) + def test_handle_multi_process(self): - func = Comparator().compare_ops - input_parma = {} + stack_mode = False + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + func = Comparator(mode_config).compare_ops + generate_dump_json(base_dir) + input_parma = {'bench_json_path': os.path.join(base_dir, 'dump.json')} lock = multiprocessing.Manager().RLock() result = _handle_multi_process(func, input_parma, result_df, lock) self.assertTrue(result.equals(o_result)) @@ -56,9 +72,10 @@ class TestUtilsMethods(unittest.TestCase): cos_result=[0.99, 0.98], max_err_result=[0.01, 0.02], max_relative_err_result=[0.001, 0.002], - err_msgs=['', 'Error in comparison'], + euc_dist_result=[0.5, 0.49], one_thousand_err_ratio_result=[0.1, 0.2], - five_thousand_err_ratio_result=[0.05, 0.1] + five_thousand_err_ratio_result=[0.05, 0.1], + err_msgs=['', 'Error in comparison'] ) offset = 0 updated_df = _save_cmp_result(offset, comparison_result, self.result_df, self.lock) @@ -72,9 +89,10 @@ class TestUtilsMethods(unittest.TestCase): cos_result=[0.99], max_err_result=[], max_relative_err_result=[0.001], - err_msgs=[''], + euc_dist_result=[0.5], one_thousand_err_ratio_result=[0.1], - five_thousand_err_ratio_result=[0.05] + five_thousand_err_ratio_result=[0.05], + err_msgs=[''] ) with self.assertRaises(CompareException) as context: _save_cmp_result(0, comparison_result, self.result_df, self.lock) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_common_data_scope_parser.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_common_data_scope_parser.py new file mode 100644 index 0000000000000000000000000000000000000000..ee9c7ac866cdd80bfa83bf61434b8294fdce62f2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_common_data_scope_parser.py @@ -0,0 +1,34 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +from unittest import TestCase +from msprobe.core.compare.layer_mapping.data_scope_parser import DumpDataItem +from msprobe.core.common.utils import CompareException + + +class TestDataScopeParser(TestCase): + + def test_check_stack_valid_invalid_stack_type(self): + stack_info_string = "conv1.Conv2d.forward.input" + with self.assertRaises(CompareException) as context: + DumpDataItem.check_stack_valid(stack_info_string) + self.assertEqual(context.exception.code, CompareException.INVALID_DATA_ERROR) + + def test_check_stack_valid_invalid_stack_info(self): + stack_info_list = ["conv1.Conv2d.forward.input", 1] + with self.assertRaises(CompareException) as context: + DumpDataItem.check_stack_valid(stack_info_list) + self.assertEqual(context.exception.code, CompareException.INVALID_DATA_ERROR) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_common_layer_mapping.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_common_layer_mapping.py new file mode 100644 index 0000000000000000000000000000000000000000..c51bd8be2422652f2eb0f219606969fae667ba05 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_common_layer_mapping.py @@ -0,0 +1,25 @@ +import os +import unittest +from pathlib import Path + +from msprobe.core.compare.layer_mapping import ( + generate_api_mapping_by_layer_mapping, + generate_data_mapping_by_layer_mapping) + + +class TestLayerMapping(unittest.TestCase): + def setUp(self): + self.base_test_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) + self.input_dir = os.path.join(self.base_test_dir, 'resources', 'layer_mapping') + self.npu_dump_json = os.path.join(self.input_dir, 'mindspore', 'dump.json') + self.bench_dump_json = os.path.join(self.input_dir, 'pytorch', 'dump.json') + self.layer_mapping = os.path.join(self.input_dir, 'layer_mapping.yaml') + + def test_generate_api_mapping_by_layer_mapping(self): + # Example test to check if construct.json is processed correctly + res = generate_api_mapping_by_layer_mapping(self.npu_dump_json, self.bench_dump_json, self.layer_mapping) + excepted_api_mapping = { + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": + "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0", + } + self.assertDictEqual(res, excepted_api_mapping) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result.py new file mode 100644 index 0000000000000000000000000000000000000000..a0b023bf7ead42e4048f3130f7cd4ba91950364b --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result.py @@ -0,0 +1,400 @@ +# coding=utf-8 +""" +# Copyright (C) 2024-2025. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import unittest +import multiprocessing +from unittest.mock import patch, MagicMock + +import pandas as pd + +from msprobe.core.common.const import Const, CompareConst +from msprobe.core.compare.merge_result.merge_result import check_compare_result_name, reorder_path, get_result_path, \ + get_dump_mode, check_index_dump_mode_consistent, extract_api_full_name, search_api_index_result, \ + table_value_check, result_process, handle_multi_process, generate_result_df, generate_merge_result, df_merge, \ + initialize_compare_index, merge_result + + +class TestUtilsMethods(unittest.TestCase): + def setUp(self): + self.api_list = ['api1', 'api2'] + self.compare_index_list = ['index1', 'index2'] + self.all_compare_index_list_list = [['index1', 'index2']] + self.result_df = pd.DataFrame({ + CompareConst.NPU_NAME: ['api1', 'api2'], + 'index1': [100, 200], + 'index2': [300, 400] + }) + + self.compare_index_dict = {} + + self.result_df_1 = pd.DataFrame({ + CompareConst.NPU_NAME: ['api1', 'api2'], + 'index1': [100, 200], + 'index2': [300, 400] + }) + self.result_df_2 = pd.DataFrame({ + CompareConst.NPU_NAME: ['api1', 'api2'], + 'index1': [150, 250], + 'index2': [350, 450] + }) + + self.compare_result_path_list = [ + "/path/to/compare_result_rank1-rank1.xlsx", + "/path/to/compare_result_rank2-rank2.xlsx" + ] + + self.all_compare_index_dict_list = [ + [ + { + 'index1': {'api1': {1: 100}, + 'api2': {1: 200} + }, + 'index2': {'api1': {1: 300}, + 'api2': {1: 400} + } + } + ], + [ + { + 'index1': {'api1': {2: 500}, + 'api2': {2: 600} + }, + 'index2': {'api1': {2: 700}, + 'api2': {2: 800} + } + } + ] + ] + + self.all_rank_num_list = [[1], [2]] + + def test_check_compare_result_name_multi_rank_pattern(self): + valid_name = "compare_result_rank1-rank1_20240101010101.xlsx" + + result = check_compare_result_name(valid_name) + self.assertTrue(result) + + @patch('msprobe.core.compare.merge_result.merge_result.logger') + def test_single_rank_pattern_single_rank_pattern(self, mock_logger): + valid_name = "compare_result_rank-rank_20240101010101.xlsx" + + result = check_compare_result_name(valid_name) + self.assertFalse(result) + mock_logger.warning.assert_called_once_with("Single rank compare result do not need to be merged.") + + def test_reorder_path(self): + paths = [ + "/path/to/compare_result_rank3-rank3_20240101010101.xlsx", + "/path/to/compare_result_rank1-rank1_20240101010101.xlsx", + "/path/to/compare_result_rank2-rank2_20240101010101.xlsx", + ] + expected_order = [ + "/path/to/compare_result_rank1-rank1_20240101010101.xlsx", + "/path/to/compare_result_rank2-rank2_20240101010101.xlsx", + "/path/to/compare_result_rank3-rank3_20240101010101.xlsx", + ] + result = reorder_path(paths) + self.assertEqual(result, expected_order) + + @patch("os.listdir") + @patch("os.path.join") + @patch("msprobe.core.compare.merge_result.merge_result.check_compare_result_name") + @patch("msprobe.core.compare.merge_result.merge_result.FileChecker") + @patch("msprobe.core.compare.merge_result.merge_result.reorder_path") + def test_get_result_path_valid_files(self, mock_reorder_path, mock_file_checker, mock_check_name, mock_join, + mock_listdir): + mock_listdir.return_value = [ + "/path/to/compare_result_rank2-rank2_20240101010101.xlsx", + "/path/to/compare_result_rank1-rank1_20240101010101.xlsx", + "/path/to/compare_result_rank3-rank3_20240101010101.xlsx" + ] + mock_join.side_effect = lambda dir, name: f"{dir}/{name}" + mock_check_name.return_value = True + mock_file_checker.return_value.common_check.side_effect = lambda: True + mock_reorder_path.return_value = [ + "/mock_dir/path/to/compare_result_rank1-rank1_20240101010101.xlsx", + "/mock_dir/path/to/compare_result_rank2-rank2_20240101010101.xlsx", + "/mock_dir/path/to/compare_result_rank3-rank3_20240101010101.xlsx" + ] + + input_dir = "/mock_dir" + result = get_result_path(input_dir) + + expected_result = [ + "/mock_dir/path/to/compare_result_rank1-rank1_20240101010101.xlsx", + "/mock_dir/path/to/compare_result_rank2-rank2_20240101010101.xlsx", + "/mock_dir/path/to/compare_result_rank3-rank3_20240101010101.xlsx" + ] + self.assertEqual(result, expected_result) + mock_file_checker.assert_called() + mock_reorder_path.assert_called_once() + + def test_get_dump_mode_all_mode(self): + header = CompareConst.COMPARE_RESULT_HEADER + [CompareConst.DATA_NAME] + result_df = pd.DataFrame(columns=header) + + result = get_dump_mode(result_df, rank_num=1) + self.assertEqual(result, Const.ALL) + + def test_get_dump_mode_summary_mode(self): + header = CompareConst.SUMMARY_COMPARE_RESULT_HEADER + result_df = pd.DataFrame(columns=header) + + result = get_dump_mode(result_df, rank_num=2) + self.assertEqual(result, Const.SUMMARY) + + def test_get_dump_mode_md5_mode(self): + header = CompareConst.MD5_COMPARE_RESULT_HEADER + result_df = pd.DataFrame(columns=header) + + result = get_dump_mode(result_df, rank_num=3) + self.assertEqual(result, Const.MD5) + + @patch("msprobe.core.compare.merge_result.merge_result.logger") + def test_check_index_dump_mode_consistent_md5(self, mock_logger): + result = check_index_dump_mode_consistent(Const.MD5, rank_num=1) + + self.assertEqual(result, []) + mock_logger.warning.assert_called_once_with( + "Rank1 compare result is 'md5' dump task and does not support merging result, please " + "check! The compare result will not be shown in merged result." + ) + + def test_check_index_dump_mode_consistent_valid_compare_index_subset(self): + config = {"compare_index": ["Cosine", "MaxAbsErr"]} + initialize_compare_index(config) + + result = check_index_dump_mode_consistent(Const.ALL, rank_num=2) + + compare_index_list = ["Cosine", "MaxAbsErr"] + self.assertEqual(result, compare_index_list) + + def test_extract_api_full_name_all_apis_found(self): + api_list = ["api1", "api2"] + result_df = pd.DataFrame({ + CompareConst.NPU_NAME: ["api1.forward", "api2.forward", "api11.forward"] + }) + rank_num = 1 + + result = extract_api_full_name(api_list, result_df, rank_num) + expected = ["api1.forward", "api2.forward"] + self.assertEqual(result, expected) + + @patch("msprobe.core.compare.merge_result.merge_result.table_value_check") + @patch("msprobe.core.compare.merge_result.merge_result.extract_api_full_name") + def test_search_api_index_result(self, mock_extract_api_full_name, mock_table_value_check): + mock_extract_api_full_name.return_value = self.api_list + mock_table_value_check.return_value = None + result = search_api_index_result( + self.api_list, + self.compare_index_list, + self.result_df, + 1, # rank_num + self.compare_index_dict + ) + + expected_result = { + 'index1': { + 'api1': {1: 100}, + 'api2': {1: 200}, + }, + 'index2': { + 'api1': {1: 300}, + 'api2': {1: 400}, + } + } + + self.assertEqual(result, expected_result) + mock_table_value_check.assert_any_call('api1') + mock_table_value_check.assert_any_call('api2') + mock_extract_api_full_name.assert_called_with(self.api_list, self.result_df, 1) + + @patch("msprobe.core.compare.merge_result.merge_result.table_value_is_valid") + def test_table_value_check_invalid_value(self, mock_table_value_is_valid): + mock_table_value_is_valid.return_value = False + value = "invalid_value" + + with self.assertRaises(RuntimeError) as context: + table_value_check(value) + + self.assertEqual( + str(context.exception), + f"Malicious value [{value}] is not allowed to be written into the merged xlsx." + ) + mock_table_value_is_valid.assert_called_once_with(value) + + @patch('msprobe.core.compare.merge_result.merge_result.read_xlsx') + @patch('msprobe.core.compare.merge_result.merge_result.get_dump_mode') + @patch('msprobe.core.compare.merge_result.merge_result.check_index_dump_mode_consistent') + @patch('msprobe.core.compare.merge_result.merge_result.search_api_index_result') + @patch('msprobe.core.compare.merge_result.merge_result.logger') + def test_result_process(self, mock_logger, mock_search_api_index_result, mock_check_index_dump_mode_consistent, + mock_get_dump_mode, mock_read_xlsx): + + mock_read_xlsx.side_effect = [self.result_df_1, self.result_df_2] + mock_get_dump_mode.side_effect = ["mode1", "mode1"] + mock_check_index_dump_mode_consistent.return_value = self.compare_index_list + mock_search_api_index_result.return_value = { + "index1": + {"api1": {1: 100}, "api2": {1: 200}}, + "index2": + {"api1": {1: 300}, "api2": {1: 400}} + } + config = {"compare_index": ["index1", "index2"]} + initialize_compare_index(config) + + compare_index_dict_list, rank_num_list, compare_index_list = result_process(self.compare_result_path_list, + self.api_list) + + self.assertEqual(len(compare_index_dict_list), 2) + self.assertEqual(len(rank_num_list), 2) + self.assertEqual(rank_num_list, [1, 2]) + self.assertEqual(compare_index_list, ['index1', 'index2']) + + mock_logger.info.assert_any_call("Parsing rank1 compare result...") + mock_logger.warning.assert_not_called() + + expected_dict = { + "index1": {"api1": {1: 100}, "api2": {1: 200}}, + "index2": {"api1": {1: 300}, "api2": {1: 400}}, + } + self.assertEqual(compare_index_dict_list[0], expected_dict) + self.assertEqual(compare_index_dict_list[1], expected_dict) + + @patch('multiprocessing.Pool') + def test_handle_multi_process(self, mock_pool): + mock_pool_instance = MagicMock() + mock_pool.return_value = mock_pool_instance + mock_result = MagicMock() + mock_result.get.return_value = ([{'index1': {'api1': {1: 100}}}], [1], [['index1']]) + mock_pool_instance.apply_async.return_value = mock_result + + compare_result_path_list = ['/path/to/compare_result_rank1-rank1.xlsx'] + + config = {"compare_index": ["index1", "index2"]} + initialize_compare_index(config) + + func_args = (compare_result_path_list, self.api_list) + lock = multiprocessing.Manager().RLock() + + all_compare_index_dict_list, all_rank_num_list, all_compare_index_list_list = handle_multi_process(result_process, func_args, lock) + + self.assertEqual(all_compare_index_dict_list, [[{'index1': {'api1': {1: 100}}}]]) + self.assertEqual(all_rank_num_list, [[1]]) + self.assertEqual(mock_pool_instance.apply_async.call_count, 1) + + def test_generate_result_df_valid_input(self): + api_index_dict = { + "api_full_name1": {"rank1": 100}, + "api_full_name2": {"rank1": 200}, + } + header = ["API Full Name", "rank1"] + + result_df = generate_result_df(api_index_dict, header) + + expected_data = [ + ["api_full_name1", 100], + ["api_full_name2", 200], + ] + expected_df = pd.DataFrame(expected_data, columns=header, dtype="object") + pd.testing.assert_frame_equal(result_df, expected_df) + + @patch('msprobe.core.compare.merge_result.merge_result.logger') + @patch('msprobe.core.compare.merge_result.merge_result.save_excel') + @patch("os.path.join") + @patch('msprobe.core.compare.merge_result.merge_result.add_time_with_xlsx') + def test_generate_merge_result(self, mock_add_time_with_xlsx, mock_join, mock_save_excel, mock_logger): + mock_add_time_with_xlsx.return_value = "multi_ranks_compare_merge_20240101010101.xlsx" + mock_join.return_value = "/path/to/multi_ranks_compare_merge_20240101010101.xlsx" + output_dir = "/path/to" + + generate_merge_result(self.all_compare_index_dict_list, self.all_rank_num_list, + self.all_compare_index_list_list, output_dir) + + mock_save_excel.assert_called_once() + mock_logger.info.assert_called_once_with("The compare results of the multi-ranks are merged and saved in: " + "/path/to/multi_ranks_compare_merge_20240101010101.xlsx.") + + def test_df_merge_multiple_dataframes(self): + df1 = pd.DataFrame({CompareConst.NPU_NAME: ["api1", "api2"], "rank1": [100, 200]}) + df2 = pd.DataFrame({CompareConst.NPU_NAME: ["api2", "api3"], "rank2": [150, 250]}) + df3 = pd.DataFrame({CompareConst.NPU_NAME: ["api1", "api3"], "rank3": [120, 300]}) + + all_result_df_list = [[df1], [df2], [df3]] + + result = df_merge(all_result_df_list) + + expected_df = pd.DataFrame({ + CompareConst.NPU_NAME: ["api1", "api2", "api3"], + "rank1": [100, 200, None], + "rank2": [None, 150, 250], + "rank3": [120, None, 300] + }) + + self.assertEqual(len(result), 1) + pd.testing.assert_frame_equal(result[0], expected_df) + + @patch("multiprocessing.Manager") + def test_initialize_compare_index(self, mock_manager): + mock_list = MagicMock() + mock_manager_instance = MagicMock() + mock_manager_instance.list.return_value = mock_list + mock_manager.return_value = mock_manager_instance + + config = {"compare_index": [1, 2, 3]} + + initialize_compare_index(config) + + mock_manager.assert_called_once() + mock_manager_instance.list.assert_called_once_with([1, 2, 3]) + + from msprobe.core.compare.merge_result.merge_result import share_compare_index_list + self.assertIs(share_compare_index_list, mock_list) + + @patch('msprobe.core.compare.merge_result.merge_result.FileChecker') + @patch('msprobe.core.compare.merge_result.merge_result.create_directory') + @patch('msprobe.core.compare.merge_result.merge_result.get_result_path') + @patch('msprobe.core.compare.merge_result.merge_result.load_yaml') + @patch('msprobe.core.compare.merge_result.merge_result.handle_multi_process') + @patch('msprobe.core.compare.merge_result.merge_result.generate_merge_result') + def test_merge_result(self, mock_generate_merge_result, mock_handle_multi_process, mock_load_yaml, + mock_get_result_path, mock_create_directory, mock_file_checker): + + input_dir = '/path/to/input' + output_dir = '/path/to/output' + config_path = '/path/to/config.yaml' + + mock_file_checker.return_value.common_check.return_value = input_dir + mock_create_directory.return_value = None + mock_get_result_path.return_value = ['/path/to/input/compare_result_rank1-rank1_20240101010101.xlsx', + '/path/to/input/compare_result_rank2-rank2_20240101010101.xlsx'] + mock_load_yaml.return_value = { + 'api': ['api1', 'api2'], + 'compare_index': ['index1', 'index2'] + } + mock_handle_multi_process.return_value = ( + [[{'index1': {'api1': {1: 100}}}], [{'index1': {'api1': {2: 100}}}]], # all_compare_index_dict_list + [[1], [2]], # all_rank_num_list + [['index1'], ['index2']] # all_compare_index_list_list + ) + + merge_result(input_dir, output_dir, config_path) + + mock_file_checker.assert_called_once_with(input_dir, "dir", "read") + mock_create_directory.assert_called_once_with(output_dir) + mock_get_result_path.assert_called_once_with(input_dir) + mock_load_yaml.assert_called_once_with(config_path) + mock_handle_multi_process.assert_called_once() + mock_generate_merge_result.assert_called_once() diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result_cli.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result_cli.py new file mode 100644 index 0000000000000000000000000000000000000000..e8b6575a0b3323f42afd754b64e244723af9d6d1 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result_cli.py @@ -0,0 +1,24 @@ +import unittest +from unittest.mock import patch +import argparse +from msprobe.core.compare.merge_result.merge_result_cli import _merge_result_parser, merge_result_cli + + +class TestMergeResultCLI(unittest.TestCase): + @patch('msprobe.core.compare.merge_result.merge_result_cli.merge_result') + def test_merge_result_cli_success(self, mock_merge_result): + args = [ + '-i', '/path/to/input', + '-o', '/path/to/output', + '-config', '/path/to/config.yaml' + ] + + parser = argparse.ArgumentParser() + _merge_result_parser(parser) + parsed_args = parser.parse_args(args) + + merge_result_cli(parsed_args) + + mock_merge_result.assert_called_once_with( + '/path/to/input', '/path/to/output', '/path/to/config.yaml' + ) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result_utils.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..69b0e7ff01ba7304e977dfd9608ce8482c131d5b --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_merge_result_utils.py @@ -0,0 +1,266 @@ +# coding=utf-8 +""" +# Copyright (C) 2025-2025. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import unittest +from unittest.mock import patch + +from msprobe.core.common.const import CompareConst +from msprobe.core.common.utils import CompareException +from msprobe.core.compare.merge_result.utils import replace_compare_index_dict, check_config + + +class TestReplaceCompareIndexDict(unittest.TestCase): + + def setUp(self): + # 初始化测试数据 + self.compare_index_dict = { + 'Max diff': { + 'op_name_1': {0: 'N/A'}, + 'op_name_2': {0: 'N/A'} + }, + 'L2norm diff': { + 'op_name_1': {0: 'N/A'}, + 'op_name_2': {0: 'N/A'} + }, + 'MeanRelativeErr': { + 'op_name_1': {0: 'N/A'}, + 'op_name_2': {0: 'N/A'} + }, + CompareConst.NPU_MAX: { + 'op_name_1': {0: 'tp-0-1-2-3'}, + 'op_name_2': {0: 'tp-0-1-2-3'} + }, + CompareConst.BENCH_MAX: { + 'op_name_1': {0: 'tp-0-1-2-3'}, + 'op_name_2': {0: 'tp-0-1-2-3'} + } + } + self.compare_index_list = ['Max diff', 'L2norm diff', 'MeanRelativeErr', 'NPU max', 'Bench max'] + self.rank_num = 0 + + def test_process_compare_index_dict_na(self): + result = replace_compare_index_dict(self.compare_index_dict, self.compare_index_list, self.rank_num) + + # 检查是否替换了 N/A 值 + self.assertEqual(result['Max diff']['op_name_1'][self.rank_num], 'NPU:tp-0-1-2-3 Bench:tp-0-1-2-3') + self.assertEqual(result['Max diff']['op_name_2'][self.rank_num], 'NPU:tp-0-1-2-3 Bench:tp-0-1-2-3') + + self.assertEqual(result['L2norm diff']['op_name_1'][self.rank_num], 'NPU:tp-0-1-2-3 Bench:tp-0-1-2-3') + self.assertEqual(result['L2norm diff']['op_name_2'][self.rank_num], 'NPU:tp-0-1-2-3 Bench:tp-0-1-2-3') + + self.assertEqual(result['MeanRelativeErr']['op_name_1'][self.rank_num], 'NPU:tp-0-1-2-3 Bench:tp-0-1-2-3') + self.assertEqual(result['MeanRelativeErr']['op_name_2'][self.rank_num], 'NPU:tp-0-1-2-3 Bench:tp-0-1-2-3') + + def test_no_na_values(self): + # 修改测试数据,确保没有 N/A 值 + for index in self.compare_index_list[:-2]: # 排除 'NPU max' 和 'Bench max' + self.compare_index_dict[index] = { + 'op_name_1': {0: 'tp-0-1-2-3'}, + 'op_name_2': {0: 'tp-0-1-2-3'} + } + + result = replace_compare_index_dict(self.compare_index_dict, self.compare_index_list, self.rank_num) + + # 验证返回值没有变化 + self.assertEqual(result['Max diff']['op_name_1'][self.rank_num], 'tp-0-1-2-3') + self.assertEqual(result['Max diff']['op_name_2'][self.rank_num], 'tp-0-1-2-3') + + self.assertEqual(result['L2norm diff']['op_name_1'][self.rank_num], 'tp-0-1-2-3') + self.assertEqual(result['L2norm diff']['op_name_2'][self.rank_num], 'tp-0-1-2-3') + + self.assertEqual(result['MeanRelativeErr']['op_name_1'][self.rank_num], 'tp-0-1-2-3') + self.assertEqual(result['MeanRelativeErr']['op_name_2'][self.rank_num], 'tp-0-1-2-3') + + def test_non_string_npu_bench(self): + # 修改 NPU 和 Bench 统计量为非字符串类型 + self.compare_index_dict[CompareConst.NPU_MAX] = { + 'op_name_1': {0: 123}, + 'op_name_2': {0: 123} + } + self.compare_index_dict[CompareConst.BENCH_MAX] = { + 'op_name_1': {0: 123}, + 'op_name_2': {0: 123} + } + + result = replace_compare_index_dict(self.compare_index_dict, self.compare_index_list, self.rank_num) + + expected_value = 'NPU:123 Bench:123' + self.assertEqual(result['Max diff']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['Max diff']['op_name_2'][self.rank_num], expected_value) + + self.assertEqual(result['L2norm diff']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['L2norm diff']['op_name_2'][self.rank_num], expected_value) + + self.assertEqual(result['MeanRelativeErr']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['MeanRelativeErr']['op_name_2'][self.rank_num], expected_value) + + def test_missing_npu_bench_max(self): + # 移除 NPU_MAX 和 BENCH_MAX 键 + del self.compare_index_dict[CompareConst.NPU_MAX] + del self.compare_index_dict[CompareConst.BENCH_MAX] + + result = replace_compare_index_dict(self.compare_index_dict, self.compare_index_list, self.rank_num) + + # 验证原始数据未改变 + self.assertEqual(result['Max diff']['op_name_1'][self.rank_num], 'N/A') + self.assertEqual(result['Max diff']['op_name_2'][self.rank_num], 'N/A') + + self.assertEqual(result['L2norm diff']['op_name_1'][self.rank_num], 'N/A') + self.assertEqual(result['L2norm diff']['op_name_2'][self.rank_num], 'N/A') + + self.assertEqual(result['MeanRelativeErr']['op_name_1'][self.rank_num], 'N/A') + self.assertEqual(result['MeanRelativeErr']['op_name_2'][self.rank_num], 'N/A') + + def test_unsupported_values(self): + # 'unsupported' + self.compare_index_dict['Max diff'] = { + 'op_name_1': {0: 'unsupported'}, + 'op_name_2': {0: 'unsupported'} + } + self.compare_index_dict['L2norm diff'] = { + 'op_name_1': {0: 'unsupported'}, + 'op_name_2': {0: 'unsupported'} + } + self.compare_index_dict['MeanRelativeErr'] = { + 'op_name_1': {0: 'unsupported'}, + 'op_name_2': {0: 'unsupported'} + } + + result = replace_compare_index_dict(self.compare_index_dict, self.compare_index_list, self.rank_num) + + # 检查是否替换了'unsupported' + expected_value = 'NPU:tp-0-1-2-3 Bench:tp-0-1-2-3' + + self.assertEqual(result['Max diff']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['Max diff']['op_name_2'][self.rank_num], expected_value) + + self.assertEqual(result['L2norm diff']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['L2norm diff']['op_name_2'][self.rank_num], expected_value) + + self.assertEqual(result['MeanRelativeErr']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['MeanRelativeErr']['op_name_2'][self.rank_num], expected_value) + + def test_nan_values(self): + # 'Nan' + self.compare_index_dict['Max diff'] = { + 'op_name_1': {0: 'Nan'}, + 'op_name_2': {0: 'Nan'} + } + self.compare_index_dict['L2norm diff'] = { + 'op_name_1': {0: 'Nan'}, + 'op_name_2': {0: 'Nan'} + } + self.compare_index_dict['MeanRelativeErr'] = { + 'op_name_1': {0: 'Nan'}, + 'op_name_2': {0: 'Nan'} + } + + result = replace_compare_index_dict(self.compare_index_dict, self.compare_index_list, self.rank_num) + + # 检查是否替换了'Nan' + expected_value = 'NPU:tp-0-1-2-3 Bench:tp-0-1-2-3' + + self.assertEqual(result['Max diff']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['Max diff']['op_name_2'][self.rank_num], expected_value) + + self.assertEqual(result['L2norm diff']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['L2norm diff']['op_name_2'][self.rank_num], expected_value) + + self.assertEqual(result['MeanRelativeErr']['op_name_1'][self.rank_num], expected_value) + self.assertEqual(result['MeanRelativeErr']['op_name_2'][self.rank_num], expected_value) + + def test_empty_dict(self): + # 测试空字典的处理 + empty_dict = {} + result = replace_compare_index_dict(empty_dict, [], self.rank_num) + self.assertEqual(result, {}) + + def test_empty_compare_index_list(self): + # 测试空 compare_index_list 的情况 + result = replace_compare_index_dict(self.compare_index_dict, [], self.rank_num) + self.assertEqual(result, self.compare_index_dict) + + +class TestCheckConfig(unittest.TestCase): + + @patch('msprobe.core.common.file_utils.logger.error') + def test_check_config_empty(self, mock_logger_error): + config = None + + with self.assertRaises(CompareException): + check_config(config) + + mock_logger_error.assert_called_once_with('config.yaml is empty, please check.') + + @patch('msprobe.core.common.file_utils.logger.error') + def test_check_config_missing_api(self, mock_logger_error): + config = { + 'compare_index': ['index1', 'index2'] + } + + with self.assertRaises(CompareException): + check_config(config) + + mock_logger_error.assert_called_once_with('The APIs required to merge data were not found.') + + @patch('msprobe.core.common.file_utils.logger.error') + def test_check_config_api_is_not_list(self, mock_logger_error): + config = { + 'api': 'api1', + 'compare_index': ['index1', 'index2'] + } + + with self.assertRaises(CompareException): + check_config(config) + + mock_logger_error.assert_called_once_with("The config format of 'api' is incorrect, please check.") + + @patch('msprobe.core.common.file_utils.logger.error') + def test_check_config_compare_index_is_not_list(self, mock_logger_error): + config = { + 'api': ['api1', 'api2'], + 'compare_index': 'index1' + } + + with self.assertRaises(CompareException): + check_config(config) + + mock_logger_error.assert_called_once_with("The config format of 'compare_index' is incorrect, please check.") + + def test_check_config_compare_index_is_none(self): + config = { + 'api': ['api1', 'api2'], + 'compare_index': None + } + result_target = { + 'api': ['api1', 'api2'], + 'compare_index': [] + } + result = check_config(config) + + self.assertEqual(result, result_target) + + @patch('msprobe.core.common.file_utils.logger.error') + def test_check_config_success(self, mock_logger_error): + config = { + 'api': ['api1', 'api2'], + 'compare_index': ['index1', 'index2'] + } + + result = check_config(config) + + self.assertEqual(result, config) + mock_logger_error.assert_not_called() diff --git a/debug/accuracy_tools/msprobe/test/core_ut/compare/test_postprocess_pass.py b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_postprocess_pass.py new file mode 100644 index 0000000000000000000000000000000000000000..9cb33eb277848fa96bdf5b7456867d8579359723 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/compare/test_postprocess_pass.py @@ -0,0 +1,48 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +from unittest import TestCase +from msprobe.core.compare.layer_mapping.postprocess_pass import extract_next_item_last_number +from msprobe.core.compare.layer_mapping.postprocess_pass import replace_next_item_index + + +class TestPostProcessPass(TestCase): + + def test_check_path_type_None(self): + input_data = "conv1.Conv2d.forward.input" + prefix = "Conv2d" + none_result = extract_next_item_last_number(input_data, prefix) + self.assertEqual(none_result, None) + + def test_check_path_type_find_result(self): + input_data = "conv1.Conv2d.forward.input.conv1" + prefix = "conv1" + result_2 = extract_next_item_last_number(input_data, prefix) + self.assertEqual(result_2, 2) + + def test_replace_next_item_index(self): + input_data = "conv1.Conv2d.forward.input.conv1" + prefix = "conv1" + replace_result = replace_next_item_index(input_data, prefix, 1) + self.assertEqual(replace_result, "conv1.1.forward.input.conv1") + + def test_replace_next_item_index_with_inf(self): + input_data = "conv1.Conv2d.forward.input.conv1" + prefix = "conv1" + inf_value = float("inf") + replace_result = replace_next_item_index(input_data, prefix, inf_value) + self.assertEqual(replace_result, input_data) + diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_base.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_base.py index 2cba5e890f6d878b5ca3e7f9deed866da02ab373..8ff89437646ee203aaa4a3fac5bbfea1538e9409 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_base.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_base.py @@ -1,11 +1,15 @@ import unittest from unittest.mock import patch, MagicMock import os +from collections import namedtuple +from dataclasses import dataclass +from typing import Optional, Tuple import numpy as np from msprobe.core.common.log import logger from msprobe.core.data_dump.data_processor.base import ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs, \ TensorStatInfo, BaseDataProcessor +from msprobe.core.data_dump.data_processor.mindspore_processor import MindsporeDataProcessor class TestModuleForwardInputsOutputs(unittest.TestCase): @@ -22,9 +26,10 @@ class TestModuleForwardInputsOutputs(unittest.TestCase): module = ModuleForwardInputsOutputs(args=None, kwargs=None, output=(4, 5, 6)) self.assertEqual(module.output_tuple, (4, 5, 6)) - def test_concat_args_and_kwargs(self): + def test_update_output_with_args_and_kwargs(self): module = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3, 'b': 4}, output=None) - self.assertEqual(module.concat_args_and_kwargs(), (1, 2, 3, 4)) + module.update_output_with_args_and_kwargs() + self.assertEqual(module.output, (1, 2, 3, 4)) class TestModuleBackwardInputsOutputs(unittest.TestCase): @@ -61,7 +66,7 @@ class TestBaseDataProcessor(unittest.TestCase): self.data_writer.dump_tensor_data_dir = "./dump_data" self.processor.current_api_or_module_name = "test_api" self.processor.api_data_category = "input" - + @patch('inspect.stack') def test_analyze_api_call_stack(self, mock_stack): mock_stack.return_value = [ @@ -77,8 +82,8 @@ class TestBaseDataProcessor(unittest.TestCase): result = BaseDataProcessor.analyze_api_call_stack('test_stack') expected_output = { 'test_stack': [ - 'File file5.py, line 50, in function5, \n code line 5', - 'File file6.py, line 60, in function6, \n code line 6', + 'File file5.py, line 50, in function5, \n code line 5', + 'File file6.py, line 60, in function6, \n code line 6', 'File file7.py, line 70, in function7, \n code line 7', ] } @@ -108,21 +113,47 @@ class TestBaseDataProcessor(unittest.TestCase): expected = {'type': 'int', 'value': 1} self.assertEqual(result, expected) - def test_analyze_numpy(self): - result = BaseDataProcessor._analyze_numpy(5, 'int32') - self.assertEqual(result, {'type': 'int32', 'value': 5}) - def test_get_special_types(self): self.assertIn(int, BaseDataProcessor.get_special_types()) + def test_analyze_numpy(self): + ndarray = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.int32) + result = BaseDataProcessor._analyze_numpy(ndarray, 'numpy.ndarray') + expected_result = { + 'type': 'numpy.ndarray', + 'dtype': 'int32', + 'shape': (2, 3), + 'Max': 6, + 'Min': 1, + 'Mean': 3.5, + 'Norm':9.539392014169456 + } + self.assertEqual(result, expected_result) + def test_recursive_apply_transform(self): transform = lambda x, _: x * 2 + Test = namedtuple("Test", ['a']) + myNamedTuple = Test(1) + @dataclass + class MyDataClass: + last_hidden_state: int = None + hidden_states: Optional[Tuple[int, ...]] = None + attentions: Optional[Tuple[int, ...]] = None + + myData = MyDataClass( + last_hidden_state=1, + hidden_states=(2, 3), + attentions=(4, 5) + ) + expected_dataclass_res = {'last_hidden_state': 2, 'hidden_states': [4, 6], 'attentions': [8,10]} self.assertEqual(BaseDataProcessor.recursive_apply_transform(2, transform), 4) + self.assertEqual(BaseDataProcessor.recursive_apply_transform(myData, transform), expected_dataclass_res) + self.assertEqual(BaseDataProcessor.recursive_apply_transform(myNamedTuple, transform), {'a': 2}) self.assertEqual(BaseDataProcessor.recursive_apply_transform([1, 2], transform), [2, 4]) - self.assertEqual(BaseDataProcessor.recursive_apply_transform((1, 2), transform), (2, 4)) + self.assertEqual(BaseDataProcessor.recursive_apply_transform((1, 2), transform), [2, 4]) self.assertEqual(BaseDataProcessor.recursive_apply_transform({'a': 1}, transform), {'a': 2}) - @patch.object(logger, 'warning') + @patch.object(logger, 'debug') def test_recursive_apply_transform_with_warning(self, mock_logger): transform = lambda x, _: x * 2 BaseDataProcessor.recursive_apply_transform({1, 2, 3}, transform) @@ -148,52 +179,67 @@ class TestBaseDataProcessor(unittest.TestCase): def test_is_dump_for_data_mode(self): self.config.data_mode = ["all"] + self.processor.allowed_data_mode = self.processor._get_allowed_data_mode(self.config.data_mode) self.assertTrue(self.processor.is_dump_for_data_mode("forward", "input")) + self.config.data_mode = ["forward"] + self.processor.allowed_data_mode = self.processor._get_allowed_data_mode(self.config.data_mode) self.assertTrue(self.processor.is_dump_for_data_mode("forward", "input")) + self.config.data_mode = ["input"] + self.processor.allowed_data_mode = self.processor._get_allowed_data_mode(self.config.data_mode) self.assertTrue(self.processor.is_dump_for_data_mode("forward", "input")) + self.config.data_mode = ["backward"] + self.processor.allowed_data_mode = self.processor._get_allowed_data_mode(self.config.data_mode) self.assertFalse(self.processor.is_dump_for_data_mode("forward", "input")) + self.config.data_mode = ["forward", "input"] + self.processor.allowed_data_mode = self.processor._get_allowed_data_mode(self.config.data_mode) + self.assertFalse(self.processor.is_dump_for_data_mode("forward", "output")) + + self.config.data_mode = ["forward", "input"] + self.processor.allowed_data_mode = self.processor._get_allowed_data_mode(self.config.data_mode) + self.assertFalse(self.processor.is_dump_for_data_mode("backward", "input")) + @patch.object(BaseDataProcessor, 'analyze_element') - def test_analyze_forward(self, mock_analyze_element): + def test_analyze_forward_input(self, mock_analyze_element): mock_analyze_element.side_effect = lambda args: args - module_io = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3}, output=(4, 5)) + module_io = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3}, output=None) self.config.data_mode = ["all"] - result = self.processor.analyze_forward("test_forward", None, module_io) + result = self.processor.analyze_forward_input("test_forward_input", None, module_io) expected = { - "test_forward": { + "test_forward_input": { "input_args": (1, 2), - "input_kwargs": {'a': 3}, - "output": (4, 5) + "input_kwargs": {'a': 3} } } self.assertEqual(result, expected) @patch.object(BaseDataProcessor, 'analyze_element') - def test_analyze_pre_forward_inplace(self, mock_analyze_element): + def test_analyze_forward_output(self, mock_analyze_element): mock_analyze_element.side_effect = lambda args: args - module_io = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3}, output=None) + module_io = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3}, output=(4, 5)) self.config.data_mode = ["all"] - result = self.processor.analyze_pre_forward_inplace("test_pre_forward", module_io) + result = self.processor.analyze_forward_output("test_forward_output", None, module_io) expected = { - "test_pre_forward": { - "input_args": (1, 2), - "input_kwargs": {'a': 3} + "test_forward_output": { + "output": (4, 5) } } self.assertEqual(result, expected) @patch.object(BaseDataProcessor, 'analyze_element') - def test_analyze_forward_inplace(self, mock_analyze_element): + def test_analyze_forward(self, mock_analyze_element): mock_analyze_element.side_effect = lambda args: args - module_io = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3}, output=None) + module_io = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3}, output=(4, 5)) self.config.data_mode = ["all"] - result = self.processor.analyze_forward_inplace("test_forward_inplace", module_io) + result = self.processor.analyze_forward("test_forward", None, module_io) expected = { - "test_forward_inplace": { - "output": (1, 2, 3) + "test_forward": { + "input_args": (1, 2), + "input_kwargs": {'a': 3}, + "output": (4, 5) } } self.assertEqual(result, expected) @@ -218,3 +264,67 @@ class TestBaseDataProcessor(unittest.TestCase): expected_file_name = "test_api.input.suffix.pt" expected_file_path = os.path.join(self.data_writer.dump_tensor_data_dir, expected_file_name) self.assertEqual(result, (expected_file_name, expected_file_path)) + + def test_get_save_file_path_with_save_name(self): + self.config.framework = "pytorch" + self.processor.save_name = "custom_name" + result = self.processor.get_save_file_path("suffix") + expected_file_name = "custom_name.pt" + expected_file_path = os.path.join(self.data_writer.dump_tensor_data_dir, expected_file_name) + self.assertEqual(result, (expected_file_name, expected_file_path)) + + def test_set_value_into_nested_structure(self): + dst_data_structure = {"key1": [None, None]} + self.processor.set_value_into_nested_structure(dst_data_structure, ["key1", 0], 12) + excepted_result = {"key1": [12, None]} + self.assertEqual(dst_data_structure, excepted_result) + + def test_analyze_element_to_all_none(self): + element = {"key1": [12, 3, {"key2": 10, "key3":["12"]}]} + result = self.processor.analyze_element_to_all_none(element) + excepted_result = {"key1": [None, None, {"key2": None, "key3":[None]}]} + self.assertEqual(result, excepted_result) + + @patch.object(MindsporeDataProcessor, "is_hookable_element", return_value=True) + def test_register_hook_single_element(self, _): + element = MagicMock() + element.hasattr = MagicMock(side_effect=lambda attr: attr == "register_hook") + element.requires_grad = True + hook_fn = MagicMock() + MindsporeDataProcessor.register_hook_single_element(element, [1, 2], hook_fn) + element.register_hook.assert_called_once() + + @patch("msprobe.core.data_dump.data_processor.base.partial") + def test_analyze_debug_backward(self, mock_partial): + variable = MagicMock() # 模拟输入变量 + grad_name_with_count = "grad_name_1" + nested_data_structure = {"key": "value"} # 模拟嵌套数据结构 + + self.processor.recursive_apply_transform = MagicMock() + self.processor.set_value_into_nested_structure = MagicMock() + self.processor.analyze_element = MagicMock(return_value="grad_data_info") + self.processor.register_hook_single_element = MagicMock() + + # call + self.processor.analyze_debug_backward(variable, grad_name_with_count, nested_data_structure) + + # check partial + args, kwargs = mock_partial.call_args + self.assertIn("hook_fn", kwargs) + self.assertEqual(args[0], self.processor.register_hook_single_element) + self.assertEqual(kwargs["hook_fn"].__name__, "hook_fn") + + wrap_func = mock_partial.return_value + self.processor.recursive_apply_transform.assert_called_once_with(variable, wrap_func) + + grad = MagicMock() + index = ["layer1", "layer2"] + result = kwargs["hook_fn"](grad, index) + + # 验证 hook_fn 内部逻辑 + self.processor.analyze_element.assert_called_once_with(grad) + self.processor.set_value_into_nested_structure.assert_called_once_with( + nested_data_structure, ["grad_name_1", "layer1", "layer2"], "grad_data_info" + ) + self.assertIsNone(self.processor.save_name) + self.assertEqual(result, grad) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_mindspore_processor.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_mindspore_processor.py index 5ecd508acf4f076f13325a71b3f8e8cdc446350c..b593d34c5d86c7fb3b4a0e8a3ff548c55555e09d 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_mindspore_processor.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_mindspore_processor.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (C) 2024-2025. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -26,8 +26,10 @@ from msprobe.core.data_dump.data_processor.base import BaseDataProcessor from msprobe.core.data_dump.data_processor.mindspore_processor import ( MindsporeDataProcessor, TensorDataProcessor, - OverflowCheckDataProcessor + OverflowCheckDataProcessor, + KernelDumpDataProcessor, ) +from msprobe.mindspore.common.log import logger class TestMindsporeDataProcessor(unittest.TestCase): @@ -56,6 +58,7 @@ class TestMindsporeDataProcessor(unittest.TestCase): self.assertEqual(result, expected_result) def test_get_stat_info_float(self): + self.config.async_dump = False tensor = ms.Tensor([1.0, 2.0, 3.0]) result = self.processor.get_stat_info(tensor) self.assertEqual(result.max, 3.0) @@ -63,7 +66,17 @@ class TestMindsporeDataProcessor(unittest.TestCase): self.assertEqual(result.mean, 2.0) self.assertEqual(result.norm, ms.ops.norm(tensor).item()) + def test_get_stat_info_float_async(self): + self.config.async_dump = True + tensor = ms.tensor([1.0, 2.0, 3.0]) + result = self.processor.get_stat_info(tensor).stack_tensor_stat[1] + self.assertEqual(result[0].item(), 3.0) + self.assertEqual(result[1].item(), 1.0) + self.assertEqual(result[2].item(), 2.0) + self.assertEqual(result[3].item(), ms.ops.norm(tensor).item()) + def test_get_stat_info_int(self): + self.config.async_dump = False tensor = ms.Tensor([1, 2, 3], dtype=ms.int32) result = self.processor.get_stat_info(tensor) self.assertEqual(result.max, 3) @@ -71,7 +84,15 @@ class TestMindsporeDataProcessor(unittest.TestCase): self.assertEqual(result.mean, 2) self.assertEqual(result.norm, ms.ops.norm(tensor).item()) + def test_get_stat_info_int_async(self): + self.config.async_dump = True + tensor = ms.tensor([1, 2, 3]) + result = self.processor.get_stat_info(tensor).stack_tensor_stat[1] + self.assertEqual(result[0].item(), 3.0) + self.assertEqual(result[1].item(), 1.0) + def test_get_stat_info_bool(self): + self.config.async_dump = False tensor = ms.Tensor([True, False, True]) result = self.processor.get_stat_info(tensor) self.assertEqual(result.max, True) @@ -79,11 +100,19 @@ class TestMindsporeDataProcessor(unittest.TestCase): self.assertIsNone(result.mean) self.assertIsNone(result.norm) + def test_get_stat_info_bool_async(self): + self.config.async_dump = True + tensor = ms.Tensor([True, False, True]) + result = self.processor.get_stat_info(tensor).stack_tensor_stat[1] + self.assertEqual(result[0].item(), True) + self.assertEqual(result[1].item(), False) + @patch.object(MindsporeDataProcessor, 'get_md5_for_tensor') def test__analyze_tensor(self, get_md5_for_tensor): get_md5_for_tensor.return_value = "test_md5" tensor = ms.Tensor(np.array([1, 2, 3], dtype=np.int32)) self.config.summary_mode = 'md5' + self.config.async_dump = False suffix = "test_tensor" expected_result = { 'type': 'mindspore.Tensor', @@ -112,6 +141,7 @@ class TestTensorDataProcessor(unittest.TestCase): @patch('msprobe.core.data_dump.data_processor.mindspore_processor.save_tensor_as_npy') def test_analyze_tensor(self, mock_save): self.config.framework = "mindspore" + self.config.async_dump = False tensor = ms.Tensor([1.0, 2.0, 3.0]) suffix = 'suffix' result = self.processor._analyze_tensor(tensor, suffix) @@ -239,3 +269,68 @@ class TestOverflowCheckDataProcessor(unittest.TestCase): return_value=True): self.data_processor._analyze_tensor("tensor", "suffix") mock_warning.assert_called_with("The file path file_path length exceeds limit.") + +class TestKernelDumpDataProcessor(unittest.TestCase): + def setUp(self): + self.config = MagicMock() + self.data_writer = MagicMock() + self.processor = KernelDumpDataProcessor(self.config, self.data_writer) + + @patch.object(logger, 'warning') + def test_print_unsupported_log(self, mock_logger_warning): + self.processor._print_unsupported_log("test_api_name") + mock_logger_warning.assert_called_with("The kernel dump does not support the test_api_name API.") + + @patch('msprobe.core.data_dump.data_processor.mindspore_processor.KernelDumpDataProcessor.start_kernel_dump') + @patch('msprobe.core.data_dump.data_processor.mindspore_processor.has_adump', new=True) + def test_analyze_pre_forward_with_adump(self, mock_start_kernel_dump): + self.processor.analyze_forward_input("test_api_name", None, None) + mock_start_kernel_dump.assert_called_once() + self.assertTrue(self.processor.enable_kernel_dump) + + @patch('msprobe.core.data_dump.data_processor.mindspore_processor.has_adump', new=False) + @patch.object(logger, 'warning') + def test_analyze_pre_forward_without_adump(self, mock_logger_warning): + self.processor.enable_kernel_dump = True + self.processor.analyze_forward_input("test_api_name", None, None) + mock_logger_warning.assert_called_with("The current msprobe package does not compile adump, and kernel dump cannot be used.") + self.assertFalse(self.processor.enable_kernel_dump) + + @patch('msprobe.core.data_dump.data_processor.mindspore_processor.KernelDumpDataProcessor.stop_kernel_dump') + @patch.object(logger, 'info') + def test_analyze_forward_successfully(self, mock_logger_info, mock_stop_kernel_dump): + self.processor.enable_kernel_dump = True + self.processor.analyze_forward_output('test_api_name', None, None) + self.assertFalse(self.processor.enable_kernel_dump) + mock_stop_kernel_dump.assert_called_once() + mock_logger_info.assert_called_with("The kernel data of test_api_name is dumped successfully.") + + @patch('msprobe.core.data_dump.data_processor.mindspore_processor.has_adump', new=True) + @patch('msprobe.core.data_dump.data_processor.mindspore_processor.KernelDumpDataProcessor.start_kernel_dump') + def test_analyze_pre_backward_with_adump(self, mock_start_kernel_dump): + self.processor.enable_kernel_dump = True + self.processor.analyze_backward_input("test_api_name", None, None) + self.assertTrue(self.processor.enable_kernel_dump) + mock_start_kernel_dump.assert_called_once() + + @patch('msprobe.core.data_dump.data_processor.mindspore_processor.has_adump', new=False) + @patch.object(logger, 'warning') + def test_analyze_pre_backward_without_adump(self, mock_logger_warning): + self.processor.enable_kernel_dump = True + self.processor.analyze_backward_input("test_api_name", None, None) + self.assertFalse(self.processor.enable_kernel_dump) + mock_logger_warning.assert_called_with("The current msprobe package does not compile adump, and kernel dump cannot be used.") + + @patch('msprobe.core.data_dump.data_processor.mindspore_processor.KernelDumpDataProcessor.stop_kernel_dump') + @patch.object(logger, 'info') + def test_analyze_backward_successfully(self, mock_logger_info, mock_stop_kernel_dump): + self.processor.enable_kernel_dump = True + self.processor.analyze_backward('test_api_name', None, None) + self.assertFalse(self.processor.enable_kernel_dump) + mock_stop_kernel_dump.assert_called_once() + mock_logger_info.assert_called_with("The kernel data of test_api_name is dumped successfully.") + + def test_reset_status(self): + self.processor.enable_kernel_dump = False + self.processor.reset_status() + self.assertTrue(self.processor.enable_kernel_dump) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_pytorch_processor.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_pytorch_processor.py index 648ce488d3ba3e70f44f1fc816bb962b1e198302..34064e7cc2b9d0aa5c0c2e98806b8993137a589c 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_pytorch_processor.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/data_processor/test_pytorch_processor.py @@ -1,20 +1,24 @@ +import hashlib +import os import sys import unittest -from unittest.mock import patch, MagicMock, Mock import zlib +from unittest.mock import patch, MagicMock, Mock -import torch import numpy as np - -from msprobe.core.data_dump.data_processor.base import ModuleBackwardInputsOutputs, ModuleForwardInputsOutputs, BaseDataProcessor +import torch +from msprobe.core.common.log import logger +from msprobe.core.data_dump.data_processor.base import ModuleBackwardInputsOutputs, ModuleForwardInputsOutputs, \ + BaseDataProcessor from msprobe.core.data_dump.data_processor.pytorch_processor import ( PytorchDataProcessor, FreeBenchmarkDataProcessor, TensorDataProcessor, - OverflowCheckDataProcessor, - TensorStatInfo, + OverflowCheckDataProcessor, + TensorStatInfo, KernelDumpDataProcessor ) +from torch import distributed as dist class TestPytorchDataProcessor(unittest.TestCase): @@ -66,6 +70,14 @@ class TestPytorchDataProcessor(unittest.TestCase): self.assertEqual(result.mean, 2.0) self.assertEqual(result.norm, torch.norm(tensor).item()) + def test_get_stat_info_float_async(self): + tensor = torch.tensor([1.0, 2.0, 3.0]) + result = self.processor.get_stat_info_async(tensor).stack_tensor_stat[1] + self.assertEqual(result[0].item(), 3.0) + self.assertEqual(result[1].item(), 1.0) + self.assertEqual(result[2].item(), 2.0) + self.assertEqual(result[3].item(), torch.norm(tensor).item()) + def test_get_stat_info_int(self): tensor = torch.tensor([1, 2, 3], dtype=torch.int32) result = self.processor.get_stat_info(tensor) @@ -74,6 +86,14 @@ class TestPytorchDataProcessor(unittest.TestCase): self.assertEqual(result.mean, 2) self.assertEqual(result.norm, torch.norm(tensor.float()).item()) + def test_get_stat_info_int_async(self): + tensor = torch.tensor([1, 2, 3]) + result = self.processor.get_stat_info_async(tensor).stack_tensor_stat[1] + self.assertEqual(result[0].item(), 3.0) + self.assertEqual(result[1].item(), 1.0) + self.assertEqual(result[2].item(), 2.0) + self.assertEqual(result[3].item(), torch.norm(tensor.float()).item()) + def test_get_stat_info_empty(self): tensor = torch.tensor([]) result = self.processor.get_stat_info(tensor) @@ -89,7 +109,13 @@ class TestPytorchDataProcessor(unittest.TestCase): self.assertEqual(result.min, False) self.assertIsNone(result.mean) self.assertIsNone(result.norm) - + + def test_get_stat_info_bool_async(self): + tensor = torch.tensor([True, False, True]) + result = self.processor.get_stat_info_async(tensor).stack_tensor_stat[1] + self.assertEqual(result[0].item(), True) + self.assertEqual(result[1].item(), False) + def test_get_stat_info_with_scalar_tensor(self): scalar_tensor = torch.tensor(42.0) result = PytorchDataProcessor.get_stat_info(scalar_tensor) @@ -102,9 +128,9 @@ class TestPytorchDataProcessor(unittest.TestCase): def test_get_stat_info_with_complex_tensor(self): complex_tensor = torch.tensor([1 + 2j, 3 + 4j], dtype=torch.complex64) result = PytorchDataProcessor.get_stat_info(complex_tensor) - expected_max = np.abs(np.array([1+2j, 3+4j])).max().item() - expected_min = np.abs(np.array([1+2j, 3+4j])).min().item() - expected_mean = np.abs(np.array([1+2j, 3+4j])).mean().item() + expected_max = np.abs(np.array([1 + 2j, 3 + 4j])).max().item() + expected_min = np.abs(np.array([1 + 2j, 3 + 4j])).min().item() + expected_mean = np.abs(np.array([1 + 2j, 3 + 4j])).mean().item() self.assertIsInstance(result, TensorStatInfo) self.assertAlmostEqual(result.max, expected_max, places=6) self.assertAlmostEqual(result.min, expected_min, places=6) @@ -162,12 +188,61 @@ class TestPytorchDataProcessor(unittest.TestCase): expected = {'type': 'slice', 'value': [None, None, None]} self.assertEqual(result, expected) + def test_process_group_hash(self): + os.environ['MASTER_ADDR'] = 'localhost' + os.environ['MASTER_PORT'] = '12345' + if dist.is_initialized(): + dist.destroy_process_group() + dist.init_process_group(backend='gloo', world_size=1, rank=0) + process_group_element = dist.group.WORLD + result = self.processor.process_group_hash(process_group_element) + expected = hashlib.md5('[0]'.encode('utf-8')).hexdigest() + self.assertEqual(result, expected) + def test_analyze_torch_size(self): size = torch.Size([3, 4, 5]) result = self.processor._analyze_torch_size(size) expected = {'type': 'torch.Size', 'value': [3, 4, 5]} self.assertEqual(result, expected) + def test_analyze_memory_format(self): + memory_format_element = torch.contiguous_format + result = self.processor._analyze_memory_format(memory_format_element) + expected = {'type': 'torch.memory_format', 'format': 'contiguous_format'} + self.assertEqual(result, expected) + + def test_analyze_process_group(self): + os.environ['MASTER_ADDR'] = 'localhost' + os.environ['MASTER_PORT'] = '12345' + if dist.is_initialized(): + dist.destroy_process_group() + dist.init_process_group(backend='gloo', world_size=1, rank=0) + process_group_element = dist.group.WORLD + result = self.processor._analyze_process_group(process_group_element) + expected = { + 'type': 'torch.ProcessGroup', + 'group_ranks': [0], + 'group_id': hashlib.md5('[0]'.encode('utf-8')).hexdigest() + } + self.assertEqual(result, expected) + + def test_analyze_reduce_op_successful(self): + arg = dist.ReduceOp.SUM + result = self.processor._analyze_reduce_op(arg) + expected = {'type': 'torch.distributed.ReduceOp', 'value': 'RedOpType.SUM'} + self.assertEqual(result, expected) + + @patch.object(logger, 'warning') + def test_analyze_reduce_op_failed(self, mock_logger_warning): + class TestReduceOp: + def __str__(self): + raise Exception("failed to convert str type") + arg = TestReduceOp() + self.processor._analyze_reduce_op(arg) + mock_logger_warning.assert_called_with( + "Failed to get value of torch.distributed.ReduceOp with error info: failed to convert str type." + ) + def test_get_special_types(self): special_types = self.processor.get_special_types() self.assertIn(torch.Tensor, special_types) @@ -176,14 +251,29 @@ class TestPytorchDataProcessor(unittest.TestCase): size_element = torch.Size([2, 3]) result = self.processor.analyze_single_element(size_element, []) self.assertEqual(result, self.processor._analyze_torch_size(size_element)) - + + def test_analyze_single_element_memory_size(self): + memory_format_element = torch.contiguous_format + result = self.processor.analyze_single_element(memory_format_element, []) + self.assertEqual(result, self.processor._analyze_memory_format(memory_format_element)) + + def test_analyze_single_element_process_group(self): + os.environ['MASTER_ADDR'] = 'localhost' + os.environ['MASTER_PORT'] = '12345' + if dist.is_initialized(): + dist.destroy_process_group() + dist.init_process_group(backend='gloo', world_size=1, rank=0) + process_group_element = dist.group.WORLD + result = self.processor.analyze_single_element(process_group_element, []) + self.assertEqual(result, self.processor._analyze_process_group(process_group_element)) + def test_analyze_single_element_numpy_conversion(self): numpy_element = np.int64(1) converted_numpy, numpy_type = self.processor._convert_numpy_to_builtin(numpy_element) result = self.processor.analyze_single_element(numpy_element, []) - expected_result = self.processor._analyze_numpy(converted_numpy, numpy_type) + expected_result = {"type": numpy_type, "value": converted_numpy} self.assertEqual(result, expected_result) - + def test_analyze_single_element_tensor(self): tensor_element = torch.tensor([1, 2, 3]) result = self.processor.analyze_single_element(tensor_element, ['tensor']) @@ -206,6 +296,7 @@ class TestPytorchDataProcessor(unittest.TestCase): get_md5_for_tensor.return_value = 'mocked_md5' tensor = torch.tensor([1.0, 2.0, 3.0]) self.config.summary_mode = 'md5' + self.config.async_dump = False result = self.processor._analyze_tensor(tensor, 'suffix') expected = { 'type': 'torch.Tensor', @@ -248,6 +339,7 @@ class TestTensorDataProcessor(unittest.TestCase): @patch('torch.save') def test_analyze_tensor(self, mock_save): self.config.framework = "pytorch" + self.config.async_dump = False tensor = torch.tensor([1.0, 2.0, 3.0]) suffix = 'suffix' result = self.processor._analyze_tensor(tensor, suffix) @@ -292,21 +384,21 @@ class TestOverflowCheckDataProcessor(unittest.TestCase): self.processor.real_overflow_nums = 1 self.assertFalse(self.processor.is_terminated) - def test_analyze_pre_forward_inplace(self): - with patch.object(BaseDataProcessor, "analyze_pre_forward_inplace", return_value={"name": 1}): - api_info = self.processor.analyze_pre_forward_inplace("name", "module_input_output") - self.assertEqual(self.processor.cached_inplace_api_info, {"name": 1}) + def test_analyze_forward_input(self): + with patch.object(BaseDataProcessor, "analyze_forward_input", return_value={"name": 1}): + api_info = self.processor.analyze_forward_input("name", "module","module_input_output") + self.assertEqual(self.processor.cached_api_info, {"name": 1}) self.assertIsNone(api_info) - def test_analyze_forward_inplace(self): + def test_analyze_forward_output(self): def func(_): self.processor.has_overflow = True - with patch.object(BaseDataProcessor, "analyze_forward_inplace", return_value={"name": {"output": 2}}), \ - patch.object(OverflowCheckDataProcessor, "handle_overflow", new=func): - self.processor.cached_inplace_api_info = {"name": {"intput": 1}} - api_info = self.processor.analyze_forward_inplace("name", "module_input_output") - self.assertEqual(api_info, {"name": {"intput": 1, "output": 2}}) + with patch.object(BaseDataProcessor, "analyze_forward_output", return_value={"name": {"output": 2}}), \ + patch.object(OverflowCheckDataProcessor, "handle_overflow", new=func): + self.processor.cached_api_info = {"name": {"intput": 1}} + api_info = self.processor.analyze_forward_output("name", "module", "module_input_output") + self.assertEqual(api_info, {"name": {"intput": 1, "output": 2}}) def test_analyze_forward(self): def func(_): @@ -402,18 +494,18 @@ class TestFreeBenchmarkDataProcessor(unittest.TestCase): self.data_writer.write_data_to_csv.assert_not_called() @patch('msprobe.pytorch.free_benchmark.FreeBenchmarkCheck.pre_forward') - def test_analyze_pre_forward(self, mock_pre_forward): + def test_analyze_forward_input(self, mock_pre_forward): module_io = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3}, output=None) - self.processor.analyze_pre_forward('test_pre_forward', None, module_io) + self.processor.analyze_forward_input('test_analyze_forward_input', None, module_io) mock_pre_forward.assert_called_once() @patch('msprobe.pytorch.free_benchmark.FreeBenchmarkCheck.forward', return_value=(None, [])) - def test_analyze_forward(self, mock_forward): + def test_analyze_forward_output(self, mock_forward): module_io = ModuleForwardInputsOutputs(args=(1, 2), kwargs={'a': 3}, output=(4, 5)) - self.processor.analyze_forward('test_forward', None, module_io) + self.processor.analyze_forward_output('test_analyze_forward_output', None, module_io) mock_forward.assert_called_once() - def test_analyze_forward_if_fix_branch(self): + def test_analyze_forward_output_if_fix_branch(self): self.processor.checker = MagicMock() name = "test_module" module = MagicMock() @@ -423,10 +515,10 @@ class TestFreeBenchmarkDataProcessor(unittest.TestCase): module_input_output.output = "some_output" new_output = "new_output_value" - unequal_rows = [] + unequal_rows = [] self.processor.checker.forward.return_value = (new_output, unequal_rows) self.processor.checker.if_fix.return_value = True - self.processor.analyze_forward(name, module, module_input_output) + self.processor.analyze_forward_output(name, module, module_input_output) self.processor.checker.if_fix.assert_called_once() self.assertTrue(self.processor._return_forward_new_output) self.assertEqual(self.processor._forward_new_output, new_output) @@ -436,3 +528,155 @@ class TestFreeBenchmarkDataProcessor(unittest.TestCase): module_io = ModuleBackwardInputsOutputs(grad_output=(torch.tensor([1.0, 2.0]),), grad_input=None) self.processor.analyze_backward('test_backward', None, module_io) mock_backward.assert_called_once() + + +class TestKernelDumpDataProcessor(unittest.TestCase): + def setUp(self): + self.config = MagicMock() + self.data_writer = MagicMock() + self.processor = KernelDumpDataProcessor(self.config, self.data_writer) + + @patch.object(logger, 'warning') + def test_print_unsupported_log(self, mock_logger_warning): + self.processor._print_unsupported_log("test_api_name") + mock_logger_warning.assert_called_with("The kernel dump does not support the test_api_name API.") + + @patch('msprobe.core.data_dump.data_processor.pytorch_processor.is_gpu') + @patch.object(logger, 'warning') + def test_analyze_pre_forward_with_gpu(self, mock_logger_warning, mock_is_gpu): + mock_is_gpu = True + self.processor.analyze_forward_input("test_api_name", None, None) + mock_logger_warning.assert_called_with( + "The current environment is not a complete NPU environment, and kernel dump cannot be used.") + self.assertFalse(self.processor.enable_kernel_dump) + + @patch('msprobe.core.data_dump.data_processor.pytorch_processor.is_gpu', new=False) + @patch('msprobe.core.data_dump.data_processor.pytorch_processor.KernelDumpDataProcessor.analyze_element') + @patch.object(logger, 'warning') + def test_analyze_pre_forward_with_not_gpu(self, mock_logger_warning, mock_analyze_element): + self.config.is_backward_kernel_dump = True + mock_module = MagicMock() + mock_module_input_output = MagicMock() + self.processor.analyze_forward_input("test_api_name", mock_module, mock_module_input_output) + mock_module.forward.assert_called_once() + mock_analyze_element.assert_called() + mock_logger_warning.assert_called_with("The kernel dump does not support the test_api_name API.") + self.assertFalse(self.processor.enable_kernel_dump) + + @patch('msprobe.core.data_dump.data_processor.pytorch_processor.KernelDumpDataProcessor.stop_kernel_dump') + @patch.object(logger, 'info') + def test_analyze_forward_successfully(self, mock_logger_info, mock_stop_kernel_dump): + self.processor.enable_kernel_dump = True + self.processor.config.is_backward_kernel_dump = False + self.processor.analyze_forward_output('test_api_name', None, None) + self.assertFalse(self.processor.enable_kernel_dump) + mock_stop_kernel_dump.assert_called_once() + mock_logger_info.assert_called_with("The kernel data of test_api_name is dumped successfully.") + + @patch('msprobe.core.data_dump.data_processor.pytorch_processor.KernelDumpDataProcessor.analyze_element') + @patch.object(logger, 'warning') + def test_analyze_backward_unsuccessfully(self, mock_logger_warning, mock_analyze_element): + self.processor.enable_kernel_dump = True + self.processor.is_found_grad_input_tensor = False + mock_module_input_output = MagicMock() + self.processor.analyze_backward("test_api_name", None, mock_module_input_output) + mock_analyze_element.assert_called_once() + mock_logger_warning.assert_called_with("The kernel dump does not support the test_api_name API.") + self.assertFalse(self.processor.enable_kernel_dump) + + @patch('msprobe.core.data_dump.data_processor.pytorch_processor.KernelDumpDataProcessor.stop_kernel_dump') + @patch('msprobe.core.data_dump.data_processor.pytorch_processor.KernelDumpDataProcessor.start_kernel_dump') + @patch('msprobe.core.data_dump.data_processor.pytorch_processor.KernelDumpDataProcessor.analyze_element') + @patch.object(logger, 'info') + def test_analyze_backward_successfully(self, mock_logger_info, mock_analyze_element, mock_start, mock_stop): + self.processor.enable_kernel_dump = True + self.processor.is_found_grad_input_tensor = True + self.processor.forward_output_tensor = MagicMock() + mock_module_input_output = MagicMock() + self.processor.analyze_backward("test_api_name", None, mock_module_input_output) + mock_analyze_element.assert_called_once() + self.assertFalse(self.processor.enable_kernel_dump) + self.processor.forward_output_tensor.backward.assert_called_once() + mock_start.assert_called_once() + mock_stop.assert_called_once() + mock_logger_info.assert_called_with("The kernel data of test_api_name is dumped successfully.") + + def test_clone_tensor(self): + tensor = torch.tensor([1.0, 2.0, 3.0]) + clone_tensor = self.processor.clone_and_detach_tensor(tensor) + self.assertTrue(torch.equal(tensor, clone_tensor)) + self.assertFalse(clone_tensor.requires_grad) + + tensor = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) + clone_tensor = self.processor.clone_and_detach_tensor(tensor) + self.assertTrue(torch.equal(tensor, clone_tensor)) + self.assertTrue(clone_tensor.requires_grad) + + tensor1 = torch.tensor([1.0], requires_grad=True) + tensor2 = torch.tensor([1.0]) + input_tuple = (tensor1, tensor2) + clone_tuple = self.processor.clone_and_detach_tensor(input_tuple) + self.assertEqual(len(input_tuple), len(clone_tuple)) + self.assertTrue(clone_tuple[0].requires_grad) + self.assertFalse(clone_tuple[1].requires_grad) + + input_list = [tensor1, tensor2] + clone_list = self.processor.clone_and_detach_tensor(input_list) + self.assertEqual(len(input_list), len(clone_list)) + self.assertTrue(clone_tuple[0].requires_grad) + self.assertFalse(clone_tuple[1].requires_grad) + + input_dict = {'tensor1': tensor1, 'tensor2': tensor2} + clone_dict = self.processor.clone_and_detach_tensor(input_dict) + self.assertEqual(len(clone_dict), len(input_dict)) + self.assertTrue(clone_dict["tensor1"].requires_grad) + self.assertFalse(clone_dict["tensor2"].requires_grad) + + non_tensor_input = 1 + result = self.processor.clone_and_detach_tensor(non_tensor_input) + self.assertEqual(result, non_tensor_input) + + def test_analyze_single_element_with_output_grad(self): + self.processor.is_found_output_tensor = False + tensor = torch.tensor([1.0], requires_grad=True) + self.processor.analyze_single_element(tensor, None) + self.assertTrue(self.processor.is_found_output_tensor) + + def test_analyze_single_element_without_output_grad(self): + self.processor.is_found_output_tensor = False + tensor = torch.tensor([1.0]) + self.processor.analyze_single_element(tensor, None) + self.assertFalse(self.processor.is_found_output_tensor) + + def test_analyze_single_element_with_grad_input(self): + self.processor.is_found_output_tensor = True + self.processor.is_found_grad_input_tensor = False + tensor = torch.tensor([1.0]) + self.processor.analyze_single_element(tensor, None) + self.assertTrue(self.processor.is_found_grad_input_tensor) + + def test_analyze_single_element_without_grad_input(self): + self.processor.is_found_output_tensor = True + self.processor.is_found_grad_input_tensor = True + tensor = torch.tensor([1.0]) + self.processor.analyze_single_element(tensor, None) + self.assertTrue(self.processor.is_found_grad_input_tensor) + + def test_reset_status(self): + self.processor.enable_kernel_dump = False + self.processor.is_found_output_tensor = True + self.processor.is_found_grad_input_tensor = True + self.processor.forward_args = 0 + self.processor.forward_kwargs = 1 + self.processor.forward_output_tensor = 2 + self.processor.grad_input_tensor = 3 + + self.processor.reset_status() + + self.assertTrue(self.processor.enable_kernel_dump) + self.assertFalse(self.processor.is_found_output_tensor) + self.assertFalse(self.processor.is_found_grad_input_tensor) + self.assertIsNone(self.processor.forward_args) + self.assertIsNone(self.processor.forward_kwargs) + self.assertIsNone(self.processor.forward_output_tensor) + self.assertIsNone(self.processor.grad_input_tensor) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_data_collector.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_data_collector.py index 6e0f1a430a23e92c16f9890cbf451d6b8c4f83eb..b9d2e7abef7244fc12dc71e3113c26af52529ce9 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_data_collector.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_data_collector.py @@ -59,17 +59,6 @@ class TestDataCollector(unittest.TestCase): mock_warning.assert_not_called() mock_debug.assert_called_once_with("msprobe is collecting data on Tensor.add.") - def test_pre_forward_data_collect(self): - self.data_collector.check_scope_and_pid = MagicMock(return_value=False) - self.data_collector.is_inplace = MagicMock(return_value=False) - self.data_collector.data_processor.analyze_pre_forward = MagicMock() - name = "TestModule.forward" - pid = 123 - - self.data_collector.pre_forward_data_collect(name, None, pid, None) - self.data_collector.check_scope_and_pid.assert_called_once_with( - self.data_collector.scope, "TestModule.backward", 123) - def test_handle_data(self): with patch.object(DataCollector, "update_data") as mock_update_data, \ patch.object(DataCollector, "write_json") as mock_write_json, \ @@ -93,7 +82,6 @@ class TestDataCollector(unittest.TestCase): @patch.object(DataCollector, "handle_data") def test_forward_data_collect(self, mock_handle_data, _, __, ___): with patch.object(DataCollector, "check_scope_and_pid", return_value=True), \ - patch.object(DataCollector, "is_inplace", return_value=False), \ patch.object(StatisticsDataProcessor, "analyze_forward", return_value={}): with patch.object(StatisticsDataProcessor, "is_terminated", new=True): self.data_collector.forward_data_collect("name", "module", "pid", "module_input_output") @@ -113,3 +101,19 @@ class TestDataCollector(unittest.TestCase): self.data_collector.backward_data_collect("name", "module", "pid", "module_input_output") mock_handle_data.assert_called_with("name", {}, flush=False) + + @patch.object(DataWriter, "update_debug") + @patch.object(BaseDataProcessor, "analyze_debug_forward", return_value="data_info") + def test_debug_data_collect_forward(self, _, mock_update_debug): + self.data_collector.debug_data_collect_forward("variable", "name_with_count") + mock_update_debug.assert_called_with({"name_with_count": "data_info"}) + + @patch.object(DataWriter, "update_debug") + @patch.object(BaseDataProcessor, "analyze_debug_backward") + @patch.object(BaseDataProcessor, "analyze_element_to_all_none", return_value = "all_none_data_info") + def test_debug_data_collect_backward(self, _, mock_analyze_debug_backward, mock_update_debug): + self.data_collector.data_writer.cache_debug = {"data": None} + self.data_collector.debug_data_collect_backward("variable", "name_with_count") + mock_update_debug.assert_called_with({"name_with_count": "all_none_data_info"}) + mock_analyze_debug_backward.assert_called_with("variable", "name_with_count", self.data_collector.data_writer.cache_debug['data']) + self.data_collector.data_writer.cache_debug = None diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_json_writer.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_json_writer.py index 042e16c5d33e4021dfa039264df97b7193fb1f9c..9b20ffb2197882e16c1550cf013d1ba132096063 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_json_writer.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_json_writer.py @@ -3,6 +3,7 @@ import os import unittest from unittest.mock import patch +from msprobe.core.common.utils import DumpPathAggregation from msprobe.core.common.file_utils import FileOpen, remove_path, load_json from msprobe.core.data_dump.json_writer import DataWriter @@ -75,12 +76,21 @@ class TestDataWriter(unittest.TestCase): self.assertIsNone(self.data_writer.dump_file_path) test_path = os.path.join(self.cur_path, "test1.json") - self.data_writer.update_dump_paths(test_path, test_path, test_path, test_path, test_path) + dump_path_aggregation = DumpPathAggregation() + dump_path_aggregation.dump_file_path = test_path + dump_path_aggregation.stack_file_path = test_path + dump_path_aggregation.construct_file_path = test_path + dump_path_aggregation.dump_tensor_data_dir = test_path + dump_path_aggregation.free_benchmark_file_path = test_path + dump_path_aggregation.debug_file_path = test_path + + self.data_writer.update_dump_paths(dump_path_aggregation) self.assertTrue(self.data_writer.dump_file_path == test_path) self.assertTrue(self.data_writer.stack_file_path == test_path) self.assertTrue(self.data_writer.construct_file_path == test_path) self.assertTrue(self.data_writer.dump_tensor_data_dir == test_path) self.assertTrue(self.data_writer.free_benchmark_file_path == test_path) + self.assertTrue(self.data_writer.debug_file_path == test_path) @patch.object(DataWriter, "write_json") def test_flush_data_periodically(self, mock_write_json): diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_scope.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_scope.py index 2c4d35652a413010bd63ac9e7f171ecbfba8e6e1..246a6b558e597f2cb09a1e96dc9d9f3272a46b6b 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_scope.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_scope.py @@ -1,65 +1,34 @@ import unittest -from unittest.mock import MagicMock - -from msprobe.core.common.exceptions import ScopeException -from msprobe.core.data_dump.scope import ( - build_scope, - build_range_scope_according_to_scope_name, - BaseScope, - ListScope, - RangeScope, - APIRangeScope, - ModuleRangeScope -) - - -class TestBuildScope(unittest.TestCase): - - def test_build_scope_with_no_scope_and_no_api_list(self): - result = build_scope(None) - self.assertIsNone(result) - - def test_build_scope_with_no_scope(self): - result = build_scope(None, api_list=['api1', 'api2']) - self.assertIsInstance(result, APIRangeScope) - - def test_build_scope_with_no_api_list(self): - result = build_scope(None, scope=['scope1', 'scope2']) - self.assertIsInstance(result, APIRangeScope) - - def test_build_scope_with_valid_scope_class(self): - class DummyScope: - def __init__(self, scope, api_list): - self.scope = scope - self.api_list = api_list - - result = build_scope(DummyScope, scope=['scope1', 'scope2'], api_list=['api1', 'api2']) - self.assertIsInstance(result, DummyScope) - self.assertEqual(result.scope, ['scope1', 'scope2']) - self.assertEqual(result.api_list, ['api1', 'api2']) - - def test_build_scope_with_invalid_scope_class(self): - with self.assertRaises(TypeError): - build_scope("NotAScopeClass", scope=['scope1'], api_list=['api1']) - - def test_build_range_scope_with_valid_api_range_scope(self): - result = build_range_scope_according_to_scope_name(['scope1'], ['api1']) - self.assertIsInstance(result, APIRangeScope) - self.assertTrue(result.is_valid) - - def test_build_range_scope_with_valid_module_range_scope(self): - result = build_range_scope_according_to_scope_name(['Module.m1', 'Module.m2'], ['api1']) - self.assertIsInstance(result, ModuleRangeScope) - self.assertTrue(result.is_valid) - - def test_build_range_scope_with_invalid_scope(self): - with self.assertRaises(ScopeException) as context: - build_range_scope_according_to_scope_name(['Module.m1', 'scope1'], ['api1']) - self.assertIn("scope=['Module.m1', 'scope1']", str(context.exception)) +from unittest.mock import Mock +from msprobe.core.data_dump.scope import ScopeFactory, ListScope, APIRangeScope, \ + ModuleRangeScope, MixRangeScope, BaseScope, RangeScope, ScopeException +from msprobe.core.common.const import Const + - def test_build_range_scope_with_empty_scope(self): - result = build_range_scope_according_to_scope_name([], ['api1']) - self.assertIsInstance(result, APIRangeScope) +class TestScopeFactory(unittest.TestCase): + def setUp(self): + self.config = Mock() + self.config.task = None + self.config.level = None + self.config.scope = None + self.config.list = None + + def test_build_scope_none(self): + factory = ScopeFactory(self.config) + self.assertIsNone(factory.build_scope()) + + def test_build_scope_free_benchmark(self): + self.config.task = Const.FREE_BENCHMARK + self.config.scope = ['scope1'] + factory = ScopeFactory(self.config) + result = factory.build_scope() + self.assertIsInstance(result, ListScope) + + self.config.scope = ['scope1'] + self.config.list = ['api1'] + factory = ScopeFactory(self.config) + with self.assertRaises(ScopeException): + factory.build_scope() class TestBaseScope(unittest.TestCase): @@ -142,18 +111,51 @@ class TestListScope(unittest.TestCase): self.assertFalse(result) +class MockRangeScope(RangeScope): + def check_scope_is_valid(self): + pass + + def check(self, name): + pass + + class TestRangeScope(unittest.TestCase): + def test_init_valid(self): + scope = ['Tensor.add.0.forward', 'Tensor.add.0.backward'] + rs = MockRangeScope(scope, [], Const.LEVEL_L1) + self.assertFalse(rs.in_scope) + self.assertFalse(rs.in_list) + + def test_rectify_args_valid(self): + valid_scope = ['Tensor.add.0.forward', 'Tensor.add.0.backward'] + valid_api_list = ["relu"] + rs = MockRangeScope(valid_scope, valid_api_list) + scope, api_list = rs.rectify_args(valid_scope, valid_api_list) + self.assertEqual(scope, valid_scope) + self.assertEqual(api_list, valid_api_list) + + def test_rectify_args_invalid_scope_length(self): + with self.assertRaises(ScopeException) as context: + rs = MockRangeScope(['Tensor.add.0.forward'], []) + rs.rectify_args(['Tensor.add.0.forward'], []) + self.assertIn("须传入长度为2的列表", str(context.exception)) - def test_rectify_args(self): - scope = ["module1", "module2", "module3"] + def test_scope_length_invalid(self): + scope = ['API.scope1.forward'] + with self.assertRaises(ScopeException): + MockRangeScope(scope, [], Const.LEVEL_L1) + + def test_rectify_args_invalid_api_scope_format(self): with self.assertRaises(ScopeException) as context: - RangeScope.rectify_args(scope, []) - self.assertEqual(context.exception.code, ScopeException.InvalidScope) + rs = MockRangeScope(['Tensor.add.', 'API.scope2.backward'], [], Const.LEVEL_L1) + rs.rectify_args(['Tensor.add.', 'API.scope2.backward'], []) + self.assertIn("scope参数格式错误", str(context.exception)) - scope = ["module1"] - expected_scope = ["module1", "module1"] - result_scope, result_api_list = RangeScope.rectify_args(scope, []) - self.assertEqual(result_scope, expected_scope) + def test_rectify_args_invalid_module_scope_format(self): + with self.assertRaises(ScopeException) as context: + rs = MockRangeScope(['Cell.conv2d.', 'Module.scope2.backward'], [], Const.LEVEL_L0) + rs.rectify_args(['Cell.conv2d.', 'Module.scope2.backward'], []) + self.assertIn("scope参数格式错误", str(context.exception)) class TestAPIRangeScope(unittest.TestCase): @@ -197,7 +199,7 @@ class TestModuleRangeScope(unittest.TestCase): result = module_range_scope.check_scope_is_valid() self.assertTrue(result) - module_range_scope = ModuleRangeScope(["Module.1"], ["Module.2"]) + module_range_scope = ModuleRangeScope(["Module.1", "Module.2"], ["Module.2"]) self.assertTrue(module_range_scope.check_scope_is_valid()) def test_begin_module(self): @@ -236,5 +238,59 @@ class TestModuleRangeScope(unittest.TestCase): result = module_range_scope.check(module_name) self.assertTrue(result) - module_range_scope = ModuleRangeScope(["Module.1"], []) + module_range_scope = ModuleRangeScope(["Module.1", "Module.2"], []) self.assertFalse(module_range_scope.check("")) + + +class TestMixRangeScope(unittest.TestCase): + def setUp(self): + self.scope = ['module1', 'module2'] + self.api_list = ['api1', 'api2'] + self.rs = MixRangeScope(self.scope, self.api_list) + + def test_check_scope_is_valid_with_non_empty_scope(self): + self.assertTrue(self.rs.check_scope_is_valid()) + + def test_check_scope_is_valid_with_empty_scope(self): + rs_empty = MixRangeScope([], self.api_list) + self.assertFalse(rs_empty.check_scope_is_valid()) + + def test_begin_module_with_scope_match(self): + self.rs.begin_module('module1') + self.assertTrue(self.rs.in_scope) + + def test_begin_module_with_api_list_match(self): + self.rs.begin_module('api1') + self.assertTrue(self.rs.in_list) + + def test_end_module_with_scope_match(self): + self.rs.end_module('module2') + self.assertFalse(self.rs.in_scope) + + def test_end_module_with_api_list_match(self): + self.rs.begin_module('api1') + self.rs.end_module('api1') + self.assertFalse(self.rs.in_list) + + def test_check_api_list_empty(self): + rs_empty = MixRangeScope(self.scope, []) + self.assertTrue(rs_empty.check_api_list('any_api')) + + def test_check_api_list_match(self): + self.assertTrue(self.rs.check_api_list('api1')) + + def test_check_api_list_no_match(self): + self.assertFalse(self.rs.check_api_list('api3')) + + def test_check_with_scope_none_or_in_scope_true(self): + self.rs.in_scope = True + self.assertTrue(self.rs.check('api1')) + self.assertFalse(self.rs.check('api3')) + + def test_check_with_scope_non_empty_and_in_scope_false(self): + self.rs.in_scope = False + self.assertFalse(self.rs.check('api1')) + + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_abnormal_scene.py b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_abnormal_scene.py new file mode 100644 index 0000000000000000000000000000000000000000..ea2ce071da31a39cea1d7f25211a0e512ecf41e4 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_abnormal_scene.py @@ -0,0 +1,141 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import unittest + +from msprobe.core.overflow_check.abnormal_scene import AnomalyScene, InputNormalOutputAnomalyScene, \ + InputAnomalyOutputAnomalyScene, InputAnomalyOutputNormalScene, NumericalMutationScene +from msprobe.core.overflow_check.api_info import APIInfo +from msprobe.core.overflow_check.level import OverflowLevel + + +class TestAnomalyScene(unittest.TestCase): + + def setUp(self): + self.api_info = APIInfo( + api_name="torch.add", + input_args=[{"type": "torch.Tensor", "Max": "nan"}], + input_kwargs={"bias": {"type": "torch.Tensor", "Max": "-inf"}}, + output_data=[{"type": "torch.Tensor", "Norm": 0.8}] + ) + self.anomaly_scene = InputAnomalyOutputNormalScene(self.api_info) + + def test_get_details(self): + details = self.anomaly_scene.get_details() + self.assertEqual(details["api_name"], self.api_info.api_name) + self.assertEqual(details["scene_type"], "InputAnomalyOutputNormalScene") + self.assertEqual(details["input_args_anomaly_indices"], [0]) + self.assertEqual(details["input_kwargs_anomaly_keys"], ["bias"]) + self.assertEqual(details["output_anomaly_indices"], []) + + +class TestInputNormalOutputAnomalyScene(unittest.TestCase): + + def setUp(self): + self.api_info = APIInfo( + api_name="torch.mul", + input_args=[{"type": "torch.Tensor", "Max": 0.2}], + input_kwargs={}, + output_data=[{"type": "torch.Tensor", "Max": "nan"}] + ) + self.scene = InputNormalOutputAnomalyScene(self.api_info) + + def test_rank(self): + self.assertEqual(self.scene.rank, OverflowLevel.CRITICAL) + + def test_matches(self): + self.assertTrue(self.scene.matches()) + + +class TestInputAnomalyOutputAnomalyScene(unittest.TestCase): + + def setUp(self): + self.api_info = APIInfo( + api_name="torch.div", + input_args=[{"type": "torch.Tensor", "Max": "nan"}], + input_kwargs={}, + output_data=[{"type": "torch.Tensor", "Max": "nan"}] + ) + self.scene = InputAnomalyOutputAnomalyScene(self.api_info) + + def test_rank(self): + self.assertEqual(self.scene.rank, OverflowLevel.HIGH) + + def test_matches(self): + self.assertTrue(self.scene.matches()) + + +class TestInputAnomalyOutputNormalScene(unittest.TestCase): + + def setUp(self): + self.api_info = APIInfo( + api_name="torch.relu", + input_args=[{"type": "torch.Tensor", "Max": "nan"}], + input_kwargs={}, + output_data=[{"type": "torch.Tensor", "Max": 0.8}] + ) + self.scene = InputAnomalyOutputNormalScene(self.api_info) + + def test_rank(self): + self.assertEqual(self.scene.rank, OverflowLevel.MEDIUM) + + def test_matches(self): + self.assertTrue(self.scene.matches()) + + def test_input_kwargs_matches(self): + api_info = APIInfo( + api_name="torch.linear", + input_args=[], + input_kwargs={ + "input1":{ + "type": "torch.Tensor", + "Min": "nan", + "Max": 1.245486, + } + }, + output_data=[{"type": "torch.Tensor", "Norm": 0.8}] + ) + scene = InputAnomalyOutputNormalScene(api_info) + self.assertTrue(scene.matches()) + + +class TestNumericalMutationScene(unittest.TestCase): + + def setUp(self): + self.api_info = APIInfo( + api_name="torch.exp", + input_args=[{"type": "torch.Tensor", "Norm": 1.0}], + input_kwargs={}, + output_data=[{"type": "torch.Tensor", "Norm": 200000.0}] + ) + self.scene = NumericalMutationScene(self.api_info, threshold=100000.0) + + def test_rank(self): + self.assertEqual(self.scene.rank, OverflowLevel.HIGH) + + def test_matches(self): + self.assertTrue(self.scene.matches()) + + def test_get_details(self): + details = self.scene.get_details() + self.assertEqual(details["api_name"], self.api_info.api_name) + self.assertEqual(details["threshold"], 100000.0) + self.assertTrue(details["scale_change_detected"]) + + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_filter.py b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_filter.py new file mode 100644 index 0000000000000000000000000000000000000000..00bd6aba5e47dc3300c64c0da53380821dd1f128 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_filter.py @@ -0,0 +1,148 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import unittest +from unittest.mock import MagicMock, patch + +from msprobe.core.overflow_check.api_info import APIInfo +from msprobe.core.overflow_check.filter import IgnoreFilter, Rule, IgnoreItem + + +class TestIgnoreFilter(unittest.TestCase): + + def setUp(self): + self.mock_rule_path = "./mock_ignore_rules.yaml" + self.filter = IgnoreFilter() + + @patch("msprobe.core.common.file_utils.load_yaml") + def test_load_rules(self, mock_load_yaml): + mock_load_yaml.return_value = { + "ignore_nan_inf": [ + { + "api_name": "distributed.reduce_scatter", + "description": "Combines reduction and scatter operations. The output tensor may contain " + "uninitialized data before the reduce_scatter call, but it will be overwritten " + "with the reduced and scattered data from all processes.", + "input_ignore": [ + {"index": 0} + ] + } + ] + } + + self.filter._load_rules(self.mock_rule_path) + self.assertIn("distributed.reduce_scatter", self.filter.rules) + rule = self.filter.rules["distributed.reduce_scatter"] + self.assertEqual(rule.api_name, "distributed.reduce_scatter") + self.assertTrue(rule.input_ignore.has_index(0)) + + def test_has_api_rule(self): + self.filter.rules = {"distributed.reduce_scatter": Rule("distributed.reduce_scatter")} + self.assertTrue(self.filter.has_api_rule("distributed.reduce_scatter")) + self.assertFalse(self.filter.has_api_rule("torch.mul")) + + def test_apply_filter(self): + api_info = APIInfo( + api_name="torch.empty.0.forward", + input_args=[{"Max": "nan"}], + input_kwargs={}, + output_data=[] + ) + rule = Rule( + api_name="torch.empty", + output_ignore=[{"index": 0}] + ) + rule.match = MagicMock(return_value=True) + self.filter.rules = {"torch.empty": rule} + self.assertTrue(self.filter.apply_filter(api_info)) + + def test_apply_filter_no_rule(self): + api_info = APIInfo( + api_name="torch.mul", + input_args=[{"Max": "inf"}], + input_kwargs={}, + output_data=[] + ) + self.filter.rules = {"torch.empty_like": Rule("torch.empty_like")} + self.assertFalse(self.filter.apply_filter(api_info)) + + +class TestRule(unittest.TestCase): + + def setUp(self): + self.rule = Rule( + api_name="distributed.recv", + desc="Test description", + input_ignore=[ + {"index": 0}, + {"name": "tensor"} + ], + output_ignore=[{"index": 1}] + ) + + def test_verify_field(self): + self.assertTrue(self.rule.verify_field()) + self.rule.api_name = "" + self.assertFalse(self.rule.verify_field()) + + def test_match(self): + api_info = APIInfo( + api_name="distributed.recv", + input_args=[{"Max": "nan"}, {"Max": 0}], + input_kwargs={ + "tensor": {"Max": "nan"} + }, + output_data=[{"Max": 1}, {"Max": "inf"}] + ) + self.assertTrue(self.rule.match(api_info)) + + def test_match_no_ignore(self): + api_info = APIInfo( + api_name="torch.add", + input_args=[{"Max": 0}], + input_kwargs={}, + output_data=[{"Max": 1}] + ) + self.assertFalse(self.rule.match(api_info)) + + +class TestIgnoreItem(unittest.TestCase): + + def setUp(self): + self.item = IgnoreItem() + + def test_add_index(self): + self.item.add_index(0) + self.assertIn(0, self.item.index) + + def test_add_name(self): + self.item.add_name("bias") + self.assertIn("bias", self.item.name) + + def test_has_index(self): + self.item.add_index(0) + self.assertTrue(self.item.has_index(0)) + self.assertFalse(self.item.has_index(1)) + + def test_has_name(self): + self.item.add_name("bias") + self.assertTrue(self.item.has_name("bias")) + self.assertFalse(self.item.has_name("weight")) + + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_level.py b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_level.py new file mode 100644 index 0000000000000000000000000000000000000000..7fe03f5ddfe0ff23d3c2ac556752f04e790db918 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_level.py @@ -0,0 +1,41 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import unittest + +from msprobe.core.overflow_check.level import OverflowLevel + + +class TestOverflowLevel(unittest.TestCase): + + def test_enum_values(self): + self.assertEqual(OverflowLevel.MEDIUM.value, "medium") + self.assertEqual(OverflowLevel.HIGH.value, "high") + self.assertEqual(OverflowLevel.CRITICAL.value, "critical") + + def test_enum_names(self): + self.assertEqual(OverflowLevel.MEDIUM.name, "MEDIUM") + self.assertEqual(OverflowLevel.HIGH.name, "HIGH") + self.assertEqual(OverflowLevel.CRITICAL.name, "CRITICAL") + + def test_enum_iteration(self): + levels = [level for level in OverflowLevel] + self.assertEqual(levels, [OverflowLevel.MEDIUM, OverflowLevel.HIGH, OverflowLevel.CRITICAL]) + + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_check_api_info.py b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_check_api_info.py new file mode 100644 index 0000000000000000000000000000000000000000..98d3cd941b42ec68429a6d78627bf8820493f773 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_check_api_info.py @@ -0,0 +1,88 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import unittest +from msprobe.core.common.const import Const +from msprobe.core.overflow_check.api_info import APIInfo + + +class TestAPIInfo(unittest.TestCase): + + def setUp(self): + self.api_name = "Functional.linear.1.forward" + self.input_args = [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 40, + 10 + ], + "Max": 0.3156644105911255, + "Min": -0.3159552812576294, + "Mean": -0.007069610990583897, + "Norm": 3.8414149284362793, + "requires_grad": True + }, + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 40 + ], + "Max": 0.2751258611679077, + "Min": -0.29283690452575684, + "Mean": -0.01155175268650055, + "Norm": 1.0337861776351929, + "requires_grad": True + } + ] + self.input_kwargs = { + "bias": {"type": "torch.Tensor", "Norm": 0.2} + } + self.output_data = [ + {"type": "torch.Tensor", "Norm": 0.8} + ] + + def test_init(self): + api_info = APIInfo( + api_name=self.api_name, + input_args=self.input_args, + input_kwargs=self.input_kwargs, + output_data=self.output_data + ) + self.assertEqual(api_info.api_name, self.api_name) + self.assertEqual(api_info.input_args, self.input_args) + self.assertEqual(api_info.input_kwargs, self.input_kwargs) + self.assertEqual(api_info.output_data, self.output_data) + + def test_extract_torch_api(self): + torch_api = APIInfo.extract_torch_api("Functional.linear.1.backward") + self.assertEqual(torch_api, "functional.linear") + + # Case with single part + torch_api = APIInfo.extract_torch_api("torch") + self.assertEqual(torch_api, "torch") + + # Case with empty string + torch_api = APIInfo.extract_torch_api("") + self.assertEqual(torch_api, "") + + + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_check_utils.py b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_check_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..cf166b535311397855b3eeccc05b0705bc0f3410 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_check_utils.py @@ -0,0 +1,83 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import unittest +from typing import Any +from msprobe.core.overflow_check.utils import has_nan_inf + + +class TestHasNanInf(unittest.TestCase): + def test_empty_dict(self): + """Test with an empty dictionary""" + self.assertFalse(has_nan_inf({})) + + def test_dict_without_nan_inf(self): + """Test with a dictionary that doesn't contain NaN or Inf""" + test_dict = { + 'Max': 10, + 'Min': 1, + 'Mean': 5, + 'Norm': 7 + } + self.assertFalse(has_nan_inf(test_dict)) + + def test_dict_with_nan(self): + """Test dict with 'NaN' as a string in key values""" + test_dict = { + 'Max': 'NaN', + 'Min': 5, + 'Mean': 3 + } + self.assertTrue(has_nan_inf(test_dict)) + + def test_dict_with_inf(self): + """Test dict with 'Inf' as a string in key values""" + test_dict = { + 'Max': 'Inf', + 'Min': 5, + 'Mean': 3 + } + self.assertTrue(has_nan_inf(test_dict)) + + def test_dict_with_lowercase_nan(self): + """Test dict with lowercase 'nan'""" + test_dict = { + 'Max': 'nan', + 'Min': 5, + 'Mean': 3 + } + self.assertTrue(has_nan_inf(test_dict)) + + def test_dict_with_lowercase_inf(self): + """Test dict with lowercase 'inf'""" + test_dict = { + 'Max': 'inf', + 'Min': 5, + 'Mean': 3 + } + self.assertTrue(has_nan_inf(test_dict)) + + def test_non_dict_input(self): + """Test with non-dictionary input""" + self.assertFalse(has_nan_inf(42)) + self.assertFalse(has_nan_inf("string")) + self.assertFalse(has_nan_inf(None)) + self.assertFalse(has_nan_inf([1, 2, 3])) + + +if __name__ == '__main__': + unittest.main(exit=False) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_checker.py b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_checker.py new file mode 100644 index 0000000000000000000000000000000000000000..35793d3b85adbe1052853dbf8c93f8e12c5c252e --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/overflow_check/test_overflow_checker.py @@ -0,0 +1,254 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import unittest + +from msprobe.core.overflow_check.checker import AnomalyDetector +from msprobe.core.overflow_check.level import OverflowLevel + + +class TestAnomalyDetector(unittest.TestCase): + def setUp(self): + """初始化测试数据""" + self.dump_data = { + # 场景 1: 输入正常,输出异常 + "Torch.add.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": 1.223, + "Min": -1.386, + "Mean": -0.0448, + "Norm": 26.5, + "requires_grad": True, + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": float("nan"), + "Min": float("-inf"), + "Mean": float("nan"), + "Norm": float("inf"), + "requires_grad": False, + } + ], + }, + # 场景 2: 输入异常,输出异常 + "Torch.mul.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": float("inf"), + "Min": -1.386, + "Mean": float("nan"), + "Norm": float("nan"), + "requires_grad": True, + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": float("nan"), + "Min": float("-inf"), + "Mean": float("nan"), + "Norm": float("inf"), + "requires_grad": False, + } + ], + }, + # 场景 3: 输入异常,输出正常 + "Torch.div.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": float("inf"), + "Min": -1.386, + "Mean": float("nan"), + "Norm": float("nan"), + "requires_grad": True, + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": 2.0, + "Min": -1.0, + "Mean": 0.5, + "Norm": 50.0, + "requires_grad": False, + } + ], + }, + # 场景 4: 数值突变 + "Torch.matmul.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": 1.0, + "Min": -1.0, + "Mean": 0.0, + "Norm": 10.0, + "requires_grad": True, + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": 1e6, + "Min": -1e6, + "Mean": 0.0, + "Norm": 1e10, + "requires_grad": False, + } + ], + }, + } + self.detector = AnomalyDetector(self.dump_data) + + def test_analyze(self): + """测试 analyze 方法,确保场景检测正确分类""" + self.detector.analyze() + + # 检查每个场景是否正确分类 + self.assertTrue(self.detector.has_overflow("Torch.add.0.forward")) # 输入正常,输出异常 + self.assertEqual( + self.detector.get_overflow_level("Torch.add.0.forward"), + OverflowLevel.CRITICAL, + ) + + self.assertTrue(self.detector.has_overflow("Torch.mul.0.forward")) # 输入异常,输出异常 + self.assertEqual( + self.detector.get_overflow_level("Torch.mul.0.forward"), + OverflowLevel.HIGH, + ) + + self.assertTrue(self.detector.has_overflow("Torch.div.0.forward")) # 输入异常,输出正常 + self.assertEqual( + self.detector.get_overflow_level("Torch.div.0.forward"), + OverflowLevel.MEDIUM, + ) + + self.assertTrue(self.detector.has_overflow("Torch.matmul.0.forward")) # 数值突变 + self.assertEqual( + self.detector.get_overflow_level("Torch.matmul.0.forward"), + OverflowLevel.HIGH, + ) + + def test_filter(self): + """测试 filter 方法,确保 Torch.empty 被正确过滤""" + self.dump_data["Torch.empty.0.forward"] = { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 100, + 40 + ], + "Max": 1.2230066061019897, + "Min": -1.3862265348434448, + "Mean": -0.044829513877630234, + "Norm": 26.499610900878906, + "requires_grad": True + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [100, 40], + "Max": float("inf"), + "Min": float("-inf"), + "Mean": float("nan"), + "Norm": float("nan"), + "requires_grad": False, + } + ], + } + + new_detector = AnomalyDetector(self.dump_data) + + new_detector.analyze().filter() + self.assertFalse(new_detector.has_overflow("Torch.empty.0.forward")) # 被过滤 + self.assertTrue(new_detector.has_overflow("Torch.add.0.forward")) # 未过滤 + + def test_statistics(self): + """测试统计信息输出""" + self.detector.analyze().filter() + stats = self.detector.get_statistics() + self.assertIn("critical_apis", stats) + self.assertIn("high_priority_apis", stats) + self.assertIn("medium_priority_apis", stats) + self.assertIn("anomaly_details", stats) + + def test_overflow_result(self): + """测试 overflow_result 方法""" + self.detector.analyze() + results = self.detector.overflow_result() + + # 验证结果是否包含异常 + self.assertIn("Torch.add.0.forward", results) + self.assertIn("Torch.mul.0.forward", results) + + def test_has_overflow(self): + """测试 has_overflow 方法""" + self.detector.analyze() + self.assertTrue(self.detector.has_overflow("Torch.add.0.forward")) + self.assertFalse(self.detector.has_overflow("Non.existent.api")) + + def test_get_overflow_level(self): + """测试 get_overflow_level 方法""" + self.detector.analyze() + level = self.detector.get_overflow_level("Torch.add.0.forward") + self.assertEqual(level, OverflowLevel.CRITICAL) + + # 测试不存在的 API + self.assertIsNone(self.detector.get_overflow_level("Non.existent.api")) + + def test_chain_calls(self): + """测试链式调用""" + self.detector.analyze().filter() + stats = self.detector.get_statistics() + + # 验证链式调用的结果 + self.assertIn("anomaly_details", stats) + + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/core_ut/test_common_config.py b/debug/accuracy_tools/msprobe/test/core_ut/test_common_config.py index 4419f32a97eced1f612e2978592ce8a3dbb29e91..26426484591f8b5626fd6bcd689920efa6c88416 100644 --- a/debug/accuracy_tools/msprobe/test/core_ut/test_common_config.py +++ b/debug/accuracy_tools/msprobe/test/core_ut/test_common_config.py @@ -34,7 +34,6 @@ class TestCommonConfig(TestCase): self.assertEqual(common_config.rank, []) self.assertEqual(common_config.step, []) self.assertIsNone(common_config.level) - self.assertIsNone(common_config.acl_config) self.assertFalse(common_config.enable_dataloader) json_config.update({"task": "md5"}) @@ -75,7 +74,6 @@ class TestBaseConfig(TestCase): base_config.check_config() self.assertIsNone(base_config.scope) self.assertIsNone(base_config.list) - self.assertIsNone(base_config.backward_input) self.assertIsNone(base_config.file_format) self.assertIsNone(base_config.summary_mode) self.assertIsNone(base_config.overflow_nums) @@ -113,15 +111,23 @@ class TestBaseConfig(TestCase): self.config._check_data_mode() self.assertEqual( mock_error_log_with_exp.call_args_list[0][0][0], - f"data_mode is invalid, it should be a list[str]", + "data_mode is invalid, it should be a list[str]" ) mock_error_log_with_exp.reset_mock() - self.config.data_mode = ["test", "all", "input", "output", "forward", "backward"] + self.config.data_mode = ["all", "forward"] self.config._check_data_mode() self.assertEqual( mock_error_log_with_exp.call_args_list[0][0][0], - f"The number of elements in the data_made cannot exceed {len(Const.DUMP_DATA_MODE_LIST)}.", + "'all' cannot be combined with other options in data_mode." + ) + + mock_error_log_with_exp.reset_mock() + self.config.data_mode = ["test", "input", "output", "forward", "backward"] + self.config._check_data_mode() + self.assertEqual( + mock_error_log_with_exp.call_args_list[0][0][0], + f"The number of elements in the data_made cannot exceed {len(Const.DUMP_DATA_MODE_LIST) - 1}." ) mock_error_log_with_exp.reset_mock() @@ -129,11 +135,11 @@ class TestBaseConfig(TestCase): self.config._check_data_mode() self.assertEqual( mock_error_log_with_exp.call_args_list[0][0][0], - f"data_mode is invalid, it should be a list[str]" + "data_mode is invalid, it should be a list[str]" ) mock_error_log_with_exp.reset_mock() - self.config.data_mode = ['all', 'test_case_1'] + self.config.data_mode = ['forward', 'test_case_1'] self.config._check_data_mode() self.assertEqual( mock_error_log_with_exp.call_args_list[0][0][0], diff --git a/debug/accuracy_tools/msprobe/test/cpp/CMakeLists.txt b/debug/accuracy_tools/msprobe/test/cpp/CMakeLists.txt new file mode 100644 index 0000000000000000000000000000000000000000..8807d800b8f745fbea895339a09586c411ca6da0 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/CMakeLists.txt @@ -0,0 +1,24 @@ +project(msprobe VERSION 1.0.0 LANGUAGES CXX C) +cmake_minimum_required(VERSION 3.14) + +set(CMAKE_CXX_STANDARD 17) +set(CMAKE_CXX_STANDARD_REQUIRED ON) + +find_package(gtest MODULE REQUIRED) +find_package(mockcpp MODULE REQUIRED) +find_package(nlohmannjson MODULE REQUIRED) +find_package(cpython MODULE REQUIRED) + +add_executable(msprobe_test) +target_link_libraries(msprobe_test PRIVATE ${gtest_LIBRARIES}) +target_link_libraries(msprobe_test PRIVATE ${mockcpp_LIBRARIES}) +target_link_libraries(msprobe_test PRIVATE _msprobe_c) + +target_include_directories(msprobe_test PRIVATE $ENV{PROJECT_ROOT_PATH}/msprobe/ccsrc) +target_include_directories(msprobe_test PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/include) +target_include_directories(msprobe_test PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/mock) + +target_compile_definitions(msprobe_test PRIVATE __RESOURCES_PATH__="${CMAKE_CURRENT_SOURCE_DIR}/../resources") + +file(GLOB_RECURSE SOURCES "*.cpp") +target_sources(msprobe_test PUBLIC ${SOURCES}) diff --git a/debug/accuracy_tools/msprobe/test/cpp/include/test_utils.cpp b/debug/accuracy_tools/msprobe/test/cpp/include/test_utils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..e744233b3199c15f5ce77b4690bbaa523b0bad45 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/include/test_utils.cpp @@ -0,0 +1,31 @@ +#include +#include +#include +#include +#include + +std::string TEST_ExecShellCommand(const std::string& cmd) +{ + std::array buffer; + std::string result; + std::unique_ptr pipe(popen(cmd.c_str(), "r"), pclose); + if (!pipe) { + throw std::runtime_error("popen() failed!"); + } + while (fgets(buffer.data(), buffer.size(), pipe.get()) != nullptr) { + result += buffer.data(); + } + return result; +} + +std::string trim(const std::string& str) +{ + std::string::size_type first = str.find_first_not_of(" \t\n\r\f\v"); + std::string::size_type last = str.find_last_not_of(" \t\n\r\f\v"); + + if (first == std::string::npos || last == std::string::npos) { + return ""; + } + + return str.substr(first, (last - first + 1)); +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/include/test_utils.hpp b/debug/accuracy_tools/msprobe/test/cpp/include/test_utils.hpp new file mode 100644 index 0000000000000000000000000000000000000000..ed842b87db77e75e618acd7a25949145a1578c37 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/include/test_utils.hpp @@ -0,0 +1,8 @@ +#pragma once + +#include + +#define CONFIG_EXAMPLE __RESOURCES_PATH__"/config.json" + +std::string TEST_ExecShellCommand(const std::string& cmd); +std::string trim(const std::string& str); diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_config.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_config.cpp new file mode 100644 index 0000000000000000000000000000000000000000..e8b9b73fb66c3fcae40819545c84b7fafb5d2c4d --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_config.cpp @@ -0,0 +1,276 @@ +#include +#include "gtest/gtest.h" +#include "nlohmann/json.hpp" +#include "test_utils.hpp" +#include "base/ErrorInfos.hpp" +#include "base/DebuggerConfig.hpp" + +using namespace MindStudioDebugger; + +namespace MsProbeTest { + +static const std::string cfgContent = R"({ + "task": "statistics", + "dump_path": "./dump_path", + "rank": [], + "step": [], + "level": "L1", + "seed": 1234, + "is_deterministic": false, + "enable_dataloader": false, + "acl_config": "", + "tensor": { + "scope": [], + "list":[], + "data_mode": ["all"], + "backward_input": [], + "file_format": "npy" + }, + "statistics": { + "scope": [], + "list":[], + "data_mode": ["all"], + "summary_mode": "statistics" + }, + "overflow_check": { + "overflow_nums": 1, + "check_mode":"all" + }, + "run_ut": { + "white_list": [], + "black_list": [], + "error_data_path": "./" + }, + "grad_probe": { + "grad_level": "L1", + "param_list": [], + "bounds": [-1, 0, 1] + }, + "free_benchmark": { + "scope": [], + "list": [], + "fuzz_device": "npu", + "pert_mode": "improve_precision", + "handler_type": "check", + "fuzz_level": "L1", + "fuzz_stage": "forward", + "if_preheat": false, + "preheat_step": 15, + "max_sample": 20 + } +})"; + +class TestConfigPyTorch : public ::testing::Test +{ +protected: + void SetUp(){} + void TearDown(){} +}; + +class TestConfigMindSpore : public ::testing::Test +{ +protected: + void SetUp(); + void TearDown(); + int32_t DumpCfgFile(); + const std::string framework = "MindSpore"; + const std::string cfgPath = "./config.json"; + nlohmann::json cfgJson; + const std::string logpath = "./test.log"; +}; + +int32_t TestConfigMindSpore::DumpCfgFile() +{ + std::ofstream ofs(cfgPath, std::ios::out | std::ios::trunc); + if (!ofs.is_open()) { + return -1; + } + try { + ofs << cfgJson.dump(); + } catch (std::exception &e) { + ofs.close(); + return -1; + } + + if (ofs.fail()) { + return -1; + } + + return 0; +} + +void TestConfigMindSpore::SetUp() +{ + DebuggerConfig::GetInstance().Reset(); + CleanErrorInfoCache(); + ErrorInfosManager::SetLogPath(logpath); + cfgJson = nlohmann::json::parse(cfgContent); +} + +void TestConfigMindSpore::TearDown() +{ + TEST_ExecShellCommand("rm -f " + cfgPath); + TEST_ExecShellCommand("rm -f " + logpath); +} + +TEST_F(TestConfigMindSpore, TestDefaultValue) +{ + DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + EXPECT_FALSE(cfg.IsCfgLoaded()); + EXPECT_EQ(cfg.GetFramework(), DebuggerFramework::FRAMEWORK_PYTORCH); + EXPECT_TRUE(cfg.GetTaskList().empty()); + EXPECT_EQ(cfg.GetOutputPath(), "./output"); + EXPECT_TRUE(cfg.GetRankRange().empty()); + EXPECT_TRUE(cfg.GetStepRange().empty()); + EXPECT_EQ(cfg.GetDebugLevel(), DebuggerLevel::L1); + EXPECT_EQ(cfg.GetRandSeed(), 1234); + EXPECT_FALSE(cfg.IsDeterministic()); + EXPECT_FALSE(cfg.IsDataloaderEnable()); + EXPECT_EQ(cfg.GetStatisticsCfg(), nullptr); + EXPECT_EQ(cfg.GetDumpTensorCfg(), nullptr); + EXPECT_EQ(cfg.GetOverflowCheckCfg(), nullptr); +} + +TEST_F(TestConfigMindSpore, TestLoadConfigBase) +{ + int32_t ret; + DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + ret = cfg.LoadConfig("", cfgPath); + EXPECT_EQ(ret, -1); + CleanErrorInfoCache(); + ret = cfg.LoadConfig(framework, "./xxx"); + EXPECT_EQ(ret, -1); + TEST_ExecShellCommand("echo \"invalid content\" > ./invalid.json"); + CleanErrorInfoCache(); + ret = cfg.LoadConfig(framework, "./invalid.json"); + EXPECT_EQ(ret, -1); + TEST_ExecShellCommand("rm ./invalid.json"); + ASSERT_EQ(DumpCfgFile(), 0); + CleanErrorInfoCache(); + ret = cfg.LoadConfig(framework, cfgPath); + EXPECT_EQ(ret, 0); +} + +TEST_F(TestConfigMindSpore, TestCommonCfg) +{ + DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + + /* test static method */ + EXPECT_TRUE(cfg.IsRankHits(0)); + EXPECT_TRUE(cfg.IsRankHits(7)); + EXPECT_TRUE(cfg.IsRankHits(12345)); + EXPECT_TRUE(cfg.IsStepHits(0)); + EXPECT_TRUE(cfg.IsStepHits(7)); + EXPECT_TRUE(cfg.IsStepHits(12345)); + + cfgJson["dump_path"] = "./output1"; + cfgJson["rank"] = nlohmann::json::array({0, 1, 8}); + cfgJson["step"] = nlohmann::json::array({2, 4, "6-8"}); + cfgJson["level"] = "L2"; + cfgJson["seed"] = 2345; + cfgJson["is_deterministic"] = true; + cfgJson["enable_dataloader"] = true; + ASSERT_EQ(DumpCfgFile(), 0); + EXPECT_EQ(cfg.LoadConfig(framework, cfgPath), 0); + EXPECT_EQ(cfg.GetTaskList(), std::vector({DebuggerTaskType::TASK_DUMP_STATISTICS})); + EXPECT_EQ(cfg.GetOutputPath(), trim(TEST_ExecShellCommand("realpath ./output1"))); + EXPECT_EQ(cfg.GetRankRange(), std::vector({0, 1, 8})); + EXPECT_EQ(cfg.GetStepRange(), std::vector({2, 4, 6, 7, 8})); + EXPECT_EQ(cfg.GetDebugLevel(), DebuggerLevel::L2); + EXPECT_EQ(cfg.GetRandSeed(), 2345); + EXPECT_TRUE(cfg.IsDeterministic()); + EXPECT_TRUE(cfg.IsDataloaderEnable()); + EXPECT_NE(cfg.GetStatisticsCfg(), nullptr); + EXPECT_EQ(cfg.GetDumpTensorCfg(), nullptr); + EXPECT_EQ(cfg.GetOverflowCheckCfg(), nullptr); + EXPECT_TRUE(cfg.IsRankHits(0)); + EXPECT_FALSE(cfg.IsRankHits(7)); + EXPECT_FALSE(cfg.IsRankHits(12345)); + EXPECT_TRUE(cfg.IsStepHits(4)); + EXPECT_TRUE(cfg.IsStepHits(6)); + EXPECT_TRUE(cfg.IsStepHits(8)); + EXPECT_FALSE(cfg.IsStepHits(9)); + + /* invalid case */ + cfg.Reset(); + ErrorInfosManager::SetLogPath("./test.log"); + cfgJson["dump_path"] = 111; + cfgJson["rank"] = "abc"; + cfgJson["step"] = nlohmann::json::array({"a", "b"}); + cfgJson["level"] = "L10"; + cfgJson["seed"] = "123"; + cfgJson["is_deterministic"] = 1; + cfgJson["enable_dataloader"] = "true"; + ASSERT_EQ(DumpCfgFile(), 0); + EXPECT_NE(cfg.LoadConfig(framework, cfgPath), 0); + std::string logContent = TEST_ExecShellCommand("cat " + logpath); + EXPECT_NE(logContent.find("dump_path"), std::string::npos); + EXPECT_NE(logContent.find("rank"), std::string::npos); + EXPECT_NE(logContent.find("step"), std::string::npos); + EXPECT_NE(logContent.find("level"), std::string::npos); + EXPECT_NE(logContent.find("seed"), std::string::npos); + EXPECT_NE(logContent.find("is_deterministic"), std::string::npos); + EXPECT_NE(logContent.find("enable_dataloader"), std::string::npos); +} + +TEST_F(TestConfigMindSpore, TestTensorCfg) +{ + DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + cfgJson["task"] = "tensor"; + cfgJson["level"] = "L2"; + nlohmann::json& tensorCfgJson = cfgJson["tensor"]; + tensorCfgJson["scope"] = nlohmann::json::array({"a", "b"}); + tensorCfgJson["list"] = nlohmann::json::array({"name-regex(conv)", "add", "ReduceMean-op0.10.5"}); + tensorCfgJson["data_mode"] = nlohmann::json::array({"all"}); + tensorCfgJson["backward_input"] = nlohmann::json::array({"/a.pt", "/b.pt"});; + tensorCfgJson["file_format"] = "npy"; + ASSERT_EQ(DumpCfgFile(), 0); + EXPECT_EQ(cfg.LoadConfig(framework, cfgPath), 0); + std::shared_ptr tensorcfg = cfg.GetDumpTensorCfg(); + ASSERT_NE(tensorcfg, nullptr); + EXPECT_EQ(tensorcfg->scope, std::vector({"a", "b"})); + EXPECT_EQ(tensorcfg->list, std::vector({"name-regex(conv)", "add", "ReduceMean-op0.10.5"})); + EXPECT_EQ(tensorcfg->direction, DebuggerDataDirection::DIRECTION_BOTH); + EXPECT_EQ(tensorcfg->inout, DebuggerDataInOut::INOUT_BOTH); + EXPECT_EQ(tensorcfg->backwardInput, std::vector({"/a.pt", "/b.pt"})); + EXPECT_EQ(tensorcfg->fileFormat, DebuggerDumpFileFormat::FILE_FORMAT_NPY); +} + +TEST_F(TestConfigMindSpore, TestStatisticCfg) +{ + DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + cfgJson["task"] = "statistics"; + cfgJson["level"] = "L2"; + nlohmann::json& statisticsCfgJson = cfgJson["statistics"]; + statisticsCfgJson["scope"] = nlohmann::json::array({"c", "d"}); + statisticsCfgJson["list"] = nlohmann::json::array({"name-regex(conv)", "add", "ReduceMean-op0.10.5"}); + statisticsCfgJson["data_mode"] = nlohmann::json::array({"input"}); + statisticsCfgJson["summary_mode"] = "statistics"; + ASSERT_EQ(DumpCfgFile(), 0); + EXPECT_EQ(cfg.LoadConfig(framework, cfgPath), 0); + std::shared_ptr statisticscfg = cfg.GetStatisticsCfg(); + ASSERT_NE(statisticscfg, nullptr); + EXPECT_EQ(statisticscfg->scope, std::vector({"c", "d"})); + EXPECT_EQ(statisticscfg->list, std::vector({"name-regex(conv)", "add", "ReduceMean-op0.10.5"})); + EXPECT_EQ(statisticscfg->direction, DebuggerDataDirection::DIRECTION_BOTH); + EXPECT_EQ(statisticscfg->inout, DebuggerDataInOut::INOUT_INPUT); + EXPECT_EQ(statisticscfg->summaryOption,std::vector( + {DebuggerSummaryOption::MAX, DebuggerSummaryOption::MIN, DebuggerSummaryOption::MEAN, DebuggerSummaryOption::L2NORM})); +} + +TEST_F(TestConfigMindSpore, TestOverflowCfg) +{ + DebuggerConfig& cfg = DebuggerConfig::GetInstance(); + cfgJson["task"] = "overflow_check"; + nlohmann::json& overflowCfgJson = cfgJson["overflow_check"]; + overflowCfgJson["overflow_nums"] = 3; + overflowCfgJson["check_mode"] = "all"; + ASSERT_EQ(DumpCfgFile(), 0); + EXPECT_EQ(cfg.LoadConfig(framework, cfgPath), 0); + std::shared_ptr overflowcfg = cfg.GetOverflowCheckCfg(); + ASSERT_NE(overflowcfg, nullptr); + EXPECT_EQ(overflowcfg->overflowNums, 3); + EXPECT_EQ(overflowcfg->checkMode, DebuggerOpCheckLevel::CHECK_LEVEL_ALL); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_cpython_utils.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_cpython_utils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..0d9188878c0864d66d76cc3a823b0a0a5cf644d5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_cpython_utils.cpp @@ -0,0 +1,312 @@ +#include +#include + +#include "test_utils.hpp" +#include "utils/CPythonUtils.hpp" + +using namespace MindStudioDebugger; +using namespace MindStudioDebugger::CPythonUtils; + +namespace MsProbeTest { + +class CPythonUtilsTest : public ::testing::Test { +protected: + void SetUp() override { + Py_Initialize(); + } + + void TearDown() override { + Py_Finalize(); + } +}; + +TEST_F(CPythonUtilsTest, CPythonAgent) { + PythonObject obj = PythonObject::From("test"); + std::string name = "test_object"; + int32_t result = RegisterPythonObject(name, obj); + EXPECT_EQ(result, 0); + bool registerd = IsPyObjRegistered(name); + EXPECT_TRUE(registerd); + + result = RegisterPythonObject(name, obj); + EXPECT_EQ(result, -1); + registerd = IsPyObjRegistered(name); + EXPECT_TRUE(registerd); + + name = "test_object"; + UnRegisterPythonObject(name); + name = "test_object1"; + UnRegisterPythonObject(name); + registerd = IsPyObjRegistered(name); + EXPECT_FALSE(registerd); + + result = RegisterPythonObject(name, obj); + EXPECT_EQ(result, 0); + registerd = IsPyObjRegistered(name); + EXPECT_TRUE(registerd); + + PythonObject registerd_obj = GetRegisteredPyObj(name); + EXPECT_EQ(static_cast(registerd_obj), static_cast(obj)); + EXPECT_TRUE(registerd_obj.IsString()); + EXPECT_EQ(registerd_obj.ToString(), "test"); + + PythonObject invalid_obj = GetRegisteredPyObj("invalid_name"); + EXPECT_TRUE(invalid_obj.IsNone()); +} + +TEST_F(CPythonUtilsTest, PythonObjectFromTo) { + // 测试PythonObject的From和To函数 + int32_t input_int = -42; + PythonObject obj_int = PythonObject::From(input_int); + EXPECT_TRUE(obj_int.IsNumber()); + + int32_t output_int; + EXPECT_EQ(obj_int.To(output_int), 0); + EXPECT_EQ(output_int, input_int); + + uint32_t input_uint = 56; + PythonObject obj_uint = PythonObject::From(input_uint); + EXPECT_TRUE(obj_uint.IsNumber()); + + uint32_t output_uint; + EXPECT_EQ(obj_uint.To(output_uint), 0); + EXPECT_EQ(output_uint, input_uint); + + double input_double = 3.14; + PythonObject obj_double = PythonObject::From(input_double); + EXPECT_TRUE(obj_double.IsNumber()); + + double output_double; + EXPECT_EQ(obj_double.To(output_double), 0); + EXPECT_DOUBLE_EQ(output_double, input_double); + + std::string input_str = "hello"; + PythonObject obj_str = PythonObject::From(input_str); + EXPECT_TRUE(obj_str.IsString()); + + std::string output_str; + EXPECT_EQ(obj_str.To(output_str), 0); + EXPECT_EQ(output_str, input_str); + + const char* input_char = "world"; + PythonObject obj_str1 = PythonObject::From(input_char); + EXPECT_TRUE(obj_str1.IsString()); + + EXPECT_EQ(obj_str1.To(output_str), 0); + EXPECT_EQ(output_str, std::string(input_char)); + + bool input_bool = true; + PythonObject obj_bool = PythonObject::From(input_bool); + EXPECT_TRUE(obj_bool.IsBool()); + + bool output_bool; + EXPECT_EQ(obj_bool.To(output_bool), 0); + EXPECT_EQ(output_bool, input_bool); + + std::vector input_vector_int = {1, 2, 3, 100}; + PythonObject list_int_obj = PythonObject::From(input_vector_int); + EXPECT_TRUE(list_int_obj.IsList()); + + std::vector output_vector_int; + EXPECT_EQ(list_int_obj.To(output_vector_int), 0); + + size_t size = input_vector_int.size(); + EXPECT_EQ(size, output_vector_int.size()); + + for (size_t i = 0; i < size; ++i) { + EXPECT_EQ(input_vector_int[i], output_vector_int[i]); + } + + std::vector input_vector_str = {"a", "bb", "ccc", "dddd"}; + PythonObject list_str_obj = PythonObject::From(input_vector_str); + EXPECT_TRUE(list_str_obj.IsList()); + + std::vector output_vector_str; + EXPECT_EQ(list_str_obj.To(output_vector_str), 0); + + size = input_vector_str.size(); + EXPECT_EQ(size, output_vector_str.size()); + + for (size_t i = 0; i < size; ++i) { + EXPECT_EQ(input_vector_str[i], output_vector_str[i]); + } +} + +TEST_F(CPythonUtilsTest, PythonObjectImport) { + PythonObject sys = PythonObject::Import("sys"); + EXPECT_TRUE(sys.IsModule()); + EXPECT_EQ(static_cast(sys), PyImport_ImportModule("sys")); + EXPECT_FALSE(sys.IsNone()); + PythonObject invalid = PyImport_ImportModule("invalid"); + EXPECT_TRUE(invalid.IsNone()); +} + +TEST_F(CPythonUtilsTest, PythonObjectGetAttr) { + PythonObject sys = PythonObject::Import("sys"); + PythonObject sys_path = sys.Get("path"); + EXPECT_TRUE(sys_path.IsList()); + PythonObject fexit = sys.Get("exit"); + EXPECT_TRUE(fexit.IsCallable()); + PythonObject invalid = sys.Get("invalid"); + EXPECT_TRUE(invalid.IsNone()); + + std::vector input_vector = {1, 2, 3, 100}; + PythonObject list_obj = PythonObject::From(input_vector); + PythonObject append = list_obj.Get("append"); + EXPECT_TRUE(append.IsCallable()); +} + +TEST_F(CPythonUtilsTest, PythonObjectCall) { + PythonObject int_class = PythonObject::Import("builtins").Get("int"); + EXPECT_TRUE(int_class.IsCallable()); + PythonObject int_obj = int_class.Call(); + EXPECT_TRUE(int_obj.IsNumber()); + int result = -1; + EXPECT_EQ(int_obj.To(result), 0); + EXPECT_EQ(result, 0); + + PythonObject ret = PythonObject::Import("builtins").Call(); + EXPECT_TRUE(ret.IsNone()); +} + +TEST_F(CPythonUtilsTest, PythonObjectType) { + PythonObject none = Py_None; + EXPECT_TRUE(none.IsNone()); + EXPECT_FALSE(none.IsNumber() || none.IsCallable()); + + PythonObject pytrue = Py_True; + EXPECT_TRUE(pytrue.IsBool()); + EXPECT_FALSE(pytrue.IsString() || pytrue.IsCallable()); + + PythonObject builtins = PyImport_ImportModule("builtins"); + EXPECT_TRUE(builtins.IsModule()); + EXPECT_FALSE(builtins.IsList() || builtins.IsCallable()); + + PythonObject int_class = builtins.Get("int"); + EXPECT_TRUE(int_class.IsCallable()); + EXPECT_FALSE(builtins.IsDict()); + + PythonObject dict = builtins.Get("__dict__"); + EXPECT_TRUE(dict.IsDict()); + EXPECT_FALSE(dict.IsNone() || dict.IsCallable()); +} + +TEST_F(CPythonUtilsTest, PythonNumberObject) { + PythonNumberObject o1(PyLong_FromLong(123)); + PythonNumberObject o2(PyFloat_FromDouble(3.14)); + PythonNumberObject o3 = PythonNumberObject::From(321); + PythonNumberObject o4 = PythonNumberObject::From(2.33); + PythonNumberObject o5(PythonObject::From(4.44)); + PythonNumberObject o6(PythonObject::From("1111")); + + int int_v; + EXPECT_EQ(o1.To(int_v), 0); + EXPECT_EQ(int_v, 123); + double double_v; + EXPECT_EQ(o2.To(double_v), 0); + EXPECT_TRUE(std::fabs(double_v - 3.14) < 1e-5); + EXPECT_EQ(o3.To(int_v), 0); + EXPECT_EQ(int_v, 321); + EXPECT_EQ(o4.To(double_v), 0); + EXPECT_TRUE(std::fabs(double_v - 2.33) < 1e-5); + EXPECT_EQ(o5.To(double_v), 0); + EXPECT_TRUE(std::fabs(double_v - 4.44) < 1e-5); + EXPECT_TRUE(o6.IsNone()); +} + +TEST_F(CPythonUtilsTest, PythonStringObject) { + PythonStringObject o1(PyUnicode_FromString("hello")); + PythonStringObject o2 = PythonStringObject::From("OK"); + PythonStringObject o3 = PythonStringObject::From(std::string("banana")); + PythonStringObject o4(PythonObject::From(1)); + + EXPECT_EQ(o1.ToString(), "hello"); + EXPECT_EQ(o2.ToString(), "OK"); + EXPECT_EQ(o3.ToString(), "banana"); + EXPECT_TRUE(o4.IsNone()); +} + +TEST_F(CPythonUtilsTest, PythonBoolObject) { + PythonBoolObject o1(Py_True); + PythonBoolObject o2(Py_False); + PythonBoolObject o3(PythonObject::From(true)); + PythonBoolObject o4(PythonObject::From(0)); + + EXPECT_EQ(o1, true); + EXPECT_EQ(o2, false); + EXPECT_EQ(o3, true); + EXPECT_TRUE(o4.IsNone()); +} + +TEST_F(CPythonUtilsTest, PythonListObject) { + PythonListObject empty_list(5); + PythonListObject sys_path(static_cast(PythonObject::Import("sys").Get("path"))); + PythonListObject list1 = PythonListObject::From(std::vector({1, 3, 5, 7})); + PythonListObject list2 = PythonListObject::From(std::vector>({{1, 3, 5, 7}, {2, 4, 6}})); + PythonListObject list3; + + int val; + EXPECT_EQ(empty_list.Size(), 5); + EXPECT_FALSE(sys_path.IsNone()); + EXPECT_TRUE(sys_path.Size() > 0); + EXPECT_TRUE(sys_path.GetItem(0).IsString()); + EXPECT_EQ(list1.Size(), 4); + EXPECT_EQ(list1.GetItem(1).To(val), 0); + EXPECT_EQ(val, 3); + EXPECT_EQ(list1.GetItem(3).ToString(), "7"); + EXPECT_TRUE(list1.GetItem(4).IsNone()); + EXPECT_EQ(list2.Size(), 2); + EXPECT_TRUE(list2.GetItem(0).IsList()); + EXPECT_EQ(list2.GetItem(1).ToString(), "[2, 4, 6]"); + EXPECT_EQ(list3.Size(), 0); + list3.Append(PythonObject::From(1)); + EXPECT_EQ(list3.Size(), 1); + list3.Append(PythonObject::From("2")).Append(PythonObject::From(true)); + EXPECT_EQ(list3.Size(), 3); + EXPECT_EQ(list3.GetItem(1).ToString(), "2"); + list3.SetItem(1, empty_list); + EXPECT_EQ(list3.Size(), 3); + EXPECT_EQ(static_cast(list3.GetItem(1)), static_cast(empty_list)); + list3.Insert(0, sys_path); + EXPECT_EQ(list3.Size(), 4); + EXPECT_EQ(static_cast(list3.GetItem(0)), static_cast(sys_path)); + PythonTupleObject tuple = list3.ToTuple(); + EXPECT_FALSE(tuple.IsNone()); +} + +TEST_F(CPythonUtilsTest, PythonTupleObject) { + PythonTupleObject tuple1; + PythonTupleObject tuple2(PyTuple_New(0)); + PythonTupleObject tuple3 = PythonTupleObject::From(std::vector({"ab", "cd"})); + PythonTupleObject tuple4 = PythonListObject::From(std::vector({1, 3, 5})).ToTuple(); + + EXPECT_FALSE(tuple1.IsNone()); + EXPECT_EQ(tuple1.Size(), 0); + EXPECT_TRUE(tuple1.GetItem(0).IsNone()); + EXPECT_FALSE(tuple2.IsNone()); + EXPECT_EQ(tuple2.Size(), 0); + EXPECT_EQ(tuple3.Size(), 2); + EXPECT_EQ(tuple3.GetItem(0).ToString(), "ab"); + EXPECT_EQ(tuple4.Size(), 3); + EXPECT_EQ(tuple4.GetItem(0).ToString(), "1"); +} + +TEST_F(CPythonUtilsTest, PythonDictObject) { + PythonDictObject dict1; + PythonDictObject dict2(PyDict_New()); + PythonDictObject dict3 = PythonDictObject::From(std::map({{1, "a"}, {2, "b"}})); + + EXPECT_FALSE(dict1.IsNone()); + EXPECT_FALSE(dict2.IsNone()); + EXPECT_TRUE(dict2.GetItem("none").IsNone()); + EXPECT_FALSE(dict3.IsNone()); + EXPECT_EQ(dict3.GetItem(1).ToString(), "a"); + EXPECT_EQ(dict3.GetItem(2).ToString(), "b"); + EXPECT_TRUE(dict3.GetItem(3).IsNone()); + dict3.Add(std::string("apple"), std::string("banana")); + EXPECT_EQ(dict3.GetItem(std::string("apple")).ToString(), "banana"); + dict3.Delete(std::string("apple")); + EXPECT_TRUE(dict3.GetItem(std::string("apple")).IsNone()); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_data_utils.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_data_utils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..11442f12bfea9179ecd4e2e357bcf70b4212ab84 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_data_utils.cpp @@ -0,0 +1,89 @@ +#include +#include +#include +#include +#include "utils/DataUtils.hpp" + +using namespace MindStudioDebugger; +using namespace MindStudioDebugger::DataUtils; + +namespace MsProbeTest { + +TEST(DataUtilsTest, TestUnpackUint64Value) { + uint64_t data_le = 0x0102030405060708; + uint64_t result = UnpackUint64Value_Le(&data_le); +#if __BYTE_ORDER == __LITTLE_ENDIAN + EXPECT_EQ(result, 0x0102030405060708); +#else + EXPECT_EQ(result, 0x0807060504030201); +#endif + uint64_t data_be = 0x0102030405060708; + result = UnpackUint64Value_Be(&data_be); +#if __BYTE_ORDER == __LITTLE_ENDIAN + EXPECT_EQ(result, 0x0807060504030201); +#else + EXPECT_EQ(result, 0x0102030405060708); +#endif +} + +TEST(DataUtilsTest, TestDataTrans) { + size_t value = 123456; + int64_t result = SizeToS64(value); + EXPECT_EQ(result, 123456); + bool exception = false; + try { + int64_t result = SizeToS64(static_cast(INT64_MAX) + 1ULL); + } catch (const std::runtime_error& e) { + exception = true; + } + EXPECT_TRUE(exception); + uint64_t num = 0x123456789ABCDEF0; + std::string s = U64ToHexString(num); + EXPECT_EQ(s, "0x123456789ABCDEF0"); +} + +TEST(DataUtilsTest, TestBFloat16) { + float fp32 = 3.14f; + BFloat16 bf16(fp32); +#define BF16_EQ(a, b) (-0.01f < static_cast((a) - (b)) && static_cast((a) - (b)) < 0.01f) + EXPECT_TRUE(BF16_EQ(fp32, static_cast(bf16))); + EXPECT_TRUE(BF16_EQ(fp32 + fp32, static_cast(bf16 + bf16))); + EXPECT_TRUE(BF16_EQ(fp32 + fp32, bf16 + fp32)); + EXPECT_TRUE(BF16_EQ(fp32 + fp32, bf16 + fp32)); +#undef BF16_EQ +} + +TEST(DataUtilsTest, TestDType) { + EXPECT_EQ(SizeOfDType(DataType::DT_FLOAT), 4); + EXPECT_EQ(SizeOfDType(DataType::DT_DOUBLE), 8); + EXPECT_EQ(SizeOfDType(DataType::DT_INT64), 8); + EXPECT_EQ(SizeOfDType(DataType::DT_UINT8), 1); + EXPECT_EQ(SizeOfDType(DataType::DT_FLOAT16), 2); + EXPECT_EQ(SizeOfDType(static_cast(99)), 0); + EXPECT_EQ(GetDTypeString(DataType::DT_BOOL), "BOOL"); + EXPECT_EQ(GetDTypeString(DataType::DT_INT8), "INT8"); + EXPECT_EQ(GetDTypeString(DataType::DT_BF16), "BF16"); + EXPECT_EQ(GetDTypeString(DataType::DT_UINT64), "UINT64"); + EXPECT_EQ(GetDTypeString(DataType::DT_COMPLEX64), "COMPLEX64"); + EXPECT_EQ(GetDTypeString(static_cast(99)), "UNKNOWN"); +} + +TEST(DataUtilsTest, TestGetFormatString) { + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_NCHW), "NCHW"); + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_NHWC), "NHWC"); + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_FRACTAL_Z), "FRACTAL_Z"); + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_C1HWNC0), "C1HWNC0"); + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_HWCN), "HWCN"); + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_C1HWNCoC0), "C1HWNCoC0"); + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_DHWNC), "DHWNC"); + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_NCL), "NCL"); + EXPECT_EQ(GetFormatString(TensorFormat::FORMAT_MAX), "UNKNOWN"); +} + +TEST(DataUtilsTest, GetShapeString) { + EXPECT_EQ(GetShapeString({2, 3, 5}), "(2,3,5)"); + EXPECT_EQ(GetShapeString({}), "()"); + EXPECT_EQ(GetShapeString({3}), "(3)"); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_environ.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_environ.cpp new file mode 100644 index 0000000000000000000000000000000000000000..94c830227ae58637642a189f36ade78de9a2a75c --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_environ.cpp @@ -0,0 +1,28 @@ +#include +#include + +#include "include/test_utils.hpp" +#include "base/DebuggerConfig.hpp" +#include "base/Environment.hpp" + +using namespace MindStudioDebugger; +using namespace MindStudioDebugger::Environment; + +namespace MsProbeTest { + +TEST(EnvironmentTest, TestRankId) { + DebuggerConfig::GetInstance().Reset(); + EXPECT_EQ(GetRankID(), -1); + DebuggerConfig::GetInstance().LoadConfig("MindSpore", CONFIG_EXAMPLE); + EXPECT_EQ(GetRankID(), -1); + setenv("RANK_ID", "xxxx", 1); + EXPECT_EQ(GetRankID(), -1); + setenv("RANK_ID", "-5", 1); + EXPECT_EQ(GetRankID(), -1); + setenv("RANK_ID", "2", 1); + EXPECT_EQ(GetRankID(), 2); + + DebuggerConfig::GetInstance().Reset(); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_file_operation.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_file_operation.cpp new file mode 100644 index 0000000000000000000000000000000000000000..2886126e9f568fba6b8ce3eabd752653d4493108 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_file_operation.cpp @@ -0,0 +1,47 @@ +#include +#include +#include +#include + +#include "test_utils.hpp" +#include "utils/DataUtils.hpp" +#include "utils/FileOperation.hpp" + +using namespace MindStudioDebugger; +using namespace MindStudioDebugger::FileOperation; + +namespace MsProbeTest { + +TEST(FileOperationTest, TestDumpJson) { + std::string testPath = "./test.json"; + nlohmann::json testJson = {{"key", "value"}}; + auto result = DumpJson(testPath, testJson); + EXPECT_EQ(result, DebuggerErrno::OK); + + std::ifstream ifs(testPath); + std::string fileContent((std::istreambuf_iterator(ifs)), std::istreambuf_iterator()); + ifs.close(); + EXPECT_EQ(fileContent, testJson.dump()); + remove(testPath.c_str()); +} + +TEST(FileOperationTest, TestDumpNpy) { + std::string testPath = "./test.npy"; + std::vector int8Data = {0, 1, 2, 3, 4, 5}; + auto result = DumpNpy(testPath, int8Data.data(), int8Data.size() * sizeof(uint8_t), DataUtils::DataType::DT_UINT8, + {2, 3}); + EXPECT_EQ(result, DebuggerErrno::OK); + std::string content = TEST_ExecShellCommand("python -c \'import numpy; print(numpy.load(\"./test.npy\"))\'"); + EXPECT_EQ(content, "[[0 1 2]\n [3 4 5]]\n"); + remove(testPath.c_str()); + + std::vector fp32Data = {0.1f, 1.2f, 2.3f, 3.4f, 4.5f, 5.6f, 6.7f, 7.8f}; + result = DumpNpy(testPath, reinterpret_cast(fp32Data.data()), fp32Data.size() * sizeof(float), + DataUtils::DataType::DT_FLOAT, {2, 2, 2}); + EXPECT_EQ(result, DebuggerErrno::OK); + content = TEST_ExecShellCommand("python -c \'import numpy; print(numpy.load(\"./test.npy\"))\'"); + EXPECT_EQ(content, "[[[0.1 1.2]\n [2.3 3.4]]\n\n [[4.5 5.6]\n [6.7 7.8]]]\n"); + remove(testPath.c_str()); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_file_utils.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_file_utils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..03449f761be0c8548021218581f4cbff12d4e07d --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_file_utils.cpp @@ -0,0 +1,391 @@ +#include +#include +#include +#include +#include +#include +#include +#include + +#include "test_utils.hpp" +#include "utils/FileUtils.hpp" + +using namespace MindStudioDebugger; +using namespace MindStudioDebugger::FileUtils; + +namespace MsProbeTest { + +class FileUtilsTest : public ::testing::Test { +protected: + void SetUp() override { + // 创建目录 + ASSERT_EQ(mkdir(testDir.c_str(), 0750), 0); + ASSERT_EQ(mkdir(testDirSub.c_str(), 0750), 0); + // 创建文件 + std::ofstream file(testRegularFile); + file.close(); + // 创建符号链接 + ASSERT_EQ(symlink(GetAbsPath(testRegularFile).c_str(), testLink.c_str()), 0); + ASSERT_EQ(mkfifo(testFifo.c_str(), 0640), 0); + } + + void TearDown() override { + // 删除测试目录和文件 + TEST_ExecShellCommand("rm -rf " + testDir); + } + + const std::string testDir = "./FileUtilsTest"; + const std::string testDirSub = testDir + "/subdir"; + const std::string testRegularFile = testDir + "/RegularFile.txt"; + const std::string testNotExistsFile = testDir + "/NotExistsFile.txt"; + const std::string testLink = testDir + "/testlink"; + const std::string testFifo = testDir + "/testfifo"; +}; + +TEST_F(FileUtilsTest, TestIsPathExist) +{ + EXPECT_TRUE(IsPathExist("/")); + EXPECT_TRUE(IsPathExist(".")); + EXPECT_TRUE(IsPathExist(testRegularFile)); + EXPECT_FALSE(IsPathExist(testNotExistsFile)); +} + +TEST_F(FileUtilsTest, TestGetAbsPath) +{ + std::string pwd = trim(TEST_ExecShellCommand("pwd")); + EXPECT_EQ(pwd, GetAbsPath(".")); + EXPECT_EQ(pwd + "/testpath", GetAbsPath("./testpath")); + EXPECT_EQ(pwd + "/testpath", GetAbsPath("./testpath/")); + EXPECT_EQ(pwd + "/testpath", GetAbsPath("./subdir/../testpath")); + EXPECT_EQ(pwd + "/testpath", GetAbsPath("subdir/subdir/.././../testpath")); + EXPECT_EQ(pwd + "/subdir/testpath", GetAbsPath("./subdir/.././/subdir/testpath")); +} + +TEST_F(FileUtilsTest, TestIsDir) +{ + EXPECT_TRUE(IsDir("/")); + EXPECT_TRUE(IsDir("./")); + EXPECT_TRUE(IsDir(testDirSub)); + EXPECT_FALSE(IsDir(testRegularFile)); + EXPECT_FALSE(IsDir(testFifo)); +} + +TEST_F(FileUtilsTest, TestIsRegularFile) +{ + EXPECT_TRUE(IsRegularFile(testRegularFile)); + EXPECT_FALSE(IsRegularFile(testDirSub)); + EXPECT_TRUE(IsRegularFile(testLink)); + EXPECT_FALSE(IsRegularFile(testFifo)); + EXPECT_FALSE(IsRegularFile(testNotExistsFile)); +} + +TEST_F(FileUtilsTest, TestIsFileSymbolLink) +{ + EXPECT_TRUE(IsFileSymbolLink(testLink)); + EXPECT_FALSE(IsFileSymbolLink(testDirSub)); + EXPECT_FALSE(IsFileSymbolLink(testNotExistsFile)); + EXPECT_FALSE(IsFileSymbolLink(testRegularFile)); + EXPECT_FALSE(IsFileSymbolLink(testFifo)); +} + +TEST_F(FileUtilsTest, TestIsPathCharactersValid) +{ + std::string validPath = "/tmp/FileUtilsTest/testfile.txt"; + std::string invalidPath1 = "/tmp/FileUtilsTest/<>:|?*\""; + std::string invalidPath2 = " /tmp/FileUtilsTest/testfile.txt"; + EXPECT_TRUE(IsPathCharactersValid("123456789")); + EXPECT_TRUE(IsPathCharactersValid(validPath)); + EXPECT_FALSE(IsPathCharactersValid("")); + EXPECT_FALSE(IsPathCharactersValid(invalidPath1)); + EXPECT_FALSE(IsPathCharactersValid(invalidPath2)); +} + +TEST_F(FileUtilsTest, TestIsFileReadable) +{ + TEST_ExecShellCommand("chmod -r " + testRegularFile); + EXPECT_FALSE(IsFileReadable(testRegularFile)); + TEST_ExecShellCommand("chmod +r " + testRegularFile); + EXPECT_TRUE(IsFileReadable(testRegularFile)); + TEST_ExecShellCommand("chmod -r " + testDirSub); + EXPECT_FALSE(IsFileReadable(testDirSub)); + TEST_ExecShellCommand("chmod +r " + testDirSub); + EXPECT_TRUE(IsFileReadable(testDirSub)); +} + +TEST_F(FileUtilsTest, TestIsFileWritable) +{ + TEST_ExecShellCommand("chmod -w " + testRegularFile); + EXPECT_FALSE(IsFileWritable(testRegularFile)); + TEST_ExecShellCommand("chmod +w " + testRegularFile); + EXPECT_TRUE(IsFileWritable(testRegularFile)); + TEST_ExecShellCommand("chmod -w " + testDirSub); + EXPECT_FALSE(IsFileWritable(testDirSub)); + TEST_ExecShellCommand("chmod +w " + testDirSub); + EXPECT_TRUE(IsFileWritable(testDirSub)); +} + +TEST_F(FileUtilsTest, TestIsFileExecutable) +{ + TEST_ExecShellCommand("chmod -x " + testRegularFile); + EXPECT_FALSE(IsFileExecutable(testRegularFile)); + TEST_ExecShellCommand("chmod +x " + testRegularFile); + EXPECT_TRUE(IsFileExecutable(testRegularFile)); + TEST_ExecShellCommand("chmod -x " + testDirSub); + EXPECT_FALSE(IsFileExecutable(testDirSub)); + TEST_ExecShellCommand("chmod +x " + testDirSub); + EXPECT_TRUE(IsFileExecutable(testDirSub)); +} + +TEST_F(FileUtilsTest, TestIsDirReadable) +{ + EXPECT_TRUE("."); + EXPECT_TRUE(IsDirReadable(testDirSub)); + TEST_ExecShellCommand("chmod 100 " + testDirSub); + EXPECT_FALSE(IsDirReadable(testDirSub)); + TEST_ExecShellCommand("chmod 400 " + testDirSub); + EXPECT_FALSE(IsDirReadable(testDirSub)); + TEST_ExecShellCommand("chmod 500 " + testDirSub); + EXPECT_TRUE(IsDirReadable(testDirSub)); +} + +TEST_F(FileUtilsTest, TestGetParentDir) +{ + EXPECT_EQ("/tmp/FileUtilsTest", GetParentDir("/tmp/FileUtilsTest/dir")); + EXPECT_EQ("/tmp/FileUtilsTest", GetParentDir("/tmp/FileUtilsTest/")); + EXPECT_EQ("./FileUtilsTest", GetParentDir("./FileUtilsTest/testfile.txt")); + EXPECT_EQ(".", GetParentDir("testfile.txt")); + EXPECT_EQ(".", GetParentDir("")); +} + +TEST_F(FileUtilsTest, TestGetFileName) +{ + EXPECT_EQ("dir", GetFileName("/tmp/FileUtilsTest/dir")); + EXPECT_EQ("", GetFileName("/tmp/FileUtilsTest/")); + EXPECT_EQ("testfile.txt", GetFileName("./FileUtilsTest/testfile.txt")); + EXPECT_EQ("testfile.txt", GetFileName("testfile.txt")); + EXPECT_EQ("", GetFileName("")); +} + +TEST_F(FileUtilsTest, TestGetFileBaseName) +{ + EXPECT_EQ("dir", GetFileBaseName("/tmp/FileUtilsTest/dir")); + EXPECT_EQ("", GetFileBaseName("/tmp/FileUtilsTest/")); + EXPECT_EQ("testfile", GetFileBaseName("./FileUtilsTest/testfile.txt")); + EXPECT_EQ("testfile", GetFileBaseName("testfile.txt")); + EXPECT_EQ("testfile", GetFileBaseName("testfile")); +} + +TEST_F(FileUtilsTest, TestGetFileSuffix) +{ + EXPECT_EQ("", GetFileSuffix("/tmp/FileUtilsTest/dir")); + EXPECT_EQ("", GetFileSuffix("/tmp/FileUtilsTest/")); + EXPECT_EQ("txt", GetFileSuffix("./FileUtilsTest/testfile.txt")); + EXPECT_EQ("txt", GetFileSuffix("testfile.txt")); + EXPECT_EQ("", GetFileSuffix("testfile")); + EXPECT_EQ("", GetFileSuffix("testfile.")); +} + +TEST_F(FileUtilsTest, TestCheckFileRWX) +{ + TEST_ExecShellCommand("chmod 640 " + testRegularFile); + EXPECT_TRUE(CheckFileRWX(testRegularFile, "rw")); + EXPECT_FALSE(CheckFileRWX(testRegularFile, "rx")); + TEST_ExecShellCommand("chmod 750 " + testDirSub); + EXPECT_TRUE(CheckFileRWX(testDirSub, "rwx")); +} + +TEST_F(FileUtilsTest, TestIsPathLengthLegal) +{ + std::string maxFile = std::string(FILE_NAME_LENGTH_MAX, 'a'); + std::string longFile = std::string(FILE_NAME_LENGTH_MAX + 1, 'a'); + std::string maxPath(FULL_PATH_LENGTH_MAX, '/'); + std::string longPath = maxPath + "/"; + EXPECT_TRUE(IsPathLengthLegal(maxFile)); + EXPECT_TRUE(IsPathLengthLegal(maxPath)); + EXPECT_FALSE(IsPathLengthLegal(longFile)); + EXPECT_FALSE(IsPathLengthLegal(longPath)); + EXPECT_FALSE(IsPathLengthLegal("")); +} + +TEST_F(FileUtilsTest, TestIsPathDepthValid) +{ + EXPECT_TRUE(IsPathDepthValid("")); + EXPECT_TRUE(IsPathDepthValid(std::string(PATH_DEPTH_MAX, pathSeparator))); + EXPECT_FALSE(IsPathDepthValid(std::string(PATH_DEPTH_MAX + 1, pathSeparator))); +} + +TEST_F(FileUtilsTest, TestIsFileOwner) +{ + EXPECT_TRUE(IsFileOwner(testRegularFile)); + EXPECT_TRUE(IsFileOwner(testDirSub)); + EXPECT_FALSE(IsFileOwner("/")); +} + +TEST_F(FileUtilsTest, TestDeleteFile) +{ + ASSERT_TRUE(IsPathExist(testRegularFile)); + EXPECT_EQ(DeleteFile(testLink), DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK); + EXPECT_EQ(DeleteFile(testRegularFile), DebuggerErrno::OK); + EXPECT_FALSE(IsPathExist(testRegularFile)); + EXPECT_EQ(DeleteFile(testRegularFile), DebuggerErrno::OK); + EXPECT_EQ(DeleteFile(testFifo), DebuggerErrno::OK); + EXPECT_EQ(DeleteFile(testDirSub), DebuggerErrno::OK); + EXPECT_EQ(DeleteFile(testDir), DebuggerErrno::ERROR_SYSCALL_FAILED); + EXPECT_EQ(DeleteFile(testLink), DebuggerErrno::OK); +} + +TEST_F(FileUtilsTest, TestDeleteDir) +{ + ASSERT_TRUE(IsPathExist(testDirSub)); + EXPECT_EQ(DeleteDir(testDirSub), DebuggerErrno::OK); + EXPECT_FALSE(IsPathExist(testDirSub)); + EXPECT_EQ(DeleteDir(testDirSub), DebuggerErrno::OK); + std::string subSubDir = testDirSub + "/subdir"; + std::string subSubFile = testDirSub + "/subfile"; + TEST_ExecShellCommand("mkdir " + testDirSub); + TEST_ExecShellCommand("mkdir " + subSubDir); + TEST_ExecShellCommand("touch " + subSubFile); + EXPECT_EQ(DeleteDir(testLink), DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK); + EXPECT_EQ(DeleteDir(testRegularFile), DebuggerErrno::ERROR_SYSCALL_FAILED); + EXPECT_EQ(DeleteDir(testDirSub), DebuggerErrno::ERROR_SYSCALL_FAILED); + EXPECT_EQ(DeleteDir(testDirSub, true), DebuggerErrno::OK); + EXPECT_FALSE(IsPathExist(testDirSub)); +} + +TEST_F(FileUtilsTest, TestCreateDir) +{ + ASSERT_TRUE(IsPathExist(testDirSub)); + EXPECT_EQ(CreateDir(testDirSub), DebuggerErrno::OK); + TEST_ExecShellCommand("rm -rf " + testDirSub); + ASSERT_FALSE(IsPathExist(testDirSub)); + EXPECT_EQ(CreateDir(testDirSub), DebuggerErrno::OK); + EXPECT_TRUE(IsPathExist(testDirSub)); + TEST_ExecShellCommand("rm -rf " + testDirSub); + std::string subSubDir = testDirSub + "/subdir"; + EXPECT_EQ(CreateDir(subSubDir), DebuggerErrno::ERROR_DIR_NOT_EXISTS); + EXPECT_EQ(CreateDir(subSubDir, true), DebuggerErrno::OK); + EXPECT_TRUE(IsPathExist(subSubDir)); + EXPECT_TRUE(CheckFileRWX(subSubDir, "rwx")); + TEST_ExecShellCommand("rm -rf " + testDirSub); + EXPECT_EQ(CreateDir(subSubDir, true, 0750), DebuggerErrno::OK); + EXPECT_TRUE(CheckFileRWX(testDirSub, "rwx")); + EXPECT_TRUE(CheckFileRWX(subSubDir, "rwx")); +} + +TEST_F(FileUtilsTest, TestChmod) +{ + EXPECT_EQ(Chmod(testNotExistsFile, 0640), DebuggerErrno::ERROR_FILE_NOT_EXISTS); + EXPECT_EQ(Chmod(testRegularFile, 0440), DebuggerErrno::OK); + EXPECT_FALSE(IsFileWritable(testRegularFile)); + EXPECT_EQ(Chmod(testDirSub, 0550), DebuggerErrno::OK); + EXPECT_FALSE(IsFileWritable(testDirSub)); + EXPECT_EQ(Chmod(testRegularFile, 0640), DebuggerErrno::OK); + EXPECT_TRUE(IsFileWritable(testRegularFile)); + EXPECT_EQ(Chmod(testLink, 0640), DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK); + EXPECT_EQ(Chmod("", 0640), DebuggerErrno::ERROR_FILE_NOT_EXISTS); + EXPECT_EQ(Chmod("/", 0750), DebuggerErrno::ERROR_SYSCALL_FAILED); +} + +TEST_F(FileUtilsTest, TestGetFileSize) +{ + size_t size; + EXPECT_EQ(GetFileSize(testRegularFile, size), DebuggerErrno::OK); + EXPECT_EQ(size, 0); + TEST_ExecShellCommand("echo \"123456789\" > " + testRegularFile); + EXPECT_EQ(GetFileSize(testRegularFile, size), DebuggerErrno::OK); + EXPECT_EQ(size, 10); + EXPECT_EQ(GetFileSize(testNotExistsFile, size), DebuggerErrno::ERROR_FILE_NOT_EXISTS); + EXPECT_EQ(GetFileSize(testDirSub, size), DebuggerErrno::ERROR_ILLEGAL_FILE_TYPE); + EXPECT_EQ(GetFileSize(testFifo, size), DebuggerErrno::ERROR_ILLEGAL_FILE_TYPE); +} + +TEST_F(FileUtilsTest, TestOpenFileRead) +{ + std::ifstream ifs; + EXPECT_EQ(OpenFile(testNotExistsFile, ifs), DebuggerErrno::ERROR_FILE_NOT_EXISTS); + TEST_ExecShellCommand("chmod -r " + testRegularFile); + EXPECT_EQ(OpenFile(testRegularFile, ifs), DebuggerErrno::ERROR_PERMISSION_DENINED); + TEST_ExecShellCommand("chmod +r " + testRegularFile); + EXPECT_EQ(OpenFile(testLink, ifs), DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK); + TEST_ExecShellCommand("echo \"123456789\" > " + testRegularFile); + ASSERT_EQ(OpenFile(testRegularFile, ifs), DebuggerErrno::OK); + ASSERT_TRUE(ifs.is_open()); + std::string content((std::istreambuf_iterator(ifs)), std::istreambuf_iterator()); + EXPECT_EQ(content, "123456789\n"); + ifs.close(); +} + +TEST_F(FileUtilsTest, TestOpenFileWrite) +{ + std::ofstream ofs; + ASSERT_EQ(OpenFile(testRegularFile, ofs), DebuggerErrno::OK); + ofs << "123456789"; + ofs.close(); + std::ifstream ifs(testRegularFile, std::ios::in); + std::string content((std::istreambuf_iterator(ifs)), std::istreambuf_iterator()); + ifs.close(); + EXPECT_EQ(content, "123456789"); +} + +TEST_F(FileUtilsTest, TestCheckFileSuffixAndSize) +{ + EXPECT_EQ(CheckFileSuffixAndSize(testRegularFile, FileType::COMMON), DebuggerErrno::OK); + EXPECT_EQ(CheckFileSuffixAndSize(testRegularFile, FileType::JSON), DebuggerErrno::ERROR_UNKNOWN_FILE_SUFFIX); + std::string sparseKpl = testDir + "/test.kpl"; + std::string sparseNpy = testDir + "/test.npy"; + std::string sparseJson = testDir + "/test.json"; + std::string sparsePt = testDir + "/test.pt"; + std::string sparseCsv = testDir + "/test.csv"; + std::string sparseYaml = testDir + "/test.yaml"; + TEST_ExecShellCommand("truncate -s 1G " + sparseCsv); + EXPECT_EQ(CheckFileSuffixAndSize(sparseCsv, FileType::CSV), DebuggerErrno::OK); + TEST_ExecShellCommand("rm " + sparseCsv); + TEST_ExecShellCommand("truncate -s 1025M " + sparseCsv); + EXPECT_EQ(CheckFileSuffixAndSize(sparseCsv, FileType::CSV), DebuggerErrno::ERROR_FILE_TOO_LARGE); + TEST_ExecShellCommand("truncate -s 1025M " + sparseKpl); + EXPECT_EQ(CheckFileSuffixAndSize(sparseKpl, FileType::PKL), DebuggerErrno::ERROR_FILE_TOO_LARGE); + TEST_ExecShellCommand("truncate -s 11G " + sparseNpy); + EXPECT_EQ(CheckFileSuffixAndSize(sparseNpy, FileType::NUMPY), DebuggerErrno::ERROR_FILE_TOO_LARGE); + TEST_ExecShellCommand("truncate -s 1025M " + sparseJson); + EXPECT_EQ(CheckFileSuffixAndSize(sparseJson, FileType::JSON), DebuggerErrno::ERROR_FILE_TOO_LARGE); + TEST_ExecShellCommand("truncate -s 11G " + sparsePt); + EXPECT_EQ(CheckFileSuffixAndSize(sparsePt, FileType::PT), DebuggerErrno::ERROR_FILE_TOO_LARGE); + TEST_ExecShellCommand("truncate -s 10241K " + sparseYaml); + EXPECT_EQ(CheckFileSuffixAndSize(sparseYaml, FileType::YAML), DebuggerErrno::ERROR_FILE_TOO_LARGE); +} + +TEST_F(FileUtilsTest, TestCheckDirCommon) +{ + EXPECT_EQ(CheckDirCommon(""), DebuggerErrno::ERROR_CANNOT_PARSE_PATH); + EXPECT_EQ(CheckDirCommon(testNotExistsFile), DebuggerErrno::ERROR_FILE_NOT_EXISTS); + EXPECT_EQ(CheckDirCommon(testRegularFile), DebuggerErrno::ERROR_ILLEGAL_FILE_TYPE); + std::string linkdir = testDir + "/linkdir"; + TEST_ExecShellCommand("ln -s " + GetAbsPath(testDirSub) + " " + linkdir); + EXPECT_EQ(CheckDirCommon(linkdir), DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK); + EXPECT_EQ(CheckDirCommon(testDirSub), DebuggerErrno::OK); + TEST_ExecShellCommand("chmod -r " + testDirSub); + EXPECT_EQ(CheckDirCommon(testDirSub), DebuggerErrno::ERROR_PERMISSION_DENINED); +} + +TEST_F(FileUtilsTest, TestCheckFileBeforeRead) +{ + EXPECT_EQ(CheckFileBeforeRead(""), DebuggerErrno::ERROR_CANNOT_PARSE_PATH); + EXPECT_EQ(CheckFileBeforeRead(testNotExistsFile), DebuggerErrno::ERROR_FILE_NOT_EXISTS); + EXPECT_EQ(CheckFileBeforeRead(testLink), DebuggerErrno::ERROR_NOT_ALLOW_SOFTLINK); + EXPECT_EQ(CheckFileBeforeRead(testRegularFile), DebuggerErrno::OK); + TEST_ExecShellCommand("chmod -r " + testRegularFile); + EXPECT_EQ(CheckFileBeforeRead(testRegularFile), DebuggerErrno::ERROR_PERMISSION_DENINED); +} + +TEST_F(FileUtilsTest, TestCheckFileBeforeCreateOrWrite) +{ + EXPECT_EQ(CheckFileBeforeCreateOrWrite(""), DebuggerErrno::ERROR_CANNOT_PARSE_PATH); + EXPECT_EQ(CheckFileBeforeCreateOrWrite(testNotExistsFile), DebuggerErrno::OK); + EXPECT_EQ(CheckFileBeforeCreateOrWrite(testRegularFile), DebuggerErrno::ERROR_FILE_ALREADY_EXISTS); + EXPECT_EQ(CheckFileBeforeCreateOrWrite(testRegularFile, true), DebuggerErrno::OK); + TEST_ExecShellCommand("chmod -w " + testRegularFile); + EXPECT_EQ(CheckFileBeforeCreateOrWrite(testRegularFile, true), DebuggerErrno::ERROR_PERMISSION_DENINED); + EXPECT_EQ(CheckFileBeforeCreateOrWrite("/", true), DebuggerErrno::ERROR_PERMISSION_DENINED); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_log.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_log.cpp new file mode 100644 index 0000000000000000000000000000000000000000..254b54359a50166e1d893c5b936eb220ee0b2a73 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_log.cpp @@ -0,0 +1,50 @@ +#include + +#include "gtest/gtest.h" +#include "test_utils.hpp" +#include "base/ErrorInfos.hpp" + +using namespace MindStudioDebugger; + +namespace MsProbeTest { + +TEST(ErrorInfoTest, TestLog) +{ + std::string testDir = "./testdir"; + ASSERT_EQ(mkdir(testDir.c_str(), 0750), 0); + ErrorInfosManager::SetLogPath(testDir + "/logfile1.log"); + LOG_CRITICAL(DebuggerErrno::ERROR_DIR_NOT_EXISTS, "Critical log content."); + std::ifstream ifs1(testDir + "/logfile1.log", std::ios::in); + ASSERT_TRUE(ifs1.is_open()); + std::string content1((std::istreambuf_iterator(ifs1)), std::istreambuf_iterator()); + ifs1.close(); + EXPECT_EQ(content1, "[CRITICAL][DIR_NOT_EXISTS]Critical log content.\n"); + LOG_ERROR(DebuggerErrno::ERROR_INVALID_OPERATION, "Error log content."); + ifs1.open(testDir + "/logfile1.log"); + ASSERT_TRUE(ifs1.is_open()); + std::string content2((std::istreambuf_iterator(ifs1)), std::istreambuf_iterator()); + EXPECT_EQ(content2, + "[CRITICAL][DIR_NOT_EXISTS]Critical log content.\n[ERROR][INVALID_OPERATION]Error log content.\n"); + + ErrorInfosManager::SetLogPath(testDir + "/logfile2.log"); + LOG_WARNING(DebuggerErrno::ERROR_SYSCALL_FAILED, "Warning log content."); + std::ifstream ifs2(testDir + "/logfile2.log", std::ios::in); + ASSERT_TRUE(ifs2.is_open()); + std::string content3((std::istreambuf_iterator(ifs2)), std::istreambuf_iterator()); + ifs2.close(); + EXPECT_EQ(content3, "[WARNING][SYSCALL_FAILED]Warning log content.\n"); + + ErrorInfosManager::SetLogPath(testDir + "/logfile3.log"); + LOG_INFO("Info log content."); + LOG_DEBUG("Debug log content."); + std::ifstream ifs3(testDir + "/logfile3.log", std::ios::in); + ASSERT_TRUE(ifs3.is_open()); + std::string content4((std::istreambuf_iterator(ifs3)), std::istreambuf_iterator()); + ifs3.close(); + EXPECT_EQ(content4, "[INFO]Info log content.\n"); + TEST_ExecShellCommand("rm -rf " + testDir); + + ErrorInfosManager::SetLogPath(""); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_main.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_main.cpp new file mode 100644 index 0000000000000000000000000000000000000000..08fb83905205f05c0710aec3d0bdaed3c8bdd54f --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_main.cpp @@ -0,0 +1,7 @@ +#include "gtest/gtest.h" + +int main(int argc, char** argv) +{ + ::testing::InitGoogleTest(&argc, argv); + return RUN_ALL_TESTS(); +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_math_utils.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_math_utils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..3b23e9c879c431ef7457990ba774aa0dc1321b45 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_math_utils.cpp @@ -0,0 +1,100 @@ +#include +#include +#include +#include +#include +#include "utils/MathUtils.hpp" + +using namespace MindStudioDebugger; +using namespace MindStudioDebugger::MathUtils; + +namespace MsProbeTest { + +TEST(MathUtilsTest, TestRandom) +{ + for (uint32_t i = 0; i < 5; i++) { + float result = Random(); + EXPECT_GE(result, 0.0f); + EXPECT_LE(result, 1.0f); + } + for (uint32_t i = 0; i < 5; i++) { + float floor = static_cast(i * 5) - 10.0f; + float ceil = static_cast(i * 10); + float result = Random(floor, ceil); + EXPECT_GE(result, floor); + EXPECT_LE(result, ceil); + } +} + +TEST(MathUtilsTest, TestRandomInt) +{ + for (uint32_t i = 0; i < 5; i++) { + int32_t floor = static_cast(i * 5) - 10; + int32_t ceil = static_cast(i * 10); + int32_t result = RandomInt(floor, ceil); + EXPECT_GE(result, floor); + EXPECT_LT(result, ceil); + } +} + +TEST(MathUtilsTest, TestRandomString) +{ + uint32_t len = 16; + std::string result = RandomString(len); + EXPECT_EQ(result.length(), len); + for (char c : result) { + EXPECT_TRUE((c >= ' ' && c <= '~')); + } + + result = RandomString(len, 'a', 'f'); + EXPECT_EQ(result.length(), len); + for (char c : result) { + EXPECT_TRUE(c >= 'a' && c <= 'f'); + } +} + +TEST(MathUtilsTest, TestCalculateMD5) +{ + const uint8_t data[] = "Hello, world!"; + std::string result = CalculateMD5(data, sizeof(data) - 1); + EXPECT_EQ(result, "6cd3556deb0da54bca060b4c39479839"); +} + +TEST(MathUtilsTest, TestGcd) +{ + EXPECT_EQ(Gcd(10, 5), 5); + EXPECT_EQ(Gcd(15, 5), 5); + EXPECT_EQ(Gcd(0, 5), 0); + EXPECT_EQ(Gcd(5, 0), 0); + EXPECT_EQ(Gcd(0, 0), 0); + EXPECT_EQ(Gcd(1, 1), 1); +} + +TEST(MathUtilsTest, TestLcm) +{ + EXPECT_EQ(Lcm(10, 5), 10); + EXPECT_EQ(Lcm(15, 5), 15); + EXPECT_EQ(Lcm(0, 5), 0); + EXPECT_EQ(Lcm(5, 0), 0); + EXPECT_EQ(Lcm(0, 0), 0); + EXPECT_EQ(Lcm(1, 1), 1); +} + +TEST(MathUtilsTest, TestDivCeil) +{ + EXPECT_EQ(DivCeil(10, 5), 2); + EXPECT_EQ(DivCeil(10, 3), 4); + EXPECT_EQ(DivCeil(10, 1), 10); + EXPECT_EQ(DivCeil(0, 5), 0); + EXPECT_EQ(DivCeil(0, 0), 0); +} + +TEST(MathUtilsTest, TestAlignCeil) +{ + EXPECT_EQ(AlignCeil(10, 5), 10); + EXPECT_EQ(AlignCeil(7, 5), 10); + EXPECT_EQ(AlignCeil(0, 5), 0); + EXPECT_EQ(AlignCeil(10, 0), 0); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/cpp/test_precision_debugger.cpp b/debug/accuracy_tools/msprobe/test/cpp/test_precision_debugger.cpp new file mode 100644 index 0000000000000000000000000000000000000000..69df0c18fcc27cd0ac359262649fcc588f2e9b9f --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/cpp/test_precision_debugger.cpp @@ -0,0 +1,122 @@ +#include +#include + +#include "include/test_utils.hpp" +#include "third_party/ACL/AclApi.hpp" +#include "base/ErrorInfos.hpp" +#include "core/PrecisionDebugger.hpp" + +using namespace MindStudioDebugger; + +namespace MsProbeTest { + +class PrecisionDbgTaskStub : public PrecisionDbgTaskBase { +public: + PrecisionDbgTaskStub() = default; + ~PrecisionDbgTaskStub() = default; + std::string Name() const override {return "PrecisionDbgTaskStub";} + bool Condition(const DebuggerConfig& cfg) const override {return true;} + + void Initialize(const DebuggerConfig& cfg) {initialize_called = true;} + void OnStart() {start_called = true;} + void OnStop() {stop_called = true;} + void OnStep() {step_called = true;} + + bool initialize_called{false}; + bool start_called{false}; + bool stop_called{false}; + bool step_called{false}; +}; + +class PrecisionDbgTaskUselessStub : public PrecisionDbgTaskStub { +public: + bool Condition(const DebuggerConfig& cfg) const override {return false;} +}; + +TEST(PrecisionDebuggerTest, TestRegisterBeforeInit) { + PrecisionDebugger& debugger = PrecisionDebugger::GetInstance(); + PrecisionDbgTaskStub stub_task; + + DebuggerConfig::GetInstance().Reset(); + debugger.RegisterDebuggerTask(&stub_task); + stub_task.Register(); + + EXPECT_FALSE(debugger.IsEnable()); + EXPECT_EQ(debugger.GetCurStep(), 0); + debugger.Start(); + EXPECT_FALSE(debugger.IsEnable()); + debugger.Stop(); + debugger.Step(); + EXPECT_EQ(debugger.GetCurStep(), 0); + + EXPECT_FALSE(stub_task.initialize_called); + EXPECT_FALSE(stub_task.start_called); + EXPECT_FALSE(stub_task.stop_called); + EXPECT_FALSE(stub_task.step_called); + + debugger.UnRegisterDebuggerTask(&stub_task); + debugger.UnRegisterDebuggerTask(nullptr); +} + +TEST(PrecisionDebuggerTest, TestInit) { + PrecisionDebugger& debugger = PrecisionDebugger::GetInstance(); + MOCKER(MindStudioDebugger::AscendCLApi::LoadAclApi) + .stubs() + .then(returnValue(0)) + .expects(atLeast(1)); + + DebuggerConfig::GetInstance().Reset(); + EXPECT_FALSE(debugger.HasInitialized()); + EXPECT_NE(debugger.Initialize("", ""), 0); + EXPECT_FALSE(debugger.HasInitialized()); + CleanErrorInfoCache(); + EXPECT_EQ(debugger.Initialize("MindSpore", CONFIG_EXAMPLE), 0); + EXPECT_TRUE(debugger.HasInitialized()); + EXPECT_EQ(debugger.Initialize("MindSpore", CONFIG_EXAMPLE), 0); + EXPECT_TRUE(debugger.HasInitialized()); + + GlobalMockObject::verify(); + GlobalMockObject::reset(); +} + +TEST(PrecisionDebuggerTest, TestSubTaskDispatch) { + PrecisionDebugger& debugger = PrecisionDebugger::GetInstance(); + PrecisionDbgTaskStub stub_task1; + PrecisionDbgTaskStub stub_task2; + PrecisionDbgTaskUselessStub stub_task3; + MOCKER(MindStudioDebugger::AscendCLApi::LoadAclApi) + .stubs() + .then(returnValue(0)); + MOCKER(MindStudioDebugger::AscendCLApi::ACLAPI_aclrtSynchronizeDevice) + .stubs() + .then(returnValue(0)) + .expects(atLeast(1)); + + stub_task1.Register(); + EXPECT_EQ(debugger.Initialize("MindSpore", CONFIG_EXAMPLE), 0); + stub_task2.Register(); + stub_task3.Register(); + + EXPECT_TRUE(stub_task1.initialize_called); + EXPECT_TRUE(stub_task2.initialize_called); + EXPECT_FALSE(stub_task3.initialize_called); + EXPECT_FALSE(stub_task1.start_called); + EXPECT_FALSE(stub_task2.stop_called); + EXPECT_FALSE(stub_task3.step_called); + + debugger.Start(); + EXPECT_TRUE(stub_task1.start_called); + EXPECT_FALSE(stub_task3.start_called); + + debugger.Stop(); + EXPECT_TRUE(stub_task1.stop_called); + EXPECT_TRUE(stub_task2.stop_called); + + debugger.Step(); + EXPECT_TRUE(stub_task1.step_called); + + GlobalMockObject::verify(); + GlobalMockObject::reset(); +} + +} diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_api_accuracy_checker.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_api_accuracy_checker.py index 15ee6d3759178b27d0ae4ccd269db6ca6dfce6fd..2cf47a2064626db792c55efb449b47ca8ab9b04e 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_api_accuracy_checker.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_api_accuracy_checker.py @@ -38,17 +38,28 @@ def find_with_prefix(directory, prefix): target_files = [os.path.join(directory, entry) for entry in entries if entry.startswith(prefix) and os.path.isfile(os.path.join(directory, entry))] return target_files + +class Args: + def __init__(self, api_info_file=None, out_path=None, result_csv_path=None): + self.api_info_file = api_info_file if api_info_file is not None else os.path.join(directory, "files", "api_info_statistics.json") + self.out_path = out_path if out_path is not None else os.path.join(directory, "files") + self.result_csv_path = result_csv_path if result_csv_path is not None else "" + + class TestApiAccuracyChecker(unittest.TestCase): def test_statistics_mode(self): api_info_statistics_path = os.path.join(directory, "files", "api_info_statistics.json") result_directory = os.path.join(directory, "files") + + # 初始化 Args 类,提供三个路径参数 + args = Args(api_info_file=api_info_statistics_path, out_path=result_directory) # 在这里传入自定义的路径参数 + delete_files_with_prefix(result_directory, "accuracy_checking") - api_accuracy_checker = ApiAccuracyChecker() + api_accuracy_checker = ApiAccuracyChecker(args) api_accuracy_checker.parse(api_info_statistics_path) api_accuracy_checker.run_and_compare() - api_accuracy_checker.to_detail_csv(result_directory) - api_accuracy_checker.to_result_csv(result_directory) + detail_csv = find_with_prefix(result_directory, "accuracy_checking_detail") assert len(detail_csv) == 1 check_csv(detail_csv[0], 2) @@ -63,11 +74,13 @@ class TestApiAccuracyChecker(unittest.TestCase): result_directory = os.path.join(directory, "files") delete_files_with_prefix(result_directory, "accuracy_checking") modify_tensor_api_info_json(api_info_tensor_path, result_directory) - api_accuracy_checker = ApiAccuracyChecker() + + args = Args(api_info_file=api_info_tensor_path, out_path=result_directory) + + api_accuracy_checker = ApiAccuracyChecker(args) api_accuracy_checker.parse(api_info_tensor_path) api_accuracy_checker.run_and_compare() - api_accuracy_checker.to_detail_csv(result_directory) - api_accuracy_checker.to_result_csv(result_directory) + detail_csv = find_with_prefix(result_directory, "accuracy_checking_detail") assert len(detail_csv) == 1 check_csv(detail_csv[0], 2) @@ -78,6 +91,13 @@ class TestApiAccuracyChecker(unittest.TestCase): modify_tensor_api_info_json(api_info_tensor_path, "") delete_files_with_prefix(result_directory, "accuracy_checking") + def test_is_api_checkable(self): + input_return_mapping = {"fake api": False, "MintFunctional.relu.0.forward": True, "Tensor.add_.0.forward": True, + "Tensor.new.add.0.forward": False} + for api_name, target_result in input_return_mapping.items(): + result = ApiAccuracyChecker.is_api_checkable(api_name) + assert result == target_result + if __name__ == '__main__': unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_api_info.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_api_info.py index d4f1c11d7544785c353d2dd22792e45c257cdfe6..da82e143b0966fccc6ade65a28dfee55d588e91d 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_api_info.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_api_info.py @@ -19,7 +19,22 @@ class TestApiInfo(unittest.TestCase): """ class level setup_class """ - global_context.init(False, os.path.join(directory, "files")) + global_context.init(False, os.path.join(directory, "files"), "mindspore") + + def test_get_kwargs_with_null(self): + # first load forward backward api_info + only_kwargs_api_info_dict = { + "input_kwargs": { + "approximate": None, + } + } + api_info = ApiInfo("only_input_kwargs_api") + api_info.load_forward_info(only_kwargs_api_info_dict) + + self.assertTrue(api_info.check_forward_info()) + kwargs_compute_element_dict = api_info.get_kwargs() + self.assertEqual(kwargs_compute_element_dict.get("approximate").get_parameter(), None) + def test_get_compute_element_list(self): # first load forward backward api_info diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_compute_element.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_compute_element.py index 4fd0357c5111c9361507e703445469177ebf701b..47ed33844c5376d852554b880a596f20b4a45981 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_compute_element.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_compute_element.py @@ -28,7 +28,7 @@ class TestComputeElement(unittest.TestCase): cls.init(TestComputeElement) def init(self): - global_context.init(False, os.path.join(directory, "files")) + global_context.init(False, os.path.join(directory, "files"), "mindspore") self.ndarray = np.array([[1, 2, 3], [1, 2, 3]], dtype=np.float32) self.ms_tensor = mindspore.Tensor(self.ndarray) self.torch_tensor = torch.Tensor(self.ndarray) @@ -154,7 +154,26 @@ class TestComputeElement(unittest.TestCase): pt_parameter = compute_element.get_parameter(tensor_platform=Const.PT_FRAMEWORK) self.assertEqual(ms_parameter, mindspore.float32) self.assertEqual(pt_parameter, torch.float32) - + + def test_transfer_to_torch_tensor(self): + ms_tensor_2_torch_tensor_mapping = { + mindspore.Tensor([1, 2, 3], dtype=mindspore.uint8): torch.tensor([1, 2, 3], dtype=torch.uint8), + mindspore.Tensor([1, 2, 3], dtype=mindspore.float32): torch.tensor([1, 2, 3], dtype=torch.float32) + } + for ms_tensor, torch_tensor in ms_tensor_2_torch_tensor_mapping.items(): + real_torch_tensor = ComputeElement.transfer_to_torch_tensor(ms_tensor) + self.assertTrue((real_torch_tensor == torch_tensor).all()) + self.assertEqual(real_torch_tensor.dtype, torch_tensor.dtype) + + def test_transfer_to_mindspore_tensor(self): + ms_tensor_2_torch_tensor_mapping = { + mindspore.Tensor([1, 2, 3], dtype=mindspore.uint8): torch.tensor([1, 2, 3], dtype=torch.uint8), + mindspore.Tensor([1, 2, 3], dtype=mindspore.float32): torch.tensor([1, 2, 3], dtype=torch.float32) + } + for ms_tensor, torch_tensor in ms_tensor_2_torch_tensor_mapping.items(): + real_ms_tensor = ComputeElement.transfer_to_mindspore_tensor(torch_tensor) + self.assertTrue((real_ms_tensor == ms_tensor).all()) + self.assertEqual(real_ms_tensor.dtype, ms_tensor.dtype) if __name__ == '__main__': diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_data_manager.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_data_manager.py new file mode 100644 index 0000000000000000000000000000000000000000..9cfad00d8ff13e91eb84fff5f46ab434f9ed1d4d --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_data_manager.py @@ -0,0 +1,86 @@ +import unittest +from unittest.mock import patch, mock_open, MagicMock +import os +from msprobe.mindspore.api_accuracy_checker.api_accuracy_checker import DataManager +from msprobe.core.common.const import CompareConst +from msprobe.mindspore.common.const import MsCompareConst + + +class TestDataManager(unittest.TestCase): + + def setUp(self): + # 设置测试的CSV目录和结果路径 + self.csv_dir = "./test_data_manager_csv_dir" + self.result_csv_path = os.path.join(self.csv_dir, "result.csv") + self.details_csv_path = os.path.join(self.csv_dir, "details.csv") # 新增 details.csv 路径 + + # 确保目录存在 + if not os.path.exists(self.csv_dir): + os.makedirs(self.csv_dir) + + # 确保文件存在,并写入正确的表头 + with open(self.result_csv_path, 'w') as f: + f.write("API Name,Forward Test Success,Backward Test Success,Message\n") # 写入必需的表头字段 + # 写入一些示例数据 + f.write("API1,pass,pass,All tests passed\n") + f.write("API2,pass,pass,Forward test failed\n") + f.write("api_name,pass,pass,Backward test failed\n") + + # 确保 details.csv 文件存在,并写入一些默认数据 + with open(self.details_csv_path, 'w') as f: + f.write("api_name,details\n") # 表头 + f.write("API1,Detail for API1\n") + f.write("API2,Detail for API2\n") + + # 创建 DataManager 实例 + self.data_manager = DataManager(self.csv_dir, self.result_csv_path) + + def test_is_unique_api(self): + # 测试唯一 API 名称检测 + self.assertTrue(self.data_manager.is_unique_api("API4")) + self.assertFalse(self.data_manager.is_unique_api("API4")) + + @patch('os.path.exists', return_value=True) # Mock路径存在 + @patch('msprobe.core.common.file_utils.check_file_or_directory_path') + @patch('os.path.isfile', return_value=True) # Mock文件存在 + @patch('os.access', return_value=True) # Mock文件可读权限 + def test_resume_from_last_csv(self, mock_access, mock_isfile, mock_check_file, mock_exists): + # 测试恢复断点功能 + self.data_manager.resume_from_last_csv(self.result_csv_path) + + # 验证路径和输出是否正确 + self.assertEqual(self.data_manager.csv_dir, os.path.dirname(self.result_csv_path)) + self.assertIsNotNone(self.data_manager.detail_out_path) + self.assertIsNotNone(self.data_manager.result_out_path) + + @patch('msprobe.core.common.file_utils.FileOpen', mock_open()) + def test_clear_results(self): + # 测试清除结果 + self.data_manager.results[("API1", "FORWARD")] = ["test_data"] + self.data_manager.clear_results() + self.assertEqual(len(self.data_manager.results), 0) + + @patch('msprobe.core.common.file_utils.write_csv') + def test_record(self, mock_write_csv): + # 测试记录数据 + output_list = [("API1", "FORWARD", MagicMock(api_name="API1"), MagicMock())] + self.data_manager.record(output_list) + self.assertIn(("API1", "FORWARD"), self.data_manager.results) + self.assertEqual(len(self.data_manager.results[("API1", "FORWARD")]), 1) + + def test_record_exception_skip(self): + self.data_manager.record_exception_skip("API3", "FORWARD", "custom err msg") + self.assertEqual(self.data_manager.results_exception_skip["API3"]["FORWARD"], "custom err msg") + + def tearDown(self): + # 清理创建的测试目录和文件 + if os.path.exists(self.csv_dir): + for root, dirs, files in os.walk(self.csv_dir, topdown=False): + for file in files: + os.remove(os.path.join(root, file)) + for dir in dirs: + os.rmdir(os.path.join(root, dir)) + os.rmdir(self.csv_dir) + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_multi_api_accuracy_checker.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_multi_api_accuracy_checker.py new file mode 100644 index 0000000000000000000000000000000000000000..cee0f9fe53574e65a117f2923f35ecc51a196898 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_multi_api_accuracy_checker.py @@ -0,0 +1,75 @@ +# Python 标准库 +import os +from multiprocessing import Manager, Queue + +# 第三方库 +import unittest +from unittest.mock import patch, MagicMock + +# 自定义模块 +from msprobe.mindspore.api_accuracy_checker.multi_api_accuracy_checker import ( + MultiApiAccuracyChecker +) +from msprobe.mindspore.api_accuracy_checker.multi_data_manager import MultiDataManager +from msprobe.core.common.const import Const + + +class TestMultiApiAccuracyChecker(unittest.TestCase): + def setUp(self): + # 初始化参数 + self.args = MagicMock() + self.args.out_path = "./test_output" + self.args.result_csv_path = "./test_output/result.csv" + self.args.device_id = [0, 1] # 模拟两个设备ID + + # **创建测试输出目录** + if not os.path.exists(self.args.out_path): + os.makedirs(self.args.out_path) + + # **创建空的 result.csv 文件** + with open(self.args.result_csv_path, 'w') as f: + f.write("API Name,Forward Test Success,Backward Test Success,Message\n") # 写入表头 + + # 创建 MultiApiAccuracyChecker 实例 + self.checker = MultiApiAccuracyChecker(self.args) + + # 模拟 api_infos 数据 + self.checker.api_infos = { + 'API_1': MagicMock(), + 'API_2': MagicMock(), + 'API_3': MagicMock(), + 'API_4': MagicMock(), + } + + @patch('msprobe.mindspore.api_accuracy_checker.multi_api_accuracy_checker.logger') + def test_process_forward_no_forward_info(self, mock_logger): + """测试当 check_forward_info 返回 False 时,process_forward 返回 None 并记录调试日志。""" + # 设置 current_device_id + self.checker.current_device_id = 0 + + api_info = MagicMock() + api_info.check_forward_info.return_value = False + + result = self.checker.process_forward("API_1", api_info) + + self.assertEqual(result, Const.EXCEPTION_NONE) + mock_logger.debug.assert_called_with( + "[Device 0] API: API_1 lacks forward information, skipping forward check." + ) + + def tearDown(self): + # 清理资源 + if hasattr(self.checker, 'manager'): + self.checker.manager.shutdown() + # 清理测试输出目录 + if os.path.exists(self.args.out_path): + for root, dirs, files in os.walk(self.args.out_path, topdown=False): + for file in files: + os.remove(os.path.join(root, file)) + for dir in dirs: + os.rmdir(os.path.join(root, dir)) + os.rmdir(self.args.out_path) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_multi_data_manager.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_multi_data_manager.py new file mode 100644 index 0000000000000000000000000000000000000000..59a3eca2426d80313e77e6a17b3575af0234b9f8 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/api_accuracy_checker/test_multi_data_manager.py @@ -0,0 +1,212 @@ +# Python 标准库 +import multiprocessing +import os +import threading +from collections import namedtuple +from multiprocessing import Manager + +# 第三方库 +import unittest +from unittest.mock import MagicMock, patch + +# 自定义模块 +from msprobe.mindspore.api_accuracy_checker.data_manager import ( + get_detail_csv_header, + get_result_csv_header, + write_csv_header +) +from msprobe.mindspore.api_accuracy_checker.multi_data_manager import MultiDataManager + +class TestMultiDataManager(unittest.TestCase): + + def setUp(self): + # 创建测试目录和文件路径 + self.csv_dir = "./test_data_csv_dir" + self.result_csv_path = os.path.join(self.csv_dir, "result.csv") + self.detail_csv_path = os.path.join(self.csv_dir, "details.csv") + + # 创建测试目录 + os.makedirs(self.csv_dir, exist_ok=True) + + # **添加以下代码,创建结果文件并写入表头** + # 创建 result.csv 文件并写入表头 + with open(self.result_csv_path, 'w') as f: + f.write("API Name,Forward Test Success,Backward Test Success,Message\n") # 必需的表头字段 + + # 创建 detail.csv 文件并写入表头 + with open(self.detail_csv_path, 'w') as f: + f.write("api_name,details\n") # 必需的表头字段 + + # 初始化共享变量 + self.manager = Manager() + self.shared_is_first_write = self.manager.Value('b', True) + + # 创建 MultiDataManager 实例 + self.data_manager = MultiDataManager(self.csv_dir, self.result_csv_path, self.shared_is_first_write) + + def test_save_results_first_write(self): + # 测试初次写入表头的情况 + self.data_manager.is_first_write = True + self.shared_is_first_write.value = True + api_name = "TestAPI" + + with patch("msprobe.mindspore.api_accuracy_checker.multi_data_manager.write_csv_header") as mock_write_header: + with patch.object(self.data_manager, 'to_detail_csv') as mock_to_detail_csv: + with patch.object(self.data_manager, 'to_result_csv') as mock_to_result_csv: + # 调用 save_results + self.data_manager.save_results(api_name) + + # 验证表头写入方法被调用两次(detail 和 result) + mock_write_header.assert_any_call(self.data_manager.detail_out_path, get_detail_csv_header) + mock_write_header.assert_any_call(self.data_manager.result_out_path, get_result_csv_header) + self.assertEqual(mock_write_header.call_count, 2) + + # 验证 is_first_write 和 shared_is_first_write 的值已更新 + self.assertFalse(self.data_manager.is_first_write) + self.assertFalse(self.shared_is_first_write.value) + + def test_save_results_multiple_calls(self): + # 测试连续多次调用 save_results + api_name = "TestAPI" + self.data_manager.is_first_write = True + + with patch("msprobe.mindspore.api_accuracy_checker.multi_data_manager.write_csv_header") as mock_write_header: + with patch.object(self.data_manager, 'to_detail_csv') as mock_to_detail_csv: + with patch.object(self.data_manager, 'to_result_csv') as mock_to_result_csv: + # 连续调用 save_results + for _ in range(3): + self.data_manager.save_results(api_name) + + # 验证表头写入方法只被调用一次 + self.assertEqual(mock_write_header.call_count, 2) + mock_write_header.assert_any_call(self.data_manager.detail_out_path, get_detail_csv_header) + mock_write_header.assert_any_call(self.data_manager.result_out_path, get_result_csv_header) + + # 验证详细输出和结果摘要写入次数 + self.assertEqual(mock_to_detail_csv.call_count, 3) + self.assertEqual(mock_to_result_csv.call_count, 3) + + def test_save_results_with_shared_is_first_write_false(self): + # 测试 shared_is_first_write 已经为 False 的情况 + self.data_manager.is_first_write = True + self.shared_is_first_write.value = False + api_name = "TestAPI" + + with patch("msprobe.mindspore.api_accuracy_checker.multi_data_manager.write_csv_header") as mock_write_header: + self.data_manager.save_results(api_name) + + # 验证表头写入方法未被调用 + mock_write_header.assert_not_called() + + def test_save_results_exception_handling(self): + # 测试 save_results 方法在出现异常时的处理 + api_name = "TestAPI" + + with patch.object(self.data_manager, 'to_detail_csv', side_effect=Exception("Test Exception")): + with patch.object(self.data_manager, 'to_result_csv') as mock_to_result_csv: + # 调用 save_results,应该抛出异常 + with self.assertRaises(Exception) as context: + self.data_manager.save_results(api_name) + + # 验证异常信息 + self.assertEqual(str(context.exception), "Test Exception") + + # 验证 to_result_csv 未被调用 + mock_to_result_csv.assert_not_called() + + def test_clear_results_after_save(self): + # 测试在调用 save_results 后,results 是否被清空 + self.data_manager.results = {'some_key': 'some_value'} + api_name = "TestAPI" + + with patch.object(self.data_manager, 'to_detail_csv'): + with patch.object(self.data_manager, 'to_result_csv'): + self.data_manager.save_results(api_name) + + # 验证 results 已被清空 + self.assertEqual(self.data_manager.results, {}) + + def test_thread_safety_with_threads(self): + # 使用多线程测试线程安全性 + self.data_manager.is_first_write = True + self.shared_is_first_write.value = True + api_name = "TestAPI" + call_counts = {'write_header': 0} + + original_write_csv_header = write_csv_header + + def write_csv_header_counter(*args, **kwargs): + call_counts['write_header'] += 1 + return original_write_csv_header(*args, **kwargs) + + with patch("msprobe.mindspore.api_accuracy_checker.multi_data_manager.write_csv_header", + side_effect=write_csv_header_counter): + def run_save_results(): + self.data_manager.save_results(api_name) + + threads = [] + for _ in range(5): + t = threading.Thread(target=run_save_results) + threads.append(t) + t.start() + + for t in threads: + t.join() + + # 验证表头写入方法只被调用一次 + if 'write_header' in call_counts: # 确保 key 在有效范围内 + self.assertEqual(call_counts['write_header'], 2) # detail 和 result 各一次 + + def test_save_results_with_existing_api_names(self): + # 测试当 api_names_set 已包含某个 API 名称时的行为 + api_name = "TestAPI" + self.data_manager.api_names_set.add(api_name) + + with patch.object(self.data_manager, 'to_detail_csv') as mock_to_detail_csv: + with patch.object(self.data_manager, 'to_result_csv') as mock_to_result_csv: + self.data_manager.save_results(api_name) + + # 验证 to_detail_csv 和 to_result_csv 仍被调用 + mock_to_detail_csv.assert_called_once() + mock_to_result_csv.assert_called_once() + + def test_save_results_without_results(self): + # 测试在 results 未设置的情况下调用 save_results + del self.data_manager.results # 删除 results 属性 + api_name = "TestAPI" + + with self.assertRaises(AttributeError): + self.data_manager.save_results(api_name) + + def test_save_results_with_empty_results(self): + # 测试在 results 为空的情况下调用 save_results + self.data_manager.results = {} + api_name = "TestAPI" + + with patch.object(self.data_manager, 'to_detail_csv') as mock_to_detail_csv: + with patch.object(self.data_manager, 'to_result_csv') as mock_to_result_csv: + self.data_manager.save_results(api_name) + + # 验证 to_detail_csv 和 to_result_csv 被调用,即使 results 为空 + mock_to_detail_csv.assert_called_once() + mock_to_result_csv.assert_called_once() + + def test_clear_results_empty(self): + # 测试 clear_results 在 results 已为空时的行为 + self.data_manager.results = {} + self.data_manager.clear_results() + self.assertEqual(self.data_manager.results, {}) + + def tearDown(self): + # 清理测试目录 + if os.path.exists(self.csv_dir): + for root, dirs, files in os.walk(self.csv_dir, topdown=False): + for file in files: + os.remove(os.path.join(root, file)) + for dir in dirs: + os.rmdir(os.path.join(root, dir)) + os.rmdir(self.csv_dir) + + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/code_mapping/test_statistic_code_mapping.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/code_mapping/test_statistic_code_mapping.py new file mode 100644 index 0000000000000000000000000000000000000000..2189feed4d58f3d596dbfbc73273713f17748c65 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/code_mapping/test_statistic_code_mapping.py @@ -0,0 +1,246 @@ +# test_code_mapping.py +import unittest +import tempfile +import os +import re +import argparse +import numpy as np +from unittest.mock import patch +from pathlib import Path + +# 从你的项目中导入相关函数和类 +from msprobe.mindspore.code_mapping.cmd_parser import add_ir_parser_arguments +from msprobe.mindspore.code_mapping.main import code_mapping_main + +# 将ir和csv的测试内容提取为独立变量 +TEST_IR_CONTENT = """# IR entry: @19_1___main___Net_construct_72 +# Total subgraphs: 3 + +# Attrs: +has_shard: 0 +flash_sp_send_recv_has_attached: 1 +has_attached: 1 +check_set_strategy_valid_once_only: 1 +FLASH_SP_RUN_ONCE_ONLY: 1 +FIAS_SP_RUN_ONCE_ONLY: 1 +less_bn: 0 +auto_parallel_finish_pre_action: 1 + +# Total params: 2 +# Params: +%para1_x: : [] +%para2_y: : [] + +Node counting information: +Total number of nodes: 29 +Total number of cnodes: 12 + +subgraph attr: +has_shard: 0 +flash_sp_send_recv_has_attached: 1 +has_attached: 1 +check_set_strategy_valid_once_only: 1 +FLASH_SP_RUN_ONCE_ONLY: 1 +FIAS_SP_RUN_ONCE_ONLY: 1 +less_bn: 0 +auto_parallel_finish_pre_action: 1 +subgraph instance: 19_1___main___Net_construct_72 : 0xaaaae0ca0250 +# In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ +subgraph @19_1___main___Net_construct_72() { + %0(CNode_69$a) = PrimFunc_Sub(%para1_x, Tensor(shape=[], dtype=Float32, value=1)) cnode_attrs: {checkpoint: Bool(1)} + : (, ) -> () + # Fullname with scope: (Default/Sub-op0) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:16, 12~25/ a = ops.sub(x, 1)/ + # In file test_ir.py:16, 12~19/ a = ops.sub(x, 1)/<~~This line of code can be shared by multiple nodes, and may be duplicated./ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/auto_generate/gen_ops_def.py:7431~7474, 0~31/ >>> y = Tensor(1, mindspore.int32)/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/auto_generate/gen_ops_def.py:7474, 11~31/ Supported Platforms:/ + %1(CNode_70$b) = PrimFunc_Add(%0, %para2_y) cnode_attrs: {checkpoint: Bool(1)} + : (, ) -> () + # Fullname with scope: (Default/Add-op0) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:17, 12~25/ b = ops.add(a, y)/ + # In file test_ir.py:17, 12~19/ b = ops.add(a, y)/<~~This line of code can be shared by multiple nodes, and may be duplicated./ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/auto_generate/gen_ops_def.py:312~370, 0~31/def add(input, other):/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/auto_generate/gen_ops_def.py:370, 11~31/ return add_op(input, other)/ + %2(CNode_71) = PrimFunc_Cast(%1, I64(30)) primitive_attrs: {output_names: [output], input_names: [x, dst_type]} cnode_attrs: {checkpoint: Bool(1)} + : (, ) -> () + # Fullname with scope: (Default/Cast-op0) + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/_extends/parse/standard_method.py:2755~2757, 0~23/def bool_(x):/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/_extends/parse/standard_method.py:2757, 11~23/ return x.__bool__()/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/_extends/parse/standard_method.py:2757, 11~21/ return x.__bool__()/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/_extends/parse/standard_method.py:3275~3280, 0~34/def tensor_bool(x):/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/_extends/parse/standard_method.py:3278~3279, 4~38/ if is_cond and F.isconstant(x):/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/_extends/parse/standard_method.py:3280, 11~34/ return F.cast(x, mstype.bool_)/ + %3(CNode_80) = Partial(@20_4_✓__main___Net_construct_75, %1, %0) primitive_attrs: {side_effect_propagate: I64(1)} cnode_attrs: {checkpoint: Bool(1)} + : (, , ) -> () + # Fullname with scope: (Default/Partial-op0) + %4(CNode_82) = Partial(@21_14_✗__main___Net_construct_76, %1) primitive_attrs: {side_effect_propagate: I64(1)} cnode_attrs: {checkpoint: Bool(1)} + : (, ) -> () + # Fullname with scope: (Default/Partial-op1) + %5(CNode_74) = Switch(%2, %3, %4) cnode_attrs: {checkpoint: Bool(1)} + : (, , ) -> () + # Fullname with scope: (Default/Switch-op0) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:18~19, 8~43/ if b :/ + %6(CNode_77) = %5[@FuncUnion(@20_4_✓__main___Net_construct_75, @21_14_✗__main___Net_construct_76)]() + : () -> () + # Fullname with scope: (0) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:18~19, 8~43/ if b :/ + Return(%6) cnode_attrs: {checkpoint: Bool(1)} + : () + # Fullname with scope: (Default/Return-op2) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:18~19, 8~43/ if b :/ +} + + +indirect: 1 +subgraph attr: +defer_inline: 0 +undeterminate: 0 +subgraph instance: 20_4_✓__main___Net_construct_75 : 0xaaaae0c9dc10 +# Parameters: 2, (, ) +# In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ +subgraph @20_4_✓__main___Net_construct_75(%para3_Parameter_79, %para4_Parameter_78) { + %0(output) = PrimFunc_Div(%para4_Parameter_78, %para3_Parameter_79) + : (, ) -> () + # Fullname with scope: (Default/Div-op0) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:19, 27~42/ b = ops.mul(b, self.func(a, b))/ + # In file test_ir.py:19, 27~36/ b = ops.mul(b, self.func(a, b))/<~~This line of code can be shared by multiple nodes, and may be duplicated./ + # In file test_ir.py:12~13, 4~28/ def func(x, y):/ + # In file test_ir.py:13, 15~28/ return ops.div(x, y)/ + # In file test_ir.py:13, 15~22/ return ops.div(x, y)/<~~This line of code can be shared by multiple nodes, and may be duplicated./ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/function/math_func.py:727~786, 0~17/def div(input, other, *, rounding_mode=None):/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/function/math_func.py:782~785, 4~42/ if rounding_mode:/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/function/math_func.py:785, 17~42/ output = tensor_div_(input, other)/<~~This line of code can be shared by multiple nodes, and may be duplicated./ + %1(CNode_73$b) = PrimFunc_Mul(%para3_Parameter_79, %0) + : (, ) -> () + # Fullname with scope: (Default/Mul-op0) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:19, 16~43/ b = ops.mul(b, self.func(a, b))/ + # In file test_ir.py:19, 16~23/ b = ops.mul(b, self.func(a, b))/<~~This line of code can be shared by multiple nodes, and may be duplicated./ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/auto_generate/gen_ops_def.py:5222~5269, 0~31/ >>> import mindspore/ + # In file /home/maoyanlongbak/anaconda3/envs/pytorch21copy/lib/python3.8/site-packages/mindspore/ops/auto_generate/gen_ops_def.py:5269, 11~31/ Supported Platforms:/ + Return(%1) + : () + # Fullname with scope: (Default/Return-op0) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:19, 12~43/ b = ops.mul(b, self.func(a, b))/ +} + + +indirect: 1 +subgraph attr: +defer_inline: 0 +undeterminate: 0 +subgraph instance: 21_14_✗__main___Net_construct_76 : 0xaaaae0c71c00 +# Parameters: 1, () +# In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ +subgraph @21_14_✗__main___Net_construct_76(%para5_Parameter_81) { + Return(%para5_Parameter_81) + : () + # Fullname with scope: (Default/Return-op1) + # In file test_ir.py:15~20, 4~16/ def construct(self, x, y):/ + # In file test_ir.py:18~19, 8~43/ if b :/ +}""" + +TEST_CSV_CONTENT = """Op Type,Op Name,Task ID,Stream ID,Timestamp,IO,Slot,Data Size,Data Type,Shape,Max Value,Min Value,L2Norm Value +Sub,Default_Sub-op0,0,0,1733905446819790,input,0,4,float32,"()",3,3,3, +Sub,Default_Sub-op0,0,0,1733905446820357,input,1,4,float32,"()",1,1,1, +Sub,Default_Sub-op0,0,0,1733905446820495,output,0,4,float32,"()",2,2,2, +Add,Default_Add-op0,0,0,1733905446822806,input,0,4,float32,"()",2,2,2, +Add,Default_Add-op0,0,0,1733905446822996,input,1,4,float32,"()",2,2,2, +Add,Default_Add-op0,0,0,1733905446823151,output,0,4,float32,"()",4,4,4, +Cast,Default_Cast-op0,0,0,1733905446823900,input,0,4,float32,"()",4,4,4, +Cast,Default_Cast-op0,0,0,1733905446824053,input,1,8,int64,"()",30,30,30, +Cast,Default_Cast-op0,0,0,1733905446824184,output,0,1,bool,"()",1,1,1, +Div,Default_Div-op0,0,0,1733905446827858,input,0,4,float32,"()",2,2,2, +Div,Default_Div-op0,0,0,1733905446828193,input,1,4,float32,"()",4,4,4, +Div,Default_Div-op0,0,0,1733905446828341,output,0,4,float32,"()",0.5,0.5,0.5, +Mul,Default_Mul-op0,0,0,1733905446831139,input,0,4,float32,"()",4,4,4, +Mul,Default_Mul-op0,0,0,1733905446831365,input,1,4,float32,"()",0.5,0.5,0.5, +Mul,Default_Mul-op0,0,0,1733905446831510,output,0,4,float32,"()",2,2,2, +""" + + +class TestCodeMapping(unittest.TestCase): + def test_statistic_code_mapping(self): + # 使用临时目录创建并测试 + with tempfile.TemporaryDirectory() as tmpdir: + ir_file_path = os.path.join(tmpdir, "18_validate_0068.ir") + csv_file_path = os.path.join(tmpdir, "statistic.csv") + + # 创建ir文件 + with open(ir_file_path, 'w') as f: + f.write(TEST_IR_CONTENT) + + # 创建csv文件 + with open(csv_file_path, 'w') as f: + f.write(TEST_CSV_CONTENT) + + # 读取运行前的CSV文件内容 + with open(csv_file_path, 'r') as f: + original_csv_content = f.read() + + # 准备参数 + parser = argparse.ArgumentParser() + add_ir_parser_arguments(parser) + + args = parser.parse_args(["--ir", ir_file_path, "--dump_data", csv_file_path, "--output", tmpdir]) + + # 执行主函数 + code_mapping_main(args) + + # 读取运行后的CSV文件内容 + with open(csv_file_path, 'r') as f: + updated_csv_content = f.read() + + # 比较文件内容是否有变化 + self.assertNotEqual(original_csv_content, updated_csv_content, "CSV文件内容未变化,测试失败") + + def test_npy_code_mapping(self): + with tempfile.TemporaryDirectory() as tmpdir: + ir_file_path = os.path.join(tmpdir, "18_validate_0068.ir") + + # 创建IR文件 + with open(ir_file_path, 'w') as f: + f.write(TEST_IR_CONTENT) + + # 创建包含npy文件的data目录 + data_dir = os.path.join(tmpdir, "data_dir") + os.makedirs(data_dir, exist_ok=True) + + # 准备需要创建的npy文件名列表 + npy_files = ["Add.Default_Add-op0.0.0.1734008383669918.input.0.DefaultFormat.float32.npy"] + + # 创建空的npy文件,或写入一些数据 + dummy_data = np.array([1.0, 2.0, 3.0]) + for fname in npy_files: + file_path = os.path.join(data_dir, fname) + np.save(file_path, dummy_data) # 写入测试数据 + + # 准备参数 + parser = argparse.ArgumentParser() + add_ir_parser_arguments(parser) + + # 此处 --dump_data 参数传入我们创建的data目录 + args = parser.parse_args(["--ir", ir_file_path, "--dump_data", data_dir, "--output", tmpdir]) + + # 执行主函数 + code_mapping_main(args) + + # 使用正则表达式验证生成的文件名 + pattern = r"code_mapping_\d{8}\d{6}\.csv" # 匹配如 code_mapping_YYYYMMDDHHMMSS.csv 格式 + generated_files = os.listdir(tmpdir) + + # 检查是否有符合格式的文件生成 + matching_files = [f for f in generated_files if re.match(pattern, f)] + self.assertTrue(matching_files, + msg=f"没有生成符合要求的CSV文件。生成的文件列表: {generated_files}. 正则匹配模式: {pattern}") + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/common/test_ms_utils.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/common/test_ms_utils.py index a7699d479308f3a8beedfd5611041e88b147b84f..1ed3ca016108519fb3f643c9d4bb768f63a52d40 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/common/test_ms_utils.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/common/test_ms_utils.py @@ -28,6 +28,7 @@ from msprobe.mindspore.common.utils import (get_rank_if_initialized, convert_to_int, list_lowest_level_directories, seed_all, + remove_dropout, MsprobeStep) class MockCell: @@ -81,18 +82,12 @@ class TestMsprobeFunctions(unittest.TestCase): self.assertEqual(rank, 0) mock_get_rank.assert_called_once() - @patch('mindspore.Tensor') - def test_convert_bf16_to_fp32(self, mock_tensor): - mock_tensor.dtype = ms.bfloat16 - mock_tensor.to.return_value = mock_tensor - result = convert_bf16_to_fp32(mock_tensor) - self.assertEqual(result, mock_tensor) - mock_tensor.to.assert_called_once_with(ms.float32) - - # Test when tensor is not bfloat16 - mock_tensor.dtype = ms.float32 - result = convert_bf16_to_fp32(mock_tensor) - self.assertEqual(result, mock_tensor) + def test_convert_bf16_to_fp32(self): + original_tensor = ms.Tensor(np.array([1.5, 2.5, 3.5]), dtype=ms.bfloat16) + converted_tensor = convert_bf16_to_fp32(original_tensor) + self.assertEqual(converted_tensor.dtype, ms.float32) + np.testing.assert_array_almost_equal( + converted_tensor.asnumpy(), np.array([1.5, 2.5, 3.5], dtype=np.float32)) def test_convert_to_int(self): self.assertEqual(convert_to_int("123"), 123) @@ -118,7 +113,7 @@ class TestMsprobeFunctions(unittest.TestCase): seed_all(42, True) # 验证 check_seed_all 的调用 - mock_check_seed_all.assert_called_once_with(42, True) + mock_check_seed_all.assert_called_once_with(42, True, True) # 验证环境变量是否设置正确 self.assertEqual(mock_environ.get('PYTHONHASHSEED'), '42') # 验证其他函数是否正确调用 @@ -126,6 +121,22 @@ class TestMsprobeFunctions(unittest.TestCase): mock_random_seed.assert_called_once_with(42) mock_set_context.assert_called_once_with(deterministic="ON") + def test_remove_dropout(self): + remove_dropout() + from mindspore import Tensor + x1d = Tensor(np.ones([5, 5]), ms.float32) + x2d = Tensor(np.ones([5, 5, 5, 5]), ms.float32) + x3d = Tensor(np.ones([5, 5, 5, 5, 5]), ms.float32) + from mindspore.ops import Dropout, Dropout2D, Dropout3D + self.assertTrue((Dropout(0.5)(x1d)[0].numpy() == x1d.numpy()).all()) + self.assertTrue((Dropout2D(0.5)(x2d)[0].numpy() == x2d.numpy()).all()) + self.assertTrue((Dropout3D(0.5)(x3d)[0].numpy() == x3d.numpy()).all()) + + from mindspore.mint.nn import Dropout + from mindspore.mint.nn.functional import dropout + self.assertTrue((Dropout(0.5)(x1d).numpy() == x1d.numpy()).all()) + self.assertTrue((dropout(x1d, p=0.5).numpy() == x1d.numpy()).all()) + if __name__ == "__main__": diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/layer_mapping.yaml b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/layer_mapping.yaml new file mode 100644 index 0000000000000000000000000000000000000000..a928b0c1de1f75daafbb96a46a165f1e758131a5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/layer_mapping.yaml @@ -0,0 +1,15 @@ +TopLayer: + network_with_loss: module + +VocabParallelEmbedding: + logical_or: __or__ + reduce_from_mp_region.all_reduce: all_reduce + +ParallelTransformerLayer: + attention: self_attention + +ParallelAttention: + flash_attention_score: core_attention_flash.npu_fusion_attention + +FusedRMSNorm: + RmsNorm: npu_rms_norm diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/construct.json b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..f1d4b05a6cc6b91539d0b0d95b2a6710bed5ae53 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/construct.json @@ -0,0 +1,15 @@ +{ + "Tensor.__add__.0.forward": "", + "Tensor.__bool__.1.forward": "", + "Tensor.__add__.1.forward": "", + "Mint.logical_or.0.forward": "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0", + "Distributed.all_reduce.0.forward": "Cell.network_with_loss.module.language_model.embedding.word_embeddings.reduce_from_mp_region.ReduceFromModelParallelRegion.forward.0", + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": "Cell.network_with_loss.module.language_model.embedding.Embedding.forward.0", + "Primitive.norm.RmsNorm.0.forward": "Cell.network_with_loss.module.language_model.encoder.layers.0.input_norm.FusedRMSNorm.forward.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": "Cell.network_with_loss.module.language_model.encoder.ParallelTransformer.forward.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.input_norm.FusedRMSNorm.forward.0": "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Mint.cos.0.forward": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0", + "Functional.flash_attention_score.0.forward": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0", + "Functional.flash_attention_score.0.backward": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.backward.0" +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/dump.json b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..48800c0455c6651b146600e61e636d4dc25fac31 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/dump.json @@ -0,0 +1,530 @@ +{ + "task": "statistics", + "level": "mix", + "framework": "mindspore", + "dump_data_dir": null, + "data": { + "Tensor.__add__.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Int32", + "shape": [ + 5, + 2 + ], + "Max": 3430.0, + "Min": 0.0, + "Mean": 926.5, + "Norm": 4977.4384765625 + }, + { + "type": "mindspore.Tensor", + "dtype": "Int64", + "shape": [ + 2 + ], + "Max": 4096.0, + "Min": 1.0, + "Mean": 2048.5, + "Norm": 4096.0 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Int64", + "shape": [ + 5, + 2 + ], + "Max": 7526.0, + "Min": 1.0, + "Mean": 2974.999755859375, + "Norm": 13585.2802734375 + } + ] + }, + "Tensor.__bool__.1.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Bool", + "shape": [], + "Max": false, + "Min": false, + "Mean": null, + "Norm": null + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "bool", + "value": false + } + ] + }, + "Tensor.__add__.1.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Int64", + "shape": [], + "Max": 423, + "Min": 423, + "Mean": 423, + "Norm": 423 + }, + { + "type": "int", + "value": 1 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Int64", + "shape": [], + "Max": 424, + "Min": 424, + "Mean": 424, + "Norm": 424 + } + ] + }, + "Mint.logical_or.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Bool", + "shape": [ + 1, + 4096 + ], + "Max": false, + "Min": false, + "Mean": null, + "Norm": null + }, + { + "type": "mindspore.Tensor", + "dtype": "Bool", + "shape": [ + 1, + 4096 + ], + "Max": true, + "Min": false, + "Mean": null, + "Norm": null + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Bool", + "shape": [ + 1, + 4096 + ], + "Max": true, + "Min": false, + "Mean": null, + "Norm": null + } + ] + }, + "Distributed.all_reduce.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1, + 4096, + 6144 + ], + "Max": 2.6875, + "Min": -2.640625, + "Mean": 0.0002651214599609375, + "Norm": 2352.0 + } + ], + "input_kwargs": { + "group": { + "type": "str", + "value": "tp-0-1-2-3" + } + }, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1, + 4096, + 6144 + ], + "Max": 2.6875, + "Min": -2.640625, + "Mean": 0.000316619873046875, + "Norm": 2512.0 + }, + null + ] + }, + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Int32", + "shape": [ + 1, + 4096 + ], + "Max": 165558.0, + "Min": 0.0, + "Mean": 16050.638671875, + "Norm": 2257767.75 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1, + 4096, + 6144 + ], + "Max": 2.6875, + "Min": -2.640625, + "Mean": 0.000316619873046875, + "Norm": 2512.0 + } + ] + }, + "Primitive.norm.RmsNorm.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 2.5625, + "Min": -2.640625, + "Mean": 0.0002841949462890625, + "Norm": 1256.0 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 6144 + ], + "Max": 1.0, + "Min": 1.0, + "Mean": 1.0, + "Norm": 78.5 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 5.125, + "Min": -5.25, + "Mean": 0.0005645751953125, + "Norm": 2512.0 + }, + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 1024, + 1, + 1 + ], + "Max": 2.0491626262664795, + "Min": 1.9447283744812012, + "Mean": 1.9983034133911133, + "Norm": 63.948543548583984 + } + ] + }, + "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": { + "input": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 4.5299530029296875e-05, + "Min": -4.935264587402344e-05, + "Mean": -6.781192496418953e-09, + "Norm": 0.0126953125 + } + ], + "output": [ + null + ] + }, + "Cell.network_with_loss.module.language_model.encoder.layers.0.input_norm.FusedRMSNorm.backward.0": { + "input": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 1.9311904907226562e-05, + "Min": -1.7881393432617188e-05, + "Mean": -7.741618901491165e-09, + "Norm": 0.00201416015625 + } + ], + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 3.838539123535156e-05, + "Min": -3.552436828613281e-05, + "Mean": -1.548323780298233e-08, + "Norm": 0.0040283203125 + } + ] + }, + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": { + "input_args": [], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 0.421875, + "Min": -0.443359375, + "Mean": -0.0002346038818359375, + "Norm": 50.75 + }, + null + ] + }, + "Mint.cos.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 4096, + 1, + 1, + 128 + ], + "Max": 4095.0, + "Min": 0.0, + "Mean": 238.66024780273438, + "Norm": 427910.46875 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 4096, + 1, + 1, + 128 + ], + "Max": 1.0000001192092896, + "Min": -1.0000001192092896, + "Mean": 0.13587358593940735, + "Norm": 528.9301147460938 + } + ] + }, + "Functional.flash_attention_score.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.671875, + "Min": -3.765625, + "Mean": -0.00072479248046875, + "Norm": 1744.0 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.484375, + "Min": -3.0625, + "Mean": -0.00115966796875, + "Norm": 1728.0 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.28125, + "Min": -3.234375, + "Mean": -0.001861572265625, + "Norm": 1744.0 + }, + { + "type": "int", + "value": 12 + } + ], + "input_kwargs": { + "attn_mask": { + "type": "mindspore.Tensor", + "dtype": "UInt8", + "shape": [ + 1, + 1, + 4096, + 4096 + ], + "Max": 1.0, + "Min": 0.0, + "Mean": 0.876311182975769, + "Norm": 3834.326904296875 + }, + "scalar_value": { + "type": "float", + "value": 0.08838834764831843 + }, + "pre_tokens": { + "type": "int", + "value": 65536 + }, + "next_tokens": { + "type": "int", + "value": 0 + }, + "input_layout": { + "type": "str", + "value": "SBH" + } + }, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 2.734375, + "Min": -2.578125, + "Mean": -0.001373291015625, + "Norm": 266.0 + } + ] + }, + "Functional.flash_attention_score.0.backward": { + "input": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 1.5139579772949219e-05, + "Min": -1.5854835510253906e-05, + "Mean": -9.313225746154785e-09, + "Norm": 0.0064697265625 + } + ], + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 1.3634562492370605e-06, + "Min": -1.3485550880432129e-06, + "Mean": -3.765308065339923e-10, + "Norm": 0.000286102294921875 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.11434268951416e-06, + "Min": -2.8312206268310547e-06, + "Mean": 6.661338147750939e-14, + "Norm": 0.000244140625 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 2.181529998779297e-05, + "Min": -2.205371856689453e-05, + "Mean": -9.313225746154785e-09, + "Norm": 0.001922607421875 + }, + null + ] + } + } +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/stack.json b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..17c9286cd048318275de92aba0b72dc40f92140f --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/mindspore_data/stack.json @@ -0,0 +1,290 @@ +{ + "Tensor.__add__.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py, line 487, in format_index_tensor, \n return F.select(index < 0, index + format_dims, index)", + "File /path_to_package/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py, line 140, in data_update_by_ops, \n new_index = format_index_tensor(new_index, (None, F.shape(data)[:F.shape(new_index)[-1]]))", + "File /path_to_package/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py, line 102, in data_update, \n data = data_update_by_ops(transfer_type, arg, data, new_index, origin_data, value)", + "File /path_to_package/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py, line 203, in _tensor_getitem, \n return data_update(tensor_update_types, tensor_update_args, self, new_index)", + "File /path_to_package/site-packages/mindspore/common/tensor.py, line 483, in __getitem__, \n out = tensor_operator_registry.get('__getitem__')(self, index)", + "File /path_to_net/PanGu_ms/pangu/training/utils.py, line 65, in get_ltor_reset_masks_and_position_ids, \n eod_index = position_ids[b, data[b] == eod_token]", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 113, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 724, in train, \n data = get_batch_func(train_dataset_dict_iterator)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Tensor.__bool__.1.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/training/utils.py, line 75, in get_ltor_reset_masks_and_position_ids, \n if i == pre_eod_idx:", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 113, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 724, in train, \n data = get_batch_func(train_dataset_dict_iterator)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Tensor.__add__.1.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/training/utils.py, line 78, in get_ltor_reset_masks_and_position_ids, \n attention_mask[b, 0, (i + 1):, :(i + 1)] = 0", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 113, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 724, in train, \n data = get_batch_func(train_dataset_dict_iterator)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Mint.logical_or.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/tensor_parallel/layers.py, line 1145, in construct, \n input_mask = mint.logical_or((input_ < self.vocab_start_index), (input_ >= self.vocab_end_index))", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2455, in _backward_hook_construct, \n outputs = self.construct(outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 226, in construct, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 554, in construct, \n text_embedding_out = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Distributed.all_reduce.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/tensor_parallel/mappings.py, line 241, in construct, \n output = comm_func.all_reduce(input_, group=self.tp_group)[0]", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 785, in _call_custom_bprop, \n output = self.construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2450, in _backward_hook_construct, \n outputs = self._call_custom_bprop(outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/tensor_parallel/layers.py, line 1168, in construct, \n output = self.reduce_from_mp_region(output_parallel)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2455, in _backward_hook_construct, \n outputs = self.construct(outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 226, in construct, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 554, in construct, \n text_embedding_out = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 226, in construct, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 554, in construct, \n text_embedding_out = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Primitive.norm.RmsNorm.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 198, in ParallelTransformerLayerForward, \n norm_output = self.input_norm(hidden_states)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Mint.cos.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/rotary_pos_embedding.py, line 123, in _apply_fused_rotary_pos_emb, \n cos_ = mint.cos(freqs).to(t.dtype)", + "File /path_to_net/PanGu_ms/pangu/model/rotary_pos_embedding.py, line 136, in apply_rotary_pos_emb, \n return _apply_fused_rotary_pos_emb(t, freqs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 619, in construct, \n query = apply_rotary_pos_emb(query, q_pos_emb, self.config)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Functional.flash_attention_score.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 637, in construct, \n output = ops.flash_attention_score(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/construct.json b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..e99f6e1729808261e23b92ae7ace90bc81743853 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/construct.json @@ -0,0 +1,14 @@ +{ + "Tensor.__add__.0.forward": "", + "Tensor.__add__.1.forward": "", + "Tensor.__or__.0.forward": "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0", + "Distributed.all_reduce.0.forward": "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0", + "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": "Module.module.module.language_model.embedding.Embedding.forward.0", + "NPU.npu_rms_norm.0.forward": "Module.module.module.language_model.encoder.layers.0.input_norm.RMSNorm.forward.0", + "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": "Module.module.module.language_model.encoder.ParallelTransformer.forward.0", + "Module.module.module.language_model.encoder.layers.0.input_norm.RMSNorm.forward.0": "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0": "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Torch.cos.0.forward": "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0", + "NPU.npu_fusion_attention.0.forward": "Module.module.module.language_model.encoder.layers.0.self_attention.core_attention_flash.FlashSelfAttention.forward.0", + "NPU.npu_fusion_attention.0.backward": "Module.module.module.language_model.encoder.layers.0.self_attention.core_attention_flash.FlashSelfAttention.backward.0" +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/dump.json b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..b2704185ff19b961b43453f81247236d77677d83 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/dump.json @@ -0,0 +1,701 @@ +{ + "task": "statistics", + "level": "mix", + "framework": "pytorch", + "dump_data_dir": null, + "data": { + "Tensor.__add__.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [], + "Max": 423, + "Min": 423, + "Mean": 423, + "Norm": 423, + "requires_grad": false + }, + { + "type": "int", + "value": 1 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [], + "Max": 424, + "Min": 424, + "Mean": 424, + "Norm": 424, + "requires_grad": false + } + ] + }, + "Tensor.__add__.1.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [], + "Max": 423, + "Min": 423, + "Mean": 423, + "Norm": 423, + "requires_grad": false + }, + { + "type": "int", + "value": 1 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [], + "Max": 424, + "Min": 424, + "Mean": 424, + "Norm": 424, + "requires_grad": false + } + ] + }, + "Tensor.__or__.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.bool", + "shape": [ + 1, + 4096 + ], + "Max": false, + "Min": false, + "Mean": null, + "Norm": null, + "requires_grad": false + }, + { + "type": "torch.Tensor", + "dtype": "torch.bool", + "shape": [ + 1, + 4096 + ], + "Max": true, + "Min": false, + "Mean": null, + "Norm": null, + "requires_grad": false + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bool", + "shape": [ + 1, + 4096 + ], + "Max": true, + "Min": false, + "Mean": null, + "Norm": null, + "requires_grad": false + } + ] + }, + "Distributed.all_reduce.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1, + 4096, + 6144 + ], + "Max": 2.6875, + "Min": -2.640625, + "Mean": 0.0002651214599609375, + "Norm": 2352.0, + "requires_grad": true + } + ], + "input_kwargs": { + "group": null + }, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1, + 4096, + 6144 + ], + "Max": 2.6875, + "Min": -2.640625, + "Mean": 0.000316619873046875, + "Norm": 2512.0, + "requires_grad": true + }, + null + ] + }, + "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [ + 1, + 4096 + ], + "Max": 165558.0, + "Min": 0.0, + "Mean": 16050.638671875, + "Norm": 2257767.75, + "requires_grad": false + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1, + 4096, + 6144 + ], + "Max": 2.6875, + "Min": -2.640625, + "Mean": 0.000316619873046875, + "Norm": 2512.0, + "requires_grad": true + } + ] + }, + "NPU.npu_rms_norm.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 2.5625, + "Min": -2.640625, + "Mean": 0.0002841949462890625, + "Norm": 1256.0, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 6144 + ], + "Max": 1.0, + "Min": 1.0, + "Mean": 1.0, + "Norm": 78.5, + "requires_grad": true + } + ], + "input_kwargs": { + "epsilon": { + "type": "float", + "value": 1e-05 + } + }, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 5.125, + "Min": -5.25, + "Mean": 0.0005645751953125, + "Norm": 2512.0, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 1024, + 1, + 1 + ], + "Max": 2.0491626262664795, + "Min": 1.9447283744812012, + "Mean": 1.9983034133911133, + "Norm": 63.948543548583984, + "requires_grad": false + } + ] + }, + "Module.module.module.language_model.encoder.layers.0.input_norm.RMSNorm.forward.0": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 2.5625, + "Min": -2.640625, + "Mean": 0.0002841949462890625, + "Norm": 1256.0, + "requires_grad": true + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 5.125, + "Min": -5.25, + "Mean": 0.0005645751953125, + "Norm": 2512.0, + "requires_grad": true + } + ] + }, + "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 5.125, + "Min": -5.25, + "Mean": 0.0005645751953125, + "Norm": 2512.0, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.bool", + "shape": [ + 1, + 1, + 4096, + 4096 + ], + "Max": true, + "Min": false, + "Mean": null, + "Norm": null, + "requires_grad": false + } + ], + "input_kwargs": { + "inference_params": null, + "rotary_pos_emb": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 4096, + 1, + 1, + 128 + ], + "Max": 4095.0, + "Min": 0.0, + "Mean": 238.66024780273438, + "Norm": 427910.46875, + "requires_grad": false + } + }, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 0.421875, + "Min": -0.443359375, + "Mean": -0.0002346038818359375, + "Norm": 50.75, + "requires_grad": true + }, + null + ] + }, + "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 2.5625, + "Min": -2.640625, + "Mean": 0.0002841949462890625, + "Norm": 1256.0, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.bool", + "shape": [ + 1, + 1, + 4096, + 4096 + ], + "Max": true, + "Min": false, + "Mean": null, + "Norm": null, + "requires_grad": false + } + ], + "input_kwargs": { + "encoder_output": null, + "enc_dec_attn_mask": null, + "inference_params": null, + "rotary_pos_emb": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 4096, + 1, + 1, + 128 + ], + "Max": 4095.0, + "Min": 0.0, + "Mean": 238.66024780273438, + "Norm": 427910.46875, + "requires_grad": false + }, + "retriever_input": null, + "retriever_output": null, + "retriever_attn_mask": null + }, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 2.46875, + "Min": -2.6875, + "Mean": -0.000438690185546875, + "Norm": 1280.0, + "requires_grad": true + } + ] + }, + "Torch.cos.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 4096, + 1, + 1, + 128 + ], + "Max": 4095.0, + "Min": 0.0, + "Mean": 238.66024780273438, + "Norm": 427910.46875, + "requires_grad": false + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 4096, + 1, + 1, + 128 + ], + "Max": 1.0000001192092896, + "Min": -1.0000001192092896, + "Mean": 0.13587358593940735, + "Norm": 528.9301147460938, + "requires_grad": false + } + ] + }, + "NPU.npu_fusion_attention.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.671875, + "Min": -3.765625, + "Mean": -0.00072479248046875, + "Norm": 1744.0, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.484375, + "Min": -3.0625, + "Mean": -0.00115966796875, + "Norm": 1728.0, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.28125, + "Min": -3.234375, + "Mean": -0.001861572265625, + "Norm": 1744.0, + "requires_grad": true + }, + { + "type": "int", + "value": 12 + }, + { + "type": "str", + "value": "SBH" + } + ], + "input_kwargs": { + "pse": null, + "padding_mask": null, + "atten_mask": { + "type": "torch.Tensor", + "dtype": "torch.bool", + "shape": [ + 1, + 1, + 4096, + 4096 + ], + "Max": true, + "Min": false, + "Mean": null, + "Norm": null, + "requires_grad": false + }, + "scale": { + "type": "float", + "value": 0.08838834764831843 + }, + "pre_tockens": { + "type": "int", + "value": 65536 + }, + "next_tockens": { + "type": "int", + "value": 0 + }, + "keep_prob": { + "type": "float", + "value": 1.0 + }, + "inner_precise": { + "type": "int", + "value": 0 + }, + "actual_seq_qlen": null, + "actual_seq_kvlen": null, + "sparse_mode": { + "type": "int", + "value": 0 + } + }, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 2.734375, + "Min": -2.578125, + "Mean": -0.001373291015625, + "Norm": 266.0, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 1, + 12, + 4096, + 8 + ], + "Max": 2.60323166847229, + "Min": -0.8802953958511353, + "Mean": 1.3589274883270264, + "Norm": 870.7459716796875, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": [ + 1, + 12, + 4096, + 8 + ], + "Max": 546.2620849609375, + "Min": 1.0, + "Mean": 132.59361267089844, + "Norm": 102677.125, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 0 + ], + "Max": null, + "Min": null, + "Mean": null, + "Norm": null, + "requires_grad": true + }, + { + "type": "int", + "value": 187651464146080 + }, + { + "type": "int", + "value": 281472727719936 + }, + { + "type": "int", + "value": 201326592 + } + ] + }, + "NPU.npu_fusion_attention.0.backward": { + "input": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 1.5139579772949219e-05, + "Min": -1.5854835510253906e-05, + "Mean": -9.313225746154785e-09, + "Norm": 0.0064697265625, + "requires_grad": false + }, + null, + null, + null + ], + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 1.3634562492370605e-06, + "Min": -1.3485550880432129e-06, + "Mean": -3.765308065339923e-10, + "Norm": 0.000286102294921875, + "requires_grad": false + }, + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.11434268951416e-06, + "Min": -2.8312206268310547e-06, + "Mean": 6.661338147750939e-14, + "Norm": 0.000244140625, + "requires_grad": false + }, + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 2.181529998779297e-05, + "Min": -2.205371856689453e-05, + "Mean": -9.313225746154785e-09, + "Norm": 0.001922607421875, + "requires_grad": false + }, + null + ] + } + } +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/stack.json b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..7a8f68284215cd5a14e436661c57150bc012f61a --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/dump_file/pytorch_data/stack.json @@ -0,0 +1,322 @@ +{ + "Tensor.__add__.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py, line 61, in tensor_op_template, \n return TensorOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/utils.py, line 176, in get_ltor_reset_masks_and_position_ids, \n attention_mask[b, 0, (i + 1):, :(i + 1)] = 0", + "File /path_to_net/PanGu/pretrain_gpt.py, line 174, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 243, in forward_step, \n tokens, labels, loss_mask, attention_mask, position_ids = get_batch(", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Tensor.__add__.1.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py, line 61, in tensor_op_template, \n return TensorOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/utils.py, line 176, in get_ltor_reset_masks_and_position_ids, \n attention_mask[b, 0, (i + 1):, :(i + 1)] = 0", + "File /path_to_net/PanGu/pretrain_gpt.py, line 174, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 243, in forward_step, \n tokens, labels, loss_mask, attention_mask, position_ids = get_batch(", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Tensor.__or__.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py, line 61, in tensor_op_template, \n return TensorOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/tensor_parallel/layers.py, line 19, in vocab_parallel_embedding_forward, \n input_mask = (input_ < self.vocab_start_index) | \\", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 217, in forward, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 473, in forward, \n encoder_input = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Distributed.all_reduce.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py, line 68, in distributed_op_template, \n return DistributedOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 24, in _reduce, \n torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group())", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 223, in forward, \n return _reduce(input_)", + "File /path_to_package/site-packages/torch/autograd/function.py, line 539, in apply, \n return super().apply(*args, **kwargs) # type: ignore[misc]", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 436, in reduce_from_tensor_model_parallel_region, \n return _ReduceFromModelParallelRegion.apply(input_)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/tensor_parallel/layers.py, line 35, in vocab_parallel_embedding_forward, \n output = reduce_from_tensor_model_parallel_region(output_parallel)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 217, in forward, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 473, in forward, \n encoder_input = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 217, in forward, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 473, in forward, \n encoder_input = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "NPU.npu_rms_norm.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_npu_custom.py, line 78, in npu_op_template, \n return NpuOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/fusions/rms_norm.py, line 26, in wrapper, \n return torch_npu.npu_rms_norm(x, self.weight, epsilon=self.eps)[0]", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1194, in forward, \n norm_output = self.input_norm(hidden_states)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Module.module.module.language_model.encoder.layers.0.input_norm.RMSNorm.forward.0": [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1194, in forward, \n norm_output = self.input_norm(hidden_states)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0": [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1198, in forward, \n self.self_attention(", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Torch.cos.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_torch.py, line 76, in torch_op_template, \n return TorchOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/fusions/rotary_pos_embedding.py, line 16, in wrapper, \n cos_ = torch.cos(freqs).to(t.dtype)", + "File /path_to_net/PanGu/pangu/core/fusions/rotary_pos_embedding.py, line 13, in wrapper, \n t = fn(t, freqs, rotary_interleaved)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/models/common/embeddings/rotary_pos_embedding.py, line 313, in apply_rotary_pos_emb, \n return apply_rotary_pos_emb_bshd(t, freqs, rotary_interleaved=config.rotary_interleaved)", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 738, in parallel_attention_forward, \n query_layer = apply_rotary_pos_emb(query_layer, q_pos_emb, self.config)", + "File /path_to_net/PanGu/pangu/model/transformer.py, line 97, in wrapper, \n return fn(self, hidden_states, attention_mask,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1198, in forward, \n self.self_attention(", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "NPU.npu_fusion_attention.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_npu_custom.py, line 78, in npu_op_template, \n return NpuOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 525, in flash_self_attention_forward, \n output = torch_npu.npu_fusion_attention(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 757, in parallel_attention_forward, \n context_layer = self.core_attention_flash(query_layer, key_layer, value_layer, attention_mask)", + "File /path_to_net/PanGu/pangu/model/transformer.py, line 97, in wrapper, \n return fn(self, hidden_states, attention_mask,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1198, in forward, \n self.self_attention(", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_data_scope_parser.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_data_scope_parser.py index 4d8bc9635eb569ec4ace6c659b5c69a6a9326645..4a54eacfa5160443ec78c099ca2be50c53f3b39d 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_data_scope_parser.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_data_scope_parser.py @@ -1,6 +1,14 @@ +import os import unittest -from msprobe.core.compare.data_scope_parser import * +import tempfile +from msprobe.core.compare.layer_mapping.data_scope_parser import ( + find_regard_scope, + find_stack_func_list, + get_dump_data_items, + get_stack_in_lines, +) from msprobe.core.common.const import Const +from msprobe.core.common.file_utils import load_yaml class TestModifyMapping(unittest.TestCase): @@ -41,6 +49,27 @@ class TestModifyMapping(unittest.TestCase): "File /home/user/envs/python3.9/site-packages/mindspore/nn/cell.py, line 2462, \ in _backward_hook_construct, \n outputs = self.construct(output, **kwargs)" ] + + self.frame_func_lines = [ + "File /home/user/envs/python3.9/site-packages/mindspore/common/tensor.py, line 2465, \ + in copy, x = x / 1.0", + "File /home/user/envs/python3.9/site-packages/mindformers/tensor_parallel/layers.py, line 1147, \ + in construct, \n masked_input = input_.copy() = self.vocab_start_index", + "File /home/user/envs/python3.9/site-packages/mindspore/nn/cell.py, line 2455, \ + in _backward_hook_construct, \n outputs = self.construct(outputs, **kwargs)", + "File /home/user/envs/python3.9/site-packages/mindspore/nn/cell.py, line 494, \ + in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /home/user/envs/python3.9/site-packages/mindspore/nn/cell.py, line 733, \ + in __call__, \n return self._run_construct(*args, *kwargs)" + ] + + self.simplified_lines = [ + "File /home/user/envs/python3.9/site-packages/mindspore/common/tensor.py, line 2465, \ + in copy, \n x = x / 1.0", + "File /home/user/envs/python3.9/site-packages/mindformers/tensor_parallel/layers.py, line 1147, \ + in construct, \n masked_input = input_.copy() = self.vocab_start_index" + ] + self.pt_construct = { "Functional.max_pool2d.0.forward": "Module.pool1.MaxPool2d.forward.0", "Functional.conv2d.1.forward": "Module.conv2.Conv2d.forward.0", @@ -48,11 +77,13 @@ class TestModifyMapping(unittest.TestCase): "Module.conv1.Conv2d.backward.1": None, "Functional.conv2d.2.forward": None } + self.ms_construct = { "Functional.add.0.forward": "Cell.transformer_layers.0.attention.core_attention.scale_mask_softmax.ScaleMaskSoftmax.forward.0", "Tensor.reshape.2.forward": "Cell.transformer_layers.0.attention.ParallelAttention.forward.0", "Functional.add.4.forward": "Cell.transformer_layers.0.attention.core_attention.scale_mask_softmax.ScaleMaskSoftmax.forward.0" } + self.ms_dump = { "data": {"Functional.add.0.forward": "", "Functional.add.4.forward": "", @@ -68,6 +99,7 @@ class TestModifyMapping(unittest.TestCase): "Module.conv1.Conv2d.backward.1": "" } } + self.ms_stack = { "Functional.add.0.forward": [ "File /home/user/envs/python3.9/site-packages/mindspore/nn/cell.py, line 507, \ @@ -160,6 +192,7 @@ class TestModifyMapping(unittest.TestCase): in _backward_hook_construct, \n outputs = self.construct(output, **kwargs)" ] } + self.pt_stack = { "Functional.max_pool2d.0.forward": [ "File /home/user/envs/lib/python3.9/site-packages/msprobe/pytorch/hook_module/wrap_Functional.py, line 97, \ @@ -203,6 +236,324 @@ class TestModifyMapping(unittest.TestCase): ] } + self.pt_dump_source = { + "task": "statistics", + "level": "mix", + "dump_data_dir": None, + "data": { + "Tensor.__add__.0.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [], + "Max": 423, + "Min": 423, + "Mean": 423, + "Norm": 423, + "requires_grad": False + }, + { + "type": "int", + "value": 1 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [], + "Max": 424, + "Min": 424, + "Mean": 424, + "Norm": 424, + "requires_grad": False + } + ] + }, + "Tensor.__add__.1.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [], + "Max": 423, + "Min": 423, + "Mean": 423, + "Norm": 423, + "requires_grad": False + }, + { + "type": "int", + "value": 1 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [], + "Max": 424, + "Min": 424, + "Mean": 424, + "Norm": 424, + "requires_grad": False + } + ] + }, + } + } + + self.ms_dump_source = { + "task": "statistics", + "level": "mix", + "dump_data_dir": None, + "data": { + # layer type + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": { + "input_args": [], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1024, + 1, + 6144 + ], + "Max": 0.421875, + "Min": -0.443359375, + "Mean": -0.0002346038818359375, + "Norm": 50.75 + }, + None + ] + }, + # Mint Type + "Mint.cos.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 4096, + 1, + 1, + 128 + ], + "Max": 4095.0, + "Min": 0.0, + "Mean": 238.66024780273438, + "Norm": 427910.46875 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": [ + 4096, + 1, + 1, + 128 + ], + "Max": 1.0000001192092896, + "Min": -1.0000001192092896, + "Mean": 0.13587358593940735, + "Norm": 528.9301147460938 + } + ] + }, + # Functional Type + "Functional.flash_attention_score.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.671875, + "Min": -3.765625, + "Mean": -0.00072479248046875, + "Norm": 1744.0 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.484375, + "Min": -3.0625, + "Mean": -0.00115966796875, + "Norm": 1728.0 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.28125, + "Min": -3.234375, + "Mean": -0.001861572265625, + "Norm": 1744.0 + }, + { + "type": "int", + "value": 12 + } + ], + "input_kwargs": { + "attn_mask": { + "type": "mindspore.Tensor", + "dtype": "UInt8", + "shape": [ + 1, + 1, + 4096, + 4096 + ], + "Max": 1.0, + "Min": 0.0, + "Mean": 0.876311182975769, + "Norm": 3834.326904296875 + }, + "scalar_value": { + "type": "float", + "value": 0.08838834764831843 + }, + "pre_tokens": { + "type": "int", + "value": 65536 + }, + "next_tokens": { + "type": "int", + "value": 0 + }, + "input_layout": { + "type": "str", + "value": "SBH" + } + }, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 2.734375, + "Min": -2.578125, + "Mean": -0.001373291015625, + "Norm": 266.0 + } + ] + }, + } + } + + self.ms_construct_source = { + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Mint.cos.0.forward": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0", + "Functional.flash_attention_score.0.forward": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0" + } + self.ms_stack_source = { + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, " + "in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, " + "in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, " + "in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, " + "in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative" + "/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, " + "in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, " + "in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative" + "/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, " + "in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, " + "in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, " + "in construct, \n lm_output = self.language_model(tokens,", + ], + "Mint.cos.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, " + "in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, " + "in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, " + "in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/rotary_pos_embedding.py, line 123, " + "in _apply_fused_rotary_pos_emb, \n cos_ = mint.cos(freqs).to(t.dtype)", + "File /path_to_net/PanGu_ms/pangu/model/rotary_pos_embedding.py, line 136, " + "in apply_rotary_pos_emb, \n return _apply_fused_rotary_pos_emb(t, freqs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 619, " + "in construct, \n query = apply_rotary_pos_emb(query, q_pos_emb, self.config)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, " + "in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, " + "in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, " + "in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, " + "in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, " + "in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative" + "/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, " + "in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, " + "in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative" + "/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, " + "in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, " + "in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, " + "in construct, \n lm_output = self.language_model(tokens,", + ], + "Functional.flash_attention_score.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 637, in construct, \n output = ops.flash_attention_score(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + ] + } + def test_find_regard_scope(self): start_sign = "add" end_sign = "attention" @@ -213,7 +564,7 @@ class TestModifyMapping(unittest.TestCase): def test_find_stack_func_list(self): result = find_stack_func_list(self.lines) - self.assertEqual(result, ['attn_mask_add']) + self.assertEqual(result, ([], ['attn_mask_add'])) def test_get_dump_data_items_when_pt_valid_then_pass(self): result = get_dump_data_items( @@ -249,19 +600,21 @@ class TestModifyMapping(unittest.TestCase): expect_values = [(item.get("data_name"), item.get("construct_scope")) for item in expected_result] self.assertListEqual(actual_values, expect_values) - def test_get_dump_data_items_when_ms_valid_then_pass(self): + def test_get_dump_data_items_when_ms_functional_valid_then_pass(self): result = get_dump_data_items( self.ms_dump, self.ms_stack, self.ms_construct, Const.MS_FRAMEWORK) expected_result = [ { "data_name": "Functional.add.0.forward", - "construct_scope": "Cell.transformer_layers.0.attention.core_attention.scale_mask_softmax.ScaleMaskSoftmax.forward.0", - "full_scope": "Cell.transformer_layers.0.attention.core_attention.scale_mask_softmax.attn_mask_add.add" + "construct_scope": "Cell.transformer_layers.0.attention.core_attention." + "scale_mask_softmax.ScaleMaskSoftmax.forward.0", + "full_scope": "Cell.transformer_layers.0.attention.core_attention.scale_mask_softmax.add" }, { "data_name": "Functional.add.4.forward", - "construct_scope": "Cell.transformer_layers.0.attention.core_attention.scale_mask_softmax.ScaleMaskSoftmax.forward.0", - "full_scope": "Cell.transformer_layers.0.attention.core_attention.scale_mask_softmax.attn_mask_add.add" + "construct_scope": "Cell.transformer_layers.0.attention.core_attention." + "scale_mask_softmax.ScaleMaskSoftmax.forward.0", + "full_scope": "Cell.transformer_layers.0.attention.core_attention.scale_mask_softmax.add" }, { "data_name": "Tensor.reshape.2.forward", @@ -269,6 +622,183 @@ class TestModifyMapping(unittest.TestCase): "full_scope": "Cell.transformer_layers.0.attention.reshape" } ] - actual_values = [(res.data_name, res.construct_scope) for res in result] - expect_values = [(item.get("data_name"), item.get("construct_scope")) for item in expected_result] + actual_values = [(res.data_name, res.construct_scope, res.full_scope) for res in result] + expect_values = [(item.get("data_name"), item.get("construct_scope"), item.get("full_scope")) for item in expected_result] + self.assertListEqual(actual_values, expect_values) + + def test_get_stack_in_lines_when_frame_func_then_pass(self): + result = get_stack_in_lines(self.simplified_lines) + expected_result = ["copy"] + self.assertEqual(result, expected_result) + + def test_find_stack_func_list_when_frame_func_then_pass(self): + result = find_stack_func_list(self.frame_func_lines) + self.assertEqual(result, (["copy"], [])) + + def test_get_dump_data_items_when_layer_and_mint_valid_then_pass(self): + result = get_dump_data_items(self.ms_dump_source, self.ms_stack_source, self.ms_construct_source, Const.MS_FRAMEWORK) + + expected_result = [ + { + "data_name": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.ParallelAttention.forward.0", + "construct_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.ParallelTransformerLayer.forward.0", + "full_scope": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention" + }, + { + "data_name": "Mint.cos.0.forward", + "construct_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.ParallelAttention.forward.0", + "full_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.cos" + }, + { + "data_name": "Functional.flash_attention_score.0.forward", + "construct_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.ParallelAttention.forward.0", + "full_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.flash_attention_score" + } + ] + # result store DumpDataItem Object List + actual_values = [(res.data_name, res.construct_scope, res.full_scope) for res in result] + expect_values = [(item.get("data_name"), item.get("construct_scope"), item.get("full_scope")) for item in expected_result] + self.assertListEqual(actual_values, expect_values) + + # Test 2: stack and construct are empty + def test_get_dump_data_items_when_empty_construct_stack_then_pass(self): + dump = { + "data": { + "Functional.flash_attention_score.0.forward": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.671875, + "Min": -3.765625, + "Mean": -0.00072479248046875, + "Norm": 1744.0 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.484375, + "Min": -3.0625, + "Mean": -0.00115966796875, + "Norm": 1728.0 + }, + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 4096, + 1, + 1536 + ], + "Max": 3.28125, + "Min": -3.234375, + "Mean": -0.001861572265625, + "Norm": 1744.0 + }, + { + "type": "int", + "value": 12 + } + ], + } + } + } + stack = {} + construct = {} + framework = Const.MS_FRAMEWORK + + result = get_dump_data_items(dump, stack, construct, framework) + + self.assertEqual(result, []) + + # Test 3: No data in dump + def test_get_dump_data_items_when_empty_dump_stack_then_pass(self): + dump = {} + stack = {} + construct = { + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0" + } + framework = Const.MS_FRAMEWORK + + result = get_dump_data_items(dump, stack, construct, framework) + + self.assertEqual(result, []) + + # Test 4: output_path is provided, ensure file is saved + def test_empty_dump_with_output_path(self): + dump = {} + stack = { + "Tensor.__add__.0.forward": ["stack data"] + } + construct = { + "Tensor.__add__.0.forward": "construct data" + } + framework = Const.MS_FRAMEWORK + output_path = "./" + + result = get_dump_data_items(dump, stack, construct, framework, output_path) + + # Check if file was saved + entries = os.listdir(output_path) + # filter file endwith _data.yaml + sign = f"{Const.MS_FRAMEWORK}_data" + data_yaml_files = [os.path.join(output_path, entry) for entry in entries if sign in entry] + saved_file = data_yaml_files[0] + yaml_info = load_yaml(saved_file) + os.remove(saved_file) + self.assertEqual(yaml_info, {}) + + # Test 5: Empty dump and output_path + def test_get_dump_data_items_when_valid_with_output_path_then_pass(self): + output_path = "./" + result = get_dump_data_items(self.ms_dump_source, self.ms_stack_source, self.ms_construct_source, Const.MS_FRAMEWORK, output_path) + # Check if file was saved + entries = os.listdir(output_path) + sign = f"{Const.MS_FRAMEWORK}_data" + data_yaml_files = [os.path.join(output_path, entry) for entry in entries if sign in entry] + # filter file endwith _data.yaml + saved_file = data_yaml_files[0] + yaml_info = load_yaml(saved_file) + expected_result = [ + { + "data_name": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.ParallelAttention.forward.0", + "construct_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.ParallelTransformerLayer.forward.0", + "full_scope": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention" + }, + { + "data_name": "Mint.cos.0.forward", + "construct_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.ParallelAttention.forward.0", + "full_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.cos" + }, + { + "data_name": "Functional.flash_attention_score.0.forward", + "construct_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.ParallelAttention.forward.0", + "full_scope": "Cell.network_with_loss.module.language_model.encoder.layers." + "0.attention.flash_attention_score" + } + ] + actual_values = [(name, res.get("construct_scope"), res.get("full_scope")) for name, res in yaml_info.items()] + expect_values = [(item.get("data_name"), item.get("construct_scope"), item.get("full_scope")) for item in expected_result] + os.remove(saved_file) self.assertListEqual(actual_values, expect_values) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_layer_mapping.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_layer_mapping.py index e43a47673454febcb853b046c7693bd8f92f8412..41d64fcfcbe7f7aef31eecbd97eeb700e2c52cd6 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_layer_mapping.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_layer_mapping.py @@ -1,28 +1,86 @@ import unittest -from msprobe.core.compare.layer_mapping import * -from msprobe.core.common.const import Const +from pathlib import Path + +from msprobe.core.compare.layer_mapping import ( + generate_api_mapping_by_layer_mapping, + generate_data_mapping_by_layer_mapping) class TestLayerMapping(unittest.TestCase): + def setUp(self): + # 获取当前文件路径 + self.current_dir = Path(__file__).parent + self.npu_dump_json = self.current_dir / "dump_file/mindspore_data/dump.json" + self.bench_dump_json = self.current_dir / "dump_file/pytorch_data/dump.json" + self.layer_mapping = str(self.current_dir / "dump_file/layer_mapping.yaml") + + def test_generate_api_mapping_by_layer_mapping_then_pass(self): + # Example test to check if construct.json is processed correctly + res = generate_api_mapping_by_layer_mapping(self.npu_dump_json, self.bench_dump_json, self.layer_mapping) + excepted_api_mapping = { + "Tensor.__add__.0.forward": "N/A", + "Tensor.__bool__.1.forward": "N/A", + "Tensor.__add__.1.forward": "Tensor.__add__.0.forward", + "Mint.logical_or.0.forward": "Tensor.__or__.0.forward", + "Distributed.all_reduce.0.forward": "Distributed.all_reduce.0.forward", + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0", + "Primitive.norm.RmsNorm.0.forward": "NPU.npu_rms_norm.0.forward", + "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.input_norm.FusedRMSNorm.backward.0": "N/A", + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0", + "Mint.cos.0.forward": "Torch.cos.0.forward", + "Functional.flash_attention_score.0.forward": "NPU.npu_fusion_attention.0.forward", + "Functional.flash_attention_score.0.backward": "NPU.npu_fusion_attention.0.backward", + } + self.assertDictEqual(res, excepted_api_mapping) - def test_generate_index_set_then_pass(self): - data0 = [0, 0] - data1 = [[0, 0, 0]] - data2 = [[0, 0, 0], 0] - data3 = [[0, 0, 0], 0, [0], [0,[0,[0]]]] - data4 = [] - expect0 = {"0", "1"} - expect1 = {"0.0", "0.1", "0.2"} - expect2 = {"0.0", "0.1", "0.2", "1"} - expect3 = {"0.0", "0.1", "0.2", "1", "2.0", "3.0", "3.1.0", "3.1.1.0"} - expect4 = set() - res0 = generate_index_set(data0) - self.assertEqual(res0, expect0) - res1 = generate_index_set(data1) - self.assertEqual(res1, expect1) - res2 = generate_index_set(data2) - self.assertEqual(res2, expect2) - res3 = generate_index_set(data3) - self.assertEqual(res3, expect3) - res4 = generate_index_set(data4) - self.assertEqual(res4, expect4) + def test_generate_data_mapping_by_layer_mapping_then_pass(self): + input_param = {"npu_json_path": self.npu_dump_json, "bench_json_path": self.bench_dump_json} + res = generate_data_mapping_by_layer_mapping(input_param, self.layer_mapping) + excepted_data_mapping = { + "Tensor.__add__.0.forward.input.0": "N/A", + "Tensor.__add__.0.forward.input.1": "N/A", + "Tensor.__add__.0.forward.output.0": "N/A", + "Tensor.__bool__.1.forward.input.0": "N/A", + "Tensor.__bool__.1.forward.output.0": "N/A", + "Tensor.__add__.1.forward.input.0": "Tensor.__add__.0.forward.input.0", + "Tensor.__add__.1.forward.input.1": "Tensor.__add__.0.forward.input.1", + "Tensor.__add__.1.forward.output.0": "Tensor.__add__.0.forward.output.0", + "Mint.logical_or.0.forward.input.0": "Tensor.__or__.0.forward.input.0", + "Mint.logical_or.0.forward.input.1": "Tensor.__or__.0.forward.input.1", + "Mint.logical_or.0.forward.output.0": "Tensor.__or__.0.forward.output.0", + "Distributed.all_reduce.0.forward.input.0": "Distributed.all_reduce.0.forward.input.0", + "Distributed.all_reduce.0.forward.input.group": "Distributed.all_reduce.0.forward.input.group", + "Distributed.all_reduce.0.forward.output.0": "Distributed.all_reduce.0.forward.output.0", + "Distributed.all_reduce.0.forward.output.1": "Distributed.all_reduce.0.forward.output.1", + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0.input.0": "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0.input.0", + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0.output.0": "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0.output.0", + "Primitive.norm.RmsNorm.0.forward.input.0": "NPU.npu_rms_norm.0.forward.input.0", + "Primitive.norm.RmsNorm.0.forward.input.1": "NPU.npu_rms_norm.0.forward.input.1", + "Primitive.norm.RmsNorm.0.forward.output.0": "NPU.npu_rms_norm.0.forward.output.0", + "Primitive.norm.RmsNorm.0.forward.output.1": "NPU.npu_rms_norm.0.forward.output.1", + "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0.input.0": "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0.input.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0.output.0": "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0.output.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.input_norm.FusedRMSNorm.backward.0.input.0": "N/A", + "Cell.network_with_loss.module.language_model.encoder.layers.0.input_norm.FusedRMSNorm.backward.0.output.0": "N/A", + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0.output.0": "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0.output.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0.output.1": "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0.output.1", + "Mint.cos.0.forward.input.0": "Torch.cos.0.forward.input.0", + "Mint.cos.0.forward.output.0": "Torch.cos.0.forward.output.0", + "Functional.flash_attention_score.0.forward.input.0": "NPU.npu_fusion_attention.0.forward.input.0", + "Functional.flash_attention_score.0.forward.input.1": "NPU.npu_fusion_attention.0.forward.input.1", + "Functional.flash_attention_score.0.forward.input.2": "NPU.npu_fusion_attention.0.forward.input.2", + "Functional.flash_attention_score.0.forward.input.3": "NPU.npu_fusion_attention.0.forward.input.3", + "Functional.flash_attention_score.0.forward.input.attn_mask": "N/A", + "Functional.flash_attention_score.0.forward.input.scalar_value": "N/A", + "Functional.flash_attention_score.0.forward.input.pre_tokens": "N/A", + "Functional.flash_attention_score.0.forward.input.next_tokens": "N/A", + "Functional.flash_attention_score.0.forward.input.input_layout": "N/A", + "Functional.flash_attention_score.0.forward.output.0": "NPU.npu_fusion_attention.0.forward.output.0", + "Functional.flash_attention_score.0.backward.input.0": "NPU.npu_fusion_attention.0.backward.input.0", + "Functional.flash_attention_score.0.backward.output.0": "NPU.npu_fusion_attention.0.backward.output.0", + "Functional.flash_attention_score.0.backward.output.1": "NPU.npu_fusion_attention.0.backward.output.1", + "Functional.flash_attention_score.0.backward.output.2": "NPU.npu_fusion_attention.0.backward.output.2", + "Functional.flash_attention_score.0.backward.output.3": "NPU.npu_fusion_attention.0.backward.output.3", + } + self.assertDictEqual(res, excepted_data_mapping) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_ms_compare.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_ms_compare.py index bb021eea2f7deec5e7d180298d65a2ed32265bda..035fe0c53a4470b14e5b1ba44bb99b6de33a5d40 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_ms_compare.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_ms_compare.py @@ -1,9 +1,19 @@ # coding=utf-8 -import unittest -import tempfile import json +import os +import random +import shutil +import tempfile +import unittest +from unittest.mock import patch + +import numpy as np +import torch +import yaml -from msprobe.mindspore.compare.ms_compare import MSComparator, check_cross_framework +from msprobe.core.common.utils import CompareException +from msprobe.core.compare.acc_compare import ModeConfig +from msprobe.mindspore.compare.ms_compare import MappingConfig, MSComparator, check_cross_framework from msprobe.core.common.const import Const npu_dict = {'op_name': ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.1', @@ -46,7 +56,7 @@ bench_dict = {'op_name': ['Functional.conv2d.0.forward.input.0', 'Functional.con [0.19734230637550354, -0.18177609145641327, 0.007903944700956345], [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], 'stack_info': []} -npu_op_name = ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.1', +npu_op_name_list = ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.forward.input.1', 'Functional.conv2d.0.forward.input.2', 'Functional.conv2d.0.forward.output'] npu_op_name_Mint = ['Mint.conv2d.0.forward.input.0', 'Mint.conv2d.0.forward.input.1', @@ -58,6 +68,16 @@ bench_op_name = ['Functional.conv2d.0.forward.input.0', 'Functional.conv2d.0.for data_mapping = {'Functional.flash_attention_score.4.forward.input.0': 'NPU.npu_fusion_attention.4.forward.input.0', 'Functional.flash_attention_score.4.forward.output.0': 'NPU.npu_fusion_attention.4.forward.output.0'} +npu_cell_dict = {'op_name': ['Cell.fc1.Dense.forward.0.input.0', 'Cell.fc1.Dense.forward.0.input.1', + 'Cell.fc1.Dense.forward.0.input.2', 'Cell.fc1.Dense.forward.0.output.0'], + 'input_struct': [('Float32', [1, 1, 28, 28]), ('Float32', [16, 1, 5, 5]), + ('Float32', [16])], + 'output_struct': [('Float32', [1, 16, 28, 28])], + 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], + [0.19919930398464203, -0.19974489510059357, 0.006269412115216255], + [0.19734230637550354, -0.18177609145641327, 0.007903944700956345], + [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], "stack_info": []} + npu_json_data = { 'task': 'statistics', 'level': 'L1', @@ -147,20 +167,110 @@ bench_json_data = { } +json_data_template = { + 'task': 'statistics', + 'level': 'L1', + 'dump_data_dir': '', + 'data': {} +} + + +def gen_data(is_ms=True): + type_value = 'mindspore.Tensor' if is_ms else 'torch.Tensor' + dtype_value = 'BFloat16' if is_ms else 'torch.bfloat16' + return { + 'type': type_value, + 'dtype': dtype_value, + 'shape': [4096, 1, 2048], + 'Max': random.uniform(0, 4), + 'Min': random.uniform(-4, 0), + 'Mean': random.random() / 10000, + 'Norm': random.random() * 1000 + } + + +def gen_api_mapping_test_data(need_user_mapping=False): + result_npu = json_data_template.copy() + result_bench = json_data_template.copy() + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig() + ms_comparator = MSComparator(mode_config, mapping_config) + + api_mapping = ms_comparator.load_internal_api() + ms_api_list = np.random.choice(list(api_mapping.keys()), size=5, replace=False).astype(str).tolist() + ms_api_data = {} + pt_api_data = {} + user_mapping = [] + for api in ms_api_list: + call_num = random.randint(1, 10) + direction = random.choice(['forward', 'backward']) + data_name_ms = api + '.' + str(call_num) + '.' + direction + data_name_pt = api_mapping.get(api) + '.' + str(call_num) + '.' + direction + input_num = random.randint(1, 5) + output_num = random.randint(1, 5) + ms_data = {'input_args': [gen_data(True) for _ in range(input_num)], + 'output': [gen_data(True) for _ in range(output_num)]} + pt_data = {'input_args': [gen_data(False) for _ in range(input_num)], + 'output': [gen_data(False) for _ in range(output_num)]} + ms_api_data[data_name_ms] = ms_data + pt_api_data[data_name_pt] = pt_data + if need_user_mapping: + compare_num_input = random.randint(1, input_num) + compare_num_output = random.randint(1, output_num) + user_mapping_item = {'ms_api': api, + 'pt_api': api_mapping.get(api), + 'ms_args': sorted(np.random.choice(list(range(input_num)), size=compare_num_input, + replace=False).astype(int).tolist()), + 'pt_args': sorted(np.random.choice(list(range(input_num)), size=compare_num_input, + replace=False).astype(int).tolist()), + 'ms_output': sorted(np.random.choice(list(range(output_num)), size=compare_num_output, + replace=False).astype(int).tolist()), + 'pt_output': sorted(np.random.choice(list(range(output_num)), size=compare_num_output, + replace=False).astype(int).tolist())} + user_mapping.append(user_mapping_item) + ms_api_key_list = list(ms_api_data.keys()) + random.shuffle(ms_api_key_list) + result_npu['data'] = {k: ms_api_data.get(k) for k in ms_api_key_list} + pt_api_key_list = list(pt_api_data.keys()) + random.shuffle(pt_api_key_list) + result_bench['data'] = {k: pt_api_data.get(k) for k in pt_api_key_list} + return result_npu, result_bench, user_mapping + + class TestUtilsMethods(unittest.TestCase): def test_check_op_ms(self): + stack_mode = True + auto_analyze = True fuzzy_match = False - ms_comparator = MSComparator() - result = ms_comparator.check_op(npu_dict, bench_dict, fuzzy_match) - self.assertEqual(result, True) + dump_mode = Const.ALL + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig() + + ms_comparator = MSComparator(mode_config, mapping_config) + result = ms_comparator.check_op(npu_dict, bench_dict) + self.assertTrue(result) def test_data_mapping(self): - dump_mode = Const.SUMMARY stack_json_data = {} - ms_comparator = MSComparator(data_mapping=data_mapping) - npu_ops_all = ms_comparator.merge_data(npu_json_data, stack_json_data, dump_mode) + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig(data_mapping=data_mapping) + ms_comparator = MSComparator(mode_config, mapping_config) + + npu_ops_all = ms_comparator.merge_data(npu_json_data, stack_json_data) npu_ops_all_correct = { 'Functional.flash_attention_score.4.forward.input.0': { 'struct': ('BFloat16', [4096, 1, 2048]), @@ -177,7 +287,7 @@ class TestUtilsMethods(unittest.TestCase): } self.assertDictEqual(npu_ops_all, npu_ops_all_correct) - bench_ops_all = ms_comparator.merge_data(bench_json_data, stack_json_data, dump_mode) + bench_ops_all = ms_comparator.merge_data(bench_json_data, stack_json_data) bench_ops_all_correct = { 'NPU.npu_fusion_attention.4.forward.input.0': { 'struct': ('torch.bfloat16', [4096, 1, 2048]), @@ -194,7 +304,7 @@ class TestUtilsMethods(unittest.TestCase): } self.assertDictEqual(bench_ops_all, bench_ops_all_correct) - result = ms_comparator.get_accuracy(npu_ops_all, bench_ops_all, dump_mode) + result = ms_comparator.get_accuracy(npu_ops_all, bench_ops_all) result_correct = [['Functional.flash_attention_score.4.forward.input.0', 'NPU.npu_fusion_attention.4.forward.input.0', 'BFloat16', 'torch.bfloat16', [4096, 1, 2048], [4096, 1, 2048], 0.0, 0.0, @@ -214,33 +324,214 @@ class TestUtilsMethods(unittest.TestCase): self.compare_process_custom(dump_mode=Const.ALL) def compare_process_custom(self, dump_mode): - import os, tempfile, json data_path = tempfile.mkdtemp(prefix='dump_data', dir='/tmp') - npu_dump_path = os.path.join(data_path, 'npu_dump.json') - bench_dump_path = os.path.join(data_path, 'bench_dump.json') - npu_stack_path = os.path.join(data_path, 'npu_stack.json') - - with open(npu_dump_path, 'w') as n_d_f, open(bench_dump_path, 'w') as b_d_f, open(npu_stack_path, 'w') as n_s_f: - json.dump(npu_json_data, n_d_f) - json.dump(bench_json_data, b_d_f) - json.dump({}, n_s_f) - ms_comparator = MSComparator() - result_df = ms_comparator.compare_process_custom((npu_dump_path, bench_dump_path, npu_stack_path), - False, dump_mode) - self.assertListEqual(result_df.values.tolist(), []) - - def test_check_cross_framework(self): - ms_data = { - "data_name": "Cell.model.language_model.encoder.layers.5.input_norm.FusedRMSNorm.forward.0.input.0.npy", - } - pt_data = { - "data_name": "Module.module.module.language_model.encoder.layers.0.input_norm.RMSNorm.forward.0.input.0.pt", - } + try: + npu_dump_path = os.path.join(data_path, 'npu_dump.json') + bench_dump_path = os.path.join(data_path, 'bench_dump.json') + npu_stack_path = os.path.join(data_path, 'npu_stack.json') + + with open(npu_dump_path, 'w') as n_d_f: + json.dump(npu_json_data, n_d_f) + with open(bench_dump_path, 'w') as b_d_f: + json.dump(bench_json_data, b_d_f) + with open(npu_stack_path, 'w') as n_s_f: + json.dump({}, n_s_f) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig() + + ms_comparator = MSComparator(mode_config, mapping_config) + result_df = ms_comparator.compare_process_custom((npu_dump_path, bench_dump_path, npu_stack_path)) + self.assertListEqual(result_df.values.tolist(), []) + finally: + shutil.rmtree(data_path) + + @patch('msprobe.mindspore.compare.ms_compare.detect_framework_by_dump_json') + def test_check_cross_framework_valid_pytorch(self, mock_detect_framework): + mock_detect_framework.return_value = Const.PT_FRAMEWORK + + result = check_cross_framework("dummy_path") + + self.assertTrue(result) + + @patch('msprobe.mindspore.compare.ms_compare.detect_framework_by_dump_json') + def test_check_cross_framework_invalid_framework(self, mock_detect_framework): + mock_detect_framework.return_value = Const.MS_FRAMEWORK + + result = check_cross_framework("dummy_path") + + self.assertFalse(result) + + def test_comapre_process(self): + data_path = tempfile.mkdtemp(prefix='dump_data', dir='/tmp') + try: + npu_dump_path = os.path.join(data_path, 'npu_dump.json') + bench_dump_path = os.path.join(data_path, 'bench_dump.json') + npu_stack_path = os.path.join(data_path, 'npu_stack.json') + + npu_data, bench_data, _ = gen_api_mapping_test_data() + with open(npu_dump_path, 'w', encoding='utf8') as n_d_f: + json.dump(npu_data, n_d_f) + with open(bench_dump_path, 'w', encoding='utf8') as b_d_f: + json.dump(bench_data, b_d_f) + with open(npu_stack_path, 'w', encoding='utf8') as n_s_f: + json.dump({}, n_s_f) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig(api_mapping=True) + + ms_comparator = MSComparator(mode_config, mapping_config) + result_df = ms_comparator.compare_process((npu_dump_path, bench_dump_path, npu_stack_path)) + self.assertTrue((result_df['Bench Name'] != 'N/A').all()) + finally: + shutil.rmtree(data_path) + + def test_compare_process_with_customize_api_mapping(self): + data_path = tempfile.mkdtemp(prefix='dump_data', dir='/tmp') + try: + npu_dump_path = os.path.join(data_path, 'npu_dump.json') + bench_dump_path = os.path.join(data_path, 'bench_dump.json') + npu_stack_path = os.path.join(data_path, 'npu_stack.json') + user_mapping_path = os.path.join(data_path, 'user_mapping.yaml') + + npu_data, bench_data, user_mapping = gen_api_mapping_test_data(True) + with open(npu_dump_path, 'w', encoding='utf8') as n_d_f: + json.dump(npu_data, n_d_f) + with open(bench_dump_path, 'w', encoding='utf8') as b_d_f: + json.dump(bench_data, b_d_f) + with open(npu_stack_path, 'w', encoding='utf8') as n_s_f: + json.dump({}, n_s_f) + with open(user_mapping_path, 'w', encoding='utf8') as u_m_f: + yaml.safe_dump(user_mapping, u_m_f) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig(api_mapping=user_mapping_path) + + ms_comparator = MSComparator(mode_config, mapping_config) + result_df = ms_comparator.compare_process((npu_dump_path, bench_dump_path, npu_stack_path)) + + user_mapping_dict = {} + for i in user_mapping: + user_mapping_dict[i.get('ms_api')] = {'input': i.get('ms_args'), 'output': i.get('ms_output')} + match_set = set() + for key in npu_data.get('data').keys(): + matched_dict = user_mapping_dict.get(key.rsplit('.', 2)[0]) + match_set.update({key + '.input.' + str(i) for i in matched_dict.get('input')}) + match_set.update({key + '.output.' + str(i) for i in matched_dict.get('output')}) + + self.assertTrue((result_df.loc[result_df['NPU Name'].isin(match_set), 'Bench Name'] != 'N/A').all()) + self.assertTrue((result_df.loc[~result_df['NPU Name'].isin(match_set), 'Bench Name'] == 'N/A').all()) + finally: + shutil.rmtree(data_path) + + def test_load_internal_api(self): + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig() + + ms_comparator = MSComparator(mode_config, mapping_config) + api_dict = ms_comparator.load_internal_api() + self.assertEqual(api_dict['Functional.abs'], 'Torch.abs') + + def test_process_cell_mapping(self): + self.base_test_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) + self.input_dir = os.path.join(self.base_test_dir, 'resources') + cell_mapping_path = os.path.join(self.input_dir, 'common', 'cell_mapping.yaml') + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.SUMMARY + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig(cell_mapping=cell_mapping_path) + + ms_comparator = MSComparator(mode_config, mapping_config) + npu_op_name = ms_comparator.process_cell_mapping(npu_cell_dict.get('op_name')[0]) + self.assertEqual(npu_op_name, 'Module.fc1.Linear.forward.0.input.0') + + def test_read_npy_data(self): + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig() + + ms_comparator = MSComparator(mode_config, mapping_config) + + self.temp_file = tempfile.NamedTemporaryFile(suffix='.pt') + tensor = torch.Tensor([1, 2, 3]) + filename = self.temp_file.name.split('/')[-1] + torch.save(tensor, self.temp_file.name) + result = ms_comparator.read_npy_data('/tmp', filename, load_pt_file=True) + self.assertTrue(np.array_equal(result, np.array([1, 2, 3]))) + self.temp_file.close() + + self.temp_file = tempfile.NamedTemporaryFile(suffix='.npy') + tensor = np.array([1, 2, 3]) + filename = self.temp_file.name.split('/')[-1] + np.save(self.temp_file.name, tensor) + result = ms_comparator.read_npy_data('/tmp', filename, load_pt_file=False) + self.assertTrue(np.array_equal(result, np.array([1, 2, 3]))) + self.temp_file.close() + + def test_process_internal_api_mapping(self): + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig(api_mapping=1) + + ms_comparator = MSComparator(mode_config, mapping_config) + + npu_op_name = "Mint.addcmul.0.forward.input.0" + result = ms_comparator.process_internal_api_mapping(npu_op_name) + self.assertEqual(result, "Torch.addcmul.0.forward.input.0") + + npu_op_name = "MintFunctional.addcmul.0.forward.input.0" + result = ms_comparator.process_internal_api_mapping(npu_op_name) + self.assertEqual(result, "Functional.addcmul.0.forward.input.0") + + npu_op_name = "Functional.abs" + result = ms_comparator.process_internal_api_mapping(npu_op_name) + self.assertEqual(result, "Torch.abs") + + def test_get_api_name(self): + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + mapping_config = MappingConfig() + + ms_comparator = MSComparator(mode_config, mapping_config) + + api_list = ["Functional", "absolute", "0", "forward", "input", "0"] + result = ms_comparator.get_api_name(api_list) + self.assertEqual(result, "Functional.absolute") - def check_data(data): - with tempfile.NamedTemporaryFile(mode='w+', suffix='.json', encoding='utf-8', delete=True) as temp_file: - json.dump(data, temp_file, ensure_ascii=False, indent=4) - temp_file.flush() - return check_cross_framework(temp_file.name) - self.assertFalse(check_data(ms_data)) - self.assertTrue(check_data(pt_data)) + api_list = ["Mint"] + with self.assertRaises(CompareException): + ms_comparator.get_api_name(api_list) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_ms_graph_compare.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_ms_graph_compare.py index e3fd9348efe7dd4df0a6db2cd52a45f4757dae01..c2e7c9368c3f049511657469ebb16388015b621a 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_ms_graph_compare.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_ms_graph_compare.py @@ -78,7 +78,7 @@ class TestMsGraphCompare(unittest.TestCase): result_correct = ( f"[['{npu_file_path}', '{bench_file_path}', dtype('float16'), dtype('float16'), (10, 10), (10, 10), " - f"44.0, 44.0, 44.0, inf, 44.0, 44.0, 44.0, inf, 'Yes', '', 1.0, 0.0, 0.0, 1.0, 1.0]]") + f"44.0, 44.0, 44.0, inf, 44.0, 44.0, 44.0, inf, 'Yes', '', 1.0, 0.0, 0.0, 0.0, 1.0, 1.0]]") self.assertNotEqual(len(files), 0) self.assertEqual(result, result_correct) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_post_process.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_post_process.py new file mode 100644 index 0000000000000000000000000000000000000000..3286296bbf35f99457f9b9a201fd843894366120 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/compare/test_post_process.py @@ -0,0 +1,300 @@ +import pytest +import unittest + + +from msprobe.core.common.utils import CompareException +from msprobe.core.compare.layer_mapping.data_scope_parser import ( + DumpDataItem, +) +from msprobe.core.compare.layer_mapping.postprocess_pass import ( + postprocess_pass, + backward_pass, + renumber_index_pass, +) +from msprobe.core.common.const import Const + + +class TestModifyMapping(unittest.TestCase): + def setUp(self): + pt_name1 = "Distributed.all_reduce.0.forward" + pt_construct_info1 = "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0" + pt_stack_info1 = [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py, line 68, in distributed_op_template, \n return DistributedOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 24, in _reduce, \n torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group())", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 223, in forward, \n return _reduce(input_)", + "File /path_to_package/site-packages/torch/autograd/function.py, line 539, in apply, \n return super().apply(*args, **kwargs) # type: ignore[misc]", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 436, in reduce_from_tensor_model_parallel_region, \n return _ReduceFromModelParallelRegion.apply(input_)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/tensor_parallel/layers.py, line 35, in vocab_parallel_embedding_forward, \n output = reduce_from_tensor_model_parallel_region(output_parallel)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 217, in forward, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 473, in forward, \n encoder_input = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ] + pt_item1 = DumpDataItem(Const.PT_FRAMEWORK) + pt_item1.set(pt_name1, pt_construct_info1, pt_stack_info1) + + pt_name2 = "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0" + pt_construct_info2 = "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0" + pt_stack_info2 = [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1198, in forward, \n self.self_attention(", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ] + pt_item2 = DumpDataItem(Const.PT_FRAMEWORK) + pt_item2.set(pt_name2, pt_construct_info2, pt_stack_info2) + + """ + ---------------------------------------------------------- + Normal Case Data + ---------------------------------------------------------- + """ + ms_name1 = "Distributed.all_reduce.0.forward" + ms_construct_info1 = "Cell.network_with_loss.module.language_model.embedding.word_embeddings.reduce_from_mp_region.ReduceFromModelParallelRegion.forward.0" + ms_stack_info1 = [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/tensor_parallel/mappings.py, line 241, in construct, \n output = comm_func.all_reduce(input_, group=self.tp_group)[0]", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 785, in _call_custom_bprop, \n output = self.construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2450, in _backward_hook_construct, \n outputs = self._call_custom_bprop(outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/tensor_parallel/layers.py, line 1168, in construct, \n output = self.reduce_from_mp_region(output_parallel)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2455, in _backward_hook_construct, \n outputs = self.construct(outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 226, in construct, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 554, in construct, \n text_embedding_out = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ] + ms_item1 = DumpDataItem(Const.MS_FRAMEWORK) + ms_item1.set(ms_name1, ms_construct_info1, ms_stack_info1) + ms_item1.type_name = "all_reduce" + ms_item1.layer_scope = "layer_2" + ms_item1.full_scope = "Cell.network_with_loss.module.language_model.embedding.word_embeddings.reduce_from_mp_region.ReduceFromModelParallelRegion.all_reduce.0" + + """ + ---------------------------------------------------------- + Used for renumber layer id + ---------------------------------------------------------- + """ + ms_name2 = "Cell.network_with_loss.module.language_model.encoder.layers.1.attention.ParallelAttention.forward.0" + ms_construct_info2 = "Cell.network_with_loss.module.language_model.encoder.layers.1.ParallelTransformerLayer.forward.0" + ms_stack_info2 = [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ] + ms_item2 = DumpDataItem(Const.MS_FRAMEWORK) + ms_item2.set(ms_name2, ms_construct_info2, ms_stack_info2) + + """ + ---------------------------------------------------------- + backward sample data used for backward_pass + ---------------------------------------------------------- + """ + ms_name3 = "Functional.flash_attention_score.0.backward" + ms_construct_info3 = "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.backward.0" + ms_stack_info3 = [] + ms_item3 = DumpDataItem(Const.MS_FRAMEWORK) + ms_item3.set(ms_name3, ms_construct_info3, ms_stack_info3) + + + """ + ---------------------------------------------------------- + corresponding forward for backward sample data used for backward_pass + ---------------------------------------------------------- + """ + ms_name4 = "Functional.flash_attention_score.0.forward" + ms_construct_info4 = "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0" + ms_stack_info4 = [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 637, in construct, \n output = ops.flash_attention_score(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ] + ms_item4 = DumpDataItem(Const.MS_FRAMEWORK) + ms_item4.set(ms_name4, ms_construct_info4, ms_stack_info4) + self.ms_data_items = [ms_item1, ms_item2, ms_item3, ms_item4] + self.pt_data_items = [pt_item1, pt_item2] + + def test_backward_pass_when_ms_valid_then_pass(self): + name2item = {data_item.data_name : data_item for data_item in self.ms_data_items} + backward_pass(self.ms_data_items, name2item) + expected_stack_scope = self.ms_data_items[3].stack_scope + + self.assertEqual(self.ms_data_items[2].stack_scope, self.ms_data_items[3].stack_scope) + self.assertEqual(self.ms_data_items[2].full_scope, self.ms_data_items[3].full_scope) + self.assertEqual(self.ms_data_items[2].layer_scope, self.ms_data_items[3].layer_scope) + + def test_backward_pass_when_none_then_pass(self): + with self.assertRaises(CompareException) as context: + non_data = DumpDataItem(Const.MS_FRAMEWORK) + non_data.set('', '', []) + non_datas = [non_data, non_data] + name2item = {data_item.data_name : data_item for data_item in non_datas} + backward_pass(non_datas, name2item) + self.assertTrue(isinstance(context.exception, CompareException)) + self.assertEqual(context.exception.code, CompareException.INVALID_DATA_ERROR) + + def test_renumber_index_pass_when_ms_valid_then_pass(self): + suffix = "layers" + type_name = "ParallelTransformer" + renumber_index_pass(self.ms_data_items, type_name, suffix) + self.assertEqual(self.ms_data_items[1].full_scope, "Cell.network_with_loss.module.language_model.encoder.layers.1.attention") + + def test_postprocess_pass_when_ms_valid_then_pass(self): + name2item = {data_item.data_name : data_item for data_item in self.ms_data_items} + try: + postprocess_pass(self.ms_data_items, name2item) + except Exception as e: + self.fail(f"Unexpected exception raised: {e}") + + def test_postprocess_pass_when_pt_valid_then_pass(self): + name2item = {data_item.data_name : data_item for data_item in self.pt_data_items} + try: + postprocess_pass(self.pt_data_items, name2item) + except Exception as e: + self.fail(f"Unexpected exception raised: {e}") + + def test_postprocess_pass_when_non_data_then_pass(self): + with self.assertRaises(CompareException) as context: + non_data = DumpDataItem(Const.MS_FRAMEWORK) + non_data.set('', '', []) + non_datas = [non_data, non_data] + name2item = {data_item.data_name : data_item for data_item in non_datas} + postprocess_pass(non_datas, name2item) + self.assertTrue(isinstance(context.exception, CompareException)) + self.assertEqual(context.exception.code, CompareException.INVALID_DATA_ERROR) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_forward_backward_dump_end.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_forward_backward_dump_end.py deleted file mode 100644 index 5c8867a40b5cdc6ca7aca829c89289c71dbe4f29..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_forward_backward_dump_end.py +++ /dev/null @@ -1,137 +0,0 @@ -import mindspore -from mindspore import Tensor -import mindspore.nn as nn -import mindspore.ops as ops -import numpy as np -import os -from mindspore import mint -import shutil -import json -from unittest import TestCase -import hashlib -from msprobe.core.common.file_utils import FileOpen - -file_path = os.path.abspath(__file__) -directory = os.path.dirname(file_path) -config_json_path = os.path.join(directory, "config.json") - -from msprobe.mindspore import PrecisionDebugger - -def main(): - PrecisionDebugger._instance = None - PrecisionDebugger.initialized = False - debugger = PrecisionDebugger(config_json_path) - num_classes = 10 - - class SimplifiedAlexNet(nn.Cell): - def __init__(self, num_classes=10, channel=3): - super(SimplifiedAlexNet, self).__init__() - # 第一层卷积 - self.conv1 = nn.Conv2d(channel, 96, 11, stride=4, pad_mode='same') - self.relu1 = nn.ReLU() - self.max_pool2d = nn.MaxPool2d(kernel_size=3, stride=2) - - # 第二层卷积 - self.conv2 = nn.Conv2d(96, 256, 5, pad_mode='same') - self.relu2 = nn.ReLU() - - # 全连接层 - self.flatten = nn.Flatten() - self.fc1 = nn.Dense(13*13*256, num_classes) - - def construct(self, x): - # 第一层卷积 + ReLU + MaxPool - x = self.conv1(x) - x = self.relu1(x) - x = self.max_pool2d(x) - x = mint.add(x, 0.5) - x = ops.mul(x, 1.2) - debugger.forward_backward_dump_end() - - # 第二层卷积 + ReLU - x = self.conv2(x) - x = self.relu2(x) - x = self.max_pool2d(x) - x = mint.add(x, 0.5) - x = ops.mul(x, 1.2) - - # 展平 + 全连接层 - x = self.flatten(x) - x = self.fc1(x) - return x - - net = SimplifiedAlexNet(num_classes=num_classes) - optimizer = nn.SGD(net.trainable_params(), learning_rate=0.01) - criterion = nn.MSELoss() - - def forward_fn(data, label): - out = net(data) - loss = criterion(out, label) - return loss - - grad_fn = mindspore.value_and_grad(forward_fn, None, optimizer.parameters) - - def train_step(data, label): - loss, grads = grad_fn(data, label) - optimizer(grads) - return loss - - batch_size = 1 - data = np.random.normal(1, 1, (batch_size, 3, 227, 227)).astype(np.float32) - label = np.random.randint(0, num_classes, (batch_size,)).astype(np.int32) - data = Tensor(data) - label = Tensor(label) - - for i in range(3): - debugger.start(net) - loss = train_step(data, label) - print(f"step: {i}, loss: {loss}") - debugger.stop() - debugger.step() - -def save_dict_as_json(data, json_file_path): - with FileOpen(json_file_path, 'w') as f: - json.dump(data, f, ensure_ascii=False, indent=4) - print(f"字典已保存为json文件: {json_file_path}") - -class TestDump(TestCase): - def test_gradient_monitor_L2(self): - output_path = os.path.join(directory, "output") - if os.path.isfile(config_json_path): - os.remove(config_json_path) - if os.path.isdir(output_path): - shutil.rmtree(output_path) - - config_dict = { - "task": "statistics", - "dump_path": output_path, - "rank": [], - "step": [], - "level": "L1", - "statistics": { - "scope": [], - "list":[], - "data_mode": ["all"], - "summary_mode": "statistics" - }, - } - save_dict_as_json(config_dict, config_json_path) - main() - - #check - target_keys = ["Mint.add.0.forward", "Functional.mul.0.forward", - "Mint.add.0.backward", "Functional.mul.0.backward"] - for root, _, files in os.walk(output_path): - for file in files: - if file == 'dump.json': - dump_json_path = os.path.join(root, file) - with open(dump_json_path, 'r', encoding='utf-8') as file: - # 使用json.load()函数读取文件内容并转换为字典 - data_dict = json.load(file) - data_dict = data_dict.get("data") - for key in target_keys: - self.assertTrue(key in data_dict, f"{key} not found in dump.json") - if os.path.isfile(config_json_path): - os.remove(config_json_path) - if os.path.isdir(output_path): - shutil.rmtree(output_path) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_jit_dump.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_jit_dump.py index fa23ebb940d90bb50fc15bc66d63bdbdaf4f46b4..819c19daed948c63c43a76143a6b674299314ff6 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_jit_dump.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_jit_dump.py @@ -1,185 +1,30 @@ -import numpy as np +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest import os from unittest.mock import patch, MagicMock -import mindspore as ms + import mindspore.common.dtype as mstype -import mindspore.nn as nn -import mindspore.ops as ops +import numpy as np from mindspore.common.tensor import Tensor -from mindspore import jit -from msprobe.mindspore import PrecisionDebugger -from msprobe.core.common_config import CommonConfig, BaseConfig -from msprobe.mindspore.dump.jit_dump import JitDump, dump_jit -import unittest -from unittest.mock import MagicMock, patch - -def conv(in_channels, out_channels, kernel_size, stride=1, padding=0, pad_mode="valid", has_bias=True): - return nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, -has_bias=has_bias, pad_mode=pad_mode) - -def fc_with_initialize(input_channels, out_channels, has_bias=True): - return nn.Dense(input_channels, out_channels, has_bias=has_bias) - -class DataNormTranspose(nn.Cell): - """Normalize an tensor image with mean and standard deviation. - - Given mean: (R, G, B) and std: (R, G, B), - will normalize each channel of the torch.*Tensor, i.e. - channel = (channel - mean) / std - - Args: - mean (sequence): Sequence of means for R, G, B channels respectively. - std (sequence): Sequence of standard deviations for R, G, B channels - respectively. - """ - - def __init__(self, dataset_name='imagenet'): - super(DataNormTranspose, self).__init__() - # Computed from random subset of ImageNet training images - if dataset_name == 'imagenet': - self.mean = Tensor(np.array([0.485 * 255, 0.456 * 255, 0.406 * 255]).reshape((1, 1, 1, 3)), mstype.float32) - self.std = Tensor(np.array([0.229 * 255, 0.224 * 255, 0.225 * 255]).reshape((1, 1, 1, 3)), mstype.float32) - else: - self.mean = Tensor(np.array([0.4914, 0.4822, 0.4465]).reshape((1, 1, 1, 3)), mstype.float32) - self.std = Tensor(np.array([0.2023, 0.1994, 0.2010]).reshape((1, 1, 1, 3)), mstype.float32) - - def construct(self, x): - x = (x - self.mean) / self.std - x = ops.transpose(x, (0, 3, 1, 2)) - return x - -class AlexNet(nn.Cell): - """ - Alexnet - """ - - def __init__(self, num_classes=10, channel=3, phase='train', include_top=True, dataset_name='imagenet'): - super(AlexNet, self).__init__() - self.data_trans = DataNormTranspose(dataset_name=dataset_name) - self.conv1 = conv(channel, 64, 11, stride=4, pad_mode="same", has_bias=True) - self.conv2 = conv(64, 128, 5, pad_mode="same", has_bias=True) - self.conv3 = conv(128, 192, 3, pad_mode="same", has_bias=True) - self.conv4 = conv(192, 256, 3, pad_mode="same", has_bias=True) - self.conv5 = conv(256, 256, 3, pad_mode="same", has_bias=True) - self.relu = nn.ReLU() - nn.BatchNorm2d - self.max_pool2d = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode='valid') - self.include_top = include_top - if self.include_top: - dropout_ratio = 0.65 - if phase == 'test': - dropout_ratio = 1.0 - self.flatten = nn.Flatten() - self.fc1 = fc_with_initialize(6 * 6 * 256, 4096) - self.fc2 = fc_with_initialize(4096, 4096) - self.fc3 = fc_with_initialize(4096, num_classes) - self.dropout = nn.Dropout(p=1 - dropout_ratio) - @jit - def construct(self, x): - """define network""" - x = self.data_trans(x) - x = self.conv1(x) - x = self.relu(x) - x = self.max_pool2d(x) - x = self.conv2(x) - x = self.relu(x) - x = self.max_pool2d(x) - x = self.conv3(x) - x = self.relu(x) - x = self.conv4(x) - x = self.relu(x) - x = self.conv5(x) - x = self.relu(x) - x = self.max_pool2d(x) - if not self.include_top: - return x - x = self.flatten(x) - x = self.fc1(x) - x = self.relu(x) - x = self.dropout(x) - x = self.fc2(x) - x = self.relu(x) - x = self.dropout(x) - x = self.fc3(x) - x = ops.celu(x, 2.0) - return x - -if __name__ == "__main__": - json_config = { - "task": "statistics", - "dump_path": "/absolute_path", - "rank": [], - "step": [], - "level": "L1" - } - - common_config = CommonConfig(json_config) - task_config = BaseConfig(json_config) - mock_parse_json_config = MagicMock() - mock_parse_json_config.return_value = [common_config, task_config] - debugger = PrecisionDebugger() - ms.set_context(mode=ms.PYNATIVE_MODE) - net = AlexNet() - debugger.start() - ops.relu(ms.Tensor(np.random.random([1, 227, 227, 3]).astype(np.float32))) - grad_net = ms.grad(net, None, net.trainable_params()) - output = grad_net(ms.Tensor(np.random.random([1, 227, 227, 3]).astype(np.float32))) - debugger.stop() - expected_file_count = 5 - dir_path = "/absolute_path/step0/rank/dump_tensor_data/" - actual_file_count = len(os.listdir(dir_path)) - assert actual_file_count == expected_file_count - - -class SimpleNet(nn.Cell): - def __init__(self): - super(SimpleNet, self).__init__() - self.relu = ops.relu +from msprobe.mindspore.dump.jit_dump import JitDump, dump_jit - def construct(self, x): - return self.relu(x) class TestJitDump(unittest.TestCase): - @patch('msprobe.mindspore.dump.hook_cell.api_registry.api_register.api_set_ori_func') - @patch('msprobe.mindspore.dump.hook_cell.api_registry.api_register.api_set_hook_func') - @patch.object(JitDump, 'need_dump', return_value=True) - def test_jitdump_forward(self, mock_need_dump, mock_set_ori_func, mock_set_hook_func): - # Set up configurations - json_config = { - "task": "statistics", - "dump_path": "/absolute_path", - "rank": [], - "step": [], - "level": "L1" - } - - common_config = CommonConfig(json_config) - task_config = BaseConfig(json_config) - debugger = MagicMock() - - # Setup MindSpore context and JitDump class - ms.set_context(mode=ms.PYNATIVE_MODE) - - def identity_fn(x): - return x - - jit_dump_instance = JitDump(fn=identity_fn, ms_create_time=0) - - # Set collector and config for JitDump - JitDump.set_config(common_config) - JitDump.set_data_collector(MagicMock()) - - net = SimpleNet() - input_tensor = Tensor(np.random.random([1, 227, 227, 3]).astype(np.float32)) - output_tensor = net(input_tensor) - - # Call the JitDump instance and validate expectations - jit_dump_instance(input_tensor) - - # Assertions to ensure required methods are called - self.assertTrue(mock_set_ori_func.called, "api_set_ori_func should be called during forward pass.") - @patch('os.getpid', return_value=12345) def test_dump_jit(self, mock_getpid): in_feat = Tensor(np.array([1, 2, 3]), mstype.float32) @@ -204,7 +49,3 @@ class TestJitDump(unittest.TestCase): expected_file_count = 5 actual_file_count = len(os.listdir(dir_path)) self.assertEqual(actual_file_count, expected_file_count) - - -if __name__ == "__main__": - unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_ms_debugger_config.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_ms_debugger_config.py index f4f18914e45f146513d1d0f132b1fa61e97c13c4..033b0c1ea5769c3f1f8e19dd8b45c48918e15814 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_ms_debugger_config.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_ms_debugger_config.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,13 +12,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import unittest from unittest.mock import patch from msprobe.core.common.const import Const -from msprobe.mindspore.common.const import FreeBenchmarkConst from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.common.const import FreeBenchmarkConst from msprobe.mindspore.debugger.debugger_config import DebuggerConfig @@ -55,3 +54,12 @@ class TestDebuggerConfig(unittest.TestCase): self.assertEqual(str(context.exception), "pert_mode must be improve_precision or empty when handler_type is fix, " f"but got {FreeBenchmarkConst.ADD_NOISE}.") + + task_config.handler_type = FreeBenchmarkConst.FIX + task_config.pert_mode = FreeBenchmarkConst.DEFAULT_PERT_TYPE + task_config.fuzz_stage = Const.BACKWARD + with self.assertRaises(Exception) as context: + DebuggerConfig(common_config, task_config) + self.assertEqual(str(context.exception), + "handler_type must be check or empty when fuzz_stage is backward, " + f"but got {task_config.handler_type}.") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_ms_precision_debugger.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_ms_precision_debugger.py index 59f9a09815bca541c664fe3d0dc28615b71778e7..066ff537ce6fba12f712ae3d4681115499be35a6 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_ms_precision_debugger.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/debugger/test_ms_precision_debugger.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,16 +12,18 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import unittest from unittest.mock import patch, MagicMock from msprobe.core.common_config import CommonConfig, BaseConfig from msprobe.core.common.const import Const, MsgConst +from msprobe.mindspore.cell_processor import CellProcessor +from msprobe.mindspore.common.const import Const as MsConst from msprobe.mindspore.debugger.debugger_config import DebuggerConfig from msprobe.mindspore.debugger.precision_debugger import PrecisionDebugger +from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell from msprobe.mindspore.runtime import Runtime -from msprobe.mindspore.common.const import Const as MsConst class TestPrecisionDebugger(unittest.TestCase): @@ -30,6 +31,7 @@ class TestPrecisionDebugger(unittest.TestCase): @patch("msprobe.mindspore.debugger.debugger_config.create_directory") def test_start(self, _): PrecisionDebugger._instance = None + class Handler: called = False @@ -41,7 +43,8 @@ class TestPrecisionDebugger(unittest.TestCase): "dump_path": "/absolute_path", "rank": [], "step": [], - "level": "L1" + "level": "L1", + "async_dump": False } common_config = CommonConfig(json_config) @@ -52,7 +55,8 @@ class TestPrecisionDebugger(unittest.TestCase): mock_parse_json_config = MagicMock() with patch("msprobe.mindspore.debugger.precision_debugger.parse_json_config", new=mock_parse_json_config), \ patch.object(PrecisionDebugger, "_get_execution_mode", new=mock_get_mode), \ - patch("msprobe.mindspore.debugger.precision_debugger.TaskHandlerFactory.create", return_value=handler): + patch("msprobe.mindspore.debugger.precision_debugger.TaskHandlerFactory.create", return_value=handler), \ + patch("msprobe.mindspore.debugger.precision_debugger.set_register_backward_hook_functions"): mock_get_mode.return_value = MsConst.GRAPH_GE_MODE mock_parse_json_config.return_value = [common_config, task_config] debugger = PrecisionDebugger() @@ -64,7 +68,8 @@ class TestPrecisionDebugger(unittest.TestCase): self.assertTrue(Handler.called) mock_get_mode.return_value = MsConst.PYNATIVE_MODE - with patch("msprobe.mindspore.debugger.precision_debugger.Service") as mock_Service: + with patch("msprobe.mindspore.debugger.precision_debugger.Service") as mock_Service, \ + patch("msprobe.mindspore.debugger.precision_debugger.set_register_backward_hook_functions"): debugger = PrecisionDebugger() debugger.start() service = mock_Service.return_value @@ -78,7 +83,8 @@ class TestPrecisionDebugger(unittest.TestCase): with patch("msprobe.mindspore.debugger.precision_debugger.parse_json_config", new=mock_parse_json_config), \ patch.object(PrecisionDebugger, "_get_execution_mode", new=mock_get_mode), \ - patch("msprobe.mindspore.debugger.precision_debugger.TaskHandlerFactory.create", return_value=handler): + patch("msprobe.mindspore.debugger.precision_debugger.TaskHandlerFactory.create", return_value=handler), \ + patch("msprobe.mindspore.debugger.precision_debugger.set_register_backward_hook_functions"): common_config.task = Const.FREE_BENCHMARK mock_get_mode.return_value = MsConst.PYNATIVE_MODE mock_parse_json_config.return_value = [common_config, task_config] @@ -106,3 +112,58 @@ class TestPrecisionDebugger(unittest.TestCase): Runtime.step_count = 0 PrecisionDebugger.step() self.assertEqual(Runtime.step_count, 1) + Runtime.step_count = 0 + + HOOKCell.cell_count["api"] = 1 + PrecisionDebugger.step() + self.assertEqual(HOOKCell.cell_count["api"], 0) + + with patch.object(CellProcessor, "reset_cell_stats") as mock_reset_cell: + PrecisionDebugger.step() + mock_reset_cell.assert_called_once() + + def test_forward_backward_dump_end(self): + with patch("msprobe.mindspore.debugger.precision_debugger.set_register_backward_hook_functions"): + debugger = PrecisionDebugger() + debugger.task = "statistics" + debugger.service = MagicMock() + debugger.forward_backward_dump_end() + debugger.service.stop.assert_called_once() + + def test_is_graph_dump_level_not_kernel(self): + config = MagicMock() + config.level = "NOT_KERNEL" + config.list = ["some_value"] + result = PrecisionDebugger._is_graph_dump(config) + self.assertFalse(result) + + def test_is_graph_dump_empty_list(self): + config = MagicMock() + config.level = MsConst.KERNEL + config.list = [] + result = PrecisionDebugger._is_graph_dump(config) + self.assertTrue(result) + + def test_is_graph_dump_multiple_items_in_list(self): + config = MagicMock() + config.level = MsConst.KERNEL + config.list = ["item1", "item2"] + result = PrecisionDebugger._is_graph_dump(config) + self.assertTrue(result) + + def test_is_graph_dump_single_item_with_slash_or_dash(self): + config = MagicMock() + config.level = MsConst.KERNEL + config.list = ["item/with/slash"] + result = PrecisionDebugger._is_graph_dump(config) + self.assertTrue(result) + config.list = ["item-with-dash"] + result = PrecisionDebugger._is_graph_dump(config) + self.assertTrue(result) + + def test_is_graph_dump_single_item_without_dash_or_slash(self): + config = MagicMock() + config.level = MsConst.KERNEL + config.list = ["Functional.relu.1.forward"] + result = PrecisionDebugger._is_graph_dump(config) + self.assertFalse(result) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/dump/test_ms_kernel_config.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/dump/test_ms_kernel_config.py new file mode 100644 index 0000000000000000000000000000000000000000..54c59b6409cb546384dcb50f47c7c27975fa1cb7 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/dump/test_ms_kernel_config.py @@ -0,0 +1,53 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest +from unittest.mock import patch + +from msprobe.mindspore.dump.kernel_dump.kernel_config import create_kernel_config_json + + +class TestPtKernelConfig(unittest.TestCase): + @patch("msprobe.mindspore.dump.kernel_dump.kernel_config.save_json") + def test_create_kernel_config_json_with_rank(self, mock_save_json): + dump_path = "./step0" + cur_rank = 0 + kernel_config_path = create_kernel_config_json(dump_path, cur_rank) + self.assertEqual(kernel_config_path, "./step0/kernel_config_0.json") + config_info = { + "dump": { + "dump_list": [], + "dump_path": dump_path, + "dump_mode": "all", + "dump_op_switch": "on" + } + } + mock_save_json.assert_called_once_with(kernel_config_path, config_info, indent=4) + + @patch("msprobe.mindspore.dump.kernel_dump.kernel_config.save_json") + def test_create_kernel_config_json_without_rank(self, mock_save_json): + dump_path = "./step0" + cur_rank = '' + kernel_config_path = create_kernel_config_json(dump_path, cur_rank) + self.assertEqual(kernel_config_path, "./step0/kernel_config.json") + config_info = { + "dump": { + "dump_list": [], + "dump_path": dump_path, + "dump_mode": "all", + "dump_op_switch": "on" + } + } + mock_save_json.assert_called_once_with(kernel_config_path, config_info, indent=4) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/common/test_ms_free_benchmark_utils.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/common/test_ms_free_benchmark_utils.py index 3521fb35c5c31bd10fbc8a719cb24b42b12220b5..1f37e18c6ef8d3facb526f3c54169a17f4616189 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/common/test_ms_free_benchmark_utils.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/common/test_ms_free_benchmark_utils.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,17 +12,16 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest import mindspore as ms from mindspore import Tensor -from msprobe.mindspore.free_benchmark.common.utils import Tools, UnequalRow, make_unequal_row -from msprobe.mindspore.free_benchmark.common.config import Config from msprobe.mindspore.common.const import FreeBenchmarkConst +from msprobe.mindspore.free_benchmark.common.config import Config from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.common.utils import Tools, UnequalRow, make_unequal_row from msprobe.mindspore.runtime import Runtime @@ -46,6 +44,18 @@ class TestUtils(unittest.TestCase): ret = Tools.get_default_error_threshold(ms.float16) self.assertEqual(ret, FreeBenchmarkConst.ERROR_THRESHOLD.get(ms.float16)) + def test_get_grad_out(self): + tensor = Tensor([1.0, 5.0], dtype=ms.float32) + target_grad_out = Tensor([1.0, 1.0], dtype=ms.float32) + ret = Tools.get_grad_out(tensor) + self.assertTrue((ret == target_grad_out).all()) + + tensors = (Tensor([1.0, 5.0], dtype=ms.float16), Tensor([1.0, 5.0], dtype=ms.float16)) + target_grad_out = (Tensor([1.0, 1.0], dtype=ms.float16), Tensor([1.0, 1.0], dtype=ms.float16)) + ret = Tools.get_grad_out(tensors) + self.assertTrue((ret[0] == target_grad_out[0]).all()) + self.assertTrue((ret[1] == target_grad_out[1]).all()) + def test_unequal_row(self): self.assertIsNone(UnequalRow.rank) self.assertIsNone(UnequalRow.pert_type) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/common/test_ms_handler_params.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/common/test_ms_handler_params.py index 06c475b5e45fea0cbff8e9494b383848187bf841..78d8f405cfe530ac8505b5d7db602a919dcff4e2 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/common/test_ms_handler_params.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/common/test_ms_handler_params.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/data/test_ms_free_benchmark_api.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/data/test_ms_free_benchmark_api.py index 5d2fd25084e810befa30944660187f0bc8b8f2be..261260a6e347b0b0d22ca8e8fcfb945e60b5511b 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/data/test_ms_free_benchmark_api.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/data/test_ms_free_benchmark_api.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,13 +12,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import os import unittest -from msprobe.mindspore.common.const import FreeBenchmarkConst from msprobe.core.common.file_utils import load_yaml +from msprobe.mindspore.common.const import FreeBenchmarkConst class TestSupportWrapOps(unittest.TestCase): diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/decorator/test_ms_dec_forward.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/decorator/test_ms_dec_forward.py deleted file mode 100644 index cd29b865841775c9b9e216a87fe3203f6fd503bc..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/decorator/test_ms_dec_forward.py +++ /dev/null @@ -1,118 +0,0 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from unittest import TestCase -from unittest.mock import patch - -import mindspore as ms -from mindspore import Tensor, ops - -from msprobe.mindspore.common.const import Const, FreeBenchmarkConst -from msprobe.mindspore.free_benchmark.common.config import Config -from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams -from msprobe.mindspore.free_benchmark.decorator.dec_forward import ForwardSelfChecker -from msprobe.mindspore.free_benchmark.handler.check_handler import CheckHandler - - -class TestForwardSelfChecker(TestCase): - checker = None - - @classmethod - def setUpClass(cls): - cls.checker = ForwardSelfChecker("api_name") - - def test__init__(self): - self_checker = ForwardSelfChecker("api_name") - self.assertEqual(self_checker.api_name, "api_name") - - def test_get_compare_data(self): - params = HandlerParams() - params.args = (Tensor([1.0], dtype=ms.float32), Tensor([5.0], dtype=ms.float32)) - params.index = 0 - params.fuzzed_value = Tensor([1.0001], dtype=ms.float32) - params.original_result = (Tensor([2.0], dtype=ms.float32), Tensor([6.0], dtype=ms.float32)) - params.fuzzed_result = (Tensor([2.0001], dtype=ms.float32), Tensor([6.0001], dtype=ms.float32)) - - TestForwardSelfChecker.checker.api_name = "api_name" - self.checker.get_compare_data(params) - target = (Tensor([2.0001], dtype=ms.float32), Tensor([6.0001], dtype=ms.float32)) - self.assertTrue((params.fuzzed_result[0] == target[0]).all()) - self.assertTrue((params.fuzzed_result[1] == target[1]).all()) - - TestForwardSelfChecker.checker.api_name = Const.COMMUNICATION_API_LIST[0] - Config.pert_type = FreeBenchmarkConst.IMPROVE_PRECISION - self.checker.get_compare_data(params) - target = Tensor([1.0001], dtype=ms.float32) - self.assertTrue((params.fuzzed_result == target).all()) - target = (Tensor([1.0], dtype=ms.float32), Tensor([5.0], dtype=ms.float32)) - self.assertTrue((params.original_result[0] == target[0]).all()) - self.assertTrue((params.original_result[1] == target[1]).all()) - - params.fuzzed_value = Tensor([1.0001], dtype=ms.float32) - params.original_result = (Tensor([2.0], dtype=ms.float32), Tensor([6.0], dtype=ms.float32)) - Config.pert_type = FreeBenchmarkConst.ADD_NOISE - self.checker.get_compare_data(params) - target = Tensor([1.0001], dtype=ms.float32) - self.assertTrue((params.fuzzed_result == target).all()) - target = Tensor([1.0], dtype=ms.float32) - self.assertTrue((params.original_result == target).all()) - - def test_deal_fuzzed_and_original_result(self): - params = HandlerParams() - params.fuzzed_value = [Tensor([1.0001], dtype=ms.float32)] - params.args = [Tensor([1.0], dtype=ms.float32)] - params.original_result = Tensor([1.0], dtype=ms.float32) - Config.handler_type = FreeBenchmarkConst.CHECK - handler_return = Tensor([2.0], dtype=ms.float32) - - with patch.object(CheckHandler, "handle", return_value=handler_return) as mock_handle: - TestForwardSelfChecker.checker.api_name = "api_name" - ret = self.checker.deal_fuzzed_and_original_result(params) - mock_handle.assert_called_with(params) - self.assertTrue((ret == handler_return).all()) - - TestForwardSelfChecker.checker.api_name = Const.COMMUNICATION_API_LIST[0] - Config.pert_type = FreeBenchmarkConst.IMPROVE_PRECISION - target = Tensor([1.0], dtype=ms.float32) - ret = self.checker.deal_fuzzed_and_original_result(params) - self.assertTrue((ret == target).all()) - - def test_handle(self): - params = HandlerParams() - params.args = [Tensor([1.0], dtype=ms.float32), Tensor([5.0], dtype=ms.float32)] - params.kwargs = {} - params.index = 0 - params.original_func = ops.add - original_result = ops.add(params.args[0], params.args[1]) - fuzzed_result = ops.add(original_result, 1e-8) - deal_result = Tensor([2.0], dtype=ms.float32) - - with patch.object(ForwardSelfChecker, - "deal_fuzzed_and_original_result", return_value=deal_result) as mock_deal: - Config.pert_type = FreeBenchmarkConst.ADD_NOISE - ret = self.checker.handle(params) - self.assertTrue((params.fuzzed_result == fuzzed_result).all()) - self.assertTrue((params.original_result == original_result).all()) - self.assertTrue((ret == deal_result).all()) - mock_deal.assert_called_with(params) - - params.args = [Tensor([0.0], dtype=ms.float32), Tensor([5.0], dtype=ms.float32)] - ret = self.checker.handle(params) - self.assertTrue((ret == params.original_result).all()) - mock_deal.assert_called_once() - - @classmethod - def tearDownClass(cls): - cls.checker = None diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/decorator/test_ms_decorator_factory.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/decorator/test_ms_decorator_factory.py deleted file mode 100644 index bd3b0e1cf77a5ea10900f3716c5d20b1efca1c87..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/decorator/test_ms_decorator_factory.py +++ /dev/null @@ -1,177 +0,0 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -from unittest import TestCase -from unittest.mock import patch - -import mindspore as ms -from mindspore import Tensor, ops - -from msprobe.mindspore.common.log import logger -from msprobe.mindspore.free_benchmark.common.config import Config -from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams -from msprobe.mindspore.free_benchmark.decorator.decorator_factory import (data_pre_deal, decorate, - decorate_forward_function, - get_target_arg_index, - need_wrapper_func, stack_depth_check) -from msprobe.mindspore.runtime import Runtime - - -class TestDecoratorFactory(TestCase): - def test_need_wrapper_func(self): - Runtime.is_running = True - Config.is_enable = False - self.assertFalse(need_wrapper_func()) - - Runtime.is_running = False - Config.is_enable = True - self.assertFalse(need_wrapper_func()) - - Runtime.is_running = True - Config.is_enable = True - with patch("msprobe.mindspore.free_benchmark.decorator.decorator_factory.stack_depth_check", - return_value=False): - self.assertFalse(need_wrapper_func()) - - with patch("msprobe.mindspore.free_benchmark.decorator.decorator_factory.stack_depth_check", - return_value=True): - Config.steps = [0] - Runtime.step_count = 1 - self.assertFalse(need_wrapper_func()) - - Config.steps = [] - Config.ranks = [0] - Runtime.rank_id = 1 - self.assertFalse(need_wrapper_func()) - - Config.ranks = [] - self.assertTrue(need_wrapper_func()) - - Config.ranks = [1] - self.assertTrue(need_wrapper_func()) - - def test_data_pre_deal(self): - params = HandlerParams() - params.args = (Tensor([1.0, 1.0], dtype=ms.float32), 1) - params.kwargs = {"axis": 0} - params.original_func = ops.split - params.index = 0 - - ret = data_pre_deal("ops.split", params.original_func, *params.args, **params.kwargs) - self.assertTrue((ret.args[0] == params.args[0]).all()) - self.assertEqual(ret.args[1], params.args[1]) - self.assertEqual(ret.kwargs, params.kwargs) - self.assertEqual(ret.original_func, params.original_func) - self.assertEqual(ret.index, params.index) - - params.args = (Tensor([1, 1], dtype=ms.int32), 1) - with self.assertRaises(Exception) as context: - ret = data_pre_deal("ops.split", params.original_func, *params.args, **params.kwargs) - self.assertEqual(str(context.exception), "ops.split has no supported input type") - self.assertEqual(ret.index, -1) - - def test_get_target_arg_index(self): - args = (1, Tensor([1.0, 1.0], dtype=ms.float32)) - target = 1 - ret = get_target_arg_index(args) - self.assertEqual(ret, target) - - args = (1, (1.0, 1.0)) - target = 1 - ret = get_target_arg_index(args) - self.assertEqual(ret, target) - - args = (1, Tensor([1.0, 1.0], dtype=ms.int32)) - target = -1 - ret = get_target_arg_index(args) - self.assertEqual(ret, target) - - def test_stack_depth_check(self): - def fuzz_wrapper(call_times): - call_times += 1 - if stack_depth_check(): - call_times = fuzz_wrapper(call_times) - return call_times - - ret = fuzz_wrapper(0) - self.assertEqual(ret, 2) - - def test_decorate_forward_function(self): - def func(): - pass - - with patch("msprobe.mindspore.free_benchmark.decorator.decorator_factory.decorate", - return_value=0) as mock_decorate: - decorate_forward_function(func) - ret = decorate_forward_function(func, api_name="api_name") - self.assertEqual(mock_decorate.call_args_list[0][0][0], func) - self.assertEqual(mock_decorate.call_args_list[0][0][1].__name__, "forward_func") - self.assertEqual(mock_decorate.call_args_list[0][0][2], "func") - self.assertEqual(mock_decorate.call_args_list[1][0][2], "api_name") - self.assertEqual(ret, 0) - - def test_decorate(self): - def decorate_func(input): - if isinstance(input, int): - return input + 1 - else: - raise - - original_func = ops.add - api_name = "ops.add" - fuzz_wrapper = decorate(original_func, decorate_func, api_name) - - with patch("msprobe.mindspore.free_benchmark.decorator.decorator_factory.data_pre_deal", - return_value=0) as mock_pre_deal: - args = (Tensor([1.0], dtype=ms.float32), Tensor([5.0], dtype=ms.float32)) - kwargs = {} - os.environ["RANK_ID"] = "1" - - Runtime.rank_id = 0 - with patch("msprobe.mindspore.free_benchmark.decorator.decorator_factory.need_wrapper_func", - return_value=True), \ - patch.object(logger, "info") as mock_info: - ret = fuzz_wrapper(*args, **kwargs) - mock_info.assert_called_with(f"[{api_name}] is checking.") - mock_pre_deal.assert_called_with(api_name, original_func, *args, **kwargs) - self.assertEqual(ret, 1) - self.assertEqual(Runtime.rank_id, 0) - - Runtime.rank_id = -1 - with patch("msprobe.mindspore.free_benchmark.decorator.decorator_factory.need_wrapper_func", - return_value=False), \ - patch.object(logger, "info") as mock_info: - target = Tensor([6.0], dtype=ms.float32) - ret = fuzz_wrapper(*args, **kwargs) - mock_pre_deal.assert_called_once() - self.assertEqual(ret, target) - self.assertEqual(Runtime.rank_id, "1") - - del os.environ["RANK_ID"] - - with patch("msprobe.mindspore.free_benchmark.decorator.decorator_factory.data_pre_deal", - return_value="0") as mock_pre_deal: - args = (Tensor([1.0], dtype=ms.float32), Tensor([5.0], dtype=ms.float32)) - kwargs = {} - Runtime.rank_id = 0 - with patch("msprobe.mindspore.free_benchmark.decorator.decorator_factory.need_wrapper_func", - return_value=True), \ - patch.object(logger, "info"), \ - patch.object(logger, "error") as mock_error: - target = Tensor([6.0], dtype=ms.float32) - ret = fuzz_wrapper(*args, **kwargs) - self.assertEqual(mock_error.call_count, 2) - self.assertEqual(ret, target) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_base_handler.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_base_handler.py index 429c54d99dbc077ba3121ccd01e5ab8efb3d900e..d7f5b0745cff481d0bc2e5771df36beb492d4015 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_base_handler.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_base_handler.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest from unittest.mock import patch @@ -21,11 +19,11 @@ from unittest.mock import patch import mindspore as ms from mindspore import Tensor, ops -from msprobe.mindspore.free_benchmark.handler.base_handler import BaseHandler from msprobe.mindspore.common.const import FreeBenchmarkConst from msprobe.mindspore.common.log import logger from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams from msprobe.mindspore.free_benchmark.common.utils import Tools +from msprobe.mindspore.free_benchmark.handler.base_handler import BaseHandler class Handler(BaseHandler): @@ -46,11 +44,11 @@ class TestBaseHandler(unittest.TestCase): @classmethod def setUpClass(cls): - cls.base_handler = Handler("api_name") + cls.base_handler = Handler("api_name_with_id") def test___init__(self): - base_handler = Handler("api_name") - self.assertEqual(base_handler.api_name, "api_name") + base_handler = Handler("api_name_with_id") + self.assertEqual(base_handler.api_name_with_id, "api_name_with_id") def test_pre_calculate(self): fuzzed_output = Tensor([1.0], dtype=ms.float32) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_check_handler.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_check_handler.py index 9ff7593ed549b279986425ce99059f4cbd417c02..58c0a7b46ad7ca5c05a157733d57dd8828ced24d 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_check_handler.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_check_handler.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest from unittest.mock import patch @@ -21,11 +19,11 @@ from unittest.mock import patch import mindspore as ms from mindspore import Tensor, ops -from msprobe.mindspore.free_benchmark.handler.check_handler import CheckHandler -from msprobe.mindspore.common.log import logger -from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams from msprobe.core.data_dump.json_writer import DataWriter +from msprobe.mindspore.common.log import logger from msprobe.mindspore.free_benchmark.common.config import Config +from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.handler.check_handler import CheckHandler from msprobe.mindspore.runtime import Runtime @@ -42,7 +40,39 @@ class TestCheckHandler(unittest.TestCase): @classmethod def setUpClass(cls): - cls.check_handler = CheckHandler("api_name") + cls.check_handler = CheckHandler("api_name_with_id") + + @classmethod + def tearDownClass(cls): + cls.check_handler = None + + @patch.object(CheckHandler, "npu_compare_and_save") + @patch.object(logger, "error") + def test_handle(self, mock_error, mock_compare): + params = HandlerParams() + params.original_result = Tensor([1.0], dtype=ms.float32) + params.fuzzed_result = Tensor([1], dtype=ms.int32) + self.check_handler.handle(params) + mock_compare.assert_not_called() + + params.fuzzed_result = Tensor([1.0001], dtype=ms.float32) + with patch.object(CheckHandler, "npu_compare_and_save") as mock_compare: + self.check_handler.handle(params) + mock_compare.assert_called_with(params.original_result, params.fuzzed_result, params) + + params.original_result = (Tensor([1.0], dtype=ms.float32), Tensor([2.0], dtype=ms.float32)) + params.fuzzed_result = (Tensor([1.0001], dtype=ms.float32), Tensor([2.0001], dtype=ms.float32)) + with patch.object(CheckHandler, "npu_compare_and_save") as mock_compare: + self.check_handler.handle(params) + self.assertEqual(mock_compare.call_count, 2) + self.assertEqual(mock_compare.call_args_list[0][0], + (params.original_result[0], params.fuzzed_result[0], params)) + self.assertEqual(mock_compare.call_args_list[0][1], {"output_index": 0}) + self.assertEqual(mock_compare.call_args_list[1][0], + (params.original_result[1], params.fuzzed_result[1], params)) + self.assertEqual(mock_compare.call_args_list[1][1], {"output_index": 1}) + + mock_error.assert_not_called() @patch.object(logger, "error") def test_npu_compare_and_save(self, mock_error): @@ -56,7 +86,7 @@ class TestCheckHandler(unittest.TestCase): "pert_type": Config.pert_type, "stage": Config.stage, "step": Runtime.step_count, - "api_name": "api_name", + "api_name": "api_name_with_id", "max_rel": ops.max(ops.div(fuzzed_output, original_output))[0].item() - 1, "dtype": ms.float32, "shape": original_output.shape, @@ -69,34 +99,4 @@ class TestCheckHandler(unittest.TestCase): self.assertEqual(list(mock_write.call_args[0][0]), list(data_dict.values())) self.assertEqual(mock_write.call_args[0][1], data_dict.keys()) self.assertEqual(mock_write.call_args[0][2], Config.dump_path) - mock_error.assert_called_with("api_name is not consistent") - - def test_handle(self): - params = HandlerParams() - params.original_result = Tensor([1.0], dtype=ms.float32) - params.fuzzed_result = Tensor([1], dtype=ms.int32) - ret = self.check_handler.handle(params) - self.assertTrue((ret == params.original_result).all()) - - params.fuzzed_result = Tensor([1.0001], dtype=ms.float32) - with patch.object(CheckHandler, "npu_compare_and_save") as mock_compare: - ret = self.check_handler.handle(params) - mock_compare.assert_called_with(params.original_result, params.fuzzed_result, params) - self.assertTrue((ret == params.original_result).all()) - - params.original_result = (Tensor([1.0], dtype=ms.float32), Tensor([2.0], dtype=ms.float32)) - params.fuzzed_result = (Tensor([1.0001], dtype=ms.float32), Tensor([2.0001], dtype=ms.float32)) - with patch.object(CheckHandler, "npu_compare_and_save") as mock_compare: - ret = self.check_handler.handle(params) - self.assertEqual(mock_compare.call_count, 2) - self.assertEqual(mock_compare.call_args_list[0][0], - (params.original_result[0], params.fuzzed_result[0], params)) - self.assertEqual(mock_compare.call_args_list[0][1], {"output_index": 0}) - self.assertEqual(mock_compare.call_args_list[1][0], - (params.original_result[1], params.fuzzed_result[1], params)) - self.assertEqual(mock_compare.call_args_list[1][1], {"output_index": 1}) - self.assertTrue(ret == params.original_result) - - @classmethod - def tearDownClass(cls): - cls.check_handler = None + mock_error.assert_called_with("api_name_with_id is not consistent") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_fix_handler.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_fix_handler.py index c48c4b52deae69022078b3a13a878a0f23525da8..8f3892fb1bf7aec3bb852ebe471f697ecb393eda 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_fix_handler.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_fix_handler.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest from unittest.mock import patch @@ -21,9 +19,9 @@ from unittest.mock import patch import mindspore as ms from mindspore import Tensor -from msprobe.mindspore.free_benchmark.handler.fix_handler import FixHandler from msprobe.mindspore.common.log import logger from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.handler.fix_handler import FixHandler class TestFixHandler(unittest.TestCase): @@ -31,11 +29,11 @@ class TestFixHandler(unittest.TestCase): @classmethod def setUpClass(cls): - cls.fix_handler = FixHandler("api_name") + cls.fix_handler = FixHandler("api_name_with_id") def test__init__(self): - fix_handler = FixHandler("api_name") - self.assertEqual(fix_handler.api_name, "api_name") + fix_handler = FixHandler("api_name_with_id") + self.assertEqual(fix_handler.api_name_with_id, "api_name_with_id") def test_use_fuzzed_result(self): original_result = 1.0 @@ -81,7 +79,7 @@ class TestFixHandler(unittest.TestCase): patch.object(logger, "error") as mock_error: ret = self.fix_handler.handle(params) self.assertEqual(mock_error.call_count, 2) - self.assertEqual(mock_error.call_args_list[0][0][0], "api_name failed to fix.") + self.assertEqual(mock_error.call_args_list[0][0][0], "api_name_with_id failed to fix.") self.assertEqual(mock_error.call_args_list[1][0][0], "raise Exception") self.assertTrue((ret == params.original_result).all()) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_handler_factory.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_handler_factory.py index a4d168f9fb60c817774adf9eecbd0399fb884a3f..82627094a02e058281528ae4e4a6902e9bfbc292 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_handler_factory.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/handler/test_ms_handler_factory.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,35 +12,34 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest from unittest.mock import patch -from msprobe.mindspore.free_benchmark.handler.handler_factory import HandlerFactory -from msprobe.mindspore.free_benchmark.common.config import Config from msprobe.mindspore.common.const import FreeBenchmarkConst from msprobe.mindspore.common.log import logger +from msprobe.mindspore.free_benchmark.common.config import Config from msprobe.mindspore.free_benchmark.handler.check_handler import CheckHandler from msprobe.mindspore.free_benchmark.handler.fix_handler import FixHandler +from msprobe.mindspore.free_benchmark.handler.handler_factory import HandlerFactory class TestHandlerFactory(unittest.TestCase): @patch.object(logger, "error") def test_create(self, mock_error): - api_name = "mindspore.ops.add" + api_name_with_id = "Mint.add.0" Config.handler_type = "UNKNOWN" with self.assertRaises(Exception): - HandlerFactory.create(api_name) + HandlerFactory.create(api_name_with_id) mock_error.assert_called_with("UNKNOWN is not supported.") Config.handler_type = FreeBenchmarkConst.CHECK - handler = HandlerFactory.create(api_name) + handler = HandlerFactory.create(api_name_with_id) self.assertTrue(isinstance(handler, CheckHandler)) - self.assertEqual(handler.api_name, api_name) + self.assertEqual(handler.api_name_with_id, api_name_with_id) Config.handler_type = FreeBenchmarkConst.FIX - handler = HandlerFactory.create(api_name) + handler = HandlerFactory.create(api_name_with_id) self.assertTrue(isinstance(handler, FixHandler)) - self.assertEqual(handler.api_name, api_name) + self.assertEqual(handler.api_name_with_id, api_name_with_id) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_add_noise.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_add_noise.py index 44a2f0c9cf66987d00650642a173e344fce89236..7715eac3fa9d912646e43159366c044823a592d8 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_add_noise.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_add_noise.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest from unittest.mock import patch @@ -32,7 +30,7 @@ class TestAddNoisePerturbation(unittest.TestCase): @classmethod def setUpClass(cls): - cls.add_noise_pert = AddNoisePerturbation("mindspore.ops.add") + cls.add_noise_pert = AddNoisePerturbation("Mint.add.0") def test__get_noise(self): input = Tensor([1.0], dtype=ms.float32) @@ -88,7 +86,7 @@ class TestAddNoisePerturbation(unittest.TestCase): params.args = input params.index = 0 ret = self.add_noise_pert.handle(params) - mock_warning.assert_called_with("mindspore.ops.add can not add noise.") + mock_warning.assert_called_with("Mint.add.0 can not add noise.") self.assertFalse(self.add_noise_pert.is_fuzzed) self.assertFalse(ret) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_base_perturbation.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_base_perturbation.py index 6c8e89668238b7555a77492b4558b6e24bbaf91a..3469e809d3fb27f9e366d128cfa10d68c776e391 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_base_perturbation.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_base_perturbation.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,29 +12,29 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest +from unittest.mock import patch import mindspore as ms from mindspore import Tensor - -from msprobe.mindspore.free_benchmark.perturbation.base_perturbation import BasePerturbation -from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams -from msprobe.mindspore.free_benchmark.common.config import Config from msprobe.core.common.const import Const +from msprobe.mindspore.free_benchmark.common.config import Config +from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.perturbation.base_perturbation import BasePerturbation class TestBasePerturbation(unittest.TestCase): base_pert = None def test___init__(self): - TestBasePerturbation.base_pert = BasePerturbation("mindspore.ops.add") - self.assertEqual(TestBasePerturbation.base_pert.api_name, "mindspore.ops.add") + TestBasePerturbation.base_pert = BasePerturbation("Functional.add.0") + self.assertEqual(TestBasePerturbation.base_pert.api_name_with_id, "Functional.add.0") self.assertFalse(TestBasePerturbation.base_pert.is_fuzzed) self.assertIsNone(TestBasePerturbation.base_pert.perturbation_value) - def test_get_fuzzed_result(self): + @patch("msprobe.mindspore.service.Service.should_execute_hook", return_value=False) + def test_get_fuzzed_result(self, _): params = HandlerParams() params.args = [Tensor([1.0], dtype=ms.float32), Tensor([5.0], dtype=ms.float32)] params.kwargs = {} @@ -43,6 +42,12 @@ class TestBasePerturbation(unittest.TestCase): params.index = 0 params.original_func = ms.ops.add + Config.stage = Const.BACKWARD + target = (Tensor([1.0], dtype=ms.float32), Tensor([1.0], dtype=ms.float32)) + ret = self.base_pert.get_fuzzed_result(params) + self.assertTrue((ret[0] == target[0]).all()) + self.assertTrue((ret[1] == target[1]).all()) + Config.stage = Const.FORWARD target = Tensor([7.0], dtype=ms.float32) ret = self.base_pert.get_fuzzed_result(params) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_bit_noise.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_bit_noise.py index 7a1e08e2c33c7dddb73747cb05f7c762c94efe6c..49b114960ee562ac0c2d2ddf29f871ec216971ca 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_bit_noise.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_bit_noise.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest from unittest.mock import patch @@ -22,9 +20,9 @@ import numpy as np import mindspore as ms from mindspore import Tensor, ops -from msprobe.mindspore.free_benchmark.perturbation.bit_noise import BitNoisePerturbation -from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams from msprobe.mindspore.common.log import logger +from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.perturbation.bit_noise import BitNoisePerturbation class TestBitNoisePerturbation(unittest.TestCase): @@ -33,7 +31,7 @@ class TestBitNoisePerturbation(unittest.TestCase): @classmethod def setUpClass(cls): - cls.bit_noise_pert = BitNoisePerturbation("mindspore.ops.add") + cls.bit_noise_pert = BitNoisePerturbation("Mint.add.0") def test__get_bit_len_type(self): input = Tensor([1.0], dtype=ms.float32) @@ -93,7 +91,7 @@ class TestBitNoisePerturbation(unittest.TestCase): params.args = input params.index = 0 ret = self.bit_noise_pert.handle(params) - mock_warning.assert_called_with("mindspore.ops.add can not add bit noise.") + mock_warning.assert_called_with("Mint.add.0 can not add bit noise.") self.assertFalse(self.bit_noise_pert.is_fuzzed) self.assertFalse(ret) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_exchange_value.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_exchange_value.py index 688444b3659ca4449c4a138a7238288d876ef586..047c3b92304c6368481d0d3150399781a4eaa934 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_exchange_value.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_exchange_value.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_improve_precision.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_improve_precision.py index fddb49037d7776dd3758be13042ddbb56bf85fdc..e200bb40868fab8a9618047244830aa8a74cec27 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_improve_precision.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_improve_precision.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest from unittest.mock import patch @@ -21,9 +19,11 @@ from unittest.mock import patch import mindspore as ms from mindspore import Tensor -from msprobe.mindspore.free_benchmark.perturbation.improve_precision import ImprovePrecisionPerturbation -from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.core.common.const import Const from msprobe.mindspore.common.log import logger +from msprobe.mindspore.free_benchmark.common.config import Config +from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.perturbation.improve_precision import ImprovePrecisionPerturbation class TestImprovePrecisionPerturbation(unittest.TestCase): @@ -32,7 +32,7 @@ class TestImprovePrecisionPerturbation(unittest.TestCase): @classmethod def setUpClass(cls): - cls.improve_precision_pert = ImprovePrecisionPerturbation("mindspore.ops.add") + cls.improve_precision_pert = ImprovePrecisionPerturbation("Functional.add.0") def test_improve_tensor_precision(self): self.improve_precision_pert.is_fuzzed = False @@ -68,8 +68,9 @@ class TestImprovePrecisionPerturbation(unittest.TestCase): self.assertEqual(ret.dtype, target.dtype) self.assertFalse(self.improve_precision_pert.is_fuzzed) + @patch("msprobe.mindspore.service.Service.should_execute_hook", return_value=False) @patch.object(logger, "warning") - def test_handle(self, mock_warning): + def test_handle(self, mock_warning, _): self.improve_precision_pert.is_fuzzed = False params = HandlerParams() @@ -77,24 +78,23 @@ class TestImprovePrecisionPerturbation(unittest.TestCase): params.args = input params.kwargs = {} ret = self.improve_precision_pert.handle(params) - mock_warning.assert_called_with("mindspore.ops.add can not improve precision.") + mock_warning.assert_called_with("Functional.add.0 can not improve precision.") self.assertFalse(self.improve_precision_pert.is_fuzzed) self.assertFalse(ret) + Config.stage = Const.FORWARD params.args = [Tensor([1.0], dtype=ms.float16), Tensor([5.0], dtype=ms.float16)] params.original_func = ms.ops.add target = Tensor([6.0], dtype=ms.float32) ret = self.improve_precision_pert.handle(params) self.assertTrue(ret == target) - self.assertIsNone(params.fuzzed_value) - - TestImprovePrecisionPerturbation.improve_precision_pert.api_name = "mindspore.communication.comm_func.reduce" - fuzzed_value = [Tensor([1.0], dtype=ms.float32), Tensor([5.0], dtype=ms.float32)] - self.improve_precision_pert.handle(params) - self.assertTrue((params.fuzzed_value[0] == fuzzed_value[0]).all()) - self.assertTrue((params.fuzzed_value[1] == fuzzed_value[1]).all()) - self.assertEqual(params.fuzzed_value[0].dtype, fuzzed_value[0].dtype) - self.assertEqual(params.fuzzed_value[1].dtype, fuzzed_value[1].dtype) + + Config.stage = Const.BACKWARD + params.args = [Tensor([1.0], dtype=ms.float16), Tensor([5.0], dtype=ms.float16)] + params.original_func = ms.ops.add + target = (Tensor([1.0], dtype=ms.float32), Tensor([1.0], dtype=ms.float32)) + ret = self.improve_precision_pert.handle(params) + self.assertTrue(ret == target) @classmethod def tearDownClass(cls): diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_perturbation_factory.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_perturbation_factory.py index da90393a495b597d18b45f22ec44332d55f4d9c1..858e664bbaddb3506bf53ea067eeca1c9706b43b 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_perturbation_factory.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/perturbation/test_ms_perturbation_factory.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" import unittest @@ -30,7 +28,7 @@ from msprobe.mindspore.free_benchmark.perturbation.exchange_value import Exchang class TestPerturbationFactory(unittest.TestCase): def test_create(self): - api_name = "mindspore.ops.add" + api_name = "Functional.add.0" Config.pert_type = "UNKNOWN" with self.assertRaises(Exception) as context: @@ -41,29 +39,29 @@ class TestPerturbationFactory(unittest.TestCase): Config.pert_type = FreeBenchmarkConst.EXCHANGE_VALUE pert = PerturbationFactory.create(api_name) self.assertTrue(isinstance(pert, ExchangeValuePerturbation)) - self.assertEqual(pert.api_name, api_name) + self.assertEqual(pert.api_name_with_id, api_name) self.assertFalse(pert.is_fuzzed) Config.pert_type = FreeBenchmarkConst.NO_CHANGE pert = PerturbationFactory.create(api_name) self.assertTrue(isinstance(pert, NoChangePerturbation)) - self.assertEqual(pert.api_name, api_name) + self.assertEqual(pert.api_name_with_id, api_name) self.assertFalse(pert.is_fuzzed) Config.pert_type = FreeBenchmarkConst.BIT_NOISE pert = PerturbationFactory.create(api_name) self.assertTrue(isinstance(pert, BitNoisePerturbation)) - self.assertEqual(pert.api_name, api_name) + self.assertEqual(pert.api_name_with_id, api_name) self.assertFalse(pert.is_fuzzed) Config.pert_type = FreeBenchmarkConst.ADD_NOISE pert = PerturbationFactory.create(api_name) self.assertTrue(isinstance(pert, AddNoisePerturbation)) - self.assertEqual(pert.api_name, api_name) + self.assertEqual(pert.api_name_with_id, api_name) self.assertFalse(pert.is_fuzzed) Config.pert_type = FreeBenchmarkConst.IMPROVE_PRECISION pert = PerturbationFactory.create(api_name) self.assertTrue(isinstance(pert, ImprovePrecisionPerturbation)) - self.assertEqual(pert.api_name, api_name) + self.assertEqual(pert.api_name_with_id, api_name) self.assertFalse(pert.is_fuzzed) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/test_ms_api_pynative_self_check.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/test_ms_api_pynative_self_check.py index 34f2dcdb1e0df667bb1f236a8a068bd236064864..e589dd4d58715d74644047f8c7e7a6ce79ccf225 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/test_ms_api_pynative_self_check.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/test_ms_api_pynative_self_check.py @@ -1,7 +1,7 @@ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. # All rights reserved. # -# Licensed under the Apache License, Version 2.0 (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -19,15 +19,22 @@ from unittest.mock import patch import mindspore as ms from mindspore import Tensor, mint, ops -from mindspore.communication import comm_func from msprobe.core.common.const import Const from msprobe.mindspore.common.const import FreeBenchmarkConst -from msprobe.mindspore.free_benchmark.api_pynative_self_check import (ApiPyNativeSelFCheck, get_decorate_func, - get_module, get_supported_ops, get_wrapper_obj, - hijack, is_func_support_decorate) +from msprobe.mindspore.common.log import logger +from msprobe.mindspore.dump.hook_cell.api_registry import api_register +from msprobe.mindspore.free_benchmark.api_pynative_self_check import (ApiPyNativeSelfCheck, check_all_tensor, + check_self, data_pre_deal, + deal_fuzzed_and_original_result, + get_module, get_supported_ops, + get_target_arg_index, need_wrapper_func) from msprobe.mindspore.free_benchmark.common.config import Config -from msprobe.mindspore.free_benchmark.decorator.decorator_factory import decorate_forward_function +from msprobe.mindspore.free_benchmark.common.handler_params import HandlerParams +from msprobe.mindspore.free_benchmark.common.utils import Tools +from msprobe.mindspore.free_benchmark.handler.check_handler import CheckHandler +from msprobe.mindspore.free_benchmark.handler.fix_handler import FixHandler +from msprobe.mindspore.runtime import Runtime class DebuggerConfig: @@ -41,44 +48,81 @@ class DebuggerConfig: list = [] -class TestApiPyNativeSelFCheck(TestCase): - def test___init__(self): +class Cell: + def __init__(self): + self.input_kwargs = {} + + +class TestApiPyNativeSelfCheck(TestCase): + checker = None + + @classmethod + def setUpClass(cls): config = DebuggerConfig() config.list = [] - self_checker = ApiPyNativeSelFCheck(config) + cls.checker = ApiPyNativeSelfCheck(config) + def test___init__(self): self.assertTrue(Config.is_enable) - self.assertEqual(Config.handler_type, config.handler_type) - self.assertEqual(Config.pert_type, config.pert_type) - self.assertEqual(Config.stage, config.stage) - self.assertEqual(Config.dump_level, config.dump_level) - self.assertEqual(Config.steps, config.step) - self.assertEqual(Config.ranks, config.rank) - self.assertEqual(Config.dump_path, os.path.join(config.dump_path, "free_benchmark.csv")) + self.assertEqual(Config.handler_type, DebuggerConfig.handler_type) + self.assertEqual(Config.pert_type, DebuggerConfig.pert_type) + self.assertEqual(Config.stage, DebuggerConfig.stage) + self.assertEqual(Config.dump_level, DebuggerConfig.dump_level) + self.assertEqual(Config.steps, DebuggerConfig.step) + self.assertEqual(Config.ranks, DebuggerConfig.rank) + self.assertEqual(Config.dump_path, os.path.join(DebuggerConfig.dump_path, "free_benchmark.csv")) target_api_list = get_supported_ops() - self.assertEqual(self_checker.api_list, target_api_list) + self.assertEqual(self.checker.api_list, target_api_list) + config = DebuggerConfig() config.list = ["mindspore.ops.add"] - self_checker = ApiPyNativeSelFCheck(config) + self_checker = ApiPyNativeSelfCheck(config) target_api_list = set(config.list) self.assertEqual(self_checker.api_list, target_api_list) + target_ori_func = {"mindspore.ops.add": ops.add} + self.assertEqual(self_checker.ori_func, target_ori_func) def test_handle(self): - config = DebuggerConfig() - config.list = [] - self_checker = ApiPyNativeSelFCheck(config) + with patch.object(api_register, "initialize_hook") as mock_init_hook, \ + patch.object(api_register, "api_set_hook_func") as mock_set_hook: + self.checker.handle() + mock_init_hook.assert_called_with(self.checker.build_hook) + mock_set_hook.assert_called_once() + + def test_build_hook(self): + _, forward_hook, backward_hook, _ = self.checker.build_hook("Functional.add.") + + cell = Cell() - with patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.hijack") as mock_hijack: - self_checker.handle() - self.assertEqual(mock_hijack.call_count, len(self_checker.api_list)) + with patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.need_wrapper_func", return_value=False): + self.assertIsNone(forward_hook(cell, "input", "output")) + + cell = Cell() + self.checker.api_list = ["mindspore.ops.add"] + self.checker.ori_func["mindspore.ops.add"] = "add" + with patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.need_wrapper_func", return_value=True), \ + patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.check_self", + return_value="ret") as mock_check: + ret = forward_hook(cell, ("input",), ("output",)) + self.assertEqual(ret, "ret") + mock_check.assert_called_with("Functional.add.0", ("output",), "add", "input") + + self.assertIsNone(backward_hook("cell", "grad_input", "grad_output")) + + def test_store_original_func(self): + self.checker.api_list = ["mindspore.ops.add"] + self.checker.ori_func = {} + target_ori_func = {"mindspore.ops.add": ops.add} + self.checker.store_original_func() + self.assertEqual(self.checker.ori_func, target_ori_func) def test_get_supported_ops(self): yaml_api = { "ops": ["add"], "Tensor": ["div"], "mint": ["mean"], - "mint.nn.functional": ["relu"], - "communication": ["all_reduce"]} + "mint.nn.functional": ["relu"] + } with patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.load_yaml", return_value=yaml_api): api_list = get_supported_ops() target_list = [] @@ -90,48 +134,144 @@ class TestApiPyNativeSelFCheck(TestCase): target_list.append("mindspore.mint.mean") if hasattr(mint.nn.functional, "relu"): target_list.append("mindspore.mint.nn.functional.relu") - if hasattr(comm_func, "all_reduce"): - target_list.append("mindspore.communication.comm_func.all_reduce") self.assertEqual(api_list, set(target_list)) - def test_get_decorate_func(self): - ret = get_decorate_func() - self.assertEqual(ret, decorate_forward_function) + def test_get_module(self): + module_obj, orig_func = get_module("mindspore.ops.add") + self.assertEqual(module_obj, ops) + self.assertEqual(orig_func, ops.add) - def test_is_func_support_decorate(self): - ret = is_func_support_decorate(ops.Add) - self.assertFalse(ret) + @patch.object(logger, "warning") + def test_check_self(self, mock_warning): + api_name_with_id = "Functional.add.0" + output = (ms.Tensor([2.0], dtype=ms.float16),) + ori_func = ops.add + args = (ms.Tensor([1.0]), 1.0) + kwargs = {} - ret = is_func_support_decorate(ops.__name__) - self.assertFalse(ret) + Config.stage = Const.BACKWARD + self.assertFalse(check_self(api_name_with_id, output, ori_func, *args, **kwargs)) + mock_warning.assert_called_with(f"{api_name_with_id} has non-tensor input or output.") - ret = is_func_support_decorate(ops.add) - self.assertTrue(ret) + mock_warning.reset_mock() + Config.stage = Const.FORWARD + with patch.object(logger, "info") as mock_info, \ + patch.object(api_register, "api_set_ori_func") as mock_set_ori, \ + patch.object(api_register, "api_set_hook_func") as mock_set_hook, \ + patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.deal_fuzzed_and_original_result", + return_value="ret"): + args = (1.0, 1.0) + ret = check_self(api_name_with_id, output, ori_func, *args, **kwargs) + self.assertIsNone(ret) + mock_warning.assert_called_once() + mock_info.assert_not_called() - def test_get_wrapper_obj(self): - with patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.decorate_forward_function", - return_value=0) as mock_dec: - ret = get_wrapper_obj(ops.add, "ops.add") - mock_dec.assert_called_with(ops.add, "ops.add") - self.assertEqual(ret, 0) - ret = get_wrapper_obj(ops.__name__, "ops.ops.__name__") - mock_dec.assert_called_once() - self.assertEqual(ret, ops.__name__) + Config.pert_type = FreeBenchmarkConst.IMPROVE_PRECISION + args = (ms.Tensor([1.0], dtype=ms.float32), ms.Tensor([1.0], dtype=ms.float32)) + ret = check_self(api_name_with_id, output, ori_func, *args, **kwargs) + mock_info.assert_called_with(f"[{api_name_with_id}] is {Config.handler_type}ing.") + mock_set_ori.assert_called_once() + mock_set_hook.assert_called_once() + self.assertIsNone(ret) - def test_get_module(self): - module_obj, orig_func = get_module("mindspore.ops.add") - self.assertEqual(module_obj, ops) - self.assertEqual(orig_func, ops.add) + mock_set_hook.reset_mock() + args = (ms.Tensor([1.0], dtype=ms.float16), ms.Tensor([1.0], dtype=ms.float16)) + ret = check_self(api_name_with_id, output, ori_func, *args, **kwargs) + mock_set_hook.assert_called_once() + self.assertEqual(ret, "ret") + + Config.stage = Const.BACKWARD + mock_set_hook.reset_mock() + with patch.object(Tools, "get_grad") as mock_grad: + ret = check_self(api_name_with_id, output, ori_func, *args, **kwargs) + self.assertEqual(mock_grad.call_count, 2) + mock_set_hook.assert_called_once() + self.assertEqual(ret, "ret") + Config.stage = Const.FORWARD + + def test_check_all_tensor(self): + inputs = ms.Tensor([1.0]) + self.assertTrue(check_all_tensor(inputs)) + + inputs = (ms.Tensor([1.0]), ms.Tensor([2.0])) + self.assertTrue(check_all_tensor(inputs)) + + inputs = (ms.Tensor([1.0]), 2.0) + self.assertFalse(check_all_tensor(inputs)) + + def test_get_target_arg_index(self): + args = (ms.Tensor([1], dtype=ms.int32), 2.0, ms.Tensor([1.0])) + self.assertEqual(get_target_arg_index(args), 2) + + args = ((1.0, 2.0), ms.Tensor([1.0])) + self.assertEqual(get_target_arg_index(args), 0) + + args = (1.0, 2.0) + self.assertEqual(get_target_arg_index(args), -1) + + def test_data_pre_deal(self): + params = HandlerParams() + params.args = (Tensor([1.0, 1.0], dtype=ms.float32), 1) + params.kwargs = {"axis": 0} + params.original_func = ops.split + params.index = 0 + + ret = data_pre_deal("Functional.split.0", params.original_func, *params.args, **params.kwargs) + self.assertTrue((ret.args[0] == params.args[0]).all()) + self.assertEqual(ret.args[1], params.args[1]) + self.assertEqual(ret.kwargs, params.kwargs) + self.assertEqual(ret.original_func, params.original_func) + self.assertEqual(ret.index, params.index) + + params.args = (Tensor([1, 1], dtype=ms.int32), 1) + with patch.object(logger, "warning") as mock_warning: + ret = data_pre_deal("Functional.split.0", params.original_func, *params.args, **params.kwargs) + mock_warning.assert_called_with("Functional.split.0 has no supported input type.") + self.assertEqual(ret.index, -1) + + def test_need_wrapper_func(self): + Runtime.is_running = True + Config.is_enable = False + self.assertFalse(need_wrapper_func()) + + Runtime.is_running = False + Config.is_enable = True + self.assertFalse(need_wrapper_func()) + + Runtime.is_running = True + Config.is_enable = True + self.assertTrue(need_wrapper_func()) + + Config.steps = [1] + Runtime.step_count = 0 + self.assertFalse(need_wrapper_func()) + + Config.steps = [] + Runtime.step_count = 0 + + Config.ranks = [] + Runtime.rank_id = -1 + self.assertTrue(need_wrapper_func()) + + with patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.get_rank_if_initialized", return_value=0): + self.assertTrue(need_wrapper_func()) + self.assertEqual(Runtime.rank_id, 0) + + Config.ranks = [0] + Runtime.rank_id = 1 + self.assertFalse(need_wrapper_func()) + Config.ranks = [] + Runtime.rank_id = -1 + + def test_deal_fuzzed_and_original_result(self): + params = HandlerParams() + + Config.handler_type = FreeBenchmarkConst.FIX + with patch.object(FixHandler, "handle") as mock_fix: + deal_fuzzed_and_original_result("api_name_with_id", params) + mock_fix.assert_called_with(params) - def test_hijack(self): - def wrapped_func(): - pass - with patch("msprobe.mindspore.free_benchmark.api_pynative_self_check.get_wrapper_obj", - return_value=wrapped_func) as mock_dec: - hijack(" ") - mock_dec.assert_not_called() - ori_func_backup = ops.add - hijack("mindspore.ops.add") - wrapped_ori_func = getattr(ops, "add") - self.assertEqual(wrapped_ori_func, wrapped_func) - setattr(ops, "add", ori_func_backup) + Config.handler_type = FreeBenchmarkConst.CHECK + with patch.object(CheckHandler, "handle") as mock_check: + deal_fuzzed_and_original_result("api_name_with_id", params) + mock_check.assert_called_with(params) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/test_ms_self_check_tool_factory.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/test_ms_self_check_tool_factory.py index e85f54bd5210cc45243b002fba6acecb10164940..fa68b8896c26d4156833c54d2b2bf5b443164e8f 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/test_ms_self_check_tool_factory.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/free_benchmark/test_ms_self_check_tool_factory.py @@ -19,7 +19,7 @@ import os import unittest from msprobe.mindspore.free_benchmark.self_check_tool_factory import SelfCheckToolFactory -from msprobe.mindspore.free_benchmark.api_pynative_self_check import ApiPyNativeSelFCheck +from msprobe.mindspore.free_benchmark.api_pynative_self_check import ApiPyNativeSelfCheck from msprobe.mindspore.debugger.debugger_config import DebuggerConfig from msprobe.core.common_config import CommonConfig, BaseConfig from msprobe.mindspore.common.const import Const as MsConst @@ -49,4 +49,4 @@ class TestSelfCheckToolFactory(unittest.TestCase): config.execution_mode = MsConst.PYNATIVE_MODE tool = SelfCheckToolFactory.create(config) - self.assertTrue(isinstance(tool, ApiPyNativeSelFCheck)) + self.assertIsInstance(tool, ApiPyNativeSelfCheck, "") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/grad_probe/test_grad_analyzer.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/grad_probe/test_grad_analyzer.py index b7e37db43b4f6142cf49f0d3dddd141c39c17aff..802769d9005916c8723d436349d13ca7f557a00a 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/grad_probe/test_grad_analyzer.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/grad_probe/test_grad_analyzer.py @@ -6,7 +6,7 @@ import mindspore as ms from unittest import TestCase, mock from unittest.mock import patch from mindspore import Tensor, Parameter -from msprobe.mindspore.grad_probe.grad_analyzer import CSVGenerator, grad_dump +from msprobe.mindspore.grad_probe.grad_analyzer import CSVGenerator, grad_dump, GradDumpConfig from msprobe.mindspore.grad_probe.global_context import grad_context from msprobe.core.grad_probe.constant import GradConst @@ -86,6 +86,11 @@ class TestGradAnalyzer(TestCase): self.csv_generator.gen_csv_line(file_path, stat_data) mock_append.assert_called_once() + file_path = os.path.join(self.dump_dir, "0.npy") + with mock.patch.object(self.csv_generator.cache_list, 'append') as mock_append: + with self.assertRaises(RuntimeError): + self.csv_generator.gen_csv_line(file_path, stat_data) + def test_grad_dump(self): # Test grad_dump function with numpy file output dump_dir = self.dump_dir @@ -97,7 +102,9 @@ class TestGradAnalyzer(TestCase): # Run the grad_dump function try: - grad_dump(dump_dir, g_name, dump_step, grad, level, bounds) + conf = GradDumpConfig(dump_dir=dump_dir, g_name=g_name, dump_step=dump_step, grad=grad, level=level, + bounds=bounds) + grad_dump(conf) except RuntimeError as e: # If TensorDump fails due to environment, skip the file existence check self.skipTest(f"TensorDump operation failed: {e}") @@ -140,8 +147,34 @@ class TestGradAnalyzer(TestCase): self.csv_generator.traverse_files(npy_files) self.assertFalse(os.path.exists(test_file_path)) + npy_files = ["step_finish.npy"] + test_file_path = os.path.join(self.dump_dir, "step_finish.npy") + np.save(test_file_path, np.array([1, 2, 3, 4, 5])) + with mock.patch.object(self.csv_generator, 'load_npy_data', return_value=np.array([1, 2, 3, 4, 5])): + self.csv_generator.traverse_files(npy_files) + self.assertTrue(self.csv_generator.last_finish) + + npy_files = ["step_dir.npy"] + test_file_path = os.path.join(self.dump_dir, "step_dir.npy") + np.save(test_file_path, np.array([1, 2, 3, 4, 5])) + with mock.patch.object(self.csv_generator, 'load_npy_data', return_value=np.array([1, 2, 3, 4, 5])): + with self.assertRaises(RuntimeError): + self.csv_generator.traverse_files(npy_files) + + def test_traverse_files_with_data_successful_move(self): + npy_files = ["step_dir.npy"] + self.csv_generator.current_step = 0 + test_file_path = os.path.join(self.dump_dir, "step_dir.npy") + np.save(test_file_path, np.array([1, 2, 3, 4, 5])) + with mock.patch.object(self.csv_generator, 'load_npy_data', return_value=np.array([1, 2, 3, 4, 5])): + self.csv_generator.traverse_files(npy_files) + dst_file_path = os.path.join(self.save_dir, f"step{self.csv_generator.current_step}", "dir.npy") + assert os.path.exists(dst_file_path) + real_tensor = np.load(dst_file_path) + self.assertTrue((real_tensor == np.array([1, 2, 3, 4, 5])).all()) + if __name__ == "__main__": from unittest import main - main() \ No newline at end of file + main() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/grad_probe/test_ms_grad_monitor.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/grad_probe/test_ms_grad_monitor.py index 955f77a196ca212e8dc779c1d79a3e1e65afb2c5..ae24457a444bfdddc796802126150577280d7e62 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/grad_probe/test_ms_grad_monitor.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/grad_probe/test_ms_grad_monitor.py @@ -1,26 +1,47 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import hashlib +import json import os -import numpy as np import shutil -import json from unittest import TestCase -import hashlib +from unittest.mock import patch + +import numpy as np import mindspore from mindspore import nn, Tensor from mindspore.nn import SGD + +from msprobe.core.common.file_utils import FileOpen +from msprobe.core.grad_probe.constant import GradConst from msprobe.mindspore import PrecisionDebugger from msprobe.mindspore.grad_probe.global_context import grad_context -from msprobe.core.grad_probe.constant import GradConst -from msprobe.core.common.file_utils import FileOpen + file_path = os.path.abspath(__file__) directory = os.path.dirname(file_path) config_json_path = os.path.join(directory, "config.json") + def main(): PrecisionDebugger._instance = None PrecisionDebugger.initialized = False grad_context._setting[GradConst.CURRENT_STEP] = 0 - debugger = PrecisionDebugger(config_json_path) + with patch("msprobe.mindspore.debugger.precision_debugger.set_register_backward_hook_functions"): + debugger = PrecisionDebugger(config_json_path) class SimpleNet(nn.Cell): def __init__(self): @@ -37,7 +58,7 @@ def main(): debugger.monitor(optimizer) fix_gradient = tuple([Tensor(np.arange(5*16).reshape((5, 16)), dtype=mindspore.float32), - Tensor(np.arange(5).reshape(5), dtype=mindspore.float32)]) + Tensor(np.arange(5).reshape(5), dtype=mindspore.float32)]) steps = 10 @@ -98,7 +119,6 @@ class TestMsGradientMonitor(TestCase): target_md5_value = "d5e71f1aa37d48ef0ca0a75932597a29" self.assertEqual(real_md5_value, target_md5_value, "hash value of grad_summary_1.csv is not same as target") - def test_gradient_monitor_L1(self): gradient_output_path = os.path.join(directory, "gradient_output") if os.path.isfile(config_json_path): @@ -159,4 +179,4 @@ class TestMsGradientMonitor(TestCase): real_md5_value = get_hash(os.path.join(gradient_output_path, "rank0", "grad_summary_1.csv")) target_md5_value = "62e137a119c0d1a44623f10049c3f80d" - self.assertEqual(real_md5_value, target_md5_value, "hash value of grad_summary_1.csv is not same as target") \ No newline at end of file + self.assertEqual(real_md5_value, target_md5_value, "hash value of grad_summary_1.csv is not same as target") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/hook_module/test_ms_wrap_distributed.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/hook_module/test_ms_wrap_distributed.py new file mode 100644 index 0000000000000000000000000000000000000000..325996ae0ee422f6e111ad831d20fea1e8344736 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/hook_module/test_ms_wrap_distributed.py @@ -0,0 +1,112 @@ +import unittest +from unittest.mock import Mock, patch +import numpy as np +import mindspore +from mindspore import Tensor, ops + +from msprobe.mindspore.monitor.distributed.wrap_distributed import ( + catch_data, + DistributedOPTemplate, + ApiRegistry, + get_distributed_ops, + is_target_line, + op_aggregate, + update_data +) +from msprobe.core.common.const import MonitorConst + +class TestWrapDistributed(unittest.TestCase): + def setUp(self): + self.mock_ops = ['min', 'max', 'norm'] + self.mock_rank = '0' + + def hook(self): + def forward_pre_hook(nope, inputs): + return inputs + + def forward_hook(nope, inputs, output): + return 2 + + return [forward_pre_hook], [forward_hook] + + def test_catch_data(self): + # 准备测试数据 + cc_context = Mock() + cc_context.data = {} + cc_name = "all_reduce" + args = [Tensor(np.array([1.0, 2.0, 3.0]))] + + # 测试输入数据捕获 + catch_data(cc_context, cc_name, self.mock_ops, args, MonitorConst.PREFIX_PRE) + self.assertTrue('all_reduce/pre_0' in cc_context.data) + + # 测试输出数据捕获 + catch_data(cc_context, cc_name, self.mock_ops, args, MonitorConst.PREFIX_POST) + self.assertTrue('all_reduce/post_0' in cc_context.data) + + def test_distributed_op_template(self): + # 测试分布式算子模板 + pre_hooks, post_hooks = self.hook() + op = DistributedOPTemplate("all_reduce", pre_hooks, post_hooks) + + self.assertEqual(op.op_name_, "all_reduce") + self.assertEqual(len(op.cc_hooks), 2) + + def test_api_registry(self): + # 测试API注册器 + registry = ApiRegistry() + + # 测试API存储 + ori_api_group = Mock() + api_list = ["all_reduce", "all_gather"] + api_ori_attr = {} + + ApiRegistry.store_ori_attr(ori_api_group, api_list, api_ori_attr) + self.assertEqual(len(api_ori_attr), 2) + + def test_op_aggregate(self): + # 测试算子聚合 + tensor_list = [Tensor(1.0), Tensor(2.0), Tensor(3.0)] + + # 测试min操作 + result = op_aggregate('min', tensor_list) + self.assertEqual(result.asnumpy(), 1.0) + + # 测试max操作 + result = op_aggregate('max', tensor_list) + self.assertEqual(result.asnumpy(), 3.0) + + # 测试mean操作 + result = op_aggregate('mean', tensor_list) + self.assertEqual(result.asnumpy(), 2.0) + + def test_update_data(self): + # 测试数据更新 + old_data = {} + new_data = { + 'tag1': { + 'min': Tensor(1.0), + 'max': Tensor(2.0) + } + } + + result = update_data(old_data, new_data) + self.assertTrue('tag1' in result) + self.assertTrue('min' in result['tag1']) + self.assertTrue('max' in result['tag1']) + + def test_is_target_line(self): + # 测试目标行检查 + # 空代码行列表应该返回True + self.assertTrue(is_target_line([])) + + # 测试匹配模式 + codeline = ['test_pattern'] + with patch('msprobe.mindspore.monitor.distributed.wrap_distributed.get_callstack') as mock_callstack: + mock_callstack.return_value = ['test_pattern_line'] + self.assertTrue(is_target_line(codeline)) + + def test_get_distributed_ops(self): + # 测试获取分布式算子列表 + ops = get_distributed_ops() + self.assertIsInstance(ops, set) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_cell_processor.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_cell_processor.py index f49b8b09923f1d2abd829ea729031b07623a61b5..40f5c0164115e18cdd49c046ce29967e7a3f63eb 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_cell_processor.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_cell_processor.py @@ -1,8 +1,24 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import unittest from unittest.mock import MagicMock, patch -from msprobe.core.data_dump.scope import ModuleRangeScope + from msprobe.core.common.const import Const -from msprobe.mindspore.cell_processor import CellProcessor # 替换为实际的模块名 +from msprobe.core.data_dump.scope import ModuleRangeScope +from msprobe.mindspore.cell_processor import CellProcessor class MockCell: @@ -95,6 +111,37 @@ class TestCellProcessor(unittest.TestCase): self.assertEqual(len(CellProcessor.cell_stack), 0) # Stack should be empty now self.assertIsNone(CellProcessor.api_parent_node) + def test_set_and_get_reserved_name(self): + cell = MockCell() + cell.mindstudio_reserved_name = "mindstudio_reserved_name" + CellProcessor.reset_cell_stats() + + cell_name = "Cell.net.Net.forward" + ret = self.processor.set_and_get_reserved_name(cell, cell_name) + self.assertEqual(ret, cell_name + Const.SEP + "0") + self.assertEqual(cell.mindstudio_reserved_name, ret) + self.assertEqual(CellProcessor.cell_count[cell_name], 0) + self.assertFalse(hasattr(cell, "has_pre_hook_called")) + + cell.has_pre_hook_called = False + ret = self.processor.set_and_get_reserved_name(cell, cell_name) + self.assertEqual(ret, cell_name + Const.SEP + "1") + self.assertEqual(cell.mindstudio_reserved_name, ret) + self.assertEqual(CellProcessor.cell_count[cell_name], 1) + self.assertFalse(cell.has_pre_hook_called) + + cell.has_pre_hook_called = True + cell.mindstudio_reserved_name = "mindstudio_reserved_name" + CellProcessor.reset_cell_stats() + ret = self.processor.set_and_get_reserved_name(cell, cell_name) + self.assertEqual(ret, "mindstudio_reserved_name") + self.assertEqual(cell.mindstudio_reserved_name, ret) + self.assertEqual(CellProcessor.cell_count, {}) + self.assertFalse(cell.has_pre_hook_called) -if __name__ == "__main__": - unittest.main() + ret = self.processor.set_and_get_reserved_name(cell, cell_name, is_called_by_pre_hook=True) + self.assertEqual(ret, cell_name + Const.SEP + "0") + self.assertEqual(cell.mindstudio_reserved_name, ret) + self.assertEqual(CellProcessor.cell_count[cell_name], 0) + self.assertTrue(cell.has_pre_hook_called) + CellProcessor.reset_cell_stats() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_dump_tool_factory.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_dump_tool_factory.py index 973132a5ac90ed4acdbd1215eea93c2429b0b8f5..8f5d207c41923175b6efe4f9dc313896f879fd89 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_dump_tool_factory.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_dump_tool_factory.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,12 +12,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + from unittest import TestCase from unittest.mock import patch -from msprobe.mindspore.common.const import Const from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.core.common.const import Const as CoreConst +from msprobe.mindspore.common.const import Const from msprobe.mindspore.debugger.debugger_config import DebuggerConfig from msprobe.mindspore.dump.dump_tool_factory import DumpToolFactory @@ -38,6 +38,17 @@ class TestDumpToolFactory(TestCase): task_config = BaseConfig(json_config) config = DebuggerConfig(common_config, task_config) + config.data_mode = [CoreConst.INPUT, CoreConst.OUTPUT] + with self.assertRaises(Exception) as context: + DumpToolFactory.create(config) + self.assertEqual(str(context.exception), "data_mode must be one of all, input, output.") + + config.data_mode = [CoreConst.FORWARD] + with self.assertRaises(Exception) as context: + DumpToolFactory.create(config) + self.assertEqual(str(context.exception), "data_mode must be one of all, input, output.") + + config.data_mode = [CoreConst.ALL] config.level = "module" with self.assertRaises(Exception) as context: DumpToolFactory.create(config) diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_config.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_config.py index 3d6d409daabc58eddad28ce43b1cfc9bfe67419f..7717f9c336202d67ee524f59c3c5f328e70a045f 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_config.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_config.py @@ -1,7 +1,6 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -13,9 +12,9 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import unittest -from unittest.mock import patch, mock_open +from unittest.mock import patch from msprobe.core.common.const import Const from msprobe.mindspore.ms_config import (parse_json_config, parse_task_config, @@ -36,8 +35,7 @@ class TestMsConfig(unittest.TestCase): "summary_mode": "statistics" } } - with patch("msprobe.mindspore.ms_config.FileOpen", mock_open(read_data='')), \ - patch("msprobe.mindspore.ms_config.json.load", return_value=mock_json_data): + with patch("msprobe.mindspore.ms_config.load_json", return_value=mock_json_data): common_config, task_config = parse_json_config("./config.json") self.assertEqual(common_config.task, Const.STATISTICS) self.assertEqual(task_config.data_mode, ["all"]) @@ -79,9 +77,19 @@ class TestMsConfig(unittest.TestCase): task_config = parse_task_config("overflow_check", mock_json_config) self.assertEqual(str(context.exception), "check_mode is invalid") + mock_json_config.update({"free_benchmark": {"fuzz_stage": Const.FORWARD}}) task_config = parse_task_config("free_benchmark", mock_json_config) self.assertTrue(isinstance(task_config, FreeBenchmarkConfig)) + mock_json_config.update({"free_benchmark": {"fuzz_stage": Const.BACKWARD}}) + task_config = parse_task_config("free_benchmark", mock_json_config) + self.assertTrue(isinstance(task_config, FreeBenchmarkConfig)) + + mock_json_config.update({"free_benchmark": {"fuzz_stage": "unsupported_stage"}}) + with self.assertRaises(Exception) as context: + task_config = parse_task_config("free_benchmark", mock_json_config) + self.assertEqual(str(context.exception), "fuzz_stage must be forward, backward or empty") + with self.assertRaises(Exception) as context: parse_task_config("unsupported_task", mock_json_config) self.assertEqual(str(context.exception), "task is invalid.") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_debug_save.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_debug_save.py new file mode 100644 index 0000000000000000000000000000000000000000..495eedbf41384f820c2ca054fd73192d1966a8bd --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_debug_save.py @@ -0,0 +1,77 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from unittest import TestCase +from unittest.mock import patch +import mindspore + +from msprobe.mindspore import PrecisionDebugger +from msprobe.core.common_config import CommonConfig, BaseConfig + +class TestMindsporeDebuggerSave(TestCase): + def setUp(self): + PrecisionDebugger._instance = None + mindspore.set_context(mode=mindspore.PYNATIVE_MODE) + statistics_task_json = { + "task": "statistics", + "dump_path": "./dump_path", + "rank": [], + "step": [], + "level": "debug", + "enable_dataloader": False, + "statistics": { + "summary_mode": "statistics" + } + } + common_config = CommonConfig(statistics_task_json) + task_config = BaseConfig(statistics_task_json) + with patch("msprobe.mindspore.debugger.precision_debugger.parse_json_config", return_value=(common_config, task_config)), \ + patch("msprobe.mindspore.debugger.precision_debugger.set_register_backward_hook_functions"): + self.debugger = PrecisionDebugger() + + def test_forward_and_backward(self): + def forward_func(x, y): + PrecisionDebugger.save(x, "x_tensor") + return x * y + x = mindspore.Tensor([1.]) + y = mindspore.Tensor([2.]) + result_json = { + "task": "statistics", + "level": "debug", + "framework": "mindspore", + "dump_data_dir": None, + "data": { + "x_tensor.0": { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": (1,), + "Max": 1.0, + "Min": 1.0, + "Mean": 1.0, + "Norm": 1.0 + }, + "x_tensor_grad.0": { + "type": "mindspore.Tensor", + "dtype": "Float32", + "shape": (1,), + "Max": 2.0, + "Min": 2.0, + "Mean": 2.0, + "Norm": 2.0 + } + } + } + grad_fn = mindspore.value_and_grad(forward_func, (0, 1)) + grad_fn(x, y) + self.assertEqual(self.debugger.service.data_collector.data_writer.cache_debug, result_json) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_service.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_service.py new file mode 100644 index 0000000000000000000000000000000000000000..912830ea1ab705aae63c69f5c240887d4b4ce5b7 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_service.py @@ -0,0 +1,301 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. language governing permissions and +# limitations under the License. + +import unittest +from collections import defaultdict +from unittest.mock import MagicMock, patch + +from mindspore import nn, ops + +from msprobe.core.common.exceptions import MsprobeException +from msprobe.core.common.utils import Const, DumpPathAggregation +from msprobe.core.data_dump.scope import BaseScope +from msprobe.mindspore.cell_processor import CellProcessor +from msprobe.mindspore.common.log import logger +from msprobe.mindspore.common.utils import register_backward_hook_functions +from msprobe.mindspore.dump.hook_cell.api_registry import ApiRegistry, api_register +from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell +from msprobe.mindspore.dump.jit_dump import JitDump +from msprobe.mindspore.service import Service + + +class TestService(unittest.TestCase): + def setUp(self): + self.config_mock = MagicMock() + self.config_mock.level_ori = Const.LEVEL_L0 + self.config_mock.dump_path = "/tmp/dump" + self.config_mock.step = [] + self.config_mock.rank = [] + self.config_mock.task = Const.TENSOR + self.config_mock.framework = Const.MS_FRAMEWORK + self.config_mock.list = [] + self.config_mock.scope = [] + self.service = Service(self.config_mock) + self.service.model = MagicMock(spec=nn.Cell) + self.service.data_collector = MagicMock() + self.service.primitive_hook_service = MagicMock() + + def tearDown(self) -> None: + api_register.api_set_ori_func() + + def test_init(self): + self.assertEqual(self.service.config.level, "L0") + self.assertFalse(self.service.switch) + self.assertFalse(self.service.should_stop_service) + self.assertFalse(self.service.start_call) + self.assertTrue(self.service.first_start) + + def test_check_model_valid_with_valid_cell(self): + model = nn.Cell() + model_list = [model] + self.assertEqual(self.service.check_model_valid(model), model) + self.assertEqual(self.service.check_model_valid(model_list), model_list) + + def test_check_model_valid_with_invalid_type(self): + model = nn.Cell() + with self.assertRaises(MsprobeException): + self.service.check_model_valid("not a cell") + with self.assertRaises(MsprobeException): + self.service.check_model_valid(["not a cell", model]) + + def test_update_primitive_counters(self): + self.service.primitive_counters = {} + self.service.update_primitive_counters("conv2d") + self.assertEqual(self.service.primitive_counters["conv2d"], 0) + self.service.update_primitive_counters("conv2d") + self.assertEqual(self.service.primitive_counters["conv2d"], 1) + + @patch('msprobe.mindspore.service.create_directory') + def test_create_dirs(self, mock_create_directory): + self.service.current_iter = 1 + self.service.current_rank = 0 + self.service.data_collector.tasks_need_tensor_data = [Const.TENSOR] + self.service.data_collector.update_dump_paths = MagicMock() + self.service.create_dirs() + expected_calls = [ + ("/tmp/dump"), + ("/tmp/dump/step1/rank0"), + "/tmp/dump/step1/rank0/dump_tensor_data" + ] + mock_create_directory.assert_has_calls( + [unittest.mock.call(path) for path in expected_calls], any_order=True) + + args, _ = self.service.data_collector.update_dump_paths.call_args + self.assertEqual(args[0].dump_file_path, "/tmp/dump/step1/rank0/dump.json") + self.assertEqual(args[0].stack_file_path, "/tmp/dump/step1/rank0/stack.json") + self.assertEqual(args[0].construct_file_path, "/tmp/dump/step1/rank0/construct.json") + self.assertEqual(args[0].dump_tensor_data_dir, "/tmp/dump/step1/rank0/dump_tensor_data") + self.service.data_collector.initialize_json_file.assert_called_once_with( + framework=Const.MS_FRAMEWORK + ) + + @patch.object(Service, 'need_end_service', return_value=False) + def test_start_stop_cycle(self, mock_need_end_service): + self.service.model = nn.Cell() + with patch.object(self.service, 'register_cell_hook') as mock_register_hook: + self.should_stop_service = False + self.service.start(self.service.model) + self.assertTrue(self.service.switch) + self.service.stop() + self.assertFalse(self.service.switch) + mock_register_hook.assert_called_once() + mock_need_end_service.assert_called_once() + + def test_should_execute_hook_return_false(self): + cell = MagicMock() + self.service.switch = False + self.assertFalse(self.service.should_execute_hook("Module", cell, True)) + self.assertFalse(self.service.should_execute_hook("api", cell, True)) + + self.service.switch = True + cell.forward_data_collected = False + self.assertFalse(self.service.should_execute_hook("api", cell, False)) + + self.service.inner_switch = True + self.assertFalse(self.service.should_execute_hook("Module", cell, True)) + + self.service.inner_switch = False + self.service.data_collector = None + self.assertFalse(self.service.should_execute_hook("Module", cell, True)) + + def test_should_execute_hook_return_true(self): + cell = MagicMock() + self.service.switch = True + self.service.inner_switch = False + self.service.data_collector = MagicMock() + self.service.data_collector.data_processor = MagicMock() + self.service.data_collector.data_processor.is_terminated = False + self.assertTrue(self.service.should_execute_hook("Module", cell, True)) + + cell.forward_data_collected = True + self.assertTrue(self.service.should_execute_hook("api", cell, False)) + + def test_need_end_service_with_high_step(self): + self.service.config.step = [1, 2, 3] + self.service.current_iter = 4 + self.assertTrue(self.service.need_end_service()) + + def test_need_end_service_with_low_step(self): + self.service.config.step = [1, 2, 3] + self.service.current_iter = 2 + self.service.data_collector.data_processor.is_terminated = False + self.assertFalse(self.service.need_end_service()) + + def test_start_with_termination_condition(self): + self.service.config.step = [1, 2, 3] + self.service.current_iter = 4 + self.service.start() + self.assertFalse(self.service.switch) + self.assertTrue(self.service.should_stop_service) + self.assertFalse(self.service.primitive_switch) + + @patch('msprobe.mindspore.service.print_tools_ends_info') + @patch.object(Service, 'need_end_service', return_value=True) + def test_start_with_end_service(self, mock_need_end_service, mock_print_tools_ends_info): + self.service.start(self.service.model) + mock_need_end_service.assert_called_once() + mock_print_tools_ends_info.assert_called_once() + self.assertFalse(self.service.switch) + self.assertTrue(self.service.should_stop_service) + + @patch.object(Service, 'need_end_service', return_value=False) + @patch.object(logger, 'info') + @patch.object(Service, 'register_cell_hook') + @patch.object(Service, 'register_primitive_hook') + @patch.object(Service, 'create_dirs') + @patch('msprobe.mindspore.service.get_rank_if_initialized', return_value=0) + def test_start_first_time(self, mock_get_rank, mock_create_dirs, mock_register_primitive_hook, + mock_register_cell_hook, mock_logger, mock_need_end_service): + self.service.first_start = True + self.service.should_stop_service = False + self.service.start(self.service.model) + mock_get_rank.assert_called_once() + mock_register_cell_hook.assert_called_once() + mock_register_primitive_hook.assert_called_once() + mock_need_end_service.assert_called_once() + mock_create_dirs.assert_called_once() + self.assertFalse(self.service.first_start) + self.assertTrue(self.service.switch) + self.assertTrue(self.service.primitive_switch) + mock_logger.assert_called_with(f"Dump data will be saved in {self.service.dump_iter_dir}.") + + @patch.object(Service, 'register_primitive_hook') + @patch.object(Service, 'register_cell_hook') + @patch.object(Service, 'need_end_service', return_value=False) + @patch.object(JitDump, 'set_config') + @patch.object(JitDump, 'set_data_collector') + @patch.object(ApiRegistry, 'api_set_hook_func') + def test_start_with_jit_dump_enabled(self, mock_api_set_hook_func, mock_set_data_collector, + mock_set_config, mock_need_end_service, mock_register_cell_hook, + mock_register_primitive_hook): + self.service.config.level = Const.LEVEL_MIX + self.service.first_start = True + self.service.should_stop_service = False + self.service.start(self.service.model) + mock_set_config.assert_called_with(self.service.config) + mock_set_data_collector.assert_called_with(self.service.data_collector) + mock_api_set_hook_func.assert_called_once() + mock_need_end_service.assert_called_once() + mock_register_cell_hook.assert_called_once() + mock_register_primitive_hook.assert_called_once() + self.assertTrue(JitDump.jit_dump_switch) + + def test_step_updates(self): + CellProcessor.cell_count = {"test_api": 1} + HOOKCell.cell_count = {"test_api": 1} + JitDump.jit_count = {"test_api": 1} + self.service.primitive_hook_service.primitive_counters = {"test_api": 1} + self.service.current_iter = 0 + self.service.step() + self.assertEqual(self.service.current_iter, 1) + self.service.data_collector.update_iter.assert_called_once_with(1) + self.service.data_collector.reset_status.assert_called_once() + self.assertEqual(JitDump.jit_count, defaultdict(int)) + self.assertEqual((self.service.primitive_hook_service.primitive_counters), {}) + + @patch.object(Service, 'should_execute_hook') + def test_build_forward_and_backward_hooks(self, mock_should_execute_hook): + mock_should_execute_hook.return_value = True + self.service.data_collector = MagicMock() + self.service.data_collector.update_api_or_module_name = MagicMock() + self.service.data_collector.forward_data_collect = MagicMock() + self.service.data_collector.if_return_forward_new_output = MagicMock(return_value=False) + self.service.data_collector.backward_data_collect = MagicMock() + + mock_cell = MagicMock() + mock_cell.mindstudio_reserved_name = "TestCell" + mock_input = (MagicMock(),) + mock_output = MagicMock() + + _, forward_hook, backward_hook, _ = self.service.build_hook(BaseScope.Module_Type_Module, "TestHook") + + forward_hook(mock_cell, mock_input, mock_output) + self.service.data_collector.update_api_or_module_name.assert_called_with('TestCell') + self.service.data_collector.forward_data_collect.assert_called() + + self.service.data_collector.reset_mock() + + mock_grad_input = (MagicMock(),) + mock_grad_output = MagicMock() + + backward_hook(mock_cell, mock_grad_input, mock_grad_output) + self.service.data_collector.update_api_or_module_name.assert_called_with('TestHookbackward.0') + self.service.data_collector.backward_data_collect.assert_called() + + def test_register_primitive_hook(self): + self.service.config.level = Const.LEVEL_MIX + primitive_attr = ops.Add() + primitive_name = "primitive_api" + cell_mock = MagicMock() + cell_mock.primitive_api = primitive_attr + primitive_combined_name = primitive_name + Const.SEP + primitive_attr.__class__.__name__ + self.service.model.cells_and_names.return_value = [("cell_name", cell_mock)] + self.service.register_primitive_hook() + self.assertTrue(hasattr(primitive_attr.__class__, '__call__')) + self.assertEqual(self.service.primitive_hook_service.wrap_primitive.call_args[0][1], + primitive_combined_name) + + @patch.object(ApiRegistry, 'initialize_hook') + @patch.object(ApiRegistry, 'api_set_hook_func') + @patch("msprobe.mindspore.service.logger.info") + def test_register_hook_new_with_level_mix(self, mock_logger, mock_api_set_hook_func, mock_initialize_hook): + self.service.config.level = Const.LEVEL_MIX + self.service.register_api_hook() + self.service.register_cell_hook() + mock_logger.assert_called_with(f"The cell {self.service.config.task} hook function " + "is successfully mounted to the model.") + mock_api_set_hook_func.assert_called() + mock_initialize_hook.assert_called() + + @patch.object(CellProcessor, 'node_hook') + def test_register_hook_new_with_level_l0(self, mock_node_hook): + global register_backward_hook_functions + self.service.config.level = Const.LEVEL_L0 + cell_mock = MagicMock() + self.service.model.cells_and_names.return_value = [("cell_name", cell_mock)] + register_backward_hook_functions["pre"] = cell_mock.register_backward_pre_hook + register_backward_hook_functions["full"] = cell_mock.register_backward_hook + self.service.register_cell_hook() + cell_mock.register_forward_hook.assert_called() + cell_mock.register_backward_hook.assert_called() + mock_node_hook.assert_called() + register_backward_hook_functions = {} + + def test_register_hook_new_without_model_raises_exception(self): + self.service.config.level = Const.LEVEL_L0 + self.service.model = None + with self.assertRaises(MsprobeException): + self.service.register_cell_hook() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_primitive_dump.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_primitive_dump.py index bab91046a8c4aeb611467c3f17ebea6b742a3937..3cafd49f2c101c45dbb65a08803dd77c6bca485d 100644 --- a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_primitive_dump.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_primitive_dump.py @@ -76,10 +76,6 @@ class TestService(unittest.TestCase): with self.assertRaises(MsprobeException) as context: self.service.check_model_valid(model) - # For the purpose of the test, let's also verify the expected exception message - expected_message = f"{MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)}model 参数必须是 mindspore.nn.Cell 类型。" - self.assertEqual(str(context.exception), expected_message) - def test_update_primitive_counters(self): primitive_name = "test_primitive" self.service.primitive_hook_service.update_primitive_counters(primitive_name) @@ -523,6 +519,7 @@ class TestPrimitiveHookService(unittest.TestCase): # 确保在 switch 关闭时不应用 hook mock_origin_func.assert_called_once() + HOOKCell.cell_count = defaultdict(int) self.assertTrue((result == input_tensor).all()) # 使用 .all() 来比较 Tensor @patch('msprobe.mindspore.dump.hook_cell.primitive_hooks.ops.HookBackward') diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_config.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_config.py index 35fc6164763e685d09e737e7f85bec33623ec111..df03485dc6c77371750fd0b67ca2c37ff7e2ed7b 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_config.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_config.py @@ -2,7 +2,20 @@ import unittest import os from unittest.mock import patch -from msprobe.pytorch.api_accuracy_checker.common.config import Config +from msprobe.pytorch.api_accuracy_checker.common.config import Config, CheckerConfig, OnlineConfig, msCheckerConfig + + +class TestUtConfig(): + def __init__(self): + self.white_list = ['api1', 'api2'] + self.black_list = ['api3'] + self.error_data_path = '/path/to/error_data' + self.is_online = True + self.nfs_path = '/path/to/nfs' + self.host = 'localhost' + self.port = 8080 + self.rank_list = [0, 1, 2] + self.tls_path = '/path/to/tls' class TestConfig(unittest.TestCase): @@ -11,6 +24,7 @@ class TestConfig(unittest.TestCase): self.input_dir = os.path.join(self.base_test_dir, 'resources') self.yaml_file = os.path.join(self.input_dir, "config.yaml") self.cfg = Config(self.yaml_file) + self.task_config = TestUtConfig() def test_validate_valid_data(self): for key, val in self.cfg.config.items(): @@ -30,6 +44,9 @@ class TestConfig(unittest.TestCase): with self.assertRaises(ValueError): self.cfg.validate('precision', -1) + + with self.assertRaises(ValueError): + self.cfg.validate('precision', True) def test_validate_white_list(self): validate_white_list = ['conv1d', 'max_pool1d', 'dropout', '__add__'] @@ -37,3 +54,83 @@ class TestConfig(unittest.TestCase): with self.assertRaises(Exception): self.cfg.validate('white_list', ['invalid_api1', 'invalid_api2']) + + def test_CheckerConfig_init_with_defaults(self): + checker_config = CheckerConfig() + self.assertEqual(checker_config.white_list, msCheckerConfig.white_list) + self.assertEqual(checker_config.black_list, msCheckerConfig.black_list) + self.assertEqual(checker_config.error_data_path, msCheckerConfig.error_data_path) + self.assertEqual(checker_config.is_online, msCheckerConfig.is_online) + self.assertEqual(checker_config.nfs_path, msCheckerConfig.nfs_path) + self.assertEqual(checker_config.host, msCheckerConfig.host) + self.assertEqual(checker_config.port, msCheckerConfig.port) + self.assertEqual(checker_config.rank_list, msCheckerConfig.rank_list) + self.assertEqual(checker_config.tls_path, msCheckerConfig.tls_path) + + def test_init_with_task_config(self): + checker_config = CheckerConfig(self.task_config) + self.assertEqual(checker_config.white_list, self.task_config.white_list) + self.assertEqual(checker_config.black_list, self.task_config.black_list) + self.assertEqual(checker_config.error_data_path, self.task_config.error_data_path) + self.assertEqual(checker_config.is_online, self.task_config.is_online) + self.assertEqual(checker_config.nfs_path, self.task_config.nfs_path) + self.assertEqual(checker_config.host, self.task_config.host) + self.assertEqual(checker_config.port, self.task_config.port) + self.assertEqual(checker_config.rank_list, self.task_config.rank_list) + self.assertEqual(checker_config.tls_path, self.task_config.tls_path) + + def test_load_config(self): + checker_config = CheckerConfig() + checker_config.load_config(self.task_config) + self.assertEqual(checker_config.is_online, self.task_config.is_online) + self.assertEqual(checker_config.nfs_path, self.task_config.nfs_path) + self.assertEqual(checker_config.host, self.task_config.host) + self.assertEqual(checker_config.port, self.task_config.port) + self.assertEqual(checker_config.rank_list, self.task_config.rank_list) + self.assertEqual(checker_config.tls_path, self.task_config.tls_path) + + def test_get_online_config(self): + checker_config = CheckerConfig() + checker_config.load_config(self.task_config) + online_config = checker_config.get_online_config() + self.assertIsInstance(online_config, OnlineConfig) + self.assertEqual(online_config.is_online, self.task_config.is_online) + self.assertEqual(online_config.nfs_path, self.task_config.nfs_path) + self.assertEqual(online_config.host, self.task_config.host) + self.assertEqual(online_config.port, self.task_config.port) + self.assertEqual(online_config.rank_list, self.task_config.rank_list) + self.assertEqual(online_config.tls_path, self.task_config.tls_path) + + def test_get_run_ut_config(self): + forward_content = {'api1': 'data1', 'api2': 'data2'} + backward_content = {'api3': 'data3'} + result_csv_path = '/path/to/result.csv' + details_csv_path = '/path/to/details.csv' + save_error_data = True + api_result_csv_path = '/path/to/api_result.csv' + real_data_path = '/path/to/real_data' + error_data_path = '/path/to/error_data' + + checker_config = CheckerConfig() + + config_params = { + 'forward_content': forward_content, + 'backward_content': backward_content, + 'result_csv_path': result_csv_path, + 'details_csv_path': details_csv_path, + 'save_error_data': save_error_data, + 'is_continue_run_ut': api_result_csv_path, + 'real_data_path': real_data_path, + 'error_data_path': error_data_path + } + + run_ut_config = checker_config.get_run_ut_config(**config_params) + + self.assertEqual(run_ut_config.forward_content, forward_content) + self.assertEqual(run_ut_config.backward_content, backward_content) + self.assertEqual(run_ut_config.result_csv_path, result_csv_path) + self.assertEqual(run_ut_config.details_csv_path, details_csv_path) + self.assertEqual(run_ut_config.save_error_data, save_error_data) + self.assertEqual(run_ut_config.is_continue_run_ut, api_result_csv_path) + self.assertEqual(run_ut_config.real_data_path, real_data_path) + self.assertEqual(run_ut_config.error_data_path, error_data_path) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py index 18376a05811027367033e95c06783f1748b5ef20..15a7908ad8de6d4883e0574ceaf451a03dbfbfe3 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py @@ -1,10 +1,108 @@ import unittest -from unittest.mock import patch +from unittest.mock import patch, MagicMock, Mock import pandas as pd from msprobe.pytorch.api_accuracy_checker.compare.api_precision_compare import * +from msprobe.core.common.exceptions import FileCheckException +from msprobe.pytorch.api_accuracy_checker.compare.api_precision_compare import _api_precision_compare_command, register_compare_func from msprobe.core.common.const import CompareConst +from msprobe.pytorch.api_accuracy_checker.compare.compare_input import PrecisionCompareInput + + +class Args: + def __init__(self, npu_csv_path=None, gpu_csv_path=None, out_path=None): + self.npu_csv_path = npu_csv_path + self.gpu_csv_path = gpu_csv_path + self.out_path = out_path + + +class TestFileCheck(unittest.TestCase): + def setUp(self): + src_path = 'temp_path' + create_directory(src_path) + dst_path = 'compare_soft_link' + os.symlink(src_path, dst_path) + self.hard_path = os.path.abspath(src_path) + self.soft_path = os.path.abspath(dst_path) + csv_path = os.path.join(self.hard_path, 'test.csv') + csv_data = [['1', '2', '3']] + write_csv(csv_data, csv_path) + self.hard_csv_path = os.path.abspath(csv_path) + soft_csv_path = 'soft.csv' + os.symlink(csv_path, soft_csv_path) + self.soft_csv_path = os.path.abspath(soft_csv_path) + self.empty_path = "empty_path" + + def tearDown(self): + os.unlink(self.hard_csv_path) + os.unlink(self.soft_csv_path) + os.unlink(self.soft_path) + for file in os.listdir(self.hard_path): + os.remove(os.path.join(self.hard_path, file)) + os.rmdir(self.hard_path) + + def test_npu_path_soft_link_check(self): + args = Args(npu_csv_path=self.soft_csv_path, gpu_csv_path=self.hard_csv_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_gpu_path_soft_link_check(self): + args = Args(npu_csv_path=self.hard_csv_path, gpu_csv_path=self.soft_csv_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_out_path_soft_link_check(self): + args = Args(npu_csv_path=self.hard_csv_path, gpu_csv_path=self.hard_csv_path, out_path=self.soft_path) + + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_npu_path_empty_check(self): + args = Args(npu_csv_path=self.empty_path, gpu_csv_path=self.hard_csv_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_gpu_path_empty_check(self): + args = Args(npu_csv_path=self.hard_csv_path, gpu_csv_path=self.empty_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_out_path_empty_check(self): + args = Args(npu_csv_path=self.hard_csv_path, gpu_csv_path=self.hard_csv_path, out_path=self.empty_path) + + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_npu_path_invalid_type_check(self): + args = Args(npu_csv_path=123, gpu_csv_path=self.hard_csv_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.INVALID_FILE_ERROR) + + def test_gpu_path_invalid_type_check(self): + args = Args(npu_csv_path=self.hard_csv_path, gpu_csv_path=123, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.INVALID_FILE_ERROR) + + def test_out_path_invalid_type_check(self): + args = Args(npu_csv_path=self.hard_csv_path, gpu_csv_path=self.hard_csv_path, out_path=123) + with self.assertRaises(Exception) as context: + _api_precision_compare_command(args) + self.assertEqual(context.exception.code, FileCheckException.INVALID_FILE_ERROR) class TestApiPrecisionCompare(unittest.TestCase): @@ -19,7 +117,7 @@ class TestApiPrecisionCompare(unittest.TestCase): ) self.npu_data = pd.DataFrame({ - 'API_NAME': ['api1.forward', 'api1.backward'], + 'API_NAME': ['torch.abs.0.forward.output.0', 'torch.matmul.0.forward.output.0'], 'DEVICE_DTYPE': ['float32', 'float32'], 'ERROR_RATE': ['0', '0.1'], 'SMALL_VALUE_ERROR_RATE': ['0.01', '0.02'], @@ -30,7 +128,7 @@ class TestApiPrecisionCompare(unittest.TestCase): }) self.gpu_data = pd.DataFrame({ - 'API_NAME': ['api1.forward', 'api1.backward'], + 'API_NAME': ['torch.abs.0.forward.output.0', 'torch.matmul.0.forward.output.0'], 'DEVICE_DTYPE': ['float32', 'float32'], 'ERROR_RATE': ['0', '0'], 'SMALL_VALUE_ERROR_RATE': ['0.01', '0.01'], @@ -40,6 +138,44 @@ class TestApiPrecisionCompare(unittest.TestCase): 'EB': ['0.1', '0.1'] }) + self.test_data = { + 'API Name': ['torch.abs.0.forward.output.0'], + 'Shape': ['(2,3)'], + 'DEVICE Dtype': ['float32'], + '小值域错误占比': ['0'], + '均方根误差': ['0'], + '相对误差最大值': ['0'], + '相对误差平均值': ['0'], + '误差均衡性': ['0'], + '二进制一致错误率': ['0'], + 'inf/nan错误率': ['0'], + '相对误差错误率': ['0'], + '绝对误差错误率': ['0'], + 'ULP误差平均值': ['0'], + 'ULP误差大于阈值占比': ['0'], + '双千指标': ['0.999'], + 'Message': ['error'] + } + + self.test_data_2 = { + 'API Name': ['torch.abs.0.forward.output.0', 'torch.matmul.0.forward.output.0', + 'torch.matmul.0.backward.output.0', 'torch.add.0.forward.output.0'], + 'DEVICE Dtype': ['torch.float32', 'torch.float32', 'torch.float32', 'torch.float64'], + '小值域错误占比': ['0', '0', '0', '0'], + '均方根误差': ['0', '0', '0', '0'], + '相对误差最大值': ['0', '0', '0', '0'], + '相对误差平均值': ['0', '0', '0', '0'], + '误差均衡性': ['0', '0', '0', '0'], + '二进制一致错误率': ['0', '0', '0', '0'], + 'inf/nan错误率': ['0', '0', '0', '0'], + '相对误差错误率': ['0', '0', '0', '0'], + '绝对误差错误率': ['0', '0', '0', '0'], + 'ULP误差平均值': ['0', '0', '0', '0'], + 'ULP误差大于阈值占比': ['0', '0', '0', '0'], + '双千指标': ['0', '0', '0', '0'], + 'Message': ['error', 'pass', 'error', 'pass'] + } + self.api_name = "test_api" self.npu_precision = { ApiPrecisionCompareColumn.INF_NAN_ERROR_RATIO: '0', ApiPrecisionCompareColumn.REL_ERR_RATIO: '0', @@ -47,33 +183,37 @@ class TestApiPrecisionCompare(unittest.TestCase): ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE: '0.01', ApiPrecisionCompareColumn.RMSE: '0.1', ApiPrecisionCompareColumn.MAX_REL_ERR: '0.1', ApiPrecisionCompareColumn.MEAN_REL_ERR: '0.1', ApiPrecisionCompareColumn.EB: '0.1', ApiPrecisionCompareColumn.MEAN_ULP_ERR: '0.1', - ApiPrecisionCompareColumn.ULP_ERR_PROPORTION: '0.05' - } + ApiPrecisionCompareColumn.ULP_ERR_PROPORTION: '0.05', ApiPrecisionCompareColumn.SHAPE: '(2,3)' + } self.gpu_precision = { ApiPrecisionCompareColumn.INF_NAN_ERROR_RATIO: '0', ApiPrecisionCompareColumn.REL_ERR_RATIO: '0', ApiPrecisionCompareColumn.ABS_ERR_RATIO: '0', ApiPrecisionCompareColumn.ERROR_RATE: '0', ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE: '0.01', ApiPrecisionCompareColumn.RMSE: '0.1', ApiPrecisionCompareColumn.MAX_REL_ERR: '0.1', ApiPrecisionCompareColumn.MEAN_REL_ERR: '0.1', ApiPrecisionCompareColumn.EB: '0.1', ApiPrecisionCompareColumn.MEAN_ULP_ERR: '0.2', - ApiPrecisionCompareColumn.ULP_ERR_PROPORTION: '0.06'} - - self.ulp_standard = ULPStandard(self.api_name, self.npu_precision, self.gpu_precision) - self.benchmark_standard = BenchmarkStandard(self.api_name, self.npu_precision, self.gpu_precision) + ApiPrecisionCompareColumn.ULP_ERR_PROPORTION: '0.06', ApiPrecisionCompareColumn.SHAPE: '(2,3)' + } - def test_benchmark_standard_calc_ratio(self): - column_name = "TEST_COLUMN" - default_value = 0 - result = BenchmarkStandard._calc_ratio(column_name, '2', '1', default_value) - self.assertEqual(result[0], 2.0) + # 创建 DataFrame + self.npu_data = pd.DataFrame(self.test_data) + self.gpu_data = pd.DataFrame(self.test_data) + + self.npu_data_2 = pd.DataFrame(self.test_data_2) + self.gpu_data_2 = pd.DataFrame(self.test_data_2) + + # 使用第一行数据作为测试用例 + self.row_npu = self.npu_data.iloc[0] + self.row_gpu = self.gpu_data.iloc[0] - result = BenchmarkStandard._calc_ratio(column_name, '0', '0', default_value) - self.assertEqual(result[0], 1.0) + # 添加 compare_column + self.compare_column = MagicMock() + self.compare_column.api_name = MagicMock(return_value="test_api") - result = BenchmarkStandard._calc_ratio(column_name, '1', '0', default_value) - self.assertEqual(result[0], default_value) + self.registry = register_compare_func() + + self.dtype = 'torch.float16' - result = BenchmarkStandard._calc_ratio(column_name, 'nan', '0', default_value) - self.assertTrue(math.isnan(result[0])) + self.input_data = PrecisionCompareInput(self.row_npu, self.row_gpu, self.dtype, self.compare_column) def test_check_csv_columns(self): with self.assertRaises(Exception): @@ -139,21 +279,7 @@ class TestApiPrecisionCompare(unittest.TestCase): for filename in os.listdir(save_path): os.remove(os.path.join(save_path, filename)) os.rmdir(save_path) - - def test_ulp_standard(self): - self.ulp_standard.get_result() - self.assertEqual(self.ulp_standard.ulp_err_status, CompareConst.PASS) - - self.assertEqual(self.ulp_standard._get_ulp_status(torch.float32), CompareConst.PASS) - def test_benchmark_standard(self): - self.benchmark_standard.get_result() - self.assertEqual(self.benchmark_standard.final_result, CompareConst.PASS) - - column_list = self.benchmark_standard.to_column_value() - expect_column_list = [1, 'pass', 1, 'pass', 1, 'pass', 1, 'pass', 1, 'pass'] - self.assertEqual(column_list, expect_column_list) - def test_get_absolute_threshold_result_pass(self): row_npu = { ApiPrecisionCompareColumn.INF_NAN_ERROR_RATIO: '0', @@ -184,6 +310,203 @@ class TestApiPrecisionCompare(unittest.TestCase): self.assertEqual(result['abs_err_result'], CompareConst.PASS) self.assertEqual(result['absolute_threshold_result'], CompareConst.ERROR) + def test_api_precision_compare(self): + # 准备测试目录和文件 + base_path = 'test_compare_tmp' + os.makedirs(base_path, exist_ok=True) + + # 创建测试用的CSV文件 + npu_csv = os.path.join(base_path, 'npu.csv') + gpu_csv = os.path.join(base_path, 'gpu.csv') + result_csv = os.path.join(base_path, 'result.csv') + details_csv = os.path.join(base_path, 'details.csv') + + # 将测试数据写入CSV文件 + df = pd.DataFrame(self.test_data) + df.to_csv(npu_csv, index=False) + df.to_csv(gpu_csv, index=False) + + try: + # 执比较操作 + config = CompareConfig(npu_csv, gpu_csv, result_csv, details_csv) + api_precision_compare(config) + + # 验证结果文件是否生成 + self.assertTrue(os.path.exists(result_csv)) + self.assertTrue(os.path.exists(details_csv)) + + # 读取并验证结果 + result_df = pd.read_csv(result_csv) + self.assertFalse(result_df.empty) + + details_df = pd.read_csv(details_csv) + self.assertFalse(details_df.empty) + + finally: + # 清理测试文件 + for file_path in [npu_csv, gpu_csv, result_csv, details_csv]: + if os.path.exists(file_path): + os.remove(file_path) + if os.path.exists(base_path): + os.rmdir(base_path) + + def test_online_api_precision_compare(self): + # 准备测试目录和文件 + base_path = 'test_online_compare_tmp' + os.makedirs(base_path, exist_ok=True) + + # 创建测试用的CSV文件 + npu_csv = os.path.join(base_path, 'npu.csv') + gpu_csv = os.path.join(base_path, 'gpu.csv') + result_csv = os.path.join(base_path, 'results_rank1.csv') + details_csv = os.path.join(base_path, 'details_rank1.csv') + + # 准备在线比较的配置 + online_config = MagicMock() + online_config.rank = 1 + online_config.result_csv_path = os.path.join(base_path, "results_rank*.csv") + online_config.details_csv_path = os.path.join(base_path, "details_rank*.csv") + + # 将测试数据写入CSV文件 + df = pd.DataFrame(self.test_data) + df.to_csv(npu_csv, index=False) + df.to_csv(gpu_csv, index=False) + + # 设置online_config的数据 + online_config.npu_data = pd.read_csv(npu_csv) + online_config.gpu_data = pd.read_csv(gpu_csv) + + try: + # 执行在线比较 + online_api_precision_compare(online_config) + + # 验证结果文件是否生成 + self.assertTrue(os.path.exists(result_csv)) + self.assertTrue(os.path.exists(details_csv)) + + # 读取并验证结果 + result_df = pd.read_csv(result_csv) + self.assertFalse(result_df.empty) + + details_df = pd.read_csv(details_csv) + self.assertFalse(details_df.empty) + + # 验证文件权限 + self.assertEqual(os.stat(result_csv).st_mode & 0o777, FileCheckConst.DATA_FILE_AUTHORITY) + self.assertEqual(os.stat(details_csv).st_mode & 0o777, FileCheckConst.DATA_FILE_AUTHORITY) + + finally: + # 清理测试文件 + for file_path in [npu_csv, gpu_csv, result_csv, details_csv]: + if os.path.exists(file_path): + os.remove(file_path) + if os.path.exists(base_path): + os.rmdir(base_path) + + def test_skip_due_to_empty_output(self): + self.row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] = ' ' + api_name = "abs" + result = get_api_status(self.row_npu, self.row_gpu, api_name, self.compare_column, self.registry) + self.assertEqual(result, CompareConst.SKIP) + + def test_thousandth_standard(self): + self.row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] = 'torch.float16' + api_name = "conv2d" + result = get_api_status(self.row_npu, self.row_gpu, api_name, self.compare_column, self.registry) + self.assertEqual(result, CompareConst.PASS) + + def test_binary_consistency(self): + self.row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] = 'torch.float16' + api_name = "abs" + result = get_api_status(self.row_npu, self.row_gpu, api_name, self.compare_column, self.registry) + self.assertEqual(result, CompareConst.PASS) + + def test_absolute_threshold(self): + self.row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] = 'torch.float16' + api_name = "mul" + result = get_api_status(self.row_npu, self.row_gpu, api_name, self.compare_column, self.registry) + self.assertEqual(result, CompareConst.PASS) + + def test_ulp_standard(self): + self.row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] = "torch.float16" + api_name = "matmul" + result = get_api_status(self.row_npu, self.row_gpu, api_name, self.compare_column, self.registry) + self.assertEqual(result, CompareConst.PASS) + + def test_benchmark_compare(self): + self.row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] = "torch.float16" + api_name = "mean" + result = get_api_status(self.row_npu, self.row_gpu, api_name, self.compare_column, self.registry) + self.assertEqual(result, CompareConst.PASS) + + def test_record_binary_consistency_result_pass(self): + self.row_npu[ApiPrecisionCompareColumn.ERROR_RATE] = "0.0" + self.compare_column.ERROR = CompareConst.PASS + + result = record_binary_consistency_result(self.input_data) + + self.assertEqual(result, CompareConst.PASS) + self.assertEqual(self.compare_column.compare_algorithm, "二进制一致法") + + def test_record_binary_consistency_result_error(self): + self.row_npu[ApiPrecisionCompareColumn.ERROR_RATE] = "2.0" + self.compare_column.ERROR = CompareConst.ERROR + + input_data = PrecisionCompareInput(self.row_npu, self.row_gpu, self.dtype, self.compare_column) + result = record_binary_consistency_result(input_data) + + self.assertEqual(result, CompareConst.ERROR) + self.assertIn("ERROR: 二进制一致错误率超过阈值\n", self.compare_column.compare_message) + + def test_record_absolute_threshold_result(self): + row_npu = { + ApiPrecisionCompareColumn.INF_NAN_ERROR_RATIO: "0.0", + ApiPrecisionCompareColumn.REL_ERR_RATIO: "0.0", + ApiPrecisionCompareColumn.ABS_ERR_RATIO: "0.0" + } + compare_column = MagicMock() + + input_data = PrecisionCompareInput(row_npu, self.row_gpu, self.dtype, compare_column) + result = record_absolute_threshold_result(input_data) + + self.assertEqual(result, CompareConst.PASS) + + def test_record_benchmark_compare_result(self): + bs = MagicMock() + bs.get_result = MagicMock() + bs.small_value_err_status = CompareConst.PASS + bs.final_result = CompareConst.PASS + compare_column = MagicMock() + + result = record_benchmark_compare_result(self.input_data) + + self.assertEqual(result, CompareConst.PASS) + + def test_record_ulp_compare_result(self): + us = MagicMock() + us.get_result = MagicMock() + us.ulp_err_status = CompareConst.PASS + compare_column = MagicMock() + + result = record_ulp_compare_result(self.input_data) + + self.assertEqual(result, CompareConst.PASS) + + def test_record_thousandth_threshold_result(self): + self.row_npu[ApiPrecisionCompareColumn.REL_ERR_THOUSANDTH] = 0.999 + self.compare_column.rel_err_thousandth = 0.999 + self.compare_column.rel_err_thousandth_status = CompareConst.PASS + + input_data = PrecisionCompareInput(self.row_npu, self.row_gpu, self.dtype, self.compare_column) + result = record_thousandth_threshold_result(input_data) + + self.assertEqual(result, CompareConst.PASS) + self.assertEqual(self.compare_column.compare_message, "") + + @patch('msprobe.pytorch.api_accuracy_checker.compare.api_precision_compare.write_detail_csv') + def test_analyse_csv(self, mock_write_detail_csv): + analyse_csv(self.npu_data_2, self.gpu_data_2, self.config) + mock_write_detail_csv.assert_called() if __name__ == '__main__': unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py index 825806baf59a918b3963023ed3a1c41657e4af41..fc05f9d469a811b41feda4baf8e05a61c63b7e6d 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py @@ -3,11 +3,14 @@ import os import shutil import time import unittest +from unittest.mock import patch import numpy as np import torch.nn.functional +from msprobe.core.common.utils import CompareException from msprobe.pytorch.api_accuracy_checker.compare.compare import Comparator +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import DETAIL_TEST_ROWS from msprobe.pytorch.api_accuracy_checker.compare.compare_column import CompareColumn from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import UtDataInfo @@ -30,11 +33,41 @@ class TestCompare(unittest.TestCase): if os.path.exists(self.output_path): shutil.rmtree(self.output_path) - def test_compare_dropout(self): - dummy_input = torch.randn(100, 100) - bench_out = torch.nn.functional.dropout2d(dummy_input, 0.3) - npu_out = torch.nn.functional.dropout2d(dummy_input, 0.3) - self.assertTrue(self.compare._compare_dropout(bench_out, npu_out)) + def test_compare_dropout_pass_large_tensor(self): + # Arrange + bench_output = torch.tensor([0, 0, 1, 1, 0, 1] * 20) # 120 elements, 60 zeros + device_output = torch.tensor([0, 0, 1, 1, 0, 1] * 20) # Same as bench_output + # Act + result = self.compare._compare_dropout(bench_output, device_output) + # Assert + self.assertEqual(result, ('pass', 1)) + + def test_compare_dropout_error_large_tensor(self): + # Arrange + bench_output = torch.tensor([0, 0, 1, 1, 0, 1] * 20) # 120 elements, 60 zeros + device_output = torch.tensor([1, 1, 1, 1, 1, 1] * 20) # 120 elements, 0 zeros + # Act + result = self.compare._compare_dropout(bench_output, device_output) + # Assert + self.assertEqual(result, ('error', 0)) + + def test_compare_dropout_pass_small_tensor(self): + # Arrange + bench_output = torch.tensor([0, 1, 0]) # 3 elements + device_output = torch.tensor([0, 1, 0]) # Same as bench_output + # Act + result = self.compare._compare_dropout(bench_output, device_output) + # Assert + self.assertEqual(result, ('pass', 1)) + + def test_compare_dropout_large_tensor_boundary(self): + # Arrange + bench_output = torch.tensor([0, 0, 1] * 33 + [0]) # 100 elements, 67 zeros + device_output = torch.tensor([0, 1, 1] * 33 + [0]) # 100 elements, 34 zeros + # Act + result = self.compare._compare_dropout(bench_output, device_output) + # Assert + self.assertEqual(result, ('error', 0)) def test_compare_core_wrapper(self): dummy_input = torch.randn(100, 100) @@ -68,6 +101,56 @@ class TestCompare(unittest.TestCase): ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', ' ', ' ', ' ', 'pass', '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n']]) + def test_compare_core_different(self): + res = self.compare._compare_core('api', 1, 'str') + + self.assertEqual(res[0], 'error') + self.assertEqual(res[2], 'bench and npu output type is different.') + + def test_compare_core_with_dict(self): + output_dict = { + 'key1': 1, + 'key2': 2 + } + res = self.compare._compare_core('api', output_dict, output_dict) + + self.assertEqual(res[0], 'error') + self.assertEqual(res[2], "Unexpected output type in compare_core: ") + + def test_compare_core_with_dict_different(self): + bench_dict = { + 'key1': 1, + 'key2': 2 + } + device_dict = { + 'key3': 3, + 'key4': 4 + } + res = self.compare._compare_core('api', bench_dict, device_dict) + + self.assertEqual(res[0], 'error') + self.assertEqual(res[2], 'bench and npu output dict keys are different.') + + def test_compare_core_with_tensor(self): + tensor = torch.tensor([1, 2, 3]) + res = self.compare._compare_core('api', tensor, tensor) + + self.assertEqual(res[0], 'pass') + self.assertEqual(res[2], 'Compare algorithm is not supported for int64 data. Only judged by Error Rate.\n') + + def test_compare_core_with_buildin(self): + interger = 1 + res = self.compare._compare_core('api', interger, interger) + + self.assertEqual(res[0], 'pass') + self.assertEqual(res[2], '') + + def test_compare_core_with_none(self): + res = self.compare._compare_core('api', None, None) + + self.assertEqual(res[0], 'SKIP') + self.assertEqual(res[2], 'Bench output is None, skip this test.') + def test_compare_output(self): bench_out, npu_out = torch.randn(100, 100), torch.randn(100, 100) bench_grad, npu_grad = [torch.randn(100, 100)], [torch.randn(100, 100)] @@ -84,6 +167,33 @@ class TestCompare(unittest.TestCase): self.assertTrue(is_fwd_success) self.assertTrue(is_bwd_success) + def test_compare_output_error(self): + bench_out, npu_out = torch.randn(100, 100), torch.randn(100, 100) + bench_grad, npu_grad = [torch.randn(100, 100)], [torch.randn(100, 100)] + api_name = 'Functional.conv2d' + data_info = UtDataInfo(bench_grad, npu_grad, bench_out, npu_out, None, None, None) + + with self.assertRaises(ValueError): + self.compare.compare_output(api_name, data_info) + + def test_compare_output_with_dropout(self): + bench_out, npu_out = torch.randn(100, 100), torch.randn(100, 100) + bench_grad, npu_grad = [torch.randn(100, 100)], [torch.randn(100, 100)] + api_name = 'Functional.dropout.0' + data_info = UtDataInfo(bench_grad, npu_grad, bench_out, npu_out, None, None, None) + is_fwd_success, is_bwd_success = self.compare.compare_output(api_name, data_info) + self.assertTrue(is_fwd_success) + self.assertTrue(is_bwd_success) + + def test_compare_output_with_backward_message(self): + bench_out, npu_out = torch.randn(100, 100), torch.randn(100, 100) + bench_grad, npu_grad = [torch.randn(100, 100)], [torch.randn(100, 100)] + api_name = 'Functional.conv2d.0' + data_info = UtDataInfo(bench_grad, npu_grad, bench_out, npu_out, None, "test_message", None) + is_fwd_success, is_bwd_success = self.compare.compare_output(api_name, data_info) + self.assertFalse(is_fwd_success) + self.assertFalse(is_fwd_success) + def test_record_results(self): args = ('Functional.conv2d.0', False, 'N/A', [['torch.float64', 'torch.float32', (32, 64, 112, 112), 1.0, 0.012798667686, 'N/A', 0.81631212311, 0.159979121213, 'N/A', @@ -103,23 +213,238 @@ class TestCompare(unittest.TestCase): compare_column) self.assertEqual(status, "pass") - def test_compare_bool_tensor(self): - cpu_output = np.array([True, False, True]) - npu_output = np.array([True, False, True]) - self.assertEqual(self.compare._compare_bool_tensor(cpu_output, npu_output), (0.0, 'pass', '')) + def test_compare_torch_tensor_bf16(self): + cpu_output = torch.Tensor([1.0, 2.0, 3.0]) + npu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.bfloat16) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_torch_tensor("api", cpu_output, npu_output, + compare_column) + self.assertEqual(status, "pass") + + def test_compare_torch_tensor_different_shape(self): + cpu_output = torch.Tensor([[1.0, 2.0, 3.0], [1.0, 2.0, 3.0]]) + npu_output = torch.Tensor([1.0, 2.0, 3.0]) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_torch_tensor("api", cpu_output, npu_output, + compare_column) + self.assertEqual(status, "error") + + def test_compare_torch_tensor_different_dtype(self): + cpu_output = torch.Tensor([True, True, False]) + npu_output = torch.Tensor([1.0, 2.0, 3.0]) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_torch_tensor("api", cpu_output, npu_output, + compare_column) + self.assertEqual(status, "error") + + def test_compare_torch_tensor_special_dtype(self): + cpu_output = torch.Tensor([True, True, False]) + npu_output = torch.Tensor([True, True, False]) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_torch_tensor("api", cpu_output, npu_output, + compare_column) + self.assertEqual(status, "pass") - def test_compare_builtin_type(self): + def test_compare_builtin_type_pass_with_special_types(self): compare_column = CompareColumn() bench_out = 1 npu_out = 1 status, compare_result, message = self.compare._compare_builtin_type(bench_out, npu_out, compare_column) self.assertEqual((status, compare_result.error_rate, message), ('pass', 0, '')) + def test_compare_builtin_type_pass_with_none_special_types(self): + compare_column = CompareColumn() + bench_out = np.array([1]) + npu_out = np.array([1]) + status, compare_result, message = self.compare._compare_builtin_type(bench_out, npu_out, compare_column) + self.assertEqual((status, compare_result.error_rate, message), ('pass', ' ', '')) + + def test_compare_builtin_type_error(self): + compare_column = CompareColumn() + bench_out = 1 + npu_out = 2 + status, compare_result, message = self.compare._compare_builtin_type(bench_out, npu_out, compare_column) + self.assertEqual((status, compare_result.error_rate, message), ('error', ' ', '')) + def test_compare_float_tensor(self): cpu_output = torch.Tensor([1.0, 2.0, 3.0]) npu_output = torch.Tensor([1.0, 2.0, 3.0]) compare_column = CompareColumn() - status, compare_column, message = self.compare._compare_float_tensor("api", cpu_output.numpy(), + status, compare_column, message = self.compare._compare_float_tensor("conv2d", cpu_output.numpy(), npu_output.numpy(), compare_column, npu_output.dtype) self.assertEqual(status, "pass") + + def test_compare_float_tensor_binary(self): + cpu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float16) + npu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("abs", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + self.assertEqual(status, "pass") + + def test_compare_float_tensor_absolute(self): + cpu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + npu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("mul", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + + self.assertEqual(status, "pass") + + def test_compare_float_tensor_ulp(self): + cpu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + npu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("__matmul__", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + + self.assertEqual(status, "pass") + + def test_compare_float_tensor_error_16(self): + cpu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float16) + npu_output = torch.tensor([1.1, 2.1, 3.1], dtype=torch.float16) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("__matmul__", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + + self.assertEqual(status, "error") + + def test_compare_float_tensor_pass_16(self): + cpu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float16) + npu_output = torch.tensor([1.0001, 2.0001, 3.0001], dtype=torch.float16) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("__matmul__", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + + self.assertEqual(status, "pass") + + def test_compare_float_tensor_warn_16(self): + cpu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float16) + npu_output = torch.tensor([1.01, 2.01, 3.01], dtype=torch.float16) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("__matmul__", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + + self.assertEqual(status, "Warning") + + def test_compare_float_tensor_error_32(self): + cpu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + npu_output = torch.tensor([1.01, 2.01, 3.01], dtype=torch.float32) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("__matmul__", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + + self.assertEqual(status, "error") + + def test_compare_float_tensor_pass_32(self): + cpu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + npu_output = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("__matmul__", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + + self.assertEqual(status, "pass") + + def test_get_run_ut_detail_success(self): + # Arrange + test_result = [ + "test_subject", # subject_prefix + None, # Placeholder for other indices + None, + [[0.123456, 1.234567], [2.345678]], # fwd_result + [[3.456789], [4.567890]] # bwd_result + ] + + # Act + result = self.compare._get_run_ut_detail(test_result) + + # Assert + expected_result = [ + ["test_subject.forward.output.0", "0.12345600000000", "1.23456700000000"], + ["test_subject.forward.output.1", "2.34567800000000"], + ["test_subject.backward.output.0", "3.45678900000000"], + ["test_subject.backward.output.1", "4.56789000000000"], + ] + self.assertEqual(result, expected_result) + + @patch('msprobe.pytorch.api_accuracy_checker.compare.compare.logger') + def test_get_run_ut_detail_index_error(self, mock_logger): + # Arrange + test_result = [ + "test_subject", # subject_prefix + None # Placeholder for other indices + ] + + # Act and Assert + with self.assertRaises(CompareException): + self.compare._get_run_ut_detail(test_result) + mock_logger.error.assert_called_once_with("List index out of bounds when writing detail CSV.") + + @patch('msprobe.pytorch.api_accuracy_checker.compare.compare.write_csv') + @patch('msprobe.pytorch.api_accuracy_checker.compare.compare.os.path.exists') + def test_write_csv_title(self, mock_exists, mock_write_csv): + # Mock the behavior of os.path.exists + mock_exists.return_value = False + + self.compare.save_path_list = ['result.csv', 'summary.csv'] + self.compare.detail_save_path_list = ['detail_result.csv', 'detail_summary.csv'] + + # Act + self.compare.write_csv_title() + + # Assert + # Check that write_csv was called for the paths that do not exist + mock_write_csv.assert_any_call( + [[self.compare.COLUMN_API_NAME, + self.compare.COLUMN_FORWARD_SUCCESS, + self.compare.COLUMN_BACKWARD_SUCCESS, + "Message"]], + 'summary.csv' + ) + mock_write_csv.assert_any_call(DETAIL_TEST_ROWS, 'detail_result.csv') + + # Ensure write_csv was not called for 'result.csv' and 'detail_summary.csv' + self.assertEqual(mock_write_csv.call_count, 4) + + @patch('msprobe.pytorch.api_accuracy_checker.compare.compare.write_csv') + @patch('msprobe.pytorch.api_accuracy_checker.compare.compare.logger') + def test_write_summary_csv(self, mock_logger, mock_write_csv): + self.compare.stack_info = {'test1': ['info1', 'info2']} + self.compare.save_path_list = ['summary_result.csv', 'summary_detail.csv'] + test_result = [ + 'test1', # name + 'SKIP', # status + None, # Placeholder + [[{'message': 'Skipped test'}]], # Group message + 0 # rank + ] + + # Act + self.compare.write_summary_csv(test_result) + + # Assert + mock_write_csv.assert_called_once() + + @patch('msprobe.pytorch.api_accuracy_checker.compare.compare.write_csv') + @patch('msprobe.pytorch.api_accuracy_checker.compare.compare.Comparator._get_run_ut_detail') + @patch('msprobe.pytorch.api_accuracy_checker.compare.compare.Comparator.get_path_from_rank') + def test_write_detail_csv(self, mock_get_path, mock_get_detail, mock_write_csv): + test_result = ['test_result'] + self.compare.write_detail_csv(test_result) + + mock_get_detail.assert_called_once() + mock_get_path.assert_called_once() + mock_write_csv.assert_called_once() + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_absolute_threshold.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_absolute_threshold.py new file mode 100644 index 0000000000000000000000000000000000000000..10cd9ef6e8e640b48acecf64fbb7ca98a07090b7 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_absolute_threshold.py @@ -0,0 +1,77 @@ +import unittest +import numpy as np +import torch +from msprobe.pytorch.api_accuracy_checker.precision_standard.absolute_threshold import AbsolutethdCompare + + +class InputData: + """测试数据类""" + def __init__(self, bench_output, device_output, compare_column, dtype): + self.bench_output = bench_output + self.device_output = device_output + self.dtype = dtype + self.compare_column = compare_column + + +class TestAbsolutethdCompare(unittest.TestCase): + + def setUp(self): + # 设置测试数据 + self.compare_column = {} + self.input_data = InputData( + bench_output=np.array([1.0, 2.0, 3.0, float('inf'), float('nan')]), + device_output=np.array([1.1, 1.9, 3.1, float('inf'), 1.0]), + compare_column={}, + dtype=torch.float32 + ) + self.compare = AbsolutethdCompare(self.input_data) + + def test_get_rtol(self): + # 测试_get_rtol方法 + rtol = self.compare._get_rtol() + self.assertEqual(rtol, 2**-20) + + def test_get_rel_err(self): + # 测试_get_rel_err方法 + abs_err = np.abs(self.input_data.bench_output - self.input_data.device_output) + abs_bench_with_eps = np.abs(self.input_data.bench_output) + np.finfo(np.float32).eps + rel_err = self.compare._get_rel_err(abs_err, abs_bench_with_eps) + self.assertTrue(np.all(np.isfinite(rel_err[~np.isnan(rel_err)]))) + + def test_get_normal_value_mask(self): + # 测试_get_normal_value_mask方法 + self.compare._pre_compare() + + # 创建一个示例small_value_mask + small_value_mask = np.array([True, False, False, False, False]) + + normal_mask = self.compare._get_normal_value_mask(self.compare.both_finite_mask, small_value_mask) + + # 验证返回值是布尔数组 + self.assertTrue(isinstance(normal_mask, np.ndarray)) + self.assertEqual(normal_mask.dtype, bool) + + # 验证normal_mask是both_finite_mask和not small_value_mask的逻辑与 + expected_mask = np.logical_and(self.compare.both_finite_mask, + np.logical_not(small_value_mask)) + np.testing.assert_array_equal(normal_mask, expected_mask) + + def test_pre_compare(self): + # 测试_pre_compare方法 + self.compare._pre_compare() + self.assertIsNotNone(self.compare.abs_bench) + self.assertIsNotNone(self.compare.both_finite_mask) + self.assertIsNotNone(self.compare.small_value_mask) + self.assertIsNotNone(self.compare.normal_value_mask) + + def test_compute_metrics(self): + # 测试_compute_metrics方法 + self.compare._pre_compare() + metrics = self.compare._compute_metrics() + self.assertIn("inf_nan_error_ratio", metrics) + self.assertIn("rel_err_ratio", metrics) + self.assertIn("abs_err_ratio", metrics) + + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_base_standard.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_base_standard.py new file mode 100644 index 0000000000000000000000000000000000000000..129da1585b24d5c71efe07da9e17b623d8346cc1 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_base_standard.py @@ -0,0 +1,117 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest +import numpy as np +import torch + +from msprobe.pytorch.api_accuracy_checker.precision_standard.base_standard import BaseCompare + + +class MockInputData: + def __init__(self, bench_output, device_output, dtype): + self.bench_output = bench_output + self.device_output = device_output + self.compare_column = {} + self.dtype = dtype + + +class TestCompare(BaseCompare): + """Test implementation of BaseCompare""" + def _pre_compare(self): + """实现抽象方法""" + pass + + +class TestBaseStandard(unittest.TestCase): + """Test base_standard.py""" + def setUp(self): + """Test environment setup""" + self.bench_output = np.array([1.0, 2.0, 3.0, float('inf'), float('nan')]) + self.device_output = np.array([1.1, 2.1, 3.1, float('inf'), float('nan')]) + self.input_data = MockInputData(self.bench_output, self.device_output, torch.float32) + self.compare = TestCompare(self.input_data) + + def test_init(self): + """Test BaseCompare initialization""" + np.testing.assert_array_equal(self.compare.bench_output, self.input_data.bench_output) + np.testing.assert_array_equal(self.compare.device_output, self.input_data.device_output) + self.assertEqual(self.compare.dtype, self.input_data.dtype) + + def test_stat_finite_and_infinite_mask(self): + """Test finite and infinite mask generation""" + both_finite_mask, inf_nan_mask = self.compare.stat_finite_and_infinite_mask() + + # Check first three values are finite + self.assertTrue(np.array_equal(both_finite_mask[:3], [True, True, True])) + # Check last two values are infinite or NaN + self.assertTrue(np.array_equal(inf_nan_mask[3:], [True, True])) + + def test_stat_abs_error(self): + """Test absolute error calculation""" + abs_err = self.compare.stat_abs_error() + + expected_errors = np.array([0.1, 0.1, 0.1]) + np.testing.assert_array_almost_equal(abs_err[:3], expected_errors) + + def test_stat_small_value_mask(self): + """Test small value mask generation""" + abs_bench = np.array([1e-10, 1.0, 1e-8]) + both_finite_mask = np.array([True, True, True]) + small_value = 1e-9 + + result = TestCompare.stat_small_value_mask(abs_bench, both_finite_mask, small_value) + expected = np.array([True, False, False]) + self.assertTrue(np.array_equal(result, expected)) + + def test_compare_workflow(self): + """Test compare workflow execution""" + self.compare.compare() + self.assertEqual(self.compare.compare_column, {}) + + def test_get_small_value_threshold(self): + """Test small value threshold retrieval""" + small_value, small_value_atol = self.compare.get_small_value_threshold() + self.assertIsInstance(small_value, (int, float)) + self.assertIsInstance(small_value_atol, (int, float)) + + def test_stat_abs_bench_with_eps(self): + """Test absolute benchmark with epsilon calculation""" + abs_bench, abs_bench_with_eps = self.compare.stat_abs_bench_with_eps() + + # Check finite values + self.assertTrue(np.array_equal(abs_bench[:3], np.abs(self.input_data.bench_output[:3]))) + self.assertTrue(np.all(abs_bench_with_eps[:3] >= abs_bench[:3])) + + # 添加极小值测试 + small_bench_output = np.array([1e-10, 1e-8, 1e-6]) + small_input_data = MockInputData(small_bench_output, small_bench_output + 1e-10, torch.float32) + small_compare = TestCompare(small_input_data) + + abs_bench_small, abs_bench_with_eps_small = small_compare.stat_abs_bench_with_eps() + + # 验证极小值的误差容限计算 + self.assertTrue(np.all(abs_bench_with_eps_small > 0)) # 确保误差容限不为0 + self.assertTrue(np.all(abs_bench_with_eps_small >= abs_bench_small)) # 验证误差容限始终大于等于原值 + + # 验证相对误差在合理范围内 + relative_tolerance = (abs_bench_with_eps_small - abs_bench_small) / abs_bench_small + self.assertTrue(np.all(relative_tolerance <= 1.0)) # 相对误差不应过大 + + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_benchmark_compare.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_benchmark_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..90f8a65cef3eff316d0463689ec5b49181c19386 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_benchmark_compare.py @@ -0,0 +1,104 @@ +import unittest +import numpy as np +import torch + +from msprobe.pytorch.api_accuracy_checker.precision_standard.benchmark_compare import BenchmarkCompare + + +class InputData: + """测试数据类""" + def __init__(self, bench_output, device_output, dtype, compare_column): + self.bench_output = bench_output + self.device_output = device_output + self.dtype = dtype + self.compare_column = compare_column + + +class TestBenchmarkCompare(unittest.TestCase): + def setUp(self): + """创建基础测试数据""" + self.bench_output = np.array([1.0, 2.0, 3.0, float('inf'), float('nan')]) + self.device_output = np.array([1.1, 2.1, 3.1, float('inf'), float('nan')]) + self.compare_column = {} + self.dtype = torch.float32 + + self.input_data = InputData( + bench_output=self.bench_output, + device_output=self.device_output, + compare_column=self.compare_column, + dtype=self.dtype + ) + + self.compare = BenchmarkCompare(self.input_data) + self.compare._pre_compare() + + def test_get_abs_err_greater_mask(self): + """测试_get_abs_err_greater_mask函数""" + # 测试不同的阈值 + small_value_atol = 0.05 + mask = self.compare._get_abs_err_greater_mask(small_value_atol) + + # 验证掩码类型和形状 + self.assertIsInstance(mask, np.ndarray) + self.assertEqual(mask.dtype, bool) + self.assertEqual(mask.shape, self.bench_output.shape) + + # 验证掩码值是否正确 + expected_mask = np.array([True, True, True, False, False]) # 前三个差值大于0.05,后两个是inf/nan + np.testing.assert_array_equal(mask, expected_mask) + + def test_compute_rel_err(self): + """测试_compute_rel_err函数""" + rel_err = self.compare._compute_rel_err() + + # 验证相对误差的类型和形状 + self.assertIsInstance(rel_err, np.ndarray) + self.assertEqual(rel_err.shape, self.bench_output.shape) + + # 验证前三个有效值的相对误差 + expected_rel_err = np.array([0.1, 0.05, 0.033333], dtype=np.float32) + np.testing.assert_array_almost_equal(rel_err[:3], expected_rel_err, decimal=5) + + def test_pre_compare(self): + """测试_pre_compare函数""" + # 创建新的比较对象进行预处理 + compare = BenchmarkCompare(self.input_data) + compare._pre_compare() + + # 验证预处理后的属性是否正确设置 + self.assertTrue(hasattr(compare, 'abs_bench')) + self.assertTrue(hasattr(compare, 'abs_bench_with_eps')) + self.assertTrue(hasattr(compare, 'both_finite_mask')) + self.assertTrue(hasattr(compare, 'inf_nan_mask')) + self.assertTrue(hasattr(compare, 'abs_err')) + self.assertTrue(hasattr(compare, 'small_value')) + self.assertTrue(hasattr(compare, 'small_value_atol')) + self.assertTrue(hasattr(compare, 'small_value_mask')) + self.assertTrue(hasattr(compare, 'rel_err')) + self.assertTrue(hasattr(compare, 'abs_err_greater_mask')) + + # 验证有限值掩码 + expected_finite_mask = np.array([True, True, True, False, False]) + np.testing.assert_array_equal(compare.both_finite_mask, expected_finite_mask) + + # 验证绝对误差 + expected_abs_err = np.abs(self.device_output - self.bench_output) + np.testing.assert_array_equal(compare.abs_err[:3], expected_abs_err[:3]) + + def test_compute_metrics(self): + """测试_compute_metrics函数""" + metrics = self.compare._compute_metrics() + + # 验证返回的指标字典 + expected_metrics = { + "small_value_err_ratio", + "max_rel_error", + "mean_rel_error", + "rmse", + "eb" + } + self.assertEqual(set(metrics.keys()), expected_metrics) + + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_binary_consistency.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_binary_consistency.py new file mode 100644 index 0000000000000000000000000000000000000000..5421599f4ce517beea084a919751f74fd3b217f2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_binary_consistency.py @@ -0,0 +1,86 @@ +import unittest +import numpy as np +import torch + +from msprobe.pytorch.api_accuracy_checker.precision_standard.binary_consistency import BinaryCompare + + +class InputData: + """测试数据类""" + def __init__(self, bench_output, device_output, compare_column, dtype): + self.bench_output = bench_output + self.device_output = device_output + self.dtype = dtype + self.compare_column = compare_column + + +class TestBinaryCompare(unittest.TestCase): + def setUp(self): + # 创建实际的测试数据 + self.bench_output = np.array([True, False, True, False]) + self.device_output = np.array([True, False, False, False]) + self.compare_column = {} + self.dtype = torch.bool + + + self.input_data = InputData( + bench_output=self.bench_output, + device_output=self.device_output, + compare_column=self.compare_column, + dtype=self.dtype + ) + + def test_binary_compare(self): + """测试二进制比较的基本功能""" + binary_compare = BinaryCompare(self.input_data) + metrics = binary_compare._compute_metrics() + + # 在这个例子中,4个元素中有1个不匹配,所以错误率应该是0.25 + self.assertAlmostEqual(metrics['error_rate'], 0.25) + + def test_binary_compare_all_match(self): + """测试完全匹配的情况""" + input_data = InputData( + bench_output=np.array([True, False, True]), + device_output=np.array([True, False, True]), + compare_column=self.compare_column, + dtype=self.dtype + ) + + binary_compare = BinaryCompare(input_data) + metrics = binary_compare._compute_metrics() + + self.assertAlmostEqual(metrics['error_rate'], 0.0) + + def test_binary_compare_no_match(self): + """测试完全不匹配的情况""" + input_data = InputData( + bench_output=np.array([True, True, True]), + device_output=np.array([False, False, False]), + compare_column=self.compare_column, + dtype=self.dtype + ) + + binary_compare = BinaryCompare(input_data) + metrics = binary_compare._compute_metrics() + + self.assertAlmostEqual(metrics['error_rate'], 1.0) + + def test_binary_compare_multidimensional(self): + """测试多维数组的情况""" + input_data = InputData( + bench_output=np.array([[True, False], [True, True]]), + device_output=np.array([[True, False], [False, True]]), + compare_column=self.compare_column, + dtype=self.dtype + ) + + binary_compare = BinaryCompare(input_data) + metrics = binary_compare._compute_metrics() + + # 4个元素中有1个不匹配,错误率应该是0.25 + self.assertAlmostEqual(metrics['error_rate'], 0.25) + + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_standard_config.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_standard_config.py new file mode 100644 index 0000000000000000000000000000000000000000..30cebf81cfee7bb4ffc330abdb85fa2ead9999bb --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_standard_config.py @@ -0,0 +1,47 @@ +import unittest +import torch +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_config import StandardConfig +from msprobe.core.common.const import CompareConst + +class TestStandardConfig(unittest.TestCase): + def test_get_small_value(self): + # 测试已定义的数据类型 + self.assertEqual(StandardConfig.get_small_value(torch.float16, CompareConst.BENCHMARK), 2**-10) + self.assertEqual(StandardConfig.get_small_value(torch.bfloat16, CompareConst.BENCHMARK), 2**-10) + self.assertEqual(StandardConfig.get_small_value(torch.float32, CompareConst.BENCHMARK), 2**-20) + + # 测试未定义的数据类型(应返回默认值) + self.assertEqual(StandardConfig.get_small_value(torch.int32, CompareConst.BENCHMARK), 2**-20) + + self.assertEqual(StandardConfig.get_small_value(torch.float16, CompareConst.ACCUMULATIVE_ERROR_COMPARE), 1) + + def test_get_small_value_atol(self): + standard = 'absolute_threshold' + # 测试已定义的数据类型 + self.assertEqual(StandardConfig.get_small_value_atol(torch.float16, standard), 2**-16) + self.assertEqual(StandardConfig.get_small_value_atol(torch.bfloat16, standard), 1e-16) + self.assertEqual(StandardConfig.get_small_value_atol(torch.float32, standard), 2**-30) + + # 测试未定义的数据类型(应返回默认值) + self.assertEqual(StandardConfig.get_small_value_atol(torch.int32, standard), 2**-30) + + standard = 'benchmark' + # 测试已定义的数据类型 + self.assertEqual(StandardConfig.get_small_value_atol(torch.float16, standard), 1e-16) + self.assertEqual(StandardConfig.get_small_value_atol(torch.bfloat16, standard), 1e-16) + self.assertEqual(StandardConfig.get_small_value_atol(torch.float32, standard), 2**-30) + + # 测试未定义的数据类型(应返回默认值) + self.assertEqual(StandardConfig.get_small_value_atol(torch.int32, standard), 2**-30) + + def test_get_rtol(self): + # 测试已定义的数据类型 + self.assertEqual(StandardConfig.get_rtol(torch.float16), 2**-10) + self.assertEqual(StandardConfig.get_rtol(torch.bfloat16), 2**-8) + self.assertEqual(StandardConfig.get_rtol(torch.float32), 2**-20) + + # 测试未定义的数据类型(应返回默认值) + self.assertEqual(StandardConfig.get_rtol(torch.int32), 2**-20) + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_standard_register.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_standard_register.py new file mode 100644 index 0000000000000000000000000000000000000000..0a776348933361e0934d6bef2dded9dca124bc02 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_standard_register.py @@ -0,0 +1,91 @@ +import unittest +from unittest.mock import Mock +import numpy as np + +from msprobe.pytorch.api_accuracy_checker.precision_standard.standard_register import StandardRegistry +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import ( + absolute_standard_api, binary_standard_api, BINARY_COMPARE_UNSUPPORT_LIST +) + +class TestStandardRegistry(unittest.TestCase): + def setUp(self): + self.registry = StandardRegistry() + + def test_register_valid_function(self): + """测试正常注册比较函数""" + mock_func = Mock() + self.registry.register("test_standard", mock_func) + self.assertEqual(self.registry.comparison_functions["test_standard"], mock_func) + + def test_register_invalid_function(self): + """测试注册非callable对象时抛出异常""" + with self.assertRaises(ValueError): + self.registry.register("test_standard", "not_callable") + + def test_get_comparison_function_binary_consistency(self): + """测试获取二进制一致性比较函数""" + mock_func = Mock() + self.registry.register("binary_consistency", mock_func) + # 使用支持二进制比较的数据类型 + result = self.registry.get_comparison_function("abs", dtype='torch.int8') + self.assertEqual(result, mock_func) + + def test_get_comparison_function_absolute_threshold(self): + """测试获取绝对阈值比较函数""" + mock_func = Mock() + self.registry.register("absolute_threshold", mock_func) + # 假设'test_api'在absolute_standard_api列表中 + result = self.registry.get_comparison_function("mul") + self.assertEqual(result, mock_func) + + def test_get_comparison_function_ulp(self): + """测试获取ULP比较函数""" + mock_func = Mock() + self.registry.register("ulp_compare", mock_func) + result = self.registry.get_comparison_function("matmul") + self.assertEqual(result, mock_func) + + def test_get_comparison_function_thousandth(self): + """测试获取双千比较函数""" + mock_func = Mock() + self.registry.register("thousandth_threshold", mock_func) + result = self.registry.get_comparison_function("conv2d") + self.assertEqual(result, mock_func) + + def test_get_comparison_function_benchmark(self): + """测试获取默认benchmark比较函数""" + mock_func = Mock() + self.registry.register("benchmark", mock_func) + result = self.registry.get_comparison_function("npu_fusion_attention") + self.assertEqual(result, mock_func) + + def test_get_standard_category_binary(self): + """测试获取二进制一致性标准类别""" + dtype = 'torch.int8' + self.assertNotIn(dtype, BINARY_COMPARE_UNSUPPORT_LIST) + category = self.registry._get_standard_category("abs", dtype) + self.assertEqual(category, "binary_consistency") + + def test_get_standard_category_absolute(self): + """测试获取绝对阈值标准类别""" + category = self.registry._get_standard_category("mul") + self.assertEqual(category, "absolute_threshold") + + def test_get_standard_category_default(self): + """测试获取默认benchmark标准类别""" + category = self.registry._get_standard_category("unknown_api") + self.assertEqual(category, "benchmark") + + def test_get_standard_category_ulp(self): + """测试获取ULP标准类别""" + category = self.registry._get_standard_category("matmul") + self.assertEqual(category, "ulp_compare") + + def test_get_standard_category_thousandth(self): + """测试获取双千比对标准类别""" + category = self.registry._get_standard_category("conv2d") + self.assertEqual(category, "thousandth_threshold") + + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_thousandth_standard.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_thousandth_standard.py new file mode 100644 index 0000000000000000000000000000000000000000..1dc28c96eb0606831df57c7d563926128eed13cb --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_thousandth_standard.py @@ -0,0 +1,63 @@ +import unittest +import numpy as np +from unittest.mock import Mock + +from msprobe.pytorch.api_accuracy_checker.precision_standard.thousandth_standard import ThousandthStdCompare +from msprobe.core.common.const import CompareConst + +class TestThousandthStdCompare(unittest.TestCase): + def setUp(self): + # 创建模拟的input_data对象 + self.mock_input = Mock() + + def test_initialization(self): + """测试ThousandthStdCompare类的初始化""" + # 设置模拟数据 + self.mock_input.rel_err_orign = np.array([0.0001, 0.002, 0.0005]) + self.mock_input.compare_column = Mock() + + # 创建实例 + compare = ThousandthStdCompare(self.mock_input) + + # 验证属性是否正确设置 + np.testing.assert_array_equal(compare.rel_err_orign, self.mock_input.rel_err_orign) + self.assertEqual(compare.compare_column, self.mock_input.compare_column) + + def test_compute_metrics_all_within_threshold(self): + """测试所有值都在阈值内的情况""" + # 设置模拟数据 - 所有值都小于阈值(0.001) + self.mock_input.rel_err_orign = np.array([0.0001, 0.0005, 0.0008]) + compare = ThousandthStdCompare(self.mock_input) + + # 计算指标 + result = compare._compute_metrics() + + # 验证结果 + self.assertEqual(result['rel_err_thousandth'], 1.0) + + def test_compute_metrics_mixed_values(self): + """测试混合值的情况""" + # 设置模拟数据 - 部分值超过阈值 + self.mock_input.rel_err_orign = np.array([0.0005, 0.002, 0.003, 0.0008]) + compare = ThousandthStdCompare(self.mock_input) + + # 计算指标 + result = compare._compute_metrics() + + # 验证结果 - 2个值在阈值内,2个值超过阈值 + self.assertEqual(result['rel_err_thousandth'], 0.5) + + def test_compute_metrics_all_exceed_threshold(self): + """测试所有值都超过阈值的情况""" + # 设置模拟数据 - 所有值都大于阈值 + self.mock_input.rel_err_orign = np.array([0.002, 0.003, 0.005]) + compare = ThousandthStdCompare(self.mock_input) + + # 计算指标 + result = compare._compute_metrics() + + # 验证结果 + self.assertEqual(result['rel_err_thousandth'], 0.0) + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_ulp_compare.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_ulp_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..720e7c6b84e7b3630db0294dea54beec4edc7d17 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/precision_standard/test_ulp_compare.py @@ -0,0 +1,113 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest +import numpy as np +import torch + +from msprobe.pytorch.api_accuracy_checker.precision_standard.ulp_compare import UlpCompare +from msprobe.core.common.const import CompareConst + + +class InputData: + """测试数据类""" + def __init__(self): + self.bench_output = None + self.device_output = None + self.dtype = None + self.compare_column = None + + +class TestUlpCompare(unittest.TestCase): + """UlpCompare类的单元测试""" + def setUp(self): + """测试前的准备工作""" + self.input_data = InputData() + self.input_data.bench_output = np.array([1.0, 2.0, 3.0], dtype=np.float32) + self.input_data.device_output = np.array([1.0, 2.0001, 3.0], dtype=np.float32) + self.input_data.dtype = torch.float32 + self.input_data.compare_column = ['output'] + self.ulp_compare = UlpCompare(self.input_data) + + def test_init(self): + """测试初始化""" + self.assertIsInstance(self.ulp_compare, UlpCompare) + self.assertEqual(self.ulp_compare.dtype, torch.float32) + np.testing.assert_array_equal(self.ulp_compare.bench_output, self.input_data.bench_output) + np.testing.assert_array_equal(self.ulp_compare.device_output, self.input_data.device_output) + + def test_stat_max_ulp_err(self): + """测试最大ULP误差计算""" + test_ulp_err = np.array([0, 2, 5], dtype=np.float32) + max_err = self.ulp_compare._stat_max_ulp_err(test_ulp_err) + self.assertEqual(max_err, 5) + + def test_stat_mean_ulp_err(self): + """测试平均ULP误差计算""" + test_ulp_err = np.array([1, 2, 3], dtype=np.float32) + mean_err = self.ulp_compare._stat_mean_ulp_err(test_ulp_err) + self.assertEqual(mean_err, 2) + + def test_stat_ulp_error_proportion_float32(self): + """测试float32的ULP误差比例计算""" + test_ulp_err = np.array([ + CompareConst.ULP_FLOAT32_THRESHOLD - 1, # 低于阈值 + CompareConst.ULP_FLOAT32_THRESHOLD + 1, # 高于阈值 + CompareConst.ULP_FLOAT32_THRESHOLD - 1 # 低于阈值 + ], dtype=np.float32) + proportion = self.ulp_compare._stat_ulp_error_proportion(test_ulp_err) + self.assertAlmostEqual(proportion, 1/3) + + def test_stat_ulp_error_proportion_float16(self): + """测试float16的ULP误差比例计算""" + self.ulp_compare.dtype = torch.float16 + test_ulp_err = np.array([ + CompareConst.ULP_FLOAT16_THRESHOLD - 1, # 低于阈值 + CompareConst.ULP_FLOAT16_THRESHOLD + 1, # 高于阈值 + CompareConst.ULP_FLOAT16_THRESHOLD - 1 # 低于阈值 + ], dtype=np.float32) + proportion = self.ulp_compare._stat_ulp_error_proportion(test_ulp_err) + self.assertAlmostEqual(proportion, 1/3) + + def test_pre_compare(self): + """测试预处理比较步骤""" + self.ulp_compare._pre_compare() + self.assertTrue(hasattr(self.ulp_compare, 'ulp_err')) + self.assertIsInstance(self.ulp_compare.ulp_err, np.ndarray) + + def test_compute_metrics(self): + """测试完整的指标计算流程""" + self.ulp_compare._pre_compare() + metrics = self.ulp_compare._compute_metrics() + + # 验证返回的字典包含所有必要的键 + expected_keys = {'max_ulp_error', 'mean_ulp_error', 'ulp_error_proportion'} + self.assertEqual(set(metrics.keys()), expected_keys) + + # 验证返回值类型 + for value in metrics.values(): + self.assertIsInstance(value, (np.float32, np.float64)) + + # 验证数值范围 + self.assertGreaterEqual(metrics['max_ulp_error'], 0) + self.assertGreaterEqual(metrics['mean_ulp_error'], 0) + self.assertGreaterEqual(metrics['ulp_error_proportion'], 0) + self.assertLessEqual(metrics['ulp_error_proportion'], 1) + + +if __name__ == '__main__': + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_data_generate.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_data_generate.py index f3c3c5d607a1c4f458d3113ce29c0f0141079eda..0a88476d600958b26eaf6ca20a9a70d35b4221cc 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_data_generate.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_data_generate.py @@ -5,6 +5,7 @@ from unittest.mock import patch import copy import math import numpy as np +import torch from msprobe.pytorch.api_accuracy_checker.run_ut.data_generate import * from msprobe.core.common.file_utils import get_json_contents, create_directory @@ -43,7 +44,7 @@ class TestDataGenerateMethods(unittest.TestCase): def test_gen_api_params(self): api_info = copy.deepcopy(api_info_dict) - args_params, kwargs_params = gen_api_params(api_info, "conv2d", True, None, None) + args_params, kwargs_params, output_dtype = gen_api_params(api_info, "conv2d", True, None, None) max_diff = abs(args_params[0].max() - max_value) min_diff = abs(args_params[0].min() - min_value) self.assertEqual(len(args_params), 2) @@ -53,6 +54,7 @@ class TestDataGenerateMethods(unittest.TestCase): self.assertLessEqual(min_diff, 0.001) self.assertEqual(args_params[0].shape, torch.Size([2048, 2, 1, 256])) self.assertEqual(kwargs_params, {'dim': -1}) + self.assertEqual(output_dtype, torch.float16) def test_gen_args(self): func_options = {} @@ -149,6 +151,12 @@ class TestDataGenerateMethods(unittest.TestCase): info = {'Min': 0, 'Max': 1, 'dtype': "torch.bool", 'shape': (1, 2)} data = gen_random_tensor(info, convert_type=None) self.assertEqual(data.dtype, torch.bool) + + def test_gen_random_tensor_gen_cat(self): + info = {'Min': None, 'Max': None, 'dtype': "torch.float32", 'shape': (1, 0, 256)} + data = gen_random_tensor(info, None) + self.assertEqual(data.dtype, torch.float32) + self.assertEqual(data.shape, torch.Size([1, 0, 256])) def test_gen_random_tensor(self): data = gen_random_tensor(api_info_dict.get('input_args')[0], None) @@ -174,6 +182,23 @@ class TestDataGenerateMethods(unittest.TestCase): api_info = copy.deepcopy(api_info_dict) kwargs_params = gen_kwargs(api_info, None) self.assertEqual(kwargs_params, {'dim': -1}) + + def test_gen_kwargs_fa_special_sparse_mode(self): + api_info = {"input_kwargs": {"atten_mask": {"type": "torch.Tensor", "shape": [2048, 2048]}, + "sparse_mode": {"type": "int", "value": 3}}} + api_name = "npu_fusion_attention" + kwargs_params = gen_kwargs(api_info, api_name, None, None) + + # 分别验证 kwargs_params 的每个键值 + self.assertIn('atten_mask', kwargs_params) + self.assertIn('sparse_mode', kwargs_params) + + # 验证 atten_mask 的属性 + expected_mask = torch.triu(torch.ones([2048, 2048]), diagonal=1).to(torch.bool) + self.assertTrue(torch.equal(kwargs_params['atten_mask'], expected_mask)) + + # 验证 sparse_mode 的值 + self.assertEqual(kwargs_params['sparse_mode'], 3) def test_gen_kwargs_2(self): k_dict = {"dtype": {"type": "torch.dtype", "value": "torch.float16"}} @@ -187,7 +212,6 @@ class TestDataGenerateMethods(unittest.TestCase): convert_type = None real_data_path = None kwargs = gen_kwargs(api_info, api_name, convert_type, real_data_path) - print(kwargs['key']) self.assertEqual(kwargs["key"], [1.0]) def test_gen_kwargs_none_kwargs(self): @@ -348,6 +372,6 @@ class TestDataGenerateMethods(unittest.TestCase): convert_type = None api_info = {"input_args": None, "input_kwargs": {}} with patch('msprobe.pytorch.common.log.logger.warning') as mock_logger: - result_args, result_kwargs = gen_api_params(api_info, api_name, need_grad, convert_type, real_data_path) + result_args, result_kwargs, _ = gen_api_params(api_info, api_name, need_grad, convert_type, real_data_path) self.assertEqual(result_args, []) mock_logger.assert_called_once_with(f'Warning: No args in {api_info} ') diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_multi_run_ut.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_multi_run_ut.py index 8913c7160538e02932ec02251a7e03240715788e..1ad191a0d4e85715e6199367d1d305c10a728630 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_multi_run_ut.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_multi_run_ut.py @@ -5,10 +5,133 @@ import logging from unittest.mock import patch, mock_open, MagicMock import json import signal +from msprobe.core.common.file_utils import create_directory, save_json, write_csv +from msprobe.core.common.exceptions import FileCheckException from msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut import split_json_file, signal_handler, run_parallel_ut, \ prepare_config, main, ParallelUTConfig +class Args: + def __init__(self, config_path=None, api_info_path=None, out_path=None, result_csv_path=None): + self.config_path = config_path + self.api_info_path = api_info_path + self.out_path = out_path + self.result_csv_path = result_csv_path + + +class TestFileCheck(unittest.TestCase): + def setUp(self): + src_path = 'temp_path' + create_directory(src_path) + dst_path = 'soft_link' + os.symlink(src_path, dst_path) + self.hard_path = os.path.abspath(src_path) + self.soft_path = os.path.abspath(dst_path) + json_path = os.path.join(self.hard_path, 'test.json') + json_data = {'key': 'value'} + save_json(json_path, json_data) + self.hard_json_path = json_path + soft_json_path = 'soft.json' + os.symlink(json_path, soft_json_path) + self.soft_json_path = os.path.abspath(soft_json_path) + csv_path = os.path.join(self.hard_path, 'test.csv') + csv_data = [['1', '2', '3']] + write_csv(csv_data, csv_path) + soft_csv_path = 'soft.csv' + os.symlink(csv_path, soft_csv_path) + self.csv_path = os.path.abspath(soft_csv_path) + self.empty_path = "empty_path" + + def tearDown(self): + os.unlink(self.soft_json_path) + os.unlink(self.csv_path) + os.unlink(self.soft_path) + for file in os.listdir(self.hard_path): + os.remove(os.path.join(self.hard_path, file)) + os.rmdir(self.hard_path) + + def test_config_path_soft_link_check(self): + args = Args(config_path=self.soft_json_path, api_info_path=self.hard_json_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_api_info_path_soft_link_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.soft_json_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_out_path_soft_link_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.soft_path) + + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_result_csv_path_soft_link_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.hard_path, + result_csv_path=self.csv_path) + + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_config_path_empty_check(self): + args = Args(config_path=self.empty_path, api_info_path=self.hard_json_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_api_info_path_empty_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.empty_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_out_path_empty_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.empty_path) + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_result_csv_path_empty_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.hard_path, + result_csv_path=self.empty_path) + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_config_path_invalid_check(self): + args = Args(config_path=123, api_info_path=self.hard_json_path, out_path=self.hard_path) + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_api_info_path_invalid_check(self): + args = Args(config_path=self.hard_json_path, api_info_path="123", out_path=self.hard_path) + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_out_path_invalid_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=123) + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_result_csv_path_invalid_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.hard_path, + result_csv_path=123) + with self.assertRaises(Exception) as context: + prepare_config(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + class TestMultiRunUT(unittest.TestCase): def setUp(self): diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_overflow_check.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_overflow_check.py index b22e0f5e2e17473fd6c4d96a7caccd4bb643390a..094e8897e8fa58ab3f292af264286fdbafebab96 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_overflow_check.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_overflow_check.py @@ -4,6 +4,8 @@ from msprobe.pytorch.api_accuracy_checker.run_ut.run_overflow_check import * class TestRunOverflowCheck(unittest.TestCase): + def setUp(self): + self.device = "cpu" def test_check_tensor_overflow_tensor_inf(self): x = torch.tensor(float('inf')) @@ -36,33 +38,33 @@ class TestRunOverflowCheck(unittest.TestCase): def test_check_data_overflow_list_with_overflow(self): tensor_overflow = torch.tensor(float('inf')) tensor_list = [tensor_overflow, torch.tensor(1.0)] - self.assertTrue(check_data_overflow(tensor_list)) + self.assertTrue(check_data_overflow(tensor_list, self.device)) def test_check_data_overflow_list_without_overflow(self): tensor_list = [torch.tensor(1.0), torch.tensor(2.0)] - self.assertFalse(check_data_overflow(tensor_list)) + self.assertFalse(check_data_overflow(tensor_list, self.device)) def test_check_data_overflow_tuple_with_overflow(self): tensor_overflow = torch.tensor(float('inf')) tensor_tuple = (tensor_overflow, torch.tensor(1.0)) - self.assertTrue(check_data_overflow(tensor_tuple)) + self.assertTrue(check_data_overflow(tensor_tuple, self.device)) def test_check_data_overflow_tuple_without_overflow(self): tensor_tuple = (torch.tensor(1.0), torch.tensor(2.0)) - self.assertFalse(check_data_overflow(tensor_tuple)) + self.assertFalse(check_data_overflow(tensor_tuple, self.device)) def test_check_data_overflow_single_tensor_overflow(self): tensor_overflow = torch.tensor(float('inf')) - self.assertTrue(check_data_overflow(tensor_overflow)) + self.assertTrue(check_data_overflow(tensor_overflow, self.device)) def test_check_data_overflow_single_tensor_no_overflow(self): tensor = torch.tensor(1.0) - self.assertFalse(check_data_overflow(tensor)) + self.assertFalse(check_data_overflow(tensor, self.device)) def test_check_data_overflowt_empty_list(self): empty_list = [] - self.assertFalse(check_data_overflow(empty_list)) + self.assertFalse(check_data_overflow(empty_list, self.device)) def test_check_data_overflow_empty_tuple(self): empty_tuple = () - self.assertFalse(check_data_overflow(empty_tuple)) + self.assertFalse(check_data_overflow(empty_tuple, self.device)) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut.py index 59c84eabd1ed0575a980649c9dcd8313d2e9f72f..cb54b4ccfef5c1aa19c4a3527b6b5cfdac7dcc77 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut.py @@ -1,9 +1,12 @@ # coding=utf-8 +import os import copy +import shutil import unittest from unittest.mock import patch, DEFAULT from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut import * -from msprobe.core.common.file_utils import get_json_contents +from msprobe.core.common.file_utils import get_json_contents, create_directory, save_json, write_csv +from msprobe.core.common.exceptions import FileCheckException from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import UtDataInfo, exec_api base_dir = os.path.dirname(os.path.realpath(__file__)) @@ -14,14 +17,137 @@ for api_full_name, api_info_dict in forward_content.items(): api_info_dict = api_info_dict +class Args: + def __init__(self, config_path=None, api_info_path=None, out_path=None, result_csv_path=None): + self.config_path = config_path + self.api_info_path = api_info_path + self.out_path = out_path + self.result_csv_path = result_csv_path + + +class TestFileCheck(unittest.TestCase): + def setUp(self): + src_path = 'temp_path' + create_directory(src_path) + dst_path = 'soft_link' + os.symlink(src_path, dst_path) + self.hard_path = os.path.abspath(src_path) + self.soft_path = os.path.abspath(dst_path) + json_path = os.path.join(self.hard_path, 'test.json') + json_data = {'key': 'value'} + save_json(json_path, json_data) + self.hard_json_path = json_path + soft_json_path = 'soft.json' + os.symlink(json_path, soft_json_path) + self.soft_json_path = os.path.abspath(soft_json_path) + csv_path = os.path.join(self.hard_path, 'test.csv') + csv_data = [['1', '2', '3']] + write_csv(csv_data, csv_path) + soft_csv_path = 'soft.csv' + os.symlink(csv_path, soft_csv_path) + self.csv_path = os.path.abspath(soft_csv_path) + self.empty_path = "empty_path" + + def tearDown(self): + os.unlink(self.soft_json_path) + os.unlink(self.csv_path) + os.unlink(self.soft_path) + for file in os.listdir(self.hard_path): + os.remove(os.path.join(self.hard_path, file)) + os.rmdir(self.hard_path) + + def test_config_path_soft_link_check(self): + args = Args(config_path=self.soft_json_path, api_info_path=self.hard_json_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_api_info_path_soft_link_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.soft_json_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_out_path_soft_link_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.soft_path) + + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_result_csv_path_soft_link_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.hard_path, + result_csv_path=self.csv_path) + + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.SOFT_LINK_ERROR) + + def test_config_path_empty_check(self): + args = Args(config_path=self.empty_path, api_info_path=self.hard_json_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_api_info_path_empty_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.empty_path, out_path=self.hard_path) + + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_out_path_empty_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.empty_path) + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_result_csv_path_empty_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.hard_path, + result_csv_path=self.empty_path) + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_config_path_invalid_check(self): + args = Args(config_path=123, api_info_path=self.hard_json_path, out_path=self.hard_path) + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_api_info_path_invalid_check(self): + args = Args(config_path=self.hard_json_path, api_info_path="123", out_path=self.hard_path) + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_out_path_invalid_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=123) + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + def test_result_csv_path_invalid_check(self): + args = Args(config_path=self.hard_json_path, api_info_path=self.hard_json_path, out_path=self.hard_path, + result_csv_path=123) + with self.assertRaises(Exception) as context: + run_ut_command(args) + self.assertEqual(context.exception.code, FileCheckException.ILLEGAL_PATH_ERROR) + + class TestRunUtMethods(unittest.TestCase): def test_exec_api(self): api_info = copy.deepcopy(api_info_dict) [api_type, api_name, _, _] = api_full_name.split(".") args, kwargs, need_grad = get_api_info(api_info, api_name, None) - cpu_args, cpu_kwargs = generate_cpu_params(args, kwargs, True, '') - out = exec_api(api_type, api_name, Const.CPU_LOWERCASE, cpu_args, cpu_kwargs) + cpu_params = generate_cpu_params(args, kwargs, True, '') + cpu_args, cpu_kwargs = cpu_params.cpu_args, cpu_params.cpu_kwargs + cpu_exec_params = ExecParams(api_type, api_name, Const.CPU_LOWERCASE, cpu_args, cpu_kwargs, False, None) + out = exec_api(cpu_exec_params) self.assertEqual(out[0].dtype, torch.float32) self.assertTrue(out[0].requires_grad) self.assertEqual(out[0].shape, torch.Size([2048, 2, 1, 128])) @@ -54,7 +180,8 @@ class TestRunUtMethods(unittest.TestCase): api_info = copy.deepcopy(api_info_dict) [api_type, api_name, _, _] = api_full_name.split(".") args, kwargs, need_grad = get_api_info(api_info, api_name, None) - cpu_args, cpu_kwargs = generate_cpu_params(args, kwargs, True, '') + cpu_params = generate_cpu_params(args, kwargs, True, '') + cpu_args, cpu_kwargs = cpu_params.cpu_args, cpu_params.cpu_kwargs self.assertEqual(len(cpu_args), 2) self.assertEqual(cpu_args[0].dtype, torch.float32) self.assertTrue(cpu_args[0].requires_grad) @@ -114,3 +241,76 @@ class TestRunUtMethods(unittest.TestCase): out = 42 result = need_to_backward(grad_index, out) self.assertTrue(result) + + +class TestRunUtOnlineConfig(unittest.TestCase): + + @patch('msprobe.pytorch.api_accuracy_checker.run_ut.run_ut.check_crt_valid') + def test_checked_online_config(self, mock_check_crt_valid): + class OnlineConfigClass: + is_online = True + rank_list = [0, 1] + nfs_path = "" + tls_path = "" + host = "127.0.0.1" + port = 12345 + + mock_check_crt_valid.return_value = None + + online_config = OnlineConfigClass() + res = checked_online_config(online_config) + self.assertIsNone(res) + + # test is_online + online_config.is_online = "True" + with self.assertRaises(Exception) as context: + checked_online_config(online_config) + self.assertIn(str(context.exception), f"is_online must be bool type") + online_config.is_online = True + + # test rank_list + online_config.rank_list = "1234" + with self.assertRaises(Exception) as context: + checked_online_config(online_config) + self.assertIn(str(context.exception), f"rank_list must be a list") + online_config.rank_list = ["1", "2"] + with self.assertRaises(Exception) as context: + checked_online_config(online_config) + self.assertIn(str(context.exception), f"All elements in rank_list must be integers") + online_config.rank_list = [1, 2] + + # test nfs_path + online_config.nfs_path = "./nfs_path" + with self.assertRaises(Exception) as context: + checked_online_config(online_config) + self.assertIn(str(context.exception), "[msprobe] 非法文件路径: ") + online_config.nfs_path = "" + + # test tls_path + online_config.tls_path = "./tls_path" + with self.assertRaises(Exception) as context: + checked_online_config(online_config) + self.assertIn(str(context.exception), "[msprobe] 非法文件路径: ") + + os.makedirs(online_config.tls_path) + with open(os.path.join(online_config.tls_path, "server.key"), 'w') as file: + file.write("1") + with open(os.path.join(online_config.tls_path, "server.crt"), 'w') as file: + file.write("1") + checked_online_config(online_config) + shutil.rmtree(online_config.tls_path) + online_config.tls_path = "" + + # test host + online_config.host = "invalid_host" + with self.assertRaises(Exception) as context: + checked_online_config(online_config) + self.assertIn(str(context.exception), f"host: {online_config.host} is invalid.") + online_config.host = "127.0.0.1" + + # test port + online_config.port = -1 + with self.assertRaises(Exception) as context: + checked_online_config(online_config) + self.assertIn(str(context.exception), f"port: {online_config.port} is invalid, port range 0-65535.") + online_config.port = 6123 diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut_utils.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut_utils.py index 384cee0584e4e6d04e1470a062a949cd18d7bd23..0cf30461aec70b85577c38ebed011bf9f818874d 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut_utils.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut_utils.py @@ -55,25 +55,37 @@ class TestRunUtUtils(unittest.TestCase): def test_exec_api_functional_api(self): api_name = "sigmoid" args = (torch.tensor([1])) - result = exec_api("Functional", api_name, None, args, kwargs={}) + kwargs={} + api_type = "Functional" + exec_params = ExecParams(api_type, api_name, "cpu", args, kwargs, False, None) + result = exec_api(exec_params) self.assertTrue(torch.allclose(result, torch.tensor(0.7311), atol=1e-4)) def test_exec_api_tensor_api(self): api_name = "add" args = (torch.tensor(1), torch.tensor(2)) - result = exec_api("Tensor", api_name, None, args, kwargs={}) + kwargs={} + api_type = "Tensor" + exec_params = ExecParams(api_type, api_name, "cpu", args, kwargs, False, None) + result = exec_api(exec_params) self.assertEqual(result, torch.tensor(3)) def test_exec_api_torch_api(self): api_name = "add" args = (torch.tensor(1), torch.tensor(2)) - result = exec_api("Torch", api_name, None, args, kwargs={}) + kwargs={} + api_type = "Torch" + exec_params = ExecParams(api_type, api_name, "cpu", args, kwargs, False, None) + result = exec_api(exec_params) self.assertEqual(result, torch.tensor(3)) def test_exec_api_aten_api(self): api_name = "add" args = (torch.tensor(1), torch.tensor(2)) - result = exec_api("Aten", api_name, None, args, kwargs={}) + kwargs={} + api_type = "Aten" + exec_params = ExecParams(api_type, api_name, "cpu", args, kwargs, False, None) + result = exec_api(exec_params) self.assertEqual(result, torch.tensor(3)) def test_raise_bench_data_dtype_dtype_unchanged(self): diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_attl.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_attl.py new file mode 100644 index 0000000000000000000000000000000000000000..7d4e6e950dc1d3e51ef69ca46895fcf5078c5f67 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_attl.py @@ -0,0 +1,108 @@ +# coding=utf-8 +import unittest +from unittest.mock import patch +from multiprocessing import Queue + +from msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.attl import * +from msprobe.core.common.file_utils import create_directory + +class TestATTL(unittest.TestCase): + + def setUp(self): + nfs_path = "temp_nfs_path" + create_directory(nfs_path) + self.nfs_path = os.path.realpath(nfs_path) + self.session_id = "test_session" + self.session_config = ATTLConfig(is_benchmark_device=False, connect_ip='127.0.0.1', + connect_port=8080, nfs_path=self.nfs_path , check_sum=False, queue_size=100) + self.attls = ATTL(self.session_id, self.session_config, need_dump=False) + self.buffer = ApiData('test_api', args=(torch.randn(2, 2),), kwargs={'device': 'cpu'}, + result=torch.randn(2, 2), step=1, rank=1) + + def tearDown(self): + for filename in os.listdir(self.nfs_path): + os.remove(os.path.join(self.nfs_path, filename)) + os.rmdir(self.nfs_path) + + def test_attl_config(self): + config = ATTLConfig(is_benchmark_device=True, connect_ip='192.168.1.1', connect_port=9090, + nfs_path=self.nfs_path, tls_path='/path/to/tls', check_sum=False, queue_size=100) + self.assertEqual(config.is_benchmark_device, True) + self.assertEqual(config.connect_ip, '192.168.1.1') + self.assertEqual(config.connect_port, 9090) + self.assertEqual(config.nfs_path, self.nfs_path) + self.assertEqual(config.tls_path, '/path/to/tls') + self.assertFalse(config.check_sum) + self.assertEqual(config.queue_size, 100) + + @patch('msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.attl.move2target_device') + def test_upload_api_data(self, mock_move2target_device): + mock_move2target_device.return_value = self.buffer + self.attls.upload(self.buffer) + mock_move2target_device.assert_called_once_with(self.buffer, torch.device('cpu')) + + @patch('glob.glob') + def test_download_no_files(self, mock_glob): + mock_glob.return_value = [] + result = self.attls.download() + self.assertIsNone(result) + + @patch('glob.glob') + @patch('msprobe.pytorch.common.utils.load_pt') + def test_download_with_exception(self, mock_load_pt, mock_glob): + mock_glob.return_value = ['/tmp/start_file.pt'] + mock_load_pt.side_effect = Exception('Load error') + with patch.object(self.attls.logger, 'warning') as mock_logger: + result = self.attls.download() + self.assertIsNone(result) + mock_logger.assert_called_once() + + def test_move2device_exec_tensor(self): + tensor = torch.randn(2, 2) + device = torch.device("cpu") + moved_tensor = move2device_exec(tensor, device) + self.assertEqual(moved_tensor.device, device) + + def test_move2device_exec_list(self): + tensor_list = [torch.randn(2, 2), torch.randn(2, 2)] + device = torch.device("cpu") + moved_list = move2device_exec(tensor_list, device) + for tensor in moved_list: + self.assertEqual(tensor.device, device) + + def test_move2device_exec_tuple(self): + tensor_tuple = (torch.randn(2, 2), torch.randn(2, 2)) + device = torch.device("cpu") + moved_tuple = move2device_exec(tensor_tuple, device) + for tensor in moved_tuple: + self.assertEqual(tensor.device, device) + + def test_move2device_exec_dict(self): + tensor_dict = {"a": torch.randn(2, 2), "b": torch.randn(2, 2)} + device = torch.device("cpu") + moved_dict = move2device_exec(tensor_dict, device) + for tensor in moved_dict.values(): + self.assertEqual(tensor.device, device) + + def test_move2device_exec_device(self): + device = torch.device("cpu") + moved_device = move2device_exec(torch.device("cpu"), device) + self.assertEqual(moved_device, device) + + def test_move2device_exec_non_tensor(self): + obj = "This is a string" + device = torch.device("cpu") + self.assertEqual(move2device_exec(obj, device), obj) + + def test_move2target_device_to_cpu(self): + tensor_args = (torch.randn(2, 2), torch.randn(3, 3)) + tensor_kwargs = {'key1': torch.randn(2, 2), 'key2': torch.randn(3, 3)} + tensor_result = torch.randn(2, 2) + buffer = ApiData('test_api', tensor_args, tensor_kwargs, tensor_result, 1, 1) + target_device = torch.device('cpu') + moved_buffer = move2target_device(buffer, target_device) + self.assertEqual(moved_buffer.result.device, target_device) + for tensor in moved_buffer.args: + self.assertEqual(tensor.device, target_device) + for tensor in moved_buffer.kwargs.values(): + self.assertEqual(tensor.device, target_device) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_client.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_client.py new file mode 100644 index 0000000000000000000000000000000000000000..d35cfc3387559064298a451fb9d868838bb25aac --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_client.py @@ -0,0 +1,33 @@ +# coding=utf-8 +import unittest +from unittest.mock import patch, MagicMock +from multiprocessing import Queue + +from msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.client import * +from msprobe.core.common.file_utils import create_directory + + +class TestClient(unittest.TestCase): + + def setUp(self) -> None: + self.host = "localhost" + self.port = 8000 + self.check_sum = False + tls_path = "temp_tls_path" + create_directory(tls_path) + self.tls_path = os.path.realpath(tls_path) + + def tearDown(self) -> None: + for filename in os.listdir(self.tls_path): + os.remove(os.path.join(self.tls_path, filename)) + os.rmdir(self.tls_path) + + def test_TCPDataItem(self): + data_item = TCPDataItem(data="example_data", sequence_number=10, rank=1, step=2) + self.assertEqual(data_item.raw_data, "example_data") + self.assertEqual(data_item.sequence_number, 10) + self.assertEqual(data_item.rank, 1) + self.assertEqual(data_item.step, 2) + self.assertEqual(data_item.retry_times, 0) + self.assertEqual(data_item.pending_time, 0) + self.assertEqual(data_item.busy_time, 0) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_pt_accuracy_server.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_pt_accuracy_server.py index 4e535f4881e42fe616effc6a1a5a55957fecad83..b60cfdc323bed57e1cda1fc2d9db3197638cee4c 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_pt_accuracy_server.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_pt_accuracy_server.py @@ -39,27 +39,6 @@ class TestTCPServer(unittest.TestCase): self.tcp_server.run_reactor() mock_reactor.run.assert_called_once_with(installSignalHandlers=False) - @patch("os.path.exists") - def test_check_tls_path(self, mock_path_exists): - mock_path_exists.side_effect = lambda path: True - server_key, server_crt = self.tcp_server.check_tls_path() - - self.assertEqual(server_key, "/test/path/server.key") - self.assertEqual(server_crt, "/test/path/server.crt") - - @patch("os.path.exists") - def test_check_tls_path_missing_key(self, mock_path_exists): - def side_effect(path): - if "server.key" in path: - return False - return True - - mock_path_exists.side_effect = side_effect - - with self.assertRaises(Exception) as context: - self.tcp_server.check_tls_path() - self.assertIn("/test/path/server.key is not exists", str(context.exception)) - def test_is_running(self): self.tcp_server.is_running() self.tcp_server.factory.is_all_connection_closed.assert_called_once_with() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_pt_device_dispatch.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_pt_device_dispatch.py index 677bb95c067256cacb9c8e52cc56a5ad6e89b44b..79f569cdcaeaa662f403b73fd4047caf7c2f0311 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_pt_device_dispatch.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/tensor_transport_layer/test_pt_device_dispatch.py @@ -44,7 +44,7 @@ class TestDeviceDispatchFunc(unittest.TestCase): patch("msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.device_dispatch.pd"), \ patch( "msprobe.pytorch.api_accuracy_checker.tensor_transport_layer.device_dispatch.online_api_precision_compare"): - mock_gen_cpu_params.return_value = (MagicMock(), MagicMock()) + mock_gen_cpu_params.return_value = (MagicMock()) mock_api_data = MagicMock() mock_api_data.name.split.return_value = ("tensor", "conv2d", 1) mock_com_config = MagicMock() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_apply_adam.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_apply_adam.py new file mode 100644 index 0000000000000000000000000000000000000000..02631ec1fb487de28e9300934637b36791cf22ea --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_apply_adam.py @@ -0,0 +1,85 @@ +import unittest +import torch + +from msprobe.pytorch.bench_functions.apply_adam import npu_apply_adam + + +class TestNPUApplyAdam(unittest.TestCase): + def setUp(self): + # 初始化测试数据 + self.beta1_power = 0.9 + self.beta2_power = 0.999 + self.lr = 0.001 + self.beta1 = 0.9 + self.beta2 = 0.999 + self.epsilon = 1e-8 + self.grad = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + self.use_locking = False + self.var = torch.tensor([0.0, 0.0, 0.0], dtype=torch.float32) + self.m = torch.tensor([0.0, 0.0, 0.0], dtype=torch.float32) + self.v = torch.tensor([0.0, 0.0, 0.0], dtype=torch.float32) + self.out = (self.var, self.m, self.v) + + def test_npu_apply_adam_without_nesterov(self): + # 测试不使用 Nesterov 动量的情况 + use_nesterov = False + var_t, m_t, v_t = npu_apply_adam( + self.beta1_power, self.beta2_power, self.lr, self.beta1, self.beta2, + self.epsilon, self.grad, self.use_locking, use_nesterov, self.out + ) + + # 验证 var_t 的结果 + expected_var_t = torch.tensor([-0.0010, -0.0010, -0.0010], dtype=torch.float32) + self.assertTrue(torch.allclose(var_t, expected_var_t, atol=1e-4)) + + # 验证 m_t 的结果 + expected_m_t = torch.tensor([0.1000, 0.2000, 0.3000], dtype=torch.float32) + self.assertTrue(torch.allclose(m_t, expected_m_t, atol=1e-4)) + + # 验证 v_t 的结果 + expected_v_t = torch.tensor([0.0010, 0.0040, 0.0090], dtype=torch.float32) + self.assertTrue(torch.allclose(v_t, expected_v_t, atol=1e-4)) + + def test_npu_apply_adam_with_nesterov(self): + # 测试使用 Nesterov 动量的情况 + use_nesterov = True + var_t, m_t, v_t = npu_apply_adam( + self.beta1_power, self.beta2_power, self.lr, self.beta1, self.beta2, + self.epsilon, self.grad, self.use_locking, use_nesterov, self.out + ) + + # 验证 var_t 的结果 + expected_var_t = torch.tensor([-0.0019, -0.0019, -0.0019], dtype=torch.float32) + self.assertTrue(torch.allclose(var_t, expected_var_t, atol=1e-4)) + + # 验证 m_t 的结果 + expected_m_t = torch.tensor([0.1000, 0.2000, 0.3000], dtype=torch.float32) + self.assertTrue(torch.allclose(m_t, expected_m_t, atol=1e-4)) + + # 验证 v_t 的结果 + expected_v_t = torch.tensor([0.0010, 0.0040, 0.0090], dtype=torch.float32) + self.assertTrue(torch.allclose(v_t, expected_v_t, atol=1e-4)) + + def test_npu_apply_adam_with_non_zero_initial_values(self): + # 测试非零初始值的情况 + self.m = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + self.v = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32) + self.out = (self.var, self.m, self.v) + + use_nesterov = False + var_t, m_t, v_t = npu_apply_adam( + self.beta1_power, self.beta2_power, self.lr, self.beta1, self.beta2, + self.epsilon, self.grad, self.use_locking, use_nesterov, self.out + ) + + # 验证 var_t 的结果 + expected_var_t = torch.tensor([-0.0003, -0.0004, -0.0005], dtype=torch.float32) + self.assertTrue(torch.allclose(var_t, expected_var_t, atol=1e-4)) + + # 验证 m_t 的结果 + expected_m_t = torch.tensor([1., 2., 3.], dtype=torch.float32) + self.assertTrue(torch.allclose(m_t, expected_m_t, atol=1e-4)) + + # 验证 v_t 的结果 + expected_v_t = torch.tensor([1.0000, 2.0020, 3.0060], dtype=torch.float32) + self.assertTrue(torch.allclose(v_t, expected_v_t, atol=1e-4)) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_group_norm_silu.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_group_norm_silu.py new file mode 100644 index 0000000000000000000000000000000000000000..7e4a447df1a10d57f47a427be672bbced2a5cffd --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_group_norm_silu.py @@ -0,0 +1,23 @@ +import unittest +import torch + +from msprobe.pytorch.bench_functions.group_norm_silu import npu_group_norm_silu + + +class TestNPUGroupNormSILU(unittest.TestCase): + def setUp(self): + self.input0 = torch.tensor([[[[1.0, 2.0], [3.0, 4.0]]]]) + self.gama = torch.tensor([1.0]) + self.beta = torch.tensor([0.0]) + self.group = 1 + self.eps = 1e-5 + + def test_npu_group_norm_silu_positive(self): + # 调用 npu_group_norm_silu 函数 + result = npu_group_norm_silu(self.input0, self.gama, self.beta, self.group, self.eps) + + # 预期的结果 + expected_result = torch.tensor([[[[-0.2780, -0.1744], [0.2728, 1.0636]]]]) + + # 使用 torch.allclose 进行近似比较 + self.assertTrue(torch.allclose(result[0], expected_result, atol=1e-4)) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_mish.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_mish.py new file mode 100644 index 0000000000000000000000000000000000000000..e1684859d8cb48a12584f0c41f37687cb57e7e13 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_mish.py @@ -0,0 +1,20 @@ +import unittest +import torch + +from msprobe.pytorch.bench_functions.mish import npu_mish + + +class TestNPUMish(unittest.TestCase): + def setUp(self): + self.input0 = torch.tensor([[[[1.0, 2.0], [3.0, 4.0]]]]) + self.eps = 1e-5 + + def test_npu_mish_positive(self): + # 调用 npu_mish 函数 + result = npu_mish(self.input0) + + # 预期的结果 + expected_result = torch.tensor([[[[0.8651, 1.9440], [2.9865, 3.9974]]]]) + + # 使用 torch.allclose 进行近似比较 + self.assertTrue(torch.allclose(result[0], expected_result, atol=1e-4)) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_npu_fusion_attention.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_npu_fusion_attention.py index 737cf747b24de45e795bc751824eb62b288f9636..36b307eed3f7acea0d8b23ed76f93d6f21e8a805 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_npu_fusion_attention.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_npu_fusion_attention.py @@ -3,7 +3,7 @@ import torch import unittest from msprobe.pytorch.bench_functions.npu_fusion_attention import npu_fusion_attention, npu_fusion_attention_grad, \ - broadcast_kv + broadcast_kv, convert_from_bsnd, convert_to_bsnd, rearrange class TestNpuFusionAttention(unittest.TestCase): @@ -18,6 +18,37 @@ class TestNpuFusionAttention(unittest.TestCase): self.key = torch.randn(self.B, self.S2, self.N2, self.D) self.value = torch.randn(self.B, self.S2, self.N2, self.D) self.atten_mask = torch.randn(self.B, 1, self.S1, self.S2) + self.batch_size = 2 + self.seq_len = 3 + self.num_heads = 4 + self.head_dim = 5 + self.input_tensor = torch.randn(self.batch_size, self.seq_len, self.num_heads, self.head_dim) + + def test_convert_from_bsnd(self): + # 测试从 bsnd 转换到 BSH + converted_tensor = convert_from_bsnd(self.input_tensor, "BSH") + self.assertEqual(converted_tensor.shape, (self.batch_size, self.seq_len, self.num_heads * self.head_dim)) + + # 测试从 bsnd 转换到 SBH + converted_tensor = convert_from_bsnd(self.input_tensor, "SBH") + self.assertEqual(converted_tensor.shape, (self.seq_len, self.batch_size, self.num_heads * self.head_dim)) + + # 测试从 bsnd 转换到 BNSD + converted_tensor = convert_from_bsnd(self.input_tensor, "BNSD") + self.assertEqual(converted_tensor.shape, (self.batch_size, self.num_heads, self.seq_len, self.head_dim)) + + def test_convert_to_bsnd(self): + # 测试从 BSH 转换回 bsnd + converted_tensor = convert_to_bsnd(rearrange(self.input_tensor, 'b s n d -> b s (n d)'), self.num_heads, "BSH") + self.assertEqual(converted_tensor.shape, (self.batch_size, self.seq_len, self.num_heads, self.head_dim)) + + # 测试从 SBH 转换回 bsnd + converted_tensor = convert_to_bsnd(rearrange(self.input_tensor, 'b s n d -> s b (n d)'), self.num_heads, "SBH") + self.assertEqual(converted_tensor.shape, (self.batch_size, self.seq_len, self.num_heads, self.head_dim)) + + # 测试从 BNSD 转换回 bsnd + converted_tensor = convert_to_bsnd(rearrange(self.input_tensor, 'b s n d -> b n s d'), self.num_heads, "BNSD") + self.assertEqual(converted_tensor.shape, (self.batch_size, self.seq_len, self.num_heads, self.head_dim)) def test_basic_forward_input_layout_is_BSND(self): # 基本前向传播测试 diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_npu_moe_gating_top_k_softmax.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_npu_moe_gating_top_k_softmax.py new file mode 100644 index 0000000000000000000000000000000000000000..e33915afb7455e647c7631e225e0f28668796e64 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_npu_moe_gating_top_k_softmax.py @@ -0,0 +1,37 @@ +import unittest +import torch + +from msprobe.pytorch.bench_functions.moe_gating_top_k_softmax import npu_moe_gating_top_k_softmax, softmax_func + + +class TestNPUMoEGatingTopKSoftmax(unittest.TestCase): + def setUp(self): + self.input0 = torch.tensor([[[[1.0, 2.0], [3.0, 4.0]]]]) + self.finished_optional = None + self.k = 2 + + def test_npu_moe_gating_top_k_softmax(self): + # 调用 npu_moe_gating_top_k_softmax 函数 + result = npu_moe_gating_top_k_softmax(self.input0, self.finished_optional, self.k) + + # 预期的结果 + expected_result = ( + torch.tensor([[[[0.7311, 0.2689], [0.7311, 0.2689]]]]), + torch.tensor([[[[1, 0], [1, 0]]]]), + torch.tensor([[0]]) + ) + + # 使用 torch.allclose 进行近似比较 + self.assertTrue(torch.allclose(result[0], expected_result[0], atol=1e-4)) + self.assertTrue(torch.allclose(result[1], expected_result[1], atol=1e-4)) + self.assertTrue(torch.allclose(result[2], expected_result[2], atol=1e-4)) + + def test_softmax_func(self): + # 调用 softmax_func 函数 + result = softmax_func(self.input0, -1) + + # 预期的结果 + expected_result = torch.tensor([[[[0.2689, 0.7311], [0.2689, 0.7311]]]]) + + # 使用 torch.allclose 进行近似比较 + self.assertTrue(torch.allclose(result, expected_result, atol=1e-4)) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_sort_v2.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_sort_v2.py new file mode 100644 index 0000000000000000000000000000000000000000..9c008172dbcfa6e3fd7c8e6c340a679e2ac3e9c8 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/bench_functions/test_sort_v2.py @@ -0,0 +1,22 @@ +import unittest +import torch + +from msprobe.pytorch.bench_functions.sort_v2 import npu_sort_v2 + + +class TestSortV2(unittest.TestCase): + def setUp(self): + self.input0 = torch.tensor([[[[1.0, 2.0], [3.0, 4.0]]]]) + self.dim = -1 + self.descending = False + self.out = None + + def test_npu_sort_v2(self): + # 调用 npu_sort_v2 函数 + result = npu_sort_v2(self.input0, self.dim, self.descending, self.out) + + # 预期的结果 + expected_result = torch.tensor([[[[1.0, 2.0], [3.0, 4.0]]]]) + + # 使用 torch.allclose 进行近似比较 + self.assertTrue(torch.allclose(result, expected_result, atol=1e-4)) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_pt_compare.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_pt_compare.py index 865b64401085fe1346a9074fd212a645aa7ba8b3..b079e646c4a8f4098bb233e3e6259ef3ebea9c94 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_pt_compare.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_pt_compare.py @@ -1,11 +1,15 @@ # coding=utf-8 import os -import torch -import unittest import shutil +import unittest + import numpy as np -from msprobe.pytorch.compare.pt_compare import PTComparator, compare +import torch + +from msprobe.core.common.const import Const from msprobe.core.common.utils import CompareException +from msprobe.core.compare.acc_compare import ModeConfig +from msprobe.pytorch.compare.pt_compare import PTComparator, compare from msprobe.test.core_ut.compare.test_acc_compare import generate_dump_json, generate_stack_json @@ -38,14 +42,32 @@ class TestUtilsMethods(unittest.TestCase): def test_read_npy_data_bf16(self): generate_bf16_pt(base_dir1) - result = PTComparator().read_npy_data(base_dir1, 'bf16.pt') + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + result = pt_comparator.read_npy_data(base_dir1, 'bf16.pt') + target_result = torch.tensor([1, 2, 3, 4], dtype=torch.float32).numpy() self.assertTrue(np.array_equal(result, target_result)) def test_read_npy_data_dict(self): generate_dict_pt(base_dir1) + + stack_mode = True + auto_analyze = True + fuzzy_match = False + dump_mode = Const.ALL + mode_config = ModeConfig(stack_mode, auto_analyze, fuzzy_match, dump_mode) + + pt_comparator = PTComparator(mode_config) + with self.assertRaises(CompareException) as context: - result = PTComparator().read_npy_data(base_dir1, 'dict.pt') + result = pt_comparator.read_npy_data(base_dir1, 'dict.pt') self.assertEqual(context.exception.code, CompareException.DETACH_ERROR) def test_compare(self): @@ -53,12 +75,10 @@ class TestUtilsMethods(unittest.TestCase): generate_stack_json(base_dir2) dump_path = os.path.join(base_dir2, 'dump.json') - stack_path = os.path.join(base_dir2, 'stack.json') input_param = { 'npu_json_path': dump_path, 'bench_json_path': dump_path, - 'stack_json_path': stack_path, 'is_print_compare_log': True } output_path = base_dir2 @@ -70,7 +90,6 @@ class TestUtilsMethods(unittest.TestCase): input_param2 = { 'npu_json_path': '', 'bench_json_path': dump_path, - 'stack_json_path': stack_path, 'is_print_compare_log': True } with self.assertRaises(CompareException) as context: diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/debugger/test_pt_debugger_config.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/debugger/test_pt_debugger_config.py index da89c91d343af9fe82f1ba482986cd889ed44bf8..4fc27c267ebe65ea46ecf0f17bc47ff702eb241d 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/debugger/test_pt_debugger_config.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/debugger/test_pt_debugger_config.py @@ -15,6 +15,7 @@ class TestDebuggerConfig(unittest.TestCase): self.common_config.task = Const.STATISTICS self.common_config.level = "L1" self.common_config.enable_dataloader = True + self.common_config.async_dump = False def test_default_init(self): debugger = DebuggerConfig(self.common_config, self.task_config, None, None, None) @@ -33,24 +34,6 @@ class TestDebuggerConfig(unittest.TestCase): self.assertEqual(debugger.handler_type, "check") self.assertTrue(debugger.preheat_config["if_preheat"]) - def test_level_l2_scope_validation(self): - self.task_config.scope = ["backward.api"] - self.task_config.backward_input = ["input"] - debugger = DebuggerConfig(self.common_config, self.task_config, None, None, "L2") - self.assertEqual(debugger.scope, ["forward.api"]) - self.assertEqual(debugger.backward_input["forward.api"], "input") - - self.task_config.scope = ["op1", "op2"] - with self.assertRaises(ValueError) as context: - DebuggerConfig(self.common_config, self.task_config, None, None, "L2") - self.assertIn("scope must be configured as a list with one api name", str(context.exception)) - - self.task_config.scope = ["backward.api"] - self.task_config.backward_input = [] - with self.assertRaises(ValueError) as context: - DebuggerConfig(self.common_config, self.task_config, None, None, "L2") - self.assertIn("backward_input must be configured when scope contains 'backward'", str(context.exception)) - def test_online_run_ut_initialization(self): self.task_config.online_run_ut = True self.task_config.nfs_path = "./nfs_path" @@ -62,7 +45,7 @@ class TestDebuggerConfig(unittest.TestCase): self.assertTrue(debugger.online_run_ut) self.assertEqual(debugger.nfs_path, "./nfs_path") self.assertEqual(debugger.port, 8080) - + def test_valid_task_and_level(self): config = DebuggerConfig(self.common_config, self.task_config, "tensor", None, "L1") config.check_kwargs() @@ -85,3 +68,35 @@ class TestDebuggerConfig(unittest.TestCase): config = DebuggerConfig(self.common_config, self.task_config, "tensor", None, "L1") config.check_kwargs() self.assertIn("dump_path not found", str(context.exception)) + + def test_check_and_adjust_config_with_l2_scope_not_empty(self): + self.common_config.dump_path = "./dump_path" + self.common_config.task = Const.TENSOR + + self.task_config.scope = ["test_api_name"] + debugger = DebuggerConfig(self.common_config, self.task_config, None, None, None) + with self.assertRaises(MsprobeException) as context: + debugger._check_and_adjust_config_with_l2() + self.assertIn("the scope cannot be configured", str(context.exception)) + + def test_check_and_adjust_config_with_l2_list_empty(self): + self.common_config.dump_path = "./dump_path" + self.common_config.task = Const.TENSOR + self.common_config.async_dump = False + + self.task_config.scope = [] + self.task_config.list = [] + debugger = DebuggerConfig(self.common_config, self.task_config, None, None, None) + with self.assertRaises(MsprobeException) as context: + debugger._check_and_adjust_config_with_l2() + self.assertIn("the list must be configured", str(context.exception)) + + def test_check_and_adjust_config_with_l2_success(self): + self.common_config.dump_path = "./dump_path" + self.common_config.task = Const.TENSOR + + self.task_config.scope = [] + self.task_config.list = ["Functional.conv2d.0.backward"] + debugger = DebuggerConfig(self.common_config, self.task_config, None, None, None) + debugger._check_and_adjust_config_with_l2() + self.assertIn("Functional.conv2d.0.forward", self.task_config.list) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/debugger/test_pt_precision_debugger.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/debugger/test_pt_precision_debugger.py index 0b32274d271973f63868a2575f7c63a426930c11..a2f3e8a816e356b68e598138b30a9e14b42107d9 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/debugger/test_pt_precision_debugger.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/debugger/test_pt_precision_debugger.py @@ -61,11 +61,6 @@ class TestPrecisionDebugger(unittest.TestCase): PrecisionDebugger.check_input_params(args) self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) - args = Args(model = 1) - with self.assertRaises(MsprobeException) as context: - PrecisionDebugger.check_input_params(args) - self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) - args = Args(config_path = os.path.join(os.path.dirname(__file__), "../../../config.json"), task = Const.TASK_LIST[0], dump_path="./dump_path", @@ -88,19 +83,17 @@ class TestPrecisionDebugger(unittest.TestCase): debugger = PrecisionDebugger(dump_path="./dump_path") debugger.service = MagicMock() debugger.config = MagicMock() - debugger.model = 'model' - debugger.api_origin = 'api_origin' - debugger.task = '' + debugger.task = 'statistics' debugger.start() - debugger.service.start.assert_called_once_with('model', 'api_origin') - self.assertFalse(debugger.api_origin) + debugger.service.start.assert_called_once() def test_forward_backward_dump_end(self): debugger = PrecisionDebugger(dump_path="./dump_path") debugger.service = MagicMock() + debugger.config = MagicMock() + debugger.task = 'statistics' debugger.forward_backward_dump_end() - debugger.service.forward_backward_dump_end.assert_called_once() - self.assertTrue(debugger.api_origin) + debugger.service.stop.assert_called_once() def test_stop_grad_probe(self): with self.assertRaises(Exception) as context: diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/functional/test_module_dump.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_module_dump.py similarity index 38% rename from debug/accuracy_tools/msprobe/test/pytorch_ut/functional/test_module_dump.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_module_dump.py index 8a0ff72dd266e056f8a549b698b56b6e8c6e1041..63d6abc3a2430bb6f092820c4b97a02cdf675612 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/functional/test_module_dump.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_module_dump.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -18,16 +18,12 @@ from unittest.mock import patch, MagicMock import torch import torch.nn as nn -from msprobe.core.common.exceptions import MsprobeException -from msprobe.core.common.log import logger from msprobe.pytorch import PrecisionDebugger -from msprobe.pytorch.service import torch_version_above_or_equal_2 -from msprobe.pytorch.functional.module_dump import module_dump, module_dump_end, \ - hook_handle_list, remove_hook, register_hook from msprobe.pytorch.hook_module.api_registry import api_register +from msprobe.pytorch.service import torch_version_above_or_equal_2 -class TestModuleDump(unittest.TestCase): +class TestModuleDumper(unittest.TestCase): @classmethod def setUpClass(cls): PrecisionDebugger._instance = None @@ -40,44 +36,25 @@ class TestModuleDump(unittest.TestCase): def setUp(self): self.module = nn.Linear(8, 4) - - def tearDown(self): - hook_handle_list.clear() - - @patch.object(logger, 'error') - def test_module_dump(self, mock_error): - with self.assertRaises(MsprobeException) as context: - module_dump(1, "TestModule") - self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) - mock_error.assert_called_with("The parameter module in module_dump must be a Module subclass.") - - with self.assertRaises(MsprobeException) as context: - module_dump(self.module, 1) - self.assertEqual(context.exception.code, MsprobeException.INVALID_PARAM_ERROR) - mock_error.assert_called_with("The parameter dump_name in module_dump must be a str type.") - - with patch('msprobe.pytorch.functional.module_dump.register_hook') as mock_register_hook: - module_dump(self.module, "TestModule") - mock_register_hook.assert_called_with(self.module, "TestModule") - - def test_module_dump_end(self): - hook_handle_list.extend([1, 2, 3]) - with patch('msprobe.pytorch.functional.module_dump.remove_hook') as mock_remove_hook: - module_dump_end() - mock_remove_hook.assert_called_once() - self.assertEqual(hook_handle_list, []) + debugger = PrecisionDebugger(dump_path="./") + self.module_dumper = debugger.module_dumper + + def test_stop_module_dump(self): + self.module_dumper.hook_handle_list.extend([1, 2, 3]) + with patch('msprobe.pytorch.dump.module_dump.module_dump.api_register') as mock_api_register: + mock_handle1 = MagicMock(spec=torch.utils.hooks.RemovableHandle) + mock_handle2 = MagicMock(spec=torch.utils.hooks.RemovableHandle) + self.module_dumper.hook_handle_list.extend([mock_handle1, mock_handle2]) + + self.module_dumper.stop_module_dump() + mock_handle1.remove.assert_called_once() + mock_handle2.remove.assert_called_once() + self.assertEqual(self.module_dumper.hook_handle_list, []) + mock_api_register.api_modularity.assert_called_once() def test_register_hook(self): - PrecisionDebugger(dump_path="./") - register_hook(self.module, "TestModule") + self.module_dumper.register_hook(self.module, "TestModule") if torch_version_above_or_equal_2: - self.assertEqual(len(hook_handle_list), 6) + self.assertEqual(len(self.module_dumper.hook_handle_list), 6) else: - self.assertEqual(len(hook_handle_list), 5) - - def test_remove_hook(self): - mock_handle = MagicMock(spec=torch.utils.hooks.RemovableHandle) - hook_handle_list.append(mock_handle) - remove_hook() - - mock_handle.remove.assert_called_once() + self.assertEqual(len(self.module_dumper.hook_handle_list), 5) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_module_processer.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_module_processer.py similarity index 62% rename from debug/accuracy_tools/msprobe/test/pytorch_ut/test_module_processer.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_module_processer.py index 172799deba3672c4714998ddbc4e01e64d9b49fa..f8a561b61b6a758a525675bdc59957e5c923b261 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_module_processer.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_module_processer.py @@ -5,7 +5,7 @@ import torch from msprobe.core.data_dump.scope import ModuleRangeScope from msprobe.pytorch.common.utils import Const -from msprobe.pytorch.module_processer import ModuleProcesser +from msprobe.pytorch.dump.module_dump.module_processer import ModuleProcesser class TestModuleProcesser(unittest.TestCase): @@ -25,31 +25,6 @@ class TestModuleProcesser(unittest.TestCase): processor = ModuleProcesser(scope) self.assertIsNone(processor.scope) - def test_filter_tensor_and_tuple(self): - def func(nope, x): - return x * 2 - result_1 = ModuleProcesser.filter_tensor_and_tuple(func)(None, torch.tensor([1])) - self.assertEqual(result_1, torch.tensor([2])) - result_2 = ModuleProcesser.filter_tensor_and_tuple(func)(None, "test") - self.assertEqual(result_2, "test") - - def test_filter_tensor_and_tuple_with_tensor(self): - class MockBackwardHook: - @staticmethod - def setup_output_hook(*args, **kwargs): - return args[1] - - mock_hook = MockBackwardHook.setup_output_hook - wrapped_hook = ModuleProcesser.filter_tensor_and_tuple(mock_hook) - - tensor = torch.tensor([1, 2, 3]) - mock_obj = type('MockObj', (object,), {'tensor_attr': tensor})() - wrapped_hook(None, mock_obj) - self.assertIs(mock_obj.tensor_attr, tensor) - non_tensor_obj = type('MockObj', (object,), {'non_tensor_attr': 'non_tensor_value'})() - wrapped_hook(None, non_tensor_obj) - self.assertEqual(non_tensor_obj.non_tensor_attr, 'non_tensor_value') - def test_clone_return_value_and_test_clone_if_tensor(self): def func(x): return x @@ -87,7 +62,7 @@ class TestModuleProcesser(unittest.TestCase): module.mindstudio_reserved_name = None hook(module, input) expected_name = f"forward_layer{Const.SEP}0" - self.assertEqual(module.mindstudio_reserved_name, expected_name) + self.assertEqual(module.mindstudio_reserved_name, [expected_name]) self.assertIn(expected_name, ModuleProcesser.module_stack) self.assertEqual(ModuleProcesser.api_parent_node, expected_name) @@ -98,32 +73,32 @@ class TestModuleProcesser(unittest.TestCase): module = MagicMock() input = (self.mock_tensor,) - module.mindstudio_reserved_name = f"forward_layer{Const.SEP}0" - hook(module, input) - self.assertNotIn([f"forward_layer{Const.SEP}0"], ModuleProcesser.module_stack) - self.assertEqual(ModuleProcesser.api_parent_node, module.mindstudio_reserved_name) - - def test_node_hook_forward_stop(self): - name_prefix = "forward_layer" - hook = self.processor.node_hook(name_prefix, start_or_stop=Const.STOP) - ModuleProcesser.module_stack.append(f"forward_layer{Const.SEP}0") - - module = MagicMock() - input = (self.mock_tensor,) - module.mindstudio_reserved_name = f"forward_layer{Const.SEP}0" + reserved_name = f"forward_layer{Const.SEP}0" + module.mindstudio_reserved_name = [reserved_name] hook(module, input) self.assertNotIn([f"forward_layer{Const.SEP}0"], ModuleProcesser.module_stack) - self.assertEqual(ModuleProcesser.api_parent_node, module.mindstudio_reserved_name) + self.assertEqual(ModuleProcesser.api_parent_node, reserved_name) def test_node_hook_backward(self): name_prefix = "backward_layer" hook = self.processor.node_hook(name_prefix, start_or_stop=Const.START) - + module = MagicMock() input = (self.mock_tensor,) module.mindstudio_reserved_name = None ModuleProcesser.module_node[f"forward_layer{Const.SEP}0"] = None hook(module, input) expected_name = f"backward_layer{Const.SEP}0" - self.assertEqual(module.mindstudio_reserved_name, expected_name) + self.assertEqual(module.mindstudio_reserved_name, [expected_name]) self.assertIn(expected_name, ModuleProcesser.module_node) + + def test_has_register_backward_hook(self): + module = MagicMock() + module._backward_hooks = {0: lambda: None} + module._is_full_backward_hook = False + result = self.processor.has_register_backward_hook(module) + self.assertTrue(result) + + module._is_full_backward_hook = True + result = self.processor.has_register_backward_hook(module) + self.assertFalse(result) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_pt_kernel_config.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_pt_kernel_config.py new file mode 100644 index 0000000000000000000000000000000000000000..fbeeb07ffc9ac43eedc22ed95d1fa142bb2dd6e4 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/dump/test_pt_kernel_config.py @@ -0,0 +1,53 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest +from unittest.mock import patch + +from msprobe.pytorch.dump.kernel_dump.kernel_config import create_kernel_config_json + + +class TestPtKernelConfig(unittest.TestCase): + @patch("msprobe.pytorch.dump.kernel_dump.kernel_config.save_json") + def test_create_kernel_config_json_with_rank(self, mock_save_json): + dump_path = "./step0" + cur_rank = 0 + kernel_config_path = create_kernel_config_json(dump_path, cur_rank) + self.assertEqual(kernel_config_path, "./step0/kernel_config_0.json") + config_info = { + "dump": { + "dump_list": [], + "dump_path": dump_path, + "dump_mode": "all", + "dump_op_switch": "on" + } + } + mock_save_json.assert_called_once_with(kernel_config_path, config_info, indent=4) + + @patch("msprobe.pytorch.dump.kernel_dump.kernel_config.save_json") + def test_create_kernel_config_json_without_rank(self, mock_save_json): + dump_path = "./step0" + cur_rank = '' + kernel_config_path = create_kernel_config_json(dump_path, cur_rank) + self.assertEqual(kernel_config_path, "./step0/kernel_config.json") + config_info = { + "dump": { + "dump_list": [], + "dump_path": dump_path, + "dump_mode": "all", + "dump_op_switch": "on" + } + } + mock_save_json.assert_called_once_with(kernel_config_path, config_info, indent=4) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/result_handlers/test_calculate_max_ratio.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/result_handlers/test_calculate_max_ratio.py new file mode 100644 index 0000000000000000000000000000000000000000..5734a5b84e6e005564bf6a3208c41fd06ec30767 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/result_handlers/test_calculate_max_ratio.py @@ -0,0 +1,68 @@ +from unittest import TestCase + +import torch +from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig +from msprobe.pytorch.free_benchmark.common.params import HandlerParams +from msprobe.pytorch.free_benchmark.result_handlers.check_handler import CheckerHandler + + +class TestFuzzHandler(TestCase): + + def setUp(self) -> None: + self.api_name = "test_api" + self.handler = CheckerHandler(HandlerParams(api_name=self.api_name)) + self.abs_tol = 1e-4 + + def test_calculate_max_ratio_with_equal_outputs(self): + # 测试两个输出相等时,比值应该接近1 + origin_output = torch.tensor([1.0, 2.0, 3.0]) + perturbed_output = torch.tensor([1.0, 2.0, 3.0]) + max_ratio = self.handler.calculate_max_ratio( + origin_output, perturbed_output, self.abs_tol + ) + self.assertAlmostEqual(max_ratio, 1.0) + + def test_calculate_max_ratio_with_different_outputs(self): + # 测试两个输出不同时,比值应该为最大的比值 + origin_output = torch.tensor([1.0, 2.0, 1e-4]) + perturbed_output = torch.tensor([1.3, 2.7, 1e-3]) + max_ratio = self.handler.calculate_max_ratio( + origin_output, perturbed_output, self.abs_tol + ) + self.assertAlmostEqual(max_ratio, 10.0, places=2) + + def test_calculate_max_ratio_with_tol_elements(self): + # 测试忽略绝对值小于极小值的情况,小于的全部变为极小值计算 + origin_output = torch.tensor([1.0, 1e-8, 1e-6]) + perturbed_output = torch.tensor([1.0, 1e-4, -1e-8]) + max_ratio = self.handler.calculate_max_ratio( + origin_output, perturbed_output, self.abs_tol + ) + self.assertAlmostEqual(max_ratio, 1.0) + + def test_calculate_max_ratio_with_symbol_flipping(self): + # 测试乘积符号相反时,应该返回SYMBOL_FLIPPING + origin_output = torch.tensor([1.0, -2.0, 3.0]) + perturbed_output = torch.tensor([1.0, 2.0, 3.0]) + result = self.handler.calculate_max_ratio( + origin_output, perturbed_output, self.abs_tol + ) + self.assertEqual(result, ThresholdConfig.SYMBOL_FLIPPING) + + def test_calculate_max_ratio_with_nan_values(self): + # 测试包含NaN值时,函数应该正确计算 + origin_output = torch.tensor([1.0, float("nan"), 2.0]) + perturbed_output = torch.tensor([1.1, float("nan"), 2.4]) + max_ratio = self.handler.calculate_max_ratio( + origin_output, perturbed_output, self.abs_tol + ) + self.assertAlmostEqual(max_ratio, 1.2) + + def test_calculate_max_ratio_with_empty_chunks(self): + # 测试空的输出块时,函数应该正确处理 + origin_output = torch.tensor([]) + perturbed_output = torch.tensor([]) + max_ratio = self.handler.calculate_max_ratio( + origin_output, perturbed_output, self.abs_tol + ) + self.assertEqual(max_ratio, ThresholdConfig.COMP_CONSISTENT) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_register_optimizer_hook.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_register_optimizer_hook.py new file mode 100644 index 0000000000000000000000000000000000000000..dde9f37b2b6c752a18f237db7ce1430881ea80f4 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_register_optimizer_hook.py @@ -0,0 +1,27 @@ +import unittest +from unittest.mock import patch +import torch + +from msprobe.pytorch.hook_module.register_optimizer_hook import register_optimizer_hook + + +class DataCollector: + def __init__(self): + self.optimizer_status = "" + + +class TestRegisterOptimizerHook(unittest.TestCase): + def test_register_optimizer_hook(self): + data_collector = DataCollector() + with patch("torch.nn.utils.clip_grad_norm_") as clip, \ + patch("torch.nn.utils.clip_grad_value_") as clip_value: + clip.return_value = None + clip_value.return_value = None + register_optimizer_hook(data_collector) + + torch.nn.utils.clip_grad_norm_() + self.assertEqual(data_collector.optimizer_status, "end_clip_grad") + + data_collector.optimizer_status = "" + torch.nn.utils.clip_grad_value_() + self.assertEqual(data_collector.optimizer_status, "end_clip_grad") diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_functional.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_functional.py index bd206a0c36745f5c152b83af08909783a728709d..282551e3cefdb2ae63efda284f5e7ae7482ae81c 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_functional.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_functional.py @@ -1,8 +1,9 @@ import unittest import torch import torch.nn.functional as F -from msprobe.pytorch.hook_module.wrap_functional import remove_dropout, get_functional_ops, \ +from msprobe.pytorch.hook_module.wrap_functional import get_functional_ops, \ wrap_functional_ops_and_bind, HOOKFunctionalOP +from msprobe.pytorch.common.utils import remove_dropout class TestDropoutFunctions(unittest.TestCase): diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/all_config.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/all_config.json new file mode 100644 index 0000000000000000000000000000000000000000..9c2eb5b43a278a6e2e8104e3d9f8bce912930a0d --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/all_config.json @@ -0,0 +1,13 @@ +{ + "targets": { + "": {} + }, + "param_distribution": true, + "xy_distribution": true, + "mv_distribution": true, + "wg_distribution": true, + "all_xy": true, + "format": "csv", + "ops": ["norm", "nans"], + "step_count_per_record": 3 +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_basic_functions.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/cc_config.json similarity index 31% rename from debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_basic_functions.json rename to debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/cc_config.json index 6ce01d653dcfd288e4955b71aefd609056fd38e9..667f474d51928598c184e78128ca39be015b8c20 100644 --- a/debug/accuracy_tools/msprobe/pytorch/monitor/unittest/config_basic_functions.json +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/cc_config.json @@ -1,17 +1,12 @@ { "targets": { - "fc": {"input": "tuple[1]:0", "output": "tensor", "input_grad": "tuple[1]:0", "output_grad": "tuple[1]:0"} + "": {} }, - "module_ranks": [], - "ur_distribution": true, - "xy_distribution": true, - "mv_distribution": true, - "wg_distribution": true, - "mg_direction": true, "cc_distribution": {"enable":true, "cc_codeline":[]}, "alert": { - "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}] + "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}], + "dump": true }, - "eps": 1e-8, - "ops": ["min", "max", "norm", "zeros", "id"] + "format": "csv", + "ops": ["norm"] } \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/mv_config.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/mv_config.json new file mode 100644 index 0000000000000000000000000000000000000000..50c06c88c94170021c59e15bf814afbebaa86d08 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/mv_config.json @@ -0,0 +1,8 @@ +{ + "targets": { + "": {} + }, + "mv_distribution": true, + "format": "csv", + "ops": ["norm"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/struct_config.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/struct_config.json new file mode 100644 index 0000000000000000000000000000000000000000..263d079747f95864b98b9e822de0159ec70f17a9 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/struct_config.json @@ -0,0 +1,9 @@ +{ + "targets": { + }, + "print_struct": true, + "xy_distribution": true, + "all_xy": true, + "format": "csv", + "ops": ["norm"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/ur_config.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/ur_config.json new file mode 100644 index 0000000000000000000000000000000000000000..ee97270a889b8dca6260257b85ed5929911322c7 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/ur_config.json @@ -0,0 +1,8 @@ +{ + "targets": { + "": {} + }, + "ur_distribution": true, + "format": "tensorboard", + "ops": ["norm"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/wg_config.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/wg_config.json new file mode 100644 index 0000000000000000000000000000000000000000..0b547bb98a96e746d050cd8395696f70e8233899 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/wg_config.json @@ -0,0 +1,8 @@ +{ + "targets": { + "": {} + }, + "wg_distribution": true, + "format": "csv", + "ops": ["norm"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/xy_config.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/xy_config.json new file mode 100644 index 0000000000000000000000000000000000000000..8540929ad2d4163f0064d165b5b9cb2da4261ab0 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/config/xy_config.json @@ -0,0 +1,8 @@ +{ + "targets": { + }, + "xy_distribution": true, + "all_xy": true, + "format": "csv", + "ops": ["norm", "nans"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/demo_model.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/demo_model.py new file mode 100644 index 0000000000000000000000000000000000000000..f5de419440224cca261b62df2495e8ce28b8e2d4 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/demo_model.py @@ -0,0 +1,58 @@ +import torch +import torch.nn.functional as F +from msprobe.pytorch import TrainerMon +from msprobe.pytorch.common import seed_all + +device = torch.device('cpu') +dtype_float32 = torch.float32 +seed_all(mode=True) + + +class Model(torch.nn.Module): + def __init__(self): + super().__init__() + self.fc = torch.nn.Linear(784, 10, dtype=dtype_float32) + self.relu = torch.nn.ReLU() + + def forward(self, x): + return self.relu(self.fc(x).type(dtype_float32)) + + +class ToyDataset(torch.utils.data.Dataset): + def __init__(self): + self.data = torch.randn(16, 784, dtype=dtype_float32, requires_grad=True) + self.labels = torch.randint(low=0, high=9, size=(16,)) + + def __len__(self): + return len(self.labels) + + def __getitem__(self, idx): + return self.data[idx].to(device), self.labels[idx].to(device) + + +def monitor_demo(config: str = "./config/monitor_config.json"): + net = Model().to(device=device) + optimizer = torch.optim.Adam(net.parameters(), lr=0.0001) + + hooker = TrainerMon( + config, + params_have_main_grad=False + ) + hooker.set_monitor( + model=net, + grad_acc_steps=1, + optimizer=optimizer + ) + + train_ds = ToyDataset() + train_loader = torch.utils.data.DataLoader(train_ds, shuffle=True, batch_size=10) + + for (inputs, labels) in train_loader: + optimizer.zero_grad() + outputs = net(inputs) + loss = F.cross_entropy(outputs, labels) + + loss.backward() + optimizer.step() + + hooker.summary_writer.close() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_anomaly_analyse.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_anomaly_analyse.py new file mode 100644 index 0000000000000000000000000000000000000000..904be210a3771f1757e4410b5e0fa0f2ad6152f2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_anomaly_analyse.py @@ -0,0 +1,380 @@ +import os +import unittest +from unittest.mock import patch, MagicMock + +from msprobe.pytorch.monitor.anomaly_detect import GradAnomalyData + +from msprobe.pytorch.monitor.anomaly_analyse import AnomalyDataWriter, AnomalyDataLoader, AnomalyAnalyse, \ + _get_parse_args, _get_step_and_stop, _anomaly_analyse + + +class TestAnomalyDataWriter(unittest.TestCase): + + def test_get_anomaly_dict(self): + # 测试 get_anomaly_dict 方法 + anomaly1 = MagicMock() + anomaly1.get_key.return_value = 'anomaly1' + anomaly1.to_dict.return_value = {'value': 1} + + anomaly2 = MagicMock() + anomaly2.get_key.return_value = 'anomaly2' + anomaly2.to_dict.return_value = {'value': 2} + + anomalies = [anomaly1, anomaly2] + result = AnomalyDataWriter.get_anomaly_dict(anomalies) + + expected = { + 'anomaly1': {'value': 1}, + 'anomaly2': {'value': 2} + } + self.assertEqual(result, expected) + + @patch('msprobe.pytorch.monitor.anomaly_analyse.os.path.exists') + @patch('msprobe.pytorch.monitor.anomaly_analyse.create_directory') + @patch('msprobe.pytorch.monitor.anomaly_analyse.save_json') + def test_init_detected_json(self, mock_save_json, mock_create_directory, mock_exists): + # 模拟路径检查 + mock_exists.side_effect = [False, False, False] # dump_path, dump_rank_dir, json_path + # 模拟文件不存在 + mock_create_directory.side_effect = None + + writer = AnomalyDataWriter('/tmp/dump', 0) + writer.init_detected_json() + + # 检查是否创建了目录 + mock_create_directory.assert_any_call('/tmp/dump') + mock_create_directory.assert_any_call('/tmp/dump/rank0') + + # 检查是否初始化了 JSON 文件 + mock_save_json.assert_called_once_with(writer.json_path, {}, indent=1) + + @patch('msprobe.pytorch.monitor.anomaly_analyse.check_file_or_directory_path') + @patch('msprobe.pytorch.monitor.anomaly_analyse.remove_path') + @patch('msprobe.pytorch.monitor.anomaly_analyse.save_json') + @patch('msprobe.pytorch.monitor.anomaly_analyse.logger') + def test_init_detected_json_existing_file(self, mock_logger, mock_save_json, mock_remove_path, mock_check_path): + # 设置测试参数 + dump_path = 'test/dump_path' + rank = 0 + writer = AnomalyDataWriter(dump_path, rank) + + # 模拟文件存在情况 + mock_check_path.side_effect = None # 阻止实际调用 + mock_remove_path.return_value = None # 阻止实际调用 + + # 模拟 json_path 存在 + writer.json_path = 'existing_file.json' + with patch('os.path.exists', return_value=True): + writer.init_detected_json() + + # 验证文件删除和新文件保存 + mock_remove_path.assert_called_once_with(writer.json_path) + mock_logger.warning.assert_called_once_with(f"The existing file will be deleted: {writer.json_path}.") + mock_save_json.assert_called_once_with(writer.json_path, {}, indent=1) + + @patch('msprobe.pytorch.monitor.anomaly_analyse.os.path.exists') + @patch('msprobe.pytorch.monitor.anomaly_analyse.load_json') + @patch('msprobe.pytorch.monitor.anomaly_analyse.save_json') + def test_write_detected_json(self, mock_save_json, mock_load_json, mock_exists): + mock_exists.side_effect = [True, True] # json_path 存在 + + # 创建模拟的异常数据 + anomalies = [MagicMock(), MagicMock()] + anomalies[0].get_key.return_value = 'anomaly1' + anomalies[0].to_dict.return_value = {'value': 1} + anomalies[1].get_key.return_value = 'anomaly2' + anomalies[1].to_dict.return_value = {'value': 2} + + mock_load_json.return_value = {'existing_anomaly': {'value': 0}} + + writer = AnomalyDataWriter('/tmp/dump', 0) + writer.write_detected_json(anomalies) + + expected_data = { + 'existing_anomaly': {'value': 0}, + 'anomaly1': {'value': 1}, + 'anomaly2': {'value': 2} + } + + # 检查 JSON 是否被加载和保存 + mock_load_json.assert_called_once_with(writer.json_path) + mock_save_json.assert_called_once_with(writer.json_path, expected_data, indent=1) + + +class TestAnomalyDataLoader(unittest.TestCase): + + @patch('msprobe.pytorch.monitor.anomaly_analyse.GradAnomalyData') # 替换为 GradAnomalyData 的实际导入路径 + def test_create_instances_from_dict(self, mock_GradAnomalyData): + # 模拟 GradAnomalyData 的构造函数 + def mock_constructor(**kwargs): + return None + + mock_GradAnomalyData.side_effect = mock_constructor # 假设构造成功 + + data = { + 'anomaly1': {'key1': 'value1', 'key2': 'value2'}, + 'anomaly2': {'key1': 'value3', 'key2': 'value4'}, + } + + loader = AnomalyDataLoader('/tmp/data') + instances = loader.create_instances_from_dict(data) + + # 确保创建了两个实例,第三个因缺少 key2 被捕获 + self.assertEqual(len(instances), 2) + + @patch('msprobe.pytorch.monitor.anomaly_analyse.os.listdir') + @patch('msprobe.pytorch.monitor.anomaly_analyse.os.path.exists') + @patch('msprobe.pytorch.monitor.anomaly_analyse.load_json') + @patch('msprobe.pytorch.monitor.anomaly_analyse.check_file_or_directory_path') + @patch('msprobe.pytorch.monitor.anomaly_analyse.GradAnomalyData') + def test_get_anomalies_from_jsons(self, mock_GradAnomalyData, mock_check_path, mock_load_json, mock_exists, + mock_listdir): + mock_check_path.return_value = None + mock_listdir.return_value = ['rank0', 'rank1'] + + # 模拟 rank0/anomaly.json 存在,rank1/anomaly.json 不存在 + mock_exists.side_effect = [True, False] + mock_load_json.return_value = { + 'anomaly1': {'key1': 'value1', 'key2': 'value2'}, + 'anomaly2': {'key1': 'value3', 'key2': 'value4'} + } + + # 模拟 GradAnomalyData 的构造函数 + def mock_constructor(**kwargs): + return None + + mock_GradAnomalyData.side_effect = mock_constructor # 假设构造成功 + + loader = AnomalyDataLoader('/tmp/data') + with patch('msprobe.pytorch.monitor.anomaly_analyse.os.path.isdir', return_value=True): + anomalies = loader.get_anomalies_from_jsons() + + # 确保从 rank0 读取了异常数据 + self.assertEqual(len(anomalies), 2) + mock_check_path.assert_called_once_with('/tmp/data', isdir=True) + mock_load_json.assert_called_once_with('/tmp/data/rank0/anomaly.json') + + +class TestAnomalyAnalyse(unittest.TestCase): + + def setUp(self): + self.anomaly_analyse = AnomalyAnalyse() + self.anomalies = [ + MagicMock(step=1, value=5), + MagicMock(step=2, value=3), + MagicMock(step=3, value=8), + MagicMock(step=4, value=1), + ] + + def test_get_range_top_k(self): + anomalies = [ + GradAnomalyData(step=1, micro_step=1, vpp_stage=0, pp_stage=0, call_id=0, tag_name=""), + GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name=""), + GradAnomalyData(step=2, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name=""), + GradAnomalyData(step=3, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="") + ] + + # step_list not empty + result = self.anomaly_analyse.get_range_top_k(3, [1], anomalies) + self.assertEqual(len(result), 2) + result = self.anomaly_analyse.get_range_top_k(3, [1, 2], anomalies) + self.assertEqual(len(result), 3) + + # top_k greater than anomalies length + result = self.anomaly_analyse.get_range_top_k(4, [], anomalies) + self.assertEqual(len(result), 4) + + # top_k less than anomalies length + result = self.anomaly_analyse.get_range_top_k(3, [], anomalies) + self.assertEqual(len(result), 3) + self.assertEqual(result, [anomalies[1], anomalies[0], anomalies[2]]) + + @patch('msprobe.pytorch.monitor.anomaly_analyse.os.path.exists') + @patch('msprobe.pytorch.monitor.anomaly_analyse.AnomalyDataWriter.get_anomaly_dict') + @patch('msprobe.pytorch.monitor.anomaly_analyse.save_json') + @patch('msprobe.pytorch.monitor.anomaly_analyse.logger') + def test_rewrite_sorted_anomalies(self, mock_logger, mock_save_json, mock_get_anomaly_dict, mock_exists): + # 设置 mock + mock_exists.return_value = False + mock_get_anomaly_dict.return_value = {'anomalies': 'data'} + + output_path = 'output_path' + + # 调用方法 + self.anomaly_analyse.sorted_anomalies = self.anomalies + with patch("msprobe.pytorch.monitor.anomaly_analyse.check_file_or_directory_path", return_value=None): + self.anomaly_analyse.rewrite_sorted_anomalies(output_path) + + # 验证调用 + mock_get_anomaly_dict.assert_called_once_with(self.anomaly_analyse.sorted_anomalies) + mock_save_json.assert_called_once_with( + os.path.join(output_path, 'anomaly_analyse.json'), + {'anomalies': 'data'}, + indent=1 + ) + mock_logger.info.assert_called_once_with("anomaly_analyse.json is at output_path.") + + @patch('msprobe.pytorch.monitor.anomaly_analyse.os.path.exists') + @patch('msprobe.pytorch.monitor.anomaly_analyse.logger') + def test_rewrite_sorted_anomalies_file_exists(self, mock_logger, mock_exists): + # 模拟文件已经存在的情况 + mock_exists.return_value = True + output_path = 'output_path' + + # 调用方法 + with patch("msprobe.pytorch.monitor.anomaly_analyse.check_file_or_directory_path", return_value=None), \ + patch("msprobe.pytorch.monitor.anomaly_analyse.remove_path", return_value=None), \ + patch("msprobe.pytorch.monitor.anomaly_analyse.save_json", return_value=None): + self.anomaly_analyse.rewrite_sorted_anomalies(output_path) + + # 验证日志警告 + mock_logger.warning.assert_called_once_with( + f"The existing file will be deleted: output_path/anomaly_analyse.json.") + + +class TestParseArgs(unittest.TestCase): + + @patch('msprobe.pytorch.monitor.anomaly_analyse.sys.argv', + new=['script_name', '-d', 'path/to/data', '-o', 'path/to/output', '-k', '5', '-s', '[1,2,3]']) + def test_parse_args_with_all_arguments(self): + args = _get_parse_args() + self.assertEqual(args.data_path_dir, 'path/to/data') + self.assertEqual(args.out_path, 'path/to/output') + self.assertEqual(args.top_k_number, 5) + self.assertEqual(args.step_list, '[1,2,3]') + + @patch('msprobe.pytorch.monitor.anomaly_analyse.sys.argv', new=['script_name', '-d', 'path/to/data']) + def test_parse_args_with_required_argument_only(self): + args = _get_parse_args() + self.assertEqual(args.data_path_dir, 'path/to/data') + self.assertEqual(args.out_path, '') + self.assertEqual(args.top_k_number, 8) # 默认值 + self.assertEqual(args.step_list, '[]') # 默认值 + + @patch('msprobe.pytorch.monitor.anomaly_analyse.sys.argv', new=['script_name', '-d', 'path/to/data', '-k', '10']) + def test_parse_args_with_topk_only(self): + args = _get_parse_args() + self.assertEqual(args.data_path_dir, 'path/to/data') + self.assertEqual(args.out_path, '') + self.assertEqual(args.top_k_number, 10) # 提供的值 + self.assertEqual(args.step_list, '[]') # 默认值 + + +class TestGetStepAndStop(unittest.TestCase): + + def test_valid_step_list_and_top_k(self): + # 构造有效的 args 对象 + args = MagicMock() + args.step_list = '[1, 2, 3]' + args.top_k_number = 5 + + step_list, top_k = _get_step_and_stop(args) + + self.assertEqual(step_list, [1, 2, 3]) + self.assertEqual(top_k, 5) + + def test_invalid_step_list(self): + # 构造无效的 args 对象 + args = MagicMock() + args.step_list = '[1, 2, 3' # 不完整的列表 + args.top_k_number = 5 + + with self.assertRaises(Exception) as context: + _get_step_and_stop(args) + + self.assertEqual(str(context.exception), "The step list must be a resolvable list type.") + + def test_non_list_step_list(self): + # 构造无效的 args 对象 + args = MagicMock() + args.step_list = 'not_a_list' # 非列表 + args.top_k_number = 5 + + with self.assertRaises(Exception) as context: + _get_step_and_stop(args) + + self.assertEqual(str(context.exception), "The step list must be a resolvable list type.") + + def test_top_k_number_zero(self): + # 构造无效的 args 对象 + args = MagicMock() + args.step_list = '[1, 2, 3]' + args.top_k_number = 0 # 非法值 + + with self.assertRaises(Exception) as context: + _get_step_and_stop(args) + + self.assertEqual(str(context.exception), "The top k number must be greater than 0.") + + def test_top_k_number_negative(self): + # 构造无效的 args 对象 + args = MagicMock() + args.step_list = '[1, 2, 3]' + args.top_k_number = -1 # 非法值 + + with self.assertRaises(Exception) as context: + _get_step_and_stop(args) + + self.assertEqual(str(context.exception), "The top k number must be greater than 0.") + + +class TestAnomalyAnalyseFunction(unittest.TestCase): + + @patch('msprobe.pytorch.monitor.anomaly_analyse._get_parse_args') # 模拟命令行参数解析 + @patch('msprobe.pytorch.monitor.anomaly_analyse._get_step_and_stop') # 模拟步骤和顶级数字解析 + @patch('msprobe.pytorch.monitor.anomaly_analyse.AnomalyDataLoader') # 模拟数据加载器 + @patch('msprobe.pytorch.monitor.anomaly_analyse.AnomalyAnalyse') # 模拟异常分析器 + @patch('msprobe.pytorch.monitor.anomaly_analyse.logger') # 模拟日志记录 + def test_anomaly_analyse(self, mock_logger, mock_anomaly_analyse, mock_anomaly_data_loader, mock_get_step_and_stop, + mock_get_parse_args): + # 模拟命令行参数 + mock_args = MagicMock() + mock_args.data_path_dir = 'path/to/data' + mock_args.out_path = 'path/to/output' + mock_args.step_list = '[1, 2, 3]' + mock_args.top_k_number = 5 + mock_get_parse_args.return_value = mock_args + + # 模拟步骤和顶级数字 + mock_step_list = [1, 2, 3] + mock_top_k_number = 5 + mock_get_step_and_stop.return_value = (mock_step_list, mock_top_k_number) + + # 模拟数据加载 + mock_loader_instance = MagicMock() + mock_loader_instance.get_anomalies_from_jsons.return_value = [ + MagicMock(message='Anomaly 1'), + MagicMock(message='Anomaly 2'), + MagicMock(message='Anomaly 3') + ] + mock_anomaly_data_loader.return_value = mock_loader_instance + + # 模拟异常分析 + mock_analyser_instance = MagicMock() + mock_analyser_instance.get_range_top_k.return_value = [ + MagicMock(message='Top Anomaly 1'), + MagicMock(message='Top Anomaly 2') + ] + mock_anomaly_analyse.return_value = mock_analyser_instance + + # 调用被测试的函数 + _anomaly_analyse() + + # 验证调用 + mock_get_parse_args.assert_called_once() + mock_get_step_and_stop.assert_called_once_with(mock_args) + mock_anomaly_data_loader.assert_called_once_with(mock_args.data_path_dir) + mock_loader_instance.get_anomalies_from_jsons.assert_called_once() + mock_analyser_instance.get_range_top_k.assert_called_once_with( + mock_top_k_number, mock_step_list, mock_loader_instance.get_anomalies_from_jsons.return_value + ) + mock_analyser_instance.rewrite_sorted_anomalies.assert_called_once_with(mock_args.out_path) + + # 验证日志记录 + mock_logger.info.assert_any_call(f"Top {mock_top_k_number} anomalies are listed as follows:") + mock_logger.info.assert_any_call("0: Top Anomaly 1") + mock_logger.info.assert_any_call("1: Top Anomaly 2") + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_anomaly_detect.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_anomaly_detect.py new file mode 100644 index 0000000000000000000000000000000000000000..fa0960e2cc1842a138b47fad3f86c1ed0d089db8 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_anomaly_detect.py @@ -0,0 +1,291 @@ +import unittest +from unittest import TestCase +from unittest.mock import patch + +from msprobe.pytorch.monitor.anomaly_detect import AnomalyTurbulence, AnomalyScanner, \ + AnomalyDataFactory, GradAnomalyData, BaseWriterWithAD, ScanRule, WriterInput + + +class TestScanRule(TestCase): + def test_apply_not_implemented(self): + scan_rule = ScanRule() + with self.assertRaises(Exception) as context: + scan_rule.apply(None, None) + + self.assertEqual(str(context.exception), "abstract method apply is not implemented") + + +class TestAnomalyTurbulence(TestCase): + + def setUp(self) -> None: + self.threshold = 0.2 + self.rule = AnomalyTurbulence(self.threshold) + + def test_apply_with_positive_baseline(self): + history = [10, 12, 14] + cur = 16 + result = self.rule.apply(history, cur) + self.assertTrue(result) + + def test_apply_with_non_positive_baseline(self): + history = [0, 0, 0] + cur = -1 + result = self.rule.apply(history, cur) + self.assertTrue(result) + + +class TestAnomalyScanner(TestCase): + + def test_load_rules_with_valied_spec(self): + specs = [ + {"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.2}} + ] + rules = AnomalyScanner.load_rules(specs) + + self.assertEqual(len(rules), 1) + self.assertIsInstance(rules[0], AnomalyTurbulence) + self.assertEqual(rules[0].threshold, 0.2) + + rules = AnomalyScanner.load_rules(None) + self.assertEqual(len(rules), 0) + + @patch("msprobe.pytorch.monitor.anomaly_detect.logger") + def test_load_rules_with_missing_keys(self, mock_logger): + specs = [ + {"rule_name": "AnomalyTurbulence"} + ] + rules = AnomalyScanner.load_rules(specs) + + self.assertEqual(len(rules), 0) + mock_logger.warning.assert_called_once_with(f"Spec is missing required keys: {specs[0]}") + + def test_load_rules_with_invalid_rule(self): + # test invalid rule_name + specs = [{"rule_name": "InvalidRule", "args": {"threshold": 0.2}}] + rules = AnomalyScanner.load_rules(specs) + self.assertEqual(len(rules), 0) + + # test invalid args + specs = [{"rule_name": "AnomalyTurbulence", "args": "invalid args"}] + rules = AnomalyScanner.load_rules(specs) + self.assertEqual(len(rules), 0) + + def test_scan(self): + ad_rules = [AnomalyTurbulence(0.2)] + # test scan with anomaly + expected = True, "AnomalyTurbulence" + self.assertEqual(AnomalyScanner.scan(ad_rules, 1.0, 2.0), expected) + # test scan with no anomaly + expected = False, None + self.assertEqual(AnomalyScanner.scan(ad_rules, 1.0, 1.0), expected) + + +class TestAnomalyDataFactory(TestCase): + + def setUp(self) -> None: + rank = 0 + pp_stage = 0 + group_mates = [0] + self.AnomalyDataFactory = AnomalyDataFactory(rank, pp_stage, group_mates) + + def test_set_call_id(self): + name2callid = {'param_name': 0} + self.AnomalyDataFactory.set_call_id(name2callid) + + self.assertEqual(self.AnomalyDataFactory.name2callid, {'param_name': 0}) + + def test_create_success(self): + tag = ('0:1.self_attention.core_attention_flash_0/rank0/output', 'min') + message = "Rule AnomalyTurbulence reports anomaly signal in ('0:1.self_attention.core_attention_flash_0/rank0/output', 'min') at step 2." + step = 2 + result = self.AnomalyDataFactory.create(tag, message, step) + + self.assertEqual(result.step, step) + self.assertEqual(result.tag_name, tag[0]) + self.assertEqual(result.message, message) + self.assertEqual(result.vpp_stage, 0) + + # test no vpp_stage + tag = ('1.self_attention.core_attention_flash_0/rank0/output', 'min') + result = self.AnomalyDataFactory.create(tag, message, step) + self.assertEqual(result.vpp_stage, 0) + + def test_create_failed(self): + error_tag = '0:1.self_attention.core_attention_flash_0/rank0/output' + message = "Rule AnomalyTurbulence reports anomaly signal in ('0:1.self_attention.core_attention_flash_0/rank0/output', 'min') at step 2." + step = 2 + with self.assertRaises(Exception) as context: + self.AnomalyDataFactory.create(error_tag, message, step) + self.assertEqual(str(context.exception), "tag must be a tuple with length 2") + + +class TestGradAnomalyData(TestCase): + + def setUp(self) -> None: + tag_name = "0:1.self_attention.core_attention_flash.output:0/rank0/actv" + message = "Rule AnomalyTurbulence reports anomaly signal in ('0:1.self_attention.core_attention_flash.output:0/rank0/actv', 'min') at step 2." + group_mates = [0] + self.GradAnomalyData = GradAnomalyData(tag_name=tag_name, message=message, group_mates=group_mates) + + def test_get_train_stage(self): + tag_name_list = ["0:fc2.input:0/rank0/actv", "0:fc1.weight/rank0/post_grad", "0:fc2.weight/rank0/exp_avg_sq", ""] + expected_train_stage_list = [0, 1, 2, -1] + for tag_name, expected_train_stage in zip(tag_name_list, expected_train_stage_list): + train_stage = GradAnomalyData.get_train_stage(tag_name) + self.assertEqual(train_stage, expected_train_stage) + + def test_to_dict(self): + expected = { + 'rank': 0, + 'step': 0, + 'micro_step': 0, + 'pp_stage': 0, + 'vpp_stage': 0, + 'call_id': 0, + 'tag_name': "0:1.self_attention.core_attention_flash.output:0/rank0/actv", + 'message': "Rule AnomalyTurbulence reports anomaly signal in ('0:1.self_attention.core_attention_flash.output:0/rank0/actv', 'min') at step 2.", + 'group_mates': [0] + } + + self.assertEqual(self.GradAnomalyData.to_dict(), expected) + + def test_get_key(self): + expected = "0:1.self_attention.core_attention_flash.output:0/rank0/actv_step_0_call_0" + + self.assertEqual(self.GradAnomalyData.get_key(), expected) + + def test_lt_different_step(self): + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="") + data2 = GradAnomalyData(step=2, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="") + self.assertLess(data1, data2) + self.assertGreater(data2, data1) + + def test_lt_same_step_different_micro_step(self): + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="") + data2 = GradAnomalyData(step=1, micro_step=1, vpp_stage=0, pp_stage=0, call_id=0, tag_name="") + self.assertLess(data1, data2) + self.assertGreater(data2, data1) + + def test_lt_same_step_same_micro_step_different_vpp_stage(self): + # same forward + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="xxx/actv") + data2 = GradAnomalyData(step=1, micro_step=0, vpp_stage=1, pp_stage=0, call_id=0, tag_name="xxx/actv") + self.assertLess(data1, data2) + self.assertGreater(data2, data1) + + # same backward + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="xxx/post_grad") + data2 = GradAnomalyData(step=1, micro_step=0, vpp_stage=1, pp_stage=0, call_id=0, tag_name="xxx/post_grad") + self.assertLess(data2, data1) + self.assertGreater(data1, data2) + + # diff train stage + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="xxx/actv") + data2 = GradAnomalyData(step=1, micro_step=0, vpp_stage=1, pp_stage=0, call_id=0, tag_name="xxx/post_grad") + self.assertLess(data1, data2) + self.assertGreater(data2, data1) + + def test_lt_same_step_same_micro_step_same_vpp_stage_different_pp_stage(self): + # same forward + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="xxx/actv") + data2 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=1, call_id=0, tag_name="xxx/actv") + self.assertLess(data1, data2) + self.assertGreater(data2, data1) + + # same backward + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="xxx/post_grad") + data2 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=1, call_id=0, tag_name="xxx/post_grad") + self.assertLess(data2, data1) + self.assertGreater(data1, data2) + + # diff train stage + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="xxx/input") + data2 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=1, call_id=0, tag_name="xxx/post_grad") + self.assertLess(data1, data2) + self.assertGreater(data2, data1) + + def test_lt_same_step_same_micro_step_same_vpp_stage_same_pp_stage_different_call_id(self): + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="") + data2 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=1, tag_name="") + self.assertLess(data1, data2) + self.assertGreater(data2, data1) + + def test_lt_same_data(self): + data1 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="") + data2 = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0, tag_name="") + self.assertGreaterEqual(data1, data2) + self.assertLessEqual(data1, data2) + + def test_lt_not_instance(self): + data = GradAnomalyData(step=1, micro_step=0, vpp_stage=0, pp_stage=0, call_id=0) + not_instance = "not an instance of GradAnomalyData" + self.assertEqual(data.__lt__(not_instance), NotImplemented) + + def test_le_same_instance(self): + # 测试相同实例的情况 + data1 = GradAnomalyData() + self.assertTrue(data1 <= data1) + + def test_le_different_instance(self): + # 测试不同实例的情况 + data1 = GradAnomalyData() + data2 = GradAnomalyData() + self.assertTrue(data1 <= data2) + + def test_le_not_instance(self): + # 测试非GradAnomalyData实例的情况 + data = GradAnomalyData() + not_instance = "Not an instance of GradAnomalyData" + self.assertEqual(data.__le__(not_instance), NotImplemented) + + def test_le_different_instance_not_equal(self): + # 测试不同实例且不相等的情况 + data1 = GradAnomalyData() + data2 = GradAnomalyData() + data2.some_attribute = "some value" + self.assertTrue(data1 <= data2) + + +class TestBaseWriterWithAD(TestCase): + + def setUp(self) -> None: + self.BaseWriter = BaseWriterWithAD(WriterInput('', None, None)) + + def test_get_anomalies(self): + expected = [] + + self.assertEqual(self.BaseWriter.get_anomalies(), expected) + + def test_clear_anomalies(self): + self.BaseWriter.anomalies = ['anomaly1', 'anomaly2'] + self.BaseWriter.clear_anomalies() + + self.assertEqual(self.BaseWriter.anomalies, []) + + @patch("msprobe.pytorch.monitor.anomaly_detect.logger") + def test_add_scalar(self, mock_logger): + AnomalyTurbulence_obj = AnomalyTurbulence(0.2) + self.BaseWriter.ad_rules = [AnomalyTurbulence_obj] + self.BaseWriter.tag2scalars = {'tag': {'avg': 1.0, 'count': 1}} + self.BaseWriter.add_scalar('tag', 2.0) + + mock_logger.info.assert_called_once() + + def test_ad(self): + AnomalyTurbulence_obj = AnomalyTurbulence(0.2) + self.BaseWriter.ad_rules = [AnomalyTurbulence_obj] + expected = True, "AnomalyTurbulence" + + self.assertEqual(self.BaseWriter._ad(2.0, 1.0), expected) + + def test_update_tag2scalars(self): + self.BaseWriter._update_tag2scalars('tag1', 1.0) + self.assertEqual(self.BaseWriter.tag2scalars['tag1']['avg'], 1.0) + self.assertEqual(self.BaseWriter.tag2scalars['tag1']['count'], 1) + self.BaseWriter._update_tag2scalars('tag1', 2.0) + self.assertEqual(self.BaseWriter.tag2scalars['tag1']['avg'], 1.5) + self.assertEqual(self.BaseWriter.tag2scalars['tag1']['count'], 2) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_csv2tb.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_csv2tb.py new file mode 100644 index 0000000000000000000000000000000000000000..f2bc82ffafc2a1f10719d4a46669bc0050c12782 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_csv2tb.py @@ -0,0 +1,751 @@ +import os +import shutil +import random +import unittest +import pytest +import torch +import numpy as np +import torch.nn as nn +from tensorboard.backend.event_processing.event_accumulator import EventAccumulator + +from msprobe.pytorch import TrainerMon +from msprobe.core.common.const import MonitorConst +from msprobe.pytorch.monitor.csv2tb import parse_step_fn, csv2tensorboard_by_step + + +base_dir = os.path.dirname(os.path.realpath(__file__)) +config_json_path = os.path.join(base_dir, "config", "all_config.json") +monitor_output = os.path.join(base_dir, "./monitor_output_csv2tb") +os.environ[MonitorConst.MONITOR_OUTPUT_DIR] = monitor_output +timestamp_dirpath = None +csv2tb_dirpath = None + + +def seed_all(seed=1234, mode=False): + random.seed(seed) + os.environ['PYTHONHASHSEED'] = str(seed) + np.random.seed(seed) + torch.manual_seed(seed) + torch.use_deterministic_algorithms(mode) + +seed_all() + + +inputs = [torch.rand(10, 10) for _ in range(10)] +labels = [torch.randint(0, 5, (10,)) for _ in range(10)] + + +class MockModule(nn.Module): + def __init__(self): + super().__init__() + self.linear = nn.Linear(10, 5) + self.relu = nn.ReLU() + + def forward(self, x): + x1 = self.linear(x) + x2 = self.relu(x1) + return x2 + + +def data_collect(): + loss_fun = nn.CrossEntropyLoss() + test_module = MockModule() + nn.init.constant_(test_module.linear.weight, 1.0) + nn.init.constant_(test_module.linear.bias, 1.0) + optimizer = torch.optim.Adam(test_module.parameters()) + + monitor = TrainerMon(config_json_path, params_have_main_grad=False) + monitor.set_monitor(test_module, grad_acc_steps=1, optimizer=optimizer) + + for input_data, label in zip(inputs, labels): + output = test_module(input_data) + loss = loss_fun(output, label) + optimizer.zero_grad() + loss.backward() + optimizer.step() + + global timestamp_dirpath, csv2tb_dirpath + timestamp_dirpath = os.path.join(monitor_output, os.listdir(monitor_output)[0]) + csv2tensorboard_by_step(monitor_output) + for dirname in os.listdir(monitor_output): + if "csv2tensorboard" in dirname: + csv2tb_dirpath = os.path.join(monitor_output, dirname, "rank0") + + +def extract_scalars_from_tensorboard(log_dir): + # 初始化 EventAccumulator + event_acc = EventAccumulator(log_dir) + event_acc.Reload() # 加载事件数据 + + # 获取所有 scalar 标签 + scalar_tags = event_acc.Tags()['scalars'] + + # 构建字典,键为标签,值为对应的 (step, value) 列表 + scalars_dict = {} + for tag in scalar_tags: + scalar_events = event_acc.Scalars(tag) + scalars_dict[tag] = [(event.step, event.value) for event in scalar_events] + + return scalars_dict + + +def dict_equal(a, b): + if not isinstance(a, dict) or not isinstance(b, dict): + if np.isnan(a) and np.isnan(b): + return True + return a == b + + if set(a.keys()) != set(b.keys()): + return False + + for key in a: + if not dict_equal(a[key], b[key]): + return False + + return True + + +def compare_scalar_dicts(dict1, dict2): + if set(dict1.keys()) != set(dict2.keys()): + return False + + for key in dict1: + list1 = dict1[key] + list2 = dict2[key] + + if len(list1) != len(list2): + return False + + # 对比每对 (step, value) + for (step1, value1), (step2, value2) in zip(list1, list2): + if step1 != step2: + return False + + if not (value1 == value2 or (np.isnan(value1) and np.isnan(value2))): + return False + return True + + +@pytest.fixture(scope="session") +def setup_all(): + data_collect() + yield + shutil.rmtree(monitor_output) + +@pytest.mark.usefixtures("setup_all") +class TestGradMonitor(unittest.TestCase): + + def setUp(self): + self.maxDiff = None + + def test_actv(self): + data = parse_step_fn(os.path.join(timestamp_dirpath,"actv_0-2.csv")) + result = { + 'vp0:.input:micro0': { + 0: {'nans': 0.0,'norm': 5.550016}, + 1: {'nans': 0.0,'norm': 5.975112}, + 2: {'nans': 0.0,'norm': 5.789881} + }, + 'vp0:.output:micro0': { + 0: {'nans': 0.0,'norm': 41.842655}, + 1: {'nans': 0.0,'norm': 44.40981}, + 2: {'nans': 0.0,'norm': 43.578354} + }, + 'vp0:linear.input:micro0': { + 0: {'nans': 0.0,'norm': 5.550016}, + 1: {'nans': 0.0,'norm': 5.975112}, + 2: {'nans': 0.0,'norm': 5.789881} + }, + 'vp0:linear.output:micro0': { + 0: {'nans': 0.0,'norm': 41.842655}, + 1: {'nans': 0.0,'norm': 44.40981}, + 2: {'nans': 0.0,'norm': 43.578354} + }, + 'vp0:relu.input:micro0': { + 0: {'nans': 0.0,'norm': 41.842655}, + 1: {'nans': 0.0,'norm': 44.40981}, + 2: {'nans': 0.0,'norm': 43.578354} + }, + 'vp0:relu.output:micro0': { + 0: {'nans': 0.0,'norm': 41.842655}, + 1: {'nans': 0.0,'norm': 44.40981}, + 2: {'nans': 0.0,'norm': 43.578354} + } + } + self.assertEqual(dict_equal(data, result), True) + tb_data = extract_scalars_from_tensorboard(os.path.join(csv2tb_dirpath, "actv")) + print(tb_data) + tb_result = { + 'vp0:.input:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:.input:micro0/norm': [(0, 5.550015926361084), + (1, 5.975111961364746), + (2, 5.789881229400635), + (3, 6.052319049835205), + (4, 5.573315143585205), + (5, 5.864360809326172), + (6, 5.292460918426514), + (7, 5.477899074554443), + (8, 5.884613990783691), + (9, 5.456457138061523)], + 'vp0:.output:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:.output:micro0/norm': [(0, 41.842655181884766), + (1, 44.40980911254883), + (2, 43.57835388183594), + (3, 45.83631134033203), + (4, 42.0673828125), + (5, 43.46839141845703), + (6, 39.77947235107422), + (7, 40.200843811035156), + (8, 44.453147888183594), + (9, 40.841522216796875)], + 'vp0:linear.input:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.input:micro0/norm': [(0, 5.550015926361084), + (1, 5.975111961364746), + (2, 5.789881229400635), + (3, 6.052319049835205), + (4, 5.573315143585205), + (5, 5.864360809326172), + (6, 5.292460918426514), + (7, 5.477899074554443), + (8, 5.884613990783691), + (9, 5.456457138061523)], + 'vp0:linear.output:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.output:micro0/norm': [(0, 41.842655181884766), + (1, 44.40980911254883), + (2, 43.57835388183594), + (3, 45.83631134033203), + (4, 42.0673828125), + (5, 43.46839141845703), + (6, 39.77947235107422), + (7, 40.200843811035156), + (8, 44.453147888183594), + (9, 40.841522216796875)], + 'vp0:relu.input:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:relu.input:micro0/norm': [(0, 41.842655181884766), + (1, 44.40980911254883), + (2, 43.57835388183594), + (3, 45.83631134033203), + (4, 42.0673828125), + (5, 43.46839141845703), + (6, 39.77947235107422), + (7, 40.200843811035156), + (8, 44.453147888183594), + (9, 40.841522216796875)], + 'vp0:relu.output:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:relu.output:micro0/norm': [(0, 41.842655181884766), + (1, 44.40980911254883), + (2, 43.57835388183594), + (3, 45.83631134033203), + (4, 42.0673828125), + (5, 43.46839141845703), + (6, 39.77947235107422), + (7, 40.200843811035156), + (8, 44.453147888183594), + (9, 40.841522216796875)]} + self.assertEqual(compare_scalar_dicts(tb_data, tb_result), True) + + + def test_actv_grad(self): + data = parse_step_fn(os.path.join(timestamp_dirpath,"actv_grad_0-2.csv")) + nan = np.nan + result = { + 'vp0:.input:micro0': { + 0: {'norm': nan, 'nans': nan}, + 1: {'norm': nan, 'nans': nan}, + 2: {'norm': nan, 'nans': nan} + }, + 'vp0:.output:micro0': { + 0: {'norm': 0.282843, 'nans': 0.0}, + 1: {'norm': 0.282617, 'nans': 0.0}, + 2: {'norm': 0.282655, 'nans': 0.0} + }, + 'vp0:relu.input:micro0': { + 0: {'norm': 0.282843, 'nans': 0.0}, + 1: {'norm': 0.282617, 'nans': 0.0}, + 2: {'norm': 0.282655, 'nans': 0.0} + }, + 'vp0:relu.output:micro0': { + 0: {'norm': 0.282843, 'nans': 0.0}, + 1: {'norm': 0.282617, 'nans': 0.0}, + 2: {'norm': 0.282655, 'nans': 0.0} + }, + 'vp0:linear.input:micro0': { + 0: {'norm': nan, 'nans': nan}, + 1: {'norm': nan, 'nans': nan}, + 2: {'norm': nan, 'nans': nan} + }, + 'vp0:linear.output:micro0': { + 0: {'norm': 0.282843, 'nans': 0.0}, + 1: {'norm': 0.282617, 'nans': 0.0}, + 2: {'norm': 0.282655, 'nans': 0.0} + } + } + self.assertEqual(dict_equal(data, result), True) + + tb_data = extract_scalars_from_tensorboard(os.path.join(csv2tb_dirpath, "actv_grad")) + tb_result = { + 'vp0:.input:micro0/nans': [(0, nan), + (1, nan), + (2, nan), + (3, nan), + (4, nan), + (5, nan), + (6, nan), + (7, nan), + (8, nan), + (9, nan)], + 'vp0:.input:micro0/norm': [(0, nan), + (1, nan), + (2, nan), + (3, nan), + (4, nan), + (5, nan), + (6, nan), + (7, nan), + (8, nan), + (9, nan)], + 'vp0:.output:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:.output:micro0/norm': [(0, 0.2828429937362671), + (1, 0.2826170027256012), + (2, 0.2826550006866455), + (3, 0.2828519940376282), + (4, 0.2822929918766022), + (5, 0.2826640009880066), + (6, 0.28316599130630493), + (7, 0.28274500370025635), + (8, 0.2833530008792877), + (9, 0.2825529873371124)], + 'vp0:linear.input:micro0/nans': [(0, nan), + (1, nan), + (2, nan), + (3, nan), + (4, nan), + (5, nan), + (6, nan), + (7, nan), + (8, nan), + (9, nan)], + 'vp0:linear.input:micro0/norm': [(0, nan), + (1, nan), + (2, nan), + (3, nan), + (4, nan), + (5, nan), + (6, nan), + (7, nan), + (8, nan), + (9, nan)], + 'vp0:linear.output:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.output:micro0/norm': [(0, 0.2828429937362671), + (1, 0.2826170027256012), + (2, 0.2826550006866455), + (3, 0.2828519940376282), + (4, 0.2822929918766022), + (5, 0.2826640009880066), + (6, 0.28316599130630493), + (7, 0.28274500370025635), + (8, 0.2833530008792877), + (9, 0.2825529873371124)], + 'vp0:relu.input:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:relu.input:micro0/norm': [(0, 0.2828429937362671), + (1, 0.2826170027256012), + (2, 0.2826550006866455), + (3, 0.2828519940376282), + (4, 0.2822929918766022), + (5, 0.2826640009880066), + (6, 0.28316599130630493), + (7, 0.28274500370025635), + (8, 0.2833530008792877), + (9, 0.2825529873371124)], + 'vp0:relu.output:micro0/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:relu.output:micro0/norm': [(0, 0.2828429937362671), + (1, 0.2826170027256012), + (2, 0.2826550006866455), + (3, 0.2828519940376282), + (4, 0.2822929918766022), + (5, 0.2826640009880066), + (6, 0.28316599130630493), + (7, 0.28274500370025635), + (8, 0.2833530008792877), + (9, 0.2825529873371124)]} + self.assertEqual(compare_scalar_dicts(tb_data, tb_result), True) + + + def test_param(self): + data = parse_step_fn(os.path.join(timestamp_dirpath,"param_0-2.csv")) + result = { + 'vp0:linear.bias': { + 0: {'nans': 0.0, 'norm': 2.236068}, + 1: {'nans': 0.0, 'norm': 2.236198}, + 2: {'nans': 0.0, 'norm': 2.235769} + }, + 'vp0:linear.weight': { + 0: {'nans': 0.0, 'norm': 7.071068}, + 1: {'nans': 0.0, 'norm': 7.068808}, + 2: {'nans': 0.0, 'norm': 7.06771} + } + } + self.assertEqual(dict_equal(data, result), True) + tb_data = extract_scalars_from_tensorboard(os.path.join(csv2tb_dirpath, "param")) + tb_result = { + 'vp0:linear.weight/norm': [ + (0, 7.071067810058594), + (1, 7.068808078765869), + (2, 7.067709922790527), + (3, 7.0673418045043945), + (4, 7.066926956176758), + (5, 7.066311836242676), + (6, 7.065629959106445), + (7, 7.065262794494629), + (8, 7.065001964569092), + (9, 7.064840793609619)], + 'vp0:linear.weight/nans': [ + (0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.bias/norm': [ + (0, 2.2360680103302), + (1, 2.2361979484558105), + (2, 2.235769033432007), + (3, 2.235903024673462), + (4, 2.2360129356384277), + (5, 2.2359039783477783), + (6, 2.2357990741729736), + (7, 2.2357349395751953), + (8, 2.2356700897216797), + (9, 2.235619068145752)], + 'vp0:linear.bias/nans': [ + (0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)] + } + self.assertEqual(compare_scalar_dicts(tb_data, tb_result), True) + + def test_exp_avg(self): + data = parse_step_fn(os.path.join(timestamp_dirpath,"exp_avg_0-2.csv")) + result = { + 'vp0:linear.bias': { + 1: {'nans': 0.0, 'norm': 0.024495}, + 2: {'nans': 0.0, 'norm': 0.052203} + }, + 'vp0:linear.weight': { + 1: {'nans': 0.0, 'norm': 0.052394}, + 2: {'nans': 0.0, 'norm': 0.099221} + } + } + self.assertEqual(dict_equal(data, result), True) + tb_data = extract_scalars_from_tensorboard(os.path.join(csv2tb_dirpath, "exp_avg")) + tb_result = { + 'vp0:linear.bias/nans': [(1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.bias/norm': [(1, 0.024495000019669533), + (2, 0.05220299959182739), + (3, 0.06452500075101852), + (4, 0.05751600116491318), + (5, 0.07189200073480606), + (6, 0.07151799649000168), + (7, 0.053112998604774475), + (8, 0.06187799945473671), + (9, 0.04195199906826019)], + 'vp0:linear.weight/nans': [(1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.weight/norm': [(1, 0.05239399895071983), + (2, 0.09922099858522415), + (3, 0.12258800119161606), + (4, 0.11325100064277649), + (5, 0.14186500012874603), + (6, 0.14408400654792786), + (7, 0.11372199654579163), + (8, 0.12264800071716309), + (9, 0.09017200022935867)]} + self.assertEqual(compare_scalar_dicts(tb_data, tb_result), True) + + def test_exp_avg_sq(self): + data = parse_step_fn(os.path.join(timestamp_dirpath,"exp_avg_sq_0-2.csv")) + result = { + 'vp0:linear.bias': { + 1: {'nans': 0.0, 'norm': 4.2e-05}, + 2: {'nans': 0.0, 'norm': 9.6e-05} + }, + 'vp0:linear.weight': { + 1: {'nans': 0.0, 'norm': 6.7e-05}, + 2: {'nans': 0.0, 'norm': 0.000126} + } + } + self.assertEqual(dict_equal(data, result), True) + tb_data = extract_scalars_from_tensorboard(os.path.join(csv2tb_dirpath, "exp_avg_sq")) + tb_result = { + 'vp0:linear.bias/nans': [(1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.bias/norm': [(1, 4.199999966658652e-05), + (2, 9.600000339560211e-05), + (3, 0.00013099999341648072), + (4, 0.00013099999341648072), + (5, 0.00016500000492669642), + (6, 0.0001900000061141327), + (7, 0.00020199999562464654), + (8, 0.00022899999748915434), + (9, 0.00024300000222865492)], + 'vp0:linear.weight/nans': [(1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.weight/norm': [(1, 6.70000008540228e-05), + (2, 0.00012599999899975955), + (3, 0.00015799999528098851), + (4, 0.00016599999798927456), + (5, 0.00021399999968707561), + (6, 0.00024199999461416155), + (7, 0.00026000000070780516), + (8, 0.00028700000257231295), + (9, 0.0003060000017285347)]} + self.assertEqual(compare_scalar_dicts(tb_data, tb_result), True) + + def test_grad_reduced(self): + data = parse_step_fn(os.path.join(timestamp_dirpath,"grad_reduced_0-2.csv")) + result = { + 'vp0:linear.bias': { + 0: {'nans': 0.0, 'norm': 0.244949}, + 1: {'nans': 0.0, 'norm': 0.314345}, + 2: {'nans': 0.0, 'norm': 0.281475} + }, + 'vp0:linear.weight': { + 0: {'nans': 0.0, 'norm': 0.523935}, + 1: {'nans': 0.0, 'norm': 0.595672}, + 2: {'nans': 0.0, 'norm': 0.497603} + } + } + self.assertEqual(dict_equal(data, result), True) + tb_data = extract_scalars_from_tensorboard(os.path.join(csv2tb_dirpath, "grad_reduced")) + tb_result = { + 'vp0:linear.bias/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.bias/norm': [(0, 0.24494899809360504), + (1, 0.31434500217437744), + (2, 0.2814750075340271), + (3, 0.006068999879062176), + (4, 0.2398650050163269), + (5, 0.2817699909210205), + (6, 0.1456969976425171), + (7, 0.2817710041999817), + (8, 0.15226399898529053), + (9, 0.1355219930410385)], + 'vp0:linear.weight/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.weight/norm': [(0, 0.5239350199699402), + (1, 0.5956720113754272), + (2, 0.49760299921035767), + (3, 0.23948900401592255), + (4, 0.5050320029258728), + (5, 0.5136330127716064), + (6, 0.3642309904098511), + (7, 0.4831080138683319), + (8, 0.3234719932079315), + (9, 0.32385098934173584)]} + self.assertEqual(compare_scalar_dicts(tb_data, tb_result), True) + + def test_grad_unreduced(self): + data = parse_step_fn(os.path.join(timestamp_dirpath,"grad_unreduced_0-2.csv")) + result = { + 'vp0:linear.bias': { + 0: {'nans': 0.0, 'norm': 0.244949}, + 1: {'nans': 0.0, 'norm': 0.314345}, + 2: {'nans': 0.0, 'norm': 0.281475} + }, + 'vp0:linear.weight': { + 0: {'nans': 0.0, 'norm': 0.523935}, + 1: {'nans': 0.0, 'norm': 0.595672}, + 2: {'nans': 0.0, 'norm': 0.497603} + } + } + self.assertEqual(dict_equal(data, result), True) + + tb_data = extract_scalars_from_tensorboard(os.path.join(csv2tb_dirpath, "grad_unreduced")) + tb_result = { + 'vp0:linear.bias/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.bias/norm': [(0, 0.24494899809360504), + (1, 0.31434500217437744), + (2, 0.2814750075340271), + (3, 0.006068999879062176), + (4, 0.2398650050163269), + (5, 0.2817699909210205), + (6, 0.1456969976425171), + (7, 0.2817710041999817), + (8, 0.15226399898529053), + (9, 0.1355219930410385)], + 'vp0:linear.weight/nans': [(0, 0.0), + (1, 0.0), + (2, 0.0), + (3, 0.0), + (4, 0.0), + (5, 0.0), + (6, 0.0), + (7, 0.0), + (8, 0.0), + (9, 0.0)], + 'vp0:linear.weight/norm': [(0, 0.5239350199699402), + (1, 0.5956720113754272), + (2, 0.49760299921035767), + (3, 0.23948900401592255), + (4, 0.5050320029258728), + (5, 0.5136330127716064), + (6, 0.3642309904098511), + (7, 0.4831080138683319), + (8, 0.3234719932079315), + (9, 0.32385098934173584)]} + self.assertEqual(compare_scalar_dicts(tb_data, tb_result), True) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_features.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_features.py new file mode 100644 index 0000000000000000000000000000000000000000..ff00cf7490d8110f2198df57ee5d91b6b75f5092 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_features.py @@ -0,0 +1,92 @@ +import unittest +import torch +from msprobe.pytorch.monitor.features import square_sum, get_min, get_mean, get_norm, get_max, get_zeros, \ + get_sign_matches, eff_rank, mNTK, lambda_max_subsample, cal_histc, get_nans + + +class TestMathFunctions(unittest.TestCase): + def test_square_sum(self): + tensor = torch.tensor([1.0, 2.0, 3.0]) + result = square_sum(tensor) + self.assertEqual(result, 14.0) + + def test_get_min(self): + tensor = torch.tensor([1.0, 2.0, 3.0]) + result = get_min(tensor) + self.assertEqual(result, 1.0) + + def test_get_mean(self): + tensor = torch.tensor([1.0, 2.0, 3.0]) + result = get_mean(tensor) + self.assertAlmostEqual(result, 2.0, places=1) + + def test_get_norm(self): + tensor = torch.tensor([1.0, 2.0, 3.0]) + result = get_norm(tensor) + self.assertTrue(torch.allclose(result, torch.tensor(3.7417, dtype=torch.float64), atol=1e-4)) + + def test_get_max(self): + tensor = torch.tensor([1.0, 2.0, 3.0]) + result = get_max(tensor) + self.assertEqual(result, 3.0) + + def test_get_zeros(self): + tensor = torch.tensor([1e-10, 2e-10, 3e-10]) + result = get_zeros(tensor, eps=1e-10) + res = torch.allclose(result, torch.tensor(0.), atol=1e-1) + self.assertTrue(res) + + def test_get_sign_matches(self): + tensor_x = torch.tensor([1.0, -1.0, 1.0]) + tensor_y = torch.tensor([1.0, 1.0, -1.0]) + result = get_sign_matches(tensor_x, tensor_y) + res = torch.allclose(result, torch.tensor(0.3333), atol=1e-4) + self.assertTrue(res) + + def test_eff_rank(self): + tensor = torch.tensor([[1.0, 2.0, 3.0, 4.0, 5.0], [1.0, 2.0, 3.0, 4.0, 5.0]]) + result = eff_rank(tensor) + res = torch.allclose(result, torch.tensor(2), atol=1e-1) + self.assertTrue(res) + + def test_mNTK(self): + class MockModule(torch.nn.Module): + def __init__(self): + super(MockModule, self).__init__() + + def forward(self, x): + return x + 1 + + module = MockModule() + tensor = torch.tensor([1.0]) + result = mNTK(module, tensor) + res = torch.allclose(result, torch.tensor([[1.]]), atol=1e-1) + self.assertTrue(res) + + def test_lambda_max_subsample(self): + class MockModule(torch.nn.Module): + def __init__(self): + super(MockModule, self).__init__() + + def forward(self, x): + return x + + module = MockModule() + tensor = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0]) + result = lambda_max_subsample(module, tensor) + res = torch.allclose(result, torch.tensor(1.0), atol=1e-1) + self.assertTrue(res) + + def test_cal_histc(self): + tensor = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0]) + result = cal_histc(tensor, bins_total=3, min_val=1.0, max_val=5.0) + self.assertEqual(result.size(), (3,)) + + def test_get_nans(self): + tensor = torch.tensor([1.0, float('nan'), 3.0]) + result = get_nans(tensor) + self.assertEqual(result, 1) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_module_hook.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_module_hook.py new file mode 100644 index 0000000000000000000000000000000000000000..eefacb73c8e76636086554775b0e6f2e916ddf6e --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_module_hook.py @@ -0,0 +1,319 @@ +import os.path +import shutil +import unittest +from unittest.mock import patch, MagicMock + +import pandas as pd +import torch +from msprobe.core.common.const import MonitorConst, Const +from torch import distributed as dist + +from msprobe.pytorch.monitor.module_hook import CommunicationContext, GradContext, ModuleHookContext, \ + param_is_not_tensor_parallel_duplicate, param_is_data_parallel_duplicate +from msprobe.test.pytorch_ut.monitor.demo_model import monitor_demo +from msprobe.pytorch import TrainerMon + +base_dir = os.path.dirname(os.path.realpath(__file__)) + + +def clean_output(path): + if os.path.exists(path): + shutil.rmtree(path) + + +class TestModuleHook(unittest.TestCase): + monitor_output = "./monitor_output" + + @staticmethod + def get_dist_mock(initialized=False): + dist_mock = MagicMock() + dist_mock.is_initialized.return_value = initialized + dist_mock.get_rank.return_value = 0 + dist_mock.get_process_group_ranks.return_value = [0] + + dist.is_initialized = dist_mock.is_initialized + dist.get_rank = dist_mock.get_rank + dist.get_process_group_ranks = dist_mock.get_process_group_ranks + + def test_smallest_rank_print(self): + xy_config = os.path.join(base_dir, "config/xy_config.json") + hooker = TrainerMon( + xy_config, + params_have_main_grad=False + ) + self.get_dist_mock(True) + + hooker._smallest_rank_print("test print") + + hooker.module_rank_list = [0] + hooker._smallest_rank_print("test print") + self.assertIsNotNone(hooker) + + def test_print_struct(self): + print_struct_config = os.path.join(base_dir, "config/struct_config.json") + self.get_dist_mock(False) + + with self.assertRaises(Exception) as context: + monitor_demo(print_struct_config) + self.assertEqual(str(context.exception), "exit after first monitor step when print model struct") + + def test_xy_distribution(self): + xy_monitor_output = "./test_xy_distribution" + clean_output(xy_monitor_output) + os.environ[MonitorConst.MONITOR_OUTPUT_DIR] = xy_monitor_output + xy_config = os.path.join(base_dir, "config/xy_config.json") + monitor_demo(xy_config) + # validate output file + output_dir_list = os.listdir(xy_monitor_output) + self.assertEqual(len(output_dir_list), 1) + actv_0_csv = os.path.join(xy_monitor_output, output_dir_list[0], "actv_0-0.csv") + actv_grad_0_csv = os.path.join(xy_monitor_output, output_dir_list[0], "actv_grad_0-0.csv") + self.assertTrue(os.path.exists(actv_0_csv)) + self.assertTrue(os.path.exists(actv_grad_0_csv)) + # validate columns and lines + actv_0 = pd.read_csv(actv_0_csv) + expect_columns = ['vpp_stage', 'name', 'step', 'micro_step', 'norm', 'nans'] + self.assertListEqual(list(actv_0.columns), expect_columns) + self.assertEqual(actv_0.shape, tuple([6, 6])) + actv_grad_0 = pd.read_csv(actv_grad_0_csv) + expect_columns = ['vpp_stage', 'name', 'step', 'micro_step', 'norm', 'nans'] + self.assertListEqual(list(actv_grad_0.columns), expect_columns) + self.assertEqual(actv_0.shape, tuple([6, 6])) + + def test_wg_distribution(self): + self.get_dist_mock(False) + wg_monitor_output = "./test_wg_distribution" + clean_output(wg_monitor_output) + os.environ[MonitorConst.MONITOR_OUTPUT_DIR] = wg_monitor_output + mv_config = os.path.join(base_dir, "config/wg_config.json") + monitor_demo(mv_config) + # validate output file + output_dir_list = os.listdir(wg_monitor_output) + self.assertEqual(len(output_dir_list), 1) + grad_reduced_0_csv = os.path.join(wg_monitor_output, output_dir_list[0], "grad_reduced_0-0.csv") + grad_unreduced_0_csv = os.path.join(wg_monitor_output, output_dir_list[0], "grad_unreduced_0-0.csv") + self.assertTrue(os.path.exists(grad_reduced_0_csv)) + self.assertTrue(os.path.exists(grad_unreduced_0_csv)) + # validate columns and lines + expect_columns = ["vpp_stage", "name", "step", "norm"] + grad_reduced_0 = pd.read_csv(grad_reduced_0_csv) + self.assertListEqual(list(grad_reduced_0.columns), expect_columns) + self.assertEqual(grad_reduced_0.shape, tuple([2, 4])) + grad_unreduced_0 = pd.read_csv(grad_unreduced_0_csv) + self.assertListEqual(list(grad_unreduced_0.columns), expect_columns) + self.assertEqual(grad_unreduced_0.shape, tuple([2, 4])) + + def test_mv_distribution(self): + self.get_dist_mock(False) + mv_monitor_output = "./test_mv_distribution" + clean_output(mv_monitor_output) + os.environ[MonitorConst.MONITOR_OUTPUT_DIR] = mv_monitor_output + mv_config = os.path.join(base_dir, "config/mv_config.json") + monitor_demo(mv_config) + # validate output file + output_dir_list = os.listdir(mv_monitor_output) + self.assertEqual(len(output_dir_list), 1) + exp_avg_1_csv = os.path.join(mv_monitor_output, output_dir_list[0], "exp_avg_1-1.csv") + exp_avg_sq_1_csv = os.path.join(mv_monitor_output, output_dir_list[0], "exp_avg_sq_1-1.csv") + self.assertTrue(os.path.exists(exp_avg_1_csv)) + self.assertTrue(os.path.exists(exp_avg_sq_1_csv)) + # validate columns and lines + expect_columns = ["vpp_stage", "name", "step", "norm"] + exp_avg_1 = pd.read_csv(exp_avg_1_csv) + self.assertListEqual(list(exp_avg_1.columns), expect_columns) + self.assertEqual(exp_avg_1.shape, tuple([2, 4])) + exp_avg_sq_1 = pd.read_csv(exp_avg_sq_1_csv) + self.assertListEqual(list(exp_avg_sq_1.columns), expect_columns) + self.assertEqual(exp_avg_sq_1.shape, tuple([2, 4])) + + def test_ur_distribution(self): + self.get_dist_mock(False) + ur_monitor_output = "./test_ur_distribution" + clean_output(ur_monitor_output) + os.environ[MonitorConst.MONITOR_OUTPUT_DIR] = ur_monitor_output + ur_config = os.path.join(base_dir, "config/ur_config.json") + monitor_demo(ur_config) + # validate output file + output_dir_list = os.listdir(ur_monitor_output) + self.assertEqual(len(output_dir_list), 1) + tb_dir = os.listdir(os.path.join(ur_monitor_output, output_dir_list[0])) + self.assertEqual(len(tb_dir), 1) + self.assertTrue(tb_dir[0].startswith("events.out.tfevents.")) + + def test_cc_distribution(self): + cc_config = os.path.join(base_dir, "config/cc_config.json") + self.get_dist_mock(True) + hooker = TrainerMon( + cc_config, + params_have_main_grad=False + ) + self.assertIsNotNone(hooker) + + def test_adhoc_check(self): + # mock dist + self.get_dist_mock(True) + target_tensor = torch.randn(10) + module_name = 'test_module' + tensor_name = 'test_tensor' + rank_list = [1, 2] + ops_list = ['max', 'min'] + cc_config = os.path.join(base_dir, "config/cc_config.json") + hooker = TrainerMon(cc_config, params_have_main_grad=False) + hooker.adhoc_check(target_tensor, module_name, tensor_name, rank_list, ops_list) + + def test_generate_cc_metrics(self): + self.get_dist_mock(True) + + cc_name = 'test_cc' + cc_tensor = CommunicationContext() + cc_tensor.data = { + 'min': { + 'tag1': 'tensor1', + 'tag2': 'tensor2' + }, + 'max': { + 'tag3': 'tensor3', + 'tag4': 'tensor4' + } + } + expected_metrics = {'min': {'test_cc/rank0/tag1': 'tensor1', 'test_cc/rank0/tag2': 'tensor2'}, + 'max': {'test_cc/rank0/tag3': 'tensor3', 'test_cc/rank0/tag4': 'tensor4'}} + result = TrainerMon.generate_cc_metrics(cc_name, cc_tensor) + self.assertDictEqual(result, expected_metrics) + + def test_generate_xy_metrics(self): + xy_config = os.path.join(base_dir, "config/xy_config.json") + trainer_mon = TrainerMon( + xy_config, + params_have_main_grad=False + ) + + fwd_context = ModuleHookContext("module1") + fwd_context.actv = {'module1': 'value1'} + trainer_mon.module_fwd_hook_context_by_module = {'module1': fwd_context} + trainer_mon.grad_context.actv = {'module2': 'value2'} + + actv, actv_grad = trainer_mon.generate_xy_metrics() + self.assertEqual(actv, {'module1': 'value1'}) + self.assertEqual(actv_grad, {'module2': 'value2'}) + + def test_reload_xy(self): + xy_config = os.path.join(base_dir, "config/xy_config.json") + trainer_mon = TrainerMon( + xy_config, + params_have_main_grad=False + ) + trainer_mon.rank = 0 + trainer_mon.module_rank_list = [1, 2] + trainer_mon.handles = {'xy': []} + trainer_mon.module_fwd_hook_context_by_module = {"a": ModuleHookContext("test")} + trainer_mon.hook_modules = MagicMock() + + handle = MagicMock() + trainer_mon.handles['xy'].append(handle) + trainer_mon.reload_xy() + self.assertEqual(trainer_mon.handles['xy'], []) + + +class TestParamIsNotTensorParallelDuplicate(unittest.TestCase): + @patch('torch.distributed.get_rank') + def test_param_is_not_tensor_parallel_duplicate(self, mock_get_rank): + class MockParam: + def __init__(self, tensor_model_parallel): + self.tensor_model_parallel = tensor_model_parallel + + param = MockParam(True) + tp_group = 'dummy_group' + self.assertTrue(param_is_not_tensor_parallel_duplicate(param, tp_group)) + + +class TestParamIsDataParallelDuplicate(unittest.TestCase): + @patch('torch.distributed.get_rank') + def test_param_is_data_parallel_duplicate_true(self, mock_get_rank): + mock_get_rank.return_value = 1 + dp_group = 'dp_group' + result = param_is_data_parallel_duplicate(dp_group) + self.assertTrue(result) + + @patch('torch.distributed.get_rank') + def test_param_is_data_parallel_duplicate_false(self, mock_get_rank): + mock_get_rank.return_value = 0 + dp_group = 'dp_group' + result = param_is_data_parallel_duplicate(dp_group) + self.assertFalse(result) + + +class TestModuleHookContext(unittest.TestCase): + def setUp(self): + self.module_name = "test_module" + self.context = ModuleHookContext(self.module_name) + self.context.struct = { + Const.INPUT: { + "config": "tuple[1]", + "0": "size=(2, 784), dtype=torch.float32", + }, + Const.OUTPUT: { + "config": "tensor", + "tensor": "size=(2, 10), dtype=torch.float32" + }, + MonitorConst.INPUT_GRAD: { + "config": "tuple[1]", + "0": "size=(2, 784), dtype=torch.float32" + }, + MonitorConst.OUTPUT_GRAD: { + "config": "tuple[1]", + "0": "size=(2, 10), dtype=torch.float32" + } + } + self.target_config = { + self.module_name: { + Const.INPUT: "tuple[1]:0", + Const.OUTPUT: "tensor", + MonitorConst.INPUT_GRAD: "tuple[1]:0" + } + } + + def test_set_format_by_arg_module_name_in_target_config(self): + self.context.set_format_by_arg(Const.INPUT, self.target_config) + self.assertEqual(self.context.format_by_arg[Const.INPUT], "tuple[1]:0") + self.context.set_format_by_arg(Const.OUTPUT, self.target_config) + self.assertEqual(self.context.format_by_arg[Const.OUTPUT], "tensor") + self.context.set_format_by_arg(MonitorConst.INPUT_GRAD, self.target_config) + self.assertEqual(self.context.format_by_arg[MonitorConst.INPUT_GRAD], "tuple[1]:0") + self.context.set_format_by_arg(MonitorConst.OUTPUT_GRAD, self.target_config) + self.assertEqual(self.context.format_by_arg[MonitorConst.OUTPUT_GRAD], "tuple[1]") + + def test_set_format_by_arg_module_name_not_in_target_config(self): + target_config = {} + self.context.set_format_by_arg(Const.INPUT, target_config) + self.assertEqual(self.context.format_by_arg[Const.INPUT], "tuple[1]") + self.context.set_format_by_arg(Const.OUTPUT, target_config) + self.assertEqual(self.context.format_by_arg[Const.OUTPUT], "tensor") + + @patch('msprobe.pytorch.monitor.module_hook.logger') + def test_set_format_by_arg_target_module_config_error(self, mock_logger): + target_config = {self.module_name: {Const.INPUT: 123}} + self.context.set_format_by_arg(Const.INPUT, target_config) + self.assertIsNone(self.context.format_by_arg.get(Const.INPUT)) + mock_logger.warning_on_rank_0.assert_called_once() + + +class TestContext(unittest.TestCase): + def test_communication_context(self): + cc_ctx = CommunicationContext() + cc_ctx.reset() + cc_ctx.data = {'tag1': {'min': [1, 2, 3], 'max': [10, 11, 12]}, + 'tag2': {'min': [16, 17, 18], 'max': [22, 23, 24]}} + cc_ctx.aggregate() + expected_aggregated_data = {'tag1': {'max': 12, 'min': 1}, 'tag2': {'max': 24, 'min': 16}} + self.assertEqual(cc_ctx.data, expected_aggregated_data) + + def test_grad_context(self): + grad_ctx = GradContext() + grad_ctx.reset() + self.assertEqual(grad_ctx.pre, {}) + self.assertEqual(grad_ctx.post, {}) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_monitor_utils.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_monitor_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..0462ac3f39531119b40d3cc5051fad77f687b9b5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_monitor_utils.py @@ -0,0 +1,178 @@ +import os +import unittest +from unittest.mock import patch, MagicMock + +import torch +from msprobe.core.common.const import MonitorConst + +from msprobe.pytorch.monitor.utils import filter_special_chars, MsgConst, get_param_struct, validate_ops, \ + validate_ranks, validate_targets, validate_print_struct, validate_ur_distribution, validate_xy_distribution, \ + validate_mg_distribution, validate_wg_distribution, validate_cc_distribution, validate_alert, validate_config, \ + get_output_base_dir +from msprobe.pytorch.common.utils import is_recomputation + + + +class TestValidationFunctions(unittest.TestCase): + + def test_get_output_base_dir(self): + # not set env + if os.getenv(MonitorConst.MONITOR_OUTPUT_DIR): + del os.environ[MonitorConst.MONITOR_OUTPUT_DIR] + output_base_dir = get_output_base_dir() + expect_output_base_dir = "./monitor_output" + self.assertEqual(output_base_dir, expect_output_base_dir) + + # set env + os.environ[MonitorConst.MONITOR_OUTPUT_DIR] = "test123" + output_base_dir = get_output_base_dir() + expect_output_base_dir = "test123" + self.assertEqual(output_base_dir, expect_output_base_dir) + + def test_filter_special_chars(self): + @filter_special_chars + def func(msg): + return msg + + self.assertEqual(func(MsgConst.SPECIAL_CHAR[0]), '_') + + def test_get_param_struct(self): + param = (torch.tensor([1, 2, 3]), torch.tensor([4, 5, 6])) + res = get_param_struct(param) + self.assertEqual(res['config'], 'tuple[2]') + + def test_validate_ops(self): + ops = ['op1', 'op2', 'norm', 'max'] + valid_ops = validate_ops(ops) + self.assertEqual(valid_ops, ['norm', 'max']) + + def test_no_valid_ops(self): + ops = ['op1', 'op2'] + valid_ops = validate_ops(ops) + target_ops = [MonitorConst.OP_LIST[0]] + self.assertEqual(valid_ops, target_ops) + + def test_validate_ranks(self): + ranks = [0, 1, 2, 3] + res = validate_ranks(ranks) + self.assertIsNone(res) + + def test_validate_targets(self): + targets = {'module_name': {'input': 'tensor'}} + validate_targets(targets) + + def test_validate_print_struct(self): + print_struct = True + validate_print_struct(print_struct) + + def test_validate_ur_distribution(self): + ur_distribution = True + validate_ur_distribution(ur_distribution) + + def test_validate_xy_distribution(self): + xy_distribution = True + validate_xy_distribution(xy_distribution) + + def test_validate_wg_distribution(self): + wg_distribution = True + validate_wg_distribution(wg_distribution) + + def test_validate_mg_distribution(self): + mg_distribution = True + validate_mg_distribution(mg_distribution) + + def test_validate_cc_distribution(self): + cc_distribution = {'enable': True, 'cc_codeline': ['line1'], 'cc_pre_hook': False, 'cc_log_only': True} + validate_cc_distribution(cc_distribution) + + def test_validate_alert(self): + alert = {'rules': [{'rule_name': 'AnomalyTurbulence', 'args': {'threshold': 10.0}}], 'dump': True} + validate_alert(alert) + + def test_validate_config(self): + config = { + 'ops': ['op1', 'op2'], + 'eps': 1e-8, + 'module_ranks': [0, 1, 2, 3], + 'targets': {'module_name': {'input': 'tensor'}}, + 'print_struct': True, + 'ur_distribution': True, + 'xy_distribution': True, + 'wg_distribution': True, + 'mg_distribution': True, + 'cc_distribution': {'enable': True, 'cc_codeline': ['line1'], 'cc_pre_hook': False, 'cc_log_only': True}, + 'alert': {'rules': [{'rule_name': 'AnomalyTurbulence', 'args': {'threshold': 10.0}}], 'dump': True} + } + validate_config(config) + target_ops = [MonitorConst.OP_LIST[0]] + self.assertEqual(config["ops"], target_ops) + del config["targets"] + validate_config(config) + self.assertEqual(config["targets"], {"": {}}) + self.assertEqual(config["all_xy"], True) + + +class TestIsRecomputation(unittest.TestCase): + @patch('inspect.stack') + def test_in_recomputation_megatron(self, mock_stack): + # 模拟megatron框架下的调用栈 + frame1 = MagicMock() + frame1.function = 'backward' + frame1.filename = 'torch/_tensor.py' + + frame2 = MagicMock() + frame2.function = 'some_function' + frame2.filename = 'torch/autograd/function.py' + + mock_stack.return_value = [frame1, frame2] + + self.assertTrue(is_recomputation()) + + @patch('inspect.stack') + def test_in_recomputation_mindspeed_L0L1(self, mock_stack): + # 模拟mindspeed L0&L1场景下的调用栈 + frame1 = MagicMock() + frame1.function = 'checkpoint_function_backward' + frame1.filename = 'some_module.py' + + frame2 = MagicMock() + frame2.function = 'some_other_function' + frame2.filename = 'torch/autograd/function.py' + + mock_stack.return_value = [frame1, frame2] + + self.assertTrue(is_recomputation()) + + @patch('inspect.stack') + def test_in_recomputation_mindspeed_L2(self, mock_stack): + # 模拟mindspeed L2场景下的调用栈 + frame1 = MagicMock() + frame1.function = 'checkpoint_function_backward' + frame1.filename = 'another_module.py' + + frame2 = MagicMock() + frame2.function = 'yet_another_function' + frame2.filename = 'some_file.py' + + frame3 = MagicMock() + frame3.function = 'final_function' + frame3.filename = 'torch/autograd/function.py' + + mock_stack.return_value = [frame1, frame2, frame3] + + self.assertTrue(is_recomputation()) + + @patch('inspect.stack') + def test_not_in_recomputation(self, mock_stack): + # 模拟非重计算阶段的调用栈 + frame1 = MagicMock() + frame1.function = 'forward' + frame1.filename = 'my_model.py' + + mock_stack.return_value = [frame1] + + self.assertFalse(is_recomputation()) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_monitor_wrap_distributed.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_monitor_wrap_distributed.py new file mode 100644 index 0000000000000000000000000000000000000000..0463c37364820bb08e675a572a25c065de048c05 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_monitor_wrap_distributed.py @@ -0,0 +1,371 @@ +import unittest +from collections import defaultdict +from unittest.mock import patch, MagicMock +import torch +from torch import distributed as dist +from msprobe.pytorch.monitor.module_hook import CommunicationContext + +from msprobe.pytorch.monitor.distributed.wrap_distributed import get_distributed_ops, DistributedOPTemplate, \ + ApiRegistry, \ + get_process_group, stack_filter, get_callstack, op_aggregate, update_data, is_target_line, \ + create_async_callback_func, \ + ORIGIN_WAIT, PENDING_ASYNC_CC_BY_HANDLE, catch_data, create_hooks + +WrapDistributedOps = ['all_reduce', 'broadcast', 'all_gather'] + + +class TestGetDistributedOps(unittest.TestCase): + + def test_get_distributed_ops(self): + expected = {'all_reduce', 'broadcast', 'all_gather'} + with patch('msprobe.pytorch.monitor.distributed.wrap_distributed.WrapDistributedOps', \ + new=['all_reduce', 'broadcast', 'all_gather']): + result = get_distributed_ops() + + self.assertEqual(result, expected) + + def test_get_distributed_ops_with_non_existent_op(self): + expected = {'all_reduce', 'broadcast'} + with patch('msprobe.pytorch.monitor.distributed.wrap_distributed.WrapDistributedOps', \ + new=['all_reduce', 'broadcast', 'non_existent_op']): + result = get_distributed_ops() + + self.assertEqual(result, expected) + + def test_get_distributed_ops_only_exclusions(self): + expected = set() + with patch('msprobe.pytorch.monitor.distributed.wrap_distributed.WrapDistributedOps', new=['exclusion']): + result = get_distributed_ops() + + self.assertEqual(result, expected) + + +class TestDistributedOPTemplate(unittest.TestCase): + + def hook(name): + def forward_pre_hook(nope, input, kwargs): + pass + + def forward_hook(nope, input, kwargs, result): + pass + + return forward_pre_hook, forward_hook + + def test_distributed_op(self): + self.setUp() + op_name = 'all_reduce' + if op_name in get_distributed_ops(): + op = DistributedOPTemplate(op_name, [self.hook()[0]], [self.hook()[1]]) + self.assertEqual(op.op_name_, op_name) + + +class TestApiRegistry(unittest.TestCase): + + def setUp(self) -> None: + self.ApiRegistry = ApiRegistry() + global ORIGIN_WAIT + global PENDING_ASYNC_CC_BY_HANDLE + self.attr_dict = {"b2": 2, "b3": 3} + + def hook(name): + def forward_pre_hook(nope, input, kwargs): + pass + + def forward_hook(nope, input, kwargs, result): + pass + + return forward_pre_hook, forward_hook + + def tearDown(self) -> None: + # 清空 PENDING_ASYNC_CC_BY_HANDLE + PENDING_ASYNC_CC_BY_HANDLE.clear() + + def test_store_ori_attr(self): + class A(): + a1 = 1 + + class B(): + a = A() + b1 = 1 + b2 = 2 + + api_list = ["a.a1", "b1", "b2"] + expect_output = {"a.a1": 1, "b1": 1, "b2": 2} + actual_output = dict() + ApiRegistry.store_ori_attr(B, api_list, actual_output) + self.assertEqual(actual_output, expect_output) + + def test_set_api_attr(self): + class A(): + a1 = 1 + + class B(): + a = A().__class__ + b1 = 1 + + attr_dict = {"a.a2": 2, "b2": 2, "b3": 3} + ApiRegistry.set_api_attr(B, attr_dict) + + for k, v in attr_dict.items(): + if '.' in k: + sub_module_name, sub_op = k.rsplit('.', 1) + sub_module = getattr(B, sub_module_name, None) + + self.assertEqual(getattr(sub_module, sub_op), v) + else: + self.assertEqual(getattr(B, k), v) + + @patch('msprobe.pytorch.monitor.distributed.wrap_distributed.torch.distributed.Work') + @patch('msprobe.pytorch.monitor.distributed.wrap_distributed.ORIGIN_WAIT') + @patch('msprobe.pytorch.monitor.distributed.wrap_distributed.PENDING_ASYNC_CC_BY_HANDLE') + def test_redirect_wait_with_pending(self, mock_handle, mock_wait, mock_work): + # 注册一个待处理的函数 + mock_wait.return_value = MagicMock() + mock_handle["handle"] = MagicMock() + + # 执行 redirect_wait + ApiRegistry.redirect_wait() + + # 调用 wrapped_wait + wrapped_wait = dist.Work.wait + wrapped_wait("handle") + + # 验证 ORIGIN_WAIT 被调用 + mock_wait.assert_called_once_with("handle") + + def test_redirect_api(self): + self.ApiRegistry.distributed_attr_hooked = self.attr_dict + self.ApiRegistry.redirect_api() + + self.assertEqual(dist.b2, 2) + self.assertEqual(dist.distributed_c10d.b2, 2) + + def test_restore_api(self): + self.ApiRegistry.distributed_attr_origin = self.attr_dict + self.ApiRegistry.restore_api() + + self.assertEqual(dist.b2, 2) + self.assertEqual(dist.distributed_c10d.b2, 2) + + def test_initialize_hook(self): + self.ApiRegistry.initialize_hook([self.hook()[0]], [self.hook()[1]]) + + self.assertEqual(len(get_distributed_ops()), len(self.ApiRegistry.distributed_attr_origin)) + self.assertEqual(len(get_distributed_ops()), len(self.ApiRegistry.distributed_attr_hooked)) + + +class TestFunctions(unittest.TestCase): + + def test_get_process_group(self): + process_group_element = dist.GroupMember.WORLD + result = get_process_group(process_group_element) + + self.assertEqual(result, process_group_element) + + def test_get_process_group_with_none(self): + result = get_process_group(None) + + self.assertEqual(result, dist.GroupMember.WORLD) + + def test_stack_filter_false(self): + stack = 'msprobe/pytorch/monitor/distributed' + result = stack_filter(stack) + + self.assertFalse(result) + + def test_stack_filter_true(self): + stack = 'wrong/stack' + result = stack_filter(stack) + + self.assertTrue(result) + + @patch('msprobe.pytorch.monitor.distributed.wrap_distributed.inspect') + def test_get_callstack(self, mock_inspect): + mock_inspect.stack.return_value = [(None, 'wrong/stack', 1, 'function', None, None)] + expected = ['wrong/stack[1] function'] + result = get_callstack() + + self.assertEqual(result, expected) + + def test_op_aggregate_with_tensor(self): + tensor = torch.tensor([1, 2, 3]) + + self.assertTrue(torch.equal(op_aggregate('', tensor), tensor)) + + def test_op_aggregate_with_non_tensor(self): + self.assertTrue(torch.isnan(op_aggregate('', None))) + + def test_op_aggregate_with_op_min(self): + tensorlist = [1, 2, 3] + + self.assertEqual(op_aggregate('min', tensorlist), 1) + + def test_op_aggregate_with_op_max(self): + tensorlist = [1, 2, 3] + + self.assertEqual(op_aggregate('max', tensorlist), 3) + + def test_op_aggregate_with_op_norm(self): + tensorlist = [1, 2, 3] + + self.assertEqual(op_aggregate('norm', tensorlist), 6) + + def test_op_aggregate_with_op_zeros(self): + tensorlist = [1, 2, 3] + + self.assertEqual(op_aggregate('zeros', tensorlist), 2) + + def test_op_aggregate_with_op_nans(self): + tensorlist = [1, 2, 3] + + self.assertEqual(op_aggregate('nans', tensorlist), 6) + + def test_op_aggregate_with_op_mean(self): + tensorlist = [1, 2, 3] + + self.assertEqual(op_aggregate('mean', tensorlist), 2) + + def test_op_aggregate_with_default_op(self): + tensorlist = [1, 2, 3] + res = op_aggregate('test_op', tensorlist) + self.assertTrue(res.isnan().item()) + + def test_op_aggregate_other(self): + self.assertTrue(torch.isnan(op_aggregate('', None))) + + def test_update_data_new(self): + old = {} + new = {'tag1': {'op1': torch.tensor([1, 2, 3])}} + expected = torch.tensor([1, 2, 3]) + old = update_data(old, new) + + self.assertIsInstance(old['tag1']['op1'], list) + self.assertTrue(torch.equal(old['tag1']['op1'][0], expected)) + + def test_update_data_append(self): + old = {'tag1': {'op1': [torch.tensor([1, 2, 3])]}} + new = {'tag1': {'op1': torch.tensor([2, 3, 4]), 'op2': torch.tensor([3, 4, 5])}} + old = update_data(old, new) + + self.assertIsInstance(old['tag1']['op1'], list) + self.assertEqual(len(old['tag1']['op1']), 2) + self.assertTrue(torch.equal(old['tag1']['op1'][1], torch.tensor([2, 3, 4]))) + self.assertTrue(torch.equal(old['tag1']['op2'][0], torch.tensor([3, 4, 5]))) + + def test_is_target_line_with_empty(self): + self.assertTrue(is_target_line([])) + + @patch('msprobe.pytorch.monitor.distributed.wrap_distributed.get_callstack') + def test_is_target_line_with_pattern_found(self, mock_stack): + mock_stack.return_value = ['stack1', 'stack2'] + + self.assertTrue(is_target_line(['stack1'])) + + @patch('msprobe.pytorch.monitor.distributed.wrap_distributed.get_callstack') + def test_is_target_line_other(self, mock_stack): + mock_stack.return_value = ['stack1', 'stack2'] + + self.assertFalse(is_target_line(['stack3'])) + + @patch('msprobe.pytorch.monitor.distributed.wrap_distributed.catch_data') + def test_create_async_callback_func(self, mock_catch_data): + context = 'test_context' + cc_name = 'test_cc_name' + ops = 'test_ops' + args = 'test_args' + prefix = 'test_prefix' + + # 创建回调函数 + callback_func = create_async_callback_func(context, cc_name, ops, args, prefix) + + # 调用回调函数 + callback_func() + + # 验证 catch_data 是否被调用,并且传递的参数正确 + mock_catch_data.assert_called_once_with(context, cc_name, ops, args, prefix) + + +class TestCatchData(unittest.TestCase): + + def setUp(self) -> None: + self.cc_context = CommunicationContext() + self.cc_name = 'cc_name' + self.ops = ["min", "max"] + self.prefix = 'prefix' + self.target_key = "cc_name/prefix_0" + + def test_catch_data_with_tensor(self): + args = [torch.tensor([1, 2, 3])] + catch_data(self.cc_context, self.cc_name, self.ops, args, self.prefix) + self.assertEqual(len(self.cc_context.data), 1) + + def test_catch_data_with_list_of_tensors(self): + args = [[torch.tensor([1, 2, 3]), torch.tensor([4, 5, 6])]] + catch_data(self.cc_context, self.cc_name, self.ops, args, self.prefix) + self.assertEqual(len(self.cc_context.data), 1) + + +class TestCreateHooks(unittest.TestCase): + + def setUp(self) -> None: + class MockMonitor: + cc_logged_stack = defaultdict(set) + cc_codeline = [] + ops = ["min", "max"] + module_rank_list = [] + cc_log_only = False + cc_pre_hook = False + cc_context = defaultdict(CommunicationContext) + + self.monitor = MockMonitor() + self.context = self.monitor.cc_context + self.dist_mock = MagicMock() + self.dist_mock.get_rank.return_value = 0 + self.dist_mock.is_initialized.return_value = True + dist.is_initialized = self.dist_mock.is_initialized + dist.get_rank = self.dist_mock.get_rank + + def test_create_hooks_without_hook(self): + self.monitor.module_rank_list = [0] + pre_hooks, hooks = create_hooks(self.context, self.monitor) + self.assertEqual(len(pre_hooks), 0) + self.assertEqual(len(hooks), 1) + + def test_create_hooks_with_cc_log_only(self): + self.monitor.cc_log_only = True + pre_hooks, hooks = create_hooks(self.context, self.monitor) + self.assertEqual(hooks, []) + cc_log_hook = pre_hooks[0] + + mock_get_callstack = MagicMock() + mock_get_callstack.return_value = "test string" + mock_module = MagicMock() + mock_module.op_name_ = "op" + cc_log_hook(mock_module, None, None) + self.assertIn("op", self.monitor.cc_logged_stack) + self.assertEqual(1, len(self.monitor.cc_logged_stack["op"])) + + def test_create_hooks_with_cc_pre_hook_and_cc_hook(self): + self.monitor.cc_pre_hook = True + pre_hooks, hooks = create_hooks(self.context, self.monitor) + self.assertEqual(1, len(pre_hooks)) + self.assertEqual(1, len(hooks)) + + cc_pre_hook = pre_hooks[0] + mock_module = MagicMock() + mock_module.op_name_ = "test_module" + args = tuple([torch.tensor([1, 2, 3])]) + kwargs = {} + cc_pre_hook(mock_module, args, kwargs) + self.assertIn("test_module", self.context) + self.assertIsInstance(self.context["test_module"], CommunicationContext) + + cc_hook = hooks[0] + cc_hook(mock_module, args, kwargs) + + res = cc_hook(mock_module, args, kwargs, []) + self.assertEqual(res, []) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_optimizer_collect.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_optimizer_collect.py new file mode 100644 index 0000000000000000000000000000000000000000..793b086b02db03f8a04b159f35f1df55fc1a9d2c --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_optimizer_collect.py @@ -0,0 +1,339 @@ +import unittest +from collections import defaultdict +from unittest.mock import Mock, patch, MagicMock + +import torch +from msprobe.pytorch.monitor.optimizer_collect import OptimizerMon, \ + OptimizerMonFactory, DummyOptimizerMon, \ + MixPrecisionOptimizerMon, MegatronDistributedOptimizerMon, MegatronFP32OptimizerMon, \ + MegatronChainedDistributedOptimizerMon, MegatronChainedMixPrecisionOptimizerMon, \ + DeepSpeedZeroOptimizerStage0Mon, DeepSpeedZeroOptimizerStage1or2Mon, DeepSpeedZeroOptimizerStage3Mon + +from msprobe.pytorch.monitor.utils import MVResult, MVGradResult + + +class TestOptimizerMon(unittest.TestCase): + def setUp(self) -> None: + # 初始化需要的monitor, torch_opt, params2name等对象 + self.monitor = Mock() + self.monitor.mv_distribution = True + self.monitor.mg_direction = True + self.monitor.ur_distribution = True + self.monitor.update_heatmap_visualizer = {'param1': Mock(), 'param2': Mock()} + self.monitor.ratio_heatmap_visualizer = {'param1': Mock(), 'param2': Mock()} + + def test_fetch_mv(self): + optimizer_mon = OptimizerMon() + res = optimizer_mon.fetch_mv(None, None, None) + self.assertEqual(res, None) + + def test_fetch_mv_in_adam(self): + self.torch_opt = Mock() + self.torch_opt.state = { + 'param1': {'exp_avg': torch.tensor(0.1), 'exp_avg_sq': torch.tensor(0.2), 'step': torch.tensor(10)}, + 'param2': {'exp_avg': torch.tensor(0.3), 'exp_avg_sq': torch.tensor(0.4), 'step': torch.tensor(20)} + } + self.torch_opt.param_groups = [{'step': 10}] + self.torch_opt.defaults = {'betas': (0.9, 0.999), 'eps': 1e-8} + self.params2name = {'param1': 'param1', 'param2': 'param2'} + + self.optimizer_mon = OptimizerMon() + result = self.optimizer_mon._fetch_mv_in_adam(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(result, MVResult) + + @patch('msprobe.pytorch.monitor.optimizer_collect.dist') + def test_fetch_mv_grad_in_adam(self, mock_dist): + self.optimizer_mon = OptimizerMon() + self.monitor = MagicMock() + self.torch_opt = MagicMock() + self.params2name = defaultdict(str) + self.name2indices = defaultdict(tuple) + self.fp32_partitioned_groups_flat = defaultdict(torch.Tensor) + + # Mocking the dist.get_rank() and dist.get_world_size() + mock_dist.get_rank.return_value = 0 + mock_dist.get_world_size.return_value = 1 + + # Mocking the wrapped_optimizer + self.torch_opt.state = defaultdict(dict) + self.torch_opt.averaged_gradients = defaultdict(torch.Tensor) + self.torch_opt.partition_size = defaultdict(int) + self.torch_opt.flatten_dense_tensors_aligned = MagicMock() + self.torch_opt.flatten = MagicMock() + + # Mocking the torch_opt.param_groups + self.torch_opt.param_groups = [{'step': 1, 'betas': (0.9, 0.999)}, + {'step': 2, 'betas': (0.9, 0.999)}, + {'step': 3, 'betas': (0.9, 0.999)}] + + # Mocking the monitor.mv_distribution, monitor.mg_direction, monitor.ur_distribution + self.monitor.mv_distribution = True + self.monitor.mg_direction = True + self.monitor.ur_distribution = True + + # Mocking the monitor.update_heatmap_visualizer and monitor.ratio_heatmap_visualizer + self.monitor.update_heatmap_visualizer = defaultdict(MagicMock) + self.monitor.ratio_heatmap_visualizer = defaultdict(MagicMock) + + result = self.optimizer_mon._fetch_mv_grad_in_adam(self.monitor, self.torch_opt, self.params2name, + self.name2indices, self.fp32_partitioned_groups_flat) + self.assertIsInstance(result, MVGradResult) + + +class TestMixPrecisionOptimizerMon(unittest.TestCase): + def test_fetch_mv_with_fp16_to_fp32_param_and_mix_prec_opt(self): + # init monitor, torch_opt ... + self.monitor = MagicMock() + self.torch_opt = MagicMock() + self.params2name = MagicMock() + self.mix_prec_opt = MagicMock() + self.mix_prec_opt.float16_groups = [MagicMock()] + self.mix_prec_opt.fp32_from_float16_groups = [MagicMock()] + self.optimizer = MixPrecisionOptimizerMon() + self.optimizer.fp16_to_fp32_param = {} + + # Mock _fetch_mv_in_adam method and set a fixed return value + mv_result = MVResult(exp_avg={}, exp_avg_sq={}, update={}, ratio={}) + self.mock_fetch_mv_in_adam = MagicMock(return_value=mv_result) + self.optimizer._fetch_mv_in_adam = self.mock_fetch_mv_in_adam + + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.mock_fetch_mv_in_adam.assert_called_once_with(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVResult) + + +class TestChainedMixPrecisionOptimizerMon(unittest.TestCase): + def test_fetch_mv_with_fp16_to_fp32_param_and_mix_prec_opt(self): + # init monitor, torch_opt ... + self.monitor = MagicMock() + self.torch_opt = MagicMock() + self.params2name = MagicMock() + self.torch_opt.float16_groups = [MagicMock()] + self.torch_opt.fp32_from_float16_groups = [MagicMock()] + self.optimizer = MegatronChainedMixPrecisionOptimizerMon() + self.optimizer.optimizer = [MagicMock(), MagicMock()] + self.optimizer.fp16_to_fp32_param = {} + + # Mock _fetch_mv_in_adam method and set a fixed return value + mv_result = MVResult(exp_avg={}, exp_avg_sq={}, update={}, ratio={}) + self.mock_fetch_mv_in_adam = MagicMock(return_value=mv_result) + self.optimizer._fetch_mv_in_adam = self.mock_fetch_mv_in_adam + + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.mock_fetch_mv_in_adam.assert_called_once_with(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVResult) + + +class TestMegatronChainedDistributedOptimizerMon(unittest.TestCase): + def setUp(self): + self.monitor = MagicMock() + self.torch_opt = MagicMock() + self.params2name = MagicMock() + mv_result = MVResult(exp_avg={}, exp_avg_sq={}, update={}, ratio={}) + self.mock_fetch_mv_in_adam = MagicMock(return_value=mv_result) + self.optimizer = MegatronChainedDistributedOptimizerMon() + + def test_fetch_mv_with_valid_optimizer(self): + self.torch_opt.model_float16_groups = [MagicMock()] + self.torch_opt.shard_fp32_from_float16_groups = [MagicMock()] + self.optimizer._fetch_mv_in_adam = self.mock_fetch_mv_in_adam + + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVResult) + + def test_fetch_mv_with_invalid_optimizer(self): + self.torch_opt = Mock() + self.torch_opt.model_float16_groups = None + self.torch_opt.shard_fp32_from_float16_groups = None + self.optimizer._fetch_mv_in_adam = self.mock_fetch_mv_in_adam + + with self.assertRaises(Exception): + self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + + +class TestMegatronDistributedOptimizerMon(unittest.TestCase): + def setUp(self): + self.monitor = MagicMock() + self.torch_opt = MagicMock() + self.params2name = MagicMock() + mv_result = MVResult(exp_avg={}, exp_avg_sq={}, update={}, ratio={}) + self.mock_fetch_mv_in_adam = MagicMock(return_value=mv_result) + self.optimizer = MegatronDistributedOptimizerMon() + + def test_fetch_mv_with_valid_optimizer(self): + self.torch_opt.model_float16_groups = [MagicMock()] + self.torch_opt.shard_fp32_from_float16_groups = [MagicMock()] + self.optimizer._fetch_mv_in_adam = self.mock_fetch_mv_in_adam + + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVResult) + + def test_fetch_mv_with_invalid_optimizer(self): + self.torch_opt = Mock() + self.torch_opt.model_float16_groups = None + self.torch_opt.shard_fp32_from_float16_groups = None + self.optimizer._fetch_mv_in_adam = self.mock_fetch_mv_in_adam + + with self.assertRaises(Exception): + self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + + +class TestCommonFetchMv(unittest.TestCase): + def setUp(self) -> None: + self.monitor = MagicMock() + self.torch_opt = MagicMock() + self.params2name = MagicMock() + + def test_megatron_fp32_optimizer_mon(self): + self.optimizer = MegatronFP32OptimizerMon() + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVResult) + + def test_deepspeed_zero_optimizer_stage0_mon(self): + self.optimizer = DeepSpeedZeroOptimizerStage0Mon() + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVResult) + + def test_dummy_optimizer_mon(self): + self.optimizer = DummyOptimizerMon() + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVResult) + + +class TestDeepSpeedZeroOptimizerStage3Mon(unittest.TestCase): + def test_get_param_index(self): + self.torch_opt = Mock() + self.torch_opt.fp16_partitioned_groups = [ + [Mock(flatten=lambda: [1, 2, 3]), + Mock(flatten=lambda: [4, 5])], + [Mock(flatten=lambda: [6, 7, 8, 9])] + ] + self.params2name = {'param1': 'weight1', 'param2': 'weight2'} + self.name2index = {'weight1': 0, 'weight2': 2} + + optimizer_stage3_mon = DeepSpeedZeroOptimizerStage3Mon() + name2indices = optimizer_stage3_mon.get_param_index(self.params2name, self.name2index, self.torch_opt) + + expected_name2indices = {'weight1': (0, 3, 0, None), 'weight2': (5, 9, 1, None)} + self.assertDictEqual(dict(name2indices), expected_name2indices) + + def test_fetch_mv(self): + self.monitor = MagicMock() + self.torch_opt = MagicMock() + self.params2name = MagicMock() + self.torch_opt.fp16_partitioned_groups = MagicMock() + self.optimizer = DeepSpeedZeroOptimizerStage3Mon() + + # mock _fetch_mv_grad_in_adam + mv_result = MVGradResult(exp_avg={}, exp_avg_sq={}, update={}, ratio={}, grad={}) + self.mock_fetch_mv_grad_in_adam = MagicMock(return_value=mv_result) + self.optimizer._fetch_mv_grad_in_adam = self.mock_fetch_mv_grad_in_adam + + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVGradResult) + + +class TestDeepSpeedZeroOptimizerStage1or2Mon(unittest.TestCase): + def test_get_group_index(self): + self.fp32_length = [10, 20, 30, 40] + self.world_size = 4 + self.indexes = [5, 7, 12, 25, 35, 45] + self.expected_results = [(40, 0), (40, 0), (12, 1), (24, 2), (34, 2), (40, 0)] + + optimizer = DeepSpeedZeroOptimizerStage1or2Mon() + results = [optimizer.get_group_index(self.fp32_length, self.world_size, index) for index in self.indexes] + self.assertEqual(results, self.expected_results) + + @patch('msprobe.pytorch.monitor.optimizer_collect.dist') + def test_get_param_index(self, mock_dist): + mock_dist.get_world_size.return_value = 4 + + self.params2name = {'param1': 'weight', 'param2': 'bias'} + self.name2index = {'weight': 0, 'bias': 1} + + self.optimizer_monitor = DeepSpeedZeroOptimizerStage1or2Mon() + + self.torch_opt = MagicMock() + self.torch_opt.groups_padding = [1, 2, 3] + self.torch_opt.single_partition_of_fp32_groups = [torch.tensor([1, 2]), torch.tensor([3, 4, 5])] + self.torch_opt.bit16_groups = [ + [torch.tensor([6, 7]), torch.tensor([8])], + [torch.tensor([9, 10, 11])] + ] + + name2indices = self.optimizer_monitor.get_param_index(self.params2name, self.name2index, self.torch_opt) + for name, indices in name2indices.items(): + self.assertIn(name, self.params2name.values()) + self.assertIsInstance(indices, tuple) + self.assertEqual(len(indices), 4) + + def test_fetch_mv(self): + self.monitor = MagicMock() + self.torch_opt = MagicMock() + self.params2name = MagicMock() + self.torch_opt.fp16_partitioned_groups = MagicMock() + self.optimizer = DeepSpeedZeroOptimizerStage1or2Mon() + + # mock _fetch_mv_grad_in_adam + mv_result = MVGradResult(exp_avg={}, exp_avg_sq={}, update={}, ratio={}, grad={}) + self.mock_fetch_mv_grad_in_adam = MagicMock(return_value=mv_result) + self.optimizer._fetch_mv_grad_in_adam = self.mock_fetch_mv_grad_in_adam + + res = self.optimizer.fetch_mv(self.monitor, self.torch_opt, self.params2name) + self.assertIsInstance(res, MVGradResult) + + +class TestOptimizerMonFactory(unittest.TestCase): + + def test_create_optimizer_mon(self): + # 测试已知的优化器类型 + mix_optimizer = MagicMock() + mix_optimizer_class = MagicMock() + mix_optimizer_class.__name__ = "Float16OptimizerWithFloat16Params" + mix_optimizer.__class__ = mix_optimizer_class + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(mix_optimizer)[0], + MixPrecisionOptimizerMon) + dis_optimizer = MagicMock() + dis_optimizer_class = MagicMock() + dis_optimizer_class.__name__ = "DistributedOptimizer" + dis_optimizer.__class__ = dis_optimizer_class + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(dis_optimizer)[0], + MegatronDistributedOptimizerMon) + fp32_optimizer = MagicMock() + fp32_optimizer_class = MagicMock() + fp32_optimizer_class.__name__ = "FP32Optimizer" + fp32_optimizer.__class__ = fp32_optimizer_class + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(fp32_optimizer)[0], + MegatronFP32OptimizerMon) + chained_optimizer = MagicMock() + chained_optimizer_class = MagicMock() + chained_optimizer_class.__name__ = "ChainedOptimizer" + chained_optimizer.__class__ = chained_optimizer_class + chained_optimizer.chained_optimizers = [mix_optimizer, mix_optimizer] + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(chained_optimizer)[0], + MegatronChainedMixPrecisionOptimizerMon) + chained_optimizer.chained_optimizers = [dis_optimizer, dis_optimizer] + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(chained_optimizer)[0], + MegatronChainedDistributedOptimizerMon) + deepspeed_optimizer = MagicMock() + deepspeed_optimizer_class = MagicMock() + deepspeed_optimizer_class.__name__ = "BF16_Optimizer" + deepspeed_optimizer.__class__ = deepspeed_optimizer_class + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(deepspeed_optimizer)[0], + DeepSpeedZeroOptimizerStage0Mon) + deepspeed_optimizer_class.__name__ = "DeepSpeedZeroOptimizer" + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(deepspeed_optimizer)[0], + DeepSpeedZeroOptimizerStage1or2Mon) + deepspeed_optimizer_class.__name__ = "DeepSpeedZeroOptimizer_Stage3" + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(deepspeed_optimizer)[0], + DeepSpeedZeroOptimizerStage3Mon) + # 测试未知的优化器类型,应该返回DummyOptimizerMon + unknown_optimizer = MagicMock() + unknown_optimizer_class = MagicMock() + unknown_optimizer_class.__name__ = "unknown" + unknown_optimizer.__class__ = unknown_optimizer_class + self.assertIsInstance(OptimizerMonFactory.create_optimizer_mon(unknown_optimizer)[0], DummyOptimizerMon) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_visualizer.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_visualizer.py new file mode 100644 index 0000000000000000000000000000000000000000..602ed7ef586a33eb30a25d33074fdad57789f3ca --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/monitor/test_visualizer.py @@ -0,0 +1,56 @@ +import unittest +import torch +from msprobe.pytorch.monitor.visualizer import HeatmapVisualizer + + +class TestHeatmapVisualizer(unittest.TestCase): + def setUp(self): + self.heatmap_visualizer = HeatmapVisualizer() + + def test_init(self): + self.assertEqual(self.heatmap_visualizer.histogram_bins_num, 30) + self.assertEqual(self.heatmap_visualizer.min_val, -1) + self.assertEqual(self.heatmap_visualizer.max_val, 1) + self.assertIsNotNone(self.heatmap_visualizer.histogram_edges) + self.assertIsNone(self.heatmap_visualizer.histogram_sum_data_np) + self.assertIsNone(self.heatmap_visualizer.cur_step_histogram_data) + + def test_pre_cal(self): + tensor = torch.tensor([1., 2., 3., 4., 5.]) + self.heatmap_visualizer.pre_cal(tensor) + expected_histogram = torch.tensor([0. for _ in range(29)] + [1.0]) + res = torch.allclose(self.heatmap_visualizer.cur_step_histogram_data, expected_histogram, atol=1e-1) + self.assertTrue(res) + + def mock_summary_writer(self): + class MockSummaryWriter: + def add_image(self, tag, img, global_step, dataformats): + self.called_with_tag = tag + self.called_with_img = img + self.called_with_global_step = global_step + + return MockSummaryWriter() + + def test_visualize(self): + + self.tag_name = "histogram" + self.step = 10 + self.summary_writer = self.mock_summary_writer() + + # 准备测试数据 + self.heatmap_visualizer.cur_step_histogram_data = torch.tensor([1., 2., 3., 4., 5.]) + self.heatmap_visualizer.histogram_edges = [0.01, 0.02, 0.03, 0.04, 0.05] + self.heatmap_visualizer.histogram_bins_num = 5 + + # 调用方法 + self.heatmap_visualizer.visualize(self.tag_name, self.step, self.summary_writer) + + # 验证 + self.assertEqual(self.summary_writer.called_with_tag, self.tag_name) + self.assertEqual(self.summary_writer.called_with_global_step, self.step) + img = self.summary_writer.called_with_img + self.assertEqual(list(img.shape), [4, 480, 640]) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/online_dispatch/test_compare_online_dispatch.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/online_dispatch/test_compare_online_dispatch.py index e0c3c3368de27122140c7bb0daf5dbc5c7d83e54..47db945c296d9063506e69ca8b80f262fb8e30a4 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/online_dispatch/test_compare_online_dispatch.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/online_dispatch/test_compare_online_dispatch.py @@ -22,7 +22,7 @@ from unittest.mock import Mock, patch import pandas as pd from msprobe.core.common.file_utils import FileOpen from msprobe.core.common.utils import CompareException -from msprobe.pytorch.online_dispatch.compare import get_json_contents, Saver, Comparator +from msprobe.pytorch.online_dispatch.compare import Saver, Comparator from rich.table import Table from io import StringIO from rich.console import Console @@ -41,22 +41,6 @@ class TestCompare(unittest.TestCase): if os.path.exists(self.list_json_path): os.remove(self.list_json_path) - def test_get_json_contents_when_get_json(self): - data = {"one": 1} - with FileOpen(self.dict_json_path, 'w') as f: - json.dump(data, f) - self.assertEqual(get_json_contents(self.dict_json_path), data) - - @patch('msprobe.core.common.log.BaseLogger.error') - def test_get_json_contents_when_get_list(self, mock_error): - data = [1, 2] - with FileOpen(self.list_json_path, 'w') as f: - json.dump(data, f) - with self.assertRaises(CompareException) as context: - get_json_contents(self.list_json_path) - self.assertEqual(context.exception.code, CompareException.INVALID_FILE_ERROR) - mock_error.assert_called_once_with('Json file %s, content is not a dictionary!' % self.list_json_path) - class TestSaver(unittest.TestCase): def setUp(self): diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/parse_tool/test_parse_utils.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/parse_tool/test_parse_utils.py index dbf6765eb185ad155ca3cfbef107dd4c2818ece9..dfec4d20366c6e834939130009dc6d33d1cbe9ed 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/parse_tool/test_parse_utils.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/parse_tool/test_parse_utils.py @@ -1,10 +1,12 @@ import unittest -from unittest.mock import patch import os import shutil +from unittest.mock import patch from pathlib import Path -import numpy as np from collections import namedtuple +from rich.panel import Panel + +import numpy as np from msprobe.pytorch.parse_tool.lib.utils import Util from msprobe.pytorch.parse_tool.lib.parse_exception import ParseException @@ -23,12 +25,16 @@ class TestUtils(unittest.TestCase): os.makedirs(self.npy_file_dir, exist_ok=True) for i in range(3): (Path(self.npy_file_dir) / f'file{i}.npy').touch() + self.empty_dir = './empty_dir' + os.makedirs(self.empty_dir, exist_ok=True) def tearDown(self): if os.path.exists(self.base_dir): shutil.rmtree(self.base_dir) if os.path.exists(self.npy_file_dir): shutil.rmtree(self.npy_file_dir) + if os.path.exists(self.empty_dir): + shutil.rmtree(self.empty_dir) def test_path_strip(self): path = "\'\"./path\"\'" @@ -72,6 +78,12 @@ class TestUtils(unittest.TestCase): self.assertTrue(res) + def test_check_npy_files_valid_in_dir_false(self): + (Path(self.npy_file_dir) / f'file4.pt').touch() + res = self.util.check_npy_files_valid_in_dir(self.npy_file_dir) + + self.assertFalse(res) + def test_get_md5_for_numpy(self): obj = np.array([1, 2, 3, 4, 5]) res = self.util.get_md5_for_numpy(obj) @@ -96,17 +108,52 @@ class TestUtils(unittest.TestCase): self.assertFalse(res1) self.assertTrue(res2) + def test_change_filemode_safe(self): + test_path = './test/path' + res = self.util.change_filemode_safe(test_path) + + self.assertIsNone(res) + def test_execute_command(self): res = self.util.execute_command('pwd') self.assertEqual(res, 0) + def test_execute_command_error(self): + res = self.util.execute_command(None) + + self.assertEqual(res, -1) + + @patch('msprobe.pytorch.parse_tool.lib.utils.Panel') + def test_print_panel_none(self, mock_panel): + mock_panel.return_value = None + res = self.util.print_panel('test content') + + self.assertIsNone(res) + + @patch('msprobe.pytorch.parse_tool.lib.utils.Panel') + def test_print_panel_with_fit(self, mock_panel): + self.util.print_panel('test content') + + mock_panel.fit.assert_called_once_with('test content', title='') + + @patch('msprobe.pytorch.parse_tool.lib.utils.Util.print') + def test_print_panel_with_none_fit(self, mock_print): + self.util.print_panel('test content', fit=False) + + mock_print.assert_called_once() + @patch('msprobe.pytorch.parse_tool.lib.utils.subprocess.run') - def test_check_msaccucmp(self, mock_run): + def test_check_msaccucmp_fail(self, mock_run): mock_run.returncode.return_value = 1 + with self.assertRaises(ParseException): self.util.check_msaccucmp('./msaccucmp.py') + def test_check_msaccucmp_with_wrong_file(self): + with self.assertRaises(ParseException): + self.util.check_msaccucmp('./aerfaew') + @patch('msprobe.pytorch.parse_tool.lib.utils.Util.npy_info') def test_gen_npy_info_txt(self, mock_npu_info): mock_npu_info.return_value = (1, 1, 1, 1, 1) @@ -133,6 +180,14 @@ class TestUtils(unittest.TestCase): self.assertTrue(res) + def test_check_path_valid_fail(self): + with self.assertRaises(ParseException): + self.util.check_path_valid('non_existent_path') + + def test_check_files_in_path(self): + with self.assertRaises(ParseException): + self.util.check_files_in_path(self.empty_dir) + def test_npy_info(self): var = np.array([1, 2, 3, 4, 5]) res = self.util.npy_info(var) @@ -140,6 +195,20 @@ class TestUtils(unittest.TestCase): self.assertEqual(res, npu_info_res(shape=(5,), dtype=np.int64, max=5, min=1, mean=3)) + def test_npy_info_fail_with_none_nparray(self): + with self.assertRaises(ParseException): + self.util.npy_info(1) + + def test_npy_info_fail_with_none_object(self): + with self.assertRaises(ParseException): + var = np.array([1, 2, 3, 4, 5], dtype=object) + self.util.npy_info(var) + + def test_npy_info_fail_with_size_0(self): + with self.assertRaises(ParseException): + var = np.empty((0,)) + self.util.npy_info(var) + def test_list_file_with_pattern(self): with patch('msprobe.pytorch.parse_tool.lib.utils.Util.check_path_valid', return_value=True), \ patch('msprobe.pytorch.parse_tool.lib.utils.check_file_or_directory_path', return_value=None): @@ -147,15 +216,21 @@ class TestUtils(unittest.TestCase): self.assertEqual(len(res), 3) - def test_check_file_path_format(self): + def test_check_file_path_format_with_dir(self): with self.assertRaises(ParseException): self.util.check_file_path_format(self.base_dir, Const.PKL_SUFFIX) - self.util.check_file_path_format(self.npy_file_dir.join('file1.npy'), Const.PKL_SUFFIX) + + def test_check_file_path_format_with_file(self): + with self.assertRaises(ParseException): + self.util.check_file_path_format(self.npy_file_dir + '.file1.npy', Const.PKL_SUFFIX) def test_check_str_param(self): with self.assertRaises(ParseException): param = 'a' * 256 self.util.check_str_param(param) + + def test_check_str_param(self): + with self.assertRaises(ParseException): self.util.check_str_param('faworf9 823*(A#&./)') def test_is_subdir_count_equal(self): @@ -164,3 +239,6 @@ class TestUtils(unittest.TestCase): def test_check_positive(self): with self.assertRaises(ParseException): self.util.check_positive(-1) + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_config.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_config.py index 2031a69c94d99f89cc026f64bd4ce05b338bf352..c1b8bac47fda100636b55fbc5ad452c2843e8aaa 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_config.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_config.py @@ -1,9 +1,9 @@ import os +import shutil import unittest -from unittest.mock import MagicMock, patch +from unittest.mock import patch from msprobe.core.common.const import Const -from msprobe.core.common.exceptions import MsprobeException from msprobe.pytorch.pt_config import parse_json_config, parse_task_config, TensorConfig, \ StatisticsConfig, OverflowCheckConfig, FreeBenchmarkCheckConfig, RunUTConfig, GradToolConfig @@ -86,8 +86,8 @@ class TestTensorConfig(unittest.TestCase): def setUp(self): self.json_config = { - "online_run_ut": True, - "host": "localhost", + "online_run_ut": False, + "host": "127.0.0.1", "port": 8080 } self.config = TensorConfig(self.json_config) @@ -105,24 +105,56 @@ class TestTensorConfig(unittest.TestCase): self.config._check_file_format() self.assertIn(str(context.exception), "file_format is invalid") - @patch('os.path.exists') - def test_check_tls_path_config_exists(self, mock_exists): - mock_exists.return_value = True - self.config.tls_path = "/valid/path" - try: - self.config._check_tls_path_config() - except Exception as e: - self.fail(f"Unexpected exception raised: {e}") - mock_exists.assert_called_once_with("/valid/path") + @patch('msprobe.pytorch.pt_config.check_crt_valid') + def test_check_online_run_ut(self, mock_check_crt_valid): + mock_check_crt_valid.return_value = True + + self.config.online_run_ut = "True" + with self.assertRaises(Exception) as context: + self.config._check_online_run_ut() + self.assertIn(str(context.exception), f"online_run_ut: {self.config.online_run_ut} is invalid.") + self.config.online_run_ut = True + + self.config.online_run_ut_recompute = "True" + with self.assertRaises(Exception) as context: + self.config._check_online_run_ut() + self.assertIn(str(context.exception), f"online_run_ut_recompute: {self.config.online_run_ut} is invalid.") + self.config.online_run_ut_recompute = False + + self.config.nfs_path = "./nfs_path" + with self.assertRaises(Exception) as context: + self.config._check_online_run_ut() + self.assertIn(str(context.exception), "[msprobe] 非法文件路径: ") + self.config.nfs_path = "" - @patch('os.path.exists') - def test_check_tls_path_config_not_exists(self, mock_exists): - mock_exists.return_value = False - self.config.tls_path = "/invalid/path" + self.config.tls_path = "./tls_path" with self.assertRaises(Exception) as context: - self.config._check_tls_path_config() - self.assertEqual(str(context.exception), "tls_path: /invalid/path does not exist") - mock_exists.assert_called_once_with("/invalid/path") + self.config._check_online_run_ut() + self.assertIn(str(context.exception), "[msprobe] 非法文件路径: ") + + os.makedirs(self.config.tls_path) + with open(os.path.join(self.config.tls_path, "client.key"), 'w') as file: + file.write("1") + with open(os.path.join(self.config.tls_path, "client.crt"), 'w') as file: + file.write("1") + self.config._check_online_run_ut() + shutil.rmtree(self.config.tls_path) + self.config.tls_path = "" + + self.config.host = "invalid_host" + with self.assertRaises(Exception) as context: + self.config._check_online_run_ut() + self.assertIn(str(context.exception), f"host: {self.config.host} is invalid.") + self.config.host = "127.0.0.1" + + self.config.port = -1 + with self.assertRaises(Exception) as context: + self.config._check_online_run_ut() + self.assertIn(str(context.exception), f"port: {self.config.port} is invalid, port range 0-65535.") + self.config.port = 6123 + + # all config right + self.config._check_online_run_ut() class TestStatisticsConfig(unittest.TestCase): @@ -165,10 +197,14 @@ class TestOverflowCheckConfig(unittest.TestCase): "overflow_nums": 2, "check_mode": "all" } - self.invalid_overflow_nums_config = { + self.invalid_overflow_nums_config_str = { "overflow_nums": "not_an_int", "check_mode": "all" } + self.invalid_overflow_nums_config_bool = { + "overflow_nums": bool, + "check_mode": "all" + } self.invalid_check_mode_config = { "overflow_nums": 2, "check_mode": "invalid_mode" @@ -179,9 +215,14 @@ class TestOverflowCheckConfig(unittest.TestCase): self.assertEqual(config.overflow_nums, 2) self.assertEqual(config.check_mode, Const.ALL) - def test_invalid_overflow_nums(self): + def test_invalid_overflow_nums_str_type(self): + with self.assertRaises(Exception) as context: + OverflowCheckConfig(self.invalid_overflow_nums_config_str) + self.assertEqual(str(context.exception), "overflow_num is invalid") + + def test_invalid_overflow_nums_bool_type(self): with self.assertRaises(Exception) as context: - OverflowCheckConfig(self.invalid_overflow_nums_config) + OverflowCheckConfig(self.invalid_overflow_nums_config_bool) self.assertEqual(str(context.exception), "overflow_num is invalid") def test_invalid_check_mode(self): diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_debug_save.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_debug_save.py new file mode 100644 index 0000000000000000000000000000000000000000..534437260e66d9e586d69d557d30e308a9f4f3ee --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_debug_save.py @@ -0,0 +1,80 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from unittest import TestCase +from unittest.mock import patch +import torch + +from msprobe.pytorch import PrecisionDebugger +from msprobe.core.common_config import CommonConfig, BaseConfig + + +class TestPytorchDebuggerSave(TestCase): + def setUp(self): + PrecisionDebugger._instance = None + statistics_task_json = { + "task": "statistics", + "dump_path": "./dump_path", + "rank": [], + "step": [], + "level": "debug", + "enable_dataloader": False, + "statistics": { + "summary_mode": "statistics" + } + } + common_config = CommonConfig(statistics_task_json) + task_config = BaseConfig(statistics_task_json) + with patch("msprobe.pytorch.debugger.precision_debugger.parse_json_config", return_value=(common_config, task_config)): + self.debugger = PrecisionDebugger() + + def test_forward_and_backward(self): + def forward_func(x, y): + PrecisionDebugger.save(x, "x_tensor") + return x * y + x = torch.tensor([1.]) + y = torch.tensor([2.]) + x.requires_grad = True + y.requires_grad = True + result_json = { + "task": "statistics", + "level": "debug", + "framework": "pytorch", + "dump_data_dir": None, + "data": { + "x_tensor.0": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": torch.Size([1]), + "Max": 1.0, + "Min": 1.0, + "Mean": 1.0, + "Norm": 1.0, + "requires_grad": True + }, + "x_tensor_grad.0": { + "type": "torch.Tensor", + "dtype": "torch.float32", + "shape": torch.Size([1]), + "Max": 2.0, + "Min": 2.0, + "Mean": 2.0, + "Norm": 2.0, + "requires_grad": False + } + } + } + loss = forward_func(x, y) + loss.backward() + self.assertEqual(self.debugger.service.data_collector.data_writer.cache_debug, result_json) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_service.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_service.py index 8a3e41cbce04210e7e9d119652808836877144b2..6687f3111050ea53e14e62f3afd55ae1eff2b8c0 100644 --- a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_service.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_service.py @@ -1,7 +1,6 @@ import unittest -from unittest.mock import patch, mock_open +from unittest.mock import patch, mock_open, MagicMock -import torch.nn as nn from msprobe.core.common.utils import Const from msprobe.pytorch.debugger.debugger_config import DebuggerConfig from msprobe.pytorch.pt_config import parse_json_config @@ -19,40 +18,86 @@ class TestService(unittest.TestCase): self.config = DebuggerConfig(common_config, task_config, Const.STATISTICS, "./ut_dump", "L1") self.service = Service(self.config) - def test_start(self): + def test_start_success(self): with patch("msprobe.pytorch.service.get_rank_if_initialized", return_value=0), \ patch("msprobe.pytorch.service.Service.create_dirs", return_value=None): self.service.start(None) self.assertEqual(self.service.current_rank, 0) - def test_stop_and_step(self): - with patch("msprobe.core.data_dump.data_collector.DataCollector.write_json", return_value=None): - self.service.stop() + def test_start_fail(self): + self.service.config.rank = [1, 2] + self.service.current_rank = 3 + self.assertIsNone(self.service.start(None)) + + self.service.config.step = [1, 2] + self.service.current_iter = 3 + self.assertIsNone(self.service.start(None)) + + @patch("msprobe.core.data_dump.data_collector.DataCollector.write_json") + def test_stop_success(self, mock_write_json): + mock_write_json.return_value = None + self.service.stop() + self.assertFalse(self.service.switch) + def test_stop_fail(self): + self.service.switch = True + + self.service.config.rank = [1, 2] + self.service.current_rank = 3 + res = self.service.stop() + self.assertIsNone(res) + self.assertTrue(self.service.switch) + + self.service.config.step = [1, 2] + self.service.current_iter = 3 + res = self.service.stop() + self.assertIsNone(res) + self.assertTrue(self.service.switch) + + self.service.config.level = "L2" + res = self.service.stop() + self.assertIsNone(res) + self.assertTrue(self.service.switch) + + self.service.should_stop_service = True + res = self.service.stop() + self.assertIsNone(res) + self.assertTrue(self.service.switch) + + def test_step_success(self): self.service.step() self.assertEqual(self.service.current_iter, 1) - def test_register_hook_new(self): - class TestModule(nn.Module): - def __init__(self) -> None: - super().__init__() - self.linear = nn.Linear(in_features=8, out_features=4) - - def forward(self, x): - x = self.linear(x) - return x + def test_step_fail(self): + self.service.should_stop_service = True + self.assertIsNone(self.service.step()) - self.service.model = TestModule() + def test_register_module_hook_with_level0(self): + self.service.model = MagicMock() + self.service.build_hook = MagicMock() self.config.level = "L0" with patch("msprobe.pytorch.service.logger.info_on_rank_0") as mock_logger, \ - patch("msprobe.pytorch.service.remove_dropout", return_value=None): - self.service.register_hook_new() - self.assertEqual(mock_logger.call_count, 2) + patch("msprobe.pytorch.service.ModuleProcesser.register_module_hook") as mock_register_module_hook: + self.service.register_module_hook() + self.assertEqual(mock_logger.call_count, 1) + mock_register_module_hook.assert_called_once() + + def test_register_api_hook_with_level1(self): + self.service.build_hook = MagicMock() + self.config.level = "L1" + with patch("msprobe.pytorch.service.logger.info_on_rank_0") as mock_logger, \ + patch("msprobe.pytorch.service.api_register.initialize_hook") as mock_init_hook, \ + patch("msprobe.pytorch.service.api_register.api_modularity") as mock_api_modularity: + self.service.register_api_hook() + self.assertEqual(mock_logger.call_count, 1) + mock_init_hook.assert_called_once() + mock_api_modularity.assert_called_once() def test_create_dirs(self): with patch("msprobe.pytorch.service.create_directory"), \ - patch("msprobe.core.data_dump.data_collector.DataCollector.update_dump_paths"): + patch("msprobe.core.data_dump.data_collector.DataCollector.update_dump_paths"), \ + patch("msprobe.core.data_dump.data_collector.DataCollector.initialize_json_file"): self.service.create_dirs() self.assertEqual(self.service.dump_iter_dir, "./ut_dump/step0") @@ -75,22 +120,31 @@ class TestService(unittest.TestCase): self.assertFalse(self.service.switch) self.assertTrue(self.service.should_stop_service) - def test_should_execute_hook(self): - self.service.switch = True - self.service.data_collector = None - self.assertTrue(self.service.should_execute_hook()) - + def test_should_execute_hook_return_false(self): + module = MagicMock() self.service.switch = False - self.assertFalse(self.service.should_execute_hook()) + self.assertFalse(self.service.should_execute_hook("Module", module, True)) + self.assertFalse(self.service.should_execute_hook("api", module, True)) + + self.service.switch = True + module.forward_data_collected = False + self.assertFalse(self.service.should_execute_hook("api", module, False)) - class DataProcessor: - def __init__(self): - self.is_terminated = True + self.service.inner_switch = True + self.assertFalse(self.service.should_execute_hook("Module", module, True)) - class DataCollector: - def __init__(self): - self.data_processor = DataProcessor() + self.service.inner_switch = False + self.service.data_collector = None + self.assertFalse(self.service.should_execute_hook("Module", module, True)) + def test_should_execute_hook_return_true(self): + module = MagicMock() self.service.switch = True - self.service.data_collector = DataCollector() - self.assertFalse(self.service.should_execute_hook()) + self.service.inner_switch = False + self.service.data_collector = MagicMock() + self.service.data_collector.data_processor = MagicMock() + self.service.data_collector.data_processor.is_terminated = False + self.assertTrue(self.service.should_execute_hook("Module", module, True)) + + module.forward_data_collected = True + self.assertTrue(self.service.should_execute_hook("api", module, False)) diff --git a/debug/accuracy_tools/msprobe/test/resources/common/cell_mapping.yaml b/debug/accuracy_tools/msprobe/test/resources/common/cell_mapping.yaml new file mode 100644 index 0000000000000000000000000000000000000000..77592e66cda5aaac8ac68c1c86f2b2a9b4dd55ef --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/common/cell_mapping.yaml @@ -0,0 +1,2 @@ +fc1.Dense: fc1.Linear +conv1.Conv2d: conv3.Conv2d \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/resources/config.json b/debug/accuracy_tools/msprobe/test/resources/config.json new file mode 100644 index 0000000000000000000000000000000000000000..a61fd5ca83a787913413ba8aac589cb50dfd13e3 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/config.json @@ -0,0 +1,50 @@ +{ + "task": "statistics", + "dump_path": "./dump_path", + "rank": [], + "step": [], + "level": "L1", + "seed": 1234, + "is_deterministic": false, + "enable_dataloader": false, + "acl_config": "", + "tensor": { + "scope": [], + "list":[], + "data_mode": ["all"], + "backward_input": [], + "file_format": "npy" + }, + "statistics": { + "scope": [], + "list":[], + "data_mode": ["all"], + "summary_mode": "statistics" + }, + "overflow_check": { + "overflow_nums": 1, + "check_mode":"all" + }, + "run_ut": { + "white_list": [], + "black_list": [], + "error_data_path": "./" + }, + "grad_probe": { + "grad_level": "L1", + "param_list": [], + "bounds": [-1, 0, 1] + }, + "free_benchmark": { + "scope": [], + "list": [], + "fuzz_device": "npu", + "pert_mode": "improve_precision", + "handler_type": "check", + "fuzz_level": "L1", + "fuzz_stage": "forward", + "if_preheat": false, + "preheat_step": 15, + "max_sample": 20 + } +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/resources/layer_mapping/layer_mapping.yaml b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/layer_mapping.yaml new file mode 100644 index 0000000000000000000000000000000000000000..a928b0c1de1f75daafbb96a46a165f1e758131a5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/layer_mapping.yaml @@ -0,0 +1,15 @@ +TopLayer: + network_with_loss: module + +VocabParallelEmbedding: + logical_or: __or__ + reduce_from_mp_region.all_reduce: all_reduce + +ParallelTransformerLayer: + attention: self_attention + +ParallelAttention: + flash_attention_score: core_attention_flash.npu_fusion_attention + +FusedRMSNorm: + RmsNorm: npu_rms_norm diff --git a/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/construct.json b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..f1d4b05a6cc6b91539d0b0d95b2a6710bed5ae53 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/construct.json @@ -0,0 +1,15 @@ +{ + "Tensor.__add__.0.forward": "", + "Tensor.__bool__.1.forward": "", + "Tensor.__add__.1.forward": "", + "Mint.logical_or.0.forward": "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0", + "Distributed.all_reduce.0.forward": "Cell.network_with_loss.module.language_model.embedding.word_embeddings.reduce_from_mp_region.ReduceFromModelParallelRegion.forward.0", + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": "Cell.network_with_loss.module.language_model.embedding.Embedding.forward.0", + "Primitive.norm.RmsNorm.0.forward": "Cell.network_with_loss.module.language_model.encoder.layers.0.input_norm.FusedRMSNorm.forward.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": "Cell.network_with_loss.module.language_model.encoder.ParallelTransformer.forward.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.input_norm.FusedRMSNorm.forward.0": "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": "Cell.network_with_loss.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Mint.cos.0.forward": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0", + "Functional.flash_attention_score.0.forward": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0", + "Functional.flash_attention_score.0.backward": "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.backward.0" +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/dump.json b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..153d84e7d117b5be89dfdb522edc39dc066929cb --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/dump.json @@ -0,0 +1,40 @@ +{ + "task": "statistics", + "level": "mix", + "framework": "mindspore", + "dump_data_dir": null, + "data": { + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": { + "input_args": [ + { + "type": "mindspore.Tensor", + "dtype": "Int32", + "shape": [ + 1, + 4096 + ], + "Max": 165558.0, + "Min": 0.0, + "Mean": 16050.638671875, + "Norm": 2257767.75 + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "mindspore.Tensor", + "dtype": "BFloat16", + "shape": [ + 1, + 4096, + 6144 + ], + "Max": 2.6875, + "Min": -2.640625, + "Mean": 0.000316619873046875, + "Norm": 2512.0 + } + ] + } + } +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/stack.json b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..17c9286cd048318275de92aba0b72dc40f92140f --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/mindspore/stack.json @@ -0,0 +1,290 @@ +{ + "Tensor.__add__.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py, line 487, in format_index_tensor, \n return F.select(index < 0, index + format_dims, index)", + "File /path_to_package/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py, line 140, in data_update_by_ops, \n new_index = format_index_tensor(new_index, (None, F.shape(data)[:F.shape(new_index)[-1]]))", + "File /path_to_package/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py, line 102, in data_update, \n data = data_update_by_ops(transfer_type, arg, data, new_index, origin_data, value)", + "File /path_to_package/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py, line 203, in _tensor_getitem, \n return data_update(tensor_update_types, tensor_update_args, self, new_index)", + "File /path_to_package/site-packages/mindspore/common/tensor.py, line 483, in __getitem__, \n out = tensor_operator_registry.get('__getitem__')(self, index)", + "File /path_to_net/PanGu_ms/pangu/training/utils.py, line 65, in get_ltor_reset_masks_and_position_ids, \n eod_index = position_ids[b, data[b] == eod_token]", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 113, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 724, in train, \n data = get_batch_func(train_dataset_dict_iterator)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Tensor.__bool__.1.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/training/utils.py, line 75, in get_ltor_reset_masks_and_position_ids, \n if i == pre_eod_idx:", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 113, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 724, in train, \n data = get_batch_func(train_dataset_dict_iterator)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Tensor.__add__.1.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/training/utils.py, line 78, in get_ltor_reset_masks_and_position_ids, \n attention_mask[b, 0, (i + 1):, :(i + 1)] = 0", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 113, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 724, in train, \n data = get_batch_func(train_dataset_dict_iterator)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Mint.logical_or.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/tensor_parallel/layers.py, line 1145, in construct, \n input_mask = mint.logical_or((input_ < self.vocab_start_index), (input_ >= self.vocab_end_index))", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2455, in _backward_hook_construct, \n outputs = self.construct(outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 226, in construct, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 554, in construct, \n text_embedding_out = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Distributed.all_reduce.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/tensor_parallel/mappings.py, line 241, in construct, \n output = comm_func.all_reduce(input_, group=self.tp_group)[0]", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 785, in _call_custom_bprop, \n output = self.construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2450, in _backward_hook_construct, \n outputs = self._call_custom_bprop(outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/tensor_parallel/layers.py, line 1168, in construct, \n output = self.reduce_from_mp_region(output_parallel)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2455, in _backward_hook_construct, \n outputs = self.construct(outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 226, in construct, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 554, in construct, \n text_embedding_out = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Cell.network_with_loss.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 226, in construct, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 554, in construct, \n text_embedding_out = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Primitive.norm.RmsNorm.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 198, in ParallelTransformerLayerForward, \n norm_output = self.input_norm(hidden_states)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Cell.network_with_loss.module.language_model.encoder.layers.0.attention.ParallelAttention.forward.0": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Mint.cos.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/rotary_pos_embedding.py, line 123, in _apply_fused_rotary_pos_emb, \n cos_ = mint.cos(freqs).to(t.dtype)", + "File /path_to_net/PanGu_ms/pangu/model/rotary_pos_embedding.py, line 136, in apply_rotary_pos_emb, \n return _apply_fused_rotary_pos_emb(t, freqs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 619, in construct, \n query = apply_rotary_pos_emb(query, q_pos_emb, self.config)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ], + "Functional.flash_attention_score.0.forward": [ + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 505, in _run_construct, \n output = self._run_forward_hook(inputs, output)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/hook_cell.py, line 48, in __call__, \n out = super(HOOKCell, self).__call__(*args, **kwargs)", + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/wrap_api.py, line 98, in api_function, \n return ApiTemplate(api_name, api_dict, prefix, hook)(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 637, in construct, \n output = ops.flash_attention_score(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/model/transformer.py, line 201, in ParallelTransformerLayerForward, \n attention_output, _ = self.attention(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/transformer.py, line 1454, in construct, \n hidden_states = layer(", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/transformer/language_model.py, line 579, in construct, \n encoder_output = self.encoder(encoder_input,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/gpt_model.py, line 101, in construct, \n lm_output = self.language_model(tokens,", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 733, in __call__, \n return self._run_construct(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/distributed/distributed_data_parallel.py, line 171, in construct, \n output = self.module(*inputs, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 2453, in _backward_hook_construct, \n outputs = self.construct(*outputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 494, in _run_construct, \n output = self._backward_hook_construct(*inputs, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 747, in _complex_call, \n output = self._run_construct(*args, **kwargs)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 735, in __call__, \n return self._complex_call(*args, **kwargs)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 113, in run_forward, \n output_tensor = model(*input_data)", + "File /path_to_net/third_party/dynamic-parallel/mindformers/experimental/parallel_core/pynative/pipeline_parallel/schedules.py, line 760, in forward_backward_pipelining_without_interleaving, \n micro_input_data = run_forward(*micro_input_data,", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 444, in forward_backward_with_pipelining, \n loss, logits, grads = forward_backward_pipelining_without_interleaving(", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 580, in construct, \n (loss, _), grads = self.forward_backward_func(*inputs_tuple, loss_scale=current_step_loss_scale, **inputs_dict)", + "File /path_to_package/site-packages/mindspore/nn/cell.py, line 731, in __call__, \n return self.construct(*args, **kwargs)", + "File /path_to_net/PanGu_ms/pangu/pynative/training/training.py, line 725, in train, \n loss, is_finite, loss_scale, learning_rate, _ = train_one_step_cell(**data)", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 329, in main, \n train(", + "File /path_to_net/PanGu_ms/pretrain_gpt.py, line 342, in , \n main()" + ] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/construct.json b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..e99f6e1729808261e23b92ae7ace90bc81743853 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/construct.json @@ -0,0 +1,14 @@ +{ + "Tensor.__add__.0.forward": "", + "Tensor.__add__.1.forward": "", + "Tensor.__or__.0.forward": "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0", + "Distributed.all_reduce.0.forward": "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0", + "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": "Module.module.module.language_model.embedding.Embedding.forward.0", + "NPU.npu_rms_norm.0.forward": "Module.module.module.language_model.encoder.layers.0.input_norm.RMSNorm.forward.0", + "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": "Module.module.module.language_model.encoder.ParallelTransformer.forward.0", + "Module.module.module.language_model.encoder.layers.0.input_norm.RMSNorm.forward.0": "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0": "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0", + "Torch.cos.0.forward": "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0", + "NPU.npu_fusion_attention.0.forward": "Module.module.module.language_model.encoder.layers.0.self_attention.core_attention_flash.FlashSelfAttention.forward.0", + "NPU.npu_fusion_attention.0.backward": "Module.module.module.language_model.encoder.layers.0.self_attention.core_attention_flash.FlashSelfAttention.backward.0" +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/dump.json b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..02239176a9d690c4ce70c06cc6ab117a3c122811 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/dump.json @@ -0,0 +1,42 @@ +{ + "task": "statistics", + "level": "mix", + "framework": "pytorch", + "dump_data_dir": null, + "data": { + "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.int64", + "shape": [ + 1, + 4096 + ], + "Max": 165558.0, + "Min": 0.0, + "Mean": 16050.638671875, + "Norm": 2257767.75, + "requires_grad": false + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.bfloat16", + "shape": [ + 1, + 4096, + 6144 + ], + "Max": 2.6875, + "Min": -2.640625, + "Mean": 0.000316619873046875, + "Norm": 2512.0, + "requires_grad": true + } + ] + } + } +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/stack.json b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..7a8f68284215cd5a14e436661c57150bc012f61a --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/layer_mapping/pytorch/stack.json @@ -0,0 +1,322 @@ +{ + "Tensor.__add__.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py, line 61, in tensor_op_template, \n return TensorOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/utils.py, line 176, in get_ltor_reset_masks_and_position_ids, \n attention_mask[b, 0, (i + 1):, :(i + 1)] = 0", + "File /path_to_net/PanGu/pretrain_gpt.py, line 174, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 243, in forward_step, \n tokens, labels, loss_mask, attention_mask, position_ids = get_batch(", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Tensor.__add__.1.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py, line 61, in tensor_op_template, \n return TensorOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/utils.py, line 176, in get_ltor_reset_masks_and_position_ids, \n attention_mask[b, 0, (i + 1):, :(i + 1)] = 0", + "File /path_to_net/PanGu/pretrain_gpt.py, line 174, in get_batch, \n attention_mask, position_ids = get_ltor_reset_masks_and_position_ids(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 243, in forward_step, \n tokens, labels, loss_mask, attention_mask, position_ids = get_batch(", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Tensor.__or__.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py, line 61, in tensor_op_template, \n return TensorOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/tensor_parallel/layers.py, line 19, in vocab_parallel_embedding_forward, \n input_mask = (input_ < self.vocab_start_index) | \\", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 217, in forward, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 473, in forward, \n encoder_input = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Distributed.all_reduce.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py, line 68, in distributed_op_template, \n return DistributedOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 24, in _reduce, \n torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group())", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 223, in forward, \n return _reduce(input_)", + "File /path_to_package/site-packages/torch/autograd/function.py, line 539, in apply, \n return super().apply(*args, **kwargs) # type: ignore[misc]", + "File /path_to_net/third_party/Megatron-LM/megatron/core/tensor_parallel/mappings.py, line 436, in reduce_from_tensor_model_parallel_region, \n return _ReduceFromModelParallelRegion.apply(input_)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/tensor_parallel/layers.py, line 35, in vocab_parallel_embedding_forward, \n output = reduce_from_tensor_model_parallel_region(output_parallel)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 217, in forward, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 473, in forward, \n encoder_input = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Module.module.module.language_model.embedding.word_embeddings.VocabParallelEmbedding.forward.0": [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 217, in forward, \n words_embeddings = self.word_embeddings(input_ids)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 473, in forward, \n encoder_input = self.embedding(enc_input_ids, enc_position_ids,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "NPU.npu_rms_norm.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_npu_custom.py, line 78, in npu_op_template, \n return NpuOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/fusions/rms_norm.py, line 26, in wrapper, \n return torch_npu.npu_rms_norm(x, self.weight, epsilon=self.eps)[0]", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1194, in forward, \n norm_output = self.input_norm(hidden_states)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Module.module.module.language_model.encoder.layers.0.input_norm.RMSNorm.forward.0": [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1194, in forward, \n norm_output = self.input_norm(hidden_states)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Module.module.module.language_model.encoder.layers.0.self_attention.ParallelAttention.forward.0": [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1198, in forward, \n self.self_attention(", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Module.module.module.language_model.encoder.layers.0.ParallelTransformerLayer.forward.0": [ + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "Torch.cos.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_torch.py, line 76, in torch_op_template, \n return TorchOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/fusions/rotary_pos_embedding.py, line 16, in wrapper, \n cos_ = torch.cos(freqs).to(t.dtype)", + "File /path_to_net/PanGu/pangu/core/fusions/rotary_pos_embedding.py, line 13, in wrapper, \n t = fn(t, freqs, rotary_interleaved)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/models/common/embeddings/rotary_pos_embedding.py, line 313, in apply_rotary_pos_emb, \n return apply_rotary_pos_emb_bshd(t, freqs, rotary_interleaved=config.rotary_interleaved)", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 738, in parallel_attention_forward, \n query_layer = apply_rotary_pos_emb(query_layer, q_pos_emb, self.config)", + "File /path_to_net/PanGu/pangu/model/transformer.py, line 97, in wrapper, \n return fn(self, hidden_states, attention_mask,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1198, in forward, \n self.self_attention(", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ], + "NPU.npu_fusion_attention.0.forward": [ + "File /path_to_package/mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_npu_custom.py, line 78, in npu_op_template, \n return NpuOPTemplate(op_name, hook)(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 525, in flash_self_attention_forward, \n output = torch_npu.npu_fusion_attention(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 757, in parallel_attention_forward, \n context_layer = self.core_attention_flash(query_layer, key_layer, value_layer, attention_mask)", + "File /path_to_net/PanGu/pangu/model/transformer.py, line 97, in wrapper, \n return fn(self, hidden_states, attention_mask,", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1198, in forward, \n self.self_attention(", + "File /path_to_package/third_party/MindSpeed/mindspeed/core/transformer/transformer.py, line 21, in row_parallel_forward, \n output = forward_func(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/transformer.py, line 1832, in forward, \n hidden_states = layer(", + "File /path_to_package/third_party/MindSpeed/mindspeed/model/transformer.py, line 349, in wrapper, \n return fn(self, hidden_states, attention_mask, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/language_model.py, line 500, in forward, \n encoder_output = self.encoder(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/gpt_model.py, line 86, in forward, \n lm_output = self.language_model(", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/legacy/model/module.py, line 190, in forward, \n outputs = self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py, line 179, in forward, \n return self.module(*inputs, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1568, in _call_impl, \n result = forward_call(*args, **kwargs)", + "File /path_to_package/site-packages/torch/nn/modules/module.py, line 1518, in _wrapped_call_impl, \n return self._call_impl(*args, **kwargs)", + "File /path_to_net/PanGu/pretrain_gpt.py, line 247, in forward_step, \n output_tensor = model(tokens, position_ids, attention_mask,", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 193, in forward_step, \n output_tensor, loss_func = forward_step_func(data_iterator, model)", + "File /path_to_net/third_party/Megatron-LM/megatron/core/pipeline_parallel/schedules.py, line 1225, in forward_backward_pipelining_without_interleaving, \n output_tensor = forward_step(", + "File /path_to_net/third_party/Megatron-LM/megatron/training/training.py, line 624, in train_step, \n losses_reduced = forward_backward_func(", + "File /path_to_net/PanGu/pangu/training/auto_parallel_wrapper.py, line 34, in wrapper, \n ret = train_step(*args, **kwargs)", + "File /path_to_net/PanGu/pangu/training/training.py, line 495, in train, \n train_step(forward_step_func,", + "File /path_to_net/PanGu/pangu/training/training.py, line 303, in pretrain, \n iteration, num_floating_point_operations_so_far = train(", + "File /path_to_net/PanGu/pretrain_gpt.py, line 372, in main, \n pretrain(train_valid_test_datasets_provider,", + "File /path_to_net/PanGu/pretrain_gpt.py, line 392, in , \n main()" + ] +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/builder/construct.json b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..f38780de744675a62cee03c58fb4682448c210a6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/construct.json @@ -0,0 +1,4 @@ +{ + "Tensor1": "Module1", + "Module1": null +} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/builder/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..9e26dfeeb6e641a33dae4961196235bdb965b21b --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/dump.json @@ -0,0 +1 @@ +{} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/builder/stack.json b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/stack.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/builder/test_graph_builder.py b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/test_graph_builder.py index 3d296355fc7b65304511ec53afb2c55753ee0aea..706dc8bf82e59f413c3fd559a39af89c6a70be47 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/builder/test_graph_builder.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/test_graph_builder.py @@ -1,3 +1,4 @@ +import os import unittest from unittest.mock import MagicMock, patch from msprobe.visualization.builder.graph_builder import GraphBuilder, Graph, GraphExportConfig @@ -8,8 +9,9 @@ from msprobe.visualization.graph.base_node import BaseNode class TestGraphBuilder(unittest.TestCase): def setUp(self): - self.construct_path = "step/rank/construct.json" - self.data_path = "step/rank/dump.json" + self.construct_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "construct.json") + self.data_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "dump.json") + self.stack_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "stack.json") self.model_name = "TestModel" self.graph = Graph(self.model_name) self.graph_b = Graph(self.model_name) @@ -22,14 +24,10 @@ class TestGraphBuilder(unittest.TestCase): "Module1": {"data": "data for Module1"}, "Tensor1": {"data": "data for Tensor1"} } + self.stack_dict = {} - @patch('msprobe.visualization.builder.graph_builder.load_json_file') - @patch('msprobe.visualization.builder.graph_builder.load_data_json_file') - def test_build(self, mock_load_data_json_file, mock_load_json_file): - mock_load_data_json_file.return_value = self.data_dict - mock_load_json_file.return_value = self.construct_dict - - graph = GraphBuilder.build(self.construct_path, self.data_path, self.model_name) + def test_build(self): + graph = GraphBuilder.build(self.construct_path, self.data_path, self.stack_path, self.model_name) self.assertIsNotNone(graph) self.assertIsInstance(graph, Graph) self.assertEqual(len(graph.node_map), 3) @@ -42,7 +40,7 @@ class TestGraphBuilder(unittest.TestCase): @patch('msprobe.visualization.graph.node_op.NodeOp.get_node_op') @patch('msprobe.visualization.builder.msprobe_adapter.get_input_output', return_value=([], [])) def test__init_nodes(self, mock_get_input_output, mock_get_node_op): - GraphBuilder._init_nodes(self.graph, self.construct_dict, self.data_dict) + GraphBuilder._init_nodes(self.graph, self.construct_dict, self.data_dict, self.stack_dict) mock_get_node_op.assert_any_call("Tensor1") mock_get_node_op.assert_any_call("Module1") self.assertIs(self.graph.root, self.graph.get_node("TestModel")) @@ -50,7 +48,8 @@ class TestGraphBuilder(unittest.TestCase): def test__create_or_get_node(self): node_op = MagicMock() data_dict = {"node1": {}} - node = GraphBuilder._create_or_get_node(self.graph, data_dict, node_op, "node1") + stack_dict = {} + node = GraphBuilder._create_or_get_node(self.graph, [data_dict, stack_dict], node_op, "node1") self.assertIn("node1", self.graph.node_map) self.assertEqual(node.input_data, {}) self.assertEqual(node.output_data, {}) @@ -67,6 +66,17 @@ class TestGraphBuilder(unittest.TestCase): self.assertEqual(node_id_b, 'Module.root.backward.0') node_id_c = GraphBuilder._handle_backward_upnode_missing(construct_dict, 'Module.module.c.backward.0', None) self.assertIsNone(node_id_c) + construct_dict = {'Module.module.a.forward': 'Module.root.forward', 'Module.module.a.backward': None, + 'Module.root.forward': None, 'Module.root.backward': None, + 'Module.module.b.forward': 'Module.root.forward', + 'Module.module.b.backward': 'Module.root.backward', 'Module.module.c.backward': None} + node_id_a = GraphBuilder._handle_backward_upnode_missing(construct_dict, 'Module.module.a.backward', None) + self.assertEqual(node_id_a, 'Module.root.backward') + node_id_b = GraphBuilder._handle_backward_upnode_missing(construct_dict, 'Module.module.b.backward', + 'Module.root.backward') + self.assertEqual(node_id_b, 'Module.root.backward') + node_id_c = GraphBuilder._handle_backward_upnode_missing(construct_dict, 'Module.module.c.backward', None) + self.assertIsNone(node_id_c) def test__collect_apis_between_modules_only_apis(self): graph = Graph('TestNet') diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/builder/test_msprobe_adapter.py b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/test_msprobe_adapter.py index 9286dea5fdf42ecfca5f72be098c2fbafb6c1482..bee32a34a0509d5559b47d7a1625f618dc132d4e 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/builder/test_msprobe_adapter.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/builder/test_msprobe_adapter.py @@ -8,8 +8,7 @@ from msprobe.visualization.builder.msprobe_adapter import ( format_node_data, compare_node, _format_decimal_string, - _format_data, - compare_mapping_data + _format_data ) from msprobe.visualization.utils import GraphConst import torch @@ -52,7 +51,6 @@ class TestMsprobeAdapter(unittest.TestCase): def test_format_node_data(self): data_dict = {'node1': {'data_name': 'data1', 'full_op_name': 'op1'}} result = format_node_data(data_dict) - self.assertNotIn('data_name', result['node1']) self.assertNotIn('requires_grad', result['node1']) @patch('msprobe.visualization.builder.msprobe_adapter.get_accuracy') @@ -84,19 +82,6 @@ class TestMsprobeAdapter(unittest.TestCase): self.assertEqual(data_dict['value4'], 'inf') self.assertEqual(data_dict['value5'], '-1') - all_none_dict = {'a': None, 'b': None, 'c': None, 'd': None, 'e': None} + all_none_dict = {'Max': None, 'Min': None, 'Mean': None, 'Norm': None, 'type': None} _format_data(all_none_dict) self.assertEqual({'value': 'null'}, all_none_dict) - - def test_compare_mapping_data(self): - dict1 = {'a': {'shape': [1, 2, 3]}, 'b': {'shape': [1, 2, 3]}, 'c': {'shape': [1, 2, 3]}} - dict2 = {'a': {'shape': [1, 2, 3]}, 'b': {'shape': [1, 2, 3]}, 'c': {'shape': [1, 2, 3]}} - dict3 = {'a': {'shape': [1, 2, 3]}, 'b': {'shape': [1, 2, 3]}} - dict4 = {'a': {'shape': [2, 1, 3]}, 'b': {'shape': [1, 2, 3]}} - dict5 = {'a': {'shape': [2, 2, 3]}, 'b': {'shape': [1, 2, 3]}} - dict6 = {'a': {'type': 'str'}} - self.assertTrue(compare_mapping_data(dict1, dict2)) - self.assertTrue(compare_mapping_data(dict1, dict3)) - self.assertTrue(compare_mapping_data(dict1, dict4)) - self.assertFalse(compare_mapping_data(dict1, dict5)) - self.assertTrue(compare_mapping_data(dict1, dict6)) diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/compare/test_graph_comparator.py b/debug/accuracy_tools/msprobe/test/visualization_ut/compare/test_graph_comparator.py index 4bc3ef5db272b4def02feafbe1886601e44d1581..f4d68ccb530919dbdfedaa12bea716b2c70e278d 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/compare/test_graph_comparator.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/compare/test_graph_comparator.py @@ -1,4 +1,6 @@ +import os import unittest +from dataclasses import dataclass from unittest.mock import patch from unittest.mock import MagicMock from msprobe.visualization.compare.graph_comparator import GraphComparator @@ -6,6 +8,16 @@ from msprobe.visualization.graph.graph import Graph, BaseNode, NodeOp from msprobe.visualization.utils import GraphConst +@dataclass +class Args: + input_path: str = None + output_path: str = None + layer_mapping: str = None + framework: str = None + overflow_check: bool = False + fuzzy_match: bool = False + + class TestGraphComparator(unittest.TestCase): def setUp(self): @@ -27,7 +39,7 @@ class TestGraphComparator(unittest.TestCase): mock_load_data_json_file.return_value = "data_dict" mock_load_json_file.return_value = "construct_dict" mock_get_compare_mode.return_value = GraphConst.SUMMARY_COMPARE - self.comparator = GraphComparator(self.graphs, self.dump_path_param, self.output_path) + self.comparator = GraphComparator(self.graphs, self.dump_path_param, Args(output_path=self.output_path)) self.comparator._parse_param(self.dump_path_param, self.output_path) self.assertEqual(self.comparator.dump_path_param, { @@ -45,7 +57,7 @@ class TestGraphComparator(unittest.TestCase): mock_load_data_json_file.return_value = "data_dict" mock_load_json_file.return_value = "construct_dict" mock_get_compare_mode.return_value = GraphConst.SUMMARY_COMPARE - comparator = GraphComparator(self.graphs, self.dump_path_param, self.output_path) + comparator = GraphComparator(self.graphs, self.dump_path_param, Args(output_path=self.output_path)) comparator._compare_nodes = MagicMock() comparator._postcompare = MagicMock() @@ -64,7 +76,7 @@ class TestGraphComparator(unittest.TestCase): node = MagicMock() compare_result_list = [("output1", "data1"), ("input1", "data2")] - comparator = GraphComparator(self.graphs, self.dump_path_param, self.output_path) + comparator = GraphComparator(self.graphs, self.dump_path_param, Args(output_path=self.output_path)) comparator.ma = MagicMock() comparator.ma.prepare_real_data.return_value = True @@ -88,7 +100,7 @@ class TestGraphComparator(unittest.TestCase): mock_run_real_data.return_value = mock_df mock_get_csv_df.return_value = mock_df mock_get_node_error_status.return_value = True - comparator = GraphComparator(self.graphs, self.dump_path_param, self.output_path) + comparator = GraphComparator(self.graphs, self.dump_path_param, Args(output_path=self.output_path)) comparator.ma = MagicMock() comparator.ma.compare_mode = GraphConst.REAL_DATA_COMPARE comparator._handle_api_collection_index = MagicMock() @@ -98,7 +110,6 @@ class TestGraphComparator(unittest.TestCase): comparator._postcompare() comparator._handle_api_collection_index.assert_called_once() - comparator.ma.add_error_key.assert_called() @patch('msprobe.visualization.compare.graph_comparator.get_compare_mode') @patch('msprobe.visualization.compare.graph_comparator.load_json_file') @@ -107,7 +118,7 @@ class TestGraphComparator(unittest.TestCase): mock_load_data_json_file.return_value = "data_dict" mock_load_json_file.return_value = "construct_dict" mock_get_compare_mode.return_value = GraphConst.SUMMARY_COMPARE - comparator = GraphComparator(self.graphs, self.dump_path_param, self.output_path) + comparator = GraphComparator(self.graphs, self.dump_path_param, Args(output_path=self.output_path)) apis = BaseNode(NodeOp.api_collection, 'Apis_Between_Modules.0') api1 = BaseNode(NodeOp.function_api, 'Tensor.a.0') api1.data = {GraphConst.JSON_INDEX_KEY: 0.9} @@ -117,7 +128,7 @@ class TestGraphComparator(unittest.TestCase): sub_nodes = [BaseNode(NodeOp.module, 'Module.a.0'), apis, BaseNode(NodeOp.module, 'Module.a.1')] comparator.graph_n.root.subnodes = sub_nodes comparator._handle_api_collection_index() - self.assertEqual(comparator.graph_n.root.subnodes[1].data.get(GraphConst.JSON_INDEX_KEY), 0.6) + self.assertEqual(comparator.graph_n.root.subnodes[1].data.get(GraphConst.JSON_INDEX_KEY), 0.9) @patch('msprobe.visualization.builder.msprobe_adapter.compare_node') @patch('msprobe.visualization.graph.graph.Graph.match') @@ -134,12 +145,12 @@ class TestGraphComparator(unittest.TestCase): mock_get_compare_mode.return_value = GraphConst.SUMMARY_COMPARE mock_mapping_match.return_value = (node_b, [], []) mock_compare_node.return_value = ['result'] - comparator = GraphComparator(self.graphs, self.dump_path_param, self.output_path) - comparator.mapping_config = True + comparator = GraphComparator(self.graphs, self.dump_path_param, Args(output_path=self.output_path)) + comparator.mapping_dict = True comparator._compare_nodes(node_n) self.assertEqual(node_n.matched_node_link, ['Tensor.b.0']) self.assertEqual(node_b.matched_node_link, ['Tensor.a.0']) - comparator.mapping_config = False + comparator.mapping_dict = False node_n = BaseNode(NodeOp.function_api, 'Tensor.a.0') node_b = BaseNode(NodeOp.function_api, 'Tensor.a.0') mock_match.return_value = (node_b, []) @@ -147,3 +158,33 @@ class TestGraphComparator(unittest.TestCase): self.assertEqual(node_n.matched_node_link, ['Tensor.a.0']) self.assertEqual(node_b.matched_node_link, ['Tensor.a.0']) + def test_add_compare_result_node(self): + compare_result_list = [ + ['Module.module.Float16Module.forward.0.input.0', 'Module.module.Float16Module.forward.0.input.0', + 'torch.int64', 'torch.int64', [4, 4096], [4, 4096], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 30119.0, 1.0, 8466.25, 1786889.625, 30119.0, 1.0, 8466.25, 1786889.625, '', '', ], + ['Module.module.Float16Module.forward.0.input.1', 'Module.module.Float16Module.forward.0.input.1', + 'torch.int64', 'torch.int64', [4, 4096], [4, 4096], 0.0, 0.0, 0.0, 0.0, '0.0%', 'N/A', '0.0%', '0.0%', + 4095.0, 0.0, 2047.5, 302642.375, 4095.0, 0.0, 2047.5, 302642.375, '', '', 'None'], + ['Module.module.Float16Module.forward.0.input.2', 'Module.module.Float16Module.forward.0.input.2', + 'torch.bool', 'torch.bool', [1, 1, 4096, 4096], [1, 1, 4096, 4096], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', True, False, None, None, True, False, None, None, '', '', 'None'], + ['Module.module.Float16Module.forward.0.input.labels', 'Module.module.Float16Module.forward.0.input.labels', + 'torch.int64', 'torch.int64', [4, 4096], [4, 4096], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 30119.0, 1.0, 8460.7685546875, 1786117.625, 30119.0, 1.0, 8460.7685546875, 1786117.625, '', '', 'None'], + ['Module.module.Float16Module.forward.0.output.0', 'Module.module.Float16Module.forward.0.output.0', + 'torch.float32', 'torch.float32', [4, 4096], [4, 4096], 7.7903289794921875, -4.33783483505249, + 1.8622245788574219, 256.28173828125, '73.29288957533336%', '42.86585137147556%', '17.943317141609008%', + '19.29155502636134%', 18.41936683654785, 5.781723499298096, 12.240598678588867, 1584.7476806640625, + 10.629037857055664, 10.119558334350586, 10.378374099731445, 1328.4659423828125, '', '', 'None']] + node = BaseNode(NodeOp.module, 'Module.module.Float16Module.forward.0') + dir_name = os.path.join(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) + dump_path_param = { + 'npu_json_path': os.path.join(dir_name, 'input', 'step0', 'rank0', 'dump.json'), + 'bench_json_path': os.path.join(dir_name, 'input', 'step0', 'rank0', 'dump.json'), + 'stack_json_path': os.path.join(dir_name, 'input', 'step0', 'rank0', 'stack.json'), + 'is_print_compare_log': True + } + comparator = GraphComparator(self.graphs, dump_path_param, Args(output_path=self.output_path)) + comparator.add_compare_result_to_node(node, compare_result_list) + self.assertEqual(node.data, {'precision_index': 0}) diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/compare/test_mode_adapter.py b/debug/accuracy_tools/msprobe/test/visualization_ut/compare/test_mode_adapter.py index 628545afafda44caf328c53579e54d8df0df41da..87d1f9ee5f01c7c9b2f264f3e6ec16b5155c1f8e 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/compare/test_mode_adapter.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/compare/test_mode_adapter.py @@ -23,6 +23,175 @@ class TestModeAdapter(unittest.TestCase): precision_index = ModeAdapter._add_md5_compare_data(node_data, compare_data_dict) self.assertEqual(precision_index, 1) + node_data = {'Tensor.__imul__.0.forward.input.0': {'type': 'torch.Tensor', 'dtype': 'torch.int64', 'shape': [], + 'Max': 16388, 'Min': 16388, 'Mean': 16388, 'Norm': 16388, + 'requires_grad': False, 'md5': 'a563a4ea', + 'full_op_name': 'Tensor.__imul__.0.forward.input.0', + 'data_name': '-1'}, + 'Tensor.__imul__.0.forward.input.1': {'type': 'torch.Tensor', 'dtype': 'torch.int64', 'shape': [], + 'Max': 4097, 'Min': 4097, 'Mean': 4097, 'Norm': 4097, + 'requires_grad': False, 'md5': 'ce564339', + 'full_op_name': 'Tensor.__imul__.0.forward.input.1', + 'data_name': '-1'}} + compare_dict = {'Tensor.__imul__.0.forward.input.0': ['Tensor.__imul__.0.forward.input.0', + 'Tensor.__imul__.0.forward.input.0', 'torch.int64', + 'torch.int64', [], [], 'a563a4ea', 'a563a4ea', 'pass', + []], + 'Tensor.__imul__.0.forward.input.1': ['Tensor.__imul__.0.forward.input.1', + 'Tensor.__imul__.0.forward.input.1', 'torch.int64', + 'torch.int64', [], [], 'ce564339', 'ce564559', 'diff', + 'None']} + precision_index = ModeAdapter._add_md5_compare_data(node_data, compare_dict) + self.assertEqual(precision_index, 0) + + def test_add_real_compare_data(self): + tensor_data = {'Module.module.Float16Module.forward.0.input.0': + ['Module.module.Float16Module.forward.0.input.0', + 'Module.module.Float16Module.forward.0.input.0', + 'torch.int64', 'torch.int64', [1, 1024], [1, 1024], + 1.0, 0.0, 0.0, 1.0, 1.0, 29992.0, 1.0, 9100.3125, + 474189.09375, 29992.0, 1.0, 9100.3125, 474189.09375, + 'Yes', '', None, + 'Module.module.Float16Module.forward.0.input.0.pt'], + 'Module.module.Float16Module.forward.0.input.1': [ + 'Module.module.Float16Module.forward.0.input.1', + 'Module.module.Float16Module.forward.0.input.1', + 'torch.int64', 'torch.int64', [1, 1024], [1, 1024], + 1.0, 0.0, 0.0, None, 1.0, 1023.0, 0.0, 511.5, + 18904.755859375, 1023.0, 0.0, 511.5, 18904.755859375, + 'Yes', '', 'None', + 'Module.module.Float16Module.forward.0.input.1.pt'], + 'Module.module.Float16Module.forward.0.input.2': [ + 'Module.module.Float16Module.forward.0.input.2', + 'Module.module.Float16Module.forward.0.input.2', + 'torch.bool', 'torch.bool', [1, 1, 1024, 1024], + [1, 1, 1024, 1024], 1.0, 0.0, 0.0, 1.0, 1.0, True, + False, None, None, True, False, None, None, 'Yes', '', + 'None', + 'Module.module.Float16Module.forward.0.input.2.pt'], + 'Module.module.Float16Module.forward.0.kwargs.labels': [ + 'Module.module.Float16Module.forward.0.kwargs.labels', + 'Module.module.Float16Module.forward.0.kwargs.labels', 'torch.int64', + 'torch.int64', [1, 1024], + [1, 1024], 1.0, 0.0, 0.0, 1.0, 1.0, 29992.0, 1.0, 9108.99609375, 474332.28125, + 29992.0, 1.0, + 9108.99609375, 474332.28125, 'Yes', '', 'None', + 'Module.module.Float16Module.forward.0.kwargs.labels.pt'], + 'Module.module.Float16Module.forward.0.output.0': [ + 'Module.module.Float16Module.forward.0.output.0', + 'Module.module.Float16Module.forward.0.output.0', + 'torch.float32', 'torch.float32', [1, 1024], + [1, 1024], 0.994182636336, 4.863566398621, + 0.461487948895, 0.0068359375, 0.0234375, + 15.402446746826172, 7.318280220031738, + 11.375151634216309, 366.3365173339844, + 10.538880348205566, 10.215872764587402, + 10.378824234008789, 332.1264953613281, 'No', '', + 'None', + 'Module.module.Float16Module.forward.0.output.0.pt']} + node_data = {'Module.module.Float16Module.forward.0.input.0': {'type': 'torch.Tensor', 'dtype': 'torch.int64', + 'shape': [1, 1024], 'Max': 29992.0, 'Min': 1.0, + 'Mean': 9100.3125, 'Norm': 474189.09375, + 'requires_grad': False, + 'md5': '00000000'}, + 'Module.module.Float16Module.forward.0.input.1': {'type': 'torch.Tensor', 'dtype': 'torch.int64', + 'shape': [1, 1024], 'Max': 1023.0, 'Min': 0.0, + 'Mean': 511.5, 'Norm': 18904.755859375, + 'requires_grad': False, + 'md5': '00000000'}, + 'Module.module.Float16Module.forward.0.input.2': {'type': 'torch.Tensor', 'dtype': 'torch.bool', + 'shape': [1, 1, 1024, 1024], 'Max': True, + 'Min': False, 'Mean': None, 'Norm': None, + 'requires_grad': False, + 'md5': '00000000'}, + 'Module.module.Float16Module.forward.0.kwargs.labels': {'type': 'torch.Tensor', + 'dtype': 'torch.int64', 'shape': None, + 'Max': 29992.0, 'Min': 1.0, + 'Mean': 9108.99609375, + 'Norm': 474332.28125, + 'requires_grad': False, + 'md5': '00000000'}, + 'Module.module.Float16Module.forward.0.kwargs.None': None} + min_thousandth = ModeAdapter._add_real_compare_data(node_data, tensor_data) + self.assertEqual(min_thousandth, 1.0) + + def test_add_summary_compare_data(self): + compare_data_dict = { + 'Module.module.Float16Module.forward.0.input.0': ['Module.module.Float16Module.forward.0.input.0', + 'Module.module.Float16Module.forward.0.input.0', + 'torch.int64', 'torch.int64', [4, 4096], [4, 4096], 0.0, + 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', 30119.0, + 1.0, 8466.25, 1786889.625, 30119.0, 1.0, 8466.25, + 1786889.625, '', ''], + 'Module.module.Float16Module.forward.0.input.1': ['Module.module.Float16Module.forward.0.input.1', + 'Module.module.Float16Module.forward.0.input.1', + 'torch.int64', 'torch.int64', [4, 4096], [4, 4096], 0.0, + 0.0, 0.0, 0.0, '0.0%', 'N/A', '0.0%', '0.0%', 4095.0, 0.0, + 2047.5, 302642.375, 4095.0, 0.0, 2047.5, 302642.375, '', + '', 'None'], + 'Module.module.Float16Module.forward.0.input.2': ['Module.module.Float16Module.forward.0.input.2', + 'Module.module.Float16Module.forward.0.input.2', + 'torch.bool', 'torch.bool', [1, 1, 4096, 4096], + [1, 1, 4096, 4096], 'N/A', 'N/A', 'N/A', 'N/A', 'N/A', + 'N/A', 'N/A', 'N/A', True, False, None, None, True, False, + None, None, '', '', 'None'], + 'Module.module.Float16Module.forward.0.input.labels': ['Module.module.Float16Module.forward.0.input.labels', + 'Module.module.Float16Module.forward.0.input.labels', + 'torch.float16', 'torch.float16', [4, 4096], + [4, 4096], + 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', + 30119.0, 0.00001, 8460.7685546875, 1786117.625, + 30119.0, + 1.0, 8460.7685546875, 1786117.625, '', '', 'None']} + node_data = {'Module.module.Float16Module.forward.0.input.0': {'type': 'torch.Tensor', 'dtype': 'torch.int64', + 'shape': [4, 4096], 'Max': 30119.0, 'Min': 1.0, + 'Mean': 8466.25, 'Norm': 1786889.625, + 'requires_grad': False, + 'data_name': '-1', 'md5': '00000000'}, + 'Module.module.Float16Module.forward.0.input.1': {'type': 'torch.Tensor', 'dtype': 'torch.int64', + 'shape': [4, 4096], 'Max': 4095.0, 'Min': 0.0, + 'Mean': 2047.5, 'Norm': 302642.375, + 'requires_grad': False, + 'data_name': '-1', 'md5': '00000000'}, + 'Module.module.Float16Module.forward.0.input.2': {'type': 'torch.Tensor', 'dtype': 'torch.bool', + 'shape': [1, 1, 4096, 4096], 'Max': True, + 'Min': False, 'Mean': None, 'Norm': None, + 'requires_grad': False, + 'data_name': '-1', 'md5': '00000000'}, + 'Module.module.Float16Module.forward.0.input.labels': {'type': 'torch.Tensor', + 'dtype': 'torch.float16', + 'shape': [4, 4096], + 'Max': 30119.0, 'Min': 0.00001, + 'Mean': 8460.7685546875, + 'Norm': 1786117.625, 'requires_grad': False, + 'data_name': '-1', 'md5': '00000000'}, + 'Module.module.Float16Module.forward.0.kwargs.None': None} + precision_index = ModeAdapter._add_summary_compare_data(node_data, compare_data_dict) + self.assertEqual(precision_index, 0) + + def test_match_data(self): + compare_data = ['Module.module.Float16Module.forward.0.input.0', + 'Module.module.Float16Module.forward.0.input.0', 'torch.int64', 'torch.int64', [4, 4096], + [4, 4096], 0.0, 0.0, 0.0, 0.0, '0.0%', '0.0%', '0.0%', '0.0%', 30119.0, 1.0, 8466.25, + 1786889.625, 30119.0, 1.0, 8466.25, 1786889.625, '', ''] + data_dict = {'type': 'torch.Tensor', 'dtype': 'torch.int64', 'shape': [4, 4096], 'Max': 30119.0, 'Min': 1.0, + 'Mean': 8466.25, 'Norm': 1786889.625, 'requires_grad': False, + 'full_op_name': 'Module.module.Float16Module.forward.0.input.0', 'data_name': '-1', + 'md5': '00000000'} + id_list = [6, 7, 8, 9, 10, 11, 12, 13] + id_list1 = [6, 7, 8, 9, 10, 11, 12, 13, 14] + key_list = ['Max diff', 'Min diff', 'Mean diff', 'L2norm diff', 'MaxRelativeErr', 'MinRelativeErr', + 'MeanRelativeErr', 'NormRelativeErr'] + ModeAdapter._match_data(data_dict, compare_data, key_list, id_list1) + self.assertNotIn('Max diff', data_dict) + ModeAdapter._match_data(data_dict, compare_data, key_list, id_list) + self.assertIn('Max diff', data_dict) + + def test_check_list_len(self): + data_list = [1, 2] + with self.assertRaises(ValueError): + ModeAdapter._check_list_len(data_list, 3) + @patch('msprobe.visualization.compare.mode_adapter.ModeAdapter') def test_parse_result(self, mock_mode_adapter): mock_mode_adapter._add_summary_compare_data.return_value = 0.5 @@ -68,6 +237,14 @@ class TestModeAdapter(unittest.TestCase): self.assertEqual(node_data['key'][GraphConst.ERROR_KEY], [CompareConst.MAX_RELATIVE_ERR, CompareConst.MIN_RELATIVE_ERR, CompareConst.MEAN_RELATIVE_ERR, CompareConst.NORM_RELATIVE_ERR]) + node_data = {'key': []} + self.adapter.add_error_key(node_data) + self.assertEqual(node_data['key'], []) + + node_data = {'key': {}} + self.adapter.compare_mode = '111' + self.adapter.add_error_key(node_data) + self.assertEqual(node_data['key'], {'error_key': []}) def test_get_tool_tip(self): self.adapter.compare_mode = GraphConst.MD5_COMPARE diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_base_node.py b/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_base_node.py index db584a97896e1e437e490117f5bcea804dec71c5..480b95620e6a81577d825b7af55b45fc0a04c34c 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_base_node.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_base_node.py @@ -1,5 +1,4 @@ import unittest -from unittest.mock import patch from msprobe.visualization.graph.base_node import BaseNode, NodeOp from msprobe.visualization.utils import GraphConst @@ -21,13 +20,6 @@ class TestBaseNode(unittest.TestCase): other_node = BaseNode(self.node_op, self.node_id, self.up_node) self.assertEqual(self.node, other_node) - def test_get_suggestions(self): - self.node.get_suggestions() - self.assertIn(GraphConst.SUGGEST_KEY, self.node.suggestions) - - node = BaseNode(NodeOp.function_api, "up_node_1") - node.get_suggestions() - self.assertIn(GraphConst.SUGGEST_KEY, node.suggestions) def test_set_input_output(self): input_data = {'input1': 'value1'} @@ -68,9 +60,3 @@ class TestBaseNode(unittest.TestCase): def test_get_ancestors(self): expected_ancestors = ['up_node_1'] self.assertEqual(self.node.get_ancestors(), expected_ancestors) - - @patch('msprobe.visualization.builder.msprobe_adapter.compare_mapping_data') - def test_compare_mapping_node(self, mock_compare_mapping_data): - mock_compare_mapping_data.return_value = True - result = self.node.compare_mapping_node(BaseNode(NodeOp.function_api, "up_node_1")) - self.assertTrue(result) diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_graph.py b/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_graph.py index 3626a889a3b7f8f56481934b8f4745b8cd04e5e8..81f9fdca5277de6e1670da409bcf93e56ece3206 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_graph.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_graph.py @@ -67,17 +67,19 @@ class TestGraph(unittest.TestCase): 'matched_node_link': [], 'suggestions': {}, 'stack_info': []}}) def test_split_nodes_by_micro_step(self): - nodes = [BaseNode(NodeOp.module, 'a.0'), BaseNode(NodeOp.module, 'b.0'), - BaseNode(NodeOp.api_collection, 'apis.0'), BaseNode(NodeOp.module, 'a.1'), - BaseNode(NodeOp.module, 'b.1'), BaseNode(NodeOp.api_collection, 'apis.1')] + nodes = [BaseNode(NodeOp.module, 'a.forward.0'), BaseNode(NodeOp.module, 'a.backward.0'), + BaseNode(NodeOp.api_collection, 'apis.0'), BaseNode(NodeOp.module, 'a.forward.1'), + BaseNode(NodeOp.module, 'b.forward.0'), BaseNode(NodeOp.module, 'b.backward.0'), + BaseNode(NodeOp.module, 'a.backward.1'), BaseNode(NodeOp.api_collection, 'apis.1')] result = Graph.split_nodes_by_micro_step(nodes) self.assertEqual(len(result), 2) self.assertEqual(len(result[0]), 3) def test_paging_by_micro_step(self): - nodes = [BaseNode(NodeOp.module, 'a.0'), BaseNode(NodeOp.module, 'b.0'), - BaseNode(NodeOp.api_collection, 'apis.0'), BaseNode(NodeOp.module, 'a.1'), - BaseNode(NodeOp.module, 'b.1'), BaseNode(NodeOp.api_collection, 'apis.1')] + nodes = [BaseNode(NodeOp.module, 'a.forward.0'), BaseNode(NodeOp.module, 'a.backward.0'), + BaseNode(NodeOp.api_collection, 'apis.0'), BaseNode(NodeOp.module, 'a.forward.1'), + BaseNode(NodeOp.module, 'b.forward.0'), BaseNode(NodeOp.module, 'b.backward.0'), + BaseNode(NodeOp.module, 'a.backward.1'), BaseNode(NodeOp.api_collection, 'apis.1')] graph = Graph('Model1') graph.root.subnodes = nodes @@ -90,13 +92,12 @@ class TestGraph(unittest.TestCase): self.assertEqual(graph_other.root.subnodes[0].micro_step_id, 0) def test_mapping_match(self): - mapping_config = MagicMock() graph_a = Graph("model_name_a") graph_b = Graph("model_name_b") graph_a.add_node(NodeOp.module, "a1", BaseNode(NodeOp.module, "root")) graph_b.add_node(NodeOp.module, "b1", BaseNode(NodeOp.module, "root")) - mapping_config.get_mapping_string.return_value = "b1" - node_b, ancestors_n, ancestors_b = Graph.mapping_match(graph_a.get_node("a1"), graph_b, mapping_config) + mapping_dict = {"a1": "b1"} + node_b, ancestors_n, ancestors_b = Graph.mapping_match(graph_a.get_node("a1"), graph_b, mapping_dict) self.assertIsNotNone(node_b) self.assertEqual(ancestors_n, ["root"]) self.assertEqual(ancestors_b, ["root"]) diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_node_colors.py b/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_node_colors.py index 3f4587e9c14495222f42e54dcd320559991c9bb9..869df1ba096b27e33a3340c69af794920229cae7 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_node_colors.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_node_colors.py @@ -30,7 +30,7 @@ class TestNodeColors(unittest.TestCase): self.assertIn("#FFEDBE", colors_info) self.assertIn("#FFDC7F", colors_info) self.assertIn("#FFC62E", colors_info) - self.assertIn("#E32020", colors_info) + self.assertIn("#FF704D", colors_info) self.assertIn("#C7C7C7", colors_info) # 确保返回的字典具有正确的描述和值范围 diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_node_op.py b/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_node_op.py index 4e0bc926b1ecb493deee5f30a680a086f220c739..8cc51126cd76b09ac2abcb20bfb3f2adb2d606cb 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_node_op.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/graph/test_node_op.py @@ -10,8 +10,7 @@ class TestNodeOp(unittest.TestCase): def test_get_node_op_invalid(self): node_name = "InvalidNodeName" - with self.assertRaises(Exception): - NodeOp.get_node_op(node_name) + self.assertEqual(NodeOp.get_node_op(node_name), NodeOp.module) def test_get_node_op_all(self): test_cases = [ diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/construct.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/construct.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..330122252bd65cb01bbf9f0cd6c912f407b32a28 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/dump.json @@ -0,0 +1,6 @@ +{ + "task": "statistics", + "level": "mix", + "dump_data_dir": null, + "data": {} +} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/stack.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank0/stack.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank1/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank1/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step0/rank1/dump.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input/step1/rank0/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step1/rank0/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step1/rank0/dump.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input/step1/step1/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step1/step1/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input/step1/step1/dump.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/construct.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/construct.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..330122252bd65cb01bbf9f0cd6c912f407b32a28 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/dump.json @@ -0,0 +1,6 @@ +{ + "task": "statistics", + "level": "mix", + "dump_data_dir": null, + "data": {} +} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/stack.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank0/stack.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/construct.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/construct.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..330122252bd65cb01bbf9f0cd6c912f407b32a28 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/dump.json @@ -0,0 +1,6 @@ +{ + "task": "statistics", + "level": "mix", + "dump_data_dir": null, + "data": {} +} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/stack.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step0/rank1/stack.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/construct.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/construct.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..330122252bd65cb01bbf9f0cd6c912f407b32a28 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/dump.json @@ -0,0 +1,6 @@ +{ + "task": "statistics", + "level": "mix", + "dump_data_dir": null, + "data": {} +} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/stack.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step1/rank0/stack.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/construct.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/construct.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/construct.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/dump.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..330122252bd65cb01bbf9f0cd6c912f407b32a28 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/dump.json @@ -0,0 +1,6 @@ +{ + "task": "statistics", + "level": "mix", + "dump_data_dir": null, + "data": {} +} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/stack.json b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/stack.json new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/input_format_correct/step2/rank0/stack.json @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/layer_mapping.yaml b/debug/accuracy_tools/msprobe/test/visualization_ut/layer_mapping.yaml new file mode 100644 index 0000000000000000000000000000000000000000..0967ef424bce6791893e9a57bb952f80fd536e93 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/layer_mapping.yaml @@ -0,0 +1 @@ +{} diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/mapping.yaml b/debug/accuracy_tools/msprobe/test/visualization_ut/mapping.yaml index 8b2f85ebf872aae4b3377842ac899824da5877f9..59e1d5e3c3068a50be17d6b780e3141601be015b 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/mapping.yaml +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/mapping.yaml @@ -1,2 +1,2 @@ -- vision_model: "language_model.vision_encoder" -- vision_projection: "language_model.projection" \ No newline at end of file +NPU.npu_fusion_attention.4.forward.input.0: Function.attention.4.forward.input.0 +Module.module.language_model.embedding.word_embedding.VocabParallelEmbedding.forward.0.input.0: Module.module.language_model.embedding.word_embedding.VocabParallelEmbedding.forward.0.input.0 \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/test_graph_service.py b/debug/accuracy_tools/msprobe/test/visualization_ut/test_graph_service.py new file mode 100644 index 0000000000000000000000000000000000000000..7dfd9564ebc21327f3e7e29be90da7f78c3b0393 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/test_graph_service.py @@ -0,0 +1,222 @@ +import os +import re +import json +import unittest +import shutil +import argparse +from dataclasses import dataclass + +from unittest.mock import patch +from msprobe.visualization.graph_service import _compare_graph, _build_graph, _compare_graph_ranks, \ + _compare_graph_steps, _build_graph_ranks, _build_graph_steps, _graph_service_command, _graph_service_parser +from msprobe.core.common.utils import CompareException + + +@dataclass +class Args: + input_path: str = None + output_path: str = None + layer_mapping: str = None + framework: str = None + overflow_check: bool = False + fuzzy_match: bool = False + complete_stack: bool = False + + +class TestGraphService(unittest.TestCase): + def setUp(self): + self.current_path = os.path.dirname(os.path.realpath(__file__)) + self.input = os.path.join(self.current_path, "input_format_correct") + self.output = os.path.join(self.current_path, 'output') + self.input_param = { + 'npu_path': os.path.join(self.input, 'step0', 'rank0'), + 'bench_path': os.path.join(self.input, 'step0', 'rank0'), + 'is_print_compare_log': True + } + self.layer_mapping = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'layer_mapping.yaml') + self.pattern = r'\b\w+\.vis\b' + self.pattern_rank = r'[\w_]+\.vis\b' + self.output_json = [] + for i in range(7): + self.output_json.append(os.path.join(self.current_path, f"compare{i}.json")) + + def assert_log_info(self, mock_log_info, + log_info='Model graphs compared successfully, the result file is saved in'): + last_call_args = mock_log_info.call_args[0][0] + self.assertIn(log_info, last_call_args) + matches = re.findall(self.pattern, last_call_args) + self.assertTrue(os.path.exists(os.path.join(self.output, matches[0]))) + + @patch('msprobe.core.common.log.logger.info') + def test_compare_graph(self, mock_log_info): + args = Args(output_path=self.output, framework='pytorch') + result = _compare_graph(self.input_param, args) + self.assertEqual(mock_log_info.call_count, 2) + self.assertIsNotNone(result) + + args = Args(output_path=self.output, framework='mindspore') + result = _compare_graph(self.input_param, args) + self.assertIsNotNone(result) + + args = Args(output_path=self.output, framework='pytorch', layer_mapping=self.layer_mapping) + result = _compare_graph(self.input_param, args) + self.assertIsNotNone(result) + + args = Args(output_path=self.output, framework='pytorch', overflow_check=True) + result = _compare_graph(self.input_param, args) + self.assertIsNotNone(result) + + @patch('msprobe.core.common.log.logger.info') + def test_build_graph(self, mock_log_info): + result = _build_graph(os.path.join(self.input, 'step0', 'rank0'), Args(overflow_check=True)) + self.assertEqual(mock_log_info.call_count, 1) + self.assertIsNotNone(result) + + @patch('msprobe.core.common.log.logger.info') + def test_compare_graph_ranks(self, mock_log_info): + input_param = { + 'npu_path': os.path.join(self.input, 'step0'), + 'bench_path': os.path.join(self.input, 'step0'), + 'is_print_compare_log': True + } + args = Args(output_path=self.output, framework='pytorch') + _compare_graph_ranks(input_param, args) + self.assert_log_info(mock_log_info) + + input_param1 = { + 'npu_path': os.path.join(self.input, 'step0'), + 'bench_path': os.path.join(self.input, 'step1'), + 'is_print_compare_log': True + } + args = Args(output_path=self.output, framework='pytorch') + with self.assertRaises(CompareException): + _compare_graph_ranks(input_param1, args) + + @patch('msprobe.core.common.log.logger.info') + def test_compare_graph_steps(self, mock_log_info): + input_param = { + 'npu_path': self.input, + 'bench_path': self.input, + 'is_print_compare_log': True + } + args = Args(output_path=self.output, framework='pytorch') + _compare_graph_steps(input_param, args) + self.assert_log_info(mock_log_info) + + input_param1 = { + 'npu_path': self.input, + 'bench_path': os.path.join(self.current_path, "input"), + 'is_print_compare_log': True + } + args = Args(output_path=self.output, framework='pytorch') + with self.assertRaises(CompareException): + _compare_graph_steps(input_param1, args) + + @patch('msprobe.core.common.log.logger.info') + def test_build_graph_ranks(self, mock_log_info): + _build_graph_ranks(os.path.join(self.input, 'step0'), Args(output_path=self.output)) + self.assert_log_info(mock_log_info, "Model graph built successfully, the result file is saved in") + + @patch('msprobe.core.common.log.logger.info') + def test_build_graph_steps(self, mock_log_info): + _build_graph_steps(self.input, Args(output_path=self.output)) + self.assert_log_info(mock_log_info, "Model graph built successfully, the result file is saved in") + + @patch('msprobe.core.common.log.logger.info') + def test_graph_service_command(self, mock_log_info): + with open(self.output_json[0], 'w') as f: + json.dump(self.input_param, f, indent=4) + + args = Args(input_path=self.output_json[0], output_path=self.output, framework='pytorch') + _graph_service_command(args) + self.assert_log_info(mock_log_info) + + input_param1 = { + 'npu_path': os.path.join(self.input, 'step0', 'rank0'), + 'is_print_compare_log': True + } + with open(self.output_json[1], 'w') as f: + json.dump(input_param1, f, indent=4) + args = Args(input_path=self.output_json[1], output_path=self.output, framework='pytorch') + _graph_service_command(args) + self.assert_log_info(mock_log_info, "Model graph built successfully, the result file is saved in") + + input_param2 = { + 'npu_path': os.path.join(self.input, 'step0'), + 'bench_path': os.path.join(self.input, 'step0'), + 'is_print_compare_log': True + } + with open(self.output_json[2], 'w') as f: + json.dump(input_param2, f, indent=4) + args = Args(input_path=self.output_json[2], output_path=self.output, framework='pytorch') + _graph_service_command(args) + self.assert_log_info(mock_log_info) + + input_param3 = { + 'npu_path': self.input, + 'bench_path': self.input, + 'is_print_compare_log': True + } + with open(self.output_json[3], 'w') as f: + json.dump(input_param3, f, indent=4) + args = Args(input_path=self.output_json[3], output_path=self.output, framework='pytorch') + _graph_service_command(args) + self.assert_log_info(mock_log_info) + + input_param4 = { + 'npu_path': os.path.join(self.input, 'step0'), + 'is_print_compare_log': True + } + with open(self.output_json[4], 'w') as f: + json.dump(input_param4, f, indent=4) + args = Args(input_path=self.output_json[4], output_path=self.output, framework='pytorch') + _graph_service_command(args) + self.assert_log_info(mock_log_info, "Model graph built successfully, the result file is saved in") + + input_param5 = { + 'npu_path': self.input, + 'is_print_compare_log': True + } + with open(self.output_json[5], 'w') as f: + json.dump(input_param5, f, indent=4) + args = Args(input_path=self.output_json[5], output_path=self.output, framework='pytorch') + _graph_service_command(args) + self.assert_log_info(mock_log_info, "Model graph built successfully, the result file is saved in") + + input_param6 = { + 'npu_path': self.input, + 'bench_path': os.path.join(self.input, 'step0'), + 'is_print_compare_log': True + } + with open(self.output_json[6], 'w') as f: + json.dump(input_param6, f, indent=4) + args = Args(input_path=self.output_json[6], output_path=self.output, framework='pytorch') + with self.assertRaises(ValueError): + _graph_service_command(args) + + def test_graph_service_parser(self): + parser = argparse.ArgumentParser() + _graph_service_parser(parser) + args = parser.parse_args(['-i', 'input.json', '-o', 'output.json']) + self.assertEqual(args.input_path, 'input.json') + self.assertEqual(args.output_path, 'output.json') + + args = parser.parse_args(['-i', 'input.json', '-o', 'output.json', '-lm', 'mapping.json']) + self.assertEqual(args.layer_mapping, 'mapping.json') + + args = parser.parse_args(['-i', 'input.json', '-o', 'output.json', '-oc']) + self.assertTrue(args.overflow_check) + + args = parser.parse_args(['-i', 'input.json', '-o', 'output.json']) + self.assertFalse(args.overflow_check) + + def tearDown(self): + if os.path.exists(self.output): + shutil.rmtree(self.output) + for json_data in self.output_json: + if os.path.exists(json_data): + os.remove(json_data) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/test_mapping_config.py b/debug/accuracy_tools/msprobe/test/visualization_ut/test_mapping_config.py deleted file mode 100644 index 8db4242a6e565d534ce2000fd68aa4a744821513..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/test_mapping_config.py +++ /dev/null @@ -1,52 +0,0 @@ -import os -import unittest -from msprobe.visualization.mapping_config import MappingConfig - - -class TestMappingConfig(unittest.TestCase): - - def setUp(self): - self.yaml_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "mapping.yaml") - - def test_validate(self): - with self.assertRaises(ValueError): - MappingConfig.validate(123, "some value") - with self.assertRaises(ValueError): - MappingConfig.validate("some key", 456) - self.assertEqual(MappingConfig.validate("key", "value"), "value") - - def test_convert_to_regex(self): - regex = MappingConfig.convert_to_regex("hello{world}") - self.assertEqual(regex, ".*hello\\{world\\}.*") - - def test_replace_parts(self): - result = MappingConfig._replace_parts('hello world', 'world', 'everyone') - self.assertEqual(result, 'hello everyone') - result = MappingConfig._replace_parts('radio_model.layers.0.input_norm', 'radio_model.layers.{}.input_norm', - 'radio_model.transformer.layers.{}.input_layernorm') - self.assertEqual(result, 'radio_model.transformer.layers.0.input_layernorm') - - def test_get_mapping_string(self): - mc = MappingConfig(self.yaml_path) - mc.classify_config = { - 'category1': [('category1.key1', 'replacement1')], - 'category2': [('category2.key1', 'replacement2')] - } - result = mc.get_mapping_string("some category1.key1 text") - self.assertEqual(result, "some replacement1 text") - - def test_long_string(self): - long_string = "x" * (MappingConfig.MAX_STRING_LEN + 1) - mc = MappingConfig(self.yaml_path) - result = mc.get_mapping_string(long_string) - self.assertEqual(result, long_string) - - def test__classify_and_sort_keys(self): - mc = MappingConfig(self.yaml_path) - result = mc._classify_and_sort_keys() - self.assertEqual(result, {'vision_model': [('vision_model', 'language_model.vision_encoder')], - 'vision_projection': [('vision_projection', 'language_model.projection')]}) - - -if __name__ == '__main__': - unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/visualization_ut/test_visualization_utils.py b/debug/accuracy_tools/msprobe/test/visualization_ut/test_visualization_utils.py index daa018565beb3492361cde5044399a8ee9e9d9fe..e5b0afaadf9def910c248b945ad15084300a65c0 100644 --- a/debug/accuracy_tools/msprobe/test/visualization_ut/test_visualization_utils.py +++ b/debug/accuracy_tools/msprobe/test/visualization_ut/test_visualization_utils.py @@ -1,12 +1,14 @@ import os import unittest -from msprobe.visualization.utils import (load_json_file, load_data_json_file, str2float) +from msprobe.visualization.utils import (load_json_file, load_data_json_file, str2float, check_directory_content, + GraphConst) class TestMappingConfig(unittest.TestCase): def setUp(self): self.yaml_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "mapping.yaml") + self.input = os.path.join(os.path.dirname(os.path.realpath(__file__)), "input") def test_load_json_file(self): result = load_json_file(self.yaml_path) @@ -22,6 +24,19 @@ class TestMappingConfig(unittest.TestCase): result = str2float('2.3.4%') self.assertAlmostEqual(result, 0) + def test_check_directory_content(self): + input_type = check_directory_content(self.input) + self.assertEqual(input_type, GraphConst.STEPS) + + input_type = check_directory_content(os.path.join(self.input, "step0")) + self.assertEqual(input_type, GraphConst.RANKS) + + with self.assertRaises(ValueError): + check_directory_content(os.path.join(self.input, "step1")) + + input_type = check_directory_content(os.path.join(self.input, "step0", "rank0")) + self.assertEqual(input_type, GraphConst.FILES) + if __name__ == '__main__': unittest.main() diff --git a/debug/accuracy_tools/msprobe/visualization/builder/graph_builder.py b/debug/accuracy_tools/msprobe/visualization/builder/graph_builder.py index 89cf629ce8441cdd8fab67776731dfa3d22891f5..814882e6b819e9e6b6b421aec5f8f0b89f03f7c6 100644 --- a/debug/accuracy_tools/msprobe/visualization/builder/graph_builder.py +++ b/debug/accuracy_tools/msprobe/visualization/builder/graph_builder.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,27 +14,42 @@ # limitations under the License. import re + +from msprobe.core.common.const import Const +from msprobe.core.common.file_utils import load_json +from msprobe.visualization.builder.msprobe_adapter import get_input_output +from msprobe.visualization.builder.msprobe_adapter import op_patterns from msprobe.visualization.graph.graph import Graph from msprobe.visualization.graph.node_op import NodeOp -from msprobe.visualization.utils import load_json_file, load_data_json_file, save_json_file, GraphConst -from msprobe.visualization.builder.msprobe_adapter import get_input_output +from msprobe.visualization.utils import save_json_file, GraphConst class GraphBuilder: + backward_pattern = re.compile(r"(\.backward\.)(\d+)$") + forward_pattern = re.compile(r"(\.forward\.)(\d+)$") + # 匹配以大写字母开头,后接任意字母,并以Template(结尾 + template_pattern = re.compile(r'\b[A-Z][a-zA-Z]*Template\(') + @staticmethod - def build(construct_path, data_path, model_name='DefaultModel'): + def build(construct_path, data_path, stack_path, model_name='DefaultModel', complete_stack=False): """ GraphBuilder的对外提供的构图方法 Args: construct_path: construct.json路径 data_path: dump.json路径 + stack_path: stack.json路径 model_name: 模型名字,依赖外部输入 + complete_stack: 完整的堆栈信息 Returns: Graph,代表图的数据结构 """ - construct_dict = load_json_file(construct_path) - data_dict = load_data_json_file(data_path) - graph = Graph(model_name) - GraphBuilder._init_nodes(graph, construct_dict, data_dict) + construct_dict = load_json(construct_path) + dump_dict = load_json(data_path) + stack_dict = load_json(stack_path) + if not complete_stack: + GraphBuilder._simplify_stack(stack_dict) + data_dict = dump_dict.get(GraphConst.DATA_KEY, {}) + graph = Graph(model_name, data_path=dump_dict.get('dump_data_dir', ''), dump_data=data_dict) + GraphBuilder._init_nodes(graph, construct_dict, data_dict, stack_dict) GraphBuilder._collect_apis_between_modules(graph) return graph @@ -55,52 +70,126 @@ class GraphBuilder: result[GraphConst.COLORS] = config.node_colors if config.micro_steps: result[GraphConst.MICRO_STEPS] = config.micro_steps + if config.task: + result[GraphConst.JSON_TASK_KEY] = config.task + result[GraphConst.OVERFLOW_CHECK] = config.overflow_check save_json_file(filename, result) + @staticmethod + def _simplify_stack(stack_dict): + """ + 精简堆栈内容,模块级保留包含"模块名("的堆栈,api级保留"xxxTemplate("的下一行堆栈 + + 例如模块 Module.layer3.0.bn2.BatchNorm2d.forward.0,模块名为bn2,匹配"bn2(", + 保留堆栈"File /home/models/resnet.py, line 97, in forward, \n out = self.bn2(out)" + + 例如Api Tensor.__iadd__.4.forward,堆栈为: + "File /home/wrap_tensor.py, line 61, return TensorOPTemplate(op_name, hook)(*args, **kwargs)", + "File /home/torchvision/models/resnet.py, line 102, in forward, \n out += identity", + 匹配到第一行的"TensorOPTemplate(",保留下一行堆栈 + """ + module_pattern = re.compile(op_patterns[0]) + for dump_name, stack_list in stack_dict.items(): + if not isinstance(stack_list, list): + continue + if module_pattern.match(dump_name): + parts = dump_name.split(Const.SEP) + if len(parts) < abs(Const.LAYER_NAME_INDEX): + continue + module_name = parts[Const.LAYER_NAME_INDEX] + for stack in stack_list: + if re.search(module_name + r'\(', stack): + stack_list = [stack] + break + else: + for index, stack in enumerate(stack_list): + if GraphBuilder.template_pattern.search(stack) and index < len(stack_list) - 1: + stack_list = [stack_list[index + 1]] + break + stack_dict[dump_name] = stack_list + @staticmethod def _handle_backward_upnode_missing(construct_dict, subnode_id, upnode_id): """ 如果backward节点的父级节点是null,则尝试从同名的forward节点寻找父级节点 """ # 匹配以.backward.后跟一个或多个数字结尾的模式 - backward_pattern = r"(\.backward\.)(\d+)$" - forward_pattern = r"(\.forward\.)(\d+)$" - if re.search(backward_pattern, subnode_id) and not upnode_id: - forward_upnode_id = construct_dict.get(re.sub(backward_pattern, r".forward.\2", subnode_id)) + if GraphBuilder.backward_pattern.search(subnode_id) and not upnode_id: + forward_upnode_id = construct_dict.get(GraphBuilder.backward_pattern.sub(r".forward.\2", subnode_id)) + if forward_upnode_id: + new_upnode_id = GraphBuilder.forward_pattern.sub(r".backward.\2", forward_upnode_id) + if new_upnode_id in construct_dict: + return new_upnode_id + # 匹配以.backward结尾的节点 + if subnode_id.endswith(Const.SEP + Const.BACKWARD) and not upnode_id: + forward_upnode_id = construct_dict.get(subnode_id.replace(Const.BACKWARD, Const.FORWARD)) if forward_upnode_id: - new_upnode_id = re.sub(forward_pattern, r".backward.\2", forward_upnode_id) + new_upnode_id = forward_upnode_id.replace(Const.FORWARD, Const.BACKWARD) if new_upnode_id in construct_dict: return new_upnode_id return upnode_id @staticmethod - def _init_nodes(graph, construct_dict, data_dict): + def _init_nodes(graph, construct_dict, data_dict, stack_dict): for subnode_id, upnode_id in construct_dict.items(): upnode_id = GraphBuilder._handle_backward_upnode_missing(construct_dict, subnode_id, upnode_id) if upnode_id: upnode_op = NodeOp.get_node_op(upnode_id) - upnode = GraphBuilder._create_or_get_node(graph, data_dict, upnode_op, upnode_id) + upnode = GraphBuilder._create_or_get_node(graph, [data_dict, stack_dict], upnode_op, upnode_id) else: upnode = graph.root node_op = NodeOp.get_node_op(subnode_id) - GraphBuilder._create_or_get_node(graph, data_dict, node_op, subnode_id, upnode) + GraphBuilder._create_or_get_node(graph, [data_dict, stack_dict], node_op, subnode_id, upnode) @staticmethod - def _create_or_get_node(graph, data_dict, op, name, upnode=None): + def _create_or_get_node(graph, data_stack_list, op, name, upnode=None): if name in graph.node_map: node = graph.get_node(name) else: graph.add_node(op, name, upnode) node = graph.get_node(name) - node_data = data_dict.get(name, {}) + node_data = data_stack_list[0].get(name, {}) + node_stack_info = data_stack_list[1].get(name, []) # 添加输入输出数据 input_data, output_data = get_input_output(node_data, node.id) # 更新数据 node.set_input_output(input_data, output_data) + if GraphConst.BATCH_P2P in name: + GraphBuilder._extract_batch_p2p_info(node, node_data) + # 反向节点使用对应前向节点的堆栈信息 + # 模块命名举例:Module.module.module.GPTModel.backward.0; API命名举例:Tensor.permute.1.backward + if (not node_stack_info and + (GraphBuilder.backward_pattern.search(name) or name.endswith(f'{Const.SEP}{Const.BACKWARD}'))): + forward_node = graph.get_node( + # 同名模块全局唯一,无论调用几次堆栈信息都一致,直接使用编号0的同名模块堆栈信息,避免遗漏 + GraphBuilder.backward_pattern.sub(f'{Const.SEP}{Const.FORWARD}{Const.SEP}0', name)) \ + if GraphBuilder.backward_pattern.search(name) \ + else graph.get_node(name.replace(Const.BACKWARD, Const.FORWARD)) + node_stack_info = forward_node.stack_info if forward_node \ + else ['This backward node cannot find the forward node and cannot retrieve stack information.'] + node.stack_info = node_stack_info # 添加节点 node.add_upnode(upnode) return node + @staticmethod + def _is_valid_batch_p2p_output(param_list): + if not isinstance(param_list, list) or not param_list: + return False + if not isinstance(param_list[0], list) or not param_list[0]: + return False + return True + + @staticmethod + def _extract_batch_p2p_info(node, node_data): + param_list = node_data.get(Const.OUTPUT, []) + # 数据格式:"output": [[{param1}, {param2}, ...]] + if GraphBuilder._is_valid_batch_p2p_output(param_list): + for param in param_list[0]: + info = {GraphConst.OP: param.get(GraphConst.OP), GraphConst.PEER: param.get(GraphConst.PEER), + GraphConst.GROUP_ID: param.get(GraphConst.GROUP_ID)} + node.batch_p2p_info.append(info) + @staticmethod def _collect_apis_between_modules(graph): """ @@ -131,6 +220,10 @@ class GraphBuilder: id_accumulation=True) api_collection_node = graph.get_node(node_id) api_collection_node.subnodes = temp_nodes + # 重新确立父子关系 + for node in temp_nodes: + node.upnode = api_collection_node + api_collection_node.upnode = graph.root output.append(api_collection_node) else: # 如果连续的api节点不足2个,将它们原样添加到输出列表 @@ -144,9 +237,12 @@ class GraphBuilder: class GraphExportConfig: - def __init__(self, graph_n, graph_b=None, tool_tip=None, node_colors=None, micro_steps=None): + def __init__(self, graph_n, graph_b=None, tool_tip=None, node_colors=None, micro_steps=None, task='', + overflow_check=False): self.graph_n = graph_n self.graph_b = graph_b self.tool_tip = tool_tip self.node_colors = node_colors self.micro_steps = micro_steps + self.task = task + self.overflow_check = overflow_check diff --git a/debug/accuracy_tools/msprobe/visualization/builder/msprobe_adapter.py b/debug/accuracy_tools/msprobe/visualization/builder/msprobe_adapter.py index bc35dd2e4ee8f3b5176670c109ec385d626aad27..ee5e3f519ed126b2aaa493e0d3a3b7fce33313e4 100644 --- a/debug/accuracy_tools/msprobe/visualization/builder/msprobe_adapter.py +++ b/debug/accuracy_tools/msprobe/visualization/builder/msprobe_adapter.py @@ -13,16 +13,19 @@ # See the License for the specific language governing permissions and # limitations under the License. import re +import math from msprobe.core.compare.acc_compare import read_op, merge_tensor, get_accuracy from msprobe.core.common.utils import set_dump_path, get_dump_mode -from msprobe.visualization.utils import GraphConst, process_kwargs_parameter -from msprobe.pytorch.compare.pt_compare import PTComparator - +from msprobe.visualization.utils import GraphConst +from msprobe.core.common.const import Const +from msprobe.core.compare.acc_compare import ModeConfig # 用于将节点名字解析成对应的NodeOp的规则 op_patterns = [ - r'^(Module)', #NodeOp.module - r'^(Tensor|Torch|Functional|NPU|VF|Distributed|Aten)' #NodeOp.function_api + # NodeOp.module + r'^(Module.|Cell.|optimizer|clip_grad)', + # NodeOp.function_api + r'^(Tensor.|Torch.|Functional.|NPU.|VF.|Distributed.|Aten.|Mint.|Primitive.|Jit.|MintFunctional.)' ] @@ -39,14 +42,25 @@ def get_compare_mode(dump_path_param): return compare_mode -def run_real_data(dump_path_param, csv_path): +def run_real_data(dump_path_param, csv_path, framework, is_cross_frame=False): """ 多进程运行生成真实数据 Args: dump_path_param: 调用acc_compare接口所依赖的参数 csv_path: 生成文件路径 + framework: 框架类型, pytorch或mindspore + is_cross_frame: 是否进行跨框架比对,仅支持mindspore比pytorch, 其中pytorch为标杆 """ - return PTComparator()._do_multi_process(dump_path_param, csv_path) + mode_config = ModeConfig(stack_mode=False, auto_analyze=True, fuzzy_match=False, dump_mode=Const.ALL) + + if framework == Const.PT_FRAMEWORK: + from msprobe.pytorch.compare.pt_compare import PTComparator + return PTComparator(mode_config).do_multi_process(dump_path_param, csv_path) + else: + from msprobe.mindspore.compare.ms_compare import MSComparator, MappingConfig + ms_comparator = MSComparator(mode_config, MappingConfig()) + ms_comparator.cross_frame = is_cross_frame + return ms_comparator.do_multi_process(dump_path_param, csv_path) def get_input_output(node_data, node_id): @@ -63,14 +77,15 @@ def get_input_output(node_data, node_id): full_op_name = item.get('full_op_name', '') if not full_op_name: continue - splits = full_op_name.split('.') - if len(splits) < GraphConst.OUTPUT_MIN_LEN: - continue - if GraphConst.OUTPUT in splits[GraphConst.OUTPUT_INDEX_TWO] and \ - GraphConst.INPUT not in splits[GraphConst.OUTPUT_INDEX_THREE]: + if GraphConst.OUTPUT in full_op_name and GraphConst.INPUT not in full_op_name: output_data[full_op_name] = item else: - input_data[process_kwargs_parameter(full_op_name)] = item + name = item.get('data_name') + # 节点参数名称尽量使用落盘数据的名称 + if isinstance(name, str) and name != '-1': + input_data[name.rsplit(Const.SEP, 1)[0]] = item + else: + input_data[full_op_name] = item return input_data, output_data @@ -93,28 +108,25 @@ def compare_data(data_dict_list1, data_dict_list2): return True -def compare_mapping_data(data_dict_list1, data_dict_list2): +def compare_data_fuzzy(data_dict_list1, data_dict_list2): """ - node1映射node2,可能node1参数多于或少于node2参数,个别参数的shape的维度顺序不同,node1参数null对应node2参数其他值 - 工具要尽可能保证node的数据能够比对,进行数据的弱校验,仅校验参数的shape维度数值是否相同 + 模糊匹配,仅校验参数shape是否一致 """ for x, y in zip(data_dict_list1.values(), data_dict_list2.values()): - x_shape = x.get('shape') - y_shape = y.get('shape') - if x_shape is None or y_shape is None: - continue - x_shape = sorted(x_shape) if isinstance(x_shape, list) else x_shape - y_shape = sorted(y_shape) if isinstance(y_shape, list) else y_shape + x_shape = x.get(Const.SHAPE) + y_shape = y.get(Const.SHAPE) if x_shape != y_shape: return False return True -def format_node_data(data_dict): +def format_node_data(data_dict, node_id=None): """ - 批量进行节点数据的输出 + 删除节点数据中不需要展示的字段 """ - del_list = ['requires_grad', 'data_name', 'full_op_name'] + del_list = ['requires_grad', 'full_op_name'] + if node_id and GraphConst.BATCH_P2P in node_id: + del_list.extend(['op', 'peer', 'tag', 'group_id']) for _, value in data_dict.items(): if not isinstance(value, dict): continue @@ -183,7 +195,14 @@ def _format_data(data_dict): 格式化数据,小数保留6位,处理一些异常值 """ pattern = r'^[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)$' - none_num = 0 + all_null = False + + keys_to_keep = ['type', 'group_ranks', 'group_id', 'data_name'] + if data_dict.get('type') == 'torch.ProcessGroup': + keys_to_remove = [key for key in data_dict if key not in keys_to_keep] + for key in keys_to_remove: + del data_dict[key] + for key, value in data_dict.items(): if isinstance(value, str): # 将单引号删掉,None换成null避免前端解析错误 @@ -197,12 +216,14 @@ def _format_data(data_dict): elif isinstance(value, float): value = round(value, GraphConst.ROUND_TH) # Inf会走入这里,确保转成Inf。另外给其他不符合预期的类型做兜底方案 - if not isinstance(value, (list, tuple, dict, str)): + if key != GraphConst.ERROR_KEY: + # 除了error_key不转str,其他都转str, 避免前端解析错误 value = str(value) - if value == GraphConst.NULL or key == GraphConst.ERROR_KEY: - none_num += 1 + # max为null, 意味着这个参数值为null + if key == Const.MAX and value == GraphConst.NULL: + all_null = True data_dict[key] = value # 字典里的value全null,只保留一个null - if none_num == len(data_dict): + if all_null: data_dict.clear() data_dict[GraphConst.VALUE] = GraphConst.NULL diff --git a/debug/accuracy_tools/msprobe/visualization/compare/graph_comparator.py b/debug/accuracy_tools/msprobe/visualization/compare/graph_comparator.py index 158c25f6eb006bb91fcf1ae15775b410bdda8086..902d721a8d1047b687b878eb45a802a1df4154bd 100644 --- a/debug/accuracy_tools/msprobe/visualization/compare/graph_comparator.py +++ b/debug/accuracy_tools/msprobe/visualization/compare/graph_comparator.py @@ -13,26 +13,33 @@ # See the License for the specific language governing permissions and # limitations under the License. +import re from msprobe.visualization.builder.msprobe_adapter import compare_node, get_compare_mode, run_real_data -from msprobe.visualization.utils import (GraphConst, load_json_file, load_data_json_file, get_csv_df, - process_kwargs_parameter) +from msprobe.visualization.utils import GraphConst, load_json_file, load_data_json_file, get_csv_df from msprobe.visualization.graph.graph import Graph, NodeOp from msprobe.visualization.graph.node_colors import NodeColors from msprobe.visualization.compare.mode_adapter import ModeAdapter +from msprobe.core.common.const import Const class GraphComparator: - def __init__(self, graphs, dump_path_param, output_path, mapping_config=None): + def __init__(self, graphs, dump_path_param, args, mapping_dict=None): self.graph_n = graphs[0] self.graph_b = graphs[1] - self._parse_param(dump_path_param, output_path) - self.mapping_config = mapping_config + self._parse_param(dump_path_param, args.output_path) + self.framework = args.framework + self.mapping_dict = mapping_dict + self.fuzzy_match = args.fuzzy_match + self.pattern = re.compile(r'\.\d+\.') def compare(self): """ 比较函数,初始化结束后单独调用。比较结果写入graph_n """ - self._compare_nodes(self.graph_n.root) + if self.fuzzy_match: + self._compare_nodes_fuzzy(self.graph_n.root) + else: + self._compare_nodes(self.graph_n.root) self._postcompare() def add_compare_result_to_node(self, node, compare_result_list): @@ -49,19 +56,16 @@ class GraphComparator: compare_out_dict = {} # input和output对比数据分开 for item in compare_result_list: - if not node.stack_info and node.id in item[0]: - node.stack_info = item[-1] + if not isinstance(item, (list, tuple)) or not item: + continue if '.output.' in item[0]: compare_out_dict[item[0]] = item else: - compare_in_dict[process_kwargs_parameter(item[0])] = item + compare_in_dict[item[0]] = item precision_index, other_dict = ( self.ma.parse_result(node, [compare_in_dict, compare_out_dict])) node.data[GraphConst.JSON_INDEX_KEY] = precision_index node.data.update(other_dict) - if NodeColors.get_node_error_status(self.ma.compare_mode, precision_index): - self.ma.add_error_key(node.output_data) - node.get_suggestions() def _parse_param(self, dump_path_param, output_path): self.dump_path_param = dump_path_param @@ -77,24 +81,26 @@ class GraphComparator: if not self.ma.compare_mode == GraphConst.REAL_DATA_COMPARE: return df = get_csv_df(True, self.ma.csv_data, self.ma.compare_mode) - df = run_real_data(self.dump_path_param, df) + df = run_real_data(self.dump_path_param, df, self.framework, True if self.mapping_dict else False) compare_data_dict = {row[0]: row.tolist() for _, row in df.iterrows()} for node in self.ma.compare_nodes: precision_index, _ = self.ma.parse_result(node, [compare_data_dict]) node.data[GraphConst.JSON_INDEX_KEY] = precision_index - if NodeColors.get_node_error_status(self.ma.compare_mode, precision_index): - self.ma.add_error_key(node.output_data) - node.get_suggestions() def _handle_api_collection_index(self): """ - api集合的指标使用集合中所有api最小的指标 + api集合的指标, md5模式使用集合中所有api最小的指标,statistics和tensor模式使用集合中所有api最大的指标 + md5模式下指标为0代表最差,statistics和tensor模式下指标为1代表最差 """ for node in self.graph_n.root.subnodes: if node.op == NodeOp.api_collection: - precision_index = 1 + precision_index = GraphConst.MAX_INDEX_KEY if self.ma.compare_mode == GraphConst.MD5_COMPARE \ + else GraphConst.MIN_INDEX_KEY for api in node.subnodes: - precision_index = min(precision_index, api.data.get(GraphConst.JSON_INDEX_KEY, 1)) + precision_index = min(precision_index, + api.data.get(GraphConst.JSON_INDEX_KEY, GraphConst.MAX_INDEX_KEY)) \ + if self.ma.compare_mode == GraphConst.MD5_COMPARE \ + else max(precision_index, api.data.get(GraphConst.JSON_INDEX_KEY, GraphConst.MIN_INDEX_KEY)) node.data[GraphConst.JSON_INDEX_KEY] = precision_index def _compare_nodes(self, node_n): @@ -102,8 +108,8 @@ class GraphComparator: 递归遍历NPU树中的节点,如果在Bench中找到具有相同名称的节点,检查他们的祖先和参数信息,检查一致则及逆行精度数据对比 这里采用先序遍历,好处在于当这个节点被比较时,他的先序已经被匹配,这可以为后续的模糊匹配提供重要信息 """ - if self.mapping_config: - node_b, ancestors_n, ancestors_b = Graph.mapping_match(node_n, self.graph_b, self.mapping_config) + if self.mapping_dict: + node_b, ancestors_n, ancestors_b = Graph.mapping_match(node_n, self.graph_b, self.mapping_dict) if node_b: ancestors_n.append(node_n.id) ancestors_b.append(node_b.id) @@ -116,11 +122,59 @@ class GraphComparator: node_n.add_link(node_b, ancestors) if node_b: # 真实数据比对只会得到基本信息,并没有精度指标,需要调用多进程对比接口 - compare_result_list = compare_node([node_n.id, node_b.id], - [self.data_n_dict, self.data_b_dict], - self.stack_json_data, self.ma.compare_mode) - if compare_result_list: - self.ma.add_csv_data(compare_result_list) - self.add_compare_result_to_node(node_n, compare_result_list) + self._get_and_add_result(node_n, node_b) for subnode in node_n.subnodes: self._compare_nodes(subnode) + + def _compare_nodes_fuzzy(self, node_n): + if node_n.op != NodeOp.function_api: + # 模块经过模糊匹配 + node_b, ancestors_n, ancestors_b = Graph.fuzzy_match(node_n, self.graph_b.node_map.get(node_n.id)) + if node_b: + self._process_matched_nodes(node_n, node_b, ancestors_n, ancestors_b) + # 匹配上的两个模块中的所有api, 忽略dump调用次数,按照名称一致+模块中的调用顺序进行匹配 + recount_result_n = self._recount_api_node(node_n) + recount_result_b = self._recount_api_node(node_b) + for recount_node_id, node_id_n in recount_result_n.items(): + api_node_n = self.graph_n.node_map.get(node_id_n) + if not api_node_n: + continue + api_node_b, ancestors_n, ancestors_b = Graph.fuzzy_match( + api_node_n, self.graph_b.node_map.get(recount_result_b.get(recount_node_id))) + if api_node_b: + self._process_matched_nodes(api_node_n, api_node_b, ancestors_n, ancestors_b) + for sub_node in node_n.subnodes: + self._compare_nodes_fuzzy(sub_node) + + def _get_and_add_result(self, node_n, node_b): + compare_result_list = compare_node([node_n.id, node_b.id], + [self.data_n_dict, self.data_b_dict], + self.stack_json_data, self.ma.compare_mode) + if compare_result_list: + self.ma.add_csv_data(compare_result_list) + self.add_compare_result_to_node(node_n, compare_result_list) + + def _recount_api_node(self, node): + """ + 两个匹配上的模块, 忽略各自模块下所有api的dump调用次数, 并赋予模块中的调用顺序 + Return: + {赋予模块中的调用顺序的node_id: 原始node_id} + """ + recount_result = {} + node_count = {} + for sub_node in node.subnodes: + if sub_node.op == NodeOp.function_api: + # 忽略dump调用次数 + count_removed_id = self.pattern.sub(Const.SEP, sub_node.id) + node_count[count_removed_id] = node_count.get(count_removed_id, 0) + 1 + # 赋予模块中的调用顺序 + recount_node_id = count_removed_id + str(node_count.get(count_removed_id)) + recount_result[recount_node_id] = sub_node.id + return recount_result + + def _process_matched_nodes(self, node_n, node_b, ancestors_n, ancestors_b): + ancestors_n.append(node_n.id) + ancestors_b.append(node_b.id) + node_n.matched_node_link = ancestors_b + node_b.matched_node_link = ancestors_n + self._get_and_add_result(node_n, node_b) diff --git a/debug/accuracy_tools/msprobe/visualization/compare/mode_adapter.py b/debug/accuracy_tools/msprobe/visualization/compare/mode_adapter.py index cf61e4c2be1deb96a4ab26d5fdcbc7691fbbbaab..535192d80c566c48cedde4ea5b4474b6dc82dec0 100644 --- a/debug/accuracy_tools/msprobe/visualization/compare/mode_adapter.py +++ b/debug/accuracy_tools/msprobe/visualization/compare/mode_adapter.py @@ -14,6 +14,7 @@ # limitations under the License. import json +import math from msprobe.core.common.const import CompareConst, Const from msprobe.visualization.utils import ToolTip, GraphConst, str2float @@ -23,7 +24,7 @@ class ModeAdapter: self.compare_mode = compare_mode self.csv_data = [] self.compare_nodes = [] - + @staticmethod def _add_md5_compare_data(node_data, compare_data_dict): precision_index = GraphConst.MAX_INDEX_KEY @@ -40,7 +41,7 @@ class ModeAdapter: precision_index = GraphConst.MIN_INDEX_KEY node_data[key] = value return precision_index - + @staticmethod def _add_real_compare_data(node_data, compare_data_dict): min_thousandth = float(1) @@ -53,6 +54,9 @@ class ModeAdapter: headers = CompareConst.COMPARE_RESULT_HEADER id_list = [headers.index(x) for x in GraphConst.REAL_DATA_INDEX_LIST] ModeAdapter._match_data(value, compare_data, GraphConst.REAL_DATA_INDEX_LIST, id_list) + # 跳过scalar data,因为无法计算双千指标,会得到Nan + if not value.get(Const.SHAPE): + continue # 获取一个节点所有的输入或输出最小的双千指标 thousandth = value.get(CompareConst.ONE_THOUSANDTH_ERR_RATIO) # 可能是None,可能是非数字内容str @@ -69,12 +73,13 @@ class ModeAdapter: else: min_thousandth = min(numbers + [min_thousandth]) return min_thousandth - + @staticmethod - def _add_summary_compare_data( node_data, compare_data_dict): - max_relative_err = 0 - for key, value in node_data.items(): - if not isinstance(value, dict): + def _add_summary_compare_data(node_data, compare_data_dict): + max_relative_err = GraphConst.MIN_INDEX_KEY + # data_info: {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [2, 536320], 'Max': 9.66036224, ...} + for key, data_info in node_data.items(): + if not isinstance(data_info, dict): continue compare_data = compare_data_dict.get(key) if compare_data: @@ -82,19 +87,14 @@ class ModeAdapter: key_list = GraphConst.SUMMARY_INDEX_LIST headers = CompareConst.SUMMARY_COMPARE_RESULT_HEADER id_list = [headers.index(x) for x in key_list] - ModeAdapter._match_data(value, compare_data, key_list, id_list) - # 相对误差大于0.5疑似有精度问题,小值域1e-3不比较相对误差 - for index, item in enumerate(key_list[4:]): - value_diff = value.get(key_list[index]) - if isinstance(value_diff, float) and value_diff != 0 and abs(value_diff) < GraphConst.SMALL_VALUE: - value[item] = ToolTip.SMALL_VALUE_TIP.format(key_list[index]) - continue - relative_err = str2float(value.get(item)) + ModeAdapter._match_data(data_info, compare_data, key_list, id_list) + for item in key_list[4:]: + relative_err = str2float(data_info.get(item)) max_relative_err = max(max_relative_err, relative_err) - node_data[key] = value + node_data[key] = data_info max_relative_err = 1 if max_relative_err > 1 else max_relative_err return max_relative_err - + @staticmethod def _match_data(data_dict, compare_data, key_list, id_list): """ @@ -102,40 +102,47 @@ class ModeAdapter: """ if len(key_list) != len(id_list): return - for id, key in zip(id_list, key_list): - data = compare_data[id] - if data is not None and 'nan' not in str(data) and str(data) != ' ': - data_dict[key] = data - else: - data_dict[key] = 'null' - - def parse_result(self, node, compare_data_dict): + for id_val, key in zip(id_list, key_list): + data_dict[key] = compare_data[id_val] + + @staticmethod + def _check_list_len(data_list, len_num): + if len(data_list) < len_num: + raise ValueError(f"compare_data_dict_list must contain at least {len_num} items.") + + def parse_result(self, node, compare_data_dict_list): """ 根据结果返回数据,分别是precision_index,和附加数据 """ + other_dict = {} if self.compare_mode == GraphConst.MD5_COMPARE: - precision_index_in = ModeAdapter._add_md5_compare_data(node.input_data, compare_data_dict[0]) - precision_index_out = ModeAdapter._add_md5_compare_data(node.output_data, compare_data_dict[1]) + ModeAdapter._check_list_len(compare_data_dict_list, 2) + precision_index_in = ModeAdapter._add_md5_compare_data(node.input_data, compare_data_dict_list[0]) + precision_index_out = ModeAdapter._add_md5_compare_data(node.output_data, compare_data_dict_list[1]) # 所有输入输出md5对比通过,这个节点才算通过 precision_index = min(precision_index_in, precision_index_out) - other_result = CompareConst.PASS if precision_index == 1 else CompareConst.DIFF + other_result = CompareConst.PASS if precision_index == GraphConst.MAX_INDEX_KEY else CompareConst.DIFF other_dict[CompareConst.RESULT] = other_result elif self.compare_mode == GraphConst.SUMMARY_COMPARE: - precision_index_in = ModeAdapter._add_summary_compare_data(node.input_data, compare_data_dict[0]) - precision_index_out = ModeAdapter._add_summary_compare_data(node.output_data, compare_data_dict[1]) - precision_index = max(precision_index_in, precision_index_out) + ModeAdapter._check_list_len(compare_data_dict_list, 2) + ModeAdapter._add_summary_compare_data(node.input_data, compare_data_dict_list[0]) + precision_index_out = ModeAdapter._add_summary_compare_data(node.output_data, compare_data_dict_list[1]) + precision_index = precision_index_out else: - min_thousandth_in = ModeAdapter._add_real_compare_data(node.input_data, compare_data_dict[0]) - min_thousandth_out = ModeAdapter._add_real_compare_data(node.output_data, compare_data_dict[0]) + ModeAdapter._check_list_len(compare_data_dict_list, 1) + min_thousandth_in = ModeAdapter._add_real_compare_data(node.input_data, compare_data_dict_list[0]) + min_thousandth_out = ModeAdapter._add_real_compare_data(node.output_data, compare_data_dict_list[0]) if min_thousandth_in is not None and min_thousandth_out is not None: - change_percentage = abs(min_thousandth_in - min_thousandth_out) + change_percentage = min_thousandth_in - min_thousandth_out else: - change_percentage = 0 + change_percentage = GraphConst.MIN_INDEX_KEY + change_percentage = GraphConst.MIN_INDEX_KEY if change_percentage < GraphConst.MIN_INDEX_KEY \ + else change_percentage precision_index = GraphConst.MAX_INDEX_KEY \ if change_percentage > GraphConst.MAX_INDEX_KEY else change_percentage return precision_index, other_dict - + def prepare_real_data(self, node): """ 为真实数据比较模式准备节点信息 @@ -144,12 +151,12 @@ class ModeAdapter: self.compare_nodes.append(node) return True return False - + def add_csv_data(self, compare_result_list): if self.compare_mode != GraphConst.REAL_DATA_COMPARE: return self.csv_data.extend(compare_result_list) - + def add_error_key(self, node_data): """ 根据不同的模式进行提供不同错误信息 @@ -167,7 +174,7 @@ class ModeAdapter: message = [] value[GraphConst.ERROR_KEY] = message node_data[key] = value - + def get_tool_tip(self): """ 用于前端展示字段的具体含义 diff --git a/debug/accuracy_tools/msprobe/visualization/graph/base_node.py b/debug/accuracy_tools/msprobe/visualization/graph/base_node.py index cb8e3aebd472611fd5e723d5c5f2bcde683ee7bb..2642ff1e97ebcc055212d4d776eb7c8a08866dc8 100644 --- a/debug/accuracy_tools/msprobe/visualization/graph/base_node.py +++ b/debug/accuracy_tools/msprobe/visualization/graph/base_node.py @@ -12,10 +12,10 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - +from msprobe.core.overflow_check.level import OverflowLevel from msprobe.visualization.graph.node_op import NodeOp -from msprobe.visualization.utils import Suggestions, GraphConst -from msprobe.visualization.builder.msprobe_adapter import format_node_data, compare_data, compare_mapping_data +from msprobe.visualization.utils import GraphConst +from msprobe.visualization.builder.msprobe_adapter import format_node_data, compare_data, compare_data_fuzzy class BaseNode: @@ -32,6 +32,9 @@ class BaseNode: self.suggestions = {} self.stack_info = [] self.micro_step_id = None + self.overflow_level = None + self.matched_distributed = {} + self.batch_p2p_info = [] def __str__(self): info = f'id:\t{self.id}' @@ -47,28 +50,23 @@ class BaseNode: return False return True - def compare_mapping_node(self, other): - if not compare_mapping_data(self.input_data, other.input_data): + def fuzzy_eq(self, other): + if not compare_data_fuzzy(self.input_data, other.input_data): return False - if not compare_mapping_data(self.output_data, other.output_data): + if not compare_data_fuzzy(self.output_data, other.output_data): return False return True - def get_suggestions(self): - """ - 精度疑似有问题时,提供一些建议 - """ - if self.op == NodeOp.module: - self.suggestions[GraphConst.SUGGEST_KEY] = Suggestions.Module - self.suggestions[Suggestions.DUMP] = Suggestions.DUMP_URL - elif self.op == NodeOp.function_api: - self.suggestions[GraphConst.SUGGEST_KEY] = Suggestions.API - self.suggestions[Suggestions.API_ACCURACY_CHECKER] = Suggestions.API_ACCURACY_CHECKER_URL - def set_input_output(self, input_data, output_data): self.input_data = input_data self.output_data = output_data + def set_overflow_level(self, level): + if not level or not isinstance(level, OverflowLevel): + return + self.overflow_level = level + self.data[GraphConst.OVERFLOW_LEVEL] = self.overflow_level.value + def add_upnode(self, node): """ 绑定upnode,用于对两个节点进行上下级关联 @@ -92,19 +90,22 @@ class BaseNode: """ 输出数据 """ - result = {} - result['id'] = self.id - result['node_type'] = self.op.value - result['data'] = self.data - result['output_data'] = format_node_data(self.output_data) - result['input_data'] = format_node_data(self.input_data) - result['upnode'] = self.upnode.id if self.upnode else 'None' - result['subnodes'] = [node.id for node in self.subnodes] - result['matched_node_link'] = self.matched_node_link - result['suggestions'] = self.suggestions - result['stack_info'] = self.stack_info + result = { + 'id': self.id, + 'node_type': self.op.value, + 'output_data': format_node_data(self.output_data, self.id), + 'input_data': format_node_data(self.input_data, self.id), + 'upnode': self.upnode.id if self.upnode else 'None', + 'subnodes': [node.id for node in self.subnodes], + 'matched_node_link': self.matched_node_link, + 'suggestions': self.suggestions, + 'stack_info': self.stack_info + } if self.micro_step_id is not None: result['micro_step_id'] = self.micro_step_id + result['data'] = self.data + if self.matched_distributed: + result[GraphConst.MATCHED_DISTRIBUTED] = self.matched_distributed return result def get_ancestors(self): diff --git a/debug/accuracy_tools/msprobe/visualization/graph/distributed_analyzer.py b/debug/accuracy_tools/msprobe/visualization/graph/distributed_analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..5e68d6b2528aea4d6645da2885fa76a7b9bb97b2 --- /dev/null +++ b/debug/accuracy_tools/msprobe/visualization/graph/distributed_analyzer.py @@ -0,0 +1,395 @@ +# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from enum import Enum +from msprobe.visualization.utils import GraphConst +from msprobe.core.common.const import Const, CompareConst +from msprobe.core.common.log import logger + + +class CommunicationType(Enum): + """ + 通信类型:发送、接收、发送接收 + """ + SEND = 'send' + RECEIVE = 'receive' + SEND_RECEIVE = 'send_receive' + + +class DistributedType(Enum): + """ + 分布式类型:点对点通信、集体通信 + """ + P2P = 'p2p' + COLLECTIVE = 'collective' + + +CANNOT_MATCH = 'cannot match distributed node in rank' + + +class DistributedAnalyzer: + + def __init__(self, graphs: dict, overflow_check: bool): + self.graphs = graphs + self.overflow_check = overflow_check + self.config = { + # 当前通信api名称: 匹配目标通信api名称, 获取rank信息的位置参数或关键字参数, 通信类型, 分布式类型 + 'send': ['recv', GraphConst.DST, CommunicationType.SEND.value, DistributedType.P2P], + 'isend': ['irecv', GraphConst.DST, CommunicationType.SEND.value, DistributedType.P2P], + 'recv': ['send', GraphConst.SRC, CommunicationType.RECEIVE.value, DistributedType.P2P], + 'irecv': ['isend', GraphConst.SRC, CommunicationType.RECEIVE.value, DistributedType.P2P], + 'broadcast': ['broadcast', '1', CommunicationType.SEND.value, DistributedType.COLLECTIVE], + 'scatter': ['scatter', GraphConst.SRC, CommunicationType.SEND.value, DistributedType.COLLECTIVE], + 'gather': ['gather', GraphConst.DST, CommunicationType.RECEIVE.value, DistributedType.COLLECTIVE], + 'reduce': ['reduce', '1', CommunicationType.RECEIVE.value, DistributedType.COLLECTIVE] + } + self.group_node_mapping = {} + self._make_group_node_mapping() + + @staticmethod + def _get_opposite_communication_type(action): + if action == CommunicationType.SEND.value: + return CommunicationType.RECEIVE.value + elif action == CommunicationType.RECEIVE.value: + return CommunicationType.SEND.value + return action + + @staticmethod + def _node_output_all_equal(data: dict, target_data: dict): + keys_to_compare = [Const.DTYPE, Const.SHAPE, Const.MAX, Const.MIN, Const.MEAN, Const.NORM] + return all(data.get(key) == target_data.get(key) for key in keys_to_compare) + + @staticmethod + def _get_target_rank(node, rank, parameter): + """ + 点对点通信, 从输出数据参数src或dst, 获取通信目标rank + 一对多通信和多对一通信, 从输出数据参数src或dst或位置参数, 获取发送或接收的rank源头 + :param node: 当前节点 + :param rank: 当前rank + :param parameter: 输出数据参数 + :return: 目标rank + """ + target_rank = node.input_data.get(f'{node.id}{GraphConst.INPUT}{parameter}', {}).get('value') + if target_rank is None: + logger.warning(f'The parameter {parameter} of node {node.id} does not exist, {CANNOT_MATCH}{rank}') + return target_rank + + @staticmethod + def _get_group_info(node, rank): + """ + 获取当前通信节点的group参数中的group_ranks和group_id + :param node: 当前通信节点 + :param rank: 当前rank + :return: group_ranks和group_id + """ + group = node.input_data.get(f'{node.id}{GraphConst.INPUT}group', {}) + if not group: + logger.warning(f'The kwarg group of node {node.id} does not exist, {CANNOT_MATCH}{rank}') + return None, None + group_ranks = group.get('group_ranks') + if not group_ranks: + logger.warning(f'The group_ranks of node {node.id} does not exist, {CANNOT_MATCH}{rank}') + return None, None + group_id = group.get('group_id') + if not group_id: + logger.warning(f'The group_id of node {node.id} does not exist, {CANNOT_MATCH}{rank}') + return None, None + return group_ranks, group_id + + @staticmethod + def _get_batch_group_info(node, rank): + for data in node.input_data.values(): + group_id = data.get('group_id') + if group_id is not None: + return group_id + logger.warning(f'The group_id of node {node.id} does not exist, {CANNOT_MATCH}{rank}') + return None + + def distributed_match(self): + for rank, graph in self.graphs.items(): + nodes = graph.node_map + for node_id, node in nodes.items(): + # 不是通信节点或者已经匹配过了 + if not node_id.startswith(Const.DISTRIBUTED) or node.matched_distributed: + continue + api_name, distributed_type = self._get_distributed_name_and_type(node_id) + if api_name == GraphConst.BATCH_P2P: + self._batch_p2p_match(node, rank) + elif distributed_type == DistributedType.P2P: + self._p2p_match(node, rank, api_name) + else: + self._collective_match(node, rank, api_name) + + def _make_group_node_mapping(self): + """ + 建立通信节点的全局唯一标识映射 + key: rank号, value: unique_group_id与node_id之间的映射 + { + "0": { + "unique_group_id1": "node_id1", + "unique_group_id2": "node_id2", + "node_id1": "unique_group_id1", + "node_id2": "unique_group_id2" + }, + "1": {}, + "2": {} + } + """ + for rank, graph in self.graphs.items(): + group_count = {} + group_info = {} + batch_p2p_count = {} + nodes = graph.node_map + for node_id, node in nodes.items(): + if not node_id.startswith(Const.DISTRIBUTED): + continue + api_name, distributed_type = self._get_distributed_name_and_type(node_id) + if api_name == GraphConst.BATCH_P2P: + self._make_batch_p2p_mapping(node, rank, batch_p2p_count) + continue + elif distributed_type == DistributedType.P2P: + config_info = self.config.get(api_name) + target_rank = self._get_target_rank(node, rank, config_info[1]) + if target_rank is None: + continue + # p2p通信节点,api名称+传输目标rank作为group_id + group_id = api_name + Const.RANK + str(target_rank) + else: + # 其他通信节点直接获取group_id, 并拼接api名称 + _, group_id = self._get_group_info(node, rank) + if not group_id: + continue + group_id += api_name + # 同group_id的调用次数累计 + group_count[group_id] = group_count.get(group_id, 0) + 1 + # group_id+同group_id的调用次数作为唯一的unique_group_id + unique_group_id = group_id + Const.REPLACEMENT_CHARACTER + str(group_count.get(group_id)) + group_info[unique_group_id] = node_id + group_info[node_id] = unique_group_id + if rank not in self.group_node_mapping: + self.group_node_mapping[rank] = {} + self.group_node_mapping[rank].update(group_info) + + def _make_batch_p2p_mapping(self, node, rank, batch_p2p_count): + """ + 给batch_isend_irecv接口的每个p2p内容赋予唯一标识 + """ + if rank not in self.group_node_mapping: + self.group_node_mapping[rank] = {} + params = [] + for info_dict in node.batch_p2p_info: + op = info_dict.get(GraphConst.OP) + target_rank = info_dict.get(GraphConst.PEER) + if op is None or target_rank is None: + logger.warning('Cannot get param op or peer.') + continue + group_id = op + Const.REPLACEMENT_CHARACTER + Const.RANK + str(target_rank) + \ + Const.REPLACEMENT_CHARACTER + info_dict.get(GraphConst.GROUP_ID, '') + batch_p2p_count[group_id] = batch_p2p_count.get(group_id, 0) + 1 + # 例如: isend_rank0_5a4d31ad765260ba50eb190f1f9fd163_1 + unique_group_id = group_id + Const.REPLACEMENT_CHARACTER + str(batch_p2p_count.get(group_id)) + params.append(unique_group_id) + self.group_node_mapping.get(rank)[unique_group_id] = node.id + if params: + self.group_node_mapping.get(rank)[node.id] = params + + def _get_distributed_name_and_type(self, node_id): + if Const.SEP not in node_id: + raise ValueError(f'Invalid node id {node_id}.') + api_name = node_id.split(Const.SEP)[1] + if api_name in self.config: + return api_name, self.config.get(api_name)[3] + return api_name, DistributedType.COLLECTIVE + + def _get_target_node(self, rank, unique_group_id, api_name, target_rank, target_api_name=None): + """ + 获取名称匹配上的目标节点 + :param rank: 当前rank + :param unique_group_id: 当前节点唯一group id + :param api_name: 当前节点的api名称, 例如Distributed.isend.0.forward, api名称为isend + :param target_rank: 与当前节点产生通信的rank + :param target_api_name: 与当前节点产生通信的节点api名称, 仅p2p通信需要配置 + :return: 目标节点 + """ + target_graph = self.graphs.get(target_rank) + if not target_graph: + logger.warning(f'Graph data does not exist, {CANNOT_MATCH}{target_rank}') + return None + target_group_mapping = self.group_node_mapping.get(target_rank) + # p2p通信,想要获取目标节点,需要替换unique_group_id中的rank和api name, + # 例如isend发送到rank1,对应的irecv接收自rank0, isend_rank1与irecv_rank0对应 + target_unique_group_id = (unique_group_id + .replace(Const.RANK + str(target_rank), Const.RANK + str(rank)) + .replace(api_name, target_api_name)) if target_api_name else unique_group_id + target_node_id = target_group_mapping.get(target_unique_group_id, '') + target_node = target_graph.node_map.get(target_node_id) + if not target_node: + logger.warning(f'Node {target_node_id} does not exist, {CANNOT_MATCH}{target_rank}') + return None + return target_node + + def _add_node_matched_distributed(self, node, target_node, api_name, target_rank, reversal_type=False): + """ + 给当前节点添加matched_distributed字段信息 + :param node: 当前节点 + :param target_node: 匹配上的目标节点 + :param api_name: 当前节点的api名称 + :param target_rank: 匹配上的目标rank + :param reversal_type: 是否需要反转通信类型,例如broadcast在rank0通信类型是发送,但在其他rank通信类型是接收 + """ + communications_type = self.config.get(api_name)[2] + communications_type = self._get_opposite_communication_type(communications_type) if reversal_type \ + else communications_type + index = target_node.data.get(GraphConst.OVERFLOW_LEVEL, CompareConst.NAN) if self.overflow_check \ + else target_node.data.get(GraphConst.JSON_INDEX_KEY, CompareConst.NAN) + matched_distributed = { + 'communications_type': communications_type, + 'nodes_info': {target_rank: [str(index), target_node.id]} + } + node.matched_distributed = matched_distributed + + def _p2p_match(self, node, rank, api_name): + """ + 点对点通信匹配 + + 根据当前点对点通信节点的输出数据中的src或dst参数, 确定目标rank, 并从目标rank中找到对应的点对点通信节点, 校验输出数据是否一致, + 校验通过则在两个匹配节点增加匹配信息 + Args: + node: 当前点对点通信节点 + rank: 当前节点所属rank + api_name: 当前节点的api名称 + Returns: + """ + config_info = self.config.get(api_name) + target_api_name = config_info[0] + # + target_rank = self._get_target_rank(node, rank, config_info[1]) + if target_rank is None: + return + unique_group_id = self.group_node_mapping.get(rank, {}).get(node.id, '') + target_node = self._get_target_node(rank, unique_group_id, api_name, target_rank, target_api_name) + if not target_node: + return + target_config_info = self.config.get(target_api_name) + source_rank = (target_node.input_data.get(f'{target_node.id}{GraphConst.INPUT}{target_config_info[1]}', {}) + .get('value')) + if source_rank is None: + logger.warning( + f'The kwarg {target_config_info[1]} of node {target_node.id} does not exist, ' + f'{CANNOT_MATCH}{source_rank}') + return + if source_rank != rank: + # 点对点通信,待匹配目标节点包含的rank信息要与当前rank一致 + logger.warning( + f'{node.id} of rank{rank} is expected to communicate with {target_node.id} of rank{target_rank}, ' + f'but the data shows that {target_node.id} communicates with rank{source_rank}.' + f'The rank is inconsistent, cannot match distributed node') + return + + # 点对点通信,两个匹配节点的输出数据要一致 + if not DistributedAnalyzer._node_output_all_equal(node.output_data.get(node.id + '.output.0'), + target_node.output_data.get(target_node.id + '.output.0')): + logger.warning(f'{node.id} output of rank{rank} is different from the {target_node.id} ' + f'output of rank{target_rank}, cannot match distributed node') + return + + self._add_node_matched_distributed(node, target_node, api_name, target_rank) + self._add_node_matched_distributed(target_node, node, target_api_name, rank) + + def _collective_match(self, node, rank, api_name): + """ + 集体通信匹配 + + 一对多通信和多对一通信, 需要先获取节点输出数据中的src或dst或位置参数, 确定发送源或接收源, 多对多通信不需要 + :param node: 当前集体通信节点 + :param rank: 当前节点所属rank + :param api_name: 当前节点的api名称 + :return: + """ + communications_type = CommunicationType.SEND_RECEIVE.value + config_info = self.config.get(api_name) + if config_info: + # 此时为一对多通信或多对一通信 + source_rank = self._get_target_rank(node, rank, config_info[1]) + if source_rank is None or str(source_rank) != str(rank): + return + communications_type = config_info[2] + group_ranks, group_id = self._get_group_info(node, rank) + if not group_ranks or not group_id: + return + unique_group_id = self.group_node_mapping.get(rank, {}).get(node.id, '') + matched_distributed = {'communications_type': communications_type} + nodes_info = {} + for target_rank in group_ranks: + if str(target_rank) == str(rank): + continue + target_node = self._get_target_node(rank, unique_group_id, api_name, target_rank) + if not target_node: + continue + _, target_group_id = self._get_group_info(target_node, target_rank) + if not target_group_id: + continue + if group_id != target_group_id: + logger.warning( + f'{node.id} of rank{rank} is expected to communicate with {target_node.id} of rank{target_rank}' + f', but the data shows that the group id of the two nodes are different, ' + f'cannot match distributed node') + continue + # 给当前通信节点添加matched_distributed字段信息 + index = target_node.data.get(GraphConst.OVERFLOW_LEVEL, CompareConst.NAN) if self.overflow_check \ + else target_node.data.get(GraphConst.JSON_INDEX_KEY, CompareConst.NAN) + nodes_info[target_rank] = [str(index), target_node.id] + if config_info: + # 给匹配上的目标节点也添加matched_distributed字段信息 + self._add_node_matched_distributed(target_node, node, api_name, rank, True) + if nodes_info: + matched_distributed['nodes_info'] = nodes_info + node.matched_distributed = matched_distributed + + def _batch_p2p_match(self, node, rank): + """ + 批量点对点匹配 + + 针对torch.distributed.batch_isend_irecv接口,其入参是一个包含点对点通信信息的集合,需要遍历集合对每个点对点通信信息进行匹配 + :param node: 当前集体通信节点 + :param rank: 当前节点所属rank + :return: + """ + unique_group_ids = self.group_node_mapping.get(rank, {}).get(node.id) + if not unique_group_ids: + return + matched_distributed = [] if len(unique_group_ids) > 1 else {} + for unique_group_id in unique_group_ids: + try: + id_info = unique_group_id.split(Const.REPLACEMENT_CHARACTER) + api_name = id_info[0] + target_api_name = self.config.get(api_name)[0] + target_rank = int(id_info[1].replace(Const.RANK, '')) + except Exception as e: + logger.warning(f'Failed to parsing batch p2p parameter with error info: {e}.') + continue + target_node = self._get_target_node(rank, unique_group_id, api_name, target_rank, target_api_name) + if not target_node: + continue + communications_type = self.config.get(api_name)[2] + index = target_node.data.get(GraphConst.OVERFLOW_LEVEL, CompareConst.NAN) if self.overflow_check \ + else target_node.data.get(GraphConst.JSON_INDEX_KEY, CompareConst.NAN) + matched_info = { + 'communications_type': communications_type, + 'nodes_info': {target_rank: [str(index), target_node.id]} + } + matched_distributed.append(matched_info) if isinstance(matched_distributed, list) \ + else matched_distributed.update(matched_info) + if matched_distributed: + node.matched_distributed = matched_distributed diff --git a/debug/accuracy_tools/msprobe/visualization/graph/graph.py b/debug/accuracy_tools/msprobe/visualization/graph/graph.py index 2c88a09dd73d0ab3d64442da785e79e1ba3c004d..5ce12d1cadb9aec2cc7c65954bb861b85032212d 100644 --- a/debug/accuracy_tools/msprobe/visualization/graph/graph.py +++ b/debug/accuracy_tools/msprobe/visualization/graph/graph.py @@ -12,7 +12,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - +from msprobe.core.overflow_check.checker import AnomalyDetector from msprobe.visualization.graph.base_node import BaseNode from msprobe.visualization.graph.node_op import NodeOp from msprobe.visualization.utils import GraphConst @@ -20,12 +20,17 @@ from msprobe.core.common.log import logger from msprobe.core.common.const import Const +MAX_RECUR_LEVEL = 100 + + class Graph: - def __init__(self, model_name): + def __init__(self, model_name, data_path='', dump_data=None): self.node_map = {} self.node_id_map = {} self.add_node(NodeOp.module, model_name) self.root = self.get_node(model_name) + self.data_path = data_path + self.dump_data = dump_data def __str__(self): infos = [f'{str(self.node_map.get(node_id))}' for node_id in self.node_map] @@ -51,12 +56,21 @@ class Graph: return node_b, ancestors_n @staticmethod - def mapping_match(node_n, graph_b, mapping_config): + def mapping_match(node_n, graph_b, mapping_dict): """ 根据映射配置对节点进行匹配 """ - node_b = graph_b.node_map.get(mapping_config.get_mapping_string(node_n.id)) - if not node_b or not node_n.compare_mapping_node(node_b): + node_b = graph_b.node_map.get(mapping_dict.get(node_n.id, node_n.id)) + if not node_b: + return None, [], [] + ancestors_n = node_n.get_ancestors() + ancestors_b = node_b.get_ancestors() + return node_b, ancestors_n, ancestors_b + + + @staticmethod + def fuzzy_match(node_n, node_b): + if not node_n or not node_b or not node_n.fuzzy_eq(node_b): return None, [], [] ancestors_n = node_n.get_ancestors() ancestors_b = node_b.get_ancestors() @@ -72,27 +86,45 @@ class Graph: @staticmethod def split_nodes_by_micro_step(nodes): """ - 根据Module名称后缀数字, 区分一个step中的多个micro steps, 后缀数字相同代表节点属于同一个micro step. + 根据Module名称, 区分一个step中的多个micro steps. + 一个micro step必须是一次完整的前反向过程 + Example:: + =============== micro step0 + Module.forward + Module.forward + ... + Module.backward + Module.backward + =============== micro step1 + Module.forward + Module.forward + ... + Module.backward + Module.backward + =============== micro step2 + Module.forward + Module.forward + ... + Module.backward + Module.backward + 如果是非Module节点,分类到前一个Module节点所在的micro step. """ result = {} - default_id = 0 - result[default_id] = [] + micro_step = 0 + result[micro_step] = [] + backward_flag = False for node in nodes: if node.op == NodeOp.module: - micro_step_id = node.id.split(Const.SEP)[-1] - try: - micro_step_id = int(micro_step_id) - except ValueError: - logger.warning(f'The node id suffix {micro_step_id} is not a number, micro steps cannot be split.') - micro_step_id = 0 - if micro_step_id not in result: - default_id = micro_step_id - result[micro_step_id] = [] - result[micro_step_id].append(node) - else: - result[default_id].append(node) + if f'{Const.SEP}{Const.FORWARD}{Const.SEP}' in node.id: + if backward_flag: + micro_step += 1 + result[micro_step] = [] + backward_flag = False + else: + backward_flag = True + result[micro_step].append(node) return result def add_node(self, node_op, node_id, up_node=None, id_accumulation=False): @@ -131,6 +163,7 @@ class Graph: """ result = {} result[GraphConst.JSON_ROOT_KEY] = self.root.id if self.root else 'None' + result[GraphConst.JSON_DATA_KEY] = self.data_path result[GraphConst.JSON_NODE_KEY] = {} for node_id in self.node_map: info = self.node_map.get(node_id).to_dict() @@ -165,3 +198,12 @@ class Graph: micro_step_id = 0 node.micro_step_id = micro_step_id return len(batches_n) + + def overflow_check(self): + detector = AnomalyDetector(self.dump_data) + detector.analyze().filter() + + for node_id, _node in self.node_map.items(): + if detector.has_overflow(node_id): + lv = detector.get_overflow_level(node_id) + _node.set_overflow_level(lv) diff --git a/debug/accuracy_tools/msprobe/visualization/graph/node_colors.py b/debug/accuracy_tools/msprobe/visualization/graph/node_colors.py index 51b908ada407f67a56ed65fd3b24c99c9c767d9d..f824ec433043bbdae7f0b12c4f1f6796da9116b7 100644 --- a/debug/accuracy_tools/msprobe/visualization/graph/node_colors.py +++ b/debug/accuracy_tools/msprobe/visualization/graph/node_colors.py @@ -44,7 +44,7 @@ class NodeColors(Enum): GraphConst.SUMMARY_COMPARE: {GraphConst.VALUE: [0.6, 0.8], GraphConst.DESCRIPTION: SUMMARY_DESCRIPTION}, GraphConst.REAL_DATA_COMPARE: {GraphConst.VALUE: [0.15, 0.2], GraphConst.DESCRIPTION: REAL_DATA_DESCRIPTION} }) - RED = ("#E32020", { + RED = ("#FF704D", { GraphConst.SUMMARY_COMPARE: {GraphConst.VALUE: [0.8, 1], GraphConst.DESCRIPTION: SUMMARY_DESCRIPTION}, GraphConst.REAL_DATA_COMPARE: {GraphConst.VALUE: [0.2, 1], GraphConst.DESCRIPTION: REAL_DATA_DESCRIPTION}, GraphConst.MD5_COMPARE: {GraphConst.VALUE: [0, 0], GraphConst.DESCRIPTION: MD5_DESCRIPTION_N}, diff --git a/debug/accuracy_tools/msprobe/visualization/graph/node_op.py b/debug/accuracy_tools/msprobe/visualization/graph/node_op.py index 26839398ca3b15ab6b8cffa9137999befb6541b2..33bfa9cc2e34a0960c3ff236a1bd183a5753a0ab 100644 --- a/debug/accuracy_tools/msprobe/visualization/graph/node_op.py +++ b/debug/accuracy_tools/msprobe/visualization/graph/node_op.py @@ -16,6 +16,7 @@ from enum import Enum import re from msprobe.visualization.builder.msprobe_adapter import op_patterns +from msprobe.core.common.log import logger class NodeOp(Enum): @@ -32,8 +33,9 @@ class NodeOp(Enum): for op in NodeOp: index = op.value if index < 0 or index >= len(op_patterns): - raise Exception("NodeOp and op_patterns in MsprobeAdapter do not match") + continue pattern = op_patterns[index] if re.match(pattern, node_name): return op - raise Exception(f"Cannot parse node_name {node_name} into NodeOp") + logger.warning(f"Cannot parsing node_name {node_name} into NodeOp, default parsing as module.") + return NodeOp.module diff --git a/debug/accuracy_tools/msprobe/visualization/graph_service.py b/debug/accuracy_tools/msprobe/visualization/graph_service.py index 8f59fd1d92ae6c9d215f6acf53a648de59b51f3e..75b0014c1c09abb8dfecf285fed5eed3063827a0 100644 --- a/debug/accuracy_tools/msprobe/visualization/graph_service.py +++ b/debug/accuracy_tools/msprobe/visualization/graph_service.py @@ -15,70 +15,193 @@ import os import time -import argparse -import sys import json -from msprobe.core.common.file_utils import FileOpen, check_file_type, create_directory -from msprobe.core.common.const import FileCheckConst +from msprobe.core.common.file_utils import (check_file_type, create_directory, FileChecker, + check_file_or_directory_path, load_json) +from msprobe.core.common.const import FileCheckConst, Const from msprobe.core.common.utils import CompareException +from msprobe.core.overflow_check.checker import AnomalyDetector from msprobe.visualization.compare.graph_comparator import GraphComparator -from msprobe.visualization.utils import GraphConst +from msprobe.visualization.utils import GraphConst, check_directory_content from msprobe.visualization.builder.graph_builder import GraphBuilder, GraphExportConfig from msprobe.core.common.log import logger -from msprobe.visualization.mapping_config import MappingConfig from msprobe.visualization.graph.node_colors import NodeColors +from msprobe.core.compare.layer_mapping import generate_api_mapping_by_layer_mapping +from msprobe.core.compare.utils import check_and_return_dir_contents +from msprobe.visualization.graph.distributed_analyzer import DistributedAnalyzer current_time = time.strftime("%Y%m%d%H%M%S") -def compare_graph(dump_path_n, dump_path_b, out_path, is_print_compare_log=True, mapping_file=None): +def _compare_graph(input_param, args): logger.info('Start building model graphs...') # 对两个数据进行构图 - construct_path_n = os.path.join(dump_path_n, GraphConst.CONSTRUCT_FILE) - construct_path_b = os.path.join(dump_path_b, GraphConst.CONSTRUCT_FILE) - data_path_n = os.path.join(dump_path_n, GraphConst.DUMP_FILE) - data_path_b = os.path.join(dump_path_b, GraphConst.DUMP_FILE) - graph_n = GraphBuilder.build(construct_path_n, data_path_n) - graph_b = GraphBuilder.build(construct_path_b, data_path_b) + dump_path_n = input_param.get('npu_path') + dump_path_b = input_param.get('bench_path') + construct_path_n = FileChecker(os.path.join(dump_path_n, GraphConst.CONSTRUCT_FILE), + FileCheckConst.FILE, FileCheckConst.READ_ABLE).common_check() + construct_path_b = FileChecker(os.path.join(dump_path_b, GraphConst.CONSTRUCT_FILE), + FileCheckConst.FILE, FileCheckConst.READ_ABLE).common_check() + data_path_n = FileChecker(os.path.join(dump_path_n, GraphConst.DUMP_FILE), FileCheckConst.FILE, + FileCheckConst.READ_ABLE).common_check() + data_path_b = FileChecker(os.path.join(dump_path_b, GraphConst.DUMP_FILE), FileCheckConst.FILE, + FileCheckConst.READ_ABLE).common_check() + stack_path_n = FileChecker(os.path.join(dump_path_n, GraphConst.STACK_FILE), FileCheckConst.FILE, + FileCheckConst.READ_ABLE).common_check() + stack_path_b = FileChecker(os.path.join(dump_path_b, GraphConst.STACK_FILE), FileCheckConst.FILE, + FileCheckConst.READ_ABLE).common_check() + graph_n = GraphBuilder.build(construct_path_n, data_path_n, stack_path_n, complete_stack=args.complete_stack) + graph_b = GraphBuilder.build(construct_path_b, data_path_b, stack_path_b, complete_stack=args.complete_stack) logger.info('Model graphs built successfully, start Comparing graphs...') # 基于graph、stack和data进行比较 - stack_path = os.path.join(dump_path_n, GraphConst.STACK_FILE) dump_path_param = { 'npu_json_path': data_path_n, 'bench_json_path': data_path_b, - 'stack_json_path': stack_path, - 'is_print_compare_log': is_print_compare_log + 'stack_json_path': stack_path_n, + 'is_print_compare_log': input_param.get("is_print_compare_log", True) } - graph_comparator = GraphComparator([graph_n, graph_b], dump_path_param, out_path, - mapping_config=MappingConfig(mapping_file) if mapping_file else None) + mapping_dict = None + if args.layer_mapping: + yaml_path = FileChecker(args.layer_mapping, FileCheckConst.FILE, FileCheckConst.READ_ABLE).common_check() + try: + mapping_dict = generate_api_mapping_by_layer_mapping(data_path_n, data_path_b, yaml_path) + except Exception: + logger.warning('The layer mapping file parsing failed, please check file format, mapping is not effective.') + graph_comparator = GraphComparator([graph_n, graph_b], dump_path_param, args, mapping_dict=mapping_dict) graph_comparator.compare() micro_steps = graph_n.paging_by_micro_step(graph_b) - create_directory(out_path) - output_path = os.path.join(out_path, f'compare_{current_time}.vis') - export_config = GraphExportConfig(graph_n, graph_b, graph_comparator.ma.get_tool_tip(), - NodeColors.get_node_colors(graph_comparator.ma.compare_mode), micro_steps) + # 开启溢出检测 + if args.overflow_check: + graph_n.overflow_check() + graph_b.overflow_check() + + return CompareGraphResult(graph_n, graph_b, graph_comparator, micro_steps) + + +def _export_compare_graph_result(args, graphs, graph_comparator, micro_steps, + output_file_name=f'compare_{current_time}.vis'): + create_directory(args.output_path) + output_path = os.path.join(args.output_path, output_file_name) + task = GraphConst.GRAPHCOMPARE_MODE_TO_DUMP_MODE_TO_MAPPING.get(graph_comparator.ma.compare_mode) + export_config = GraphExportConfig(graphs[0], graphs[1], graph_comparator.ma.get_tool_tip(), + NodeColors.get_node_colors(graph_comparator.ma.compare_mode), micro_steps, task, + args.overflow_check) GraphBuilder.to_json(output_path, export_config) logger.info(f'Model graphs compared successfully, the result file is saved in {output_path}') -def build_graph(dump_path, out_path): +def _build_graph(dump_path, args): logger.info('Start building model graph...') - construct_path = os.path.join(dump_path, GraphConst.CONSTRUCT_FILE) - data_path = os.path.join(dump_path, GraphConst.DUMP_FILE) - create_directory(out_path) - output_path = os.path.join(out_path, f'build_{current_time}.vis') - graph = GraphBuilder.build(construct_path, data_path) + construct_path = FileChecker(os.path.join(dump_path, GraphConst.CONSTRUCT_FILE), FileCheckConst.FILE, + FileCheckConst.READ_ABLE).common_check() + data_path = FileChecker(os.path.join(dump_path, GraphConst.DUMP_FILE), FileCheckConst.FILE, + FileCheckConst.READ_ABLE).common_check() + stack_path = FileChecker(os.path.join(dump_path, GraphConst.STACK_FILE), FileCheckConst.FILE, + FileCheckConst.READ_ABLE).common_check() + graph = GraphBuilder.build(construct_path, data_path, stack_path, complete_stack=args.complete_stack) micro_steps = graph.paging_by_micro_step() - GraphBuilder.to_json(output_path, GraphExportConfig(graph, micro_steps=micro_steps)) + # 开启溢出检测 + if args.overflow_check: + graph.overflow_check() + return BuildGraphResult(graph, micro_steps) + + +def _export_build_graph_result(out_path, graph, micro_steps, overflow_check, + output_file_name=f'build_{current_time}.vis'): + create_directory(out_path) + output_path = os.path.join(out_path, output_file_name) + GraphBuilder.to_json(output_path, GraphExportConfig(graph, micro_steps=micro_steps, overflow_check=overflow_check)) logger.info(f'Model graph built successfully, the result file is saved in {output_path}') -def _graph_service(parser=None): - if not parser: - parser = argparse.ArgumentParser() - _graph_service_parser(parser) - args = parser.parse_args(sys.argv[1:]) - _graph_service_command(args) +def _compare_graph_ranks(input_param, args, step=None): + dump_rank_n = input_param.get('npu_path') + dump_rank_b = input_param.get('bench_path') + npu_ranks = sorted(check_and_return_dir_contents(dump_rank_n, Const.RANK)) + bench_ranks = sorted(check_and_return_dir_contents(dump_rank_b, Const.RANK)) + if npu_ranks != bench_ranks: + logger.error('The number of ranks in the two runs are different. Unable to match the ranks.') + raise CompareException(CompareException.INVALID_PATH_ERROR) + compare_graph_results = [] + for nr, br in zip(npu_ranks, bench_ranks): + logger.info(f'Start processing data for {nr}...') + input_param['npu_path'] = os.path.join(dump_rank_n, nr) + input_param['bench_path'] = os.path.join(dump_rank_b, br) + output_file_name = f'compare_{step}_{nr}_{current_time}.vis' if step else f'compare_{nr}_{current_time}.vis' + result = _compare_graph(input_param, args) + result.output_file_name = output_file_name + if nr != Const.RANK: + try: + result.rank = int(nr.replace(Const.RANK, "")) + except Exception as e: + logger.error('The folder name format is incorrect, expected rank+number.') + raise CompareException(CompareException.INVALID_PATH_ERROR) from e + # 暂存所有rank的graph,用于匹配rank间的分布式节点 + compare_graph_results.append(result) + + # 匹配rank间的分布式节点 + if len(compare_graph_results) > 1: + DistributedAnalyzer({obj.rank: obj.graph_n for obj in compare_graph_results}, + args.overflow_check).distributed_match() + DistributedAnalyzer({obj.rank: obj.graph_b for obj in compare_graph_results}, + args.overflow_check).distributed_match() + + for result in compare_graph_results: + _export_compare_graph_result(args, [result.graph_n, result.graph_b], result.graph_comparator, + result.micro_steps, output_file_name=result.output_file_name) + + +def _compare_graph_steps(input_param, args): + dump_step_n = input_param.get('npu_path') + dump_step_b = input_param.get('bench_path') + + npu_steps = sorted(check_and_return_dir_contents(dump_step_n, Const.STEP)) + bench_steps = sorted(check_and_return_dir_contents(dump_step_b, Const.STEP)) + + if npu_steps != bench_steps: + logger.error('The number of steps in the two runs are different. Unable to match the steps.') + raise CompareException(CompareException.INVALID_PATH_ERROR) + + for folder_step in npu_steps: + logger.info(f'Start processing data for {folder_step}...') + input_param['npu_path'] = os.path.join(dump_step_n, folder_step) + input_param['bench_path'] = os.path.join(dump_step_b, folder_step) + + _compare_graph_ranks(input_param, args, step=folder_step) + + +def _build_graph_ranks(dump_ranks_path, args, step=None): + ranks = sorted(check_and_return_dir_contents(dump_ranks_path, Const.RANK)) + build_graph_results = [] + for rank in ranks: + logger.info(f'Start processing data for {rank}...') + dump_path = os.path.join(dump_ranks_path, rank) + output_file_name = f'build_{step}_{rank}_{current_time}.vis' if step else f'build_{rank}_{current_time}.vis' + result = _build_graph(dump_path, args) + result.output_file_name = output_file_name + if rank != Const.RANK: + try: + result.rank = int(rank.replace(Const.RANK, "")) + except Exception as e: + logger.error('The folder name format is incorrect, expected rank+number.') + raise CompareException(CompareException.INVALID_PATH_ERROR) from e + build_graph_results.append(result) + + if len(build_graph_results) > 1: + DistributedAnalyzer({obj.rank: obj.graph for obj in build_graph_results}, + args.overflow_check).distributed_match() + + for result in build_graph_results: + _export_build_graph_result(args.output_path, result.graph, result.micro_steps, args.overflow_check, + result.output_file_name) + + +def _build_graph_steps(dump_steps_path, args): + steps = sorted(check_and_return_dir_contents(dump_steps_path, Const.STEP)) + for step in steps: + logger.info(f'Start processing data for {step}...') + dump_ranks_path = os.path.join(dump_steps_path, step) + _build_graph_ranks(dump_ranks_path, args, step) def _graph_service_parser(parser): @@ -86,18 +209,79 @@ def _graph_service_parser(parser): help=" The compare input path, a dict json.", required=True) parser.add_argument("-o", "--output_path", dest="output_path", type=str, help=" The compare task result out path.", required=True) + parser.add_argument("-lm", "--layer_mapping", dest="layer_mapping", type=str, + help=" The layer mapping file path.", required=False) + parser.add_argument("-oc", "--overflow_check", dest="overflow_check", action="store_true", + help=" whether open overflow_check for graph.", required=False) + parser.add_argument("-f", "--fuzzy_match", dest="fuzzy_match", action="store_true", + help=" Whether to perform a fuzzy match on the api name.", required=False) + parser.add_argument("-cs", "--complete_stack", dest="complete_stack", action="store_true", + help=" Whether to use complete stack information.", required=False) def _graph_service_command(args): - with FileOpen(args.input_path, "r") as file: - input_param = json.load(file) + input_param = load_json(args.input_path) npu_path = input_param.get("npu_path") bench_path = input_param.get("bench_path") - is_print_compare_log = input_param.get("is_print_compare_log", True) + check_file_or_directory_path(npu_path, isdir=True) + if bench_path: + check_file_or_directory_path(bench_path, isdir=True) if check_file_type(npu_path) == FileCheckConst.DIR and not bench_path: - build_graph(npu_path, args.output_path) + content = check_directory_content(npu_path) + if content == GraphConst.RANKS: + _build_graph_ranks(npu_path, args) + elif content == GraphConst.STEPS: + _build_graph_steps(npu_path, args) + else: + result = _build_graph(npu_path, args) + _export_build_graph_result(args.output_path, result.graph, result.micro_steps, args.overflow_check) elif check_file_type(npu_path) == FileCheckConst.DIR and check_file_type(bench_path) == FileCheckConst.DIR: - compare_graph(npu_path, bench_path, args.output_path, is_print_compare_log=is_print_compare_log) + content_n = check_directory_content(npu_path) + content_b = check_directory_content(bench_path) + if content_n != content_b: + raise ValueError('The directory structures of npu_path and bench_path are inconsistent.') + if content_n == GraphConst.RANKS: + _compare_graph_ranks(input_param, args) + elif content_n == GraphConst.STEPS: + _compare_graph_steps(input_param, args) + else: + result = _compare_graph(input_param, args) + _export_compare_graph_result(args, [result.graph_n, result.graph_b], + result.graph_comparator, result.micro_steps) else: logger.error("The npu_path or bench_path should be a folder.") raise CompareException(CompareException.INVALID_COMPARE_MODE) + + +def _pt_graph_service_parser(parser): + _graph_service_parser(parser) + + +def _pt_graph_service_command(args): + _graph_service_command(args) + + +def _ms_graph_service_parser(parser): + _graph_service_parser(parser) + + +def _ms_graph_service_command(args): + _graph_service_command(args) + + +class CompareGraphResult: + def __init__(self, graph_n, graph_b, graph_comparator, micro_steps, rank=0, output_file_name=''): + self.graph_n = graph_n + self.graph_b = graph_b + self.graph_comparator = graph_comparator + self.micro_steps = micro_steps + self.rank = rank + self.output_file_name = output_file_name + + +class BuildGraphResult: + def __init__(self, graph, micro_steps, rank=0, output_file_name=''): + self.graph = graph + self.micro_steps = micro_steps + self.rank = rank + self.output_file_name = output_file_name diff --git a/debug/accuracy_tools/msprobe/visualization/mapping_config.py b/debug/accuracy_tools/msprobe/visualization/mapping_config.py deleted file mode 100644 index 2a7b849252d8e4dac3ea2023236438c39cae5612..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/msprobe/visualization/mapping_config.py +++ /dev/null @@ -1,77 +0,0 @@ -import re -import yaml -from msprobe.core.common.file_utils import FileOpen -from msprobe.core.common.const import Const -from msprobe.visualization.utils import GraphConst - - -class MappingConfig: - MAX_STRING_LEN = 10000 - - def __init__(self, yaml_file): - with FileOpen(yaml_file, 'r') as file: - config = yaml.safe_load(file) - try: - self.config = {key: self.validate(key, value) for data in config for key, value in data.items()} - except Exception as e: - raise RuntimeError("Line of yaml contains content that is not '- key: value'.") from e - self.classify_config = self._classify_and_sort_keys() - - @staticmethod - def validate(key, value): - if not isinstance(key, str): - raise ValueError(f"{key} must be a string.") - if not isinstance(value, str): - raise ValueError(f"{value} must be a string.") - return value - - @staticmethod - def convert_to_regex(s): - """ - 字符串转换为正则表达式, {}替换为d+以匹配一个或多个数字, 开始和结束添加.*以匹配任意前缀和后缀 - Args: - s: 字符串 - Returns: 正则表达式 - """ - escaped_pattern = re.escape(s) - pattern = re.sub(r'\\\{\\\}', r'\\d+', escaped_pattern) - pattern = f'.*{pattern}.*' - return pattern - - @staticmethod - def _replace_parts(origin_string, mapping_key, mapping_value): - if GraphConst.BRACE in mapping_key: - parts = mapping_key.split(GraphConst.BRACE) - m_parts = mapping_value.split(GraphConst.BRACE) - return origin_string.replace(parts[0], m_parts[0]).replace(parts[1], m_parts[1]) - else: - return origin_string.replace(mapping_key, mapping_value) - - def get_mapping_string(self, origin_string: str): - if len(origin_string) > MappingConfig.MAX_STRING_LEN: - return origin_string - for category, items in self.classify_config.items(): - if category in origin_string: - for key, value in items: - if re.match(MappingConfig.convert_to_regex(key), origin_string): - return MappingConfig._replace_parts(origin_string, key, value) - return origin_string - - def _classify_and_sort_keys(self): - categorized_dict = {} - for key, value in self.config.items(): - parts = key.split(Const.SEP) - # 获取第一个部分作为新的分类key - category_key = parts[0] - - if category_key not in categorized_dict: - categorized_dict[category_key] = [] - - # 将原始的key-value对添加到对应的分类中 - categorized_dict[category_key].append((key, value)) - - # 对每个分类中的项按key中的.数量进行排序, .数量越多排越靠前, 优先匹配 - for category in categorized_dict: - categorized_dict[category].sort(key=lambda x: -x[0].count(Const.SEP)) - - return categorized_dict diff --git a/debug/accuracy_tools/msprobe/visualization/utils.py b/debug/accuracy_tools/msprobe/visualization/utils.py index be789b99ce9448f0dd810af55c28ea3abbd25910..f6e8258bb67cd850b5fc93b2b541b263fb48578e 100644 --- a/debug/accuracy_tools/msprobe/visualization/utils.py +++ b/debug/accuracy_tools/msprobe/visualization/utils.py @@ -13,10 +13,12 @@ # See the License for the specific language governing permissions and # limitations under the License. +import os +import re import json from msprobe.core.common.file_utils import FileOpen from msprobe.core.common.const import CompareConst, Const -from msprobe.core.compare.acc_compare import Comparator +from msprobe.core.compare.acc_compare import Comparator, ModeConfig def load_json_file(file_path): @@ -48,12 +50,13 @@ def save_json_file(file_path, data): f.write(json.dumps(data, indent=4)) -def get_csv_df(stack, csv_data, compare_mode): +def get_csv_df(stack_mode, csv_data, compare_mode): """ 调用acc接口写入csv """ dump_mode = GraphConst.GRAPHCOMPARE_MODE_TO_DUMP_MODE_TO_MAPPING.get(compare_mode) - return Comparator.make_result_table(csv_data, stack, dump_mode) + mode_config = ModeConfig(stack_mode=stack_mode, dump_mode=dump_mode) + return Comparator(mode_config).make_result_table(csv_data) def str2float(percentage_str): @@ -70,18 +73,57 @@ def str2float(percentage_str): return 0 -def process_kwargs_parameter(parameter): +def is_integer(s): + try: + int(s) + return True + except Exception: + return False + + +def check_directory_content(input_path): """ - 转换kwargs参数命名 - Args: - parameter: 'Module.module.Float16Module.forward.0.input.labels.0' - Returns: 'Module.module.Float16Module.forward.0.kwargs.labels' + 检查input_path内容, 是否全是step{数字}命名的文件夹(例如step0), 或者全是rank{数字}命名的文件夹(例如rank0), 或者全是文件 """ - parts = parameter.split(Const.SEP) - if parts[GraphConst.OUTPUT_INDEX_THREE] == GraphConst.INPUT: - parts[GraphConst.OUTPUT_INDEX_THREE] = 'kwargs' - return Const.SEP.join(parts[:-1]) - return parameter + contents = os.listdir(input_path) + if not contents: + raise ValueError(f'The path {input_path} is empty.') + + # 真实数据dump会有dump_tensor_data文件夹 + if os.path.exists(os.path.join(input_path, Const.DUMP_TENSOR_DATA)): + return GraphConst.FILES + + # 检查是否全是文件 + if all(os.path.isfile(os.path.join(input_path, item)) for item in contents): + return GraphConst.FILES + + # 单卡只有一个rank文件夹 + if contents == [Const.RANK]: + return GraphConst.RANKS + + rank_pattern = re.compile(r'^rank\d+$') + step_pattern = re.compile(r'^step\d+$') + + rank_all = True + step_all = True + + for item in contents: + item_path = os.path.join(input_path, item) + if not os.path.isdir(item_path): + continue + if not rank_pattern.match(item): + rank_all = False + if not step_pattern.match(item): + step_all = False + + if rank_all: + return GraphConst.RANKS + if step_all: + return GraphConst.STEPS + + raise ValueError("The input path content does not conform to the expected naming convention. " + "It is expected to be all step{number} named folders (such as step0), " + "all rank{number} named folders (such as rank0), or all files.") class ToolTip: @@ -92,19 +134,16 @@ class ToolTip: MD5 = '数据MD5信息,用于比较两个数据信息是否完全一致' ONE_THOUSANDTH_ERR_RATIO = 'Tensor中的元素逐个与对应的标杆数据对比,相对误差小于千分之一的比例占总元素个数的比例,比例越接近1越好' FIVE_THOUSANDTHS_ERR_RATIO = 'Tensor中的元素逐个与对应的标杆数据对比,相对误差小于千分之五的比例占总元素个数的比例,比例越接近1越好' - COSINE = '通过计算两个向量的余弦值来判断其相似度,数值越接近于1说明计算出的两个张量越相似,实际可接受阈值为大于0.99。在计算中可能会存在nan,主要由于可能会出现其中一个向量为0' + COSINE = ( + '通过计算两个向量的余弦值来判断其相似度,数值越接近于1说明计算出的两个张量越相似,实际可接受阈值为大于0.99。' + '在计算中可能会存在nan,主要由于可能会出现其中一个向量为0' + ) MAX_ABS_ERR = '当最大绝对误差越接近0表示其计算的误差越小,实际可接受阈值为小于0.001' - MAX_RELATIVE_ERR = '当最大相对误差越接近0表示其计算的误差越小。当dump数据中存在0或Nan时,比对结果中最大相对误差则出现inf或Nan的情况,属于正常现象' - SMALL_VALUE_TIP = '{} 小于1e-3,不计算相对误差' - - -class Suggestions: - Module = '此模块精度比对结果疑似异常,请使用msprobe工具的数据采集功能对模块中的api进行dump比对' - API = '此api精度比对结果疑似异常,请使用msprobe工具的预检功能对api进行精度检测' - DUMP = 'msprobe工具的数据采集功能' - DUMP_URL = 'https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/pytorch/doc/dump.md' - API_ACCURACY_CHECKER = 'msprobe工具的预检功能' - API_ACCURACY_CHECKER_URL = 'https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/pytorch/doc/api_accuracy_checker.md' + MAX_RELATIVE_ERR = ( + '当最大相对误差越接近0表示其计算的误差越小。' + '当dump数据中存在0或Nan时,比对结果中最大相对误差则出现inf或Nan的情况,属于正常现象' + ) + SMALL_VALUE_TIP = '{}, 由于{}小于{}, 建议不参考此相对误差,请参考绝对误差' class GraphConst: @@ -116,16 +155,21 @@ class GraphConst: SUMMARY_COMPARE = 0 MD5_COMPARE = 1 REAL_DATA_COMPARE = 2 + STRUCTURE_COMPARE = 3 JSON_NPU_KEY = 'NPU' JSON_BENCH_KEY = 'Bench' JSON_TIP_KEY = 'ToolTip' JSON_ROOT_KEY = 'root' JSON_NODE_KEY = 'node' + JSON_DATA_KEY = 'dump_data_dir' + JSON_TASK_KEY = 'task' DATA_KEY = 'data' REAL_DATA_TH = 0.1 MAX_RELATIVE_ERR_TH = 0.5 ROUND_TH = 6 JSON_INDEX_KEY = 'precision_index' + MATCHED_DISTRIBUTED = 'matched_distributed' + OVERFLOW_LEVEL = 'overflow_level' MAX_INDEX_KEY = 1 MIN_INDEX_KEY = 0 SUGGEST_KEY = 'text' @@ -133,16 +177,14 @@ class GraphConst: OUTPUT_INDEX_TWO = -2 OUTPUT_INDEX_THREE = -3 OUTPUT_MIN_LEN = 3 - INPUT = 'input' - OUTPUT = 'output' + INPUT = '.input.' + OUTPUT = '.output.' STR_MAX_LEN = 50 SMALL_VALUE = 1e-3 MD5_INDEX_LIST = [CompareConst.RESULT] - REAL_DATA_INDEX_LIST = [CompareConst.COSINE, CompareConst.MAX_ABS_ERR, CompareConst.MAX_RELATIVE_ERR, - CompareConst.ONE_THOUSANDTH_ERR_RATIO, CompareConst.FIVE_THOUSANDTHS_ERR_RATIO] - SUMMARY_INDEX_LIST = [CompareConst.MAX_DIFF, CompareConst.MIN_DIFF, CompareConst.MEAN_DIFF, - CompareConst.NORM_DIFF, CompareConst.MAX_RELATIVE_ERR, CompareConst.MIN_RELATIVE_ERR, - CompareConst.MEAN_RELATIVE_ERR, CompareConst.NORM_RELATIVE_ERR] + REAL_DATA_INDEX_LIST = CompareConst.ALL_COMPARE_INDEX + SUMMARY_INDEX_LIST = CompareConst.SUMMARY_COMPARE_INDEX + VALUE_INDEX_LIST = [Const.MAX, Const.MIN, Const.MEAN, Const.NORM] APIS_BETWEEN_MODULES = 'Apis_Between_Modules' NULL = 'null' NONE = 'None' @@ -151,15 +193,30 @@ class GraphConst: DESCRIPTION = 'description' COLORS = 'Colors' MICRO_STEPS = 'MicroSteps' + OVERFLOW_CHECK = 'OverflowCheck' DUMP_MODE_TO_GRAPHCOMPARE_MODE_MAPPING = { Const.ALL: REAL_DATA_COMPARE, Const.SUMMARY: SUMMARY_COMPARE, - Const.MD5: MD5_COMPARE + Const.MD5: MD5_COMPARE, + Const.STRUCTURE: STRUCTURE_COMPARE } GRAPHCOMPARE_MODE_TO_DUMP_MODE_TO_MAPPING = { REAL_DATA_COMPARE: Const.ALL, SUMMARY_COMPARE: Const.SUMMARY, - MD5_COMPARE: Const.MD5 + MD5_COMPARE: Const.MD5, + STRUCTURE_COMPARE: Const.STRUCTURE } + + RANKS = 'ranks' + STEPS = 'steps' + FILES = 'files' + + SRC = 'src' + DST = 'dst' + + BATCH_P2P = 'batch_isend_irecv' + OP = 'op' + PEER = 'peer' + GROUP_ID = 'group_id' diff --git a/debug/accuracy_tools/setup.py b/debug/accuracy_tools/setup.py index eb58dee7cf614cd8911493b66e89a7a628a4a0f8..2da7fcf667765a841b9db1bbf5628fad5b1cf8a9 100644 --- a/debug/accuracy_tools/setup.py +++ b/debug/accuracy_tools/setup.py @@ -1,4 +1,4 @@ -# Copyright (c) 2024-2024, Huawei Technologies Co., Ltd. +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -14,8 +14,11 @@ # limitations under the License. -__version__ = '1.1.0' +__version__ = '1.2.2' +import subprocess +import platform +import sys import setuptools INSTALL_REQUIRED = [ @@ -31,21 +34,58 @@ INSTALL_REQUIRED = [ "twisted", "matplotlib", "tensorboard", - "sqlalchemy", - "pymysql" + "tabulate" ] EXCLUDE_PKGS = [ "api_accuracy_checker*", "grad_tool*", "ptdbg_ascend*", + "msprobe.ccsrc*", "msprobe.test*", + "build.sh", + "build_dependency*", + "cmake*", + "output*", + "third_party*", ] +if "--plat-name" in sys.argv or "--python-tag" in sys.argv: + raise SystemError("Specifing platforms or python version is not supported.") + +if platform.system() != "Linux": + raise SystemError("MsProbe is only supported on Linux platforms.") + +mod_list_range = {"adump", } +mod_list = [] +for i, arg in enumerate(sys.argv): + if arg.startswith("--include-mod"): + if arg.startswith("--include-mod="): + mod_list = arg[len("--include-mod="):].split(',') + sys.argv.remove(arg) + elif i + 1 < len(sys.argv) and not sys.argv[i + 1].startswith("--"): + mod_list = sys.argv[i + 1].split(',') + sys.argv.remove(sys.argv[i + 1]) + sys.argv.remove(arg) + mod_list = list(set(mod_list) & mod_list_range) + break + +# 当前只有adump一个mod +if mod_list: + arch = platform.machine() + sys.argv.append("--plat-name") + sys.argv.append(f"linux_{arch}") + sys.argv.append("--python-tag") + sys.argv.append(f"cp{sys.version_info.major}{sys.version_info.minor}") + build_cmd = f"bash ./build.sh -j16 -a {arch} -v {sys.version_info.major}.{sys.version_info.minor}" + p = subprocess.run(build_cmd.split(), shell=False) + if p.returncode != 0: + raise RuntimeError(f"Failed to build source({p.returncode})") + setuptools.setup( name="mindstudio-probe", version=__version__, - description="Pytorch Ascend Probe Utils", + description="Ascend Probe Utils", long_description="MindStudio-Probe is a set of tools for diagnosing and improving model accuracy on Ascend NPU, " "including API acc checker, ptdbg, grad tool etc.", url="https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/msprobe", @@ -60,6 +100,7 @@ setuptools.setup( 'Intended Audience :: Education', 'Intended Audience :: Science/Research', 'Programming Language :: Python :: 3', + 'Programming Language :: C++', 'Topic :: Scientific/Engineering', 'Topic :: Scientific/Engineering :: Mathematics', 'Topic :: Scientific/Engineering :: Artificial Intelligence', diff --git a/dynolog_npu/README.md b/dynolog_npu/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d6ebd6f7ff04f0fa40601500eaf66b89ed7a7f97 --- /dev/null +++ b/dynolog_npu/README.md @@ -0,0 +1,213 @@ +# Ascend Extension for dynolog + +## 安装方式 + +### 1. clone 代码 + +```bash +git clone https://gitee.com/ascend/mstt.git +``` + +### 2. 安装依赖 +dynolog的编译依赖,确保安装了以下依赖: + + + + + + + + + + + + + +
Language + Toolchain +
C++ + gcc 8.5.0+ +
Rust + Rust 1.58.1 (1.56+ required for clap dependency) +
+ +- 安装rust + +```bash +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh + +source $HOME/.cargo/env +``` + +- 安装ninja + +```bash +# debian +sudo apt-get install -y cmake ninja-build + +# centos +sudo yum install -y cmake ninja +``` + +### 3. 编译 + +默认编译生成dyno和dynolog二进制文件, -t参数可以支持将二进制文件打包成deb包或rpm包. + +```bash +# 编译dyno和dynolog二进制文件 +bash scripts/build.sh + +# 编译deb包, 当前支持amd64和aarch64平台, 默认为amd64, 编译aarch64平台需要修改third_party/dynolog/scripts/debian/control文件中的Architecture改为aarch64 +bash scripts/build.sh -t deb + +# 编译rpm包, 当前只支持amd64平台 +bash scripts/build.sh -t rpm +``` + +## 使用方式 + +### Profiler trace dump功能 +Profiler trace dump功能基于dynolog开发,实现类似于动态profiling的动态触发Ascend Torch Profiler采集profiling的功能。用户基于dyno CLI命令行可以动态触发指定节点的训练进程trace dump。 + +- 查看nputrace支持的命令和帮助 + +```bash +dyno nputrace --help +``` + +- nputrace使用方式 + +```bash +dyno nputrace [SUBCOMMANDS] --log-file +``` + +nputrace子命令支持的参数选项 + +| 子命令 | 参数类型 | 说明 | +|-------|-------|-------| +| job-id | u64 | 采集任务的job id,默认值0,dynolog原生参数 | +| pids | String | 采集任务的pid列表,多个pid用逗号分隔,默认值0,dynolog原生参数 | +| process-limit | u64 | 最大采集进程的数量,默认值3,dynolog原生参数 | +| profile-start-time | u64 | 用于同步采集的Unix时间戳,单位毫秒,默认值0,dynolog原生参数 | +| duration-ms | u64 | 采集的周期,单位毫秒,默认值500,dynolog原生参数 | +| iterations | i64 | 采集总迭代数,默认值-1,dynolog原生参数 | +| log-file | String | 采集落盘的路径,必选值 | +| start-step | u64 | 开始采集的迭代数,默认值0 | +| record-shapes | action | 是否采集算子的InputShapes和InputTypes,设置参数采集,默认不采集 | +| profile-memory | action | 是否采集算子内存信息,设置参数采集,默认不采集 | +| with-stack | action | 是否采集Python调用栈,设置参数采集,默认不采集 | +| with-flops | action | 是否采集算子flops,设置参数采集,默认不采集 | +| with-modules | action | 是否采集modules层级的Python调用栈,设置参数采集,默认不采集 | +| analyse | action | 采集后是否自动解析,设置参数解析,默认不解析 | +| l2-cache | action | 是否采集L2 Cache数据,设置参数采集,默认不采集 | +| op-attr | action | 是否采集算子属性信息,设置参数采集,默认不采集 | +| msprof-tx | action | 是否使能MSTX,设置参数采集,默认使能 | +| data-simplification | String | 解析完成后是否数据精简,可选值范围[`true`, `false`],默认值`true` | +| activities | String | 控制CPU、NPU事件采集范围,可选值范围[`CPU,NPU`, `NPU,CPU`, `CPU`, `NPU`],默认值`CPU,NPU` | +| profiler-level | String | 控制profiler的采集等级,可选值范围[`Level_none`, `Level0`, `Level1`, `Level2`],默认值`Level0`| +| aic-metrics | String | AI Core的性能指标采集项,可选值范围[`AiCoreNone`, `PipeUtilization`, `ArithmeticUtilization`, `Memory`, `MemoryL0`, `ResourceConflictRatio`, `MemoryUB`, `L2Cache`, `MemoryAccess`],默认值`AiCoreNone`| +| export-type | String | profiler解析导出数据的类型,可选值范围[`Text`, `Db`],默认值`Text`| +| gc-detect-threshold | Option | GC检测阈值,单位ms,只采集超过阈值的GC事件。该参数为可选参数,默认不设置时不开启GC检测 | + + +- nputrace使用方法 + +Step1: 拉起dynolog daemon进程 +```bash +# 方法1:使用systemd拉起service +# 修改配置文件/etc/dynolog.gflags, 使能ipc_monitor +echo "--enable_ipc_monitor" | sudo tee -a /etc/dynolog.gflags +sudo systemctl start dynolog + +# 方法2:命令行执行 +dynolog --enable-ipc-monitor + +#dynolog daemon的日志路径为:/var/log/dynolog.log +``` + +Step 2:使能dynolog trace dump环境变量 +```bash +export KINETO_USE_DAEMON=1 +``` + +Step 3: 拉起训练任务 +```bash +# 训练任务中需要使用pytorch的优化器/继承原生优化器 +bash train.sh +``` + +Step 4:使用dyno CLI动态触发trace dump +```bash +# 示例1:从第10个step开始采集,采集2个step,采集框架、CANN和device数据,同时采集完后自动解析以及解析完成不做数据精简,落盘路径为/tmp/profile_data +dyno nputrace --start-step 10 --iterations 2 --activities CPU,NPU --analyse --data-simplification false --log-file /tmp/profile_data + +# 示例2:从第10个step开始采集,采集2个step,只采集CANN和device数据,同时采集完后自动解析以及解析完成后开启数据精简,落盘路径为/tmp/profile_data +dyno nputrace --start-step 10 --iterations 2 --activities NPU --analyse --data-simplification true --log-file /tmp/profile_data + +# 示例3:从第10个step开始采集,采集2个step,只采集CANN和device数据,只采集不解析,落盘路径为/tmp/profile_data +dyno nputrace --start-step 10 --iterations 2 --activities NPU --log-file /tmp/profile_data +``` + +### NPU Monitor功能 +NPU Monitor基于MSPTI/MSTX能力开发,实现了轻量级在线监控能力,能够用于性能问题的初步定位。 + +```bash +dyno npu-monitor --help +``` + +- npu-monitor使用方式 + +```bash +dyno npu-monitor [SUBCOMMANDS] +``` + +npu-monitor子命令支持的参数选项 +| 子命令 | 参数类型 | 说明 | +|-------|-------|-------| +| npu-monitor-start | action | 开启性能监控,设置参数开启,默认不采集 | +| npu-monitor-stop | action | 停止性能监控,设置参数开启,默认不采集 | +| report-interval-s | int | 性能监控数据上报周期,单位s,需要在启动时设置。默认值60 | +| mspti-activity-kind | String | 性能监控数据上报数据类型,可以设置单个或多个,多个类型以逗号分隔,需要在启动时设置。可选值范围[`Marker`, `Kernel`, `API`, `Hccl`, `Memory`, `MemSet`, `MemCpy`] , 默认值`Marker`| + +- npu-monitor使用方法 + +Step1: 拉起dynolog daemon进程 +```bash +# 方法1:使用systemd拉起service +# 修改配置文件/etc/dynolog.gflags, 使能ipc_monitor +echo "--enable_ipc_monitor" | sudo tee -a /etc/dynolog.gflags +sudo systemctl start dynolog + +# 方法2:命令行执行 +dynolog --enable-ipc-monitor + +#dynolog daemon的日志路径为:/var/log/dynolog.log +``` + +Step 2:使能dynolog trace dump环境变量 +```bash +export KINETO_USE_DAEMON=1 +``` + +Step 3: 拉起训练任务 +```bash +# 训练任务中需要使用pytorch的优化器/继承原生优化器 +bash train.sh +``` + +Step 4:使用dyno CLI使能npu-monitor +```bash +# 示例1:开启性能监控,使用默认配置 +dyno npu-monitor --npu-monitor-start + +# 示例2:暂停性能监控 +dyno npu-monitor --npu-monitor-stop + +# 示例3:性能监控过程中修改配置 +# 上报周期30s, 上报数据类型Marker和Kernel +dyno npu-monitor --report-interval-s 30 --mspti-activity-kind Marker,Kernel + +# 示例4:性能监控开启时修改配置 +# 上报周期30s, 上报数据类型Marker和Kernel +dyno npu-monitor --npu-monitor-start --report-interval-s 30 --mspti-activity-kind Marker,Kernel +``` \ No newline at end of file diff --git a/dynolog_npu/dynolog_npu/cli/src/commands/mod.rs b/dynolog_npu/dynolog_npu/cli/src/commands/mod.rs new file mode 100644 index 0000000000000000000000000000000000000000..18950d3c1a01d972db58a614a46f08176b02c725 --- /dev/null +++ b/dynolog_npu/dynolog_npu/cli/src/commands/mod.rs @@ -0,0 +1,18 @@ +// Copyright (c) Meta Platforms, Inc. and affiliates. +// +// This source code is licensed under the MIT license found in the +// LICENSE file in the root directory of this source tree. + +// Export all command submodules to be used in main.rs +// Note: This "intermediate" commands module is purely for organizational purposes. +// This allows for a clear distinction between the command dispatching code and the command +// handling code. Additionally, explicitly "exporting" all the command modules here allows +// us to avoid having to explicitly list all the command modules in main.rs. + +pub mod dcgm; +pub mod gputrace; +pub mod nputrace; +pub mod npumonitor; +pub mod status; +pub mod version; +// ... add new command modules here \ No newline at end of file diff --git a/dynolog_npu/dynolog_npu/cli/src/commands/npumonitor.rs b/dynolog_npu/dynolog_npu/cli/src/commands/npumonitor.rs new file mode 100644 index 0000000000000000000000000000000000000000..1edfaea5939f5cee5df8618720d1bfa16d0071b5 --- /dev/null +++ b/dynolog_npu/dynolog_npu/cli/src/commands/npumonitor.rs @@ -0,0 +1,59 @@ +use std::net::TcpStream; + +use anyhow::Result; + +#[path = "utils.rs"] +mod utils; + +#[derive(Debug)] +pub struct NpuMonitorConfig { + pub npu_monitor_start: bool, + pub npu_monitor_stop: bool, + pub report_interval_s: u32, + pub mspti_activity_kind: String, +} + +impl NpuMonitorConfig { + fn config(&self) -> String { + format!( + r#" +NPU_MONITOR_START={} +NPU_MONITOR_STOP={} +REPORT_INTERVAL_S={} +MSPTI_ACTIVITY_KIND={}"#, + self.npu_monitor_start, + self.npu_monitor_stop, + self.report_interval_s, + self.mspti_activity_kind + ) + } +} + +pub fn run_npumonitor( + client: TcpStream, + config: NpuMonitorConfig, +) -> Result<()> { + let config_str = config.config(); + println!("Npu monitor config = \n{}", config_str); + let config_str = config_str.replace('\n', "\\n"); + + let request_json = format!( + r#" +{{ + "fn": "setKinetOnDemandRequest", + "config": "{}", + "job_id": 0, + "pids": [0], + "process_limit": 3 +}}"#, + config_str + ); + + utils::send_msg(&client, &request_json).expect("Error sending message to service"); + + let resp_str = utils::get_resp(&client).expect("Unable to decode output bytes"); + + println!("response = {}", resp_str); + + Ok(()) +} diff --git a/dynolog_npu/dynolog_npu/cli/src/commands/nputrace.rs b/dynolog_npu/dynolog_npu/cli/src/commands/nputrace.rs new file mode 100644 index 0000000000000000000000000000000000000000..f70923bca4cc5ce29a8855a464c411b63a930ef0 --- /dev/null +++ b/dynolog_npu/dynolog_npu/cli/src/commands/nputrace.rs @@ -0,0 +1,247 @@ +use std::net::TcpStream; + +use anyhow::Result; +use serde_json::Value; + +#[path = "utils.rs"] +mod utils; + +#[derive(Debug)] +pub enum NpuTraceTriggerConfig { + DurationBased { + profile_start_time: u64, + duration_ms: u64, + }, + IterationBased { + start_step: u64, + iterations: i64, + }, +} + +impl NpuTraceTriggerConfig { + fn config(&self) -> String { + match *self { + NpuTraceTriggerConfig::DurationBased { + profile_start_time, + duration_ms, + } => format!( + "PROFILE_START_TIME={}\nACTIVITIES_DURATION_MSECS={}", + profile_start_time, duration_ms + ), + NpuTraceTriggerConfig::IterationBased { + start_step, + iterations, + } => format!( + r#"PROFILE_START_ITERATION=0 +PROFILE_START_STEP={} +ACTIVITIES_ITERATIONS={}"#, + start_step, iterations + ), + } + } +} + +// torch npu profiler config +#[derive(Debug)] +pub struct NpuTraceOptions { + pub record_shapes: bool, + pub profile_memory: bool, + pub with_stack: bool, + pub with_flops: bool, + pub with_modules: bool, + pub activities: String, + pub analyse: bool, + pub profiler_level: String, + pub aic_metrics: String, + pub l2_cache: bool, + pub op_attr: bool, + pub msprof_tx: bool, + pub gc_detect_threshold: Option, + pub data_simplification: String, + pub export_type: String, +} + +impl NpuTraceOptions { + fn config(&self) -> String { + format!( + r#" +PROFILE_RECORD_SHAPES={} +PROFILE_PROFILE_MEMORY={} +PROFILE_WITH_STACK={} +PROFILE_WITH_FLOPS={} +PROFILE_WITH_MODULES={} +PROFILE_ACTIVITIES={} +PROFILE_ANALYSE={} +PROFILE_PROFILER_LEVEL={} +PROFILE_AIC_METRICS={} +PROFILE_L2_CACHE={} +PROFILE_OP_ATTR={} +PROFILE_MSPROF_TX={} +PROFILE_GC_DETECT_THRESHOLD={} +PROFILE_DATA_SIMPLIFICATION={} +PROFILE_EXPORT_TYPE={}"#, + self.record_shapes, + self.profile_memory, + self.with_stack, + self.with_flops, + self.with_modules, + self.activities, + self.analyse, + self.profiler_level, + self.aic_metrics, + self.l2_cache, + self.op_attr, + self.msprof_tx, + self.gc_detect_threshold.map_or("None".to_string(), |v| v.to_string()), + self.data_simplification, + self.export_type + ) + } +} + +#[derive(Debug)] +pub struct NpuTraceConfig { + pub log_file: String, + pub trigger_config: NpuTraceTriggerConfig, + pub trace_options: NpuTraceOptions, +} + +impl NpuTraceConfig { + fn config(&self) -> String { + format!( + "ACTIVITIES_LOG_FILE={}\n{}{}", + self.log_file, + self.trigger_config.config(), + self.trace_options.config() + ) + } +} + +pub fn run_nputrace( + client: TcpStream, + job_id: u64, + pids: &str, + process_limit: u32, + config: NpuTraceConfig, +) -> Result<()> { + let config_str = config.config(); + println!("NpuTrace config = \n{}", config_str); + let config_str = config_str.replace('\n', "\\n"); + + let request_json = format!( + r#" +{{ + "fn": "setKinetOnDemandRequest", + "config": "{}", + "job_id": {}, + "pids": [{}], + "process_limit": {} +}}"#, + config_str, job_id, pids, process_limit + ); + + utils::send_msg(&client, &request_json).expect("Error sending message to service"); + + let resp_str = utils::get_resp(&client).expect("Unable to decode output bytes"); + + println!("response = {}", resp_str); + + let resp_v: Value = serde_json::from_str(&resp_str)?; + let processes = resp_v["processesMatched"].as_array().unwrap(); + + if processes.is_empty() { + println!("No processes were matched, please check --job-id or --pids flags"); + } else { + println!("Matched {} processes", processes.len()); + println!("Trace output files will be written to:"); + + for pid in processes { + let pid = pid.as_i64().unwrap(); + println!( + " {}", + config.log_file.replace(".json", &format!("_{}.json", pid)) + ); + } + } + + Ok(()) +} + + +#[cfg(test)] +mod test { + use crate::*; + + #[test] + fn test_nputrace_trigger_config() { + let trigger_config = NpuTraceTriggerConfig::DurationBased { + profile_start_time: 1000, + duration_ms: 1000, + }; + assert_eq!( + trigger_config.config(), + r#"PROFILE_START_TIME=1000 +ACTIVITIES_DURATION_MSECS=1000"# + ); + + let trigger_config = NpuTraceTriggerConfig::IterationBased { + profile_start_step: 1000, + iterations: 1000, + }; + assert_eq!( + trigger_config.config(), + r#"PROFILE_START_ITERATION=0 +PROFILE_START_STEP=1000 +ACTIVITIES_ITERATIONS=1000"# + ); + } + + #[test] + fn test_nputrace_config() { + let config = NpuTraceConfig { + log_file: "test.json".to_string(), + trigger_config: NpuTraceTriggerConfig::DurationBased { + profile_start_time: 1000, + duration_ms: 1000, + }, + trace_options: NpuTraceOptions { + record_shapes: true, + profile_memory: false, + with_stack: true, + with_flops: true, + with_modules: true, + activities: "CPU,NPU".to_string(), + analyse: false, + profiler_level: "Level0".to_string(), + aic_metrics: "AiCoreNone".to_string(), + l2_cache: true, + op_attr: true, + msprof_tx: true, + gc_detect_threshold: 0.1, + data_simplification: "true", + export_type: "Text".to_string(), + }, + }; + assert_eq!( + config.config(), + r#"ACTIVITIES_LOG_FILE=test.json +PROFILE_START_TIME=1000 +ACTIVITIES_DURATION_MSECS=1000 +PROFILE_RECORD_SHAPES=true +PROFILE_PROFILE_MEMORY=false +PROFILE_WITH_STACK=true +PROFILE_WITH_FLOPS=true +PROFILE_WITH_MODULES=true +PROFILE_ACTIVITIES=CPU,NPU +PROFILE_ANALYSE=false +PROFILE_PROFILER_LEVEL=Level0 +PROFILE_AIC_METRICS=AiCoreNone +PROFILE_L2_CACHE=true +PROFILE_OP_ATTR=true +PROFILE_MSPROF_TX=true +PROFILE_GC_DETECT_THRESHOLD=0.1 +PROFILE_DATA_SIMPLIFICATION=true +PROFILE_EXPORT_TYPE=Text"# + ); + } +} diff --git a/dynolog_npu/dynolog_npu/cli/src/main.rs b/dynolog_npu/dynolog_npu/cli/src/main.rs new file mode 100644 index 0000000000000000000000000000000000000000..9fdea3d1254467081356b2e0daeb8ed3ca05a16d --- /dev/null +++ b/dynolog_npu/dynolog_npu/cli/src/main.rs @@ -0,0 +1,355 @@ +// Copyright (c) Meta Platforms, Inc. and affiliates. +// +// This source code is licensed under the MIT license found in the +// LICENSE file in the root directory of this source tree. + +use std::net::TcpStream; +use std::net::ToSocketAddrs; + +use anyhow::Result; +use clap::Parser; +use std::collections::HashSet; + +// Make all the command modules accessible to this file. +mod commands; +use commands::gputrace::GpuTraceConfig; +use commands::gputrace::GpuTraceOptions; +use commands::gputrace::GpuTraceTriggerConfig; +use commands::nputrace::NpuTraceConfig; +use commands::nputrace::NpuTraceOptions; +use commands::nputrace::NpuTraceTriggerConfig; +use commands::npumonitor::NpuMonitorConfig; +use commands::*; + +/// Instructions on adding a new Dyno CLI command: +/// +/// 1. Add a new variant to the `Command` enum. +/// Please include a description of the command and, if applicable, its flags/subcommands. +/// +/// 2. Create a new file for the command's implementation in the commands/ directory (ie +/// commands/status.rs). This new file is where the command should be implemented. +/// Make the new command's module accessible from this file by adding +/// a new line with `pub mod ;` to commands/mod.rs. +/// +/// +/// 3. Add a branch to the match statement in main() to handle the new enum variant (from step 1). +/// From here, invoke the handling logic defined in the new file (from step 2). In an effort to keep +/// the command dispatching logic clear and concise, please keep the code in the match branch to a minimum. + +const DYNO_PORT: u16 = 1778; + +#[derive(Debug, Parser)] +struct Opts { + #[clap(long, default_value = "localhost")] + hostname: String, + #[clap(long, default_value_t = DYNO_PORT)] + port: u16, + #[clap(subcommand)] + cmd: Command, +} + +const ALLOWED_VALUES: &[&str] = &["Marker", "Kernel", "API", "Hccl", "Memory", "MemSet", "MemCpy"]; + +fn parse_mspti_activity_kinds(src: &str) -> Result{ + let allowed_values: HashSet<&str> = ALLOWED_VALUES.iter().cloned().collect(); + + let kinds: Vec<&str> = src.split(',').map(|s| s.trim()).collect(); + + for kind in &kinds { + if !allowed_values.contains(kind) { + return Err(format!("Invalid MSPTI activity kind: {}, Possible values: {:?}.]", kind, allowed_values)); + } + } + + Ok(src.to_string()) +} + +#[derive(Debug, Parser)] +enum Command { + /// Check the status of a dynolog process + Status, + /// Check the version of a dynolog process + Version, + /// Capture gputrace + Gputrace { + /// Job id of the application to trace. + #[clap(long, default_value_t = 0)] + job_id: u64, + /// List of pids to capture trace for (comma separated). + #[clap(long, default_value = "0")] + pids: String, + /// Duration of trace to collect in ms. + #[clap(long, default_value_t = 500)] + duration_ms: u64, + /// Training iterations to collect, this takes precedence over duration. + #[clap(long, default_value_t = -1)] + iterations: i64, + /// Log file for trace. + #[clap(long)] + log_file: String, + /// Unix timestamp used for synchronized collection (milliseconds since epoch). + #[clap(long, default_value_t = 0)] + profile_start_time: u64, + /// Start iteration roundup, starts an iteration based trace at a multiple + /// of this value. + #[clap(long, default_value_t = 1)] + profile_start_iteration_roundup: u64, + /// Max number of processes to profile. + #[clap(long, default_value_t = 3)] + process_limit: u32, + /// Record PyTorch operator input shapes and types. + #[clap(long, action)] + record_shapes: bool, + /// Profile PyTorch memory. + #[clap(long, action)] + profile_memory: bool, + /// Capture Python stacks in traces. + #[clap(long, action)] + with_stacks: bool, + /// Annotate operators with analytical flops. + #[clap(long, action)] + with_flops: bool, + /// Capture PyTorch operator modules in traces. + #[clap(long, action)] + with_modules: bool, + }, + /// Capture nputrace. Subcommand functions aligned with Ascend Torch Profiler. + Nputrace { + /// Job id of the application to trace. + #[clap(long, default_value_t = 0)] + job_id: u64, + /// List of pids to capture trace for (comma separated). + #[clap(long, default_value = "0")] + pids: String, + /// Duration of trace to collect in ms. + #[clap(long, default_value_t = 500)] + duration_ms: u64, + /// Training iterations to collect, this takes precedence over duration. + #[clap(long, default_value_t = -1)] + iterations: i64, + /// Log file for trace. + #[clap(long)] + log_file: String, + /// Unix timestamp used for synchronized collection (milliseconds since epoch). + #[clap(long, default_value_t = 0)] + profile_start_time: u64, + /// Number of steps to start profile. + #[clap(long, default_value_t = 0)] + start_step: u64, + /// Max number of processes to profile. + #[clap(long, default_value_t = 3)] + process_limit: u32, + /// Whether to record PyTorch operator input shapes and types. + #[clap(long, action)] + record_shapes: bool, + /// Whether to profile PyTorch memory. + #[clap(long, action)] + profile_memory: bool, + /// Whether to profile the Python call stack in trace. + #[clap(long, action)] + with_stack: bool, + /// Annotate operators with analytical flops. + #[clap(long, action)] + with_flops: bool, + /// Whether to profile PyTorch operator modules in traces. + #[clap(long, action)] + with_modules: bool, + /// The scope of the profile's events. + #[clap(long, value_parser = ["CPU,NPU", "NPU,CPU", "CPU", "NPU"], default_value = "CPU,NPU")] + activities: String, + /// Profiler level. + #[clap(long, value_parser = ["Level0", "Level1", "Level2", "Level_none"], default_value = "Level0")] + profiler_level: String, + /// AIC metrics. + #[clap(long, value_parser = ["AiCoreNone", "PipeUtilization", "ArithmeticUtilization", "Memory", "MemoryL0", "ResourceConflictRatio", "MemoryUB", "L2Cache", "MemoryAccess"], default_value = "AiCoreNone")] + aic_metrics: String, + /// Whether to analyse the data after collection. + #[clap(long, action)] + analyse: bool, + /// Whether to collect L2 cache. + #[clap(long, action)] + l2_cache: bool, + /// Whether to collect op attributes. + #[clap(long, action)] + op_attr: bool, + /// Whether to enable MSTX. + #[clap(long, action)] + msprof_tx: bool, + /// GC detect threshold. + #[clap(long)] + gc_detect_threshold: Option, + /// Whether to streamline data after analyse is complete. + #[clap(long, value_parser = ["true", "false"], default_value = "true")] + data_simplification: String, + /// Types of data exported by the profiler. + #[clap(long, value_parser = ["Text", "Db"], default_value = "Text")] + export_type: String, + }, + /// Ascend MSPTI Monitor + NpuMonitor { + /// Start NPU monitor. + #[clap(long, action)] + npu_monitor_start: bool, + /// Stop NPU monitor. + #[clap(long, action)] + npu_monitor_stop: bool, + /// NPU monitor report interval in seconds. + #[clap(long, default_value_t = 60)] + report_interval_s: u32, + /// MSPTI collect activity kind + #[clap(long, value_parser = parse_mspti_activity_kinds, default_value = "Marker")] + mspti_activity_kind: String, + }, + /// Pause dcgm profiling. This enables running tools like Nsight compute and avoids conflicts. + DcgmPause { + /// Duration to pause dcgm profiling in seconds + #[clap(long, default_value_t = 300)] + duration_s: i32, + }, + /// Resume dcgm profiling + DcgmResume, +} + +/// Create a socket connection to dynolog +fn create_dyno_client(host: &str, port: u16) -> Result { + let addr = (host, port) + .to_socket_addrs()? + .next() + .expect("Failed to connect to the server"); + + TcpStream::connect(addr).map_err(|err| err.into()) +} + +fn main() -> Result<()> { + let Opts { + hostname, + port, + cmd, + } = Opts::parse(); + + let dyno_client = + create_dyno_client(&hostname, port).expect("Couldn't connect to the server..."); + + match cmd { + Command::Status => status::run_status(dyno_client), + Command::Version => version::run_version(dyno_client), + Command::Gputrace { + job_id, + pids, + log_file, + duration_ms, + iterations, + profile_start_time, + profile_start_iteration_roundup, + process_limit, + record_shapes, + profile_memory, + with_stacks, + with_flops, + with_modules, + } => { + let trigger_config = if iterations > 0 { + GpuTraceTriggerConfig::IterationBased { + profile_start_iteration_roundup, + iterations, + } + } else { + GpuTraceTriggerConfig::DurationBased { + profile_start_time, + duration_ms, + } + }; + let trace_options = GpuTraceOptions { + record_shapes, + profile_memory, + with_stacks, + with_flops, + with_modules, + }; + let trace_config = GpuTraceConfig { + log_file, + trigger_config, + trace_options, + }; + gputrace::run_gputrace(dyno_client, job_id, &pids, process_limit, trace_config) + } + Command::Nputrace { + job_id, + pids, + log_file, + duration_ms, + iterations, + profile_start_time, + start_step, + process_limit, + record_shapes, + profile_memory, + with_stack, + with_flops, + with_modules, + activities, + analyse, + profiler_level, + aic_metrics, + l2_cache, + op_attr, + msprof_tx, + gc_detect_threshold, + data_simplification, + export_type, + } => { + let trigger_config = if iterations > 0 { + NpuTraceTriggerConfig::IterationBased { + start_step, + iterations, + } + } else { + NpuTraceTriggerConfig::DurationBased { + profile_start_time, + duration_ms, + } + }; + + let trace_options = NpuTraceOptions { + record_shapes, + profile_memory, + with_stack, + with_flops, + with_modules, + activities, + analyse, + profiler_level, + aic_metrics, + l2_cache, + op_attr, + msprof_tx, + gc_detect_threshold, + data_simplification, + export_type, + }; + let trace_config = NpuTraceConfig { + log_file, + trigger_config, + trace_options, + }; + nputrace::run_nputrace(dyno_client, job_id, &pids, process_limit, trace_config) + } + Command::NpuMonitor { + npu_monitor_start, + npu_monitor_stop, + report_interval_s, + mspti_activity_kind, + } => { + let npu_mon_config = NpuMonitorConfig { + npu_monitor_start, + npu_monitor_stop, + report_interval_s, + mspti_activity_kind + }; + npumonitor::run_npumonitor(dyno_client, npu_mon_config) + } + Command::DcgmPause { duration_s } => dcgm::run_dcgm_pause(dyno_client, duration_s), + Command::DcgmResume => dcgm::run_dcgm_resume(dyno_client), + // ... add new commands here + } +} \ No newline at end of file diff --git a/dynolog_npu/dynolog_npu/dynolog/src/Main.cpp b/dynolog_npu/dynolog_npu/dynolog/src/Main.cpp new file mode 100644 index 0000000000000000000000000000000000000000..8e5177768327e37173d4e7661e334a9400bd6172 --- /dev/null +++ b/dynolog_npu/dynolog_npu/dynolog/src/Main.cpp @@ -0,0 +1,206 @@ +// Copyright (c) Meta Platforms, Inc. and affiliates. +// +// This source code is licensed under the MIT license found in the +// LICENSE file in the root directory of this source tree. + +// Dynolog : A portable telemetry monitoring daemon. + +#include +#include +#include +#include +#include +#include "dynolog/src/CompositeLogger.h" +#include "dynolog/src/FBRelayLogger.h" +#include "dynolog/src/KernelCollector.h" +#include "dynolog/src/Logger.h" +#include "dynolog/src/ODSJsonLogger.h" +#include "dynolog/src/PerfMonitor.h" +#include "dynolog/src/ScubaLogger.h" +#include "dynolog/src/ServiceHandler.h" +#include "dynolog/src/gpumon/DcgmGroupInfo.h" +#include "dynolog/src/rpc/SimpleJsonServer.h" +#include "dynolog/src/rpc/SimpleJsonServerInl.h" +#include "dynolog/src/tracing/IPCMonitor.h" +#include "hbt/src/perf_event/BuiltinMetrics.h" + +#ifdef USE_PROMETHEUS +#include "dynolog/src/PrometheusLogger.h" +#endif + +using namespace dynolog; +using json = nlohmann::json; +namespace hbt = facebook::hbt; + +DEFINE_int32(port, 1778, "Port for listening RPC requests."); +DEFINE_bool(use_JSON, false, "Emit metrics to JSON file through JSON logger"); +#ifdef USE_PROMETHEUS +DEFINE_bool(use_prometheus, false, "Emit metrics to Prometheus"); +#endif +DEFINE_bool(use_fbrelay, false, "Emit metrics to FB Relay on Lab machines"); +DEFINE_bool(use_ODS, false, "Emit metrics to ODS through ODS logger"); +DEFINE_bool(use_scuba, false, "Emit metrics to Scuba through Scuba logger"); +DEFINE_int32( + kernel_monitor_reporting_interval_s, + 60, + "Duration in seconds to read and report metrics for kernel monitor"); +DEFINE_int32( + perf_monitor_reporting_interval_s, + 60, + "Duration in seconds to read and report metrics for performance monitor"); +DEFINE_int32( + dcgm_reporting_interval_s, + 10, + "Duration in seconds to read and report metrics for DCGM"); +DEFINE_bool( + enable_ipc_monitor, + false, + "Enabled IPC monitor for on system tracing requests."); +DEFINE_bool( + enable_gpu_monitor, + false, + "Enabled GPU monitorng, currently supports NVIDIA GPUs."); +DEFINE_bool(enable_perf_monitor, false, "Enable heartbeat perf monitoring."); + +std::unique_ptr getLogger(const std::string& scribe_category = "") { + std::vector> loggers; +#ifdef USE_PROMETHEUS + if (FLAGS_use_prometheus) { + loggers.push_back(std::make_unique()); + } +#endif + if (FLAGS_use_fbrelay) { + loggers.push_back(std::make_unique()); + } + if (FLAGS_use_ODS) { + loggers.push_back(std::make_unique()); + } + if (FLAGS_use_JSON) { + loggers.push_back(std::make_unique()); + } + if (FLAGS_use_scuba && !scribe_category.empty()) { + loggers.push_back(std::make_unique(scribe_category)); + } + return std::make_unique(std::move(loggers)); +} + +auto next_wakeup(int sec) { + return std::chrono::steady_clock::now() + std::chrono::seconds(sec); +} + +void kernel_monitor_loop() { + KernelCollector kc; + + LOG(INFO) << "Running kernel monitor loop : interval = " + << FLAGS_kernel_monitor_reporting_interval_s << " s."; + + while (1) { + auto logger = getLogger(); + auto wakeup_timepoint = + next_wakeup(FLAGS_kernel_monitor_reporting_interval_s); + + kc.step(); + kc.log(*logger); + logger->finalize(); + + /* sleep override */ + std::this_thread::sleep_until(wakeup_timepoint); + } +} + +void perf_monitor_loop() { + PerfMonitor pm( + hbt::CpuSet::makeAllOnline(), + std::vector{"instructions", "cycles"}, + getDefaultPmuDeviceManager(), + getDefaultMetrics()); + + LOG(INFO) << "Running perf monitor loop : interval = " + << FLAGS_perf_monitor_reporting_interval_s << " s."; + + while (1) { + auto logger = getLogger(); + auto wakeup_timepoint = + next_wakeup(FLAGS_perf_monitor_reporting_interval_s); + + pm.step(); + pm.log(*logger); + + logger->finalize(); + /* sleep override */ + std::this_thread::sleep_until(wakeup_timepoint); + } +} + +auto setup_server(std::shared_ptr handler) { + return std::make_unique>( + handler, FLAGS_port); +} + +void gpu_monitor_loop(std::shared_ptr dcgm) { + auto logger = getLogger(FLAGS_scribe_category); + + LOG(INFO) << "Running DCGM loop : interval = " + << FLAGS_dcgm_reporting_interval_s << " s."; + LOG(INFO) << "DCGM fields: " << gpumon::FLAGS_dcgm_fields; + + while (1) { + auto wakeup_timepoint = next_wakeup(FLAGS_dcgm_reporting_interval_s); + + dcgm->update(); + dcgm->log(*logger); + + /* sleep override */ + std::this_thread::sleep_until(wakeup_timepoint); + } +} + +int main(int argc, char** argv) { + gflags::ParseCommandLineFlags(&argc, &argv, true); + FLAGS_logtostderr = 1; + google::InitGoogleLogging(argv[0]); + + LOG(INFO) << "Starting Ascend Extension for dynolog, version = " DYNOLOG_VERSION + << ", build git-hash = " DYNOLOG_GIT_REV; + + std::shared_ptr dcgm; + + std::unique_ptr ipcmon; + std::unique_ptr ipcmon_thread, gpumon_thread, pm_thread; + + if (FLAGS_enable_ipc_monitor) { + LOG(INFO) << "Starting IPC Monitor"; + ipcmon = std::make_unique(); + ipcmon_thread = + std::make_unique([&ipcmon]() { ipcmon->loop(); }); + } + + if (FLAGS_enable_gpu_monitor) { + dcgm = gpumon::DcgmGroupInfo::factory( + gpumon::FLAGS_dcgm_fields, FLAGS_dcgm_reporting_interval_s * 1000); + gpumon_thread = std::make_unique(gpu_monitor_loop, dcgm); + } + std::thread km_thread{kernel_monitor_loop}; + if (FLAGS_enable_perf_monitor) { + pm_thread = std::make_unique(perf_monitor_loop); + } + + // setup service + auto handler = std::make_shared(dcgm); + + // use simple json RPC server for now + auto server = setup_server(handler); + server->run(); + + km_thread.join(); + if (pm_thread) { + pm_thread->join(); + } + if (gpumon_thread) { + gpumon_thread->join(); + } + + server->stop(); + + return 0; +} \ No newline at end of file diff --git a/dynolog_npu/plugin/Readme.md b/dynolog_npu/plugin/Readme.md new file mode 100644 index 0000000000000000000000000000000000000000..c59bfffad5aaac5383b407e3ff3d23ed126131f5 --- /dev/null +++ b/dynolog_npu/plugin/Readme.md @@ -0,0 +1,17 @@ + + +# Build and Install npu-dynolog-plugin +``` +# install pybind11 +pip install pybind11 + +# build dynolog_npu_plugin wheel +python3 setup.py bdist_wheel +# install +pip install dist/{dynolog-npu-plugin-xxx.wheel} + +# example +import IPCMonitor +dyno_worker = IPCMonitor.PyDynamicMonitorProxy() +dyno_worker.init_dyno(0) +``` diff --git a/dynolog_npu/plugin/bindings.cpp b/dynolog_npu/plugin/bindings.cpp new file mode 100644 index 0000000000000000000000000000000000000000..c0cdaa4d577b3a76ec2d6f3eae4b426556a56532 --- /dev/null +++ b/dynolog_npu/plugin/bindings.cpp @@ -0,0 +1,11 @@ +#include +#include "ipc_monitor/PyDynamicMonitorProxy.h" + +namespace py = pybind11; + +PYBIND11_MODULE(IPCMonitor, m) { + py::class_(m, "PyDynamicMonitorProxy") + .def(py::init<>()) + .def("init_dyno", &dynolog_npu::ipc_monitor::PyDynamicMonitorProxy::InitDyno, py::arg("npuId")) + .def("poll_dyno", &dynolog_npu::ipc_monitor::PyDynamicMonitorProxy::PollDyno); +} \ No newline at end of file diff --git a/dynolog_npu/plugin/build.sh b/dynolog_npu/plugin/build.sh new file mode 100755 index 0000000000000000000000000000000000000000..ce20d9d2be546afbc63e3aace524f74858eff6ff --- /dev/null +++ b/dynolog_npu/plugin/build.sh @@ -0,0 +1,21 @@ +#!/bin/bash + +# install pybind11 +pip install pybind11 + +# build dynolog_npu_plugin wheel +python3 setup.py bdist_wheel + +# find .whl files in dist +files=$(find dist -type f -name "*.whl" 2>/dev/null) +count=$(echo "$files" | wc -l) +if [ "$count" -eq 1 ]; then + echo "find .whl in dist: $files" +else + echo "find no or multi .whl in dist" + exit 1 +fi + +# pip install whl +echo "pip install ${files}" +pip install ${files} \ No newline at end of file diff --git a/dynolog_npu/plugin/ipc_monitor/DynoLogNpuMonitor.cpp b/dynolog_npu/plugin/ipc_monitor/DynoLogNpuMonitor.cpp new file mode 100644 index 0000000000000000000000000000000000000000..940f5aae167f088361057fe2a7a389a76f5bb2b4 --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/DynoLogNpuMonitor.cpp @@ -0,0 +1,36 @@ +#include "DynoLogNpuMonitor.h" + +#include + +#include "utils.h" + +namespace dynolog_npu { +namespace ipc_monitor { + +bool DynoLogNpuMonitor::Init() +{ + if (isInitialized_) { + std::cout << "[WRARNING] DynoLog npu monitor already initialized" << std::endl; + return true; + } + bool res = ipcClient_.RegisterInstance(npuId_); + if (res) { + isInitialized_ = true; + std::cout << "[INFO] DynoLog npu monitor initialized success !" << std::endl; + } + return res; +} + +std::string DynoLogNpuMonitor::Poll() +{ + std::string res = ipcClient_.IpcClientNpuConfig(); + if (res.empty()) { + std::cout << "[INFO] Request for dynolog server is empty !" << std::endl; + return ""; + } + std::cout << "[INFO] Received NPU configuration successfully" << std::endl; + return res; +} + +} // namespace ipc_monitor +} // namespace dynolog_npu \ No newline at end of file diff --git a/dynolog_npu/plugin/ipc_monitor/DynoLogNpuMonitor.h b/dynolog_npu/plugin/ipc_monitor/DynoLogNpuMonitor.h new file mode 100644 index 0000000000000000000000000000000000000000..40ee21072710312a86cd75befdcefa67e24efb8f --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/DynoLogNpuMonitor.h @@ -0,0 +1,33 @@ +#ifndef DYNOLOG_NPU_MONITOR_H +#define DYNOLOG_NPU_MONITOR_H + +#include "MonitorBase.h" +#include "NpuIpcClient.h" +#include "singleton.h" + +namespace dynolog_npu { +namespace ipc_monitor { + +class DynoLogNpuMonitor : public MonitorBase, public Singleton { + friend class Singleton; + +public: + DynoLogNpuMonitor() = default; + bool Init() override; + std::string Poll() override; + void SetNpuId(int id) override + { + npuId_ = id; + } + +private: + bool isInitialized_ = false; + int32_t npuId_ = 0; + IpcClient ipcClient_; +}; + +} // namespace ipc_monitor +} // namespace dynolog_npu + +#endif + diff --git a/dynolog_npu/plugin/ipc_monitor/MonitorBase.h b/dynolog_npu/plugin/ipc_monitor/MonitorBase.h new file mode 100644 index 0000000000000000000000000000000000000000..108023c7624b747e5987be9184d6c594decd360a --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/MonitorBase.h @@ -0,0 +1,18 @@ +#ifndef MONITOR_BASE_H +#define MONITOR_BASE_H +#include + +namespace dynolog_npu { +namespace ipc_monitor { + +class MonitorBase { +public: + virtual bool Init() = 0; + virtual std::string Poll() = 0; + virtual void SetNpuId(int id) = 0; +}; + +} // namespace ipc_monitor +} // namespace dynolog_npu + +#endif \ No newline at end of file diff --git a/dynolog_npu/plugin/ipc_monitor/NpuIpcClient.cpp b/dynolog_npu/plugin/ipc_monitor/NpuIpcClient.cpp new file mode 100644 index 0000000000000000000000000000000000000000..97966e8eeacc7276426feb237aa122eb8dee046f --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/NpuIpcClient.cpp @@ -0,0 +1,138 @@ +#include "NpuIpcClient.h" + +#include + +namespace dynolog_npu { +namespace ipc_monitor { + +bool IpcClient::RegisterInstance(int32_t id) +{ + NpuContext context{ + .npu = id, + .pid = getpid(), + .jobId = JOB_ID, + }; + std::unique_ptr message = Message::ConstructMessage(context, "ctxt"); + try { + if (!SyncSendMessage(*message, std::string(DYNO_IPC_NAME))) { + std::cout << "[WARNING]Failed to send register ctxt for pid " << context.pid << " with dyno" << std::endl; + return false; + } + } catch (const std::exception &e) { + std::cout << "[WARNING] Error when SyncSendMessage: " << e.what() << std::endl; + return false; + } + std::cout << "[INFO] Resigter pid " << context.pid << " for dynolog success !" << std::endl; + return true; +} +std::string IpcClient::IpcClientNpuConfig() +{ + auto size = pids_.size(); + auto *req = (NpuRequest *)malloc(sizeof(NpuRequest) + sizeof(int32_t) * size); + req->type = DYNO_IPC_TYPE; + req->pidSize = size; + req->jobId = JOB_ID; + for (int i = 0; i < size; i++) { + req->pids[i] = pids_[i]; + } + std::unique_ptr message = Message::ConstructMessage(*req, "req", size); + if (!SyncSendMessage(*message, std::string(DYNO_IPC_NAME))) { + std::cout << "[WARNING] Failed to send config to dyno server fail !" << std::endl; + free(req); + req = nullptr; + return ""; + } + free(req); + message = PollRecvMessage(MAX_IPC_RETRIES, MAX_SLEEP_US); + if (!message) { + std::cout << "[WARNING] Failed to receive on-demand config !" << std::endl; + return ""; + } + std::string res = std::string(ReinterpretConvert(message->buf.get()), message->metadata.size); + + return res; +} +std::unique_ptr IpcClient::ReceiveMessage() +{ + std::lock_guard wguard(dequeLock_); + if (msgDynoDeque_.empty()) { + return nullptr; + } + std::unique_ptr message = std::move(msgDynoDeque_.front()); + msgDynoDeque_.pop_front(); + return message; +} +bool IpcClient::SyncSendMessage(const Message &message, const std::string &destName, int numRetry, int seepTimeUs) +{ + if (destName.empty()) { + std::cout << "[WARNING] Can not send to empty socket name !" << std::endl; + return false; + } + int i = 0; + std::vector npuPayLoad{ NpuPayLoad(sizeof(struct Metadata), (void *)&message.metadata), + NpuPayLoad(message.metadata.size, message.buf.get()) }; + try { + auto ctxt = ep_.BuildSendNpuCtxt(destName, npuPayLoad, std::vector()); + while (!ep_.TrySendMessage(*ctxt) && i < numRetry) { + i++; + usleep(seepTimeUs); + seepTimeUs *= 2; // 2: double sleep time + } + } catch (const std::exception &e) { + std::cout << "[ERROR] Error when SyncSendMessage: " << e.what() << std::endl; + return false; + } + return i < numRetry; +} +bool IpcClient::Recv() +{ + try { + Metadata recvMetadata; + std::vector PeekNpuPayLoad{ NpuPayLoad(sizeof(struct Metadata), &recvMetadata) }; + auto peekCtxt = ep_.BuildNpuRcvCtxt(PeekNpuPayLoad); + bool successFlag = false; + try { + successFlag = ep_.TryPeekMessage(*peekCtxt); + } catch (std::exception &e) { + std::cout << "[ERROR] Error when TryPeekMessage: " << e.what() << std::endl; + return false; + } + if (successFlag) { + std::unique_ptr npuMessage = std::make_unique(Message()); + npuMessage->metadata = recvMetadata; + npuMessage->buf = std::make_unique(recvMetadata.size); + npuMessage->src = std::string(ep_.GetName(*peekCtxt)); + std::vector npuPayLoad{ NpuPayLoad(sizeof(struct Metadata), (void *)&npuMessage->metadata), + NpuPayLoad(recvMetadata.size, npuMessage->buf.get()) }; + auto recvCtxt = ep_.BuildNpuRcvCtxt(npuPayLoad); + try { + successFlag = ep_.TryRcvMessage(*recvCtxt); + } catch (std::exception &e) { + std::cout << "[ERROR] Error when TryRecvMsg: " << e.what() << std::endl; + return false; + } + if (successFlag) { + std::lock_guard wguard(dequeLock_); + msgDynoDeque_.push_back(std::move(npuMessage)); + return true; + } + } + } catch (std::exception &e) { + std::cout << "[ERROR] Error in Recv(): " << e.what() << std::endl; + return false; + } + return false; +} +std::unique_ptr IpcClient::PollRecvMessage(int maxRetry, int sleeTimeUs) +{ + for (int i = 0; i < maxRetry; i++) { + if (Recv()) { + return ReceiveMessage(); + } + usleep(sleeTimeUs); + } + return nullptr; +} + +} // namespace ipc_monitor +} // namespace dynolog_npu \ No newline at end of file diff --git a/dynolog_npu/plugin/ipc_monitor/NpuIpcClient.h b/dynolog_npu/plugin/ipc_monitor/NpuIpcClient.h new file mode 100644 index 0000000000000000000000000000000000000000..ae7b00eb51b935db4e799fab470c3343e78bcb6f --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/NpuIpcClient.h @@ -0,0 +1,103 @@ +#ifndef NPU_IPC_CLIENT_H +#define NPU_IPC_CLIENT_H +#include +#include +#include +#include +#include +#include +#include +#include +#include "NpuIpcEndPoint.h" +#include "utils.h" + +namespace dynolog_npu { +namespace ipc_monitor { + +constexpr int TYPE_SIZE = 32; +constexpr int JOB_ID = 0; +constexpr const char *DYNO_IPC_NAME = "dynolog"; +constexpr const int DYNO_IPC_TYPE = 3; +constexpr const int MAX_IPC_RETRIES = 5; +constexpr const int MAX_SLEEP_US = 10000; +struct NpuRequest { + int type; + int pidSize; + int64_t jobId; + int32_t pids[0]; +}; +struct NpuContext { + int32_t npu; + pid_t pid; + int64_t jobId; +}; +struct Metadata { + size_t size = 0; + char type[TYPE_SIZE] = ""; +}; +struct Message { + Metadata metadata; + std::unique_ptr buf; + std::string src; + template static std::unique_ptr ConstructMessage(const T &data, const std::string &type) + { + std::unique_ptr ipcNpuMessage = std::make_unique(Message()); + if (type.size() + 1 > sizeof(ipcNpuMessage->metadata.type)) { + throw std::runtime_error("Type string is too long to fit in metadata.type" + IPC_ERROR(ErrCode::PARAM)); + } + memcpy(ipcNpuMessage->metadata.type, type.c_str(), type.size() + 1); +#if __cplusplus >= 201703L + if constexpr (std::is_same::value == true) { + ipcNpuMessage->metadata.size = data.size(); + ipcNpuMessage->buf = std::make_unique(ipcNpuMessage->metadata.size); + memcpy(ipcNpuMessage->buf.get(), data.c_str(), sizeof(data)); + return ipcNpuMessage; + } +#endif + static_assert(std::is_trivially_copyable::value); + ipcNpuMessage->metadata.size = sizeof(data); + ipcNpuMessage->buf = std::make_unique(ipcNpuMessage->metadata.size); + memcpy(ipcNpuMessage->buf.get(), &data, sizeof(data)); + return ipcNpuMessage; + } + + template + static std::unique_ptr ConstructMessage(const T &data, const std::string &type, int n) + { + std::unique_ptr ipcNpuMessage = std::make_unique(Message()); + if (type.size() + 1 > sizeof(ipcNpuMessage->metadata.type)) { + throw std::runtime_error("Type string is too long to fit in metadata.type" + IPC_ERROR(ErrCode::PARAM)); + } + memcpy(ipcNpuMessage->metadata.type, type.c_str(), type.size() + 1); + static_assert(std::is_trivially_copyable::value); + static_assert(std::is_trivially_copyable::value); + ipcNpuMessage->metadata.size = sizeof(data) + sizeof(U) * n; + ipcNpuMessage->buf = std::make_unique(ipcNpuMessage->metadata.size); + memcpy(ipcNpuMessage->buf.get(), &data, ipcNpuMessage->metadata.size); + return ipcNpuMessage; + } +}; +class IpcClient { +public: + IpcClient(const IpcClient &) = delete; + IpcClient &operator = (const IpcClient &) = delete; + IpcClient() = default; + bool RegisterInstance(int32_t npu); + std::string IpcClientNpuConfig(); + +private: + std::vector pids_ = GetPids(); + NpuIpcEndPoint<0> ep_{ "dynoconfigclient" + GenerateUuidV4() }; + std::mutex dequeLock_; + std::deque> msgDynoDeque_; + std::unique_ptr ReceiveMessage(); + bool SyncSendMessage(const Message &message, const std::string &destName, int numRetry = 10, + int seepTimeUs = 10000); + bool Recv(); + std::unique_ptr PollRecvMessage(int maxRetry, int sleeTimeUs); +}; + +} // namespace ipc_monitor +} // namespace dynolog_npu + +#endif \ No newline at end of file diff --git a/dynolog_npu/plugin/ipc_monitor/NpuIpcEndPoint.h b/dynolog_npu/plugin/ipc_monitor/NpuIpcEndPoint.h new file mode 100644 index 0000000000000000000000000000000000000000..6560fa515646226ddbffbca49c4f818eb0d0ebcf --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/NpuIpcEndPoint.h @@ -0,0 +1,204 @@ +#ifndef NPU_IPC_ENDPOINT_H +#define NPU_IPC_ENDPOINT_H +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "utils.h" + +namespace dynolog_npu { +namespace ipc_monitor { + +using fileDesT = int; +constexpr const char STR_END_CHAR = '\0'; +constexpr int SOCKET_FD_CHMOD = 0666; + +struct NpuPayLoad { + size_t size; + void *data; + NpuPayLoad(size_t size, void *data) : size(size), data(data) {} +}; + +template struct NpuIpcEndPointCtxt { + struct sockaddr_un messageName; + size_t messageLen; + fileDesT *fileDesPtr; + struct msghdr msghdr; + std::vector iov; + char ancillaryBuf[CMSG_SPACE(MaxNumFileDes * sizeof(fileDesT))]; + explicit NpuIpcEndPointCtxt(size_t num) : iov(std::vector(num)){}; +}; + +template class NpuIpcEndPoint final { + using Ctxt = NpuIpcEndPointCtxt; + +public: + constexpr static size_t addressMaxLen = 108 - 2; // Max unix socket path length + explicit NpuIpcEndPoint(const std::string &addressName) + { + socketFd = socket(AF_UNIX, SOCK_DGRAM, 0); + if (socketFd == -1) { + throw std::runtime_error(std::strerror(errno) + IPC_ERROR(ErrCode::PARAM)); + } + struct sockaddr_un address; + size_t addressLen = SetSocketAdress(addressName, address); + if (address.sun_path[0] != STR_END_CHAR) { + unlink(address.sun_path); + } + int res = bind(socketFd, ReinterpretConvert(&address), addressLen); + if (res == -1) { + throw std::runtime_error("Bind socket failed." + IPC_ERROR(ErrCode::PARAM)); + } + if (address.sun_path[0] != STR_END_CHAR) { + chmod(address.sun_path, SOCKET_FD_CHMOD); + } + } + ~NpuIpcEndPoint() + { + close(socketFd); + } + [[nodiscard]] auto BuildSendNpuCtxt(const std::string &desAddrName, const std::vector &npuPayLoad, + const std::vector &fileDes) + { + if (fileDes.size() > MaxNumFileDes) { + throw std::runtime_error("Request to fill more than max connections " + IPC_ERROR(ErrCode::PARAM)); + } + if (desAddrName.empty()) { + throw std::runtime_error("Can not send to dest point, because dest socket name is empty " + + IPC_ERROR(ErrCode::PARAM)); + } + auto ctxt = BuildNpuCtxt_(npuPayLoad, fileDes.size()); + ctxt->msghdr.msg_namelen = SetSocketAdress(desAddrName, ctxt->messageName); + if (!fileDes.empty()) { + if (fileDes.size() * sizeof(fileDesT) > sizeof(ctxt->fileDesPtr)) { + throw std::runtime_error("Memcpy failed when fileDes size large than ctxt fileDesPtr " + + IPC_ERROR(ErrCode::PARAM)); + } + memcpy(ctxt->fileDesPtr, fileDes.data(), fileDes.size() * sizeof(fileDesT)); + } + return ctxt; + } + + [[nodiscard]] bool TrySendMessage(Ctxt const & ctxt, bool retryOnConnRefused = true) + { + ssize_t retCode = sendmsg(socketFd, &ctxt.msghdr, MSG_DONTWAIT); + if (retCode > 0) { + return true; + } + if ((errno == EAGAIN || errno == EWOULDBLOCK) && retCode == -1) { + return false; + } + if (retryOnConnRefused && errno == ECONNREFUSED && retCode == -1) { + return false; + } + throw std::runtime_error("TrySendMessage occur " + std::string(std::strerror(errno)) + " " + + IPC_ERROR(ErrCode::PARAM)); + } + + [[nodiscard]] auto BuildNpuRcvCtxt(const std::vector &npuPayLoad) + { + return BuildNpuCtxt_(npuPayLoad, MaxNumFileDes); + } + + [[nodiscard]] bool TryRcvMessage(Ctxt &ctxt) noexcept + { + auto retCode = recvmsg(socketFd, &ctxt.msghdr, MSG_DONTWAIT); + if (retCode > 0) { + return true; + } + if (retCode == 0) { + return false; + } + if (errno == EWOULDBLOCK || errno == EAGAIN) { + return false; + } + throw std::runtime_error("TryRcvMessage occur " + std::string(std::strerror(errno)) + " " + + IPC_ERROR(ErrCode::PARAM)); + } + + [[nodiscard]] bool TryPeekMessage(Ctxt &ctxt) + { + ssize_t ret = recvmsg(socketFd, &ctxt.msghdr, MSG_DONTWAIT | MSG_PEEK); + if (ret > 0) { + return true; + } + if (ret == 0) { + return false; + } + if (errno == EAGAIN || errno == EWOULDBLOCK) { + return false; + } + throw std::runtime_error("TryPeekMessage occur " + std::string(std::strerror(errno))); + } + + const char *GetName(Ctxt const & ctxt) const noexcept + { + if (ctxt.messageName.sun_path[0] != STR_END_CHAR) { + throw std::runtime_error("GetName() want to got abstract socket, but got " + + std::string(ctxt.messageName.sun_path)); + } + return ctxt.messageName.sun_path + 1; + } + + std::vector GetFileDes(const Ctxt &ctxt) const + { + struct cmsghdr *cmg = CMSG_FIRSTHDR(&ctxt.msghdl); + unsigned numFileDes = (cmg->cmsg_len - sizeof(struct cmsghdr)) / sizeof(fileDesT); + return { ctxt.fileDesPtr, ctxt.fileDesPtr + numFileDes }; + } + +protected: + fileDesT socketFd; + size_t SetSocketAdress(const std::string &srcSocket, struct sockaddr_un &destSocket) + { + if (srcSocket.size() > addressMaxLen) { + throw std::runtime_error("Abstract UNIX Socket path cannot be larger than addressMaxLen"); + } + destSocket.sun_family = AF_UNIX; + destSocket.sun_path[0] = STR_END_CHAR; + if (srcSocket.empty()) { + return sizeof(sa_family_t); + } + srcSocket.copy(destSocket.sun_path + 1, srcSocket.size()); + destSocket.sun_path[srcSocket.size() + 1] = STR_END_CHAR; + return sizeof(sa_family_t) + srcSocket.size() + 2; // 2 + } + + auto BuildNpuCtxt_(const std::vector &npuPayLoad, unsigned numFileDes) + { + auto ctxt = std::make_unique(npuPayLoad.size()); + std::memset(&ctxt->msghdr, 0, sizeof(ctxt->msghdr)); + for (auto i = 0; i < npuPayLoad.size(); i++) { + ctxt->iov[i] = {npuPayLoad[i].data, npuPayLoad[i].size}; + } + ctxt->msghdr.msg_name = &ctxt->messageName; + ctxt->msghdr.msg_namelen = sizeof(decltype(ctxt->messageName)); + ctxt->msghdr.msg_iov = ctxt->iov.data(); + ctxt->msghdr.msg_iovlen = npuPayLoad.size(); + ctxt->fileDesPtr = nullptr; + if (numFileDes == 0) { + return ctxt; + } + const size_t fileDesSize = sizeof(fileDesT) * numFileDes; + ctxt->msghdr.msg_control = ctxt->ancillaryBuf; + ctxt->msghdr.msg_controllen = CMSG_SPACE(fileDesSize); + + struct cmsghdr *cmsg = CMSG_FIRSTHDR(&ctxt->msghdr); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_RIGHTS; + cmsg->cmsg_len = CMSG_LEN(fileDesSize); + ctxt->fileDesPtr = ReinterpretConvert(CMSG_DATA(cmsg)); + return ctxt; + } +}; + +} // namespace ipc_monitor +} // namespace dynolog_npu + +#endif diff --git a/dynolog_npu/plugin/ipc_monitor/PyDynamicMonitorProxy.h b/dynolog_npu/plugin/ipc_monitor/PyDynamicMonitorProxy.h new file mode 100644 index 0000000000000000000000000000000000000000..8b5f88abf9d2cf589bec685cd3a520729afe8dd5 --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/PyDynamicMonitorProxy.h @@ -0,0 +1,40 @@ +#ifndef PYDYNAMIC_MONITOR_PROXY_H +#define PYDYNAMIC_MONITOR_PROXY_H + +#include +#include +#include "MonitorBase.h" +#include "DynoLogNpuMonitor.h" + +namespace dynolog_npu { +namespace ipc_monitor { + +class PyDynamicMonitorProxy { +public: + PyDynamicMonitorProxy() = default; + bool InitDyno(int npuId) + { + try { + monitor_ = DynoLogNpuMonitor::GetInstance(); + monitor_->SetNpuId(npuId); + bool res = monitor_->Init(); + return res; + } catch (const std::exception &e) { + std::cout << "[ERROR] Error when init dyno " << e.what() << std::endl; + return false; + } + } + + std::string PollDyno() + { + return monitor_->Poll(); + }; + +private: + MonitorBase *monitor_ = nullptr; +}; + +} // namespace ipc_monitor +} // namespace dynolog_npu + +#endif diff --git a/dynolog_npu/plugin/ipc_monitor/singleton.h b/dynolog_npu/plugin/ipc_monitor/singleton.h new file mode 100644 index 0000000000000000000000000000000000000000..8bb106f3adc8b365ef81feb603c6aaac917a00e2 --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/singleton.h @@ -0,0 +1,31 @@ +#ifndef SINGLETON_H +#define SINGLETON_H +#include + +namespace dynolog_npu { +namespace ipc_monitor { + +template +class Singleton { +public: + static T *GetInstance() noexcept(std::is_nothrow_constructible::value) { + static T instance; + return &instance; + } + + virtual ~Singleton() = default; + +protected: + explicit Singleton() = default; + +private: + explicit Singleton(const Singleton &obj) = delete; + Singleton& operator=(const Singleton &obj) = delete; + explicit Singleton(Singleton &&obj) = delete; + Singleton& operator=(Singleton &&obj) = delete; +}; + +} // ipc_monitor +} // dynolog_npu + +#endif \ No newline at end of file diff --git a/dynolog_npu/plugin/ipc_monitor/utils.cpp b/dynolog_npu/plugin/ipc_monitor/utils.cpp new file mode 100644 index 0000000000000000000000000000000000000000..936821fd34bc34bc9db9e09515132e8af39ba57a --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/utils.cpp @@ -0,0 +1,135 @@ +#include "utils.h" + +namespace dynolog_npu { +namespace ipc_monitor { +std::unordered_map submoduleMap = { + {SubModule::IPC, "IPC"}, +}; + +std::unordered_map errCodeMap = { + {ErrCode::SUC, "success"}, + {ErrCode::PARAM, "invalid parameter"}, + {ErrCode::TYPE, "invalid type"}, + {ErrCode::VALUE, "invalid value"}, + {ErrCode::PTR, "invalid pointer"}, + {ErrCode::INTERNAL, "internal error"}, + {ErrCode::MEMORY, "memory error"}, + {ErrCode::NOT_SUPPORT, "feature not supported"}, + {ErrCode::NOT_FOUND, "resource not found"}, + {ErrCode::UNAVAIL, "resource unavailable"}, + {ErrCode::SYSCALL, "system call failed"}, + {ErrCode::TIMEOUT, "timeout error"}, + {ErrCode::PERMISSION, "permission error"}, +}; + +std::string getCurrentTimestamp() +{ + auto now = std::chrono::system_clock::now(); + auto micros = std::chrono::duration_cast(now.time_since_epoch()); + + std::time_t currentTime = std::chrono::system_clock::to_time_t(now); + std::tm* timeInfo = std::localtime(¤tTime); + + auto milli_time = std::chrono::duration_cast(micros).count() % 1000; + auto micro_time = micros.count() % 1000; + + std::ostringstream oss; + oss << std::put_time(timeInfo, "%Y-%m-%d-%H:%M:%S"); + return oss.str(); +} + +std::string formatErrorCode(SubModule submodule, ErrCode errorCode) +{ + std::ostringstream oss; + oss << "\n[ERROR] " << getCurrentTimestamp() << " (PID:" << getpid() << ")"; + oss << "ERR" << std::setw(2) << std::setfill('0') << static_cast(submodule); // 2: 字段宽度 + oss << std::setw(3) << std::setfill('0') << static_cast(errorCode); // 3: 字段宽度 + oss << " " << submoduleMap[submodule] << " " << errCodeMap[errorCode]; + + return oss.str(); +}; + + +int32_t GetProcessId() +{ + return static_cast(getpid()); +} + +std::pair GetParentPidAndCommand(int32_t pid) +{ + std::string fileName = "/proc/" + std::to_string(pid) + "/stat"; + std::ifstream statFile(fileName); + if (!statFile) { + return std::make_pair(0, ""); + } + int32_t parentPid = 0; + std::string command; + std::string line; + if (std::getline(statFile, line)) { + int ret = sscanf(line.c_str(), "%*d (%[^)]) %*c %d", command.data(), &parentPid); + if (ret == 2) { // 2: 接收到2个字符 + std::cout << "[INFO] Success to get parent pid: " << parentPid << std::endl; + return std::make_pair(parentPid, command); + } + } + std::cout << "[WARNING] Failed to parse /proc/" << pid << "/stat" << std::endl; + return std::make_pair(0, ""); +} + +std::vector> GetPidCommandPairsofAncestors() +{ + std::vector> process_pids_and_cmds; + process_pids_and_cmds.reserve(MaxParentPids + 1); + int32_t current_pid = GetProcessId(); + for (int i = 0; i <= MaxParentPids && (i == 0 || current_pid > 1); i++) { + std::pair parent_pid_and_cmd = GetParentPidAndCommand(current_pid); + process_pids_and_cmds.push_back(std::make_pair(current_pid, parent_pid_and_cmd.second)); + current_pid = parent_pid_and_cmd.first; + } + return process_pids_and_cmds; +} + +std::vector GetPids() +{ + const auto &pids = GetPidCommandPairsofAncestors(); + std::vector res; + res.reserve(pids.size()); + for (const auto &pidPair : pids) { + res.push_back(pidPair.first); + } + return res; +} +std::string GenerateUuidV4() +{ + static std::random_device randomDevice; + static std::mt19937 gen(randomDevice()); + static std::uniform_int_distribution<> dis(0, 15); // range (0, 15) + static std::uniform_int_distribution<> dis2(8, 11); // range (8, 11) + + std::stringstream stringStream; + stringStream << std::hex; + for (int i = 0; i < 8; i++) { // 8 times + stringStream << dis(gen); + } + stringStream << "-"; + for (int j = 0; j < 4; j++) { // 4 times + stringStream << dis(gen); + } + stringStream << "-4"; // add -4 + for (int k = 0; k < 3; k++) { // 3 times + stringStream << dis(gen); + } + stringStream << "-"; + stringStream << dis2(gen); + for (int m = 0; m < 3; m++) { // 3 times + stringStream << dis(gen); + } + stringStream << "-"; + for (int n = 0; n < 12; n++) { // 12 times + stringStream << dis(gen); + } + return stringStream.str(); +} + +} // namespace ipc_monitor +} // namespace dynolog_npu diff --git a/dynolog_npu/plugin/ipc_monitor/utils.h b/dynolog_npu/plugin/ipc_monitor/utils.h new file mode 100644 index 0000000000000000000000000000000000000000..0d8ceb8cfd0bf81b6d8b807c6ac1b505276ddf83 --- /dev/null +++ b/dynolog_npu/plugin/ipc_monitor/utils.h @@ -0,0 +1,63 @@ +#ifndef IPC_MONITOR_UTILS_H +#define IPC_MONITOR_UTILS_H +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +namespace dynolog_npu { +namespace ipc_monitor { + +constexpr int MaxParentPids = 5; +int32_t GetProcessId(); +std::string GenerateUuidV4(); +std::vector GetPids(); +std::pair GetParentPidAndCommand(int32_t pid); +std::vector> GetPidCommandPairsofAncestors(); +std::string getCurrentTimestamp(); + +enum class SubModule { + IPC = 0 +}; + +enum class ErrCode { + SUC = 0, + PARAM = 1, + TYPE = 2, + VALUE = 3, + PTR = 4, + INTERNAL = 5, + MEMORY = 6, + NOT_SUPPORT = 7, + NOT_FOUND = 8, + UNAVAIL = 9, + SYSCALL = 10, + TIMEOUT = 11, + PERMISSION = 12, +}; + + +std::string formatErrorCode(SubModule submodule, ErrCode errorCode); + +#define IPC_ERROR(error) formatErrorCode(SubModule::IPC, error) + +template +inline T ReinterpretConvert(V ptr) { + return reinterpret_cast(ptr); +} + + +} // namespace ipc_monitor +} // namespace dynolog_npu + +#endif + diff --git a/dynolog_npu/plugin/setup.py b/dynolog_npu/plugin/setup.py new file mode 100644 index 0000000000000000000000000000000000000000..151b9b3fb3fa1a42e147685f632163c8b3f5a564 --- /dev/null +++ b/dynolog_npu/plugin/setup.py @@ -0,0 +1,42 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +from setuptools import setup +from pybind11.setup_helpers import Pybind11Extension + +BASE_DIR = os.path.dirname(os.path.realpath(__file__)) + +# Define the extension module +ext_modules = [ + Pybind11Extension( + "IPCMonitor", # Name of the Python module + sources=["bindings.cpp", + "ipc_monitor/utils.cpp", + "ipc_monitor/DynoLogNpuMonitor.cpp", + "ipc_monitor/NpuIpcClient.cpp", + ], # Source files + include_dirs=[os.path.join(BASE_DIR, "ipc_monitor")], # Include Pybind11 headers + language="c++", # Specify the language + ), +] + +# Set up the package +setup( + name="dynolog_npu_plugin", + version="0.1", + description="dynolog npu plugins", + ext_modules=ext_modules, + install_requires=["pybind11"], +) \ No newline at end of file diff --git a/dynolog_npu/scripts/apply_dyno_patches.sh b/dynolog_npu/scripts/apply_dyno_patches.sh new file mode 100644 index 0000000000000000000000000000000000000000..c492db74a2a56948433a47e9cffcccd4ac71e098 --- /dev/null +++ b/dynolog_npu/scripts/apply_dyno_patches.sh @@ -0,0 +1,36 @@ +#! /bin/bash +set -e + +apply_ascend_patches() { + cd ./third_party/dynolog || return 1 + + if [ ! -d "../../patches" ]; then + echo "ERROR: patches directory not found" + cd ../.. + return 1 + fi + + for patch_file in ../../patches/*.patch; do + if [ -f "$patch_file" ]; then + echo "Applying patch: $patch_file" + git apply --check -p1 "$patch_file" + if [ $? -ne 0 ]; then + echo "ERROR: Failed to apply patch: $(basename $patch_file)" + cd ../.. + return 1 + fi + git apply -p1 "$patch_file" + if [ $? -ne 0 ]; then + echo "ERROR: Failed to apply patch: $(basename $patch_file)" + cd ../.. + return 1 + fi + fi + done + + cd ../.. + echo "Successfully applied all Ascend patches" + return 0 +} + +apply_ascend_patches \ No newline at end of file diff --git a/dynolog_npu/scripts/build.sh b/dynolog_npu/scripts/build.sh new file mode 100644 index 0000000000000000000000000000000000000000..aa3508e14faa6bfea06afe0cd3083ad1a5317037 --- /dev/null +++ b/dynolog_npu/scripts/build.sh @@ -0,0 +1,108 @@ +#!/bin/bash +set -e + +check_gcc_version() { + if ! command -v gcc >/dev/null 2>&1; then + echo "ERROR: gcc command not found" + return 1 + fi + + local GCC_VERSION=$(gcc -dumpversion) + local GCC_MAJOR=$(echo $GCC_VERSION | cut -d. -f1) + local GCC_MINOR=$(echo $GCC_VERSION | cut -d. -f2) + + if [ "$GCC_MAJOR" -lt 8 ] || ([ "$GCC_MAJOR" -eq 8 ] && [ "$GCC_MINOR" -lt 5 ]); then + echo "ERROR: gcc version must be greater than or equal to 8.5.0" + echo "Current gcc version: $GCC_VERSION" + return 1 + fi + echo "Check pass: current gcc version is $GCC_VERSION" + return 0 +} + +check_rust_version() { + if ! command -v rustc >/dev/null 2>&1; then + echo "ERROR: rustc command not found" + return 1 + fi + + local RUST_VERSION=$(rustc --version | cut -d' ' -f2) + local RUST_MAJOR=$(echo $RUST_VERSION | cut -d. -f1) + local RUST_MINOR=$(echo $RUST_VERSION | cut -d. -f2) + + if [ "$RUST_MAJOR" -lt 1 ] || ([ "$RUST_MAJOR" -eq 1 ] && [ "$RUST_MINOR" -lt 56 ]); then + echo "ERROR: Rust version must be greater than or equal to 1.56.0" + echo "Current Rust version: $RUST_VERSION" + return 1 + fi + echo "Check pass: current Rust version is $RUST_VERSION" + return 0 +} + +update_and_checkout_submodule() { + DYNLOG_COMMIT_ID="a9b6aeddcd6363252f5388cb0dd942981a09a24b" + + git submodule update --init --recursive + if [ $? -ne 0 ]; then + echo "ERROR: update git submodule failed" + return 1 + fi + + cd ./third_party/dynolog + git checkout ${DYNLOG_COMMIT_ID} + if [ $? -ne 0 ]; then + echo "ERROR: switch to dynolog specified commit failed" + cd .. + return 1 + fi + echo "Check pass: switch to dynolog specified commit ${DYNLOG_COMMIT_ID}" + cd ../../ + return 0 +} + +PACKAGE_TYPE="" +while getopts "t:" opt; do + case $opt in + t) + PACKAGE_TYPE="$OPTARG" + if [[ "$PACKAGE_TYPE" != "deb" && "$PACKAGE_TYPE" != "rpm" ]]; then + echo "ERROR: Invalid package type. Supported types: deb, rpm" + exit 1 + fi + ;; + \?) + echo "Usage: $0 [-t package_type]" + echo "package_type: deb or rpm (optional, if not specified will only build)" + exit 1 + ;; + esac +done + +echo "------------------ Check GCC and Rust version ----------------------" +check_gcc_version +check_rust_version + +echo "------------------ Update and checkout submodule -------------------" +update_and_checkout_submodule + +echo "------------------ Generate patch for Ascend -----------------------" +bash scripts/gen_dyno_patches.sh + +echo "------------------ Apply patch for Ascend --------------------------" +bash scripts/apply_dyno_patches.sh + +echo "------------------ Build dynolog patch for Ascend-------------------" +cd third_party/dynolog +rm -rf build +if [ -z "$PACKAGE_TYPE" ]; then + bash scripts/build.sh + echo "Build dynolog success without packaging" +elif [ "$PACKAGE_TYPE" = "deb" ]; then + bash scripts/debian/make_deb.sh + mv dynolog_*.deb ../../ + echo "Build dynolog deb package success" +elif [ "$PACKAGE_TYPE" = "rpm" ]; then + bash scripts/rpm/make_rpm.sh + mv dynolog_*.rpm ../../ + echo "Build dynolog rpm package success" +fi diff --git a/dynolog_npu/scripts/gen_dyno_patches.sh b/dynolog_npu/scripts/gen_dyno_patches.sh new file mode 100644 index 0000000000000000000000000000000000000000..5ade74dbcfcf88dfbc072c9de790ec4f3ec451d9 --- /dev/null +++ b/dynolog_npu/scripts/gen_dyno_patches.sh @@ -0,0 +1,63 @@ +#!/bin/bash +set -e + +WORK_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" +PATCHES_DIR="${WORK_DIR}/patches" +DYNOLOG_DIR="${WORK_DIR}/third_party/dynolog" +MODIFIED_FILES_DIR="${WORK_DIR}/dynolog_npu" + +mkdir -p "${PATCHES_DIR}" + +generate_patches() { + echo "Generating patches from modified files..." + + # 检查修改后的文件目录是否存在 + if [ ! -d "${MODIFIED_FILES_DIR}" ]; then + echo "ERROR: dynolog_npu directory not found" + return 1 + fi + + # 清理旧的patch文件 + rm -f "${PATCHES_DIR}"/*.patch + + # 遍历修改后的文件目录 + find "${MODIFIED_FILES_DIR}" -type f | while read modified_file; do + # 获取相对路径 + rel_path=$(realpath --relative-to="${MODIFIED_FILES_DIR}" "${modified_file}") + original_file="${DYNOLOG_DIR}/${rel_path}" + + echo "original_file: ${original_file}" + # 检查原始文件是否存在 + if [ ! -f "${original_file}" ]; then + echo "WARN: Original file not found: ${original_file}" + + cp "${modified_file}" "${original_file}" + echo "Copied ${modified_file} to ${original_file}" + continue + fi + + # 生成patch文件名(将路径中的斜杠替换为下划线) + patch_name=$(echo "${rel_path}" | sed 's/\//_/g') + patch_file="${PATCHES_DIR}/${patch_name}.patch" + + echo "Generating patch for: ${rel_path}" + + ( + cd "${WORK_DIR}" + diff -u "third_party/dynolog/${rel_path}" "dynolog_npu/${rel_path}" > "${patch_file}" || true + ) + + # 检查patch文件大小 + if [ ! -s "${patch_file}" ]; then + rm "${patch_file}" + echo "No differences found for: ${rel_path}" + else + echo "Successfully generated patch: ${patch_file}" + fi + done + + echo "Patch generation completed" + return 0 +} + +generate_patches \ No newline at end of file diff --git a/dynolog_npu/third_party/dynolog b/dynolog_npu/third_party/dynolog new file mode 160000 index 0000000000000000000000000000000000000000..d5d37bc182bc2aa8fa60ba7d5ee897bacb5cbd4b --- /dev/null +++ b/dynolog_npu/third_party/dynolog @@ -0,0 +1 @@ +Subproject commit d5d37bc182bc2aa8fa60ba7d5ee897bacb5cbd4b diff --git a/flight_recorder/analysis_flight.py b/flight_recorder/analysis_flight.py new file mode 100644 index 0000000000000000000000000000000000000000..f81f771ab1c81ad79cb93401e200b600a4b17af3 --- /dev/null +++ b/flight_recorder/analysis_flight.py @@ -0,0 +1,164 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Copyright Huawei Technologies Co., Ltd. 2024-2025. All rights reserved. + +import os +import pickle +import sys +import logging +from collections import defaultdict + +from check_path import get_valid_read_path + + +logging.basicConfig( + level=logging.INFO, # 设置日志级别为 INFO + format="%(asctime)s - %(levelname)s - %(message)s", # 设置日志格式 + handlers=[logging.StreamHandler()], # 输出到控制台 +) + + +SAFE_CLASSES = { + # 内置安全类型 + "builtins": {"str", "int", "float", "list", "dict", "tuple"}, +} + + +class SafeUnpickler(pickle.Unpickler): + def find_class(self, module, name): + # 检查模块和类是否在白名单中 + if module in SAFE_CLASSES and name in SAFE_CLASSES[module]: + return super().find_class(module, name) + raise pickle.UnpicklingError(f"Forbidden class: {module}.{name}") + + +def load_recorder_data(path, world_size): + """加载所有 rank 的 recorder 数据""" + recorder_dict = {} + for rank in range(world_size): + file_path = os.path.join(path, str(rank)) if not path.endswith("/") else path + str(rank) + file_path = get_valid_read_path(file_path) + try: + with open(file_path, "rb") as f: + res = SafeUnpickler(f).load() + recorder_dict[str(rank)] = res + except Exception as e: + logging.error(f"Failed to load data from {file_path}: {e}") + return recorder_dict + + +def extract_hccl_info(recorder_dict): + """从 recorder 数据中提取 HCCL 相关信息""" + hccl_dict = {} + for rank, recorder in recorder_dict.items(): + entries = recorder.get("entries", []) + if not entries: + continue + last_entry = entries[-1] + hccl_dict[rank] = { + "state": last_entry.get("state", None), + "record_id": last_entry.get("record_id", None), + "pg_id": last_entry.get("pg_id", None), + "time_discovered_completed_ns": last_entry.get("time_discovered_completed_ns", None), + "name": last_entry.get("frames", [{}])[0].get("name", None), + } + return hccl_dict + + +def analyze_pg_groups(hccl_dict): + """分析 HCCL 数据,按 pg_id 分组并检查问题""" + pg_groups = defaultdict(list) + for _, op in hccl_dict.items(): + pg_groups[op["pg_id"]].append(op) + + for pg_id, group in pg_groups.items(): + scheduled_ops = [op for op in group if op["state"] == "scheduled"] + completed_ops = [op for op in group if op["state"] == "completed"] + + # 情况 1: 所有卡都是 scheduled,且 record_id 和 name 相同 + if len(scheduled_ops) == len(group): + record_id = scheduled_ops[0]["record_id"] + name = scheduled_ops[0]["name"] + all_same = all(op["record_id"] == record_id and op["name"] == name for op in scheduled_ops) + if all_same: + logging.info( + f"The pg_id {pg_id}'s Communication Operator {name}" + " executed too slowly, causing the HCCL to time out." + ) + + # 情况 2: 存在 completed 算子且 该算子的record_id 比其他 scheduled 算子少 1 + elif completed_ops and scheduled_ops: + completed_op = completed_ops[0] + scheduled_record_id = scheduled_ops[0]["record_id"] + if completed_op["record_id"] == scheduled_record_id - 1: + logging.info( + f"The pg_id {pg_id}'s rank {completed_op['pg_id']}'s " + "Computational task took too long, causing the other ranks' " + "HCCL task to time out." + ) + + # 情况 3: 所有算子均为 completed + elif not scheduled_ops and completed_ops: + latest_op = max(completed_ops, key=lambda x: x["time_discovered_completed_ns"] or 0) + logging.info( + f"The computational task of the pg_id {pg_id} " + f"after the communication operator {latest_op['name']} " + "took too long." + ) + + else: + logging.info(f"The situation cannot be recognized!") + + +def get_int_arg(args, idx, default): + if len(args) > idx: + try: + return int(args[idx]) + except ValueError: + logging.warning(f"Invalid input {args[idx]}, using default: {default}") + return default + + +def main(): + # 设置默认值 + default_path = os.getenv("TORCH_HCCL_DEBUG_INFO_TEMP_FILE") + default_world_size = 8 + + # 获取命令行参数,如果未提供则使用默认值 + path = sys.argv[1] if len(sys.argv) > 1 else default_path + world_size = get_int_arg(sys.argv, 2, default_world_size) + + if not path: + raise ValueError("Path is required and cannot be empty.") + + logging.info(f"Path: {path}") + logging.info(f"World Size: {world_size}") + + # 加载数据 + recorder_dict = load_recorder_data(path, world_size) + if not recorder_dict: + logging.error("No valid recorder data found.") + return + + # 提取 HCCL 信息 + hccl_dict = extract_hccl_info(recorder_dict) + + # 分析 HCCL 数据 + analyze_pg_groups(hccl_dict) + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/flight_recorder/check_path.py b/flight_recorder/check_path.py new file mode 100644 index 0000000000000000000000000000000000000000..b34e4dcdb68b28b44f387cb14919ad127658ca8f --- /dev/null +++ b/flight_recorder/check_path.py @@ -0,0 +1,81 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re +import os +import sys +import stat + + +PATH_WHITE_LIST_REGEX = re.compile(r"[^_A-Za-z0-9/.-]") +MAX_READ_FILE_SIZE_4G = 4294967296 # 4G, 4 * 1024 * 1024 * 1024 +MAX_READ_FILE_SIZE_32G = 34359738368 # 32G, 32 * 1024 * 1024 * 1024 +MAX_READ_FILE_SIZE_512G = 549755813888 # 512G, 512 * 1024 * 1024 * 1024 + +# group not writable, others no permission, max stat is 750 +WRITE_FILE_NOT_PERMITTED_STAT = stat.S_IWGRP | stat.S_IWOTH | stat.S_IROTH | stat.S_IXOTH +# group not writable, others not writable, max stat is 755 +READ_FILE_NOT_PERMITTED_STAT = stat.S_IWGRP | stat.S_IWOTH + + +def type_to_str(value_type): + return ' or '.join([ii.__name__ for ii in value_type]) if isinstance(value_type, tuple) else value_type.__name__ + + +def check_type(value, value_type, param_name="value"): + if not isinstance(value, value_type): + raise TypeError('{} must be {}, not {}.'.format(param_name, type_to_str(value_type), type(value).__name__)) + + +def get_valid_path(path): + check_type(path, str, "path") + if not path or len(path) == 0: + raise ValueError("The value of the path cannot be empty.") + if PATH_WHITE_LIST_REGEX.search(path): # Check special char + raise ValueError("Input path contains invalid characters.") # Not printing out the path value for invalid char + path = os.path.expanduser(path) # Consider paths starting with "~" + if os.path.islink(os.path.abspath(path)): # when checking link, get rid of the "/" at the path tail if any + raise ValueError("The value of the path cannot be soft link: {}.".format(path)) + + real_path = os.path.realpath(path) + + if len(real_path) > 4096: + raise ValueError("The length of file path should be less than 4096.") + + if real_path != path and PATH_WHITE_LIST_REGEX.search(real_path): # Check special char again + raise ValueError("Input path contains invalid characters.") # Not printing out the path value for invalid char + + return real_path + + +def is_belong_to_user_or_group(file_stat): + return file_stat.st_uid == os.getuid() or file_stat.st_gid in os.getgroups() + + +def get_valid_read_path(path, size_max=MAX_READ_FILE_SIZE_4G, check_user_stat=True, is_dir=False): + real_path = get_valid_path(path) + if not os.path.isfile(real_path): + raise ValueError("The path {} doesn't exists or not a file.".format(path)) + + file_stat = os.stat(real_path) + if check_user_stat and not sys.platform.startswith("win") and not is_belong_to_user_or_group(file_stat): + raise ValueError("The file {} doesn't belong to the current user or group.".format(path)) + if check_user_stat and os.stat(path).st_mode & READ_FILE_NOT_PERMITTED_STAT > 0: + raise ValueError("The file {} is group writable, or is others writable.".format(path)) + if not os.access(real_path, os.R_OK) or file_stat.st_mode & stat.S_IRUSR == 0: # At least been 400 + raise ValueError("Current user doesn't have read permission to the file {}.".format(path)) + if not is_dir and size_max > 0 and file_stat.st_size > size_max: + raise ValueError("The file {} exceeds size limitation of {}.".format(path, size_max)) + return real_path \ No newline at end of file diff --git a/flight_recorder/flight_recorder.md b/flight_recorder/flight_recorder.md new file mode 100644 index 0000000000000000000000000000000000000000..8b398a6730bae0823b04c20a22258a81392922c9 --- /dev/null +++ b/flight_recorder/flight_recorder.md @@ -0,0 +1,49 @@ +# 飞行记录器超时类问题分析 + +训练任务卡住是阻塞AI大规模分布式集群训练任务的主要和关键问题,当前需要等待集合通信超时才能感知,影响集群可用性。框架需要支持检测训练任务卡住问题,做到提前识别并保存必要的诊断信息,提高问题定位效率和集群设备可用性。当HeartbeatMonitor长时间未检测到心跳时,即可认为训练任务已经卡住,需要触发诊断信息保存。 + +本工具提供torch npu上飞行记录器flight recorder记录日志的读取解析能力,并根据解析后的日志提供超时类问题的初步分析能力,主要支持以下三种情况的超时类问题的识别和分析 + +|问题| 具体内容 | +| --- | --- | +|类型一 | 同通信域内的某张卡计算超时,导致其他卡等待触发飞行记录器和hccl time out | +|类型二 | 同通信域内的通信算子之后的非通信任务耗时过长| +|类型三 | 同通信域内的某个通信算子进行通信时执行超时 | + +## 使用方法 + +### 1 飞行记录器开启方法 + +按照如下方法设置环境变量开启飞行记录器 + +``` +export TORCH_HCCL_ENABLE_MONITORING=1 #用于检测是否开启卡住问题检测 +export TORCH_HCCL_DUMP_ON_TIMEOUT=1 # 用于控制是否保存诊断信息 +export TORCH_HCCL_TRACE_BUFFER_SIZE=1 # 用于控制保存的集合通信状态数量 +export TORCH_HCCL_HEARTBEAT_TIMEOUT_SEC=20 # 用于控制心跳超时时间,即训练业务多久未下发集合通信算子时需要判定为卡住,默认10分钟,单位s。(需要小于HCCL_EXEC_TIMEOUT,避免集合通信先报超时错误) +export TORCH_HCCL_DEBUG_INFO_TEMP_FILE=/tmp/ #保存诊断信息的文件路径 +``` + +### 2 工具使用方法 + +``` +python analysis_flight.py path world_size +``` + +脚本从命令行参数获取 `path` 和 `world_size` 的值,并记录日志。如果未提供命令行参数,则使用默认值。 + +* `path`:从命令行第一个参数获取,如果未提供则使用 `default_path`, default_path从TORCH_HCCL_DEBUG_INFO_TEMP_FILE获取。 +* `world_size`:从命令行第二个参数获取,如果未提供则使用 `default_world_size`,默认为8。 + +| 参数名| 含义 | 使用限制 | +| --- | --- | --- | +| path | 飞行记录器的日志 | 可选。数据类型:string 默认为环境变量中的TORCH_HCCL_DEBUG_INFO_TEMP_FILE,若设置日志格式指定有前缀,则需要在路径中加入前缀 | +| world_size | 同一个通信域中的卡数 | 可选。数据类型:int 默认为8 | + +### 3 输出示例 + +``` +2025-02-19 08:10:07,160 - INFO - Path: /tmp/ +2025-02-19 08:10:07,160 - INFO - World Size: 8 +2025-02-19 08:10:07,162 - INFO - The pg_id 0's rank 0's Computational task took too long, causing the other ranks' HCCL task to time out. +``` diff --git a/plugins/tensorboard-plugins/OWNERS b/plugins/tensorboard-plugins/OWNERS index 34c383beaf138da92df0991b472135496450a827..8dd996262b04faf778976324fa4221e51c4bfa30 100644 --- a/plugins/tensorboard-plugins/OWNERS +++ b/plugins/tensorboard-plugins/OWNERS @@ -3,7 +3,8 @@ options: approvers: - wo-wenjie - ly-qianxiao +- leo920320 +- ninghuang reviewers: -- wo-wenjie -- ly-qianxiao - leo920320 +- ninghuang diff --git a/plugins/tensorboard-plugins/tb_plugin/README.md b/plugins/tensorboard-plugins/tb_plugin/README.md index 3b3edbb9af0d6cb2bd6d877906fa31679cb42a6f..b4b417c4d21899c195674be08d196a5d9f11021a 100644 --- a/plugins/tensorboard-plugins/tb_plugin/README.md +++ b/plugins/tensorboard-plugins/tb_plugin/README.md @@ -17,7 +17,7 @@ 2. 从源代码安装 * 从仓库下载源码: - `git clone https://gitee.com/ascend/att.git` + `git clone https://gitee.com/ascend/mstt.git` * 进入目录 `/plugins/tensorboard_plugins/tb_plugin` 下. * 编译前端代码 @@ -128,25 +128,37 @@ ##### Kernel View - Kernel View展示算子在加速核上运行的详细信息。 + Kernel View 展示算子在加速核上运行的详细信息。此视图包含两张饼图和两张表,可通过 Group By 切换表格数据:算子的详情表以及统计表。 - ![Alt text](./docs/images/kernel_view.PNG) + * 上方为饼图,展示耗时最多的数个算子耗时比例信息(左侧饼图)和算子执行在各类加速核上耗时百分比(右侧饼图) - * Calls: 算子调度的次数。 + ![Alt text](./docs/images/kernel_view.PNG) - * Accelerator Core: 计算核。 + * 选择 Group By 为 All 时,展示算子详情表,部分字段说明如下: - * Block Dim: Task运行切分数量,对应Task运行时核数。 + | 字段名 | 说明 | + | ---------------- | -------------------------------------- | + | Step Id | 标识在哪个 Step 采集的数据 | + | Name | 运行在 npu 上的算子名称 | + | Type | 算子类型 | + | Accelerator Core | AI 加速核类型,包括 AI Core、AI CPU 等 | + | Start Time(us) | 算子执行开始时间 | + | Duration(us) | 当前算子执行耗时 | + | Wait Time(us) | 算子执行等待时间 | + | Block Dim | 运行切分数量,对应任务执行时的核数 | ![Alt text](./docs/images/kernel_view_group_by_statistic.PNG) - * Accelerator Core Utilization: 算子执行在各类core上耗时百分比。 + * 选择 Group By 为 Statistic 时,展示算子信息统计表,此表格展示各算子的执行统计信息,字段说明如下: - * Name: 运行在npu上的算子名称。 - - * Total Duration、 Max Duration、Avg Duration、Min Duration: 算子调用总耗时、最大耗时、平均耗时以及最小耗时。 - - 此视图包含两张饼图和两张表,可通过Group By切换表格数据:算子的详细表以及统计表。 + | 字段名 | 说明 | + | ---------------- | -------| + | Name | 运行在 npu 上的算子名称 | + | Calls | 算子执行次数 | + | Total Duration(us) | 算子执行总时间 | + | Min Duration(us) | 算子执行的最小时间 | + | Max Duration(us) | 算子执行的最大时间 | + | Avg Duration(us) | 算子执行平均时间 | ##### Trace View @@ -162,7 +174,7 @@ ![Alt text](./docs/images/trace_view_launch.PNG) - 选择只展示async_nup,可以查看框架侧算子与昇腾硬件上执行的算子的关联关系。 + 选择只展示async_npu,可以查看框架侧算子与昇腾硬件上执行的算子的下发执行关系。 ![Alt text](./docs/images/trace_view_npu_utilization.PNG) diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/api.ts b/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/api.ts index c5193706ecfd199b31d3c8cc5d3fda2d4684ae03..29cde96ebbde928cde967b3b1b365d12e74ee734 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/api.ts +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/api.ts @@ -1432,31 +1432,31 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } if (exp_run !== undefined) { - localVarQueryParameter['exp_run'] = exp_run; + localVarQueryParameter.exp_run = exp_run; } if (exp_worker !== undefined) { - localVarQueryParameter['exp_worker'] = exp_worker; + localVarQueryParameter.exp_worker = exp_worker; } if (exp_span !== undefined) { - localVarQueryParameter['exp_span'] = exp_span; + localVarQueryParameter.exp_span = exp_span; } if (path !== undefined) { - localVarQueryParameter['path'] = path; + localVarQueryParameter.path = path; } localVarUrlObj.query = Object.assign( @@ -1520,15 +1520,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -1592,15 +1592,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -1664,15 +1664,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -1736,15 +1736,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -1817,19 +1817,19 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } if (group_by !== undefined) { - localVarQueryParameter['group_by'] = group_by; + localVarQueryParameter.group_by = group_by; } localVarUrlObj.query = Object.assign( @@ -1895,19 +1895,19 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } if (group_by !== undefined) { - localVarQueryParameter['group_by'] = group_by; + localVarQueryParameter.group_by = group_by; } localVarUrlObj.query = Object.assign( @@ -1971,15 +1971,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -2043,15 +2043,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -2119,23 +2119,23 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } if (start_ts !== undefined) { - localVarQueryParameter['start_ts'] = start_ts; + localVarQueryParameter.start_ts = start_ts; } if (end_ts !== undefined) { - localVarQueryParameter['end_ts'] = end_ts; + localVarQueryParameter.end_ts = end_ts; } localVarUrlObj.query = Object.assign( @@ -2203,23 +2203,23 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } if (start_ts !== undefined) { - localVarQueryParameter['start_ts'] = start_ts; + localVarQueryParameter.start_ts = start_ts; } if (end_ts !== undefined) { - localVarQueryParameter['end_ts'] = end_ts; + localVarQueryParameter.end_ts = end_ts; } localVarUrlObj.query = Object.assign( @@ -2283,15 +2283,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -2364,19 +2364,19 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } if (group_by !== undefined) { - localVarQueryParameter['group_by'] = group_by; + localVarQueryParameter.group_by = group_by; } localVarUrlObj.query = Object.assign( @@ -2460,27 +2460,27 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } if (group_by !== undefined) { - localVarQueryParameter['group_by'] = group_by; + localVarQueryParameter.group_by = group_by; } if (op_name !== undefined) { - localVarQueryParameter['op_name'] = op_name; + localVarQueryParameter.op_name = op_name; } if (input_shape !== undefined) { - localVarQueryParameter['input_shape'] = input_shape; + localVarQueryParameter.input_shape = input_shape; } localVarUrlObj.query = Object.assign( @@ -2553,19 +2553,19 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } if (group_by !== undefined) { - localVarQueryParameter['group_by'] = group_by; + localVarQueryParameter.group_by = group_by; } localVarUrlObj.query = Object.assign( @@ -2629,15 +2629,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -2719,11 +2719,11 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } localVarUrlObj.query = Object.assign( @@ -2787,15 +2787,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -2859,15 +2859,15 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (worker !== undefined) { - localVarQueryParameter['worker'] = worker; + localVarQueryParameter.worker = worker; } if (span !== undefined) { - localVarQueryParameter['span'] = span; + localVarQueryParameter.span = span; } localVarUrlObj.query = Object.assign( @@ -2910,7 +2910,7 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } localVarUrlObj.query = Object.assign( @@ -2961,11 +2961,11 @@ export const DefaultApiFetchParamCreator = function ( const localVarQueryParameter = {} as any; if (run !== undefined) { - localVarQueryParameter['run'] = run; + localVarQueryParameter.run = run; } if (view !== undefined) { - localVarQueryParameter['view'] = view; + localVarQueryParameter.view = view; } localVarUrlObj.query = Object.assign( diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/configuration.ts b/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/configuration.ts index 78bcbeff871e0a67a6c0c129e2b818220b960115..85b77bf651c049ec5a2ec85379414f619904c6dd 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/configuration.ts +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/configuration.ts @@ -14,7 +14,6 @@ * https://github.com/swagger-api/swagger-codegen.git * Do not edit the file manually. */ - export interface ConfigurationParameters { apiKey?: string | ((name: string) => string); username?: string; diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/index.ts b/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/index.ts index 516ebd97b0c7b44684edcb4a744b6223a4a3ab12..7ad784e60de2777174cea9d902ad9cf2550fad68 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/index.ts +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/api/generated/index.ts @@ -14,6 +14,5 @@ * https://github.com/swagger-api/swagger-codegen.git * Do not edit the file manually. */ - export * from './api'; export * from './configuration'; diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/api/mock.ts b/plugins/tensorboard-plugins/tb_plugin/fe/src/api/mock.ts index 9e7b3375b62826347e43547715d40d0b5e5ee594..4b4b447d97192b7c7c00784dd9176faeed25d64b 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/api/mock.ts +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/api/mock.ts @@ -20,11 +20,11 @@ export class MockAPI { ]); } - spansGet(run: string, view: String) { + spansGet(run: string, view: string): Promise { return Promise.resolve(['1', '2']); } - workersGet(run: string, view: String) { + workersGet(run: string, view: string): Promise { return Promise.resolve(['worker0']); } diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/app.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/app.tsx index 0b7c290f57bb7e3ede34a4d24b9efafb495516d0..19eb4b112529073c6b8db9a86b8d68a7633598db 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/app.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/app.tsx @@ -39,6 +39,7 @@ import Tabs from '@material-ui/core/Tabs'; import Typography from '@material-ui/core/Typography'; import ChevronLeftIcon from '@material-ui/icons/ChevronLeft'; import ChevronRightIcon from '@material-ui/icons/ChevronRight'; +import { message } from 'antd'; import 'antd/es/button/style/css'; import 'antd/es/list/style/css'; import 'antd/es/table/style/css'; @@ -51,11 +52,11 @@ import { LossComparison } from './components/Accuracy/LossComparison'; import { DiffOverview } from './components/DiffOverview'; import { DistributedView } from './components/DistributedView'; import { FullCircularProgress } from './components/FullCircularProgress'; -import { Kernel } from './components/Kernel'; +import { Kernel as KernelView } from './components/Kernel'; import { MemoryView } from './components/MemoryView'; import { ModuleView } from './components/ModuleView'; -import { Operator } from './components/Operator'; -import { Overview } from './components/Overview'; +import { Operator as OperatorView } from './components/Operator'; +import { Overview as OverviewPage } from './components/Overview'; import { TraceView } from './components/TraceView'; import { setup } from './setup'; import './styles.css'; @@ -72,7 +73,7 @@ export enum Views { Lightning = 'Lightning', } -const ViewNames = { +const viewNames = { [Views.Overview]: Views.Overview, [Views.Operator]: Views.Operator, [Views.Kernel]: 'Kernel', @@ -83,8 +84,6 @@ const ViewNames = { [Views.Lightning]: Views.Lightning, }; -const accViews = ['Loss Comparison']; - const drawerWidth = 340; const useStyles = makeStyles((theme) => ({ root: { @@ -163,11 +162,10 @@ const useStyles = makeStyles((theme) => ({ }, })); -export const App = () => { +export const App = (): JSX.Element => { const classes = useStyles(); // #region - State - const [selectedTab, setSelectedTab] = React.useState(0); const [run, setRun] = React.useState(''); @@ -186,22 +184,14 @@ export const App = () => { const iframeRef = React.useRef(null); const [deviceTarget, setDeviceTarget] = React.useState('GPU'); - const [diffLeftWorkerOptions, setDiffLeftWorkerOptions] = React.useState< - string[] - >([]); - const [diffLeftSpansOptions, setDiffLeftSpansOptions] = React.useState< - string[] - >([]); + const [diffLeftWorkerOptions, setDiffLeftWorkerOptions] = React.useState([]); + const [diffLeftSpansOptions, setDiffLeftSpansOptions] = React.useState([]); const [diffLeftRun, setDiffLeftRun] = React.useState(''); const [diffLeftWorker, setDiffLeftWorker] = React.useState(''); const [diffLeftSpan, setDiffLeftSpan] = React.useState(''); - const [diffRightWorkerOptions, setDiffRightWorkerOptions] = React.useState< - string[] - >([]); - const [diffRightSpansOptions, setDiffRightSpansOptions] = React.useState< - string[] - >([]); + const [diffRightWorkerOptions, setDiffRightWorkerOptions] = React.useState([]); + const [diffRightSpansOptions, setDiffRightSpansOptions] = React.useState([]); const [diffRightRun, setDiffRightRun] = React.useState(''); const [diffRightWorker, setDiffRightWorker] = React.useState(''); const [diffRightSpan, setDiffRightSpan] = React.useState(''); @@ -210,28 +200,26 @@ export const App = () => { const [topTab, setTopTab] = React.useState(0); const [fileList, setFileList] = React.useState([]); - const [uploadedCount, setUploadedCount] = React.useState(0); - - // #endregion + const [uploadedCount, setUploadedCount] = React.useState(0); // #endregion React.useEffect(() => { setup() .catch(() => { - console.log('google chart is not supported offline'); + message.warning('google chart is not supported offline'); }) .finally(() => { setLoaded(true); }); }, []); - const continuouslyFetchRuns = async () => { + const continuouslyFetchRuns = async (): Promise => { while (true) { try { - const runs = await api.defaultApi.runsGet(); - setRuns(runs.runs); - setRunsLoading(runs.loading); + const result = await api.defaultApi.runsGet(); + setRuns(result.runs); + setRunsLoading(result.loading); } catch (e) { - console.info('Cannot fetch runs: ', e); + message.warning(`Cannot fetch runs: ${e}`); } await sleep(5000); } @@ -245,60 +233,50 @@ export const App = () => { if (!run || !runs.includes(run)) { setRun(firstOrUndefined(runs) ?? ''); } - }, [runs]); - - // #region - Diff Left + }, [runs]); // #region - Diff Left React.useEffect(() => { if (diffLeftRun) { - api.defaultApi.workersGet(diffLeftRun, Views.Overview).then((workers) => { - setDiffLeftWorkerOptions(workers); + api.defaultApi.workersGet(diffLeftRun, Views.Overview).then((data) => { + setDiffLeftWorkerOptions(data); }); } }, [diffLeftRun]); React.useEffect(() => { if (diffLeftRun && diffLeftWorker) { - api.defaultApi.spansGet(diffLeftRun, diffLeftWorker).then((spans) => { - setDiffLeftSpansOptions(spans); + api.defaultApi.spansGet(diffLeftRun, diffLeftWorker).then((data) => { + setDiffLeftSpansOptions(data); }); } }, [diffLeftRun, diffLeftWorker]); // #endregion - // #region - Diff Right - React.useEffect(() => { if (diffRightRun) { - api.defaultApi - .workersGet(diffRightRun, Views.Overview) - .then((workers) => { - setDiffRightWorkerOptions(workers); - }); + api.defaultApi.workersGet(diffRightRun, Views.Overview).then((data) => { + setDiffRightWorkerOptions(data); + }); } }, [diffRightRun]); React.useEffect(() => { if (diffRightRun && diffRightWorker) { - api.defaultApi.spansGet(diffRightRun, diffRightWorker).then((spans) => { - setDiffRightSpansOptions(spans); + api.defaultApi.spansGet(diffRightRun, diffRightWorker).then((data) => { + setDiffRightSpansOptions(data); }); } }, [diffRightRun, diffRightWorker]); // #endregion - // #region - normal - React.useEffect(() => { if (run) { api.defaultApi.viewsGet(run).then((rawViews) => { - const views = rawViews.views - .map((v) => Views[Views[v as Views]]) - .filter(Boolean); + const result = rawViews.views.map((v) => Views[Views[v as Views]]).filter(Boolean); setDeviceTarget(rawViews.device_target); - setViews(views); + setViews(result); }); } }, [run]); @@ -309,8 +287,8 @@ export const App = () => { React.useEffect(() => { if (run && view) { - api.defaultApi.workersGet(run, view).then((workers) => { - setWorkers(workers); + api.defaultApi.workersGet(run, view).then((data) => { + setWorkers(data); }); } }, [run, view]); @@ -321,8 +299,8 @@ export const App = () => { React.useEffect(() => { if (run && worker) { - api.defaultApi.spansGet(run, worker).then((spans) => { - setSpans(spans); + api.defaultApi.spansGet(run, worker).then((data) => { + setSpans(data); }); } }, [run, worker]); @@ -334,11 +312,11 @@ export const App = () => { // #endregion // #region - Event Handler - const handleTabChange = (event: React.ChangeEvent<{}>, value: any) => { + const handleTabChange = (event: React.ChangeEvent>, value: any): void => { setSelectedTab(value as number); }; - const handleTopTabChange = (event: React.ChangeEvent<{}>, value: any) => { + const handleTopTabChange = (event: React.ChangeEvent>, value: any): void => { setTopTab(value as number); }; @@ -394,34 +372,40 @@ export const App = () => { setDiffRightSpan(event.target.value as string); }; - const handleDrawerOpen = () => { + const handleDrawerOpen = (): void => { setOpen(true); - SetIframeActive(); + setIframeActive(); }; - const handleDrawerClose = () => { + const handleDrawerClose = (): void => { setOpen(false); - SetIframeActive(); + setIframeActive(); }; - const SetIframeActive = () => { + const setIframeActive = (): void => { iframeRef.current?.focus(); }; - const _changeFileList = (files: FileInfo[]) => { + const _changeFileList = (files: FileInfo[]): void => { if (JSON.stringify(files) !== JSON.stringify(fileList)) { setFileList(files); } }; - const _changeUploadCount = (count: number) => { - setUploadedCount(count); + const _getViews = (viewName: Views): string => { + if (viewName === Views.Kernel) { + return deviceTarget === 'Ascend' ? `NPU ${viewNames[viewName]}` : `GPU ${viewNames[viewName]}`; + } else { + return viewNames[viewName]; + } }; - // #endregion + const _changeUploadCount = (count: number): void => { + setUploadedCount(count); + }; // #endregion - const renderContent = () => { - if (!runsLoading && runs.length == 0) { + const renderContent = (): JSX.Element => { + if (!runsLoading && runs.length === 0) { return ( @@ -431,56 +415,30 @@ export const App = () => { ); } - - if (!loaded || !run || !worker || !view || !span) { + const notReady = !loaded || !run || !worker || !view || !span; + if (notReady) { return ; } if (selectedTab === 0) { switch (view) { case Views.Overview: - return ; + return ; case Views.Operator: - return ( - - ); + return ; case Views.Kernel: - return ( - - ); + return ; case Views.Trace: - return ( - - ); + return ; case Views.Distributed: return ; case Views.Memory: - return ( - - ); + return ; case Views.Module: case Views.Lightning: return ; + default: + return <>; } } else { return ( @@ -496,15 +454,15 @@ export const App = () => { } }; - const spanComponent = () => { + const spanComponent = (): JSX.Element => { const spanFragment = ( Spans - + @@ -535,23 +493,16 @@ export const App = () => { [classes.drawerClose]: !open, }), }} - onClick={SetIframeActive} + onClick={setIframeActive} >
- +
- + @@ -559,58 +510,39 @@ export const App = () => { {topTab === 0 ? ( <> - + - {selectedTab == 0 ? ( + {selectedTab === 0 ? ( <> Runs - - + + Views - - + + Workers - - + + @@ -622,35 +554,26 @@ export const App = () => {   Baseline Runs - + {runs.map((item) => ( + {item} ))} Workers - + {diffLeftWorkerOptions.map((worker2) => ( + {worker2} ))} Spans - + {diffLeftSpansOptions.map((span1) => ( + {span1} ))} @@ -660,34 +583,25 @@ export const App = () => {   Experimental Runs - + {runs.map((item) => ( + {item} ))} Workers - + {diffRightWorkerOptions.map((worker3) => ( + {worker3} ))} Spans - + {diffRightSpansOptions.map((span2) => ( + {span2} ))} @@ -695,29 +609,16 @@ export const App = () => { )} ) : ( - + )} {!open && ( - + )}
- {topTab === 0 ? ( - renderContent() - ) : ( - - )} + {topTab === 0 ? renderContent() : }
); diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/AccuracyLeftPanel.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/AccuracyLeftPanel.tsx index ff7b379d178372e02da8c62e652725fbca43fc76..c7b7d7cf0841e7dc3686138b584e101e5052f4a6 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/AccuracyLeftPanel.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/AccuracyLeftPanel.tsx @@ -22,13 +22,7 @@ import { useState, useEffect, useCallback, useRef } from 'react'; import { makeStyles } from '@material-ui/core/styles'; import { Button, Checkbox, Spin, Modal, message } from 'antd'; import { CheckboxChangeEvent } from 'antd/es/checkbox'; -import { - DeleteOutlined, - DownloadOutlined, - ImportOutlined, - SettingOutlined, - WarningTwoTone, -} from '@ant-design/icons'; +import { DeleteOutlined, DownloadOutlined, ImportOutlined, SettingOutlined, WarningTwoTone } from '@ant-design/icons'; import { RegexConfigModal } from './RegexConfigModal'; import { FileInfo } from './entity'; @@ -117,9 +111,7 @@ export const AccuracyLeftPanel: React.FC = (props) => { const [deleteModalVis, setDeleteModalVis] = useState(false); const [fileList, setFileList] = useState([]); const [importSpin, setImportSpin] = useState(false); - const [selectedFile, setSelectedFile] = useState( - undefined - ); + const [selectedFile, setSelectedFile] = useState(undefined); const downLoadRef = useRef(null); const parseFile = (file: FileInfo): FileInfo => { @@ -139,11 +131,7 @@ export const AccuracyLeftPanel: React.FC = (props) => { return file; }; - const parseByTag = ( - line: string, - tag: string, - isLoss: boolean - ): number | null => { + const parseByTag = (line: string, tag: string, isLoss: boolean): number | null => { let pos = line.indexOf(tag); let result: number | null = null; if (pos !== -1) { @@ -160,21 +148,17 @@ export const AccuracyLeftPanel: React.FC = (props) => { result = parseInt(res[0]); } } else { - console.log( - `Found ${ - isLoss ? 'loss' : 'iteration' - } text, but parse value with error: [${line}]` - ); + console.warn(`Found ${isLoss ? 'loss' : 'iteration'} text, but parse value with error: [${line}]`); } } return result; }; - const importFile = () => { + const importFile = (): void => { document.getElementById('accComparisonSelectFile')?.click(); }; - const uploadFile = (e: React.ChangeEvent) => { + const uploadFile = (e: React.ChangeEvent): void => { setImportSpin(true); const file = e.target.files?.[0]; if (file) { @@ -186,9 +170,9 @@ export const AccuracyLeftPanel: React.FC = (props) => { return; } const reader = new FileReader(); - reader.onload = ((selectedFile) => { - return (e) => { - addFile(selectedFile.name.trim(), e.target?.result as string); + reader.onload = ((loadedFile) => { + return (event) => { + addFile(loadedFile.name.trim(), event.target?.result as string); setImportSpin(false); }; })(file); @@ -198,24 +182,23 @@ export const AccuracyLeftPanel: React.FC = (props) => { e.target.value = ''; }; - const addFile = (fileName: string, fileContent: string) => { + const addFile = (fileName: string, fileContent: string): void => { const fileLength = fileName.length; const tempList: FileInfo[] = JSON.parse(JSON.stringify(fileList)); + let updatedFileName = fileName; // 新变量用于存储更新后的文件名 // 上传同名文件加上(1~最大文件数减1)标识 if (!!tempList.find((item) => item.fileName === fileName)) { for (let i = 1; i < MAX_FILE_COUNT; i++) { - let temp = `${fileName.slice(0, fileLength - 4)}(${i})${fileName.slice( - fileLength - 4 - )}`; + let temp = `${fileName.slice(0, fileLength - 4)}(${i})${fileName.slice(fileLength - 4)}`; if (tempList.find((item) => item.fileName === temp) === undefined) { - fileName = temp; + updatedFileName = temp; break; } } } const file: FileInfo = { id: fileList.length, - fileName: fileName, + fileName: updatedFileName, fileContent, checked: true, lossTag: 'loss:', @@ -228,7 +211,7 @@ export const AccuracyLeftPanel: React.FC = (props) => { setFileList(tempList); }; - const exportCsv = (data: FileInfo) => { + const exportCsv = (data: FileInfo): void => { let csvContent = `data:text/csv;charset=utf-8,${data.iterTag},${data.lossTag}\n`; data.losses.forEach((item) => { csvContent += `${item[0]},${item[1]}\n`; @@ -238,23 +221,23 @@ export const AccuracyLeftPanel: React.FC = (props) => { downLoadRef.current?.click(); }; - const onCheckChange = (e: CheckboxChangeEvent, index: number) => { + const onCheckChange = (e: CheckboxChangeEvent, index: number): void => { const tempList: FileInfo[] = JSON.parse(JSON.stringify(fileList)); tempList[index].checked = e.target.checked; setFileList(tempList); }; - const onConfigIconClick = (data: FileInfo) => { + const onConfigIconClick = (data: FileInfo): void => { setSelectedFile(data); setConfigModalVis(true); }; - const onDeleteIconClick = (data: FileInfo) => { + const onDeleteIconClick = (data: FileInfo): void => { setSelectedFile(data); setDeleteModalVis(true); }; - const configModalOk = (data: FileInfo) => { + const configModalOk = (data: FileInfo): void => { const tempList = fileList.map((item) => { return item.id === data.id ? parseFile(data) : item; }); @@ -262,11 +245,11 @@ export const AccuracyLeftPanel: React.FC = (props) => { setConfigModalVis(false); }; - const configModalCancel = () => { + const configModalCancel = (): void => { setConfigModalVis(false); }; - const deleteModalOk = () => { + const deleteModalOk = (): void => { const tempList = JSON.parse(JSON.stringify(fileList)); let founded = false; let index = 0; @@ -290,29 +273,14 @@ export const AccuracyLeftPanel: React.FC = (props) => { return fileList.map((item) => { return (
- onCheckChange(e, item.id)} - /> + onCheckChange(e, item.id)} /> {item.fileName}
- onConfigIconClick(item)} - /> - exportCsv(item)} - /> - onDeleteIconClick(item)} - /> + onConfigIconClick(item)} /> + exportCsv(item)} /> + onDeleteIconClick(item)} />
); @@ -336,37 +304,25 @@ export const AccuracyLeftPanel: React.FC = (props) => { > Import files - + {renderFileItems()} {configModalVis && ( - + )} setDeleteModalVis(false)} + onCancel={(): void => setDeleteModalVis(false)} onOk={deleteModalOk} width={500} className={classes.deleteModal} >
- + Are you sure to delete "{selectedFile?.fileName}"? diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/ComparisonPanel.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/ComparisonPanel.tsx index 22b00ef81482f756ce9f773edb8412723b9cdae0..500d29764c5209958ba19630ac1d4e08c10f24a5 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/ComparisonPanel.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/ComparisonPanel.tsx @@ -100,9 +100,7 @@ export const ComparisonPanel: React.FC = (props) => { const [selectedFiles, setSelectedFiles] = useState([]); const [compareWay, setCompareWay] = useState(0); const [pageSize, setPageSize] = useState(20); - const [lineData, setLineData] = useState( - undefined - ); + const [lineData, setLineData] = useState(undefined); const [tableData, setTableData] = useState([]); const chartRef = useRef(null); @@ -129,7 +127,7 @@ export const ComparisonPanel: React.FC = (props) => { return columns; }; - const compareFile = (fileNames: string[]) => { + const compareFile = (fileNames: string[]): void => { if (fileNames.length < 2) { return; } @@ -137,14 +135,8 @@ export const ComparisonPanel: React.FC = (props) => { const expFile = fileList.find((item) => item.fileName === fileNames[1]); if (!!baseFile && !!expFile) { const commonIters: number[] = []; - const lessIters = - baseFile.iters.length <= expFile.iters.length - ? baseFile.iters - : expFile.iters; - const moreIters = - baseFile.iters.length > expFile.iters.length - ? baseFile.iters - : expFile.iters; + const lessIters = baseFile.iters.length <= expFile.iters.length ? baseFile.iters : expFile.iters; + const moreIters = baseFile.iters.length > expFile.iters.length ? baseFile.iters : expFile.iters; lessIters.forEach((iter) => { if (moreIters.includes(iter)) { commonIters.push(iter); @@ -168,35 +160,40 @@ export const ComparisonPanel: React.FC = (props) => { }); tempChartData.normal.push([iter, expLoss - baseLoss]); tempChartData.absolute.push([iter, Math.abs(expLoss - baseLoss)]); - tempChartData.relative.push([ - iter, - baseLoss === 0 ? 0 : Math.abs(expLoss - baseLoss) / baseLoss, - ]); + tempChartData.relative.push([iter, baseLoss === 0 ? 0 : Math.abs(expLoss - baseLoss) / baseLoss]); }); setTableData(tempTableData); setLineData(tempChartData); } }; - const onSelectChange = (value: string[]) => { + const onSelectChange = (value: string[]): void => { setSelectedFiles(value); compareFile(value); }; - const onRadioChange = (e: RadioChangeEvent) => { + const onRadioChange = (e: RadioChangeEvent): void => { setCompareWay(e.target.value); }; - const onShowSizeChange = (current: number, size: number) => { + const onShowSizeChange = (current: number, size: number): void => { setPageSize(size); }; useLayoutEffect(() => { const element = chartRef.current; if (!element || !lineData) { - return; + return undefined; } const echart = echarts.init(element); + let dataSource: number[][] = []; + if (compareWay === 0) { + dataSource = lineData.normal; + } else if (compareWay === 1) { + dataSource = lineData.absolute; + } else { + dataSource = lineData.relative; + } const option: echarts.EChartsOption = { title: { text: 'Comparison Chart', @@ -224,12 +221,7 @@ export const ComparisonPanel: React.FC = (props) => { type: 'inside', }, dataset: { - source: - compareWay === 0 - ? lineData.normal - : compareWay === 1 - ? lineData.absolute - : lineData.relative, + source: dataSource, }, series: { type: 'line', @@ -238,7 +230,9 @@ export const ComparisonPanel: React.FC = (props) => { }, }; - option && echart.setOption(option, true); + if (option) { + echart.setOption(option, true); + } return () => { echart.dispose(); }; @@ -272,9 +266,7 @@ export const ComparisonPanel: React.FC = (props) => { return { value: file.fileName, label: file.fileName, - disabled: - !selectedFiles.includes(file.fileName) && - selectedFiles.length > 1, + disabled: !selectedFiles.includes(file.fileName) && selectedFiles.length > 1, }; })} /> @@ -294,8 +286,7 @@ export const ComparisonPanel: React.FC = (props) => { Absolute: The absolute value of difference.
- Relative: The absolute value of difference divided by the - loss value of the first file. + Relative: The absolute value of difference divided by the loss value of the first file.
} @@ -304,10 +295,7 @@ export const ComparisonPanel: React.FC = (props) => { {selectedFiles.length < 2 ? ( - + ) : (
diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/LossComparison.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/LossComparison.tsx index 3a7288a1029d526490ff2382e203be1ebe19e2ab..fcf1c67953460c367246ee7a0374c7e6676ffda4 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/LossComparison.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/LossComparison.tsx @@ -58,7 +58,7 @@ export const LossComparison: React.FC = (props) => { const { fileList, fileCount } = props; const classes = useStyles(); - const onImportFile = () => { + const onImportFile = (): void => { if (fileCount >= MAX_FILE_COUNT) { message.warn(`You can import no more than ${MAX_FILE_COUNT} files.`); return; @@ -72,8 +72,7 @@ export const LossComparison: React.FC = (props) => { <> Welcome to loss comparison
- Select left files or{' '} - Import files + Select left files or Import files
diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/LossDisplayPanel.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/LossDisplayPanel.tsx index e6ee88ff802c08cb58097ace50b132478281e0ed..87190756e58bb951c30edd381b512ce4c91d1afc 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/LossDisplayPanel.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/LossDisplayPanel.tsx @@ -122,14 +122,14 @@ export const LossDisplayPanel: React.FC = (props) => { return dataSource; }; - const onShowSizeChange = (current: number, size: number) => { + const onShowSizeChange = (current: number, size: number): void => { setPageSize(size); }; useLayoutEffect(() => { const element = chartRef.current; if (!element) { - return; + return undefined; } const echart = echarts.init(element); const dataset: echarts.DatasetComponentOption[] = []; @@ -170,7 +170,7 @@ export const LossDisplayPanel: React.FC = (props) => { }, formatter: (name) => { // Show ellipsis and set tooltip for legends with too long name - return name.length > 50 ? name.slice(0, 48) + '...' : name; + return name.length > 50 ? `${name.slice(0, 48)}...` : name; }, }, xAxis: { @@ -191,7 +191,9 @@ export const LossDisplayPanel: React.FC = (props) => { series, }; - option && echart.setOption(option, true); + if (option) { + echart.setOption(option, true); + } return () => { echart.dispose(); @@ -209,11 +211,8 @@ export const LossDisplayPanel: React.FC = (props) => { dataSource={getTableData()} size='small' scroll={{ - x: 150 * fileList.length + 100, - y: - fileList.length < 2 - ? 'calc(100vh - 240px)' - : 'calc(50vh - 185px)', + x: (150 * fileList.length) + 100, + y: fileList.length < 2 ? 'calc(100vh - 240px)' : 'calc(50vh - 185px)', }} pagination={{ pageSize, diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/RegexConfigModal.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/RegexConfigModal.tsx index 146a1bc53b11d7e5c46e87c28a322db04b4161d4..456fa2d9a2b6359ed5f03ad5763ef17373544ea4 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/RegexConfigModal.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Accuracy/RegexConfigModal.tsx @@ -63,15 +63,15 @@ export const RegexConfigModal: React.FC = (props) => { const [lossTag, setLossTag] = useState(props.file.lossTag); const [iterTag, setIterTag] = useState(props.file.iterTag); - const lossTagChange = (e: React.ChangeEvent) => { + const lossTagChange = (e: React.ChangeEvent): void => { setLossTag(e.target.value); }; - const iterTagChange = (e: React.ChangeEvent) => { + const iterTagChange = (e: React.ChangeEvent): void => { setIterTag(e.target.value); }; - const configModalOk = () => { + const configModalOk = (): void => { if (lossTag.trim() === '') { message.warning('Loss Tag cannot be empty or only spaces!'); return; diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DataLoading.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DataLoading.tsx index 6e685efb4057a430225401646ef3b8abf79b0ee2..3c5d353ce641c409b51a7aaef8c00ff2f57df6e8 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DataLoading.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DataLoading.tsx @@ -6,11 +6,11 @@ import * as React from 'react'; import { FullCircularProgress } from './FullCircularProgress'; interface IProps { - value: T | undefined | null; + value?: T | null; children: (t: T) => JSX.Element; } -export function DataLoading(props: IProps) { +export function DataLoading(props: IProps): JSX.Element { if (props.value === undefined || props.value === null) { return ; } diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DiffOverview.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DiffOverview.tsx index 8b20682cb13a6a3dfbd7fde9cc79807a4d589986..ed029d5020ed1eaf8caea159b25d33c7a5ad03e3 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DiffOverview.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DiffOverview.tsx @@ -52,12 +52,12 @@ const useStyles = makeStyles((theme) => ({ }, })); -const getAngleByDataLength = (data: number) => { +const getAngleByDataLength = (data: number): number => { if (data < 10) { return 0; } else { // 数量越大越趋近于旋转90度 - return 90 * (1 - 10 / data); + return 90 * (1 - (10 / data)); } }; @@ -70,46 +70,17 @@ export interface DiffStepChartIProps { rawData: any[]; } -const DiffColumnChart: React.FC = ( - props: DiffColumnChartIProps -) => { +const DiffColumnChart: React.FC = (props: DiffColumnChartIProps) => { const { rawData, selectCallback } = props; const graphRef = React.useRef(null); const [resizeEventDependency] = useResizeEventDependency(); React.useLayoutEffect(() => { const element = graphRef.current; - if (!element) return; - - let left_duration_data: number[] = []; - let left_accumulated_duration_data: number[] = []; - - let right_duration_data: number[] = []; - let right_accumulated_duration_data: number[] = []; - - for (let i = 0; i < rawData.length; i++) { - let curr = rawData[i]; - left_duration_data.push(curr[1]); - right_duration_data.push(curr[2]); - left_accumulated_duration_data.push(curr[3]); - right_accumulated_duration_data.push(curr[4]); + if (!element) { + return undefined; } - let left_duration_max = Math.max(...left_duration_data); - let right_duration_max = Math.max(...right_duration_data); - let duration_max = Math.max(left_duration_max, right_duration_max); - - let left_accumulated_duration_max = Math.max( - ...left_accumulated_duration_data - ); - let right_accumulated_duration_max = Math.max( - ...right_accumulated_duration_data - ); - let accumulated_max = Math.max( - left_accumulated_duration_max, - right_accumulated_duration_max - ); - const chart = echarts.init(element); const options: echarts.EChartsOption = { @@ -124,12 +95,8 @@ const DiffColumnChart: React.FC = ( trigger: 'axis', formatter: function (params: any) { const index = params[0].name.indexOf('@'); - const safeName = params[0].name - .replace(//g, '>'); - var res = `${ - index > -1 ? safeName.slice(index + 1) : safeName - }
`; + const safeName = params[0].name.replace(//g, '>'); + let res = `${index > -1 ? safeName.slice(index + 1) : safeName}
`; for (const item of params) { if (typeof item.value[item.encode.y[0]] === 'number') { res += ` = ( ], }; - options && chart.setOption(options, true); + if (options) { + chart.setOption(options, true); + } return () => { chart.dispose(); }; @@ -395,7 +358,6 @@ let columnTableDataSourceStack: TableRow[][] = []; export const DiffOverview: React.FC = (props: IProps) => { // #region - Constant - const COMPOSITE_NODES_NAME = 'CompositeNodes'; const hostDurationColumns = [ @@ -403,29 +365,41 @@ export const DiffOverview: React.FC = (props: IProps) => { title: 'Baseline Host Duration (us)', dataIndex: 'baselineHostDuration', key: 'baselineHostDuration', - sorter: (a: TableRow, b: TableRow) => - a.baselineHostDuration - b.baselineHostDuration, + sorter: (a: TableRow, b: TableRow): number => { + const aBaselineHost = a.baselineHostDuration ?? 0; + const bBaselineHost = b.baselineHostDuration ?? 0; + return aBaselineHost - bBaselineHost; + }, }, { title: 'Exp Host Duration (us)', dataIndex: 'expHostDuration', key: 'expHostDuration', - sorter: (a: TableRow, b: TableRow) => - a.expHostDuration - b.expHostDuration, + sorter: (a: TableRow, b: TableRow): number => { + const aExpHost = a.expHostDuration ?? 0; + const bExpHost = b.expHostDuration ?? 0; + return aExpHost - bExpHost; + }, }, { title: 'Delta Host Duration (us)', dataIndex: 'deltaHostDuration', key: 'deltaHostDuration', - sorter: (a: TableRow, b: TableRow) => - a.deltaHostDuration! - b.deltaHostDuration!, + sorter: (a: TableRow, b: TableRow): number => { + const aDeltaHost = a.deltaHostDuration ?? 0; + const bDeltaHost = b.deltaHostDuration ?? 0; + return aDeltaHost - bDeltaHost; + }, }, { title: 'Delta Host Duration%', dataIndex: 'deltaHostDurationPercent', key: 'deltaHostDurationPercent', - sorter: (a: TableRow, b: TableRow) => - a.deltaHostDurationPercentNumber! - b.deltaHostDurationPercentNumber!, + sorter: (a: TableRow, b: TableRow): number => { + const aPercent = a.deltaHostDurationPercentNumber ?? 0; + const bPercent = b.deltaHostDurationPercentNumber ?? 0; + return aPercent - bPercent; + }, }, ]; @@ -434,30 +408,33 @@ export const DiffOverview: React.FC = (props: IProps) => { title: 'Baseline Self Host Duration (us)', dataIndex: 'baselineSelfHostDuration', key: 'baselineSelfHostDuration', - sorter: (a: TableRow, b: TableRow) => - a.baselineSelfHostDuration - b.baselineSelfHostDuration, + sorter: (a: TableRow, b: TableRow): number => a.baselineSelfHostDuration - b.baselineSelfHostDuration, }, { title: 'Exp Self Host Duration (us)', dataIndex: 'expSelfHostDuration', key: 'expSelfHostDuration', - sorter: (a: TableRow, b: TableRow) => - a.expSelfHostDuration - b.expSelfHostDuration, + sorter: (a: TableRow, b: TableRow): number => a.expSelfHostDuration - b.expSelfHostDuration, }, { title: 'Delta Self Host Duration (us)', dataIndex: 'deltaSelfHostDuration', key: 'deltaSelfHostDuration', - sorter: (a: TableRow, b: TableRow) => - a.deltaSelfHostDuration! - b.deltaSelfHostDuration!, + sorter: (a: TableRow, b: TableRow): number => { + const aDeltaSelfHost = a.deltaSelfHostDuration ?? 0; + const bDeltaSelfHost = b.deltaSelfHostDuration ?? 0; + return aDeltaSelfHost - bDeltaSelfHost; + }, }, { title: 'Delta Self Host Duration%', dataIndex: 'deltaSelfHostDurationPercent', key: 'deltaSelfHostDurationPercent', - sorter: (a: TableRow, b: TableRow) => - a.deltaSelfHostDurationPercentNumber! - - b.deltaSelfHostDurationPercentNumber!, + sorter: (a: TableRow, b: TableRow): number => { + const aSelfPercent = a.deltaSelfHostDurationPercentNumber ?? 0; + const bSelfPercent = b.deltaSelfHostDurationPercentNumber ?? 0; + return aSelfPercent - bSelfPercent; + }, }, ]; @@ -466,30 +443,33 @@ export const DiffOverview: React.FC = (props: IProps) => { title: 'Baseline Device Duration (us)', dataIndex: 'baselineDeviceDuration', key: 'baselineDeviceDuration', - sorter: (a: TableRow, b: TableRow) => - a.baselineDeviceDuration - b.baselineDeviceDuration, + sorter: (a: TableRow, b: TableRow): number => a.baselineDeviceDuration - b.baselineDeviceDuration, }, { title: 'Exp Device Duration (us)', dataIndex: 'expDeviceDuration', key: 'expDeviceDuration', - sorter: (a: TableRow, b: TableRow) => - a.expDeviceDuration - b.expDeviceDuration, + sorter: (a: TableRow, b: TableRow): number => a.expDeviceDuration - b.expDeviceDuration, }, { title: 'Delta Device Duration (us)', dataIndex: 'deltaDeviceDuration', key: 'deltaDeviceDuration', - sorter: (a: TableRow, b: TableRow) => - a.deltaDeviceDuration! - b.deltaDeviceDuration!, + sorter: (a: TableRow, b: TableRow): number => { + const aDeltaDeviceDuration = a.deltaDeviceDuration ?? 0; + const bdeltaDeviceDuration = b.deltaDeviceDuration ?? 0; + return aDeltaDeviceDuration - bdeltaDeviceDuration; + }, }, { title: 'Delta Device Duration%', dataIndex: 'deltaDeviceDurationPercent', key: 'deltaDeviceDurationPercent', - sorter: (a: TableRow, b: TableRow) => - a.deltaDeviceDurationPercentNumber! - - b.deltaDeviceDurationPercentNumber!, + sorter: (a: TableRow, b: TableRow): number => { + const aDeltaDeviceDurationPercentNumber = a.deltaDeviceDurationPercentNumber ?? 0; + const bDeltaDeviceDurationPercentNumber = b.deltaDeviceDurationPercentNumber ?? 0; + return aDeltaDeviceDurationPercentNumber - bDeltaDeviceDurationPercentNumber; + }, }, ]; @@ -498,34 +478,40 @@ export const DiffOverview: React.FC = (props: IProps) => { title: 'Baseline Self Device Duration (us)', dataIndex: 'baselineSelfDeviceDuration', key: 'baselineSelfDeviceDuration', - sorter: (a: TableRow, b: TableRow) => - a.baselineSelfDeviceDuration - b.baselineSelfDeviceDuration, + sorter: (a: TableRow, b: TableRow): number => a.baselineSelfDeviceDuration - b.baselineSelfDeviceDuration, }, { title: 'Exp Self Device Duration (us)', dataIndex: 'expSelfDeviceDuration', key: 'expSelfDeviceDuration', - sorter: (a: TableRow, b: TableRow) => - a.expSelfDeviceDuration - b.expSelfDeviceDuration, + sorter: (a: TableRow, b: TableRow): number => a.expSelfDeviceDuration - b.expSelfDeviceDuration, }, { title: 'Delta Self Device Duration (us)', dataIndex: 'deltaSelfDeviceDuration', key: 'deltaSelfDeviceDuration', - sorter: (a: TableRow, b: TableRow) => - a.deltaSelfDeviceDuration! - b.deltaSelfDeviceDuration!, + sorter: (a: TableRow, b: TableRow): number => { + const aDeltaSelfDeviceDuration = a.deltaSelfDeviceDuration ?? 0; + const bDeltaSelfDeviceDuration = b.deltaSelfDeviceDuration ?? 0; + return aDeltaSelfDeviceDuration - bDeltaSelfDeviceDuration; + }, }, { title: 'Delta Self Device Duration%', dataIndex: 'deltaSelfDeviceDurationPercent', key: 'deltaSelfDeviceDurationPercent', - sorter: (a: TableRow, b: TableRow) => - a.deltaSelfDeviceDurationPercentNumber! - - b.deltaSelfDeviceDurationPercentNumber!, + sorter: (a: TableRow, b: TableRow): number => { + const aDeltaSelfDeviceDurationPercentNumber = a.deltaSelfDeviceDurationPercentNumber ?? 0; + const bDeltaSelfDeviceDurationPercentNumber = b.deltaSelfDeviceDurationPercentNumber ?? 0; + return aDeltaSelfDeviceDurationPercentNumber - bDeltaSelfDeviceDurationPercentNumber; + }, }, ]; - type IColumnMapType = { [key: string]: any }; + interface IColumnMap { + [key: string]: any; + } + type IColumnMapType = IColumnMap; const tableSourceColumnMap: IColumnMapType = { selfHostDuration: selfHostDurationColumns, @@ -539,33 +525,31 @@ export const DiffOverview: React.FC = (props: IProps) => { title: 'Operator', dataIndex: 'operator', key: 'operator', - sorter: (a: TableRow, b: TableRow) => - a.operator.localeCompare(b.operator), + sorter: (a: TableRow, b: TableRow) => a.operator.localeCompare(b.operator), }, { title: 'Baseline Calls', dataIndex: 'baselineCalls', key: 'baselineCalls', - sorter: (a: TableRow, b: TableRow) => a.baselineCalls! - b.baselineCalls!, + sorter: (a: TableRow, b: TableRow) => a.baselineCalls ?? 0 - (b.baselineCalls ?? 0), }, { title: 'Exp Calls', dataIndex: 'expCalls', key: 'expCalls', - sorter: (a: TableRow, b: TableRow) => a.expCalls! - b.expCalls!, + sorter: (a: TableRow, b: TableRow) => a.expCalls ?? 0 - (b.expCalls ?? 0), }, { title: 'Delta Calls', dataIndex: 'deltaCalls', key: 'deltaCalls', - sorter: (a: TableRow, b: TableRow) => a.deltaCalls! - b.deltaCalls!, + sorter: (a: TableRow, b: TableRow) => a.deltaCalls ?? 0 - (b.deltaCalls ?? 0), }, { title: 'Delta Calls%', dataIndex: 'deltaCallsPercent', key: 'deltaCallsPercent', - sorter: (a: TableRow, b: TableRow) => - a.deltaCallsPercentNumber! - b.deltaCallsPercentNumber!, + sorter: (a: TableRow, b: TableRow) => a.deltaCallsPercentNumber ?? 0 - (b.deltaCallsPercentNumber ?? 0), }, ]; @@ -575,21 +559,18 @@ export const DiffOverview: React.FC = (props: IProps) => { const [tableDataSource, setTableDataSource] = React.useState([]); const { run, worker, span, expRun, expWorker, expSpan } = props; - const [columnUnderlyingData, setColumnUnderlyingData] = React.useState< - ColumnUnderlyingData[] - >([]); + const [columnUnderlyingData, setColumnUnderlyingData] = React.useState([]); - const [rootUnderlyingData, setRootUnderlyingData] = - React.useState(); + const [rootUnderlyingData, setRootUnderlyingData] = React.useState(); const [columnChartData, setColumnChartData] = React.useState([]); const [stepChartData, setStepChartData] = React.useState([]); - const [selectedTableColumnsOptions, setSelectedTableColumnsOptions] = - React.useState<[key: string]>(['hostDuration']); - const [selectedTableColumns, setSelectedTableColumns] = React.useState( - [...baseTableColumns, ...hostDurationColumns] - ); + const [selectedTableColumnsOptions, setSelectedTableColumnsOptions] = React.useState<[key: string]>(['hostDuration']); + const [selectedTableColumns, setSelectedTableColumns] = React.useState([ + ...baseTableColumns, + ...hostDurationColumns, + ]); const [dataStackLevel, setDataStackLevel] = React.useState(0); const [loading, setLoading] = React.useState(false); @@ -598,7 +579,7 @@ export const DiffOverview: React.FC = (props: IProps) => { const classes = useStyles(); // #region - Event Handler - const handleChartColumnSelect = (row: number, column: number) => { + const handleChartColumnSelect = (row: number, column: number): void => { if (columnUnderlyingData.length === 0) { return; } @@ -608,29 +589,19 @@ export const DiffOverview: React.FC = (props: IProps) => { return; } - let tableDataSource = generateDataSourceFromUnderlyingData( - selectedUnderlyingData - ); - setTableDataSource(tableDataSource); - columnTableDataSourceStack.push(tableDataSource); + let tableDataSource1 = generateDataSourceFromUnderlyingData(selectedUnderlyingData); + setTableDataSource(tableDataSource1); + columnTableDataSourceStack.push(tableDataSource1); setLoading(true); api.defaultApi - .diffnodeGet( - run, - worker, - span, - expRun, - expWorker, - expSpan, - selectedUnderlyingData.path - ) + .diffnodeGet(run, worker, span, expRun, expWorker, expSpan, selectedUnderlyingData.path) .then((resp) => handleDiffNodeResp(resp)) .finally(() => setLoading(false)); }; - const handleGoBack = () => { + const handleGoBack = (): void => { if (columnChartDataStack.length > 1) { columnChartDataStack.pop(); let top = columnChartDataStack[columnChartDataStack.length - 1]; @@ -651,23 +622,20 @@ export const DiffOverview: React.FC = (props: IProps) => { if (columnTableDataSourceStack.length > 0) { columnTableDataSourceStack.pop(); - let top = - columnTableDataSourceStack[columnTableDataSourceStack.length - 1]; + let top = columnTableDataSourceStack[columnTableDataSourceStack.length - 1]; if (top) { setTableDataSource(top); } else { - let tableDataSource = generateDataSourceFromUnderlyingData( - rootUnderlyingData! - ); - setTableDataSource(tableDataSource); + let tableDataSource2 = generateDataSourceFromUnderlyingData(rootUnderlyingData); + setTableDataSource(tableDataSource2); } } setDataStackLevel(dataStackLevel - 1); }; - const toPercentString = (percentNumber: number) => { + const toPercentString = (percentNumber: number): string => { if (isNaN(percentNumber)) { return 'N/A'; } @@ -675,44 +643,37 @@ export const DiffOverview: React.FC = (props: IProps) => { return `${percentNumber.toFixed(2)}%`; }; - const handleColumnSelectionChange = (value: [key: string]) => { + const handleColumnSelectionChange = (value: [key: string]): void => { let columns = value.map((x) => tableSourceColumnMap[x]).flat(); let r = [...baseTableColumns, ...columns]; setSelectedTableColumnsOptions(value); setSelectedTableColumns(r); }; - const generateDataSourceFromUnderlyingData = ( - selectedUnderlyingData: ColumnUnderlyingData - ) => { - let tableDataSource: TableRow[] = []; + const generateDataSourceFromUnderlyingData = (selectedUnderlyingData?: ColumnUnderlyingData): TableRow[] => { + if (!selectedUnderlyingData) { + return []; + } + let newTableDataSource: TableRow[] = []; for (let i = 0; i < selectedUnderlyingData.leftAggs.length; i++) { let left = selectedUnderlyingData.leftAggs[i]; let right = selectedUnderlyingData.rightAggs[i]; - let deltaCallsPercentNumber = - ((right.calls - left.calls) / left.calls) * 100; + let deltaCallsPercentNumber = ((right.calls - left.calls) / left.calls) * 100; - let deltaHostDurationPercentNumber = - ((right.host_duration - left.host_duration) / left.host_duration) * 100; + let deltaHostDurationPercentNumber = ((right.host_duration - left.host_duration) / left.host_duration) * 100; let deltaSelfHostDurationPercentNumber = - ((right.self_host_duration - left.self_host_duration) / - left.self_host_duration) * - 100; + ((right.self_host_duration - left.self_host_duration) / left.self_host_duration) * 100; let deltaDeviceDurationPercentNumber = - ((right.device_duration - left.device_duration) / - left.device_duration) * - 100; + ((right.device_duration - left.device_duration) / left.device_duration) * 100; let deltaSelfDeviceDurationPercentNumber = - ((right.self_device_duration - left.self_device_duration) / - left.self_device_duration) * - 100; + ((right.self_device_duration - left.self_device_duration) / left.self_device_duration) * 100; - tableDataSource.push({ + newTableDataSource.push({ key: i, operator: left.name, baselineCalls: left.calls, @@ -723,59 +684,42 @@ export const DiffOverview: React.FC = (props: IProps) => { baselineHostDuration: left.host_duration, expHostDuration: right.host_duration, - deltaHostDuration: parseFloat( - (right.host_duration - left.host_duration).toFixed(3) - ), + deltaHostDuration: parseFloat((right.host_duration - left.host_duration).toFixed(3)), deltaHostDurationPercentNumber: deltaHostDurationPercentNumber, - deltaHostDurationPercent: toPercentString( - deltaHostDurationPercentNumber - ), + deltaHostDurationPercent: toPercentString(deltaHostDurationPercentNumber), baselineSelfHostDuration: left.self_host_duration, expSelfHostDuration: right.self_host_duration, - deltaSelfHostDuration: parseFloat( - (right.self_host_duration - left.self_host_duration).toFixed(3) - ), + deltaSelfHostDuration: parseFloat((right.self_host_duration - left.self_host_duration).toFixed(3)), deltaSelfHostDurationPercentNumber: deltaSelfHostDurationPercentNumber, - deltaSelfHostDurationPercent: toPercentString( - deltaSelfHostDurationPercentNumber - ), + deltaSelfHostDurationPercent: toPercentString(deltaSelfHostDurationPercentNumber), baselineDeviceDuration: left.device_duration, expDeviceDuration: right.device_duration, - deltaDeviceDuration: parseFloat( - (right.device_duration - left.device_duration).toFixed(3) - ), + deltaDeviceDuration: parseFloat((right.device_duration - left.device_duration).toFixed(3)), deltaDeviceDurationPercentNumber: deltaDeviceDurationPercentNumber, - deltaDeviceDurationPercent: toPercentString( - deltaDeviceDurationPercentNumber - ), + deltaDeviceDurationPercent: toPercentString(deltaDeviceDurationPercentNumber), baselineSelfDeviceDuration: left.self_device_duration, expSelfDeviceDuration: right.self_device_duration, - deltaSelfDeviceDuration: parseFloat( - (right.self_device_duration - left.self_device_duration).toFixed(3) - ), - deltaSelfDeviceDurationPercentNumber: - deltaSelfDeviceDurationPercentNumber, - deltaSelfDeviceDurationPercent: toPercentString( - deltaSelfDeviceDurationPercentNumber - ), + deltaSelfDeviceDuration: parseFloat((right.self_device_duration - left.self_device_duration).toFixed(3)), + deltaSelfDeviceDurationPercentNumber: deltaSelfDeviceDurationPercentNumber, + deltaSelfDeviceDurationPercent: toPercentString(deltaSelfDeviceDurationPercentNumber), }); } - return tableDataSource; + return newTableDataSource; }; React.useEffect(() => { - if ( + const hasData = run.length > 0 && worker.length > 0 && span.length > 0 && expRun.length > 0 && expWorker.length > 0 && - expSpan.length > 0 - ) { + expSpan.length > 0; + if (hasData) { setLoading(true); columnChartDataStack = []; @@ -787,18 +731,16 @@ export const DiffOverview: React.FC = (props: IProps) => { .diffnodeGet(run, worker, span, expRun, expWorker, expSpan) .then((resp) => { handleDiffNodeResp(resp); - let rootUnderlyingData = { + let newRootUnderlyingData = { name: 'rootNode', path: resp.path, leftAggs: resp.left.aggs, rightAggs: resp.right.aggs, }; - setRootUnderlyingData(rootUnderlyingData); - let tableDataSource = generateDataSourceFromUnderlyingData( - rootUnderlyingData! - ); - setTableDataSource(tableDataSource); + setRootUnderlyingData(newRootUnderlyingData); + let tableDataSource3 = generateDataSourceFromUnderlyingData(newRootUnderlyingData); + setTableDataSource(tableDataSource3); }) .finally(() => setLoading(false)); @@ -806,24 +748,18 @@ export const DiffOverview: React.FC = (props: IProps) => { } }, [run, worker, span, expRun, expWorker, expSpan]); - const handleDiffNodeResp = (resp: any) => { - let columnChartData: any[] = []; - let stepChartData: any[] = []; + const handleDiffNodeResp = (resp: any): void => { + let newColumnChartData: any[] = []; + let newStepChartData: any[] = []; let underlyingData: ColumnUnderlyingData[] = []; - columnChartData.push([ - 'Call', - 'Baseline', - 'Experiment', - 'Baseline Trend', - 'Exp Trend', - ]); - stepChartData.push(['Call', 'Diff', 'Accumulated Diff']); + newColumnChartData.push(['Call', 'Baseline', 'Experiment', 'Baseline Trend', 'Exp Trend']); + newStepChartData.push(['Call', 'Diff', 'Accumulated Diff']); if (resp.children.length > 0) { - let accumulated_left_duration = 0; - let accumulated_right_duration = 0; - let accumulated_step_diff = 0; + let accumulatedLeftDuration = 0; + let accumulatedRightDuration = 0; + let accumulatedStepDiff = 0; for (let i = 0; i < resp.children.length; i++) { let left = resp.children[i].left; let right = resp.children[i].right; @@ -864,12 +800,12 @@ export const DiffOverview: React.FC = (props: IProps) => { currColumn.push(left.total_duration); currColumn.push(right.total_duration); - accumulated_left_duration += left.total_duration; - currColumn.push(accumulated_left_duration); + accumulatedLeftDuration += left.total_duration; + currColumn.push(accumulatedLeftDuration); - accumulated_right_duration += right.total_duration; - currColumn.push(accumulated_right_duration); - columnChartData.push(currColumn); + accumulatedRightDuration += right.total_duration; + currColumn.push(accumulatedRightDuration); + newColumnChartData.push(currColumn); underlyingData.push({ name: name, @@ -882,10 +818,10 @@ export const DiffOverview: React.FC = (props: IProps) => { let stepDiff = right.total_duration - left.total_duration; currStep.push(stepDiff); - accumulated_step_diff += stepDiff; - currStep.push(accumulated_step_diff); + accumulatedStepDiff += stepDiff; + currStep.push(accumulatedStepDiff); - stepChartData.push(currStep); + newStepChartData.push(currStep); } } else { let left = resp.left; @@ -904,28 +840,26 @@ export const DiffOverview: React.FC = (props: IProps) => { currColumn.push(left.total_duration); currColumn.push(right.total_duration); - columnChartData.push(currColumn); + newColumnChartData.push(currColumn); currStep.push(name); let stepDiff = right.total_duration - left.total_duration; currStep.push(stepDiff); currStep.push(stepDiff); - stepChartData.push(currStep); + newStepChartData.push(currStep); } - setColumnChartData(columnChartData); - columnChartDataStack.push(columnChartData); + setColumnChartData(newColumnChartData); + columnChartDataStack.push(newColumnChartData); - setStepChartData(stepChartData); - stepChartDataStack.push(stepChartData); + setStepChartData(newStepChartData); + stepChartDataStack.push(newStepChartData); setColumnUnderlyingData(underlyingData); columnUnderlyingDataStack.push(underlyingData); setDataStackLevel(columnChartDataStack.length); - }; - - // #endregion + }; // #endregion if (!loading && columnUnderlyingDataStack.length === 0) { return ( @@ -961,16 +895,11 @@ export const DiffOverview: React.FC = (props: IProps) => { {columnChartData.length > 1 && ( <> - + )} - {columnChartData.length === 1 && ( - No more level to show. - )} + {columnChartData.length === 1 && No more level to show.} @@ -997,18 +926,12 @@ export const DiffOverview: React.FC = (props: IProps) => { -   - +
diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DistributedView.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DistributedView.tsx index 14a760c578347d9b3f937ac8d72b517b626cf04f..096501b61bc9ce41978c65dc24f6b3640ab960f3 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DistributedView.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/DistributedView.tsx @@ -21,10 +21,10 @@ import { DataLoading } from './DataLoading'; import { GpuInfoTable } from './GpuInfoTable'; import { makeChartHeaderRenderer, useTooltipCommonStyles } from './helpers'; import { - DistributedCommopsTableTooltip, - DistributedGpuInfoTableTooltip, - DistributedOverlapGraphTooltip, - DistributedWaittimeGraphTooltip, + distributedCommopsTableTooltip, + distributedGpuInfoTableTooltip, + distributedOverlapGraphTooltip, + distributedWaittimeGraphTooltip, } from './TooltipDescriptions'; export interface IProps { @@ -74,15 +74,9 @@ export const DistributedView: React.FC = (props) => { let { run, worker, span } = props; const classes = useStyles(); - const [overlapGraph, setOverlapGraph] = React.useState< - DistributedGraph | undefined - >(undefined); - const [waittimeGraph, setWaittimeGraph] = React.useState< - DistributedGraph | undefined - >(undefined); - const [commopsTableData, setCommopsTableData] = React.useState< - any | undefined - >(undefined); + const [overlapGraph, setOverlapGraph] = React.useState(undefined); + const [waittimeGraph, setWaittimeGraph] = React.useState(undefined); + const [commopsTableData, setCommopsTableData] = React.useState(undefined); const [gpuInfo, setGpuInfo] = React.useState(undefined); const [commopsTableTitle, setCommopsTableTitle] = React.useState(''); const [commopsWorkers, setCommopsWorkers] = React.useState([]); @@ -145,11 +139,10 @@ export const DistributedView: React.FC = (props) => { setWaittimeStep(event.target.value as string); }; - const getColumnChartData = ( - distributedGraph?: DistributedGraph, - step?: string - ) => { - if (!distributedGraph || !step) return undefined; + const getColumnChartData = (distributedGraph?: DistributedGraph, step?: string): any => { + if (!distributedGraph || !step) { + return undefined; + } const barLabels = Object.keys(distributedGraph.data[step]); return { legends: distributedGraph.metadata.legends, @@ -157,31 +150,28 @@ export const DistributedView: React.FC = (props) => { barHeights: barLabels.map((label) => distributedGraph.data[step][label]), }; }; - const overlapData = React.useMemo( - () => getColumnChartData(overlapGraph, overlapStep), - [overlapGraph, overlapStep] - ); + const overlapData = React.useMemo(() => getColumnChartData(overlapGraph, overlapStep), [overlapGraph, overlapStep]); const waittimeData = React.useMemo( () => getColumnChartData(waittimeGraph, waittimeStep), [waittimeGraph, waittimeStep] ); - const getTableData = (tableData?: any, worker?: string) => { - if (!tableData || !worker) { + const getTableData = (tableData?: any, opsWorker?: string): any[] => { + if (!tableData || !opsWorker) { return []; } - let dataInfo: api.Graph = tableData[worker]; - const stringCompare = (a: string, b: string) => a.localeCompare(b); - const numberCompare = (a: number, b: number) => a - b; + let dataInfo: api.Graph = tableData[opsWorker]; + const stringCompare = (a: string, b: string): number => a.localeCompare(b); + const numberCompare = (a: number, b: number): number => a - b; let column: any[] = dataInfo.columns.map((item) => { return { title: item.name, key: item.name, dataIndex: item.name, sorter: - item.type == 'string' - ? (a: any, b: any) => stringCompare(a[item.name], b[item.name]) - : (a: any, b: any) => numberCompare(a[item.name], b[item.name]), + item.type === 'string' + ? (a: any, b: any): number => stringCompare(a[item.name], b[item.name]) + : (a: any, b: any): number => numberCompare(a[item.name], b[item.name]), }; }); setColumns(column); @@ -190,8 +180,8 @@ export const DistributedView: React.FC = (props) => { return null; } const dataRow: { [column: string]: number | string } = { key: index }; - dataInfo.columns.forEach((column, index) => { - dataRow[column.name] = row[index] as string | number; + dataInfo.columns.forEach((item, idx) => { + dataRow[item.name] = row[idx] as string | number; }); return dataRow; }); @@ -200,7 +190,7 @@ export const DistributedView: React.FC = (props) => { return getTableData(commopsTableData, commopsWorker); }, [commopsTableData, commopsWorker]); - const onShowSizeChange = (current: number, size: number) => { + const onShowSizeChange = (current: number, size: number): void => { setPageSize(size); }; @@ -213,12 +203,7 @@ export const DistributedView: React.FC = (props) => { {gpuInfo && ( - + @@ -227,7 +212,7 @@ export const DistributedView: React.FC = (props) => { )} - {(chartData) => ( + {(chartData): JSX.Element => ( @@ -235,11 +220,7 @@ export const DistributedView: React.FC = (props) => { Step - {overlapSteps.map((step) => ( {step} ))} @@ -249,23 +230,17 @@ export const DistributedView: React.FC = (props) => { {overlapGraph?.metadata?.title && ( )} - + )} - {(chartData) => ( + {(chartData): JSX.Element => ( @@ -273,11 +248,7 @@ export const DistributedView: React.FC = (props) => { Step - {waittimeSteps.map((step) => ( {step} ))} @@ -287,10 +258,7 @@ export const DistributedView: React.FC = (props) => { {waittimeGraph?.metadata?.title && ( )} = (props) => { - + @@ -317,13 +280,9 @@ export const DistributedView: React.FC = (props) => { Worker - + {commopsWorkers.map((item) => ( + {item} ))} diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/GpuInfoTable.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/GpuInfoTable.tsx index dcc54ad02a7677f59c66a3ce7878d4c74f3bf049..07f6f1d78c88abab5f62f844356b47ca517a2561 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/GpuInfoTable.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/GpuInfoTable.tsx @@ -49,46 +49,44 @@ interface TableCellInfo { function makeTableCellInfo(gpuInfo: any): TableCellInfo[][] { const rows: TableCellInfo[][] = []; - let curr_row: TableCellInfo[] = []; - rows.push(curr_row); - Object.keys(gpuInfo.data).forEach(function (node_name) { - const node_cell = { - content: node_name, + let currRow: TableCellInfo[] = []; + rows.push(currRow); + Object.keys(gpuInfo.data).forEach((nodeName) => { + const nodeCell = { + content: nodeName, rowspan: 0, cellType: 'node' as const, }; const i = rows.length; - curr_row.push(node_cell); - Object.keys(gpuInfo.data[node_name]).forEach(function (pid) { - const pid_cell = { content: pid, rowspan: 0, cellType: 'pid' as const }; - const i = rows.length; - curr_row.push(pid_cell); - Object.keys(gpuInfo.data[node_name][pid]).forEach(function (gpu) { - const gpu_cell = { content: gpu, rowspan: 0, cellType: 'gpu' as const }; - const i = rows.length; - curr_row.push(gpu_cell); - Object.keys(gpuInfo.data[node_name][pid][gpu]).forEach(function ( - key_name - ) { - curr_row.push({ - content: key_name, + currRow.push(nodeCell); + Object.keys(gpuInfo.data[nodeName]).forEach((pid) => { + const pidCell = { content: pid, rowspan: 0, cellType: 'pid' as const }; + const j = rows.length; + currRow.push(pidCell); + Object.keys(gpuInfo.data[nodeName][pid]).forEach((gpu) => { + const gpuCell = { content: gpu, rowspan: 0, cellType: 'gpu' as const }; + const k = rows.length; + currRow.push(gpuCell); + Object.keys(gpuInfo.data[nodeName][pid][gpu]).forEach((keyName) => { + currRow.push({ + content: keyName, rowspan: 1, cellType: 'key' as const, }); - const value: string = gpuInfo.data[node_name][pid][gpu][key_name]; - curr_row.push({ + const value: string = gpuInfo.data[nodeName][pid][gpu][keyName]; + currRow.push({ content: value, rowspan: 1, cellType: 'value' as const, }); - curr_row = []; - rows.push(curr_row); + currRow = []; + rows.push(currRow); }); - gpu_cell.rowspan = rows.length - i; + gpuCell.rowspan = rows.length - k; }); - pid_cell.rowspan = rows.length - i; + pidCell.rowspan = rows.length - j; }); - node_cell.rowspan = rows.length - i; + nodeCell.rowspan = rows.length - i; }); rows.pop(); return rows; @@ -96,16 +94,13 @@ function makeTableCellInfo(gpuInfo: any): TableCellInfo[][] { export const GpuInfoTable: React.FC = (props) => { const classes = useStyles(); - interface TableCellInfo { + interface TableCellInfoNoLast { content: string; rowspan: number; cellType: 'node' | 'pid' | 'gpu' | 'key' | 'value'; } - const rows = React.useMemo( - () => makeTableCellInfo(props.gpuInfo), - [props.gpuInfo] - ); + const rows = React.useMemo(() => makeTableCellInfo(props.gpuInfo), [props.gpuInfo]); const cellToClass = { node: classes.nodeTd, @@ -115,11 +110,11 @@ export const GpuInfoTable: React.FC = (props) => { value: classes.valueTd, }; - const renderCell = function (info: TableCellInfo) { + const renderCell = function (info: TableCellInfoNoLast): JSX.Element { let cellClass = cellToClass[info.cellType]; - let content = info.cellType == 'key' ? info.content + ':' : info.content; + let content = info.cellType === 'key' ? `${info.content}:` : info.content; return ( - ); diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Kernel.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Kernel.tsx index 0e900a2ac88f448396731135afb5315d7bf6c68b..66e05695153a853f68d382a2f3b6a68931861abf 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Kernel.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Kernel.tsx @@ -30,10 +30,7 @@ import Radio from '@material-ui/core/Radio'; import RadioGroup, { RadioGroupProps } from '@material-ui/core/RadioGroup'; import Select, { SelectProps } from '@material-ui/core/Select'; import { makeStyles } from '@material-ui/core/styles'; -import TextField, { - StandardTextFieldProps, - TextFieldProps, -} from '@material-ui/core/TextField'; +import TextField, { StandardTextFieldProps, TextFieldProps } from '@material-ui/core/TextField'; import * as React from 'react'; import * as api from '../api'; import { Graph } from '../api'; @@ -45,9 +42,9 @@ import { PieChart } from './charts/PieChart'; import { DataLoading } from './DataLoading'; import { makeChartHeaderRenderer, useTooltipCommonStyles } from './helpers'; import { - GPUKernelTotalTimeTooltip, - TensorCoresPieChartTooltip, - TensorCoresPieChartTooltipAscend, + gpuKernelTotalTimeTooltip, + tensorCoresPieChartTooltip, + tensorCoresPieChartTooltipAscend, } from './TooltipDescriptions'; export interface IProps { @@ -86,21 +83,17 @@ export const Kernel: React.FC = (props) => { [tooltipCommonClasses] ); - const [kernelGraph, setKernelGraph] = React.useState( - undefined - ); + const [kernelGraph, setKernelGraph] = React.useState(undefined); const [tcGraph, setTcGraph] = React.useState(undefined); - const [kernelTable, setKernelTable] = React.useState( - undefined - ); - const [groupBy, setGroupBy] = React.useState(KernelGroupBy.Kernel); + const [kernelTable, setKernelTable] = React.useState(undefined); + const [groupBy, setGroupBy] = React.useState(KernelGroupBy.KERNEL); const [searchKernelName, setSearchKernelName] = React.useState(''); const [searchOpName, setSearchOpName] = React.useState(''); const [sortColumn, setSortColumn] = React.useState(''); const [hasStep, setHasStep] = React.useState(false); const [topText, actualTop, useTop, setTopText, setUseTop] = useTopN({ - defaultUseTop: UseTop.Use, + defaultUseTop: UseTop.USE, defaultTop: 10, }); @@ -118,24 +111,16 @@ export const Kernel: React.FC = (props) => { api.defaultApi.kernelTableGet(run, worker, span, groupBy).then((resp) => { setSortColumn(resp.metadata.sort); setKernelTable(resp.data); - const nameColumnIdx = resp.data.columns.findIndex( - (c) => c.name.toLowerCase() === 'step id' - ); + const nameColumnIdx = resp.data.columns.findIndex((c) => c.name.toLowerCase() === 'step id'); setHasStep(nameColumnIdx > -1); }); }, [run, worker, span, groupBy]); React.useEffect(() => { - api.defaultApi - .kernelGet(run, worker, span, KernelGroupBy.Kernel) - .then((resp) => { - setKernelGraph(resp.total); - setGroupBy( - resp.device_target === 'Ascend' - ? KernelGroupBy.KernelNameAndOpName - : KernelGroupBy.Kernel - ); - }); + api.defaultApi.kernelGet(run, worker, span, KernelGroupBy.KERNEL).then((resp) => { + setKernelGraph(resp.total); + setGroupBy(resp.device_target === 'Ascend' ? KernelGroupBy.KERNEL_NAME_AND_OP_NAME : KernelGroupBy.KERNEL); + }); }, [run, worker, span]); React.useEffect(() => { @@ -144,11 +129,7 @@ export const Kernel: React.FC = (props) => { }); }, [run, worker, span]); - const [searchedKernelTable] = useSearch( - searchKernelName, - 'name', - kernelTable - ); + const [searchedKernelTable] = useSearch(searchKernelName, 'name', kernelTable); const [searchedOpTable] = useSearch( searchOpName, deviceTarget === 'Ascend' ? 'step id' : 'operator', @@ -171,7 +152,7 @@ export const Kernel: React.FC = (props) => { setUseTop(event.target.value as UseTop); }; - const onTopChanged = (event: React.ChangeEvent) => { + const onTopChanged = (event: React.ChangeEvent): void => { setTopText(event.target.value); }; @@ -180,21 +161,15 @@ export const Kernel: React.FC = (props) => { }; const GPUKernelTotalTimeTitle = React.useMemo( - () => chartHeaderRenderer('Total Time (us)', GPUKernelTotalTimeTooltip), + () => chartHeaderRenderer('Total Time (us)', gpuKernelTotalTimeTooltip), [chartHeaderRenderer] ); const TensorCoresTitle = React.useMemo( () => deviceTarget === 'Ascend' - ? chartHeaderRenderer( - 'Accelerator Core Utilization', - TensorCoresPieChartTooltipAscend - ) - : chartHeaderRenderer( - 'Tensor Cores Utilization', - TensorCoresPieChartTooltip - ), + ? chartHeaderRenderer('Accelerator Core Utilization', tensorCoresPieChartTooltipAscend) + : chartHeaderRenderer('Tensor Cores Utilization', tensorCoresPieChartTooltip), [chartHeaderRenderer, deviceTarget] ); @@ -207,19 +182,11 @@ export const Kernel: React.FC = (props) => { - } - label='All kernels' - /> - } - label='Top kernels to show' - /> + } label='All kernels' /> + } label='Top kernels to show' /> - {useTop === UseTop.Use && ( + {useTop === UseTop.USE && ( = (props) => { - {(graph) => ( + {(graph): JSX.Element => ( - + )} - {(graph) => ( + {(graph): JSX.Element => ( = (props) => { graph={graph} colors={['#0099C6', '#DD4477', '#66AA00', '#B82E2E']} top={actualTop} - tooltip_mode='percentage' + tooltipMode='percentage' /> )} @@ -267,17 +230,11 @@ export const Kernel: React.FC = (props) => { Group By - + + {deviceTarget === 'Ascend' ? 'Statistic' : 'Kernel Properties + Op Name'} - + {deviceTarget === 'Ascend' ? 'All' : 'Kernel Name'} @@ -297,7 +254,7 @@ export const Kernel: React.FC = (props) => { /> {deviceTarget === 'Ascend' - ? groupBy === KernelGroupBy.Kernel && + ? groupBy === KernelGroupBy.KERNEL && hasStep && ( = (props) => { /> ) - : groupBy === KernelGroupBy.KernelNameAndOpName && ( + : groupBy === KernelGroupBy.KERNEL_NAME_AND_OP_NAME && ( = (props) => { - {(graph) => ( - - )} + {(graph): JSX.Element => } diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/MemoryView.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/MemoryView.tsx index b90b410cbc24f807ec3b533c40af60ff6d0defb7..225f28a931e969d7cfd40d3f490e7cb45c64a305 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/MemoryView.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/MemoryView.tsx @@ -102,39 +102,29 @@ export const MemoryView: React.FC = React.memo((props) => { const { run, worker, span, deviceTarget } = props; const classes = useStyles(); - const [memoryStatsData, setMemoryStatsData] = React.useState< - MemoryStatsData | undefined - >(undefined); + const [memoryStatsData, setMemoryStatsData] = React.useState(undefined); // for backward compatability, old profile do not have events to show - const showEvents = () => { - return memoryEventsData && Object.keys(memoryEventsData.rows).length != 0; + const showEvents = (): boolean | undefined => { + return memoryEventsData && Object.keys(memoryEventsData.rows).length !== 0; }; - const [memoryEventsData, setMemoryEventsData] = React.useState< - MemoryEventsData | undefined - >(undefined); + const [memoryEventsData, setMemoryEventsData] = React.useState(undefined); // for backward compatability, old profile do not have curve to show - const showCurve = () => { - return memoryCurveData && Object.keys(memoryCurveData.rows).length != 0; + const showCurve = (): boolean | undefined => { + return memoryCurveData && Object.keys(memoryCurveData.rows).length !== 0; }; - const [memoryCurveData, setMemoryCurveData] = React.useState< - MemoryCurveData | MemoryCurveDataAscend | undefined - >(undefined); + const [memoryCurveData, setMemoryCurveData] = React.useState( + undefined + ); - const [lineChartData, setLineChartData] = React.useState< - Graph | GraphAscend | undefined - >(undefined); + const [lineChartData, setLineChartData] = React.useState(undefined); const [devices, setDevices] = React.useState([]); const [device, setDevice] = React.useState(''); const [tag, setTag] = React.useState('Operator'); - const memoryCurveDataAllRef = React.useRef( - undefined - ); - const memoryEventDataAllRef = React.useRef( - undefined - ); + const memoryCurveDataAllRef = React.useRef(undefined); + const memoryEventDataAllRef = React.useRef(undefined); interface SelectedRange { start: number; @@ -142,37 +132,29 @@ export const MemoryView: React.FC = React.memo((props) => { startTs: number; endTs: number; } - const [selectedRange, setSelectedRange] = React.useState< - SelectedRange | undefined - >(); + const [selectedRange, setSelectedRange] = React.useState(); const [searchOperatorName, setSearchOperatorName] = React.useState(''); - const [searchEventOperatorName, setSearchEventOperatorName] = - React.useState(''); - const [filterEventSize, setFilterEventSize] = React.useState( - {} - ); + const [searchEventOperatorName, setSearchEventOperatorName] = React.useState(''); + const [filterEventSize, setFilterEventSize] = React.useState({}); const [maxSize, setMaxSize] = React.useState({}); - const getSearchIndex = function () { + const getSearchIndex = function (): number { if (!memoryStatsData) { return -1; } for (let i = 0; i < memoryStatsData.columns.length; i++) { - if (memoryStatsData.columns[i].name == memoryStatsData.metadata.search) { + if (memoryStatsData.columns[i].name === memoryStatsData.metadata.search) { return i; } } return -1; }; - const getStep = (size: number, indexBias: number) => { - return 10 ** (Math.floor(Math.log10(size != 0 ? size : 1)) - indexBias); + const getStep = (size: number, indexBias: number): number => { + return 10 ** (Math.floor(Math.log10(size !== 0 ? size : 1)) - indexBias); }; - const filterByEventSize = ( - rows: T[] | undefined, - size: Array - ) => { + const filterByEventSize = (rows: T[] | undefined, size: Array): T[] | undefined => { const result = React.useMemo(() => { if (!rows) { return undefined; @@ -193,23 +175,13 @@ export const MemoryView: React.FC = React.memo((props) => { }; const searchIndex = getSearchIndex(); - const getName = React.useCallback( - (row: any) => row[searchIndex], - [searchIndex] - ); - const getNameAscend = (row: any) => row[0]; - const [searchedTableDataRows] = useSearchDirectly( - searchOperatorName, - getName, - memoryStatsData?.rows[device] ?? [] - ); + const getName = React.useCallback((row: any) => row[searchIndex], [searchIndex]); + const getNameAscend = (row: any): any => row[0]; + const [searchedTableDataRows] = useSearchDirectly(searchOperatorName, getName, memoryStatsData?.rows[device] ?? []); const [searchedEventsTableDataRows] = useSearchDirectly( searchEventOperatorName, deviceTarget === 'Ascend' ? getNameAscend : getName, - filterByEventSize( - memoryEventsData?.rows[device], - filterEventSize[device] ?? [0, Infinity] - ) ?? [] + filterByEventSize(memoryEventsData?.rows[device], filterEventSize[device] ?? [0, Infinity]) ?? [] ); const onSearchOperatorChanged: TextFieldProps['onChange'] = (event) => { @@ -221,32 +193,25 @@ export const MemoryView: React.FC = React.memo((props) => { }; const [selectedRecord, setSelectedRecord] = React.useState(); - const onRowSelected = (record?: object, rowIndex?: number) => { + const onRowSelected = (record?: object, rowIndex?: number): void => { setSelectedRecord(record); }; - const onFilterEventSizeChanged = ( - event: any, - newValue: number | number[] - ) => { + const onFilterEventSizeChanged = (event: any, newValue: number | number[]): void => { setFilterEventSize({ ...filterEventSize, [device]: newValue as number[], }); }; - const onFilterEventMinSizeInputChanged = ( - event: React.ChangeEvent - ) => { + const onFilterEventMinSizeInputChanged = (event: React.ChangeEvent): void => { setFilterEventSize({ ...filterEventSize, [device]: [Number(event.target.value), filterEventSize[device][1]], }); }; - const onFilterEventMaxSizeInputChanged = ( - event: React.ChangeEvent - ) => { + const onFilterEventMaxSizeInputChanged = (event: React.ChangeEvent): void => { setFilterEventSize({ ...filterEventSize, [device]: [filterEventSize[device][0], Number(event.target.value)], @@ -254,63 +219,39 @@ export const MemoryView: React.FC = React.memo((props) => { }; React.useEffect(() => { - deviceTarget !== 'Ascend' && - api.defaultApi - .memoryGet( - run, - worker, - span, - selectedRange?.startTs, - selectedRange?.endTs - ) - .then((resp) => { - setMemoryStatsData(resp); - if (!devices || devices.length == 0) { - // setDevices only execute on view load. Since selection on curve - // might filter all events later, some devices might is missing. - setDevices(Object.keys(resp.rows)); - setDevice(resp.metadata.default_device); - } - }); + if (deviceTarget !== 'Ascend') { + api.defaultApi.memoryGet(run, worker, span, selectedRange?.startTs, selectedRange?.endTs).then((resp) => { + setMemoryStatsData(resp); + if (!devices || devices.length === 0) { + // setDevices only execute on view load. Since selection on curve + // might filter all events later, some devices might is missing. + setDevices(Object.keys(resp.rows)); + setDevice(resp.metadata.default_device); + } + }); + } }, [run, worker, span, selectedRange]); React.useEffect(() => { - api.defaultApi - .memoryEventsGet( - run, - worker, - span, - selectedRange?.startTs, - selectedRange?.endTs - ) - .then((resp) => { - const tempRes = - deviceTarget === 'Ascend' - ? (resp as MemoryEventsDataAll).operator - : (resp as MemoryEventsData); - if (deviceTarget === 'Ascend') { - memoryEventDataAllRef.current = resp as MemoryEventsDataAll; - } - let curMaxSize: MaxEventSize = {}; - let curFilterEventSize: EventSizeFilter = {}; - for (let deviceName in tempRes.rows) { - curMaxSize[deviceName] = 0; - for (let i = 0; i < tempRes.rows[deviceName].length; i++) { - curMaxSize[deviceName] = Math.max( - curMaxSize[deviceName], - tempRes.rows[deviceName][i][1] - ); - } - curFilterEventSize[deviceName] = [ - curMaxSize[deviceName] / 4, - curMaxSize[deviceName], - ]; - curMaxSize[deviceName] = curMaxSize[deviceName]; + api.defaultApi.memoryEventsGet(run, worker, span, selectedRange?.startTs, selectedRange?.endTs).then((resp) => { + const tempRes = deviceTarget === 'Ascend' ? (resp as MemoryEventsDataAll).operator : (resp as MemoryEventsData); + if (deviceTarget === 'Ascend') { + memoryEventDataAllRef.current = resp as MemoryEventsDataAll; + } + let curMaxSize: MaxEventSize = {}; + let curFilterEventSize: EventSizeFilter = {}; + Object.keys(tempRes.rows).forEach((deviceName) => { + curMaxSize[deviceName] = 0; + for (let i = 0; i < tempRes.rows[deviceName].length; i++) { + curMaxSize[deviceName] = Math.max(curMaxSize[deviceName], tempRes.rows[deviceName][i][1]); } - setMaxSize(curMaxSize); - setFilterEventSize(curFilterEventSize); - setMemoryEventsData(tempRes); + curFilterEventSize[deviceName] = [curMaxSize[deviceName] / 4, curMaxSize[deviceName]]; + curMaxSize[deviceName] = curMaxSize[deviceName]; }); + setMaxSize(curMaxSize); + setFilterEventSize(curFilterEventSize); + setMemoryEventsData(tempRes); + }); }, [run, worker, span, selectedRange]); React.useEffect(() => { @@ -365,16 +306,13 @@ export const MemoryView: React.FC = React.memo((props) => { } }; - const onSelectedRangeChanged = (start: number, end: number) => { + const onSelectedRangeChanged = (start: number, end: number): void => { if (start > end) { setSelectedRange(undefined); return; } - let allDatas = - deviceTarget === 'Ascend' - ? memoryCurveData?.rows[device]?.Allocated - : memoryCurveData?.rows[device]; + let allDatas = deviceTarget === 'Ascend' ? memoryCurveData?.rows[device]?.Allocated : memoryCurveData?.rows[device]; if (allDatas.length <= 1) { setSelectedRange(undefined); return; @@ -415,8 +353,8 @@ export const MemoryView: React.FC = React.memo((props) => { } else { let bias = memoryCurveData?.metadata.first_ts ?? 0; let scale = 1 / (memoryCurveData?.metadata.time_factor ?? 1); - startTs = Math.round(allDatas[realStart][0] * scale + bias); - endTs = Math.round(allDatas[realEnd][0] * scale + bias); + startTs = Math.round((allDatas[realStart][0] * scale) + bias); + endTs = Math.round((allDatas[realEnd][0] * scale) + bias); } setSelectedRange({ start, end, startTs, endTs }); @@ -430,59 +368,43 @@ export const MemoryView: React.FC = React.memo((props) => { - {(graph) => ( + {(graph): JSX.Element => ( Device - + {devices.map((item) => ( + {item} ))} {deviceTarget === 'Ascend' && ( - - Group By - - + {tags.map((item) => ( + {item} ))} )} - {showCurve() && - lineChartData && - lineChartData.columns.length > 0 && ( - -
- -
-
- )} + {showCurve() && lineChartData && lineChartData.columns.length > 0 && ( + +
+ +
+
+ )}
)}
@@ -554,15 +476,12 @@ export const MemoryView: React.FC = React.memo((props) => { )} - {(data) => { + {(data): JSX.Element => { return ( = React.memo((props) => { - {(data) => ( + {(data): JSX.Element => ( )} diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/ModuleView.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/ModuleView.tsx index dd3a6dd6222ec14db45b414904d977d349b92d31..a66a825365fd3c813e58865c609643ab547b4c49 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/ModuleView.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/ModuleView.tsx @@ -7,16 +7,10 @@ import InputLabel from '@material-ui/core/InputLabel'; import MenuItem from '@material-ui/core/MenuItem'; import Select, { SelectProps } from '@material-ui/core/Select'; import { makeStyles } from '@material-ui/core/styles'; -import { Table } from 'antd'; +import { message, Table } from 'antd'; import * as React from 'react'; import { FlameGraph } from 'react-flame-graph'; -import { - defaultApi, - KeyedColumn, - ModuleStats, - ModuleViewData, - OperatorNode, -} from '../api'; +import { defaultApi, KeyedColumn, ModuleStats, ModuleViewData, OperatorNode } from '../api'; const useStyles = makeStyles((theme) => ({ root: { @@ -33,7 +27,7 @@ export interface IProps { span: string; } -const getKeyedTableColumns = (columns: KeyedColumn[]) => { +const getKeyedTableColumns = (columns: KeyedColumn[]): any[] => { return columns.map((col) => { return { dataIndex: col.key, @@ -43,10 +37,12 @@ const getKeyedTableColumns = (columns: KeyedColumn[]) => { }); }; -const getTableRows = (key: number, rows: ModuleStats[]) => { +const getTableRows = (key: number, rows: ModuleStats[]): any[] => { + let initialKey = key; return rows.map((row) => { + const currentKey = initialKey++; const data: any = { - key: key++, + key: currentKey, name: row.name, occurences: row.occurences, operators: row.operators, @@ -64,7 +60,7 @@ const getTableRows = (key: number, rows: ModuleStats[]) => { }); }; -const getFlameGraphData = (rows: ModuleStats[]) => { +const getFlameGraphData = (rows: ModuleStats[]): any[] => { return rows.map((row) => { const data: any = { name: row.name, @@ -81,18 +77,14 @@ const getFlameGraphData = (rows: ModuleStats[]) => { }; const getTreeHeight = (row: ModuleStats): number => { - if (row.children && row.children.length) { + if (row.children?.length) { return 1 + Math.max(...row.children.map((child) => getTreeHeight(child))); } else { return 1; } }; -const getOperatorTree = ( - level: number, - row: OperatorNode, - result: object[] -) => { +const getOperatorTree = (level: number, row: OperatorNode, result: object[]): void => { result.push({ level: level, name: row.name, @@ -108,9 +100,7 @@ export const ModuleView: React.FC = (props) => { const { run, worker, span } = props; const classes = useStyles(); - const [moduleView, setModuleView] = React.useState< - ModuleViewData | undefined - >(undefined); + const [moduleView, setModuleView] = React.useState(undefined); const [flameData, setFlameData] = React.useState([]); const [flameHeight, setFlameHeight] = React.useState(0); const [modules, setModules] = React.useState([]); @@ -120,9 +110,7 @@ export const ModuleView: React.FC = (props) => { const [rows, setRows] = React.useState([]); const cardRef = React.useRef(null); - const [cardWidth, setCardWidth] = React.useState( - undefined - ); + const [cardWidth, setCardWidth] = React.useState(undefined); const timelineRef = React.useRef(null); React.useEffect(() => { @@ -132,13 +120,11 @@ export const ModuleView: React.FC = (props) => { setModuleView(resp); if (resp) { // set the flamegraph data - const flameData: any[] = getFlameGraphData(resp.data); - setFlameData(flameData); - const flameHeight = Math.max( - ...flameData.map((x) => getTreeHeight(x)) - ); - setFlameHeight(flameHeight * 25); - setModules(Array.from(Array(flameData.length).keys())); + const flameGraphData: any[] = getFlameGraphData(resp.data); + setFlameData(flameGraphData); + const flameGraphHeight = Math.max(...flameGraphData.map((x) => getTreeHeight(x))); + setFlameHeight(flameGraphHeight * 25); + setModules(Array.from(Array(flameGraphData.length).keys())); setModule(0); // set the tree table data @@ -147,7 +133,7 @@ export const ModuleView: React.FC = (props) => { } }) .catch((e) => { - if (e.status == 404) { + if (e.status === 404) { setModules([]); setFlameData([]); setRows([]); @@ -168,11 +154,11 @@ export const ModuleView: React.FC = (props) => { data.addColumn({ type: 'number', id: 'Start' }); data.addColumn({ type: 'number', id: 'End' }); - let timeline_data: any[] = []; - getOperatorTree(0, resp, timeline_data); - timeline_data.sort((a, b) => a.level - b.level); - const max_level = timeline_data[timeline_data.length - 1].level; - timeline_data.forEach((d) => { + let timelineData: any[] = []; + getOperatorTree(0, resp, timelineData); + timelineData.sort((a, b) => a.level - b.level); + const maxLevel = timelineData[timelineData.length - 1].level; + timelineData.forEach((d) => { data.addRow([ d.level.toString(), d.name, @@ -182,11 +168,9 @@ export const ModuleView: React.FC = (props) => { ]); }); - const chart = new google.visualization.Timeline( - timelineRef.current - ); + const chart = new google.visualization.Timeline(timelineRef.current); const options = { - height: (max_level + 1) * 50, + height: (maxLevel + 1) * 50, tooltip: { isHtml: true, }, @@ -199,7 +183,7 @@ export const ModuleView: React.FC = (props) => { }); } } catch (e) { - console.warn('Timeline in module view is not supported offline.'); + message.warning('Timeline in module view is not supported offline.'); } }, [run, worker, span]); @@ -207,7 +191,7 @@ export const ModuleView: React.FC = (props) => { setModule(event.target.value as number); }; - const moduleComponent = () => { + const moduleComponent = (): JSX.Element => { const moduleFragment = ( Module @@ -249,7 +233,7 @@ export const ModuleView: React.FC = (props) => { data={flameData[module]} height={flameHeight} width={cardWidth} - onChange={(node: any) => {}} + onChange={(node: any): void => {}} /> )} diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Operator.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Operator.tsx index 4d3d1ee2eecbf1b875277f1811cc4931deafdc2c..b19bef1967a31915c3c1d660b699b11c83ebb226 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Operator.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/Operator.tsx @@ -32,17 +32,10 @@ import Radio from '@material-ui/core/Radio'; import RadioGroup, { RadioGroupProps } from '@material-ui/core/RadioGroup'; import Select, { SelectProps } from '@material-ui/core/Select'; import { makeStyles } from '@material-ui/core/styles'; -import TextField, { - StandardTextFieldProps, - TextFieldProps, -} from '@material-ui/core/TextField'; +import TextField, { StandardTextFieldProps, TextFieldProps } from '@material-ui/core/TextField'; import * as React from 'react'; import * as api from '../api'; -import { - OperationTableData, - OperationTableDataInner, - OperatorGraph, -} from '../api'; +import { OperationTableData, OperationTableDataInner, OperatorGraph } from '../api'; import { OperationGroupBy } from '../constants/groupBy'; import { useSearchDirectly } from '../utils/search'; import { topIsValid, UseTop, useTopN } from '../utils/top'; @@ -51,12 +44,12 @@ import { DataLoading } from './DataLoading'; import { makeChartHeaderRenderer, useTooltipCommonStyles } from './helpers'; import { OperationTable } from './tables/OperationTable'; import { - DeviceSelfTimeTooltip, - DeviceSelfTimeTooltipAscend, - DeviceTotalTimeTooltip, - DeviceTotalTimeTooltipAscend, - HostSelfTimeTooltip, - HostTotalTimeTooltip, + deviceSelfTimeTooltip, + deviceSelfTimeTooltipAscend, + deviceTotalTimeTooltip, + deviceTotalTimeTooltipAscend, + hostSelfTimeTooltip, + hostTotalTimeTooltip, } from './TooltipDescriptions'; const useStyles = makeStyles((theme) => ({ @@ -98,32 +91,19 @@ export const Operator: React.FC = (props) => { [tooltipCommonClasses] ); - const [operatorGraph, setOperatorGraph] = React.useState< - OperatorGraph | undefined - >(undefined); - const [operatorTable, setOperatorTable] = React.useState< - OperationTableData | undefined - >(undefined); + const [operatorGraph, setOperatorGraph] = React.useState(undefined); + const [operatorTable, setOperatorTable] = React.useState(undefined); const [sortColumn, setSortColumn] = React.useState(''); - const [tableTooltips, setTableTooltips] = React.useState( - undefined - ); - const [groupBy, setGroupBy] = React.useState(OperationGroupBy.Operation); + const [tableTooltips, setTableTooltips] = React.useState(undefined); + const [groupBy, setGroupBy] = React.useState(OperationGroupBy.OPERATION); const [searchOperatorName, setSearchOperatorName] = React.useState(''); const [topText, actualTop, useTop, setTopText, setUseTop] = useTopN({ - defaultUseTop: UseTop.Use, + defaultUseTop: UseTop.USE, defaultTop: 10, }); - const getName = React.useCallback( - (row: OperationTableDataInner) => row.name, - [] - ); - const [searchedOperatorTable] = useSearchDirectly( - searchOperatorName, - getName, - operatorTable - ); + const getName = React.useCallback((row: OperationTableDataInner) => row.name, []); + const [searchedOperatorTable] = useSearchDirectly(searchOperatorName, getName, operatorTable); const onSearchOperatorChanged: TextFieldProps['onChange'] = (event) => { setSearchOperatorName(event.target.value as string); @@ -142,13 +122,11 @@ export const Operator: React.FC = (props) => { }, [operatorGraph]); React.useEffect(() => { - api.defaultApi - .operationTableGet(run, worker, span, groupBy) - .then((resp) => { - setSortColumn(resp.metadata.sort); - setTableTooltips(resp.metadata.tooltips); - setOperatorTable(resp.data); - }); + api.defaultApi.operationTableGet(run, worker, span, groupBy).then((resp) => { + setSortColumn(resp.metadata.sort); + setTableTooltips(resp.metadata.tooltips); + setOperatorTable(resp.data); + }); }, [run, worker, span, groupBy]); React.useEffect(() => { @@ -165,7 +143,7 @@ export const Operator: React.FC = (props) => { setUseTop(event.target.value as UseTop); }; - const onTopChanged = (event: React.ChangeEvent) => { + const onTopChanged = (event: React.ChangeEvent): void => { setTopText(event.target.value); }; @@ -173,7 +151,7 @@ export const Operator: React.FC = (props) => { min: 1, }; - const renderCharts = (graph: api.OperatorGraph) => { + const renderCharts = (graph: api.OperatorGraph): JSX.Element => { return ( {graph.device_self_time && ( @@ -183,9 +161,7 @@ export const Operator: React.FC = (props) => { )} @@ -200,9 +176,7 @@ export const Operator: React.FC = (props) => { )} @@ -213,12 +187,7 @@ export const Operator: React.FC = (props) => { {graph.host_self_time.title && ( - + )} @@ -226,12 +195,7 @@ export const Operator: React.FC = (props) => { {graph.host_total_time.title && ( - + )} @@ -249,19 +213,11 @@ export const Operator: React.FC = (props) => { - } - label='All operators' - /> - } - label='Top operators to show' - /> + } label='All operators' /> + } label='Top operators to show' /> - {useTop === UseTop.Use && ( + {useTop === UseTop.USE && ( = (props) => { Group By - + Operator + Input Shape + Operator @@ -311,7 +259,7 @@ export const Operator: React.FC = (props) => { - {(table) => ( + {(table): JSX.Element => ( = (props) => { const { run, worker, span } = props; - const [steps, setSteps] = React.useState( - undefined - ); + const [steps, setSteps] = React.useState(undefined); const [performances, setPerformances] = React.useState([]); const [environments, setEnvironments] = React.useState([]); - const [gpuMetrics, setGpuMetrics] = React.useState< - api.GpuMetrics | undefined - >(undefined); + const [gpuMetrics, setGpuMetrics] = React.useState(undefined); const [recommendations, setRecommendations] = React.useState(''); const [columns, setColumns] = React.useState>([]); @@ -88,17 +81,17 @@ export const Overview: React.FC = (props) => { if (dataInfo.columns.length < 3) { return []; } - const stringCompare = (a: string, b: string) => a.localeCompare(b); - const numberCompare = (a: number, b: number) => a - b; + const stringCompare = (a: string, b: string): number => a.localeCompare(b); + const numberCompare = (a: number, b: number): number => a - b; let column: any[] = dataInfo.columns.map((item) => { return { title: item.name, key: item.name, dataIndex: item.name, sorter: - item.type == 'string' - ? (a: any, b: any) => stringCompare(a[item.name], b[item.name]) - : (a: any, b: any) => numberCompare(a[item.name], b[item.name]), + item.type === 'string' + ? (a: any, b: any): number => stringCompare(a[item.name], b[item.name]) + : (a: any, b: any): number => numberCompare(a[item.name], b[item.name]), }; }); setColumns(column); @@ -137,13 +130,11 @@ export const Overview: React.FC = (props) => { ); const stepTimeBreakDownTitle = React.useMemo( - () => chartHeaderRenderer('Step Time Breakdown', StepTimeBreakDownTooltip), + () => chartHeaderRenderer('Step Time Breakdown', stepTimeBreakDownTooltip), [tooltipCommonClasses, chartHeaderRenderer] ); - const cardSizes = gpuMetrics - ? ([2, 3, 7] as const) - : ([4, undefined, 8] as const); + const cardSizes = gpuMetrics ? ([2, 3, 7] as const) : ([4, undefined, 8] as const); return (
@@ -156,10 +147,7 @@ export const Overview: React.FC = (props) => { {environments.map((environment) => ( - + ))} @@ -170,19 +158,10 @@ export const Overview: React.FC = (props) => { {gpuMetrics && ( - - + + {gpuMetrics.data.map((metric) => ( - + ))} @@ -203,10 +182,7 @@ export const Overview: React.FC = (props) => { /> - + @@ -219,12 +195,8 @@ export const Overview: React.FC = (props) => { - {(graph) => ( - + {(graph): JSX.Element => ( + )} diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TextListItem.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TextListItem.tsx index 400ac9de8083d5578175b5ede47016ccdd1b4b61..59eb79c2a8f05cc750d264880bb66ab646c4bbb4 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TextListItem.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TextListItem.tsx @@ -35,7 +35,7 @@ const useStyles = makeStyles((theme) => ({ export const TextListItem: React.FC = (props) => { const classes = useStyles(); - const getSizes = function () { + const getSizes = function (): readonly any[] { if (props.value && props.extra) { return [4, 4, 4] as const; } @@ -50,14 +50,9 @@ export const TextListItem: React.FC = (props) => { const sizes = getSizes(); - const renderSpan = function (content: string, className?: string) { + const renderSpan = function (content: string, className?: string): React.JSX.Element { if (props.dangerouslyAllowHtml) { - return ( - - ); + return ; } return {content}; }; @@ -69,9 +64,7 @@ export const TextListItem: React.FC = (props) => { {renderSpan(props.name, props.classes?.name)} - {props.description && ( - {renderSpan(props.description)} - )} + {props.description && {renderSpan(props.description)}} {props.value && ( diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TooltipDescriptions.ts b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TooltipDescriptions.ts index d9aaaabb742f4e14bc035bb78709c81455a46eaa..6d3631fee97a4dd8da5ebde1550573d8c6e501fa 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TooltipDescriptions.ts +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TooltipDescriptions.ts @@ -2,7 +2,7 @@ * Copyright (c) Microsoft Corporation. All rights reserved. *--------------------------------------------------------------------------------------------*/ -export const StepTimeBreakDownTooltip = `The time spent on each step is broken down into multiple categories as follows: +export const stepTimeBreakDownTooltip = `The time spent on each step is broken down into multiple categories as follows: Kernel: Kernels execution time on GPU device; Memcpy: GPU involved memory copy time (either D2D, D2H or H2D); Memset: GPU involved memory set time; @@ -11,28 +11,28 @@ DataLoader: The data loading time spent in PyTorch DataLoader object; CPU Exec: Host compute time, including every PyTorch operator running time; Other: The time not included in any of the above.`; -export const DeviceSelfTimeTooltip = `The accumulated time spent on GPU, not including this operator’s child operators.`; +export const deviceSelfTimeTooltip = `The accumulated time spent on GPU, not including this operator’s child operators.`; -export const DeviceSelfTimeTooltipAscend = `The accumulated time spent on NPU, not including this operator’s child operators.`; +export const deviceSelfTimeTooltipAscend = `The accumulated time spent on NPU, not including this operator’s child operators.`; -export const DeviceTotalTimeTooltip = `The accumulated time spent on GPU, including this operator’s child operators.`; +export const deviceTotalTimeTooltip = `The accumulated time spent on GPU, including this operator’s child operators.`; -export const DeviceTotalTimeTooltipAscend = `The accumulated time spent on NPU, including this operator’s child operators.`; +export const deviceTotalTimeTooltipAscend = `The accumulated time spent on NPU, including this operator’s child operators.`; -export const HostSelfTimeTooltip = `The accumulated time spent on Host, not including this operator’s child operators.`; +export const hostSelfTimeTooltip = `The accumulated time spent on Host, not including this operator’s child operators.`; -export const HostTotalTimeTooltip = `The accumulated time spent on Host, including this operator’s child operators.`; +export const hostTotalTimeTooltip = `The accumulated time spent on Host, including this operator’s child operators.`; -export const GPUKernelTotalTimeTooltip = `The accumulated time of all calls of this kernel.`; +export const gpuKernelTotalTimeTooltip = `The accumulated time of all calls of this kernel.`; -export const TensorCoresPieChartTooltip = `The accumulated time of all kernels using or not using Tensor Cores.`; +export const tensorCoresPieChartTooltip = `The accumulated time of all kernels using or not using Tensor Cores.`; -export const TensorCoresPieChartTooltipAscend = `The accumulated time of all kernels group by Accelerator Core.`; +export const tensorCoresPieChartTooltipAscend = `The accumulated time of all kernels group by Accelerator Core.`; -export const DistributedGpuInfoTableTooltip = `Information about GPU hardware used during the run.`; +export const distributedGpuInfoTableTooltip = `Information about GPU hardware used during the run.`; -export const DistributedOverlapGraphTooltip = `The time spent on computation vs communication.`; +export const distributedOverlapGraphTooltip = `The time spent on computation vs communication.`; -export const DistributedWaittimeGraphTooltip = `The time spent waiting vs communicating between devices.`; +export const distributedWaittimeGraphTooltip = `The time spent waiting vs communicating between devices.`; -export const DistributedCommopsTableTooltip = `Statistics for operations managing communications between nodes.`; +export const distributedCommopsTableTooltip = `Statistics for operations managing communications between nodes.`; diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TraceView.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TraceView.tsx index 22aa87782c8af9ca5d4a6f5dccfcc542c1bcecb2..be499794936a085ed72740eea8bac5f33df37171 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TraceView.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/TraceView.tsx @@ -29,9 +29,7 @@ export const TraceView: React.FC = (props) => { const { run, worker, span, iframeRef } = props; const classes = useStyles(); - const [traceData, setTraceData] = React.useState | null>( - null - ); + const [traceData, setTraceData] = React.useState | null>(null); const [traceViewReady, setTraceViewReady] = React.useState(false); React.useEffect(() => { @@ -43,7 +41,7 @@ export const TraceView: React.FC = (props) => { }, [run, worker, span]); React.useEffect(() => { - function callback(event: MessageEvent) { + function callback(event: MessageEvent): void { const data = event.data || {}; if (data.msg === 'ready') { setTraceViewReady(true); @@ -59,26 +57,19 @@ export const TraceView: React.FC = (props) => { React.useEffect(() => { if (traceData && traceViewReady) { traceData.then((data) => { - iframeRef.current?.contentWindow?.postMessage( - { msg: 'data', data }, - window.origin - ); + iframeRef.current?.contentWindow?.postMessage({ msg: 'data', data }, window.origin); }); } }, [traceData, traceViewReady]); - const SetIframeActive = () => { + const setIframeActive = (): void => { iframeRef.current?.focus(); }; return (
{React.useMemo( () => ( - - + + ), [] diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/AntTableChart.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/AntTableChart.tsx index 479ca410cd4583b482cfad9191a92851e1fb7b33..83618064b55223ab06d4d1fec8b8b5eeab8d3268 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/AntTableChart.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/AntTableChart.tsx @@ -23,35 +23,29 @@ const useStyles = makeStyles((theme) => ({ }, })); -const getTableColumns = function ( - columns: any, - sort: string | undefined, - tooltipClass: string -) { +const getTableColumns = function (columns: any, sort: string | undefined, tooltipClass: string): any { let i = 0; - return columns.map(function (col: any) { - const key = 'col' + i++; - const stringCompare = (a: any, b: any) => a[key].localeCompare(b[key]); - const numberCompare = (a: any, b: any) => (a[key] || 0) - (b[key] || 0); + return columns.map((col: any) => { + const key = `col${i++}`; + const stringCompare = (a: any, b: any): number => a[key].localeCompare(b[key]); + const numberCompare = (a: any, b: any): number => (a[key] || 0) - (b[key] || 0); return { dataIndex: key, key: key, title: col.name, - sorter: col.type == 'string' ? stringCompare : numberCompare, - defaultSortOrder: sort == col.name ? ('descend' as const) : undefined, - showSorterTooltip: col.tooltip - ? { title: col.tooltip, overlayClassName: tooltipClass } - : true, + sorter: col.type === 'string' ? stringCompare : numberCompare, + defaultSortOrder: sort === col.name ? ('descend' as const) : undefined, + showSorterTooltip: col.tooltip ? { title: col.tooltip, overlayClassName: tooltipClass } : true, }; }); }; -const getTableRows = function (rows: any) { - return rows.map(function (row: any) { +const getTableRows = function (rows: any): any { + return rows.map((row: any) => { let i = 0; const res: any = {}; - row.forEach(function (entry: any) { - res['col' + i++] = entry; + row.forEach((entry: any) => { + res[`col${i++}`] = entry; }); return res; }); @@ -69,21 +63,27 @@ export const AntTableChart: React.FC = (props) => { ); // key is used to reset the Table state (page and sort) if the columns change - const key = React.useMemo(() => Math.random() + '', [graph.columns]); + const key: string = React.useMemo(() => `${Math.random()}`, [graph.columns]); const [pageSize, setPageSize] = React.useState(initialPageSize ?? 30); - const onShowSizeChange = (current: number, size: number) => { + const onShowSizeChange = (current: number, size: number): void => { setPageSize(size); }; - const onRow = (record: object, rowIndex?: number) => { + const onRow = ( + record: object, + rowIndex?: number + ): { + onMouseEnter: (event: any) => void; + onMouseLeave: (event: any) => void; + } => { return { - onMouseEnter: (event: any) => { + onMouseEnter: (event: any): void => { if (onRowSelected) { onRowSelected(record, rowIndex); } }, - onMouseLeave: (event: any) => { + onMouseLeave: (event: any): void => { if (onRowSelected) { onRowSelected(undefined, undefined); } diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/AreaChart.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/AreaChart.tsx index a4d495a2dcc88fcd60878bc494a8c641c3d4fdf4..cda12860c2fba41f5a15c5d9e73fb92093c0371b 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/AreaChart.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/AreaChart.tsx @@ -15,7 +15,7 @@ interface IProps { const useStyles = makeStyles(() => ({ root: { - height: (props: Pick) => props.height, + height: (props: Pick): number | undefined => props.height, }, })); @@ -27,7 +27,9 @@ export const AreaChart: React.FC = (props) => { React.useLayoutEffect(() => { const element = graphRef.current; - if (!element) return; + if (!element) { + return undefined; + } const data = new google.visualization.DataTable(); data.addColumn('string', 'step'); diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/ColumnChart.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/ColumnChart.tsx index f68e9fe44b5dae6be9aab58644879db457d218bd..ae51dc1a34e94b1c91eab2fe502ffe2cbc20f618 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/ColumnChart.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/ColumnChart.tsx @@ -42,25 +42,28 @@ export const ColumnChart: React.FC = (props) => { const graphRef = React.useRef(null); const [resizeEventDependency] = useResizeEventDependency(); - const getAngleByDataLength = (data: number) => { + const getAngleByDataLength = (data: number): number => { if (data < 10) { return 0; } else { // 数量越大越趋近于旋转90度 - return 90 * (1 - 10 / data); + return 90 * (1 - (10 / data)); } }; React.useLayoutEffect(() => { const element = graphRef.current; - if (!element) return; + if (!element) { + return undefined; + } const chart = echarts.init(element); const dataSource: Array> = []; dataSource.push(['worker', ...legends]); barHeights.forEach((item, index) => { - barLabels[index] !== undefined && + if (barLabels[index] !== undefined) { dataSource.push([barLabels[index], ...item]); + } }); const options: echarts.EChartsOption = { title: { @@ -76,10 +79,8 @@ export const ColumnChart: React.FC = (props) => { rotate: getAngleByDataLength(barLabels.length), formatter: (name: string) => { const index = name.indexOf('@'); - if (index > -1) { - name = name.slice(index + 1); - } - return name.length > 16 ? name.slice(0, 14) + '...' : name; + const processedName = index > -1 ? name.slice(index + 1) : name; // 使用新变量处理 + return processedName.length > 16 ? `${processedName.slice(0, 14)}...` : processedName; }, }, }, @@ -105,7 +106,9 @@ export const ColumnChart: React.FC = (props) => { options.color = colors.slice(0, barLabels.length); } - options && chart.setOption(options, true); + if (options) { + chart.setOption(options, true); + } return () => { chart.dispose(); }; diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/LineChart.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/LineChart.tsx deleted file mode 100644 index 3370dc56e7bdc47ba68c062223f4c69a510fae3d..0000000000000000000000000000000000000000 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/LineChart.tsx +++ /dev/null @@ -1,246 +0,0 @@ -/*--------------------------------------------------------------------------------------------- - * Copyright (c) Microsoft Corporation. All rights reserved. - *--------------------------------------------------------------------------------------------*/ - -import { makeStyles } from '@material-ui/core/styles'; -import * as React from 'react'; -import { Graph, GraphAscend } from '../../api'; -import { useResizeEventDependency } from '../../utils/resize'; -import { binarySearch } from '../../utils/binarysearch'; - -interface IProps { - graph: Graph | GraphAscend; - height?: number; - deviceTarget: string; - tag: string; - hAxisTitle?: string; - vAxisTitle?: string; - explorerOptions?: object; - onSelectionChanged?: (start: number, end: number) => void; - record?: any; -} - -const useStyles = makeStyles(() => ({ - root: { - height: (props: Pick) => props.height, - }, -})); - -export const LineChart: React.FC = (props) => { - const { - graph, - height = 400, - deviceTarget, - tag, - hAxisTitle, - vAxisTitle, - onSelectionChanged, - explorerOptions, - record, - } = props; - const classes = useStyles({ height }); - const graphRef = React.useRef(null); - const [resizeEventDependency] = useResizeEventDependency(); - const [chartObj, setChartObj] = React.useState(); - - React.useLayoutEffect(() => { - const element = graphRef.current; - if (!element) return; - - const options = { - title: graph.title, - isStacked: true, - height, - legend: { position: 'bottom' }, - tooltip: { isHtml: true }, - hAxis: { - title: hAxisTitle, - }, - vAxis: { - title: vAxisTitle, - }, - explorer: explorerOptions, - }; - - const chart = new google.visualization.LineChart(element); - - // Disable selection of single point - google.visualization.events.addListener(chart, 'select', function () { - chart.setSelection(); - }); - - google.visualization.events.addListener(chart, 'ready', function () { - var zoomLast = getCoords(); - var observer = new MutationObserver(function () { - var zoomCurrent = getCoords(); - if (JSON.stringify(zoomLast) !== JSON.stringify(zoomCurrent)) { - zoomLast = getCoords(); - if (onSelectionChanged) { - onSelectionChanged(zoomLast.x_min, zoomLast.x_max); - } - } - }); - if (graphRef.current) { - observer.observe(graphRef.current, { - childList: true, - subtree: true, - }); - } - }); - - function getCoords() { - var chartLayout = chart.getChartLayoutInterface(); - var chartBounds = chartLayout.getChartAreaBoundingBox(); - - return { - x_min: chartLayout.getHAxisValue(chartBounds.left), - x_max: chartLayout.getHAxisValue(chartBounds.width + chartBounds.left), - }; - } - - if (deviceTarget === 'Ascend') { - let data = new google.visualization.DataTable(); - if (tag === 'Component') { - if (graph.columns.length === 3) { - graph.columns.forEach((column) => { - data.addColumn({ - type: column.type, - label: column.name, - role: column.role, - p: column.p, - }); - }); - data.addRows(graph.rows['PTA'] ?? graph.rows['GE']); - } else if (graph.columns.length === 5) { - const data2 = new google.visualization.DataTable(); - graph.columns.forEach((column, index) => { - if (index === 0 || index < 3) { - data.addColumn({ - type: column.type, - label: column.name, - role: column.role, - p: column.p, - }); - } - if (index === 0 || index >= 3) { - data2.addColumn({ - type: column.type, - label: column.name, - role: column.role, - p: column.p, - }); - } - }); - data.addRows(graph.rows['PTA']); - data2.addRows(graph.rows['GE']); - data = google.visualization.data.join( - data, - data2, - 'full', - [[0, 0]], - [1, 2], - [1, 2] - ); - } - } else { - if (graph.columns.length === 2) { - graph.columns.forEach((column) => { - data.addColumn({ - type: column.type, - label: column.name, - role: column.role, - p: column.p, - }); - }); - data.addRows(graph.rows['Allocated'] ?? graph.rows['Reserved']); - } else if (graph.columns.length === 3) { - const data2 = new google.visualization.DataTable(); - graph.columns.forEach((column, index) => { - if (index === 0 || index < 2) { - data.addColumn({ - type: column.type, - label: column.name, - role: column.role, - p: column.p, - }); - } - if (index === 0 || index >= 2) { - data2.addColumn({ - type: column.type, - label: column.name, - role: column.role, - p: column.p, - }); - } - }); - data.addRows(graph.rows['Allocated']); - data2.addRows(graph.rows['Reserved']); - data = google.visualization.data.join( - data, - data2, - 'full', - [[0, 0]], - [1], - [1] - ); - } - } - - chart.draw(data, options); - } else { - const data = new google.visualization.DataTable(); - graph.columns.forEach((column) => { - data.addColumn({ - type: column.type, - label: column.name, - role: column.role, - p: column.p, - }); - }); - data.addRows(graph.rows); - chart.draw(data, options); - } - - setChartObj(chart); - }, [graph, height, resizeEventDependency]); - - React.useEffect(() => { - const compare_fn = (key: number, mid: Array) => - key - parseFloat(mid[0].toFixed(2)); - if (chartObj && tag === 'Operator') { - if (record) { - if (deviceTarget === 'Ascend') { - let startId = binarySearch( - graph.rows['Allocated'], - record.col2, - compare_fn - ); - let endId = binarySearch( - graph.rows['Allocated'], - record.col3, - compare_fn - ); - let selection = []; - if (startId >= 0) selection.push({ row: startId, column: 1 }); - if (endId >= 0) selection.push({ row: endId, column: 1 }); - chartObj.setSelection(selection); - } else { - let startId = binarySearch(graph.rows, record.col2, compare_fn); - let endId = binarySearch(graph.rows, record.col3, compare_fn); - let selection = []; - if (startId >= 0) selection.push({ row: startId, column: 1 }); - if (endId >= 0) selection.push({ row: endId, column: 1 }); - chartObj.setSelection(selection); - } - } else { - chartObj.setSelection(); - } - } - }, [graph, record, chartObj]); - - return ( -
-
-
- ); -}; diff --git a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/NewLineChart.tsx b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/NewLineChart.tsx index 83d854e9cdd82ca4339fa156082d72f6554d6d63..a6e222a6cc9d04b3b0c9031be60b91b75fe9ab37 100644 --- a/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/NewLineChart.tsx +++ b/plugins/tensorboard-plugins/tb_plugin/fe/src/components/charts/NewLineChart.tsx @@ -33,16 +33,7 @@ interface IProps { } export const LineChart: React.FC = (props) => { - const { - graph, - height = 400, - deviceTarget, - tag, - hAxisTitle, - vAxisTitle, - onSelectionChanged, - record, - } = props; + const { graph, height = 400, deviceTarget, tag, hAxisTitle, vAxisTitle, onSelectionChanged, record } = props; const graphRef = React.useRef(null); const [resizeEventDependency] = useResizeEventDependency(); const [chartObj, setChartObj] = React.useState(); @@ -50,8 +41,10 @@ export const LineChart: React.FC = (props) => { React.useLayoutEffect(() => { const element = graphRef.current; - if (!element) return; - element.oncontextmenu = () => { + if (!element) { + return undefined; + } + element.oncontextmenu = (): boolean => { return false; }; @@ -94,7 +87,7 @@ export const LineChart: React.FC = (props) => { const mixedTooltip: echarts.TooltipComponentOption = { trigger: 'axis', formatter: function (params: any) { - var res = `${params[0].name}
`; + let res = `${params[0].name}
`; for (const item of params) { if (typeof item.value[item.encode.y[0]] === 'number') { res += `= 1.43.0 < 2" - -compression@^1.7.4: - version "1.7.4" - resolved "https://registry.yarnpkg.com/compression/-/compression-1.7.4.tgz#95523eff170ca57c29a0ca41e6fe131f41e5bb8f" - integrity sha512-jaSIDzP9pZVS4ZfQ+TzvtiWhdpFhE2RDHz8QJkpX9SIpLq88VueF5jJw6t+6CUQcAoA6t+x89MLrWAqpfDE8iQ== - dependencies: - accepts "~1.3.5" - bytes "3.0.0" - compressible "~2.0.16" - debug "2.6.9" - on-headers "~1.0.2" - safe-buffer "5.1.2" - vary "~1.1.2" - -compute-scroll-into-view@^1.0.17: - version "1.0.17" - resolved "https://registry.yarnpkg.com/compute-scroll-into-view/-/compute-scroll-into-view-1.0.17.tgz#6a88f18acd9d42e9cf4baa6bec7e0522607ab7ab" - integrity sha512-j4dx+Fb0URmzbwwMUrhqWM2BEWHdFGx+qZ9qqASHRPqvTYdqvWnHg0H1hIbcyLnvgnoNAVMlwkepyqM3DaIFUg== - -concat-map@0.0.1: - version "0.0.1" - resolved "https://registry.yarnpkg.com/concat-map/-/concat-map-0.0.1.tgz#d8a96bd77fd68df7793a73036a3ba0d5405d477b" - integrity sha1-2Klr13/Wjfd5OnMDajug1UBdR3s= - -connect-history-api-fallback@^1.6.0: - version "1.6.0" - resolved "https://registry.yarnpkg.com/connect-history-api-fallback/-/connect-history-api-fallback-1.6.0.tgz#8b32089359308d111115d81cad3fceab888f97bc" - integrity sha512-e54B99q/OUoH64zYYRf3HBP5z24G38h5D3qXu23JGRoigpX5Ss4r9ZnDk3g0Z8uQC2x2lPaJ+UlWBc1ZWBWdLg== - -content-disposition@0.5.4: - version "0.5.4" - resolved "https://registry.yarnpkg.com/content-disposition/-/content-disposition-0.5.4.tgz#8b82b4efac82512a02bb0b1dcec9d2c5e8eb5bfe" - integrity sha512-FveZTNuGw04cxlAiWbzi6zTAL/lhehaWbTtgluJh4/E95DqMwTmha3KZN1aAWA8cFIhHzMZUvLevkw5Rqk+tSQ== - dependencies: - safe-buffer "5.2.1" - -content-type@~1.0.4: - version "1.0.4" - resolved "https://registry.yarnpkg.com/content-type/-/content-type-1.0.4.tgz#e138cc75e040c727b1966fe5e5f8c9aee256fe3b" - integrity sha512-hIP3EEPs8tB9AT1L+NUqtwOAps4mk2Zob89MWXMHjHWg9milF/j4osnnQLXBCBFBk/tvIG/tUc9mOUJiPBhPXA== - -cookie-signature@1.0.6: - version "1.0.6" - resolved "https://registry.yarnpkg.com/cookie-signature/-/cookie-signature-1.0.6.tgz#e303a882b342cc3ee8ca513a79999734dab3ae2c" - integrity sha1-4wOogrNCzD7oylE6eZmXNNqzriw= - -cookie@0.4.2: - version "0.4.2" - resolved "https://registry.yarnpkg.com/cookie/-/cookie-0.4.2.tgz#0e41f24de5ecf317947c82fc789e06a884824432" - integrity sha512-aSWTXFzaKWkvHO1Ny/s+ePFpvKsPnjc551iI41v3ny/ow6tBG5Vd+FuqGNhh1LxOmVzOlGUriIlOaokOvhaStA== - -copy-to-clipboard@^3.2.0: - version "3.3.1" - resolved "https://registry.yarnpkg.com/copy-to-clipboard/-/copy-to-clipboard-3.3.1.tgz#115aa1a9998ffab6196f93076ad6da3b913662ae" - integrity sha512-i13qo6kIHTTpCm8/Wup+0b1mVWETvu2kIMzKoK8FpkLkFxlt0znUAHcMzox+T8sPlqtZXq3CulEjQHsYiGFJUw== - dependencies: - toggle-selection "^1.0.6" - -core-util-is@~1.0.0: - version "1.0.3" - resolved "https://registry.yarnpkg.com/core-util-is/-/core-util-is-1.0.3.tgz#a6042d3634c2b27e9328f837b965fac83808db85" - integrity sha512-ZQBvi1DcpJ4GDqanjucZ2Hj3wEO5pZDS89BWbkcrvdxksJorwUDDZamX9ldFkp9aw2lmBDLgkObEA4DWNJ9FYQ== - -cross-env@^7.0.2: - version "7.0.3" - resolved "https://registry.yarnpkg.com/cross-env/-/cross-env-7.0.3.tgz#865264b29677dc015ba8418918965dd232fc54cf" - integrity sha512-+/HKd6EgcQCJGh2PSjZuUitQBQynKor4wrFbRg4DtAgS1aWO+gU52xpH7M9ScGgXSYmAVS9bIJ8EzuaGw0oNAw== - dependencies: - cross-spawn "^7.0.1" - -cross-spawn@^7.0.1, cross-spawn@^7.0.3: - version "7.0.3" - resolved "https://registry.yarnpkg.com/cross-spawn/-/cross-spawn-7.0.3.tgz#f73a85b9d5d41d045551c177e2882d4ac85728a6" - integrity sha512-iRDPJKUPVEND7dHPO8rkbOnPpyDygcDFtWjpeWNCgy8WP2rXcxXL8TskReQl6OrB2G7+UJrags1q15Fudc7G6w== - dependencies: - path-key "^3.1.0" - shebang-command "^2.0.0" - which "^2.0.1" - -css-loader@^5.2.4: - version "5.2.7" - resolved "https://registry.yarnpkg.com/css-loader/-/css-loader-5.2.7.tgz#9b9f111edf6fb2be5dc62525644cbc9c232064ae" - integrity sha512-Q7mOvpBNBG7YrVGMxRxcBJZFL75o+cH2abNASdibkj/fffYD8qWbInZrD0S9ccI6vZclF3DsHE7njGlLtaHbhg== - dependencies: - icss-utils "^5.1.0" - loader-utils "^2.0.0" - postcss "^8.2.15" - postcss-modules-extract-imports "^3.0.0" - postcss-modules-local-by-default "^4.0.0" - postcss-modules-scope "^3.0.0" - postcss-modules-values "^4.0.0" - postcss-value-parser "^4.1.0" - schema-utils "^3.0.0" - semver "^7.3.5" - -css-select@^4.1.3: - version "4.2.1" - resolved "https://registry.yarnpkg.com/css-select/-/css-select-4.2.1.tgz#9e665d6ae4c7f9d65dbe69d0316e3221fb274cdd" - integrity sha512-/aUslKhzkTNCQUB2qTX84lVmfia9NyjP3WpDGtj/WxhwBzWBYUV3DgUpurHTme8UTPcPlAD1DJ+b0nN/t50zDQ== - dependencies: - boolbase "^1.0.0" - css-what "^5.1.0" - domhandler "^4.3.0" - domutils "^2.8.0" - nth-check "^2.0.1" - -css-vendor@^2.0.8: - version "2.0.8" - resolved "https://registry.yarnpkg.com/css-vendor/-/css-vendor-2.0.8.tgz#e47f91d3bd3117d49180a3c935e62e3d9f7f449d" - integrity sha512-x9Aq0XTInxrkuFeHKbYC7zWY8ai7qJ04Kxd9MnvbC1uO5DagxoHQjm4JvG+vCdXOoFtCjbL2XSZfxmoYa9uQVQ== - dependencies: - "@babel/runtime" "^7.8.3" - is-in-browser "^1.0.2" - -css-what@^5.1.0: - version "5.1.0" - resolved "https://registry.yarnpkg.com/css-what/-/css-what-5.1.0.tgz#3f7b707aadf633baf62c2ceb8579b545bb40f7fe" - integrity sha512-arSMRWIIFY0hV8pIxZMEfmMI47Wj3R/aWpZDDxWYCPEiOMv6tfOrnpDtgxBYPEQD4V0Y/958+1TdC3iWTFcUPw== - -cssesc@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/cssesc/-/cssesc-3.0.0.tgz#37741919903b868565e1c09ea747445cd18983ee" - integrity sha512-/Tb/JcjK111nNScGob5MNtsntNM1aCNUDipB/TkwZFhyDrrE47SOx/18wF2bbjgc3ZzCSKW1T5nt5EbFoAz/Vg== - -csstype@^2.5.2: - version "2.6.20" - resolved "https://registry.yarnpkg.com/csstype/-/csstype-2.6.20.tgz#9229c65ea0b260cf4d3d997cb06288e36a8d6dda" - integrity sha512-/WwNkdXfckNgw6S5R125rrW8ez139lBHWouiBvX8dfMFtcn6V81REDqnH7+CRpRipfYlyU1CmOnOxrmGcFOjeA== - -csstype@^3.0.2: - version "3.0.11" - resolved "https://registry.yarnpkg.com/csstype/-/csstype-3.0.11.tgz#d66700c5eacfac1940deb4e3ee5642792d85cd33" - integrity sha512-sa6P2wJ+CAbgyy4KFssIb/JNMLxFvKF1pCYCSXS8ZMuqZnMsrxqI2E5sPyoTpxoPU/gVZMzr2zjOfg8GIZOMsw== - -date-fns@2.x: - version "2.28.0" - resolved "https://registry.yarnpkg.com/date-fns/-/date-fns-2.28.0.tgz#9570d656f5fc13143e50c975a3b6bbeb46cd08b2" - integrity sha512-8d35hViGYx/QH0icHYCeLmsLmMUheMmTyV9Fcm6gvNwdw31yXXH+O85sOBJ+OLnLQMKZowvpKb6FgMIQjcpvQw== - -dayjs@1.x: - version "1.10.8" - resolved "https://registry.yarnpkg.com/dayjs/-/dayjs-1.10.8.tgz#267df4bc6276fcb33c04a6735287e3f429abec41" - integrity sha512-wbNwDfBHHur9UOzNUjeKUOJ0fCb0a52Wx0xInmQ7Y8FstyajiV1NmK1e00cxsr9YrE9r7yAChE0VvpuY5Rnlow== - -debug@2.6.9: - version "2.6.9" - resolved "https://registry.yarnpkg.com/debug/-/debug-2.6.9.tgz#5d128515df134ff327e90a4c93f4e077a536341f" - integrity sha512-bC7ElrdJaJnPbAP+1EotYvqZsb3ecl5wi6Bfi6BJTUcNowp6cvspg0jXznRTKDjm/E7AdgFBVeAPVMNcKGsHMA== - dependencies: - ms "2.0.0" - -debug@^3.1.1: - version "3.2.7" - resolved "https://registry.yarnpkg.com/debug/-/debug-3.2.7.tgz#72580b7e9145fb39b6676f9c5e5fb100b934179a" - integrity sha512-CFjzYYAi4ThfiQvizrFQevTTXHtnCqWfe7x1AhgEscTz6ZbLbfoLRLPugTQyBth6f8ZERVUSyWHFD/7Wu4t1XQ== - dependencies: - ms "^2.1.1" - -debug@^4.1.0: - version "4.3.3" - resolved "https://registry.yarnpkg.com/debug/-/debug-4.3.3.tgz#04266e0b70a98d4462e6e288e38259213332b664" - integrity sha512-/zxw5+vh1Tfv+4Qn7a5nsbcJKPaSvCDhojn6FEl9vupwK2VCSDtEiEtqr8DFtzYFOdz63LBkxec7DYuc2jon6Q== - dependencies: - ms "2.1.2" - -deep-equal@^1.0.1: - version "1.1.1" - resolved "https://registry.yarnpkg.com/deep-equal/-/deep-equal-1.1.1.tgz#b5c98c942ceffaf7cb051e24e1434a25a2e6076a" - integrity sha512-yd9c5AdiqVcR+JjcwUQb9DkhJc8ngNr0MahEBGvDiJw8puWab2yZlh+nkasOnZP+EGTAP6rRp2JzJhJZzvNF8g== - dependencies: - is-arguments "^1.0.4" - is-date-object "^1.0.1" - is-regex "^1.0.4" - object-is "^1.0.1" - object-keys "^1.1.1" - regexp.prototype.flags "^1.2.0" - -default-gateway@^6.0.3: - version "6.0.3" - resolved "https://registry.yarnpkg.com/default-gateway/-/default-gateway-6.0.3.tgz#819494c888053bdb743edbf343d6cdf7f2943a71" - integrity sha512-fwSOJsbbNzZ/CUFpqFBqYfYNLj1NbMPm8MMCIzHjC83iSJRBEGmDUxU+WP661BaBQImeC2yHwXtz+P/O9o+XEg== - dependencies: - execa "^5.0.0" - -define-lazy-prop@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/define-lazy-prop/-/define-lazy-prop-2.0.0.tgz#3f7ae421129bcaaac9bc74905c98a0009ec9ee7f" - integrity sha512-Ds09qNh8yw3khSjiJjiUInaGX9xlqZDY7JVryGxdxV7NPeuqQfplOpQ66yJFZut3jLa5zOwkXw1g9EI2uKh4Og== - -define-properties@^1.1.3: - version "1.1.3" - resolved "https://registry.yarnpkg.com/define-properties/-/define-properties-1.1.3.tgz#cf88da6cbee26fe6db7094f61d870cbd84cee9f1" - integrity sha512-3MqfYKj2lLzdMSf8ZIZE/V+Zuy+BgD6f164e8K2w7dgnpKArBDerGYpM46IYYcjnkdPNMjPk9A6VFB8+3SKlXQ== - dependencies: - object-keys "^1.0.12" - -del@^6.0.0: - version "6.0.0" - resolved "https://registry.yarnpkg.com/del/-/del-6.0.0.tgz#0b40d0332cea743f1614f818be4feb717714c952" - integrity sha512-1shh9DQ23L16oXSZKB2JxpL7iMy2E0S9d517ptA1P8iw0alkPtQcrKH7ru31rYtKwF499HkTu+DRzq3TCKDFRQ== - dependencies: - globby "^11.0.1" - graceful-fs "^4.2.4" - is-glob "^4.0.1" - is-path-cwd "^2.2.0" - is-path-inside "^3.0.2" - p-map "^4.0.0" - rimraf "^3.0.2" - slash "^3.0.0" - -depd@~1.1.2: - version "1.1.2" - resolved "https://registry.yarnpkg.com/depd/-/depd-1.1.2.tgz#9bcd52e14c097763e749b274c4346ed2e560b5a9" - integrity sha1-m81S4UwJd2PnSbJ0xDRu0uVgtak= - -destroy@~1.0.4: - version "1.0.4" - resolved "https://registry.yarnpkg.com/destroy/-/destroy-1.0.4.tgz#978857442c44749e4206613e37946205826abd80" - integrity sha1-l4hXRCxEdJ5CBmE+N5RiBYJqvYA= - -detect-node@^2.0.4: - version "2.1.0" - resolved "https://registry.yarnpkg.com/detect-node/-/detect-node-2.1.0.tgz#c9c70775a49c3d03bc2c06d9a73be550f978f8b1" - integrity sha512-T0NIuQpnTvFDATNuHN5roPwSBG83rFsuO+MXXH9/3N1eFbn4wcPjttvjMLEPWJ0RGUYgQE7cGgS3tNxbqCGM7g== - -dir-glob@^3.0.1: - version "3.0.1" - resolved "https://registry.yarnpkg.com/dir-glob/-/dir-glob-3.0.1.tgz#56dbf73d992a4a93ba1584f4534063fd2e41717f" - integrity sha512-WkrWp9GR4KXfKGYzOLmTuGVi1UWFfws377n9cc55/tb6DuqyF6pcQ5AbiHEshaDpY9v6oaSr2XCDidGmMwdzIA== - dependencies: - path-type "^4.0.0" - -dns-equal@^1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/dns-equal/-/dns-equal-1.0.0.tgz#b39e7f1da6eb0a75ba9c17324b34753c47e0654d" - integrity sha1-s55/HabrCnW6nBcySzR1PEfgZU0= - -dns-packet@^1.3.1: - version "1.3.4" - resolved "https://registry.yarnpkg.com/dns-packet/-/dns-packet-1.3.4.tgz#e3455065824a2507ba886c55a89963bb107dec6f" - integrity sha512-BQ6F4vycLXBvdrJZ6S3gZewt6rcrks9KBgM9vrhW+knGRqc8uEdT7fuCwloc7nny5xNoMJ17HGH0R/6fpo8ECA== - dependencies: - ip "^1.1.0" - safe-buffer "^5.0.1" - -dns-txt@^2.0.2: - version "2.0.2" - resolved "https://registry.yarnpkg.com/dns-txt/-/dns-txt-2.0.2.tgz#b91d806f5d27188e4ab3e7d107d881a1cc4642b6" - integrity sha1-uR2Ab10nGI5Ks+fRB9iBocxGQrY= - dependencies: - buffer-indexof "^1.0.0" - -dom-align@^1.7.0: - version "1.12.2" - resolved "https://registry.yarnpkg.com/dom-align/-/dom-align-1.12.2.tgz#0f8164ebd0c9c21b0c790310493cd855892acd4b" - integrity sha512-pHuazgqrsTFrGU2WLDdXxCFabkdQDx72ddkraZNih1KsMcN5qsRSTR9O4VJRlwTPCPb5COYg3LOfiMHHcPInHg== - -dom-converter@^0.2.0: - version "0.2.0" - resolved "https://registry.yarnpkg.com/dom-converter/-/dom-converter-0.2.0.tgz#6721a9daee2e293682955b6afe416771627bb768" - integrity sha512-gd3ypIPfOMr9h5jIKq8E3sHOTCjeirnl0WK5ZdS1AW0Odt0b1PaWaHdJ4Qk4klv+YB9aJBS7mESXjFoDQPu6DA== - dependencies: - utila "~0.4" - -dom-helpers@^5.0.1: - version "5.2.1" - resolved "https://registry.yarnpkg.com/dom-helpers/-/dom-helpers-5.2.1.tgz#d9400536b2bf8225ad98fe052e029451ac40e902" - integrity sha512-nRCa7CK3VTrM2NmGkIy4cbK7IZlgBE/PYMn55rrXefr5xXDP0LdtfPnblFDoVdcAfslJ7or6iqAUnx0CCGIWQA== - dependencies: - "@babel/runtime" "^7.8.7" - csstype "^3.0.2" - -dom-serializer@^1.0.1: - version "1.3.2" - resolved "https://registry.yarnpkg.com/dom-serializer/-/dom-serializer-1.3.2.tgz#6206437d32ceefaec7161803230c7a20bc1b4d91" - integrity sha512-5c54Bk5Dw4qAxNOI1pFEizPSjVsx5+bpJKmL2kPn8JhBUq2q09tTCa3mjijun2NfK78NMouDYNMBkOrPZiS+ig== - dependencies: - domelementtype "^2.0.1" - domhandler "^4.2.0" - entities "^2.0.0" - -domelementtype@^2.0.1, domelementtype@^2.2.0: - version "2.2.0" - resolved "https://registry.yarnpkg.com/domelementtype/-/domelementtype-2.2.0.tgz#9a0b6c2782ed6a1c7323d42267183df9bd8b1d57" - integrity sha512-DtBMo82pv1dFtUmHyr48beiuq792Sxohr+8Hm9zoxklYPfa6n0Z3Byjj2IV7bmr2IyqClnqEQhfgHJJ5QF0R5A== - -domhandler@^4.0.0, domhandler@^4.2.0, domhandler@^4.3.0: - version "4.3.0" - resolved "https://registry.yarnpkg.com/domhandler/-/domhandler-4.3.0.tgz#16c658c626cf966967e306f966b431f77d4a5626" - integrity sha512-fC0aXNQXqKSFTr2wDNZDhsEYjCiYsDWl3D01kwt25hm1YIPyDGHvvi3rw+PLqHAl/m71MaiF7d5zvBr0p5UB2g== - dependencies: - domelementtype "^2.2.0" - -domutils@^2.5.2, domutils@^2.8.0: - version "2.8.0" - resolved "https://registry.yarnpkg.com/domutils/-/domutils-2.8.0.tgz#4437def5db6e2d1f5d6ee859bd95ca7d02048135" - integrity sha512-w96Cjofp72M5IIhpjgobBimYEfoPjx1Vx0BSX9P30WBdZW2WIKU0T1Bd0kz2eNZ9ikjKgHbEyKx8BB6H1L3h3A== - dependencies: - dom-serializer "^1.0.1" - domelementtype "^2.2.0" - domhandler "^4.2.0" - -dot-case@^3.0.4: - version "3.0.4" - resolved "https://registry.yarnpkg.com/dot-case/-/dot-case-3.0.4.tgz#9b2b670d00a431667a8a75ba29cd1b98809ce751" - integrity sha512-Kv5nKlh6yRrdrGvxeJ2e5y2eRUpkUosIW4A2AS38zwSz27zu7ufDwQPi5Jhs3XAlGNetl3bmnGhQsMtkKJnj3w== - dependencies: - no-case "^3.0.4" - tslib "^2.0.3" - -ee-first@1.1.1: - version "1.1.1" - resolved "https://registry.yarnpkg.com/ee-first/-/ee-first-1.1.1.tgz#590c61156b0ae2f4f0255732a158b266bc56b21d" - integrity sha1-WQxhFWsK4vTwJVcyoViyZrxWsh0= - -electron-to-chromium@^1.4.76: - version "1.4.76" - resolved "https://registry.yarnpkg.com/electron-to-chromium/-/electron-to-chromium-1.4.76.tgz#a0494baedaf51094b1c172999919becd9975a934" - integrity sha512-3Vftv7cenJtQb+k00McEBZ2vVmZ/x+HEF7pcZONZIkOsESqAqVuACmBxMv0JhzX7u0YltU0vSqRqgBSTAhFUjA== - -emojis-list@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/emojis-list/-/emojis-list-3.0.0.tgz#5570662046ad29e2e916e71aae260abdff4f6a78" - integrity sha512-/kyM18EfinwXZbno9FyUGeFh87KC8HRQBQGildHZbEuRyWFOmv1U10o9BBp8XVZDVNNuQKyIGIu5ZYAAXJ0V2Q== - -encodeurl@~1.0.2: - version "1.0.2" - resolved "https://registry.yarnpkg.com/encodeurl/-/encodeurl-1.0.2.tgz#ad3ff4c86ec2d029322f5a02c3a9a606c95b3f59" - integrity sha1-rT/0yG7C0CkyL1oCw6mmBslbP1k= - -enhanced-resolve@^4.0.0: - version "4.5.0" - resolved "https://registry.yarnpkg.com/enhanced-resolve/-/enhanced-resolve-4.5.0.tgz#2f3cfd84dbe3b487f18f2db2ef1e064a571ca5ec" - integrity sha512-Nv9m36S/vxpsI+Hc4/ZGRs0n9mXqSWGGq49zxb/cJfPAQMbUtttJAlNPS4AQzaBdw/pKskw5bMbekT/Y7W/Wlg== - dependencies: - graceful-fs "^4.1.2" - memory-fs "^0.5.0" - tapable "^1.0.0" - -enhanced-resolve@^5.9.2: - version "5.9.2" - resolved "https://registry.yarnpkg.com/enhanced-resolve/-/enhanced-resolve-5.9.2.tgz#0224dcd6a43389ebfb2d55efee517e5466772dd9" - integrity sha512-GIm3fQfwLJ8YZx2smuHpBKkXC1yOk+OBEmKckVyL0i/ea8mqDEykK3ld5dgH1QYPNyT/lIllxV2LULnxCHaHkA== - dependencies: - graceful-fs "^4.2.4" - tapable "^2.2.0" - -entities@^2.0.0: - version "2.2.0" - resolved "https://registry.yarnpkg.com/entities/-/entities-2.2.0.tgz#098dc90ebb83d8dffa089d55256b351d34c4da55" - integrity sha512-p92if5Nz619I0w+akJrLZH0MX0Pb5DX39XOwQTtXSdQQOaYH03S1uIQp4mhOZtAXrxq4ViO67YTiLBo2638o9A== - -envinfo@^7.7.3: - version "7.8.1" - resolved "https://registry.yarnpkg.com/envinfo/-/envinfo-7.8.1.tgz#06377e3e5f4d379fea7ac592d5ad8927e0c4d475" - integrity sha512-/o+BXHmB7ocbHEAs6F2EnG0ogybVVUdkRunTT2glZU9XAaGmhqskrvKwqXuDfNjEO0LZKWdejEEpnq8aM0tOaw== - -errno@^0.1.3: - version "0.1.8" - resolved "https://registry.yarnpkg.com/errno/-/errno-0.1.8.tgz#8bb3e9c7d463be4976ff888f76b4809ebc2e811f" - integrity sha512-dJ6oBr5SQ1VSd9qkk7ByRgb/1SH4JZjCHSW/mr63/QcXO9zLVxvJ6Oy13nio03rxpSnVDDjFor75SjVeZWPW/A== - dependencies: - prr "~1.0.1" - -es-module-lexer@^0.9.0: - version "0.9.3" - resolved "https://registry.yarnpkg.com/es-module-lexer/-/es-module-lexer-0.9.3.tgz#6f13db00cc38417137daf74366f535c8eb438f19" - integrity sha512-1HQ2M2sPtxwnvOvT1ZClHyQDiggdNjURWpY2we6aMKCQiUVxTmVs2UYPLIrD84sS+kMdUwfBSylbJPwNnBrnHQ== - -escalade@^3.1.1: - version "3.1.1" - resolved "https://registry.yarnpkg.com/escalade/-/escalade-3.1.1.tgz#d8cfdc7000965c5a0174b4a82eaa5c0552742e40" - integrity sha512-k0er2gUkLf8O0zKJiAhmkTnJlTvINGv7ygDNPbeIsX/TJjGJZHuh9B2UxbsaEkmlEo9MfhrSzmhIlhRlI2GXnw== - -escape-html@~1.0.3: - version "1.0.3" - resolved "https://registry.yarnpkg.com/escape-html/-/escape-html-1.0.3.tgz#0258eae4d3d0c0974de1c169188ef0051d1d1988" - integrity sha1-Aljq5NPQwJdN4cFpGI7wBR0dGYg= - -eslint-scope@5.1.1: - version "5.1.1" - resolved "https://registry.yarnpkg.com/eslint-scope/-/eslint-scope-5.1.1.tgz#e786e59a66cb92b3f6c1fb0d508aab174848f48c" - integrity sha512-2NxwbF/hZ0KpepYN0cNbo+FN6XoK7GaHlQhgx/hIZl6Va0bF45RQOOwhLIy8lQDbuCiadSLCBnH2CFYquit5bw== - dependencies: - esrecurse "^4.3.0" - estraverse "^4.1.1" - -esrecurse@^4.3.0: - version "4.3.0" - resolved "https://registry.yarnpkg.com/esrecurse/-/esrecurse-4.3.0.tgz#7ad7964d679abb28bee72cec63758b1c5d2c9921" - integrity sha512-KmfKL3b6G+RXvP8N1vr3Tq1kL/oCFgn2NYXEtqP8/L3pKapUA4G8cFVaoF3SU323CD4XypR/ffioHmkti6/Tag== - dependencies: - estraverse "^5.2.0" - -estraverse@^4.1.1: - version "4.3.0" - resolved "https://registry.yarnpkg.com/estraverse/-/estraverse-4.3.0.tgz#398ad3f3c5a24948be7725e83d11a7de28cdbd1d" - integrity sha512-39nnKffWz8xN1BU/2c79n9nB9HDzo0niYUqx6xyqUnyoAnQyyWpOTdZEeiCch8BBu515t4wp9ZmgVfVhn9EBpw== - -estraverse@^5.2.0: - version "5.3.0" - resolved "https://registry.yarnpkg.com/estraverse/-/estraverse-5.3.0.tgz#2eea5290702f26ab8fe5370370ff86c965d21123" - integrity sha512-MMdARuVEQziNTeJD8DgMqmhwR11BRQ/cBP+pLtYdSTnf3MIO8fFeiINEbX36ZdNlfU/7A9f3gUw49B3oQsvwBA== - -etag@~1.8.1: - version "1.8.1" - resolved "https://registry.yarnpkg.com/etag/-/etag-1.8.1.tgz#41ae2eeb65efa62268aebfea83ac7d79299b0887" - integrity sha1-Qa4u62XvpiJorr/qg6x9eSmbCIc= - -eventemitter3@^4.0.0: - version "4.0.7" - resolved "https://registry.yarnpkg.com/eventemitter3/-/eventemitter3-4.0.7.tgz#2de9b68f6528d5644ef5c59526a1b4a07306169f" - integrity sha512-8guHBZCwKnFhYdHr2ysuRWErTwhoN2X8XELRlrRwpmfeY2jjuUN4taQMsULKUVo1K4DvZl+0pgfyoysHxvmvEw== - -events@^3.2.0: - version "3.3.0" - resolved "https://registry.yarnpkg.com/events/-/events-3.3.0.tgz#31a95ad0a924e2d2c419a813aeb2c4e878ea7400" - integrity sha512-mQw+2fkQbALzQ7V0MY0IqdnXNOeTtP4r0lN9z7AAawCXgqea7bDii20AYrIBrFd/Hx0M2Ocz6S111CaFkUcb0Q== - -execa@^5.0.0: - version "5.1.1" - resolved "https://registry.yarnpkg.com/execa/-/execa-5.1.1.tgz#f80ad9cbf4298f7bd1d4c9555c21e93741c411dd" - integrity sha512-8uSpZZocAZRBAPIEINJj3Lo9HyGitllczc27Eh5YYojjMFMn8yHMDMaUHE2Jqfq05D/wucwI4JGURyXt1vchyg== - dependencies: - cross-spawn "^7.0.3" - get-stream "^6.0.0" - human-signals "^2.1.0" - is-stream "^2.0.0" - merge-stream "^2.0.0" - npm-run-path "^4.0.1" - onetime "^5.1.2" - signal-exit "^3.0.3" - strip-final-newline "^2.0.0" - -express@^4.17.1: - version "4.17.3" - resolved "https://registry.yarnpkg.com/express/-/express-4.17.3.tgz#f6c7302194a4fb54271b73a1fe7a06478c8f85a1" - integrity sha512-yuSQpz5I+Ch7gFrPCk4/c+dIBKlQUxtgwqzph132bsT6qhuzss6I8cLJQz7B3rFblzd6wtcI0ZbGltH/C4LjUg== - dependencies: - accepts "~1.3.8" - array-flatten "1.1.1" - body-parser "1.19.2" - content-disposition "0.5.4" - content-type "~1.0.4" - cookie "0.4.2" - cookie-signature "1.0.6" - debug "2.6.9" - depd "~1.1.2" - encodeurl "~1.0.2" - escape-html "~1.0.3" - etag "~1.8.1" - finalhandler "~1.1.2" - fresh "0.5.2" - merge-descriptors "1.0.1" - methods "~1.1.2" - on-finished "~2.3.0" - parseurl "~1.3.3" - path-to-regexp "0.1.7" - proxy-addr "~2.0.7" - qs "6.9.7" - range-parser "~1.2.1" - safe-buffer "5.2.1" - send "0.17.2" - serve-static "1.14.2" - setprototypeof "1.2.0" - statuses "~1.5.0" - type-is "~1.6.18" - utils-merge "1.0.1" - vary "~1.1.2" - -fast-deep-equal@^3.1.1, fast-deep-equal@^3.1.3: - version "3.1.3" - resolved "https://registry.yarnpkg.com/fast-deep-equal/-/fast-deep-equal-3.1.3.tgz#3a7d56b559d6cbc3eb512325244e619a65c6c525" - integrity sha512-f3qQ9oQy9j2AhBe/H9VC91wLmKBCCU/gDOnKNAYG5hswO7BLKj09Hc5HYNz9cGI++xlpDCIgDaitVs03ATR84Q== - -fast-glob@^3.2.9: - version "3.2.11" - resolved "https://registry.yarnpkg.com/fast-glob/-/fast-glob-3.2.11.tgz#a1172ad95ceb8a16e20caa5c5e56480e5129c1d9" - integrity sha512-xrO3+1bxSo3ZVHAnqzyuewYT6aMFHRAd4Kcs92MAonjwQZLsK9d0SF1IyQ3k5PoirxTW0Oe/RqFgMQ6TcNE5Ew== - dependencies: - "@nodelib/fs.stat" "^2.0.2" - "@nodelib/fs.walk" "^1.2.3" - glob-parent "^5.1.2" - merge2 "^1.3.0" - micromatch "^4.0.4" - -fast-json-stable-stringify@^2.0.0: - version "2.1.0" - resolved "https://registry.yarnpkg.com/fast-json-stable-stringify/-/fast-json-stable-stringify-2.1.0.tgz#874bf69c6f404c2b5d99c481341399fd55892633" - integrity sha512-lhd/wF+Lk98HZoTCtlVraHtfh5XYijIjalXck7saUtuanSDyLMxnHhSXEDJqHxD7msR8D0uCmqlkwjCV8xvwHw== - -fastest-levenshtein@^1.0.12: - version "1.0.12" - resolved "https://registry.yarnpkg.com/fastest-levenshtein/-/fastest-levenshtein-1.0.12.tgz#9990f7d3a88cc5a9ffd1f1745745251700d497e2" - integrity sha512-On2N+BpYJ15xIC974QNVuYGMOlEVt4s0EOI3wwMqOmK1fdDY+FN/zltPV8vosq4ad4c/gJ1KHScUn/6AWIgiow== - -fastq@^1.6.0: - version "1.13.0" - resolved "https://registry.yarnpkg.com/fastq/-/fastq-1.13.0.tgz#616760f88a7526bdfc596b7cab8c18938c36b98c" - integrity sha512-YpkpUnK8od0o1hmeSc7UUs/eB/vIPWJYjKck2QKIzAf71Vm1AAQ3EbuZB3g2JIy+pg+ERD0vqI79KyZiB2e2Nw== - dependencies: - reusify "^1.0.4" - -faye-websocket@^0.11.3: - version "0.11.4" - resolved "https://registry.yarnpkg.com/faye-websocket/-/faye-websocket-0.11.4.tgz#7f0d9275cfdd86a1c963dc8b65fcc451edcbb1da" - integrity sha512-CzbClwlXAuiRQAlUyfqPgvPoNKTckTPGfwZV4ZdAhVcP2lh9KUxJg2b5GkE7XbjKQ3YJnQ9z6D9ntLAlB+tP8g== - dependencies: - websocket-driver ">=0.5.1" - -fill-range@^7.0.1: - version "7.0.1" - resolved "https://registry.yarnpkg.com/fill-range/-/fill-range-7.0.1.tgz#1919a6a7c75fe38b2c7c77e5198535da9acdda40" - integrity sha512-qOo9F+dMUmC2Lcb4BbVvnKJxTPjCm+RRpe4gDuGrzkL7mEVl/djYSu2OdQ2Pa302N4oqkSg9ir6jaLWJ2USVpQ== - dependencies: - to-regex-range "^5.0.1" - -finalhandler@~1.1.2: - version "1.1.2" - resolved "https://registry.yarnpkg.com/finalhandler/-/finalhandler-1.1.2.tgz#b7e7d000ffd11938d0fdb053506f6ebabe9f587d" - integrity sha512-aAWcW57uxVNrQZqFXjITpW3sIUQmHGG3qSb9mUah9MgMC4NeWhNOlNjXEYq3HjRAvL6arUviZGGJsBg6z0zsWA== - dependencies: - debug "2.6.9" - encodeurl "~1.0.2" - escape-html "~1.0.3" - on-finished "~2.3.0" - parseurl "~1.3.3" - statuses "~1.5.0" - unpipe "~1.0.0" - -find-up@^4.0.0: - version "4.1.0" - resolved "https://registry.yarnpkg.com/find-up/-/find-up-4.1.0.tgz#97afe7d6cdc0bc5928584b7c8d7b16e8a9aa5d19" - integrity sha512-PpOwAdQ/YlXQ2vj8a3h8IipDuYRi3wceVQQGYWxNINccq40Anw7BlsEXCMbt1Zt+OLA6Fq9suIpIWD0OsnISlw== - dependencies: - locate-path "^5.0.0" - path-exists "^4.0.0" - -flow-bin@^0.118.0: - version "0.118.0" - resolved "https://registry.yarnpkg.com/flow-bin/-/flow-bin-0.118.0.tgz#fb706364a58c682d67a2ca7df39396467dc397d1" - integrity sha512-jlbUu0XkbpXeXhan5xyTqVK1jmEKNxE8hpzznI3TThHTr76GiFwK0iRzhDo4KNy+S9h/KxHaqVhTP86vA6wHCg== - -follow-redirects@^1.0.0: - version "1.14.9" - resolved "https://registry.yarnpkg.com/follow-redirects/-/follow-redirects-1.14.9.tgz#dd4ea157de7bfaf9ea9b3fbd85aa16951f78d8d7" - integrity sha512-MQDfihBQYMcyy5dhRDJUHcw7lb2Pv/TuE6xP1vyraLukNDHKbDxDNaOE3NbCAdKQApno+GPRyo1YAp89yCjK4w== - -forwarded@0.2.0: - version "0.2.0" - resolved "https://registry.yarnpkg.com/forwarded/-/forwarded-0.2.0.tgz#2269936428aad4c15c7ebe9779a84bf0b2a81811" - integrity sha512-buRG0fpBtRHSTCOASe6hD258tEubFoRLb4ZNA6NxMVHNw2gOcwHo9wyablzMzOA5z9xA9L1KNjk/Nt6MT9aYow== - -fresh@0.5.2: - version "0.5.2" - resolved "https://registry.yarnpkg.com/fresh/-/fresh-0.5.2.tgz#3d8cadd90d976569fa835ab1f8e4b23a105605a7" - integrity sha1-PYyt2Q2XZWn6g1qx+OSyOhBWBac= - -fs-monkey@1.0.3: - version "1.0.3" - resolved "https://registry.yarnpkg.com/fs-monkey/-/fs-monkey-1.0.3.tgz#ae3ac92d53bb328efe0e9a1d9541f6ad8d48e2d3" - integrity sha512-cybjIfiiE+pTWicSCLFHSrXZ6EilF30oh91FDP9S2B051prEa7QWfrVTQm10/dDpswBDXZugPa1Ogu8Yh+HV0Q== - -fs.realpath@^1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/fs.realpath/-/fs.realpath-1.0.0.tgz#1504ad2523158caa40db4a2787cb01411994ea4f" - integrity sha1-FQStJSMVjKpA20onh8sBQRmU6k8= - -fsevents@~2.3.2: - version "2.3.2" - resolved "https://registry.yarnpkg.com/fsevents/-/fsevents-2.3.2.tgz#8a526f78b8fdf4623b709e0b975c52c24c02fd1a" - integrity sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA== - -function-bind@^1.1.1: - version "1.1.1" - resolved "https://registry.yarnpkg.com/function-bind/-/function-bind-1.1.1.tgz#a56899d3ea3c9bab874bb9773b7c5ede92f4895d" - integrity sha512-yIovAzMX49sF8Yl58fSCWJ5svSLuaibPxXQJFLmBObTuCr0Mf1KiPopGM9NiFjiYBCbfaa2Fh6breQ6ANVTI0A== - -get-intrinsic@^1.0.2: - version "1.1.1" - resolved "https://registry.yarnpkg.com/get-intrinsic/-/get-intrinsic-1.1.1.tgz#15f59f376f855c446963948f0d24cd3637b4abc6" - integrity sha512-kWZrnVM42QCiEA2Ig1bG8zjoIMOgxWwYCEeNdwY6Tv/cOSeGpcoX4pXHfKUxNKVoArnrEr2e9srnAxxGIraS9Q== - dependencies: - function-bind "^1.1.1" - has "^1.0.3" - has-symbols "^1.0.1" - -get-stream@^6.0.0: - version "6.0.1" - resolved "https://registry.yarnpkg.com/get-stream/-/get-stream-6.0.1.tgz#a262d8eef67aced57c2852ad6167526a43cbf7b7" - integrity sha512-ts6Wi+2j3jQjqi70w5AlN8DFnkSwC+MqmxEzdEALB2qXZYV3X/b1CTfgPLGJNMeAWxdPfU8FO1ms3NUfaHCPYg== - -glob-parent@^5.1.2, glob-parent@~5.1.2: - version "5.1.2" - resolved "https://registry.yarnpkg.com/glob-parent/-/glob-parent-5.1.2.tgz#869832c58034fe68a4093c17dc15e8340d8401c4" - integrity sha512-AOIgSQCepiJYwP3ARnGx+5VnTu2HBYdzbGP45eLw1vr3zB3vZLeyed1sC9hnbcOc9/SrMyM5RPQrkGz4aS9Zow== - dependencies: - is-glob "^4.0.1" - -glob-to-regexp@^0.4.1: - version "0.4.1" - resolved "https://registry.yarnpkg.com/glob-to-regexp/-/glob-to-regexp-0.4.1.tgz#c75297087c851b9a578bd217dd59a92f59fe546e" - integrity sha512-lkX1HJXwyMcprw/5YUZc2s7DrpAiHB21/V+E1rHUrVNokkvB6bqMzT0VfV6/86ZNabt1k14YOIaT7nDvOX3Iiw== - -glob@^7.1.3: - version "7.2.0" - resolved "https://registry.yarnpkg.com/glob/-/glob-7.2.0.tgz#d15535af7732e02e948f4c41628bd910293f6023" - integrity sha512-lmLf6gtyrPq8tTjSmrO94wBeQbFR3HbLHbuyD69wuyQkImp2hWqMGB47OX65FBkPffO641IP9jWa1z4ivqG26Q== - dependencies: - fs.realpath "^1.0.0" - inflight "^1.0.4" - inherits "2" - minimatch "^3.0.4" - once "^1.3.0" - path-is-absolute "^1.0.0" - -globby@^11.0.1: - version "11.1.0" - resolved "https://registry.yarnpkg.com/globby/-/globby-11.1.0.tgz#bd4be98bb042f83d796f7e3811991fbe82a0d34b" - integrity sha512-jhIXaOzy1sb8IyocaruWSn1TjmnBVs8Ayhcy83rmxNJ8q2uWKCAj3CnJY+KpGSXCueAPc0i05kVvVKtP1t9S3g== - dependencies: - array-union "^2.1.0" - dir-glob "^3.0.1" - fast-glob "^3.2.9" - ignore "^5.2.0" - merge2 "^1.4.1" - slash "^3.0.0" - -graceful-fs@^4.1.2, graceful-fs@^4.2.4, graceful-fs@^4.2.6, graceful-fs@^4.2.9: - version "4.2.9" - resolved "https://registry.yarnpkg.com/graceful-fs/-/graceful-fs-4.2.9.tgz#041b05df45755e587a24942279b9d113146e1c96" - integrity sha512-NtNxqUcXgpW2iMrfqSfR73Glt39K+BLwWsPs94yR63v45T0Wbej7eRmL5cWfwEgqXnmjQp3zaJTshdRW/qC2ZQ== - -handle-thing@^2.0.0: - version "2.0.1" - resolved "https://registry.yarnpkg.com/handle-thing/-/handle-thing-2.0.1.tgz#857f79ce359580c340d43081cc648970d0bb234e" - integrity sha512-9Qn4yBxelxoh2Ow62nP+Ka/kMnOXRi8BXnRaUwezLNhqelnN49xKz4F/dPP8OYLxLxq6JDtZb2i9XznUQbNPTg== - -has-flag@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/has-flag/-/has-flag-4.0.0.tgz#944771fd9c81c81265c4d6941860da06bb59479b" - integrity sha512-EykJT/Q1KjTWctppgIAgfSO0tKVuZUjhgMr17kqTumMl6Afv3EISleU7qZUzoXDFTAHTDC4NOoG/ZxU3EvlMPQ== - -has-symbols@^1.0.1, has-symbols@^1.0.2: - version "1.0.3" - resolved "https://registry.yarnpkg.com/has-symbols/-/has-symbols-1.0.3.tgz#bb7b2c4349251dce87b125f7bdf874aa7c8b39f8" - integrity sha512-l3LCuF6MgDNwTDKkdYGEihYjt5pRPbEg46rtlmnSPlUbgmB8LOIrKJbYYFBSbnPaJexMKtiPO8hmeRjRz2Td+A== - -has-tostringtag@^1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/has-tostringtag/-/has-tostringtag-1.0.0.tgz#7e133818a7d394734f941e73c3d3f9291e658b25" - integrity sha512-kFjcSNhnlGV1kyoGk7OXKSawH5JOb/LzUc5w9B02hOTO0dfFRjbHQKvg1d6cf3HbeUmtU9VbbV3qzZ2Teh97WQ== - dependencies: - has-symbols "^1.0.2" - -has@^1.0.3: - version "1.0.3" - resolved "https://registry.yarnpkg.com/has/-/has-1.0.3.tgz#722d7cbfc1f6aa8241f16dd814e011e1f41e8796" - integrity sha512-f2dvO0VU6Oej7RkWJGrehjbzMAjFp5/VKPp5tTpWIV4JHHZK1/BxbFRtf/siA2SWTe09caDmVtYYzWEIbBS4zw== - dependencies: - function-bind "^1.1.1" - -he@^1.2.0: - version "1.2.0" - resolved "https://registry.yarnpkg.com/he/-/he-1.2.0.tgz#84ae65fa7eafb165fddb61566ae14baf05664f0f" - integrity sha512-F/1DnUGPopORZi0ni+CvrCgHQ5FyEAHRLSApuYWMmrbSwoN2Mn/7k+Gl38gJnR7yyDZk6WLXwiGod1JOWNDKGw== - -hoist-non-react-statics@^3.3.2: - version "3.3.2" - resolved "https://registry.yarnpkg.com/hoist-non-react-statics/-/hoist-non-react-statics-3.3.2.tgz#ece0acaf71d62c2969c2ec59feff42a4b1a85b45" - integrity sha512-/gGivxi8JPKWNm/W0jSmzcMPpfpPLc3dY/6GxhX2hQ9iGj3aDfklV4ET7NjKpSinLpJ5vafa9iiGIEZg10SfBw== - dependencies: - react-is "^16.7.0" - -hpack.js@^2.1.6: - version "2.1.6" - resolved "https://registry.yarnpkg.com/hpack.js/-/hpack.js-2.1.6.tgz#87774c0949e513f42e84575b3c45681fade2a0b2" - integrity sha1-h3dMCUnlE/QuhFdbPEVoH63ioLI= - dependencies: - inherits "^2.0.1" - obuf "^1.0.0" - readable-stream "^2.0.1" - wbuf "^1.1.0" - -html-entities@^2.3.2: - version "2.3.2" - resolved "https://registry.yarnpkg.com/html-entities/-/html-entities-2.3.2.tgz#760b404685cb1d794e4f4b744332e3b00dcfe488" - integrity sha512-c3Ab/url5ksaT0WyleslpBEthOzWhrjQbg75y7XUsfSzi3Dgzt0l8w5e7DylRn15MTlMMD58dTfzddNS2kcAjQ== - -html-minifier-terser@^6.0.2: - version "6.1.0" - resolved "https://registry.yarnpkg.com/html-minifier-terser/-/html-minifier-terser-6.1.0.tgz#bfc818934cc07918f6b3669f5774ecdfd48f32ab" - integrity sha512-YXxSlJBZTP7RS3tWnQw74ooKa6L9b9i9QYXY21eUEvhZ3u9XLfv6OnFsQq6RxkhHygsaUMvYsZRV5rU/OVNZxw== - dependencies: - camel-case "^4.1.2" - clean-css "^5.2.2" - commander "^8.3.0" - he "^1.2.0" - param-case "^3.0.4" - relateurl "^0.2.7" - terser "^5.10.0" - -html-webpack-plugin@^5.3.1: - version "5.5.0" - resolved "https://registry.yarnpkg.com/html-webpack-plugin/-/html-webpack-plugin-5.5.0.tgz#c3911936f57681c1f9f4d8b68c158cd9dfe52f50" - integrity sha512-sy88PC2cRTVxvETRgUHFrL4No3UxvcH8G1NepGhqaTT+GXN2kTamqasot0inS5hXeg1cMbFDt27zzo9p35lZVw== - dependencies: - "@types/html-minifier-terser" "^6.0.0" - html-minifier-terser "^6.0.2" - lodash "^4.17.21" - pretty-error "^4.0.0" - tapable "^2.0.0" - -htmlparser2@^6.1.0: - version "6.1.0" - resolved "https://registry.yarnpkg.com/htmlparser2/-/htmlparser2-6.1.0.tgz#c4d762b6c3371a05dbe65e94ae43a9f845fb8fb7" - integrity sha512-gyyPk6rgonLFEDGoeRgQNaEUvdJ4ktTmmUh/h2t7s+M8oPpIPxgNACWa+6ESR57kXstwqPiCut0V8NRpcwgU7A== - dependencies: - domelementtype "^2.0.1" - domhandler "^4.0.0" - domutils "^2.5.2" - entities "^2.0.0" - -http-deceiver@^1.2.7: - version "1.2.7" - resolved "https://registry.yarnpkg.com/http-deceiver/-/http-deceiver-1.2.7.tgz#fa7168944ab9a519d337cb0bec7284dc3e723d87" - integrity sha1-+nFolEq5pRnTN8sL7HKE3D5yPYc= - -http-errors@1.8.1: - version "1.8.1" - resolved "https://registry.yarnpkg.com/http-errors/-/http-errors-1.8.1.tgz#7c3f28577cbc8a207388455dbd62295ed07bd68c" - integrity sha512-Kpk9Sm7NmI+RHhnj6OIWDI1d6fIoFAtFt9RLaTMRlg/8w49juAStsrBgp0Dp4OdxdVbRIeKhtCUvoi/RuAhO4g== - dependencies: - depd "~1.1.2" - inherits "2.0.4" - setprototypeof "1.2.0" - statuses ">= 1.5.0 < 2" - toidentifier "1.0.1" - -http-errors@~1.6.2: - version "1.6.3" - resolved "https://registry.yarnpkg.com/http-errors/-/http-errors-1.6.3.tgz#8b55680bb4be283a0b5bf4ea2e38580be1d9320d" - integrity sha1-i1VoC7S+KDoLW/TqLjhYC+HZMg0= - dependencies: - depd "~1.1.2" - inherits "2.0.3" - setprototypeof "1.1.0" - statuses ">= 1.4.0 < 2" - -http-parser-js@>=0.5.1: - version "0.5.6" - resolved "https://registry.yarnpkg.com/http-parser-js/-/http-parser-js-0.5.6.tgz#2e02406ab2df8af8a7abfba62e0da01c62b95afd" - integrity sha512-vDlkRPDJn93swjcjqMSaGSPABbIarsr1TLAui/gLDXzV5VsJNdXNzMYDyNBLQkjWQCJ1uizu8T2oDMhmGt0PRA== - -http-proxy-middleware@^2.0.0: - version "2.0.3" - resolved "https://registry.yarnpkg.com/http-proxy-middleware/-/http-proxy-middleware-2.0.3.tgz#5df04f69a89f530c2284cd71eeaa51ba52243289" - integrity sha512-1bloEwnrHMnCoO/Gcwbz7eSVvW50KPES01PecpagI+YLNLci4AcuKJrujW4Mc3sBLpFxMSlsLNHS5Nl/lvrTPA== - dependencies: - "@types/http-proxy" "^1.17.8" - http-proxy "^1.18.1" - is-glob "^4.0.1" - is-plain-obj "^3.0.0" - micromatch "^4.0.2" - -http-proxy@^1.18.1: - version "1.18.1" - resolved "https://registry.yarnpkg.com/http-proxy/-/http-proxy-1.18.1.tgz#401541f0534884bbf95260334e72f88ee3976549" - integrity sha512-7mz/721AbnJwIVbnaSv1Cz3Am0ZLT/UBwkC92VlxhXv/k/BBQfM2fXElQNC27BVGr0uwUpplYPQM9LnaBMR5NQ== - dependencies: - eventemitter3 "^4.0.0" - follow-redirects "^1.0.0" - requires-port "^1.0.0" - -human-signals@^2.1.0: - version "2.1.0" - resolved "https://registry.yarnpkg.com/human-signals/-/human-signals-2.1.0.tgz#dc91fcba42e4d06e4abaed33b3e7a3c02f514ea0" - integrity sha512-B4FFZ6q/T2jhhksgkbEW3HBvWIfDW85snkQgawt07S7J5QXTk6BkNV+0yAeZrM5QpMAdYlocGoljn0sJ/WQkFw== - -hyphenate-style-name@^1.0.3: - version "1.0.4" - resolved "https://registry.yarnpkg.com/hyphenate-style-name/-/hyphenate-style-name-1.0.4.tgz#691879af8e220aea5750e8827db4ef62a54e361d" - integrity sha512-ygGZLjmXfPHj+ZWh6LwbC37l43MhfztxetbFCoYTM2VjkIUpeHgSNn7QIyVFj7YQ1Wl9Cbw5sholVJPzWvC2MQ== - -iconv-lite@0.4.24: - version "0.4.24" - resolved "https://registry.yarnpkg.com/iconv-lite/-/iconv-lite-0.4.24.tgz#2022b4b25fbddc21d2f524974a474aafe733908b" - integrity sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA== - dependencies: - safer-buffer ">= 2.1.2 < 3" - -icss-utils@^5.0.0, icss-utils@^5.1.0: - version "5.1.0" - resolved "https://registry.yarnpkg.com/icss-utils/-/icss-utils-5.1.0.tgz#c6be6858abd013d768e98366ae47e25d5887b1ae" - integrity sha512-soFhflCVWLfRNOPU3iv5Z9VUdT44xFRbzjLsEzSr5AQmgqPMTHdU3PMT1Cf1ssx8fLNJDA1juftYl+PUcv3MqA== - -ignore@^5.2.0: - version "5.2.0" - resolved "https://registry.yarnpkg.com/ignore/-/ignore-5.2.0.tgz#6d3bac8fa7fe0d45d9f9be7bac2fc279577e345a" - integrity sha512-CmxgYGiEPCLhfLnpPp1MoRmifwEIOgjcHXxOBjv7mY96c+eWScsOP9c112ZyLdWHi0FxHjI+4uVhKYp/gcdRmQ== - -import-local@^3.0.2: - version "3.1.0" - resolved "https://registry.yarnpkg.com/import-local/-/import-local-3.1.0.tgz#b4479df8a5fd44f6cdce24070675676063c95cb4" - integrity sha512-ASB07uLtnDs1o6EHjKpX34BKYDSqnFerfTOJL2HvMqF70LnxpjkzDB8J44oT9pu4AMPkQwf8jl6szgvNd2tRIg== - dependencies: - pkg-dir "^4.2.0" - resolve-cwd "^3.0.0" - -indent-string@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/indent-string/-/indent-string-4.0.0.tgz#624f8f4497d619b2d9768531d58f4122854d7251" - integrity sha512-EdDDZu4A2OyIK7Lr/2zG+w5jmbuk1DVBnEwREQvBzspBJkCEbRa8GxU1lghYcaGJCnRWibjDXlq779X1/y5xwg== - -inflight@^1.0.4: - version "1.0.6" - resolved "https://registry.yarnpkg.com/inflight/-/inflight-1.0.6.tgz#49bd6331d7d02d0c09bc910a1075ba8165b56df9" - integrity sha1-Sb1jMdfQLQwJvJEKEHW6gWW1bfk= - dependencies: - once "^1.3.0" - wrappy "1" - -inherits@2, inherits@2.0.4, inherits@^2.0.1, inherits@^2.0.3, inherits@~2.0.3: - version "2.0.4" - resolved "https://registry.yarnpkg.com/inherits/-/inherits-2.0.4.tgz#0fa2c64f932917c3433a0ded55363aae37416b7c" - integrity sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ== - -inherits@2.0.3: - version "2.0.3" - resolved "https://registry.yarnpkg.com/inherits/-/inherits-2.0.3.tgz#633c2c83e3da42a502f52466022480f4208261de" - integrity sha1-Yzwsg+PaQqUC9SRmAiSA9CCCYd4= - -inline-chunk-html-plugin@^1.1.1: - version "1.1.1" - resolved "https://registry.yarnpkg.com/inline-chunk-html-plugin/-/inline-chunk-html-plugin-1.1.1.tgz#f64111aed16fac274d2b929f6a6a08671d82354e" - integrity sha512-6W1eGIj8z/Yla6xJx5il6jJfCxMZS3kVkbiLQThbbjdsDLRIWkUVmpnhfW2l6WAwCW+qfy0zoXVGBZM1E5XF3g== - -interpret@^2.2.0: - version "2.2.0" - resolved "https://registry.yarnpkg.com/interpret/-/interpret-2.2.0.tgz#1a78a0b5965c40a5416d007ad6f50ad27c417df9" - integrity sha512-Ju0Bz/cEia55xDwUWEa8+olFpCiQoypjnQySseKtmjNrnps3P+xfpUmGr90T7yjlVJmOtybRvPXhKMbHr+fWnw== - -ip@^1.1.0: - version "1.1.5" - resolved "https://registry.yarnpkg.com/ip/-/ip-1.1.5.tgz#bdded70114290828c0a039e72ef25f5aaec4354a" - integrity sha1-vd7XARQpCCjAoDnnLvJfWq7ENUo= - -ipaddr.js@1.9.1: - version "1.9.1" - resolved "https://registry.yarnpkg.com/ipaddr.js/-/ipaddr.js-1.9.1.tgz#bff38543eeb8984825079ff3a2a8e6cbd46781b3" - integrity sha512-0KI/607xoxSToH7GjN1FfSbLoU0+btTicjsQSWQlh/hZykN8KpmMf7uYwPW3R+akZ6R/w18ZlXSHBYXiYUPO3g== - -ipaddr.js@^2.0.1: - version "2.0.1" - resolved "https://registry.yarnpkg.com/ipaddr.js/-/ipaddr.js-2.0.1.tgz#eca256a7a877e917aeb368b0a7497ddf42ef81c0" - integrity sha512-1qTgH9NG+IIJ4yfKs2e6Pp1bZg8wbDbKHT21HrLIeYBTRLgMYKnMTPAuI3Lcs61nfx5h1xlXnbJtH1kX5/d/ng== - -is-arguments@^1.0.4: - version "1.1.1" - resolved "https://registry.yarnpkg.com/is-arguments/-/is-arguments-1.1.1.tgz#15b3f88fda01f2a97fec84ca761a560f123efa9b" - integrity sha512-8Q7EARjzEnKpt/PCD7e1cgUS0a6X8u5tdSiMqXhojOdoV9TsMsiO+9VLC5vAmO8N7/GmXn7yjR8qnA6bVAEzfA== - dependencies: - call-bind "^1.0.2" - has-tostringtag "^1.0.0" - -is-binary-path@~2.1.0: - version "2.1.0" - resolved "https://registry.yarnpkg.com/is-binary-path/-/is-binary-path-2.1.0.tgz#ea1f7f3b80f064236e83470f86c09c254fb45b09" - integrity sha512-ZMERYes6pDydyuGidse7OsHxtbI7WVeUEozgR/g7rd0xUimYNlvZRE/K2MgZTjWy725IfelLeVcEM97mmtRGXw== - dependencies: - binary-extensions "^2.0.0" - -is-core-module@^2.8.1: - version "2.8.1" - resolved "https://registry.yarnpkg.com/is-core-module/-/is-core-module-2.8.1.tgz#f59fdfca701d5879d0a6b100a40aa1560ce27211" - integrity sha512-SdNCUs284hr40hFTFP6l0IfZ/RSrMXF3qgoRHd3/79unUTvrFO/JoXwkGm+5J/Oe3E/b5GsnG330uUNgRpu1PA== - dependencies: - has "^1.0.3" - -is-date-object@^1.0.1: - version "1.0.5" - resolved "https://registry.yarnpkg.com/is-date-object/-/is-date-object-1.0.5.tgz#0841d5536e724c25597bf6ea62e1bd38298df31f" - integrity sha512-9YQaSxsAiSwcvS33MBk3wTCVnWK+HhF8VZR2jRxehM16QcVOdHqPn4VPHmRK4lSr38n9JriurInLcP90xsYNfQ== - dependencies: - has-tostringtag "^1.0.0" - -is-docker@^2.0.0, is-docker@^2.1.1: - version "2.2.1" - resolved "https://registry.yarnpkg.com/is-docker/-/is-docker-2.2.1.tgz#33eeabe23cfe86f14bde4408a02c0cfb853acdaa" - integrity sha512-F+i2BKsFrH66iaUFc0woD8sLy8getkwTwtOBjvs56Cx4CgJDeKQeqfz8wAYiSb8JOprWhHH5p77PbmYCvvUuXQ== - -is-extglob@^2.1.1: - version "2.1.1" - resolved "https://registry.yarnpkg.com/is-extglob/-/is-extglob-2.1.1.tgz#a88c02535791f02ed37c76a1b9ea9773c833f8c2" - integrity sha1-qIwCU1eR8C7TfHahueqXc8gz+MI= - -is-glob@^4.0.1, is-glob@~4.0.1: - version "4.0.3" - resolved "https://registry.yarnpkg.com/is-glob/-/is-glob-4.0.3.tgz#64f61e42cbbb2eec2071a9dac0b28ba1e65d5084" - integrity sha512-xelSayHH36ZgE7ZWhli7pW34hNbNl8Ojv5KVmkJD4hBdD3th8Tfk9vYasLM+mXWOZhFkgZfxhLSnrwRr4elSSg== - dependencies: - is-extglob "^2.1.1" - -is-in-browser@^1.0.2, is-in-browser@^1.1.3: - version "1.1.3" - resolved "https://registry.yarnpkg.com/is-in-browser/-/is-in-browser-1.1.3.tgz#56ff4db683a078c6082eb95dad7dc62e1d04f835" - integrity sha1-Vv9NtoOgeMYILrldrX3GLh0E+DU= - -is-number@^7.0.0: - version "7.0.0" - resolved "https://registry.yarnpkg.com/is-number/-/is-number-7.0.0.tgz#7535345b896734d5f80c4d06c50955527a14f12b" - integrity sha512-41Cifkg6e8TylSpdtTpeLVMqvSBEVzTttHvERD741+pnZ8ANv0004MRL43QKPDlK9cGvNp6NZWZUBlbGXYxxng== - -is-path-cwd@^2.2.0: - version "2.2.0" - resolved "https://registry.yarnpkg.com/is-path-cwd/-/is-path-cwd-2.2.0.tgz#67d43b82664a7b5191fd9119127eb300048a9fdb" - integrity sha512-w942bTcih8fdJPJmQHFzkS76NEP8Kzzvmw92cXsazb8intwLqPibPPdXf4ANdKV3rYMuuQYGIWtvz9JilB3NFQ== - -is-path-inside@^3.0.2: - version "3.0.3" - resolved "https://registry.yarnpkg.com/is-path-inside/-/is-path-inside-3.0.3.tgz#d231362e53a07ff2b0e0ea7fed049161ffd16283" - integrity sha512-Fd4gABb+ycGAmKou8eMftCupSir5lRxqf4aD/vd0cD2qc4HL07OjCeuHMr8Ro4CoMaeCKDB0/ECBOVWjTwUvPQ== - -is-plain-obj@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/is-plain-obj/-/is-plain-obj-3.0.0.tgz#af6f2ea14ac5a646183a5bbdb5baabbc156ad9d7" - integrity sha512-gwsOE28k+23GP1B6vFl1oVh/WOzmawBrKwo5Ev6wMKzPkaXaCDIQKzLnvsA42DRlbVTWorkgTKIviAKCWkfUwA== - -is-plain-object@^2.0.4: - version "2.0.4" - resolved "https://registry.yarnpkg.com/is-plain-object/-/is-plain-object-2.0.4.tgz#2c163b3fafb1b606d9d17928f05c2a1c38e07677" - integrity sha512-h5PpgXkWitc38BBMYawTYMWJHFZJVnBquFE57xFpjB8pJFiF6gZ+bU+WyI/yqXiFR5mdLsgYNaPe8uao6Uv9Og== - dependencies: - isobject "^3.0.1" - -is-regex@^1.0.4: - version "1.1.4" - resolved "https://registry.yarnpkg.com/is-regex/-/is-regex-1.1.4.tgz#eef5663cd59fa4c0ae339505323df6854bb15958" - integrity sha512-kvRdxDsxZjhzUX07ZnLydzS1TU/TJlTUHHY4YLL87e37oUA49DfkLqgy+VjFocowy29cKvcSiu+kIv728jTTVg== - dependencies: - call-bind "^1.0.2" - has-tostringtag "^1.0.0" - -is-stream@^2.0.0: - version "2.0.1" - resolved "https://registry.yarnpkg.com/is-stream/-/is-stream-2.0.1.tgz#fac1e3d53b97ad5a9d0ae9cef2389f5810a5c077" - integrity sha512-hFoiJiTl63nn+kstHGBtewWSKnQLpyb155KHheA1l39uvtO9nWIop1p3udqPcUd/xbF1VLMO4n7OI6p7RbngDg== - -is-wsl@^2.2.0: - version "2.2.0" - resolved "https://registry.yarnpkg.com/is-wsl/-/is-wsl-2.2.0.tgz#74a4c76e77ca9fd3f932f290c17ea326cd157271" - integrity sha512-fKzAra0rGJUUBwGBgNkHZuToZcn+TtXHpeCgmkMJMMYx1sQDYaCSyjJBSCa2nH1DGm7s3n1oBnohoVTBaN7Lww== - dependencies: - is-docker "^2.0.0" - -isarray@~1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/isarray/-/isarray-1.0.0.tgz#bb935d48582cba168c06834957a54a3e07124f11" - integrity sha1-u5NdSFgsuhaMBoNJV6VKPgcSTxE= - -isexe@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/isexe/-/isexe-2.0.0.tgz#e8fbf374dc556ff8947a10dcb0572d633f2cfa10" - integrity sha1-6PvzdNxVb/iUehDcsFctYz8s+hA= - -isobject@^3.0.1: - version "3.0.1" - resolved "https://registry.yarnpkg.com/isobject/-/isobject-3.0.1.tgz#4e431e92b11a9731636aa1f9c8d1ccbcfdab78df" - integrity sha1-TkMekrEalzFjaqH5yNHMvP2reN8= - -jest-worker@^27.4.5: - version "27.5.1" - resolved "https://registry.yarnpkg.com/jest-worker/-/jest-worker-27.5.1.tgz#8d146f0900e8973b106b6f73cc1e9a8cb86f8db0" - integrity sha512-7vuh85V5cdDofPyxn58nrPjBktZo0u9x1g8WtjQol+jZDaE+fhN+cIvTj11GndBnMnyfrUOG1sZQxCdjKh+DKg== - dependencies: - "@types/node" "*" - merge-stream "^2.0.0" - supports-color "^8.0.0" - -"js-tokens@^3.0.0 || ^4.0.0": - version "4.0.0" - resolved "https://registry.yarnpkg.com/js-tokens/-/js-tokens-4.0.0.tgz#19203fb59991df98e3a287050d4647cdeaf32499" - integrity sha512-RdJUflcE3cUzKiMqQgsCu06FPu9UdIJO0beYbPhHN4k6apgJtifcoCtT9bcxOpYBtpD2kCM6Sbzg4CausW/PKQ== - -json-parse-better-errors@^1.0.2: - version "1.0.2" - resolved "https://registry.yarnpkg.com/json-parse-better-errors/-/json-parse-better-errors-1.0.2.tgz#bb867cfb3450e69107c131d1c514bab3dc8bcaa9" - integrity sha512-mrqyZKfX5EhL7hvqcV6WG1yYjnjeuYDzDhhcAAUrq8Po85NBQBJP+ZDUT75qZQ98IkUoBqdkExkukOU7Ts2wrw== - -json-schema-traverse@^0.4.1: - version "0.4.1" - resolved "https://registry.yarnpkg.com/json-schema-traverse/-/json-schema-traverse-0.4.1.tgz#69f6a87d9513ab8bb8fe63bdb0979c448e684660" - integrity sha512-xbbCH5dCYU5T8LcEhhuh7HJ88HXuW3qsI3Y0zOZFKfZEHcpWiHU/Jxzk629Brsab/mMiHQti9wMP+845RPe3Vg== - -json-schema-traverse@^1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/json-schema-traverse/-/json-schema-traverse-1.0.0.tgz#ae7bcb3656ab77a73ba5c49bf654f38e6b6860e2" - integrity sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug== - -json2mq@^0.2.0: - version "0.2.0" - resolved "https://registry.yarnpkg.com/json2mq/-/json2mq-0.2.0.tgz#b637bd3ba9eabe122c83e9720483aeb10d2c904a" - integrity sha1-tje9O6nqvhIsg+lyBIOusQ0skEo= - dependencies: - string-convert "^0.2.0" - -json5@^2.1.2: - version "2.2.0" - resolved "https://registry.yarnpkg.com/json5/-/json5-2.2.0.tgz#2dfefe720c6ba525d9ebd909950f0515316c89a3" - integrity sha512-f+8cldu7X/y7RAJurMEJmdoKXGB/X550w2Nr3tTbezL6RwEE/iMcm+tZnXeoZtKuOq6ft8+CqzEkrIgx1fPoQA== - dependencies: - minimist "^1.2.5" - -jss-plugin-camel-case@^10.5.1: - version "10.9.0" - resolved "https://registry.yarnpkg.com/jss-plugin-camel-case/-/jss-plugin-camel-case-10.9.0.tgz#4921b568b38d893f39736ee8c4c5f1c64670aaf7" - integrity sha512-UH6uPpnDk413/r/2Olmw4+y54yEF2lRIV8XIZyuYpgPYTITLlPOsq6XB9qeqv+75SQSg3KLocq5jUBXW8qWWww== - dependencies: - "@babel/runtime" "^7.3.1" - hyphenate-style-name "^1.0.3" - jss "10.9.0" - -jss-plugin-default-unit@^10.5.1: - version "10.9.0" - resolved "https://registry.yarnpkg.com/jss-plugin-default-unit/-/jss-plugin-default-unit-10.9.0.tgz#bb23a48f075bc0ce852b4b4d3f7582bc002df991" - integrity sha512-7Ju4Q9wJ/MZPsxfu4T84mzdn7pLHWeqoGd/D8O3eDNNJ93Xc8PxnLmV8s8ZPNRYkLdxZqKtm1nPQ0BM4JRlq2w== - dependencies: - "@babel/runtime" "^7.3.1" - jss "10.9.0" - -jss-plugin-global@^10.5.1: - version "10.9.0" - resolved "https://registry.yarnpkg.com/jss-plugin-global/-/jss-plugin-global-10.9.0.tgz#fc07a0086ac97aca174e37edb480b69277f3931f" - integrity sha512-4G8PHNJ0x6nwAFsEzcuVDiBlyMsj2y3VjmFAx/uHk/R/gzJV+yRHICjT4MKGGu1cJq2hfowFWCyrr/Gg37FbgQ== - dependencies: - "@babel/runtime" "^7.3.1" - jss "10.9.0" - -jss-plugin-nested@^10.5.1: - version "10.9.0" - resolved "https://registry.yarnpkg.com/jss-plugin-nested/-/jss-plugin-nested-10.9.0.tgz#cc1c7d63ad542c3ccc6e2c66c8328c6b6b00f4b3" - integrity sha512-2UJnDrfCZpMYcpPYR16oZB7VAC6b/1QLsRiAutOt7wJaaqwCBvNsosLEu/fUyKNQNGdvg2PPJFDO5AX7dwxtoA== - dependencies: - "@babel/runtime" "^7.3.1" - jss "10.9.0" - tiny-warning "^1.0.2" - -jss-plugin-props-sort@^10.5.1: - version "10.9.0" - resolved "https://registry.yarnpkg.com/jss-plugin-props-sort/-/jss-plugin-props-sort-10.9.0.tgz#30e9567ef9479043feb6e5e59db09b4de687c47d" - integrity sha512-7A76HI8bzwqrsMOJTWKx/uD5v+U8piLnp5bvru7g/3ZEQOu1+PjHvv7bFdNO3DwNPC9oM0a//KwIJsIcDCjDzw== - dependencies: - "@babel/runtime" "^7.3.1" - jss "10.9.0" - -jss-plugin-rule-value-function@^10.5.1: - version "10.9.0" - resolved "https://registry.yarnpkg.com/jss-plugin-rule-value-function/-/jss-plugin-rule-value-function-10.9.0.tgz#379fd2732c0746fe45168011fe25544c1a295d67" - integrity sha512-IHJv6YrEf8pRzkY207cPmdbBstBaE+z8pazhPShfz0tZSDtRdQua5jjg6NMz3IbTasVx9FdnmptxPqSWL5tyJg== - dependencies: - "@babel/runtime" "^7.3.1" - jss "10.9.0" - tiny-warning "^1.0.2" - -jss-plugin-vendor-prefixer@^10.5.1: - version "10.9.0" - resolved "https://registry.yarnpkg.com/jss-plugin-vendor-prefixer/-/jss-plugin-vendor-prefixer-10.9.0.tgz#aa9df98abfb3f75f7ed59a3ec50a5452461a206a" - integrity sha512-MbvsaXP7iiVdYVSEoi+blrW+AYnTDvHTW6I6zqi7JcwXdc6I9Kbm234nEblayhF38EftoenbM+5218pidmC5gA== - dependencies: - "@babel/runtime" "^7.3.1" - css-vendor "^2.0.8" - jss "10.9.0" - -jss@10.9.0, jss@^10.5.1: - version "10.9.0" - resolved "https://registry.yarnpkg.com/jss/-/jss-10.9.0.tgz#7583ee2cdc904a83c872ba695d1baab4b59c141b" - integrity sha512-YpzpreB6kUunQBbrlArlsMpXYyndt9JATbt95tajx0t4MTJJcCJdd4hdNpHmOIDiUJrF/oX5wtVFrS3uofWfGw== - dependencies: - "@babel/runtime" "^7.3.1" - csstype "^3.0.2" - is-in-browser "^1.1.3" - tiny-warning "^1.0.2" - -kind-of@^6.0.2: - version "6.0.3" - resolved "https://registry.yarnpkg.com/kind-of/-/kind-of-6.0.3.tgz#07c05034a6c349fa06e24fa35aa76db4580ce4dd" - integrity sha512-dcS1ul+9tmeD95T+x28/ehLgd9mENa3LsvDTtzm3vyBEO7RPptvAD+t44WVXaUjTBRcrpFeFlC8WCruUR456hw== - -loader-runner@^4.2.0: - version "4.2.0" - resolved "https://registry.yarnpkg.com/loader-runner/-/loader-runner-4.2.0.tgz#d7022380d66d14c5fb1d496b89864ebcfd478384" - integrity sha512-92+huvxMvYlMzMt0iIOukcwYBFpkYJdpl2xsZ7LrlayO7E8SOv+JJUEK17B/dJIHAOLMfh2dZZ/Y18WgmGtYNw== - -loader-utils@^2.0.0: - version "2.0.2" - resolved "https://registry.yarnpkg.com/loader-utils/-/loader-utils-2.0.2.tgz#d6e3b4fb81870721ae4e0868ab11dd638368c129" - integrity sha512-TM57VeHptv569d/GKh6TAYdzKblwDNiumOdkFnejjD0XwTH87K90w3O7AiJRqdQoXygvi1VQTJTLGhJl7WqA7A== - dependencies: - big.js "^5.2.2" - emojis-list "^3.0.0" - json5 "^2.1.2" - -locate-path@^5.0.0: - version "5.0.0" - resolved "https://registry.yarnpkg.com/locate-path/-/locate-path-5.0.0.tgz#1afba396afd676a6d42504d0a67a3a7eb9f62aa0" - integrity sha512-t7hw9pI+WvuwNJXwk5zVHpyhIqzg2qTlklJOf0mVxGSbe3Fp2VieZcduNYjaLDoy6p9uGpQEGWG87WpMKlNq8g== - dependencies: - p-locate "^4.1.0" - -lodash@^4.17.14, lodash@^4.17.20, lodash@^4.17.21: - version "4.17.21" - resolved "https://registry.yarnpkg.com/lodash/-/lodash-4.17.21.tgz#679591c564c3bffaae8454cf0b3df370c3d6911c" - integrity sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg== - -loose-envify@^1.1.0, loose-envify@^1.4.0: - version "1.4.0" - resolved "https://registry.yarnpkg.com/loose-envify/-/loose-envify-1.4.0.tgz#71ee51fa7be4caec1a63839f7e682d8132d30caf" - integrity sha512-lyuxPGr/Wfhrlem2CL/UcnUc1zcqKAImBDzukY7Y5F/yQiNdko6+fRLevlw1HgMySw7f611UIY408EtxRSoK3Q== - dependencies: - js-tokens "^3.0.0 || ^4.0.0" - -lower-case@^2.0.2: - version "2.0.2" - resolved "https://registry.yarnpkg.com/lower-case/-/lower-case-2.0.2.tgz#6fa237c63dbdc4a82ca0fd882e4722dc5e634e28" - integrity sha512-7fm3l3NAF9WfN6W3JOmf5drwpVqX78JtoGJ3A6W0a6ZnldM41w2fV5D490psKFTpMds8TJse/eHLFFsNHHjHgg== - dependencies: - tslib "^2.0.3" - -lru-cache@^6.0.0: - version "6.0.0" - resolved "https://registry.yarnpkg.com/lru-cache/-/lru-cache-6.0.0.tgz#6d6fe6570ebd96aaf90fcad1dafa3b2566db3a94" - integrity sha512-Jo6dJ04CmSjuznwJSS3pUeWmd/H0ffTlkXXgwZi+eq1UCmqQwCh+eLsYOYCwY991i2Fah4h1BEMCx4qThGbsiA== - dependencies: - yallist "^4.0.0" - -media-typer@0.3.0: - version "0.3.0" - resolved "https://registry.yarnpkg.com/media-typer/-/media-typer-0.3.0.tgz#8710d7af0aa626f8fffa1ce00168545263255748" - integrity sha1-hxDXrwqmJvj/+hzgAWhUUmMlV0g= - -memfs@^3.4.1: - version "3.4.1" - resolved "https://registry.yarnpkg.com/memfs/-/memfs-3.4.1.tgz#b78092f466a0dce054d63d39275b24c71d3f1305" - integrity sha512-1c9VPVvW5P7I85c35zAdEr1TD5+F11IToIHIlrVIcflfnzPkJa0ZoYEoEdYDP8KgPFoSZ/opDrUsAoZWym3mtw== - dependencies: - fs-monkey "1.0.3" - -"memoize-one@>=3.1.1 <6": - version "5.2.1" - resolved "https://registry.yarnpkg.com/memoize-one/-/memoize-one-5.2.1.tgz#8337aa3c4335581839ec01c3d594090cebe8f00e" - integrity sha512-zYiwtZUcYyXKo/np96AGZAckk+FWWsUdJ3cHGGmld7+AhvcWmQyGCYUh1hc4Q/pkOhb65dQR/pqCyK0cOaHz4Q== - -memoize-one@^3.1.1: - version "3.1.1" - resolved "https://registry.yarnpkg.com/memoize-one/-/memoize-one-3.1.1.tgz#ef609811e3bc28970eac2884eece64d167830d17" - integrity sha512-YqVh744GsMlZu6xkhGslPSqSurOv6P+kLN2J3ysBZfagLcL5FdRK/0UpgLoL8hwjjEvvAVkjJZyFP+1T6p1vgA== - -memoize-one@^6.0.0: - version "6.0.0" - resolved "https://registry.yarnpkg.com/memoize-one/-/memoize-one-6.0.0.tgz#b2591b871ed82948aee4727dc6abceeeac8c1045" - integrity sha512-rkpe71W0N0c0Xz6QD0eJETuWAJGnJ9afsl1srmwPrI+yBCkge5EycXXbYRyvL29zZVUWQCY7InPRCv3GDXuZNw== - -memory-fs@^0.5.0: - version "0.5.0" - resolved "https://registry.yarnpkg.com/memory-fs/-/memory-fs-0.5.0.tgz#324c01288b88652966d161db77838720845a8e3c" - integrity sha512-jA0rdU5KoQMC0e6ppoNRtpp6vjFq6+NY7r8hywnC7V+1Xj/MtHwGIbB1QaK/dunyjWteJzmkpd7ooeWg10T7GA== - dependencies: - errno "^0.1.3" - readable-stream "^2.0.1" - -merge-descriptors@1.0.1: - version "1.0.1" - resolved "https://registry.yarnpkg.com/merge-descriptors/-/merge-descriptors-1.0.1.tgz#b00aaa556dd8b44568150ec9d1b953f3f90cbb61" - integrity sha1-sAqqVW3YtEVoFQ7J0blT8/kMu2E= - -merge-stream@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/merge-stream/-/merge-stream-2.0.0.tgz#52823629a14dd00c9770fb6ad47dc6310f2c1f60" - integrity sha512-abv/qOcuPfk3URPfDzmZU1LKmuw8kT+0nIHvKrKgFrwifol/doWcdA4ZqsWQ8ENrFKkd67Mfpo/LovbIUsbt3w== - -merge2@^1.3.0, merge2@^1.4.1: - version "1.4.1" - resolved "https://registry.yarnpkg.com/merge2/-/merge2-1.4.1.tgz#4368892f885e907455a6fd7dc55c0c9d404990ae" - integrity sha512-8q7VEgMJW4J8tcfVPy8g09NcQwZdbwFEqhe/WZkoIzjn/3TGDwtOCYtXGxA3O8tPzpczCCDgv+P2P5y00ZJOOg== - -methods@~1.1.2: - version "1.1.2" - resolved "https://registry.yarnpkg.com/methods/-/methods-1.1.2.tgz#5529a4d67654134edcc5266656835b0f851afcee" - integrity sha1-VSmk1nZUE07cxSZmVoNbD4Ua/O4= - -micromatch@^4.0.0, micromatch@^4.0.2, micromatch@^4.0.4: - version "4.0.4" - resolved "https://registry.yarnpkg.com/micromatch/-/micromatch-4.0.4.tgz#896d519dfe9db25fce94ceb7a500919bf881ebf9" - integrity sha512-pRmzw/XUcwXGpD9aI9q/0XOwLNygjETJ8y0ao0wdqprrzDa4YnxLcz7fQRZr8voh8V10kGhABbNcHVk5wHgWwg== - dependencies: - braces "^3.0.1" - picomatch "^2.2.3" - -mime-db@1.51.0: - version "1.51.0" - resolved "https://registry.yarnpkg.com/mime-db/-/mime-db-1.51.0.tgz#d9ff62451859b18342d960850dc3cfb77e63fb0c" - integrity sha512-5y8A56jg7XVQx2mbv1lu49NR4dokRnhZYTtL+KGfaa27uq4pSTXkwQkFJl4pkRMyNFz/EtYDSkiiEHx3F7UN6g== - -"mime-db@>= 1.43.0 < 2": - version "1.52.0" - resolved "https://registry.yarnpkg.com/mime-db/-/mime-db-1.52.0.tgz#bbabcdc02859f4987301c856e3387ce5ec43bf70" - integrity sha512-sPU4uV7dYlvtWJxwwxHD0PuihVNiE7TyAbQ5SWxDCB9mUYvOgroQOwYQQOKPJ8CIbE+1ETVlOoK1UC2nU3gYvg== - -mime-types@^2.1.27, mime-types@^2.1.31, mime-types@~2.1.17, mime-types@~2.1.24, mime-types@~2.1.34: - version "2.1.34" - resolved "https://registry.yarnpkg.com/mime-types/-/mime-types-2.1.34.tgz#5a712f9ec1503511a945803640fafe09d3793c24" - integrity sha512-6cP692WwGIs9XXdOO4++N+7qjqv0rqxxVvJ3VHPh/Sc9mVZcQP+ZGhkKiTvWMQRr2tbHkJP/Yn7Y0npb3ZBs4A== - dependencies: - mime-db "1.51.0" - -mime@1.6.0: - version "1.6.0" - resolved "https://registry.yarnpkg.com/mime/-/mime-1.6.0.tgz#32cd9e5c64553bd58d19a568af452acff04981b1" - integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg== - -mimic-fn@^2.1.0: - version "2.1.0" - resolved "https://registry.yarnpkg.com/mimic-fn/-/mimic-fn-2.1.0.tgz#7ed2c2ccccaf84d3ffcb7a69b57711fc2083401b" - integrity sha512-OqbOk5oEQeAZ8WXWydlu9HJjz9WVdEIvamMCcXmuqUYjTknH/sqsWvhQ3vgwKFRR1HpjvNBKQ37nbJgYzGqGcg== - -minimalistic-assert@^1.0.0: - version "1.0.1" - resolved "https://registry.yarnpkg.com/minimalistic-assert/-/minimalistic-assert-1.0.1.tgz#2e194de044626d4a10e7f7fbc00ce73e83e4d5c7" - integrity sha512-UtJcAD4yEaGtjPezWuO9wC4nwUnVH/8/Im3yEHQP4b67cXlD/Qr9hdITCU1xDbSEXg2XKNaP8jsReV7vQd00/A== - -minimatch@^3.0.4: - version "3.1.2" - resolved "https://registry.yarnpkg.com/minimatch/-/minimatch-3.1.2.tgz#19cd194bfd3e428f049a70817c038d89ab4be35b" - integrity sha512-J7p63hRiAjw1NDEww1W7i37+ByIrOWO5XQQAzZ3VOcL0PNybwpfmV/N05zFAzwQ9USyEcX6t3UO+K5aqBQOIHw== - dependencies: - brace-expansion "^1.1.7" - -minimist@^1.2.5: - version "1.2.5" - resolved "https://registry.yarnpkg.com/minimist/-/minimist-1.2.5.tgz#67d66014b66a6a8aaa0c083c5fd58df4e4e97602" - integrity sha512-FM9nNUYrRBAELZQT3xeZQ7fmMOBg6nWNmJKTcgsJeaLstP/UODVpGsr5OhXhhXg6f+qtJ8uiZ+PUxkDWcgIXLw== - -mkdirp@^0.5.5: - version "0.5.5" - resolved "https://registry.yarnpkg.com/mkdirp/-/mkdirp-0.5.5.tgz#d91cefd62d1436ca0f41620e251288d420099def" - integrity sha512-NKmAlESf6jMGym1++R0Ra7wvhV+wFW63FaSOFPwRahvea0gMUcGUhVeAg/0BC0wiv9ih5NYPB1Wn1UEI1/L+xQ== - dependencies: - minimist "^1.2.5" - -moment@^2.24.0, moment@^2.25.3: - version "2.29.1" - resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.1.tgz#b2be769fa31940be9eeea6469c075e35006fa3d3" - integrity sha512-kHmoybcPV8Sqy59DwNDY3Jefr64lK/by/da0ViFcuA4DH0vQg5Q6Ze5VimxkfQNSC+Mls/Kx53s7TjP1RhFEDQ== - -ms@2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/ms/-/ms-2.0.0.tgz#5608aeadfc00be6c2901df5f9861788de0d597c8" - integrity sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g= - -ms@2.1.2: - version "2.1.2" - resolved "https://registry.yarnpkg.com/ms/-/ms-2.1.2.tgz#d09d1f357b443f493382a8eb3ccd183872ae6009" - integrity sha512-sGkPx+VjMtmA6MX27oA4FBFELFCZZ4S4XqeGOXCv68tT+jb3vk/RyaKWP0PTKyWtmLSM0b+adUTEvbs1PEaH2w== - -ms@2.1.3, ms@^2.1.1: - version "2.1.3" - resolved "https://registry.yarnpkg.com/ms/-/ms-2.1.3.tgz#574c8138ce1d2b5861f0b44579dbadd60c6615b2" - integrity sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA== - -multicast-dns-service-types@^1.1.0: - version "1.1.0" - resolved "https://registry.yarnpkg.com/multicast-dns-service-types/-/multicast-dns-service-types-1.1.0.tgz#899f11d9686e5e05cb91b35d5f0e63b773cfc901" - integrity sha1-iZ8R2WhuXgXLkbNdXw5jt3PPyQE= - -multicast-dns@^6.0.1: - version "6.2.3" - resolved "https://registry.yarnpkg.com/multicast-dns/-/multicast-dns-6.2.3.tgz#a0ec7bd9055c4282f790c3c82f4e28db3b31b229" - integrity sha512-ji6J5enbMyGRHIAkAOu3WdV8nggqviKCEKtXcOqfphZZtQrmHKycfynJ2V7eVPUA4NhJ6V7Wf4TmGbTwKE9B6g== - dependencies: - dns-packet "^1.3.1" - thunky "^1.0.2" - -nanoid@^3.1.31, nanoid@^3.3.1: - version "3.3.1" - resolved "https://registry.yarnpkg.com/nanoid/-/nanoid-3.3.1.tgz#6347a18cac88af88f58af0b3594b723d5e99bb35" - integrity sha512-n6Vs/3KGyxPQd6uO0eH4Bv0ojGSUvuLlIHtC3Y0kEO23YRge8H9x1GCzLn28YX0H66pMkxuaeESFq4tKISKwdw== - -negotiator@0.6.3: - version "0.6.3" - resolved "https://registry.yarnpkg.com/negotiator/-/negotiator-0.6.3.tgz#58e323a72fedc0d6f9cd4d31fe49f51479590ccd" - integrity sha512-+EUsqGPLsM+j/zdChZjsnX51g4XrHFOIXwfnCVPGlQk/k5giakcKsuxCObBRu6DSm9opw/O6slWbJdghQM4bBg== - -neo-async@^2.6.2: - version "2.6.2" - resolved "https://registry.yarnpkg.com/neo-async/-/neo-async-2.6.2.tgz#b4aafb93e3aeb2d8174ca53cf163ab7d7308305f" - integrity sha512-Yd3UES5mWCSqR+qNT93S3UoYUkqAZ9lLg8a7g9rimsWmYGK8cVToA4/sF3RrshdyV3sAGMXVUmpMYOw+dLpOuw== - -no-case@^3.0.4: - version "3.0.4" - resolved "https://registry.yarnpkg.com/no-case/-/no-case-3.0.4.tgz#d361fd5c9800f558551a8369fc0dcd4662b6124d" - integrity sha512-fgAN3jGAh+RoxUGZHTSOLJIqUc2wmoBwGR4tbpNAKmmovFoWq0OdRkb0VkldReO2a2iBT/OEulG9XSUc10r3zg== - dependencies: - lower-case "^2.0.2" - tslib "^2.0.3" - -node-fetch@^1.0.1, node-fetch@^2.6.1: - version "2.6.7" - resolved "https://registry.yarnpkg.com/node-fetch/-/node-fetch-2.6.7.tgz#24de9fba827e3b4ae44dc8b20256a379160052ad" - integrity sha512-ZjMPFEfVx5j+y2yF35Kzx5sF7kDzxuDj6ziH4FFbOp87zKDZNx8yExJIb05OGF4Nlt9IHFIMBkRl41VdvcNdbQ== - dependencies: - whatwg-url "^5.0.0" - -node-forge@^1.2.0: - version "1.2.1" - resolved "https://registry.yarnpkg.com/node-forge/-/node-forge-1.2.1.tgz#82794919071ef2eb5c509293325cec8afd0fd53c" - integrity sha512-Fcvtbb+zBcZXbTTVwqGA5W+MKBj56UjVRevvchv5XrcyXbmNdesfZL37nlcWOfpgHhgmxApw3tQbTr4CqNmX4w== - -node-releases@^2.0.2: - version "2.0.2" - resolved "https://registry.yarnpkg.com/node-releases/-/node-releases-2.0.2.tgz#7139fe71e2f4f11b47d4d2986aaf8c48699e0c01" - integrity sha512-XxYDdcQ6eKqp/YjI+tb2C5WM2LgjnZrfYg4vgQt49EK268b6gYCHsBLrK2qvJo4FmCtqmKezb0WZFK4fkrZNsg== - -normalize-path@^3.0.0, normalize-path@~3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/normalize-path/-/normalize-path-3.0.0.tgz#0dcd69ff23a1c9b11fd0978316644a0388216a65" - integrity sha512-6eZs5Ls3WtCisHWp9S2GUy8dqkpGi4BVSz3GaqiE6ezub0512ESztXUwUB6C6IKbQkY2Pnb/mD4WYojCRwcwLA== - -npm-run-path@^4.0.1: - version "4.0.1" - resolved "https://registry.yarnpkg.com/npm-run-path/-/npm-run-path-4.0.1.tgz#b7ecd1e5ed53da8e37a55e1c2269e0b97ed748ea" - integrity sha512-S48WzZW777zhNIrn7gxOlISNAqi9ZC/uQFnRdbeIHhZhCA6UqpkOT8T1G7BvfdgP4Er8gF4sUbaS0i7QvIfCWw== - dependencies: - path-key "^3.0.0" - -nth-check@^2.0.1: - version "2.0.1" - resolved "https://registry.yarnpkg.com/nth-check/-/nth-check-2.0.1.tgz#2efe162f5c3da06a28959fbd3db75dbeea9f0fc2" - integrity sha512-it1vE95zF6dTT9lBsYbxvqh0Soy4SPowchj0UBGj/V6cTPnXXtQOPUbhZ6CmGzAD/rW22LQK6E96pcdJXk4A4w== - dependencies: - boolbase "^1.0.0" - -object-assign@^4.1.1: - version "4.1.1" - resolved "https://registry.yarnpkg.com/object-assign/-/object-assign-4.1.1.tgz#2109adc7965887cfc05cbbd442cac8bfbb360863" - integrity sha1-IQmtx5ZYh8/AXLvUQsrIv7s2CGM= - -object-is@^1.0.1: - version "1.1.5" - resolved "https://registry.yarnpkg.com/object-is/-/object-is-1.1.5.tgz#b9deeaa5fc7f1846a0faecdceec138e5778f53ac" - integrity sha512-3cyDsyHgtmi7I7DfSSI2LDp6SK2lwvtbg0p0R1e0RvTqF5ceGx+K2dfSjm1bKDMVCFEDAQvy+o8c6a7VujOddw== - dependencies: - call-bind "^1.0.2" - define-properties "^1.1.3" - -object-keys@^1.0.12, object-keys@^1.1.1: - version "1.1.1" - resolved "https://registry.yarnpkg.com/object-keys/-/object-keys-1.1.1.tgz#1c47f272df277f3b1daf061677d9c82e2322c60e" - integrity sha512-NuAESUOUMrlIXOfHKzD6bpPu3tYt3xvjNdRIQ+FeT0lNb4K8WR70CaDxhuNguS2XG+GjkyMwOzsN5ZktImfhLA== - -obuf@^1.0.0, obuf@^1.1.2: - version "1.1.2" - resolved "https://registry.yarnpkg.com/obuf/-/obuf-1.1.2.tgz#09bea3343d41859ebd446292d11c9d4db619084e" - integrity sha512-PX1wu0AmAdPqOL1mWhqmlOd8kOIZQwGZw6rh7uby9fTc5lhaOWFLX3I6R1hrF9k3zUY40e6igsLGkDXK92LJNg== - -on-finished@~2.3.0: - version "2.3.0" - resolved "https://registry.yarnpkg.com/on-finished/-/on-finished-2.3.0.tgz#20f1336481b083cd75337992a16971aa2d906947" - integrity sha1-IPEzZIGwg811M3mSoWlxqi2QaUc= - dependencies: - ee-first "1.1.1" - -on-headers@~1.0.2: - version "1.0.2" - resolved "https://registry.yarnpkg.com/on-headers/-/on-headers-1.0.2.tgz#772b0ae6aaa525c399e489adfad90c403eb3c28f" - integrity sha512-pZAE+FJLoyITytdqK0U5s+FIpjN0JP3OzFi/u8Rx+EV5/W+JTWGXG8xFzevE7AjBfDqHv/8vL8qQsIhHnqRkrA== - -once@^1.3.0: - version "1.4.0" - resolved "https://registry.yarnpkg.com/once/-/once-1.4.0.tgz#583b1aa775961d4b113ac17d9c50baef9dd76bd1" - integrity sha1-WDsap3WWHUsROsF9nFC6753Xa9E= - dependencies: - wrappy "1" - -onetime@^5.1.2: - version "5.1.2" - resolved "https://registry.yarnpkg.com/onetime/-/onetime-5.1.2.tgz#d0e96ebb56b07476df1dd9c4806e5237985ca45e" - integrity sha512-kbpaSSGJTWdAY5KPVeMOKXSrPtr8C8C7wodJbcsd51jRnmD+GZu8Y0VoU6Dm5Z4vWr0Ig/1NKuWRKf7j5aaYSg== - dependencies: - mimic-fn "^2.1.0" - -open@^8.0.9: - version "8.4.0" - resolved "https://registry.yarnpkg.com/open/-/open-8.4.0.tgz#345321ae18f8138f82565a910fdc6b39e8c244f8" - integrity sha512-XgFPPM+B28FtCCgSb9I+s9szOC1vZRSwgWsRUA5ylIxRTgKozqjOCrVOqGsYABPYK5qnfqClxZTFBa8PKt2v6Q== - dependencies: - define-lazy-prop "^2.0.0" - is-docker "^2.1.1" - is-wsl "^2.2.0" - -p-limit@^2.2.0: - version "2.3.0" - resolved "https://registry.yarnpkg.com/p-limit/-/p-limit-2.3.0.tgz#3dd33c647a214fdfffd835933eb086da0dc21db1" - integrity sha512-//88mFWSJx8lxCzwdAABTJL2MyWB12+eIY7MDL2SqLmAkeKU9qxRvWuSyTjm3FUmpBEMuFfckAIqEaVGUDxb6w== - dependencies: - p-try "^2.0.0" - -p-locate@^4.1.0: - version "4.1.0" - resolved "https://registry.yarnpkg.com/p-locate/-/p-locate-4.1.0.tgz#a3428bb7088b3a60292f66919278b7c297ad4f07" - integrity sha512-R79ZZ/0wAxKGu3oYMlz8jy/kbhsNrS7SKZ7PxEHBgJ5+F2mtFW2fK2cOtBh1cHYkQsbzFV7I+EoRKe6Yt0oK7A== - dependencies: - p-limit "^2.2.0" - -p-map@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/p-map/-/p-map-4.0.0.tgz#bb2f95a5eda2ec168ec9274e06a747c3e2904d2b" - integrity sha512-/bjOqmgETBYB5BoEeGVea8dmvHb2m9GLy1E9W43yeyfP6QQCZGFNa+XRceJEuDB6zqr+gKpIAmlLebMpykw/MQ== - dependencies: - aggregate-error "^3.0.0" - -p-retry@^4.5.0: - version "4.6.1" - resolved "https://registry.yarnpkg.com/p-retry/-/p-retry-4.6.1.tgz#8fcddd5cdf7a67a0911a9cf2ef0e5df7f602316c" - integrity sha512-e2xXGNhZOZ0lfgR9kL34iGlU8N/KO0xZnQxVEwdeOvpqNDQfdnxIYizvWtK8RglUa3bGqI8g0R/BdfzLMxRkiA== - dependencies: - "@types/retry" "^0.12.0" - retry "^0.13.1" - -p-try@^2.0.0: - version "2.2.0" - resolved "https://registry.yarnpkg.com/p-try/-/p-try-2.2.0.tgz#cb2868540e313d61de58fafbe35ce9004d5540e6" - integrity sha512-R4nPAVTAU0B9D35/Gk3uJf/7XYbQcyohSKdvAxIRSNghFl4e71hVoGnBNQz9cWaXxO2I10KTC+3jMdvvoKw6dQ== - -param-case@^3.0.4: - version "3.0.4" - resolved "https://registry.yarnpkg.com/param-case/-/param-case-3.0.4.tgz#7d17fe4aa12bde34d4a77d91acfb6219caad01c5" - integrity sha512-RXlj7zCYokReqWpOPH9oYivUzLYZ5vAPIfEmCTNViosC78F8F0H9y7T7gG2M39ymgutxF5gcFEsyZQSph9Bp3A== - dependencies: - dot-case "^3.0.4" - tslib "^2.0.3" - -parseurl@~1.3.2, parseurl@~1.3.3: - version "1.3.3" - resolved "https://registry.yarnpkg.com/parseurl/-/parseurl-1.3.3.tgz#9da19e7bee8d12dff0513ed5b76957793bc2e8d4" - integrity sha512-CiyeOxFT/JZyN5m0z9PfXw4SCBJ6Sygz1Dpl0wqjlhDEGGBP1GnsUVEL0p63hoG1fcj3fHynXi9NYO4nWOL+qQ== - -pascal-case@^3.1.2: - version "3.1.2" - resolved "https://registry.yarnpkg.com/pascal-case/-/pascal-case-3.1.2.tgz#b48e0ef2b98e205e7c1dae747d0b1508237660eb" - integrity sha512-uWlGT3YSnK9x3BQJaOdcZwrnV6hPpd8jFH1/ucpiLRPh/2zCVJKS19E4GvYHvaCcACn3foXZ0cLB9Wrx1KGe5g== - dependencies: - no-case "^3.0.4" - tslib "^2.0.3" - -path-exists@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/path-exists/-/path-exists-4.0.0.tgz#513bdbe2d3b95d7762e8c1137efa195c6c61b5b3" - integrity sha512-ak9Qy5Q7jYb2Wwcey5Fpvg2KoAc/ZIhLSLOSBmRmygPsGwkVVt0fZa0qrtMz+m6tJTAHfZQ8FnmB4MG4LWy7/w== - -path-is-absolute@^1.0.0: - version "1.0.1" - resolved "https://registry.yarnpkg.com/path-is-absolute/-/path-is-absolute-1.0.1.tgz#174b9268735534ffbc7ace6bf53a5a9e1b5c5f5f" - integrity sha1-F0uSaHNVNP+8es5r9TpanhtcX18= - -path-key@^3.0.0, path-key@^3.1.0: - version "3.1.1" - resolved "https://registry.yarnpkg.com/path-key/-/path-key-3.1.1.tgz#581f6ade658cbba65a0d3380de7753295054f375" - integrity sha512-ojmeN0qd+y0jszEtoY48r0Peq5dwMEkIlCOu6Q5f41lfkswXuKtYrhgoTpLnyIcHm24Uhqx+5Tqm2InSwLhE6Q== - -path-parse@^1.0.7: - version "1.0.7" - resolved "https://registry.yarnpkg.com/path-parse/-/path-parse-1.0.7.tgz#fbc114b60ca42b30d9daf5858e4bd68bbedb6735" - integrity sha512-LDJzPVEEEPR+y48z93A0Ed0yXb8pAByGWo/k5YYdYgpY2/2EsOsksJrq7lOHxryrVOn1ejG6oAp8ahvOIQD8sw== - -path-to-regexp@0.1.7: - version "0.1.7" - resolved "https://registry.yarnpkg.com/path-to-regexp/-/path-to-regexp-0.1.7.tgz#df604178005f522f15eb4490e7247a1bfaa67f8c" - integrity sha1-32BBeABfUi8V60SQ5yR6G/qmf4w= - -path-type@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/path-type/-/path-type-4.0.0.tgz#84ed01c0a7ba380afe09d90a8c180dcd9d03043b" - integrity sha512-gDKb8aZMDeD/tZWs9P6+q0J9Mwkdl6xMV8TjnGP3qJVJ06bdMgkbBlLU8IdfOsIsFz2BW1rNVT3XuNEl8zPAvw== - -picocolors@^1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/picocolors/-/picocolors-1.0.0.tgz#cb5bdc74ff3f51892236eaf79d68bc44564ab81c" - integrity sha512-1fygroTLlHu66zi26VoTDv8yRgm0Fccecssto+MhsZ0D/DGW2sm8E8AjW7NU5VVTRt5GxbeZ5qBuJr+HyLYkjQ== - -picomatch@^2.0.4, picomatch@^2.2.1, picomatch@^2.2.3: - version "2.3.1" - resolved "https://registry.yarnpkg.com/picomatch/-/picomatch-2.3.1.tgz#3ba3833733646d9d3e4995946c1365a67fb07a42" - integrity sha512-JU3teHTNjmE2VCGFzuY8EXzCDVwEqB2a8fsIvwaStHhAWJEeVd1o1QD80CU6+ZdEXXSLbSsuLwJjkCBWqRQUVA== - -pkg-dir@^4.2.0: - version "4.2.0" - resolved "https://registry.yarnpkg.com/pkg-dir/-/pkg-dir-4.2.0.tgz#f099133df7ede422e81d1d8448270eeb3e4261f3" - integrity sha512-HRDzbaKjC+AOWVXxAU/x54COGeIv9eb+6CkDSQoNTt4XyWoIJvuPsXizxu/Fr23EiekbtZwmh1IcIG/l/a10GQ== - dependencies: - find-up "^4.0.0" - -popper.js@1.16.1-lts: - version "1.16.1-lts" - resolved "https://registry.yarnpkg.com/popper.js/-/popper.js-1.16.1-lts.tgz#cf6847b807da3799d80ee3d6d2f90df8a3f50b05" - integrity sha512-Kjw8nKRl1m+VrSFCoVGPph93W/qrSO7ZkqPpTf7F4bk/sqcfWK019dWBUpE/fBOsOQY1dks/Bmcbfn1heM/IsA== - -portable-fetch@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/portable-fetch/-/portable-fetch-3.0.0.tgz#3cbf4aa6dbc5a5734b41c0419c9273313bfd9ad8" - integrity sha1-PL9KptvFpXNLQcBBnJJzMTv9mtg= - dependencies: - node-fetch "^1.0.1" - whatwg-fetch ">=0.10.0" - -portfinder@^1.0.28: - version "1.0.28" - resolved "https://registry.yarnpkg.com/portfinder/-/portfinder-1.0.28.tgz#67c4622852bd5374dd1dd900f779f53462fac778" - integrity sha512-Se+2isanIcEqf2XMHjyUKskczxbPH7dQnlMjXX6+dybayyHvAf/TCgyMRlzf/B6QDhAEFOGes0pzRo3by4AbMA== - dependencies: - async "^2.6.2" - debug "^3.1.1" - mkdirp "^0.5.5" - -postcss-modules-extract-imports@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/postcss-modules-extract-imports/-/postcss-modules-extract-imports-3.0.0.tgz#cda1f047c0ae80c97dbe28c3e76a43b88025741d" - integrity sha512-bdHleFnP3kZ4NYDhuGlVK+CMrQ/pqUm8bx/oGL93K6gVwiclvX5x0n76fYMKuIGKzlABOy13zsvqjb0f92TEXw== - -postcss-modules-local-by-default@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/postcss-modules-local-by-default/-/postcss-modules-local-by-default-4.0.0.tgz#ebbb54fae1598eecfdf691a02b3ff3b390a5a51c" - integrity sha512-sT7ihtmGSF9yhm6ggikHdV0hlziDTX7oFoXtuVWeDd3hHObNkcHRo9V3yg7vCAY7cONyxJC/XXCmmiHHcvX7bQ== - dependencies: - icss-utils "^5.0.0" - postcss-selector-parser "^6.0.2" - postcss-value-parser "^4.1.0" - -postcss-modules-scope@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/postcss-modules-scope/-/postcss-modules-scope-3.0.0.tgz#9ef3151456d3bbfa120ca44898dfca6f2fa01f06" - integrity sha512-hncihwFA2yPath8oZ15PZqvWGkWf+XUfQgUGamS4LqoP1anQLOsOJw0vr7J7IwLpoY9fatA2qiGUGmuZL0Iqlg== - dependencies: - postcss-selector-parser "^6.0.4" - -postcss-modules-values@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/postcss-modules-values/-/postcss-modules-values-4.0.0.tgz#d7c5e7e68c3bb3c9b27cbf48ca0bb3ffb4602c9c" - integrity sha512-RDxHkAiEGI78gS2ofyvCsu7iycRv7oqw5xMWn9iMoR0N/7mf9D50ecQqUo5BZ9Zh2vH4bCUR/ktCqbB9m8vJjQ== - dependencies: - icss-utils "^5.0.0" - -postcss-selector-parser@^6.0.2, postcss-selector-parser@^6.0.4: - version "6.0.9" - resolved "https://registry.yarnpkg.com/postcss-selector-parser/-/postcss-selector-parser-6.0.9.tgz#ee71c3b9ff63d9cd130838876c13a2ec1a992b2f" - integrity sha512-UO3SgnZOVTwu4kyLR22UQ1xZh086RyNZppb7lLAKBFK8a32ttG5i87Y/P3+2bRSjZNyJ1B7hfFNo273tKe9YxQ== - dependencies: - cssesc "^3.0.0" - util-deprecate "^1.0.2" - -postcss-value-parser@^4.1.0: - version "4.2.0" - resolved "https://registry.yarnpkg.com/postcss-value-parser/-/postcss-value-parser-4.2.0.tgz#723c09920836ba6d3e5af019f92bc0971c02e514" - integrity sha512-1NNCs6uurfkVbeXG4S8JFT9t19m45ICnif8zWLd5oPSZ50QnwMfK+H3jv408d4jw/7Bttv5axS5IiHoLaVNHeQ== - -postcss@^8.2.15: - version "8.4.8" - resolved "https://registry.yarnpkg.com/postcss/-/postcss-8.4.8.tgz#dad963a76e82c081a0657d3a2f3602ce10c2e032" - integrity sha512-2tXEqGxrjvAO6U+CJzDL2Fk2kPHTv1jQsYkSoMeOis2SsYaXRO2COxTdQp99cYvif9JTXaAk9lYGc3VhJt7JPQ== - dependencies: - nanoid "^3.3.1" - picocolors "^1.0.0" - source-map-js "^1.0.2" - -prettier@^2.1.2: - version "2.5.1" - resolved "https://registry.yarnpkg.com/prettier/-/prettier-2.5.1.tgz#fff75fa9d519c54cf0fce328c1017d94546bc56a" - integrity sha512-vBZcPRUR5MZJwoyi3ZoyQlc1rXeEck8KgeC9AwwOn+exuxLxq5toTRDTSaVrXHxelDMHy9zlicw8u66yxoSUFg== - -pretty-error@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/pretty-error/-/pretty-error-4.0.0.tgz#90a703f46dd7234adb46d0f84823e9d1cb8f10d6" - integrity sha512-AoJ5YMAcXKYxKhuJGdcvse+Voc6v1RgnsR3nWcYU7q4t6z0Q6T86sv5Zq8VIRbOWWFpvdGE83LtdSMNd+6Y0xw== - dependencies: - lodash "^4.17.20" - renderkid "^3.0.0" - -process-nextick-args@~2.0.0: - version "2.0.1" - resolved "https://registry.yarnpkg.com/process-nextick-args/-/process-nextick-args-2.0.1.tgz#7820d9b16120cc55ca9ae7792680ae7dba6d7fe2" - integrity sha512-3ouUOpQhtgrbOa17J7+uxOTpITYWaGP7/AhoR3+A+/1e9skrzelGi/dXzEYyvbxubEF6Wn2ypscTKiKJFFn1ag== - -prop-types@^15.6.2, prop-types@^15.7.2: - version "15.8.1" - resolved "https://registry.yarnpkg.com/prop-types/-/prop-types-15.8.1.tgz#67d87bf1a694f48435cf332c24af10214a3140b5" - integrity sha512-oj87CgZICdulUohogVAR7AjlC0327U4el4L6eAvOqCeudMDVU0NThNaV+b9Df4dXgSP1gXMTnPdhfe/2qDH5cg== - dependencies: - loose-envify "^1.4.0" - object-assign "^4.1.1" - react-is "^16.13.1" - -proxy-addr@~2.0.7: - version "2.0.7" - resolved "https://registry.yarnpkg.com/proxy-addr/-/proxy-addr-2.0.7.tgz#f19fe69ceab311eeb94b42e70e8c2070f9ba1025" - integrity sha512-llQsMLSUDUPT44jdrU/O37qlnifitDP+ZwrmmZcoSKyLKvtZxpyV0n2/bD/N4tBAAZ/gJEdZU7KMraoK1+XYAg== - dependencies: - forwarded "0.2.0" - ipaddr.js "1.9.1" - -prr@~1.0.1: - version "1.0.1" - resolved "https://registry.yarnpkg.com/prr/-/prr-1.0.1.tgz#d3fc114ba06995a45ec6893f484ceb1d78f5f476" - integrity sha1-0/wRS6BplaRexok/SEzrHXj19HY= - -punycode@^2.1.0: - version "2.1.1" - resolved "https://registry.yarnpkg.com/punycode/-/punycode-2.1.1.tgz#b58b010ac40c22c5657616c8d2c2c02c7bf479ec" - integrity sha512-XRsRjdf+j5ml+y/6GKHPZbrF/8p2Yga0JPtdqTIY2Xe5ohJPD9saDJJLPvp9+NSBprVvevdXZybnj2cv8OEd0A== - -qs@6.9.7: - version "6.9.7" - resolved "https://registry.yarnpkg.com/qs/-/qs-6.9.7.tgz#4610846871485e1e048f44ae3b94033f0e675afe" - integrity sha512-IhMFgUmuNpyRfxA90umL7ByLlgRXu6tIfKPpF5TmcfRLlLCckfP/g3IQmju6jjpu+Hh8rA+2p6A27ZSPOOHdKw== - -queue-microtask@^1.2.2: - version "1.2.3" - resolved "https://registry.yarnpkg.com/queue-microtask/-/queue-microtask-1.2.3.tgz#4929228bbc724dfac43e0efb058caf7b6cfb6243" - integrity sha512-NuaNSa6flKT5JaSYQzJok04JzTL1CA6aGhv5rfLW3PgqA+M2ChpZQnAC8h8i4ZFkBS8X5RqkDBHA7r4hej3K9A== - -randombytes@^2.1.0: - version "2.1.0" - resolved "https://registry.yarnpkg.com/randombytes/-/randombytes-2.1.0.tgz#df6f84372f0270dc65cdf6291349ab7a473d4f2a" - integrity sha512-vYl3iOX+4CKUWuxGi9Ukhie6fsqXqS9FE2Zaic4tNFD2N2QQaXOMFbuKK4QmDHC0JO6B1Zp41J0LpT0oR68amQ== - dependencies: - safe-buffer "^5.1.0" - -range-parser@^1.2.1, range-parser@~1.2.1: - version "1.2.1" - resolved "https://registry.yarnpkg.com/range-parser/-/range-parser-1.2.1.tgz#3cf37023d199e1c24d1a55b84800c2f3e6468031" - integrity sha512-Hrgsx+orqoygnmhFbKaHE6c296J+HTAQXoxEF6gNupROmmGJRoyzfG3ccAveqCBrwr/2yxQ5BVd/GTl5agOwSg== - -raw-body@2.4.3: - version "2.4.3" - resolved "https://registry.yarnpkg.com/raw-body/-/raw-body-2.4.3.tgz#8f80305d11c2a0a545c2d9d89d7a0286fcead43c" - integrity sha512-UlTNLIcu0uzb4D2f4WltY6cVjLi+/jEN4lgEUj3E04tpMDpUlkBo/eSn6zou9hum2VMNpCCUone0O0WeJim07g== - dependencies: - bytes "3.1.2" - http-errors "1.8.1" - iconv-lite "0.4.24" - unpipe "1.0.0" - -rc-align@^4.0.0: - version "4.0.11" - resolved "https://registry.yarnpkg.com/rc-align/-/rc-align-4.0.11.tgz#8198c62db266bc1b8ef05e56c13275bf72628a5e" - integrity sha512-n9mQfIYQbbNTbefyQnRHZPWuTEwG1rY4a9yKlIWHSTbgwI+XUMGRYd0uJ5pE2UbrNX0WvnMBA1zJ3Lrecpra/A== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "2.x" - dom-align "^1.7.0" - lodash "^4.17.21" - rc-util "^5.3.0" - resize-observer-polyfill "^1.5.1" - -rc-cascader@~3.2.1: - version "3.2.7" - resolved "https://registry.yarnpkg.com/rc-cascader/-/rc-cascader-3.2.7.tgz#74ac3ab9258f930e0c84dfacffd838b122b2cedf" - integrity sha512-M8VtKtifTXXo/qqXj63p12tsMNXm1z45Lytj7tu86L6gxIF8keDPcJ16/ZqrhS5JwlBPfoJNA1VooNl/KId15A== - dependencies: - "@babel/runtime" "^7.12.5" - array-tree-filter "^2.1.0" - classnames "^2.3.1" - rc-select "~14.0.0-alpha.23" - rc-tree "~5.4.3" - rc-util "^5.6.1" - -rc-checkbox@~2.3.0: - version "2.3.2" - resolved "https://registry.yarnpkg.com/rc-checkbox/-/rc-checkbox-2.3.2.tgz#f91b3678c7edb2baa8121c9483c664fa6f0aefc1" - integrity sha512-afVi1FYiGv1U0JlpNH/UaEXdh6WUJjcWokj/nUN2TgG80bfG+MDdbfHKlLcNNba94mbjy2/SXJ1HDgrOkXGAjg== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.1" - -rc-collapse@~3.1.0: - version "3.1.2" - resolved "https://registry.yarnpkg.com/rc-collapse/-/rc-collapse-3.1.2.tgz#76028a811b845d03d9460ccc409c7ea8ad09db14" - integrity sha512-HujcKq7mghk/gVKeI6EjzTbb8e19XUZpakrYazu1MblEZ3Hu3WBMSN4A3QmvbF6n1g7x6lUlZvsHZ5shABWYOQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "2.x" - rc-motion "^2.3.4" - rc-util "^5.2.1" - shallowequal "^1.1.0" - -rc-dialog@~8.6.0: - version "8.6.0" - resolved "https://registry.yarnpkg.com/rc-dialog/-/rc-dialog-8.6.0.tgz#3b228dac085de5eed8c6237f31162104687442e7" - integrity sha512-GSbkfqjqxpZC5/zc+8H332+q5l/DKUhpQr0vdX2uDsxo5K0PhvaMEVjyoJUTkZ3+JstEADQji1PVLVb/2bJeOQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.6" - rc-motion "^2.3.0" - rc-util "^5.6.1" - -rc-drawer@~4.4.2: - version "4.4.3" - resolved "https://registry.yarnpkg.com/rc-drawer/-/rc-drawer-4.4.3.tgz#2094937a844e55dc9644236a2d9fba79c344e321" - integrity sha512-FYztwRs3uXnFOIf1hLvFxIQP9MiZJA+0w+Os8dfDh/90X7z/HqP/Yg+noLCIeHEbKln1Tqelv8ymCAN24zPcfQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.6" - rc-util "^5.7.0" - -rc-dropdown@^3.2.0, rc-dropdown@~3.3.2: - version "3.3.2" - resolved "https://registry.yarnpkg.com/rc-dropdown/-/rc-dropdown-3.3.2.tgz#097c2ec1b6d55c10eeb94dcf6120ba034c7a58e0" - integrity sha512-49GOz42oNvLtYGoJ2X5UWXJFp7aUiSZkj9OcgTV1UpxFZqHQMw+xijkaL5k3XDkMbb92XsuFnFt7IGG3/C0DKw== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.6" - rc-trigger "^5.0.4" - -rc-field-form@~1.23.0: - version "1.23.1" - resolved "https://registry.yarnpkg.com/rc-field-form/-/rc-field-form-1.23.1.tgz#638c11d05d7ed2efdcb862ff3da5fe2a7d199aaa" - integrity sha512-Mun+eaFmX1Pjud9bz0fD0IvxwDfFKWk2Q8tkt4sg4aKR9/FML/rzYC5MjY77p86X45XBurBDUR3gAda+Cg/ULw== - dependencies: - "@babel/runtime" "^7.8.4" - async-validator "^4.0.2" - rc-util "^5.8.0" - -rc-image@~5.2.5: - version "5.2.5" - resolved "https://registry.yarnpkg.com/rc-image/-/rc-image-5.2.5.tgz#44e6ffc842626827960e7ab72e1c0d6f3a8ce440" - integrity sha512-qUfZjYIODxO0c8a8P5GeuclYXZjzW4hV/5hyo27XqSFo1DmTCs2HkVeQObkcIk5kNsJtgsj1KoPThVsSc/PXOw== - dependencies: - "@babel/runtime" "^7.11.2" - classnames "^2.2.6" - rc-dialog "~8.6.0" - rc-util "^5.0.6" - -rc-input-number@~7.3.0: - version "7.3.4" - resolved "https://registry.yarnpkg.com/rc-input-number/-/rc-input-number-7.3.4.tgz#674aea98260250287d36e330a7e065b174486e9d" - integrity sha512-W9uqSzuvJUnz8H8vsVY4kx+yK51SsAxNTwr8SNH4G3XqQNocLVmKIibKFRjocnYX1RDHMND9FFbgj2h7E7nvGA== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.5" - rc-util "^5.9.8" - -rc-input@^0.0.1-alpha.5: - version "0.0.1-alpha.5" - resolved "https://registry.yarnpkg.com/rc-input/-/rc-input-0.0.1-alpha.5.tgz#cc043c44570c651f4d10d9809b3d634ed12537e6" - integrity sha512-RHvNweOVWFbbx2l/y6hgnSAdOg5fXc1D1VGhX2RNkGGyGr6cemnvyiYMxwZJjcXs0al3YK9jMObm20+DgH/mpw== - dependencies: - "@babel/runtime" "^7.11.1" - classnames "^2.2.1" - rc-util "^5.18.1" - -rc-mentions@~1.6.1: - version "1.6.2" - resolved "https://registry.yarnpkg.com/rc-mentions/-/rc-mentions-1.6.2.tgz#62ed7cdd8fa86d857c3ce3f9e73438022130815e" - integrity sha512-cntfJkNMq8B910rXuvnsnOV88DfmoUidnQnSIeXzWiYiUX4RL5oWUfSZzs+HAXYRU4SL1l8Mwjx95wHETiZ/fQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.6" - rc-menu "^9.0.0" - rc-textarea "^0.3.0" - rc-trigger "^5.0.4" - rc-util "^5.0.1" - -rc-menu@^9.0.0: - version "9.3.2" - resolved "https://registry.yarnpkg.com/rc-menu/-/rc-menu-9.3.2.tgz#bb842d37ebf71da912bea201cf7ef0a27267ad49" - integrity sha512-h3m45oY1INZyqphGELkdT0uiPnFzxkML8m0VMhJnk2fowtqfiT7F5tJLT3znEVaPIY80vMy1bClCkgq8U91CzQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "2.x" - rc-motion "^2.4.3" - rc-overflow "^1.2.0" - rc-trigger "^5.1.2" - rc-util "^5.12.0" - shallowequal "^1.1.0" - -rc-menu@~9.2.1: - version "9.2.1" - resolved "https://registry.yarnpkg.com/rc-menu/-/rc-menu-9.2.1.tgz#6fbe47f4846363bb81a5a21f0960026c3ada497a" - integrity sha512-UbEtn3rflJ8zS+etYGTVQuzy7Fm+yWXR5c0Rl6ecNTS/dPknRyWAyhJcbeR0Hu1+RdQT+0VCqrUPrgKnm4iY+w== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "2.x" - rc-motion "^2.4.3" - rc-overflow "^1.2.0" - rc-trigger "^5.1.2" - rc-util "^5.12.0" - shallowequal "^1.1.0" - -rc-motion@^2.0.0, rc-motion@^2.0.1, rc-motion@^2.2.0, rc-motion@^2.3.0, rc-motion@^2.3.4, rc-motion@^2.4.3, rc-motion@^2.4.4: - version "2.4.5" - resolved "https://registry.yarnpkg.com/rc-motion/-/rc-motion-2.4.5.tgz#b061c50bb29ecd3d735d5f4c40924a3c78226cbd" - integrity sha512-f3uJHR4gcpeZS/s8/nYFSOrXt2Wu/h9GrEcbJmC0qmKrVNgwL1pTgrT5kW7lgG6PFeoL4yHDmpQoEKkrPtKIzQ== - dependencies: - "@babel/runtime" "^7.11.1" - classnames "^2.2.1" - rc-util "^5.18.1" - -rc-notification@~4.5.7: - version "4.5.7" - resolved "https://registry.yarnpkg.com/rc-notification/-/rc-notification-4.5.7.tgz#265e6e6a0c1a0fac63d6abd4d832eb8ff31522f1" - integrity sha512-zhTGUjBIItbx96SiRu3KVURcLOydLUHZCPpYEn1zvh+re//Tnq/wSxN4FKgp38n4HOgHSVxcLEeSxBMTeBBDdw== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "2.x" - rc-motion "^2.2.0" - rc-util "^5.0.1" - -rc-overflow@^1.0.0, rc-overflow@^1.2.0: - version "1.2.3" - resolved "https://registry.yarnpkg.com/rc-overflow/-/rc-overflow-1.2.3.tgz#1754216d807f5473304272b0321c3aba7615f47a" - integrity sha512-Bz6dXTn/ww8nmu70tUQfRV0wT3BkfXY6j1lB1O38OVkDPz4xwfAcGK+LJ2zewUR5cTXkJ8hAN7YULohG8z4M7Q== - dependencies: - "@babel/runtime" "^7.11.1" - classnames "^2.2.1" - rc-resize-observer "^1.0.0" - rc-util "^5.15.0" - -rc-pagination@~3.1.9: - version "3.1.15" - resolved "https://registry.yarnpkg.com/rc-pagination/-/rc-pagination-3.1.15.tgz#e05eddf4c15717a5858290bed0857e27e2f957ff" - integrity sha512-4L3fot8g4E+PjWEgoVGX0noFCg+8ZFZmeLH4vsnZpB3O2T2zThtakjNxG+YvSaYtyMVT4B+GLayjKrKbXQpdAg== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.1" - -rc-picker@~2.6.4: - version "2.6.4" - resolved "https://registry.yarnpkg.com/rc-picker/-/rc-picker-2.6.4.tgz#916aa5fcd8abd11106f1c2fb64bfd549439abfa0" - integrity sha512-Mnc1udPyGNSG7/ya5SmYltUjCUcsMH7jfJnuuXVAvEaEdx9qZxDGMWtIii//+ARC06CSHQ83s5iwiGFwM+FcDw== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.1" - date-fns "2.x" - dayjs "1.x" - moment "^2.24.0" - rc-trigger "^5.0.4" - rc-util "^5.4.0" - shallowequal "^1.1.0" - -rc-progress@~3.2.1: - version "3.2.4" - resolved "https://registry.yarnpkg.com/rc-progress/-/rc-progress-3.2.4.tgz#4036acdae2566438545bc4df2203248babaf7549" - integrity sha512-M9WWutRaoVkPUPIrTpRIDpX0SPSrVHzxHdCRCbeoBFrd9UFWTYNWRlHsruJM5FH1AZI+BwB4wOJUNNylg/uFSw== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.6" - rc-util "^5.16.1" - -rc-rate@~2.9.0: - version "2.9.1" - resolved "https://registry.yarnpkg.com/rc-rate/-/rc-rate-2.9.1.tgz#e43cb95c4eb90a2c1e0b16ec6614d8c43530a731" - integrity sha512-MmIU7FT8W4LYRRHJD1sgG366qKtSaKb67D0/vVvJYR0lrCuRrCiVQ5qhfT5ghVO4wuVIORGpZs7ZKaYu+KMUzA== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.5" - rc-util "^5.0.1" - -rc-resize-observer@^1.0.0, rc-resize-observer@^1.1.0, rc-resize-observer@^1.2.0: - version "1.2.0" - resolved "https://registry.yarnpkg.com/rc-resize-observer/-/rc-resize-observer-1.2.0.tgz#9f46052f81cdf03498be35144cb7c53fd282c4c7" - integrity sha512-6W+UzT3PyDM0wVCEHfoW3qTHPTvbdSgiA43buiy8PzmeMnfgnDeb9NjdimMXMl3/TcrvvWl5RRVdp+NqcR47pQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.1" - rc-util "^5.15.0" - resize-observer-polyfill "^1.5.1" - -rc-select@~14.0.0-alpha.15, rc-select@~14.0.0-alpha.23, rc-select@~14.0.0-alpha.8: - version "14.0.0" - resolved "https://registry.yarnpkg.com/rc-select/-/rc-select-14.0.0.tgz#87735dbc548f1cc8e94d579b21682ed2d34f7653" - integrity sha512-DkoWMhyxmrfpc1KJSqPORZdkKevzgOINvjR4WI+dibRe6i6DyqGB4Jk21sencnK9di6dumzOCHf93x9t9+gp3Q== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "2.x" - rc-motion "^2.0.1" - rc-overflow "^1.0.0" - rc-trigger "^5.0.4" - rc-util "^5.16.1" - rc-virtual-list "^3.2.0" - -rc-slider@~10.0.0-alpha.4: - version "10.0.0-alpha.4" - resolved "https://registry.yarnpkg.com/rc-slider/-/rc-slider-10.0.0-alpha.4.tgz#f14ec0905d53f1f9d7f495c301527d6eca5781cf" - integrity sha512-ih2xwkBgXAWAf7MjZIZyCiiWo6tnoIMuHifn0UeKXVAup7sH53QdSVvT9x/cysuSZIPNMYWEf6mec184n3gbiQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.5" - rc-tooltip "^5.0.1" - rc-util "^5.18.1" - shallowequal "^1.1.0" - -rc-steps@~4.1.0: - version "4.1.4" - resolved "https://registry.yarnpkg.com/rc-steps/-/rc-steps-4.1.4.tgz#0ba82db202d59ca52d0693dc9880dd145b19dc23" - integrity sha512-qoCqKZWSpkh/b03ASGx1WhpKnuZcRWmvuW+ZUu4mvMdfvFzVxblTwUM+9aBd0mlEUFmt6GW8FXhMpHkK3Uzp3w== - dependencies: - "@babel/runtime" "^7.10.2" - classnames "^2.2.3" - rc-util "^5.0.1" - -rc-switch@~3.2.0: - version "3.2.2" - resolved "https://registry.yarnpkg.com/rc-switch/-/rc-switch-3.2.2.tgz#d001f77f12664d52595b4f6fb425dd9e66fba8e8" - integrity sha512-+gUJClsZZzvAHGy1vZfnwySxj+MjLlGRyXKXScrtCTcmiYNPzxDFOxdQ/3pK1Kt/0POvwJ/6ALOR8gwdXGhs+A== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.1" - rc-util "^5.0.1" - -rc-table@~7.23.0: - version "7.23.0" - resolved "https://registry.yarnpkg.com/rc-table/-/rc-table-7.23.0.tgz#e5f76998ecf3246147d45ed311417c08886e6507" - integrity sha512-Q1gneB2+lUa8EzCCfbrq+jO1qNSwQv1RUUXKB84W/Stdp4EvGOt2+QqGyfotMNM4JUw0fgGLwY+WjnhUhnLuQQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.5" - rc-resize-observer "^1.1.0" - rc-util "^5.14.0" - shallowequal "^1.1.0" - -rc-tabs@~11.10.0: - version "11.10.7" - resolved "https://registry.yarnpkg.com/rc-tabs/-/rc-tabs-11.10.7.tgz#7d8b5dcc17f1608cf3b9425d80069f1415479335" - integrity sha512-7IKmcU7QU3CdYnJTabeXs2DDeLiXLyALC8fvOtgyWWFXUD47G5vG+4bFO3f9+AI+rcFAPpfwapZbXxgmiRuWYQ== - dependencies: - "@babel/runtime" "^7.11.2" - classnames "2.x" - rc-dropdown "^3.2.0" - rc-menu "^9.0.0" - rc-resize-observer "^1.0.0" - rc-util "^5.5.0" - -rc-textarea@^0.3.0, rc-textarea@~0.3.0: - version "0.3.7" - resolved "https://registry.yarnpkg.com/rc-textarea/-/rc-textarea-0.3.7.tgz#987142891efdedb774883c07e2f51b318fde5a11" - integrity sha512-yCdZ6binKmAQB13hc/oehh0E/QRwoPP1pjF21aHBxlgXO3RzPF6dUu4LG2R4FZ1zx/fQd2L1faktulrXOM/2rw== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.1" - rc-resize-observer "^1.0.0" - rc-util "^5.7.0" - shallowequal "^1.1.0" - -rc-tooltip@^5.0.1, rc-tooltip@~5.1.1: - version "5.1.1" - resolved "https://registry.yarnpkg.com/rc-tooltip/-/rc-tooltip-5.1.1.tgz#94178ed162d0252bc4993b725f5dc2ac0fccf154" - integrity sha512-alt8eGMJulio6+4/uDm7nvV+rJq9bsfxFDCI0ljPdbuoygUscbsMYb6EQgwib/uqsXQUvzk+S7A59uYHmEgmDA== - dependencies: - "@babel/runtime" "^7.11.2" - rc-trigger "^5.0.0" - -rc-tree-select@~5.1.1: - version "5.1.4" - resolved "https://registry.yarnpkg.com/rc-tree-select/-/rc-tree-select-5.1.4.tgz#3577135399d1f4931b0f4d8245e0845861802e2b" - integrity sha512-sA6vTUQghzbjh3u6YAwJIebKkJEHUWDPFHQpfiPObqsEYqi9TKE1LvWqbJ77NbOlOARZq0KIb7LDGF8X0dikDQ== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "2.x" - rc-select "~14.0.0-alpha.8" - rc-tree "~5.4.3" - rc-util "^5.16.1" - -rc-tree@~5.4.3: - version "5.4.4" - resolved "https://registry.yarnpkg.com/rc-tree/-/rc-tree-5.4.4.tgz#2ea3663ad3c566aef79a46ba6a1e050d24323e01" - integrity sha512-2qoObRgp31DBXmVzMJmo4qmwP20XEa4hR3imWQtRPcgN3pmljW3WKFmZRrYdOFHz7CyTnRsFZR065bBkIoUpiA== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "2.x" - rc-motion "^2.0.1" - rc-util "^5.16.1" - rc-virtual-list "^3.4.2" - -rc-trigger@^5.0.0, rc-trigger@^5.0.4, rc-trigger@^5.1.2, rc-trigger@^5.2.10: - version "5.2.10" - resolved "https://registry.yarnpkg.com/rc-trigger/-/rc-trigger-5.2.10.tgz#8a0057a940b1b9027eaa33beec8a6ecd85cce2b1" - integrity sha512-FkUf4H9BOFDaIwu42fvRycXMAvkttph9AlbCZXssZDVzz2L+QZ0ERvfB/4nX3ZFPh1Zd+uVGr1DEDeXxq4J1TA== - dependencies: - "@babel/runtime" "^7.11.2" - classnames "^2.2.6" - rc-align "^4.0.0" - rc-motion "^2.0.0" - rc-util "^5.5.0" - -rc-upload@~4.3.0: - version "4.3.3" - resolved "https://registry.yarnpkg.com/rc-upload/-/rc-upload-4.3.3.tgz#e237aa525e5313fa16f4d04d27f53c2f0e157bb8" - integrity sha512-YoJ0phCRenMj1nzwalXzciKZ9/FAaCrFu84dS5pphwucTC8GUWClcDID/WWNGsLFcM97NqIboDqrV82rVRhW/w== - dependencies: - "@babel/runtime" "^7.10.1" - classnames "^2.2.5" - rc-util "^5.2.0" - -rc-util@^5.0.1, rc-util@^5.0.6, rc-util@^5.0.7, rc-util@^5.12.0, rc-util@^5.14.0, rc-util@^5.15.0, rc-util@^5.16.1, rc-util@^5.18.1, rc-util@^5.2.0, rc-util@^5.2.1, rc-util@^5.3.0, rc-util@^5.4.0, rc-util@^5.5.0, rc-util@^5.6.1, rc-util@^5.7.0, rc-util@^5.8.0, rc-util@^5.9.4, rc-util@^5.9.8: - version "5.18.1" - resolved "https://registry.yarnpkg.com/rc-util/-/rc-util-5.18.1.tgz#80bd1450b5254655d2fbea63e3d34f6871e9be79" - integrity sha512-24xaSrMZUEKh1+suDOtJWfPe9E6YrwryViZcoPO0miJTKzP4qhUlV5AAlKQ82AJilz/AOHfi3l6HoX8qa1ye8w== - dependencies: - "@babel/runtime" "^7.12.5" - react-is "^16.12.0" - shallowequal "^1.1.0" - -rc-virtual-list@^3.2.0, rc-virtual-list@^3.4.2: - version "3.4.2" - resolved "https://registry.yarnpkg.com/rc-virtual-list/-/rc-virtual-list-3.4.2.tgz#1078327aa7230b5e456d679ed2ce99f3c036ebd1" - integrity sha512-OyVrrPvvFcHvV0ssz5EDZ+7Rf5qLat/+mmujjchNw5FfbJWNDwkpQ99EcVE6+FtNRmX9wFa1LGNpZLUTvp/4GQ== - dependencies: - classnames "^2.2.6" - rc-resize-observer "^1.0.0" - rc-util "^5.0.7" - -react-dom@^16.13.1: - version "16.14.0" - resolved "https://registry.yarnpkg.com/react-dom/-/react-dom-16.14.0.tgz#7ad838ec29a777fb3c75c3a190f661cf92ab8b89" - integrity sha512-1gCeQXDLoIqMgqD3IO2Ah9bnf0w9kzhwN5q4FGnHZ67hBm9yePzB5JJAIQCc8x3pFnNlwFq4RidZggNAAkzWWw== - dependencies: - loose-envify "^1.1.0" - object-assign "^4.1.1" - prop-types "^15.6.2" - scheduler "^0.19.1" - -react-flame-graph@^1.4.0: - version "1.4.0" - resolved "https://registry.yarnpkg.com/react-flame-graph/-/react-flame-graph-1.4.0.tgz#52d118cc94348f630a812fc0ec530a5b73c30cdb" - integrity sha512-DaCK9ZX+xK0mNca72kUE5cu6T8hGe/KLsefQWf+eT9sVt+0WP1dVxZCGD8Svfn2KrZB9Mv011Intg/yG2YWSxA== - dependencies: - flow-bin "^0.118.0" - memoize-one "^3.1.1" - react-window "^1" - -react-is@^16.12.0, react-is@^16.13.1, react-is@^16.7.0: - version "16.13.1" - resolved "https://registry.yarnpkg.com/react-is/-/react-is-16.13.1.tgz#789729a4dc36de2999dc156dd6c1d9c18cea56a4" - integrity sha512-24e6ynE2H+OKt4kqsOvNd8kBpV65zoxbA4BVsEOB3ARVWQki/DHzaUoC5KuON/BiccDaCCTZBuOcfZs70kR8bQ== - -"react-is@^16.8.0 || ^17.0.0": - version "17.0.2" - resolved "https://registry.yarnpkg.com/react-is/-/react-is-17.0.2.tgz#e691d4a8e9c789365655539ab372762b0efb54f0" - integrity sha512-w2GsyukL62IJnlaff/nRegPQR94C/XXamvMWmSHRJ4y7Ts/4ocGRmTHvOs8PSE6pB3dWOrD/nueuU5sduBsQ4w== - -react-transition-group@^4.4.0: - version "4.4.2" - resolved "https://registry.yarnpkg.com/react-transition-group/-/react-transition-group-4.4.2.tgz#8b59a56f09ced7b55cbd53c36768b922890d5470" - integrity sha512-/RNYfRAMlZwDSr6z4zNKV6xu53/e2BuaBbGhbyYIXTrmgu/bGHzmqOs7mJSJBHy9Ud+ApHx3QjrkKSp1pxvlFg== - dependencies: - "@babel/runtime" "^7.5.5" - dom-helpers "^5.0.1" - loose-envify "^1.4.0" - prop-types "^15.6.2" - -react-window@^1: - version "1.8.6" - resolved "https://registry.yarnpkg.com/react-window/-/react-window-1.8.6.tgz#d011950ac643a994118632665aad0c6382e2a112" - integrity sha512-8VwEEYyjz6DCnGBsd+MgkD0KJ2/OXFULyDtorIiTz+QzwoP94tBoA7CnbtyXMm+cCeAUER5KJcPtWl9cpKbOBg== - dependencies: - "@babel/runtime" "^7.0.0" - memoize-one ">=3.1.1 <6" - -react@^16.13.1: - version "16.14.0" - resolved "https://registry.yarnpkg.com/react/-/react-16.14.0.tgz#94d776ddd0aaa37da3eda8fc5b6b18a4c9a3114d" - integrity sha512-0X2CImDkJGApiAlcf0ODKIneSwBPhqJawOa5wCtKbu7ZECrmS26NvtSILynQ66cgkT/RJ4LidJOc3bUESwmU8g== - dependencies: - loose-envify "^1.1.0" - object-assign "^4.1.1" - prop-types "^15.6.2" - -readable-stream@^2.0.1: - version "2.3.7" - resolved "https://registry.yarnpkg.com/readable-stream/-/readable-stream-2.3.7.tgz#1eca1cf711aef814c04f62252a36a62f6cb23b57" - integrity sha512-Ebho8K4jIbHAxnuxi7o42OrZgF/ZTNcsZj6nRKyUmkhLFq8CHItp/fy6hQZuZmP/n3yZ9VBUbp4zz/mX8hmYPw== - dependencies: - core-util-is "~1.0.0" - inherits "~2.0.3" - isarray "~1.0.0" - process-nextick-args "~2.0.0" - safe-buffer "~5.1.1" - string_decoder "~1.1.1" - util-deprecate "~1.0.1" - -readable-stream@^3.0.6: - version "3.6.0" - resolved "https://registry.yarnpkg.com/readable-stream/-/readable-stream-3.6.0.tgz#337bbda3adc0706bd3e024426a286d4b4b2c9198" - integrity sha512-BViHy7LKeTz4oNnkcLJ+lVSL6vpiFeX6/d3oSH8zCW7UxP2onchk+vTGB143xuFjHS3deTgkKoXXymXqymiIdA== - dependencies: - inherits "^2.0.3" - string_decoder "^1.1.1" - util-deprecate "^1.0.1" - -readdirp@~3.6.0: - version "3.6.0" - resolved "https://registry.yarnpkg.com/readdirp/-/readdirp-3.6.0.tgz#74a370bd857116e245b29cc97340cd431a02a6c7" - integrity sha512-hOS089on8RduqdbhvQ5Z37A0ESjsqz6qnRcffsMU3495FuTdqSm+7bhJ29JvIOsBDEEnan5DPu9t3To9VRlMzA== - dependencies: - picomatch "^2.2.1" - -rechoir@^0.7.0: - version "0.7.1" - resolved "https://registry.yarnpkg.com/rechoir/-/rechoir-0.7.1.tgz#9478a96a1ca135b5e88fc027f03ee92d6c645686" - integrity sha512-/njmZ8s1wVeR6pjTZ+0nCnv8SpZNRMT2D1RLOJQESlYFDBvwpTA4KWJpZ+sBJ4+vhjILRcK7JIFdGCdxEAAitg== - dependencies: - resolve "^1.9.0" - -regenerator-runtime@^0.13.4: - version "0.13.9" - resolved "https://registry.yarnpkg.com/regenerator-runtime/-/regenerator-runtime-0.13.9.tgz#8925742a98ffd90814988d7566ad30ca3b263b52" - integrity sha512-p3VT+cOEgxFsRRA9X4lkI1E+k2/CtnKtU4gcxyaCUreilL/vqI6CdZ3wxVUx3UOUg+gnUOQQcRI7BmSI656MYA== - -regexp.prototype.flags@^1.2.0: - version "1.4.1" - resolved "https://registry.yarnpkg.com/regexp.prototype.flags/-/regexp.prototype.flags-1.4.1.tgz#b3f4c0059af9e47eca9f3f660e51d81307e72307" - integrity sha512-pMR7hBVUUGI7PMA37m2ofIdQCsomVnas+Jn5UPGAHQ+/LlwKm/aTLJHdasmHRzlfeZwHiAOaRSo2rbBDm3nNUQ== - dependencies: - call-bind "^1.0.2" - define-properties "^1.1.3" - -relateurl@^0.2.7: - version "0.2.7" - resolved "https://registry.yarnpkg.com/relateurl/-/relateurl-0.2.7.tgz#54dbf377e51440aca90a4cd274600d3ff2d888a9" - integrity sha1-VNvzd+UUQKypCkzSdGANP/LYiKk= - -renderkid@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/renderkid/-/renderkid-3.0.0.tgz#5fd823e4d6951d37358ecc9a58b1f06836b6268a" - integrity sha512-q/7VIQA8lmM1hF+jn+sFSPWGlMkSAeNYcPLmDQx2zzuiDfaLrOmumR8iaUKlenFgh0XRPIUeSPlH3A+AW3Z5pg== - dependencies: - css-select "^4.1.3" - dom-converter "^0.2.0" - htmlparser2 "^6.1.0" - lodash "^4.17.21" - strip-ansi "^6.0.1" - -require-from-string@^2.0.2: - version "2.0.2" - resolved "https://registry.yarnpkg.com/require-from-string/-/require-from-string-2.0.2.tgz#89a7fdd938261267318eafe14f9c32e598c36909" - integrity sha512-Xf0nWe6RseziFMu+Ap9biiUbmplq6S9/p+7w7YXP/JBHhrUDDUhwa+vANyubuqfZWTveU//DYVGsDG7RKL/vEw== - -requires-port@^1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/requires-port/-/requires-port-1.0.0.tgz#925d2601d39ac485e091cf0da5c6e694dc3dcaff" - integrity sha1-kl0mAdOaxIXgkc8NpcbmlNw9yv8= - -resize-observer-polyfill@^1.5.0, resize-observer-polyfill@^1.5.1: - version "1.5.1" - resolved "https://registry.yarnpkg.com/resize-observer-polyfill/-/resize-observer-polyfill-1.5.1.tgz#0e9020dd3d21024458d4ebd27e23e40269810464" - integrity sha512-LwZrotdHOo12nQuZlHEmtuXdqGoOD0OhaxopaNFxWzInpEgaLWoVuAMbTzixuosCx2nEG58ngzW3vxdWoxIgdg== - -resolve-cwd@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/resolve-cwd/-/resolve-cwd-3.0.0.tgz#0f0075f1bb2544766cf73ba6a6e2adfebcb13f2d" - integrity sha512-OrZaX2Mb+rJCpH/6CpSqt9xFVpN++x01XnN2ie9g6P5/3xelLAkXWVADpdz1IHD/KFfEXyE6V0U01OQ3UO2rEg== - dependencies: - resolve-from "^5.0.0" - -resolve-from@^5.0.0: - version "5.0.0" - resolved "https://registry.yarnpkg.com/resolve-from/-/resolve-from-5.0.0.tgz#c35225843df8f776df21c57557bc087e9dfdfc69" - integrity sha512-qYg9KP24dD5qka9J47d0aVky0N+b4fTU89LN9iDnjB5waksiC49rvMB0PrUJQGoTmH50XPiqOvAjDfaijGxYZw== - -resolve@^1.9.0: - version "1.22.0" - resolved "https://registry.yarnpkg.com/resolve/-/resolve-1.22.0.tgz#5e0b8c67c15df57a89bdbabe603a002f21731198" - integrity sha512-Hhtrw0nLeSrFQ7phPp4OOcVjLPIeMnRlr5mcnVuMe7M/7eBn98A3hmFRLoFo3DLZkivSYwhRUJTyPyWAk56WLw== - dependencies: - is-core-module "^2.8.1" - path-parse "^1.0.7" - supports-preserve-symlinks-flag "^1.0.0" - -retry@^0.13.1: - version "0.13.1" - resolved "https://registry.yarnpkg.com/retry/-/retry-0.13.1.tgz#185b1587acf67919d63b357349e03537b2484658" - integrity sha512-XQBQ3I8W1Cge0Seh+6gjj03LbmRFWuoszgK9ooCpwYIrhhoO80pfq4cUkU5DkknwfOfFteRwlZ56PYOGYyFWdg== - -reusify@^1.0.4: - version "1.0.4" - resolved "https://registry.yarnpkg.com/reusify/-/reusify-1.0.4.tgz#90da382b1e126efc02146e90845a88db12925d76" - integrity sha512-U9nH88a3fc/ekCF1l0/UP1IosiuIjyTh7hBvXVMHYgVcfGvt897Xguj2UOLDeI5BG2m7/uwyaLVT6fbtCwTyzw== - -rimraf@^3.0.2: - version "3.0.2" - resolved "https://registry.yarnpkg.com/rimraf/-/rimraf-3.0.2.tgz#f1a5402ba6220ad52cc1282bac1ae3aa49fd061a" - integrity sha512-JZkJMZkAGFFPP2YqXZXPbMlMBgsxzE8ILs4lMIX/2o0L9UBw9O/Y3o6wFw/i9YLapcUJWwqbi3kdxIPdC62TIA== - dependencies: - glob "^7.1.3" - -run-parallel@^1.1.9: - version "1.2.0" - resolved "https://registry.yarnpkg.com/run-parallel/-/run-parallel-1.2.0.tgz#66d1368da7bdf921eb9d95bd1a9229e7f21a43ee" - integrity sha512-5l4VyZR86LZ/lDxZTR6jqL8AFE2S0IFLMP26AbjsLVADxHdhB/c0GUsH+y39UfCi3dzz8OlQuPmnaJOMoDHQBA== - dependencies: - queue-microtask "^1.2.2" - -safe-buffer@5.1.2, safe-buffer@~5.1.0, safe-buffer@~5.1.1: - version "5.1.2" - resolved "https://registry.yarnpkg.com/safe-buffer/-/safe-buffer-5.1.2.tgz#991ec69d296e0313747d59bdfd2b745c35f8828d" - integrity sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g== - -safe-buffer@5.2.1, safe-buffer@>=5.1.0, safe-buffer@^5.0.1, safe-buffer@^5.1.0, safe-buffer@~5.2.0: - version "5.2.1" - resolved "https://registry.yarnpkg.com/safe-buffer/-/safe-buffer-5.2.1.tgz#1eaf9fa9bdb1fdd4ec75f58f9cdb4e6b7827eec6" - integrity sha512-rp3So07KcdmmKbGvgaNxQSJr7bGVSVk5S9Eq1F+ppbRo70+YeaDxkw5Dd8NPN+GD6bjnYm2VuPuCXmpuYvmCXQ== - -"safer-buffer@>= 2.1.2 < 3": - version "2.1.2" - resolved "https://registry.yarnpkg.com/safer-buffer/-/safer-buffer-2.1.2.tgz#44fa161b0187b9549dd84bb91802f9bd8385cd6a" - integrity sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg== - -scheduler@^0.19.1: - version "0.19.1" - resolved "https://registry.yarnpkg.com/scheduler/-/scheduler-0.19.1.tgz#4f3e2ed2c1a7d65681f4c854fa8c5a1ccb40f196" - integrity sha512-n/zwRWRYSUj0/3g/otKDRPMh6qv2SYMWNq85IEa8iZyAv8od9zDYpGSnpBEjNgcMNq6Scbu5KfIPxNF72R/2EA== - dependencies: - loose-envify "^1.1.0" - object-assign "^4.1.1" - -schema-utils@^3.0.0, schema-utils@^3.1.0, schema-utils@^3.1.1: - version "3.1.1" - resolved "https://registry.yarnpkg.com/schema-utils/-/schema-utils-3.1.1.tgz#bc74c4b6b6995c1d88f76a8b77bea7219e0c8281" - integrity sha512-Y5PQxS4ITlC+EahLuXaY86TXfR7Dc5lw294alXOq86JAHCihAIZfqv8nNCWvaEJvaC51uN9hbLGeV0cFBdH+Fw== - dependencies: - "@types/json-schema" "^7.0.8" - ajv "^6.12.5" - ajv-keywords "^3.5.2" - -schema-utils@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/schema-utils/-/schema-utils-4.0.0.tgz#60331e9e3ae78ec5d16353c467c34b3a0a1d3df7" - integrity sha512-1edyXKgh6XnJsJSQ8mKWXnN/BVaIbFMLpouRUrXgVq7WYne5kw3MW7UPhO44uRXQSIpTSXoJbmrR2X0w9kUTyg== - dependencies: - "@types/json-schema" "^7.0.9" - ajv "^8.8.0" - ajv-formats "^2.1.1" - ajv-keywords "^5.0.0" - -scroll-into-view-if-needed@^2.2.25: - version "2.2.29" - resolved "https://registry.yarnpkg.com/scroll-into-view-if-needed/-/scroll-into-view-if-needed-2.2.29.tgz#551791a84b7e2287706511f8c68161e4990ab885" - integrity sha512-hxpAR6AN+Gh53AdAimHM6C8oTN1ppwVZITihix+WqalywBeFcQ6LdQP5ABNl26nX8GTEL7VT+b8lKpdqq65wXg== - dependencies: - compute-scroll-into-view "^1.0.17" - -select-hose@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/select-hose/-/select-hose-2.0.0.tgz#625d8658f865af43ec962bfc376a37359a4994ca" - integrity sha1-Yl2GWPhlr0Psliv8N2o3NZpJlMo= - -selfsigned@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/selfsigned/-/selfsigned-2.0.0.tgz#e927cd5377cbb0a1075302cff8df1042cc2bce5b" - integrity sha512-cUdFiCbKoa1mZ6osuJs2uDHrs0k0oprsKveFiiaBKCNq3SYyb5gs2HxhQyDNLCmL51ZZThqi4YNDpCK6GOP1iQ== - dependencies: - node-forge "^1.2.0" - -semver@^7.3.4, semver@^7.3.5: - version "7.3.5" - resolved "https://registry.yarnpkg.com/semver/-/semver-7.3.5.tgz#0b621c879348d8998e4b0e4be94b3f12e6018ef7" - integrity sha512-PoeGJYh8HK4BTO/a9Tf6ZG3veo/A7ZVsYrSA6J8ny9nb3B1VrpkuN+z9OE5wfE5p6H4LchYZsegiQgbJD94ZFQ== - dependencies: - lru-cache "^6.0.0" - -send@0.17.2: - version "0.17.2" - resolved "https://registry.yarnpkg.com/send/-/send-0.17.2.tgz#926622f76601c41808012c8bf1688fe3906f7820" - integrity sha512-UJYB6wFSJE3G00nEivR5rgWp8c2xXvJ3OPWPhmuteU0IKj8nKbG3DrjiOmLwpnHGYWAVwA69zmTm++YG0Hmwww== - dependencies: - debug "2.6.9" - depd "~1.1.2" - destroy "~1.0.4" - encodeurl "~1.0.2" - escape-html "~1.0.3" - etag "~1.8.1" - fresh "0.5.2" - http-errors "1.8.1" - mime "1.6.0" - ms "2.1.3" - on-finished "~2.3.0" - range-parser "~1.2.1" - statuses "~1.5.0" - -serialize-javascript@^6.0.0: - version "6.0.0" - resolved "https://registry.yarnpkg.com/serialize-javascript/-/serialize-javascript-6.0.0.tgz#efae5d88f45d7924141da8b5c3a7a7e663fefeb8" - integrity sha512-Qr3TosvguFt8ePWqsvRfrKyQXIiW+nGbYpy8XK24NQHE83caxWt+mIymTT19DGFbNWNLfEwsrkSmN64lVWB9ag== - dependencies: - randombytes "^2.1.0" - -serve-index@^1.9.1: - version "1.9.1" - resolved "https://registry.yarnpkg.com/serve-index/-/serve-index-1.9.1.tgz#d3768d69b1e7d82e5ce050fff5b453bea12a9239" - integrity sha1-03aNabHn2C5c4FD/9bRTvqEqkjk= - dependencies: - accepts "~1.3.4" - batch "0.6.1" - debug "2.6.9" - escape-html "~1.0.3" - http-errors "~1.6.2" - mime-types "~2.1.17" - parseurl "~1.3.2" - -serve-static@1.14.2: - version "1.14.2" - resolved "https://registry.yarnpkg.com/serve-static/-/serve-static-1.14.2.tgz#722d6294b1d62626d41b43a013ece4598d292bfa" - integrity sha512-+TMNA9AFxUEGuC0z2mevogSnn9MXKb4fa7ngeRMJaaGv8vTwnIEkKi+QGvPt33HSnf8pRS+WGM0EbMtCJLKMBQ== - dependencies: - encodeurl "~1.0.2" - escape-html "~1.0.3" - parseurl "~1.3.3" - send "0.17.2" - -setprototypeof@1.1.0: - version "1.1.0" - resolved "https://registry.yarnpkg.com/setprototypeof/-/setprototypeof-1.1.0.tgz#d0bd85536887b6fe7c0d818cb962d9d91c54e656" - integrity sha512-BvE/TwpZX4FXExxOxZyRGQQv651MSwmWKZGqvmPcRIjDqWub67kTKuIMx43cZZrS/cBBzwBcNDWoFxt2XEFIpQ== - -setprototypeof@1.2.0: - version "1.2.0" - resolved "https://registry.yarnpkg.com/setprototypeof/-/setprototypeof-1.2.0.tgz#66c9a24a73f9fc28cbe66b09fed3d33dcaf1b424" - integrity sha512-E5LDX7Wrp85Kil5bhZv46j8jOeboKq5JMmYM3gVGdGH8xFpPWXUMsNrlODCrkoxMEeNi/XZIwuRvY4XNwYMJpw== - -shallow-clone@^3.0.0: - version "3.0.1" - resolved "https://registry.yarnpkg.com/shallow-clone/-/shallow-clone-3.0.1.tgz#8f2981ad92531f55035b01fb230769a40e02efa3" - integrity sha512-/6KqX+GVUdqPuPPd2LxDDxzX6CAbjJehAAOKlNpqqUpAqPM6HeL8f+o3a+JsyGjn2lv0WY8UsTgUJjU9Ok55NA== - dependencies: - kind-of "^6.0.2" - -shallowequal@^1.1.0: - version "1.1.0" - resolved "https://registry.yarnpkg.com/shallowequal/-/shallowequal-1.1.0.tgz#188d521de95b9087404fd4dcb68b13df0ae4e7f8" - integrity sha512-y0m1JoUZSlPAjXVtPPW70aZWfIL/dSP7AFkRnniLCrK/8MDKog3TySTBmckD+RObVxH0v4Tox67+F14PdED2oQ== - -shebang-command@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/shebang-command/-/shebang-command-2.0.0.tgz#ccd0af4f8835fbdc265b82461aaf0c36663f34ea" - integrity sha512-kHxr2zZpYtdmrN1qDjrrX/Z1rR1kG8Dx+gkpK1G4eXmvXswmcE1hTWBWYUzlraYw1/yZp6YuDY77YtvbN0dmDA== - dependencies: - shebang-regex "^3.0.0" - -shebang-regex@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/shebang-regex/-/shebang-regex-3.0.0.tgz#ae16f1644d873ecad843b0307b143362d4c42172" - integrity sha512-7++dFhtcx3353uBaq8DDR4NuxBetBzC7ZQOhmTQInHEd6bSrXdiEyzCvG07Z44UYdLShWUyXt5M/yhz8ekcb1A== - -signal-exit@^3.0.3: - version "3.0.7" - resolved "https://registry.yarnpkg.com/signal-exit/-/signal-exit-3.0.7.tgz#a9a1767f8af84155114eaabd73f99273c8f59ad9" - integrity sha512-wnD2ZE+l+SPC/uoS0vXeE9L1+0wuaMqKlfz9AMUo38JsyLSBWSFcHR1Rri62LZc12vLr1gb3jl7iwQhgwpAbGQ== - -slash@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/slash/-/slash-3.0.0.tgz#6539be870c165adbd5240220dbe361f1bc4d4634" - integrity sha512-g9Q1haeby36OSStwb4ntCGGGaKsaVSjQ68fBxoQcutl5fS1vuY18H3wSt3jFyFtrkx+Kz0V1G85A4MyAdDMi2Q== - -sockjs@^0.3.21: - version "0.3.24" - resolved "https://registry.yarnpkg.com/sockjs/-/sockjs-0.3.24.tgz#c9bc8995f33a111bea0395ec30aa3206bdb5ccce" - integrity sha512-GJgLTZ7vYb/JtPSSZ10hsOYIvEYsjbNU+zPdIHcUaWVNUEPivzxku31865sSSud0Da0W4lEeOPlmw93zLQchuQ== - dependencies: - faye-websocket "^0.11.3" - uuid "^8.3.2" - websocket-driver "^0.7.4" - -source-map-js@^1.0.2: - version "1.0.2" - resolved "https://registry.yarnpkg.com/source-map-js/-/source-map-js-1.0.2.tgz#adbc361d9c62df380125e7f161f71c826f1e490c" - integrity sha512-R0XvVJ9WusLiqTCEiGCmICCMplcCkIwwR11mOSD9CR5u+IXYdiseeEuXCVAjS54zqwkLcPNnmU4OeJ6tUrWhDw== - -source-map-support@~0.5.20: - version "0.5.21" - resolved "https://registry.yarnpkg.com/source-map-support/-/source-map-support-0.5.21.tgz#04fe7c7f9e1ed2d662233c28cb2b35b9f63f6e4f" - integrity sha512-uBHU3L3czsIyYXKX88fdrGovxdSCoTGDRZ6SYXtSRxLZUzHg5P/66Ht6uoUlHu9EZod+inXhKo3qQgwXUT/y1w== - dependencies: - buffer-from "^1.0.0" - source-map "^0.6.0" - -source-map@^0.6.0, source-map@^0.6.1, source-map@~0.6.0: - version "0.6.1" - resolved "https://registry.yarnpkg.com/source-map/-/source-map-0.6.1.tgz#74722af32e9614e9c287a8d0bbde48b5e2f1a263" - integrity sha512-UjgapumWlbMhkBgzT7Ykc5YXUT46F0iKu8SGXq0bcwP5dz/h0Plj6enJqjz1Zbq2l5WaqYnrVbwWOWMyF3F47g== - -source-map@~0.7.2: - version "0.7.3" - resolved "https://registry.yarnpkg.com/source-map/-/source-map-0.7.3.tgz#5302f8169031735226544092e64981f751750383" - integrity sha512-CkCj6giN3S+n9qrYiBTX5gystlENnRW5jZeNLHpe6aue+SrHcG5VYwujhW9s4dY31mEGsxBDrHR6oI69fTXsaQ== - -spdy-transport@^3.0.0: - version "3.0.0" - resolved "https://registry.yarnpkg.com/spdy-transport/-/spdy-transport-3.0.0.tgz#00d4863a6400ad75df93361a1608605e5dcdcf31" - integrity sha512-hsLVFE5SjA6TCisWeJXFKniGGOpBgMLmerfO2aCyCU5s7nJ/rpAepqmFifv/GCbSbueEeAJJnmSQ2rKC/g8Fcw== - dependencies: - debug "^4.1.0" - detect-node "^2.0.4" - hpack.js "^2.1.6" - obuf "^1.1.2" - readable-stream "^3.0.6" - wbuf "^1.7.3" - -spdy@^4.0.2: - version "4.0.2" - resolved "https://registry.yarnpkg.com/spdy/-/spdy-4.0.2.tgz#b74f466203a3eda452c02492b91fb9e84a27677b" - integrity sha512-r46gZQZQV+Kl9oItvl1JZZqJKGr+oEkB08A6BzkiR7593/7IbtuncXHd2YoYeTsG4157ZssMu9KYvUHLcjcDoA== - dependencies: - debug "^4.1.0" - handle-thing "^2.0.0" - http-deceiver "^1.2.7" - select-hose "^2.0.0" - spdy-transport "^3.0.0" - -"statuses@>= 1.4.0 < 2", "statuses@>= 1.5.0 < 2", statuses@~1.5.0: - version "1.5.0" - resolved "https://registry.yarnpkg.com/statuses/-/statuses-1.5.0.tgz#161c7dac177659fd9811f43771fa99381478628c" - integrity sha1-Fhx9rBd2Wf2YEfQ3cfqZOBR4Yow= - -string-convert@^0.2.0: - version "0.2.1" - resolved "https://registry.yarnpkg.com/string-convert/-/string-convert-0.2.1.tgz#6982cc3049fbb4cd85f8b24568b9d9bf39eeff97" - integrity sha1-aYLMMEn7tM2F+LJFaLnZvznu/5c= - -string_decoder@^1.1.1: - version "1.3.0" - resolved "https://registry.yarnpkg.com/string_decoder/-/string_decoder-1.3.0.tgz#42f114594a46cf1a8e30b0a84f56c78c3edac21e" - integrity sha512-hkRX8U1WjJFd8LsDJ2yQ/wWWxaopEsABU1XfkM8A+j0+85JAGppt16cr1Whg6KIbb4okU6Mql6BOj+uup/wKeA== - dependencies: - safe-buffer "~5.2.0" - -string_decoder@~1.1.1: - version "1.1.1" - resolved "https://registry.yarnpkg.com/string_decoder/-/string_decoder-1.1.1.tgz#9cf1611ba62685d7030ae9e4ba34149c3af03fc8" - integrity sha512-n/ShnvDi6FHbbVfviro+WojiFzv+s8MPMHBczVePfUpDJLwoLT0ht1l4YwBCbi8pJAveEEdnkHyPyTP/mzRfwg== - dependencies: - safe-buffer "~5.1.0" - -strip-ansi@^6.0.1: - version "6.0.1" - resolved "https://registry.yarnpkg.com/strip-ansi/-/strip-ansi-6.0.1.tgz#9e26c63d30f53443e9489495b2105d37b67a85d9" - integrity sha512-Y38VPSHcqkFrCpFnQ9vuSXmquuv5oXOKpGeT6aGrr3o3Gc9AlVa6JBfUSOCnbxGGZF+/0ooI7KrPuUSztUdU5A== - dependencies: - ansi-regex "^5.0.1" - -strip-ansi@^7.0.0: - version "7.0.1" - resolved "https://registry.yarnpkg.com/strip-ansi/-/strip-ansi-7.0.1.tgz#61740a08ce36b61e50e65653f07060d000975fb2" - integrity sha512-cXNxvT8dFNRVfhVME3JAe98mkXDYN2O1l7jmcwMnOslDeESg1rF/OZMtK0nRAhiari1unG5cD4jG3rapUAkLbw== - dependencies: - ansi-regex "^6.0.1" - -strip-final-newline@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/strip-final-newline/-/strip-final-newline-2.0.0.tgz#89b852fb2fcbe936f6f4b3187afb0a12c1ab58ad" - integrity sha512-BrpvfNAE3dcvq7ll3xVumzjKjZQ5tI1sEUIKr3Uoks0XUl45St3FlatVqef9prk4jRDzhW6WZg+3bk93y6pLjA== - -style-loader@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/style-loader/-/style-loader-2.0.0.tgz#9669602fd4690740eaaec137799a03addbbc393c" - integrity sha512-Z0gYUJmzZ6ZdRUqpg1r8GsaFKypE+3xAzuFeMuoHgjc9KZv3wMyCRjQIWEbhoFSq7+7yoHXySDJyyWQaPajeiQ== - dependencies: - loader-utils "^2.0.0" - schema-utils "^3.0.0" - -supports-color@^7.1.0: - version "7.2.0" - resolved "https://registry.yarnpkg.com/supports-color/-/supports-color-7.2.0.tgz#1b7dcdcb32b8138801b3e478ba6a51caa89648da" - integrity sha512-qpCAvRl9stuOHveKsn7HncJRvv501qIacKzQlO/+Lwxc9+0q2wLyv4Dfvt80/DPn2pqOBsJdDiogXGR9+OvwRw== - dependencies: - has-flag "^4.0.0" - -supports-color@^8.0.0: - version "8.1.1" - resolved "https://registry.yarnpkg.com/supports-color/-/supports-color-8.1.1.tgz#cd6fc17e28500cff56c1b86c0a7fd4a54a73005c" - integrity sha512-MpUEN2OodtUzxvKQl72cUF7RQ5EiHsGvSsVG0ia9c5RbWGL2CI4C7EpPS8UTBIplnlzZiNuV56w+FuNxy3ty2Q== - dependencies: - has-flag "^4.0.0" - -supports-preserve-symlinks-flag@^1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/supports-preserve-symlinks-flag/-/supports-preserve-symlinks-flag-1.0.0.tgz#6eda4bd344a3c94aea376d4cc31bc77311039e09" - integrity sha512-ot0WnXS9fgdkgIcePe6RHNk1WA8+muPa6cSjeR3V8K27q9BB1rTE3R1p7Hv0z1ZyAc8s6Vvv8DIyWf681MAt0w== - -tapable@^1.0.0: - version "1.1.3" - resolved "https://registry.yarnpkg.com/tapable/-/tapable-1.1.3.tgz#a1fccc06b58db61fd7a45da2da44f5f3a3e67ba2" - integrity sha512-4WK/bYZmj8xLr+HUCODHGF1ZFzsYffasLUgEiMBY4fgtltdO6B4WJtlSbPaDTLpYTcGVwM2qLnFTICEcNxs3kA== - -tapable@^2.0.0, tapable@^2.1.1, tapable@^2.2.0: - version "2.2.1" - resolved "https://registry.yarnpkg.com/tapable/-/tapable-2.2.1.tgz#1967a73ef4060a82f12ab96af86d52fdb76eeca0" - integrity sha512-GNzQvQTOIP6RyTfE2Qxb8ZVlNmw0n88vp1szwWRimP02mnTsx3Wtn5qRdqY9w2XduFNUgvOwhNnQsjwCp+kqaQ== - -terser-webpack-plugin@^5.1.3: - version "5.3.1" - resolved "https://registry.yarnpkg.com/terser-webpack-plugin/-/terser-webpack-plugin-5.3.1.tgz#0320dcc270ad5372c1e8993fabbd927929773e54" - integrity sha512-GvlZdT6wPQKbDNW/GDQzZFg/j4vKU96yl2q6mcUkzKOgW4gwf1Z8cZToUCrz31XHlPWH8MVb1r2tFtdDtTGJ7g== - dependencies: - jest-worker "^27.4.5" - schema-utils "^3.1.1" - serialize-javascript "^6.0.0" - source-map "^0.6.1" - terser "^5.7.2" - -terser@^5.10.0, terser@^5.7.2: - version "5.12.0" - resolved "https://registry.yarnpkg.com/terser/-/terser-5.12.0.tgz#728c6bff05f7d1dcb687d8eace0644802a9dae8a" - integrity sha512-R3AUhNBGWiFc77HXag+1fXpAxTAFRQTJemlJKjAgD9r8xXTpjNKqIXwHM/o7Rh+O0kUJtS3WQVdBeMKFk5sw9A== - dependencies: - acorn "^8.5.0" - commander "^2.20.0" - source-map "~0.7.2" - source-map-support "~0.5.20" - -thunky@^1.0.2: - version "1.1.0" - resolved "https://registry.yarnpkg.com/thunky/-/thunky-1.1.0.tgz#5abaf714a9405db0504732bbccd2cedd9ef9537d" - integrity sha512-eHY7nBftgThBqOyHGVN+l8gF0BucP09fMo0oO/Lb0w1OF80dJv+lDVpXG60WMQvkcxAkNybKsrEIE3ZtKGmPrA== - -tiny-warning@^1.0.2: - version "1.0.3" - resolved "https://registry.yarnpkg.com/tiny-warning/-/tiny-warning-1.0.3.tgz#94a30db453df4c643d0fd566060d60a875d84754" - integrity sha512-lBN9zLN/oAf68o3zNXYrdCt1kP8WsiGW8Oo2ka41b2IM5JL/S1CTyX1rW0mb/zSuJun0ZUrDxx4sqvYS2FWzPA== - -to-regex-range@^5.0.1: - version "5.0.1" - resolved "https://registry.yarnpkg.com/to-regex-range/-/to-regex-range-5.0.1.tgz#1648c44aae7c8d988a326018ed72f5b4dd0392e4" - integrity sha512-65P7iz6X5yEr1cwcgvQxbbIw7Uk3gOy5dIdtZ4rDveLqhrdJP+Li/Hx6tyK0NEb+2GCyneCMJiGqrADCSNk8sQ== - dependencies: - is-number "^7.0.0" - -toggle-selection@^1.0.6: - version "1.0.6" - resolved "https://registry.yarnpkg.com/toggle-selection/-/toggle-selection-1.0.6.tgz#6e45b1263f2017fa0acc7d89d78b15b8bf77da32" - integrity sha1-bkWxJj8gF/oKzH2J14sVuL932jI= - -toidentifier@1.0.1: - version "1.0.1" - resolved "https://registry.yarnpkg.com/toidentifier/-/toidentifier-1.0.1.tgz#3be34321a88a820ed1bd80dfaa33e479fbb8dd35" - integrity sha512-o5sSPKEkg/DIQNmH43V0/uerLrpzVedkUh8tGNvaeXpfpuwjKenlSox/2O/BTlZUtEe+JG7s5YhEz608PlAHRA== - -tr46@~0.0.3: - version "0.0.3" - resolved "https://registry.yarnpkg.com/tr46/-/tr46-0.0.3.tgz#8184fd347dac9cdc185992f3a6622e14b9d9ab6a" - integrity sha1-gYT9NH2snNwYWZLzpmIuFLnZq2o= - -ts-loader@^8.0.18: - version "8.3.0" - resolved "https://registry.yarnpkg.com/ts-loader/-/ts-loader-8.3.0.tgz#83360496d6f8004fab35825279132c93412edf33" - integrity sha512-MgGly4I6cStsJy27ViE32UoqxPTN9Xly4anxxVyaIWR+9BGxboV4EyJBGfR3RePV7Ksjj3rHmPZJeIt+7o4Vag== - dependencies: - chalk "^4.1.0" - enhanced-resolve "^4.0.0" - loader-utils "^2.0.0" - micromatch "^4.0.0" - semver "^7.3.4" - -tslib@^2.0.3: - version "2.3.1" - resolved "https://registry.yarnpkg.com/tslib/-/tslib-2.3.1.tgz#e8a335add5ceae51aa261d32a490158ef042ef01" - integrity sha512-77EbyPPpMz+FRFRuAFlWMtmgUWGe9UOG2Z25NqCwiIjRhOf5iKGuzSe5P2w1laq+FkRy4p+PCuVkJSGkzTEKVw== - -type-is@~1.6.18: - version "1.6.18" - resolved "https://registry.yarnpkg.com/type-is/-/type-is-1.6.18.tgz#4e552cd05df09467dcbc4ef739de89f2cf37c131" - integrity sha512-TkRKr9sUTxEH8MdfuCSP7VizJyzRNMjj2J2do2Jr3Kym598JVdEksuzPQCnlFPW4ky9Q+iA+ma9BGm06XQBy8g== - dependencies: - media-typer "0.3.0" - mime-types "~2.1.24" - -typescript@^4.0.3: - version "4.6.2" - resolved "https://registry.yarnpkg.com/typescript/-/typescript-4.6.2.tgz#fe12d2727b708f4eef40f51598b3398baa9611d4" - integrity sha512-HM/hFigTBHZhLXshn9sN37H085+hQGeJHJ/X7LpBWLID/fbc2acUMfU+lGD98X81sKP+pFa9f0DZmCwB9GnbAg== - -unpipe@1.0.0, unpipe@~1.0.0: - version "1.0.0" - resolved "https://registry.yarnpkg.com/unpipe/-/unpipe-1.0.0.tgz#b2bf4ee8514aae6165b4817829d21b2ef49904ec" - integrity sha1-sr9O6FFKrmFltIF4KdIbLvSZBOw= - -uri-js@^4.2.2: - version "4.4.1" - resolved "https://registry.yarnpkg.com/uri-js/-/uri-js-4.4.1.tgz#9b1a52595225859e55f669d928f88c6c57f2a77e" - integrity sha512-7rKUyy33Q1yc98pQ1DAmLtwX109F7TIfWlW1Ydo8Wl1ii1SeHieeh0HHfPeL2fMXK6z0s8ecKs9frCuLJvndBg== - dependencies: - punycode "^2.1.0" - -util-deprecate@^1.0.1, util-deprecate@^1.0.2, util-deprecate@~1.0.1: - version "1.0.2" - resolved "https://registry.yarnpkg.com/util-deprecate/-/util-deprecate-1.0.2.tgz#450d4dc9fa70de732762fbd2d4a28981419a0ccf" - integrity sha1-RQ1Nyfpw3nMnYvvS1KKJgUGaDM8= - -utila@~0.4: - version "0.4.0" - resolved "https://registry.yarnpkg.com/utila/-/utila-0.4.0.tgz#8a16a05d445657a3aea5eecc5b12a4fa5379772c" - integrity sha1-ihagXURWV6Oupe7MWxKk+lN5dyw= - -utils-merge@1.0.1: - version "1.0.1" - resolved "https://registry.yarnpkg.com/utils-merge/-/utils-merge-1.0.1.tgz#9f95710f50a267947b2ccc124741c1028427e713" - integrity sha1-n5VxD1CiZ5R7LMwSR0HBAoQn5xM= - -uuid@^8.3.2: - version "8.3.2" - resolved "https://registry.yarnpkg.com/uuid/-/uuid-8.3.2.tgz#80d5b5ced271bb9af6c445f21a1a04c606cefbe2" - integrity sha512-+NYs2QeMWy+GWFOEm9xnn6HCDp0l7QBD7ml8zLUmJ+93Q5NF0NocErnwkTkXVFNiX3/fpC6afS8Dhb/gz7R7eg== - -vary@~1.1.2: - version "1.1.2" - resolved "https://registry.yarnpkg.com/vary/-/vary-1.1.2.tgz#2299f02c6ded30d4a5961b0b9f74524a18f634fc" - integrity sha1-IpnwLG3tMNSllhsLn3RSShj2NPw= - -watchpack@^2.3.1: - version "2.3.1" - resolved "https://registry.yarnpkg.com/watchpack/-/watchpack-2.3.1.tgz#4200d9447b401156eeca7767ee610f8809bc9d25" - integrity sha512-x0t0JuydIo8qCNctdDrn1OzH/qDzk2+rdCOC3YzumZ42fiMqmQ7T3xQurykYMhYfHaPHTp4ZxAx2NfUo1K6QaA== - dependencies: - glob-to-regexp "^0.4.1" - graceful-fs "^4.1.2" - -wbuf@^1.1.0, wbuf@^1.7.3: - version "1.7.3" - resolved "https://registry.yarnpkg.com/wbuf/-/wbuf-1.7.3.tgz#c1d8d149316d3ea852848895cb6a0bfe887b87df" - integrity sha512-O84QOnr0icsbFGLS0O3bI5FswxzRr8/gHwWkDlQFskhSPryQXvrTMxjxGP4+iWYoauLoBvfDpkrOauZ+0iZpDA== - dependencies: - minimalistic-assert "^1.0.0" - -webidl-conversions@^3.0.0: - version "3.0.1" - resolved "https://registry.yarnpkg.com/webidl-conversions/-/webidl-conversions-3.0.1.tgz#24534275e2a7bc6be7bc86611cc16ae0a5654871" - integrity sha1-JFNCdeKnvGvnvIZhHMFq4KVlSHE= - -webpack-cli@^4.5.0: - version "4.9.2" - resolved "https://registry.yarnpkg.com/webpack-cli/-/webpack-cli-4.9.2.tgz#77c1adaea020c3f9e2db8aad8ea78d235c83659d" - integrity sha512-m3/AACnBBzK/kMTcxWHcZFPrw/eQuY4Df1TxvIWfWM2x7mRqBQCqKEd96oCUa9jkapLBaFfRce33eGDb4Pr7YQ== - dependencies: - "@discoveryjs/json-ext" "^0.5.0" - "@webpack-cli/configtest" "^1.1.1" - "@webpack-cli/info" "^1.4.1" - "@webpack-cli/serve" "^1.6.1" - colorette "^2.0.14" - commander "^7.0.0" - execa "^5.0.0" - fastest-levenshtein "^1.0.12" - import-local "^3.0.2" - interpret "^2.2.0" - rechoir "^0.7.0" - webpack-merge "^5.7.3" - -webpack-dev-middleware@^5.3.1: - version "5.3.1" - resolved "https://registry.yarnpkg.com/webpack-dev-middleware/-/webpack-dev-middleware-5.3.1.tgz#aa079a8dedd7e58bfeab358a9af7dab304cee57f" - integrity sha512-81EujCKkyles2wphtdrnPg/QqegC/AtqNH//mQkBYSMqwFVCQrxM6ktB2O/SPlZy7LqeEfTbV3cZARGQz6umhg== - dependencies: - colorette "^2.0.10" - memfs "^3.4.1" - mime-types "^2.1.31" - range-parser "^1.2.1" - schema-utils "^4.0.0" - -webpack-dev-server@^4.7.4: - version "4.7.4" - resolved "https://registry.yarnpkg.com/webpack-dev-server/-/webpack-dev-server-4.7.4.tgz#d0ef7da78224578384e795ac228d8efb63d5f945" - integrity sha512-nfdsb02Zi2qzkNmgtZjkrMOcXnYZ6FLKcQwpxT7MvmHKc+oTtDsBju8j+NMyAygZ9GW1jMEUpy3itHtqgEhe1A== - dependencies: - "@types/bonjour" "^3.5.9" - "@types/connect-history-api-fallback" "^1.3.5" - "@types/express" "^4.17.13" - "@types/serve-index" "^1.9.1" - "@types/sockjs" "^0.3.33" - "@types/ws" "^8.2.2" - ansi-html-community "^0.0.8" - bonjour "^3.5.0" - chokidar "^3.5.3" - colorette "^2.0.10" - compression "^1.7.4" - connect-history-api-fallback "^1.6.0" - default-gateway "^6.0.3" - del "^6.0.0" - express "^4.17.1" - graceful-fs "^4.2.6" - html-entities "^2.3.2" - http-proxy-middleware "^2.0.0" - ipaddr.js "^2.0.1" - open "^8.0.9" - p-retry "^4.5.0" - portfinder "^1.0.28" - schema-utils "^4.0.0" - selfsigned "^2.0.0" - serve-index "^1.9.1" - sockjs "^0.3.21" - spdy "^4.0.2" - strip-ansi "^7.0.0" - webpack-dev-middleware "^5.3.1" - ws "^8.4.2" - -webpack-merge@^5.7.3: - version "5.8.0" - resolved "https://registry.yarnpkg.com/webpack-merge/-/webpack-merge-5.8.0.tgz#2b39dbf22af87776ad744c390223731d30a68f61" - integrity sha512-/SaI7xY0831XwP6kzuwhKWVKDP9t1QY1h65lAFLbZqMPIuYcD9QAW4u9STIbU9kaJbPBB/geU/gLr1wDjOhQ+Q== - dependencies: - clone-deep "^4.0.1" - wildcard "^2.0.0" - -webpack-sources@^3.2.3: - version "3.2.3" - resolved "https://registry.yarnpkg.com/webpack-sources/-/webpack-sources-3.2.3.tgz#2d4daab8451fd4b240cc27055ff6a0c2ccea0cde" - integrity sha512-/DyMEOrDgLKKIG0fmvtz+4dUX/3Ghozwgm6iPp8KRhvn+eQf9+Q7GWxVNMk3+uCPWfdXYC4ExGBckIXdFEfH1w== - -webpack@^5.28.0: - version "5.70.0" - resolved "https://registry.yarnpkg.com/webpack/-/webpack-5.70.0.tgz#3461e6287a72b5e6e2f4872700bc8de0d7500e6d" - integrity sha512-ZMWWy8CeuTTjCxbeaQI21xSswseF2oNOwc70QSKNePvmxE7XW36i7vpBMYZFAUHPwQiEbNGCEYIOOlyRbdGmxw== - dependencies: - "@types/eslint-scope" "^3.7.3" - "@types/estree" "^0.0.51" - "@webassemblyjs/ast" "1.11.1" - "@webassemblyjs/wasm-edit" "1.11.1" - "@webassemblyjs/wasm-parser" "1.11.1" - acorn "^8.4.1" - acorn-import-assertions "^1.7.6" - browserslist "^4.14.5" - chrome-trace-event "^1.0.2" - enhanced-resolve "^5.9.2" - es-module-lexer "^0.9.0" - eslint-scope "5.1.1" - events "^3.2.0" - glob-to-regexp "^0.4.1" - graceful-fs "^4.2.9" - json-parse-better-errors "^1.0.2" - loader-runner "^4.2.0" - mime-types "^2.1.27" - neo-async "^2.6.2" - schema-utils "^3.1.0" - tapable "^2.1.1" - terser-webpack-plugin "^5.1.3" - watchpack "^2.3.1" - webpack-sources "^3.2.3" - -websocket-driver@>=0.5.1, websocket-driver@^0.7.4: - version "0.7.4" - resolved "https://registry.yarnpkg.com/websocket-driver/-/websocket-driver-0.7.4.tgz#89ad5295bbf64b480abcba31e4953aca706f5760" - integrity sha512-b17KeDIQVjvb0ssuSDF2cYXSg2iztliJ4B9WdsuB6J952qCPKmnVq4DyW5motImXHDC1cBT/1UezrJVsKw5zjg== - dependencies: - http-parser-js ">=0.5.1" - safe-buffer ">=5.1.0" - websocket-extensions ">=0.1.1" - -websocket-extensions@>=0.1.1: - version "0.1.4" - resolved "https://registry.yarnpkg.com/websocket-extensions/-/websocket-extensions-0.1.4.tgz#7f8473bc839dfd87608adb95d7eb075211578a42" - integrity sha512-OqedPIGOfsDlo31UNwYbCFMSaO9m9G/0faIHj5/dZFDMFqPTcx6UwqyOy3COEaEOg/9VsGIpdqn62W5KhoKSpg== - -whatwg-fetch@>=0.10.0: - version "3.6.2" - resolved "https://registry.yarnpkg.com/whatwg-fetch/-/whatwg-fetch-3.6.2.tgz#dced24f37f2624ed0281725d51d0e2e3fe677f8c" - integrity sha512-bJlen0FcuU/0EMLrdbJ7zOnW6ITZLrZMIarMUVmdKtsGvZna8vxKYaexICWPfZ8qwf9fzNq+UEIZrnSaApt6RA== - -whatwg-url@^5.0.0: - version "5.0.0" - resolved "https://registry.yarnpkg.com/whatwg-url/-/whatwg-url-5.0.0.tgz#966454e8765462e37644d3626f6742ce8b70965d" - integrity sha1-lmRU6HZUYuN2RNNib2dCzotwll0= - dependencies: - tr46 "~0.0.3" - webidl-conversions "^3.0.0" - -which@^2.0.1: - version "2.0.2" - resolved "https://registry.yarnpkg.com/which/-/which-2.0.2.tgz#7c6a8dd0a636a0327e10b59c9286eee93f3f51b1" - integrity sha512-BLI3Tl1TW3Pvl70l3yq3Y64i+awpwXqsGBYWkkqMtnbXgrMD+yj7rhW0kuEDxzJaYXGjEW5ogapKNMEKNMjibA== - dependencies: - isexe "^2.0.0" - -wildcard@^2.0.0: - version "2.0.0" - resolved "https://registry.yarnpkg.com/wildcard/-/wildcard-2.0.0.tgz#a77d20e5200c6faaac979e4b3aadc7b3dd7f8fec" - integrity sha512-JcKqAHLPxcdb9KM49dufGXn2x3ssnfjbcaQdLlfZsL9rH9wgDQjUtDxbo8NE0F6SFvydeu1VhZe7hZuHsB2/pw== - -wrappy@1: - version "1.0.2" - resolved "https://registry.yarnpkg.com/wrappy/-/wrappy-1.0.2.tgz#b5243d8f3ec1aa35f1364605bc0d1036e30ab69f" - integrity sha1-tSQ9jz7BqjXxNkYFvA0QNuMKtp8= - -ws@^8.4.2: - version "8.5.0" - resolved "https://registry.yarnpkg.com/ws/-/ws-8.5.0.tgz#bfb4be96600757fe5382de12c670dab984a1ed4f" - integrity sha512-BWX0SWVgLPzYwF8lTzEy1egjhS4S4OEAHfsO8o65WOVsrnSRGaSiUaa9e0ggGlkMTtBlmOpEXiie9RUcBO86qg== - -yallist@^4.0.0: - version "4.0.0" - resolved "https://registry.yarnpkg.com/yallist/-/yallist-4.0.0.tgz#9bb92790d9c0effec63be73519e11a35019a3a72" - integrity sha512-3wdGidZyq5PB084XLES5TpOSRA3wjXAlIWMhum2kRcv/41Sn2emQ0dycQW4uZXLejwKvg6EsvbdlVL+FYEct7A== diff --git a/plugins/tensorboard-plugins/tb_plugin/samples/resnet50_num_workers_0/worker0.1623143089861.pt.trace.json.gz b/plugins/tensorboard-plugins/tb_plugin/test/resources/resnet50_num_workers_0/worker0.1623143089861.pt.trace.json.gz similarity index 100% rename from plugins/tensorboard-plugins/tb_plugin/samples/resnet50_num_workers_0/worker0.1623143089861.pt.trace.json.gz rename to plugins/tensorboard-plugins/tb_plugin/test/resources/resnet50_num_workers_0/worker0.1623143089861.pt.trace.json.gz diff --git a/plugins/tensorboard-plugins/tb_plugin/samples/resnet50_num_workers_0/worker0.1623143566756.pt.trace.json.gz b/plugins/tensorboard-plugins/tb_plugin/test/resources/resnet50_num_workers_0/worker0.1623143566756.pt.trace.json.gz similarity index 100% rename from plugins/tensorboard-plugins/tb_plugin/samples/resnet50_num_workers_0/worker0.1623143566756.pt.trace.json.gz rename to plugins/tensorboard-plugins/tb_plugin/test/resources/resnet50_num_workers_0/worker0.1623143566756.pt.trace.json.gz diff --git a/plugins/tensorboard-plugins/tb_plugin/samples/resnet50_num_workers_4/worker0.1623212756351.pt.trace.json.gz b/plugins/tensorboard-plugins/tb_plugin/test/resources/resnet50_num_workers_4/worker0.1623212756351.pt.trace.json.gz similarity index 100% rename from plugins/tensorboard-plugins/tb_plugin/samples/resnet50_num_workers_4/worker0.1623212756351.pt.trace.json.gz rename to plugins/tensorboard-plugins/tb_plugin/test/resources/resnet50_num_workers_4/worker0.1623212756351.pt.trace.json.gz diff --git a/plugins/tensorboard-plugins/tb_plugin/samples/resnet50_num_workers_4/worker0.1623213129365.pt.trace.json.gz b/plugins/tensorboard-plugins/tb_plugin/test/resources/resnet50_num_workers_4/worker0.1623213129365.pt.trace.json.gz similarity index 100% rename from plugins/tensorboard-plugins/tb_plugin/samples/resnet50_num_workers_4/worker0.1623213129365.pt.trace.json.gz rename to plugins/tensorboard-plugins/tb_plugin/test/resources/resnet50_num_workers_4/worker0.1623213129365.pt.trace.json.gz diff --git a/plugins/tensorboard-plugins/tb_plugin/test/test_tensorboard_end2end.py b/plugins/tensorboard-plugins/tb_plugin/test/test_tensorboard_end2end.py index fae95b49050537b921e291a4771c63a6bff35690..46636d11801a739935b4f385c6ce548009d09916 100644 --- a/plugins/tensorboard-plugins/tb_plugin/test/test_tensorboard_end2end.py +++ b/plugins/tensorboard-plugins/tb_plugin/test/test_tensorboard_end2end.py @@ -13,7 +13,7 @@ from urllib.error import HTTPError def get_samples_dir(): - return os.path.join(os.path.dirname(os.path.abspath(__file__)), '../samples') + return os.path.join(os.path.dirname(os.path.abspath(__file__)), 'resources') class TestEnd2End(unittest.TestCase): diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py index 2fbe631c1115a57168643bc95d35cf1c7e413383..f7b951e609e5c65895a6db82d391e8d584eb37c8 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py @@ -4,4 +4,4 @@ # Entry point for Pytorch TensorBoard plugin package. -__version__ = '0.4.0.9' +__version__ = '0.4.0.11' diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/azureblob.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/azureblob.py index b0ac49a655fd3d999ea80dfc3e6fa62e33fc5269..2fcd69fee8c24393458875635c17bd74a71b0fc4 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/azureblob.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/azureblob.py @@ -20,9 +20,9 @@ class AzureBlobSystem(RemotePath, BaseFileSystem): raise ImportError('azure-storage-blob must be installed for Azure Blob support.') self.connection_string = os.environ.get('AZURE_STORAGE_CONNECTION_STRING', None) - def exists(self, dirname): + def exists(self, filename): """Returns whether the path is a directory or not.""" - basename, parts = self.split_blob_path(dirname) + basename, parts = self.split_blob_path(filename) if basename is None or parts is None: return False if basename == '': @@ -31,10 +31,10 @@ class AzureBlobSystem(RemotePath, BaseFileSystem): else: return basename == parts[0] - def read(self, filename, binary_mode=False, size=None, continue_from=None): + def read(self, file, binary_mode=False, size=None, continue_from=None): """Reads contents of a file to a string.""" - logger.info('azure blob: starting reading file %s' % filename) - account, container, path = self.container_and_path(filename) + logger.info('azure blob: starting reading file %s' % file) + account, container, path = self.container_and_path(file) client = self.create_container_client(account, container) blob_client = client.get_blob_client(path) if not blob_client.exists(): @@ -47,7 +47,7 @@ class AzureBlobSystem(RemotePath, BaseFileSystem): continuation_token = downloader.size data = downloader.readall() - logger.info('azure blob: file %s download is done, size is %d' % (filename, len(data))) + logger.info('azure blob: file %s download is done, size is %d' % (file, len(data))) if binary_mode: return as_bytes(data), continuation_token else: @@ -122,7 +122,7 @@ class AzureBlobSystem(RemotePath, BaseFileSystem): items.append(item) return items - def makedirs(self, dirname): + def makedirs(self, path): """No need create directory since the upload blob will automatically create""" pass diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/file.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/file.py index 104146af464d301e0920eeae5794738ce8033bdf..9ef5d8485264f18426c18147663f2e1b9fb6900e 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/file.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/file.py @@ -16,6 +16,7 @@ The following functionalities are added after forking: import glob as py_glob import os import platform +import sys import tempfile from .. import utils @@ -25,24 +26,23 @@ from ..consts import MAX_FILE_SIZE, MAX_WINDOWS_PATH_LENGTH, MAX_LINUX_PATH_LENG logger = utils.get_logger() +S3_ENABLED = True try: import boto3 import botocore.exceptions - - S3_ENABLED = True except ImportError: S3_ENABLED = False +BLOB_ENABLED = True try: from azure.storage.blob import ContainerClient - BLOB_ENABLED = True except ImportError: BLOB_ENABLED = False +GS_ENABLED = True try: # Imports the Google Cloud client library from google.cloud import storage - GS_ENABLED = True except ImportError: GS_ENABLED = False @@ -95,16 +95,16 @@ class LocalFileSystem(LocalPath, BaseFileSystem): def exists(self, filename): return os.path.exists(filename) - def read(self, filename, binary_mode=False, size=None, continue_from=None): + def read(self, file, binary_mode=False, size=None, continue_from=None): mode = "rb" if binary_mode else "r" encoding = None if binary_mode else "utf8" - if not self.exists(filename): - raise FileNotFoundError(filename) + if not self.exists(file): + raise FileNotFoundError(file) offset = None if continue_from is not None: offset = continue_from.get("opaque_offset", None) - with open(filename, mode, encoding=encoding) as f: + with open(file, mode, encoding=encoding) as f: if offset is not None: f.seek(offset) data = f.read(size) @@ -200,10 +200,10 @@ class S3FileSystem(RemotePath, BaseFileSystem): return True return False - def read(self, filename, binary_mode=False, size=None, continue_from=None): + def read(self, file, binary_mode=False, size=None, continue_from=None): """Reads contents of a file to a string.""" s3 = boto3.resource("s3", endpoint_url=self._s3_endpoint) - bucket, path = self.bucket_and_path(filename) + bucket, path = self.bucket_and_path(file) args = {} # S3 use continuation tokens of the form: {byte_offset: number} @@ -218,7 +218,7 @@ class S3FileSystem(RemotePath, BaseFileSystem): if offset != 0 or endpoint != "": args["Range"] = "bytes={}-{}".format(offset, endpoint) - logger.info("s3: starting reading file %s" % filename) + logger.info("s3: starting reading file %s" % file) try: stream = s3.Object(bucket, path).get(**args)["Body"].read() except botocore.exceptions.ClientError as exc: @@ -240,7 +240,7 @@ class S3FileSystem(RemotePath, BaseFileSystem): raise logger.info("s3: file %s download is done, size is %d" % - (filename, len(stream))) + (file, len(stream))) # `stream` should contain raw bytes here (i.e., there has been neither decoding nor newline translation), # so the byte offset increases by the expected amount. continuation_token = {"byte_offset": (offset + len(stream))} @@ -320,14 +320,14 @@ class S3FileSystem(RemotePath, BaseFileSystem): keys.append(key) return keys - def makedirs(self, dirname): + def makedirs(self, path): """Creates a directory and all parent/intermediate directories.""" - if not self.exists(dirname): + if not self.exists(path): client = boto3.client("s3", endpoint_url=self._s3_endpoint) - bucket, path = self.bucket_and_path(dirname) - if not path.endswith("/"): - path += "/" - client.put_object(Body="", Bucket=bucket, Key=path) + bucket, dir_path = self.bucket_and_path(path) + if not dir_path.endswith("/"): + dir_path += "/" + client.put_object(Body="", Bucket=bucket, Key=dir_path) def stat(self, filename): """Returns file statistics for a given path.""" @@ -465,7 +465,7 @@ class File(object): if line and (line[-1] == "\n" or not self.buff): return line if not self.buff: - raise StopIteration() + return None else: index = self.buff.find("\n", self.buff_offset) if index != -1: @@ -480,7 +480,7 @@ class File(object): if line and (line[-1] == "\n" or not self.buff): return line if not self.buff: - raise StopIteration() + return None def next(self): return self.__next__() diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/gs.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/gs.py index d3a46877326b12a5e8be49a65cf4c90be8157311..8596bce2b892b7188155d05330a6356a83323eff 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/gs.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/io/gs.py @@ -16,14 +16,14 @@ class GoogleBlobSystem(RemotePath, BaseFileSystem): if not storage: raise ImportError('google-cloud-storage must be installed for Google Cloud Blob support.') - def exists(self, dirname): + def exists(self, filename): """Returns whether the path is a directory or not.""" - bucket_name, path = self.bucket_and_path(dirname) + bucket_name, path = self.bucket_and_path(filename) client = self.create_google_cloud_client() bucket = client.bucket(bucket_name) return bucket.blob(path).exists() - def read(self, filename, binary_mode=False, size=None, continue_from=None): + def read(self, file, binary_mode=False, size=None, continue_from=None): raise NotImplementedError def write(self, filename, file_content, binary_mode=False): @@ -62,7 +62,7 @@ class GoogleBlobSystem(RemotePath, BaseFileSystem): items.append(item) return items - def makedirs(self, dirname): + def makedirs(self, path): """No need create directory since the upload blob will automatically create""" pass diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/plugin.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/plugin.py index 4c7f36c34bb0cb6c0191962e9c956b2a8d41028a..2651f87c087a419c950f93b201606e7601a33a08 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/plugin.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/plugin.py @@ -46,6 +46,7 @@ def decorate_headers(func): headers = func(*args, **kwargs) headers.extend(TorchProfilerPlugin.headers) return headers + return wrapper @@ -343,14 +344,23 @@ class TorchProfilerPlugin(base_plugin.TBPlugin): end_ts = float(end_ts) for key in operator_memory_events: if start_ts is not None and end_ts is not None: - operator_memory_events[key] = [i for i in operator_memory_events[key] if - i[2] and start_ts <= i[2] <= end_ts] + operator_memory_events[key] = [ + i + for i in operator_memory_events[key] + if i[2] and start_ts <= i[2] <= end_ts + ] elif start_ts is not None: - operator_memory_events[key] = [i for i in operator_memory_events[key] if - i[2] and start_ts <= i[2]] + operator_memory_events[key] = [ + i + for i in operator_memory_events[key] + if i[2] and start_ts <= i[2] + ] elif end_ts is not None: - operator_memory_events[key] = [i for i in operator_memory_events[key] if - i[2] and end_ts >= i[2]] + operator_memory_events[key] = [ + i + for i in operator_memory_events[key] + if i[2] and end_ts >= i[2] + ] return self.respond_as_json(temp_memory_events, True) else: if start_ts is not None: @@ -472,9 +482,8 @@ class TorchProfilerPlugin(base_plugin.TBPlugin): def _monitor_runs(self): logger.info('Monitor runs begin') - + touched = set() try: - touched = set() while True: try: logger.debug('Scan run dir') diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/__init__.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/__init__.py index 9ca062abf58245753361a96890a2ee1ccdec42fb..59a0e64155546ce75e1c4607cf35c3144a28271f 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/__init__.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/__init__.py @@ -1,7 +1,6 @@ # ------------------------------------------------------------------------- # Copyright (c) Microsoft Corporation. All rights reserved. # -------------------------------------------------------------------------- +__all__ = ['RunLoader'] from .loader import RunLoader - -__all__ = ['RunLoader'] diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/communication.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/communication.py index 00f8dc98139d5bbb96daffb5989b9c3c660f2cbc..0afcdb11a66f89b8a448713bf140e3293db7e503 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/communication.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/communication.py @@ -59,7 +59,7 @@ def analyze_communication_nodes(comm_node_list: List[CommunicationNode])\ total_comm_stats[comm_node.name][0] += 1 bytes_one_value = 0 if comm_node.input_shape: - for i in range(len(comm_node.input_shape)): + for i, shape in enumerate(comm_node.input_shape): if comm_node.input_type[i] == 'long int': bytes_one_value = 8 elif comm_node.input_type[i] == 'float': @@ -76,7 +76,7 @@ def analyze_communication_nodes(comm_node_list: List[CommunicationNode])\ logger.warning('Found an unknown tensor type: {}'.format(comm_node.input_type[i])) bytes_one_value = 0 total_size = 1 - for size in comm_node.input_shape[i]: + for size in shape: total_size *= size total_comm_stats[comm_node.name][1] += total_size * bytes_one_value total_comm_stats[comm_node.name][2].extend(comm_node.kernel_ranges) diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py index f368adbab20fd4498a0fb36a63292af21c88499f..00544e635340c556d5346fc307bb29913c08929c 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py @@ -260,10 +260,10 @@ class RunProfileData(object): try: trace_json = json.loads(fout.getvalue()) logger.warning('Get JSONDecodeError: %s, Re-encode it to temp file' % e.msg) - json_reencode = True except JSONDecodeError: logger.error(f'File "{trace_path}" is not in a legal JSON format and will be skipped.') return trace_path, {} + json_reencode = True # work-around to remove the 'Record Window End' events to avoid the huge end timestamp if device_target == 'Ascend': diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/diffrun/tree.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/diffrun/tree.py index a164bd3d37390ba367f0d504910e45050227ffbf..c5cf5fad448122c74db46467cb0c70b8ce4f727e 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/diffrun/tree.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/diffrun/tree.py @@ -56,8 +56,9 @@ class DiffNode: def compare_operator_nodes( left_nodes: List[OperatorNode], right_nodes: List[OperatorNode]) -> Generator['DiffNode', None, None]: - '''Given two OperatorNode lists, find the DataLoader/Module/Backward/Optimizer node and create the child list DiffNode - ''' + """Given two OperatorNode lists, find the DataLoader/Module/Backward/Optimizer node and + create the child list DiffNode + """ right_keys = [(type(r), r.name) for r in right_nodes] # find matching points in the two list diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py index 3cd7ce9ff662a152cc9e1e4150bfe4d762e7a691..9b364e0dbba55e07b939690d45123bbf6dc6fe23 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py @@ -3,6 +3,7 @@ # ------------------------------------------------------------------------- import sys from collections import defaultdict +from dataclasses import dataclass from enum import IntEnum from typing import Dict, Iterable, List, Optional, Tuple @@ -31,11 +32,19 @@ class ProfileRole(IntEnum): Total = 8 +@dataclass +class NodeInfoParams: + event: DurationEvent + corrid_to_device: Dict[int, List[DeviceNode]] + corrid_to_runtime: Dict[int, RuntimeNode] + externalid_to_runtime: Dict[int, List[RuntimeNode]] + tid2list: Dict[int, List[OperatorNode]] + pl_tid2list: Dict[int, List[PLProfileNode]] + tid2zero_rt_list: Dict[int, List[RuntimeNode]] + + class NodeParserMixin: def __init__(self, *args, **kwargs): - """Please refer to https://stackoverflow.com/questions/9575409/calling-parent-class-init-with-multiple-inheritance-whats-the-right-way # noqa: E501 - to see the reason why we need call super().__init__ like this way - """ super().__init__(*args, **kwargs) self.communication_data: Dict[int, CommunicationNode] = {} @@ -68,14 +77,9 @@ class NodeParserMixin: for event in events: if event.type == EventTypes.MEMORY: continue - self._parse_node( - event, - corrid_to_device, - corrid_to_runtime, - externalid_to_runtime, - tid2list, - pl_tid2list, - tid2zero_rt_list) + params = NodeInfoParams(event, corrid_to_device, corrid_to_runtime, externalid_to_runtime, tid2list, + pl_tid2list, tid2zero_rt_list) + self._parse_node(params) if CommLibTypes.Nccl in self.comm_lib: for event in events: @@ -116,14 +120,14 @@ class NodeParserMixin: return comm_node is not None - def _parse_node(self, - event: DurationEvent, - corrid_to_device: Dict[int, List[DeviceNode]], - corrid_to_runtime: Dict[int, RuntimeNode], - externalid_to_runtime: Dict[int, List[RuntimeNode]], - tid2list: Dict[int, List[OperatorNode]], - pl_tid2list: Dict[int, List[PLProfileNode]], - tid2zero_rt_list: Dict[int, List[RuntimeNode]]): + def _parse_node(self, params: NodeInfoParams): + event = params.event + corrid_to_device = params.corrid_to_device + corrid_to_runtime = params.corrid_to_runtime + externalid_to_runtime = params.externalid_to_runtime + tid2list = params.tid2list + pl_tid2list = params.pl_tid2list + tid2zero_rt_list = params.tid2zero_rt_list corrid = event.correlation_id tid = event.tid if event.type in [EventTypes.KERNEL, EventTypes.MEMCPY, EventTypes.MEMSET]: @@ -226,8 +230,8 @@ class StepParser: self.steps.append((self.cpu_min_ts, self.cpu_max_ts)) self.steps_names.append('0') - for i in range(len(self.role_ranges)): - self.role_ranges[i] = merge_ranges(self.role_ranges[i]) + for i, role_range in enumerate(self.role_ranges): + self.role_ranges[i] = merge_ranges(role_range) def update_device_steps(self, runtime_node_list: List[RuntimeNode]): self._update_steps_duration(*self._find_device_steps(runtime_node_list)) @@ -362,9 +366,9 @@ class StepParser: # Change step time to device side on the condition that any step have device time. is_use_gpu = prev_step_end_time is not None if is_use_gpu: - for i_step in range(len(self.steps)): - step_start_time = max(prev_step_end_time, self.steps[i_step][0]) - step_end_time = self.steps[i_step][1] + for i_step, step in enumerate(self.steps): + step_start_time = max(prev_step_end_time, step[0]) + step_end_time = step[1] if steps_device[i_step][0] == sys.maxsize: # When step i_step has no device event. # Assign to step_start_time when kernel is behind host step end. step_end_time = max(step_end_time, step_start_time) @@ -402,7 +406,7 @@ class StepParser: class EventParser(NodeParserMixin, StepParser): def __init__(self): super().__init__() - self.comm_node_list: Dict[CommunicationNode] = None + self.comm_node_list: List[CommunicationNode] = None def parse(self, events: Iterable[BaseEvent], fwd_bwd_map: Dict[int, int]) -> Dict[int, List[OperatorNode]]: with utils.timing('EventParser: parse nodes'): @@ -439,10 +443,10 @@ class EventParser(NodeParserMixin, StepParser): header = f'[{ctx.tid}]' + '.'.join(ctx.name_stack[1:]) # omit the CallTreeRoot prefix_len = len(ctx.name_stack) * 4 - 4 - 1 if len(ctx.name_stack) > 1: - print(header) + logger.info(header) prefix = ' ' * prefix_len - print(prefix, node.name) - print(prefix, 'time:', node.start_time, '-->', node.end_time) + logger.info(prefix, node.name) + logger.info(prefix, 'time:', node.start_time, '-->', node.end_time) def push(node: OperatorNode): ctx.name_stack.append(node.name) diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/kernel_parser.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/kernel_parser.py index 838fc38ce60619977c3e096791241d7fc697562d..229251e60a90d5bf4fed514d5f175199b92d3870 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/kernel_parser.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/kernel_parser.py @@ -6,7 +6,7 @@ from typing import Optional import numpy as np import pandas as pd -from .tensor_core import TC_Allowlist +from .tensor_core import TcAllowlist from .trace import EventTypes @@ -19,7 +19,7 @@ class KernelParser: events = [vars(event) for event in events if event.type == EventTypes.KERNEL] events = pd.DataFrame(events) events = events.astype({'type': 'category', 'name': 'string'}, copy=False) - events['tc_used'] = events['name'].map(lambda name: name in TC_Allowlist) + events['tc_used'] = events['name'].map(lambda name: name in TcAllowlist) def weighted_avg(x: pd.Series): try: diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/memory_parser.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/memory_parser.py index 766782be271240dabffc76bbc389d8659e601299..64b78127a4c7a5675e5b2f71877754c541dde94f 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/memory_parser.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/memory_parser.py @@ -25,7 +25,7 @@ class MemoryMetrics(IntEnum): class MemoryRecord: def __init__(self, scope: str, pid: int, tid: int, ts: int, device_type: DeviceType, device_id: int, - address: int, bytes: int, total_allocated: float, total_reserved: float): + address: int, record_bytes: int, total_allocated: float, total_reserved: float): self.scope = scope self.tid = tid self.pid = pid @@ -33,7 +33,7 @@ class MemoryRecord: self.device_type = device_type self.device_id = device_id self.addr = address - self.bytes = bytes + self.bytes = record_bytes self.total_allocated = total_allocated self.total_reserved = total_reserved self.op_name: Optional[str] = None @@ -132,7 +132,7 @@ class MemorySnapshot: for i in range(self_metric_length, metric_length): memory_metrics_keyed_by_node[node][device][i] += metrics[i] - for tid, root in tid2tree.items(): + for _, root in tid2tree.items(): for child in root.children: traverse_node_memory(child) @@ -217,7 +217,8 @@ class MemoryParser: """In the loop, one pass will process one record. The basic logic is: It will search from the node that last visited since both the records and tree is ordered already 1. it current node contains the records, then find the exactly child which just embrace it. - 2. otherwise, find the parent node and set the child_index, so that the parent node could continue from previous visited node. # noqa: E501 + 2. otherwise, find the parent node and set the child_index, so that the parent node could continue from + previous visited node. # noqa: E501 3. if there is not any node contains the records, then all remaining records will be ignored. """ record = records[record_index] diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/module_op.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/module_op.py index 061a503b411bb900c6a405c0b97c8a07dd986a00..15f1e4ef93a5234cdf6273f9830ac1a6f3aeaa41 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/module_op.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/module_op.py @@ -260,10 +260,3 @@ def get_module_tree(tid2tree: Dict[int, OperatorNode]): traverse_node(child, None) return modules - - -def dump_modules(level: int, modules: Iterable[Union[Module, ModuleNode]]): - """testing purpose""" - for module in modules: - print(f"{' ' * level}{module.name.replace('nn.Module: ', '')}_{module.module_id}") - dump_modules(level + 1, module.children) diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/node.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/node.py index 80860e53661e9a554de6fa9b09e6f13057fca8bb..0528491c28752b0358d79e27168d055546bd0310 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/node.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/node.py @@ -6,7 +6,7 @@ from abc import ABC from typing import List, Optional, Tuple from .. import utils -from .tensor_core import TC_Allowlist, TC_OP_Allowlist +from .tensor_core import TcAllowlist, TcOpAllowlist from .trace import (DurationEvent, EventTypes, KernelEvent, ModuleEvent, OperatorEvent, PLProfileEvent, NcclOpNameSet, GlooOpNameSet) @@ -16,12 +16,12 @@ ExcludeOpName = ['DataParallel.forward', 'DistributedDataParallel.forward'] class BaseNode(ABC): - def __init__(self, name: str, start_time: int, end_time: int, type: str, tid: int, + def __init__(self, name: str, start_time: int, end_time: int, node_type: str, tid: int, external_id: Optional[int] = None): self.name = name self.start_time = start_time self.end_time = end_time - self.type = type + self.type = node_type self.tid = tid self.external_id = external_id # For consistency check. @@ -31,7 +31,7 @@ class BaseNode(ABC): kwargs['name'] = event.name kwargs['start_time'] = event.ts kwargs['end_time'] = event.ts + event.duration - kwargs['type'] = event.type + kwargs['node_type'] = event.type kwargs['tid'] = event.tid external_id = getattr(event, 'external_id', None) @@ -84,15 +84,18 @@ class OperatorNode(HostNode): self.callstack = callstack self.self_host_duration = self_host_duration self.self_device_duration = self_device_duration - # self.parent_node = None - self.tc_eligible = self.name in TC_OP_Allowlist + self.tc_eligible = self.name in TcOpAllowlist self.tc_self_duration = 0 # Time of TC kernels launched by this op excluding its children operators. self.tc_total_duration = 0 # Time of TC kernels launched by this op including its children operators. def fill_stats(self): + def sort_key(x): + if x.start_time and x.end_time: + return x.start_time, -x.end_time + else: + return sys.maxsize, -sys.maxsize - 1 self.children.sort(key=lambda x: (x.start_time, -x.end_time)) - self.runtimes.sort(key=lambda x: (x.start_time, -x.end_time) - if x.start_time and x.end_time else (sys.maxsize, -sys.maxsize - 1)) + self.runtimes.sort(key=sort_key) for child in self.children: child.fill_stats() @@ -273,7 +276,7 @@ class DeviceNode(BaseNode): self.block = block self.regs_per_thread = regs_per_thread self.shared_memory = shared_memory - self.tc_used = self.name in TC_Allowlist + self.tc_used = self.name in TcAllowlist self.device_id = device_id @classmethod @@ -306,7 +309,7 @@ def create_operator_node(event: OperatorEvent): def is_operator_node(node: BaseNode): - return bool(type(node) is OperatorNode and node.type == EventTypes.OPERATOR and node.name not in ExcludeOpName + return bool(isinstance(node, OperatorNode) and node.type == EventTypes.OPERATOR and node.name not in ExcludeOpName and not node.name.startswith("Optimizer.")) # exclude Optimizer.zero_grad diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/op_agg.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/op_agg.py index 08a3f0d7061dc332a78ec97a6ff085bf1840a47d..d6fdb5903d368e02c4ddb9fc3f29f536696e2a2e 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/op_agg.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/op_agg.py @@ -49,7 +49,6 @@ def aggregate_ops(op_list: List[OperatorNode], agg.self_device_duration += op.self_device_duration agg.tc_self_duration += op.tc_self_duration agg.tc_total_duration += op.tc_total_duration - return agg agg_dicts: List[Dict[str, OperatorAgg]] = [{} for _ in range(len(keys_func))] for op in op_list: diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/op_tree.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/op_tree.py index 55e264617d835fb5bf94819b329fdbd2ee1c53f6..fe919b29ced02efcea862f5e83ab52704f3f0d09 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/op_tree.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/op_tree.py @@ -68,9 +68,10 @@ class OpTreeBuilder: if main_tid: # only append the staled device nodes into main thread self.main_tid = op_list[0].tid - root_node = self._build_tree_internal(op_list, zero_rt_list, tid, staled_device_nodes, is_ascend) + root_node = OpTreeBuilder._build_tree_internal(op_list, zero_rt_list, tid, staled_device_nodes, + is_ascend) else: - root_node = self._build_tree_internal(op_list, zero_rt_list, tid, [], is_ascend) + root_node = OpTreeBuilder._build_tree_internal(op_list, zero_rt_list, tid, [], is_ascend) tid2tree[int(tid)] = root_node return tid2tree @@ -83,7 +84,8 @@ class OpTreeBuilder: # there are multiple tids backward_tid = self._find_backward_tid() tid2len = { - tid: root.end_time - root.start_time for tid, root in self.tid2tree.items() + tid: root.end_time - root.start_time + for tid, root in self.tid2tree.items() if tid != backward_tid or backward_tid is None } # get the maximum length as the main thread @@ -97,7 +99,8 @@ class OpTreeBuilder: return None - def _build_tree_internal(self, host_node_list, zero_rt_list, tid, staled_device_nodes, is_ascend): + @staticmethod + def _build_tree_internal(host_node_list, zero_rt_list, tid, staled_device_nodes, is_ascend): """host_node_list: list of OperatorNode and ProfilerStepNode. zero_rt_list: list of RuntimeNode with external_id=0.""" @@ -110,7 +113,7 @@ class OpTreeBuilder: name='dummy', start_time=None, end_time=None, - type=EventTypes.RUNTIME, + node_type=EventTypes.RUNTIME, tid=0, device_nodes=staled_device_nodes)) dummpy_rt[0].fill_stats() @@ -119,7 +122,7 @@ class OpTreeBuilder: name='CallTreeRoot', start_time=-sys.maxsize - 1, end_time=sys.maxsize, - type=EventTypes.PYTHON, + node_type=EventTypes.PYTHON, tid=tid, runtimes=zero_rt_list + dummpy_rt) # Give the list of RuntimeNode with external_id=0 to root node. node_stack.append(root_node) @@ -130,7 +133,6 @@ class OpTreeBuilder: if node.end_time <= tail_node.end_time or ( is_ascend and math.isclose(node.end_time, tail_node.end_time, rel_tol=1)): tail_node.children.append(node) - # node.parent_node = weakref.ref(tail_node) node_stack.append(node) else: logger.error('Error in input data: ranges on the same thread should not intersect!' @@ -274,7 +276,7 @@ class OpTreeBuilder: if isinstance(node, ModuleNode): backward_node = BackwardNode(name=node.name + '.backward', start_time=None, end_time=None, - type='backward', tid=0) + node_type='backward', tid=0) if parent is None: result.append(backward_node) else: diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/overall_parser.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/overall_parser.py index e12fbfd1cc502accee83fb44c52b94f8253c64ce..c646a33b89a673e1738fd38704516df8bfdfaade 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/overall_parser.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/overall_parser.py @@ -23,8 +23,8 @@ class OverallParser(object): @classmethod def create_from_statistics(cls, statistics: 'OverallParser.Statistics', total_duration: int): costs = [0.] * len(ProfileRole) - for i in range(len(statistics.cost_ranges)): - costs[i] = get_ranges_sum(statistics.cost_ranges[i]) + for i, cost_range in enumerate(statistics.cost_ranges): + costs[i] = get_ranges_sum(cost_range) costs[ProfileRole.Total] = total_duration return cls(costs) @@ -58,8 +58,8 @@ class OverallParser(object): def intersection_with_step(self, step: Tuple[int, int]): cost_ranges: List[List[Tuple[int, int]]] = [] step = [step] - for range in self.cost_ranges: - cost_ranges.append(intersection_ranges_lists(step, range)) + for cost_range in self.cost_ranges: + cost_ranges.append(intersection_ranges_lists(step, cost_range)) return OverallParser.Statistics(cost_ranges) @@ -77,6 +77,9 @@ class OverallParser(object): def aggregate(self, steps: List[Tuple[int, int]], role_ranges: List[List[Tuple[int, int]]]): logger.debug('Overall, statistics') + if len(steps) <= 0: + logger.error('Invalid steps number of 0') + return global_stats = OverallParser.Statistics.create_from_range(steps, role_ranges) if role_ranges[ProfileRole.Kernel]: comm_comp_overlap = intersection_ranges_lists( @@ -89,7 +92,7 @@ class OverallParser(object): for i, step in enumerate(steps): steps_stat = global_stats.intersection_with_step(step) self.steps_costs.append(OverallParser.Costs.create_from_statistics(steps_stat, step[1] - step[0])) - for cost_index in range(len(self.avg_costs.costs)): + for cost_index, _ in enumerate(self.avg_costs.costs): self.avg_costs.costs[cost_index] += self.steps_costs[i].costs[cost_index] comm_costs = OverallParser.StepCommunicationCosts() @@ -107,5 +110,5 @@ class OverallParser(object): self.communication_overlap.append(comm_costs) valid_steps = len(steps) - for i in range(len(self.avg_costs.costs)): + for i, _ in enumerate(self.avg_costs.costs): self.avg_costs.costs[i] /= valid_steps diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/run_generator.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/run_generator.py index 87cf8f3ca730a0c464d76bddd6ebf547b566274d..111dc34e81031a33ff9e0a2c03b0375522de24cf 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/run_generator.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/run_generator.py @@ -2,7 +2,6 @@ # Copyright (c) Microsoft Corporation. All rights reserved. # # Copyright(c) 2023 Huawei Technologies. -# All rights reserved # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -49,6 +48,140 @@ class RunGenerator(object): self.component_curve_data = {} self.process_data = {} + @staticmethod + def check_overlap_data(title): + # csv: step / compute time / communication_not_overlap / overlap / communication / free time + length = len(title) + if length < 5: + return [] + key = ["computing", "overlapped", "communication(not overlapped)", "free"] + get_key = list() + for j in key: + for i in range(length): + if j == title[i]: + get_key.append(i) + if len(get_key) < 4: + return [] + return get_key + + @staticmethod + def get_table_head(name: str, input_shape: str, call_stack: str, value: list): + if name is None: + return {} + temp = { + 'name': name, 'calls': 0, 'host_self_duration': 0, + 'host_total_duration': 0, 'device_self_duration': 0, 'device_total_duration': 0, + 'tc_self_ratio': 0, 'tc_total_ratio': 0, 'tc_eligible': 'Yes' + } + if input_shape is not None: + temp['input_shape'] = input_shape + if call_stack is not None: + temp['call_stack'] = call_stack + else: + temp['has_call_stack'] = False + else: + if call_stack is not None: + temp['call_stack'] = call_stack + else: + temp['has_call_stack'] = False + for vl in iter(value): + if 'has_call_stack' in temp and vl[2]: + temp['has_call_stack'] = True + temp['calls'] += 1 + temp['host_self_duration'] = round(temp['host_self_duration'] + vl[3], 2) + temp['host_total_duration'] = round(temp['host_total_duration'] + vl[4], 2) + temp['device_self_duration'] = round(temp['device_self_duration'] + vl[5], 2) + temp['device_total_duration'] = round(temp['device_total_duration'] + vl[6], 2) + temp['tc_self_ratio'] = round(temp['tc_self_ratio'] + vl[7], 2) + temp['tc_total_ratio'] = round(temp['tc_total_ratio'] + vl[8], 2) + temp['tc_eligible'] = 'Yes' if temp['tc_self_ratio'] > 0 or temp['tc_total_ratio'] > 0 else 'No' + temp['tc_self_ratio'] = 0 if temp['device_self_duration'] == 0 \ + else round(temp['tc_self_ratio'] / temp['device_self_duration'] * 100, 2) + temp['tc_total_ratio'] = 0 if temp['device_total_duration'] == 0 \ + else round(temp['tc_total_ratio'] / temp['device_total_duration'] * 100, 2) + return temp + + @staticmethod + def get_wait_table_by_ops(op, ops): + total_trans = 0 + total_synchronize = 0 + for key, data in op.items(): + if str(key) == "Total Op Info" and data.get("Communication Time Info"): + total_trans += float(data.get("Communication Time Info").get("Transit Time(ms)")) + total_synchronize += float(data.get("Communication Time Info").get("Synchronization Time(ms)")) + continue + k = re.sub(r'[0-9]+', ' ', key).split(" ")[0] + if k not in ops: + ops[k] = [0, 0, 0, 0] + ops[k][0] += 1 + for _, band in data.get("Communication Bandwidth Info").items(): + ops[k][1] += float(band.get("Transit Size(MB)")) + if data.get("Communication Time Info") is not None: + ops[k][2] += data.get("Communication Time Info").get("Elapse Time(ms)") + ops[k][3] += data.get("Communication Time Info").get("Transit Time(ms)") + return total_trans, total_synchronize + + @staticmethod + def trans_shape(shape: str): + result = list() + if ';' not in shape: + result.append('[' + shape.strip() + ']') + return '[' + ', '.join(result) + ']' + if len(shape.strip()) <= 1: + result.append('[]') + return '[' + ', '.join(result) + ']' + shape_spl = shape.split("\n") + for shape_div in iter(shape_spl): + result.append('[' + str(shape_div.replace(';', '')) + ']') + return '[' + ', '.join(result) + ']' + + @staticmethod + def get_process_peaks_and_devices_type(process_data: dict, memory_metric: str): + devices_type = [] + peaks = {} + for device in process_data: + devices_type.append(device) + reserved_list = process_data.get(device).get('Allocated') + if reserved_list is not None: + max_reserved = 0 + for array_value in reserved_list: + max_reserved = max(array_value[1], max_reserved) + peaks[device] = f'Peak Memory Usage: {max_reserved:.1f}{memory_metric}' + return devices_type, peaks + + @staticmethod + def get_pta_ge_peaks_and_devices_type(process_data: dict, memory_metric: str): + devices_type = [] + peaks = {} + for device in process_data: + devices_type.append(device) + peaks[device] = 'Reserved Peak Memory Usage:' + for component in process_data.get(device): + max_reserved = 0 + for array_value in process_data.get(device).get(component): + max_reserved = max(array_value[2], max_reserved) + peaks[device] += f' {component}-{max_reserved:.1f}{memory_metric} |' + return devices_type, peaks + + @staticmethod + def check_csv_columns(columns: list, column_idxs: dict): + column_exist_count = 0 + for idx, column in enumerate(columns): + if column in column_idxs: + column_idxs[column] = idx + column_exist_count += 1 + return column_idxs.values(), column_exist_count + + @staticmethod + def get_csv_data(path: str): + if path is None: + return [] + datas = [] + with open(path, encoding='utf-8-sig') as f: + for row in csv.reader(f, skipinitialspace=True): + datas.append(row) + return datas + def generate_run_profile(self): profile_run = RunProfile(self.worker, self.span) profile_run.is_pytorch_lightning = self.profile_data.is_pytorch_lightning @@ -85,7 +218,7 @@ class RunGenerator(object): profile_run.gpu_metrics = self.profile_data.gpu_metrics_parser.get_gpu_metrics() - gpu_infos = {gpu_id: RunGenerator._get_gpu_info(self.profile_data.device_props, gpu_id) + gpu_infos = {gpu_id: RunGenerator.get_gpu_info(self.profile_data.device_props, gpu_id) for gpu_id in self.profile_data.gpu_metrics_parser.gpu_ids} gpu_infos = {gpu_id: gpu_info for gpu_id, gpu_info in gpu_infos.items() if gpu_info is not None} @@ -140,11 +273,11 @@ class RunGenerator(object): def _npu_get_overlap(self): path = self.profile_data.distributed_csv_path overlap_by_steps: Dict[str, List[float]] = OrderedDict() - data = RunGenerator._get_csv_data(path) + data = RunGenerator.get_csv_data(path) if len(data) <= 1: return overlap_by_steps title = [x.lower() for x in data[0]] - title_name = RunGenerator._check_overlap_data(title) + title_name = RunGenerator.check_overlap_data(title) if not title_name: logger.error(f"Incomplete content of CSV file {path}.") return overlap_by_steps @@ -166,22 +299,6 @@ class RunGenerator(object): logger.error(f'File "{path}" has wrong data format in row {idx + 2} and will skip it.') return overlap_by_steps - @staticmethod - def _check_overlap_data(title): - # csv: step / compute time / communication_not_overlap / overlap / communication / free time - length = len(title) - if length < 5: - return [] - key = ["computing", "overlapped", "communication(not overlapped)", "free"] - get_key = list() - for j in key: - for i in range(length): - if j == title[i]: - get_key.append(i) - if len(get_key) < 4: - return [] - return get_key - def _npu_get_wait_table(self): path = self.profile_data.communication_json_path if not io.exists(path): @@ -216,9 +333,9 @@ class RunGenerator(object): collection_ops = data.get("collective") p2p_ops = data.get("p2p") try: - coll_total_trans, coll_total_synchronize = RunGenerator._get_wait_table_by_ops(collection_ops, - table_ops) - p2p_total_trans, p2p_total_synchronize = RunGenerator._get_wait_table_by_ops(p2p_ops, table_ops) + coll_total_trans, coll_total_synchronize = RunGenerator.get_wait_table_by_ops(collection_ops, + table_ops) + p2p_total_trans, p2p_total_synchronize = RunGenerator.get_wait_table_by_ops(p2p_ops, table_ops) except ValueError: logger.error(f'Time and size info must be number, please check file "{path}"') return wait_by_step, table_ops @@ -229,41 +346,21 @@ class RunGenerator(object): } return wait_by_step, table_ops - @staticmethod - def _get_wait_table_by_ops(op, ops): - total_trans = 0 - total_synchronize = 0 - for key, data in op.items(): - if str(key) == "Total Op Info" and data.get("Communication Time Info"): - total_trans += float(data.get("Communication Time Info").get("Transit Time(ms)")) - total_synchronize += float(data.get("Communication Time Info").get("Synchronization Time(ms)")) - continue - k = re.sub(r'[0-9]+', ' ', key).split(" ")[0] - if k not in ops: - ops[k] = [0, 0, 0, 0] - ops[k][0] += 1 - for _, band in data.get("Communication Bandwidth Info").items(): - ops[k][1] += float(band.get("Transit Size(MB)")) - if data.get("Communication Time Info") is not None: - ops[k][2] += data.get("Communication Time Info").get("Elapse Time(ms)") - ops[k][3] += data.get("Communication Time Info").get("Transit Time(ms)") - return total_trans, total_synchronize - def _get_operator_details_by_name(self): operator_by_name = defaultdict(list) operator_by_name_and_input_shapes = defaultdict(list) path = self.profile_data.operator_path - datas = RunGenerator._get_csv_data(path) + datas = RunGenerator.get_csv_data(path) if len(datas) <= 1: return operator_by_name, operator_by_name_and_input_shapes for idx, ls in enumerate(datas[1:]): try: temp: list = [ - ls[0], RunGenerator._trans_shape(str(ls[1])), ls[2], float(ls[3]), float(ls[4]), + ls[0], RunGenerator.trans_shape(str(ls[1])), ls[2], float(ls[3]), float(ls[4]), float(ls[5]), float(ls[6]), float(ls[7]), float(ls[8]) ] operator_by_name[ls[0]].append(temp) - key = "{}###{}".format(str(ls[0]), RunGenerator._trans_shape(str(ls[1]))) + key = "{}###{}".format(str(ls[0]), RunGenerator.trans_shape(str(ls[1]))) operator_by_name_and_input_shapes[key].append(temp) except (ValueError, IndexError): logger.error(f'File "{path}" has wrong data format in row {idx + 2} and will skip it.') @@ -313,9 +410,9 @@ class RunGenerator(object): if group_by_input_shape: name = name_key.split("###")[0] shape = name_key.split("###")[1] - result.append(RunGenerator._get_table_head(name, shape, None, values)) + result.append(RunGenerator.get_table_head(name, shape, None, values)) else: - result.append(RunGenerator._get_table_head(name_key, None, None, values)) + result.append(RunGenerator.get_table_head(name_key, None, None, values)) return result def _set_name_callstack_data(self, group_by_input_shape=False): @@ -350,24 +447,10 @@ class RunGenerator(object): 'data': [] } for callstack_key, value in values.items(): - table['data'].append(RunGenerator._get_table_head(name, shape, callstack_key, value)) + table['data'].append(RunGenerator.get_table_head(name, shape, callstack_key, value)) result[name_key] = table return result - @staticmethod - def _trans_shape(shape: str): - result = list() - if ';' not in shape: - result.append('[' + shape.strip() + ']') - return '[' + ', '.join(result) + ']' - if len(shape.strip()) <= 1: - result.append('[]') - return '[' + ', '.join(result) + ']' - shape_spl = shape.split("\n") - for shape_div in iter(shape_spl): - result.append('[' + str(shape_div.replace(';', '')) + ']') - return '[' + ', '.join(result) + ']' - def _get_call_stack_by_name(self): result = dict() name_callstack_data = self._set_name_callstack_data() @@ -384,47 +467,10 @@ class RunGenerator(object): 'data': [] } for callstack_key, value in values.items(): - table['data'].append(RunGenerator._get_table_head(name_key, None, callstack_key, value)) + table['data'].append(RunGenerator.get_table_head(name_key, None, callstack_key, value)) result[name_key] = table return result - @staticmethod - def _get_table_head(name: str, input_shape: str, call_stack: str, value: list): - if name is None: - return {} - temp = { - 'name': name, 'calls': 0, 'host_self_duration': 0, - 'host_total_duration': 0, 'device_self_duration': 0, 'device_total_duration': 0, - 'tc_self_ratio': 0, 'tc_total_ratio': 0, 'tc_eligible': 'Yes' - } - if input_shape is not None: - temp['input_shape'] = input_shape - if call_stack is not None: - temp['call_stack'] = call_stack - else: - temp['has_call_stack'] = False - else: - if call_stack is not None: - temp['call_stack'] = call_stack - else: - temp['has_call_stack'] = False - for vl in iter(value): - if 'has_call_stack' in temp and vl[2]: - temp['has_call_stack'] = True - temp['calls'] += 1 - temp['host_self_duration'] = round(temp['host_self_duration'] + vl[3], 2) - temp['host_total_duration'] = round(temp['host_total_duration'] + vl[4], 2) - temp['device_self_duration'] = round(temp['device_self_duration'] + vl[5], 2) - temp['device_total_duration'] = round(temp['device_total_duration'] + vl[6], 2) - temp['tc_self_ratio'] = round(temp['tc_self_ratio'] + vl[7], 2) - temp['tc_total_ratio'] = round(temp['tc_total_ratio'] + vl[8], 2) - temp['tc_eligible'] = 'Yes' if temp['tc_self_ratio'] > 0 or temp['tc_total_ratio'] > 0 else 'No' - temp['tc_self_ratio'] = 0 if temp['device_self_duration'] == 0 \ - else round(temp['tc_self_ratio'] / temp['device_self_duration'] * 100, 2) - temp['tc_total_ratio'] = 0 if temp['device_total_duration'] == 0 \ - else round(temp['tc_total_ratio'] / temp['device_total_duration'] * 100, 2) - return temp - def _get_memory_event(self, peak_memory_events: dict): display_columns = ('Name', 'Size(KB)', 'Allocation Time(us)', 'Release Time(us)', 'Duration(us)') path = self.profile_data.memory_operator_path @@ -438,15 +484,16 @@ class RunGenerator(object): 'columns': [], 'rows': {} } - datas = RunGenerator._get_csv_data(path) + datas = RunGenerator.get_csv_data(path) if len(datas) < 1: return { 'operator': table, 'component': peak_memory_events } + device_type_form_idx = -1 for idx, column in enumerate(datas[0]): if column == 'Device Type': - self.device_type_form_idx = idx + device_type_form_idx = idx if column in display_columns: if column == 'Name': table['columns'].append({'name': column, 'type': 'string'}) @@ -457,11 +504,11 @@ class RunGenerator(object): table['columns'].append({'name': column.replace('(us)', '(ms)'), 'type': 'number'}) required_column_idxs = {key: -1 for key in display_columns} (name_idx, size_idx, allocation_idx, release_idx, duration_idx), column_exist_count = \ - RunGenerator._check_csv_columns(datas[0], required_column_idxs) - if column_exist_count < len(required_column_idxs): - logger.error('Required column is missing in file "operator_memory.csv"') + RunGenerator.check_csv_columns(datas[0], required_column_idxs) + if device_type_form_idx < 0 or column_exist_count < len(required_column_idxs): + raise ValueError('Required column is missing in file "operator_memory.csv"') for idx, ls in enumerate(datas[1:]): - device_type = ls[self.device_type_form_idx] + device_type = ls[device_type_form_idx] # convert time metric 'us' to 'ms' # some operators may not have the following columns try: @@ -489,8 +536,8 @@ class RunGenerator(object): time_metric: str = 'ms' memory_metric: str = 'MB' cano = Canonicalizer(time_metric, memory_metric) - process_devices_type, process_peaks = RunGenerator._get_process_peaks_and_devices_type(self.process_data, - memory_metric) + process_devices_type, process_peaks = RunGenerator.get_process_peaks_and_devices_type(self.process_data, + memory_metric) total_result = { 'metadata': { 'devices': process_devices_type, @@ -517,8 +564,8 @@ class RunGenerator(object): if len(total_result['columns'][device]) > 0: total_result['columns'][device].insert(0, {'name': f'Time ({cano.time_metric})', 'type': 'number', 'tooltip': 'Time since profiler starts.'}) - pta_ge_devices_type, pta_ge_peaks = RunGenerator._get_pta_ge_peaks_and_devices_type(self.component_curve_data, - memory_metric) + pta_ge_devices_type, pta_ge_peaks = RunGenerator.get_pta_ge_peaks_and_devices_type(self.component_curve_data, + memory_metric) component_curve_result = { 'metadata': { 'devices': pta_ge_devices_type, @@ -562,48 +609,11 @@ class RunGenerator(object): 'ptaGe': component_curve_result } - @staticmethod - def _get_process_peaks_and_devices_type(process_data: dict, memory_metric: str): - devices_type = [] - peaks = {} - for device in process_data: - devices_type.append(device) - reserved_list = process_data.get(device).get('Allocated') - if reserved_list is not None: - max_reserved = 0 - for array_value in reserved_list: - max_reserved = max(array_value[1], max_reserved) - peaks[device] = f'Peak Memory Usage: {max_reserved:.1f}{memory_metric}' - return devices_type, peaks - - @staticmethod - def _get_pta_ge_peaks_and_devices_type(process_data: dict, memory_metric: str): - devices_type = [] - peaks = {} - for device in process_data: - devices_type.append(device) - peaks[device] = 'Reserved Peak Memory Usage:' - for component in process_data.get(device): - max_reserved = 0 - for array_value in process_data.get(device).get(component): - max_reserved = max(array_value[2], max_reserved) - peaks[device] += f' {component}-{max_reserved:.1f}{memory_metric} |' - return devices_type, peaks - - @staticmethod - def _check_csv_columns(columns: list, column_idxs: dict): - column_exist_count = 0 - for idx, column in enumerate(columns): - if column in column_idxs: - column_idxs[column] = idx - column_exist_count += 1 - return column_idxs.values(), column_exist_count - def _handle_memory_data(self): process_data = defaultdict() pta_or_ge_data = defaultdict() path = self.profile_data.memory_curve_path - datas = RunGenerator._get_csv_data(path) + datas = RunGenerator.get_csv_data(path) required_column_idxs = { 'Component': -1, 'Device Type': -1, @@ -612,7 +622,7 @@ class RunGenerator(object): 'Total Allocated(MB)': -1 } (tag_type_idx, device_type_idx, time_idx, reserved_idx, allocated_idx), column_exist_count = \ - RunGenerator._check_csv_columns(datas[0], required_column_idxs) + RunGenerator.check_csv_columns(datas[0], required_column_idxs) if column_exist_count < len(required_column_idxs): logger.error('Required column is missing in file "memory_record.csv"') else: @@ -653,7 +663,7 @@ class RunGenerator(object): } peak_memory_rows = defaultdict(list) path = self.profile_data.memory_component_path - component_datas = RunGenerator._get_csv_data(path) + component_datas = RunGenerator.get_csv_data(path) if component_datas: required_column_idxs = { 'Component': -1, @@ -662,7 +672,7 @@ class RunGenerator(object): 'Device': -1 } (tag_type_idx, time_idx, reserved_idx, device_type_idx), column_exist_count = \ - RunGenerator._check_csv_columns(component_datas[0], required_column_idxs) + RunGenerator.check_csv_columns(component_datas[0], required_column_idxs) if column_exist_count < len(required_column_idxs): logger.error(f'Required column is missing in file "{path}"') else: @@ -708,14 +718,16 @@ class RunGenerator(object): '{}: {}us
' 'Percentage: {}%' '
') - percentage = round(100 * part_cost / costs.costs[ProfileRole.Total], 2) + percentage = 0.0 if costs.costs[ProfileRole.Total] == 0 else round( + 100 * part_cost / costs.costs[ProfileRole.Total], 2) return format_str.format(step_name, costs.costs[ProfileRole.Total], part_name, part_cost, percentage) def build_avg_cost_dict(part_name: str, part_cost: float): + profiler_total_cost = self.profile_data.avg_costs.costs[ProfileRole.Total] cost_dict = {'name': part_name, 'description': '', 'value': round(part_cost), - 'extra': round(100 * part_cost / self.profile_data.avg_costs.costs[ProfileRole.Total], 2)} + 'extra': 0.0 if profiler_total_cost == 0 else round(100 * part_cost / profiler_total_cost, 2)} return cost_dict show_gpu = (self.profile_data.has_runtime @@ -734,8 +746,7 @@ class RunGenerator(object): data['steps']['columns'].extend(['DataLoader', 'CPU Exec', 'Other']) data['steps']['rows'] = [] - for i in range(len(self.profile_data.steps_costs)): - costs = self.profile_data.steps_costs[i] + for i, costs in enumerate(self.profile_data.steps_costs): step_name = self.profile_data.steps_names[i] row = [{'value': step_name}] if show_gpu: @@ -1063,14 +1074,14 @@ class RunGenerator(object): 'data': table } path = self.profile_data.kernel_file_path - datas = RunGenerator._get_csv_data(path) + datas = RunGenerator.get_csv_data(path) required_column_idxs = { 'Name': -1, 'Duration(us)': -1, 'Accelerator Core': -1 } (name_idx, duration_idx, core_type_idx), column_exist_count = \ - RunGenerator._check_csv_columns(datas[0], required_column_idxs) + RunGenerator.check_csv_columns(datas[0], required_column_idxs) if column_exist_count < 3: logger.error('Required column is missing in file "kernel_details.csv"') else: @@ -1084,16 +1095,6 @@ class RunGenerator(object): table['rows'] = datas[1:] return result - @staticmethod - def _get_csv_data(path: str): - if path is None: - return [] - datas = [] - with open(path, encoding='utf-8-sig') as f: - for row in csv.reader(f, skipinitialspace=True): - datas.append(row) - return datas - def _generate_tc_pie_npu(self): pie = {'columns': [{'type': 'string', 'name': 'name'}, {'type': 'number', 'name': 'value'}], 'rows': []} for key, val in self.accelerator_data.items(): @@ -1102,7 +1103,7 @@ class RunGenerator(object): return data @staticmethod - def _get_gpu_info(device_props, gpu_id): + def get_gpu_info(device_props, gpu_id): if (device_props is None) or (gpu_id >= len(device_props)) or (gpu_id < 0): return None @@ -1203,7 +1204,7 @@ class DistributedRunGenerator(object): process_id = 'Process ' + str(process_id) result[node][process_id] = OrderedDict() for used_device in data.used_devices: - gpu_info = RunGenerator._get_gpu_info(data.device_props, used_device) + gpu_info = RunGenerator.get_gpu_info(data.device_props, used_device) if gpu_info is not None: result[node][process_id]['GPU' + str(used_device)] = gpu_info @@ -1254,7 +1255,8 @@ class DistributedRunGenerator(object): round(costs.other, 3) ] steps_to_overlap['all'][data.worker] = [ - sum(x) for x in zip(steps_to_overlap['all'][data.worker], steps_to_overlap[step_name][data.worker]) + sum(x) + for x in zip(steps_to_overlap['all'][data.worker], steps_to_overlap[step_name][data.worker]) ] @staticmethod @@ -1267,7 +1269,8 @@ class DistributedRunGenerator(object): steps_to_overlap[k][data.worker] = list( [round(v[0] - v[1], 3), round(v[1], 3), round(v[2], 3), round(v[3], 3)]) steps_to_overlap['all'][data.worker] = [ - sum(x) for x in zip(steps_to_overlap['all'][data.worker], steps_to_overlap[k][data.worker]) + sum(x) + for x in zip(steps_to_overlap['all'][data.worker], steps_to_overlap[k][data.worker]) ] @staticmethod @@ -1283,7 +1286,8 @@ class DistributedRunGenerator(object): wait = round(v.get('Synchronize') * 1000, 3) # 1ms = 1000us steps_to_wait[k][data.worker] = list([trans, wait]) steps_to_wait['all'][data.worker] = [ - sum(x) for x in zip(steps_to_wait['all'][data.worker], steps_to_wait[k][data.worker]) + sum(x) + for x in zip(steps_to_wait['all'][data.worker], steps_to_wait[k][data.worker]) ] steps_to_wait['all'][data.worker] = [x / step_number for x in steps_to_wait['all'][data.worker]] @@ -1298,7 +1302,8 @@ class DistributedRunGenerator(object): round(comm_stats[0] - comm_stats[1], 3) ] steps_to_wait['all'][data.worker] = [ - sum(x) for x in zip(steps_to_wait['all'][data.worker], steps_to_wait[step][data.worker]) + sum(x) + for x in zip(steps_to_wait['all'][data.worker], steps_to_wait[step][data.worker]) ] steps_to_wait['all'][data.worker] = [int(x / step_number) for x in steps_to_wait['all'][data.worker]] diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/tensor_core.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/tensor_core.py index 501e2076c4696419a6d8a1fa4b39bec9c14f2c45..cc53ab217f0ee6f88817c51da6ba46da68df4e28 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/tensor_core.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/tensor_core.py @@ -1,13 +1,13 @@ # ------------------------------------------------------------------------- # Copyright (c) Microsoft Corporation. All rights reserved. # ------------------------------------------------------------------------- -class TC_Allowlist_Meta(type): - # Enable grammar sugar as 'v in TC_Allowlist'. +class TcAllowlistMeta(type): + # Enable grammar sugar as 'v in TcAllowlist'. def __contains__(cls, item): return cls.__contains__(item) -class TC_Allowlist(metaclass=TC_Allowlist_Meta): +class TcAllowlist(metaclass=TcAllowlistMeta): allowlist = ['h884', 's884', 'h1688', 's1688', 'hmma', 'i8816', '16816', 'dgrad_1x1_stride_2x2', 'first_layer_wgrad_kernel', 'conv1x1', 'conv2d_c1_k1', 'direct_group', 'xmma_implicit_gemm', @@ -23,7 +23,7 @@ class TC_Allowlist(metaclass=TC_Allowlist_Meta): return False -class TC_OP_Allowlist(metaclass=TC_Allowlist_Meta): +class TcOpAllowlist(metaclass=TcAllowlistMeta): allowlist = ['aten::_convolution', 'aten::conv1d', 'aten::conv2d', 'aten::conv3d', 'aten::conv_tbc', 'aten::conv_transpose1d', 'aten::conv_transpose2d', 'aten::conv_transpose3d', 'aten::convolution', 'aten::cudnn_convolution', 'aten::cudnn_convolution_transpose', diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/trace.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/trace.py index e76f8b18dd80a9f12a867c9395de6a96a39bc2c1..ea09f79666bd184956469f48fc7922854394940d 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/trace.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/trace.py @@ -1,13 +1,13 @@ # ------------------------------------------------------------------------- # Copyright (c) Microsoft Corporation. All rights reserved. # -------------------------------------------------------------------------- +__all__ = ['EventTypes', 'create_event'] + from enum import IntEnum from typing import Dict, Optional from .. import utils -__all__ = ['EventTypes', 'create_event'] - logger = utils.get_logger() NcclOpNameSet = ['nccl:broadcast', 'nccl:reduce', 'nccl:all_reduce', 'nccl:all_gather', 'nccl:reduce_scatter'] @@ -56,8 +56,8 @@ EventTypeMap = { class BaseEvent(object): - def __init__(self, type, data): - self.type: str = type + def __init__(self, event_type, data): + self.type: str = event_type self.name: str = data.get('name') self.ts: int = data.get('ts') self.pid: int = data.get('pid') @@ -66,8 +66,8 @@ class BaseEvent(object): class DurationEvent(BaseEvent): - def __init__(self, type, data): - super().__init__(type, data) + def __init__(self, event_type, data): + super().__init__(event_type, data) self.category: str = data.get('cat', '') self.duration: int = data.get('dur') @@ -79,8 +79,8 @@ class DurationEvent(BaseEvent): class KernelEvent(DurationEvent): - def __init__(self, type, data): - super().__init__(type, data) + def __init__(self, event_type, data): + super().__init__(event_type, data) self.occupancy = self.args.get('est. achieved occupancy %') self.blocks_per_sm = self.args.get('blocks per SM') self.grid = self.args.get('grid') @@ -91,8 +91,8 @@ class KernelEvent(DurationEvent): class OperatorEvent(DurationEvent): - def __init__(self, type, data): - super().__init__(type, data) + def __init__(self, event_type, data): + super().__init__(event_type, data) self.callstack = self.args.get('Call stack') self.input_type = self.args.get('Input type') @@ -111,8 +111,8 @@ class ProfilerStepEvent(OperatorEvent): class MemoryEvent(BaseEvent): - def __init__(self, type, data): - super().__init__(type, data) + def __init__(self, event_type, data): + super().__init__(event_type, data) self.scope: str = data.get('s', '') self.device_id: int = self.args.get('Device Id') dtype = self.args.get('Device Type') @@ -142,8 +142,8 @@ class MemoryEvent(BaseEvent): class PythonFunctionEvent(DurationEvent): - def __init__(self, type, data): - super().__init__(type, data) + def __init__(self, event_type, data): + super().__init__(event_type, data) self.python_id: int = self.args.get('Python id') self.python_parent_id: int = self.args.get('Python parent id') diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/static/trace_embedding.html b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/static/trace_embedding.html index 4ef8846583a9e4d28710ff4ec62a4263921c9804..462d2c395f81d932fbf0196ccc53f4b0ece6e93a 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/static/trace_embedding.html +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/static/trace_embedding.html @@ -11,7 +11,7 @@ found in the LICENSE file. 'use strict'; function onTraceViewerImportFail() { - document.addEventListener('DOMContentLoaded', function () { + document.addEventListener('DOMContentLoaded', () => { document.body.textContent = 'tracing/bin/trace_viewer_full.html is missing. ' + 'Run vulcanize_trace_viewer from $TRACE_VIEWER and reload.'; @@ -52,12 +52,11 @@ found in the LICENSE file. // warning. window.__hideTraceViewerPolyfillWarning = true; - window.addEventListener("message", event => { - const data = event.data || {} - console.log(data) - name = data.name || 'unknown' - onResult(data.data) - }) + window.addEventListener('message', event => { + const data = event.data || {}; + name = data.name || 'unknown'; + onResult(data.data); + }); function onResult(result) { model = new tr.Model(); @@ -78,7 +77,7 @@ found in the LICENSE file. overlay.visible = true; } - document.addEventListener('WebComponentsReady', function () { + document.addEventListener('WebComponentsReady', () => { const container = document.createElement('track-view-container'); container.id = 'track_view_container'; @@ -91,7 +90,7 @@ found in the LICENSE file. Polymer.dom(document.body).appendChild(viewer); if (window.parent) { - window.parent.postMessage({ msg: 'ready' }, window.origin) + window.parent.postMessage({ msg: 'ready' }, window.origin); } }); }()); diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/utils.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/utils.py index 21909d9c2dc2b170ad727ebb27f413bda29024a0..5991cf2b33d1e818e6876c8d7550fbb6c87cdaa3 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/utils.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/utils.py @@ -23,14 +23,15 @@ import math import os import time from contextlib import contextmanager -from math import pow from . import consts +predefined_logging_level = ('CRITICAL', 'ERROR', 'WARNING', 'INFO', 'DEBUG', 'NOTSET') + def get_logging_level(): log_level = os.environ.get('TORCH_PROFILER_LOG_LEVEL', 'INFO').upper() - if log_level not in logging._levelToName.values(): + if log_level not in predefined_logging_level: log_level = logging.getLevelName(logging.INFO) return log_level @@ -83,10 +84,10 @@ class Canonicalizer: } # raw memory is in bytes memory_metric_to_factor = { - 'B': pow(1024, 0), - 'KB': pow(1024, 1), - 'MB': pow(1024, 2), - 'GB': pow(1024, 3), + 'B': math.pow(1024, 0), + 'KB': math.pow(1024, 1), + 'MB': math.pow(1024, 2), + 'GB': math.pow(1024, 3), } # canonicalize the memory metric to a string @@ -124,7 +125,7 @@ class DisplayRounder: def __init__(self, ndigits): self.ndigits = ndigits - self.precision = pow(10, -ndigits) + self.precision = math.pow(10, -ndigits) def __call__(self, v: float): _v = abs(v) diff --git a/profiler/MANIFEST.in b/profiler/MANIFEST.in deleted file mode 100644 index 0550da458f399209a4002b47706e5d741c990af3..0000000000000000000000000000000000000000 --- a/profiler/MANIFEST.in +++ /dev/null @@ -1,7 +0,0 @@ -recursive-include profiler/advisor/ * -recursive-include profiler/cli/ * -recursive-include profiler/prof_common/ * -recursive-include profiler/compare_tools/ * -recursive-include profiler/cluster_analyse/ * -global-exclude */__pycache__/* -global-exclude *.pyc diff --git a/profiler/advisor/common/constant.py b/profiler/advisor/common/constant.py deleted file mode 100644 index dcaffee83df77ad192543a06acf40914a78a2c03..0000000000000000000000000000000000000000 --- a/profiler/advisor/common/constant.py +++ /dev/null @@ -1,154 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import stat - -# timeline -DEQUEUE = "Dequeue" -DEQUEUE_SEP = "@" -ATEN = "aten" -NPU = "npu" -ATEN_SEP = "::" -OPTIMIZER = "Optimizer" -OPTIMIZER_SEP = "#" -OPTIMIZER_STEP = "step" -ENQUEUE = "enqueue" -TORCH_TO_NPU = "torch_to_npu" -FREE = "free" -OP_COMPILE_NAME = "AscendCL@aclopCompileAndExecute" -OP_COMPILE_ID = "aclopCompileAndExecute" -SYNC_STREAM = "AscendCL@aclrtSynchronizeStream" -NODE_LAUNCH = "Node@launch" -MAX_OP_COMPILE_NUM = 20 -ACL_TO_NPU = "acl_to_npu" -TASK_TYPE = "Task Type" -CPU_OP = "cpu_op" -AI_CORE = "AI_CORE" -AI_CPU = "AI_CPU" -MIX_AIC = "MIX_AIC" -CALL_STACKS = "Call stack" -INPUT_DIMS = "Input Dims" -OP_SEP = "-" -ADVISOR_MAX_PROCESSES = 8 -ADVISOR_ANALYZE_PROCESSES = "ADVISOR_ANALYZE_PROCESSES" -TIMELINE_OP_STACKS_DATASET = "timeline_op_stacks_dataset" -TIMELINE_BACKWARD_NO_STACK = "Backward broadcast, without call stacks in profiling." -TIMELINE_ACL_TO_NPU_NO_STACK = "Incoming flow is 'acl_to_npu', without call stacks in profiling." -TIMELINE_BACKWARD_NO_STACK_CODE = -1 -TIMELINE_ACL_TO_NPU_NO_STACK_CODE = -2 -TIMELINE_FUSION_OPS_NO_STACK_FLAG = "NO STACK" -NO_STACK_REASON_MAP = { - TIMELINE_BACKWARD_NO_STACK_CODE: "Backward broadcast, without call stacks in profiling.", - TIMELINE_ACL_TO_NPU_NO_STACK_CODE: "Incoming flow is 'acl_to_npu', without call stacks in profiling." -} -AFFINITY_TRAINING_API = "Affinity training api" -TIMELINE_EMPTY_STACKS_PROMPT = "These APIs have no code stack. If parameter 'with_stack=False' while profiling, " \ - "please refer to {timeline_profiling_doc_url} to set 'with_stack=True'. " \ - "Otherwise, ignore following affinity APIs due to backward broadcast lack of stack." - -CLUSTER_ANALYSIS = "Cluster analysis" -SLOW_RANK_TIME_RATIO_THRESHOLD = 0.05 - -CANN_VERSION = "cann_version" -TORCH_VERSION = "torch_version" -PROFILING_TYPE = "profiling_type" -ANALYSIS_DIMENSIONS = "analysis_dimensions" - -PROFILER_METADATA = "profiler_metadata.json" - -TERMINAL_OUTPUT_HEADERS = ["No.", "Problem", "Description", "Suggestion"] -SKIP_ANALYZE_PROMPT = "Finish analysis, no optimization suggestions" -SKIP_QUERY_PROMPT = "Finish query operator stack, no operators" - -# operator output constant -OPERATOR_OUT_TOPK = 10 -OPERATOR_LIST_UNLIMIT = -1 - -DEFAULT_OPERATOR_TYPE = 'None_type' -DEFAULT_DURATION_ZERO = 0.0 - -ADVISOR_LOG_LEVEL = "ADVISOR_LOG_LEVEL" -DEFAULT_LOG_LEVEL = "INFO" -SUPPORTED_LOG_LEVEL = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"] - -RULE_BUCKET = "RULE-BUCKET" -CLOUD_RULE_REGION_CN_NORTH_9 = "cn-north-9" -CLOUD_RULE_REGION_CN_NORTH_7 = "cn-north-7" -CLOUD_RULE_REGION_CN_SOUTHWEST_2 = "cn-southwest-2" -CLOUD_RULE_REGION_LIST = [CLOUD_RULE_REGION_CN_NORTH_7, CLOUD_RULE_REGION_CN_NORTH_9, CLOUD_RULE_REGION_CN_SOUTHWEST_2] -INNER_REGION_LIST = [CLOUD_RULE_REGION_CN_NORTH_7] -DEFAULT_CLOUD_RULE_REGION = CLOUD_RULE_REGION_CN_SOUTHWEST_2 - -HTTP_PREFIXES = "http://" -HTTPS_PREFIXES = "https://" -COMMON_YAML_DIR = "modelarts/solution/ma_advisor_rules/" -COMMON_ENDPOINT_SUFFIX = "obs.{}.myhuaweicloud.com" -INNER_ENDPOINT_SUFFIX = "obs.{}.ulanqab.huawei.com" - -AICPU_RULES_YAML_NAME = "aicpu_rules.yaml" -FUSION_PASS_YAML_NAME = "op_fusion_pass.yaml" -TIMELINE_FUSION_OPS_YAML_NAME = "timeline_fusion_ops.yaml" -CLOUD_YAML_NAME_LIST = [AICPU_RULES_YAML_NAME, FUSION_PASS_YAML_NAME, TIMELINE_FUSION_OPS_YAML_NAME] - -MAX_RETRIES = 3 -TIMEOUT = 3 -DEPTH_LIMIT = 20 - -ADVISOR_RULE_PATH = "ADVISOR_RULE_PATH" -CLOUD_RULE_PATH = "rules/cloud/" -DEFAULT_RULE_PATH = "./rules/" - -TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID = -1 - -DEFAULT_TEMPLATE_HEADER = "Performance Optimization Suggestions" - -PT_PROF_SUFFIX = "ascend_pt" -ASCEND_PROFILER_OUTPUT = "ASCEND_PROFILER_OUTPUT" -COLLECTION_PATH = "collection_path" -CLUSTER_ANALYSIS_OUTPUT = "cluster_analysis_output" -KERNEL_DETAILS_CSV = "kernel_details.csv" -CLUSTER_STEP_TIME_CSV = "cluster_step_trace_time.csv" -CLUSTER_COMM_JSON = "cluster_communication.json" -COMMUNICATION_JSON = "communication.json" - -BOTTLENECK = "bottleneck" -DATA = "data" -ADVISOR_ANALYSIS_OUTPUT_DIR = "advisor_analysis_result" -DEFAULT_PROCESSES = 8 -CLUSTER_ANALYSIS_FILE_PATTERN = [ - r'profiler_info_\d+\.json', "step_trace_time.csv", "communication.json", "communication_matrix.json" -] -ANALYSIS_OUTPUT_PATH = "ANALYSIS_OUTPUT_PATH" -DEFAULT_RANK_FOR_PROFILING_ANALYSIS = 0 -PROFILER_INFO_FILE_PATTERN = r"profiler_info_(\d+)\.json" -DISABLE_STREAMINIG_READER = "DISABLE_STREAMINIG_READER" -FRAMEWORK_STACK_BLACK_LIST = ["torch", "torch_npu", "megatron", "deepspeed"] -DISABLE_STREAMING_READER = "DISABLE_STREAMING_READER" -MAX_FILE_SIZE = 10 ** 10 -MAX_NUM_PROCESSES = 4 -DEFAULT_STEP = "-1" -STEP_RANK_SEP = "_" - -MAX_READ_LINE_BYTES = 8196 * 1024 -MAX_READ_FILE_BYTES = 64 * 1024 * 1024 * 1024 -MAX_READ_DB_FILE_BYTES = 8 * 1024 * 1024 * 1024 - -WRITE_MODES = stat.S_IWUSR | stat.S_IRUSR | stat.S_IRGRP -WRITE_FLAGS = os.O_WRONLY | os.O_CREAT | os.O_TRUNC - -DISABLE_PROFILING_COMPARISON = "DISABLE_PROFILING_COMPARISON" -FREE_DURATION_FOR_GC_ANALYSIS = "FREE_DURATION_FOR_GC_ANALYSIS" -DISABLE_AFFINITY_API = "DISABLE_AFFINITY_API" diff --git a/profiler/affinity_cpu_bind/bind_core.py b/profiler/affinity_cpu_bind/bind_core.py index 23d72c8b10975f1b3526d7b4a2a515702b32e701..0f27eb79a02d22a5cfc6ec7eb827f5c0cfb5a529 100644 --- a/profiler/affinity_cpu_bind/bind_core.py +++ b/profiler/affinity_cpu_bind/bind_core.py @@ -1,3 +1,18 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import subprocess import argparse import os @@ -6,6 +21,8 @@ import logging from datetime import datetime from datetime import timezone +logger = logging.getLogger("affinity_cpu_bind") + class PathManager: DATA_FILE_AUTHORITY = 0o640 @@ -48,7 +65,7 @@ class BindCoreManager(): self.args_parse() if not bind_core_manager.get_npu_info(): - print('[ERROR] Failed to get current npus info') + logger.error('Failed to get current npus info') exit() if not bind_core_manager.get_running_pid_on_npu(): exit() @@ -56,7 +73,7 @@ class BindCoreManager(): bind_core_manager.run_bind_core() def get_running_pid_on_npu(self) -> bool: - no_running_pids_on_npu_msg = '[INFO] Now there is no running process on all NPUs, stop bind cores' + no_running_pids_on_npu_msg = 'Now there is no running process on all NPUs, stop bind cores' logging.info('Begin to find running process on all NPUs') # get running process on NPUs for _ in range(self.find_running_pid_times): @@ -102,7 +119,7 @@ class BindCoreManager(): logging.info('Succeed to find running process %s on NPU %d', pids, npu_id) if_running_process = True if not if_running_process: - print(no_running_pids_on_npu_msg) + logger.info(no_running_pids_on_npu_msg) return if_running_process def get_npu_info(self) -> bool: @@ -129,14 +146,16 @@ class BindCoreManager(): p = subprocess.run(set_affinity_cpu_cmd.split(), shell=False, capture_output=True) logging.info(p.stdout.decode('utf-8')) except subprocess.CalledProcessError: - print('[ERROR] Failed to bind process {} on NPU {} with cpu cores list {}'.format(pid, npu, affinity_cpu)) + logger.error(f'Failed to bind process {pid} on NPU {npu} with cpu cores list {affinity_cpu}') logging.info('Succeed to bind process %s on NPU %d with cpu cores list %s', pid, npu, affinity_cpu) def args_parse(self): parser = argparse.ArgumentParser(description='This is a affinity cpu core bind script.') - parser.add_argument('-t', '--time', type=int, metavar='', help='Wait time before bind cores that you want to set. The unit is \'s\'.') - parser.add_argument('-app', '--application', metavar='', nargs='+', help='Training or inference command that you want to run.') + parser.add_argument('-t', '--time', type=int, metavar='', + help='Wait time before bind cores that you want to set. The unit is \'s\'.') + parser.add_argument('-app', '--application', metavar='', nargs='+', + help='Training or inference command that you want to run.') args = parser.parse_args() if args.application: application_cmd = ' '.join(args.application) @@ -148,7 +167,7 @@ class BindCoreManager(): args.time = 0 msg = f"Invalid parameter. The value of --time is not within the range " \ f"[0, {BindCoreManager.MAX_WAIT_TIME_BEFORE_BIND_CORE}]. --time has been set to 0 to continue." - print(f'[WARNING] {msg}') + logger.warning(msg) time.sleep(args.time) def _init_log_file(self): @@ -175,7 +194,8 @@ class BindCoreManager(): get_npu_info_cmd = 'npu-smi info -l' get_npu_info_process = subprocess.run(get_npu_info_cmd.split(), shell=False, capture_output=True) get_npu_id_cmd = 'grep ID' - get_npu_id_process = subprocess.run(get_npu_id_cmd.split(), shell=False, input=get_npu_info_process.stdout, capture_output=True) + get_npu_id_process = subprocess.run(get_npu_id_cmd.split(), shell=False, input=get_npu_info_process.stdout, + capture_output=True) res = get_npu_id_process.stdout.decode('utf-8').split() for i in res: if i.isdigit(): @@ -189,7 +209,8 @@ class BindCoreManager(): p = subprocess.run(get_npu_topo_cmd.split(), shell=False, capture_output=True) res = p.stdout.decode('utf-8').split() if not res: - print('[ERROR] Failed to run get npu affinity info, please check if driver version support cmd npu-smi info -t topo') + logger.error('Failed to run get npu affinity info, ' + 'please check if driver version support cmd npu-smi info -t topo') return False index = 0 @@ -205,10 +226,12 @@ class BindCoreManager(): cpus[1] = str(int(cpus[1]) + cpu_num_for_each_npu) affinity_cpus.append(cpus[0] + '-' + cpus[1]) if index < len(self.npu_id_list): - self.npu_affinity_cpu_dict[self.npu_id_list[index]] = ','.join(affinity_cpu for affinity_cpu in affinity_cpus) + self.npu_affinity_cpu_dict[self.npu_id_list[index]] = ','.join( + affinity_cpu for affinity_cpu in affinity_cpus) index += 1 else: - print('[ERROR] Get affinity_cpu_list for {} npus, more than real npu num: {}'.format(index + 1, len(self.npu_id_list))) + logger.error(f'Get affinity_cpu_list for {index + 1} npus, ' + f'more than real npu num: {len(self.npu_id_list)}') return False for k in self.npu_affinity_cpu_dict.keys(): @@ -217,12 +240,12 @@ class BindCoreManager(): if __name__ == '__main__': - print('[INFO] Begin to run bind-cores script...') + logger.info('Begin to run bind-cores script...') bind_core_manager = BindCoreManager() try: bind_core_manager.run() except Exception as exception: - print(f'[ERROR] {exception}') + logger.error(f"{exception}") - print('[INFO] End to run bind-cores script, the log is saved in {}'.format(bind_core_manager.log_file)) + logger.info(f'End to run bind-cores script, the log is saved in {bind_core_manager.log_file}') diff --git a/profiler/cluster_analyse/common_func/constant.py b/profiler/cluster_analyse/common_func/constant.py deleted file mode 100644 index f78f798b35e8cbc1029387bd9a3564f2bf516789..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/constant.py +++ /dev/null @@ -1,120 +0,0 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -class Constant(object): - # dir name - FRAMEWORK_DIR = "FRAMEWORK" - CLUSTER_ANALYSIS_OUTPUT = "cluster_analysis_output" - SINGLE_OUTPUT = "ASCEND_PROFILER_OUTPUT" - COMM_JSON = "communication.json" - COMM_MATRIX_JSON = "communication_matrix.json" - STEP_TIME_CSV = "step_trace_time.csv" - KERNEL_DETAILS_CSV = "kernel_details.csv" - - # file authority - FILE_AUTHORITY = 0o640 - DIR_AUTHORITY = 0o750 - MAX_JSON_SIZE = 1024 * 1024 * 1024 * 10 - MAX_CSV_SIZE = 1024 * 1024 * 1024 * 5 - MAX_PATH_LENGTH = 4096 - MAX_READ_DB_FILE_BYTES = 1024 * 1024 * 1024 * 8 - - # communication - P2P = "p2p" - COLLECTIVE = "collective" - STEP_ID = "step_id" - RANK_ID = "rank_id" - GROUP_NAME = "group_name" - COMM_OP_TYPE = "comm_op_type" - COMM_OP_NAME = "comm_op_name" - COMM_OP_INFO = "comm_op_info" - TOTAL_OP_INFO = "Total Op Info" - COMMUNICATION_TIME_INFO = "Communication Time Info" - START_TIMESTAMP = "Start Timestamp(us)" - COMMUNICATION_BANDWIDTH_INFO = "Communication Bandwidth Info" - HCOM_SEND = "hcom_send" - HCOM_RECEIVE = "hcom_receive" - SYNCHRONIZATION_TIME_RATIO = "Synchronization Time Ratio" - SYNCHRONIZATION_TIME_MS = "Synchronization Time(ms)" - WAIT_TIME_RATIO = "Wait Time Ratio" - TRANSIT_TIME_MS = "Transit Time(ms)" - TRANSIT_SIZE_MB = "Transit Size(MB)" - SIZE_DISTRIBUTION = "Size Distribution" - WAIT_TIME_MS = "Wait Time(ms)" - OP_NAME = "Op Name" - BANDWIDTH_GB_S = "Bandwidth(GB/s)" - COMMUNICATION = "communication.json" - ELAPSE_TIME_MS = "Elapse Time(ms)" - IDLE_TIME_MS = "Idle Time(ms)" - LARGE_PACKET_RATIO = "Large Packet Ratio" - - # params - DATA_MAP = "data_map" - COLLECTIVE_GROUP = "collective_group" - COMMUNICATION_OPS = "communication_ops" - MATRIX_OPS = "matrix_ops" - COLLECTION_PATH = "collection_path" - CLUSTER_ANALYSIS_OUTPUT_PATH = "output_path" - COMMUNICATION_GROUP = "communication_group" - TRANSPORT_TYPE = "Transport Type" - COMM_DATA_DICT = "comm_data_dict" - DATA_TYPE = "data_type" - ANALYSIS_MODE = "analysis_mode" - - # step time - RANK = "rank" - STAGE = "stage" - - # epsilon - EPS = 1e-15 - - # file suffix - JSON_SUFFIX = ".json" - CSV_SUFFIX = ".csv" - - # result files type - TEXT = "text" - DB = "db" - INVALID = "invalid" - - # db name - DB_COMMUNICATION_ANALYZER = "analysis.db" - DB_CLUSTER_COMMUNICATION_ANALYZER = "cluster_analysis.db" - - # db tables - TABLE_COMM_ANALYZER_BANDWIDTH = "CommAnalyzerBandwidth" - TABLE_COMM_ANALYZER_TIME = "CommAnalyzerTime" - TABLE_COMM_ANALYZER_MATRIX = "CommAnalyzerMatrix" - TABLE_STEP_TRACE = "StepTraceTime" - TABLE_HOST_INFO = "HostInfo" - TABLE_RANK_DEVICE_MAP = "RankDeviceMap" - - # data config key - CONFIG = "config" - EXPER_CONFIG = "experimental_config" - EXPORT_TYPE = "_export_type" - - # metadata key - DISTRIBUTED_ARGS = "distributed_args" - - # mode - ALL = "all" - COMMUNICATION_TIME = "communication_time" - COMMUNICATION_MATRIX = "communication_matrix" - - STEP = "step" - - DATA_SIMPLIFICATION = "data_simplification" \ No newline at end of file diff --git a/profiler/cluster_analyse/common_func/empty_class.py b/profiler/cluster_analyse/common_func/empty_class.py deleted file mode 100644 index df100d156fa064cca4514260db0b2e843e217d09..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/empty_class.py +++ /dev/null @@ -1,20 +0,0 @@ -class EmptyClass: - - def __init__(self: any, info: str = "") -> None: - self._info = info - - @classmethod - def __bool__(cls: any) -> bool: - return False - - @classmethod - def __str__(cls: any) -> str: - return "" - - @property - def info(self: any) -> str: - return self._info - - @staticmethod - def is_empty() -> bool: - return True diff --git a/profiler/cluster_analyse/common_func/path_manager.py b/profiler/cluster_analyse/common_func/path_manager.py deleted file mode 100644 index 9030e041b80f897088517484683e53583ffd6014..0000000000000000000000000000000000000000 --- a/profiler/cluster_analyse/common_func/path_manager.py +++ /dev/null @@ -1,202 +0,0 @@ -# Copyright (c) 2023 Huawei Technologies Co., Ltd -# All rights reserved. -# -# Licensed under the BSD 3-Clause License (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# https://opensource.org/licenses/BSD-3-Clause -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import re -import shutil -import platform - - - -class PathManager: - MAX_PATH_LENGTH = 4096 - MAX_FILE_NAME_LENGTH = 255 - DATA_FILE_AUTHORITY = 0o640 - DATA_DIR_AUTHORITY = 0o750 - WINDOWS = "windows" - - @classmethod - def check_input_directory_path(cls, path: str): - """ - Function Description: - check whether the path is valid, some businesses can accept a path that does not exist, - so the function do not verify whether the path exists - Parameter: - path: the path to check, whether the incoming path is absolute or relative depends on the business - Exception Description: - when invalid data throw exception - """ - cls.input_path_common_check(path) - base_name = os.path.basename(path) - if os.path.isfile(path): - msg = f"Invalid input path which is a file path: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_input_file_path(cls, path: str): - """ - Function Description: - check whether the file path is valid, some businesses can accept a path that does not exist, - so the function do not verify whether the path exists - Parameter: - path: the file path to check, whether the incoming path is absolute or relative depends on the business - Exception Description: - when invalid data throw exception - """ - cls.input_path_common_check(path) - base_name = os.path.basename(path) - if os.path.isdir(path): - msg = f"Invalid input path which is a directory path: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_path_length(cls, path: str): - if len(path) > cls.MAX_PATH_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - path_split_list = path.split("/") - for path in path_split_list: - path_list = path.split("\\") - for name in path_list: - if len(name) > cls.MAX_FILE_NAME_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - @classmethod - def input_path_common_check(cls, path: str): - if len(path) > cls.MAX_PATH_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - if os.path.islink(path): - msg = f"Invalid input path which is a soft link." - raise RuntimeError(msg) - - pattern = r'(\.|:|\\|/|_|-|\s|[~0-9a-zA-Z\u4e00-\u9fa5])+' - if not re.fullmatch(pattern, path): - illegal_pattern = r'([^\.\:\\\/\_\-\s~0-9a-zA-Z\u4e00-\u9fa5])+' - invalid_obj = re.search(illegal_pattern, path).group() - msg = f"Invalid path which has illagal characters \"{invalid_obj}\"." - raise RuntimeError(msg) - - path_split_list = path.split("/") - for path in path_split_list: - path_list = path.split("\\") - for name in path_list: - if len(name) > cls.MAX_FILE_NAME_LENGTH: - raise RuntimeError("Length of input path exceeds the limit.") - - @classmethod - def check_path_owner_consistent(cls, path: str): - """ - Function Description: - check whether the path belong to process owner - Parameter: - path: the path to check - Exception Description: - when invalid path, prompt the user - """ - base_name = os.path.basename(path) - if not os.path.exists(path): - msg = f"Invalid path: {base_name}" - raise RuntimeError(msg) - if platform.system().lower() == cls.WINDOWS: - return - if os.stat(path).st_uid != os.getuid(): - check_msg = input("The path does not belong to you, do you want to continue? [y/n]") - if check_msg.lower() != "y": - raise RuntimeError("The user choose not to continue.") - - @classmethod - def check_path_writeable(cls, path): - """ - Function Description: - check whether the path is writable - Parameter: - path: the path to check - Exception Description: - when invalid data throw exception - """ - cls.check_path_owner_consistent(path) - if os.path.islink(path): - msg = f"Invalid path which is a soft link." - raise RuntimeError(msg) - base_name = os.path.basename(path) - if not os.access(path, os.W_OK): - msg = f"The path permission check failed: {base_name}" - raise RuntimeError(msg) - - @classmethod - def check_path_readable(cls, path): - """ - Function Description: - check whether the path is writable - Parameter: - path: the path to check - Exception Description: - when invalid data throw exception - """ - cls.check_path_owner_consistent(path) - if os.path.islink(path): - msg = f"Invalid path which is a soft link." - raise RuntimeError(msg) - base_name = os.path.basename(path) - if not os.access(path, os.R_OK): - msg = f"The path permission check failed: {base_name}" - raise RuntimeError(msg) - - @classmethod - def remove_path_safety(cls, path: str): - if not os.path.exists(path): - return - base_name = os.path.basename(path) - msg = f"Failed to remove path: {base_name}" - cls.check_path_writeable(path) - if os.path.islink(path): - raise RuntimeError(msg) - try: - shutil.rmtree(path) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def make_dir_safety(cls, path: str): - base_name = os.path.basename(path) - msg = f"Failed to make directory: {base_name}" - if os.path.islink(path): - raise RuntimeError(msg) - if os.path.exists(path): - return - try: - os.makedirs(path, mode=cls.DATA_DIR_AUTHORITY) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def create_file_safety(cls, path: str): - base_name = os.path.basename(path) - msg = f"Failed to create file: {base_name}" - if os.path.islink(path): - raise RuntimeError(msg) - if os.path.exists(path): - return - try: - os.close(os.open(path, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY)) - except Exception as err: - raise RuntimeError(msg) from err - - @classmethod - def get_realpath(cls, path: str) -> str: - if os.path.islink(path): - msg = f"Invalid input path which is a soft link." - raise RuntimeError(msg) - return os.path.abspath(path) diff --git a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/memory_record_bean.py b/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/memory_record_bean.py deleted file mode 100644 index 50d14089fe95f2dbc8e97788d80e0644306f671e..0000000000000000000000000000000000000000 --- a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/memory_record_bean.py +++ /dev/null @@ -1,15 +0,0 @@ -from compare_backend.utils.common_func import convert_to_float - - -class MemoryRecordBean: - def __init__(self, data: dict): - self._data = data - self._total_reserved_mb = 0.0 - self.init() - - @property - def total_reserved_mb(self) -> float: - return convert_to_float(self._total_reserved_mb) - - def init(self): - self._total_reserved_mb = self._data.get("Total Reserved(MB)", 0) diff --git a/profiler/compare_tools/compare_backend/interface/overall_interface.py b/profiler/compare_tools/compare_backend/interface/overall_interface.py deleted file mode 100644 index fb549007f634610d1c954ef132c416a5c2606541..0000000000000000000000000000000000000000 --- a/profiler/compare_tools/compare_backend/interface/overall_interface.py +++ /dev/null @@ -1,13 +0,0 @@ -from compare_backend.comparator.overall_performance_comparator import OverallPerformanceComparator -from compare_backend.compare_bean.profiling_info import ProfilingInfo -from compare_backend.utils.constant import Constant - - -class OverallInterface: - def __init__(self, overall_data: dict): - self._overall_data = overall_data - - def run(self): - data = {Constant.BASE_DATA: self._overall_data.get(Constant.BASE_DATA).overall_metrics, - Constant.COMPARISON_DATA: self._overall_data.get(Constant.COMPARISON_DATA).overall_metrics} - return OverallPerformanceComparator(data, ProfilingInfo).generate_data() diff --git a/profiler/compare_tools/compare_backend/utils/constant.py b/profiler/compare_tools/compare_backend/utils/constant.py deleted file mode 100644 index 7d71a93f07fbf5611df729a2504e4ab6e0fee7a6..0000000000000000000000000000000000000000 --- a/profiler/compare_tools/compare_backend/utils/constant.py +++ /dev/null @@ -1,125 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - - -class Constant(object): - GPU = "GPU" - NPU = "NPU" - NA = 'N/A' - LIMIT_KERNEL = 3 - MAX_PATH_LENGTH = 4096 - MAX_FLOW_CAT_LEN = 20 - MAX_OP_NAME_LEN = 200 - MAX_FILE_SIZE = 1024 * 1024 * 1024 * 5 - MAX_JSON_SIZE = 1024 * 1024 * 1024 * 10 - BYTE_TO_KB = 1024 - YELLOW_COLOR = "FFFF00" - GREEN_COLOR = "00FF00" - RED_COLOR = "FF0000" - BLUE_COLOR = "00BFFF" - LIGHT_BLUE_COLOR = "87CEFA" - US_TO_MS = 1000 - KB_TO_MB = 1024 - INVALID_VALUE = -1 - MILLISECONDS_TO_SECONDS = 10 ** 3 - MICROSECONDS_TO_SECONDS = 10 ** 6 - - # epsilon - EPS = 1e-15 - - # autority - FILE_AUTHORITY = 0o640 - DIR_AUTHORITY = 0o750 - - PROFILING_TYPE = "profiling type" - - # path - PROFILING_PATH = "profiling_path" - TRACE_PATH = "trace_path" - MEMORY_DATA_PATH = "memory_data_path" - ASCEND_OUTPUT_PATH = "ascend_output" - INFO_JSON_PATH = "info_path" - - # excel headers - BASE_PROFILING = 'Base Profiling: ' - COMPARISON_PROFILING = 'Comparison Profiling: ' - WAIT_TIME = "wait" - TRANSMIT_TIME = "transmit" - - # compare type - OPERATOR_COMPARE = "OperatorCompare" - MEMORY_COMPARE = "MemoryCompare" - API_COMPARE = "ApiCompare" - KERNEL_COMPARE = "KernelCompare" - - # sheet name - OPERATOR_SHEET = "OperatorCompare" - MEMORY_SHEET = "MemoryCompare" - OPERATOR_TOP_SHEET = "OperatorCompareStatistic" - MEMORY_TOP_SHEET = "MemoryCompareStatistic" - COMMUNICATION_SHEET = "CommunicationCompare" - API_SHEET = "ApiCompare" - KERNEL_SHEET = "KernelCompare" - - # table name - OPERATOR_TABLE = "OperatorCompare" - MEMORY_TABLE = "MemoryCompare" - OPERATOR_TOP_TABLE = "OperatorCompareStatistic" - MEMORY_TOP_TABLE = "MemoryCompareStatistic" - COMMUNICATION_TABLE = "CommunicationCompare" - PERFORMANCE_TABLE = "Model Profiling Time Distribution" - MODULE_TABLE = "ModuleCompare" - MODULE_TOP_TABLE = "ModuleCompareStatistic" - OVERALL_METRICS_TABLE = "OverallMetrics" - API_TABLE = "ApiCompare" - KERNEL_TABLE = "KernelCompare" - - # memory - SIZE = "Size(KB)" - TS = "ts" - ALLOCATION_TIME = "Allocation Time(us)" - RELEASE_TIME = "Release Time(us)" - NAME = "Name" - - OP_KEY = "op_name" - DEVICE_DUR = "dur" - - BASE_DATA = "base_data" - COMPARISON_DATA = "comparison_data" - OVERALL_METRICS = "overall_metrics" - TORCH_OP = "torch_op" - KERNEL_DICT = "kernel_dict" - MEMORY_LIST = "memory_list" - COMMUNICATION_DICT = "comm_dict" - - # compare type - OVERALL_COMPARE = "overall" - - BWD_LIST = ["bwd", "backward", "back", "grad"] - - CPU_OP_FA_MASK = ( - "flash_attention", "fusion_attention", "flashattn", "xformers_flash", "efficient_attention", "flash2attn" - ) - CPU_OP_CONV = "aten::conv" - CPU_OP_MATMUL_MASK = ("aten::addmm", "aten::bmm", "aten::mm", "aten::matmul") - KERNEL_CUBE_MASK = ("gemm", "conv", "cutlass", "wgrad", "gemvx") - KERNEL_TRANS_MASK = ("cast", "transdata", "transpose") - - IS_BWD = "is_bwd" - OPS = "ops" - - VOID_STEP = -1 diff --git a/profiler/compare_tools/compare_backend/utils/file_reader.py b/profiler/compare_tools/compare_backend/utils/file_reader.py deleted file mode 100644 index a9513add498abd33350b60b25e511ae545403a48..0000000000000000000000000000000000000000 --- a/profiler/compare_tools/compare_backend/utils/file_reader.py +++ /dev/null @@ -1,84 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - -import csv -import json -import os -import logging - -from common_func.path_manager import PathManager -from compare_backend.utils.constant import Constant - - -logger = logging.getLogger() - - -class FileReader: - @classmethod - def read_trace_file(cls, file_path: str) -> any: - PathManager.check_path_readable(file_path) - if not os.path.isfile(file_path): - raise FileNotFoundError("File not exists.") - file_size = os.path.getsize(file_path) - if file_size <= 0: - return [] - if file_size > Constant.MAX_FILE_SIZE: - check_msg = input( - f"The file({file_path}) size exceeds the preset max value. Continue reading the file? [y/n]") - if check_msg.lower() != "y": - logger.warning("The user choose not to read the file: %s", file_path) - return [] - try: - with open(file_path, "rt") as file: - json_data = json.loads(file.read()) - except Exception as e: - msg = f"Can't read file: {file_path}" - raise RuntimeError(msg) from e - return json_data - - @classmethod - def read_csv_file(cls, file_path: str, bean_class: any = None) -> any: - PathManager.check_path_readable(file_path) - if not os.path.isfile(file_path): - raise FileNotFoundError("File not exists.") - file_size = os.path.getsize(file_path) - if file_size <= 0: - return [] - if file_size > Constant.MAX_FILE_SIZE: - check_msg = input( - f"The file({file_path}) size exceeds the preset max value. Continue reading the file? [y/n]") - if check_msg.lower() != "y": - print(f"[WARNING] The user choose not to read the file: {file_path}") - return [] - result_data = [] - try: - with open(file_path, newline="") as csv_file: - reader = csv.DictReader(csv_file) - for row in reader: - row_data = bean_class(row) if bean_class else row - result_data.append(row_data) - except Exception as e: - msg = f"Failed to read the file: {file_path}" - raise RuntimeError(msg) from e - return result_data - - @classmethod - def check_json_type(cls, file_path: str) -> str: - json_data = cls.read_trace_file(file_path) - if isinstance(json_data, dict): - return Constant.GPU - return Constant.NPU diff --git a/profiler/compare_tools/compare_backend/view/base_view.py b/profiler/compare_tools/compare_backend/view/base_view.py deleted file mode 100644 index d18980b7de2098b5a1015d14fbd1b5be91a23bfc..0000000000000000000000000000000000000000 --- a/profiler/compare_tools/compare_backend/view/base_view.py +++ /dev/null @@ -1,10 +0,0 @@ -from abc import ABC, abstractmethod - - -class BaseView(ABC): - def __init__(self, data_dict: dict): - self._data_dict = data_dict - - @abstractmethod - def generate_view(self): - raise NotImplementedError("Function generate_view need to be implemented.") diff --git a/profiler/compare_tools/img/OverallMetrics.png b/profiler/compare_tools/img/OverallMetrics.png deleted file mode 100644 index b130d3607344c983a9304440e38a45fe96a4bb56..0000000000000000000000000000000000000000 Binary files a/profiler/compare_tools/img/OverallMetrics.png and /dev/null differ diff --git a/profiler/advisor/analyzer/__init__.py b/profiler/example/__init__.py similarity index 100% rename from profiler/advisor/analyzer/__init__.py rename to profiler/example/__init__.py diff --git a/profiler/example/mstx_torch_plugin/README.md b/profiler/example/mstx_torch_plugin/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8f140ce17c176b64a6651350cb62621cce7c16b7 --- /dev/null +++ b/profiler/example/mstx_torch_plugin/README.md @@ -0,0 +1,66 @@ +# mstx_torch_plugin + +Ascend Pytorch Profiler中的[采集并解析msprof_tx数据](https://www.hiascend.com/document/detail/zh/canncommercial/80RC3/devaids/devtools/profiling/atlasprofiling_16_0033.html#ZH-CN_TOPIC_0000002081898541__section5940122172516)功能已经内置了通信算子的打点。为了方便用户在不修改业务代码的基础上获取更多关键阶段的耗时数据,mstx_torch_plugin在Ascend Pytorch Profiler内置了**dataloader**、**forward**、**step**、**save_checkpoint**这四个关键阶段函数的打点。 + +**约束** + +暂不支持PyTorch图模式场景使用。 + +**使用指导** + +1. 下载mstx_torch_plugin的whl包。 + + whl包链接:[mstx_torch_plugin](https://ptdbg.obs.myhuaweicloud.com/profiler/example/1.0/mstx_torch_plugin-1.0-py3-none-any.whl) + +2. 安装mstx_torch_plugin + + ```bash + pip install mstx_torch_plugin-1.0-py3-none-any.whl + ``` + +3. 在AI任务执行脚本中import导入该whl包。 + + 需保证import的顺序在import torch和import torch_npu后面: + + ```python + import torch + import torch_npu + + import mstx_torch_plugin + ``` + +4. 使能torch_npu.profiler,采集打点数据。 + + 打开msprof_tx开关,profiler_level开关可根据实际采集需要,配置对应的level: + + ```python + import torch + import torch_npu + + import mstx_torch_plugin + ... + experimental_config = torch_npu.profiler._ExperimentalConfig( + export_type=torch_npu.profiler.ExportType.Text, + profiler_level=torch_npu.profiler.ProfilerLevel.Level_none + ) + + with torch_npu.profiler.profile( + activities=[ + torch_npu.profiler.ProfilerActivity.CPU, + torch_npu.profiler.ProfilerActivity.NPU + ], + schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1), + on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"), + experimental_config=experimental_config) as prof: + for step in range(steps): + train_one_step(step, steps, train_loader, model, optimizer, criterion) + prof.step() + ``` + +**采集结果** + +采集的性能数据使用MindStudio Insight工具打开,可视化效果如下: + +![result](img/result.png) + +上图以dataloader函数为例,与mstx数据相似,在上层应用数据中展示。 diff --git a/profiler/example/mstx_torch_plugin/__init__.py b/profiler/example/mstx_torch_plugin/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..bc98ed43a7a7cba299f8f47e7ef0ba43096c5293 --- /dev/null +++ b/profiler/example/mstx_torch_plugin/__init__.py @@ -0,0 +1,29 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import sys +import logging +from .mstx_torch_plugin import apply_mstx_patch + +logger = logging.getLogger() +requirements_module_list = ['torch', 'torch_npu'] + +enable_mstx_torch = True +for module_name in requirements_module_list: + if module_name not in sys.modules: + enable_mstx_torch = False + logger.error(f"mstx_torch_plugin not enabled, please ensure that {module_name} has been installed.") + +if enable_mstx_torch: + apply_mstx_patch() diff --git a/profiler/example/mstx_torch_plugin/img/result.png b/profiler/example/mstx_torch_plugin/img/result.png new file mode 100644 index 0000000000000000000000000000000000000000..85ea4353f08207b1a9d48491cdf737407a1af500 Binary files /dev/null and b/profiler/example/mstx_torch_plugin/img/result.png differ diff --git a/profiler/example/mstx_torch_plugin/mstx_torch_plugin.py b/profiler/example/mstx_torch_plugin/mstx_torch_plugin.py new file mode 100644 index 0000000000000000000000000000000000000000..ed22a3d0b7eed2ab0b457bb0a185061dacabc186 --- /dev/null +++ b/profiler/example/mstx_torch_plugin/mstx_torch_plugin.py @@ -0,0 +1,213 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import functools +import re +import site +import torch +from torch.nn import Module +from torch.utils.data import DataLoader +from torch.optim.optimizer import register_optimizer_step_post_hook + +original_forward_call = Module.__call__ +original_iter = DataLoader.__iter__ +original_save = torch.serialization.save +original_singlenext = torch.utils.data.dataloader._SingleProcessDataLoaderIter.__next__ +original_multinext = torch.utils.data.dataloader._MultiProcessingDataLoaderIter.__next__ +origin_patch_step_function = torch.optim.Optimizer._patch_step_function + + +def _check_directory_path_readable(path): + if not os.path.exists(path): + msg = f"The path dose not exist: {path}" + raise RuntimeError(msg) + if os.path.islink(path): + msg = f"Invalid path is a soft chain: {path}" + raise RuntimeError(msg) + if not os.access(path, os.R_OK): + msg = f"The path permission check failed: {path}" + raise RuntimeError(msg) + + +class MstxState: + def __init__(self): + self.module_dict = {} + self.is_outer_call = True + self.fp_range_id = None + self.dataloader_range_id = None + self.save_range_id = None + self.step_range_id = None + self.step_id = 0 + self.last_optimizer_id = None + + def add_module_dict(self, module): + self.module_dict[module] = [ + sub_module + for _, sub_module in module.named_modules() + if sub_module != module + ] + + def is_child_module(self, module): + return any(module in value for value in self.module_dict.values()) + +mstx_state = MstxState() + + +def _is_loss_module(module): + return isinstance(module, torch.nn.modules.loss._Loss) + + +def _custom_forward_call(self, *args, **kwargs): + global mstx_state + + if not torch.npu.is_initialized(): + return original_forward_call(self, *args, **kwargs) + + # the outermost module add mstx range_start + if mstx_state.is_outer_call: + # not the loss module and recalculation process + if not mstx_state.is_child_module(self) and not _is_loss_module(self): + stream = torch.npu.current_stream() + mstx_state.fp_range_id = torch.npu.mstx.range_start("forward", stream) + mstx_state.add_module_dict(self) + mstx_state.is_outer_call = False + self.tx_visited = True + + out_call = original_forward_call(self, *args, **kwargs) + + # the outermost module add mstx range_end + if hasattr(self, "tx_visited") and self.tx_visited: + mstx_state.is_outer_call = True + self.tx_visited = False + if not _is_loss_module(self) and mstx_state.fp_range_id is not None: + torch.npu.mstx.range_end(mstx_state.fp_range_id) + mstx_state.fp_range_id = None + + return out_call + + +def _custom_dataloader_iter(self): + global mstx_state + + out_iter = original_iter(self) + + def dataloader_wrapper(func): + def wrapper(*args, **kwargs): + mstx_state.dataloader_range_id = torch.npu.mstx.range_start("dataloader") + out = func(*args, **kwargs) + if mstx_state.dataloader_range_id is not None: + torch.npu.mstx.range_end(mstx_state.dataloader_range_id) + mstx_state.dataloader_range_id = None + return out + + return wrapper + + if self.num_workers == 0: + torch.utils.data.dataloader._SingleProcessDataLoaderIter.__next__ = dataloader_wrapper(original_singlenext) + else: + torch.utils.data.dataloader._MultiProcessingDataLoaderIter.__next__ = dataloader_wrapper(original_multinext) + + return out_iter + + +def _custom_save(func): + global mstx_state + + @functools.wraps(func) + def save_wrapper(*args, **kwargs): + stream = torch.npu.current_stream() + mstx_state.save_range_id = torch.npu.mstx.range_start("save_checkpoint", stream) + out = func(*args, **kwargs) + if mstx_state.save_range_id is not None: + torch.npu.mstx.range_end(mstx_state.save_range_id) + mstx_state.save_range_id = None + return out + + return save_wrapper + + +def _step_hook(self, *args, **kwargs): + global mstx_state + + if id(self) != mstx_state.last_optimizer_id: + return + stream = torch.npu.current_stream() + mstx_state.step_id += 1 + if mstx_state.step_range_id is not None: + torch.npu.mstx.range_end(mstx_state.step_range_id) + mstx_state.step_range_id = torch.npu.mstx.range_start(f"step {mstx_state.step_id}", stream) + + +def _custom_step(optimizer: torch.optim.Optimizer): + global mstx_state + + origin_patch_step_function(optimizer) + mstx_state.last_optimizer_id = id(optimizer) + + +def _get_torch_npu_version_str(): + torch_npu_version_str = "" + site_packages = site.getsitepackages() + if site_packages and site_packages[0]: + path = site_packages[0] + version_path = os.path.join(path, "torch_npu", "version.py") + _check_directory_path_readable(version_path) + # example version info: "__version__ = '2.1.0.post11.xxxxxx'" + try: + with open(version_path, "r") as f: + for line in f: + if line.find("__version__") != -1: + torch_npu_version_str = line.strip().split("=")[-1][2:-1] + break + except Exception as e: + raise RuntimeError(f"Failed to open {version_path} to get torch npu version.") from e + return torch_npu_version_str + + +def _get_torch_npu_info(version_str: str): + # version info example: "2.1.0.post11.xxxxxx" + match = re.search(r"^(\d+\.\d+\.\d+)\.post(\d+)", version_str) + if match and len(match.groups()) == 2: + return match.group(1), match.group(2) + else: + return '', '' + + +def _check_pta_support_patch(): + pta_support_patch_version = { + "2.1.0": 10, + "2.3.1": 4, + "2.4.0": 2, + } + torch_npu_version_str = _get_torch_npu_version_str() + if not torch_npu_version_str: + raise RuntimeError("Failed to get torch_npu version info.") + torch_branch, torch_npu_version = _get_torch_npu_info(torch_npu_version_str) + if not torch_branch or not torch_npu_version or not torch_npu_version.isdigit(): + raise RuntimeError("Failed to get valid torch branch or torch_npu version.") + for branch, post_version in pta_support_patch_version.items(): + if torch_branch == branch and int(torch_npu_version) <= post_version: + return False + return True + + +def apply_mstx_patch(): + pta_support_patch = _check_pta_support_patch() + Module.__call__ = _custom_forward_call + if not pta_support_patch: + DataLoader.__iter__ = _custom_dataloader_iter + torch.serialization.save = _custom_save(original_save) + torch.optim.Optimizer._patch_step_function = _custom_step + register_optimizer_step_post_hook(_step_hook) diff --git a/profiler/example/setup.py b/profiler/example/setup.py new file mode 100644 index 0000000000000000000000000000000000000000..6b14f5d65ff214f6467bbecedc766d7ae1289179 --- /dev/null +++ b/profiler/example/setup.py @@ -0,0 +1,28 @@ +#!/usr/bin/python +# -*- coding: utf-8 -*- +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +from setuptools import setup, find_packages + +setup( + name='mstx_torch_plugin', + version="1.0", + description='MindStudio Profiler Mstx Plugin For Pytorch', + long_description='mstx_torch_plugin provides lightweight data for dataloader, ' + 'forward, step and save_checkpoint.', + packages=find_packages(), + license='Apache License 2.0' +) \ No newline at end of file diff --git a/profiler/LICENSE b/profiler/msprof_analyze/LICENSE similarity index 100% rename from profiler/LICENSE rename to profiler/msprof_analyze/LICENSE diff --git a/profiler/msprof_analyze/MANIFEST.in b/profiler/msprof_analyze/MANIFEST.in new file mode 100644 index 0000000000000000000000000000000000000000..b4d096405c98ea1a906b8882418362d428cbf1b6 --- /dev/null +++ b/profiler/msprof_analyze/MANIFEST.in @@ -0,0 +1,7 @@ +recursive-include msprof_analyze/advisor/ * +recursive-include msprof_analyze/cli/ * +recursive-include msprof_analyze/prof_common/ * +recursive-include msprof_analyze/compare_tools/ * +recursive-include msprof_analyze/cluster_analyse/ * +global-exclude */__pycache__/* +global-exclude *.pyc diff --git a/profiler/OWNERS b/profiler/msprof_analyze/OWNERS similarity index 80% rename from profiler/OWNERS rename to profiler/msprof_analyze/OWNERS index 0c09fb8ce494186cdbc03cdc136edc1452ae1de0..864e7ecc649aab5a9eb5d6db1b33e9dd8a8882dc 100644 --- a/profiler/OWNERS +++ b/profiler/msprof_analyze/OWNERS @@ -6,6 +6,5 @@ approvers: - chenhao_1209 - feng123www reviewers: -- sunboquan -- stby - Seanesmhxocism +- wjchuee diff --git a/profiler/README.md b/profiler/msprof_analyze/README.md similarity index 47% rename from profiler/README.md rename to profiler/msprof_analyze/README.md index 4db89899f1a8d1e22c7adb5556deb4f09067b725..d39aea89a521eaae504d177fac8ac2c9c5982afb 100644 --- a/profiler/README.md +++ b/profiler/msprof_analyze/README.md @@ -1,12 +1,22 @@ # 性能工具 -MindStudio Training Tools工具针对训练&大模型场景,提供端到端性能调优工具:用户采集到性能数据后,由MindStudio Training Tools的性能工具提供统计、分析以及相关的调优建议。 +MindStudio Training Tools工具针对训练&大模型场景,提供端到端性能调优工具msprof-analyze:用户采集到性能数据后,由MindStudio Training Tools的性能工具msprof-analyze提供统计、分析以及相关的调优建议。 ## NPU性能数据采集 目前MindStudio Training Tools工具主要支持对Ascend PyTorch Profiler接口采集的性能数据进行分析,请参考官方文档:[Ascend PyTorch Profiler数据采集与分析](https://www.hiascend.com/document/detail/zh/canncommercial/80RC1/devaids/auxiliarydevtool/atlasprofiling_16_0006.html)。 -Ascend PyTorch Profiler接口支持AscendPyTorch 1.11.0或更高版本,支持的PyThon和CANN软件版本配套关系请参见“[安装PyTorch框架](https://www.hiascend.com/document/detail/zh/Pytorch/60RC1/configandinstg/instg/insg_0006.html)”。 +### 环境和依赖 + +- 硬件环境请参见《[昇腾产品形态说明](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F80RC22%2Fquickstart%2Fquickstart%2Fquickstart_18_0002.html)》。 +- 软件环境请参见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F80RC22%2Fsoftwareinst%2Finstg%2Finstg_0000.html%3FMode%3DPmIns%26OS%3DUbuntu%26Software%3DcannToolKit)》安装昇腾设备开发或运行环境,即toolkit软件包。 + +以上环境依赖请根据实际环境选择适配的版本。 + +### 版本配套说明 + +- Ascend PyTorch Profiler接口支持AscendPyTorch 1.11.0或更高版本,支持的PyTorch和CANN以及PyTorch和Python软件版本配套关系请参见《[Ascend Extension for PyTorch插件](https://gitee.com/ascend/pytorch)》。 +- Ascend PyTorch Profiler接口支持的固件驱动版本与配套CANN软件支持的固件驱动版本相同,开发者可通过“[昇腾社区-固件与驱动](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fhardware%2Ffirmware-drivers%2Fcommunity%3Fproduct%3D2%26model%3D28%26cann%3D8.0.RC3.alpha003%26driver%3D1.0.25.alpha)”页面根据产品型号与CANN软件版本获取配套的固件与驱动。 ### 采集方式一:通过with语句进行采集 @@ -79,7 +89,7 @@ ascend pytorch profiler数据目录结构如下: |- * _ascend_pt ``` -## 工具安装 +## 安装 性能工具的安装方式包括:**pip安装**、**下载whl包安装**和**源代码编译安装**。 @@ -99,28 +109,32 @@ pip命令会自动安装最新的包及其配套依赖。 Successfully installed msprof-analyze-{version} ``` -#### 下载whl包安装 +### 下载whl包安装 1. whl包获取。 请通过下表链接下载profiler工具whl包。 - | profiler版本 | 发布日期 | 下载链接 | 校验码 | - |------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------| - | 1.3.0 | 2024-10-12 | [msprof_analyze-1.3.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.3.0/msprof_analyze-1.3.0-py3-none-any.whl) | 8b09758c6b5181bb656a95857c32852f898c370e7f1041e5a08e4f10d5004d48 | - | 1.2.5 | 2024-09-25 | [msprof_analyze-1.2.5-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.5/msprof_analyze-1.2.5-py3-none-any.whl) | aea8ae8deac07b5b4980bd2240da27d0eec93b9ace9ea9eb2e3a05ae9072018b | - | 1.2.4 | 2024-09-19 | [msprof_analyze-1.2.4-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.4/msprof_analyze-1.2.4-py3-none-any.whl) | 7c392e72c3347c4034fd3fdfcccb1f7936c24d9c3eb217e2cc05bae1347e5ab7 | - | 1.2.3 | 2024-08-29 | [msprof_analyze-1.2.3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.3/msprof_analyze-1.2.3-py3-none-any.whl) | 354a55747f64ba1ec6ee6fe0f05a53e84e1b403ee0341ec40cc216dd25fda14c | - | 1.2.2 | 2024-08-23 | [msprof_analyze-1.2.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.2/msprof_analyze-1.2.2-py3-none-any.whl) | ed92a8e4eaf5ada8a2b4079072ec0cc42501b1b1f2eb00c8fdcb077fecb4ae02 | - | 1.2.1 | 2024-08-14 | [msprof_analyze-1.2.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.1/msprof_analyze-1.2.1-py3-none-any.whl) | 7acd477417bfb3ea29029dadf175d019ad3212403b7e11dc1f87e84c2412c078 | - | 1.2.0 | 2024-07-25 | [msprof_analyze-1.2.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.0/msprof_analyze-1.2.0-py3-none-any.whl) | 6a4366e3beca40b4a8305080e6e441d6ecafb5c05489e5905ac0265787555f37 | - | 1.1.2 | 2024-07-12 | [msprof_analyze-1.1.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.2/msprof_analyze-1.1.2-py3-none-any.whl) | af62125b1f9348bf491364e03af712fc6d0282ccee3fb07458bc9bbef82dacc6 | - | 1.1.1 | 2024-06-20 | [msprof_analyze-1.1.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.1/msprof_analyze-1.1.1-py3-none-any.whl) | 76aad967a3823151421153d368d4d2f8e5cfbcb356033575e0b8ec5acea8e5e4 | - | 1.1.0 | 2024-05-28 | [msprof_analyze-1.1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.0/msprof_analyze-1.1.0-py3-none-any.whl) | b339f70e7d1e45e81f289332ca64990a744d0e7ce6fdd84a8d82e814fa400698 | - | 1.0 | 2024-05-10 | [msprof_analyze-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.0/msprof_analyze-1.0-py3-none-any.whl) | 95b2f41c8c8e8afe4887b738c8cababcb4f412e1874483b6adae4a025fcbb7d4 | - +| profiler版本 | 发布日期 | 下载链接 | 校验码 | +|------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------| ------------------------------------------------------------ | +| 2.0.1 | 2025-02-28 | [msprof_analyze-2.0.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/2.0.1/msprof_analyze-2.0.1-py3-none-any.whl) | 82dfe2c779dbab9015f61d36ea0c32d832b6d182454b3f7db68e6c0ed49c0423 | +| 2.0.0 | 2025-02-08 | [msprof_analyze-2.0.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/2.0.0/msprof_analyze-2.0.0-py3-none-any.whl) | 8e44e5f3e7681c377bb2657a600ad9841d3bed11061ddd7844c30e8a97242101 | +| 1.3.4 | 2025-01-20 | [msprof_analyze-1.3.4-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.3.4/msprof_analyze-1.3.4-py3-none-any.whl) | 8de92188d1a97105fb14cadcb0875ccd5f66629ee3bb25f37178da1906f4cce2 | +| 1.3.3 | 2024-12-26 | [msprof_analyze-1.3.3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.3.3/msprof_analyze-1.3.3-py3-none-any.whl) | 27676f2eee636bd0c65243f81e292c7f9d30d7f985c772ac9cbaf10b54d3584e | +| 1.3.2 | 2024-12-20 | [msprof_analyze-1.3.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.3.2/msprof_analyze-1.3.2-py3-none-any.whl) | ceb227e751ec3a204135be13801f1deee6a66c347f1bb3cdaef596872874df06 | +| 1.3.1 | 2024-12-04 | [msprof_analyze-1.3.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.3.1/msprof_analyze-1.3.1-py3-none-any.whl) | eae5548804314110a649caae537f2c63320fc70ec41ce1167f67c1d674d8798e | +| 1.3.0 | 2024-10-12 | [msprof_analyze-1.3.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.3.0/msprof_analyze-1.3.0-py3-none-any.whl) | 8b09758c6b5181bb656a95857c32852f898c370e7f1041e5a08e4f10d5004d48 | +| 1.2.5 | 2024-09-25 | [msprof_analyze-1.2.5-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.5/msprof_analyze-1.2.5-py3-none-any.whl) | aea8ae8deac07b5b4980bd2240da27d0eec93b9ace9ea9eb2e3a05ae9072018b | +| 1.2.4 | 2024-09-19 | [msprof_analyze-1.2.4-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.4/msprof_analyze-1.2.4-py3-none-any.whl) | 7c392e72c3347c4034fd3fdfcccb1f7936c24d9c3eb217e2cc05bae1347e5ab7 | +| 1.2.3 | 2024-08-29 | [msprof_analyze-1.2.3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.3/msprof_analyze-1.2.3-py3-none-any.whl) | 354a55747f64ba1ec6ee6fe0f05a53e84e1b403ee0341ec40cc216dd25fda14c | +| 1.2.2 | 2024-08-23 | [msprof_analyze-1.2.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.2/msprof_analyze-1.2.2-py3-none-any.whl) | ed92a8e4eaf5ada8a2b4079072ec0cc42501b1b1f2eb00c8fdcb077fecb4ae02 | +| 1.2.1 | 2024-08-14 | [msprof_analyze-1.2.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.1/msprof_analyze-1.2.1-py3-none-any.whl) | 7acd477417bfb3ea29029dadf175d019ad3212403b7e11dc1f87e84c2412c078 | +| 1.2.0 | 2024-07-25 | [msprof_analyze-1.2.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.2.0/msprof_analyze-1.2.0-py3-none-any.whl) | 6a4366e3beca40b4a8305080e6e441d6ecafb5c05489e5905ac0265787555f37 | +| 1.1.2 | 2024-07-12 | [msprof_analyze-1.1.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.2/msprof_analyze-1.1.2-py3-none-any.whl) | af62125b1f9348bf491364e03af712fc6d0282ccee3fb07458bc9bbef82dacc6 | +| 1.1.1 | 2024-06-20 | [msprof_analyze-1.1.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.1/msprof_analyze-1.1.1-py3-none-any.whl) | 76aad967a3823151421153d368d4d2f8e5cfbcb356033575e0b8ec5acea8e5e4 | +| 1.1.0 | 2024-05-28 | [msprof_analyze-1.1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.0/msprof_analyze-1.1.0-py3-none-any.whl) | b339f70e7d1e45e81f289332ca64990a744d0e7ce6fdd84a8d82e814fa400698 | +| 1.0 | 2024-05-10 | [msprof_analyze-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.0/msprof_analyze-1.0-py3-none-any.whl) | 95b2f41c8c8e8afe4887b738c8cababcb4f412e1874483b6adae4a025fcbb7d4 | - 2. whl包校验。 1. 根据以上下载链接下载whl包到Linux安装环境。 @@ -154,7 +168,7 @@ Successfully installed msprof-analyze-{version} Successfully installed msprof_analyze-{version} ``` -#### 源代码编译安装 +### 源代码编译安装 1. 安装依赖。 @@ -173,11 +187,11 @@ Successfully installed msprof-analyze-{version} 3. 编译whl包。 ```bash - cd mstt/profiler - python3 setup.py bdist_wheel + cd mstt/profiler/msprof_analyze + pip3 install -r requirements.txt && python3 setup.py bdist_wheel ``` - 以上命令执行完成后在mstt/profiler/dist目录下生成性能工具whl安装包`msprof_analyze-{version}-py3-none-any.whl`。 + 以上命令执行完成后在mstt/profiler/msprof_analyze/dist目录下生成性能工具whl安装包`msprof_analyze-{version}-py3-none-any.whl`。 4. 安装。 diff --git a/profiler/advisor/analyzer/cluster/__init__.py b/profiler/msprof_analyze/__init__.py similarity index 100% rename from profiler/advisor/analyzer/cluster/__init__.py rename to profiler/msprof_analyze/__init__.py diff --git a/profiler/advisor/README.md b/profiler/msprof_analyze/advisor/README.md similarity index 73% rename from profiler/advisor/README.md rename to profiler/msprof_analyze/advisor/README.md index c0148afc13dfd05851f8bd5f57d04eed2f89a7c5..befdf89fbe9542c69b5ac0e94d163e11f34c4fad 100644 --- a/profiler/advisor/README.md +++ b/profiler/msprof_analyze/advisor/README.md @@ -1,15 +1,14 @@ # advisor -msprof-analyze的advisor功能是将Ascend PyTorch Profiler或者msprof采集的性能数据进行分析,并输出性能调优建议。 +msprof-analyze的advisor功能是将Ascend PyTorch Profiler、msprof或者MindSpore Profiler采集的性能数据进行分析,并输出性能调优建议。 -性能数据采集方法请参见《[性能分析工具](https://www.hiascend.com/document/detail/zh/mindstudio/70RC1/mscommandtoolug/mscommandug/atlasprofiling_16_0001.html)》。 +Ascend PyTorch Profiler、msprof采集方法请参见《[性能调优工具](https://www.hiascend.com/document/detail/zh/mindstudio/70RC3/T&ITools/Profiling/atlasprofiling_16_0001.html)》,MindSpore Profiler采集方法请参见《[性能调试](https://www.mindspore.cn/mindinsight/docs/zh-CN/r2.3/performance_profiling_ascend.html)》。 ## 工具使用(命令行方式方式) ### 约束 -- 不支持对db格式文件分析。 -- 不支持分析MindSpore场景采集的性能数据。 +不支持对db格式文件分析。 ### 操作步骤 @@ -37,7 +36,7 @@ msprof-analyze的advisor功能是将Ascend PyTorch Profiler或者msprof采集的 以上命令更多参数介绍请参见“**命令详解**”。 - 单卡场景需要指定到性能数据文件`*_ascend_pt`目录;多卡或集群场景需要指定到`*_ascend_pt`目录的父目录层级。 + 单卡场景需要指定到性能数据文件`*_ascend_pt`或`*_ascend_ms`目录;多卡或集群场景需要指定到`*_ascend_pt`或`*_ascend_ms`目录的父目录层级。 3. 查看结果。 @@ -83,30 +82,33 @@ msprof-analyze advisor命令行包含如下三个参数: 下表中字段为advisor的完整功能点,由all、computation和schedule控制启动。 -| dimension | mode | 参数释义 | -| ---------- |---------------------------------------| ------------------------------------ | -| overall | overall summary | 计算、通信、空闲等维度对性能数据进行拆解 | -| | environment_variable_analysis | 环境变量设置推荐 | -| cluster | slow rank | 慢卡识别 | -| | slow link | 慢链路识别 | -| computing | AICPU operator | AI CPU调优 | -| | Dynamic shape operator | 识别动态Shape算子 | -| | block dim | block dim算子调优 | -| | operator no bound | 算子瓶颈分析 | -| | fusion issue | 融合算子图调优 | -| | AI Core Frequency | AI Core算子降频分析 | -|communication| Packet analysis |通信小包检测 | -|| bandwidth contention analysis |通信计算带宽抢占检测 | -|| Communication retransmission analysis |通信重传检测 | -| scheduling | Affinity apis | 亲和API替换调优 | -| | Operator dispatch | 识别算子下发问题(路径3/路径5) | -| | SyncBatchNorm | BatchNorm同步检测 | -| | SynchronizeStream | 流同步检测 | -| | Slow dataloader | 异常dataloader检测 | -| | gc | 识别异常垃圾回收事件。需要Ascend PyTorch Profiler采集时开启experimental_config下的gc_delect_threshold功能 | -| memory | Memory | 识别异常的内存申请释放操作 | -| comparison | Kernel compare of Rank\* Step\* and Rank\* Step\* | 识别标杆和待比对性能数据的Kernel数据(无标杆场景是集群内部快慢卡的性能数据对比,有标杆场景是两个集群之间存在明显耗时差异的相同卡之间的性能数据对比) | -| | API compare of Rank\* Step\* and Rank\* Step\* | 识别标杆和待比对性能数据的API数据(无标杆场景是集群内部快慢卡的性能数据对比,有标杆场景是两个集群之间存在明显耗时差异的相同卡之间的性能数据对比) | +| dimension | mode | 参数释义 | 支持场景 | +| ---------- |---------------------------------------| ------------------------------------ | ------------------------------------ | +| overall | overall summary | 计算、通信、空闲等维度对性能数据进行拆解 | PyTorch、MindSpore | +| | Environment Variable Issues | 环境变量设置推荐 | PyTorch | +| | slow rank | 慢卡识别 | PyTorch、MindSpore | +| | slow link | 慢链路识别 | PyTorch、MindSpore | +| computation | AICPU Issues | AI CPU调优 | PyTorch、MindSpore | +| | Operator Dynamic Shape Issues | 识别动态Shape算子 | PyTorch | +| | AI Core Performance analysis | MatMul、FlashAttentionScore、AI_VECTOR_CORE和MIX_AIV类算子的性能分析 | PyTorch | +| | Block Dim | Block Dim算子调优 | PyTorch、MindSpore | +| | Operator No Bound Issues | 算子瓶颈分析 | PyTorch、MindSpore | +| | Fusion Issues | 融合算子图调优 | PyTorch、MindSpore | +| | AI Core Frequency Issues | AI Core算子降频分析 | PyTorch、MindSpore | +|communication| Packet Analysis |通信小包检测 |PyTorch、MindSpore | +|| Bandwidth Contention Analysis |通信计算带宽抢占检测 |PyTorch、MindSpore | +|| Communication Retransmission Analysis |通信重传检测 |PyTorch、MindSpore | +|| Byte Alignment Analysis |通信算子字节对齐检测,传输类型为SDMA的通信算子,数据量需要被512字节整除,保证传输带宽不会下降 |PyTorch、MindSpore | +| schedule | Affinity API Issues | 亲和API替换调优 | PyTorch、MindSpore | +| | Operator Dispatch Issues | 识别算子下发问题(路径3/路径5) | PyTorch | +| | SyncBatchNorm Issues | BatchNorm同步检测 | PyTorch、MindSpore | +| | Synchronize Stream Issues | 流同步检测 | PyTorch、MindSpore | +| | GC Analysis | 识别异常垃圾回收事件。需要Ascend PyTorch Profiler采集时开启experimental_config下的gc_delect_threshold功能 | PyTorch | +| | Fusible Operator Analysis | 检测具有Host瓶颈或者MTE瓶颈的算子序列,可用于代码优化或开发可融合算子 | PyTorch、MindSpore | +| dataloader | Slow Dataloader Issues | 异常dataloader检测 | PyTorch、MindSpore | +| memory | Memory Operator Issues | 识别异常的内存申请释放操作 | PyTorch、MindSpore | +| comparison | Kernel compare of Rank\* Step\* and Rank\* Step\* | 识别标杆和待比对性能数据的Kernel数据(无标杆场景是集群内部快慢卡的性能数据对比,有标杆场景是两个集群之间存在明显耗时差异的相同卡之间的性能数据对比) | PyTorch、MindSpore | +| | Api compare of Rank\* Step\* and Rank\* Step\* | 识别标杆和待比对性能数据的API数据(无标杆场景是集群内部快慢卡的性能数据对比,有标杆场景是两个集群之间存在明显耗时差异的相同卡之间的性能数据对比) | PyTorch | 集群场景时自动进行cluster和overall的environment_variable_analysis解析,单卡时自动进行overall解析。 @@ -115,19 +117,19 @@ msprof-analyze advisor命令行包含如下三个参数: - 总体性能瓶颈 ```bash - msprof-analyze advisor all -d {profiling_path} [-bp benchmark_profiling_path] [-o output_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [--debug] [-h] + msprof-analyze advisor all -d {profiling_path} [-bp benchmark_profiling_path] [-o output_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [--force] [--language language] [--debug] [-h] ``` - 计算瓶颈 ```bash - msprof-analyze advisor computation -d {profiling_path} [-o output_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [--debug] [-h] + msprof-analyze advisor computation -d {profiling_path} [-o output_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [--force] [--language language] [--debug] [-h] ``` - 调度瓶颈 ```bash - msprof-analyze advisor schedule -d {profiling_path} [-o output_path] [-cv cann_version] [-tv torch_version] [--debug] [-h] + msprof-analyze advisor schedule -d {profiling_path} [-o output_path] [-cv cann_version] [-tv torch_version] [--force] [--language language] [--debug] [-h] ``` #### 参数介绍 @@ -140,7 +142,9 @@ msprof-analyze advisor命令行包含如下三个参数: | -cv
--cann_version | 使用Profiling工具采集时对应的CANN软件版本。目前配套的兼容版本为“6.3.RC2”,“7.0.RC1”、“7.0.0”、“8.0.RC1”,此字段不填默认按“8.0.RC1”版本数据进行处理,其余版本采集的Profiling数据在分析时可能会导致不可知问题。可通过在环境中执行如下命令获取其version字段:`cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info` | 否 | | -tv
--torch_version | 运行环境的torch版本,默认为1.11.0,支持torch1.11.0和torch2.1.0,当运行环境torch版本为其他版本如torch1.11.3时,可以忽略小版本号差异选择相近的torch版本如1.11.0。 | 否 | | -pt
--profiling_type | 配置性能数据采集使用的Profiling工具类型。可取值:
ascend_pytorch_profiler:使用Ascend PyThon Profiler接口方式采集的性能数据时配置,默认值。
msprof:使用msprof命令行方式采集的性能数据时配置。功能完善中,暂不建议使用。
mslite:使用[Benchmark](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench)工具采集的性能数据时配置。不建议使用。
**schedule不支持该参数。** | 否 | -| --debug | 工具执行报错时可打开此开关,将会展示详细保存堆栈信息。 | 否 | +| --force | 强制执行advisor。配置后可强制跳过如下情况:
指定的目录、文件的用户属主不属于当前用户,忽略属主判断直接执行。
csv文件大于5G、json文件大于10G、db文件大于8G,忽略文件过大判断直接执行。
配置该参数表示开启强制执行,默认未配置表示关闭。 | 否 | +| -l
--language | 设置分析结果输出的语言,可取值:
cn:输出中文,默认值。
en:输出英文。 | 否 | +| --debug | 工具执行报错时可打开此开关,将会展示详细保存堆栈信息。配置该参数表示开启Debug,默认未配置表示关闭。 | 否 | | -h,-H
--help | 在需要查询当前命令附属子命令或相关参数时,给出帮助建议。 | 否 | ### 报告解析(无标杆) @@ -207,7 +211,7 @@ memory模块分析内存的异常申请释放操作。 ![memory](./img/memory.png) -communication模块从通信维度进行分析,目前支持通信小算子检测。 +communication模块从通信维度进行分析,目前支持通信小包检测、通信计算带宽抢占检测、通信重传检测、通信算子字节对齐检测。 ![communication](./img/communication.png) @@ -227,11 +231,21 @@ communication模块从通信维度进行分析,目前支持通信小算子检 ![bandwidth](./img/bandwidth.png) -computation模块从device计算性能维度进行分析,能够识别AI CPU、计算bound、动态Shape、AI Core算子降频分析等问题并给出相应建议。此处不再详细展开,按照报告进行调优即可。示例如下: +通信算子字节对齐检测,传输类型为SDMA的通信算子,数据量需要被512字节整除,保证传输带宽不会下降。 + +![byte_alignment](/img/byte_alignment.png) + +computation模块从device计算性能维度进行分析,能够识别AI CPU、动态Shape、AI Core Performance analysis、Dlock Dim、算子瓶颈、融合算子图、AI Core算子降频分析等问题并给出相应建议。此处不再详细展开,按照报告进行调优即可。示例如下: ![computation_1](./img/computation_1.png) -上图中torch_npu.npu.set_compile_mode接口介绍请参见[torch_npu.npu.set_compile_mode](https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/apiref/apilist/ptaoplist_000880.html);AICPU算子替换样例可参考《[Samples of AI CPU Operator Replacement](https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc/Samples%20of%20AI%20CPU%20Operator%20Replacement.md)》。 +![block_dim](./img/block_dim.png) + +![op_no_bound](./img/op_no_bound.png) + +![AI Core Performance analysis](./img/AI Core Performance analysis.png) + +上图中torch_npu.npu.set_compile_mode接口介绍请参见[torch_npu.npu.set_compile_mode](https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/apiref/apilist/ptaoplist_000880.html);AICPU算子替换样例可参考《[Samples of AI CPU Operator Replacement](https://gitee.com/ascend/mstt/blob/master/profiler/msprof_analyze/advisor/doc/Samples%20of%20AI%20CPU%20Operator%20Replacement.md)》。 当存在pp stage(流水线并行)时,computation会按stage分析,每个stage就是一个流水线切分,比如0\~7卡为stage-0、8\~15卡为stage-1。 @@ -243,7 +257,22 @@ dataloader模块包含Slow Dataloader Issues,主要检测异常高耗时的dat 上图中的`pin_memory`(内存锁定)和`num_workers`(数据加载是子流程数量)参数为[数据加载优化](https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/ptmoddevg/trainingmigrguide/performance_tuning_0019.html)使用。 -schedule模块包GC Analysis、含亲和API、aclOpCompile、syncBatchNorm、SynchronizeStream等多项检测。 +schedule模块包GC Analysis、含亲和API、aclOpCompile、SyncBatchNorm、SynchronizeStream和Fusible Operator Analysis等多项检测。 + +其中Fusible Operator Analysis解析结果仅打屏展示和保存在`mstt_advisor_{timestamp}.xlsx`文件中,包含“基于host瓶颈的算子序列分析”和“基于mte瓶颈的算子序列分析”页签,如下图: + +![Fusible Operator Analysis](/img/Fusible Operator Analysis.png) + +| 字段 | 说明 | +| ------------------ | ------------------------------------------------------------ | +| start index | 序列起始算子在kernel details.csv或op_summary.csv中索引位置(不包含表头,起始索引为0)。 | +| end index | 序列末尾算子在kernel details.csv或op_summary.csv中索引位置。 | +| total time(us) | 算子序列总耗时(包含算子间隙),单位us。 | +| execution time(us) | 序列中算子执行总耗时,单位us。 | +| mte time(us) | 序列中算子搬运总耗时,单位us。 | +| occurrences | 序列出现次数。 | +| mte bound | 是否为MTE瓶颈。 | +| host bound | 是否为Host瓶颈。 | 如下图示例,GC Analysis提示存在异常垃圾回收事件,用户可以通过有效的Python内存管理、使用`gc.set_threshold()`调整垃圾回收阈值、使用gc.disable()禁用gc等方法处理GC问题。 @@ -256,7 +285,7 @@ schedule模块包GC Analysis、含亲和API、aclOpCompile、syncBatchNorm、Syn - `gc.set_threshold(threshold0, thresholdl, threshold2)`:这个函数用于设置垃圾回收的阈值。垃圾回收器将所有对象分为三代(0代、1代和2代),每一代的对象在经历垃圾回收后会被移到下一代。`threshold0`控制第0代的垃圾回收频率,`threshold1`控制第1代的垃圾回收频率,`threshold2`控制第2代的垃圾回收频率。将`threshold0`设为0可以禁用垃圾回收。 - `gc.disable ()`:这个函数用于禁用自动垃圾回收。调用`gc.disable ()`后,垃圾回收器将不会自动运行,直到手动调用`gc.enable()`。 -如下图示例,Affinity API Issues提示存在可以替换的亲和API并给出对应的堆栈,用户可以根据堆栈找到需要修改的代码,并给出修改案例([API instruction](https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc/Samples%20of%20Fused%20Operator%20API%20Replacement.md))。 +如下图示例,Affinity API Issues提示存在可以替换的亲和API并给出对应的堆栈,用户可以根据堆栈找到需要修改的代码,并给出修改案例([API instruction](https://gitee.com/ascend/mstt/blob/master/profiler/msprof_analyze/advisor/doc/Samples%20of%20Fused%20Operator%20API%20Replacement.md))。 ![schedule_3](./img/schedule_3.png) @@ -313,6 +342,8 @@ comparison模块内容如下图示例,识别标杆和待比对性能数据的K ## 工具使用(Jupyter Notebook方式) +MindSpore场景不支持该方式。 + Jupyter Notebook使用方式如下: 下列以Windows环境下执行为例介绍。 diff --git a/profiler/advisor/__init__.py b/profiler/msprof_analyze/advisor/__init__.py similarity index 100% rename from profiler/advisor/__init__.py rename to profiler/msprof_analyze/advisor/__init__.py diff --git a/profiler/advisor/advisor_backend/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/__init__.py diff --git a/profiler/advisor/advisor_backend/advice_base.py b/profiler/msprof_analyze/advisor/advisor_backend/advice_base.py similarity index 100% rename from profiler/advisor/advisor_backend/advice_base.py rename to profiler/msprof_analyze/advisor/advisor_backend/advice_base.py diff --git a/profiler/advisor/advisor_backend/advice_factory/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/advice_factory/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/advice_factory/__init__.py diff --git a/profiler/advisor/advisor_backend/advice_factory/advice_factory.py b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/advice_factory.py similarity index 95% rename from profiler/advisor/advisor_backend/advice_factory/advice_factory.py rename to profiler/msprof_analyze/advisor/advisor_backend/advice_factory/advice_factory.py index 1b4b0c1be4ba466588469728e5aaa0cc3a1422ac..4e31882451950d73df83d73965ed793b257124e9 100644 --- a/profiler/advisor/advisor_backend/advice_factory/advice_factory.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/advice_factory.py @@ -14,7 +14,7 @@ # limitations under the License. import os -from common_func.path_manager import PathManager +from msprof_analyze.prof_common.path_manager import PathManager class AdviceFactory: diff --git a/profiler/advisor/advisor_backend/advice_factory/cluster_advice_factory.py b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/cluster_advice_factory.py similarity index 65% rename from profiler/advisor/advisor_backend/advice_factory/cluster_advice_factory.py rename to profiler/msprof_analyze/advisor/advisor_backend/advice_factory/cluster_advice_factory.py index 6bb93f46704eb13fef14d070f891e350446829ea..c3575c0a7576bce190c6b38aa31d32924dfb8e38 100644 --- a/profiler/advisor/advisor_backend/advice_factory/cluster_advice_factory.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/cluster_advice_factory.py @@ -12,12 +12,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from advice_factory.advice_factory import AdviceFactory -from cluster_advice.slow_link_advice import SlowLinkAdvice -from cluster_advice.slow_rank_advice import SlowRankAdvice -from cluster_advice.cluster_pipeline_advice import ClusterPipelineAdvice -from cluster_advice.kernel_cluster_advice import KernelClusterAdvice -from common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.advice_factory.advice_factory import AdviceFactory +from msprof_analyze.advisor.advisor_backend.cluster_advice.slow_link_advice import SlowLinkAdvice +from msprof_analyze.advisor.advisor_backend.cluster_advice.slow_rank_advice import SlowRankAdvice +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_pipeline_advice import ClusterPipelineAdvice +from msprof_analyze.advisor.advisor_backend.cluster_advice.kernel_cluster_advice import KernelClusterAdvice +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant class ClusterAdviceFactory(AdviceFactory): diff --git a/profiler/advisor/advisor_backend/advice_factory/compute_advice_factory.py b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/compute_advice_factory.py similarity index 73% rename from profiler/advisor/advisor_backend/advice_factory/compute_advice_factory.py rename to profiler/msprof_analyze/advisor/advisor_backend/advice_factory/compute_advice_factory.py index 336bef7dd8553eb82586d52260443a7d01e84ab0..026ea1675a1a923cc2c4328f492601e8f3a7c15d 100644 --- a/profiler/advisor/advisor_backend/advice_factory/compute_advice_factory.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/compute_advice_factory.py @@ -12,10 +12,10 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from common_func_advisor.constant import Constant -from advice_factory.advice_factory import AdviceFactory -from compute_advice.npu_fused_advice import NpuFusedAdvice -from compute_advice.npu_slow_advice import NpuSlowAdvice +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.advice_factory.advice_factory import AdviceFactory +from msprof_analyze.advisor.advisor_backend.compute_advice.npu_fused_advice import NpuFusedAdvice +from msprof_analyze.advisor.advisor_backend.compute_advice.npu_slow_advice import NpuSlowAdvice class ComputeAdviceFactory(AdviceFactory): diff --git a/profiler/advisor/advisor_backend/advice_factory/overall_advice_factory.py b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/overall_advice_factory.py similarity index 77% rename from profiler/advisor/advisor_backend/advice_factory/overall_advice_factory.py rename to profiler/msprof_analyze/advisor/advisor_backend/advice_factory/overall_advice_factory.py index baf80cc200f4c3cd1057b7fc28e750948a450cf1..9f4964aad4b303dcf071765eabf4b625f652b190 100644 --- a/profiler/advisor/advisor_backend/advice_factory/overall_advice_factory.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/overall_advice_factory.py @@ -12,9 +12,9 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from advice_factory.advice_factory import AdviceFactory -from common_func_advisor.constant import Constant -from overall_advice.overall_summary_advice import OverallSummaryAdvice +from msprof_analyze.advisor.advisor_backend.advice_factory.advice_factory import AdviceFactory +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.overall_advice.overall_summary_advice import OverallSummaryAdvice class OverallAdviceFactory(AdviceFactory): diff --git a/profiler/advisor/advisor_backend/advice_factory/timeline_advice_factory.py b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/timeline_advice_factory.py similarity index 73% rename from profiler/advisor/advisor_backend/advice_factory/timeline_advice_factory.py rename to profiler/msprof_analyze/advisor/advisor_backend/advice_factory/timeline_advice_factory.py index 44b352e95a7bb1007bc7373193603c2a0b9d8b6c..2051c019b19e2228c78f60b08317e6651f9c7ccd 100644 --- a/profiler/advisor/advisor_backend/advice_factory/timeline_advice_factory.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/advice_factory/timeline_advice_factory.py @@ -12,10 +12,10 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from advice_factory.advice_factory import AdviceFactory -from common_func_advisor.constant import Constant -from timeline_advice.optimizer_advice import OptimizerAdvice -from timeline_advice.op_schedule_advice import OpScheduleAdvice +from msprof_analyze.advisor.advisor_backend.advice_factory.advice_factory import AdviceFactory +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.timeline_advice.optimizer_advice import OptimizerAdvice +from msprof_analyze.advisor.advisor_backend.timeline_advice.op_schedule_advice import OpScheduleAdvice class TimelineAdviceFactory(AdviceFactory): diff --git a/profiler/advisor/advisor_backend/cluster_advice/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/cluster_advice/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/__init__.py diff --git a/profiler/advisor/advisor_backend/cluster_advice/cluster_advice_base.py b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/cluster_advice_base.py similarity index 87% rename from profiler/advisor/advisor_backend/cluster_advice/cluster_advice_base.py rename to profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/cluster_advice_base.py index 2cc1eebebc921510e44e8869bd70c423995efeb0..5620f8e49ec54ccaad9f185fabe15d279e4c48f1 100644 --- a/profiler/advisor/advisor_backend/cluster_advice/cluster_advice_base.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/cluster_advice_base.py @@ -14,13 +14,14 @@ # limitations under the License. import os -import logging from abc import abstractmethod -from common_func.constant import Constant -from advice_base import AdviceBase -from cluster_analysis import Interface -logger = logging.getLogger() +from msprof_analyze.advisor.advisor_backend.advice_base import AdviceBase +from msprof_analyze.cluster_analyse.cluster_analysis import Interface +from msprof_analyze.advisor.advisor_backend.logger import Logger +from msprof_analyze.prof_common.constant import Constant + +logger = Logger() class ClusterAdviceBase(AdviceBase): @@ -67,4 +68,4 @@ class ClusterAdviceBase(AdviceBase): def output(self): """ output relevant data - """ \ No newline at end of file + """ diff --git a/profiler/advisor/advisor_backend/cluster_advice/cluster_pipeline_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/cluster_pipeline_advice.py similarity index 88% rename from profiler/advisor/advisor_backend/cluster_advice/cluster_pipeline_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/cluster_pipeline_advice.py index 7f8846f1d99e9bc81636df32d04148df99d12920..8db0f6fba4dc1ffc92faf1062f93add29b0dd11b 100644 --- a/profiler/advisor/advisor_backend/cluster_advice/cluster_pipeline_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/cluster_pipeline_advice.py @@ -13,25 +13,28 @@ # See the License for the specific language governing permissions and # limitations under the License. +import multiprocessing import os import time -import multiprocessing -from typing import Dict -from typing import Optional -from typing import Deque -from typing import List -from typing import Tuple from collections import defaultdict from collections import deque -from decimal import Decimal from dataclasses import dataclass +from decimal import Decimal +from typing import Deque +from typing import Dict +from typing import List +from typing import Optional +from typing import Tuple -from common_func.file_manager import FileManager -from common_func_advisor.constant import Constant -from common_func_advisor.trace_view_preprocessor import FineTraceViewData -from common_func_advisor.trace_view_preprocessor import TraceViewPreProcessor -from cluster_advice.cluster_advice_base import ClusterAdviceBase -from cluster_data_preprocess.pytorch_data_preprocessor import PytorchDataPreprocessor +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_advice_base import ClusterAdviceBase +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.common_func_advisor.trace_view_preprocessor import FineTraceViewData +from msprof_analyze.advisor.advisor_backend.common_func_advisor.trace_view_preprocessor import TraceViewPreProcessor +from msprof_analyze.advisor.advisor_backend.logger import Logger +from msprof_analyze.cluster_analyse.cluster_data_preprocess.pytorch_data_preprocessor import PytorchDataPreprocessor +from msprof_analyze.prof_common.file_manager import FileManager + +logger = Logger() @dataclass @@ -63,19 +66,6 @@ class PipelineTraceViewer: BP: BP_COLOR } - def _gen_trace_pair(self, name: str, start_ts: str, end_ts: str, pid: str, tid: str) -> Dict: - data = { - Constant.OP_NAME: name, - Constant.CNAME: self.COLORS.get(name, self.BUBBLE), - Constant.PH: Constant.PH_X, - Constant.PID: pid, - Constant.OP_TID: tid, - Constant.TS: start_ts, - Constant.DUR: str(Decimal(end_ts) - Decimal(start_ts)) - } - - return data - def gen_stage_bubble_trace_data(self, rank_id: int, timeslice_list: List[PipelineTimeSlice]) -> List[Dict]: """ generate stage bubble trace json data @@ -120,6 +110,19 @@ class PipelineTraceViewer: return trace_data + def _gen_trace_pair(self, name: str, start_ts: str, end_ts: str, pid: str, tid: str) -> Dict: + data = { + Constant.OP_NAME: name, + Constant.CNAME: self.COLORS.get(name, self.BUBBLE), + Constant.PH: Constant.PH_X, + Constant.PID: pid, + Constant.OP_TID: tid, + Constant.TS: start_ts, + Constant.DUR: str(Decimal(end_ts) - Decimal(start_ts)) + } + + return data + class ClusterPipelineAdvice(ClusterAdviceBase): BUBBLE = "Bubble" @@ -136,121 +139,6 @@ class ClusterPipelineAdvice(ClusterAdviceBase): self.cur_bottleneck = {} self.cur_advices = "" - def run(self) -> dict: - """ - Unified entrance interface - """ - self.rank_prof_dirs = self.get_rank_prof_dirs(self.rank_ids) - if not self.rank_prof_dirs: - print("[ERROR] No rank profiling data found, please check the rank ids or dir path.") - return {} - - self.process() - self.output() - self.identify_bottleneck() - return self.output_format_data - - def process(self) -> None: - """ - process all rank profiling data by using multi-process - """ - start_time = time.time() - print(f"[INFO] Start to process {len(self.rank_prof_dirs)} rank profiling data with {self.worker_num} workers.") - with multiprocessing.Pool(self.worker_num) as pool: - results = pool.map(self.work, self.rank_prof_dirs.items()) - - for (rank_id, _), (res, show_fp_bp) in zip(self.rank_prof_dirs.items(), results): - if show_fp_bp: - self.cur_data += PipelineTraceViewer().gen_fp_bp_trace_data(rank_id, res) - else: - self.cur_data += PipelineTraceViewer().gen_stage_bubble_trace_data(rank_id, res) - print(f"[INFO] Pipline view data process finished, cost {time.time() - start_time:.2f}s.") - - @staticmethod - def _align_trace_bound(results: List) -> None: - """ - align all rank trace bound for better visualization - """ - start_list, end_list = [], [] - for res in results: - start_list.append(res[0].start) - end_list.append(res[-1].end) - - # update all rank trace bound - for res in results: - res[0].start = min(start_list) - res[-1].end = max(end_list) - - def work(self, kv: Tuple[int, str]) -> Tuple[List[PipelineTimeSlice], bool]: - """ - single process worker function - """ - show_fp_bp = False - rank_id, rank_prof_dir = kv - print(f"[INFO] [Rank {rank_id}] Start to process rank profiling data.") - json_path = os.path.join(rank_prof_dir, Constant.ASCEND_PROFILER_OUTPUT, Constant.TRACE_VIEW_JSON) - fine_data = self.load_trace_view_data(json_path) - if not fine_data.hcom_ops or not fine_data.hcom_tids: - print(f"[ERROR] [Rank {rank_id}] No hcom send recv ops found, make sure the trace view data is pipeline " - f"parallel sense.") - return [], show_fp_bp - - timeslice_list = self.get_pipeline_timeslice(fine_data.hcom_ops, fine_data.hcom_tids, fine_data.min_ts, - fine_data.max_ts) - if not fine_data.fp_ops or not fine_data.bp_ops: - print(f"[INFO] [Rank {rank_id}] No frameWork data in trace view, only show stage and bubble.") - elif len(fine_data.hcom_tids) > 1: - print(f"[WARN] [Rank {rank_id}] More than one hcom tid found, only show stage and bubble.") - else: - print(f"[INFO] [Rank {rank_id}] Found frameWork data in trace view, show fp bp and bubble.") - bp_ops = self.get_fp_bp_bound_ops(fine_data) - self.update_stage_fp_bp(timeslice_list, bp_ops) - show_fp_bp = True - print(f"[INFO] [Rank {rank_id}] Rank profiling data process finished.") - - return timeslice_list, show_fp_bp - - def identify_bottleneck(self) -> None: - pass - - def output(self) -> None: - """ - output result - """ - self.cur_data.append( - { - Constant.OP_NAME: Constant.PROCESS_NAME, - Constant.PH: Constant.PH_META, - Constant.PID: self.PIPELINE_VIEW, - Constant.OP_TID: self.PIPELINE_VIEW, - Constant.ARGS: { - Constant.OP_NAME: self.PIPELINE_VIEW - } - } - ) - self.output_format_data[self.DATA] = self.cur_data - self.output_format_data[self.BOTTLENECK] = self.cur_bottleneck - self.output_format_data[self.ADVICE] = self.cur_advices - - def get_rank_prof_dirs(self, rank_ids: list) -> Dict[int, str]: - """ - get rank profiling directories by rank ids - """ - rank_prof_dirs = defaultdict(str) - prof_dirs = [] - for prof_dir in os.listdir(self.collection_path): - if prof_dir.endswith(Constant.PT_PROF_SUFFIX): - prof_dirs.append(os.path.join(self.collection_path, prof_dir)) - - data_map = PytorchDataPreprocessor(prof_dirs).get_data_map() - for rank_id in rank_ids: - if rank_id in data_map: - rank_prof_dirs[rank_id] = data_map[rank_id] - else: - print(f'[Warning] Rank {rank_id} not found in {self.collection_path}') - - return rank_prof_dirs - @staticmethod def load_trace_view_data(json_path) -> Optional[FineTraceViewData]: """ @@ -368,6 +256,126 @@ class ClusterPipelineAdvice(ClusterAdviceBase): return res + @staticmethod + def _align_trace_bound(results: List) -> None: + """ + align all rank trace bound for better visualization + """ + start_list, end_list = [], [] + for res in results: + start_list.append(res[0].start) + end_list.append(res[-1].end) + + # update all rank trace bound + for res in results: + res[0].start = min(start_list) + res[-1].end = max(end_list) + + def run(self) -> dict: + """ + Unified entrance interface + """ + self.rank_prof_dirs = self.get_rank_prof_dirs(self.rank_ids) + if not self.rank_prof_dirs: + logger.error("No rank profiling data found, please check the rank ids or dir path.") + return {} + + self.process() + self.output() + self.identify_bottleneck() + return self.output_format_data + + def process(self) -> None: + """ + process all rank profiling data by using multi-process + """ + start_time = time.time() + logger.info("Start to process %s rank profiling data with %s workers.", + str(len(self.rank_prof_dirs)), str(self.worker_num)) + with multiprocessing.Pool(self.worker_num) as pool: + results = pool.map(self.work, self.rank_prof_dirs.items()) + + for (rank_id, _), (res, show_fp_bp) in zip(self.rank_prof_dirs.items(), results): + if show_fp_bp: + self.cur_data += PipelineTraceViewer().gen_fp_bp_trace_data(rank_id, res) + else: + self.cur_data += PipelineTraceViewer().gen_stage_bubble_trace_data(rank_id, res) + time_cost = time.time() - start_time + logger.info("Pipline view data process finished, cost %2f s.", time_cost) + + def work(self, kv: Tuple[int, str]) -> Tuple[List[PipelineTimeSlice], bool]: + """ + single process worker function + """ + show_fp_bp = False + rank_id, rank_prof_dir = kv + logger.info("[Rank %s] Start to process rank profiling data.", rank_id) + json_path = os.path.join(rank_prof_dir, Constant.ASCEND_PROFILER_OUTPUT, Constant.TRACE_VIEW_JSON) + fine_data = self.load_trace_view_data(json_path) + if not fine_data.hcom_ops or not fine_data.hcom_tids: + logger.error("[Rank %s] No hcom send recv ops found, make sure the trace view data is " + "pipeline parallel sense.", str(rank_id)) + return [], show_fp_bp + + timeslice_list = self.get_pipeline_timeslice(fine_data.hcom_ops, fine_data.hcom_tids, fine_data.min_ts, + fine_data.max_ts) + if not fine_data.fp_ops or not fine_data.bp_ops: + logger.info("[Rank %s] No frameWork data in trace view, only show stage and bubble.", + str(rank_id)) + elif len(fine_data.hcom_tids) > 1: + logger.warning("[Rank %s] More than one hcom tid found, only show stage and bubble.", + str(rank_id)) + else: + logger.info("[Rank %s] Found frameWork data in trace view, show fp bp and bubble.", + rank_id) + bp_ops = self.get_fp_bp_bound_ops(fine_data) + self.update_stage_fp_bp(timeslice_list, bp_ops) + show_fp_bp = True + logger.info("[Rank %s] Rank profiling data process finished.", str(rank_id)) + + return timeslice_list, show_fp_bp + + def identify_bottleneck(self) -> None: + pass + + def output(self) -> None: + """ + output result + """ + self.cur_data.append( + { + Constant.OP_NAME: Constant.PROCESS_NAME, + Constant.PH: Constant.PH_META, + Constant.PID: self.PIPELINE_VIEW, + Constant.OP_TID: self.PIPELINE_VIEW, + Constant.ARGS: { + Constant.OP_NAME: self.PIPELINE_VIEW + } + } + ) + self.output_format_data[self.DATA] = self.cur_data + self.output_format_data[self.BOTTLENECK] = self.cur_bottleneck + self.output_format_data[self.ADVICE] = self.cur_advices + + def get_rank_prof_dirs(self, rank_ids: list) -> Dict[int, str]: + """ + get rank profiling directories by rank ids + """ + rank_prof_dirs = defaultdict(str) + prof_dirs = [] + for prof_dir in os.listdir(self.collection_path): + if prof_dir.endswith(Constant.PT_PROF_SUFFIX): + prof_dirs.append(os.path.join(self.collection_path, prof_dir)) + + data_map = PytorchDataPreprocessor(prof_dirs).get_data_map() + for rank_id in rank_ids: + if rank_id in data_map: + rank_prof_dirs[rank_id] = data_map[rank_id] + else: + logger.warning('Rank %s not found in %s', str(self.collection_path)) + + return rank_prof_dirs + def get_fp_bp_bound_ops(self, fine_data: FineTraceViewData) -> List[List[dict]]: """ get fp and bp bound ops by using double queue alternating pop algorithm and @@ -391,7 +399,7 @@ class ClusterPipelineAdvice(ClusterAdviceBase): timeslice_list = [] last_op_end = None if len(hcom_tids) > 1: - print("[WARN] More than one hcom tid found, default to show minimal tid pipeline view.") + logger.warning("More than one hcom tid found, default to show minimal tid pipeline view.") for op in hcom_ops: if op[Constant.OP_TID] == min(hcom_tids): diff --git a/profiler/advisor/advisor_backend/cluster_advice/kernel_cluster_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/kernel_cluster_advice.py similarity index 73% rename from profiler/advisor/advisor_backend/cluster_advice/kernel_cluster_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/kernel_cluster_advice.py index e7b334bbfe2cea02214e2900094c2c005f463432..a7d3a010959275ee0fd6e3be2af926f7fb46c3bb 100644 --- a/profiler/advisor/advisor_backend/cluster_advice/kernel_cluster_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/kernel_cluster_advice.py @@ -1,11 +1,27 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os import pandas as pd -from common_func.path_manager import PathManager -from common_func.constant import Constant -from common_func_advisor.constant import Constant as AdvisorConstant -from cluster_advice.cluster_advice_base import ClusterAdviceBase +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant as AdvisorConstant +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_advice_base import ClusterAdviceBase from cluster_data_preprocess.pytorch_data_preprocessor import PytorchDataPreprocessor -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.path_manager import PathManager + class KernelClusterAdvice(ClusterAdviceBase): COLUMNS_TO_GROUP = ["Name", "Input Shapes", "Input Data Types", "Output Shapes"] diff --git a/profiler/advisor/advisor_backend/cluster_advice/slow_link_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/slow_link_advice.py similarity index 88% rename from profiler/advisor/advisor_backend/cluster_advice/slow_link_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/slow_link_advice.py index 8d299326236461614afb152fbcdb62cc0fb61d94..6d2a0638913d759817b091a013d7fbce9df09f63 100644 --- a/profiler/advisor/advisor_backend/cluster_advice/slow_link_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/slow_link_advice.py @@ -13,11 +13,12 @@ # See the License for the specific language governing permissions and # limitations under the License. +import copy import os from collections import defaultdict -from common_func_advisor.constant import Constant -from common_func.file_manager import FileManager -from cluster_advice.cluster_advice_base import ClusterAdviceBase +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_advice_base import ClusterAdviceBase +from msprof_analyze.prof_common.file_manager import FileManager class SlowLinkAdvice(ClusterAdviceBase): @@ -35,12 +36,13 @@ class SlowLinkAdvice(ClusterAdviceBase): def __init__(self, collection_path: str, kwargs: dict = None): super().__init__(collection_path) - self.rank_bw_dict = defaultdict(lambda: { + default_value = { self.RDMA_TIME_MS: 0, self.RDMA_SIZE_MB: 0, self.SDMA_TIME_MS: 0, self.SDMA_SIZE_MB: 0, - }) + } + self.rank_bw_dict = defaultdict(lambda: copy.deepcopy(default_value)) @staticmethod def compute_ratio(dividend: float, divisor: float): @@ -65,9 +67,9 @@ class SlowLinkAdvice(ClusterAdviceBase): return self.output_format_data def process(self, communication_json: dict): - for comm_group, group_dict in communication_json.items(): - for step, step_dict in group_dict.items(): - for op, op_dict in step_dict.items(): + for _, group_dict in communication_json.items(): + for _, step_dict in group_dict.items(): + for _, op_dict in step_dict.items(): self.compute_bandwidth(op_dict) if self.rank_bw_dict: self.produce_bottleneck(self.RDMA_BANDWIDTH) @@ -88,7 +90,7 @@ class SlowLinkAdvice(ClusterAdviceBase): self.rank_bw_dict[rank][self.RDMA_SIZE_MB] += bw_dict.get(self.TRANSIT_SIZE) self.rank_bw_dict[rank][self.RDMA_TIME_MS] += bw_dict.get(self.TRANSIT_TIME) - for rank, rank_dict in self.rank_bw_dict.items(): + for rank, _ in self.rank_bw_dict.items(): self.rank_bw_dict[rank][self.RDMA_BANDWIDTH] = self.compute_ratio( self.rank_bw_dict[rank][self.RDMA_SIZE_MB], self.rank_bw_dict[rank][self.RDMA_TIME_MS]) self.rank_bw_dict[rank][self.SDMA_BANDWIDTH] = self.compute_ratio( diff --git a/profiler/advisor/advisor_backend/cluster_advice/slow_rank_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/slow_rank_advice.py similarity index 88% rename from profiler/advisor/advisor_backend/cluster_advice/slow_rank_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/slow_rank_advice.py index 4e789fb7fb688626df7e8f5b25b84e4955d6c2a3..182249e3a08a1e50d1178dbed9b6b17257fcbfd3 100644 --- a/profiler/advisor/advisor_backend/cluster_advice/slow_rank_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/cluster_advice/slow_rank_advice.py @@ -15,10 +15,11 @@ import os from collections import defaultdict -from common_func_advisor.constant import Constant -from common_func.file_manager import FileManager -from cluster_advice.cluster_advice_base import ClusterAdviceBase -from prof_bean_advisor.cluster_step_trace_time_bean import ClusterStepTraceTimeBean +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_advice_base import ClusterAdviceBase +from msprof_analyze.advisor.advisor_backend.prof_bean_advisor.cluster_step_trace_time_bean \ + import ClusterStepTraceTimeBean +from msprof_analyze.prof_common.file_manager import FileManager class SlowRankAdvice(ClusterAdviceBase): diff --git a/profiler/advisor/advisor_backend/common_func_advisor/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/common_func_advisor/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/__init__.py diff --git a/profiler/advisor/advisor_backend/common_func_advisor/constant.py b/profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/constant.py similarity index 62% rename from profiler/advisor/advisor_backend/common_func_advisor/constant.py rename to profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/constant.py index 46a7fb24c2dade75c157f18118f29233eb924b88..162a9fd2fdde15e02d2897106b43f52bca99bde1 100644 --- a/profiler/advisor/advisor_backend/common_func_advisor/constant.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/constant.py @@ -92,20 +92,17 @@ class CsvTitleV2(CsvTitle): class Constant: - DTYPE_SIZE_MAP = {"int8": 1, "uint8": 1, - "int16": 2, "uint16": 2, - "int32": 4, "uint32": 4, - "int64": 8, "uint64": 8, - "float16": 2, - "bfloat16": 2, - "bf16": 2, - "dt_bf16": 2, - "float32": 4, - "float": 4, - "float64": 8, - "complex64": 8, - "complex128": 16, - "bool": 1} + DTYPE_SIZE_MAP = { + "int8": 1, "uint8": 1, + "int16": 2, "uint16": 2, + "int32": 4, "uint32": 4, + "int64": 8, "uint64": 8, + "float16": 2, "bfloat16": 2, + "bf16": 2, "dt_bf16": 2, + "float32": 4, "float": 4, + "float64": 8, "complex64": 8, + "complex128": 16, "bool": 1 + } TP_THRESHOLD = 1150 MAX_INPUT_MODE_LEN = 30 MAX_INPUT_ADVICE_LEN = 30 @@ -173,35 +170,37 @@ class Constant: TRACE_VIEW_JSON = "trace_view.json" # pattern_dict key: pattern, value: pattern name - PATTERN_DICT = {("Add", "DropOutDoMask", "Add"): "bias_dropout_add", - ("BatchMatMul", "Mul", "Cast", "Mul", "MaskedFill", "SoftmaxV2", "Cast", "DropOutDoMask", - "AsStrided", "BatchMatMul", "Transpose"): "FA", - ("Transpose", "Transpose", "Transpose", "Mul", "Transpose", "BatchMatMulV2", "MaskedFill", - "Cast", "SoftmaxV2", "Cast", "DropOutDoMask", "BatchMatMulV2", "Transpose"): "FA", - ("Transpose", "BatchMatMulV2", "Transpose", "Transpose", "BatchMatMulV2", "ZerosLike", - "DropOutDoMask", "Cast", "SoftmaxGrad", "Cast", "MaskedFill", "BatchMatMulV2", - "BatchMatMulV2", "Mul"): "FA", - ("Cast", "Square", "ReduceMeanD", "Add", "Rsqrt", "Cast", "Cast", "Mul", "Cast", "Cast", - "Mul", "Cast"): "RMSNORM", - ("Cast", "LayerNorm", "Cast"): "LayerNorm", - ("Add", "LayerNorm"): "AddLayerNorm", - ("Add", "LayerNormV3"): "AddLayerNorm", - ("Gelu", "Add"): "GeluAdd", - ("Cast", "Square", "MemSet", "ReduceMean", "Add", "Rsqrt", "Mul", "Cast", "Mul"): "RMSNorm", - ("BatchMatMul", "RealDiv", "Add", "Maximum", "SoftmaxV2", "Cast", "BatchMatMul"): "FA", - ("BatchMatMulV2", "RealDiv", "Add", "Cast", "Maximum", "Cast", "SoftmaxV2", "AsStrided", - "BatchMatMulV2"): "FA", - ("BatchMatMulV2", "RealDiv", "Add", "Cast", "SoftmaxV2", "Cast", "BroadcastTo", - "BatchMatMulV2"): "FA", - ("Mul", "Slice", "Neg", "Slice", "ConcatD", "Cast", "Mul", "Add"): "RotaryMul", - ("Mul", "AsStrided", "Neg", "AsStrided", "ConcatD", "Mul", "Add"): "RotaryMul", - ("Mul", "Slice", "Neg", "Slice", "ConcatD", "Mul", "Add"): "RotaryMul", - ("MatMulV2", "Swish", "MatMulV2", "Mul", "MatMulV2"): "FFN", - ("Transpose", "Transpose", "GatherElement", "Transpose"): "GatherElement", - ("Slice", "Slice", "Swish", "Mul"): "torch_npu.npu_swiglu", - ("Cast", "Mul", "MaskedFill", "SoftmaxV2", "Cast"): "torch_npu.npu_scaled_masked_softmax", - ("Mul", "Slice", "Neg", "Slice", "ConcatD", "Mul"): "torch_npu.npu_rotary_mul", - ("Cast", "Square", "ReduceMeanD", "Add", "Rsqrt", "Mul", "Cast", "Mul"): "torch_npu.npu_rms_norm"} + PATTERN_DICT = { + ("Add", "DropOutDoMask", "Add"): "bias_dropout_add", + ("BatchMatMul", "Mul", "Cast", "Mul", "MaskedFill", "SoftmaxV2", "Cast", "DropOutDoMask", + "AsStrided", "BatchMatMul", "Transpose"): "FA", + ("Transpose", "Transpose", "Transpose", "Mul", "Transpose", "BatchMatMulV2", "MaskedFill", + "Cast", "SoftmaxV2", "Cast", "DropOutDoMask", "BatchMatMulV2", "Transpose"): "FA", + ("Transpose", "BatchMatMulV2", "Transpose", "Transpose", "BatchMatMulV2", "ZerosLike", + "DropOutDoMask", "Cast", "SoftmaxGrad", "Cast", "MaskedFill", "BatchMatMulV2", + "BatchMatMulV2", "Mul"): "FA", + ("Cast", "Square", "ReduceMeanD", "Add", "Rsqrt", "Cast", "Cast", "Mul", "Cast", "Cast", + "Mul", "Cast"): "RMSNORM", + ("Cast", "LayerNorm", "Cast"): "LayerNorm", + ("Add", "LayerNorm"): "AddLayerNorm", + ("Add", "LayerNormV3"): "AddLayerNorm", + ("Gelu", "Add"): "GeluAdd", + ("Cast", "Square", "MemSet", "ReduceMean", "Add", "Rsqrt", "Mul", "Cast", "Mul"): "RMSNorm", + ("BatchMatMul", "RealDiv", "Add", "Maximum", "SoftmaxV2", "Cast", "BatchMatMul"): "FA", + ("BatchMatMulV2", "RealDiv", "Add", "Cast", "Maximum", "Cast", "SoftmaxV2", "AsStrided", + "BatchMatMulV2"): "FA", + ("BatchMatMulV2", "RealDiv", "Add", "Cast", "SoftmaxV2", "Cast", "BroadcastTo", + "BatchMatMulV2"): "FA", + ("Mul", "Slice", "Neg", "Slice", "ConcatD", "Cast", "Mul", "Add"): "RotaryMul", + ("Mul", "AsStrided", "Neg", "AsStrided", "ConcatD", "Mul", "Add"): "RotaryMul", + ("Mul", "Slice", "Neg", "Slice", "ConcatD", "Mul", "Add"): "RotaryMul", + ("MatMulV2", "Swish", "MatMulV2", "Mul", "MatMulV2"): "FFN", + ("Transpose", "Transpose", "GatherElement", "Transpose"): "GatherElement", + ("Slice", "Slice", "Swish", "Mul"): "torch_npu.npu_swiglu", + ("Cast", "Mul", "MaskedFill", "SoftmaxV2", "Cast"): "torch_npu.npu_scaled_masked_softmax", + ("Mul", "Slice", "Neg", "Slice", "ConcatD", "Mul"): "torch_npu.npu_rotary_mul", + ("Cast", "Square", "ReduceMeanD", "Add", "Rsqrt", "Mul", "Cast", "Mul"): "torch_npu.npu_rms_norm" + } TITLE = CsvTitleV2 @classmethod @@ -215,7 +214,7 @@ class CoreType: AICPU = "AI_CPU" MIX_AIV = "MIX_AIV" MIX_AIC = "MIX_AIC" - HCCL = "HCCL" + HCCL = "COMMUNICATION" class PerfColor(Enum): diff --git a/profiler/advisor/advisor_backend/common_func_advisor/trace_view_json.py b/profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/trace_view_json.py similarity index 89% rename from profiler/advisor/advisor_backend/common_func_advisor/trace_view_json.py rename to profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/trace_view_json.py index 8171f06ee235fc02da715044b4d310087c36c102..5af97c785aa431f32056078dd6868b0c954f40c9 100644 --- a/profiler/advisor/advisor_backend/common_func_advisor/trace_view_json.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/trace_view_json.py @@ -12,7 +12,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import os + from abc import abstractmethod from dataclasses import dataclass from dataclasses import field @@ -21,7 +21,10 @@ from typing import List import pandas as pd -from common_func.file_manager import FileManager +from msprof_analyze.advisor.advisor_backend.logger import Logger +from msprof_analyze.prof_common.file_manager import FileManager + +logger = Logger() @dataclass @@ -35,7 +38,7 @@ class TraceObj: id: int = 0 ts: str = "" dur: float = 0.0 - args: dict = field(default='unknown') + args: dict = field(default_factory=dict) @abstractmethod def hash(self): @@ -93,21 +96,21 @@ class TraceViewJson: self.torch_2_npu_flow_events: Dict[str, FlowEvent] = dict() traces = FileManager.read_json_file(path) self._load_obj(traces) - + def get_call_stack(self, data: pd.DataFrame, index_id: int, ts_col: str) -> str: if ts_col not in data.columns.tolist(): - print("[ERROR] No {} col found in data columns.".format(ts_col)) + logger.error("No %s col found in data columns.", str(ts_col)) return "" row = data.loc[index_id] timestamp = row[ts_col] flow_event = self.get_torch_2_npu_flow_event(timestamp) if not flow_event.valid(): - print("[ERROR] Get flow event failed for pattern {}.".format(row['pattern'])) + logger.error("Get flow event failed for pattern %s.", str(row['pattern'])) return "" flow_event_s_key = flow_event.s_point_ts python_dur_events = self.get_python_dur_events_contain_ts(flow_event_s_key) if not python_dur_events: - print("[ERROR] No python dur event found for pattern {}.".format(row['pattern'])) + logger.error("No python dur event found for pattern %s.", str(row['pattern'])) return "" # 保持新老版本callstack兼容性 if python_dur_events[0].args.get("Call stack"): @@ -122,7 +125,7 @@ class TraceViewJson: def get_torch_2_npu_flow_event(self, end_time) -> FlowEvent: if not self.torch_2_npu_flow_events or not self.torch_2_npu_flow_events.get(end_time): - print("[ERROR] Find flow event failed for ts: {}".format(end_time)) + logger.error("Find flow event failed for ts: %s", str(end_time)) return FlowEvent() return self.torch_2_npu_flow_events.get(end_time) @@ -136,7 +139,7 @@ class TraceViewJson: def _load_obj(self, traces): self._load_format(traces) if not self._check_format(): - print("[ERROR] parse json failed for error format") + logger.error("parse json failed for error format") return self._load_duration_events(traces) self._load_torch_to_npu_flow_events(traces) @@ -147,13 +150,13 @@ class TraceViewJson: for check_process in check_processes: if check_process in self.processes: continue - print("[ERROR] {} process not found in json.".format(check_process)) + logger.error("%s process not found in json.", str(check_process)) return False return True # 加载pid, tid头 def _load_format(self, traces: List[Dict]): - for i, trace in enumerate(traces): + for _, trace in enumerate(traces): if trace.get('name') == 'process_name': if not trace.get('args') or not trace.get('args').get('name') or not trace.get('pid'): continue @@ -172,7 +175,7 @@ class TraceViewJson: python_pid = self.processes.get("Python").pid cann_pid = self.processes.get("CANN").pid ascend_hardware_pid = self.processes.get("Ascend Hardware").pid - for i, trace in enumerate(traces): + for _, trace in enumerate(traces): if trace.get('ph') != 'X': continue if not check_events(trace): @@ -192,7 +195,7 @@ class TraceViewJson: flow_events_table_by_id = dict() python_pid = self.processes.get("Python") - for i, trace in enumerate(traces): + for _, trace in enumerate(traces): if trace.get('ph') != 's' and trace.get('ph') != 'f' and trace.get('pid') != python_pid: continue if not check_events(trace): diff --git a/profiler/advisor/advisor_backend/common_func_advisor/trace_view_preprocessor.py b/profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/trace_view_preprocessor.py similarity index 94% rename from profiler/advisor/advisor_backend/common_func_advisor/trace_view_preprocessor.py rename to profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/trace_view_preprocessor.py index 6eced27efd203e2e7b23e47cd33ab71ce8c4e240..782cf459eda6e471526d8cae924e1fecc272a1f2 100644 --- a/profiler/advisor/advisor_backend/common_func_advisor/trace_view_preprocessor.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/common_func_advisor/trace_view_preprocessor.py @@ -13,12 +13,16 @@ # See the License for the specific language governing permissions and # limitations under the License. + import re import sys from typing import Optional from dataclasses import dataclass -from common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.logger import Logger + +logger = Logger() @dataclass @@ -103,11 +107,11 @@ class TraceViewPreProcessor: """ check whether op is hcom send or recv op """ - # eg: hcom_BatchSendRecv__101_0_1 + # for example, hcom_BatchSendRecv__101_0_1 p1 = re.compile(r'^hcom_\w+SendRecv__\d+') - # eg: hcom_send__101_0_1 + # for example, hcom_send__101_0_1 p2 = re.compile(r'hcom_send__\d+') - # eg: hcom_receive__101_0_1 + # for example, hcom_receive__101_0_1 p3 = re.compile(r'hcom_receive__\d+') return bool(p1.match(op_name)) or bool(p2.match(op_name)) or bool(p3.match(op_name)) @@ -157,7 +161,7 @@ class TraceViewPreProcessor: preprocess raw data """ if not raw_data: - print("[ERROR] No raw data found in trace view data.") + logger.error("No raw data found in trace view data.") return None raw_fp_tids, raw_bp_tids, raw_hcom_tids = set(), set(), set() @@ -189,7 +193,7 @@ class TraceViewPreProcessor: fine_data.hcom_tids = list(raw_hcom_tids) if not unique_fp_tid or not unique_bp_tid: - print("[INFO] No fp or bp tid found in trace view data.") + logger.info("No fp or bp tid found in trace view data.") else: fine_data.fp_tid, fine_data.bp_tid = unique_fp_tid[0], unique_bp_tid[0] diff --git a/profiler/advisor/advisor_backend/compute_advice/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/compute_advice/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/compute_advice/__init__.py diff --git a/profiler/advisor/advisor_backend/compute_advice/compute_advice_base.py b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/compute_advice_base.py similarity index 82% rename from profiler/advisor/advisor_backend/compute_advice/compute_advice_base.py rename to profiler/msprof_analyze/advisor/advisor_backend/compute_advice/compute_advice_base.py index cafbafd8e28c162bc76edb2f77ebd0645fed552f..9b5a3a37685f18ac0b08cd8bcfcde12c7a5ec708 100644 --- a/profiler/advisor/advisor_backend/compute_advice/compute_advice_base.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/compute_advice_base.py @@ -16,9 +16,12 @@ from abc import abstractmethod from collections import defaultdict import os +import logging -from advice_base import AdviceBase -from common_func.file_manager import FileManager +from msprof_analyze.advisor.advisor_backend.advice_base import AdviceBase +from msprof_analyze.prof_common.file_manager import FileManager + +logger = logging.getLogger() class ComputeAdviceBase(AdviceBase): @@ -32,7 +35,7 @@ class ComputeAdviceBase(AdviceBase): self.kernel_details_path = "" self.has_preparse = False self.preparse_data = defaultdict(list) - self.call_stack = None + self.call_stack = False self.trace_view_path = "" def path_check(self): @@ -40,47 +43,42 @@ class ComputeAdviceBase(AdviceBase): check whether input path is valid """ if not os.path.exists(self.collection_path): - print("[ERROR] Path: {} is not exist.".format(self.collection_path)) + logger.error("Path: {} is not exist.".format(self.collection_path)) return False - if os.path.isdir(self.collection_path) and self.collection_path.endswith("ascend_pt"): + if os.path.isdir(self.collection_path) and \ + (self.collection_path.endswith("ascend_pt") or self.collection_path.endswith("ascend_ms")): self.kernel_details_path = os.path.join(self.collection_path, "ASCEND_PROFILER_OUTPUT", "kernel_details.csv") if not os.path.exists(self.kernel_details_path): - print("[ERROR] kernel_details.csv is not exist in the Path: {}.".format( + logger.error("kernel_details.csv is not exist in the Path: {}.".format( os.path.join(self.collection_path, "ASCEND_PROFILER_OUTPUT"))) return False elif os.path.isfile(self.collection_path) and os.path.basename(self.collection_path) == "kernel_details.csv": self.kernel_details_path = self.collection_path else: - print("[ERROR] Please input ascend_pt or kernel_details.csv") + logger.error("Please input ascend_pt or kernel_details.csv") return False - print("[INFO] Start to analyse the target file: {}".format(self.kernel_details_path)) + logger.info("Start to analyse the target file: {}".format(self.kernel_details_path)) self.preparse() return True def has_callstack(self): - if self.call_stack is not None: - return self.call_stack profiler_info_json_path = "" for file in os.listdir(self.collection_path): if file.startswith("profiler_info"): profiler_info_json_path = os.path.join(self.collection_path, file) break if not profiler_info_json_path: - self.call_stack = False return self.call_stack self.trace_view_path = os.path.join(self.collection_path, self.ASCEND_PROFILER_OUTPUT, "trace_view.json") if not os.path.exists(profiler_info_json_path) or not os.path.exists(self.trace_view_path): - self.call_stack = False return self.call_stack info = FileManager.read_json_file(profiler_info_json_path) if not info.get("config") or not info.get("config").get("common_config") \ or not info.get("config").get("common_config").get("with_stack"): - self.call_stack = False return self.call_stack activities = info.get("config").get("common_config").get("activities") if not activities or "ProfilerActivity.CPU" not in activities: - self.call_stack = False return self.call_stack self.call_stack = info.get("config").get("common_config").get("with_stack") return self.call_stack diff --git a/profiler/advisor/advisor_backend/compute_advice/npu_fused/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/compute_advice/npu_fused/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/__init__.py diff --git a/profiler/advisor/advisor_backend/compute_advice/npu_fused/csv_analyzer.py b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/csv_analyzer.py similarity index 95% rename from profiler/advisor/advisor_backend/compute_advice/npu_fused/csv_analyzer.py rename to profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/csv_analyzer.py index 8ae109856a3b8719d8d7fef2ff15ca7a1270eccc..9a63f8e860f5d34e127586ceb006524063970b81 100644 --- a/profiler/advisor/advisor_backend/compute_advice/npu_fused/csv_analyzer.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/csv_analyzer.py @@ -13,37 +13,16 @@ # See the License for the specific language governing permissions and # limitations under the License. -import multiprocessing - import pandas as pd -import numpy as np -from common_func.path_manager import PathManager -from common_func_advisor.constant import Constant -from .op_perf import OpPerfFactory +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant class CSVAnalyzer: def __init__(self, path) -> None: self._path = path - def process(self): - PathManager.check_path_readable(self._path) - df = pd.read_csv(self._path, dtype={"Start Time(us)": str}) - # 分析是否存在可融合的算子 - op_type_list = df["Type"].tolist() - duration_list = df["Duration(us)"].tolist() - start_times = df["Start Time(us)"].tolist() - # 去除末尾的\t分隔符 - start_times = [start_time[:-1] for start_time in start_times] - result_list = [] - for pattern in Constant.PATTERN_DICT.keys(): - result_list.extend(self.find_all_sub_lists(op_type_list, duration_list, start_times, pattern)) - data_frame = pd.DataFrame(result_list) - data_frame.columns = ["pattern_name", "pattern", "len", "count", "duration sum(us)", "op durations(us)", - "index", "first_timestamp"] - return data_frame - @staticmethod def find_all_sub_lists(op_type_list, duration_list, start_times, expect_sub_list): # 创建一个空字典,用来存储子列表和它们的出现次数和起始位置 @@ -81,3 +60,20 @@ class CSVAnalyzer: repeated_sublists.append([pattern_name, expect_sub_list, 0, 0, 0, 0, 0, 0]) # 返回所有重复的子列表 return repeated_sublists + + def process(self): + PathManager.check_path_readable(self._path) + df = pd.read_csv(self._path, dtype={"Start Time(us)": str}) + # 分析是否存在可融合的算子 + op_type_list = df["Type"].tolist() + duration_list = df["Duration(us)"].tolist() + start_times = df["Start Time(us)"].tolist() + # 去除末尾的\t分隔符 + start_times = [start_time[:-1] for start_time in start_times] + result_list = [] + for pattern in Constant.PATTERN_DICT.keys(): + result_list.extend(self.find_all_sub_lists(op_type_list, duration_list, start_times, pattern)) + data_frame = pd.DataFrame(result_list) + data_frame.columns = ["pattern_name", "pattern", "len", "count", "duration sum(us)", "op durations(us)", + "index", "first_timestamp"] + return data_frame diff --git a/profiler/advisor/advisor_backend/compute_advice/npu_fused/json_analyzer.py b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/json_analyzer.py similarity index 83% rename from profiler/advisor/advisor_backend/compute_advice/npu_fused/json_analyzer.py rename to profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/json_analyzer.py index fd2a72ffa39bfde1b3e59450c6d76f51d98110d9..d47d6d8ad34352edab2a2539dcb5b9a79580a399 100644 --- a/profiler/advisor/advisor_backend/compute_advice/npu_fused/json_analyzer.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/json_analyzer.py @@ -13,9 +13,13 @@ # See the License for the specific language governing permissions and # limitations under the License. +import logging + import pandas as pd -from common_func_advisor.trace_view_json import TraceViewJson +from msprof_analyze.advisor.advisor_backend.common_func_advisor.trace_view_json import TraceViewJson + +logger = logging.getLogger() class JSONAnalyzer(object): @@ -28,18 +32,18 @@ class JSONAnalyzer(object): for i, row in data.iterrows(): if ts_col not in data.columns.tolist(): - print("[ERROR] No {} col found in data columns.".format(ts_col)) + logger.error("No {} col found in data columns.".format(ts_col)) return callstacks timestamp = row[ts_col] flow_event = trace_json.get_torch_2_npu_flow_event(timestamp) if not flow_event.valid(): - print("[ERROR] Get flow event failed for pattern {}.".format(row['pattern'])) + logger.error("Get flow event failed for pattern {}.".format(row['pattern'])) callstacks.loc[i] = "" continue flow_event_s_key = flow_event.s_point_ts python_dur_events = trace_json.get_python_dur_events_contain_ts(flow_event_s_key) if not python_dur_events: - print("[ERROR] No python dur event found for pattern {}.".format(row['pattern'])) + logger.error("No python dur event found for pattern {}.".format(row['pattern'])) callstacks.loc[i] = "" continue # 保持新老版本callstack兼容性 diff --git a/profiler/advisor/advisor_backend/compute_advice/npu_fused/op_perf.py b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/op_perf.py similarity index 89% rename from profiler/advisor/advisor_backend/compute_advice/npu_fused/op_perf.py rename to profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/op_perf.py index 7bcbed5a75807b57a55787c743cfaaff55a68589..b8f5ef42850d98fb5fc44acaf5289b0c61ab84d0 100644 --- a/profiler/advisor/advisor_backend/compute_advice/npu_fused/op_perf.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused/op_perf.py @@ -14,10 +14,13 @@ # limitations under the License. import functools from typing import Dict +import logging -from common_func_advisor.constant import Constant -from common_func_advisor.constant import CoreType -from common_func_advisor.constant import PerfColor +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import CoreType +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import PerfColor + +logger = logging.getLogger() class OpPerfFactory: @@ -129,7 +132,7 @@ class OpPerf: shapes = self.shape_to_tuple(shapes_str) dtypes = self.dtype_to_tuple(dtypes_str) if len(shapes) > len(dtypes): - print(f"[ERROR] The size of shape is greater than that of dtypes.") + logger.error("The size of shape is greater than that of dtypes.") return 0 if len(shapes) < len(dtypes): shapes = list(shapes) @@ -144,20 +147,22 @@ class OpPerf: def get_calc_size(self): # input and output bytes (MB) if not self.input_shapes or not self.output_shapes: - print("[ERROR] There is no tensor data, do not assess vector op performance.") + logger.error("There is no tensor data, do not assess vector op performance.") return 0 intput_size = self.get_size(self.input_shapes, self.input_data_types) output_size = self.get_size(self.output_shapes, self.output_data_types) return (intput_size + output_size) / (Constant.BYTE_UNIT_TRANS * Constant.BYTE_UNIT_TRANS) def get_throughput(self): - # throughput(GB/s) + # throughput bytes (GB/s) if not self.task_duration or abs(self.task_duration) < 1e-6: - print("[ERROR] There is no task_duration, do not assess vector op performance.") + logger.error("There is no task_duration, do not assess vector op performance.") return 0 - return self.row[Constant.TITLE.SIZE] / Constant.BYTE_UNIT_TRANS / self.task_duration * Constant.UNIT_TRANS * Constant.UNIT_TRANS + return (self.row[Constant.TITLE.SIZE] / + Constant.BYTE_UNIT_TRANS / self.task_duration * Constant.UNIT_TRANS * Constant.UNIT_TRANS) def get_perf_color(self): + row = self.row return PerfColor.WHITE def update(self): @@ -186,7 +191,7 @@ class CubeOpPerf(OpPerf): def get_perf_color(self) -> PerfColor: aic_mac_ratio = self.get_mac_ratio() if not aic_mac_ratio: - print("[WARNING] There is no aic_mac_ratio, do not assess cube op performance.") + logger.warning("There is no aic_mac_ratio, do not assess cube op performance.") return PerfColor.WHITE elif aic_mac_ratio < 0.6: return PerfColor.RED diff --git a/profiler/advisor/advisor_backend/compute_advice/npu_fused_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused_advice.py similarity index 85% rename from profiler/advisor/advisor_backend/compute_advice/npu_fused_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused_advice.py index fd5610bbbbb98d15fbab22bb646b2dd7de36ac3d..349ede99059ffa98e0fb52888679e01cb15d2917 100644 --- a/profiler/advisor/advisor_backend/compute_advice/npu_fused_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_fused_advice.py @@ -13,14 +13,16 @@ # See the License for the specific language governing permissions and # limitations under the License. -import os +import logging from abc import ABC import pandas as pd -from compute_advice.compute_advice_base import ComputeAdviceBase -from compute_advice.npu_fused.csv_analyzer import CSVAnalyzer -from compute_advice.npu_fused.json_analyzer import JSONAnalyzer +from msprof_analyze.advisor.advisor_backend.compute_advice.compute_advice_base import ComputeAdviceBase +from msprof_analyze.advisor.advisor_backend.compute_advice.npu_fused.csv_analyzer import CSVAnalyzer +from msprof_analyze.advisor.advisor_backend.compute_advice.npu_fused.json_analyzer import JSONAnalyzer + +logger = logging.getLogger() class NpuFusedAdvice(ComputeAdviceBase, ABC): @@ -46,7 +48,7 @@ class NpuFusedAdvice(ComputeAdviceBase, ABC): all_pattern_data = all_pattern_data.sort_values(by='duration sum(us)', ascending=False) filter_data = all_pattern_data.get(all_pattern_data.get("duration sum(us)", 0) > 0) if not self.has_callstack(): - print("[Warning] No call stack info found, advice will be incomplete") + logger.warning("No call stack info found, advice will be incomplete") self.cur_data = filter_data else: json_analyzer = JSONAnalyzer(self.trace_view_path) diff --git a/profiler/advisor/advisor_backend/compute_advice/npu_slow_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_slow_advice.py similarity index 84% rename from profiler/advisor/advisor_backend/compute_advice/npu_slow_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_slow_advice.py index 48522cf55a4cfb3f89083c3ac69ec7b22b295195..5f2f123fb6867b6b8fc48e8049f6687c5c9369d9 100644 --- a/profiler/advisor/advisor_backend/compute_advice/npu_slow_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/compute_advice/npu_slow_advice.py @@ -15,15 +15,15 @@ from abc import ABC import os import multiprocessing +import logging import pandas as pd -from common_func.path_manager import PathManager -from compute_advice.compute_advice_base import ComputeAdviceBase -from compute_advice.npu_fused.op_perf import OpPerfFactory -from common_func_advisor.constant import Constant -from common_func_advisor.constant import PerfColor -from advisor_backend.common_func_advisor.trace_view_json import TraceViewJson +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.advisor.advisor_backend.compute_advice.compute_advice_base import ComputeAdviceBase +from msprof_analyze.advisor.advisor_backend.compute_advice.npu_fused.op_perf import OpPerfFactory +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant, PerfColor +from msprof_analyze.advisor.advisor_backend.common_func_advisor.trace_view_json import TraceViewJson class NpuSlowAdvice(ComputeAdviceBase, ABC): @@ -64,7 +64,7 @@ class NpuSlowAdvice(ComputeAdviceBase, ABC): def get_call_stack(self, data: pd.DataFrame, index_id: int, ts_col: str) -> str: if not self.has_callstack(): - print("There is no call stack info, please set 'with_stack=True'") + logging.warning("There is no call stack info, please set 'with_stack=True'") return "" trace_json = TraceViewJson(self.trace_view_path) return trace_json.get_call_stack(data, index_id, ts_col) diff --git a/profiler/advisor/advisor_backend/interface.py b/profiler/msprof_analyze/advisor/advisor_backend/interface.py similarity index 70% rename from profiler/advisor/advisor_backend/interface.py rename to profiler/msprof_analyze/advisor/advisor_backend/interface.py index deb68822ec4ac025e6cf647cd031415618cc415e..f4e84f64e5afb2f94225eccd1f9e7a0dd0091589 100644 --- a/profiler/advisor/advisor_backend/interface.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/interface.py @@ -13,19 +13,12 @@ # See the License for the specific language governing permissions and # limitations under the License. import os -import sys - -sys.path.append( - os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "advisor_backend")) -sys.path.append( - os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))), "compare_tools")) -sys.path.append( - os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))), "cluster_analyse")) -from common_func_advisor.constant import Constant -from advisor_backend.advice_factory.cluster_advice_factory import ClusterAdviceFactory -from advisor_backend.advice_factory.compute_advice_factory import ComputeAdviceFactory -from advisor_backend.advice_factory.timeline_advice_factory import TimelineAdviceFactory -from advisor_backend.advice_factory.overall_advice_factory import OverallAdviceFactory + +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.advice_factory.cluster_advice_factory import ClusterAdviceFactory +from msprof_analyze.advisor.advisor_backend.advice_factory.compute_advice_factory import ComputeAdviceFactory +from msprof_analyze.advisor.advisor_backend.advice_factory.timeline_advice_factory import TimelineAdviceFactory +from msprof_analyze.advisor.advisor_backend.advice_factory.overall_advice_factory import OverallAdviceFactory class Interface: diff --git a/profiler/msprof_analyze/advisor/advisor_backend/logger.py b/profiler/msprof_analyze/advisor/advisor_backend/logger.py new file mode 100644 index 0000000000000000000000000000000000000000..5fb8b5d407d1cf139be636eea6925d4d064d44f2 --- /dev/null +++ b/profiler/msprof_analyze/advisor/advisor_backend/logger.py @@ -0,0 +1,38 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging + + +class Logger: + def __init__(self): + if not hasattr(self, 'logger'): + self.logger = logging.getLogger('singleton_logger') + self.logger.setLevel(logging.INFO) + + def info(self, message): + self.logger.info(message) + + def debug(self, message): + self.logger.debug(message) + + def warning(self, message): + self.logger.warning(message) + + def error(self, message): + self.logger.error(message) + + def critical(self, message): + self.logger.critical(message) diff --git a/profiler/advisor/analyzer/communication/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/overall_advice/__init__.py similarity index 100% rename from profiler/advisor/analyzer/communication/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/overall_advice/__init__.py diff --git a/profiler/advisor/advisor_backend/overall_advice/overall_summary_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/overall_advice/overall_summary_advice.py similarity index 81% rename from profiler/advisor/advisor_backend/overall_advice/overall_summary_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/overall_advice/overall_summary_advice.py index f5bfc351f2820ac8d797798fd959577da8062ea4..979fcf7246eccafda9fbb1d9e3565ac89b9b7420 100644 --- a/profiler/advisor/advisor_backend/overall_advice/overall_summary_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/overall_advice/overall_summary_advice.py @@ -13,35 +13,17 @@ # See the License for the specific language governing permissions and # limitations under the License. import os +import logging -from advisor_backend.advice_base import AdviceBase -from compare_backend.utils.constant import Constant -from compare_interface.comparison_interface import ComparisonInterface +from msprof_analyze.compare_tools.compare_interface.comparison_interface import ComparisonInterface +from msprof_analyze.advisor.advisor_backend.advice_base import AdviceBase +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.prof_common.constant import Constant + +logger = logging.getLogger() class OverallSummaryAdvice(AdviceBase): - advice_map = { - "Computing Time": "if you want more detailed advice please use msprof-analyze advisor computation.", - "Uncovered Communication Time": "if you want more detailed advice, please use msprof-analyze advisor schedule.", - "Free Time": "if you want more detailed advice please use msprof-analyze advisor schedule." - } - time_name_map = { - "Computing Time": "computing", - "Uncovered Communication Time": "communication", - "Free Time": "free", - 'Cube Time(Num)': 'Cube Time', - 'Vector Time(Num)': 'Vector Time', - 'Flash Attention Time(Forward)(Num)': 'Flash Attention Time(Forward)', - 'Flash Attention Time(Backward)(Num)': 'Flash Attention Time(Backward)', - 'Other Time': "Other Computing Time", - 'SDMA Time(Num)': 'SDMA Time' - } - performance_time_dict = { - "Computing Time": ['Cube Time(Num)', 'Vector Time(Num)', 'Flash Attention Time(Forward)(Num)', - 'Flash Attention Time(Backward)(Num)', 'Other Time'], - "Uncovered Communication Time(Wait Time)": [], - "Free Time": ['SDMA Time(Num)'] - } def __init__(self, collection_path: str, kwargs: dict): super().__init__(collection_path) @@ -55,6 +37,11 @@ class OverallSummaryAdvice(AdviceBase): self._base_data = [] self._comparison_data = [] + self.prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) + self.advice_map = self.prompt_class.PERFORMANCE_TIME_DICT + self.time_name_map = self.prompt_class.TIME_NAME_MAP + self.performance_time_dict = self.prompt_class.PERFORMANCE_TIME_DICT + @staticmethod def split_duration_and_num(time_value: str) -> tuple: split_data = time_value.split("s") # time value example: 0.229s(1756) @@ -68,7 +55,7 @@ class OverallSummaryAdvice(AdviceBase): try: duration = float(split_data[0]) except ValueError: - print(f"[WARNING] Invalid time value: {time_value}.") + logger.warning(f"Invalid time value: {time_value}.") return duration, num @staticmethod @@ -89,7 +76,7 @@ class OverallSummaryAdvice(AdviceBase): if os.path.exists(self.base_collection_path): self._has_base_collection = True else: - print(f"[WARNING] Invalid path which not exists: {self.base_collection_path}.") + logger.warning(f"Invalid path which not exists: {self.base_collection_path}.") return os.path.exists(self.collection_path) def process(self): diff --git a/profiler/advisor/advisor_backend/prof_bean_advisor/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/prof_bean_advisor/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/prof_bean_advisor/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/prof_bean_advisor/__init__.py diff --git a/profiler/advisor/advisor_backend/prof_bean_advisor/cluster_step_trace_time_bean.py b/profiler/msprof_analyze/advisor/advisor_backend/prof_bean_advisor/cluster_step_trace_time_bean.py similarity index 100% rename from profiler/advisor/advisor_backend/prof_bean_advisor/cluster_step_trace_time_bean.py rename to profiler/msprof_analyze/advisor/advisor_backend/prof_bean_advisor/cluster_step_trace_time_bean.py diff --git a/profiler/advisor/advisor_backend/timeline_advice/__init__.py b/profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/__init__.py similarity index 100% rename from profiler/advisor/advisor_backend/timeline_advice/__init__.py rename to profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/__init__.py diff --git a/profiler/advisor/advisor_backend/timeline_advice/op_schedule_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/op_schedule_advice.py similarity index 75% rename from profiler/advisor/advisor_backend/timeline_advice/op_schedule_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/op_schedule_advice.py index 9e492b2156c6faee6c023206f3cfc4f852eeb547..ac163a794863b3837deb006cf26dbc27579e3b74 100644 --- a/profiler/advisor/advisor_backend/timeline_advice/op_schedule_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/op_schedule_advice.py @@ -12,9 +12,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import logging from decimal import Decimal -from common_func_advisor.constant import Constant -from timeline_advice.timeline_advice_base import TimelineAdviceBase + +from msprof_analyze.advisor.advisor_backend.common_func_advisor.constant import Constant +from msprof_analyze.advisor.advisor_backend.timeline_advice.timeline_advice_base import TimelineAdviceBase + +logger = logging.getLogger() class OpScheduleAdvice(TimelineAdviceBase): @@ -33,10 +37,10 @@ class OpScheduleAdvice(TimelineAdviceBase): return self.output_format_data def process(self): - cpt_data = self.preparse_data[self.PREPARSE_TYPE.OVERLAP_CPT] - free_data = self.preparse_data[self.PREPARSE_TYPE.OVERLAP_FREE] + cpt_data = self.preparse_data[self.PreParseType.OVERLAP_CPT] + free_data = self.preparse_data[self.PreParseType.OVERLAP_FREE] if not cpt_data or not free_data: - print("[ERROR] Fail to find Overlap data.") + logger.error("Fail to find Overlap data.") return op_dur = [entry.get("dur", 0) for entry in cpt_data] @@ -44,7 +48,7 @@ class OpScheduleAdvice(TimelineAdviceBase): merge_data = list() merge_data.extend(cpt_data) merge_data.extend(free_data) - merge_data.sort(key=lambda x : Decimal(x.get("ts"))) + merge_data.sort(key=lambda x: Decimal(x.get("ts"))) idx = free_idx = 0 while idx < len(merge_data) and free_idx < len(op_free): entry = merge_data[idx] @@ -60,9 +64,10 @@ class OpScheduleAdvice(TimelineAdviceBase): if free_ratio < 0.2: return self.cur_bottleneck = f"NPU Utilication: {round(free_ratio * 100, 2)}%, " \ - f"NPU Free Utilization: {round(cpt_ratio * 100, 2)}%." - if len(self.preparse_data[self.PREPARSE_TYPE.SYNCHRONIZE]) > 1: - self.cur_advice = f"Device synchronize {len(self.preparse_data[self.PREPARSE_TYPE.SYNCHRONIZE])} times, " \ + f"NPU Free Utilization: {round(cpt_ratio * 100, 2)}%." + if len(self.preparse_data[self.PreParseType.SYNCHRONIZE]) > 1: + self.cur_advice = \ + f"Device synchronize {len(self.preparse_data[self.PreParseType.SYNCHRONIZE])} times, " \ "try to reduce synchronization statements to alleviate the bottleneck of operator delivery.\n" small_op_num = self.small_op_block(op_free, op_dur) small_op_ratio = small_op_num / len(op_dur) if op_dur else 0.0 @@ -77,9 +82,9 @@ class OpScheduleAdvice(TimelineAdviceBase): return small_op_num def get_ratio(self): - cpt_data = self.preparse_data[self.PREPARSE_TYPE.OVERLAP_CPT] - free_data = self.preparse_data[self.PREPARSE_TYPE.OVERLAP_FREE] - cmu_data = self.preparse_data[self.PREPARSE_TYPE.OVERLAP_CMU] + cpt_data = self.preparse_data[self.PreParseType.OVERLAP_CPT] + free_data = self.preparse_data[self.PreParseType.OVERLAP_FREE] + cmu_data = self.preparse_data[self.PreParseType.OVERLAP_CMU] cpt_time = sum([x.get("dur", 0) for x in cpt_data]) free_time = sum([x.get("dur", 0) for x in free_data]) cmu_time = sum([x.get("dur", 0) for x in cmu_data]) diff --git a/profiler/advisor/advisor_backend/timeline_advice/optimizer_advice.py b/profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/optimizer_advice.py similarity index 80% rename from profiler/advisor/advisor_backend/timeline_advice/optimizer_advice.py rename to profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/optimizer_advice.py index dee2e7ba563d0d00b4459333dffb4099dee9240a..6a01b1694eab10206eed3cd0e0e36f1dba8a61aa 100644 --- a/profiler/advisor/advisor_backend/timeline_advice/optimizer_advice.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/optimizer_advice.py @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -from timeline_advice.timeline_advice_base import TimelineAdviceBase +from msprof_analyze.advisor.advisor_backend.timeline_advice.timeline_advice_base import TimelineAdviceBase class OptimizerAdvice(TimelineAdviceBase): @@ -44,12 +44,14 @@ class OptimizerAdvice(TimelineAdviceBase): return self.output_format_data def process(self): - if not self.preparse_data[self.PREPARSE_TYPE.OPTIMIZER]: + if not self.preparse_data[self.PreParseType.OPTIMIZER]: return - self.cur_data = list(set([entry.get("name", None) for entry in self.preparse_data[self.PREPARSE_TYPE.OPTIMIZER]])) + self.cur_data = list(set([entry.get("name", None) \ + for entry in self.preparse_data[self.PreParseType.OPTIMIZER]])) for index, opt_name in enumerate(self.cur_data): - self.cur_advice += f"You can choose {self.OPTIMIZER_MAP.get(opt_name)} to replace the current Optimizer: {opt_name}." + self.cur_advice += \ + f"You can choose {self.OPTIMIZER_MAP.get(opt_name)} to replace the current Optimizer: {opt_name}." if index != len(self.cur_data) - 1: self.cur_advice += "\n" self.cur_bottleneck = self.cur_advice diff --git a/profiler/advisor/advisor_backend/timeline_advice/timeline_advice_base.py b/profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/timeline_advice_base.py similarity index 70% rename from profiler/advisor/advisor_backend/timeline_advice/timeline_advice_base.py rename to profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/timeline_advice_base.py index 4c7ac96cd22673741accd6bb2abb463566a2e652..36c5e1e2dea6dcd41f3199e1c542ae4eebde1663 100644 --- a/profiler/advisor/advisor_backend/timeline_advice/timeline_advice_base.py +++ b/profiler/msprof_analyze/advisor/advisor_backend/timeline_advice/timeline_advice_base.py @@ -13,17 +13,20 @@ # See the License for the specific language governing permissions and # limitations under the License. +import logging +import os from abc import abstractmethod from collections import defaultdict -import json -import os -from advice_base import AdviceBase -from common_func.file_manager import FileManager +from msprof_analyze.advisor.advisor_backend.advice_base import AdviceBase +from msprof_analyze.prof_common.file_manager import FileManager + +logger = logging.getLogger() +logger.setLevel(logging.INFO) class TimelineAdviceBase(AdviceBase): - class PREPARSE_TYPE: + class PreParseType: OPTIMIZER = 0 STEP = 1 OVERLAP_CPT = 2 @@ -40,9 +43,9 @@ class TimelineAdviceBase(AdviceBase): self.has_preparse = False self.preparse_data = defaultdict(list) self.entry_map = { - 'Computing': self.PREPARSE_TYPE.OVERLAP_CPT, - 'Free': self.PREPARSE_TYPE.OVERLAP_FREE, - 'AscendCL@aclrtSynchronizeDevice': self.PREPARSE_TYPE.SYNCHRONIZE + 'Computing': self.PreParseType.OVERLAP_CPT, + 'Free': self.PreParseType.OVERLAP_FREE, + 'AscendCL@aclrtSynchronizeDevice': self.PreParseType.SYNCHRONIZE } def path_check(self): @@ -50,19 +53,21 @@ class TimelineAdviceBase(AdviceBase): check whether input path is valid """ if not os.path.exists(self.collection_path): - print("[ERROR] Path: {} is not exist.".format(self.collection_path)) + logger.error("Path: %s is not exist.", str(self.collection_path)) return False - if os.path.isdir(self.collection_path) and self.collection_path.endswith("ascend_pt"): + if os.path.isdir(self.collection_path) and \ + (self.collection_path.endswith("ascend_pt") or self.collection_path.endswith("ascend_ms")): self.trace_view_path = os.path.join(self.collection_path, "ASCEND_PROFILER_OUTPUT", "trace_view.json") if not os.path.exists(self.trace_view_path): - print("[ERROR] trace_view.json is not exist in the Path: {}.".format(os.path.join(self.collection_path, "ASCEND_PROFILER_OUTPUT"))) + logger.error("trace_view.json is not exist in the Path: %s.", + str(os.path.join(self.collection_path, "ASCEND_PROFILER_OUTPUT"))) return False elif os.path.isfile(self.collection_path) and os.path.basename(self.collection_path) == "trace_view.json": self.trace_view_path = self.collection_path else: - print("[ERROR] Please input ascend_pt or trace_view.json.") + logger.error("Please input ascend_pt or trace_view.json.") return False - print("[INFO] Start to analyse the target file: {}".format(self.trace_view_path)) + logger.info("Start to analyse the target file: %s", str(self.trace_view_path)) return True @abstractmethod @@ -91,9 +96,9 @@ class TimelineAdviceBase(AdviceBase): if not name: continue if name.startswith("Optimizer.step#") and name.endswith(".step"): - self.preparse_data[self.PREPARSE_TYPE.OPTIMIZER].append(entry) + self.preparse_data[self.PreParseType.OPTIMIZER].append(entry) elif name.startswith("ProfilerStep#"): - self.preparse_data[self.PREPARSE_TYPE.STEP].append(entry) + self.preparse_data[self.PreParseType.STEP].append(entry) elif name in self.entry_map: self.preparse_data[self.entry_map[name]].append(entry) self.has_preparse = True diff --git a/profiler/advisor/analyzer/communication/bandwidth/__init__.py b/profiler/msprof_analyze/advisor/analyzer/__init__.py similarity index 100% rename from profiler/advisor/analyzer/communication/bandwidth/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/__init__.py diff --git a/profiler/advisor/analyzer/analyzer_controller.py b/profiler/msprof_analyze/advisor/analyzer/analyzer_controller.py similarity index 90% rename from profiler/advisor/analyzer/analyzer_controller.py rename to profiler/msprof_analyze/advisor/analyzer/analyzer_controller.py index 711035adfdf46b858fa7164d4244dc20e8e49eb7..bde9e5cd3454a85853a6fbcfbd0ade060ebc229b 100644 --- a/profiler/advisor/analyzer/analyzer_controller.py +++ b/profiler/msprof_analyze/advisor/analyzer/analyzer_controller.py @@ -24,23 +24,21 @@ from pathlib import Path import psutil -sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), "compare_tools")) -sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), "cluster_analyse")) - -from profiler.advisor.analyzer.cluster.slow_rank_analyzer import SlowRankAnalyzer -from profiler.advisor.analyzer.cluster.slow_link_analyzer import SlowLinkAnalyzer -from profiler.advisor.analyzer.computation.pp_stage_computation_analyzer import PPStageComputationAnalyzer -from profiler.advisor.analyzer.overall.overall_summary_analyzer import OverallSummaryAnalyzer -from profiler.advisor.config.config import Config -from profiler.advisor.common import constant as const -from profiler.advisor.common.analyzer_scopes import SupportedScopes -from profiler.advisor.common.async_analysis_status import AsyncAnalysisStatus -from profiler.advisor.common.enum_params_parser import EnumParamsParser -from profiler.advisor.utils.utils import Timer, safe_index_value, safe_division, safe_index, convert_to_int -from profiler.advisor.interface.interface import Interface -from profiler.cluster_analyse.cluster_data_preprocess.pytorch_data_preprocessor import PytorchDataPreprocessor -from profiler.prof_common.path_manager import PathManager -from profiler.compare_tools.compare_backend.utils.constant import Constant as CompareConstant +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.advisor.analyzer.cluster.slow_rank_analyzer import SlowRankAnalyzer +from msprof_analyze.advisor.analyzer.cluster.slow_link_analyzer import SlowLinkAnalyzer +from msprof_analyze.advisor.analyzer.computation.pp_stage_computation_analyzer import PPStageComputationAnalyzer +from msprof_analyze.advisor.analyzer.overall.overall_summary_analyzer import OverallSummaryAnalyzer +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes +from msprof_analyze.advisor.common.async_analysis_status import AsyncAnalysisStatus +from msprof_analyze.advisor.common.enum_params_parser import EnumParamsParser +from msprof_analyze.advisor.utils.utils import Timer, safe_index_value, safe_division, safe_index, convert_to_int +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.cluster_analyse.cluster_data_preprocess.pytorch_data_preprocessor import PytorchDataPreprocessor +from msprof_analyze.cluster_analyse.cluster_data_preprocess.mindspore_data_preprocessor import MindsporeDataPreprocessor +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.prof_common.constant import Constant # 以spawn模式启动多进程,避免fork主进程资源。如果主进程逻辑较为复杂,fork可能会导致异常。 mp.set_start_method("spawn", force=True) @@ -144,7 +142,7 @@ class AsyncParams: class AnalyzerController: CLUSTER_RANK_THRESHOLD = 2 - SDMA_SUPPORT_SCOPES = [SupportedScopes.BANDWIDTH_CONTENTION_DETECTION] + SDMA_SUPPORT_SCOPES = [SupportedScopes.BANDWIDTH_CONTENTION_DETECTION, SupportedScopes.BYTE_ALIGNMENT_DETECTION] RDMA_SUPPORT_SCOPES = [SupportedScopes.PACKET] COMMUNICATION_MAPPING = { SlowLinkAnalyzer.SDMA: SDMA_SUPPORT_SCOPES, @@ -154,6 +152,7 @@ class AnalyzerController: def __init__(self): self.dimensions = Interface.all_dimension self.kwargs = {} + self.args_manager = None self.slow_rank_analyzer = None self.slow_link_analyzer = None self.cluster_local_data_map = {} @@ -184,28 +183,6 @@ class AnalyzerController: return True - @staticmethod - def _whether_include_mindspore_prof(profiling_path): - # 暂不支持Mindspore数据,支持后可删除该限制 - ASCEND_MS = "ascend_ms" - - has_ascend_ms_dirs = False - for root, dirs, _ in os.walk(profiling_path): - if root.endswith(ASCEND_MS): - has_ascend_ms_dirs = True - break - for dir_name in dirs: - if dir_name.endswith(ASCEND_MS): - has_ascend_ms_dirs = True - break - if has_ascend_ms_dirs: - break - - if has_ascend_ms_dirs: - logger.error("Advisor does not support data from MindSpore now, existing dirs end with 'ascend_ms'") - return True - - return False @staticmethod def _get_step_rank_for_cluster_statistic_diff(target_cluster_statistic_data, benchmark_cluster_statistic_data, @@ -278,6 +255,8 @@ class AnalyzerController: def do_analysis(self, dimensions, **kwargs): pid = os.getpid() resp = {"id": pid} + self.args_manager = AdditionalArgsManager() + self.args_manager.init(kwargs) output_path = kwargs.get("output_path") AnalyzerController._set_analysis_process_priority(pid) @@ -291,7 +270,7 @@ class AnalyzerController: PathManager.check_input_directory_path(output_path) if os.path.exists(output_path): - PathManager.check_path_owner_consistent(output_path) + PathManager.check_path_owner_consistent([output_path]) else: PathManager.make_dir_safety(output_path) @@ -308,14 +287,15 @@ class AnalyzerController: def async_do_analysis(self, dimensions, **kwargs): """ Deploy a online service to start async analysis job, wrap this api by flask or tornado and so on, then could query the analysis status by restful api. - You can view file 'profiler/advisor/config/enum_parameters.yaml' to obtain detailed information for - all the args listed below. + You can view file 'profiler/msprof_analyze/advisor/config/enum_parameters.yaml' to obtain detailed + information for all the args listed below. Args: dimensions: analysis dimension, normally set as Interface.all_dimension, support specific dimension analysis such as ['computation'] or ['computation', 'schedule'] cann_version: cann version of your runtime, inpact on the analysis of affinity api and AICPU operators - torch_version: torch version of your runtime, inpact on the analysis of affinity api + profiling_type: profiling type of your runtime + profiling_version: profiling version of your runtime, inpact on the analysis of affinity api analysis_dimensions: can overwite dimensions. advisor_analyze_processes: number of processes to use while the training params pipeline parallel(pp) >1, can reduce the time of analysis. @@ -330,7 +310,9 @@ class AnalyzerController: >>> analyzer_controller = AnalyzerController() >>> analysis_kwargs = dict(advisor_analyze_processes=2, disable_profiling_comparison=True) >>> - >>> async_analysis_process = analyzer_controller.async_do_analysis(Interface.all_dimension, **analysis_kwargs) + >>> async_analysis_process = analyzer_controller.async_do_analysis( + >>> Interface.all_dimension, **analysis_kwargs) + >>> >>> >>> # query the job status every second >>> while True: @@ -393,9 +375,9 @@ class AnalyzerController: # kernel/api 比对 compare_profiling_list = [ dict(profiling_path=profiling_path, benchmark_profiling_path=benchmark_profiling_path, - compare_mode=CompareConstant.KERNEL_COMPARE), + compare_mode=Constant.KERNEL_COMPARE), dict(profiling_path=profiling_path, benchmark_profiling_path=benchmark_profiling_path, - compare_mode=CompareConstant.API_COMPARE) + compare_mode=Constant.API_COMPARE) ] job_list += self._profiling_comparison(compare_profiling_list) @@ -423,8 +405,8 @@ class AnalyzerController: return job_list def overall(self, profiling_path): - from profiler.advisor.analyzer.overall.environment_variable_analyzer import EnvironmentVariabelAnalyzer - env_analyzer = EnvironmentVariabelAnalyzer(profiling_path) + from msprof_analyze.advisor.analyzer.overall.environment_variable_analyzer import EnvironmentVariableAnalyzer + env_analyzer = EnvironmentVariableAnalyzer(profiling_path) env_analyzer.optimize() if self._is_cluster: @@ -541,7 +523,7 @@ class AnalyzerController: benchmark_profiling_path=self._get_profiling_path_by_rank(profiling_path, fast_rank_id), step=slow_step, benchmark_step=fast_step, rank=slow_rank_id, benchmark_rank=fast_rank_id, - compare_mode=CompareConstant.API_COMPARE, + compare_mode=Constant.API_COMPARE, step_duration=self.slow_rank_analyzer.get_step_duration(slow_rank_id, slow_step)) job_list += self.schedule_analysis(**kwargs) @@ -632,8 +614,10 @@ class AnalyzerController: result_list = [] profiling_path = PathManager.get_realpath(self.kwargs.get("profiling_path")) benchmark_profiling_path = self.kwargs.get("benchmark_profiling_path") + PathManager.check_path_owner_consistent([profiling_path]) if benchmark_profiling_path: benchmark_profiling_path = PathManager.get_realpath(benchmark_profiling_path) + PathManager.check_path_owner_consistent([benchmark_profiling_path]) if not self._check_profiling_path_valid(profiling_path): error_msg = f"Got invalid argument '-d/--profiling_path' {profiling_path}, skip analysis" @@ -643,14 +627,6 @@ class AnalyzerController: logger.error(error_msg) return - # 暂不支持Mindspore数据,支持后可删除该限制 - if self._whether_include_mindspore_prof(profiling_path): - error_msg = f"Got *_ascend_ms dirs from {profiling_path}, skip analysis" - self._update_analysis_process_resp(pid, async_resp, error_msg=error_msg, - status_code=AsyncAnalysisStatus.FAILED_STATUS_CODE, - status=AsyncAnalysisStatus.FAILED) - logger.error(error_msg) - return if benchmark_profiling_path and not self._check_profiling_path_valid(benchmark_profiling_path): error_msg = (f"Got invalid argument '-bp/--benchmark_profiling_path' {benchmark_profiling_path}, " @@ -678,8 +654,8 @@ class AnalyzerController: if not self._is_cluster: job_list = self.single_rank_analysis(profiling_path, benchmark_profiling_path) else: - self.slow_rank_analyzer = SlowRankAnalyzer(profiling_path) - self.slow_link_analyzer = SlowLinkAnalyzer(profiling_path) + self.slow_rank_analyzer = SlowRankAnalyzer(profiling_path, output_path=self.kwargs.get("output_path")) + self.slow_link_analyzer = SlowLinkAnalyzer(profiling_path, output_path=self.kwargs.get("output_path")) job_list = self.do_cluster_analysis(profiling_path, benchmark_profiling_path) for i, (dimension, scope, interface, kwargs) in enumerate(job_list[::-1]): @@ -732,7 +708,7 @@ class AnalyzerController: def _profiling_comparison(self, compare_profiling_list): job_list = [] - disable_profiling_comparison = os.getenv(const.DISABLE_PROFILING_COMPARISON) + disable_profiling_comparison = os.getenv(Constant.DISABLE_PROFILING_COMPARISON) if disable_profiling_comparison is not None and disable_profiling_comparison.lower() == "true": logger.info( "Skip profiling comparison due to longer processing time due to env 'DISABLE_PROFILING_COMPARISON'") @@ -783,7 +759,7 @@ class AnalyzerController: if isinstance(target_cluster_analyzer, SlowRankAnalyzer): comparison_dims = [SlowRankAnalyzer.COMPUTE, SlowRankAnalyzer.FREE] - comparison_modes = [CompareConstant.KERNEL_COMPARE, CompareConstant.API_COMPARE] + comparison_modes = [Constant.KERNEL_COMPARE, Constant.API_COMPARE] elif isinstance(target_cluster_analyzer, SlowLinkAnalyzer): comparison_dims = [SlowLinkAnalyzer.SDMA_BANDWIDTH, SlowLinkAnalyzer.RDMA_BANDWIDTH] comparison_modes = [None, None] @@ -837,7 +813,16 @@ class AnalyzerController: return False path_list = [os.path.join(profiling_path, dir_name) for dir_name in os.listdir(profiling_path)] ascend_pt_dirs = [path for path in path_list if os.path.isdir(path) and path.endswith("ascend_pt")] - data_processor = PytorchDataPreprocessor(ascend_pt_dirs) + ascend_ms_dirs = [path for path in path_list if os.path.isdir(path) and path.endswith("ascend_ms")] + if ascend_ms_dirs and ascend_pt_dirs: + logger.error("Cannot analyze pytorch and mindspore meantime.") + return False + if not ascend_pt_dirs and not ascend_ms_dirs: + return False + if ascend_ms_dirs and not ascend_pt_dirs: + data_processor = MindsporeDataPreprocessor(ascend_ms_dirs) + elif ascend_pt_dirs and not ascend_ms_dirs: + data_processor = PytorchDataPreprocessor(ascend_pt_dirs) self.cluster_local_data_map[profiling_path] = data_processor.get_data_map() @@ -909,7 +894,7 @@ class AnalyzerController: benchmark_step=benchmark_step, profiling_path=self._get_profiling_path_by_rank(profiling_path, rank_id), benchmark_profiling_path=self._get_profiling_path_by_rank(profiling_path, benchmark_rank_id), - compare_mode=CompareConstant.KERNEL_COMPARE, + compare_mode=Constant.KERNEL_COMPARE, step_duration=self.slow_rank_analyzer.get_step_duration(rank_id, step) ) ) @@ -949,7 +934,7 @@ class AnalyzerController: kwargs = dict(profiling_path=self._get_profiling_path_by_rank(profiling_path, slow_rank_id), benchmark_profiling_path=self._get_profiling_path_by_rank(profiling_path, fast_rank_id), step=slow_step, benchmark_step=fast_step, rank=slow_rank_id, benchmark_rank=fast_rank_id, - compare_mode=CompareConstant.KERNEL_COMPARE, + compare_mode=Constant.KERNEL_COMPARE, step_duration=self.slow_rank_analyzer.get_step_duration(slow_rank_id, slow_step)) job_list += self.computation_analysis(**kwargs) diff --git a/profiler/advisor/analyzer/base_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/base_analyzer.py similarity index 42% rename from profiler/advisor/analyzer/base_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/base_analyzer.py index 38b7ea0be68683c0a24f2faab59d2b311917f5d5..ee7835cf539602a7226104f34413f922935906d9 100644 --- a/profiler/advisor/analyzer/base_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/base_analyzer.py @@ -13,24 +13,32 @@ # See the License for the specific language governing permissions and # limitations under the License. import logging +import os from functools import wraps from typing import Dict, List, Union from abc import abstractmethod, ABCMeta -from profiler.advisor.common import constant -from profiler.advisor.common.enum_params_parser import EnumParamsParser -from profiler.advisor.common.version_control import VersionControl -from profiler.advisor.dataset.dataset import Dataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.utils.utils import safe_division +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.enum_params_parser import EnumParamsParser +from msprof_analyze.advisor.common.version_control import VersionControl +from msprof_analyze.advisor.dataset.dataset import Dataset +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.utils.utils import safe_division +from msprof_analyze.prof_common.file_manager import FileManager logger = logging.getLogger() +ASCEND_PT = "ascend_pt" +ASCEND_MS = "ascend_ms" +PROFILER_INFO_HEAD = "profiler_info_" +PROFILER_INFO_EXTENSION = ".json" +MS_VERSION = "ms_version" + class BaseAnalyzer(VersionControl, metaclass=ABCMeta): - _SUPPORT_VERSIONS = EnumParamsParser().get_options(constant.CANN_VERSION) + _SUPPORT_VERSIONS = EnumParamsParser().get_options(Constant.CANN_VERSION) ANALYZER_HIGH_PRIORITY_TIME_RATIO = 0.05 ANALYZER_MEDIUM_PRIORITY_TIME_RATIO = 0.03 @@ -38,11 +46,14 @@ class BaseAnalyzer(VersionControl, metaclass=ABCMeta): def __init__(self, collection_path, n_processes: int = 1, **kwargs): self.n_processes = n_processes - self.cann_version = kwargs.get(constant.CANN_VERSION, EnumParamsParser().get_default(constant.CANN_VERSION)) - self.torch_version = kwargs.get(constant.TORCH_VERSION, EnumParamsParser().get_default(constant.TORCH_VERSION)) - self.html_render = HTMLRender() - self.collection_path = collection_path self.kwargs = kwargs + self.collection_path = collection_path + self.output_path = kwargs.get("output_path", None) + self.cann_version = kwargs.get(Constant.CANN_VERSION, EnumParamsParser().get_default(Constant.CANN_VERSION)) + self.profiling_type = self.identify_profiling_type( + EnumParamsParser().get_options(Constant.PROFILING_TYPE_UNDER_LINE)) + self.profiling_version = self.identify_profiling_version() + self.html_render = HTMLRender() self.dataset_list: Dict[str, List[Dataset]] = {} self.init_dataset_list() self.result = OptimizeResult() @@ -79,7 +90,7 @@ class BaseAnalyzer(VersionControl, metaclass=ABCMeta): if data_key not in data: return None - logger.info("Enable analysis %s with %s", self.__class__.__name__, ",".join(data_list)) + logger.info("Start analysis %s with %s", self.__class__.__name__, ",".join(data_list)) return func(self, **kwargs) return wrapper @@ -94,6 +105,67 @@ class BaseAnalyzer(VersionControl, metaclass=ABCMeta): def get_priority(self, max_mem_op_dur): pass + def identify_profiling_type(self, profiling_type_list): + profiling_type = None + if self.collection_path.endswith(ASCEND_MS): + profiling_type = [elem for elem in profiling_type_list if Constant.MINDSPORE in elem][0] + elif self.collection_path.endswith(ASCEND_PT): + profiling_type = [elem for elem in profiling_type_list if Constant.PYTORCH in elem][0] + else: + for _, dirs, __ in os.walk(self.collection_path): + is_found_type = False + for direction in dirs: + if direction.endswith(ASCEND_MS): + profiling_type = [elem for elem in profiling_type_list if Constant.MINDSPORE in elem][0] + is_found_type = True + break + elif direction.endswith(ASCEND_PT): + profiling_type = [elem for elem in profiling_type_list if Constant.PYTORCH in elem][0] + is_found_type = True + break + if is_found_type: + break + if self.kwargs.get(Constant.PROFILING_TYPE_UNDER_LINE) and self.kwargs.get( + Constant.PROFILING_TYPE_UNDER_LINE) != profiling_type: + logger.warning("%s The input profiling type %s is inconsistent with the actual profiling type %s.", + self.__class__.__name__, self.kwargs.get(Constant.PROFILING_TYPE_UNDER_LINE), profiling_type) + if not profiling_type: + logger.warning("Unknown profiling type, the default value is set pytorch.") + profiling_type = profiling_type_list[0] + return profiling_type + + def identify_profiling_version(self): + profiling_version = "" + if Constant.MINDSPORE in self.profiling_type: + ascend_dirs = [] + if self.collection_path.endswith(ASCEND_MS): + ascend_dirs.append(self.collection_path) + else: + for root, dirs, _ in os.walk(self.collection_path): + for direction in dirs: + if direction.endswith(ASCEND_MS): + ascend_dirs.append(os.path.join(root, direction)) + if ascend_dirs: + ascend_dir = ascend_dirs[0] + for file_name in os.listdir(ascend_dir): + if file_name.startswith(PROFILER_INFO_HEAD) and file_name.endswith(PROFILER_INFO_EXTENSION): + file_path = os.path.join(ascend_dir, file_name) + config = FileManager.read_json_file(file_path) + profiling_version = config.get(MS_VERSION, "") + break + if profiling_version and self.kwargs.get(Constant.MINDSPORE_VERSION): + if profiling_version != self.kwargs.get(Constant.MINDSPORE_VERSION): + logger.warning("%s The input version %s is inconsistent with the actual version %s.", + self.__class__.__name__, self.kwargs.get(Constant.MINDSPORE_VERSION), + profiling_version) + elif Constant.PYTORCH in self.profiling_type: + profiling_version = self.kwargs.get(Constant.TORCH_VERSION, + EnumParamsParser().get_default(Constant.TORCH_VERSION)) + if self.kwargs.get(Constant.TORCH_VERSION) and profiling_version != self.kwargs.get(Constant.TORCH_VERSION): + logger.warning("%s The input version %s is inconsistent with the actual version %s.", + self.__class__.__name__, self.kwargs.get(Constant.TORCH_VERSION), profiling_version) + return profiling_version + def init_dataset_list(self) -> None: dataset_cls_list = self.dataset_cls_list if len(dataset_cls_list) == 0: diff --git a/profiler/advisor/analyzer/communication/contention/__init__.py b/profiler/msprof_analyze/advisor/analyzer/cluster/__init__.py similarity index 100% rename from profiler/advisor/analyzer/communication/contention/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/cluster/__init__.py diff --git a/profiler/advisor/analyzer/cluster/Communication_retransmission_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/cluster/communication_retransmission_analyzer.py similarity index 79% rename from profiler/advisor/analyzer/cluster/Communication_retransmission_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/cluster/communication_retransmission_analyzer.py index 3683ef1b44f8b6c571dd4d8fdce0d39882d342af..07ae0892661c82742c7f78d61061378880b84d18 100644 --- a/profiler/advisor/analyzer/cluster/Communication_retransmission_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/cluster/communication_retransmission_analyzer.py @@ -14,11 +14,12 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.analyzer.cluster.Communication_retransmission_checker import CommunicationRetransmissionChecker -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.cluster.communication_retransmission_checker import \ + CommunicationRetransmissionChecker +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset logger = logging.getLogger() diff --git a/profiler/advisor/analyzer/cluster/Communication_retransmission_checker.py b/profiler/msprof_analyze/advisor/analyzer/cluster/communication_retransmission_checker.py similarity index 86% rename from profiler/advisor/analyzer/cluster/Communication_retransmission_checker.py rename to profiler/msprof_analyze/advisor/analyzer/cluster/communication_retransmission_checker.py index c63fc12f27acfe1bf88c832b85e7746539143162..bd63da7ee9518224b8b7336aa5743620ead0628d 100644 --- a/profiler/advisor/analyzer/cluster/Communication_retransmission_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/cluster/communication_retransmission_checker.py @@ -16,11 +16,13 @@ import logging import os from typing import Dict, List from collections import defaultdict -from profiler.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.cluster_analyse.common_func.file_manager import FileManager -from profiler.advisor.dataset.cluster.hccl_collection import HcclInfo +from msprof_analyze.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.dataset.cluster.hccl_collection import HcclInfo logger = logging.getLogger() @@ -99,11 +101,10 @@ class CommunicationRetransmissionChecker: """ make record for what and how to optimize """ - optimization_item = OptimizeItem("Communication retransmission analysis", self.desc, self.suggestions) + optimization_item = OptimizeItem(self.problem, self.desc, self.suggestions) result.add(OptimizeRecord(optimization_item)) - sub_table_name = \ - "Comm Retransmission Analysis" if not self.stage else f"Stage-{self.stage}: Comm Retransmission Analysis" + sub_table_name = BasePrompt.get_sub_table_name(self.problem, self.stage) result.add_detail(sub_table_name, headers=self.headers) for row in self.abnormal_rdma_list: @@ -120,14 +121,17 @@ class CommunicationRetransmissionChecker: ) def _init_rule(self): + language = AdditionalArgsManager().language syncbn_rule_path = os.path.join( os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))), "rules", + language, "rdma_analysis.yaml" ) syncbn_rule = FileManager.read_yaml_file(syncbn_rule_path) - self.desc = syncbn_rule.get("problem") + self.problem = syncbn_rule.get("problem") + self.desc = syncbn_rule.get("description") self.min_retransmission_time = syncbn_rule.get("min_retransmission_time") self.solutions = syncbn_rule.get("solutions") diff --git a/profiler/advisor/analyzer/cluster/slow_link_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/cluster/slow_link_analyzer.py similarity index 77% rename from profiler/advisor/analyzer/cluster/slow_link_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/cluster/slow_link_analyzer.py index 259e5eb0c4255afc97aad83210b72a14b7285888..9c4416009e1035e1938cf5430a49b0383bbf47d7 100644 --- a/profiler/advisor/analyzer/cluster/slow_link_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/cluster/slow_link_analyzer.py @@ -13,16 +13,16 @@ # See the License for the specific language governing permissions and # limitations under the License. -from collections import defaultdict from typing import Dict, List import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.common import constant -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset -from profiler.advisor.utils.utils import safe_index_value, convert_to_int +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset +from msprof_analyze.advisor.utils.utils import safe_index_value, convert_to_int +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() @@ -40,6 +40,7 @@ class SlowLinkAnalyzer(BaseAnalyzer): SDMA = "SDMA" RDMA = "RDMA" SLOW_LINK_ANALYSIS = "slow link" + SLOW_LINK_ANALYSIS_CN = "慢链路分析" RATIO_THRESHOLD = 0.05 dataset_cls_list = [ClusterCommunicationDataset] @@ -86,11 +87,19 @@ class SlowLinkAnalyzer(BaseAnalyzer): logger.info("The slow link (identified bottleneck) cannot provide a bottleneck \ because the analysis data is missing bandwidth information.") return - self.bottelneck += f'{link_type}: \n' \ - f' The average is {avg_bw}, \n' \ - f' while the maximum is {round(max(data_list), 3)}GB/s \n' \ - f' and the minimum is {round(min(data_list), 3)}GB/s. \n' \ - f' the difference is {round(max(data_list) - min(data_list), 3)}GB/s. \n' + language = AdditionalArgsManager().language + if language == "en": + self.bottelneck += f'{link_type}: \n' \ + f' The average is {avg_bw}, \n' \ + f' while the maximum is {round(max(data_list), 3)}GB/s \n' \ + f' and the minimum is {round(min(data_list), 3)}GB/s. \n' \ + f' the difference is {round(max(data_list) - min(data_list), 3)}GB/s. \n' + else: + self.bottelneck += f'{link_type}: \n' \ + f' 平均值是 {avg_bw}, \n' \ + f' 但最大值是 {round(max(data_list), 3)}GB/s ,\n' \ + f' 最小值是 {round(min(data_list), 3)}GB/s。\n' \ + f' 差距为 {round(max(data_list) - min(data_list), 3)}GB/s。 \n' def format_details(self): if not self.rank_bw_dict: @@ -105,7 +114,7 @@ class SlowLinkAnalyzer(BaseAnalyzer): data_list = [] for step_rank, rank_bw in self.rank_bw_dict.items(): - step_rank_list = list(map(convert_to_int, step_rank.split(constant.STEP_RANK_SEP))) + step_rank_list = list(map(convert_to_int, step_rank.split(Constant.STEP_RANK_SEP))) value_list = [rank_bw.get(i, 0) for i in headers] data_list.append(step_rank_list + value_list) data_list.sort(key=lambda x: (x[0], x[1])) # 按rank_id排序 @@ -119,8 +128,11 @@ class SlowLinkAnalyzer(BaseAnalyzer): """ make record for what and how to optimize """ + title = self.SLOW_LINK_ANALYSIS_CN + if AdditionalArgsManager().language == "en": + title = self.SLOW_LINK_ANALYSIS optimization_item = OptimizeItem( - SlowLinkAnalyzer.SLOW_LINK_ANALYSIS, + title, self.bottelneck, self.suggestion ) @@ -129,7 +141,7 @@ class SlowLinkAnalyzer(BaseAnalyzer): data_list = self.format_datas.get("data", []) headers = self.format_datas.get("headers", []) for data in data_list: - self.result.add_detail(SlowLinkAnalyzer.SLOW_LINK_ANALYSIS, headers, data) + self.result.add_detail(title, headers, data) def make_render(self, template_key="cluster"): result_for_html = { @@ -143,7 +155,8 @@ class SlowLinkAnalyzer(BaseAnalyzer): template_dir="templates", template_name="cluster_analysis.html", cann_version=self.cann_version, - torch_version=self.torch_version, + profiling_type=self.profiling_type, + profiling_version=self.profiling_version, result=result_for_html) def get_global_step_rank(self, bindwidth_type): @@ -181,7 +194,7 @@ class SlowLinkAnalyzer(BaseAnalyzer): min_bandwidth_rank_id = self.format_datas.get("data")[min_bandwidth_index][rank_id_index] if step_index is None: - max_bandwidth_step, min_bandwidth_step = constant.DEFAULT_STEP, constant.DEFAULT_STEP + max_bandwidth_step, min_bandwidth_step = Constant.DEFAULT_STEP, Constant.DEFAULT_STEP else: max_bandwidth_step = self.format_datas.get("data")[max_bandwidth_index][step_index] min_bandwidth_step = self.format_datas.get("data")[min_bandwidth_index][step_index] @@ -191,5 +204,5 @@ class SlowLinkAnalyzer(BaseAnalyzer): return global_step_rank - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): pass diff --git a/profiler/advisor/analyzer/cluster/slow_rank_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/cluster/slow_rank_analyzer.py similarity index 80% rename from profiler/advisor/analyzer/cluster/slow_rank_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/cluster/slow_rank_analyzer.py index 165dec7db7f3a6aa2fbb88654cf4590da09abcf9..b1d05c8b4562c23ad4993d3b620d1abf27453b1c 100644 --- a/profiler/advisor/analyzer/cluster/slow_rank_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/cluster/slow_rank_analyzer.py @@ -15,21 +15,25 @@ import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.common import constant -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.dataset.cluster.cluster_dataset import ClusterStepTraceTimeDataset -from profiler.advisor.utils.utils import safe_index_value, safe_division, convert_to_int, safe_index, convert_to_float +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.dataset.cluster.cluster_dataset import ClusterStepTraceTimeDataset +from msprof_analyze.advisor.utils.utils import safe_index_value, safe_division, convert_to_int, safe_index, \ + convert_to_float +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() class SlowRankAnalyzer(BaseAnalyzer): SLOW_RANK_ANALYSIS = "slow rank" + SLOW_RANK_ANALYSIS_CN = "慢卡分析" RANK = "rank" RATIO_THRESHOLD = 0.05 BOTTLENECK_LIST = ['Computing', 'Communication', "Free"] + BOTTLENECK_LIST_CN = ['计算', '通信', "空闲"] dataset_cls_list = [ClusterStepTraceTimeDataset] COMPUTE = "compute(us)" FREE = "free(us)" @@ -65,7 +69,7 @@ class SlowRankAnalyzer(BaseAnalyzer): logger.error( "Slow rank analysis failed, " "please ensure file 'step_trace_time.csv' exists in your profiling directory %s", - constant.ASCEND_PROFILER_OUTPUT) + Constant.ASCEND_PROFILER_OUTPUT) return self.result self.process() self.make_record() @@ -80,24 +84,36 @@ class SlowRankAnalyzer(BaseAnalyzer): self.produce_bottleneck(self.step_trace_dict, i, mean_total_time) if not self.bottelneck: - self.bottelneck = "There is no slow rank issues" + language = AdditionalArgsManager().language + if language == "en": + self.bottelneck = "There is no slow rank issues" + else: + self.bottelneck = "没有慢节点问题" def produce_bottleneck(self, step_dict: dict, produce_type: int, mean_total_time: float): data_list = [data_tuple[produce_type] for rank_id, data_tuple in step_dict.items()] max_ratio = self.compute_max_gap_ratio(data_list, mean_total_time) if max_ratio > self.RATIO_THRESHOLD: - self.bottelneck += f'{self.BOTTLENECK_LIST[produce_type]} \n' \ - f' has some issues in the cluster, \n' \ - f' because the max difference of {self.BOTTLENECK_LIST[produce_type]} time \n' \ - f' has reached {round(max_ratio * mean_total_time / 1000, 3)}ms. \n' + language = AdditionalArgsManager().language + if language == "en": + self.bottelneck += f'{self.BOTTLENECK_LIST[produce_type]} \n' \ + f' has some issues in the cluster, \n' \ + f' because the max difference of {self.BOTTLENECK_LIST[produce_type]} time \n' \ + f' has reached {round(max_ratio * mean_total_time / 1000, 3)}ms. \n' + else: + self.bottelneck += f'集群中的{self.BOTTLENECK_LIST_CN[produce_type]}有问题, \n' \ + f'因为{self.BOTTLENECK_LIST_CN[produce_type]}时间的最大差距已经达到 \n' \ + f'{round(max_ratio * mean_total_time / 1000, 3)}ms。 \n' def make_record(self): """ make record for what and how to optimize """ - + title = self.SLOW_RANK_ANALYSIS_CN + if AdditionalArgsManager().language == "en": + title = self.SLOW_RANK_ANALYSIS optimization_item = OptimizeItem( - SlowRankAnalyzer.SLOW_RANK_ANALYSIS, + title, self.bottelneck, self.suggestion ) @@ -106,14 +122,14 @@ class SlowRankAnalyzer(BaseAnalyzer): data_list = self.format_datas.get("data", []) headers = self.format_datas.get("headers", []) for data in data_list: - self.result.add_detail(SlowRankAnalyzer.SLOW_RANK_ANALYSIS, headers, data) + self.result.add_detail(title, headers, data) def format_details(self): details_dict = {} headers = ["step", "rank_id", "compute(us)", "communication(us)", "free(us)"] data_list = [] for key, value in self.step_trace_dict.items(): - step, rank_id = key.split(constant.STEP_RANK_SEP) + step, rank_id = key.split(Constant.STEP_RANK_SEP) data_list.append([convert_to_int(step), convert_to_int(rank_id)] + value) if step and step not in self._steps: self._steps.add(step) @@ -134,7 +150,8 @@ class SlowRankAnalyzer(BaseAnalyzer): template_dir="templates", template_name="cluster_analysis.html", cann_version=self.cann_version, - torch_version=self.torch_version, + profiling_type=self.profiling_type, + profiling_version=self.profiling_version, result=result_for_html) def get_global_step_rank(self, dimension): @@ -167,7 +184,7 @@ class SlowRankAnalyzer(BaseAnalyzer): max_time_step = self.format_datas.get("data")[max_time_index][step_index] min_time_step = self.format_datas.get("data")[min_time_index][step_index] else: - max_time_step, min_time_step = constant.DEFAULT_STEP, constant.DEFAULT_STEP + max_time_step, min_time_step = Constant.DEFAULT_STEP, Constant.DEFAULT_STEP global_step_rank["maximum"] = {"rank_id": max_time_rank_id, "step": max_time_step} global_step_rank["minimum"] = {"rank_id": min_time_rank_id, "step": min_time_step} @@ -192,7 +209,7 @@ class SlowRankAnalyzer(BaseAnalyzer): if step_index is not None: step_list = [tuple_list[step_index] for tuple_list in self.format_datas.get("data")] else: - step_list = [constant.DEFAULT_STEP] * len(rank_list) + step_list = [Constant.DEFAULT_STEP] * len(rank_list) for index, stage in enumerate(self.stages): tmp_step_list, tmp_rank_list, tmp_time_list = [], [], [] @@ -260,5 +277,5 @@ class SlowRankAnalyzer(BaseAnalyzer): free_time = safe_index(safe_index(self.format_datas.get("data"), row_index, []), free_col_index, 0) return convert_to_float(compute_time) + convert_to_float(communicate_time) + convert_to_float(free_time) - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): pass diff --git a/profiler/advisor/analyzer/communication/environment/__init__.py b/profiler/msprof_analyze/advisor/analyzer/communication/__init__.py similarity index 100% rename from profiler/advisor/analyzer/communication/environment/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/communication/__init__.py diff --git a/profiler/advisor/analyzer/communication/packet/__init__.py b/profiler/msprof_analyze/advisor/analyzer/communication/alignment/__init__.py similarity index 100% rename from profiler/advisor/analyzer/communication/packet/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/communication/alignment/__init__.py diff --git a/profiler/msprof_analyze/advisor/analyzer/communication/alignment/byte_alignment_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/communication/alignment/byte_alignment_analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..0021365a4e29df51610780f8ee667dcd166d8612 --- /dev/null +++ b/profiler/msprof_analyze/advisor/analyzer/communication/alignment/byte_alignment_analyzer.py @@ -0,0 +1,52 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.analyzer.communication.alignment.byte_alignment_checker import ByteAlignmentChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.dataset.communication.hccl_detail_dataset import HcclDetailDataset +from msprof_analyze.advisor.result.result import OptimizeResult + +logger = logging.getLogger() + + +class ByteAlignmentAnalyzer(BaseAnalyzer): + dataset_cls_list = [ProfilingDataset] + requires_cluster_dataset = False + + def __init__(self, collection_path, n_processes: int = 1, **kwargs) -> None: + super().__init__(collection_path, n_processes, **kwargs) + profiling_key = ProfilingDataset.get_key() + self.profiling_dataset = self.get_first_data_by_key(self.dataset_list, profiling_key) + self.hccl_detail_dataset = HcclDetailDataset(self.profiling_dataset.msprof) + self.result = OptimizeResult() + self.html_render = HTMLRender() + self.html = None + + @BaseAnalyzer.check_data((ProfilingDataset.get_key(),)) + def optimize(self, **kwargs): + byte_alignment_checker = ByteAlignmentChecker(**kwargs) + byte_alignment_checker.check_alignment(self.hccl_detail_dataset) + if not byte_alignment_checker.byge_alignment_issue: + return self.result + byte_alignment_checker.make_record(self.result) + self.html = byte_alignment_checker.make_render(self.html_render, priority=self.get_priority()) + return self.result + + def get_priority(self, max_mem_op_dur=0): + return PriorityBackgroundColor.medium diff --git a/profiler/msprof_analyze/advisor/analyzer/communication/alignment/byte_alignment_checker.py b/profiler/msprof_analyze/advisor/analyzer/communication/alignment/byte_alignment_checker.py new file mode 100644 index 0000000000000000000000000000000000000000..4f8bf95091316417ea34ea408c3d5f318c58c1be --- /dev/null +++ b/profiler/msprof_analyze/advisor/analyzer/communication/alignment/byte_alignment_checker.py @@ -0,0 +1,156 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +import os +from typing import List +from msprof_analyze.advisor.dataset.communication.hccl_detail_dataset import HcclDetailDataset +from msprof_analyze.advisor.dataset.profiling.info_collection import HcclTask +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.utils.utils import safe_division +from msprof_analyze.prof_common.constant import Constant + +logger = logging.getLogger() + + +class ByteAlignmentChecker: + _CHECKER = "ByteAlignmentChecker" + _BASE_SIZE = 512 + _MIN_SIZE = 512 + _LOW_PRIORITY = 0.2 + _HIGH_PRIORITY = 0.7 + _TYPE = "SDMA" + + def __init__(self, **kwargs): + self.contention_issues = False + self.desc = "" + self.step_id = kwargs.get("step") + self.stage = None + self.byge_alignment_issue = False + self.total_ops_dur = 0 + self.abnormal_ops_dur = 0 + self.abnormal_ops_count = 0 + self.min_size = 0 + self.topk = 0 + self.abnormal_ops = [] + self.suggestions = [] + self._init_rule() + self.headers = [ + "op name", "total size(Byte)", "duration(us)", "abnormal duration(us)", "bandwidth(GB/s)" + ] + + @staticmethod + def _calculate_bandwidth_gb_s(size, duration): + if abs(duration) < 1e-15: + bandwidth = 0 + else: + bandwidth = (size * Constant.COMMUNICATION_B_TO_GB) / (duration * Constant.US_TO_S) + return round(bandwidth, 4) + + def check_alignment(self, hccl_detail_dataset: HcclDetailDataset) -> None: + for hccl_op in hccl_detail_dataset.hccl_ops: + size, duration, abnormal_dur, flag = self._check_op([hccl_op.memcpy_tasks, hccl_op.reduce_inline_tasks]) + if flag: + self.abnormal_ops_count += 1 + self.abnormal_ops.append([hccl_op.op_name, size, round(duration, 4), round(abnormal_dur, 4), + self._calculate_bandwidth_gb_s(size, duration)]) + if self.abnormal_ops_count: + self.byge_alignment_issue = True + self.desc = self.desc.format(count=self.abnormal_ops_count) + + def make_record(self, result: OptimizeResult): + """ + make record for what and how to optimize + """ + optimization_item = OptimizeItem(self.problem, self.desc, self.suggestions) + result.add(OptimizeRecord(optimization_item)) + + sub_table_name = BasePrompt.get_sub_table_name(self.problem, self.stage) + result.add_detail(sub_table_name, headers=self.headers) + + for hccl_op in self.abnormal_ops: + result.add_detail(sub_table_name, detail=hccl_op) + + def make_render(self, html_render, **kwargs): + rank = kwargs.get("rank") + return html_render.render_template(key="communication", + template_dir="templates", + template_name="byte_alignment.html", + desc=self.desc, + solutions=self.solutions, + headers=self.headers, + datas=self.abnormal_ops[:min(self.topk, len(self.abnormal_ops))], + num=min(self.topk, len(self.abnormal_ops)), + priority_background_color=self._get_priority(), + rank=rank) + + def _pre_check(self, task: HcclTask, type_): + """ + Check whether the operator meets the data volume alignment detection conditions. + """ + if task.transport_type != type_ or task.link_type == "ON_CHIP" or task.size <= self.min_size: + return False + return True + + def _check_op(self, op: List[List[HcclTask]]): + flag = False + size = 0 + duration = 0 + abnormal_dur = 0 + for tasks in op: + for task in tasks: + if not self._pre_check(task, self._TYPE): + continue + self.total_ops_dur += task.duration + size += task.size + duration += task.duration + if task.size % self._BASE_SIZE: + self.abnormal_ops_dur += task.duration + abnormal_dur += task.duration + flag = True + return [size, duration, abnormal_dur, flag] + + def _init_rule(self): + language = AdditionalArgsManager().language + rule_path = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), + "rules", + language, + "byte_alignment.yaml" + ) + + byte_alignment_rule = FileManager.read_yaml_file(rule_path) + self.problem = byte_alignment_rule.get("problem") + self.desc = byte_alignment_rule.get("description") + self.min_size = byte_alignment_rule.get("min_size", self._MIN_SIZE) + self.topk = byte_alignment_rule.get("top_num", 3) + self.solutions = byte_alignment_rule.get("solutions") + if not self.desc or not self.solutions or not isinstance(self.solutions, list): + raise RuntimeError("The configuration file of the byte alignment analyzer is abnormal. Please check.") + for solution in self.solutions: + for _, val in solution.items(): + self.suggestions.append(f"{val.get('desc')}") + + def _get_priority(self): + if safe_division(self.abnormal_ops_dur, self.total_ops_dur) < self._LOW_PRIORITY: + return PriorityBackgroundColor.low + elif safe_division(self.abnormal_ops_dur, self.total_ops_dur) >= self._HIGH_PRIORITY: + return PriorityBackgroundColor.high + else: + return PriorityBackgroundColor.medium diff --git a/profiler/advisor/analyzer/communication/retransmission/__init__.py b/profiler/msprof_analyze/advisor/analyzer/communication/bandwidth/__init__.py similarity index 100% rename from profiler/advisor/analyzer/communication/retransmission/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/communication/bandwidth/__init__.py diff --git a/profiler/advisor/analyzer/communication/base_communication_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/communication/base_communication_analyzer.py similarity index 90% rename from profiler/advisor/analyzer/communication/base_communication_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/communication/base_communication_analyzer.py index be97e07fc08ebc5096e6a7ae984f77570b24d399..5fbbf0c56dc204711eb37f47d11e67c65f9d3897 100644 --- a/profiler/advisor/analyzer/communication/base_communication_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/communication/base_communication_analyzer.py @@ -12,7 +12,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer class BaseCommunicationAnalyzer(BaseAnalyzer): diff --git a/profiler/advisor/analyzer/comparison/__init__.py b/profiler/msprof_analyze/advisor/analyzer/communication/contention/__init__.py similarity index 100% rename from profiler/advisor/analyzer/comparison/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/communication/contention/__init__.py diff --git a/profiler/advisor/analyzer/communication/contention/bandwidth_contention_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/communication/contention/bandwidth_contention_analyzer.py similarity index 75% rename from profiler/advisor/analyzer/communication/contention/bandwidth_contention_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/communication/contention/bandwidth_contention_analyzer.py index 46c8be7a482b38fb32234eca179b8cf89f992f32..377fff0c1141faed7ef1eadab19aa0c2fd6e56aa 100644 --- a/profiler/advisor/analyzer/communication/contention/bandwidth_contention_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/communication/contention/bandwidth_contention_analyzer.py @@ -14,13 +14,14 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.communication.base_communication_analyzer import BaseCommunicationAnalyzer -from profiler.advisor.analyzer.communication.contention.bandwidth_contention_checker import BandwidthContentionChecker -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.communication.communication_dataset import CommunicationDataset -from profiler.advisor.dataset.profiling.profiling_dataset import ProfilingDataset -from profiler.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.communication.base_communication_analyzer import BaseCommunicationAnalyzer +from msprof_analyze.advisor.analyzer.communication.contention.bandwidth_contention_checker import \ + BandwidthContentionChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.communication.communication_dataset import CommunicationDataset +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.result.result import OptimizeResult logger = logging.getLogger() @@ -51,6 +52,6 @@ class BandwidthContentionAnalyzer(BaseCommunicationAnalyzer): priority=self.get_priority()) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): # 提升1% ~ 3% return PriorityBackgroundColor.low diff --git a/profiler/advisor/analyzer/communication/contention/bandwidth_contention_checker.py b/profiler/msprof_analyze/advisor/analyzer/communication/contention/bandwidth_contention_checker.py similarity index 83% rename from profiler/advisor/analyzer/communication/contention/bandwidth_contention_checker.py rename to profiler/msprof_analyze/advisor/analyzer/communication/contention/bandwidth_contention_checker.py index e0b85592d995024ef392ccd1b44eeb3fa1ab00eb..2f2d2e5df832440021e9ded14dd49cc75cf471d1 100644 --- a/profiler/advisor/analyzer/communication/contention/bandwidth_contention_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/communication/contention/bandwidth_contention_checker.py @@ -15,14 +15,17 @@ import logging import os from typing import List -from profiler.advisor.dataset.communication.communication_dataset import CommunicationDataset -from profiler.advisor.dataset.profiling.profiling_dataset import ProfilingDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.cluster_analyse.common_func.file_manager import FileManager -from profiler.advisor.utils.utils import convert_to_float -from profiler.advisor.dataset.cluster.hccl_collection import HcclInfo -from profiler.advisor.dataset.profiling.info_collection import OpInfo +from msprof_analyze.advisor.dataset.communication.communication_dataset import CommunicationDataset +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.utils.utils import convert_to_float +from msprof_analyze.advisor.dataset.cluster.hccl_collection import HcclInfo +from msprof_analyze.advisor.dataset.profiling.info_collection import OpInfo +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() @@ -126,7 +129,7 @@ class BandwidthContentionChecker: else: if self.sdma_list[hccl_index].bandwidth < self.threshold: self.abnormal_sdma_list.append(self.sdma_list[hccl_index]) - matmul_index += 1 + hccl_index += 1 if self.abnormal_sdma_list: self.contention_issues = True self.desc = self.desc.format(threshold=self.threshold) @@ -135,12 +138,12 @@ class BandwidthContentionChecker: """ make record for what and how to optimize """ - optimization_item = OptimizeItem("bandwidth contention analysis", self.desc, self.suggestions) + optimization_item = OptimizeItem(self.problem, self.desc, self.suggestions) result.add(OptimizeRecord(optimization_item)) - sub_table_name = "Bandwidth Contention Analysis" if not self.stage else f"Stage-{self.stage}: " \ - f"Bandwidth Contention Analysis" + sub_table_name = BasePrompt.get_sub_table_name(self.problem, self.stage) result.add_detail(sub_table_name, headers=self.headers) + for hccl_op in self.abnormal_sdma_list: result.add_detail(sub_table_name, detail=[hccl_op.name, round(hccl_op.dur, 4), round(hccl_op.bandwidth, 2)]) @@ -157,19 +160,22 @@ class BandwidthContentionChecker: priority_background_color=priority) def _init_rule(self): + language = AdditionalArgsManager().language contention_rule_path = os.path.join( os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), "rules", + language, "bandwidth_contention.yaml" ) contention_rule = FileManager.read_yaml_file(contention_rule_path) - self.desc = contention_rule.get("problem") + self.problem = contention_rule.get("problem") + self.desc = contention_rule.get("description") self.threshold = contention_rule.get("threshold", 0) * contention_rule.get("sdma_baseline", 0) self.contention_topk = contention_rule.get("top_num", 3) self.solutions = contention_rule.get("solutions") if not self.desc or not self.solutions or not isinstance(self.solutions, list): raise RuntimeError("The configuration file of the bandwidth contention analyzer is abnormal. Please check.") for solution in self.solutions: - for key, val in solution.items(): - self.suggestions.append(f"{key}, {val.get('desc')}") + for _, val in solution.items(): + self.suggestions.append(f"{val.get('desc')}") diff --git a/profiler/advisor/analyzer/computation/__init__.py b/profiler/msprof_analyze/advisor/analyzer/communication/environment/__init__.py similarity index 100% rename from profiler/advisor/analyzer/computation/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/communication/environment/__init__.py diff --git a/profiler/advisor/analyzer/computation/ai_core_freq/__init__.py b/profiler/msprof_analyze/advisor/analyzer/communication/packet/__init__.py similarity index 100% rename from profiler/advisor/analyzer/computation/ai_core_freq/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/communication/packet/__init__.py diff --git a/profiler/advisor/analyzer/communication/packet/packet_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/communication/packet/packet_analyzer.py similarity index 76% rename from profiler/advisor/analyzer/communication/packet/packet_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/communication/packet/packet_analyzer.py index 444b643a1c6914b36476145caa866be81fcf65a4..76a3590b39838b1d3712c1ec3920cedc6338face 100644 --- a/profiler/advisor/analyzer/communication/packet/packet_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/communication/packet/packet_analyzer.py @@ -14,12 +14,12 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.communication.base_communication_analyzer import BaseCommunicationAnalyzer -from profiler.advisor.analyzer.communication.packet.packet_checker import PacketChecker -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.communication.communication_dataset import CommunicationDataset -from profiler.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.communication.base_communication_analyzer import BaseCommunicationAnalyzer +from msprof_analyze.advisor.analyzer.communication.packet.packet_checker import PacketChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.communication.communication_dataset import CommunicationDataset +from msprof_analyze.advisor.result.result import OptimizeResult logger = logging.getLogger() @@ -49,6 +49,6 @@ class PacketAnalyzer(BaseCommunicationAnalyzer): self.html = packet_checker.make_render(self.html_render, add_render_list, priority=self.get_priority()) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): # 提升1% ~ 3% return PriorityBackgroundColor.low diff --git a/profiler/advisor/analyzer/communication/packet/packet_checker.py b/profiler/msprof_analyze/advisor/analyzer/communication/packet/packet_checker.py similarity index 86% rename from profiler/advisor/analyzer/communication/packet/packet_checker.py rename to profiler/msprof_analyze/advisor/analyzer/communication/packet/packet_checker.py index 6ddf17c43fdddc7adfd98866bc8869206c4cf942..6f76b09ff9c547f65acb924e4ebc6df485699d81 100644 --- a/profiler/advisor/analyzer/communication/packet/packet_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/communication/packet/packet_checker.py @@ -14,11 +14,14 @@ # limitations under the License. import logging import os -from profiler.advisor.dataset.communication.communication_dataset import CommunicationDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.cluster_analyse.common_func.file_manager import FileManager -from profiler.advisor.utils.utils import convert_to_float +from msprof_analyze.advisor.dataset.communication.communication_dataset import CommunicationDataset +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.utils.utils import convert_to_float +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() @@ -109,10 +112,11 @@ class PacketChecker: """ make record for what and how to optimize """ - optimization_item = OptimizeItem("Packet analysis", self.desc, self.suggestions) + optimization_item = OptimizeItem(self.problem, self.desc, self.suggestions) result.add(OptimizeRecord(optimization_item)) - sub_table_name = "Packet Analysis" if not self.stage else f"Stage-{self.stage}: Packet Analysis" + sub_table_name = BasePrompt.get_sub_table_name(self.problem, self.stage) + result.add_detail(sub_table_name, headers=self.headers) result.add_detail(sub_table_name, detail=self.small_packet_detail) @@ -128,14 +132,17 @@ class PacketChecker: priority_background_color=priority) def _init_rule(self): + language = AdditionalArgsManager().language syncbn_rule_path = os.path.join( os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), "rules", + language, "packet.yaml" ) syncbn_rule = FileManager.read_yaml_file(syncbn_rule_path) - self.desc = syncbn_rule.get("problem") + self.problem = syncbn_rule.get("problem") + self.desc = syncbn_rule.get("description") self.sdma_desc = syncbn_rule.get("sdma_problem") self.rdma_desc = syncbn_rule.get("rdma_problem") self.min_sdma_size = convert_to_float(syncbn_rule.get("min_sdma_size")) diff --git a/profiler/advisor/analyzer/computation/aicpu/__init__.py b/profiler/msprof_analyze/advisor/analyzer/communication/retransmission/__init__.py similarity index 100% rename from profiler/advisor/analyzer/computation/aicpu/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/communication/retransmission/__init__.py diff --git a/profiler/advisor/analyzer/communication/retransmission/communication_retransmission_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/communication/retransmission/communication_retransmission_analyzer.py similarity index 75% rename from profiler/advisor/analyzer/communication/retransmission/communication_retransmission_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/communication/retransmission/communication_retransmission_analyzer.py index 78cade900731926f6be2303fd8b9ac6072df35f7..73d798c5c4cfad06082f895664b811375c6d2780 100644 --- a/profiler/advisor/analyzer/communication/retransmission/communication_retransmission_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/communication/retransmission/communication_retransmission_analyzer.py @@ -14,13 +14,13 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.communication.base_communication_analyzer import BaseCommunicationAnalyzer -from profiler.advisor.analyzer.communication.retransmission.communication_retransmission_checker import \ +from msprof_analyze.advisor.analyzer.communication.base_communication_analyzer import BaseCommunicationAnalyzer +from msprof_analyze.advisor.analyzer.communication.retransmission.communication_retransmission_checker import \ CommunicationRetransmissionChecker -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset -from profiler.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset +from msprof_analyze.advisor.result.result import OptimizeResult logger = logging.getLogger() @@ -47,6 +47,6 @@ class RDMARetransmissionAnalyzer(BaseCommunicationAnalyzer): self.html = rdma_checker.make_render(self.html_render, add_render_list, priority=self.get_priority()) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): # 单次重传最少4s,高优先级 return PriorityBackgroundColor.high diff --git a/profiler/advisor/analyzer/communication/retransmission/communication_retransmission_checker.py b/profiler/msprof_analyze/advisor/analyzer/communication/retransmission/communication_retransmission_checker.py similarity index 85% rename from profiler/advisor/analyzer/communication/retransmission/communication_retransmission_checker.py rename to profiler/msprof_analyze/advisor/analyzer/communication/retransmission/communication_retransmission_checker.py index 577f7c23ff21dccaad138acd95ca9af20faa7bab..f27dbb13e8020b2f702bbaa7b81c9228bdf23a26 100644 --- a/profiler/advisor/analyzer/communication/retransmission/communication_retransmission_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/communication/retransmission/communication_retransmission_checker.py @@ -16,11 +16,14 @@ import logging import os from typing import Dict, List from collections import defaultdict -from profiler.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.cluster_analyse.common_func.file_manager import FileManager -from profiler.advisor.dataset.cluster.hccl_collection import HcclInfo +from msprof_analyze.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.dataset.cluster.hccl_collection import HcclInfo +from msprof_analyze.prof_common.constant import Constant logger = logging.getLogger() @@ -74,6 +77,8 @@ class CommunicationRetransmissionChecker: """ for group_name, hccl_group_dict in hccl_dataset.hccl_dict.items(): for op_name, hccl_op_dict in hccl_group_dict.items(): + if op_name == Constant.TOTAL_OP_INFO: + continue for step_id, hccl_list in hccl_op_dict.items(): if self.step_id and step_id != self.step_id: # 传输指定step(self.step_id)情况下,非目标step跳过 continue @@ -99,11 +104,10 @@ class CommunicationRetransmissionChecker: """ make record for what and how to optimize """ - optimization_item = OptimizeItem("Communication retransmission analysis", self.desc, self.suggestions) + optimization_item = OptimizeItem(self.problem, self.desc, self.suggestions) result.add(OptimizeRecord(optimization_item)) - sub_table_name = \ - "Comm Retransmission Analysis" if not self.stage else f"Stage-{self.stage}: Comm Retransmission Analysis" + sub_table_name = BasePrompt.get_sub_table_name(self.problem, self.stage) result.add_detail(sub_table_name, headers=self.headers) for row in self.abnormal_rdma_list: @@ -121,14 +125,17 @@ class CommunicationRetransmissionChecker: priority_background_color=priority) def _init_rule(self): + language = AdditionalArgsManager().language syncbn_rule_path = os.path.join( os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), "rules", + language, "rdma_analysis.yaml" ) syncbn_rule = FileManager.read_yaml_file(syncbn_rule_path) - self.desc = syncbn_rule.get("problem") + self.problem = syncbn_rule.get("problem") + self.desc = syncbn_rule.get("description") self.min_retransmission_time = syncbn_rule.get("min_retransmission_time") self.solutions = syncbn_rule.get("solutions") diff --git a/profiler/advisor/analyzer/computation/bound/__init__.py b/profiler/msprof_analyze/advisor/analyzer/comparison/__init__.py similarity index 100% rename from profiler/advisor/analyzer/computation/bound/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/comparison/__init__.py diff --git a/profiler/advisor/analyzer/comparison/comparison_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/comparison/comparison_analyzer.py similarity index 84% rename from profiler/advisor/analyzer/comparison/comparison_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/comparison/comparison_analyzer.py index b333c1174863d84f709f418081c1352be3e605d1..cd84644d78345ab4bcba246dab976fd39760d399 100644 --- a/profiler/advisor/analyzer/comparison/comparison_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/comparison/comparison_analyzer.py @@ -14,10 +14,10 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.analyzer.comparison.comparison_checker import ComparisonChecker +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.analyzer.comparison.comparison_checker import ComparisonChecker +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.result.result import OptimizeResult logger = logging.getLogger() @@ -34,7 +34,7 @@ class ComparisonAnalyzer(BaseAnalyzer): self._optimize(**compare_profiling_path) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): pass def _optimize(self, profiling_path, benchmark_profiling_path, **kwargs): diff --git a/profiler/advisor/analyzer/comparison/comparison_checker.py b/profiler/msprof_analyze/advisor/analyzer/comparison/comparison_checker.py similarity index 83% rename from profiler/advisor/analyzer/comparison/comparison_checker.py rename to profiler/msprof_analyze/advisor/analyzer/comparison/comparison_checker.py index ad4cb83d33c43614c90e198a6b35a2dc1f301782..8e40bcc1cfe6914470d82c86f5b76980a5c16814 100644 --- a/profiler/advisor/analyzer/comparison/comparison_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/comparison/comparison_checker.py @@ -14,11 +14,11 @@ # limitations under the License. import logging -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.utils.utils import safe_index_value, convert_to_float, convert_to_int -from profiler.compare_tools.compare_backend.utils.constant import Constant as CompareConstant -from profiler.compare_tools.compare_interface.comparison_interface import ComparisonInterface +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.utils.utils import safe_index_value, convert_to_float, convert_to_int +from msprof_analyze.compare_tools.compare_interface.comparison_interface import ComparisonInterface +from msprof_analyze.prof_common.constant import Constant logger = logging.getLogger() @@ -28,8 +28,8 @@ class ComparisonChecker: SHOW_TOPK = 10 DIFF_AVG_RATIO = "Diff Avg Ratio" COMPARE_MODE_TO_DESC = { - CompareConstant.KERNEL_COMPARE: "Kernel compare", - CompareConstant.API_COMPARE: "Api compare", + Constant.KERNEL_COMPARE: "Kernel compare", + Constant.API_COMPARE: "Api compare", } def __init__(self, profiling_path, benchmark_profiling_path, step=None, benchmark_step=None, rank=None, @@ -48,7 +48,7 @@ class ComparisonChecker: @staticmethod def get_valid_step(step): - none_step = None + none_step = "" if step is None: return none_step if isinstance(step, (int, float)): @@ -67,10 +67,18 @@ class ComparisonChecker: if compare_mode is None: return self.compare_mode = compare_mode + if ("Api" in compare_mode) and self.benchmark_profiling_path.endswith("ascend_ms"): + logger.warning("The current compare mode %s does not support Mindspore.", compare_mode) + return compare_interface = ComparisonInterface(self.profiling_path, self.benchmark_profiling_path, self.step, - self.benchmark_step) + self.benchmark_step, + use_kernel_type=self.compare_mode == Constant.KERNEL_COMPARE) result = compare_interface.compare(self.compare_mode) - data = result.get(self.compare_mode, {}) + if self.compare_mode == Constant.KERNEL_COMPARE: + data = result.get(Constant.KERNEL_TYPE_COMPARE, {}) + else: + data = result.get(self.compare_mode, {}) + headers = data.get("headers", {}) rows = data.get("rows", []) format_headers = [] @@ -140,13 +148,13 @@ class ComparisonChecker: sheet_name = "" if self.rank is not None: sheet_name += f"Rank{self.rank}" - if self.step is not None: + if self.step: sheet_name += f" Step{self.step}" if sheet_name: sheet_name += " and " if self.benchmark_rank is not None: sheet_name += f"Rank{self.benchmark_rank}" - if self.benchmark_step is not None: + if self.benchmark_step: sheet_name += f" Step{self.benchmark_step}" if not sheet_name: sheet_name = "Target and Benchmark" diff --git a/profiler/advisor/analyzer/computation/op_compile/__init__.py b/profiler/msprof_analyze/advisor/analyzer/computation/__init__.py similarity index 100% rename from profiler/advisor/analyzer/computation/op_compile/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/computation/__init__.py diff --git a/profiler/advisor/analyzer/dataloader/__init__.py b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_freq/__init__.py similarity index 100% rename from profiler/advisor/analyzer/dataloader/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/computation/ai_core_freq/__init__.py diff --git a/profiler/advisor/analyzer/computation/ai_core_freq/ai_core_freq_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_freq/ai_core_freq_analyzer.py similarity index 74% rename from profiler/advisor/analyzer/computation/ai_core_freq/ai_core_freq_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/computation/ai_core_freq/ai_core_freq_analyzer.py index dd41df260eff3d03459e92d570923256816cb9f4..b4d97362592fcd4ed2e155cbc4078f8c87c3ad50 100644 --- a/profiler/advisor/analyzer/computation/ai_core_freq/ai_core_freq_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_freq/ai_core_freq_analyzer.py @@ -14,14 +14,14 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.analyzer.computation.ai_core_freq.ai_core_freq_checker import AICoreFreqChecker -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset -from profiler.advisor.dataset.profiling.device_info import DeviceInfoParser -from profiler.advisor.config.config import Config +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.computation.ai_core_freq.ai_core_freq_checker import AICoreFreqChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset +from msprof_analyze.advisor.dataset.profiling.device_info import DeviceInfoParser +from msprof_analyze.advisor.config.config import Config logger = logging.getLogger() @@ -53,5 +53,5 @@ class AICoreFreqAnalyzer(BaseAnalyzer): rank=kwargs.get("rank")) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): return PriorityBackgroundColor.high \ No newline at end of file diff --git a/profiler/advisor/analyzer/computation/ai_core_freq/ai_core_freq_checker.py b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_freq/ai_core_freq_checker.py similarity index 78% rename from profiler/advisor/analyzer/computation/ai_core_freq/ai_core_freq_checker.py rename to profiler/msprof_analyze/advisor/analyzer/computation/ai_core_freq/ai_core_freq_checker.py index f42b9514782e977c56a0f7776627beddbdebcd60..7e07e2f3b5f671922491329ea91a135e525cea00 100644 --- a/profiler/advisor/analyzer/computation/ai_core_freq/ai_core_freq_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_freq/ai_core_freq_checker.py @@ -14,11 +14,13 @@ # limitations under the License. import logging -from profiler.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.config.config import Config -from profiler.advisor.utils.utils import convert_to_float +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.utils.utils import convert_to_float +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() @@ -73,18 +75,11 @@ class AICoreFreqChecker: if self.decrease_freq_ops: # 按算子总耗时和降频比率 降序排列 - self.decrease_freq_ops.sort(key = - lambda x: (x[self.TOTAL_DURATION_INDEX], x[self.DECREASE_FREQ_RATIO_INDEX]), - reverse = True) + self.decrease_freq_ops.sort( + key=lambda x: (x[self.TOTAL_DURATION_INDEX], x[self.DECREASE_FREQ_RATIO_INDEX]), reverse=True) if not self.ai_core_freq_issues: return - self.desc = (f"{len(self.decrease_freq_ops)} operators are found during frequency reduction, and the reduction " - f"ratio is larger than {self.DECREASE_FREQ_RATIO}.") - if self.rank: - self.desc = f"For rank {self.rank}, " + self.desc.lower() - self.suggestions = "Please check the temperature or max power of your machine." - def make_record(self, result: OptimizeResult): """ make record for what and how to optimize @@ -92,11 +87,17 @@ class AICoreFreqChecker: if not self.ai_core_freq_issues: return self.ai_core_freq_issues - sheet_name = "AI Core Frequency" + prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) + + problem = prompt_class.PROBLEM if self.rank is not None: - sheet_name = f"rank {self.rank} AI Core Frequency".capitalize() + problem += prompt_class.RANK_ID.format(self.rank) + + self.desc = prompt_class.DESCRIPTION.format(len(self.decrease_freq_ops), self.DECREASE_FREQ_RATIO) + if self.rank: + self.desc = prompt_class.RANK_DESCRIPTION.format(self.rank) + self.desc.lower() - optimization_item = OptimizeItem(sheet_name, self.desc, [self.suggestions]) + optimization_item = OptimizeItem(problem, self.desc, [prompt_class.SUGGESTION]) result.add(OptimizeRecord(optimization_item)) self.headers = [ @@ -108,10 +109,10 @@ class AICoreFreqChecker: "Max frequency", "Min frequency", ] - result.add_detail(sheet_name, headers=self.headers) + result.add_detail(problem, headers=self.headers) for row in self.decrease_freq_ops: - result.add_detail(sheet_name, detail=row) + result.add_detail(problem, detail=row) return True def make_render(self, html_render, add_render_list=True, **kwargs): diff --git a/profiler/advisor/analyzer/graph_fusion/__init__.py b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_performance/__init__.py similarity index 100% rename from profiler/advisor/analyzer/graph_fusion/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/computation/ai_core_performance/__init__.py diff --git a/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_performance/ai_core_performance_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_performance/ai_core_performance_analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..23ec775e275134e8a99336b005d9f8f198660245 --- /dev/null +++ b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_performance/ai_core_performance_analyzer.py @@ -0,0 +1,53 @@ +# Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.analyzer.computation.ai_core_performance.ai_core_performance_checker import \ + AICorePerformanceChecker +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender + +logger = logging.getLogger() + + +class AICorePerformanceAnalyzer(BaseAnalyzer): + dataset_cls_list = [ProfilingDataset] + + def __init__(self, collection_path, n_processes: int = 1, **kwargs) -> None: + super().__init__(collection_path, n_processes, **kwargs) + profiling_key = ProfilingDataset.get_key() + self.profiling_dataset = self.get_first_data_by_key(self.dataset_list, profiling_key) + self.result = OptimizeResult() + self.html_render = HTMLRender() + self.html = None + + def optimize(self, **kwargs): + add_render_list = kwargs.get("add_render_list", True) + ai_core_perf_checker = AICorePerformanceChecker() + ai_core_perf_checker.data_filter(self.profiling_dataset) + if not ai_core_perf_checker.ai_core_performance_issues: + return self.result + ai_core_perf_checker.check_ai_core_performance(self.profiling_dataset) + ai_core_perf_checker.make_record(self.result) + self.html = ai_core_perf_checker.make_render(self.html_render, + add_render_list, + priority=self.get_priority(), + rank=kwargs.get("rank")) + return self.result + + def get_priority(self, max_mem_op_dur=None): + return PriorityBackgroundColor.low \ No newline at end of file diff --git a/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_performance/ai_core_performance_checker.py b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_performance/ai_core_performance_checker.py new file mode 100644 index 0000000000000000000000000000000000000000..fa62cd6f8958e28320d19e09d8ef1dae5609d03f --- /dev/null +++ b/profiler/msprof_analyze/advisor/analyzer/computation/ai_core_performance/ai_core_performance_checker.py @@ -0,0 +1,562 @@ +# Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +import os +from functools import reduce +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager + +logger = logging.getLogger() + + +class AICorePerformanceChecker: + """ + operator performance checker + """ + _CHECKER = "AICorePerformanceChecker" + CUBE_OPERATOR_MEMORY_SIZE_MB = 100 + INNER_AXIS_256 = 256 + INNER_AXIS_128 = 128 + + def __init__(self): + self.result = dict() + self.ai_core_performance_issues = False + self._desc = "" + self.cube_dict = {} + self.fa_dict = {} + self.fa_list = [] + self.vector_dict = {} + self.load_aicore_perf_rules() + + @staticmethod + def get_operator_list(cube_dict, profiling_dataset): + operator_list = [] + for op in profiling_dataset.op_summary.op_list: + if op.op_name in cube_dict: + key = op.input_shapes[1:-1] + "-" + op.output_shapes[1:-1] + if key in cube_dict[op.op_name]: + operator_list.append(op) + return operator_list + + @staticmethod + def get_vector_list(profiling_dataset, vector_dict): + vector_list = [] + for op_name in vector_dict: + for shape in vector_dict[op_name]: + for operator in profiling_dataset.op_summary.op_list: + if operator.op_name == op_name and operator.input_shapes[1:-1] + "-" + operator.output_shapes[ + 1:-1] == shape: + vector_list.append(operator) + return vector_list + + @staticmethod + def safe_divide(numerator, denominator): + if denominator == 0: + logger.warning("Warning: Division by zero is not allowed.") + return None + return numerator / denominator + + @staticmethod + def memory_size(operator): + memory = 0 + input_shapes = operator.input_shapes[1:-1].split(";") + output_shapes = operator.output_shapes[1:-1] + for shapes in input_shapes: + if "," not in shapes and shapes != "": + # 多的一维是 bias ,预先乘2 + memory += int(shapes) * 2 + continue + memory += reduce(lambda x, y: x * y, map(int, shapes.split(","))) + memory += reduce(lambda x, y: x * y, map(int, output_shapes.split(","))) + return memory * 2 / 1024 / 1024 + + def load_aicore_perf_rules(self): + language = AdditionalArgsManager().language + rule_path = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), + "rules", language, "aicore_performance.yaml" + ) + + if not os.path.exists(rule_path): + logger.warning("Skip analyze aicpu issues, because %s does not exist.", rule_path) + + self.language = language + self.aicore_rules = FileManager.read_yaml_file(rule_path) + self._cube_problem = self.aicore_rules.get("cube_problem") + self._fa_problem = self.aicore_rules.get("fa_problem") + self._vector_problem = self.aicore_rules.get("vector_problem") + self._desc = self.aicore_rules.get("description") + self._bound_desc = self.aicore_rules.get("bound_description") + self._opti_desc = self.aicore_rules.get("optimization_description") + self._affinity_desc = self.aicore_rules.get("affinity_description") + self._cube_affinity_desc = self.aicore_rules.get("cube_affinity_desc") + self._fa_affinity_desc_head_dim_128 = self.aicore_rules.get("fa_affinity_desc_head_dim_128") + self._fa_affinity_desc_seq_len_128 = self.aicore_rules.get("fa_affinity_desc_seq_len_128") + self._fa_affinity_desc_head_dim_seq_len_128 = self.aicore_rules.get("fa_affinity_desc_head_dim_seq_len_128") + self._suggestion = self.aicore_rules.get("suggestion") + self._affinity_suggestion = self.aicore_rules.get("affinity_suggestion") + self._bound_suggestion = self.aicore_rules.get("bound_suggestion") + self._opti_suggestion = self.aicore_rules.get("optimization_suggestion") + self._operator_rules = {"cube_operators": self.aicore_rules.get("cube_operators"), + "fa_operators": self.aicore_rules.get("fa_operators"), + "vector_operators": self.aicore_rules.get("vector_operators")} + + def data_filter(self, profiling_dataset: ProfilingDataset): + if not self.check_task_list(profiling_dataset): + return + + operator_list = profiling_dataset.op_summary.op_list + total_duration = sum(float(operator.task_duration) for operator in operator_list) + if (total_duration == 0): + return + cube_memory_dict, vector_type_dict = {}, {} + + for op in operator_list: + shapes = op.input_shapes[1:-1] + "-" + op.output_shapes[1:-1] + # preliminary filter cube operator + if op.task_type == "AI_CORE" and "matmul" in op.op_type.lower(): + cube_memory_dict.setdefault(op.op_name, {}).setdefault(shapes, 0) + cube_memory_dict[op.op_name][shapes] += self.memory_size(op) + continue + + # filter fa operator + if op.op_type == "FlashAttentionScore": + self.fa_dict.setdefault(op.op_name, set()).add(shapes) + self.fa_list.append(op) + elif op.op_type == "FlashAttentionScoreGrad": + self.fa_dict.setdefault(op.op_name, set()).add(shapes + "-grad") + self.fa_list.append(op) + + # preliminary filter vector operator + if op.task_type in ["AI_VECTOR_CORE", "MIX_AIV"]: + vector_type_dict.setdefault(op.op_type, set()).add(op) + + # filter cube operator + for op_name in cube_memory_dict: + for shapes in cube_memory_dict[op_name]: + if cube_memory_dict[op_name][shapes] >= self.CUBE_OPERATOR_MEMORY_SIZE_MB: + self.cube_dict.setdefault(op_name, set()).add(shapes) + + # filter vector operator + for op_type in vector_type_dict: + duration_group_by_time = sum(float(op.task_duration) for op in vector_type_dict[op_type]) + if (duration_group_by_time / total_duration) >= 0.01 or duration_group_by_time >= 1000000: + for op in vector_type_dict[op_type]: + shapes = op.input_shapes[1:-1] + "-" + op.output_shapes[1:-1] + self.vector_dict.setdefault(op.op_name, set()).add(shapes) + + if any([self.cube_dict, self.fa_dict, self.vector_dict]): + self.ai_core_performance_issues = True + + def check_ai_core_performance(self, promoting_dataset: ProfilingDataset): + for operator_type in ["cube", "fa", "vector"]: + try: + self.result[operator_type] = getattr(self, f"check_{operator_type}_operator")(promoting_dataset) + except (IndexError, ValueError, AttributeError) as e: + logger.warning(f"Failed to check ai core performance {operator_type} operator, {e}.") + self.result[operator_type] = [] + + if not any([self.result["cube"], self.result["fa"], self.result["vector"]]): + self.ai_core_performance_issues = False + + def check_cube_operator(self, profiling_dataset: ProfilingDataset): + cube_dict = self.cube_dict + suggestion = self._cube_affinity_desc + optimization_queue, bound_queue, affinity_queue = [], [], [] + operator_list = self.get_operator_list(cube_dict, profiling_dataset) + for op in cube_dict: + for shape in cube_dict[op]: + affinity_flag = self._check_cube_inner_axis(shape) + if not affinity_flag: + dtype, shape_duration = None, 0. + for operator in operator_list: + if (operator.op_name == op and + operator.input_shapes[1:-1] + "-" + operator.output_shapes[1:-1] == shape): + dtype = operator.input_data_types + shape_duration += float(operator.task_duration) + affinity_queue.append({"op_name": op, + "shape": shape.split("-")[0], + "dtype": dtype, + "duration": shape_duration, + "suggestion": suggestion}) + else: + shape_list = [] + for operator in operator_list: + if (operator.op_name == op and operator.input_shapes[1:-1] + "-" + + operator.output_shapes[1:-1] == shape): + shape_list.append(operator) + shape_duration = sum(float(operator.task_duration) for operator in shape_list) + dtype = shape_list[0].input_data_types if shape_list else None + bound, optimization = self.del_cube_operator_bound(shape_list) + if bound is None and optimization is None: + continue + if bound: + bound_queue.append({"op_name": op, + "shape": shape.split("-")[0], + "dtype": dtype, + "bound": bound, + "duration": shape_duration}) + else: + optimization_queue.append({"op_name": op, + "shape": shape.split("-")[0], + "dtype": dtype, + "optimization": round(optimization * 100, 2)}) + return [sorted(optimization_queue, key=lambda x: x["optimization"], reverse=True)[:5], + sorted(bound_queue, key=lambda x: x["duration"], reverse=True)[:5], + sorted(affinity_queue, key=lambda x: x["duration"], reverse=True)[:5]] + + def del_cube_operator_bound(self, shape_list): + bound, optimization, aic_mac_ratio, aic_mte2_ratio, length = "", 0., 0., 0., 0 + for operator in shape_list: + try: + aic_mac_ratio += float(operator.aic_mac_ratio) + aic_mte2_ratio += float(operator.aic_mte2_ratio) + length += 1 + except ValueError: + continue + aic_mac_ratio = self.safe_divide(aic_mac_ratio, length) + aic_mte2_ratio = self.safe_divide(aic_mte2_ratio, length) + if aic_mac_ratio is None or aic_mte2_ratio is None: + return None, None + aic_mac_ratio_rule, aic_mte2_ratio_rule = None, None + for operator_rule in self._operator_rules["cube_operators"]: + if operator_rule["target"] == "aic_mac_ratio": + aic_mac_ratio_rule = operator_rule + elif operator_rule["target"] == "aic_mte2_ratio": + aic_mte2_ratio_rule = operator_rule + if (aic_mac_ratio >= aic_mac_ratio_rule["threshold"] + and aic_mte2_ratio >= aic_mte2_ratio_rule["threshold"]): + bound = aic_mac_ratio_rule["bound"] + "_and_" + aic_mte2_ratio_rule["bound"] + "_bound" + elif aic_mac_ratio >= aic_mte2_ratio_rule["threshold"]: + bound = aic_mac_ratio_rule["bound"] + elif aic_mte2_ratio >= aic_mte2_ratio_rule["threshold"]: + bound = aic_mte2_ratio_rule["bound"] + else: + optimization = max(aic_mac_ratio_rule["threshold"] - aic_mac_ratio, + aic_mte2_ratio_rule["threshold"] - aic_mte2_ratio) + return bound, optimization + + def check_fa_operator(self, profiling_dataset: ProfilingDataset): + fa_list, fa_dict = self.fa_list, self.fa_dict + optimization_queue, bound_queue, affinity_queue = [], [], [] + # 不亲和算子筛选 + for op in fa_dict: + for shape in fa_dict[op]: + affinity_flag, dtype, shape_duration, suggestion = self._check_fa_inner_axis(fa_list, op, shape) + if affinity_flag: + # 不亲和算子 计算耗时,加入affinity_queue + affinity_queue.append({"op_name": op, + "shape": shape.split("-")[0], + "dtype": dtype, + "suggestion": suggestion, + "duration": shape_duration}) + else: + # 处理bound算子和优化算子 + if len(shape.split("-")) > 2: + bound, optimization, dtype, shape_duration = self.del_fa_operator_bound_grad(op, shape, fa_list) + else: + bound, optimization, dtype, shape_duration = self.del_fa_operator_bound(op, shape, fa_list) + if bound is None and optimization is None: + continue + if bound: + bound_queue.append({"op_name": op, + "shape": shape.split("-")[0], + "dtype": dtype, + "bound": bound, + "duration": shape_duration}) + else: + optimization_queue.append({"op_name": op, + "shape": shape.split("-")[0], + "dtype": dtype, + "optimization": round(optimization * 100, 2)}) + + return [sorted(optimization_queue, key=lambda x: x["optimization"], reverse=True)[:5], + sorted(bound_queue, key=lambda x: x["duration"], reverse=True)[:5], + sorted(affinity_queue, key=lambda x: x["duration"], reverse=True)[:5]] + + def del_fa_operator_bound_grad(self, op, shape, fa_list): + aic_fixpipe_ratio, aic_mte2_ratio, shape_duration, optimization, length = 0., 0., 0., 0., 0 + bound, dtype = "", None + for operator in fa_list: + if (operator.op_name == op and + operator.input_shapes[1:-1] + "-" + + operator.output_shapes[1:-1] + "-grad" == shape): + try: + aic_fixpipe_ratio += float(operator.aic_fixpipe_ratio) + aic_mte2_ratio += float(operator.aic_mte2_ratio) + shape_duration += float(operator.task_duration) + dtype = operator.input_data_types + length += 1 + except ValueError: + continue + aic_fixpipe_ratio = self.safe_divide(aic_fixpipe_ratio, length) + aic_mte2_ratio = self.safe_divide(aic_mte2_ratio, length) + if aic_mte2_ratio is None or aic_fixpipe_ratio is None: + return None, None, None + aic_fixpipe_ratio_rule, aic_mte2_ratio_rule = None, None + for rule in self._operator_rules["fa_operators"]: + if rule["target"] == "aic_fixpipe_ratio": + aic_fixpipe_ratio_rule = rule + elif rule["target"] == "aic_mte2_ratio": + aic_mte2_ratio_rule = rule + if (aic_mte2_ratio >= aic_mte2_ratio_rule["threshold"] and + aic_fixpipe_ratio >= aic_fixpipe_ratio_rule["threshold"]): + bound = aic_fixpipe_ratio_rule["bound"] + "_and_" + aic_mte2_ratio_rule["bound"] + "_bound" + elif aic_mte2_ratio >= aic_mte2_ratio_rule["threshold"]: + bound = aic_mte2_ratio_rule["bound"] + elif aic_fixpipe_ratio >= aic_fixpipe_ratio_rule["threshold"]: + bound = aic_fixpipe_ratio_rule["bound"] + else: + optimization = max(aic_fixpipe_ratio_rule["threshold"] - aic_fixpipe_ratio, + aic_mte2_ratio_rule["threshold"] - aic_mte2_ratio) + return bound, optimization, dtype, shape_duration + + def del_fa_operator_bound(self, op, shape, fa_list): + aiv_vec_ratio, aic_mte2_ratio, shape_duration, optimization, length = 0., 0., 0., 0., 0 + bound, dtype = "", None + for operator in fa_list: + if (operator.op_name == op and + operator.input_shapes[1:-1] + "-" + operator.output_shapes[1:-1] == shape): + try: + aiv_vec_ratio += float(operator.aiv_vec_ratio) + aic_mte2_ratio += float(operator.aic_mte2_ratio) + shape_duration += float(operator.task_duration) + length += 1 + except ValueError: + continue + aiv_vec_ratio = self.safe_divide(aiv_vec_ratio, length) + aic_mte2_ratio = self.safe_divide(aic_mte2_ratio, length) + if aiv_vec_ratio is None or aic_mte2_ratio is None: + return None, None, None + aiv_vec_ratio_rule, aic_mte2_ratio_rule = None, None + for rule in self._operator_rules["fa_operators"]: + if rule["target"] == "aiv_vec_ratio": + aiv_vec_ratio_rule = rule + elif rule["target"] == "aic_mte2_ratio": + aic_mte2_ratio_rule = rule + if (aic_mte2_ratio >= aic_mte2_ratio_rule["threshold"] + and aiv_vec_ratio >= aiv_vec_ratio_rule["threshold"]): + bound = aic_mte2_ratio_rule["bound"] + "_and_" + aiv_vec_ratio_rule["bound"] + "_bound" + elif aic_mte2_ratio >= aic_mte2_ratio_rule["threshold"]: + bound = aic_mte2_ratio_rule["bound"] + elif aiv_vec_ratio >= aiv_vec_ratio_rule["threshold"]: + bound = aiv_vec_ratio_rule["bound"] + else: + optimization = max(aiv_vec_ratio_rule["threshold"] - aiv_vec_ratio, + aic_mte2_ratio_rule["threshold"] - aic_mte2_ratio) + return bound, optimization, dtype, shape_duration + + def check_vector_operator(self, profiling_dataset: ProfilingDataset): + vector_dict = self.vector_dict + optimization_queue, bound_queue = [], [] + vector_list = self.get_vector_list(profiling_dataset, vector_dict) + for op_name in vector_dict: + for shape in vector_dict[op_name]: + aiv_vec_ratio, aiv_mte2_ratio, aiv_mte3_ratio, shape_duration = 0., 0., 0., 0. + length, dtype = 0, "" + for operator in vector_list: + if (operator.op_name == op_name and + operator.input_shapes[1:-1] + "-" + operator.output_shapes[1:-1] == shape): + try: + aiv_vec_ratio += float(operator.aiv_vec_ratio) + aiv_mte2_ratio += float(operator.aiv_mte2_ratio) + aiv_mte3_ratio += float(operator.aiv_mte3_ratio) + shape_duration += float(operator.task_duration) + dtype = operator.input_data_types + length += 1 + except ValueError: + continue + aiv_vec_ratio = self.safe_divide(aiv_vec_ratio, length) + aiv_mte2_ratio = self.safe_divide(aiv_mte2_ratio, length) + aiv_mte3_ratio = self.safe_divide(aiv_mte3_ratio, length) + if aiv_vec_ratio is None or aiv_mte2_ratio is None or aiv_mte3_ratio is None: + continue + bound, optimization = self.del_vector_operator_bound(aiv_mte2_ratio, aiv_mte3_ratio, aiv_vec_ratio) + if bound: + bound_queue.append({"op_name": op_name, + "shape": shape.split("-")[0], + "bound": bound, + "dtype": dtype, + "duration": shape_duration}) + else: + optimization_queue.append({"op_name": op_name, + "shape": shape.split("-")[0], + "dtype": dtype, + "optimization": round(optimization * 100, 2)}) + return [sorted(optimization_queue, key=lambda x: x["optimization"], reverse=True)[:5], + sorted(bound_queue, key=lambda x: x["duration"], reverse=True)[:5]] + + def del_vector_operator_bound(self, aiv_mte2_ratio, aiv_mte3_ratio, aiv_vec_ratio): + bound, optimization = "", 0 + aiv_vec_ratio_rule, aiv_mte2_ratio_rule, aiv_mte3_ratio_rule, total_rule = None, None, None, None + for operator_rule in self._operator_rules["vector_operators"]: + if operator_rule["target"] == "aiv_vec_ratio": + aiv_vec_ratio_rule = operator_rule + elif operator_rule["target"] == "aiv_mte2_ratio": + aiv_mte2_ratio_rule = operator_rule + elif operator_rule["target"] == "aiv_mte3_ratio": + aiv_mte3_ratio_rule = operator_rule + elif operator_rule["target"] == "total": + total_rule = operator_rule + if aiv_vec_ratio + aiv_mte2_ratio + aiv_mte3_ratio >= total_rule["threshold"]: + bound = total_rule["bound"] + elif aiv_mte2_ratio >= aiv_mte2_ratio_rule["threshold"]: + bound = aiv_mte2_ratio_rule["bound"] + elif aiv_mte3_ratio >= aiv_mte3_ratio_rule["threshold"]: + bound = aiv_mte3_ratio_rule["bound"] + elif aiv_vec_ratio >= aiv_vec_ratio_rule["threshold"]: + bound = aiv_vec_ratio_rule["bound"] + else: + optimization = max(aiv_vec_ratio_rule["threshold"] - aiv_vec_ratio, + aiv_mte2_ratio_rule["threshold"] - aiv_mte2_ratio, + aiv_mte3_ratio_rule["threshold"] - aiv_mte3_ratio) + return bound, optimization + + def draw_record(self, op_type: str, result: OptimizeResult): + suggestion_keys = ['opti', 'bound', 'affinity'] + desc = dict.fromkeys(suggestion_keys, "") + problem_map = { + 'cube': self._cube_problem, + 'fa': self._fa_problem, + 'vector': self._vector_problem + } + if op_type not in problem_map: + return + optimization_item = OptimizeItem(problem_map[op_type], self._desc, [self._suggestion]) + result.add(OptimizeRecord(optimization_item)) + headers = [ + "Type", + "Description and Suggestion", + ] + result.add_detail(problem_map[op_type], headers=headers) + for opti_issue in self.result[op_type][0]: + opti_sugg = self._opti_suggestion.format(**opti_issue) + desc["opti"] += opti_sugg + if desc["opti"]: + result.add_detail(problem_map[op_type], detail=[self._opti_desc, desc["opti"]]) + for bound_issue in self.result[op_type][1]: + bound_sugg = self._bound_suggestion.format(**bound_issue) + desc["bound"] += bound_sugg + if desc["bound"]: + result.add_detail(problem_map[op_type], detail=[self._bound_desc, desc["bound"]]) + if op_type == "vector": # vector 类型没有亲和性建议 + return + for affinity_issue in self.result[op_type][2]: + affinity_sugg = self._affinity_suggestion.format(**affinity_issue) + desc["affinity"] += affinity_sugg + if desc["affinity"]: + result.add_detail(problem_map[op_type], detail=[self._affinity_desc, desc["affinity"]]) + + def make_record(self, result: OptimizeResult): + """ + make record for what and how to optimize + """ + if not self.ai_core_performance_issues: + return self.ai_core_performance_issues + if any(self.result["cube"]): + self.draw_record("cube", result) + if any(self.result["fa"]): + self.draw_record("fa", result) + if any(self.result["vector"]): + self.draw_record("vector", result) + + return True + + def make_render(self, html_render, add_render_list=True, **kwargs): + if not self.ai_core_performance_issues: + return self.ai_core_performance_issues + + priority = kwargs.get("priority") + return html_render.render_template(key="computation", + template_dir="templates", + template_name="ai_core_performance.html", + format_result=self.result, + language=self.language, + add_render_list=add_render_list, + priority_background_color=priority, + rank=kwargs.get("rank")) + + def check_task_list(self, profiling_dataset: ProfilingDataset) -> bool: + if not hasattr(profiling_dataset, "op_summary"): + logger.warning("Skip %s checker because of not containing %s", self._CHECKER, "op summary") + return False + if not hasattr(profiling_dataset.op_summary, "op_list"): + logger.warning("Skip %s checker because of not containing %s", self._CHECKER, "op_list") + return False + if (not hasattr(profiling_dataset.op_summary.op_list[0], "input_shapes") or + not hasattr(profiling_dataset.op_summary.op_list[0], "input_data_types")): + logger.warning("Skip %s checker because of not containing input datas", self._CHECKER) + return False + return True + + def _check_cube_inner_axis(self, shape): + # 判断输入shape内轴是否为256的倍数 + shapes = shape.split("-")[0].split(";") + if (len(shape.split("-")[0].split(";")[0].split(","))) == 4: + # NZ格式 + b_axis, c_axis = int(shapes[0].split(",")[1]), int(shapes[0].split(",")[2]) + f_axis, g_axis = int(shapes[1].split(",")[1]), int(shapes[1].split(",")[2]) + return (b_axis * c_axis % self.INNER_AXIS_256 == 0) and (f_axis * g_axis % self.INNER_AXIS_256 == 0) + elif (len(shape.split("-")[0].split(";")[0].split(","))) == 2: + # ND格式 + l_axis, k_axis = int(shapes[0].split(",")[1]), int(shapes[1].split(",")[1]) + return (l_axis % self.INNER_AXIS_256 == 0) and (k_axis % self.INNER_AXIS_256 == 0) + else: + return False + + def _check_fa_inner_axis(self, fa_list, op, shape): + shape_duration = 0. + affinity_flag = False + dtype = None + suggestion = "" + if "varlen" in op.lower(): + # 处理变长算子 如果不亲和则affinity_flag为False + inner_axis = int(shape.split("-")[0].split(";")[0].split(",")[2]) + if inner_axis % self.INNER_AXIS_128 != 0: + affinity_flag = True + suggestion = self._fa_affinity_desc_head_dim_128 + for operator in fa_list: + if (operator.op_name == op and + operator.input_shapes[1:-1] + "-" + operator.output_shapes[1:-1] == shape): + shape_duration += float(operator.task_duration) + dtype = operator.input_data_types + else: + # 处理定长算子 如果不亲和则affinity_flag为False + head_dim = 0 + seq_len = int(shape.split("-")[1].split(";")[0].split(",")[2]) + input_first_tensor = shape.split("-")[0].split(";")[0].split(",") + if len(input_first_tensor) == 3: + head_dim = int(input_first_tensor[2]) / int(shape.split("-")[1].split(";")[0].split(",")[1]) + else: + head_dim = int(input_first_tensor[3]) + if head_dim % self.INNER_AXIS_128 != 0 and seq_len % self.INNER_AXIS_128 != 0: + affinity_flag = True + suggestion = self._fa_affinity_desc_head_dim_seq_len_128 + elif head_dim % self.INNER_AXIS_128 != 0: + affinity_flag = True + suggestion = self._fa_affinity_desc_head_dim_128 + elif seq_len % self.INNER_AXIS_128 != 0: + affinity_flag = True + suggestion = self._fa_affinity_desc_seq_len_128 + if affinity_flag: + for operator in fa_list: + if (operator.op_name == op and + operator.input_shapes[1:-1] + "-" + + operator.output_shapes[1:-1] == shape): + shape_duration += float(operator.task_duration) + dtype = operator.input_data_types + return affinity_flag, dtype, shape_duration, suggestion diff --git a/profiler/advisor/analyzer/memory/__init__.py b/profiler/msprof_analyze/advisor/analyzer/computation/aicpu/__init__.py similarity index 100% rename from profiler/advisor/analyzer/memory/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/computation/aicpu/__init__.py diff --git a/profiler/advisor/analyzer/computation/aicpu/aicpu_checker.py b/profiler/msprof_analyze/advisor/analyzer/computation/aicpu/aicpu_checker.py similarity index 88% rename from profiler/advisor/analyzer/computation/aicpu/aicpu_checker.py rename to profiler/msprof_analyze/advisor/analyzer/computation/aicpu/aicpu_checker.py index 0c724f45aa2a65a40cb2fd53eebc84e930bd4646..bd3b18eb17937582d634b057dd0a4b3ef62115c8 100644 --- a/profiler/advisor/analyzer/computation/aicpu/aicpu_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/aicpu/aicpu_checker.py @@ -17,21 +17,19 @@ import os from functools import partial from typing import List, Dict, Optional -from profiler.advisor.analyzer.computation.operator_checker import OperatorChecker, logger -from profiler.advisor.analyzer.schedule.fusion_ops.timeline_api_stack_checker import OpStackFinder -from profiler.advisor.common import constant -from profiler.advisor.dataset.dataset import Dataset -from profiler.advisor.dataset.profiling.profiling_dataset import ProfilingDataset -from profiler.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.advisor.analyzer.computation.operator_checker import OperatorChecker, logger +from msprof_analyze.advisor.analyzer.schedule.fusion_ops.timeline_api_stack_checker import OpStackFinder +from msprof_analyze.advisor.dataset.dataset import Dataset +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.prof_common.constant import Constant class AicpuChecker(OperatorChecker): _CHECKER = "aicpu operator" - _PROBLEM = "AICPU operator" _MIN_TASK_DURATION = 20 - _description = f"Some operators and task duration exceed {_MIN_TASK_DURATION} us, such as :\n" - _SUGGESTION: List[str] = ["Modify code to avoid aicpu operator"] STACK_INFO_ITEMS = "stack_info" SUGGESTION_INFO_ITEMS = "suggestions" _ITEMS = [ @@ -43,19 +41,28 @@ class AicpuChecker(OperatorChecker): super(AicpuChecker, self).__init__(cann_version=cann_version) self.aicpu_rules: Dict = {} self.aicpu_checker: Dict = {} - self.load_aicpu_rules() self.total_task_duration = 0.0 self.aicpu_task_duration = 0.0 + self.double_suggestion = None + self.load_aicpu_rules() - def load_aicpu_rules(self, rule_path="rules/aicpu_rules.yaml"): - if not os.path.isabs(rule_path): - rule_path = os.path.join(os.path.dirname(__file__), - "../../../", rule_path) + def load_aicpu_rules(self): + language = AdditionalArgsManager().language + rule_path = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), + "rules", + language, + "aicpu_rules.yaml" + ) if not os.path.exists(rule_path): logger.warning("Skip analyze aicpu issues, because %s does not exist.", rule_path) self.aicpu_rules = FileManager.read_yaml_file(rule_path) + self._problem = self.aicpu_rules.get("problem") + self._description = self.aicpu_rules.get("description").format(self._MIN_TASK_DURATION) + self._suggestion = [self.aicpu_rules.get("suggestion")] + self.double_suggestion = self.aicpu_rules.get("double_suggestion") self.filter_aicpu_rules(self.aicpu_rules) for checker_name, check_rule in self.aicpu_rules.items(): if not isinstance(check_rule, (list, dict,)): @@ -97,10 +104,10 @@ class AicpuChecker(OperatorChecker): data: Dict[str, Dataset] = {} event_dataset = ComputationAnalysisDataset(collection_path=profiling_data.collection_path, data=data, - task_type=constant.AI_CPU) + task_type=Constant.AI_CPU) # disable multiprocessing, avoid cost time of enable new process for light task - api_stack_finder.get_api_stack_by_op(event_dataset, op_name_list, constant.AI_CPU, + api_stack_finder.get_api_stack_by_op(event_dataset, op_name_list, Constant.AI_CPU, disable_multiprocess=True) return api_stack_finder.get_stack_record() @@ -153,8 +160,7 @@ class AicpuChecker(OperatorChecker): and op.op_name not in double_type_ai_cpu_operator): double_type_ai_cpu_operator.append(op.op_name) if bool(double_type_ai_cpu_operator): - self._SUGGESTION.append("Try to convert double type operator to float, such as {}".format( - ",".join(double_type_ai_cpu_operator))) + self._suggestion.append(self.double_suggestion.format(",".join(double_type_ai_cpu_operator))) return True def make_render(self, html_render, record, add_render_list=True, **kwargs): @@ -163,7 +169,7 @@ class AicpuChecker(OperatorChecker): template_dir="templates", template_name="operator_ai_cpu.html", format_result=self.format_operator_result(record, - constant.OPERATOR_LIST_UNLIMIT), + Constant.OPERATOR_LIST_UNLIMIT), add_render_list=add_render_list, priority_background_color=priority, rank=kwargs.get("rank")) @@ -197,7 +203,7 @@ class AicpuChecker(OperatorChecker): return format_result def group_by_list(self, op_list, op_key_list: List = None, - limit: int = constant.OPERATOR_LIST_UNLIMIT): + limit: int = Constant.OPERATOR_LIST_UNLIMIT): if op_list is None: op_list = [] if op_key_list is None: @@ -220,7 +226,7 @@ class AicpuChecker(OperatorChecker): return True def _check_operator(self, op_info) -> bool: - return op_info.task_type == constant.AI_CPU + return op_info.task_type == Constant.AI_CPU class BaserChecker: diff --git a/profiler/advisor/analyzer/overall/__init__.py b/profiler/msprof_analyze/advisor/analyzer/computation/bound/__init__.py similarity index 100% rename from profiler/advisor/analyzer/overall/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/computation/bound/__init__.py diff --git a/profiler/advisor/analyzer/computation/bound/block_dim_checker.py b/profiler/msprof_analyze/advisor/analyzer/computation/bound/block_dim_checker.py similarity index 78% rename from profiler/advisor/analyzer/computation/bound/block_dim_checker.py rename to profiler/msprof_analyze/advisor/analyzer/computation/bound/block_dim_checker.py index 6eef6f81310c9a186c57b340f163f691f7336d76..58599a734f2cbcf1b449aab3cd2fc8583fa38619 100644 --- a/profiler/advisor/analyzer/computation/bound/block_dim_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/bound/block_dim_checker.py @@ -15,10 +15,12 @@ import logging from typing import List -from profiler.advisor.analyzer.computation.operator_checker import OperatorChecker -from profiler.advisor.common import constant -from profiler.advisor.config.config import Config -from profiler.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.analyzer.computation.operator_checker import OperatorChecker +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() @@ -26,15 +28,22 @@ logger = logging.getLogger() class BlockDimChecker(OperatorChecker): _SUGGESTION: List[str] = [] _CHECKER = "block dim" - _PROBLEM = "block dim" _aicore_num = 0 _aiv_num = 0 - _description = "some operator does not make full use of {} ai core" _ITEMS = [ "op_name", "op_type", "task_type", "task_duration", "income", "block_dim", "mix_block_dim", "input_shapes", "input_data_types", "input_formats", "output_shapes", "output_data_types", "output_formats" ] + def __init__(self, cann_version): + super(BlockDimChecker, self).__init__(cann_version=cann_version) + self.prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) + + self._problem = self.prompt_class.PROBLEM + self._description = self.prompt_class.DESCRIPTION + self.aiv_num_desc = self.prompt_class.AIV_NUM_DESCRIPTION + self.top_duration_op_desc = self.prompt_class.TOP_DURATION_OP_DESCRIPTION + def pre_check(self, profiling_data) -> bool: return not self.is_dynamic_shape(profiling_data) @@ -44,7 +53,7 @@ class BlockDimChecker(OperatorChecker): template_dir="templates", template_name="operator_block_dim.html", format_result=self.format_operator_result(record, - constant.OPERATOR_OUT_TOPK), + Constant.OPERATOR_OUT_TOPK), add_render_list=add_render_list, priority_background_color=priority, rank=kwargs.get("rank")) @@ -82,11 +91,11 @@ class BlockDimChecker(OperatorChecker): self._aiv_num = int(Config().get_config("aiv_num")) except ValueError as e: logger.warning("get aiv_num failed, please check info.json: %s", e) + self._description = self._description.format(self._aicore_num) if self._aiv_num: - self._description += f" or {self._aiv_num} ai vector core" - self._description += f";\n Top-{OperatorChecker._MAX_TUNE_OP_NUM} operator of " \ - "task duration are as follows:\n" + self._description += self.aiv_num_desc.format(self._aiv_num) + self._description += self.top_duration_op_desc.format(OperatorChecker._MAX_TUNE_OP_NUM) return True def _check_operator(self, op_info) -> bool: diff --git a/profiler/advisor/analyzer/computation/bound/operator_bound_checker.py b/profiler/msprof_analyze/advisor/analyzer/computation/bound/operator_bound_checker.py similarity index 75% rename from profiler/advisor/analyzer/computation/bound/operator_bound_checker.py rename to profiler/msprof_analyze/advisor/analyzer/computation/bound/operator_bound_checker.py index 9ef64e546948945a76a9a3ea7a0a142bd94b2b4d..ffa5b9dac59c2319d868a6530669d4537c30729d 100644 --- a/profiler/advisor/analyzer/computation/bound/operator_bound_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/bound/operator_bound_checker.py @@ -15,11 +15,13 @@ import logging from typing import List -from profiler.advisor.analyzer.computation.operator_checker import OperatorChecker -from profiler.advisor.common import constant -from profiler.advisor.config.config import Config -from profiler.advisor.dataset.profiling.profiling_dataset import ProfilingDataset -from profiler.advisor.utils.utils import to_percent +from msprof_analyze.advisor.analyzer.computation.operator_checker import OperatorChecker +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.utils.utils import to_percent logger = logging.getLogger() @@ -27,17 +29,19 @@ logger = logging.getLogger() class OperatorBoundChecker(OperatorChecker): _MIN_TASK_DURATION = 20 # min task duration 20us _CHECKER = "operator no bound" - _PROBLEM = "operator no bound" _SUGGESTION: List[str] = [] - _description = ( - f"There is no mte, cube, vector, scalar ratio is more than {to_percent(Config().operator_bound_ratio)};\n" + - f"Top task duration operators need to be tuned are as follows: \n") _ITEMS = [ "op_name", "op_type", "task_type", "task_duration", "vec_ratio", "mac_ratio", "scalar_ratio", "mte1_ratio", "mte2_ratio", "mte3_ratio", "block_dim", "input_shapes", "input_data_types", "input_formats", "output_shapes", "output_data_types", "output_formats" ] + def __init__(self, cann_version) -> None: + super().__init__(cann_version=cann_version) + self.prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) + self._problem = self.prompt_class.PROBLEM + self._description = self.prompt_class.DESCRIPTION.format(to_percent(Config().operator_bound_ratio)) + def pre_check(self, profiling_data) -> bool: return not self.is_dynamic_shape(profiling_data) @@ -47,7 +51,7 @@ class OperatorBoundChecker(OperatorChecker): template_dir="templates", template_name="operator_no_bound.html", format_result=self.format_operator_result(record, - constant.OPERATOR_OUT_TOPK), + Constant.OPERATOR_OUT_TOPK), add_render_list=add_render_list, priority_background_color=priority, rank=kwargs.get("rank")) diff --git a/profiler/advisor/analyzer/schedule/__init__.py b/profiler/msprof_analyze/advisor/analyzer/computation/op_compile/__init__.py similarity index 100% rename from profiler/advisor/analyzer/schedule/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/computation/op_compile/__init__.py diff --git a/profiler/advisor/analyzer/computation/op_compile/dynamic_shape_checker.py b/profiler/msprof_analyze/advisor/analyzer/computation/op_compile/dynamic_shape_checker.py similarity index 68% rename from profiler/advisor/analyzer/computation/op_compile/dynamic_shape_checker.py rename to profiler/msprof_analyze/advisor/analyzer/computation/op_compile/dynamic_shape_checker.py index 6ce417729adf821803faa8dc50d0e8e385dddd28..f247156b7026b69a11cf15fdb29476ab638e10a1 100644 --- a/profiler/advisor/analyzer/computation/op_compile/dynamic_shape_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/op_compile/dynamic_shape_checker.py @@ -16,30 +16,31 @@ import copy import logging from typing import List -from profiler.advisor.analyzer.computation.operator_checker import OperatorChecker -from profiler.advisor.common import constant -from profiler.advisor.config.config import Config -from profiler.advisor.dataset.profiling.info_collection import OpInfo -from profiler.advisor.result.item import OptimizeItem, StatisticsItem, OptimizeRecord +from msprof_analyze.advisor.analyzer.computation.operator_checker import OperatorChecker +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.dataset.profiling.info_collection import OpInfo +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.item import OptimizeItem, StatisticsItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager logger = logging.getLogger() class DynamicShapeChecker(OperatorChecker): - ENABLE_COMPILED_SUGGESTION = "1. Please try to set environment by execute `export HOST_CACHE_CAPACITY=20`.\n." \ - "2. Please place the following code at the entrance of the python script to disable jit compile.\n " \ - "Code: `torch_npu.npu.set_compile_mode(jit_compile=False);\n " \ - "torch_npu.npu.config.allow_internal_format = False`.\n" - _SUGGESTION: List[str] = [ENABLE_COMPILED_SUGGESTION] _CHECKER = "dynamic shape operator" - _PROBLEM = "Dynamic shape operator" - _description = f"Found all operators are dynamic shape" _op_list: List[OpInfo] = [] _tune_op_list: List[str] = [] # record op name to be tuned, and save to tune_ops_file.cfg _op_views: List = [] def __init__(self, cann_version) -> None: super().__init__(cann_version=cann_version) + self.prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) + self._problem = self.prompt_class.PROBLEM + self._description = self.prompt_class.DESCRIPTION + self.enable_compiled_suggestion = self.prompt_class.ENABLE_COMPILED_SUGGESTION + self._suggestion = [self.prompt_class.ENABLE_COMPILED_SUGGESTION] + self.release_suggestion = self.prompt_class.RELEASE_SUGGESTION def check(self, profiling_data) -> bool: return self.is_dynamic_shape(profiling_data) @@ -48,13 +49,12 @@ class DynamicShapeChecker(OperatorChecker): """ make record for what and how to optimize """ - if rank is not None: - self._PROBLEM = f"rank {rank} ".capitalize() + self._PROBLEM.lower() + self._problem = self.prompt_class.RANK_ID.format(rank) + self._problem.lower() optimization_item = OptimizeItem( - self._PROBLEM, + self._problem, self._description, - self._SUGGESTION + self._suggestion ) statistics_item = StatisticsItem("", "", 1) return OptimizeRecord(optimization_item, statistics_item) @@ -70,9 +70,8 @@ class DynamicShapeChecker(OperatorChecker): release_suggestion_list = [] for suggestion in optimization_item.suggestion: release_suggestion = copy.deepcopy(suggestion) - if release_suggestion == DynamicShapeChecker.ENABLE_COMPILED_SUGGESTION: - release_suggestion += \ - f"for details please refer to link : LINK" + if release_suggestion == self.enable_compiled_suggestion: + release_suggestion += self.release_suggestion.format(Config().enable_compiled_tune_url) release_suggestion_list.append(release_suggestion.replace('\n', '
')) format_result = {"record": record.__dict__, "suggestion": '
'.join(release_suggestion_list)} return format_result diff --git a/profiler/advisor/analyzer/computation/operator_checker.py b/profiler/msprof_analyze/advisor/analyzer/computation/operator_checker.py similarity index 79% rename from profiler/advisor/analyzer/computation/operator_checker.py rename to profiler/msprof_analyze/advisor/analyzer/computation/operator_checker.py index 17be15b4eb547e1d2fce198fd0eeaef071b8ad31..4be0fc66ae8b8f75ca0518228cbdccde1a0d7c1e 100644 --- a/profiler/advisor/analyzer/computation/operator_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/operator_checker.py @@ -17,45 +17,47 @@ import logging from textwrap import fill from typing import List -from profiler.advisor.common import constant -from profiler.advisor.common.enum_params_parser import EnumParamsParser -from profiler.advisor.common.version_control import VersionControl -from profiler.advisor.config.config import Config -from profiler.advisor.dataset.profiling.info_collection import OpInfo -from profiler.advisor.dataset.profiling.profiling_dataset import ProfilingDataset -from profiler.advisor.result.item import OptimizeItem, StatisticsItem, OptimizeRecord -from profiler.advisor.utils.utils import safe_division, convert_to_float +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.enum_params_parser import EnumParamsParser +from msprof_analyze.advisor.common.version_control import VersionControl +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.dataset.profiling.info_collection import OpInfo +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.result.item import OptimizeItem, StatisticsItem, OptimizeRecord +from msprof_analyze.advisor.utils.utils import safe_division, convert_to_float +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() class OperatorChecker(VersionControl): - _SUPPORT_VERSIONS = EnumParamsParser().get_options(constant.CANN_VERSION) - _MAX_TUNE_OP_NUM = constant.OPERATOR_OUT_TOPK + _SUPPORT_VERSIONS = EnumParamsParser().get_options(Constant.CANN_VERSION) + _MAX_TUNE_OP_NUM = Constant.OPERATOR_OUT_TOPK _MIN_TASK_DURATION = 0 _MIN_TASK_DURATION_RATIO = 1.0 _MIN_TOTAL_DURATION_RATIO = 1.0 _CHECKER = str() - _PROBLEM = str() + _problem = str() _description = str() STACK_INFO_ITEMS = "" _ITEMS: List[str] = [] - _SUGGESTION: List[str] = [] + _suggestion: List[str] = [] SKIP_CHECK_MSG = "Skip %s checker because of not containing %s" _tune_op_info_list: List[OpInfo] = [] - PyTorch_OPERATOR_TUNE_SUGGESTION = f"Optimize operator by AOE, such as:\n" \ - f"'aoe --job_type=2 --model_path=$user_dump_path " \ - f"--tune_ops_file={Config().tune_ops_file}'\n" - MSLite_OPERATOR_TUNE_SUGGESTION = f"Optimize operator by AOE in mindspore lite framework, such as:\n" \ - f"converter_lite --fmk=ONNX --optimize=ascend_oriented --saveType=MINDIR " \ - f"--modelFile=$user_model.onnx --outputFile=user_model " \ - f"--configFile=./config.txt\n" def __init__(self, cann_version: str): self.cann_version = cann_version self._op_list: List[OpInfo] = [] self._tune_op_list: List[str] = [] + self.prompt_class = BasePrompt.get_prompt_class("OperatorChecker") + self.rank_id = self.prompt_class.RANK_ID + self.pytorch_op_tune_suggestion = self.prompt_class.PYTORCH_OPERATOR_TUNE_SUGGESTION + self.mslite_op_tune_suggestion = self.prompt_class.MSLITE_OPERATOR_TUNE_SUGGESTION + self.pytorch_release_suggestion = self.prompt_class.PYTORCH_RELEASE_SUGGESTION + self.mslite_release_suggestion = self.prompt_class.MSLITE_RELEASE_SUGGESTION + @staticmethod def get_ratio(op_info: OpInfo, attr: str) -> float: if not op_info.has_attr(attr): @@ -65,13 +67,12 @@ class OperatorChecker(VersionControl): return 0 return float(value) - @classmethod - def get_name(cls): + def get_name(self): """ get name of checker :return: checker name """ - return cls._PROBLEM + return self._problem def check(self, profiling_data: ProfilingDataset) -> bool: """ @@ -105,7 +106,7 @@ class OperatorChecker(VersionControl): self._op_list.sort(key=lambda x: float(x.get_attr("task_duration")), reverse=True) self._tune_op_info_list.sort(key=lambda x: float(x.get_attr("task_duration")), reverse=True) for op in self._op_list: - if op.op_name not in self._tune_op_list and len(self._tune_op_list) < constant.OPERATOR_OUT_TOPK: + if op.op_name not in self._tune_op_list and len(self._tune_op_list) < Constant.OPERATOR_OUT_TOPK: self._tune_op_list.append(op.op_name) return True return False @@ -118,7 +119,7 @@ class OperatorChecker(VersionControl): """ if rank is not None: - self._PROBLEM = f"rank {rank} ".capitalize() + self._PROBLEM.lower() + self._problem = self.rank_id.format(rank) + self._problem.lower() task_duration_list = [float(op_info.get_attr("task_duration")) for op_info in self._op_list @@ -128,36 +129,19 @@ class OperatorChecker(VersionControl): count = len(task_duration_list) statistics_item = StatisticsItem(total_task_duration, total_cost_time, count, self.get_incomes()) optimization_item = OptimizeItem( - self._PROBLEM, + self._problem, self._get_description(self._description, self.get_op_type_list(self._op_list)[:self._MAX_TUNE_OP_NUM]), - self._SUGGESTION + self._suggestion ) return OptimizeRecord(optimization_item, statistics_item) - def _get_description(self, description, op_type_list=None): - if not op_type_list: - return description - - desc_suffix = [] - for i, _ in enumerate(op_type_list): - if i % 3 == 0 and i != 0: - desc_suffix.append("\n") - - desc_suffix.append(f"{op_type_list[i]}") - - if i < len(op_type_list) - 1: - desc_suffix.append(", ") - - description += "".join(desc_suffix) - return description - def pre_check(self, profiling_data) -> bool: return True def is_dynamic_shape(self, profiling_database: ProfilingDataset) -> bool: cann800_major_version = 8 less_than_cann800_list = EnumParamsParser().get_options( - constant.CANN_VERSION, + Constant.CANN_VERSION, filter_func=lambda x: convert_to_float(x.split(".")[0]) < cann800_major_version ) # CANN 8.0.RC1 之前从 ge_info 中获取 op_state 属性,进行动态 shape 逻辑判断 @@ -196,17 +180,12 @@ class OperatorChecker(VersionControl): release_suggestion_list = [] for suggestion in optimization_item.suggestion: release_suggestion = copy.deepcopy(suggestion) - if release_suggestion == OperatorChecker.PyTorch_OPERATOR_TUNE_SUGGESTION: - release_suggestion += \ - (f"for details please refer to link : LINK") - elif release_suggestion == OperatorChecker.MSLite_OPERATOR_TUNE_SUGGESTION: - release_suggestion += \ - (f"\nThe config file for MSLite AOE usage is as follows:\n" \ - f"[ascend_context]\n" \ - f"aoe_mode=\"operator tuning\"\n" \ - f"--tune_ops_file={Config().tune_ops_file}\n" - f"\nFor details please refer to link : LINK") + if release_suggestion == self.pytorch_op_tune_suggestion: + release_suggestion += (self.pytorch_release_suggestion.format(Config().pytorch_aoe_operator_tune_url)) + elif release_suggestion == self.mslite_op_tune_suggestion: + release_suggestion += (self.mslite_release_suggestion.format( + Config().tune_ops_file, Config().mslite_infer_aoe_operator_tune_url)) + release_suggestion_list.append(release_suggestion.replace('\n', '
')) format_result = { "record": record.__dict__, @@ -218,7 +197,7 @@ class OperatorChecker(VersionControl): return format_result def group_by(self, op_list, op_key="op_type", - limit: int = constant.OPERATOR_LIST_UNLIMIT): + limit: int = Constant.OPERATOR_LIST_UNLIMIT): """ group by Profiling.OpInfo's attribute key, then return top limit tuple by duration :param op_list: input a OpInfo list @@ -236,7 +215,7 @@ class OperatorChecker(VersionControl): if summary.get("total_duration"): summary["total_duration"] = float( summary["total_duration"]) + float( - op_info.get_attr("task_duration", constant.DEFAULT_DURATION_ZERO)) + op_info.get_attr("task_duration", Constant.DEFAULT_DURATION_ZERO)) if summary.get("counts"): summary["counts"] += 1 stack_info = op_info.get_attr("stack_info") @@ -248,9 +227,9 @@ class OperatorChecker(VersionControl): else: statistic[op_info.get_attr(op_key)] = {"summary": {}, "op_info_list": []} statistic[op_info.get_attr(op_key)]["summary"]["op_type"] = op_info.get_attr( - "op_type", constant.DEFAULT_OPERATOR_TYPE) + "op_type", Constant.DEFAULT_OPERATOR_TYPE) statistic[op_info.get_attr(op_key)]["summary"]["total_duration"] = float( - op_info.get_attr("task_duration", constant.DEFAULT_DURATION_ZERO)) + op_info.get_attr("task_duration", Constant.DEFAULT_DURATION_ZERO)) statistic[op_info.get_attr(op_key)]["summary"]["counts"] = 1 stack_info = op_info.get_attr("stack_info") if stack_info: @@ -321,10 +300,10 @@ class OperatorChecker(VersionControl): return details def format_suggestion_content(self, profiling_data: ProfilingDataset) -> None: - if profiling_data.PROF_TYPE == EnumParamsParser().profiling_type.ascend_pytorch_profiler: - self._SUGGESTION.append(self.PyTorch_OPERATOR_TUNE_SUGGESTION) - elif profiling_data.PROF_TYPE == EnumParamsParser.profiling_type.mslite: - self._SUGGESTION.append(self.MSLite_OPERATOR_TUNE_SUGGESTION) + if profiling_data.prof_type == EnumParamsParser().profiling_type.ascend_pytorch_profiler: + self._suggestion.append(self.pytorch_op_tune_suggestion) + elif profiling_data.prof_type == EnumParamsParser().profiling_type.mslite: + self._suggestion.append(self.mslite_op_tune_suggestion) def _check_data(self, profiling_data): return True @@ -339,4 +318,21 @@ class OperatorChecker(VersionControl): if not hasattr(data, "op_summary"): logger.warning(self.SKIP_CHECK_MSG, self._CHECKER, "op summary") return False - return True \ No newline at end of file + return True + + def _get_description(self, description, op_type_list=None): + if not op_type_list: + return description + + desc_suffix = [] + for i, _ in enumerate(op_type_list): + if i % 3 == 0 and i != 0: + desc_suffix.append("\n") + + desc_suffix.append(f"{op_type_list[i]}") + + if i < len(op_type_list) - 1: + desc_suffix.append(", ") + + description += "".join(desc_suffix) + return description \ No newline at end of file diff --git a/profiler/advisor/analyzer/computation/pp_stage_computation_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/computation/pp_stage_computation_analyzer.py similarity index 86% rename from profiler/advisor/analyzer/computation/pp_stage_computation_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/computation/pp_stage_computation_analyzer.py index 04971cab6782d14ace127a1002842fcb09c1e0ea..2780204b2064ed628ee686d91e82169818955eb7 100644 --- a/profiler/advisor/analyzer/computation/pp_stage_computation_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/pp_stage_computation_analyzer.py @@ -13,18 +13,16 @@ # See the License for the specific language governing permissions and # limitations under the License. import logging -import os from multiprocessing import Manager -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.common.analyzer_scopes import SupportedScopes -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.interface.interface import Interface -from profiler.advisor.utils.utils import ParallelJob, get_analyze_processes -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.common import constant as const +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.utils.utils import ParallelJob, get_analyze_processes +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord logger = logging.getLogger() @@ -69,7 +67,7 @@ class PPStageComputationAnalyzer(BaseAnalyzer): stages_rendered_html=list(self._stages_rendered_html), priority_background_color=PriorityBackgroundColor.high) - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): pass def _optimize(self, profiling_path, **kwargs): diff --git a/profiler/advisor/analyzer/computation/profiling_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/computation/profiling_analyzer.py similarity index 75% rename from profiler/advisor/analyzer/computation/profiling_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/computation/profiling_analyzer.py index 6d525f303cc8c5971bda8a11d16d638ef3dcf2c3..665c1570457c4a1a618eeac82e09fc1205b78500 100644 --- a/profiler/advisor/analyzer/computation/profiling_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/computation/profiling_analyzer.py @@ -15,16 +15,16 @@ import logging from abc import ABC -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.analyzer.computation.aicpu.aicpu_checker import AicpuChecker -from profiler.advisor.analyzer.computation.bound.block_dim_checker import BlockDimChecker -from profiler.advisor.analyzer.computation.bound.operator_bound_checker import OperatorBoundChecker -from profiler.advisor.analyzer.computation.op_compile.dynamic_shape_checker import DynamicShapeChecker -from profiler.advisor.analyzer.computation.operator_checker import OperatorChecker -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.computation.aicpu.aicpu_checker import AicpuChecker +from msprof_analyze.advisor.analyzer.computation.bound.block_dim_checker import BlockDimChecker +from msprof_analyze.advisor.analyzer.computation.bound.operator_bound_checker import OperatorBoundChecker +from msprof_analyze.advisor.analyzer.computation.op_compile.dynamic_shape_checker import DynamicShapeChecker +from msprof_analyze.advisor.analyzer.computation.operator_checker import OperatorChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset logger = logging.getLogger() @@ -78,7 +78,7 @@ class ProfilingAnalyzer(BaseAnalyzer, ABC): return self.result - def get_priority(self,max_mem_op_dur): + def get_priority(self, max_mem_op_dur): if "aicpu" not in max_mem_op_dur.__class__.__name__.lower(): return PriorityBackgroundColor.low @@ -92,6 +92,13 @@ class DynamicShapeAnalyzer(ProfilingAnalyzer): super().__init__(collection_path, **kwargs) self.checker = DynamicShapeChecker(self.cann_version) + @BaseAnalyzer.check_data((ProfilingDataset.get_key(),)) + def optimize(self, **kwargs) -> OptimizeResult: + if "mindspore" in self.profiling_type: + logger.info("The analyzer %s does not support MindSpore.", self.__class__.__name__) + return self.result + return super().optimize.__wrapped__(self, **kwargs) + class BlockDimAnalyzer(ProfilingAnalyzer): def __init__(self, collection_path, **kwargs) -> None: diff --git a/profiler/advisor/analyzer/schedule/dispatch/__init__.py b/profiler/msprof_analyze/advisor/analyzer/dataloader/__init__.py similarity index 100% rename from profiler/advisor/analyzer/schedule/dispatch/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/dataloader/__init__.py diff --git a/profiler/advisor/analyzer/dataloader/dataloader_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/dataloader/dataloader_analyzer.py similarity index 74% rename from profiler/advisor/analyzer/dataloader/dataloader_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/dataloader/dataloader_analyzer.py index 5c97773ef26247214814f490600e4a986890ec26..54471bab187f023f7456fad548d740f6a603b683 100644 --- a/profiler/advisor/analyzer/dataloader/dataloader_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/dataloader/dataloader_analyzer.py @@ -16,12 +16,12 @@ import logging from typing import List, Dict, Any -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.analyzer.dataloader.dataloader_checker import DataloaderChecker -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.dataloader.dataloader_checker import DataloaderChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset logger = logging.getLogger() @@ -44,5 +44,5 @@ class DataloaderAnalyzer(BaseAnalyzer): dataloader_checker.make_render(self.html_render, priority=self.get_priority(), rank=kwargs.get("rank")) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): return PriorityBackgroundColor.high diff --git a/profiler/advisor/analyzer/dataloader/dataloader_checker.py b/profiler/msprof_analyze/advisor/analyzer/dataloader/dataloader_checker.py similarity index 89% rename from profiler/advisor/analyzer/dataloader/dataloader_checker.py rename to profiler/msprof_analyze/advisor/analyzer/dataloader/dataloader_checker.py index d4ba7713c7070e60a44dd47934350440eeb1f2f2..45efecc728680561264cdf4e0c9502aadc3b4140 100644 --- a/profiler/advisor/analyzer/dataloader/dataloader_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/dataloader/dataloader_checker.py @@ -17,10 +17,11 @@ import re import logging import yaml -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager logger = logging.getLogger() @@ -80,9 +81,11 @@ class DataloaderChecker: rank=kwargs.get("rank")) def _init_rule(self): + language = AdditionalArgsManager().language dataloader_rule_path = os.path.join( os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))), "rules", + language, "dataloader.yaml" ) dataloader_rule = FileManager.read_yaml_file(dataloader_rule_path) diff --git a/profiler/advisor/analyzer/schedule/free_event/__init__.py b/profiler/msprof_analyze/advisor/analyzer/graph_fusion/__init__.py similarity index 100% rename from profiler/advisor/analyzer/schedule/free_event/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/graph_fusion/__init__.py diff --git a/profiler/advisor/analyzer/graph_fusion/graph_fusion_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/graph_fusion/graph_fusion_analyzer.py similarity index 80% rename from profiler/advisor/analyzer/graph_fusion/graph_fusion_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/graph_fusion/graph_fusion_analyzer.py index b72e1316a452303020e08f16a80a28c11717115f..c3323aa020e9732c2aea8a1d81cbae6e6f805436 100644 --- a/profiler/advisor/analyzer/graph_fusion/graph_fusion_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/graph_fusion/graph_fusion_analyzer.py @@ -15,12 +15,12 @@ from typing import List from functools import partial -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.dataset.graph_dataset import GraphDataset -from profiler.advisor.analyzer.graph_fusion.graph_fusion_checker import GraphFusionRules -from profiler.advisor.dataset.profiling.profiling_dataset import ProfilingDataset -from profiler.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.dataset.graph_dataset import GraphDataset +from msprof_analyze.advisor.analyzer.graph_fusion.graph_fusion_checker import GraphFusionRules +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.display.html.render import HTMLRender class FusionOPAnalyzer(BaseAnalyzer): @@ -45,7 +45,7 @@ class FusionOPAnalyzer(BaseAnalyzer): kwargs.get("add_render_list")) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): pass def _check(self, graph_data: List[GraphDataset], profiling_data: List[ProfilingDataset] = None, diff --git a/profiler/advisor/analyzer/graph_fusion/graph_fusion_checker.py b/profiler/msprof_analyze/advisor/analyzer/graph_fusion/graph_fusion_checker.py similarity index 92% rename from profiler/advisor/analyzer/graph_fusion/graph_fusion_checker.py rename to profiler/msprof_analyze/advisor/analyzer/graph_fusion/graph_fusion_checker.py index 2cfde931a6116db41f1ed3bec2f17f64cd88ddeb..727c5a3770782332a611321e7e566f14fbe8a3ee 100644 --- a/profiler/advisor/analyzer/graph_fusion/graph_fusion_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/graph_fusion/graph_fusion_checker.py @@ -17,12 +17,13 @@ from typing import List from tqdm import tqdm -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord, StatisticsItem -from profiler.advisor.common.graph.graph import Graph -from profiler.advisor.common.graph.graph_parser import QueryGraphParser -from profiler.advisor.dataset.graph_dataset import GraphDataset -from profiler.advisor.common.graph.graph_match import find_isomorphisms +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord, StatisticsItem +from msprof_analyze.advisor.common.graph.graph import Graph +from msprof_analyze.advisor.common.graph.graph_parser import QueryGraphParser +from msprof_analyze.advisor.dataset.graph_dataset import GraphDataset +from msprof_analyze.advisor.common.graph.graph_match import find_isomorphisms logger = logging.getLogger() @@ -180,10 +181,11 @@ class GraphFusionRules: if not self.candidates: return + prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) optimization_item = OptimizeItem( - "fusion issue", - f"Found {len(self.candidates)} fusion issues", - ["Check fusion issues detail in mstt_advisor*.html"] + prompt_class.PROBLEM, + prompt_class.DESCRIPTION.format(len(self.candidates)), + [prompt_class.SUGGESTION] ) total_time = 0.0 for candidate in self.task_duration_list: diff --git a/profiler/advisor/analyzer/schedule/fusion_ops/__init__.py b/profiler/msprof_analyze/advisor/analyzer/memory/__init__.py similarity index 100% rename from profiler/advisor/analyzer/schedule/fusion_ops/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/memory/__init__.py diff --git a/profiler/advisor/analyzer/memory/memory_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/memory/memory_analyzer.py similarity index 72% rename from profiler/advisor/analyzer/memory/memory_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/memory/memory_analyzer.py index 939e2de90c634ee6cca584dca345111dce26bb7b..9fe274174c2463f7efe21d1904aba83560b5a524 100644 --- a/profiler/advisor/analyzer/memory/memory_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/memory/memory_analyzer.py @@ -14,12 +14,12 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.analyzer.memory.memory_checker import MemoryOpsChecker -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.memory.memory_checker import MemoryOpsChecker +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor logger = logging.getLogger() @@ -39,7 +39,8 @@ class MemoryAnalyzer(BaseAnalyzer): memory_checker = MemoryOpsChecker() memory_checker.check_memory_ops(self.dataset) memory_checker.make_record(self.result) - memory_checker.make_render(self.html_render, priority=self.get_priority(memory_checker.max_mem_op_dur), rank=kwargs.get("rank")) + memory_checker.make_render( + self.html_render, priority=self.get_priority(memory_checker.max_mem_op_dur), rank=kwargs.get("rank")) return self.result def get_priority(self, max_mem_op_dur): diff --git a/profiler/advisor/analyzer/memory/memory_checker.py b/profiler/msprof_analyze/advisor/analyzer/memory/memory_checker.py similarity index 91% rename from profiler/advisor/analyzer/memory/memory_checker.py rename to profiler/msprof_analyze/advisor/analyzer/memory/memory_checker.py index b446067ef7e6cd6ddfbd3c61f59bb49632c014c1..82bca84cd233aadf0e0744d2dab51c341a19e3cf 100644 --- a/profiler/advisor/analyzer/memory/memory_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/memory/memory_checker.py @@ -12,15 +12,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import os import re import logging -import yaml -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset, MemCollector -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset, MemCollector +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord + logger = logging.getLogger() diff --git a/profiler/advisor/analyzer/schedule/gc/__init__.py b/profiler/msprof_analyze/advisor/analyzer/overall/__init__.py similarity index 100% rename from profiler/advisor/analyzer/schedule/gc/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/overall/__init__.py diff --git a/profiler/advisor/analyzer/overall/environment_variable_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/overall/environment_variable_analyzer.py similarity index 64% rename from profiler/advisor/analyzer/overall/environment_variable_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/overall/environment_variable_analyzer.py index c4468c36d0eded6b36ae265e239d95e1fdf2dbbb..c32b8e5ec2559f594d8054d8bb8fd8bae1cf3909 100644 --- a/profiler/advisor/analyzer/overall/environment_variable_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/overall/environment_variable_analyzer.py @@ -14,34 +14,40 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.prof_common.path_manager import PathManager -from profiler.advisor.dataset.environment_variable_dataset import EnvironmentVariableDataset -from profiler.advisor.analyzer.overall.environment_variable_checker import EnvironmentVariabelChecker -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.advisor.dataset.environment_variable_dataset import EnvironmentVariableDataset +from msprof_analyze.advisor.analyzer.overall.environment_variable_checker import EnvironmentVariableChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +logger = logging.getLogger() -class EnvironmentVariabelAnalyzer(BaseAnalyzer): + +class EnvironmentVariableAnalyzer(BaseAnalyzer): dataset_cls_list = [EnvironmentVariableDataset] def __init__(self, collection_path: str, n_processes: int = 1, **kwargs): super().__init__(collection_path, n_processes, **kwargs) self.dataset = self.get_first_data_by_key(self.dataset_list, EnvironmentVariableDataset.get_key()) + @BaseAnalyzer.check_data((EnvironmentVariableDataset.get_key(),)) def optimize(self, **kwargs): + if "mindspore" in self.profiling_type: + logger.info("The analyzer %s does not support MindSpore.", self.__class__.__name__) + return self.result try: PathManager.check_input_directory_path(self.collection_path) except RuntimeError as e: logging.error("Invalid path: %s", str(e)) return self.result self.collection_path = PathManager.get_realpath(self.collection_path) - checker = EnvironmentVariabelChecker() + checker = EnvironmentVariableChecker() checker.format_env_suggest(self.dataset) checker.make_record(self.result) checker.make_render(self.html_render) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): return PriorityBackgroundColor.high def make_record(self): diff --git a/profiler/advisor/analyzer/overall/environment_variable_checker.py b/profiler/msprof_analyze/advisor/analyzer/overall/environment_variable_checker.py similarity index 69% rename from profiler/advisor/analyzer/overall/environment_variable_checker.py rename to profiler/msprof_analyze/advisor/analyzer/overall/environment_variable_checker.py index 25058a790cc10c03b658309590666e90a29b450e..05093b38969719b924a284cf04cb2250fdd7820b 100644 --- a/profiler/advisor/analyzer/overall/environment_variable_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/overall/environment_variable_checker.py @@ -14,25 +14,29 @@ # limitations under the License. import os -from profiler.cluster_analyse.common_func.file_manager import FileManager -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem -from profiler.advisor.result.item import OptimizeRecord -from profiler.advisor.common.analyzer_scopes import SupportedScopes -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.utils.utils import convert_to_int +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem +from msprof_analyze.advisor.result.item import OptimizeRecord +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.utils.utils import convert_to_int_with_exception +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager -class EnvironmentVariabelChecker: +class EnvironmentVariableChecker: ENV_SUGGEST_CONDITION = { - "ASCEND_GLOBAL_LOG_LEVEL": lambda x: x != "" and convert_to_int(x) != 3, + "ASCEND_GLOBAL_LOG_LEVEL": lambda x: x != "" and convert_to_int_with_exception(x) != 3, "HCCL_RDMA_TC": lambda x: x != "", "HCCL_RDMA_SL": lambda x: x != "", - "ACLNN_CACHE_LIMIT": lambda x: x == "" or convert_to_int(x) < 10000, - "HOST_CACHE_CAPACITY": lambda x: x == "" or convert_to_int(x) == 0, - "ASCEND_ENHANCE_ENABLE": lambda x: convert_to_int(x) == 0, + "ACLNN_CACHE_LIMIT": lambda x: x == "" or convert_to_int_with_exception(x) < 10000, + "HOST_CACHE_CAPACITY": lambda x: x == "" or convert_to_int_with_exception(x) == 0, + "ASCEND_ENHANCE_ENABLE": lambda x: convert_to_int_with_exception(x) == 0, "PYTORCH_NPU_ALLOC_CONF": lambda x: isinstance(x, str) and "expandable_segments:True" not in x, - "ASCEND_LAUNCH_BLOCKING": lambda x: convert_to_int(x) != 1, + "ASCEND_LAUNCH_BLOCKING": lambda x: convert_to_int_with_exception(x) != 1, + "HCCL_ALGO": lambda x: x != "", } HEADERS = ["Environment", "Value", "Description", "Suggestion"] @@ -44,9 +48,11 @@ class EnvironmentVariabelChecker: @staticmethod def read_environment_info(): + language = AdditionalArgsManager().language environment_variable_info_path = os.path.join( os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))), "rules", + language, "environment_variable_info.yaml" ) return FileManager.read_yaml_file(environment_variable_info_path) @@ -78,18 +84,17 @@ class EnvironmentVariabelChecker: def make_record(self, result: OptimizeResult): if not self.env_suggest_csv: return - desc = f"Describe and suggest the optimal environment variable settings" - suggestion = "Please set the optimal environment variable" - + + prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) optimization_item = OptimizeItem( - SupportedScopes.ENVIRONMENT_VARIABLE_ANALYSIS, - desc, - [suggestion] + prompt_class.PROBLEM, + prompt_class.DESCRIPTION, + [prompt_class.SUGGESTION] ) result.add(OptimizeRecord(optimization_item)) - result.add_detail(SupportedScopes.ENVIRONMENT_VARIABLE_ANALYSIS, headers=self.HEADERS) + result.add_detail(prompt_class.PROBLEM, headers=self.HEADERS) for env_suggest in self.env_suggest_csv: - result.add_detail(SupportedScopes.ENVIRONMENT_VARIABLE_ANALYSIS, detail=env_suggest) + result.add_detail(prompt_class.PROBLEM, detail=env_suggest) def make_render(self, html_render: HTMLRender): if not self.env_suggest_html: diff --git a/profiler/advisor/analyzer/overall/overall_summary_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/overall/overall_summary_analyzer.py similarity index 71% rename from profiler/advisor/analyzer/overall/overall_summary_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/overall/overall_summary_analyzer.py index 8a5982d3ce92f4401fd0537c08d0176b42a02471..1bfaf8d611964af8d3a23544d630eeddd116206b 100644 --- a/profiler/advisor/analyzer/overall/overall_summary_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/overall/overall_summary_analyzer.py @@ -15,48 +15,17 @@ import logging import os -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.result.result import OptimizeResult -from profiler.compare_tools.compare_backend.utils.constant import Constant -from profiler.compare_tools.compare_interface.comparison_interface import ComparisonInterface +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.compare_tools.compare_interface.comparison_interface import ComparisonInterface +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.constant import Constant class OverallSummaryAnalyzer(BaseAnalyzer): - OVERALL_SUMMARY_ANALYZER = "overall summary" - advice_map = { - "Computing Time": "if you want more detailed advice please go to mstt_advisor_*.html", - "Uncovered Communication Time": "if you want more detailed advice please go to mstt_advisor_*.html", - "Free Time": "if you want more detailed advice please go to mstt_advisor_*.html" - } - time_name_map = { - "Computing Time": "computing", - "Uncovered Communication Time": "communication", - "Free Time": "free", - 'Cube Time(Num)': 'Cube Time', - 'Vector Time(Num)': 'Vector Time', - 'Flash Attention Time(Forward)(Num)': 'Flash Attention Time(Forward)', - 'Flash Attention Time(Backward)(Num)': 'Flash Attention Time(Backward)', - 'Other Time': "Other Computing Time", - 'SDMA Time(Num)': 'SDMA Time' - } - performance_time_dict = { - "Computing Time": "computing_time_ms", - " -- Flash Attention": "fa_time_ms", - " -- Conv": "conv_time_ms", - " -- Matmul": "matmul_time_ms", - " -- Vector": "vector_time_ms", - " -- SDMA(Tensor Move)": "tensor_move_time_ms", - " -- Other Cube": "other_cube_time_ms", - "Uncovered Communication Time": "uncovered_communication_time_ms", - " -- Wait": "wait_time_ms", - " -- Transmit": "transmit_time_ms", - "Free Time": "free_time_ms", - " -- SDMA": "sdma_time_ms", - " -- Free": "free_ms", - "E2E Time": "e2e_time_ms" - } def __init__(self, collection_path: str, n_processes: int = 1, **kwargs): profile_path = get_profile_path(collection_path) @@ -74,6 +43,12 @@ class OverallSummaryAnalyzer(BaseAnalyzer): self.bottleneck_str = "" self.over_summary_analysis = {} + self.prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) + self.over_summary_analyzer = self.prompt_class.OVERALL_SUMMARY_ANALYZER + self.advice_map = self.prompt_class.PERFORMANCE_TIME_DICT + self.time_name_map = self.prompt_class.TIME_NAME_MAP + self.performance_time_dict = self.prompt_class.PERFORMANCE_TIME_DICT + @staticmethod def calculate_ratio(dividend, divisor): if not divisor: @@ -82,11 +57,19 @@ class OverallSummaryAnalyzer(BaseAnalyzer): @staticmethod def get_time_category_dict(overall_dict: dict): - time_category_dict = { - "Computing Time": round(overall_dict.get('computing_time_ms', 0.0), 3), - "Uncovered Communication Time": round(overall_dict.get('uncovered_communication_time_ms', 0.0), 3), - "Free Time": round(overall_dict.get('free_time_ms', 0.0), 3) - } + language = AdditionalArgsManager().language + if language == "en": + time_category_dict = { + "Computing Time": round(overall_dict.get('computing_time_ms', 0.0), 3), + "Uncovered Communication Time": round(overall_dict.get('uncovered_communication_time_ms', 0.0), 3), + "Free Time": round(overall_dict.get('free_time_ms', 0.0), 3) + } + else: + time_category_dict = { + "计算时长": round(overall_dict.get('computing_time_ms', 0.0), 3), + "未被掩盖的通信时长": round(overall_dict.get('uncovered_communication_time_ms', 0.0), 3), + "空闲时长": round(overall_dict.get('free_time_ms', 0.0), 3) + } return time_category_dict def path_check(self): @@ -111,15 +94,26 @@ class OverallSummaryAnalyzer(BaseAnalyzer): overall_data = self.cur_data.get("overall_data") if not overall_data: return - e2e_time = '%.3f' % sum([data for data in overall_data.values()]) - overall_bottleneck = f"The Model E2E Time is {e2e_time}ms.\n" + e2e_time = round(sum([data for data in overall_data.values()]), 3) + + language = AdditionalArgsManager().language + if language == "en": + overall_bottleneck = f"The Model E2E Time is {e2e_time}ms.\n" + else: + overall_bottleneck = f"模型E2E的时间是{e2e_time}ms。\n" comparison_bottleneck = "" for time_type, time_value in overall_data.items(): # add overall bottleneck - overall_bottleneck += f" -- {time_type} is {time_value}ms\n" + if language == "en": + overall_bottleneck += f" -- {time_type} is {time_value}ms\n" + else: + overall_bottleneck += f" -- {time_type}是{time_value}ms\n" if time_type == "Free Time" and self._is_minimal_profiling and self.calculate_ratio(time_value, e2e_time) > 0.1: - overall_bottleneck += "percentage of free time exceed the threshold 10%." + if language == "en": + overall_bottleneck += "percentage of free time exceed the threshold 10%." + else: + overall_bottleneck += "空闲时间的百分比超过了阈值的10%。" if not self._has_benchmark_profiling: continue # add comparison bottleneck @@ -128,7 +122,10 @@ class OverallSummaryAnalyzer(BaseAnalyzer): ).get(time_type) if time_value > base_duration: ratio = "{:.2%}".format(self.calculate_ratio(time_value - base_duration, base_duration)) - comparison_bottleneck += f"{time_type} exceeds the benchmark by {ratio}\n" + if language == "en": + comparison_bottleneck += f"{time_type} exceeds the benchmark by {ratio}\n" + else: + comparison_bottleneck += f"{time_type}超过了基线{ratio}。\n" self.cur_bottleneck["overall_data"] = overall_bottleneck if comparison_bottleneck: self.cur_bottleneck["comparison_result"] = comparison_bottleneck @@ -203,18 +200,18 @@ class OverallSummaryAnalyzer(BaseAnalyzer): if not self.bottleneck_str and not self.cur_advices: return optimization_item = OptimizeItem( - OverallSummaryAnalyzer.OVERALL_SUMMARY_ANALYZER, + self.over_summary_analyzer, self.bottleneck_str, self.cur_advices ) self.result.add(OptimizeRecord(optimization_item)) self.result.add_detail( - OverallSummaryAnalyzer.OVERALL_SUMMARY_ANALYZER, + self.over_summary_analyzer, headers=self.over_summary_analysis["headers"] ) for data in self.over_summary_analysis["data"]: - self.result.add_detail(OverallSummaryAnalyzer.OVERALL_SUMMARY_ANALYZER, detail=data) + self.result.add_detail(self.over_summary_analyzer, detail=data) def make_render(self): if not self.bottleneck_str and not self.cur_advices: @@ -227,14 +224,15 @@ class OverallSummaryAnalyzer(BaseAnalyzer): "details": [self.over_summary_analysis] } self.html_render.render_template(key="overall", - title=OverallSummaryAnalyzer.OVERALL_SUMMARY_ANALYZER, + title="Overall Summary", template_dir="templates", template_name="cluster_analysis.html", cann_version=self.cann_version, - torch_version=self.torch_version, + profiling_type=self.profiling_type, + profiling_version=self.profiling_version, result=result_for_html) - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): pass diff --git a/profiler/advisor/analyzer/schedule/syncbn/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/__init__.py similarity index 100% rename from profiler/advisor/analyzer/schedule/syncbn/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/__init__.py diff --git a/profiler/advisor/analyzer/schedule/synchronize_stream/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/conjectured_gc/__init__.py similarity index 100% rename from profiler/advisor/analyzer/schedule/synchronize_stream/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/conjectured_gc/__init__.py diff --git a/profiler/msprof_analyze/advisor/analyzer/schedule/conjectured_gc/conjectured_gc_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/schedule/conjectured_gc/conjectured_gc_analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..93f35930879f0689a85a1a607db65c66dbe14ade --- /dev/null +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/conjectured_gc/conjectured_gc_analyzer.py @@ -0,0 +1,43 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.schedule.conjectured_gc.conjectured_gc_checker import ConjecturedGcChecker +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor + + +class ConjecturedGcAnalyzer(BaseAnalyzer): + dataset_cls_list = [ScheduleAnalysisDataset] + + def __init__(self, collection_path, **kwargs): + super().__init__(collection_path, **kwargs) + self.result = OptimizeResult() + self.html_render = HTMLRender() + key = ScheduleAnalysisDataset.get_key() + self.timeline_event_dataset = self.get_first_data_by_key(self.dataset_list, key) + + @BaseAnalyzer.check_data((ScheduleAnalysisDataset.get_key(),)) + def optimize(self, **kwargs): + gc_checker = ConjecturedGcChecker() + gc_checker.check_gc(self.timeline_event_dataset, rank=kwargs.get("rank"), stage=kwargs.get("stage")) + gc_checker.make_record(self.result) + gc_checker.make_render(self.html_render, priority=self.get_priority(), rank=kwargs.get("rank")) + return self.result + + def get_priority(self, max_mem_op_dur=0): + return PriorityBackgroundColor.medium diff --git a/profiler/advisor/analyzer/schedule/gc/gc_checker.py b/profiler/msprof_analyze/advisor/analyzer/schedule/conjectured_gc/conjectured_gc_checker.py similarity index 51% rename from profiler/advisor/analyzer/schedule/gc/gc_checker.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/conjectured_gc/conjectured_gc_checker.py index be1a60536774e849e425d3a9f0b724001274132f..05d2e79969752d411625bd0c805efdca0bcd9195 100644 --- a/profiler/advisor/analyzer/schedule/gc/gc_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/conjectured_gc/conjectured_gc_checker.py @@ -12,104 +12,126 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import logging -import math import os -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.utils.utils import convert_to_float, convert_to_int, safe_division -from profiler.advisor.common import constant as const -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.utils.utils import convert_to_float, convert_to_int, safe_division +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.file_manager import FileManager -logger = logging.getLogger() -class GcChecker: +class AbnormalGcStatistic: + def __init__(self): + self._count = 0 + self._duration = 0 + self._events = [] + + @property + def count(self): + return self._count + + @count.setter + def count(self, value): + self._count = value + + @property + def duration(self): + return self._duration + + @duration.setter + def duration(self, value): + self._duration = value + + @property + def events(self): + return self._events + + def export(self): + res = [] + for free_event in self.events: + res.append([round(convert_to_float(free_event.get("ts", 0)), 2), + round(convert_to_float(free_event.get("free time", 0)), 4)]) + return res + + +class ConjecturedGcChecker: + ACL_EVENT_DUR = "acl_event_dur" + ACL_EVENT_COUNT = "acl_event_count" + HEADERS = ["timestamp", "duration(us)"] def __init__(self): self.stage = None self.rank = None self.optimization_item = [] - self.gc_issues = False - self.gc_problem_with_count = "" self.gc_problem_with_free = "" self.desc = "" self.suggestions = [] self.solutions = None self.gc_threshold = 0 self.gc_topk_num = 0 - self.abnormal_gc_count = 0 - self.abnormal_gc_duration = 0 - self.abnormal_gc_list = [] - self.headers = ["timestamp", "duration(us)"] + self.gc_statistic = AbnormalGcStatistic() self._init_rule() def check_gc(self, event_dataset: ScheduleAnalysisDataset, rank=None, stage=None): """ - :Param event_dataset: dataset of timeline event + Param event_dataset: dataset of timeline event + rank: rank id + stage: a stage of a model that is assigned to specific computational device """ + if event_dataset.gc_events: + return + self.rank = rank self.stage = stage # 当用户cann和pta版本不支持采集gc信息时,通过timeline中的free和cann层acl事件 综合判断是否可能存在free - if not event_dataset.gc_events: - acl_events = getattr(event_dataset, "acl_events", []) - large_free_events = getattr(event_dataset, "large_free_events", []) - # 如果acl_events为空,则没有采集cann信息,不基于free+acl events进行gc分析 - if acl_events and large_free_events: - free_event = self.get_free_events_include_gc(large_free_events, acl_events) - if not free_event: - return - self.desc = self.gc_problem_with_free.format(free_duration_time=free_event.dur) - - return - - for gc_event in event_dataset.gc_events: - if convert_to_float(gc_event.dur) >= self.gc_threshold: - self.gc_issues = True - self.abnormal_gc_count += 1 - self.abnormal_gc_duration += convert_to_float(gc_event.dur) - self.abnormal_gc_list.append([gc_event.ts, gc_event.dur]) - self.abnormal_gc_duration = round(self.abnormal_gc_duration / 1000, 4) - self.abnormal_gc_list.sort(key=lambda x: x[1], reverse=True) - self.desc = self.gc_problem_with_count.format(gc_count=self.abnormal_gc_count, - gc_total_time=self.abnormal_gc_duration) + acl_events = getattr(event_dataset, "acl_events", []) + large_free_events = getattr(event_dataset, "large_free_events", []) + # 如果acl_events为空,则没有采集cann信息,不基于free+acl events进行gc分析 + if acl_events and large_free_events: + self.get_free_events_include_gc(large_free_events, acl_events) + if not self.gc_statistic.count: + return + self.desc = self.gc_problem_with_free.format(free_duration_time=self.gc_statistic.duration) def make_record(self, result: OptimizeResult): """ make record for what and how to optimize """ - if not self.gc_issues: + if not self.gc_statistic.count: return - self.optimization_item.append(OptimizeItem("GC", self.desc, self.suggestions)) - for optimization in self.optimization_item: - result.add(OptimizeRecord(optimization)) + self.optimization_item.append(OptimizeItem("Conjectured Gc", self.desc, self.suggestions)) + result.add(OptimizeRecord(self.optimization_item[-1])) + headers = self.HEADERS if self.rank is not None: - self.headers = ["Rank id"] + self.headers - sub_table_name = "GcAnalysis" if not self.stage else f"Stage-{self.stage}: GcAnalysis" - result.add_detail(sub_table_name, headers=self.headers) + headers = ["Rank id"] + headers + sub_table_name = "ConjecturedGcAnalysis" if not self.stage else f"Stage-{self.stage}: ConjecturedGcAnalysis" + result.add_detail(sub_table_name, headers=headers) - for row in self.abnormal_gc_list: + for row in self.gc_statistic.export(): if self.rank is not None: row = [self.rank] + row result.add_detail(sub_table_name, detail=row) def make_render(self, html_render, **kwargs): - if not self.gc_issues: + if not self.gc_statistic.count: return priority = kwargs.get("priority") rank = kwargs.get("rank") - show_num = min(self.gc_topk_num, self.abnormal_gc_count) + show_num = min(self.gc_topk_num, self.gc_statistic.count) html_render.render_template(key="schedule", template_dir="templates", template_name="gc.html", + title="Conjectured GC Analysis", desc=self.desc, solutions=self.solutions, - headers=self.headers, - datas=self.abnormal_gc_list[:show_num], + headers=self.HEADERS, + datas=self.gc_statistic.export()[:show_num], num=show_num, priority_background_color=priority, rank=rank) @@ -118,13 +140,13 @@ class GcChecker: free_event_index, acl_event_index = 0, 0 free_include_acl_events = {} - while free_event_index < len(large_free_events) and acl_event_index < len(acl_events): + while free_event_index < len(large_free_events): free_event = large_free_events[free_event_index] - free_event_name = f"{const.FREE}-{free_event_index}" + free_event_name = f"{Constant.FREE}-{free_event_index}" free_event_start_time = convert_to_float(free_event.ts) free_event_end_time = free_event_start_time + convert_to_float(free_event.dur) if free_event_name not in free_include_acl_events: - free_include_acl_events[free_event_name] = {} + free_include_acl_events[free_event_name] = {"ts": free_event.ts} while acl_event_index < len(acl_events): acl_event = acl_events[acl_event_index] @@ -137,13 +159,13 @@ class GcChecker: if acl_event_start_time > free_event_end_time: break - if "acl_event_count" not in free_include_acl_events[free_event_name]: - free_include_acl_events[free_event_name]["acl_event_count"] = 0 - free_include_acl_events[free_event_name]["acl_event_count"] += 1 + if self.ACL_EVENT_COUNT not in free_include_acl_events[free_event_name]: + free_include_acl_events[free_event_name][self.ACL_EVENT_COUNT] = 0 + free_include_acl_events[free_event_name][self.ACL_EVENT_COUNT] += 1 - if "acl_event_dur" not in free_include_acl_events[free_event_name]: - free_include_acl_events[free_event_name]["acl_event_dur"] = 0.0 - free_include_acl_events[free_event_name]["acl_event_dur"] += convert_to_float(acl_event.dur) + if self.ACL_EVENT_DUR not in free_include_acl_events[free_event_name]: + free_include_acl_events[free_event_name][self.ACL_EVENT_DUR] = 0.0 + free_include_acl_events[free_event_name][self.ACL_EVENT_DUR] += convert_to_float(acl_event.dur) acl_event_index += 1 @@ -152,29 +174,29 @@ class GcChecker: # 按free持续时间降序排列,优先判断持续时间最长的free event_indexs = range(len(large_free_events)) for index, free_event in sorted(zip(event_indexs, large_free_events), key=lambda x: x[1].dur, reverse=True): - - free_event_name = f"{const.FREE}-{index}" + free_event_name = f"{Constant.FREE}-{index}" free_duration = convert_to_float(free_event.dur) - acl_event_dur = free_include_acl_events.get(free_event_name, {}).get("acl_event_dur", 0.0) - acl_event_count = free_include_acl_events.get(free_event_name, {}).get("acl_event_count", 0) + free_include_acl_events[free_event_name]["free time"] = free_duration + acl_event_dur = free_include_acl_events.get(free_event_name, {}).get(self.ACL_EVENT_DUR, 0.0) + acl_event_count = free_include_acl_events.get(free_event_name, {}).get(self.ACL_EVENT_COUNT, 0) if safe_division(acl_event_dur, free_duration) < self.max_acl_event_time_ratio and safe_division( acl_event_count, free_duration) < self.max_acl_event_num_ratio: - self.gc_issues = True - return free_event - return {} + self.gc_statistic.count += 1 + self.gc_statistic.duration += free_duration + self.gc_statistic.events.append(free_include_acl_events.get(free_event_name, {})) def _init_rule(self): + language = AdditionalArgsManager().language gc_rule_path = os.path.join( os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), "rules", - "gc.yaml" + language, + "conjectured_gc.yaml" ) gc_rule = FileManager.read_yaml_file(gc_rule_path) - self.gc_threshold = convert_to_float(gc_rule.get("gc_threshold", 0)) self.gc_topk_num = convert_to_int(gc_rule.get("top_num", 0)) - self.gc_problem_with_count = gc_rule.get("gc_problem_with_count", "") self.gc_problem_with_free = gc_rule.get("gc_problem_with_free", "") self.max_acl_event_num_ratio = convert_to_float(gc_rule.get("max_acl_event_num_ratio")) self.max_acl_event_time_ratio = convert_to_float(gc_rule.get("max_acl_event_time_ratio")) diff --git a/profiler/advisor/common/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/dispatch/__init__.py similarity index 100% rename from profiler/advisor/common/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/dispatch/__init__.py diff --git a/profiler/advisor/analyzer/schedule/dispatch/timeline_op_dispatch_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/schedule/dispatch/timeline_op_dispatch_analyzer.py similarity index 73% rename from profiler/advisor/analyzer/schedule/dispatch/timeline_op_dispatch_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/dispatch/timeline_op_dispatch_analyzer.py index 126fe30176cf6ca0f1d7d3557c360f95af7b20be..c1669d018494c2c9757e656034482c499284021a 100644 --- a/profiler/advisor/analyzer/schedule/dispatch/timeline_op_dispatch_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/dispatch/timeline_op_dispatch_analyzer.py @@ -16,13 +16,15 @@ # limitations under the License. import logging -from profiler.advisor.common import constant as const -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() @@ -43,12 +45,16 @@ class OpDispatchAnalyzer(BaseAnalyzer): self._issues_record = [] self.optimization_item = [] + @BaseAnalyzer.check_data((ScheduleAnalysisDataset.get_key(),)) def optimize(self, **kwargs): """ optimize operator :param data: input datasets :return: result """ + if "mindspore" in self.profiling_type: + logger.info("The analyzer %s does not support MindSpore.", self.__class__.__name__) + return self.result self.get_op_compile_info(self.dataset) self.make_record(self.result) self.make_render(self.html_render, rank=kwargs.get('rank')) @@ -60,11 +66,11 @@ class OpDispatchAnalyzer(BaseAnalyzer): """ if hasattr(event_dataset, "ops_compile"): self._op_compile = getattr(event_dataset, "ops_compile") - if not self._op_compile or self._op_compile.total_count < const.MAX_OP_COMPILE_NUM: + if not self._op_compile or self._op_compile.total_count < Constant.MAX_OP_COMPILE_NUM: return self._issues_record.append(['operator dispatch', - const.OP_COMPILE_ID, + Constant.OP_COMPILE_ID, self._op_compile.total_count, self._op_compile.total_time]) else: @@ -76,17 +82,19 @@ class OpDispatchAnalyzer(BaseAnalyzer): """ if not self._op_compile or len(self._issues_record) <= 0: return - desc = f"Found {self._op_compile.total_count} operator compile issues." - suggestion = ("Please place the following code at the entrance of the python script to disable jit compile. " \ - "Code: `torch_npu.npu.set_compile_mode(jit_compile=False); " - "torch_npu.npu.config.allow_internal_format = False`") - self.optimization_item.append(OptimizeItem("Operator dispatch", desc, [suggestion])) + + prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) + self.optimization_item.append(OptimizeItem( + prompt_class.PROBLEM, + prompt_class.DESCRIPTION.format(self._op_compile.total_count), + [prompt_class.SUGGESTION])) for optimization in self.optimization_item: result.add(OptimizeRecord(optimization)) + record_title = ["Issues", "op name", "counts", "total time"] - result.add_detail('operator dispatch', headers=record_title) + result.add_detail(prompt_class.PROBLEM, headers=record_title) for op_info in self._issues_record: - result.add_detail('operator dispatch', detail=op_info) + result.add_detail(prompt_class.PROBLEM, detail=op_info) def make_render(self, html_render, **kwargs): issues = [] @@ -109,7 +117,7 @@ class OpDispatchAnalyzer(BaseAnalyzer): priority_background_color=self.get_priority(), rank=kwargs.get("rank")) - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): step_duration = getattr(self.dataset, "step_duration", None) op_compile_total_dur = getattr(self._op_compile, "total_time", None) if step_duration is None or op_compile_total_dur is None: diff --git a/profiler/advisor/common/graph/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/free_event/__init__.py similarity index 100% rename from profiler/advisor/common/graph/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/free_event/__init__.py diff --git a/profiler/advisor/common/profiling/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/fusible_ops/__init__.py similarity index 100% rename from profiler/advisor/common/profiling/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/fusible_ops/__init__.py diff --git a/profiler/msprof_analyze/advisor/analyzer/schedule/fusible_ops/fusible_operator_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/schedule/fusible_ops/fusible_operator_analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..b7a9566ca2927b08894b6f236e181d7ef0ee68c6 --- /dev/null +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/fusible_ops/fusible_operator_analyzer.py @@ -0,0 +1,49 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.analyzer.schedule.fusible_ops.fusible_operator_checker import FusibleOperatorChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.result.result import OptimizeResult + +logger = logging.getLogger() + + +class FusibleOperatorAnalyzer(BaseAnalyzer): + dataset_cls_list = [ProfilingDataset] + + def __init__(self, collection_path, n_processes: int = 1, **kwargs) -> None: + super().__init__(collection_path, n_processes, **kwargs) + profiling_key = ProfilingDataset.get_key() + self.profiling_dataset = self.get_first_data_by_key(self.dataset_list, profiling_key) + self.result = OptimizeResult() + self.html_render = HTMLRender() + self.html = None + + def optimize(self, **kwargs): + add_render_list = kwargs.get("add_render_list", True) + fusible_operator_checker = FusibleOperatorChecker(**kwargs) + fusible_operator_checker.check_fusible_operator(self.profiling_dataset) + if not fusible_operator_checker.fusion_issues: + return self.result + fusible_operator_checker.make_record(self.result) + return self.result + + def get_priority(self, max_mem_op_dur=None): + # 提升1% ~ 3% + return PriorityBackgroundColor.low diff --git a/profiler/msprof_analyze/advisor/analyzer/schedule/fusible_ops/fusible_operator_checker.py b/profiler/msprof_analyze/advisor/analyzer/schedule/fusible_ops/fusible_operator_checker.py new file mode 100644 index 0000000000000000000000000000000000000000..9070a8036047f7976ca7e9a7ab81bd5bf9632af6 --- /dev/null +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/fusible_ops/fusible_operator_checker.py @@ -0,0 +1,285 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +import os +from typing import List +from collections import OrderedDict +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.dataset.profiling.profiling_dataset import ProfilingDataset +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.dataset.profiling.info_collection import OpInfo +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.utils.utils import convert_to_float_with_warning +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor + +logger = logging.getLogger() + + +class FusibleOperatorChecker: + _CHECKER = "FusibleOperatorChecker" + _KEYS = ['Name', 'Input Shapes', 'Output Shapes'] + _RATIO_COLUMN = ['aic_mte2_ratio', 'aiv_mte2_ratio', 'aic_fixpipe_ratio', 'aiv_mte3_ratio'] + _TOTAL_TIME_INDEX = 0 + _NPU_TIME_INDEX = 1 + _MTE_TIME_INDEX = 2 + _COUNT_INDEX = 3 + _MTE_FLAG_INDEX = 4 + _HOST_FLAG_INDEX = 5 + _HIGH_PRIORITY = 0.7 + _LOW_PRIORITY = 0.3 + _SPLITTER = '-' + + def __init__(self, **kwargs): + self.fusion_issues = False + self.desc = "" + self.table_desc = "" + self.host_desc = "" + self.mte_desc = "" + self.problem = "" + self.mte_problem = "" + self.host_problem = "" + self.solutions = "" + self.min_length = 0 + self.max_length = 0 + self.host_threshold = 0 + self.mte_threshold = 0 + self.sequence_duration_threshold = 0 + self.sequence_count_threshold = 0 + self.topk = 0 + self.step_duration = 0 + self.step_id = kwargs.get("step") + self.stage = None + self.task_list = [] + self.suggestions = [] + self._init_rule() + self.headers = [ + "start index", "end index", "total time(us)", "execution time(us)", "mte time(us)", "occurrences", + "mte bound", "host bound" + ] + self.index_dict: OrderedDict = OrderedDict() + self.host_details = [] + self.mte_details = [] + + @staticmethod + def make_render(self, html_render, add_render_list=True, **kwargs): + return + + @staticmethod + def get_mte_time(task: OpInfo): + return max(convert_to_float_with_warning(task.aic_mte2_time), + convert_to_float_with_warning(task.aiv_mte2_time)) + max( + convert_to_float_with_warning(task.aic_fixpipe_time), + convert_to_float_with_warning(task.aiv_mte3_time)) + + @staticmethod + def check_hccl(task: OpInfo): + return (task.task_type in ["COMMUNICATION", "HCCL"] or + any(task.op_name.lower().startswith(item) for item in ["hcom", "lccl", "lcoc"])) + + @staticmethod + def calculate_total_time(pre_timestamp, timestamp, duration): + total_time = (convert_to_float_with_warning(timestamp) + convert_to_float_with_warning(duration) - + convert_to_float_with_warning(pre_timestamp)) + if not total_time: + logger.warning("Total duration is zero.") + return 0, False + return total_time, True + + def check_fusible_operator(self, profiling_dataset: ProfilingDataset): + if not self.check_tasks(profiling_dataset): + return + tasks = profiling_dataset.op_summary.op_list + result_dict = OrderedDict() + self.step_duration, _ = self.calculate_total_time(tasks[0].task_start_time, tasks[-1].task_start_time, + tasks[-1].task_duration) + length = len(profiling_dataset.op_summary.op_list) + for index, task in enumerate(tasks): + if self.check_hccl(task): + continue + start_time = convert_to_float_with_warning(task.task_start_time) + key = self.generate_key(task) + duration = convert_to_float_with_warning(task.task_duration) + mte_time = self.get_mte_time(task) + aicore_time = convert_to_float_with_warning(task.aicore_time) + for i in range(1, self.max_length): + if i + index >= length: + break + new_task = tasks[i + index] + if self.check_hccl(new_task): + break + key = key + self._SPLITTER + self.generate_key(new_task) + duration = duration + convert_to_float_with_warning(new_task.task_duration) + mte_time += self.get_mte_time(new_task) + aicore_time += convert_to_float_with_warning(new_task.aicore_time) + total_time, _ = self.calculate_total_time(start_time, new_task.task_start_time, new_task.task_duration) + host_flag = duration / total_time < self.host_threshold if total_time else False + mte_flag = mte_time / aicore_time > self.mte_threshold if aicore_time else False + if not mte_flag and not host_flag or i < self.min_length: + continue + result = result_dict.get(key, (0, 0, 0, 0, False, False)) + result_dict[key] = ( + result[self._TOTAL_TIME_INDEX] + total_time, + result[self._NPU_TIME_INDEX] + duration, result[self._MTE_TIME_INDEX] + mte_time, + result[self._COUNT_INDEX] + 1, mte_flag, host_flag + ) + if key not in self.index_dict: + self.index_dict[key] = (index, i + index) + if result_dict: + self.post_processing(result_dict) + + def check_sequence_ratio(self, detail: List): + return detail[self._TOTAL_TIME_INDEX] / self.step_duration > self.sequence_duration_threshold + + def check_sequence_num(self, detail: List): + return detail[self._COUNT_INDEX] > self.sequence_count_threshold + + def check_bound(self, detail: List): + return self.check_sequence_ratio(detail) or detail[self._MTE_FLAG_INDEX] + + def post_processing(self, result_dict: OrderedDict): + result = OrderedDict() + base_sequence = None + record_task_name = None + for task_name, detail in result_dict.items(): + if self.check_sequence_num(detail) and (self.check_sequence_ratio(detail) or detail[self._MTE_FLAG_INDEX]): + if not base_sequence: + record_task_name = task_name + elif task_name.startswith(base_sequence) and detail[self._TOTAL_TIME_INDEX] > \ + result_dict[record_task_name][self._TOTAL_TIME_INDEX]: + record_task_name = task_name + else: + result[record_task_name] = result_dict[record_task_name] + record_task_name = task_name + base_sequence = task_name + if task_name not in result and self.check_sequence_num(detail) and self.check_bound(detail): + result[task_name] = result_dict[task_name] + wall_duration = 0 + npu_time = 0 + host_time = 0 + mte_time = 0 + result = OrderedDict(sorted(result.items(), key=lambda x: -x[1][self._TOTAL_TIME_INDEX])) + for task_name, detail in result.items(): + wall_duration += detail[0] + npu_time += detail[1] + host_time += detail[0] - detail[1] + mte_time += detail[2] + if not wall_duration: + continue + if detail[self._MTE_FLAG_INDEX]: + self.add_detail(task_name, self.mte_details, detail) + if detail[self._HOST_FLAG_INDEX]: + self.add_detail(task_name, self.host_details, detail) + if result: + self.fusion_issues = True + self.desc = self.desc.format(count=len(self.mte_details + self.host_details), + wall_duration=round(wall_duration / Constant.US_TO_MS, 3), + npu_time=round(npu_time / Constant.US_TO_MS, 3), + host_threshold=round(host_time / wall_duration, 3), + mte_threshold=round(mte_time / wall_duration, 3)) + + def add_detail(self, task_name: str, details: List, detail: List): + details.append([ + self.index_dict.get(task_name, (0, 0))[0], self.index_dict.get(task_name, (0, 0))[1], + round(detail[0], 2), round(detail[1], 2), round(detail[2], 2), detail[3], detail[4], detail[5] + ]) + + def generate_key(self, task): + return self._SPLITTER.join([task.op_name, task.input_shapes, task.output_shapes]) + + def compute_priority(self): + sequence_total_time = sum(detail[self._TOTAL_TIME_INDEX] for detail in self.host_details + self.mte_details) + if sequence_total_time / self.step_duration > self._HIGH_PRIORITY: + return PriorityBackgroundColor.high + elif sequence_total_time / self.step_duration < self._LOW_PRIORITY: + return PriorityBackgroundColor.low + else: + return PriorityBackgroundColor.medium + + def check_tasks(self, profiling_dataset: ProfilingDataset): + if not hasattr(profiling_dataset, "op_summary"): + logger.warning("Skip %s checker because of not containing %s", self._CHECKER, "op summary") + return False + elif not hasattr(profiling_dataset.op_summary, "op_list"): + logger.warning("Skip %s checker because of not containing %s", self._CHECKER, "op summary") + return False + elif not profiling_dataset.op_summary.op_list: + logger.warning("Skip %s checker because not containing tasks", self._CHECKER) + return False + tasks = profiling_dataset.op_summary.op_list + task = tasks[0] + step_duration, flag = self.calculate_total_time(tasks[0].task_start_time, tasks[-1].task_start_time, + tasks[-1].task_duration) + if not flag: + return False + for item in ["aic_mte2_time", "aiv_mte2_time", "aic_fixpipe_time", "aiv_mte3_time", "task_type"]: + if not hasattr(task, item): + logger.warning("kenel_details.csv(op_summary.csv) not contain %s, skip operator sequence analysis.", + item) + return False + return True + + def make_record(self, result: OptimizeResult): + """ + make record for what and how to optimize + """ + optimization_item = OptimizeItem(self.problem, self.desc, self.suggestions) + result.add(OptimizeRecord(optimization_item)) + + sub_table_name = BasePrompt.get_sub_table_name(self.host_problem, self.stage) + annotation = self.table_desc.split("\n") + result.add_detail(sub_table_name, headers=self.headers) + result.add_detail(sub_table_name, detail=annotation) + for detail in self.host_details: + result.add_detail(sub_table_name, detail=detail) + + sub_table_name = BasePrompt.get_sub_table_name(self.mte_problem, self.stage) + result.add_detail(sub_table_name, headers=self.headers) + result.add_detail(sub_table_name, detail=annotation) + for detail in self.mte_details: + result.add_detail(sub_table_name, detail=detail) + + def _init_rule(self): + language = AdditionalArgsManager().language + contention_rule_path = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), + "rules", + language, + "fusible_operator.yaml" + ) + + fusion_rule = FileManager.read_yaml_file(contention_rule_path) + self.problem = fusion_rule.get("problem") + self.mte_problem = fusion_rule.get("mte_problem") + self.host_problem = fusion_rule.get("host_problem") + self.desc = fusion_rule.get("description") + self.table_desc = fusion_rule.get("table_description") + self.mte_desc = fusion_rule.get("mte_description") + self.host_desc = fusion_rule.get("host_description") + self.solutions = fusion_rule.get("solutions") + self.min_length = fusion_rule.get("min_length") + self.max_length = fusion_rule.get("max_length") + self.host_threshold = fusion_rule.get("host_threshold") + self.mte_threshold = fusion_rule.get("mte_threshold") + self.sequence_duration_threshold = fusion_rule.get("sequence_duration_threshold") + self.sequence_count_threshold = fusion_rule.get("sequence_count_threshold") + self.topk = fusion_rule.get("top_num") + if not self.desc or not self.solutions or not isinstance(self.solutions, list): + raise RuntimeError("The configuration file of the fusible operator analyzer is abnormal. Please check.") + for solution in self.solutions: + for _, val in solution.items(): + self.suggestions.append(f"{val.get('desc')}") diff --git a/profiler/advisor/common/timeline/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/fusion_ops/__init__.py similarity index 100% rename from profiler/advisor/common/timeline/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/fusion_ops/__init__.py diff --git a/profiler/advisor/analyzer/schedule/fusion_ops/fusion_ops_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/schedule/fusion_ops/fusion_ops_analyzer.py similarity index 77% rename from profiler/advisor/analyzer/schedule/fusion_ops/fusion_ops_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/fusion_ops/fusion_ops_analyzer.py index 7407823106ec6039605e87539e86e66f737e20f4..247088080b9a5e1b492889752405516a1a731f3e 100644 --- a/profiler/advisor/analyzer/schedule/fusion_ops/fusion_ops_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/fusion_ops/fusion_ops_analyzer.py @@ -19,16 +19,16 @@ import re from tqdm import tqdm -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.common import constant as const -from profiler.advisor.common.analyzer_scopes import SupportedScopes -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.advisor.config.config import Config -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.utils.utils import format_timeline_result -from profiler.advisor.common.timeline.fusion_ops_db import init_timeline_ops_db -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.utils.utils import format_timeline_result +from msprof_analyze.advisor.common.timeline.fusion_ops_db import init_timeline_ops_db +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor logger = logging.getLogger() @@ -44,19 +44,21 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): key = ScheduleAnalysisDataset.get_key() self.timeline_event_dataset = self.get_first_data_by_key(self.dataset_list, key) - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): return PriorityBackgroundColor.low def optimize(self, **kwargs): - disable_affinity_api = os.getenv(const.DISABLE_AFFINITY_API) + disable_affinity_api = os.getenv(Constant.DISABLE_AFFINITY_API) if disable_affinity_api is not None and disable_affinity_api.lower() == "true": logger.info( "Skip affinity api analysis due to longer processing time due to env 'DISABLE_AFFINITY_API'") return self.result - for mode in [const.ATEN.lower(), const.OPTIMIZER.lower()]: + for mode in [Constant.ATEN.lower(), Constant.OPTIMIZER.lower()]: - for op_combined, npu_apis in tqdm(getattr(init_timeline_ops_db(self.cann_version, self.torch_version), + for op_combined, npu_apis in tqdm(getattr(init_timeline_ops_db(self.cann_version, + self.profiling_type, + self.profiling_version), f"_{mode}_op_api_map").items(), leave=False, ncols=100, desc="Scanning timeline for affinity apis"): for npu_api in npu_apis.split("/"): @@ -92,36 +94,30 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): """ if not self.matched_op_stacks: return + + prompt_class = BasePrompt.get_prompt_class(self.__class__.__name__) - desc = f"Found {len(format_timeline_result(self.matched_op_stacks))} apis to be replaced" \ - f" based on the runtime env cann-{self.cann_version} and torch-{self.torch_version}" - suggestion = "Please replace training api according to sub table 'Affinity training api'" + desc = prompt_class.DESCRIPTION.format(self.cann_version, self.profiling_version, + len(format_timeline_result(self.matched_op_stacks))) + suggestion = prompt_class.SUGGESTION if self.empty_stacks: - desc += ", but with no stack" - suggestion = const.TIMELINE_EMPTY_STACKS_PROMPT.format( - timeline_profiling_doc_url=Config().timeline_with_stack_doc_url - ) - - sheet_name = "Affinity apis" - optimization_item = OptimizeItem( - sheet_name, - desc, - [suggestion] - ) + desc += prompt_class.EMPTY_STACK_DESCRIPTION + suggestion = prompt_class.EMPTY_STACKS_SUGGESTION.format(Config().timeline_with_stack_doc_url) - self.result.add(OptimizeRecord(optimization_item)) + optimization_item = OptimizeItem(prompt_class.PROBLEM, desc, [suggestion]) + self.result.add(OptimizeRecord(optimization_item)) record_title = ["Affinity API", "Code stacks", "Stack called counts"] - self.result.add_detail(sheet_name, headers=record_title) + self.result.add_detail(prompt_class.PROBLEM, headers=record_title) for api_name, stacks_info in format_timeline_result(self.matched_op_stacks).items(): if not stacks_info: detail = [api_name, "null", "null"] - self.result.add_detail(sheet_name, detail=detail) + self.result.add_detail(prompt_class.PROBLEM, detail=detail) else: for stack in stacks_info: detail = [api_name, *stack] - self.result.add_detail(sheet_name, detail=detail) + self.result.add_detail(prompt_class.PROBLEM, detail=detail) def make_render(self, **kwargs): rank = kwargs.get("rank") @@ -131,7 +127,8 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): template_dir="templates", template_name="affinity_api.html", cann_version=self.cann_version, - torch_version=self.torch_version, + profiling_type=self.profiling_type, + profiling_version=self.profiling_version, empty_stacks=self.empty_stacks, with_stack_doc_url=Config().timeline_with_stack_doc_url, api_doc_url=Config().timeline_api_doc_url, @@ -148,7 +145,7 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): for op_rule, stack in op_stack.items(): if op_rule not in self.matched_op_stacks: self.matched_op_stacks[op_rule] = {} - if stack == const.TIMELINE_FUSION_OPS_NO_STACK_FLAG: + if stack == Constant.TIMELINE_FUSION_OPS_NO_STACK_FLAG: continue if stack not in self.matched_op_stacks[op_rule]: self.matched_op_stacks[op_rule][stack] = 0 @@ -162,7 +159,7 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): :Param npu_api: api of torch_npu, generally more efficient than torch api :Param mode: aten or dequeue or optimizer """ - op_list = ops.split(const.OP_SEP) + op_list = ops.split(Constant.OP_SEP) matched_op_index = set() api_ops_matched = False @@ -190,7 +187,7 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): :Param mode: aten or dequeue or optimizer """ matched_op_index = set() - total_op_name = "".join([f"{const.OP_SEP}{self._replace_op_name_prefix(event.name, mode)}{const.OP_SEP}" + total_op_name = "".join([f"{Constant.OP_SEP}{self._replace_op_name_prefix(event.name, mode)}{Constant.OP_SEP}" for event in getattr(event_dataset, mode)]) matched_pattern_index_tuple = [(x.start(0), x.end(0)) for x in re.finditer(op_rule_pattern, total_op_name)] @@ -211,7 +208,8 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): # by the regex index and then calculate the real index for matched fusion operators in event dataset for left, right in zip(total_ops_split_points, total_ops_split_points[1:]): matched_op_flag = True if (left, right) in matched_pattern_index_tuple else False - matched_ops_list = total_op_name[left: right].strip(const.OP_SEP).split(const.OP_SEP + const.OP_SEP) + matched_ops_list = \ + total_op_name[left: right].strip(Constant.OP_SEP).split(Constant.OP_SEP + Constant.OP_SEP) op_index.append([matched_op_flag, len(matched_ops_list)]) for i, _ in enumerate(op_index): if i > 0: @@ -243,10 +241,10 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): continue matched_op_rules.append(op_rule) - stack = event.args.get(const.CALL_STACKS) + stack = event.args.get(Constant.CALL_STACKS) if not stack: - logger.debug("Got empty '%s' for event %s", const.CALL_STACKS, event) + logger.debug("Got empty '%s' for event %s", Constant.CALL_STACKS, event) continue if self.empty_stacks and stack: @@ -256,17 +254,17 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): if matched_op_rules and not stack_record: for op_rule in matched_op_rules: - stack_record[op_rule] = const.TIMELINE_FUSION_OPS_NO_STACK_FLAG + stack_record[op_rule] = Constant.TIMELINE_FUSION_OPS_NO_STACK_FLAG return stack_record def _replace_op_name_prefix(self, event_name, mode): - if mode == const.DEQUEUE.lower(): - op_name_prefix = f"{const.DEQUEUE}{const.DEQUEUE_SEP}" - elif mode == const.ATEN: - op_name_prefix = f"{const.ATEN}{const.ATEN_SEP}" + if mode == Constant.DEQUEUE.lower(): + op_name_prefix = f"{Constant.DEQUEUE}{Constant.DEQUEUE_SEP}" + elif mode == Constant.ATEN: + op_name_prefix = f"{Constant.ATEN}{Constant.ATEN_SEP}" else: - op_name_prefix = f"{const.OPTIMIZER}.{const.OPTIMIZER_STEP}{const.OPTIMIZER_SEP}" + op_name_prefix = f"{Constant.OPTIMIZER}.{Constant.OPTIMIZER_STEP}{Constant.OPTIMIZER_SEP}" return event_name.replace(op_name_prefix, "") @@ -283,10 +281,10 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): return op_rule, enable_regex enable_regex = True - op_pattern_list = op_rule.split(const.OP_SEP) + op_pattern_list = op_rule.split(Constant.OP_SEP) format_op_pattern = "" for op_pattern in op_pattern_list: - matched_res = re.search(r'\((\w*?)\)', op_pattern) + matched_res = re.search(r'\((.*?)\)', op_pattern) ops_index_range = (matched_res.start() + 1, matched_res.end() - 1) if matched_res else ( 0, len(op_pattern)) @@ -294,7 +292,7 @@ class TimelineFusionOpsAnalyzer(BaseAnalyzer): op_names = op_pattern[ops_index_range[0]: ops_index_range[1]] tmp_op_names_record = [] for op_name in op_names.split("|"): - tmp_op_names_record.append(f"{const.OP_SEP}{op_name.strip(' ')}{const.OP_SEP}") + tmp_op_names_record.append(f"{Constant.OP_SEP}{op_name.strip(' ')}{Constant.OP_SEP}") op_suffix = op_pattern[ops_index_range[1] + 1:] op_names_format = f"({'|'.join(tmp_op_names_record)}){op_suffix}" diff --git a/profiler/advisor/analyzer/schedule/fusion_ops/timeline_api_stack_checker.py b/profiler/msprof_analyze/advisor/analyzer/schedule/fusion_ops/timeline_api_stack_checker.py similarity index 71% rename from profiler/advisor/analyzer/schedule/fusion_ops/timeline_api_stack_checker.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/fusion_ops/timeline_api_stack_checker.py index 126584c3d733401ee7a3481b794f46bf93fc51aa..51469a1f3c14b6584909e1e2e89bf3dc1cc02403 100644 --- a/profiler/advisor/analyzer/schedule/fusion_ops/timeline_api_stack_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/fusion_ops/timeline_api_stack_checker.py @@ -15,12 +15,13 @@ import logging from typing import List -from profiler.advisor.common import constant as const -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.utils.utils import get_analyze_processes, ParallelJob +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.utils.utils import get_analyze_processes, ParallelJob +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager logger = logging.getLogger() @@ -41,16 +42,16 @@ class OpStackFinder: dst_op_event = event_dataset.ops_with_stack.get(dst_op_event_key) if not dst_op_event: - return const.TIMELINE_BACKWARD_NO_STACK_CODE + return Constant.TIMELINE_BACKWARD_NO_STACK_CODE return int(dst_op_event.get("dataset_index")) @staticmethod def _query_index_by_acl_to_npu(acl_to_npu_event): if acl_to_npu_event: - return const.TIMELINE_ACL_TO_NPU_NO_STACK_CODE + return Constant.TIMELINE_ACL_TO_NPU_NO_STACK_CODE - return const.TIMELINE_BACKWARD_NO_STACK_CODE + return Constant.TIMELINE_BACKWARD_NO_STACK_CODE def get_api_stack_by_op(self, event_dataset: ComputationAnalysisDataset, op_name: List[str] = None, task_type: str = None, @@ -90,17 +91,33 @@ class OpStackFinder: if not self._stack_record: return - desc = f"Found {len(self._stack_record)} called stacks for" - if self.op_name and self.task_type: - desc += f" operators with name '{self.op_name}' with task type '{self.task_type}'" - elif self.op_name and not self.task_type: - desc += f" operators with name '{self.op_name}'" - elif self.task_type and not self.op_name: - desc += f" operators with task type '{self.task_type}'" + language = AdditionalArgsManager().language + if language == "en": + desc = f"Found {len(self._stack_record)} called stacks for" + if self.op_name and self.task_type: + desc += f" operators with name '{self.op_name}' with task type '{self.task_type}'" + elif self.op_name and not self.task_type: + desc += f" operators with name '{self.op_name}'" + elif self.task_type and not self.op_name: + desc += f" operators with task type '{self.task_type}'" + else: + desc += " all operators" + + suggestion = f"Please use command 'ma-advisor analyze profiling' to analyze operators" else: - desc += " all operators" + desc = f"发现以下{len(self._stack_record)}个算子的调用堆栈," + if self.op_name and self.task_type: + desc += f"任务类型为'{self.task_type}'的'{self.op_name}'算子" + elif self.op_name and not self.task_type: + desc += f"'{self.op_name}'算子" + elif self.task_type and not self.op_name: + desc += f"算子类型为'{self.task_type}'" + else: + desc += "包括全部算子" + + suggestion = f"请用命令'ma-advisor analyze profiling'分析算子" + - suggestion = f"Please use command 'ma-advisor analyze profiling' to analyze operators" optimization_item = OptimizeItem( "Operator stacks", desc, @@ -126,7 +143,7 @@ class OpStackFinder: def _get_api_stack_by_op(self, event_dataset: ComputationAnalysisDataset, op_name: str, task_type: str): for _, src_op_event in event_dataset.ops_with_task_type.items(): - op_task_type = src_op_event.get(const.TASK_TYPE) + op_task_type = src_op_event.get(Constant.TASK_TYPE) if not (src_op_event.name == op_name and op_task_type and op_task_type == task_type): continue @@ -163,8 +180,8 @@ class OpStackFinder: if task_type is not None: self._get_api_stack_by_op(event_dataset, op_name, task_type) else: - self._get_api_stack_by_op(event_dataset, op_name, const.AI_CORE) - self._get_api_stack_by_op(event_dataset, op_name, const.AI_CPU) + self._get_api_stack_by_op(event_dataset, op_name, Constant.AI_CORE) + self._get_api_stack_by_op(event_dataset, op_name, Constant.AI_CPU) def _format_stack_record(self): stack_list = [] @@ -176,13 +193,13 @@ class OpStackFinder: if index not in self.matched_index: return None event = TimelineEvent(event) - stack = event.args.get(const.CALL_STACKS) + stack = event.args.get(Constant.CALL_STACKS) - stack = stack if stack else const.NO_STACK_REASON_MAP.get(const.TIMELINE_BACKWARD_NO_STACK_CODE) + stack = stack if stack else Constant.NO_STACK_REASON_MAP.get(Constant.TIMELINE_BACKWARD_NO_STACK_CODE) for matched_op_info in self._task_id_record.get(index, []): self._stack_record.append([*matched_op_info, stack]) - for matched_op_info in self._task_id_record.get(const.TIMELINE_ACL_TO_NPU_NO_STACK_CODE, []): + for matched_op_info in self._task_id_record.get(Constant.TIMELINE_ACL_TO_NPU_NO_STACK_CODE, []): self._stack_record.append([*matched_op_info, - const.NO_STACK_REASON_MAP.get(const.TIMELINE_ACL_TO_NPU_NO_STACK_CODE)]) + Constant.NO_STACK_REASON_MAP.get(Constant.TIMELINE_ACL_TO_NPU_NO_STACK_CODE)]) return None diff --git a/profiler/advisor/config/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/gc/__init__.py similarity index 100% rename from profiler/advisor/config/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/gc/__init__.py diff --git a/profiler/advisor/analyzer/schedule/gc/gc_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/schedule/gc/gc_analyzer.py similarity index 68% rename from profiler/advisor/analyzer/schedule/gc/gc_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/gc/gc_analyzer.py index b59a8fc2e2a25428e34cb51917462f9f6162bc46..7e04a0597ffb555b15afd2a7d11f2eedb2ff5e06 100644 --- a/profiler/advisor/analyzer/schedule/gc/gc_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/gc/gc_analyzer.py @@ -14,12 +14,12 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.analyzer.schedule.gc.gc_checker import GcChecker -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.schedule.gc.gc_checker import GcChecker +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor logger = logging.getLogger() @@ -36,11 +36,14 @@ class GcAnalyzer(BaseAnalyzer): @BaseAnalyzer.check_data((ScheduleAnalysisDataset.get_key(),)) def optimize(self, **kwargs): + if "mindspore" in self.profiling_type: + logger.info("The analyzer %s does not support MindSpore.", self.__class__.__name__) + return self.result gc_checker = GcChecker() gc_checker.check_gc(self.timeline_event_dataset, rank=kwargs.get("rank"), stage=kwargs.get("stage")) gc_checker.make_record(self.result) gc_checker.make_render(self.html_render, priority=self.get_priority(), rank=kwargs.get("rank")) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): return PriorityBackgroundColor.medium diff --git a/profiler/msprof_analyze/advisor/analyzer/schedule/gc/gc_checker.py b/profiler/msprof_analyze/advisor/analyzer/schedule/gc/gc_checker.py new file mode 100644 index 0000000000000000000000000000000000000000..8e9f9139afd4eac8a80e6f27b34b399cebb1e65e --- /dev/null +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/gc/gc_checker.py @@ -0,0 +1,127 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +import os + +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.display.prompt.base_prompt import BasePrompt +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.utils.utils import convert_to_float, convert_to_int +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager + +logger = logging.getLogger() + + +class GcChecker: + + def __init__(self): + self.stage = None + self.rank = None + self.optimization_item = [] + self.gc_issues = False + self.gc_problem_with_count = "" + self.desc = "" + self.suggestions = [] + self.solutions = None + self.gc_threshold = 0 + self.gc_topk_num = 0 + self.abnormal_gc_count = 0 + self.abnormal_gc_duration = 0 + self.abnormal_gc_list = [] + self.headers = ["timestamp", "duration(us)"] + self._init_rule() + + def check_gc(self, event_dataset: ScheduleAnalysisDataset, rank=None, stage=None): + """ + :Param event_dataset: dataset of timeline event + """ + self.rank = rank + self.stage = stage + + # 当用户cann和pta版本不支持采集gc信息时,跳过该分析器 + if not event_dataset.gc_events: + return + + for gc_event in event_dataset.gc_events: + if convert_to_float(gc_event.dur) >= self.gc_threshold: + self.gc_issues = True + self.abnormal_gc_count += 1 + self.abnormal_gc_duration += convert_to_float(gc_event.dur) + self.abnormal_gc_list.append([gc_event.ts, gc_event.dur]) + self.abnormal_gc_duration = round(self.abnormal_gc_duration / 1000, 4) + self.abnormal_gc_list.sort(key=lambda x: x[1], reverse=True) + self.desc = self.gc_problem_with_count.format(gc_count=self.abnormal_gc_count, + gc_total_time=self.abnormal_gc_duration) + + def make_record(self, result: OptimizeResult): + """ + make record for what and how to optimize + """ + if not self.gc_issues: + return + + self.optimization_item.append(OptimizeItem(self.problem, self.desc, self.suggestions)) + for optimization in self.optimization_item: + result.add(OptimizeRecord(optimization)) + headers = self.headers + if self.rank is not None: + headers = ["Rank id"] + headers + + sub_table_name = BasePrompt.get_sub_table_name(self.problem, self.stage) + result.add_detail(sub_table_name, headers=headers) + + for row in self.abnormal_gc_list: + if self.rank is not None: + row = [self.rank] + row + result.add_detail(sub_table_name, detail=row) + + def make_render(self, html_render, **kwargs): + if not self.gc_issues: + return + priority = kwargs.get("priority") + rank = kwargs.get("rank") + show_num = min(self.gc_topk_num, self.abnormal_gc_count) + html_render.render_template(key="schedule", + template_dir="templates", + template_name="gc.html", + title="GC Analysis", + desc=self.desc, + solutions=self.solutions, + headers=self.headers, + datas=self.abnormal_gc_list[:show_num], + num=show_num, + priority_background_color=priority, + rank=rank) + + def _init_rule(self): + language = AdditionalArgsManager().language + gc_rule_path = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), + "rules", + language, + "gc.yaml" + ) + gc_rule = FileManager.read_yaml_file(gc_rule_path) + + self.problem = gc_rule.get("problem") + self.gc_threshold = convert_to_float(gc_rule.get("gc_threshold", 0)) + self.gc_topk_num = convert_to_int(gc_rule.get("top_num", 0)) + self.gc_problem_with_count = gc_rule.get("gc_problem_with_count", "") + self.solutions = gc_rule.get("solutions", []) + for solution in self.solutions: + for key, val in solution.items(): + self.suggestions.append(f"{key}, {val.get('desc')}") diff --git a/profiler/advisor/dataset/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/syncbn/__init__.py similarity index 100% rename from profiler/advisor/dataset/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/syncbn/__init__.py diff --git a/profiler/advisor/analyzer/schedule/syncbn/syncbn_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/schedule/syncbn/syncbn_analyzer.py similarity index 72% rename from profiler/advisor/analyzer/schedule/syncbn/syncbn_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/syncbn/syncbn_analyzer.py index 3d4679e258903cb71f1c26d5871376ae6df4d116..1e75d4e8969d57d54f55eb477165e6379664b817 100644 --- a/profiler/advisor/analyzer/schedule/syncbn/syncbn_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/syncbn/syncbn_analyzer.py @@ -14,12 +14,12 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.analyzer.schedule.syncbn.syncbn_checker import SyncBNChecker -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.analyzer.schedule.syncbn.syncbn_checker import SyncBNChecker +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset logger = logging.getLogger() @@ -42,5 +42,5 @@ class SyncBNAnalyzer(BaseAnalyzer): syncbn_checker.make_render(self.html_render, priority=self.get_priority(), rank=kwargs.get("rank")) return self.result - def get_priority(self): + def get_priority(self, max_mem_op_dur=None): return PriorityBackgroundColor.high \ No newline at end of file diff --git a/profiler/advisor/analyzer/schedule/syncbn/syncbn_checker.py b/profiler/msprof_analyze/advisor/analyzer/schedule/syncbn/syncbn_checker.py similarity index 87% rename from profiler/advisor/analyzer/schedule/syncbn/syncbn_checker.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/syncbn/syncbn_checker.py index e4cf2430c50756046ad182b90772888b39e87d72..64bf40ee230fe2b5d4000deeda337fef124c9c35 100644 --- a/profiler/advisor/analyzer/schedule/syncbn/syncbn_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/syncbn/syncbn_checker.py @@ -1,89 +1,92 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import logging -import os - -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.cluster_analyse.common_func.file_manager import FileManager - -logger = logging.getLogger() - - -class SyncBNChecker: - - def __init__(self): - self.optimization_item = [] - self.syncbn_issues = False - self.desc = "" - self.suggestions = [] - self.solutions = None - self.max_syncbn_num = None - self._init_rule() - - def check_syncbn(self, event_dataset: ScheduleAnalysisDataset): - """ - :Param event_dataset: dataset of timeline event - """ - if not hasattr(event_dataset, "sync_batchnorm") or not getattr(event_dataset, "sync_batchnorm"): - logger.debug("Skip syncbn checker, because no syncbn found") - return - - syncbn_num = len(event_dataset.sync_batchnorm) - self.syncbn_issues = syncbn_num >= self.max_syncbn_num - self.desc = self.desc.format(syncbn_num=syncbn_num) - - def make_record(self, result: OptimizeResult): - """ - make record for what and how to optimize - """ - if not self.syncbn_issues: - return - - self.optimization_item.append(OptimizeItem("SyncBatchNorm", self.desc, self.suggestions)) - for optimization in self.optimization_item: - result.add(OptimizeRecord(optimization)) - - def make_render(self, html_render, **kwargs): - if not self.syncbn_issues: - return - - priority = kwargs.get("priority") - rank = kwargs.get("rank") - html_render.render_template(key="schedule", - template_dir="templates", - template_name="sync_batchnorm.html", - desc=self.desc, - solutions=self.solutions, - priority_background_color=priority, - rank=rank) - - def _init_rule(self): - syncbn_rule_path = os.path.join( - os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), - "rules", - "sync_batchnorm.yaml" - ) - - syncbn_rule = FileManager.read_yaml_file(syncbn_rule_path) - - self.max_syncbn_num = syncbn_rule.get("max_syncbn_num") - self.desc = syncbn_rule.get("problem") - - self.solutions = syncbn_rule.get("solutions") - for solution in self.solutions: - for key, val in solution.items(): - self.suggestions.append(f"{key}, {val.get('desc')}") +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +import os + +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.file_manager import FileManager + +logger = logging.getLogger() + + +class SyncBNChecker: + + def __init__(self): + self.optimization_item = [] + self.syncbn_issues = False + self.desc = "" + self.suggestions = [] + self.solutions = None + self.max_syncbn_num = None + self._init_rule() + + def check_syncbn(self, event_dataset: ScheduleAnalysisDataset): + """ + :Param event_dataset: dataset of timeline event + """ + if not hasattr(event_dataset, "sync_batchnorm") or not getattr(event_dataset, "sync_batchnorm"): + logger.debug("Skip syncbn checker, because no syncbn found") + return + + syncbn_num = len(event_dataset.sync_batchnorm) + self.syncbn_issues = syncbn_num >= self.max_syncbn_num + self.desc = self.desc.format(syncbn_num=syncbn_num) + + def make_record(self, result: OptimizeResult): + """ + make record for what and how to optimize + """ + if not self.syncbn_issues: + return + + self.optimization_item.append(OptimizeItem("SyncBatchNorm", self.desc, self.suggestions)) + for optimization in self.optimization_item: + result.add(OptimizeRecord(optimization)) + + def make_render(self, html_render, **kwargs): + if not self.syncbn_issues: + return + + priority = kwargs.get("priority") + rank = kwargs.get("rank") + html_render.render_template(key="schedule", + template_dir="templates", + template_name="sync_batchnorm.html", + desc=self.desc, + solutions=self.solutions, + priority_background_color=priority, + rank=rank) + + def _init_rule(self): + language = AdditionalArgsManager().language + syncbn_rule_path = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), + "rules", + language, + "sync_batchnorm.yaml" + ) + + syncbn_rule = FileManager.read_yaml_file(syncbn_rule_path) + + self.max_syncbn_num = syncbn_rule.get("max_syncbn_num") + self.desc = syncbn_rule.get("problem") + + self.solutions = syncbn_rule.get("solutions") + for solution in self.solutions: + for key, val in solution.items(): + self.suggestions.append(f"{key}, {val.get('desc')}") diff --git a/profiler/advisor/dataset/ai_core_freq/__init__.py b/profiler/msprof_analyze/advisor/analyzer/schedule/synchronize_stream/__init__.py similarity index 100% rename from profiler/advisor/dataset/ai_core_freq/__init__.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/synchronize_stream/__init__.py diff --git a/profiler/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_analyzer.py b/profiler/msprof_analyze/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_analyzer.py similarity index 74% rename from profiler/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_analyzer.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_analyzer.py index 2d3a601ac533b48c15fd44504cf56265a42335a7..ea095e1968f67d4762280bb4dfe180bddde4368e 100644 --- a/profiler/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_analyzer.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_analyzer.py @@ -14,11 +14,12 @@ # limitations under the License. import logging -from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.analyzer.schedule.synchronize_stream.synchronize_stream_checker import SynchronizeStreamChecker -from profiler.advisor.display.html.render import HTMLRender -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.analyzer.base_analyzer import BaseAnalyzer +from msprof_analyze.advisor.analyzer.schedule.synchronize_stream.synchronize_stream_checker import \ + SynchronizeStreamChecker +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.advisor.result.result import OptimizeResult logger = logging.getLogger() @@ -43,5 +44,5 @@ class SynchronizeStreamAnalyzer(BaseAnalyzer): rank=kwargs.get("rank")) return self.result - def get_priority(self, synchronize_stream_checker): - return synchronize_stream_checker.priority + def get_priority(self, max_mem_op_dur): + return max_mem_op_dur.priority diff --git a/profiler/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_checker.py b/profiler/msprof_analyze/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_checker.py similarity index 82% rename from profiler/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_checker.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_checker.py index c27b398f46060c49fc385a0648b99b133d880fa6..1b9c074304d3f3f7b5e8db904df5344197bbb312 100644 --- a/profiler/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/synchronize_stream/synchronize_stream_checker.py @@ -1,129 +1,132 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import logging -import os - -from profiler.advisor.analyzer.schedule.timeline_base_checker import TimelineBaseChecker -from profiler.advisor.common import constant as const -from profiler.advisor.config.config import Config -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.display.html.priority_background_color import PriorityBackgroundColor -from profiler.advisor.result.result import OptimizeResult -from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.utils.utils import format_timeline_result, safe_division -from profiler.cluster_analyse.common_func.file_manager import FileManager - -logger = logging.getLogger() - - -class SynchronizeStreamChecker(TimelineBaseChecker): - - def __init__(self): - super().__init__(n_processes=1) - self.optimization_item = [] - self.synchronize_issues = False - self.desc = "" - self.suggestions = [] - self.solutions = [] - self.min_co_occurrence_ratio = 0 - self.priority = None - self._init_rule() - - def check_synchronize(self, event_dataset: ScheduleAnalysisDataset): - if not hasattr(event_dataset, "synchronize_stream") or not getattr(event_dataset, "synchronize_stream"): - logger.info("Skip synchronize stream checker, because no synchronize stream found") - return - - node_launch_num = 0 - co_occurrence_num = 0 - synchronize_num = 0 - synchronize_stream = event_dataset.synchronize_stream - for index, op in enumerate(synchronize_stream): - if op.name.startswith(const.NODE_LAUNCH): - node_launch_num += 1 - if op.name.startswith(const.SYNC_STREAM): - synchronize_num += 1 - - # 统计nodeLaunch 和 synchronizeStream 一前一后连续出现次数 - if index > 0 and synchronize_stream[index - 1].name.startswith(const.NODE_LAUNCH): - co_occurrence_num += 1 - - # 当共现次数很多时,则大概率设置了ASCEND_LAUNCH_BLOCKING环境变量 - co_occurrence_ratio = round(safe_division(co_occurrence_num, node_launch_num), 4) - if co_occurrence_ratio > self.min_co_occurrence_ratio: - self.synchronize_issues = True - - self.priority = self.get_priority() - - self.desc = self.desc.format(synchronize_num=synchronize_num, - node_launch_num=node_launch_num, - co_occur_ratio=co_occurrence_ratio) - - solutions = [] - for solution in solutions: - renderer_solution = {} - for key, val in solution.items(): - self.suggestions.append(f"{key}, {val.get('desc')}") - renderer_solution.update({key: val}) - self.solutions.append(renderer_solution) - - def make_record(self, result: OptimizeResult): - """ - make record for what and how to optimize - """ - if not self.synchronize_issues: - return - - self.optimization_item.append(OptimizeItem("SynchronizeStream", self.desc, self.suggestions)) - for optimization in self.optimization_item: - result.add(OptimizeRecord(optimization)) - - def make_render(self, html_render, **kwargs): - if not self.synchronize_issues: - return - priority = kwargs.get("priority") - rank = kwargs.get("rank") - format_result_for_html = format_timeline_result(dict(self.matched_op_stacks), dump_html=True) - html_render.render_template(key="schedule", - template_dir="templates", - template_name="synchronize_stream.html", - desc=self.desc, - solutions=self.solutions, - result=format_result_for_html, - with_stack_doc_url=Config().timeline_with_stack_doc_url, - empty_stacks=self.empty_stacks, - framework_black_list=self.framework_black_list, - priority_background_color=priority, - rank=rank) - - def get_priority(self): - return PriorityBackgroundColor.high - - def _init_rule(self): - synchronize_rule_path = os.path.join( - os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), - "rules", - "synchronize.yaml" - ) - - synchronize_rule = FileManager.read_yaml_file(synchronize_rule_path) - - self.min_co_occurrence_ratio = synchronize_rule.get("min_co_occurrence_ratio") - self.desc = synchronize_rule.get("problem") - - self.solutions = synchronize_rule.get("solutions") - for solution in self.solutions: - for key, val in solution.items(): - self.suggestions.append(f"{key}, {val.get('desc')}") +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +import os + +from msprof_analyze.advisor.analyzer.schedule.timeline_base_checker import TimelineBaseChecker +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.display.html.priority_background_color import PriorityBackgroundColor +from msprof_analyze.advisor.result.result import OptimizeResult +from msprof_analyze.advisor.result.item import OptimizeItem, OptimizeRecord +from msprof_analyze.advisor.utils.utils import format_timeline_result, safe_division +from msprof_analyze.prof_common.file_manager import FileManager + +logger = logging.getLogger() + + +class SynchronizeStreamChecker(TimelineBaseChecker): + + def __init__(self): + super().__init__(n_processes=1) + self.optimization_item = [] + self.synchronize_issues = False + self.desc = "" + self.suggestions = [] + self.solutions = [] + self.min_co_occurrence_ratio = 0 + self.priority = None + self._init_rule() + + def check_synchronize(self, event_dataset: ScheduleAnalysisDataset): + if not hasattr(event_dataset, "synchronize_stream") or not getattr(event_dataset, "synchronize_stream"): + logger.info("Skip synchronize stream checker, because no synchronize stream found") + return + + node_launch_num = 0 + co_occurrence_num = 0 + synchronize_num = 0 + synchronize_stream = event_dataset.synchronize_stream + for index, op in enumerate(synchronize_stream): + if op.name.startswith(Constant.NODE_LAUNCH): + node_launch_num += 1 + if op.name.startswith(Constant.SYNC_STREAM): + synchronize_num += 1 + + # 统计nodeLaunch 和 synchronizeStream 一前一后连续出现次数 + if index > 0 and synchronize_stream[index - 1].name.startswith(Constant.NODE_LAUNCH): + co_occurrence_num += 1 + + # 当共现次数很多时,则大概率设置了ASCEND_LAUNCH_BLOCKING环境变量 + co_occurrence_ratio = round(safe_division(co_occurrence_num, node_launch_num), 4) + if co_occurrence_ratio > self.min_co_occurrence_ratio: + self.synchronize_issues = True + + self.priority = self.get_priority() + + self.desc = self.desc.format(synchronize_num=synchronize_num, + node_launch_num=node_launch_num, + co_occur_ratio=co_occurrence_ratio) + + solutions = [] + for solution in solutions: + renderer_solution = {} + for key, val in solution.items(): + self.suggestions.append(f"{key}, {val.get('desc')}") + renderer_solution.update({key: val}) + self.solutions.append(renderer_solution) + + def make_record(self, result: OptimizeResult): + """ + make record for what and how to optimize + """ + if not self.synchronize_issues: + return + + self.optimization_item.append(OptimizeItem("SynchronizeStream", self.desc, self.suggestions)) + for optimization in self.optimization_item: + result.add(OptimizeRecord(optimization)) + + def make_render(self, html_render, **kwargs): + if not self.synchronize_issues: + return + priority = kwargs.get("priority") + rank = kwargs.get("rank") + format_result_for_html = format_timeline_result(dict(self.matched_op_stacks), dump_html=True) + html_render.render_template(key="schedule", + template_dir="templates", + template_name="synchronize_stream.html", + desc=self.desc, + solutions=self.solutions, + result=format_result_for_html, + with_stack_doc_url=Config().timeline_with_stack_doc_url, + empty_stacks=self.empty_stacks, + framework_black_list=self.framework_black_list, + priority_background_color=priority, + rank=rank) + + def get_priority(self): + return PriorityBackgroundColor.high + + def _init_rule(self): + language = AdditionalArgsManager().language + synchronize_rule_path = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))), + "rules", + language, + "synchronize.yaml" + ) + + synchronize_rule = FileManager.read_yaml_file(synchronize_rule_path) + + self.min_co_occurrence_ratio = synchronize_rule.get("min_co_occurrence_ratio") + self.desc = synchronize_rule.get("problem") + + self.solutions = synchronize_rule.get("solutions") + for solution in self.solutions: + for key, val in solution.items(): + self.suggestions.append(f"{key}, {val.get('desc')}") diff --git a/profiler/advisor/analyzer/schedule/timeline_base_checker.py b/profiler/msprof_analyze/advisor/analyzer/schedule/timeline_base_checker.py similarity index 80% rename from profiler/advisor/analyzer/schedule/timeline_base_checker.py rename to profiler/msprof_analyze/advisor/analyzer/schedule/timeline_base_checker.py index 9ef492c1a24a4efafcee86d30f2201adb40264a6..649e41e5daa7597795d31abb5ed6468364cf49ff 100644 --- a/profiler/advisor/analyzer/schedule/timeline_base_checker.py +++ b/profiler/msprof_analyze/advisor/analyzer/schedule/timeline_base_checker.py @@ -12,14 +12,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from abc import ABC, abstractmethod +from abc import ABC import multiprocessing import logging -from profiler.advisor.common import constant as const -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.result.result import OptimizeResult +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset logger = logging.getLogger() @@ -46,7 +45,7 @@ class TimelineBaseChecker(ABC): for op, stack in op_stack.items(): if op not in self.matched_op_stacks: self.matched_op_stacks[op] = {} - if stack == const.TIMELINE_FUSION_OPS_NO_STACK_FLAG: + if stack == Constant.TIMELINE_FUSION_OPS_NO_STACK_FLAG: continue if stack not in self.matched_op_stacks[op]: self.matched_op_stacks[op][stack] = 0 @@ -62,15 +61,15 @@ class TimelineBaseChecker(ABC): continue matched_ops.append(op) - stack = event.args.get(const.CALL_STACKS) + stack = event.args.get(Constant.CALL_STACKS) if not stack: - logger.debug("Got empty '%s' for event %s", const.CALL_STACKS, event) + logger.debug("Got empty '%s' for event %s", Constant.CALL_STACKS, event) continue if not self._is_keep_stack(stack): self.framework_black_list = True - logger.debug("Drop stack from framework %s", const.FRAMEWORK_STACK_BLACK_LIST) + logger.debug("Drop stack from framework %s", Constant.FRAMEWORK_STACK_BLACK_LIST) continue if self.empty_stacks and stack: @@ -80,7 +79,7 @@ class TimelineBaseChecker(ABC): if matched_ops and not stack_record: for op in matched_ops: - stack_record[op] = const.TIMELINE_FUSION_OPS_NO_STACK_FLAG + stack_record[op] = Constant.TIMELINE_FUSION_OPS_NO_STACK_FLAG return stack_record @@ -91,7 +90,7 @@ class TimelineBaseChecker(ABC): return False final_called_stack = stack_list[0] - for framework in const.FRAMEWORK_STACK_BLACK_LIST: + for framework in Constant.FRAMEWORK_STACK_BLACK_LIST: if framework in final_called_stack.split("/"): return False return True diff --git a/profiler/advisor/cluster_perf_analysis.ipynb b/profiler/msprof_analyze/advisor/cluster_perf_analysis.ipynb similarity index 99% rename from profiler/advisor/cluster_perf_analysis.ipynb rename to profiler/msprof_analyze/advisor/cluster_perf_analysis.ipynb index 7ee0b24e85467fe42205c5986095a7e66bf0a636..0fc37222f6829331f30ba6dd470faae475a1bac9 100644 --- a/profiler/advisor/cluster_perf_analysis.ipynb +++ b/profiler/msprof_analyze/advisor/cluster_perf_analysis.ipynb @@ -23,7 +23,7 @@ "metadata": {}, "outputs": [], "source": [ - "from profiler.advisor.interface.interface import Interface\n", + "from msprof_analyze.advisor.interface.interface import Interface\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from prettytable import PrettyTable, ALL\n", @@ -675,7 +675,7 @@ "metadata": {}, "outputs": [], "source": [ - "from advisor_backend.interface import Interface\n", + "from msprof_analyze.advisor.advisor_backend.interface import Interface\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] @@ -973,7 +973,7 @@ "5). 生成的json文件可以在chrome trace中查看 \n", "\n", "示例图:\n", - "![pipeline_view](../../profiler/test/resource/pipeline_view.png)" + "![pipeline_view](../../profiler/msprof_analyze/test/resource/pipeline_view.png)" ] }, { diff --git a/profiler/advisor/dataset/cluster/__init__.py b/profiler/msprof_analyze/advisor/common/__init__.py similarity index 100% rename from profiler/advisor/dataset/cluster/__init__.py rename to profiler/msprof_analyze/advisor/common/__init__.py diff --git a/profiler/advisor/common/analyzer_scopes.py b/profiler/msprof_analyze/advisor/common/analyzer_scopes.py similarity index 83% rename from profiler/advisor/common/analyzer_scopes.py rename to profiler/msprof_analyze/advisor/common/analyzer_scopes.py index a07a6d5de72c01c7ea568599de917d66a0a89f70..6a6261c7b75e721c0a9df75f35ecb3cd2aa1e487 100644 --- a/profiler/advisor/common/analyzer_scopes.py +++ b/profiler/msprof_analyze/advisor/common/analyzer_scopes.py @@ -1,5 +1,4 @@ -# Copyright (c) 2024, Huawei Technologies Co., Ltd. -# All rights reserved. +# Copyright (c) Huawei Technologies Co., Ltd. 2024-2025. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -23,6 +22,7 @@ class SupportedScopes: COMMUNICATION_RETRANSMISSION_DETECTION = "communication_retransmission_analysis" PACKET = "packet_analysis" BANDWIDTH_CONTENTION_DETECTION = "bandwidth_contention_analysis" + BYTE_ALIGNMENT_DETECTION = "byte_alignment_analysis" OVER_ALL = "over_all" ENVIRONMENT_VARIABLE_ANALYSIS = "environment_variable_analysis" DYNAMIC_SHAPE_ANALYSIS = "dynamic_shape_analysis" @@ -37,4 +37,7 @@ class SupportedScopes: MEMORY = "memory" STAGE_COMPUTE = "stage_compute" GC_ANALYSIS = "gc_analysis" + FUSIBLE_OPERATOR_ANALYSIS = "fusible_operator_analysis" + CONJECTURED_GC_ANALYSIS = "conjectured_analysis" COMPARISON = "comparison" + AICORE_PERFORMANCE_ANALYSIS = "ai_core_performance_analysis" diff --git a/profiler/advisor/common/async_analysis_status.py b/profiler/msprof_analyze/advisor/common/async_analysis_status.py similarity index 96% rename from profiler/advisor/common/async_analysis_status.py rename to profiler/msprof_analyze/advisor/common/async_analysis_status.py index 3d9a5d7c102ef0a2a20fb4c84cfc00c6a8568899..98bb458105421b38395f745f2913311a24a5ce40 100644 --- a/profiler/advisor/common/async_analysis_status.py +++ b/profiler/msprof_analyze/advisor/common/async_analysis_status.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,7 +13,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + class AsyncAnalysisStatus: diff --git a/profiler/advisor/common/enum_params_parser.py b/profiler/msprof_analyze/advisor/common/enum_params_parser.py similarity index 92% rename from profiler/advisor/common/enum_params_parser.py rename to profiler/msprof_analyze/advisor/common/enum_params_parser.py index ea41620a9c014e8a538b9ba13cbfb5f263e21d1d..ebf81ae38c249f4701e46f9a05b5cb9f86db635c 100644 --- a/profiler/advisor/common/enum_params_parser.py +++ b/profiler/msprof_analyze/advisor/common/enum_params_parser.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,14 +13,14 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import os import logging import typing -from profiler.advisor.common.timeline.event import AdvisorDict -from profiler.advisor.utils.utils import singleton -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.advisor.common.timeline.event import AdvisorDict +from msprof_analyze.advisor.utils.utils import singleton +from msprof_analyze.prof_common.file_manager import FileManager logger = logging.getLogger() diff --git a/profiler/advisor/dataset/communication/__init__.py b/profiler/msprof_analyze/advisor/common/graph/__init__.py similarity index 100% rename from profiler/advisor/dataset/communication/__init__.py rename to profiler/msprof_analyze/advisor/common/graph/__init__.py diff --git a/profiler/advisor/common/graph/graph.py b/profiler/msprof_analyze/advisor/common/graph/graph.py similarity index 97% rename from profiler/advisor/common/graph/graph.py rename to profiler/msprof_analyze/advisor/common/graph/graph.py index f86f5db7f2cec8f51d3daab6b9c6e9d22de44483..b237e8d59419ebf6a3c313d501fe829eace6d03e 100644 --- a/profiler/advisor/common/graph/graph.py +++ b/profiler/msprof_analyze/advisor/common/graph/graph.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,13 +13,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import logging from typing import Dict, List, Tuple, Callable, Any, Optional, Union import networkx as nx -from profiler.advisor.common.graph.graph_parser import HostGraphNode, QueryGraphNode +from msprof_analyze.advisor.common.graph.graph_parser import HostGraphNode, QueryGraphNode logger = logging.getLogger() diff --git a/profiler/advisor/common/graph/graph_match.py b/profiler/msprof_analyze/advisor/common/graph/graph_match.py similarity index 99% rename from profiler/advisor/common/graph/graph_match.py rename to profiler/msprof_analyze/advisor/common/graph/graph_match.py index fbf0a8abe8e049ccb6f9ff2baaa528e94cb3d7e2..1cf2fe170d2ab8d3e29429785c7b6398cc0dd964 100644 --- a/profiler/advisor/common/graph/graph_match.py +++ b/profiler/msprof_analyze/advisor/common/graph/graph_match.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,7 +13,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import itertools import logging from functools import lru_cache diff --git a/profiler/advisor/common/graph/graph_parser.py b/profiler/msprof_analyze/advisor/common/graph/graph_parser.py similarity index 99% rename from profiler/advisor/common/graph/graph_parser.py rename to profiler/msprof_analyze/advisor/common/graph/graph_parser.py index a89cf738fff8b679219380e71435148c7f8aa216..5a35a971f07d1d54471aacbeeb44935d8a8cb347 100644 --- a/profiler/advisor/common/graph/graph_parser.py +++ b/profiler/msprof_analyze/advisor/common/graph/graph_parser.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,7 +13,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import os import logging import itertools @@ -21,8 +21,8 @@ from collections import deque from dataclasses import dataclass from typing import List, Tuple, Dict -from profiler.cluster_analyse.common_func.file_manager import FileManager -from profiler.advisor.utils.file import FileOpen +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.utils.file import FileOpen logger = logging.getLogger() diff --git a/profiler/advisor/dataset/profiling/__init__.py b/profiler/msprof_analyze/advisor/common/profiling/__init__.py similarity index 100% rename from profiler/advisor/dataset/profiling/__init__.py rename to profiler/msprof_analyze/advisor/common/profiling/__init__.py diff --git a/profiler/advisor/common/profiling/ge_info.py b/profiler/msprof_analyze/advisor/common/profiling/ge_info.py similarity index 90% rename from profiler/advisor/common/profiling/ge_info.py rename to profiler/msprof_analyze/advisor/common/profiling/ge_info.py index 91642f967970fdf27f76754ee4bbd7f4ab4fcc50..f255684290e1935928ba741dec4cfdc55341cfe5 100644 --- a/profiler/advisor/common/profiling/ge_info.py +++ b/profiler/msprof_analyze/advisor/common/profiling/ge_info.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,7 +13,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import logging import os from typing import Any, List @@ -21,9 +21,9 @@ from typing import Any, List from sqlalchemy import text from sqlalchemy.exc import SQLAlchemyError -from profiler.advisor.dataset.profiling.db_manager import ConnectionManager -from profiler.advisor.dataset.profiling.profiling_parser import ProfilingParser -from profiler.advisor.utils.utils import check_path_valid +from msprof_analyze.advisor.dataset.profiling.db_manager import ConnectionManager +from msprof_analyze.advisor.dataset.profiling.profiling_parser import ProfilingParser +from msprof_analyze.advisor.utils.utils import check_path_valid logger = logging.getLogger() diff --git a/profiler/advisor/common/profiling/msprof.py b/profiler/msprof_analyze/advisor/common/profiling/msprof.py similarity index 94% rename from profiler/advisor/common/profiling/msprof.py rename to profiler/msprof_analyze/advisor/common/profiling/msprof.py index 150d3f985973f2c79ccf5406c114932aef5008fd..e4d537ddc78d282ef9db69b1741abfb6cde6d906 100644 --- a/profiler/advisor/common/profiling/msprof.py +++ b/profiler/msprof_analyze/advisor/common/profiling/msprof.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,12 +13,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import logging from typing import Dict, List -from profiler.advisor.dataset.profiling.info_collection import TaskInfo -from profiler.advisor.dataset.profiling.profiling_parser import ProfilingParser +from msprof_analyze.advisor.dataset.profiling.info_collection import TaskInfo, HcclOp +from msprof_analyze.advisor.dataset.profiling.profiling_parser import ProfilingParser logger = logging.getLogger() @@ -54,6 +54,7 @@ class Msprof(ProfilingParser): def __init__(self, path: str) -> None: super().__init__(path) self._tasks: List[TaskInfo] = [] + self._hccl_tasks: List[HcclOp] = [] self._iteration_time = 0.0 self._model_id = None self._iteration_id = None @@ -146,6 +147,7 @@ class Msprof(ProfilingParser): self._max_time = max_time self._min_time = min_time if self._tasks: + self._tasks.sort(key=lambda x: x.start_time) return True return False diff --git a/profiler/advisor/common/profiling/op_summary.py b/profiler/msprof_analyze/advisor/common/profiling/op_summary.py similarity index 92% rename from profiler/advisor/common/profiling/op_summary.py rename to profiler/msprof_analyze/advisor/common/profiling/op_summary.py index c042509df96c0c8feacb39a56e6f73358cd5d8a9..f4659705bdf9645776d16d1b04da4a5746f109f0 100644 --- a/profiler/advisor/common/profiling/op_summary.py +++ b/profiler/msprof_analyze/advisor/common/profiling/op_summary.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,14 +13,14 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import logging from decimal import Decimal from typing import List, Any -from profiler.advisor.dataset.profiling.info_collection import OpInfo -from profiler.advisor.dataset.profiling.profiling_parser import ProfilingParser -from profiler.advisor.utils.utils import format_excel_title, lazy_property +from msprof_analyze.advisor.dataset.profiling.info_collection import OpInfo +from msprof_analyze.advisor.dataset.profiling.profiling_parser import ProfilingParser +from msprof_analyze.advisor.utils.utils import format_excel_title, lazy_property logger = logging.getLogger() diff --git a/profiler/advisor/common/profiling/tasktime.py b/profiler/msprof_analyze/advisor/common/profiling/tasktime.py similarity index 94% rename from profiler/advisor/common/profiling/tasktime.py rename to profiler/msprof_analyze/advisor/common/profiling/tasktime.py index 211800585a6b3385e41d009827ec675bfa9df560..2d0474be60266dd9b687e9221789790ca18bcef6 100644 --- a/profiler/advisor/common/profiling/tasktime.py +++ b/profiler/msprof_analyze/advisor/common/profiling/tasktime.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,12 +13,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import logging from typing import Dict, List -from profiler.advisor.dataset.profiling.info_collection import TaskInfo -from profiler.advisor.dataset.profiling.profiling_parser import ProfilingParser +from msprof_analyze.advisor.dataset.profiling.info_collection import TaskInfo +from msprof_analyze.advisor.dataset.profiling.profiling_parser import ProfilingParser logger = logging.getLogger() diff --git a/profiler/advisor/dataset/timeline_op_collector/__init__.py b/profiler/msprof_analyze/advisor/common/timeline/__init__.py similarity index 100% rename from profiler/advisor/dataset/timeline_op_collector/__init__.py rename to profiler/msprof_analyze/advisor/common/timeline/__init__.py diff --git a/profiler/advisor/common/timeline/event.py b/profiler/msprof_analyze/advisor/common/timeline/event.py similarity index 99% rename from profiler/advisor/common/timeline/event.py rename to profiler/msprof_analyze/advisor/common/timeline/event.py index 79ee63211c33515ce8bad1a3a537caa65ac86511..f6afb7359494e561fc2e403c29ed4481d12c7529 100644 --- a/profiler/advisor/common/timeline/event.py +++ b/profiler/msprof_analyze/advisor/common/timeline/event.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,7 +13,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + from decimal import Decimal diff --git a/profiler/advisor/common/timeline/fusion_ops_db.py b/profiler/msprof_analyze/advisor/common/timeline/fusion_ops_db.py similarity index 83% rename from profiler/advisor/common/timeline/fusion_ops_db.py rename to profiler/msprof_analyze/advisor/common/timeline/fusion_ops_db.py index ad8b5981c72b12c213146b205d1f1d86dd408589..87812d8c73ceb02ba1fa4c7f5b7f28263efc3c0a 100644 --- a/profiler/advisor/common/timeline/fusion_ops_db.py +++ b/profiler/msprof_analyze/advisor/common/timeline/fusion_ops_db.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" + # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,49 +13,51 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import logging import os -from profiler.advisor.common import constant -from profiler.advisor.common.enum_params_parser import EnumParamsParser -from profiler.advisor.common.timeline.fusion_ops_rule import OpRule -from profiler.advisor.common.timeline.fusion_ops_rule_handler import TimelineOpRuleHandler -from profiler.advisor.utils.log import get_log_level -from profiler.advisor.utils.utils import get_file_path_by_walk -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.enum_params_parser import EnumParamsParser +from msprof_analyze.advisor.common.timeline.fusion_ops_rule import OpRule +from msprof_analyze.advisor.common.timeline.fusion_ops_rule_handler import TimelineOpRuleHandler +from msprof_analyze.advisor.utils.log import get_log_level +from msprof_analyze.advisor.utils.utils import get_file_path_by_walk +from msprof_analyze.prof_common.file_manager import FileManager logger = logging.getLogger() logger.setLevel(get_log_level()) -def init_timeline_ops_db(cann_version=None, torch_version=None): +def init_timeline_ops_db(cann_version=None, profiling_type=None, profiling_version=None): logger.debug("init operators database") - return FusionOperatorDB(cann_version=cann_version, torch_version=torch_version) + return FusionOperatorDB(cann_version=cann_version, + profiling_type=profiling_type, + profiling_version=profiling_version) def get_timeline_fusion_ops_yaml_path(): # 环境变量 ADVISOR_RULE_PATH 不为空且该路径存在, os.walk遍历其下文件, 若存在相应的规则文件则返回路径 - advisor_rule_path = os.getenv(constant.ADVISOR_RULE_PATH) + advisor_rule_path = os.getenv(Constant.ADVISOR_RULE_PATH) if advisor_rule_path and os.path.exists(advisor_rule_path): - specified_file_path = get_file_path_by_walk(advisor_rule_path, constant.TIMELINE_FUSION_OPS_YAML_NAME) + specified_file_path = get_file_path_by_walk(advisor_rule_path, Constant.TIMELINE_FUSION_OPS_YAML_NAME) if len(specified_file_path.strip()) and os.path.exists(specified_file_path): logger.debug("Successfully find The %s file which is specified by the environment variable: %s.", - specified_file_path, constant.ADVISOR_RULE_PATH) + specified_file_path, Constant.ADVISOR_RULE_PATH) return specified_file_path logger.warning("The %s does not exist in path: %s. Try to use cloud or default local YAML file.", - constant.TIMELINE_FUSION_OPS_YAML_NAME, os.path.normpath(advisor_rule_path)) + Constant.TIMELINE_FUSION_OPS_YAML_NAME, os.path.normpath(advisor_rule_path)) # 检查云文件默认保存路径文件夹下是否存在相应文件, 默认路径 ~/rules/cloud/ - cloud_file_path = os.path.join(os.path.expanduser("~"), constant.CLOUD_RULE_PATH, - constant.TIMELINE_FUSION_OPS_YAML_NAME) + cloud_file_path = os.path.join(os.path.expanduser("~"), Constant.CLOUD_RULE_PATH, + Constant.TIMELINE_FUSION_OPS_YAML_NAME) if os.path.exists(cloud_file_path): - logger.debug("Successfully find The cloud %s file in %s.", constant.TIMELINE_FUSION_OPS_YAML_NAME, + logger.debug("Successfully find The cloud %s file in %s.", Constant.TIMELINE_FUSION_OPS_YAML_NAME, cloud_file_path) return cloud_file_path # 检查本地默认文件 local_file_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))), - constant.DEFAULT_RULE_PATH, constant.TIMELINE_FUSION_OPS_YAML_NAME) + Constant.DEFAULT_RULE_PATH, Constant.TIMELINE_FUSION_OPS_YAML_NAME) if not os.path.exists(local_file_path): # 若本地默认文件不存在, 则log异常信息并 logger.error("The default local YAML file does not exist. Please check the YAML file in the default path %s.", @@ -65,17 +67,18 @@ def get_timeline_fusion_ops_yaml_path(): class FusionOperatorDB: - def __init__(self, file_path=None, cann_version=None, torch_version=None): + def __init__(self, cann_version=None, profiling_type=None, profiling_version=None): self.timeline_fusion_ops_yaml_path = os.path.normpath(get_timeline_fusion_ops_yaml_path()) - - self.cann_version = cann_version or EnumParamsParser().get_default(constant.CANN_VERSION) - self.torch_version = torch_version or EnumParamsParser().get_default(constant.TORCH_VERSION) + self.cann_version = cann_version or EnumParamsParser().get_default(Constant.CANN_VERSION) + self.profiling_type = profiling_type or EnumParamsParser().get_default(Constant.PROFILING_TYPE_UNDER_LINE) + self.profiling_version = profiling_version or EnumParamsParser().get_default(Constant.PROFILING_TYPE_UNDER_LINE) self._supported_version_dict = {} self.is_empty = False self.timeline_op_rule_handler = TimelineOpRuleHandler() - self.fusion_operator = self._load_yaml(self.timeline_fusion_ops_yaml_path) + self.fusion_operator = self._load_yaml( + self.timeline_fusion_ops_yaml_path) if profiling_type == Constant.PYTORCH else {} self._dequeue_op_names = [] self._aten_op_names = [] @@ -110,9 +113,9 @@ class FusionOperatorDB: return self._optimizer_op_api_map def get_fusion_operator_with_unique_id(self, unique_id): - if unique_id == constant.TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID: + if unique_id == Constant.TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID: logger.warning("The specified unique id: %s is invalid.Please check whether the rule of the unique id " - "exists and modify the rule.", constant.TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID) + "exists and modify the rule.", Constant.TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID) return {} result_tmp_rule = self.timeline_op_rule_handler.get_tmp_timeline_op_rule_with_unique_id(unique_id) result_op_rule = OpRule(result_tmp_rule) @@ -126,7 +129,7 @@ class FusionOperatorDB: def regenerate_timeline_op_rule_with_version(self, cann_version=None, torch_version=None): cann_version = cann_version or self.cann_version - torch_version = torch_version or self.torch_version + torch_version = torch_version or self.profiling_version unique_id = self._get_unique_id_in_supported_version_dict(cann_version=cann_version, torch_version=torch_version) self.regenerate_timeline_op_rule_with_unique_id(unique_id) @@ -180,7 +183,7 @@ class FusionOperatorDB: if not is_version_supported: # 若规则库不支持当前版本, 则log警告信息 logger.warning("Unsupported versions: cann-%s and torch-%s, supported version list of ['cann', 'torch'] " - "is %s", self.cann_version, self.torch_version, self._supported_version_dict.values()) + "is %s", self.cann_version, self.profiling_version, self._supported_version_dict.values()) return is_version_supported def _is_version_supported_in_supported_version_dict(self, cann_version=None, torch_version=None): @@ -195,7 +198,7 @@ class FusionOperatorDB: for key_unique_id, supported_version in self._supported_version_dict.items(): if self._is_version_supported_in_versions(supported_version, cann_version, torch_version): return key_unique_id - return constant.TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID + return Constant.TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID def _is_version_supported_in_versions(self, supported_version, cann_version=None, torch_version=None): """校验当前cann版本和torch版本是否存在在规则库中的版本支持数组的元素中""" @@ -208,7 +211,7 @@ class FusionOperatorDB: torch_version_list = [torch_version_list] cann_version = cann_version or self.cann_version - torch_version = torch_version or self.torch_version + torch_version = torch_version or self.profiling_version if (cann_version in cann_version_list) and (torch_version in torch_version_list): return True @@ -216,9 +219,9 @@ class FusionOperatorDB: def _parse_db(self): """生成输出的规则库""" - self._parse(constant.ATEN) - self._parse(constant.DEQUEUE) - self._parse(constant.OPTIMIZER) + self._parse(Constant.ATEN) + self._parse(Constant.DEQUEUE) + self._parse(Constant.OPTIMIZER) def _parse(self, mode): """生成输出的规则库中指定部分, 如aten, Optimizer等""" @@ -252,7 +255,7 @@ class FusionOperatorDB: if not os.path.exists(file_path): logger.warning("Path: '%s' does not exist, please specific existed path of " "fusion operators yaml file by setting env '%s'", - os.path.abspath(file_path), constant.ADVISOR_RULE_PATH) + os.path.abspath(file_path), Constant.ADVISOR_RULE_PATH) self.is_empty = True return {} diff --git a/profiler/advisor/common/timeline/fusion_ops_rule.py b/profiler/msprof_analyze/advisor/common/timeline/fusion_ops_rule.py similarity index 98% rename from profiler/advisor/common/timeline/fusion_ops_rule.py rename to profiler/msprof_analyze/advisor/common/timeline/fusion_ops_rule.py index deee68edb9a92d0588f3f3c155a7b2595317a5c7..bf8a53207d403dff55d5ae7fe125bd0dd55d0913 100644 --- a/profiler/advisor/common/timeline/fusion_ops_rule.py +++ b/profiler/msprof_analyze/advisor/common/timeline/fusion_ops_rule.py @@ -2,7 +2,7 @@ import copy import logging -from profiler.advisor.utils.log import get_log_level +from msprof_analyze.advisor.utils.log import get_log_level logger = logging.getLogger() logger.setLevel(get_log_level()) diff --git a/profiler/advisor/common/timeline/fusion_ops_rule_handler.py b/profiler/msprof_analyze/advisor/common/timeline/fusion_ops_rule_handler.py similarity index 97% rename from profiler/advisor/common/timeline/fusion_ops_rule_handler.py rename to profiler/msprof_analyze/advisor/common/timeline/fusion_ops_rule_handler.py index b0558cca6d951ee057e538b5e4da6d9c2e78111b..808b095f2748fa509550d7548e20a26fb4eb2dfe 100644 --- a/profiler/advisor/common/timeline/fusion_ops_rule_handler.py +++ b/profiler/msprof_analyze/advisor/common/timeline/fusion_ops_rule_handler.py @@ -2,9 +2,9 @@ import copy import logging -from profiler.advisor.common import constant -from profiler.advisor.common.timeline.fusion_ops_rule import OpRule -from profiler.advisor.utils.log import get_log_level +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.timeline.fusion_ops_rule import OpRule +from msprof_analyze.advisor.utils.log import get_log_level logger = logging.getLogger() logger.setLevel(get_log_level()) @@ -189,5 +189,5 @@ class TimelineOpRuleHandler: logger.error("Advise to use a positive integer as the unique id of rules. " "Negative numbers: %s are not recommended to use as unique id. " "If specified invalid unique id: %s is used, an empty rule is returned by default.", - unique_id, constant.TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID) + unique_id, Constant.TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID) return self._all_tmp_timeline_op_rule.get(unique_id) diff --git a/profiler/advisor/common/version_control.py b/profiler/msprof_analyze/advisor/common/version_control.py similarity index 99% rename from profiler/advisor/common/version_control.py rename to profiler/msprof_analyze/advisor/common/version_control.py index ec30b3be9d84532ff4e8829341dd2da4d3dfc49f..f04f097ca4d238f6917b2bb94dde0564a742fc84 100644 --- a/profiler/advisor/common/version_control.py +++ b/profiler/msprof_analyze/advisor/common/version_control.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- -""" +# # Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,7 +13,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" + import logging from typing import List diff --git a/profiler/advisor/computation_analysis.ipynb b/profiler/msprof_analyze/advisor/computation_analysis.ipynb similarity index 99% rename from profiler/advisor/computation_analysis.ipynb rename to profiler/msprof_analyze/advisor/computation_analysis.ipynb index 0d4aaadfadff05d1e11d4a9873ef7ce4ae2cfaa8..861cdddbc07bafa6740da1f1a00b5ba347eb7f4d 100644 --- a/profiler/advisor/computation_analysis.ipynb +++ b/profiler/msprof_analyze/advisor/computation_analysis.ipynb @@ -12,7 +12,7 @@ "\n", "from prettytable import PrettyTable, ALL\n", "from textwrap import fill\n", - "from profiler.advisor.interface.interface import Interface" + "from msprof_analyze.advisor.interface.interface import Interface" ] }, { @@ -237,7 +237,7 @@ "source": [ "from prettytable import PrettyTable, ALL\n", "from textwrap import fill\n", - "from profiler.advisor.interface.interface import Interface\n", + "from msprof_analyze.advisor.interface.interface import Interface\n", "\n", "\n", "# 配置profiling采集出来的数据,需要指定到的profiling目录是同一个工具采集的,并且需要采集l0级别以上\n", @@ -462,7 +462,7 @@ "source": [ "### AICPU问题识别\n", "AICPU问题主要为识别相关算子执行时跑到AICPU上计算,并没有利用到AI CORE的计算能力的场景,主要调优手段为修改相关代码来避免AICPU算子,可参见相关资料,来避免AICPU算子的问题:\n", - "https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc/Samples%20of%20AI%20CPU%20Operator%20Replacement.md\n", + "https://gitee.com/ascend/mstt/blob/master/profiler/msprof_analyze/advisor/doc/Samples%20of%20AI%20CPU%20Operator%20Replacement.md\n", "\n", "下列代码为样例,主要展示如何检测Dynamic Shape类型问题,并获取相关问题检测结果:" ] @@ -475,7 +475,7 @@ "source": [ "from prettytable import PrettyTable, ALL\n", "from textwrap import fill\n", - "from profiler.advisor.interface.interface import Interface\n", + "from msprof_analyze.advisor.interface.interface import Interface\n", "\n", "\n", "# 配置profiling采集出来的数据,需要指定到的profiling目录是同一个工具采集的,并且需要采集l0级别以上\n", diff --git a/profiler/advisor/display/__init__.py b/profiler/msprof_analyze/advisor/config/__init__.py similarity index 100% rename from profiler/advisor/display/__init__.py rename to profiler/msprof_analyze/advisor/config/__init__.py diff --git a/profiler/advisor/config/config.ini b/profiler/msprof_analyze/advisor/config/config.ini similarity index 70% rename from profiler/advisor/config/config.ini rename to profiler/msprof_analyze/advisor/config/config.ini index 08dd8f2d95af0b15d732450093d9acc170b237d7..211d3104e7db82b47300922ea3e5d44d0794333c 100644 --- a/profiler/advisor/config/config.ini +++ b/profiler/msprof_analyze/advisor/config/config.ini @@ -16,9 +16,9 @@ cn-north-9 = cnnorth9-modelarts-sdk cn-southwest-2 = cnsouthwest2-modelarts-sdk cn-north-7 = cnnorth7-modelarts-sdk [URL] -timeline_api_doc_url = https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc/Samples%20of%20Fused%20Operator%20API%20Replacement.md -timeline_with_stack_doc_url = https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/modeldevpt/ptmigr/AImpug_0067.html -pytorch_aoe_operator_tune_url = https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/devtools/auxiliarydevtool/aoe_16_045.html -mslite_infer_aoe_operator_tune_url = https://www.mindspore.cn/lite/docs/en/master/use/cloud_infer/converter_tool_ascend.html#aoe-auto-tuning -enable_compiled_tune_url = https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/modeldevpt/ptmigr/AImpug_0059.html -ascend_profiler_url = https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/modeldevpt/ptmigr/AImpug_0067.html \ No newline at end of file +timeline_api_doc_url = https://gitee.com/ascend/mstt/blob/master/profiler/msprof_analyze/advisor/doc/Samples%20of%20Fused%20Operator%20API%20Replacement.md +timeline_with_stack_doc_url = https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/devaids/auxiliarydevtool/atlasprofiling_16_0038.html +pytorch_aoe_operator_tune_url = https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/devaids/auxiliarydevtool/aoe_16_043.html +mslite_infer_aoe_operator_tune_url = https://www.mindspore.cn/lite/docs/en/master/mindir/converter_tool_ascend.html#aoe-auto-tuning +enable_compiled_tune_url = https://www.hiascend.com/document/detail/zh/canncommercial/700/modeldevpt/ptmigr/AImpug_000060.html +ascend_profiler_url = https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/devaids/auxiliarydevtool/atlasprofiling_16_0038.html \ No newline at end of file diff --git a/profiler/advisor/config/config.py b/profiler/msprof_analyze/advisor/config/config.py similarity index 84% rename from profiler/advisor/config/config.py rename to profiler/msprof_analyze/advisor/config/config.py index 2c074d6535b83b9d8e1a1a225dcfe06b6fbe9a1f..2ab735a28657caf1cf60fe357ef907c6b62e6811 100644 --- a/profiler/advisor/config/config.py +++ b/profiler/msprof_analyze/advisor/config/config.py @@ -1,13 +1,25 @@ -""" -advisor config -""" +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import logging import os -from profiler.advisor.utils.utils import Timer -from profiler.advisor.utils.utils import singleton -from profiler.prof_common.utils import SafeConfigReader +from msprof_analyze.advisor.utils.utils import Timer +from msprof_analyze.advisor.utils.utils import singleton +from msprof_analyze.prof_common.utils import SafeConfigReader logger = logging.getLogger() @@ -46,11 +58,6 @@ class Config: os.path.join(self._work_path, f"operator_tuning_file_{Timer().strftime}.cfg")) self.log_path = None - def _normalize_path(self, file) -> str: - if not file.startswith("/"): - file = os.path.join(self._work_path, file) - return os.path.abspath(file) - @property def work_path(self) -> str: """ @@ -67,25 +74,6 @@ class Config: """ return self._root_path - def set_config(self, key, value) -> None: - """ - set config value - :param key: config key - :param value: config value - """ - setattr(self, key, value) - - def get_config(self, key) -> str: - """ - get value of config - :param key: config key - :return: config value - """ - try: - return getattr(self, key) - except AttributeError: - return "" - @property def analysis_result_file(self) -> str: """ @@ -158,14 +146,36 @@ class Config: except Exception: return "" + def set_config(self, key, value) -> None: + """ + set config value + :param key: config key + :param value: config value + """ + setattr(self, key, value) + + def get_config(self, key) -> str: + """ + get value of config + :param key: config key + :return: config value + """ + try: + return getattr(self, key) + except AttributeError: + return "" + def set_log_path(self, result_file: str, log_path: str = None): self.log_path = log_path if log_path is not None else os.path.join(self._work_path, "log") os.makedirs(self.log_path, exist_ok=True) - self.config._analysis_result_file = os.path.join(self.log_path, result_file) + self.config.set("ANALYSE", "analysis_result_file", os.path.join(self.log_path, result_file)) self._analysis_result_file = os.path.join(self.log_path, result_file) def remove_log(self): if self.log_path and os.path.isdir(self.log_path) and not os.listdir(self.log_path): os.rmdir(self.log_path) - + def _normalize_path(self, file) -> str: + if not file.startswith("/"): + file = os.path.join(self._work_path, file) + return os.path.abspath(file) diff --git a/profiler/advisor/config/enum_parameters.yaml b/profiler/msprof_analyze/advisor/config/enum_parameters.yaml similarity index 76% rename from profiler/advisor/config/enum_parameters.yaml rename to profiler/msprof_analyze/advisor/config/enum_parameters.yaml index b1a0548d480b8722609df40ae8e9331b5d3d34bd..678fe72b43c7f5b2fd66b3f38c3114cc9793cd50 100644 --- a/profiler/advisor/config/enum_parameters.yaml +++ b/profiler/msprof_analyze/advisor/config/enum_parameters.yaml @@ -6,7 +6,9 @@ arguments: - 7.0.RC1 - 7.0.0 - 8.0.RC1 - default: 8.0.RC1 + - 8.0.RC2 + - 8.0.0 + default: 8.0.0 torch_version: type: str @@ -14,7 +16,12 @@ arguments: - 1.11.0 - 2.1.0 default: 2.1.0 - + mindspore_version: + type: str + options: + - 2.3.0 + - 2.4.0 + default: 2.4.0 analysis_dimensions: type: list options: @@ -28,10 +35,11 @@ arguments: profiling_type: type: str options: - - ascend_pytorch_profiler + - pytorch - mslite - msprof - default: ascend_pytorch_profiler + - mindspore + default: pytorch envs: ADVISOR_ANALYZE_PROCESSES: diff --git a/profiler/advisor/config/profiling_data_version_config.yaml b/profiler/msprof_analyze/advisor/config/profiling_data_version_config.yaml similarity index 60% rename from profiler/advisor/config/profiling_data_version_config.yaml rename to profiler/msprof_analyze/advisor/config/profiling_data_version_config.yaml index b8c92fe074d3bf67a23214d18f6a2438be130314..1c82a8a6bcfbd7ae3d09cab4ca1852b875b2e64c 100644 --- a/profiler/advisor/config/profiling_data_version_config.yaml +++ b/profiler/msprof_analyze/advisor/config/profiling_data_version_config.yaml @@ -1,19 +1,43 @@ versions: + - version: 8.0.0 + dirs_pattern: + ASCEND_PROFILER_OUTPUT: [ op_summary, msprof ] + ^PROF_\d{6}_\d{17}_\w+$: + mindstudio_profiler_output: [ op_summary, msprof ] + class_attr: + op_summary: OpSummary + msprof: Msprof + file_attr: + msprof: [trace_view.json, '^msprof_\d{14}\.json$'] + op_summary: [ kernel_details.csv, '^op_summary_\d{14}\.csv$' ] + + - version: 8.0.RC2 + dirs_pattern: + ASCEND_PROFILER_OUTPUT: [ op_summary, msprof ] + ^PROF_\d{6}_\d{17}_\w+$: + mindstudio_profiler_output: [ op_summary, msprof ] + class_attr: + op_summary: OpSummary + msprof: Msprof + file_attr: + msprof: [trace_view.json, '^msprof_\d{14}\.json$'] + op_summary: [ kernel_details.csv, '^op_summary_\d{14}\.csv$' ] + - version: 8.0.RC1 dirs_pattern: - ASCEND_PROFILER_OUTPUT: [ op_summary ] + ASCEND_PROFILER_OUTPUT: [ op_summary, msprof ] ^PROF_\d{6}_\d{17}_\w+$: mindstudio_profiler_output: [ op_summary, msprof ] class_attr: op_summary: OpSummary msprof: Msprof file_attr: - msprof: ^msprof_\d{14}\.json$ + msprof: [trace_view.json, '^msprof_\d{14}\.json$'] op_summary: [ kernel_details.csv, '^op_summary_\d{14}\.csv$' ] - version: 7.0.0 dirs_pattern: - ASCEND_PROFILER_OUTPUT: [ op_summary ] + ASCEND_PROFILER_OUTPUT: [ op_summary, msprof ] ^PROF_\d{6}_\d{17}_\w+$: ^device_\d+$: summary: @@ -31,12 +55,12 @@ versions: file_attr: op_summary: [ kernel_details.csv, '^op_summary_\d+_\d+_\d{14}\.csv$'] task_time: ^task_time_\d+_\d+_\d{14}\.json$ - msprof: ^msprof_\d+_\d+_\d{14}\.json$ + msprof: [trace_view.json, '^msprof_\d+_\d+_\d{14}\.json$'] ge_info: ge_info.db - version: 7.0.RC1 dirs_pattern: - ASCEND_PROFILER_OUTPUT: [ op_summary ] + ASCEND_PROFILER_OUTPUT: [ op_summary, msprof ] ^PROF_\d{6}_\d{17}_\w+$: ^device_\d+$: summary: @@ -54,12 +78,12 @@ versions: file_attr: op_summary: [ kernel_details.csv, '^op_summary_\d+_\d+_\d+_\d{14}\.csv$'] task_time: ^task_time_\d+_\d+_\d+_\d{14}\.json$ - msprof: ^msprof_\d+_\d+_\d+_\d{14}\.json$ + msprof: [trace_view.json, '^msprof_\d+_\d+_\d+_\d{14}\.json$'] ge_info: ge_info.db - version: 6.3.RC2 dirs_pattern: - ASCEND_PROFILER_OUTPUT: [ op_summary ] + ASCEND_PROFILER_OUTPUT: [ op_summary, msprof ] ^PROF_\d{6}_\d{17}_\w+$: ^device_\d+$: summary: @@ -77,5 +101,5 @@ versions: file_attr: op_summary: [ kernel_details.csv, '^op_summary_\d+_\d+\.csv$'] task_time: ^task_time_\d+_\d+\.json$ - msprof: ^msprof_\d+_\d+\.json$ + msprof: [trace_view.json, '^msprof_\d+_\d+\.json$'] ge_info: ge_info.db diff --git a/profiler/advisor/display/html/__init__.py b/profiler/msprof_analyze/advisor/dataset/__init__.py similarity index 100% rename from profiler/advisor/display/html/__init__.py rename to profiler/msprof_analyze/advisor/dataset/__init__.py diff --git a/profiler/advisor/display/html/templates/__init__.py b/profiler/msprof_analyze/advisor/dataset/ai_core_freq/__init__.py similarity index 100% rename from profiler/advisor/display/html/templates/__init__.py rename to profiler/msprof_analyze/advisor/dataset/ai_core_freq/__init__.py diff --git a/profiler/advisor/dataset/ai_core_freq/ai_core_freq_dataset.py b/profiler/msprof_analyze/advisor/dataset/ai_core_freq/ai_core_freq_dataset.py similarity index 92% rename from profiler/advisor/dataset/ai_core_freq/ai_core_freq_dataset.py rename to profiler/msprof_analyze/advisor/dataset/ai_core_freq/ai_core_freq_dataset.py index db31c1a0c5f37467c8191fdb2dc419b925ee4bd5..71b3e7410727f5c74f6e502c8c76787fcf3a4a0d 100644 --- a/profiler/advisor/dataset/ai_core_freq/ai_core_freq_dataset.py +++ b/profiler/msprof_analyze/advisor/dataset/ai_core_freq/ai_core_freq_dataset.py @@ -16,18 +16,12 @@ import json import logging import math -import os -import traceback - -import ijson -from tqdm import tqdm - -from profiler.advisor.common import constant as const -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.advisor.utils.utils import get_file_path_from_directory -from profiler.advisor.utils.utils import convert_to_float, parse_json_with_generator -from profiler.advisor.dataset.profiling.device_info import DeviceInfoParser -from profiler.advisor.config.config import Config + +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.advisor.utils.utils import get_file_path_from_directory +from msprof_analyze.advisor.utils.utils import convert_to_float, parse_json_with_generator +from msprof_analyze.advisor.dataset.profiling.device_info import DeviceInfoParser +from msprof_analyze.advisor.config.config import Config logger = logging.getLogger() diff --git a/profiler/advisor/interface/__init__.py b/profiler/msprof_analyze/advisor/dataset/cluster/__init__.py similarity index 100% rename from profiler/advisor/interface/__init__.py rename to profiler/msprof_analyze/advisor/dataset/cluster/__init__.py diff --git a/profiler/advisor/dataset/cluster/cluster_dataset.py b/profiler/msprof_analyze/advisor/dataset/cluster/cluster_dataset.py similarity index 81% rename from profiler/advisor/dataset/cluster/cluster_dataset.py rename to profiler/msprof_analyze/advisor/dataset/cluster/cluster_dataset.py index 66bf993a2f1f8f2798857a2389dd3468239e6a00..4489dde44621e5650f664cd8e28262f2df613c84 100644 --- a/profiler/advisor/dataset/cluster/cluster_dataset.py +++ b/profiler/msprof_analyze/advisor/dataset/cluster/cluster_dataset.py @@ -18,14 +18,13 @@ import os import re from collections import defaultdict -from profiler.advisor.dataset.dataset import Dataset -from profiler.advisor.utils.utils import singleton -from profiler.cluster_analyse.common_func.file_manager import FileManager -from profiler.advisor.common import constant as const -from profiler.cluster_analyse.common_func.constant import Constant -from profiler.cluster_analyse.cluster_analysis import Interface -from profiler.advisor.dataset.cluster.cluster_step_trace_time_bean import ClusterStepTraceTimeBean -from profiler.advisor.dataset.cluster.hccl_collection import HcclInfo +from msprof_analyze.advisor.dataset.dataset import Dataset +from msprof_analyze.advisor.utils.utils import singleton +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.cluster_analyse.cluster_analysis import Interface +from msprof_analyze.advisor.dataset.cluster.cluster_step_trace_time_bean import ClusterStepTraceTimeBean +from msprof_analyze.advisor.dataset.cluster.hccl_collection import HcclInfo logger = logging.getLogger() @@ -33,13 +32,13 @@ logger = logging.getLogger() class ClusterDataset(Dataset): def __init__(self, collection_path, data: dict, **kwargs) -> None: - super().__init__(collection_path, data) + super().__init__(collection_path, data, **kwargs) def is_cluster_analysis_output_exist(self): """ check whether input path is valid """ - for filename in os.listdir(self.collection_path): + for filename in os.listdir(self.output_path): if filename == 'cluster_analysis_output': logger.info("Cluster has been analyzed " "because of the existence of cluster analysis output directory.") @@ -51,8 +50,9 @@ class ClusterDataset(Dataset): if self.is_cluster_analysis_output_exist(): return parameter = { - Constant.COLLECTION_PATH: self.collection_path, - Constant.ANALYSIS_MODE: "all" + Constant.PROFILING_PATH: self.collection_path, + Constant.MODE: "all", + Constant.CLUSTER_ANALYSIS_OUTPUT_PATH: self.output_path } logger.info("cluster analysis is in the process, please wait...") try: @@ -61,7 +61,7 @@ class ClusterDataset(Dataset): raise ValueError(f"Cluster analyze backend failed:{e}") from e def load_csv_data(self, file_name, data_bean): - csv_path = os.path.join(self.collection_path, const.CLUSTER_ANALYSIS_OUTPUT, file_name) + csv_path = os.path.join(self.output_path, Constant.CLUSTER_ANALYSIS_OUTPUT, file_name) if not os.path.exists(csv_path): msg = "[ERROR] cluster_step_trace_time.csv doesn't exist, terminate analysis." raise RuntimeError(msg) @@ -69,7 +69,7 @@ class ClusterDataset(Dataset): return data def load_json_data(self, file_name): - json_path = os.path.join(self.collection_path, const.CLUSTER_ANALYSIS_OUTPUT, file_name) + json_path = os.path.join(self.output_path, Constant.CLUSTER_ANALYSIS_OUTPUT, file_name) if not os.path.exists(json_path): msg = "[ERROR] cluster_communication.json doesn't exist, terminate analysis." raise RuntimeError(msg) @@ -85,21 +85,21 @@ class ClusterStepTraceTimeDataset(ClusterDataset): def __init__(self, collection_path: str, data: dict, **kwargs): self._step_dict = defaultdict() self._stages = [] - super().__init__(collection_path, data) + super().__init__(collection_path, data, **kwargs) def format_data(self, step_data: list): step_dict = defaultdict(lambda: [0, 0, 0]) for step_bean in step_data: if step_bean.type == self.RANK: step_rank_record = [] - step = str(step_bean.step).replace(" ", "") or str(const.DEFAULT_STEP) + step = str(step_bean.step).replace(" ", "") or str(Constant.DEFAULT_STEP) rank = str(step_bean.index).replace(" ", "") if step: step_rank_record.append(step) if rank: step_rank_record.append(rank) - step_rank_index = const.STEP_RANK_SEP.join(step_rank_record) + step_rank_index = Constant.STEP_RANK_SEP.join(step_rank_record) step_dict[step_rank_index][0] += step_bean.compute step_dict[step_rank_index][1] += step_bean.communication step_dict[step_rank_index][2] += step_bean.free @@ -119,7 +119,7 @@ class ClusterStepTraceTimeDataset(ClusterDataset): def _parse(self): self.cluster_analyze() try: - step_data = self.load_csv_data(const.CLUSTER_STEP_TIME_CSV, ClusterStepTraceTimeBean) + step_data = self.load_csv_data(Constant.CLUSTER_STEP_TIME_CSV, ClusterStepTraceTimeBean) except RuntimeError as e: logger.error("捕获到异常:%s", e) self._step_dict = None @@ -145,7 +145,7 @@ class ClusterCommunicationDataset(ClusterDataset): def __init__(self, collection_path: str, data: dict, **kwargs): self.rank_bw_dict = defaultdict(self.create_rank_bw_dict) self.hccl_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(list))) - super().__init__(collection_path, data) + super().__init__(collection_path, data, **kwargs) @staticmethod def compute_ratio(dividend: float, divisor: float): @@ -155,7 +155,7 @@ class ClusterCommunicationDataset(ClusterDataset): return round(dividend / divisor, 4) def create_rank_bw_dict(self): - return{ + return { self.RDMA_TIME_MS: 0, self.RDMA_SIZE_MB: 0, self.SDMA_TIME_MS: 0, @@ -168,7 +168,7 @@ class ClusterCommunicationDataset(ClusterDataset): self.hccl_dict.setdefault(comm_group, defaultdict(lambda: defaultdict(list))) for step, step_dict in group_dict.items(): for op, op_dict in step_dict.items(): - self.compute_bandwidth(step.lower().lstrip("step") or str(const.DEFAULT_STEP), op_dict) + self.compute_bandwidth(step.lower().lstrip("step") or str(Constant.DEFAULT_STEP), op_dict) self.process_hccl_info(comm_group, step, op, op_dict) def process_hccl_info(self, group, step, op, op_dict): @@ -194,14 +194,14 @@ class ClusterCommunicationDataset(ClusterDataset): raise ValueError(msg) from e for comm_type, bw_dict in rank_dict.get(self.COMMUNICATION_BANDWIDTH_INFO, {}).items(): if comm_type == self.SDMA: - self.rank_bw_dict[f"{step}{const.STEP_RANK_SEP}{rank}"][self.SDMA_SIZE_MB] += \ + self.rank_bw_dict[f"{step}{Constant.STEP_RANK_SEP}{rank}"][self.SDMA_SIZE_MB] += \ bw_dict.get(self.TRANSIT_SIZE) - self.rank_bw_dict[f"{step}{const.STEP_RANK_SEP}{rank}"][self.SDMA_TIME_MS] += \ + self.rank_bw_dict[f"{step}{Constant.STEP_RANK_SEP}{rank}"][self.SDMA_TIME_MS] += \ bw_dict.get(self.TRANSIT_TIME) if comm_type == self.RDMA: - self.rank_bw_dict[f"{step}{const.STEP_RANK_SEP}{rank}"][self.RDMA_SIZE_MB] += \ + self.rank_bw_dict[f"{step}{Constant.STEP_RANK_SEP}{rank}"][self.RDMA_SIZE_MB] += \ bw_dict.get(self.TRANSIT_SIZE) - self.rank_bw_dict[f"{step}{const.STEP_RANK_SEP}{rank}"][self.RDMA_TIME_MS] += \ + self.rank_bw_dict[f"{step}{Constant.STEP_RANK_SEP}{rank}"][self.RDMA_TIME_MS] += \ bw_dict.get(self.TRANSIT_TIME) for step_rank in self.rank_bw_dict.keys(): @@ -216,7 +216,7 @@ class ClusterCommunicationDataset(ClusterDataset): def _parse(self): self.cluster_analyze() try: - communication_json = self.load_json_data(const.CLUSTER_COMM_JSON) + communication_json = self.load_json_data(Constant.CLUSTER_COMM_JSON) except RuntimeError as e: logger.error("捕获到异常:%s", e) self.rank_bw_dict = None diff --git a/profiler/advisor/dataset/cluster/cluster_step_trace_time_bean.py b/profiler/msprof_analyze/advisor/dataset/cluster/cluster_step_trace_time_bean.py similarity index 100% rename from profiler/advisor/dataset/cluster/cluster_step_trace_time_bean.py rename to profiler/msprof_analyze/advisor/dataset/cluster/cluster_step_trace_time_bean.py diff --git a/profiler/advisor/dataset/cluster/hccl_collection.py b/profiler/msprof_analyze/advisor/dataset/cluster/hccl_collection.py similarity index 100% rename from profiler/advisor/dataset/cluster/hccl_collection.py rename to profiler/msprof_analyze/advisor/dataset/cluster/hccl_collection.py diff --git a/profiler/advisor/result/__init__.py b/profiler/msprof_analyze/advisor/dataset/communication/__init__.py similarity index 100% rename from profiler/advisor/result/__init__.py rename to profiler/msprof_analyze/advisor/dataset/communication/__init__.py diff --git a/profiler/advisor/dataset/communication/communication_dataset.py b/profiler/msprof_analyze/advisor/dataset/communication/communication_dataset.py similarity index 85% rename from profiler/advisor/dataset/communication/communication_dataset.py rename to profiler/msprof_analyze/advisor/dataset/communication/communication_dataset.py index 01a72ef93044ad8f0afb5af4ee864a99c7865060..44efddccffbc4a0759f62bd5b53d732617f9f887 100644 --- a/profiler/advisor/dataset/communication/communication_dataset.py +++ b/profiler/msprof_analyze/advisor/dataset/communication/communication_dataset.py @@ -15,11 +15,11 @@ import logging import os from collections import defaultdict -from profiler.advisor.utils.utils import singleton -from profiler.advisor.common import constant as const -from profiler.cluster_analyse.common_func.file_manager import FileManager -from profiler.advisor.dataset.cluster.hccl_collection import HcclInfo -from profiler.advisor.utils.utils import CheckPathAccess +from msprof_analyze.advisor.utils.utils import singleton +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.advisor.dataset.cluster.hccl_collection import HcclInfo +from msprof_analyze.advisor.utils.utils import CheckPathAccess logger = logging.getLogger() @@ -27,14 +27,15 @@ logger = logging.getLogger() @singleton class CommunicationDataset: RANK = "rank" + hccl_dict = defaultdict(list) def __init__(self, collection_path, data: dict, **kwargs) -> None: self.timeline_dir = collection_path - if not self.timeline_dir.endswith("ascend_pt"): + if not self.timeline_dir.endswith("ascend_pt") and not self.timeline_dir.endswith("ascend_ms"): return self.timeline_data_list = self.get_file_path_from_directory( self.timeline_dir, - lambda file: file.endswith(const.COMMUNICATION_JSON) + lambda file: file.endswith(Constant.COMMUNICATION_JSON) ) self.hccl_dict = defaultdict(list) self.step = kwargs.get("step") @@ -67,7 +68,7 @@ class CommunicationDataset: logger.warning("Expected existed directory, but got %s", path) for root, _, files in os.walk(path): - if root.endswith("cluster_analysis_output"): + if os.path.basename(root) != "ASCEND_PROFILER_OUTPUT": continue for filename in files: filepath = os.path.join(root, filename) diff --git a/profiler/msprof_analyze/advisor/dataset/communication/hccl_detail_dataset.py b/profiler/msprof_analyze/advisor/dataset/communication/hccl_detail_dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..a1d5425b5431b9dc8149b957ae8deb95a2f9295d --- /dev/null +++ b/profiler/msprof_analyze/advisor/dataset/communication/hccl_detail_dataset.py @@ -0,0 +1,96 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +from typing import List +from msprof_analyze.advisor.utils.utils import singleton +from msprof_analyze.advisor.common.profiling.msprof import Msprof +from msprof_analyze.advisor.dataset.profiling.info_collection import TaskInfo, HcclOp, HcclTask + +logger = logging.getLogger() + + +@singleton +class HcclDetailDataset: + RANK = "rank" + + def __init__(self, timeline_dataset: Msprof, **kwargs) -> None: + self.step = kwargs.get("step") + self._hccl_pid = -1 + self._current_hccl_op = None + self._hccl_ops: List[HcclOp] = [] + self._parse(timeline_dataset) + + @property + def hccl_ops(self): + return self._hccl_ops + + @staticmethod + def _get_hccl_pid(tasks: List[TaskInfo]): + for task in tasks: + if task.name == "process_name" and hasattr(task, "args") \ + and task.args.get("name", None) in ["Communication", "HCCL"]: + return task.pid + return -1 + + @staticmethod + def _get_tasks(timeline_dataset: Msprof): + if hasattr(timeline_dataset, 'tasks'): + return timeline_dataset.tasks + return [] + + @classmethod + def get_key(cls): + """ + get key of dataset + :return: key + """ + return cls.__module__.rsplit('.', maxsplit=1)[-1] + + def _parse(self, timeline_dataset: Msprof): + tasks = self._get_tasks(timeline_dataset) + self._hccl_pid = self._get_hccl_pid(tasks) + if self._hccl_pid == -1: + return + self._process(tasks) + + def _process(self, tasks: List[TaskInfo]): + task_handlers = { + "hcom": lambda sub_task: self._start_new_hccl_op(sub_task), + "Reduce": lambda sub_task: self._add_reduce_inline(sub_task), + "Memcpy": lambda sub_task: self._add_memcpy(sub_task) + } + + for task in tasks: + if task.pid == self._hccl_pid: + handler = task_handlers.get(task.name.split('_')[0]) + result = handler(task) if handler else None + if result is not None: + self._current_hccl_op = result + + if self._current_hccl_op: + self._hccl_ops.append(self._current_hccl_op) + + def _start_new_hccl_op(self, task: TaskInfo): + if self._current_hccl_op: + self._hccl_ops.append(self._current_hccl_op) + return HcclOp(task) + + def _add_reduce_inline(self, task: TaskInfo): + if self._current_hccl_op: + self._current_hccl_op.reduce_inline_tasks.append(HcclTask(task)) + + def _add_memcpy(self, task: TaskInfo): + if self._current_hccl_op: + self._current_hccl_op.memcpy_tasks.append(HcclTask(task)) diff --git a/profiler/advisor/dataset/dataset.py b/profiler/msprof_analyze/advisor/dataset/dataset.py similarity index 82% rename from profiler/advisor/dataset/dataset.py rename to profiler/msprof_analyze/advisor/dataset/dataset.py index becd3e6e88d89326b2dcdacdd58add2ee150c17b..3cc669480db99d26076fd76caabb50ac6fe59bdb 100644 --- a/profiler/advisor/dataset/dataset.py +++ b/profiler/msprof_analyze/advisor/dataset/dataset.py @@ -19,7 +19,7 @@ dataset module import logging import os -from profiler.advisor.config.config import Config +from msprof_analyze.advisor.config.config import Config logger = logging.getLogger() @@ -30,10 +30,13 @@ class Dataset: dataset base class """ - def __init__(self, collection_path, data=None) -> None: + def __init__(self, collection_path, data=None, **kwargs) -> None: if data is None: data = {} self.collection_path = os.path.abspath(os.path.join(Config().work_path, collection_path)) + self.output_path = kwargs.get("output_path", None) + if not self.output_path: + self.output_path = self.collection_path logger.debug("init %s with %s", self.__class__.__name__, self.collection_path) if self._parse(): key = self.get_key() @@ -42,7 +45,7 @@ class Dataset: data[key].append(self) @staticmethod - def _parse(): + def _parse(self): return None @classmethod diff --git a/profiler/advisor/dataset/environment_variable_dataset.py b/profiler/msprof_analyze/advisor/dataset/environment_variable_dataset.py similarity index 89% rename from profiler/advisor/dataset/environment_variable_dataset.py rename to profiler/msprof_analyze/advisor/dataset/environment_variable_dataset.py index 577273ffe8ae955ae8b33e1d871ef2f867aa3f71..6fa569ad9fad2149fdb35ebdf145a633f416031d 100644 --- a/profiler/advisor/dataset/environment_variable_dataset.py +++ b/profiler/msprof_analyze/advisor/dataset/environment_variable_dataset.py @@ -15,8 +15,8 @@ import os import logging -from profiler.advisor.common import constant -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.file_manager import FileManager class EnvironmentVariableDataset: @@ -29,7 +29,7 @@ class EnvironmentVariableDataset: def get_env_data_file(collection_path: str) -> str: for root, _, files in os.walk(collection_path): for file_name in files: - if file_name == constant.PROFILER_METADATA: + if file_name == Constant.PROFILER_METADATA: return os.path.join(root, file_name) return "" diff --git a/profiler/advisor/dataset/graph_dataset.py b/profiler/msprof_analyze/advisor/dataset/graph_dataset.py similarity index 87% rename from profiler/advisor/dataset/graph_dataset.py rename to profiler/msprof_analyze/advisor/dataset/graph_dataset.py index d02af46b3a71413d3871b765fb878d7b46a958d5..0e68312558de0018e3cd7285015e22662006edf2 100644 --- a/profiler/advisor/dataset/graph_dataset.py +++ b/profiler/msprof_analyze/advisor/dataset/graph_dataset.py @@ -16,10 +16,10 @@ import logging from typing import List -from profiler.advisor.dataset.dataset import Dataset -from profiler.advisor.common.graph.graph_parser import HostGraphParser -from profiler.advisor.common.graph.graph import Graph -from profiler.advisor.utils.utils import load_parameter, lazy_property, get_file_path_from_directory +from msprof_analyze.advisor.dataset.dataset import Dataset +from msprof_analyze.advisor.common.graph.graph_parser import HostGraphParser +from msprof_analyze.advisor.common.graph.graph import Graph +from msprof_analyze.advisor.utils.utils import load_parameter, lazy_property, get_file_path_from_directory logger = logging.getLogger() diff --git a/profiler/advisor/rules/__init__.py b/profiler/msprof_analyze/advisor/dataset/profiling/__init__.py similarity index 100% rename from profiler/advisor/rules/__init__.py rename to profiler/msprof_analyze/advisor/dataset/profiling/__init__.py diff --git a/profiler/advisor/dataset/profiling/builder_base.py b/profiler/msprof_analyze/advisor/dataset/profiling/builder_base.py similarity index 91% rename from profiler/advisor/dataset/profiling/builder_base.py rename to profiler/msprof_analyze/advisor/dataset/profiling/builder_base.py index 77bd926f72c5942e0c234e787cc59b3f6e572319..5b04399b1905fbe89a46ee95a7ee18d22b2cb524 100644 --- a/profiler/advisor/dataset/profiling/builder_base.py +++ b/profiler/msprof_analyze/advisor/dataset/profiling/builder_base.py @@ -19,8 +19,8 @@ profiling base import logging from typing import Dict, List -from profiler.advisor.dataset.profiling.profiling_parser import ProfilingParser -from profiler.advisor.utils.utils import join_prof_path +from msprof_analyze.advisor.dataset.profiling.profiling_parser import ProfilingParser +from msprof_analyze.advisor.utils.utils import join_prof_path logger = logging.getLogger() diff --git a/profiler/advisor/dataset/profiling/db_manager.py b/profiler/msprof_analyze/advisor/dataset/profiling/db_manager.py similarity index 100% rename from profiler/advisor/dataset/profiling/db_manager.py rename to profiler/msprof_analyze/advisor/dataset/profiling/db_manager.py diff --git a/profiler/advisor/dataset/profiling/device_info.py b/profiler/msprof_analyze/advisor/dataset/profiling/device_info.py similarity index 68% rename from profiler/advisor/dataset/profiling/device_info.py rename to profiler/msprof_analyze/advisor/dataset/profiling/device_info.py index abb0e6000c4b0a5b10517a5789f3bbb6c47a6aa6..b0226ac569512d82684071b9af5cd3d83c2202a3 100644 --- a/profiler/advisor/dataset/profiling/device_info.py +++ b/profiler/msprof_analyze/advisor/dataset/profiling/device_info.py @@ -1,12 +1,25 @@ +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """ profiling info """ import json import logging -from profiler.advisor.config.config import Config -from profiler.advisor.utils.utils import get_file_path_from_directory -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.utils.utils import get_file_path_from_directory +from msprof_analyze.prof_common.file_manager import FileManager logger = logging.getLogger() @@ -23,19 +36,6 @@ class DeviceInfoParser: def __init__(self, path) -> None: self._path = path - def parse_data(self) -> bool: - """ - parse profiling data - :return: true for success or false - """ - file_list = get_file_path_from_directory(self._path, lambda x: x.startswith("info.json.")) - if not file_list: - return False - for info in file_list: - if self._parse(info): - return True - return False - @staticmethod def _parse(info_file: str) -> bool: if info_file.endswith("done"): @@ -61,3 +61,16 @@ class DeviceInfoParser: return True logger.error("No ai_core_num in json info file %s", info_file) return False + + def parse_data(self) -> bool: + """ + parse profiling data + :return: true for success or false + """ + file_list = get_file_path_from_directory(self._path, lambda x: x.startswith("info.json.")) + if not file_list: + return False + for info in file_list: + if self._parse(info): + return True + return False \ No newline at end of file diff --git a/profiler/advisor/dataset/profiling/info_collection.py b/profiler/msprof_analyze/advisor/dataset/profiling/info_collection.py similarity index 81% rename from profiler/advisor/dataset/profiling/info_collection.py rename to profiler/msprof_analyze/advisor/dataset/profiling/info_collection.py index a3810dd0fcb2feb59c769c6119cbed1335da6793..8540591c33a2355909bca44e9344388e7b1a0df0 100644 --- a/profiler/advisor/dataset/profiling/info_collection.py +++ b/profiler/msprof_analyze/advisor/dataset/profiling/info_collection.py @@ -18,8 +18,8 @@ profiling info """ import decimal import logging - -from profiler.advisor.utils.utils import lazy_property +from typing import List +from msprof_analyze.advisor.utils.utils import lazy_property logger = logging.getLogger() @@ -220,7 +220,7 @@ class TaskInfo: get pid :return: pid """ - return self._args.get("Task Type", "NA") + return self._args.get("task type", "NA") @property def start_time(self): @@ -260,7 +260,7 @@ class TaskInfo: get stream_id :return: steram id """ - return self._args.get("Stream Id", "NA") + return self._args.get("stream id", "NA") @property def task_id(self): @@ -268,7 +268,23 @@ class TaskInfo: get task id :return: task_id """ - return self._args.get("Task Id", "NA") + return self._args.get("task id", "NA") + + @property + def transport_type(self): + """ + get transport type + :return: transport_type + """ + return self._args.get("transport type", "NA") + + @property + def link_type(self): + """ + get link type + :return: link_type + """ + return self._args.get("link type", "NA") @property def args(self): @@ -284,3 +300,52 @@ class TaskInfo: get category of task """ return self._cat + + +class HcclOp: + MIN_SIZE = 512 + + def __init__(self, task: TaskInfo): + self.op_name = task.name + self.start = task.start_time + self.end = task.end_time + self.sdma_size = 0 + self.sdma_duration = 0 + self.rdma_size = 0 + self.rdma_duration = 0 + self.reduce_inline_tasks: List[HcclTask] = [] + self.memcpy_tasks: List[HcclTask] = [] + + +class HcclTask: + def __init__(self, task: TaskInfo): + self._start = task.start_time + self._end = task.end_time + self._duration = task.dur + self._size = task.args.get("size(Byte)", 0) + self._transport_type = task.transport_type + self._link_type = task.link_type + + @property + def start(self): + return self._start + + @property + def end(self): + return self._end + + @property + def duration(self): + return self._duration + + @property + def size(self): + return self._size + + @property + def transport_type(self): + return self._transport_type + + @property + def link_type(self): + return self._link_type diff --git a/profiler/advisor/dataset/profiling/profiling_dataset.py b/profiler/msprof_analyze/advisor/dataset/profiling/profiling_dataset.py similarity index 79% rename from profiler/advisor/dataset/profiling/profiling_dataset.py rename to profiler/msprof_analyze/advisor/dataset/profiling/profiling_dataset.py index 0db673e18871266316d2f5c3673021aee801d8ec..7981e4140f03eda4a07392f700b5432909c7f497 100644 --- a/profiler/advisor/dataset/profiling/profiling_dataset.py +++ b/profiler/msprof_analyze/advisor/dataset/profiling/profiling_dataset.py @@ -17,16 +17,16 @@ import logging import os import yaml -from profiler.advisor.common import constant -from profiler.advisor.common.profiling.ge_info import GeInfo -from profiler.advisor.common.profiling.msprof import Msprof -from profiler.advisor.common.profiling.op_summary import OpSummary -from profiler.advisor.common.profiling.tasktime import TaskTime -from profiler.advisor.common.enum_params_parser import EnumParamsParser -from profiler.advisor.dataset.dataset import Dataset -from profiler.advisor.dataset.profiling.device_info import DeviceInfoParser -from profiler.advisor.utils.utils import join_prof_path -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.profiling.ge_info import GeInfo +from msprof_analyze.advisor.common.profiling.msprof import Msprof +from msprof_analyze.advisor.common.profiling.op_summary import OpSummary +from msprof_analyze.advisor.common.profiling.tasktime import TaskTime +from msprof_analyze.advisor.common.enum_params_parser import EnumParamsParser +from msprof_analyze.advisor.dataset.dataset import Dataset +from msprof_analyze.advisor.dataset.profiling.device_info import DeviceInfoParser +from msprof_analyze.advisor.utils.utils import join_prof_path +from msprof_analyze.prof_common.file_manager import FileManager logger = logging.getLogger() @@ -36,15 +36,16 @@ class ProfilingDataset(Dataset): prof_type = "" def __init__(self, collection_path, data: dict, **kwargs) -> None: - self.cann_version = kwargs.get(constant.CANN_VERSION, EnumParamsParser().get_default(constant.CANN_VERSION)) - self.prof_type = kwargs.get(constant.PROFILING_TYPE, EnumParamsParser().get_default(constant.PROFILING_TYPE)) + self.cann_version = kwargs.get(Constant.CANN_VERSION, EnumParamsParser().get_default(Constant.CANN_VERSION)) + self.prof_type = kwargs.get( + Constant.PROFILING_TYPE_UNDER_LINE, EnumParamsParser().get_default(Constant.PROFILING_TYPE_UNDER_LINE)) self.patterns = self.parse_pattern() self.current_version_pattern = self.get_current_version_pattern() self._info = None super().__init__(collection_path, data) def build_from_pattern(self, dirs_pattern, current_path, depth): - if depth > constant.DEPTH_LIMIT: + if depth > Constant.DEPTH_LIMIT: logger.error("Recursion depth exceeds limit!") return depth += 1 @@ -65,7 +66,7 @@ class ProfilingDataset(Dataset): is_success = data_object.parse_data() if is_success: setattr(self, item, data_object) - else: + elif current_path: logger.info("Skip parse %s with file pattern %s from local path %s", self.current_version_pattern.get('class_attr').get(item), file_pattern_list, current_path diff --git a/profiler/advisor/dataset/profiling/profiling_parser.py b/profiler/msprof_analyze/advisor/dataset/profiling/profiling_parser.py similarity index 92% rename from profiler/advisor/dataset/profiling/profiling_parser.py rename to profiler/msprof_analyze/advisor/dataset/profiling/profiling_parser.py index 9f0f476de040ec78c3de39c82ff449b998525f41..016b914b84ede6eac2b6ddc6936066ba550290f6 100644 --- a/profiler/advisor/dataset/profiling/profiling_parser.py +++ b/profiler/msprof_analyze/advisor/dataset/profiling/profiling_parser.py @@ -14,14 +14,14 @@ # limitations under the License. import csv -import json import os import re +from abc import abstractmethod from typing import List, Dict -from profiler.advisor.dataset.profiling.info_collection import logger -from profiler.advisor.utils.utils import get_file_path_from_directory, SafeOpen, format_excel_title -from profiler.cluster_analyse.common_func.file_manager import FileManager +from msprof_analyze.advisor.dataset.profiling.info_collection import logger +from msprof_analyze.advisor.utils.utils import get_file_path_from_directory, SafeOpen, format_excel_title +from msprof_analyze.prof_common.file_manager import FileManager class ProfilingParser: @@ -82,8 +82,8 @@ class ProfilingParser: title_dict[idx] = title.replace(" ", "_") return title_dict - @staticmethod - def parse_from_file(file): + @abstractmethod + def parse_from_file(self, file: str): """ parse from file as a static method """ @@ -155,4 +155,4 @@ class ProfilingParser: except RuntimeError as error: logger.error("Parse json file %s failed : %s", file, error) return False - return True \ No newline at end of file + return True diff --git a/profiler/advisor/dataset/timeline_event_dataset.py b/profiler/msprof_analyze/advisor/dataset/timeline_event_dataset.py similarity index 92% rename from profiler/advisor/dataset/timeline_event_dataset.py rename to profiler/msprof_analyze/advisor/dataset/timeline_event_dataset.py index 2cc6890f91f801b63e0105892057fcf5b01ab789..418883f6ce94ce18a9dacc408c2e1b707f1970a1 100644 --- a/profiler/advisor/dataset/timeline_event_dataset.py +++ b/profiler/msprof_analyze/advisor/dataset/timeline_event_dataset.py @@ -13,7 +13,6 @@ # See the License for the specific language governing permissions and # limitations under the License. -import inspect import logging import traceback from collections import OrderedDict @@ -21,10 +20,11 @@ from collections import OrderedDict import ijson from tqdm import tqdm -from profiler.advisor.common import constant as const -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.advisor.utils.utils import get_file_path_from_directory, check_path_valid, singleton, convert_to_float -from profiler.advisor.dataset.timeline_op_collector.timeline_op_collector import ( +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.advisor.utils.utils import get_file_path_from_directory, check_path_valid, singleton, \ + convert_to_float +from msprof_analyze.advisor.dataset.timeline_op_collector.timeline_op_collector import ( OpCompileCollector, SynchronizeStreamCollector, MemCollector, @@ -80,10 +80,10 @@ class BaseTimelineEventDataset: kwargs = {} if func_name == FrequencyCollector.__name__: ops_with_task_type = getattr(self, "ops_with_task_type", {}).values() - kwargs["ai_core_ops"] = [ - op for op in ops_with_task_type if - op.get(const.TASK_TYPE) in [const.AI_CORE, const.MIX_AIC] - ] + kwargs["ai_core_ops"] = [op + for op in ops_with_task_type if + op.get(Constant.TASK_TYPE) in [Constant.AI_CORE, Constant.MIX_AIC] + ] return kwargs def add_event(self, index, event): @@ -201,7 +201,7 @@ class ScheduleAnalysisDataset(BaseTimelineEventDataset): return for event in sorted(self.aten, key=lambda x: x.get("ts", -1)): - if event.name.startswith(const.ATEN): + if event.name.startswith(Constant.ATEN): if not formated_atens or not formated_atens[-1].ts_include(event): formated_atens.append(event) diff --git a/profiler/advisor/utils/__init__.py b/profiler/msprof_analyze/advisor/dataset/timeline_op_collector/__init__.py similarity index 100% rename from profiler/advisor/utils/__init__.py rename to profiler/msprof_analyze/advisor/dataset/timeline_op_collector/__init__.py diff --git a/profiler/advisor/dataset/timeline_op_collector/timeline_op_collector.py b/profiler/msprof_analyze/advisor/dataset/timeline_op_collector/timeline_op_collector.py similarity index 82% rename from profiler/advisor/dataset/timeline_op_collector/timeline_op_collector.py rename to profiler/msprof_analyze/advisor/dataset/timeline_op_collector/timeline_op_collector.py index 824f4bc7bdd2e46c90b4cc274995c5500cf87ac3..5def8948b52bb2bb3767c1410fa3b71619677d59 100644 --- a/profiler/advisor/dataset/timeline_op_collector/timeline_op_collector.py +++ b/profiler/msprof_analyze/advisor/dataset/timeline_op_collector/timeline_op_collector.py @@ -1,398 +1,416 @@ -import logging -import math -import os -from abc import abstractmethod, ABCMeta - -from profiler.advisor.common import constant as const -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.advisor.utils.utils import convert_to_float -from profiler.cluster_analyse.common_func.file_manager import FileManager - -logger = logging.getLogger() - - -class BaseOpCollector(metaclass=ABCMeta): - - def __init__(self): - self.attribute_to_dataset = {} - self.op_list = [] - self.require_filter_by_step = True - - @abstractmethod - def add_op(self): - """ add timeline event into self.op_list, and then will filter event in self.op_list by specific step - """ - pass - - @abstractmethod - def post_process(self): - """ convert self.op_list to required format like dict, set and so on and then record the final object into - self.attribute_to_dataset which used to set property of timeline event dataset - """ - pass - - -class StepCollector(BaseOpCollector): - KEY_WORD = "ProfilerStep" - - def __init__(self): - super().__init__() - self.require_filter_by_step = False - - def add_op(self, event): - if event.name.startswith(self.KEY_WORD): - self.op_list.append(event) - - def post_process(self, *args, **kwargs): - self.attribute_to_dataset["profiler_step"] = self.op_list - - -class OpCompileCollector(BaseOpCollector): - def __init__(self): - super().__init__() - self._total_op_compile_counter = 0 - self._total_op_compile_time = 0.0 - - @property - def total_time(self): - return self._total_op_compile_time - - @property - def total_count(self): - return self._total_op_compile_counter - - def is_empty(self): - return self._total_op_compile_counter == 0 - - def update(self, event: TimelineEvent): - self._total_op_compile_time += float(event.dur) - self._total_op_compile_counter += 1 - - def unset(self): - self._total_op_compile_counter = 0 - self._total_op_compile_time = 0.0 - - def add_op(self, event): - if event.name == const.OP_COMPILE_NAME or event.args.get("id") == const.OP_COMPILE_ID: - self.op_list.append(event) - - def post_process(self, target_op_list, **kwargs): - for op in target_op_list: - self.update(op) - - self.attribute_to_dataset["ops_compile"] = self - - -class SynchronizeStreamCollector(BaseOpCollector): - - def __init__(self): - super().__init__() - self.require_filter_by_step = False - - def add_op(self, event): - if event.name.startswith(const.SYNC_STREAM) or event.name.startswith(const.NODE_LAUNCH): - self.op_list.append(event) - - def post_process(self, *args, **kwargs): - self.op_list.sort(key=lambda x: x.ts) - - self.attribute_to_dataset["synchronize_stream"] = self.op_list - - -class MemCollector(BaseOpCollector): - MEMORY_OP_NAME = ["AscendCL@aclMallocMemInner", "AscendCL@aclrtFreePhysical", "AscendCL@aclrtFree"] - - def __init__(self): - super().__init__() - self.mem_op_info = {} - self.rule = self._load_rule() - - @staticmethod - def _load_rule(): - memory_rule_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))), - "rules", - "memory.yaml") - - memory_rule = FileManager.read_yaml_file(memory_rule_path) - return memory_rule - - def add_op(self, event): - if event.name not in self.MEMORY_OP_NAME: - return - self.op_list.append(event) - - def post_process(self, target_op_list, **kwargs): - for op in target_op_list: - if op.name not in self.mem_op_info: - self.mem_op_info[op.name] = dict(count=0, total_dur=0) - self.mem_op_info[op.name]["count"] += 1 - self.mem_op_info[op.name]["total_dur"] += float(op.dur) - - self.attribute_to_dataset["memory_ops"] = self - - -class DataloaderCollector(BaseOpCollector): - key_word = "dataloader" - - def __init__(self): - super().__init__() - - def add_op(self, event): - if self.key_word in event.name.lower(): - self.op_list.append(TimelineEvent({ - "name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur, - "stack": event.args.get("Call stack") - })) - - def post_process(self, *args, **kwargs): - self.attribute_to_dataset["dataloader"] = self.op_list - - -class SyncBNCollector(BaseOpCollector): - key_word = "syncbatchnorm" - - def __init__(self): - super().__init__() - - def add_op(self, event): - if event.name.lower() == self.key_word: - self.op_list.append(TimelineEvent({ - "name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur - })) - - def post_process(self, target_op_list, **kwargs): - self.attribute_to_dataset["sync_batchnorm"] = target_op_list - - -class AtenCollector(BaseOpCollector): - - def __init__(self): - super().__init__() - - def add_op(self, event): - if event.name.lower().startswith(f"{const.ATEN}{const.ATEN_SEP}") or event.name.lower().startswith( - f"{const.NPU}{const.ATEN_SEP}"): - self._add_aten(event) - return - - # 检查cann层同步操作,根据时间窗口索引到host侧的aten算子并给出堆栈 - if event.name.startswith(const.SYNC_STREAM): - self._add_aten(event) - - def post_process(self, target_op_list, **kwargs): - self.attribute_to_dataset["aten"] = target_op_list - - def _add_aten(self, event: TimelineEvent): - self.op_list.append(TimelineEvent({ - "name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur - })) - - -class OptimizerCollector(BaseOpCollector): - - def __init__(self): - super().__init__() - - def add_op(self, event): - if event.name.startswith(f"{const.OPTIMIZER}.{const.OPTIMIZER_STEP}{const.OPTIMIZER_SEP}"): - self.op_list.append(TimelineEvent( - {"name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur})) - - def post_process(self, target_op_list, **kwargs): - self.attribute_to_dataset["optimizer"] = target_op_list - - -class FrequencyCollector(BaseOpCollector): - KEY_WORD = "AI Core Freq" - - def __init__(self): - super().__init__() - self._previous_freq_index = -1 - - @staticmethod - def get_op_frequency(ai_core_ops, ai_core_freq): - ai_core_freq.sort(key=lambda x: float(x.ts)) - op_freq_record = {} - - op_index, freq_index = 0, 0 - while op_index < len(ai_core_ops) and freq_index < len(ai_core_freq): - op_event = ai_core_ops[op_index] - op_end_time = convert_to_float(op_event.ts) + convert_to_float(op_event.dur) - op_freq_list = [] - while freq_index < len(ai_core_freq): - freq_event = ai_core_freq[freq_index] - if convert_to_float(freq_event.end) < op_end_time: - op_freq_list.append(convert_to_float(freq_event.args.MHz)) - freq_index += 1 - continue - elif convert_to_float(freq_event.ts) < op_end_time: - if op_event.name not in op_freq_record: - op_freq_record[op_event.name] = {"count": 0, "dur": 0, "freq_list": []} - op_freq_record[op_event.name]["count"] += 1 - op_freq_record[op_event.name]["dur"] += convert_to_float(op_event.dur) - op_freq_list.append(convert_to_float(freq_event.args.MHz)) - op_freq_record[op_event.name]["freq_list"].append(min(op_freq_list)) - break - else: - break - - op_index += 1 - return op_freq_record - - def add_op(self, event): - if event.name == self.KEY_WORD: - if self._previous_freq_index != -1: - self.op_list[self._previous_freq_index]["end"] = event.get("ts", float(math.inf)) - self._previous_freq_index += 1 - event.setdefault("end", float(math.inf)) - self.op_list.append(event) - - def post_process(self, target_op_list, **kwargs): - ai_core_ops = kwargs.get("ai_core_ops", []) - if not ai_core_ops: - return - ai_core_ops.sort(key=lambda x: float(x.ts)) - op_freq = FrequencyCollector.get_op_frequency(ai_core_ops, target_op_list) - self.attribute_to_dataset["op_freq"] = op_freq - - -class SpecificTaskTypeOpCollector(BaseOpCollector): - - def __init__(self, op_type_list=None): - super().__init__() - self.op_type_list = op_type_list if op_type_list else [const.AI_CPU, const.AI_CORE, const.MIX_AIC] - - def add_op(self, event): - if event.args.get(const.TASK_TYPE) and event.args.get(const.TASK_TYPE) in self.op_type_list: - self.op_list.append( - TimelineEvent( - { - const.TASK_TYPE: event.args.get(const.TASK_TYPE), - "task_id": event.args.get("Task Id"), - "tid": event.tid, - "name": event.name, - "ts": str(event.ts), - "dur": str(event.dur) - } - ) - ) - - def post_process(self, target_op_list, **kwargs): - op_map = dict() - for op in target_op_list: - key = f"{op.name}-{op.ts}" - op_map[key] = op - - self.attribute_to_dataset["ops_with_task_type"] = op_map - self.attribute_to_dataset["task_op_names"] = list( - set([event_key.split("-")[0] for event_key in op_map.keys()])) - - -class TorchToNpuCollector(BaseOpCollector): - def __init__(self): - super().__init__() - - def add_op(self, event): - if event.name.lower() == const.TORCH_TO_NPU: - self.op_list.append(TimelineEvent({"tid": event.tid, "ts": str(event.ts), "ph": event.ph, "id": event.id})) - - def post_process(self, target_op_list, **kwargs): - op_map = dict() - for op in target_op_list: - key = f"{op.ph}-{op.id}" - op_map[key] = op - - self.attribute_to_dataset["torch_to_npu"] = op_map - - -class AclToNpuCollector(BaseOpCollector): - def __init__(self): - super().__init__() - - def add_op(self, event): - if event.name and event.ts and event.name == const.ACL_TO_NPU: - self.op_list.append(TimelineEvent({"ts": event.ts})) - - def post_process(self, target_op_list, **kwargs): - op_record = set(str(op.ts) for op in target_op_list) - self.attribute_to_dataset["acl_to_npu"] = op_record - - -class OpStackCollector(BaseOpCollector): - def __init__(self): - super().__init__() - - def add_op(self, event): - if event.args.get(const.CALL_STACKS): - self.op_list.append( - TimelineEvent({"name": event.name, "dataset_index": event.dataset_index, "ts": event.ts})) - - def post_process(self, target_op_list, **kwargs): - op_map = dict() - for op in target_op_list: - op_map[str(op.ts)] = op - - self.attribute_to_dataset["ops_with_stack"] = op_map - - -class GcCollector(BaseOpCollector): - def __init__(self): - super().__init__() - - def add_op(self, event): - if event.cat and isinstance(event.cat, str) and event.cat.lower() == "gc": - self.op_list.append(TimelineEvent( - {"name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur})) - - def post_process(self, target_op_list, **kwargs): - self.attribute_to_dataset["gc_events"] = self.op_list - - -class FreeEventsCollector(BaseOpCollector): - def __init__(self): - super().__init__() - - @staticmethod - def _load_rule(): - sync_stream_rule_path = os.path.join( - os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))), - "rules", - "gc.yaml") - - gc_rule = FileManager.read_yaml_file(sync_stream_rule_path) - return gc_rule - - def add_op(self, event): - if event.name.lower() == const.FREE: - self.op_list.append(event) - - def post_process(self, target_op_list, **kwargs): - gc_rule = self._load_rule() - if os.getenv(const.FREE_DURATION_FOR_GC_ANALYSIS): - max_free_threshold = convert_to_float(os.getenv(const.FREE_DURATION_FOR_GC_ANALYSIS)) - else: - max_free_threshold = gc_rule.get("max_free_threshold") - - large_free_events = [] - - for op in target_op_list: - if convert_to_float(op.dur) > max_free_threshold: - large_free_events.append(op) - - large_free_events.sort(key=lambda x: convert_to_float(x.ts)) - self.attribute_to_dataset["large_free_events"] = large_free_events - - -class AclEventsCollector(BaseOpCollector): - ACL_EVENT_PREFIX = "AscendCL@" - - def __init__(self): - super().__init__() - - def add_op(self, event): - if event.name.startswith(self.ACL_EVENT_PREFIX): - self.op_list.append(event) - - def post_process(self, target_op_list, **kwargs): - target_op_list.sort(key=lambda x: convert_to_float(x.ts)) - self.attribute_to_dataset["acl_events"] = target_op_list +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import logging +import math +import os +from abc import abstractmethod, ABCMeta + +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.advisor.utils.utils import convert_to_float +from msprof_analyze.prof_common.file_manager import FileManager + +logger = logging.getLogger() + + +class BaseOpCollector(metaclass=ABCMeta): + + def __init__(self): + self.attribute_to_dataset = {} + self.op_list = [] + self.require_filter_by_step = True + + @abstractmethod + def add_op(self, event): + """ add timeline event into self.op_list, and then will filter event in self.op_list by specific step + """ + pass + + @abstractmethod + def post_process(self, target_op_list, **kwargs): + """ convert self.op_list to required format like dict, set and so on and then record the final object into + self.attribute_to_dataset which used to set property of timeline event dataset + """ + pass + + +class StepCollector(BaseOpCollector): + KEY_WORD = "ProfilerStep" + + def __init__(self): + super().__init__() + self.require_filter_by_step = False + + def add_op(self, event): + if event.name.startswith(self.KEY_WORD): + self.op_list.append(event) + + def post_process(self, *args, **kwargs): + self.attribute_to_dataset["profiler_step"] = self.op_list + + +class OpCompileCollector(BaseOpCollector): + def __init__(self): + super().__init__() + self._total_op_compile_counter = 0 + self._total_op_compile_time = 0.0 + + @property + def total_time(self): + return self._total_op_compile_time + + @property + def total_count(self): + return self._total_op_compile_counter + + def is_empty(self): + return self._total_op_compile_counter == 0 + + def update(self, event: TimelineEvent): + self._total_op_compile_time += float(event.dur) + self._total_op_compile_counter += 1 + + def unset(self): + self._total_op_compile_counter = 0 + self._total_op_compile_time = 0.0 + + def add_op(self, event): + if event.name == Constant.OP_COMPILE_NAME or event.args.get("id") == Constant.OP_COMPILE_ID: + self.op_list.append(event) + + def post_process(self, target_op_list, **kwargs): + for op in target_op_list: + self.update(op) + + self.attribute_to_dataset["ops_compile"] = self + + +class SynchronizeStreamCollector(BaseOpCollector): + + def __init__(self): + super().__init__() + self.require_filter_by_step = False + + def add_op(self, event): + if event.name.startswith(Constant.SYNC_STREAM) or event.name.startswith(Constant.NODE_LAUNCH): + self.op_list.append(event) + + def post_process(self, *args, **kwargs): + self.op_list.sort(key=lambda x: x.ts) + + self.attribute_to_dataset["synchronize_stream"] = self.op_list + + +class MemCollector(BaseOpCollector): + MEMORY_OP_NAME = ["AscendCL@aclMallocMemInner", "AscendCL@aclrtFreePhysical", "AscendCL@aclrtFree"] + + def __init__(self): + super().__init__() + self.mem_op_info = {} + self.rule = self._load_rule() + + @staticmethod + def _load_rule(): + language = AdditionalArgsManager().language + memory_rule_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))), + "rules", + language, + "memory.yaml") + + memory_rule = FileManager.read_yaml_file(memory_rule_path) + return memory_rule + + def add_op(self, event): + if event.name not in self.MEMORY_OP_NAME: + return + self.op_list.append(event) + + def post_process(self, target_op_list, **kwargs): + for op in target_op_list: + if op.name not in self.mem_op_info: + self.mem_op_info[op.name] = dict(count=0, total_dur=0) + self.mem_op_info[op.name]["count"] += 1 + self.mem_op_info[op.name]["total_dur"] += float(op.dur) + + self.attribute_to_dataset["memory_ops"] = self + + +class DataloaderCollector(BaseOpCollector): + key_word = "dataloader" + + def __init__(self): + super().__init__() + + def add_op(self, event): + if self.key_word in event.name.lower(): + self.op_list.append(TimelineEvent({ + "name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur, + "stack": event.args.get("Call stack") + })) + + def post_process(self, *args, **kwargs): + self.attribute_to_dataset["dataloader"] = self.op_list + + +class SyncBNCollector(BaseOpCollector): + key_word = "syncbatchnorm" + + def __init__(self): + super().__init__() + + def add_op(self, event): + if event.name.lower() == self.key_word: + self.op_list.append(TimelineEvent({ + "name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur + })) + + def post_process(self, target_op_list, **kwargs): + self.attribute_to_dataset["sync_batchnorm"] = target_op_list + + +class AtenCollector(BaseOpCollector): + + def __init__(self): + super().__init__() + + def add_op(self, event): + if event.name.lower().startswith(f"{Constant.ATEN}{Constant.ATEN_SEP}") or event.name.lower().startswith( + f"{Constant.NPU_LOWER}{Constant.ATEN_SEP}"): + self._add_aten(event) + return + + # 检查cann层同步操作,根据时间窗口索引到host侧的aten算子并给出堆栈 + if event.name.startswith(Constant.SYNC_STREAM): + self._add_aten(event) + + def post_process(self, target_op_list, **kwargs): + self.attribute_to_dataset["aten"] = target_op_list + + def _add_aten(self, event: TimelineEvent): + self.op_list.append(TimelineEvent({ + "name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur + })) + + +class OptimizerCollector(BaseOpCollector): + + def __init__(self): + super().__init__() + + def add_op(self, event): + if event.name.startswith(f"{Constant.OPTIMIZER}.{Constant.OPTIMIZER_STEP}{Constant.OPTIMIZER_SEP}"): + self.op_list.append(TimelineEvent( + {"name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur})) + + def post_process(self, target_op_list, **kwargs): + self.attribute_to_dataset["optimizer"] = target_op_list + + +class FrequencyCollector(BaseOpCollector): + KEY_WORD = "AI Core Freq" + + def __init__(self): + super().__init__() + self._previous_freq_index = -1 + + @staticmethod + def get_op_frequency(ai_core_ops, ai_core_freq): + ai_core_freq.sort(key=lambda x: float(x.ts)) + op_freq_record = {} + + op_index, freq_index = 0, 0 + while op_index < len(ai_core_ops) and freq_index < len(ai_core_freq): + op_event = ai_core_ops[op_index] + op_end_time = convert_to_float(op_event.ts) + convert_to_float(op_event.dur) + op_freq_list = [] + while freq_index < len(ai_core_freq): + freq_event = ai_core_freq[freq_index] + if convert_to_float(freq_event.end) < op_end_time: + op_freq_list.append(convert_to_float(freq_event.args.MHz)) + freq_index += 1 + continue + elif convert_to_float(freq_event.ts) < op_end_time: + if op_event.name not in op_freq_record: + op_freq_record[op_event.name] = {"count": 0, "dur": 0, "freq_list": []} + op_freq_record[op_event.name]["count"] += 1 + op_freq_record[op_event.name]["dur"] += convert_to_float(op_event.dur) + op_freq_list.append(convert_to_float(freq_event.args.MHz)) + op_freq_record[op_event.name]["freq_list"].append(min(op_freq_list)) + break + else: + break + + op_index += 1 + return op_freq_record + + def add_op(self, event): + if event.name == self.KEY_WORD: + if self._previous_freq_index != -1: + self.op_list[self._previous_freq_index]["end"] = event.get("ts", float(math.inf)) + self._previous_freq_index += 1 + event.setdefault("end", float(math.inf)) + self.op_list.append(event) + + def post_process(self, target_op_list, **kwargs): + ai_core_ops = kwargs.get("ai_core_ops", []) + if not ai_core_ops: + return + ai_core_ops.sort(key=lambda x: float(x.ts)) + op_freq = FrequencyCollector.get_op_frequency(ai_core_ops, target_op_list) + self.attribute_to_dataset["op_freq"] = op_freq + + +class SpecificTaskTypeOpCollector(BaseOpCollector): + + def __init__(self, op_type_list=None): + super().__init__() + self.op_type_list = op_type_list if op_type_list else [Constant.AI_CPU, Constant.AI_CORE, Constant.MIX_AIC] + + def add_op(self, event): + if event.args.get(Constant.TASK_TYPE) and event.args.get(Constant.TASK_TYPE) in self.op_type_list: + self.op_list.append( + TimelineEvent( + { + Constant.TASK_TYPE: event.args.get(Constant.TASK_TYPE), + "task_id": event.args.get("Task Id"), + "tid": event.tid, + "name": event.name, + "ts": str(event.ts), + "dur": str(event.dur) + } + ) + ) + + def post_process(self, target_op_list, **kwargs): + op_map = dict() + for op in target_op_list: + key = f"{op.name}-{op.ts}" + op_map[key] = op + + self.attribute_to_dataset["ops_with_task_type"] = op_map + self.attribute_to_dataset["task_op_names"] = list( + set([event_key.split("-")[0] for event_key in op_map.keys()])) + + +class TorchToNpuCollector(BaseOpCollector): + def __init__(self): + super().__init__() + + def add_op(self, event): + if event.name.lower() == Constant.TORCH_TO_NPU: + self.op_list.append(TimelineEvent({"tid": event.tid, "ts": str(event.ts), "ph": event.ph, "id": event.id})) + + def post_process(self, target_op_list, **kwargs): + op_map = dict() + for op in target_op_list: + key = f"{op.ph}-{op.id}" + op_map[key] = op + + self.attribute_to_dataset["torch_to_npu"] = op_map + + +class AclToNpuCollector(BaseOpCollector): + def __init__(self): + super().__init__() + + def add_op(self, event): + if event.name and event.ts and event.name == Constant.ACL_TO_NPU: + self.op_list.append(TimelineEvent({"ts": event.ts})) + + def post_process(self, target_op_list, **kwargs): + op_record = set(str(op.ts) for op in target_op_list) + self.attribute_to_dataset["acl_to_npu"] = op_record + + +class OpStackCollector(BaseOpCollector): + def __init__(self): + super().__init__() + + def add_op(self, event): + if event.args.get(Constant.CALL_STACKS): + self.op_list.append( + TimelineEvent({"name": event.name, "dataset_index": event.dataset_index, "ts": event.ts})) + + def post_process(self, target_op_list, **kwargs): + op_map = dict() + for op in target_op_list: + op_map[str(op.ts)] = op + + self.attribute_to_dataset["ops_with_stack"] = op_map + + +class GcCollector(BaseOpCollector): + def __init__(self): + super().__init__() + + def add_op(self, event): + if event.cat and isinstance(event.cat, str) and event.cat.lower() == "gc": + self.op_list.append(TimelineEvent( + {"name": event.name, "dataset_index": event.dataset_index, "ts": event.ts, "dur": event.dur})) + + def post_process(self, target_op_list, **kwargs): + self.attribute_to_dataset["gc_events"] = self.op_list + + +class FreeEventsCollector(BaseOpCollector): + def __init__(self): + super().__init__() + + @staticmethod + def _load_rule(): + language = AdditionalArgsManager().language + sync_stream_rule_path = os.path.join( + os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))), + "rules", + language, + "conjectured_gc.yaml") + + gc_rule = FileManager.read_yaml_file(sync_stream_rule_path) + return gc_rule + + def add_op(self, event): + if event.name.lower() == Constant.FREE: + self.op_list.append(event) + + def post_process(self, target_op_list, **kwargs): + gc_rule = self._load_rule() + if os.getenv(Constant.FREE_DURATION_FOR_GC_ANALYSIS): + max_free_threshold = convert_to_float(os.getenv(Constant.FREE_DURATION_FOR_GC_ANALYSIS)) + else: + max_free_threshold = gc_rule.get("max_free_threshold") + + large_free_events = [] + + for op in target_op_list: + if convert_to_float(op.dur) > max_free_threshold: + large_free_events.append(op) + + large_free_events.sort(key=lambda x: convert_to_float(x.ts)) + self.attribute_to_dataset["large_free_events"] = large_free_events + + +class AclEventsCollector(BaseOpCollector): + ACL_EVENT_PREFIX = "AscendCL@" + + def __init__(self): + super().__init__() + + def add_op(self, event): + if event.name.startswith(self.ACL_EVENT_PREFIX): + self.op_list.append(event) + + def post_process(self, target_op_list, **kwargs): + target_op_list.sort(key=lambda x: convert_to_float(x.ts)) + self.attribute_to_dataset["acl_events"] = target_op_list diff --git a/profiler/cli/__init__.py b/profiler/msprof_analyze/advisor/display/__init__.py similarity index 100% rename from profiler/cli/__init__.py rename to profiler/msprof_analyze/advisor/display/__init__.py diff --git a/profiler/cluster_analyse/cluster_kernels_analysis/__init__.py b/profiler/msprof_analyze/advisor/display/html/__init__.py similarity index 100% rename from profiler/cluster_analyse/cluster_kernels_analysis/__init__.py rename to profiler/msprof_analyze/advisor/display/html/__init__.py diff --git a/profiler/advisor/display/html/priority_background_color.py b/profiler/msprof_analyze/advisor/display/html/priority_background_color.py similarity index 100% rename from profiler/advisor/display/html/priority_background_color.py rename to profiler/msprof_analyze/advisor/display/html/priority_background_color.py diff --git a/profiler/advisor/display/html/render.py b/profiler/msprof_analyze/advisor/display/html/render.py similarity index 92% rename from profiler/advisor/display/html/render.py rename to profiler/msprof_analyze/advisor/display/html/render.py index d20df9a7601f94c6a276abc2f17dff5c717ebbf3..09211220677a0d7d50e8af52a27f068a85d0859b 100644 --- a/profiler/advisor/display/html/render.py +++ b/profiler/msprof_analyze/advisor/display/html/render.py @@ -19,10 +19,9 @@ from typing import List, Dict from collections import defaultdict, OrderedDict from jinja2 import Environment, FileSystemLoader -from profiler.advisor.common import constant - -from profiler.advisor.config.config import Config -from profiler.advisor.utils.utils import singleton, safe_write +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.config.config import Config +from msprof_analyze.advisor.utils.utils import singleton, safe_write logger = logging.getLogger() @@ -40,7 +39,7 @@ class HTMLRender: self.render_list = defaultdict(list) def render_html(self, template_dir: str = "templates", template_name: str = "main.html", - template_header=constant.DEFAULT_TEMPLATE_HEADER): + template_header=Constant.DEFAULT_TEMPLATE_HEADER): # 确保overall 和 comparison 在 performance problem analysis 之前 sorted_render_htmls = OrderedDict() @@ -98,5 +97,5 @@ class HTMLRender: "but got %s.", os.path.basename(save_path)) return - safe_write(self.html, save_path) + safe_write(self.html, save_path, encoding="UTF-8") logger.info("Save suggestion to %s.", save_path) diff --git a/profiler/cluster_analyse/cluster_utils/__init__.py b/profiler/msprof_analyze/advisor/display/html/templates/__init__.py similarity index 100% rename from profiler/cluster_analyse/cluster_utils/__init__.py rename to profiler/msprof_analyze/advisor/display/html/templates/__init__.py diff --git a/profiler/advisor/display/html/templates/affinity_api.html b/profiler/msprof_analyze/advisor/display/html/templates/affinity_api.html similarity index 94% rename from profiler/advisor/display/html/templates/affinity_api.html rename to profiler/msprof_analyze/advisor/display/html/templates/affinity_api.html index 7cd3d7ad33d0220c7aba055721eddf049161a0d8..b227afae9624cadb5ce753c80f1e8719040eacb4 100644 --- a/profiler/advisor/display/html/templates/affinity_api.html +++ b/profiler/msprof_analyze/advisor/display/html/templates/affinity_api.html @@ -8,14 +8,14 @@ The analysis results of following affinity APIs are based on runtime env cann-{{ cann_version }} and - torch-{{ torch_version }} + {{profiling_type}}-{{ profiling_type }}
{% if empty_stacks %} Suggestion: These APIs have no code stack. If parameter 'with_stack=False' was set while profiling, please refer to - Ascend PyTorch Profiler to set + Ascend Profiler to set 'with_stack=True'. Otherwise, ignore following affinity APIs due to backward broadcast lack of stack. {% endif %} diff --git a/profiler/advisor/display/html/templates/ai_core_frequency.html b/profiler/msprof_analyze/advisor/display/html/templates/ai_core_frequency.html similarity index 100% rename from profiler/advisor/display/html/templates/ai_core_frequency.html rename to profiler/msprof_analyze/advisor/display/html/templates/ai_core_frequency.html diff --git a/profiler/msprof_analyze/advisor/display/html/templates/ai_core_performance.html b/profiler/msprof_analyze/advisor/display/html/templates/ai_core_performance.html new file mode 100644 index 0000000000000000000000000000000000000000..77e5e0cb55200efdf5b854e03ac2844ddc631a8f --- /dev/null +++ b/profiler/msprof_analyze/advisor/display/html/templates/ai_core_performance.html @@ -0,0 +1,159 @@ +{% if format_result|length > 0 %} +
+

AI CORE Performance Analysis

+
+ {% if language == "cn" %} + {% set title_ns = namespace(type='类别', desc='描述及建议', opti_set='性能优化算子集合', bound_set='bound算子集合', affinity_set='不亲和算子集合', + opti_refer=' 参考性能优化空间: ', bound_refer=' bound类型为: ', affinity_refer=' 不亲和类型为: ', title_desc='算子相关分析,参考如下: ') %} + {% else %} + {% set title_ns = namespace(type='Type', desc='Description and Suggestion', opti_set='set of performance optimization operators', + bound_set='set of bound operators', affinity_set='set of unaffine operators', opti_refer=' refer to Performance Optimization Space: ', + bound_refer=' bound type: ', affinity_refer=' type of disaffinity: ', title_desc=' Operator related analysis, referenced below: ') %} + {% endif %} + {% if format_result.cube[0]|length + format_result.cube[1]|length + format_result.cube[2]|length > 0 %} + MatMul{{ title_ns.title_desc }} +
+
+ {content}
+ + + + + {% set opti_ns = namespace(total_opti='') %} + {% for opti in format_result.cube[0] %} + {% if not loop.first %} + {% set opti_ns.total_opti = opti_ns.total_opti ~ "
" ~ opti.op_name ~ " operator shape: " ~ opti.shape ~ " dtype: " ~ opti.dtype ~ title_ns.opti_refer ~ opti.optimization ~ "%" %} + {% else %} + {% set opti_ns.total_opti = opti.op_name ~ " operator shape: " ~ opti.shape ~ " dtype: " ~ opti.dtype ~ title_ns.opti_refer ~ opti.optimization ~ "%" %} + {% endif %} + {% endfor %} + {% if opti_ns.total_opti|length > 0 %} + + + + + {% endif %} + {% set bound_ns = namespace(total_bound='') %} + {% for bound in format_result.cube[1] %} + {% if not loop.first %} + {% set bound_ns.total_bound = bound_ns.total_bound ~ "
" ~ bound.op_name ~ " operator shape: " ~ bound.shape ~ " dtype: " ~ bound.dtype ~ title_ns.bound_refer ~ bound.bound %} + {% else %} + {% set bound_ns.total_bound = bound.op_name ~ " operator shape: " ~ bound.shape ~ " dtype: " ~ bound.dtype ~ title_ns.bound_refer ~ bound.bound %} + {% endif %} + {% endfor %} + {% if bound_ns.total_bound|length > 0 %} + + + + + {% endif %} + {% set affinity_ns = namespace(total_affinity='') %} + {% for affinity in format_result.cube[2] %} + {% if not loop.first %} + {% set affinity_ns.total_affinity = affinity_ns.total_affinity ~ "
" ~ affinity.op_name ~ " operator shape: " ~ affinity.shape ~ " dtype: " ~ affinity.dtype ~ title_ns.affinity_refer ~ affinity.suggestion %} + {% else %} + {% set affinity_ns.total_affinity = affinity.op_name ~ " operator shape: " ~ affinity.shape ~ " dtype: " ~ affinity.dtype ~ title_ns.affinity_refer ~ affinity.suggestion %} + {% endif %} + {% endfor %} + {% if affinity_ns.total_affinity|length > 0 %} + + + + + {% endif %} +
{{ title_ns.type }}{{ title_ns.desc }}
{{ title_ns.opti_set }}{{ opti_ns.total_opti | safe }}
{{ title_ns.bound_set }}{{ bound_ns.total_bound | safe }}
{{ title_ns.affinity_set }}{{ affinity_ns.total_affinity | safe }}
+ {% endif %} + + {% if format_result.fa[0]|length + format_result.fa[1]|length + format_result.fa[2]|length > 0 %} + FA{{ title_ns.title_desc }} +
+ + + + + + {% set opti_ns = namespace(total_opti='') %} + {% for opti in format_result.fa[0] %} + {% if not loop.first %} + {% set opti_ns.total_opti = opti_ns.total_opti ~ "
" ~ opti.op_name ~ " operator shape: " ~ opti.shape ~ " dtype: " ~ opti.dtype ~ title_ns.opti_refer ~ opti.optimization ~ "%" %} + {% else %} + {% set opti_ns.total_opti = opti.op_name ~ " operator shape: " ~ opti.shape ~ " dtype: " ~ opti.dtype ~ title_ns.opti_refer ~ opti.optimization ~ "%" %} + {% endif %} + {% endfor %} + {% if opti_ns.total_opti|length > 0 %} + + + + + {% endif %} + {% set bound_ns = namespace(total_bound='') %} + {% for bound in format_result.fa[1] %} + {% if not loop.first %} + {% set bound_ns.total_bound = bound_ns.total_bound ~ "
" ~ bound.op_name ~ " operator shape: " ~ bound.shape ~ " dtype: " ~ bound.dtype ~ title_ns.bound_refer ~ bound.bound %} + {% else %} + {% set bound_ns.total_bound = bound.op_name ~ " operator shape: " ~ bound.shape ~ " dtype: " ~ bound.dtype ~ title_ns.bound_refer ~ bound.bound %} + {% endif %} + {% endfor %} + {% if bound_ns.total_bound|length > 0 %} + + + + + {% endif %} + {% set affinity_ns = namespace(total_affinity='') %} + {% for affinity in format_result.fa[2] %} + {% if not loop.first %} + {% set affinity_ns.total_affinity = affinity_ns.total_affinity ~ "
" ~ affinity.op_name ~ " operator shape: " ~ affinity.shape ~ " dtype: " ~ affinity.dtype ~ title_ns.affinity_refer ~ affinity.suggestion %} + {% else %} + {% set affinity_ns.total_affinity = affinity.op_name ~ " operator shape: " ~ affinity.shape ~ " dtype: " ~ affinity.dtype ~ title_ns.affinity_refer ~ affinity.suggestion %} + {% endif %} + {% endfor %} + {% if affinity_ns.total_affinity|length > 0 %} + + + + + {% endif %} +
{{ title_ns.type }}{{ title_ns.desc }}
{{ title_ns.opti_set }}{{ opti_ns.total_opti | safe }}
{{ title_ns.bound_set }}{{ bound_ns.total_bound | safe }}
{{ title_ns.affinity_set }}{{ affinity_ns.total_affinity | safe }}
+ {% endif %} + + {% if format_result.vector[0]|length + format_result.vector[1]|length > 0 %} + Vector{{ title_ns.title_desc }} +
+ + + + + + {% set opti_ns = namespace(total_opti='') %} + {% for opti in format_result.vector[0] %} + {% if not loop.first %} + {% set opti_ns.total_opti = opti_ns.total_opti ~ "
" ~ opti.op_name ~ " operator shape: " ~ opti.shape ~ " dtype: " ~ opti.dtype ~ title_ns.opti_refer ~ opti.optimization ~ "%" %} + {% else %} + {% set opti_ns.total_opti = opti.op_name ~ " operator shape: " ~ opti.shape ~ " dtype: " ~ opti.dtype ~ title_ns.opti_refer ~ opti.optimization ~ "%" %} + {% endif %} + {% endfor %} + {% if opti_ns.total_opti|length > 0 %} + + + + + {% endif %} + {% set bound_ns = namespace(total_bound='') %} + {% for bound in format_result.vector[1] %} + {% if not loop.first %} + {% set bound_ns.total_bound = bound_ns.total_bound ~ "
" ~ bound.op_name ~ " operator shape: " ~ bound.shape ~ " dtype: " ~ bound.dtype ~ title_ns.bound_refer ~ bound.bound %} + {% else %} + {% set bound_ns.total_bound = bound.op_name ~ " operator shape: " ~ bound.shape ~ " dtype: " ~ bound.dtype ~ title_ns.bound_refer ~ bound.bound %} + {% endif %} + {% endfor %} + {% if bound_ns.total_bound|length > 0 %} + + + + + {% endif %} +
{{ title_ns.type }}{{ title_ns.desc }}
{{ title_ns.opti_set }}{{ opti_ns.total_opti | safe }}
{{ title_ns.bound_set }}{{ bound_ns.total_bound | safe }}
+ {% endif %} +
+ +{% endif %} \ No newline at end of file diff --git a/profiler/msprof_analyze/advisor/display/html/templates/byte_alignment.html b/profiler/msprof_analyze/advisor/display/html/templates/byte_alignment.html new file mode 100644 index 0000000000000000000000000000000000000000..5677dd5c1f8e3e519fb32e9f1abd0aafcdd7cf72 --- /dev/null +++ b/profiler/msprof_analyze/advisor/display/html/templates/byte_alignment.html @@ -0,0 +1,43 @@ + +
+

Byte Alignment Analysis

+
+ {% if rank is not none %} + Analysis of rank {{ rank|safe }}. + {% endif %} + {{ desc }} + + + + + {% for item in solutions %} + {% set rowloop = loop %} + {% for key, value in item.items() %} + + + + {% endfor %} + {% endfor %} +
Suggestions
{{ rowloop.index }}. {{ value.desc }}
+ {% if datas|safe %} + The details of top {{ num }} abnormal communication + operators are as follows: +

+ + + {% for header in headers %} + + {% endfor %} + + + {% for row in datas %} + + {% for element in row %} + + {% endfor %} + + {% endfor %} +
{{ header }}
{{ element|safe }}
+ {% endif %} +
+
diff --git a/profiler/advisor/display/html/templates/cluster_analysis.html b/profiler/msprof_analyze/advisor/display/html/templates/cluster_analysis.html similarity index 100% rename from profiler/advisor/display/html/templates/cluster_analysis.html rename to profiler/msprof_analyze/advisor/display/html/templates/cluster_analysis.html diff --git a/profiler/advisor/display/html/templates/communication_retransmission_analysis.html b/profiler/msprof_analyze/advisor/display/html/templates/communication_retransmission_analysis.html similarity index 100% rename from profiler/advisor/display/html/templates/communication_retransmission_analysis.html rename to profiler/msprof_analyze/advisor/display/html/templates/communication_retransmission_analysis.html diff --git a/profiler/advisor/display/html/templates/comparison.html b/profiler/msprof_analyze/advisor/display/html/templates/comparison.html similarity index 100% rename from profiler/advisor/display/html/templates/comparison.html rename to profiler/msprof_analyze/advisor/display/html/templates/comparison.html diff --git a/profiler/advisor/display/html/templates/compute_analysis.html b/profiler/msprof_analyze/advisor/display/html/templates/compute_analysis.html similarity index 100% rename from profiler/advisor/display/html/templates/compute_analysis.html rename to profiler/msprof_analyze/advisor/display/html/templates/compute_analysis.html diff --git a/profiler/advisor/display/html/templates/contention.html b/profiler/msprof_analyze/advisor/display/html/templates/contention.html similarity index 100% rename from profiler/advisor/display/html/templates/contention.html rename to profiler/msprof_analyze/advisor/display/html/templates/contention.html diff --git a/profiler/advisor/display/html/templates/environment_variable.html b/profiler/msprof_analyze/advisor/display/html/templates/environment_variable.html similarity index 100% rename from profiler/advisor/display/html/templates/environment_variable.html rename to profiler/msprof_analyze/advisor/display/html/templates/environment_variable.html diff --git a/profiler/msprof_analyze/advisor/display/html/templates/fusible_operator_analysis.html b/profiler/msprof_analyze/advisor/display/html/templates/fusible_operator_analysis.html new file mode 100644 index 0000000000000000000000000000000000000000..52bf05c2c0349bbfcbd54776fce21752c50646f5 --- /dev/null +++ b/profiler/msprof_analyze/advisor/display/html/templates/fusible_operator_analysis.html @@ -0,0 +1,65 @@ +
+

Fusible Operator Analysis

+
+ {{ desc }} + + + + + + + {% for item in solutions %} + {% set rowloop = loop %} + {% for key, value in item.items() %} + + + + + {% endfor %} + {% endfor %} +
Suggestions
{{ rowloop.index }}. {{ value.desc }}
+

+ {% for sub_desc in table_desc %} + {{ sub_desc }}
+ {% endfor %} +
+ {% if host_data %} + {{ host_desc }} + + + {% for header in headers %} + + {% endfor %} + + + {% for row in host_data %} + + {% for element in row %} + + {% endfor %} + + {% endfor %} +
{{ header }}
{{ element|safe }}
+ {% endif %} + {% if mte_data %} +

+ {{ mte_desc }} + + + {% for header in headers %} + + {% endfor %} + + + {% for row in mte_data %} + + {% for element in row %} + + {% endfor %} + + {% endfor %} +
{{ header }}
{{ element|safe }}
+ {% endif %} + +
+
diff --git a/profiler/advisor/display/html/templates/fusion.html b/profiler/msprof_analyze/advisor/display/html/templates/fusion.html similarity index 100% rename from profiler/advisor/display/html/templates/fusion.html rename to profiler/msprof_analyze/advisor/display/html/templates/fusion.html diff --git a/profiler/advisor/display/html/templates/gc.html b/profiler/msprof_analyze/advisor/display/html/templates/gc.html similarity index 96% rename from profiler/advisor/display/html/templates/gc.html rename to profiler/msprof_analyze/advisor/display/html/templates/gc.html index 205e1b3b9ede3282189864f116a9c650b59626df..236c0acaec38c158ab4d848b235da55867bbaa10 100644 --- a/profiler/advisor/display/html/templates/gc.html +++ b/profiler/msprof_analyze/advisor/display/html/templates/gc.html @@ -1,6 +1,6 @@
-

GC Analysis

+

{{ title }}

{% if rank is not none %} Analysis of rank {{ rank|safe }}. diff --git a/profiler/advisor/display/html/templates/main.html b/profiler/msprof_analyze/advisor/display/html/templates/main.html similarity index 99% rename from profiler/advisor/display/html/templates/main.html rename to profiler/msprof_analyze/advisor/display/html/templates/main.html index 9317abba543dacabf19d6b7967acf496e7aa8dc9..25db9caed36af1e5e8ea63d4d55f9b2284379fc1 100644 --- a/profiler/advisor/display/html/templates/main.html +++ b/profiler/msprof_analyze/advisor/display/html/templates/main.html @@ -1,6 +1,7 @@ + \"))\n", + "pd.set_option('display.max_columns', None)\n", + "pd.set_option('display.max_rows', None)\n", + "pyo.init_notebook_mode()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## 集群场景CANN层API统计分析\n", + "该分析脚本展示了集群场景的统计数据分析结果。需要注意以下几点:\n", + "1. 所有的时间信息单位是微秒(us);\n", + "2. Q1表示单个API耗时的25%分位数,最终结果取自所有卡的Q1值中最小值;\n", + "3. Q3表示单个API耗时的75%分位数,最终结果取自所有卡的Q3值中最大值;\n", + "4. 'minRank'展示了API最小耗时所在卡;\n", + "5. 'maxRank'展示了API最大耗时所在卡。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\"all_stats.csv\")\n", + "display(df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cluster_display.display_box(df, xaxis_title=\"name\", yaxis_title=\"duration (ns)\")\n", + "cluster_display.display_stats_scatter(df, xaxis_title=\"name\", yaxis_title=\"duration (ns)\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "per_rank_df = pd.read_csv(\"rank_stats.csv\")\n", + "cluster_display.display_stats_per_operation(per_rank_df, box=False, scatter=False)" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/cluster_display.py b/profiler/msprof_analyze/cluster_analyse/recipes/cluster_display.py new file mode 100644 index 0000000000000000000000000000000000000000..5a23a280fff9b3c0492f1c8cd2fac20824afb708 --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/cluster_display.py @@ -0,0 +1,240 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + +import numpy as np +import pandas as pd +import plotly.graph_objects as go +from IPython.display import display, HTML +from ipywidgets import Dropdown, fixed, interact + +logger = logging.getLogger("cluster_display") + + +def get_stats_cols(df): + cols = df.columns.tolist() + q1 = "Q1(Us)" if "Q1(Us)" in cols else "Q1~" + q3 = "Q3(Us)" if "Q3(Us)" in cols else "Q3~" + med = "med(Us)" if "med(Us)" in cols else "med~" + std = "stdev" if "stdev" in cols else "stdev~" + return q1, q3, med, std + + +def display_box(df, x=None, **layout_args): + if x is None: + x = df.columns[0] + q1, q3, med, std = get_stats_cols(df) + fig = go.Figure() + fig.add_trace( + go.Box( + x=df[x], + q1=df[q1], + median=df[med], + q3=df[q3], + sd=df[std], + lowerfence=df["minRank"], + upperfence=df["maxRank"] + ) + ) + fig.update_layout(**layout_args) + fig.show() + + +def display_stats_scatter(df, x=None, **layout_args): + if x is None: + x = df.columns[0] + q1, q3, med, _ = get_stats_cols(df) + fig = go.Figure() + col_names = [q1, med, q3, "minRank", "maxRank"] + for name in col_names: + fig.add_trace( + go.Scatter( + x=df[x], + y=df[name], + name=name + ) + ) + fig.update_layout(**layout_args) + fig.show() + + +def display_table_per_rank(df): + if df.empty: + display(df) + return + + rank_groups = df.groupby("rank") + + def display_table(name): + rank_df = rank_groups.get_group(name) + rank_df = rank_df.drop(columns=["rank"]) + display(rank_df) + + dropdown = Dropdown( + options=rank_groups.groups.keys(), + description="rank:", + disabled=False, + ) + interact( + display_table, + name=dropdown + ) + + +def display_stats_per_operation(df, x=None, box=True, scatter=True, table=True, **layout_args): + if df.empty: + display(df) + return + + if x is None: + x = df.columns[0] + + op_groups = df.groupby(x) + + def display_graphs(name): + op_df = op_groups.get_group(name) + if table: + display(op_df.reset_index(drop=True).set_index("rank")) + if box: + display_box(op_df, x=op_df["rank"], **layout_args) + if scatter: + display_stats_scatter(op_df, x=op_df["rank"], **layout_args) + + operations = list(op_groups.groups.keys()) + + if len(operations) > 1: + dropdown = Dropdown( + options=operations, + description="Operation:", + disabled=False, + value=operations[1] + ) + interact( + display_graphs, + name=dropdown + ) + dropdown.value = operations[0] + else: + display_graphs(operations[0]) + + +def display_duration_boxplots(figs, stats_df: pd.DataFrame, orientation="v", title=None, + x_title="Names", y_title="Time", legend_title="Legend"): + mean_ds = stats_df.get("Mean(Us)", None) + min_ds = stats_df.get("Min(Us)", None) + max_ds = stats_df.get("Max(Us)", None) + q1_ds = stats_df.get("Q1(Us)", None) + median_ds = stats_df.get('Median(Us)', None) + q3_ds = stats_df.get('Q3(Us)', None) + display_boxplot(figs, stats_df.index, min_ds, q1_ds, median_ds, q3_ds, max_ds, mean_ds, + orientation=orientation, title=title, x_title=x_title, y_title=y_title, + legend_title=legend_title) + + +def display_boxplot(figs, x_axis, min_ds, q1_ds, median_ds, q3_ds, max_ds, mean_ds, orientation="v", + title=None, x_title=None, y_title="Time", legend_title="Legend"): + fig = go.Figure() + fig.add_trace( + go.Box( + x=x_axis, + lowerfence=min_ds, + q1=q1_ds, + median=median_ds, + q3=q3_ds, + upperfence=max_ds, + mean=mean_ds + ) + ) + fig.update_traces(orientation=orientation) + fig.update_layout( + xaxis_title=x_title, yaxis_title=y_title, legend_title=legend_title, + title=title, height=1024 + ) + fig.show() + if isinstance(figs, list): + figs.append(fig) + + +def display_graph(figs, x_axis, y_axes, title=None, + x_title=None, y_title=None, legend_title="Legend"): + if isinstance(y_axes, pd.DataFrame): + data = y_axes.set_index(x_axis) + elif isinstance(y_axes, dict): + data = pd.DataFrame(y_axes, index=x_axis) + elif isinstance(y_axes, pd.Series): + data = pd.DataFrame({"": y_axes}, index=x_axis) + elif isinstance(y_axes, np.ndarray): + data = pd.DataFrame({"": pd.Series(y_axes)}, index=x_axis) + else: + return + + fig = data.plot.line() + fig.update_layout( + title=title, xaxis_title=x_title, yaxis_title=y_title, legend_title=legend_title + ) + fig.show() + if isinstance(figs, list): + figs.append(fig) + + +def display_stats_per_rank_groups_combobox(rank_stats_gdf): + names = list(rank_stats_gdf.groups.keys()) + if len(names) > 1: + dropdown = Dropdown( + options=names, layout={"width": "max-content"}, value=names[1] + ) + interact( + __display_stats_per_rank_group, + selected=dropdown, + rank_stats_gdf=fixed(rank_stats_gdf) + ) + dropdown.value = names[0] + elif len(names) == 1: + __display_stats_per_rank_group(names[0], rank_stats_gdf) + else: + logger.info("cluster_display func:input rank_stats_gdf groups is null so no need to display") + + +def __display_stats_per_rank_group(selected, rank_stats_gdf): + df = rank_stats_gdf.get_group(selected) + df = df.reset_index(drop=True) + df = df.set_index(df["Rank"]) + display(df) + + figs = [] + display_duration_boxplots(figs, df, x_title="Ranks") + display_graph( + figs, + df.index, + df[["Q1(Us)", "Median(Us)", "Q3(Us)"]], + title="50% of Distribution", + x_title="Ranks" + ) + + +def display_stats_optional_combobox(options, display_func, args, description="Option:"): + if len(options) > 1: + dropdown = Dropdown( + options=options, layout={"width": "max-content"}, value=options[1], + description=description + ) + interact( + display_func, + selected=dropdown, + args=fixed(args) + ) + dropdown.value = options[0] + elif len(options) == 1: + display_func(options[0], args) diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/__init__.py b/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..7101187a2c2619f3b1c20dded14b433950b4c662 --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/__init__.py @@ -0,0 +1,14 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/compute_op_sum.py b/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/compute_op_sum.py new file mode 100644 index 0000000000000000000000000000000000000000..528534be399e3ceacadbe7d1acf7294d7b3ff37d --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/compute_op_sum.py @@ -0,0 +1,120 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import pandas as pd + +from msprof_analyze.cluster_analyse.common_func.utils import describe_duration +from msprof_analyze.cluster_analyse.recipes.base_recipe_analysis import BaseRecipeAnalysis +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_exports.compute_op_sum_export import ComputeOpSumExport +from msprof_analyze.prof_exports.compute_op_sum_export import ComputeOpSumExportExcludeOpName + +logger = get_logger() + + +class ComputeOpSum(BaseRecipeAnalysis): + TABLE_ALL_RANK_STATS = "ComputeOpAllRankStats" + TABLE_PER_RANK_STATS_BY_OPTYPE = "ComputeOpPerRankStatsByOpType" + TABLE_PER_RANK_STATS_BY_OPNAME = "ComputeOpPerRankStatsByOpName" + + EXCLUDE_OP_NAME = "exclude_op_name" + DEFAULT_SWITCH = False + + def __init__(self, params): + super().__init__(params) + logger.info("ComputeOpSum init.") + self.all_rank_stats = None + self.per_rank_stats_by_optype = None + self.per_rank_stats_by_opname = None + self.exclude_op_name = self._extra_args.get(self.EXCLUDE_OP_NAME, self.DEFAULT_SWITCH) + + @property + def base_dir(self): + return os.path.basename(os.path.dirname(__file__)) + + @classmethod + def add_parser_argument(cls, parser): + BaseRecipeAnalysis.add_parser_argument(parser) + parser.add_argument( + '--exclude_op_name', default=False, action='store_true', help='whether exclude op_name in the SQL query' + ) + + def reducer_func(self, mapper_res): + mapper_res = list(filter(lambda df: df is not None, mapper_res)) + if not mapper_res: + logger.error("Mapper data is None.") + return + # get per rank stats by optype + self.per_rank_stats_by_optype = pd.concat( + describe_duration(df.groupby(["OpType", "TaskType"])["Duration"]).assign(Rank=df["Rank"][0]) + for df in mapper_res + ) + self.per_rank_stats_by_optype.sort_values(by=["SumNs"], inplace=True, ascending=False) + + # get all rank stats by optype + all_op_data = pd.concat(mapper_res) + self.all_rank_stats = describe_duration(all_op_data.groupby(["OpType", "TaskType"])["Duration"]) + self.all_rank_stats.sort_values(by=["SumNs"], inplace=True, ascending=False) + + if self.exclude_op_name: + return + # get per rank stats by opname + self.per_rank_stats_by_opname = pd.concat( + describe_duration(df.groupby(["OpName", "OpType", "TaskType", "InputShapes"])["Duration"]).assign( + Rank=df["Rank"][0]) for df in mapper_res) + self.per_rank_stats_by_opname.sort_values(by=["SumNs"], inplace=True, ascending=False) + + def run(self, context): + mapper_res = self.mapper_func(context) + self.reducer_func(mapper_res) + + if self._export_type == "db": + self.save_db() + elif self._export_type == "notebook": + self.save_notebook() + else: + logger.error("Unknown export type.") + + def save_notebook(self): + self.dump_data(self.all_rank_stats, "all_stats.csv") + self.dump_data(self.per_rank_stats_by_optype, "rank_stats_by_optype.csv") + if not self.exclude_op_name: + self.dump_data(self.per_rank_stats_by_opname, "rank_stats_by_opname.csv") + self.create_notebook("stats.ipynb") + self.add_helper_file("cluster_display.py") + + def save_db(self): + self.dump_data(self.all_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_ALL_RANK_STATS) + self.dump_data(self.per_rank_stats_by_optype, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, + self.TABLE_PER_RANK_STATS_BY_OPTYPE) + if not self.exclude_op_name: + self.dump_data(self.per_rank_stats_by_opname, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, + self.TABLE_PER_RANK_STATS_BY_OPNAME) + + def _mapper_func(self, data_map, analysis_class): + profiler_db_path = data_map.get(Constant.PROFILER_DB_PATH) + rank_id = data_map.get(Constant.RANK_ID) + step_range = data_map.get(Constant.STEP_RANGE) + if self.exclude_op_name: + df = ComputeOpSumExportExcludeOpName(profiler_db_path, analysis_class, step_range).read_export_db() + else: + df = ComputeOpSumExport(profiler_db_path, analysis_class, step_range).read_export_db() + if df is None or df.empty: + logger.warning(f"There is no stats data in {profiler_db_path}.") + return None + df["Rank"] = rank_id + return df diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/stats.ipynb b/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/stats.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..c6b8a487b249d6abde84d54063b21f0442c0a478 --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/compute_op_sum/stats.ipynb @@ -0,0 +1,165 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Compute Op Summary\n", + "\n", + "集群场景计算类算子数据分析\n", + "\n", + "主要包含以下3个统计内容:\n", + "1. 按算子类型和任务类型分组的,整个集群通信算子耗时的统计情况\n", + "2. 按算子类型和任务类型分组的,每个Rank上计算类算子的耗时情况\n", + "3. 按算子名称、任务类型、输入shape分组的,每个Rank上的计算类算子的耗时情况" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据准备" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import display, HTML\n", + "display(HTML(\"\"))\n", + "\n", + "import plotly.offline as pyo\n", + "\n", + "def is_lab_notebook():\n", + " import re\n", + " import psutil\n", + " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n", + "\n", + "if is_lab_notebook():\n", + " pyo.init_notebook_mode()\n", + "\n", + "import pandas as pd\n", + "pd.options.plotting.backend = \"plotly\"\n", + "pd.set_option(\"display.max_rows\", 100)\n", + "pd.set_option(\"display.width\", 1000)\n", + "\n", + "import cluster_display\n", + "\n", + "all_stats_df = pd.read_csv(\"all_stats.csv\", index_col=\"OpType\")\n", + "rank_stats_by_optype_df = pd.read_csv(\"rank_stats_by_optype.csv\", index_col=\"OpType\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 计算类算子耗时分析\n", + "\n", + "将整个集群所有Rank的计算类算子进行汇总,按算子类型和任务类型分类,统计分析耗时情况,时间单位为微秒(us)\n", + "\n", + "包含以下统计项:\n", + "- Count:算子数量\n", + "- Mean:平均耗时\n", + "- Std:标准差\n", + "- Min:最小值\n", + "- Q1:四分之一分位数\n", + "- Median:中位数\n", + "- Q3:四分之三分位数\n", + "- Max:最大值\n", + "- Sum:总耗时" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display(all_stats_df)\n", + "fig_all_rank = cluster_display.display_duration_boxplots(None, all_stats_df, x_title=\"OpType\")\n", + "fig_per_rank = cluster_display.display_graph(None, all_stats_df.index, all_stats_df[[\"Q1(Us)\", \"Median(Us)\", \"Q3(Us)\"]], title=\"50% of Distribution\", x_title=\"OpType\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 单个Rank的计算类算子基于算子类型的耗时分析\n", + "将集群内每个Rank的计算类算子进行汇总,按算子类型和任务类型分类,统计分析耗时情况,时间单位为微秒(us)\n", + "\n", + "包含以下统计项:\n", + "- Count:算子数量\n", + "- Mean:平均耗时\n", + "- Std:标准差\n", + "- Min:最小值\n", + "- Q1:四分之一分位数\n", + "- Median:中位数\n", + "- Q3:四分之三分位数\n", + "- Max:最大值\n", + "- Sum:总耗时" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rank_stats_gdf = rank_stats_by_optype_df.groupby(rank_stats_by_optype_df.index)\n", + "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 单个Rank的计算类算子基于算子名的耗时分析\n", + "提醒:添加--exclude_op_name后,以下内容不支持运行\n", + "\n", + "将集群内每个Rank的计算类算子进行汇总,按算子名称、任务类型、输入shape分类,统计分析耗时情况,时间单位为微秒(us)\n", + "\n", + "包含以下统计项:\n", + "- Count:算子数量\n", + "- Mean:平均耗时\n", + "- Std:标准差\n", + "- Min:最小值\n", + "- Q1:四分之一分位数\n", + "- Median:中位数\n", + "- Q3:四分之三分位数\n", + "- Max:最大值\n", + "- Sum:总耗时" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rank_stats_by_opname_df = pd.read_csv(\"rank_stats_by_opname.csv\", index_col=\"OpName\")\n", + "rank_stats_gdf = rank_stats_by_opname_df.groupby(rank_stats_by_opname_df.index)\n", + "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/__init__.py b/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..7101187a2c2619f3b1c20dded14b433950b4c662 --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/__init__.py @@ -0,0 +1,14 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/hccl_sum.py b/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/hccl_sum.py new file mode 100644 index 0000000000000000000000000000000000000000..84ff40ac7e5d78d6ea30127739e18dfd1654e2c0 --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/hccl_sum.py @@ -0,0 +1,137 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import pandas as pd + +from msprof_analyze.cluster_analyse.common_func.utils import describe_duration +from msprof_analyze.cluster_analyse.recipes.base_recipe_analysis import BaseRecipeAnalysis +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_exports.hccl_sum_export import HcclSumExport + +logger = get_logger() + + +def double_hash(data): + prime = [29, 131] + hash_num = [0, 0] + for d in data: + hash_num[0] = (((hash_num[0] * prime[0]) & Constant.UINT32_MASK) + ord(d)) & Constant.UINT32_MASK + hash_num[1] = (((hash_num[1] * prime[1]) & Constant.UINT32_MASK) + ord(d)) & Constant.UINT32_MASK + + return str((hash_num[0] << Constant.UINT32_BITS) | hash_num[1]) + + +class HcclSum(BaseRecipeAnalysis): + TABLE_ALL_RANK_STATS = "HcclAllRankStats" + TABLE_PER_RANK_STATS = "HcclPerRankStats" + TABLE_TOP_OP_STATS = "HcclTopOpStats" + TABLE_GROUP_NAME_MAP = "HcclGroupNameMap" + + TOP_NUM = "top_num" + DEFAULT_TOP_NUM = 15 + + def __init__(self, params): + super().__init__(params) + logger.info("HcclSum init.") + self.per_rank_stats = None + self.all_rank_stats = None + self.group_name_map = None + self.top_op_stats = None + top_num = self._extra_args.get(self.TOP_NUM, self.DEFAULT_TOP_NUM) + self.top_num = int(top_num) if isinstance(top_num, str) and top_num.isdigit() else self.DEFAULT_TOP_NUM + + @property + def base_dir(self): + return os.path.basename(os.path.dirname(__file__)) + + @classmethod + def add_parser_argument(cls, parser): + parser.add_argument("--top_num", type=str, help="Duration cost top count", default=cls.DEFAULT_TOP_NUM) + + def reducer_func(self, mapper_res): + mapper_res = list(filter(lambda df: df is not None, mapper_res)) + if not mapper_res: + logger.error("Mapper data is None.") + return + self.per_rank_stats = pd.concat( + describe_duration(df.groupby("OpType")["Duration"]).assign(Rank=df["Rank"][0]) for df in mapper_res) + self.per_rank_stats.sort_values(by=["Rank"], inplace=True) + all_op_data = pd.concat(mapper_res) + self.all_rank_stats = describe_duration(all_op_data.groupby("OpType")["Duration"]) + grouped_op_stats = all_op_data.groupby("OpName") + self.top_op_stats = describe_duration(grouped_op_stats["Duration"]).nlargest(self.top_num, "MeanNs") + min_rank = [] + max_rank = [] + for op_name in self.top_op_stats.index: + df = grouped_op_stats.get_group(op_name) + min_rank.append(df[df["Duration"] == df["Duration"].min()]["Rank"].values[0]) + max_rank.append(df[df["Duration"] == df["Duration"].max()]["Rank"].values[0]) + self.top_op_stats["MinRank"] = min_rank + self.top_op_stats["MaxRank"] = max_rank + + grouped_group_name_stats = all_op_data.groupby("GroupName") + group_name_rank_map = grouped_group_name_stats.apply( + lambda x: ';'.join(map(str, x['Rank'].drop_duplicates().sort_index()))).sort_index() + self.group_name_map = pd.DataFrame( + data={ + "GroupId": [key[-3:] for key in map(double_hash, group_name_rank_map.keys())], + "Ranks": group_name_rank_map.values + }, + index=sorted(grouped_group_name_stats.groups.keys()) + ) + self.group_name_map.index.name = "GroupName" + self.group_name_map.sort_values("GroupId", inplace=True) + + def run(self, context): + if self.top_num <= 0: + logger.warning(f"HcclSum: top_num is set to a invalid value, " + f"it will be reset to default value({self.DEFAULT_TOP_NUM}).") + self.top_num = self.DEFAULT_TOP_NUM + mapper_res = self.mapper_func(context) + self.reducer_func(mapper_res) + + if self._export_type == "db": + self.save_db() + elif self._export_type == "notebook": + self.save_notebook() + else: + logger.error("Unknown export type.") + + def save_notebook(self): + self.dump_data(self.all_rank_stats, "all_stats.csv") + self.dump_data(self.per_rank_stats, "rank_stats.csv") + self.dump_data(self.top_op_stats, "top_op_stats.csv") + self.dump_data(self.group_name_map, "group_name_map.csv") + self.create_notebook("stats.ipynb") + self.add_helper_file("cluster_display.py") + + def save_db(self): + self.dump_data(self.all_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_ALL_RANK_STATS) + self.dump_data(self.per_rank_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_PER_RANK_STATS) + self.dump_data(self.top_op_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_TOP_OP_STATS) + self.dump_data(self.group_name_map, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_GROUP_NAME_MAP) + + def _mapper_func(self, data_map, analysis_class): + profiler_db_path = data_map.get(Constant.PROFILER_DB_PATH) + rank_id = data_map.get(Constant.RANK_ID) + step_range = data_map.get(Constant.STEP_RANGE) + df = HcclSumExport(profiler_db_path, analysis_class, step_range).read_export_db() + if df is None or df.empty: + logger.warning(f"There is no stats data in {profiler_db_path}.") + return None + df["Rank"] = rank_id + return df diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/stats.ipynb b/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/stats.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..51a08a854b97161ba8e88ec94809b728582d6631 --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/hccl_sum/stats.ipynb @@ -0,0 +1,162 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# COMMUNICATION Summary\n", + "\n", + "集群场景通信算子数据分析\n", + "\n", + "主要包含以下3个统计内容:\n", + "1. 按算子类型分组的,整个集群通信算子耗时的统计情况\n", + "2. 按算子类型分组的,每个Rank上通信算子的耗时情况\n", + "3. 整个集群平均耗时最久的TOP通信算子" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据准备" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import display, HTML\n", + "display(HTML(\"\"))\n", + "\n", + "import plotly.offline as pyo\n", + "\n", + "def is_lab_notebook():\n", + " import re\n", + " import psutil\n", + " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n", + "\n", + "if is_lab_notebook():\n", + " pyo.init_notebook_mode()\n", + "\n", + "import pandas as pd\n", + "pd.options.plotting.backend = \"plotly\"\n", + "pd.set_option(\"display.max_rows\", 100)\n", + "pd.set_option(\"display.width\", 1000)\n", + "\n", + "import cluster_display\n", + "\n", + "all_stats_df = pd.read_csv(\"all_stats.csv\", index_col=\"OpType\")\n", + "rank_stats_df = pd.read_csv(\"rank_stats.csv\", index_col=\"OpType\")\n", + "top_op_stats_df = pd.read_csv(\"top_op_stats.csv\", index_col=\"OpName\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 集群通信算子耗时分析\n", + "\n", + "将整个集群所有Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n", + "\n", + "包含以下统计项:\n", + "- Count:算子数量\n", + "- Mean:平均耗时\n", + "- Std:标准差\n", + "- Min:最小值\n", + "- Q1:四分之一分位数\n", + "- Median:中位数\n", + "- Q3:四分之三分位数\n", + "- Max:最大值\n", + "- Sum:总耗时" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display(all_stats_df)\n", + "fig_all_rank = cluster_display.display_duration_boxplots(None, all_stats_df, x_title=\"Hccl OpType\")\n", + "fig_per_rank = cluster_display.display_graph(None, all_stats_df.index, all_stats_df[[\"Q1(Us)\", \"Median(Us)\", \"Q3(Us)\"]], title=\"50% of Distribution\", x_title=\"Hccl OpType\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 集群Rank通信算子耗时分析\n", + "\n", + "将集群内每个Rank的通信算子进行汇总,按算子类型分类,统计分析耗时情况,时间单位为微秒(us)\n", + "\n", + "包含以下统计项:\n", + "- Count:算子数量\n", + "- Mean:平均耗时\n", + "- Std:标准差\n", + "- Min:最小值\n", + "- Q1:四分之一分位数\n", + "- Median:中位数\n", + "- Q3:四分之三分位数\n", + "- Max:最大值\n", + "- Sum:总耗时" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rank_stats_gdf = rank_stats_df.groupby(rank_stats_df.index)\n", + "cluster_display.display_stats_per_rank_groups_combobox(rank_stats_gdf)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 集群TOP-N通信算子耗时分析\n", + "\n", + "统计集群内耗时最多的TOP-N通信算子,时间单位为微秒(us)\n", + "\n", + "包含以下统计项:\n", + "- Count:算子数量\n", + "- Mean:平均耗时\n", + "- Std:标准差\n", + "- Min:最小值\n", + "- Q1:四分之一分位数\n", + "- Median:中位数\n", + "- Q3:四分之三分位数\n", + "- Max:最大值\n", + "- Sum:总耗时\n", + "- MinRank:耗时最少算子所在的Rank\n", + "- MaxRank:耗时最长算子所在的Rank" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display(top_op_stats_df)\n", + "fig_top_op = cluster_display.display_duration_boxplots(None, top_op_stats_df, x_title=\"Hccl OpName\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/__init__.py b/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..7101187a2c2619f3b1c20dded14b433950b4c662 --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/__init__.py @@ -0,0 +1,14 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/mstx_sum.py b/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/mstx_sum.py new file mode 100644 index 0000000000000000000000000000000000000000..db6aae0de869853ccc5debac729492bfc9853695 --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/mstx_sum.py @@ -0,0 +1,227 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from collections import namedtuple + +import os +import pandas as pd + +from msprof_analyze.cluster_analyse.common_func.utils import describe_duration +from msprof_analyze.cluster_analyse.recipes.base_recipe_analysis import BaseRecipeAnalysis +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_exports.mstx_event_export import MstxMarkExport, MstxRangeExport +from msprof_analyze.prof_exports.mstx_step_export import MstxStepExport + +logger = get_logger() + +MarkInfo = namedtuple("MarkInfo", ["name", "framework_duration", "cann_duration", "device_duration", + "tid", "start_ns"]) + + +def format_mark_info(df: pd.DataFrame, start_idx, stop_idx, name) -> MarkInfo: + start_series = df.iloc[start_idx] + stop_series = df.iloc[stop_idx] + return MarkInfo( + name=name, + framework_duration=float(stop_series["framework_ts"] - start_series["framework_ts"]), + cann_duration=float(stop_series["cann_ts"] - start_series["cann_ts"]), + device_duration=float(stop_series["device_ts"] - start_series["device_ts"]), + tid=start_series["tid"], + start_ns=start_series["cann_ts"] + ) + + +def format_range_info(df: pd.DataFrame, idx, name) -> MarkInfo: + range_series = df.iloc[idx] + return MarkInfo( + name=name, + framework_duration=float(0), + cann_duration=float(range_series["cann_end_ts"] - range_series["cann_start_ts"]), + device_duration=float(range_series["device_end_ts"] - range_series["device_start_ts"]), + tid=range_series["tid"], + start_ns=range_series["cann_start_ts"] + ) + + +def rename_mark_msg_name(mstx_stats_df: pd.DataFrame): + msg_idx_counter = {} + for idx, mark_info in enumerate(mstx_stats_df.itertuples(index=False)): + msg_idx_counter.setdefault(mark_info.step_id, {}).setdefault(mark_info.name, []).append(idx) + for msg_dict in msg_idx_counter.values(): + for msg, idx_list in msg_dict.items(): + if len(idx_list) <= 1: + continue + for i, idx in enumerate(idx_list): + mstx_stats_df.loc[idx, 'name'] = f"{msg}_{i}" + + +def compute_step_id(mark_stat, step_stats_df: pd.DataFrame): + for step_info in step_stats_df.itertuples(index=False): + if step_info.start_ns <= mark_stat.start_ns <= step_info.end_ns: + return step_info.step_id + logger.warning(f"{mark_stat.name} is not in any step.") + return 0 + + +def format_columns(df: pd.DataFrame): + formatted_df = df.rename( + { + "framework_duration": "FrameworkDurationNs", + "cann_duration": "CannDurationNs", + "device_duration": "DeviceDurationNs", + "duration": "DurationNs", + "step_id": "StepId", + "tid": "Tid", + "name": "Name" + }, + axis="columns" + ) + cols = [col for col in formatted_df.columns if not col.endswith("_ns") and col not in {"Tid"}] + return formatted_df[cols] + + +def handle_mark_data(mark_df: pd.DataFrame, rank_id: int) -> list: + res = [] + mark_df["framework_ts"] = mark_df["framework_ts"].astype("int64") + mark_info = {} + mismatch_msg = [] + for idx, row in enumerate(mark_df.itertuples(index=False)): + if row.msg.endswith(MstxSum.START_SUFFIX): + msg = row.msg[:-len(MstxSum.START_SUFFIX)] + mark_info.setdefault(row.tid, {}).setdefault(msg, []).append(idx) + elif row.msg.endswith(MstxSum.STOP_SUFFIX): + msg = row.msg[:-len(MstxSum.STOP_SUFFIX)] + idx_list = mark_info.get(row.tid, {}).get(msg, []) + if not idx_list: + mismatch_msg.append((row.msg, idx)) + continue + start_idx = idx_list.pop() + res.append(format_mark_info(mark_df, start_idx, idx, msg)) + + # 统计未匹配上的mark信息 + for msg_info in mark_info.values(): + for msg, idx_list in msg_info.items(): + if not idx_list: + continue + mismatch_msg.extend((msg + MstxSum.START_SUFFIX, idx) for idx in idx_list) + if mismatch_msg: + mismatch_msg.sort(key=lambda msg: msg[1]) + logger.warning(f"The following mark messages do not match anyone in " + f"rank {rank_id}: {','.join(msg[0] for msg in mismatch_msg)}.") + + return res + + +def handle_range_data(range_df: pd.DataFrame) -> list: + res = [] + for idx, row in enumerate(range_df.itertuples(index=False)): + res.append(format_range_info(range_df, idx, row.msg)) + return res + + +class MstxSum(BaseRecipeAnalysis): + TABLE_FRAMEWORK_STATS = "MSTXAllFrameworkStats" + TABLE_CANN_STATS = "MSTXAllCannStats" + TABLE_DEVICE_STATS = "MSTXAllDeviceStats" + TABLE_MARK_STATS = "MSTXMarkStats" + + START_SUFFIX = "_start" + STOP_SUFFIX = "_stop" + + def __init__(self, params): + super().__init__(params) + logger.info("MstxSum init.") + self.mark_stats = None + self.all_fwk_stats = None + self.all_cann_stats = None + self.all_device_stats = None + + @property + def base_dir(self): + return os.path.basename(os.path.dirname(__file__)) + + def reducer_func(self, mapper_res): + mapper_res = list(filter(lambda df: df is not None, mapper_res)) + if not mapper_res: + logger.error("Mapper data is None.") + return + self.mark_stats = pd.concat(mapper_res) + all_fwk_stats = [] + all_cann_stats = [] + all_device_stats = [] + mark_step_df = self.mark_stats.groupby("StepId") + for step_id, df in mark_step_df: + name_gdf = df.groupby("Name") + fwk_stats = describe_duration(name_gdf["FrameworkDurationNs"]).assign(StepId=step_id) + fwk_stats.sort_values(by=["SumNs"], inplace=True, ascending=False) + all_fwk_stats.append(fwk_stats) + cann_stats = describe_duration(name_gdf["CannDurationNs"]).assign(StepId=step_id) + cann_stats.sort_values(by=["SumNs"], inplace=True, ascending=False) + all_cann_stats.append(cann_stats) + device_stats = describe_duration(name_gdf["DeviceDurationNs"]).assign(StepId=step_id) + device_stats.sort_values(by=["SumNs"], inplace=True, ascending=False) + all_device_stats.append(device_stats) + self.all_fwk_stats = pd.concat(all_fwk_stats) + self.all_cann_stats = pd.concat(all_cann_stats) + self.all_device_stats = pd.concat(all_device_stats) + + def run(self, context): + mapper_res = self.mapper_func(context) + self.reducer_func(mapper_res) + + if self._export_type == "db": + self.save_db() + elif self._export_type == "notebook": + self.save_notebook() + else: + logger.error("Unknown export type.") + + def save_notebook(self): + self.dump_data(self.mark_stats, "mark_stats.csv") + self.dump_data(self.all_fwk_stats, "all_fwk_stats.csv") + self.dump_data(self.all_cann_stats, "all_cann_stats.csv") + self.dump_data(self.all_device_stats, "all_device_stats.csv") + self.create_notebook("stats.ipynb") + self.add_helper_file("cluster_display.py") + + def save_db(self): + self.dump_data(self.mark_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_MARK_STATS) + self.dump_data(self.all_fwk_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_FRAMEWORK_STATS) + self.dump_data(self.all_cann_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_CANN_STATS) + self.dump_data(self.all_device_stats, Constant.DB_CLUSTER_COMMUNICATION_ANALYZER, self.TABLE_DEVICE_STATS) + + def _mapper_func(self, data_map, analysis_class): + profiler_db_path = data_map.get(Constant.PROFILER_DB_PATH) + rank_id = data_map.get(Constant.RANK_ID) + step_range = data_map.get(Constant.STEP_RANGE) + step_df = MstxStepExport(profiler_db_path, analysis_class, step_range).read_export_db() + if step_df is None or step_df.empty: + step_df = pd.DataFrame({"start_ns": [0], "end_ns": [float("inf")], "step_id": [0]}) + mark_df = MstxMarkExport(profiler_db_path, analysis_class, step_range).read_export_db() + range_df = MstxRangeExport(profiler_db_path, analysis_class, step_range).read_export_db() + mstx_res = [] + if not mark_df.empty: + mstx_res += handle_mark_data(mark_df, rank_id) + if not range_df.empty: + mstx_res += handle_range_data(range_df) + if not mstx_res: + logger.warning(f"There is no mstx data in {profiler_db_path}.") + return None + + mstx_stats_df = pd.DataFrame(mstx_res).assign(Rank=rank_id) + mstx_stats_df["step_id"] = mstx_stats_df.apply(compute_step_id, axis=1, step_stats_df=step_df) + rename_mark_msg_name(mstx_stats_df) + mstx_stats_df = format_columns(mstx_stats_df).set_index("Name", drop=True) + return mstx_stats_df diff --git a/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/stats.ipynb b/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/stats.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..84672bc72b97b02717c3a4110ab1b4dd827adafd --- /dev/null +++ b/profiler/msprof_analyze/cluster_analyse/recipes/mstx_sum/stats.ipynb @@ -0,0 +1,180 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# MSTX Summary\n", + "\n", + "集群场景MSTX打点数据分析\n", + "\n", + "主要包含以下2个统计内容:\n", + "1. 按Step分组的,整个集群MSTX打点数据的统计情况\n", + "2. 按Name分组的,每个Rank上MSTX打点数据的统计情况" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据准备" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import display, HTML\n", + "display(HTML(\"\"))\n", + "\n", + "import plotly.offline as pyo\n", + "\n", + "def is_lab_notebook():\n", + " import re\n", + " import psutil\n", + " return any(re.search('jupyter--lab-script', x) for x in psutil.Process().parent().cmdline())\n", + "\n", + "if is_lab_notebook():\n", + " pyo.init_notebook_mode()\n", + "\n", + "import pandas as pd\n", + "pd.options.plotting.backend = \"plotly\"\n", + "pd.set_option(\"display.max_rows\", 100)\n", + "pd.set_option(\"display.width\", 1000)\n", + "\n", + "import cluster_display\n", + "\n", + "all_fwk_stats_gdf = pd.read_csv(\"all_fwk_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n", + "all_cann_stats_gdf = pd.read_csv(\"all_cann_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n", + "all_device_stats_gdf = pd.read_csv(\"all_device_stats.csv\", index_col=\"Name\").groupby(\"StepId\")\n", + "mark_stats_df = pd.read_csv(\"mark_stats.csv\", index_col=\"Name\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 集群MSTX数据分析\n", + "\n", + "将整个集群所有Rank的MSTX数据进行汇总,按Step划分,统计分析耗时情况,时间单位为微秒(us)\n", + "打点数据分为三种:\n", + "1. 框架侧耗时:Framework Time\n", + "2. Cann侧耗时:Cann Time\n", + "3. Device侧耗时:Devcie Time\n", + "\n", + "3种数据都包含以下统计项:\n", + "- Count:数量\n", + "- Mean:平均耗时\n", + "- Std:标准差\n", + "- Min:最小值\n", + "- Q1:四分之一分位数\n", + "- Median:中位数\n", + "- Q3:四分之三分位数\n", + "- Max:最大值\n", + "- Sum:总耗时" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def display_stats_mstx_step_combobox(selected, args):\n", + " step = selected\n", + " fwk_stats_gdf, cann_stats_gdf, device_stats_gdf = args\n", + " fwk_df = fwk_stats_gdf.get_group(step)\n", + " cann_df = cann_stats_gdf.get_group(step)\n", + " device_df = device_stats_gdf.get_group(step)\n", + " figs = []\n", + " display(HTML(\"

Framework Time Stats

\"))\n", + " display(fwk_df)\n", + " cluster_display.display_duration_boxplots(figs, fwk_df, title=\"Framework Time\", x_title=\"Name\", y_title=\"Time\")\n", + " display(HTML(\"

Cann Time Stats

\"))\n", + " display(cann_df)\n", + " cluster_display.display_duration_boxplots(figs, cann_df, title=\"Cann Time\", x_title=\"Name\", y_title=\"Time\")\n", + " display(HTML(\"

Device Time Stats

\"))\n", + " display(device_df)\n", + " cluster_display.display_duration_boxplots(figs, device_df, title=\"Device Time\", x_title=\"Name\", y_title=\"Time\")\n", + "\n", + "steps = list(all_fwk_stats_gdf.groups.keys())\n", + "if steps:\n", + " cluster_display.display_stats_optional_combobox(steps, display_stats_mstx_step_combobox, \n", + " [all_fwk_stats_gdf, all_cann_stats_gdf, all_device_stats_gdf], \"Step:\")\n", + "else:\n", + " print(\"There is no step in stats, so no need to display\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 集群Rank MSTX数据分析\n", + "\n", + "将集群内每个Rank的MSTX数据进行汇总,按打点Name分类,统计分析耗时情况,时间单位为微秒(us)\n", + "\n", + "包含以下统计项:\n", + "- Name:打点名称\n", + "- FrameworkDuration(Us):框架侧耗时\n", + "- CannDuration(Us):Cann侧耗时\n", + "- DeviceDuration(Us):Device侧耗时\n", + "- Rank:Rank序号\n", + "- StepId:Step序号" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def display_mstx_duration_by_rank(selected, args):\n", + " mark_stats_gdf = args\n", + " df = mark_stats_gdf.get_group(selected).sort_values(\"Rank\")\n", + " display(df)\n", + " fwk_duration = []\n", + " cann_duration = []\n", + " device_duration = []\n", + " step_ids = []\n", + " for step_id, step_df in df.groupby(\"StepId\"):\n", + " fwk_duration.append((step_id, step_df[\"FrameworkDuration(Us)\"].values))\n", + " cann_duration.append((step_id, step_df[\"CannDuration(Us)\"].values))\n", + " device_duration.append((step_id, step_df[\"DeviceDuration(Us)\"].values))\n", + " step_ids.append(step_id)\n", + " fwk_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in fwk_duration], axis=1)\n", + " cann_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in cann_duration], axis=1)\n", + " device_df = pd.concat([pd.Series(dur, name=step_id) for step_id, dur in device_duration], axis=1)\n", + " figs = []\n", + " ranks = df[\"Rank\"].drop_duplicates()\n", + " cluster_display.display_graph(figs, ranks, fwk_df[step_ids],\n", + " title=\"Framework Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", + " cluster_display.display_graph(figs, ranks, cann_df[step_ids],\n", + " title=\"Cann Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", + " cluster_display.display_graph(figs, ranks, device_df[step_ids],\n", + " title=\"Device Time\", x_title=\"Rank\", y_title=\"Time\", legend_title=\"Step\")\n", + "\n", + "mark_stats_gdf = mark_stats_df.groupby(mark_stats_df.index)\n", + "names = list(mark_stats_gdf.groups.keys())\n", + "if steps:\n", + " cluster_display.display_stats_optional_combobox(names, display_mstx_duration_by_rank, mark_stats_gdf, \"Name:\")\n", + "else:\n", + " print(\"There is no mark name in stats, so no need to display\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/profiler/cluster_analyse/resources/.keep b/profiler/msprof_analyze/cluster_analyse/resources/.keep similarity index 100% rename from profiler/cluster_analyse/resources/.keep rename to profiler/msprof_analyze/cluster_analyse/resources/.keep diff --git a/profiler/compare_tools/README.md b/profiler/msprof_analyze/compare_tools/README.md similarity index 62% rename from profiler/compare_tools/README.md rename to profiler/msprof_analyze/compare_tools/README.md index 670d8da80af775f629b17143510f397baf900d82..ada936c90aea2916b6ddfec37b85e0735b8732a8 100644 --- a/profiler/compare_tools/README.md +++ b/profiler/msprof_analyze/compare_tools/README.md @@ -69,7 +69,7 @@ PyTorch Profiler采集结果数据目录结构如下: #### NPU性能数据采集 -通过Ascend PyTorch Profiler工具采集NPU的性能数据,采集参数配置与GPU基本一致,只需将GPU的性能数据采集代码中torch.profiler替换成torch_npu.profiler。,参考链接:[Profiling数据采集](https://gitee.com/ascend/mstt/tree/master/profiler)。 +通过Ascend PyTorch Profiler工具采集NPU的性能数据,采集参数配置与GPU基本一致,只需将GPU的性能数据采集代码中torch.profiler替换成torch_npu.profiler。,参考链接:[Profiling数据采集](https://gitee.com/ascend/mstt/tree/master/profiler/msprof_analyze)。 Ascend PyTorch Profiler采集结果数据目录结构如下: @@ -78,6 +78,7 @@ Ascend PyTorch Profiler采集结果数据目录结构如下: |- * _ascend_pt |- ASCEND_PROFILER_OUTPUT |- kernel_details.csv + |- op_statistic.csv |- trace_view.json |- FRAMEWORK |- PROF_XXX @@ -98,6 +99,7 @@ MindSpore性能调试工具采集结果数据目录结构如下: |- profiler/{rank-*}_{timestamps}_ascend_ms |- ASCEND_PROFILER_OUTPUT |- kernel_details.csv + |- op_statistic.csv |- trace_view.json ``` @@ -119,8 +121,8 @@ MindSpore性能调试工具采集结果数据目录结构如下: msprof-analyze compare -d [比对性能数据文件所在路径] -bp [基准性能数据文件所在路径] --output_path=[比对结果文件存放路径] ``` - - --profiling_path或-d(必选):比对性能数据文件所在路径。可以指定以“ascend_pt”结尾的目录、ASCEND_PROFILER_OUTPUT目录或trace_view.json文件,指定trace_view.json无法显示算子的内存占用。 - - --benchmark_path或-bp(必选):基准性能数据文件所在路径。基准性能数据文件若以GPU为基准,指定到以".pt.trace"结尾的json文件;若以NPU不同版本为基准,指定文件与-d一致。 + - --profiling_path或-d(必选):比对性能数据文件所在路径。可以指定以“ascend_pt”或“ascend_ms”结尾的目录、ASCEND_PROFILER_OUTPUT目录或trace_view.json文件,指定trace_view.json无法显示算子的内存占用。 + - --benchmark_profiling_path或-bp(必选):基准性能数据文件所在路径。基准性能数据文件若以GPU为基准,指定到以“.pt.trace”结尾的json文件;若以NPU不同版本为基准,指定文件与-d一致。 - --output_path或-o(可选):性能比对结果存放的路径,默认保存在当前目录。 #### 脚本方式 @@ -129,13 +131,13 @@ MindSpore性能调试工具采集结果数据目录结构如下: ```bash # 进入mstt代码仓目录下的compare_tools目录 -cd mstt/profiler/compare_tools +cd mstt/profiler/msprof_analyze/compare_tools # 执行最简比对命令 python performance_compare.py [基准性能数据文件所在路径] [比对性能数据文件所在路径] --output_path=[比对结果文件存放路径] ``` - 基准性能数据文件所在路径(必选):若以GPU为基准,指定到以".pt.trace"结尾的json文件;若以NPU不同版本为基准,指定文件参考**比对性能数据文件所在路径**。 -- 比对性能数据文件所在路径(必选):可以指定以“ascend_pt”结尾的目录、ASCEND_PROFILER_OUTPUT目录或trace_view.json文件,指定trace_view.json无法显示算子的内存占用。 +- 比对性能数据文件所在路径(必选):可以指定以“ascend_pt”或“ascend_ms”结尾的目录、ASCEND_PROFILER_OUTPUT目录或trace_view.json文件,指定trace_view.json无法显示算子的内存占用。 - --output_path或-o(可选):性能比对结果存放的路径,默认保存在当前目录。 #### 通用参数说明 @@ -143,14 +145,15 @@ python performance_compare.py [基准性能数据文件所在路径] [比对性 | 参数名 | 说明 | 是否必选 | | ------------------------------ | ------------------------------------------------------------ | -------- | | --enable_profiling_compare | 开启总体性能比对。 | 否 | -| --enable_operator_compare | 开启算子性能比对。MindSpore场景暂不支持。该开关较耗时,建议只采集一个step的性能数据。 | 否 | +| --enable_operator_compare | 开启算子性能比对。MindSpore场景暂不支持。该开关较耗时,建议只采集一个step的性能数据。支持扩展参数请参见“**算子性能比对特有参数说明**”。 | 否 | | --enable_communication_compare | 开启通信性能比对。 | 否 | | --enable_memory_compare | 开启算子内存比对。MindSpore场景暂不支持。该开关较耗时,建议只采集一个step的性能数据。 | 否 | -| --enable_kernel_compare | 开启kernel性能比对。仅针对NPU与NPU比对的场景。需要使用性能数据中的kernel_details.csv文件。 | 否 | -| --enable_api_compare | 开启API性能比对。需要使用性能数据中的trace_view.csv文件。 | 否 | +| --enable_kernel_compare | 开启kernel性能比对。仅针对NPU与NPU比对的场景。支持扩展参数请参见“**kernel性能比对特有参数说明**”。 | 否 | +| --enable_api_compare | 开启API性能比对。MindSpore场景暂不支持。需要使用性能数据中的trace_view.csv文件。 | 否 | | --disable_details | 隐藏明细比对,只进行统计级比对。 | 否 | | --base_step | 基准性能数据step ID,配置后使用基准性能数据对应step的数据进行比对。为整数,需配置实际数据存在的step ID,默认未配置,比对所有性能数据,需要与--comparison_step同时配置。配置示例:--base_step=1。 | 否 | | --comparison_step | 比对性能数据step ID,配置后使用比对性能数据对应step的数据进行比对。为整数,需配置实际数据存在的step ID,默认未配置,比对所有性能数据,需要与--base_step同时配置。配置示例:--comparison_step=1。 | 否 | +| --force | 强制执行compare。配置后可强制跳过如下情况:
指定的目录、文件的用户属主不属于当前用户,忽略属主判断直接执行。
csv文件大于5G、json文件大于10G、db文件大于8G,忽略文件过大判断直接执行。
配置该参数表示开启强制执行,默认未配置表示关闭。 | 否 | 说明:以上开关均不设置的情况下,**工具默认开启所有的性能比对**,当用户设置了以上开关,则按照用户设置的开关进行性能比对,示例如下: @@ -168,16 +171,43 @@ python performance_compare.py [基准性能数据文件] [比对性能数据文 #### 算子性能比对特有参数说明 +MindSpore场景暂不支持。 + +--enable_operator_compare时支持。 + | 参数名 | 说明 | 是否必选 | | ----------------- | ------------------------------------------------------------ | -------- | | --gpu_flow_cat | 配置GPU trace中CPU侧算子与device kernel的连线标识,当GPU的Device Duration(us)均为0时设置。使用chrome://tracing打开GPU的json,右上角Flow events找到连线标识,将标识配置进该参数。使用示例:--gpu_flow_cat=async_gpu | 否 | | --use_input_shape | 开启算子精准匹配,默认关闭。使用示例:--use_input_shape | 否 | | --max_kernel_num | 设置CPU侧算子下发的最大kernel数量,当超过设定值时工具会自动往下找子算子,直至满足条件。默认仅比对最上层算子,粒度较粗;若想要更细粒度的算子比对,可设置该参数,参数值不得小于4,参数值设置越小,比对粒度越细。使用示例:--max_kernel_num=10 | 否 | | --op_name_map | 设置GPU与NPU等价的算子名称的映射关系,以字典形式存入。使用示例:--op_name_map={'Optimizer.step#SGD.step':'Optimizer.step#NpuFusedSGD.step'} | 否 | +| --disable_module | 算子性能比对。当前配置该参数时,无论是否采集module信息,均进行算子级别的比对。 | 否 | + +#### kernel性能比对特有参数说明 + +--enable_kernel_compare时支持。 + +| 参数名 | 说明 | 是否必选 | +| ----------------- | ------------------------------------------------------------ | -------- | +| --use_kernel_type | kernel比对模式,可取值:
true:代表使用op_statistic.csv为比对性能数据进行比对,输出简化的比对结果以及减少比对时间。
false:代表使用kernel_details.csv为比对性能数据进行比对,输出完整比对结果,默认值。 | 否 | + +#### 自定义比对算子 + +一般情况下compare功能按照默认配置的算子进行比对,若用户需要对特定算子的性能进行比对和分析,可以通过在[compare_config.ini](https://gitee.com/ascend/mstt/blob/master/profiler/msprof_analyze/compare_tools/compare_backend/compare_config/compare_config.ini)文件中配置需要比对的算子名的识别关键词,之后再执行比对操作(msprof-analyze compare),比对结果在结果文件performance_comparison_result_{timestamp}.csv中呈现。 + +配置算子名的识别关键词为算子名称中的一部分,代表只要算子名称中包含该关键词,那么该算子会进行比对。 + +配置格式如下,算子名识别关键词之间用逗号隔开且名称为英文全小写: + +![config](img/config.PNG) + +上图中为compare_config.ini文件当前的默认配置,即默认进行如上类型算子的性能比对。 + +其中FA_MASK、CONV_MASK、MATMUL_MASK为GPU和NPU共有的上层应用operator的识别关键词,CUBE_MASK为底层GPU kernel cube识别的关键词,TRANS_MASK为底层NPU转换类kernel识别的关键词。 ## 比对结果说明 -MindSpore场景仅支持**总体性能**和**通信性能**的对比。 +MindSpore场景仅支持**总体性能**、**通信性能**和**kernel性能**的对比。 比对结果分为打屏和performance_comparison_result_{timestamp}.csv两种形式输出,其中打屏输出为概要信息,csv文件保存详细结果。 @@ -207,7 +237,7 @@ MindSpore场景仅支持**总体性能**和**通信性能**的对比。 | E2E Time(Not minimal profiling) | E2E总耗时,计算流端到端耗时。当存在Not minimal profiling时,表示该时间存在性能膨胀,会影响通信和调度耗时。 | | Other Time | AI CPU、DSA、TensorMove等其他算子耗时。 | -#### csv文件结果 +#### xlsx文件结果 总体性能比对结果在performance_comparison_result_*.xlsx中OverallMetrics的sheet页呈现时,示例如下: @@ -222,38 +252,44 @@ MindSpore场景仅支持**总体性能**和**通信性能**的对比。 | Duration Ratio | 执行耗时占E2E总耗时的比例。 | | Number | 计算算子的数量。 | -Index列字段说明: - -| 字段 | | | 说明 | -| ---------------------------- | :----------------- | ----------------------------------- | ------------------------------------------------------------ | -| Computing Time | | | 计算流耗时,计算流所有event耗时总和。如果有多条并发计算,计算流耗时对重叠部分只会计算一次。
NPU场景下,仅当采集性能数据的Level等级为L1及以上且aic_metrics取值为PipeUtilization时才可拆分出Computing Time的二级字段Flash Attention、Conv等。 | -| | Flash Attention | | Flash Attention算子。 | -| | | Flash Attention (Forward) (Cube) | Flash Attention前向算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | -| | | Flash Attention (Forward) (Vector) | Flash Attention前向算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | -| | | Flash Attention (Backward) (Cube) | Flash Attention反向算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | -| | | Flash Attention (Backward) (Vector) | Flash Attention反向算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | -| | Conv | | Conv算子。 | -| | | Conv (Forward) (Cube) | Conv前向算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | -| | | Conv (Forward) (Vector) | Conv前向Vector算子。Conv前向算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | -| | | Conv (Backward) (Cube) | Conv反向算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | -| | | Conv (Backward) (Vector) | Conv反向算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | -| | Matmul | | Matmul算子。 | -| | | Matmul (Cube) | Matmul算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | -| | | Matmul (Vector) | Matmul算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | -| | Paged Attention | | Paged Attention算子。 | -| | Vector | | Vector算子。 | -| | | Vector (Trans) | 转换类Vector算子,主要包含Cast、TransPose、TransData算子。(仅针对NPU数据) | -| | | Vector ( No Trans) | 非转换类Vector算子。 | -| | Cube | | 未识别出Flash Attention、Conv和Matmul的Cube算子。 | -| | SDMA (Tensor Move) | | 拷贝类任务。 | -| | Other | | AI CPU、DSA等其他算子。 | -| Uncovered Communication Time | | | 通信未掩盖耗时,包含卡间等待时间。 | -| | Wait | | 卡间同步等待耗时。(仅针对NPU数据) | -| | Transmit | | 通信传输耗时。 | -| Free Time | | | 调度耗时 = E2E耗时 - 算子耗时 - 通信不可掩盖耗时。Free的定义为Device侧既不在通信又不在计算的时间,因此包含拷贝时间(SDMA Time)。 | -| | SDMA | | NPU为除Tensor Move外的拷贝类任务,GPU为所有拷贝类任务。 | -| | Free | | 排除SDMA的空闲耗时。 | -| E2E Time | | | E2E总耗时,计算流端到端耗时。当存在Not minimal profiling时,表示该时间存在性能膨胀,会影响通信和调度耗时。 | +Index列完整字段说明: + +| 字段 | | | 说明 | +| ---------------------------- | :------------------ | ----------------------------------- | ------------------------------------------------------------ | +| Computing Time | | | 计算流耗时,计算流所有event耗时总和。如果有多条并发计算,计算流耗时对重叠部分只会计算一次。
NPU场景下,仅当采集性能数据的Level等级为L1及以上且aic_metrics取值为PipeUtilization时才可拆分出Computing Time的二级字段Flash Attention、Conv等。 | +| | AllGatherMatmul | | AllGatherMatmul算子。MC²算子,仅为示例。 | +| | | Computing | AllGatherMatmul算子的计算算子。 | +| | | Communication | AllGatherMatmul算子的通信算子。 | +| | MatmulReduceScatter | | MatmulReduceScatter算子。MC²算子,仅为示例。 | +| | | Computing | MatmulReduceScatter算子的计算算子。 | +| | | Communication | MatmulReduceScatter算子的通信算子。 | +| | Flash Attention | | Flash Attention算子。 | +| | | Flash Attention (Forward) (Cube) | Flash Attention前向算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | +| | | Flash Attention (Forward) (Vector) | Flash Attention前向算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | +| | | Flash Attention (Backward) (Cube) | Flash Attention反向算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | +| | | Flash Attention (Backward) (Vector) | Flash Attention反向算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | +| | Conv | | Conv算子。 | +| | | Conv (Forward) (Cube) | Conv前向算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | +| | | Conv (Forward) (Vector) | Conv前向Vector算子。Conv前向算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | +| | | Conv (Backward) (Cube) | Conv反向算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | +| | | Conv (Backward) (Vector) | Conv反向算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | +| | Matmul | | Matmul算子。 | +| | | Matmul (Cube) | Matmul算子下发的所有Cube类Kernel,一般为执行该算子核心计算的算子。 | +| | | Matmul (Vector) | Matmul算子下发的所有Vector类Kernel,一般为插入的转换类算子,如TransData。 | +| | Paged Attention | | Paged Attention算子。 | +| | Vector | | Vector算子。 | +| | | Vector (Trans) | 转换类Vector算子,主要包含Cast、TransPose、TransData算子。(仅针对NPU数据) | +| | | Vector ( No Trans) | 非转换类Vector算子。 | +| | Cube | | 未识别出Flash Attention、Conv和Matmul的Cube算子。 | +| | SDMA (Tensor Move) | | 拷贝类任务。 | +| | Other | | AI CPU、DSA等其他算子。 | +| Uncovered Communication Time | | | 通信未掩盖耗时,包含卡间等待时间。 | +| | Wait | | 卡间同步等待耗时。(仅针对NPU数据) | +| | Transmit | | 通信传输耗时。 | +| Free Time | | | 调度耗时 = E2E耗时 - 算子耗时 - 通信不可掩盖耗时。Free的定义为Device侧既不在通信又不在计算的时间,因此包含拷贝时间(SDMA Time)。 | +| | SDMA | | NPU为除Tensor Move外的拷贝类任务,GPU为所有拷贝类任务。 | +| | Free | | 排除SDMA的空闲耗时。 | +| E2E Time | | | E2E总耗时,计算流端到端耗时。当存在Not minimal profiling时,表示该时间存在性能膨胀,会影响通信和调度耗时。 | 可以采取最简性能数据采集的方式来减少E2E耗时的性能膨胀,示例代码如下: @@ -301,7 +337,7 @@ MindSpore场景暂不支持。 - Module Level:Module的层级。 - Module Name:Module唯一标识名,如/ DynamicNet_0/ Linear_0。 - Operator Name:框架侧算子名,如aten::add。字段为[ TOTAL ]代表该module的总体情况。 -- Kernel Detail:算子详细信息。 +- Kernel Details:算子详细信息,包括:算子名、task id、task type、input shape、执行耗时。 - Device Self Time(ms):该模块调用的算子(排除子模块)在device侧执行的总耗时,单位ms。 - Number:该Module或算子被调用的次数。 - Device Total Time(ms):该模块调用的算子(包含子模块)在device侧执行的总耗时,单位ms。 @@ -317,7 +353,7 @@ ModuleCompare:模块及模块下算子比对的明细展示,可以查看每 - Module Level:Module的层级。 - Module Name:Module唯一标识名,如/ DynamicNet_0/ Linear_0。 - Operator Name:框架侧算子名,如aten::add。字段为[ TOTAL ]代表该module的总体情况。 -- Kernel Detail:算子详细信息。 +- Kernel Details:算子详细信息,包括:算子名、task id、task type、input shape、执行耗时。 - Device Self Time(us):该模块调用的算子(排除子模块)在device侧执行的总耗时,单位us。 - Device Total Time(us):该模块调用的算子(包含子模块)在device侧执行的总耗时,单位us。 - Device Total Time Diff(us):GPU与NPU的Device Total Time(us)差值。 @@ -365,18 +401,30 @@ MindSpore场景暂不支持。 仅针对NPU与NPU比对的场景。 -kernel比对结果在performance_comparison_result_*.xlsx中KernelCompare页呈现。 +- 当--use_kernel_type开关为false时,kernel比对结果在performance_comparison_result_*.xlsx中KernelCompare页呈现。 -按照Kernel(Kernel类型)和Input Shapes(输入Shape)分组统计,统计信息包括: + 按照Kernel Type(Kernel类型)和Input Shapes(输入Shape)分组统计,统计信息包括: -- Total Duration(us):总耗时,单位us。 -- Avg Duration(us):平均耗时,单位us。 -- Max Duration(us):最大耗时,单位us。 -- Min Duration(us):最小耗时,单位us。 -- Calls:调用次数。 + - Total Duration(us):总耗时,单位us。 + - Avg Duration(us):平均耗时,单位us。 + - Max Duration(us):最大耗时,单位us。 + - Min Duration(us):最小耗时,单位us。 + - Calls:调用次数。 + +- 当--use_kernel_type开关为true时,kernel比对结果在performance_comparison_result_*.xlsx中KernelTypeCompare页呈现。 + + 按照Kernel Type(Kernel类型)和Core Type(AI核类型)分组统计,统计信息包括: + + - Total Duration(us):总耗时,单位us。 + - Avg Duration(us):平均耗时,单位us。 + - Max Duration(us):最大耗时,单位us。 + - Min Duration(us):最小耗时,单位us。 + - Calls:调用次数。 ### API性能 +MindSpore场景暂不支持。 + API比对结果在performance_comparison_result_*.xlsx中ApiCompare页呈现。 按照api name(API名称)组统计,统计信息包括: diff --git a/profiler/cluster_analyse/communication_group/__init__.py b/profiler/msprof_analyze/compare_tools/__init__.py similarity index 100% rename from profiler/cluster_analyse/communication_group/__init__.py rename to profiler/msprof_analyze/compare_tools/__init__.py diff --git a/profiler/test/st/compare_tools/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/__init__.py similarity index 100% rename from profiler/test/st/compare_tools/__init__.py rename to profiler/msprof_analyze/compare_tools/compare_backend/__init__.py diff --git a/profiler/test/ut/advisor/advisor_backend/tools/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/__init__.py similarity index 100% rename from profiler/test/ut/advisor/advisor_backend/tools/__init__.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/__init__.py diff --git a/profiler/compare_tools/compare_backend/comparator/api_compare_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/api_compare_comparator.py similarity index 58% rename from profiler/compare_tools/compare_backend/comparator/api_compare_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/api_compare_comparator.py index bc5810068b04e04aa935ce252ffd127380dd855e..bea449255768f7d5c5ed597682e47fafad2f2802 100644 --- a/profiler/compare_tools/compare_backend/comparator/api_compare_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/api_compare_comparator.py @@ -1,6 +1,20 @@ -from compare_backend.comparator.base_comparator import BaseComparator -from compare_backend.utils.constant import Constant -from compare_backend.utils.common_func import update_order_id +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator +from msprof_analyze.compare_tools.compare_backend.utils.common_func import update_order_id +from msprof_analyze.prof_common.constant import Constant class ApiCompareComparator(BaseComparator): diff --git a/profiler/compare_tools/compare_backend/comparator/base_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/base_comparator.py similarity index 54% rename from profiler/compare_tools/compare_backend/comparator/base_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/base_comparator.py index 8012dfae94440b7e17613f432770ec8b63ece431..49890167485bb7dc48998bfc7fa4ce73bf8f6b4f 100644 --- a/profiler/compare_tools/compare_backend/comparator/base_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/base_comparator.py @@ -1,3 +1,17 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from abc import ABC, abstractmethod diff --git a/profiler/compare_tools/compare_backend/comparator/communication_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/communication_comparator.py similarity index 42% rename from profiler/compare_tools/compare_backend/comparator/communication_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/communication_comparator.py index f7580bec89a85b8d23e0ec878eda944d95e69f3f..7710d2cbb7fd1bc2c67088d59dfd6f9ad618fa7a 100644 --- a/profiler/compare_tools/compare_backend/comparator/communication_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/communication_comparator.py @@ -1,7 +1,22 @@ -from compare_backend.comparator.base_comparator import BaseComparator -from compare_backend.compare_bean.communication_bean import CommunicationBean -from compare_backend.utils.constant import Constant -from compare_backend.utils.common_func import update_order_id +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.communication_bean import CommunicationBean +from msprof_analyze.compare_tools.compare_backend.utils.common_func import update_order_id + +from msprof_analyze.prof_common.constant import Constant class CommunicationComparator(BaseComparator): @@ -17,4 +32,3 @@ class CommunicationComparator(BaseComparator): for comm_name, comm_data in comparison_data.items(): self._rows.extend(CommunicationBean(comm_name, {}, comm_data).rows) update_order_id(self._rows) - diff --git a/profiler/compare_tools/compare_backend/comparator/kernel_compare_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/kernel_compare_comparator.py similarity index 63% rename from profiler/compare_tools/compare_backend/comparator/kernel_compare_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/kernel_compare_comparator.py index 13c0f776af60f7f250dc22b084cf251733f4c47d..89e1d3e0ada4982a8c32bd0325ec0e9cad537746 100644 --- a/profiler/compare_tools/compare_backend/comparator/kernel_compare_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/kernel_compare_comparator.py @@ -1,6 +1,20 @@ -from compare_backend.comparator.base_comparator import BaseComparator -from compare_backend.utils.constant import Constant -from compare_backend.utils.common_func import update_order_id +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator +from msprof_analyze.compare_tools.compare_backend.utils.common_func import update_order_id +from msprof_analyze.prof_common.constant import Constant class KernelCompareComparator(BaseComparator): @@ -32,4 +46,4 @@ class KernelCompareComparator(BaseComparator): if comparison_aggregated_kernels: for _, comparison_data in comparison_aggregated_kernels.items(): self._rows.append(self._bean([], comparison_data).row) - update_order_id(self._rows) \ No newline at end of file + update_order_id(self._rows) diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/comparator/kernel_type_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/kernel_type_comparator.py new file mode 100644 index 0000000000000000000000000000000000000000..44e117c82f335a04d05c6c01534b57c5f6c33d01 --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/kernel_type_comparator.py @@ -0,0 +1,34 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.op_stastic_bean import OpStatisticBean +from msprof_analyze.compare_tools.compare_backend.utils.common_func import update_order_id +from msprof_analyze.prof_common.constant import Constant + + +class KernelTypeComparator(BaseComparator): + def __init__(self, origin_data: dict, bean: any): + super().__init__(origin_data, bean) + + def _compare(self): + base_kernels = self._origin_data.get(Constant.BASE_DATA, {}) + comparison_kernels = self._origin_data.get(Constant.COMPARISON_DATA, {}) + for key, base_kernel in base_kernels.items(): + comparison_kernel = comparison_kernels.pop(key, OpStatisticBean({})) + self._rows.append(self._bean(base_kernel, comparison_kernel).row) + for comparison_kernel in comparison_kernels.values(): + self._rows.append(self._bean(OpStatisticBean({}), comparison_kernel).row) + self._rows.sort(key=lambda x: x[-2], reverse=True) # order by diff column + update_order_id(self._rows) diff --git a/profiler/compare_tools/compare_backend/comparator/module_comparetor.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/module_comparetor.py similarity index 58% rename from profiler/compare_tools/compare_backend/comparator/module_comparetor.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/module_comparetor.py index 49c50b53c5a1b00bd17b7281d80b61d5011cb59a..11c701a55a223b5e42404e9841edb284c2bc3720 100644 --- a/profiler/compare_tools/compare_backend/comparator/module_comparetor.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/module_comparetor.py @@ -1,6 +1,23 @@ -from compare_backend.comparator.base_comparator import BaseComparator -from compare_backend.utils.common_func import update_order_id -from compare_backend.utils.constant import Constant +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator +from msprof_analyze.compare_tools.compare_backend.utils.common_func import update_order_id +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger + +logger = get_logger() class ModuleComparator(BaseComparator): @@ -33,4 +50,4 @@ class ModuleComparator(BaseComparator): index += 1 update_order_id(self._rows) if not any(row[-1] != Constant.NA for row in self._rows): - print(f"[WARNING] If you want to see the operator's call stack, you must enable with_stack switch.") + logger.warning("If you want to see the operator's call stack, you must enable with_stack switch.") diff --git a/profiler/compare_tools/compare_backend/comparator/module_statistic_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/module_statistic_comparator.py similarity index 73% rename from profiler/compare_tools/compare_backend/comparator/module_statistic_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/module_statistic_comparator.py index e09108f3cbe3744068daf6c5316dc318aea53177..7fee4a361dfefce9ab351b1061ffdeddb831e5e9 100644 --- a/profiler/compare_tools/compare_backend/comparator/module_statistic_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/module_statistic_comparator.py @@ -1,7 +1,21 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from collections import OrderedDict -from compare_backend.comparator.base_comparator import BaseComparator -from compare_backend.utils.common_func import update_order_id +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator +from msprof_analyze.compare_tools.compare_backend.utils.common_func import update_order_id class ModuleStatisticComparator(BaseComparator): diff --git a/profiler/compare_tools/compare_backend/comparator/operator_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/operator_comparator.py similarity index 37% rename from profiler/compare_tools/compare_backend/comparator/operator_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/operator_comparator.py index cc475116cab59104a049689292f25f339a7285ce..1091da79c34bcbcf23cc7c4d4d38729362106a72 100644 --- a/profiler/compare_tools/compare_backend/comparator/operator_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/operator_comparator.py @@ -1,4 +1,18 @@ -from compare_backend.comparator.base_comparator import BaseComparator +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator class OperatorComparator(BaseComparator): diff --git a/profiler/compare_tools/compare_backend/comparator/operator_statistic_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/operator_statistic_comparator.py similarity index 59% rename from profiler/compare_tools/compare_backend/comparator/operator_statistic_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/operator_statistic_comparator.py index 73aecf6f1283242311bcb0e848bd94f0f1afa377..db1db36f724aab6f84165459090b99d6be5c6d6a 100644 --- a/profiler/compare_tools/compare_backend/comparator/operator_statistic_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/operator_statistic_comparator.py @@ -1,5 +1,19 @@ -from compare_backend.comparator.base_comparator import BaseComparator -from compare_backend.utils.common_func import update_order_id +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator +from msprof_analyze.compare_tools.compare_backend.utils.common_func import update_order_id class OperatorStatisticComparator(BaseComparator): diff --git a/profiler/compare_tools/compare_backend/comparator/overall_metrics_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/overall_metrics_comparator.py similarity index 86% rename from profiler/compare_tools/compare_backend/comparator/overall_metrics_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/overall_metrics_comparator.py index d438dc41d563b163d14a1b391b2ef4a301144dc0..8a80262c6abc91e4db399f993a502faff9fb4ace 100644 --- a/profiler/compare_tools/compare_backend/comparator/overall_metrics_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/overall_metrics_comparator.py @@ -14,9 +14,10 @@ # limitations under the License. from math import isclose -from compare_backend.comparator.base_comparator import BaseComparator -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig + +from msprof_analyze.prof_common.constant import Constant class OverallMetricsComparator(BaseComparator): diff --git a/profiler/compare_tools/compare_backend/comparator/overall_performance_comparator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/overall_performance_comparator.py similarity index 82% rename from profiler/compare_tools/compare_backend/comparator/overall_performance_comparator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparator/overall_performance_comparator.py index 09d8688cf231ba713a2f731c25e1da7d54aa5ddb..b1c11a71e726d83f1f828b9bd0c9c09226cec5f7 100644 --- a/profiler/compare_tools/compare_backend/comparator/overall_performance_comparator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparator/overall_performance_comparator.py @@ -1,5 +1,20 @@ -from compare_backend.comparator.base_comparator import BaseComparator -from compare_backend.utils.constant import Constant +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.base_comparator import BaseComparator + +from msprof_analyze.prof_common.constant import Constant class OverallPerformanceComparator(BaseComparator): @@ -64,14 +79,14 @@ class OverallPerformanceComparator(BaseComparator): else: comp_col.extend( [f'{comp_profiling_info.communication_not_overlapped: .3f}s({comp_profiling_info.wait_time:.3f}s)']) - if base_profiling_info.RDMA_bandwidth or comp_profiling_info.RDMA_bandwidth: + if base_profiling_info.rdma_bandwidth or comp_profiling_info.rdma_bandwidth: self._headers.extend(['RDMA Bandwidth']) - base_col.append(f'{base_profiling_info.RDMA_bandwidth:.3f}GB/s') - comp_col.append(f'{comp_profiling_info.RDMA_bandwidth:.3f}GB/s') - if base_profiling_info.SDMA_bandwidth or comp_profiling_info.SDMA_bandwidth: + base_col.append(f'{base_profiling_info.rdma_bandwidth:.3f}GB/s') + comp_col.append(f'{comp_profiling_info.rdma_bandwidth:.3f}GB/s') + if base_profiling_info.sdma_bandwidth or comp_profiling_info.sdma_bandwidth: self._headers.extend(['SDMA Bandwidth']) - base_col.append(f'{base_profiling_info.SDMA_bandwidth:.3f}GB/s') - comp_col.append(f'{comp_profiling_info.SDMA_bandwidth:.3f}GB/s') + base_col.append(f'{base_profiling_info.sdma_bandwidth:.3f}GB/s') + comp_col.append(f'{comp_profiling_info.sdma_bandwidth:.3f}GB/s') if base_profiling_info.sdma_time or comp_profiling_info.sdma_time: self._headers.append('SDMA Time(Num)') base_col.append(f'{base_profiling_info.sdma_time:.3f}s({base_profiling_info.sdma_num})') diff --git a/profiler/test/ut/advisor/common/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/__init__.py similarity index 100% rename from profiler/test/ut/advisor/common/__init__.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/__init__.py diff --git a/profiler/compare_tools/compare_backend/compare_bean/api_compare_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/api_compare_bean.py similarity index 39% rename from profiler/compare_tools/compare_backend/compare_bean/api_compare_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/api_compare_bean.py index 4f3340828b20a2ec48ef53d76321794a6bb26b9b..faf98fb96d36bc08a38926dd0207b72a5f7b2c56 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/api_compare_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/api_compare_bean.py @@ -1,9 +1,26 @@ -from compare_backend.utils.common_func import calculate_diff_ratio -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig + +from msprof_analyze.prof_common.constant import Constant class ApiInfo: + __slots__ = ['_data_list', 'name', 'total_dur', 'self_time', 'avg_dur', 'number'] + def __init__(self, op_name: str, data_list: list): self._data_list = data_list self.name = op_name @@ -23,6 +40,7 @@ class ApiInfo: class ApiCompareBean: + __slots__ = ['_name', '_base_api', '_comparison_api'] TABLE_NAME = Constant.API_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -34,14 +52,17 @@ class ApiCompareBean: @property def row(self): - row = [None, self._name, - self._base_api.total_dur, self._base_api.self_time, self._base_api.avg_dur, self._base_api.number, - self._comparison_api.total_dur, self._comparison_api.self_time, - self._comparison_api.avg_dur, self._comparison_api.number] - diff_fields = [calculate_diff_ratio(self._base_api.total_dur, self._comparison_api.total_dur)[1], - calculate_diff_ratio(self._base_api.self_time, self._comparison_api.self_time)[1], - calculate_diff_ratio(self._base_api.avg_dur, self._comparison_api.avg_dur)[1], - calculate_diff_ratio(self._base_api.number, self._comparison_api.number)[1]] + row = [ + None, self._name, + self._base_api.total_dur, self._base_api.self_time, self._base_api.avg_dur, self._base_api.number, + self._comparison_api.total_dur, self._comparison_api.self_time, + self._comparison_api.avg_dur, self._comparison_api.number + ] + diff_fields = [ + calculate_diff_ratio(self._base_api.total_dur, self._comparison_api.total_dur)[1], + calculate_diff_ratio(self._base_api.self_time, self._comparison_api.self_time)[1], + calculate_diff_ratio(self._base_api.avg_dur, self._comparison_api.avg_dur)[1], + calculate_diff_ratio(self._base_api.number, self._comparison_api.number)[1] + ] row.extend(diff_fields) return row - diff --git a/profiler/compare_tools/compare_backend/compare_bean/communication_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/communication_bean.py similarity index 66% rename from profiler/compare_tools/compare_backend/compare_bean/communication_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/communication_bean.py index 94813193d69b4a1f92cc88dbd1eb31d6f96ff608..1801a75cb807959fa26addbee8a563be6ebab434 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/communication_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/communication_bean.py @@ -1,9 +1,24 @@ -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig -from compare_backend.utils.common_func import calculate_diff_ratio +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.prof_common.constant import Constant class CommunicationInfo: + __slots__ = ['comm_op_name', 'task_name', 'calls', 'total_duration', 'avg_duration', 'max_duration', 'min_duration'] def __init__(self, name: str, data_list: list, is_task: bool): self.comm_op_name = None @@ -24,6 +39,7 @@ class CommunicationInfo: class CommunicationBean: + __slots__ = ['_name', '_base_comm', '_comparison_comm'] TABLE_NAME = Constant.COMMUNICATION_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -62,10 +78,12 @@ class CommunicationBean: @classmethod def _get_row(cls, base_info: CommunicationInfo, comparison_info: CommunicationInfo, is_task: bool) -> list: - row = [None, base_info.comm_op_name, base_info.task_name, base_info.calls, base_info.total_duration, - base_info.avg_duration, base_info.max_duration, base_info.min_duration, comparison_info.comm_op_name, - comparison_info.task_name, comparison_info.calls, comparison_info.total_duration, - comparison_info.avg_duration, comparison_info.max_duration, comparison_info.min_duration] + row = [ + None, base_info.comm_op_name, base_info.task_name, base_info.calls, base_info.total_duration, + base_info.avg_duration, base_info.max_duration, base_info.min_duration, comparison_info.comm_op_name, + comparison_info.task_name, comparison_info.calls, comparison_info.total_duration, + comparison_info.avg_duration, comparison_info.max_duration, comparison_info.min_duration + ] diff_fields = [None, None] if is_task else calculate_diff_ratio(base_info.total_duration, comparison_info.total_duration) row.extend(diff_fields) diff --git a/profiler/compare_tools/compare_backend/compare_bean/kernel_compare_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/kernel_compare_bean.py similarity index 52% rename from profiler/compare_tools/compare_backend/compare_bean/kernel_compare_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/kernel_compare_bean.py index 06d9783970ef5f56fbf6206981deb41f58f66848..3f64f28ec6c62b2c61548b7d2f189339670aea06 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/kernel_compare_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/kernel_compare_bean.py @@ -1,9 +1,25 @@ -from compare_backend.utils.common_func import calculate_diff_ratio, convert_to_float -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio, convert_to_float +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.prof_common.constant import Constant class KernelCompareInfo: + __slots__ = ['_kernel_type', '_input_shapes', '_total_dur', '_number', '_max_dur', '_min_dur'] + def __init__(self, data_list: list): self._kernel_type = None self._input_shapes = None @@ -27,29 +43,30 @@ class KernelCompareInfo: @property def input_shapes(self): return self._input_shapes - + @property def total_dur(self): return self._total_dur if self._total_dur else 0.0 - + @property def number(self): return self._number - + @property def max_dur(self): return self._max_dur - + @property def min_dur(self): return self._min_dur - + @property def avg_dur(self): return round(self._total_dur / self._number, 2) if self._total_dur and self._number else 0.0 class KernelCompareBean: + __slots__ = ['_base_kernel', '_comparison_kernel', '_kernel_type', '_input_shapes'] TABLE_NAME = Constant.KERNEL_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -64,12 +81,16 @@ class KernelCompareBean: @property def row(self): - row = [None, self._kernel_type, self._input_shapes, - self._base_kernel.total_dur, self._base_kernel.avg_dur, - self._base_kernel.max_dur, self._base_kernel.min_dur, self._base_kernel.number, - self._comparison_kernel.total_dur, self._comparison_kernel.avg_dur, - self._comparison_kernel.max_dur, self._comparison_kernel.min_dur, self._comparison_kernel.number] - diff_fields = [calculate_diff_ratio(self._base_kernel.total_dur, self._comparison_kernel.total_dur)[1], - calculate_diff_ratio(self._base_kernel.avg_dur, self._comparison_kernel.avg_dur)[1]] + row = [ + None, self._kernel_type, self._input_shapes, + self._base_kernel.total_dur, self._base_kernel.avg_dur, + self._base_kernel.max_dur, self._base_kernel.min_dur, self._base_kernel.number, + self._comparison_kernel.total_dur, self._comparison_kernel.avg_dur, + self._comparison_kernel.max_dur, self._comparison_kernel.min_dur, self._comparison_kernel.number + ] + diff_fields = [ + calculate_diff_ratio(self._base_kernel.total_dur, self._comparison_kernel.total_dur)[1], + calculate_diff_ratio(self._base_kernel.avg_dur, self._comparison_kernel.avg_dur)[1] + ] row.extend(diff_fields) - return row \ No newline at end of file + return row diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/kernel_type_compare_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/kernel_type_compare_bean.py new file mode 100644 index 0000000000000000000000000000000000000000..1434f42d5f6a0d0300010700cf638536c6b4b03e --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/kernel_type_compare_bean.py @@ -0,0 +1,40 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.prof_common.constant import Constant + + +class KernelTypeCompareBean: + TABLE_NAME = Constant.KERNEL_TYPE_TABLE + HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) + OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) + + def __init__(self, base_kernel, comparison_kernel): + self._base_kernel = base_kernel + self._comparison_kernel = comparison_kernel + + @property + def row(self): + kernel_type = self._base_kernel.kernel_type or self._comparison_kernel.kernel_type + core_type = self._base_kernel.core_type or self._comparison_kernel.core_type + return [ + None, kernel_type, core_type, self._base_kernel.total_dur, self._base_kernel.avg_dur, + self._base_kernel.max_dur, self._base_kernel.min_dur, self._base_kernel.calls, + self._comparison_kernel.total_dur, self._comparison_kernel.avg_dur, self._comparison_kernel.max_dur, + self._comparison_kernel.min_dur, self._comparison_kernel.calls, + calculate_diff_ratio(self._base_kernel.total_dur, self._comparison_kernel.total_dur)[1], + calculate_diff_ratio(self._base_kernel.avg_dur, self._comparison_kernel.avg_dur)[1] + ] diff --git a/profiler/compare_tools/compare_backend/compare_bean/memory_compare_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/memory_compare_bean.py similarity index 44% rename from profiler/compare_tools/compare_backend/compare_bean/memory_compare_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/memory_compare_bean.py index e1baa175311ae42765757feb8b13bbb3918c3727..7a269d6dbcca4393fcd937d3cc2ab69905f6bb2b 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/memory_compare_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/memory_compare_bean.py @@ -1,11 +1,26 @@ -from compare_backend.utils.common_func import calculate_diff_ratio -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig -from compare_backend.utils.torch_op_node import TorchOpNode -from compare_backend.utils.tree_builder import TreeBuilder +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.compare_tools.compare_backend.utils.torch_op_node import TorchOpNode +from msprof_analyze.compare_tools.compare_backend.utils.tree_builder import TreeBuilder +from msprof_analyze.prof_common.constant import Constant class MemoryCompareBean: + __slots__ = ['_index', '_base_op', '_comparison_op'] TABLE_NAME = Constant.MEMORY_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -17,16 +32,20 @@ class MemoryCompareBean: @property def row(self): - row = [self._index + 1, self._base_op.operator_name, self._base_op.input_shape, self._base_op.input_type, - self._base_op.memory_details, self._base_op.size, self._comparison_op.operator_name, - self._comparison_op.input_shape, self._comparison_op.input_type, self._comparison_op.memory_details, - self._comparison_op.size] + row = [ + self._index + 1, self._base_op.operator_name, self._base_op.input_shape, self._base_op.input_type, + self._base_op.memory_details, self._base_op.size, self._comparison_op.operator_name, + self._comparison_op.input_shape, self._comparison_op.input_type, self._comparison_op.memory_details, + self._comparison_op.size + ] diff_fields = calculate_diff_ratio(self._base_op.size, self._comparison_op.size) row.extend(diff_fields) return row class MemoryInfo: + __slots__ = ['operator_name', 'input_shape', 'input_type', 'size', 'memory_details', '_memory_list'] + def __init__(self, torch_op: TorchOpNode): self.operator_name = None self.input_shape = None diff --git a/profiler/compare_tools/compare_backend/compare_bean/memory_statistic_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/memory_statistic_bean.py similarity index 46% rename from profiler/compare_tools/compare_backend/compare_bean/memory_statistic_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/memory_statistic_bean.py index 9ccc2cb76da9158355aacb0994a1b66c0be97fb5..e4f7792b5eb3e12aa0c66f784cc827b7761bf901 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/memory_statistic_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/memory_statistic_bean.py @@ -1,10 +1,25 @@ -from compare_backend.utils.common_func import calculate_diff_ratio -from compare_backend.utils.constant import Constant -from compare_backend.utils.tree_builder import TreeBuilder -from compare_backend.utils.excel_config import ExcelConfig +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.compare_tools.compare_backend.utils.tree_builder import TreeBuilder +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.prof_common.constant import Constant class MemoryStatisticBean: + __slots__ = ['_name', '_base_info', '_comparison_info'] TABLE_NAME = Constant.MEMORY_TOP_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -16,14 +31,18 @@ class MemoryStatisticBean: @property def row(self): - row = [None, self._name, self._base_info.duration_ms, self._base_info.size_mb, self._base_info.number, - self._comparison_info.duration_ms, self._comparison_info.size_mb, self._comparison_info.number] + row = [ + None, self._name, self._base_info.duration_ms, self._base_info.size_mb, self._base_info.number, + self._comparison_info.duration_ms, self._comparison_info.size_mb, self._comparison_info.number + ] diff_fields = calculate_diff_ratio(self._base_info.size_mb, self._comparison_info.size_mb) row.extend(diff_fields) return row class MemoryStatisticInfo: + __slots__ = ['_data_list', 'duration_ms', 'size_mb', 'number'] + def __init__(self, data_list: list): self._data_list = data_list self.duration_ms = 0 diff --git a/profiler/compare_tools/compare_backend/compare_bean/module_compare_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/module_compare_bean.py similarity index 71% rename from profiler/compare_tools/compare_backend/compare_bean/module_compare_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/module_compare_bean.py index abfce00d83d6c1a914aa71481277e2dc1c195f17..107a2d79caae5c21cc6ef71761ba2447cc58cc11 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/module_compare_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/module_compare_bean.py @@ -1,12 +1,30 @@ -from compare_backend.utils.common_func import longest_common_subsequence_matching, calculate_diff_ratio -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig -from compare_backend.utils.module_node import ModuleNode -from compare_backend.utils.name_function import NameFunction -from compare_backend.utils.torch_op_node import TorchOpNode +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import ( + longest_common_subsequence_matching, + calculate_diff_ratio +) +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.compare_tools.compare_backend.utils.module_node import ModuleNode +from msprof_analyze.compare_tools.compare_backend.utils.name_function import NameFunction +from msprof_analyze.compare_tools.compare_backend.utils.torch_op_node import TorchOpNode +from msprof_analyze.prof_common.constant import Constant class ModuleCompareBean: + __slots__ = ['_base_module', '_comparison_module', 'module_class', 'module_level', 'module_name'] TABLE_NAME = Constant.MODULE_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -51,6 +69,9 @@ class ModuleCompareBean: class ModuleInfo: + __slots__ = ['module_class', 'module_level', 'module_name', 'device_self_time', 'device_total_time', + 'top_layer_ops', 'call_stack'] + def __init__(self, module: ModuleNode): self.module_class = "" self.module_level = "" @@ -70,6 +91,8 @@ class ModuleInfo: class OpInfo: + __slots__ = ['operator_name', 'kernel_details', 'device_self_time', 'call_stack'] + def __init__(self, operator: TorchOpNode): self.operator_name = "" self.kernel_details = "" diff --git a/profiler/compare_tools/compare_backend/compare_bean/module_statistic_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/module_statistic_bean.py similarity index 64% rename from profiler/compare_tools/compare_backend/compare_bean/module_statistic_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/module_statistic_bean.py index 97fc98bdd354e1ebe1fbb3fc44def4eaf3059235..2ca8b7e11946566038747cd8916d2a843daa4601 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/module_statistic_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/module_statistic_bean.py @@ -1,11 +1,26 @@ +# Copyright (c) 2024 , Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import re -from compare_backend.utils.common_func import calculate_diff_ratio -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.prof_common.constant import Constant class ModuleStatisticBean: + __slots__ = ['_module_name', '_module_class', '_module_level', '_base_info', '_comparison_info'] TABLE_NAME = Constant.MODULE_TOP_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -41,11 +56,13 @@ class ModuleStatisticBean: self._comparison_info.device_total_dur_ms) self_diff, _ = calculate_diff_ratio(self._base_info.device_self_dur_ms, self._comparison_info.device_self_dur_ms) - row = [None, self._module_class, self._module_level, self._module_name, "[ TOTAL ]", None, - self._base_info.device_self_dur_ms, self._base_info.number, self._base_info.device_total_dur_ms, - None, self._comparison_info.device_self_dur_ms, self._comparison_info.number, - self._comparison_info.device_total_dur_ms, total_diff, self_diff, - total_ratio, self._base_info.call_stack, self._comparison_info.call_stack] + row = [ + None, self._module_class, self._module_level, self._module_name, "[ TOTAL ]", None, + self._base_info.device_self_dur_ms, self._base_info.number, self._base_info.device_total_dur_ms, + None, self._comparison_info.device_self_dur_ms, self._comparison_info.number, + self._comparison_info.device_total_dur_ms, total_diff, self_diff, + total_ratio, self._base_info.call_stack, self._comparison_info.call_stack + ] return row def get_detail_rows(self): @@ -57,23 +74,29 @@ class ModuleStatisticBean: base_kernel_detals, com_kernel_details = self._get_kernel_detail_rows(base_dur_dict.get("detail", {}), com_dur_dict.get("detail", {})) self_diff, self_ratio = calculate_diff_ratio(sum(base_dur_list), sum(com_dur_list)) - row = [None, self._module_class, self._module_level, self._module_name, op_name, base_kernel_detals, - sum(base_dur_list), len(base_dur_list), None, com_kernel_details, sum(com_dur_list), - len(com_dur_list), None, None, self_diff, self_ratio, None, None] + row = [ + None, self._module_class, self._module_level, self._module_name, op_name, base_kernel_detals, + sum(base_dur_list), len(base_dur_list), None, com_kernel_details, sum(com_dur_list), + len(com_dur_list), None, None, self_diff, self_ratio, None, None + ] rows.append(row) for op_name, com_dur_dict in self._comparison_info.api_dict.items(): com_dur_list = com_dur_dict.get("total", []) base_kernel_detals, com_kernel_details = self._get_kernel_detail_rows({}, com_dur_dict.get("detail", {})) self_diff, self_ratio = calculate_diff_ratio(0, sum(com_dur_list)) - row = [None, self._module_class, self._module_level, self._module_name, op_name, base_kernel_detals, 0, 0, - None, com_kernel_details, sum(com_dur_list), len(com_dur_list), None, None, self_diff, - self_ratio, None, None] + row = [ + None, self._module_class, self._module_level, self._module_name, op_name, base_kernel_detals, 0, 0, + None, com_kernel_details, sum(com_dur_list), len(com_dur_list), None, None, self_diff, + self_ratio, None, None + ] rows.append(row) return rows class ModuleStatisticInfo: + __slots__ = ['_data_list', 'device_self_dur_ms', 'device_total_dur_ms', 'call_stack', 'number', 'api_dict'] + def __init__(self, data_list: list): self._data_list = data_list self.device_self_dur_ms = 0 diff --git a/profiler/compare_tools/compare_backend/compare_bean/operator_compare_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/operator_compare_bean.py similarity index 44% rename from profiler/compare_tools/compare_backend/compare_bean/operator_compare_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/operator_compare_bean.py index e7ecfedddd7c2f5dd33664b1556a7b0245e295d1..3770afc8445ee5234c59682b21a2a19c3186e379 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/operator_compare_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/operator_compare_bean.py @@ -1,11 +1,26 @@ -from compare_backend.utils.common_func import calculate_diff_ratio -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig -from compare_backend.utils.torch_op_node import TorchOpNode -from compare_backend.utils.tree_builder import TreeBuilder +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.compare_tools.compare_backend.utils.torch_op_node import TorchOpNode +from msprof_analyze.compare_tools.compare_backend.utils.tree_builder import TreeBuilder +from msprof_analyze.prof_common.constant import Constant class OperatorCompareBean: + __slots__ = ['_index', '_base_op', '_comparison_op'] TABLE_NAME = Constant.OPERATOR_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -17,16 +32,20 @@ class OperatorCompareBean: @property def row(self): - row = [self._index + 1, self._base_op.operator_name, self._base_op.input_shape, self._base_op.input_type, - self._base_op.kernel_details, self._base_op.device_dur, self._comparison_op.operator_name, - self._comparison_op.input_shape, self._comparison_op.input_type, self._comparison_op.kernel_details, - self._comparison_op.device_dur] + row = [ + self._index + 1, self._base_op.operator_name, self._base_op.input_shape, self._base_op.input_type, + self._base_op.kernel_details, self._base_op.device_dur, self._comparison_op.operator_name, + self._comparison_op.input_shape, self._comparison_op.input_type, self._comparison_op.kernel_details, + self._comparison_op.device_dur + ] diff_fields = calculate_diff_ratio(self._base_op.device_dur, self._comparison_op.device_dur) row.extend(diff_fields) return row class OperatorInfo: + __slots__ = ['operator_name', 'input_shape', 'input_type', 'device_dur', 'kernel_details', '_kernel_list'] + def __init__(self, torch_op: TorchOpNode): self.operator_name = None self.input_shape = None diff --git a/profiler/compare_tools/compare_backend/compare_bean/operator_statistic_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/operator_statistic_bean.py similarity index 45% rename from profiler/compare_tools/compare_backend/compare_bean/operator_statistic_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/operator_statistic_bean.py index 457ae55acbd275dcf3e2f3c584114af8b9d55d17..ab13584cb5e2ed59497a52bb8a387f6a5bab440b 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/operator_statistic_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/operator_statistic_bean.py @@ -1,10 +1,25 @@ -from compare_backend.utils.common_func import calculate_diff_ratio -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig -from compare_backend.utils.tree_builder import TreeBuilder +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.compare_tools.compare_backend.utils.tree_builder import TreeBuilder +from msprof_analyze.prof_common.constant import Constant class OperatorStatisticBean: + __slots__ = ['_name', '_base_info', '_comparison_info'] TABLE_NAME = Constant.OPERATOR_TOP_TABLE HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) @@ -16,14 +31,18 @@ class OperatorStatisticBean: @property def row(self): - row = [None, self._name, self._base_info.device_dur_ms, self._base_info.number, - self._comparison_info.device_dur_ms, self._comparison_info.number] + row = [ + None, self._name, self._base_info.device_dur_ms, self._base_info.number, + self._comparison_info.device_dur_ms, self._comparison_info.number + ] diff_fields = calculate_diff_ratio(self._base_info.device_dur_ms, self._comparison_info.device_dur_ms) row.extend(diff_fields) return row class OperatorStatisticInfo: + __slots__ = ['_data_list', 'device_dur_ms', 'number'] + def __init__(self, data_list: list): self._data_list = data_list self.device_dur_ms = 0 diff --git a/profiler/test/ut/compare_tools/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/__init__.py similarity index 100% rename from profiler/test/ut/compare_tools/__init__.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/__init__.py diff --git a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/compare_event.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/compare_event.py similarity index 67% rename from profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/compare_event.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/compare_event.py index 463e82430896923a8d21c44ab9e6f9b952855a84..1f921334aa3615d838a1086302c5b52c17a0ec34 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/compare_event.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/compare_event.py @@ -1,10 +1,26 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from decimal import Decimal -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.prof_common.constant import Constant class KernelEvent: + __slots__ = ['_event', '_device_type'] + def __init__(self, event: TraceEventBean, device_type: str): self._event = event self._device_type = device_type @@ -29,7 +45,8 @@ class KernelEvent: def kernel_details(self): if self._device_type == Constant.GPU: return f"{self.kernel_name} [duration: {self.device_dur}]\n" - return f"{self.kernel_name}, {self.task_id}, {self.task_type} [duration: {self.device_dur}]\n" + input_shape = f", [input shapes: {self._event.input_shapes}]" if self._event.input_shapes else "" + return f"{self.kernel_name}, {self.task_id}, {self.task_type}{input_shape} [duration: {self.device_dur}]\n" class MemoryEvent: diff --git a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py similarity index 70% rename from profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py index f29839724a64078e86eeedc59e14e50e2cf2655d..d9f6e519da264836673ae5a58fca64f3acd06109 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py @@ -1,13 +1,31 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import math from decimal import Decimal import pandas as pd -from compare_backend.utils.common_func import convert_to_float, convert_to_decimal -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.utils.common_func import convert_to_float, convert_to_decimal +from msprof_analyze.compare_tools.compare_backend.compare_config.compare_config import CompareConfig +from msprof_analyze.prof_common.constant import Constant class KernelDetailsBean: + __slots__ = ['_data', '_op_type', '_name', '_input_shapes', '_aiv_vec_time', '_aicore_time', '_mac_time', + '_duration', '_start_time', '_step_id'] + def __init__(self, data: dict): self._data = data self._op_type = "" @@ -66,11 +84,16 @@ class KernelDetailsBean: @property def end_time(self) -> Decimal: return self.start_time + convert_to_decimal(self._duration) - + @property def step_id(self) -> int: return int(self._step_id) if self._step_id else Constant.VOID_STEP + @property + def mc2_computing_time(self): + return (max(float(self._data.get("aic_mac_time(us)", 0)), float(self._data.get("aic_mte2_time(us)", 0))) + + float(self._data.get("aiv_time(us)", 0))) + def is_hide_op_pmu(self): if "mac_time(us)" in self._data.keys() or "aiv_vec_time(us)" in self._data.keys(): return False @@ -88,6 +111,11 @@ class KernelDetailsBean: return True return False + def is_invalid_op_type(self): + if pd.isna(self.op_type) or self.op_type == "N/A" or self.op_type == "": + return True + return False + def is_fa_bwd(self): return 'bwd' in self.op_type.lower() or 'grad' in self.op_type.lower() @@ -111,11 +139,14 @@ class KernelDetailsBean: return "pagedattention" in self.op_type.lower() def is_trans(self): - return any(trans_mask in self.name.lower() for trans_mask in Constant.KERNEL_TRANS_MASK) + return any(trans_mask in self.name.lower() for trans_mask in CompareConfig().trans_mask) def is_cube_kernel_cat(self): return self.mac_time > 0 or self.aicore_time > 0 + def is_mc2(self): + return self._name.lower() in CompareConfig().mc2_kernel + def init(self): self._op_type = self._data.get('Type', "") self._name = self._data.get('Name', "") diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/memory_record_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/memory_record_bean.py new file mode 100644 index 0000000000000000000000000000000000000000..e123d53fab53a5d4c371cb39026a8e2818ba32df --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/memory_record_bean.py @@ -0,0 +1,31 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.utils.common_func import convert_to_float + + +class MemoryRecordBean: + __slots__ = ['_data', '_total_reserved_mb'] + + def __init__(self, data: dict): + self._data = data + self._total_reserved_mb = 0.0 + self.init() + + @property + def total_reserved_mb(self) -> float: + return convert_to_float(self._total_reserved_mb) + + def init(self): + self._total_reserved_mb = self._data.get("Total Reserved(MB)", 0) diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/op_stastic_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/op_stastic_bean.py new file mode 100644 index 0000000000000000000000000000000000000000..14d7ff1370fc77f37525cddc5c1b3b1651637800 --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/op_stastic_bean.py @@ -0,0 +1,26 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.prof_common.utils import convert_to_float, convert_to_int + + +class OpStatisticBean: + def __init__(self, data: dict): + self.kernel_type = data.get("OP Type", "") + self.core_type = data.get("Core Type", "") + self.total_dur = convert_to_float(data.get("Total Time(us)", 0)) + self.avg_dur = convert_to_float(data.get("Avg Time(us)", 0)) + self.max_dur = convert_to_float(data.get("Max Time(us)", 0)) + self.min_dur = convert_to_float(data.get("Min Time(us)", 0)) + self.calls = convert_to_int(data.get("Count", 0)) diff --git a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/operator_memory_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/operator_memory_bean.py similarity index 57% rename from profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/operator_memory_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/operator_memory_bean.py index 254b8629cdc1941ac46da9b47419a4c675718375..71e5548fef439f031cab3004128dd1d2a744855a 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/operator_memory_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/operator_memory_bean.py @@ -1,9 +1,24 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from decimal import Decimal -from compare_backend.utils.common_func import convert_to_float, convert_to_decimal +from msprof_analyze.compare_tools.compare_backend.utils.common_func import convert_to_float, convert_to_decimal class OperatorMemoryBean: + __slots__ = ['_data', '_name', '_size', '_allocation_time', '_release_time'] def __init__(self, data: dict): self._data = data diff --git a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py similarity index 69% rename from profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py index 26d5bc4478f30ce3e6b004449ee17cba6ff05a2e..9d813c23b63350ecc724dfea5cbdd36ac0579afd 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py @@ -1,13 +1,33 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from decimal import Decimal -from compare_backend.utils.common_func import convert_to_float, convert_to_decimal -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.utils.common_func import convert_to_float, convert_to_decimal +from msprof_analyze.compare_tools.compare_backend.compare_config.compare_config import CompareConfig +from msprof_analyze.prof_common.constant import Constant class TraceEventBean: + _COMPUTE_TASK_TYPES = frozenset({'AI_CORE', 'MIX_AIC', 'MIX_AIV', 'AI_CPU', + 'AI_VECTOR_CORE', 'FFTS_PLUS'}) + _SDMA_TASK_TYPES = frozenset({'SDMA_SQE', 'PCIE_DMA_SQE'}) + __slots__ = ['_id', '_pid', '_tid', '_ts', '_dur', '_ph', '_cat', '_name', '_args', '_is_torch_op', + '_input_shape'] def __init__(self, event: dict): - self._event = event + self._id = None self._pid = 0 self._tid = 0 self._ts = Decimal(0) @@ -17,7 +37,9 @@ class TraceEventBean: self._name = "" self._args = {} self._is_torch_op = False - self.init() + self._input_shape = None + self.init(event) + del event @property def pid(self) -> int: @@ -57,7 +79,7 @@ class TraceEventBean: @property def id(self) -> str: - return self._event.get("id") + return self._id @property def stream_id(self) -> int: @@ -103,8 +125,8 @@ class TraceEventBean: return self._args.get("Addr") @property - def event(self) -> dict: - return self._event + def input_shapes(self): + return self._input_shape @property def is_torch_op(self) -> bool: @@ -114,6 +136,10 @@ class TraceEventBean: def is_torch_op(self, value: bool): self._is_torch_op = value + @input_shapes.setter + def input_shapes(self, value: str): + self._input_shape = value + @classmethod def is_sdma(cls): return False @@ -129,6 +155,13 @@ class TraceEventBean: """ return False + @classmethod + def is_mc2(cls) -> bool: + """ + GPU没有mc2算子,全部返回False + """ + return False + def is_m_mode(self) -> bool: return self._ph == "M" @@ -160,7 +193,7 @@ class TraceEventBean: return self._args.get("name", "").find("Communication") != -1 def is_hccl_process_name(self) -> bool: - return self.process_name == "HCCL" + return self.process_name in ["Communication", "HCCL"] def is_overlap_process_name(self) -> bool: return self.process_name == "Overlap Analysis" @@ -174,12 +207,12 @@ class TraceEventBean: def is_comm_not_overlap(self): return self._name == 'Communication(Not Overlapped)' - def is_dict(self): - return isinstance(self._event, dict) - def is_kernel_cat(self): return self.lower_cat == "kernel" + def is_memory_copy_cat(self): + return self.lower_cat == "gpu_memcpy" + def is_nccl_name(self): return self.lower_name.startswith("nccl") @@ -190,10 +223,10 @@ class TraceEventBean: return self.lower_name == '[memory]' and self.device_id >= 0 def is_compute_event(self): - return self.task_type in ('AI_CORE', 'MIX_AIC', 'MIX_AIV', 'AI_CPU', 'AI_VECTOR_CORE', 'FFTS_PLUS') + return self.task_type in self._COMPUTE_TASK_TYPES def is_sdma_event(self): - return self.task_type in ('SDMA_SQE', 'PCIE_DMA_SQE') + return self.task_type in self._SDMA_TASK_TYPES def is_event_wait(self): return self.task_type == 'EVENT_WAIT_SQE' @@ -226,19 +259,19 @@ class TraceEventBean: """ 这个类在cpu op和gpu中均有用到,这里是在cpu op阶段判断 """ - return any(cube_mask in self.lower_name for cube_mask in Constant.CPU_OP_FA_MASK) + return any(cube_mask in self.lower_name for cube_mask in CompareConfig().fa_mask) def is_conv_for_cpu_op(self) -> bool: """ 这个类在cpu op和gpu中均有用到,这里是在cpu op阶段判断 """ - return self.lower_name.startswith(Constant.CPU_OP_CONV) + return any(conv_mask in self.lower_name for conv_mask in CompareConfig().conv_mask) def is_matmul_for_cpu_op(self) -> bool: """ 这个类在cpu op和gpu中均有用到,这里是在cpu op阶段判断 """ - return any(bwd_mask in self.lower_name for bwd_mask in Constant.CPU_OP_MATMUL_MASK) + return any(bwd_mask in self.lower_name for bwd_mask in CompareConfig().mm_mask) def is_bwd_for_cpu_op(self) -> bool: """ @@ -250,18 +283,22 @@ class TraceEventBean: return self.is_matmul_for_cpu_op() or self.is_fa_for_cpu_op() or self.is_conv_for_cpu_op() def is_vector(self): - return not any(cube_mask in self.lower_name for cube_mask in Constant.KERNEL_CUBE_MASK) + return not any(cube_mask in self.lower_name for cube_mask in CompareConfig().cube_mask) def is_cube_kernel_cat(self): - return any(cube_mask in self.lower_name for cube_mask in Constant.KERNEL_CUBE_MASK) - - def init(self): - if isinstance(self._event, dict): - self._pid = self._event.get("pid", 0) - self._tid = self._event.get("tid", 0) - self._ts = self._event.get("ts", 0) - self._dur = self._event.get("dur", 0) - self._ph = self._event.get("ph", "") - self._cat = self._event.get("cat", "") - self._name = self._event.get("name", "") - self._args = self._event.get("args", {}) + return any(cube_mask in self.lower_name for cube_mask in CompareConfig().cube_mask) + + def is_c_core_sqe(self): + return self.name == "C_CORE_SQE" + + def init(self, event): + if isinstance(event, dict): + self._id = event.get("id") + self._pid = event.get("pid", 0) + self._tid = event.get("tid", 0) + self._ts = event.get("ts", 0) + self._dur = event.get("dur", 0) + self._ph = str(event.get("ph", "")) + self._cat = event.get("cat", "") + self._name = event.get("name", "") + self._args = event.get("args", {}) diff --git a/profiler/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py similarity index 67% rename from profiler/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py index 0c96c5f58c4426ecce32b7f888ffb412701815fb..059416ec15e54b7732eb40cbf8b0e6ef1227bf90 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py @@ -14,10 +14,10 @@ # limitations under the License. from math import isclose -from compare_backend.compare_bean.profiling_info import ProfilingInfo -from compare_backend.utils.common_func import calculate_diff_ratio -from compare_backend.utils.constant import Constant -from compare_backend.utils.excel_config import ExcelConfig, CellFormatType +from msprof_analyze.compare_tools.compare_backend.compare_bean.profiling_info import ProfilingInfo +from msprof_analyze.compare_tools.compare_backend.utils.common_func import calculate_diff_ratio +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig, CellFormatType +from msprof_analyze.prof_common.constant import Constant class OverallMetricsBean: @@ -29,13 +29,43 @@ class OverallMetricsBean: self._base_data = OverallMetricsInfo(base_info).overall_metrics self._comparison_data = OverallMetricsInfo(comparison_info).overall_metrics if not any((base_info.is_not_minimal_profiling(), comparison_info.is_not_minimal_profiling())): - self.TABLE_NAME += ' (Minimal Prof)' + OverallMetricsBean.TABLE_NAME += ' (Minimal Prof)' @property def rows(self): rows_data = [] + rows_data.extend( + self._get_rows(self._base_data.get("before_mc2", {}), self._comparison_data.get("before_mc2", {}))) + + base_mc2_data = self._base_data.get("mc2", {}) + comparison_mc2_data = self._comparison_data.get("mc2", {}) + default_value = [0, 0, "/"] + for kernel_name, base_data in base_mc2_data.items(): + comparison_data = comparison_mc2_data.pop(kernel_name, {}) + self._append_data(rows_data, self._get_row_data(kernel_name, base_data.get("mc2", default_value), + comparison_data.get("mc2", default_value))) + self._append_data(rows_data, + self._get_row_data(ExcelConfig.MC2_COMPUTING_TIME, + base_data.get(ExcelConfig.MC2_COMPUTING_TIME, default_value), + comparison_data.get(ExcelConfig.MC2_COMPUTING_TIME, default_value))) + self._append_data(rows_data, + self._get_row_data(ExcelConfig.MC2_COMMUNICATION_TIME, + base_data.get(ExcelConfig.MC2_COMMUNICATION_TIME, default_value), + comparison_data.get(ExcelConfig.MC2_COMMUNICATION_TIME, + default_value))) + for kernel_name, comparison_data in comparison_mc2_data.items(): + self._append_data(rows_data, self._get_row_data(kernel_name, default_value, + comparison_data.get("mc2", default_value))) + self._append_data(rows_data, self._get_row_data(ExcelConfig.MC2_COMPUTING_TIME, default_value, + comparison_data.get(ExcelConfig.MC2_COMPUTING_TIME, + default_value))) + self._append_data(rows_data, self._get_row_data(ExcelConfig.MC2_COMMUNICATION_TIME, default_value, + comparison_data.get(ExcelConfig.MC2_COMMUNICATION_TIME, + default_value))) + rows_data.extend( self._get_rows(self._base_data.get("before_group", {}), self._comparison_data.get("before_group", {}))) + base_group_data = self._base_data.get("group", {}) comparison_group_data = self._comparison_data.get("group", {}) default_value = [0, 0, "/"] @@ -57,6 +87,7 @@ class OverallMetricsBean: comparison_data.get(ExcelConfig.WAIT, default_value))) self._append_data(rows_data, self._get_row_data(ExcelConfig.TRANSMIT, default_value, comparison_data.get(ExcelConfig.TRANSMIT, default_value))) + rows_data.extend( self._get_rows(self._base_data.get("after_group", {}), self._comparison_data.get("after_group", {}))) return rows_data @@ -89,6 +120,8 @@ class OverallMetricsBean: class OverallMetricsInfo: + __slots__ = ['_profiling_info', '_comm_group_list', '_overall_metrics_data'] + def __init__(self, profiling_info: ProfilingInfo): self._profiling_info = profiling_info self._comm_group_list = list(profiling_info.communication_group_time.keys()) @@ -247,7 +280,8 @@ class OverallMetricsInfo: self._profiling_info.fa_bwd_time - self._profiling_info.conv_fwd_time - self._profiling_info.conv_bwd_time - self._profiling_info.mm_total_time - self._profiling_info.vector_total_time - self._profiling_info.sdma_time_tensor_move - - self._profiling_info.other_cube_time - self._profiling_info.page_attention_time)) + self._profiling_info.other_cube_time - self._profiling_info.page_attention_time - + self._profiling_info.all_mc2_time)) return [other_time, other_time / self.e2e_time, "/"] @property @@ -287,32 +321,55 @@ class OverallMetricsInfo: return [self._profiling_info.get_transmit_time_by_group(group_name), self._profiling_info.get_transmit_time_by_group(group_name) / self.e2e_time, "/"] + def mc2_data_by_name(self, kernel_name: str): + return [self._profiling_info.get_mc2_time_by_name(kernel_name), + self._profiling_info.get_mc2_time_by_name(kernel_name) / self.e2e_time, + self._profiling_info.get_mc2_number_by_name(kernel_name)] + + def mc2_computing_data_by_name(self, kernel_name: str): + return [self._profiling_info.get_mc2_computing_time_by_name(kernel_name), + self._profiling_info.get_mc2_computing_time_by_name(kernel_name) / self.e2e_time, "/"] + + def mc2_communication_data_by_name(self, kernel_name: str): + return [self._profiling_info.get_mc2_communication_time_by_name(kernel_name), + self._profiling_info.get_mc2_communication_time_by_name(kernel_name) / self.e2e_time, "/"] + def _init_overall_metrics_data(self): - overall_metrics_data = {"before_group": { - ExcelConfig.COMPUTING: self.computing_data, - ExcelConfig.FA_FWD: self.fa_fwd_data, - ExcelConfig.FA_FWD_CUBE: self.fa_fwd_cube_data, - ExcelConfig.FA_FWD_VECTOR: self.fa_fwd_vector_data, - ExcelConfig.FA_BWD: self.fa_bwd_data, - ExcelConfig.FA_BWD_CUBE: self.fa_bwd_cube_data, - ExcelConfig.FA_BWD_VECTOR: self.fa_bwd_vector_data, - ExcelConfig.CONV_FWD: self.conv_fwd_data, - ExcelConfig.CONV_FWD_CUBE: self.conv_fwd_cube_data, - ExcelConfig.CONV_FWD_VECTOR: self.conv_fwd_vector_data, - ExcelConfig.CONV_BWD: self.conv_bwd_data, - ExcelConfig.CONV_BWD_CUBE: self.conv_bwd_cube_data, - ExcelConfig.CONV_BWD_VECTOR: self.conv_bwd_vector_data, - ExcelConfig.MM: self.mm_data, - ExcelConfig.MM_CUBE: self.mm_cube_data, - ExcelConfig.MM_VECTOR: self.mm_vector_data, - ExcelConfig.PA: self.pa_data, - ExcelConfig.VECTOR: self.vector_data, - ExcelConfig.VECTOR_TRANS: self.vector_trans_data, - ExcelConfig.VECTOR_NO_TRANS: self.vector_no_trans_data, - ExcelConfig.CUBE: self.cube_data, - ExcelConfig.SDMA_TM: self.sdma_tm_data, - ExcelConfig.OTHER: self.other_data, - ExcelConfig.COMMUNICATION_TIME: self.communication_data} + overall_metrics_data = { + "before_mc2": { + ExcelConfig.COMPUTING: self.computing_data + }, + "before_group": { + ExcelConfig.FA_FWD: self.fa_fwd_data, + ExcelConfig.FA_FWD_CUBE: self.fa_fwd_cube_data, + ExcelConfig.FA_FWD_VECTOR: self.fa_fwd_vector_data, + ExcelConfig.FA_BWD: self.fa_bwd_data, + ExcelConfig.FA_BWD_CUBE: self.fa_bwd_cube_data, + ExcelConfig.FA_BWD_VECTOR: self.fa_bwd_vector_data, + ExcelConfig.CONV_FWD: self.conv_fwd_data, + ExcelConfig.CONV_FWD_CUBE: self.conv_fwd_cube_data, + ExcelConfig.CONV_FWD_VECTOR: self.conv_fwd_vector_data, + ExcelConfig.CONV_BWD: self.conv_bwd_data, + ExcelConfig.CONV_BWD_CUBE: self.conv_bwd_cube_data, + ExcelConfig.CONV_BWD_VECTOR: self.conv_bwd_vector_data, + ExcelConfig.MM: self.mm_data, + ExcelConfig.MM_CUBE: self.mm_cube_data, + ExcelConfig.MM_VECTOR: self.mm_vector_data, + ExcelConfig.PA: self.pa_data, + ExcelConfig.VECTOR: self.vector_data, + ExcelConfig.VECTOR_TRANS: self.vector_trans_data, + ExcelConfig.VECTOR_NO_TRANS: self.vector_no_trans_data, + ExcelConfig.CUBE: self.cube_data, + ExcelConfig.SDMA_TM: self.sdma_tm_data, + ExcelConfig.OTHER: self.other_data, + ExcelConfig.COMMUNICATION_TIME: self.communication_data + }, + "after_group": { + ExcelConfig.FREE_TIME: self.free_time_data, + ExcelConfig.SDMA: self.sdma_data, + ExcelConfig.FREE: self.free_data, + ExcelConfig.E2E_TIME: self.e2e_time_data + } } if self._comm_group_list: for group_name in self._comm_group_list: @@ -323,10 +380,12 @@ class OverallMetricsInfo: ExcelConfig.WAIT: self.wait_data_by_group(group_name), ExcelConfig.TRANSMIT: self.transmit_data_by_group(group_name) } - overall_metrics_data["after_group"] = { - ExcelConfig.FREE_TIME: self.free_time_data, - ExcelConfig.SDMA: self.sdma_data, - ExcelConfig.FREE: self.free_data, - ExcelConfig.E2E_TIME: self.e2e_time_data - } + for kernel_name in self._profiling_info.mc2_time_dict.keys(): + mc2_name_index = f"\t{kernel_name}" + ExcelConfig.ROW_STYLE_MAP[mc2_name_index] = CellFormatType.LIGHT_BLUE_NORMAL + overall_metrics_data.setdefault("mc2", {})[mc2_name_index] = { + "mc2": self.mc2_data_by_name(kernel_name), + ExcelConfig.MC2_COMPUTING_TIME: self.mc2_computing_data_by_name(kernel_name), + ExcelConfig.MC2_COMMUNICATION_TIME: self.mc2_communication_data_by_name(kernel_name) + } return overall_metrics_data diff --git a/profiler/compare_tools/compare_backend/compare_bean/profiling_info.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/profiling_info.py similarity index 75% rename from profiler/compare_tools/compare_backend/compare_bean/profiling_info.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/profiling_info.py index 891ed2f49c10bf4ca34efc2ef5550e748313a114..bcbce59c016338195d49044508614307f2db1896 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/profiling_info.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_bean/profiling_info.py @@ -1,7 +1,33 @@ -from compare_backend.utils.constant import Constant +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.prof_common.constant import Constant class ProfilingInfo: + __slots__ = ['profiling_type', 'other_time', 'lccl_num', 'compute_time', 'communication_not_overlapped', + 'wait_time', 'memory_used', 'e2e_time', 'scheduling_time', 'lccl_time', 'minimal_profiling', + 'hide_op_details', 'is_level0', 'fa_time_fwd_cube', 'fa_num_fwd_cube', 'fa_time_bwd_cube', + 'fa_num_bwd_cube', 'fa_time_fwd_vector', 'fa_num_fwd_vector', 'fa_time_bwd_vector', + 'fa_num_bwd_vector', + 'conv_time_fwd_cube', 'conv_num_fwd_cube', 'conv_time_bwd_cube', 'conv_num_bwd_cube', + 'conv_time_fwd_vector', 'conv_num_fwd_vector', 'conv_time_bwd_vector', 'conv_num_bwd_vector', + 'matmul_time_cube', 'matmul_num_cube', 'matmul_time_vector', 'matmul_num_vector', + 'page_attention_time', 'page_attention_num', 'vector_time_trans', 'vector_num_trans', + 'vector_time_notrans', 'vector_num_notrans', 'sdma_time_tensor_move', 'sdma_num_tensor_move', + 'sdma_time_stream', 'sdma_num_stream', 'other_cube_time', 'other_cube_num', 'rdma_bandwidth', + 'sdma_bandwidth', 'communication_group_time', 'mc2_time_dict'] TABLE_NAME = Constant.PERFORMANCE_TABLE HEADERS = [] OVERHEAD = [] @@ -60,8 +86,10 @@ class ProfilingInfo: self.other_cube_time = 0.0 self.other_cube_num = 0 - self.RDMA_bandwidth = 0.0 - self.SDMA_bandwidth = 0.0 + self.rdma_bandwidth = 0.0 + self.sdma_bandwidth = 0.0 + + self.mc2_time_dict = {} # 按group展示通信的卡间等待和传输耗时 self.communication_group_time = {} @@ -140,8 +168,8 @@ class ProfilingInfo: @property def cube_time(self): - return ( - self.matmul_time_cube + self.matmul_time_vector + self.other_cube_time) / Constant.MILLISECONDS_TO_SECONDS + return ((self.matmul_time_cube + self.matmul_time_vector + self.other_cube_time + self.all_mc2_time) + / Constant.MILLISECONDS_TO_SECONDS) @property def vec_time(self): @@ -203,12 +231,16 @@ class ProfilingInfo: def fa_time_bwd(self): return (self.fa_time_bwd_cube + self.fa_time_bwd_vector) / Constant.MILLISECONDS_TO_SECONDS + @property + def all_mc2_time(self): + return sum((self.get_mc2_time_by_name(kernel_name) for kernel_name in self.mc2_time_dict.keys())) + def calculate_other_time(self): self.other_time = max(0, (self.compute_time_ms - self.fa_fwd_time - self.fa_bwd_time - self.conv_fwd_time - self.conv_bwd_time - self.mm_total_time - - self.vector_total_time - self.sdma_time_tensor_move - + self.vector_total_time - self.sdma_time_tensor_move - self.all_mc2_time - self.other_cube_time - self.page_attention_time) / Constant.MILLISECONDS_TO_SECONDS) def calculate_schedule_time(self): @@ -308,11 +340,19 @@ class ProfilingInfo: def is_not_minimal_profiling(self) -> bool: return self.profiling_type == Constant.NPU and not self.minimal_profiling - def set_RDMA_bandwidth(self, bandwidth: float): - self.RDMA_bandwidth = bandwidth + def set_rdma_bandwidth(self, bandwidth: float): + self.rdma_bandwidth = bandwidth + + def set_sdma_bandwidth(self, bandwidth: float): + self.sdma_bandwidth = bandwidth - def set_SDMA_bandwidth(self, bandwidth: float): - self.SDMA_bandwidth = bandwidth + def update_mc2_info(self, kernel_name, mc2_time, computing_time, communication_time): + default_dict = {Constant.MC2_TIME: 0, Constant.MC2_COMPUTING: 0, Constant.MC2_COMMUNICATION: 0, + Constant.MC2_NUMBER: 0} + self.mc2_time_dict.setdefault(kernel_name, default_dict)[Constant.MC2_TIME] += mc2_time + self.mc2_time_dict.setdefault(kernel_name, default_dict)[Constant.MC2_COMPUTING] += computing_time + self.mc2_time_dict.setdefault(kernel_name, default_dict)[Constant.MC2_COMMUNICATION] += communication_time + self.mc2_time_dict.setdefault(kernel_name, default_dict)[Constant.MC2_NUMBER] += 1 def trans_time_to_s(self): # 新指标单位为ms @@ -349,3 +389,15 @@ class ProfilingInfo: def get_communication_time_by_group(self, group_name: str): return (self.communication_group_time.get(group_name, {}).get(Constant.WAIT_TIME, 0) + self.communication_group_time.get(group_name, {}).get(Constant.TRANSMIT_TIME, 0)) / 10 ** 3 + + def get_mc2_time_by_name(self, kernel_name: str): + return self.mc2_time_dict.get(kernel_name, {}).get(Constant.MC2_TIME, 0) / 10 ** 3 + + def get_mc2_computing_time_by_name(self, kernel_name: str): + return self.mc2_time_dict.get(kernel_name, {}).get(Constant.MC2_COMPUTING, 0) / 10 ** 3 + + def get_mc2_communication_time_by_name(self, kernel_name: str): + return self.mc2_time_dict.get(kernel_name, {}).get(Constant.MC2_COMMUNICATION, 0) / 10 ** 3 + + def get_mc2_number_by_name(self, kernel_name: str): + return self.mc2_time_dict.get(kernel_name, {}).get(Constant.MC2_NUMBER, 0) diff --git a/profiler/test/ut/compare_tools/compare_bean/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_config/__init__.py similarity index 100% rename from profiler/test/ut/compare_tools/compare_bean/__init__.py rename to profiler/msprof_analyze/compare_tools/compare_backend/compare_config/__init__.py diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/compare_config/compare_config.ini b/profiler/msprof_analyze/compare_tools/compare_backend/compare_config/compare_config.ini new file mode 100644 index 0000000000000000000000000000000000000000..a40082a6f9a53ee747ab1eb281b26103ae1e5970 --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_config/compare_config.ini @@ -0,0 +1,7 @@ +[OP_MASK] +FA_MASK = flash_attention,fusion_attention,flashattn,xformers_flash,efficient_attention,flash2attn +CONV_MASK = aten::conv +MATMUL_MASK = aten::addmm,aten::bmm,aten::mm,aten::matmul +CUBE_MASK = gemm,conv,cutlass,wgrad,gemvx +TRANS_MASK = cast,transdata,transpose +MC2_KERNEL = allgathermatmul,matmulreducescatter,matmulallreduce \ No newline at end of file diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/compare_config/compare_config.py b/profiler/msprof_analyze/compare_tools/compare_backend/compare_config/compare_config.py new file mode 100644 index 0000000000000000000000000000000000000000..4d044a1fa010878e69e13d3a1fcd79a6e12053dc --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/compare_config/compare_config.py @@ -0,0 +1,64 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os + +from msprof_analyze.compare_tools.compare_backend.utils.singleton import Singleton +from msprof_analyze.prof_common.utils import SafeConfigReader + + +@Singleton +class CompareConfig: + _REQUIRED_SECTIONS = { + "OP_MASK": ["FA_MASK", "CONV_MASK", "MATMUL_MASK", "CUBE_MASK", "TRANS_MASK", "MC2_KERNEL"] + } + + def __init__(self, cls): + self.config_reader = SafeConfigReader( + os.path.join(os.path.dirname(os.path.abspath(os.path.join(__file__))), "compare_config.ini")) + self.config_reader.validate(self._REQUIRED_SECTIONS) + self.config = self.config_reader.get_config() + self._fa_mask = self.get_mask_by_key("FA_MASK") + self._conv_mask = self.get_mask_by_key("CONV_MASK") + self._mm_mask = self.get_mask_by_key("MATMUL_MASK") + self._cube_mask = self.get_mask_by_key("CUBE_MASK") + self._trans_mask = self.get_mask_by_key("TRANS_MASK") + self._mc2_kernel = self.get_mask_by_key("MC2_KERNEL") + + @property + def fa_mask(self): + return self._fa_mask + + @property + def conv_mask(self): + return self._conv_mask + + @property + def mm_mask(self): + return self._mm_mask + + @property + def cube_mask(self): + return self._cube_mask + + @property + def trans_mask(self): + return self._trans_mask + + @property + def mc2_kernel(self): + return self._mc2_kernel + + def get_mask_by_key(self, key): + return set((mask.strip().lower() for mask in self.config.get("OP_MASK", key).split(",") if mask.strip())) diff --git a/profiler/compare_tools/compare_backend/comparison_generator.py b/profiler/msprof_analyze/compare_tools/compare_backend/comparison_generator.py similarity index 62% rename from profiler/compare_tools/compare_backend/comparison_generator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/comparison_generator.py index 820ba75754e5e889bed3fc14c9a1aa763735bbbd..2c3c3d920ce3896336f396eb51d4df3920ded7d7 100644 --- a/profiler/compare_tools/compare_backend/comparison_generator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/comparison_generator.py @@ -12,15 +12,20 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import logging -from compare_backend.generator.detail_performance_generator import DetailPerformanceGenerator -from compare_backend.generator.overall_performance_generator import OverallPerformanceGenerator -from compare_backend.interface.overall_interface import OverallInterface -from compare_backend.interface.compare_interface import CompareInterface -from compare_backend.profiling_parser.gpu_profiling_parser import GPUProfilingParser -from compare_backend.profiling_parser.npu_profiling_parser import NPUProfilingParser -from compare_backend.utils.constant import Constant -from compare_backend.utils.args_manager import ArgsManager +from msprof_analyze.compare_tools.compare_backend.generator.detail_performance_generator \ + import DetailPerformanceGenerator +from msprof_analyze.compare_tools.compare_backend.generator.overall_performance_generator \ + import OverallPerformanceGenerator +from msprof_analyze.compare_tools.compare_backend.interface.overall_interface import OverallInterface +from msprof_analyze.compare_tools.compare_backend.interface.compare_interface import CompareInterface +from msprof_analyze.compare_tools.compare_backend.profiling_parser.gpu_profiling_parser import GPUProfilingParser +from msprof_analyze.compare_tools.compare_backend.profiling_parser.npu_profiling_parser import NPUProfilingParser +from msprof_analyze.compare_tools.compare_backend.utils.args_manager import ArgsManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager +from msprof_analyze.prof_common.logger import get_logger + +logger = get_logger() class ComparisonGenerator: @@ -28,6 +33,7 @@ class ComparisonGenerator: INTERFACE_DICT = {Constant.OVERALL_COMPARE: OverallInterface} def __init__(self, args): + AdditionalArgsManager().init(args) self._args_manager = ArgsManager(args) self._data_dict = {} @@ -37,13 +43,13 @@ class ComparisonGenerator: self.load_data() self.generate_compare_result() except NotImplementedError as e: - logging.error("%s", e) + logger.error("%s", e) except RuntimeError as e: - logging.error("%s", e) + logger.error("%s", e) except FileNotFoundError as e: - logging.error("%s", e) + logger.error("%s", e) except Exception as e: - logging.error("%s", e) + logger.error("%s", e) def load_data(self): self._data_dict[Constant.BASE_DATA] = self.PARSER_DICT.get(self._args_manager.base_profiling_type)( @@ -60,14 +66,10 @@ class ComparisonGenerator: Constant.BASE_DATA: self._data_dict.get(Constant.BASE_DATA).overall_metrics, Constant.COMPARISON_DATA: self._data_dict.get(Constant.COMPARISON_DATA).overall_metrics, } - generator_list = [ - OverallPerformanceGenerator(overall_data, self._args_manager.args), - DetailPerformanceGenerator(self._data_dict, self._args_manager.args), - ] - for generator in generator_list: - generator.start() - for generator in generator_list: - generator.join() + overall_generator = OverallPerformanceGenerator(overall_data, self._args_manager.args) + overall_generator.start() + DetailPerformanceGenerator(self._data_dict, self._args_manager.args).run() + overall_generator.join() def run_interface(self, compare_type: str) -> dict: try: @@ -79,11 +81,11 @@ class ComparisonGenerator: return interface(self._data_dict).run() return CompareInterface(self._data_dict, self._args_manager).run() except NotImplementedError as e: - logging.error("%s", e) + logger.error("%s", e) except RuntimeError as e: - logging.error("%s", e) + logger.error("%s", e) except FileNotFoundError as e: - logging.error("%s", e) + logger.error("%s", e) except Exception as e: - logging.error("%s", e) + logger.error("%s", e) return {} diff --git a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/__init__.py similarity index 100% rename from profiler/test/ut/compare_tools/compare_bean/origin_data_bean/__init__.py rename to profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/__init__.py diff --git a/profiler/compare_tools/compare_backend/data_prepare/module_data_prepare.py b/profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/module_data_prepare.py similarity index 90% rename from profiler/compare_tools/compare_backend/data_prepare/module_data_prepare.py rename to profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/module_data_prepare.py index 116de8fb6df4be643bc627a4461e2b1eda686576..e8c56b26080c43780b81bffc2f0d61ccde5c3f64 100644 --- a/profiler/compare_tools/compare_backend/data_prepare/module_data_prepare.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/module_data_prepare.py @@ -15,11 +15,11 @@ import copy from queue import Queue -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.profiling_parser.base_profiling_parser import ProfilingResult -from compare_backend.utils.constant import Constant -from compare_backend.utils.module_node import ModuleNode -from compare_backend.utils.tree_builder import TreeBuilder +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.profiling_parser.base_profiling_parser import ProfilingResult +from msprof_analyze.compare_tools.compare_backend.utils.module_node import ModuleNode +from msprof_analyze.compare_tools.compare_backend.utils.tree_builder import TreeBuilder +from msprof_analyze.prof_common.constant import Constant class ModuleDataPrepare: diff --git a/profiler/compare_tools/compare_backend/data_prepare/operator_data_prepare.py b/profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/operator_data_prepare.py similarity index 81% rename from profiler/compare_tools/compare_backend/data_prepare/operator_data_prepare.py rename to profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/operator_data_prepare.py index d5fb3990a94f6229c1a9dcddd7b239375e9720e3..b5da970126ca2d652875974cca616beada3814fb 100644 --- a/profiler/compare_tools/compare_backend/data_prepare/operator_data_prepare.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/operator_data_prepare.py @@ -12,10 +12,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import logging -from compare_backend.profiling_parser.base_profiling_parser import ProfilingResult -from compare_backend.utils.tree_builder import TreeBuilder -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.profiling_parser.base_profiling_parser import ProfilingResult +from msprof_analyze.compare_tools.compare_backend.utils.tree_builder import TreeBuilder +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger + +logger = get_logger() class OperatorDataPrepare: @@ -44,7 +46,7 @@ class OperatorDataPrepare: node_queue.extend(node.child_nodes) if not result_data: msg = f"There is no operator event data for step {self._specified_step_id}, " \ - "please check whether the data contains this step." + "please check whether the data contains this step." raise RuntimeError(msg) return result_data @@ -63,6 +65,6 @@ class OperatorDataPrepare: elif level1_node.is_step_profiler() and level1_node.get_step_id() == self._specified_step_id: result_data.extend(level1_node.child_nodes) if not result_data and self._specified_step_id != Constant.VOID_STEP: - logging.warning("[WARNING] There is no operator infomation for step %s, " - "please check whether the data contains this step.", self._specified_step_id) - return result_data \ No newline at end of file + logger.warning("[WARNING] There is no operator infomation for step %s, " + "please check whether the data contains this step.", self._specified_step_id) + return result_data diff --git a/profiler/compare_tools/compare_backend/data_prepare/sequence_pre_matching.py b/profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/sequence_pre_matching.py similarity index 94% rename from profiler/compare_tools/compare_backend/data_prepare/sequence_pre_matching.py rename to profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/sequence_pre_matching.py index c04d4c2b699e7dc5ac6d76e6aa4ef40229558c30..cdca93a92767f169f4f4c014ed18f5aa7d407a7a 100644 --- a/profiler/compare_tools/compare_backend/data_prepare/sequence_pre_matching.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/data_prepare/sequence_pre_matching.py @@ -14,12 +14,11 @@ # limitations under the License. from collections import deque -from compare_backend.utils.name_function import NameFunction -from compare_backend.utils.common_func import longest_common_subsequence_matching -from compare_backend.utils.torch_op_node import TorchOpNode -from compare_backend.utils.module_node import ModuleNode - -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.utils.name_function import NameFunction +from msprof_analyze.compare_tools.compare_backend.utils.common_func import longest_common_subsequence_matching +from msprof_analyze.compare_tools.compare_backend.utils.torch_op_node import TorchOpNode +from msprof_analyze.compare_tools.compare_backend.utils.module_node import ModuleNode +from msprof_analyze.prof_common.constant import Constant class SequencePreMatching: @@ -92,7 +91,7 @@ class SequencePreMatching: base_index += 1 comparison_index += 1 while comparison_index < comparison_data_len: - result_data.extend(self._match_torch_op([], comparison_data[0].get(Constant.OPS, []))) + result_data.extend(self._match_torch_op([], comparison_data[comparison_index].get(Constant.OPS, []))) comparison_index += 1 return result_data diff --git a/profiler/test/ut/compare_tools/profiling_parser/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/disaggregate/__init__.py similarity index 100% rename from profiler/test/ut/compare_tools/profiling_parser/__init__.py rename to profiler/msprof_analyze/compare_tools/compare_backend/disaggregate/__init__.py diff --git a/profiler/compare_tools/compare_backend/disaggregate/overall_perf_interface.py b/profiler/msprof_analyze/compare_tools/compare_backend/disaggregate/overall_perf_interface.py similarity index 83% rename from profiler/compare_tools/compare_backend/disaggregate/overall_perf_interface.py rename to profiler/msprof_analyze/compare_tools/compare_backend/disaggregate/overall_perf_interface.py index 7c3bb96cf8c941865f92450692ead0e964f1b91b..4dfbe559210e80c5072e1ed076253c483ef285e6 100644 --- a/profiler/compare_tools/compare_backend/disaggregate/overall_perf_interface.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/disaggregate/overall_perf_interface.py @@ -12,13 +12,15 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import logging -from common_func.path_manager import PathManager -from compare_backend.profiling_parser.gpu_profiling_parser import GPUProfilingParser -from compare_backend.profiling_parser.npu_profiling_parser import NPUProfilingParser -from compare_backend.utils.args_manager import ArgsManager -from compare_backend.utils.compare_args import Args -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.profiling_parser.gpu_profiling_parser import GPUProfilingParser +from msprof_analyze.compare_tools.compare_backend.profiling_parser.npu_profiling_parser import NPUProfilingParser +from msprof_analyze.compare_tools.compare_backend.utils.args_manager import ArgsManager +from msprof_analyze.compare_tools.compare_backend.utils.compare_args import Args +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_common.path_manager import PathManager + +logger = get_logger() class OverallPerfInterface: @@ -28,7 +30,7 @@ class OverallPerfInterface: self._profiling_path = profiling_path self._profiling_path_dict = {} self._result_data = {} - self._profiling_data = NPUProfilingParser() + self._profiling_data = None def run(self): try: @@ -36,13 +38,13 @@ class OverallPerfInterface: self._load_data() self._generate_result() except NotImplementedError as e: - logging.error("%s", e) + logger.error("%s", e) except RuntimeError as e: - logging.error("%s", e) + logger.error("%s", e) except FileNotFoundError as e: - logging.error("%s", e) + logger.error("%s", e) except Exception as e: - logging.error("%s", e) + logger.error("%s", e) return self._result_data def _check_path(self): diff --git a/profiler/test/ut/compare_tools/view/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/generator/__init__.py similarity index 100% rename from profiler/test/ut/compare_tools/view/__init__.py rename to profiler/msprof_analyze/compare_tools/compare_backend/generator/__init__.py diff --git a/profiler/compare_tools/compare_backend/generator/base_generator.py b/profiler/msprof_analyze/compare_tools/compare_backend/generator/base_generator.py similarity index 100% rename from profiler/compare_tools/compare_backend/generator/base_generator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/generator/base_generator.py diff --git a/profiler/compare_tools/compare_backend/generator/detail_performance_generator.py b/profiler/msprof_analyze/compare_tools/compare_backend/generator/detail_performance_generator.py similarity index 65% rename from profiler/compare_tools/compare_backend/generator/detail_performance_generator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/generator/detail_performance_generator.py index 0fba831c980c39129c7092fc56420b45798e3b07..6479d624fae6d3c7aa437d7f152dde7284b74f43 100644 --- a/profiler/compare_tools/compare_backend/generator/detail_performance_generator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/generator/detail_performance_generator.py @@ -13,42 +13,53 @@ # See the License for the specific language governing permissions and # limitations under the License. import os -import logging +from collections import OrderedDict from datetime import datetime -from compare_backend.comparator.communication_comparator import CommunicationComparator -from compare_backend.comparator.module_comparetor import ModuleComparator -from compare_backend.comparator.module_statistic_comparator import ModuleStatisticComparator -from compare_backend.comparator.operator_comparator import OperatorComparator -from compare_backend.comparator.operator_statistic_comparator import OperatorStatisticComparator -from compare_backend.comparator.api_compare_comparator import ApiCompareComparator -from compare_backend.comparator.kernel_compare_comparator import KernelCompareComparator -from compare_backend.comparator.overall_metrics_comparator import OverallMetricsComparator -from compare_backend.compare_bean.communication_bean import CommunicationBean -from compare_backend.compare_bean.memory_compare_bean import MemoryCompareBean -from compare_backend.compare_bean.memory_statistic_bean import MemoryStatisticBean -from compare_backend.compare_bean.module_compare_bean import ModuleCompareBean -from compare_backend.compare_bean.module_statistic_bean import ModuleStatisticBean -from compare_backend.compare_bean.operator_compare_bean import OperatorCompareBean -from compare_backend.compare_bean.operator_statistic_bean import OperatorStatisticBean -from compare_backend.compare_bean.api_compare_bean import ApiCompareBean -from compare_backend.compare_bean.kernel_compare_bean import KernelCompareBean -from compare_backend.compare_bean.overall_metrics_bean import OverallMetricsBean -from compare_backend.data_prepare.module_data_prepare import ModuleDataPrepare -from compare_backend.data_prepare.operator_data_prepare import OperatorDataPrepare -from compare_backend.generator.base_generator import BaseGenerator -from compare_backend.utils.constant import Constant -from compare_backend.view.excel_view import ExcelView +from msprof_analyze.compare_tools.compare_backend.comparator.communication_comparator import CommunicationComparator +from msprof_analyze.compare_tools.compare_backend.comparator.module_comparetor import ModuleComparator +from msprof_analyze.compare_tools.compare_backend.comparator.module_statistic_comparator \ + import ModuleStatisticComparator +from msprof_analyze.compare_tools.compare_backend.comparator.operator_comparator import OperatorComparator +from msprof_analyze.compare_tools.compare_backend.comparator.operator_statistic_comparator \ + import OperatorStatisticComparator +from msprof_analyze.compare_tools.compare_backend.comparator.api_compare_comparator import ApiCompareComparator +from msprof_analyze.compare_tools.compare_backend.comparator.kernel_compare_comparator import KernelCompareComparator +from msprof_analyze.compare_tools.compare_backend.comparator.overall_metrics_comparator import OverallMetricsComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.communication_bean import CommunicationBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.memory_compare_bean import MemoryCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.memory_statistic_bean import MemoryStatisticBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.module_compare_bean import ModuleCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.module_statistic_bean import ModuleStatisticBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.operator_compare_bean import OperatorCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.operator_statistic_bean import OperatorStatisticBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.api_compare_bean import ApiCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.kernel_compare_bean import KernelCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.overall_metrics_bean import OverallMetricsBean +from msprof_analyze.compare_tools.compare_backend.data_prepare.module_data_prepare import ModuleDataPrepare +from msprof_analyze.compare_tools.compare_backend.data_prepare.operator_data_prepare import OperatorDataPrepare +from msprof_analyze.compare_tools.compare_backend.comparator.kernel_type_comparator import KernelTypeComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.kernel_type_compare_bean import KernelTypeCompareBean +from msprof_analyze.compare_tools.compare_backend.view.excel_view import ExcelView +from msprof_analyze.compare_tools.compare_backend.data_prepare.sequence_pre_matching import SequencePreMatching +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger -from compare_backend.data_prepare.sequence_pre_matching import SequencePreMatching +logger = get_logger() -class DetailPerformanceGenerator(BaseGenerator): +class DetailPerformanceGenerator: def __init__(self, profiling_data_dict: dict, args: any): - super().__init__(profiling_data_dict, args) + self._profiling_data_dict = profiling_data_dict + self._args = args + self._result_data = OrderedDict() self._base_step_id = int(args.base_step) if args.base_step else Constant.VOID_STEP self._comparison_step_id = int(args.comparison_step) if args.comparison_step else Constant.VOID_STEP + def run(self): + self.compare() + self.generate_view() + def compare(self): enable_compare = [ self._args.enable_operator_compare, @@ -59,7 +70,7 @@ class DetailPerformanceGenerator(BaseGenerator): self._args.enable_profiling_compare, ] if any(enable_compare): - logging.info("Start to compare performance detail data, please wait.") + logger.info("Start to compare performance detail data, please wait.") comparator_list = self._create_comparator() else: comparator_list = [] @@ -73,7 +84,7 @@ class DetailPerformanceGenerator(BaseGenerator): file_name = "performance_comparison_result_{}.xlsx".format(datetime.utcnow().strftime("%Y%m%d%H%M%S")) result_file_path = os.path.abspath(os.path.join(dir_path, file_name)) ExcelView(self._result_data, result_file_path, self._args).generate_view() - logging.info("The comparison result file has been generated: %s", result_file_path) + logger.info("The comparison result file has been generated: %s", result_file_path) def _create_comparator(self): comparator_list = [] @@ -95,7 +106,7 @@ class DetailPerformanceGenerator(BaseGenerator): # 算子性能比对-module级 enable_operator_compare = False if self._args.enable_operator_compare: - module_compare_result = self._module_match() + module_compare_result = self._module_match() if not self._args.disable_module else [] if module_compare_result: comparator_list.append(ModuleStatisticComparator(module_compare_result, ModuleStatisticBean)) if not self._args.disable_details: @@ -140,7 +151,10 @@ class DetailPerformanceGenerator(BaseGenerator): Constant.BASE_DATA: self._profiling_data_dict.get(Constant.BASE_DATA).kernel_details, Constant.COMPARISON_DATA: self._profiling_data_dict.get(Constant.COMPARISON_DATA).kernel_details, } - comparator_list.append(KernelCompareComparator(kernel_compare_result, KernelCompareBean)) + if self._args.use_kernel_type: + comparator_list.append(KernelTypeComparator(kernel_compare_result, KernelTypeCompareBean)) + else: + comparator_list.append(KernelCompareComparator(kernel_compare_result, KernelCompareBean)) return comparator_list def _module_match(self): diff --git a/profiler/compare_tools/compare_backend/generator/overall_performance_generator.py b/profiler/msprof_analyze/compare_tools/compare_backend/generator/overall_performance_generator.py similarity index 68% rename from profiler/compare_tools/compare_backend/generator/overall_performance_generator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/generator/overall_performance_generator.py index 6bddf4f46892e528793adc099d6fe3bd2a2a509b..e144ef185c433543f1b05afd64a8125d0d1f26c4 100644 --- a/profiler/compare_tools/compare_backend/generator/overall_performance_generator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/generator/overall_performance_generator.py @@ -12,12 +12,14 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import logging +from msprof_analyze.compare_tools.compare_backend.comparator.overall_performance_comparator \ + import OverallPerformanceComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.profiling_info import ProfilingInfo +from msprof_analyze.compare_tools.compare_backend.generator.base_generator import BaseGenerator +from msprof_analyze.compare_tools.compare_backend.view.screen_view import ScreenView +from msprof_analyze.prof_common.logger import get_logger -from compare_backend.comparator.overall_performance_comparator import OverallPerformanceComparator -from compare_backend.compare_bean.profiling_info import ProfilingInfo -from compare_backend.generator.base_generator import BaseGenerator -from compare_backend.view.screen_view import ScreenView +logger = get_logger() class OverallPerformanceGenerator(BaseGenerator): @@ -33,6 +35,6 @@ class OverallPerformanceGenerator(BaseGenerator): if not self._result_data: return ScreenView(self._result_data).generate_view() - logging.info("The OverallMetrics sheet page is more comprehensive for the disaggregate of performance data, " + logger.info("The OverallMetrics sheet page is more comprehensive for the disaggregate of performance data, " "and it is recommended to view the overall performance comparison results from " "the performance_comparison_result_*.xlsx.") diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/interface/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/interface/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/compare_tools/compare_backend/interface/compare_interface.py b/profiler/msprof_analyze/compare_tools/compare_backend/interface/compare_interface.py similarity index 43% rename from profiler/compare_tools/compare_backend/interface/compare_interface.py rename to profiler/msprof_analyze/compare_tools/compare_backend/interface/compare_interface.py index 67c5db67fb45c46ba36675931629b8958e6b7bf1..2a459c7621a22c94411a3b93a46ff8c877f6bdea 100644 --- a/profiler/compare_tools/compare_backend/interface/compare_interface.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/interface/compare_interface.py @@ -1,14 +1,29 @@ -from compare_backend.utils.constant import Constant +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.operator_comparator import OperatorComparator +from msprof_analyze.compare_tools.compare_backend.comparator.api_compare_comparator import ApiCompareComparator +from msprof_analyze.compare_tools.compare_backend.comparator.kernel_compare_comparator import KernelCompareComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.operator_compare_bean import OperatorCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.api_compare_bean import ApiCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.kernel_compare_bean import KernelCompareBean +from msprof_analyze.compare_tools.compare_backend.data_prepare.operator_data_prepare import OperatorDataPrepare +from msprof_analyze.compare_tools.compare_backend.data_prepare.sequence_pre_matching import SequencePreMatching +from msprof_analyze.compare_tools.compare_backend.comparator.kernel_type_comparator import KernelTypeComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.kernel_type_compare_bean import KernelTypeCompareBean +from msprof_analyze.prof_common.constant import Constant -from compare_backend.comparator.operator_comparator import OperatorComparator -from compare_backend.comparator.api_compare_comparator import ApiCompareComparator -from compare_backend.comparator.kernel_compare_comparator import KernelCompareComparator -from compare_backend.compare_bean.operator_compare_bean import OperatorCompareBean -from compare_backend.compare_bean.api_compare_bean import ApiCompareBean -from compare_backend.compare_bean.kernel_compare_bean import KernelCompareBean -from compare_backend.data_prepare.operator_data_prepare import OperatorDataPrepare -from compare_backend.utils.constant import Constant -from compare_backend.data_prepare.sequence_pre_matching import SequencePreMatching class CompareInterface: def __init__(self, data_dict: dict, args_manager: any): @@ -20,10 +35,13 @@ class CompareInterface: kernel_compare_result = { Constant.BASE_DATA: self._data_dict.get(Constant.BASE_DATA).kernel_details, Constant.COMPARISON_DATA: self._data_dict.get(Constant.COMPARISON_DATA).kernel_details} - return KernelCompareComparator(kernel_compare_result, KernelCompareBean).generate_data() + if self._args_manager.use_kernel_type: + return KernelTypeComparator(kernel_compare_result, KernelTypeCompareBean).generate_data() + else: + return KernelCompareComparator(kernel_compare_result, KernelCompareBean).generate_data() base_op_prepare = OperatorDataPrepare(self._data_dict.get(Constant.BASE_DATA), - self._args_manager.base_step) + self._args_manager.base_step) comparison_op_prepare = OperatorDataPrepare(self._data_dict.get(Constant.COMPARISON_DATA), self._args_manager.comparison_step) @@ -38,10 +56,10 @@ class CompareInterface: comparison_op_prepare.get_top_layer_ops()) return OperatorComparator(op_compare_result, OperatorCompareBean).generate_data() return {} - + def _operator_match(self, base_ops, comparison_ops): base_bwd_tid = self._data_dict.get(Constant.BASE_DATA).bwd_tid comparison_bwd_tid = self._data_dict.get(Constant.COMPARISON_DATA).bwd_tid - return SequencePreMatching(self._args_manager.args, base_bwd_tid, comparison_bwd_tid).run(SequencePreMatching.OP_TYPE, - base_ops, comparison_ops) - + return SequencePreMatching(self._args_manager.args, base_bwd_tid, comparison_bwd_tid).run( + SequencePreMatching.OP_TYPE, + base_ops, comparison_ops) diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/interface/overall_interface.py b/profiler/msprof_analyze/compare_tools/compare_backend/interface/overall_interface.py new file mode 100644 index 0000000000000000000000000000000000000000..dcd7c2b49e8b279cdef1d822dfc43e98366e3f3a --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/interface/overall_interface.py @@ -0,0 +1,28 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparator.overall_performance_comparator \ + import OverallPerformanceComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.profiling_info import ProfilingInfo +from msprof_analyze.prof_common.constant import Constant + + +class OverallInterface: + def __init__(self, overall_data: dict): + self._overall_data = overall_data + + def run(self): + data = {Constant.BASE_DATA: self._overall_data.get(Constant.BASE_DATA).overall_metrics, + Constant.COMPARISON_DATA: self._overall_data.get(Constant.COMPARISON_DATA).overall_metrics} + return OverallPerformanceComparator(data, ProfilingInfo).generate_data() diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py b/profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py similarity index 83% rename from profiler/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py rename to profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py index 2e5df19827d49a3765f64dac60d14a1a5b28c260..b3d9a29944fe7f722b2fc54ac0381f8c23c1af14 100644 --- a/profiler/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py @@ -1,15 +1,36 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from abc import abstractmethod, ABC from decimal import Decimal -import logging -from compare_backend.compare_bean.origin_data_bean.compare_event import KernelEvent, MemoryEvent -from compare_backend.compare_bean.origin_data_bean.kernel_details_bean import KernelDetailsBean -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.compare_bean.profiling_info import ProfilingInfo -from compare_backend.utils.constant import Constant -from compare_backend.utils.file_reader import FileReader +import ijson -logger = logging.getLogger() +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.compare_event import ( + KernelEvent, + MemoryEvent +) +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.kernel_details_bean \ + import KernelDetailsBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.profiling_info import ProfilingInfo +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_common.path_manager import PathManager + +logger = get_logger() class ProfilingResult: @@ -57,13 +78,13 @@ class ProfilingResult: class BaseProfilingParser(ABC): + trace_event_item = {Constant.GPU: "traceEvents.item", Constant.NPU: "item"} def __init__(self, args: any, path_dict: dict, step_id: int = Constant.VOID_STEP): self._args = args self._profiling_type = path_dict.get(Constant.PROFILING_TYPE) self._profiling_path = path_dict.get(Constant.PROFILING_PATH) self._json_path = path_dict.get(Constant.TRACE_PATH) - self._trace_events = [] if self._profiling_path == Constant.NPU else {} self._enable_profiling_compare = args.enable_profiling_compare self._enable_operator_compare = args.enable_operator_compare self._enable_memory_compare = args.enable_memory_compare @@ -78,7 +99,6 @@ class BaseProfilingParser(ABC): self._all_kernels = {} self._comm_task_list = [] self._comm_list = [] - self._read_trace_event() self._cur_func_index = 0 self._categorize_performance_index = 0 self._cpu_cube_op = None @@ -122,14 +142,20 @@ class BaseProfilingParser(ABC): def _get_dispatch_func(self): raise NotImplementedError("Function _get_dispatch_func need to be implemented.") + @abstractmethod + def _calculate_mc2_communication_time(self, kernel: KernelDetailsBean): + raise NotImplementedError("Function _calculate_mc2_communication_time need to be implemented.") + def load_data(self) -> ProfilingResult: self._result_data.update_bwd_tid(self._bwd_tid) if self._step_id != Constant.VOID_STEP and self._profiling_type == Constant.GPU: msg = "[WARNING] step id is invalid in GPU data, please use this when comparing between NPU datas." raise RuntimeError(msg) - self._dispatch_events() - self._update_kernel_dict() - self._update_communication_dict() + if any((self._enable_profiling_compare, self._enable_operator_compare, self._enable_memory_compare, + self._enable_api_compare, self._enable_communication_compare)): + self._dispatch_events() + self._update_kernel_dict() + self._update_communication_dict() if self._enable_memory_compare: self._update_memory_list() if self._enable_profiling_compare: @@ -146,6 +172,11 @@ class BaseProfilingParser(ABC): if tk.is_sdma(): self._result_data.overall_metrics.update_sdma_tensor_move_info(tk.dur) return + if tk.is_mc2(): + communication_time = self._calculate_mc2_communication_time(tk) + computing_time = tk.mc2_computing_time + self._result_data.overall_metrics.update_mc2_info(tk.name, tk.dur, computing_time, communication_time) + return if flow_start_time: while self._categorize_performance_index < len(self.cpu_cube_op): cur_op = self.cpu_cube_op[self._categorize_performance_index] @@ -226,9 +257,7 @@ class BaseProfilingParser(ABC): if not self._dispatch_func: return index_list = list(range(0, len(self._dispatch_func))) * 2 - for event in self._trace_events: - if not event.is_dict(): - continue + for event in self._trace_event_generator(self._profiling_type): if event.is_m_mode(): continue self.__picking_event(event, index_list) @@ -314,6 +343,8 @@ class BaseProfilingParser(ABC): task_index += 1 def _check_result_data(self): + if self._json_path == self._profiling_path: + return if self._enable_operator_compare or self._enable_memory_compare or self._enable_api_compare: if not self._result_data.torch_op_data: logger.warning("Can't find any torch op in the file: %s", self._profiling_path) @@ -331,8 +362,10 @@ class BaseProfilingParser(ABC): "make sure that the profiling data is greater than level0 and " "aic_metrics=PipeUtilization.", self._profiling_path) - def _read_trace_event(self): - try: - self._trace_events = FileReader.read_trace_file(self._json_path) - except Exception: - print(f"[ERROR] Failed to read the file: {self._json_path}") + def _trace_event_generator(self, profiling_type): + PathManager.check_path_readable(self._json_path) + FileManager.check_file_size(self._json_path) + item = self.trace_event_item.get(profiling_type) + with open(self._json_path, 'r') as file: + for event in ijson.items(file, item): + yield TraceEventBean(event) diff --git a/profiler/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py b/profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py similarity index 82% rename from profiler/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py rename to profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py index c2e5a148056ed9a598b1d0303c057927202b5939..f5a6e564c575941cdb2025e39d73eaea3fd8d5b3 100644 --- a/profiler/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py @@ -1,9 +1,26 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import sys from collections import defaultdict, Counter -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.profiling_parser.base_profiling_parser import BaseProfilingParser -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.profiling_parser.base_profiling_parser import BaseProfilingParser +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger + +logger = get_logger() class GPUProfilingParser(BaseProfilingParser): @@ -13,7 +30,6 @@ class GPUProfilingParser(BaseProfilingParser): def __init__(self, args: any, path_dict: dict, step_id: int = Constant.VOID_STEP): super().__init__(args, path_dict, step_id) - self._trace_events = [TraceEventBean(event) for event in self._trace_events.get("traceEvents", [])] self._flow_cat = (args.gpu_flow_cat,) if args.gpu_flow_cat else self.FLOW_CAT self._compute_stream_id = self._infer_compute_stream_id() self._marks = defaultdict(int) @@ -27,6 +43,9 @@ class GPUProfilingParser(BaseProfilingParser): def _update_kernel_details(self): pass + def _calculate_mc2_communication_time(self): + pass + def _update_memory_list(self): if not self._enable_memory_compare: return @@ -58,10 +77,11 @@ class GPUProfilingParser(BaseProfilingParser): def _calculate_performance_time(self): min_ts = sys.float_info.max max_ts = sys.float_info.min - self._trace_events.sort(key=lambda x: x.start_time) + kernels = list(self._all_kernels.values()) + kernels.sort(key=lambda x: x.start_time) flow_dict_new = self._get_flow_time_dict() computing_events = [] - for event in self._trace_events: + for event in kernels: if event.stream: min_ts = min(event.start_time, min_ts) max_ts = max(event.end_time, max_ts) @@ -120,14 +140,14 @@ class GPUProfilingParser(BaseProfilingParser): return event.lower_cat in self.TORCH_OP_CAT def _is_kernel_event(self, event: TraceEventBean): - return event.is_kernel_cat() + return event.is_kernel_cat() or event.is_memory_copy_cat() def _is_flow_event(self, event: TraceEventBean): return event.lower_cat in self._flow_cat def __parse_memory_reserved(self): if not self._memory_events: - print("[INFO] Gpu profiling data doesn't contain memory info.") + logger.info("Gpu profiling data doesn't contain memory info.") return memory_used = max([event.total_reserved for event in self._memory_events]) / 1024 ** 3 self._result_data.overall_metrics.set_memory_used(memory_used) @@ -136,7 +156,7 @@ class GPUProfilingParser(BaseProfilingParser): func_set = set() if self._enable_memory_compare or self._enable_operator_compare or self._enable_profiling_compare: func_set.add(self._picking_torch_op_event) - if self._enable_communication_compare: + if self._enable_communication_compare or self._enable_profiling_compare: func_set.add(self._picking_kernel_event) if self._enable_operator_compare: func_set.add(self._picking_python_function_event) @@ -156,7 +176,7 @@ class GPUProfilingParser(BaseProfilingParser): if not self._enable_profiling_compare: return -1 kernel_stream_ids = [] - for event in self._trace_events: + for event in self._trace_event_generator(Constant.GPU): if event.is_kernel_except_nccl() and event.stream: kernel_stream_ids.append(event.stream) if not kernel_stream_ids: @@ -165,7 +185,7 @@ class GPUProfilingParser(BaseProfilingParser): return counter.most_common(1)[0][0] def _find_bwd_tid(self): - for event in self._trace_events: + for event in self._trace_event_generator(Constant.GPU): if event.is_fwdbwd() and event.is_flow_end(): self._bwd_tid = event.tid break diff --git a/profiler/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py b/profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py similarity index 75% rename from profiler/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py rename to profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py index 8ffb2dadfd0c804beb5a96a65d6f796eba834b5d..6e7e9e1c1e5963303ceb2cd73abda2089d7ae069 100644 --- a/profiler/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py @@ -1,18 +1,35 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import sys -import logging from math import ceil -from compare_backend.compare_bean.origin_data_bean.kernel_details_bean import KernelDetailsBean -from compare_backend.compare_bean.origin_data_bean.memory_record_bean import MemoryRecordBean -from compare_backend.compare_bean.origin_data_bean.operator_memory_bean import OperatorMemoryBean -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.profiling_parser.base_profiling_parser import BaseProfilingParser -from compare_backend.utils.constant import Constant -from compare_backend.utils.file_reader import FileReader +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.kernel_details_bean \ + import KernelDetailsBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.memory_record_bean \ + import MemoryRecordBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.operator_memory_bean \ + import OperatorMemoryBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.profiling_parser.base_profiling_parser import BaseProfilingParser +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.op_stastic_bean import OpStatisticBean +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.prof_common.logger import get_logger - -logger = logging.getLogger() +logger = get_logger() class NPUProfilingParser(BaseProfilingParser): @@ -23,12 +40,13 @@ class NPUProfilingParser(BaseProfilingParser): def __init__(self, args: any, path_dict: dict, step_id: int = Constant.VOID_STEP): super().__init__(args, path_dict, step_id) + self._path_level = NPUProfilingParser._get_path_level(path_dict) self._operator_memory_path = os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "operator_memory.csv") self._memory_record_path = os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "memory_record.csv") self._kernel_detail_path = os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "kernel_details.csv") + self._op_statistic_path = os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "op_statistic.csv") self._communication_path = os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "communication.json") self._info_json_path = path_dict.get(Constant.INFO_JSON_PATH, "") - self._trace_events = [TraceEventBean(event) for event in self._trace_events] self._hccl_pid = None self._hccl_op_tid_list = [] self._kernel_pid = None @@ -38,8 +56,22 @@ class NPUProfilingParser(BaseProfilingParser): self._overlap_analysis = [] self._group_comm_tid_dict = {} self._hccl_tid_name_dict = {} + self._c_core_sqe_list = [] + self._c_core_sqe_index = 0 self._dispatch_func = self._get_dispatch_func() - self._filter_meta_id() + if any((self._enable_profiling_compare, self._enable_operator_compare, self._enable_memory_compare, + self._enable_api_compare, self._enable_communication_compare)): + self._filter_meta_id() + + @staticmethod + def _get_path_level(path_dict): + if not path_dict.get(Constant.PROFILING_PATH, ""): + return Constant.PROFILING_PATH + if path_dict.get(Constant.PROFILING_PATH, "") == path_dict.get(Constant.TRACE_PATH, ""): + return Constant.TRACE_PATH + if path_dict.get(Constant.PROFILING_PATH, "") == path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""): + return Constant.ASCEND_OUTPUT_PATH + return Constant.PROFILING_PATH @staticmethod def __calculate_overlap_time_with_uncovered_communication(uncovered_communication_events: list, events: list): @@ -62,6 +94,18 @@ class NPUProfilingParser(BaseProfilingParser): index += 1 return float(overlap_time) + @classmethod + def _read_csv_data(cls, file_path, bean): + data = [] + file_name = os.path.basename(file_path) + try: + data = FileManager.read_csv_file(file_path, bean) + except FileNotFoundError: + logger.warning("The file %s does not exist.", file_name) + except Exception: + logger.error("Failed to read %s.", file_name) + return data + def _get_dispatch_func(self): func_list = set() if self._enable_memory_compare or self._enable_operator_compare or self._enable_profiling_compare: @@ -86,18 +130,22 @@ class NPUProfilingParser(BaseProfilingParser): return list(func_list) def _update_kernel_details(self): - try: - kernel_details = FileReader.read_csv_file(self._kernel_detail_path, KernelDetailsBean) - except FileNotFoundError: - logger.warning("The file kernel_details.csv does not exist.") - except Exception: - logger.warning("Failed to read kernel_details.csv.") + if self._path_level == Constant.TRACE_PATH: + return + if self._args.use_kernel_type: + op_statistics = self._read_csv_data(self._op_statistic_path, OpStatisticBean) + if not op_statistics: + return + self._result_data.update_kernel_details( + {f"{kernel.kernel_type}-{kernel.core_type}": kernel for kernel in op_statistics}) return + + kernel_details = self._read_csv_data(self._kernel_detail_path, KernelDetailsBean) if not kernel_details: return kernels_dict = {} for kernel in kernel_details: - if kernel.is_invalid(): + if kernel.is_invalid_op_type(): continue if self._step_id != Constant.VOID_STEP and kernel.step_id != self._step_id: continue @@ -110,19 +158,14 @@ class NPUProfilingParser(BaseProfilingParser): " please check whether the data contains this step." raise RuntimeError(msg) else: - logger.warning("Failed to enable enable_kernel_compare, type of kernel_details.csv is null.") + logger.warning("Failed to enable enable_kernel_compare,kernel_details.csv lacks duration.") return self._result_data.update_kernel_details(kernels_dict) def _update_memory_list(self): - try: - memory_data = FileReader.read_csv_file(self._operator_memory_path, OperatorMemoryBean) - except FileNotFoundError: - logger.warning("The file operator_memory.csv does not exist.") - return - except Exception: - logger.error("Failed to read operator_memory.csv.") + if self._path_level == Constant.TRACE_PATH: return + memory_data = self._read_csv_data(self._operator_memory_path, OperatorMemoryBean) if memory_data: self._dequeue_data.sort(key=lambda x: x.start_time) for data in memory_data: @@ -143,6 +186,16 @@ class NPUProfilingParser(BaseProfilingParser): Constant.ALLOCATION_TIME: data.allocation_time, Constant.RELEASE_TIME: data.release_time}) + def _update_kernel_dict(self): + kernel_details = [] + if self._path_level != Constant.TRACE_PATH: + kernel_details = self._read_csv_data(self._kernel_detail_path, KernelDetailsBean) + input_shape_dict = {kernel.start_time: kernel.input_shapes for kernel in kernel_details} + for kernel in self._all_kernels.values(): + input_shape = input_shape_dict.get(kernel.start_time, "") + kernel.input_shapes = input_shape + super()._update_kernel_dict() + def __match_dequeue_data(self, ts_time: float) -> int: if not self._dequeue_data: return Constant.INVALID_VALUE @@ -157,10 +210,13 @@ class NPUProfilingParser(BaseProfilingParser): self._dequeue_data[left].end_time else Constant.INVALID_VALUE def _update_bandwidth(self): + if self._path_level == Constant.TRACE_PATH: + return try: - communication_json = FileReader.read_trace_file(self._communication_path) + communication_json = FileManager.read_json_file(self._communication_path) except FileNotFoundError: logger.warning("The file communication.json does not exist.") + return except Exception: logger.error("Failed to read communication.json.") return @@ -183,11 +239,12 @@ class NPUProfilingParser(BaseProfilingParser): sdma_time_ms += sdma_info.get("Transit Time(ms)", 0) # 单位为 MS rdma_bandwidth = rdma_size_mb / rdma_time_ms if rdma_time_ms > 0 else 0 sdma_bandwidth = sdma_size_mb / sdma_time_ms if sdma_time_ms > 0 else 0 - self._result_data.overall_metrics.set_RDMA_bandwidth(rdma_bandwidth) - self._result_data.overall_metrics.set_SDMA_bandwidth(sdma_bandwidth) + self._result_data.overall_metrics.set_rdma_bandwidth(rdma_bandwidth) + self._result_data.overall_metrics.set_sdma_bandwidth(sdma_bandwidth) def _update_overall_metrics(self): - self.__parse_info_json() + if self._path_level == Constant.PROFILING_PATH: + self.__parse_info_json() self.__parse_mem_csv() self.__parse_kernel_csv() self.__add_lccl_time() @@ -267,6 +324,24 @@ class NPUProfilingParser(BaseProfilingParser): return True return False + def _calculate_mc2_communication_time(self, kernel: KernelDetailsBean): + sqe_data = [] + while self._c_core_sqe_index < len(self._c_core_sqe_list): + end_time = self._c_core_sqe_list[self._c_core_sqe_index].end_time + if end_time < kernel.start_time: + self._c_core_sqe_index += 1 + continue + if end_time <= kernel.end_time: + sqe_data.append(self._c_core_sqe_list[self._c_core_sqe_index]) + self._c_core_sqe_index += 1 + continue + break + communication_time = 0 + for i in range(0, len(sqe_data), 2): + if i + 1 < len(sqe_data): + communication_time += sqe_data[i + 1].end_time - sqe_data[i].end_time + return float(communication_time) + def _is_kernel_event(self, event: TraceEventBean): return event.pid == self._kernel_pid and event.is_x_mode() @@ -278,7 +353,7 @@ class NPUProfilingParser(BaseProfilingParser): def _filter_meta_id(self): thread_events, thread_sort_events = [], [] - for event in self._trace_events: + for event in self._trace_event_generator(Constant.NPU): if event.is_fwdbwd() and event.is_flow_end(): self._bwd_tid = event.tid if not event.is_m_mode(): @@ -320,7 +395,7 @@ class NPUProfilingParser(BaseProfilingParser): def __parse_info_json(self): try: - json_data = FileReader.read_trace_file(self._info_json_path) + json_data = FileManager.read_json_file(self._info_json_path) except Exception: logger.error('Failed to read profiler_info.json.') return @@ -341,8 +416,10 @@ class NPUProfilingParser(BaseProfilingParser): self._result_data.overall_metrics.update_lccl_info(event.dur) def __parse_kernel_csv(self): + if self._path_level == Constant.TRACE_PATH: + return try: - kernel_details = FileReader.read_csv_file(self._kernel_detail_path, KernelDetailsBean) + kernel_details = self._read_csv_data(self._kernel_detail_path, KernelDetailsBean) except Exception: logger.error('Npu kernel details csv file is not available.') return @@ -353,17 +430,23 @@ class NPUProfilingParser(BaseProfilingParser): ordered_computing_events = sorted( ((flow_dict_new.get(kernel.start_time, 0), kernel) for kernel in kernel_details if not kernel.is_invalid()), key=lambda x: x[0]) + self._c_core_sqe_list = list(filter(lambda x: x.is_c_core_sqe(), self._all_kernels.values())) + self._c_core_sqe_list.sort(key=lambda x: x.start_time) for flow_start_time, event in ordered_computing_events: self.categorize_computing_performance_data(event, flow_start_time) def __parse_mem_csv(self): + if self._path_level == Constant.TRACE_PATH: + return try: - memory_record = FileReader.read_csv_file(self._memory_record_path, MemoryRecordBean) + memory_record = self._read_csv_data(self._memory_record_path, MemoryRecordBean) except FileNotFoundError: logger.warning('Npu memory record csv file is not available.') + return except Exception: logger.error('Load memory info failed.') - else: + return + if memory_record: memory_used = max([memory.total_reserved_mb for memory in memory_record]) / 1024 self._result_data.overall_metrics.set_memory_used(memory_used) diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/utils/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/compare_tools/compare_backend/utils/args_manager.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/args_manager.py similarity index 74% rename from profiler/compare_tools/compare_backend/utils/args_manager.py rename to profiler/msprof_analyze/compare_tools/compare_backend/utils/args_manager.py index bf36cc32bc3736ce63cb2eb353f6349804a2ace7..6ac463982b434901390b4440ef0bcde956d62362 100644 --- a/profiler/compare_tools/compare_backend/utils/args_manager.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/args_manager.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,29 +12,21 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" - import os.path import re -from common_func.path_manager import PathManager -from compare_backend.utils.constant import Constant -from compare_backend.utils.file_reader import FileReader - +from msprof_analyze.compare_tools.compare_backend.utils.singleton import Singleton +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_common.path_manager import PathManager -class Singleton(object): - def __init__(self, cls): - self._cls = cls - self._instance = {} - - def __call__(self, args=None): - if self._cls not in self._instance: - self._instance[self._cls] = self._cls(args) - return self._instance[self._cls] +logger = get_logger() @Singleton class ArgsManager: + __slots__ = ['_args', '_base_path_dict', '_comparison_path_dict', '_base_step', '_comparison_step'] def __init__(self, args: any): self._args = args @@ -99,15 +90,30 @@ class ArgsManager: @property def enable_api_compare(self): return self._args.enable_api_compare - + @property def enable_kernel_compare(self): return self._args.enable_kernel_compare + @property + def use_kernel_type(self): + return self._args.use_kernel_type + @classmethod - def check_profiling_path(cls, file_path: str): - PathManager.input_path_common_check(file_path) - PathManager.check_path_owner_consistent(file_path) + def check_profiling_path(cls, path_dict: dict): + PathManager.input_path_common_check(path_dict.get(Constant.PROFILING_PATH)) + path_list = [path_dict.get(Constant.PROFILING_PATH, "")] if path_dict.get( + Constant.PROFILING_TYPE) == Constant.GPU else [ + path_dict.get(Constant.PROFILING_PATH, ""), + path_dict.get(Constant.TRACE_PATH, ""), + path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), + path_dict.get(Constant.INFO_JSON_PATH, ""), + os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "operator_memory.csv"), + os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "memory_record.csv"), + os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "kernel_details.csv"), + os.path.join(path_dict.get(Constant.ASCEND_OUTPUT_PATH, ""), "communication.json") + ] + PathManager.check_path_owner_consistent(path_list) @classmethod def check_output_path(cls, output_path: str): @@ -115,27 +121,16 @@ class ArgsManager: PathManager.make_dir_safety(output_path) PathManager.check_path_writeable(output_path) - def get_step_args_with_validating(self): - if self._args.base_step and self._args.comparison_step: - if all([self._args.base_step.isdigit(), self._args.comparison_step.isdigit()]): - self._base_step = int(self._args.base_step) - self._comparison_step = int(self._args.comparison_step) - else: - msg = "Invalid param, base_step and comparison_step must be a number." - raise RuntimeError(msg) - elif any([self._args.base_step, self._args.comparison_step]): - msg = "Invalid param, base_step and comparison_step must be set at the same time." - raise RuntimeError(msg) - - def parse_profiling_path(self, file_path: str): - self.check_profiling_path(file_path) + @classmethod + def parse_profiling_path(cls, file_path: str): + PathManager.input_path_common_check(file_path) if os.path.isfile(file_path): (split_file_path, split_file_name) = os.path.split(file_path) (shot_name, extension) = os.path.splitext(split_file_name) if extension != ".json": msg = f"Invalid profiling path suffix: {file_path}" raise RuntimeError(msg) - json_type = FileReader.check_json_type(file_path) + json_type = FileManager.check_json_type(file_path) return { Constant.PROFILING_TYPE: json_type, Constant.PROFILING_PATH: file_path, Constant.TRACE_PATH: file_path } @@ -156,6 +151,18 @@ class ArgsManager: path_dict.update({Constant.INFO_JSON_PATH: os.path.join(file_path, dir_name)}) return path_dict + def get_step_args_with_validating(self): + if self._args.base_step and self._args.comparison_step: + if all([self._args.base_step.isdigit(), self._args.comparison_step.isdigit()]): + self._base_step = int(self._args.base_step) + self._comparison_step = int(self._args.comparison_step) + else: + msg = "Invalid param, base_step and comparison_step must be a number." + raise RuntimeError(msg) + elif any([self._args.base_step, self._args.comparison_step]): + msg = "Invalid param, base_step and comparison_step must be set at the same time." + raise RuntimeError(msg) + def init(self): if self._args.max_kernel_num is not None and self._args.max_kernel_num <= Constant.LIMIT_KERNEL: msg = f"Invalid param, --max_kernel_num has to be greater than {Constant.LIMIT_KERNEL}" @@ -183,15 +190,15 @@ class ArgsManager: self._args.enable_communication_compare = True self._args.enable_api_compare = True self._args.enable_kernel_compare = True - + if not self._args.enable_kernel_compare and self._args.use_kernel_type: + logger.warning("The use_kernel_type parameter is invalid because it only takes effect " + "when enable_kernel_compare is enabled.") self.get_step_args_with_validating() - base_profiling_path = PathManager.get_realpath(self._args.base_profiling_path) - self.check_profiling_path(base_profiling_path) - self._base_path_dict = self.parse_profiling_path(base_profiling_path) - comparison_profiling_path = PathManager.get_realpath(self._args.comparison_profiling_path) - self.check_profiling_path(comparison_profiling_path) - self._comparison_path_dict = self.parse_profiling_path(comparison_profiling_path) - + self._base_path_dict = self.parse_profiling_path(PathManager.get_realpath(self._args.base_profiling_path)) + self.check_profiling_path(self._base_path_dict) + self._comparison_path_dict = self.parse_profiling_path( + PathManager.get_realpath(self._args.comparison_profiling_path)) + self.check_profiling_path(self._comparison_path_dict) if self._args.output_path: self.check_output_path(PathManager.get_realpath(self._args.output_path)) @@ -200,6 +207,8 @@ class ArgsManager: self._args.enable_operator_compare = False self._args.enable_api_compare = False self._args.enable_kernel_compare = False + self._args.enable_memory_compare = False + self._args.enable_communication_compare = False if compare_type == Constant.OVERALL_COMPARE: self._args.enable_profiling_compare = True elif compare_type == Constant.OPERATOR_COMPARE: @@ -210,4 +219,4 @@ class ArgsManager: self._args.enable_kernel_compare = True else: msg = f"Invalid compare_type: {compare_type}, please check it." - raise RuntimeError(msg) \ No newline at end of file + raise RuntimeError(msg) diff --git a/profiler/compare_tools/compare_backend/utils/common_func.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/common_func.py similarity index 73% rename from profiler/compare_tools/compare_backend/utils/common_func.py rename to profiler/msprof_analyze/compare_tools/compare_backend/utils/common_func.py index ca06db8b24a6df39a0866c485670c3f9c1461e5e..ac9b4726eadb2cb5998ae10b4eaa995b6d7f0252 100644 --- a/profiler/compare_tools/compare_backend/utils/common_func.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/common_func.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,13 +12,13 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" - from decimal import Decimal -import logging +import numpy as np -logger = logging.getLogger() +from msprof_analyze.prof_common.logger import get_logger + +logger = get_logger() def calculate_diff_ratio(base_value: float, comparison_value: float): @@ -56,27 +55,24 @@ def convert_to_decimal(data: any) -> Decimal: def longest_common_subsequence_matching(base_ops: list, comparison_ops: list, name_func: any) -> list: if not comparison_ops: - result_data = [None] * len(base_ops) - for index, value in enumerate(base_ops): - result_data[index] = [value, None] - return result_data + return [[value, None] for value in base_ops] if not base_ops: - result_data = [None] * len(comparison_ops) - for index, value in enumerate(comparison_ops): - result_data[index] = [None, value] - return result_data + return [[None, value] for value in comparison_ops] comparison_len, base_len = len(comparison_ops), len(base_ops) if comparison_len * base_len > 50 * 10 ** 8: - print('[WARNING] The comparison time is expected to exceed 30 minutes, if you want to see the results quickly, ' - 'you can restart comparison task and turn on the switch --disable_details.') - dp_flag = set() # flag for only comparison op - pre_list = [0] * (base_len + 1) - cur_list = [0] * (base_len + 1) + logger.warning('The comparison time is expected to exceed 30 minutes, if you want to see the results quickly, ' + 'you can restart comparison task and turn on the switch --disable_details.') + + pre_list = np.zeros(base_len + 1, dtype=np.int32) + cur_list = np.zeros(base_len + 1, dtype=np.int32) - comparison_index = 1 all_base_data = [hash(name_func(op)) for op in base_ops] all_comparison_data = [hash(name_func(op)) for op in comparison_ops] + + dp_flag = BitMap((comparison_len + 1) * (base_len + 1)) # flag for only comparison op + + comparison_index = 1 for comparison_data in iter(all_comparison_data): base_index = 1 for base_data in all_base_data: @@ -117,3 +113,19 @@ def longest_common_subsequence_matching(base_ops: list, comparison_ops: list, na base_index -= 1 matched_op.reverse() return matched_op + + +class BitMap: + + def __init__(self, size): + self.size = size + # 使用 bytearray 存储位信息 + self.bits = bytearray((size + 7) // 8) + + def __contains__(self, n: int) -> bool: + """检查数字是否在位图中""" + return bool(self.bits[n >> 3] & (1 << (n & 7))) + + def add(self, n: int): + """添加一个数字到位图""" + self.bits[n >> 3] |= 1 << (n & 7) # n >> 3 等价于 n // 8, n & 7 等价于 n % 8 diff --git a/profiler/compare_tools/compare_backend/utils/compare_args.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/compare_args.py similarity index 88% rename from profiler/compare_tools/compare_backend/utils/compare_args.py rename to profiler/msprof_analyze/compare_tools/compare_backend/utils/compare_args.py index 5a8dadb8f6ad727fcc7c31e49a16e62aa4a0f18f..e16240a7f766e02068eae2d802cc869723498d27 100644 --- a/profiler/compare_tools/compare_backend/utils/compare_args.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/compare_args.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,7 +12,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" class Args: @@ -33,7 +31,8 @@ class Args: use_input_shape: bool = False, gpu_flow_cat: str = "", base_step: str = "", - comparison_step: str = ""): + comparison_step: str = "", + use_kernel_type: bool = False): self.base_profiling_path = base_profiling_path self.comparison_profiling_path = comparison_profiling_path self.enable_profiling_compare = enable_profiling_compare @@ -50,3 +49,4 @@ class Args: self.gpu_flow_cat = gpu_flow_cat self.base_step = base_step self.comparison_step = comparison_step + self.use_kernel_type = use_kernel_type diff --git a/profiler/compare_tools/compare_backend/utils/excel_config.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/excel_config.py similarity index 91% rename from profiler/compare_tools/compare_backend/utils/excel_config.py rename to profiler/msprof_analyze/compare_tools/compare_backend/utils/excel_config.py index fb12f8a67c0515043d736ec5e8fd2508af719386..75b1c64b9eddaadbe55bebb020597e9d4c572a10 100644 --- a/profiler/compare_tools/compare_backend/utils/excel_config.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/excel_config.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,9 +12,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" - -from compare_backend.utils.constant import Constant +from msprof_analyze.prof_common.constant import Constant class CellFormatType: @@ -96,6 +93,8 @@ class ExcelConfig(object): DIFF_AVG_RATIO = "Diff Avg Ratio" DIFF_CALLS_RATIO = "Diff Calls Ratio" KERNEL = "Kernel" + KERNEL_TYPE = "Kernel Type" + CORE_TYPE = "Core Type" HEADERS = { Constant.OPERATOR_TABLE: [ @@ -235,7 +234,7 @@ class ExcelConfig(object): {"name": DIFF_AVG_RATIO, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, {"name": DIFF_CALLS_RATIO, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, ], - Constant.KERNEL_COMPARE: [ + Constant.KERNEL_TABLE: [ {"name": ORDER, "type": CellFormatType.DEFAULT, "width": 10}, {"name": KERNEL, "type": CellFormatType.BOLD_STR, "width": 30}, {"name": INPUT_SHAPE, "type": CellFormatType.DEFAULT, "width": 20}, @@ -251,6 +250,23 @@ class ExcelConfig(object): {"name": CALLS, "type": CellFormatType.DEFAULT, "width": 20}, {"name": DIFF_TOTAL_RATIO, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, {"name": DIFF_AVG_RATIO, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + ], + Constant.KERNEL_TYPE_TABLE: [ + {"name": ORDER, "type": CellFormatType.DEFAULT, "width": 10}, + {"name": KERNEL_TYPE, "type": CellFormatType.BOLD_STR, "width": 30}, + {"name": CORE_TYPE, "type": CellFormatType.DEFAULT, "width": 20}, + {"name": TOTAL_DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": AVG_DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": MAX_DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": MIN_DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": CALLS, "type": CellFormatType.DEFAULT, "width": 20}, + {"name": TOTAL_DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": AVG_DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": MAX_DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": MIN_DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": CALLS, "type": CellFormatType.DEFAULT, "width": 20}, + {"name": DIFF_TOTAL_RATIO, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": DIFF_AVG_RATIO, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, ] } @@ -261,13 +277,17 @@ class ExcelConfig(object): Constant.MODULE_TABLE: ["E1:H1", "I1:L1"], Constant.OVERALL_METRICS_TABLE: ["B1:D1", "E1:G1"], Constant.API_TABLE: ["C1:F1", "G1:J1"], - Constant.KERNEL_TABLE: ["D1:H1", "I1:M1"] + Constant.KERNEL_TABLE: ["D1:H1", "I1:M1"], + Constant.KERNEL_TYPE_TABLE: ["D1:H1", "I1:M1"] } # overall metrics index # computing time COMPUTING = "Computing Time" + MC2_COMMUNICATION_TIME = "\t\tCommunication" + MC2_COMPUTING_TIME = "\t\tComputing" + FA_FWD = "\tFlash Attention (Forward)" FA_FWD_CUBE = "\t\tFlash Attention (Forward) (Cube)" FA_FWD_VECTOR = "\t\tFlash Attention (Forward) (Vector)" diff --git a/profiler/compare_tools/compare_backend/utils/module_node.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/module_node.py similarity index 82% rename from profiler/compare_tools/compare_backend/utils/module_node.py rename to profiler/msprof_analyze/compare_tools/compare_backend/utils/module_node.py index 6cb10a0747c7d6515b5fa336cc0a99126c46b718..98fa2ca1292874f5a4b3dd6903ae48552e52e1b6 100644 --- a/profiler/compare_tools/compare_backend/utils/module_node.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/module_node.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,35 +12,36 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" - import re from math import ceil -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.utils.torch_op_node import TorchOpNode +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.utils.torch_op_node import TorchOpNode class ModuleNode: + __slots__ = ['_event', '_parent_node', '_child_nodes', '_module_level', '_kernel_self_list', '_kernel_total_list', + '_call_stack', '_root_torch_op_node', '_cur_torch_op_node'] ts = "ts" kernels = "kernels" + _call_stack_pool = {} def __init__(self, event: TraceEventBean, parent_node=None): self._event = event self._parent_node = parent_node self._child_nodes = [] - self._module_name = f"{parent_node.module_name}/{event.name}" if parent_node else event.name self._module_level = parent_node.module_level + 1 if parent_node else 1 self._kernel_self_list = [] self._kernel_total_list = [] - self._call_stack = f"{parent_node.call_stack};\n{event.name}" if parent_node and parent_node.call_stack \ + call_stack = f"{parent_node.call_stack};\n{event.name}" if parent_node and parent_node.call_stack \ else event.name + self._call_stack = self._call_stack_pool.setdefault(call_stack, call_stack) self._root_torch_op_node = TorchOpNode() self._cur_torch_op_node = self._root_torch_op_node @property def module_name(self): - return self._module_name + return f"{self._parent_node.module_name}/{self._event.name}" if self._parent_node else self._event.name @property def module_class(self): @@ -130,16 +130,16 @@ class ModuleNode: return None def reset_call_stack(self, call_stack): - self._call_stack = call_stack + self._call_stack = self._call_stack_pool.setdefault(call_stack, call_stack) def update_child_nodes(self, node): self._child_nodes.append(node) def update_kernel_list(self, ts, kernel_list: list): - self._update_kernel_self_list(ts, kernel_list) + self.update_kernel_self_list(ts, kernel_list) node = self while node.parent_node: - node._update_kernel_total_list(ts, kernel_list) + node.update_kernel_total_list(ts, kernel_list) node = node.parent_node def find_module_call(self, ts_time): @@ -181,8 +181,8 @@ class ModuleNode: top_node_list[cur_index].update_kernel_list(kernel_list) break - def _update_kernel_self_list(self, ts, kernel_list: list): + def update_kernel_self_list(self, ts, kernel_list: list): self._kernel_self_list.append({self.ts: ts, self.kernels: kernel_list}) - def _update_kernel_total_list(self, ts, kernel_list: list): + def update_kernel_total_list(self, ts, kernel_list: list): self._kernel_total_list.append({self.ts: ts, self.kernels: kernel_list}) diff --git a/profiler/compare_tools/compare_backend/utils/name_function.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/name_function.py similarity index 88% rename from profiler/compare_tools/compare_backend/utils/name_function.py rename to profiler/msprof_analyze/compare_tools/compare_backend/utils/name_function.py index a36ee515cf2e9894e91e5ee17dfcf2b58176679e..f1e2e90d22bb6d86b73ee5c43315a7677741d1a8 100644 --- a/profiler/compare_tools/compare_backend/utils/name_function.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/name_function.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,10 +12,8 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" - -from compare_backend.utils.module_node import ModuleNode -from compare_backend.utils.torch_op_node import TorchOpNode +from msprof_analyze.compare_tools.compare_backend.utils.module_node import ModuleNode +from msprof_analyze.compare_tools.compare_backend.utils.torch_op_node import TorchOpNode class NameFunction: diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/utils/singleton.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/singleton.py new file mode 100644 index 0000000000000000000000000000000000000000..1ce6dda94c4fad6ab3badd8fa95dc58b6bacb928 --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/singleton.py @@ -0,0 +1,25 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +class Singleton(object): + def __init__(self, cls): + self._cls = cls + self._instance = {} + + def __call__(self, args=None): + if self._cls not in self._instance: + self._instance[self._cls] = self._cls(args) + return self._instance[self._cls] diff --git a/profiler/compare_tools/compare_backend/utils/torch_op_node.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/torch_op_node.py similarity index 83% rename from profiler/compare_tools/compare_backend/utils/torch_op_node.py rename to profiler/msprof_analyze/compare_tools/compare_backend/utils/torch_op_node.py index 693f55c4a66d1d0bdce01b9df7d8bcafd0ba5382..2b72ad3d990f1bb1a73aa071d359b32d269934b1 100644 --- a/profiler/compare_tools/compare_backend/utils/torch_op_node.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/torch_op_node.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,14 +12,14 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" - -from compare_backend.compare_bean.origin_data_bean.compare_event import MemoryEvent -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.compare_event import MemoryEvent +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.prof_common.constant import Constant class TorchOpNode: + __slots__ = ['_event', '_parent_node', '_child_nodes', '_kernel_list', '_kernel_num', '_memory_allocated_list'] + def __init__(self, event=TraceEventBean, parent_node=None): self._event = event self._parent_node = parent_node @@ -102,9 +101,9 @@ class TorchOpNode: self._kernel_list.extend(kernel_list) kernel_num = len(kernel_list) cur_node = self - while cur_node._parent_node: + while cur_node.parent: cur_node._kernel_num += kernel_num - cur_node = cur_node._parent_node + cur_node = cur_node.parent def update_kernel_list(self, kernel_list: list): if not kernel_list: diff --git a/profiler/compare_tools/compare_backend/utils/tree_builder.py b/profiler/msprof_analyze/compare_tools/compare_backend/utils/tree_builder.py similarity index 89% rename from profiler/compare_tools/compare_backend/utils/tree_builder.py rename to profiler/msprof_analyze/compare_tools/compare_backend/utils/tree_builder.py index 59bcc3066007d83a42d23d0735b8336e7ec4c74e..6872bed55f67df6e5736be9cdd0ade0d97f4f598 100644 --- a/profiler/compare_tools/compare_backend/utils/tree_builder.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/utils/tree_builder.py @@ -1,8 +1,7 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # @@ -13,13 +12,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" from queue import Queue -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.utils.module_node import ModuleNode -from compare_backend.utils.torch_op_node import TorchOpNode +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.utils.module_node import ModuleNode +from msprof_analyze.compare_tools.compare_backend.utils.torch_op_node import TorchOpNode class TreeBuilder: diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/view/__init__.py b/profiler/msprof_analyze/compare_tools/compare_backend/view/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/msprof_analyze/compare_tools/compare_backend/view/base_view.py b/profiler/msprof_analyze/compare_tools/compare_backend/view/base_view.py new file mode 100644 index 0000000000000000000000000000000000000000..24f23dcffbf705431b97954d056e0069be75c3e7 --- /dev/null +++ b/profiler/msprof_analyze/compare_tools/compare_backend/view/base_view.py @@ -0,0 +1,24 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from abc import ABC, abstractmethod + + +class BaseView(ABC): + def __init__(self, data_dict: dict): + self._data_dict = data_dict + + @abstractmethod + def generate_view(self): + raise NotImplementedError("Function generate_view need to be implemented.") diff --git a/profiler/compare_tools/compare_backend/view/excel_view.py b/profiler/msprof_analyze/compare_tools/compare_backend/view/excel_view.py similarity index 38% rename from profiler/compare_tools/compare_backend/view/excel_view.py rename to profiler/msprof_analyze/compare_tools/compare_backend/view/excel_view.py index 73b82b1cd31d7e8207e34a040e484f6387fb8694..6a094fdf3df8828f0a89e3333517459257440d19 100644 --- a/profiler/compare_tools/compare_backend/view/excel_view.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/view/excel_view.py @@ -1,10 +1,24 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from xlsxwriter import Workbook -from compare_backend.view.base_view import BaseView -from compare_backend.view.work_sheet_creator import WorkSheetCreator -from compare_backend.utils.constant import Constant +from msprof_analyze.compare_tools.compare_backend.view.base_view import BaseView +from msprof_analyze.compare_tools.compare_backend.view.work_sheet_creator import WorkSheetCreator +from msprof_analyze.prof_common.constant import Constant class ExcelView(BaseView): diff --git a/profiler/compare_tools/compare_backend/view/screen_view.py b/profiler/msprof_analyze/compare_tools/compare_backend/view/screen_view.py similarity index 42% rename from profiler/compare_tools/compare_backend/view/screen_view.py rename to profiler/msprof_analyze/compare_tools/compare_backend/view/screen_view.py index 150b36c6feda79cafacd7e4980624cd51e116912..5797d0bf8c4a64336d025043e41f5b7cec1e0054 100644 --- a/profiler/compare_tools/compare_backend/view/screen_view.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/view/screen_view.py @@ -1,6 +1,20 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from prettytable import PrettyTable -from compare_backend.view.base_view import BaseView +from msprof_analyze.compare_tools.compare_backend.view.base_view import BaseView class ScreenView(BaseView): diff --git a/profiler/compare_tools/compare_backend/view/work_sheet_creator.py b/profiler/msprof_analyze/compare_tools/compare_backend/view/work_sheet_creator.py similarity index 85% rename from profiler/compare_tools/compare_backend/view/work_sheet_creator.py rename to profiler/msprof_analyze/compare_tools/compare_backend/view/work_sheet_creator.py index 58bad621b03f855933517ef9286047e23b5681ea..b73f6df97e81886e6034eae0dcdbe4c180f7995c 100644 --- a/profiler/compare_tools/compare_backend/view/work_sheet_creator.py +++ b/profiler/msprof_analyze/compare_tools/compare_backend/view/work_sheet_creator.py @@ -1,6 +1,20 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from xlsxwriter import Workbook -from compare_backend.utils.excel_config import ExcelConfig, CellFormatType +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig, CellFormatType class WorkSheetCreator: diff --git a/profiler/msprof_analyze/compare_tools/compare_interface/__init__.py b/profiler/msprof_analyze/compare_tools/compare_interface/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/compare_tools/compare_interface/comparison_interface.py b/profiler/msprof_analyze/compare_tools/compare_interface/comparison_interface.py similarity index 31% rename from profiler/compare_tools/compare_interface/comparison_interface.py rename to profiler/msprof_analyze/compare_tools/compare_interface/comparison_interface.py index 936e5a7e8e71838cd5526718ea0113d0a6ff6475..70da42d20e250834477d811a94aa749fae380b33 100644 --- a/profiler/compare_tools/compare_interface/comparison_interface.py +++ b/profiler/msprof_analyze/compare_tools/compare_interface/comparison_interface.py @@ -1,31 +1,43 @@ -import sys -import os +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from msprof_analyze.compare_tools.compare_backend.comparison_generator import ComparisonGenerator +from msprof_analyze.compare_tools.compare_backend.disaggregate.overall_perf_interface import OverallPerfInterface +from msprof_analyze.compare_tools.compare_backend.utils.compare_args import Args +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.analyze_dict import AnalyzeDict +from msprof_analyze.prof_common.logger import get_logger -sys.path.append( - os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))), "cluster_analyse")) -sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) - -from compare_backend.comparison_generator import ComparisonGenerator -from compare_backend.disaggregate.overall_perf_interface import OverallPerfInterface -from compare_backend.utils.compare_args import Args -from compare_backend.utils.constant import Constant +logger = get_logger() class ComparisonInterface: def __init__(self, base_profiling_path: str, comparison_profiling_path: str = "", - base_step: str = "", comparison_step: str = ""): + base_step: str = "", comparison_step: str = "", **kwargs): self.base_profiling_path = base_profiling_path if comparison_profiling_path: self._args = Args(base_profiling_path=base_profiling_path, comparison_profiling_path=comparison_profiling_path, base_step=base_step, - comparison_step=comparison_step) + comparison_step=comparison_step, + use_kernel_type=kwargs.get("use_kernel_type", False)) def compare(self, compare_type: str) -> dict: - return ComparisonGenerator(self._args).run_interface(compare_type) + return ComparisonGenerator(AnalyzeDict(vars(self._args))).run_interface(compare_type) def disaggregate_perf(self, compare_type: str) -> dict: if compare_type != Constant.OVERALL_COMPARE: - print('[ERROR] Invalid compare_type value: {compare_type} which not supported.') + logger.error(f'Invalid compare_type value: {compare_type} which not supported.') return {} return OverallPerfInterface(self.base_profiling_path).run() diff --git a/profiler/msprof_analyze/compare_tools/img/OverallMetrics.png b/profiler/msprof_analyze/compare_tools/img/OverallMetrics.png new file mode 100644 index 0000000000000000000000000000000000000000..db092d45e88d0bea4a5a6ee1cb9b369fa2287ee3 Binary files /dev/null and b/profiler/msprof_analyze/compare_tools/img/OverallMetrics.png differ diff --git a/profiler/msprof_analyze/compare_tools/img/config.PNG b/profiler/msprof_analyze/compare_tools/img/config.PNG new file mode 100644 index 0000000000000000000000000000000000000000..01d83c80ef97b9dd4acf6009151c24bfd46b0785 Binary files /dev/null and b/profiler/msprof_analyze/compare_tools/img/config.PNG differ diff --git a/profiler/compare_tools/performance_compare.py b/profiler/msprof_analyze/compare_tools/performance_compare.py similarity index 32% rename from profiler/compare_tools/performance_compare.py rename to profiler/msprof_analyze/compare_tools/performance_compare.py index 419e2c2aff1728c167e760cebc2a6aa7973c34bf..7dc917ea40ec2e99628cd0dcbc2c452dabb7a95f 100644 --- a/profiler/compare_tools/performance_compare.py +++ b/profiler/msprof_analyze/compare_tools/performance_compare.py @@ -1,40 +1,75 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import argparse import ast import datetime import os.path import sys -sys.path.append( - os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "cluster_analyse")) +sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))) -from compare_backend.comparison_generator import ComparisonGenerator +from msprof_analyze.compare_tools.compare_backend.comparison_generator import ComparisonGenerator +from msprof_analyze.prof_common.analyze_dict import AnalyzeDict +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_common.path_manager import PathManager + +logger = get_logger() def main(): parser = argparse.ArgumentParser(description="Compare trace of GPU and NPU") - parser.add_argument("base_profiling_path", type=str, default='', help="Path of the profiling data") - parser.add_argument("comparison_profiling_path", type=str, default='', help="Path of the benchmark data") - parser.add_argument("--enable_profiling_compare", default=False, action='store_true', help="Enable overall performance comparison") - parser.add_argument("--enable_operator_compare", default=False, action='store_true', help="Enable operator performance comparison") - parser.add_argument("--enable_memory_compare", default=False, action='store_true', help="Enable operator memory comparison") - parser.add_argument("--enable_communication_compare", default=False, action='store_true', help="Enable communication performance comparison") - parser.add_argument("--enable_api_compare", default=False, action='store_true', help="Enable API performance comparison") - parser.add_argument("--enable_kernel_compare", default=False, action='store_true', help="Enable kernel performance comparison") + parser.add_argument("base_profiling_path", type=PathManager.expanduser_for_argumentparser, + default='', help="Path of the profiling data") + parser.add_argument("comparison_profiling_path", type=PathManager.expanduser_for_argumentparser, + default='', help="Path of the benchmark data") + parser.add_argument("--enable_profiling_compare", default=False, action='store_true', + help="Enable overall performance comparison") + parser.add_argument("--enable_operator_compare", default=False, action='store_true', + help="Enable operator performance comparison") + parser.add_argument("--enable_memory_compare", default=False, action='store_true', + help="Enable operator memory comparison") + parser.add_argument("--enable_communication_compare", default=False, action='store_true', + help="Enable communication performance comparison") + parser.add_argument("--enable_api_compare", default=False, action='store_true', + help="Enable API performance comparison") + parser.add_argument("--enable_kernel_compare", default=False, action='store_true', + help="Enable kernel performance comparison") parser.add_argument("--disable_details", default=False, action='store_true', help="Hide detailed comparison") - parser.add_argument('-o', "--output_path", type=str, default='', help="Path of comparison result") + parser.add_argument("--disable_module", default=False, action='store_true', help="Hide module comparison") + parser.add_argument('-o', "--output_path", type=PathManager.expanduser_for_argumentparser, + default='', help="Path of comparison result") parser.add_argument("--max_kernel_num", type=int, help="The number of kernels per torch op is limited.") parser.add_argument("--op_name_map", type=ast.literal_eval, default={}, help="The mapping of operator names equivalent to GPUs and NPUs in the form of dictionaries.") - parser.add_argument("--use_input_shape", default=False, action='store_true', help="Enable precise matching of operators") + parser.add_argument("--use_input_shape", default=False, action='store_true', + help="Enable precise matching of operators") parser.add_argument("--gpu_flow_cat", type=str, default='', help="Identifier of the GPU connection") parser.add_argument("--base_step", type=str, default='', help="Comparison step for performance data to be compared") - parser.add_argument("--comparison_step", type=str, default='', help="Comparison step for benchmark performance data") + parser.add_argument("--comparison_step", type=str, default='', + help="Comparison step for benchmark performance data") + parser.add_argument("--force", action='store_true', + help="Indicates whether to skip file size verification and owner verification") + parser.add_argument("--use_kernel_type", action='store_true', + help="Indicates whether kernel compare use op_statistic.csv") args = parser.parse_args() - ComparisonGenerator(args).run() + ComparisonGenerator(AnalyzeDict(vars(args))).run() + if __name__ == "__main__": - start_time = datetime.datetime.now() + start_time = datetime.datetime.utcnow() main() - end_time = datetime.datetime.now() - print(f'[INFO] The comparison task has been completed in a total time of {end_time - start_time}') + end_time = datetime.datetime.utcnow() + logger.info(f'The comparison task has been completed in a total time of {end_time - start_time}') diff --git a/profiler/config/config.ini b/profiler/msprof_analyze/config/config.ini similarity index 83% rename from profiler/config/config.ini rename to profiler/msprof_analyze/config/config.ini index decb27df53fcfb5ce427ebfe26bb807b900759d0..26b9d379a07ae34777209f6bce44ebc227a814e5 100644 --- a/profiler/config/config.ini +++ b/profiler/msprof_analyze/config/config.ini @@ -1,4 +1,4 @@ [URL] -msprof_analyze_url = https://gitee.com/ascend/mstt/tree/master/profiler +msprof_analyze_url = https://gitee.com/ascend/mstt/tree/master/profiler/msprof_analyze [EMAIL] ms_email = pmail_mindstudio@huawei.com \ No newline at end of file diff --git a/profiler/msprof_analyze/prof_common/__init__.py b/profiler/msprof_analyze/prof_common/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..c2764ec2a520567abc0c7d119b222f5fea7c3b72 --- /dev/null +++ b/profiler/msprof_analyze/prof_common/__init__.py @@ -0,0 +1,17 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import sys +sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))) diff --git a/profiler/msprof_analyze/prof_common/additional_args_manager.py b/profiler/msprof_analyze/prof_common/additional_args_manager.py new file mode 100644 index 0000000000000000000000000000000000000000..9136b049bc087fd04e2fca9a55e93f0f59859fb0 --- /dev/null +++ b/profiler/msprof_analyze/prof_common/additional_args_manager.py @@ -0,0 +1,55 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Dict +from msprof_analyze.advisor.utils.utils import singleton + + +@singleton +class AdditionalArgsManager: + def __init__(self): + self._args = None + self._language = "cn" + self._force = False + + @property + def force(self): + return self._force + + @property + def language(self): + return self._language + + def init(self, args: Dict): + self._args = args + if self._args.get("force", None): + self._force = self._args.get("force", False) + if self._args.get("language", None): + self._language = self._args.get("language", "cn") diff --git a/profiler/prof_common/analyze_dict.py b/profiler/msprof_analyze/prof_common/analyze_dict.py similarity index 100% rename from profiler/prof_common/analyze_dict.py rename to profiler/msprof_analyze/prof_common/analyze_dict.py diff --git a/profiler/msprof_analyze/prof_common/constant.py b/profiler/msprof_analyze/prof_common/constant.py new file mode 100644 index 0000000000000000000000000000000000000000..c04e429321dba0175475a1f42bdecd9f76be45f5 --- /dev/null +++ b/profiler/msprof_analyze/prof_common/constant.py @@ -0,0 +1,436 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import stat + + +class Constant(object): + COLLECTION_PATH = "collection_path" + ANALYSIS_MODE = "analysis_mode" + MODE = "mode" + CONTEXT_SETTINGS = dict(help_option_names=['-H', '-h', '--help']) + + MAX_FILE_SIZE_5_GB = 1024 * 1024 * 1024 * 5 + + MODULE_EVENT = "module_event" + CPU_OP_EVENT = "op_event" + STEP_EVENT = "step_event" + TORCH_TO_NPU_FLOW = "torch_to_device" + KERNEL_EVENT = "kernel_event" + HCCL_EVENT = "hccl_event" + OVERLAP_ANALYSIS_EVENT = "overlap_event" + FWD_BWD_FLOW = "fwd_to_bwd" + NPU_ROOT_ID = "NPU" + BACKWARD_MODULE = "nn.Module: BACKWARD" + + FWD_OR_OPT = 0 + BACKWARD = 1 + INVALID_RETURN = -1 + + # dir name + FRAMEWORK_DIR = "FRAMEWORK" + CLUSTER_ANALYSIS_OUTPUT = "cluster_analysis_output" + SINGLE_OUTPUT = "ASCEND_PROFILER_OUTPUT" + COMM_JSON = "communication.json" + COMM_MATRIX_JSON = "communication_matrix.json" + STEP_TIME_CSV = "step_trace_time.csv" + KERNEL_DETAILS_CSV = "kernel_details.csv" + + # file authority + FILE_AUTHORITY = 0o640 + DIR_AUTHORITY = 0o750 + MAX_JSON_SIZE = 1024 * 1024 * 1024 * 10 + MAX_CSV_SIZE = 1024 * 1024 * 1024 * 5 + MAX_COMMON_SIZE = 1024 * 1024 * 1024 + MAX_TRACE_SIZE = 1024 * 1024 * 1024 * 5 + MAX_PATH_LENGTH = 4096 + MAX_READ_DB_FILE_BYTES = 1024 * 1024 * 1024 * 8 + + # communication + P2P = "p2p" + COLLECTIVE = "collective" + TOTAL = "total" + STEP_ID = "step_id" + RANK_ID = "rank_id" + GROUP_NAME = "group_name" + COMM_OP_TYPE = "comm_op_type" + COMM_OP_NAME = "comm_op_name" + COMM_OP_INFO = "comm_op_info" + TOTAL_OP_INFO = "Total Op Info" + COMMUNICATION_TIME_INFO = "Communication Time Info" + START_TIMESTAMP = "Start Timestamp(us)" + COMMUNICATION_BANDWIDTH_INFO = "Communication Bandwidth Info" + HCOM_SEND = "hcom_send" + HCOM_RECEIVE = "hcom_receive" + SYNCHRONIZATION_TIME_RATIO = "Synchronization Time Ratio" + SYNCHRONIZATION_TIME_MS = "Synchronization Time(ms)" + WAIT_TIME_RATIO = "Wait Time Ratio" + TRANSIT_TIME_MS = "Transit Time(ms)" + TRANSIT_SIZE_MB = "Transit Size(MB)" + SIZE_DISTRIBUTION = "Size Distribution" + WAIT_TIME_MS = "Wait Time(ms)" + OP_NAME = "Op Name" + BANDWIDTH_GB_S = "Bandwidth(GB/s)" + COMMUNICATION = "communication.json" + ELAPSE_TIME_MS = "Elapse Time(ms)" + IDLE_TIME_MS = "Idle Time(ms)" + LARGE_PACKET_RATIO = "Large Packet Ratio" + + # params + DATA_MAP = "data_map" + COLLECTIVE_GROUP = "collective_group" + COMMUNICATION_OPS = "communication_ops" + MATRIX_OPS = "matrix_ops" + CLUSTER_ANALYSIS_OUTPUT_PATH = "output_path" + COMMUNICATION_GROUP = "communication_group" + TRANSPORT_TYPE = "Transport Type" + COMM_DATA_DICT = "comm_data_dict" + DATA_TYPE = "data_type" + IS_MSPROF = "is_prof" + + # step time + RANK = "rank" + STAGE = "stage" + + # epsilon + EPS = 1e-15 + + # file suffix + JSON_SUFFIX = ".json" + CSV_SUFFIX = ".csv" + + # result files type + TEXT = "text" + DB = "db" + INVALID = "invalid" + + # db name + DB_COMMUNICATION_ANALYZER = "analysis.db" + DB_CLUSTER_COMMUNICATION_ANALYZER = "cluster_analysis.db" + + # db tables + TABLE_COMM_ANALYZER_BANDWIDTH = "CommAnalyzerBandwidth" + TABLE_COMM_ANALYZER_TIME = "CommAnalyzerTime" + TABLE_COMM_ANALYZER_MATRIX = "CommAnalyzerMatrix" + TABLE_STEP_TRACE = "StepTraceTime" + TABLE_HOST_INFO = "HostInfo" + TABLE_RANK_DEVICE_MAP = "RankDeviceMap" + TABLE_CLUSTER_BASE_INFO = "ClusterBaseInfo" + + # data config key + CONFIG = "config" + EXPER_CONFIG = "experimental_config" + EXPER_EXPORT_TYPE = "_export_type" + + # metadata key + DISTRIBUTED_ARGS = "distributed_args" + + # mode + ALL = "all" + COMMUNICATION_TIME = "communication_time" + COMMUNICATION_MATRIX = "communication_matrix" + + STEP = "step" + + DATA_SIMPLIFICATION = "data_simplification" + FORCE = "force" + + # compare tools + + GPU = "GPU" + NPU = "NPU" + NA = 'N/A' + LIMIT_KERNEL = 3 + MAX_FLOW_CAT_LEN = 20 + MAX_OP_NAME_LEN = 200 + MAX_FILE_SIZE = 1024 * 1024 * 1024 * 5 + BYTE_TO_KB = 1024 + YELLOW_COLOR = "FFFF00" + GREEN_COLOR = "00FF00" + RED_COLOR = "FF0000" + BLUE_COLOR = "00BFFF" + LIGHT_BLUE_COLOR = "87CEFA" + US_TO_MS = 1000 + KB_TO_MB = 1024 + INVALID_VALUE = -1 + MILLISECONDS_TO_SECONDS = 10 ** 3 + MICROSECONDS_TO_SECONDS = 10 ** 6 + + PROFILING_TYPE = "profiling type" + + # path + PROFILING_PATH = "profiling_path" + TRACE_PATH = "trace_path" + MEMORY_DATA_PATH = "memory_data_path" + ASCEND_OUTPUT_PATH = "ascend_output" + INFO_JSON_PATH = "info_path" + + # excel headers + BASE_PROFILING = 'Base Profiling: ' + COMPARISON_PROFILING = 'Comparison Profiling: ' + WAIT_TIME = "wait" + TRANSMIT_TIME = "transmit" + + # compare type + OPERATOR_COMPARE = "OperatorCompare" + MEMORY_COMPARE = "MemoryCompare" + API_COMPARE = "ApiCompare" + KERNEL_COMPARE = "KernelCompare" + KERNEL_TYPE_COMPARE = "KernelTypeCompare" + + # sheet name + OPERATOR_SHEET = "OperatorCompare" + MEMORY_SHEET = "MemoryCompare" + OPERATOR_TOP_SHEET = "OperatorCompareStatistic" + MEMORY_TOP_SHEET = "MemoryCompareStatistic" + COMMUNICATION_SHEET = "CommunicationCompare" + API_SHEET = "ApiCompare" + KERNEL_SHEET = "KernelCompare" + + # table name + OPERATOR_TABLE = "OperatorCompare" + MEMORY_TABLE = "MemoryCompare" + OPERATOR_TOP_TABLE = "OperatorCompareStatistic" + MEMORY_TOP_TABLE = "MemoryCompareStatistic" + COMMUNICATION_TABLE = "CommunicationCompare" + PERFORMANCE_TABLE = "Model Profiling Time Distribution" + MODULE_TABLE = "ModuleCompare" + MODULE_TOP_TABLE = "ModuleCompareStatistic" + OVERALL_METRICS_TABLE = "OverallMetrics" + API_TABLE = "ApiCompare" + KERNEL_TABLE = "KernelCompare" + KERNEL_TYPE_TABLE = "KernelTypeCompare" + + # memory + SIZE = "Size(KB)" + TS = "ts" + ALLOCATION_TIME = "Allocation Time(us)" + RELEASE_TIME = "Release Time(us)" + NAME = "Name" + + OP_KEY = "op_name" + DEVICE_DUR = "dur" + + BASE_DATA = "base_data" + COMPARISON_DATA = "comparison_data" + OVERALL_METRICS = "overall_metrics" + TORCH_OP = "torch_op" + KERNEL_DICT = "kernel_dict" + MEMORY_LIST = "memory_list" + COMMUNICATION_DICT = "comm_dict" + + # compare type + OVERALL_COMPARE = "overall" + + BWD_LIST = ["bwd", "backward", "back", "grad"] + + CPU_OP_FA_MASK = ( + "flash_attention", "fusion_attention", "flashattn", "xformers_flash", "efficient_attention", "flash2attn" + ) + CPU_OP_CONV = "aten::conv" + CPU_OP_MATMUL_MASK = ("aten::addmm", "aten::bmm", "aten::mm", "aten::matmul") + KERNEL_CUBE_MASK = ("gemm", "conv", "cutlass", "wgrad", "gemvx") + KERNEL_TRANS_MASK = ("cast", "transdata", "transpose") + + IS_BWD = "is_bwd" + OPS = "ops" + + VOID_STEP = -1 + + # advisor + + # timeline + DEQUEUE = "Dequeue" + DEQUEUE_SEP = "@" + ATEN = "aten" + NPU_LOWER = "npu" + ATEN_SEP = "::" + OPTIMIZER = "Optimizer" + OPTIMIZER_SEP = "#" + OPTIMIZER_STEP = "step" + ENQUEUE = "enqueue" + TORCH_TO_NPU = "torch_to_npu" + FREE = "free" + OP_COMPILE_NAME = "AscendCL@aclopCompileAndExecute" + OP_COMPILE_ID = "aclopCompileAndExecute" + SYNC_STREAM = "AscendCL@aclrtSynchronizeStream" + NODE_LAUNCH = "Node@launch" + MAX_OP_COMPILE_NUM = 20 + ACL_TO_NPU = "acl_to_npu" + TASK_TYPE = "Task Type" + CPU_OP = "cpu_op" + AI_CORE = "AI_CORE" + AI_CPU = "AI_CPU" + MIX_AIC = "MIX_AIC" + CALL_STACKS = "Call stack" + INPUT_DIMS = "Input Dims" + OP_SEP = "-" + ADVISOR_MAX_PROCESSES = 8 + ADVISOR_ANALYZE_PROCESSES = "ADVISOR_ANALYZE_PROCESSES" + TIMELINE_OP_STACKS_DATASET = "timeline_op_stacks_dataset" + TIMELINE_BACKWARD_NO_STACK = "Backward broadcast, without call stacks in profiling." + TIMELINE_ACL_TO_NPU_NO_STACK = "Incoming flow is 'acl_to_npu', without call stacks in profiling." + TIMELINE_BACKWARD_NO_STACK_CODE = -1 + TIMELINE_ACL_TO_NPU_NO_STACK_CODE = -2 + TIMELINE_FUSION_OPS_NO_STACK_FLAG = "NO STACK" + NO_STACK_REASON_MAP = { + TIMELINE_BACKWARD_NO_STACK_CODE: "Backward broadcast, without call stacks in profiling.", + TIMELINE_ACL_TO_NPU_NO_STACK_CODE: "Incoming flow is 'acl_to_npu', without call stacks in profiling." + } + AFFINITY_TRAINING_API = "Affinity training api" + TIMELINE_EMPTY_STACKS_PROMPT = "These APIs have no code stack. If parameter 'with_stack=False' while profiling, " \ + "please refer to {timeline_profiling_doc_url} to set 'with_stack=True'. " \ + "Otherwise, ignore following affinity APIs due to backward broadcast lack of stack." + + CLUSTER_ANALYSIS = "Cluster analysis" + SLOW_RANK_TIME_RATIO_THRESHOLD = 0.05 + + CANN_VERSION = "cann_version" + TORCH_VERSION = "torch_version" + PROFILING_TYPE_UNDER_LINE = "profiling_type" + ANALYSIS_DIMENSIONS = "analysis_dimensions" + + PROFILER_METADATA = "profiler_metadata.json" + + TERMINAL_OUTPUT_HEADERS = ["No.", "Problem", "Description", "Suggestion"] + SKIP_ANALYZE_PROMPT = "Finish analysis, no optimization suggestions" + SKIP_QUERY_PROMPT = "Finish query operator stack, no operators" + + # operator output constant + OPERATOR_OUT_TOPK = 10 + OPERATOR_LIST_UNLIMIT = -1 + + DEFAULT_OPERATOR_TYPE = 'None_type' + DEFAULT_DURATION_ZERO = 0.0 + + ADVISOR_LOG_LEVEL = "ADVISOR_LOG_LEVEL" + DEFAULT_LOG_LEVEL = "INFO" + MSPROF_ANALYZE_LOG_LEVEL = "MSPROF_ANALYZE_LOG_LEVEL" + SUPPORTED_LOG_LEVEL = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"] + + RULE_BUCKET = "RULE-BUCKET" + CLOUD_RULE_REGION_CN_NORTH_9 = "cn-north-9" + CLOUD_RULE_REGION_CN_NORTH_7 = "cn-north-7" + CLOUD_RULE_REGION_CN_SOUTHWEST_2 = "cn-southwest-2" + CLOUD_RULE_REGION_LIST = [CLOUD_RULE_REGION_CN_NORTH_7, CLOUD_RULE_REGION_CN_NORTH_9, + CLOUD_RULE_REGION_CN_SOUTHWEST_2] + INNER_REGION_LIST = [CLOUD_RULE_REGION_CN_NORTH_7] + DEFAULT_CLOUD_RULE_REGION = CLOUD_RULE_REGION_CN_SOUTHWEST_2 + + HTTP_PREFIXES = "http://" + HTTPS_PREFIXES = "https://" + COMMON_YAML_DIR = "modelarts/solution/ma_advisor_rules/" + COMMON_ENDPOINT_SUFFIX = "obs.{}.myhuaweicloud.com" + INNER_ENDPOINT_SUFFIX = "obs.{}.ulanqab.huawei.com" + + AICPU_RULES_YAML_NAME = "aicpu_rules.yaml" + FUSION_PASS_YAML_NAME = "op_fusion_pass.yaml" + TIMELINE_FUSION_OPS_YAML_NAME = "timeline_fusion_ops.yaml" + CLOUD_YAML_NAME_LIST = [AICPU_RULES_YAML_NAME, FUSION_PASS_YAML_NAME, TIMELINE_FUSION_OPS_YAML_NAME] + + MAX_RETRIES = 3 + TIMEOUT = 3 + DEPTH_LIMIT = 20 + + ADVISOR_RULE_PATH = "ADVISOR_RULE_PATH" + CLOUD_RULE_PATH = "rules/cloud/" + DEFAULT_RULE_PATH = "./rules/" + + TIMELINE_FUSION_OPS_INVALID_UNIQUE_ID = -1 + + DEFAULT_TEMPLATE_HEADER = "Performance Optimization Suggestions" + + PT_PROF_SUFFIX = "ascend_pt" + ASCEND_PROFILER_OUTPUT = "ASCEND_PROFILER_OUTPUT" + CLUSTER_STEP_TIME_CSV = "cluster_step_trace_time.csv" + CLUSTER_COMM_JSON = "cluster_communication.json" + COMMUNICATION_JSON = "communication.json" + + BOTTLENECK = "bottleneck" + DATA = "data" + ADVISOR_ANALYSIS_OUTPUT_DIR = "advisor_analysis_result" + DEFAULT_PROCESSES = 8 + CLUSTER_ANALYSIS_FILE_PATTERN = [ + r'profiler_info_\d+\.json', "step_trace_time.csv", "communication.json", "communication_matrix.json" + ] + ANALYSIS_OUTPUT_PATH = "ANALYSIS_OUTPUT_PATH" + DEFAULT_RANK_FOR_PROFILING_ANALYSIS = 0 + PROFILER_INFO_FILE_PATTERN = r"profiler_info_(\d+)\.json" + DISABLE_STREAMINIG_READER = "DISABLE_STREAMINIG_READER" + FRAMEWORK_STACK_BLACK_LIST = ["torch", "torch_npu", "megatron", "deepspeed"] + DISABLE_STREAMING_READER = "DISABLE_STREAMING_READER" + MAX_NUM_PROCESSES = 4 + DEFAULT_STEP = "-1" + STEP_RANK_SEP = "_" + + MAX_READ_LINE_BYTES = 8196 * 1024 + MAX_READ_FILE_BYTES = 64 * 1024 * 1024 * 1024 + + # Unit Conversion + COMMUNICATION_B_TO_GB = 0.001 ** 3 + US_TO_S = 0.001 ** 2 + + WRITE_MODES = stat.S_IWUSR | stat.S_IRUSR | stat.S_IRGRP + WRITE_FLAGS = os.O_WRONLY | os.O_CREAT | os.O_TRUNC + + DISABLE_PROFILING_COMPARISON = "DISABLE_PROFILING_COMPARISON" + FREE_DURATION_FOR_GC_ANALYSIS = "FREE_DURATION_FOR_GC_ANALYSIS" + DISABLE_AFFINITY_API = "DISABLE_AFFINITY_API" + + MINDSPORE_VERSION = "mindspore_version" + PYTORCH = "pytorch" + MINDSPORE = "mindspore" + + # node type + MODULE_TYPE = 0 + OPERATOR_TYPE = 1 + VIRTUAL_TYPE = 9 + + # json trace bar + NPU_BAR = "Ascend Hardware" + COMM_BAR = "Communication" + OVERLAP_BAR = "Overlap Analysis" + + # overlap_analysis event + COMPUTING_EVENT = "Computing" + FREE_EVENT = "Free" + UNCOVERED_COMMUNICATION_EVENT = "Communication(Not Overlapped)" + + MC2_TIME = "mc2" + MC2_COMPUTING = "mc2_p" + MC2_COMMUNICATION = "mc2_m" + MC2_NUMBER = "mc2_num" + + # recipe config + ANALYSIS = "analysis" + RECIPE_NAME = "recipe_name" + RECIPE_CLASS = "recipe_class" + PARALLEL_MODE = "parallel_mode" + MSPROF_ANALYZE_PATH = os.path.abspath(os.path.dirname(os.path.dirname(__file__))) + RECIPES_PATH = os.path.join(MSPROF_ANALYZE_PATH, 'cluster_analyse', 'recipes') + + CONCURRENT_MODE = "concurrent" + PROFILER_DB_PATH = "profiler_db_path" + ANALYSIS_DB_PATH = "analysis_db_path" + RANK_LIST = "rank_list" + EXPORT_TYPE = "export_type" + EXTRA_ARGS = "args" + STEP_RANGE = "step_range" + START_NS = "startNs" + END_NS = "endNs" + + # hccl_sum + UINT32_BITS = 32 + UINT32_MASK = 0xffffffff \ No newline at end of file diff --git a/profiler/msprof_analyze/prof_common/database_service.py b/profiler/msprof_analyze/prof_common/database_service.py new file mode 100644 index 0000000000000000000000000000000000000000..6b776d4d957a9491aeb5690cf456038c114c3590 --- /dev/null +++ b/profiler/msprof_analyze/prof_common/database_service.py @@ -0,0 +1,92 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import pandas as pd + +from msprof_analyze.prof_common.db_manager import DBManager +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_common.constant import Constant + +logger = get_logger() + + +class DatabaseService: + TABLE_TS_DICT = { + "TASK": "startNs", + "COMMUNICATION_OP": "startNs", + "CANN_API": "startNs", + "PYTORCH_API": "startNs", + "MSTX_EVENTS": "startNs", + "GC_RECORD": "startNs", + "ACC_PMU": "timestampNs", + "NIC": "timestampNs", + "RoCE": "timestampNs", + "LLC": "timestampNs", + "SAMPLE_PMU_TIMELINE": "timestampNs", + "NPU_MEM": "timestampNs", + "NPU_MODULE_MEM": "timestampNs", + "NPU_OP_MEM": "timestampNs", + "HBM": "timestampNs", + "DDR": "timestampNs", + "HCCS": "timestampNs", + "PCIE": "timestampNs", + "AICORE_FREQ": "timestampNs" + } + + def __init__(self, db_path, step_range): + self._db_path = db_path + self._step_range = step_range + self._table_info = {} + + def add_table_for_query(self, table_name: str, columns=None): + if not isinstance(table_name, str): + logger.error("Parameter table_name must be type of string.") + return + if columns is not None and not isinstance(columns, list): + logger.error("Parameter columns must be type of list.") + return + self._table_info[table_name] = columns + + def query_data(self): + result_data = {} + if not self._table_info: + return result_data + try: + conn, cursor = DBManager.create_connect_db(self._db_path) + except Exception as err: + logger.error(err) + return result_data + for table_name, columns in self._table_info.items(): + if not DBManager.judge_table_exists(cursor, table_name): + logger.warning(f"This table {table_name} does not exist in this database {self._db_path}.") + continue + columns_str = "*" if not columns else ",".join(columns) + if table_name in self.TABLE_TS_DICT and self._step_range: + where_str = f"where {self.TABLE_TS_DICT.get(table_name)} >= {self._step_range.get(Constant.START_NS)}" \ + f" and {self.TABLE_TS_DICT.get(table_name)} <= {self._step_range.get(Constant.END_NS)}" + else: + where_str = "" + query_sql = f"select {columns_str} from {table_name} {where_str}" + try: + data = pd.read_sql(query_sql, conn) + result_data[table_name] = data + except Exception as err: + logger.error(err) + return result_data + try: + DBManager.destroy_db_connect(conn, cursor) + except Exception as err: + logger.error(err) + return result_data + return result_data diff --git a/profiler/cluster_analyse/common_func/db_manager.py b/profiler/msprof_analyze/prof_common/db_manager.py similarity index 74% rename from profiler/cluster_analyse/common_func/db_manager.py rename to profiler/msprof_analyze/prof_common/db_manager.py index 1aa7ed8740e4baa6fd5f04ec1674b20d584517c3..ac24ec8144f7a67c1796906d7e75ab25a7a7f71c 100644 --- a/profiler/cluster_analyse/common_func/db_manager.py +++ b/profiler/msprof_analyze/prof_common/db_manager.py @@ -1,4 +1,4 @@ -# Copyright (c) 2023, Huawei Technologies Co., Ltd. +# Copyright (c) 2024, Huawei Technologies Co., Ltd. # All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -16,10 +16,15 @@ import os import sqlite3 -from common_func.constant import Constant -from common_func.empty_class import EmptyClass -from common_func.file_manager import check_db_path_valid -from common_func.tables_config import TablesConfig +from msprof_analyze.cluster_analyse.common_func.empty_class import EmptyClass +from msprof_analyze.cluster_analyse.common_func.tables_config import TablesConfig +from msprof_analyze.prof_common.sql_extention_func import SqlExtentionAggregateFunc +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.file_manager import check_db_path_valid +from msprof_analyze.prof_common.logger import get_logger + +logger = get_logger() + class DBManager: """ @@ -38,15 +43,21 @@ class DBManager: try: conn = sqlite3.connect(db_path) except sqlite3.Error as err: - print(f"[ERROR] {err}") + logger.error(err) return EmptyClass("empty conn"), EmptyClass("empty curs") try: + if mode == Constant.ANALYSIS: + try: + for func_name, params_count, class_name in SqlExtentionAggregateFunc: + conn.create_aggregate(func_name, params_count, class_name) + except sqlite3.Error as err: + logger.error(err) if isinstance(conn, sqlite3.Connection): curs = conn.cursor() os.chmod(db_path, Constant.FILE_AUTHORITY) return conn, curs except sqlite3.Error as err: - print(f"[ERROR] {err}") + logger.error(err) return EmptyClass("empty conn"), EmptyClass("empty curs") return EmptyClass("empty conn"), EmptyClass("empty curs") @@ -59,12 +70,12 @@ class DBManager: if isinstance(curs, sqlite3.Cursor): curs.close() except sqlite3.Error as err: - print(f"[ERROR] {err}") + logger.error(err) try: if isinstance(conn, sqlite3.Connection): conn.close() except sqlite3.Error as err: - print(f"[ERROR] {err}") + logger.error(err) @staticmethod def judge_table_exists(curs: any, table_name: str) -> any: @@ -77,7 +88,7 @@ class DBManager: curs.execute("select count(*) from sqlite_master where type='table' and name=?", (table_name,)) return curs.fetchone()[0] except sqlite3.Error as err: - print("[ERROR] {}".format(err)) + logger.error(err) return False @staticmethod @@ -97,6 +108,41 @@ class DBManager: return header_with_type_begin return "" + @staticmethod + def execute_sql(conn: any, sql: str, params: any = None) -> bool: + """ + execute sql + """ + try: + if isinstance(conn, sqlite3.Connection): + if params: + conn.cursor().execute(sql, params) + else: + conn.cursor().execute(sql) + conn.commit() + return True + except sqlite3.Error as err: + logger.error(err) + return False + logger.error("conn is invalid param") + return False + + @staticmethod + def executemany_sql(conn: any, sql: str, params: any) -> bool: + """ + execute many sql once + """ + try: + if isinstance(conn, sqlite3.Connection): + conn.cursor().executemany(sql, params) + conn.commit() + return True + except sqlite3.Error as err: + logger.error(err) + return False + logger.error("conn is invalid param") + return False + @classmethod def check_tables_in_db(cls, db_path: any, *tables: any) -> bool: if check_db_path_valid(db_path): @@ -132,52 +178,17 @@ class DBManager: conn, curs = cls.create_connect_db(db_path) if not (conn and curs): return 0 - sql = "SELECT COUNT(*) FROM pragma_table_info('{}')".format(table) + sql = f"PRAGMA table_info({table})" res = 0 try: curs.execute(sql) - res = curs.fetchone()[0] + res = len(curs.fetchall()) except sqlite3.Error as err: - print("[ERROR] {}".format(err)) + logger.error(err) finally: cls.destroy_db_connect(conn, curs) return res - @staticmethod - def execute_sql(conn: any, sql: str, params: any = None) -> bool: - """ - execute sql - """ - try: - if isinstance(conn, sqlite3.Connection): - if params: - conn.cursor().execute(sql, params) - else: - conn.cursor().execute(sql) - conn.commit() - return True - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return False - print("[ERROR] conn is invalid param") - return False - - @staticmethod - def executemany_sql(conn: any, sql: str, params: any) -> bool: - """ - execute many sql once - """ - try: - if isinstance(conn, sqlite3.Connection): - conn.cursor().executemany(sql, params) - conn.commit() - return True - except sqlite3.Error as err: - print(f"[ERROR] {err}") - return False - print("[ERROR] conn is invalid param") - return False - @classmethod def fetch_all_data(cls: any, curs: any, sql: str, param: tuple = None, is_dict: bool = True) -> list: """ @@ -192,7 +203,7 @@ class DBManager: else: res = curs.execute(sql) except sqlite3.Error as err: - print(f"[ERROR] {err}") + logger.error(err) curs.row_factory = None return [] try: @@ -204,16 +215,40 @@ class DBManager: else: data += res if len(data) > cls.MAX_ROW_COUNT: - print("[WARRING] The records count in the table exceeds the limit!") + logger.warning("The records count in the table exceeds the limit!") if len(res) < cls.FETCH_SIZE: break return data except sqlite3.Error as err: - print(f"[ERROR] {err}") + logger.error(err) return [] finally: curs.row_factory = None + @classmethod + def insert_data_into_table(cls, conn: sqlite3.Connection, table_name: str, data: list) -> None: + """ + insert data into certain table + """ + index = 0 + if not data: + return + sql = "insert into {table_name} values ({value_form})".format( + table_name=table_name, value_form="?, " * (len(data[0]) - 1) + "?") + while index < len(data): + if not cls.executemany_sql(conn, sql, data[index:index + cls.INSERT_SIZE]): + raise RuntimeError("Failed to insert data into profiler db file.") + index += cls.INSERT_SIZE + + @classmethod + def insert_data_into_db(cls, db_path: str, table_name: str, data: list): + conn, curs = cls.create_connect_db(db_path) + if not (conn and curs): + logger.warning(f"Failed to connect to db file: {db_path}") + return + cls.insert_data_into_table(conn, table_name, data) + cls.destroy_db_connect(conn, curs) + class CustomizedDictFactory: @staticmethod diff --git a/profiler/cluster_analyse/common_func/file_manager.py b/profiler/msprof_analyze/prof_common/file_manager.py similarity index 63% rename from profiler/cluster_analyse/common_func/file_manager.py rename to profiler/msprof_analyze/prof_common/file_manager.py index 121c4e2d31d9905670445d5aef1de501c099e201..7329d1d9f3cd11588bf63300e581260205b400cb 100644 --- a/profiler/cluster_analyse/common_func/file_manager.py +++ b/profiler/msprof_analyze/prof_common/file_manager.py @@ -18,66 +18,101 @@ import csv import json import yaml +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager -from common_func.constant import Constant -from common_func.path_manager import PathManager +logger = get_logger() class FileManager: - DATA_FILE_AUTHORITY = 0o640 - DATA_DIR_AUTHORITY = 0o750 - @classmethod - def read_csv_file(cls, file_path: str, class_bean: any) -> list: + def read_json_file(cls, file_path: str) -> dict: PathManager.check_path_readable(file_path) base_name = os.path.basename(file_path) file_size = os.path.getsize(file_path) + result_data = {} + if file_size <= 0: + return result_data + if not AdditionalArgsManager().force and file_size > Constant.MAX_FILE_SIZE: + check_msg = input( + f"The file({file_path}) size exceeds the preset max value. Continue reading the file? [y/n]") + if check_msg.lower() != "y": + logger.warning("The user choose not to read the file: %s", file_path) + return result_data + try: + with open(file_path, "r") as json_file: + result_data = json.loads(json_file.read()) + except Exception as e: + raise RuntimeError(f"Failed to read the file: {base_name}") from e + return result_data + + @classmethod + def read_csv_file(cls, file_path: str, class_bean: any = None) -> list: + if not os.path.isfile(file_path): + raise FileNotFoundError("File not exists.") + PathManager.check_path_readable(file_path) + file_size = os.path.getsize(file_path) if file_size <= 0: return [] - if file_size > Constant.MAX_CSV_SIZE: - raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") + if not AdditionalArgsManager().force and file_size > Constant.MAX_FILE_SIZE: + check_msg = input( + f"The file({file_path}) size exceeds the preset max value. Continue reading the file? [y/n]") + if check_msg.lower() != "y": + logger.warning(f"The user choose not to read the file: {file_path}") + return [] result_data = [] try: with open(file_path, newline="") as csv_file: reader = csv.DictReader(csv_file) for row in reader: - result_data.append(class_bean(row)) + row_data = class_bean(row) if class_bean else row + result_data.append(row_data) except Exception as e: - raise RuntimeError(f"Failed to read the file: {base_name}") from e + msg = f"Failed to read the file: {file_path}" + raise RuntimeError(msg) from e return result_data @classmethod - def read_json_file(cls, file_path: str) -> dict: + def check_json_type(cls, file_path: str) -> str: + json_data = cls.read_json_file(file_path) + if isinstance(json_data, dict): + return Constant.GPU + return Constant.NPU + + @classmethod + def read_yaml_file(cls, file_path: str) -> dict: PathManager.check_path_readable(file_path) base_name = os.path.basename(file_path) file_size = os.path.getsize(file_path) if file_size <= 0: return {} - if file_size > Constant.MAX_JSON_SIZE: + if not AdditionalArgsManager().force and file_size > Constant.MAX_JSON_SIZE: raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") + try: - with open(file_path, "r") as json_file: - result_data = json.loads(json_file.read()) + with open(file_path, "r", encoding="utf-8") as yaml_file: + result_data = yaml.safe_load(yaml_file) except Exception as e: - raise RuntimeError(f"Failed to read the file: {base_name}") from e + raise RuntimeError(f"Failed to read the file: {base_name}, reason is {str(e)}") from e return result_data @classmethod - def read_yaml_file(cls, file_path: str) -> dict: + def read_common_file(cls, file_path: str) -> str: PathManager.check_path_readable(file_path) base_name = os.path.basename(file_path) file_size = os.path.getsize(file_path) if file_size <= 0: - return {} - if file_size > Constant.MAX_JSON_SIZE: + raise RuntimeError(f"The file({base_name}) size is less than or equal to 0.") + if not AdditionalArgsManager().force and file_size > Constant.MAX_COMMON_SIZE: raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") - try: - with open(file_path, "r", encoding="utf-8") as yaml_file: - result_data = yaml.safe_load(yaml_file) + with open(file_path, 'r') as f: + content = f.read() except Exception as e: raise RuntimeError(f"Failed to read the file: {base_name}, reason is {str(e)}") from e - return result_data + return content @classmethod def create_csv_file(cls, profiler_path: str, data: list, file_name: str, headers: list = None) -> None: @@ -90,7 +125,7 @@ class FileManager: PathManager.check_path_writeable(output_path) try: with os.fdopen( - os.open(output_file, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY), + os.open(output_file, os.O_WRONLY | os.O_CREAT, Constant.FILE_AUTHORITY), 'w', newline="" ) as file: writer = csv.writer(file) @@ -114,16 +149,20 @@ class FileManager: base_name = os.path.basename(output_file) try: with os.fdopen( - os.open(output_file, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY), 'w' + os.open(output_file, os.O_WRONLY | os.O_CREAT, Constant.FILE_AUTHORITY), 'w' ) as file: file.write(json.dumps(data)) except Exception as e: raise RuntimeError(f"Can't create the file: {base_name}") from e @classmethod - def create_output_dir(cls, collection_path: str) -> None: + def create_output_dir(cls, collection_path: str, is_overwrite: bool = False) -> None: output_path = os.path.join( collection_path, Constant.CLUSTER_ANALYSIS_OUTPUT) + if is_overwrite: + if not os.path.exists(output_path): + PathManager.make_dir_safety(output_path) + return PathManager.remove_path_safety(output_path) PathManager.make_dir_safety(output_path) @@ -136,15 +175,16 @@ class FileManager: else: limit_size = Constant.MAX_JSON_SIZE file_size = os.path.getsize(file_path) - if file_size > limit_size: + if not AdditionalArgsManager().force and file_size > limit_size: raise RuntimeError(f"The file({base_name}) size exceeds the preset max value.") def check_db_path_valid(path: str, is_create: bool = False, max_size: int = Constant.MAX_READ_DB_FILE_BYTES) -> bool: if os.path.islink(path): - print(f'[ERROR] The db file path: {path} is link. Please check the path') + logger.error('The db file path: %s is link. Please check the path', path) return False if not is_create and os.path.exists(path) and os.path.getsize(path) > max_size: - print(f'[ERROR] The db file: {path} is too large to read. Please check the file') - return False + if not AdditionalArgsManager().force: + logger.error('The db file: %s is too large to read. Please check the file', path) + return False return True diff --git a/profiler/msprof_analyze/prof_common/logger.py b/profiler/msprof_analyze/prof_common/logger.py new file mode 100644 index 0000000000000000000000000000000000000000..d409727b4e26edcdcf6f996b1d556c80b3aa697e --- /dev/null +++ b/profiler/msprof_analyze/prof_common/logger.py @@ -0,0 +1,41 @@ +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +import os + +from msprof_analyze.prof_common.constant import Constant + + +def get_log_level(): + log_level = os.getenv(Constant.MSPROF_ANALYZE_LOG_LEVEL, Constant.DEFAULT_LOG_LEVEL).upper() + if not hasattr(logging, log_level): + raise AttributeError(f"module 'logging' has no attribute '{log_level}', " + f"supported log level: {', '.join(Constant.SUPPORTED_LOG_LEVEL)}") + return log_level + + +def get_logger() -> logging.Logger: + logger_name = "msprof-analyze" + if logger_name in logging.Logger.manager.loggerDict: + return logging.getLogger(logger_name) + + logger = logging.getLogger(logger_name) + logger.propagate = False + logger.setLevel(get_log_level()) + + handler = logging.StreamHandler() + formatter = logging.Formatter(fmt="[%(asctime)s][%(levelname)s] %(message)s", + datefmt='%Y-%m-%d %H:%M:%S') + handler.setFormatter(formatter) + logger.addHandler(handler) + return logger diff --git a/profiler/prof_common/path_manager.py b/profiler/msprof_analyze/prof_common/path_manager.py similarity index 82% rename from profiler/prof_common/path_manager.py rename to profiler/msprof_analyze/prof_common/path_manager.py index 291f4874016b19abd53c61f43c5f8abd7a5caa96..ed0a0554ebfbd1d8386539d7936c245747a3dd76 100644 --- a/profiler/prof_common/path_manager.py +++ b/profiler/msprof_analyze/prof_common/path_manager.py @@ -17,7 +17,8 @@ import re import shutil import platform -from .constant import Constant +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.additional_args_manager import AdditionalArgsManager class PathManager: @@ -87,8 +88,15 @@ class PathManager: msg = f"Invalid path which has illagal characters \"{invalid_obj}\"." raise RuntimeError(msg) + path_split_list = path.split("/") + for path in path_split_list: + path_list = path.split("\\") + for name in path_list: + if len(name) > cls.MAX_FILE_NAME_LENGTH: + raise RuntimeError("Length of input path exceeds the limit.") + @classmethod - def check_path_owner_consistent(cls, path: str): + def check_path_owner_consistent(cls, path_list: list): """ Function Description: check whether the path belong to process owner @@ -97,16 +105,16 @@ class PathManager: Exception Description: when invalid path, prompt the user """ - base_name = os.path.basename(path) - if not os.path.exists(path): - msg = f"Invalid path: {base_name}" - raise RuntimeError(msg) - if platform.system().lower() == cls.WINDOWS: + if platform.system().lower() == cls.WINDOWS or AdditionalArgsManager().force: return - if os.stat(path).st_uid != os.getuid(): - check_msg = input("The path does not belong to you, do you want to continue? [y/n]") - if check_msg.lower() != "y": - raise RuntimeError("The user choose not to continue.") + for path in path_list: + if not os.path.exists(path): + continue + if os.stat(path).st_uid != os.getuid(): + check_msg = input("The path does not belong to you, do you want to continue? [y/n]") + if check_msg.lower() != "y": + raise RuntimeError("The user choose not to continue.") + return @classmethod def check_path_writeable(cls, path): @@ -118,7 +126,6 @@ class PathManager: Exception Description: when invalid data throw exception """ - cls.check_path_owner_consistent(path) if os.path.islink(path): msg = f"Invalid path which is a soft link." raise RuntimeError(msg) @@ -137,7 +144,9 @@ class PathManager: Exception Description: when invalid data throw exception """ - cls.check_path_owner_consistent(path) + if not os.path.exists(path): + msg = f"The path does not exist: {path}" + raise FileNotFoundError(msg) if os.path.islink(path): msg = f"Invalid path which is a soft link." raise RuntimeError(msg) @@ -148,6 +157,8 @@ class PathManager: @classmethod def remove_path_safety(cls, path: str): + if not os.path.exists(path): + return base_name = os.path.basename(path) msg = f"Failed to remove path: {base_name}" cls.check_path_writeable(path) @@ -196,9 +207,20 @@ class PathManager: def check_file_size(cls, file_path: str): if not os.path.exists(file_path): raise FileNotFoundError(f"The file {file_path} does not exists.") + if AdditionalArgsManager().force: + return file_size = os.path.getsize(file_path) if file_size > Constant.MAX_FILE_SIZE_5_GB: check_msg = input( f"The file({file_path}) size exceeds the preset max value. Continue reading the file? [y/n]") if check_msg.lower() != "y": raise RuntimeError(f"[WARNING] The user choose not to read the file: {file_path}") + + @classmethod + def expanduser_for_cli(cls, ctx, parm, str_name: str): + return cls.expanduser_for_argumentparser(str_name) + + @classmethod + def expanduser_for_argumentparser(cls, str_name: str): + # None 对应 参数未赋值的场景 + return str_name if str_name is None else os.path.expanduser(str_name.lstrip('=')) diff --git a/profiler/msprof_analyze/prof_common/sql_extention_func.py b/profiler/msprof_analyze/prof_common/sql_extention_func.py new file mode 100644 index 0000000000000000000000000000000000000000..987a0d4365307704d6abf32575a48cc15c0fa33d --- /dev/null +++ b/profiler/msprof_analyze/prof_common/sql_extention_func.py @@ -0,0 +1,73 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np + + +class Median: + + def __init__(self) -> None: + self.data = [] + + def step(self, value) -> None: + self.data.append(value) + + def finalize(self): + return np.median(self.data) + + +class LowerQuartile: + + def __init__(self) -> None: + self.data = [] + + def step(self, value) -> None: + self.data.append(value) + + def finalize(self): + return np.quantile(self.data, 0.25) + + +class UpperQuartile: + + def __init__(self) -> None: + self.data = [] + + def step(self, value) -> None: + self.data.append(value) + + def finalize(self): + return np.quantile(self.data, 0.75) + + +class StandardDeviation: + + def __init__(self) -> None: + self.data = [] + + def step(self, value) -> None: + self.data.append(value) + + def finalize(self): + return np.std(self.data) + + +# func_name, params_count, class +SqlExtentionAggregateFunc = [ + ('median', 1, Median), + ('lower_quartile', 1, LowerQuartile), + ('upper_quartile', 1, UpperQuartile), + ('stdev', 1, StandardDeviation) +] diff --git a/profiler/prof_common/utils.py b/profiler/msprof_analyze/prof_common/utils.py similarity index 64% rename from profiler/prof_common/utils.py rename to profiler/msprof_analyze/prof_common/utils.py index 48cac267691d40e9726533a145a0a437439e1894..5c0832566330bc3ba219e2de654d4b3d7af46e52 100644 --- a/profiler/prof_common/utils.py +++ b/profiler/msprof_analyze/prof_common/utils.py @@ -1,12 +1,28 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import configparser -import logging +import os from email.utils import parseaddr from typing import Dict, List from urllib.parse import urlparse -from .path_manager import PathManager +from msprof_analyze.prof_common.logger import get_logger +from msprof_analyze.prof_common.path_manager import PathManager -logger = logging.getLogger() +logger = get_logger() class SafeConfigReader: @@ -20,6 +36,9 @@ class SafeConfigReader: self.read_config(config_file) def read_config(self, path): + if not os.path.exists(path): + msg = f"The config file {path} does not exists." + raise FileNotFoundError(msg) PathManager.check_input_file_path(path) PathManager.check_path_readable(path) PathManager.check_file_size(path) @@ -64,3 +83,18 @@ def convert_to_float(num): except (ValueError, FloatingPointError): logger.error(f"Can not convert %s to float", num) return 0 + + +def convert_to_int(num): + try: + return int(num) + except (ValueError, NameError): + logger.error(f"Can not convert %s to int", num) + return 0 + + +def compute_ratio(dividend: float, divisor: float): + if abs(divisor) < 1e-15: + return 0 + else: + return round(dividend / divisor, 4) diff --git a/profiler/msprof_analyze/prof_exports/__init__.py b/profiler/msprof_analyze/prof_exports/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..7101187a2c2619f3b1c20dded14b433950b4c662 --- /dev/null +++ b/profiler/msprof_analyze/prof_exports/__init__.py @@ -0,0 +1,14 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/profiler/msprof_analyze/prof_exports/base_stats_export.py b/profiler/msprof_analyze/prof_exports/base_stats_export.py new file mode 100644 index 0000000000000000000000000000000000000000..6e0ff5e211e2c9d6f2ff73cef5a64bb43eb33936 --- /dev/null +++ b/profiler/msprof_analyze/prof_exports/base_stats_export.py @@ -0,0 +1,51 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pandas as pd + +from msprof_analyze.prof_common.db_manager import DBManager +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.prof_common.logger import get_logger + +logger = get_logger() + + +class BaseStatsExport: + + def __init__(self, db_path, analysis_class, step_range): + self._db_path = db_path + self._analysis_class = analysis_class + self._step_range = step_range + self._query = None + + def get_query(self): + return self._query + + def read_export_db(self): + try: + if not self._db_path: + logger.error("db path is None.") + return None + query = self.get_query() + if query is None: + logger.error("query is None.") + return None + conn, cursor = DBManager.create_connect_db(self._db_path, Constant.ANALYSIS) + data = pd.read_sql(query, conn) + DBManager.destroy_db_connect(conn, cursor) + return data + except Exception as e: + logger.error(f"File {self._db_path} read failed error: {e}") + return None diff --git a/profiler/msprof_analyze/prof_exports/cann_api_sum_export.py b/profiler/msprof_analyze/prof_exports/cann_api_sum_export.py new file mode 100644 index 0000000000000000000000000000000000000000..0d3da94a001609cdbaed7d3f4646dc908d2b8c23 --- /dev/null +++ b/profiler/msprof_analyze/prof_exports/cann_api_sum_export.py @@ -0,0 +1,75 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprof_analyze.prof_exports.base_stats_export import BaseStatsExport +from msprof_analyze.prof_common.constant import Constant + +QUERY = """ +WITH + summary as ( + SELECT + name, + sum(endNs - startNs) AS duration, + count (*) AS num, + avg(endNs - startNs) AS avg_duration, + min(endNs - startNs) AS min_duration, + median(endNs - startNs) AS med_duration, + max(endNs - startNs) AS max_duration, + stdev(endNs - startNs) AS stdev_duration, + lower_quartile(endNs - startNs) AS lower_quartile_duration, + upper_quartile(endNs - startNs) AS upper_quartile_duration + FROM + CANN_API + {} + GROUP BY name + ), + totals AS ( + SELECT sum(duration) AS total + FROM summary + ) +SELECT + ids.value AS "name", + round(summary.duration * 100.0 / (SELECT total FROM totals), 2) AS "durationRatio", + summary.duration AS "totalTimeNs", + summary.num AS "totalCount", + round(summary.avg_duration, 1) AS "averageNs", + round(summary.min_duration, 1) AS "minNs", + round(summary.lower_quartile_duration, 1) AS "Q1Ns", + round(summary.med_duration, 1) AS "medNs", + round(summary.upper_quartile_duration, 1) AS "Q3Ns", + round(summary.max_duration, 1) AS "maxNs", + round(summary.stdev_duration, 1) AS "stdev" +FROM + summary +LEFT JOIN + STRING_IDS AS ids + ON ids.id == summary.name +ORDER BY 2 DESC; + """ + + +class CannApiSumExport(BaseStatsExport): + + def __init__(self, db_path, recipe_name, step_range): + super().__init__(db_path, recipe_name, step_range) + self._query = self.get_query_statement() + + def get_query_statement(self): + if self._step_range: + filter_statement = f"WHERE CANN_API.startNs >= {self._step_range.get(Constant.START_NS)} " \ + f"and CANN_API.startNs <= {self._step_range.get(Constant.END_NS)}" + else: + filter_statement = "" + return QUERY.format(filter_statement) diff --git a/profiler/msprof_analyze/prof_exports/compute_op_sum_export.py b/profiler/msprof_analyze/prof_exports/compute_op_sum_export.py new file mode 100644 index 0000000000000000000000000000000000000000..f337925dc36ff8e26c782ab1ea1c00618ebf271c --- /dev/null +++ b/profiler/msprof_analyze/prof_exports/compute_op_sum_export.py @@ -0,0 +1,95 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprof_analyze.prof_exports.base_stats_export import BaseStatsExport +from msprof_analyze.prof_common.constant import Constant + +QUERY = """ +SELECT + NAME_IDS.value AS "OpName", + OPTYPE_IDS.value AS "OpType", + TASKTYPE_IDS.value AS "TaskType", + INPUTSHAPES_IDS.value AS "InputShapes", + round(TASK.endNs - TASK.startNs) AS "Duration" +FROM + COMPUTE_TASK_INFO +LEFT JOIN TASK + ON TASK.globalTaskId == COMPUTE_TASK_INFO.globalTaskId +LEFT JOIN + STRING_IDS AS NAME_IDS + ON NAME_IDS.id == COMPUTE_TASK_INFO.name +LEFT JOIN + STRING_IDS AS OPTYPE_IDS + ON OPTYPE_IDS.id == COMPUTE_TASK_INFO.opType +LEFT JOIN + STRING_IDS AS TASKTYPE_IDS + ON TASKTYPE_IDS.id == COMPUTE_TASK_INFO.taskType +LEFT JOIN + STRING_IDS AS INPUTSHAPES_IDS + ON INPUTSHAPES_IDS.id == COMPUTE_TASK_INFO.inputShapes +{} + """ + +QUERY_EXCLUDE_OPNAME = """ +SELECT + OPTYPE_IDS.value AS "OpType", + TASKTYPE_IDS.value AS "TaskType", + INPUTSHAPES_IDS.value AS "InputShapes", + round(TASK.endNs - TASK.startNs) AS "Duration" +FROM + COMPUTE_TASK_INFO +LEFT JOIN TASK + ON TASK.globalTaskId == COMPUTE_TASK_INFO.globalTaskId +LEFT JOIN + STRING_IDS AS OPTYPE_IDS + ON OPTYPE_IDS.id == COMPUTE_TASK_INFO.opType +LEFT JOIN + STRING_IDS AS TASKTYPE_IDS + ON TASKTYPE_IDS.id == COMPUTE_TASK_INFO.taskType +LEFT JOIN + STRING_IDS AS INPUTSHAPES_IDS + ON INPUTSHAPES_IDS.id == COMPUTE_TASK_INFO.inputShapes +{} +""" + + +class ComputeOpSumExport(BaseStatsExport): + + def __init__(self, db_path, recipe_name, step_range): + super().__init__(db_path, recipe_name, step_range) + self._query = self.get_query_statement() + + def get_query_statement(self): + if self._step_range: + filter_statement = f"WHERE TASK.startNs >= {self._step_range.get(Constant.START_NS)} " \ + f"and TASK.startNs <= {self._step_range.get(Constant.END_NS)}" + else: + filter_statement = "" + return QUERY.format(filter_statement) + + +class ComputeOpSumExportExcludeOpName(BaseStatsExport): + + def __init__(self, db_path, recipe_name, step_range): + super().__init__(db_path, recipe_name, step_range) + self._query = self.get_query_statement() + + def get_query_statement(self): + if self._step_range: + filter_statement = f"WHERE TASK.startNs >= {self._step_range.get(Constant.START_NS)} " \ + f"and TASK.startNs <= {self._step_range.get(Constant.END_NS)}" + else: + filter_statement = "" + return QUERY_EXCLUDE_OPNAME.format(filter_statement) diff --git a/profiler/cluster_analyse/analysis/analysis_facade.py b/profiler/msprof_analyze/prof_exports/hccl_sum_export.py similarity index 32% rename from profiler/cluster_analyse/analysis/analysis_facade.py rename to profiler/msprof_analyze/prof_exports/hccl_sum_export.py index 6a3fc052e8717d7a7043f589a462d5d44c12a344..c577d40c0f5ae1289d196bdd6d7cd306ebcbf01e 100644 --- a/profiler/cluster_analyse/analysis/analysis_facade.py +++ b/profiler/msprof_analyze/prof_exports/hccl_sum_export.py @@ -13,35 +13,40 @@ # See the License for the specific language governing permissions and # limitations under the License. -from multiprocessing import Process +from msprof_analyze.prof_exports.base_stats_export import BaseStatsExport +from msprof_analyze.prof_common.constant import Constant -from analysis.communication_analysis import CommunicationAnalysis -from analysis.communication_analysis import CommunicationAnalysisOptimized -from analysis.comm_matrix_analysis import CommMatrixAnalysis -from analysis.comm_matrix_analysis import CommMatrixAnalysisOptimized -from analysis.step_trace_time_analysis import StepTraceTimeAnalysis -from analysis.host_info_analysis import HostInfoAnalysis -from common_func.constant import Constant +QUERY = """ +SELECT + NAME_IDS.value AS "OpName", + TYPE_IDS.value AS "OpType", + round(endNs - startNs) AS "Duration", + GROUP_NAME_IDS.value AS "GroupName" +FROM + COMMUNICATION_OP +LEFT JOIN + STRING_IDS AS TYPE_IDS + ON TYPE_IDS.id == COMMUNICATION_OP.opType +LEFT JOIN + STRING_IDS AS NAME_IDS + ON NAME_IDS.id == COMMUNICATION_OP.opName +LEFT JOIN + STRING_IDS AS GROUP_NAME_IDS + ON GROUP_NAME_IDS.id == COMMUNICATION_OP.groupName +{} + """ -class AnalysisFacade: - default_module = {CommunicationAnalysis, StepTraceTimeAnalysis, CommMatrixAnalysis, HostInfoAnalysis} - simplified_module = {CommunicationAnalysisOptimized, StepTraceTimeAnalysis, CommMatrixAnalysisOptimized, HostInfoAnalysis} +class HcclSumExport(BaseStatsExport): - def __init__(self, params: dict): - self.params = params + def __init__(self, db_path, recipe_name, step_range): + super().__init__(db_path, recipe_name, step_range) + self._query = self.get_query_statement() - def cluster_analyze(self): - # 多个profiler用多进程处理 - process_list = [] - if self.params.get(Constant.DATA_SIMPLIFICATION) and self.params.get(Constant.DATA_TYPE) == Constant.DB: - analysis_module = self.simplified_module + def get_query_statement(self): + if self._step_range: + filter_statement = f"WHERE COMMUNICATION_OP.startNs >= {self._step_range.get(Constant.START_NS)} " \ + f"and COMMUNICATION_OP.startNs <= {self._step_range.get(Constant.END_NS)}" else: - analysis_module = self.default_module - for analysis in analysis_module: - process = Process(target=analysis(self.params).run) - process.start() - process_list.append(process) - - for process in process_list: - process.join() + filter_statement = "" + return QUERY.format(filter_statement) diff --git a/profiler/msprof_analyze/prof_exports/mstx_event_export.py b/profiler/msprof_analyze/prof_exports/mstx_event_export.py new file mode 100644 index 0000000000000000000000000000000000000000..97c3813b7eb4e9eb805527984c5b7deabbbea823 --- /dev/null +++ b/profiler/msprof_analyze/prof_exports/mstx_event_export.py @@ -0,0 +1,107 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprof_analyze.prof_exports.base_stats_export import BaseStatsExport +from msprof_analyze.prof_common.constant import Constant + +MARK_QUERY = """ +WITH + FRAMEWORK_API AS ( + SELECT + PYTORCH_API.startNs, + CONNECTION_IDS.connectionId + FROM + PYTORCH_API + LEFT JOIN + CONNECTION_IDS + ON PYTORCH_API.connectionId == CONNECTION_IDS.id + {} + ) +SELECT + MSG_IDS.value AS "msg", + MSTX_EVENTS.startNs AS "cann_ts", + TASK.startNs AS "device_ts", + FRAMEWORK_API.startNs AS "framework_ts", + MSTX_EVENTS.globalTid AS "tid" +FROM + MSTX_EVENTS +LEFT JOIN + TASK + ON MSTX_EVENTS.connectionId == TASK.connectionId +LEFT JOIN + FRAMEWORK_API + ON MSTX_EVENTS.connectionId == FRAMEWORK_API.connectionId +LEFT JOIN + STRING_IDS AS MSG_IDS + ON MSTX_EVENTS.message == MSG_IDS.id +WHERE + MSTX_EVENTS.eventType == 3 {} +ORDER BY + MSTX_EVENTS.startNs + """ + + +class MstxMarkExport(BaseStatsExport): + + def __init__(self, db_path, recipe_name, step_range): + super().__init__(db_path, recipe_name, step_range) + self._query = self.get_query_statement() + + def get_query_statement(self): + if self._step_range: + filter_statement_1 = f"WHERE PYTORCH_API.startNs >= {self._step_range.get(Constant.START_NS)} " \ + f"AND PYTORCH_API.startNs <= {self._step_range.get(Constant.END_NS)}" + filter_statement_2 = f"AND MSTX_EVENTS.startNs >= {self._step_range.get(Constant.START_NS)} " \ + f"AND MSTX_EVENTS.startNs <= {self._step_range.get(Constant.END_NS)}" + else: + filter_statement_1, filter_statement_2 = "", "" + return MARK_QUERY.format(filter_statement_1, filter_statement_2) + + +RANGE_QUERY = ''' +SELECT + MSG_IDS.value AS "msg", + MSTX_EVENTS.startNs AS "cann_start_ts", + MSTX_EVENTS.endNs AS "cann_end_ts", + TASK.startNs AS "device_start_ts", + TASK.endNs AS "device_end_ts", + MSTX_EVENTS.globalTid AS "tid" +FROM + MSTX_EVENTS +LEFT JOIN + TASK + ON MSTX_EVENTS.connectionId == TASK.connectionId +LEFT JOIN + STRING_IDS AS MSG_IDS + ON MSTX_EVENTS.message == MSG_IDS.id +WHERE + MSTX_EVENTS.eventType == 2 {} +AND + MSTX_EVENTS.connectionId != 4294967295 +ORDER BY + MSTX_EVENTS.startNs + ''' + + +class MstxRangeExport(BaseStatsExport): + + def __init__(self, db_path, recipe_name, step_range): + super().__init__(db_path, recipe_name, step_range) + self._query = self.get_query_statement() + + def get_query_statement(self): + filter_statement = f"AND MSTX_EVENTS.startNs >= {self._step_range.get(Constant.START_NS)} AND " \ + f"MSTX_EVENTS.startNs <= {self._step_range.get(Constant.END_NS)}" if self._step_range else "" + return RANGE_QUERY.format(filter_statement) diff --git a/profiler/msprof_analyze/prof_exports/mstx_step_export.py b/profiler/msprof_analyze/prof_exports/mstx_step_export.py new file mode 100644 index 0000000000000000000000000000000000000000..c8aec91b7e5ce5fb29fffebeb8668fec723e3fa8 --- /dev/null +++ b/profiler/msprof_analyze/prof_exports/mstx_step_export.py @@ -0,0 +1,34 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from msprof_analyze.prof_exports.base_stats_export import BaseStatsExport + +QUERY = """ +SELECT + id AS "step_id", + startNs AS "start_ns", + endNs AS "end_ns" +FROM + STEP_TIME +ORDER BY + startNs + """ + + +class MstxStepExport(BaseStatsExport): + + def __init__(self, db_path, recipe_name, step_range): + super().__init__(db_path, recipe_name, step_range) + self._query = QUERY diff --git a/profiler/requirements.txt b/profiler/msprof_analyze/requirements.txt similarity index 100% rename from profiler/requirements.txt rename to profiler/msprof_analyze/requirements.txt diff --git a/profiler/requirements/build.txt b/profiler/msprof_analyze/requirements/build.txt similarity index 80% rename from profiler/requirements/build.txt rename to profiler/msprof_analyze/requirements/build.txt index 0732de1f7c7b1839f399530f17a3eb4237d34391..3ef20e787be3bad76de0ccde4dc3e3a1dbe63efb 100644 --- a/profiler/requirements/build.txt +++ b/profiler/msprof_analyze/requirements/build.txt @@ -7,10 +7,9 @@ tqdm prettytable ijson requests -xlsxwriter +xlsxwriter>=3.0.6 sqlalchemy urllib3<2.0 -bottleneck>=1.3.6 numpy<=1.26.4 pandas psutil \ No newline at end of file diff --git a/profiler/requirements/tests.txt b/profiler/msprof_analyze/requirements/tests.txt similarity index 83% rename from profiler/requirements/tests.txt rename to profiler/msprof_analyze/requirements/tests.txt index 8313304e687428a406a5962ff5aef4d16620c167..6ef8754a26b463cde07c99cff8679ae4494b0ff5 100644 --- a/profiler/requirements/tests.txt +++ b/profiler/msprof_analyze/requirements/tests.txt @@ -14,4 +14,6 @@ ijson requests xlsxwriter sqlalchemy -urllib3<2.0 \ No newline at end of file +urllib3<2.0 +beautifulsoup4 +openpyxl \ No newline at end of file diff --git a/profiler/setup.cfg b/profiler/msprof_analyze/setup.cfg similarity index 100% rename from profiler/setup.cfg rename to profiler/msprof_analyze/setup.cfg diff --git a/profiler/setup.py b/profiler/msprof_analyze/setup.py similarity index 47% rename from profiler/setup.py rename to profiler/msprof_analyze/setup.py index 8ca7240ef2e03e13568d0f2d2743d4132e957c13..72a13d66b4c2ebbda289caad62332fd4351d029e 100644 --- a/profiler/setup.py +++ b/profiler/msprof_analyze/setup.py @@ -1,11 +1,31 @@ #!/usr/bin/python # -*- coding: utf-8 -*- + +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os +import sys + +sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) from setuptools import find_packages, setup # type: ignore -from prof_common.path_manager import PathManager -from prof_common.utils import SafeConfigReader +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.prof_common.utils import SafeConfigReader extras = { "test": [ @@ -21,20 +41,17 @@ sections = { 'EMAIL': ['ms_email'] } -with open('requirements/build.txt', 'r') as f: - requires = f.read().splitlines() +requires = FileManager.read_common_file('requirements/build.txt').splitlines() -with open('requirements/tests.txt', 'r') as f: - tests_requires = f.read().splitlines() +tests_requires = FileManager.read_common_file('requirements/tests.txt').splitlines() tests_requires.extend(set(requires)) -with open('version.txt', 'r') as f: - version = f.read().strip() +version = FileManager.read_common_file('version.txt').strip() -config_file_path = "config/config.ini" -PathManager.check_input_file_path(config_file_path) -PathManager.check_file_size(config_file_path) -reader = SafeConfigReader(config_file_path) +CONFIG_FILE_PATH = "config/config.ini" +PathManager.check_input_file_path(CONFIG_FILE_PATH) +PathManager.check_file_size(CONFIG_FILE_PATH) +reader = SafeConfigReader(CONFIG_FILE_PATH) reader.validate(sections) config = reader.get_config() try: @@ -46,7 +63,10 @@ try: except Exception as e: raise RuntimeError("The configuration file is incomplete and not configured ms_email information.") from e -root_path = os.path.dirname(os.path.abspath(os.path.dirname(__file__))) +root_path = os.path.dirname(os.path.dirname(os.path.abspath(os.path.dirname(__file__)))) +msprof_analyze_path = os.path.abspath(os.path.dirname(__file__)) +child_packages = find_packages(msprof_analyze_path, exclude=["example"]) +msprof_analyze_packages = [f"msprof_analyze.{package}" for package in child_packages] setup( name="msprof-analyze", version=version, @@ -57,8 +77,9 @@ setup( url=url, author="MindStudio", author_email=author_email, - package_dir={"": root_path}, - packages=find_packages(root_path), + package_dir={"": root_path, + "msprof_analyze": msprof_analyze_path}, + packages=find_packages(root_path, exclude=["example"]) + msprof_analyze_packages, include_package_data=False, python_requires='>=3.7', install_requires=requires, @@ -67,7 +88,7 @@ setup( license='Apache License 2.0', entry_points=""" [console_scripts] - msprof-analyze=profiler.cli.entrance:msprof_analyze_cli + msprof-analyze=msprof_analyze.cli.entrance:msprof_analyze_cli """ ) diff --git a/profiler/msprof_analyze/test/__init__.py b/profiler/msprof_analyze/test/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/test/resource/advisor/cluster_analysis_output/cluster_communication.json b/profiler/msprof_analyze/test/resource/advisor/cluster_analysis_output/cluster_communication.json similarity index 100% rename from profiler/test/resource/advisor/cluster_analysis_output/cluster_communication.json rename to profiler/msprof_analyze/test/resource/advisor/cluster_analysis_output/cluster_communication.json diff --git a/profiler/test/resource/advisor/cluster_analysis_output/cluster_step_trace_time.csv b/profiler/msprof_analyze/test/resource/advisor/cluster_analysis_output/cluster_step_trace_time.csv similarity index 100% rename from profiler/test/resource/advisor/cluster_analysis_output/cluster_step_trace_time.csv rename to profiler/msprof_analyze/test/resource/advisor/cluster_analysis_output/cluster_step_trace_time.csv diff --git a/profiler/test/resource/event_list.json b/profiler/msprof_analyze/test/resource/event_list.json similarity index 100% rename from profiler/test/resource/event_list.json rename to profiler/msprof_analyze/test/resource/event_list.json diff --git a/profiler/test/resource/pipeline_view.png b/profiler/msprof_analyze/test/resource/pipeline_view.png similarity index 100% rename from profiler/test/resource/pipeline_view.png rename to profiler/msprof_analyze/test/resource/pipeline_view.png diff --git a/profiler/test/resource/test.csv b/profiler/msprof_analyze/test/resource/test.csv similarity index 100% rename from profiler/test/resource/test.csv rename to profiler/msprof_analyze/test/resource/test.csv diff --git a/profiler/msprof_analyze/test/run_st.py b/profiler/msprof_analyze/test/run_st.py new file mode 100644 index 0000000000000000000000000000000000000000..e15bf17a2f4bdad495d2b48be7304b98b416241f --- /dev/null +++ b/profiler/msprof_analyze/test/run_st.py @@ -0,0 +1,100 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import datetime +import logging +import os +import subprocess +import sys +import threading + +stop_print_thread = False + + +def print_stout(output): + while True: + line = output.readline().strip() + if line: + logging.info(line) + global stop_print_thread + if stop_print_thread: + break + + +def stop_stout_threads(thread_list): + global stop_print_thread + stop_print_thread = True + for stout_thread in thread_list: + if stout_thread.is_alive(): + stout_thread.join() + + +def start_st_process(module_name): + st_path = os.path.join(os.path.abspath(os.path.dirname(__file__)), "st", module_name) + cmd = ["python3", "-m", "pytest", "-s", st_path] + process = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) + stout_thread = threading.Thread(target=print_stout, args=(process.stdout,)) + stout_thread.start() + return process, stout_thread + + +def stop_st_process(process_list): + for process in process_list: + if process.poll() is None: + process.terminate() + process.wait() + + +def run_st(): + timeout = 3600 + + modules = ["advisor", "cluster_analyse", "compare_tools"] + process_list = [] + thread_list = [] + for module in modules: + process, stout_thread = start_st_process(module) + process_list.append(process) + thread_list.append(stout_thread) + + success, failed = True, False + start_time = datetime.datetime.utcnow() + while process_list: + duration = datetime.datetime.utcnow() - start_time + if duration.total_seconds() >= timeout: + logging.error("run st use case timeout.") + stop_stout_threads(thread_list) + stop_st_process(process_list) + return failed + for process in process_list: + if process.poll() is None: + continue + if process.returncode == 0: + process_list.remove(process) + continue + stop_stout_threads(thread_list) + stop_st_process(process_list) + return failed + stop_stout_threads(thread_list) + return success + + +if __name__ == "__main__": + logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s') + st_success = run_st() + if st_success: + logging.info("run st successfully.") + sys.exit(0) + else: + logging.error("run st failed.") + sys.exit(1) diff --git a/profiler/test/run_ut.py b/profiler/msprof_analyze/test/run_ut.py similarity index 78% rename from profiler/test/run_ut.py rename to profiler/msprof_analyze/test/run_ut.py index 6ab208dc29e9d5feb2418f9243d395a1aabfa23b..891bf82de35c984d1bc4af07498c3ad2d9b283f4 100644 --- a/profiler/test/run_ut.py +++ b/profiler/msprof_analyze/test/run_ut.py @@ -1,9 +1,22 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import shutil import subprocess import sys - def set_python_path(): cluster_analyse_root = os.path.join( os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "cluster_analyse") diff --git a/profiler/msprof_analyze/test/st/__init__.py b/profiler/msprof_analyze/test/st/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/msprof_analyze/test/st/advisor/__init__.py b/profiler/msprof_analyze/test/st/advisor/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/msprof_analyze/test/st/advisor/test_advisor_cmd_single_ascend_pt_compare.py b/profiler/msprof_analyze/test/st/advisor/test_advisor_cmd_single_ascend_pt_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..a485d62188deaa661499b79b94c94ce990398fcd --- /dev/null +++ b/profiler/msprof_analyze/test/st/advisor/test_advisor_cmd_single_ascend_pt_compare.py @@ -0,0 +1,344 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import subprocess +import logging +from unittest import TestCase + +import math +import pandas as pd +from bs4 import BeautifulSoup + +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.advisor.utils import get_files, execute_cmd + + +class TestAdvisorCmdSingleAscendPtNoCompare(TestCase): + ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", + "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data") + BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", "n122-122-067_12380_20240912033946038_ascend_pt") + COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_2", + "n122-120-121_12321_20240911113658382_ascend_pt") + OUTPUT_PATH = os.path.join(os.path.abspath(os.path.dirname(__file__)), "TestAdvisorCmdSingleAscendPtCompare") + ALL_OUTPUT_PATH = os.path.join(OUTPUT_PATH, "all") + RESULT_EXCEL = {} + RESULT_HTML = {} + COMMAND_SUCCESS = 0 + + def setup_class(self): + PathManager.make_dir_safety(self.ALL_OUTPUT_PATH) + cmd_all = ["msprof-analyze", "advisor", "all", "-d", self.BASE_PROFILING_PATH, "-bp", + self.COMPARISON_PROFILING_PATH, "-o", self.ALL_OUTPUT_PATH, "-l", "en", "--force"] + if execute_cmd(cmd_all) != self.COMMAND_SUCCESS or not os.path.exists(self.ALL_OUTPUT_PATH): + self.assertTrue(False, msg="advisor [all] [bp] task failed.") + self.RESULT_HTML, self.RESULT_EXCEL = get_files(self.OUTPUT_PATH) + + def teardown_class(self): + PathManager.remove_path_safety(self.OUTPUT_PATH) + + def test_all_problems(self): + category = [ + "Kernel compare of Target and Benchmark", + "Byte Alignment Analysis", + "Bandwidth Contention Analysis", + "AICPU operator", + "Dynamic Shape Operator", + "FUSIBLE OPERATOR ANALYSIS", + "Affinity Apis", + "Operator Dispatch" + ] + + # True presents the attr is nan + description_len = [1, 1, 3, 2, 1, 1, 1, 1] + suggestion_len = [True, 1, 1, 2, 4, 2, 1, 1] + problem_count = [True, True, True, 2.0, 1.0, True, True, True] + total_time = [True, True, True, 57674709.54, True, True, True, True] + time_ratio = [True, True, True, 0.0, True, True, True, True] + income = [True, True, True, True, True, True, True, True] + income_ratio = [True, True, True, True, True, True, True, True] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all", None), sheet_name='problems', header=0) + except FileNotFoundError: + logging.error("File %s not found.", str(self.RESULT_EXCEL.get("all", None))) + return + + for index, row in df.iterrows(): + self.assertEqual(category[index], row["category"]) + self.assertEqual(description_len[index], len(row["description"].split("\n"))) + self.assertEqual(suggestion_len[index], isinstance(row["suggestion"], float) or + len(row["suggestion"].split("\n"))) + self.assertEqual(problem_count[index], (math.isnan(row["problem count"]) or row["problem count"])) + self.assertEqual(total_time[index], (math.isnan(row["total_time(us)"]) or + round(row["total_time(us)"], 2))) + self.assertEqual(time_ratio[index], (math.isnan(row["time ratio"]) or round(row["time ratio"], 2))) + self.assertEqual(income[index], (math.isnan(row["income(us)"]) or round(row["income(us)"], 2))) + self.assertEqual(income_ratio[index], (math.isnan(row["income ratio"]) or + round(row["income ratio"], 2))) + + def test_Byte_Alignment_Analysis(self): + op_name = [ + "hcom_broadcast__868_2_1", + "hcom_reduceScatter__511_1_1", + "hcom_allGather__511_2_1" + ] + + total_size = [ + 24274052, + 670986240, + 335493120 + ] + + duration = [ + 995.36, + 35724.8, + 17275.4 + ] + + abnormal_duration = [ + 995.36, + 35724.8, + 17275.4 + ] + + bandwidth = [ + 24.3872, + 18.7821, + 19.4203 + ] + + test_pattern = ["all"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='Byte Alignment Analysis', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + + for index, row in df.iterrows(): + self.assertEqual(op_name[index], row["op name"]) + self.assertEqual(total_size[index], row["total size(Byte)"]) + self.assertEqual(duration[index], row["duration(us)"]) + self.assertEqual(abnormal_duration[index], row["abnormal duration(us)"]) + self.assertEqual(bandwidth[index], row["bandwidth(GB/s)"]) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Byte Alignment Analysis": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(str(op_name[row_index - 1]), row.find_all('td')[0].text) + self.assertEqual(str(total_size[row_index - 1]), row.find_all('td')[1].text) + self.assertEqual(str(round(duration[row_index - 1], 2)), row.find_all('td')[2].text) + self.assertEqual(str(round(abnormal_duration[row_index - 1], 2)), row.find_all('td')[3].text) + self.assertEqual(str(round(bandwidth[row_index - 1], 4)), row.find_all('td')[4].text) + + def test_all_bandwidth_contention_analysis(self): + bandwidth_contention_analysis = [ + "hcom_allGather__508_1_1", "hcom_allGather__508_4_1", "hcom_allGather__508_8_1", + "hcom_allGather__508_108_1", "hcom_allGather__508_112_1", "hcom_allGather__508_113_1", + "hcom_allGather__508_137_1", "hcom_allGather__508_141_1", "hcom_allGather__508_145_1", + "hcom_allGather__508_153_1", "hcom_allGather__508_157_1", "hcom_allGather__508_173_1", + "hcom_allGather__508_177_1", "hcom_allGather__508_181_1", "hcom_allGather__508_209_1", + "hcom_reduceScatter__868_261_1", "hcom_reduceScatter__868_266_1", "hcom_allGather__508_276_1", + "hcom_reduceScatter__508_283_1", "hcom_reduceScatter__508_291_1", "hcom_reduceScatter__508_299_1", + "hcom_reduceScatter__508_307_1", "hcom_allGather__508_308_1", "hcom_reduceScatter__508_315_1", + "hcom_reduceScatter__508_323_1", "hcom_reduceScatter__508_331_1", "hcom_reduceScatter__508_339_1", + "hcom_reduceScatter__508_347_1", "hcom_reduceScatter__508_355_1", "hcom_allGather__508_356_1", + "hcom_reduceScatter__508_363_1", "hcom_reduceScatter__508_371_1", "hcom_allGather__508_372_1", + "hcom_reduceScatter__508_379_1", "hcom_reduceScatter__508_387_1", "hcom_allGather__508_388_1", + "hcom_reduceScatter__508_395_1", "hcom_reduceScatter__508_403_1", "hcom_allGather__508_404_1", + "hcom_reduceScatter__508_411_1", "hcom_reduceScatter__508_419_1", "hcom_reduceScatter__508_427_1", + "hcom_reduceScatter__508_435_1", "hcom_reduceScatter__508_443_1", "hcom_reduceScatter__508_451_1", + "hcom_reduceScatter__508_459_1", "hcom_reduceScatter__508_467_1", "hcom_allGather__508_468_1", + "hcom_reduceScatter__508_475_1", "hcom_reduceScatter__508_483_1", "hcom_reduceScatter__508_491_1", + "hcom_reduceScatter__508_499_1", "hcom_reduceScatter__508_507_1", "hcom_reduceScatter__508_515_1", + "hcom_allGather__508_516_1", "hcom_reduceScatter__508_523_1", "hcom_reduceScatter__508_531_1", + "hcom_reduceScatter__508_539_1", "hcom_reduceScatter__508_547_1", "hcom_reduceScatter__508_555_1", + "hcom_reduceScatter__508_563_1", "hcom_reduceScatter__508_571_1", "hcom_reduceScatter__508_579_1", + "hcom_reduceScatter__508_587_1", "hcom_allGather__508_588_1", "hcom_reduceScatter__508_595_1", + "hcom_reduceScatter__508_603_1", "hcom_reduceScatter__508_611_1", "hcom_reduceScatter__508_619_1", + "hcom_reduceScatter__508_627_1", "hcom_reduceScatter__508_635_1", "hcom_reduceScatter__508_643_1", + "hcom_allGather__508_644_1", "hcom_reduceScatter__508_651_1", "hcom_reduceScatter__508_659_1", + "hcom_reduceScatter__508_667_1", "hcom_reduceScatter__508_675_1", "hcom_reduceScatter__508_683_1" + ] + duration = [ + 8.3454, 13.8113, 39.8263, 21.6036, 38.2598, 5.3913, 13.4007, 9.6871, 8.8002, 10.0535, 8.3423, 9.3205, + 11.3891, 9.473, 12.7247, 19.4176, 13.2621, 16.3541, 127.5414, 127.288, 126.6839, 129.0707, 11.8205, + 128.8378, + 130.0548, 128.3927, 124.9711, 128.0221, 122.8157, 11.7839, 127.0278, 123.3328, 11.9078, 122.3141, 123.1837, + 11.2561, 123.8337, 127.5955, 11.5881, 123.0412, 128.4852, 122.3674, 127.1958, 127.5779, 129.6155, 127.2981, + 125.5495, 11.0916, 127.4827, 126.4632, 125.0414, 123.9187, 125.168, 127.1, 12.6763, 126.3728, 126.9693, + 127.677, 127.1439, 127.2013, 127.9102, 125.7989, 126.4961, 127.6573, 12.2088, 127.6283, 126.3803, 129.8238, + 126.2997, 127.4806, 129.2007, 127.2733, 12.0963, 126.8322, 127.5317, 126.482, 127.8283, 129.2951 + ] + bandwidth = [ + 5.49, 4.8, 5.99, 14.13, 3.24, 6.25, 8.52, 5.17, 5.34, 8.24, 5.43, 6.15, 9.79, 5.55, 4.39, 13.35, 13.14, + 3.61, 2.51, 2.88, 2.83, 3.07, 4.81, 2.55, 2.57, 2.73, 2.84, 2.44, 3.01, 4.95, 2.63, 3.06, 3.77, 2.88, 3.44, + 4.72, 2.91, 3.21, 4.47, 2.38, 2.31, 2.9, 4.26, 3.57, 2.31, 2.24, 2.81, 4.37, 2.67, 2.8, 2.74, 2.16, 2.79, + 2.88, 5.79, 2.75, 2.93, 2.88, 2.31, 2.72, 2.39, 2.6, 2.55, 2.58, 4.29, 2.69, 2.86, 2.09, 3.12, 2.31, 2.28, + 2.87, 6.97, 3.1, 2.35, 3.4, 2.61, 2.62 + ] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all", None), sheet_name='Bandwidth Contention Analysis', header=0) + except FileNotFoundError: + logging.error("File %s not found.", str(self.RESULT_EXCEL.get("all", None))) + return + + for index, row in df.iterrows(): + self.assertEqual(bandwidth_contention_analysis[index], row["op name"]) + self.assertEqual(duration[index], round(row["duration(ms)"], 4)) + self.assertEqual(bandwidth[index], round(row["bandwidth(GB/s)"], 2)) + + # wait repair bugs to check html + + def test_AICPU_operator(self): + op_name = ["aclnnPowTensorScalar_SquareAiCpu_Square", "aclnnEqScalar_EqualAiCpu_Equal"] + op_type = ["Square", "Equal"] + task_duration = [92.06, 90.72] + input_shapes = ["\"41\"", "\"41;\""] + input_data_types = ["INT64", "DOUBLE;DOUBLE"] + input_formats = ["FORMAT_ND", "FORMAT_ND;FORMAT_ND"] + output_shapes = ["\"41\"", "\"41\""] + output_data_types = ["INT64", "BOOL"] + output_formats = ["FORMAT_ND", "FORMAT_ND"] + stack_info = [True, True] + + t0_description = ["Square, Equal"] + t0_suggestion = ["aclnnEqScalar_EqualAiCpu_Equal"] + t0_elapsed_time = ["182.78"] + t0_time_ratio = ["0.0"] + t1_operator_type = ["Square"] + t1_counts = ["1"] + t1_elapsed_time = ["92.06"] + t2_operator_type = ["Equal"] + t2_counts = ["1"] + t2_elapsed_time = ["90.72"] + b_names = ["Square", "Suggestion 1:", "Equal", "Suggestion 1:"] + + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all", None), sheet_name='AICPU operator', header=0) + except FileNotFoundError: + logging.error("File %s not found.", str(self.RESULT_EXCEL.get("all", None))) + return + + for index, row in df.iterrows(): + self.assertEqual(op_name[index], row["op_name"]) + self.assertEqual(op_type[index], row["op_type"]) + self.assertEqual(task_duration[index], round(row["task_duration"], 2)) + self.assertEqual(input_shapes[index], row["input_shapes"]) + self.assertEqual(input_data_types[index], row["input_data_types"]) + self.assertEqual(input_formats[index], row["input_formats"]) + self.assertEqual(output_shapes[index], row["output_shapes"]) + self.assertEqual(output_data_types[index], row["output_data_types"]) + self.assertEqual(output_formats[index], row["output_formats"]) + self.assertEqual(stack_info[index], math.isnan(row["stack_info"])) + + soup = BeautifulSoup(open(self.RESULT_HTML.get("all", None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "AICPU Issues": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t0_description[row_index - 1], + row.find_all('td')[0].text.split(":")[1].replace("\n", "")) + self.assertEqual(t0_suggestion[row_index - 1], row.find_all('td')[1].text.split(" ")[-1]) + self.assertEqual(t0_elapsed_time[row_index - 1], row.find_all('td')[2].text) + self.assertEqual(t0_time_ratio[row_index - 1], row.find_all('td')[3].text) + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t1_operator_type[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t1_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t1_elapsed_time[row_index - 1], row.find_all('td')[2].text) + for row_index, row in enumerate(table[2].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t2_operator_type[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t2_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t2_elapsed_time[row_index - 1], row.find_all('td')[2].text) + + b_contents = div_content.find_all('b') + for b_index, b_content in enumerate(b_contents): + self.assertEqual(b_names[b_index], b_content.text) + + def test_Affinity_API(self): + affinity_api = ["torch_npu.npu_confusion_transpose", "torch_npu.optim.NpuFusedAdamW"] + code_stacks = [True, True] + stack_called_counts = [True, True] + ignore_api = ["torch_npu.optim.NpuFusedAdamW", "torch_npu.npu_confusion_transpose"] + + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all", None), sheet_name='Affinity Apis', header=0) + except FileNotFoundError: + logging.error("File %s not found.", str(self.RESULT_EXCEL.get("all", None))) + return + + for index, row in df.iterrows(): + self.assertEqual(affinity_api[index], row["Affinity API"]) + self.assertEqual(code_stacks[index], math.isnan(row["Code stacks"])) + self.assertEqual(stack_called_counts[index], math.isnan(row["Stack called counts"])) + + soup = BeautifulSoup(open(self.RESULT_HTML.get("all", None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Affinity API Issues": + div_content = h2.next.next.next + self.assertEqual(ignore_api[0], div_content.contents[-2].contents[-2].text) + self.assertEqual(ignore_api[1], div_content.contents[-2].contents[-4].text) + + def test_operator_dispatch(self): + issues = ["operator dispatch"] + op_name = ["aclopCompileAndExecute"] + counts = [381] + total_time = [58486.7048] + + t0_description = ["381"] + t0_suggestion = ["torch_npu.npu.set_compile_mode(jit_compile=False)"] + t1_issue = ["aclopCompileAndExecute"] + t1_counts = ['381'] + t1_elapsed_time = ['58486.704798215804'] + + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all", None), sheet_name='Operator Dispatch', header=0) + except FileNotFoundError: + logging.error("File %s not found.", str(self.RESULT_EXCEL.get("all", None))) + return + for index, row in df.iterrows(): + self.assertEqual(issues[index], row["Issues"]) + self.assertEqual(op_name[index], row["op name"]) + self.assertEqual(counts[index], row["counts"]) + self.assertEqual(total_time[index], round(row["total time"], 4)) + + soup = BeautifulSoup(open(self.RESULT_HTML.get("all", None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Operator Dispatch Issues": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t0_description[row_index - 1], row.find_all('td')[0].text.split(' ')[1]) + self.assertEqual(t0_suggestion[row_index - 1], + row.find_all('td')[1].text.split('`')[1].split(';')[0]) + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t1_issue[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t1_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t1_elapsed_time[row_index - 1], row.find_all('td')[2].text) diff --git a/profiler/msprof_analyze/test/st/advisor/utils.py b/profiler/msprof_analyze/test/st/advisor/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..807695bedf64eddad5f960a795e513a1688f29a9 --- /dev/null +++ b/profiler/msprof_analyze/test/st/advisor/utils.py @@ -0,0 +1,65 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import re +import logging +import subprocess + +RE_EXCEL_MATCH_EXP = r"^mstt_advisor_\d{1,20}\.xlsx" +RE_HTML_MATCH_EXP = r"^mstt_advisor_\d{1,20}\.html" +COMMAND_SUCCESS = 0 + + +def execute_cmd(cmd): + logging.info('Execute command:%s', " ".join(cmd)) + completed_process = subprocess.run(cmd, shell=False, stderr=subprocess.PIPE) + if completed_process.returncode != COMMAND_SUCCESS: + logging.error(completed_process.stderr.decode()) + return completed_process.returncode + + +def get_files(out_path): + dirs = os.listdir(out_path) + result_html = {} + result_excel = {} + for pattern in dirs: + files_out_path = os.path.join(out_path, pattern) + files = os.listdir(files_out_path) + newest_html_file = None + newest_excel_file = None + for file_name in files: + if re.match(RE_HTML_MATCH_EXP, file_name): + file_time = file_name.split(".")[0].split("_")[-1] + if not newest_html_file or file_time > newest_html_file.split(".")[0].split("_")[-1]: + newest_html_file = file_name + if not newest_html_file: + logging.error("advisor [%s] result html is not find.", str(pattern)) + log_dir = os.path.join(files_out_path, "log") + log_files = os.listdir(log_dir) + for file_name in log_files: + if re.match(RE_EXCEL_MATCH_EXP, file_name): + file_time = file_name.split(".")[0].split("_")[-1] + if not newest_excel_file or file_time > newest_excel_file.split(".")[0].split("_")[-1]: + newest_excel_file = file_name + if not newest_excel_file: + logging.error("advisor [%s] result excel is not find.", str(pattern)) + + # html time same with excel time + if newest_html_file.split(".")[0].split("_")[-1] != newest_excel_file.split(".")[0].split("_")[-1]: + logging.error("advisor [%s] html file and excel file dose not match.", str(pattern)) + + result_html[pattern] = os.path.join(files_out_path, newest_html_file) + result_excel[pattern] = os.path.join(log_dir, newest_excel_file) + return result_html, result_excel diff --git a/profiler/msprof_analyze/test/st/cluster_analyse/__init__.py b/profiler/msprof_analyze/test/st/cluster_analyse/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_bandwidth_db.py b/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_bandwidth_db.py new file mode 100644 index 0000000000000000000000000000000000000000..00097bd9d3667df508e84fd4f54dbf68e72e41cb --- /dev/null +++ b/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_bandwidth_db.py @@ -0,0 +1,135 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +class ClusterCommunicationAnalyzerBandwidthDb: + def __init__(self, rank_set=None, step=None, rank_id=None, hccl_op_name=None, group_name=None, + band_type=None, transit_size=None, transit_time=None, bandwidth=None, large_packet_ratio=None, + package_size=None, count=None, total_duration=None): + self._rank_set = rank_set + self._step = step + self._rank_id = rank_id + self._hccl_op_name = hccl_op_name + self._group_name = group_name + self._band_type = band_type + self._transit_size = transit_size + self._transit_time = transit_time + self._bandwidth = bandwidth + self._large_packet_ratio = large_packet_ratio + self._package_size = package_size + self._count = count + self._total_duration = total_duration + + @property + def rank_set(self): + return self._rank_set + + @rank_set.setter + def rank_set(self, value): + self._rank_set = value + + @property + def step(self): + return self._step + + @step.setter + def step(self, value): + self._step = value + + @property + def rank_id(self): + return self._rank_id + + @rank_id.setter + def rank_id(self, value): + self._rank_id = value + + @property + def hccl_op_name(self): + return self._hccl_op_name + + @hccl_op_name.setter + def hccl_op_name(self, value): + self._hccl_op_name = value + + @property + def group_name(self): + return self._group_name + + @group_name.setter + def group_name(self, value): + self._group_name = value + + @property + def band_type(self): + return self._band_type + + @band_type.setter + def band_type(self, value): + self._band_type = value + + @property + def transit_size(self): + return self._transit_size + + @transit_size.setter + def transit_size(self, value): + self._transit_size = value + + @property + def transit_time(self): + return self._transit_time + + @transit_time.setter + def transit_time(self, value): + self._transit_time = value + + @property + def bandwidth(self): + return self._bandwidth + + @bandwidth.setter + def bandwidth(self, value): + self._bandwidth = value + + @property + def large_packet_ratio(self): + return self._large_packet_ratio + + @large_packet_ratio.setter + def large_packet_ratio(self, value): + self._large_packet_ratio = value + + @property + def package_size(self): + return self._package_size + + @package_size.setter + def package_size(self, value): + self._package_size = value + + @property + def count(self): + return self._count + + @count.setter + def count(self, value): + self._count = value + + @property + def total_duration(self): + return self._total_duration + + @total_duration.setter + def total_duration(self, value): + self._total_duration = value diff --git a/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_matrix_db.py b/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_matrix_db.py new file mode 100644 index 0000000000000000000000000000000000000000..9665a6c8b40ef6b20e169209ac314eeba3246684 --- /dev/null +++ b/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_matrix_db.py @@ -0,0 +1,131 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Cluster communication matrix db class """ + + +class ClusterCommunicationAnalyzerMatrixDb: + def __init__(self, + rank_set=None, + step=None, + hccl_op_name=None, + group_name=None, + src_rank=None, + dst_rank=None, + transit_size=None, + transit_time=None, + bandwidth=None, + transport_type=None, + op_name=None): + self._rank_set = rank_set + self._step = step + self._hccl_op_name = hccl_op_name + self._group_name = group_name + self._src_rank = src_rank + self._dst_rank = dst_rank + self._transit_size = transit_size + self._transit_time = transit_time + self._bandwidth = bandwidth + self._transport_type = transport_type + self._op_name = op_name + + @property + def rank_set(self): + return self._rank_set + + @rank_set.setter + def rank_set(self, value): + self._rank_set = value + + @property + def step(self): + return self._step + + @step.setter + def step(self, value): + self._step = value + + @property + def hccl_op_name(self): + return self._hccl_op_name + + @hccl_op_name.setter + def hccl_op_name(self, value): + self._hccl_op_name = value + + # group_name property + @property + def group_name(self): + return self._group_name + + @group_name.setter + def group_name(self, value): + self._group_name = value + + @property + def src_rank(self): + return self._src_rank + + @src_rank.setter + def src_rank(self, value): + self._src_rank = value + + @property + def dst_rank(self): + return self._dst_rank + + @dst_rank.setter + def dst_rank(self, value): + self._dst_rank = value + + @property + def transit_size(self): + return self._transit_size + + @transit_size.setter + def transit_size(self, value): + self._transit_size = value + + @property + def transit_time(self): + return self._transit_time + + @transit_time.setter + def transit_time(self, value): + self._transit_time = value + + @property + def bandwidth(self): + return self._bandwidth + + @bandwidth.setter + def bandwidth(self, value): + self._bandwidth = value + + @property + def transport_type(self): + return self._transport_type + + @transport_type.setter + def transport_type(self, value): + self._transport_type = value + + # op_name property + @property + def op_name(self): + return self._op_name + + @op_name.setter + def op_name(self, value): + self._op_name = value diff --git a/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_time_db.py b/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_time_db.py new file mode 100644 index 0000000000000000000000000000000000000000..36aae143f4d09f7af70c46e8b813afe9871833d1 --- /dev/null +++ b/profiler/msprof_analyze/test/st/cluster_analyse/cluster_communication_analyzer_time_db.py @@ -0,0 +1,135 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +class ClusterCommunicationAnalyzerTime: + def __init__(self, rank_set=None, step=None, rank_id=None, hccl_op_name=None, group_name=None, + start_timestamp=None, elapsed_time=None, transit_time=None, wait_time=None, + synchronization_time=None, idle_time=None, synchronization_time_ratio=None, wait_time_ratio=None): + self._rank_set = rank_set + self._step = step + self._rank_id = rank_id + self._hccl_op_name = hccl_op_name + self._group_name = group_name + self._start_timestamp = start_timestamp + self._elapsed_time = elapsed_time + self._transit_time = transit_time + self._wait_time = wait_time + self._synchronization_time = synchronization_time + self._idle_time = idle_time + self._synchronization_time_ratio = synchronization_time_ratio + self._wait_time_ratio = wait_time_ratio + + @property + def rank_set(self): + return self._rank_set + + @rank_set.setter + def rank_set(self, value): + self._rank_set = value + + @property + def step(self): + return self._step + + @step.setter + def step(self, value): + self._step = value + + @property + def rank_id(self): + return self._rank_id + + @rank_id.setter + def rank_id(self, value): + self._rank_id = value + + @property + def hccl_op_name(self): + return self._hccl_op_name + + @hccl_op_name.setter + def hccl_op_name(self, value): + self._hccl_op_name = value + + @property + def group_name(self): + return self._group_name + + @group_name.setter + def group_name(self, value): + self._group_name = value + + @property + def start_timestamp(self): + return self._start_timestamp + + @start_timestamp.setter + def start_timestamp(self, value): + self._start_timestamp = value + + @property + def elapsed_time(self): + return self._elapsed_time + + @elapsed_time.setter + def elapsed_time(self, value): + self._elapsed_time = value + + @property + def transit_time(self): + return self._transit_time + + @transit_time.setter + def transit_time(self, value): + self._transit_time = value + + @property + def wait_time(self): + return self._wait_time + + @wait_time.setter + def wait_time(self, value): + self._wait_time = value + + @property + def synchronization_time(self): + return self._synchronization_time + + @synchronization_time.setter + def synchronization_time(self, value): + self._synchronization_time = value + + @property + def idle_time(self): + return self._idle_time + + @idle_time.setter + def idle_time(self, value): + self._idle_time = value + + @property + def synchronization_time_ratio(self): + return self._synchronization_time_ratio + + @synchronization_time_ratio.setter + def synchronization_time_ratio(self, value): + self._synchronization_time_ratio = value + + @property + def wait_time_ratio(self): + return self._wait_time_ratio + + @wait_time_ratio.setter + def wait_time_ratio(self, value): + self._wait_time_ratio = value diff --git a/profiler/msprof_analyze/test/st/cluster_analyse/cluster_step_trace_time_db.py b/profiler/msprof_analyze/test/st/cluster_analyse/cluster_step_trace_time_db.py new file mode 100644 index 0000000000000000000000000000000000000000..eb6d562896a043ffa8e5b4838e5f7dce546f679f --- /dev/null +++ b/profiler/msprof_analyze/test/st/cluster_analyse/cluster_step_trace_time_db.py @@ -0,0 +1,169 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Cluster step trace time class """ + + +class ClusterStepTraceTimeDb: + def __init__(self, + step=None, + type=None, + index=None, + computing=None, + communication_not_overlapped=None, + overlapped=None, + communication=None, + free=None, + stage=None, + bubble=None, + communication_not_overlapped_and_exclude_receive=None, + preparing=None, + dp_index=None, + pp_index=None, + tp_index=None): + self._step = step + self._type = type + self._index = index + self._computing = computing + self._communication_not_overlapped = communication_not_overlapped + self._overlapped = overlapped + self._communication = communication + self._free = free + self._stage = stage + self._bubble = bubble + self._communication_not_overlapped_and_exclude_receive = communication_not_overlapped_and_exclude_receive + self._preparing = preparing + self._dp_index = dp_index + self._pp_index = pp_index + self._tp_index = tp_index + + @property + def step(self): + return self._step + + @step.setter + def step(self, value): + self._step = value + + @property + def type(self): + return self._type + + @type.setter + def type(self, value): + self._type = value + + @property + def index(self): + return self._index + + @index.setter + def index(self, value): + self._index = value + + @property + def computing(self): + return self._computing + + @computing.setter + def computing(self, value): + self._computing = value + + @property + def communication_not_overlapped(self): + return self._communication_not_overlapped + + @communication_not_overlapped.setter + def communication_not_overlapped(self, value): + self._communication_not_overlapped = value + + @property + def overlapped(self): + return self._overlapped + + @overlapped.setter + def overlapped(self, value): + self._overlapped = value + + @property + def communication(self): + return self._communication + + @communication.setter + def communication(self, value): + self._communication = value + + @property + def free(self): + return self._free + + @free.setter + def free(self, value): + self._free = value + + @property + def stage(self): + return self._stage + + @stage.setter + def stage(self, value): + self._stage = value + + @property + def bubble(self): + return self._bubble + + @bubble.setter + def bubble(self, value): + self._bubble = value + + @property + def communication_not_overlapped_and_exclude_receive(self): + return self._communication_not_overlapped_and_exclude_receive + + @communication_not_overlapped_and_exclude_receive.setter + def communication_not_overlapped_and_exclude_receive(self, value): + self._communication_not_overlapped_and_exclude_receive = value + + @property + def preparing(self): + return self._preparing + + @preparing.setter + def preparing(self, value): + self._preparing = value + + @property + def dp_index(self): + return self._dp_index + + @dp_index.setter + def dp_index(self, value): + self._dp_index = value + + @property + def pp_index(self): + return self._pp_index + + @pp_index.setter + def pp_index(self, value): + self._pp_index = value + + @property + def tp_index(self): + return self._tp_index + + @tp_index.setter + def tp_index(self, value): + self._tp_index = value \ No newline at end of file diff --git a/profiler/msprof_analyze/test/st/cluster_analyse/test_cluster_analyse_pytorch_db.py b/profiler/msprof_analyze/test/st/cluster_analyse/test_cluster_analyse_pytorch_db.py new file mode 100644 index 0000000000000000000000000000000000000000..bbc07adebfb5fd96da6f61cdd9a24e107b3653c8 --- /dev/null +++ b/profiler/msprof_analyze/test/st/cluster_analyse/test_cluster_analyse_pytorch_db.py @@ -0,0 +1,204 @@ +# Copyright (c) 2024-2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Test cluster analyse pytorch db""" +import os + +from unittest import TestCase + +import pandas as pd + +from msprof_analyze.test.st.utils import execute_cmd, select_count, select_by_query +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.cluster_analyse.cluster_communication_analyzer_bandwidth_db \ + import ClusterCommunicationAnalyzerBandwidthDb +from msprof_analyze.test.st.cluster_analyse.cluster_communication_analyzer_matrix_db \ + import ClusterCommunicationAnalyzerMatrixDb +from msprof_analyze.test.st.cluster_analyse.cluster_communication_analyzer_time_db \ + import ClusterCommunicationAnalyzerTime +from msprof_analyze.test.st.cluster_analyse.cluster_step_trace_time_db import ClusterStepTraceTimeDb + + +class TestClusterAnalysePytorchDb(TestCase): + """ + Test cluster analyse pytorch db + """ + ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", + "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data/") + CLUSTER_PATH = os.path.join(ST_DATA_PATH, "cluster_data_2_db") + db_path = "" + STEP_TRACE_TIME_PATH = os.path.join(ST_DATA_PATH, "cluster_data_2_db", "cluster_analysis_output_text", + "cluster_analysis_output", "cluster_step_trace_time.csv") + COMMUNICATION_MATRIX_PATH = os.path.join(ST_DATA_PATH, "cluster_data_2_db", "cluster_analysis_output_text", + "cluster_analysis_output", "cluster_communication_matrix.json") + COMMUNICATION_PATH = os.path.join(ST_DATA_PATH, "cluster_data_2_db", "cluster_analysis_output_text", + "cluster_analysis_output", "cluster_communication.json") + COMMAND_SUCCESS = 0 + + def setup_class(self): + # generate db data + PathManager.make_dir_safety(self.ST_DATA_PATH) + cmd = ["msprof-analyze", "cluster", "-d", self.CLUSTER_PATH, "-m", "all", + "--output_path", self.ST_DATA_PATH, "--force"] + if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.ST_DATA_PATH): + self.fail("pytorch db cluster analyse task failed.") + self.db_path = os.path.join(self.ST_DATA_PATH, "cluster_analysis_output", "cluster_analysis.db") + + def teardown_class(self): + # Delete db Data + PathManager.remove_path_safety(os.path.join(self.ST_DATA_PATH, "cluster_analysis_output")) + + def test_msprof_analyze_text_db_trace_time_compare(self): + """ + Test case to compare the cluster step trace time from text file and database. + """ + df = pd.read_csv(self.STEP_TRACE_TIME_PATH) + query_count = "SELECT count(*) FROM ClusterStepTraceTime" + self.assertEqual(len(df), select_count(self.db_path, query_count), + "Cluster step trace time count wrong.") + query = "SELECT * FROM ClusterStepTraceTime where type= 'rank' and [index] = 7" + db_cluster_step_trace_time = select_by_query(self.db_path, query, ClusterStepTraceTimeDb) + text_cluster_step_trace_time = ClusterStepTraceTimeDb(*df.iloc[0]) + self.assertEqual(text_cluster_step_trace_time.type, db_cluster_step_trace_time.type, + "Cluster step trace time db vs text 'type' property wrong.") + self.assertEqual(text_cluster_step_trace_time.index, db_cluster_step_trace_time.index, + "Cluster step trace time db vs text 'index' property wrong.") + self.assertEqual(round(text_cluster_step_trace_time.computing), round(db_cluster_step_trace_time.computing), + "Cluster step trace time db vs text 'computing' property wrong.") + self.assertEqual(int(text_cluster_step_trace_time.communication_not_overlapped) + 1, + round(db_cluster_step_trace_time.communication_not_overlapped), + "Cluster step trace time db vs text 'communication_not_overlapped' property wrong.") + self.assertEqual(round(text_cluster_step_trace_time.overlapped), round(db_cluster_step_trace_time.overlapped), + "Cluster step trace time db vs text 'overlapped' property wrong.") + self.assertEqual(round(text_cluster_step_trace_time.communication), + round(db_cluster_step_trace_time.communication), + "Cluster step trace time db vs text 'communication' property wrong.") + self.assertEqual(round(text_cluster_step_trace_time.free), round(db_cluster_step_trace_time.free), + "Cluster step trace time db vs text 'free' property wrong.") + self.assertEqual(round(text_cluster_step_trace_time.stage), round(db_cluster_step_trace_time.stage), + "Cluster step trace time db vs text 'stage' property wrong.") + self.assertEqual(round(text_cluster_step_trace_time.bubble), round(db_cluster_step_trace_time.bubble), + "Cluster step trace time db vs text 'bubble' property wrong.") + self.assertEqual(int(text_cluster_step_trace_time.communication_not_overlapped_and_exclude_receive) + 1, + round(db_cluster_step_trace_time.communication_not_overlapped_and_exclude_receive), + "Cluster step trace time db vs text 'communication_not_overlapped_and_exclude_receive' " + "property wrong.") + + def test_msprof_analyze_text_db_communication_analyzer_matrix_compare(self): + """ + Test case to compare the cluster communication matrix from text file and database. + """ + query = ("SELECT * FROM ClusterCommAnalyzerMatrix WHERE hccl_op_name = 'Total Op Info' and src_rank = 7 " + "and group_name = '15244899533746605158' and dst_rank = 4 and step = 'step' and " + "rank_set = '(4, 5, 6, 7)'") + db_cluster_communication_analyzer_matrix = select_by_query(self.db_path, query, + ClusterCommunicationAnalyzerMatrixDb) + query_count = ("SELECT count(*) FROM ClusterCommAnalyzerMatrix WHERE hccl_op_name = 'Total Op Info' and " + "group_name = '15244899533746605158'") + communication_matrix_json = FileManager.read_json_file(self.COMMUNICATION_MATRIX_PATH) + self.assertEqual(select_count(self.db_path, query_count), + len(communication_matrix_json.get('(4, 5, 6, 7)') + .get('step').get('Total Op Info')), + "Cluster communication matrix db vs text count wrong.") + text_cluster_communication_matrix = (communication_matrix_json.get('(4, 5, 6, 7)').get('step') + .get('Total Op Info').get('7-4')) + self.assertEqual(text_cluster_communication_matrix.get('Transport Type'), + db_cluster_communication_analyzer_matrix.transport_type, + "Cluster communication matrix db vs text 'Transport Type' property wrong.") + self.assertEqual(round(text_cluster_communication_matrix.get('Transit Time(ms)')), + round(db_cluster_communication_analyzer_matrix.transit_time), + "Cluster communication matrix db vs text 'Transit Time' property wrong.") + self.assertEqual(round(text_cluster_communication_matrix.get('Transit Size(MB)')), + round(db_cluster_communication_analyzer_matrix.transit_size), + "Cluster communication matrix db vs text 'Transit Size' property wrong.") + self.assertEqual(round(text_cluster_communication_matrix.get('Bandwidth(GB/s)')), + round(db_cluster_communication_analyzer_matrix.bandwidth), + "Cluster communication matrix db vs text 'Bandwidth' property wrong.") + + def test_msprof_analyze_text_db_communication_analyzer_bandWidth_compare(self): + """ + Test case to compare the cluster bandWidth from text file and database. + """ + query = ("SELECT * FROM ClusterCommAnalyzerBandwidth WHERE hccl_op_name = 'Total Op Info' and rank_id = 7 " + "and step = 'step' and band_type = 'HCCS' and package_size = '3.372891' and rank_set = '(4, 5, 6, 7)'") + db_cluster_communication_analyzer_band_width = select_by_query(self.db_path, query, + ClusterCommunicationAnalyzerBandwidthDb) + query_count = ("SELECT count(*) FROM ClusterCommAnalyzerBandwidth WHERE hccl_op_name = 'Total Op Info' and " + "rank_set = '(4, 5, 6, 7)' and rank_id = 7 and band_type = 'HCCS'") + communication_json = FileManager.read_json_file(self.COMMUNICATION_PATH) + self.assertEqual(select_count(self.db_path, query_count), + len(communication_json.get('(4, 5, 6, 7)') + .get('step').get('Total Op Info').get('7').get('Communication Bandwidth Info') + .get('HCCS').get('Size Distribution')), + "Cluster communication bandWidth db vs text count wrong.") + text_cluster_communication_band_width = (communication_json.get('(4, 5, 6, 7)').get('step') + .get('Total Op Info').get('7').get('Communication Bandwidth Info') + .get('HCCS')) + self.assertEqual(round(text_cluster_communication_band_width.get('Transit Time(ms)')), + round(db_cluster_communication_analyzer_band_width.transit_time), + "Cluster communication bandWidth db vs text 'Transport Time' property wrong.") + self.assertEqual(round(text_cluster_communication_band_width.get('Transit Size(MB)')), + round(db_cluster_communication_analyzer_band_width.transit_size), + "Cluster communication bandWidth db vs text 'Transit Size(MB)' property wrong.") + self.assertEqual(round(text_cluster_communication_band_width.get('Bandwidth(GB/s)')), + round(db_cluster_communication_analyzer_band_width.bandwidth), + "Cluster communication bandWidth db vs text 'Bandwidth(GB/s)' property wrong.") + self.assertEqual(round(text_cluster_communication_band_width.get('Size Distribution').get('3.372891')[0]), + round(db_cluster_communication_analyzer_band_width.count), + "Cluster communication bandWidth db vs text 'count' property wrong.") + total_duration = text_cluster_communication_band_width.get('Size Distribution').get('3.372891')[1] + self.assertEqual(f"{round(total_duration, 2):.2f}", + f"{db_cluster_communication_analyzer_band_width.total_duration:.2f}", + "Cluster communication bandWidth db vs text 'total duration' property wrong.") + + def test_msprof_analyze_text_db_communication_analyzer_time_compare(self): + """ + Test case to compare the cluster time from text file and database. + """ + query = ("SELECT * FROM ClusterCommAnalyzerTime WHERE hccl_op_name = 'Total Op Info' and rank_id = 0 " + "and step = 'step' and rank_set = '(0, 1, 2, 3)'") + db_cluster_communication_analyzer_time = select_by_query(self.db_path, query, + ClusterCommunicationAnalyzerTime) + query_count = ("SELECT count(*) FROM ClusterCommAnalyzerTime WHERE hccl_op_name = 'Total Op Info' and " + "rank_set = '(0, 1, 2, 3)'") + communication_json = FileManager.read_json_file(self.COMMUNICATION_PATH) + self.assertEqual(select_count(self.db_path, query_count), + len(communication_json.get('(0, 1, 2, 3)') + .get('step').get('Total Op Info')), + "Cluster communication time db vs text count wrong.") + text_cluster_communication_analyzer_time = (communication_json.get('(0, 1, 2, 3)').get('step') + .get('Total Op Info').get('0').get('Communication Time Info')) + self.assertEqual(round(text_cluster_communication_analyzer_time.get('Elapse Time(ms)')), + round(db_cluster_communication_analyzer_time.elapsed_time), + "Cluster communication time db vs text 'Elapse Time(ms)' property wrong.") + self.assertEqual(round(text_cluster_communication_analyzer_time.get('Transit Time(ms)')), + round(db_cluster_communication_analyzer_time.transit_time), + "Cluster communication time db vs text 'Transit Time(ms)' property wrong.") + self.assertEqual(round(text_cluster_communication_analyzer_time.get('Wait Time(ms)')), + round(db_cluster_communication_analyzer_time.wait_time), + "Cluster communication time db vs text 'Wait Time(ms)' property wrong.") + self.assertEqual(round(text_cluster_communication_analyzer_time.get('Synchronization Time(ms)')), + round(db_cluster_communication_analyzer_time.synchronization_time), + "Cluster communication time db vs text 'Synchronization Time(ms)' property wrong.") + self.assertEqual(round(text_cluster_communication_analyzer_time.get('Idle Time(ms)')), + round(db_cluster_communication_analyzer_time.idle_time), + "Cluster communication time db vs text 'Idle Time(ms)' property wrong.") + self.assertEqual(round(text_cluster_communication_analyzer_time.get('Wait Time Ratio')), + round(db_cluster_communication_analyzer_time.wait_time_ratio), + "Cluster communication time db vs text 'Wait Time Ratio' property wrong.") + self.assertEqual(round(text_cluster_communication_analyzer_time.get('Synchronization Time Ratio')), + round(db_cluster_communication_analyzer_time.synchronization_time_ratio), + "Cluster communication time db vs text 'Synchronization Time Ratio' property wrong.") + diff --git a/profiler/test/st/cluster_analyse/test_cluster_analyse_pytorch_text.py b/profiler/msprof_analyze/test/st/cluster_analyse/test_cluster_analyse_pytorch_text.py similarity index 84% rename from profiler/test/st/cluster_analyse/test_cluster_analyse_pytorch_text.py rename to profiler/msprof_analyze/test/st/cluster_analyse/test_cluster_analyse_pytorch_text.py index 2003706467ca4122ad95f509fa0f5c08e15d7705..d6f8d470109bf21a33bfde683bafb0f41660dcb0 100644 --- a/profiler/test/st/cluster_analyse/test_cluster_analyse_pytorch_text.py +++ b/profiler/msprof_analyze/test/st/cluster_analyse/test_cluster_analyse_pytorch_text.py @@ -1,10 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import json import logging import subprocess import pandas as pd from unittest import TestCase -from profiler.prof_common.path_manager import PathManager +from msprof_analyze.prof_common.path_manager import PathManager class TestClusterAnalyseCmdPytorchText(TestCase): @@ -91,16 +105,20 @@ class TestClusterAnalyseCmdPytorchText(TestCase): def run_cmd(self, mode): cmd = ["msprof-analyze", "cluster", "-d", self.CLUSTER_PATH, "-m", mode, - "--output_path", self.OUTPUT_PATH] - completed_process = subprocess.run(cmd, capture_output=True, shell=False) + "--output_path", self.OUTPUT_PATH, "--force"] + completed_process = subprocess.run(cmd, shell=False, stderr=subprocess.PIPE) + if completed_process.returncode != self.COMMAND_SUCCESS: + logging.error(completed_process.stderr.decode()) if (completed_process.returncode != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_DATA)): self.assertEqual(False, True, msg="pytorch text cluster analyse task failed.") def run_py3(self, mode): cmd = ["python3", self.CLUSTER_ANALYSE, "-d", self.CLUSTER_PATH, "-m", mode, - "--output_path", self.OUTPUT_PATH] - completed_process = subprocess.run(cmd, capture_output=True, shell=False) + "--output_path", self.OUTPUT_PATH, "--force"] + completed_process = subprocess.run(cmd, shell=False, stderr=subprocess.PIPE) + if completed_process.returncode != self.COMMAND_SUCCESS: + logging.error(completed_process.stderr.decode()) if (completed_process.returncode != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_DATA)): self.assertEqual(False, True, msg="pytorch text cluster analyse task failed.") diff --git a/profiler/msprof_analyze/test/st/compare_tools/__init__.py b/profiler/msprof_analyze/test/st/compare_tools/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_api_compare.py b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_api_compare.py similarity index 72% rename from profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_api_compare.py rename to profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_api_compare.py index 9212f82e2bc4784bfba0ef1d2bc7257b1d7f4b19..ad388ee6d4b1e0e0ec61166c5890f9c5125ae7df 100644 --- a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_api_compare.py +++ b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_api_compare.py @@ -1,10 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from unittest import TestCase import pandas as pd -from profiler.prof_common.path_manager import PathManager -from .utils import execute_cmd, check_result_file +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.utils import execute_cmd, check_result_file class TestCompareToolsCmdPytorchNpuVsNpuEnableApiCompare(TestCase): @@ -22,7 +36,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableApiCompare(TestCase): def setup_class(self): PathManager.make_dir_safety(self.OUTPUT_PATH) cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--enable_api_compare", "-o", self.OUTPUT_PATH] + "--enable_api_compare", "-o", self.OUTPUT_PATH, "--force"] if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_PATH): self.assertEqual(False, True, msg="enable api compare comparison task failed.") if not check_result_file(self.OUTPUT_PATH): diff --git a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_communication_compare.py b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_communication_compare.py similarity index 79% rename from profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_communication_compare.py rename to profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_communication_compare.py index 8804a0ccaf05c124259597fc234794b5f011a943..18d8a4b35f412bf2a5d81d16f846374295ae5c1e 100644 --- a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_communication_compare.py +++ b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_communication_compare.py @@ -1,11 +1,25 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from typing import List from unittest import TestCase import pandas as pd -from profiler.prof_common.path_manager import PathManager -from .utils import execute_cmd, check_result_file +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.utils import execute_cmd, check_result_file class TestCompareToolsCmdPytorchNpuVsNpuEnableCommunicationCompare(TestCase): @@ -23,7 +37,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableCommunicationCompare(TestCase): def setup_class(self): PathManager.make_dir_safety(self.OUTPUT_PATH) cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--enable_communication_compare", "-o", self.OUTPUT_PATH] + "--enable_communication_compare", "-o", self.OUTPUT_PATH, "--force"] if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_PATH): self.assertEqual(False, True, msg="enable communication compare comparison task failed.") if not check_result_file(self.OUTPUT_PATH): diff --git a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_kernel_compare.py b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_kernel_compare.py similarity index 73% rename from profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_kernel_compare.py rename to profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_kernel_compare.py index 0aa996a5c2575ca992a02f96bc970712eb6c3992..5082ee07d9df3f50110e8ed6063c60df0c6fb0ab 100644 --- a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_kernel_compare.py +++ b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_kernel_compare.py @@ -1,10 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from unittest import TestCase import pandas as pd -from profiler.prof_common.path_manager import PathManager -from .utils import execute_cmd, check_result_file +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.utils import execute_cmd, check_result_file class TestCompareToolsCmdPytorchNpuVsNpuEnableKernelCompare(TestCase): @@ -22,7 +36,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableKernelCompare(TestCase): def setup_class(self): PathManager.make_dir_safety(self.OUTPUT_PATH) cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--enable_kernel_compare", "-o", self.OUTPUT_PATH] + "--enable_kernel_compare", "-o", self.OUTPUT_PATH, "--force"] if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_PATH): self.assertEqual(False, True, msg="enable kernel compare comparison task failed.") if not check_result_file(self.OUTPUT_PATH): @@ -37,5 +51,5 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableKernelCompare(TestCase): "Max Duration(us)", "Min Duration(us)", "Calls", "Total Duration(us).1", "Avg Duration(us).1", "Max Duration(us).1", "Min Duration(us).1", "Calls.1", "Diff Total Ratio", "Diff Avg Ratio"] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="KernelCompare", header=2) - self.assertEqual(len(df), 703, msg="pytorch npu vs npu kernel compare results quantity is wrong") + self.assertEqual(len(df), 709, msg="pytorch npu vs npu kernel compare results quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu kernel compare results headers is wrong") diff --git a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_memory_compare.py b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_memory_compare.py similarity index 67% rename from profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_memory_compare.py rename to profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_memory_compare.py index 1d93e77cdc46844365224d6e0f5dcacb71947b43..6535af01e52c1996ea0857767711854f731d11b5 100644 --- a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_memory_compare.py +++ b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_memory_compare.py @@ -1,18 +1,33 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from unittest import TestCase import pandas as pd -from profiler.prof_common.path_manager import PathManager -from .utils import execute_cmd, check_result_file +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.utils import execute_cmd, check_result_file class TestCompareToolsCmdPytorchNpuVsNpuEnableMemoryCompare(TestCase): ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data") - BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", "n122-122-067_12380_20240912033946038_ascend_pt") - COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", - "n122-122-067_12380_20240912033946038_ascend_pt") + BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_4", + "n122-197-168_1333345_20241105122131111_ascend_pt") + COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_4", + "n122-197-168_1632305_20241105124759292_ascend_pt") OUTPUT_PATH = os.path.join(os.path.abspath(os.path.dirname(__file__)), "CompareToolsCmdPytorchNpuVsNpuEnableMemoryCompare") RESULT_EXCEL = "" @@ -22,7 +37,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableMemoryCompare(TestCase): def setup_class(self): PathManager.make_dir_safety(self.OUTPUT_PATH) cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--enable_memory_compare", "--disable_details", "-o", self.OUTPUT_PATH] + "--enable_memory_compare", "--disable_details", "-o", self.OUTPUT_PATH, "--force"] if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_PATH): self.assertEqual(False, True, msg="enable memory compare comparison task failed.") if not check_result_file(self.OUTPUT_PATH): @@ -37,7 +52,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableMemoryCompare(TestCase): "Base Operator Number", "Comparison Allocated Duration(ms)", "Comparison Allocated Memory(MB)", "Comparison Operator Number", "Diff Memory(MB)", "Diff Ratio"] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="MemoryCompareStatistic", header=2) - self.assertEqual(len(df), 141, msg="pytorch npu vs npu memory compare results 'MemoryCompareStatistic'" + self.assertEqual(len(df), 139, msg="pytorch npu vs npu memory compare results 'MemoryCompareStatistic'" "quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu memory compare results 'MemoryCompareStatistic'" diff --git a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_operator_compare.py b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_operator_compare.py similarity index 83% rename from profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_operator_compare.py rename to profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_operator_compare.py index 6c2e9392e46cb17747e6c156bdf7e3dfab895d24..19211e9ae70bc3416aa6ced71d016374ca7a7775 100644 --- a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_operator_compare.py +++ b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_operator_compare.py @@ -1,19 +1,34 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from unittest import TestCase import pandas as pd -from profiler.prof_common.path_manager import PathManager -from .utils import execute_cmd, check_result_file +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.utils import execute_cmd, check_result_file class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data") - BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", "n122-122-067_12380_20240912033946038_ascend_pt") - COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", - "n122-122-067_12380_20240912033946038_ascend_pt") + BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_4", + "n122-197-168_1333345_20241105122131111_ascend_pt") + COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_4", + "n122-197-168_1632305_20241105124759292_ascend_pt") OUTPUT_PATH = os.path.join(os.path.abspath(os.path.dirname(__file__)), "CompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare") RESULT_EXCEL = "" @@ -36,7 +51,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): PathManager.make_dir_safety(self.ONM_OUTPUT_PATH) # 1. no params compare cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--enable_operator_compare", "--disable_details", "-o", self.OUTPUT_PATH] + "--enable_operator_compare", "--disable_details", "-o", self.OUTPUT_PATH, "--force"] if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_PATH): self.assertEqual(False, True, msg="enable operator compare comparison task failed.") if not check_result_file(self.OUTPUT_PATH): @@ -45,7 +60,8 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): # 2. use_input_shape compare is_cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--enable_operator_compare", "--disable_details", "--use_input_shape", "-o", self.IS_OUTPUT_PATH] + "--enable_operator_compare", "--disable_details", "--use_input_shape", "-o", self.IS_OUTPUT_PATH, + "--force"] if execute_cmd(is_cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.IS_OUTPUT_PATH): self.assertEqual(False, True, msg="enable use input shape operator compare comparison task failed.") if not check_result_file(self.IS_OUTPUT_PATH): @@ -55,7 +71,8 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): # 3. max_kernel_num compare mkn_cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--enable_operator_compare", "--disable_details", "--max_kernel_num=10", "-o", self.MKN_OUTPUT_PATH] + "--enable_operator_compare", "--disable_details", "--max_kernel_num=10", "-o", self.MKN_OUTPUT_PATH, + "--force"] if execute_cmd(mkn_cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.MKN_OUTPUT_PATH): self.assertEqual(False, True, msg="enable max kernel num operator compare comparison task failed.") if not check_result_file(self.MKN_OUTPUT_PATH): @@ -66,7 +83,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): # 4. op_name_map compare onm_cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, "--enable_operator_compare", "--disable_details", "--op_name_map={'aten':'to','aten':'item'}", "-o", - self.ONM_OUTPUT_PATH] + self.ONM_OUTPUT_PATH, "--force"] if execute_cmd(onm_cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.ONM_OUTPUT_PATH): self.assertEqual(False, True, msg="enable op name map operator compare comparison task failed.") if not check_result_file(self.ONM_OUTPUT_PATH): @@ -84,7 +101,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): headers = ["Top", "Operator Name", "Base Device Duration(ms)", "Base Operator Number", "Comparison Device Duration(ms)", "Comparison Operator Number", "Diff Duration(ms)", "Diff Ratio"] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="OperatorCompareStatistic", header=2) - self.assertEqual(len(df), 141, msg="pytorch npu vs npu operator compare results 'OperatorCompareStatistic'" + self.assertEqual(len(df), 139, msg="pytorch npu vs npu operator compare results 'OperatorCompareStatistic'" "quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu operator compare results " "'OperatorCompareStatistic' headers is wrong") @@ -93,7 +110,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): headers = ["Top", "Operator Name", "Base Device Duration(ms)", "Base Operator Number", "Comparison Device Duration(ms)", "Comparison Operator Number", "Diff Duration(ms)", "Diff Ratio"] df = pd.read_excel(self.IS_RESULT_EXCEL, sheet_name="OperatorCompareStatistic", header=2) - self.assertEqual(len(df), 141, msg="pytorch npu vs npu use input shape operator compare results " + self.assertEqual(len(df), 139, msg="pytorch npu vs npu use input shape operator compare results " "'OperatorCompareStatistic' quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu use input shape operator compare " "results 'OperatorCompareStatistic' headers is wrong") @@ -102,7 +119,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): headers = ["Top", "Operator Name", "Base Device Duration(ms)", "Base Operator Number", "Comparison Device Duration(ms)", "Comparison Operator Number", "Diff Duration(ms)", "Diff Ratio"] df = pd.read_excel(self.MKN_RESULT_EXCEL, sheet_name="OperatorCompareStatistic", header=2) - self.assertEqual(len(df), 141, msg="pytorch npu vs npu use input shape operator compare results " + self.assertEqual(len(df), 139, msg="pytorch npu vs npu use input shape operator compare results " "'OperatorCompareStatistic' quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu use input shape operator compare " "results 'OperatorCompareStatistic' headers is wrong") @@ -111,7 +128,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableOperatorCompare(TestCase): headers = ["Top", "Operator Name", "Base Device Duration(ms)", "Base Operator Number", "Comparison Device Duration(ms)", "Comparison Operator Number", "Diff Duration(ms)", "Diff Ratio"] df = pd.read_excel(self.ONM_RESULT_EXCEL, sheet_name="OperatorCompareStatistic", header=2) - self.assertEqual(len(df), 141, msg="pytorch npu vs npu use input shape operator compare results " + self.assertEqual(len(df), 139, msg="pytorch npu vs npu use input shape operator compare results " "'OperatorCompareStatistic' quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu use input shape operator compare " "results 'OperatorCompareStatistic' headers is wrong") diff --git a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_profiling.py b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_profiling.py similarity index 58% rename from profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_profiling.py rename to profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_profiling.py index 97e5f683bda3b02eeab8c332198237fdb57c2421..07fd66d517ab0784c289453f9e6aca44be27026b 100644 --- a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_profiling.py +++ b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_npu_enable_profiling.py @@ -1,19 +1,34 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from typing import List from unittest import TestCase import pandas as pd -from profiler.prof_common.path_manager import PathManager -from .utils import execute_cmd, check_result_file +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.utils import execute_cmd, check_result_file class TestCompareToolsCmdPytorchNpuVsNpuEnableProfiling(TestCase): ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data") - BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", "n122-122-067_12380_20240912033946038_ascend_pt") - COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", - "n122-122-067_12380_20240912033946038_ascend_pt") + BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_4", + "n122-197-168_1333345_20241105122131111_ascend_pt") + COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_4", + "n122-197-168_1632305_20241105124759292_ascend_pt") OUTPUT_PATH = os.path.join(os.path.abspath(os.path.dirname(__file__)), "CompareToolsCmdPytorchNpuVsNpuEnableProfiling") RESULT_EXCEL = "" @@ -22,7 +37,7 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableProfiling(TestCase): def setup_class(self): PathManager.make_dir_safety(self.OUTPUT_PATH) cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--enable_profiling_compare", "-o", self.OUTPUT_PATH] + "--enable_profiling_compare", "-o", self.OUTPUT_PATH, "--force"] if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_PATH): self.assertEqual(False, True, msg="enable profiling comparison task failed.") if not check_result_file(self.OUTPUT_PATH): @@ -34,14 +49,13 @@ class TestCompareToolsCmdPytorchNpuVsNpuEnableProfiling(TestCase): def test_overall_metrics(self): duration_exp: List[float] = [ - 14474.86, 1194.01, 1194.01, 10442.62, 10402.34, 40.28, 2821.57, 444.33, 2377.24, 16.47, - 0.18, 23922.06, 4604.98, 9.35, 4595.63, 128.38, 127.35, 1.03, 93.30, 1.27, 92.03, 146.49, 0.10, - 146.39, 23310.81, 23310.81, 170.82, 0.20, 170.62, 373.72, 373.72, 38770.64 + 1725.15, 11.98, 11.98, 31.78, 31.78, 756.62, 705.49, 51.13, 879.83, 66.23, 813.60, 32.90, 12.04, 520.82, + 307.11, 13.12, 293.99, 0.00, 0.00, 207.81, 0.01, 207.80, 5.87, 0.01, 5.86, 0.03, 0.03, 2897.92, 2897.92, + 5143.89 ] - diff_exp: List[float] = [0.37, 0.03, 0.03, 0.27, 0.27, 0.00, 0.07, 0.01, 0.06, 0.00, - 0.00, 0.62, 0.12, 0.00, 0.12, 0.00, 0.00, 0.00, 0.00, 0.00, - 0.00, 0.00, 0.00, 0.00, 0.60, 0.60, 0.00, 0.00, 0.00, 0.01, - 0.01, 1.00] + diff_exp: List[float] = [ + 0.34, 0.00, 0.00, 0.01, 0.01, 0.15, 0.14, 0.01, 0.17, 0.01, 0.16, 0.01, 0.00, 0.10, 0.06, 0.00, 0.06, 0.00, + 0.00, 0.04, 0.00, 0.04, 0.00, 0.00, 0.00, 0.00, 0.00, 0.56, 0.56, 1.00] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="OverallMetrics", header=2) for index, row in df.iterrows(): self.assertEqual(duration_exp[index], round(row["Duration(ms)"], 2), diff --git a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu.py b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu.py similarity index 79% rename from profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu.py rename to profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu.py index b1ecb0421a32147bf60884fadc2d6799a955fc75..67be541ccc11d1e5780c4295fe7df15f22d47a13 100644 --- a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu.py +++ b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu.py @@ -1,10 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from unittest import TestCase import pandas as pd -from profiler.prof_common.path_manager import PathManager -from .utils import execute_cmd, check_result_file +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.utils import execute_cmd, check_result_file class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): @@ -20,7 +34,7 @@ class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): def setup_class(self): PathManager.make_dir_safety(self.OUTPUT_PATH) cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, "-o", - self.OUTPUT_PATH] + self.OUTPUT_PATH, "--force"] if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_PATH): self.assertEqual(False, True, msg="comparison task failed.") if not check_result_file(self.OUTPUT_PATH): @@ -55,5 +69,5 @@ class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): "Min Duration(us)", "Calls", "Total Duration(us).1", "Avg Duration(us).1", "Max Duration(us).1", "Min Duration(us).1", "Calls.1", "Diff Total Ratio", "Diff Avg Ratio"] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="KernelCompare", header=2) - self.assertEqual(len(df), 704, msg="pytorch npu vs npu compare results quantity is wrong") + self.assertEqual(len(df), 710, msg="pytorch npu vs npu compare results quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu compare results headers is wrong") diff --git a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu_step.py b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu_step.py similarity index 74% rename from profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu_step.py rename to profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu_step.py index b6830c32958a168221234a638ad086cd957c2681..a6c9efb2ddc30fafba6433d05120ed80d51db110 100644 --- a/profiler/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu_step.py +++ b/profiler/msprof_analyze/test/st/compare_tools/test_compare_tools_cmd_pytorch_npu_vs_npu_step.py @@ -1,19 +1,34 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os from typing import List from unittest import TestCase import pandas as pd -from profiler.prof_common.path_manager import PathManager -from .utils import execute_cmd, check_result_file +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.utils import execute_cmd, check_result_file class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data") - BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", "n122-122-067_12380_20240912033946038_ascend_pt") - COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", - "n122-122-067_12380_20240912033946038_ascend_pt") + BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_4", + "n122-197-168_1333345_20241105122131111_ascend_pt") + COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_4", + "n122-197-168_1632305_20241105124759292_ascend_pt") OUTPUT_PATH = os.path.join(os.path.abspath(os.path.dirname(__file__)), "CompareToolsCmdPytorchNpuVsNpuStep") RESULT_EXCEL = "" COMMAND_SUCCESS = 0 @@ -21,7 +36,7 @@ class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): def setup_class(self): PathManager.make_dir_safety(self.OUTPUT_PATH) cmd = ["msprof-analyze", "compare", "-d", self.COMPARISON_PROFILING_PATH, "-bp", self.BASE_PROFILING_PATH, - "--base_step=5", "--comparison_step=5", "--disable_details", "-o", self.OUTPUT_PATH] + "--base_step=5", "--comparison_step=5", "--disable_details", "-o", self.OUTPUT_PATH, "--force"] if execute_cmd(cmd) != self.COMMAND_SUCCESS or not os.path.exists(self.OUTPUT_PATH): self.assertEqual(False, True, msg="step comparison task failed.") if not check_result_file(self.OUTPUT_PATH): @@ -33,9 +48,9 @@ class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): def test_overall_metrics(self): duration_exp = [ - 14474.86, 1194.01, 1194.01, 10442.62, 10402.34, 40.28, 2821.57, 444.33, 2377.24, 16.47, 0.18, 23922.06, - 4604.98, 9.35, 4595.63, 128.38, 127.35, 1.03, 93.30, 1.27, 92.03, 146.49, 0.10, 146.39, 23310.81, 23310.81, - 170.82, 0.20, 170.62, 373.72, 373.72, 38770.64 + 1725.15, 11.98, 11.98, 31.78, 31.78, 756.62, 705.49, 51.13, 879.83, 66.23, 813.60, 32.90, 12.04, 520.82, + 307.11, 13.12, 293.99, 0.00, 0.00, 207.81, 0.01, 207.80, 5.87, 0.01, 5.86, 0.03, 0.03, 2897.92, 2897.92, + 5143.89 ] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="OverallMetrics", header=2) for index, row in df.iterrows(): @@ -49,14 +64,13 @@ class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): "Min Duration(us)", "Calls", "Total Duration(us).1", "Avg Duration(us).1", "Max Duration(us).1", "Min Duration(us).1", "Calls.1", "Diff Total Ratio", "Diff Avg Ratio"] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="KernelCompare", header=2) - self.assertEqual(len(df), 703, msg="pytorch npu vs npu step compare results quantity is wrong") + self.assertEqual(len(df), 498, msg="pytorch npu vs npu step compare results quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu step compare results headers is wrong") def test_communication_compare(self): total_duration: List[float] = [ - 9354.86, 1046.68, 9191.52, 24.74, 27743477.86, 12418099.90, 23832.90, 28928712.46, 18411.66, 2939703.28, - 327934.12, 17074.96, 77.58, 931489.92, 2894.42, 75.46, 80.86, 15119087.00, 3594561.44, 12963.36, - 12692002.20, 6180907.46, 9859.70 + 351054.85, 7.22, 400355.22, 80.52, 590652.54, 8.96, 25518.15, 67.62, 49.08, 389.39, 41357.87, 15.68, + 80.18, 144247.88, 867518.01, 4973.28, 336.97, 91039.10, 30.74 ] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="CommunicationCompare", header=2) for index, row in df.iterrows(): @@ -68,7 +82,7 @@ class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): headers = ["Top", "Operator Name", "Base Device Duration(ms)", "Base Operator Number", "Comparison Device Duration(ms)", "Comparison Operator Number", "Diff Duration(ms)", "Diff Ratio"] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="OperatorCompareStatistic", header=2) - self.assertEqual(len(df), 141, msg="pytorch npu vs npu operator compare results 'OperatorCompareStatistic'" + self.assertEqual(len(df), 139, msg="pytorch npu vs npu operator compare results 'OperatorCompareStatistic'" "quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu operator compare results 'OperatorCompareStatistic'" @@ -79,7 +93,7 @@ class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): "Base Operator Number", "Comparison Allocated Duration(ms)", "Comparison Allocated Memory(MB)", "Comparison Operator Number", "Diff Memory(MB)", "Diff Ratio"] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="MemoryCompareStatistic", header=2) - self.assertEqual(len(df), 141, msg="pytorch npu vs npu memory compare results 'MemoryCompareStatistic'" + self.assertEqual(len(df), 139, msg="pytorch npu vs npu memory compare results 'MemoryCompareStatistic'" "quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu memory compare results " "'MemoryCompareStatistic' headers is wrong") @@ -89,5 +103,5 @@ class TestCompareToolsCmdPytorchNpuVsNpu(TestCase): "Calls", "Total Duration(ms).1", "Self Time(ms).1", "Avg Duration(ms).1", "Calls.1", "Diff Total Ratio", "Diff Self Ratio", "Diff Avg Ratio", "Diff Calls Ratio"] df = pd.read_excel(self.RESULT_EXCEL, sheet_name="ApiCompare", header=2) - self.assertEqual(len(df), 311, msg="pytorch npu vs npu api compare results quantity is wrong") + self.assertEqual(len(df), 310, msg="pytorch npu vs npu api compare results quantity is wrong") self.assertEqual(headers, df.columns.tolist(), msg="pytorch npu vs npu api compare results headers is wrong") diff --git a/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_cluster_ascend_pt_compare.py b/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_cluster_ascend_pt_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..0a5eb062b10c6eca56084762df1293ce872bc927 --- /dev/null +++ b/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_cluster_ascend_pt_compare.py @@ -0,0 +1,397 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import logging +from unittest import TestCase + +import math +import pandas as pd +from bs4 import BeautifulSoup + +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.advisor.utils import get_files, execute_cmd + + +class TestAdvisorCmdClusterAscendPtCompare(TestCase): + ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", + "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data") + BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_2") + COMPARISON_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_1") + OUTPUT_PATH = os.path.join(os.path.abspath(os.path.dirname(__file__)), "TestAdvisorCmdClusterAscendPtCompare") + ALL_OUTPUT_PATH = os.path.join(OUTPUT_PATH,"all") + RESULT_EXCEL = {} + RESULT_HTML = {} + COMMAND_SUCCESS = 0 + + def setup_class(self): + PathManager.make_dir_safety(self.ALL_OUTPUT_PATH) + cmd_all = ["msprof-analyze", "advisor", "all" ,"-d", self.BASE_PROFILING_PATH, "-bp", + self.COMPARISON_PROFILING_PATH, "-o", self.ALL_OUTPUT_PATH, "-l", "en", "--force"] + if execute_cmd(cmd_all) != self.COMMAND_SUCCESS or not os.path.exists(self.ALL_OUTPUT_PATH): + self.assertTrue(False, msg="advisor [all] [bp] task failed.") + self.RESULT_HTML,self.RESULT_EXCEL = get_files(self.OUTPUT_PATH) + + def teardown_class(self): + PathManager.remove_path_safety(self.OUTPUT_PATH) + + def test_all_problems(self): + category = [ + "slow rank", + "slow link", + "byte alignment analysis", + "Rank 5 dynamic shape operator", + "Rank 5 aicpu operator", + "Operator dispatch" + ] + + + # True presents the attr is nan + description_len = [1, 11, 1, 1, 2, 1] + suggestion_len = [True, True, 1, 5, 2, 1] + problem_count = [True, True, True, 1.0, 2.0, True] + total_time = [True, True, True, True, 87845894.04, True] + time_ratio = [True, True, True, True, 0.0, True] + income = [True, True, True, True, True, True] + income_ratio = [True, True, True, True, True, True] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all",None), sheet_name='problems', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("all",None)) + return + + for index, row in df.iterrows(): + self.assertEqual(category[index], row["category"]) + self.assertEqual(description_len[index], len(row["description"].split("\n"))) + self.assertEqual(suggestion_len[index], (isinstance(row["suggestion"],float) or + len(row["suggestion"].split("\n")))) + self.assertEqual(problem_count[index], (math.isnan(row["problem count"]) or row["problem count"])) + self.assertEqual(total_time[index], (math.isnan(row["total_time(us)"]) or + round(row["total_time(us)"], 2))) + self.assertEqual(time_ratio[index], (math.isnan(row["time ratio"]) or round(row["time ratio"], 2))) + self.assertEqual(income[index], (math.isnan(row["income(us)"]) or round(row["income(us)"], 2))) + self.assertEqual(income_ratio[index], (math.isnan(row["income ratio"]) or + round(row["income ratio"], 2))) + + def test_slow_rank(self): + step = [-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1] + rank_id = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] + compute_us = [ + 14302466.71,14308948.28,14311412.46, + 14242056.74,14972627.53,14974042.28, + 14964095.87,14945901.57,13878006.78, + 13826069.33,13853184.69,13910409.81, + 13612993.03,13669912.48,13779569.04, + 13826274.64 + ] + communication_us = [ + 50636595.62,50670520.26,50698886.74, + 50741670.92,50257498.54,50286645.51, + 50294747.07,50289541.49,51211928.02, + 51161276.14,51187346.34,51169195.18, + 51544052.84,51556067.16,51374012.81, + 51425588.65 + ] + free_us = [ + 682939.022,634478.74,609248.76, + 645123.76,396550.744,377863.438, + 363100.664,377397.078,537692.3, + 568293.626,525516.858,549405.358, + 458639.564,400809.38,396422.698, + 367782.33 + ] + + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all",None), sheet_name='slow rank', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("all",None)) + return + + for index, row in df.iterrows(): + self.assertEqual(step[index], row["step"]) + self.assertEqual(rank_id[index], row["rank_id"]) + self.assertEqual(compute_us[index], round(row["compute(us)"],2)) + self.assertEqual(communication_us[index], round(row["communication(us)"], 2)) + self.assertEqual(free_us[index], round(row["free(us)"],3)) + + soup = BeautifulSoup(open(self.RESULT_HTML.get("all",None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "slow rank": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(str(step[row_index - 1]), row.find_all('td')[0].text) + self.assertEqual(str(rank_id[row_index - 1]), row.find_all('td')[1].text) + self.assertEqual(str(compute_us[row_index - 1]), row.find_all('td')[2].text) + self.assertEqual(str(communication_us[row_index - 1]), row.find_all('td')[3].text) + self.assertEqual(str(round(free_us[row_index - 1],2)), row.find_all('td')[4].text) + + def test_slow_link(self): + step = [-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1] + rank_id = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] + rdma_bandwidth = [ + 4.7729, 4.8102, 4.7966, + 4.8765, 4.7619, 4.7982, + 4.781, 4.8631, 4.7027, + 4.7044, 4.6912, 4.7522, + 4.7046, 4.7058, 4.6952, + 4.7541 + ] + rdma_size = [ + 892743.4811,892743.4811,892743.4811, + 892743.4811,892743.4811,892743.4811, + 892743.4811,892743.4811,892743.4811, + 892743.4811,892743.4811,892743.4811, + 892743.4811,892743.4811,892743.4811, + 892743.4811 + ] + rdma_time = [ + 187043.9406, 185595.2189, 186121.0769, + 183071.016, 187477.765, 186059.6148, + 186726.3076, 183574.5527, 189837.9228, + 189769.2996, 190299.7641, 187859.376, + 189760.073, 189710.3648, 190138.7748, + 187784.9608 + ] + sdma_bandwidth = [ + 17.6709, 17.6923, 17.4552, + 17.3868, 18.2276, 18.305, + 18.2818, 18.3001, 18.3065, + 18.321, 18.2769, 18.3395, + 17.4277, 17.3756, 17.7028, + 17.6966 + ] + sdma_size = [ + 1975312.412, 1975321.29, 1975321.29, + 1975321.314, 1975227.481, 1975231.641, + 1975231.641, 1975231.665, 1975319.579, + 1975328.855, 1975328.855, 1975328.879, + 1975354.942, 1975366.182, 1975366.182, + 1975366.207 + ] + sdma_time = [ + 111783.3614, 111648.6661, 113165.2345, + 113610.0823, 108364.7268, 107906.9403, + 108043.8853, 107935.3805, 107902.3439, + 107817.7491, 108077.6098, 107708.7801, + 113345.8656, 113686.0851, 111585.2068, + 111624.1395 + ] + + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all",None), sheet_name='slow link', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("all",None)) + return + + for index, row in df.iterrows(): + self.assertEqual(step[index], row["step"]) + self.assertEqual(rank_id[index], row["rank_id"]) + self.assertEqual(rdma_bandwidth[index], round(row["RDMA bandwidth(GB/s)"], 4)) + self.assertEqual(rdma_size[index], round(row["RDMA size(mb)"], 4)) + self.assertEqual(rdma_time[index], round(row["RDMA time(ms)"], 4)) + self.assertEqual(sdma_bandwidth[index], round(row["SDMA bandwidth(GB/s)"], 4)) + self.assertEqual(sdma_size[index], round(row["SDMA size(mb)"], 3)) + self.assertEqual(sdma_time[index], round(row["SDMA time(ms)"], 4)) + + soup = BeautifulSoup(open(self.RESULT_HTML.get("all",None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "slow link": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(str(step[row_index - 1]), row.find_all('td')[0].text) + self.assertEqual(str(rank_id[row_index - 1]), row.find_all('td')[1].text) + self.assertEqual(str(round(rdma_bandwidth[row_index - 1],2)), row.find_all('td')[2].text) + self.assertEqual(str(round(rdma_size[row_index - 1],2)), row.find_all('td')[3].text) + self.assertEqual(str(round(rdma_time[row_index - 1],2)), row.find_all('td')[4].text) + self.assertEqual(str(round(sdma_bandwidth[row_index - 1],2)), row.find_all('td')[5].text) + self.assertEqual(str(round(sdma_size[row_index - 1],2)), row.find_all('td')[6].text) + self.assertEqual(str(round(sdma_time[row_index - 1],2)), row.find_all('td')[7].text) + + def test_Byte_Alignment_Analysis(self): + op_name = [ + "hcom_broadcast__275_2_1", "hcom_allReduce__275_237_1", + "hcom_allReduce__275_238_1", "hcom_allReduce__275_239_1", + "hcom_reduceScatter__063_1_1", "hcom_allGather__063_2_1" + ] + + total_size = [ + 41816518, 262120, + 262120, 262120, + 670986240, 335493120 + ] + + duration = [ + 1656.7, 14.9, + 14.46, 14.58, + 35449.52, 17285 + ] + + abnormal_duration = [ + 1656.7, 14.9, + 14.46, 14.58, + 35449.52, 17285 + ] + + bandwidth = [ + 25.2409, 17.5919, + 18.1272, 17.9781, + 18.9279, 19.4095 + ] + + test_pattern = ["all"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern,None), sheet_name='Byte Alignment Analysis', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern,None)) + return + + for index, row in df.iterrows(): + self.assertEqual(op_name[index], row["op name"]) + self.assertEqual(total_size[index], row["total size(Byte)"]) + self.assertEqual(duration[index], row["duration(us)"]) + self.assertEqual(abnormal_duration[index], row["abnormal duration(us)"]) + self.assertEqual(bandwidth[index], row["bandwidth(GB/s)"]) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern,None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Byte Alignment Analysis": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(str(op_name[row_index - 1]), row.find_all('td')[0].text) + self.assertEqual(str(total_size[row_index - 1]), row.find_all('td')[1].text) + self.assertEqual(str(duration[row_index - 1]), row.find_all('td')[2].text) + self.assertEqual(str(abnormal_duration[row_index - 1]), row.find_all('td')[3].text) + self.assertEqual(str(bandwidth[row_index - 1]), row.find_all('td')[4].text) + + def test_aicpu_operator(self): + op_name = [ + "aclnnEqScalar_EqualAiCpu_Equal", + "aclnnPowTensorScalar_SquareAiCpu_Square" + ] + op_type = ["Equal","Square"] + task_duration = [85.502,74.862] + input_shapes = ["\"17;\"","\"17\""] + input_data_types = ["DOUBLE;DOUBLE","INT64"] + input_formats = ["FORMAT_ND;FORMAT_ND","FORMAT_ND"] + output_shapes = ["\"17\"","\"17\""] + output_data_types = ["BOOL","INT64"] + output_formats = ["FORMAT_ND","FORMAT_ND"] + stack_info = [True, True] + + t0_description = ["Square, Equal"] + t0_suggestion = ["aclnnEqScalar_EqualAiCpu_Equal"] + t0_elapsed_time = ["160.36"] + t0_time_ratio = ["0.0"] + t1_operator_type = ["Equal"] + t1_counts = ["1"] + t1_elapsed_time = ["85.5"] + t2_operator_type = ["Square"] + t2_counts = ["1"] + t2_elapsed_time = ["74.86"] + + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all",None), sheet_name='Rank 5 AICPU operator', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("all",None)) + return + + for index, row in df.iterrows(): + self.assertEqual(op_name[index], row["op_name"]) + self.assertEqual(op_type[index], row["op_type"]) + self.assertEqual(task_duration[index], row["task_duration"]) + self.assertEqual(input_shapes[index], row["input_shapes"]) + self.assertEqual(input_data_types[index], row["input_data_types"]) + self.assertEqual(input_formats[index], row["input_formats"]) + self.assertEqual(output_shapes[index], row["output_shapes"]) + self.assertEqual(output_data_types[index], row["output_data_types"]) + self.assertEqual(output_formats[index], row["output_formats"]) + self.assertEqual(stack_info[index], math.isnan(row["stack_info"])) + + soup = BeautifulSoup(open(self.RESULT_HTML.get("all",None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "AICPU Issues": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t0_description[row_index - 1], + row.find_all('td')[0].text.split(":")[1].replace("\n", "")) + self.assertEqual(t0_suggestion[row_index - 1], row.find_all('td')[1].text.split(" ")[-1]) + self.assertEqual(t0_elapsed_time[row_index - 1], row.find_all('td')[2].text) + self.assertEqual(t0_time_ratio[row_index - 1], row.find_all('td')[3].text) + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t1_operator_type[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t1_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t1_elapsed_time[row_index - 1], row.find_all('td')[2].text) + for row_index, row in enumerate(table[2].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t2_operator_type[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t2_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t2_elapsed_time[row_index - 1], row.find_all('td')[2].text) + + def test_operator_dispatch(self): + issues = ["operator dispatch"] + op_name = ["aclopCompileAndExecute"] + counts = [381] + total_time = [64611.0511] + + t0_description = ["381"] + t0_suggestion = ["torch_npu.npu.set_compile_mode(jit_compile=False)"] + t1_issue = ["aclopCompileAndExecute"] + t1_counts = ['381'] + t1_elapsed_time = ['64611.05109720859'] + + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all",None), sheet_name='operator dispatch', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("all",None)) + return + + for index, row in df.iterrows(): + self.assertEqual(issues[index], row["Issues"]) + self.assertEqual(op_name[index], row["op name"]) + self.assertEqual(counts[index], row["counts"]) + self.assertEqual(total_time[index], round(row["total time"], 4)) + + soup = BeautifulSoup(open(self.RESULT_HTML.get("all",None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Operator Dispatch Issues": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t0_description[row_index - 1], row.find_all('td')[0].text.split(' ')[1]) + self.assertEqual(t0_suggestion[row_index - 1], + row.find_all('td')[1].text.split('`')[1].split(';')[0]) + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t1_issue[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t1_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t1_elapsed_time[row_index - 1], row.find_all('td')[2].text) \ No newline at end of file diff --git a/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_cluster_ascend_pt_no_compare.py b/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_cluster_ascend_pt_no_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..e747886f737819aa705e055efc991518d2f36610 --- /dev/null +++ b/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_cluster_ascend_pt_no_compare.py @@ -0,0 +1,483 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +from unittest import TestCase +import logging + +import math +import pandas as pd +from bs4 import BeautifulSoup + +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.advisor.utils import get_files, execute_cmd + + +class TestAdvisorCmdClusterAscendPtNoCompare(TestCase): + ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", + "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data") + BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_2") + OUTPUT_PATH = os.path.join(os.path.abspath(os.path.dirname(__file__)), "TestAdvisorCmdClusterAscendPtNoCompare") + ALL_OUTPUT_PATH = os.path.join(OUTPUT_PATH, "all") + COMPUTATION_OUTPUT_PATH = os.path.join(OUTPUT_PATH, "computation") + SCHEDULE_OUTPUT_PATH = os.path.join(OUTPUT_PATH, "schedule") + RESULT_EXCEL = {} + RESULT_HTML = {} + COMMAND_SUCCESS = 0 + + def setup_class(self): + PathManager.make_dir_safety(self.ALL_OUTPUT_PATH) + PathManager.make_dir_safety(self.COMPUTATION_OUTPUT_PATH) + PathManager.make_dir_safety(self.SCHEDULE_OUTPUT_PATH) + cmd_all = ["msprof-analyze", "advisor", "all", "-d", self.BASE_PROFILING_PATH, "-o", self.ALL_OUTPUT_PATH, + "-l", "en", "--force"] + if execute_cmd(cmd_all) != self.COMMAND_SUCCESS or not os.path.exists(self.ALL_OUTPUT_PATH): + self.assertTrue(False, msg="advisor [all] task failed.") + cmd_computation = ["msprof-analyze", "advisor", "computation", "-d", self.BASE_PROFILING_PATH, "-o", + self.COMPUTATION_OUTPUT_PATH, "-l", "en", "--force"] + if execute_cmd(cmd_computation) != self.COMMAND_SUCCESS or not os.path.exists(self.COMPUTATION_OUTPUT_PATH): + self.assertTrue(False, msg="advisor [computation] task failed.") + cmd_schedule = ["msprof-analyze", "advisor", "schedule", "-d", self.BASE_PROFILING_PATH, "-o", + self.SCHEDULE_OUTPUT_PATH, "-l", "en", "--force"] + if execute_cmd(cmd_schedule) != self.COMMAND_SUCCESS or not os.path.exists( + self.SCHEDULE_OUTPUT_PATH): + self.assertTrue(False, msg="advisor [schedule] task failed.") + + self.RESULT_HTML, self.RESULT_EXCEL = get_files(self.OUTPUT_PATH) + + def teardown_class(self): + PathManager.remove_path_safety(self.OUTPUT_PATH) + + def test_all_problems(self): + category = [ + "slow rank", + "slow link", + "byte alignment analysis", + "Kernel compare of Rank5 and Rank12", + "Rank 5 dynamic shape operator", + "Rank 5 aicpu operator", + "Operator dispatch" + ] + + # True presents the attr is nan + description_len = [1, 11, 1, 1, 1, 2, 1] + suggestion_len = [True, True, 1, True, 5, 2, 1] + problem_count = [True, True, True, True, 1.0, 2.0, True] + total_time = [True, True, True, True, True, 87845894.04, True] + time_ratio = [True, True, True, True, True, 0.0, True] + income = [True, True, True, True, True, True, True] + income_ratio = [True, True, True, True, True, True, True] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all", None), sheet_name='problems', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("all", None)) + return + + for index, row in df.iterrows(): + self.assertEqual(category[index], row["category"]) + self.assertEqual(description_len[index], len(row["description"].split("\n"))) + self.assertEqual(suggestion_len[index], (isinstance(row["suggestion"], float) or + len(row["suggestion"].split("\n")))) + self.assertEqual(problem_count[index], (math.isnan(row["problem count"]) or row["problem count"])) + self.assertEqual(total_time[index], (math.isnan(row["total_time(us)"]) or + round(row["total_time(us)"], 2))) + self.assertEqual(time_ratio[index], (math.isnan(row["time ratio"]) or round(row["time ratio"], 2))) + self.assertEqual(income[index], (math.isnan(row["income(us)"]) or round(row["income(us)"], 2))) + self.assertEqual(income_ratio[index], (math.isnan(row["income ratio"]) or + round(row["income ratio"], 2))) + + def test_computation_problems(self): + category = [ + "slow rank", + "slow link", + "Kernel compare of Rank5 and Rank12", + "Rank 5 dynamic shape operator", + "Rank 5 aicpu operator", + ] + + # True presents the attr is nan + description_len = [1, 11, 1, 1, 2] + suggestion_len = [True, True, True, 5, 2] + problem_count = [True, True, True, 1.0, 2.0] + total_time = [True, True, True, True, 87845894.04] + time_ratio = [True, True, True, True, 0.0] + income = [True, True, True, True, True] + income_ratio = [True, True, True, True, True] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("computation", None), sheet_name='problems', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("computation", None)) + return + + for index, row in df.iterrows(): + self.assertEqual(category[index], row["category"]) + self.assertEqual(description_len[index], len(row["description"].split("\n"))) + self.assertEqual(suggestion_len[index], (isinstance(row["suggestion"], float) or + len(row["suggestion"].split("\n")))) + self.assertEqual(problem_count[index], (math.isnan(row["problem count"]) or row["problem count"])) + self.assertEqual(total_time[index], (math.isnan(row["total_time(us)"]) or + round(row["total_time(us)"], 2))) + self.assertEqual(time_ratio[index], (math.isnan(row["time ratio"]) or round(row["time ratio"], 2))) + self.assertEqual(income[index], (math.isnan(row["income(us)"]) or round(row["income(us)"], 2))) + self.assertEqual(income_ratio[index], (math.isnan(row["income ratio"]) or + round(row["income ratio"], 2))) + + def test_schedule_problems(self): + category = [ + "slow rank", + "slow link", + "Operator dispatch" + ] + + # True presents the attr is nan + description_len = [1, 11, 1] + suggestion_len = [True, True, 1] + problem_count = [True, True, True] + total_time = [True, True, True] + time_ratio = [True, True, True] + income = [True, True, True] + income_ratio = [True, True, True] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("schedule", None), sheet_name='problems', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("schedule", None)) + return + + for index, row in df.iterrows(): + self.assertEqual(category[index], row["category"]) + self.assertEqual(description_len[index], len(row["description"].split("\n"))) + self.assertEqual(suggestion_len[index], (isinstance(row["suggestion"], float) or + len(row["suggestion"].split("\n")))) + self.assertEqual(problem_count[index], (math.isnan(row["problem count"]) or row["problem count"])) + self.assertEqual(total_time[index], (math.isnan(row["total_time(us)"]) or + round(row["total_time(us)"], 2))) + self.assertEqual(time_ratio[index], (math.isnan(row["time ratio"]) or round(row["time ratio"], 2))) + self.assertEqual(income[index], (math.isnan(row["income(us)"]) or round(row["income(us)"], 2))) + self.assertEqual(income_ratio[index], (math.isnan(row["income ratio"]) or + round(row["income ratio"], 2))) + + def test_slow_rank(self): + step = [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1] + rank_id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] + compute_us = [ + 14302466.71, 14308948.28, 14311412.46, + 14242056.74, 14972627.53, 14974042.28, + 14964095.87, 14945901.57, 13878006.78, + 13826069.33, 13853184.69, 13910409.81, + 13612993.03, 13669912.48, 13779569.04, + 13826274.64 + ] + communication_us = [ + 50636595.62, 50670520.26, 50698886.74, + 50741670.92, 50257498.54, 50286645.51, + 50294747.07, 50289541.49, 51211928.02, + 51161276.14, 51187346.34, 51169195.18, + 51544052.84, 51556067.16, 51374012.81, + 51425588.65 + ] + free_us = [ + 682939.022, 634478.74, 609248.76, + 645123.76, 396550.744, 377863.438, + 363100.664, 377397.078, 537692.3, + 568293.626, 525516.858, 549405.358, + 458639.564, 400809.38, 396422.698, + 367782.33 + ] + + test_pattern = ["all", "computation", "schedule"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='slow rank', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + + for index, row in df.iterrows(): + self.assertEqual(step[index], row["step"]) + self.assertEqual(rank_id[index], row["rank_id"]) + self.assertEqual(compute_us[index], round(row["compute(us)"], 2)) + self.assertEqual(communication_us[index], round(row["communication(us)"], 2)) + self.assertEqual(free_us[index], round(row["free(us)"], 3)) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "slow rank": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(str(step[row_index - 1]), row.find_all('td')[0].text) + self.assertEqual(str(rank_id[row_index - 1]), row.find_all('td')[1].text) + self.assertEqual(str(compute_us[row_index - 1]), row.find_all('td')[2].text) + self.assertEqual(str(communication_us[row_index - 1]), row.find_all('td')[3].text) + self.assertEqual(str(round(free_us[row_index - 1], 2)), row.find_all('td')[4].text) + + def test_slow_link(self): + step = [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1] + rank_id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] + rdma_bandwidth = [ + 4.7729, 4.8102, 4.7966, + 4.8765, 4.7619, 4.7982, + 4.781, 4.8631, 4.7027, + 4.7044, 4.6912, 4.7522, + 4.7046, 4.7058, 4.6952, + 4.7541 + ] + rdma_size = [ + 892743.4811, 892743.4811, 892743.4811, + 892743.4811, 892743.4811, 892743.4811, + 892743.4811, 892743.4811, 892743.4811, + 892743.4811, 892743.4811, 892743.4811, + 892743.4811, 892743.4811, 892743.4811, + 892743.4811 + ] + rdma_time = [ + 187043.9406, 185595.2189, 186121.0769, + 183071.016, 187477.765, 186059.6148, + 186726.3076, 183574.5527, 189837.9228, + 189769.2996, 190299.7641, 187859.376, + 189760.073, 189710.3648, 190138.7748, + 187784.9608 + ] + sdma_bandwidth = [ + 17.6709, 17.6923, 17.4552, + 17.3868, 18.2276, 18.305, + 18.2818, 18.3001, 18.3065, + 18.321, 18.2769, 18.3395, + 17.4277, 17.3756, 17.7028, + 17.6966 + ] + sdma_size = [ + 1975312.412, 1975321.29, 1975321.29, + 1975321.314, 1975227.481, 1975231.641, + 1975231.641, 1975231.665, 1975319.579, + 1975328.855, 1975328.855, 1975328.879, + 1975354.942, 1975366.182, 1975366.182, + 1975366.207 + ] + sdma_time = [ + 111783.3614, 111648.6661, 113165.2345, + 113610.0823, 108364.7268, 107906.9403, + 108043.8853, 107935.3805, 107902.3439, + 107817.7491, 108077.6098, 107708.7801, + 113345.8656, 113686.0851, 111585.2068, + 111624.1395 + ] + + test_pattern = ["all", "computation", "schedule"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='slow link', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + + for index, row in df.iterrows(): + self.assertEqual(step[index], row["step"]) + self.assertEqual(rank_id[index], row["rank_id"]) + self.assertEqual(rdma_bandwidth[index], round(row["RDMA bandwidth(GB/s)"], 4)) + self.assertEqual(rdma_size[index], round(row["RDMA size(mb)"], 4)) + self.assertEqual(rdma_time[index], round(row["RDMA time(ms)"], 4)) + self.assertEqual(sdma_bandwidth[index], round(row["SDMA bandwidth(GB/s)"], 4)) + self.assertEqual(sdma_size[index], round(row["SDMA size(mb)"], 3)) + self.assertEqual(sdma_time[index], round(row["SDMA time(ms)"], 4)) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "slow link": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(str(step[row_index - 1]), row.find_all('td')[0].text) + self.assertEqual(str(rank_id[row_index - 1]), row.find_all('td')[1].text) + self.assertEqual(str(round(rdma_bandwidth[row_index - 1], 2)), row.find_all('td')[2].text) + self.assertEqual(str(round(rdma_size[row_index - 1], 2)), row.find_all('td')[3].text) + self.assertEqual(str(round(rdma_time[row_index - 1], 2)), row.find_all('td')[4].text) + self.assertEqual(str(round(sdma_bandwidth[row_index - 1], 2)), row.find_all('td')[5].text) + self.assertEqual(str(round(sdma_size[row_index - 1], 2)), row.find_all('td')[6].text) + self.assertEqual(str(round(sdma_time[row_index - 1], 2)), row.find_all('td')[7].text) + + def test_Byte_Alignment_Analysis(self): + op_name = [ + "hcom_broadcast__275_2_1", "hcom_allReduce__275_237_1", + "hcom_allReduce__275_238_1", "hcom_allReduce__275_239_1", + "hcom_reduceScatter__063_1_1", "hcom_allGather__063_2_1" + ] + + total_size = [ + 41816518, 262120, + 262120, 262120, + 670986240, 335493120 + ] + + duration = [ + 1656.7, 14.9, + 14.46, 14.58, + 35449.52, 17285 + ] + + abnormal_duration = [ + 1656.7, 14.9, + 14.46, 14.58, + 35449.52, 17285 + ] + + bandwidth = [ + 25.2409, 17.5919, + 18.1272, 17.9781, + 18.9279, 19.4095 + ] + + test_pattern = ["all"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='Byte Alignment Analysis', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + + for index, row in df.iterrows(): + self.assertEqual(op_name[index], row["op name"]) + self.assertEqual(total_size[index], row["total size(Byte)"]) + self.assertEqual(duration[index], row["duration(us)"]) + self.assertEqual(abnormal_duration[index], row["abnormal duration(us)"]) + self.assertEqual(bandwidth[index], row["bandwidth(GB/s)"]) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Byte Alignment Analysis": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(str(op_name[row_index - 1]), row.find_all('td')[0].text) + self.assertEqual(str(total_size[row_index - 1]), row.find_all('td')[1].text) + self.assertEqual(str(round(duration[row_index - 1], 2)), row.find_all('td')[2].text) + self.assertEqual(str(round(abnormal_duration[row_index - 1], 2)), row.find_all('td')[3].text) + self.assertEqual(str(round(bandwidth[row_index - 1], 4)), row.find_all('td')[4].text) + + def test_aicpu_operator(self): + op_name = [ + "aclnnEqScalar_EqualAiCpu_Equal", + "aclnnPowTensorScalar_SquareAiCpu_Square" + ] + op_type = ["Equal", "Square"] + task_duration = [85.502, 74.862] + input_shapes = ["\"17;\"", "\"17\""] + input_data_types = ["DOUBLE;DOUBLE", "INT64"] + input_formats = ["FORMAT_ND;FORMAT_ND", "FORMAT_ND"] + output_shapes = ["\"17\"", "\"17\""] + output_data_types = ["BOOL", "INT64"] + output_formats = ["FORMAT_ND", "FORMAT_ND"] + stack_info = [True, True] + + t0_description = ["Square, Equal"] + t0_suggestion = ["aclnnEqScalar_EqualAiCpu_Equal"] + t0_elapsed_time = ["160.36"] + t0_time_ratio = ["0.0"] + t1_operator_type = ["Equal"] + t1_counts = ["1"] + t1_elapsed_time = ["85.5"] + t2_operator_type = ["Square"] + t2_counts = ["1"] + t2_elapsed_time = ["74.86"] + + test_pattern = ["all", "computation"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='Rank 5 AICPU operator', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + + for index, row in df.iterrows(): + self.assertEqual(op_name[index], row["op_name"]) + self.assertEqual(op_type[index], row["op_type"]) + self.assertEqual(task_duration[index], row["task_duration"]) + self.assertEqual(input_shapes[index], row["input_shapes"]) + self.assertEqual(input_data_types[index], row["input_data_types"]) + self.assertEqual(input_formats[index], row["input_formats"]) + self.assertEqual(output_shapes[index], row["output_shapes"]) + self.assertEqual(output_data_types[index], row["output_data_types"]) + self.assertEqual(output_formats[index], row["output_formats"]) + self.assertEqual(stack_info[index], math.isnan(row["stack_info"])) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "AICPU Issues": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t0_description[row_index - 1], + row.find_all('td')[0].text.split(":")[1].replace("\n", "")) + self.assertEqual(t0_suggestion[row_index - 1], row.find_all('td')[1].text.split(" ")[-1]) + self.assertEqual(t0_elapsed_time[row_index - 1], row.find_all('td')[2].text) + self.assertEqual(t0_time_ratio[row_index - 1], row.find_all('td')[3].text) + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t1_operator_type[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t1_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t1_elapsed_time[row_index - 1], row.find_all('td')[2].text) + for row_index, row in enumerate(table[2].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t2_operator_type[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t2_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t2_elapsed_time[row_index - 1], row.find_all('td')[2].text) + + def test_operator_dispatch(self): + issues = ["operator dispatch"] + op_name = ["aclopCompileAndExecute"] + counts = [381] + total_time = [64611.0511] + + t0_description = ["381"] + t0_suggestion = ["torch_npu.npu.set_compile_mode(jit_compile=False)"] + t1_issue = ["aclopCompileAndExecute"] + t1_counts = ['381'] + t1_elapsed_time = ['64611.05109720859'] + + test_pattern = ["all", "schedule"] + for pattern in test_pattern: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='operator dispatch', header=0) + for index, row in df.iterrows(): + self.assertEqual(issues[index], row["Issues"]) + self.assertEqual(op_name[index], row["op name"]) + self.assertEqual(counts[index], row["counts"]) + self.assertEqual(total_time[index], round(row["total time"], 4)) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Operator Dispatch Issues": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t0_description[row_index - 1], row.find_all('td')[0].text.split(' ')[1]) + self.assertEqual(t0_suggestion[row_index - 1], + row.find_all('td')[1].text.split('`')[1].split(';')[0]) + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t1_issue[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t1_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t1_elapsed_time[row_index - 1], row.find_all('td')[2].text) diff --git a/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_single_ascend_pt_no_compare.py b/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_single_ascend_pt_no_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..f8d4c03ecbcac152174a033270308598d24c3998 --- /dev/null +++ b/profiler/msprof_analyze/test/st/dev/test_advisor_cmd_single_ascend_pt_no_compare.py @@ -0,0 +1,414 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import subprocess +import logging +from unittest import TestCase + +import math +import pandas as pd +from bs4 import BeautifulSoup + +from msprof_analyze.prof_common.path_manager import PathManager +from msprof_analyze.test.st.advisor.utils import get_files, execute_cmd + + +class TestAdvisorCmdSingleAscendPtNoCompare(TestCase): + ST_DATA_PATH = os.getenv("MSTT_PROFILER_ST_DATA_PATH", + "/home/dcs-50/smoke_project_for_msprof_analyze/mstt_profiler/st_data") + BASE_PROFILING_PATH = os.path.join(ST_DATA_PATH, "cluster_data_3", "n122-122-067_12380_20240912033946038_ascend_pt") + OUTPUT_PATH = os.path.join(os.path.abspath(os.path.dirname(__file__)), "TestAdvisorCmdSingleAscendPtNoCompare") + ALL_OUTPUT_PATH = os.path.join(OUTPUT_PATH, "all") + COMPUTATION_OUTPUT_PATH = os.path.join(OUTPUT_PATH, "computation") + SCHEDULE_OUTPUT_PATH = os.path.join(OUTPUT_PATH, "schedule") + RESULT_EXCEL = {} + RESULT_HTML = {} + COMMAND_SUCCESS = 0 + + def setup_class(self): + PathManager.make_dir_safety(self.ALL_OUTPUT_PATH) + PathManager.make_dir_safety(self.COMPUTATION_OUTPUT_PATH) + PathManager.make_dir_safety(self.SCHEDULE_OUTPUT_PATH) + cmd_all = ["msprof-analyze", "advisor", "all", "-d", self.BASE_PROFILING_PATH, "-o", self.ALL_OUTPUT_PATH, + "-l", "en", "--force"] + if execute_cmd(cmd_all) != self.COMMAND_SUCCESS or not os.path.exists(self.ALL_OUTPUT_PATH): + self.assertTrue(False, msg="advisor [all] task failed.") + + cmd_computation = ["msprof-analyze", "advisor", "computation", "-d", self.BASE_PROFILING_PATH, "-o", + self.COMPUTATION_OUTPUT_PATH, "-l", "en", "--force"] + if execute_cmd(cmd_computation) != self.COMMAND_SUCCESS or not os.path.exists(self.COMPUTATION_OUTPUT_PATH): + self.assertTrue(False, msg="advisor [computation] task failed.") + + cmd_schedule = ["msprof-analyze", "advisor", "schedule", "-d", self.BASE_PROFILING_PATH, "-o", + self.SCHEDULE_OUTPUT_PATH, "-l", "en", "--force"] + if execute_cmd(cmd_schedule) != self.COMMAND_SUCCESS or not os.path.exists(self.SCHEDULE_OUTPUT_PATH): + self.assertTrue(False, msg="advisor [schedule] task failed.") + + self.RESULT_HTML, self.RESULT_EXCEL = get_files(self.OUTPUT_PATH) + + def teardown_class(self): + PathManager.remove_path_safety(self.OUTPUT_PATH) + + def test_all_problems(self): + category = [ + "overall summary", + "byte alignment analysis", + "bandwidth contention analysis", + "AICPU operator", + "Dynamic shape operator", + "Affinity apis", + "Operator dispatch" + ] + + # True presents the attr is nan + description_len = [6, 1, 3, 2, 1, 1, 1] + suggestion_len = [True, 1, 1, 2, 5, 1, 1] + problem_count = [True, True, True, 2.0, 1.0, True, True] + total_time = [True, True, True, 57674709.54, True, True, True] + time_ratio = [True, True, True, 0.0, True, True, True] + income = [True, True, True, True, True, True, True] + income_ratio = [True, True, True, True, True, True, True] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all", None), sheet_name='problems', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("all", None)) + return + + for index, row in df.iterrows(): + self.assertEqual(category[index], row["category"]) + self.assertEqual(description_len[index], len(row["description"].split("\n"))) + self.assertEqual(suggestion_len[index], isinstance(row["suggestion"], float) or + len(row["suggestion"].split("\n"))) + self.assertEqual(problem_count[index], (math.isnan(row["problem count"]) or row["problem count"])) + self.assertEqual(total_time[index], (math.isnan(row["total_time(us)"]) or + round(row["total_time(us)"], 2))) + self.assertEqual(time_ratio[index], (math.isnan(row["time ratio"]) or round(row["time ratio"], 2))) + self.assertEqual(income[index], (math.isnan(row["income(us)"]) or round(row["income(us)"], 2))) + self.assertEqual(income_ratio[index], (math.isnan(row["income ratio"]) or + round(row["income ratio"], 2))) + + def test_computation_problems(self): + category = [ + "overall summary", + "AICPU operator", + "Dynamic shape operator", + ] + + # True presents the attr is nan + description_len = [6, 2, 1] + suggestion_len = [True, 2, 5] + problem_count = [True, 2.0, 1.0] + total_time = [True, 57674709.54, True] + time_ratio = [True, 0.0, True] + income = [True, True, True] + income_ratio = [True, True, True] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("computation", None), sheet_name='problems', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("computation", None)) + return + + for index, row in df.iterrows(): + self.assertEqual(category[index], row["category"]) + self.assertEqual(description_len[index], len(row["description"].split("\n"))) + self.assertEqual(suggestion_len[index], (isinstance(row["suggestion"], float) or + len(row["suggestion"].split("\n")))) + self.assertEqual(problem_count[index], (math.isnan(row["problem count"]) or row["problem count"])) + self.assertEqual(total_time[index], (math.isnan(row["total_time(us)"]) or + round(row["total_time(us)"], 2))) + self.assertEqual(time_ratio[index], (math.isnan(row["time ratio"]) or round(row["time ratio"], 2))) + self.assertEqual(income[index], (math.isnan(row["income(us)"]) or round(row["income(us)"], 2))) + self.assertEqual(income_ratio[index], (math.isnan(row["income ratio"]) or + round(row["income ratio"], 2))) + + def test_schedule_problems(self): + category = [ + "overall summary", + "Affinity apis", + "Operator dispatch" + ] + + # True presents the attr is nan + description_len = [6, 1, 1] + suggestion_len = [True, 1, 1] + problem_count = [True, True, True] + total_time = [True, True, True] + time_ratio = [True, True, True] + income = [True, True, True] + income_ratio = [True, True, True] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("schedule", None), sheet_name='problems', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("schedule", None)) + return + + for index, row in df.iterrows(): + self.assertEqual(category[index], row["category"]) + self.assertEqual(description_len[index], len(row["description"].split("\n"))) + self.assertEqual(suggestion_len[index], (isinstance(row["suggestion"], float) or + len(row["suggestion"].split("\n")))) + self.assertEqual(problem_count[index], (math.isnan(row["problem count"]) or row["problem count"])) + self.assertEqual(total_time[index], (math.isnan(row["total_time(us)"]) or + round(row["total_time(us)"], 2))) + self.assertEqual(time_ratio[index], (math.isnan(row["time ratio"]) or round(row["time ratio"], 2))) + self.assertEqual(income[index], (math.isnan(row["income(us)"]) or round(row["income(us)"], 2))) + self.assertEqual(income_ratio[index], (math.isnan(row["income ratio"]) or + round(row["income ratio"], 2))) + + def test_overall_summary(self): + performance_index = [ + "Computing Time", " -- Flash Attention", + " -- Conv", " -- Matmul", + " -- Vector", " -- SDMA(Tensor Move)", + " -- Other Cube", "Uncovered Communication Time", + " -- Wait", " -- Transmit", + "Free Time", " -- SDMA", + " -- Free", "E2E Time" + ] + duration = [14474.856, 1194.014, 0.000, 10442.616, 2821.569, 16.473, 0.0, 23922.059, 138.275, 23783.785, + 373.722, 0.000, 373.722, 38770.637] + duration_ratio = ["37.33%", "3.08%", "0.00%", "26.93%", "7.28%", "0.04%", "0.00%", "61.70%", "0.36%", + "61.34%", "0.96%", "0.00%", "0.96%", "100.00%"] + + test_pattern = ["all", "computation", "schedule"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='overall summary', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + + for index, row in df.iterrows(): + self.assertEqual(performance_index[index], row["Performance Index"]) + self.assertEqual(duration[index], row["Duration(ms)"]) + self.assertEqual(duration_ratio[index], row["Duration Ratio"]) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "overall summary": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(performance_index[row_index - 1], row.find_all('td')[0].text) + self.assertEqual("{:.3f}".format(duration[row_index - 1]), row.find_all('td')[1].text) + self.assertEqual(duration_ratio[row_index - 1], row.find_all('td')[2].text) + + def test_all_bandwidth_contention_analysis(self): + bandwidth_contention_analysis = [ + "hcom_allGather__508_1_1", "hcom_allGather__508_4_1", "hcom_allGather__508_8_1", + "hcom_allGather__508_108_1", "hcom_allGather__508_112_1", "hcom_allGather__508_113_1", + "hcom_allGather__508_137_1", "hcom_allGather__508_141_1", "hcom_allGather__508_145_1", + "hcom_allGather__508_153_1", "hcom_allGather__508_157_1", "hcom_allGather__508_173_1", + "hcom_allGather__508_177_1", "hcom_allGather__508_181_1", "hcom_allGather__508_209_1", + "hcom_reduceScatter__868_261_1", "hcom_reduceScatter__868_266_1", "hcom_allGather__508_276_1", + "hcom_reduceScatter__508_283_1", "hcom_reduceScatter__508_291_1", "hcom_reduceScatter__508_299_1", + "hcom_reduceScatter__508_307_1", "hcom_allGather__508_308_1", "hcom_reduceScatter__508_315_1", + "hcom_reduceScatter__508_323_1", "hcom_reduceScatter__508_331_1", "hcom_reduceScatter__508_339_1", + "hcom_reduceScatter__508_347_1", "hcom_reduceScatter__508_355_1", "hcom_allGather__508_356_1", + "hcom_reduceScatter__508_363_1", "hcom_reduceScatter__508_371_1", "hcom_allGather__508_372_1", + "hcom_reduceScatter__508_379_1", "hcom_reduceScatter__508_387_1", "hcom_allGather__508_388_1", + "hcom_reduceScatter__508_395_1", "hcom_reduceScatter__508_403_1", "hcom_allGather__508_404_1", + "hcom_reduceScatter__508_411_1", "hcom_reduceScatter__508_419_1", "hcom_reduceScatter__508_427_1", + "hcom_reduceScatter__508_435_1", "hcom_reduceScatter__508_443_1", "hcom_reduceScatter__508_451_1", + "hcom_reduceScatter__508_459_1", "hcom_reduceScatter__508_467_1", "hcom_allGather__508_468_1", + "hcom_reduceScatter__508_475_1", "hcom_reduceScatter__508_483_1", "hcom_reduceScatter__508_491_1", + "hcom_reduceScatter__508_499_1", "hcom_reduceScatter__508_507_1", "hcom_reduceScatter__508_515_1", + "hcom_allGather__508_516_1", "hcom_reduceScatter__508_523_1", "hcom_reduceScatter__508_531_1", + "hcom_reduceScatter__508_539_1", "hcom_reduceScatter__508_547_1", "hcom_reduceScatter__508_555_1", + "hcom_reduceScatter__508_563_1", "hcom_reduceScatter__508_571_1", "hcom_reduceScatter__508_579_1", + "hcom_reduceScatter__508_587_1", "hcom_allGather__508_588_1", "hcom_reduceScatter__508_595_1", + "hcom_reduceScatter__508_603_1", "hcom_reduceScatter__508_611_1", "hcom_reduceScatter__508_619_1", + "hcom_reduceScatter__508_627_1", "hcom_reduceScatter__508_635_1", "hcom_reduceScatter__508_643_1", + "hcom_allGather__508_644_1", "hcom_reduceScatter__508_651_1", "hcom_reduceScatter__508_659_1", + "hcom_reduceScatter__508_667_1", "hcom_reduceScatter__508_675_1", "hcom_reduceScatter__508_683_1" + ] + duration = [ + 8.3454, 13.8113, 39.8263, 21.6036, 38.2598, 5.3913, 13.4007, 9.6871, 8.8002, 10.0535, 8.3423, 9.3205, + 11.3891, + 9.473, 12.7247, 19.4176, 13.2621, 16.3541, 127.5414, 127.288, 126.6839, 129.0707, 11.8205, 128.8378, + 130.0548, + 128.3927, 124.9711, 128.0221, 122.8157, 11.7839, 127.0278, 123.3328, 11.9078, 122.3141, 123.1837, 11.2561, + 123.8337, 127.5955, 11.5881, 123.0412, 128.4852, 122.3674, 127.1958, 127.5779, 129.6155, 127.2981, 125.5495, + 11.0916, 127.4827, 126.4632, 125.0414, 123.9187, 125.168, 127.1, 12.6763, 126.3728, 126.9693, 127.677, + 127.1439, 127.2013, 127.9102, 125.7989, 126.4961, 127.6573, 12.2088, 127.6283, 126.3803, 129.8238, 126.2997, + 127.4806, 129.2007, 127.2733, 12.0963, 126.8322, 127.5317, 126.482, 127.8283, 129.2951 + ] + bandwidth = [ + 5.49, 4.8, 5.99, 14.13, 3.24, 6.25, 8.52, 5.17, 5.34, 8.24, 5.43, 6.15, 9.79, 5.55, 4.39, 13.35, 13.14, + 3.61, 2.51, + 2.88, 2.83, 3.07, 4.81, 2.55, 2.57, 2.73, 2.84, 2.44, 3.01, 4.95, 2.63, 3.06, 3.77, 2.88, 3.44, 4.72, 2.91, + 3.21, + 4.47, 2.38, 2.31, 2.9, 4.26, 3.57, 2.31, 2.24, 2.81, 4.37, 2.67, 2.8, 2.74, 2.16, 2.79, 2.88, 5.79, 2.75, + 2.93, 2.88, + 2.31, 2.72, 2.39, 2.6, 2.55, 2.58, 4.29, 2.69, 2.86, 2.09, 3.12, 2.31, 2.28, 2.87, 6.97, 3.1, 2.35, 3.4, + 2.61, 2.62 + ] + try: + df = pd.read_excel(self.RESULT_EXCEL.get("all", None), sheet_name='Bandwidth Contention Analysis', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get("all", None)) + return + + for index, row in df.iterrows(): + self.assertEqual(bandwidth_contention_analysis[index], row["op name"]) + self.assertEqual(duration[index], round(row["duration(ms)"], 4)) + self.assertEqual(bandwidth[index], round(row["bandwidth(GB/s)"], 2)) + + # wait repair bugs to check html + + def test_AICPU_operator(self): + op_name = ["aclnnPowTensorScalar_SquareAiCpu_Square", "aclnnEqScalar_EqualAiCpu_Equal"] + op_type = ["Square", "Equal"] + task_duration = [92.06, 90.72] + input_shapes = ["\"41\"", "\"41;\""] + input_data_types = ["INT64", "DOUBLE;DOUBLE"] + input_formats = ["FORMAT_ND", "FORMAT_ND;FORMAT_ND"] + output_shapes = ["\"41\"", "\"41\""] + output_data_types = ["INT64", "BOOL"] + output_formats = ["FORMAT_ND", "FORMAT_ND"] + stack_info = [True, True] + + t0_description = ["Square, Equal"] + t0_suggestion = ["aclnnEqScalar_EqualAiCpu_Equal"] + t0_elapsed_time = ["182.78"] + t0_time_ratio = ["0.0"] + t1_operator_type = ["Square"] + t1_counts = ["1"] + t1_elapsed_time = ["92.06"] + t2_operator_type = ["Equal"] + t2_counts = ["1"] + t2_elapsed_time = ["90.72"] + b_names = ["Square", "Suggestion 1:", "Equal", "Suggestion 1:"] + + test_pattern = ["all", "computation"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='AICPU operator', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + + for index, row in df.iterrows(): + self.assertEqual(op_name[index], row["op_name"]) + self.assertEqual(op_type[index], row["op_type"]) + self.assertEqual(task_duration[index], round(row["task_duration"], 2)) + self.assertEqual(input_shapes[index], row["input_shapes"]) + self.assertEqual(input_data_types[index], row["input_data_types"]) + self.assertEqual(input_formats[index], row["input_formats"]) + self.assertEqual(output_shapes[index], row["output_shapes"]) + self.assertEqual(output_data_types[index], row["output_data_types"]) + self.assertEqual(output_formats[index], row["output_formats"]) + self.assertEqual(stack_info[index], math.isnan(row["stack_info"])) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "AICPU Issues": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t0_description[row_index - 1], + row.find_all('td')[0].text.split(":")[1].replace("\n", "")) + self.assertEqual(t0_suggestion[row_index - 1], row.find_all('td')[1].text.split(" ")[-1]) + self.assertEqual(t0_elapsed_time[row_index - 1], row.find_all('td')[2].text) + self.assertEqual(t0_time_ratio[row_index - 1], row.find_all('td')[3].text) + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t1_operator_type[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t1_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t1_elapsed_time[row_index - 1], row.find_all('td')[2].text) + for row_index, row in enumerate(table[2].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t2_operator_type[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t2_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t2_elapsed_time[row_index - 1], row.find_all('td')[2].text) + + b_contents = div_content.find_all('b') + for b_index, b_content in enumerate(b_contents): + self.assertEqual(b_names[b_index], b_content.text) + + def test_Affinity_API(self): + affinity_api = ["torch_npu.npu_confusion_transpose", "torch_npu.optim.NpuFusedAdamW"] + code_stacks = [True, True] + stack_called_counts = [True, True] + ignore_api = ["torch_npu.optim.NpuFusedAdamW", "torch_npu.npu_confusion_transpose"] + + test_pattern = ["all", "schedule"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='Affinity apis', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + + for index, row in df.iterrows(): + self.assertEqual(affinity_api[index], row["Affinity API"]) + self.assertEqual(code_stacks[index], math.isnan(row["Code stacks"])) + self.assertEqual(stack_called_counts[index], math.isnan(row["Stack called counts"])) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Affinity API Issues": + div_content = h2.next.next.next + self.assertEqual(ignore_api[0], div_content.contents[-2].contents[-2].text) + self.assertEqual(ignore_api[1], div_content.contents[-2].contents[-4].text) + + def test_operator_dispatch(self): + issues = ["operator dispatch"] + op_name = ["aclopCompileAndExecute"] + counts = [381] + total_time = [58486.7048] + + t0_description = ["381"] + t0_suggestion = ["torch_npu.npu.set_compile_mode(jit_compile=False)"] + t1_issue = ["aclopCompileAndExecute"] + t1_counts = ['381'] + t1_elapsed_time = ['58486.704798215804'] + + test_pattern = ["all", "schedule"] + for pattern in test_pattern: + try: + df = pd.read_excel(self.RESULT_EXCEL.get(pattern, None), sheet_name='operator dispatch', header=0) + except FileNotFoundError: + logging.error("File %s not found.", self.RESULT_EXCEL.get(pattern, None)) + return + for index, row in df.iterrows(): + self.assertEqual(issues[index], row["Issues"]) + self.assertEqual(op_name[index], row["op name"]) + self.assertEqual(counts[index], row["counts"]) + self.assertEqual(total_time[index], round(row["total time"], 4)) + + soup = BeautifulSoup(open(self.RESULT_HTML.get(pattern, None)), 'html.parser') + for h2 in soup.find_all('h2'): + if h2.contents[0] == "Operator Dispatch Issues": + div_content = h2.next.next.next + table = div_content.find_all('table') + for row_index, row in enumerate(table[0].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t0_description[row_index - 1], row.find_all('td')[0].text.split(' ')[1]) + self.assertEqual(t0_suggestion[row_index - 1], + row.find_all('td')[1].text.split('`')[1].split(';')[0]) + for row_index, row in enumerate(table[1].find_all('tr')): + if row_index == 0: + continue + self.assertEqual(t1_issue[row_index - 1], row.find_all('td')[0].text) + self.assertEqual(t1_counts[row_index - 1], row.find_all('td')[1].text) + self.assertEqual(t1_elapsed_time[row_index - 1], row.find_all('td')[2].text) diff --git a/profiler/msprof_analyze/test/st/utils.py b/profiler/msprof_analyze/test/st/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..434b32c43d7b3b7757f9478ce79aeea60b9c0a32 --- /dev/null +++ b/profiler/msprof_analyze/test/st/utils.py @@ -0,0 +1,104 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import subprocess +import os +import re +import logging +import sqlite3 + +COMMAND_SUCCESS = 0 + + +def execute_cmd(cmd): + logging.info('Execute command:%s' % " ".join(cmd)) + completed_process = subprocess.run(cmd, shell=False, stderr=subprocess.PIPE) + if completed_process.returncode != COMMAND_SUCCESS: + logging.error(completed_process.stderr.decode()) + return completed_process.returncode + + +def execute_script(cmd): + logging.info('Execute command:%s' % " ".join(cmd)) + process = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) + while process.poll() is None: + line = process.stdout.readline().strip() + if line: + logging.debug(line) + return process.returncode + + +def check_result_file(out_path): + files = os.listdir(out_path) + newest_file = None + re_match_exp = r"^performance_comparison_result_\d{1,20}\.xlsx" + for file_name in files: + if re.match(re_match_exp, file_name): + file_time = file_name.split(".")[0].split("_")[-1] + if not newest_file or file_time > newest_file.split(".")[0].split("_")[-1]: + newest_file = file_name + + return newest_file + + +def select_count(db_path: str, query: str): + """ + Execute a SQL query to count the number of records in the database. + """ + conn, cursor = create_connect_db(db_path) + cursor.execute(query) + count = cursor.fetchone() + destroy_db_connect(conn, cursor) + return count[0] + + +def select_by_query(db_path: str, query: str, db_class): + """ + Execute a SQL query and return the first record as an instance of db_class. + """ + conn, cursor = create_connect_db(db_path) + cursor.execute(query) + rows = cursor.fetchall() + dbs = [db_class(*row) for row in rows] + destroy_db_connect(conn, cursor) + return dbs[0] + + +def create_connect_db(db_file: str) -> tuple: + """ + Create a connection to the SQLite database. + """ + try: + conn = sqlite3.connect(db_file) + curs = conn.cursor() + return conn, curs + except sqlite3.Error as e: + logging.error("Unable to connect to database: %s", e) + return None, None + + +def destroy_db_connect(conn: any, curs: any) -> None: + """ + Close the database connection and cursor. + """ + try: + if isinstance(curs, sqlite3.Cursor): + curs.close() + except sqlite3.Error as err: + logging.error("%s", err) + try: + if isinstance(conn, sqlite3.Connection): + conn.close() + except sqlite3.Error as err: + logging.error("%s", err) diff --git a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_advice_base.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_advice_base.py similarity index 69% rename from profiler/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_advice_base.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_advice_base.py index 6235c06efb37a553ddb61a60a7a705fdb603edea..a86cd57af1d1992675308295ba2eaf5f60303b31 100644 --- a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_advice_base.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_advice_base.py @@ -1,10 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import shutil import unittest from unittest import mock from unittest.mock import MagicMock -from advisor_backend.cluster_advice.cluster_advice_base import ClusterAdviceBase +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_advice_base import ClusterAdviceBase class MockChildClusterAdvice(ClusterAdviceBase): @@ -56,7 +70,8 @@ class TestClusterAdviceBase(unittest.TestCase): def test_cluster_analyze_normal(self): mock_inst = MockChildClusterAdvice(self.tmp_dir) - with mock.patch("advisor_backend.cluster_advice.cluster_advice_base.Interface") as mock_if: + with mock.patch("msprof_analyze.advisor.advisor_backend.cluster_advice." + "cluster_advice_base.Interface") as mock_if: mock_if_inst = mock_if.return_value mock_if_inst.run = MagicMock(name="run") mock_inst.cluster_analyze() @@ -65,7 +80,8 @@ class TestClusterAdviceBase(unittest.TestCase): def test_cluster_analyze_abnormal(self): mock_inst = MockChildClusterAdvice(self.tmp_dir) with self.assertRaises(ValueError): - with mock.patch("advisor_backend.cluster_advice.cluster_advice_base.Interface") as mock_if: + with mock.patch("msprof_analyze.advisor.advisor_backend.cluster_advice." + "cluster_advice_base.Interface") as mock_if: mock_if_inst = mock_if.return_value mock_if_inst.run = mock.Mock(name="run", side_effect=Exception('Error!')) mock_inst.cluster_analyze() diff --git a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_pipeline_advice.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_pipeline_advice.py similarity index 87% rename from profiler/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_pipeline_advice.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_pipeline_advice.py index a6c62dca67430b9eb821eec4b01ed91bcdec20d2..8f6d2a6a50329bfabe1fd0d92d50f20571379fb2 100644 --- a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_pipeline_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_cluster_pipeline_advice.py @@ -1,18 +1,32 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest from unittest import mock from collections import deque from collections import defaultdict -from advisor_backend.cluster_advice.cluster_pipeline_advice import ClusterPipelineAdvice -from advisor_backend.cluster_advice.cluster_pipeline_advice import FineTraceViewData -from advisor_backend.cluster_advice.cluster_pipeline_advice import PipelineTimeSlice -from advisor_backend.cluster_advice.cluster_pipeline_advice import PipelineTraceViewer +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_pipeline_advice import ClusterPipelineAdvice +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_pipeline_advice import FineTraceViewData +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_pipeline_advice import PipelineTimeSlice +from msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_pipeline_advice import PipelineTraceViewer class TestClusterPipelineAdvice(unittest.TestCase): def test_load_trace_view_data_should_return_none_when_input_json_empty(self): - with mock.patch('common_func.file_manager.FileManager.read_json_file', return_value=None): + with mock.patch('msprof_analyze.prof_common.file_manager.FileManager.read_json_file', return_value=None): advice = ClusterPipelineAdvice('./tmp_dir', {}) self.assertEqual(advice.load_trace_view_data('test'), None) @@ -62,7 +76,7 @@ class TestClusterPipelineAdvice(unittest.TestCase): npu_ops_ts_dur={"15": 16, "17": 18}, torch_to_npu_links=[torch_to_npu_link], ) - with mock.patch('common_func.file_manager.FileManager.read_json_file', return_value=raw_data): + with mock.patch('msprof_analyze.prof_common.file_manager.FileManager.read_json_file', return_value=raw_data): advice = ClusterPipelineAdvice('./tmp_dir', {}) check_res = advice.load_trace_view_data('test') self.assertEqual(check_res, except_res) @@ -105,7 +119,8 @@ class TestClusterPipelineAdvice(unittest.TestCase): bp_op2 = {"ph": "X", "name": "autogard::add", "ts": str(2000000000 - 100), "dur": 2000, "tid": 3, "pid": 1, "args": {}} res_bp_ops = [(bp_op1, bp_op2)] - with mock.patch('advisor_backend.cluster_advice.cluster_pipeline_advice.ClusterPipelineAdvice.double_queue_pop', + with mock.patch('msprof_analyze.advisor.advisor_backend.cluster_advice.cluster_pipeline_advice.' + 'ClusterPipelineAdvice.double_queue_pop', return_value=(None, res_bp_ops)): advice = ClusterPipelineAdvice('./tmp_dir', {}) res_check = advice.get_fp_bp_bound_ops(fine_data) diff --git a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_kernel_cluster_advice.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_kernel_cluster_advice.py similarity index 86% rename from profiler/test/ut/advisor/advisor_backend/cluster_advice/test_kernel_cluster_advice.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_kernel_cluster_advice.py index 0509b197cafae23e4f53ed0b300c1934a66b1197..f114b2eb7387fe179eebf86cd0526fe6721e7198 100644 --- a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_kernel_cluster_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_kernel_cluster_advice.py @@ -1,3 +1,17 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import stat import shutil @@ -5,8 +19,8 @@ import unittest from unittest import mock from unittest.mock import MagicMock -from common_func.constant import Constant -from advisor_backend.cluster_advice.kernel_cluster_advice import KernelClusterAdvice +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.advisor_backend.cluster_advice.kernel_cluster_advice import KernelClusterAdvice class TestClusterAdviceBase(unittest.TestCase): diff --git a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_slow_link_advice.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_slow_link_advice.py similarity index 73% rename from profiler/test/ut/advisor/advisor_backend/cluster_advice/test_slow_link_advice.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_slow_link_advice.py index bf283cda5c338fe606620e2e9d1453a913c9f25f..3a0b3de50fb1a9896d88a22915991274b6466992 100644 --- a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_slow_link_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_slow_link_advice.py @@ -1,6 +1,20 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest -from advisor_backend.cluster_advice.slow_link_advice import SlowLinkAdvice +from msprof_analyze.advisor.advisor_backend.cluster_advice.slow_link_advice import SlowLinkAdvice class TestSlowLinkAdvice(unittest.TestCase): diff --git a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_slow_rank_advice.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_slow_rank_advice.py similarity index 70% rename from profiler/test/ut/advisor/advisor_backend/cluster_advice/test_slow_rank_advice.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_slow_rank_advice.py index 6a45553e1eb9522acc778575e104eb8903b0f7be..8c196faba0708ad8be2f23445c645d268b16642e 100644 --- a/profiler/test/ut/advisor/advisor_backend/cluster_advice/test_slow_rank_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/cluster_advice/test_slow_rank_advice.py @@ -1,6 +1,20 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest -from advisor_backend.cluster_advice.slow_rank_advice import SlowRankAdvice +from msprof_analyze.advisor.advisor_backend.cluster_advice.slow_rank_advice import SlowRankAdvice class TestSlowRankAdvice(unittest.TestCase): diff --git a/profiler/test/ut/advisor/advisor_backend/compute_advice/kernel_details.csv b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/compute_advice/kernel_details.csv similarity index 100% rename from profiler/test/ut/advisor/advisor_backend/compute_advice/kernel_details.csv rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/compute_advice/kernel_details.csv diff --git a/profiler/test/ut/advisor/advisor_backend/compute_advice/test_npu_slow_advice.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/compute_advice/test_npu_slow_advice.py similarity index 93% rename from profiler/test/ut/advisor/advisor_backend/compute_advice/test_npu_slow_advice.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/compute_advice/test_npu_slow_advice.py index 8830d495992cfcd2c26024863f8b644d5b4c6902..bfbb17df5bc5c8b6626764a1f963be883560a111 100644 --- a/profiler/test/ut/advisor/advisor_backend/compute_advice/test_npu_slow_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/compute_advice/test_npu_slow_advice.py @@ -1,3 +1,17 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import json import os import shutil @@ -5,8 +19,8 @@ import stat import csv import unittest -from advisor_backend.interface import Interface -from advisor_backend.compute_advice.npu_slow_advice import NpuSlowAdvice +from msprof_analyze.advisor.advisor_backend.interface import Interface +from msprof_analyze.advisor.advisor_backend.compute_advice.npu_slow_advice import NpuSlowAdvice class TestNpuSlowAdvice(unittest.TestCase): @@ -15,18 +29,6 @@ class TestNpuSlowAdvice(unittest.TestCase): interface = None err_interface = None - def tearDown(self): - if os.path.exists(TestNpuSlowAdvice.ASCEND_PT_DIR): - shutil.rmtree(TestNpuSlowAdvice.ASCEND_PT_DIR) - - def setUp(self): - if os.path.exists(TestNpuSlowAdvice.ASCEND_PT_DIR): - shutil.rmtree(TestNpuSlowAdvice.ASCEND_PT_DIR) - if not os.path.exists(TestNpuSlowAdvice.ASCEND_PT_DIR): - os.makedirs(TestNpuSlowAdvice.ASCEND_PT_DIR) - if not os.path.exists(TestNpuSlowAdvice.OUTPUT_DIR): - os.makedirs(TestNpuSlowAdvice.OUTPUT_DIR) - @classmethod def get_basic_trace_view(cls): # Python pid @@ -172,6 +174,18 @@ class TestNpuSlowAdvice(unittest.TestCase): csv_writer.writerow(csv_row8) csv_writer.writerow(csv_row9) + def setUp(self): + if os.path.exists(TestNpuSlowAdvice.ASCEND_PT_DIR): + shutil.rmtree(TestNpuSlowAdvice.ASCEND_PT_DIR) + if not os.path.exists(TestNpuSlowAdvice.ASCEND_PT_DIR): + os.makedirs(TestNpuSlowAdvice.ASCEND_PT_DIR) + if not os.path.exists(TestNpuSlowAdvice.OUTPUT_DIR): + os.makedirs(TestNpuSlowAdvice.OUTPUT_DIR) + + def tearDown(self): + if os.path.exists(TestNpuSlowAdvice.ASCEND_PT_DIR): + shutil.rmtree(TestNpuSlowAdvice.ASCEND_PT_DIR) + def test_run_should_return_empty_when_ascend_pt_path_not_exist(self): interface = Interface("") data = interface.get_data('compute', 'npu_slow') @@ -201,7 +215,6 @@ class TestNpuSlowAdvice(unittest.TestCase): call_stack = NpuSlowAdvice(self.ASCEND_PT_DIR).get_call_stack(data, index_id=0, ts_col="Start Time(us)") self.assertEqual(9, len(data)) self.assertEqual(2, len(slow_op_data)) - print(call_stack) call_stack_res = "/root/torch/module.py\n" \ "/root/test/slice.py(116)" self.assertEqual(call_stack_res, call_stack) @@ -217,7 +230,6 @@ class TestNpuSlowAdvice(unittest.TestCase): call_stack = NpuSlowAdvice(self.ASCEND_PT_DIR).get_call_stack(data, index_id=0, ts_col="Start Time(us)") self.assertEqual(9, len(data)) self.assertEqual(2, len(slow_op_data)) - print(call_stack) call_stack_res = "/root/test/slice.py(116)\n\r\n" \ "/root/torch/module.py" self.assertEqual(call_stack_res, call_stack) diff --git a/profiler/test/ut/advisor/advisor_backend/compute_advice/test_npufused_advice.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/compute_advice/test_npufused_advice.py similarity index 93% rename from profiler/test/ut/advisor/advisor_backend/compute_advice/test_npufused_advice.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/compute_advice/test_npufused_advice.py index 90c9515cb2da3f8f80d13f8848d6947550577686..4670882c05faed698bdd35842c385a0c4f02e2d0 100644 --- a/profiler/test/ut/advisor/advisor_backend/compute_advice/test_npufused_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/compute_advice/test_npufused_advice.py @@ -1,12 +1,25 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import json import os import shutil import stat import csv import unittest -import pytest -from advisor_backend.interface import Interface +from msprof_analyze.advisor.advisor_backend.interface import Interface class TestComputeAdvice(unittest.TestCase): @@ -15,18 +28,6 @@ class TestComputeAdvice(unittest.TestCase): interface = None err_interface = None - def tearDown(self): - if os.path.exists(TestComputeAdvice.TMP_DIR): - shutil.rmtree(TestComputeAdvice.TMP_DIR) - - def setUp(self): - if os.path.exists(TestComputeAdvice.TMP_DIR): - shutil.rmtree(TestComputeAdvice.TMP_DIR) - if not os.path.exists(TestComputeAdvice.TMP_DIR): - os.makedirs(TestComputeAdvice.TMP_DIR) - if not os.path.exists(TestComputeAdvice.OUTPUT_DIR): - os.makedirs(TestComputeAdvice.OUTPUT_DIR) - @classmethod def get_basic_trace_view(cls): # Python pid @@ -146,6 +147,18 @@ class TestComputeAdvice(unittest.TestCase): csv_writer.writerow(csv_row4) csv_writer.writerow(csv_row5) + def setUp(self): + if os.path.exists(TestComputeAdvice.TMP_DIR): + shutil.rmtree(TestComputeAdvice.TMP_DIR) + if not os.path.exists(TestComputeAdvice.TMP_DIR): + os.makedirs(TestComputeAdvice.TMP_DIR) + if not os.path.exists(TestComputeAdvice.OUTPUT_DIR): + os.makedirs(TestComputeAdvice.OUTPUT_DIR) + + def tearDown(self): + if os.path.exists(TestComputeAdvice.TMP_DIR): + shutil.rmtree(TestComputeAdvice.TMP_DIR) + def test_run_should_return_empty_when_ascend_pt_path_not_exist(self): interface = Interface("") data = interface.get_data('compute', 'npu_fused') diff --git a/profiler/test/ut/advisor/advisor_backend/prof_bean_advisor/test_cluster_step_trace_time_bean.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/prof_bean_advisor/test_cluster_step_trace_time_bean.py similarity index 67% rename from profiler/test/ut/advisor/advisor_backend/prof_bean_advisor/test_cluster_step_trace_time_bean.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/prof_bean_advisor/test_cluster_step_trace_time_bean.py index 7b141ae08865c36af83ae65afd2dd713c9e473a6..d63176dc1d399c5f94703c3aa8a2a522143bcfc7 100644 --- a/profiler/test/ut/advisor/advisor_backend/prof_bean_advisor/test_cluster_step_trace_time_bean.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/prof_bean_advisor/test_cluster_step_trace_time_bean.py @@ -1,12 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import stat import shutil import unittest -from unittest import mock -from unittest.mock import MagicMock -from common_func.constant import Constant -from advisor_backend.prof_bean_advisor.cluster_step_trace_time_bean import ClusterStepTraceTimeBean +from msprof_analyze.advisor.advisor_backend.prof_bean_advisor.cluster_step_trace_time_bean \ + import ClusterStepTraceTimeBean class TestClusterStepTraceTimeBean(unittest.TestCase): diff --git a/profiler/test/ut/advisor/advisor_backend/timeline_advice/test_opsche_advice.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/timeline_advice/test_opsche_advice.py similarity index 55% rename from profiler/test/ut/advisor/advisor_backend/timeline_advice/test_opsche_advice.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/timeline_advice/test_opsche_advice.py index 00024746ccbe1119730181ac92df5e181b29fa86..5854ffa15ef29f5ce528a9e7d12d2429e6f30dd4 100644 --- a/profiler/test/ut/advisor/advisor_backend/timeline_advice/test_opsche_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/timeline_advice/test_opsche_advice.py @@ -1,11 +1,21 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os -import shutil -import stat -import json import unittest -import pytest -from advisor_backend.interface import Interface +from msprof_analyze.advisor.advisor_backend.interface import Interface class TestOpScheAdvice(unittest.TestCase): diff --git a/profiler/test/ut/advisor/advisor_backend/timeline_advice/test_optimizer_advice.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/timeline_advice/test_optimizer_advice.py similarity index 77% rename from profiler/test/ut/advisor/advisor_backend/timeline_advice/test_optimizer_advice.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/timeline_advice/test_optimizer_advice.py index de9fbcb5ca9122d04c28299dc997c701a31e7962..3ed142f7f7a966c6ca0aa7f797bcb3282c214fb8 100644 --- a/profiler/test/ut/advisor/advisor_backend/timeline_advice/test_optimizer_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/timeline_advice/test_optimizer_advice.py @@ -1,11 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import shutil import stat import json import unittest -import pytest -from advisor_backend.interface import Interface +from msprof_analyze.advisor.advisor_backend.interface import Interface class TestOptimizerAdvice(unittest.TestCase): diff --git a/profiler/test/ut/advisor/advisor_backend/timeline_advice/trace_view.json b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/timeline_advice/trace_view.json similarity index 100% rename from profiler/test/ut/advisor/advisor_backend/timeline_advice/trace_view.json rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/timeline_advice/trace_view.json diff --git a/profiler/msprof_analyze/test/ut/advisor/advisor_backend/tools/__init__.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/tools/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/test/ut/advisor/advisor_backend/tools/tool.py b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/tools/tool.py similarity index 53% rename from profiler/test/ut/advisor/advisor_backend/tools/tool.py rename to profiler/msprof_analyze/test/ut/advisor/advisor_backend/tools/tool.py index 6c6f690d3a1b091a67babbd1708dc8076fbb26a3..2de5b1e3da5d6ba200df0aba296c9d32ae961ecc 100644 --- a/profiler/test/ut/advisor/advisor_backend/tools/tool.py +++ b/profiler/msprof_analyze/test/ut/advisor/advisor_backend/tools/tool.py @@ -1,3 +1,17 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import re import shutil diff --git a/profiler/msprof_analyze/test/ut/advisor/common/__init__.py b/profiler/msprof_analyze/test/ut/advisor/common/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/test/ut/advisor/common/test_enum_params_parser.py b/profiler/msprof_analyze/test/ut/advisor/common/test_enum_params_parser.py similarity index 72% rename from profiler/test/ut/advisor/common/test_enum_params_parser.py rename to profiler/msprof_analyze/test/ut/advisor/common/test_enum_params_parser.py index 8e5ddb680444c944898201bb58f7b71a520b924b..5d11af12781eb5845d82e5415253fa93be99cf6b 100644 --- a/profiler/test/ut/advisor/common/test_enum_params_parser.py +++ b/profiler/msprof_analyze/test/ut/advisor/common/test_enum_params_parser.py @@ -1,13 +1,21 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest -import sys -import os -profiler_root_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))) -sys.path.append(os.path.join(profiler_root_dir, "compare_tools")) -sys.path.append(os.path.join(profiler_root_dir, "cluster_analyse")) - -from profiler.advisor.common.enum_params_parser import EnumParamsParser -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env +from msprof_analyze.advisor.common.enum_params_parser import EnumParamsParser +from msprof_analyze.test.ut.advisor.advisor_backend.tools.tool import recover_env class TestEnumParamsParser(unittest.TestCase): @@ -17,7 +25,7 @@ class TestEnumParamsParser(unittest.TestCase): def setUp(self) -> None: self.enum_params_parser = EnumParamsParser() - self.argument_keys = sorted(["cann_version", "torch_version", "analysis_dimensions", "profiling_type"]) + self.argument_keys = sorted(["cann_version", "torch_version", "analysis_dimensions", "profiling_type", "mindspore_version"]) self.env_keys = ["ADVISOR_ANALYZE_PROCESSES", "DISABLE_PROFILING_COMPARISON", "DISABLE_AFFINITY_API"] def test_get_keys(self): diff --git a/profiler/test/ut/advisor/communication_advice/test_bandwidth_contention_advice.py b/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_bandwidth_contention_advice.py similarity index 91% rename from profiler/test/ut/advisor/communication_advice/test_bandwidth_contention_advice.py rename to profiler/msprof_analyze/test/ut/advisor/communication_advice/test_bandwidth_contention_advice.py index fd1c40d8c40cad63736230e08b5b21c2b16009d9..dbe153027baca7e5c8e8b8424fd04181ee60b3f4 100644 --- a/profiler/test/ut/advisor/communication_advice/test_bandwidth_contention_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_bandwidth_contention_advice.py @@ -1,11 +1,26 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import shutil import stat import json import csv import unittest -from profiler.advisor.interface.interface import Interface -from profiler.advisor.common.analyzer_scopes import SupportedScopes + +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes class TestBandwidthContentionAdvice(unittest.TestCase): @@ -212,6 +227,6 @@ class TestBandwidthContentionAdvice(unittest.TestCase): dimension = Interface.COMMUNICATION scope = SupportedScopes.BANDWIDTH_CONTENTION_DETECTION result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path=self.TMP_DIR) - self.assertEqual(2, len(result.data.get("Bandwidth Contention Analysis", []))) - self.assertEqual(1, len(result.data.get("Bandwidth Contention Analysis", []).get('data'))) + self.assertEqual(2, len(result.data.get("带宽分析", []))) + self.assertEqual(1, len(result.data.get("带宽分析", []).get('data'))) result.clear() diff --git a/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_byte_alignment_analyzer.py b/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_byte_alignment_analyzer.py new file mode 100644 index 0000000000000000000000000000000000000000..9b4f02b41701d69a6cb539427df79f66bf24d143 --- /dev/null +++ b/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_byte_alignment_analyzer.py @@ -0,0 +1,138 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import shutil +import stat +import json +import unittest + +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes + + +class TestByteAlignmentAnalyzer(unittest.TestCase): + TMP_DIR = "./ascend_pt" + OUTPUT_DIR = "./ascend_pt/ASCEND_PROFILER_OUTPUT" + interface = None + err_interface = None + + def setUp(self): + if os.path.exists(TestByteAlignmentAnalyzer.TMP_DIR): + shutil.rmtree(TestByteAlignmentAnalyzer.TMP_DIR) + if not os.path.exists(TestByteAlignmentAnalyzer.TMP_DIR): + os.makedirs(TestByteAlignmentAnalyzer.TMP_DIR) + if not os.path.exists(TestByteAlignmentAnalyzer.OUTPUT_DIR): + os.makedirs(TestByteAlignmentAnalyzer.OUTPUT_DIR) + self.clear_htmls() + + def tearDown(self): + if os.path.exists(TestByteAlignmentAnalyzer.TMP_DIR): + shutil.rmtree(TestByteAlignmentAnalyzer.TMP_DIR) + self.clear_htmls() + + @classmethod + def clear_htmls(cls): + current_path = os.path.dirname(os.path.abspath(__file__)) + for filename in os.listdir(current_path): + # 检查文件是否以“mstt”开头 + if filename.startswith("mstt"): + # 构建文件的完整路径 + file_path = os.path.join(current_path, filename) + # 删除文件 + os.remove(file_path) + + @classmethod + def create_trace_view(cls): + # Python pid + py_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 1, "args": {"name": "Python"}} + # ascend pid + ascend_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 4, "args": {"name": "Ascend Hardware"}} + # ascend pid + cann_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 5, "args": {"name": "HCCL"}} + # hccl ops + hccl_event1 = { + "name": "hcom_broadcast__661_0_1", "pid": 5, "tid": 0, "ts": "1723545784535521.354", + "dur": 40.3, "args": {"connection_id": 64349, "model id": 4294967295, "data_type": "INT64", + "alg_type": "RING-RING", "count": 256}, "ph": "X" + } + # python ops + mem_event1 = { + "name": "Memcpy", "pid": 5, "tid": 1, "ts": "1723545784535549.654", "dur": 1.26, + "args": { + "notify_id": "18446744073709551615", "duration estimated(us)": 0.6530569948186529, + "stream id": 5, "task id": 8342, "context id": 15, "task type": "Memcpy", "src rank": 0, + "dst rank": 1, "transport type": "SDMA", "size(Byte)": 3024, "data type": "INVALID_TYPE", + "link type": "HCCS", "bandwidth(GB/s)": 0.8126984126984127, "model id": 4294967295 + }, "ph": "X" + } + hccl_event2 = { + "name": "hcom_broadcast__661_1_1", "pid": 5, "tid": 0, "ts": "1723545784535812.974", + "dur": 38.18, "args": { + "connection_id": 64366, "model id": 4294967295, "data_type": "INT64", + "alg_type": "RING-RING", "count": 256}, "ph": "X" + } + reduce_event2 = { + "name": "Reduce_inline", "pid": 5, "tid": 1, "ts": "1723545784535814.854", "dur": 0.6, + "args": { + "notify_id": "18446744073709551615", "duration estimated(us)": 0.7061139896373057, + "stream id": 5, "task id": 8346, "context id": 1, "task type": "Reduce_inline", + "src rank": 0, "dst rank": 0, "transport type": "SDMA", "size(Byte)": 3048, + "data type": "INVALID_TYPE", "link type": "HCCS", + "bandwidth(GB/s)": 3.4133333333333336, "model id": 4294967295 + }, "ph": "X" + } + hccl_event3 = { + "name": "hcom_broadcast__661_2_1", "pid": 5, "tid": 0, "ts": "1723545784536062.654", + "dur": 39.06, "args": { + "connection_id": 64398, "model id": 4294967295, "data_type": "FP32", "alg_type": "RING-RING", + "count": 256 + }, "ph": "X" + } + mem_event2 = { + "name": "Memcpy", "pid": 5, "tid": 1, "ts": "1723545784536090.214", "dur": 1.26, + "args": { + "notify_id": "18446744073709551615", "duration estimated(us)": 0.6265284974093264, + "stream id": 5, "task id": 8350, "context id": 15, "task type": "Memcpy", "src rank": 0, + "dst rank": 1, "transport type": "SDMA", "size(Byte)": 512, "data type": "INVALID_TYPE", + "link type": "PCIE", "bandwidth(GB/s)": 0.40634920634920635, "model id": 4294967295 + }, "ph": "X" + } + reduce_event3 = { + "name": "Reduce_inline", "pid": 5, "tid": 1, "ts": "1723545784536309.474", "dur": 0.58, + "args": { + "notify_id": "18446744073709551615", "duration estimated(us)": 0.7061139896373057, + "stream id": 5, "task id": 8354, "context id": 1, "task type": "Reduce_inline", + "src rank": 0, "dst rank": 0, "transport type": "SDMA", "size(Byte)": 3048, + "data type": "INVALID_TYPE", "link type": "HCCS", + "bandwidth(GB/s)": 3.5310344827586206, "model id": 4294967295 + }, "ph": "X" + } + raw_data = [ + py_pid_data, ascend_pid_data, cann_pid_data, hccl_event1, mem_event1, hccl_event2, reduce_event2, + hccl_event3, mem_event2, reduce_event3 + ] + with os.fdopen(os.open(f"{TestByteAlignmentAnalyzer.OUTPUT_DIR}/trace_view.json", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w') as fp: + fp.write(json.dumps(raw_data)) + + def test_run_should_run_success_when_communication_ops_not_aligned(self): + self.create_trace_view() + interface = Interface(profiling_path=self.TMP_DIR) + dimension = Interface.COMMUNICATION + scope = SupportedScopes.BYTE_ALIGNMENT_DETECTION + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path=self.TMP_DIR) + self.assertEqual(2, len(result.data.get("字节对齐分析", []))) + self.assertEqual(3, len(result.data.get("字节对齐分析", []).get('data'))) + result.clear() diff --git a/profiler/test/ut/advisor/communication_advice/test_packet_advice.py b/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_packet_advice.py similarity index 88% rename from profiler/test/ut/advisor/communication_advice/test_packet_advice.py rename to profiler/msprof_analyze/test/ut/advisor/communication_advice/test_packet_advice.py index 9459ccfbe71560f78cb4bd1cf983df578b75bce6..34c3e2acb6405f47779936f07a0758986b32790a 100644 --- a/profiler/test/ut/advisor/communication_advice/test_packet_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_packet_advice.py @@ -1,11 +1,25 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import shutil import stat import json - import unittest -from profiler.advisor.interface.interface import Interface -from profiler.advisor.common.analyzer_scopes import SupportedScopes + +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes class TestPacketAdvice(unittest.TestCase): @@ -170,6 +184,6 @@ class TestPacketAdvice(unittest.TestCase): dimension = Interface.COMMUNICATION scope = SupportedScopes.PACKET result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path=self.TMP_DIR) - self.assertEqual(2, len(result.data.get("Packet Analysis", []))) - self.assertEqual(1, len(result.data.get("Packet Analysis", []).get('data'))) + self.assertEqual(2, len(result.data.get("包分析", []))) + self.assertEqual(1, len(result.data.get("包分析", []).get('data'))) result.clear() diff --git a/profiler/test/ut/advisor/communication_advice/test_rdma_retransmission_advice.py b/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_rdma_retransmission_advice.py similarity index 88% rename from profiler/test/ut/advisor/communication_advice/test_rdma_retransmission_advice.py rename to profiler/msprof_analyze/test/ut/advisor/communication_advice/test_rdma_retransmission_advice.py index 35a04d41568b40889a28a64bccbff70fd2e3f9b5..aec8bd6b032eb9b139626ecaf5129e35ed6128c8 100644 --- a/profiler/test/ut/advisor/communication_advice/test_rdma_retransmission_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/communication_advice/test_rdma_retransmission_advice.py @@ -1,11 +1,25 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import shutil import stat import json - import unittest -from profiler.advisor.interface.interface import Interface -from profiler.advisor.common.analyzer_scopes import SupportedScopes + +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes class TestRdmaAdvice(unittest.TestCase): @@ -165,6 +179,6 @@ class TestRdmaAdvice(unittest.TestCase): dimension = Interface.COMMUNICATION scope = SupportedScopes.COMMUNICATION_RETRANSMISSION_DETECTION result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path=self.TMP_DIR) - self.assertEqual(2, len(result.data.get("Comm Retransmission Analysis", []))) - self.assertEqual(2, len(result.data.get("Comm Retransmission Analysis", []).get('data'))) + self.assertEqual(2, len(result.data.get("通信重传分析", []))) + self.assertEqual(2, len(result.data.get("通信重传分析", []).get('data'))) result.clear() diff --git a/profiler/msprof_analyze/test/ut/advisor/compute_advice/data/kernel_details.csv b/profiler/msprof_analyze/test/ut/advisor/compute_advice/data/kernel_details.csv new file mode 100644 index 0000000000000000000000000000000000000000..8a255e939ae2ff4e781c7a356b342815838e2ff3 --- /dev/null +++ b/profiler/msprof_analyze/test/ut/advisor/compute_advice/data/kernel_details.csv @@ -0,0 +1,30 @@ +Step Id,Model ID,Task ID,Stream ID,Name,Type,OP State,Accelerator Core,Start Time(us),Duration(us),Wait Time(us),Block Dim,Mix Block Dim,HF32 Eligible,Input Shapes,Input Data Types,Input Formats,Output Shapes,Output Data Types,Output Formats,Context ID,aicore_time(us),aic_total_cycles,aic_mac_time(us),aic_mac_ratio,aic_scalar_time(us),aic_scalar_ratio,aic_mte1_time(us),aic_mte1_ratio,aic_mte2_time(us),aic_mte2_ratio,aic_fixpipe_time(us),aic_fixpipe_ratio,aic_icache_miss_rate,aiv_time(us),aiv_total_cycles,aiv_vec_time(us),aiv_vec_ratio,aiv_scalar_time(us),aiv_scalar_ratio,aiv_mte2_time(us),aiv_mte2_ratio,aiv_mte3_time(us),aiv_mte3_ratio,aiv_icache_miss_rate,cube_utilization(%) +19,4294967295,61653,2,aclnnMatmul_MatMulCommon_MatMulV2,MatMulV2,dynamic,AI_CORE,"1736413971558972.912 ",185.504,1.087,16,0,NO,"""81920,4096;8192,512""",DT_BF16;DT_BF16,ND;ND,"""4096,512""",DT_BF16,ND,N/A,183.87,5295467,151.425,0.824,88.03,0.479,119.148,0.648,177.314,0.964,5.736,0.031,0.001,0,0,0,0,0,0,0,0,0,0,0,79.295 +19,4294967295,61669,2,aclnnMatmul_MatMulV3Common_MatMulV3,MatMulV3,dynamic,AI_CORE,"1736413971560588.764 ",501.17,2.2,20,0,NO,"""81920,1536;8192,4096""",DT_BF16;DT_BF16,ND;ND,"""1536,4096""",DT_BF16,ND,N/A,478.701,17233251,356.349,0.744,118.087,0.247,296.009,0.618,452.112,0.944,35.833,0.075,0.001,0,0,0,0,0,0,0,0,0,0,0,95.517 +19,4294967295,61694,2,aclnnMatmul_MatMulCommon_MatMulV2,MatMulV2,dynamic,AI_CORE,"1736413971565213.257 ",186.823,1.178,16,0,NO,"""81920,4096;8192,512""",DT_BF16;DT_BF16,ND;ND,"""4096,512""",DT_BF16,ND,N/A,183.728,5291376,151.502,0.825,87.902,0.478,118.519,0.645,177.654,0.967,5.773,0.031,0.001,0,0,0,0,0,0,0,0,0,0,0,78.675 +19,4294967295,61710,2,aclnnMatmul_MatMulV3Common_MatMulV3,MatMulV3,dynamic,AI_CORE,"1736413971566843.489 ",516.991,2.33,20,0,NO,"""81920,1536;8192,4096""",DT_BF16;DT_BF16,ND;ND,"""1536,4096""",DT_BF16,ND,N/A,491.775,17703905,356.249,0.724,118.59,0.241,295.046,0.6,463.696,0.943,37.671,0.077,0.001,0,0,0,0,0,0,0,0,0,0,0,95.123 +19,4294967295,61735,2,aclnnMatmul_MatMulCommon_MatMulV2,MatMulV2,dynamic,AI_CORE,"1736413971571596.404 ",187.724,0.766,16,0,NO,"""81920,4096;8192,512""",DT_BF16;DT_BF16,ND;ND,"""4096,512""",DT_BF16,ND,N/A,184.904,5325221,151.489,0.819,87.893,0.475,118.63,0.642,178.815,0.967,5.77,0.031,0.001,0,0,0,0,0,0,0,0,0,0,0,78.798 +19,4294967295,61751,2,aclnnMatmul_MatMulV3Common_MatMulV3,MatMulV3,dynamic,AI_CORE,"1736413971573223.437 ",514.87,2.15,20,0,NO,"""81920,1536;8192,4096""",DT_BF16;DT_BF16,ND;ND,"""1536,4096""",DT_BF16,ND,N/A,486.931,17529512,356.117,0.731,118.847,0.244,295.529,0.607,457.002,0.939,37.938,0.078,0.001,0,0,0,0,0,0,0,0,0,0,0,94.574 +19,4294967295,61776,2,aclnnMatmul_MatMulCommon_MatMulV2,MatMulV2,dynamic,AI_CORE,"1736413971577931.851 ",190.544,1.367,16,0,NO,"""81920,4096;8192,512""",DT_BF16;DT_BF16,ND;ND,"""4096,512""",DT_BF16,ND,N/A,187.073,5387702,151.741,0.811,87.935,0.47,117.467,0.628,181.043,0.968,5.803,0.031,0.001,0,0,0,0,0,0,0,0,0,0,0,78.543 +19,4294967295,61792,2,aclnnMatmul_MatMulV3Common_MatMulV3,MatMulV3,dynamic,AI_CORE,"1736413971579566.403 ",504.071,2.28,20,0,NO,"""81920,1536;8192,4096""",DT_BF16;DT_BF16,ND;ND,"""1536,4096""",DT_BF16,ND,N/A,485.542,17479517,356.283,0.734,117.755,0.243,296.421,0.61,455.064,0.937,37.75,0.078,0.001,0,0,0,0,0,0,0,0,0,0,0,96.324 +19,4294967295,13792,2,aclnnMatmul_MatMulV3Common_MatMulV5,MatMulV3,dynamic,AI_CORE,"1736413974248200.543 ",521.31,2.22,20,0,NO,"""8192,15365;8192,4096""",DT_BF16;DT_BF16,ND;ND,"""1536,4096""",DT_BF16,ND,N/A,499.234,17972434,356.364,0.714,117.639,0.236,295.58,0.592,471.784,0.945,35.825,0.072,0.001,0,0,0,0,0,0,0,0,0,0,0,95.765 +19,4294967295,13792,2,aclnnMatmul_MatMulV3Common_MatMulV5,MatMulV3,dynamic,AI_CORE,"1736413974248200.543 ",521.31,2.22,20,0,NO,"""8192,15365;8192,4096""",DT_BF16;DT_BF16,ND;ND,"""1536,4096""",DT_BF16,ND,N/A,499.234,17972434,356.364,0.714,117.639,0.236,295.58,0.592,471.784,0.945,35.825,0.072,0.001,0,0,0,0,0,0,0,0,0,0,0,95.765 +19,4294967295,13792,2,aclnnMatmul_MatMulV3Common_MatMulV5,MatMulV3,dynamic,AI_CORE,"1736413974248200.543 ",521.31,2.22,20,0,NO,"""8192,15365;8192,4096""",DT_BF16;DT_BF16,ND;ND,"""1536,4096""",DT_BF16,ND,N/A,499.234,17972434,356.364,0.714,117.639,0.236,295.58,0.592,471.784,0.945,35.825,0.072,0.001,0,0,0,0,0,0,0,0,0,0,0,95.765 +19,4294967295,13792,2,aclnnMatmul_MatMulV3Common_MatMulV5,MatMulV3,dynamic,AI_CORE,"1736413974248200.543 ",521.31,2.22,20,0,NO,"""8192,15365;8192,4096""",DT_BF16;DT_BF16,ND;ND,"""1536,4096""",DT_BF16,ND,N/A,499.234,17972434,356.364,0.714,117.639,0.236,295.58,0.592,471.784,0.945,35.825,0.072,0.001,0,0,0,0,0,0,0,0,0,0,0,95.765 +19,4294967295,60679,2,aclnnFlashAttentionScore_FlashAttentionScore_FlashAttentionScore,FlashAttentionScore,dynamic,MIX_AIC,"1736413971411629.128 ",410.188,1.53,20,40,NO,"""4096,2,512;4096,2,512;4096,2,512;;;;4096,4096;;;;;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;UINT8;DT_BF16;BOOL;INT64;INT64;INT64;INT64;INT64,NCL;NCL;NCL;ND;ND;ND;ND;ND;ND;ND;ND;ND,"""2,4,4096,8;2,4,4096,8;;4096,2,512""",FLOAT;FLOAT;DT_BF16;DT_BF16,ND;ND;ND;ND,0,366.147,13181275,129.055,0.352,352.275,0.962,108.364,0.296,172.86,0.872,216.141,0.59,0.003,365.782,26336326,228.687,0.625,137.979,0.377,118.603,0.324,71.448,0.195,0.013,89.263 +19,4294967295,60707,2,aclnnFlashAttentionScore_FlashAttentionScore_FlashAttentionScore,FlashAttentionScore,dynamic,MIX_AIC,"1736413971415611.468 ",406.128,1.279,20,40,NO,"""4096,2,512;4096,2,512;4096,2,512;;;;4096,4096;;;;;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;UINT8;DT_BF16;BOOL;INT64;INT64;INT64;INT64;INT64,NCL;NCL;NCL;ND;ND;ND;ND;ND;ND;ND;ND;ND,"""2,4,4096,8;2,4,4096,8;;4096,2,512""",FLOAT;FLOAT;DT_BF16;DT_BF16,ND;ND;ND;ND,0,358.77,12915719,128.96,0.359,345.096,0.962,108.337,0.302,168.284,0.869,209.057,0.583,0.003,358.308,25798146,228.693,0.638,137.809,0.385,108.679,0.303,70.099,0.196,0.013,88.339 +19,4294967295,60735,2,aclnnFlashAttentionScore_FlashAttentionScore_FlashAttentionScore,FlashAttentionScore,dynamic,MIX_AIC,"1736413971420248.800 ",407.008,0.84,20,40,NO,"""4096,2,512;4096,2,512;4096,2,512;;;;4096,4096;;;;;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;UINT8;DT_BF16;BOOL;INT64;INT64;INT64;INT64;INT64,NCL;NCL;NCL;ND;ND;ND;ND;ND;ND;ND;ND;ND,"""2,4,4096,8;2,4,4096,8;;4096,2,512""",FLOAT;FLOAT;DT_BF16;DT_BF16,ND;ND;ND;ND,0,359.702,12949284,128.975,0.359,346.306,0.963,108.43,0.301,166.899,0.864,209.018,0.581,0.003,359.274,25867705,228.693,0.637,138.438,0.385,107.723,0.3,70.146,0.195,0.013,88.377 +19,4294967295,60763,2,aclnnFlashAttentionScore_FlashAttentionScore_FlashAttentionScore,FlashAttentionScore,dynamic,MIX_AIC,"1736413971424592.447 ",405.228,1.35,20,40,NO,"""4096,2,512;4096,2,512;4096,2,512;;;;4096,4096;;;;;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;UINT8;DT_BF16;BOOL;INT64;INT64;INT64;INT64;INT64,NCL;NCL;NCL;ND;ND;ND;ND;ND;ND;ND;ND;ND,"""2,4,4096,8;2,4,4096,8;;4096,2,512""",FLOAT;FLOAT;DT_BF16;DT_BF16,ND;ND;ND;ND,0,359.793,12952532,128.923,0.358,345.768,0.961,108.411,0.301,167.379,0.865,208.79,0.58,0.003,359.294,25869164,228.691,0.637,138.411,0.385,107.868,0.3,70.163,0.195,0.013,88.788 +19,4294967295,61655,2,aclnnFlashAttentionScoreGrad_FlashAttentionScoreGrad_FlashAttentionScoreGrad,FlashAttentionScoreGrad,dynamic,MIX_AIC,"1736413971559180.676 ",762.215,1.37,20,40,NO,"""4096,2,512;4096,2,512;4096,2,512;4096,2,512;4096,4096;2,4,4096,8;2,4,4096,8;;4096,2,512;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;BOOL;FLOAT;FLOAT;DT_BF16;DT_BF16;INT64,NCL;NCL;NCL;NCL;ND;NCHW;NCHW;ND;NCL;ND,"""4096,2,512;4096,2,512;4096,2,512;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16,ND;ND;ND;ND,0,755.664,27203907,344.023,0.455,592.472,0.784,266.388,0.353,397.091,0.525,589.726,0.525,0.004,755.04,54362915,318.452,0.422,184.623,0.245,206.78,0.274,152.973,0.203,0.006,99.141 +19,4294967295,61696,2,aclnnFlashAttentionScoreGrad_FlashAttentionScoreGrad_FlashAttentionScoreGrad,FlashAttentionScoreGrad,dynamic,MIX_AIC,"1736413971565420.821 ",763.215,1.189,20,40,NO,"""4096,2,512;4096,2,512;4096,2,512;4096,2,512;4096,4096;2,4,4096,8;2,4,4096,8;;4096,2,512;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;BOOL;FLOAT;FLOAT;DT_BF16;DT_BF16;INT64,NCL;NCL;NCL;NCL;ND;NCHW;NCHW;ND;NCL;ND,"""4096,2,512;4096,2,512;4096,2,512;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16,ND;ND;ND;ND,0,757.83,27281885,344.047,0.454,595.954,0.786,266.123,0.351,389.105,0.513,576.226,0.513,0.004,757.046,54507345,318.443,0.421,188.292,0.249,200.176,0.264,162.113,0.214,0.006,99.294 +19,4294967295,61737,2,aclnnFlashAttentionScoreGrad_FlashAttentionScoreGrad_FlashAttentionScoreGrad,FlashAttentionScoreGrad,dynamic,MIX_AIC,"1736413971571804.228 ",757.095,0.88,20,40,NO,"""4096,2,512;4096,2,512;4096,2,512;4096,2,512;4096,4096;2,4,4096,8;2,4,4096,8;;4096,2,512;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;BOOL;FLOAT;FLOAT;DT_BF16;DT_BF16;INT64,NCL;NCL;NCL;NCL;ND;NCHW;NCHW;ND;NCL;ND,"""4096,2,512;4096,2,512;4096,2,512;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16,ND;ND;ND;ND,0,750.605,27021778,343.983,0.458,586.708,0.782,266.304,0.355,392.522,0.523,584.432,0.523,0.004,749.913,53993736,318.436,0.425,188.508,0.251,207.668,0.277,152.634,0.204,0.006,99.143 +19,4294967295,61778,2,aclnnFlashAttentionScoreGrad_FlashAttentionScoreGrad_FlashAttentionScoreGrad,FlashAttentionScoreGrad,dynamic,MIX_AIC,"1736413971578144.095 ",755.915,1.22,20,40,NO,"""4096,2,512;4096,2,512;4096,2,512;4096,2,512;4096,4096;2,4,4096,8;2,4,4096,8;;4096,2,512;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;BOOL;FLOAT;FLOAT;DT_BF16;DT_BF16;INT64,NCL;NCL;NCL;NCL;ND;NCHW;NCHW;ND;NCL;ND,"""4096,2,512;4096,2,512;4096,2,512;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16,ND;ND;ND;ND,0,750.152,27005467,344.115,0.459,579.317,0.772,266.08,0.355,398.019,0.531,587.37,0.531,0.004,749.348,53953058,318.444,0.425,186.908,0.249,207.068,0.276,151.329,0.202,0.006,99.238 +19,4294967295,60763,2,aclnnFlashAttentionScore_FlashAttentionScore_FlashAttentionScore_varlen,FlashAttentionScore,dynamic,MIX_AIC,"1736413971424592.447 ",405.228,1.35,20,40,NO,"""4096,2,511;4096,2,512;4096,2,512;;;;4096,4096;;;;;""",DT_BF16;DT_BF16;DT_BF16;DT_BF16;UINT8;DT_BF16;BOOL;INT64;INT64;INT64;INT64;INT64,NCL;NCL;NCL;ND;ND;ND;ND;ND;ND;ND;ND;ND,"""2,3,4096,8;2,4,4096,8;;4096,2,512""",FLOAT;FLOAT;DT_BF16;DT_BF16,ND;ND;ND;ND,0,359.793,12952532,128.923,0.358,345.768,0.961,108.411,0.301,167.379,0.465,208.79,0.58,0.003,359.294,25869164,228.691,0.637,138.411,0.385,107.868,0.3,70.163,0.195,0.013,88.788 +19,4294967295,60683,2,aclnnAdd_AddAiCore_Add,Add,dynamic,AI_VECTOR_CORE,"1736413971412768.871 ",26.78,0.485,40,0,NO,"""512,2,4096;512,2,4096""",DT_BF16;DT_BF16,NCL;NCL,"""512,2,4096""",DT_BF16,ND,N/A,0,0,0,0,0,0,0,0,0,0,0,0,0,24.19,1741674,5.986,0.247,1.352,0.056,20.363,0.842,3.195,0.132,0.027,0 +19,4294967295,60690,2,aclnnAdd_AddAiCore_Add,Add,dynamic,AI_VECTOR_CORE,"1736413971414677.549 ",31.201,0.664,40,0,NO,"""512,2,4096;512,2,4096""",DT_BF16;DT_BF16,NCL;NCL,"""512,2,4096""",DT_BF16,ND,N/A,0,0,0,0,0,0,0,0,0,0,0,0,0,28.617,2060443,5.986,0.209,1.444,0.05,25.005,0.874,3.336,0.117,0.026,0 +19,4294967295,60711,2,aclnnAdd_AddAiCore_Add,Add,dynamic,AI_VECTOR_CORE,"1736413971416743.250 ",27.021,1.246,40,0,NO,"""512,2,4096;512,2,4096""",DT_BF16;DT_BF16,NCL;NCL,"""512,2,4096""",DT_BF16,ND,N/A,0,0,0,0,0,0,0,0,0,0,0,0,0,24.304,1749862,5.986,0.246,1.258,0.052,20.424,0.84,3.23,0.133,0.027,0 +19,4294967295,60718,2,aclnnAdd_AddAiCore_Add,Add,dynamic,AI_VECTOR_CORE,"1736413971419318.962 ",25.08,0.984,40,0,NO,"""512,2,4096;512,2,4096""",DT_BF16;DT_BF16,NCL;NCL,"""512,2,4096""",DT_BF16,ND,N/A,0,0,0,0,0,0,0,0,0,0,0,0,0,22.47,1617840,5.989,0.267,2.009,0.089,18.809,0.837,3.191,0.142,0.024,0 +19,4294967295,13907,2,aclnnAdd_AddAiCore_Add,Add,dynamic,AI_VECTOR_CORE,"1736413974268377.206 ",1.38,31.48,1,0,NO,""";""",FLOAT;FLOAT,ND;ND,"""""",FLOAT,ND,N/A,0,0,0,0,0,0,0,0,0,0,0,0,0,0.883,1589,0.027,0.03,0.265,0.3,0.18,0.204,0.108,0.123,0.182,0 +19,4294967295,13910,2,aclnnAdd_AddAiCore_Add,Add,dynamic,AI_VECTOR_CORE,"1736413974268502.128 ",1.46,17.48,1,0,NO,""";""",FLOAT;FLOAT,ND;ND,"""""",FLOAT,ND,N/A,0,0,0,0,0,0,0,0,0,0,0,0,0,0.948,1706,0.027,0.028,0.276,0.291,0.217,0.229,0.127,0.134,0.174,0 +19,4294967295,13913,2,aclnnAdd_AddAiCore_Add,Add,dynamic,AI_VECTOR_CORE,"1736413974268605.410 ",1.5,0.09,1,0,NO,""";""",FLOAT;FLOAT,ND;ND,"""""",FLOAT,ND,N/A,0,0,0,0,0,0,0,0,0,0,0,0,0,0.96,1728,0.027,0.028,0.268,0.28,0.221,0.23,0.132,0.137,0.145,0 +19,4294967295,13916,2,aclnnAdd_AddAiCore_Add,Add,dynamic,AI_VECTOR_CORE,"1736413974268747.953 ",1.58,28.28,1,0,NO,""";""",FLOAT;FLOAT,ND;ND,"""""",FLOAT,ND,N/A,0,0,0,0,0,0,0,0,0,0,0,0,0,1.107,1993,0.027,0.024,0.426,0.384,0.201,0.181,0.118,0.106,0.162,0 \ No newline at end of file diff --git a/profiler/msprof_analyze/test/ut/advisor/compute_advice/test_ai_core_performance_advice.py b/profiler/msprof_analyze/test/ut/advisor/compute_advice/test_ai_core_performance_advice.py new file mode 100644 index 0000000000000000000000000000000000000000..c8196f5eefdee0c1f3819916b261a002017ba987 --- /dev/null +++ b/profiler/msprof_analyze/test/ut/advisor/compute_advice/test_ai_core_performance_advice.py @@ -0,0 +1,85 @@ +# Copyright (c) Huawei Technologies Co., Ltd. 2025. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import shutil + +import unittest +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes + + +class TestAICorePerformanceAdvice(unittest.TestCase): + TMP_DIR = "./ascend_pt" + OUTPUT_DIR = "./ascend_pt/ASCEND_PROFILER_OUTPUT" + interface = None + err_interface = None + + @classmethod + def clear_htmls(cls): + current_path = os.path.dirname(os.path.abspath(__file__)) + for filename in os.listdir(current_path): + # 检查文件是否以“mstt”开头 + if filename.startswith("mstt"): + # 构建文件的完整路径 + file_path = os.path.join(current_path, filename) + # 删除文件 + os.remove(file_path) + + @classmethod + def copy_kernel_details(cls, path): + # Define source and destination paths + source_csv_path = os.path.join(os.path.dirname(__file__), 'data', path) + destination_csv_path = f"{TestAICorePerformanceAdvice.OUTPUT_DIR}/kernel_details.csv" + + # Check if source CSV file exists + if not os.path.exists(source_csv_path): + raise FileNotFoundError(f"test data file not found:{source_csv_path}") + + # Ensure the output directory exists + if not os.path.exists(TestAICorePerformanceAdvice.OUTPUT_DIR): + os.makedirs(TestAICorePerformanceAdvice.OUTPUT_DIR) + + # Copy the CSV file from source to destination + shutil.copyfile(source_csv_path, destination_csv_path) + + def tearDown(self): + if os.path.exists(TestAICorePerformanceAdvice.TMP_DIR): + shutil.rmtree(TestAICorePerformanceAdvice.TMP_DIR) + self.clear_htmls() + + def setUp(self): + if os.path.exists(TestAICorePerformanceAdvice.TMP_DIR): + shutil.rmtree(TestAICorePerformanceAdvice.TMP_DIR) + if not os.path.exists(TestAICorePerformanceAdvice.TMP_DIR): + os.makedirs(TestAICorePerformanceAdvice.TMP_DIR) + if not os.path.exists(TestAICorePerformanceAdvice.OUTPUT_DIR): + os.makedirs(TestAICorePerformanceAdvice.OUTPUT_DIR) + self.clear_htmls() + + def test_ai_core_performance_total(self): + file_path = "kernel_details.csv" + self.copy_kernel_details(file_path) + interface = Interface(profiling_path=self.TMP_DIR) + dimension = Interface.COMPUTATION + scope = SupportedScopes.AICORE_PERFORMANCE_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path=self.TMP_DIR) + self.assertLess(1, len(result.data.get("Cube算子性能分析").get("data")[0])) + self.assertLess(1, len(result.data.get("Cube算子性能分析").get("data")[1])) + self.assertLess(1, len(result.data.get("Cube算子性能分析").get("data")[2])) + self.assertLess(1, len(result.data.get("FA算子性能分析").get("data")[0])) + self.assertLess(1, len(result.data.get("FA算子性能分析").get("data")[1])) + self.assertLess(1, len(result.data.get("FA算子性能分析").get("data")[2])) + self.assertLess(1, len(result.data.get("Vector算子性能分析").get("data")[0])) + self.assertLess(1, len(result.data.get("Vector算子性能分析").get("data")[1])) + result.clear() \ No newline at end of file diff --git a/profiler/test/ut/advisor/compute_advice/test_frequency_advice.py b/profiler/msprof_analyze/test/ut/advisor/compute_advice/test_frequency_advice.py similarity index 84% rename from profiler/test/ut/advisor/compute_advice/test_frequency_advice.py rename to profiler/msprof_analyze/test/ut/advisor/compute_advice/test_frequency_advice.py index 8de2df25d9a7c5d4954919ade19d14a2220b37c1..f316032e4ae70a96fbec4cadf105c4436771d09f 100644 --- a/profiler/test/ut/advisor/compute_advice/test_frequency_advice.py +++ b/profiler/msprof_analyze/test/ut/advisor/compute_advice/test_frequency_advice.py @@ -1,12 +1,26 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import shutil import stat import json - import unittest -from profiler.advisor.interface.interface import Interface -from profiler.advisor.common.analyzer_scopes import SupportedScopes -from profiler.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset + +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes +from msprof_analyze.advisor.dataset.timeline_event_dataset import ComputationAnalysisDataset class TestFrequencyAdvice(unittest.TestCase): @@ -16,22 +30,6 @@ class TestFrequencyAdvice(unittest.TestCase): interface = None err_interface = None - def tearDown(self): - if os.path.exists(TestFrequencyAdvice.TMP_DIR): - shutil.rmtree(TestFrequencyAdvice.TMP_DIR) - self.clear_htmls() - - def setUp(self): - if os.path.exists(TestFrequencyAdvice.TMP_DIR): - shutil.rmtree(TestFrequencyAdvice.TMP_DIR) - if not os.path.exists(TestFrequencyAdvice.TMP_DIR): - os.makedirs(TestFrequencyAdvice.TMP_DIR) - if not os.path.exists(TestFrequencyAdvice.OUTPUT_DIR): - os.makedirs(TestFrequencyAdvice.OUTPUT_DIR) - if not os.path.exists(TestFrequencyAdvice.DEVICE_DIR): - os.makedirs(TestFrequencyAdvice.DEVICE_DIR) - self.clear_htmls() - @classmethod def clear_htmls(cls): current_path = os.path.dirname(os.path.abspath(__file__)) @@ -91,7 +89,7 @@ class TestFrequencyAdvice(unittest.TestCase): fp.write(json.dumps(info)) @classmethod - def create_non_910B_trace_view(cls): + def create_non_910A2_trace_view(cls): basic_info = cls.get_basic_trace_view() # python ops @@ -108,7 +106,7 @@ class TestFrequencyAdvice(unittest.TestCase): fp.write(json.dumps(raw_data)) @classmethod - def create_910B_trace_view(cls): + def create_910A2_trace_view(cls): basic_info = cls.get_basic_trace_view() # python ops @@ -124,9 +122,25 @@ class TestFrequencyAdvice(unittest.TestCase): os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w') as fp: fp.write(json.dumps(raw_data)) + def setUp(self): + if os.path.exists(TestFrequencyAdvice.TMP_DIR): + shutil.rmtree(TestFrequencyAdvice.TMP_DIR) + if not os.path.exists(TestFrequencyAdvice.TMP_DIR): + os.makedirs(TestFrequencyAdvice.TMP_DIR) + if not os.path.exists(TestFrequencyAdvice.OUTPUT_DIR): + os.makedirs(TestFrequencyAdvice.OUTPUT_DIR) + if not os.path.exists(TestFrequencyAdvice.DEVICE_DIR): + os.makedirs(TestFrequencyAdvice.DEVICE_DIR) + self.clear_htmls() + + def tearDown(self): + if os.path.exists(TestFrequencyAdvice.TMP_DIR): + shutil.rmtree(TestFrequencyAdvice.TMP_DIR) + self.clear_htmls() + def test_run_should_run_success_when_msprof_not_contain_frequency_data(self): self.create_info_json() - self.create_non_910B_trace_view() + self.create_non_910A2_trace_view() interface = Interface(profiling_path=self.TMP_DIR) dimension = "computation" scope = SupportedScopes.FREQ_ANALYSIS @@ -137,11 +151,11 @@ class TestFrequencyAdvice(unittest.TestCase): def test_run_should_run_success_when_trace_view_contain_frequency_data(self): self.create_info_json() - self.create_910B_trace_view() + self.create_910A2_trace_view() interface = Interface(profiling_path=self.TMP_DIR) dimension = "computation" scope = SupportedScopes.FREQ_ANALYSIS result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path=self.TMP_DIR) - self.assertEqual(2, len(result.data.get("AI Core Frequency", dict()).get("data", []))) + self.assertEqual(2, len(result.data.get("AIcore频率", dict()).get("data", []))) result.clear() ComputationAnalysisDataset.reset_all_instances() diff --git a/profiler/test/ut/advisor/compute_advice/test_pp_stage_computation_analyzer.py b/profiler/msprof_analyze/test/ut/advisor/compute_advice/test_pp_stage_computation_analyzer.py similarity index 68% rename from profiler/test/ut/advisor/compute_advice/test_pp_stage_computation_analyzer.py rename to profiler/msprof_analyze/test/ut/advisor/compute_advice/test_pp_stage_computation_analyzer.py index a8be1bf197156968d8a8113973ad556a6f7f3215..6bfe49e19aa760d1ee9f7d4ca09192eebaec79d9 100644 --- a/profiler/test/ut/advisor/compute_advice/test_pp_stage_computation_analyzer.py +++ b/profiler/msprof_analyze/test/ut/advisor/compute_advice/test_pp_stage_computation_analyzer.py @@ -1,9 +1,23 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest import copy import os -from profiler.advisor.analyzer.computation.pp_stage_computation_analyzer import PPStageComputationAnalyzer -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env +from msprof_analyze.advisor.analyzer.computation.pp_stage_computation_analyzer import PPStageComputationAnalyzer +from msprof_analyze.test.ut.advisor.advisor_backend.tools.tool import recover_env mock_profiling_path = os.path.realpath(__file__) @@ -43,7 +57,7 @@ class TestPPStageComputationAnalyzer(unittest.TestCase): pp_stage_computation_analyzer._merge_multiprocess_result() data = dict(pp_stage_computation_analyzer.result.data) - problems = data.get("problems", {}).get("data", []) + problems = data.get("问题综述", {}).get("data", []) self.assertEqual(len(problems), self.rank_num) for i in range(self.rank_num): self.assertTrue(f"rank {i} ai cpu issues" in data) diff --git a/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_conjectured_gc_advice.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_conjectured_gc_advice.py new file mode 100644 index 0000000000000000000000000000000000000000..ad5b76c84758f9e3fe5b37e76a5b785f1e21a0d7 --- /dev/null +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_conjectured_gc_advice.py @@ -0,0 +1,162 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import shutil +import stat +import json +import multiprocessing +import unittest + +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes + + +class TestCompatibleGcAdvice(unittest.TestCase): + TMP_DIR = "./ascend_pt" + OUTPUT_DIR = "./ascend_pt/ASCEND_PROFILER_OUTPUT" + interface = None + + @staticmethod + def run_should_run_success_when_trace_view_not_contain_gc_events(): + interface = Interface(profiling_path="./ascend_pt") + dimension = "schedule" + scope = SupportedScopes.CONJECTURED_GC_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path="./ascend_pt") + assert len(result.data.get("ConjecturedGcAnalysis", [])) == 0 + result.clear() + + @staticmethod + def run_should_run_success_when_trace_view_contain_gc_events(): + interface = Interface(profiling_path="./ascend_pt") + dimension = "schedule" + scope = SupportedScopes.CONJECTURED_GC_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path="./ascend_pt") + assert len(result.data.get("ConjecturedGcAnalysis", {}).get("data", [])) == 2 + result.clear() + + @classmethod + def create_common_events(cls): + # Python pid + py_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 1, "args": {"name": "Python"}} + # ascend pid + ascend_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 4, "args": {"name": "Ascend Hardware"}} + # cann pid + cann_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 5, "args": {"name": "CANN"}} + # ascend hardware ops + ah_event1 = { + "ph": "X", "name": "Slice1", "ts": "1699529623106750", "dur": 100, "tid": 3, "pid": 4, + "args": {"Task Type": "AI_CORE"} + } + ah_event2 = { + "ph": "X", "name": "Matmul", "ts": "1699529623106888", "dur": 80, "tid": 3, "pid": 4, + "args": {"Task Type": "AI_CORE"} + } + # free event + free_event1 = { + "name": "Free", "pid": 4139906593, "tid": 3, "ts": "1723545784434032.646", "dur": 500000.58, "ph": "X" + } + free_event2 = { + "name": "Free", "pid": 4139906593, "tid": 3, "ts": "1723545784984032.326", "dur": 200000.76, "ph": "X" + } + return [py_pid_data, ascend_pid_data, cann_pid_data, ah_event1, ah_event2, free_event1, free_event2] + + @classmethod + def create_trace_view_with_gc_events(cls): + # acl apis + api_event1 = { + "name": "AscendCL@aclCreateDataBuffer", "pid": 4139906273, "tid": 4042877, "ts": "1723545784534032.000", + "dur": 20, "args": {"Thread Id": 4042877, "Mode": "ACL_OP", "level": "acl", "id": "aclCreateDataBuffer", + "item_id": "0", "connection_id": 63899}, "ph": "X" + } + api_event2 = { + "name": "AscendCL@aclCreateTensorDesc", "pid": 4139906273, "tid": 4042877, "ts": "1723545784556032.450", + "dur": 40, "args": {"Thread Id": 4042877, "Mode": "ACL_OP", "level": "acl", "id": "aclCreateTensorDesc", + "item_id": "0", "connection_id": 63900}, "ph": "X" + } + api_event3 = { + "name": "AscendCL@opCompile", "pid": 4139906273, "tid": 4044446, "ts": "1723545784572032.870", + "dur": 150.36, + "args": {"Thread Id": 4044446, "Mode": "ACL_OP", "level": "acl", "id": "opCompile", "item_id": "0", + "connection_id": 63992}, "ph": "X" + } + + raw_data = [*cls.create_common_events(), api_event1, api_event2, api_event3] + with os.fdopen(os.open(f"{TestCompatibleGcAdvice.OUTPUT_DIR}/trace_view.json", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w') as fp: + fp.write(json.dumps(raw_data)) + + @classmethod + def create_trace_view_without_gc_events(cls): + # acl apis + api_event1 = { + "name": "AscendCL@aclCreateDataBuffer", "pid": 4139906273, "tid": 4042877, "ts": "1723545784534032.000", + "dur": 200000, "args": {"Thread Id": 4042877, "Mode": "ACL_OP", "level": "acl", "id": "aclCreateDataBuffer", + "item_id": "0", "connection_id": 63899}, "ph": "X" + } + api_event2 = { + "name": "AscendCL@aclCreateTensorDesc", "pid": 4139906273, "tid": 4042877, "ts": "1723545784556032.450", + "dur": 400000, "args": {"Thread Id": 4042877, "Mode": "ACL_OP", "level": "acl", "id": "aclCreateTensorDesc", + "item_id": "0", "connection_id": 63900}, "ph": "X" + } + api_event3 = { + "name": "AscendCL@opCompile", "pid": 4139906273, "tid": 4044446, "ts": "1723545784992032.870", + "dur": 1500000.36, "args": {"Thread Id": 4044446, "Mode": "ACL_OP", "level": "acl", "id": "opCompile", + "item_id": "0", "connection_id": 63992}, "ph": "X" + } + + raw_data = [*cls.create_common_events(), api_event1, api_event2, api_event3] + with os.fdopen(os.open(f"{TestCompatibleGcAdvice.OUTPUT_DIR}/trace_view.json", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w') as fp: + fp.write(json.dumps(raw_data)) + + @classmethod + def clear_htmls(cls): + current_path = os.path.dirname(os.path.abspath(__file__)) + for filename in os.listdir(current_path): + # 检查文件是否以“att”开头 + if filename.startswith("mstt"): + # 构建文件的完整路径 + file_path = os.path.join(current_path, filename) + # 删除文件 + os.remove(file_path) + + def tearDown(self): + if os.path.exists(TestCompatibleGcAdvice.TMP_DIR): + shutil.rmtree(TestCompatibleGcAdvice.TMP_DIR) + self.clear_htmls() + + def setUp(self): + if os.path.exists(TestCompatibleGcAdvice.TMP_DIR): + shutil.rmtree(TestCompatibleGcAdvice.TMP_DIR) + if not os.path.exists(TestCompatibleGcAdvice.TMP_DIR): + os.makedirs(TestCompatibleGcAdvice.TMP_DIR) + if not os.path.exists(TestCompatibleGcAdvice.OUTPUT_DIR): + os.makedirs(TestCompatibleGcAdvice.OUTPUT_DIR) + self.clear_htmls() + + def test_run_should_run_success_when_trace_view_contain_gc_events(self): + self.create_trace_view_with_gc_events() + new_process = multiprocessing.Process( + target=self.run_should_run_success_when_trace_view_contain_gc_events) + new_process.start() + new_process.join() + + def test_run_should_run_success_when_trace_view_not_contain_gc_events(self): + self.create_trace_view_without_gc_events() + new_process = multiprocessing.Process( + target=self.run_should_run_success_when_trace_view_not_contain_gc_events) + new_process.start() + new_process.join() + diff --git a/profiler/test/ut/advisor/timeline_advice/test_dataloader_checker.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_dataloader_checker.py similarity index 70% rename from profiler/test/ut/advisor/timeline_advice/test_dataloader_checker.py rename to profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_dataloader_checker.py index f2bd2695786379b93c6ec181b4e38b2bd5b89e82..5ad2b5b26423939e5f9a7c9c853a6571ff6e8b88 100644 --- a/profiler/test/ut/advisor/timeline_advice/test_dataloader_checker.py +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_dataloader_checker.py @@ -1,11 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest import os -import sys import yaml -from profiler.advisor.analyzer.dataloader.dataloader_checker import DataloaderChecker -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env +from msprof_analyze.advisor.analyzer.dataloader.dataloader_checker import DataloaderChecker +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.test.ut.advisor.advisor_backend.tools.tool import recover_env class TestDataloaderChecker(unittest.TestCase): @@ -16,7 +29,7 @@ class TestDataloaderChecker(unittest.TestCase): def setUp(self) -> None: rule_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname( os.path.dirname(os.path.dirname(os.path.realpath(__file__)))))), - "advisor", "rules", "dataloader.yaml") + "advisor", "rules", "cn", "dataloader.yaml") with open(rule_path, "rb") as file: self.rule = yaml.safe_load(file) diff --git a/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_fusible_operator_advice.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_fusible_operator_advice.py new file mode 100644 index 0000000000000000000000000000000000000000..09d5f296809dae33c1efd70a966f93990ed11ca9 --- /dev/null +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_fusible_operator_advice.py @@ -0,0 +1,214 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import shutil +import stat +import csv +import multiprocessing +import unittest +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes + + +class TestFusibleOperatorAdvice(unittest.TestCase): + TMP_DIR = "./ascend_pt" + OUTPUT_DIR = "./ascend_pt/ASCEND_PROFILER_OUTPUT" + interface = None + + @staticmethod + def run_should_run_success_when_kernel_details_not_contain_fusible_operators(): + interface = Interface(profiling_path="./ascend_pt") + dimension = "schedule" + scope = SupportedScopes.FUSIBLE_OPERATOR_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path="./ascend_pt") + assert len(result.data) == 0 + result.clear() + + @staticmethod + def run_should_run_success_when_kernel_details_contain_host_bound(): + interface = Interface(profiling_path="./ascend_pt") + dimension = "schedule" + scope = SupportedScopes.FUSIBLE_OPERATOR_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path="./ascend_pt") + assert len(result.data.get("基于host瓶颈的算子序列分析", {}).get("data", [])) == 3 + result.clear() + + @staticmethod + def run_should_run_success_when_kernel_details_contain_mte_bound(): + interface = Interface(profiling_path="./ascend_pt") + dimension = "schedule" + scope = SupportedScopes.FUSIBLE_OPERATOR_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path="./ascend_pt") + assert len(result.data.get("基于mte瓶颈的算子序列分析", {}).get("data", [])) == 4 + result.clear() + + @staticmethod + def run_should_run_success_when_kernel_details_contain_mte_and_host_bound(): + interface = Interface(profiling_path="./ascend_pt") + dimension = "schedule" + scope = SupportedScopes.FUSIBLE_OPERATOR_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path="./ascend_pt") + assert len(result.data.get("基于mte瓶颈的算子序列分析", {}).get("data", [])) == 3 + assert len(result.data.get("基于host瓶颈的算子序列分析", {}).get("data", [])) == 3 + result.clear() + + @classmethod + def create_kernel_details_without_bound(cls): + # create csv files + csv_header = [ + 'Name', 'Type', 'Accelerator Core', 'Start Time(us)', 'Duration(us)', 'aicore_time(us)', + 'aic_mte2_time(us)', 'aic_fixpipe_time(us)', 'aiv_mte2_time(us)', 'aiv_mte3_time(us)', "Input Shapes", + "Output Shapes" + ] + csv_row1 = ['MatMul56', 'MatMul', 'AI_CORE', "0\t", 10, 8, 2, 0, 0, 0, 0, "1;1", "2;2"] + csv_row2 = ['Add2', 'Add', 'AI_VECTOR_CORE', "13\t", 5, 3, 1, 0, 0, 0, 0, "1;1", "3;3"] + csv_row3 = ['MatMul57', 'MatMul', 'AI_CORE', "19\t", 12, 9, 2, 0, 0, 0, 0, "1;1", "4;4"] + csv_row4 = ['Add1', 'Add', 'AI_CORE', "33\t", 3.14, 2.56, 1, 0, 0, 0, 0, "1;1", "4;4"] + + with os.fdopen(os.open(f"{TestFusibleOperatorAdvice.OUTPUT_DIR}/kernel_details.csv", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w', newline='') as fp: + csv_writer = csv.writer(fp) + csv_writer.writerow(csv_header) + csv_writer.writerow(csv_row1) + csv_writer.writerow(csv_row2) + csv_writer.writerow(csv_row3) + csv_writer.writerow(csv_row4) + + @classmethod + def create_kernel_details_with_mte_bound(cls): + # create csv files + csv_header = [ + 'Name', 'Type', 'Accelerator Core', 'Start Time(us)', 'Duration(us)', 'aicore_time(us)', + 'aic_mte2_time(us)', 'aic_fixpipe_time(us)', 'aiv_mte2_time(us)', 'aiv_mte3_time(us)', "Input Shapes", + "Output Shapes" + ] + csv_row1 = ['MatMul56', 'MatMul', 'AI_CORE', "0\t", 10, 8, 7, 0, 0, 0, 0, "1;1", "2;2"] + csv_row2 = ['Add2', 'Add', 'AI_VECTOR_CORE', "13\t", 5, 3, 2, 0, 0, 0, 0, "1;1", "21;2"] + csv_row3 = ['MatMul57', 'MatMul', 'AI_CORE', "19\t", 12, 9, 8, 0, 0, 0, 0, "1;1", "23;2"] + csv_row4 = ['Add1', 'Add', 'AI_CORE', "33\t", 3.14, 2.56, 1, 0, 0, 0, 0, "1;1", "24;2"] + + with os.fdopen(os.open(f"{TestFusibleOperatorAdvice.OUTPUT_DIR}/kernel_details.csv", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w', newline='') as fp: + csv_writer = csv.writer(fp) + csv_writer.writerow(csv_header) + for _ in range(7): + csv_writer.writerow(csv_row1) + csv_writer.writerow(csv_row2) + csv_writer.writerow(csv_row3) + csv_writer.writerow(csv_row4) + + @classmethod + def create_kernel_details_with_host_bound(cls): + # create csv files + csv_header = [ + 'Name', 'Type', 'Accelerator Core', 'Start Time(us)', 'Duration(us)', 'aicore_time(us)', + 'aic_mte2_time(us)', 'aic_fixpipe_time(us)', 'aiv_mte2_time(us)', 'aiv_mte3_time(us)', "Input Shapes", + "Output Shapes" + ] + csv_row1 = ['MatMul56', 'MatMul', 'AI_CORE', "0\t", 20, 18, 7, 0, 0, 0, 0, "1;1", "2;2"] + csv_row2 = ['Add2', 'Add', 'AI_VECTOR_CORE', "83\t", 25, 13, 2, 0, 0, 0, 0, "1;1", "21;2"] + csv_row3 = ['MatMul57', 'MatMul', 'AI_CORE', "169\t", 12, 9, 8, 0, 0, 0, 0, "1;1", "23;2"] + csv_row4 = ['Add1', 'Add', 'AI_CORE', "183\t", 3.14, 2.56, 1, 0, 0, 0, 0, "1;1", "24;2"] + csv_row5 = ['hcom_allreduce', 'allreduce', "HCCL", "233\t", 3.14, 2.56, 1, 0, 0, 0, 0, "1;1", "24;2"] + + with os.fdopen(os.open(f"{TestFusibleOperatorAdvice.OUTPUT_DIR}/kernel_details.csv", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w', newline='') as fp: + csv_writer = csv.writer(fp) + csv_writer.writerow(csv_header) + for _ in range(7): + csv_writer.writerow(csv_row1) + csv_writer.writerow(csv_row2) + csv_writer.writerow(csv_row3) + csv_writer.writerow(csv_row4) + csv_writer.writerow(csv_row5) + + @classmethod + def create_kernel_details_with_host_and_mte_bound(cls): + # create csv files + csv_header = [ + 'Name', 'Type', 'Accelerator Core', 'Start Time(us)', 'Duration(us)', 'aicore_time(us)', + 'aic_mte2_time(us)', 'aic_fixpipe_time(us)', 'aiv_mte2_time(us)', 'aiv_mte3_time(us)', "Input Shapes", + "Output Shapes" + ] + csv_row1 = ['MatMul56', 'MatMul', 'AI_CORE', "0\t", 20, 18, 17, 0, 0, 0, 0, "1;1", "2;2"] + csv_row2 = ['Add2', 'Add', 'AI_VECTOR_CORE', "83\t", 25, 13, 12, 0, 0, 0, 0, "1;1", "21;2"] + csv_row3 = ['MatMul57', 'MatMul', 'AI_CORE', "169\t", 12, 9, 8, 0, 0, 0, 0, "1;1", "23;2"] + csv_row4 = ['Add1', 'Add', 'AI_CORE', "183\t", 3.14, 2.56, 1, 0, 0, 0, 0, "1;1", "24;2"] + csv_row5 = ['hcom_allreduce', 'allreduce', "HCCL", "233\t", 3.14, 2.56, 1, 0, 0, 0, 0, "1;1", "24;2"] + + with os.fdopen(os.open(f"{TestFusibleOperatorAdvice.OUTPUT_DIR}/kernel_details.csv", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w', newline='') as fp: + csv_writer = csv.writer(fp) + csv_writer.writerow(csv_header) + for _ in range(7): + csv_writer.writerow(csv_row1) + csv_writer.writerow(csv_row2) + csv_writer.writerow(csv_row3) + csv_writer.writerow(csv_row4) + csv_writer.writerow(csv_row5) + + @classmethod + def clear_htmls(cls): + current_path = os.path.dirname(os.path.abspath(__file__)) + for filename in os.listdir(current_path): + # 检查文件是否以“att”开头 + if filename.startswith("mstt"): + # 构建文件的完整路径 + file_path = os.path.join(current_path, filename) + # 删除文件 + os.remove(file_path) + + def tearDown(self): + if os.path.exists(TestFusibleOperatorAdvice.TMP_DIR): + shutil.rmtree(TestFusibleOperatorAdvice.TMP_DIR) + self.clear_htmls() + + def setUp(self): + if os.path.exists(TestFusibleOperatorAdvice.TMP_DIR): + shutil.rmtree(TestFusibleOperatorAdvice.TMP_DIR) + if not os.path.exists(TestFusibleOperatorAdvice.TMP_DIR): + os.makedirs(TestFusibleOperatorAdvice.TMP_DIR) + if not os.path.exists(TestFusibleOperatorAdvice.OUTPUT_DIR): + os.makedirs(TestFusibleOperatorAdvice.OUTPUT_DIR) + self.clear_htmls() + + def test_run_should_run_success_when_kernel_details_not_contain_fusible_operators(self): + self.create_kernel_details_without_bound() + new_process = multiprocessing.Process( + target=self.run_should_run_success_when_kernel_details_not_contain_fusible_operators) + new_process.start() + new_process.join() + + def test_run_should_run_success_when_kernel_details_contain_mte_bound(self): + self.create_kernel_details_with_mte_bound() + new_process = multiprocessing.Process( + target=self.run_should_run_success_when_kernel_details_contain_mte_bound) + new_process.start() + new_process.join() + + def test_run_should_run_success_when_kernel_details_contain_host_bound(self): + self.create_kernel_details_with_host_bound() + new_process = multiprocessing.Process( + target=self.run_should_run_success_when_kernel_details_contain_host_bound) + new_process.start() + new_process.join() + + def test_run_should_run_success_when_kernel_details_contain_mte_and_host_bound(self): + self.create_kernel_details_with_host_and_mte_bound() + new_process = multiprocessing.Process( + target=self.run_should_run_success_when_kernel_details_contain_mte_and_host_bound) + new_process.start() + new_process.join() + diff --git a/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_gc_checker.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_gc_checker.py new file mode 100644 index 0000000000000000000000000000000000000000..b570813b1e28fd9a00b0d6fafb64c3cdc2f2722e --- /dev/null +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_gc_checker.py @@ -0,0 +1,140 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import shutil +import stat +import json +import multiprocessing +import unittest + +from msprof_analyze.advisor.interface.interface import Interface +from msprof_analyze.advisor.common.analyzer_scopes import SupportedScopes + + +class TestGcAdvice(unittest.TestCase): + TMP_DIR = "./ascend_pt" + OUTPUT_DIR = "./ascend_pt/ASCEND_PROFILER_OUTPUT" + interface = None + + @staticmethod + def run_should_run_success_when_trace_view_not_contain_gc_events(): + interface = Interface(profiling_path="./ascend_pt") + dimension = "schedule" + scope = SupportedScopes.GC_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path="./ascend_pt") + assert len(result.data.get("GC分析", [])) == 0 + result.clear() + + @staticmethod + def run_should_run_success_when_trace_view_contain_gc_events(): + interface = Interface(profiling_path="./ascend_pt") + dimension = "schedule" + scope = SupportedScopes.GC_ANALYSIS + result = interface.get_result(dimension, scope, render_html=1, output_dict=False, profiling_path="./ascend_pt") + assert len(result.data.get("GC分析", {}).get("data", [])) == 2 + result.clear() + + @classmethod + def clear_htmls(cls): + current_path = os.path.dirname(os.path.abspath(__file__)) + for filename in os.listdir(current_path): + # 检查文件是否以“att”开头 + if filename.startswith("mstt"): + # 构建文件的完整路径 + file_path = os.path.join(current_path, filename) + # 删除文件 + os.remove(file_path) + + @classmethod + def create_common_events(cls): + # Python pid + py_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 1, "args": {"name": "Python"}} + # ascend pid + ascend_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 4, "args": {"name": "Ascend Hardware"}} + # ascend pid + cann_pid_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 5, "args": {"name": "CANN"}} + # ascend hardware ops + ah_event1 = { + "ph": "X", "name": "Slice1", "ts": "1699529623106750", "dur": 100, "tid": 3, "pid": 4, + "args": {"Task Type": "AI_CORE"} + } + ah_event2 = { + "ph": "X", "name": "Slice2", "ts": "1699529623106888", "dur": 80, "tid": 3, "pid": 4, + "args": {"Task Type": "AI_CORE"} + } + # flow event + flow_event_s = {"ph": "s", "name": "link1", "id": 1, "tid": 3, "pid": 1, "ts": "200", "args": {}} + flow_event_e = {"ph": "f", "name": "link1", "id": 1, "tid": 3, "pid": 1, "ts": "1699529623106750", "args": {}} + return [py_pid_data, ascend_pid_data, cann_pid_data, ah_event1, ah_event2, flow_event_s, flow_event_e] + + + @classmethod + def create_trace_view_with_gc_events(cls): + + # Python GC pid + py_gc_data = {"ph": "M", "name": "process_name", "tid": 0, "pid": 2, "args": {"name": "Python GC"}} + + gc_event1 = { + "ph": "X", "name": "GC", "ts": "1699529622103750", "dur": 1500, "tid": 3, "pid": 4, "cat": "GC", + "args": {} + } + gc_event2 = { + "ph": "X", "name": "GC", "ts": "1699529623104750", "dur": 50, "tid": 3, "pid": 4, "cat": "GC", + "args": {} + } + gc_event3 = { + "ph": "X", "name": "GC", "ts": "1699529623105750", "dur": 50000, "tid": 3, "pid": 4, "cat": "GC", + "args": {} + } + + + raw_data = [*cls.create_common_events(), py_gc_data, gc_event1, gc_event2, gc_event3] + with os.fdopen(os.open(f"{TestGcAdvice.OUTPUT_DIR}/trace_view.json", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w') as fp: + fp.write(json.dumps(raw_data)) + + @classmethod + def create_trace_view_without_gc_events(cls): + raw_data = cls.create_common_events() + with os.fdopen(os.open(f"{TestGcAdvice.OUTPUT_DIR}/trace_view.json", + os.O_WRONLY | os.O_CREAT, stat.S_IWUSR | stat.S_IRUSR), 'w') as fp: + fp.write(json.dumps(raw_data)) + + def tearDown(self): + if os.path.exists(TestGcAdvice.TMP_DIR): + shutil.rmtree(TestGcAdvice.TMP_DIR) + self.clear_htmls() + + def setUp(self): + if os.path.exists(TestGcAdvice.TMP_DIR): + shutil.rmtree(TestGcAdvice.TMP_DIR) + if not os.path.exists(TestGcAdvice.TMP_DIR): + os.makedirs(TestGcAdvice.TMP_DIR) + if not os.path.exists(TestGcAdvice.OUTPUT_DIR): + os.makedirs(TestGcAdvice.OUTPUT_DIR) + self.clear_htmls() + + def test_run_should_run_success_when_trace_view_contain_gc_events(self): + self.create_trace_view_with_gc_events() + new_process = multiprocessing.Process(target=self.run_should_run_success_when_trace_view_contain_gc_events) + new_process.start() + new_process.join() + + def test_run_should_run_success_when_trace_view_not_contain_gc_events(self): + self.create_trace_view_without_gc_events() + new_process = multiprocessing.Process(target=self.run_should_run_success_when_trace_view_not_contain_gc_events) + new_process.start() + new_process.join() + diff --git a/profiler/test/ut/advisor/timeline_advice/test_memory_op_checker.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_memory_op_checker.py similarity index 66% rename from profiler/test/ut/advisor/timeline_advice/test_memory_op_checker.py rename to profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_memory_op_checker.py index a5326b9893dbab7f0b7d44860a0d0598f762d3da..e267bbf9f6c1895c735f3fd90a12843913ec94f7 100644 --- a/profiler/test/ut/advisor/timeline_advice/test_memory_op_checker.py +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_memory_op_checker.py @@ -1,62 +1,75 @@ -import unittest -import os -import sys -import yaml - -from profiler.advisor.analyzer.memory.memory_checker import MemoryOpsChecker -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env - - -class TestMemOpChecker(unittest.TestCase): - @classmethod - def tearDownClass(cls) -> None: - recover_env() - - def setUp(self) -> None: - rule_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname( - os.path.dirname(os.path.dirname(os.path.realpath(__file__)))))), - "advisor", "rules", "memory.yaml") - - with open(rule_path, "rb") as file: - self.rule = yaml.safe_load(file) - - def test_no_mem_op(self): - dataset = self._get_mock_dataset(1, is_empty_dataset=True) - - checker = MemoryOpsChecker() - checker.check_memory_ops(dataset) - self.assertFalse(checker.memory_issues) - - def test_mem_op_not_reach_threshold(self): - dataset = self._get_mock_dataset(1, is_empty_dataset=False) - - checker = MemoryOpsChecker() - checker.check_memory_ops(dataset) - self.assertFalse(checker.memory_issues) - - def test_mem_op_reach_threshold(self): - dataset = self._get_mock_dataset(1, 1000000, is_empty_dataset=False) - - checker = MemoryOpsChecker() - checker.check_memory_ops(dataset) - self.assertTrue(checker.memory_issues) - - def _get_mock_dataset(self, mem_op_num, mem_op_total_dur=1000, is_empty_dataset=False): - dataset = TimelineEvent() - if is_empty_dataset: - return dataset - - mem_op_info = TimelineEvent() - for i in range(mem_op_num): - mem_op_info[f"mock_mem_op_{i}"] = TimelineEvent({"total_dur": mem_op_total_dur, "count": 10}) - - dataset["memory_ops"] = TimelineEvent({"mem_op_info": mem_op_info, "rule": TimelineEvent(self.rule)}) - return dataset - - -if __name__ == '__main__': - tester = TestMemOpChecker() - tester.test_no_mem_op() - tester.test_mem_op_not_reach_threshold() +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import unittest +import os +import yaml + +from msprof_analyze.advisor.analyzer.memory.memory_checker import MemoryOpsChecker +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.test.ut.advisor.advisor_backend.tools.tool import recover_env + + +class TestMemOpChecker(unittest.TestCase): + @classmethod + def tearDownClass(cls) -> None: + recover_env() + + def setUp(self) -> None: + rule_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname( + os.path.dirname(os.path.dirname(os.path.realpath(__file__)))))), + "advisor", "rules", "cn", "memory.yaml") + + with open(rule_path, "rb") as file: + self.rule = yaml.safe_load(file) + + def test_no_mem_op(self): + dataset = self._get_mock_dataset(1, is_empty_dataset=True) + + checker = MemoryOpsChecker() + checker.check_memory_ops(dataset) + self.assertFalse(checker.memory_issues) + + def test_mem_op_not_reach_threshold(self): + dataset = self._get_mock_dataset(1, is_empty_dataset=False) + + checker = MemoryOpsChecker() + checker.check_memory_ops(dataset) + self.assertFalse(checker.memory_issues) + + def test_mem_op_reach_threshold(self): + dataset = self._get_mock_dataset(1, 1000000, is_empty_dataset=False) + + checker = MemoryOpsChecker() + checker.check_memory_ops(dataset) + self.assertTrue(checker.memory_issues) + + def _get_mock_dataset(self, mem_op_num, mem_op_total_dur=1000, is_empty_dataset=False): + dataset = TimelineEvent() + if is_empty_dataset: + return dataset + + mem_op_info = TimelineEvent() + for i in range(mem_op_num): + mem_op_info[f"mock_mem_op_{i}"] = TimelineEvent({"total_dur": mem_op_total_dur, "count": 10}) + + dataset["memory_ops"] = TimelineEvent({"mem_op_info": mem_op_info, "rule": TimelineEvent(self.rule)}) + return dataset + + +if __name__ == '__main__': + tester = TestMemOpChecker() + tester.test_no_mem_op() + tester.test_mem_op_not_reach_threshold() tester.test_mem_op_reach_threshold() \ No newline at end of file diff --git a/profiler/test/ut/advisor/timeline_advice/test_syncbn_checker.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_syncbn_checker.py similarity index 66% rename from profiler/test/ut/advisor/timeline_advice/test_syncbn_checker.py rename to profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_syncbn_checker.py index ecd4ee6ccae8690fdf40c0a595c3843b24574a83..0859bc1866727fe5c832686f64e4ffe5efe05512 100644 --- a/profiler/test/ut/advisor/timeline_advice/test_syncbn_checker.py +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_syncbn_checker.py @@ -1,11 +1,24 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest import os -import sys import yaml -from profiler.advisor.analyzer.schedule.syncbn.syncbn_checker import SyncBNChecker -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env +from msprof_analyze.advisor.analyzer.schedule.syncbn.syncbn_checker import SyncBNChecker +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.test.ut.advisor.advisor_backend.tools.tool import recover_env class TestSyncBNChecker(unittest.TestCase): @@ -16,7 +29,7 @@ class TestSyncBNChecker(unittest.TestCase): def setUp(self) -> None: rule_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname( os.path.dirname(os.path.dirname(os.path.realpath(__file__)))))), - "advisor", "rules", "sync_batchnorm.yaml") + "advisor", "rules", "cn", "sync_batchnorm.yaml") with open(rule_path, "rb") as file: self.rule = yaml.safe_load(file) diff --git a/profiler/test/ut/advisor/timeline_advice/test_synchronize_stream.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_synchronize_stream.py similarity index 59% rename from profiler/test/ut/advisor/timeline_advice/test_synchronize_stream.py rename to profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_synchronize_stream.py index 674d149640d59e44c8865fb0d3135dd999608701..0a379b8b8d8646fb1ddc12f83b0645978dfe7945 100644 --- a/profiler/test/ut/advisor/timeline_advice/test_synchronize_stream.py +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_synchronize_stream.py @@ -1,13 +1,26 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest import os -import sys import yaml -from profiler.advisor.analyzer.schedule.synchronize_stream.synchronize_stream_checker import SynchronizeStreamChecker -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.advisor.common import constant as const -from profiler.advisor.utils.utils import safe_division -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env +from msprof_analyze.advisor.analyzer.schedule.synchronize_stream.synchronize_stream_checker import SynchronizeStreamChecker +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.prof_common.constant import Constant +from msprof_analyze.advisor.utils.utils import safe_division +from msprof_analyze.test.ut.advisor.advisor_backend.tools.tool import recover_env class TestSynchronizeChecker(unittest.TestCase): @@ -18,7 +31,7 @@ class TestSynchronizeChecker(unittest.TestCase): def setUp(self) -> None: rule_path = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname( os.path.dirname(os.path.dirname(os.path.realpath(__file__)))))), - "advisor", "rules", "synchronize.yaml") + "advisor", "rules", "cn", "synchronize.yaml") with open(rule_path, "rb") as file: self.rule = yaml.safe_load(file) @@ -49,13 +62,13 @@ class TestSynchronizeChecker(unittest.TestCase): if is_empty_dataset: return dataset - co_occurrence_event_list = [TimelineEvent(dict(name=const.NODE_LAUNCH)), - TimelineEvent(dict(name=const.SYNC_STREAM))] * co_occurrence_num + co_occurrence_event_list = [TimelineEvent(dict(name=Constant.NODE_LAUNCH)), + TimelineEvent(dict(name=Constant.SYNC_STREAM))] * co_occurrence_num - synchronize_stream_event_list = [TimelineEvent(dict(name=const.SYNC_STREAM))] * ( + synchronize_stream_event_list = [TimelineEvent(dict(name=Constant.SYNC_STREAM))] * ( total_synchronize_stream_num - co_occurrence_num) - node_launch_event_list = [TimelineEvent(dict(name=const.NODE_LAUNCH))] * ( + node_launch_event_list = [TimelineEvent(dict(name=Constant.NODE_LAUNCH))] * ( total_node_launch_num - co_occurrence_num) dataset[ diff --git a/profiler/test/ut/advisor/timeline_advice/test_timeline_op_collector.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_timeline_op_collector.py similarity index 86% rename from profiler/test/ut/advisor/timeline_advice/test_timeline_op_collector.py rename to profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_timeline_op_collector.py index cb9121d4cf4532c32e59bb1674801d788a7cd612..edef567259f8778896e6f3d3291fb4649664aecc 100644 --- a/profiler/test/ut/advisor/timeline_advice/test_timeline_op_collector.py +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_timeline_op_collector.py @@ -1,9 +1,20 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest -import os -import sys -import yaml -from profiler.advisor.dataset.timeline_op_collector.timeline_op_collector import ( +from msprof_analyze.advisor.dataset.timeline_op_collector.timeline_op_collector import ( OpCompileCollector, SynchronizeStreamCollector, MemCollector, @@ -18,8 +29,8 @@ from profiler.advisor.dataset.timeline_op_collector.timeline_op_collector import OpStackCollector, StepCollector ) -from profiler.advisor.common.timeline.event import TimelineEvent -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env +from msprof_analyze.advisor.common.timeline.event import TimelineEvent +from msprof_analyze.test.ut.advisor.advisor_backend.tools.tool import recover_env class TestTimelineOpCollector(unittest.TestCase): diff --git a/profiler/test/ut/advisor/timeline_advice/test_timeline_op_compile_checker.py b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_timeline_op_compile_checker.py similarity index 55% rename from profiler/test/ut/advisor/timeline_advice/test_timeline_op_compile_checker.py rename to profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_timeline_op_compile_checker.py index 3ebdcddad3dcd174263e20387a7d6a416c819372..85223328e8b8fde050cc3ff2a83ad81ec0fdfd57 100644 --- a/profiler/test/ut/advisor/timeline_advice/test_timeline_op_compile_checker.py +++ b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/test_timeline_op_compile_checker.py @@ -1,3 +1,17 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest import os import sys @@ -6,11 +20,11 @@ work_path = os.path.dirname(os.path.dirname( os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))))))) sys.path.insert(0, work_path) from unittest.mock import patch -from profiler.advisor.analyzer.schedule import dispatch -from profiler.advisor.analyzer.schedule.dispatch.timeline_op_dispatch_analyzer import OpDispatchAnalyzer -from profiler.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset -from profiler.advisor.display.html.render import HTMLRender -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env +from msprof_analyze.advisor.analyzer.schedule import dispatch +from msprof_analyze.advisor.analyzer.schedule.dispatch.timeline_op_dispatch_analyzer import OpDispatchAnalyzer +from msprof_analyze.advisor.dataset.timeline_event_dataset import ScheduleAnalysisDataset +from msprof_analyze.advisor.display.html.render import HTMLRender +from msprof_analyze.test.ut.advisor.advisor_backend.tools.tool import recover_env class TestOperatorDispatchAnalyzer(unittest.TestCase): @@ -18,7 +32,7 @@ class TestOperatorDispatchAnalyzer(unittest.TestCase): def tearDownClass(cls) -> None: recover_env() - @patch("profiler.advisor.common.constant.MAX_OP_COMPILE_NUM", 5) + @patch("msprof_analyze.prof_common.constant.Constant.MAX_OP_COMPILE_NUM", 5) def test_ops_dispatch_analyzer(self): kwargs = {"analysis_mode": "all"} data_root_dir = os.path.dirname(os.path.realpath(__file__)) @@ -26,9 +40,9 @@ class TestOperatorDispatchAnalyzer(unittest.TestCase): results = op_dispatch_analyzer.optimize(**kwargs) self.assertTrue(results.page_dict) - self.assertIsNotNone(results.sheet_recorder.sheet_data.get("operator dispatch")) + self.assertIsNotNone(results.sheet_recorder.sheet_data.get("算子下发")) - @patch("profiler.advisor.common.constant.MAX_OP_COMPILE_NUM", 5) + @patch("msprof_analyze.prof_common.constant.Constant.MAX_OP_COMPILE_NUM", 5) def test_ops_dispatch_make_render(self): kwargs = {"analysis_mode": "timeline"} data_root_dir = os.path.dirname(os.path.realpath(__file__)) diff --git a/profiler/test/ut/advisor/timeline_advice/trace_view.json b/profiler/msprof_analyze/test/ut/advisor/timeline_advice/trace_view.json similarity index 100% rename from profiler/test/ut/advisor/timeline_advice/trace_view.json rename to profiler/msprof_analyze/test/ut/advisor/timeline_advice/trace_view.json diff --git a/profiler/msprof_analyze/test/ut/cluster_analyse/cluster_data_preprocess/test_pytorch_data_preprocessor.py b/profiler/msprof_analyze/test/ut/cluster_analyse/cluster_data_preprocess/test_pytorch_data_preprocessor.py new file mode 100644 index 0000000000000000000000000000000000000000..ee159cca792cb53d9504accef54be6ed93f9a980 --- /dev/null +++ b/profiler/msprof_analyze/test/ut/cluster_analyse/cluster_data_preprocess/test_pytorch_data_preprocessor.py @@ -0,0 +1,68 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import os +import shutil +import unittest +from unittest import mock + +from msprof_analyze.cluster_analyse.cluster_data_preprocess.pytorch_data_preprocessor import PytorchDataPreprocessor + + +class TestPytorchDataPreprocessor(unittest.TestCase): + DIR_PATH = os.path.join(os.path.dirname(__file__), 'DT_CLUSTER_PREPROCESS') + + def setUp(self) -> None: + if os.path.exists(self.DIR_PATH): + shutil.rmtree(self.DIR_PATH) + os.makedirs(os.path.join(self.DIR_PATH, 'worker1_11111111_ascend_pt')) + open(os.path.join(self.DIR_PATH, 'worker1_11111111_ascend_pt', 'profiler_info_1.json'), 'w') + os.makedirs(os.path.join(self.DIR_PATH, 'worker2_11111112_ascend_pt')) + open(os.path.join(self.DIR_PATH, 'worker2_11111112_ascend_pt', 'profiler_info_2.json'), 'w') + os.makedirs(os.path.join(self.DIR_PATH, 'single_worker_11111111_ascend_pt')) + open(os.path.join(self.DIR_PATH, 'single_worker_11111111_ascend_pt', 'profiler_info.json'), 'w') + os.makedirs(os.path.join(self.DIR_PATH, 'worker1_11111112_ascend_pt')) + open(os.path.join(self.DIR_PATH, 'worker1_11111112_ascend_pt', 'profiler_info_1.json'), 'w') + os.makedirs(os.path.join(self.DIR_PATH, 'worker2_11111113_ascend_pt')) + open(os.path.join(self.DIR_PATH, 'worker2_11111113_ascend_pt', 'profiler_info_2.json'), 'w') + self.dirs = [os.path.join(self.DIR_PATH, filename) for filename in os.listdir(self.DIR_PATH)] + + def tearDown(self) -> None: + shutil.rmtree(self.DIR_PATH) + + def test_get_data_map_when_given_normal_input_expect_dict(self): + res = PytorchDataPreprocessor(self.dirs).get_data_map() + self.assertIsInstance(res, dict) + + def test_get_rank_id_when_given_cluster_rank_1_dirs_expect_rank_1(self): + check = PytorchDataPreprocessor(self.dirs) + ret = check.get_rank_id(os.path.join(self.DIR_PATH, 'worker1_11111111_ascend_pt')) + self.assertEqual(ret, 1) + + def test_get_rank_id_when_single_device_not_cluster_expect_rank_minus1(self): + check = PytorchDataPreprocessor(self.dirs) + ret = check.get_rank_id(os.path.join(self.DIR_PATH, 'single_worker_11111111_ascend_pt')) + self.assertEqual(ret, -1) + + def test_get_data_map_given_cluster_files_expect_rank_12(self): + check = PytorchDataPreprocessor(self.dirs) + with mock.patch("msprof_analyze.prof_common.file_manager.FileManager.read_json_file", + return_value={}): + ret = check.get_data_map() + self.assertIn(1, ret.keys()) + self.assertIn(2, ret.keys()) + self.assertIn(os.path.join(self.DIR_PATH, 'worker1_11111111_ascend_pt'), ret.values()) + self.assertIn(os.path.join(self.DIR_PATH, 'worker2_11111112_ascend_pt'), ret.values()) diff --git a/profiler/msprof_analyze/test/ut/cluster_analyse/cluster_data_preprocess/test_step_trace_time_analysis.py b/profiler/msprof_analyze/test/ut/cluster_analyse/cluster_data_preprocess/test_step_trace_time_analysis.py new file mode 100644 index 0000000000000000000000000000000000000000..bd8e2da21f7421e3686293367b02428197386f6e --- /dev/null +++ b/profiler/msprof_analyze/test/ut/cluster_analyse/cluster_data_preprocess/test_step_trace_time_analysis.py @@ -0,0 +1,81 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest + +from msprof_analyze.cluster_analyse.analysis.step_trace_time_analysis import StepTraceTimeAnalysis +from msprof_analyze.cluster_analyse.prof_bean.step_trace_time_bean import StepTraceTimeBean +from msprof_analyze.prof_common.constant import Constant + + +class TestStepTraceTimeAnalysis(unittest.TestCase): + DIR_PATH = '' + + def test_get_max_data_row_when_given_data_return_max_rows(self): + check = StepTraceTimeAnalysis({}) + ls = [ + [1, 3, 5, 7, 10], + [2, 4, 6, 8, 11], + [1000, -1, -1, -1, -1] + ] + ret = check.get_max_data_row(ls) + self.assertEqual([1000, 4, 6, 8, 11], ret) + + def test_get_max_data_when_given_row_single_ls_return_this_row(self): + check = StepTraceTimeAnalysis({}) + ls = [ + [1, 3, 5, 7, 10] + ] + ret = check.get_max_data_row(ls) + self.assertEqual([1, 3, 5, 7, 10], ret) + + def test_analyze_step_time_when_give_normal_expect_stage(self): + check = StepTraceTimeAnalysis({}) + check.data_type = Constant.TEXT + check.step_time_dict = { + 0: [ + StepTraceTimeBean({"Step": 0, "time1": 1, "time2": 2}), + StepTraceTimeBean({"Step": 1, "time1": 1, "time2": 2}), + ], + 1: [ + StepTraceTimeBean({"Step": 0, "time1": 10, "time2": 20}), + StepTraceTimeBean({"Step": 1, "time1": 10, "time2": 20}) + ] + } + check.communication_group = {Constant.P2P: [[0, 1]]} + check.analyze_step_time() + self.assertIn([0, 'stage', (0, 1), 10.0, 20.0], check.step_data_list) + + def test_analyze_step_time_when_given_none_step_expect_stage_and_rank_row(self): + check = StepTraceTimeAnalysis({}) + check.data_type = Constant.TEXT + check.step_time_dict = { + 0: [ + StepTraceTimeBean({"Step": None, "time1": 1, "time2": 2}) + ], + 1: [ + StepTraceTimeBean({"Step": None, "time1": 10, "time2": 20}), + ], + 2: [ + StepTraceTimeBean({"Step": None, "time1": 2, "time2": 3}), + ], + 3: [ + StepTraceTimeBean({"Step": None, "time1": 1, "time2": 1}), + ], + } + check.communication_group = {Constant.P2P: [[0, 1], [2, 3]]} + check.analyze_step_time() + self.assertIn([None, 'stage', (2, 3), 2.0, 3.0], check.step_data_list) + self.assertIn([None, 'rank', 0, 1.0, 2.0], check.step_data_list) \ No newline at end of file diff --git a/profiler/test/ut/cluster_analyse/cluster_utils/test_parallel_strategy_calculator.py b/profiler/msprof_analyze/test/ut/cluster_analyse/cluster_utils/test_parallel_strategy_calculator.py similarity index 70% rename from profiler/test/ut/cluster_analyse/cluster_utils/test_parallel_strategy_calculator.py rename to profiler/msprof_analyze/test/ut/cluster_analyse/cluster_utils/test_parallel_strategy_calculator.py index 2eb8b300ab5faee326448712e5eb35d6725f466f..7707fd761aba717df508a407af254292dda5969f 100644 --- a/profiler/test/ut/cluster_analyse/cluster_utils/test_parallel_strategy_calculator.py +++ b/profiler/msprof_analyze/test/ut/cluster_analyse/cluster_utils/test_parallel_strategy_calculator.py @@ -1,6 +1,20 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest -from cluster_utils.parallel_strategy_calculator import ParallelStrategyCalculator +from msprof_analyze.cluster_analyse.cluster_utils.parallel_strategy_calculator import ParallelStrategyCalculator class TestParallelStrategyCalculator(unittest.TestCase): diff --git a/profiler/test/ut/cluster_analyse/common_func/test_file_manager.py b/profiler/msprof_analyze/test/ut/cluster_analyse/common_func/test_file_manager.py similarity index 79% rename from profiler/test/ut/cluster_analyse/common_func/test_file_manager.py rename to profiler/msprof_analyze/test/ut/cluster_analyse/common_func/test_file_manager.py index 5f73b20244ee8b9d922a78779b1b98737b6882df..c919e71ceea14867e1e8675f63f1408bd34497a9 100644 --- a/profiler/test/ut/cluster_analyse/common_func/test_file_manager.py +++ b/profiler/msprof_analyze/test/ut/cluster_analyse/common_func/test_file_manager.py @@ -1,3 +1,17 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import shutil import stat @@ -5,8 +19,8 @@ import json import unittest import pytest -from common_func.file_manager import FileManager -from prof_bean.step_trace_time_bean import StepTraceTimeBean +from msprof_analyze.prof_common.file_manager import FileManager +from msprof_analyze.cluster_analyse.prof_bean.step_trace_time_bean import StepTraceTimeBean class TestFileManager(unittest.TestCase): diff --git a/profiler/test/ut/cluster_analyse/common_func/test_path_manager.py b/profiler/msprof_analyze/test/ut/cluster_analyse/common_func/test_path_manager.py similarity index 83% rename from profiler/test/ut/cluster_analyse/common_func/test_path_manager.py rename to profiler/msprof_analyze/test/ut/cluster_analyse/common_func/test_path_manager.py index 0510cf4a7936b04fdc6f76cd36fb51cd6c12af98..1ffb456fde0f6bb6335a27cd559008a40a132f6b 100644 --- a/profiler/test/ut/cluster_analyse/common_func/test_path_manager.py +++ b/profiler/msprof_analyze/test/ut/cluster_analyse/common_func/test_path_manager.py @@ -1,9 +1,23 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest import os import time import pytest -from common_func.path_manager import PathManager +from msprof_analyze.prof_common.path_manager import PathManager PATH_DIR = "resource" @@ -46,7 +60,7 @@ class TestPathManager(unittest.TestCase): PathManager.input_path_common_check(PATH_FILE) def test_check_path_owner_consistent(self): - PathManager.check_path_owner_consistent(PATH_DIR) + PathManager.check_path_owner_consistent([PATH_DIR]) def test_check_path_writeable(self): link_name = "test_link" + str(time.time()) diff --git a/profiler/msprof_analyze/test/ut/cluster_analyse/communication_group/test_communication_group_generator.py b/profiler/msprof_analyze/test/ut/cluster_analyse/communication_group/test_communication_group_generator.py new file mode 100644 index 0000000000000000000000000000000000000000..517327b81117f100a1bf0f71edbe9ebd45ef605e --- /dev/null +++ b/profiler/msprof_analyze/test/ut/cluster_analyse/communication_group/test_communication_group_generator.py @@ -0,0 +1,113 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest +from unittest import mock + +from msprof_analyze.cluster_analyse.communication_group.communication_group_generator import CommunicationGroupGenerator +from msprof_analyze.prof_common.constant import Constant + + +class TestCommunicationGroupGenerator(unittest.TestCase): + DIR_PATH = '' + PARAMS = { + Constant.DATA_SIMPLIFICATION: "ORIGINAL", + Constant.DATA_TYPE: Constant.TEXT + } + + def test_generate_p2p_communication_when_given_group_1p_return_1p2p(self): + check = CommunicationGroupGenerator(self.PARAMS).processor + check.collective_group_dict = { + 'group1': {0} + } + with mock.patch("msprof_analyze.prof_common.file_manager.FileManager.read_json_file", + return_value=True): + check.generate_p2p_communication_group() + ret = {0} + self.assertEqual(ret, set(check.communication_group[Constant.P2P][0])) + + def test_generate_p2p_communication_when_given_group_8p_return_correct_value(self): + check = CommunicationGroupGenerator(self.PARAMS).processor + check.collective_group_dict = { + 'group1': {1, 2, 3, 4}, + 'group2': {5, 6, 7, 8}, + } + with mock.patch("msprof_analyze.prof_common.file_manager.FileManager.read_json_file", + return_value=True): + check.generate_p2p_communication_group() + ret_a = {1, 2, 3, 4} + ret_b = {5, 6, 7, 8} + self.assertEqual(ret_a, set(check.communication_group[Constant.P2P][0])) + self.assertEqual(ret_b, set(check.communication_group[Constant.P2P][1])) + + def test_generate_p2p_communication_when_given_group_16p_expect_4_group(self): + check = CommunicationGroupGenerator(self.PARAMS).processor + check.collective_group_dict = { + 'group1': {0, 1}, + 'group2': {0, 2}, + 'group3': {2, 3}, + 'group4': {3, 1}, + 'group5': {4, 5}, + 'group6': {4, 6}, + 'group7': {5, 7}, + 'group8': {6, 7}, + 'group9': {8, 9}, + 'group10': {8, 10}, + 'group11': {11, 10}, + 'group12': {11, 9}, + 'group13': {12, 13}, + 'group14': {12, 14}, + 'group15': {15, 13}, + 'group16': {15, 14} + } + with mock.patch("msprof_analyze.prof_common.file_manager.FileManager.read_json_file", + return_value=True): + check.generate_p2p_communication_group() + ret_a = {0, 1, 2, 3} + ret_b = {4, 5, 6, 7} + ret_c = {8, 9, 10, 11} + ret_d = {12, 13, 14, 15} + self.assertEqual(ret_a, set(check.communication_group[Constant.P2P][0])) + self.assertEqual(ret_b, set(check.communication_group[Constant.P2P][1])) + self.assertEqual(ret_c, set(check.communication_group[Constant.P2P][2])) + self.assertEqual(ret_d, set(check.communication_group[Constant.P2P][3])) + + def test_generate_p2p_communication_group_when_given_repeat_group_expect_2_group(self): + check = CommunicationGroupGenerator(self.PARAMS).processor + check.collective_group_dict = { + 'group1': {0, 1, 2, 3}, + 'group2': {0, 1, 2, 3}, + 'group3': {0, 1, 2, 3}, + 'group4': {0, 1, 2, 3}, + 'group5': {3, 2, 4, 5}, + 'group6': {4, 5, 6, 7}, + 'group7': {4, 5, 6, 7}, + 'group8': {4, 5, 6, 7}, + 'group9': {8, 9, 11, 10}, + 'group10': {8, 9, 11, 10}, + 'group11': {11, 10, 12, 13}, + 'group12': {11, 10, 12, 13}, + 'group13': {11, 10, 12, 13}, + 'group14': {12, 13, 14, 15}, + 'group15': {12, 13, 14, 15}, + 'group16': {12, 13, 14, 15} + } + with mock.patch("msprof_analyze.prof_common.file_manager.FileManager.read_json_file", + return_value=True): + check.generate_p2p_communication_group() + ret_a = {0, 1, 2, 3, 4, 5, 6, 7} + ret_b = {8, 9, 10, 11, 12, 13, 14, 15} + self.assertEqual(ret_a, set(check.communication_group[Constant.P2P][0])) + self.assertEqual(ret_b, set(check.communication_group[Constant.P2P][1])) diff --git a/profiler/test/ut/cluster_analyse/prof_bean/test_step_trace_time_bean.py b/profiler/msprof_analyze/test/ut/cluster_analyse/prof_bean/test_step_trace_time_bean.py similarity index 33% rename from profiler/test/ut/cluster_analyse/prof_bean/test_step_trace_time_bean.py rename to profiler/msprof_analyze/test/ut/cluster_analyse/prof_bean/test_step_trace_time_bean.py index e369df48421f0bcc50017e6f03583771c29ea076..f7d81543b277db0f371e8b6c63d78697ccc4a091 100644 --- a/profiler/test/ut/cluster_analyse/prof_bean/test_step_trace_time_bean.py +++ b/profiler/msprof_analyze/test/ut/cluster_analyse/prof_bean/test_step_trace_time_bean.py @@ -1,6 +1,20 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import unittest -from prof_bean.step_trace_time_bean import StepTraceTimeBean +from msprof_analyze.cluster_analyse.prof_bean.step_trace_time_bean import StepTraceTimeBean class TestStepTraceTimeBean(unittest.TestCase): diff --git a/profiler/msprof_analyze/test/ut/cluster_analyse/recipes/test_compute_op_sum.py b/profiler/msprof_analyze/test/ut/cluster_analyse/recipes/test_compute_op_sum.py new file mode 100644 index 0000000000000000000000000000000000000000..314a213bce44d83dbedb1cfed221722467d5411e --- /dev/null +++ b/profiler/msprof_analyze/test/ut/cluster_analyse/recipes/test_compute_op_sum.py @@ -0,0 +1,69 @@ +# Copyright (c) 2025, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import unittest +import pandas as pd + +from msprof_analyze.cluster_analyse.recipes.compute_op_sum.compute_op_sum import ComputeOpSum +from msprof_analyze.prof_common.constant import Constant + + +class TestComputeOpSum(unittest.TestCase): + PARAMS = { + Constant.COLLECTION_PATH: "/data", + Constant.DATA_MAP: {}, + Constant.DATA_TYPE: Constant.DB, + Constant.CLUSTER_ANALYSIS_OUTPUT_PATH: "./test_compute_op_sum", + Constant.RECIPE_NAME: "ComputeOpSum", + Constant.RECIPE_CLASS: ComputeOpSum, + Constant.PARALLEL_MODE: Constant.CONCURRENT_MODE, + Constant.EXPORT_TYPE: Constant.DB, + ComputeOpSum.RANK_LIST: Constant.ALL, + } + + def test_reducer_func_when_exclude_op_name_switch_on_given_all_dataframe(self): + df = pd.DataFrame({ + "OpType": ["ZerosLike", "Cast", "Slice"], + "TaskType": ["AI_VECTOR_CORE", "AI_VECTOR_CORE", "AI_VECTOR_CORE"], + "InputShapes": ["1903865856", "4,1025", "4,1025;2;2"], + "Duration": [2553091.0, 3020.0, 2440.0], + "Rank": [0, 0, 0] + }) + params = {Constant.EXTRA_ARGS: ["--exclude_op_name"]} + params.update(self.PARAMS) + recipe = ComputeOpSum(params) + recipe.reducer_func([df]) + self.assertEqual(recipe.all_rank_stats.shape, (3, 9)) + self.assertEqual(recipe.per_rank_stats_by_optype.shape, (3, 10)) + self.assertIsNone(recipe.per_rank_stats_by_opname, None) + + + def test_reducer_func_when_exclude_op_name_switch_off_given_all_dataframe(self): + df = pd.DataFrame({ + "OpName": ["aclnnInplaceZero_ZerosLikeAiCore_ZerosLike", "aclnnCast_CastAiCore_Cast", + "aclnnInplaceCopy_SliceAiCore_Slice"], + "OpType": ["ZerosLike", "Cast", "Slice"], + "TaskType": ["AI_VECTOR_CORE", "AI_VECTOR_CORE", "AI_VECTOR_CORE"], + "InputShapes": ["1903865856", "4,1025", "4,1025;2;2"], + "Duration": [2553091.0, 3020.0, 2440.0], + "Rank": [0, 0, 0] + }) + params = {} + params.update(self.PARAMS) + recipe = ComputeOpSum(params) + recipe.reducer_func([df]) + self.assertEqual(recipe.all_rank_stats.shape, (3, 9)) + self.assertEqual(recipe.per_rank_stats_by_optype.shape, (3, 10)) + self.assertEqual(recipe.per_rank_stats_by_opname.shape, (3, 10)) \ No newline at end of file diff --git a/profiler/msprof_analyze/test/ut/compare_tools/__init__.py b/profiler/msprof_analyze/test/ut/compare_tools/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/test/ut/compare_tools/comparator/test_communication_comparator.py b/profiler/msprof_analyze/test/ut/compare_tools/comparator/test_communication_comparator.py similarity index 90% rename from profiler/test/ut/compare_tools/comparator/test_communication_comparator.py rename to profiler/msprof_analyze/test/ut/compare_tools/comparator/test_communication_comparator.py index 3cd1884c226c802f91e8af88bf46759dbb6d5be3..f44b7b3ddf01998217949887505f71375c933e42 100644 --- a/profiler/test/ut/compare_tools/comparator/test_communication_comparator.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/comparator/test_communication_comparator.py @@ -1,7 +1,7 @@ import unittest -from compare_backend.comparator.communication_comparator import CommunicationComparator -from compare_backend.compare_bean.communication_bean import CommunicationBean +from msprof_analyze.compare_tools.compare_backend.comparator.communication_comparator import CommunicationComparator +from msprof_analyze.compare_tools.compare_backend.compare_bean.communication_bean import CommunicationBean class TestCommunicationComparator(unittest.TestCase): diff --git a/profiler/test/ut/compare_tools/comparator/test_operator_comparator.py b/profiler/msprof_analyze/test/ut/compare_tools/comparator/test_operator_comparator.py similarity index 88% rename from profiler/test/ut/compare_tools/comparator/test_operator_comparator.py rename to profiler/msprof_analyze/test/ut/compare_tools/comparator/test_operator_comparator.py index cb51b5756c4a72f85b0a67d3da3e0864d54312e0..cff56a146c8ac2e62ff2059371251b0cdaa14004 100644 --- a/profiler/test/ut/compare_tools/comparator/test_operator_comparator.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/comparator/test_operator_comparator.py @@ -1,6 +1,6 @@ import unittest -from compare_backend.comparator.operator_comparator import OperatorComparator +from msprof_analyze.compare_tools.compare_backend.comparator.operator_comparator import OperatorComparator class MockBean: diff --git a/profiler/test/ut/compare_tools/comparator/test_operator_statistic_comparator.py b/profiler/msprof_analyze/test/ut/compare_tools/comparator/test_operator_statistic_comparator.py similarity index 86% rename from profiler/test/ut/compare_tools/comparator/test_operator_statistic_comparator.py rename to profiler/msprof_analyze/test/ut/compare_tools/comparator/test_operator_statistic_comparator.py index 133fa197f14a5bd0a4ca9d7280b096962190a066..a123a98bffa44804154e6bc1caaa5fa07f38093a 100644 --- a/profiler/test/ut/compare_tools/comparator/test_operator_statistic_comparator.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/comparator/test_operator_statistic_comparator.py @@ -1,7 +1,8 @@ import unittest from unittest.mock import patch -from compare_backend.comparator.operator_statistic_comparator import OperatorStatisticComparator +from msprof_analyze.compare_tools.compare_backend.comparator.operator_statistic_comparator \ + import OperatorStatisticComparator class MockBean: @@ -24,7 +25,8 @@ class TestOperatorStatisticComparator(unittest.TestCase): base_dict = {"add": [1], "matmul": [1]} comparison_dict = {"add": [1], "reduce": [1]} with patch( - "compare_backend.comparator.operator_statistic_comparator.OperatorStatisticComparator._group_by_op_name", + "msprof_analyze.compare_tools.compare_backend.comparator.operator_statistic_comparator." + "OperatorStatisticComparator._group_by_op_name", return_value=(base_dict, comparison_dict)): comparator = OperatorStatisticComparator({1: 2}, MockBean) comparator._compare() diff --git a/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/__init__.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/__init__.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_compare_event.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_compare_event.py similarity index 77% rename from profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_compare_event.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_compare_event.py index 771e1993398704774b9cb8a5b48c350f0a73b5bd..8b2c40dcd4a05a7966816e5ab1e50da84fcc34ac 100644 --- a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_compare_event.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_compare_event.py @@ -1,7 +1,10 @@ import unittest -from compare_backend.compare_bean.origin_data_bean.compare_event import KernelEvent, MemoryEvent -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.compare_event import ( + KernelEvent, + MemoryEvent +) +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean class TestKernelEvent(unittest.TestCase): diff --git a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py similarity index 94% rename from profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py index 869ee85570febe1d5db7c1a5aa6e89ac8392078d..94ff68eceb9b4682e67df455185fde634ea12b36 100644 --- a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py @@ -1,6 +1,7 @@ import unittest -from compare_backend.compare_bean.origin_data_bean.kernel_details_bean import KernelDetailsBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.kernel_details_bean \ + import KernelDetailsBean class TestKernelDetailsBean(unittest.TestCase): diff --git a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_memory_record_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_memory_record_bean.py similarity index 67% rename from profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_memory_record_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_memory_record_bean.py index 3ec34ffbaa53d0369716e4df6a5633bae7eb28c1..2488105bec8cb8b41b14c7b21f4ab0e7d6742cb0 100644 --- a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_memory_record_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_memory_record_bean.py @@ -1,6 +1,7 @@ import unittest -from compare_backend.compare_bean.origin_data_bean.memory_record_bean import MemoryRecordBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.memory_record_bean \ + import MemoryRecordBean class TestMemoryRecordBean(unittest.TestCase): diff --git a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_operator_memory_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_operator_memory_bean.py similarity index 84% rename from profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_operator_memory_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_operator_memory_bean.py index 027b620e81fc6c9fa5d4694258f69cccd21a7c91..750da0b9be8f4a0d0da918b8f46577c336f005d8 100644 --- a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_operator_memory_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_operator_memory_bean.py @@ -1,6 +1,7 @@ import unittest -from compare_backend.compare_bean.origin_data_bean.operator_memory_bean import OperatorMemoryBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.operator_memory_bean \ + import OperatorMemoryBean class TestOperatorMemoryBean(unittest.TestCase): diff --git a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_trace_event_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_trace_event_bean.py similarity index 96% rename from profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_trace_event_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_trace_event_bean.py index 07a41c7e747092abce27025fc88ac13ae14bc565..8dbb9ac5d4f586953fc891fa005c10ddf86b6c43 100644 --- a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_trace_event_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/origin_data_bean/test_trace_event_bean.py @@ -1,6 +1,6 @@ import unittest -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean class TestTraceEventBean(unittest.TestCase): @@ -74,10 +74,6 @@ class TestTraceEventBean(unittest.TestCase): self.assertTrue(TraceEventBean({"name": "Communication(Not Overlapped)"}).is_comm_not_overlap()) self.assertFalse(TraceEventBean({"name": "add"}).is_comm_not_overlap()) - def test_is_dict(self): - self.assertTrue(TraceEventBean({}).is_dict()) - self.assertFalse(TraceEventBean([]).is_dict()) - def test_is_kernel_cat(self): self.assertTrue(TraceEventBean({"cat": "Kernel"}).is_kernel_cat()) self.assertFalse(TraceEventBean({"cat": "cpu_op"}).is_kernel_cat()) diff --git a/profiler/test/ut/compare_tools/compare_bean/test_communication_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_communication_bean.py similarity index 87% rename from profiler/test/ut/compare_tools/compare_bean/test_communication_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_communication_bean.py index 2b5ee568287b47a9c8ee4b4f7c5989b42883a4d9..4c14de0a43ce5b7f057887313e5c8c0dfe4530d6 100644 --- a/profiler/test/ut/compare_tools/compare_bean/test_communication_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_communication_bean.py @@ -1,6 +1,6 @@ import unittest -from compare_backend.compare_bean.communication_bean import CommunicationBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.communication_bean import CommunicationBean class TestCommunicationBean(unittest.TestCase): diff --git a/profiler/test/ut/compare_tools/compare_bean/test_memory_compare_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_memory_compare_bean.py similarity index 65% rename from profiler/test/ut/compare_tools/compare_bean/test_memory_compare_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_memory_compare_bean.py index f275c2261af172c99b97a4cc4775a3942ec4a967..39efc1ac88d9cba9a4466edd660742b2c702419f 100644 --- a/profiler/test/ut/compare_tools/compare_bean/test_memory_compare_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_memory_compare_bean.py @@ -1,7 +1,7 @@ import unittest from unittest.mock import patch -from compare_backend.compare_bean.memory_compare_bean import MemoryCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.memory_compare_bean import MemoryCompareBean class MockNode: @@ -22,18 +22,21 @@ class TestMemoryCompareBean(unittest.TestCase): def test_row_when_valid_data(self): result = [2, self.name, None, None, 'add', 8, self.name, None, None, 'add', 8, 0, 1.0] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", return_value=[MockMemory(8)]): + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", + return_value=[MockMemory(8)]): mem = MemoryCompareBean(1, MockNode(self.name), MockNode(self.name)) self.assertEqual(mem.row, result) def test_row_when_invalid_base_data(self): result = [2, None, None, None, "", 0, self.name, None, None, 'add', 8, 8, float("inf")] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", return_value=[MockMemory(8)]): + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", + return_value=[MockMemory(8)]): mem = MemoryCompareBean(1, None, MockNode(self.name)) self.assertEqual(mem.row, result) def test_row_when_invalid_comparison_data(self): result = [2, self.name, None, None, 'add', 8, None, None, None, '', 0, -8, 0] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", return_value=[MockMemory(8)]): + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", + return_value=[MockMemory(8)]): mem = MemoryCompareBean(1, MockNode(self.name), None) self.assertEqual(mem.row, result) diff --git a/profiler/test/ut/compare_tools/compare_bean/test_memory_statistic_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_memory_statistic_bean.py similarity index 72% rename from profiler/test/ut/compare_tools/compare_bean/test_memory_statistic_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_memory_statistic_bean.py index 40bed2160baa6ee2136791f6fb9d26f0bb0c30bc..48a4f0096b62c7f5c015c342c8af61997d8cf4ef 100644 --- a/profiler/test/ut/compare_tools/compare_bean/test_memory_statistic_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_memory_statistic_bean.py @@ -1,7 +1,7 @@ import unittest from unittest.mock import patch -from compare_backend.compare_bean.memory_statistic_bean import MemoryStatisticBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.memory_statistic_bean import MemoryStatisticBean class MockMemory: @@ -15,21 +15,21 @@ class TestMemoryStatisticBean(unittest.TestCase): def test_row_when_valid_data(self): result = [None, self.name, 8.0, 40.0, 2, 4.0, 20.0, 1, -20.0, 0.5] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", return_value=[MockMemory(10240, 2000), MockMemory(10240, 2000)]): bean = MemoryStatisticBean(self.name, [1, 1], [1]) self.assertEqual(bean.row, result) def test_row_when_invalid_base_data(self): result = [None, self.name, 0, 0, 0, 4.0, 20.0, 1, 20.0, float("inf")] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", return_value=[MockMemory(10240, 2000), MockMemory(10240, 2000)]): bean = MemoryStatisticBean(self.name, [], [1]) self.assertEqual(bean.row, result) def test_row_when_invalid_comparison_data(self): result = [None, self.name, 8.0, 40.0, 2, 0, 0, 0, -40.0, 0] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_memory", return_value=[MockMemory(10240, 2000), MockMemory(10240, 2000)]): bean = MemoryStatisticBean(self.name, [1, 1], []) self.assertEqual(bean.row, result) diff --git a/profiler/test/ut/compare_tools/compare_bean/test_operator_compare_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_operator_compare_bean.py similarity index 65% rename from profiler/test/ut/compare_tools/compare_bean/test_operator_compare_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_operator_compare_bean.py index ab8dec156f9589ccf516cda2c7ecd3ecf854f963..15099a5f54208b6c3e31147b1d04578ea0c29f84 100644 --- a/profiler/test/ut/compare_tools/compare_bean/test_operator_compare_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_operator_compare_bean.py @@ -1,7 +1,7 @@ import unittest from unittest.mock import patch -from compare_backend.compare_bean.operator_compare_bean import OperatorCompareBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.operator_compare_bean import OperatorCompareBean class MockNode: @@ -22,18 +22,21 @@ class TestOperatorCompareBean(unittest.TestCase): def test_row_when_valid_data(self): result = [2, self.name, None, None, 'add', 8, self.name, None, None, 'add', 8, 0, 1.0] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", return_value=[MockKernel(8)]): + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", + return_value=[MockKernel(8)]): op = OperatorCompareBean(1, MockNode(self.name), MockNode(self.name)) self.assertEqual(op.row, result) def test_row_when_invalid_base_data(self): result = [2, None, None, None, "", 0, self.name, None, None, 'add', 8, 8, float("inf")] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", return_value=[MockKernel(8)]): + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", + return_value=[MockKernel(8)]): op = OperatorCompareBean(1, None, MockNode(self.name)) self.assertEqual(op.row, result) def test_row_when_invalid_comparison_data(self): result = [2, self.name, None, None, 'add', 8, None, None, None, '', 0, -8, 0] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", return_value=[MockKernel(8)]): + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", + return_value=[MockKernel(8)]): op = OperatorCompareBean(1, MockNode(self.name), None) self.assertEqual(op.row, result) diff --git a/profiler/test/ut/compare_tools/compare_bean/test_operator_statistic_bean.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_operator_statistic_bean.py similarity index 70% rename from profiler/test/ut/compare_tools/compare_bean/test_operator_statistic_bean.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_operator_statistic_bean.py index 4c4e338ce593a588316d2461eac03fd4e9677233..5293d428468678add6c7829888a69d6a2e409260 100644 --- a/profiler/test/ut/compare_tools/compare_bean/test_operator_statistic_bean.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_operator_statistic_bean.py @@ -1,7 +1,7 @@ import unittest from unittest.mock import patch -from compare_backend.compare_bean.operator_statistic_bean import OperatorStatisticBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.operator_statistic_bean import OperatorStatisticBean class MockKernel: @@ -14,21 +14,21 @@ class TestOperatorStatisticBean(unittest.TestCase): def test_row_when_valid_data(self): result = [None, self.name, 8.0, 2, 4.0, 1, -4.0, 0.5] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", return_value=[MockKernel(2000), MockKernel(2000)]): bean = OperatorStatisticBean(self.name, [1, 1], [1]) self.assertEqual(bean.row, result) def test_row_when_invalid_base_data(self): result = [None, self.name, 0, 0, 4.0, 1, 4.0, float("inf")] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", return_value=[MockKernel(2000), MockKernel(2000)]): bean = OperatorStatisticBean(self.name, [], [1]) self.assertEqual(bean.row, result) def test_row_when_invalid_comparison_data(self): result = [None, self.name, 8.0, 2, 0, 0, -8.0, 0] - with patch("compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", + with patch("msprof_analyze.compare_tools.compare_backend.utils.tree_builder.TreeBuilder.get_total_kernels", return_value=[MockKernel(2000), MockKernel(2000)]): bean = OperatorStatisticBean(self.name, [1, 1], []) self.assertEqual(bean.row, result) diff --git a/profiler/test/ut/compare_tools/compare_bean/test_profiling_info.py b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_profiling_info.py similarity index 97% rename from profiler/test/ut/compare_tools/compare_bean/test_profiling_info.py rename to profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_profiling_info.py index 59525f18f96236a7e0383d08721629318f690f1b..b408f734a3c575c4e6dd25f4a826b4897c7acf04 100644 --- a/profiler/test/ut/compare_tools/compare_bean/test_profiling_info.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/compare_bean/test_profiling_info.py @@ -1,6 +1,6 @@ import unittest -from compare_backend.compare_bean.profiling_info import ProfilingInfo +from msprof_analyze.compare_tools.compare_backend.compare_bean.profiling_info import ProfilingInfo class TestProfilingInfo(unittest.TestCase): diff --git a/profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/__init__.py b/profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/test/ut/compare_tools/profiling_parser/test_base_profiling_parser.py b/profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_base_profiling_parser.py similarity index 77% rename from profiler/test/ut/compare_tools/profiling_parser/test_base_profiling_parser.py rename to profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_base_profiling_parser.py index b78c59f1f70634a4aa63efdbe5d83f6692d9efae..de2ef0a46800bade04cf955cee68ce6f82acd274 100644 --- a/profiler/test/ut/compare_tools/profiling_parser/test_base_profiling_parser.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_base_profiling_parser.py @@ -1,8 +1,11 @@ import unittest from unittest.mock import patch -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.profiling_parser.base_profiling_parser import BaseProfilingParser, ProfilingResult +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.profiling_parser.base_profiling_parser import ( + BaseProfilingParser, + ProfilingResult +) class ProfilingParser(BaseProfilingParser): @@ -53,6 +56,9 @@ class ProfilingParser(BaseProfilingParser): def _get_dispatch_func(self): pass + def _calculate_mc2_communication_time(self): + pass + class MockEvent: def __init__(self, pid, tid, ts, ph="M"): @@ -75,15 +81,16 @@ class MockEvent: def start_time(self): return self.ts + @staticmethod + def is_nccl_name(): + return False + def is_flow_start(self): return self.ph == "s" def is_flow_end(self): return self.ph == "f" - def is_nccl_name(self): - return False - class TestBaseProfilingParser(unittest.TestCase): flow_dict = {1: {"start": MockEvent(1, 2, 12), "end": MockEvent(2, 3, 21)}, @@ -96,70 +103,72 @@ class TestBaseProfilingParser(unittest.TestCase): def test_picking_torch_op_event(self): event = MockEvent(1, 2, 3) - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init({}, {}) self.assertTrue(parser._picking_torch_op_event(event)) def test_picking_kernel_event(self): event = MockEvent(1, 2, 3) - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init({}, {}) self.assertTrue(parser._picking_kernel_event(event)) def test_picking_flow_event(self): events = [MockEvent(1, 2, 3, "s"), MockEvent(1, 2, 3, "f")] - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init({}, {}) for event in events: self.assertTrue(parser._picking_flow_event(event)) def test_update_kernel_dict_when_valid_input(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init(self.flow_dict, self.all_kernels) parser._update_kernel_dict() self.assertEqual(len(parser._result_data.kernel_dict.get(12)), 2) def test_update_kernel_dict_when_without_kernels_return_null(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init(self.flow_dict, {}) parser._update_kernel_dict() self.assertEqual(len(parser._result_data.kernel_dict), 0) def test_update_kernel_dict_when_without_flow_return_null(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init({}, self.all_kernels) parser._update_kernel_dict() self.assertEqual(len(parser._result_data.kernel_dict), 0) def test_check_result_data(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init(self.flow_dict, self.all_kernels) parser._check_result_data() def test_load_data_when_valid_input(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init(self.flow_dict, self.all_kernels) result_data = parser.load_data() self.assertEqual(len(result_data.kernel_dict.get(12)), 2) - def test_read_trace_event_when_invalid_json_path(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): - parser = ProfilingParser() - parser.init({}, {}) - parser._read_trace_event() - self.assertEqual(parser._trace_events, []) - def test_update_communication_dict(self): result = {'allreduce': {'comm_list': [2.0], 'comm_task': {'notify_wait': [1.0]}}} - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"): parser = ProfilingParser() parser.init({}, {}) parser._comm_task_list = [TraceEventBean(event) for event in self.task_events] diff --git a/profiler/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py b/profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py similarity index 73% rename from profiler/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py rename to profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py index 25293d64a2c371002e6c9624f4fa6c10c592c13b..9e4d76bf97823bde34987a6bfb9e385fde6f581f 100644 --- a/profiler/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py @@ -2,9 +2,9 @@ import unittest from collections import defaultdict from unittest.mock import patch -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.profiling_parser.base_profiling_parser import ProfilingResult -from compare_backend.profiling_parser.gpu_profiling_parser import GPUProfilingParser +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.profiling_parser.base_profiling_parser import ProfilingResult +from msprof_analyze.compare_tools.compare_backend.profiling_parser.gpu_profiling_parser import GPUProfilingParser class TestGpuProfilingParser(unittest.TestCase): @@ -52,8 +52,10 @@ class TestGpuProfilingParser(unittest.TestCase): other_event = {"ph": "X", "name": "other", "pid": 1, "tid": 1, "ts": 6, "dur": 1} def test_update_memory_list_when_valid_input(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.gpu_profiling_parser.GPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "gpu_profiling_parser.GPUProfilingParser.__init__", return_value=None): res = GPUProfilingParser({}, {}) res._enable_memory_compare = True @@ -64,12 +66,18 @@ class TestGpuProfilingParser(unittest.TestCase): self.assertEqual(res._result_data.memory_list[0].memory_details, ", (1, 2), [duration: 1.0], [size: 0.5]\n") def test_calculate_performance_time_when_valid_input(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.gpu_profiling_parser.GPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "gpu_profiling_parser.GPUProfilingParser.__init__", return_value=None): res = GPUProfilingParser({}, {}) res._profiling_type = "GPU" - res._trace_events = [TraceEventBean(event) for event in self.trace_events] + res._all_kernels = {} + for event in self.trace_events: + event_name = event.get("name") + event_ts = str(event.get("ts")) + res._all_kernels[event_name + event_ts] = TraceEventBean(event) res._result_data = ProfilingResult("GPU") res._compute_stream_id = 3 res._flow_dict = {} @@ -86,8 +94,10 @@ class TestGpuProfilingParser(unittest.TestCase): self.assertEqual(res._result_data.overall_metrics.compute_time, 7) def test_picking_memory_event_when_valid_input(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.gpu_profiling_parser.GPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "gpu_profiling_parser.GPUProfilingParser.__init__", return_value=None): res = GPUProfilingParser({}, {}) res._memory_events = [] @@ -98,8 +108,10 @@ class TestGpuProfilingParser(unittest.TestCase): def test_is_torch_op_event_when_valid_input(self): event_list = [{"cat": "cpu_op"}, {"cat": "user_annotation"}, {"cat": "cuda_runtime"}, {"cat": "operator"}] - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.gpu_profiling_parser.GPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "gpu_profiling_parser.GPUProfilingParser.__init__", return_value=None): res = GPUProfilingParser({}, {}) for event in event_list: @@ -111,8 +123,10 @@ class TestGpuProfilingParser(unittest.TestCase): def test_is_kernel_event_when_valid_input(self): event_list1 = [{"cat": "kernel", "name": "matmul"}, {"cat": "kernel", "name": "nccl_reduce"}] event_list2 = [{"cat": "async", "name": "nccl_reduce"}, {"cat": "cpu_op", "name": "aten::to"}] - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.gpu_profiling_parser.GPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "gpu_profiling_parser.GPUProfilingParser.__init__", return_value=None): res = GPUProfilingParser({}, {}) for event in event_list1: @@ -123,8 +137,10 @@ class TestGpuProfilingParser(unittest.TestCase): self.assertFalse(result) def test_is_flow_event_when_valid_input(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.gpu_profiling_parser.GPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "gpu_profiling_parser.GPUProfilingParser.__init__", return_value=None): res = GPUProfilingParser({}, {}) res._flow_cat = ("async_gpu",) diff --git a/profiler/test/ut/compare_tools/profiling_parser/test_npu_profiling_parser.py b/profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_npu_profiling_parser.py similarity index 60% rename from profiler/test/ut/compare_tools/profiling_parser/test_npu_profiling_parser.py rename to profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_npu_profiling_parser.py index 3d9ff4512d4d8c85a438bbed141219fed390d3e5..8f33065c16f2d7a00ab08cec6305c0f237a66252 100644 --- a/profiler/test/ut/compare_tools/profiling_parser/test_npu_profiling_parser.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/profiling_parser/test_npu_profiling_parser.py @@ -1,10 +1,11 @@ import unittest from unittest.mock import patch -from compare_backend.compare_bean.origin_data_bean.operator_memory_bean import OperatorMemoryBean -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.profiling_parser.base_profiling_parser import ProfilingResult -from compare_backend.profiling_parser.npu_profiling_parser import NPUProfilingParser +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.operator_memory_bean \ + import OperatorMemoryBean +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.profiling_parser.base_profiling_parser import ProfilingResult +from msprof_analyze.compare_tools.compare_backend.profiling_parser.npu_profiling_parser import NPUProfilingParser class TestNPUProfilingParser(unittest.TestCase): @@ -21,11 +22,14 @@ class TestNPUProfilingParser(unittest.TestCase): {"ph": "M", "name": "thread_sort_index", "pid": 7, "tid": 3, "args": {"sort_index": 0}}] def test_update_memory_list_when_invalid_path(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", return_value=None): res = NPUProfilingParser({}, {}) res._operator_memory_path = "" + res._path_level='' res._update_memory_list() def test_update_memory_list_when_valid_data(self): @@ -35,22 +39,29 @@ class TestNPUProfilingParser(unittest.TestCase): OperatorMemoryBean({"Name": "cann::add", "Size(KB)": 512, "Allocation Time(us)": 2, "Release Time(us)": 4}), OperatorMemoryBean( {"Name": "aten::add", "Size(KB)": 512, "Allocation Time(us)": 7, "Release Time(us)": 10})] - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", return_value=None), \ - patch("compare_backend.utils.file_reader.FileReader.read_csv_file", return_value=memory_data): + patch("msprof_analyze.prof_common.file_manager.FileManager.read_csv_file", + return_value=memory_data): res = NPUProfilingParser({}, {}) + res._path_level='' res._operator_memory_path = "" res._enqueue_dict = {} res._dequeue_data = [TraceEventBean(event) for event in self.dequeue_events] res._result_data = ProfilingResult("NPU") res._update_memory_list() + self.assertEqual(len(res._result_data.memory_list), 3) self.assertEqual(res._result_data.memory_list[0].duration, 2) def test_picking_hccl_event(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", return_value=None): res = NPUProfilingParser({}, {}) res._hccl_pid = 7 @@ -64,8 +75,10 @@ class TestNPUProfilingParser(unittest.TestCase): self.assertEqual(len(res._comm_list), 1) def test_picking_task_queue_data(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", return_value=None): res = NPUProfilingParser({}, {}) res._enqueue_dict = {} @@ -80,8 +93,10 @@ class TestNPUProfilingParser(unittest.TestCase): self.assertEqual(len(res._dequeue_data), 1) def test_picking_overlap_analysis_data(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", return_value=None): res = NPUProfilingParser({}, {}) res._overlap_analysis = [] @@ -94,8 +109,10 @@ class TestNPUProfilingParser(unittest.TestCase): self.assertFalse(result) def test_is_kernel_event(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", return_value=None): res = NPUProfilingParser({}, {}) res._kernel_pid = 5 @@ -104,27 +121,36 @@ class TestNPUProfilingParser(unittest.TestCase): self.assertFalse(res._is_kernel_event(TraceEventBean({"pid": 1, "ph": "x"}))) def test_is_flow_event(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", return_value=None): res = NPUProfilingParser({}, {}) self.assertTrue(res._is_flow_event(TraceEventBean({"cat": "async_npu"}))) self.assertFalse(res._is_flow_event(TraceEventBean({"cat": "async"}))) def test_is_torch_op_event(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", return_value=None): res = NPUProfilingParser({}, {}) self.assertTrue(res._is_torch_op_event(TraceEventBean({"cat": "cpu_op"}))) self.assertFalse(res._is_torch_op_event(TraceEventBean({"cat": "async"}))) def test_filter_meta_id(self): - with patch("compare_backend.profiling_parser.base_profiling_parser.BaseProfilingParser.__init__"), \ - patch("compare_backend.profiling_parser.npu_profiling_parser.NPUProfilingParser.__init__", - return_value=None): + with patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "base_profiling_parser.BaseProfilingParser.__init__"), \ + patch("msprof_analyze.compare_tools.compare_backend.profiling_parser." + "npu_profiling_parser.NPUProfilingParser.__init__", + return_value=None), \ + patch( + "compare_backend.profiling_parser.npu_profiling_parser.BaseProfilingParser." + "_trace_event_generator", + return_value=(TraceEventBean(event) for event in self.meta_events)): res = NPUProfilingParser({}, {}) - res._trace_events = [TraceEventBean(event) for event in self.meta_events] res._hccl_op_tid_list = [] res._hccl_tid_name_dict = {} res._group_comm_tid_dict = {} diff --git a/profiler/test/ut/compare_tools/utils/test_name_function.py b/profiler/msprof_analyze/test/ut/compare_tools/utils/test_name_function.py similarity index 79% rename from profiler/test/ut/compare_tools/utils/test_name_function.py rename to profiler/msprof_analyze/test/ut/compare_tools/utils/test_name_function.py index 2903f9838bc1bf093ac78238076c56756e77cf07..99b90051a456a0f531ae074cd19b55e53188f70b 100644 --- a/profiler/test/ut/compare_tools/utils/test_name_function.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/utils/test_name_function.py @@ -1,8 +1,8 @@ import unittest -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.utils.name_function import NameFunction -from compare_backend.utils.torch_op_node import TorchOpNode +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.utils.name_function import NameFunction +from msprof_analyze.compare_tools.compare_backend.utils.torch_op_node import TorchOpNode class Args: diff --git a/profiler/test/ut/compare_tools/utils/test_tree_builder.py b/profiler/msprof_analyze/test/ut/compare_tools/utils/test_tree_builder.py similarity index 80% rename from profiler/test/ut/compare_tools/utils/test_tree_builder.py rename to profiler/msprof_analyze/test/ut/compare_tools/utils/test_tree_builder.py index 326a424d3dd9a36d158816ba73ffcf260ac583d9..389d11c5f3a0f0455e5bcdd7ca94fd6bb4c1cb4f 100644 --- a/profiler/test/ut/compare_tools/utils/test_tree_builder.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/utils/test_tree_builder.py @@ -1,8 +1,8 @@ import unittest -from compare_backend.compare_bean.origin_data_bean.compare_event import MemoryEvent -from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean -from compare_backend.utils.tree_builder import TreeBuilder +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.compare_event import MemoryEvent +from msprof_analyze.compare_tools.compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean +from msprof_analyze.compare_tools.compare_backend.utils.tree_builder import TreeBuilder class TestUtils(unittest.TestCase): diff --git a/profiler/msprof_analyze/test/ut/compare_tools/view/__init__.py b/profiler/msprof_analyze/test/ut/compare_tools/view/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/profiler/test/ut/compare_tools/view/test_excel_view.py b/profiler/msprof_analyze/test/ut/compare_tools/view/test_excel_view.py similarity index 64% rename from profiler/test/ut/compare_tools/view/test_excel_view.py rename to profiler/msprof_analyze/test/ut/compare_tools/view/test_excel_view.py index aa500c4242d567f3f588c7702202570f5f17d276..20357a886b52d4adae709513f6c78a958f447530 100644 --- a/profiler/test/ut/compare_tools/view/test_excel_view.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/view/test_excel_view.py @@ -2,7 +2,7 @@ import os import unittest from unittest.mock import patch -from compare_backend.view.excel_view import ExcelView +from msprof_analyze.compare_tools.compare_backend.view.excel_view import ExcelView class TestExcelView(unittest.TestCase): @@ -14,5 +14,6 @@ class TestExcelView(unittest.TestCase): os.remove(self.file_path) def test_generate_view(self): - with patch("compare_backend.view.work_sheet_creator.WorkSheetCreator.create_sheet"): + with patch("msprof_analyze.compare_tools.compare_backend.view.work_sheet_creator." + "WorkSheetCreator.create_sheet"): ExcelView({"table1": {}, "table2": {}}, self.file_path, {}).generate_view() diff --git a/profiler/test/ut/compare_tools/view/test_screen_view.py b/profiler/msprof_analyze/test/ut/compare_tools/view/test_screen_view.py similarity index 72% rename from profiler/test/ut/compare_tools/view/test_screen_view.py rename to profiler/msprof_analyze/test/ut/compare_tools/view/test_screen_view.py index caa25e396e4dae801305cc39f78e657e1bc9601a..027cebbb9ca3384cd4072afd40cfd667488d37d7 100644 --- a/profiler/test/ut/compare_tools/view/test_screen_view.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/view/test_screen_view.py @@ -1,6 +1,6 @@ import unittest -from compare_backend.view.screen_view import ScreenView +from msprof_analyze.compare_tools.compare_backend.view.screen_view import ScreenView class TestScreenView(unittest.TestCase): diff --git a/profiler/test/ut/compare_tools/view/test_worker_sheet_creator.py b/profiler/msprof_analyze/test/ut/compare_tools/view/test_worker_sheet_creator.py similarity index 89% rename from profiler/test/ut/compare_tools/view/test_worker_sheet_creator.py rename to profiler/msprof_analyze/test/ut/compare_tools/view/test_worker_sheet_creator.py index 1e80931ff5d014adf57012e22e5850e8c75af68c..587cccca986f88769eef18f38162ea92eaa09871 100644 --- a/profiler/test/ut/compare_tools/view/test_worker_sheet_creator.py +++ b/profiler/msprof_analyze/test/ut/compare_tools/view/test_worker_sheet_creator.py @@ -4,8 +4,8 @@ import unittest import pandas as pd from xlsxwriter import Workbook -from compare_backend.utils.excel_config import ExcelConfig -from compare_backend.view.work_sheet_creator import WorkSheetCreator +from msprof_analyze.compare_tools.compare_backend.utils.excel_config import ExcelConfig +from msprof_analyze.compare_tools.compare_backend.view.work_sheet_creator import WorkSheetCreator class TestWorkerSheetCreator(unittest.TestCase): diff --git a/profiler/msprof_analyze/version.txt b/profiler/msprof_analyze/version.txt new file mode 100644 index 0000000000000000000000000000000000000000..10bf840ed530af123660f5edb1544264d8f2def4 --- /dev/null +++ b/profiler/msprof_analyze/version.txt @@ -0,0 +1 @@ +2.0.1 \ No newline at end of file diff --git a/profiler/test/run_st.py b/profiler/test/run_st.py deleted file mode 100644 index 83f6b2d468439d9c3ea46f8ea5d7f97ee9315e8b..0000000000000000000000000000000000000000 --- a/profiler/test/run_st.py +++ /dev/null @@ -1,53 +0,0 @@ -import logging -import os -import subprocess -import sys -import threading - -stop_thread = False - - -def print_stout(output): - while True: - line = output.readline().strip() - if line: - print(line) - global stop_thread - if stop_thread: - break - - -def run_st(): - st_status = False - timeout = 3600 - global stop_thread - - st_path = os.path.join(os.path.abspath(os.path.dirname(__file__)), "st/") - cmd = ["python3", "-m", "pytest", "-s", st_path] - process = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - stout_thread = threading.Thread(target=print_stout, args=(process.stdout,)) - stout_thread.start() - - try: - process.wait(timeout=timeout) - except subprocess.TimeoutExpired: - process.kill() - stop_thread = True - logging.error("run st use case timeout.") - return st_status - stop_thread = True - if process.returncode == 0: - st_status = True - logging.info("run st successfully.") - else: - logging.error("run st failed.") - - return st_status - - -if __name__ == "__main__": - st_success = run_st() - if st_success: - sys.exit(0) - else: - sys.exit(1) diff --git a/profiler/test/st/compare_tools/utils.py b/profiler/test/st/compare_tools/utils.py deleted file mode 100644 index aaaf004d21cb9681d26c4365ee0c1070e5a431d4..0000000000000000000000000000000000000000 --- a/profiler/test/st/compare_tools/utils.py +++ /dev/null @@ -1,33 +0,0 @@ -import subprocess -import os -import re -import logging - - -def execute_cmd(cmd): - logging.info('Execute command:%s' % " ".join(cmd)) - completed_process = subprocess.run(cmd, capture_output=True, shell=False, check=True) - return completed_process.returncode - - -def execute_script(cmd): - logging.info('Execute command:%s' % " ".join(cmd)) - process = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - while process.poll() is None: - line = process.stdout.readline().strip() - if line: - print(line) - return process.returncode - - -def check_result_file(out_path): - files = os.listdir(out_path) - newest_file = None - re_match_exp = r"^performance_comparison_result_\d{1,20}\.xlsx" - for file_name in files: - if re.match(re_match_exp, file_name): - file_time = file_name.split(".")[0].split("_")[-1] - if not newest_file or file_time > newest_file.split(".")[0].split("_")[-1]: - newest_file = file_name - - return newest_file diff --git a/profiler/test/ut/advisor/timeline_advice/test_gc_checker.py b/profiler/test/ut/advisor/timeline_advice/test_gc_checker.py deleted file mode 100644 index 112847ac85e9ab645695e26af61f0bcb04bd0ec9..0000000000000000000000000000000000000000 --- a/profiler/test/ut/advisor/timeline_advice/test_gc_checker.py +++ /dev/null @@ -1,30 +0,0 @@ -import unittest - -from profiler.advisor.analyzer.schedule.gc.gc_checker import GcChecker -from profiler.test.ut.advisor.advisor_backend.tools.tool import recover_env -from profiler.advisor.common.timeline.event import TimelineEvent - - -class TestGcChecker(unittest.TestCase): - @classmethod - def tearDownClass(cls) -> None: - recover_env() - - def test_no_synchronize_stream(self): - checker = GcChecker() - - large_free_events = [ - TimelineEvent(dict(ts=1, dur=10)), TimelineEvent(dict(ts=20, dur=100)), TimelineEvent(dict(ts=200, dur=10)) - ] - - checker.max_acl_event_time_ratio = 0.02 - checker.max_acl_event_num_ratio = 0.02 - acl_events = [TimelineEvent(dict(ts=i, dur=0.1)) for i in range(1, 10)] + \ - [TimelineEvent(dict(ts=i, dur=0.1)) for i in range(20, 21)] + \ - [TimelineEvent(dict(ts=i, dur=0.1)) for i in range(200, 210)] - free_event = checker.get_free_events_include_gc(large_free_events, acl_events) - self.assertEqual(free_event, TimelineEvent(dict(ts=20, dur=100))) - - checker.max_acl_event_num_ratio = 0.001 - free_event = checker.get_free_events_include_gc(large_free_events, acl_events) - self.assertEqual(free_event, {}) diff --git a/profiler/test/ut/compare_tools/utils/test_file_reader.py b/profiler/test/ut/compare_tools/utils/test_file_reader.py deleted file mode 100644 index de7e13f9e539a1f2003d440d391651928e6f5f2d..0000000000000000000000000000000000000000 --- a/profiler/test/ut/compare_tools/utils/test_file_reader.py +++ /dev/null @@ -1,19 +0,0 @@ -import unittest - -from compare_backend.utils.file_reader import FileReader -from compare_backend.utils.constant import Constant - - -class TestFileReader(unittest.TestCase): - - def test_read_trace_file(self): - json_data = FileReader.read_trace_file("resource/event_list.json") - self.assertEqual(len(json_data), 2) - - def test_read_csv_file(self): - csv = FileReader.read_csv_file("resource/test.csv") - self.assertEqual(len(csv), 8) - - def test_check_json_type(self): - t = FileReader.check_json_type("resource/event_list.json") - self.assertEqual(t, Constant.NPU) diff --git a/profiler/version.txt b/profiler/version.txt deleted file mode 100644 index 589268e6fedb18e0dcdb97be4f19d569c5878d2b..0000000000000000000000000000000000000000 --- a/profiler/version.txt +++ /dev/null @@ -1 +0,0 @@ -1.3.0 \ No newline at end of file diff --git a/sample/README.md b/sample/README.md index 8e555f4870d2c39fc5cabad3092d1c17f60d3dfa..15238cb9f3815d6fecb0c743e6f826d2abc2988b 100644 --- a/sample/README.md +++ b/sample/README.md @@ -8,10 +8,19 @@ 说明:该sample目录中,每个最小目录就是一个完整的样例工程。这些样例工程本身可能以为依赖的不同存在差异。 ## 依赖说明 -安装CANN包,并使能环境变量,并确保```ASCEND_HOME_PATH```生效,可以在CANN包安装目录下使能: -``` -source set_env.sh -``` +- 硬件环境请参见《[昇腾产品形态说明](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F80RC22%2Fquickstart%2Fquickstart%2Fquickstart_18_0002.html)》。 +- 软件环境请参见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F80RC22%2Fsoftwareinst%2Finstg%2Finstg_0000.html%3FMode%3DPmIns%26OS%3DUbuntu%26Software%3DcannToolKit)》安装昇腾设备开发或运行环境,即toolkit软件包。 + +以上环境依赖请根据实际环境选择适配的版本。 + +### 版本配套 +| 条件 | 要求 | +|---|---| +| CANN版本 | >=8.0.RC1.alpha001 | +| 硬件要求 | Atlas 800T A2 训练服务器| + +- 支持AscendPyTorch 1.11.0或更高版本,支持的PyTorch和CANN以及PyTorch和Python软件版本配套关系请参见《[Ascend Extension for PyTorch插件](https://gitee.com/ascend/pytorch)》。 +- 固件驱动版本与配套CANN软件支持的固件驱动版本相同,开发者可通过“[昇腾社区-固件与驱动](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fhardware%2Ffirmware-drivers%2Fcommunity%3Fproduct%3D2%26model%3D28%26cann%3D8.0.RC3.alpha003%26driver%3D1.0.25.alpha)”页面根据产品型号与CANN软件版本获取配套的固件与驱动。 ## 目录介绍 整体目录结构如下: @@ -91,7 +100,7 @@ mssanitizer ./*.fatbin # 默认进行memcheck检查 ``` LINK_LIBS := -L${ASCEND_HOME_PATH}/lib64 -lruntime -lascendcl -lstdc++ 修改为: - LINK_LIBS := -L${ASCEND_HOME_PATH}/lib64 -L${ASCEND_HOME_PATH}/tools/simulator/${SOC_VERSION}/lib/ -lruntime_camodel -lascendcl -lstdc++ # 需要添加libruntime_camodel的依赖路径, SOC_VERSION 使用npu-smi info查询NPU Name + LINK_LIBS := -L${ASCEND_HOME_PATH}/lib64 -L${ASCEND_HOME_PATH}/tools/simulator/${SOC_VERSION}/lib/ -lruntime_camodel -lascendcl -lstdc++ # 需要添加libruntime_camodel的依赖路径, SOC_VERSION 通过使用npu-smi info命令进行查询,获取Chip Name信息。实际配置值 为AscendChip Name,例如Chip Name取值为xxxyy,实际配置值为Ascendxxxyy。当Ascendxxxyy为代码样例路径时,需要配置ascendxxxyy。 ``` + 调试信息增强: ``` diff --git "a/\345\205\254\347\275\221URL\350\257\264\346\230\216.md" "b/\345\205\254\347\275\221URL\350\257\264\346\230\216.md" index 67064cd045c9b65da7bbf9bbce5378cb47dabcad..c78d206c1a47d0e39555574ac78b111cc0d37c53 100644 --- "a/\345\205\254\347\275\221URL\350\257\264\346\230\216.md" +++ "b/\345\205\254\347\275\221URL\350\257\264\346\230\216.md" @@ -2,13 +2,13 @@ | 软件类型 | 软件名 | 路径 | 类型 | 内容 | 用途说明 | |------|----------------------------------------------------|------------------------------------------|------|------------------------------------------------------------------------------------------------------------|--------------------| -| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/advisor/config/config.ini | 公网地址 | https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc/Samples%20of%20Fused%20Operator%20API%20Replacement.md" | Advisor优化手段参考示例 | -| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/advisor/config/config.ini | 公网地址 | https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/modeldevpt/ptmigr/AImpug_0067.html | Advisor优化手段参考示例 | -| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/advisor/config/config.ini | 公网地址 | https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/devtools/auxiliarydevtool/aoe_16_045.html | Advisor优化手段参考示例 | -| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/advisor/config/config.ini | 公网地址 | https://www.mindspore.cn/lite/docs/en/master/use/cloud_infer/converter_tool_ascend.html#aoe-auto-tuning | Advisor优化手段参考示例 | -| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/advisor/config/config.ini | 公网地址 | https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/modeldevpt/ptmigr/AImpug_0059.html | Advisor优化手段参考示例 | -| 开源软件 | MindStudio Training Tools - msprof-analyze | /profiler/config/config.ini | 公网地址 | https://gitee.com/ascend/mstt/tree/master/profiler | msprof-analyze工具地址 | -| 开源软件 | MindStudio Training Tools - msprof-analyze | /profiler/LICENSE | 公网地址 | http://www.apache.org/licenses/LICENSE-2.0 | 开源软件协议地址 | -| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/advisor/rules/aicpu_rules.ymal | 公网地址 | https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc/Samples%20of%20AI%20CPU%20Operator%20Replacement.md | AI CPU 算子替换样例 | -| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/advisor/rules/environment_variable_info.yaml | 公网地址 | https://support.huawei.com/enterprise/zh/doc/EDOC1100371278/5eeeed85?idPath=23710424 | 组网指南 | -| 开源软件 | MindStudio Training Tools - msprof-analyze | /profiler/config/config.ini | 公网地址 | pmail_mindstudio@huawei.com | 公网邮箱 | +| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/msprof_analyze/advisor/config/config.ini | 公网地址 | https://gitee.com/ascend/mstt/blob/master/profiler/msprof_analyze/advisor/doc/Samples%20of%20Fused%20Operator%20API%20Replacement.md" | Advisor优化手段参考示例 | +| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/msprof_analyze/advisor/config/config.ini | 公网地址 | https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/modeldevpt/ptmigr/AImpug_0067.html | Advisor优化手段参考示例 | +| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/msprof_analyze/advisor/config/config.ini | 公网地址 | https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/devtools/auxiliarydevtool/aoe_16_045.html | Advisor优化手段参考示例 | +| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/msprof_analyze/advisor/config/config.ini | 公网地址 | https://www.mindspore.cn/lite/docs/en/master/use/cloud_infer/converter_tool_ascend.html#aoe-auto-tuning | Advisor优化手段参考示例 | +| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/msprof_analyze/advisor/config/config.ini | 公网地址 | https://www.hiascend.com/document/detail/zh/canncommercial/70RC1/modeldevpt/ptmigr/AImpug_0059.html | Advisor优化手段参考示例 | +| 开源软件 | MindStudio Training Tools - msprof-analyze | /profiler/msprof_analyze/config/config.ini | 公网地址 | https://gitee.com/ascend/mstt/tree/master/profiler/msprof_analyze | msprof-analyze工具地址 | +| 开源软件 | MindStudio Training Tools - msprof-analyze | /profiler/msprof_analyze/LICENSE | 公网地址 | http://www.apache.org/licenses/LICENSE-2.0 | 开源软件协议地址 | +| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/msprof_analyze/advisor/rules/aicpu_rules.ymal | 公网地址 | https://gitee.com/ascend/mstt/blob/master/profiler/msprof_analyze/advisor/doc/Samples%20of%20AI%20CPU%20Operator%20Replacement.md | AI CPU 算子替换样例 | +| 开源软件 | MindStudio Training Tools - msprof-analyze advisor | /profiler/msprof_analyze/advisor/rules/environment_variable_info.yaml | 公网地址 | https://support.huawei.com/enterprise/zh/doc/EDOC1100371278/5eeeed85?idPath=23710424 | 组网指南 | +| 开源软件 | MindStudio Training Tools - msprof-analyze | /profiler/msprof_analyze/config/config.ini | 公网地址 | pmail_mindstudio@huawei.com | 公网邮箱 |