diff --git a/OWNERS b/OWNERS index 7b721dd643e3399d29ff649e2f76182de72421a4..2e949debf181a6e75fdb5b1e1e091ce7a39c7e69 100644 --- a/OWNERS +++ b/OWNERS @@ -13,35 +13,14 @@ approvers: - kun_8 - binghamhuang reviewers: -- leo920320 -- wo-wenjie -- ma-dongfang -- wuyulong11 -- alysongirl -- wangchao285 -- brightlyking -- chenhao_1209 -- feng123www -- zhang-mingyu-0813 -- snowflakephoenix -- Seanesmhxocism -- augboost -- fanxiaotong1995 -- sunboquan -- kun_8 -- Martin-M -- ly-qianxiao -- yang-minghai22 -- hu-xiao-bo - lv-kaimeng - litian_drinksnow -- blian -- cycoe -- machj -- zhengweifeng6 -- gong-siwei -- uniteone - binghamhuang -- wjchuee -- zhou-xianqi -- stby11 \ No newline at end of file +- wo-wenjie +- ly-qianxiao +- leo920320 +- sunboquan +- stby +- Seanesmhxocism +- TAJh +- czr9775 \ No newline at end of file diff --git a/README.md b/README.md index d76740bca40ef1bf43686fda32582dd2a9e53832..dd25d20158d7a42bec57efc931d3fad5e838a73b 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ 原Ascend Training Tools工具更名为MindStudio Training Tools,MindStudio训练工具链。变更计划如下: 1. 2024.06.25本代码仓名称变更为mstt。 -2. 2024.07.25 URL变更为[https://gitee.com/ascend/mstt](https://gitee.com/ascend/mstt),原始URL将不再维护。 +2. 2024.07.04 URL变更为[https://gitee.com/ascend/mstt](https://gitee.com/ascend/mstt),原始URL仍然可用,但建议使用新URL。 # MindStudio Training Tools @@ -14,49 +14,49 @@ MindStudio Training Tools,MindStudio训练工具链。针对训练&大模型 ## 使用说明 -### [分析迁移工具](https://gitee.com/ascend/att/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D) +### [分析迁移工具](https://gitee.com/ascend/mstt/wikis/工具介绍/分析迁移工具/分析迁移工具介绍) -1. [脚本分析工具](https://gitee.com/ascend/att/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E5%88%86%E6%9E%90%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) +1. [脚本分析工具](https://gitee.com/ascend/mstt/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E5%88%86%E6%9E%90%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) 脚本分析工具提供分析脚本,帮助用户在执行迁移操作前,分析基于GPU平台的PyTorch训练脚本中算子、三方库套件、亲和API分析以及动态shape的支持情况。 -2. [(推荐)自动迁移工具](https://gitee.com/ascend/att/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E8%87%AA%E5%8A%A8%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) +2. [(推荐)自动迁移工具](https://gitee.com/ascend/mstt/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E8%87%AA%E5%8A%A8%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) 自动迁移只需在训练脚本中导入库代码即可完成模型脚本迁移,使用方式较简单,且修改内容最少。 -3. [脚本迁移工具](https://gitee.com/ascend/att/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E8%84%9A%E6%9C%AC%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) +3. [脚本迁移工具](https://gitee.com/ascend/mstt/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E8%84%9A%E6%9C%AC%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) 脚本迁移工具提供后端命令行用于将GPU上训练的PyTorch脚本迁移至NPU上,得到新的训练脚本用于训练。 -4. [训推一体权重转换工具](https://gitee.com/Ascend/att/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E8%AE%AD%E6%8E%A8%E4%B8%80%E4%BD%93%E6%9D%83%E9%87%8D%E8%BD%AC%E6%8D%A2%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) +4. [训推一体权重转换工具](https://gitee.com/Ascend/mstt/wikis/%E5%B7%A5%E5%85%B7%E4%BB%8B%E7%BB%8D/%E5%88%86%E6%9E%90%E8%BF%81%E7%A7%BB%E5%B7%A5%E5%85%B7/%E8%AE%AD%E6%8E%A8%E4%B8%80%E4%BD%93%E6%9D%83%E9%87%8D%E8%BD%AC%E6%8D%A2%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8%E6%8C%87%E5%AF%BC) 训推一体权重转换工具,支持在GPU和NPU上训练好的模型转成加速推理支持的格式。 -### [精度工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools) +### [精度工具](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools) -1. [api_accuracy_checker(Ascend模型精度预检工具)](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker) +1. [api_accuracy_checker(Ascend模型精度预检工具)](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/api_accuracy_checker) 在昇腾NPU上扫描用户训练模型中所有API,进行API复现,给出精度情况的诊断和分析。 -2. [ptdbg_ascend(PyTorch精度工具)](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend) +2. [ptdbg_ascend(PyTorch精度工具)](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/ptdbg_ascend) 进行PyTorch整网API粒度的数据dump、精度比对和溢出检测,从而定位PyTorch训练场景下的精度问题。 -### [性能工具](https://gitee.com/ascend/att/tree/master/profiler) +### [性能工具](https://gitee.com/ascend/mstt/tree/master/profiler) -1. [compare_tools(性能比对工具)](https://gitee.com/ascend/att/tree/master/profiler/compare_tools) +1. [compare_tools(性能比对工具)](https://gitee.com/ascend/mstt/tree/master/profiler/compare_tools) 提供NPU与GPU性能拆解功能以及算子、通信、内存性能的比对功能。 -2. [cluster_analyse(集群分析工具)](https://gitee.com/ascend/att/tree/master/profiler/cluster_analyse) +2. [cluster_analyse(集群分析工具)](https://gitee.com/ascend/mstt/tree/master/profiler/cluster_analyse) - 提供多机多卡的集群分析能力(基于通信域的通信分析和迭代耗时分析), 当前需要配合Ascend Insight的集群分析功能使用。 + 提供多机多卡的集群分析能力(基于通信域的通信分析和迭代耗时分析), 当前需要配合MindStudio Insight的集群分析功能使用。 -3. [affinity_cpu_bind (亲和性cpu绑核工具) ](https://gitee.com/ascend/att/tree/master/profiler/affinity_cpu_bind) +3. [affinity_cpu_bind (亲和性cpu绑核工具) ](https://gitee.com/ascend/mstt/tree/master/profiler/affinity_cpu_bind) 提供亲和性CPU绑核能力,改善host_bound调度问题。 -### [Tensorboard](https://gitee.com/ascend/att/tree/master/plugins/tensorboard-plugins/tb_plugin) +### [Tensorboard](https://gitee.com/ascend/mstt/tree/master/plugins/tensorboard-plugins/tb_plugin) Tensorboard支持NPU性能数据可视化插件PyTorch Profiler TensorBoard NPU Plugin。 @@ -93,4 +93,4 @@ mstt仓每年发布4个版本,每个版本都将对应一个分支;以v6.0 ## 版本过渡提示 -当前版本预检和ptdbg维护到2024/09/30,准备于2024/09/30下线,相关目录att/debug/accuracy_tools/api_accuracy_checker和att/debug/accuracy_tools/ptdbg_ascend将于2024/09/30删除。新版本的预检和ptdbg已经合到att/debug/accuracy_tools/atat目录下。 +当前版本预检和ptdbg维护到2024/09/30,准备于2024/09/30下线,相关目录mstt/debug/accuracy_tools/api_accuracy_checker和mstt/debug/accuracy_tools/ptdbg_ascend将于2024/09/30删除。新版本的预检和ptdbg已经合到mstt/debug/accuracy_tools/atat目录下。 diff --git a/debug/accuracy_tools/atat/__init__.py b/debug/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/__init__.py rename to debug/__init__.py diff --git a/debug/accuracy_tools/LICENSE b/debug/accuracy_tools/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..261eeb9e9f8b2b4b0d119366dda99c6fd7d35c64 --- /dev/null +++ b/debug/accuracy_tools/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/debug/accuracy_tools/MANIFEST.in b/debug/accuracy_tools/MANIFEST.in index a0aeb46bbc6b35a3ecb0015e137fadf595a65c4d..7997215ffdb2071277645bf47c520db304b1bd98 100644 --- a/debug/accuracy_tools/MANIFEST.in +++ b/debug/accuracy_tools/MANIFEST.in @@ -1,5 +1,5 @@ -recursive-include ptdbg_ascend/src/python/ptdbg_ascend/ *.py -recursive-include ptdbg_ascend/src/python/ptdbg_ascend/ *.yaml -recursive-include ptdbg_ascend/src/python/ptdbg_ascend/ *.template -recursive-include atat/ * -recursive-exclude api_accuracy_checker/test * +include README.md +include LICENSE +recursive-include msprobe * +recursive-exclude msprobe/test * + diff --git a/debug/accuracy_tools/README.md b/debug/accuracy_tools/README.md index 0d4ea25e3e85bbad3c5c091449630f62dfdf842c..962736908fed2a2fecc2b68ac723ad7516a7a678 100644 --- a/debug/accuracy_tools/README.md +++ b/debug/accuracy_tools/README.md @@ -8,13 +8,13 @@ NPU上训练的网络存在精度问题,精度指标(loss或者具体的评 | 工具名称 | 说明 | | ------------------------------------------------------------ | ------------------------------------------------------------ | -| [api_accuracy_checker(Ascend模型精度预检工具)](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker) | 在昇腾NPU上扫描用户训练模型中所有API,进行API复现,给出精度情况的诊断和分析。 | -| [ptdbg_ascend(PyTorch精度工具)](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend) | 进行PyTorch整网API粒度的数据dump、精度比对和溢出检测,从而定位PyTorch训练场景下的精度问题。 | +| [api_accuracy_checker(Ascend模型精度预检工具)](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/api_accuracy_checker) | 在昇腾NPU上扫描用户训练模型中所有API,进行API复现,给出精度情况的诊断和分析。 | +| [ptdbg_ascend(PyTorch精度工具)](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/ptdbg_ascend) | 进行PyTorch整网API粒度的数据dump、精度比对和溢出检测,从而定位PyTorch训练场景下的精度问题。 | ### 场景介绍 -**Ascend模型精度预检工具**会对全网每一个API根据其实际训练中的shape、dtype和数值范围生成随机的输入,对比它与标杆的输出差异,并指出输出差异过大不符合精度标准的API。该工具检查单API精度问题准确率超过80%,对比一般dump比对方法减少落盘数据量99%以上。具体使用请参见《[Ascend模型精度预检工具](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/api_accuracy_checker/README.md)》 +**Ascend模型精度预检工具**会对全网每一个API根据其实际训练中的shape、dtype和数值范围生成随机的输入,对比它与标杆的输出差异,并指出输出差异过大不符合精度标准的API。该工具检查单API精度问题准确率超过80%,对比一般dump比对方法减少落盘数据量99%以上。具体使用请参见《[Ascend模型精度预检工具](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/api_accuracy_checker/README.md)》 -**PyTorch精度工具精度比对功能**可以对NPU整网API数据进行与CPU或GPU标杆数据的精度比对,从而检测精度问题。具体来说,dump统计量、分段dump、模块化dump,通讯算子dump等功能可以用较轻的数据量实现不同侧重的精度比对,从而定位精度问题。具体使用请参见《[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)》。 +**PyTorch精度工具精度比对功能**可以对NPU整网API数据进行与CPU或GPU标杆数据的精度比对,从而检测精度问题。具体来说,dump统计量、分段dump、模块化dump,通讯算子dump等功能可以用较轻的数据量实现不同侧重的精度比对,从而定位精度问题。具体使用请参见《[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)》。 -**PyTorch精度工具溢出检测功能**是在判断训练网络可能存在溢出现象时,例如某个step的loss突然变成inf nan,或者混精场景下loss_scale不断减小,可以通过ptdbg_ascend的精度检测工具检测API的溢出情况。具体使用请参见《[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)》。 \ No newline at end of file +**PyTorch精度工具溢出检测功能**是在判断训练网络可能存在溢出现象时,例如某个step的loss突然变成inf nan,或者混精场景下loss_scale不断减小,可以通过ptdbg_ascend的精度检测工具检测API的溢出情况。具体使用请参见《[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)》。 \ No newline at end of file diff --git a/debug/accuracy_tools/README_POC.md b/debug/accuracy_tools/README_POC.md deleted file mode 100644 index 00b4818832ca26bf49c7144df2ed706ce4e2c150..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/README_POC.md +++ /dev/null @@ -1,69 +0,0 @@ -# 精度工具 - -本手册主要介绍精度预检工具和ptdbg_ascend精度工具合一软件包的安装和工具命令行使用指导。 - -## 工具安装 - -精度工具合一软件包名称:`ascend_training_accuracy_tools-{version}-py3-none-any.whl` - -1. whl包获取。 - - 请通过下表链接下载ptdbg_ascend精度工具whl包。 - - | 版本 | 发布日期 | 支持PyTorch版本 | 下载链接 | 校验码 | - | ----- | ---------- | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | - | 0.0.1 | 2024-03-15 | 1.11.0/2.0/2.1 | [ascend_training_accuracy_tools-0.0.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.1-py3-none-any.whl) | 5801510d4e827e4859bc9a5aca021e4d30c2ea42d60a4c8ad0c2baab1b7782c9 | - -2. whl包校验。 - - 1. 根据以上下载链接下载whl包到Linux安装环境。 - - 2. 进入whl包所在目录,执行如下命令。 - - ```bash - sha256sum {name}.whl - ``` - - {name}为whl包名称。 - - 若回显呈现对应版本whl包一致的**校验码**,则表示下载了正确的ptdbg_ascend精度工具whl安装包。示例如下: - - ```bash - sha256sum ascend_training_accuracy_tools-0.0.1-py3-none-any.whl - 5801510d4e827e4859bc9a5aca021e4d30c2ea42d60a4c8ad0c2baab1b7782c9 *ascend_training_accuracy_tools-0.0.1-py3-none-any.whl - ``` - -3. 执行如下命令进行安装。 - - ```bash - pip3 install ./ascend_training_accuracy_tools-{version}-py3-none-any.whl - ``` - - 若为覆盖安装,请在命令行末尾增加“--force-reinstall”参数强制安装,例如: - - ```bash - pip3 install ./ascend_training_accuracy_tools-{version}-py3-none-any.whl --force-reinstall - ``` - - 提示如下信息则表示安装成功。 - - ```bash - Successfully installed ascend_training_accuracy_tools-{version} - ``` - - -## 工具使用 - -安装精度工具合一软件包后,精度工具支持使用命令行启动各种功能(除ptdbg_ascend工具的dump和精度比对操作)。命令格式如下: - -```bash -atat [-h] parse run_ut multi_run_ut benchmark_compare run_overflow_check -``` - -| 参数 | 说明 | -| ------------------ | ------------------------------------------------------------ | -| parse | ptdbg_ascend.parse数据解析功能入口,执行atat parse命令后进入parse交互式界面,更多参数请参见《[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)》的“ptdbg_ascend.parse数据解析功能”。 | -| run_ut | 预检工具run_ut功能,可以通过atat run_ut命令执行精度预检操作,更多参数请参见《[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)》的“执行预检”。 | -| multi_run_ut | 预检工具multi_run_ut功能,可以通过atat multi_run_ut命令执行多线程预检操作,更多参数请参见《[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)》的“multi_run_ut多线程预检”。 | -| benchmark_compare | 预检工具预检结果比对功能,可以通过atat benchmark_compare命令执行预检结果比对操作,更多参数请参见《[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)》的“multi_run_ut多线程预检”。 | -| run_overflow_check | 溢出解析工具,可以通过atat run_overflow_check命令执行溢出API解析操作,更多参数请参见《[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)》的“溢出解析工具”。 | \ No newline at end of file diff --git a/debug/accuracy_tools/api_accuracy_checker/README.md b/debug/accuracy_tools/api_accuracy_checker/README.md index d8d4b531212e8758a59e2119670219a559ec5d85..2d3c01831bb8b17f964ece5ebca044010eb499e1 100644 --- a/debug/accuracy_tools/api_accuracy_checker/README.md +++ b/debug/accuracy_tools/api_accuracy_checker/README.md @@ -2,7 +2,7 @@ ## 版本过渡提示 -当前版本预检维护到2024/09/30,准备于2024/09/30下线,相关目录att/debug/accuracy_tools/api_accuracy_checker将于2024/09/30删除。新版本的预检已经合到att/debug/accuracy_tools/atat目录下。 +当前版本预检维护到2024/09/30,准备于2024/09/30下线,相关目录mstt/debug/accuracy_tools/api_accuracy_checker将于2024/09/30删除。新版本的预检已经合到mstt/debug/accuracy_tools/msprobe目录下。 Ascend模型精度预检工具能在昇腾NPU上扫描用户训练模型中所有API,输出精度情况的诊断和分析。工具通过dump模型中所有的API前反向信息;构造相应的API单元测试,将NPU输出与标杆(CPU高精度)比对,从而计算对应的精度指标,该过程称为run_ut;将NPU环境下dump的预检数据拷贝至GPU环境,同样执行run_ut;最后通过新精度标准比对法将NPU和GPU的预检结果进行比对,从而找出NPU中存在精度问题的API。 @@ -37,11 +37,9 @@ Ascend模型精度预检工具能在昇腾NPU上扫描用户训练模型中所 export PYTHONPATH=$PYTHONPATH:$MSTT_HOME/debug/accuracy_tools/ ``` -2. 安装依赖。 +2. 使用pip命令安装einops、numpy、pandas、PyYAML、rich、torch、tqdm、Twisted、openpyxl依赖。 - ```bash - pip3 install tqdm rich pyyaml pandas einops - ``` + 若环境中已安装部分依赖,不需要重复安装。 ## 预检操作 @@ -75,7 +73,7 @@ Ascend模型精度预检工具能在昇腾NPU上扫描用户训练模型中所 首先,需要开启torch.utils.data.dataloader加载数据,操作如下: ```bash - cd att/debug/accuracy_tools/api_accuracy_checker + cd mstt/debug/accuracy_tools/api_accuracy_checker vi config.yaml # 修改enable_dataloader参数值为True ``` @@ -201,7 +199,7 @@ run_ut预检操作包括如下场景: python run_ut.py -forward ./forward_info_0.json -backward ./backward_info_0.json -save_error_data ``` - 数据默认会存盘到'./ut_error_data{timestamp}'路径下(相对于启动run_ut的路径),有需要的话,用户可以通过修改att/debug/accuracy_tools/api_accuracy_checker目录下,config.yaml文件的error_data_path参数来配置保存路径,详见“**config.yaml文件说明**”。。 + 数据默认会存盘到'./ut_error_data{timestamp}'路径下(相对于启动run_ut的路径),有需要的话,用户可以通过修改mstt/debug/accuracy_tools/api_accuracy_checker目录下,config.yaml文件的error_data_path参数来配置保存路径,详见“**config.yaml文件说明**”。。 3. (可选)如果dump的数据为真实数据,那么需要指定真实数据路径,例如: @@ -251,13 +249,13 @@ python run_ut.py -forward ./forward_info_0.json -backward ./backward_info_0.json run_ut过程同样支持API预检白名单,操作方式如下: -修改att/debug/accuracy_tools/api_accuracy_checker目录下config.yaml文件的white_list参数,配置需要预检的API名称,详见“config.yaml文件说明”。 +修改mstt/debug/accuracy_tools/api_accuracy_checker目录下config.yaml文件的white_list参数,配置需要预检的API名称,详见“config.yaml文件说明”。 ### config.yaml文件说明 config.yaml文件可以通过配置参数来控制dump和run_ut操作的真实数据模式以及白名单等功能。 -文件路径为:att/debug/accuracy_tools/api_accuracy_checker/config.yaml +文件路径为:mstt/debug/accuracy_tools/api_accuracy_checker/config.yaml | 参数名称 | 说明 | 是否必选 | | ----------------- | ------------------------------------------------------------ | -------- | @@ -419,7 +417,7 @@ Forward Test Success和Backward Test Success是否通过测试是由`api_precisi # 溢出解析工具 -针对训练过程中的溢出检测场景(参见[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)中的"溢出检测场景"进行溢出检测dump),对于输入正常但输出存在溢出的API,会在训练执行目录下将溢出的API信息按照前向和反向分类,dump并保存为`forward_info_{pid}.json`,前向过程溢出的API可通过该工具对`forward_info_{pid}.json`进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 +针对训练过程中的溢出检测场景(参见[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)中的"溢出检测场景"进行溢出检测dump),对于输入正常但输出存在溢出的API,会在训练执行目录下将溢出的API信息按照前向和反向分类,dump并保存为`forward_info_{pid}.json`,前向过程溢出的API可通过该工具对`forward_info_{pid}.json`进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 工具支持PyTorch版本:1.11.0/2.0/2.1/2.2。 @@ -433,15 +431,13 @@ Forward Test Success和Backward Test Success是否通过测试是由`api_precisi export PYTHONPATH=$PYTHONPATH:$MSTT_HOME/debug/accuracy_tools/ ``` - 安装依赖: + 使用pip命令安装einops、numpy、pandas、PyYAML、rich、torch、tqdm、Twisted依赖。 - ```bash - pip3 install tqdm rich pyyaml pandas einops - ``` + 若环境中已安装部分依赖,不需要重复安装。 2. 执行溢出API解析操作 - **forward_info_0.json为[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)中的"溢出检测场景"执行溢出检测dump时生成,而不是精度预检工具生成。** + **forward_info_0.json为[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)中的"溢出检测场景"执行溢出检测dump时生成,而不是精度预检工具生成。** ```bash cd $MSTT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut diff --git a/debug/accuracy_tools/api_accuracy_checker/api_accuracy_checker_online.md b/debug/accuracy_tools/api_accuracy_checker/api_accuracy_checker_online.md index 404d7f7e31764577b75b62d9c8e487a15c877570..9fd985751513dea127a6397dd81c0803bd948b6f 100644 --- a/debug/accuracy_tools/api_accuracy_checker/api_accuracy_checker_online.md +++ b/debug/accuracy_tools/api_accuracy_checker/api_accuracy_checker_online.md @@ -21,7 +21,7 @@ 安装完成预检工具后,需要分别在GPU和NPU环境下配置config.yaml文件。其中需要重点关注文件中的is_online、is_benchmark_device、host和port参数的配置,保障在线预检时GPU和NPU两台设备间的通信正常。 -文件路径为:att/debug/accuracy_tools/api_accuracy_checker/config.yaml +文件路径为:mstt/debug/accuracy_tools/api_accuracy_checker/config.yaml | 参数名称 | 说明 | 是否必选 | | ------------------- | ------------------------------------------------------------ | -------- | diff --git a/debug/accuracy_tools/api_accuracy_checker/common/utils.py b/debug/accuracy_tools/api_accuracy_checker/common/utils.py index 5af80a1fff26f683adaf88b78c187c290821231f..76d117afb491a887c93d0d34d872be48ef3aa166 100644 --- a/debug/accuracy_tools/api_accuracy_checker/common/utils.py +++ b/debug/accuracy_tools/api_accuracy_checker/common/utils.py @@ -108,7 +108,7 @@ class Const: VERSION_MESSAGE = """The current version of api_precision_checker will be deprecated on September 30, 2024. The att/debug/accuracy_tools/api_accuracy_checker directory will be deleted on September 30, 2024. - Please use the api_precision_checker in the att/debug/accuracy_tools/atat directory.""" + Please use the api_precision_checker in the att/debug/accuracy_tools/msprobe directory.""" class CompareConst: diff --git a/debug/accuracy_tools/api_accuracy_checker/compare/compare.py b/debug/accuracy_tools/api_accuracy_checker/compare/compare.py index c7b175cd4913e2072acca36d53c2952bb0eadce0..1b796359045f24e5ce1ce620c1002b5e7dcff782 100644 --- a/debug/accuracy_tools/api_accuracy_checker/compare/compare.py +++ b/debug/accuracy_tools/api_accuracy_checker/compare/compare.py @@ -1,13 +1,10 @@ # 进行比对及结果展示 import os -import csv from collections import namedtuple import torch import numpy as np -from rich.table import Table -from rich.console import Console -from api_accuracy_checker.common.utils import get_json_contents, write_csv, print_warn_log, Const +from api_accuracy_checker.common.utils import get_json_contents, write_csv, print_info_log, Const from api_accuracy_checker.compare.compare_utils import CompareConst, check_dtype_comparable, DETAIL_TEST_ROWS, \ precision_configs, BENCHMARK_COMPARE_SUPPORT_LIST, AbsoluteStandardApi, BinaryStandardApi, ULPStandardApi, \ ThousandthStandardApi, apis_threshold @@ -17,7 +14,6 @@ from api_accuracy_checker.compare.algorithm import get_rmse, get_error_balance, get_small_value_err_ratio, get_finite_and_infinite_mask, get_small_value_mask, check_inf_nan_value, \ check_small_value, check_norm_value, get_abs_bench_with_eps, get_ulp_err from api_accuracy_checker.common.config import msCheckerConfig -from ptdbg_ascend.src.python.ptdbg_ascend.common.file_check_util import FileOpen ResultInfo = namedtuple('ResultInfo', ['full_api_name', 'fwd_success_status', 'bwd_success_status', @@ -49,83 +45,13 @@ class Comparator: else: self.stack_info = None - self.test_result_cnt = { - "success_num": 0, "warning_num": 0, "error_num": 0, - "forward_fail_num": 0, "backward_fail_num": 0, "forward_and_backward_fail_num": 0, - "total_num": 0, "total_skip_num": 0 - } - @staticmethod def get_path_from_rank(rank, path_list, path_pattern): return path_list[-1] if len(path_list) == 1 else path_pattern.format(rank) - def print_pretest_result(self): - for save_path in self.save_path_list: - self.get_statistics_from_result_csv(save_path) - total_tests = self.test_result_cnt.get("total_num", 0) - if total_tests != 0: - passing_rate = "{:.2%}".format(self.test_result_cnt.get("success_num", 0) / total_tests) - else: - passing_rate = "0%" - - print_warn_log("The follwing tables will be deprecated in the future." - "The following results are for reference only.") - console = Console() - table_total = Table( - show_header=True, title="Overall Statistics", show_lines=True, width=75 - ) - table_total.add_column("Result") - table_total.add_column("Statistics") - table_total.add_row("[green]Pass[/green]", str(self.test_result_cnt.get("success_num", 0))) - table_total.add_row("[yellow]Warning[/yellow]", str(self.test_result_cnt.get("warning_num", 0))) - table_total.add_row("[red]Error[/red]", str(self.test_result_cnt.get("error_num", 0))) - table_total.add_row("Passing Rate", passing_rate) - table_total.add_row("Skip Tests", str(self.test_result_cnt.get("total_skip_num", 0))) - - table_detail = Table( - show_header=True, title="Detail Statistics", show_lines=True, width=75 - ) - table_detail.add_column("Result") - table_detail.add_column("Statistics") - table_detail.add_row("Forward Error", str(self.test_result_cnt.get("forward_fail_num", 0))) - table_detail.add_row("Backward Error", str(self.test_result_cnt.get("backward_fail_num", 0))) - table_detail.add_row("Both Forward & Backward Error", str(self.test_result_cnt.get("forward_and_backward_fail_num", 0))) - - console.print(table_total) - console.print(table_detail) - - def get_statistics_from_result_csv(self, save_path): - checklist = [CompareConst.PASS, CompareConst.ERROR, CompareConst.WARNING, CompareConst.SPACE, CompareConst.SKIP, "skip"] - with FileOpen(save_path, 'r') as file: - reader = csv.reader(file) - result_csv_rows = [row for row in reader] - result_csv_name = os.path.basename(save_path) - for item in result_csv_rows[1:]: - if not isinstance(item, list) or len(item) < 3: - raise ValueError("The number of columns in %s is incorrect" % result_csv_name) - if not all(item[i] and item[i] in checklist for i in (1, 2)): - raise ValueError( - "The value in the 2nd or 3rd column of %s is wrong, it must be pass, error, warning, skip, or SPACE" - % result_csv_name) - column1 = item[1] - column2 = item[2] - if column1.upper() == CompareConst.SKIP: - self.test_result_cnt["total_skip_num"] += 1 - continue - self.test_result_cnt["total_num"] += 1 - if column1 == CompareConst.PASS and column2 in [CompareConst.PASS, CompareConst.SPACE, CompareConst.SKIP]: - self.test_result_cnt['success_num'] += 1 - elif column1 == CompareConst.ERROR and column2 == CompareConst.ERROR: - self.test_result_cnt['forward_and_backward_fail_num'] += 1 - self.test_result_cnt['error_num'] += 1 - elif column1 == CompareConst.ERROR: - self.test_result_cnt['forward_fail_num'] += 1 - self.test_result_cnt['error_num'] += 1 - elif column2 == CompareConst.ERROR: - self.test_result_cnt['backward_fail_num'] += 1 - self.test_result_cnt['error_num'] += 1 - elif column1 == CompareConst.WARNING or column2 == CompareConst.WARNING: - self.test_result_cnt['warning_num'] += 1 + @staticmethod + def print_pretest_result(): + print_info_log("Successfully completed run_ut/multi_run_ut.") def write_csv_title(self): summary_test_rows = [[self.COLUMN_API_NAME, self.COLUMN_FORWARD_SUCCESS, diff --git a/debug/accuracy_tools/atat/core/file_check_util.py b/debug/accuracy_tools/atat/core/file_check_util.py deleted file mode 100644 index b10cdd61049ad9a87e91d910e89b121557a58a7f..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/core/file_check_util.py +++ /dev/null @@ -1,319 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -import os -import re - -from .log import print_warn_log, print_error_log - - -class FileCheckConst: - """ - Class for file check const - """ - READ_ABLE = "read" - WRITE_ABLE = "write" - READ_WRITE_ABLE = "read and write" - DIRECTORY_LENGTH = 4096 - FILE_NAME_LENGTH = 255 - FILE_VALID_PATTERN = r"^[a-zA-Z0-9_.:/-]+$" - PKL_SUFFIX = ".pkl" - NUMPY_SUFFIX = ".npy" - JSON_SUFFIX = ".json" - PT_SUFFIX = ".pt" - CSV_SUFFIX = ".csv" - YAML_SUFFIX = ".yaml" - MAX_PKL_SIZE = 1 * 1024 * 1024 * 1024 - MAX_NUMPY_SIZE = 10 * 1024 * 1024 * 1024 - MAX_JSON_SIZE = 1 * 1024 * 1024 * 1024 - MAX_PT_SIZE = 10 * 1024 * 1024 * 1024 - MAX_CSV_SIZE = 1 * 1024 * 1024 * 1024 - MAX_YAML_SIZE = 10 * 1024 * 1024 - DIR = "dir" - FILE = "file" - DATA_DIR_AUTHORITY = 0o750 - DATA_FILE_AUTHORITY = 0o640 - FILE_SIZE_DICT = { - PKL_SUFFIX: MAX_PKL_SIZE, - NUMPY_SUFFIX: MAX_NUMPY_SIZE, - JSON_SUFFIX: MAX_JSON_SIZE, - PT_SUFFIX: MAX_PT_SIZE, - CSV_SUFFIX: MAX_CSV_SIZE, - YAML_SUFFIX: MAX_YAML_SIZE - } - - -class FileCheckException(Exception): - """ - Class for File Check Exception - """ - NONE_ERROR = 0 - INVALID_PATH_ERROR = 1 - INVALID_FILE_TYPE_ERROR = 2 - INVALID_PARAM_ERROR = 3 - INVALID_PERMISSION_ERROR = 3 - - def __init__(self, code, error_info: str = ""): - super(FileCheckException, self).__init__() - self.code = code - self.error_info = error_info - - def __str__(self): - return self.error_info - - -class FileChecker: - """ - The class for check file. - - Attributes: - file_path: The file or dictionary path to be verified. - path_type: file or dictionary - ability(str): FileCheckConst.WRITE_ABLE or FileCheckConst.READ_ABLE to set file has writability or readability - file_type(str): The correct file type for file - """ - def __init__(self, file_path, path_type, ability=None, file_type=None, is_script=True): - self.file_path = file_path - self.path_type = self._check_path_type(path_type) - self.ability = ability - self.file_type = file_type - self.is_script = is_script - - @staticmethod - def _check_path_type(path_type): - if path_type not in [FileCheckConst.DIR, FileCheckConst.FILE]: - print_error_log(f'The path_type must be {FileCheckConst.DIR} or {FileCheckConst.FILE}.') - raise FileCheckException(FileCheckException.INVALID_PARAM_ERROR) - return path_type - - def common_check(self): - """ - 功能:用户校验基本文件权限:软连接、文件长度、是否存在、读写权限、文件属组、文件特殊字符 - 注意:文件后缀的合法性,非通用操作,可使用其他独立接口实现 - """ - check_path_exists(self.file_path) - check_link(self.file_path) - self.file_path = os.path.realpath(self.file_path) - check_path_length(self.file_path) - check_path_type(self.file_path, self.path_type) - self.check_path_ability() - if self.is_script: - check_path_owner_consistent(self.file_path) - check_path_pattern_vaild(self.file_path) - check_common_file_size(self.file_path) - check_file_suffix(self.file_path, self.file_type) - return self.file_path - - def check_path_ability(self): - if self.ability == FileCheckConst.WRITE_ABLE: - check_path_writability(self.file_path) - if self.ability == FileCheckConst.READ_ABLE: - check_path_readability(self.file_path) - if self.ability == FileCheckConst.READ_WRITE_ABLE: - check_path_readability(self.file_path) - check_path_writability(self.file_path) - - -class FileOpen: - """ - The class for open file by a safe way. - - Attributes: - file_path: The file or dictionary path to be opened. - mode(str): The file open mode - """ - SUPPORT_READ_MODE = ["r", "rb"] - SUPPORT_WRITE_MODE = ["w", "wb", "a", "ab"] - SUPPORT_READ_WRITE_MODE = ["r+", "rb+", "w+", "wb+", "a+", "ab+"] - - def __init__(self, file_path, mode, encoding='utf-8'): - self.file_path = file_path - self.mode = mode - self.encoding = encoding - self._handle = None - - def __enter__(self): - self.check_file_path() - binary_mode = "b" - if binary_mode not in self.mode: - self._handle = open(self.file_path, self.mode, encoding=self.encoding) - else: - self._handle = open(self.file_path, self.mode) - return self._handle - - def __exit__(self, exc_type, exc_val, exc_tb): - if self._handle: - self._handle.close() - - def check_file_path(self): - support_mode = self.SUPPORT_READ_MODE + self.SUPPORT_WRITE_MODE + self.SUPPORT_READ_WRITE_MODE - if self.mode not in support_mode: - print_error_log("File open not support %s mode" % self.mode) - raise FileCheckException(FileCheckException.INVALID_PARAM_ERROR) - check_link(self.file_path) - self.file_path = os.path.realpath(self.file_path) - check_path_length(self.file_path) - self.check_ability_and_owner() - check_path_pattern_vaild(self.file_path) - if os.path.exists(self.file_path): - check_common_file_size(self.file_path) - - def check_ability_and_owner(self): - if self.mode in self.SUPPORT_READ_MODE: - check_path_exists(self.file_path) - check_path_readability(self.file_path) - check_path_owner_consistent(self.file_path) - if self.mode in self.SUPPORT_WRITE_MODE and os.path.exists(self.file_path): - check_path_writability(self.file_path) - check_path_owner_consistent(self.file_path) - if self.mode in self.SUPPORT_READ_WRITE_MODE and os.path.exists(self.file_path): - check_path_readability(self.file_path) - check_path_writability(self.file_path) - check_path_owner_consistent(self.file_path) - - -def check_link(path): - abs_path = os.path.abspath(path) - if os.path.islink(abs_path): - print_error_log('The file path {} is a soft link.'.format(path)) - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - - -def check_path_length(path, name_length=None): - file_max_name_length = name_length if name_length else FileCheckConst.FILE_NAME_LENGTH - if len(path) > FileCheckConst.DIRECTORY_LENGTH or \ - len(os.path.basename(path)) > file_max_name_length: - print_error_log('The file path length exceeds limit.') - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - - -def check_path_exists(path): - if not os.path.exists(path): - print_error_log('The file path %s does not exist.' % path) - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - - -def check_path_readability(path): - if not os.access(path, os.R_OK): - print_error_log('The file path %s is not readable.' % path) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) - - -def check_path_writability(path): - if not os.access(path, os.W_OK): - print_error_log('The file path %s is not writable.' % path) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) - - -def check_path_executable(path): - if not os.access(path, os.X_OK): - print_error_log('The file path %s is not executable.' % path) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) - - -def check_other_user_writable(path): - st = os.stat(path) - if st.st_mode & 0o002: - _user_interactive_confirm( - 'The file path %s may be insecure because other users have write permissions. ' - 'Do you want to continue?' % path) - - -def _user_interactive_confirm(message): - while True: - check_message = input(message + " Enter 'c' to continue or enter 'e' to exit: ") - if check_message == "c": - break - elif check_message == "e": - print_warn_log("User canceled.") - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - else: - print("Input is error, please enter 'c' or 'e'.") - - -def check_path_owner_consistent(path): - file_owner = os.stat(path).st_uid - if file_owner != os.getuid(): - print_error_log('The file path %s may be insecure because is does not belong to you.' % path) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) - - -def check_path_pattern_vaild(path): - if not re.match(FileCheckConst.FILE_VALID_PATTERN, path): - print_error_log('The file path {} contains special characters.'.format(path)) - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) - - -def check_file_size(file_path, max_size): - file_size = os.path.getsize(file_path) - if file_size >= max_size: - _user_interactive_confirm(f'The size of file path {file_path} exceeds {max_size} bytes.' - f'Do you want to continue?') - - -def check_common_file_size(file_path): - if os.path.isfile(file_path): - for suffix, max_size in FileCheckConst.FILE_SIZE_DICT.items(): - if file_path.endswith(suffix): - check_file_size(file_path, max_size) - break - - -def check_file_suffix(file_path, file_suffix): - if file_suffix: - if not file_path.endswith(file_suffix): - print_error_log(f"The {file_path} should be a {file_suffix} file!") - raise FileCheckException(FileCheckException.INVALID_FILE_TYPE_ERROR) - - -def check_path_type(file_path, file_type): - if file_type == FileCheckConst.FILE: - if not os.path.isfile(file_path): - print_error_log(f"The {file_path} should be a file!") - raise FileCheckException(FileCheckException.INVALID_FILE_TYPE_ERROR) - if file_type == FileCheckConst.DIR: - if not os.path.isdir(file_path): - print_error_log(f"The {file_path} should be a dictionary!") - raise FileCheckException(FileCheckException.INVALID_FILE_TYPE_ERROR) - - -def create_directory(dir_path): - """ - Function Description: - creating a directory with specified permissions - Parameter: - dir_path: directory path - Exception Description: - when invalid data throw exception - """ - dir_path = os.path.realpath(dir_path) - try: - os.makedirs(dir_path, mode=FileCheckConst.DATA_DIR_AUTHORITY, exist_ok=True) - except OSError as ex: - print_error_log( - 'Failed to create {}.Please check the path permission or disk space .{}'.format(dir_path, str(ex))) - raise FileCheckException(FileCheckException.INVALID_PATH_ERROR) from ex - - -def change_mode(path, mode): - if not os.path.exists(path) or os.path.islink(path): - return - try: - os.chmod(path, mode) - except PermissionError as ex: - print_error_log('Failed to change {} authority. {}'.format(path, str(ex))) - raise FileCheckException(FileCheckException.INVALID_PERMISSION_ERROR) from ex - diff --git a/debug/accuracy_tools/atat/mindspore/__init__.py b/debug/accuracy_tools/atat/mindspore/__init__.py deleted file mode 100644 index bb3f93567542e93ff913edf3daabcd3aedb91ee3..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/mindspore/__init__.py +++ /dev/null @@ -1 +0,0 @@ -from atat.mindspore.debugger.precision_debugger import PrecisionDebugger diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/README.md b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/README.md deleted file mode 100644 index 7738501db87b1cacbc9eb96687bf09aed3a5ed68..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/README.md +++ /dev/null @@ -1,528 +0,0 @@ -# Ascend模型精度预检工具 - -Ascend模型精度预检工具能在昇腾NPU上扫描用户训练模型中所有API,输出精度情况的诊断和分析。工具会提取模型中所有的API前反向信息,构造相应的API单元测试,将NPU输出与标杆(CPU高精度)比对,从而检测出精度有问题的API;另外工具还可以通过新精度标准比对法,从而确认NPU和GPU各自运行时的精度哪一方更接近标杆(CPU高精度)。 - -**新精度标准比对法**:依据新精度标准,对不同的API采取不同的比对算法进行比对(包括绝对阈值法,标杆比对法和二进制一致法),最终给定预检判定结果。 - -**真实数据模式**:精度预检工具支持随机生成模式和真实数据模式,即在预检dump时可以选择由工具构造随机数进行输入获得dump数据或选择获取真实输入数据进行预检dump操作;随机生成模式执行效率高,可以快速获得结果,但数据精度低,只能大致判断精度问题;真实数据模式执行效率略低于随机生成模式,但是数据精度高,可以准确判断精度问题。 - -工具支持PyTorch版本:1.11.0/2.0/2.1。 - -## 工具特性 - -1. 落盘数据小。 -2. 不依赖标杆侧GPU训练资源,本地即可完成预检(新精度标准比对法除外)。 -3. 支持随机生成模式和真实数据模式。 -4. 单API测试,排除整网中的累计误差问题。 - -## 预检流程 - -精度预检可以分为:标准模式(直接进行NPU vs CPU高精度的预检比对操作)和新精度标准比对法(将NPU vs CPU高精度的预检比对结果和GPU vs CPU高精度的预检比对结果进行比对汇总),两种模式操作流程如下。 - -### 标准模式 - -1. 在NPU环境下安装预检工具。详见“**工具安装**”。 -2. 在NPU环境下dump预检数据。详见“**dump预检数据**”。 -3. NPU环境下执行run_ut。详见“**run_ut预检操作**”。 -4. 查看“**预检结果**”。 - -### 新精度标准比对法 - -1. 在NPU和GPU环境下分别安装预检工具。详见“**工具安装**”。 -2. 在NPU环境下dump预检数据(使用msCheckerConfig.update_config开启真实数据模式)。详见“**dump预检数据**”。 -3. 将NPU环境下dump的预检数据拷贝至GPU环境。 -4. 在NPU和GPU环境下分别执行run_ut。详见“**run_ut预检操作**”。 -5. 将NPU和GPU执行run_ut生成的`accuracy_checking_details_{timestamp}.csv`结果文件拷贝至同一环境下。 -6. 运行api_precision_compare.py。详见“**预检结果比对**”。 - -## 工具安装 - -1. 将att仓代码下载到本地,并配置环境变量。假设下载后att仓路径为 $ATT_HOME,环境变量应配置为: - - ```bash - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - ``` - -2. 安装依赖tqdm、rich、pyyaml、pandas - - ```bash - pip3 install tqdm rich pyyaml pandas - ``` - -## 预检操作 - -### dump预检数据 - -#### dump操作 - -在训练脚本(如main.py)中加入以下代码导入工具dump模块,启动训练即可自动抓取网络所有API信息。 - -- 若训练脚本中的代码不是通过torch.utils.data.dataloader来加载数据或在部分流水并行、张量并行场景下,工具的开关无法在每张卡上自动打开,导致多卡训练dump结果只有一组json,那么需要在训练代码中添加打开工具开关的调用。 - - 在训练代码中添加数据dump操作如下: - - ```Python - import api_accuracy_checker.dump as DP - - # 需要先修改enable_dataloader参数值为False - # 关闭torch.utils.data.dataloader加载数据时,下列代码须在训练step代码内添加 - DP.dump.start() # 开启工具dump模块 - - ... - - DP.dump.stop() # 控制dump结束 - DP.dump.step() # 在DP.dump.stop()后加入DP.dump.step()即可指定需要dump的step - ``` - - 上述代码要添加在迭代内,如对于[ModelLink](https://gitee.com/ascend/ModelLink)的LLAMA2-7B可以添加在training.py中train函数的iteration循环内。之后工具会适配这个场景开关的自动打开。 - -- 如果训练脚本是通过torch.utils.data.dataloader方式加载数据。 - - 首先,需要开启torch.utils.data.dataloader加载数据,操作如下: - - ```bash - cd att/debug/accuracy_tools/api_accuracy_checker - vi config.yaml - # 修改enable_dataloader参数值为True - ``` - - 其次,在训练脚本中加入以下代码导入工具dump模块,启动训练即可自动抓取网络所有API信息。 - - ```python - import api_accuracy_checker.dump - ``` - - 工具默认抓取训练的**第二个迭代**并且在第二个迭代后会报错退出训练进程,可通过target_iter参数配置。 - - **报错信息如下,这个报错仅用于停止训练,属于正常现象**: - - ```bash - Exception: Model pretest: exit after iteration 1. - ``` - - 若报错信息不一致,可能是由于服务器的其他错误信息覆盖导致,可以尝试查找报错信息中的Exception。 - -dump信息默认会存盘到“./step1”路径下(相对于启动训练的路径),包括: - -- forward_info_{pid}.json:前向API信息文件。 -- backward_info_{pid}.json:反向API信息文件。 -- stack_info_{pid}.json:调用栈信息文件。 - -forward_info与stack_info中的key值一一对应,用户可根据forward_info中API的key在stack_info中查询到其调用栈及代码行位置。 - -若有需要,用户可以通过msCheckerConfig.update_config来配置dump路径以及开启**真实数据模式**、指定dump某个step或配置**API dump白名单**,详见“**msCheckerConfig.update_config**”。 - -#### 真实数据模式 - -预检工具默认为随机数据模式,如果想要完全复刻整网的API运行情况,可以使用真实数据模式,添加以下代码即可: - -```python -from api_accuracy_checker.dump import msCheckerConfig -msCheckerConfig.update_config(real_data=True) -``` - -#### API dump白名单 - -精度预检工具可以对指定API进行预检操作,可以在dump时的训练脚本中直接添加白名单参数,只dump指定的API数据,示例代码如下: - -```python -from api_accuracy_checker.dump import msCheckerConfig -msCheckerConfig.update_config(white_list=["conv1d", "conv2d"]) -``` - -配置的API名称须存在于[support_wrap_ops.yaml](./hook_module/support_wrap_ops.yaml)文件下。 - -#### 工具支持的API列表 - -预检工具维护固定的API支持列表,若需要删除或增加dump的API,可以在[support_wrap_ops.yaml](./hook_module/support_wrap_ops.yaml)文件内手动修改,如下示例: - -```bash -functional: # functional为算子类别,找到对应的类别,在该类别下按照下列格式删除或添加API - - conv1d - - conv2d - - conv3d -``` - -#### msCheckerConfig.update_config - -**功能说明** - -配置精度预检dump时的属性。 - -可选配置。 - -**函数原型** - -```python -msCheckerConfig.update_config(dump_path="./", real_data=False, target_iter=[1], white_list=[], enable_dataloader=False) -``` - -**参数说明** - -| 参数名称 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| dump_path | 设置dump路径,默认为当前目录。若指定目录不存在,则自动创建。 | 否 | -| real_data | 真实数据模式,可取值True或False,默认为False,表示随机数据模式,配置为True后开启真实数据模式,dump信息增加forward_real_data和backward_real_data目录,目录下保存每个API输入的具体数值。 | 否 | -| target_iter | 指定dump某个step的数据,默认为[1],须指定为训练脚本中存在的step。target_iter为list格式,可配置逐个step,例如:target_iter=[0,1,2];也可以配置step范围,例如:target_iter=list(range(0,9)),表示dump第0到第8个step。 | 否 | -| white_list | API dump白名单,指定dump具体API数据,也可以直接配置预检的API白名单,详细请参见“**API预检白名单**”。参数示例:white_list=["conv1d", "conv2d"]。默认未配置白名单,即dump全量API数据。 | 否 | -| enable_dataloader | 自动dump数据开关,可取值True(开启)、False(关闭),默认关闭。 | 否 | - -### run_ut预检操作 - -完成“dump预检数据”后,仅仅获取了API的输入数据,为了得到NPU vs CPU高精度(标杆)的预检比对结果和GPU vs CPU高精度(标杆)的预检比对结果,还需要进行run_ut操作。 - -run_ut预检操作包括如下场景: - -- 使用run_ut.py执行预检:run_ut.py适用于数据量较小的单卡场景。 -- 使用multi_run_ut.py执行多线程预检:multi_run_ut.py适用于数据量较大的大模型场景。 - -#### 使用run_ut.py执行预检 - -1. 将API信息输入给run_ut模块运行精度检测并比对,运行如下命令: - - ```bash - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_ut.py -forward ./forward_info_0.json -backward ./backward_info_0.json - ``` - - 某些场景下(如推理),可以不指定backward_info_0.json,不影响预检功能。 - - | 参数名称 | 说明 | 是否必选 | - | -------------------------------- | ------------------------------------------------------------ | ---------------------------------- | - | -forward或--forward_input_file | 指定前向API信息文件forward_info_{pid}.json。 | 是 | - | -backward或--backward_input_file | 指定反向API信息文件backward_info_{pid}.json。 | 否 | - | -save_error_data | 保存精度未达标的API输入输出数据。 | 否 | - | -o或--out_path | 指定run_ut执行结果存盘路径,默认“./”(相对于run_ut的路径)。 | 否 | - | -j或--jit_compile | 开启jit编译。 | 否 | - | -d或--device | 指定Device ID,选择UT代码运行所在的卡,默认值为0。 | 否 | - | -csv_path或--result_csv_path | 指定本次运行中断时生成的`accuracy_checking_result_{timestamp}.csv`文件路径,执行run_ut中断时,若想从中断处继续执行,配置此参数即可。需要指定为上次中断的`accuracy_checking_result_{timestamp}.csv`文件。详见“**断点续检**”。 | run_ut操作中断后继续执行场景下必选 | - | -real_data_path | 指定run_ut操作的真实数据路径。真实数据dump模式通过**msCheckerConfig.update_config**接口的real_data参数开启。指定绝对路径为forward_real_data和backward_real_data目录的父目录。 | dump的数据为真实数据下必选 | - | -f或--filter_api | 过滤模型中除最大值和最小值以外其他参数和结构相同的API。适用于模型较大且重复API较多的场景。 | 否 | - - run_ut执行结果包括`accuracy_checking_result_{timestamp}.csv`和`accuracy_checking_details_{timestamp}.csv`两个文件。`accuracy_checking_result_{timestamp}.csv`是API粒度的,标明每个API是否通过测试。建议用户先查看`accuracy_checking_result_{timestamp}.csv`文件,对于其中没有通过测试的或者特定感兴趣的API,根据其API name字段在`accuracy_checking_details_{timestamp}.csv`中查询其各个输出的达标情况以及比较指标。详细介绍请参见“**预检结果**”。 - -2. (可选)如果需要保存比对不达标的输入和输出数据,可以在run_ut执行命令结尾添加-save_error_data,例如: - - ```bash - python run_ut.py -forward ./forward_info_0.json -backward ./backward_info_0.json -save_error_data - ``` - - 数据默认会存盘到'./ut_error_data{timestamp}'路径下(相对于启动run_ut的路径),有需要的话,用户可以通过修改att/debug/accuracy_tools/api_accuracy_checker目录下,config.yaml文件的error_data_path参数来配置保存路径,详见“config.yaml文件说明”。。 - -3. (可选)如果dump的数据为真实数据,那么需要指定真实数据路径,例如: - - ```bash - python run_ut.py -forward ./forward_info_0.json -backward ./backward_info_0.json -real_data_path /home/xxx/ut/real_data - ``` - -#### 使用multi_run_ut.py执行多线程预检 - -multi_run_ut.py脚本,可以并行执行多个run_ut操作,从而降低预检耗时。 - -命令示例如下: - -```bash -cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut -python multi_run_ut.py -forward ./forward_info_0.json -backward ./backward_info_0.json -n 32 -d 0 1 2 3 -``` - -某些场景下(如推理),可以不指定backward_info_0.json,不影响预检功能。 - -| 参数名称 | 说明 | 是否必选 | -| -------------------------------- | ------------------------------------------------------------ | ---------------------------------- | -| -forward或--forward_input_file | 指定前向API信息文件forward_info_{pid}.json。 | 是 | -| -backward或--backward_input_file | 指定反向API信息文件backward_info_{pid}.json。 | 否 | -| -save_error_data | 保存精度未达标的API输入输出数据。 | 否 | -| -o或--out_path | 指定run_ut执行结果存盘路径,默认“./”(相对于run_ut的路径)。 | 否 | -| -j或--jit_compile | 开启jit编译。 | 否 | -| -n | 同时执行run_ut线程的数量,默认为8,最大支持64,但每个Device最大支持8个线程,当指定多个线程和多个Device时,则线程数在每张卡上均分。 | 否 | -| -d或--device | 指定Device ID,选择UT代码运行所在的卡,默认值为0,支持同时指定0~7,共8个Device。 | 否 | -| -csv_path或--result_csv_path | 指定本次运行中断时生成的`accuracy_checking_result_{timestamp}.csv`文件路径,执行run_ut中断时,若想从中断处继续执行,配置此参数即可。需要指定为上次中断的`accuracy_checking_result_{timestamp}.csv`文件。详见“**断点续检**”。 | run_ut操作中断后继续执行场景下必选 | -| -real_data_path | 指定run_ut操作的真实数据路径。真实数据dump模式通过**msCheckerConfig.update_config**接口的real_data参数开启。指定绝对路径为forward_real_data和backward_real_data目录的父目录。 | dump的数据为真实数据下必选 | -| -f或--filter_api | 过滤模型中除最大值和最小值以外其他参数和结构相同的API。适用于模型较大且重复API较多的场景。 | 否 | - -#### 断点续检 - -精度预检run_ut过程中,若因环境、数据量过大等原因导致预检进程中断,那么当用户解决这些问题后,重新执行run_ut操作,可以通过断点续检操作继续前面未完成的预检,会在-csv_path指定的`accuracy_checking_result_{timestamp}.csv`文件以及对应的`accuracy_checking_details_{timestamp}.csv`文件中继续写入后续的结果,不会重新创建结果文件。 - -须指定为上次预检中断的`accuracy_checking_result_{timestamp}.csv`文件。请勿修改`accuracy_checking_result_{timestamp}.csv`和`accuracy_checking_details_{timestamp}.csv`文件名,包括时间戳,否则断点续检会因无法识别到文件名而失败。 - -断点续检操作通过如下命令执行: - -```bash -python run_ut.py -forward ./forward_info_0.json -backward ./backward_info_0.json -csv_path /home/xxx/ut/accuracy_checking_result_{timestamp}.csv -``` - -#### API预检白名单 - -run_ut过程同样支持API预检白名单,操作方式如下: - -修改att/debug/accuracy_tools/api_accuracy_checker目录下config.yaml文件的white_list参数,配置需要预检的API名称,详见“config.yaml文件说明”。 - -### config.yaml文件说明 - -config.yaml文件可以通过配置参数来控制dump和run_ut操作的真实数据模式以及白名单等功能。 - -文件路径为:att/debug/accuracy_tools/api_accuracy_checker/config.yaml - -| 参数名称 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| dump_path | 设置dump路径,默认为当前目录。若指定目录不存在,则自动创建。 | 否 | -| real_data | 真实数据模式,可取值True或False,默认为False,表示随机数据模式,配置为True后开启真实数据模式,dump信息增加forward_real_data和backward_real_data目录,目录下保存每个API输入的具体数值。 | 否 | -| enable_dataloader | 自动dump数据开关,可取值True(开启)、False(关闭),默认关闭。 | 否 | -| target_iter | 指定dump某个step的数据,默认为[1],须指定为训练脚本中存在的step。target_iter为list格式,可配置逐个step,例如:target_iter=[0,1,2];也可以配置step范围,例如:target_iter=list(range(0,9)),表示dump第0到第8个step。 | 否 | -| white_list | API dump白名单,指定dump具体API数据,也可以直接配置预检的API白名单,详细请参见“**API预检白名单**”。参数示例:white_list=["conv1d", "conv2d"]。默认未配置白名单,即dump全量API数据。 | 否 | -| error_data_path | 配置保存精度未达标的API输入输出数据路径。 | 否 | -| jit_compile | 开启jit编译。 | 否 | -| precision | 浮点数表示位数,默认取小数点后14位。 | 否 | - -## 预检结果 - -精度预检生成的`accuracy_checking_result_{timestamp}.csv`和`accuracy_checking_details_{timestamp}.csv`文件示例如下: - -可以通过先查看`accuracy_checking_result_{timestamp}.csv`文件的Forward Test Success和Backward Test Success,判断是否存在未通过测试的API,再查看`accuracy_checking_details_{timestamp}.csv`文件的API详细达标情况,API达标情况介绍请参见“**API预检指标**”。 - -`accuracy_checking_result_{timestamp}.csv` - -![891a3bd8_12631423](img/accuracy_checking_result.png) - -| 字段 | 含义 | -| --------------------- | ------------------------------------------------------------ | -| API name | API名称。 | -| Forward Test Success | 前向API是否通过测试,pass为通过,warning为待观察,error为错误。 | -| Backward Test Success | 反向API是否通过测试,pass为通过,warning为待观察,error为错误,如果是空白的话代表该API没有反向输出。 | -| Message | 提示信息。 | - -Forward Test Success和Backward Test Success是否通过测试是由`accuracy_checking_details_{timestamp}.csv`中的余弦相似度、最大绝对误差、双百双千双万指标判定结果决定的。 - -需要注意的是`accuracy_checking_details_{timestamp}.csv`中可能存在一个API的前向(反向)有多个输出,那么每个输出记录一行,而在`accuracy_checking_result_{timestamp}.csv`中的结果需要该API的所有结果均为pass才能标记为TRUE,否则标记FALSE或WARING。 - -`accuracy_checking_details_{timestamp}.csv` - -![f07237b1_12631423](img/accuracy_checking_details.png) - -| 字段 | 含义 | -| ---------------- | ------------------------------------------------------------ | -| API name | NPU或GPU下的API名称。 | -| Bench Dtype | 标杆数据的API数据类型。 | -| Device Dtype | NPU或GPU数据的API数据类型。 | -| Shape | API的Shape信息。 | -| 余弦相似度 | NPU或GPU数据与标杆数据的余弦相似度。 | -| 最大绝对误差 | NPU或GPU数据与标杆数据的最大绝对误差。 | -| 双百指标 | 双百精度指标。是指NPU或GPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差小于百分之一的个数占总元素个数的比例。测试通过标准为相对误差大于百分之一的个数占总元素个数的比例小于百分之一。 | -| 双千指标 | 双千精度指标。是指NPU或GPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差小于千分之一的个数占总元素个数的比例。测试通过标准为相对误差大于千分之一的个数占总元素个数的比例小于千分之一。 | -| 双万指标 | 双万精度指标。是指NPU或GPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差小于万分之一的个数占总元素个数的比例。测试通过标准为相对误差大于万分之一的个数占总元素个数的比例小于万分之一。 | -| 二进制一致错误率 | NPU或GPU数据中每个Tensor精度不一致的数值的数量与Tensor中数值数量的比值。只有数据是builtin类型(bool、int、float、str)、torch.bool和torch的int类型才会展示。 | -| 误差均衡性 | NPU或GPU数据与标杆数据精度差的上下浮动情况。 | -| 均方根误差 | NPU或GPU数据与标杆数据的均方根误差。 | -| 小值域错误占比 | NPU或GPU Tensor中与标杆的绝对误差大于错误阈值的小值在小值域(小值的总数量)中的占比。判断为小值以及绝对误差的错误阈值见“**小值域阈值**”。 | -| 相对误差最大值 | NPU或GPU数据与标杆数据相对误差的最大值。 | -| 相对误差平均值 | NPU或GPU数据与标杆数据相对误差的平均值。 | -| inf/nan错误率 | NPU与标杆inf/nan计算不一致的元素个数占总元素的个数比例。 | -| 相对误差错误率 | NPU与标杆的正常值计算相对误差,其大于错误阈值的元素个数占正常值元素个数的比例。 | -| 绝对误差错误率 | NPU与标杆的小值计算绝对误差,其大于错误阈值的元素个数占小值元素个数的比例。 | -| Status | API预检通过状态,pass表示通过测试,error表示未通过,warning表示测试未通过双千或双万精度指标,SKIP表示该API的某个参数的反向不要计算梯度,所以没有任何计算过程,其他信息均为空。 | -| message | 提示信息。 | - -### 小值域阈值 - -判定为小值的阈值为: - -- torch.float32:e-6 -- torch.float16:e-3 -- torch.bfloat16:e-3 - -小值域的绝对误差阈值为: - -- torch.float32:e-9 -- torch.float16:e-5 -- torch.bfloat16:e-5 - -### API预检指标 - -API预检指标是通过对`accuracy_checking_details_{timestamp}.csv`中的余弦相似度、最大绝对误差双百、双千、双万精度指标的数值进行判断,得出该API是否符合精度标准的参考指标。 - -API预检通过测试,则在`accuracy_checking_details_{timestamp}.csv`文件中的“Status”列标记“pass”,否则标记“error”或“warning”,详细规则如下: - -1. 余弦相似度 > 0.99:≤ 0.99为不达标,标记“error”,> 0.99达标,进行下一步; -2. 最大绝对误差 < 0.001:< 0.001达标,标记“pass”,≥ 0.001为不达标,进行下一步; -3. 双百、双千、双万精度指标: - - 对于float16和bfloat16数据:双百指标不通过,标记“error”;双百指标通过,双千指标不通过,标记“warning”;双百、双千指标均通过,标记“pass”。 - - 对于float32和float64数据:双千指标不通过,标记“error”;双千指标通过,双万指标不通过,标记“warning”;双千、双万指标均通过,标记“pass”。 - -4. 在`accuracy_checking_result_{timestamp}.csv`中以“Forward Test Success”和“Backward Test Success”字段统计该算子前向反向输出的测试结果,对于标记“pass”的算子,则在`accuracy_checking_result_{timestamp}.csv`中标记“TRUE”表示测试通过,对于标记“error”或“warning”的算子,则在`accuracy_checking_result_{timestamp}.csv`中标记“FALSE”表示测试不通过。由于一个算子可能有多个前向或反向的输入或输出,那么该类算子的输入或输出中必须全为“pass”,才能在`accuracy_checking_result_{timestamp}.csv`中标记“TRUE”,只要有一个输入或输出标记“error”或“warning”,那么在`accuracy_checking_result_{timestamp}.csv`中标记“FALSE”。 - -## 预检结果比对 - -该步骤仅新精度标准比对法需要执行,需要同时获取NPU和GPU环境下run_ut操作的预检结果`accuracy_checking_details_{timestamp}.csv`文件。执行如下命令进行NPU和GPU预检结果的比对: - -```bash -cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/compare -python api_precision_compare.py -npu /home/xxx/npu/accuracy_checking_details_{timestamp}.csv -gpu /home/xxx/gpu/accuracy_checking_details_{timestamp}.csv -o /home/xxx/ -``` - -| 参数名称 | 说明 | 是否必选 | -| -------------------- | ------------------------------------------------------------ | -------- | -| -npu或--npu_csv_path | NPU预检结果`accuracy_checking_details_{timestamp}.csv`文件路径。默认从当前目录下识别该文件。 | 否 | -| -gpu或--gpu_csv_path | GPU预检结果`accuracy_checking_details_{timestamp}.csv`文件路径。默认从当前目录下识别该文件。 | 否 | -| -o或--out_path | 指定api_precision_compare.py执行结果存盘路径,默认为当前目录。 | 否 | - -执行完成后输出`api_precision_compare_result_{timestamp}.csv`和`api_precision_compare_details_{timestamp}.csv`文件。文件示例如下: - -可以通过先查看`api_precision_compare_result_{timestamp}.csv`文件的Forward Test Success和Backward Test Success,判断是否存在未通过测试的API,再查看`api_precision_compare_details_{timestamp}.csv`文件的API详细达标情况。 - -`api_precision_compare_result_{timestamp}.csv` - -![api_precision_compare_result](img/api_precision_compare_result.png) - -| 字段 | 含义 | -| --------------------- | ------------------------------------------------------------ | -| API name | API名称。 | -| Forward Test Success | 前向API是否通过测试,pass为通过,warning为待观察,error为错误,skip表示该API的数据类型不支持使用新精度标准进行比对,如float64。 | -| Backward Test Success | 反向API是否通过测试,pass为通过,warning为待观察,error为错误,如果是空白的话代表该API没有反向输出,skip表示该API的数据类型不支持使用新精度标准进行比对,如float64。 | -| Message | 提示信息。 | - -Forward Test Success和Backward Test Success是否通过测试是由`api_precision_compare_details_{timestamp}.csv`中的各个指标判定结果决定的。需要注意的是`api_precision_compare_details_{timestamp}.csv`中可能存在一个API的前向(反向)有多个输出,那么每个输出记录一行,而在`api_precision_compare_result_{timestamp}.csv`中的结果需要该API的所有结果均为pass才能标记为TRUE,否则标记FALSE或WARING。 - -`api_precision_compare_details_{timestamp}.csv` - -![api_precision_compare_details](img/api_precision_compare_details.png) - -| 字段 | 含义 | -| ------------------------ | ------------------------------------------------------------ | -| API name | NPU或GPU下的API名称。 | -| 小值域错误比值 | NPU与CPU的小值域的错误比率/GPU与CPU的小值域的错误比率。 | -| 小值域错误判定结果 | 小值域错误比值小于等于1标记为pass,1~2之间标记为waring,大于2标记为error。 | -| 均方根误差比值 | NPU与CPU的均方根误差/GPU与CPU的均方根误差。 | -| 均方根误差判定结果 | 均方根误差比值小于等于1标记为pass,1~2之间标记为waring,大于2标记为error。 | -| 相对误差最大值比值 | NPU与CPU的相对误差最大值/GPU与CPU的相对误差最大值。 | -| 相对误差最大值判定结果 | 相对误差最大值比值小于等于1标记为pass,1~10之间标记为waring,大于10标记为error。 | -| 相对误差平均值比值 | NPU与CPU的相对误差的平均值/GPU与CPU的相对误差的平均值。 | -| 相对误差平均值判定结果 | 相对误差平均值比值小于等于1标记为pass,1~2之间标记为waring,大于2标记为error。 | -| 误差均衡性比值 | NPU与CPU的误差均衡性/GPU与CPU的误差均衡性。 | -| 误差均衡性判定结果 | 误差均衡性比值小于等于1标记为pass,1~2之间标记为waring,大于2标记为error。该字段暂不参与api_precision_compare_result的结果判定。 | -| inf/nan错误率 | NPU与标杆inf/nan计算不一致的元素个数占总元素的个数比例。 | -| inf/nan判定结果 | inf/nan错误率判定结果,等于0标记为pass,其余情况标记为error。 | -| 相对误差错误率 | NPU与标杆的正常值计算相对误差,其大于错误阈值的元素个数占正常值元素个数的比例。 | -| 相对误差判定结果 | 相对误差错误率判定结果,等于0标记为pass,其余情况标记为error。 | -| 绝对误差错误率 | NPU与标杆的小值计算绝对误差,其大于错误阈值的元素个数占小值元素个数的比例。 | -| 绝对误差判定结果 | 绝对误差错误率判定结果,等于0标记为pass,其余情况标记为error。 | -| 二进制一致错误率 | NPU或GPU数据中每个Tensor精度不一致的数值的数量与Tensor中数值数量的比值。只有数据是builtin类型(bool、int、float、str)、torch.bool和torch的int类型或者在新精度标准中使用二进制一致算法进行比对的API才会展示。 | -| 二进制一致错误率判定结果 | 二进制一致错误率判定结果,等于0标记为pass,其余情况标记为error。 | -| 比对结果 | 综合所有指标的最终结果。如果比对指标中有error,则标记为error;有warning,则标记为warning;否则标记为pass。 | -| 比对算法 | API使用的比对算法,为标杆比对法、二进制一致法和绝对阈值法中的一种。 | -| Message | 提示信息。当前提示该API比对结果为error或warning时对应不符合标准的指标。 | - -# 溢出解析工具 - -针对训练过程中的溢出检测场景(参见[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)中的"溢出检测场景"进行溢出检测dump),对于输入正常但输出存在溢出的API,会在训练执行目录下将溢出的API信息按照前向和反向分类,dump并保存为`forward_info_{pid}.json`,前向过程溢出的API可通过该工具对`forward_info_{pid}.json`进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - -工具支持PyTorch版本:1.8.1/1.11.0/2.0/2.1。 - -若溢出检测场景dump结果生成`forward_info_{pid}.json`文件,则使用本工具进行解析。操作步骤如下: - -1. 安装预检工具 - - 将att仓代码下载到本地,并配置环境变量。假设下载后att仓路径为 $ATT_HOME,环境变量应配置为 - - ```bash - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - ``` - - 安装依赖tqdm、rich、pyyaml - - ```bash - pip3 install tqdm rich pyyaml - ``` - -2. 执行溢出API解析操作 - - **forward_info_0.json为[ptdbg_ascend精度工具功能说明](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)中的"溢出检测场景"执行溢出检测dump时生成,而不是精度预检工具生成。** - - ```bash - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - | 参数名称 | 说明 | 是否必选 | - | ------------------------------ | -------------------------------------------------- | -------- | - | -forward或--forward_input_file | 指定前向API信息文件forward_info_{pid}.json。 | 是 | - | -j或--jit_compile | 开启jit编译。 | 否 | - | -d或--device | 指定Device ID,选择UT代码运行所在的卡,默认值为0。 | 否 | - - 反向过程溢出的API暂不支持该功能。 - - -具体参数解释请参见“**Ascend模型精度预检工具”**。 - -# FAQ - -1. 预检工具在dump和run_ut的过程中,是否需要同时开启或关闭jit编译(jit_compile)? - - 答:是。 - -2. 预检工具对于type_as这类涉及数据类型转换操作的API,是否具有参考性? - - 由于这类API在CPU侧存在精度先提升后下降的操作,因此这类API的有效性的参考价值有限。 - -3. run ut过程中出现报错:ERROR:Got unsupported ScalarType BFloat16 - - 答:请使用最新版本的工具。 - -4. Dropout算子,CPU和NPU的随机应该不一样,为什么结果比对是一致的? - - 答:这个结果是正常的,工具对该算子有特殊处理,只判定位置为0的位置比例大约和设定p值相当。 - -5. 为什么浮点型数据bench和CPU的dtype不一致? - - 答:对于fp16的数据,CPU会上升一个精度fp32去计算,这是和算子那边对齐的精度结论,CPU用更高精度去计算会更接近真实值。 - -6. 添加预检工具后截取操作报错:`IndexError: too many indices for tensor of dimension x` 或 `TypeError: len() of a 0-d tensor`。 - - 答:注释工具目录api_accuracy_checker/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- __getitem__`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。 - -7. 添加预检工具后F.gelu触发ValueError报错:`activation_func must be F.gelu`等。 - - 答:注释工具目录api_accuracy_checker/hook_module/support_wrap_ops.yaml文件中functional:下的的`- gelu`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。 - -8. 添加预检工具后触发AsStrided算子相关的报错,或者编译相关的报错,如:`Failed to compile Op [AsStrided]`。 - - 答:注释工具目录api_accuracy_checker/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- t`和`- transpose`。 - -9. Tensor 魔法函数具体对应什么操作? - - 答: - - | Tensor魔法函数 | 具体操作 | - | --------------- | ---------------- | - | `__add__` | + | - | `__and__` | & | - | `__bool__` | 返回Tensor布尔值 | - | `__div__` | / | - | `__eq__` | == | - | `__ge__` | >= | - | `__gt__` | > | - | `__iadd__` | += | - | `__iand__` | &= | - | `__idiv__` | /= | - | `__ifloordiv__` | //= | - | `__ilshift__` | <<= | - | `__imod__` | %= | - | `__imul__` | *= | - | `__ior__` | \|= | - | `__irshift__` | >>= | - | `__isub__` | -= | - | `__ixor__` | ^= | - | `__lshift__` | << | - | `__matmul__` | 矩阵乘法 | - | `__mod__` | % | - | `__mul__` | * | - | `__nonzero__` | 同`__bool__` | - | `__or__` | \| | - | `__radd__` | +(反向) | - | `__rmul__` | *(反向) | - | `__rshift__` | >> | - | `__sub__` | - | - | `__truediv__` | 同`__div__` | - | `__xor__` | ^ | - diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/utils.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/utils.py deleted file mode 100644 index 91cfc1e06d5bbac70024b50dae4de6e0d9037330..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/utils.py +++ /dev/null @@ -1,654 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -import collections -import json -import os -import random -import re -import stat -import subprocess -import sys -import time -import csv -from datetime import datetime, timezone - -import numpy as np -import torch - -try: - import torch_npu -except ImportError: - IS_GPU = True -else: - IS_GPU = False - -from ...common.file_check import FileCheckConst, FileChecker, FileOpen -from ...common import file_check as file_check_util - -torch_without_guard_version_list = ['2.1'] -for version in torch_without_guard_version_list: - if torch.__version__.startswith(version): - torch_without_guard_version = True - break - else: - torch_without_guard_version = False -if not IS_GPU and not torch_without_guard_version: - from torch_npu.utils.device_guard import torch_device_guard as torch_npu_device_guard - - -class Const: - """ - Class for const - """ - SEP = '.' - DIRECTORY_LENGTH = 4096 - FILE_NAME_LENGTH = 255 - FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' - MODEL_TYPE = ['.onnx', '.pb', '.om'] - SEMICOLON = ";" - COLON = ":" - EQUAL = "=" - COMMA = "," - DOT = "." - DUMP_RATIO_MAX = 100 - SUMMERY_DATA_NUMS = 256 - ONE_HUNDRED_MB = 100 * 1024 * 1024 - FLOAT_EPSILON = np.finfo(float).eps - SUPPORT_DUMP_MODE = ['api', 'acl'] - ON = 'ON' - OFF = 'OFF' - BACKWARD = 'backward' - FORWARD = 'forward' - FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble, np.float32, np.float16] - BOOL_TYPE = [bool, np.uint8] - INT_TYPE = [np.int32, np.int64] - NPU = 'NPU' - DISTRIBUTED = 'Distributed' - - # dump mode - ALL = "all" - LIST = "list" - RANGE = "range" - STACK = "stack" - ACL = "acl" - API_LIST = "api_list" - API_STACK = "api_stack" - DUMP_MODE = [ALL, LIST, RANGE, STACK, ACL, API_LIST, API_STACK] - - WRITE_FLAGS = os.O_WRONLY | os.O_CREAT - WRITE_MODES = stat.S_IWUSR | stat.S_IRUSR - - RAISE_PRECISION = { - torch.float16: torch.float32, - torch.bfloat16: torch.float32, - torch.float32: torch.float64 - } - CONVERT = { - "int32_to_int64": ["torch.int32", "torch.int64"], - } - - CONVERT_API = { - "int32_to_int64": ["cross_entropy"] - } - - -class CompareConst: - """ - Class for compare module const - """ - # compare result column name - NPU_NAME = "NPU Name" - BENCH_NAME = "Bench Name" - NPU_DTYPE = "NPU Tensor Dtype" - BENCH_DTYPE = "Bench Tensor Dtype" - NPU_SHAPE = "NPU Tensor Shape" - BENCH_SHAPE = "Bench Tensor Shape" - NPU_MAX = "NPU max" - NPU_MIN = "NPU min" - NPU_MEAN = "NPU mean" - BENCH_MAX = "Bench max" - BENCH_MIN = "Bench min" - BENCH_MEAN = "Bench mean" - COSINE = "Cosine" - MAX_ABS_ERR = "MaxAbsErr" - ACCURACY = "Accuracy Reached or Not" - STACK = "NPU_Stack_Info" - ERROR_MESSAGE = "Err_message" - - # compare result data - NAN = 'Nan' - SHAPE_UNMATCH = 'shape unmatched' - DTYPE_UNMATCH = 'dtype unmatched' - - # accuracy standards - COS_THRESHOLD = 0.99 - MAX_ABS_ERR_THRESHOLD = 0.001 - COS_MAX_THRESHOLD = 0.9 - MAX_ABS_ERR_MAX_THRESHOLD = 1 - ACCURACY_CHECK_YES = "Yes" - ACCURACY_CHECK_NO = "No" - ACCURACY_CHECK_UNMATCH = "Unmatched" - - # error message - NO_BENCH = "No bench data matched." - - -class VersionCheck: - """ - Class for TorchVersion - """ - V1_8 = "1.8" - V1_11 = "1.11" - - @staticmethod - def check_torch_version(version): - torch_version = torch.__version__ - if torch_version.startswith(version): - return True - else: - return False - - -class CompareException(Exception): - """ - Class for Accuracy Compare Exception - """ - NONE_ERROR = 0 - INVALID_PATH_ERROR = 1 - OPEN_FILE_ERROR = 2 - CLOSE_FILE_ERROR = 3 - READ_FILE_ERROR = 4 - WRITE_FILE_ERROR = 5 - INVALID_FILE_ERROR = 6 - PERMISSION_ERROR = 7 - INDEX_OUT_OF_BOUNDS_ERROR = 8 - NO_DUMP_FILE_ERROR = 9 - INVALID_DATA_ERROR = 10 - INVALID_PARAM_ERROR = 11 - INVALID_DUMP_RATIO = 12 - INVALID_DUMP_FILE = 13 - UNKNOWN_ERROR = 14 - INVALID_DUMP_MODE = 15 - PARSE_FILE_ERROR = 16 - INVALID_COMPARE_MODE = 17 - - def __init__(self, code, error_info: str = ""): - super(CompareException, self).__init__() - self.code = code - self.error_info = error_info - - def __str__(self): - return self.error_info - - -class DumpException(CompareException): - pass - - -def read_json(file): - with FileOpen(file, 'r') as f: - obj = json.load(f) - return obj - - -def write_csv(data, filepath): - with FileOpen(filepath, 'a', encoding='utf-8-sig') as f: - writer = csv.writer(f) - writer.writerows(data) - - -def _print_log(level, msg, end='\n'): - current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(time.time()))) - pid = os.getgid() - print(current_time + "(" + str(pid) + ")-[" + level + "]" + msg, end=end) - sys.stdout.flush() - - -def print_info_log(info_msg, end='\n'): - """ - Function Description: - print info log. - Parameter: - info_msg: the info message. - """ - _print_log("INFO", info_msg, end=end) - - -def print_error_log(error_msg): - """ - Function Description: - print error log. - Parameter: - error_msg: the error message. - """ - _print_log("ERROR", error_msg) - - -def print_warn_log(warn_msg): - """ - Function Description: - print warn log. - Parameter: - warn_msg: the warning message. - """ - _print_log("WARNING", warn_msg) - - -def check_mode_valid(mode): - if mode not in Const.DUMP_MODE: - msg = "Current mode '%s' is not supported. Please use the field in %s" % \ - (mode, Const.DUMP_MODE) - raise CompareException(CompareException.INVALID_DUMP_MODE, msg) - - -def check_object_type(check_object, allow_type): - """ - Function Description: - Check if the object belongs to a certain data type - Parameter: - check_object: the object to be checked - allow_type: legal data type - Exception Description: - when invalid data throw exception - """ - if not isinstance(check_object, allow_type): - print_error_log(f"{check_object} not of {allow_type} type") - raise CompareException(CompareException.INVALID_DATA_ERROR) - - -def check_file_or_directory_path(path, isdir=False): - """ - Function Description: - check whether the path is valid - Parameter: - path: the path to check - isdir: the path is dir or file - Exception Description: - when invalid data throw exception - """ - if isdir: - if not os.path.exists(path): - print_error_log('The path {} is not exist.'.format(path)) - raise CompareException(CompareException.INVALID_PATH_ERROR) - - if not os.path.isdir(path): - print_error_log('The path {} is not a directory.'.format(path)) - raise CompareException(CompareException.INVALID_PATH_ERROR) - - if not os.access(path, os.W_OK): - print_error_log( - 'The path {} does not have permission to write. Please check the path permission'.format(path)) - raise CompareException(CompareException.INVALID_PATH_ERROR) - else: - if not os.path.isfile(path): - print_error_log('{} is an invalid file or non-exist.'.format(path)) - raise CompareException(CompareException.INVALID_PATH_ERROR) - - if not os.access(path, os.R_OK): - print_error_log( - 'The path {} does not have permission to read. Please check the path permission'.format(path)) - raise CompareException(CompareException.INVALID_PATH_ERROR) - - -def _check_pkl(pkl_file_handle, file_name): - tensor_line = pkl_file_handle.readline() - if len(tensor_line) == 0: - print_error_log("dump file {} have empty line!".format(file_name)) - raise CompareException(CompareException.INVALID_DUMP_FILE) - pkl_file_handle.seek(0, 0) - - -def check_file_mode(npu_pkl, bench_pkl, stack_mode): - npu_pkl_name = os.path.split(npu_pkl)[-1] - bench_pkl_name = os.path.split(bench_pkl)[-1] - - if not npu_pkl_name.startswith("api_stack") and not bench_pkl_name.startswith("api_stack"): - if stack_mode: - print_error_log("The current file does not contain stack information, please turn off the stack_mode") - raise CompareException(CompareException.INVALID_COMPARE_MODE) - elif npu_pkl_name.startswith("api_stack") and bench_pkl_name.startswith("api_stack"): - if not stack_mode: - print_error_log("The current file contains stack information, please turn on the stack_mode") - raise CompareException(CompareException.INVALID_COMPARE_MODE) - else: - print_error_log("The dump mode of the two files is not same, please check the dump files") - raise CompareException(CompareException.INVALID_COMPARE_MODE) - - -def check_file_size(input_file, max_size): - try: - file_size = os.path.getsize(input_file) - except OSError as os_error: - print_error_log('Failed to open "%s". %s' % (input_file, str(os_error))) - raise CompareException(CompareException.INVALID_FILE_ERROR) from os_error - if file_size > max_size: - print_error_log('The size (%d) of %s exceeds (%d) bytes, tools not support.' - % (file_size, input_file, max_size)) - raise CompareException(CompareException.INVALID_FILE_ERROR) - - -def get_dump_data_path(dump_dir): - """ - Function Description: - traverse directories and obtain the absolute path of dump data - Parameter: - dump_dir: dump data directory - Return Value: - dump data path,file is exist or file is not exist - """ - dump_data_path = None - file_is_exist = False - - check_file_or_directory_path(dump_dir, True) - for dir_path, sub_paths, files in os.walk(dump_dir): - if len(files) != 0: - dump_data_path = dir_path - file_is_exist = True - break - dump_data_path = dir_path - return dump_data_path, file_is_exist - - -def modify_dump_path(dump_path, mode): - if mode == Const.ALL: - return dump_path - file_name = os.path.split(dump_path) - mode_file_name = mode + "_" + file_name[-1] - return os.path.join(file_name[0], mode_file_name) - - -def create_directory(dir_path): - """ - Function Description: - creating a directory with specified permissions in a thread-safe manner - Parameter: - dir_path: directory path - Exception Description: - when invalid data throw exception - """ - try: - os.makedirs(dir_path, mode=FileCheckConst.DATA_DIR_AUTHORITY, exist_ok=True) - except OSError as ex: - print_error_log( - 'Failed to create {}. Please check the path permission or disk space. {}'.format(dir_path, str(ex))) - raise CompareException(CompareException.INVALID_PATH_ERROR) from ex - - -def execute_command(cmd): - """ - Function Description: - run the following command - Parameter: - cmd: command - Exception Description: - when invalid command throw exception - """ - print_info_log('Execute command:%s' % cmd) - process = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - while process.poll() is None: - line = process.stdout.readline() - line = line.strip() - if line: - print(line) - if process.returncode != 0: - print_error_log('Failed to execute command:%s' % " ".join(cmd)) - raise CompareException(CompareException.INVALID_DATA_ERROR) - - -def save_numpy_data(file_path, data): - """ - save_numpy_data - """ - if not os.path.exists(os.path.dirname(file_path)): - os.makedirs(os.path.dirname(file_path)) - np.save(file_path, data) - - -def parse_arg_value(values): - """ - parse dynamic arg value of atc cmdline - """ - value_list = [] - for item in values.split(Const.SEMICOLON): - value_list.append(parse_value_by_comma(item)) - return value_list - - -def parse_value_by_comma(value): - """ - parse value by comma, like '1,2,4,8' - """ - value_list = [] - value_str_list = value.split(Const.COMMA) - for value_str in value_str_list: - value_str = value_str.strip() - if value_str.isdigit() or value_str == '-1': - value_list.append(int(value_str)) - else: - print_error_log("please check your input shape.") - raise CompareException(CompareException.INVALID_PARAM_ERROR) - return value_list - - -def get_data_len_by_shape(shape): - data_len = 1 - for item in shape: - if item == -1: - print_error_log("please check your input shape, one dim in shape is -1.") - return -1 - data_len = data_len * item - return data_len - - -def add_time_as_suffix(name): - return '{}_{}.csv'.format(name, time.strftime("%Y%m%d%H%M%S", time.localtime(time.time()))) - - -def get_time(): - return datetime.now(tz=timezone.utc).strftime("%Y%m%d_%H%M%S") - - -def format_value(value): - return '{:.6f}'.format(value) - - -def torch_device_guard(func): - if IS_GPU or torch_without_guard_version: - return func - # Parse args/kwargs matched torch.device objects - - @torch_npu_device_guard - def wrapper(*args, **kwargs): - return func(*args, **kwargs) - return wrapper - - -def seed_all(seed=1234, mode=False): - random.seed(seed) - os.environ['PYTHONHASHSEED'] = str(seed) - np.random.seed(seed) - torch.manual_seed(seed) - torch.use_deterministic_algorithms(mode) - if IS_GPU: - torch.cuda.manual_seed_all(seed) - torch.cuda.manual_seed(seed) - torch.backends.cudnn.deterministic = True - torch.backends.cudnn.enable = False - torch.backends.cudnn.benchmark = False - else: - torch_npu.npu.manual_seed_all(seed) - torch_npu.npu.manual_seed(seed) - - -def get_process_rank(model): - print_info_log("Rank id is not provided. Trying to get the rank id of the model.") - try: - device = next(model.parameters()).device - except StopIteration: - print_warn_log('There is no parameter in the model. Fail to get rank id.') - return 0, False - if device.type == 'cpu': - print_warn_log("Warning: the debugger is unable to get the rank id. " - "This may cause the dumpped data to be corrupted in the " - "case of distributed training. (You may ignore this if you are using only one card.) " - "Transfer the model to npu or gpu before register_hook() to avoid this warning.") - return 0, False - else: - return device.index, True - - -def get_json_contents(file_path): - ops = get_file_content_bytes(file_path) - try: - json_obj = json.loads(ops) - except ValueError as error: - print_error_log('Failed to load "%s". %s' % (file_path, str(error))) - raise CompareException(CompareException.INVALID_FILE_ERROR) from error - if not isinstance(json_obj, dict): - print_error_log('Json file %s, content is not a dictionary!' % file_path) - raise CompareException(CompareException.INVALID_FILE_ERROR) - return json_obj - - -def get_file_content_bytes(file): - with FileOpen(file, 'rb') as file_handle: - return file_handle.read() - - -def islink(path): - path = os.path.abspath(path) - return os.path.islink(path) - - -class SoftlinkCheckException(Exception): - pass - - -MAX_JSON_FILE_SIZE = 10 * 1024 ** 2 -LINUX_FILE_NAME_LENGTH_LIMIT = 200 - - -def check_path_length_valid(path): - path = os.path.realpath(path) - return len(os.path.basename(path)) <= LINUX_FILE_NAME_LENGTH_LIMIT - - -def check_path_pattern_valid(path): - pattern = re.compile(r'(\.|/|:|_|-|\s|[~0-9a-zA-Z])+') - if not pattern.fullmatch(path): - raise ValueError('Only the following characters are allowed in the path: A-Z a-z 0-9 - _ . / :') - - -def check_input_file_valid(input_path, max_file_size=MAX_JSON_FILE_SIZE): - if islink(input_path): - raise SoftlinkCheckException("Input path doesn't support soft link.") - - input_path = os.path.realpath(input_path) - if not os.path.exists(input_path): - raise ValueError('Input file %s does not exist!' % input_path) - - if not os.access(input_path, os.R_OK): - raise PermissionError('Input file %s is not readable!' % input_path) - - if not check_path_length_valid(input_path): - raise ValueError("The real path or file_name of input is too long.") - - check_path_pattern_valid(input_path) - - if os.path.getsize(input_path) > max_file_size: - raise ValueError(f'The file is too large, exceeds {max_file_size // 1024 ** 2}MB') - - -def check_need_convert(api_name): - convert_type = None - for key, value in Const.CONVERT_API.items(): - if api_name not in value: - continue - else: - convert_type = key - return convert_type - - -def api_info_preprocess(api_name, api_info_dict): - """ - Function Description: - Preprocesses the API information. - Parameter: - api_name: Name of the API. - api_info_dict: argument of the API. - Return api_info_dict: - convert_type: Type of conversion. - api_info_dict: Processed argument of the API. - """ - convert_type = check_need_convert(api_name) - if api_name == 'cross_entropy': - api_info_dict = cross_entropy_process(api_info_dict) - return convert_type, api_info_dict - - -def cross_entropy_process(api_info_dict): - """ - Function Description: - Preprocesses the cross_entropy API information. - Parameter: - api_info_dict: argument of the API. - Return api_info_dict: - api_info_dict: Processed argument of the API. - """ - if 'args' in api_info_dict and len(api_info_dict['args']) > 1 and 'Min' in api_info_dict['args'][1]: - if api_info_dict['args'][1]['Min'] <= 0: - # The second argument in cross_entropy should be -100 or not less than 0 - api_info_dict['args'][1]['Min'] = 0 - return api_info_dict - - -def initialize_save_path(save_path, dir_name): - data_path = os.path.join(save_path, dir_name) - if os.path.exists(data_path): - print_warn_log(f"{data_path} already exists, it will be overwritten") - else: - os.mkdir(data_path, mode=FileCheckConst.DATA_DIR_AUTHORITY) - data_path_checker = FileChecker(data_path, FileCheckConst.DIR) - data_path_checker.common_check() - - -def write_pt(file_path, tensor): - if os.path.exists(file_path): - raise ValueError(f"File {file_path} already exists") - torch.save(tensor, file_path) - full_path = os.path.realpath(file_path) - file_check_util.change_mode(full_path, FileCheckConst.DATA_FILE_AUTHORITY) - return full_path - - -def get_real_data_path(file_path): - targets = ['forward_real_data', 'backward_real_data', 'ut_error_data\d+'] - pattern = re.compile(r'({})'.format('|'.join(targets))) - match = pattern.search(file_path) - if match: - target_index = match.start() - target_path = file_path[target_index:] - return target_path - else: - raise DumpException(DumpException.INVALID_PATH_ERROR) - - -def get_full_data_path(data_path, real_data_path): - if not data_path: - return data_path - full_data_path = os.path.join(real_data_path, data_path) - return os.path.realpath(full_data_path) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_threshold.yaml b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_threshold.yaml deleted file mode 100644 index 0684bd8e9129653b6b69afcf43ab19207006801f..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_threshold.yaml +++ /dev/null @@ -1,390 +0,0 @@ -mul: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -mul_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__mul__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__imul__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__rmul__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -add: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -add_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__add__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__iadd__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__radd__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -div: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -div_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__div__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__idiv__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -divide: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -divide_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -leaky_relu: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -leaky_relu_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -prelu: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -reciprocal: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -reciprocal_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -rsqrt: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -rsqrt_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -square: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -square_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -sub: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -sub_: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -rsub: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__isub__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 -__sub__: - torch.float32: - rtol: 0.000001 - small_value: 0.000001 - small_value_atol: 0.000001 - torch.float16: - rtol: 0.001 - small_value: 0.001 - small_value_atol: 0.001 - torch.bfloat16: - rtol: 0.004 - small_value: 0.001 - small_value_atol: 0.001 diff --git "a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/doc/API Accuracy Checker\351\242\204\346\243\200\345\267\245\345\205\267\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" "b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/doc/API Accuracy Checker\351\242\204\346\243\200\345\267\245\345\205\267\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" deleted file mode 100644 index 740f72589a034476586c342d9709b05ea44a93d3..0000000000000000000000000000000000000000 --- "a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/doc/API Accuracy Checker\351\242\204\346\243\200\345\267\245\345\205\267\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" +++ /dev/null @@ -1,64 +0,0 @@ -# API Accuracy Checker预检工具标准性能基线报告 - -## 环境信息 - -NPU:Atlas A2 训练系列产品 - -CPU: - -![输入图片说明](https://foruda.gitee.com/images/1707274376423595920/8d725bef_10012209.png) - -Torch:2.1.0 - -CANN:8.0.T2 - -除上述环境信息影响性能外,API的数量、种类以及Shape都会对性能产生影响,因此本次选取指定网络进行测试。 - -## 多进程使用说明 - -1. 因预检工具run ut会在NPU和CPU上分别运行每个API的计算,开启多进程后会将指定总进程数平均分配给指定的NPU处理。经测试多进程数量需控制在每张卡不超过8个进程,8卡总计不超过63个进程。建议大模型场景下使用8卡56个进程。 -2. 进程数过多可能会造成环境的内存占用过高导致环境崩溃或NPU上out of memeory,若发生此类情况请减少总进程数。 -3. 因子进程拉起需要额外耗时,小模型场景下不建议开过多进程,过多进程性能提升可能并不明显。 -4. 若发生上述情况导致运行中断,可以使用断点续训功能减少进程数后重新运行。 - -## 模型信息和性能基线 - -以下场景的性能基线测试数据均为多次测试后取平均值,因此实际运行时性能数据可能会根据环境状态稍有浮动。 - -### YOLOV5 - -API:442个,主要数据类型:FLOAT32 - -单进程run_ut耗时:3m55s - -单卡8进程耗时:2m11s - -当API数量较少时,多进程计算性能提升不明显,因为拉起子进程需要额外耗时,此场景下不建议开过多进程。 - -### GPT-3 - -NUM_LAYER:1,API:170个, 主要数据类型:FLOAT16 - -单进程run_ut耗时:10m22s - -单卡8进程耗时:3m50s - -4卡16进程耗时:1m50s - -### GPT-3 - -NUM_LAYER:8,API:16782个,主要数据类型:FLOAT16 - -单进程run_ut耗时:大于2天(未跑完) - -8卡56个进程耗时:1h33m - -当API数量很多时多进程下性能提升明显,可以将天级的运行时长缩短至小时级。 - -### GLM - -API:6035个,主要数据类型:FLOAT16 - -单进程run_ut耗时:大于2天(未跑完) - -8卡56个进程耗时:2h40m diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/__init__.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/__init__.py deleted file mode 100644 index c9602292b85f753fd132634b98c74c76460997b0..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/__init__.py +++ /dev/null @@ -1 +0,0 @@ -__all__ = ['set_dump_switch'] diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/api_info.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/api_info.py deleted file mode 100644 index 7452cec74e80c812902341ef2af13d3f29c5f10c..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/api_info.py +++ /dev/null @@ -1,237 +0,0 @@ -# 定义API INFO,保存基本信息,用于后续结构体的落盘,注意考虑random场景及真实数据场景 -import os -import inspect -import torch -import numpy as np -from ..common.config import msCheckerConfig -from ..common.utils import print_error_log, write_pt, create_directory, DumpException, \ - get_real_data_path -from ...common.file_check import check_path_before_create - - -def get_tensor_extremum(data, operator): - if data.dtype is torch.bool: - if data.numel() == 0: - return False, False - if operator == 'max': - return True in data, True in data - elif operator == 'min': - return False not in data, False not in data - data_clone = data.float().clone().detach() - if operator == 'max': - max_result = torch._C._VariableFunctionsClass.max(data_clone).item() - if np.isinf(max_result) or np.isnan(max_result): - return handle_tensor_extremum_nan_inf(data_clone, operator), max_result - else: - return max_result, max_result - else: - min_result = torch._C._VariableFunctionsClass.min(data_clone).item() - if np.isinf(min_result) or np.isnan(min_result): - return handle_tensor_extremum_nan_inf(data_clone, operator), min_result - else: - return min_result, min_result - - -def handle_tensor_extremum_nan_inf(data_clone, operator): - data_nan = torch._C._VariableFunctionsClass.isnan(data_clone) - if int(torch._C._VariableFunctionsClass.sum(data_nan)) == data_clone.numel(): - return float('nan') - finite_mask = torch._C._VariableFunctionsClass.isfinite(data_clone) - if int(torch._C._VariableFunctionsClass.sum(finite_mask)) > 0: - finite_values = data_clone[finite_mask] - return torch._C._VariableFunctionsClass.max(finite_values).item() if operator == 'max' else \ - torch._C._VariableFunctionsClass.min(finite_values).item() - else: - data_no_nan = data_clone[~data_nan] - return torch._C._VariableFunctionsClass.max(data_no_nan).item() if operator == 'max' else \ - torch._C._VariableFunctionsClass.min(data_no_nan).item() - - -def get_type_name(name): - left = name.index("'") - right = name.rindex("'") - return name[left + 1: right] - - -def transfer_types(data, dtype): - if 'int' in dtype or 'bool' in dtype: - return int(data) - else: - return float(data) - - -def is_builtin_class(element): - return element is None or isinstance(element, (bool, int, float, str, slice)) - - -def analyze_device_in_kwargs(element): - single_arg = {} - single_arg.update({'type': 'torch.device'}) - if not isinstance(element, str): - if hasattr(element, "index"): - device_value = element.type + ":" + str(element.index) - else: - device_value = element.type - single_arg.update({'value': device_value}) - else: - single_arg.update({'value': element}) - return single_arg - - -def analyze_dtype_in_kwargs(element): - single_arg = {} - single_arg.update({'type': 'torch.dtype'}) - single_arg.update({'value': str(element)}) - return single_arg - - -class APIInfo: - def __init__(self, api_name, save_path, is_save_data=False): - self.api_name = api_name - self.torch_object_key = {'device': analyze_device_in_kwargs, 'dtype': analyze_dtype_in_kwargs} - self.rank = os.getpid() - self.is_save_data = is_save_data - self.save_path = save_path - self.args_num = 0 - - @staticmethod - def get_full_save_path(save_path, dir_name, contain_step=False): - if contain_step: - from calibrator.pytorch.api_accuracy_checker.dump.dump import DumpUtil - step_dir = "step" + str(DumpUtil.call_num - 1 if msCheckerConfig.enable_dataloader else DumpUtil.call_num) - rank_dir = f"rank{os.getpid()}" - return os.path.join(save_path, step_dir, dir_name, rank_dir) - else: - return os.path.join(save_path, dir_name) - - def analyze_element(self, element): - if isinstance(element, (list, tuple)): - out = [] - for item in element: - out.append(self.analyze_element(item)) - return out - - if isinstance(element, dict): - out_dict = {} - for key, value in element.items(): - if key in self.torch_object_key.keys(): - fun = self.torch_object_key[key] - out_dict[key] = fun(value) - else: - out_dict[key] = self.analyze_element(value) - return out_dict - - converted_numpy, numpy_type = self._convert_numpy_to_builtin(element) - if converted_numpy is not element: - return self._analyze_numpy(converted_numpy, numpy_type) - - if isinstance(element, torch.Tensor): - return self._analyze_tensor(element) - - if is_builtin_class(element): - return self._analyze_builtin(element) - - msg = f"Type {type(element)} is unsupported at analyze_element" - print_error_log(msg) - raise DumpException(DumpException.INVALID_DATA_ERROR) - - def _analyze_tensor(self, arg): - single_arg = {} - if not self.is_save_data: - single_arg.update({'type': 'torch.Tensor'}) - single_arg.update({'dtype': str(arg.dtype)}) - single_arg.update({'shape': arg.shape}) - max_handle, max_origin = get_tensor_extremum(arg, 'max') - single_arg.update({'Max': transfer_types(max_handle, str(arg.dtype))}) - single_arg.update({'Max_origin': transfer_types(max_origin, str(arg.dtype))}) - min_handle, min_origin = get_tensor_extremum(arg, 'min') - single_arg.update({'Min': transfer_types(min_handle, str(arg.dtype))}) - single_arg.update({'Min_origin': transfer_types(min_origin, str(arg.dtype))}) - single_arg.update({'requires_grad': arg.requires_grad}) - else: - api_args = self.api_name + '.' + str(self.args_num) - check_path_before_create(self.save_path) - create_directory(self.save_path) - file_path = os.path.join(self.save_path, f'{api_args}.pt') - pt_path = write_pt(file_path, arg.contiguous().cpu().detach()) - self.args_num += 1 - real_data_path = get_real_data_path(pt_path) - single_arg.update({'type': 'torch.Tensor'}) - single_arg.update({'datapath': real_data_path}) - single_arg.update({'requires_grad': arg.requires_grad}) - return single_arg - - def _analyze_builtin(self, arg): - single_arg = {} - if self.is_save_data: - self.args_num += 1 - if isinstance(arg, slice): - single_arg.update({'type': "slice"}) - single_arg.update({'value': [arg.start, arg.stop, arg.step]}) - else: - single_arg.update({'type': get_type_name(str(type(arg)))}) - single_arg.update({'value': arg}) - return single_arg - - def _analyze_numpy(self, value, numpy_type): - single_arg = {} - if self.is_save_data: - self.args_num += 1 - single_arg.update({'type': numpy_type}) - single_arg.update({'value': value}) - return single_arg - - def _convert_numpy_to_builtin(self, arg): - type_mapping = { - np.integer: int, - np.floating: float, - np.bool_: bool, - np.complexfloating: complex, - np.str_: str, - np.bytes_: bytes, - np.unicode_: str - } - for numpy_type, builtin_type in type_mapping.items(): - if isinstance(arg, numpy_type): - return builtin_type(arg), get_type_name(str(type(arg))) - return arg, '' - - -class ForwardAPIInfo(APIInfo): - def __init__(self, name, args, kwargs): - super().__init__(name, - self.get_full_save_path(msCheckerConfig.dump_path, 'forward_real_data', contain_step=True), - is_save_data=msCheckerConfig.real_data) - self.api_info_struct = {} - self.stack_info_struct = {} - self.analyze_api_input(args, kwargs) - self.analyze_api_call_stack() - - def analyze_api_input(self, args, kwargs): - args_info_list = self.analyze_element(args) - kwargs_info_dict = self.analyze_element(kwargs) - self.api_info_struct = {self.api_name: {"args": args_info_list, "kwargs": kwargs_info_dict}} - - def analyze_api_call_stack(self): - stack_str = [] - for (_, path, line, func, code, _) in inspect.stack()[3:]: - if not code: - continue - stack_line = " ".join([ - "File", ", ".join([path, " ".join(["line", str(line)]), " ".join(["in", func]), - " ".join(["\n", code[0].strip()])])]) - stack_str.append(stack_line) - self.stack_info_struct = {self.api_name: stack_str} - - -class BackwardAPIInfo(APIInfo): - def __init__(self, name, grads): - super().__init__(name, - self.get_full_save_path(msCheckerConfig.dump_path, 'backward_real_data', contain_step=True), - is_save_data=msCheckerConfig.real_data) - self.grad_info_struct = {} - self.analyze_api_input(grads) - - def analyze_api_input(self, grads): - grads_info_list = self.analyze_element(grads) - self.grad_info_struct = {self.api_name: grads_info_list} diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/dump.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/dump.py deleted file mode 100644 index b20378fd45d322e1e2e4a61031c8c1fa240ca5a0..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/dump.py +++ /dev/null @@ -1,109 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - -from .api_info import ForwardAPIInfo, BackwardAPIInfo -from .info_dump import write_api_info_json, initialize_output_json -from ..common.utils import print_error_log, CompareException, print_info_log -from ..hook_module.register_hook import initialize_hook -from ..common.config import msCheckerConfig - - -def set_dump_switch(switch): - if switch not in ["ON", "OFF"]: - print_error_log("Please set switch with 'ON' or 'OFF'.") - raise CompareException(CompareException.INVALID_PARAM_ERROR) - if switch == "ON": - initialize_hook(pretest_hook) - initialize_output_json() - DumpUtil.set_dump_switch(switch) - - -def check_dataloader_status(): - if msCheckerConfig.enable_dataloader: - error_info = ("If you want to use this function, set enable_dataloader " - "in the accuracy_tools/api_accuracy_check/config.yaml " - "to False first") - raise CompareException(CompareException.INVALID_PARAM_ERROR, error_info) - - -def start(): - check_dataloader_status() - if not DumpUtil.get_dump_switch(): - DumpUtil.incr_iter_num_maybe_exit() - - -def stop(): - check_dataloader_status() - DumpUtil.set_dump_switch("OFF") - - -def step(): - check_dataloader_status() - DumpUtil.call_num += 1 - - -class DumpUtil(object): - dump_switch = None - call_num = 0 - - @staticmethod - def set_dump_switch(switch): - DumpUtil.dump_switch = switch - - @staticmethod - def get_dump_switch(): - return DumpUtil.dump_switch == "ON" - - @staticmethod - def incr_iter_num_maybe_exit(): - if DumpUtil.call_num in msCheckerConfig.target_iter: - set_dump_switch("ON") - elif DumpUtil.call_num > max(msCheckerConfig.target_iter): - raise Exception("Model pretest: exit after iteration {}".format(DumpUtil.call_num - 1)) - else: - set_dump_switch("OFF") - - -class DumpConst: - delimiter = '*' - forward = 'forward' - backward = 'backward' - - -def pretest_info_dump(name, out_feat, module, phase): - if not DumpUtil.get_dump_switch(): - return - if phase == DumpConst.forward: - api_info = ForwardAPIInfo(name, module.input_args, module.input_kwargs) - elif phase == DumpConst.backward: - api_info = BackwardAPIInfo(name, out_feat) - else: - msg = "Unexpected training phase {}.".format(phase) - print_error_log(msg) - raise NotImplementedError(msg) - print_info_log(f"tools is dumping api: {name}" + " " * 10, end='\r') - write_api_info_json(api_info) - - -def pretest_hook(name, phase): - def pretest_info_dump_hook(module, in_feat, out_feat): - pretest_info_dump(name, out_feat, module, phase) - if hasattr(module, "input_args"): - del module.input_args - if hasattr(module, "input_kwargs"): - del module.input_kwargs - return pretest_info_dump_hook diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/dump_scope.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/dump_scope.py deleted file mode 100644 index ac78fa8ccae9f5935d919b62ec72ed588b290a9f..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/dump_scope.py +++ /dev/null @@ -1,22 +0,0 @@ -# dump范围控制 -import torch -from torch.utils.data.dataloader import _BaseDataLoaderIter -from ..dump.dump import DumpUtil -from ..common.config import msCheckerConfig - - -def iter_tracer(original_next): - def func_wrapper(*args, **kwargs): - if msCheckerConfig.enable_dataloader: - DumpUtil.dump_switch = "OFF" - result = original_next(*args, **kwargs) - DumpUtil.incr_iter_num_maybe_exit() - DumpUtil.call_num += 1 - return result - else: - return original_next(*args, **kwargs) - return func_wrapper - -original_next_method = _BaseDataLoaderIter.__next__ - -_BaseDataLoaderIter.__next__ = iter_tracer(original_next_method) \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/info_dump.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/info_dump.py deleted file mode 100644 index 31165077165c724f0e10ad0e279f5a59593cfd48..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/info_dump.py +++ /dev/null @@ -1,72 +0,0 @@ -import fcntl -import json -import os -import threading -import multiprocessing - -from ..dump.api_info import ForwardAPIInfo, BackwardAPIInfo -from ..common.utils import check_file_or_directory_path, create_directory -from ...common.file_check import check_path_before_create -from ...common.file_check import FileOpen, FileCheckConst, FileChecker, change_mode -from ..common.config import msCheckerConfig - - -lock = threading.Lock() -proc_lock = multiprocessing.Lock() - - -def write_api_info_json(api_info): - from ..dump.dump import DumpUtil - dump_path = msCheckerConfig.dump_path - dump_path = os.path.join(msCheckerConfig.dump_path, "step" + str((DumpUtil.call_num - 1) if msCheckerConfig.enable_dataloader else DumpUtil.call_num)) - check_path_before_create(dump_path) - create_directory(dump_path) - rank = api_info.rank - if isinstance(api_info, ForwardAPIInfo): - file_path = os.path.join(dump_path, f'forward_info_{rank}.json') - stack_file_path = os.path.join(dump_path, f'stack_info_{rank}.json') - write_json(file_path, api_info.api_info_struct) - write_json(stack_file_path, api_info.stack_info_struct, indent=4) - - elif isinstance(api_info, BackwardAPIInfo): - file_path = os.path.join(dump_path, f'backward_info_{rank}.json') - write_json(file_path, api_info.grad_info_struct) - else: - raise ValueError(f"Invalid api_info type {type(api_info)}") - - -def write_json(file_path, data, indent=None): - check_file_or_directory_path(os.path.dirname(file_path), True) - with proc_lock, lock, FileOpen(file_path, 'a+') as f: - fcntl.flock(f, fcntl.LOCK_EX) - try: - f.seek(0, os.SEEK_END) - current_position = f.tell() - if current_position > 0: - f.seek(current_position - 1, os.SEEK_SET) - f.truncate() - if f.tell() > 3: - f.seek(f.tell() - 1, os.SEEK_SET) - f.truncate() - f.write(',\n') - f.write(json.dumps(data, indent=indent)[1:-1] + '\n}') - else: - change_mode(file_path, FileCheckConst.DATA_FILE_AUTHORITY) - f.write('{\n' + json.dumps(data, indent=indent)[1:] + '\n') - except Exception as e: - raise ValueError(f"Json save failed:{e}") from e - finally: - fcntl.flock(f, fcntl.LOCK_UN) - - -def initialize_output_json(): - dump_path = msCheckerConfig.dump_path - check_path_before_create(dump_path) - create_directory(dump_path) - dump_path_checker = FileChecker(dump_path, FileCheckConst.DIR) - dump_path = dump_path_checker.common_check() - files = ['forward_info.json', 'backward_info.json', 'stack_info.json'] - for file in files: - file_path = os.path.join(dump_path, file) - if os.path.exists(file_path): - raise ValueError(f"file {file_path} already exists, please remove it first or use a new dump path") diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/hook_module.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/hook_module.py deleted file mode 100644 index 02d5fa5500e470a158b980ff889ab4d7a7ec25bf..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/hook_module.py +++ /dev/null @@ -1,113 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - - -import functools - -import torch -import torch.nn as nn -import torch.utils.hooks as full_hooks - -module_count = {} -g_stop_hook = False - - -class HOOKModule(nn.Module): - - def __init__(self, hook) -> None: - super(HOOKModule, self).__init__() - self.has_overflow = False - self.input_args = tuple() - self.input_kwargs = dict() - self._enable_hook = True - prefix = "" - if hasattr(self, "prefix_op_name_"): - prefix = self.prefix_op_name_ - - if prefix not in module_count: - module_count[prefix] = 1 - prefix += '0' - else: - module_count[prefix] += 1 - prefix = prefix + str(module_count[prefix] - 1) - - self.register_forward_hook(hook(prefix, "forward")) - self.register_backward_hook(hook(prefix, "backward")) - - def __call__(self, *inputs, **kwargs): - changed = False - global g_stop_hook - if g_stop_hook: - self._enable_hook = False - else: - g_stop_hook = True - changed = True - result = self._call_func(*inputs, **kwargs) - if changed: - g_stop_hook = False - return result - - def _call_func(self, *inputs, **kwargs): - if self._enable_hook: - full_backward_hooks, non_full_backward_hooks = [], [] - if len(self._backward_hooks) > 0: - full_backward_hooks, non_full_backward_hooks = self._get_backward_hooks() - for hook in self._forward_pre_hooks.values(): - result = hook(self, inputs) - if result is not None: - if not isinstance(result, tuple): - result = (result,) - inputs = result - bw_hook = None - if len(full_backward_hooks) > 0: - bw_hook = full_hooks.BackwardHook(self, full_backward_hooks) - inputs = bw_hook.setup_input_hook(inputs) - self.input_args = inputs - self.input_kwargs = kwargs - if torch._C._get_tracing_state(): - result = self._slow_forward(*inputs, **kwargs) - else: - result = self.forward(*inputs, **kwargs) - for hook in self._forward_hooks.values(): - hook_result = hook(self, inputs, result) - if hook_result is not None: - result = hook_result - if bw_hook: - result = bw_hook.setup_output_hook(result) - if len(non_full_backward_hooks) > 0: - var = result - while not isinstance(var, torch.Tensor): - if isinstance(var, dict): - var = next((v for v in var.values() if isinstance(v, torch.Tensor))) - elif isinstance(var, (list, tuple)): - if var: - var = var[0] - else: - return result - else: - return result - grad_fn = var.grad_fn - if grad_fn is not None: - for hook in non_full_backward_hooks: - wrapper = functools.partial(hook, self) - functools.update_wrapper(wrapper, hook) - grad_fn.register_hook(wrapper) - self._maybe_warn_non_full_backward_hook(inputs, result, grad_fn) - return result - else: - forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.forward) - return forward_call(*inputs, **kwargs) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_functional.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_functional.py deleted file mode 100644 index 056c1d047eb592f0006e3632eaa5597eba5630da..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_functional.py +++ /dev/null @@ -1,63 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - -import torch - -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard -from ..common.config import msCheckerConfig - -for f in dir(torch.nn.functional): - locals().update({f: getattr(torch.nn.functional, f)}) - - -def get_functional_ops(): - global WrapFunctionalOps - _all_functional_ops = dir(torch.nn.functional) - if msCheckerConfig.white_list: - return set(WrapFunctionalOps) & set(_all_functional_ops) & set(msCheckerConfig.white_list) - else: - return set(WrapFunctionalOps) & set(_all_functional_ops) - - -class HOOKFunctionalOP(object): - pass - - -class FunctionalOPTemplate(HOOKModule): - def __init__(self, op_name, hook, need_hook=True): - self.op_name_ = op_name - self.prefix_op_name_ = "Functional*" + str(op_name) + "*" - if need_hook: - super().__init__(hook) - - @torch_device_guard - def forward(self, *args, **kwargs): - return eval(self.op_name_)(*args, **kwargs) - - -def wrap_functional_op(op_name, hook): - def functional_op_template(*args, **kwargs): - return FunctionalOPTemplate(op_name, hook)(*args, **kwargs) - - return functional_op_template - - -def wrap_functional_ops_and_bind(hook): - _functional_ops = get_functional_ops() - for op_name in _functional_ops: - setattr(HOOKFunctionalOP, "wrap_" + op_name, wrap_functional_op(op_name, hook)) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_torch.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_torch.py deleted file mode 100644 index aab245b5d21daff0e0ea44e4073333c6854f95ac..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_torch.py +++ /dev/null @@ -1,105 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - -import torch - -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard -from ..common.config import msCheckerConfig - - -def get_torch_ops(): - global WrapTorchOps - _torch_ops = dir(torch._C._VariableFunctionsClass) - if msCheckerConfig.white_list: - return set(WrapTorchOps) & set(_torch_ops) & set(msCheckerConfig.white_list) - else: - return set(WrapTorchOps) & set(_torch_ops) - - -class HOOKTorchOP(object): - pass - - -class TorchOPTemplate(HOOKModule): - - def __init__(self, op_name, hook, need_hook=True): - self.op_name_ = op_name - self.prefix_op_name_ = "Torch*" + str(op_name) + "*" - if need_hook: - super().__init__(hook) - - def input_param_need_adapt(self): - special_op_list = ["broadcast_tensors", "block_diag"] - for item in special_op_list: - if item in self.op_name_: - return True - return False - - def einsum_adapt(self, *args): - if len(args) < 2: - raise ValueError('einsum(): must specify the equation string and at least one operand, ' - 'or at least one operand and its subscripts list') - equation = None - operands = None - if isinstance(args[0], torch.Tensor): - def parse_subscript(n: int) -> str: - if n == Ellipsis: - return '...' - if n >= 0 and n < 26: - return chr(ord('A') + n) - if n >= 26 and n < 52: - return chr(ord('a') + n - 26) - raise ValueError('einsum(): subscript in subscript list is not within the valid range [0, 52]') - equation = ','.join(''.join(parse_subscript(script) for script in arg) for arg in args[1::2]) - - if len(args) % 2 == 1: - equation += '->' + ''.join(parse_subscript(script) for script in args[-1]) - operands = args[:-1:2] - else: - operands = args[::2] - else: - equation = args[0] - operands = args[1:] - - if len(operands) == 1 and isinstance(operands[0], (list, tuple)): - _operands = operands[0] - return self.einsum_adapt(equation, *_operands) - return equation, operands - - @torch_device_guard - def forward(self, *args, **kwargs): - if self.input_param_need_adapt(): - return getattr(torch._C._VariableFunctionsClass, str(self.op_name_))(args, **kwargs) - else: - if self.op_name_ == 'einsum': - args = self.einsum_adapt(*args) - return getattr(torch._C._VariableFunctionsClass, str(self.op_name_))(*args, **kwargs) - - -def wrap_torch_op(op_name, hook): - - def torch_op_template(*args, **kwargs): - return TorchOPTemplate(op_name, hook)(*args, **kwargs) - - return torch_op_template - - -def wrap_torch_ops_and_bind(hook): - _torch_ops = get_torch_ops() - for op_name in _torch_ops: - setattr(HOOKTorchOP, "wrap_" + op_name, wrap_torch_op(op_name, hook)) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/accuracy_checking_details.png b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/accuracy_checking_details.png deleted file mode 100644 index ddc4fb348ee55197459c7303b0817853e201ace4..0000000000000000000000000000000000000000 Binary files a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/accuracy_checking_details.png and /dev/null differ diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/accuracy_checking_result.png b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/accuracy_checking_result.png deleted file mode 100644 index aa0b29d8d057ff806d5f5e82a35c5ce085dee1f3..0000000000000000000000000000000000000000 Binary files a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/accuracy_checking_result.png and /dev/null differ diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/api_precision_compare_details.png b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/api_precision_compare_details.png deleted file mode 100644 index c3fd909a8d187fd6a725c7f3cc6798989d3fa0cf..0000000000000000000000000000000000000000 Binary files a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/api_precision_compare_details.png and /dev/null differ diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/api_precision_compare_result.png b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/api_precision_compare_result.png deleted file mode 100644 index 2b95897031441408f6a88185e3cda36e4fea8049..0000000000000000000000000000000000000000 Binary files a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/img/api_precision_compare_result.png and /dev/null differ diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/.keep b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/.keep deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/resources/forward.json b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/resources/forward.json deleted file mode 100644 index f938f352460a87222bdb5346873904cb420996cc..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/resources/forward.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "Functional*silu*0": {"args": [{"type": "torch.Tensor", "dtype": "torch.float32", "shape": [2, 2560, 24, 24], "Max": 5.7421875, "Max_origin": 5.7421875, "Min": -5.125, "Min_origin": -5.125, "requires_grad": true}], "kwargs" :{"inplace": {"type": "bool", "value": false}}} -} \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/run_ut.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/run_ut.py deleted file mode 100644 index c73949697941d84782c4983aa484c06b1a7cbcc2..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/run_ut.py +++ /dev/null @@ -1,41 +0,0 @@ -import os -import shutil -import subprocess -import sys - -def run_ut(): - cur_dir = os.path.realpath(os.path.dirname(__file__)) - top_dir = os.path.realpath(os.path.dirname(cur_dir)) - ut_path = os.path.join(cur_dir, "ut/") - src_dir = top_dir - report_dir = os.path.join(cur_dir, "report") - - if os.path.exists(report_dir): - shutil.rmtree(report_dir) - - os.makedirs(report_dir) - - cmd = ["python3", "-m", "pytest", ut_path, "--junitxml=" + report_dir + "/final.xml", - "--cov=" + src_dir, "--cov-branch", "--cov-report=xml:" + report_dir + "/coverage.xml"] - - result_ut = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - - while result_ut.poll() is None: - line = result_ut.stdout.readline().strip() - if line: - print(line) - - ut_flag = False - if result_ut.returncode == 0: - ut_flag = True - print("run ut successfully.") - else: - print("run ut failed.") - - return ut_flag - -if __name__=="__main__": - if run_ut(): - sys.exit(0) - else: - sys.exit(1) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/common/test_common_utils.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/common/test_common_utils.py deleted file mode 100644 index 5f25e81c09783eeb8c682fd33d3178b99352f6e0..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/common/test_common_utils.py +++ /dev/null @@ -1,124 +0,0 @@ -import unittest -import os -import numpy as np -import torch -from api_accuracy_checker.common.utils import * - -class TestUtils(unittest.TestCase): - - def test_read_json(self): - test_dict = {"key": "value"} - with open('test.json', 'w') as f: - json.dump(test_dict, f) - self.assertEqual(read_json('test.json'), test_dict) - os.remove('test.json') - - def test_write_csv(self): - test_data = [["name", "age"], ["Alice", "20"], ["Bob", "30"]] - write_csv(test_data, 'test.csv') - with open('test.csv', 'r', encoding='utf-8-sig') as f: - reader = csv.reader(f) - for i, row in enumerate(reader): - self.assertEqual(row, test_data[i]) - os.remove('test.csv') - - def test_print_info_log(self): - try: - print_info_log("Test message") - except Exception as e: - self.fail(f"print_info_log raised exception {e}") - - def test_check_mode_valid(self): - try: - check_mode_valid(Const.ALL) - except Exception as e: - self.fail(f"check_mode_valid raised exception {e}") - - def test_check_object_type(self): - try: - check_object_type(123, int) - except Exception as e: - self.fail(f"check_object_type raised exception {e}") - - def test_check_file_or_directory_path(self): - try: - check_file_or_directory_path(__file__) - except Exception as e: - self.fail(f"check_file_or_directory_path raised exception {e}") - - def test_get_dump_data_path(self): - path, exist = get_dump_data_path(os.path.dirname(__file__)) - self.assertTrue(exist) - - def test_create_directory(self): - create_directory('test_dir') - self.assertTrue(os.path.exists('test_dir')) - os.rmdir('test_dir') - - def test_execute_command(self): - execute_command(['echo', 'Hello, World!']) - - def test_parse_arg_value(self): - values = "1,2,3;4,5,6" - expected_result = [[1, 2, 3], [4, 5, 6]] - self.assertEqual(parse_arg_value(values), expected_result) - - def test_parse_value_by_comma(self): - value = "1,2,3" - expected_result = [1, 2, 3] - self.assertEqual(parse_value_by_comma(value), expected_result) - - def test_get_data_len_by_shape(self): - shape = [2, 3, 4] - expected_result = 24 - self.assertEqual(get_data_len_by_shape(shape), expected_result) - - def test_add_time_as_suffix(self): - name = "test" - result = add_time_as_suffix(name) - self.assertTrue(result.startswith(name)) - - def test_get_time(self): - result = get_time() - self.assertTrue(isinstance(result, str)) - - def test_format_value(self): - value = 123.456789 - expected_result = '123.456789' - self.assertEqual(format_value(value), expected_result) - - def test_seed_all(self): - seed_all(1234) - - def test_get_process_rank(self): - model = torch.nn.Linear(10, 10) - rank, _ = get_process_rank(model) - self.assertEqual(rank, 0) - - def test_get_json_contents(self): - test_dict = {"key": "value"} - with open('test.json', 'w') as f: - json.dump(test_dict, f) - self.assertEqual(get_json_contents('test.json'), test_dict) - os.remove('test.json') - - def test_get_file_content_bytes(self): - with open('test.txt', 'w') as f: - f.write("Hello, World!") - self.assertEqual(get_file_content_bytes('test.txt'), b"Hello, World!") - os.remove('test.txt') - - def test_islink(self): - self.assertFalse(islink(__file__)) - - def test_check_path_length_valid(self): - self.assertTrue(check_path_length_valid(__file__)) - - def test_check_path_pattern_valid(self): - self.assertIsNone(check_path_pattern_valid(__file__)) - - def test_check_input_file_valid(self): - self.assertIsNone(check_input_file_valid(__file__)) - - def test_check_need_convert(self): - self.assertIsNone(check_need_convert("unknown_api")) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/common/test_config.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/common/test_config.py deleted file mode 100644 index a68057dfb41ca38ba79e1daa992a8f51ce4d64e4..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/common/test_config.py +++ /dev/null @@ -1,21 +0,0 @@ -import unittest -import os -from api_accuracy_checker.common.config import Config - -class TestConfig(unittest.TestCase): - def setUp(self): - cur_path = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))) - yaml_path = os.path.join(cur_path, "config.yaml") - self.yaml_file = yaml_path - self.config = Config(self.yaml_file) - - def test_validate(self): - self.assertEqual(self.config.validate('dump_path', '/path/to/dump'), '/path/to/dump') - - with self.assertRaises(ValueError): - self.config.validate('dump_path', 123) - - - def test_update_config(self): - self.config.update_config(dump_path='/new/path/to/dump') - self.assertEqual(self.config.dump_path, '/new/path/to/dump') diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_algorithm.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_algorithm.py deleted file mode 100644 index 90e18d166f56f98b8c1e1f80f2ae28dab7db67d3..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_algorithm.py +++ /dev/null @@ -1,32 +0,0 @@ -import unittest -import numpy as np -import torch -from api_accuracy_checker.compare import compare as cmp -from api_accuracy_checker.compare import algorithm as alg - -class TestAlgorithmMethods(unittest.TestCase): - - def test_get_max_abs_err(self): - b_value = np.array([1.0, 2.0, 3.0]) - n_value = np.array([1.0, 2.0, 3.0]) - abs_err = np.abs(b_value - n_value) - self.assertEqual(alg.get_max_abs_err(abs_err), (0.0, True)) - - def test_get_rel_err_ratio_thousandth(self): - b_value = np.array([1.0, 2.0, 3.0]) - n_value = np.array([1.0, 2.0, 3.0]) - abs_err = np.abs(b_value - n_value) - rel_err = alg.get_rel_err_origin(abs_err, b_value) - self.assertEqual(alg.get_rel_err_ratio(rel_err, 0.001), (1.0, True)) - - def test_get_rel_err_ratio_ten_thousandth(self): - b_value = np.array([1.0, 2.0, 3.0]) - n_value = np.array([1.0, 2.0, 3.0]) - abs_err = np.abs(b_value - n_value) - rel_err = alg.get_rel_err_origin(abs_err, b_value) - self.assertEqual(alg.get_rel_err_ratio(rel_err, 0.0001), (1.0, True)) - - def test_cosine_sim(self): - cpu_output = np.array([1.0, 2.0, 3.0]) - npu_output = np.array([1.0, 2.0, 3.0]) - self.assertEqual(alg.cosine_sim(cpu_output, npu_output), (1.0, True, '')) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_api_info.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_api_info.py deleted file mode 100644 index 2c03d56e722decc424052367dfe9700ba3df94ce..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_api_info.py +++ /dev/null @@ -1,131 +0,0 @@ -import os -import shutil -import unittest -import torch -import numpy as np -from api_accuracy_checker.dump.api_info import APIInfo, ForwardAPIInfo, BackwardAPIInfo, transfer_types, \ - get_tensor_extremum, get_type_name, is_builtin_class, analyze_device_in_kwargs, analyze_dtype_in_kwargs -from api_accuracy_checker.common.config import msCheckerConfig - - -class TestAPIInfo(unittest.TestCase): - def setUp(self): - if os.path.exists('./step-1'): - shutil.rmtree('./step-1') - self.api = APIInfo("test_api", APIInfo.get_full_save_path("./", "forward_real_data", True), True) - - def test_analyze_element(self): - element = [1, 2, 3] - result = self.api.analyze_element(element) - self.assertEqual(result, - [{'type': 'int', 'value': 1}, {'type': 'int', 'value': 2}, {'type': 'int', 'value': 3}]) - - def test_analyze_tensor(self): - tensor = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grad=True) - result = self.api._analyze_tensor(tensor) - self.assertEqual(result.get('type'), 'torch.Tensor') - self.assertTrue(result.get('requires_grad')) - datapath = result.get('datapath') - self.assertTrue(datapath.startswith('forward_real_data') or datapath.startswith('backward_real_data')) - - def test_analyze_builtin(self): - arg = slice(1, 10, 2) - result = self.api._analyze_builtin(arg) - self.assertEqual(result, {'type': 'slice', 'value': [1, 10, 2]}) - - def test_transfer_types(self): - data = 10 - dtype = 'int' - result = transfer_types(data, dtype) - self.assertEqual(result, 10) - - def test_is_builtin_class(self): - element = 10 - result = is_builtin_class(element) - self.assertTrue(result) - - def test_analyze_device_in_kwargs(self): - element = torch.device('cuda:0') - result = analyze_device_in_kwargs(element) - self.assertEqual(result, {'type': 'torch.device', 'value': 'cuda:0'}) - - def test_analyze_dtype_in_kwargs(self): - element = torch.float32 - result = analyze_dtype_in_kwargs(element) - self.assertEqual(result, {'type': 'torch.dtype', 'value': 'torch.float32'}) - - def test_get_tensor_extremum(self): - data = torch.tensor([1, 2, 3]) - result_max, result_max_origin = get_tensor_extremum(data, 'max') - result_min, result_min_origin = get_tensor_extremum(data, 'min') - self.assertEqual(result_max, 3) - self.assertEqual(result_min, 1) - self.assertEqual(result_max_origin, 3) - self.assertEqual(result_min_origin, 1) - - data = torch.tensor([1, float("inf"), 2, 3]) - result_max, result_max_origin = get_tensor_extremum(data, 'max') - result_min, result_min_origin = get_tensor_extremum(data, 'min') - self.assertEqual(result_max, 3) - self.assertEqual(result_min, 1) - self.assertEqual(result_max_origin, float("inf")) - self.assertEqual(result_min_origin, 1) - - data = torch.tensor([1, float("-inf"), 2, 3]) - result_max, result_max_origin = get_tensor_extremum(data, 'max') - result_min, result_min_origin = get_tensor_extremum(data, 'min') - self.assertEqual(result_max, 3) - self.assertEqual(result_min, 1) - self.assertEqual(result_max_origin, 3) - self.assertEqual(result_min_origin, float("-inf")) - - data = torch.tensor([1, float("inf"), float("nan"), 3]) - result_max, result_max_origin = get_tensor_extremum(data, 'max') - result_min, result_min_origin = get_tensor_extremum(data, 'min') - self.assertEqual(result_max, 3) - self.assertEqual(result_min, 1) - self.assertTrue(np.isnan(result_max_origin)) - self.assertTrue(np.isnan(result_min_origin)) - - data = torch.tensor([float("inf"), float("nan")]) - result_max, result_max_origin = get_tensor_extremum(data, 'max') - result_min, result_min_origin = get_tensor_extremum(data, 'min') - self.assertEqual(result_max, float("inf")) - self.assertEqual(result_min, float("inf")) - self.assertTrue(np.isnan(result_max_origin)) - self.assertTrue(np.isnan(result_min_origin)) - - data = torch.tensor([float("nan"), float("nan")]) - result_max, result_max_origin = get_tensor_extremum(data, 'max') - result_min, result_min_origin = get_tensor_extremum(data, 'min') - self.assertTrue(np.isnan(result_max)) - self.assertTrue(np.isnan(result_min)) - self.assertTrue(np.isnan(result_max_origin)) - self.assertTrue(np.isnan(result_min_origin)) - - def test_get_type_name(self): - name = "" - result = get_type_name(name) - self.assertEqual(result, 'int') - - def test_ForwardAPIInfo(self): - forward_api_info = ForwardAPIInfo("test_forward_api", [1, 2, 3], {"a": 1, "b": 2}) - self.assertEqual(forward_api_info.api_name, "test_forward_api") - self.assertEqual(forward_api_info.save_path, - APIInfo.get_full_save_path(msCheckerConfig.dump_path, 'forward_real_data', True)) - self.assertEqual(forward_api_info.api_info_struct, {"test_forward_api": { - "args": [{'type': 'int', 'value': 1}, {'type': 'int', 'value': 2}, {'type': 'int', 'value': 3}, ], - "kwargs": {'a': {'type': 'int', 'value': 1}, 'b': {'type': 'int', 'value': 2}}}}) - - def test_BackwardAPIInfo(self): - backward_api_info = BackwardAPIInfo("test_backward_api", [1, 2, 3]) - self.assertEqual(backward_api_info.api_name, "test_backward_api") - self.assertEqual(backward_api_info.save_path, - APIInfo.get_full_save_path(msCheckerConfig.dump_path, 'backward_real_data', True)) - self.assertEqual(backward_api_info.grad_info_struct, { - "test_backward_api": [{'type': 'int', 'value': 1}, {'type': 'int', 'value': 2}, - {'type': 'int', 'value': 3}]}) - - def tearDown(self): - if os.path.exists('./step-1'): - shutil.rmtree('./step-1') diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_dump.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_dump.py deleted file mode 100644 index 655e624e809a5cceb406b9fce9df4e4f89efb4ee..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_dump.py +++ /dev/null @@ -1,32 +0,0 @@ -import unittest -from api_accuracy_checker.dump.dump import * - -class TestDumpUtil(unittest.TestCase): - def test_set_dump_switch(self): - set_dump_switch("ON") - self.assertEqual(DumpUtil.dump_switch, "ON") - set_dump_switch("OFF") - self.assertEqual(DumpUtil.dump_switch, "OFF") - - def test_get_dump_switch(self): - DumpUtil.dump_switch = "ON" - self.assertTrue(DumpUtil.get_dump_switch()) - DumpUtil.dump_switch = "OFF" - self.assertFalse(DumpUtil.get_dump_switch()) - - def test_incr_iter_num_maybe_exit(self): - msCheckerConfig.target_iter = [5] - msCheckerConfig.enable_dataloader = True - - DumpUtil.call_num = 6 - with self.assertRaises(Exception): - DumpUtil.incr_iter_num_maybe_exit() - - DumpUtil.call_num = 4 - DumpUtil.incr_iter_num_maybe_exit() - self.assertEqual(DumpUtil.dump_switch, "OFF") - - msCheckerConfig.enable_dataloader = False - DumpUtil.call_num = 5 - DumpUtil.incr_iter_num_maybe_exit() - self.assertEqual(DumpUtil.dump_switch, "ON") diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_dump_scope.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_dump_scope.py deleted file mode 100644 index 7712552abe49d757a07bcbbd746038ed22d4027b..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_dump_scope.py +++ /dev/null @@ -1,23 +0,0 @@ -import unittest -from api_accuracy_checker.dump.dump_scope import iter_tracer -from api_accuracy_checker.dump.dump import DumpUtil - - -class TestDumpScope(unittest.TestCase): - def test_iter_tracer(self): - DumpUtil.call_num = 0 - - def dummy_func(): - return "Hello, World!" - - wrapped_func = iter_tracer(dummy_func) - result = wrapped_func() - self.assertEqual(DumpUtil.dump_switch, "OFF") - self.assertEqual(result, "Hello, World!") - - def another_dummy_func(): - return 123 - wrapped_func = iter_tracer(another_dummy_func) - result = wrapped_func() - self.assertEqual(DumpUtil.dump_switch, "OFF") - self.assertEqual(result, 123) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_info_dump.py b/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_info_dump.py deleted file mode 100644 index 45e57f2c389292e9226039f56b83966941c603ca..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/test_info_dump.py +++ /dev/null @@ -1,28 +0,0 @@ -import unittest -import os -from unittest.mock import patch -from api_accuracy_checker.dump.api_info import APIInfo, BackwardAPIInfo -from api_accuracy_checker.dump.info_dump import write_api_info_json - - -class TestInfoDump(unittest.TestCase): - - def test_write_api_info_json_backward(self): - api_info = BackwardAPIInfo("test_backward_api", [1, 2, 3]) - with patch('api_accuracy_checker.dump.info_dump.write_json') as mock_write_json: - write_api_info_json(api_info) - rank = os.getpid() - mock_write_json.assert_called_with(f'./step0/backward_info_{rank}.json', api_info.grad_info_struct) - - def test_write_api_info_json_invalid_type(self): - api_info = APIInfo("test_api", APIInfo.get_full_save_path("save_path", "forward_real_data", contain_step=True), - is_save_data=True) - with self.assertRaises(ValueError): - write_api_info_json(api_info) - - def tearDown(self): - rank = os.getpid() - files = [f'./step0/backward_info_{rank}.json'] - for file in files: - if os.path.exists(file): - os.remove(file) \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/common/log.py b/debug/accuracy_tools/atat/pytorch/common/log.py deleted file mode 100644 index fab5aca45c08af7253dedf8ee13db10b271683da..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/common/log.py +++ /dev/null @@ -1,59 +0,0 @@ -import os -import time -import sys -from .utils import get_rank_if_initialized - - -def on_rank_0(func): - def func_rank_0(*args, **kwargs): - current_rank = get_rank_if_initialized() - if current_rank is None or current_rank == 0: - return func(*args, **kwargs) - - return func_rank_0 - - -def _print_log(level, msg, end='\n'): - current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(time.time()))) - pid = os.getpid() - full_msg = current_time + "(" + str(pid) + ")-[" + level + "]" + msg - current_rank = get_rank_if_initialized() - if current_rank is not None: - full_msg = f"[rank {current_rank}]-" + full_msg - print(full_msg, end=end) - sys.stdout.flush() - - -def print_info_log(info_msg, end='\n'): - """ - Function Description: - print info log. - Parameter: - info_msg: the info message. - """ - _print_log("INFO", info_msg, end=end) - - -def print_error_log(error_msg): - """ - Function Description: - print error log. - Parameter: - error_msg: the error message. - """ - _print_log("ERROR", error_msg) - - -def print_warn_log(warn_msg): - """ - Function Description: - print warn log. - Parameter: - warn_msg: the warning message. - """ - _print_log("WARNING", warn_msg) - - -print_info_log_rank_0 = on_rank_0(print_info_log) -print_warn_log_rank_0 = on_rank_0(print_warn_log) -print_error_log_rank_0 = on_rank_0(print_error_log) diff --git a/debug/accuracy_tools/atat/pytorch/common/recursive.py b/debug/accuracy_tools/atat/pytorch/common/recursive.py deleted file mode 100644 index c8a19a63117d332b138fec4d38d7efa20f7ddebe..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/common/recursive.py +++ /dev/null @@ -1,28 +0,0 @@ -import torch -import numpy as np -from .log import print_warn_log - -_recursive_key_stack = [] -special_type = (torch.device, torch.dtype, torch.Size, torch.Tensor, np.integer, np.floating, np.bool_, np.complexfloating, \ - np.str_, np.byte, np.unicode_, bool, int, float, str, slice) -def recursive_apply_transform(args, transform): - global _recursive_key_stack - if isinstance(args, special_type): - arg_transform = transform(args, _recursive_key_stack) - return arg_transform - elif isinstance(args, (list, tuple)): - transform_result = [] - for i, arg in enumerate(args): - _recursive_key_stack.append(str(i)) - transform_result.append(recursive_apply_transform(arg, transform)) - _recursive_key_stack.pop() - return type(args)(transform_result) - elif isinstance(args, dict): - transform_result = {} - for k, arg in args.items(): - _recursive_key_stack.append(str(k)) - transform_result[k] = recursive_apply_transform(arg, transform) - _recursive_key_stack.pop() - return transform_result - elif args is not None: - print_warn_log(f"Data type {type(args)} is not supported.") diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl.png b/debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl.png deleted file mode 100644 index c64e9380c6d9c01bb2ad18c81e430ead0800bb7d..0000000000000000000000000000000000000000 Binary files a/debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl.png and /dev/null differ diff --git a/debug/accuracy_tools/atat/pytorch/dump/dump.py b/debug/accuracy_tools/atat/pytorch/dump/dump.py deleted file mode 100644 index 64652bdaec5bc1d5de5d740c2c23de474a27d5fa..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/dump/dump.py +++ /dev/null @@ -1,455 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" -# Copyright (C) 2019-2020. Huawei Technologies Co., Ltd. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - -import inspect -import json -import os -import threading -from pathlib import Path - -import numpy as np -import torch - -try: - import torch_npu -except ImportError: - is_gpu = True -else: - is_gpu = False - -from atat.core.utils import (print_warn_log, Const, print_info_log, modify_dump_path, check_inplace_op, CompareConst, - print_error_log) -from atat.core.file_check_util import FileOpen, change_mode, FileCheckConst -from atat.pytorch.common.utils import get_md5_for_tensor -from ..dump.utils import check_writable -from .utils import (DumpUtil, check_if_in_api_list, make_dump_data_dir, get_tensor_rank, create_dirs_if_not_exist, - CompareException, check_single_rank_folder) - - -forward_init_status = False -backward_init_status = False - -thread_lock = threading.Lock() -pkl_name = "" -rank = os.getpid() + 100000 -multi_output_apis = ["_sort_", "npu_flash_attention"] -module_count = {} - - -class APIList(list): - threshold = 1000 - - def __init__(self, *args): - self.dump_count = 0 - self.pkl_mode_changed = False - super().__init__(*args) - - def flush(self): - pkl_path = get_pkl_file_path() - if len(self) == 0 or pkl_path == "": - return - with FileOpen(pkl_path, 'a') as f: - try: - f.write('\n'.join(json.dumps(item) for item in self)) - f.write('\n') - except IOError as ex: - raise Exception("write to disk failed") from ex - self.dump_count += 1 - print_info_log(f"write {len(self)} items to {pkl_path} the {self.dump_count} time") - if not self.pkl_mode_changed: - change_mode(pkl_path, FileCheckConst.DATA_FILE_AUTHORITY) - self.pkl_mode_changed = True - self.clear() - - def append(self, data): - list.append(self, data) - if len(self) >= APIList.threshold: - self.flush() - - -api_list = APIList() - - -class DataInfo(object): - def __init__(self, save_data, summary_data, dtype, shape, md5=None): - if md5 is None: - md5 = [] - self.save_data = save_data - self.summary_data = summary_data - self.dtype = dtype - self.shape = shape - self.md5 = md5 - - -def get_not_float_tensor_info(data): - if DumpUtil.summary_mode == "md5": - return DataInfo([], [], str(data.dtype), tuple(data.shape), get_md5_for_tensor(data)) - if data.numel() == 0 or data.dtype == torch.bool: - tensor_max = [] - tensor_min = [] - tensor_mean = [] - elif len(data.shape) == 0: - item = data.float().item() - tensor_max = item - tensor_min = item - tensor_mean = item - else: - tensor_max = torch._C._VariableFunctionsClass.max(data).float().item() - tensor_min = torch._C._VariableFunctionsClass.min(data).float().item() - tensor_mean = torch._C._VariableFunctionsClass.mean(data.float()).float().item() - return get_tensor_data_info(data, tensor_max, tensor_min, tensor_mean, CompareConst.NAN) - - -def get_scalar_data_info(data): - summary_data = [data, data, data, data] - return DataInfo(data, summary_data, str(type(data)), str([])) - - -def get_float_tensor_info(data): - if DumpUtil.summary_mode == "md5": - return DataInfo([], [], str(data.dtype), tuple(data.shape), get_md5_for_tensor(data)) - tensor_max = torch._C._VariableFunctionsClass.max(data).float().item() - tensor_min = torch._C._VariableFunctionsClass.min(data).float().item() - tensor_mean = torch._C._VariableFunctionsClass.mean(data).float().item() - tensor_norm = torch._C._VariableFunctionsClass.norm(data).float().item() - return get_tensor_data_info(data, tensor_max, tensor_min, tensor_mean, tensor_norm) - - -def get_tensor_data_info(data, *tensor_args): - summary_data = [] - summary_data.extend([*tensor_args]) - if DumpUtil.summary_mode == "all": - saved_tensor = data.contiguous().cpu().detach() - if data.dtype == torch.bfloat16: - saved_numpy = saved_tensor.to(torch.float32).numpy() - else: - saved_numpy = saved_tensor.numpy() - return DataInfo(saved_numpy, summary_data, str(data.dtype), tuple(data.shape)) - return DataInfo([], summary_data, str(data.dtype), tuple(data.shape)) - - -def dump_tensor(x, prefix, dump_step): - if isinstance(x, (tuple, list)) and x: - for i, item in enumerate(x): - dump_tensor(item, "{}.{}".format(prefix, i), dump_step) - return - elif isinstance(x, torch.Tensor): - if x.is_meta: - print_info_log(f"Meta tensor {prefix} is skipped.") - return - x_clone = x.clone().detach() - if x_clone.numel() == 0 or len(x_clone.shape) == 0 or not x_clone.is_floating_point(): - if DumpUtil.dump_filter_switch == Const.OFF: - data_info = get_not_float_tensor_info(x_clone) - dump_data_by_rank_count(dump_step, prefix, data_info) - else: - return - else: - data_info = get_float_tensor_info(x_clone) - dump_data_by_rank_count(dump_step, prefix, data_info) - - elif DumpUtil.dump_filter_switch == Const.OFF: - if isinstance(x, bool) or isinstance(x, int) or isinstance(x, float): - data_info = get_scalar_data_info(x) - dump_data_by_rank_count(dump_step, prefix, data_info) - - -def append_pkl_data(dump_step, prefix, data_info): - global api_list - thread_lock.acquire() - api_list.append([prefix, dump_step, data_info.md5, data_info.dtype, data_info.shape, data_info.summary_data]) - thread_lock.release() - - -def dump_data(prefix, data_info): - if DumpUtil.summary_mode != "all": - return - output_path = os.path.join(DumpUtil.dump_data_dir, f'{prefix}.npy') - try: - np.save(output_path, data_info.save_data) - change_mode(output_path, FileCheckConst.DATA_FILE_AUTHORITY) - except Exception as e: - print_warn_log("Dump data failed, error: {}".format(e)) - - -def thread_dump_data(prefix, data_info): - DumpUtil.dump_thread_pool.submit(dump_data, prefix, data_info) - - -def dump_data_by_rank_count(dump_step, prefix, data_info): - print_info_log(f"ptdbg is analyzing rank{rank} api: {prefix}" + " " * 10, end='\r') - if DumpUtil.is_single_rank and DumpUtil.dump_thread_pool: - thread_dump_data(prefix, data_info) - else: - dump_data(prefix, data_info) - append_pkl_data(dump_step, prefix, data_info) - - -def dump_stack_info(name_template): - if check_inplace_op(name_template) and Const.PRE_FORWARD in name_template: - return - - stack_str = [] - try: - for (_, path, line, func, code, _) in inspect.stack()[4:]: - if code: - stack_line = [path, str(line), func, code[0].strip() if code else code] - else: - stack_line = [path, str(line), func, code] - stack_str.append(stack_line) - except Exception as e: - print_warn_log("Dump stack info failed, error: {}".format(e)) - stack_str.append('') - - prefix = name_template.format("stack_info") - if DumpUtil.dump_switch_mode in Const.DUMP_MODE: - complement_set = set(['forward', 'backward', 'input', 'output']) - set(DumpUtil.dump_mode) - if not any(mode in prefix for mode in complement_set): - api_list.append([prefix, stack_str]) - else: - api_list.append([prefix, stack_str]) - - -def dump_api_tensor(dump_step, in_feat, name_template, out_feat): - if check_inplace_op(name_template): - if Const.PRE_FORWARD in name_template: - name_template = name_template.replace(Const.PRE_FORWARD, Const.FORWARD) - else: - if Const.BACKWARD in name_template and Const.BACKWARD in DumpUtil.dump_mode: - return - elif Const.BACKWARD not in name_template and Const.FORWARD in DumpUtil.dump_mode: - if "output" in DumpUtil.dump_mode: - dump_tensor(in_feat, name_template.format("output"), dump_step) - if "input" in DumpUtil.dump_mode: - return - - if Const.BACKWARD in name_template and Const.BACKWARD in DumpUtil.dump_mode: - if 'input' in DumpUtil.dump_mode: - dump_tensor(out_feat, name_template.format("input"), dump_step) - if 'output' in DumpUtil.dump_mode: - dump_tensor(in_feat, name_template.format("output"), dump_step) - elif Const.BACKWARD not in name_template and Const.FORWARD in DumpUtil.dump_mode: - if 'input' in DumpUtil.dump_mode: - dump_tensor(in_feat, name_template.format("input"), dump_step) - if 'output' in DumpUtil.dump_mode: - dump_tensor(out_feat, name_template.format("output"), dump_step) - - -def rename_(): - global rank - global pkl_name - if rank is not None and pkl_name is not None: - dir_name = os.path.join(DumpUtil.dump_root, "step{}".format(DumpUtil.iter_num), "rank{}".format(os.getpid() + 100000)) - new_name = os.path.join(DumpUtil.dump_root, "step{}".format(DumpUtil.iter_num), "rank{}".format(rank)) - if not os.path.exists(new_name) and os.path.exists(dir_name): - _, file_name = os.path.split(pkl_name) - os.rename(dir_name, new_name) - pkl_name = os.path.join(new_name, file_name) - - -def dump_acc_cmp(name, in_feat, out_feat, dump_step, module): - if not DumpUtil.get_dump_switch(): - return - if DumpUtil.dump_switch_mode == Const.API_LIST and not check_if_in_api_list(name): - return - if DumpUtil.dump_switch_mode in [Const.LIST, Const.ACL, Const.RANGE, Const.STACK] and not DumpUtil.check_switch_scope(name): - return - dump_file = DumpUtil.get_dump_path() - dump_file = modify_dump_path(dump_file, DumpUtil.dump_switch_mode) - global rank - dump_dir, dump_filename = os.path.split(dump_file) - dump_dir = os.path.join(dump_dir, "step{}".format(DumpUtil.iter_num)) - if not os.path.exists(dump_dir): - Path(dump_dir).mkdir(mode=FileCheckConst.DATA_DIR_AUTHORITY, exist_ok=True) - dump_file = os.path.join(dump_dir, dump_filename) - rank_this = get_tensor_rank(in_feat, out_feat) - DumpUtil.dump_root = os.path.dirname(DumpUtil.dump_path) - if rank_this is not None and rank != rank_this: - rank = rank_this - rename_() - if not DumpUtil.dump_init_enable: - if '.pkl' in dump_filename: - npy_dir = dump_filename[:-4] - else: - npy_dir = dump_filename - DumpUtil.dump_data_dir = os.path.join(DumpUtil.dump_root, "step{}".format(DumpUtil.iter_num), "rank{}".format(rank), npy_dir) - if DumpUtil.target_rank is not None: - if rank != DumpUtil.target_rank: - return - dump_file = create_dirs_if_not_exist(rank, dump_file) - global pkl_name - pkl_name = dump_file - if DumpUtil.dump_init_enable: - DumpUtil.dump_init_enable = False - DumpUtil.dump_data_dir = make_dump_data_dir(dump_file) \ - if DumpUtil.dump_switch_mode not in [Const.STACK, Const.ACL] and DumpUtil.summary_mode == "all" else "" - if os.path.exists(dump_file) and not os.path.isdir(dump_file): - check_writable(dump_file) - try: - os.remove(dump_file) - except FileNotFoundError as e: - print_warn_log("The file does not exist, error: {}".format(e)) - - name_prefix = name - name_template = f"{name_prefix}" + "_{}" - if DumpUtil.is_single_rank is None: - DumpUtil.is_single_rank = check_single_rank_folder(dump_dir) - if DumpUtil.dump_switch_mode in [Const.ALL, Const.API_LIST]: - dump_api_tensor(dump_step, in_feat, name_template, out_feat) - elif DumpUtil.dump_switch_mode == Const.API_STACK: - dump_api_tensor(dump_step, in_feat, name_template, out_feat) - dump_stack_info(name_template) - else: - if DumpUtil.dump_switch_mode == Const.ACL: - acl_dump(module, name, name_prefix) - elif DumpUtil.dump_switch_mode != Const.STACK: - dump_api_tensor(dump_step, in_feat, name_template, out_feat) - dump_stack_info(name_template) - - -def acl_dump(module, module_name, name_prefix): - if name_prefix in DumpUtil.backward_input: - dump_mode_backward_acl_dump(module, module_name, DumpUtil.backward_input.get(name_prefix)) - else: - forward_acl_dump(module, module_name) - - -def Op_Need_Trigger(module_name): - if 'Tensor.__getitem__.' in module_name: - return True - return False - - -def forward_acl_dump(module, module_name): - global forward_init_status - global backward_init_status - if not forward_init_status and not backward_init_status: - forward_init_status = True - torch_npu.npu.synchronize() - torch_npu.npu.init_dump() - torch_npu.npu.set_dump(DumpUtil.dump_config) - torch_npu.npu.synchronize() - if Op_Need_Trigger(module_name): - module.forward(*module.input_args, **module.input_kwargs).cpu() - else: - module.forward(*module.input_args, **module.input_kwargs) - torch_npu.npu.synchronize() - torch_npu.npu.finalize_dump() - torch_npu.npu.synchronize() - del module.input_args - del module.input_kwargs - forward_init_status = False - print_info_log("Dump %s op file." % module_name) - - -def acl_backward_dump_status(output, grad, module_name): - if isinstance(output, torch.Tensor): - output.backward(grad, retain_graph=True) - return True - - for api_name in multi_output_apis: - if api_name in module_name: - output[0].backward(grad, retain_graph=True) - return True - return False - - -def dump_mode_backward_acl_dump(module, module_name, grad_path): - global forward_init_status - global backward_init_status - module_name = module_name.replace(Const.FORWARD, Const.BACKWARD) - if not forward_init_status and not backward_init_status: - forward_init_status = True - module.input_args = list(module.input_args) - for i, data in enumerate(module.input_args): - if isinstance(data, torch.Tensor) and data.grad_fn: - module.input_args[i] = data.detach().requires_grad_() - output = module.forward(*module.input_args, **module.input_kwargs) - grad = torch.tensor(np.load(grad_path)).to("npu").requires_grad_() - torch_npu.npu.init_dump() - torch_npu.npu.set_dump(DumpUtil.dump_config) - torch_npu.npu.synchronize() - if not acl_backward_dump_status(output, grad, module_name): - print_warn_log("The output of {} is not of tensor type and cannot be automatically derived. " - "you can manually construct a single API backward case for ACL dump.".format(module_name)) - torch_npu.npu.synchronize() - torch_npu.npu.finalize_dump() - del module.input_args - del module.input_kwargs - forward_init_status = False - print_info_log("Dump %s op file." % module_name) - - -def module_count_func(name, name_template): - module_name = name.split("_")[-3] - if Const.FORWARD in name_template: - if module_name not in module_count: - module_count[module_name] = [0, [0]] - else: - if module_count[module_name][-1] and \ - module_count[module_name][0] != module_count[module_name][-1][-1]: - module_count[module_name][-1].pop() - module_count[module_name][0] += 1 - module_count[module_name][-1].append(module_count[module_name][0]) - index = module_count[module_name][0] - else: - backward_stack = module_count[module_name][-1] if module_name in module_count else [] - if not backward_stack: - print_warn_log("The backward stack of {} is empty.".format(module_name)) - index = "abnormal" - else: - index = backward_stack.pop() - return index - - -def acc_cmp_dump(name, **kwargs): - dump_step = kwargs.get('dump_step', 1) - pid = kwargs.get('pid') - name_template = name - if not pid: - return RuntimeError("Not get the specified process pid.") - - def acc_cmp_hook(module, in_feat, out_feat=None): - nonlocal name, name_template - if "_{}_" in name_template: - try: - index = module_count_func(name, name_template) - except IndexError as e: - print_error_log(f"Get module {name_template} index failed.") - raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from e - name = name_template.format(index) - if pid == os.getpid(): - dump_acc_cmp(name, in_feat, out_feat, dump_step, module) - if hasattr(module, "input_args"): - del module.input_args - if hasattr(module, "input_kwargs"): - del module.input_kwargs - - return acc_cmp_hook - - -def write_to_disk(): - api_list.flush() - - -def get_pkl_file_path(): - return pkl_name - - -def reset_module_count(): - global module_count - module_count = {} diff --git a/debug/accuracy_tools/atat/pytorch/dump/utils.py b/debug/accuracy_tools/atat/pytorch/dump/utils.py deleted file mode 100644 index 8e58f35606a4a4f9cf9e7ae732beeedb7777cdef..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/dump/utils.py +++ /dev/null @@ -1,357 +0,0 @@ -import os -import re -import shutil -from pathlib import Path -import torch -import torch.distributed as dist - -from atat.core.utils import print_error_log, CompareException, DumpException, Const, get_time, print_info_log, \ - check_mode_valid, check_switch_valid, check_dump_mode_valid, check_summary_only_valid, generate_compare_script, \ - check_file_valid, make_dump_path_if_not_exists, check_path_before_create, check_summary_mode_valid -from atat.core.file_check_util import FileChecker, FileCheckConst, check_path_length, check_path_pattern_vaild -from atat.pytorch.common.utils import check_is_npu - -from ..dump import dump - -dump_count = 0 -range_begin_flag, range_end_flag = False, False - - -def check_list_or_acl_mode(name_prefix): - global dump_count - for item in DumpUtil.dump_switch_scope: - if name_prefix.startswith(item): - dump_count = dump_count + 1 - return True - return False - - -def check_range_mode(name_prefix): - global range_begin_flag - global range_end_flag - if name_prefix.startswith(DumpUtil.dump_switch_scope[0]): - range_begin_flag = True - return True - if name_prefix.startswith(DumpUtil.dump_switch_scope[1]): - range_end_flag = True - return True - if range_begin_flag and not range_end_flag: - return True - return False - - -def check_stack_mode(name_prefix): - if len(DumpUtil.dump_switch_scope) == 0: - return True - elif len(DumpUtil.dump_switch_scope) == 1: - return name_prefix.startswith(DumpUtil.dump_switch_scope[0]) - elif len(DumpUtil.dump_switch_scope) == 2: - return check_range_mode(name_prefix) - else: - print_error_log("dump scope is invalid, Please set the scope mode in" - " set_dump_switch with 'all', 'list', 'range', 'stack', 'acl', 'api_list'!") - return False - - -class DumpConfig: - def __init__(self, mode=None, scope=None, api_list=None, filter_switch=None, dump_mode=None, summary_only=False, summary_mode="all"): - self.mode = mode - self.scope = scope - self.api_list = api_list - self.filter_switch = filter_switch - self.dump_mode = dump_mode - self.summary_only = summary_only - self.summary_mode = summary_mode - - -class DumpUtil(object): - dump_root = None - dump_data_dir = None - dump_path = None - dump_switch = None - dump_switch_mode = Const.ALL # all, api_stack, list, stack... - dump_switch_scope = [] - dump_init_enable = False - dump_api_list = [] - dump_filter_switch = None - dump_mode = ['forward', 'backward', 'input', 'output'] - backward_input = {} - dump_dir_tag = 'ptdbg_dump' - dump_config = None - dataloader_iter = 0 - target_iter = None - iter_num = 0 - target_rank = None - summary_only = False - need_replicate = False - summary_mode = "all" - is_single_rank = None - dump_thread_pool = None - - - @staticmethod - def set_dump_path(save_path): - DumpUtil.dump_path = save_path - DumpUtil.dump_init_enable = True - - @staticmethod - def set_acl_config(acl_config): - if not acl_config: - raise ValueError("acl_config must be configured when mode is 'acl'") - acl_config_checker = FileChecker(acl_config, FileCheckConst.FILE, FileCheckConst.READ_ABLE, - FileCheckConst.JSON_SUFFIX) - acl_config = acl_config_checker.common_check() - DumpUtil.dump_config = acl_config - - @staticmethod - def set_dump_switch(switch, dump_config): - DumpUtil.dump_switch = switch - if dump_config.mode is not None: - DumpUtil.dump_switch_mode = dump_config.mode - DumpUtil.dump_init_enable = True - if dump_config.scope is not None: - DumpUtil.dump_switch_scope = dump_config.scope - if dump_config.api_list is not None: - DumpUtil.dump_api_list = [api.lower() for api in dump_config.api_list] - if dump_config.filter_switch is not None: - DumpUtil.dump_filter_switch = dump_config.filter_switch - if dump_config.dump_mode is not None: - DumpUtil.dump_mode = dump_config.dump_mode if isinstance(dump_config.dump_mode, list) else [dump_config.dump_mode] - - if dump_config.mode == Const.ACL: - DumpUtil.dump_switch_scope = [api_name.replace("backward", "forward") for api_name in dump_config.scope] - - DumpUtil.summary_only = dump_config.summary_only - DumpUtil.summary_mode = dump_config.summary_mode - - check_mapper = { - Const.LIST: check_list_or_acl_mode, - Const.ACL: check_list_or_acl_mode, - Const.RANGE: check_range_mode, - Const.STACK: check_stack_mode - } - - @staticmethod - def check_switch_scope(name_prefix): - if DumpUtil.dump_switch_mode in DumpUtil.check_mapper: - check_func = DumpUtil.check_mapper[DumpUtil.dump_switch_mode] - return check_func(name_prefix) - return False - - @staticmethod - def get_dump_path(): - if DumpUtil.dump_path: - return DumpUtil.dump_path - - if DumpUtil.dump_switch_mode == Const.ALL: - raise RuntimeError("get_dump_path: the file path is empty," - " you must use set_dump_path to set a valid dump path!!!") - else: - dir_path = os.path.realpath("./") - dump_file_name = "scope_dump_{}_{}_{}.pkl".format( - DumpUtil.dump_switch_mode, DumpUtil.dump_switch_scope[0], get_time()) - DumpUtil.dump_path = os.path.join(dir_path, dump_file_name) - return DumpUtil.dump_path - - @staticmethod - def get_dump_switch(): - return DumpUtil.dump_switch == "ON" - - -def set_dump_path(fpath=None, dump_tag='ptdbg_dump'): - fpath = load_env_dump_path(fpath) - check_file_valid(fpath) - if not re.match(Const.FILE_PATTERN, dump_tag): - print_error_log('The file path {} contains special characters.'.format(dump_tag)) - raise CompareException(CompareException.INVALID_PATH_ERROR) - real_path = os.path.realpath(fpath) - make_dump_path_if_not_exists(real_path) - fpath_checker = FileChecker(real_path, FileCheckConst.DIR, FileCheckConst.WRITE_ABLE) - fpath_checker.common_check() - DumpUtil.set_dump_path(real_path) - DumpUtil.dump_dir_tag = dump_tag - - -def get_tensor_rank(in_feat, out_feat): - if dist.is_initialized(): - return dist.get_rank() - - def get_tensor_rank_single(x): - if isinstance(x, (list, tuple)): - if len(x) > 0: - return get_tensor_rank_single(x[0]) - return None - elif isinstance(x, torch.Tensor): - device = x.device - if device.type == 'cpu': - return None - else: - return device.index - return None - in_rank = get_tensor_rank_single(in_feat) - if in_rank is None: - out_rank = get_tensor_rank_single(out_feat) - if out_rank is None: - return None - return out_rank - return in_rank - - -def create_dirs_if_not_exist(rank, dump_file): - dump_path, file_name = os.path.split(dump_file) - rank_dir = os.path.join(dump_path, f"rank{rank}") - dump_file = os.path.join(rank_dir, file_name) - if not os.path.isdir(rank_dir): - check_path_pattern_vaild(dump_file) - check_path_length(dump_file, name_length=200) - Path(rank_dir).mkdir(mode=FileCheckConst.DATA_DIR_AUTHORITY, exist_ok=True) - return dump_file - - -def generate_dump_path_str(): - if DumpUtil.dump_switch_mode == 'acl': - if DumpUtil.dump_config == '': - print_error_log("Please provide dump config for register hook before turning on dump switch!") - raise DumpException(DumpException.NONE_ERROR) - dump_path = f"according to dump config {DumpUtil.dump_config}" - else: - dump_dir, dump_file = os.path.split(DumpUtil.dump_path) - if not dump_file.endswith(".pkl"): - dump_dir = DumpUtil.dump_path - dump_path = f"to {dump_dir}" - return dump_path - - -def set_dump_switch(switch, mode=Const.ALL, scope=None, api_list=None, filter_switch=Const.OFF, dump_mode=None, - summary_only=False): - if scope is None: - scope = [] - if api_list is None: - api_list = [] - if dump_mode is None: - dump_mode = [Const.ALL] - check_switch_valid(switch) - if not DumpUtil.dump_path: - set_dump_path() - dump_config = DumpConfig(summary_only=summary_only) - DumpUtil.set_dump_switch(switch, dump_config) - dump_path_str = generate_dump_path_str() - if switch == "OFF": - dump.write_to_disk() - if check_is_npu() and DumpUtil.dump_switch_mode in [Const.ALL, Const.API_STACK, Const.LIST, Const.RANGE, Const.API_LIST]: - generate_compare_script(DumpUtil.dump_data_dir, dump.get_pkl_file_path(), DumpUtil.dump_switch_mode) - set_dump_switch_print_info(switch, mode, dump_path_str) - set_dump_switch_config(mode=mode, scope=scope, api_list=api_list, filter_switch=filter_switch, dump_mode=dump_mode, - summary_only=summary_only) - - -def set_dump_switch_config(mode=Const.ALL, scope=None, api_list=None, filter_switch=Const.OFF, dump_mode=None, - summary_only=False, summary_mode="all"): - if scope is None: - scope = [] - if api_list is None: - api_list = [] - if dump_mode is None: - dump_mode = [Const.ALL] - try: - check_summary_mode_valid(summary_mode) - check_mode_valid(mode, scope, api_list) - check_switch_valid(filter_switch) - dump_mode = check_dump_mode_valid(dump_mode) - summary_only = check_summary_only_valid(summary_only) - except (CompareException, AssertionError) as err: - print_error_log(str(err)) - raise CompareException(CompareException.INVALID_PARAM_ERROR) from err - switch = DumpUtil.dump_switch - dump_config = DumpConfig(mode, scope, api_list, filter_switch, dump_mode, summary_only, summary_mode) - DumpUtil.set_dump_switch("OFF", dump_config) - DumpUtil.dump_switch = switch - - -def set_dump_switch_print_info(switch, mode, dump_path_str): - global dump_count - if switch == "ON": - print_info_log(f"Dump switch is turned on. Dump data will be saved {dump_path_str}. ") - if mode == Const.LIST: - dump_count = 0 - else: - print_info_log(f"Dump switch is turned off. ") - if mode == Const.LIST: - print_info_log("The number of matched dump is {}".format(dump_count)) - - -def check_if_in_api_list(name): - if not DumpUtil.dump_api_list: - return False - for api in DumpUtil.dump_api_list: - if api.lower() in name.lower(): - return True - return False - - -def set_backward_input(backward_input): - for index, api_name in enumerate(DumpUtil.dump_switch_scope): - DumpUtil.backward_input[api_name] = backward_input[index] - - -def make_dump_data_dir(dump_file_name): - dump_path, file_name = os.path.split(os.path.realpath(dump_file_name)) - name_body, name_extension = os.path.splitext(file_name) - output_dir = os.path.join(dump_path, f"{name_body}") - check_path_before_create(output_dir) - if not os.path.exists(output_dir): - Path(output_dir).mkdir(mode=0o750, exist_ok=True) - else: - shutil.rmtree(output_dir, ignore_errors=True) - Path(output_dir).mkdir(mode=0o750, exist_ok=True) - return output_dir - - -def make_dump_dirs(): - dump_file_name, dump_file_name_body = "dump.pkl", "dump" - dump_root_dir = load_env_dump_path(DumpUtil.dump_path) - tag_dir = os.path.join(dump_root_dir, DumpUtil.dump_dir_tag) - check_path_length(tag_dir) - check_path_pattern_vaild(tag_dir) - Path(tag_dir).mkdir(mode=0o750, parents=True, exist_ok=True) - DumpUtil.dump_dir = tag_dir - dump_file_path = os.path.join(tag_dir, dump_file_name) - DumpUtil.set_dump_path(dump_file_path) - - -def check_writable(dump_file): - if not os.access(dump_file, os.W_OK): - print_error_log( - 'The path {} does not have permission to write. Please check the path permission'.format( - dump_file)) - raise DumpException(DumpException.INVALID_PATH_ERROR) - - -def load_env_dump_path(dump_path): - if not dump_path: - dump_path = os.getenv(Const.ASCEND_WORK_PATH) - if dump_path: - try: - dump_path = os.path.join(str(dump_path), Const.DUMP_DIR) - except TypeError as err: - print_error_log("Generating dump path from environment variables ASCEND_WORK_PATH failed.") - raise DumpException(DumpException.INVALID_PATH_ERROR) from err - else: - print_error_log("Dump path is None, you can configure it in the following ways:\n" - "1. Configure set_dump_path function.\n" - "2. Configure the dump_path parameter of PrecisionDebugger.\n" - "3. Set environment variables ASCEND_WORK_PATH.") - raise DumpException(DumpException.INVALID_PATH_ERROR) - return dump_path - - -def check_single_rank_folder(dump_path): - rank_folder_pattern = re.compile(r'^rank\d+$') - rank_folder_count = 0 - for item in os.listdir(dump_path): - full_path = os.path.join(dump_path, item) - if os.path.isdir(full_path) and rank_folder_pattern.match(item): - rank_folder_count += 1 - if rank_folder_count > 1: - return False - return rank_folder_count == 1 diff --git a/debug/accuracy_tools/atat/pytorch/functional/__init__.py b/debug/accuracy_tools/atat/pytorch/functional/__init__.py deleted file mode 100644 index 12e530d4c950f6bab9d6fe48861954ca0061e33d..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/functional/__init__.py +++ /dev/null @@ -1,4 +0,0 @@ -from .repair import build_repair -from .scope import build_scope -from .step_post_process import build_step_post_process -from .data_collector import build_data_collector diff --git a/debug/accuracy_tools/atat/pytorch/functional/repair.py b/debug/accuracy_tools/atat/pytorch/functional/repair.py deleted file mode 100644 index aed8326424f5e171a9a71d21cfeb48db6fb26fb3..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/functional/repair.py +++ /dev/null @@ -1,90 +0,0 @@ -from abc import ABC, abstractmethod - -import torch - -from .scope import build_scope, ListScope, BaseScope -from ..common.exceptions import RepairException -from ..common import recursive_apply_transform, print_info_log_rank_0 - - -def build_repair(config): - if config.repair_type is None: - return None - elif config.repair_type == RepairAPI.ToCPU: - return RepairAPI_toCPU(config) - elif config.repair_type == RepairAPI.RaisePrecision: - return RepairAPI_raise(config) - else: - raise RepairException(RepairException.InvalidRepairType, f"精度修复类型" - f"须配置为'{RepairAPI.ToCPU}'或'{RepairAPI.RaisePrecision}," - f"实际配置为{config.repair_type}") - - -class RepairAPI(ABC): - ToCPU = "cpu" - RaisePrecision = "raise" - - def __init__(self, config): - self.config = config - self.scope = build_scope(ListScope, config.repair_scope, config.repair_api_str) - self.saved, self.towards = "None", "None" - - def check_name_and_module_type(self, name, module_type): - if module_type == BaseScope.Module_Type_Module: - return False - if not self.scope.check(name): - return False - return True - - def convert(self, name, module_type, args, kwargs): - is_target = self.check_name_and_module_type(name, module_type) - if is_target: - args = recursive_apply_transform(args, self.fx) - kwargs = recursive_apply_transform(kwargs, self.fx) - print_info_log_rank_0(f"[msProbe] convert inputs of {name} to " - f"{self.towards}.") - return args, kwargs - - def invert(self, name, module_type, out_feat): - is_target = self.check_name_and_module_type(name, module_type) - if is_target: - out_feat = recursive_apply_transform(out_feat, self.inv_fx) - print_info_log_rank_0(f"[msProbe] convert outputs of {name} back to "\ - f"{self.saved}.") - return out_feat - - -class RepairAPI_toCPU(RepairAPI): - def fx(self, arg, _): - if isinstance(arg, torch.Tensor): - self.saved = arg.device - self.towards = torch.device("cpu") - return arg.cpu() - return arg - - def inv_fx(self, arg, _): - if isinstance(arg, torch.Tensor): - return arg.to(self.saved) - return arg - - -class RepairAPI_raise(RepairAPI): - raise_dtype_map = { - torch.bfloat16: torch.float32, - torch.float16: torch.float32 - } - - def fx(self, arg, _): - if isinstance(arg, torch.Tensor): - self.saved = arg.dtype - self.towards = RepairAPI_raise.raise_dtype_map.get(self.saved) - # bug: nested input may be of various dtypes. which to save and invert? - return arg.to(self.towards) - return arg - - def inv_fx(self, arg, _): - if isinstance(arg, torch.Tensor): - return arg.to(self.saved) - return arg - - diff --git a/debug/accuracy_tools/atat/pytorch/functional/step_post_process.py b/debug/accuracy_tools/atat/pytorch/functional/step_post_process.py deleted file mode 100644 index 7f0d3459326f04691a0041c120bf4efc676f8bc1..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/functional/step_post_process.py +++ /dev/null @@ -1,43 +0,0 @@ -from abc import ABC, abstractmethod -from ..common.exceptions import StepException - - -def run_parallel_ut(config): - pass - - -def compare_distrbuted(config): - pass - - -def build_step_post_process(config): - if not config.on_step_end: - return None - if config.on_step_end == StepPostProcess.SingleAPICheck: - return SingleAPICheck(config) - elif config.on_step_end == StepPostProcess.Compare: - return AutoCompare(config) - else: - raise StepException(StepException.InvalidPostProcess, f"step后处理须配置为" - f"'{StepPostProcess.SingleAPICheck}'或'{StepPostProcess.Compare}'," - f"实际配置为{config.on_step_end}") - - -class StepPostProcess(ABC): - SingleAPICheck = 'single_api_check' - Compare = 'compare' - - -class SingleAPICheck: - def __init__(self, config): - self.config = config - - def run(self): - run_parallel_ut(self.config) - -class AutoCompare: - def __init__(self, config): - self.config = config - - def run(self): - compare_distrbuted(self.config.bench_dump_path, self.config.dump_path) diff --git a/debug/accuracy_tools/atat/pytorch/overflow_check/__init__.py b/debug/accuracy_tools/atat/pytorch/overflow_check/__init__.py deleted file mode 100644 index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000 diff --git a/debug/accuracy_tools/atat/pytorch/overflow_check/info_dump.py b/debug/accuracy_tools/atat/pytorch/overflow_check/info_dump.py deleted file mode 100644 index 161e9f23f0fb7c3a9c09bb5e7697eb9a7dfaef15..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/overflow_check/info_dump.py +++ /dev/null @@ -1,252 +0,0 @@ -import inspect -import fcntl -import os -import threading - -import json -import numpy as np -import torch - -from atat.core.file_check_util import FileOpen, FileCheckConst, change_mode -from atat.core.utils import get_time -from ..common.utils import print_error_log - - -special_torch_object = ["memory_format"] -lock = threading.Lock() - - -def write_npy(file_path, tensor): - saved_tensor = tensor.contiguous().cpu().detach() - if tensor.dtype == torch.bfloat16: - saved_numpy = saved_tensor.to(torch.float32).numpy() - else: - saved_numpy = saved_tensor.numpy() - if os.path.exists(file_path): - raise ValueError(f"File {file_path} already exists") - np.save(file_path, saved_numpy) - full_path = os.path.abspath(file_path) - return full_path - - -class APIInfo: - def __init__(self, api_name, is_forward, save_real_data=False): - self.rank = os.getpid() - self.api_name = api_name - self.save_real_data = save_real_data - self.torch_object_key = {'device': self.analyze_device_in_kwargs, 'dtype': self.analyze_dtype_in_kwargs} - self.is_forward = is_forward - self.args_num = 0 - - def analyze_element(self, element): - if isinstance(element, (list, tuple)): - out = [] - for item in element: - out.append(self.analyze_element(item)) - return out - elif isinstance(element, dict): - out_dict = {} - for key, value in element.items(): - if key in self.torch_object_key.keys(): - fun = self.torch_object_key[key] - out_dict[key] = fun(value) - elif key in special_torch_object: - continue - else: - out_dict[key] = self.analyze_element(value) - return out_dict - elif isinstance(element, torch.Tensor): - out_tensor = self.analyze_tensor(element, self.save_real_data) - return out_tensor - elif self.is_builtin_class(element): - out_builtin = self.analyze_builtin(element) - return out_builtin - else: - msg = f"Type {type(element)} is unsupported at analyze_element" - print_error_log(msg) - - raise NotImplementedError(msg) - - def analyze_tensor(self, arg, save_real_data): - single_arg = {} - if not save_real_data: - single_arg.update({'type': 'torch.Tensor'}) - single_arg.update({'dtype': str(arg.dtype)}) - single_arg.update({'shape': arg.shape}) - single_arg.update({'Max': self.transfer_types(self.get_tensor_extremum(arg, 'max'), str(arg.dtype))}) - single_arg.update({'Min': self.transfer_types(self.get_tensor_extremum(arg, 'min'), str(arg.dtype))}) - single_arg.update({'requires_grad': arg.requires_grad}) - - else: - dump_path = "./" - api_args = self.api_name + '.' + str(self.args_num) - rank = arg.device.index - if self.is_forward: - forward_real_data_path = os.path.join(dump_path, "forward_real_data_" + get_time(), f"rank{rank}") - if not os.path.exists(forward_real_data_path): - os.makedirs(forward_real_data_path, 0o755) - - file_path = os.path.join(forward_real_data_path, f'{api_args}.npy') - else: - backward_real_data_path = os.path.join(dump_path, "backward_real_data_" + get_time(), f"rank{rank}") - if not os.path.exists(backward_real_data_path): - os.makedirs(backward_real_data_path, 0o755) - file_path = os.path.join(backward_real_data_path, f'{api_args}.npy') - self.args_num += 1 - npy_path = write_npy(file_path, arg) - single_arg.update({'type': 'torch.Tensor'}) - single_arg.update({'datapath': npy_path}) - single_arg.update({'requires_grad': arg.requires_grad}) - return single_arg - - def analyze_builtin(self, arg): - single_arg = {} - if isinstance(arg, slice): - single_arg.update({'type': "slice"}) - single_arg.update({'value': [arg.start, arg.stop, arg.step]}) - else: - single_arg.update({'type': self.get_type_name(str(type(arg)))}) - single_arg.update({'value': arg}) - return single_arg - - def transfer_types(self, data, dtype): - if 'int' in dtype or 'bool' in dtype: - return int(data) - else: - return float(data) - - def is_builtin_class(self, element): - if element is None or isinstance(element, (bool, int, float, str, slice)): - return True - return False - - def analyze_device_in_kwargs(self, element): - single_arg = {} - single_arg.update({'type': 'torch.device'}) - if not isinstance(element, str): - - if hasattr(element, "index"): - device_value = element.type + ":" + str(element.index) - single_arg.update({'value': device_value}) - else: - device_value = element.type - else: - single_arg.update({'value': element}) - return single_arg - - def analyze_dtype_in_kwargs(self, element): - single_arg = {} - single_arg.update({'type': 'torch.dtype'}) - single_arg.update({'value': str(element)}) - return single_arg - - def get_tensor_extremum(self, data, operator): - if data.dtype is torch.bool: - if operator == 'max': - return True in data - elif operator == 'min': - return False not in data - if operator == 'max': - return torch._C._VariableFunctionsClass.max(data).item() - else: - return torch._C._VariableFunctionsClass.min(data).item() - - def get_type_name(self, name): - - left = name.index("'") - right = name.rindex("'") - return name[left + 1: right] - - -class ForwardAPIInfo(APIInfo): - def __init__(self, name, save_real_data, args, kwargs): - super().__init__(name, is_forward=True, save_real_data=save_real_data) - self.analyze_api_input(args, kwargs) - self.analyze_api_call_stack() - - def analyze_api_input(self, args, kwargs): - args_info_list = self.analyze_element(args) - kwargs_info_dict = self.analyze_element(kwargs) - self.api_info_struct = {self.api_name: {"args": args_info_list, "kwargs": kwargs_info_dict}} - - def analyze_api_call_stack(self): - stack_str = [] - for (_, path, line, func, code, _) in inspect.stack()[3:]: - if not code: - continue - stack_line = " ".join([ - "File", ", ".join([path, " ".join(["line", str(line)]), " ".join(["in", func]), - " ".join(["\n", code[0].strip()])])]) - stack_str.append(stack_line) - self.stack_info_struct = {self.api_name: stack_str} - - -class BackwardAPIInfo(APIInfo): - def __init__(self, name, grads): - super().__init__(name, is_forward=False) - self.analyze_api_input(grads) - - def analyze_api_input(self, grads): - grads_info_list = self.analyze_element(grads) - self.grad_info_struct = {self.api_name: grads_info_list} - - -def write_api_info_json(api_info): - dump_path = "./" - rank = api_info.rank - if isinstance(api_info, ForwardAPIInfo): - file_path = os.path.join(dump_path, f'forward_info_{rank}.json') - stack_file_path = os.path.join(dump_path, f'stack_info_{rank}.json') - write_json(file_path, api_info.api_info_struct) - write_json(stack_file_path, api_info.stack_info_struct, indent=4) - - elif isinstance(api_info, BackwardAPIInfo): - file_path = os.path.join(dump_path, f'backward_info_{rank}.json') - write_json(file_path, api_info.grad_info_struct) - else: - raise ValueError(f"Invalid api_info type {type(api_info)}") - - -def write_json(file_path, data, indent=None): - if not os.path.exists(file_path): - with FileOpen(file_path, 'w') as f: - f.write("{\n}") - change_mode(file_path, FileCheckConst.DATA_FILE_AUTHORITY) - lock.acquire() - with FileOpen(file_path, 'a+') as f: - fcntl.flock(f, fcntl.LOCK_EX) - try: - f.seek(0, os.SEEK_END) - f.seek(f.tell() - 1, os.SEEK_SET) - f.truncate() - if f.tell() > 3: - f.seek(f.tell() - 1, os.SEEK_SET) - f.truncate() - f.write(',\n') - f.write(json.dumps(data, indent=indent)[1:-1] + '\n}') - except Exception as e: - raise ValueError(f"Json save failed:{e}") from e - finally: - fcntl.flock(f, fcntl.LOCK_UN) - lock.release() - - -def initialize_output_json(): - dump_path = os.path.realpath("./") - files = ['forward_info.json', 'backward_info.json', 'stack_info.json'] - - forward_real_data_path = os.path.join(dump_path, 'forward_real_data') - if os.path.exists(forward_real_data_path): - raise ValueError(f"file {forward_real_data_path} already exists, please remove it first") - else: - os.mkdir(forward_real_data_path, mode=0o750) - - backward_real_data_path = os.path.join(dump_path, 'backward_real_data') - if os.path.exists(backward_real_data_path): - raise ValueError(f"file {backward_real_data_path} already exists, please remove it first") - else: - os.mkdir(backward_real_data_path, mode=0o750) - for file in files: - file_path = os.path.join(dump_path, file) - if os.path.exists(file_path): - raise ValueError(f"file {file_path} already exists, please remove it first or use a new dump path") \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/overflow_check/overflow_check.py b/debug/accuracy_tools/atat/pytorch/overflow_check/overflow_check.py deleted file mode 100644 index f8f9926b6cd2bab4a347260e0126f551297aec8b..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/overflow_check/overflow_check.py +++ /dev/null @@ -1,190 +0,0 @@ -import os -from pathlib import Path - -import torch - -try: - import torch_npu -except ImportError: - is_gpu = True -else: - is_gpu = False - -from atat.core.file_check_util import FileCheckConst -from atat.core.utils import print_warn_log, get_time, print_info_log -from ..dump.dump import forward_init_status, forward_acl_dump -from .utils import OverFlowUtil, dump_overflow, check_overflow_npu, clear_overflow_npu -from ..dump.utils import DumpUtil, Const, get_tensor_rank, create_dirs_if_not_exist, check_single_rank_folder -from .info_dump import write_api_info_json, ForwardAPIInfo, BackwardAPIInfo -from ..dump import dump - -backward_init_status = False -api_overflow = [] -forward_api_info = {} -backward_api_info = {} -FORWARD_REAL_DATA_PATH = os.path.join('./', 'forward_real_data') -BACKWARD_REAL_DATA_PATH = os.path.join('./', 'backward_real_data') -rank = os.getpid() -pkl_name = '' - - -def check_overflow_environment(pid): - if not OverFlowUtil.get_overflow_check_switch(): - return False - if pid != os.getpid(): - return False - if is_gpu: - print_warn_log("Overflow detection is not supported in the GPU environment.") - return False - global backward_init_status - if backward_init_status or forward_init_status: - return False - return True - - -def check_data_overflow(x): - if isinstance(x, (tuple, list)) and x: - for i, item in enumerate(x): - if True == check_data_overflow(item): - return True - return False - else: - if isinstance(x, torch.Tensor) and x.numel() != 0 and x.dtype != torch.bool: - if x.is_meta: - return False - if len(x.shape) == 0: - tensor_max = x.cpu().detach().float().numpy().tolist() - tensor_min = tensor_max - else: - tensor_max = torch._C._VariableFunctionsClass.max(x).cpu().detach().float().numpy().tolist() - tensor_min = torch._C._VariableFunctionsClass.min(x).cpu().detach().float().numpy().tolist() - # inf - if tensor_max == float('inf') or tensor_min == float('-inf'): - return True - if x.dtype in [torch.float16, torch.float32, torch.bfloat16] and \ - (tensor_max == torch.finfo(x.dtype).max or tensor_min == torch.finfo(x.dtype).min): - return True - # nan - elif tensor_max != tensor_max or tensor_min != tensor_min: - return True - else: - return False - elif isinstance(x, bool) or isinstance(x, int) or isinstance(x, float): - if x == float('inf') or x == float('-inf') or x != x: - return True - else: - return False - else: - return False - - -def check_path(apis, path): - return any(api in path for api in apis) - - -def overflow_check(name, **kwargs): - overflow_nums = OverFlowUtil.overflow_nums - pid = kwargs.get('pid') - dump_mode = DumpUtil.dump_switch_mode - if not pid: - return RuntimeError("Not get the specified process pid.") - - def overflowcheck_hook(module, in_feat, out_feat=None): - if not check_overflow_environment(pid): - return - dump_file = DumpUtil.get_dump_path() - global rank - dump_dir, dump_filename = os.path.split(dump_file) - dump_dir = os.path.join(dump_dir, "step{}".format(DumpUtil.iter_num)) - if not os.path.exists(dump_dir): - Path(dump_dir).mkdir(mode=FileCheckConst.DATA_DIR_AUTHORITY, exist_ok=True) - if DumpUtil.is_single_rank is None: - DumpUtil.is_single_rank = check_single_rank_folder(dump_dir) - dump_file = os.path.join(dump_dir, dump_filename) - rank_this = get_tensor_rank(in_feat, out_feat) - DumpUtil.dump_root = os.path.dirname(DumpUtil.dump_path) - if rank_this is not None and rank != rank_this: - rank = rank_this - dump.rename_() - if DumpUtil.target_rank is not None: - if rank != DumpUtil.target_rank: - return - dump_path = create_dirs_if_not_exist(rank, dump_file) - global pkl_name - pkl_name = dump_path - dump_dir = os.path.split(dump_path)[0] - global api_overflow - global forward_api_info - global backward_api_info - - module_name = name - if hasattr(torch_npu._C, '_npu_is_support_inf_nan') and torch_npu._C._npu_is_support_inf_nan(): - # backward API endwith backward - if module_name.endswith(Const.BACKWARD): - check_feat = in_feat - else: - check_feat = out_feat - module.has_overflow = check_data_overflow(check_feat) - else: - module.has_overflow = check_overflow_npu() - if not module.has_overflow: - if hasattr(module, 'input_args'): - del module.input_args - if hasattr(module, 'input_kwargs'): - del module.input_kwargs - if module.has_overflow and OverFlowUtil.check_overflow_dump_times(overflow_nums): - if overflow_type_judge(in_feat, out_feat, module_name) and DumpUtil.need_replicate: - if module_name.endswith(Const.FORWARD): - forward_api_info.update({name: ForwardAPIInfo(name, True, module.input_args, module.input_kwargs)}) - api_overflow.append(module_name) - else: - api_overflow.append(module_name.replace("backward", "forward")) - backward_api_info.update({name: BackwardAPIInfo(name, out_feat)}) - OverFlowUtil.inc_overflow_dump_times() - dump_file_name = os.path.join(dump_dir, - "{}_{}.pkl".format(module_name, OverFlowUtil.real_overflow_dump_times)) - dump_overflow(module_name, in_feat, out_feat, dump_file_name) - dump.pkl_name = dump_file_name - - print_warn_log("[overflow {} times]: module name :'{}' is overflow and dump file is saved in '{}'." - .format(OverFlowUtil.real_overflow_dump_times, module_name, - os.path.realpath(dump_file_name))) - if dump_mode == "acl": - acl_dump(module, module_name) - dump.write_to_disk() - # clear overflow flag for the next check - clear_overflow_npu() - if not OverFlowUtil.check_overflow_dump_times(overflow_nums): - for key in forward_api_info: - write_api_info_json(forward_api_info[key]) - for key in backward_api_info: - write_api_info_json(backward_api_info[key]) - raise ValueError("[overflow {} times]: dump file is saved in '{}'." - .format(OverFlowUtil.real_overflow_dump_times, os.path.realpath(dump_file_name))) - - def overflow_type_judge(in_feat, out_feat, module_name): - if module_name.endswith(Const.BACKWARD): - check_feat = out_feat - else: - check_feat = in_feat - if check_data_overflow(check_feat): - print_warn_log("module name :'{}' is overflow and its inputs already has an overflow, so you need " - "to go back to find where the overflow started.".format(module_name)) - return False - elif not check_data_overflow(in_feat) and not check_data_overflow(out_feat): - print_warn_log("module name :'{}' is overflow and its inputs and outputs do not overflow, " - "so this is a process overflow".format(module_name)) - return False - else: - print_warn_log("module name :'{}' is overflow. Its input is normal and its output " - "is overflow.".format(module_name)) - return True - - def acl_dump(module, module_name): - if "forward" in module_name: - forward_acl_dump(module, module_name) - if "backward" in module_name: - print_info_log("The overflow is caused by backward operator {}. " - "You can use reverse acl dump(mode='acl') to get operator dump data.".format(module_name)) - - return overflowcheck_hook diff --git a/debug/accuracy_tools/atat/pytorch/overflow_check/utils.py b/debug/accuracy_tools/atat/pytorch/overflow_check/utils.py deleted file mode 100644 index d254d5845505fb2ae0c41c56ac9a0e1d9225ba87..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/atat/pytorch/overflow_check/utils.py +++ /dev/null @@ -1,114 +0,0 @@ -import os -import torch - -try: - import torch_npu -except ImportError: - is_gpu = True -else: - is_gpu = False - -from atat.core.utils import check_switch_valid, check_inplace_op, OverflowConst -from ..common.utils import Const -from ..dump.dump import dump_stack_info, get_scalar_data_info, dump_data_by_rank_count, \ - get_not_float_tensor_info, get_float_tensor_info -from ..dump.utils import DumpUtil, make_dump_data_dir - - -class OverFlowUtil(object): - overflow_check_switch = None - overflow_filter_switch = Const.OFF - real_overflow_dump_times = 0 - overflow_nums = 1 - - @staticmethod - def set_overflow_check_switch(switch, filter_switch): - OverFlowUtil.overflow_check_switch = switch - OverFlowUtil.overflow_filter_switch = filter_switch - - @staticmethod - def get_overflow_check_switch(): - if OverFlowUtil.overflow_check_switch is None: - return True - return OverFlowUtil.overflow_check_switch == "ON" - - @staticmethod - def inc_overflow_dump_times(): - OverFlowUtil.real_overflow_dump_times += 1 - - @staticmethod - def check_overflow_dump_times(need_dump_times): - if need_dump_times == -1: - return True - return OverFlowUtil.real_overflow_dump_times < need_dump_times - - -def set_overflow_check_switch(switch, filter_switch=Const.OFF): - check_switch_valid(switch) - check_switch_valid(filter_switch) - - OverFlowUtil.set_overflow_check_switch(switch, filter_switch) - - -def dump_overflow(module_name, in_feat, out_feat, dump_file): - name_template = f"{module_name}" + "_{}" - DumpUtil.dump_data_dir = make_dump_data_dir(dump_file) - dump_stack_info(name_template) - if check_inplace_op(name_template): - if Const.PRE_FORWARD in name_template: - name_template = name_template.replace(Const.PRE_FORWARD, Const.FORWARD) - else: - _dump_tensor_completely(in_feat, name_template.format("output")) - return - - if "forward" in name_template: - _dump_tensor_completely(in_feat, name_template.format("input")) - _dump_tensor_completely(out_feat, name_template.format("output")) - else: - _dump_tensor_completely(in_feat, name_template.format("output")) - _dump_tensor_completely(out_feat, name_template.format("input")) - - -def _dump_tensor_completely(x, prefix): - dump_flag = Const.DUMP_RATIO_MAX + 1 - if isinstance(x, (tuple, list)) and x: - for i, item in enumerate(x): - _dump_tensor_completely(item, "{}.{}".format(prefix, i)) - elif isinstance(x, torch.Tensor): - if x.numel() == 0 or len(x.shape) == 0 or not x.is_floating_point(): - if OverFlowUtil.overflow_filter_switch == Const.OFF: - data_info = get_not_float_tensor_info(x) - dump_data_by_rank_count(dump_flag, prefix, data_info) - else: - data_info = get_float_tensor_info(x) - dump_data_by_rank_count(dump_flag, prefix, data_info) - - elif OverFlowUtil.overflow_filter_switch == Const.OFF: - if isinstance(x, bool) or isinstance(x, int) or isinstance(x, float): - data_info = get_scalar_data_info(x) - dump_data_by_rank_count(dump_flag, prefix, data_info) - - -def overflow_debug_mode_enalbe(): - overflow_mode = os.getenv(OverflowConst.OVERFLOW_DEBUG_MODE_ENABLE, Const.ENV_DISABLE) - return overflow_mode == Const.ENV_ENABLE - - -def check_overflow_npu(): - if overflow_debug_mode_enalbe(): - float_status = torch.zeros(8).npu() - result = torch_npu.npu_get_float_status(float_status, OverflowConst.OVERFLOW_DEBUG_MODE) - if (result.cpu()[0] != 0): - return True - else: - return False - else: - return torch_npu._C._check_overflow_npu() - - -def clear_overflow_npu(): - if overflow_debug_mode_enalbe(): - float_status = torch.zeros(8).npu() - torch_npu.npu_clear_float_status(float_status, OverflowConst.OVERFLOW_DEBUG_MODE) - else: - torch_npu._C._clear_overflow_npu() \ No newline at end of file diff --git a/debug/accuracy_tools/grad_tool/README.md b/debug/accuracy_tools/grad_tool/README.md index ddb281242d748f7370a62e64246ed3d988303382..a7929ca8187e35bb22fd27ac6845b4dbed4f9039 100644 --- a/debug/accuracy_tools/grad_tool/README.md +++ b/debug/accuracy_tools/grad_tool/README.md @@ -20,11 +20,9 @@ export PYTHONPATH=$PYTHONPATH:$MSTT_HOME/debug/accuracy_tools/ ``` -2. 安装依赖 +2. 使用pip命令安装matplotlib、mindspore、numpy、pandas、PyYAML、torch、tqdm依赖。 - ```bash - pip3 install pandas pyyaml tqdm matplotlib - ``` + 若环境中已安装部分依赖,不需要重复安装。 ## 使用方式 @@ -56,7 +54,7 @@ **不同级别的level的导出数据** -- PyTorch不同level数据 +- PyTorch/MindSpore动态图不同level数据 | 级别 | 特征数据表头 | 是否有方向数据 | | ---- | ------------------------------------------------------------ | -------------- | @@ -64,7 +62,7 @@ | L1 | ("param_name", "max", "min", "norm", "shape") | 是 | | L2 | ("param_name", *intervals, "=0", "max", "min", "norm", "shape") | 是 | -- MindSpore不同level数据 +- MindSpore静态图不同level数据 | 级别 | 特征数据表头 | 是否有方向数据 | | ---- | ------------------------------------------------------------ | -------------- | @@ -267,3 +265,6 @@ GradComparator.compare_distributed(grad_output_path1, # FAQ +**比对时报错:TypeError: can't convert xla:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.** + +表示torch.load的map_location参数没生效,tensor没被加载到cpu上,可能由torch和cann包不匹配导致 \ No newline at end of file diff --git a/debug/accuracy_tools/grad_tool/common/base_comparator.py b/debug/accuracy_tools/grad_tool/common/base_comparator.py index b5dc45b20cc41422838ef66c3bc4ac9690d18277..d3254ae71f9a8fccb8608088462c4733c166814d 100644 --- a/debug/accuracy_tools/grad_tool/common/base_comparator.py +++ b/debug/accuracy_tools/grad_tool/common/base_comparator.py @@ -12,6 +12,24 @@ from grad_tool.common.utils import write_csv, check_file_or_directory_path, prin class BaseComparator(ABC): + @staticmethod + def _get_grad_weight_order(path1, path2): + for summary_file in os.listdir(path1): + if not summary_file.endswith(".csv"): + continue + if not os.path.exists(os.path.join(path2, summary_file)): + continue + summary_csv = pd.read_csv(os.path.join(path1, summary_file)) + return summary_csv["param_name"] + raise RuntimeError("no matched grad_summary.csv for comparison, please dump data in same configuration") + + @staticmethod + def _get_name_matched_grad_file(param_name, grad_files): + for grad_file in grad_files: + if param_name == grad_file[:grad_file.rfind('.')]: + return grad_file + raise RuntimeError("no matched grad_file for comparison, please dump data in same configuration") + @classmethod def compare_distributed(cls, path1: str, path2: str, output_dir: str): ranks = cls._get_matched_dirs(path1, path2, "rank") @@ -22,9 +40,9 @@ class BaseComparator(ABC): create_directory(output_dir) for rank in tqdm(ranks, desc="rank"): print_info_log(f"now comparing rank {rank}:") - cls.compare(os.path.join(path1, f"rank_{rank}"), - os.path.join(path2, f"rank_{rank}"), - os.path.join(output_dir, f"rank_{rank}")) + cls.compare(os.path.join(path1, f"rank{rank}"), + os.path.join(path2, f"rank{rank}"), + os.path.join(output_dir, f"rank{rank}")) @classmethod def compare(cls, path1: str, path2: str, output_dir: str): @@ -41,15 +59,15 @@ class BaseComparator(ABC): check_file_or_directory_path(path1, file_type=GradConst.DIR) check_file_or_directory_path(path2, file_type=GradConst.DIR) dirs = [] - for dirname in os.listdir(path1): - splits = dirname.split('_') - if not splits or splits[0] != dir_prefix or not splits[1].isdigit(): + for dir_name in os.listdir(path1): + index = dir_name.replace(dir_prefix, "", 1) + if not dir_name.startswith(dir_prefix) or not index.isdigit(): continue - folder2 = os.path.join(path2, dirname) + folder2 = os.path.join(path2, dir_name) if not os.path.isdir(folder2): continue - dirs.append(int(splits[1])) + dirs.append(int(index)) dirs = sorted(dirs) return dirs @@ -72,24 +90,6 @@ class BaseComparator(ABC): head_tuple = tuple(['step'] + [str(step) for step in steps]) write_csv(os.path.join(output_dir, "similarities.csv"), [[key] + value], head_tuple) - @staticmethod - def _get_grad_weight_order(path1, path2): - for summary_file in os.listdir(path1): - if not summary_file.endswith(".csv"): - continue - if not os.path.exists(os.path.join(path2, summary_file)): - continue - summary_csv = pd.read_csv(os.path.join(path1, summary_file)) - return summary_csv["param_name"] - raise RuntimeError("no matched grad_summary.csv for comparison, please dump data in same configuration") - - @staticmethod - def _get_name_matched_grad_file(param_name, grad_files): - for grad_file in grad_files: - if param_name == grad_file[:grad_file.rfind('.')]: - return grad_file - raise RuntimeError("no matched grad_file for comparison, please dump data in same configuration") - @classmethod def _calculate_separated_similarities(cls, path1, path2, steps): similarities = {} @@ -101,8 +101,8 @@ class BaseComparator(ABC): total_count_summary = 0 for grad_name in grad_weight_order: grad_file = cls._get_name_matched_grad_file(grad_name, grad_files) - grad1 = os.path.join(path1, f"step_{step}", grad_file) - grad2 = os.path.join(path2, f"step_{step}", grad_file) + grad1 = os.path.join(path1, f"step{step}", grad_file) + grad2 = os.path.join(path2, f"step{step}", grad_file) same_count, total_count = cls._calculate_similarity(grad1, grad2) same_count_summary += same_count total_count_summary += total_count @@ -124,8 +124,8 @@ class BaseComparator(ABC): @classmethod def _get_matched_grad_files(cls, path1: str, path2: str, step: int): - path1 = os.path.join(path1, f"step_{step}") - path2 = os.path.join(path2, f"step_{step}") + path1 = os.path.join(path1, f"step{step}") + path2 = os.path.join(path2, f"step{step}") check_file_or_directory_path(path1, file_type=GradConst.DIR) check_file_or_directory_path(path2, file_type=GradConst.DIR) grad_files = [] diff --git a/debug/accuracy_tools/grad_tool/common/constant.py b/debug/accuracy_tools/grad_tool/common/constant.py index 902f54f5e65d4503c3197930491ea2882b2d42b8..d569d47c16df4be26c29e057eb8ab1936fa2ef2d 100644 --- a/debug/accuracy_tools/grad_tool/common/constant.py +++ b/debug/accuracy_tools/grad_tool/common/constant.py @@ -46,3 +46,11 @@ class GradConst: STEP_FINISH = "step_finish" SUMMARY = "summary" + + # csv header entry + MD5 = "MD5" + DISTRIBUTION = "distribution" + SHAPE = "shape" + MAX = "max" + MIN = "min" + NORM = "norm" \ No newline at end of file diff --git a/debug/accuracy_tools/grad_tool/common/utils.py b/debug/accuracy_tools/grad_tool/common/utils.py index b63e95578a13b54a7de5f9ca3886d10b72418e0a..cdce3fda7e38940c293849b63f139ce96c5d1d55 100644 --- a/debug/accuracy_tools/grad_tool/common/utils.py +++ b/debug/accuracy_tools/grad_tool/common/utils.py @@ -211,3 +211,12 @@ def create_directory(dir_path): os.makedirs(dir_path, mode=GradConst.DATA_DIR_AUTHORITY, exist_ok=True) except OSError as ex: raise RuntimeError("Failed to create directory. Please check the path permission or disk space.") from ex + +def change_mode(path, mode): + check_path_exists(path) + check_link(path) + try: + os.chmod(path, mode) + except PermissionError as ex: + print_error_log(f'Failed to change {path} authority. {str(ex)}') + raise ex diff --git a/debug/accuracy_tools/grad_tool/grad_ms/global_context.py b/debug/accuracy_tools/grad_tool/grad_ms/global_context.py index 02d1f744543f34889cde7237b9ec4925c86f0c56..d44bea52c78922d6f04c335311988149c5154a5e 100644 --- a/debug/accuracy_tools/grad_tool/grad_ms/global_context.py +++ b/debug/accuracy_tools/grad_tool/grad_ms/global_context.py @@ -14,6 +14,8 @@ class GlobalContext: _setting = { GradConst.LEVEL: GradConst.LEVEL0, GradConst.PARAM_LIST: None, + GradConst.STEP: None, + GradConst.RANK: None, GradConst.CURRENT_STEP: 0, GradConst.BOUNDS: [-1., 0., 1.], GradConst.OUTPUT_PATH: "./grad_stat" @@ -33,6 +35,8 @@ class GlobalContext: print_warn_log("Invalid level set in config yaml file, use L0 instead.") self._set_input_list(config_dict, GradConst.PARAM_LIST, str) self._set_input_list(config_dict, GradConst.BOUNDS, float) + self._set_input_list(config_dict, GradConst.STEP, int) + self._set_input_list(config_dict, GradConst.RANK, int) output_path = config_dict.get(GradConst.OUTPUT_PATH) if output_path: try: @@ -55,6 +59,14 @@ class GlobalContext: def update_step(self): self._setting[GradConst.CURRENT_STEP] += 1 + def step_need_dump(self, step): + dump_step_list = self.get_context(GradConst.STEP) + return (not dump_step_list) or (step in dump_step_list) + + def rank_need_dump(self, rank): + dump_rank_list = self.get_context(GradConst.RANK) + return (not dump_rank_list) or (rank in dump_rank_list) + def _set_input_list(self, config_dict: Dict, name: str, dtype: Union[int, str, float]): value = config_dict.get(name) if dtype == int: @@ -72,5 +84,4 @@ class GlobalContext: else: print_warn_log(f"{name} is None or not a list with valid items, use default value.") - grad_context = GlobalContext() diff --git a/debug/accuracy_tools/grad_tool/grad_ms/grad_analyzer.py b/debug/accuracy_tools/grad_tool/grad_ms/grad_analyzer.py index 963a37f86d3ce4c85caf336e95f87b99955c3905..75280b3194451778b66ce517ca67ae996e853a42 100644 --- a/debug/accuracy_tools/grad_tool/grad_ms/grad_analyzer.py +++ b/debug/accuracy_tools/grad_tool/grad_ms/grad_analyzer.py @@ -78,6 +78,8 @@ class CSVGenerator(Process): self.level = GradConst.LEVEL0 self.cache_list = ListCache() self.current_step = None + self.stop_event = None + self.last_finish = False self.bounds = [-0.1, 0.0, 0.1], def init(self, context: GlobalContext): diff --git a/debug/accuracy_tools/grad_tool/grad_ms/grad_stat_csv.py b/debug/accuracy_tools/grad_tool/grad_ms/grad_stat_csv.py new file mode 100644 index 0000000000000000000000000000000000000000..11c2fc820557d4ec2e7e48631ec0e4fe5bfa795d --- /dev/null +++ b/debug/accuracy_tools/grad_tool/grad_ms/grad_stat_csv.py @@ -0,0 +1,130 @@ +from abc import ABC, abstractmethod +import hashlib + +import mindspore +from mindspore import ops, Tensor +from grad_tool.common.constant import GradConst + + +class CsvInput: + def __init__(self, param_name, grad, bounds): + self.param_name = param_name + self.grad = grad + self.bounds = bounds + +class GradStatCsv: + csv = {} + + @staticmethod + def get_csv_header(level, csv_input): + header = ["param_name"] + for key in level["header"]: + header.extend(GradStatCsv.csv[key].generate_csv_header(csv_input)) + return header + + @staticmethod + def get_csv_line(level, csv_input): + line = [csv_input.param_name] + for key in level["header"]: + line.extend(GradStatCsv.csv[key].generate_csv_content(csv_input)) + return line + + +def register_csv_item(key, cls=None): + if cls is None: + # 无参数时,返回装饰器函数 + return lambda cls: register_csv_item(key, cls) + GradStatCsv.csv[key] = cls + return cls + + +class CsvItem(ABC): + @staticmethod + @abstractmethod + def generate_csv_header(csv_input): + pass + + @staticmethod + @abstractmethod + def generate_csv_content(csv_input): + pass + + +@register_csv_item(GradConst.MD5) +class CsvMd5(CsvItem): + def generate_csv_header(csv_input): + return ["MD5"] + + def generate_csv_content(csv_input): + grad = csv_input.grad + tensor_bytes = grad.float().numpy().tobytes() + md5_hash = hashlib.md5(tensor_bytes) + return [md5_hash.hexdigest()] + + +@register_csv_item(GradConst.DISTRIBUTION) +class CsvDistribution(CsvItem): + def generate_csv_header(csv_input): + bounds = csv_input.bounds + intervals = [] + for i, _ in enumerate(bounds): + if i == 0: + intervals.append(f"(-inf, {bounds[i]}]") + else: + intervals.append(f"({bounds[i-1]}, {bounds[i]}]") + intervals.extend([f"({bounds[-1]}, inf)", "=0"]) + return intervals + + def generate_csv_content(csv_input): + grad = csv_input.grad + bounds = csv_input.bounds + if grad.dtype == mindspore.bfloat16: + grad = grad.to(mindspore.float32) + element_num = grad.numel() + grad_equal_0_num = (grad == 0).sum().item() + bucketsize_result = ops.bucketize(grad.float(), bounds) + bucketsize_result = bucketsize_result.astype(mindspore.int8) + interval_nums = [(bucketsize_result == i).sum().item() for i in range(len(bounds) + 1)] + interval_nums.append(grad_equal_0_num) + return_list = [x / element_num if element_num != 0 else 0 for x in interval_nums] + return return_list + + +@register_csv_item(GradConst.MAX) +class CsvMax(CsvItem): + def generate_csv_header(csv_input): + return ["max"] + + def generate_csv_content(csv_input): + grad = csv_input.grad + return [ops.amax(grad).float().numpy().tolist()] + + +@register_csv_item(GradConst.MIN) +class CsvMin(CsvItem): + def generate_csv_header(csv_input): + return ["min"] + + def generate_csv_content(csv_input): + grad = csv_input.grad + return [ops.amin(grad).float().numpy().tolist()] + + +@register_csv_item(GradConst.NORM) +class CsvNorm(CsvItem): + def generate_csv_header(csv_input): + return ["norm"] + + def generate_csv_content(csv_input): + grad = csv_input.grad + return [ops.norm(grad).float().numpy().tolist()] + + +@register_csv_item(GradConst.SHAPE) +class CsvShape(CsvItem): + def generate_csv_header(csv_input): + return ["shape"] + + def generate_csv_content(csv_input): + grad = csv_input.grad + return [list(grad.shape)] \ No newline at end of file diff --git a/debug/accuracy_tools/grad_tool/grad_ms/hook.py b/debug/accuracy_tools/grad_tool/grad_ms/hook.py index ceadfee61426ce54587ecfeb0f6bab0e2e0e2537..f0d4798182de15549ed00d829b89e9be72705698 100644 --- a/debug/accuracy_tools/grad_tool/grad_ms/hook.py +++ b/debug/accuracy_tools/grad_tool/grad_ms/hook.py @@ -1,4 +1,4 @@ -from functools import wraps + import os import shutil @@ -10,38 +10,82 @@ from mindspore.common.parameter import Parameter from mindspore.common.initializer import initializer from grad_tool.common.constant import GradConst -from grad_tool.common.utils import print_warn_log +from grad_tool.common.utils import print_warn_log, write_csv from grad_tool.grad_ms.global_context import grad_context from grad_tool.grad_ms.grad_analyzer import grad_dump, get_rank_id from grad_tool.grad_ms.grad_analyzer import csv_generator +from grad_tool.grad_ms.grad_stat_csv import GradStatCsv, CsvInput +from grad_tool.grad_ms.utils import save_grad_direction, get_adapted_level +class HookInput: -def hook_optimizer(opt: Optimizer): - func = opt.construct - g_names = [param.name for param in opt._parameters] - param_list = grad_context.get_context(GradConst.PARAM_LIST) - rank_id = get_rank_id() - output_path = grad_context.get_context(GradConst.OUTPUT_PATH) - dump_dir = f"{output_path}/rank_{rank_id}/Dump/" - save_dir = f"{output_path}/rank_{rank_id}/" - step_finish_flag = f"{output_path}/rank_{rank_id}/Dump/{GradConst.STEP_FINISH}" - if os.path.exists(save_dir): - print_warn_log(f"Delete existing path {save_dir}.") - shutil.rmtree(save_dir) - level = grad_context.get_context(GradConst.LEVEL) - bounds = grad_context.get_context(GradConst.BOUNDS) + ''' + HookInput is a class wrapping all the variables used for hooking optimizer + ''' + + def __init__(self, opt) -> None: + self.func = opt.construct + self.g_names = [param.name for param in opt._parameters] + self.param_list = grad_context.get_context(GradConst.PARAM_LIST) + self.rank_id = get_rank_id() + output_path = grad_context.get_context(GradConst.OUTPUT_PATH) + self.dump_dir = os.path.join(output_path, f"rank_{self.rank_id}", "Dump") + self.save_dir = os.path.join(output_path, f"rank_{self.rank_id}") + self.step_finish_flag = os.path.join(self.dump_dir, GradConst.STEP_FINISH) + if os.path.exists(self.save_dir): + print_warn_log(f"Delete existing path {self.save_dir}.") + shutil.rmtree(self.save_dir) + self.level = grad_context.get_context(GradConst.LEVEL) + self.bounds = grad_context.get_context(GradConst.BOUNDS) + self.mode = mindspore.get_context("mode") +def hook_graph_mode_optimizer(opt, hook_input): @jit def new_construct(self, gradients): for index, grad_value in enumerate(gradients): - if param_list and g_names[index] not in param_list: + if hook_input.param_list and hook_input.g_names[index] not in hook_input.param_list: continue - grad_dump(dump_dir, g_names[index], self.dump_step, grad_value, level, bounds) - ms.ops.TensorDump()(step_finish_flag, self.dump_step) + grad_dump(hook_input.dump_dir, hook_input.g_names[index], self.dump_step, + grad_value, hook_input.level, hook_input.bounds) + ms.ops.TensorDump()(hook_input.step_finish_flag, self.dump_step) self.assignadd(self.dump_step, self.global_step_increase_tensor) - out = func(gradients) + out = hook_input.func(gradients) return out opt.dump_step = Parameter(initializer(0, [1], ms.int32), name="dump_step") opt.construct = new_construct.__get__(opt, type(opt)) csv_generator.start() + +def hook_pynative_optimizer(opt, hook_input): + level_adapted = get_adapted_level(hook_input.level) + + def hook_fn(cell, input): + gradients, = input + cur_step = grad_context.get_context(GradConst.CURRENT_STEP) + if grad_context.step_need_dump(cur_step) and grad_context.rank_need_dump(hook_input.rank_id): + output_lines = [] + for index, grad_value in enumerate(gradients): + param_name = hook_input.g_names[index] + if hook_input.param_list and param_name not in hook_input.param_list: + continue + csv_input = CsvInput(param_name, grad_value, hook_input.bounds) + grad_info = GradStatCsv.get_csv_line(level_adapted, csv_input) + output_lines.append(grad_info) + if level_adapted["have_grad_direction"]: + save_grad_direction(param_name, grad_value, os.path.join(hook_input.save_dir, f'step_{cur_step}')) + output_csv_path = os.path.join(hook_input.save_dir, f"grad_summary_{cur_step}.csv") + dummy_csv_input = CsvInput(None, None, hook_input.bounds) + write_csv(output_csv_path, output_lines, + GradStatCsv.get_csv_header(level_adapted, dummy_csv_input)) + grad_context.update_step() + + opt.register_forward_pre_hook(hook_fn) + + +def hook_optimizer(opt: Optimizer): + hook_input = HookInput(opt) + + if hook_input.mode == mindspore.GRAPH_MODE: + hook_graph_mode_optimizer(opt, hook_input) + else: + hook_pynative_optimizer(opt, hook_input) \ No newline at end of file diff --git a/debug/accuracy_tools/grad_tool/grad_ms/utils.py b/debug/accuracy_tools/grad_tool/grad_ms/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..23703f28208e49b3d5e74202d14aedb6ef0f983e --- /dev/null +++ b/debug/accuracy_tools/grad_tool/grad_ms/utils.py @@ -0,0 +1,42 @@ +import os + +import numpy as np +import mindspore +from grad_tool.common.constant import GradConst +from grad_tool.common.utils import print_warn_log, create_directory, change_mode, check_file_or_directory_path + +level_adp = { + "L0": { + "header": [GradConst.MD5, GradConst.MAX, GradConst.MIN, GradConst.NORM, GradConst.SHAPE], + "have_grad_direction": False + }, + "L1": { + "header": [GradConst.MAX, GradConst.MIN, GradConst.NORM, GradConst.SHAPE], + "have_grad_direction": True + }, + "L2": { + "header": [GradConst.DISTRIBUTION, GradConst.MAX, GradConst.MIN, GradConst.NORM, GradConst.SHAPE], + "have_grad_direction": True + }, + } + +def save_grad_direction(param_name, grad, save_path): + if not os.path.exists(save_path): + create_directory(save_path) + save_filepath = os.path.join(save_path, f"{param_name}.npy") + check_file_or_directory_path(save_filepath) + + if grad.dtype == mindspore.bfloat16: + grad = grad.to(mindspore.float32) + grad_direction_tensor = grad > 0 + grad_direction_ndarray = grad_direction_tensor.numpy() + + np.save(save_filepath, grad_direction_ndarray) + change_mode(save_filepath, 0o640) + +def get_adapted_level(level: str): + if level == GradConst.LEVEL3: + print_warn_log(f"In mindpsore pynative mode, only 'L0', 'L1' and 'L2' are supported, use L0 instead") + level = GradConst.LEVEL0 + level_adapted = level_adp.get(level) + return level_adapted \ No newline at end of file diff --git a/debug/accuracy_tools/grad_tool/grad_pt/grad_monitor.py b/debug/accuracy_tools/grad_tool/grad_pt/grad_monitor.py index 19c5b32bf8bdb94373bf73784a9e608cc4fbfd83..f3079e622c29eefe20a6e1fdc9d372002b596610 100644 --- a/debug/accuracy_tools/grad_tool/grad_pt/grad_monitor.py +++ b/debug/accuracy_tools/grad_tool/grad_pt/grad_monitor.py @@ -4,19 +4,35 @@ from collections import defaultdict import torch from torch.optim.optimizer import register_optimizer_step_pre_hook from grad_tool.common.base_monitor import BaseMonitor -from grad_tool.grad_pt.level_adapter import Level, LevelAdapter from grad_tool.grad_pt.grad_stat_csv import GradStatCsv from grad_tool.common.utils import check_numeral_list_ascend, data_in_list_target, \ - write_csv, print_info_log, create_directory, print_warn_log -from grad_tool.grad_pt.utils import get_rank_id, print_rank_0 + write_csv, print_info_log, create_directory, print_warn_log, change_mode +from grad_tool.grad_pt.utils import get_rank_id, print_rank_0, GradConst class PtGradientMonitor(BaseMonitor): default_bounds = [-10, -1, -0.1, -0.01, -0.001, 0, 0.001, 0.01, 0.1, 1, 10] + level_adp = { + "L0": { + "header": [GradConst.md5, GradConst.max, GradConst.min, GradConst.norm, GradConst.shape], + "have_grad_direction": False + }, + "L1": { + "header": [GradConst.max, GradConst.min, GradConst.norm, GradConst.shape], + "have_grad_direction": True + }, + "L2": { + "header": [GradConst.distribution, GradConst.max, GradConst.min, GradConst.norm, GradConst.shape], + "have_grad_direction": True + }, + } def __init__(self, config_filepath): super(PtGradientMonitor, self).__init__(config_filepath) - self._level_adp: Level = LevelAdapter.level_adapter(self.config.get("level")) + level = self.config.get("level") + if level not in PtGradientMonitor.level_adp: + raise Exception(f"level is valid, not in {PtGradientMonitor.level_adp.keys()}") + self._level_adp = PtGradientMonitor.level_adp[level] self._param_list = self.config.get('param_list') self._target_ranks = self.config.get("rank") print_info_log(f"target rank {self._target_ranks}") @@ -38,6 +54,16 @@ class PtGradientMonitor(BaseMonitor): def output_path(self): return self._output_path + @staticmethod + def save_grad_direction(param_name, grad, save_path): + if not os.path.exists(save_path): + create_directory(save_path) + param_grad = grad.clone().detach() + is_positive = param_grad > 0 + save_filepath = os.path.join(save_path, f"{param_name}.pt") + torch.save(is_positive, save_filepath) + change_mode(save_filepath, 0o640) + def monitor(self, model): print_rank_0("> parameter names:") for name, param in model.named_parameters(): @@ -66,17 +92,14 @@ class PtGradientMonitor(BaseMonitor): if grad is None: print_info_log(f"grad is None: {param_name}") continue - grad_info = GradStatCsv.generate_csv_line( - level=self._level_adp, - param_name=param_name, - grad=grad, - bounds=self._bounds) + grad_info = GradStatCsv.generate_csv_line(param_name, self._level_adp, grad, self._bounds) output_lines.append(grad_info) - self._level_adp.save_grad_direction(param_name, grad, - f'{self._output_path}/rank_{self._rank}/step_{self._step}') - output_path = os.path.join(self._output_path, f"rank_{getattr(self, '_rank')}", + if self._level_adp["have_grad_direction"]: + PtGradientMonitor.save_grad_direction(param_name, grad, + f'{self._output_path}/rank{self._rank}/step{self._step}') + output_path = os.path.join(self._output_path, f"rank{getattr(self, '_rank')}", f"grad_summary_{self._step}.csv") write_csv(output_path, output_lines, - GradStatCsv.generate_csv_header(level=self._level_adp, bounds=self._bounds)) + GradStatCsv.generate_csv_header(self._level_adp, self._bounds)) register_optimizer_step_pre_hook(optimizer_pre_step_hook) diff --git a/debug/accuracy_tools/grad_tool/grad_pt/grad_stat_csv.py b/debug/accuracy_tools/grad_tool/grad_pt/grad_stat_csv.py index 442b763b47717a461e9dc1a147dca03cbabda488..48e9bff0aac458272a1b1f62c710b3649b89c25d 100644 --- a/debug/accuracy_tools/grad_tool/grad_pt/grad_stat_csv.py +++ b/debug/accuracy_tools/grad_tool/grad_pt/grad_stat_csv.py @@ -1,104 +1,127 @@ +from abc import ABC, abstractmethod +from collections import namedtuple import hashlib import torch -from grad_tool.grad_pt.level_adapter import Level +from grad_tool.grad_pt.utils import GradConst - -class GradExtremeOps: - @staticmethod - def tensor_max(tensor): - return torch._C._VariableFunctionsClass.max(tensor).cpu().detach().float().numpy().tolist() - - @staticmethod - def tensor_min(tensor): - return torch._C._VariableFunctionsClass.min(tensor).cpu().detach().float().numpy().tolist() - - @staticmethod - def tensor_norm(tensor): - return torch._C._VariableFunctionsClass.norm(tensor).cpu().detach().float().numpy().tolist() - - -class GradExtremes: - extremes = { - "max": GradExtremeOps.tensor_max, - "min": GradExtremeOps.tensor_min, - "norm": GradExtremeOps.tensor_norm - } - - -class GradStatOps: - @staticmethod - def md5_header(**kwargs): - level: Level = kwargs.get("level") - return level.MD5_header() - - @staticmethod - def intervals_header(**kwargs): - level: Level = kwargs.get("level") - bounds = kwargs.get("bounds") - return level.intervals_header(bounds) - - @staticmethod - def extremes_header(**kwargs): - return GradExtremes.extremes.keys() - - @staticmethod - def shape_header(**kwargs): - return ["shape"] - - @staticmethod - def md5_content(**kwargs): - grad = kwargs.get("grad") - level: Level = kwargs.get("level") - return level.MD5_content(grad) - - @staticmethod - def count_distribution(**kwargs): - level: Level = kwargs.get("level") - grad = kwargs.get("grad") - bounds = kwargs.get("bounds") - return level.count_grad_distribution(grad, bounds) - - @staticmethod - def extremes_content(**kwargs): - grad = kwargs.get("grad") - return [f(grad) for f in GradExtremes.extremes.values()] - - @staticmethod - def shape_content(**kwargs): - grad = kwargs.get("grad") - return [list(grad.shape)] +CSV_header_input = namedtuple("CSV_header_input", ["bounds"]) +CSV_content_input = namedtuple("CSV_content_input", ["grad", "bounds"]) class GradStatCsv: - CSV = { - "MD5": { - "header": GradStatOps.md5_header, - "content": GradStatOps.md5_content - }, - "distribution": { - "header": GradStatOps.intervals_header, - "content": GradStatOps.count_distribution - }, - "extremes": { - "header": GradStatOps.extremes_header, - "content": GradStatOps.extremes_content - }, - "shape": { - "header": GradStatOps.shape_header, - "content": GradStatOps.shape_content - }, - } + csv = {} @staticmethod - def generate_csv_header(**kwargs): + def generate_csv_header(level, bounds): header = ["param_name"] - for func in GradStatCsv.CSV.values(): - header.extend(func["header"](**kwargs)) + for key in level["header"]: + csv_header_input = CSV_header_input(bounds=bounds) + header.extend(GradStatCsv.csv[key].generate_csv_header(csv_header_input)) return header @staticmethod - def generate_csv_line(**kwargs): - line = [kwargs.get("param_name")] - for func in GradStatCsv.CSV.values(): - line.extend(func["content"](**kwargs)) + def generate_csv_line(param_name, level, grad, bounds): + line = [param_name] + for key in level["header"]: + csv_content_input = CSV_content_input(grad=grad, bounds=bounds) + line.extend(GradStatCsv.csv[key].generate_csv_content(csv_content_input)) return line + + +def register_csv_item(key, cls=None): + if cls is None: + # 无参数时,返回装饰器函数 + return lambda cls: register_csv_item(key, cls) + GradStatCsv.csv[key] = cls + return cls + + +class CsvItem(ABC): + @abstractmethod + def generate_csv_header(csv_header_input): + pass + + @abstractmethod + def generate_csv_content(csv_content_input): + pass + + +@register_csv_item(GradConst.md5) +class CSV_md5(CsvItem): + def generate_csv_header(csv_header_input): + return ["MD5"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + tensor_bytes = grad.cpu().detach().float().numpy().tobytes() + md5_hash = hashlib.md5(tensor_bytes) + return [md5_hash.hexdigest()] + + +@register_csv_item(GradConst.distribution) +class CSV_distribution(CsvItem): + def generate_csv_header(csv_header_input): + bounds = csv_header_input.bounds + intervals = [] + for i, _ in enumerate(bounds): + if i == 0: + intervals.append(f"(-inf, {bounds[i]}]") + else: + intervals.append(f"({bounds[i-1]}, {bounds[i]}]") + intervals.extend([f"({bounds[-1]}, inf)", "=0"]) + return intervals + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + bounds = csv_content_input.bounds + grad = grad.cpu().detach() + if grad.dtype == torch.bfloat16: + grad = grad.to(torch.float32) + element_num = grad.numel() + grad_equal_0_num = (grad == 0).sum().item() + bound = torch.Tensor(bounds) + bucketsize_result = torch.bucketize(grad, bound) + interval_nums = [(bucketsize_result == i).sum().item() for i in range(len(bound) + 1)] + interval_nums.append(grad_equal_0_num) + return_list = [x / element_num if element_num != 0 else 0 for x in interval_nums] + return return_list + + +@register_csv_item(GradConst.max) +class CSV_max(CsvItem): + def generate_csv_header(csv_header_input): + return ["max"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + return [torch.max(grad).cpu().detach().float().numpy().tolist()] + + +@register_csv_item(GradConst.min) +class CSV_max(CsvItem): + def generate_csv_header(csv_header_input): + return ["min"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + return [torch.min(grad).cpu().detach().float().numpy().tolist()] + + +@register_csv_item(GradConst.norm) +class CSV_max(CsvItem): + def generate_csv_header(csv_header_input): + return ["norm"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + return [torch.norm(grad).cpu().detach().float().numpy().tolist()] + + +@register_csv_item(GradConst.shape) +class CSV_shape(CsvItem): + def generate_csv_header(csv_header_input): + return ["shape"] + + def generate_csv_content(csv_content_input): + grad = csv_content_input.grad + return [list(grad.shape)] \ No newline at end of file diff --git a/debug/accuracy_tools/grad_tool/grad_pt/level_adapter.py b/debug/accuracy_tools/grad_tool/grad_pt/level_adapter.py deleted file mode 100644 index 520b0ce0cd8623cd7ca91956f80e97608a407b9b..0000000000000000000000000000000000000000 --- a/debug/accuracy_tools/grad_tool/grad_pt/level_adapter.py +++ /dev/null @@ -1,133 +0,0 @@ -import os -import hashlib -from abc import ABC, abstractmethod -import torch -from grad_tool.common.utils import print_info_log, create_directory - - -class LevelOps: - @staticmethod - def intervals_header(bounds): - intervals = [] - for i, _ in enumerate(bounds): - if i == 0: - intervals.append(f"(-inf, {bounds[i]}]") - else: - intervals.append(f"({bounds[i-1]}, {bounds[i]}]") - intervals.extend([f"({bounds[-1]}, inf)", "=0"]) - return intervals - - @staticmethod - def count_grad_distribution(grad, bounds): - grad = grad.cpu().detach() - if grad.dtype == torch.bfloat16: - grad = grad.to(torch.float32) - element_num = grad.numel() - grad_equal_0_num = (grad == 0).sum().item() - bound = torch.Tensor(bounds) - bucketsize_result = torch.bucketize(grad, bound) - interval_nums = [(bucketsize_result == i).sum().item() for i in range(len(bound) + 1)] - interval_nums.append(grad_equal_0_num) - return_list = [x / element_num if element_num != 0 else 0 for x in interval_nums] - return return_list - - @staticmethod - def save_grad_direction(param_name, grad, save_path): - if not os.path.exists(save_path): - create_directory(save_path) - param_grad = grad.clone().detach() - is_positive = param_grad > 0 - torch.save(is_positive, f'{save_path}/{param_name}.pt') - - @staticmethod - def MD5_content(grad): - tensor_bytes = grad.cpu().detach().float().numpy().tobytes() - md5_hash = hashlib.md5(tensor_bytes) - return [md5_hash.hexdigest()] - - @staticmethod - def MD5_header(): - return ["MD5"] - - -class Level(ABC): - @abstractmethod - def save_grad_direction(self, param_name, grad, save_path): - pass - - @abstractmethod - def count_grad_distribution(self, grad, bounds) -> list: - pass - - @abstractmethod - def intervals_header(self, bounds) -> list: - pass - - @abstractmethod - def MD5_content(self, grad) -> list: - pass - - @abstractmethod - def MD5_header(self) -> list: - pass - - -class Level_0(Level): - def save_grad_direction(self, param_name, grad, save_path): - pass - - def count_grad_distribution(self, grad, bounds): - return [] - - def intervals_header(self, bounds): - return [] - - def MD5_content(self, grad): - return LevelOps.MD5_content(grad) - - def MD5_header(self): - return LevelOps.MD5_header() - - -class Level_1(Level): - def save_grad_direction(self, param_name, grad, save_path): - LevelOps.save_grad_direction(param_name, grad, save_path) - - def count_grad_distribution(self, grad, bounds): - return [] - - def intervals_header(self, bounds): - return [] - - def MD5_content(self, grad): - return [] - - def MD5_header(self): - return [] - - -class Level_2(Level): - def save_grad_direction(self, param_name, grad, save_path): - LevelOps.save_grad_direction(param_name, grad, save_path) - - def count_grad_distribution(self, grad, bounds): - return LevelOps.count_grad_distribution(grad, bounds) - - def intervals_header(self, bounds): - return LevelOps.intervals_header(bounds) - - def MD5_content(self, grad): - return [] - - def MD5_header(self): - return [] - - -class LevelAdapter: - levels = {"L0": Level_0, "L1": Level_1, "L2": Level_2} - - @staticmethod - def level_adapter(level): - if level not in LevelAdapter.levels: - raise Exception(f"level is valid, not in {LevelAdapter.levels.keys()}") - return LevelAdapter.levels[level]() diff --git a/debug/accuracy_tools/grad_tool/grad_pt/utils.py b/debug/accuracy_tools/grad_tool/grad_pt/utils.py index cbccab7b480c6d38fea7c22b6779913f47e91a22..bfa7c158f1399657dc3f601e69ecc3cc8d725f5a 100644 --- a/debug/accuracy_tools/grad_tool/grad_pt/utils.py +++ b/debug/accuracy_tools/grad_tool/grad_pt/utils.py @@ -15,3 +15,11 @@ def print_rank_0(message): print(message) else: print(message) + +class GradConst: + md5 = "MD5" + distribution = "distribution" + shape = "shape" + max = "max" + min = "min" + norm = "norm" diff --git a/debug/accuracy_tools/grad_tool/test/ut/test_grad_csv.py b/debug/accuracy_tools/grad_tool/test/ut/test_grad_csv.py index 48a23c887e02c14d021d4191cb2fdae2bf9dbc33..4669da1c4d199c3007f36ae2b0f57802a1577be2 100644 --- a/debug/accuracy_tools/grad_tool/test/ut/test_grad_csv.py +++ b/debug/accuracy_tools/grad_tool/test/ut/test_grad_csv.py @@ -3,7 +3,7 @@ import unittest import os import torch from grad_tool.grad_pt.grad_stat_csv import GradStatCsv -from grad_tool.grad_pt.level_adapter import LevelAdapter +from grad_tool.grad_pt.grad_monitor import PtGradientMonitor grad_tensor = torch.tensor([[-2, 2], [0.2, 0.3]]) @@ -12,39 +12,27 @@ grad_tensor = torch.tensor([[-2, 2], [0.2, 0.3]]) class TestGradCSV(unittest.TestCase): def test_level_L0_header(self): self.assertEqual(['param_name', 'MD5', 'max', 'min', 'norm', 'shape'], - GradStatCsv.generate_csv_header(level=LevelAdapter.level_adapter("L0"), bounds=[-1, 0, 1])) + GradStatCsv.generate_csv_header(PtGradientMonitor.level_adp["L0"], [-1, 0, 1])) def test_level_L1_header(self): self.assertEqual(['param_name', 'max', 'min', 'norm', 'shape'], - GradStatCsv.generate_csv_header(level=LevelAdapter.level_adapter("L1"), bounds=[-1, 0, 1])) + GradStatCsv.generate_csv_header(PtGradientMonitor.level_adp["L1"], [-1, 0, 1])) def test_level_L2_header(self): self.assertEqual(['param_name', '(-inf, -1]', '(-1, 0]', '(0, 1]', '(1, inf)', '=0', 'max', 'min', 'norm', 'shape'], - GradStatCsv.generate_csv_header(level=LevelAdapter.level_adapter("L2"), bounds=[-1, 0, 1])) + GradStatCsv.generate_csv_header(PtGradientMonitor.level_adp["L2"], [-1, 0, 1])) def test_level_L0_content(self): - generated_csv_line = GradStatCsv.generate_csv_line( - level=LevelAdapter.level_adapter("L0"), - param_name="model.conv2d", - grad=grad_tensor, - bounds=[-1, 0, 1]) + generated_csv_line = GradStatCsv.generate_csv_line("model.conv2d", PtGradientMonitor.level_adp["L0"], grad_tensor, [-1, 0, 1]) self.assertEqual(['model.conv2d', '678a6c7d9d9716682b56fda097d0936c', 2.0, -2.0, 2.851315498352051, [2, 2]], generated_csv_line) def test_level_L1_content(self): - generated_csv_line = GradStatCsv.generate_csv_line( - level=LevelAdapter.level_adapter("L1"), - param_name="model.conv2d", - grad=grad_tensor, - bounds=[-1, 0, 1]) + generated_csv_line = GradStatCsv.generate_csv_line("model.conv2d", PtGradientMonitor.level_adp["L1"], grad_tensor, [-1, 0, 1]) self.assertEqual(['model.conv2d', 2.0, -2.0, 2.851315498352051, [2, 2]], generated_csv_line) def test_level_L2_content(self): - generated_csv_line = GradStatCsv.generate_csv_line( - level=LevelAdapter.level_adapter("L2"), - param_name="model.conv2d", - grad=grad_tensor, - bounds=[-1, 0, 1]) + generated_csv_line = GradStatCsv.generate_csv_line("model.conv2d", PtGradientMonitor.level_adp["L2"], grad_tensor, [-1, 0, 1]) self.assertEqual(['model.conv2d', 0.25, 0.0, 0.5, 0.25, 0.0, 2.0, -2.0, 2.851315498352051, [2, 2]], generated_csv_line) diff --git a/debug/accuracy_tools/kj600/README.md b/debug/accuracy_tools/kj600/README.md index 39c7bc6aa3754027338510f0bb9d1be6c328e17d..1782e58bec0404092bb8c6784a8235c68f536ac9 100644 --- a/debug/accuracy_tools/kj600/README.md +++ b/debug/accuracy_tools/kj600/README.md @@ -8,12 +8,15 @@ ### 1. 安装依赖 -| 依赖软件 | +| 依赖软件 | |-------------| | torch | | torch_npu | | torchvision | | tensorboard | +| matplotlib | +| sqlalchemy | +| pymysql | ### 2. 安装 kj600 @@ -27,8 +30,8 @@ pip install git+https://gitee.com/xiangsen2/kj600.git ``` git clone https://gitee.com/xiangsen2/kj600.git -cd KJ600 -pip install -e . +cd kj600 +pip install . ``` # 快速上手 @@ -48,10 +51,10 @@ pip install -e . "xy_distribution": true, "mv_distribution": true, "wg_distribution": true, - "mg_direction": true, "cc_distribution": {"enable":true, "cc_codeline":[]}, "alert": { - "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}] + "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}], + "inform": {"recipient": "database", "connection_str": "mysql+pymysql://username:password@host:port/database"} }, "ops": ["min", "max", "norm", "zeros", "id"], "eps": 1e-8 @@ -75,9 +78,8 @@ pip install -e . |"xy_distribution"| 可选 | 若为true则会监控指定module(targets中指定)的输入输出张量。 默认为false。| |"mv_distribution"| 可选 | 若为true则会监控指定模块中的参数的优化器状态, 默认为false。需要在TrainerMon构造函数正确指定opt_ty. 目前只支持megatron的混合精度优化器以及megatron的分布式优化器。 Deepspeed的分布式优化器实现暂不支持。 | |"wg_distribution"| 可选 | 若为true则会监控指定模块的参数梯度, 默认为false。 | -|"alert"| 必选 | 指定自动报警的异常检测机制及其相应的阈值。目前实现的异常检测是AnomalyTurbulence。 如果统计标量超出历史均值的指定浮动范围(threshold指定, 0.5意味着上浮或者下浮50%)。 目前报警是在控制台打印, 未来会实现发邮件和写数据库。| -|"mg_direction"| 可选 | 若为true则会统计adam优化器的一阶矩($m_{t-1}$)和当前梯度($g_t$)符号一致的参数比例。| -|"cc_distribution"| 可选 | 其中“enable”字段控制开关;“cc_codeline”字段指定监控的代码行,如:"train.py\\[23\\]",默认为空列表,不特别指定。"cc_log_only"字段控制是否监控数据,为true时,仅记录调用到的算子及其调用栈。| +|"alert"| 必选 | · "rules": 指定自动报警的异常检测机制及其相应的阈值。目前实现的异常检测是AnomalyTurbulence。 如果统计标量超出历史均值的指定浮动范围(threshold指定, 0.5意味着上浮或者下浮50%)则在控制台打印报警信息。
· "inform": 自动报警需要的配置,若想关闭自动报警删掉inform的配置即可。其中"recipient"指定自动报警的通知方式,可选值为"database"或"email",默认为"database"。
- 若"recipient"为"database",则需要指定"connection_str"字段,即数据库的连接URL,默认为{"recipient":"database", "connection_str": "mysql+pymysql://username:password@host:port/database"},若有特殊字符需要转义。
- 若"recipient"为"email",则需要指定"send_email_address"-发送方邮箱地址,"receive_email_address"-接收方邮箱地址,"send_email_username"-发送方邮箱用户名,"send_email_password"-发送方邮箱密码,"smtp_server"-发送方邮箱对应的SMTP服务器,"smtp_port"-发送方邮箱对应的SMTP端口号。默认为:
{"recipient":"email", send_email_address": "sender@huawei.com", "receive_email_address": "receiver@huawei.com", "send_email_username": "username", "send_email_password": "******", "smtp_server": "smtpscn.huawei.com", "smtp_port": "587"}| +|"cc_distribution"| 可选 | 其中"enable"字段控制通信监控模块的开关;需要监控通信算子时,务必尽量早地实例化`TrainerMon`, 因为监控通过劫持原始func后挂hook实现,部分加速库初始化时会保存原始function,避免监控失效。"cc_codeline"字段指定监控的代码行,如:`train.py\\[23\\]`,默认为空列表,不特别指定;"cc_pre_hook"字段控制是否监控通信前的数据; 模块会在第二个optimize.step之前打印通信日志,包括通信api的调用栈、输入dtype、通信group。 "cc_log_only"为true时,仅打印日志,不监控通信的输入输出,并在打印后中断训练。可以根据通信日志设置"cc_codeline",规避与训练过程不相关的通信,比如一些时间、metrics的同步。| |"ops"| 可选 |与ur_distribution、xy_distribution、mv_distribution、wg_distribution、mg_direction、cc_distribution配合,监控所选张量的min、max、norm、zeros值。其中,zeros代表监控所选张量的元素小于eps的比例,id代表监控所选的非张量本身,默认为[]。| |"eps"| 可选 |若ops里包含"zeros"则需要配置,默认为1e-8。| @@ -109,7 +111,7 @@ pip install -e . 2. 在训练器中加入代码,开启kj600训练监控。 - 例如在ModelLink/pretrain_gpt.py的model_provider GPTModel构造后加入以下代码: + 例如在ModelLink/pretrain_gpt.py的model_provider GPTModel构造后加入以下代码, **注意优化器类型opt_ty** : ``` from kj600.module_hook import TrainerMon @@ -154,12 +156,11 @@ tensorboard --logdir=$KJ600_OUTPUT_DIR ``` ssh -N -L localhost:6006:localhost:6006 your_username@remote_server_address ``` -## 高级用法 -### 有效秩(有内存和速度问题) -在工具配置文件中加入"params_effrank":"权重矩阵参数名" -"params_effrank": ["language_model.encoder.layers.0.self_attention.query_key_value.weight"] -## 公开接口 +# 高级用法 +TBD + +# 公开接口 **接口说明** diff --git a/debug/accuracy_tools/kj600/kj600/anomaly_detect.py b/debug/accuracy_tools/kj600/kj600/anomaly_detect.py index 24f37a7a84aa9ea2bf260259080510ba0e27733a..cbd7b6daa2f0d9b0a9b28016993e836ee07df72d 100644 --- a/debug/accuracy_tools/kj600/kj600/anomaly_detect.py +++ b/debug/accuracy_tools/kj600/kj600/anomaly_detect.py @@ -4,10 +4,11 @@ from typing import List import sys from torch.utils.tensorboard import SummaryWriter from collections import defaultdict +from kj600.utils import print_info_log class ScanRule(ABC): def apply(self, history, cur): - raise NotImplemented("abstract method apply is not implemented") + raise NotImplementedError("abstract method apply is not implemented") class AnomalyTurbulence(ScanRule): name = "AnomalyTurbulence" @@ -59,15 +60,13 @@ class bcolors: UNDERLINE = '\033[4m' class SummaryWriterWithAD(SummaryWriter): - def __init__(self, path, ad_rules, anomaly_inform=False): + def __init__(self, path, ad_rules, job_id, anomaly_inform=False): super().__init__(path) self.tag2scalars = defaultdict(list) self.ad_rules = ad_rules + self.job_id = job_id self.anomaly_inform = anomaly_inform - def _ad(self, scalar_value, history): - return AnomalyScanner.scan(self.ad_rules, history, cur=scalar_value) - def add_scalar(self, tag, scalar_value, global_step=None, walltime=None, new_style=False, double_precision=False): new_avg = avg = scalar_value if tag in self.tag2scalars: @@ -77,8 +76,11 @@ class SummaryWriterWithAD(SummaryWriter): self.tag2scalars[tag].append((scalar_value, new_avg)) detected, rule_name = self._ad(scalar_value, history=avg) if detected: - print(f"{bcolors.WARNING}> Rule {rule_name} reports anomaly signal in {tag} at step {global_step}.{bcolors.ENDC}") + print_info_log(f"{bcolors.WARNING}> Rule {rule_name} reports anomaly signal in {tag} at step {global_step}.{bcolors.ENDC}") exception_message = f"{bcolors.WARNING}> Rule {rule_name} reports anomaly signal in {tag} at step {global_step}.{bcolors.ENDC}" if self.anomaly_inform: - self.anomaly_inform.run(exception_message) + self.anomaly_inform.run(exception_message, self.job_id) return super().add_scalar(tag, scalar_value, global_step, walltime, new_style, double_precision) + + def _ad(self, scalar_value, history): + return AnomalyScanner.scan(self.ad_rules, history, cur=scalar_value) diff --git a/debug/accuracy_tools/kj600/kj600/anomaly_inform.py b/debug/accuracy_tools/kj600/kj600/anomaly_inform.py index 0bdafdaf827e5ac6658bccc0de83294d9f313602..301ac769217943a36e5d4cbe06033c828e5c675e 100644 --- a/debug/accuracy_tools/kj600/kj600/anomaly_inform.py +++ b/debug/accuracy_tools/kj600/kj600/anomaly_inform.py @@ -3,6 +3,9 @@ from email.mime.text import MIMEText import sqlite3 from datetime import datetime, timedelta +from kj600.database import Database, ExceptionMessage + + # define class InformRegistry to get inform_sub_class class AnomalyInformFactory: @staticmethod @@ -22,15 +25,15 @@ class AnomalyInform: self.time = 0 self.current_time = 0 - def inform_fun(self, exception_message_list): + def inform_fun(self, exception_message_list, job_id): pass - def run(self, exception_message): + def run(self, exception_message, job_id): if self.time != 0 and self.current_time == 0: self.current_time = datetime.now() if self.time == 0 or ((self.current_time - self.time) > timedelta(minutes=self.interval_time)): self.exception_message_list.append(exception_message) - self.inform_fun(self.exception_message_list) + self.inform_fun(self.exception_message_list, job_id) self.exception_message_list = [] self.time = datetime.now() elif (self.current_time - self.time) <= timedelta(minutes=self.interval_time): @@ -41,35 +44,33 @@ class DatabaseInform(AnomalyInform): def __init__(self, **kwargs): super().__init__(**kwargs) self.interval_time = 2 + self.database = Database(self.inform_args.get("connection_str", None)) + self.database.create_table() - def inform_fun(self, exception_message_list): - with sqlite3.connect(self.inform_args['connection_str']) as conn: - cursor = conn.cursor() - cursor.execute('''CREATE TABLE IF NOT EXISTS exceptions( - id INTEGER PRIMARY KEY, - message TEXT - )''') - now_time = datetime.now() - for exception_message in exception_message_list: - exception_message = f"Current time is :{now_time}" + exception_message - cursor.execute("INSERT INTO exceptions (message) VALUES (?)",(exception_message,)) + def inform_fun(self, exception_message_list, job_id): + save_list = [] + for exception_message in exception_message_list: + item = {'job_id': job_id, 'message': exception_message, 'create_time': datetime.now()} + save_list.append(ExceptionMessage(**item)) + self.database.insert_batch(save_list) class EmailInform(AnomalyInform): def __init__(self, **kwargs): super().__init__(**kwargs) self.interval_time = 10 - - def inform_fun(self, exception_message_list): + + def inform_fun(self, exception_message_list, job_id): subject = "Exception Detected in Your Program" text = f"{len(exception_message_list)} exception was detected in your program:\n\n" for exception_message in exception_message_list: - text += exception_message + '\n' + text += f"{job_id}: {exception_message}\n" message = MIMEText(text, "plain") message["Subject"] = subject - message["From"] = self.inform_args['email'] - message["To"] = self.inform_args['email'] + message["From"] = self.inform_args.get('send_email_address', None) + message["To"] = self.inform_args.get('receive_email_address', None) - with smtplib.SMTP(self.inform_args['smtp_server_name'], self.inform_args.get('smtp_number', 587)) as server: + with smtplib.SMTP(self.inform_args.get('smtp_server', None), self.inform_args.get('smtp_port', 587)) as server: server.starttls() - server.login(self.inform_args['id'], self.inform_args['password']) - server.sendmail(self.inform_args['email'], self.inform_args['email'], message.as_string()) + server.login(self.inform_args.get('send_email_username', None), self.inform_args.get('send_email_password', None)) + server.sendmail(self.inform_args.get('send_email_address', None), + self.inform_args.get('receive_email_address', None), message.as_string()) diff --git a/debug/accuracy_tools/kj600/kj600/database.py b/debug/accuracy_tools/kj600/kj600/database.py new file mode 100644 index 0000000000000000000000000000000000000000..ce02ab7429d066dc76c127eefab5c1f6720d612c --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/database.py @@ -0,0 +1,55 @@ +from sqlalchemy import create_engine +from sqlalchemy.orm import sessionmaker +from sqlalchemy.ext.declarative import declarative_base +from sqlalchemy import Column, Integer, String, DateTime +from pymysql.err import OperationalError + +Base = declarative_base() + + +class ExceptionMessage(Base): + __tablename__ = 'exception_message' + + id = Column(Integer, primary_key=True) + job_id = Column(String(40), index=True, nullable=False) + message = Column(String(255)) + create_time = Column(DateTime, nullable=False) + + def __repr__(self): + return ' None: - self.indata = {} - self.outdata = {} + self.data = {} + + @staticmethod + def _agg(data): + aggregated_data = {} + for op, tag2tensorlist in data.items(): + aggregated_data[op] = {} + for tag, tensorlist in tag2tensorlist.items(): + aggregated_data[op][tag] = op_aggregate(op, tensorlist) + return aggregated_data def reset(self): - self.indata = {} - self.outdata = {} + self.data = {} + + def aggregate(self): + self.data = self._agg(self.data) class TrainerMon: - @staticmethod - def set_wrapped_optimizer(_wrapped_optimizer): - MixPrecsionOptimizerMon.set_wrapped_optimizer(_wrapped_optimizer) + tensor_metrics = TensorMetrics() # opt_ty: "Megatron_Float16OptimizerWithFloat16Params" or "Megatron_DistributedOptimizer" def __init__(self, config_file_path, params_have_main_grad=True, opt_ty=None) -> None: @@ -82,6 +89,13 @@ class TrainerMon: self.xy_distribution = self.config.get('xy_distribution', False) if not self.xy_distribution: print_rank_0("> module input/output input_grad/output_grad is not monitored. ") + + # backward hook cause megatron-lm pipeline parallel schedule assert exception. + # TBD: backward hook cause output tensor is view of some base tensor. root cause invesigation pending. + self.forward_only = self.config.get('forward_only', False) + if self.forward_only: + print_rank_0("> only module forward is monitored. ") + self.ur_distribution = self.config.get('ur_distribution', False) if not self.ur_distribution: print_rank_0("> update vector and ratio vector of adam is not monitored. ") @@ -97,11 +111,13 @@ class TrainerMon: self.cc_distribution = self.config.get("cc_distribution", {}) if not self.cc_distribution.get('enable', False): print_rank_0("> cc operator is not monitored.") + self.cc_log_only = False else: self.cc_codeline = self.cc_distribution.get('cc_codeline', []) self.cc_log_only = self.cc_distribution.get('cc_log_only', False) self.cc_logged_stack = defaultdict(set) - api_register.initialize_hook(partial(create_hook, context=self.cc_context, monitor=self)) + self.cc_pre_hook = self.cc_distribution.get('cc_pre_hook', False) + api_register.initialize_hook(*create_hooks(context=self.cc_context, monitor=self)) api_register.redirect_api() alert_setting = self.config.get('alert', {"rules":[]}) @@ -116,9 +132,9 @@ class TrainerMon: if dist.is_initialized(): if (dist.get_rank() in self.module_rank_list) or len(self.module_rank_list) == 0: self.summary_writer = SummaryWriterWithAD( - os.path.join(output_base_dir, f"{cur_time}-rank{dist.get_rank()}-{unique_id}"), self.alert_rules, anomaly_inform) + os.path.join(output_base_dir, f"{cur_time}-rank{dist.get_rank()}-{unique_id}"), self.alert_rules, unique_id, anomaly_inform) else: - self.summary_writer = SummaryWriterWithAD(os.path.join(output_base_dir, f"{cur_time}-{unique_id}"), self.alert_rules, anomaly_inform) + self.summary_writer = SummaryWriterWithAD(os.path.join(output_base_dir, f"{cur_time}-{unique_id}"), self.alert_rules, unique_id, anomaly_inform) # A HeatmapVisualizer instance is associated with an image self.update_heatmap_visualizer = defaultdict(HeatmapVisualizer) self.ratio_heatmap_visualizer = defaultdict(HeatmapVisualizer) @@ -129,8 +145,10 @@ class TrainerMon: self.mix_precision_optimizer_mon = OptimizerMonFactory.create_optimizer_mon(opt_ty) if opt_ty is None: - assert not self.ur_distribution, "ur_distribution cannot be enabled with unknown optimizer." - assert not self.mv_distribution, "mv_distribution cannot be enabled with unknown optimizer." + if self.ur_distribution: + raise Exception("ur_distribution cannot be enabled with unknown optimizer.") + if self.mv_distribution: + raise Exception("mv_distribution cannot be enabled with unknown optimizer.") self.print_struct = self.config.get("print_struct", False) self.struct_printed = False self.module_struct = {} @@ -139,105 +157,19 @@ class TrainerMon: def __del__(self): if hasattr(self, "summary_writer"): self.summary_writer.close() + + @staticmethod + def set_wrapped_optimizer(_wrapped_optimizer): + MixPrecsionOptimizerMon.set_wrapped_optimizer(_wrapped_optimizer) - def _smallest_rank_print(self, msg): + @staticmethod + def adhoc_check(target_tensor:torch.tensor, module_name:str, tensor_name:str, rank_list, ops_list): + rank = None if dist.is_initialized(): - if dist.get_rank() == min(self.module_rank_list): - print_info_log(msg) - else: - print_info_log(msg) - - def _hook_module(self, target_names, module: torch.nn.Module, fwd_or_bkd): - if '_modules' not in module.__dict__: - # nothing to hook - return 0 - - def fwd_hook_fun(module, module_input, module_output): - context: ModuleHookContext = self.module_fwd_hook_context_by_module[module] - if self.print_struct: - self.module_struct[context.module_name].update( - {"input": f"{get_param_struct(module_input)}", "output": f"{get_param_struct(module_output)}"}) + rank = dist.get_rank() + if (rank not in rank_list) and len(rank_list) != 0: return - if not self.xy_distribution: - return - if not context.format_by_arg: - context.set_format_by_arg('input', self.config['targets']) - context.set_format_by_arg('output', self.config['targets']) - if not context.verified: - if not context.ignore_in: - context.focused_in_col = validate_config_spec(context.format_by_arg['input'], module_input, context.module_name, 'input') - context.focused_out_col = validate_config_spec(context.format_by_arg['output'], module_output, context.module_name, 'output') - context.verified = True - # expect output be tensor type - tbtag_tensor_map = {} - if not context.ignore_in: - cared_input = module_input if context.focused_in_col is None else module_input[context.focused_in_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input', cared_input)) - cared_output = module_output if context.focused_out_col is None else module_output[context.focused_out_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output', cared_output)) - metric_dict = {} - for metric_name in self.ops: - metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) - if context.micro_step == 0 and context.actv: - print_warn_log( - f"actv context of {context.module_name} is not empty when first micro_step, maybe something wrong happened. Now clear it.") - context.actv.clear() - context.actv.append(metric_dict) - - context.micro_step += 1 - if context.micro_step == self.micro_batch_number: - context.micro_step = 0 - context.step += 1 - return - - def bwd_hook_fun(module, input_grad, output_grad): - context: ModuleHookContext = self.module_bwd_hook_context_by_module[module] - if self.print_struct: - self.module_struct[context.module_name].update( - {"input_grad": f"{get_param_struct(input_grad)}", "output_grad": f"{get_param_struct(output_grad)}"}) - return - if not self.xy_distribution: - return - if not context.format_by_arg: - context.set_format_by_arg('input_grad', self.config['targets']) - context.set_format_by_arg('output_grad', self.config['targets']) - if not context.verified: - if not context.ignore_in: - context.focused_in_col = validate_config_spec(context.format_by_arg['input_grad'], input_grad, context.module_name, 'input_grad') - context.focused_out_col = validate_config_spec(context.format_by_arg['output_grad'], output_grad, context.module_name, 'output_grad') - context.verified = True - - tbtag_tensor_map = {} - if not context.ignore_in: - cared_input_grad = input_grad if context.focused_in_col is None else input_grad[context.focused_in_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input_grad', cared_input_grad)) - cared_output_grad = output_grad if context.focused_out_col is None else output_grad[context.focused_out_col] - tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output_grad', cared_output_grad)) - metric_dict = {} - for metric_name in self.ops: - metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) - if context.micro_step == 0 and context.actvgrad: - print_warn_log(f"actvgrad context of {context.module_name} is not empty when first micro_step, maybe something wrong happened. Now clear it.") - context.actvgrad.clear() - context.actvgrad.append(metric_dict) - - context.micro_step += 1 - if context.micro_step == self.micro_batch_number: - context.micro_step = 0 - context.step += 1 - return - - hooked_count = 0 - for name, submodule in module.named_modules(): - self.module_struct[name] = {} - if name in target_names: - submodule.register_forward_hook(fwd_hook_fun) - self.module_fwd_hook_context_by_module[submodule] = ModuleHookContext(name) - submodule.register_full_backward_hook(bwd_hook_fun) - self.module_bwd_hook_context_by_module[submodule] = ModuleHookContext(name) - print_rank_0(f"> {name} is monitored successfully") - hooked_count += 1 - return hooked_count + TrainerMon.tensor_metrics.stat_insert(target_tensor, ops_list, module_name, tensor_name, rank) def hook_modules(self, model:torch.nn.Module, grad_acc_steps): # fwd=0, bkd=1 @@ -287,17 +219,16 @@ class TrainerMon: def generate_cc_metrics(self, cc_name, cc_tensor): metrics = defaultdict(dict) rank = dist.get_rank() if dist.is_initialized() else None - for op, tag2tensor in cc_tensor.indata.items(): - for tag, tensor in tag2tensor.items(): - key = get_summary_writer_tag_name(cc_name, tag, rank) - metrics[op].update({key: tensor}) - for op, tag2tensor in cc_tensor.outdata.items(): + for op, tag2tensor in cc_tensor.data.items(): for tag, tensor in tag2tensor.items(): key = get_summary_writer_tag_name(cc_name, tag, rank) metrics[op].update({key: tensor}) cc_tensor.reset() return metrics + def write_adhoc_check(self, step): + TrainerMon.tensor_metrics.flush(self.summary_writer) + def write_xy_tb(self, step): if not self.xy_distribution: return @@ -322,11 +253,12 @@ class TrainerMon: if self.print_struct and not all(value == {} for value in self.module_struct.values()) and not self.struct_printed: self._smallest_rank_print("> module struct:") self._smallest_rank_print(json.dumps(self.module_struct, indent=4)) + self.struct_printed = True if not self.cc_log_only: raise Exception("exit after first step when print model struct") if self.cc_log_only and context.step > 0: self._smallest_rank_print("> Used communication ops and corresponding stack") - self._smallest_rank_print(json.dumps({k:list(v) for k,v in self.cc_logged_stack.items()}, indent=4)) + self._smallest_rank_print(json.dumps({k:[i.split(';') for i in v] for k,v in self.cc_logged_stack.items()}, indent=4)) raise Exception("exit after first step when print cc stack") @@ -362,11 +294,11 @@ class TrainerMon: metric_dict = {} for metric_name in self.ops: metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) - if self.cc_distribution: - for k, c in self.cc_context.items(): - cc_metrics = self.generate_cc_metrics(k, c) - for op, m in cc_metrics.items(): - metric_dict[op].update(m) + for k, c in self.cc_context.items(): + c.aggregate() + cc_metrics = self.generate_cc_metrics(k, c) + for op, m in cc_metrics.items(): + metric_dict[op].update(m) if not metric_dict: return context.metric_list.append(metric_dict) @@ -377,6 +309,7 @@ class TrainerMon: rank = dist.get_rank() if dist.is_initialized() else None self.write_xy_tb(context.step) + self.write_adhoc_check(context.step) if self.ur_distribution: for param_name, _ in context.param_adam_update.items(): @@ -396,3 +329,107 @@ class TrainerMon: register_optimizer_step_pre_hook(optimizer_pre_step_hook) register_optimizer_step_post_hook(optimizer_post_step_hook) return + + def _smallest_rank_print(self, msg): + if dist.is_initialized(): + if self.module_rank_list: + if dist.get_rank() == min(self.module_rank_list): + print_info_log(msg) + else: + if dist.get_rank() == 0: + print_info_log(msg) + else: + print_info_log(msg) + + def _hook_module(self, target_names, module: torch.nn.Module, fwd_or_bkd): + if '_modules' not in module.__dict__: + # nothing to hook + return 0 + + def fwd_hook_fun(module, module_input, module_output): + context: ModuleHookContext = self.module_fwd_hook_context_by_module[module] + if self.print_struct: + self.module_struct[context.module_name].update( + {"input": f"{get_param_struct(module_input)}", "output": f"{get_param_struct(module_output)}"}) + return + if not self.xy_distribution: + return + if not context.format_by_arg: + context.set_format_by_arg('input', self.config['targets']) + context.set_format_by_arg('output', self.config['targets']) + if not context.verified: + if not context.ignore_in: + context.focused_in_col = validate_config_spec(context.format_by_arg['input'], module_input, context.module_name, 'input') + context.focused_out_col = validate_config_spec(context.format_by_arg['output'], module_output, context.module_name, 'output') + context.verified = True + # expect output be tensor type + tbtag_tensor_map = {} + if not context.ignore_in: + cared_input = module_input if context.focused_in_col is None else module_input[context.focused_in_col] + tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input', cared_input)) + cared_output = module_output if context.focused_out_col is None else module_output[context.focused_out_col] + tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output', cared_output)) + metric_dict = {} + for metric_name in self.ops: + metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) + if context.micro_step == 0 and context.actv: + print_warn_log( + f"actv context of {context.module_name} is not empty when first micro_step, maybe something wrong happened. Now clear it.") + context.actv.clear() + context.actv.append(metric_dict) + + context.micro_step += 1 + if context.micro_step == self.micro_batch_number: + context.micro_step = 0 + context.step += 1 + return + + def bwd_hook_fun(module, input_grad, output_grad): + context: ModuleHookContext = self.module_bwd_hook_context_by_module[module] + if self.print_struct: + self.module_struct[context.module_name].update( + {"input_grad": f"{get_param_struct(input_grad)}", "output_grad": f"{get_param_struct(output_grad)}"}) + return + if not self.xy_distribution: + return + if not context.format_by_arg: + context.set_format_by_arg('input_grad', self.config['targets']) + context.set_format_by_arg('output_grad', self.config['targets']) + if not context.verified: + if not context.ignore_in: + context.focused_in_col = validate_config_spec(context.format_by_arg['input_grad'], input_grad, context.module_name, 'input_grad') + context.focused_out_col = validate_config_spec(context.format_by_arg['output_grad'], output_grad, context.module_name, 'output_grad') + context.verified = True + + tbtag_tensor_map = {} + if not context.ignore_in: + cared_input_grad = input_grad if context.focused_in_col is None else input_grad[context.focused_in_col] + tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'input_grad', cared_input_grad)) + cared_output_grad = output_grad if context.focused_out_col is None else output_grad[context.focused_out_col] + tbtag_tensor_map.update(self.build_tbtag_tensor_map(context.module_name, 'output_grad', cared_output_grad)) + metric_dict = {} + for metric_name in self.ops: + metric_dict[metric_name] = get_metrics(metric_name, tbtag_tensor_map, self.eps) + if context.micro_step == 0 and context.actvgrad: + print_warn_log(f"actvgrad context of {context.module_name} is not empty when first micro_step, maybe something wrong happened. Now clear it.") + context.actvgrad.clear() + context.actvgrad.append(metric_dict) + + context.micro_step += 1 + if context.micro_step == self.micro_batch_number: + context.micro_step = 0 + context.step += 1 + return + + hooked_count = 0 + for name, submodule in module.named_modules(): + self.module_struct[name] = {} + if name in target_names: + submodule.register_forward_hook(fwd_hook_fun) + self.module_fwd_hook_context_by_module[submodule] = ModuleHookContext(name) + if not self.forward_only: + submodule.register_full_backward_hook(bwd_hook_fun) + self.module_bwd_hook_context_by_module[submodule] = ModuleHookContext(name) + print_rank_0(f"> {name} is monitored successfully") + hooked_count += 1 + return hooked_count diff --git a/debug/accuracy_tools/kj600/kj600/module_metric.py b/debug/accuracy_tools/kj600/kj600/module_metric.py index d42749d2be3a230c56aa28a4e983dd6d989c003e..e09536b072cf7953e6b6106420936416d4264d0e 100644 --- a/debug/accuracy_tools/kj600/kj600/module_metric.py +++ b/debug/accuracy_tools/kj600/kj600/module_metric.py @@ -1,7 +1,7 @@ import math import statistics -from kj600.features import square_sum, get_max, get_min, get_zeros +from kj600.features import square_sum, get_max, get_min, get_zeros, get_nans, get_norm def get_summary_writer_tag_name(module_or_param_name:str, tag:str, rank): @@ -23,23 +23,45 @@ def register_config_metric(key, cls=None): config_metric_registry[key] = cls return cls +class TensorMetrics: + def __init__(self) -> None: + self.metrics = {} #tensor_tag --> [] + self.cur_idx = {} + + fun_map = {"norm": get_norm, "max": get_max, "min": get_min} + #get stats and insert into metrics dictionary + def stat_insert(self, tensor, stat_ops, module_name, tensor_name, rank, eps=1e-8): + prefix = get_summary_writer_tag_name(module_name, tensor_name, rank) + for stat_op in stat_ops: + y = TensorMetrics.fun_map[stat_op](tensor) + key = f"{prefix}_{stat_op}" + if key not in self.metrics: + self.metrics[key] = [] + self.cur_idx[key] = 0 + self.metrics[key].append(y) + + def flush(self, tb_writer): + for key, metric_list in self.metrics.items(): + start = self.cur_idx[key] + for v in metric_list[start:]: + tb_writer.add_scalar(key, v.item(), global_step=self.cur_idx[key]) + self.cur_idx[key] += 1 class Metric(object): @staticmethod def get_metric_value(tensor, eps): pass + @staticmethod + def metric_tensorboard(metric_name, summary_writer, metric_value, step): + pass + def get_metrics(self, tag2tensor: dict, eps): metrics_dict = {} for tag, tensor in tag2tensor.items(): metrics_dict[tag] = self.get_metric_value(tensor, eps) return metrics_dict - @staticmethod - def metric_tensorboard(metric_name, summary_writer, metric_value, step): - pass - - @register_config_metric("min") class MinMetric(Metric): @staticmethod @@ -91,6 +113,17 @@ class ZerosMetric(Metric): zeros_value = statistics.mean([item[metric_name][key].item() for item in metric_value]) summary_writer.add_scalar(f'{key}_zeros', zeros_value, step) +@register_config_metric("nans") +class NaNsMetric(Metric): + @staticmethod + def get_metric_value(t, eps): + return get_nans(t) + + @staticmethod + def metric_tensorboard(metric_name, summary_writer, metric_value, step): + for key in metric_value[0][metric_name].keys(): + nans_value = sum([v[metric_name][key].item() for v in metric_value]) + summary_writer.add_scalar(f'{key}_nans', nans_value, step) @register_config_metric("id") class IdentMetric(Metric): @@ -114,7 +147,7 @@ def get_metrics(metric_name, tag2tensor, eps): fun_metric = config_metric_registry[metric_name] return fun_metric().get_metrics(tag2tensor, eps) except KeyError as e: - raise ValueError(f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: {metric_name}") + raise ValueError(f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: {metric_name}") from e def write_metrics_tensorboard(metric_name, summary_writer, metric_value, step): @@ -122,4 +155,4 @@ def write_metrics_tensorboard(metric_name, summary_writer, metric_value, step): fun_metric = config_metric_registry[metric_name] return fun_metric.metric_tensorboard(metric_name, summary_writer, metric_value, step) except KeyError as e: - raise ValueError(f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: {metric_name}") + raise ValueError(f"Not supported this metric, expected metric: {config_metric_registry.keys()}, actual metric: {metric_name}") from e diff --git a/debug/accuracy_tools/kj600/kj600/optimizer_collect.py b/debug/accuracy_tools/kj600/kj600/optimizer_collect.py index dfb473ca074f135d809e32c85a8fc9b4047da4d3..285f17ca6dc6a00814b0847c7d203524d8a8caa6 100644 --- a/debug/accuracy_tools/kj600/kj600/optimizer_collect.py +++ b/debug/accuracy_tools/kj600/kj600/optimizer_collect.py @@ -4,7 +4,6 @@ import torch.distributed as dist from kj600.visualizer import HeatmapVisualizer - def print_rank_0(message, debug=False, force=False): if dist.is_initialized(): if dist.get_rank() == 0: @@ -16,12 +15,23 @@ def print_rank_0(message, debug=False, force=False): class MixPrecsionOptimizerMon: wrapped_optimizer = None + def __init__(self) -> None: + self.fp16_to_fp32_param = {} + @staticmethod def set_wrapped_optimizer(_wrapped_optimizer): MixPrecsionOptimizerMon.wrapped_optimizer = _wrapped_optimizer - def __init__(self) -> None: - self.fp16_to_fp32_param = {} + # parameter tensors we want to monitor and their names are in params2name_dict + # base_optimizer is pytorch optimizer, wrapped_optimizer is a normal object with base_optimizer + def fetch_mv(self, monitor, torch_opt, params2name): + mix_prec_opt = MixPrecsionOptimizerMon.wrapped_optimizer + + if not self.fp16_to_fp32_param and mix_prec_opt is not None: + for fp16_group, fp32_group in zip(mix_prec_opt.float16_groups, mix_prec_opt.fp32_from_float16_groups): + for fp16_param, fp32_param in zip(fp16_group, fp32_group): + self.fp16_to_fp32_param[fp16_param] = fp32_param + return self._fetch_mv_in_adam(params2name, torch_opt, monitor) def _fetch_mv_in_adam(self, params2name, torch_opt, monitor): exp_avg_dict = defaultdict(float) @@ -48,22 +58,13 @@ class MixPrecsionOptimizerMon: monitor.ratio_heatmap_visualizer[name].pre_cal(ratio_dict[name]) return exp_avg_dict, exp_avg_sq_dict, update_dict, ratio_dict - # parameter tensors we want to monitor and their names are in params2name_dict - # base_optimizer is pytorch optimizer, wrapped_optimizer is a normal object with base_optimizer - def fetch_mv(self, monitor, torch_opt, params2name): - mix_prec_opt = MixPrecsionOptimizerMon.wrapped_optimizer - - if not self.fp16_to_fp32_param and mix_prec_opt is not None: - for fp16_group, fp32_group in zip(mix_prec_opt.float16_groups, mix_prec_opt.fp32_from_float16_groups): - for fp16_param, fp32_param in zip(fp16_group, fp32_group): - self.fp16_to_fp32_param[fp16_param] = fp32_param - return self._fetch_mv_in_adam(params2name, torch_opt, monitor) class MegatronDistributedOptimizerMon(MixPrecsionOptimizerMon): def fetch_mv(self, monitor, torch_opt, params2name): mix_prec_opt = MixPrecsionOptimizerMon.wrapped_optimizer - assert hasattr(mix_prec_opt, "model_float16_groups") and hasattr(mix_prec_opt, "shard_fp32_from_float16_groups"), \ - "megatron distributed optimizer should have model_float16_groups and shard_fp32_from_float16_groups, if not, please check megatron-lm version" + if not (hasattr(mix_prec_opt, "model_float16_groups") and hasattr(mix_prec_opt, "shard_fp32_from_float16_groups")): + raise Exception("megatron distributed optimizer should have model_float16_groups and shard_fp32_from_float16_groups, \ + if not, please check megatron-lm version") if not self.fp16_to_fp32_param and mix_prec_opt is not None: for fp16_group, shard_fp32_group in zip(mix_prec_opt.model_float16_groups, mix_prec_opt.shard_fp32_from_float16_groups): for fp16_param, shard_fp32_param in zip(fp16_group, shard_fp32_group): @@ -71,10 +72,12 @@ class MegatronDistributedOptimizerMon(MixPrecsionOptimizerMon): return self._fetch_mv_in_adam(params2name, torch_opt, monitor) + class DummyOptimizerMon(MixPrecsionOptimizerMon): def fetch_mv(self, monitor, torch_opt, params2name): return None, None, None, None + class OptimizerMonFactory: @staticmethod def create_optimizer_mon(opt_ty:str): @@ -82,6 +85,6 @@ class OptimizerMonFactory: return MixPrecsionOptimizerMon() if opt_ty == "Megatron_DistributedOptimizer": return MegatronDistributedOptimizerMon() - if opt_ty == None or opt_ty == "unknown": + if opt_ty is None or opt_ty == "unknown": return DummyOptimizerMon() - assert opt_ty != None, "opt_ty should be Megatron_Float16OptimizerWithFloat16Params or Megatron_DistributedOptimizer or None or unknown" \ No newline at end of file + raise Exception("opt_ty should be Megatron_Float16OptimizerWithFloat16Params or Megatron_DistributedOptimizer or None or unknown") diff --git a/debug/accuracy_tools/kj600/kj600/unittest/cc_utils.py b/debug/accuracy_tools/kj600/kj600/unittest/cc_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..aa1ff688ec1c417204fc067e636535fef8a35bc0 --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/cc_utils.py @@ -0,0 +1,83 @@ +import os +from functools import partial +import torch +from torch import distributed as dist +from torch import nn +try: + import torch_npu + BACKEND = 'hccl' + DEVICE = 'npu' +except: + BACKEND = 'nccl' + DEVICE = 'cuda' + +from kj600.features import square_sum, get_max, get_min, get_zeros +from kj600.module_hook import CommunicationContext + + +OP_FUNCS = { + "min": get_min, + "max": get_max, + "norm": square_sum, + "zeros": partial(get_zeros, eps=1e-8) +} + +def ddp_setup(rank, world_size): + os.environ["MASTER_ADDR"] = "localhost" + os.environ["MASTER_PORT"] = "12346" + dist.init_process_group(backend=BACKEND, rank=rank, world_size=world_size) + +def reset_context(context): + if isinstance(context, CommunicationContext): + context.reset() + elif isinstance(context, dict): + for op, v in context.items(): + v.reset() + +def wrap_reset(func): + def reset_and_test(*args, **kwargs): + print(f"testing {func.__name__}") + reset_context(args[0]) + res = func(*args, **kwargs) + return res + + return reset_and_test + +def assert_empty(data): + assert len(data) == 0, f'data is not empty as expected' + +def assert_nonempty(data): + assert len(data) != 0, f'data is empty' + +def assert_equal(a, b, rank, op_name=None, tag=None): + if a.dim() == 0: + assert a==b, f'inequal in rank {rank}: {a}, {b}, {op_name}, {tag}' + else: + assert torch.equal(a,b), f'inequal in rank {rank}: {a},{b}' + +def assert_inequal(a, b, rank): + if a.dim() == 0: + assert a!=b, f'equal in rank {rank}: {a},{b}' + else: + assert not torch.equal(a,b), f'equal in rank {rank}: {a},{b}' + +def assert_context(data, src, rank): + if len(src) == 0: + assert_empty(data) + else: + assert_nonempty(data) + + for op_name, tensors in data.items(): + for tag, tensor in tensors.items(): + prefix, idx = tag.split('_') + idx = int(idx) + assert_equal(tensor, OP_FUNCS[op_name](src[prefix][idx]), rank, op_name, tag) + + +class Model(nn.Module): + def __init__(self): + super(Model, self).__init__() + self.layer = nn.Linear(2,2) + + def forward(self, x): + return self.layer(x) \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/config_basic_functions.json b/debug/accuracy_tools/kj600/kj600/unittest/config_basic_functions.json new file mode 100644 index 0000000000000000000000000000000000000000..6ce01d653dcfd288e4955b71aefd609056fd38e9 --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/config_basic_functions.json @@ -0,0 +1,17 @@ +{ + "targets": { + "fc": {"input": "tuple[1]:0", "output": "tensor", "input_grad": "tuple[1]:0", "output_grad": "tuple[1]:0"} + }, + "module_ranks": [], + "ur_distribution": true, + "xy_distribution": true, + "mv_distribution": true, + "wg_distribution": true, + "mg_direction": true, + "cc_distribution": {"enable":true, "cc_codeline":[]}, + "alert": { + "rules": [{"rule_name": "AnomalyTurbulence", "args": {"threshold": 0.5}}] + }, + "eps": 1e-8, + "ops": ["min", "max", "norm", "zeros", "id"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/config_cc.json b/debug/accuracy_tools/kj600/kj600/unittest/config_cc.json new file mode 100644 index 0000000000000000000000000000000000000000..a4667ce6fea8052831ddde3fb879402a30f4e946 --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/config_cc.json @@ -0,0 +1,7 @@ +{ + "targets": { + "foo": {} + }, + "cc_distribution": {"enable": true, "cc_pre_hook":true}, + "ops":["max","min","norm","zeros"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/config_cc_codeline_ranks.json b/debug/accuracy_tools/kj600/kj600/unittest/config_cc_codeline_ranks.json new file mode 100644 index 0000000000000000000000000000000000000000..720fbb9dd0ee639a412c4a7e62b3a6a73fce227d --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/config_cc_codeline_ranks.json @@ -0,0 +1,8 @@ +{ + "targets": { + "foo": {} + }, + "cc_distribution": {"enable": true, "cc_codeline":["kj600/unittest/test_cc_codeline_ranks.py\\[19\\]"]}, + "module_ranks": [1], + "ops":["max","min","norm","zeros"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/config_cc_logonly.json b/debug/accuracy_tools/kj600/kj600/unittest/config_cc_logonly.json new file mode 100644 index 0000000000000000000000000000000000000000..51e619fc2d87ffb4a52575ac53ccf0921eb78cce --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/config_cc_logonly.json @@ -0,0 +1,8 @@ +{ + "targets": { + "foo": {} + }, + "cc_distribution": {"enable": true, "cc_log_only": true}, + "module_ranks": [0,1], + "ops":["max","min","norm","zeros"] +} \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/expected_cc_log.json b/debug/accuracy_tools/kj600/kj600/unittest/expected_cc_log.json new file mode 100644 index 0000000000000000000000000000000000000000..8f2edd7ecdb373242f40ae938ca9a880a45e3264 --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/expected_cc_log.json @@ -0,0 +1,20 @@ +{ + "all_gather": [ + [ + "|torch.float32||", + "0|1", + "/home/jovyan/workspace/kj_dev/kj600/unittest/test_cc_log_only.py[18] test_all_gather", + "/home/jovyan/workspace/kj_dev/kj600/unittest/test_cc_log_only.py[40] main", + "[1] " + ] + ], + "all_reduce": [ + [ + "torch.float32|||", + "0|1", + "/home/jovyan/workspace/kj_dev/kj600/unittest/test_cc_log_only.py[23] test_all_reduce", + "/home/jovyan/workspace/kj_dev/kj600/unittest/test_cc_log_only.py[41] main", + "[1] " + ] + ] +} \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/test_anomaly_inform.py b/debug/accuracy_tools/kj600/kj600/unittest/test_anomaly_inform.py new file mode 100644 index 0000000000000000000000000000000000000000..1ad76b919eb5fcce24539bf777d368047e6458c4 --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/test_anomaly_inform.py @@ -0,0 +1,26 @@ +import uuid +import unittest + +from kj600.anomaly_inform import AnomalyInformFactory + + +class TestAnomalyInform(unittest.TestCase): + def test_database_inform(self): + inform_args = {"inform": {"recipient": "database", "connection_str": "mysql+pymysql://username:password@host:port/database"}} + anomaly_inform = AnomalyInformFactory.create_informer(**inform_args["inform"]) + exception_message = '\x1b[93m> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0.self_attention.query_key_value.weight/0/exp_avg_sq_min at step 49.\x1b[0m' + job_id = str(uuid.uuid4()) + anomaly_inform.run(exception_message, job_id) + + def test_email_inform(self): + inform_args = {"inform": {"recipient": "email", "send_email_address": "xueyuqing@huawei.com", "receive_email_address": "xueyuqing@huawei.com", + "send_email_username": "x30021831", "send_email_password": "********", + "smtp_server": "smtpscn.huawei.com", "smtp_port": "587"}} + anomaly_inform = AnomalyInformFactory.create_informer(**inform_args["inform"]) + exception_message = '\x1b[93m> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0.self_attention.query_key_value.weight/0/exp_avg_sq_min at step 49.\x1b[0m' + job_id = str(uuid.uuid4()) + anomaly_inform.run(exception_message, job_id) + + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/kj600/kj600/unittest/test_basic_functions.py b/debug/accuracy_tools/kj600/kj600/unittest/test_basic_functions.py new file mode 100644 index 0000000000000000000000000000000000000000..b7cdd3385b575b231702411fb01ebca2b67613bf --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/test_basic_functions.py @@ -0,0 +1,149 @@ +import unittest +import shutil +import torch +import json +import os +try: + import torch_npu + device = torch.device('npu:0') +except ModuleNotFoundError: + device = torch.device('cpu') +from kj600.module_hook import TrainerMon + +from tensorboard.backend.event_processing.event_accumulator import EventAccumulator + +class Model(torch.nn.Module): + def __init__(self): + super().__init__() + self.fc = torch.nn.Linear(784, 2) + self.relu = torch.nn.ReLU() + + def forward(self, x): + return self.relu(self.fc(x)) + +class ToyDataset(torch.utils.data.Dataset): + def __init__(self): + self.data = torch.randn(16, 784, requires_grad=True) + self.labels = torch.randint(low=0, high=8, size=(16,)) + def __len__(self): + return len(self.labels) + def __getitem__(self, idx): + return self.data[idx].to(device), self.labels[idx].to(device) +def get_file_path(): + output_dir = os.environ.get("KJ600_OUTPUT_DIR") + for root1, dirs, files in os.walk(output_dir): + for root2, dir, file in os.walk(os.path.join(root1, dirs[-1])): + return os.path.join(root2, file[0]) + +def get_config(): + os.environ["KJ600_OUTPUT_DIR"] = "./test_kj600_output" + with open("config_basic_functions.json", 'r') as file: + config_test = json.load(file) + return config_test +def get_tensorbaord(event_file_path): + tensorboard = EventAccumulator(event_file_path) + tensorboard.Reload() + tags = tensorboard.Tags() + scalers_tag = [] + for tag in tags['scalars']: + tag = tag.split('/') + scalers_tag.append(tag[1]) + images_tag = [] + for tag in tags['images']: + tag = tag.split('/') + images_tag.append(tag[1]) + return scalers_tag, images_tag + +def clean_output(): + folder_path = os.environ.get("KJ600_OUTPUT_DIR") + if os.path.exists(folder_path): + shutil.rmtree(folder_path) + +def train(): + model = Model().to(device=device) + hooker = TrainerMon('config_basic_functions.json', False, + opt_ty="Megatron_Float16OptimizerWithFloat16Params") # or opt_ty=Megatron_DistributedOptimizer + hooker.hook_modules(model=model, grad_acc_steps=1) + + train_ds = ToyDataset() + train_loader = torch.utils.data.DataLoader(train_ds, shuffle=True, batch_size=2) + + optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001) + + for (inputs, targets) in train_loader: + optimizer.zero_grad() + # inputs and param torch.float32 -> torch.float16 + inputs = inputs.half() + for param in model.parameters(): + param.data = param.data.half() + # outputs torch.float32 + outputs = model(inputs) + output = outputs[0] + targets = targets.float() + # loss torch.float16 -> torch.float32 + loss = torch.nn.functional.cross_entropy(output, targets) + + loss.backward() + optimizer.step() + +class TestKj600(unittest.TestCase): + def __init__(self, method_name: str) -> None: + super(TestKj600, self).__init__(method_name) + self.config_test = get_config() + self.event_file_path = None + self.scalers_tag = None + self.images_tag = None + + @classmethod + def setUpClass(cls): + train() + + def setUp(self): + self.config_test = get_config() + self.event_file_path = get_file_path() + self.scalers_tag, self.images_tag = get_tensorbaord(self.event_file_path) + + def test_ops(self): + if self.config_test["ops"]: + for op in self.config_test.get("ops"): + if op == "id": + assert any(op in item for item in self.scalers_tag) == self.config_test.get('mg_direction'), f"{op} in ops did not take effect" + else: + assert any(op in item for item in self.scalers_tag), f"{op} in ops did not take effect" + print("ops has taken effect") + + def test_ur_distribution(self): + if self.config_test.get("ur_distribution"): + assert any('adam_update' in item for item in self.images_tag) and any( + 'adam_ratio' in item for item in self.images_tag), "ur_distribution did not take effect" + print("ur_distribution has taken effect") + + def test_xy_distribution(self): + if self.config_test.get("xy_distribution"): + assert any('input' in item for item in self.scalers_tag) and any( + 'output' in item for item in self.scalers_tag), "xy_distribution did not take effect" + print("xy_distribution has taken effect") + + def test_mv_distribution(self): + if self.config_test.get("mv_distribution"): + assert any('exp_avg' in item for item in self.scalers_tag) and any( + 'exp_avg_sq' in item for item in self.scalers_tag), "mv_distribution did not take effect" + print("mv_distribution has taken effect") + + def test_mg_direction(self): + if self.config_test.get("mg_direction"): + assert any('mg_direction' in item for item in self.scalers_tag), "mg_direction did not take effect" + print("mg_direction has taken effect") + + def test_wg_distribution(self): + if self.config_test.get("wg_distribution"): + assert any('weight' in item for item in self.scalers_tag), "wg_distribution did not take effect" + print("wg_distribution has taken effect") + + @classmethod + def tearDownClass(cls) -> None: + clean_output() + + +if __name__ == "__main__": + unittest.main() diff --git a/debug/accuracy_tools/kj600/kj600/unittest/test_cc.py b/debug/accuracy_tools/kj600/kj600/unittest/test_cc.py new file mode 100644 index 0000000000000000000000000000000000000000..b5e92417a41e09539a00761e743670ba1b409ff7 --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/test_cc.py @@ -0,0 +1,260 @@ +import sys +sys.path.append(".") +import time +import torch +from torch import nn +from torch import distributed as dist +import torch.multiprocessing as mp +from kj600.module_hook import TrainerMon +from kj600.unittest.cc_utils import * + +DEBUG = False +DIM = 2 +DTYPE = torch.float16 + +# 采集数据正确 +# 通信结果正确 + +def test_broadcast(context, rank, async_op): + a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + local_a = a.clone() + src = 0 + work = dist.broadcast(a, src, dist.group.WORLD, async_op) + if work: + work.wait() + context.aggregate() + if rank == src: + assert_context(context.data, {'pre':[local_a], 'post':[a]}, rank) + assert torch.equal(local_a, a), f"{local_a}, {a}" + else: + src_tensor = torch.tensor([src+1, src+1], dtype=DTYPE, device=f'{DEVICE}:{rank}') + assert_context(context.data, {'pre': [local_a], 'post':[src_tensor]}, rank) + assert_equal(src_tensor, a, rank) + +@wrap_reset +def test_gather(context, rank, world_size, async_op): + a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + dst = 0 + if rank == dst: + data = [torch.zeros_like(a) for _ in range(world_size)] + else: + data = None + work = dist.gather(a, data, dst, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + context.aggregate() + if rank == dst: + assert_context(context.data, {'pre':[a, torch.zeros(world_size, 2, dtype=DTYPE)], 'post':[a, torch.stack(data)]}, rank) + for i in range(world_size): + local_a = torch.tensor([i+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + assert_equal(data[i], local_a, rank) + + +@wrap_reset +def test_all_gather(context, rank, world_size, async_op): + a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + data = [torch.zeros_like(a, dtype=DTYPE) for _ in range(world_size)] + work = dist.all_gather(data, a, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + context.aggregate() + assert_context(context.data, {'pre':[torch.zeros(world_size, DIM, dtype=DTYPE), a], 'post':[torch.stack(data), a]}, rank) + assert_equal(data[rank], a, rank) + +@wrap_reset +def test_all_gather_into_tensor(context, rank, world_size, async_op): + a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + # concatenation + data = torch.zeros(world_size * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + res = torch.tensor([[i+1] for i in range(world_size)], dtype=DTYPE, device=f'{DEVICE}:{rank}').repeat(1, DIM) + work = dist.all_gather_into_tensor(data, a, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + context.aggregate() + assert_context(context.data, {'pre': [torch.zeros(world_size * DIM, dtype=DTYPE), a], 'post': [data, a]}, rank) + assert_equal(data, res.flatten(), rank) + + context.reset() + # concatenation + data = torch.zeros(world_size, DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + work = dist.all_gather_into_tensor(data, a, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + + context.aggregate() + assert_context(context.data, {'pre': [torch.zeros(world_size, DIM, dtype=DTYPE), a], 'post': [data, a]}, rank) + assert_equal(data, res, rank) + +@wrap_reset +def test_reduce(context, rank, world_size, async_op): + a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + local_a = a.clone() + dst = 0 + work = dist.reduce(a, dst, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + context.aggregate() + total = sum([i+1 for i in range(world_size)]) + res = torch.tensor([total] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + if rank == dst: + assert_context(context.data, {'pre':[local_a], 'post':[res]}, rank) + assert_equal(res, a, rank) + else: + assert_context(context.data, {'pre':[a], 'post':[a]}, rank) + assert_equal(local_a, a, rank) + +@wrap_reset +def test_all_reduce(context, rank, world_size, async_op): + repeat = 2 + for _ in range(repeat): # test aggregate + a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + local_a = a.clone() + if rank == 0: + time.sleep(6) + work = dist.all_reduce(a, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + context.aggregate() + total = sum([i+1 for i in range(world_size)]) + res = torch.tensor([total] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + assert_context(context.data, {'pre': [local_a.repeat(repeat)],'post': [res.repeat(repeat)]}, rank) + assert_equal(res, a, rank) + + +@wrap_reset +def test_reduce_scatter(context, rank, world_size, async_op): + a = torch.tensor([rank+1, rank+1], dtype=DTYPE, device=f'{DEVICE}:{rank}') + output = torch.zeros_like(a) + data = [a*(i+1) for i in range(world_size)] + work = dist.reduce_scatter(output, data, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + context.aggregate() + total = sum([i+1 for i in range(world_size)]) + res = (rank+1) * torch.tensor([total, total], dtype=DTYPE, device=f'{DEVICE}:{rank}') + assert_context(context.data,{'pre': [torch.zeros_like(a), torch.stack(data)], 'post':[output, torch.stack(data)]}, rank) + assert_equal(res, output, rank) + + +@wrap_reset +def test_reduce_scatter_tensor(context, rank, world_size, async_op): + a = torch.tensor([rank+1] * DIM * world_size, dtype=DTYPE, device=f'{DEVICE}:{rank}') + output = torch.zeros(DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + work = dist.reduce_scatter_tensor(output, a, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + context.aggregate() + total = sum([i+1 for i in range(world_size)]) + res = torch.tensor([total] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + assert_context(context.data,{'pre': [torch.zeros_like(a, dtype=DTYPE, device=f'{DEVICE}:{rank}'), a], 'post':[output, a]}, rank) + assert_equal(res, output, rank) + +@wrap_reset +def test_scatter(context, rank, world_size, async_op): + a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + local_a = a.clone() + src = 0 + if rank == src: + scatter_list = [10*torch.tensor([i+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') for i in range(world_size)] + else: + scatter_list = None + work = dist.scatter(a, scatter_list, src, group=dist.group.WORLD, async_op=async_op) + if work: + work.wait() + context.aggregate() + if rank == src: + assert_context(context.data, {'pre': [local_a, torch.stack(scatter_list)], 'post': [a, torch.stack(scatter_list)]}, rank) + else: + assert_context(context.data, {'pre': [local_a], 'post': [a]}, rank) + assert_equal(a, 10*torch.tensor([(rank+1)] * DIM ,dtype=DTYPE, device=f'{DEVICE}:{rank}'), rank) + +## point2point +@wrap_reset +def test_send_recv(context, rank, world_size, async_op): + """send from rank 0 to rank world_size-1""" + if world_size<2: + return + a = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + local_a = a.clone() + src = 0 + dst = world_size-1 + if rank == src: + dist.send(a, dst, group=dist.group. + WORLD) + context['send'].aggregate() + assert_context(context['send'].data, {'pre': [local_a], 'post': [a]}, rank) + assert_equal(a, local_a, rank) + if rank == dst: + src_tensor = torch.tensor([src+1, src+1], dtype=DTYPE, device=f'{DEVICE}:{rank}') + dist.recv(a, src, group=dist.group. + WORLD) + context['recv'].aggregate() + assert_context(context['recv'].data, {'pre':[local_a], 'post': [a]}, rank) + assert_equal(a, src_tensor, rank) + +@wrap_reset +def test_batch_isend_irecv(context, rank, world_size, async_op): + send_tensor = torch.tensor([rank+1] * DIM, dtype=DTYPE, device=f'{DEVICE}:{rank}') + recv_tensor = torch.zeros_like(send_tensor) + send_op = dist.P2POp(dist.isend, send_tensor, (rank + 1)%world_size) + recv_op = dist.P2POp(dist.irecv, recv_tensor, (rank - 1 + world_size)%world_size) + reqs = dist.batch_isend_irecv([send_op, recv_op]) + for req in reqs: + req.wait() + context.aggregate() + assert_context(context.data, {'pre': [torch.stack([send_tensor, torch.zeros_like(send_tensor)])], 'post':[torch.stack([send_tensor, recv_tensor])]}, rank) + assert_equal( recv_tensor, torch.tensor([(rank - 1 + world_size)%world_size + 1] * DIM, device=f'{DEVICE}:{rank}'), rank) + +def test_all(monitor, rank, world_size, async_op): + cc_context = monitor.cc_context + + test_send_recv(cc_context, rank, world_size, async_op) + test_broadcast(cc_context['broadcast'], rank, async_op) + test_gather(cc_context['gather'], rank, world_size, async_op) + test_all_gather(cc_context['all_gather'], rank, world_size, async_op) + test_all_gather_into_tensor(cc_context['all_gather_into_tensor'], rank, world_size, async_op) + test_reduce(cc_context['reduce'], rank, world_size, async_op) + test_all_reduce(cc_context['all_reduce'], rank, world_size, async_op) + test_reduce_scatter(cc_context['reduce_scatter'], rank, world_size, async_op) + test_reduce_scatter_tensor(cc_context['reduce_scatter_tensor'], rank, world_size, async_op) + test_scatter(cc_context['scatter'], rank, world_size, async_op) + test_batch_isend_irecv(cc_context['batch_isend_irecv'], rank, world_size, async_op) + + +def main(rank, world_size): + + ddp_setup(rank, world_size) + if rank == 0 and DEBUG: + import debugpy + debugpy.listen(5678) + debugpy.wait_for_client() + steps = 2 + + net = Model() + monitor = TrainerMon("kj600/unittest/config_cc.json", opt_ty="Megatron_Float16OptimizerWithFloat16Params") + # monitor = None + # monitor.hook_optimizer() # to enable tb + optimizer = torch.optim.Adam(net.parameters()) + for step in range(steps): + print('setp: ', step) + test_all(monitor, rank, world_size, False) + test_all(monitor, rank, world_size, True) + optimizer.step() + + +class Model(nn.Module): + def __init__(self): + super(Model, self).__init__() + self.layer = nn.Linear(2,2) + + def forward(self, x): + return self.layer(x) + +if __name__ == '__main__': + if len(sys.argv)>1: + DEBUG = sys.argv[1] + world_size=4 + torch.manual_seed(1234) + mp.spawn(main, args=(world_size,), nprocs=world_size) + + \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/test_cc_codeline_ranks.py b/debug/accuracy_tools/kj600/kj600/unittest/test_cc_codeline_ranks.py new file mode 100644 index 0000000000000000000000000000000000000000..d635441e155736340648b31dc1eab3d61d03f2fd --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/test_cc_codeline_ranks.py @@ -0,0 +1,52 @@ +import sys +sys.path.append(".") +import torch +from torch import distributed as dist +import torch.multiprocessing as mp +from kj600.module_hook import TrainerMon +from kj600.unittest.cc_utils import * + +@wrap_reset +def test_all_gather(context, rank, target_rank, world_size, async_op): + a = torch.tensor([rank+1, rank+1], dtype=torch.float32, device=f'{DEVICE}:{rank}') + data = [torch.empty_like(a) for _ in range(world_size)] + dist.all_gather(data, a, group=dist.group.WORLD, async_op=async_op) + assert_context(context.data, {}, rank) + +@wrap_reset +def test_all_reduce(context, rank, target_rank, world_size, async_op): + a = torch.tensor([rank+1, rank+1], dtype=torch.float32, device=f'{DEVICE}:{rank}') + dist.all_reduce(a, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) + total = sum([i+1 for i in range(world_size)]) + sum_reduced = torch.tensor([total, total], dtype=torch.float32, device=f'{DEVICE}:{rank}') + context.aggregate() + if rank in target_rank: + assert_context(context.data, {"post": [sum_reduced]}, rank) + else: + assert_context(context.data, {}, rank) + +def main(rank, world_size): + + ddp_setup(rank, world_size) + steps = 2 + async_op = False + + net = Model() + monitor = TrainerMon("kj600/unittest/config_cc_codeline_ranks.json") + target_rank = monitor.module_rank_list + # monitor = None + # monitor.hook_optimizer() # to enable tb + optimizer = torch.optim.Adam(net.parameters()) + cc_context = monitor.cc_context + for step in range(steps): + print('setp: ', step) + test_all_gather(cc_context['all_gather'], rank, target_rank, world_size, async_op) + test_all_reduce(cc_context['all_reduce'], rank, target_rank, world_size, async_op) + optimizer.step() + +if __name__ == '__main__': + world_size=2 + torch.manual_seed(1234) + mp.spawn(main, args=(world_size,), nprocs=world_size) + + \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/test_cc_log_only.py b/debug/accuracy_tools/kj600/kj600/unittest/test_cc_log_only.py new file mode 100644 index 0000000000000000000000000000000000000000..d7508d4af51d0549105a92eea3b7ff717924aea4 --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/test_cc_log_only.py @@ -0,0 +1,55 @@ +import os +import sys +sys.path.append(".") +import json +import torch +from torch import distributed as dist +import torch.multiprocessing as mp +from kj600.module_hook import TrainerMon +from kj600.unittest.cc_utils import * + + +with open(os.path.join(os.path.dirname(__file__), 'expected_cc_log.json')) as f: + EXPECTED = json.load(f) + +def test_all_gather(context, rank, world_size, async_op): + a = torch.tensor([rank+1, rank+1], dtype=torch.float32, device=f'{DEVICE}:{rank}') + data = [torch.empty_like(a) for _ in range(world_size)] + dist.all_gather(data, a, group=dist.group.WORLD, async_op=async_op) + assert_context(context.data, {}, rank) + +def test_all_reduce(context, rank, world_size, async_op): + a = torch.tensor([rank+1, rank+1], dtype=torch.float32, device=f'{DEVICE}:{rank}') + dist.all_reduce(a, op=dist.ReduceOp.SUM, group=dist.group.WORLD, async_op=async_op) + assert_context(context.data, {}, rank) + + +def main(rank, world_size): + ddp_setup(rank, world_size) + steps = 3 + async_op = False + + net = Model() + monitor = TrainerMon("kj600/unittest/config_cc_logonly.json") + monitor.hook_optimizer() # to enable tb + optimizer = torch.optim.Adam(net.parameters()) + cc_context = monitor.cc_context + try: + for step in range(steps): + print('step: ', step) + test_all_gather(cc_context['all_gather'], rank, world_size, async_op) + test_all_reduce(cc_context['all_reduce'], rank, world_size, async_op) + optimizer.step() + except Exception as e: + assert step == 1 + assert e.__str__() == "exit after first step when print cc stack", e + for k in EXPECTED.keys(): + assert [';'.join(stack) for stack in EXPECTED[k]] == list(monitor.cc_logged_stack[k]) + + +if __name__ == '__main__': + world_size=2 + torch.manual_seed(1234) + mp.spawn(main, args=(world_size,), nprocs=world_size) + + \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/test_database.py b/debug/accuracy_tools/kj600/kj600/unittest/test_database.py new file mode 100644 index 0000000000000000000000000000000000000000..a9046d9c07ebb1cdf3f16490554f7223d893e51e --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/test_database.py @@ -0,0 +1,42 @@ +import unittest +import uuid +from datetime import datetime +from unittest import TestCase + +from sqlalchemy import inspect + +from kj600.database import Database, ExceptionMessage + + +class TestDatabase(TestCase): + def __init__(self, method_name: str): + super(TestDatabase, self).__init__(method_name) + self.db = Database('mysql+pymysql://username:password@host:port/database') + + def test_create_table(self): + self.db.create_table() + inspect_ = inspect(self.db.engine) + table_names = inspect_.get_table_names() + print(table_names) + self.assertIn("exception_message", table_names) + + def test_insert_batch(self): + self.db.create_table() + job_id = str(uuid.uuid4()) + print(job_id) + save_list = [] + exception_message_list = [ + '> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0/1/input_zeros at step 1.', + '> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0.input_norm.weight/0/exp_avg_min at step 2.', + '> Rule AnomalyTurbulence reports anomaly signal in language_model.encoder.layers.0.input_norm.weight/1/exp_avg_min at step 2.'] + for exception_message in exception_message_list: + item = {'job_id': job_id, 'message': exception_message, 'create_time': datetime.now()} + save_list.append(ExceptionMessage(**item)) + self.db.insert_batch(save_list) + find_by_job_id = self.db.find_by_job_id(job_id) + exception_messages = [item.message for item in find_by_job_id] + self.assertEqual(exception_messages, exception_message_list) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/kj600/kj600/unittest/test_features.py b/debug/accuracy_tools/kj600/kj600/unittest/test_features.py new file mode 100644 index 0000000000000000000000000000000000000000..bc8c6dd71ab4e0bf708cf3d97d02dab3a2ded9cc --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/test_features.py @@ -0,0 +1,33 @@ +import unittest +import torch +import torch.nn as nn +import torch_npu +from kj600.features import eff_rank + + +class TestFeatureCalculation(unittest.TestCase): + def test_effective_rank(self): + param = torch.randn(10, 10).npu() + rank = eff_rank(param) + self.assertTrue(rank.item() >= 1) + + def test_lambda_max(self): + pass + # input_dim = 10 + # hidden_dim = 100 + # output_dim = 1 + # num_samples = 100 + # X = torch.randn(num_samples, input_dim) + # network = nn.Sequential( + # nn.Linear(input_dim, hidden_dim), + # nn.ReLU(), + # nn.Linear(hidden_dim, output_dim) + # ) + # Y = network(X) + # Y.backward() + # for name, param in network.named_parameters(): + # lm = lambda_max(param) + + +if __name__ == "__main__": + unittest.main() \ No newline at end of file diff --git a/debug/accuracy_tools/kj600/kj600/unittest/test_module_hook.py b/debug/accuracy_tools/kj600/kj600/unittest/test_module_hook.py new file mode 100644 index 0000000000000000000000000000000000000000..f81312691d35825fad05b7ed04db352bc96b2c20 --- /dev/null +++ b/debug/accuracy_tools/kj600/kj600/unittest/test_module_hook.py @@ -0,0 +1,84 @@ +import sys +sys.path.append('./') +import argparse +import torch +try: + import torch_npu + device = torch.device('npu:0') +except ModuleNotFoundError: + device = torch.device('cpu') +import torch.nn.functional as F +from kj600.module_hook import TrainerMon # Modify PYTHONPATH to import TrainerMon +#from hook_api import reg_grad_hook, reg_grad_one_hook, reg_module_backward_hook, reg_module_forward_hook +#from torch.cuda.amp import GradScaler + +# from torch.npu.amp import GradScaler + + +# from ptdbg_ascend import PrecisionDebugger as PD +# from monitor import GradientMonitor + +# print(torch_npu.__version__) + +#debugger = PD(dump_path="./dump/", hook_name="dump", step=[1, 2, 3], enable_dataloader=False) +#debugger.configure_hook(mode="list", scope=["optim_Adam_step"], ) + +parser = argparse.ArgumentParser(prog="kj600 debug", description="kj600 sample code", epilog="") +parser.add_argument("-o", "--out_dir", type=str, default=".") +args = parser.parse_args() +DTYPE = torch.float32 + + +class Model(torch.nn.Module): + def __init__(self): + super().__init__() + self.fc = torch.nn.Linear(784, 10, dtype=DTYPE) + self.relu = torch.nn.ReLU() + + def forward(self, x): + return self.relu(self.fc(x).type(DTYPE)) + + +net = Model().to(device=device) + +config = { + "targets": { + "fc": {"input": "tuple[2]:0", "output": "tensor::"}, + "relu": {"input": "..", "output": ".."} + } +} +# reg_grad_hook(net, hook_factory=hook_factory, config=config) +# reg_grad_one_hook(net, hook=monitor_hook, config=config) +# net.fc.register_forward_hook(get_actv_hook("fc")) +# reg_module_forward_hook(net, module_fwd_hook, config) +# reg_module_backward_hook(net, module_bwd_hook, config) +optimizer = torch.optim.Adam(net.parameters(), lr=0.0001) + +hooker = TrainerMon('./kj600/unittest/config_1.json', opt_ty = 'Megatron_Float16OptimizerWithFloat16Params') +hooker.hook_modules(model=net, global_batch_size=2, dp=1, micro_batch_size=2, fwd_or_bkd=0, params_have_main_grad=False) +# hooker.hook_optimizer(optimizer) + + +class ToyDataset(torch.utils.data.Dataset): + def __init__(self): + self.data = torch.randn(16, 784, dtype=DTYPE, requires_grad=True) + self.labels = torch.randint(low=0, high=9, size=(16,)) + + def __len__(self): + return len(self.labels) + + def __getitem__(self, idx): + return self.data[idx].to(device), self.labels[idx].to(device) + +train_ds = ToyDataset() +train_loader = torch.utils.data.DataLoader(train_ds, shuffle=True, batch_size=2) + + +# scaler = GradScaler() +for (inputs, labels) in train_loader: + optimizer.zero_grad() + outputs = net(inputs) + loss = F.cross_entropy(outputs, labels) + + loss.backward() + optimizer.step() diff --git a/debug/accuracy_tools/kj600/pyproject.toml b/debug/accuracy_tools/kj600/pyproject.toml index 8ab303b81b8a53424fc4815849f3662134445b5f..5df968563345dd07ed477ec73b967b63c6e812a6 100644 --- a/debug/accuracy_tools/kj600/pyproject.toml +++ b/debug/accuracy_tools/kj600/pyproject.toml @@ -9,6 +9,10 @@ dependencies = [ "torch", "torch_npu", "torchvision", + "tensorboard", + "matplotlib", + "sqlalchemy", + "pymysql" ] [tool.setuptools.packages] diff --git a/debug/accuracy_tools/atat/README.md b/debug/accuracy_tools/msprobe/README.md similarity index 40% rename from debug/accuracy_tools/atat/README.md rename to debug/accuracy_tools/msprobe/README.md index e7e485e4f336fa9af98062b328f4e85f7fac77f8..1e8c1a1f08dfafceefa3ff917f4f202164a951b0 100644 --- a/debug/accuracy_tools/atat/README.md +++ b/debug/accuracy_tools/msprobe/README.md @@ -1,22 +1,37 @@ # MindStudio精度调试工具 -MindStudio精度调试工具(ascend_training_accuracy_tools),简称atat,是MindStudio Training Tools工具链下精度调试部分的工具包。主要包括精度预检和精度比对等子工具,当前适配场景包括PyTorch和MindSpore。 +MindStudio精度调试工具(MindStudio Probe),简称msprobe,是MindStudio Training Tools工具链下精度调试部分的工具包。主要包括精度预检和精度比对等子工具,当前适配场景包括PyTorch和MindSpore。 ## 工具安装 -精度工具合一软件包名称:`ascend_training_accuracy_tools-{version}-py3-none-any.whl` +精度工具合一软件包名称:`mindstudio_probe-{version}-py3-none-any.whl` -1. whl包获取。 +### pip安装 + ```shell + pip install mindstudio-probe + ``` + 说明 + 1. 使用`pip install mindstudio-probe==版本号`可安装指定版本的包 + 2. pip命令会自动安装包及其依赖 + 3. 安装成功后,日志会显示`Successfully installed mindstudio-probe-版本号` + +### 下载whl包安装 +1. 使用pip命令安装numpy、openpyxl、pandas、PyYAML、rich、torch、tqdm依赖。 + + 若环境中已安装部分依赖,不需要重复安装。 + +2. whl包获取。 请通过下表链接下载工具whl包。 - | 版本 | 发布日期 | 支持PyTorch版本 | 下载链接 | 校验码 | - | ----- | ---------- | ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | - | 0.0.3 | 2024-06-11 | 1.11.0/2.0/2.1/2.2 | [ascend_training_accuracy_tools-0.0.3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.3-py3-none-any.whl) | f46d9714704859e2d67861a65bbb3c76b0a250cf6e238b978b5b959ab1fe125a | - | 0.0.2 | 2024-05-23 | 1.11.0/2.0/2.1/2.2 | [ascend_training_accuracy_tools-0.0.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.2-py3-none-any.whl) | 2e35809bde559e9c4d2f16a02ccde779ed9e436bb65fded0b7ebaf6ac2c88d93 | - | 0.0.1 | 2024-03-15 | 1.11.0/2.0/2.1 | [ascend_training_accuracy_tools-0.0.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.1-py3-none-any.whl) | 5801510d4e827e4859bc9a5aca021e4d30c2ea42d60a4c8ad0c2baab1b7782c9 | + | 版本 | 发布日期 | 支持PyTorch版本 | 下载链接 | 校验码 | + | ----- | ---------- | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | + | 1.0 | 2024-07-09 | 2.0/2.1/2.2 | [ascend_training_accuracy_tools-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/1.0/ascend_training_accuracy_tools-1.0-py3-none-any.whl) | 5016dfe886c5d340ec6f60a959673355855f313c91f100680da814efb49f8e81 | + | 0.0.3 | 2024-06-11 | 2.0/2.1/2.2 | [ascend_training_accuracy_tools-0.0.3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.3-py3-none-any.whl) | f46d9714704859e2d67861a65bbb3c76b0a250cf6e238b978b5b959ab1fe125a | + | 0.0.2 | 2024-05-23 | 2.0/2.1/2.2 | [ascend_training_accuracy_tools-0.0.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.2-py3-none-any.whl) | 2e35809bde559e9c4d2f16a02ccde779ed9e436bb65fded0b7ebaf6ac2c88d93 | + | 0.0.1 | 2024-03-15 | 2.0/2.1 | [ascend_training_accuracy_tools-0.0.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/att/0.0/ascend_training_accuracy_tools-0.0.1-py3-none-any.whl) | 5801510d4e827e4859bc9a5aca021e4d30c2ea42d60a4c8ad0c2baab1b7782c9 | -2. whl包校验。 +3. whl包校验。 1. 根据以上下载链接下载whl包到Linux安装环境。 @@ -35,28 +50,51 @@ MindStudio精度调试工具(ascend_training_accuracy_tools),简称atat, 5801510d4e827e4859bc9a5aca021e4d30c2ea42d60a4c8ad0c2baab1b7782c9 *ascend_training_accuracy_tools-0.0.1-py3-none-any.whl ``` -3. 执行如下命令进行安装。 +4. 执行如下命令进行安装。 ```bash - pip3 install ./ascend_training_accuracy_tools-{version}-py3-none-any.whl + pip3 install ./mindstudio_probe-{version}-py3-none-any.whl ``` 若为覆盖安装,请在命令行末尾增加“--force-reinstall”参数强制安装,例如: ```bash - pip3 install ./ascend_training_accuracy_tools-{version}-py3-none-any.whl --force-reinstall + pip3 install ./mindstudio_probe-{version}-py3-none-any.whl --force-reinstall ``` 提示如下信息则表示安装成功。 ```bash - Successfully installed ascend_training_accuracy_tools-{version} + Successfully installed mindstudio_probe-{version} ``` +### 从源码安装 +1. 克隆或者下载项目源代码 + + ```shell + git clone https://gitee.com/ascend/mstt.git + cd debug/accuracy_tools + ``` + +2. 安装setuptools和wheel + + ```shell + pip install setuptools wheel + ``` + +3. 安装msprobe + + ```shell + python setup.py install + ``` + 提示出现如下信息则表示源码安装成功。 + ```shell + Finished processing dependencies for mindstudio-probe=={version} + ``` ## 工具使用 -安装atat工具后,可以按照如下思路选择合适的子工具进行精度调试: +安装msprobe工具后,可以按照如下思路选择合适的子工具进行精度调试: 1. 判断框架场景。 @@ -90,7 +128,7 @@ MindStudio精度调试工具(ascend_training_accuracy_tools),简称atat, 溢出解析是在执行精度数据dump时,配置了溢出检测dump,那么对于输入正常但输出存在溢出的API,可以判断是否为正常溢出。 - PyTorch场景:详见[PyTorch_溢出解析工具](./pytorch/doc/run_overflow_check.md)。(暂不支持) + PyTorch场景:详见[PyTorch_溢出解析工具](./pytorch/doc/run_overflow_check.md)。 MindSpore场景:暂不支持。 @@ -102,18 +140,38 @@ MindStudio精度调试工具(ascend_training_accuracy_tools),简称atat, MindSpore场景:暂不支持。 -上述流程中的工具均为atat工具的子工具,使用相同的命令行,格式如下: +上述流程中的工具均为msprobe工具的子工具,使用相同的命令行,格式如下: + +精度预检工具 + +```bash +msprobe -f run_ut [-h] +``` + +```bash +msprobe -f multi_run_ut [-h] +``` ```bash -atat [-h] -f parse run_ut multi_run_ut api_precision_compare run_overflow_check +msprobe -f api_precision_compare [-h] ``` -| 参数 | 说明 | -| ---- | ---------------------------------------- | -| -f | 框架,当前支持配置为pytorch和mindspore。 | -| -h | 帮助信息。 | +溢出解析工具 + +```bash +msprobe -f run_overflow_check [-h] +``` + +数据解析工具 + +```bash +msprobe -f parse [-h] +``` -其他参数在上述对应的工具手册中详细介绍。 +| 参数 | 说明 | +| ---- | ------------------------------------------------------ | +| -f | 框架,请按所使用框架配置,当前支持pytorch或mindspore。 | +| -h | 帮助信息。 | ## 贡献 diff --git a/debug/accuracy_tools/atat/mindspore/debugger/__init__.py b/debug/accuracy_tools/msprobe/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/mindspore/debugger/__init__.py rename to debug/accuracy_tools/msprobe/__init__.py diff --git a/debug/accuracy_tools/atat/config/README.md b/debug/accuracy_tools/msprobe/config/README.md similarity index 83% rename from debug/accuracy_tools/atat/config/README.md rename to debug/accuracy_tools/msprobe/config/README.md index 66429b54fc5e716bec8c70c932232546bae1b55e..7b91bd26f16d42cce1a848e22b8fa46b61553dbd 100644 --- a/debug/accuracy_tools/atat/config/README.md +++ b/debug/accuracy_tools/msprobe/config/README.md @@ -12,7 +12,8 @@ | dump_path | 设置dump数据目录路径,str类型。配置示例:"dump_path": "./dump_path"。MindSpore场景仅支持绝对路径。 | 是 | | rank | 指定对某张卡上的数据进行dump,list[int]类型,默认未配置(表示dump所有卡的数据),应配置为大于等于0的整数,且须配置实际可用的Rank ID。配置示例:"rank": [1]。
对于PyTorch场景,Rank ID从0开始计数,最大取值为所有节点可用卡总数-1,若所配置的值大于实际训练所运行的卡的Rank ID,则dump数据为空,比如当前环境Rank ID为0到7,实际训练运行0到3卡,此时若配置Rank ID为4或不存在的10等其他值,此时dump数据为空。
对于MindSpore场景,所有节点的Rank ID均从0开始计数,最大取值为每个节点可用卡总数-1,config.json配置一次rank参数对所有节点同时生效。 | 否 | | step | 指定dump某个step的数据,list[int]类型。默认未配置,表示dump所有step数据。dump特定step时,须指定为训练脚本中存在的step。step为list格式,可配置逐个step,例如:"step": [0,1,2]。 | 否 | -| level | dump级别,str类型,根据不同级别dump不同数据。可取值"L0"(dump module模块级精度数据,仅PyTorch场景支持,使用背景详见“**模块级精度数据dump说明**”)、"L1"(dump API级精度数据,默认值)、"L2"(dump kernel级精度数据)、"mix"(dump module模块级和API级精度数据,即"L0"+"L1",仅PyTorch场景支持)。配置示例:"level": "L1"。 | 否 | +| level | dump级别,str类型,根据不同级别dump不同数据。可取值"L0"(dump module模块级精度数据,仅PyTorch场景支持,使用背景详见“**模块级精度数据dump说明**”)、"L1"(dump API级精度数据,默认值)、"L2"(dump kernel级精度数据,须配置acl_config参数)、"mix"(dump module模块级和API级精度数据,即"L0"+"L1",仅PyTorch场景支持)。配置示例:"level": "L1"。 | 否 | +| acl_config | kernel dump的配置文件,str类型。level取"L2"时,该参数必选;level为其他值时,该参数不选。参数示例:acl_config='./acl_config.json'。acl_config.json配置文件详细介绍请参见“**acl_config.json配置文件说明**”。 | 否 | | seed | 随机种子数,int类型,默认值为:1234,仅PyTorch场景支持。通过固定随机数保证模型的输入或输出一致,可固定的随机数详见“**固定随机数范围**”。配置示例:"seed": 1234。 | 否 | | is_deterministic | 确定性计算模式,bool类型,仅PyTorch场景支持。可取值true(开启)或false(关闭),默认关闭。配置示例:"is_deterministic": true。
即使在相同的硬件和输入下,API多次执行的结果也可能不同,开启确定性计算是为了保证在相同的硬件和输入下,API多次执行的结果相同。
确定性计算会导致API执行性能降低,建议在发现模型多次执行结果不同的情况下开启。
rnn类算子、ReduceSum、ReduceMean等算子可能与确定性计算存在冲突,若开启确定性计算后多次执行的结果不相同,则考虑存在这些算子。 | 否 | | enable_dataloader | 自动控制开关,bool类型,仅PyTorch场景支持。可取值true(开启)或false(关闭),默认为false。配置为True后自动识别step参数指定的迭代,并在该迭代执行完成后退出训练,此时start、stop和step函数可不配置,开启该开关要求训练脚本是通过torch.utils.data.dataloader方式加载数据。仅支持PyTorch单卡训练使用,分布式训练场景下存在数据dump不全问题,**下个版本即将废弃该功能**。 | 否 | @@ -250,6 +251,109 @@ task配置为free_benchmark时,开启**无标杆比对**,在NPU环境下通 模块指的是继承自nn.Module类模块,通常情况下这类模块就是一个小模型,可以被视为一个整体,dump数据时以模块为粒度进行dump。 +### acl_config.json配置文件说明 + +#### [config.json](./config.json)配置示例 + +当level取"L2"时,须配置acl_config参数,并指定acl_config.json文件(用于指定L2 kernel级dump的配置),此时config.json文件配置示例如下: + +- 前向kernel dump配置示例: + + "scope"配置为前向API名称,仅支持配置一个API。 + + ```json + { + "task": "tensor", + "dump_path": "/home/data_dump", + "level": "L2", + "rank": [0], + "step": [0], + "is_deterministic": false, + "tensor": { + "scope": ["Tensor.__mul__.10.forward"], + "list":[], + "data_mode": ["all"], + "backward_input": [""], + "file_format": "npy" + }, + "acl_config": "acl_config.json" + } + ``` + +- 反向kernel dump配置示例: + + 执行反向kernel dump前需要先使用工具dump该API的反向输入,保存pt文件,在"backward_input"参数中传入该pt文件路径。 + + "scope"配置为反向API名称,仅支持配置一个API。 + + ```json + { + "task": "tensor", + "dump_path": "/home/data_dump", + "level": "L2", + "rank": [0], + "step": [0], + "is_deterministic": false, + "tensor": { + "scope": ["Tensor.__mul__.10.backward"], + "list":[], + "data_mode": ["all"], + "backward_input": ["Tensor.__mul__.10.backward.input.0.pt"], + "file_format": "npy" + }, + "acl_config": "acl_config.json" + } + ``` + +#### acl_config.json配置示例 + +acl_config.json文件须自行创建,配置示例如下: + +``` +{ + "dump": + { + "dump_list":[], + "dump_path":"./dump/output", + "dump_mode":"all", + "dump_op_switch":"on" + } +} +``` + +**acl_config.json参数说明** + +| 字段名 | 说明 | +| -------------- | ------------------------------------------------------------ | +| dump_list | 待dump数据的API模型。为空,无需配置。 | +| dump_path | dump数据文件存储到运行环境的目录,主要用于指定kernel dump数据路径。支持配置绝对路径或相对路径。dump_path须为已存在目录。 | +| dump_mode | dump数据模式,配置如下: output:dump API的输出数据。默认值。 input:dump API的输入数据。 all:dump API的输入、输出数据。 | +| dump_op_switch | 单API模型dump数据开关,配置如下:
off:关闭单API模型dump,默认值。
on:开启单API模型dump。 | + +**dump目录说明** + +配置register_hook的dump_config后,采集的dump数据会在{dump_path}/{time}/{deviceid}/{model_id}目录下生成,例如“/home/HwHiAiUser/output/20200808163566/0/0” + +``` +├── 20230131172437 +│ └── 1 +│ ├── 0 +│ │ ├── Add.Add.45.0.1675157077183551 +│ │ ├── Cast.trans_Cast_0.31.0.1675157077159449 +│ │ ├── Cast.trans_Cast_5.43.0.1675157077180129 +│ │ ├── MatMul.MatMul.39.0.1675157077172961 +│ │ ├── Mul.Mul.29.0.1675157077155731 +│ │ ├── NPUAllocFloatStatus.NPUAllocFloatStatus.24.0.1675157077145262 +│ │ ├── TransData.trans_TransData_1.33.0.1675157077162791 +│ │ └── TransData.trans_TransData_4.41.0.1675157077176648 +│ ├── 1701737061 +│ │ └── Cast.trans_Cast_2.35.0.1675157077166214 +│ ├── 25 +│ │ └── NPUClearFloatStatus.NPUClearFloatStatus.26.0.1675157077150342 +│ └── 68 +│ └── TransData.trans_TransData_3.37.0.1675157077169473 +``` + ### 固定随机数范围 仅PyTorch场景支持。 @@ -290,4 +394,4 @@ train_loader = torch.utils.data.DataLoader( 关闭dropout: -在使用from ptdbg import *后,工具会自动将torch.nn.functional.dropout、torch.nn.functional.dropout2d、torch.nn.functional.dropout3d、torch.nn.Dropout、torch.nn.Dropout2d、torch.nn.Dropout3d的接口参数p置为0。 +在使用from msprobe.pytorch import PrecisionDebugger后,工具会自动将torch.nn.functional.dropout、torch.nn.functional.dropout2d、torch.nn.functional.dropout3d、torch.nn.Dropout、torch.nn.Dropout2d、torch.nn.Dropout3d的接口参数p置为0。 diff --git a/debug/accuracy_tools/atat/config/config.json b/debug/accuracy_tools/msprobe/config/config.json similarity index 80% rename from debug/accuracy_tools/atat/config/config.json rename to debug/accuracy_tools/msprobe/config/config.json index ba13898090c802ed814694da70e5c415222f6c35..c6077b75aef9c0087ed71161f1f637a41ab8441a 100644 --- a/debug/accuracy_tools/atat/config/config.json +++ b/debug/accuracy_tools/msprobe/config/config.json @@ -1,6 +1,6 @@ { "task": "statistics", - "dump_path": "", + "dump_path": "./dump_path", "rank": [], "step": [], "level": "L1", @@ -24,5 +24,10 @@ "overflow_check": { "overflow_nums": 1, "check_mode":"all" + }, + "run_ut": { + "white_list": [], + "black_list": [], + "error_data_path": "./" } } \ No newline at end of file diff --git a/debug/accuracy_tools/atat/config/img/free_benchmark.png b/debug/accuracy_tools/msprobe/config/img/free_benchmark.png similarity index 100% rename from debug/accuracy_tools/atat/config/img/free_benchmark.png rename to debug/accuracy_tools/msprobe/config/img/free_benchmark.png diff --git a/debug/accuracy_tools/msprobe/core/common/const.py b/debug/accuracy_tools/msprobe/core/common/const.py new file mode 100644 index 0000000000000000000000000000000000000000..5d9cdcdb47c9e2a0a076cd1948a6ecdaded830ce --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/common/const.py @@ -0,0 +1,248 @@ +import os +import stat +import numpy as np + +class Const: + """ + Class for const + """ + SEP = "." + REGEX_PREFIX_MAX_LENGTH = 20 + REGEX_PREFIX_PATTERN = r"^[a-zA-Z0-9_-]+$" + FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' + COMMA = "," + FLOAT_EPSILON = np.finfo(float).eps + OFF = 'OFF' + BACKWARD = 'backward' + FORWARD = 'forward' + DEFAULT_LIST = [] + DEFAULT_PATH = './' + WHITE_LIST = 'white_list' + BLACK_LIST = 'black_list' + + # dump mode + ALL = "all" + LIST = "list" + RANGE = "range" + STACK = "stack" + ACL = "acl" + API_LIST = "api_list" + API_STACK = "api_stack" + DUMP_MODE = [ALL, LIST, RANGE, STACK, ACL, API_LIST, API_STACK] + AUTO = "auto" + ONLINE_DUMP_MODE = [ALL, LIST, AUTO, OFF] + SUMMARY = "summary" + MD5 = "md5" + SUMMARY_MODE = [ALL, SUMMARY, MD5] + + WRITE_FLAGS = os.O_WRONLY | os.O_CREAT + WRITE_MODES = stat.S_IWUSR | stat.S_IRUSR + OVERWRITE_FLAGS = os.O_WRONLY | os.O_CREAT | os.O_TRUNC + + PKL_SUFFIX = ".pkl" + NUMPY_SUFFIX = ".npy" + ONE_GB = 1073741824 # 1 * 1024 * 1024 * 1024 + TEN_GB = 10737418240 # 10 * 1024 * 1024 * 1024 + FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' + DISTRIBUTED_PREFIX_LENGTH = 60 + # env dump path + KWARGS = 'kwargs' + INPUT = 'input' + OUTPUT = 'output' + INPUT_ARGS = 'input_args' + INPUT_KWARGS = 'input_kwargs' + GRAD_INPUT = 'grad_input' + GRAD_OUTPUT = 'grad_output' + START = "start" + STOP = "stop" + ENV_ENABLE = "1" + ENV_DISABLE = "0" + MAX_SEED_VALUE = 4294967295 # 2**32 - 1 + TASK_LIST = ["tensor", "statistics", "overflow_check", "free_benchmark", "run_ut"] + LEVEL_LIST = ["L0", "L1", "L2", "mix"] + STATISTICS = "statistics" + TENSOR = "tensor" + OVERFLOW_CHECK = "overflow_check" + FREE_BENCHMARK = "free_benchmark" + RUN_UT = "run_ut" + ATTR_NAME_PREFIX = "wrap_" + KERNEL_DUMP = "kernel_dump" + DATA = "data" + PT_FRAMEWORK = "pytorch" + MS_FRAMEWORK = "mindspore" + DIRECTORY_LENGTH = 4096 + FILE_NAME_LENGTH = 255 + FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble, np.float32, np.float16] + BOOL_TYPE = [bool, np.uint8] + INT_TYPE = [np.int32, np.int64] + NPU = 'NPU' + DISTRIBUTED = 'Distributed' + + INPLACE_LIST = [ + "broadcast", "all_reduce", "reduce", "all_gather", "gather", "scatter", "reduce_scatter", + "_reduce_scatter_base", "_all_gather_base", "send", "recv", "irecv", "isend", "all_to_all_single" + ] + + CONVERT = { + "int32_to_int64": ["torch.int32", "torch.int64"], + } + + CONVERT_API = { + "int32_to_int64": ["cross_entropy"] + } + +class CompareConst: + """ + Class for compare module const + """ + SPACE = " " + # compare result column name + NPU_NAME = "NPU Name" + BENCH_NAME = "Bench Name" + NPU_DTYPE = "NPU Dtype" + BENCH_DTYPE = "Bench Dtype" + NPU_SHAPE = "NPU Tensor Shape" + BENCH_SHAPE = "Bench Tensor Shape" + NPU_MAX = "NPU max" + NPU_MIN = "NPU min" + NPU_MEAN = "NPU mean" + NPU_NORM = "NPU l2norm" + BENCH_MAX = "Bench max" + BENCH_MIN = "Bench min" + BENCH_MEAN = "Bench mean" + BENCH_NORM = "Bench l2norm" + MAX_DIFF = "Max diff" + MIN_DIFF = "Min diff" + MEAN_DIFF = "Mean diff" + NORM_DIFF = "L2norm diff" + COSINE = "Cosine" + MAX_ABS_ERR = "MaxAbsErr" + MAX_RELATIVE_ERR = "MaxRelativeErr" + MIN_RELATIVE_ERR = "MinRelativeErr" + MEAN_RELATIVE_ERR = "MeanRelativeErr" + NORM_RELATIVE_ERR = "NormRelativeErr" + ACCURACY = "Accuracy Reached or Not" + STACK = "NPU_Stack_Info" + DATA_NAME = "Data_name" + ERROR_MESSAGE = "Err_message" + ONE_THOUSANDTH_ERR_RATIO = "One Thousandth Err Ratio" + FIVE_THOUSANDTHS_ERR_RATIO = "Five Thousandths Err Ratio" + NPU_MD5 = "NPU MD5" + BENCH_MD5 = "BENCH MD5" + RESULT = "Result" + + COMPARE_RESULT_HEADER = [ + NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, COSINE, MAX_ABS_ERR, MAX_RELATIVE_ERR, + ONE_THOUSANDTH_ERR_RATIO, FIVE_THOUSANDTHS_ERR_RATIO, + NPU_MAX, NPU_MIN, NPU_MEAN, NPU_NORM, BENCH_MAX, BENCH_MIN, BENCH_MEAN, BENCH_NORM, ACCURACY, ERROR_MESSAGE + ] + + SUMMARY_COMPARE_RESULT_HEADER = [ + NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, MAX_DIFF, MIN_DIFF, MEAN_DIFF, NORM_DIFF, + MAX_RELATIVE_ERR, MIN_RELATIVE_ERR, MEAN_RELATIVE_ERR, NORM_RELATIVE_ERR, + NPU_MAX, NPU_MIN, NPU_MEAN, NPU_NORM, BENCH_MAX, BENCH_MIN, BENCH_MEAN, BENCH_NORM, RESULT, ERROR_MESSAGE + ] + + MD5_COMPARE_RESULT_HEADER = [ + NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, NPU_MD5, BENCH_MD5, RESULT + ] + + # compare standard + HUNDRED_RATIO_THRESHOLD = 0.01 + THOUSAND_RATIO_THRESHOLD = 0.001 + FIVE_THOUSAND_RATIO_THRESHOLD = 0.005 + TEN_THOUSAND_RATIO_THRESHOLD = 0.0001 + COSINE_THRESHOLD = 0.9999 + ULP_FLOAT32_THRESHOLD = 32 + ULP_FLOAT16_THRESHOLD = 1 + + # compare result data + READ_NONE = 'No data' + NONE = 'None' + SHAPE_UNMATCH = 'shape unmatched' + DIFF = 'Different' + UNSUPPORTED = 'unsupported' + NAN = 'Nan' + PASS = 'pass' + WARNING = 'Warning' + ERROR = 'error' + SKIP = 'SKIP' + BFLOAT16_MIN = -3.3895313892515355e+38 + BFLOAT16_MAX = 3.3895313892515355e+38 + BFLOAT16_EPS = 3.90625e-3 # 2 ** -8 + + # accuracy standards + COS_THRESHOLD = 0.99 + MAX_ABS_ERR_THRESHOLD = 0.001 + COS_MAX_THRESHOLD = 0.9 + MAX_ABS_ERR_MAX_THRESHOLD = 1 + ACCURACY_CHECK_YES = "Yes" + ACCURACY_CHECK_NO = "No" + ACCURACY_CHECK_UNMATCH = "Unmatched" + + # error message + NO_BENCH = "No bench data matched." + + # compare const + FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble] + + # highlight xlsx color const + RED = "FFFF0000" + YELLOW = "FFFF00" + BLUE = "0000FF" + + # highlight rules const + OVERFLOW_LIST = ['nan\t', 'inf\t', '-inf\t', 'nan', 'inf', '-inf'] + MAX_DIFF_RED = 1e+10 + ORDER_MAGNITUDE_DIFF_YELLOW = 1 + ONE_THOUSAND_ERROR_IN_RED = 0.9 + ONE_THOUSAND_ERROR_OUT_RED = 0.6 + ONE_THOUSAND_ERROR_DIFF_YELLOW = 0.1 + COSINE_DIFF_YELLOW = 0.1 + MAX_RELATIVE_OUT_RED = 0.5 + MAX_RELATIVE_OUT_YELLOW = 0.1 + MAX_RELATIVE_IN_YELLOW = 0.01 + +class FileCheckConst: + """ + Class for file check const + """ + READ_ABLE = "read" + WRITE_ABLE = "write" + READ_WRITE_ABLE = "read and write" + DIRECTORY_LENGTH = 4096 + FILE_NAME_LENGTH = 255 + FILE_VALID_PATTERN = r"^[a-zA-Z0-9_.:/-]+$" + FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' + PKL_SUFFIX = ".pkl" + NUMPY_SUFFIX = ".npy" + JSON_SUFFIX = ".json" + PT_SUFFIX = ".pt" + CSV_SUFFIX = ".csv" + YAML_SUFFIX = ".yaml" + MAX_PKL_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 + MAX_NUMPY_SIZE = 10737418240 # 10 * 1024 * 1024 * 1024 + MAX_JSON_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 + MAX_PT_SIZE = 10737418240 # 10 * 1024 * 1024 * 1024 + MAX_CSV_SIZE = 1073741824 # 1 * 1024 * 1024 * 1024 + MAX_YAML_SIZE = 1048576 # 10 * 1024 * 1024 + DIR = "dir" + FILE = "file" + DATA_DIR_AUTHORITY = 0o750 + DATA_FILE_AUTHORITY = 0o640 + FILE_SIZE_DICT = { + PKL_SUFFIX: MAX_PKL_SIZE, + NUMPY_SUFFIX: MAX_NUMPY_SIZE, + JSON_SUFFIX: MAX_JSON_SIZE, + PT_SUFFIX: MAX_PT_SIZE, + CSV_SUFFIX: MAX_CSV_SIZE, + YAML_SUFFIX: MAX_YAML_SIZE + } + +class OverflowConst: + """ + Class for Overflow + """ + OVERFLOW_DEBUG_MODE_ENABLE = "OVERFLOW_DEBUG_MODE_ENABLE" + OVERFLOW_ORIGINAL_MODE = 0 + OVERFLOW_DEBUG_MODE = 1 diff --git a/debug/accuracy_tools/atat/pytorch/common/exceptions.py b/debug/accuracy_tools/msprobe/core/common/exceptions.py similarity index 42% rename from debug/accuracy_tools/atat/pytorch/common/exceptions.py rename to debug/accuracy_tools/msprobe/core/common/exceptions.py index 17733b5bfd5f4b8ffcb3cb3602e3f5f54fdef97d..ea61f8cd58fe057ba6836dd1ed368d52adedeb18 100644 --- a/debug/accuracy_tools/atat/pytorch/common/exceptions.py +++ b/debug/accuracy_tools/msprobe/core/common/exceptions.py @@ -1,19 +1,20 @@ - class CodedException(Exception): def __init__(self, code, error_info=''): + super().__init__() + self.code = code self.error_info = self.err_strs.get(code) + error_info def __str__(self): return self.error_info -class MsaccException(CodedException): +class MsprobeException(CodedException): INVALID_PARAM_ERROR = 0 OVERFLOW_NUMS_ERROR = 1 - + err_strs = { - INVALID_PARAM_ERROR: "[msacc] 无效参数: ", - OVERFLOW_NUMS_ERROR: "[msacc] 超过预设溢出次数 当前溢出次数:" + INVALID_PARAM_ERROR: "[msprobe] 无效参数: ", + OVERFLOW_NUMS_ERROR: "[msprobe] 超过预设溢出次数 当前溢出次数:" } @@ -26,12 +27,12 @@ class FileCheckException(CodedException): FILE_TOO_LARGE_ERROR = 5 err_strs = { - SOFT_LINK_ERROR: "[msacc] 检测到软链接: ", - FILE_PERMISSION_ERROR: "[msacc] 文件权限错误: ", - INVALID_FILE_ERROR: "[msacc] 无效文件: ", - ILLEGAL_PATH_ERROR: "[msacc] 非法文件路径: ", - ILLEGAL_PARAM_ERROR: "[msacc] 非法打开方式: ", - FILE_TOO_LARGE_ERROR: "[msacc] 文件过大: " + SOFT_LINK_ERROR: "[msprobe] 检测到软链接: ", + FILE_PERMISSION_ERROR: "[msprobe] 文件权限错误: ", + INVALID_FILE_ERROR: "[msprobe] 无效文件: ", + ILLEGAL_PATH_ERROR: "[msprobe] 非法文件路径: ", + ILLEGAL_PARAM_ERROR: "[msprobe] 非法打开方式: ", + FILE_TOO_LARGE_ERROR: "[msprobe] 文件过大: " } @@ -39,8 +40,8 @@ class ParseJsonException(CodedException): UnexpectedNameStruct = 0 InvalidDumpJson = 1 err_strs = { - UnexpectedNameStruct: "[msacc] Unexpected name in json: ", - InvalidDumpJson: "[msacc] json格式不正确: ", + UnexpectedNameStruct: "[msprobe] Unexpected name in json: ", + InvalidDumpJson: "[msprobe] json格式不正确: ", } @@ -49,27 +50,39 @@ class ScopeException(CodedException): InvalidScope = 1 ArgConflict = 2 err_strs = { - InvalidApiStr: "[msacc] Invalid api_list: ", - InvalidScope: "[msacc] Invalid scope: ", - ArgConflict: "[msacc] Scope and api_list conflict: ", + InvalidApiStr: "[msprobe] Invalid api_list: ", + InvalidScope: "[msprobe] Invalid scope: ", + ArgConflict: "[msprobe] Scope and api_list conflict: ", } class RepairException(CodedException): InvalidRepairType = 0 err_strs = { - InvalidRepairType: "[msacc] Invalid repair_type: " + InvalidRepairType: "[msprobe] Invalid repair_type: " } class StepException(CodedException): InvalidPostProcess = 0 err_strs = { - InvalidPostProcess: "[msacc] 错误的step后处理配置: ", + InvalidPostProcess: "[msprobe] 错误的step后处理配置: ", } + class FreeBenchmarkException(CodedException): UnsupportedType = 0 + InvalidGrad = 1 err_strs = { - UnsupportedType: "[msacc] Free benchmark get unsupported type: " - } \ No newline at end of file + UnsupportedType: "[msprobe] Free benchmark get unsupported type: ", + InvalidGrad: "[msprobe] Free benchmark gradient invalid: ", + } + + +class DistributedNotInitializedError(Exception): + def __init__(self, msg): + super().__init__() + self.msg = msg + + def __str__(self): + return self.msg diff --git a/debug/accuracy_tools/atat/pytorch/common/file_check.py b/debug/accuracy_tools/msprobe/core/common/file_check.py similarity index 79% rename from debug/accuracy_tools/atat/pytorch/common/file_check.py rename to debug/accuracy_tools/msprobe/core/common/file_check.py index 3204652583b9bce5ac874b5a178fb83926856660..36896cfbc19b29f1fcaef04228aac37dc29c8416 100644 --- a/debug/accuracy_tools/atat/pytorch/common/file_check.py +++ b/debug/accuracy_tools/msprobe/core/common/file_check.py @@ -17,45 +17,9 @@ import os import re -from .log import print_error_log, print_warn_log -from .exceptions import FileCheckException -from .utils import Const - - -class FileCheckConst: - """ - Class for file check const - """ - READ_ABLE = "read" - WRITE_ABLE = "write" - READ_WRITE_ABLE = "read and write" - DIRECTORY_LENGTH = 4096 - FILE_NAME_LENGTH = 255 - FILE_VALID_PATTERN = r"^[a-zA-Z0-9_.:/-]+$" - PKL_SUFFIX = ".pkl" - NUMPY_SUFFIX = ".npy" - JSON_SUFFIX = ".json" - PT_SUFFIX = ".pt" - CSV_SUFFIX = ".csv" - YAML_SUFFIX = ".yaml" - MAX_PKL_SIZE = 1 * 1024 * 1024 * 1024 - MAX_NUMPY_SIZE = 10 * 1024 * 1024 * 1024 - MAX_JSON_SIZE = 1 * 1024 * 1024 * 1024 - MAX_PT_SIZE = 10 * 1024 * 1024 * 1024 - MAX_CSV_SIZE = 1 * 1024 * 1024 * 1024 - MAX_YAML_SIZE = 10 * 1024 * 1024 - DIR = "dir" - FILE = "file" - DATA_DIR_AUTHORITY = 0o750 - DATA_FILE_AUTHORITY = 0o640 - FILE_SIZE_DICT = { - PKL_SUFFIX: MAX_PKL_SIZE, - NUMPY_SUFFIX: MAX_NUMPY_SIZE, - JSON_SUFFIX: MAX_JSON_SIZE, - PT_SUFFIX: MAX_PT_SIZE, - CSV_SUFFIX: MAX_CSV_SIZE, - YAML_SUFFIX: MAX_YAML_SIZE - } +from msprobe.core.common.log import logger +from msprobe.core.common.exceptions import FileCheckException +from msprobe.core.common.const import FileCheckConst class FileChecker: @@ -78,7 +42,7 @@ class FileChecker: @staticmethod def _check_path_type(path_type): if path_type not in [FileCheckConst.DIR, FileCheckConst.FILE]: - print_error_log(f'The path_type must be {FileCheckConst.DIR} or {FileCheckConst.FILE}.') + logger.error(f'The path_type must be {FileCheckConst.DIR} or {FileCheckConst.FILE}.') raise FileCheckException(FileCheckException.ILLEGAL_PARAM_ERROR) return path_type @@ -144,7 +108,7 @@ class FileOpen: def check_file_path(self): support_mode = self.SUPPORT_READ_MODE + self.SUPPORT_WRITE_MODE + self.SUPPORT_READ_WRITE_MODE if self.mode not in support_mode: - print_error_log("File open not support %s mode" % self.mode) + logger.error("File open not support %s mode" % self.mode) raise FileCheckException(FileCheckException.ILLEGAL_PARAM_ERROR) check_link(self.file_path) self.file_path = os.path.realpath(self.file_path) @@ -171,7 +135,7 @@ class FileOpen: def check_link(path): abs_path = os.path.abspath(path) if os.path.islink(abs_path): - print_error_log('The file path {} is a soft link.'.format(path)) + logger.error('The file path {} is a soft link.'.format(path)) raise FileCheckException(FileCheckException.SOFT_LINK_ERROR) @@ -179,58 +143,58 @@ def check_path_length(path, name_length=None): file_max_name_length = name_length if name_length else FileCheckConst.FILE_NAME_LENGTH if len(path) > FileCheckConst.DIRECTORY_LENGTH or \ len(os.path.basename(path)) > file_max_name_length: - print_error_log('The file path length exceeds limit.') + logger.error('The file path length exceeds limit.') raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) def check_path_exists(path): if not os.path.exists(path): - print_error_log('The file path %s does not exist.' % path) + logger.error('The file path %s does not exist.' % path) raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) def check_path_readability(path): if not os.access(path, os.R_OK): - print_error_log('The file path %s is not readable.' % path) + logger.error('The file path %s is not readable.' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_path_writability(path): if not os.access(path, os.W_OK): - print_error_log('The file path %s is not writable.' % path) + logger.error('The file path %s is not writable.' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_path_executable(path): if not os.access(path, os.X_OK): - print_error_log('The file path %s is not executable.' % path) + logger.error('The file path %s is not executable.' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_other_user_writable(path): st = os.stat(path) if st.st_mode & 0o002: - print_error_log('The file path %s may be insecure because other users have write permissions. ' % path) + logger.error('The file path %s may be insecure because other users have write permissions. ' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_path_owner_consistent(path): file_owner = os.stat(path).st_uid if file_owner != os.getuid(): - print_error_log('The file path %s may be insecure because is does not belong to you.' % path) + logger.error('The file path %s may be insecure because is does not belong to you.' % path) raise FileCheckException(FileCheckException.FILE_PERMISSION_ERROR) def check_path_pattern_vaild(path): if not re.match(FileCheckConst.FILE_VALID_PATTERN, path): - print_error_log('The file path {} contains special characters.'.format(path)) + logger.error('The file path %s contains special characters.' %(path)) raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR) def check_file_size(file_path, max_size): file_size = os.path.getsize(file_path) if file_size >= max_size: - print_error_log(f'The size of file path {file_path} exceeds {max_size} bytes.') + logger.error(f'The size of file path {file_path} exceeds {max_size} bytes.') raise FileCheckException(FileCheckException.FILE_TOO_LARGE_ERROR) @@ -245,18 +209,18 @@ def check_common_file_size(file_path): def check_file_suffix(file_path, file_suffix): if file_suffix: if not file_path.endswith(file_suffix): - print_error_log(f"The {file_path} should be a {file_suffix} file!") + logger.error(f"The {file_path} should be a {file_suffix} file!") raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) def check_path_type(file_path, file_type): if file_type == FileCheckConst.FILE: if not os.path.isfile(file_path): - print_error_log(f"The {file_path} should be a file!") + logger.error(f"The {file_path} should be a file!") raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) if file_type == FileCheckConst.DIR: if not os.path.isdir(file_path): - print_error_log(f"The {file_path} should be a dictionary!") + logger.error(f"The {file_path} should be a dictionary!") raise FileCheckException(FileCheckException.INVALID_FILE_ERROR) @@ -281,7 +245,7 @@ def check_path_before_create(path): if path_len_exceeds_limit(path): raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR, 'The file path length exceeds limit.') - if not re.match(Const.FILE_PATTERN, os.path.realpath(path)): + if not re.match(FileCheckConst.FILE_PATTERN, os.path.realpath(path)): raise FileCheckException(FileCheckException.ILLEGAL_PATH_ERROR, 'The file path {} contains special characters.'.format(path)) @@ -298,4 +262,4 @@ def change_mode(path, mode): def path_len_exceeds_limit(file_path): return len(os.path.realpath(file_path)) > FileCheckConst.DIRECTORY_LENGTH or \ - len(os.path.basename(file_path)) > FileCheckConst.FILE_NAME_LENGTH + len(os.path.basename(file_path)) > FileCheckConst.FILE_NAME_LENGTH \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/core/common/log.py b/debug/accuracy_tools/msprobe/core/common/log.py new file mode 100644 index 0000000000000000000000000000000000000000..f31dad64d1ed5f16ef5946f668a00d6b02a04a5a --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/common/log.py @@ -0,0 +1,55 @@ +import os +import time +import sys + +class BaseLogger: + def __init__(self): + self.warning_level = "WARNING" + self.error_level = "ERROR" + self.info_level = "INFO" + self.rank = None + + @staticmethod + def _print_log(level, msg, end='\n'): + current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) + pid = os.getpid() + full_msg = f"{current_time} ({pid}) [{level}] {msg}" + print(full_msg, end=end) + sys.stdout.flush() + + def get_rank(self): + return self.rank + + def info(self, msg): + self._print_log(self.info_level, msg) + + def error(self, msg): + self._print_log(self.error_level, msg) + + def warning(self, msg): + self._print_log(self.warning_level, msg) + + def on_rank_0(self, func): + def func_rank_0(*args, **kwargs): + current_rank = self.get_rank() + if current_rank is None or current_rank == 0: + return func(*args, **kwargs) + else: + return None + return func_rank_0 + + def info_on_rank_0(self, msg): + return self.on_rank_0(self.info)(msg) + + def error_on_rank_0(self, msg): + return self.on_rank_0(self.error)(msg) + + def warning_on_rank_0(self, msg): + return self.on_rank_0(self.warning)(msg) + + def error_log_with_exp(self, msg, exception): + self.error(msg) + raise exception + + +logger = BaseLogger() \ No newline at end of file diff --git a/debug/accuracy_tools/atat/core/utils.py b/debug/accuracy_tools/msprobe/core/common/utils.py similarity index 65% rename from debug/accuracy_tools/atat/core/utils.py rename to debug/accuracy_tools/msprobe/core/common/utils.py index e3a30579a80bd8ce1f2da9ef7b9c28d796a6acd9..32aba8d8af48cb23b75af293b3ea6367841bc255 100644 --- a/debug/accuracy_tools/atat/core/utils.py +++ b/debug/accuracy_tools/msprobe/core/common/utils.py @@ -20,195 +20,21 @@ import re import shutil import stat import subprocess -import sys import time import json -from json.decoder import JSONDecodeError from datetime import datetime, timezone from pathlib import Path import numpy as np -from .file_check_util import FileOpen, FileChecker, FileCheckConst +from msprobe.core.common.file_check import FileOpen, FileChecker +from msprobe.core.common.const import Const, FileCheckConst, CompareConst, OverflowConst +from msprobe.core.common.log import logger device = collections.namedtuple('device', ['type', 'index']) prefixes = ['api_stack', 'list', 'range', 'acl'] -class Const: - """ - Class for const - """ - MODEL_TYPE = ['.onnx', '.pb', '.om'] - DIM_PATTERN = r"^(-?[0-9]+)(,-?[0-9]+)*" - REGEX_PREFIX_MAX_LENGTH = 20 - REGEX_PREFIX_PATTERN = r"^[a-zA-Z0-9_-]+$" - SEMICOLON = ";" - COLON = ":" - EQUAL = "=" - COMMA = "," - DOT = "." - DUMP_RATIO_MAX = 100 - SUMMERY_DATA_NUMS = 256 - FLOAT_EPSILON = np.finfo(float).eps - SUPPORT_DUMP_MODE = ['api', 'acl'] - ON = 'ON' - OFF = 'OFF' - BACKWARD = 'backward' - FORWARD = 'forward' - PRE_FORWARD = "pre_forward" - - # dump mode - ALL = "all" - LIST = "list" - RANGE = "range" - STACK = "stack" - ACL = "acl" - API_LIST = "api_list" - API_STACK = "api_stack" - DUMP_MODE = [ALL, LIST, RANGE, STACK, ACL, API_LIST, API_STACK] - AUTO = "auto" - ONLINE_DUMP_MODE = [ALL, LIST, AUTO, OFF] - SUMMARY = "summary" - MD5 = "md5" - SUMMARY_MODE = [ALL, SUMMARY, MD5] - - WRITE_FLAGS = os.O_WRONLY | os.O_CREAT - WRITE_MODES = stat.S_IWUSR | stat.S_IRUSR - - PKL_SUFFIX = ".pkl" - NUMPY_SUFFIX = ".npy" - ONE_GB = 1 * 1024 * 1024 * 1024 - TEN_GB = 10 * 1024 * 1024 * 1024 - FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' - FILE_NAME_LENGTH = 255 - DIRECTORY_LENGTH = 4096 - DISTRIBUTED_PREFIX_LENGTH = 60 - SUMMARY_COLUMN_NUM = 6 - STACK_COLUMN_NUM = 2 - # env dump path - ASCEND_WORK_PATH = "ASCEND_WORK_PATH" - DUMP_DIR = "dump_data" - - ENV_ENABLE = "1" - ENV_DISABLE = "0" - - MAX_SEED_VALUE = 2**32 - 1 - - INPLACE_LIST = ["broadcast", "all_reduce", "reduce", "all_gather", "gather", "scatter", "reduce_scatter", - "_reduce_scatter_base", "_all_gather_base"] - - TASK_LIST = ["tensor", "statistics", "overflow_check", "free_benchmark"] - LEVEL_LIST = ["L0", "L1", "L2", "mix"] - STATISTICS = "statistics" - TENSOR = "tensor" - OVERFLOW_CHECK = "overflow_check" - FREE_BENCHMARK = "free_benchmark" - -class CompareConst: - """ - Class for compare module const - """ - # compare result column name - NPU_NAME = "NPU Name" - BENCH_NAME = "Bench Name" - NPU_DTYPE = "NPU Dtype" - BENCH_DTYPE = "Bench Dtype" - NPU_SHAPE = "NPU Tensor Shape" - BENCH_SHAPE = "Bench Tensor Shape" - NPU_MAX = "NPU max" - NPU_MIN = "NPU min" - NPU_MEAN = "NPU mean" - NPU_NORM = "NPU l2norm" - BENCH_MAX = "Bench max" - BENCH_MIN = "Bench min" - BENCH_MEAN = "Bench mean" - BENCH_NORM = "Bench l2norm" - MAX_DIFF = "Max diff" - MIN_DIFF = "Min diff" - MEAN_DIFF = "Mean diff" - NORM_DIFF = "L2norm diff" - COSINE = "Cosine" - MAX_ABS_ERR = "MaxAbsErr" - MAX_RELATIVE_ERR = "MaxRelativeErr" - MIN_RELATIVE_ERR = "MinRelativeErr" - MEAN_RELATIVE_ERR = "MeanRelativeErr" - NORM_RELATIVE_ERR = "NormRelativeErr" - ACCURACY = "Accuracy Reached or Not" - STACK = "NPU_Stack_Info" - DATA_NAME = "Data_name" - ERROR_MESSAGE = "Err_message" - ONE_THOUSANDTH_ERR_RATIO = "One Thousandth Err Ratio" - FIVE_THOUSANDTHS_ERR_RATIO = "Five Thousandths Err Ratio" - NPU_MD5 = "NPU MD5" - BENCH_MD5 = "BENCH MD5" - RESULT = "Result" - - COMPARE_RESULT_HEADER = [ - NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, COSINE, MAX_ABS_ERR, MAX_RELATIVE_ERR, - ONE_THOUSANDTH_ERR_RATIO, FIVE_THOUSANDTHS_ERR_RATIO, - NPU_MAX, NPU_MIN, NPU_MEAN, NPU_NORM, BENCH_MAX, BENCH_MIN, BENCH_MEAN, BENCH_NORM, ACCURACY, ERROR_MESSAGE - ] - - SUMMARY_COMPARE_RESULT_HEADER = [ - NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, MAX_DIFF, MIN_DIFF, MEAN_DIFF, NORM_DIFF, - MAX_RELATIVE_ERR, MIN_RELATIVE_ERR, MEAN_RELATIVE_ERR, NORM_RELATIVE_ERR, - NPU_MAX, NPU_MIN, NPU_MEAN, NPU_NORM, BENCH_MAX, BENCH_MIN, BENCH_MEAN, BENCH_NORM, RESULT, ERROR_MESSAGE - ] - - MD5_COMPARE_RESULT_HEADER = [ - NPU_NAME, BENCH_NAME, NPU_DTYPE, BENCH_DTYPE, NPU_SHAPE, BENCH_SHAPE, NPU_MD5, BENCH_MD5, RESULT - ] - - # compare standard - THOUSAND_RATIO_THRESHOLD = 0.001 - FIVE_THOUSAND_RATIO_THRESHOLD = 0.005 - COSINE_THRESHOLD = 0.9999 - - # compare result data - READ_NONE = 'No data' - NAN = 'Nan' - NONE = 'None' - SHAPE_UNMATCH = 'shape unmatched' - DTYPE_UNMATCH = 'dtype unmatched' - PASS = 'Pass' - WARNING = 'Warning' - DIFF = 'Different' - UNSUPPORTED = 'unsupported' - - # accuracy standards - COS_THRESHOLD = 0.99 - MAX_ABS_ERR_THRESHOLD = 0.001 - COS_MAX_THRESHOLD = 0.9 - MAX_ABS_ERR_MAX_THRESHOLD = 1 - ACCURACY_CHECK_YES = "Yes" - ACCURACY_CHECK_NO = "No" - ACCURACY_CHECK_UNMATCH = "Unmatched" - - # error message - NO_BENCH = "No bench data matched." - - # compare const - FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble] - - # highlight xlsx color const - RED = "FFFF0000" - YELLOW = "FFFF00" - BLUE = "0000FF" - - # highlight rules const - OVERFLOW_LIST = ['nan\t', 'inf\t', '-inf\t', 'nan', 'inf', '-inf'] - MAX_DIFF_RED = 1e+10 - ORDER_MAGNITUDE_DIFF_YELLOW = 1 - ONE_THOUSAND_ERROR_IN_RED = 0.9 - ONE_THOUSAND_ERROR_OUT_RED = 0.6 - ONE_THOUSAND_ERROR_DIFF_YELLOW = 0.1 - COSINE_DIFF_YELLOW = 0.1 - MAX_RELATIVE_OUT_RED = 0.5 - MAX_RELATIVE_OUT_YELLOW = 0.1 - MAX_RELATIVE_IN_YELLOW = 0.01 - - class CompareException(Exception): """ Class for Accuracy Compare Exception @@ -235,7 +61,6 @@ class CompareException(Exception): INVALID_SUMMARY_MODE = 19 INVALID_TASK_ERROR = 20 - def __init__(self, code, error_info: str = ""): super(CompareException, self).__init__() self.code = code @@ -249,63 +74,17 @@ class DumpException(CompareException): pass -class OverflowConst: - """ - Class for Overflow - """ - OVERFLOW_DEBUG_MODE_ENABLE = "OVERFLOW_DEBUG_MODE_ENABLE" - OVERFLOW_ORIGINAL_MODE = 0 - OVERFLOW_DEBUG_MODE = 1 - - def make_dump_path_if_not_exists(dump_path): if not os.path.exists(dump_path): try: Path(dump_path).mkdir(mode=0o750, exist_ok=True, parents=True) except OSError as ex: - print_error_log( + logger.error( 'Failed to create {}.Please check the path permission or disk space .{}'.format(dump_path, str(ex))) raise CompareException(CompareException.INVALID_PATH_ERROR) from ex else: if not os.path.isdir(dump_path): - print_error_log('{} already exists and is not a directory.'.format(dump_path)) - - -def _print_log(level, msg, end='\n'): - current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(time.time()))) - pid = os.getgid() - print(current_time + "(" + str(pid) + ")-[" + level + "]" + msg, end=end) - sys.stdout.flush() - - -def print_info_log(info_msg, end='\n'): - """ - Function Description: - print info log. - Parameter: - info_msg: the info message. - """ - _print_log("INFO", info_msg, end=end) - - -def print_error_log(error_msg): - """ - Function Description: - print error log. - Parameter: - error_msg: the error message. - """ - _print_log("ERROR", error_msg) - - -def print_warn_log(warn_msg): - """ - Function Description: - print warn log. - Parameter: - warn_msg: the warning message. - """ - _print_log("WARNING", warn_msg) + logger.error('{} already exists and is not a directory.'.format(dump_path)) def check_mode_valid(mode, scope=None, api_list=None): @@ -337,13 +116,13 @@ def check_mode_valid(mode, scope=None, api_list=None): def check_switch_valid(switch): if switch not in ["ON", "OFF"]: - print_error_log("Please set switch with 'ON' or 'OFF'.") + logger.error("Please set switch with 'ON' or 'OFF'.") raise CompareException(CompareException.INVALID_PARAM_ERROR) def check_dump_mode_valid(dump_mode): if not isinstance(dump_mode, list): - print_warn_log("Please set dump_mode as a list.") + logger.warning("Please set dump_mode as a list.") dump_mode = [dump_mode] if not all(mode in ["all", "forward", "backward", "input", "output"] for mode in dump_mode): raise ValueError("Please set dump_mode as a list containing one or more of the following: 'all', 'forward', 'backward', 'input', 'output'.") @@ -364,14 +143,14 @@ def check_summary_mode_valid(summary_mode): def check_summary_only_valid(summary_only): if not isinstance(summary_only, bool): - print_error_log("Params summary_only only support True or False.") + logger.error("Params summary_only only support True or False.") raise CompareException(CompareException.INVALID_PARAM_ERROR) return summary_only def check_compare_param(input_parma, output_path, stack_mode=False, summary_compare=False, md5_compare=False): if not (isinstance(input_parma, dict) and isinstance(output_path, str)): - print_error_log("Invalid input parameters") + logger.error("Invalid input parameters") raise CompareException(CompareException.INVALID_PARAM_ERROR) check_file_or_directory_path(input_parma.get("npu_json_path"), False) check_file_or_directory_path(input_parma.get("bench_json_path"), False) @@ -388,7 +167,7 @@ def check_compare_param(input_parma, output_path, stack_mode=False, summary_comp def check_configuration_param(stack_mode=False, auto_analyze=True, fuzzy_match=False): if not (isinstance(stack_mode, bool) and isinstance(auto_analyze, bool) and isinstance(fuzzy_match, bool)): - print_error_log("Invalid input parameters which should be only bool type.") + logger.error("Invalid input parameters which should be only bool type.") raise CompareException(CompareException.INVALID_PARAM_ERROR) @@ -416,7 +195,7 @@ def is_starts_with(string, prefix_list): def _check_json(json_file_handle, file_name): tensor_line = json_file_handle.readline() if not tensor_line: - print_error_log("dump file {} have empty line!".format(file_name)) + logger.error("dump file {} have empty line!".format(file_name)) raise CompareException(CompareException.INVALID_DUMP_FILE) json_file_handle.seek(0, 0) @@ -431,10 +210,10 @@ def check_file_size(input_file, max_size): try: file_size = os.path.getsize(input_file) except OSError as os_error: - print_error_log('Failed to open "%s". %s' % (input_file, str(os_error))) + logger.error('Failed to open "%s". %s' % (input_file, str(os_error))) raise CompareException(CompareException.INVALID_FILE_ERROR) from os_error if file_size > max_size: - print_error_log('The size (%d) of %s exceeds (%d) bytes, tools not support.' + logger.error('The size (%d) of %s exceeds (%d) bytes, tools not support.' % (file_size, input_file, max_size)) raise CompareException(CompareException.INVALID_FILE_ERROR) @@ -474,7 +253,7 @@ def remove_path(path): else: shutil.rmtree(path) except PermissionError as err: - print_error_log("Failed to delete {}. Please check the permission.".format(path)) + logger.error("Failed to delete {}. Please check the permission.".format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) from err @@ -491,7 +270,7 @@ def get_dump_data_path(dump_dir): file_is_exist = False check_file_or_directory_path(dump_dir, True) - for dir_path, sub_paths, files in os.walk(dump_dir): + for dir_path, _, files in os.walk(dump_dir): if len(files) != 0: dump_data_path = dir_path file_is_exist = True @@ -500,14 +279,6 @@ def get_dump_data_path(dump_dir): return dump_data_path, file_is_exist -def modify_dump_path(dump_path, mode): - if mode == Const.ALL: - return dump_path - file_name = os.path.split(dump_path) - mode_file_name = mode + "_" + file_name[-1] - return os.path.join(file_name[0], mode_file_name) - - def create_directory(dir_path): """ Function Description: @@ -521,7 +292,7 @@ def create_directory(dir_path): try: os.makedirs(dir_path, mode=0o700) except OSError as ex: - print_error_log( + logger.error( 'Failed to create {}.Please check the path permission or disk space .{}'.format(dir_path, str(ex))) raise CompareException(CompareException.INVALID_PATH_ERROR) from ex @@ -535,7 +306,7 @@ def execute_command(cmd): Exception Description: when invalid command throw exception """ - print_info_log('Execute command:%s' % cmd) + logger.info('Execute command:%s' % cmd) process = subprocess.Popen(cmd, shell=False, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) while process.poll() is None: line = process.stdout.readline() @@ -543,7 +314,7 @@ def execute_command(cmd): if line: print(line) if process.returncode != 0: - print_error_log('Failed to execute command:%s' % " ".join(cmd)) + logger.error('Failed to execute command:%s' % " ".join(cmd)) raise CompareException(CompareException.INVALID_DATA_ERROR) @@ -567,7 +338,7 @@ def parse_value_by_comma(value): if value_str.isdigit() or value_str == '-1': value_list.append(int(value_str)) else: - print_error_log("please check your input shape.") + logger.error("please check your input shape.") raise CompareException(CompareException.INVALID_PARAM_ERROR) return value_list @@ -576,7 +347,7 @@ def get_data_len_by_shape(shape): data_len = 1 for item in shape: if item == -1: - print_error_log("please check your input shape, one dim in shape is -1.") + logger.error("please check your input shape, one dim in shape is -1.") return -1 data_len = data_len * item return data_len @@ -601,25 +372,25 @@ def format_value(value): def check_seed_all(seed, mode): if isinstance(seed, int): if seed < 0 or seed > Const.MAX_SEED_VALUE: - print_error_log(f"Seed must be between 0 and {Const.MAX_SEED_VALUE}.") + logger.error(f"Seed must be between 0 and {Const.MAX_SEED_VALUE}.") raise CompareException(CompareException.INVALID_PARAM_ERROR) else: - print_error_log(f"Seed must be integer.") + logger.error(f"Seed must be integer.") raise CompareException(CompareException.INVALID_PARAM_ERROR) if not isinstance(mode, bool): - print_error_log(f"seed_all mode must be bool.") + logger.error(f"seed_all mode must be bool.") raise CompareException(CompareException.INVALID_PARAM_ERROR) def get_process_rank(model): - print_info_log("Rank id is not provided. Trying to get the rank id of the model.") + logger.info("Rank id is not provided. Trying to get the rank id of the model.") try: local_device = next(model.parameters()).device except StopIteration: - print_warn_log('There is no parameter in the model. Fail to get rank id.') + logger.warning('There is no parameter in the model. Fail to get rank id.') return 0, False if local_device.type == 'cpu': - print_warn_log("Warning: the debugger is unable to get the rank id. " + logger.warning("Warning: the debugger is unable to get the rank id. " "This may cause the dumpped data to be corrupted in the " "case of distributed training. (You may ignore this if you are using only one card.) " "Transfer the model to npu or gpu before register_hook() to avoid this warning.") @@ -640,50 +411,50 @@ def generate_compare_script(dump_path, pkl_file_path, dump_switch_mode): code_temp = ftemp.read() fout.write(code_temp % (pkl_file_path, dump_path, is_api_stack)) except OSError: - print_error_log(f"Failed to open file. Please check file {template_path} or path {pkl_dir}.") + logger.error(f"Failed to open file. Please check file {template_path} or path {pkl_dir}.") - print_info_log(f"Generate compare script successfully which is {compare_script_path}.") + logger.info(f"Generate compare script successfully which is {compare_script_path}.") def check_file_valid(file_path): if os.path.islink(file_path): - print_error_log('The file path {} is a soft link.'.format(file_path)) + logger.error('The file path {} is a soft link.'.format(file_path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if len(os.path.realpath(file_path)) > Const.DIRECTORY_LENGTH or len(os.path.basename(file_path)) > \ Const.FILE_NAME_LENGTH: - print_error_log('The file path length exceeds limit.') + logger.error('The file path length exceeds limit.') raise CompareException(CompareException.INVALID_PATH_ERROR) if not re.match(Const.FILE_PATTERN, os.path.realpath(file_path)): - print_error_log('The file path {} contains special characters.'.format(file_path)) + logger.error('The file path {} contains special characters.'.format(file_path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if os.path.isfile(file_path): file_size = os.path.getsize(file_path) if file_path.endswith(Const.PKL_SUFFIX) and file_size > Const.ONE_GB: - print_error_log('The file {} size is greater than 1GB.'.format(file_path)) + logger.error('The file {} size is greater than 1GB.'.format(file_path)) raise CompareException(CompareException.INVALID_PATH_ERROR) if file_path.endswith(Const.NUMPY_SUFFIX) and file_size > Const.TEN_GB: - print_error_log('The file {} size is greater than 10GB.'.format(file_path)) + logger.error('The file {} size is greater than 10GB.'.format(file_path)) raise CompareException(CompareException.INVALID_PATH_ERROR) def check_path_before_create(path): if len(os.path.realpath(path)) > Const.DIRECTORY_LENGTH or len(os.path.basename(path)) > \ Const.FILE_NAME_LENGTH: - print_error_log('The file path length exceeds limit.') + logger.error('The file path length exceeds limit.') raise CompareException(CompareException.INVALID_PATH_ERROR) if not re.match(Const.FILE_PATTERN, os.path.realpath(path)): - print_error_log('The file path {} contains special characters.'.format(path)) + logger.error('The file path {} contains special characters.'.format(path)) raise CompareException(CompareException.INVALID_PATH_ERROR) def check_inplace_op(prefix): if len(prefix) > Const.DISTRIBUTED_PREFIX_LENGTH: return False - match_op = re.findall(r"Distributed_(.+?)_\d", prefix) + match_op = re.findall(r"Distributed\.(.+?)\.\d", prefix) op_name = match_op[0] if match_op else None return op_name in Const.INPLACE_LIST @@ -704,14 +475,14 @@ def task_dumppath_get(input_param): npu_json_path = input_param.get("npu_json_path", None) bench_json_path = input_param.get("bench_json_path", None) if not npu_json_path or not bench_json_path: - print_error_log(f"Please check the json path is valid.") + logger.error(f"Please check the json path is valid.") raise CompareException(CompareException.INVALID_PATH_ERROR) with FileOpen(npu_json_path, 'r') as npu_f: npu_json_data = json.load(npu_f) with FileOpen(bench_json_path, 'r') as bench_f: bench_json_data = json.load(bench_f) if npu_json_data['task'] != bench_json_data['task']: - print_error_log(f"Please check the dump task is consistent.") + logger.error(f"Please check the dump task is consistent.") raise CompareException(CompareException.INVALID_TASK_ERROR) if npu_json_data['task'] == Const.TENSOR: summary_compare = False @@ -723,7 +494,7 @@ def task_dumppath_get(input_param): else: summary_compare = True else: - print_error_log(f"Compare is not required for overflow_check or free_benchmark.") + logger.error(f"Compare is not required for overflow_check or free_benchmark.") raise CompareException(CompareException.INVALID_TASK_ERROR) input_param['npu_dump_data_dir'] = npu_json_data['dump_data_dir'] input_param['bench_dump_data_dir'] = bench_json_data['dump_data_dir'] @@ -736,6 +507,10 @@ def get_header_index(header_name, summary_compare=False): else: header = CompareConst.COMPARE_RESULT_HEADER[:] if header_name not in header: - print_error_log(f"{header_name} not in data name") + logger.error(f"{header_name} not in data name") raise CompareException(CompareException.INVALID_PARAM_ERROR) return header.index(header_name) + + +def convert_tuple(data): + return data if isinstance(data, tuple) else (data, ) diff --git a/debug/accuracy_tools/atat/core/common_config.py b/debug/accuracy_tools/msprobe/core/common_config.py similarity index 52% rename from debug/accuracy_tools/atat/core/common_config.py rename to debug/accuracy_tools/msprobe/core/common_config.py index ee045d3c520f9418191daaedec2830e8f9248435..ed38eba008bf3020e7459ff80fa65e7e2eddb5cb 100644 --- a/debug/accuracy_tools/atat/core/common_config.py +++ b/debug/accuracy_tools/msprobe/core/common_config.py @@ -1,7 +1,8 @@ -from .utils import Const +from msprobe.core.common.const import Const +from msprobe.core.common.log import logger +from msprobe.core.common.exceptions import MsprobeException -# 公共配置类 class CommonConfig: def __init__(self, json_config): self.task = json_config.get('task') @@ -17,22 +18,25 @@ class CommonConfig: def _check_config(self): if self.task and self.task not in Const.TASK_LIST: - raise Exception("task is invalid") + logger.error_log_with_exp( + "task is invalid, it should be one of {}".format(Const.TASK_LIST), MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if self.rank is not None and not isinstance(self.rank, list): - raise Exception("rank is invalid") + logger.error_log_with_exp("rank is invalid, it should be a list", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if self.step is not None and not isinstance(self.step, list): - raise Exception("step is invalid") + logger.error_log_with_exp("step is invalid, it should be a list", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if self.level and self.level not in Const.LEVEL_LIST: - raise Exception("level is invalid") + logger.error_log_with_exp( + "level is invalid, it should be one of {}".format(Const.LEVEL_LIST), MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if self.seed is not None and not isinstance(self.seed, int): - raise Exception("seed is invalid") + logger.error_log_with_exp("seed is invalid, it should be an integer", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if not isinstance(self.is_deterministic, bool): - raise Exception("is_deterministic is invalid") + logger.error_log_with_exp( + "is_deterministic is invalid, it should be a boolean", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if not isinstance(self.enable_dataloader, bool): - raise Exception("enable_dataloader is invalid") + logger.error_log_with_exp( + "enable_dataloader is invalid, it should be a boolean", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) -# 基础配置类 class BaseConfig: def __init__(self, json_config): self.scope = json_config.get('scope') @@ -46,9 +50,9 @@ class BaseConfig: def check_config(self): if self.scope is not None and not isinstance(self.scope, list): - raise Exception("scope is invalid") + logger.error_log_with_exp("scope is invalid, it should be a list", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if self.list is not None and not isinstance(self.list, list): - raise Exception("list is invalid") + logger.error_log_with_exp("list is invalid, it should be a list", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) if self.data_mode is not None and not isinstance(self.data_mode, list): - raise Exception("data_mode is invalid") - \ No newline at end of file + logger.error_log_with_exp("data_mode is invalid, it should be a list", MsprobeException(MsprobeException.INVALID_PARAM_ERROR)) + diff --git a/debug/accuracy_tools/atat/pytorch/functional/data_collector.py b/debug/accuracy_tools/msprobe/core/data_dump/data_collector.py similarity index 55% rename from debug/accuracy_tools/atat/pytorch/functional/data_collector.py rename to debug/accuracy_tools/msprobe/core/data_dump/data_collector.py index 7964c955db64682a1726b131118d5b53e9d17c8a..800a2b81c2f565a2d5b19c0014fafacb08cdccf6 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/data_collector.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_collector.py @@ -1,18 +1,11 @@ -import os -import torch -from ..module_processer import ModuleProcesser -from .scope import build_scope, ListScope -from .json_writer import DataWriter -from ..common.log import print_info_log, print_warn_log -from ..common.utils import Const -from .data_processor import build_data_processor, DataProcessor -try: - import torch_npu -except ImportError: - pass +import os -forward_init_status = False +from msprobe.core.data_dump.scope import build_scope, ListScope +from msprobe.core.data_dump.json_writer import DataWriter +from msprobe.core.common.log import logger +from msprobe.core.common.const import Const +from msprobe.core.data_dump.data_processor.factory import DataProcessorFactory def build_data_collector(config): @@ -20,29 +13,21 @@ def build_data_collector(config): class DataCollector: - overflow_task = "overflow_check" - tensor_task = "tensor" - freebenchmark_task = "free_benchmark" multi_output_apis = ["_sort_", "npu_flash_attention"] - tasks_need_tensor_data = [overflow_task, tensor_task, freebenchmark_task] + tasks_need_tensor_data = [Const.OVERFLOW_CHECK, Const.TENSOR, Const.FREE_BENCHMARK] level_without_construct = ["L1", "L2"] def __init__(self, config): self.config = config self.data_writer = DataWriter() - self.data_processor = build_data_processor(config, self.data_writer) + self.data_processor = DataProcessorFactory.create_processor(self.config, self.data_writer) + self.module_processor = DataProcessorFactory.get_module_processor(self.config.framework) if self.config.framework == Const.PT_FRAMEWORK else None self.module_count = {} - if config.task == DataCollector.freebenchmark_task: + if self.config.task == Const.FREE_BENCHMARK: self.scope = build_scope(ListScope, self.config.scope, self.config.list) else: self.scope = build_scope(None, self.config.scope, self.config.list) - def if_return_forward_new_output(self): - return self.data_processor.if_return_forward_new_output() - - def get_forward_new_output(self): - return self.data_processor.get_forward_new_output() - @property def dump_data_dir(self): return self.data_writer.dump_tensor_data_dir @@ -50,6 +35,20 @@ class DataCollector: @property def dump_file_path(self): return self.data_writer.dump_file_path + + @staticmethod + def check_scope_and_pid(scope, name, pid): + return (not scope or scope.check(name)) and pid == os.getpid() + + @staticmethod + def is_inplace(module): + return getattr(module, "op_is_inplace", False) + + def if_return_forward_new_output(self): + return self.data_processor.if_return_forward_new_output() + + def get_forward_new_output(self): + return self.data_processor.get_forward_new_output() def visit_and_clear_overflow_status(self, api_or_module_name): self.data_processor.visit_and_clear_overflow_status(api_or_module_name) @@ -58,7 +57,7 @@ class DataCollector: self.data_writer.write_json() def update_data(self, data_info, msg=''): - if self.config.task == DataProcessor.overflow: + if self.config.task == Const.OVERFLOW_CHECK: if self.data_processor.has_overflow: self.data_writer.update_data(data_info) msg += "Overflow detected." @@ -68,21 +67,13 @@ class DataCollector: self.data_writer.update_data(data_info) return msg - @staticmethod - def check_scope_and_pid(scope, name, pid): - return (not scope or scope.check(name)) and pid == os.getpid() - - @staticmethod - def is_inplace(module): - return getattr(module, "op_is_inplace", False) - def pre_forward_data_collect(self, name, module, pid, module_input_output): - backward_name = name.replace("forward", "backward") + backward_name = name.replace(Const.FORWARD, Const.BACKWARD) if self.check_scope_and_pid(self.scope, backward_name, pid): self.data_processor.analyze_pre_forward(backward_name, module, module_input_output) if not self.is_inplace(module): return - print_info_log(f"API {name} is inplace.") + logger.info(f"API {name} is inplace.") if self.check_scope_and_pid(self.scope, name, pid): data_info = self.data_processor.analyze_pre_forward_inplace(name, module_input_output) self.update_data(data_info) @@ -92,14 +83,12 @@ class DataCollector: if not self.check_scope_and_pid(self.scope, name, pid): return - if self.config.level == "L2": - self.acl_dump(module, module_input_output, name) - return - if not self.is_inplace(module): data_info = self.data_processor.analyze_forward(name, module, module_input_output) else: data_info = self.data_processor.analyze_forward_inplace(name, module_input_output) + if self.config.level == "L2": + return self.data_writer.update_stack(self.data_processor.analyze_api_call_stack(name)) self.handle_data(name, data_info) @@ -113,14 +102,14 @@ class DataCollector: def update_construct(self, name): if self.config.level not in DataCollector.level_without_construct: - self.data_writer.update_construct({name: ModuleProcesser.api_parent_node}) - self.data_writer.update_construct(ModuleProcesser.module_node) + self.data_writer.update_construct({name: self.module_processor.api_parent_node}) + self.data_writer.update_construct(self.module_processor.module_node) def handle_data(self, name, data_info): msg = f"msProbe is collecting data on {name}. " if data_info: msg = self.update_data(data_info, msg) - print_info_log(msg) + logger.info(msg) self.data_writer.flush_data_when_buffer_is_full() def module_count_func(self, name, name_template): @@ -146,65 +135,6 @@ class DataCollector: def update_dump_paths(self, *args): self.data_writer.update_dump_paths(*args) self.data_writer.initialize_json_file(task=self.config.task, level=self.config.level) - + def update_iter(self, current_iter): self.data_processor.update_iter(current_iter) - - def acl_dump(self, module, module_input_output, module_name): - if self.config.is_forward_acl_dump: - self.forward_acl_dump(module, module_input_output, module_name) - else: - self.dump_mode_backward_acl_dump(module, module_input_output, module_name) - - def op_need_trigger(self, module_name): - if 'Tensor___getitem___' in module_name: - return True - return False - - def forward_acl_dump(self, module, module_input_output, module_name): - global forward_init_status - if not forward_init_status: - forward_init_status = True - torch_npu.npu.synchronize() - torch_npu.npu.init_dump() - torch_npu.npu.set_dump(self.config.acl_config) - torch_npu.npu.synchronize() - if self.op_need_trigger(module_name): - module.forward(*module_input_output.args, **module_input_output.kwargs).cpu() - else: - module.forward(*module_input_output.args, **module_input_output.kwargs) - torch_npu.npu.synchronize() - torch_npu.npu.finalize_dump() - torch_npu.npu.synchronize() - forward_init_status = False - print_info_log("Dump %s op file." % module_name) - - def acl_backward_dump_status(self, output, grad, module_name): - if isinstance(output, torch.Tensor): - output.backward(grad, retain_graph=True) - return True - - for api_name in DataCollector.multi_output_apis: - if api_name in module_name: - output[0].backward(grad, retain_graph=True) - return True - return False - - def dump_mode_backward_acl_dump(self, module, module_input_output, module_name): - global forward_init_status - grad_path = self.config.backward_input.get(module_name) - if not forward_init_status: - forward_init_status = True - output = module.forward(*module_input_output.args, **module_input_output.kwargs) - grad = torch.load(grad_path).to("npu").requires_grad_() - torch_npu.npu.init_dump() - torch_npu.npu.set_dump(self.config.acl_config) - torch_npu.npu.synchronize() - if not self.acl_backward_dump_status(output, grad, module_name): - print_warn_log("The output of {} is not of tensor type and cannot be automatically derived. " - "you can manually construct a single API backward case for ACL dump.".format( - module_name)) - torch_npu.npu.synchronize() - torch_npu.npu.finalize_dump() - forward_init_status = False - print_info_log("Dump %s op file." % module_name) diff --git a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/base.py b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/base.py new file mode 100644 index 0000000000000000000000000000000000000000..430d13634c75fbb2d851bff3908b6fe968263388 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/base.py @@ -0,0 +1,245 @@ +import os +import inspect +from dataclasses import dataclass +from typing import Tuple, Dict, Optional, Any +import numpy as np +from msprobe.core.common.log import logger +from msprobe.core.common.utils import convert_tuple +from msprobe.core.common.const import Const + + +@dataclass +class ModuleForwardInputsOutputs: + args: Optional[Tuple] + kwargs: Optional[Dict] + output: Any + + @property + def args_tuple(self): + return convert_tuple(self.args) + + @property + def output_tuple(self): + return convert_tuple(self.output) + + def concat_args_and_kwargs(self): + args = self.args + tuple(self.kwargs.values()) + return args + + +@dataclass +class ModuleBackwardInputsOutputs: + grad_output: Optional[Tuple] + grad_input: Optional[Tuple] + + @property + def grad_input_tuple(self): + return convert_tuple(self.grad_input) + + @property + def grad_output_tuple(self): + return convert_tuple(self.grad_output) + + +class TensorStatInfo: + def __init__(self, max_val=None, min_val=None, mean_val=None, norm_val=None): + self.max = max_val + self.min = min_val + self.mean = mean_val + self.norm = norm_val + + +class BaseDataProcessor: + _recursive_key_stack = [] + special_type = (np.integer, np.floating, np.bool_, np.complexfloating, np.str_, np.byte, np.unicode_, + bool, int, float, str, slice) + + def __init__(self, config, data_writer): + self.data_writer = data_writer + self.config = config + self.api_info_struct = {} + self.stack_info_struct = {} + self.current_api_or_module_name = None + self.api_data_category = None + self.has_overflow = False + self.current_iter = 0 + self._return_forward_new_output = False + self._forward_new_output = None + + @property + def data_path(self): + return self.data_writer.dump_tensor_data_dir + + @staticmethod + def analyze_api_call_stack(name): + stack_str = [] + for (_, path, line, func, code, _) in inspect.stack()[5:]: + if not code: + continue + stack_line = " ".join([ + "File", ", ".join([ + path, + " ".join(["line", str(line)]), + " ".join(["in", func]), + " ".join(["\n", code[0].strip()]) + ]) + ]) + stack_str.append(stack_line) + stack_info_struct = {name: stack_str} + return stack_info_struct + + @staticmethod + def _convert_numpy_to_builtin(arg): + type_mapping = { + np.integer: int, + np.floating: float, + np.bool_: bool, + np.complexfloating: complex, + np.str_: str, + np.byte: bytes, + np.unicode_: str + } + for numpy_type, builtin_type in type_mapping.items(): + if isinstance(arg, numpy_type): + return builtin_type(arg), type(arg).__name__ + return arg, '' + + @staticmethod + def _analyze_numpy(value, numpy_type): + return {"type": numpy_type, "value": value} + + @staticmethod + def _analyze_builtin(arg): + single_arg = {} + if isinstance(arg, slice): + single_arg.update({"type": "slice"}) + single_arg.update({"value": [arg.start, arg.stop, arg.step]}) + else: + single_arg.update({"type": type(arg).__name__}) + single_arg.update({"value": arg}) + return single_arg + + @classmethod + def get_special_types(cls): + return cls.special_type + + @classmethod + def recursive_apply_transform(cls, args, transform): + if isinstance(args, cls.get_special_types()): + arg_transform = transform(args, cls._recursive_key_stack) + return arg_transform + elif isinstance(args, (list, tuple)): + result_list = [] + for i, arg in enumerate(args): + cls._recursive_key_stack.append(str(i)) + result_list.append(cls.recursive_apply_transform(arg, transform)) + cls._recursive_key_stack.pop() + return type(args)(result_list) + elif isinstance(args, dict): + resutl_dict = {} + for k, arg in args.items(): + cls._recursive_key_stack.append(str(k)) + resutl_dict[k] = cls.recursive_apply_transform(arg, transform) + cls._recursive_key_stack.pop() + return resutl_dict + elif args is not None: + logger.warning(f"Data type {type(args)} is not supported.") + return None + else: + return None + + def if_return_forward_new_output(self): + return self._return_forward_new_output + + def get_forward_new_output(self): + self._return_forward_new_output = False + return self._forward_new_output + + def update_iter(self, current_iter): + self.current_iter = current_iter + + def visit_and_clear_overflow_status(self, api_or_module_name): + if self.current_api_or_module_name != api_or_module_name: + self.current_api_or_module_name = api_or_module_name + self.has_overflow = False + + def is_dump_for_data_mode(self, forward_backward, input_output): + """ + Compare the parameters with data_mode to determine whether to dump. + + Args: + forward_backward(str): The forward or backward mode to check. + input_output(str): The input or output mode to check. + + Return: + bool: True if the parameters are in data_mode or data_mode is all, False otherwise. + """ + return (Const.ALL in self.config.data_mode or + forward_backward in self.config.data_mode or + input_output in self.config.data_mode) + + def analyze_pre_forward(self, name, module,module_input_output: ModuleForwardInputsOutputs): + pass + + def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + api_info_struct = {} + if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): # check whether data_mode contains forward or input + api_info_struct[name] = {} + self.api_data_category = Const.INPUT + args_info_list = self.analyze_element(module_input_output.args_tuple) + api_info_struct[name][Const.INPUT_ARGS] = args_info_list + self.api_data_category = Const.KWARGS + kwargs_info_list = self.analyze_element(module_input_output.kwargs) + api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list + + if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): # check whether data_mode contains forward or output + api_info_struct[name] = api_info_struct.get(name, {}) + self.api_data_category = Const.OUTPUT + output_info_list = self.analyze_element(module_input_output.output_tuple) + api_info_struct[name][Const.OUTPUT] = output_info_list + return api_info_struct + + def analyze_pre_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): + api_info_struct = {} + if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): + api_info_struct[name] = {} + self.api_data_category = Const.INPUT + args_info_list = self.analyze_element(module_input_output.args_tuple) + api_info_struct[name][Const.INPUT_ARGS] = args_info_list + self.api_data_category = Const.KWARGS + kwargs_info_list = self.analyze_element(module_input_output.kwargs) + api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list + return api_info_struct + + def analyze_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): + concat_args = module_input_output.concat_args_and_kwargs() + api_info_struct = {} + if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): + api_info_struct[name] = {} + self.api_data_category = Const.OUTPUT + output_info_list = self.analyze_element(concat_args) + api_info_struct[name][Const.OUTPUT] = output_info_list + return api_info_struct + + def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): + api_info_struct = {} + if self.is_dump_for_data_mode(Const.BACKWARD, Const.OUTPUT): + api_info_struct[name] = {} + self.api_data_category = Const.OUTPUT + input_info_list = self.analyze_element(module_input_output.grad_input_tuple) + api_info_struct[name][Const.GRAD_INPUT] = input_info_list + + if self.is_dump_for_data_mode(Const.BACKWARD, Const.INPUT): + api_info_struct[name] = api_info_struct.get(name, {}) + self.api_data_category = Const.INPUT + output_info_list = self.analyze_element(module_input_output.grad_output_tuple) + api_info_struct[name][Const.GRAD_OUTPUT] = output_info_list + + return api_info_struct + + def get_save_file_path(self, suffix): + file_format = "pt" if self.config.framework == Const.PT_FRAMEWORK else "npy" + dump_data_name = (self.current_api_or_module_name + Const.SEP + self.api_data_category + Const.SEP + + suffix + Const.SEP + file_format) + file_path = os.path.join(self.data_writer.dump_tensor_data_dir, dump_data_name) + return dump_data_name, file_path \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/core/data_dump/data_processor/factory.py b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/factory.py new file mode 100644 index 0000000000000000000000000000000000000000..2c536ba577a4bceb211a78ebfcb84ae07ba18813 --- /dev/null +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/factory.py @@ -0,0 +1,61 @@ +from msprobe.core.common.const import Const + + +class DataProcessorFactory: + _data_processor = {} + _module_processor = {} + + @classmethod + def register_processor(cls, framework, task, processor_class): + key = (framework, task) + cls._data_processor[key] = processor_class + + @classmethod + def register_module_processor(cls, framework, processor_class): + cls._module_processor[framework] = processor_class + + @classmethod + def get_module_processor(cls, framework): + processor_class = cls._module_processor.get(framework) + if not processor_class: + raise ValueError(f"ModuleProcesser not found for framework: {framework}") + return processor_class + + @classmethod + def create_processor(cls, config, data_writer): + cls.register_processors(config.framework) + task = Const.KERNEL_DUMP if config.level == "L2" else config.task + key = (config.framework, task) + processor_class = cls._data_processor.get(key) + if not processor_class: + raise ValueError(f"Processor not found for framework: {config.framework}, task: {config.task}") + return processor_class(config, data_writer) + + @classmethod + def register_processors(cls, framework): + if framework == Const.PT_FRAMEWORK: + from .pytorch_processor import ( + StatisticsDataProcessor as PytorchStatisticsDataProcessor, + TensorDataProcessor as PytorchTensorDataProcessor, + OverflowCheckDataProcessor as PytorchOverflowCheckDataProcessor, + FreeBenchmarkDataProcessor as PytorchFreeBenchmarkDataProcessor, + KernelDumpDataProcessor as PytorchKernelDumpDataProcessor + ) + from ....pytorch.module_processer import ModuleProcesser + cls.register_processor(Const.PT_FRAMEWORK, Const.STATISTICS, PytorchStatisticsDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.TENSOR, PytorchTensorDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.OVERFLOW_CHECK, PytorchOverflowCheckDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.FREE_BENCHMARK, PytorchFreeBenchmarkDataProcessor) + cls.register_processor(Const.PT_FRAMEWORK, Const.KERNEL_DUMP, PytorchKernelDumpDataProcessor) + cls.register_module_processor(Const.PT_FRAMEWORK, ModuleProcesser) + elif framework == Const.MS_FRAMEWORK: + from .mindspore_processor import ( + StatisticsDataProcessor as MindsporeStatisticsDataProcessor, + TensorDataProcessor as MindsporeTensorDataProcessor, + OverflowCheckDataProcessor as MindsporeOverflowCheckDataProcessor, + FreeBenchmarkDataProcessor as MindsporeFreeBenchmarkDataProcessor + ) + cls.register_processor(Const.MS_FRAMEWORK, Const.STATISTICS, MindsporeStatisticsDataProcessor) + cls.register_processor(Const.MS_FRAMEWORK, Const.TENSOR, MindsporeTensorDataProcessor) + cls.register_processor(Const.MS_FRAMEWORK, Const.OVERFLOW_CHECK, MindsporeOverflowCheckDataProcessor) + cls.register_processor(Const.MS_FRAMEWORK, Const.FREE_BENCHMARK, MindsporeFreeBenchmarkDataProcessor) \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/functional/data_processor.py b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/pytorch_processor.py similarity index 37% rename from debug/accuracy_tools/atat/pytorch/functional/data_processor.py rename to debug/accuracy_tools/msprobe/core/data_dump/data_processor/pytorch_processor.py index 1ef1b79acb2172daa3bc85d11ffe4049d1bca942..f307909a416c9a372386335dfb38077bbeb2b019 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/data_processor.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/data_processor/pytorch_processor.py @@ -1,115 +1,33 @@ -import torch -import zlib -import numpy as np import os -import inspect -from dataclasses import dataclass, asdict -import torch_npu -from typing import Tuple, List, Dict, Optional, Union -from ..common.exceptions import MsaccException -from ..common.file_check import path_len_exceeds_limit, change_mode, FileCheckConst -from ..common.log import print_warn_log -from ..common.utils import Const -from ..common import recursive_apply_transform -from ..functional. json_writer import DataWriter -from ..free_benchmark import FreeBenchmarkCheck, UnequalRow - -bits_for_overflow = 8 - -def build_data_processor(config, data_writer): - if config.task == DataProcessor.full: - return FullTensorDataProcessor(config, data_writer) - elif config.task == DataProcessor.summary: - return DataProcessor(config, data_writer) - elif config.task == DataProcessor.overflow: - return OverflowTensorDataProcessor(config, data_writer) - elif config.task == DataProcessor.free_benchmark: - return FreeBenchmarkDataProcessor(config, data_writer) - else: - raise MsaccException(MsaccException.INVALID_PARAM_ERROR, - "task should be in [{}, {}, {}, {}]".format( - DataProcessor.full, - DataProcessor.summary, - DataProcessor.overflow, - DataProcessor.free_benchmark - )) - - -@dataclass -class ModuleForwardInputsOutputs: - args: Optional[Tuple] - kwargs: Optional[Dict] - output: Union[Tuple, torch.Tensor] - - @property - def args_tuple(self): - if not isinstance(self.args, tuple): - return (self.args, ) - else: - return self.args - - @property - def output_tuple(self): - if not isinstance(self.output, tuple): - return (self.output, ) - else: - return self.output - - def concat_args_and_kwargs(self): - args = self.args + tuple(self.kwargs.values()) - return args - - -@dataclass -class ModuleBackwardInputsOutputs: - grad_output: Optional[Tuple] - grad_input: Optional[Tuple] +import zlib +from dataclasses import asdict +from typing import List - @property - def grad_input_tuple(self): - if not isinstance(self.grad_input, tuple): - return (self.grad_input, ) - else: - return self.grad_input +import numpy as np +import torch +from msprobe.core.common.exceptions import MsprobeException +from msprobe.core.common.file_check import path_len_exceeds_limit, change_mode +from msprobe.core.common.log import logger +from msprobe.core.common.const import Const, OverflowConst, FileCheckConst +from msprobe.core.data_dump.data_processor.base import BaseDataProcessor, ModuleBackwardInputsOutputs, \ + ModuleForwardInputsOutputs, TensorStatInfo +from msprobe.pytorch.free_benchmark import FreeBenchmarkCheck, UnequalRow - @property - def grad_output_tuple(self): - if not isinstance(self.grad_output, tuple): - return (self.grad_output, ) - else: - return self.grad_output +try: + import torch_npu +except ImportError: + pass -class DataProcessor: - full = "tensor" - summary = "statistics" - overflow = "overflow_check" - free_benchmark = "free_benchmark" +class PytorchDataProcessor(BaseDataProcessor): + pytorch_special_type = (torch.device, torch.dtype, torch.Size, torch.Tensor) def __init__(self, config, data_writer): - self.data_writer = data_writer - self.api_info_struct = {} - self.stack_info_struct = {} + super().__init__(config, data_writer) self.torch_object_key = { "device": self.analyze_device_in_kwargs, "dtype": self.analyze_dtype_in_kwargs } - self.current_api_or_module_name = None - self.config = config - self.api_data_category = None - self.has_overflow = False - self.current_iter = 0 - - # 需要对forward的output进行更改 - self._return_forward_new_output = False - self._forward_new_output = None - - def if_return_forward_new_output(self): - return self._return_forward_new_output - - def get_forward_new_output(self): - self._return_forward_new_output = False - return self._forward_new_output @staticmethod def get_md5_for_tensor(x): @@ -135,279 +53,90 @@ class DataProcessor: @staticmethod def analyze_dtype_in_kwargs(element): - single_arg = {} - single_arg.update({"type": "torch.dtype"}) - single_arg.update({"value": str(element)}) - return single_arg + return {"type": "torch.dtype", "value": str(element)} @staticmethod - def _convert_numpy_to_builtin(arg): - type_mapping = { - np.integer: int, - np.floating: float, - np.bool_: bool, - np.complexfloating: complex, - np.str_: str, - np.byte: bytes, - np.unicode_: str - } - for numpy_type, builtin_type in type_mapping.items(): - if isinstance(arg, numpy_type): - return builtin_type(arg), type(arg).__name__ - return arg, '' - - def update_iter(self, current_iter): - self.current_iter = current_iter - - def visit_and_clear_overflow_status(self, api_or_module_name): - if self.current_api_or_module_name != api_or_module_name: - self.current_api_or_module_name = api_or_module_name - self.has_overflow = False - - def _analyze_numpy(self, value, numpy_type): - single_arg = {} - single_arg.update({"type": numpy_type}) - single_arg.update({"value": value}) - return single_arg - - def get_stat_info(self, data): + def get_stat_info(data): + tensor_stat = TensorStatInfo() if data.is_meta: - return + return tensor_stat data_clone = data.detach() if data_clone.numel() == 0: - tensor_max = None - tensor_min = None - tensor_mean = None - tensor_norm = None + return tensor_stat elif data_clone.dtype == torch.bool: - tensor_max = True in data_clone - tensor_min = False not in data_clone - tensor_mean = None - tensor_norm = None - elif not len(data_clone.shape): - tensor_max = data_clone.item() - tensor_min = tensor_max - tensor_mean = tensor_max - tensor_norm = tensor_max + tensor_stat.max = True in data_clone + tensor_stat.min = False not in data_clone + elif not data_clone.shape: + tensor_stat.max = tensor_stat.min = tensor_stat.mean = tensor_stat.norm = data_clone.item() else: - if not data_clone.is_floating_point(): + if not data_clone.is_floating_point() or data_clone.dtype == torch.float64: data_clone = data_clone.float() - tensor_max = torch._C._VariableFunctionsClass.max(data_clone).item() - tensor_min = torch._C._VariableFunctionsClass.min(data_clone).item() - tensor_mean = torch._C._VariableFunctionsClass.mean(data_clone).item() - tensor_norm = torch._C._VariableFunctionsClass.norm(data_clone).item() - - return tensor_max, tensor_min, tensor_mean, tensor_norm - - def _analyze_builtin(self, arg): - single_arg = {} - if isinstance(arg, slice): - single_arg.update({"type": "slice"}) - # slice参数中可能存在tensor类型,json序列化,需要转换为python数值类型 - values = [ - value if not isinstance(value, torch.Tensor) else value.item() - for value in [arg.start, arg.stop, arg.step] - ] - single_arg.update({"value": values}) - else: - single_arg.update({"type": type(arg).__name__}) - single_arg.update({"value": arg}) - return single_arg - - def _analyze_torch_size(self, arg): - single_arg = {} - single_arg.update({"type": "torch.Size"}) - single_arg.update({"value": list(arg)}) - return single_arg - - def is_dump_for_data_mode(self, forward_backward, input_output): - """ - Compare the parameters with data_mode to determine whether to dump. - - Args: - forward_backward(str): The forward or backward mode to check. - input_output(str): The input or output mode to check. - - Return: - bool: True if the parameters are in data_mode or data_mode is all, False otherwise. - """ - return (Const.ALL in self.config.data_mode or - forward_backward in self.config.data_mode or - input_output in self.config.data_mode) - + tensor_stat.max = torch._C._VariableFunctionsClass.max(data_clone).item() + tensor_stat.min = torch._C._VariableFunctionsClass.min(data_clone).item() + tensor_stat.mean = torch._C._VariableFunctionsClass.mean(data_clone).item() + tensor_stat.norm = torch._C._VariableFunctionsClass.norm(data_clone).item() + return tensor_stat + @staticmethod - def handle_tensor_extremum_nan_inf(data_clone, operator): - data_nan = torch._C._VariableFunctionsClass.isnan(data_clone) - if int(torch._C._VariableFunctionsClass.sum(data_nan)) == data_clone.numel(): - return float('nan') - finite_mask = torch._C._VariableFunctionsClass.isfinite(data_clone) - if int(torch._C._VariableFunctionsClass.sum(finite_mask)) > 0: - finite_values = data_clone[finite_mask] - return torch._C._VariableFunctionsClass.max(finite_values).item() if operator == 'max' else \ - torch._C._VariableFunctionsClass.min(finite_values).item() - else: - data_no_nan = data_clone[~data_nan] - return torch._C._VariableFunctionsClass.max(data_no_nan).item() if operator == 'max' else \ - torch._C._VariableFunctionsClass.min(data_no_nan).item() + def _analyze_torch_size(arg): + return {"type": "torch.Size", "value": list(arg)} - def _analyze_maybe_overflow_tensor(self, tensor_json, tensor): - data_clone = tensor.detach() - if hasattr(torch_npu._C, '_npu_is_support_inf_nan') and torch_npu._C._npu_is_support_inf_nan(): - if tensor_json['Max'] is None: - return - if np.isinf(tensor_json['Max']) or np.isnan(tensor_json['Max']): - tensor_json['Max_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(data_clone, "max") - self.has_overflow = True - if np.isinf(tensor_json['Min']) or np.isnan(tensor_json['Min']): - tensor_json['Min_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(data_clone, "min") - self.has_overflow = True - else: - self.has_overflow = check_overflow_npu() - if self.has_overflow: - clear_overflow_npu() - - def _analyze_tensor(self, tensor, suffix): - tensor_max, tensor_min, tensor_mean, tensor_norm = self.get_stat_info(tensor) - - tensor_json = {} - tensor_json.update({'type': 'torch.Tensor'}) - tensor_json.update({'dtype': str(tensor.dtype)}) - tensor_json.update({"shape": tensor.shape}) - tensor_json.update({"Max": tensor_max}) - tensor_json.update({"Min": tensor_min}) - self._analyze_maybe_overflow_tensor(tensor_json, tensor) - tensor_json.update({"Mean": tensor_mean}) - tensor_json.update({"Norm": tensor_norm}) - tensor_json.update({"requires_grad": tensor.requires_grad}) - if self.config.summary_mode == "md5": - tensor_md5 = self.get_md5_for_tensor(tensor) - tensor_json.update({"md5": tensor_md5}) - - return tensor_json + @classmethod + def get_special_types(cls): + return super().get_special_types() + cls.pytorch_special_type def analyze_single_element(self, element, suffix_stack): if suffix_stack and suffix_stack[-1] in self.torch_object_key: return self.torch_object_key[suffix_stack[-1]](element) - if isinstance(element, torch.Size): return self._analyze_torch_size(element) - converted_numpy, numpy_type = self._convert_numpy_to_builtin(element) if converted_numpy is not element: return self._analyze_numpy(converted_numpy, numpy_type) - if isinstance(element, torch.Tensor): return self._analyze_tensor(element, Const.SEP.join(suffix_stack)) - if isinstance(element, (bool, int, float, str, slice)): return self._analyze_builtin(element) + return None def analyze_element(self, element): - return recursive_apply_transform(element, self.analyze_single_element) + return self.recursive_apply_transform(element, self.analyze_single_element) - @staticmethod - def analyze_api_call_stack(name): - stack_str = [] - for (_, path, line, func, code, _) in inspect.stack()[5:]: - if not code: - continue - stack_line = " ".join([ - "File", ", ".join([ - path, - " ".join(["line", str(line)]), - " ".join(["in", func]), - " ".join(["\n", code[0].strip()]) - ]) - ]) - stack_str.append(stack_line) - stack_info_struct = {name: stack_str} - return stack_info_struct - - def analyze_pre_forward(self, name, module, - module_input_output: ModuleForwardInputsOutputs): - pass - - def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): - api_info_struct = {} - if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): # check whether data_mode contains forward or input - api_info_struct[name] = {} - self.api_data_category = Const.INPUT - args_info_list = self.analyze_element(module_input_output.args_tuple) - api_info_struct[name][Const.INPUT_ARGS] = args_info_list - - self.api_data_category = Const.KWARGS - kwargs_info_list = self.analyze_element(module_input_output.kwargs) - api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list - - if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): # check whether data_mode contains forward or output - api_info_struct[name] = api_info_struct.get(name, {}) - self.api_data_category = Const.OUTPUT - output_info_list = self.analyze_element(module_input_output.output_tuple) - api_info_struct[name][Const.OUTPUT] = output_info_list - - return api_info_struct - - def analyze_pre_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): - api_info_struct = {} - if self.is_dump_for_data_mode(Const.FORWARD, Const.INPUT): - api_info_struct[name] = {} - self.api_data_category = Const.INPUT - args_info_list = self.analyze_element(module_input_output.args_tuple) - api_info_struct[name][Const.INPUT_ARGS] = args_info_list - - self.api_data_category = Const.KWARGS - kwargs_info_list = self.analyze_element(module_input_output.kwargs) - api_info_struct[name][Const.INPUT_KWARGS] = kwargs_info_list - - return api_info_struct - - def analyze_forward_inplace(self, name, module_input_output: ModuleForwardInputsOutputs): - concat_args = module_input_output.concat_args_and_kwargs() - api_info_struct = {} - if self.is_dump_for_data_mode(Const.FORWARD, Const.OUTPUT): - api_info_struct[name] = {} - self.api_data_category = Const.OUTPUT - output_info_list = self.analyze_element(concat_args) - api_info_struct[name][Const.OUTPUT] = output_info_list - - return api_info_struct - - - def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): - api_info_struct = {} - if self.is_dump_for_data_mode(Const.BACKWARD, Const.OUTPUT): - api_info_struct[name] = {} - self.api_data_category = Const.OUTPUT - input_info_list = self.analyze_element(module_input_output.grad_input_tuple) - api_info_struct[name][Const.GRAD_INPUT] = input_info_list - - if self.is_dump_for_data_mode(Const.BACKWARD, Const.INPUT): - api_info_struct[name] = api_info_struct.get(name, {}) - self.api_data_category = Const.INPUT - output_info_list = self.analyze_element(module_input_output.grad_output_tuple) - api_info_struct[name][Const.GRAD_OUTPUT] = output_info_list + def _analyze_tensor(self, tensor, suffix): + tensor_stat = self.get_stat_info(tensor) + tensor_json = {} + tensor_json.update({'type': 'torch.Tensor'}) + tensor_json.update({'dtype': str(tensor.dtype)}) + tensor_json.update({"shape": tensor.shape}) + tensor_json.update({"Max": tensor_stat.max}) + tensor_json.update({"Min": tensor_stat.min}) + tensor_json.update({"Mean": tensor_stat.mean}) + tensor_json.update({"Norm": tensor_stat.norm}) + tensor_json.update({"requires_grad": tensor.requires_grad}) + if self.config.summary_mode == "md5": + tensor_md5 = self.get_md5_for_tensor(tensor) + tensor_json.update({"md5": tensor_md5}) + return tensor_json - return api_info_struct +class StatisticsDataProcessor(PytorchDataProcessor): + pass -class FullTensorDataProcessor(DataProcessor): +class TensorDataProcessor(PytorchDataProcessor): def _analyze_tensor(self, tensor, suffix): - self.data_path = self.data_writer.dump_tensor_data_dir - dump_data_name = (self.current_api_or_module_name + Const.SEP + self.api_data_category + Const.SEP + - suffix + ".pt") - file_path = os.path.join(self.data_writer.dump_tensor_data_dir, dump_data_name) + dump_data_name, file_path = self.get_save_file_path(suffix) if not path_len_exceeds_limit(file_path): torch.save(tensor, file_path) change_mode(file_path, FileCheckConst.DATA_FILE_AUTHORITY) else: - print_warn_log(f'The file path {file_path} length exceeds limit.') + logger.warning(f'The file path {file_path} length exceeds limit.') single_arg = super()._analyze_tensor(tensor, suffix) single_arg.update({"data_name": dump_data_name}) return single_arg -class OverflowTensorDataProcessor(DataProcessor): +class OverflowCheckDataProcessor(PytorchDataProcessor): __slots__ = ["cached_tensors_and_file_paths"] def __init__(self, config, data_writer): @@ -415,29 +144,35 @@ class OverflowTensorDataProcessor(DataProcessor): self.cached_tensors_and_file_paths = {} self.real_overflow_dump_times = 0 self.overflow_nums = config.overflow_num + self.bits_for_overflow = 8 - def _analyze_tensor(self, tensor, suffix): - self.data_path = self.data_writer.dump_tensor_data_dir - dump_data_name = (self.current_api_or_module_name + Const.SEP + self.api_data_category + Const.SEP + - suffix + ".pt") - file_path = os.path.join(self.data_writer.dump_tensor_data_dir, dump_data_name) - if not path_len_exceeds_limit(file_path): - self.cached_tensors_and_file_paths.update({file_path: tensor}) + @staticmethod + def overflow_debug_mode_enable(): + overflow_mode = os.getenv(OverflowConst.OVERFLOW_DEBUG_MODE_ENABLE, Const.ENV_DISABLE) + return overflow_mode == Const.ENV_ENABLE + + @staticmethod + def handle_tensor_extremum_nan_inf(data_clone, operator): + data_nan = torch._C._VariableFunctionsClass.isnan(data_clone) + if int(torch._C._VariableFunctionsClass.sum(data_nan)) == data_clone.numel(): + return float('nan') + finite_mask = torch._C._VariableFunctionsClass.isfinite(data_clone) + if int(torch._C._VariableFunctionsClass.sum(finite_mask)) > 0: + finite_values = data_clone[finite_mask] + return torch._C._VariableFunctionsClass.max(finite_values).item() if operator == 'max' else \ + torch._C._VariableFunctionsClass.min(finite_values).item() else: - print_warn_log(f'The file path {file_path} length exceeds limit.') - single_arg = super()._analyze_tensor(tensor, suffix) - single_arg.update({"data_name": dump_data_name}) - return single_arg + data_no_nan = data_clone[~data_nan] + return torch._C._VariableFunctionsClass.max(data_no_nan).item() if operator == 'max' else \ + torch._C._VariableFunctionsClass.min(data_no_nan).item() - def analyze_forward(self, name, module, - module_input_output: ModuleForwardInputsOutputs): + def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): self.has_overflow = False api_info_struct = super().analyze_forward(name, module, module_input_output) self.maybe_save_overflow_data_and_check_overflow_times() return api_info_struct if self.has_overflow else None - def analyze_backward(self, name, module, - module_input_output: ModuleBackwardInputsOutputs): + def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): self.has_overflow = False api_info_struct = super().analyze_backward(name, module, module_input_output) self.maybe_save_overflow_data_and_check_overflow_times() @@ -456,22 +191,68 @@ class OverflowTensorDataProcessor(DataProcessor): if self.overflow_nums == -1: return if self.real_overflow_dump_times >= self.overflow_nums: - raise MsaccException(MsaccException.OVERFLOW_NUMS_ERROR, - str(self.real_overflow_dump_times)) + raise MsprobeException(MsprobeException.OVERFLOW_NUMS_ERROR, str(self.real_overflow_dump_times)) + + def check_overflow_npu(self): + if self.overflow_debug_mode_enalbe(): + float_status = torch.zeros(self.bits_for_overflow).npu() + result = torch_npu.npu_get_float_status(float_status, OverflowConst.OVERFLOW_DEBUG_MODE) + if result.cpu()[0] != 0: + return True + else: + return False + else: + return torch_npu._C._check_overflow_npu() + + def clear_overflow_npu(self): + if self.overflow_debug_mode_enable(): + float_status = torch.zeros(self.bits_for_overflow).npu() + torch_npu.npu_clear_float_status(float_status, OverflowConst.OVERFLOW_DEBUG_MODE) + else: + torch_npu._C._clear_overflow_npu() + + def _analyze_maybe_overflow_tensor(self, tensor_json, tensor): + data_clone = tensor.detach() + if hasattr(torch_npu._C, '_npu_is_support_inf_nan') and torch_npu._C._npu_is_support_inf_nan(): + if tensor_json['Max'] is None: + return + if np.isinf(tensor_json['Max']) or np.isnan(tensor_json['Max']): + tensor_json['Max_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(data_clone, "max") + self.has_overflow = True + if np.isinf(tensor_json['Min']) or np.isnan(tensor_json['Min']): + tensor_json['Min_except_inf_nan'] = self.handle_tensor_extremum_nan_inf(data_clone, "min") + self.has_overflow = True + else: + self.has_overflow = self.check_overflow_npu() + if self.has_overflow: + self.clear_overflow_npu() + + def _analyze_tensor(self, tensor, suffix): + dump_data_name, file_path = self.get_save_file_path(suffix) + if not path_len_exceeds_limit(file_path): + self.cached_tensors_and_file_paths.update({file_path: tensor}) + else: + logger.warning(f'The file path {file_path} length exceeds limit.') + single_arg = super()._analyze_tensor(tensor, suffix) + self._analyze_maybe_overflow_tensor(single_arg, tensor) + single_arg.update({"data_name": dump_data_name}) + return single_arg -class FreeBenchmarkDataProcessor(DataProcessor): +class FreeBenchmarkDataProcessor(PytorchDataProcessor): def __init__(self, config, data_writer): super().__init__(config, data_writer) self.checker = FreeBenchmarkCheck(config=config) + self._return_forward_new_output = None + self._forward_new_output = None def update_iter(self, current_iter): - self.current_iter = current_iter + super().update_iter(current_iter) self.checker.update_iter(current_iter) def update_unequal_rows(self, unequal_rows: List[UnequalRow]): - if len(unequal_rows) == 0: + if not unequal_rows: return for row in unequal_rows: data_dict = asdict(row) @@ -482,11 +263,8 @@ class FreeBenchmarkDataProcessor(DataProcessor): ) return - def analyze_pre_forward(self, name, module, - module_input_output: ModuleForwardInputsOutputs): - args = module_input_output.args - kwargs = module_input_output.kwargs - self.checker.pre_forward(name, module, self, args, kwargs) + def analyze_pre_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): + self.checker.pre_forward(name, module, self, module_input_output.args, module_input_output.kwargs) def analyze_forward(self, name, module, module_input_output: ModuleForwardInputsOutputs): new_output, unequal_rows = self.checker.forward( @@ -495,45 +273,74 @@ class FreeBenchmarkDataProcessor(DataProcessor): module_input_output.args, module_input_output.kwargs, module_input_output.output, - ) + ) self.update_unequal_rows(unequal_rows) if self.checker.if_fix(): self._return_forward_new_output = True self._forward_new_output = new_output - return None def analyze_backward(self, name, module, module_input_output: ModuleBackwardInputsOutputs): self.checker.backward(name, module, module_input_output.grad_output) - return None +class KernelDumpDataProcessor(PytorchDataProcessor): + forward_init_status = False + multi_output_apis = ["_sort_", "npu_flash_attention"] -def overflow_debug_mode_enable(): - overflow_mode = os.getenv(OverflowConst.OVERFLOW_DEBUG_MODE_ENABLE, Const.ENV_DISABLE) - return overflow_mode == Const.ENV_ENABLE + def __init__(self, config, data_writer): + super().__init__(config, data_writer) -def check_overflow_npu(): - if overflow_debug_mode_enable(): - float_status = torch.zeros(bits_for_overflow).npu() - result = torch_npu.npu_get_float_status(float_status, OverflowConst.OVERFLOW_DEBUG_MODE) - if (result.cpu()[0] != 0): - return True + def analyze_forward(self, name, module, module_input_output): + if self.config.is_forward_acl_dump: + self.forward_acl_dump(name, module, module_input_output) else: - return False - else: - return torch_npu._C._check_overflow_npu() - -def clear_overflow_npu(): - if overflow_debug_mode_enable(): - float_status = torch.zeros(bits_for_overflow).npu() - torch_npu.npu_clear_float_status(float_status, OverflowConst.OVERFLOW_DEBUG_MODE) - else: - torch_npu._C._clear_overflow_npu() - -class OverflowConst: - """ - Class for Overflow - """ - OVERFLOW_DEBUG_MODE_ENABLE = "OVERFLOW_DEBUG_MODE_ENABLE" - OVERFLOW_ORIGINAL_MODE = 0 - OVERFLOW_DEBUG_MODE = 1 + self.dump_mode_backward_acl_dump(name, module, module_input_output) + + def forward_acl_dump(self, name, module, module_input_output): + if not KernelDumpDataProcessor.forward_init_status: + KernelDumpDataProcessor.forward_init_status = True + torch_npu.npu.synchronize() + torch_npu.npu.init_dump() + torch_npu.npu.set_dump(self.config.acl_config) + torch_npu.npu.synchronize() + if self.op_need_trigger(name): + module.forward(*module_input_output.args, **module_input_output.kwargs).cpu() + else: + module.forward(*module_input_output.args, **module_input_output.kwargs) + torch_npu.npu.synchronize() + torch_npu.npu.finalize_dump() + torch_npu.npu.synchronize() + KernelDumpDataProcessor.forward_init_status = False + logger.info("Dump %s op file." % name) + + def acl_backward_dump_status(self, output, grad, module_name): + if isinstance(output, torch.Tensor): + output.backward(grad, retain_graph=True) + return True + + for api_name in KernelDumpDataProcessor.multi_output_apis: + if api_name in module_name: + output[0].backward(grad, retain_graph=True) + return True + return False + + def dump_mode_backward_acl_dump(self, name, module, module_input_output): + grad_path = self.config.backward_input.get(name) + if not KernelDumpDataProcessor.forward_init_status: + KernelDumpDataProcessor.forward_init_status = True + output = module.forward(*module_input_output.args, **module_input_output.kwargs) + grad = torch.load(grad_path).to("npu").requires_grad_() + torch_npu.npu.init_dump() + torch_npu.npu.set_dump(self.config.acl_config) + torch_npu.npu.synchronize() + if not self.acl_backward_dump_status(output, grad, name): + logger.warning("The output of {} is not of tensor type and cannot be automatically derived. " + "you can manually construct a single API backward case for ACL dump.".format( + name)) + torch_npu.npu.synchronize() + torch_npu.npu.finalize_dump() + KernelDumpDataProcessor.forward_init_status = False + logger.info("Dump %s op file." % name) + + def op_need_trigger(self, module_name): + return 'Tensor.__getitem__.' in module_name diff --git a/debug/accuracy_tools/atat/pytorch/functional/json_writer.py b/debug/accuracy_tools/msprobe/core/data_dump/json_writer.py similarity index 79% rename from debug/accuracy_tools/atat/pytorch/functional/json_writer.py rename to debug/accuracy_tools/msprobe/core/data_dump/json_writer.py index 0fee3aa9731aa79c2f0e5857fb8596a86e86b6d7..c4b7fc11ec4a9a082a701327775406ee193247e0 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/json_writer.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/json_writer.py @@ -1,16 +1,15 @@ import os import csv -from pathlib import Path +import fcntl import json -from ..common.file_check import FileCheckConst, change_mode -from ..common.log import print_info_log_rank_0 -from ..common.utils import Const +from pathlib import Path +from msprobe.core.common.file_check import change_mode +from msprobe.core.common.log import logger +from msprobe.core.common.const import Const, FileCheckConst -class DataWriter: # TODO: UT - # dump_json_name = "dump.json" - # stack_json_name = "stack.json" - # construct_json_name = "construct.json" + +class DataWriter: def __init__(self, init_json=None) -> None: self.dump_count = 0 @@ -18,15 +17,29 @@ class DataWriter: # TODO: UT self.dump_file_path = None # os.path.join(dump_dir, DataWriter.dump_json_name) self.stack_file_path = None # os.path.join(dump_dir, DataWriter.stack_json_name) self.construct_file_path = None # os.path.join(dump_dir, DataWriter.construct_json_name) - self.free_benchmark_file_path = None + self.free_benchmark_file_path = None self.dump_tensor_data_dir = None self.buffer_size = 1000 - self.cache_data = {"data": {}} + self.cache_data = {Const.DATA: {}} self.cache_stack = {} self.cache_construct = {} + @staticmethod + def write_data_to_csv(result: list, result_header: tuple, file_path: str): + if not result: + return + is_exists = os.path.exists(file_path) + append = "a+" if is_exists else "w+" + with os.fdopen( + os.open(file_path, Const.WRITE_FLAGS, FileCheckConst.DATA_FILE_AUTHORITY), append, newline="" + ) as csv_file: + spawn_writer = csv.writer(csv_file) + if not is_exists: + spawn_writer.writerow(result_header) + spawn_writer.writerows([result,]) + def initialize_json_file(self, **kwargs): - kwargs.update({"dump_data_dir": self.dump_tensor_data_dir, "data": {}}) + kwargs.update({"dump_data_dir": self.dump_tensor_data_dir, Const.DATA: {}}) with os.fdopen( os.open(self.dump_file_path, Const.OVERWRITE_FLAGS, FileCheckConst.DATA_FILE_AUTHORITY), 'w' ) as f: @@ -42,7 +55,8 @@ class DataWriter: # TODO: UT Path(self.construct_file_path).touch() change_mode(self.construct_file_path, FileCheckConst.DATA_FILE_AUTHORITY) - def update_dump_paths(self, dump_file_path, stack_file_path, construct_file_path, dump_data_dir, free_benchmark_file_path): + def update_dump_paths(self, dump_file_path, stack_file_path, construct_file_path, dump_data_dir, + free_benchmark_file_path): self.dump_file_path = dump_file_path self.stack_file_path = stack_file_path self.construct_file_path = construct_file_path @@ -51,13 +65,13 @@ class DataWriter: # TODO: UT def update_data(self, new_data): key = next(iter(new_data.keys())) # assert len(new_data.keys()) == 1 - if key in self.cache_data["data"]: - self.cache_data["data"][key].update(new_data[key]) + if key in self.cache_data[Const.DATA]: + self.cache_data[Const.DATA][key].update(new_data[key]) else: - self.cache_data["data"].update(new_data) + self.cache_data[Const.DATA].update(new_data) def flush_data_when_buffer_is_full(self): - if len(self.cache_data["data"]) >= self.buffer_size: + if len(self.cache_data[Const.DATA]) >= self.buffer_size: self.write_data_json(self.dump_file_path) def update_stack(self, new_data): @@ -67,8 +81,7 @@ class DataWriter: # TODO: UT self.cache_construct.update(new_data) def write_data_json(self, file_path): - import fcntl - print_info_log_rank_0(f"dump.json is at {os.path.dirname(os.path.dirname(file_path))}. ") + logger.info(f"dump.json is at {os.path.dirname(os.path.dirname(file_path))}. ") if Path(file_path).exists() and os.path.getsize(file_path) > 0: with open(file_path, "r+") as f: fcntl.flock(f, fcntl.LOCK_EX) @@ -77,23 +90,21 @@ class DataWriter: # TODO: UT else: self.init_json['data_path'] = self.dump_tensor_data_dir data_to_write = self.init_json - data_to_write['data'].update(self.cache_data['data']) + data_to_write[Const.DATA].update(self.cache_data[Const.DATA]) with open(file_path, 'w+') as f: fcntl.flock(f, fcntl.LOCK_EX) json.dump(data_to_write, f, indent=1) fcntl.flock(f, fcntl.LOCK_UN) - self.cache_data["data"].clear() + self.cache_data[Const.DATA].clear() def write_stack_info_json(self, file_path): - import fcntl with open(file_path, 'w+') as f: fcntl.flock(f, fcntl.LOCK_EX) json.dump(self.cache_stack, f, indent=1) fcntl.flock(f, fcntl.LOCK_UN) def write_construct_info_json(self, file_path): - import fcntl with open(file_path, 'w+') as f: fcntl.flock(f, fcntl.LOCK_EX) json.dump(self.cache_construct, f, indent=1) @@ -103,18 +114,3 @@ class DataWriter: # TODO: UT self.write_data_json(self.dump_file_path) self.write_stack_info_json(self.stack_file_path) self.write_construct_info_json(self.construct_file_path) - - @staticmethod - def write_data_to_csv(result: list, result_header: tuple, file_path: str): - if len(result) == 0: - return - is_exists = os.path.exists(file_path) - append = "a+" if is_exists else "w+" - with os.fdopen( - os.open(file_path, Const.WRITE_FLAGS, FileCheckConst.DATA_FILE_AUTHORITY), append, newline="" - ) as csv_file: - spawn_writer = csv.writer(csv_file) - if not is_exists: - spawn_writer.writerow(result_header) - spawn_writer.writerows([result,]) - \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/functional/scope.py b/debug/accuracy_tools/msprobe/core/data_dump/scope.py similarity index 95% rename from debug/accuracy_tools/atat/pytorch/functional/scope.py rename to debug/accuracy_tools/msprobe/core/data_dump/scope.py index e557b876b1b00beef60dd623175374ad20d6a287..1d74c3e461ac4b0005e4fdd40ae1fe2c12bb1c4e 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/scope.py +++ b/debug/accuracy_tools/msprobe/core/data_dump/scope.py @@ -1,11 +1,15 @@ from abc import ABC, abstractmethod -from ..common.exceptions import ScopeException -from ..common.utils import Const +from msprobe.core.common.exceptions import ScopeException +from msprobe.core.common.const import Const -def build_scope(scope_class, scope=[], api_list=[]): +def build_scope(scope_class, scope=None, api_list=None): if not scope and not api_list: return None + if scope is None: + scope = [] + if api_list is None: + api_list = [] if scope_class: return scope_class(scope, api_list) return build_range_scope_according_to_scope_name(scope, api_list) @@ -30,6 +34,11 @@ class BaseScope(ABC): Module_Type_Module = "Module" Module_Type_API = "api" + def __init__(self, scope, api_list): + scope, api_list = self.rectify_args(scope, api_list) + self.scope = scope + self.api_list = api_list + @staticmethod def rectify_args(scope, api_list): if not isinstance(api_list, list): @@ -51,10 +60,9 @@ class BaseScope(ABC): f"scope列表元素要求类型为字符串,实际类型为{type(s)}.") return scope, api_list - def __init__(self, scope, api_list): - scope, api_list = self.rectify_args(scope, api_list) - self.scope = scope - self.api_list = api_list + @abstractmethod + def check(self, name): + pass def check_api_list(self, api_name): if not self.api_list: @@ -62,10 +70,7 @@ class BaseScope(ABC): for api_str in self.api_list: if api_str in api_name: return True - - @abstractmethod - def check(self, name): - pass + return False class ListScope(BaseScope): @@ -83,6 +88,13 @@ class ListScope(BaseScope): class RangeScope(BaseScope, ABC): + + def __init__(self, *args): + super().__init__(*args) + self.in_scope = False + self.is_valid = self.check_scope_is_valid() + + @staticmethod def rectify_args(scope, api_list): scope, api_list = super(RangeScope, RangeScope).rectify_args(scope, api_list) @@ -99,11 +111,6 @@ class RangeScope(BaseScope, ABC): def check_scope_is_valid(self): pass - def __init__(self, *args): - super().__init__(*args) - self.in_scope = False - self.is_valid = self.check_scope_is_valid() - def begin_module(self, module_name): pass @@ -169,6 +176,3 @@ class ModuleRangeScope(RangeScope): if not self.scope or self.in_scope: return self.check_api_list(module_name) return False - - - diff --git a/debug/accuracy_tools/msprobe/mindspore/__init__.py b/debug/accuracy_tools/msprobe/mindspore/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..3bf42d1e39feef616e6eb2cc296a099b0bddfd98 --- /dev/null +++ b/debug/accuracy_tools/msprobe/mindspore/__init__.py @@ -0,0 +1 @@ +from msprobe.mindspore.debugger.precision_debugger import PrecisionDebugger diff --git a/debug/accuracy_tools/atat/mindspore/dump/__init__.py b/debug/accuracy_tools/msprobe/mindspore/debugger/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/mindspore/dump/__init__.py rename to debug/accuracy_tools/msprobe/mindspore/debugger/__init__.py diff --git a/debug/accuracy_tools/atat/mindspore/debugger/debugger_config.py b/debug/accuracy_tools/msprobe/mindspore/debugger/debugger_config.py similarity index 100% rename from debug/accuracy_tools/atat/mindspore/debugger/debugger_config.py rename to debug/accuracy_tools/msprobe/mindspore/debugger/debugger_config.py diff --git a/debug/accuracy_tools/atat/mindspore/debugger/precision_debugger.py b/debug/accuracy_tools/msprobe/mindspore/debugger/precision_debugger.py similarity index 82% rename from debug/accuracy_tools/atat/mindspore/debugger/precision_debugger.py rename to debug/accuracy_tools/msprobe/mindspore/debugger/precision_debugger.py index 0099074762f0746c1bd8341047f37b3e5fe08855..358d0d6f7d3ab15f5a619da1f7977160cf3f1682 100644 --- a/debug/accuracy_tools/atat/mindspore/debugger/precision_debugger.py +++ b/debug/accuracy_tools/msprobe/mindspore/debugger/precision_debugger.py @@ -1,7 +1,7 @@ import os -from atat.mindspore.ms_config import parse_json_config -from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.mindspore.task_handler_factory import TaskHandlerFactory +from msprobe.mindspore.ms_config import parse_json_config +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.task_handler_factory import TaskHandlerFactory class PrecisionDebugger: diff --git a/debug/accuracy_tools/atat/mindspore/doc/dump.md b/debug/accuracy_tools/msprobe/mindspore/doc/dump.md similarity index 72% rename from debug/accuracy_tools/atat/mindspore/doc/dump.md rename to debug/accuracy_tools/msprobe/mindspore/doc/dump.md index 34529f580a7b2cb4961a2c992949cab89c15115e..425d0683a268ebdcaf54a4f70b5e448bb1233f3c 100644 --- a/debug/accuracy_tools/atat/mindspore/doc/dump.md +++ b/debug/accuracy_tools/msprobe/mindspore/doc/dump.md @@ -1,8 +1,8 @@ # **精度数据采集** -atat工具主要通过在训练脚本内添加dump接口并启动训练的方式来采集精度数据。 +msprobe工具主要通过在训练脚本内添加dump接口并启动训练的方式来采集精度数据。 -执行dump操作需要安装atat工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 +执行dump操作需要安装msprobe工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 ## dump接口介绍 @@ -12,7 +12,7 @@ atat工具主要通过在训练脚本内添加dump接口并启动训练的方式 通过加载dump配置文件的方式来确定dump操作的详细配置。 -可以在from atat.mindspore import PrecisionDebugger和模型初始化之间的任意位置添加该接口。 +可以在from msprobe.mindspore import PrecisionDebugger和模型初始化之间的任意位置添加该接口。 **原型** @@ -24,7 +24,7 @@ PrecisionDebugger(config_path=None) | 参数名 | 说明 | 是否必选 | | ----------- | ------------------------------------------------------------ | -------- | -| config_path | 指定dump配置文件路径,String类型。参数示例:"./config.json"。未配置该路径时,默认使用../../config目录下的config.json文件的默认配置。config.json文件可以配置更多参数,若需要进行更多场景的精度数据dump,建议配置[config.json](../../config/config.json)文件。 | 否 | +| config_path | 指定dump配置文件路径,String类型。参数示例:"./config.json"。未配置该路径时,默认使用[config.json](../../config)文件的默认配置。config.json文件可以配置更多参数,若需要进行更多场景的精度数据dump,建议配置[config.json](../../config/config.json)文件。 | 否 | ### start函数 @@ -43,7 +43,7 @@ debugger.start() ## 示例代码 ```Python -from atat.mindspore import PrecisionDebugger +from msprobe.mindspore import PrecisionDebugger debugger = PrecisionDebugger(config_path="./config.json") # 请勿将以上初始化流程插入到循环代码中 # 下面代码也可以用PrecisionDebugger.start() diff --git a/debug/accuracy_tools/atat/mindspore/overflow_check/__init__.py b/debug/accuracy_tools/msprobe/mindspore/dump/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/mindspore/overflow_check/__init__.py rename to debug/accuracy_tools/msprobe/mindspore/dump/__init__.py diff --git a/debug/accuracy_tools/atat/mindspore/dump/api_kbk_dump.py b/debug/accuracy_tools/msprobe/mindspore/dump/api_kbk_dump.py similarity index 89% rename from debug/accuracy_tools/atat/mindspore/dump/api_kbk_dump.py rename to debug/accuracy_tools/msprobe/mindspore/dump/api_kbk_dump.py index b0f80f40e553a8b136144f515015d0f94c635f5d..5c7af45d79060c00ce198f19a589d46bacf1f756 100644 --- a/debug/accuracy_tools/atat/mindspore/dump/api_kbk_dump.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/api_kbk_dump.py @@ -1,9 +1,9 @@ import os import json -from atat.core.utils import make_dump_path_if_not_exists -from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.core.log import print_info_log -from atat.core.file_check_util import FileOpen +from msprobe.core.common.utils import make_dump_path_if_not_exists +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.core.common.log import logger +from msprobe.core.common.file_check import FileOpen class ApiKbkDump: @@ -48,7 +48,7 @@ class ApiKbkDump: json_path = os.path.join(json_path, "api_kbk_dump.json") with FileOpen(json_path, 'w') as f: json.dump(self.dump_json, f) - print_info_log(json_path + " has been created.") + logger.info(json_path + " has been created.") os.environ["GRAPH_OP_RUN"] = "1" os.environ["MINDSPORE_DUMP_CONFIG"] = json_path if "MS_ACL_DUMP_CFG_PATH" in os.environ: diff --git a/debug/accuracy_tools/atat/mindspore/dump/dump_tool_factory.py b/debug/accuracy_tools/msprobe/mindspore/dump/dump_tool_factory.py similarity index 82% rename from debug/accuracy_tools/atat/mindspore/dump/dump_tool_factory.py rename to debug/accuracy_tools/msprobe/mindspore/dump/dump_tool_factory.py index ab534edc243dfd5f44688358fe4ca8edb6a8a12d..2c4579b0e75fe1573f387f696c3d9e4efd4945e3 100644 --- a/debug/accuracy_tools/atat/mindspore/dump/dump_tool_factory.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/dump_tool_factory.py @@ -1,6 +1,6 @@ -from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.mindspore.dump.api_kbk_dump import ApiKbkDump -from atat.mindspore.dump.kernel_graph_dump import KernelGraphDump +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.dump.api_kbk_dump import ApiKbkDump +from msprobe.mindspore.dump.kernel_graph_dump import KernelGraphDump class DumpToolFactory: diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/support_wrap_ops.yaml b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml similarity index 53% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/support_wrap_ops.yaml rename to debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml index acd4cc0e6e658dd4278f6a67c4f0e8fc288efde6..089f444b6181f0623c8029926c4808ab22ae27ca 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/support_wrap_ops.yaml +++ b/debug/accuracy_tools/msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml @@ -1,999 +1,925 @@ -# Copyright (c) 2023 Huawei Technologies Co., Ltd -# All rights reserved. +# Copyright 2024 Huawei Technologies Co., Ltd # -# Licensed under the BSD 3-Clause License (the "License"); +# Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # -# https://opensource.org/licenses/BSD-3-Clause +# http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +# ============================================================================ # List of ops that register hooks -functional: - - conv1d - - conv2d - - conv3d - - conv_transpose1d - - conv_transpose2d - - conv_transpose3d - - conv_tbc + +ops: + - adaptive_avg_pool1d + - adaptive_avg_pool2d + - adaptive_avg_pool3d + - adaptive_max_pool1d + - adaptive_max_pool2d - avg_pool1d - avg_pool2d - avg_pool3d - - fractional_max_pool2d_with_indices - - fractional_max_pool2d - - fractional_max_pool3d_with_indices + - batch_norm + - bias_add + - ctc_greedy_decoder + - conv1d + - conv2d + - conv3d + - deformable_conv2d + - dense + - dropout + - dropout1d + - dropout2d + - dropout3d + - flatten + - fold - fractional_max_pool3d - - max_pool1d_with_indices - - max_pool1d - - max_pool2d_with_indices + - lp_pool1d + - lp_pool2d + - lrn - max_pool2d - - max_pool3d_with_indices - max_pool3d - max_unpool1d - max_unpool2d - max_unpool3d - - lp_pool2d - - lp_pool1d - - adaptive_max_pool1d_with_indices - - adaptive_max_pool1d - - adaptive_max_pool2d_with_indices - - adaptive_max_pool2d - - adaptive_max_pool3d_with_indices - - adaptive_max_pool3d - - adaptive_avg_pool1d - - adaptive_avg_pool2d - - adaptive_avg_pool3d - - dropout - - alpha_dropout - - dropout2d - - dropout3d - - feature_alpha_dropout - - threshold - - threshold_ - - relu - - relu_ + - unfold + - binary_cross_entropy + - binary_cross_entropy_with_logits + - cosine_embedding_loss + - cross_entropy + - ctc_loss + - gaussian_nll_loss + - hinge_embedding_loss + - huber_loss + - kl_div + - l1_loss + - margin_ranking_loss + - mse_loss + - multi_margin_loss + - multilabel_margin_loss + - multilabel_soft_margin_loss + - nll_loss + - smooth_l1_loss + - triplet_margin_loss + - elu + - fast_gelu + - gelu - glu + - gumbel_softmax + - hardshrink + - hardsigmoid + - hardswish - hardtanh - - hardtanh_ - - relu6 - - elu - - elu_ - - selu - - selu_ - - celu - - celu_ - leaky_relu - - leaky_relu_ + - log_softmax + - logsigmoid + - mish - prelu + - relu + - relu6 + - celu - rrelu - - rrelu_ - - logsigmoid - - gelu - - hardshrink - - tanhshrink - - softsign - - softplus - - softmin + - selu + - sigmoid + - silu - softmax - - gumbel_softmax - - log_softmax + - softmin - softshrink + - softsign - tanh - - sigmoid - - hardsigmoid - - linear - - bilinear - - silu - - hardswish - - embedding - - embedding_bag - - batch_norm - - instance_norm - - layer_norm - - group_norm - - local_response_norm - - ctc_loss - - nll_loss - - poisson_nll_loss - - gaussian_nll_loss - - kl_div - - cross_entropy - - binary_cross_entropy - - binary_cross_entropy_with_logits - - smooth_l1_loss - - l1_loss - - mse_loss - - margin_ranking_loss - - hinge_embedding_loss - - multilabel_margin_loss - - soft_margin_loss - - multilabel_soft_margin_loss - - cosine_embedding_loss - - multi_margin_loss + - threshold + - cdist + - dist + - pdist + - choice_with_mask + - random_categorical + - log_uniform_candidate_sampler + - uniform_candidate_sampler + - affine_grid + - bounding_box_decode + - bounding_box_encode + - col2im + - check_valid + - crop_and_resize + - grid_sample + - interpolate + - iou + - pad + - padding - pixel_shuffle - pixel_unshuffle - - channel_shuffle - upsample - - interpolate - - upsample_nearest - - upsample_bilinear - - grid_sample - - affine_grid - - pad - - pairwise_distance - - pdist - - cosine_similarity - - one_hot - - triplet_margin_loss - - triplet_margin_with_distance_loss - - normalize - - unfold - - fold - - multi_head_attention_forward - -tensor: - - __add__ - - __and__ - - __bool__ - - __div__ - - __eq__ - - __ge__ - - __gt__ - - __iadd__ - - __iand__ - - __idiv__ - - __ifloordiv__ - - __ilshift__ - - __imod__ - - __imul__ - - __ior__ - - __irshift__ - - __isub__ - - __ixor__ - - __lshift__ - - __matmul__ - - __mod__ - - __mul__ - - __nonzero__ - - __or__ - - __radd__ - - __rmul__ - - __rshift__ - - __sub__ - - __truediv__ - - __xor__ - abs - - abs_ - absolute - - absolute_ + - accumulate_n - acos - - acos_ + - arccos - acosh - - acosh_ - add - - add_ - - addbmm - - addbmm_ - addcdiv - - addcdiv_ - addcmul - - addcmul_ - - addmm - - addmm_ - addmv - - addmv_ - - addr - - addr_ - - align_as - - align_to - - all - - allclose - - amax - - amin + - addn - angle - - any - - arccos - - arccos_ - arccosh - - arccosh_ - arcsin - - arcsin_ - arcsinh - - arcsinh_ - arctan - - arctan_ - arctanh - - arctanh_ - - argmax - - argmin - - argsort + - arctan2 - asin - - asin_ - asinh - - asinh_ - atan - atan2 - - atan2_ - - atan_ - atanh - - atanh_ - - baddbmm - - baddbmm_ - - bernoulli - - bernoulli_ - - bincount + - atleast_1d + - atleast_2d + - atleast_3d + - bessel_i0 + - bessel_i0e + - bessel_i1 + - bessel_i1e + - bessel_j0 + - bessel_j1 + - bessel_k0 + - bessel_k0e + - bessel_k1 + - bessel_k1e + - bessel_y0 + - bessel_y1 - bitwise_and - - bitwise_and_ - - bitwise_not - - bitwise_not_ + - bitwise_left_shift - bitwise_or - - bitwise_or_ + - bitwise_right_shift - bitwise_xor - - bitwise_xor_ - - bmm - - broadcast_to - - cauchy_ - ceil - - ceil_ - - cholesky - - chunk - clamp - - cholesky_solve - - cholesky_inverse - - clamp_ - - clamp_max - - clamp_max_ - clip - - clamp_min - - clamp_min_ - - clip_ + - combinations - copysign - - copysign_ - cos - - cos_ - cosh - - cosh_ - - count_nonzero - - cummax - - cummin - - cumprod - - cumprod_ - - cumsum - - cumsum_ - - deg2rad - - deg2rad_ - - det - - diag + - cosine_similarity + - cov - diag_embed - - diagflat - - diagonal - diff - - dist + - deg2rad - digamma - - digamma_ - div - - div_ - divide - - divide_ - - dot - - eig - - eq - - eq_ - erf - - equal - - erf_ - erfc - - erfc_ - erfinv - - erfinv_ - exp - exp2 - - exp2_ - expm1 - - exp_ - - expm1_ - - exponential_ - - fill_ - - fix - - fill_diagonal_ - - fix_ - - flip - - fliplr - - flatten - - flipud - - float_power - - float_power_ - floor - - floor_ - - floor_divide - - floor_divide_ - - fmax - - fmin + - floor_div + - floor_mod + - float_power - fmod - - fmod_ - frac - - frac_ - - gather - gcd - - gcd_ - - ge - - ge_ - - geometric_ - - geqrf - - ger - - greater - - greater_ - - gt - - gt_ - - greater_equal - - greater_equal_ - - hardshrink - - heaviside - - heaviside_ - - histc - hypot - - hypot_ - igamma - - igamma_ - igammac - - igammac_ - - index_add - - index_add_ - - inverse - - index_copy - - index_copy_ - - index_fill - - index_fill_ - - index_put - - index_put_ - - inner - - index_select - - isclose - - isfinite - - isinf - - isnan - - isneginf - - isposinf - - isreal - - kron - - kthvalue + - imag + - i0 + - inv + - invert - lcm - - lcm_ - ldexp - - ldexp_ - - le - - le_ - lerp - - lerp_ - - where - - less - - less_ - - less_equal - - less_equal_ - - lgamma - - lgamma_ - log + - log2 - log10 - - log10_ - log1p - - log1p_ - - log2 - - log2_ - - log_ - - log_normal_ - - log_softmax - - logcumsumexp - - logdet - logaddexp - logaddexp2 - logical_and - - logical_and_ - logical_not - - logit - - logical_not_ - logical_or - - logical_or_ - logical_xor - - logical_xor_ - - logit_ - - logsumexp - - lstsq - - lt - - lt_ - - lu_solve - - map2_ - - map_ - - masked_fill - - matmul - - masked_fill_ - - masked_scatter - - masked_scatter_ - - masked_select - - matrix_exp - - max - - maximum - - mean - - matrix_power - - median - - min - - minimum - - mm - - mode - - msort + - logit - mul - - mul_ - - multinomial - multiply - - multiply_ - - mv - mvlgamma - - mvlgamma_ - - nansum - - narrow - - narrow_copy - - ne - - ne_ - neg - - neg_ - negative - - negative_ - - nonzero - - normal_ - - not_equal - - not_equal_ - - permute - - pinverse + - nextafter + - polar - polygamma + - positive - pow - - pow_ - - polygamma_ - - prelu - - prod - - put_ - rad2deg - - rad2deg_ - ravel - real - reciprocal - - reciprocal_ - - relu - - relu_ - remainder - - repeat_interleave - - reshape - - remainder_ - - renorm - - renorm_ - - repeat - - reshape_as - - resize_ - - resize_as_ - - roll - rot90 - round - - round_ - rsqrt - - rsqrt_ - - scatter - - scatter_ - - scatter_add - - scatter_add_ - - select - sgn - - sgn_ - - sigmoid - - sigmoid_ - sign - - sign_ - signbit - sin - - sin_ - sinc - - sinc_ - sinh - - sinh_ - - slogdet - - smm - - softmax - - solve - - sort - - split_with_sizes - sqrt - - sqrt_ - square - - square_ - - squeeze - - squeeze_ - - sspaddmm - - std - sub - - sub_ - - sum - - sum_to_size - - svd - - symeig + - subtract - t - - t_ - - take - tan - - tan_ - - tanh - - tanh_ + - tanhshrink + - trapz + - tril_indices + - triu_indices + - true_divide + - trunc + - truncate_div + - truncate_mod + - xdivy + - xlogy + - zeta + - all + - amax + - amin + - aminmax + - any + - argmax + - argmin + - cummax + - cummin + - cumprod + - cumsum + - fmax + - histc + - logsumexp + - max + - mean + - median + - min + - norm + - prod + - std + - std_mean + - var + - var_mean + - argsort + - approximate_equal + - equal + - ge + - greater + - greater_equal + - gt + - intopk + - isclose + - isfinite + - isinf + - isnan + - isneginf + - isposinf + - isreal + - is_complex + - le + - less + - less_equal + - lt + - maximum + - minimum + - msort + - ne + - not_equal + - searchsorted + - topk + - bmm + - addbmm + - addmm + - baddbmm + - addr + - adjoint + - cholesky + - cholesky_solve + - batch_dot + - dot + - eig + - inner + - inverse + - geqrf + - ger + - kron + - lu_solve + - lu_unpack + - matmul + - matrix_solve + - matrix_band_part + - matrix_diag + - matrix_diag_part + - matrix_set_diag + - mm + - mv + - outer + - orgqr + - ormqr + - pinv + - svd + - tensor_dot + - logdet + - slogdet + - qr + - trace + - bartlett_window + - blackman_window + - hamming_window + - hann_window + - kaiser_window + - eye + - fill + - full + - full_like + - linspace + - logspace + - one_hot + - arange + - range + - heaviside + - bernoulli + - gamma + - laplace + - multinomial + - multinomial_with_replacement + - rand + - rand_like + - randint + - randint_like + - randn + - randn_like + - random_gamma + - random_poisson + - randperm + - standard_laplace + - standard_normal + - uniform + - argwhere + - batch_to_space_nd + - bincount + - block_diag + - broadcast_to + - cat + - channel_shuffle + - chunk + - column_stack + - concat + - conj + - count_nonzero + - deepcopy + - diag + - diagflat + - diagonal + - dyn_shape + - dsplit + - dstack + - einsum + - expand + - expand_dims + - flip + - fliplr + - flipud + - gather_d + - gather_elements + - gather_nd + - hsplit + - hstack + - index_add + - index_fill + - index_select + - inplace_add + - inplace_index_add + - inplace_sub + - inplace_update + - masked_fill + - masked_select + - meshgrid + - moveaxis + - movedim + - narrow + - nan_to_num + - nansum + - normal + - nonzero + - population_count + - rank + - repeat_elements + - repeat_interleave + - reshape + - reverse + - reverse_sequence + - roll + - scatter + - scatter_nd + - select + - sequence_mask + - shuffle + - size + - slice + - sort + - space_to_batch_nd + - sparse_segment_mean + - split + - squeeze + - stack + - strided_slice + - sum + - swapaxes + - swapdims + - tensor_scatter_add + - tensor_scatter_div + - tensor_scatter_max + - tensor_scatter_min + - tensor_scatter_mul + - tensor_scatter_sub + - tensor_scatter_elements - tensor_split - tile - - topk - - transpose - - transpose_ - - triangular_solve - tril - - tril_ - triu - - true_divide - - triu_ - - true_divide_ - - trunc - - trunc_ - - type_as + - transpose - unbind - - unflatten - - unfold - - unsafe_chunk + - unique + - unique_consecutive + - unique_with_pad + - unsorted_segment_max + - unsorted_segment_min + - unsorted_segment_prod + - unsorted_segment_sum - unsqueeze - - unsafe_split - - unsafe_split_with_sizes - - var - - vdot - - unsqueeze_ - - view_as - - xlogy - - xlogy_ + - unstack + - view_as_real + - vsplit + - vstack + - where + - cross + - renorm + - is_tensor + - scalar_cast + - scalar_to_tensor + - tuple_to_array + - clip_by_global_norm + - clip_by_value + - assign + - assign_add + - assign_sub + - scatter_add + - scatter_div + - scatter_max + - scatter_min + - scatter_mul + - scatter_nd_add + - scatter_nd_div + - scatter_nd_max + - scatter_nd_min + - scatter_nd_mul + - scatter_nd_sub + - scatter_update + - derivative + - jet -torch: - - _adaptive_avg_pool2d - - _add_relu - - _add_relu_ - - _aminmax - - _batch_norm_impl_index - - _convolution +tensor: + - __abs__ + - __add__ + - __and__ + - __bool__ + - __eq__ + - __ge__ + - __gt__ + - __iadd__ + - __ifloordiv__ + - __imatmul__ + - __imod__ + - __imul__ + - __isub__ + - __le__ + - __lt__ + - __matmul__ + - __mod__ + - __mul__ + - __ne__ + - __neg__ + - __or__ + - __pow__ + - __radd__ + - __rmatmul__ + - __rmod__ + - __rmul__ + - __rpow__ + - __rsub__ + - __sub__ + - __truediv__ + - __xor__ - abs - - abs_ - absolute - acos - - acos_ - acosh - - acosh_ - - adaptive_avg_pool1d - - adaptive_max_pool1d - add - addbmm - addcdiv - addcmul - addmm - addmv - - addmv_ - addr - - amax - - affine_grid_generator - - align_tensors - all - - alpha_dropout + - amax - amin - - alpha_dropout_ - - angle - any - - arange - arccos - - arccos_ - arccosh - - arccosh_ + - argmax + - angle - arcsin - - arcsin_ - arcsinh - - arcsinh_ - arctan - - arctan_ - arctanh - - arctanh_ - - argmax - argmin - argsort - asin - - asin_ - asinh - - asinh_ - atan - atan2 - - atan_ - atanh - - atanh_ - - atleast_1d - - atleast_2d - - atleast_3d - - avg_pool1d - baddbmm - - bartlett_window - - batch_norm_backward_elemt - - batch_norm_backward_reduce - - batch_norm_elemt - - batch_norm_gather_stats - - batch_norm_gather_stats_with_counts - bernoulli - - batch_norm_stats - - batch_norm_update_stats - - bilinear - bincount - - binomial - - binary_cross_entropy_with_logits - bitwise_and - - bitwise_not - bitwise_or - bitwise_xor - - blackman_window - - block_diag - bmm - - broadcast_tensors + - bool - broadcast_to - - cartesian_prod - - cat - - cdist - ceil - - ceil_ - - celu - - celu_ - - chain_matmul - - channel_shuffle - - cholesky - - cholesky_inverse - cholesky_solve - - choose_qparams_optimized - - chunk + - cholesky - clamp - - clamp_ - - clamp_max - - clamp_max_ - - clamp_min - - clamp_min_ - clip - - clip_ - - clone - - column_stack - - combinations - - constant_pad_nd - - conv1d - - conv2d - - conv3d - - conv_tbc - - conv_transpose1d - - conv_transpose2d - - conv_transpose3d - - cos - - convolution + - conj - copysign - - cos_ + - cos - cosh - - cosh_ - - cosine_embedding_loss - - cosine_similarity - - count_nonzero - cross - - ctc_loss - cummax - cummin - cumprod - cumsum - deg2rad - - deg2rad_ - - det - diag - - diag_embed - - diff - diagflat - - diagonal + - diff - digamma - - dist - div - divide - - dot - - dropout - - dropout_ - - dsmm - - dstack - - eig - - einsum - - embedding - - embedding_bag - - embedding_renorm_ - - eq - equal - erf - - erf_ - erfc - - erfc_ - erfinv - exp - - exp2 - - exp2_ - - exp_ + - expand_as - expm1 - - expm1_ - - eye - - feature_dropout - - feature_alpha_dropout - - feature_alpha_dropout_ - - feature_dropout_ - - fix - - fill_ - - fix_ - - flatten - flip - fliplr - flipud - float_power - floor - - floor_ - - floor_divide - - fmax - - fmin - fmod - frac - - frac_ - - full - - frobenius_norm - - full_like - - gather - - gcd - - gcd_ + - gather_elements - ge - geqrf - ger - greater - greater_equal - - grid_sampler - - grid_sampler_2d - - group_norm - - grid_sampler_3d - - gru - - gru_cell - gt - - hamming_window - - hann_window + - half - hardshrink - heaviside - - hinge_embedding_loss - histc - - hsmm - - hspmm - - hstack - hypot + - i0 - igamma - igammac + - imag - index_add - - index_copy - - inner - index_fill - index_put - - index_put_ - index_select - - instance_norm + - inner + - int + - inverse - isclose - isfinite - isinf - isnan + - is_complex + - is_signed - isneginf - isposinf - - istft - - kaiser_window - - kl_div - - kron - - kthvalue - - layer_norm + - isreal - lcm - - lcm_ - ldexp - - ldexp_ - le - lerp - less - less_equal - - lgamma - - linspace - log - log10 - - log10_ - log1p - - log1p_ - log2 - - log2_ - - log_softmax - - log_ - logaddexp - logaddexp2 - - logcumsumexp - logdet - logical_and - logical_not - logical_or - logical_xor - logit - - logit_ - - logspace - logsumexp - - lstm - - lstm_cell - - lstsq + - long - lt - - lu_solve - masked_fill - - margin_ranking_loss - masked_scatter - masked_select - - matrix_exp - matmul - - matrix_power - - matrix_rank - max - - max_pool1d - - max_pool2d - - max_pool1d_with_indices - - max_pool3d - maximum - mean - median - min - minimum - - mm - - mode - moveaxis - movedim - msort - - mul - multinomial - multiply - - mv - mvlgamma - nan_to_num - - nan_to_num_ - - nanmedian - nansum - narrow - - native_batch_norm - - native_group_norm - - narrow_copy - - native_layer_norm - - native_norm - ne - neg - negative - - neg_ - - negative_ + - nelement + - new_ones + - new_zeros - nextafter + - norm - nonzero - - norm_except_dim - - normal - not_equal - - nuclear_norm - - pairwise_distance - - pdist - - pinverse - - pixel_shuffle - - pixel_unshuffle - - poisson - - poisson_nll_loss - - polar - - polygamma + - ormqr + - permute - pow - - prelu - prod - - rad2deg - - promote_types - - rad2deg_ - - range + - qr - ravel - real - reciprocal - - relu - - reciprocal_ - - relu_ - remainder - renorm + - rad2deg + - tile - repeat_interleave - reshape - - resize_as_ - - roll - - rot90 + - reshape - round - - round_ - - rrelu - - rrelu_ + - rot90 - rsqrt - - row_stack - - rsqrt_ - - rsub - - saddmm - - scalar_tensor + - sum_to_size - scatter - - select - - scatter_add - - searchsorted - - selu - - selu_ - sgn + - short - sigmoid - - sigmoid_ - sign - signbit - sin - - sin_ - sinc - - sinc_ - sinh - - sinh_ - slogdet - - smm - - softmax - - solve - sort - - sparse_coo_tensor - - square - - split_with_sizes - - spmm + - split - sqrt - - sqrt_ - - square_ + - square - squeeze - - sspaddmm - - stack - std - - std_mean - - sub - subtract - - sum + - subtract - svd - swapaxes - swapdims - - symeig - t - take - tan - - tan_ - tanh - - tanh_ - - tensordot - - tensor_split - - threshold - - threshold_ + - trace + - swapaxes - tile + - to - topk - - transpose - - trapz - - triangular_solve - tril - - tril_indices - - triplet_margin_loss - - triu - - triu_indices + - tensor_split + - transpose - true_divide - trunc - - trunc_ + - unbind - unique_consecutive + - unsqueeze + - var + - view + - where - xlogy - - unbind - - unique_dim - - unsafe_chunk - - unsafe_split - - vander + - from_numpy + - std + - take - var - - vdot - - unsafe_split_with_sizes - - unsqueeze - - var_mean - - vstack + - all + - any + - copy + - diagonal + - flatten + - resize + - sum + +mint.ops: + - abs + - absolute_import + - add + - add_ex + - all + - any + - any_ex + - arange + - argmax + - avg_pool2d + - baddbmm + - baddbmm_ex + - batch_norm + - binary_cross_entropy_with_logits + - bitwise_and + - bitwise_or + - bitwise_xor + - bmm + - broadcast_to + - cat + - cat_ex + - ceil + - chunk + - clamp + - conv2d + - conv_transpose2d + - cos + - cross + - cummax + - cummin + - cumsum + - div + - divide + - dropout + - embedding + - eq + - erf + - erfinv + - exp + - flatten + - flip + - flip_ex + - fold + - full + - functional + - gather + - gelu + - greater + - grid_sample + - group_norm + - gt + - index_select + - interpolate + - isclose + - isfinite + - layer_norm + - le + - leaky_relu + - less + - less_equal + - linear + - linspace + - log + - logical_and + - logical_not + - logical_or + - lt + - masked_select + - matmul + - max + - max_pool2d + - maximum + - mean + - mean_ex + - min + - minimum + - mul + - ne + - neg + - negative + - nn + - nonzero + - normal + - one_hot + - ones + - ones_ex + - ones_like + - pad + - permute + - permute_ex + - pow + - prod + - reciprocal + - relu + - remainder + - repeat_interleave + - rsqrt + - scatter + - scatter_add + - searchsorted + - sigmoid + - silu + - sin + - softmax + - softplus + - sort + - split + - sqrt + - sqrt_ex + - square + - stack + - sub + - sub_ex + - sum + - tanh + - tile + - topk + - tril + - triu + - unfold + - unique - where - - xlogy_ + - xlogy + - zeros + - zeros_ex + - zeros_like + +mint.nn: + - Dropout + - Embedding + - Fold + - LayerNorm + - Linear + - MaxPool2d + - Unfold + - Upsample + +mint.nn.functional: + - absolute_import + - avg_pool2d + - batch_norm + - batch_norm_ex + - bce_with_logits + - binary_cross_entropy_with_logits + - conv_transpose2d + - dense + - dropout + - embedding + - fold + - gelu + - grid_sample + - group_norm + - interpolate + - layer_norm + - leaky_relu + - linear + - max_pool2d + - max_pool2d_ex + - normal + - one_hot + - one_hot_ext + - pad + - relu + - sigmoid + - silu + - softmax + - softmax_ex + - softplus + - tanh + - unfold diff --git a/debug/accuracy_tools/atat/mindspore/dump/kernel_graph_dump.py b/debug/accuracy_tools/msprobe/mindspore/dump/kernel_graph_dump.py similarity index 90% rename from debug/accuracy_tools/atat/mindspore/dump/kernel_graph_dump.py rename to debug/accuracy_tools/msprobe/mindspore/dump/kernel_graph_dump.py index f8a10ec1b1f690931871895a47014d44594ac80a..8320ee0906458734b29b9b911351739fefb77163 100644 --- a/debug/accuracy_tools/atat/mindspore/dump/kernel_graph_dump.py +++ b/debug/accuracy_tools/msprobe/mindspore/dump/kernel_graph_dump.py @@ -1,9 +1,9 @@ import os import json -from atat.core.utils import make_dump_path_if_not_exists -from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.core.log import print_info_log -from atat.core.file_check_util import FileOpen +from msprobe.core.common.utils import make_dump_path_if_not_exists +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.core.common.log import logger +from msprobe.core.common.file_check import FileOpen class KernelGraphDump: @@ -49,7 +49,7 @@ class KernelGraphDump: json_path = os.path.join(json_path, "kernel_graph_dump.json") with FileOpen(json_path, 'w') as f: json.dump(self.dump_json, f) - print_info_log(json_path + " has been created.") + logger.info(json_path + " has been created.") os.environ["MINDSPORE_DUMP_CONFIG"] = json_path if self.dump_json["common_dump_settings"]["dump_mode"] == 0: if self.dump_json["common_dump_settings"]["iteration"] != "all" or \ diff --git a/debug/accuracy_tools/atat/mindspore/ms_config.py b/debug/accuracy_tools/msprobe/mindspore/ms_config.py similarity index 95% rename from debug/accuracy_tools/atat/mindspore/ms_config.py rename to debug/accuracy_tools/msprobe/mindspore/ms_config.py index 0d846c4771caca64443e170d580268ffbbdeff8e..2b390ae9e4274d636688829878502b6f7d919544 100644 --- a/debug/accuracy_tools/atat/mindspore/ms_config.py +++ b/debug/accuracy_tools/msprobe/mindspore/ms_config.py @@ -1,6 +1,6 @@ import json -from atat.core.common_config import CommonConfig, BaseConfig -from atat.core.file_check_util import FileOpen +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.core.common.file_check import FileOpen class TensorConfig(BaseConfig): diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/__init__.py b/debug/accuracy_tools/msprobe/mindspore/overflow_check/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/__init__.py rename to debug/accuracy_tools/msprobe/mindspore/overflow_check/__init__.py diff --git a/debug/accuracy_tools/atat/mindspore/overflow_check/kernel_graph_overflow_check.py b/debug/accuracy_tools/msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py similarity index 84% rename from debug/accuracy_tools/atat/mindspore/overflow_check/kernel_graph_overflow_check.py rename to debug/accuracy_tools/msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py index 5ef005e59e8839e19f9af600c168343251580936..6640608735d98c17c1b544b58183224a1cd4ba55 100644 --- a/debug/accuracy_tools/atat/mindspore/overflow_check/kernel_graph_overflow_check.py +++ b/debug/accuracy_tools/msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py @@ -1,9 +1,9 @@ import os import json -from atat.core.utils import make_dump_path_if_not_exists -from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.core.log import print_warn_log, print_info_log -from atat.core.file_check_util import FileOpen +from msprobe.core.common.utils import make_dump_path_if_not_exists +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.core.common.log import logger +from msprobe.core.common.file_check import FileOpen class KernelGraphOverflowCheck: @@ -23,7 +23,7 @@ class KernelGraphOverflowCheck: self.dump_json["common_dump_settings"]["path"] = config.dump_path if len(config.step) > 0: - print_warn_log("Step would change to all in this task.") + logger.warning("Step would change to all in this task.") if len(config.rank) > 0: self.dump_json["common_dump_settings"]["support_device"] = config.rank if config.check_mode == "aicore": @@ -39,7 +39,7 @@ class KernelGraphOverflowCheck: json_path = os.path.join(json_path, "kernel_graph_overflow_check.json") with FileOpen(json_path, 'w') as f: json.dump(self.dump_json, f) - print_info_log(json_path + " has been created.") + logger.info(json_path + " has been created.") os.environ["MINDSPORE_DUMP_CONFIG"] = json_path if "MS_ACL_DUMP_CFG_PATH" in os.environ: del os.environ["MS_ACL_DUMP_CFG_PATH"] diff --git a/debug/accuracy_tools/atat/mindspore/overflow_check/overflow_check_tool_factory.py b/debug/accuracy_tools/msprobe/mindspore/overflow_check/overflow_check_tool_factory.py similarity index 81% rename from debug/accuracy_tools/atat/mindspore/overflow_check/overflow_check_tool_factory.py rename to debug/accuracy_tools/msprobe/mindspore/overflow_check/overflow_check_tool_factory.py index fe53359be1ba1ecb73fb84138228415f68e1c2ce..d809c714211aa34f41588bba332b55ed808b5376 100644 --- a/debug/accuracy_tools/atat/mindspore/overflow_check/overflow_check_tool_factory.py +++ b/debug/accuracy_tools/msprobe/mindspore/overflow_check/overflow_check_tool_factory.py @@ -1,5 +1,5 @@ -from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.mindspore.overflow_check.kernel_graph_overflow_check import KernelGraphOverflowCheck +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.overflow_check.kernel_graph_overflow_check import KernelGraphOverflowCheck class OverflowCheckToolFactory: diff --git a/debug/accuracy_tools/atat/mindspore/task_handler_factory.py b/debug/accuracy_tools/msprobe/mindspore/task_handler_factory.py similarity index 68% rename from debug/accuracy_tools/atat/mindspore/task_handler_factory.py rename to debug/accuracy_tools/msprobe/mindspore/task_handler_factory.py index 4f80e4e89c92156762ea0e4c4ed3302cc5c31f5f..7b7e6fd889c775a4491e824c1f73e6021cb99350 100644 --- a/debug/accuracy_tools/atat/mindspore/task_handler_factory.py +++ b/debug/accuracy_tools/msprobe/mindspore/task_handler_factory.py @@ -1,6 +1,6 @@ -from atat.mindspore.debugger.debugger_config import DebuggerConfig -from atat.mindspore.dump.dump_tool_factory import DumpToolFactory -from atat.mindspore.overflow_check.overflow_check_tool_factory import OverflowCheckToolFactory +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.dump.dump_tool_factory import DumpToolFactory +from msprobe.mindspore.overflow_check.overflow_check_tool_factory import OverflowCheckToolFactory class TaskHandlerFactory: diff --git a/debug/accuracy_tools/atat/atat.py b/debug/accuracy_tools/msprobe/msprobe.py similarity index 74% rename from debug/accuracy_tools/atat/atat.py rename to debug/accuracy_tools/msprobe/msprobe.py index 799200ae41c76ac41be8e467910c19a772f9db74..698165b6150eabf63457cc23f4d80e7f58a5b423 100644 --- a/debug/accuracy_tools/atat/atat.py +++ b/debug/accuracy_tools/msprobe/msprobe.py @@ -15,19 +15,21 @@ import argparse import sys -from atat.pytorch.api_accuracy_checker.run_ut.run_ut import _run_ut_parser, run_ut_command -from ptdbg_ascend.src.python.ptdbg_ascend.parse_tool.cli import parse as cli_parse -from atat.pytorch.api_accuracy_checker.run_ut.multi_run_ut import prepare_config, run_parallel_ut -from atat.pytorch.api_accuracy_checker.compare.api_precision_compare import _api_precision_compare_parser, _api_precision_compare_command -from atat.pytorch.api_accuracy_checker.run_ut.run_overflow_check import _run_overflow_check_parser, _run_overflow_check_command +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut import _run_ut_parser, run_ut_command +from msprobe.pytorch.parse_tool.cli import parse as cli_parse +from msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut import prepare_config, run_parallel_ut +from msprobe.pytorch.api_accuracy_checker.compare.api_precision_compare import _api_precision_compare_parser, \ + _api_precision_compare_command +from msprobe.pytorch.api_accuracy_checker.run_ut.run_overflow_check import _run_overflow_check_parser, \ + _run_overflow_check_command def main(): parser = argparse.ArgumentParser( formatter_class=argparse.RawDescriptionHelpFormatter, - description="atat(ascend training accuracy tools), [Powered by MindStudio].\n" - "Providing one-site accuracy difference debugging toolkit for training on Ascend Devices.\n" - f"For any issue, refer README.md first", + description="msprobe(mindstudio probe), [Powered by MindStudio].\n" + "Providing one-site accuracy difference debugging toolkit for training on Ascend Devices.\n" + f"For any issue, refer README.md first", ) parser.set_defaults(print_help=parser.print_help) parser.add_argument('-f', '--framework', required=True, choices=['pytorch'], @@ -62,4 +64,4 @@ def main(): if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/debug/accuracy_tools/atat/pytorch/__init__.py b/debug/accuracy_tools/msprobe/pytorch/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/advisor/advisor.py b/debug/accuracy_tools/msprobe/pytorch/advisor/advisor.py similarity index 83% rename from debug/accuracy_tools/atat/pytorch/advisor/advisor.py rename to debug/accuracy_tools/msprobe/pytorch/advisor/advisor.py index 5ae692a998d933af01b6d8d0ebf60896fd685886..b178664d9e37f7d6cafdca58218b75909ab9cfcc 100644 --- a/debug/accuracy_tools/atat/pytorch/advisor/advisor.py +++ b/debug/accuracy_tools/msprobe/pytorch/advisor/advisor.py @@ -16,13 +16,13 @@ """ import os -import pandas as pd - -from .advisor_result import AdvisorResult -from .advisor_const import AdvisorConst -from ...core.utils import CompareException, CompareConst, Const, print_info_log, print_warn_log, print_error_log -from ...core.file_check_util import FileChecker, FileCheckConst +from msprobe.pytorch.advisor.advisor_result import AdvisorResult +from msprobe.pytorch.advisor.advisor_const import AdvisorConst +from msprobe.pytorch.common.log import logger +from msprobe.core.common.utils import CompareException +from msprobe.core.common.file_check import FileChecker +from msprobe.core.common.const import Const, CompareConst, FileCheckConst class Advisor: """ @@ -32,40 +32,7 @@ class Advisor: def __init__(self, input_data, out_path=""): self.input_data = input_data self.out_path = os.path.realpath(out_path) - - def _parse_input_data(self): - data_columns = self.input_data.columns.values - if {CompareConst.ACCURACY, CompareConst.NPU_NAME}.issubset(data_columns): - self.file_type = Const.ALL - elif {CompareConst.RESULT, CompareConst.NPU_MD5}.issubset(data_columns): - self.file_type = Const.MD5 - elif {CompareConst.MAX_DIFF, CompareConst.RESULT}.issubset(data_columns): - self.file_type = Const.SUMMARY - else: - print_error_log('Compare result does not meet the required conditions.') - raise CompareException(CompareException.INVALID_DATA_ERROR) - df = self.input_data.reset_index() - return df - - def _check_path_vaild(self): - out_path_checker = FileChecker(self.out_path, FileCheckConst.DIR, FileCheckConst.WRITE_ABLE) - out_path_checker.common_check() - - def gen_advisor_message(self, node_name): - if AdvisorConst.FORWARD in node_name: - if AdvisorConst.INPUT in node_name: - message = AdvisorConst.FORWARD_INPUT_SUGGEST - else: - message = AdvisorConst.FORWARD_OUTPUT_SUGGEST - message = self.deterministic_advisor(message, node_name) - else: - if AdvisorConst.INPUT in node_name: - message = AdvisorConst.BACKWARD_INPUT_SUGGEST - else: - message = AdvisorConst.BACKWARD_OUTPUT_SUGGEST - message = self.deterministic_advisor(message, node_name) - message = self.batch_norm_advisor(message, node_name) - return message + self.file_type = None @staticmethod def deterministic_advisor(message, node_name): @@ -82,15 +49,16 @@ class Advisor: def analyze_unmatched(self, analyze_data): if self.file_type == Const.ALL: - accuracy_unmatched = analyze_data[analyze_data[CompareConst.ACCURACY] == CompareConst.ACCURACY_CHECK_UNMATCH] + accuracy_unmatched = analyze_data[ + analyze_data[CompareConst.ACCURACY] == CompareConst.ACCURACY_CHECK_UNMATCH] else: - accuracy_unmatched = analyze_data[(analyze_data[CompareConst.NPU_SHAPE] == CompareConst.NAN) | + accuracy_unmatched = analyze_data[(analyze_data[CompareConst.NPU_SHAPE] == CompareConst.NAN) | (analyze_data[CompareConst.BENCH_SHAPE] == CompareConst.NAN)] num_unmatch = len(accuracy_unmatched) if num_unmatch != 0: for i in range(len(accuracy_unmatched)): item = accuracy_unmatched.iloc[i] - print_warn_log("The tensor name matches but the shape or dtype does not match: {}" + logger.warning("The tensor name matches but the shape or dtype does not match: {}" .format(item[CompareConst.NPU_NAME])) def gen_advisor_result(self, pd_data): @@ -98,14 +66,30 @@ class Advisor: node_name = first_failing_data[CompareConst.NPU_NAME] index = first_failing_data['index'] message = self.gen_advisor_message(node_name) - print_warn_log("Find %s accuracy not reached, the line is %s" % (node_name, index)) + logger.warning("Find %s accuracy not reached, the line is %s" % (node_name, index)) result = AdvisorResult(node_name, index, message) return result + def gen_advisor_message(self, node_name): + if AdvisorConst.FORWARD in node_name: + if AdvisorConst.INPUT in node_name: + message = AdvisorConst.FORWARD_INPUT_SUGGEST + else: + message = AdvisorConst.FORWARD_OUTPUT_SUGGEST + message = self.deterministic_advisor(message, node_name) + else: + if AdvisorConst.INPUT in node_name: + message = AdvisorConst.BACKWARD_INPUT_SUGGEST + else: + message = AdvisorConst.BACKWARD_OUTPUT_SUGGEST + message = self.deterministic_advisor(message, node_name) + message = self.batch_norm_advisor(message, node_name) + return message + def analysis(self): self._check_path_vaild() analyze_data = self._parse_input_data() - print_info_log("Start analyzing the comparison result: %s" % self.file_type) + logger.info("Start analyzing the comparison result: %s" % self.file_type) self.analyze_unmatched(analyze_data) if self.file_type == Const.ALL: failing_data = analyze_data[analyze_data[CompareConst.ACCURACY] == CompareConst.ACCURACY_CHECK_NO] @@ -114,9 +98,27 @@ class Advisor: elif self.file_type == Const.SUMMARY: failing_data = analyze_data[analyze_data[CompareConst.RESULT] == CompareConst.WARNING] if failing_data.empty: - print_info_log("All data from api input/output accuracy reached") + logger.info("All data from api input/output accuracy reached") result = AdvisorResult(AdvisorConst.NO_ERROR_API, AdvisorConst.NO_ERROR_API, AdvisorConst.NO_ERR_SUGGEST) else: result = self.gen_advisor_result(failing_data) message_list = result.print_advisor_log() result.gen_summary_file(self.out_path, message_list) + + def _parse_input_data(self): + data_columns = self.input_data.columns.values + if {CompareConst.ACCURACY, CompareConst.NPU_NAME}.issubset(data_columns): + self.file_type = Const.ALL + elif {CompareConst.RESULT, CompareConst.NPU_MD5}.issubset(data_columns): + self.file_type = Const.MD5 + elif {CompareConst.MAX_DIFF, CompareConst.RESULT}.issubset(data_columns): + self.file_type = Const.SUMMARY + else: + logger.error('Compare result does not meet the required conditions.') + raise CompareException(CompareException.INVALID_DATA_ERROR) + df = self.input_data.reset_index() + return df + + def _check_path_vaild(self): + out_path_checker = FileChecker(self.out_path, FileCheckConst.DIR, FileCheckConst.WRITE_ABLE) + out_path_checker.common_check() diff --git a/debug/accuracy_tools/atat/pytorch/advisor/advisor_const.py b/debug/accuracy_tools/msprobe/pytorch/advisor/advisor_const.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/advisor/advisor_const.py rename to debug/accuracy_tools/msprobe/pytorch/advisor/advisor_const.py diff --git a/debug/accuracy_tools/atat/pytorch/advisor/advisor_result.py b/debug/accuracy_tools/msprobe/pytorch/advisor/advisor_result.py similarity index 79% rename from debug/accuracy_tools/atat/pytorch/advisor/advisor_result.py rename to debug/accuracy_tools/msprobe/pytorch/advisor/advisor_result.py index f8a16d2a7067d7ef2fa0746e32258f9da17624df..456f542e1f5bf867aa3db6a88e36dd03f8b581dc 100644 --- a/debug/accuracy_tools/atat/pytorch/advisor/advisor_result.py +++ b/debug/accuracy_tools/msprobe/pytorch/advisor/advisor_result.py @@ -17,9 +17,10 @@ import os import time -from .advisor_const import AdvisorConst -from ...core.utils import Const, print_info_log, print_error_log -from ...core.file_check_util import FileCheckConst, change_mode +from msprobe.pytorch.advisor.advisor_const import AdvisorConst +from msprobe.pytorch.common.log import logger +from msprobe.core.common.const import Const, FileCheckConst +from msprobe.core.common.file_check import change_mode class AdvisorResult: @@ -43,15 +44,15 @@ class AdvisorResult: output_file.writelines(message_list) change_mode(result_file, FileCheckConst.DATA_FILE_AUTHORITY) except IOError as io_error: - print_error_log("Failed to save %s, the reason is %s." % (result_file, io_error)) + logger.error("Failed to save %s, the reason is %s." % (result_file, io_error)) else: - print_info_log("The advisor summary is saved in: %s" % result_file) + logger.info("The advisor summary is saved in: %s" % result_file) def print_advisor_log(self): - print_info_log("The summary of the expert advice is as follows: ") + logger.info("The summary of the expert advice is as follows: ") message_list = [AdvisorConst.LINE + AdvisorConst.COLON + str(self.line), AdvisorConst.SUSPECT_NODES + AdvisorConst.COLON + self.suspect_node, AdvisorConst.ADVISOR_SUGGEST + AdvisorConst.COLON + self.advisor_message] for message in message_list: - print_info_log(message) + logger.info(message) return message_list diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/.keep b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/.keep similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/.keep rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/.keep diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/__init__.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/.keep b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/.keep similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/.keep rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/.keep diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/__init__.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/config.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/config.py similarity index 32% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/config.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/config.py index dd6607a81ec00ce635ffae6e41b4b9d18e090827..760e7c862dba5440412f5ee27d0345d1a17d2c5d 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/common/config.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/config.py @@ -1,10 +1,8 @@ import os import yaml -from ..common.utils import check_file_or_directory_path -from ..hook_module.utils import WrapFunctionalOps, WrapTensorOps, WrapTorchOps -from ...common.file_check import FileOpen - -WrapApi = set(WrapFunctionalOps) | set(WrapTensorOps) | set(WrapTorchOps) +from msprobe.pytorch.api_accuracy_checker.common.utils import check_file_or_directory_path +from msprobe.core.common.file_check import FileOpen +from msprobe.pytorch.pt_config import RunUTConfig class Config: @@ -14,63 +12,35 @@ class Config: config = yaml.safe_load(file) self.config = {key: self.validate(key, value) for key, value in config.items()} - def validate(self, key, value): + def __getattr__(self, item): + return self.config[item] + + def __str__(self): + return '\n'.join(f"{key}={value}" for key, value in self.config.items()) + + @staticmethod + def validate(key, value): validators = { - 'dump_path': str, - 'real_data': bool, - 'enable_dataloader': bool, - 'target_iter': list, 'white_list': list, + 'black_list': list, 'error_data_path': str, - 'jit_compile': bool, 'precision': int } if key not in validators: raise ValueError(f"{key} must be one of {validators.keys()}") if not isinstance(value, validators.get(key)): raise ValueError(f"{key} must be {validators[key].__name__} type") - if key == 'target_iter': - if not isinstance(value, list): - raise ValueError("target_iter must be a list type") - if any(isinstance(i, bool) for i in value): - raise ValueError("target_iter cannot contain boolean values") - if not all(isinstance(i, int) for i in value): - raise ValueError("All elements in target_iter must be of int type") - if any(i < 0 for i in value): - raise ValueError("All elements in target_iter must be greater than or equal to 0") if key == 'precision' and value < 0: raise ValueError("precision must be greater than 0") if key == 'white_list': - if not isinstance(value, list): - raise ValueError("white_list must be a list type") - if not all(isinstance(i, str) for i in value): - raise ValueError("All elements in white_list must be of str type") - invalid_api = [i for i in value if i not in WrapApi] - if invalid_api: - raise ValueError(f"{', '.join(invalid_api)} is not in support_wrap_ops.yaml, please check the white_list") + RunUTConfig.check_filter_list_config(key, value) + if key == 'black_list': + RunUTConfig.check_filter_list_config(key, value) + if key == 'error_data_path': + RunUTConfig.check_error_data_path_config(value) return value - def __getattr__(self, item): - return self.config[item] - - def __str__(self): - return '\n'.join(f"{key}={value}" for key, value in self.config.items()) - - def update_config(self, dump_path=None, real_data=None, target_iter=None, white_list=None, enable_dataloader=None): - args = { - "dump_path": dump_path if dump_path else self.config.get("dump_path", './'), - "real_data": real_data if real_data else self.config.get("real_data", False), - "target_iter": target_iter if target_iter else self.config.get("target_iter", [1]), - "white_list": white_list if white_list else self.config.get("white_list", []), - "enable_dataloader": enable_dataloader if enable_dataloader else self.config.get("enable_dataloader", False) - } - for key, value in args.items(): - if key in self.config: - self.config[key] = self.validate(key, value) - else: - raise ValueError(f"Invalid key '{key}'") - cur_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__))) yaml_path = os.path.join(cur_path, "config.yaml") -msCheckerConfig = Config(yaml_path) \ No newline at end of file +msCheckerConfig = Config(yaml_path) diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/utils.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..b6e8932960c6ce15c65c83874720bd7d24f19909 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/common/utils.py @@ -0,0 +1,225 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import json +import os +import re +import csv + +import torch + +try: + import torch_npu +except ImportError: + IS_GPU = True +else: + IS_GPU = False + +from msprobe.pytorch.common.log import logger +from msprobe.core.common.file_check import FileChecker, FileOpen, change_mode, create_directory +from msprobe.core.common.const import Const, FileCheckConst +from msprobe.core.common.utils import CompareException + + +class DumpException(CompareException): + pass + + +def write_csv(data, filepath): + with FileOpen(filepath, 'a', encoding='utf-8-sig') as f: + writer = csv.writer(f) + writer.writerows(data) + + +def check_object_type(check_object, allow_type): + """ + Function Description: + Check if the object belongs to a certain data type + Parameter: + check_object: the object to be checked + allow_type: legal data type + Exception Description: + when invalid data throw exception + """ + if not isinstance(check_object, allow_type): + logger.error(f"{check_object} not of {allow_type} type") + raise CompareException(CompareException.INVALID_DATA_ERROR) + + +def check_file_or_directory_path(path, isdir=False): + """ + Function Description: + check whether the path is valid + Parameter: + path: the path to check + isdir: the path is dir or file + Exception Description: + when invalid data throw exception + """ + if isdir: + if not os.path.exists(path): + logger.error('The path {} is not exist.'.format(path)) + raise CompareException(CompareException.INVALID_PATH_ERROR) + + if not os.path.isdir(path): + logger.error('The path {} is not a directory.'.format(path)) + raise CompareException(CompareException.INVALID_PATH_ERROR) + + if not os.access(path, os.W_OK): + logger.error( + 'The path {} does not have permission to write. Please check the path permission'.format(path)) + raise CompareException(CompareException.INVALID_PATH_ERROR) + else: + if not os.path.isfile(path): + logger.error('{} is an invalid file or non-exist.'.format(path)) + raise CompareException(CompareException.INVALID_PATH_ERROR) + + if not os.access(path, os.R_OK): + logger.error( + 'The path {} does not have permission to read. Please check the path permission'.format(path)) + raise CompareException(CompareException.INVALID_PATH_ERROR) + + +def get_json_contents(file_path): + ops = get_file_content_bytes(file_path) + try: + json_obj = json.loads(ops) + except ValueError as error: + logger.error('Failed to load "%s". %s' % (file_path, str(error))) + raise CompareException(CompareException.INVALID_FILE_ERROR) from error + if not isinstance(json_obj, dict): + logger.error('Json file %s, content is not a dictionary!' % file_path) + raise CompareException(CompareException.INVALID_FILE_ERROR) + return json_obj + + +def get_file_content_bytes(file): + with FileOpen(file, 'rb') as file_handle: + return file_handle.read() + + +class SoftlinkCheckException(Exception): + pass + + +def check_need_convert(api_name): + convert_type = None + for key, value in Const.CONVERT_API.items(): + if api_name not in value: + continue + else: + convert_type = key + return convert_type + + +def api_info_preprocess(api_name, api_info_dict): + """ + Function Description: + Preprocesses the API information. + Parameter: + api_name: Name of the API. + api_info_dict: argument of the API. + Return api_info_dict: + convert_type: Type of conversion. + api_info_dict: Processed argument of the API. + """ + convert_type = check_need_convert(api_name) + if api_name == 'cross_entropy': + api_info_dict = cross_entropy_process(api_info_dict) + return convert_type, api_info_dict + + +def cross_entropy_process(api_info_dict): + """ + Function Description: + Preprocesses the cross_entropy API information. + Parameter: + api_info_dict: argument of the API. + Return api_info_dict: + api_info_dict: Processed argument of the API. + """ + if 'args' in api_info_dict and len(api_info_dict['args']) > 1 and 'Min' in api_info_dict['args'][1]: + if api_info_dict['args'][1]['Min'] <= 0: + # The second argument in cross_entropy should be -100 or not less than 0 + api_info_dict['args'][1]['Min'] = 0 + return api_info_dict + + +def initialize_save_path(save_path, dir_name): + data_path = os.path.join(save_path, dir_name) + if os.path.exists(data_path): + logger.warning(f"{data_path} already exists, it will be overwritten") + else: + os.mkdir(data_path, mode=FileCheckConst.DATA_DIR_AUTHORITY) + data_path_checker = FileChecker(data_path, FileCheckConst.DIR) + data_path_checker.common_check() + return data_path + + +def write_pt(file_path, tensor): + if os.path.exists(file_path): + raise ValueError(f"File {file_path} already exists") + torch.save(tensor, file_path) + full_path = os.path.realpath(file_path) + change_mode(full_path, FileCheckConst.DATA_FILE_AUTHORITY) + return full_path + + +def get_real_data_path(file_path): + targets = ['forward_real_data', 'backward_real_data', 'ut_error_data\d+'] + pattern = re.compile(r'({})'.format('|'.join(targets))) + match = pattern.search(file_path) + if match: + target_index = match.start() + target_path = file_path[target_index:] + return target_path + else: + raise DumpException(DumpException.INVALID_PATH_ERROR) + + +def get_full_data_path(data_path, real_data_path): + if not data_path: + return data_path + full_data_path = os.path.join(real_data_path, data_path) + return os.path.realpath(full_data_path) + + +class UtDataProcessor: + def __init__(self, save_path): + self.save_path = save_path + self.index = 0 + + def save_tensors_in_element(self, api_name, element): + self.index = 0 + self._save_recursive(api_name, element) + + def _save_recursive(self, api_name, element): + if isinstance(element, torch.Tensor): + api_args = api_name + Const.SEP + str(self.index) + create_directory(self.save_path) + file_path = os.path.join(self.save_path, f'{api_args}.pt') + write_pt(file_path, element.contiguous().cpu().detach()) + self.index += 1 + elif element is None or isinstance(element, (bool, int, float, str, slice)): + self.index += 1 + elif isinstance(element, (list, tuple)): + for item in element: + self._save_recursive(api_name, item) + elif isinstance(element, dict): + for value in element.values(): + self._save_recursive(api_name, value) + else: + self.index += 1 diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/__init__.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/algorithm.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/algorithm.py similarity index 88% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/algorithm.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/algorithm.py index 7983709f14bcca72a0cb29c453198396561681b1..1bb19cc048e88c9353a19069031bb8acfdae05e9 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/algorithm.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/algorithm.py @@ -1,7 +1,12 @@ # 定义比对算法及比对标准 import torch import numpy as np -from .compare_utils import CompareConst + +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import ULP_PARAMETERS +from msprobe.core.common.const import CompareConst + + +DEFAULT_THRESHOLD = 1 #cos @@ -188,3 +193,24 @@ def check_norm_value(normal_value_mask, rel_err, rtol): err_mask = np.logical_and(err_mask, normal_value_mask) err_cnt = np.sum(err_mask) return 0 if np.sum(normal_value_mask) == 0 else err_cnt / np.sum(normal_value_mask) + + +def get_ulp_err(bench_output, device_output, dtype): + parameters = ULP_PARAMETERS.get(dtype) + min_eb = parameters.get('min_eb', DEFAULT_THRESHOLD)[0] + exponent_num = parameters.get('exponent_num', DEFAULT_THRESHOLD)[0] + abs_bench = np.abs(bench_output) + eb = np.where(abs_bench == 0, 0, np.floor(np.log2(abs_bench))) + eb = np.maximum(eb, min_eb) + + if dtype == torch.float32: + ulp_err = calc_ulp_err(bench_output, device_output, eb, exponent_num, np.float64) + else: + ulp_err = calc_ulp_err(bench_output, device_output, eb, exponent_num, np.float32) + ulp_err = np.abs(ulp_err) + return ulp_err + + +def calc_ulp_err(bench_output, device_output, eb, exponent_num, data_type): + return (device_output.astype(data_type) - bench_output).astype(data_type) * \ + np.exp2(-eb + exponent_num).astype(data_type) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_compare.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py similarity index 54% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_compare.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py index 6a544de21a01321774c21aaa90397e3ea80fe7be..73bf7c2b8ebd59af8e31b9a7e9ad534f11717340 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py @@ -1,24 +1,36 @@ import argparse +import math import os import sys -import math from collections import namedtuple + +import torch import pandas as pd -from ..common.utils import print_info_log, print_warn_log, print_error_log, write_csv, \ - CompareException, create_directory -from ..common.config import msCheckerConfig -from ..compare.compare_utils import CompareConst, API_PRECISION_COMPARE_RESULT_FILE_NAME, \ +from msprobe.pytorch.api_accuracy_checker.common.utils import write_csv +from msprobe.pytorch.api_accuracy_checker.common.config import msCheckerConfig +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import API_PRECISION_COMPARE_RESULT_FILE_NAME, \ API_PRECISION_COMPARE_DETAILS_FILE_NAME, BENCHMARK_COMPARE_SUPPORT_LIST, API_PRECISION_COMPARE_UNSUPPORT_LIST, \ - ApiPrecisionCompareColumn, AbsoluteStandardApi, BinaryStandardApi, BINARY_COMPARE_UNSUPPORT_LIST, \ - convert_str_to_float, CompareMessage -from ..compare.compare_column import ApiPrecisionOutputColumn -from ..run_ut.run_ut import get_validated_result_csv_path -from ...common.file_check import FileCheckConst, FileChecker, change_mode, check_path_before_create + ApiPrecisionCompareColumn, AbsoluteStandardApi, BinaryStandardApi, ULPStandardApi, ThousandthStandardApi, \ + BINARY_COMPARE_UNSUPPORT_LIST, ULP_COMPARE_SUPPORT_LIST, convert_str_to_float, CompareMessage, is_inf_or_nan, \ + check_inf_or_nan +from msprobe.pytorch.api_accuracy_checker.compare.compare_column import ApiPrecisionOutputColumn +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut import get_validated_result_csv_path +from msprobe.core.common.file_check import FileChecker, change_mode, check_path_before_create, create_directory +from msprobe.pytorch.common.log import logger +from msprobe.core.common.utils import CompareException +from msprobe.core.common.const import CompareConst, FileCheckConst CompareConfig = namedtuple('CompareConfig', ['npu_csv_path', 'gpu_csv_path', 'result_csv_path', 'details_csv_path']) +BenchmarkInf_Nan_Consistency = namedtuple('BenchmarkInf_Nan_Consistency', ['small_value_inf_nan_consistency', + 'rmse_inf_nan_consistency', + 'max_rel_inf_nan_consistency', + 'mean_rel_inf_nan_consistency', + 'eb_inf_nan_consistency']) unsupported_message = 'This data type does not support benchmark compare.' +DEFAULT_THRESHOLD = 1 + benchmark_algorithms_thresholds = { 'small_value': { 'error_threshold': 2, @@ -62,7 +74,38 @@ benchmark_message = { } -class BenchmarkStandard: +class Standard: + @staticmethod + def _calc_ratio(column_name, x, y, default_value): + ''' + 计算npu侧和gpu侧统计量的比值 + 输入: + column_name:统计量名称 + x:npu侧统计量 + y:gpu侧统计量 + default:当x不接近0,y接近0,设置的比值默认值 + 输出: + ratio:统计量x和y的比值 + inf_nan_consistency:不出现inf或nan时为True,出现inf或nan时必须同时为inf或-inf或nan才为True,否则为False + message:当出现inf或nan时的提示信息 + ''' + x, y = convert_str_to_float(x), convert_str_to_float(y) + + if is_inf_or_nan(x) or is_inf_or_nan(y): + return check_inf_or_nan(x, y, column_name) + + inf_nan_consistency = True + message = "" + if math.isclose(y, 0.0): + if math.isclose(x, 0.0): + return 1.0, inf_nan_consistency, message + else: + return default_value, inf_nan_consistency, message + else: + return abs(x / y), inf_nan_consistency, message + + +class BenchmarkStandard(Standard): def __init__(self, api_name, npu_precision, gpu_precision): self.api_name = api_name self.npu_precision = npu_precision @@ -79,19 +122,42 @@ class BenchmarkStandard: self.eb_status = CompareConst.PASS self.check_result_list = [] self.final_result = CompareConst.PASS + self.compare_message = "" def __str__(self): return "%s" % (self.api_name) + @staticmethod + def _get_status(ratio, algorithm): + if math.isnan(ratio) or math.isinf(ratio): + return CompareConst.PASS + error_threshold = benchmark_algorithms_thresholds.get(algorithm, {}).get('error_threshold', DEFAULT_THRESHOLD) + warning_threshold = benchmark_algorithms_thresholds.get(algorithm, {}).get('warning_threshold', + DEFAULT_THRESHOLD) + if ratio > error_threshold: + return CompareConst.ERROR + elif ratio > warning_threshold: + return CompareConst.WARNING + return CompareConst.PASS + def get_result(self): - self._compare_ratio() - self.small_value_err_status = self._get_status(self.small_value_err_ratio, 'small_value') + inf_nan_consistency = self._compare_ratio() + small_value_inf_nan_consistency = inf_nan_consistency.small_value_inf_nan_consistency + rmse_inf_nan_consistency = inf_nan_consistency.rmse_inf_nan_consistency + max_rel_inf_nan_consistency = inf_nan_consistency.max_rel_inf_nan_consistency + mean_rel_inf_nan_consistency = inf_nan_consistency.mean_rel_inf_nan_consistency + eb_inf_nan_consistency = inf_nan_consistency.eb_inf_nan_consistency + self.small_value_err_status = self._get_status(self.small_value_err_ratio, 'small_value') if \ + small_value_inf_nan_consistency else CompareConst.ERROR self.check_result_list.append(self.small_value_err_status) - self.rmse_status = self._get_status(self.rmse_ratio, 'rmse') + self.rmse_status = self._get_status(self.rmse_ratio, 'rmse') if rmse_inf_nan_consistency \ + else CompareConst.ERROR self.check_result_list.append(self.rmse_status) - self.max_rel_err_status = self._get_status(self.max_rel_err_ratio, 'max_rel_err') + self.max_rel_err_status = self._get_status(self.max_rel_err_ratio, 'max_rel_err') if max_rel_inf_nan_consistency \ + else CompareConst.ERROR self.check_result_list.append(self.max_rel_err_status) - self.mean_rel_err_status = self._get_status(self.mean_rel_err_ratio, 'mean_rel_err') + self.mean_rel_err_status = self._get_status(self.mean_rel_err_ratio, 'mean_rel_err') if mean_rel_inf_nan_consistency \ + else CompareConst.ERROR self.check_result_list.append(self.mean_rel_err_status) self.eb_status = self._get_status(self.eb_ratio, 'eb') if CompareConst.ERROR in self.check_result_list: @@ -99,43 +165,94 @@ class BenchmarkStandard: elif CompareConst.WARNING in self.check_result_list: self.final_result = CompareConst.WARNING - def _compare_ratio(self): - self.small_value_err_ratio = self._calc_ratio( - self.npu_precision.get(ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE), - self.gpu_precision.get(ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE), 10000.0) - self.rmse_ratio = self._calc_ratio(self.npu_precision.get(ApiPrecisionCompareColumn.RMSE), - self.gpu_precision.get(ApiPrecisionCompareColumn.RMSE), 10000.0) - self.max_rel_err_ratio = self._calc_ratio(self.npu_precision.get(ApiPrecisionCompareColumn.MAX_REL_ERR), - self.gpu_precision.get(ApiPrecisionCompareColumn.MAX_REL_ERR), - 10000.0) - self.mean_rel_err_ratio = self._calc_ratio(self.npu_precision.get(ApiPrecisionCompareColumn.MEAN_REL_ERR), - self.gpu_precision.get(ApiPrecisionCompareColumn.MEAN_REL_ERR), - 10000.0) - self.eb_ratio = self._calc_ratio(self.npu_precision.get(ApiPrecisionCompareColumn.EB), - self.gpu_precision.get(ApiPrecisionCompareColumn.EB), 10000.0) - def to_column_value(self): return [self.small_value_err_ratio, self.small_value_err_status, self.rmse_ratio, self.rmse_status, self.max_rel_err_ratio, self.max_rel_err_status, self.mean_rel_err_ratio, self.mean_rel_err_status, self.eb_ratio, self.eb_status] - @staticmethod - def _get_status(ratio, algorithm): - error_threshold = benchmark_algorithms_thresholds.get(algorithm).get('error_threshold') - warning_threshold = benchmark_algorithms_thresholds.get(algorithm).get('warning_threshold') - if ratio > error_threshold: - return CompareConst.ERROR - elif ratio > warning_threshold: - return CompareConst.WARNING - return CompareConst.PASS + def _compare_ratio(self): + + self.small_value_err_ratio, small_value_inf_nan_consistency, small_value_message = self._calc_ratio( + ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE, + self.npu_precision.get(ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE), + self.gpu_precision.get(ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE), 10000.0) + self.compare_message += small_value_message + self.rmse_ratio, rmse_inf_nan_consistency, rmse_message = self._calc_ratio(ApiPrecisionCompareColumn.RMSE, + self.npu_precision.get(ApiPrecisionCompareColumn.RMSE), + self.gpu_precision.get(ApiPrecisionCompareColumn.RMSE), 10000.0) + self.compare_message += rmse_message + self.max_rel_err_ratio, max_rel_inf_nan_consistency, max_rel_message = self._calc_ratio( + ApiPrecisionCompareColumn.MAX_REL_ERR, + self.npu_precision.get(ApiPrecisionCompareColumn.MAX_REL_ERR), + self.gpu_precision.get(ApiPrecisionCompareColumn.MAX_REL_ERR), 10000.0) + self.compare_message += max_rel_message + self.mean_rel_err_ratio, mean_rel_inf_nan_consistency, mean_rel_message = self._calc_ratio(ApiPrecisionCompareColumn.MEAN_REL_ERR, + self.npu_precision.get(ApiPrecisionCompareColumn.MEAN_REL_ERR), + self.gpu_precision.get(ApiPrecisionCompareColumn.MEAN_REL_ERR), 10000.0) + self.compare_message += mean_rel_message + self.eb_ratio, eb_inf_nan_consistency, eb_message = self._calc_ratio(ApiPrecisionCompareColumn.EB, + self.npu_precision.get(ApiPrecisionCompareColumn.EB), + self.gpu_precision.get(ApiPrecisionCompareColumn.EB), 10000.0) + self.compare_message += eb_message + + return BenchmarkInf_Nan_Consistency(small_value_inf_nan_consistency, rmse_inf_nan_consistency, + max_rel_inf_nan_consistency, mean_rel_inf_nan_consistency, eb_inf_nan_consistency) + + +class ULPStandard(Standard): + def __init__(self, api_name, npu_precision, gpu_precision): + self.api_name = api_name + self.npu_precision = npu_precision + self.gpu_precision = gpu_precision + self.mean_ulp_err = 0 + self.ulp_err_proportion = 0 + self.ulp_err_proportion_ratio = 1 + self.ulp_err_status = CompareConst.PASS + self.compare_message = "" - @staticmethod - def _calc_ratio(x, y, default_value=1.0): - x, y = convert_str_to_float(x), convert_str_to_float(y) - if math.isclose(y, 0.0): - return 1.0 if math.isclose(x, 0.0) else default_value + def __str__(self): + return f"{self.api_name}" + + def get_result(self): + self.mean_ulp_err = convert_str_to_float(self.npu_precision.get(ApiPrecisionCompareColumn.MEAN_ULP_ERR)) + gpu_mean_ulp_err = convert_str_to_float(self.gpu_precision.get(ApiPrecisionCompareColumn.MEAN_ULP_ERR)) + inf_nan_consistency = True + if is_inf_or_nan(self.mean_ulp_err) or is_inf_or_nan(gpu_mean_ulp_err): + _, inf_nan_consistency, message = check_inf_or_nan(self.mean_ulp_err, gpu_mean_ulp_err, + ApiPrecisionCompareColumn.MEAN_ULP_ERR) + self.compare_message += message + self.ulp_err_proportion = convert_str_to_float( + self.npu_precision.get(ApiPrecisionCompareColumn.ULP_ERR_PROPORTION)) + self.ulp_err_proportion_ratio, ulp_inf_nan_consistency, message = self._calc_ratio( + ApiPrecisionCompareColumn.ULP_ERR_PROPORTION, + self.npu_precision.get(ApiPrecisionCompareColumn.ULP_ERR_PROPORTION), + self.gpu_precision.get(ApiPrecisionCompareColumn.ULP_ERR_PROPORTION), 10000.0) + inf_nan_consistency = inf_nan_consistency and ulp_inf_nan_consistency + self.compare_message += message + if inf_nan_consistency: + self.ulp_err_status = self._get_ulp_status(self.npu_precision.get(ApiPrecisionCompareColumn.DEVICE_DTYPE)) else: - return abs(x / y) + self.ulp_err_status = CompareConst.ERROR + + def _get_ulp_status(self, dtype): + if dtype == torch.float32: + if self.mean_ulp_err < 64: + return CompareConst.PASS + elif self.ulp_err_proportion < 0.05: + return CompareConst.PASS + elif self.ulp_err_proportion_ratio < 1: + return CompareConst.PASS + else: + self.compare_message += "ERROR: ULP误差不满足标准\n" + return CompareConst.ERROR + else: + if self.ulp_err_proportion < 0.001: + return CompareConst.PASS + elif self.ulp_err_proportion_ratio < 1: + return CompareConst.PASS + else: + self.compare_message += "ERROR: ULP误差不满足标准\n" + return CompareConst.ERROR def write_detail_csv(content, save_path): @@ -147,18 +264,18 @@ def write_detail_csv(content, save_path): def api_precision_compare(config): - print_info_log("Start compare task") - print_info_log(f"Compare task result will be saved in {config.result_csv_path}") - print_info_log(f"Compare task detail will be saved in {config.details_csv_path}") + logger.info("Start compare task") + logger.info(f"Compare task result will be saved in {config.result_csv_path}") + logger.info(f"Compare task detail will be saved in {config.details_csv_path}") try: npu_data = pd.read_csv(config.npu_csv_path) except Exception as err: - print_error_log(f"Open npu csv Error: %s" % str(err)) + logger.error(f"Open npu csv Error: %s" % str(err)) check_csv_columns(npu_data.columns, "npu_csv") try: gpu_data = pd.read_csv(config.gpu_csv_path) except Exception as err: - print_error_log(f"Open gpu csv Error: %s" % str(err)) + logger.error(f"Open gpu csv Error: %s" % str(err)) check_csv_columns(gpu_data.columns, "gpu_csv") detail_csv_title = [ApiPrecisionCompareColumn.get_detail_csv_title()] result_csv_title = [ApiPrecisionCompareColumn.get_result_csv_title()] @@ -167,7 +284,7 @@ def api_precision_compare(config): try: analyse_csv(npu_data, gpu_data, config) except Exception as err: - print_error_log(f"Analyse csv Error: %s" % str(err)) + logger.error(f"Analyse csv Error: %s" % str(err)) change_mode(config.result_csv_path, FileCheckConst.DATA_FILE_AUTHORITY) change_mode(config.details_csv_path, FileCheckConst.DATA_FILE_AUTHORITY) @@ -182,26 +299,37 @@ def analyse_csv(npu_data, gpu_data, config): row_gpu = gpu_data[gpu_data[ApiPrecisionCompareColumn.API_NAME] == full_api_name_with_direction_status] _, api_name, _, direction_status, _, _ = full_api_name_with_direction_status.split(".") if row_gpu.empty: - print_warn_log(f'This API : {full_api_name_with_direction_status} does not exist in the GPU data.') + logger.warning(f'This API : {full_api_name_with_direction_status} does not exist in the GPU data.') continue if len(row_gpu) > 1: msg = f'This API : {full_api_name_with_direction_status} has multiple records in the GPU data.' raise CompareException(CompareException.INVALID_DATA_ERROR, msg) row_gpu = row_gpu.iloc[0] + new_status = CompareConst.SPACE # 当前API的输出为空(例如反向过程中requires_grad=False),跳过比对 if row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE].isspace(): - continue - new_status = CompareConst.SPACE - compare_column.api_name = full_api_name_with_direction_status - if row_npu[ - ApiPrecisionCompareColumn.DEVICE_DTYPE] not in BINARY_COMPARE_UNSUPPORT_LIST or api_name in BinaryStandardApi: - new_status = record_binary_consistency_result(api_name, compare_column, row_npu) - elif api_name in AbsoluteStandardApi: - new_status = record_absolute_threshold_result(compare_column, row_npu) - elif row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] in BENCHMARK_COMPARE_SUPPORT_LIST: - bs = BenchmarkStandard(full_api_name_with_direction_status, row_npu, row_gpu) - new_status = record_benchmark_compare_result(compare_column, bs) - write_detail_csv(compare_column.to_column_value(), config.details_csv_path) + compare_column.api_name = full_api_name_with_direction_status + compare_column.compare_result = CompareConst.SKIP + compare_column.compare_message = row_npu[ApiPrecisionCompareColumn.MESSAGE] + new_status = CompareConst.SKIP + write_detail_csv(compare_column.to_column_value(), config.details_csv_path) + else: + compare_column.api_name = full_api_name_with_direction_status + if api_name in ThousandthStandardApi: + new_status = record_thousandth_threshold_result(compare_column, row_npu) + elif row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] not in BINARY_COMPARE_UNSUPPORT_LIST or \ + api_name in BinaryStandardApi: + new_status = record_binary_consistency_result(api_name, compare_column, row_npu) + elif api_name in AbsoluteStandardApi: + new_status = record_absolute_threshold_result(compare_column, row_npu) + elif api_name in ULPStandardApi and \ + row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] in ULP_COMPARE_SUPPORT_LIST: + us = ULPStandard(full_api_name_with_direction_status, row_npu, row_gpu) + new_status = record_ulp_compare_result(compare_column, us) + elif row_npu[ApiPrecisionCompareColumn.DEVICE_DTYPE] in BENCHMARK_COMPARE_SUPPORT_LIST: + bs = BenchmarkStandard(full_api_name_with_direction_status, row_npu, row_gpu) + new_status = record_benchmark_compare_result(compare_column, bs) + write_detail_csv(compare_column.to_column_value(), config.details_csv_path) if last_api_name is not None and api_name != last_api_name: if last_api_dtype in API_PRECISION_COMPARE_UNSUPPORT_LIST: @@ -229,7 +357,7 @@ def analyse_csv(npu_data, gpu_data, config): elif direction_status == 'backward': backward_status.append(new_status) else: - print_error_log(f"Invalid direction status: {direction_status}") + logger.error(f"Invalid direction status: {direction_status}") if last_api_name is not None: if last_api_dtype in API_PRECISION_COMPARE_UNSUPPORT_LIST: @@ -274,6 +402,8 @@ def get_absolute_threshold_result(row_npu): def get_api_checker_result(status): if not status: return CompareConst.SPACE + if all(item == CompareConst.SKIP for item in status): + return CompareConst.SKIP for const in (CompareConst.ERROR, CompareConst.WARNING): if const in status: return const @@ -284,7 +414,7 @@ def check_csv_columns(columns, csv_type): required_columns = ApiPrecisionCompareColumn.to_required_columns() missing_columns = [column for column in required_columns if column not in columns] if missing_columns: - msg = f"The followint columns {','.join(missing_columns)} are missing in{csv_type}" + msg = f"The following columns {','.join(missing_columns)} are missing in{csv_type}" raise CompareException(CompareException.INVALID_DATA_ERROR, msg) @@ -337,11 +467,39 @@ def record_benchmark_compare_result(compare_column, bs): compare_column.eb_status = bs.eb_status compare_column.compare_result = bs.final_result compare_column.compare_algorithm = "标杆比对法" - message = '' + compare_column.compare_message = bs.compare_message for status_attr, messages in benchmark_message.items(): status_value = getattr(compare_column, status_attr) if status_value in messages: - message += messages[status_value] + compare_column.compare_message += messages[status_value] + return compare_column.compare_result + + +def record_ulp_compare_result(compare_column, us): + us.get_result() + compare_column.mean_ulp_err = us.mean_ulp_err + compare_column.ulp_err_proportion = us.ulp_err_proportion + compare_column.ulp_err_proportion_ratio = us.ulp_err_proportion_ratio + compare_column.ulp_err_status = us.ulp_err_status + compare_column.compare_result = us.ulp_err_status + compare_column.compare_algorithm = "ULP误差比对法" + compare_column.compare_message = us.compare_message + return compare_column.compare_result + + +def check_thousandth_rate(thousandth_rate): + return CompareConst.PASS if convert_str_to_float(thousandth_rate) >= 0.999 else CompareConst.ERROR + + +def record_thousandth_threshold_result(compare_column, row_npu): + new_status = check_thousandth_rate(row_npu[ApiPrecisionCompareColumn.REL_ERR_THOUSANDTH]) + compare_column.rel_err_thousandth = row_npu[ApiPrecisionCompareColumn.REL_ERR_THOUSANDTH] + compare_column.rel_err_thousandth_status = new_status + compare_column.compare_result = new_status + compare_column.compare_algorithm = "双千指标法" + message = '' + if compare_column.rel_err_thousandth_status == CompareConst.ERROR: + message += "ERROR: 双千指标不达标\n" compare_column.compare_message = message return compare_column.compare_result @@ -384,4 +542,4 @@ def _api_precision_compare_parser(parser): if __name__ == '__main__': _api_precision_compare() - print_info_log("Compare task completed.") + logger.info("Compare task completed.") diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml similarity index 86% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml index efba9c5c02bbcc094b75ce2497d830789744b143..91170bd525b6cb8e7cb531a3c6e88888f9a628eb 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml @@ -106,3 +106,28 @@ BinaryCompareStandard: - tril_ - triu - triu_ + - type_as + +ULPStandard: + - __matmul__ + - addbmm + - addbmm_ + - addmm + - addmm_ + - baddbmm + - baddbmm_ + - bilinear + - bmm + - chain_matmul + - hspmm + - linear + - matmul + - mm + - mv + - smm + - sspaddmm + +ThousandthStandard: + - conv1d + - conv2d + \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_threshold.yaml b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_threshold.yaml new file mode 100644 index 0000000000000000000000000000000000000000..cac98d014ff0da3bd7302099c72e86dab79ed018 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/api_precision_threshold.yaml @@ -0,0 +1,390 @@ +mul: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +mul_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__mul__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__imul__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__rmul__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +add: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +add_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__add__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__iadd__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__radd__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +div: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +div_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__div__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__idiv__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +divide: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +divide_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +leaky_relu: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +leaky_relu_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +prelu: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +reciprocal: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +reciprocal_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +rsqrt: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +rsqrt_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +square: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +square_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +sub: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +sub_: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +rsub: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__isub__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 +__sub__: + torch.float32: + rtol: 1.0e-6 + small_value: 1.0e-6 + small_value_atol: 1.0e-9 + torch.float16: + rtol: 1.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 + torch.bfloat16: + rtol: 4.0e-3 + small_value: 1.0e-3 + small_value_atol: 1.0e-5 diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare.py similarity index 67% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare.py index 67aa69e209b7bdbdfb7a5db937bf8e5af8d1b8c8..ee49588288efc0a33c086913cc5624059de82272 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare.py @@ -1,20 +1,29 @@ # 进行比对及结果展示 import os -import csv +from collections import namedtuple import torch import numpy as np -from rich.table import Table -from rich.console import Console -from ..common.utils import get_json_contents, write_csv, print_warn_log, Const -from ..compare.compare_utils import CompareConst, check_dtype_comparable, DETAIL_TEST_ROWS, \ - precision_configs, BENCHMARK_COMPARE_SUPPORT_LIST, AbsoluteStandardApi, BinaryStandardApi, apis_threshold -from ..compare.compare_column import CompareColumn -from ..compare.algorithm import get_rmse, get_error_balance, get_max_rel_err, get_mean_rel_err, \ - get_rel_err, get_abs_err, get_max_abs_err, get_rel_err_ratio, cosine_sim, get_rel_err_origin, \ +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.api_accuracy_checker.common.utils import get_json_contents, write_csv +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import check_dtype_comparable, \ + DETAIL_TEST_ROWS, precision_configs, BENCHMARK_COMPARE_SUPPORT_LIST, AbsoluteStandardApi, BinaryStandardApi, \ + ULPStandardApi, ThousandthStandardApi, apis_threshold +from msprobe.pytorch.api_accuracy_checker.compare.compare_column import CompareColumn +from msprobe.pytorch.api_accuracy_checker.compare.algorithm import get_rmse, get_error_balance, get_max_rel_err, \ + get_mean_rel_err, get_rel_err, get_abs_err, get_max_abs_err, get_rel_err_ratio, cosine_sim, get_rel_err_origin, \ get_small_value_err_ratio, get_finite_and_infinite_mask, get_small_value_mask, check_inf_nan_value, \ - check_small_value, check_norm_value, get_abs_bench_with_eps -from ..common.config import msCheckerConfig -from ...common.file_check import FileOpen + check_small_value, check_norm_value, get_abs_bench_with_eps, get_ulp_err +from msprobe.pytorch.api_accuracy_checker.common.config import msCheckerConfig +from msprobe.core.common.const import Const, CompareConst + + +ResultInfo = namedtuple('ResultInfo', ['full_api_name', 'fwd_success_status', 'bwd_success_status', + 'fwd_compare_alg_results', 'bwd_compare_alg_results', 'rank']) + + +INDEX_TEST_RESULT__GROUP = 3 +INDEX_FIRST_GROUP = 0 +INDEX_MESSAGE = -1 class Comparator: @@ -34,85 +43,48 @@ class Comparator: else: self.stack_info = None - self.test_result_cnt = { - "forward_fail_num": 0, "backward_fail_num": 0, "forward_and_backward_fail_num": 0, "success_num": 0, - "total_num": 0, "forward_or_backward_fail_num": 0 - } - - def print_pretest_result(self): - self.get_statistics_from_result_csv() - total_tests = self.test_result_cnt.get("total_num", 0) - if total_tests != 0: - passing_rate = "{:.2%}".format(self.test_result_cnt.get("success_num", 0) / total_tests) + @staticmethod + def print_pretest_result(): + logger.info("Successfully completed run_ut/multi_run_ut.") + + @staticmethod + def _compare_dropout(bench_output, device_output): + tensor_num = bench_output.numel() + if tensor_num >= 100: + if abs((bench_output == 0).sum() - (device_output == 0).cpu().sum()) / tensor_num < 0.1: + return CompareConst.PASS, 1 + else: + return CompareConst.ERROR, 0 else: - passing_rate = "0%" - - print_warn_log("The follwing tables will be deprecated in the future." - "The following results are for reference only.") - console = Console() - table_total = Table( - show_header=True, title="Overall Statistics", show_lines=True, width=75 - ) - table_total.add_column("Result") - table_total.add_column("Statistics") - table_total.add_row("[green]Pass[/green]", str(self.test_result_cnt.get("success_num", 0))) - table_total.add_row("[yellow]Warning[/yellow]", str(self.test_result_cnt.get("warning_num", 0))) - table_total.add_row("[red]Error[/red]", str(self.test_result_cnt.get("error_num", 0))) - table_total.add_row("Passing Rate", passing_rate) - table_total.add_row("Skip Tests", str(self.test_result_cnt.get("total_skip_num", 0))) - - table_detail = Table( - show_header=True, title="Detail Statistics", show_lines=True, width=75 - ) - table_detail.add_column("Result") - table_detail.add_column("Statistics") - table_detail.add_row("Forward Error", str(self.test_result_cnt.get("forward_fail_num", 0))) - table_detail.add_row("Backward Error", str(self.test_result_cnt.get("backward_fail_num", 0))) - table_detail.add_row("Both Forward & Backward Error", str(self.test_result_cnt.get("forward_and_backward_fail_num", 0))) + return CompareConst.PASS, 1 - console.print(table_total) - console.print(table_detail) + @staticmethod + def _compare_builtin_type(bench_output, device_output, compare_column): + if not isinstance(bench_output, (bool, int, float, str)): + return CompareConst.PASS, compare_column, "" + if bench_output != device_output: + return CompareConst.ERROR, compare_column, "" + compare_column.error_rate = 0 + return CompareConst.PASS, compare_column, "" - def get_statistics_from_result_csv(self): - checklist = [CompareConst.PASS, CompareConst.ERROR, CompareConst.WARNING, CompareConst.SPACE, CompareConst.SKIP, "skip"] - self.test_result_cnt = { - "success_num": 0, "warning_num": 0, "error_num": 0, - "forward_fail_num": 0, "backward_fail_num": 0, "forward_and_backward_fail_num": 0, - "total_num": 0, "total_skip_num": 0 - } - with FileOpen(self.save_path, 'r') as file: - reader = csv.reader(file) - result_csv_rows = [row for row in reader] - result_csv_name = os.path.basename(self.save_path) - for item in result_csv_rows[1:]: - if not isinstance(item, list) or len(item) < 3: - raise ValueError("The number of columns in %s is incorrect" % result_csv_name) - if not all(item[i] and item[i] in checklist for i in (1, 2)): - raise ValueError( - "The value in the 2nd or 3rd column of %s is wrong, it must be pass, error, warning, skip, or SPACE" - % result_csv_name) - column1 = item[1] - column2 = item[2] - if column1.upper() == CompareConst.SKIP: - self.test_result_cnt["total_skip_num"] += 1 - continue - self.test_result_cnt["total_num"] += 1 - if column1 == CompareConst.PASS and column2 in [CompareConst.PASS, CompareConst.SPACE]: - self.test_result_cnt['success_num'] += 1 - elif column1 == CompareConst.ERROR and column2 == CompareConst.ERROR: - self.test_result_cnt['forward_and_backward_fail_num'] += 1 - self.test_result_cnt['error_num'] += 1 - elif column1 == CompareConst.ERROR: - self.test_result_cnt['forward_fail_num'] += 1 - self.test_result_cnt['error_num'] += 1 - elif column2 == CompareConst.ERROR: - self.test_result_cnt['backward_fail_num'] += 1 - self.test_result_cnt['error_num'] += 1 - elif column1 == CompareConst.WARNING or column2 == CompareConst.WARNING: - self.test_result_cnt['warning_num'] += 1 + @staticmethod + def _compare_bool_tensor(bench_output, device_output): + error_nums = (bench_output != device_output).sum() + if bench_output.size == 0: + return CompareConst.NAN, CompareConst.ERROR, "There is not bench calculation result." + error_rate = float(error_nums / bench_output.size) + result = CompareConst.PASS if error_rate == 0 else CompareConst.ERROR + return error_rate, result, "" + + @staticmethod + def _get_absolute_threshold_attribute(api_name, dtype): + small_value_threshold = apis_threshold.get(api_name).get(dtype).get('small_value') + small_value_atol = apis_threshold.get(api_name).get(dtype).get('small_value_atol') + rtol = apis_threshold.get(api_name).get(dtype).get('rtol') + return small_value_threshold, small_value_atol, rtol def write_csv_title(self): - summary_test_rows = [[self.COLUMN_API_NAME, self.COLUMN_FORWARD_SUCCESS, + summary_test_rows = [[self.COLUMN_API_NAME, self.COLUMN_FORWARD_SUCCESS, self.COLUMN_BACKWARD_SUCCESS, "Message"]] if not os.path.exists(self.save_path): write_csv(summary_test_rows, self.save_path) @@ -125,9 +97,9 @@ class Comparator: test_rows[0].append(self.COLUMN_STACK_INFO) name = test_result[0] - df_row = list(test_result[:3]) - if test_result[1] == "SKIP" or test_result[2] == "SKIP": - df_row.append(test_result[3]) + df_row = list(test_result[:INDEX_TEST_RESULT__GROUP]) + if test_result[1] == "SKIP": + df_row.append(test_result[INDEX_TEST_RESULT__GROUP][INDEX_FIRST_GROUP][INDEX_MESSAGE]) if self.stack_info: stack_info = "\n".join(self.stack_info[name]) df_row.append(stack_info) @@ -143,46 +115,65 @@ class Comparator: if isinstance(fwd_result, list): for i, test_subject in enumerate(fwd_result): subject = subject_prefix + ".forward.output." + str(i) - test_subject = ["{:.{}f}".format(item, msCheckerConfig.precision) + test_subject = ["{:.{}f}".format(item, msCheckerConfig.precision) if isinstance(item, float) else item for item in test_subject] test_rows.append([subject] + list(test_subject)) if isinstance(bwd_result, list): for i, test_subject in enumerate(bwd_result): subject = subject_prefix + ".backward.output." + str(i) - test_subject = ["{:.{}f}".format(item, msCheckerConfig.precision) + test_subject = ["{:.{}f}".format(item, msCheckerConfig.precision) if isinstance(item, float) else item for item in test_subject] test_rows.append([subject] + list(test_subject)) write_csv(test_rows, self.detail_save_path) - def record_results(self, *args): + def record_results(self, args): self.write_summary_csv(args) self.write_detail_csv(args) - def compare_output(self, full_api_name, bench_output, device_output, bench_grad=None, npu_grad=None): + def compare_output(self, full_api_name, data_info): _, api_name, _ = full_api_name.split(Const.SEP) - compare_func = self._compare_dropout if "dropout" in full_api_name else self._compare_core_wrapper - fwd_success_status, fwd_compare_alg_results = compare_func(api_name, bench_output, device_output) - if not (bench_grad and npu_grad): + bench_output, device_output = data_info.bench_output, data_info.device_output + bench_grad, device_grad = data_info.bench_grad, data_info.device_grad + backward_message = data_info.backward_message + if "dropout" in full_api_name: + fwd_success_status, fwd_compare_alg_results = self._compare_dropout(bench_output, device_output) + else: + fwd_success_status, fwd_compare_alg_results = self._compare_core_wrapper(api_name, bench_output, + device_output) + if not (bench_grad and device_grad): bwd_success_status, bwd_compare_alg_results = (CompareConst.SPACE, []) else: if "dropout" in full_api_name: - bwd_success_status, bwd_compare_alg_results = compare_func(api_name, bench_grad[0], npu_grad[0]) + bwd_success_status, bwd_compare_alg_results = self._compare_dropout(bench_grad[0], device_grad[0]) else: - bwd_success_status, bwd_compare_alg_results = compare_func(api_name, bench_grad, npu_grad) - self.record_results(full_api_name, fwd_success_status, bwd_success_status if bwd_compare_alg_results is not None else CompareConst.SPACE, fwd_compare_alg_results, bwd_compare_alg_results) + bwd_success_status, bwd_compare_alg_results = self._compare_core_wrapper(api_name, bench_grad, + device_grad) + if backward_message: + backward_column = CompareColumn() + bwd_compare_alg_results = [backward_column.to_column_value(CompareConst.SKIP, backward_message)] + else: + bwd_success_status = bwd_success_status if bwd_compare_alg_results is not None else CompareConst.SPACE + result_info = ResultInfo(full_api_name, + fwd_success_status, + bwd_success_status, + fwd_compare_alg_results, + bwd_compare_alg_results, + data_info.rank) + self.record_results(result_info) return fwd_success_status == CompareConst.PASS, bwd_success_status == CompareConst.PASS \ - or bwd_success_status == CompareConst.SPACE + or bwd_success_status == CompareConst.SPACE def _compare_core_wrapper(self, api_name, bench_output, device_output): detailed_result_total = [] test_final_success = CompareConst.PASS if isinstance(bench_output, (list, tuple)): status, compare_result, message = [], [], [] - if len(bench_output) != len(device_output): + if len(bench_output) > len(device_output): status = [CompareConst.ERROR] message = ["bench and npu output structure is different."] else: + device_output = device_output[:len(bench_output)] for b_out_i, n_out_i in zip(bench_output, device_output): status_i, compare_result_i, message_i = self._compare_core(api_name, b_out_i, n_out_i) status.append(status_i) @@ -214,7 +205,7 @@ class Comparator: if b_keys != n_keys: return CompareConst.ERROR, compare_column, "bench and npu output dict keys are different." else: - status, compare_result, message = self._compare_core(api_name, list(bench_output.values()), + status, compare_result, message = self._compare_core(api_name, list(bench_output.values()), list(device_output.values())) elif isinstance(bench_output, torch.Tensor): copy_bench_out = bench_output.detach().clone() @@ -223,7 +214,7 @@ class Comparator: compare_column.npu_type = str(copy_device_output.dtype) compare_column.shape = tuple(device_output.shape) status, compare_result, message = self._compare_torch_tensor(api_name, copy_bench_out, copy_device_output, - compare_column) + compare_column) elif isinstance(bench_output, (bool, int, float, str)): compare_column.bench_type = str(type(bench_output)) compare_column.npu_type = str(type(device_output)) @@ -231,7 +222,7 @@ class Comparator: elif bench_output is None: return CompareConst.SKIP, compare_column, "Bench output is None, skip this test." else: - return CompareConst.PASS, compare_column, + return CompareConst.PASS, compare_column, "Unexpected output type in compare_core: {}".format(type(bench_output)) return status, compare_result, message @@ -247,28 +238,32 @@ class Comparator: device_output = device_output.cpu().numpy() if cpu_shape != npu_shape: return CompareConst.ERROR, compare_column, f"The shape of bench{str(cpu_shape)} " \ - f"and npu{str(npu_shape)} not equal." + f"and npu{str(npu_shape)} not equal." if not check_dtype_comparable(bench_output, device_output): return CompareConst.ERROR, compare_column, f"Bench out dtype is {bench_output.dtype} but " \ - f"npu output dtype is {device_output.dtype}, cannot compare." + f"npu output dtype is {device_output.dtype}, cannot compare." message = "" - if bench_output.dtype in [bool, np.uint8, np.int8, np.int16, np.uint16, np.uint32, np.int32, + if bench_output.dtype in [bool, np.uint8, np.int8, np.int16, np.uint16, np.uint32, np.int32, np.int64, np.uint64]: message += f"Compare algorithm is not supported for {bench_output.dtype} data. " \ - f"Only judged by Error Rate." + f"Only judged by Error Rate." err_rate, status, msg = self._compare_bool_tensor(bench_output, device_output) message += msg + "\n" compare_column.error_rate = err_rate return status, compare_column, message else: - status, compare_column, message = self._compare_float_tensor(api_name, bench_output, device_output, + status, compare_column, message = self._compare_float_tensor(api_name, bench_output, device_output, compare_column, npu_dtype) return status, compare_column, message - + def _compare_float_tensor(self, api_name, bench_output, device_output, compare_column, dtype): message = "" abs_bench, abs_bench_with_eps = get_abs_bench_with_eps(bench_output, dtype) abs_err = get_abs_err(bench_output, device_output) + rel_err_orign = get_rel_err_origin(abs_err, abs_bench_with_eps) + if api_name in ThousandthStandardApi: + thousand_res, thousand_status = get_rel_err_ratio(rel_err_orign, CompareConst.THOUSAND_RATIO_THRESHOLD) + compare_column.rel_err_thousandth = thousand_res if str(dtype) in BENCHMARK_COMPARE_SUPPORT_LIST: both_finite_mask, inf_nan_mask = get_finite_and_infinite_mask(bench_output, device_output) if api_name in BinaryStandardApi: @@ -280,17 +275,33 @@ class Comparator: rel_err = abs_err / abs_bench_with_eps small_value_mask = get_small_value_mask(abs_bench, both_finite_mask, small_value_threshold) normal_value_mask = np.logical_and(both_finite_mask, np.logical_not(small_value_mask)) - compare_column.inf_nan_error_ratio = check_inf_nan_value(inf_nan_mask, bench_output, device_output, dtype, rtol) + compare_column.inf_nan_error_ratio = check_inf_nan_value(inf_nan_mask, bench_output, device_output, + dtype, rtol) compare_column.rel_err_ratio = check_norm_value(normal_value_mask, rel_err, rtol) compare_column.abs_err_ratio = check_small_value(abs_err, small_value_mask, small_value_atol) + elif api_name in ULPStandardApi: + if bench_output.size == 0: + compare_column.max_ulp_error = 0 + compare_column.mean_ulp_error = 0 + compare_column.ulp_error_proportion = 0 + else: + ulp_err = get_ulp_err(bench_output, device_output, dtype) + compare_column.max_ulp_error = np.max(ulp_err) + compare_column.mean_ulp_error = np.mean(ulp_err) + if dtype == torch.float32: + compare_column.ulp_error_proportion = np.sum(ulp_err > CompareConst.ULP_FLOAT32_THRESHOLD) / bench_output.size + else: + compare_column.ulp_error_proportion = np.sum(ulp_err > CompareConst.ULP_FLOAT16_THRESHOLD) / bench_output.size else: - dtype_config = precision_configs.get(dtype) + dtype_config = precision_configs.get(dtype) small_value_mask = get_small_value_mask(abs_bench, both_finite_mask, dtype_config['small_value'][0]) abs_err_greater_mask = np.greater(abs_err, dtype_config['small_value_atol'][0]) compare_column.small_value_err_ratio = get_small_value_err_ratio(small_value_mask, abs_err_greater_mask) rel_err = get_rel_err(abs_err, abs_bench_with_eps, small_value_mask, inf_nan_mask) compare_column.RMSE = get_rmse(abs_err, np.logical_or(inf_nan_mask, small_value_mask)) compare_column.EB = get_error_balance(bench_output, device_output) + if rel_err.size == 0: + return CompareConst.ERROR, compare_column, "Relative error result list is empty." compare_column.Max_rel_error = get_max_rel_err(rel_err) compare_column.Mean_rel_error = get_mean_rel_err(rel_err) @@ -307,14 +318,13 @@ class Comparator: message += "Max abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n" return CompareConst.PASS, compare_column, message - rel_err_orign = get_rel_err_origin(abs_err, abs_bench_with_eps) if dtype in [torch.float16, torch.bfloat16]: - hundred_res, hundred_status = get_rel_err_ratio(rel_err_orign, 0.01) + hundred_res, hundred_status = get_rel_err_ratio(rel_err_orign, CompareConst.HUNDRED_RATIO_THRESHOLD) compare_column.rel_err_hundredth = hundred_res if not hundred_status: message += "Relative error is greater than 0.01, consider as error, skip other check and set to SPACE.\n" return CompareConst.ERROR, compare_column, message - thousand_res, thousand_status = get_rel_err_ratio(rel_err_orign, 0.001) + thousand_res, thousand_status = get_rel_err_ratio(rel_err_orign, CompareConst.THOUSAND_RATIO_THRESHOLD) compare_column.rel_err_thousandth = thousand_res if dtype in [torch.float16, torch.bfloat16]: if thousand_status: @@ -322,7 +332,7 @@ class Comparator: return CompareConst.PASS, compare_column, message message += "Relative error is greater than 0.001, consider as warning, skip other check and set to SPACE.\n" return CompareConst.WARNING, compare_column, message - ten_thousand_res, ten_thousand_status = get_rel_err_ratio(rel_err_orign, 0.0001) + ten_thousand_res, ten_thousand_status = get_rel_err_ratio(rel_err_orign, CompareConst.TEN_THOUSAND_RATIO_THRESHOLD) compare_column.rel_err_ten_thousandth = ten_thousand_res if dtype in [torch.float32, torch.float64]: if not thousand_status: @@ -333,40 +343,3 @@ class Comparator: return CompareConst.WARNING, compare_column, message message += "Relative error is less than 0.0001, consider as pass.\n" return CompareConst.PASS, compare_column, message - - @staticmethod - def _compare_dropout(api_name, bench_output, device_output): - tensor_num = bench_output.numel() - if tensor_num >= 100: - if abs((bench_output == 0).sum() - (device_output == 0).cpu().sum()) / tensor_num < 0.1: - return CompareConst.PASS, 1 - else: - return CompareConst.ERROR, 0 - else: - return CompareConst.PASS, 1 - - @staticmethod - def _compare_builtin_type(bench_output, device_output, compare_column): - if not isinstance(bench_output, (bool, int, float, str)): - return CompareConst.PASS, compare_column, "" - if bench_output != device_output: - return CompareConst.ERROR, compare_column, "" - compare_column.error_rate = 0 - return CompareConst.PASS, compare_column, "" - - - @staticmethod - def _compare_bool_tensor(bench_output, device_output): - error_nums = (bench_output != device_output).sum() - if bench_output.size == 0: - return CompareConst.NAN, CompareConst.ERROR, "There is not bench calculation result." - error_rate = float(error_nums / bench_output.size) - result = CompareConst.PASS if error_rate == 0 else CompareConst.ERROR - return error_rate, result, "" - - @staticmethod - def _get_absolute_threshold_attribute(api_name, dtype): - small_value_threshold = apis_threshold.get(api_name).get(dtype).get('small_value') - small_value_atol = apis_threshold.get(api_name).get(dtype).get('small_value_atol') - rtol = apis_threshold.get(api_name).get(dtype).get('rtol') - return small_value_threshold, small_value_atol, rtol diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_column.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_column.py similarity index 75% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_column.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_column.py index 97cf8226bd1ea6c9a668abd91719fd2662b5183b..fb6d5dcc0f1c8b67ec2b67e8b419e8407cdc8d6d 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_column.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_column.py @@ -1,4 +1,4 @@ -from .compare_utils import CompareConst +from msprobe.core.common.const import CompareConst class CompareColumn: @@ -20,12 +20,16 @@ class CompareColumn: self.inf_nan_error_ratio = CompareConst.SPACE self.rel_err_ratio = CompareConst.SPACE self.abs_err_ratio = CompareConst.SPACE + self.max_ulp_error = CompareConst.SPACE + self.mean_ulp_error = CompareConst.SPACE + self.ulp_error_proportion = CompareConst.SPACE def to_column_value(self, is_pass, message): return [self.bench_type, self.npu_type, self.shape, self.cosine_sim, self.max_abs_err, self.rel_err_hundredth, self.rel_err_thousandth, self.rel_err_ten_thousandth, self.error_rate, self.EB, self.RMSE, self.small_value_err_ratio, self.Max_rel_error, self.Mean_rel_error, self.inf_nan_error_ratio, - self.rel_err_ratio, self.abs_err_ratio, is_pass, message] + self.rel_err_ratio, self.abs_err_ratio, self.max_ulp_error, self.mean_ulp_error, + self.ulp_error_proportion, is_pass, message] class ApiPrecisionOutputColumn: @@ -49,6 +53,12 @@ class ApiPrecisionOutputColumn: self.abs_err_ratio_status = CompareConst.SPACE self.error_rate = CompareConst.SPACE self.error_rate_status = CompareConst.SPACE + self.mean_ulp_err = CompareConst.SPACE + self.ulp_err_proportion = CompareConst.SPACE + self.ulp_err_proportion_ratio = CompareConst.SPACE + self.ulp_err_status = CompareConst.SPACE + self.rel_err_thousandth = CompareConst.SPACE + self.rel_err_thousandth_status = CompareConst.SPACE self.compare_result = CompareConst.SPACE self.compare_algorithm = CompareConst.SPACE self.compare_message = CompareConst.SPACE @@ -58,6 +68,7 @@ class ApiPrecisionOutputColumn: self.rmse_status, self.max_rel_err_ratio, self.max_rel_err_status, self.mean_rel_err_ratio, self.mean_rel_err_status, self.eb_ratio, self.eb_status, self.inf_nan_error_ratio, self.inf_nan_error_ratio_status, self.rel_err_ratio, self.rel_err_ratio_status, self.abs_err_ratio, - self.abs_err_ratio_status, self.error_rate, self.error_rate_status, self.compare_result, - self.compare_algorithm, self.compare_message] + self.abs_err_ratio_status, self.error_rate, self.error_rate_status, self.mean_ulp_err, + self.ulp_err_proportion, self.ulp_err_proportion_ratio, self.ulp_err_status, self.rel_err_thousandth, + self.rel_err_thousandth_status, self.compare_result, self.compare_algorithm, self.compare_message] \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_utils.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py similarity index 70% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_utils.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py index 5511da724446187e2dd886448bf6b26ea7b7b369..5c7e86ff36cbe027efdc20e4dc6cbdbf4b98b808 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/compare/compare_utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py @@ -1,10 +1,14 @@ import time import os +import math + import numpy as np import torch import yaml -from ..common.utils import Const, print_warn_log, CompareException -from ...common.file_check import FileOpen +from msprobe.core.common.utils import CompareException +from msprobe.core.common.const import Const +from msprobe.pytorch.common.log import logger +from msprobe.core.common.file_check import FileOpen current_time = time.strftime("%Y%m%d%H%M%S") @@ -12,6 +16,7 @@ API_PRECISION_COMPARE_RESULT_FILE_NAME = "api_precision_compare_result_" + curre API_PRECISION_COMPARE_DETAILS_FILE_NAME = "api_precision_compare_details_" + current_time + ".csv" BENCHMARK_COMPARE_SUPPORT_LIST = ['torch.float16', 'torch.bfloat16', 'torch.float32'] API_PRECISION_COMPARE_UNSUPPORT_LIST = ['torch.float64', 'torch.complex64', 'torch.complex128'] +ULP_COMPARE_SUPPORT_LIST = ['torch.float16', 'torch.bfloat16', 'torch.float32'] BINARY_COMPARE_UNSUPPORT_LIST = BENCHMARK_COMPARE_SUPPORT_LIST + API_PRECISION_COMPARE_UNSUPPORT_LIST @@ -21,6 +26,8 @@ with FileOpen(standard_yaml_path, 'r') as f: Apis = yaml.safe_load(f) AbsoluteStandardApi = Apis.get('AbsoluteThreshStandard') BinaryStandardApi = Apis.get('BinaryCompareStandard') + ULPStandardApi = Apis.get('ULPStandard') + ThousandthStandardApi = Apis.get('ThousandthStandard') threshold_yaml_path = os.path.join(cur_path, "api_precision_threshold.yaml") @@ -44,6 +51,9 @@ DETAIL_TEST_ROWS = [[ "inf/nan错误率", "相对误差错误率", "绝对误差错误率", + "ULP误差最大值", + "ULP误差平均值", + "ULP误差大于阈值占比", "Status", "Message" ]] @@ -77,20 +87,33 @@ precision_configs = { } -class CompareConst: - NAN = np.nan - NA = "N/A" - PASS = 'pass' - WARNING = 'warning' - ERROR = 'error' - SKIP = 'SKIP' - TRUE = 'TRUE' - FALSE = 'FALSE' - BFLOAT16_MIN = -3.3895313892515355e+38 - BFLOAT16_MAX = 3.3895313892515355e+38 - BFLOAT16_EPS = 2 ** -8 - SPACE = " " - +ULP_PARAMETERS = { + torch.float16 : { + 'min_eb' : [ + -14 + ], + 'exponent_num' : [ + 10 + ] + }, + torch.bfloat16 : { + 'min_eb' : [ + -126 + ], + 'exponent_num' : [ + 7 + ] + }, + torch.float32 : { + 'min_eb' : [ + -126 + ], + 'exponent_num' : [ + 23 + ] + } +} + class ApiPrecisionCompareColumn: API_NAME = 'API Name' @@ -118,6 +141,12 @@ class ApiPrecisionCompareColumn: REL_ERR_RATIO_STATUS = '相对误差判定结果' ABS_ERR_RATIO = '绝对误差错误率' ABS_ERR_RATIO_STATUS = '绝对误差判定结果' + MEAN_ULP_ERR = 'ULP误差平均值' + ULP_ERR_PROPORTION = 'ULP误差大于阈值占比' + ULP_ERR_PROPORTION_RATIO = 'ULP误差大于阈值占比比值' + ULP_ERR_STATUS = 'ULP误差判定结果' + REL_ERR_THOUSANDTH = '双千指标' + REL_ERR_THOUSANDTH_STATUS = '双千指标判定结果' FINAL_RESULT = '比对结果' ALGORITHM = '比对算法' FORWWARD_STATUS = 'Forward Test Success' @@ -130,11 +159,13 @@ class ApiPrecisionCompareColumn: ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATE, ApiPrecisionCompareColumn.RMSE, ApiPrecisionCompareColumn.MAX_REL_ERR, ApiPrecisionCompareColumn.MEAN_REL_ERR, ApiPrecisionCompareColumn.EB, ApiPrecisionCompareColumn.ERROR_RATE, ApiPrecisionCompareColumn.INF_NAN_ERROR_RATIO, - ApiPrecisionCompareColumn.REL_ERR_RATIO, ApiPrecisionCompareColumn.ABS_ERR_RATIO] + ApiPrecisionCompareColumn.REL_ERR_RATIO, ApiPrecisionCompareColumn.ABS_ERR_RATIO, + ApiPrecisionCompareColumn.MEAN_ULP_ERR, ApiPrecisionCompareColumn.ULP_ERR_PROPORTION, + ApiPrecisionCompareColumn.REL_ERR_THOUSANDTH] @staticmethod def get_detail_csv_title(): - return [ApiPrecisionCompareColumn.API_NAME, + return [ApiPrecisionCompareColumn.API_NAME, ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_RATIO, ApiPrecisionCompareColumn.SMALL_VALUE_ERROR_STATUS, ApiPrecisionCompareColumn.RMSE_RATIO, ApiPrecisionCompareColumn.RMSE_STATUS, ApiPrecisionCompareColumn.MAX_REL_ERR_RATIO, ApiPrecisionCompareColumn.MAX_REL_ERR_STATUS, @@ -144,6 +175,9 @@ class ApiPrecisionCompareColumn: ApiPrecisionCompareColumn.REL_ERR_RATIO, ApiPrecisionCompareColumn.REL_ERR_RATIO_STATUS, ApiPrecisionCompareColumn.ABS_ERR_RATIO, ApiPrecisionCompareColumn.ABS_ERR_RATIO_STATUS, ApiPrecisionCompareColumn.ERROR_RATE, ApiPrecisionCompareColumn.ERROR_RATE_STATUS, + ApiPrecisionCompareColumn.MEAN_ULP_ERR, ApiPrecisionCompareColumn.ULP_ERR_PROPORTION, + ApiPrecisionCompareColumn.ULP_ERR_PROPORTION_RATIO, ApiPrecisionCompareColumn.ULP_ERR_STATUS, + ApiPrecisionCompareColumn.REL_ERR_THOUSANDTH, ApiPrecisionCompareColumn.REL_ERR_THOUSANDTH_STATUS, ApiPrecisionCompareColumn.FINAL_RESULT, ApiPrecisionCompareColumn.ALGORITHM, ApiPrecisionCompareColumn.MESSAGE] @staticmethod @@ -170,7 +204,7 @@ def check_dtype_comparable(x, y): if y.dtype in Const.INT_TYPE: return True return False - print_warn_log(f"Compare: Unexpected dtype {x.dtype}, {y.dtype}") + logger.warning(f"Compare: Unexpected dtype {x.dtype}, {y.dtype}") return False @@ -180,11 +214,36 @@ def convert_str_to_float(input_data): raise CompareException(CompareException.INVALID_DATA_ERROR, msg) try: float_data = float(input_data) - if str(float_data) in ('inf', '-inf', 'nan'): - msg = 'ERROR: Input data is either "inf", "-inf", "nan"' - raise CompareException(CompareException.INVALID_DATA_ERROR, msg) return float_data except ValueError as e: msg = 'ERROR: Input data cannot be converted to float' raise CompareException(CompareException.INVALID_DATA_ERROR, msg) from e - \ No newline at end of file + + +def is_inf_or_nan(x): + return math.isnan(x) or math.isinf(x) + + +def handle_infinity(x, y, column_name): + if math.isinf(x) and math.isinf(y): + if x == y: + return float("nan"), True, f"{column_name}同为同号inf或nan\n" + else: + return float("nan"), False, f"{column_name}inf或nan不一致\n" + else: + return float("nan"), False, f"{column_name}inf或nan不一致\n" + + +def handle_nan(x, y, column_name): + if math.isnan(x) and math.isnan(y): + return float("nan"), True, f"{column_name}同为同号inf或nan\n" + else: + return float("nan"), False, f"{column_name}inf或nan不一致\n" + + +def check_inf_or_nan(x, y, column_name): + if math.isinf(x) or math.isinf(y): + return handle_infinity(x, y, column_name) + else: + return handle_nan(x, y, column_name) + \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/config.yaml b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/config.yaml similarity index 35% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/config.yaml rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/config.yaml index e2582c4539c9408102d3496242651cedeeefeb22..2dac535dc0501f6e47f0cdcc48bd88e1d73ab0dd 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/config.yaml +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/config.yaml @@ -1,9 +1,5 @@ -dump_path: './' -real_data: False -enable_dataloader: False -target_iter: [1] white_list: [] +black_list: [] error_data_path: './' -jit_compile: True precision: 14 \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/.keep b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/.keep similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/dump/.keep rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/.keep diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/__init__.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/data_generate.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py similarity index 86% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/data_generate.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py index 723fb8ec6680ab65770007c5ab90b5f8428db2ac..f495cd673d714fa715abd74403ab30be8b68aa1b 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/data_generate.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py @@ -20,8 +20,11 @@ import math import torch import numpy -from ..common.utils import Const, check_file_or_directory_path, check_object_type, print_warn_log, \ - print_error_log, get_full_data_path, CompareException +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import hf_32_standard_api +from msprobe.pytorch.api_accuracy_checker.common.utils import check_file_or_directory_path, check_object_type, \ + get_full_data_path, CompareException +from msprobe.pytorch.common.log import logger +from msprobe.core.common.const import Const TORCH_TYPE = ["torch.device", "torch.dtype"] TENSOR_DATA_LIST = ["torch.Tensor", "torch.nn.parameter.Parameter"] @@ -32,12 +35,13 @@ NUMPY_TYPE = ["numpy.int8", "numpy.int16", "numpy.int32", "numpy.int64", "numpy. "numpy.complex128", "numpy.complex256", "numpy.bool_", "numpy.string_", "numpy.bytes_", "numpy.unicode_"] -def gen_data(info, need_grad, convert_type, real_data_path=None): +def gen_data(info, api_name, need_grad, convert_type, real_data_path=None): """ Function Description: Based on arg basic information, generate arg data Parameter: info: arg basic information. Dict + api_name: API name need_grad: set Tensor grad for backward convert_type: convert ori_type to dist_type flag. """ @@ -50,6 +54,8 @@ def gen_data(info, need_grad, convert_type, real_data_path=None): data = gen_real_tensor(data_path, convert_type) else: data = gen_random_tensor(info, convert_type) + if api_name in hf_32_standard_api and data.dtype == torch.float32: + data = fp32_to_hf32_to_fp32(data) if info.get('requires_grad') and need_grad: data.requires_grad_(True) temp_data = data * 1 @@ -62,7 +68,7 @@ def gen_data(info, need_grad, convert_type, real_data_path=None): try: data = eval(data_type)(data) except Exception as err: - print_error_log("Failed to convert the type to numpy: %s" % str(err)) + logger.error("Failed to convert the type to numpy: %s" % str(err)) elif data_type == "torch.Size": data = torch.Size(info.get("value")) else: @@ -123,6 +129,17 @@ def gen_random_tensor(info, convert_type): return data +def fp32_to_hf32_to_fp32(input_tensor): + # 将输入的float32 tensor转为hf32 tensor,再转为float32 tensor + input_np = input_tensor.detach().numpy() + input_int = input_np.view(numpy.int32) + input_int = numpy.right_shift(numpy.right_shift(input_int, 11) + 1, 1) + input_int = numpy.left_shift(input_int, 12) + input_fp32 = input_int.view(numpy.float32) + input_hf32 = torch.from_numpy(input_fp32) + return input_hf32 + + def gen_common_tensor(low_info, high_info, shape, data_dtype, convert_type): """ Function Description: @@ -170,7 +187,7 @@ def gen_common_tensor(low_info, high_info, shape, data_dtype, convert_type): low, high = int(low), int(high) tensor = torch.randint(low, high + 1, shape, dtype=eval(data_dtype)) else: - print_error_log('Dtype is not supported: ' + data_dtype) + logger.error('Dtype is not supported: ' + data_dtype) raise NotImplementedError() if tensor.nelement() == 0: return tensor @@ -211,12 +228,13 @@ def gen_bool_tensor(low, high, shape): return data -def gen_args(args_info, need_grad=True, convert_type=None, real_data_path=None): +def gen_args(args_info, api_name, need_grad=True, convert_type=None, real_data_path=None): """ Function Description: Based on API basic information, generate input parameters: args, for API forward running Parameter: api_info: API basic information. List + api_name: API name need_grad: set Tensor grad for backward convert_type: convert ori_type to dist_type flag. real_data_path: the root directory for storing real data. @@ -225,13 +243,13 @@ def gen_args(args_info, need_grad=True, convert_type=None, real_data_path=None): args_result = [] for arg in args_info: if isinstance(arg, (list, tuple)): - data = gen_args(arg, need_grad, convert_type, real_data_path) + data = gen_args(arg, api_name, need_grad, convert_type, real_data_path) elif isinstance(arg, dict): - data = gen_data(arg, need_grad, convert_type, real_data_path) + data = gen_data(arg, api_name, need_grad, convert_type, real_data_path) elif arg is None: data = None else: - print_warn_log(f'Warning: {arg} is not supported') + logger.warning(f'Warning: {arg} is not supported') raise NotImplementedError() args_result.append(data) return args_result @@ -287,12 +305,13 @@ def gen_list_kwargs(kwargs_item_value, convert_type, real_data_path=None): return kwargs_item_result -def gen_api_params(api_info, need_grad=True, convert_type=None, real_data_path=None): +def gen_api_params(api_info, api_name, need_grad=True, convert_type=None, real_data_path=None): """ Function Description: Based on API basic information, generate input parameters: args, kwargs, for API forward running Parameter: api_info: API basic information. Dict + api_name: API name need_grad: set grad for backward convert_type: convert ori_type to dist_type flag. """ @@ -302,8 +321,8 @@ def gen_api_params(api_info, need_grad=True, convert_type=None, real_data_path=N raise CompareException(CompareException.INVALID_PARAM_ERROR, error_info) kwargs_params = gen_kwargs(api_info, convert_type, real_data_path) if api_info.get("input_args"): - args_params = gen_args(api_info.get("input_args"), need_grad, convert_type, real_data_path) + args_params = gen_args(api_info.get("input_args"), api_name, need_grad, convert_type, real_data_path) else: - print_warn_log(f'Warning: No args in {api_info} ') + logger.warning(f'Warning: No args in {api_info} ') args_params = [] return args_params, kwargs_params diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py similarity index 82% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py index bbb7c6f5ef2c75832c0b7a6e22e9d1ccf0624d8b..9c96a52d8bd63ea2af18ce21ee95bf8834525983 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py @@ -9,12 +9,14 @@ import threading from collections import namedtuple from itertools import cycle from tqdm import tqdm -from ...common import parse_json_info_forward_backward -from ...common.file_check import FileCheckConst, FileChecker, check_file_suffix, check_link, FileOpen -from ..compare.compare import Comparator -from .run_ut import _run_ut_parser, get_validated_result_csv_path, get_validated_details_csv_path, preprocess_forward_content -from ..common.utils import print_error_log, print_warn_log, print_info_log, create_directory -from ...common.file_check import check_path_before_create +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut import _run_ut_parser, get_validated_result_csv_path, \ + get_validated_details_csv_path, preprocess_forward_content +from msprobe.pytorch.api_accuracy_checker.compare.compare import Comparator +from msprobe.pytorch.common import parse_json_info_forward_backward +from msprobe.core.common.file_check import FileChecker, check_file_suffix, check_link, FileOpen, \ + check_path_before_create, create_directory +from msprobe.pytorch.common.log import logger +from msprobe.core.common.const import FileCheckConst def split_json_file(input_file, num_splits, filter_api): @@ -56,7 +58,7 @@ def split_json_file(input_file, num_splits, filter_api): def signal_handler(signum, frame): - print_warn_log(f'Signal handler called with signal {signum}') + logger.warning(f'Signal handler called with signal {signum}') raise KeyboardInterrupt() @@ -73,8 +75,8 @@ def run_parallel_ut(config): processes = [] device_id_cycle = cycle(config.device_id) if config.save_error_data_flag: - print_info_log("UT task error datas will be saved") - print_info_log(f"Starting parallel UT with {config.num_splits} processes") + logger.info("UT task error datas will be saved") + logger.info(f"Starting parallel UT with {config.num_splits} processes") progress_bar = tqdm(total=config.total_items, desc="Total items", unit="items") def create_cmd(api_info, dev_id): @@ -104,7 +106,7 @@ def run_parallel_ut(config): print(output, end='') sys.stdout.flush() except ValueError as e: - print_warn_log(f"An error occurred while reading subprocess output: {e}") + logger.warning(f"An error occurred while reading subprocess output: {e}") def update_progress_bar(progress_bar, result_csv_path): while any(process.poll() is None for process in processes): @@ -113,9 +115,9 @@ def run_parallel_ut(config): completed_items = len(result_file.readlines()) - 1 progress_bar.update(completed_items - progress_bar.n) except FileNotFoundError: - print_warn_log(f"Result CSV file not found: {result_csv_path}.") + logger.warning(f"Result CSV file not found: {result_csv_path}.") except Exception as e: - print_error_log(f"An unexpected error occurred while reading result CSV: {e}") + logger.error(f"An unexpected error occurred while reading result CSV: {e}") time.sleep(1) for api_info in config.api_files: @@ -140,27 +142,27 @@ def run_parallel_ut(config): try: os.remove(file) except FileNotFoundError: - print_warn_log(f"File not found and could not be deleted: {file}") + logger.warning(f"File not found and could not be deleted: {file}") try: for process in processes: process.communicate(timeout=None) except KeyboardInterrupt: - print_warn_log("Interrupted by user, terminating processes and cleaning up...") + logger.warning("Interrupted by user, terminating processes and cleaning up...") except Exception as e: - print_error_log(f"An unexpected error occurred: {e}") + logger.error(f"An unexpected error occurred: {e}") finally: if progress_bar.n < config.total_items: - print_warn_log("The UT task has not been completed. The parameter '-csv_path' along with the path to the result CSV file will be utilized to resume the UT task.") + logger.warning("The UT task has not been completed. The parameter '-csv_path' along with the path to the result CSV file will be utilized to resume the UT task.") clean_up() progress_bar_thread.join() try: comparator = Comparator(config.result_csv_path, config.result_csv_path, False) comparator.print_pretest_result() except FileNotFoundError as e: - print_error_log(f"Error: {e}") + logger.error(f"Error: {e}") except Exception as e: - print_error_log(f"An unexpected error occurred: {e}") + logger.error(f"An unexpected error occurred: {e}") def prepare_config(args): @@ -181,8 +183,8 @@ def prepare_config(args): else: result_csv_path = get_validated_result_csv_path(args.result_csv_path, 'result') details_csv_path = get_validated_details_csv_path(result_csv_path) - print_info_log(f"UT task result will be saved in {result_csv_path}") - print_info_log(f"UT task details will be saved in {details_csv_path}") + logger.info(f"UT task result will be saved in {result_csv_path}") + logger.info(f"UT task details will be saved in {details_csv_path}") return ParallelUTConfig(split_files, out_path, args.num_splits, args.save_error_data, args.jit_compile, args.device_id, result_csv_path, total_items, args.real_data_path) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py similarity index 72% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py index bea882f75076655998227baa6e4d3b4708074f08..732745ee8ca14b0665e2beb22da72cfd856164d1 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py @@ -1,12 +1,21 @@ import argparse import os import sys -import torch_npu + +try: + import torch_npu +except ImportError: + is_gpu = True +else: + is_gpu = False import torch from tqdm import tqdm -from ..run_ut.run_ut import exec_api, generate_device_params, get_api_info -from ..common.utils import print_info_log, print_warn_log, get_json_contents, print_error_log -from ...common.file_check import check_link +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut import exec_api, generate_device_params, get_api_info +from msprobe.pytorch.api_accuracy_checker.common.utils import get_json_contents +from msprobe.core.common.file_check import check_link +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.common.parse_json import parse_json_info_forward_backward +from msprobe.core.common.const import Const def check_tensor_overflow(x): @@ -36,7 +45,7 @@ def check_tensor_overflow(x): def check_data_overflow(x): if isinstance(x, (tuple, list)) and x: - for i, item in enumerate(x): + for _, item in enumerate(x): if check_data_overflow(item): return True return False @@ -45,30 +54,29 @@ def check_data_overflow(x): def run_overflow_check(forward_file): - print_info_log("start UT test") - forward_content = get_json_contents(forward_file) + logger.info("start UT test") + forward_content, _, real_data_path = parse_json_info_forward_backward(forward_file) for api_full_name, api_info_dict in tqdm(forward_content.items()): try: - run_torch_api(api_full_name, api_info_dict) + run_torch_api(api_full_name, api_info_dict, real_data_path) except Exception as err: - api_name = api_full_name.split("_", 1)[1].rsplit("_", 2)[0] + _, api_name, _ = api_full_name.split(Const.SEP) if "not implemented for 'Half'" in str(err): - print_warn_log(f"API {api_name} not support half tensor in CPU, please add {api_name} to CONVERT_API " + logger.warning(f"API {api_name} not support half tensor in CPU, please add {api_name} to CONVERT_API " f"'fp16_to_fp32' list in accuracy_tools/api_accuracy_check/common/utils.py file.") elif "expected scalar type Long" in str(err): - print_warn_log(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " + logger.warning(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " f"'int32_to_int64' list in accuracy_tools/api_accuracy_check/common/utils.py file.") else: - print_error_log(f"Run {api_full_name} UT Error: %s" % str(err)) + logger.error(f"Run {api_full_name} UT Error: %s" % str(err)) -def run_torch_api(api_full_name, api_info_dict): +def run_torch_api(api_full_name, api_info_dict, real_data_path): torch.npu.clear_npu_overflow_flag() - api_type = api_full_name.split(".")[0] - api_name = api_full_name.split(".", 1)[1].rsplit(".", 2)[0] - args, kwargs, need_grad = get_api_info(api_info_dict, api_name, real_data_path='') + api_type, api_name, _ = api_full_name.split(Const.SEP) + args, kwargs, need_grad = get_api_info(api_info_dict, api_name, real_data_path) if not need_grad: - print_warn_log("%s function with out=... arguments don't support automatic differentiation, skip backward." + logger.warning("%s function with out=... arguments don't support automatic differentiation, skip backward." % api_full_name) npu_args, npu_kwargs = generate_device_params(args, kwargs, False, api_name) if kwargs.get("device"): @@ -78,9 +86,9 @@ def run_torch_api(api_full_name, api_info_dict): cpu_overflow = check_data_overflow(out) npu_overflow = torch_npu.npu.utils.npu_check_overflow(npu_out) if cpu_overflow == npu_overflow: - print_warn_log("The %s overflow is a normal overflow." % api_full_name) + logger.warning("The %s overflow is a normal overflow." % api_full_name) else: - print_warn_log("The %s overflow is an abnormal overflow." % api_full_name) + logger.warning("The %s overflow is an abnormal overflow." % api_full_name) return @@ -111,11 +119,11 @@ def _run_overflow_check_command(args): try: torch.npu.set_device(npu_device) except Exception as error: - print_error_log(f"Set NPU device id failed. device id is: {args.device_id}") + logger.error(f"Set NPU device id failed. device id is: {args.device_id}") raise NotImplementedError from error run_overflow_check(api_info) if __name__ == '__main__': _run_overflow_check() - print_info_log("UT task completed.") + logger.info("UT task completed.") diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_ut.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py similarity index 66% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_ut.py rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py index 3186913e9482de5427f8b41685255e0e4cc1f140..30994f709444c4d479f5c807289be6e6bb58e25b 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/run_ut.py +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py @@ -6,6 +6,7 @@ import sys import time import gc from collections import namedtuple + try: import torch_npu except ImportError: @@ -17,28 +18,39 @@ else: import torch from tqdm import tqdm -from atat.pytorch.api_accuracy_checker.run_ut.data_generate import gen_api_params, gen_args -from atat.pytorch.api_accuracy_checker.common.utils import print_info_log, print_warn_log, get_json_contents, \ - api_info_preprocess, print_error_log, initialize_save_path, Const, create_directory -from atat.pytorch.api_accuracy_checker.compare.compare import Comparator -from atat.pytorch.api_accuracy_checker.hook_module.wrap_tensor import TensorOPTemplate -from atat.pytorch.api_accuracy_checker.hook_module.wrap_functional import FunctionalOPTemplate -from atat.pytorch.api_accuracy_checker.hook_module.wrap_torch import TorchOPTemplate -from atat.pytorch.api_accuracy_checker.common.config import msCheckerConfig -from atat.pytorch.api_accuracy_checker.dump.api_info import APIInfo -from atat.pytorch.common.parse_json import parse_json_info_forward_backward -from atat.pytorch.common.file_check import check_path_before_create -from atat.pytorch.common.file_check import FileOpen, FileCheckConst, FileChecker, \ - change_mode, check_file_suffix, check_link +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut_utils import Backward_Message, hf_32_standard_api +from msprobe.pytorch.api_accuracy_checker.run_ut.data_generate import gen_api_params, gen_args +from msprobe.pytorch.api_accuracy_checker.common.utils import get_json_contents, api_info_preprocess, \ + initialize_save_path, UtDataProcessor +from msprobe.pytorch.api_accuracy_checker.compare.compare import Comparator +from msprobe.pytorch.api_accuracy_checker.compare.compare_column import CompareColumn +from msprobe.pytorch.hook_module.wrap_tensor import TensorOPTemplate +from msprobe.pytorch.hook_module.wrap_functional import FunctionalOPTemplate +from msprobe.pytorch.hook_module.wrap_torch import TorchOPTemplate +from msprobe.pytorch.api_accuracy_checker.common.config import msCheckerConfig +from msprobe.pytorch.common.parse_json import parse_json_info_forward_backward +from msprobe.core.common.file_check import FileOpen, FileChecker, \ + change_mode, check_file_suffix, check_link, check_path_before_create, create_directory +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.pt_config import parse_json_config +from msprobe.core.common.const import Const, FileCheckConst, CompareConst current_time = time.strftime("%Y%m%d%H%M%S") UT_ERROR_DATA_DIR = 'ut_error_data' + current_time RESULT_FILE_NAME = "accuracy_checking_result_" + current_time + ".csv" DETAILS_FILE_NAME = "accuracy_checking_details_" + current_time + ".csv" RunUTConfig = namedtuple('RunUTConfig', ['forward_content', 'backward_content', 'result_csv_path', 'details_csv_path', - 'save_error_data', 'is_continue_run_ut', 'real_data_path']) + 'save_error_data', 'is_continue_run_ut', 'real_data_path', 'white_list', + 'black_list', 'error_data_path']) not_backward_list = ['repeat_interleave'] not_detach_set = {'resize_', 'resize_as_', 'set_', 'transpose_', 't_', 'squeeze_', 'unsqueeze_'} +not_raise_dtype_set = {'type_as'} + +RAISE_PRECISION = { + torch.float16: torch.float32, + torch.bfloat16: torch.float32, + torch.float32: torch.float64 +} tqdm_params = { 'smoothing': 0, # 平滑进度条的预计剩余时间,取值范围0到1 @@ -73,8 +85,19 @@ def deal_detach(arg, to_detach=True): return arg.detach() if to_detach else arg -def deal_dtype(arg, raise_dtype=None): - if raise_dtype is None or arg.dtype not in Const.RAISE_PRECISION or raise_dtype == arg.dtype: +def raise_bench_data_dtype(api_name, arg, raise_dtype=None): + ''' + 将标杆数据的dtype转换为raise_dtype + 输入: + api_name:api名称 + arg:标杆输入 + raise_dtype:需要转换的dtype + 输出: + arg: 转换dtype的标杆输入 + ''' + if api_name in hf_32_standard_api and arg.dtype == torch.float32: + return arg + if raise_dtype is None or arg.dtype not in RAISE_PRECISION or raise_dtype == arg.dtype: return arg return arg.type(raise_dtype) @@ -108,18 +131,19 @@ def generate_cpu_params(input_args, input_kwargs, need_backward, api_name): return type(arg_in)(recursive_arg_to_cpu(arg, to_detach, raise_dtype=raise_dtype) for arg in arg_in) elif isinstance(arg_in, torch.Tensor): if need_backward and arg_in.requires_grad: - arg_in = deal_detach(deal_dtype(arg_in.clone(), raise_dtype), to_detach).requires_grad_() + arg_in = deal_detach(raise_bench_data_dtype( + api_name, arg_in.clone(), raise_dtype=raise_dtype), to_detach).requires_grad_() temp_arg_in = arg_in * 1 arg_in = temp_arg_in.type_as(arg_in) arg_in.retain_grad() return arg_in else: - return deal_detach(deal_dtype(arg_in.clone(), raise_dtype=raise_dtype), to_detach) + return deal_detach(raise_bench_data_dtype(api_name, arg_in.clone(), raise_dtype=raise_dtype), to_detach) else: return arg_in def is_tensor_with_raise_precision(arg_in, check_kwargs=False): - if arg_in.dtype in Const.RAISE_PRECISION: + if arg_in.dtype in RAISE_PRECISION: return True if check_kwargs and arg_in.dtype in [torch.half, torch.bfloat16]: return True @@ -138,10 +162,11 @@ def generate_cpu_params(input_args, input_kwargs, need_backward, api_name): need_raise_dtypes = recursive_find_dtypes(input_args) need_raise_dtypes.update(recursive_find_dtypes(input_kwargs, check_kwargs=True)) if len(need_raise_dtypes) == 1: - raise_dtype = Const.RAISE_PRECISION.get(need_raise_dtypes.pop(), torch.float32) + raise_dtype = RAISE_PRECISION.get(need_raise_dtypes.pop(), torch.float32) elif len(need_raise_dtypes) >= 2: raise_dtype = torch.float32 + raise_dtype = None if api_name in not_raise_dtype_set else raise_dtype is_detach = api_name not in not_detach_set cpu_args = recursive_arg_to_cpu(input_args, is_detach, raise_dtype=raise_dtype) cpu_kwargs = {key: recursive_arg_to_cpu(value, key != "out" and is_detach, raise_dtype=raise_dtype) for key, value in input_kwargs.items()} @@ -149,43 +174,41 @@ def generate_cpu_params(input_args, input_kwargs, need_backward, api_name): def run_ut(config): - print_info_log("start UT test") - print_info_log(f"UT task result will be saved in {config.result_csv_path}") - print_info_log(f"UT task details will be saved in {config.details_csv_path}") + logger.info("start UT test") + logger.info(f"UT task result will be saved in {config.result_csv_path}") + logger.info(f"UT task details will be saved in {config.details_csv_path}") if config.save_error_data: - error_data_path = os.path.abspath(os.path.join(msCheckerConfig.error_data_path, UT_ERROR_DATA_DIR)) - print_info_log(f"UT task error_datas will be saved in {error_data_path}") + logger.info(f"UT task error_datas will be saved in {config.error_data_path}") compare = Comparator(config.result_csv_path, config.details_csv_path, config.is_continue_run_ut) with FileOpen(config.result_csv_path, 'r') as file: csv_reader = csv.reader(file) next(csv_reader) api_name_set = {row[0] for row in csv_reader} - for i, (api_full_name, api_info_dict) in enumerate(tqdm(config.forward_content.items(), **tqdm_params)): + for _, (api_full_name, api_info_dict) in enumerate(tqdm(config.forward_content.items(), **tqdm_params)): if api_full_name in api_name_set: continue - if is_unsupported_api(api_full_name): # TODO run_ut does not support to the npu fusion api and distributed api + if is_unsupported_api(api_full_name): # TODO run_ut does not support to the npu fusion api and distributed api continue + [_, api_name, _] = api_full_name.split(Const.SEP) try: - if msCheckerConfig.white_list: - [_, api_name, _] = api_full_name.split(Const.SEP) - if api_name not in set(msCheckerConfig.white_list): - continue + if config.black_list and api_name in config.black_list: + continue + if config.white_list and api_name not in config.white_list: + continue data_info = run_torch_api(api_full_name, config.real_data_path, config.backward_content, api_info_dict) - is_fwd_success, is_bwd_success = compare.compare_output(api_full_name, - data_info.bench_out, - data_info.device_out, - data_info.bench_grad_out, - data_info.device_grad_out) + is_fwd_success, is_bwd_success = compare.compare_output(api_full_name, data_info) if config.save_error_data: - do_save_error_data(api_full_name, data_info, is_fwd_success, is_bwd_success) + do_save_error_data(api_full_name, data_info, config.error_data_path, is_fwd_success, is_bwd_success) except Exception as err: - [_, api_name, _] = api_full_name.split(Const.SEP) if "expected scalar type Long" in str(err): - print_warn_log(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " + logger.warning(f"API {api_name} not support int32 tensor in CPU, please add {api_name} to CONVERT_API " f"'int32_to_int64' list in accuracy_tools/api_accuracy_check/common/utils.py file.") else: - print_error_log(f"Run {api_full_name} UT Error: %s" % str(err)) - compare.write_summary_csv((api_full_name, "SKIP", "SKIP", str(err))) + logger.error(f"Run {api_full_name} UT Error: %s" % str(err)) + err_column = CompareColumn() + fwd_compare_alg_results = err_column.to_column_value(CompareConst.SKIP, str(err)) + result_info = (api_full_name, CompareConst.SKIP, CompareConst.SKIP, [fwd_compare_alg_results], None, 0) + compare.record_results(result_info) finally: if is_gpu: torch.cuda.empty_cache() @@ -201,35 +224,37 @@ def is_unsupported_api(api_name): split_name = api_name.split(Const.SEP)[0] flag = split_name in [Const.NPU, Const.DISTRIBUTED] if flag: - print_info_log(f"{split_name} api is not supported for run ut. SKIP.") + logger.info(f"{split_name} api is not supported for run ut. SKIP.") return flag -def do_save_error_data(api_full_name, data_info, is_fwd_success, is_bwd_success): +def do_save_error_data(api_full_name, data_info, error_data_path, is_fwd_success, is_bwd_success): if not is_fwd_success or not is_bwd_success: + processor = UtDataProcessor(error_data_path) for element in data_info.in_fwd_data_list: - UtAPIInfo(api_full_name + '.forward.input', element) - UtAPIInfo(api_full_name + '.forward.output.bench', data_info.bench_out) - UtAPIInfo(api_full_name + '.forward.output.device', data_info.device_out) - UtAPIInfo(api_full_name + '.backward.input', data_info.grad_in) - UtAPIInfo(api_full_name + '.backward.output.bench', data_info.bench_grad_out) - UtAPIInfo(api_full_name + '.backward.output.device', data_info.device_grad_out) + processor.save_tensors_in_element(api_full_name + '.forward.input', element) + processor.save_tensors_in_element(api_full_name + '.forward.output.bench', data_info.bench_output) + processor.save_tensors_in_element(api_full_name + '.forward.output.device', data_info.device_output) + processor.save_tensors_in_element(api_full_name + '.backward.input', data_info.grad_in) + processor.save_tensors_in_element(api_full_name + '.backward.output.bench', data_info.bench_grad) + processor.save_tensors_in_element(api_full_name + '.backward.output.device', data_info.device_grad) def run_torch_api(api_full_name, real_data_path, backward_content, api_info_dict): in_fwd_data_list = [] + backward_message = '' [api_type, api_name, _] = api_full_name.split(Const.SEP) args, kwargs, need_grad = get_api_info(api_info_dict, api_name, real_data_path) in_fwd_data_list.append(args) in_fwd_data_list.append(kwargs) need_backward = api_full_name in backward_content if not need_grad: - print_warn_log("%s function with out=... arguments don't support automatic differentiation, skip backward." - % api_full_name) + logger.warning("%s %s" % (api_full_name, Backward_Message.UNSUPPORT_BACKWARD_MESSAGE)) + backward_message += Backward_Message.UNSUPPORT_BACKWARD_MESSAGE if api_name in not_backward_list: need_grad = False - print_warn_log( - "%s function backward result is None, skip backward." % api_full_name) + logger.warning("%s %s" % (api_full_name, Backward_Message.NO_BACKWARD_RESULT_MESSAGE)) + backward_message += Backward_Message.NO_BACKWARD_RESULT_MESSAGE need_backward = need_backward and need_grad if kwargs.get("device"): del kwargs["device"] @@ -248,17 +273,17 @@ def run_torch_api(api_full_name, real_data_path, backward_content, api_info_dict grad_index = grad_input_index.get('grad_index') if need_backward: - backward_args = backward_content[api_full_name].get("grad_output") - grad = gen_args(backward_args, real_data_path=real_data_path)[0] - bench_grad, _ = generate_cpu_params(grad, {}, False, api_name) - bench_grad_out = run_backward(cpu_args, bench_grad, grad_index, out) - device_grad = grad.clone().detach().to(current_device) - device_grad_out = run_backward(device_args, device_grad, grad_index, device_out) + if need_to_backward(grad_index, out): + backward_args = backward_content[api_full_name].get("grad_output") + grad = gen_args(backward_args, api_name, real_data_path=real_data_path)[0] + bench_grad, _ = generate_cpu_params(grad, {}, False, api_name) + bench_grad_out = run_backward(cpu_args, bench_grad, grad_index, out) + device_grad = grad.clone().detach().to(current_device) + device_grad_out = run_backward(device_args, device_grad, grad_index, device_out) + else: + backward_message += Backward_Message.MULTIPLE_BACKWARD_MESSAGE - if grad_index is not None: - return UtDataInfo(bench_grad_out, device_grad_out, device_out[grad_index], out[grad_index], bench_grad, - in_fwd_data_list) - return UtDataInfo(bench_grad_out, device_grad_out, device_out, out, bench_grad, in_fwd_data_list) + return UtDataInfo(bench_grad_out, device_grad_out, device_out, out, bench_grad, in_fwd_data_list, backward_message) def get_api_info(api_info_dict, api_name, real_data_path): @@ -266,15 +291,19 @@ def get_api_info(api_info_dict, api_name, real_data_path): need_grad = True if api_info_dict.get("input_kwargs") and "out" in api_info_dict.get("input_kwargs"): need_grad = False - args, kwargs = gen_api_params(api_info_dict, need_grad, convert_type, real_data_path) + args, kwargs = gen_api_params(api_info_dict, api_name, need_grad, convert_type, real_data_path) return args, kwargs, need_grad +def need_to_backward(grad_index, out): + if grad_index is None and isinstance(out, (list, tuple)): + return False + return True + + def run_backward(args, grad, grad_index, out): if grad_index is not None: out[grad_index].backward(grad) - elif isinstance(out, (list, tuple)): - raise NotImplementedError("Multiple backward is not supported.") else: out.backward(grad) args_grad = [] @@ -286,14 +315,14 @@ def run_backward(args, grad, grad_index, out): return grad_out -def initialize_save_error_data(): - error_data_path = msCheckerConfig.error_data_path +def initialize_save_error_data(error_data_path): check_path_before_create(error_data_path) create_directory(error_data_path) - error_data_path_checker = FileChecker(msCheckerConfig.error_data_path, FileCheckConst.DIR, + error_data_path_checker = FileChecker(error_data_path, FileCheckConst.DIR, ability=FileCheckConst.WRITE_ABLE) error_data_path = error_data_path_checker.common_check() - initialize_save_path(error_data_path, UT_ERROR_DATA_DIR) + error_data_path =initialize_save_path(error_data_path, UT_ERROR_DATA_DIR) + return error_data_path def get_validated_result_csv_path(result_csv_path, mode): @@ -356,34 +385,45 @@ def _run_ut_parser(parser): required=False) parser.add_argument("-f", "--filter_api", dest="filter_api", action="store_true", help=" Whether to filter the api in the api_info_file.", required=False) + parser.add_argument("-config", "--config_path", dest="config_path", default="", type=str, + help=" The path of config.json", required=False) def preprocess_forward_content(forward_content): processed_content = {} base_keys_variants = {} + arg_cache = {} + for key, value in forward_content.items(): base_key = key.rsplit(Const.SEP, 1)[0] - new_args = value['args'] - new_kwargs = value['kwargs'] - filtered_new_args = [{k: v for k, v in arg.items() if k not in ['Max', 'Min']} for arg in new_args if isinstance(arg, dict)] - if base_key in base_keys_variants: + + if key not in arg_cache: + filtered_new_args = [ + {k: v for k, v in arg.items() if k not in ['Max', 'Min']} + for arg in value['args'] if isinstance(arg, dict) + ] + arg_cache[key] = (filtered_new_args, value['kwargs']) + + filtered_new_args, new_kwargs = arg_cache[key] + + if base_key not in base_keys_variants: + processed_content[key] = value + base_keys_variants[base_key] = {key} + else: is_duplicate = False for variant in base_keys_variants.get(base_key, []): try: - existing_args = processed_content[variant].get('args', []) - existing_kwargs = processed_content[variant].get('kwargs', {}) - filtered_existing_args = [{k: v for k, v in arg.items() if k not in ['Max', 'Min']} for arg in existing_args if isinstance(arg, dict)] + existing_args, existing_kwargs = arg_cache.get(variant) except KeyError as e: - print_error_log(f"KeyError: {e} when processing {key}") - if filtered_existing_args == filtered_new_args and existing_kwargs == new_kwargs: + logger.error(f"KeyError: {e} when processing {key}") + if existing_args == filtered_new_args and existing_kwargs == new_kwargs: is_duplicate = True break + if not is_duplicate: processed_content[key] = value - base_keys_variants[base_key].append(key) - else: - processed_content[key] = value - base_keys_variants[base_key] = [key] + base_keys_variants[base_key].add(key) + return processed_content @@ -405,7 +445,7 @@ def run_ut_command(args): else: torch.npu.set_device(used_device) except Exception as error: - print_error_log(f"Set device id failed. device id is: {args.device_id}") + logger.error(f"Set device id failed. device id is: {args.device_id}") raise NotImplementedError from error check_link(args.api_info_file) api_info = os.path.realpath(args.api_info_file) @@ -418,42 +458,47 @@ def run_ut_command(args): save_error_data = args.save_error_data forward_content, backward_content, real_data_path = parse_json_info_forward_backward(api_info) if args.filter_api: + logger.info("Start filtering the api in the forward_input_file.") forward_content = preprocess_forward_content(forward_content) + logger.info("Finish filtering the api in the forward_input_file.") result_csv_path = os.path.join(out_path, RESULT_FILE_NAME) details_csv_path = os.path.join(out_path, DETAILS_FILE_NAME) if args.result_csv_path: result_csv_path = get_validated_result_csv_path(args.result_csv_path, 'result') details_csv_path = get_validated_details_csv_path(result_csv_path) + white_list = msCheckerConfig.white_list + black_list = msCheckerConfig.black_list + error_data_path = msCheckerConfig.error_data_path + if args.config_path: + _, task_config = parse_json_config(args.config_path, Const.RUN_UT) + white_list = task_config.white_list + black_list = task_config.black_list + error_data_path = task_config.error_data_path if save_error_data: if args.result_csv_path: time_info = result_csv_path.split('.')[0].split('_')[-1] global UT_ERROR_DATA_DIR UT_ERROR_DATA_DIR = 'ut_error_data' + time_info - initialize_save_error_data() + error_data_path = initialize_save_error_data(error_data_path) run_ut_config = RunUTConfig(forward_content, backward_content, result_csv_path, details_csv_path, save_error_data, - args.result_csv_path, real_data_path) + args.result_csv_path, real_data_path, set(white_list), set(black_list), error_data_path) run_ut(run_ut_config) class UtDataInfo: - def __init__(self, bench_grad_out, device_grad_out, device_out, bench_out, grad_in, in_fwd_data_list): - self.bench_grad_out = bench_grad_out - self.device_grad_out = device_grad_out - self.device_out = device_out - self.bench_out = bench_out + def __init__(self, bench_grad, device_grad, device_output, bench_output, grad_in, in_fwd_data_list, + backward_message, rank=0): + self.bench_grad = bench_grad + self.device_grad = device_grad + self.device_output = device_output + self.bench_output = bench_output self.grad_in = grad_in self.in_fwd_data_list = in_fwd_data_list - - -class UtAPIInfo(APIInfo): - def __init__(self, api_name, element): - super().__init__(api_name, - save_path=self.get_full_save_path(msCheckerConfig.error_data_path, UT_ERROR_DATA_DIR), - is_save_data=True) - self.analyze_element(element) + self.backward_message = backward_message + self.rank = rank if __name__ == '__main__': _run_ut() - print_info_log("UT task completed.") + logger.info("UT task completed.") diff --git a/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..d78642f214f904183af73f18e726ee8ad72ef2b4 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py @@ -0,0 +1,7 @@ +hf_32_standard_api = ["conv1d", "conv2d"] + + +class Backward_Message: + MULTIPLE_BACKWARD_MESSAGE = "Multiple backward is not supported." + UNSUPPORT_BACKWARD_MESSAGE = "function with out=... arguments don't support automatic differentiation, skip backward." + NO_BACKWARD_RESULT_MESSAGE = "function backward result is None, skip backward." \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/torch_ut_setting.json b/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/torch_ut_setting.json similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/run_ut/torch_ut_setting.json rename to debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/run_ut/torch_ut_setting.json diff --git a/debug/accuracy_tools/atat/pytorch/common/__init__.py b/debug/accuracy_tools/msprobe/pytorch/common/__init__.py similarity index 38% rename from debug/accuracy_tools/atat/pytorch/common/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/common/__init__.py index b391e103115498a2c2cf8b78f48168822517be73..8283aa502195cf2cf3cdac1260414b2dbc36a6dd 100644 --- a/debug/accuracy_tools/atat/pytorch/common/__init__.py +++ b/debug/accuracy_tools/msprobe/pytorch/common/__init__.py @@ -1,4 +1,2 @@ -from .recursive import recursive_apply_transform -from .log import print_error_log_rank_0, print_info_log_rank_0, print_warn_log_rank_0 from .parse_json import parse_json_info_forward_backward from .utils import seed_all diff --git a/debug/accuracy_tools/atat/pytorch/common/compare_script.template b/debug/accuracy_tools/msprobe/pytorch/common/compare_script.template similarity index 100% rename from debug/accuracy_tools/atat/pytorch/common/compare_script.template rename to debug/accuracy_tools/msprobe/pytorch/common/compare_script.template diff --git a/debug/accuracy_tools/msprobe/pytorch/common/log.py b/debug/accuracy_tools/msprobe/pytorch/common/log.py new file mode 100644 index 0000000000000000000000000000000000000000..cea518fa47b977ad08b5fded3401a63bf3a29d03 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/common/log.py @@ -0,0 +1,32 @@ +import os +import time +import sys +from msprobe.pytorch.common.utils import get_rank_if_initialized +from msprobe.core.common.log import BaseLogger +from msprobe.core.common.exceptions import DistributedNotInitializedError + + +class PyTorchLogger(BaseLogger): + def __init__(self): + super().__init__() + + def get_rank(self): + try: + current_rank = get_rank_if_initialized() + except DistributedNotInitializedError: + current_rank = None + return current_rank + + def _print_log(self, level, msg, end='\n'): + current_rank = self.get_rank() + current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) + pid = os.getpid() + if current_rank is not None: + full_msg = f"{current_time} ({pid}) [rank {current_rank}] [{level}] {msg}" + else: + full_msg = f"{current_time} ({pid}) [{level}] {msg}" + print(full_msg, end=end) + sys.stdout.flush() + + +logger = PyTorchLogger() \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/common/parse_json.py b/debug/accuracy_tools/msprobe/pytorch/common/parse_json.py similarity index 95% rename from debug/accuracy_tools/atat/pytorch/common/parse_json.py rename to debug/accuracy_tools/msprobe/pytorch/common/parse_json.py index dc594c4cf818ed0bedd8d3997e2848d9fe123a17..22f798798679742031378338a5bcc4356efbc4cd 100644 --- a/debug/accuracy_tools/atat/pytorch/common/parse_json.py +++ b/debug/accuracy_tools/msprobe/pytorch/common/parse_json.py @@ -1,5 +1,5 @@ import json -from .exceptions import ParseJsonException +from msprobe.core.common.exceptions import ParseJsonException def parse_json_info_forward_backward(json_path): diff --git a/debug/accuracy_tools/atat/pytorch/common/utils.py b/debug/accuracy_tools/msprobe/pytorch/common/utils.py similarity index 86% rename from debug/accuracy_tools/atat/pytorch/common/utils.py rename to debug/accuracy_tools/msprobe/pytorch/common/utils.py index e88d506b2c340f9b6141c2e0bb775a693d61a16c..acc1de105148899a46fe381b94cb7772f26d3ac8 100644 --- a/debug/accuracy_tools/atat/pytorch/common/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/common/utils.py @@ -15,12 +15,13 @@ # limitations under the License. """ import os -import re import random import stat import torch import numpy as np from functools import wraps +from msprobe.core.common.exceptions import DistributedNotInitializedError + try: import torch_npu except ImportError: @@ -28,8 +29,7 @@ except ImportError: else: is_gpu = False - -torch_without_guard_version_list = ['2.1'] +torch_without_guard_version_list = ['2.1', '2.2'] for version in torch_without_guard_version_list: if torch.__version__.startswith(version): torch_without_guard_version = True @@ -93,9 +93,13 @@ def torch_device_guard(func): def get_rank_if_initialized(): + """ + return rank id if it is initialized or raise Exception: DistributedNotInitializedError + """ if torch.distributed.is_initialized(): return torch.distributed.get_rank() - return None + else: + raise DistributedNotInitializedError("torch distributed environment is not initialized") def seed_all(seed=1234, mode=False): @@ -145,6 +149,8 @@ class Const: GRAD_OUTPUT = 'grad_output' START = "start" STOP = "stop" + MAX = 'Max' + MIN = 'Min' # dump mode ALL = "all" @@ -178,6 +184,7 @@ class Const: # env dump path ASCEND_WORK_PATH = "ASCEND_WORK_PATH" DUMP_DIR = "dump_data" + DATA = "data" ENV_ENABLE = "1" ENV_DISABLE = "0" @@ -192,4 +199,25 @@ class Const: STATISTICS = "statistics" TENSOR = "tensor" OVERFLOW_CHECK = "overflow_check" - FREE_BENCHMARK = "free_benchmark" \ No newline at end of file + FREE_BENCHMARK = "free_benchmark" + + ATTR_NAME_PREFIX = "wrap_" + + FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble, np.float32, np.float16] + BOOL_TYPE = [bool, np.uint8] + INT_TYPE = [np.int32, np.int64] + NPU = 'NPU' + DISTRIBUTED = 'Distributed' + + RAISE_PRECISION = { + torch.float16: torch.float32, + torch.bfloat16: torch.float32, + torch.float32: torch.float64 + } + CONVERT = { + "int32_to_int64": ["torch.int32", "torch.int64"], + } + + CONVERT_API = { + "int32_to_int64": ["cross_entropy"] + } diff --git a/debug/accuracy_tools/atat/pytorch/compare/acc_compare.py b/debug/accuracy_tools/msprobe/pytorch/compare/acc_compare.py similarity index 82% rename from debug/accuracy_tools/atat/pytorch/compare/acc_compare.py rename to debug/accuracy_tools/msprobe/pytorch/compare/acc_compare.py index bd903ef2d3a8eea72416280f918546c31eed0779..e214910566e6af920f48a8f2c4c4f7cb47b36b10 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/acc_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/compare/acc_compare.py @@ -18,25 +18,26 @@ import json import multiprocessing import os.path -import stat import sys -import math import torch - import numpy as np import pandas as pd import openpyxl from openpyxl.styles import PatternFill from collections import namedtuple +from dataclasses import dataclass -from .match import graph_mapping -from .highlight import HighlightRules, get_header_index -from .npy_compare import compare_ops_apply, get_error_type, reshape_value, get_relative_err, get_error_message -from ..advisor.advisor import Advisor -from ...core.utils import check_compare_param, add_time_with_xlsx, CompareException, CompareConst, \ - format_value, check_file_not_exists, check_configuration_param, task_dumppath_get, print_info_log, \ - print_warn_log, print_error_log, Const -from ...core.file_check_util import FileChecker, FileCheckConst, change_mode, FileOpen, create_directory +from msprobe.pytorch.compare.match import graph_mapping +from msprobe.pytorch.compare.highlight import HighlightRules, get_header_index +from msprobe.pytorch.compare.npy_compare import compare_ops_apply, get_error_type, reshape_value, get_relative_err, \ + get_error_message +from msprobe.pytorch.advisor.advisor import Advisor +from msprobe.pytorch.common.log import logger +from msprobe.core.common.utils import check_compare_param, add_time_with_xlsx, CompareException, \ + format_value, check_file_not_exists, check_configuration_param, task_dumppath_get +from msprobe.core.common.file_check import FileChecker, change_mode, FileOpen, create_directory +from msprobe.core.common.const import Const, CompareConst, FileCheckConst +from msprobe.core.common.exceptions import FileCheckException def check_graph_mode(a_op_name, b_op_name): @@ -60,7 +61,7 @@ def check_op(npu_dict, bench_dict, fuzzy_match): try: is_match = fuzzy_check_op(a_op_name, b_op_name) except Exception as err: - print_warn_log("%s and %s can not fuzzy match." % (a_op_name, b_op_name)) + logger.warning("%s and %s can not fuzzy match." % (a_op_name, b_op_name)) is_match = False return is_match and struct_match @@ -125,7 +126,7 @@ def fuzzy_check_name(npu_name, bench_name): def rename_api(npu_name, process): npu_split = npu_name.split(process) torch_func_index, in_out = npu_split[0], npu_split[1] - torch_func_split = torch_func_index.rsplit("_", 2) + torch_func_split = torch_func_index.rsplit(Const.SEP, 2) torch_func = str(torch_func_split[0]) + str(in_out) return torch_func @@ -139,7 +140,7 @@ def merge_tensor(tensor_list, summary_compare, md5_compare): op_dict["summary"] = [] op_dict["stack_info"] = [] - all_mode_bool = summary_compare == False and md5_compare == False + all_mode_bool = not (summary_compare or md5_compare) if all_mode_bool: op_dict["data_name"] = [] @@ -192,7 +193,7 @@ def get_accuracy(result, n_dict, b_dict, summary_compare=False, md5_compare=Fals bench_stack_info = b_dict.get("stack_info", None) has_stack = npu_stack_info and bench_stack_info - all_mode_bool = summary_compare == False and md5_compare == False + all_mode_bool = not (summary_compare or md5_compare) if all_mode_bool: npu_data_name = n_dict.get("data_name", None) bench_data_name = b_dict.get("data_name", None) @@ -206,7 +207,8 @@ def get_accuracy(result, n_dict, b_dict, summary_compare=False, md5_compare=Fals err_msg = "" if md5_compare: result_item = [n_name, b_name, n_struct[0], b_struct[0], n_struct[1], b_struct[1], - n_struct[2], b_struct[2], CompareConst.PASS if n_struct[2] == b_struct[2] else CompareConst.DIFF] + n_struct[2], b_struct[2], + CompareConst.PASS if n_struct[2] == b_struct[2] else CompareConst.DIFF] if has_stack and index == 0 and key == "input_struct": result_item.extend(npu_stack_info) else: @@ -233,7 +235,7 @@ def get_accuracy(result, n_dict, b_dict, summary_compare=False, md5_compare=Fals if isinstance(npu_val, (float, int)) and isinstance(bench_val, (float, int)): diff = npu_val - bench_val if bench_val != 0: - relative = str(abs((diff/bench_val) * 100)) + '%' + relative = str(abs((diff / bench_val) * 100)) + '%' else: relative = "N/A" result_item[start_idx + i] = diff @@ -245,7 +247,9 @@ def get_accuracy(result, n_dict, b_dict, summary_compare=False, md5_compare=Fals result_item[start_idx + i] = CompareConst.NONE accuracy_check = CompareConst.WARNING if warning_flag else "" err_msg += "Need double check api accuracy." if warning_flag else "" - result_item[start_idx:] = [f'{str(x)}\t' if str(x) in ('inf', '-inf', 'nan') else x for x in result_item[start_idx:]] + for i in range(start_idx, len(result_item)): + if str(result_item[i]) in ('inf', '-inf', 'nan'): + result_item[i] = f'{result_item[i]}\t' result_item.append(accuracy_check if summary_compare else CompareConst.ACCURACY_CHECK_YES) result_item.append(err_msg) @@ -305,7 +309,7 @@ def _do_multi_process(input_parma, result_df): result_df = _handle_multi_process(compare_ops, input_parma, result_df, multiprocessing.Manager().RLock()) return result_df except ValueError as e: - print_error_log('result dataframe is not found.') + logger.error('result dataframe is not found.') raise CompareException(CompareException.INVALID_DATA_ERROR) from e @@ -313,19 +317,17 @@ def read_dump_data(result_df): try: npu_dump_name_list = result_df.iloc[0:, 0].tolist() npu_dump_tensor_list = result_df.iloc[0:, -1].tolist() - # bench_dump_name_list = csv_pd.iloc[0:, 1].tolist() op_name_mapping_dict = {} for index, _ in enumerate(npu_dump_name_list): npu_dump_name = npu_dump_name_list[index] npu_dump_tensor = npu_dump_tensor_list[index] - # bench_dump_name = bench_dump_name_list[index] op_name_mapping_dict[npu_dump_name] = [npu_dump_tensor, npu_dump_tensor] return op_name_mapping_dict except ValueError as e: - print_error_log('result dataframe is not found.') + logger.error('result dataframe is not found.') raise CompareException(CompareException.INVALID_DATA_ERROR) from e except IndexError as e: - print_error_log('result dataframe elements can not be access.') + logger.error('result dataframe elements can not be access.') raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from e @@ -343,17 +345,17 @@ def _handle_multi_process(func, input_parma, result_df, lock): pool = multiprocessing.Pool(process_num) def err_call(args): - print_error_log('multiprocess compare failed! Reason: {}'.format(args)) + logger.error('multiprocess compare failed! Reason: {}'.format(args)) try: pool.terminate() except OSError as e: - print_error_log("pool terminate failed") + logger.error("pool terminate failed") for process_idx, df_chunk in enumerate(df_chunks): idx = df_chunk_size * process_idx result = pool.apply_async(func, - args=(idx, op_name_mapping_dict, df_chunk, lock, input_parma), - error_callback=err_call) + args=(idx, op_name_mapping_dict, df_chunk, lock, input_parma), + error_callback=err_call) results.append(result) final_results = [r.get() for r in results] pool.close() @@ -372,39 +374,73 @@ def compare_ops(idx, dump_path_dict, result_df, lock, input_parma): for i in range(len(result_df)): op_name = result_df.iloc[i, 0] if is_print_compare_log: - print("start compare: {}".format(op_name)) - cos_sim, max_abs_err, max_relative_err, one_thousand_err_ratio, five_thousand_err_ratio, err_msg = compare_by_op(op_name, dump_path_dict, input_parma) + logger.info("start compare: {}".format(op_name)) + cos_sim, max_abs_err, max_relative_err, one_thousand_err_ratio, five_thousand_err_ratio, err_msg = compare_by_op( + op_name, dump_path_dict, input_parma) if is_print_compare_log: - print("[{}] Compare result: cosine {}, max_abs_err {}, max_relative_err {}, {}, one_thousand_err_ratio {}, five_thousand_err_ratio {}".format(op_name, cos_sim, max_abs_err, max_relative_err, err_msg, one_thousand_err_ratio, five_thousand_err_ratio)) + logger.info( + "[{}] Compare result: cosine {}, max_abs_err {}, max_relative_err {}, {}, one_thousand_err_ratio {}, " + "five_thousand_err_ratio {}".format(op_name, cos_sim, max_abs_err, max_relative_err, err_msg, + one_thousand_err_ratio, five_thousand_err_ratio)) cos_result.append(cos_sim) max_err_result.append(max_abs_err) max_relative_err_result.append(max_relative_err) err_mess.append(err_msg) one_thousand_err_ratio_result.append(one_thousand_err_ratio) five_thousand_err_ratio_result.append(five_thousand_err_ratio) - result_df = _save_cmp_result(idx, cos_result, max_err_result, max_relative_err_result, err_mess, one_thousand_err_ratio_result, - five_thousand_err_ratio_result, result_df, lock) - return result_df + + cr = ComparisonResult( + cos_result=cos_result, + max_err_result=max_err_result, + max_relative_err_result=max_relative_err_result, + err_msgs=err_mess, + one_thousand_err_ratio_result=one_thousand_err_ratio_result, + five_thousand_err_ratio_result=five_thousand_err_ratio_result + ) + + return _save_cmp_result(idx, cr, result_df, lock) -def _save_cmp_result(idx, cos_result, max_err_result, max_relative_err_result, err_msg, one_thousand_err_ratio_result, five_thousand_err_ratio_result, result_df, lock): +@dataclass +class ComparisonResult: + cos_result: list + max_err_result: list + max_relative_err_result: list + err_msgs: list + one_thousand_err_ratio_result: list + five_thousand_err_ratio_result: list + + +def _save_cmp_result(offset, result: ComparisonResult, result_df, lock): + """ + Save comparison results into the result DataFrame with thread safety. + Args: + offset: offset for index + result: data struct of ComparisonResult + result_df: result of DataFrame + lock: thread lock + + Returns: + comparison results in DataFrame + """ + lock.acquire() try: - for i, _ in enumerate(cos_result): - process_index = i + idx - result_df.loc[process_index, CompareConst.COSINE] = cos_result[i] - result_df.loc[process_index, CompareConst.MAX_ABS_ERR] = max_err_result[i] - result_df.loc[process_index, CompareConst.MAX_RELATIVE_ERR] = max_relative_err_result[i] - result_df.loc[process_index, CompareConst.ERROR_MESSAGE] = err_msg[i] - result_df.loc[process_index, CompareConst.ACCURACY] = check_accuracy(cos_result[i], max_err_result[i]) - result_df.loc[process_index, CompareConst.ONE_THOUSANDTH_ERR_RATIO] = one_thousand_err_ratio_result[i] - result_df.loc[process_index, CompareConst.FIVE_THOUSANDTHS_ERR_RATIO] = five_thousand_err_ratio_result[i] + for i, _ in enumerate(result.cos_result): + process_index = i + offset + result_df.loc[process_index, CompareConst.COSINE] = result.cos_result[i] + result_df.loc[process_index, CompareConst.MAX_ABS_ERR] = result.max_err_result[i] + result_df.loc[process_index, CompareConst.MAX_RELATIVE_ERR] = result.max_relative_err_result[i] + result_df.loc[process_index, CompareConst.ERROR_MESSAGE] = result.err_msgs[i] + result_df.loc[process_index, CompareConst.ACCURACY] = check_accuracy(result.cos_result[i], result.max_err_result[i]) + result_df.loc[process_index, CompareConst.ONE_THOUSANDTH_ERR_RATIO] = result.one_thousand_err_ratio_result[i] + result_df.loc[process_index, CompareConst.FIVE_THOUSANDTHS_ERR_RATIO] = result.five_thousand_err_ratio_result[i] return result_df except ValueError as e: - print_error_log('result dataframe is not found.') + logger.error('result dataframe is not found.') raise CompareException(CompareException.INVALID_DATA_ERROR) from e except IndexError as e: - print_error_log('result dataframe elements can not be access.') + logger.error('result dataframe elements can not be access.') raise CompareException(CompareException.INDEX_OUT_OF_BOUNDS_ERROR) from e finally: lock.release() @@ -420,7 +456,7 @@ def check_accuracy(cos, max_abs_err): try: cos, max_abs_err = float(cos), float(max_abs_err) except ValueError: - print_warn_log("Cosine or MaxAbsErr can not get float value.") + logger.warning("Cosine or MaxAbsErr can not get float value.") return CompareConst.NONE if cos < CompareConst.COS_THRESHOLD and max_abs_err > CompareConst.MAX_ABS_ERR_THRESHOLD: return CompareConst.ACCURACY_CHECK_NO @@ -434,7 +470,10 @@ def read_npy_data(dir_path, file_name): path_checker = FileChecker(data_path, FileCheckConst.FILE, FileCheckConst.READ_ABLE, FileCheckConst.PT_SUFFIX, False) data_path = path_checker.common_check() - data_value = torch.load(data_path, map_location=torch.device('cpu')).detach().numpy() + data_value = torch.load(data_path, map_location=torch.device('cpu')).detach() # detach for less memory + if data_value.dtype == torch.bfloat16: + data_value = data_value.to(torch.float32) + data_value = data_value.numpy() return data_value @@ -453,6 +492,10 @@ def compare_by_op(op_name, op_name_mapping_dict, input_parma): error_file = error.filename n_value, b_value = CompareConst.READ_NONE, CompareConst.READ_NONE error_flag = True + except FileCheckerException: + error_file = data_name + n_value, b_value = CompareConst.READ_NONE, CompareConst.READ_NONE + error_flag = True n_value, b_value, error_flag = get_error_type(n_value, b_value, error_flag) if not error_flag: @@ -473,7 +516,11 @@ def handle_inf_nan(n_value, b_value): b_inf = np.isinf(b_value) n_nan = np.isnan(n_value) b_nan = np.isnan(b_value) - if np.any(n_inf) or np.any(b_inf) or np.any(n_nan) or np.any(b_nan): + + # merge boolean expressions + any_inf = np.any(n_inf) or np.any(b_inf) + any_nan = np.any(n_nan) or np.any(b_nan) + if any_inf or any_nan: if np.array_equal(n_inf, b_inf) and np.array_equal(n_nan, b_nan): n_value[n_inf] = 0 b_value[b_inf] = 0 @@ -484,8 +531,10 @@ def handle_inf_nan(n_value, b_value): return n_value, b_value -def find_error_rows(result, last_len, n_num_input, highlight_dict, summary_compare=False): +def find_error_rows(result, last_len, n_num_input, highlight_dict, summary_compare=False, md5_compare=False): """找到单个API中需要高亮的行""" + if md5_compare: + return npu_max_index = get_header_index('NPU max', summary_compare) bench_max_index = get_header_index('Bench max', summary_compare) max_diff_index = get_header_index('Max diff' if summary_compare else 'MaxAbsErr', summary_compare) @@ -510,9 +559,9 @@ def find_error_rows(result, last_len, n_num_input, highlight_dict, summary_compa continue if not isinstance(api_out[npu_max_index], (float, int)) \ or not isinstance(api_out[bench_max_index], (float, int)) \ - or not isinstance(api_out[max_diff_index],(float, int)): + or not isinstance(api_out[max_diff_index], (float, int)): continue - for m, api_in in enumerate(result[0:n_num_input]): + for _, api_in in enumerate(result[0:n_num_input]): if not isinstance(api_in[npu_max_index], (float, int)) \ or not isinstance(api_in[bench_max_index], (float, int)) \ or not isinstance(api_in[max_diff_index], (float, int)): @@ -541,7 +590,7 @@ def get_name_and_state(name): return api_name, state -def find_compare_result_error_rows(result_df, highlight_dict, summary_compare): +def find_compare_result_error_rows(result_df, highlight_dict, summary_compare, md5_compare): """将dataframe根据API分组,并找到有误差的算子用于高亮""" result = result_df.values start, input_num, output_num, end = 0, 0, 0, len(result_df) @@ -558,7 +607,8 @@ def find_compare_result_error_rows(result_df, highlight_dict, summary_compare): num, last_state = 1, state else: output_num = num - find_error_rows(result[start:start + input_num + output_num], start, input_num, highlight_dict, summary_compare) + find_error_rows(result[start:start + input_num + output_num], start, input_num, highlight_dict, + summary_compare, md5_compare) num, last_api_name, last_state = 1, api_name, state start += input_num + output_num input_num, output_num = 1, 0 @@ -569,12 +619,12 @@ def find_compare_result_error_rows(result_df, highlight_dict, summary_compare): input_num = num else: output_num = num - find_error_rows(result[start:start + input_num + output_num], start, input_num, highlight_dict, summary_compare) + find_error_rows(result[start:start + input_num + output_num], start, input_num, highlight_dict, summary_compare, md5_compare) def highlight_rows_xlsx(result_df, highlight_dict, file_path): """Write and highlight results in Excel""" - print_info_log('Compare result is %s' % file_path) + logger.info('Compare result is %s' % file_path) wb = openpyxl.Workbook() ws = wb.active @@ -607,16 +657,40 @@ def compare(input_parma, output_path, stack_mode=False, auto_analyze=True, create_directory(output_path) check_compare_param(input_parma, output_path, stack_mode, summary_compare, md5_compare) except CompareException as error: - print_error_log('Compare failed. Please check the arguments and do it again!') + logger.error('Compare failed. Please check the arguments and do it again!') sys.exit(error.code) compare_core(input_parma, output_path, stack_mode=stack_mode, auto_analyze=auto_analyze, fuzzy_match=fuzzy_match, summary_compare=summary_compare, md5_compare=md5_compare) -def compare_core(input_parma, output_path, stack_mode=False, auto_analyze=True, - suffix='', fuzzy_match=False, summary_compare=False, md5_compare=False): - print_info_log("Please check whether the input data belongs to you. If not, there may be security risks.") +def compare_core(input_parma, output_path, **kwargs): + """ + Compares data from multiple JSON files and generates a comparison report. + + Args: + input_parma (dict): A dictionary containing paths to JSON files ("npu_json_path", "bench_json_path", + "stack_json_path"). + output_path (str): The path where the output Excel report will be saved. + **kwargs: Additional keyword arguments including: + - stack_mode (bool, optional): Enables stack mode comparison. Defaults to False. + - auto_analyze (bool, optional): If True, triggers automatic analysis after comparison. Defaults to True. + - suffix (str, optional): Suffix to append to the output file name. Defaults to ''. + - fuzzy_match (bool, optional): Enables fuzzy matching during comparison. Defaults to False. + - summary_compare (bool, optional): Enables summary comparison mode. Defaults to False. + - md5_compare (bool, optional): Enables MD5 comparison. Defaults to False. + + Returns: + """ + # get kwargs or set default value + stack_mode = kwargs.get('stack_mode', False) + auto_analyze = kwargs.get('auto_analyze', True) + suffix = kwargs.get('suffix', '') + fuzzy_match = kwargs.get('fuzzy_match', False) + summary_compare = kwargs.get('summary_compare', False) + md5_compare = kwargs.get('md5_compare', False) + + logger.info("Please check whether the input data belongs to you. If not, there may be security risks.") file_name = add_time_with_xlsx("compare_result" + suffix) file_path = os.path.join(os.path.realpath(output_path), file_name) check_file_not_exists(file_path) @@ -630,7 +704,7 @@ def compare_core(input_parma, output_path, stack_mode=False, auto_analyze=True, if not md5_compare and not summary_compare: result_df = _do_multi_process(input_parma, result_df) - find_compare_result_error_rows(result_df, highlight_dict, summary_compare) + find_compare_result_error_rows(result_df, highlight_dict, summary_compare, md5_compare) highlight_rows_xlsx(result_df, highlight_dict, file_path) if auto_analyze: advisor = Advisor(result_df, output_path) @@ -639,7 +713,7 @@ def compare_core(input_parma, output_path, stack_mode=False, auto_analyze=True, def parse(pkl_file, module_name_prefix): if not isinstance(module_name_prefix, str): - print_error_log("The parameter:module_name_prefix is not a string.") + logger.error("The parameter:module_name_prefix is not a string.") raise CompareException(CompareException.INVALID_PARAM_ERROR) with FileOpen(pkl_file, "r") as f: done = False @@ -658,29 +732,33 @@ def parse(pkl_file, module_name_prefix): continue if info_prefix.find("stack_info") != -1: - print("\nTrace back({}):".format(msg[0])) + logger.info("\nTrace back({}):".format(msg[0])) for item in reversed(msg[1]): - print(" File \"{}\", line {}, in {}".format(item[0], item[1], item[2])) - print(" {}".format(item[3])) + logger.info(" File \"{}\", line {}, in {}".format(item[0], item[1], item[2])) + logger.info(" {}".format(item[3])) continue if len(msg) > 5: summary_info = " [{}][dtype: {}][shape: {}][max: {}][min: {}][mean: {}]" \ .format(msg[0], msg[3], msg[4], msg[5][0], msg[5][1], msg[5][2]) if not title_printed: - print("\nStatistic Info:") + logger.info("\nStatistic Info:") title_printed = True - print(summary_info) + logger.info(summary_info) -def op_item_parse(item, op_name, index, item_list=[], top_bool=True): - if item == None or (isinstance(item, dict) and len(item) == 0): +def op_item_parse(item, op_name, index, item_list=None, top_bool=True): + if item_list is None: + item_list = [] + if item is None or (isinstance(item, dict) and not item): if not top_bool: - tmp = {'full_op_name': op_name + '.' + str(index), 'Max': None, 'Min': None, 'Mean': None, 'Norm': None, 'dtype': None, 'shape': None, 'md5': None, 'data_name': '-1'} + tmp = {'full_op_name': op_name + '.' + str(index), 'Max': None, 'Min': None, 'Mean': None, 'Norm': None, + 'dtype': None, 'shape': None, 'md5': None, 'data_name': '-1'} else: - tmp = {'full_op_name': op_name + '.0', 'Max': None, 'Min': None, 'Mean': None, 'Norm': None, 'dtype': None, 'shape': None, 'md5': None, 'data_name': '-1'} + tmp = {'full_op_name': op_name + '.0', 'Max': None, 'Min': None, 'Mean': None, 'Norm': None, 'dtype': None, + 'shape': None, 'md5': None, 'data_name': '-1'} item_list.append(tmp) return item_list - if index == None: + if index is None: if isinstance(item, dict): full_op_name = op_name + '.0' else: @@ -730,8 +808,8 @@ def op_item_parse(item, op_name, index, item_list=[], top_bool=True): else: resolve_api_special_parameters(item, full_op_name, item_list) else: - for j in range(len(item)): - op_item_parse(item[j], full_op_name, j, top_bool=False) + for j, item_spec in enumerate(item): + op_item_parse(item_spec, full_op_name, j, item_list=item_list, top_bool=False) return item_list @@ -811,7 +889,7 @@ def compare_process(file_handles, stack_mode, fuzzy_match, summary_compare=False stack_json_data = json.load(stack_json_handle) if fuzzy_match: - print_warn_log("This task uses fuzzy matching, which may affect the accuracy of the comparison.") + logger.warning("This task uses fuzzy matching, which may affect the accuracy of the comparison.") npu_ops_queue = [] bench_ops_queue = [] @@ -862,8 +940,10 @@ def compare_process(file_handles, stack_mode, fuzzy_match, summary_compare=False except StopIteration: read_err_bench = False - if len(npu_ops_queue) == 0 or len(bench_ops_queue) == 0 or ( - len(npu_ops_queue) == last_npu_ops_len and len(bench_ops_queue) == last_bench_ops_len): + # merge all boolean expressions + both_empty = not npu_ops_queue and not bench_ops_queue + no_change = (len(npu_ops_queue) == last_npu_ops_len) and (len(bench_ops_queue) == last_bench_ops_len) + if both_empty or no_change: continue n_match_point, b_match_point = match_op(npu_ops_queue, bench_ops_queue, fuzzy_match) @@ -889,7 +969,7 @@ def compare_process(file_handles, stack_mode, fuzzy_match, summary_compare=False else: header = CompareConst.COMPARE_RESULT_HEADER[:] - all_mode_bool = summary_compare == False and md5_compare == False + all_mode_bool = not (summary_compare or md5_compare) if stack_mode: if all_mode_bool: header.append(CompareConst.STACK) diff --git a/debug/accuracy_tools/atat/pytorch/compare/distributed_compare.py b/debug/accuracy_tools/msprobe/pytorch/compare/distributed_compare.py similarity index 88% rename from debug/accuracy_tools/atat/pytorch/compare/distributed_compare.py rename to debug/accuracy_tools/msprobe/pytorch/compare/distributed_compare.py index 09d40b214d5bc2ae67480d78c9255d9e0326567a..0298eca9e7e1fbf60d1f6f781bb76a170b4fd29f 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/distributed_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/compare/distributed_compare.py @@ -17,10 +17,11 @@ import os import sys import re -from ...core.utils import print_error_log, CompareException, check_compare_param, \ +from msprobe.core.common.utils import CompareException, check_compare_param, \ check_configuration_param, task_dumppath_get, check_file_or_directory_path, check_regex_prefix_format_valid -from .acc_compare import compare_core -from ...core.file_check_util import create_directory +from msprobe.pytorch.compare.acc_compare import compare_core +from msprobe.core.common.file_check import create_directory +from msprobe.pytorch.common.log import logger def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): @@ -46,7 +47,7 @@ def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): pattern = re.compile(rf'^{prefix}(?:0|[0-9][1-9]*)?$') for name in contents: if not pattern.match(name): - print_error_log( + logger.error( f"dump_dir contains '{name}'. Expected '{prefix}'. This name is not in the format of dump " f"output. Please check and delete irrelevant files in {dump_dir} and try again." ) @@ -66,12 +67,12 @@ def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): # Provide robustness on invalid directory inputs if not json_path: - print_error_log(f'No file is found in dump dir {dirname}. ') + logger.error(f'No file is found in dump dir {dirname}. ') raise CompareException(CompareException.NO_DUMP_FILE_ERROR) return json_path if kwargs.get('suffix'): - print_error_log("Argument 'suffix' is not supported for compare_distributed.") + logger.error("Argument 'suffix' is not supported for compare_distributed.") raise CompareException(CompareException.INVALID_PARAM_ERROR) stack_mode = kwargs.get('stack_mode', False) auto_analyze = kwargs.get('auto_analyze', True) @@ -80,7 +81,7 @@ def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): npu_ranks = sorted(check_and_return_dir_contents(npu_dump_dir, 'rank')) bench_ranks = sorted(check_and_return_dir_contents(bench_dump_dir, 'rank')) if len(npu_ranks) != len(bench_ranks): - print_error_log('The number of ranks in the two runs are different. ' + logger.error('The number of ranks in the two runs are different. ' 'Unable to match the ranks. Please use another folder to compare ' 'or use compare() api and manually match the ranks.') raise CompareException(CompareException.INVALID_PATH_ERROR) @@ -104,7 +105,7 @@ def compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs): create_directory(output_path) check_compare_param(dump_result_param, output_path, stack_mode=stack_mode, summary_compare=summary_compare) except CompareException as error: - print_error_log('Compare failed. Please check the arguments and do it again!') + logger.error('Compare failed. Please check the arguments and do it again!') sys.exit(error.code) compare_core(dump_result_param, output_path, suffix=f'_{nr}-{br}', summary_compare=summary_compare, md5_compare=md5_compare, **kwargs) diff --git a/debug/accuracy_tools/atat/pytorch/compare/highlight.py b/debug/accuracy_tools/msprobe/pytorch/compare/highlight.py similarity index 97% rename from debug/accuracy_tools/atat/pytorch/compare/highlight.py rename to debug/accuracy_tools/msprobe/pytorch/compare/highlight.py index fdc11303003d20d2d89e2667355b2193a3a9db48..82f0022f8b5d4a0c6472b749d4937bfe39ef8a86 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/highlight.py +++ b/debug/accuracy_tools/msprobe/pytorch/compare/highlight.py @@ -1,7 +1,8 @@ import math import abc import numpy as np -from ...core.utils import CompareConst, get_header_index +from msprobe.core.common.utils import get_header_index +from msprobe.core.common.const import CompareConst class HighlightCheck(abc.ABC): diff --git a/debug/accuracy_tools/atat/pytorch/compare/mapping.yaml b/debug/accuracy_tools/msprobe/pytorch/compare/mapping.yaml similarity index 100% rename from debug/accuracy_tools/atat/pytorch/compare/mapping.yaml rename to debug/accuracy_tools/msprobe/pytorch/compare/mapping.yaml diff --git a/debug/accuracy_tools/atat/pytorch/compare/match.py b/debug/accuracy_tools/msprobe/pytorch/compare/match.py similarity index 91% rename from debug/accuracy_tools/atat/pytorch/compare/match.py rename to debug/accuracy_tools/msprobe/pytorch/compare/match.py index 51fb2fb6666756d39db9003b87ef7c3a71b4080b..6347d8887c85427fcb556eecb5cd4a7302166969 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/match.py +++ b/debug/accuracy_tools/msprobe/pytorch/compare/match.py @@ -1,7 +1,7 @@ import os import yaml -from ...core.file_check_util import FileOpen -from ...core.utils import CompareException +from msprobe.core.common.file_check import FileOpen +from msprobe.core.common.utils import CompareException class AtenIrMapping(): diff --git a/debug/accuracy_tools/atat/pytorch/compare/npy_compare.py b/debug/accuracy_tools/msprobe/pytorch/compare/npy_compare.py similarity index 96% rename from debug/accuracy_tools/atat/pytorch/compare/npy_compare.py rename to debug/accuracy_tools/msprobe/pytorch/compare/npy_compare.py index f16a807fefdb01189eaeed46d206e66c8a813601..5a0feb4cd4a63b6f2ab680c9e9a0f0e92b594e2e 100644 --- a/debug/accuracy_tools/atat/pytorch/compare/npy_compare.py +++ b/debug/accuracy_tools/msprobe/pytorch/compare/npy_compare.py @@ -1,6 +1,8 @@ import abc import numpy as np -from ...core.utils import CompareConst, Const, print_warn_log, format_value +from msprobe.core.common.utils import format_value +from msprobe.core.common.const import Const, CompareConst +from msprobe.pytorch.common.log import logger def handle_inf_nan(n_value, b_value): @@ -69,7 +71,7 @@ def get_error_message(n_value, b_value, op_name, error_flag, error_file=None): if not n_value.shape: return "This is type of scalar data, can not compare." if n_value.dtype != b_value.dtype: - print_warn_log("Dtype of NPU and bench Tensor do not match: {}".format(op_name)) + logger.warning("Dtype of NPU and bench Tensor do not match: {}".format(op_name)) return "Dtype of NPU and bench Tensor do not match." return "" @@ -196,6 +198,8 @@ class GetThousandErrRatio(TensorComparisonBasic): return CompareConst.NAN, "" if relative_err is None: relative_err = get_relative_err(n_value, b_value) + if not np.size(relative_err): + return CompareConst.NAN, "" return format_value(np.sum(relative_err < CompareConst.THOUSAND_RATIO_THRESHOLD) / np.size(relative_err)), "" @@ -216,6 +220,8 @@ class GetFiveThousandErrRatio(TensorComparisonBasic): return CompareConst.NAN, "" if relative_err is None: relative_err = get_relative_err(n_value, b_value) + if not np.size(relative_err): + return CompareConst.NAN, "" return format_value(np.sum(relative_err < CompareConst.FIVE_THOUSAND_RATIO_THRESHOLD) / np.size(relative_err)), "" diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/__init__.py b/debug/accuracy_tools/msprobe/pytorch/debugger/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/debugger/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/debugger/debugger_config.py b/debug/accuracy_tools/msprobe/pytorch/debugger/debugger_config.py similarity index 85% rename from debug/accuracy_tools/atat/pytorch/debugger/debugger_config.py rename to debug/accuracy_tools/msprobe/pytorch/debugger/debugger_config.py index 9fc97332f64df89a056cda1b0bfdc730c2cdfa29..cfc588e1e9705b5d2286c69aea8dfe8704115de3 100644 --- a/debug/accuracy_tools/atat/pytorch/debugger/debugger_config.py +++ b/debug/accuracy_tools/msprobe/pytorch/debugger/debugger_config.py @@ -1,5 +1,7 @@ -from ..common import print_warn_log_rank_0, seed_all -from ...core.utils import Const +from msprobe.pytorch.common import seed_all +from msprobe.pytorch.common.log import logger +from msprobe.core.common.const import Const + class DebuggerConfig: def __init__(self, common_config, task_config, task, dump_path, level): @@ -20,12 +22,9 @@ class DebuggerConfig: self.is_forward_acl_dump = True self.summary_mode = task_config.summary_mode if task_config.summary_mode else Const.STATISTICS self.overflow_num = task_config.overflow_num if task_config.overflow_num else 1 - self.repair_scope = None - self.repair_api_str = None - self.on_step_end = None - self.repair_type = None - - if self.task == "free_benchmark": + self.framework = Const.PT_FRAMEWORK + + if self.task == Const.FREE_BENCHMARK: self.fuzz_device = task_config.fuzz_device if task_config.fuzz_device else 'npu' self.handler_type = task_config.handler_type if task_config.handler_type else 'check' self.pert_mode = task_config.pert_mode if task_config.pert_mode else 'improve_precision' @@ -47,9 +46,8 @@ class DebuggerConfig: raise ValueError("backward_input must be configured when scope contains 'backward'") if Const.BACKWARD in self.scope[0]: self.is_forward_acl_dump = False - for index in range(len(self.scope)): - # Do this replace operation to let the acl backward dump can be done in forward hook. - self.scope[index] = self.scope[index].replace(Const.BACKWARD, Const.FORWARD) + for index, scope_spec in enumerate(self.scope): + self.scope[index] = scope_spec.replace(Const.BACKWARD, Const.FORWARD) self.backward_input[self.scope[index]] = self.backward_input_list[index] seed_all(self.seed, self.is_deterministic) @@ -67,22 +65,22 @@ class DebuggerConfig: self._check_step() return True + def check_model(self, model): + if self.level in ["L0", "mix"] and not model: + raise Exception( + f"For level {self.level}, PrecisionDebugger must receive a model argument." + ) + def _check_rank(self): if self.rank: for rank_id in self.rank: if not isinstance(rank_id, int) or rank_id < 0: raise ValueError(f"rank {self.rank} must be an integer and greater than or equal to 0.") else: - print_warn_log_rank_0(f"Rank argument is provided. Only rank {self.rank} data will be dumpped.") + logger.warning_on_rank_0(f"Rank argument is provided. Only rank {self.rank} data will be dumpped.") def _check_step(self): if self.step: for s in self.step: if not isinstance(s, int) or s < 0: raise ValueError(f"step element {s} must be an integer and greater than or equal to 0.") - - def check_model(self, model): - if self.level in ["L0", "mix"] and not model: - raise Exception( - f"For level {self.level}, PrecisionDebugger must receive a model argument.", - ) \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/debugger/precision_debugger.py b/debug/accuracy_tools/msprobe/pytorch/debugger/precision_debugger.py similarity index 75% rename from debug/accuracy_tools/atat/pytorch/debugger/precision_debugger.py rename to debug/accuracy_tools/msprobe/pytorch/debugger/precision_debugger.py index e0ffa4e4d6ebec80209097fe8d3a5716439bd939..e28e588c5f9e1b56be1bdb024bfad8df44b67f86 100644 --- a/debug/accuracy_tools/atat/pytorch/debugger/precision_debugger.py +++ b/debug/accuracy_tools/msprobe/pytorch/debugger/precision_debugger.py @@ -1,10 +1,10 @@ import torch from torch.utils.data import dataloader -from .debugger_config import DebuggerConfig -from ..service import Service -from ..common import print_warn_log_rank_0 -from ..pt_config import parse_json_config -from ..common.exceptions import MsaccException +from msprobe.pytorch.debugger.debugger_config import DebuggerConfig +from msprobe.pytorch.service import Service +from msprobe.pytorch.common.log import logger +from msprobe.pytorch.pt_config import parse_json_config +from msprobe.core.common.exceptions import MsprobeException class PrecisionDebugger: @@ -39,16 +39,28 @@ class PrecisionDebugger: self.service = Service(self.config) self.enable_dataloader = self.config.enable_dataloader if self.enable_dataloader: - print_warn_log_rank_0("The enable_dataloader feature will be deprecated in the future.") + logger.warning_on_rank_0("The enable_dataloader feature will be deprecated in the future.") dataloader._BaseDataLoaderIter.__next__ = iter_tracer(dataloader._BaseDataLoaderIter.__next__) + @property + def instance(self): + return self._instance + + @staticmethod + def check_model_valid(model): + if not model or isinstance(model, torch.nn.Module): + return model + raise MsprobeException( + MsprobeException.INVALID_PARAM_ERROR, "model 参数必须是torch.nn.Module类型。" + ) + @classmethod def start(cls): instance = cls._instance if not instance: raise Exception("No instance of PrecisionDebugger found.") if instance.enable_dataloader: - print_warn_log_rank_0("DataLoader is enabled, start() skipped.") + logger.warning_on_rank_0("DataLoader is enabled, start() skipped.") else: instance.service.start(instance.model) @@ -58,7 +70,7 @@ class PrecisionDebugger: if not instance: raise Exception("PrecisionDebugger instance is not created.") if instance.enable_dataloader: - print_warn_log_rank_0("DataLoader is enabled, stop() skipped.") + logger.warning_on_rank_0("DataLoader is enabled, stop() skipped.") else: instance.service.stop() @@ -68,18 +80,10 @@ class PrecisionDebugger: raise Exception("PrecisionDebugger instance is not created.") cls._instance.service.step() - @staticmethod - def check_model_valid(model): - if not model or isinstance(model, torch.nn.Module): - return model - raise MsaccException( - MsaccException.INVALID_PARAM_ERROR, "model 参数必须是torch.nn.Module类型。" - ) - def iter_tracer(func): def func_wrapper(*args, **kwargs): - debugger_instance = PrecisionDebugger._instance + debugger_instance = PrecisionDebugger.instance debugger_instance.enable_dataloader = False if not debugger_instance.service.first_start: debugger_instance.stop() diff --git a/debug/accuracy_tools/atat/pytorch/doc/FAQ.md b/debug/accuracy_tools/msprobe/pytorch/doc/FAQ.md similarity index 58% rename from debug/accuracy_tools/atat/pytorch/doc/FAQ.md rename to debug/accuracy_tools/msprobe/pytorch/doc/FAQ.md index daaa79abd956f7a585b6d76a45812c4e7b4fc6ae..8d12a72928ee4d9977b1db05a72f2189b2edd3c1 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/FAQ.md +++ b/debug/accuracy_tools/msprobe/pytorch/doc/FAQ.md @@ -22,15 +22,15 @@ 6. 添加预检工具后截取操作报错:`IndexError: too many indices for tensor of dimension x` 或 `TypeError: len() of a 0-d tensor`。 - 答:注释工具目录api_accuracy_checker/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- __getitem__`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。 + 答:注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- __getitem__`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。 7. 添加预检工具后F.gelu触发ValueError报错:`activation_func must be F.gelu`等。 - 答:注释工具目录api_accuracy_checker/hook_module/support_wrap_ops.yaml文件中functional:下的的`- gelu`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。 + 答:注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中functional:下的的`- gelu`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。 8. 添加预检工具后触发AsStrided算子相关的报错,或者编译相关的报错,如:`Failed to compile Op [AsStrided]`。 - 答:注释工具目录api_accuracy_checker/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- t`和`- transpose`。 + 答:注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- t`和`- transpose`。 9. Tensor 魔法函数具体对应什么操作? @@ -75,7 +75,7 @@ ### dump指定融合算子 -dump指定操作当前支持dump指定融合算子的输入输出,需要在att/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/hook_module/support_wrap_ops.yaml中添加,比如以下代码段调用的softmax融合算子 +dump指定操作当前支持dump指定融合算子的输入输出,需要在mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml中添加,比如以下代码段调用的softmax融合算子 ``` def npu_forward_fused_softmax(self, input_, mask): @@ -95,7 +95,7 @@ def npu_forward_fused_softmax(self, input_, mask): ### 1. 在同一个目录多次执行dump会冲突吗? -会,同一个目录多次dump,会覆盖上一次结果,可以使用dump_tag参数修改dump目录名称。 +会,同一个目录多次dump,会覆盖上一次结果,可以使用dump_path参数修改dump目录。 ### 2. 如何dump算子级的数据? @@ -111,7 +111,7 @@ torch版本和硬件差异属于正常情况。 **故障现象** -使用atat工具时,报错: error code: EI0006。 +使用msprobe工具时,报错: error code: EI0006。 **故障原因** @@ -136,67 +136,58 @@ torch.npu.set_device('npu:0') torch.npu.set_device(f'npu:{rank}') ``` -如果运行精度比对功能遇到这个报错,尝试安装最新版本的atat。 +如果运行精度比对功能遇到这个报错,尝试安装最新版本的msprobe。 -### 4. 运行compare.py时报错:json.decoder.JSONDecodeError: Extra data: line 1 column 37(char 36) - -遇到这种情况,先更新工具版本为最新版本,再重新运行训练代码dump数据,再用新的dump数据进行精度比对,如果最新版本未能解决问题,请联系atat工具开发人员。 - -### 5. AssertionError: assert set(WrapTensorOps) <= set(_tensor_ops) - -遇到这种情况,先检查安装的torch版本,建议先更新工具版本为2.2以上,版本2.2的工具支持torch1.8、1.11和2.0 - -### 6. dump得到的VF_lstm_99_forward_input.1.0.npy、VF_lstm_99_forward_input.1.1.npy类似的数据是否正常? +### 4. dump得到的VF_lstm_99_forward_input.1.0.npy、VF_lstm_99_forward_input.1.1.npy类似的数据是否正常? 带1.0/1.1/1.2后缀的npy是正常现象,例如当输入数据为[[tensor1, tensor2, tensor3]]会生成这样的后缀。 -### 8. 进行compare报错:The current file contains stack information, please turn on the stack_mode +### 5. 进行compare报错:The current file contains stack information, please turn on the stack_mode 在比对脚本中,设置stack_mode=True,例如: ``` -from ptdbg_ascend import * +from msprobe.pytorch import compare dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v2.0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v2.0/rank0/api_stack_dump.pkl", -"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v2.0/rank0/api_stack_dump", -"bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v2.0/rank0/api_stack_dump", +"npu_json_path": "./npu_dump/dump.json", +"bench_json_path": "./gpu_dump/dump.json", +"stack_json_path": "./npu_dump/stack.json", "is_print_compare_log": True } -compare(dump_result_param, "./output", stack_mode=True) +compare(dump_result_param, output_path="./output", stack_mode=True) ``` -### 9. dump指定反向API的kernel级别的数据报错:NameError:name 'torch_npu' is not defined +### 6. dump指定反向API的kernel级别的数据报错:NameError:name 'torch_npu' is not defined - 如果是npu环境,请安装torch_npu; -- 如果是gpu环境,暂不支持dump指定API的ACL级别的数据 +- 如果是gpu环境,暂不支持dump指定API的kernel级别的数据 -### 10. 配置dump_path后,使用工具报错:[ERROR]The file path /home/xxx/dump contains special characters +### 7. 配置dump_path后,使用工具报错:[ERROR]The file path /home/xxx/dump contains special characters - 请检查你设置的dump绝对路径是否包含特殊字符,确保路径名只包含大小写字母、数字、下划线、斜杠、点和短横线 -- 注意,如果你执行脚本的路径为/home/abc++/,你设置的dump_path="./dump",工具实际校验的路径为绝对路径/home/abc++/dump,++为特殊字符,会引发本条报错 +- 注意,如果执行脚本的路径为/home/abc++/,设置的dump_path="./dump",工具实际校验的路径为绝对路径/home/abc++/dump,++为特殊字符,会引发本条报错 -### 11. 无法dump matmul权重的反向梯度数据 +### 8. 无法dump matmul权重的反向梯度数据 - matmul期望的输入是二维,当输入不是二维时,会将输入通过view操作展成二维,再进行matmul运算,因此在反向求导时,backward_hook能拿到的是UnsafeViewBackward这步操作里面数据的梯度信息,取不到MmBackward这步操作里面数据的梯度信息,即权重的反向梯度数据。 - 典型的例子有,当linear的输入不是二维,且无bias时,会调用output = input.matmul(weight.t()),因此拿不到linear层的weight的反向梯度数据。 -### 12. dump.json文件中的某些api的dtype类型为float16,但是读取此api的npy文件显示的dtype类型为float32 +### 9. dump.json文件中的某些api的dtype类型为float16,但是读取此api的npy文件显示的dtype类型为float32 -- atat工具在dump数据时需要将原始数据从npu to cpu上再转换为numpy类型,npu to cpu的逻辑和gpu to cpu是保持一致的,都存在dtype可能从float16变为float32类型的情况,如果出现dtype不一致的问题,最终dump数据的dtype以pkl文件为准。 +- msprobe工具在dump数据时需要将原始数据从npu to cpu上再转换为numpy类型,npu to cpu的逻辑和gpu to cpu是保持一致的,都存在dtype可能从float16变为float32类型的情况,如果出现dtype不一致的问题,最终dump数据的dtype以pkl文件为准。 -### 13. 使用dataloader后raise异常Exception: ptdbg: exit after iteration [x, x, x] +### 10. 使用dataloader后raise异常Exception("msprobe: exit after iteration {}". format(max(self.config.step)) - 正常现象,dataloader通过raise结束程序,堆栈信息可忽略。 -### 14. 添加atat工具后截取操作报错:`IndexError: too many indices for tensor of dimension x` 或 `TypeError: len() of a 0-d tensor`。 +### 11. 添加msprobe工具后截取操作报错:`IndexError: too many indices for tensor of dimension x` 或 `TypeError: len() of a 0-d tensor`。 -- 注释工具目录atat/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- __getitem__`,工具会跳过dump该API。如果是需要dump的关键位置api也可以考虑根据报错堆栈信息注释引发报错的类型检查。 +- 注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- __getitem__`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。 -### 15. 添加atat工具后F.gelu触发ValueError报错:`activation_func must be F.gelu`等。 +### 12. 添加msprobe工具后F.gelu触发ValueError报错:`activation_func must be F.gelu`等。 -- 注释工具目录atat/hook_module/support_wrap_ops.yaml文件中functional:下的的`- gelu`,工具会跳过dump该API。如果是需要dump的关键位置api也可以考虑根据报错堆栈信息注释引发报错的类型检查。 +- 注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中functional:下的的`- gelu`,工具会跳过dump该API。如果是需要dump的关键位置api也可以考虑根据报错堆栈信息注释引发报错的类型检查。 -### 16. 添加atat工具后触发AsStrided算子相关的报错,或者编译相关的报错,如:`Failed to compile Op [AsStrided]`。 +### 13. 添加msprobe工具后触发AsStrided算子相关的报错,或者编译相关的报错,如:`Failed to compile Op [AsStrided]`。 -- 注释工具目录atat/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- t`和`- transpose`。 +- 注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- t`和`- transpose`。 diff --git a/debug/accuracy_tools/atat/pytorch/doc/api_accuracy_checker.md b/debug/accuracy_tools/msprobe/pytorch/doc/api_accuracy_checker.md similarity index 88% rename from debug/accuracy_tools/atat/pytorch/doc/api_accuracy_checker.md rename to debug/accuracy_tools/msprobe/pytorch/doc/api_accuracy_checker.md index 7004c25e6daddb0c17e5caf82f9fa0318fd5425c..b3ed4a9e24c39e3eacd539964a42197410de4682 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/api_accuracy_checker.md +++ b/debug/accuracy_tools/msprobe/pytorch/doc/api_accuracy_checker.md @@ -8,7 +8,7 @@ **真实数据模式**:精度预检工具支持随机生成模式和真实数据模式,即在预检dump时可以选择由工具构造随机数进行输入获得dump数据或选择获取真实输入数据进行预检dump操作;随机生成模式执行效率高,可以快速获得结果,但数据精度低,只能大致判断精度问题;真实数据模式执行效率略低于随机生成模式,但是数据精度高,可以准确判断精度问题。 -**工具支持PyTorch版本**:1.11.0/2.0/2.1/2.2。 +**工具支持PyTorch版本**:2.0/2.1/2.2。 **工具特性** @@ -20,8 +20,8 @@ 精度预检操作流程如下: -1. 在NPU和GPU环境下分别安装atat工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 -2. 在NPU训练脚本内添加atat工具dump接口PrecisionDebugger采集待预检数据。详见《[精度数据采集](./dump.md)》。 +1. 在NPU和GPU环境下分别安装msprobe工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 +2. 在NPU训练脚本内添加msprobe工具dump接口PrecisionDebugger采集待预检数据。详见《[精度数据采集](./dump.md)》。 3. 将NPU环境下dump的预检数据拷贝至GPU环境。 4. 在NPU和GPU环境下分别执行run_ut,生成结果用于最终api_precision_compare操作的输入。详见“**run_ut预检操作**”。 5. 将NPU和GPU执行run_ut生成的`accuracy_checking_details_{timestamp}.csv`结果文件拷贝至同一环境下。 @@ -43,11 +43,9 @@ run_ut预检操作包括如下场景: 1. 将API信息输入给run_ut模块运行精度检测并比对,运行如下命令: ```bash - atat -f pytorch run_ut -api_info ./dump.json + msprobe -f pytorch run_ut -api_info ./dump.json ``` - 某些场景下(如推理),可以不指定backward_info_0.json,不影响预检功能。 - | 参数名称 | 说明 | 是否必选 | | ---------------------------- | ------------------------------------------------------------ | ---------------------------------- | | -api_info或--api_info_file | 指定API信息文件dump.json。 | 是 | @@ -63,10 +61,10 @@ run_ut预检操作包括如下场景: 2. (可选)如果需要保存比对不达标的输入和输出数据,可以在run_ut执行命令结尾添加-save_error_data,例如: ```bash - atat -f pytorch run_ut -api_info ./dump.json -save_error_data + msprobe -f pytorch run_ut -api_info ./dump.json -save_error_data ``` - 数据默认会存盘到'./ut_error_data{timestamp}'路径下(相对于启动run_ut的路径),有需要的话,用户可以通过修改att/debug/accuracy_tools/api_accuracy_checker目录下,config.yaml文件的error_data_path参数来配置保存路径,详见“config.yaml文件说明”。 + 数据默认会存盘到'./ut_error_data{timestamp}'路径下(相对于启动run_ut的路径),有需要的话,用户可以通过修改mstt/debug/accuracy_tools/api_accuracy_checker目录下,config.yaml文件的error_data_path参数来配置保存路径,详见“config.yaml文件说明”。 #### 使用multi_run_ut.py执行多线程预检 @@ -75,11 +73,9 @@ multi_run_ut.py脚本,可以并行执行多个run_ut操作,从而降低预 命令示例如下: ```bash -atat -f pytorch multi_run_ut -api_info ./dump.json -n 32 -d 0 1 2 3 +msprobe -f pytorch multi_run_ut -api_info ./dump.json -n 32 -d 0 1 2 3 ``` -某些场景下(如推理),可以不指定backward_info_0.json,不影响预检功能。 - | 参数名称 | 说明 | 是否必选 | | ---------------------------- | ------------------------------------------------------------ | ---------------------------------- | | -api_info或--api_info_file | 指定API信息文件dump.json。 | 是 | @@ -100,31 +96,26 @@ atat -f pytorch multi_run_ut -api_info ./dump.json -n 32 -d 0 1 2 3 断点续检操作通过如下命令执行: ```bash -atat -f pytorch run_ut -api_info ./dump.json -csv_path /home/xxx/ut/accuracy_checking_result_{timestamp}.csv +msprobe -f pytorch run_ut -api_info ./dump.json -csv_path /home/xxx/ut/accuracy_checking_result_{timestamp}.csv ``` #### API预检白名单 run_ut过程支持API预检白名单,操作方式如下: -修改att/debug/accuracy_tools/api_accuracy_checker目录下config.yaml文件的white_list参数,配置需要预检的API名称,详见“config.yaml文件说明”。 +修改mstt/debug/accuracy_tools/api_accuracy_checker目录下config.yaml文件的white_list参数,配置需要预检的API名称,详见“config.yaml文件说明”。 ### config.yaml文件说明 -config.yaml文件可以通过配置参数来控制dump和run_ut操作的真实数据模式以及白名单等功能。 +config.yaml文件可以通过配置参数来控制dump和run_ut操作的白名单等功能。 -文件路径为:att/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/config.yaml +文件路径为:mstt/debug/accuracy_tools/msprobe/pytorch/api_accuracy_checker/config.yaml -| 参数名称 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| dump_path | 设置dump路径,默认为当前目录。若指定目录不存在,则自动创建。 | 否 | -| real_data | 真实数据模式,可取值True或False,默认为False,表示随机数据模式,配置为True后开启真实数据模式,dump信息增加forward_real_data和backward_real_data目录,目录下保存每个API输入的具体数值。 | 否 | -| enable_dataloader | 自动dump数据开关,可取值True(开启)、False(关闭),默认关闭。 | 否 | -| target_iter | 指定dump某个step的数据,默认为[1],须指定为训练脚本中存在的step。target_iter为list格式,可配置逐个step,例如:target_iter=[0,1,2];也可以配置step范围,例如:target_iter=list(range(0,9)),表示dump第0到第8个step。 | 否 | -| white_list | API dump白名单,指定dump具体API数据,也可以直接配置预检的API白名单,详细请参见“**API预检白名单**”。参数示例:white_list=["conv1d", "conv2d"]。默认未配置白名单,即dump全量API数据。 | 否 | -| error_data_path | 配置保存精度未达标的API输入输出数据路径。 | 否 | -| jit_compile | 开启jit编译。 | 否 | -| precision | 浮点数表示位数,默认取小数点后14位。 | 否 | +| 参数名称 | 说明 | 是否必选 | +| --------------- | ------------------------------------------------------------ | -------- | +| white_list | API dump白名单,指定dump具体API数据,也可以直接配置预检的API白名单,详细请参见“**API预检白名单**”。参数示例:white_list=["conv1d", "conv2d"]。默认未配置白名单,即dump全量API数据。 | 否 | +| error_data_path | 配置保存精度未达标的API输入输出数据路径。 | 否 | +| precision | 浮点数表示位数,默认取小数点后14位。 | 否 | ## 预检结果 @@ -212,7 +203,7 @@ API预检通过测试,则在`accuracy_checking_details_{timestamp}.csv`文件 需要同时获取NPU和GPU环境下run_ut操作的预检结果`accuracy_checking_details_{timestamp}.csv`文件。执行如下命令进行NPU和GPU预检结果的比对: ```bash -atat -f pytorch api_precision_compare -npu /home/xxx/npu/accuracy_checking_details_{timestamp}.csv -gpu /home/xxx/gpu/accuracy_checking_details_{timestamp}.csv -o /home/xxx/ +msprobe -f pytorch api_precision_compare -npu /home/xxx/npu/accuracy_checking_details_{timestamp}.csv -gpu /home/xxx/gpu/accuracy_checking_details_{timestamp}.csv -o /home/xxx/ ``` | 参数名称 | 说明 | 是否必选 | diff --git "a/debug/accuracy_tools/atat/pytorch/doc/atat\347\262\276\345\272\246\345\267\245\345\205\267\346\225\260\346\215\256dump\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" "b/debug/accuracy_tools/msprobe/pytorch/doc/atat\347\262\276\345\272\246\345\267\245\345\205\267\346\225\260\346\215\256dump\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" similarity index 97% rename from "debug/accuracy_tools/atat/pytorch/doc/atat\347\262\276\345\272\246\345\267\245\345\205\267\346\225\260\346\215\256dump\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" rename to "debug/accuracy_tools/msprobe/pytorch/doc/atat\347\262\276\345\272\246\345\267\245\345\205\267\346\225\260\346\215\256dump\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" index ed175ff30172a54d8d4868097599ab8518b45e4f..c9db3ae78d7d47330cf6cddcc66c741c77a63514 100644 --- "a/debug/accuracy_tools/atat/pytorch/doc/atat\347\262\276\345\272\246\345\267\245\345\205\267\346\225\260\346\215\256dump\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" +++ "b/debug/accuracy_tools/msprobe/pytorch/doc/atat\347\262\276\345\272\246\345\267\245\345\205\267\346\225\260\346\215\256dump\346\240\207\345\207\206\346\200\247\350\203\275\345\237\272\347\272\277\346\212\245\345\221\212.md" @@ -1,4 +1,4 @@ -# atat精度工具标准性能基线报告 +# msprobe精度工具标准性能基线报告 ## 环境信息 @@ -16,7 +16,7 @@ CANN:8.0.T2 ## 模型信息和性能基线 -大模型在使用atat工具dump数据时,建议先简化模型层数,减少dump数据量。 +大模型在使用msprobe工具dump数据时,建议先简化模型层数,减少dump数据量。 以下场景的性能基线测试数据均为多次测试后取平均值,因此实际运行时性能数据可能会根据环境状态稍有浮动。 diff --git a/debug/accuracy_tools/atat/pytorch/doc/dump.md b/debug/accuracy_tools/msprobe/pytorch/doc/dump.md similarity index 77% rename from debug/accuracy_tools/atat/pytorch/doc/dump.md rename to debug/accuracy_tools/msprobe/pytorch/doc/dump.md index 44a4d09341dabef64306ee2c7ec7463a4fb367d4..7d0763b6848d0f2154c8757859ed136cb90d7f18 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/dump.md +++ b/debug/accuracy_tools/msprobe/pytorch/doc/dump.md @@ -1,8 +1,8 @@ # **精度数据采集** -atat工具主要通过在训练脚本内添加dump接口并启动训练的方式来采集精度数据。 +msprobe工具主要通过在训练脚本内添加dump接口并启动训练的方式来采集精度数据。 -执行dump操作需要安装atat工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 +执行dump操作需要安装msprobe工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 ## dump接口介绍 @@ -12,7 +12,7 @@ atat工具主要通过在训练脚本内添加dump接口并启动训练的方式 通过加载dump配置文件的方式来确定dump操作的详细配置。 -可以在from atat.pytorch import PrecisionDebugger和模型初始化之间的任意位置添加该接口。 +可以在from msprobe.pytorch import PrecisionDebugger和模型初始化之间的任意位置添加该接口。 **原型** @@ -20,15 +20,15 @@ atat工具主要通过在训练脚本内添加dump接口并启动训练的方式 PrecisionDebugger(config_path=None, task=None, dump_path=None, level=None, model=None, step=None) ``` -说明:上述参数除config_path和model外,其他参数均在[config.json](../../config)文件中可配,此处的参数优先级高于config.json文件中的配置,而config.json文件可以配置更多参数,若需要进行更多场景的精度数据dump,建议配置[config.json](../../config)文件。 +说明:上述参数除config_path和model外,其他参数均在[config.json](../../config)文件中可配,此处的参数优先级高于[config.json](../../config)文件中的配置,而config.json文件可以配置更多参数,若需要进行更多场景的精度数据dump,建议配置[config.json](../../config)文件。 **参数说明** | 参数名 | 说明 | 是否必选 | | ----------- | ------------------------------------------------------------ | -------- | -| config_path | 指定dump配置文件路径,String类型。参数示例:"./config.json"。未配置该路径时,默认使用../../config目录下的config.json文件的默认配置。 | 否 | +| config_path | 指定dump配置文件路径,String类型。参数示例:"./config.json"。未配置该路径时,默认使用[config.json](../../config)文件的默认配置。 | 否 | | task | dump的任务类型,String类型。可取值"statistics"(仅dump API统计信息)、"tensor"(dump API统计信息和完全复刻整网的API运行情况的真实数据)、"overflow_check"(溢出检测),默认未配置,取"statistics",参数示例:task="tensor"。 | 否 | -| dump_path | 设置dump数据目录路径,String类型。参数示例:dump_path="./dump_path"。 | 是 | +| dump_path | 设置dump数据目录路径,String类型。参数示例:dump_path="./dump_path"。 | 否 | | level | dump级别,根据不同级别dump不同数据,String类型。可取值:
"L0":dump module模块级精度数据,仅PyTorch场景支持”。
"L1":dump API级精度数据,默认值。
"L2":dump kernel级精度数据。
"mix":dump module模块级和API级精度数据。
配置示例:level="L1"。 | 否 | | model | 指定具体的torch.nn.Module,默认未配置,level配置为"L0"或"mix"时必须配置该参数。配置示例参见“**model配置代码示例**”。 | 否 | | step | 指定dump某个step的数据,list[int]类型。默认未配置,表示dump所有step数据。dump特定step时,须指定为训练脚本中存在的step。step为list格式,可配置逐个step,例如:step=[0,1,2]。 | 否 | @@ -44,32 +44,33 @@ import torch import torch.nn as nn import torch_npu import torch.nn.functional as F -from atat.pytorch import PrecisionDebugger +from msprobe.pytorch import PrecisionDebugger torch.npu.set_device("npu:0") #定义一个简单的网络 -class ModuleOP(nn.Module) -def __init__(self) -> None: - super().__init__() - self.linear_1 = nn.Linear(in_features=8,out_features=4) - self.linear_2 = nn.Linear(in_features=4,out_features=2) -def forward(self,x): - x1 = self.linear_1(x) - x2 = self.linear_2(x1) - r1 = F.relu(x2) - return r1 +class ModuleOP(nn.Module): + def __init__(self) -> None: + super().__init__() + self.linear_1 = nn.Linear(in_features=8,out_features=4) + self.linear_2 = nn.Linear(in_features=4,out_features=2) + + def forward(self,x): + x1 = self.linear_1(x) + x2 = self.linear_2(x1) + r1 = F.relu(x2) + return r1 if __name__ == "__main__" -module = ModuleOP() - -#注册工具 -debugger = PrecisionDebugger('./config.json',model=module) -debugger.start() -x = torch.randn(10,8) -out = module(x) -loss = out.sum() -loss.backward() -debugger.stop() + module = ModuleOP() + + #注册工具 + debugger = PrecisionDebugger('./config.json',model=module) + debugger.start() + x = torch.randn(10,8) + out = module(x) + loss = out.sum() + loss.backward() + debugger.stop() ``` ### start函数 @@ -123,7 +124,7 @@ debugger.step() ## 示例代码 ```Python -from atat.pytorch import PrecisionDebugger +from msprobe.pytorch import PrecisionDebugger debugger = PrecisionDebugger(config_path="./config.json", dump_path="./dump_path") # 请勿将以上初始化流程插入到循环代码中 @@ -174,9 +175,9 @@ dump结果目录结构示例如下: │ ├── step2 ``` -dump过程中,pt文件在对应算子或者模块被执行后就会落盘,而json文件则需要在正常执行PrecisionDebugger.stop()或set_dump_switch("OFF")后才会被落盘保存,异常的程序终止会保存终止前被执行算子的相关pt文件,但是不会生成json文件。 +dump过程中,pt文件在对应算子或者模块被执行后就会落盘,而json文件则需要在正常执行PrecisionDebugger.stop()后才会被落盘保存,异常的程序终止会保存终止前被执行算子的相关pt文件,但是不会生成json文件。 -其中`dump_{version}`为默认命名,debugger方式dump不支持修改该文件夹名称;rank为设备上各卡的ID,每张卡上dump的数据会生成对应dump目录。 +其中rank为设备上各卡的ID,每张卡上dump的数据会生成对应dump目录。 pt文件保存的前缀和PyTorch对应关系如下: @@ -192,7 +193,7 @@ pt文件保存的前缀和PyTorch对应关系如下: ## 工具支持的API列表 -atat工具维护固定的API支持列表,若需要删除或增加dump的API,可以在atat/pytorch/hook_module/support_wrap_ops.yaml文件内手动修改,如下示例: +msprobe工具维护固定的API支持列表,若需要删除或增加dump的API,可以在msprobe/pytorch/hook_module/support_wrap_ops.yaml文件内手动修改,如下示例: ```Python functional: # functional为算子类别,找到对应的类别,在该类别下按照下列格式删除或添加API diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/BLOOM-7B_1.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/BLOOM-7B_1.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/BLOOM-7B_1.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/BLOOM-7B_1.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/BLOOM-7B_2.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/BLOOM-7B_2.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/BLOOM-7B_2.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/BLOOM-7B_2.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/BLOOM-7B_3.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/BLOOM-7B_3.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/BLOOM-7B_3.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/BLOOM-7B_3.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/BLOOM-7B_4.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/BLOOM-7B_4.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/BLOOM-7B_4.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/BLOOM-7B_4.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_1.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_1.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_1.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_1.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_2.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_2.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_2.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_2.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_3.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_3.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_3.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_3.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_4.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_4.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_4.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_4.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_5.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_5.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_5.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_5.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_6.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_6.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_6.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_6.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_7.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_7.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_7.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_7.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_8.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_8.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/GPT-3_8.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/GPT-3_8.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/YOLOV5S_1.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/YOLOV5S_1.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/YOLOV5S_1.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/YOLOV5S_1.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/YOLOV5S_2.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/YOLOV5S_2.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/YOLOV5S_2.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/YOLOV5S_2.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/accuracy_checking_details.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/accuracy_checking_details.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/accuracy_checking_details.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/accuracy_checking_details.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/accuracy_checking_result.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/accuracy_checking_result.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/accuracy_checking_result.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/accuracy_checking_result.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/api_precision_compare_details.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/api_precision_compare_details.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/api_precision_compare_details.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/api_precision_compare_details.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/api_precision_compare_result.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/api_precision_compare_result.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/api_precision_compare_result.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/api_precision_compare_result.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/auto_analyze_log.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/auto_analyze_log.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/auto_analyze_log.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/auto_analyze_log.png diff --git a/debug/accuracy_tools/msprobe/pytorch/doc/img/compare_result_pkl.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/compare_result_pkl.png new file mode 100644 index 0000000000000000000000000000000000000000..863708bf6daf46985328f0dc42d48f0a5b849af5 Binary files /dev/null and b/debug/accuracy_tools/msprobe/pytorch/doc/img/compare_result_pkl.png differ diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl_md5.png.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/compare_result_pkl_md5.png.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/compare_result_pkl_md5.png.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/compare_result_pkl_md5.png.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/cpu_info.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/cpu_info.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/cpu_info.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/cpu_info.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/img/module_compare.png b/debug/accuracy_tools/msprobe/pytorch/doc/img/module_compare.png similarity index 100% rename from debug/accuracy_tools/atat/pytorch/doc/img/module_compare.png rename to debug/accuracy_tools/msprobe/pytorch/doc/img/module_compare.png diff --git a/debug/accuracy_tools/atat/pytorch/doc/parse_tool.md b/debug/accuracy_tools/msprobe/pytorch/doc/parse_tool.md similarity index 98% rename from debug/accuracy_tools/atat/pytorch/doc/parse_tool.md rename to debug/accuracy_tools/msprobe/pytorch/doc/parse_tool.md index 23000912910e8f95b4cb74c7983961918bd9a513..81efa10fa3ec4307603e24e9599c5a00367462d4 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/parse_tool.md +++ b/debug/accuracy_tools/msprobe/pytorch/doc/parse_tool.md @@ -6,10 +6,10 @@ ## 进入parse交互式界面 -安装atat工具后(详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节),可以通过使用命令 **atat -f pytorch parse** 进入交互式界面,如下所示: +安装msprobe工具后(详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节),可以通过使用命令 **msprobe -f pytorch parse** 进入交互式界面,如下所示: ```bash -atat -f pytorch parse +msprobe -f pytorch parse Parse >>> ``` @@ -23,7 +23,7 @@ Parse >>> Ctrl+C可以退出parse交互式界面。不退出parse交互式界面若需要执行非该界面下的内置Shell命令,且命令与parse交互式界面命令冲突时,非该界面命令需要使用run命令,在相关命令前加上run前缀,如下示例: ```bash -atat -f pytorch parse +msprobe -f pytorch parse Parse >>> run vim cli.py Parse >>> vim cli.py ``` diff --git a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_compare.md b/debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_compare.md similarity index 76% rename from debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_compare.md rename to debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_compare.md index 9beda3b02f2d72383a2bcaa4c20bcd9c5b8ba971..4bd05c73e21c4491ee8286366a6b987a60ee69ae 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_compare.md +++ b/debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_compare.md @@ -36,7 +36,7 @@ compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs) | -------------- | ------------------------------------------------------------ | -------- | | npu_dump_dir | 配置NPU环境下的dump目录。dump数据目录须指定到step级。参数示例:'./npu_dump/step0'。数据类型:str。 | 是 | | bench_dump_dir | 配置CPU、GPU或NPU环境下的dump目录。参数示例:'./gpu_dump/step0'。数据类型:str。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。需要预先创建output_path目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.csv`。数据类型:str。 | 是 | +| output_path | 配置比对结果文件存盘目录。需要预先创建output_path目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.xlsx`。数据类型:str。 | 是 | | **kwargs | 支持compare的所有可选参数。 | 否 | **函数示例** @@ -44,7 +44,7 @@ compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs) 创建比对脚本,例如compare_distributed.py,拷贝如下代码,具体参数请根据实际环境修改。 ```Python -from atat.pytorch import * +from msprobe.pytorch import * compare_distributed('./npu_dump/step0', './gpu_dump/step0', './output') ``` @@ -67,7 +67,7 @@ compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_mat | 参数名 | 说明 | 是否必选 | | ------------ | ------------------------------------------------------------ | -------- | | input_param | 配置dump数据文件及目录。数据类型:dict。配置参数包括:
"npu_json_path":指定NPU dump目录下的dump.json文件。参数示例:"npu_json_path": "./npu_dump/dump.json"。必选。
"bench_json_path":指定CPU、GPU或NPU dump目录下的dump.json文件。参数示例:"bench_json_path": "./gpu_dump/dump.json"。必选。
"stack_json_path":指定NPU dump目录下的stack.json文件。参数示例:"stack_json_path": "./npu_dump/stack.json"。可选。
"is_print_compare_log":配置是否开启日志打屏。可取值True或False。可选。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.csv`。数据类型:str。 | 是 | +| output_path | 配置比对结果文件存盘目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.xlsx`。数据类型:str。 | 是 | | stack_mode | 配置stack_mode的开关。仅当配置"stack_json_path"需要开启。可取值True或False,参数示例:stack_mode=True,默认为False。数据类型:bool。 | 否 | | auto_analyze | 自动精度分析,开启后工具自动针对比对结果进行分析,识别到第一个精度不达标节点(在比对结果文件中的“Accuracy Reached or Not”列显示为No),并给出问题可能产生的原因(打屏展示并生成advisor_{timestamp}.txt文件)。可取值True或False,参数示例:auto_analyze=False,默认为True。数据类型:bool。 | 否 | | fuzzy_match | 模糊匹配。开启后,对于网络中同一层级且命名仅调用次数不同的API,可匹配并进行比对。可取值True或False,参数示例:fuzzy_match=True,默认为False。数据类型:bool。 | 否 | @@ -77,7 +77,7 @@ compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_mat 单机单卡场景下创建比对脚本,例如compare.py,拷贝如下代码,具体参数请根据实际环境修改。 ```Python -from atat.pytorch import compare +from msprobe.pytorch import compare dump_result_param={ "npu_json_path": "./npu_dump/dump.json", "bench_json_path": "./gpu_dump/dump.json", @@ -96,7 +96,7 @@ compare(dump_result_param, output_path="./output", stack_mode=True) 以compare.py为例。 ```Python -from atat.pytorch import compare +from msprobe.pytorch import compare dump_result_param={ "npu_json_path": "./npu_dump/dump.json", "bench_json_path": "./gpu_dump/dump.json", @@ -108,7 +108,7 @@ compare(dump_result_param, output_path="./output", stack_mode=True) **比对结果** -数据量比对同样生成`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt`文件。其中`advisor_{timestamp}.txt`主要对`compare_result_{timestamp}.csv`中可能存在精度问题(Result为Waring)的API提出定位建议;`compare_result_{timestamp}.csv`主要有如下两种情况: +数据量比对同样生成`compare_result_{timestamp}.xlsx`和`advisor_{timestamp}.txt`文件。其中`advisor_{timestamp}.txt`主要对`compare_result_{timestamp}.xlsx`中可能存在精度问题(Result为Waring)的API提出定位建议;`compare_result_{timestamp}.xlsx`主要有如下两种情况: - "summary_mode": "statistics"时比对dump.json文件: @@ -122,13 +122,28 @@ compare(dump_result_param, output_path="./output", stack_mode=True) 上图是对dump.json文件中NPU及标杆API的MD5信息进行比对,判断API数据的完整性,文件中记录NPU及标杆API的基本信息和MD5信息,其中需要关注Result列,包含结果:Pass(表示NPU与标杆的MD5值一致,即API数据完整);Different(表示NPU与标杆的MD5值不一致,即API数据不完全一致,可以通过NPU_Stack_Info列API调用栈查询该API的详细信息);Nan(表示MD5信息数据没有匹配上)。 -## 计算精度评价指标 +## 比对结果分析 -通过计算精度评价指标可以直接从精度比对结果文件中找出不符合精度标准的算子。 +PyTorch精度比对是以CPU或GPU的计算结果为标杆,通过计算精度评价指标判断API在运行时是否存在精度问题。 -PyTorch精度比对是以CPU或GPU的计算结果为标杆,计算Cosine(余弦相似度)、MaxAbsErr(最大绝对误差)和MaxRelativeErr(最大相对误差),根据这两个结果判断API在运行时是否存在精度问题。 +- `advisor_{timestamp}.txt`文件中给出了可能存在精度问题的API的专家建议,可直接打开查看。 -计算精度评价指标: +- `compare_result_{timestamp}.xlsx`文件列出了所有执行精度比对的API详细信息和比对结果,如下示例: + + ![compare_result](https://gitee.com/cai-weiwei1989/att_ptdbg/raw/master/debug/accuracy_tools/ptdbg_ascend/doc/img/compare_result.png) + + 可以从该结果文件中进行“**判断计算精度达标情况**”、“**计算精度评价指标分析**”以及“**异常信息识别**”等分析动作。 + +### **判断计算精度达标情况** + +精度比对结果`compare_result_{timestamp}.xlsx`文件中只需要通过Accuracy Reached or Not来判断计算精度是否达标,判断标准如下: + +1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标,标记为“No”。 +2. Cosine < 0.9,精度不达标,标记为“No”。 +3. MaxAbsError > 1,精度不达标,标记为“No”。 +4. 其余情况下记为精度达标,标记为“Yes”。 + +### **计算精度评价指标分析** 1. Cosine:通过计算两个向量的余弦值来判断其相似度,数值越接近于1说明计算出的两个张量越相似,实际可接受阈值为大于0.99。在计算中可能会存在nan,主要由于可能会出现其中一个向量为0。 @@ -140,12 +155,20 @@ PyTorch精度比对是以CPU或GPU的计算结果为标杆,计算Cosine(余 4. One Thousandth Err Ratio(双千分之一)、Five Thousandths Err Ratio(双千分之五)精度指标:是指NPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差大于千分之一、千分之五的比例占总元素个数的比例小于千分之一、千分之五。该数据仅作为精度下降趋势的参考,并不参与计算精度是否通过的判定。 -精度比对结果csv文件中只需要通过Accuracy Reached or Not来判断计算精度是否达标,判断标准如下: +### **异常信息识别** -1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标,标记为“No”。 -2. Cosine < 0.9,精度不达标,标记为“No”。 -3. MaxAbsError > 1,精度不达标,标记为“No”。 -4. 其余情况下记为精度达标,标记为“Yes”。 +精度比对结果`compare_result_{timestamp}.xlsx`文件中对于存在异常信息的API会进行高亮处理: + +- 红色可能出现的情况有: + - NPU max或NPU min信息中存在nan/inf + - Max diff存在大于1e+10的值 + - 统计数据中output的Max diff除以max(0.01, Bench max) > 0.5 + - 真实数据中One Thousandth Err Ratio的input > 0.9同时output < 0.6 +- 黄色可能出现的情况有: + - Max diff的input与output都大于1,同时output比input大一个数量级以上 + - 统计数据Max diff除以max(0.01, Bench max)的output > 0.1同时input < 0.01 + - 真实数据One Thousandth Err Ratio的input - output > 0.1 + - 真实数据Cosine的input - output > 0.1 # FAQ diff --git a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_overview.md b/debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_overview.md similarity index 81% rename from debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_overview.md rename to debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_overview.md index 708d90b3487c47249c5f6a8b0f37671e8918e7e2..019451454877eddd3cf6e59cc1eef1c48fcf2a3c 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_overview.md +++ b/debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_overview.md @@ -4,7 +4,7 @@ 在PyTorch训练网络,对同一模型或API调试过程中,遇到API相关的计算精度问题,定位时费时费力。 -atat的精度比对工具,用来进行PyTorch整网API粒度的数据dump、精度比对和溢出检测,从而定位PyTorch训练场景下的精度问题。 +msprobe的精度比对工具,用来进行PyTorch整网API粒度的数据dump、精度比对和溢出检测,从而定位PyTorch训练场景下的精度问题。 **使用场景** @@ -42,17 +42,17 @@ atat的精度比对工具,用来进行PyTorch整网API粒度的数据dump、 1. 准备CPU或GPU训练工程。 -2. 在环境下安装atat工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 +2. 在环境下安装msprobe工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 -3. 在训练脚本内添加atat工具dump接口PrecisionDebugger采集标杆数据。详见《[精度数据采集](./dump.md)》。 +3. 在训练脚本内添加msprobe工具dump接口PrecisionDebugger采集标杆数据。详见《[精度数据采集](./dump.md)》。 4. 执行训练dump数据。 5. 将CPU或GPU训练工程迁移为NPU训练工程。详见《[PyTorch模型迁移调优指南](https://www.hiascend.com/document/detail/zh/Pytorch/60RC1/ptmoddevg/trainingmigrguide/PT_LMTMOG_0003.html)》。 -6. 在NPU环境下安装atat工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 +6. 在NPU环境下安装msprobe工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 -7. 在NPU训练脚本内添加atat工具dump接口PrecisionDebugger采集标杆数据。详见《[精度数据采集](./dump.md)》。 +7. 在NPU训练脚本内添加msprobe工具dump接口PrecisionDebugger采集标杆数据。详见《[精度数据采集](./dump.md)》。 8. NPU环境下执行训练dump数据。 diff --git a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_quickstart.md b/debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_quickstart.md similarity index 93% rename from debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_quickstart.md rename to debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_quickstart.md index ae6e3b0b4bbad4796b0332ee8a41b3ae14e5f94e..4b6ac9de2f075ad17ff8a594bc40df4c36692f5f 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/ptdbg_ascend_quickstart.md +++ b/debug/accuracy_tools/msprobe/pytorch/doc/ptdbg_ascend_quickstart.md @@ -1,8 +1,8 @@ # **精度比对工具** -本文主要介绍atat的精度比对工具的快速入门和场景化示例。 +本文主要介绍msprobe的精度比对工具的快速入门和场景化示例。 -本文介绍的操作需要安装atat工具,详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 +本文介绍的操作需要安装msprobe工具,详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。 本文介绍的操作主要是精度数据dump和精度比对,详细操作指导可参考《[精度数据采集](./dump.md)》和《[CPU或GPU与NPU精度数据比对](./ptdbg_ascend.md)》。 @@ -51,12 +51,12 @@ PyTorch训练场景的精度问题分析建议参考以下思路进行精度比 } ``` -2. 在训练脚本内添加atat工具,dump整网数据。 +2. 在训练脚本内添加msprobe工具,dump整网数据。 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): ```python - from atat.pytorch import PrecisionDebugger + from msprobe.pytorch import PrecisionDebugger debugger = PrecisionDebugger(config_path="./config.json", dump_path="./npu_dump") # 请勿将以上初始化流程插入到循环代码中 @@ -82,7 +82,7 @@ PyTorch训练场景的精度问题分析建议参考以下思路进行精度比 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: ```python - from atat.pytorch import compare + from msprobe.pytorch import compare dump_result_param={ "npu_json_path": "./npu_dump/dump.json", "bench_json_path": "./gpu_dump/dump.json", @@ -98,7 +98,7 @@ PyTorch训练场景的精度问题分析建议参考以下思路进行精度比 python3 compare.py ``` - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` + 在output目录下生成结果文件,包括:`compare_result_{timestamp}.xlsx`和`advisor_{timestamp}.txt` 4. 找出存在问题的API。 @@ -106,7 +106,7 @@ python3 compare.py ![auto_analyze_log](img/auto_analyze_log.png) - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 + 2. 根据第2步结果文件`compare_result_{timestamp}.xlsx`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 5. (可选)重新比对。 @@ -140,10 +140,10 @@ python3 compare.py } ``` -2. 在NPU训练脚本内添加atat工具,执行溢出检测dump。 +2. 在NPU训练脚本内添加msprobe工具,执行溢出检测dump。 ```python - from atat.pytorch import PrecisionDebugger + from msprobe.pytorch import PrecisionDebugger debugger = PrecisionDebugger(config_path="./config.json", dump_path="./npu_dump") # 请勿将以上初始化流程插入到循环代码中 @@ -171,7 +171,7 @@ python3 compare.py 溢出解析工具执行命令如下: ```bash - atat -f pytorch run_overflow_check -api_info ./dump.json + msprobe -f pytorch run_overflow_check -api_info ./dump.json ``` 反向过程溢出的API暂不支持精度预检功能。 @@ -200,7 +200,7 @@ python3 compare.py 1. 创建比对脚本,例如compare_distributed.py,拷贝如下代码。 ```python - from atat.pytorch import * + from msprobe.pytorch import * compare_distributed('./npu_dump/step0', './gpu_dump/step0', './output') ``` @@ -219,7 +219,7 @@ python3 compare.py 多卡一般为多进程,须保证每个进程都正确调用PrecisionDebugger,或把PrecisionDebugger插入到import语句后,如: ```python -from atat.pytorch import PrecisionDebugger +from msprobe.pytorch import PrecisionDebugger debugger = PrecisionDebugger(config_path="./config.json", dump_path="./npu_dump") ``` @@ -339,10 +339,10 @@ debugger = PrecisionDebugger(config_path="./config.json", dump_path="./npu_dump" } ``` -2. 在训练脚本内添加atat工具,dump整网数据。 +2. 在训练脚本内添加msprobe工具,dump整网数据。 ```python - from atat.pytorch import PrecisionDebugger + from msprobe.pytorch import PrecisionDebugger debugger = PrecisionDebugger(config_path="./config.json", dump_path="./npu_dump") # 请勿将以上初始化流程插入到循环代码中 diff --git a/debug/accuracy_tools/atat/pytorch/doc/run_overflow_check.md b/debug/accuracy_tools/msprobe/pytorch/doc/run_overflow_check.md similarity index 95% rename from debug/accuracy_tools/atat/pytorch/doc/run_overflow_check.md rename to debug/accuracy_tools/msprobe/pytorch/doc/run_overflow_check.md index 1bdc4f354cfaf0bfbdf701baa7dfb05f3771e30b..b8c9c3b4c292886e2ef8229ec421a244ae38f92e 100644 --- a/debug/accuracy_tools/atat/pytorch/doc/run_overflow_check.md +++ b/debug/accuracy_tools/msprobe/pytorch/doc/run_overflow_check.md @@ -13,7 +13,7 @@ 2. 执行溢出API解析操作。 ```bash - atat -f pytorch run_overflow_check -api_info ./dump.json + msprobe -f pytorch run_overflow_check -api_info ./dump.json ``` | 参数名称 | 说明 | 是否必选 | diff --git "a/debug/accuracy_tools/msprobe/pytorch/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" "b/debug/accuracy_tools/msprobe/pytorch/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" new file mode 100644 index 0000000000000000000000000000000000000000..05bebaf0a22d8d5d7886ec9937b32b4755caf872 --- /dev/null +++ "b/debug/accuracy_tools/msprobe/pytorch/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" @@ -0,0 +1,90 @@ +# **PyTorch NPU在线精度比对工具使用指南** + +PyTorch NPU在线精度比对是ptdbg_ascend工具实现在PyTorch训练过程中直接完成精度比对并输出比对结果的功能。 + +在线精度比对实现的是NPU与CPU之间的精度比对。 + +## PyTorch NPU在线精度比对总体流程 + +1. 准备NPU训练工程。 + +2. 在NPU环境下安装ptdbg_ascend工具,参见《[PyTorch精度工具](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 + +3. 在训练脚本内插入ptdbg_ascend工具在线精度比对接口。 + +4. 执行训练并获取在线精度比对NPU和CPU分别执行后的精度比对结果。 + +5. 比对结果分析。 + +## PyTorch NPU在线精度比对 +### 总体说明 +- 本节主要介绍NPU精度比对所需要的函数以及示例。 +- 在线精度比对工具通过截获PyTorch框架中部分Aten Ir及其输入输出,并将输入数据转到CPU执行,最后将NPU和CPU的执行结果进行精度比对得到比对结果。 + +### 约束 + +- Pytorch 只支持2.0及其以上版本。 +- 只支持Aten Ir级在线精度比对,所有Aten Ir可以通过dir(torch.ops.aten)查看,其中部分IR不支持在线比对:Aten Ir无对应CPU实现、NPU和CPU同AtenIR实现逻辑不一致,导致同输入不同输出。 +- 正反向不支持同时在线精度比对,不支持跨step在线精度比对。 + + +### 场景示例 +1. 在NPU训练脚本中添加在线精度比对接口,示例如下: + + ```python + from msprobe.pytorch.common.utils import seed_all + from msprobe.pytorch.online_dispatch import PtdbgDispatch + + # 在main函数开始前固定随机数 + seed_all() + + + ... + + # 在需要调试精度的正向或反向代码前设置 + # 正向示例 + with PtdbgDispatch(dump_mode="auto", dump_path="/home/dump"): + output = model_cpu(inputs) + # 反向示例 + with PtdbgDispatch(dump_mode="auto", dump_path="/home/dump"): + loss.backward() + ``` + +2. 执行训练。 + +3. 找出精度不达标的Aten IR。 + + 执行过程中会打屏Failed,Failed在比对结果csv中的Accuracy Reached or Not列标记为No,并在Dump目录下存盘精度不达标Aten IR的输入输出。 + ![图片说明](http://image.huawei.com/tiny-lts/v1/images/d83d564e337e80c7cfb557ca3600d0d4_1689x178.png@900-0-90-f.png) + +### 计算精度评价指标 + +1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标; +2. Cosine < 0.9,精度不达标; +3. MaxAbsError > 1,精度不达标。 + +### 在线精度比对参数设置说明 + +| 参数名称 | 说明 | 是否必选 | +| -------- |-------------------------------------------------------------------------------------------------| -------- | +| dump_mode| dump模式,可取值"all"、"list"、"auto"、"OFF",默认值为OFF(表示不Dump数据)。 | 否 | +| api_list | dump范围,dump_mode="list"时设置,需要Dump Aten Ir API名称,默认为None,Aten Ir API名称可以通过dir(torch.ops.aten)查看。 | 否 | +| dump_path| dump文件生成的路径。 | 是 | +| tag | 传入tag字符串,成为dump文件夹名一部分,默认为None。 | 否 | +| process_num | 多进程并发数,默认为0。 | 否 | +| debug | debug信息打印,默认为False。 | 否 | +### dump数据存盘说明 +dump数据存盘目录名格式:`msprobe_tag_rankid_{timestamp}`。 + +子目录下包含1个比对结果csv文件、cpu和npudump数据目录,npu目录下包含Aten IR在NPU上的输入输出的dump数据,由于CPU的输入是直接使用NPU的输入执行,因此cpu目录下只包含执行输出的dump数据。 + +```bash +msprobe_rank4_20230911170521 +├── compare_result_rank4_20230911170521.csv +├── cpu +│   ├── native_batch_norm_backward_10_output.0.npy +│ ............ +└── npu + ├── native_batch_norm_backward_10_input.0.npy + ............ +``` diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/__init__.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/__init__.py similarity index 39% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/__init__.py index 3ffe161cba432405e2dc8d98f9be89053b58849d..d234898c0df158308b070e4a9147c9dff0b67c8d 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/__init__.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/__init__.py @@ -1,6 +1,6 @@ -from atat.pytorch.common import print_warn_log_rank_0, print_info_log_rank_0 -from atat.pytorch.common.exceptions import FreeBenchmarkException -from atat.pytorch.common.utils import Const +from msprobe.core.common.log import logger +from msprobe.core.common.exceptions import FreeBenchmarkException +from msprobe.core.common.const import Const from .main import FreeBenchmarkCheck from .common.params import UnequalRow diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/__init__.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/constant.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/constant.py similarity index 90% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/common/constant.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/constant.py index 9b72437f2280ca44a20fc5e370f1cfd9b9ea3ac4..e737e7b21796c186206d03c62eb9a9a309014133 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/constant.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/constant.py @@ -2,8 +2,8 @@ from typing import Dict import numpy as np import torch -from atat.pytorch.free_benchmark.common.enums import FuzzThreshold -from atat.pytorch.free_benchmark.common.params import BenchmarkThd +from msprobe.pytorch.free_benchmark.common.enums import FuzzThreshold +from msprobe.pytorch.free_benchmark.common.params import BenchmarkThd class CommonField: @@ -13,6 +13,7 @@ class CommonField: REQUIRES_GRAD = "requires_grad" HOLD_PLACE = "hold_place" DISTRIBUTED_OP = "torch.distributed" + GRADSAVER = "grad_saver" class ThresholdConfig: diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/counter.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/counter.py similarity index 97% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/common/counter.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/counter.py index 186b75c71aeaf71fc2adab7ec38c7f00f6b7fdb7..b2f8c81f3a4ea57d712e49b0b58fc77747797323 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/counter.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/counter.py @@ -1,5 +1,5 @@ from collections import defaultdict -from atat.pytorch.free_benchmark.common.constant import ThresholdConfig +from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig class PreheatCounter: diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/enums.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/enums.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/common/enums.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/enums.py diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/params.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/params.py similarity index 91% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/common/params.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/params.py index c5dfefb43f856383af93068840e9e48e1590c431..bbfc245a635322f0cde6951663d4d76a009ee66a 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/params.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/params.py @@ -1,15 +1,14 @@ -from abc import ABC from dataclasses import dataclass from typing import Any, Callable, Dict, List, Optional, Tuple import torch -from atat.pytorch.free_benchmark import Const, print_warn_log_rank_0 -from atat.pytorch.free_benchmark.common.enums import ( +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.enums import ( DeviceType, FuzzLevel, PerturbationMode, ) -from atat.pytorch.free_benchmark.common.utils import Tools +from msprobe.pytorch.free_benchmark.common.utils import Tools @dataclass @@ -78,8 +77,8 @@ def data_pre_deal(name, func, args, kwargs): index = check_args_type(args) data_params.valid_input_index = index if index == -1: - print_warn_log_rank_0( - f"[atat] Free benchmark: 无标杆工具不支持当前算子的输入类型 {name}." + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: 无标杆工具不支持当前算子的输入类型 {name}." ) return data_params diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/utils.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/utils.py similarity index 98% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/common/utils.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/utils.py index 24d25967635b3dcfd1da89e1f54d3282fa1181ed..ddcbd9d0f5ca013b542c189edbd2b813807b86da 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/common/utils.py @@ -1,5 +1,5 @@ import torch -from atat.pytorch.free_benchmark.common.enums import DeviceType +from msprobe.pytorch.free_benchmark.common.enums import DeviceType class Tools: diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/grad_saver.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/grad_saver.py similarity index 68% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/compare/grad_saver.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/grad_saver.py index a8752656ed72bc21773aca2bb06d4e69d96a5c4b..6781a1c2fc4c7fd348d330a49352e2f6195e8a71 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/grad_saver.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/grad_saver.py @@ -1,12 +1,12 @@ import torch -from atat.pytorch.free_benchmark import print_info_log_rank_0, print_warn_log_rank_0 -from atat.pytorch.free_benchmark.common.params import DataParams, HandlerParams -from atat.pytorch.free_benchmark.common.constant import CommonField -from atat.pytorch.free_benchmark.common.utils import Tools -from atat.pytorch.free_benchmark.result_handlers.handler_factory import ( +from msprobe.core.common.exceptions import FreeBenchmarkException +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.constant import CommonField +from msprobe.pytorch.free_benchmark.common.params import DataParams, HandlerParams +from msprobe.pytorch.free_benchmark.perturbed_layers.layer_factory import LayerFactory +from msprobe.pytorch.free_benchmark.result_handlers.handler_factory import ( FuzzHandlerFactory, ) -from atat.pytorch.free_benchmark.perturbed_layers.layer_factory import LayerFactory class GradSaver: @@ -39,9 +39,20 @@ class GradSaver: handler, grad, perturbed_grad, index=input_index ) data_processor.update_unequal_rows(handler.get_unequal_rows()) + except IndexError: + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: grad index out of range. api:{self.handler_params.api_name}." + f"index:{new_grad_index}, perturbation grad len {len(self.perturbed_grad_input)}" + ) + return grad + except FreeBenchmarkException as e: + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: grad input check error: {e}" + ) + return grad except Exception as e: - print_warn_log_rank_0( - f"[atat] Free benchmark: grad compara error: {e}" + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: grad compare error: {e}" ) return grad return grad @@ -65,34 +76,30 @@ class GradSaver: self.data_params.original_result = self.origin_grad_input handler.handle(self.data_params) except Exception as e: - print_warn_log_rank_0( - f"[atat] Free benchmark: compare two vjp failed: api:{self.handler_params.api_name}." + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: compare two vjp failed: api:{self.handler_params.api_name}." f"{e}" ) + # 在扰动前后输出对比后释放输出的引用 + self.data_params.perturbed_result = None + self.data_params.original_result = None def check_grad_input(self, origin_grad, new_grad_index): if self.perturbed_grad_input is None: - print_info_log_rank_0( - f"[atat] Free benchmark: grad not exsits : {self.api_name}." + raise FreeBenchmarkException( + FreeBenchmarkException.InvalidGrad, + f"grad not exists : {self.api_name}." ) - return None - try: - with torch.no_grad(): - perturbed_grad = self.perturbed_grad_input[new_grad_index].to( - origin_grad.device - ) - except IndexError: - print_warn_log_rank_0( - f"[atat] Free benchmark: grad index out of range. api:{self.handler_params.api_name}." - f"index:{new_grad_index}, perturbation grad len {len(self.perturbed_grad_input)}" + with torch.no_grad(): + perturbed_grad = self.perturbed_grad_input[new_grad_index].to( + origin_grad.device ) - return None if origin_grad.shape != perturbed_grad.shape: - print_warn_log_rank_0( - f"[atat] Free benchmark: grad shapes are unconsistent. api:{self.handler_params.api_name}." + raise FreeBenchmarkException( + FreeBenchmarkException.InvalidGrad, + f"grad shapes are inconsistent. api:{self.handler_params.api_name}." f"origin:{origin_grad.shape}, perturbation: {perturbed_grad.shape}" ) - return None return perturbed_grad def cache_backward_input(self, backward_input_list): @@ -117,12 +124,12 @@ class GradSaver: for object_ in self.backward_input: if isinstance(object_, dict) and CommonField.FUZZ_TENSOR in object_.keys(): tensor_ = torch.tensor( - object_.get(CommonField.FUZZ_TENSOR).data, - dtype=object_.get(CommonField.FUZZ_TENSOR).dtype, - device=object_.get(CommonField.DEVICE), - requires_grad=object_.get(CommonField.REQUIRES_GRAD), - ) - + object_.get(CommonField.FUZZ_TENSOR).data, + dtype=object_.get(CommonField.FUZZ_TENSOR).dtype, + device=object_.get(CommonField.DEVICE), + requires_grad=object_.get(CommonField.REQUIRES_GRAD), + ) + if tensor_.requires_grad: inner_args_tmp.append(CommonField.HOLD_PLACE) need_grad_tensors.append(tensor_) @@ -167,6 +174,10 @@ class GradSaver: self.handler_params.pert_mode, ) layer.handle(self.data_params) - self.perturbed_grad_input = tuple( - [x.cpu() for x in self.data_params.perturbed_result] - ) + # 在计算扰动输出之后,释放输入的引用 + self.data_params.args = None + # 确定扰动成功后,才会暂存 + if self.data_params.perturbed_result: + self.perturbed_grad_input = tuple( + [x.cpu() for x in self.data_params.perturbed_result] + ) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/single_benchmark.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/single_benchmark.py similarity index 89% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/compare/single_benchmark.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/single_benchmark.py index ed834c468ba6f15437da4479a3e2b3257fd7b6c1..59239fcd004fb3472d5c8f53305e692fec00adee 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/compare/single_benchmark.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/compare/single_benchmark.py @@ -1,9 +1,9 @@ -import torch import math -from atat.pytorch.free_benchmark import print_warn_log_rank_0 -from atat.pytorch.free_benchmark.common.utils import TorchC -from atat.pytorch.free_benchmark.common.constant import ThresholdConfig +import torch +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig +from msprobe.pytorch.free_benchmark.common.utils import TorchC class SingleCompare: @@ -13,6 +13,45 @@ class SingleCompare: self.eb = None self.threshold = None + @staticmethod + def filter_overflow(tensor) -> int: + inf_num = TorchC.sum(TorchC.isinf(tensor)) + nan_num = TorchC.sum(TorchC.isnan(tensor)) + return inf_num + nan_num + + @staticmethod + def replace_inf_or_nan(tensor): + finite_mask = TorchC.isfinite(tensor) + inf_or_nan_mask = TorchC.logical_not(finite_mask) + inf_or_nan_num = TorchC.sum(inf_or_nan_mask).item() + if inf_or_nan_num > 0: + tensor[inf_or_nan_mask] = 1 + return tensor + + @staticmethod + def compare_float_seq(actual, golden): + return math.isclose(actual, golden) + + @staticmethod + def compare_other_seq(actual, golden): + return actual == golden + + def compare_dict_seq(self, actual, golden): + if len(actual) != len(golden): + return False + for key, value in golden.items(): + if not self.compare_seq(value, actual.get(key)): + return False + return True + + def compare_list_seq(self, actual, golden): + if len(actual) != len(golden): + return False + for index_, value in enumerate(golden): + if not self.compare_seq(value, actual[index_]): + return False + return True + def compare_seq(self, actual, golden): if isinstance(golden, torch.Tensor): return self.compare_tensor_seq(actual, golden) @@ -30,7 +69,7 @@ class SingleCompare: actual.dtype, ThresholdConfig.BENCHMARK_THD_DICT.get(torch.float32) ) if self.filter_overflow(golden) > 0: - print_warn_log_rank_0("[atat] Free Benchmark: inf and nan" + logger.warning_on_rank_0("[msprobe] Free Benchmark: inf and nan" "in golden tensor is not supported.") return True actual = self.replace_inf_or_nan(actual) @@ -45,7 +84,6 @@ class SingleCompare: return False return True - def _cal_compare_metrics(self, actual, golden): diff_value = TorchC.subtract(actual, golden) diff_abs = TorchC.abs(diff_value) @@ -62,42 +100,5 @@ class SingleCompare: # 获取误差均衡性 divided = TorchC.where( TorchC.ge(TorchC.abs(golden), self.threshold.small_value), golden_abs, 1 - ) + ) self.eb = TorchC.mean(TorchC.div(diff_value, divided)) - - def compare_dict_seq(self, actual, golden): - if len(actual) != len(golden): - return False - for key, value in golden.items(): - if not self.compare_seq(value, actual.get(key)): - return False - return True - - def compare_list_seq(self, actual, golden): - if len(actual) != len(golden): - return False - for index_, value in enumerate(golden): - if not self.compare_seq(value, actual[index_]): - return False - return True - - def compare_float_seq(self, actual, golden): - return math.isclose(actual, golden) - - def compare_other_seq(self, actual, golden): - return actual == golden - - @staticmethod - def filter_overflow(tensor) -> int: - inf_num = TorchC.sum(TorchC.isinf(tensor)) - nan_num = TorchC.sum(TorchC.isnan(tensor)) - return inf_num + nan_num - - @staticmethod - def replace_inf_or_nan(tensor): - finite_mask = TorchC.isfinite(tensor) - inf_or_nan_mask = TorchC.logical_not(finite_mask) - inf_or_nan_num = TorchC.sum(inf_or_nan_mask).item() - if inf_or_nan_num > 0: - tensor[inf_or_nan_mask] = 1 - return tensor diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/main.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/main.py similarity index 74% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/main.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/main.py index c2e0005181d967ed8437e3047f9d967b1370d4e3..971776d1326409c8878849e7b09a4614ffbc16f5 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/main.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/main.py @@ -1,19 +1,19 @@ -import importlib from abc import ABC import torch -from atat.pytorch.free_benchmark import Const, print_warn_log_rank_0 - -from atat.pytorch.free_benchmark.common.params import data_pre_deal, make_handler_params -from atat.pytorch.free_benchmark.common.enums import ( - PerturbationMode, - FuzzLevel, +from msprobe.core.common.const import Const +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.constant import CommonField +from msprobe.pytorch.free_benchmark.common.enums import ( DeviceType, - HandlerType + FuzzLevel, + HandlerType, + PerturbationMode, ) -from atat.pytorch.free_benchmark.compare.grad_saver import GradSaver -from atat.pytorch.free_benchmark.perturbed_layers.layer_factory import LayerFactory -from atat.pytorch.free_benchmark.result_handlers.handler_factory import ( +from msprobe.pytorch.free_benchmark.common.params import data_pre_deal, make_handler_params +from msprobe.pytorch.free_benchmark.compare.grad_saver import GradSaver +from msprobe.pytorch.free_benchmark.perturbed_layers.layer_factory import LayerFactory +from msprobe.pytorch.free_benchmark.result_handlers.handler_factory import ( FuzzHandlerFactory, ) @@ -50,7 +50,7 @@ class FreeBenchmarkCheck(ABC): grad_saver.kwargs = kwargs grad_saver.register_compare_func_for_inputs(args, data_processor) grad_saver.cache_backward_input(args) - setattr(module, "grad_saver", grad_saver) + setattr(module, CommonField.GRADSAVER, grad_saver) def forward(self, name, module, args, kwargs, output): if not self.config.fuzz_stage == Const.FORWARD: @@ -71,17 +71,17 @@ class FreeBenchmarkCheck(ABC): handler_params = make_handler_params(name, self.config, self.current_iter) handler = FuzzHandlerFactory.create(handler_params) handler.handle(data_params) - return output, handler.get_unequal_rows() + return data_params.perturbed_result, handler.get_unequal_rows() def backward(self, name, module, grad_output): if not self.config.fuzz_stage == Const.BACKWARD: return try: - grad_saver = getattr(module, "grad_saver") + grad_saver = getattr(module, CommonField.GRADSAVER) except AttributeError: - print_warn_log_rank_0( - f"[atat] Free benchmark: get grad saver failed. api_name:{name}" + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: get grad saver failed. api_name:{name}" ) return @@ -96,7 +96,7 @@ class FreeBenchmarkCheck(ABC): _new_grad_output, need_grad_tensors, _inner_args ) except Exception as e: - print_warn_log_rank_0( - f"[atat] Free benchmark: grad vjp calculate failed. api_name:{name} error: {e}" + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: grad vjp calculate failed. api_name:{name} error: {e}" ) return diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/common/__init__.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/common/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/base_layer.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/base_layer.py similarity index 78% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/base_layer.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/base_layer.py index aa572fd8e8dc8b62493dfa1fecc587b934c83a99..f64a201d5efa007ff4ed848eafd2ab6db535a2f5 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/base_layer.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/base_layer.py @@ -1,7 +1,7 @@ from abc import ABC, abstractmethod from typing import Any -from atat.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.common.params import DataParams class BaseLayer(ABC): diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/layer_factory.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/layer_factory.py similarity index 62% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/layer_factory.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/layer_factory.py index 0d09438ce04132c9c5c301d758dc06818805082e..0ea9107aa84c2633435fe616891f5386b17de423 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/layer_factory.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/layer_factory.py @@ -1,15 +1,15 @@ -from atat.pytorch.free_benchmark import FreeBenchmarkException -from atat.pytorch.free_benchmark.common.enums import DeviceType, PerturbationMode -from atat.pytorch.free_benchmark.perturbed_layers.npu.improve_precision import ( +from msprobe.pytorch.free_benchmark import FreeBenchmarkException +from msprobe.pytorch.free_benchmark.common.enums import DeviceType, PerturbationMode +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.improve_precision import ( ImprovePrecisionLayer, ) -from atat.pytorch.free_benchmark.perturbed_layers.npu.add_noise import AddNoiseLayer -from atat.pytorch.free_benchmark.perturbed_layers.npu.bit_noise import BitNoiseLayer -from atat.pytorch.free_benchmark.perturbed_layers.npu.no_change import NoChangeLayer -from atat.pytorch.free_benchmark.perturbed_layers.npu.change_value import ( +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.add_noise import AddNoiseLayer +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.bit_noise import BitNoiseLayer +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.no_change import NoChangeLayer +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.change_value import ( ChangeValueLayer, ) -from atat.pytorch.free_benchmark.perturbed_layers.run_cpu import CpuLayer +from msprobe.pytorch.free_benchmark.perturbed_layers.run_cpu import CpuLayer class LayerFactory: diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/__init__.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py similarity index 72% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py index d03dbe931d91e5ed91b70c7b2b8fe1fb8f1342fa..a18ef1c51bd342c9b3ab5ffecf14c307e9be5527 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py @@ -1,19 +1,48 @@ import torch -from atat.pytorch.free_benchmark import ( - print_info_log_rank_0, - print_warn_log_rank_0, -) -from atat.pytorch.free_benchmark.common.constant import ThresholdConfig -from atat.pytorch.free_benchmark.common.params import DataParams -from atat.pytorch.free_benchmark.common.utils import TorchC -from atat.pytorch.free_benchmark.common.enums import PerturbationMode -from atat.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig +from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode +from msprobe.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.common.utils import TorchC +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( NpuBaseLayer, ) class AddNoiseLayer(NpuBaseLayer): + def add_noise(self, tensor_obj): + if isinstance(tensor_obj, torch.Tensor): + self.perturbed_value = ThresholdConfig.PERTURBATION_VALUE_DICT.get( + tensor_obj.dtype + ) + if not self.pre_check(tensor_obj): + return tensor_obj + noise = self._get_noise(tensor_obj) + result = TorchC.where( + TorchC.gt(TorchC.abs(tensor_obj), self.perturbed_value ** 0.5), + TorchC.add(noise, tensor_obj), + tensor_obj, + ).to(tensor_obj.dtype) + self.is_added = True + return result + if isinstance(tensor_obj, dict): + return {key: self.add_noise(value) for key, value in tensor_obj.items()} + if isinstance(tensor_obj, (tuple, list)): + return type(tensor_obj)([self.add_noise(value) for value in tensor_obj]) + return tensor_obj + + def handle(self, params: DataParams) -> torch.Any: + """ + 对输入添加扰动并返回 + """ + logger.info_on_rank_0( + f"[msprobe] Free benchmark: Perturbation is " + f"{PerturbationMode.ADD_NOISE} of {self.api_name}." + ) + params.perturbed_value = self.add_noise(params.args[params.valid_input_index]) + return self.perturbed_result(params) + def _get_noise(self, tensor_obj): dtype = tensor_obj.dtype device = str(tensor_obj.device) @@ -30,14 +59,14 @@ class AddNoiseLayer(NpuBaseLayer): 判断是否需要添加扰动 """ if not self.perturbed_value: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.api_name}, " + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.api_name}, " f"dtype unsupported. Cancel perturbation." ) return False if tensor_obj.numel() == 0: - print_warn_log_rank_0( - f"[atat] Free benchmark: For {self.api_name}, tensor shape must > 0." + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: For {self.api_name}, tensor shape must > 0." f" Cancel adding noise." ) return False @@ -47,47 +76,15 @@ class AddNoiseLayer(NpuBaseLayer): try: max_val = TorchC.max(TorchC.abs(tensor_obj)).item() except Exception: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.api_name}, " + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.api_name}, " f"when calculate maximun value, tensor is changed to float32." ) max_val = TorchC.max(TorchC.abs(tensor_obj.to(torch.float32))).item() if max_val < abs_tol: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.api_name}, " + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.api_name}, " f"Maximun value is less than the minimun threshold. Cancel add noise." ) return False return True - - def add_noise(self, tensor_obj): - if isinstance(tensor_obj, torch.Tensor): - self.perturbed_value = ThresholdConfig.PERTURBATION_VALUE_DICT.get( - tensor_obj.dtype - ) - if not self.pre_check(tensor_obj): - return tensor_obj - noise = self._get_noise(tensor_obj) - result = TorchC.where( - TorchC.gt(TorchC.abs(tensor_obj), self.perturbed_value**0.5), - TorchC.add(noise, tensor_obj), - tensor_obj, - ).to(tensor_obj.dtype) - self.is_added = True - return result - if isinstance(tensor_obj, dict): - return {key: self.add_noise(value) for key, value in tensor_obj.items()} - if isinstance(tensor_obj, (tuple, list)): - return type(tensor_obj)([self.add_noise(value) for value in tensor_obj]) - return tensor_obj - - def handle(self, params: DataParams) -> torch.Any: - """ - 对输入添加扰动并返回 - """ - print_info_log_rank_0( - f"[atat] Free benchmark: Perturbation is " - f"{PerturbationMode.ADD_NOISE} of {self.api_name}." - ) - params.perturbed_value = self.add_noise(params.args[params.valid_input_index]) - return self.perturbed_result(params) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py similarity index 76% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py index 72d04af412067882826ea402ed6fa00490bce348..45dea7b93a5c7628b24bf0470af10af355a7742f 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py @@ -1,13 +1,10 @@ import torch -from atat.pytorch.free_benchmark import ( - print_info_log_rank_0, - print_warn_log_rank_0, -) -from atat.pytorch.free_benchmark.common.constant import ThresholdConfig -from atat.pytorch.free_benchmark.common.params import DataParams -from atat.pytorch.free_benchmark.common.utils import TorchC -from atat.pytorch.free_benchmark.common.enums import PerturbationMode -from atat.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig +from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode +from msprobe.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.common.utils import TorchC +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( NpuBaseLayer, ) @@ -19,50 +16,6 @@ class BitNoiseLayer(NpuBaseLayer): self.bit_tail: int = 1 self.bit_type = None - def _check_details(self, tensor_obj): - """ - 判断是否需要添加扰动, bit翻转 - """ - if not self.bit_type: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.api_name}, " - f"dtype unsupported. Cancel perturbation." - ) - return False - if tensor_obj.numel() == 0: - print_warn_log_rank_0( - f"[atat] Free benchmark: For {self.api_name}, tensor shape must > 0" - f" Cancel adding noise." - ) - return False - abs_tol = ThresholdConfig.ABS_TOL_VALUE_DICT.get( - tensor_obj.dtype, ThresholdConfig.NOISE_INPUT_LOWER_BOUND - ) - try: - max_val = TorchC.max(TorchC.abs(tensor_obj)).item() - except Exception: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.api_name}, " - f"when calculate maximun value, tensor is changed to float32." - ) - max_val = TorchC.max(TorchC.abs(tensor_obj.to(torch.float32))).item() - if max_val < abs_tol: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.api_name}, " - f"Maximun value is less than the minimun threshold. Cancel add noise." - ) - return False - return True - - def _set_perturbation_bit(self, tensor_obj): - """ - 根据不同浮点数确定不同位数扰动值 - """ - bit_len_type = ThresholdConfig.PERTURBATION_BIT_DICT.get(tensor_obj.dtype) - if bit_len_type: - self.bit_tail = 1 - self.bit_type = bit_len_type - def add_bit_noise(self, tensor_obj): """ 对输入添加噪声 @@ -99,9 +52,53 @@ class BitNoiseLayer(NpuBaseLayer): """ 对输入添加扰动并返回 """ - print_info_log_rank_0( - f"[atat] Free benchmark: Perturbation is " + logger.info_on_rank_0( + f"[msprobe] Free benchmark: Perturbation is " f"{PerturbationMode.BIT_NOISE} of {self.api_name}." ) params.perturbed_value = self.add_bit_noise(params.args[params.valid_input_index]) return self.perturbed_result(params) + + def _check_details(self, tensor_obj): + """ + 判断是否需要添加扰动, bit翻转 + """ + if not self.bit_type: + logger.info_on_rank_0( + f"[msprobe] Free Benchmark: For {self.api_name}, " + f"dtype unsupported. Cancel perturbation." + ) + return False + if tensor_obj.numel() == 0: + logger.warning_on_rank_0( + f"[msprobe] Free benchmark: For {self.api_name}, tensor shape must > 0" + f" Cancel adding noise." + ) + return False + abs_tol = ThresholdConfig.ABS_TOL_VALUE_DICT.get( + tensor_obj.dtype, ThresholdConfig.NOISE_INPUT_LOWER_BOUND + ) + try: + max_val = TorchC.max(TorchC.abs(tensor_obj)).item() + except Exception: + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.api_name}, " + f"when calculate maximun value, tensor is changed to float32." + ) + max_val = TorchC.max(TorchC.abs(tensor_obj.to(torch.float32))).item() + if max_val < abs_tol: + logger.info_on_rank_0( + f"[msprobe] Free Benchmark: For {self.api_name}, " + f"Maximun value is less than the minimun threshold. Cancel add noise." + ) + return False + return True + + def _set_perturbation_bit(self, tensor_obj): + """ + 根据不同浮点数确定不同位数扰动值 + """ + bit_len_type = ThresholdConfig.PERTURBATION_BIT_DICT.get(tensor_obj.dtype) + if bit_len_type: + self.bit_tail = 1 + self.bit_type = bit_len_type diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/change_value.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py similarity index 77% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/change_value.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py index ab91bcb7eeea00085318a21c20bb9f03d69b8908..91085d57a68b4841b2e04453c05c41a2903477c3 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/change_value.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py @@ -1,9 +1,9 @@ import torch -from atat.pytorch.free_benchmark import print_warn_log_rank_0, print_info_log_rank_0 -from atat.pytorch.free_benchmark.common.params import DataParams -from atat.pytorch.free_benchmark.common.utils import TorchC -from atat.pytorch.free_benchmark.common.enums import PerturbationMode -from atat.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode +from msprobe.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.common.utils import TorchC +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( NpuBaseLayer, ) @@ -14,18 +14,6 @@ class ChangeValueLayer(NpuBaseLayer): self.head: int = 0 self.tail: int = -1 - def _check_details(self, tensor_obj): - """ - 判断是否需要添加扰动, 首尾值交换 - """ - if tensor_obj.size(0) < 2: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.api_name}, " - f"size 0 must greater than 1. Cancel change value." - ) - return False - return True - def change_value(self, tensor_obj): """ 交换张量首尾 @@ -42,7 +30,7 @@ class ChangeValueLayer(NpuBaseLayer): temp_last = TorchC.clone(new_tensor[self.tail][self.tail]) new_tensor[self.head][self.head] = temp_last new_tensor[self.tail][self.tail] = temp_first - + self.is_added = True return new_tensor if isinstance(tensor_obj, dict): @@ -55,9 +43,21 @@ class ChangeValueLayer(NpuBaseLayer): """ 对输入添加扰动并返回 """ - print_info_log_rank_0( - f"[atat] Free benchmark: Perturbation is " + logger.info_on_rank_0( + f"[msprobe] Free benchmark: Perturbation is " f"{PerturbationMode.CHANGE_VALUE} of {self.api_name}." ) params.perturbed_value = self.change_value(params.args[params.valid_input_index]) return self.perturbed_result(params) + + def _check_details(self, tensor_obj): + """ + 判断是否需要添加扰动, 首尾值交换 + """ + if tensor_obj.size(0) < 2: + logger.info_on_rank_0( + f"[msprobe] Free Benchmark: For {self.api_name}, " + f"size 0 must greater than 1. Cancel change value." + ) + return False + return True diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py similarity index 75% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py index fb126972c6853b81d24db8138880601f9a3af21a..ad6d8b8989d6983f81a9a2d58798a26d4ccc45c1 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py @@ -1,33 +1,16 @@ import torch -from atat.pytorch.free_benchmark import Const, print_info_log_rank_0 -from atat.pytorch.free_benchmark.common.constant import CommonField -from atat.pytorch.free_benchmark.common.params import DataParams -from atat.pytorch.free_benchmark.common.enums import PerturbationMode -from atat.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( +from msprobe.core.common.const import Const +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.constant import CommonField +from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode +from msprobe.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( NpuBaseLayer, ) class ImprovePrecisionLayer(NpuBaseLayer): - def _set_improve_valus(self, inputs): - # TODO why - if inputs.dtype in [torch.float16, torch.bfloat16]: - self.perturbed_value = torch.float32 - - def _change_dtype(self, inputs): - if hasattr(inputs, CommonField.DEVICE): - device = inputs.device - if device is CommonField.META: - new_inputs = inputs.to( - device=CommonField.META, dtype=self.perturbed_value - ) - else: - new_inputs = inputs.to(dtype=self.perturbed_value).to(device) - else: - new_inputs = inputs.to(dtype=self.perturbed_value) - return new_inputs - def improve_tensor_precision(self, tensor_obj): if ( isinstance(tensor_obj, torch.Tensor) @@ -36,6 +19,7 @@ class ImprovePrecisionLayer(NpuBaseLayer): ): self._set_improve_valus(tensor_obj) tensor_obj = self._change_dtype(tensor_obj) + self.is_added = True return tensor_obj if isinstance(tensor_obj, dict): return { @@ -49,8 +33,8 @@ class ImprovePrecisionLayer(NpuBaseLayer): return tensor_obj def handle(self, params: DataParams) -> torch.Any: - print_info_log_rank_0( - f"[atat] Free benchmark: Perturbation is " + logger.info_on_rank_0( + f"[msprobe] Free benchmark: Perturbation is " f"{PerturbationMode.IMPROVE_PRECISION} of {self.api_name}." ) new_args = self.improve_tensor_precision(params.args) @@ -58,7 +42,27 @@ class ImprovePrecisionLayer(NpuBaseLayer): new_kwargs = {} else: new_kwargs = self.improve_tensor_precision(params.kwargs) + # 如果输入中全为高精度、应跳过二次执行、减少多余显存引用 + if not self.is_added: + return params.perturbed_result if "inplace" in new_kwargs: new_kwargs["inplace"] = False params.perturbed_result = params.origin_func(*new_args, **new_kwargs) return params.perturbed_result + + def _set_improve_valus(self, inputs): + if inputs.dtype in [torch.float16, torch.bfloat16]: + self.perturbed_value = torch.float32 + + def _change_dtype(self, inputs): + if hasattr(inputs, CommonField.DEVICE): + device = inputs.device + if device is CommonField.META: + new_inputs = inputs.to( + device=CommonField.META, dtype=self.perturbed_value + ) + else: + new_inputs = inputs.to(dtype=self.perturbed_value).to(device) + else: + new_inputs = inputs.to(dtype=self.perturbed_value) + return new_inputs diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/no_change.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/no_change.py similarity index 61% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/no_change.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/no_change.py index 7ec5870fb72db30101f41a8ec057bf95d94da9b3..a69c56002a205a518a6929835591859f63b800ff 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/no_change.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/no_change.py @@ -1,8 +1,8 @@ import torch -from atat.pytorch.free_benchmark import print_info_log_rank_0 -from atat.pytorch.free_benchmark.common.params import DataParams -from atat.pytorch.free_benchmark.common.enums import PerturbationMode -from atat.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.enums import PerturbationMode +from msprobe.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.perturbed_layers.npu.npu_base_layser import ( NpuBaseLayer, ) @@ -16,13 +16,12 @@ class NoChangeLayer(NpuBaseLayer): self.is_added = True return tensor_obj - def handle(self, params: DataParams) -> torch.Any: """ 对输入添加扰动并返回 """ - print_info_log_rank_0( - f"[atat] Free benchmark: Perturbation is " + logger.info_on_rank_0( + f"[msprobe] Free benchmark: Perturbation is " f"{PerturbationMode.NO_CHANGE} of {self.api_name}." ) params.perturbed_value = self.no_change(params.args[params.valid_input_index]) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/npu_base_layser.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/npu_base_layser.py similarity index 79% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/npu_base_layser.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/npu_base_layser.py index ca502365e1b1b4ae0b37e2ecc48bff3b203f765c..1a859481475bc9963a5d3b96389061a257a1a759 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/npu_base_layser.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/npu/npu_base_layser.py @@ -1,10 +1,9 @@ from abc import abstractmethod from typing import Any + import torch -from atat.pytorch.free_benchmark.common.constant import CommonField, ThresholdConfig -from atat.pytorch.free_benchmark.common.utils import TorchC -from atat.pytorch.free_benchmark.common.params import DataParams -from atat.pytorch.free_benchmark.perturbed_layers.base_layer import BaseLayer +from msprobe.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.perturbed_layers.base_layer import BaseLayer class NpuBaseLayer(BaseLayer): @@ -13,13 +12,22 @@ class NpuBaseLayer(BaseLayer): self.perturbed_value = None # 扰动的元素 self.is_added = False # 标记当前算子输入是否调整 + @staticmethod + def perturbed_result(params: DataParams) -> Any: + args_front = params.args[: params.valid_input_index] + args_rear = params.args[params.valid_input_index + 1:] + # 此处会将有inplace属性的算子换为非inplace + if "inplace" in params.kwargs: + params.kwargs["inplace"] = False + params.perturbed_result = params.origin_func( + *args_front, params.perturbed_value, *args_rear, **params.kwargs + ) + return params.perturbed_result + @abstractmethod def handle(self, params: DataParams) -> Any: pass - def _check_details(self, tensor_obj): - return True - def pre_check(self, tensor_obj): """ 检查张量是否符合标准(float类型且最大值大于对应精度最小值) @@ -33,14 +41,5 @@ class NpuBaseLayer(BaseLayer): return False return True - @staticmethod - def perturbed_result(params: DataParams) -> Any: - args_front = params.args[: params.valid_input_index] - args_rear = params.args[params.valid_input_index + 1 :] - # 此处会将有inplace属性的算子换为非inplace - if "inplace" in params.kwargs: - params.kwargs["inplace"] = False - params.perturbed_result = params.origin_func( - *args_front, params.perturbed_value, *args_rear, **params.kwargs - ) - return params.perturbed_result + def _check_details(self, tensor_obj): + return True diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/run_cpu.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/run_cpu.py similarity index 49% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/run_cpu.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/run_cpu.py index 387f9447fd29276e3c43bcdabf0e8a3a05b8ecec..d34ac976537d794a05255a32de8d54de2dbac5d3 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/run_cpu.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/perturbed_layers/run_cpu.py @@ -1,17 +1,17 @@ import torch -from atat.pytorch.free_benchmark import print_info_log_rank_0 -from atat.pytorch.free_benchmark.common.params import DataParams -from atat.pytorch.free_benchmark.common.utils import Tools -from atat.pytorch.free_benchmark.common.enums import DeviceType -from atat.pytorch.free_benchmark.perturbed_layers.base_layer import BaseLayer +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.common.utils import Tools +from msprobe.pytorch.free_benchmark.common.enums import DeviceType +from msprobe.pytorch.free_benchmark.perturbed_layers.base_layer import BaseLayer class CpuLayer(BaseLayer): def handle(self, params: DataParams) -> torch.Any: - print_info_log_rank_0( - f"[atat] Free benchmark: Perturbation is to_cpu of {self.api_name}." + logger.info_on_rank_0( + f"[msprobe] Free benchmark: Perturbation is to_cpu of {self.api_name}." ) new_args = Tools.convert_device_and_dtype(params.args, DeviceType.CPU, change_dtype=True) new_kwargs = Tools.convert_device_and_dtype(params.kwargs, DeviceType.CPU, change_dtype=True) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/__init__.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/dump/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/base_handler.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/base_handler.py similarity index 73% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/base_handler.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/base_handler.py index 1d59ef9fc3adc2f90a7145d825ce597e209758e4..1728b096f5b0e5aa9c5a95c51b7d0591b561b008 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/base_handler.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/base_handler.py @@ -3,18 +3,20 @@ from abc import ABC, abstractmethod from typing import Any, Optional, Tuple import torch -from atat.pytorch.free_benchmark import ( - Const, - print_warn_log_rank_0, -) -from atat.pytorch.free_benchmark.common.utils import TorchC -from atat.pytorch.free_benchmark.common.constant import ThresholdConfig -from atat.pytorch.free_benchmark.common.enums import ( +from msprobe.core.common.const import Const +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig +from msprobe.pytorch.free_benchmark.common.enums import ( FuzzThreshold, NormType, PerturbationMode, ) -from atat.pytorch.free_benchmark.common.params import DataParams, HandlerParams, make_unequal_row +from msprobe.pytorch.free_benchmark.common.params import ( + DataParams, + HandlerParams, + make_unequal_row, +) +from msprobe.pytorch.free_benchmark.common.utils import Tools, TorchC class FuzzHandler(ABC): @@ -41,52 +43,45 @@ class FuzzHandler(ABC): abs_tol, ) - def get_ratio_from_specific_norm( - self, origin_output, perturbed_output, norm_type, abs_tol - ): - if norm_type == NormType.ENDLESS_NORM: - return self.get_endless_norm(origin_output, perturbed_output, abs_tol) - return ThresholdConfig.COMP_CONSISTENT - @staticmethod def convert_overflow_ratio_to_consistent(ratio): if math.isnan(ratio) or math.isinf(ratio): return ThresholdConfig.COMP_CONSISTENT return ratio + @abstractmethod + def get_threshold(self, dtype): + pass + + @abstractmethod + def handle(self, data_params: DataParams) -> Any: + pass + + def get_ratio_from_specific_norm( + self, origin_output, perturbed_output, norm_type, abs_tol + ): + if norm_type == NormType.ENDLESS_NORM: + return self.get_endless_norm(origin_output, perturbed_output, abs_tol) + return ThresholdConfig.COMP_CONSISTENT + def get_endless_norm(self, origin_output, perturbed_output, abs_tol): - try: - ratio_tensor1 = TorchC.where( - TorchC.gt(TorchC.abs(perturbed_output), abs_tol), - TorchC.div( - TorchC.abs(origin_output), - TorchC.add(TorchC.abs(perturbed_output), abs_tol), - ), - 1, - ) - ratio_tensor2 = TorchC.where( - TorchC.gt(TorchC.abs(origin_output), abs_tol), - TorchC.div( - TorchC.abs(perturbed_output), - TorchC.add(TorchC.abs(origin_output), abs_tol), - ), - 1, - ) - except: - ratio_tensor1 = TorchC.where( - TorchC.gt(TorchC.abs(perturbed_output.to(torch.float32)), abs_tol), - TorchC.div( - origin_output.to(torch.float32), perturbed_output.to(torch.float32) - ), - 1, - ) - ratio_tensor2 = TorchC.where( - TorchC.gt(TorchC.abs(origin_output.to(torch.float32)), abs_tol), - TorchC.div( - perturbed_output.to(torch.float32), origin_output.to(torch.float32) - ), - 1, - ) + ratio_tensor1 = TorchC.where( + TorchC.gt(TorchC.abs(perturbed_output), abs_tol), + TorchC.div( + TorchC.abs(origin_output), + TorchC.add(TorchC.abs(perturbed_output), abs_tol), + ), + 1, + ) + ratio_tensor2 = TorchC.where( + TorchC.gt(TorchC.abs(origin_output), abs_tol), + TorchC.div( + TorchC.abs(perturbed_output), + TorchC.add(TorchC.abs(origin_output), abs_tol), + ), + 1, + ) + norm1 = self.convert_overflow_ratio_to_consistent( TorchC.max(ratio_tensor1).item() ) @@ -108,8 +103,8 @@ class FuzzHandler(ABC): origin_output, perturbed_output ) except Exception as e: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.params.api_name}, " + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.params.api_name}, " f"when computing ratio," f" y1 or y2 dtype is not supported {e}" ) @@ -117,42 +112,32 @@ class FuzzHandler(ABC): if self.params.fuzz_stage == Const.BACKWARD: abs_tol = ThresholdConfig.BACKWARD_OUTPUT_LOWER_BOUND else: - abs_tol = abs_tol**0.5 + abs_tol = abs_tol ** 0.5 return self.get_ratio_from_specific_norm( origin_output, perturbed_output, norm_type, abs_tol ) - @abstractmethod - def get_threshold(self, dtype): - pass - - def _get_default_threshold(self, dtype): - if self.params.pert_mode == PerturbationMode.NO_CHANGE: - threshold = ThresholdConfig.COMP_CONSISTENT - else: - threshold = ThresholdConfig.DTYPE_PER_THD.get( - dtype, ThresholdConfig.DTYPE_PER_THD.get(torch.float32) - ) - return threshold - def npu_compare( - self, origin_output, perturbed_output + self, origin_output, perturbed_output ) -> Tuple[bool, Optional[float]]: if isinstance(perturbed_output, int): return origin_output == perturbed_output, None elif isinstance(perturbed_output, float): + if perturbed_output == 0: + origin_output += FuzzThreshold.F32_THD + perturbed_output += FuzzThreshold.F32_THD return ( math.isclose(origin_output, perturbed_output), origin_output / perturbed_output, ) elif not isinstance(perturbed_output, torch.Tensor): - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.params.api_name} " + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.params.api_name} " f"The compare for output type {type(perturbed_output)} is not supported" ) - threshold = self.get_threshold(origin_output.dtype) + threshold = self.get_threshold(Tools.get_first_tensor_dtype(origin_output)) ratio = self.ratio_calculate( origin_output, perturbed_output, norm_type=NormType.ENDLESS_NORM ) @@ -190,7 +175,7 @@ class FuzzHandler(ABC): max_fuzz_ratio if ratio is None else max(max_fuzz_ratio, ratio) ) data_params.is_consistent = ( - is_consistent and data_params.is_consistent + is_consistent and data_params.is_consistent ) if not is_consistent and data_params.grad_unequal_flag: self.unequal_rows.append( @@ -199,15 +184,20 @@ class FuzzHandler(ABC): ) ) except Exception as e: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.params.api_name}, " + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.params.api_name}, " f"when campare the result exception raise {e}" ) return npu_consistent, max_fuzz_ratio - @abstractmethod - def handle(self, data_params: DataParams) -> Any: - pass - def get_unequal_rows(self): return self.unequal_rows + + def _get_default_threshold(self, dtype): + if self.params.pert_mode == PerturbationMode.NO_CHANGE: + threshold = ThresholdConfig.COMP_CONSISTENT + else: + threshold = ThresholdConfig.DTYPE_PER_THD.get( + dtype, ThresholdConfig.DTYPE_PER_THD.get(torch.float32) + ) + return threshold diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/check_handler.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/check_handler.py similarity index 65% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/check_handler.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/check_handler.py index 2f590855f1b96e0a6475c87c9b3dfdafd0288332..c16284eb07beda10a38755dc54349c8835ada37a 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/check_handler.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/check_handler.py @@ -1,16 +1,14 @@ from typing import Any -import torch -from atat.pytorch.free_benchmark import print_warn_log_rank_0 -from atat.pytorch.free_benchmark.common.enums import DeviceType -from atat.pytorch.free_benchmark.compare.single_benchmark import SingleCompare -from atat.pytorch.free_benchmark.common.params import DataParams, make_unequal_row -from atat.pytorch.free_benchmark.common.utils import Tools -from atat.pytorch.free_benchmark.result_handlers.base_handler import FuzzHandler +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.enums import DeviceType +from msprobe.pytorch.free_benchmark.common.params import DataParams, make_unequal_row +from msprobe.pytorch.free_benchmark.common.utils import Tools +from msprobe.pytorch.free_benchmark.compare.single_benchmark import SingleCompare +from msprobe.pytorch.free_benchmark.result_handlers.base_handler import FuzzHandler class CheckerHandler(FuzzHandler): - @staticmethod def other_compare(self, data_params: DataParams) -> bool: is_consistent = SingleCompare().compare_seq( data_params.original_result, data_params.perturbed_result @@ -34,8 +32,8 @@ class CheckerHandler(FuzzHandler): else: self.other_compare(data_params) except Exception as e: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.params.api_name}, " + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.params.api_name}, " f"when campare the result exception raise {e}" ) return data_params.original_result diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/fix_handler.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/fix_handler.py similarity index 56% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/fix_handler.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/fix_handler.py index 789e2653aa0eafc3619fbe3bd192b49dee643a1d..a1d90035e847abb26c0838635666ff0425853513 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/fix_handler.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/fix_handler.py @@ -1,9 +1,9 @@ from typing import Any -from atat.pytorch.free_benchmark.common.params import DataParams -from atat.pytorch.free_benchmark.common.utils import Tools -from atat.pytorch.free_benchmark.result_handlers.base_handler import FuzzHandler -from atat.pytorch.free_benchmark import print_warn_log_rank_0 +from msprobe.pytorch.free_benchmark.common.params import DataParams +from msprobe.pytorch.free_benchmark.common.utils import Tools +from msprobe.pytorch.free_benchmark.result_handlers.base_handler import FuzzHandler +from msprobe.pytorch.free_benchmark import logger class FixHandler(FuzzHandler): @@ -17,8 +17,8 @@ class FixHandler(FuzzHandler): data_params.original_result, data_params.perturbed_result ) except Exception as e: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.params.api_name} " + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.params.api_name} " f"Fix output failed. " ) return data_params.original_result \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/handler_factory.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/handler_factory.py similarity index 58% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/handler_factory.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/handler_factory.py index 50f791d81eeb25f8a50a6b4044dbc8e6e09e6a1e..5ee968c6a86728786f526660594fbb6de4ce18ee 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/handler_factory.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/handler_factory.py @@ -1,11 +1,10 @@ -from atat.pytorch.free_benchmark import FreeBenchmarkException -from atat.pytorch.free_benchmark.common.constant import PreheatConfig -from atat.pytorch.free_benchmark.common.utils import Tools -from atat.pytorch.free_benchmark.common.enums import HandlerType -from atat.pytorch.free_benchmark.common.params import HandlerParams -from atat.pytorch.free_benchmark.result_handlers.check_handler import CheckerHandler -from atat.pytorch.free_benchmark.result_handlers.preheat_handler import PreheatHandler -from atat.pytorch.free_benchmark.result_handlers.fix_handler import FixHandler +from msprobe.pytorch.free_benchmark import FreeBenchmarkException +from msprobe.pytorch.free_benchmark.common.constant import PreheatConfig +from msprobe.pytorch.free_benchmark.common.enums import HandlerType +from msprobe.pytorch.free_benchmark.common.params import HandlerParams +from msprobe.pytorch.free_benchmark.result_handlers.check_handler import CheckerHandler +from msprobe.pytorch.free_benchmark.result_handlers.preheat_handler import PreheatHandler +from msprobe.pytorch.free_benchmark.result_handlers.fix_handler import FixHandler class FuzzHandlerFactory: diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/preheat_handler.py b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py similarity index 85% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/preheat_handler.py rename to debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py index b8ff3bccf00c2dbe699159b4f77da86c75ae4062..d78e4303620f3ca73522cc9188452ffce0de2b12 100644 --- a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/preheat_handler.py +++ b/debug/accuracy_tools/msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py @@ -1,16 +1,14 @@ +import math from typing import Any -import torch -import math -from atat.pytorch.free_benchmark import print_info_log_rank_0, print_warn_log_rank_0 -from atat.pytorch.free_benchmark.common.constant import ThresholdConfig -from atat.pytorch.free_benchmark.common.enums import DeviceType -from atat.pytorch.free_benchmark.common.params import DataParams, make_unequal_row -from atat.pytorch.free_benchmark.common.utils import Tools -from atat.pytorch.free_benchmark.compare.single_benchmark import SingleCompare -from atat.pytorch.free_benchmark.common.counter import preheat_counter -from atat.pytorch.free_benchmark.result_handlers.base_handler import FuzzHandler -from atat.pytorch.free_benchmark.common.params import HandlerParams +from msprobe.pytorch.free_benchmark import logger +from msprobe.pytorch.free_benchmark.common.constant import ThresholdConfig +from msprobe.pytorch.free_benchmark.common.counter import preheat_counter +from msprobe.pytorch.free_benchmark.common.enums import DeviceType +from msprobe.pytorch.free_benchmark.common.params import DataParams, HandlerParams +from msprobe.pytorch.free_benchmark.common.utils import Tools +from msprobe.pytorch.free_benchmark.compare.single_benchmark import SingleCompare +from msprobe.pytorch.free_benchmark.result_handlers.base_handler import FuzzHandler class PreheatHandler(FuzzHandler): @@ -22,14 +20,83 @@ class PreheatHandler(FuzzHandler): def get_threshold(self, dtype): return preheat_counter.get_api_thd(self.pure_name, dtype) + def compare_npu_and_cpu(self, data_params: DataParams): + args = Tools.convert_device_and_dtype( + data_params.args, DeviceType.CPU, change_dtype=True + ) + kwargs = Tools.convert_device_and_dtype( + data_params.kwargs, DeviceType.CPU, change_dtype=True + ) + cpu_result = data_params.origin_func(*args, **kwargs) + return SingleCompare().compare_seq(data_params.original_result, cpu_result) + + def preheat(self, max_fuzz_ratio, cpu_consistent, first_dtype): + # 存储当前step所有输出比值和对应npu\cpu比对结果 + preheat_counter.update_preheat_record( + self.pure_name, + first_dtype, + (max_fuzz_ratio, cpu_consistent), + ) + if self._need_adjust_threshold(): + self._adjust_threshold() + + def handle(self, data_params: DataParams) -> Any: + + if isinstance(data_params.perturbed_result, bool) or not Tools.is_float_tensor( + data_params.perturbed_result + ): + return data_params.original_result + + if self.params.step == 0: + preheat_counter.add_one_step_used_api(self.pure_name) + return data_params.original_result + + # 如果当前api,step需要预热 + npu_consistent, max_fuzz_ratio = self.cmp_output_npu(data_params) + data_params.is_consistent = npu_consistent + + preheat_counter.check_step(self.params.step) + + if self.params.preheat_config.get("preheat_step") <= self.params.step: + return data_params.original_result + + if not data_params.grad_unequal_flag: + data_params.grad_unequal_flag = True + data_params.is_consistent = False + return data_params.original_result + preheat_counter.add_api_called_time(self.pure_name) + + if not self._is_take_a_sample(): + return data_params.original_result + + cpu_consistent = True + try: + cpu_consistent = self.compare_npu_and_cpu(data_params) + except Exception as e: + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.params.api_name}, " + f"when campare to cpu exception raise {e}" + ) + try: + first_dtype = Tools.get_first_tensor_dtype(data_params.original_result) + except RuntimeError: + logger.warning_on_rank_0( + f"[msprobe] Free Benchmark: For {self.params.api_name}, " + f"the output sequence does not contain tensors." + ) + if preheat_counter.get_api_preheat(self.pure_name, str(first_dtype)): + self.preheat(max_fuzz_ratio, cpu_consistent, first_dtype) + + return data_params.original_result + def _is_take_a_sample(self) -> bool: need_sample_set = self._get_need_sample_set() curr_called_seq = preheat_counter.get_api_called_time(self.pure_name) res = curr_called_seq in need_sample_set if res: total_count = preheat_counter.get_one_step_used_api(self.pure_name) - print_info_log_rank_0( - f"[atat] Free benchmark: preheat sample in step{self.params.step}" + logger.info_on_rank_0( + f"[msprobe] Free benchmark: preheat sample in step{self.params.step}" f"api_name {self.params.api_name}, " f"curr_called_seq: {curr_called_seq}/{total_count}" ) @@ -61,17 +128,6 @@ class PreheatHandler(FuzzHandler): need_sample_set.add(count) return need_sample_set - - def compare_npu_and_cpu(self, data_params: DataParams): - args = Tools.convert_device_and_dtype( - data_params.args, DeviceType.CPU, change_dtype=True - ) - kwargs = Tools.convert_device_and_dtype( - data_params.kwargs, DeviceType.CPU, change_dtype=True - ) - cpu_result = data_params.origin_func(*args, **kwargs) - return SingleCompare().compare_seq(data_params.original_result, cpu_result) - def _need_adjust_threshold(self) -> bool: sample_count_per_step = self._get_sample_count_per_step() sampled_time = preheat_counter.get_api_sample_time(self.pure_name) @@ -112,63 +168,3 @@ class PreheatHandler(FuzzHandler): preheat_counter.update_api_thd( self.pure_name, dtype_str, new_thd, threshold ) - - def preheat(self, max_fuzz_ratio, cpu_consistent, first_dtype): - # 存储当前step所有输出比值和对应npu\cpu比对结果 - preheat_counter.update_preheat_record( - self.pure_name, - first_dtype, - (max_fuzz_ratio, cpu_consistent), - ) - if self._need_adjust_threshold(): - self._adjust_threshold() - - def handle(self, data_params: DataParams) -> Any: - - if isinstance(data_params.perturbed_result, bool) or not Tools.is_float_tensor( - data_params.perturbed_result - ): - return data_params.original_result - - if self.params.step == 0: - preheat_counter.add_one_step_used_api(self.pure_name) - return data_params.original_result - - # 如果当前api,step需要预热 - npu_consistent, max_fuzz_ratio = self.cmp_output_npu(data_params) - data_params.is_consistent = npu_consistent - - preheat_counter.check_step(self.params.step) - - if self.params.preheat_config.get("preheat_step") <= self.params.step: - return data_params.original_result - - if not data_params.grad_unequal_flag: - data_params.grad_unequal_flag = True - data_params.is_consistent = False - return data_params.original_result - preheat_counter.add_api_called_time(self.pure_name) - - - if not self._is_take_a_sample(): - return data_params.original_result - - cpu_consistent = True - try: - cpu_consistent = self.compare_npu_and_cpu(data_params) - except Exception as e: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.params.api_name}, " - f"when campare to cpu exception raise {e}" - ) - try: - first_dtype = Tools.get_first_tensor_dtype(data_params.perturbed_result) - except RuntimeError: - print_warn_log_rank_0( - f"[atat] Free Benchmark: For {self.params.api_name}, " - f"the output sequence does not contain tensors." - ) - if preheat_counter.get_api_preheat(self.pure_name, str(first_dtype)): - self.preheat(max_fuzz_ratio, cpu_consistent, first_dtype) - - return data_params.original_result diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/__init__.py b/debug/accuracy_tools/msprobe/pytorch/functional/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/functional/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/.keep b/debug/accuracy_tools/msprobe/pytorch/functional/data_processor.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/.keep rename to debug/accuracy_tools/msprobe/pytorch/functional/data_processor.py diff --git a/debug/accuracy_tools/atat/pytorch/functional/dump_module.py b/debug/accuracy_tools/msprobe/pytorch/functional/dump_module.py similarity index 64% rename from debug/accuracy_tools/atat/pytorch/functional/dump_module.py rename to debug/accuracy_tools/msprobe/pytorch/functional/dump_module.py index fed73ad5374178fac01180bb905468b9e7c747fa..efb95c3369f6cda2f883d70a86261e0232535f86 100644 --- a/debug/accuracy_tools/atat/pytorch/functional/dump_module.py +++ b/debug/accuracy_tools/msprobe/pytorch/functional/dump_module.py @@ -1,20 +1,21 @@ import torch.nn as nn -from atat.core.utils import print_error_log, DumpException -from .scope import BaseScope -from ..common.utils import Const -from ..hook_module.api_registry import api_register -from ..debugger.precision_debugger import PrecisionDebugger +from msprobe.pytorch.common.log import logger +from msprobe.core.common.const import Const +from msprobe.pytorch.hook_module.api_registry import api_register +from msprobe.pytorch.debugger.precision_debugger import PrecisionDebugger +from msprobe.core.common.exceptions import MsprobeException +from msprobe.core.data_dump.scope import BaseScope module_count = {} def module_dump(module, dump_name): if not isinstance(module, nn.Module): - print_error_log("The parameter:module in module_dump is not a Module subclass.") - raise DumpException(DumpException.INVALID_PARAM_ERROR) + logger.error("The parameter:module in module_dump is not a Module subclass.") + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR) if not isinstance(dump_name, str): - print_error_log("The parameter:dump_name in module_dump is not a str type.") - raise DumpException(DumpException.INVALID_PARAM_ERROR) + logger.error("The parameter:dump_name in module_dump is not a str type.") + raise MsprobeException(MsprobeException.INVALID_PARAM_ERROR) api_register.api_originality() if dump_name not in module_count: module_count[dump_name] = 0 diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/__init__.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/hook_module/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/api_registry.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/api_registry.py similarity index 85% rename from debug/accuracy_tools/atat/pytorch/hook_module/api_registry.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/api_registry.py index 003a8699cd750a424bf989ae9d1b3fac78f76650..f75201eafcda40c61b2c5c3da710d6cfb06719b8 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/api_registry.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/api_registry.py @@ -17,14 +17,17 @@ import torch import torch.distributed as dist -from . import wrap_torch, wrap_functional, wrap_tensor, wrap_vf, wrap_distributed, wrap_aten -from .wrap_torch import get_torch_ops -from .wrap_functional import get_functional_ops -from .wrap_tensor import get_tensor_ops -from .wrap_vf import get_vf_ops -from .wrap_distributed import get_distributed_ops -from .wrap_aten import get_aten_ops -from ..common.utils import torch_without_guard_version, npu_distributed_api, is_gpu + +from msprobe.pytorch.hook_module import wrap_torch, wrap_functional, wrap_tensor, wrap_vf, wrap_distributed, wrap_aten +from msprobe.pytorch.hook_module.wrap_aten import get_aten_ops +from msprobe.pytorch.hook_module.wrap_distributed import get_distributed_ops +from msprobe.pytorch.hook_module.wrap_functional import get_functional_ops +from msprobe.pytorch.hook_module.wrap_tensor import get_tensor_ops +from msprobe.pytorch.hook_module.wrap_torch import get_torch_ops +from msprobe.pytorch.hook_module.wrap_vf import get_vf_ops +from msprobe.pytorch.common.utils import torch_without_guard_version, npu_distributed_api, is_gpu +from msprobe.core.common.const import Const + torch_version_above_2 = torch.__version__.split('+')[0] > '2.0' if not is_gpu: @@ -108,19 +111,19 @@ class ApiRegistry: self.store_ori_attr(torch.Tensor, get_tensor_ops(), self.tensor_ori_attr) wrap_tensor.wrap_tensor_ops_and_bind(hook) for attr_name in dir(wrap_tensor.HOOKTensor): - if attr_name.startswith("wrap_"): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): self.tensor_hook_attr[attr_name[5:]] = getattr(wrap_tensor.HOOKTensor, attr_name) self.store_ori_attr(torch, get_torch_ops(), self.torch_ori_attr) wrap_torch.wrap_torch_ops_and_bind(hook) for attr_name in dir(wrap_torch.HOOKTorchOP): - if attr_name.startswith("wrap_"): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): self.torch_hook_attr[attr_name[5:]] = getattr(wrap_torch.HOOKTorchOP, attr_name) self.store_ori_attr(torch.nn.functional, get_functional_ops(), self.functional_ori_attr) wrap_functional.wrap_functional_ops_and_bind(hook) for attr_name in dir(wrap_functional.HOOKFunctionalOP): - if attr_name.startswith("wrap_"): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): self.functional_hook_attr[attr_name[5:]] = getattr(wrap_functional.HOOKFunctionalOP, attr_name) self.store_ori_attr(dist, get_distributed_ops(), self.distributed_ori_attr) @@ -128,9 +131,9 @@ class ApiRegistry: if not is_gpu and not torch_without_guard_version: self.store_ori_attr(torch_npu.distributed, npu_distributed_api, self.npu_distributed_ori_attr) for attr_name in dir(wrap_distributed.HOOKDistributedOP): - if attr_name.startswith("wrap_"): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): self.distributed_hook_attr[attr_name[5:]] = getattr(wrap_distributed.HOOKDistributedOP, attr_name) - if not is_gpu and not torch_without_guard_version and attr_name[5:] in npu_distributed_api: + if not is_gpu and not torch_without_guard_version and attr_name[5:] in npu_distributed_api: self.npu_distributed_hook_attr[attr_name[5:]] = getattr(wrap_distributed.HOOKDistributedOP, attr_name) @@ -138,20 +141,20 @@ class ApiRegistry: self.store_ori_attr(torch.ops.aten, get_aten_ops(), self.aten_ori_attr) wrap_aten.wrap_aten_ops_and_bind(hook) for attr_name in dir(wrap_aten.HOOKAtenOP): - if attr_name.startswith("wrap_"): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): self.aten_hook_attr[attr_name[5:]] = getattr(wrap_aten.HOOKAtenOP, attr_name) self.store_ori_attr(torch._VF, get_vf_ops(), self.vf_ori_attr) wrap_vf.wrap_vf_ops_and_bind(hook) for attr_name in dir(wrap_vf.HOOKVfOP): - if attr_name.startswith("wrap_"): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): self.vf_hook_attr[attr_name[5:]] = getattr(wrap_vf.HOOKVfOP, attr_name) if not is_gpu: self.store_ori_attr(torch_npu, get_npu_ops(), self.torch_npu_ori_attr) wrap_npu_custom.wrap_npu_ops_and_bind(hook) for attr_name in dir(wrap_npu_custom.HOOKNpuOP): - if attr_name.startswith("wrap_"): + if attr_name.startswith(Const.ATTR_NAME_PREFIX): self.torch_npu_hook_attr[attr_name[5:]] = getattr(wrap_npu_custom.HOOKNpuOP, attr_name) diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/hook_module.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/hook_module.py similarity index 99% rename from debug/accuracy_tools/atat/pytorch/hook_module/hook_module.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/hook_module.py index ae4a7abdab12e46fcf25e7594b9016ca347599bc..6693a09d02866a30ce2c806117e3ab6157c100f3 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/hook_module.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/hook_module.py @@ -20,7 +20,8 @@ import threading import torch import torch.nn as nn import torch.utils.hooks as full_hooks -from ..common.utils import Const +from msprobe.core.common.const import Const + class HOOKModule(nn.Module): module_count = {} diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/support_wrap_ops.yaml b/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml similarity index 100% rename from debug/accuracy_tools/atat/pytorch/hook_module/support_wrap_ops.yaml rename to debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/utils.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/utils.py similarity index 85% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/utils.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/utils.py index 6641807f929babeed3af30cf14b043d1e4f7913c..c1e581675fa0995549fcfcd5521cf9759180c3d9 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/utils.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/utils.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -18,7 +18,7 @@ import os import yaml -from ...common.file_check import FileOpen +from msprobe.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") @@ -26,4 +26,4 @@ with FileOpen(yaml_path, 'r') as f: Ops = yaml.safe_load(f) WrapFunctionalOps = Ops.get('functional') WrapTensorOps = Ops.get('tensor') - WrapTorchOps = Ops.get('torch') \ No newline at end of file + WrapTorchOps = Ops.get('torch') diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_aten.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_aten.py similarity index 92% rename from debug/accuracy_tools/atat/pytorch/hook_module/wrap_aten.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_aten.py index 8666287095bbe12f7e9d5f314cff1db75d74a108..4617e4854fcbb3b7ac60536886b74387cb01d99b 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_aten.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_aten.py @@ -20,9 +20,10 @@ import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.file_check import FileOpen +from msprobe.pytorch.hook_module.hook_module import HOOKModule +from msprobe.pytorch.common.utils import torch_device_guard +from msprobe.core.common.const import Const +from msprobe.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) @@ -79,13 +80,13 @@ class AtenOPPacketTemplate(): else: return attr - def overloads(self): - return self.opPacket.overloads() - @torch_device_guard def __call__(self, *args, **kwargs): return AtenOPTemplate(self.opPacket, self.hook)(*args, **kwargs) + def overloads(self): + return self.opPacket.overloads() + def wrap_aten_op(op, hook): return AtenOPPacketTemplate(op, hook) diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_distributed.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py similarity index 91% rename from debug/accuracy_tools/atat/pytorch/hook_module/wrap_distributed.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py index 68ce83c16b8414f43e61b1a667f8cb7c27899a10..6cf425441cc381652ddca4b203ac7a2b4161a116 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_distributed.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_distributed.py @@ -20,9 +20,10 @@ from functools import wraps import torch.distributed as dist import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.file_check import FileOpen +from msprobe.pytorch.hook_module.hook_module import HOOKModule +from msprobe.pytorch.common.utils import torch_device_guard +from msprobe.core.common.const import Const +from msprobe.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_functional.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_functional.py similarity index 90% rename from debug/accuracy_tools/atat/pytorch/hook_module/wrap_functional.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_functional.py index 46f25efe664fca2bff917b93e3e0632398bdc74e..fd7610ca8fc8089f427a91bed174055882e7207f 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_functional.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_functional.py @@ -20,15 +20,16 @@ import os import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.log import print_info_log_rank_0 -from ..common.file_check import FileOpen +from msprobe.pytorch.hook_module.hook_module import HOOKModule +from msprobe.pytorch.common.utils import torch_device_guard +from msprobe.core.common.const import Const +from msprobe.pytorch.common.log import logger +from msprobe.core.common.file_check import FileOpen def remove_dropout(): if torch.__version__ > "1.8": - print_info_log_rank_0("For precision comparison, the probability p in the dropout method is set to 0.") + logger.info_on_rank_0("For precision comparison, the probability p in the dropout method is set to 0.") import torch.nn.functional as F from torch import _VF from torch.overrides import has_torch_function_unary, handle_torch_function @@ -83,10 +84,11 @@ class HOOKFunctionalOP(object): class FunctionalOPTemplate(HOOKModule): - def __init__(self, op_name, hook): + def __init__(self, op_name, hook, need_hook=True): self.op_name_ = op_name self.prefix_op_name_ = "Functional" + Const.SEP + str(op_name) + Const.SEP - super().__init__(hook) + if need_hook: + super().__init__(hook) @torch_device_guard def forward(self, *args, **kwargs): diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_npu_custom.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_npu_custom.py similarity index 89% rename from debug/accuracy_tools/atat/pytorch/hook_module/wrap_npu_custom.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_npu_custom.py index e910e609c8379e0c66239755c3ec2a44953ef1ec..992713bce57b0f73b5d7144f4fb3a04726f70468 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_npu_custom.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_npu_custom.py @@ -20,9 +20,10 @@ import torch import torch_npu import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, torch_without_guard_version, Const -from ..common.file_check import FileOpen +from msprobe.pytorch.hook_module.hook_module import HOOKModule +from msprobe.pytorch.common.utils import torch_device_guard, torch_without_guard_version +from msprobe.core.common.const import Const +from msprobe.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_tensor.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py similarity index 83% rename from debug/accuracy_tools/atat/pytorch/hook_module/wrap_tensor.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py index 6b49826ab4712d440b4933651eb6b7eab950d023..3e26ae3beda5341df76eb1f3fdea68e43193f983 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_tensor.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_tensor.py @@ -20,9 +20,10 @@ import os import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, parameter_adapter, Const -from ..common.file_check import FileOpen +from msprobe.pytorch.hook_module.hook_module import HOOKModule +from msprobe.pytorch.common.utils import torch_device_guard, parameter_adapter +from msprobe.core.common.const import Const +from msprobe.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") @@ -45,10 +46,11 @@ class HOOKTensor(object): class TensorOPTemplate(HOOKModule): - def __init__(self, op_name, hook): + def __init__(self, op_name, hook, need_hook=True): self.op_name_ = op_name self.prefix_op_name_ = "Tensor" + Const.SEP + str(op_name) + Const.SEP - super().__init__(hook) + if need_hook: + super().__init__(hook) @torch_device_guard @parameter_adapter diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_torch.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_torch.py similarity index 87% rename from debug/accuracy_tools/atat/pytorch/hook_module/wrap_torch.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_torch.py index 889512e9c0c64d9d05dc19cbc30e542c6e5b577c..486ddda4919b1abefa35aec8ed21659c06c4588d 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_torch.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_torch.py @@ -20,9 +20,10 @@ import os import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.file_check import FileOpen +from msprobe.pytorch.hook_module.hook_module import HOOKModule +from msprobe.pytorch.common.utils import torch_device_guard +from msprobe.core.common.const import Const +from msprobe.core.common.file_check import FileOpen cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") @@ -62,10 +63,11 @@ class HOOKTorchOP(object): class TorchOPTemplate(HOOKModule): - def __init__(self, op_name, hook): + def __init__(self, op_name, hook, need_hook=True): self.op_name_ = op_name self.prefix_op_name_ = "Torch" + Const.SEP + str(op_name) + Const.SEP - super().__init__(hook) + if need_hook: + super().__init__(hook) @torch_device_guard def forward(self, *args, **kwargs): diff --git a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_vf.py b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_vf.py similarity index 87% rename from debug/accuracy_tools/atat/pytorch/hook_module/wrap_vf.py rename to debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_vf.py index 08d47308e077981e65193eea71874d4f9432c6c0..d78beb2a6ad790ab3ad897bf819e74a234520e8c 100644 --- a/debug/accuracy_tools/atat/pytorch/hook_module/wrap_vf.py +++ b/debug/accuracy_tools/msprobe/pytorch/hook_module/wrap_vf.py @@ -20,9 +20,10 @@ import os import torch import yaml -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard, Const -from ..common.file_check import FileOpen +from msprobe.pytorch.hook_module.hook_module import HOOKModule +from msprobe.core.common.file_check import FileOpen +from msprobe.pytorch.common.utils import torch_device_guard +from msprobe.core.common.const import Const cur_path = os.path.dirname(os.path.realpath(__file__)) yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml") @@ -32,8 +33,6 @@ with FileOpen(yaml_path, 'r') as f: def get_vf_ops(): global WrapVfOps - # _all_functional_ops = dir(torch.nn.functional) - # assert set(WrapFunctionalOps) <= set(_all_functional_ops) return WrapVfOps diff --git a/debug/accuracy_tools/atat/pytorch/module_processer.py b/debug/accuracy_tools/msprobe/pytorch/module_processer.py similarity index 97% rename from debug/accuracy_tools/atat/pytorch/module_processer.py rename to debug/accuracy_tools/msprobe/pytorch/module_processer.py index fda3d37bc92360fc104d761e78b13cfc793995bc..422d36d6ac79b983262624b5bbbcbbd3cc52839c 100644 --- a/debug/accuracy_tools/atat/pytorch/module_processer.py +++ b/debug/accuracy_tools/msprobe/pytorch/module_processer.py @@ -1,8 +1,8 @@ from functools import wraps import torch from torch.utils.hooks import BackwardHook -from .functional.scope import ModuleRangeScope -from .common.utils import Const +from msprobe.core.common.const import Const +from msprobe.core.data_dump.scope import ModuleRangeScope class ModuleProcesser: diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/__init__.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..54c15d2dfc6d2ab80ef082d2d0c653c5e2625f59 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/__init__.py @@ -0,0 +1,20 @@ +# Copyright (c) 2024-2024 Huawei Technologies Co., Ltd. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from signal import signal, SIGPIPE, SIG_DFL +from .dispatch import PtdbgDispatch +signal(SIGPIPE, SIG_DFL) + + +__all__ = ["PtdbgDispatch"] diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/compare.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/compare.py new file mode 100644 index 0000000000000000000000000000000000000000..048ab3f901c49870c706958da5cdd5d549c475cf --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/compare.py @@ -0,0 +1,236 @@ +# 进行比对及结果展示 +import os +import sys +import csv +import json +from collections import namedtuple +from rich.table import Table +from rich.console import Console +from .single_compare import single_benchmark_compare_wrap +from .utils import DispatchException +from msprobe.core.common.const import CompareConst +from msprobe.core.common.file_check import FileOpen +from msprobe.pytorch.common.log import logger +from msprobe.core.common.utils import CompareException + +ELEMENT_NUM_THRESHOLD = 100 +ZERO_NUM_THRESHOLD = 0.1 +FLOAT_PRECISION = 14 + +ResultInfo = namedtuple('ResultInfo', ['api_name', 'is_fwd_success', 'is_bwd_success', + 'fwd_compare_alg_results', 'bwd_compare_alg_results']) + +def get_file_content_bytes(file): + with FileOpen(file, 'rb') as file_handle: + return file_handle.read() + + +def get_json_contents(file_path): + ops = get_file_content_bytes(file_path) + try: + json_obj = json.loads(ops) + except ValueError as error: + logger.error('Failed to load "%s". %s' % (file_path, str(error))) + raise CompareException(CompareException.INVALID_FILE_ERROR) from error + if not isinstance(json_obj, dict): + logger.error('Json file %s, content is not a dictionary!' % file_path) + raise CompareException(CompareException.INVALID_FILE_ERROR) + return json_obj + + +def write_csv(data, filepath): + with FileOpen(filepath, 'a', encoding='utf-8-sig') as f: + writer = csv.writer(f) + writer.writerows(data) + + +class Saver: + # consts for result csv + COLUMN_API_NAME = "API name" + COLUMN_FORWARD_SUCCESS = "Forward Test Success" + COLUMN_BACKWARD_SUCCESS = "Backward Test Success" + COLUMN_STACK_INFO = "Traceback callstack info" + + def __init__(self, save_path, detail_save_path, stack_info): + self.save_path = save_path + self.detail_save_path = detail_save_path + self.stack_info = stack_info + + self.test_result_cnt = { + "forward_fail_num": 0, "backward_fail_num": 0, "forward_and_backward_fail_num": 0, "success_num": 0, + "total_num": 0, "forward_or_backward_fail_num": 0 + } + + def write_csv_title(self): + summary_test_rows = [[self.COLUMN_API_NAME, self.COLUMN_FORWARD_SUCCESS, self.COLUMN_BACKWARD_SUCCESS, "Message"]] + write_csv(summary_test_rows, self.save_path) + + detail_test_rows = [[ + "Npu Name", "Bench Dtype", "NPU Dtype", "Shape", + "error_balance", "max_abs_diff", "max_abs_idx", + "max_rel_diff", "max_rel_idx", "eb_thd", + "error_thd", "Status","Message" + ]] + write_csv(detail_test_rows, self.detail_save_path) + + def print_pretest_result(self): + self.get_statistics_from_result_csv() + if self.test_result_cnt.get("total_num") != 0: + passing_rate = str(self.test_result_cnt.get("success_num") / + (self.test_result_cnt.get("total_num") + sys.float_info.epsilon)) + else: + passing_rate = "0" + + console = Console() + table_total = Table( + show_header=True, title="Overall Statistics", show_lines=True, width=75 + ) + table_total.add_column("Result") + table_total.add_column("Statistics") + table_total.add_row("[green]Pass[/green]", str(self.test_result_cnt.get("success_num"))) + table_total.add_row("[red]Fail[/red]", str(self.test_result_cnt.get("forward_and_backward_fail_num") + + self.test_result_cnt.get("forward_or_backward_fail_num"))) + table_total.add_row("Passing Rate", passing_rate) + + table_detail = Table( + show_header=True, title="Detail Statistics", show_lines=True, width=75 + ) + table_detail.add_column("Result") + table_detail.add_column("Statistics") + table_detail.add_row("Only Forward Fail", str(self.test_result_cnt.get("forward_fail_num"))) + table_detail.add_row("Only Backward Fail", str(self.test_result_cnt.get("backward_fail_num"))) + table_detail.add_row( + "Both Forward & Backward Fail", str(self.test_result_cnt.get("forward_and_backward_fail_num"))) + + console.print(table_total) + console.print(table_detail) + + def get_statistics_from_result_csv(self): + checklist = [CompareConst.TRUE, CompareConst.FALSE, CompareConst.NA, CompareConst.SKIP] + with FileOpen(self.save_path, 'r') as file: + reader = csv.reader(file) + result_csv_rows = [row for row in reader] + result_csv_name = os.path.basename(self.save_path) + for item in result_csv_rows[1:]: + if not isinstance(item, list) or len(item) < 3: + raise ValueError("The number of columns in %s is incorrect" % result_csv_name) + if not all(item[i] and item[i].upper() in checklist for i in (1, 2)): + raise ValueError( + "The value in the 2nd or 3rd column of %s is wrong, it must be TRUE, FALSE, SKIP or N/A" + % result_csv_name) + column1 = item[1].upper() + column2 = item[2].upper() + if column1 == CompareConst.SKIP: + continue + self.test_result_cnt["total_num"] += 1 + if column1 == CompareConst.TRUE and column2 in [CompareConst.TRUE, 'N/A']: + self.test_result_cnt['success_num'] += 1 + elif column1 == CompareConst.FALSE and column2 == CompareConst.FALSE: + self.test_result_cnt['forward_and_backward_fail_num'] += 1 + elif column1 == CompareConst.FALSE: + self.test_result_cnt['forward_fail_num'] += 1 + self.test_result_cnt['forward_or_backward_fail_num'] += 1 + else: + self.test_result_cnt['backward_fail_num'] += 1 + self.test_result_cnt['forward_or_backward_fail_num'] += 1 + + def write_summary_csv(self, test_result): + test_rows = [] + if self.stack_info: + test_rows[0].append(self.COLUMN_STACK_INFO) + + name = test_result.api_name + df_row = [test_result.api_name, test_result.is_fwd_success, test_result.is_bwd_success] + if test_result.is_fwd_success == "SKIP" or test_result.is_bwd_success == "SKIP": + df_row.append(test_result.fwd_compare_alg_results) + if self.stack_info: + stack_info = "\n".join(self.stack_info[name]) + df_row.append(stack_info) + test_rows.append(df_row) + write_csv(test_rows, self.save_path) + + def write_detail_csv(self, test_result): + def get_rows_from_list(result, name, sub_prefix): + rows = [] + if isinstance(result, list): + for i, test_subject in enumerate(result): + subject = sub_prefix + "." + name + ".output." + str(i) + test_subject = ["{:.{}f}".format(item, FLOAT_PRECISION) if isinstance(item, float) else item for + item in test_subject] + rows.append([subject] + list(test_subject)) + return rows + + test_rows = [] + subject_prefix = test_result.api_name + fwd_result = test_result.fwd_compare_alg_results + bwd_result = test_result.bwd_compare_alg_results + + test_rows.extend(get_rows_from_list(fwd_result, "forward", subject_prefix)) + test_rows.extend(get_rows_from_list(bwd_result, "backward", subject_prefix)) + + write_csv(test_rows, self.detail_save_path) + + def record_results(self, result_info): + self.write_summary_csv(result_info) + self.write_detail_csv(result_info) + + +class Comparator: + + def __init__(self, result_csv_path, details_csv_path, is_continue_run_ut, stack_info_json_path=None): + self.save_path = result_csv_path + self.detail_save_path = details_csv_path + if stack_info_json_path: + self.stack_info = get_json_contents(stack_info_json_path) + else: + self.stack_info = None + self.saver = Saver(result_csv_path, details_csv_path, self.stack_info) + + if is_continue_run_ut and not os.path.exists(self.save_path) and not os.path.exists(self.detail_save_path): + self.saver.write_csv_title() + + @staticmethod + def _compare_core_wrapper(bench_out, npu_out): + detailed_result_total = [] + test_final_success = True + status, details = single_benchmark_compare_wrap(npu_out, bench_out) + if not isinstance(status, list): + detailed_result_total.append(details) + test_final_success = status + else: + for item, item_status in enumerate(status): + detailed_result_total.append(details.get(item, 'key does not exist')) + if not item_status: + test_final_success = False + return test_final_success, detailed_result_total + + @staticmethod + def _compare_dropout(bench_out, npu_out): + tensor_num = bench_out.numel() + if tensor_num >= ELEMENT_NUM_THRESHOLD: + if abs((bench_out == 0).sum() - (npu_out == 0).cpu().sum()) / tensor_num < ZERO_NUM_THRESHOLD: + return True, 1 + else: + return False, 0 + else: + return True, 1 + + def compare_output(self, api_name, bench_out, npu_out, bench_grad=None, npu_grad=None): + if "dropout" in api_name: + is_fwd_success, fwd_compare_alg_results = self._compare_dropout(bench_out, npu_out) + else: + is_fwd_success, fwd_compare_alg_results = self._compare_core_wrapper(bench_out, npu_out) + if bench_grad and npu_grad: + if "dropout" in api_name: + is_bwd_success, bwd_compare_alg_results = self._compare_dropout(bench_grad[0], npu_grad[0]) + else: + is_bwd_success, bwd_compare_alg_results = self._compare_core_wrapper(bench_grad, npu_grad) + else: + is_bwd_success, bwd_compare_alg_results = True, None + if is_bwd_success and bwd_compare_alg_results is None: + self.saver.record_results(ResultInfo(api_name, is_fwd_success, CompareConst.NAN, fwd_compare_alg_results, + bwd_compare_alg_results)) + else: + self.saver.record_results(ResultInfo(api_name, is_fwd_success, is_bwd_success, fwd_compare_alg_results, + bwd_compare_alg_results)) + return is_fwd_success, is_bwd_success diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dispatch.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dispatch.py new file mode 100644 index 0000000000000000000000000000000000000000..898df30b99d0fa5ebffb46e05ff7247d19d1f859 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dispatch.py @@ -0,0 +1,274 @@ +import os +import time +import json +from pathlib import Path +from multiprocessing import Manager, Pool + +import yaml +import torch + +from torch.utils._python_dispatch import TorchDispatchMode + +try: + import torch_npu +except ImportError: + is_npu = False +else: + is_npu = True + +from .dump_compare import dispatch_workflow, dispatch_multiprocess, error_call, TimeStatistics, \ + DispatchRunParam, DisPatchDataInfo +from .utils import get_callstack, data_to_cpu, logger_debug, logger_error, logger_warn, logger_logo, get_sys_info, \ + DispatchException +from .compare import Comparator +from msprobe.core.common.file_check import FileOpen +from msprobe.core.common.utils import check_file_or_directory_path, check_path_before_create +from msprobe.core.common.const import Const, CompareConst + +current_time = time.strftime("%Y%m%d%H%M%S") +RESULT_FILE_NAME = "accuracy_checking_result_" + current_time + ".csv" +DETAILS_FILE_NAME = "accuracy_checking_details_" + current_time + ".csv" + + +class PtdbgDispatch(TorchDispatchMode): + def __init__(self, dump_mode=Const.OFF, api_list=None, debug=False, dump_path=None, tag=None, process_num=0): + super(PtdbgDispatch, self).__init__() + logger_logo() + if not is_npu: + logger_error("Please confirm you run environment installed torch_npu!") + return + if dump_path is None: + logger_error("Please set dump_path when dump_mode is config!") + check_file_or_directory_path(dump_path, True) + + self.device_id = torch_npu._C._npu_getDevice() + self.dump_mode = dump_mode + self.dump_api_list = api_list + self.debug_flag = debug + self.api_index = 0 + self.single_api_index_dict = {} + self.device_dump_path_cpu = None + self.device_dump_path_npu = None + self.all_summery = [] + self.call_stack_list = [] + self.process_num = process_num + self.filter_dump_api() + self.check_param() + dir_name = self.get_dir_name(tag) + self.root_path = os.path.join(os.path.realpath(dump_path), dir_name) + self.root_cpu_path = os.path.join(self.root_path, f'cpu') + self.root_npu_path = os.path.join(self.root_path, f'npu') + check_path_before_create(self.root_cpu_path) + check_path_before_create(self.root_npu_path) + Path(self.root_cpu_path).mkdir(mode=0o750, parents=True, exist_ok=True) + Path(self.root_npu_path).mkdir(mode=0o750, parents=True, exist_ok=True) + + self.result_csv_path = os.path.join(self.root_path, RESULT_FILE_NAME) + self.detail_csv_path = os.path.join(self.root_path, DETAILS_FILE_NAME) + self.comparator = Comparator(self.result_csv_path, self.detail_csv_path, False) + + self.aten_ops_blacklist = [] + self.npu_adjust_autogard = [] + yaml_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "torch_ops_config.yaml") + self.load_yaml_file(yaml_path) + + self.lock = None + if process_num > 0: + self.pool = Pool(process_num) + if debug: + logger_debug(f'Main pid:{os.getpid()} device:{self.device_id} dump_list:{self.dump_api_list} ' + f'dump_mode:{self.dump_mode} cpu_path[{self.root_cpu_path}], npu_path[{self.root_npu_path}], ' + f'process[{process_num}]') + + def __exit__(self, exc_type, exc_val, exc_tb): + super().__exit__(exc_type, exc_val, exc_tb) + + if not is_npu: + return + logger_debug(f'start write compare csv: Rank[{self.device_id}], Pid[{os.getpid()}') + + if self.process_num > 0: + self.pool.close() + self.pool.join() + summery_path = os.path.join(self.root_cpu_path, f'summary.json') + if not os.path.exists(summery_path): + logger_error("Please check train log, An exception may have occurred!") + return + check_file_or_directory_path(summery_path, False) + fp_handle = open(summery_path, "r") + while True: + json_line_data = fp_handle.readline() + if json_line_data == '\n': + continue + if len(json_line_data) == 0: + break + msg = json.loads(json_line_data) + self.all_summery[msg[0]] = msg[1] + fp_handle.close() + + if self.debug_flag: + input_num = 0 + output_num = 0 + total_num = 0 + + for list_data in self.all_summery: + for data in list_data: + logger_debug(f'summery: Device[{self.device_id}], Pid[{os.getpid()}], Data[{data}]') + if "_input" in data[CompareConst.NPU_NAME]: + input_num = input_num + 1 + if "_output" in data[CompareConst.NPU_NAME]: + output_num = output_num + 1 + total_num = total_num + 1 + logger_debug(f'Dispatch exit: Device[{self.device_id}], Pid[{os.getpid()} Input[{input_num}] ' + f'Output[{output_num}] Total[{total_num}] API_Total[{self.api_index}]]') + + def __torch_dispatch__(self, func, types, args=(), kwargs=None): + if not is_npu: + logger_error("Please confirm you run environment installed torch_npu!") + return func(*args, **kwargs) + + func_name_split_list = func.__name__.split(".") + aten_api = func_name_split_list[0] + try: + aten_api_overload_name = func_name_split_list[1] + except IndexError: + logger_error(f"Please check the func name {func.__name__}!") + return func(*args, **kwargs) + + self.enable_autogard(aten_api) + if aten_api in self.aten_ops_blacklist: + npu_out = func(*args, **kwargs) + return npu_out + + call_stack = get_callstack() + self.call_stack_list.append(call_stack) + self.api_index += 1 + if aten_api not in self.single_api_index_dict: + self.single_api_index_dict[aten_api] = 1 + else: + self.single_api_index_dict[aten_api] += 1 + + run_param = self.get_run_param(aten_api, func.__name__, aten_api_overload_name) + + if self.debug_flag: + logger_debug(f'Dispatch Info: Rank[{self.device_id}], Pid[{os.getpid()}], Func[{func.__name__}], ' + f'Name[{run_param.aten_api}_{run_param.single_api_index}], ' + f'Count[{self.api_index}], Sys[{get_sys_info()}]') + + cpu_args = [] + cpu_kwargs = [] + data_to_cpu(args, 0, cpu_args) + data_to_cpu(kwargs, 0, cpu_kwargs) + cpu_args = cpu_args[0] + cpu_kwargs = cpu_kwargs[0] + + with TimeStatistics("NPU RUN", run_param): + npu_out = func(*args, **kwargs) + npu_out_cpu = [] + data_to_cpu(npu_out, 0, npu_out_cpu) + npu_out_cpu = npu_out_cpu[0] + + with TimeStatistics("CPU RUN", run_param): + cpu_out = func(*cpu_args, **cpu_kwargs) + + if isinstance(cpu_out, torch.Tensor) and cpu_out.dtype in [torch.bfloat16, torch.float16, torch.half]: + cpu_out = cpu_out.float() + + if self.process_num == 0: + self.all_summery.append([]) + data_info = DisPatchDataInfo(cpu_args, cpu_kwargs, self.all_summery, func, npu_out_cpu, cpu_out, self.lock) + dispatch_workflow(run_param, data_info) + else: + self.lock.acquire() + self.all_summery.append([]) + self.lock.release() + run_param.process_flag = True + if self.check_fun(func, run_param): + data_info = DisPatchDataInfo(cpu_args, cpu_kwargs, self.all_summery, None, npu_out_cpu, cpu_out, + self.lock) + self.pool.apply_async(func=dispatch_multiprocess, args=(run_param, data_info), + error_callback=error_call) + else: + logger_error("can not get correct function please set process_num=0") + return npu_out + + @staticmethod + def check_fun(func, run_param): + if hasattr(torch.ops.aten, run_param.aten_api): + aten_func = getattr(torch.ops.aten, run_param.aten_api) + if hasattr(aten_func, run_param.aten_api_overload_name): + aten_overload_func = getattr(aten_func, run_param.aten_api_overload_name) + if id(aten_overload_func) == id(func): + run_param.func_namespace = "aten" + return True + return False + + def get_dir_name(self, tag): + # guarantee file uniqueness + time.sleep(1) + time_now = time.strftime("%Y%m%d%H%M%S", time.localtime(time.time())) + if tag is None or not isinstance(tag, str): + logger_warn('There is not tag or the type of tag is not string.') + dir_name = f'msprobe_rank{self.device_id}_{time_now}' + else: + dir_name = f'msprobe_{tag}_rank{self.device_id}_{time_now}' + return dir_name + + def load_yaml_file(self, file_path): + with FileOpen(file_path, 'r') as f: + yaml_file = yaml.safe_load(f) + self.aten_ops_blacklist = yaml_file.get('aten_ops_blacklist') + self.npu_adjust_autogard = yaml_file.get('npu_adjust_autogard') + + def filter_dump_api(self): + if self.dump_mode != Const.LIST or not self.dump_api_list: + self.dump_api_list = [] + return + aten_api_list = dir(torch.ops.aten) + dump_api_list = [] + for aten_api in self.dump_api_list: + if aten_api in aten_api_list: + dump_api_list.append(aten_api) + else: + logger_warn(f'{aten_api} is not aten api will not dump, please refer to torch.ops.aten') + self.dump_api_list = dump_api_list + + def get_run_param(self, aten_api, func_name, aten_api_overload_name): + run_param = DispatchRunParam(self.debug_flag, self.device_id, self.root_npu_path, self.root_cpu_path, + self.process_num, self.comparator) + run_param.dump_flag, run_param.auto_dump_flag = self.get_dump_flag(aten_api) + run_param.func_name = func_name + run_param.aten_api = aten_api + run_param.aten_api_overload_name = aten_api_overload_name + run_param.single_api_index = self.single_api_index_dict[aten_api] + run_param.api_index = self.api_index + return run_param + + def get_dump_flag(self, aten_api): + dump_flag = False + auto_dump_flag = False + if self.dump_mode == Const.ALL: + dump_flag = True + if self.dump_mode == Const.LIST and aten_api in self.dump_api_list: + dump_flag = True + if self.dump_mode == Const.AUTO: + auto_dump_flag = True + return dump_flag, auto_dump_flag + + def check_param(self): + if self.dump_mode not in Const.ONLINE_DUMP_MODE: + logger_error('The parameter "dump mode" can only be one of {}.'.format(Const.ONLINE_DUMP_MODE)) + raise DispatchException(DispatchException.INVALID_PARAMETER) + if not isinstance(self.dump_api_list, list): + logger_error('The type of parameter "api_list" can only be list.') + raise DispatchException(DispatchException.INVALID_PARAMETER) + if not isinstance(self.debug_flag, bool): + logger_error('The type of parameter "debug" can only be bool.') + raise DispatchException(DispatchException.INVALID_PARAMETER) + if not isinstance(self.process_num, int) or self.process_num < 0: + logger_error('The type of parameter "process_num" can only be int and it should not be less than 0.') + raise DispatchException(DispatchException.INVALID_PARAMETER) + + def enable_autogard(self, aten_api): + if aten_api in self.npu_adjust_autogard: + torch._C._dispatch_tls_set_dispatch_key_excluded(torch._C.DispatchKey.AutogradFunctionality, False) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dump_compare.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dump_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..f83b6fc9f00b7b1fa9ac4baa89632c9d43a04e4c --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/dump_compare.py @@ -0,0 +1,186 @@ +import os +import json +import copy +from datetime import datetime, timezone + +import pandas as pd +import torch +from .utils import np_save_data, logger_debug, logger_error, logger_warn, logger_user, COLOR_RED, COLOR_GREEN, \ + COLOR_RESET, CSV_COLUMN_NAME +from msprobe.core.common.file_check import FileOpen, change_mode +from msprobe.core.common.const import CompareConst, FileCheckConst, Const +from msprobe.pytorch.common.log import logger + +class DispatchRunParam: + def __init__(self, debug_flag, device_id, root_npu_path, root_cpu_path, process_num, comparator): + # static parameters are initialized by constructors, and dynamic parameters are constructed at run time + self.debug_flag = debug_flag + self.device_id = device_id + self.root_npu_path = root_npu_path + self.root_cpu_path = root_cpu_path + self.process_num = process_num + self.process_flag = False + self.func_name = None + self.func_namespace = None + self.aten_api = None + self.aten_api_overload_name = None + self.single_api_index = None + self.api_index = None + self.dump_flag = None + self.auto_dump_flag = None + self.comparator = comparator + + +class DisPatchDataInfo: + def __init__(self, cpu_args, cpu_kwargs, all_summery, func, npu_out_cpu, cpu_out, lock): + self.cpu_args = cpu_args + self.cpu_kwargs = cpu_kwargs + self.all_summery = all_summery + self.func = func + self.npu_out_cpu = npu_out_cpu + self.cpu_out = cpu_out + self.lock = lock + + +class TimeStatistics: + def __init__(self, name_tag, run_param, timeout=5): + self.debug = run_param.debug_flag + if self.debug: + self.fun = run_param.func_name + self.device = run_param.device_id + self.process = run_param.process_num + self.index = run_param.single_api_index + self.tag = name_tag + self.timeout = timeout + self.time = None + + def __enter__(self): + if self.debug: + self.time = datetime.now(tz=timezone.utc) + logger_debug(f'Time[{self.tag}]-ENTER: Dev[{self.device}], Pid[{os.getpid()}], Fun[{self.fun}], ' \ + f'Id[{self.index}]') + + def __exit__(self, exc_type, exc_val, exc_tb): + if self.debug: + cost_time = datetime.now(tz=timezone.utc) - self.time + time_cost = f'Time[{self.tag}]-EXIT: Dev[{self.device}], Pid[{os.getpid()}], Fun[{self.fun}], ' \ + f'Id[{self.index}], time[{cost_time}]' + hot_time_cost = "Hotspot " + time_cost + + if cost_time.total_seconds() > self.timeout: + logger_debug(hot_time_cost) + else: + logger_debug(time_cost) + + +def support_basic_type(data): + if isinstance(data, (bool, int, float, torch.Tensor)): + return True + return False + + +def dump_data(data, prefix, dump_path): + if isinstance(data, (tuple, list)) and data: + for i, item in enumerate(data): + dump_data(item, "{}.{}".format(prefix, i), dump_path) + return + elif support_basic_type(data): + if isinstance(data, torch.Tensor) and data.is_meta: + return + # dump data may greater than summery_list collect + np_save_data(data, prefix, dump_path) + + +def save_temp_summery(api_index, single_api_summery, path, lock): + summery_path = os.path.join(path, f'summery.json') + lock.acquire() + with FileOpen(summery_path, "a") as f: + json.dump([api_index, single_api_summery], f) + f.write('\n') + lock.release() + + +def dispatch_workflow(run_param: DispatchRunParam, data_info: DisPatchDataInfo): + cpu_args, cpu_kwargs = data_info.cpu_args, data_info.cpu_kwargs + all_summery, func = data_info.all_summery, data_info.func + npu_out_cpu, cpu_out, lock = data_info.npu_out_cpu, data_info.cpu_out, data_info.lock + single_api_summery = [] + + prefix_input = f'{run_param.aten_api}_{run_param.single_api_index}_input' + prefix_output = f'{run_param.aten_api}_{run_param.single_api_index}_output' + + accuracy_reached = False + with TimeStatistics("COMPARE OUTPUT", run_param): + run_param.comparator.compare_output(prefix_output, cpu_out, npu_out_cpu, None, None) + + # user set dump or auto mode will dump + if run_param.dump_flag or (run_param.auto_dump_flag and not accuracy_reached): + with TimeStatistics("DUMP INPUT", run_param): + dump_data(cpu_args, prefix_input, run_param.root_npu_path) + if len(cpu_kwargs) > 0: + for k, v in cpu_kwargs.items(): + kwargs_prefix_name = prefix_input + f'_{k}' + dump_data(v, kwargs_prefix_name, run_param.root_npu_path) + + with TimeStatistics("DUMP OUTPUT", run_param): + dump_data(cpu_out, prefix_output, run_param.root_cpu_path) + dump_data(npu_out_cpu, prefix_output, run_param.root_npu_path) + + if run_param.process_num == 0: + all_summery[run_param.api_index - 1] = copy.deepcopy(single_api_summery) + else: + save_temp_summery(run_param.api_index - 1, single_api_summery, run_param.root_cpu_path, lock) + + +def get_torch_func(run_param): + if hasattr(torch.ops, run_param.func_namespace): + ops_func = getattr(torch.ops, run_param.func_namespace) + if hasattr(ops_func, run_param.aten_api): + ops_aten_func = getattr(ops_func, run_param.aten_api) + if hasattr(ops_aten_func, run_param.aten_api_overload_name): + ops_aten_overlaod_func = getattr(ops_aten_func, run_param.aten_api_overload_name) + return ops_aten_overlaod_func + return None + + +def dispatch_multiprocess(run_param, dispatch_data_info): + torch_func = get_torch_func(run_param) + if torch_func is None: + logger.error(f'can not find suitable call api:{run_param.aten_api}') + else: + dispatch_data_info.func = torch_func + dispatch_workflow(run_param, dispatch_data_info) + + +def error_call(err): + logger.error(f'multiprocess {err}') + + +def save_csv(all_summery, call_stack_list, csv_path): + df = pd.DataFrame(columns=CSV_COLUMN_NAME) + + for index, list_data in enumerate(all_summery): + for data in list_data: + csv_row_data = {CompareConst.NPU_NAME: data[CompareConst.NPU_NAME], + CompareConst.BENCH_NAME: data[CompareConst.BENCH_NAME], + CompareConst.NPU_DTYPE: data[CompareConst.NPU_DTYPE], + CompareConst.BENCH_DTYPE: data[CompareConst.BENCH_DTYPE], + CompareConst.NPU_SHAPE: data[CompareConst.NPU_SHAPE], + CompareConst.BENCH_SHAPE: data[CompareConst.BENCH_SHAPE], + CompareConst.NPU_MAX: data[CompareConst.NPU_MAX], + CompareConst.NPU_MIN: data[CompareConst.NPU_MIN], + CompareConst.NPU_MEAN: data[CompareConst.NPU_MEAN], + CompareConst.BENCH_MAX: data[CompareConst.BENCH_MAX], + CompareConst.BENCH_MIN: data[CompareConst.BENCH_MIN], + CompareConst.BENCH_MEAN: data[CompareConst.BENCH_MEAN], + CompareConst.COSINE: data[CompareConst.COSINE], + CompareConst.MAX_ABS_ERR: data[CompareConst.MAX_ABS_ERR], + CompareConst.MAX_RELATIVE_ERR: data[CompareConst.MAX_RELATIVE_ERR], + CompareConst.ACCURACY: data[CompareConst.ACCURACY], + CompareConst.STACK: call_stack_list[index], + CompareConst.ERROR_MESSAGE: data[CompareConst.ERROR_MESSAGE]} + row_df = pd.DataFrame.from_dict(csv_row_data, orient='index').T + df = pd.concat([df, row_df]) + + df.to_csv(csv_path, index=False) + change_mode(csv_path, FileCheckConst.DATA_FILE_AUTHORITY) diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/single_compare.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/single_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..aa0afa4e4f2a87f70590816e4ebf28ed6c079937 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/single_compare.py @@ -0,0 +1,391 @@ +import logging +from functools import wraps +import torch +from prettytable import PrettyTable +from collections import namedtuple +from .utils import logger_user, logger_debug + +def func_log_wrapper(): + def _out_wrapper(func): + @wraps(func) + def _in_wrapper(*kargs, **kwargs): + logger_debug("start to run: {}".format(func.__name__)) + x = func(*kargs, **kwargs) + logger_debug("end to run: {}".format(func.__name__)) + return x + + return _in_wrapper + + return _out_wrapper + + +class SingleBenchmarkCompareStandard: + def __init__(self, high_precision=True): + self.high_precision = high_precision + self.small_value = 1.0 + self.error_thd = {torch.float16: [2 ** -11, 2 ** -7], + torch.bfloat16: [2 ** -8, 2 ** -6], + torch.float32: [2 ** -14, 2 ** -11], + torch.float64: [2 ** -14, 2 ** -11]} + self.eb_thd = {torch.float16: 2 ** -10, + torch.bfloat16: 2 ** -7, + torch.float32: 2 ** -14, + torch.float64: 2 ** -14} + + def get_error_thd(self, dtype): + if dtype in self.error_thd.keys(): + if dtype == torch.float64: + logging.warning("the output data of fp64 uses the same standard as fp32.") + return self.error_thd.get(dtype)[0] if self.high_precision else self.error_thd.get(dtype)[1] + logging.error( + "Single benchmark compare only supports floating point " + "in fp16, bf16, fp32. " + ) + return None + + def get_eb_thd(self, dtype): + if dtype in self.eb_thd.keys(): + return self.eb_thd.get(dtype) + return None + + +class SingleBenchmarkAccuracyResult: + def __init__( + self, + result=True, + error_balance=None, + max_abs_diff=None, + max_abs_idx=None, + max_rel_diff=None, + max_rel_idx=None + ): + self.result = result + self.error_balance = error_balance + self.max_abs_diff = max_abs_diff + self.max_abs_idx = max_abs_idx + self.max_rel_diff = max_rel_diff + self.max_rel_idx = max_rel_idx + + def get_result(self, eb_thd, error_thd): + if ( + self.error_balance > eb_thd + or self.max_abs_diff > error_thd + or self.max_rel_diff > error_thd + ): + self.result = False + else: + self.result = True + + +class SingleBenchmarkAccuracyCompare: + @classmethod + @func_log_wrapper() + def check_output_size(cls, npu_out, bench_out): + acc_result = None + if npu_out.numel() == 0 and bench_out.nuimel() == 0: + info = ( + "The npu_output is [], and it is same as benchmark_output, " + "the result of data_compare is Pass" + ) + logging.debug(info) + acc_result = SingleBenchmarkAccuracyResult(result=True) + + if npu_out.size() != bench_out.size(): + error_info = ( + f"the size of npu output[{npu_out.size()}] and" + f"benchmark[{bench_out.size()}] is not equal" + ) + + logging.error(error_info) + acc_result = SingleBenchmarkAccuracyResult(result=False) + return acc_result + + @classmethod + @func_log_wrapper() + def check_output_invalid_value(cls, output): + has_nan = torch.isnan(output).any() + has_inf = torch.isinf(output).any() + return has_nan or has_inf + + @classmethod + @func_log_wrapper() + def precision_compare_for_case(cls, npu_out, bench_out, benchmark_standard: SingleBenchmarkCompareStandard): + error_thd = None + eb_thd = None + acc_result = cls.check_output_size(npu_out, bench_out) + CompareResultInfo = namedtuple("CompareResultInfo", + ['accuracy_result', 'error_threshold', 'eb_threshold', 'failed_information']) + + if acc_result: + failed_info = "比对数据的shape不一致" + return CompareResultInfo(acc_result, error_thd, eb_thd, failed_info) + + if cls.check_output_invalid_value(bench_out): + logging.info("The benchmark result contains nan/inf value. ") + failed_info = "标杆结果存在nan值或inf值, 依照单标杆标准该用例通过" + acc_result = SingleBenchmarkAccuracyResult(result=True) + return CompareResultInfo(acc_result, error_thd, eb_thd, failed_info) + + if cls.check_output_invalid_value(npu_out): + logging.info("The NPU result contains nan/inf value. ") + failed_info = "NPU结果存在nan值或inf值, 依照单标杆标准该用例不通过" + acc_result = SingleBenchmarkAccuracyResult(result=False) + return CompareResultInfo(acc_result, error_thd, eb_thd, failed_info) + + data_type = npu_out.dtype + if data_type not in [torch.float16, torch.float32, torch.float64, torch.bfloat16]: + acc_result = cls.compute_binary_diff(npu_out, bench_out) + else: + error_thd = benchmark_standard.get_error_thd(data_type) + eb_thd = benchmark_standard.get_eb_thd(data_type) + if error_thd is None: + logging.error( + "single benchmark not support the comparison of %s", str(data_type) + ) + acc_result = SingleBenchmarkAccuracyResult(result=False) + else: + if npu_out.dtype in [torch.float16, torch.bfloat16] and bench_out.dtype in [torch.float32]: + npu_out = npu_out.to(torch.float32) + error_balance = cls.compute_error_balance(npu_out, bench_out, benchmark_standard) + max_abs_diff, max_abs_idx = cls.compute_abs_diff(npu_out, bench_out, error_thd, benchmark_standard) + max_rel_diff, max_rel_idx = cls.compute_rel_diff(npu_out, bench_out, error_thd, benchmark_standard) + acc_result = SingleBenchmarkAccuracyResult( + error_balance=error_balance, + max_abs_diff=max_abs_diff, + max_abs_idx=max_abs_idx, + max_rel_diff=max_rel_diff, + max_rel_idx=max_rel_idx + ) + acc_result.get_result(eb_thd, error_thd) + return CompareResultInfo(acc_result, error_thd, eb_thd, None) + return None + + @classmethod + @func_log_wrapper() + def compute_binary_diff(cls, npu_out, bench_out): + result = torch.equal(npu_out, bench_out) + if result: + logger_user("二进制精度比对通过, 无需单标杆比对法验证") + return SingleBenchmarkAccuracyResult(result=result, max_abs_diff=0, max_rel_diff=0, error_balance=0) + + @classmethod + @func_log_wrapper() + def compute_error_balance(cls, npu_out, bench_out, benchmark_standard: SingleBenchmarkCompareStandard): + ones = torch.ones_like(npu_out) + zeros = torch.zeros_like(npu_out) + abs_mask_idx = torch.where(torch.abs(bench_out) < benchmark_standard.small_value, ones, zeros) + abs_mask_idx = abs_mask_idx.type(torch.bool) + diff_value = torch.subtract(npu_out, bench_out) + diff_value_rel = diff_value / (torch.abs(bench_out) + torch.finfo(torch.float).eps ) + rel_and_abs = torch.where(abs_mask_idx, diff_value, diff_value_rel) + eb_float = float(torch.mean(rel_and_abs)) + return eb_float + + @classmethod + @func_log_wrapper() + def compute_abs_diff(cls, npu_out, bench_out, error_thd, benchmark_standard: SingleBenchmarkCompareStandard): + max_abs_diff = 0 + max_abs_idx = None + + ones = torch.ones_like(npu_out) + zeros = torch.zeros_like(npu_out) + diff_value = torch.subtract(npu_out, bench_out) + diff_abs = torch.abs(diff_value) + abs_mask_idx = torch.where(torch.abs(bench_out) < benchmark_standard.small_value, ones, zeros) + abs_err_idx = torch.where(diff_abs > error_thd, ones, zeros) + abs_err_idx = abs_err_idx * abs_mask_idx + abs_err = diff_abs[torch.where(abs_err_idx == 1)] + + if len(abs_err) > 0: + err_for_max = torch.where(abs_err_idx == 1, diff_abs, zeros) + logging.debug("err_for_max for abs %s", err_for_max) + max_abs_idx = torch.argmax(err_for_max) + max_abs_diff = diff_abs[max_abs_idx] + elif torch.sum(abs_mask_idx) > 0: + err_for_max = torch.where(abs_mask_idx == 1, diff_abs, zeros) + logging.debug("error_for_max for abs %s", err_for_max) + max_abs_idx = torch.argmax(err_for_max) + if err_for_max.max() != 0: + max_abs_diff = diff_abs[max_abs_idx] + return (float(max_abs_diff), int(max_abs_idx) if torch.is_tensor(max_abs_idx) else max_abs_idx) + + @classmethod + @func_log_wrapper() + def compute_rel_diff(cls, npu_out, bench_out, error_thd, benchmark_standard: SingleBenchmarkCompareStandard): + max_rel_diff = 0 + max_rel_idx = None + + ones = torch.ones_like(npu_out) + zeros = torch.zeros_like(npu_out) + diff_value = torch.subtract(npu_out, bench_out) + diff_abs = torch.abs(diff_value) + + rel_mask_idx = torch.where(torch.abs(bench_out) >= benchmark_standard.small_value, ones, zeros) + rel_err = diff_abs / (torch.abs(bench_out) + torch.finfo(torch.float).eps ) + diff_rel = rel_err + rel_err_idx = torch.where(rel_err > error_thd, ones, zeros) + rel_err_idx = rel_err_idx * rel_mask_idx + rel_err = rel_err[torch.where(rel_err_idx == 1)] + if len(rel_err) > 0: + err_for_max = torch.where(rel_err_idx == 1, diff_rel, zeros) + logging.debug("error_for_max for rel %s", err_for_max) + max_rel_idx = torch.argmax(err_for_max) + max_rel_diff = diff_rel[max_rel_idx] + elif torch.sum(rel_mask_idx > 0): + err_for_max = torch.where(rel_mask_idx == 1, diff_rel, zeros) + logging.debug("err_for_max for rel %s", err_for_max) + max_rel_idx = torch.argmax(err_for_max) + if torch.sum(err_for_max) != 0: + max_rel_diff = diff_rel[max_rel_idx] + return (float(max_rel_diff), int(max_rel_idx) if torch.is_tensor(max_rel_idx) else max_rel_idx) + + +class SingleBenchSummary: + def __init__(self, precision_result: SingleBenchmarkAccuracyResult, npu_dtype=None, + bench_dtype=None, shape=None, error_thd=None, eb_thd=None, failed_info=None): + self.npu_dtype = npu_dtype + self.bench_dtype = bench_dtype + self.shape = shape + self.result = precision_result.result + self.error_balance = precision_result.error_balance + self.max_abs_diff = precision_result.max_abs_diff + self.max_abs_idx = precision_result.max_abs_idx + self.max_rel_diff = precision_result.max_rel_diff + self.max_rel_idx = precision_result.max_rel_idx + self.eb_thd = eb_thd + self.error_thd = error_thd + self.failed_info = failed_info + + def get_check_result(self): + if self.result: + return "PASS" + else: + return "FAILED" + + def get_result_msg(self): + result_str = "" + if self.failed_info: + return self.failed_info + + if self.result: + result_str += "误差均衡性EB: %s <= 阈值%s\n" % (self.error_balance, self.eb_thd) + result_str += "最大绝对误差: %s <= 阈值%s\n" % (self.max_abs_diff, self.error_thd) + result_str += "最大相对误差: %s <= 阈值%s\n" % (self.max_rel_diff, self.error_thd) + else: + if self.error_balance > self.eb_thd: + result_str += "误差均衡性EB超过阈值%s: EB = %s\n" % ( + self.eb_thd, + self.error_balance, + ) + if self.max_abs_diff > self.error_thd: + result_str += "小值域最大绝对误差超过阈值%s: idx = %s, 绝对误差 = %s\n" % ( + self.error_thd, + self.max_abs_idx, + self.max_abs_diff + ) + if self.max_rel_diff > self.error_thd: + result_str += "大值域最大相对误差超过阈值%s: idx = %s, 相对误差 = %s\n" % ( + self.error_thd, + self.max_rel_idx, + self.max_rel_diff, + ) + return result_str + + def print_detail_table(self): + table = PrettyTable() + table.title = "Single Benchmark Metrics Info" + table.field_names = ["Index", "Result", "Threshold"] + table.add_row(["error_balance", self.error_balance, self.eb_thd]) + table.add_row(["max_abs_diff", self.max_abs_diff, self.error_thd]) + table.add_row(["max_abs_idx", self.max_abs_idx, "-"]) + table.add_row(["max_rel_diff", self.max_rel_diff, self.error_thd]) + table.add_row(["max_rel_idx", self.max_rel_idx, "-"]) + + logger_user(table) + + def to_column_value(self): + return [self.bench_dtype, self.npu_dtype, self.shape, self.error_balance, + self.max_abs_diff, self.max_abs_idx, self.max_rel_diff, self.max_rel_idx, + self.eb_thd, self.error_thd, self.result, self.failed_info] + + +def single_benchmark_compare(npu_out: torch.Tensor, bench_out: torch.Tensor, high_precision: bool = True): + benchmark_standard = SingleBenchmarkCompareStandard(high_precision) + npu_out = npu_out.flatten() + bench_out = bench_out.flatten() + + compare_results = SingleBenchmarkAccuracyCompare.precision_compare_for_case(npu_out, bench_out, benchmark_standard) + ( + precision_result, + error_thd, + eb_thd, + failed_info + ) = (compare_results.accuracy_result, compare_results.error_threshold, + compare_results.eb_threshold, compare_results.failed_information) + + summary = SingleBenchSummary(precision_result, str(npu_out.dtype), str(bench_out.dtype), tuple(npu_out.shape), error_thd, eb_thd, failed_info) + result = summary.result + details = summary.to_column_value() + return result, details + + +def calc_status_details_list_tuple(npu_out, bench_out, high_precision, summary): + status, details = [], [] + if len(bench_out) != len(npu_out): + summary.result = False + summary.failed_info = "bench and npu output structure is different." + return False, summary.to_column_value() + for b_out_i, n_out_i in zip(bench_out, npu_out): + status_i, details_i = single_benchmark_compare_wrap(n_out_i, b_out_i, high_precision) + status.append(status_i) + details.append(details_i) + return status, details + + +def calc_status_details_dict(npu_out, bench_out, high_precision, summary): + b_keys, n_keys = set(bench_out.keys()), set(npu_out.keys()) + if b_keys != n_keys: + summary.result = False + summary.failed_info = "bench and npu_output dict keys are different." + return False, summary.to_column_value() + else: + status, details = single_benchmark_compare_wrap(list(bench_out.values(), list(npu_out.values()))) + return status, details + + +def calc_status_details_tensor(npu_out, bench_out, high_precision, summary): + return single_benchmark_compare(bench_out, npu_out) + + +def calc_status_details_builtin(npu_out, bench_out, summary): + summary.bench_dtype = str(type(bench_out)) + summary.npu_dtype = str(type(npu_out)) + status = bench_out == npu_out + summary.result = status + return status, summary.to_column_value() + + +def calc_status_details_none(npu_out, bench_out, high_precision, summary): + summary.result = True + summary.failed_info = "Output is None." + return True, summary.to_column_value() + + +def single_benchmark_compare_wrap(npu_output: torch.Tensor, bench_output: torch.Tensor, high_precision=True): + type_method_dict = { + (list, tuple): calc_status_details_list_tuple, + dict: calc_status_details_dict, + torch.Tensor: calc_status_details_tensor, + (bool, int, float, str): calc_status_details_builtin, + None: calc_status_details_none, + } + + result = SingleBenchmarkAccuracyResult(result=True) + bench_summary = SingleBenchSummary(result) + for type1, func in type_method_dict.items(): + if isinstance(bench_output, type1): + return func(npu_output, bench_output, high_precision, bench_summary) + + bench_summary.result = True + bench_summary.failed_info = "Unexpected output type: {}".format(type(bench_output)) + return True, bench_summary.to_column_value() diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/torch_ops_config.yaml b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/torch_ops_config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..789ae2a7a7b8a3bc05ea6a073cbbf2875de2bc59 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/torch_ops_config.yaml @@ -0,0 +1,50 @@ +aten_ops_blacklist: + - _cudnn_rnn + - _local_scalar_dense + - _pin_memory + - _to_copy + - _unsafe_view + - clone + - contiguous + - copy_ + - cudnn_batch_norm + - cudnn_batch_norm_backward + - detach + - empty + - index_put_ + - lift_fresh + - max_pool2d_with_indices_backward # shape unmatch + - native_batch_norm_backward + - new_empty + - new_empty_strided + - new_full + - new_ones + - new_zeros + - ones + - ones_like + - permute + - rand + - rand_like + - randint + - randint_like + - randn + - randn_like + - randperm + - scalar_tensor + - select + - to + - transpose + - unbind + - view + - zero + - zero_ + - zeros + - zeros_like + +npu_adjust_autogard: + - adaptive_avg_pool2d + - batch_norm + - log_softmax + - nll_loss + - to + \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/pytorch/online_dispatch/utils.py b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..fec3e0b00746c653089d763c052d2b1c350a6886 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/online_dispatch/utils.py @@ -0,0 +1,187 @@ +import os +import inspect +import logging +import psutil +import torch +import numpy as np + +try: + import torch_npu +except ImportError: + pta_cpu_device = None +else: + pta_cpu_device = torch.device("cpu") + +from msprobe.core.common.const import CompareConst, FileCheckConst +from msprobe.core.common.file_check import change_mode + +cpu_device = torch._C.device("cpu") +COLOR_RED = '\033[31m' +COLOR_GREEN = '\033[32m' +COLOR_YELLOW = '\033[33m' +COLOR_BLUE = '\033[34m' +COLOR_PURPLE = '\033[35m' +COLOR_CYAN = '\033[36m' +COLOR_GRAY = '\033[37m' +COLOR_RESET = '\033[0m' + +COMPARE_LOGO = ''' + _ _ + ___ _ __ | (_)_ __ ___ ___ ___ _ __ ___ _ __ __ _ _ __ ___ + / _ \\| '_ \\| | | '_ \\ / _ \\ / __/ _ \\| '_ ` _ \\| '_ \\ / _` | '__/ _ \\ +| (_) | | | | | | | | | __/ | (_| (_) | | | | | | |_) | (_| | | | __/ + \\___/|_| |_|_|_|_| |_|\\___| \\___\\___/|_| |_| |_| .__/ \\__,_|_| \\___| + |_| +''' + +CSV_COLUMN_NAME = [CompareConst.NPU_NAME, + CompareConst.BENCH_NAME, + CompareConst.NPU_DTYPE, + CompareConst.BENCH_DTYPE, + CompareConst.NPU_SHAPE, + CompareConst.BENCH_SHAPE, + CompareConst.NPU_MAX, + CompareConst.NPU_MIN, + CompareConst.NPU_MEAN, + CompareConst.BENCH_MAX, + CompareConst.BENCH_MIN, + CompareConst.BENCH_MEAN, + CompareConst.COSINE, + CompareConst.MAX_ABS_ERR, + CompareConst.MAX_RELATIVE_ERR, + CompareConst.ACCURACY, + CompareConst.STACK, + CompareConst.ERROR_MESSAGE] + +FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble, np.float32, np.float16] +BOOL_TYPE = [bool, np.uint8] +INT_TYPE = [np.int32, np.int64] + + +def get_callstack(): + callstack = [] + for (_, path, line, func, code, _) in inspect.stack()[2:]: + if code: + stack_line = [path, str(line), func, code[0].strip() if code else code] + else: + stack_line = [path, str(line), func, code] + callstack.append(stack_line) + return callstack + + +def np_save_data(data, file_name, data_path): + try: + if hasattr(data, "numpy"): + data = data.numpy() + dump_path = os.path.join(data_path, f'{file_name}.npy') + np.save(dump_path, data) + change_mode(dump_path, FileCheckConst.DATA_FILE_AUTHORITY) + except Exception as e: + logger_error("save numpy failed, error: {}".format(e)) + finally: + pass + + +def data_to_cpu(data, deep, data_cpu): + global cpu_device + list_cpu = [] + if isinstance(data, torch.Tensor): + if data.device == cpu_device or data.device == pta_cpu_device: + tensor_copy = data.clone().detach() + else: + tensor_copy = data.cpu().detach() + if tensor_copy.dtype in [torch.float16, torch.half, torch.bfloat16]: + tensor_copy = tensor_copy.float() + + if deep == 0: + data_cpu.append(tensor_copy) + return tensor_copy + elif isinstance(data, list): + for v in data: + list_cpu.append(data_to_cpu(v, deep + 1, data_cpu)) + if deep == 0: + data_cpu.append(list_cpu) + return list_cpu + elif isinstance(data, tuple): + for v in data: + list_cpu.append(data_to_cpu(v, deep + 1, data_cpu)) + tuple_cpu = tuple(list_cpu) + if deep == 0: + data_cpu.append(tuple_cpu) + return tuple_cpu + elif isinstance(data, dict): + dict_cpu = {} + for k, v in data.items(): + dict_cpu[k] = data_to_cpu(v, deep + 1, data_cpu) + if deep == 0: + data_cpu.append(dict_cpu) + return dict_cpu + elif isinstance(data, torch._C.device): + return cpu_device + else: + if deep == 0: + data_cpu.append(data) + return data + + +def get_mp_logger(): + logger = logging.getLogger(__name__) + if not logger.handlers: + logger.setLevel(logging.INFO) + handler = logging.StreamHandler() + formatter = logging.Formatter('%(asctime)s %(message)s') + logger.propagate = True + handler.setFormatter(formatter) + logger.addHandler(handler) + return logger.info + + +def logger_debug(mesg): + logger = get_mp_logger() + logger(f'DEBUG ' + mesg) + + +def logger_info(mesg): + logger = get_mp_logger() + logger(f'INFO ' + mesg) + + +def logger_warn(mesg): + logger = get_mp_logger() + logger(f'{COLOR_YELLOW}WARNING {mesg} {COLOR_RESET}') + + +def logger_error(mesg): + logger = get_mp_logger() + logger(f'{COLOR_RED}ERROR {mesg} {COLOR_RESET}') + + +def logger_user(mesg): + logger = get_mp_logger() + logger(mesg) + + +def logger_logo(): + logger_user(f'{COLOR_CYAN}{COMPARE_LOGO} {COLOR_RESET}') + + +def get_sys_info(): + mem = psutil.virtual_memory() + cpu_percent = psutil.cpu_percent(interval=1) + sys_info = f'Total: {mem.total / 1024 / 1024:.2f}MB ' \ + f'Free: {mem.available / 1024 / 1024:.2f} MB ' \ + f'Used: {mem.used / 1024 / 1024:.2f} MB ' \ + f'CPU: {cpu_percent}% ' + return sys_info + + +class DispatchException(Exception): + INVALID_PARAMETER = 0 + + def __init__(self, err_code, err_msg=""): + super(DispatchException, self).__init__() + self.err_code = err_code + self.err_msg = err_msg + + def __str__(self): + return self.err_msg diff --git a/debug/accuracy_tools/msprobe/pytorch/parse.py b/debug/accuracy_tools/msprobe/pytorch/parse.py new file mode 100644 index 0000000000000000000000000000000000000000..efd3d4a2ddb807e23ba346b6c80ef344d560050d --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse.py @@ -0,0 +1,4 @@ +from msprobe.pytorch.parse_tool import cli + +if __name__ == '__main__': + cli.parse() diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/__init__.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/parse_tool/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/register_hook.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/cli.py similarity index 40% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/register_hook.py rename to debug/accuracy_tools/msprobe/pytorch/parse_tool/cli.py index eee0d6c5d665470fbeaf49938cbbed1693c5f623..500e8eef6846b1209817083992a5e3630381b7dc 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/register_hook.py +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/cli.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -14,24 +14,19 @@ # See the License for the specific language governing permissions and # limitations under the License. """ -import torch +from msprobe.pytorch.parse_tool.lib.interactive_cli import InteractiveCli +from msprobe.pytorch.common.log import logger -from api_accuracy_checker.hook_module import wrap_torch, wrap_functional, wrap_tensor +def _run_interactive_cli(cli=None): + logger.info("Interactive command mode") + if not cli: + cli = InteractiveCli() + try: + cli.cmdloop(intro="Start Parsing........") + except KeyboardInterrupt: + logger.info("Exit parsing.......") -def initialize_hook(hook): - wrap_tensor.wrap_tensor_ops_and_bind(hook) - for attr_name in dir(wrap_tensor.HOOKTensor): - if attr_name.startswith("wrap_"): - setattr(torch.Tensor, attr_name[5:], getattr(wrap_tensor.HOOKTensor, attr_name)) - - wrap_torch.wrap_torch_ops_and_bind(hook) - for attr_name in dir(wrap_torch.HOOKTorchOP): - if attr_name.startswith("wrap_"): - setattr(torch, attr_name[5:], getattr(wrap_torch.HOOKTorchOP, attr_name)) - - wrap_functional.wrap_functional_ops_and_bind(hook) - for attr_name in dir(wrap_functional.HOOKFunctionalOP): - if attr_name.startswith("wrap_"): - setattr(torch.nn.functional, attr_name[5:], getattr(wrap_functional.HOOKFunctionalOP, attr_name)) +def parse(): + _run_interactive_cli() diff --git a/debug/accuracy_tools/atat/pytorch/debugger/__init__.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/debugger/__init__.py rename to debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/__init__.py diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/compare.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/compare.py new file mode 100644 index 0000000000000000000000000000000000000000..85c4cde4d10c85266bb7404f2dae315b4d620107 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/compare.py @@ -0,0 +1,259 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import os +import time +import numpy as np +from collections import namedtuple +from msprobe.pytorch.parse_tool.lib.utils import Util +from msprobe.pytorch.parse_tool.lib.config import Const +from msprobe.pytorch.parse_tool.lib.parse_exception import ParseException + + +class Compare: + def __init__(self): + self.util = Util() + self.log = self.util.log + self.vector_compare_result = {} + + def npu_vs_npu_compare(self, my_dump_path, golden_dump_path, result_dir, msaccucmp_path): + self.log.info("Start Compare ...............") + self.compare_vector(my_dump_path, golden_dump_path, result_dir, msaccucmp_path) + self.log.info("Compare finished!!") + + def compare_vector(self, my_dump_path, golden_dump_path, result_dir, msaccucmp_path): + self.util.create_dir(result_dir) + self.util.check_path_valid(result_dir) + call_msaccucmp = self.util.check_msaccucmp(msaccucmp_path) + cmd = '%s %s compare -m %s -g %s -out %s' % ( + self.util.python, call_msaccucmp, my_dump_path, golden_dump_path, result_dir + ) + return self.util.execute_command(cmd) + + def convert_dump_to_npy(self, dump_file, data_format, output, msaccucmp_path): + dump_file = self.util.path_strip(dump_file) + file_name = "" + if os.path.isfile(dump_file): + self.log.info("Covert file is: %s", dump_file) + file_name = os.path.basename(dump_file) + elif os.path.isdir(dump_file): + self.log.info("Convert all files in path: %s", dump_file) + file_name = "" + output = output if output else Const.DUMP_CONVERT_DIR + convert = self.convert(dump_file, data_format, output, msaccucmp_path) + if convert == 0: + convert_files = self.util.list_convert_files(output, file_name) + + summary_txt = ["SrcFile: %s" % dump_file] + for convert_file in convert_files.values(): + summary_txt.append(" - %s" % convert_file.file_name) + self.log.info("Transfer result is saved in : %s", os.path.realpath(output)) + self.util.print_panel("\n".join(summary_txt)) + + def convert(self, dump_file, data_format, output, msaccucmp_path): + self.util.create_dir(output) + self.util.check_path_valid(output) + call_msaccucmp = self.util.check_msaccucmp(msaccucmp_path) + if data_format: + cmd = '%s %s convert -d %s -out %s -f %s' % ( + self.util.python, call_msaccucmp, dump_file, output, data_format + ) + else: + cmd = '%s %s convert -d %s -out %s' % ( + self.util.python, call_msaccucmp, dump_file, output + ) + return self.util.execute_command(cmd) + + def compare_data(self, args): + """Compare data""" + (left, right, save_txt, rl, al, diff_count) = args + if left is None or right is None: + raise ParseException("invalid input or output") + try: + left_data = np.load(left) + right_data = np.load(right) + except UnicodeError as e: + self.log.error("%s %s" % ("UnicodeError", str(e))) + self.log.warning("Please check the npy file") + raise ParseException(ParseException.PARSE_UNICODE_ERROR) from e + except IOError: + self.log.error("Failed to load npy %s or %s." % (left, right)) + raise ParseException(ParseException.PARSE_LOAD_NPY_ERROR) from e + + # save to txt + if save_txt: + self.util.save_npy_to_txt(left_data, left + ".txt") + self.util.save_npy_to_txt(right_data, right + ".txt") + # compare data + (total_cnt, all_close, cos_sim, err_percent) = self.do_compare_data(left_data, right_data, rl, al, diff_count) + content = ['Left:', ' ├─ NpyFile: %s' % left] + if save_txt: + content.append(' ├─ TxtFile: [green]%s.txt[/green]' % left) + content.append(' └─ NpySpec: [yellow]%s[/yellow]' % self.util.gen_npy_info_txt(left_data)) + content.append('Right:') + content.append(' ├─ NpyFile: %s' % right) + if save_txt: + content.append(' ├─ TxtFile: [green]%s.txt[/green]' % right) + content.append(' └─ NpySpec: [yellow]%s[/yellow]' % self.util.gen_npy_info_txt(right_data)) + content.append('NumCnt: %s' % total_cnt) + content.append('AllClose: %s' % all_close) + content.append('CosSim: %s' % cos_sim) + content.append('ErrorPer: %s (rl= %s, al= %s)' % (err_percent, rl, al)) + self.util.print_panel("\n".join(content)) + + def do_compare_data(self, left, right, rl=0.001, al=0.001, diff_count=20): + data_left = left.astype(np.float32) + data_right = right.astype(np.float32) + shape_left = data_left.shape + shape_right = data_right.shape + if shape_left != shape_right: + self.log.warning("Data shape not equal: %s vs %s", data_left.shape, data_right.shape) + data_left = data_left.reshape(-1) + data_right = data_right.reshape(-1) + if data_left.shape[0] != data_right.shape[0]: + self.log.warning("Data size not equal: %s vs %s", data_left.shape, data_right.shape) + if data_left.shape[0] < data_right.shape[0]: + data_left = np.pad(data_left, (0, data_right.shape[0] - data_left.shape[0]), 'constant') + else: + data_right = np.pad(data_right, (0, data_left.shape[0] - data_right.shape[0]), 'constant') + all_close = np.allclose(data_left, data_right, atol=al, rtol=rl) + np.seterr(divide='raise') + cos_sim = np.dot(data_left, data_right) / ( + np.sqrt(np.dot(data_left, data_left)) * np.sqrt(np.dot(data_right, data_right))) + err_cnt = 0 + total_cnt = data_left.shape[0] + diff_table_columns = ['Index', 'Left', 'Right', 'Diff'] + err_table = self.util.create_table("Error Item Table", diff_table_columns) + top_table = self.util.create_table("Top Item Table", diff_table_columns) + for i in range(total_cnt): + abs_diff = abs(data_left[i] - data_right[i]) + if i < diff_count: + top_table.add_row(str(i), str(data_left[i]), str(data_right[i]), str(abs_diff)) + if abs_diff > (al + rl * abs(data_right[i])): + if err_cnt < diff_count: + err_table.add_row(str(i), str(data_left[i]), str(data_right[i]), str(abs_diff)) + err_cnt += 1 + if total_cnt == 0: + err_percent = float(0) + else: + err_percent = float(err_cnt / total_cnt) + self.util.print(self.util.create_columns([err_table, top_table])) + do_compare_data_result = namedtuple('do_compare_data_result', ['cnt', 'close', 'cos', 'err']) + res = do_compare_data_result(total_cnt, all_close, cos_sim, err_percent) + return res + + def compare_npy(self, file, bench_file, output_path): + data = np.load(file) + bench_data = np.load(bench_file) + shape, dtype = data.shape, data.dtype + bench_shape, bench_dtype = bench_data.shape, bench_data.dtype + filename = os.path.basename(file) + bench_filename = os.path.basename(bench_file) + if shape != bench_shape or dtype != bench_dtype: + self.log.error( + "Shape or dtype between two npy files is inconsistent. Please check the two files." + "File 1: %s, file 2: %s", file, bench_file) + self.util.deal_with_dir_or_file_inconsistency(output_path) + return + md5_consistency = False + if self.util.get_md5_for_numpy(data) == self.util.get_md5_for_numpy(bench_data): + md5_consistency = True + data_mean = np.mean(data) + bench_data_mean = np.mean(bench_data) + abs_error = np.abs(data - bench_data) + bench_data = self.util.deal_with_value_if_has_zero(bench_data) + rel_error = np.abs(abs_error / bench_data) + abs_diff_max = abs_error.max() + rel_diff_max = np.max(rel_error) + compare_result = [[filename, bench_filename, data_mean, bench_data_mean, md5_consistency, abs_diff_max, + rel_diff_max]] + self.util.write_csv(compare_result, output_path) + + def compare_all_file_in_directory(self, my_dump_dir, golden_dump_dir, output_path): + if not (self.util.is_subdir_count_equal(my_dump_dir, golden_dump_dir) + and self.util.check_npy_files_valid_in_dir(my_dump_dir) + and self.util.check_npy_files_valid_in_dir(golden_dump_dir)): + self.log.error( + "Top level(Npy files level) directory structure is inconsistent. Please check the two directory.") + self.util.deal_with_dir_or_file_inconsistency(output_path) + return + my_npy_files = self.util.get_sorted_files_names(my_dump_dir) + golden_npy_files = self.util.get_sorted_files_names(golden_dump_dir) + for my_npy_file_name, golden_npy_file_name in zip(my_npy_files, golden_npy_files): + my_npy_path = os.path.join(my_dump_dir, my_npy_file_name) + golden_npy_path = os.path.join(golden_dump_dir, golden_npy_file_name) + self.compare_npy(my_npy_path, golden_npy_path, output_path) + + def compare_timestamp_directory(self, my_dump_dir, golden_dump_dir, output_path): + if not self.util.is_subdir_count_equal(my_dump_dir, golden_dump_dir): + self.log.error( + "Second level(Timestamp level) directory structure is inconsistent. Please check the two directory.") + self.util.deal_with_dir_or_file_inconsistency(output_path) + return + my_ordered_subdirs = self.util.get_sorted_subdirectories_names(my_dump_dir) + golden_ordered_subdirs = self.util.get_sorted_subdirectories_names(golden_dump_dir) + for my_subdir_name, golden_subdir_name in zip(my_ordered_subdirs, golden_ordered_subdirs): + my_subdir_path = os.path.join(my_dump_dir, my_subdir_name) + golden_subdir_path = os.path.join(golden_dump_dir, golden_subdir_name) + self.compare_all_file_in_directory(my_subdir_path, golden_subdir_path, output_path) + + def compare_converted_dir(self, my_dump_dir, golden_dump_dir, output_dir): + if not self.util.is_subdir_count_equal(my_dump_dir, golden_dump_dir): + self.log.error( + "Top level(Opname level) directory structure is inconsistent. Please check the two directory.") + return + timestamp = int(time.time()) + output_file_name = f"batch_compare_{timestamp}.csv" + output_path = os.path.join(output_dir, output_file_name) + title_rows = [[ + "NPU File Name", + "Bench File Name", + "Mean", + "Bench Mean", + "Md5 Consistency", + "Max Abs Error", + "Max Relative Error" + ]] + self.util.write_csv(title_rows, output_path) + + my_ordered_subdirs = self.util.get_sorted_subdirectories_names(my_dump_dir) + golden_ordered_subdirs = self.util.get_sorted_subdirectories_names(golden_dump_dir) + for my_subdir_name, golden_subdir_name in zip(my_ordered_subdirs, golden_ordered_subdirs): + if not my_subdir_name == golden_subdir_name: + self.log.error( + "Top level(Opname level) directory structure is inconsistent. Please check the two directory.") + self.util.deal_with_dir_or_file_inconsistency(output_path) + return + my_subdir_path = os.path.join(my_dump_dir, my_subdir_name) + golden_subdir_path = os.path.join(golden_dump_dir, golden_subdir_name) + self.compare_timestamp_directory(my_subdir_path, golden_subdir_path, output_path) + self.util.change_filemode_safe(output_path) + self.log.info("Compare result is saved in : %s", output_path) + + def convert_api_dir_to_npy(self, dump_dir, param, output_dir, msaccucmp_path): + dump_dir = self.util.path_strip(dump_dir) + for root, _, files in os.walk(dump_dir): + for file in files: + file_path = os.path.join(root, file) + file_name = os.path.basename(file_path) + parts = file_name.split(".") + if len(parts) < 5: + continue + op_name = parts[1] + timestamp = parts[-1] + output_path = os.path.join(output_dir, op_name, timestamp) + self.convert_dump_to_npy(file_path, param, output_path, msaccucmp_path) diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/config.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/config.py new file mode 100644 index 0000000000000000000000000000000000000000..a745ff46f08a28c39c989a5d8dce4ff5cf475ee5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/config.py @@ -0,0 +1,51 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + +import os +import numpy as np + + +class Const: + + MS_ACCU_CMP_PATH = '/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py' + MS_ACCU_CMP_FILE_NAME = 'msaccucmp.py' + ROOT_DIR = "" + LOG_LEVEL = "NOTSET" + DATA_ROOT_DIR = os.path.join(ROOT_DIR, 'parse_data') + DUMP_CONVERT_DIR = os.path.join(DATA_ROOT_DIR, 'dump_convert') + COMPARE_DIR = os.path.join(DATA_ROOT_DIR, 'compare_result') + BATCH_DUMP_CONVERT_DIR = os.path.join(DATA_ROOT_DIR, 'batch_dump_convert') + BATCH_COMPARE_DIR = os.path.join(DATA_ROOT_DIR, 'batch_compare_result') + OFFLINE_DUMP_CONVERT_PATTERN = \ + r"^([A-Za-z0-9_-]+)\.([A-Za-z0-9_-]+)\.([0-9]+)(\.[0-9]+)?\.([0-9]{1,255})" \ + r"\.([a-z]+)\.([0-9]{1,255})(\.[x0-9]+)?\.npy$" + NUMPY_PATTERN = r".*\.npy$" + NPY_SUFFIX = ".npy" + PKL_SUFFIX = ".pkl" + DIRECTORY_LENGTH = 4096 + FILE_NAME_LENGTH = 255 + FILE_PATTERN = r'^[a-zA-Z0-9_./-]+$' + ONE_GB = 1 * 1024 * 1024 * 1024 + TEN_GB = 10 * 1024 * 1024 * 1024 + FLOAT_TYPE = [np.half, np.single, float, np.double, np.float64, np.longdouble, np.float32, np.float16] + HEADER = r""" ____ + / __ \____ ______________ + / /_/ / __ `/ ___/ ___/ _ \ + / ____/ /_/ / / (__ ) __/ + /_/ \__,_/_/ /____/\___/ + + """ diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/file_desc.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/file_desc.py new file mode 100644 index 0000000000000000000000000000000000000000..14ba27277168bc110b38287afbba957b69f8cdff --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/file_desc.py @@ -0,0 +1,31 @@ +# coding=utf-8 +import os + + +class FileDesc(object): + def __init__(self, file_name, dir_path, timestamp=-1): + self.file_name = file_name + self.dir_path = dir_path + self.path = os.path.join(dir_path, file_name) + self.timestamp = timestamp + self.idx = 0 + if self.timestamp == -1: + self.timestamp = os.path.getmtime(self.path) + + +class NpuDumpFileDesc(FileDesc): + def __init__(self, file_name, dir_path, timestamp, op_name, op_type, task_id, stream_id=0): + super(NpuDumpFileDesc, self).__init__(file_name, dir_path, timestamp) + self.op_name = op_name + self.op_type = op_type + self.task_id = task_id + stream_id = 0 if stream_id is None else int(stream_id) + self.stream_id = stream_id + self.idx = dir_path.split(os.sep)[-1] + + +class DumpDecodeFileDesc(NpuDumpFileDesc): + def __init__(self, file_name, dir_path, timestamp, op_name, op_type, task_id, anchor_type, anchor_idx): + super(DumpDecodeFileDesc, self).__init__(file_name, dir_path, timestamp, op_name, op_type, task_id) + self.type = anchor_type + self.idx = anchor_idx diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/interactive_cli.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/interactive_cli.py new file mode 100644 index 0000000000000000000000000000000000000000..1ea7dd30153e458b758dc0a79779b54a25fe8289 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/interactive_cli.py @@ -0,0 +1,102 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import cmd +import argparse +from msprobe.pytorch.parse_tool.lib.parse_tool import ParseTool +from msprobe.pytorch.parse_tool.lib.utils import Util +from msprobe.pytorch.parse_tool.lib.config import Const +from msprobe.pytorch.parse_tool.lib.parse_exception import catch_exception + + +class InteractiveCli(cmd.Cmd): + def __init__(self): + super().__init__() + self.prompt = "Parse >>> " + self.parse_tool = ParseTool() + self.util = Util() + self.util.print_panel(Const.HEADER) + self.prepare() + + @staticmethod + def _parse_argv(line, insert=None): + argv = line.split() if line != "" else [] + if "-h" in argv: + return argv + if insert is not None and len(argv) and argv[0] != insert: + argv.insert(0, insert) + return argv + + def prepare(self): + self.parse_tool.prepare() + + @catch_exception + def default(self, line=""): + self.util.execute_command(line) + return False + + @catch_exception + def do_run(self, line=""): + self.util.execute_command(line) + + @catch_exception + def do_vc(self, line=""): + parser = argparse.ArgumentParser() + parser.add_argument( + "-m", "--my_dump_path", dest="my_dump_path", default=None, + help=" my dump path, the data compared with golden data", + required=True + ) + parser.add_argument( + "-g", "--golden_dump_path", dest="golden_dump_path", default=None, + help=" the golden dump data path", + required=True + ) + parser.add_argument( + "-out", "--output_path", dest="output_path", default=None, + help=" the output path", + required=False + ) + parser.add_argument( + "-cmp_path", "--msaccucmp_path", dest="msaccucmp_path", default=None, + help=" the msaccucmp.py file path", + required=False + ) + args = parser.parse_args(self._parse_argv(line)) + self.util.check_path_valid(args.my_dump_path) + self.util.check_path_valid(args.golden_dump_path) + self.util.check_files_in_path(args.my_dump_path) + self.util.check_files_in_path(args.golden_dump_path) + if self.util.dir_contains_only(args.my_dump_path, ".npy") and \ + self.util.dir_contains_only(args.golden_dump_path, ".npy"): + self.parse_tool.do_compare_converted_dir(args) + else: + self.parse_tool.do_vector_compare(args) + + def do_dc(self, line=""): + self.parse_tool.do_convert_dump(self._parse_argv(line)) + + def do_pt(self, line=""): + self.parse_tool.do_print_data(self._parse_argv(line)) + + def do_pk(self, line=""): + self.parse_tool.do_parse_pkl(self._parse_argv(line)) + + def do_cn(self, line=''): + self.parse_tool.do_compare_data(self._parse_argv(line)) + + def do_cad(self, line=''): + self.parse_tool.do_convert_api_dir(self._parse_argv(line)) diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_exception.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_exception.py new file mode 100644 index 0000000000000000000000000000000000000000..7525230cedc7ff11d4112a55998c6414e8f09217 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_exception.py @@ -0,0 +1,54 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import logging +from msprobe.core.common.exceptions import FileCheckException + + +class ParseException(Exception): + + PARSE_INVALID_PATH_ERROR = 0 + PARSE_NO_FILE_ERROR = 1 + PARSE_NO_MODULE_ERROR = 2 + PARSE_INVALID_DATA_ERROR = 3 + PARSE_INVALID_FILE_FORMAT_ERROR = 4 + PARSE_UNICODE_ERROR = 5 + PARSE_JSONDECODE_ERROR = 6 + PARSE_MSACCUCMP_ERROR = 7 + PARSE_LOAD_NPY_ERROR = 8 + PARSE_INVALID_PARAM_ERROR = 9 + + def __init__(self, code, error_info=""): + super(ParseException, self).__init__() + self.error_info = error_info + self.code = code + + +def catch_exception(func): + def inner(*args, **kwargs): + log = logging.getLogger() + line = args[-1] if len(args) == 2 else "" + result = None + try: + result = func(*args, **kwargs) + except OSError: + log.error("%s: command not found" % line) + except ParseException: + log.error("Command execution failed") + except FileCheckException: + log.error("Command execution failed") + return result + return inner diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_tool.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_tool.py new file mode 100644 index 0000000000000000000000000000000000000000..9a47dc54cf9e15d65eb5fb8d3bae54358777051c --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/parse_tool.py @@ -0,0 +1,158 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import argparse +import os +from collections import namedtuple + +from msprobe.pytorch.parse_tool.lib.config import Const +from msprobe.pytorch.parse_tool.lib.utils import Util +from msprobe.pytorch.parse_tool.lib.compare import Compare +from msprobe.pytorch.parse_tool.lib.visualization import Visualization +from msprobe.pytorch.parse_tool.lib.parse_exception import catch_exception, ParseException + + +class ParseTool: + def __init__(self): + self.util = Util() + self.compare = Compare() + self.visual = Visualization() + + @catch_exception + def prepare(self): + self.util.create_dir(Const.DATA_ROOT_DIR) + + @catch_exception + def do_vector_compare(self, args): + if not args.output_path: + result_dir = os.path.join(Const.COMPARE_DIR) + else: + result_dir = args.output_path + my_dump_path = args.my_dump_path + golden_dump_path = args.golden_dump_path + if not os.path.isdir(my_dump_path) or not os.path.isdir(golden_dump_path): + self.util.log.error("Please enter a directory not a file") + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + msaccucmp_path = self.util.path_strip(args.msaccucmp_path) if args.msaccucmp_path else Const.MS_ACCU_CMP_PATH + self.util.check_path_valid(msaccucmp_path) + self.util.check_executable_file(msaccucmp_path) + self.compare.npu_vs_npu_compare(my_dump_path, golden_dump_path, result_dir, msaccucmp_path) + + @catch_exception + def do_convert_dump(self, argv=None): + parser = argparse.ArgumentParser() + parser.add_argument( + '-n', '--name', dest='path', default=None, required=True, help='dump file or dump file directory') + parser.add_argument( + '-f', '--format', dest='format', default=None, required=False, help='target format') + parser.add_argument( + '-out', '--output_path', dest='output_path', required=False, default=None, help='output path') + parser.add_argument( + "-cmp_path", "--msaccucmp_path", dest="msaccucmp_path", default=None, + help=" the msaccucmp.py file path", required=False) + args = parser.parse_args(argv) + self.util.check_path_valid(args.path) + self.util.check_files_in_path(args.path) + msaccucmp_path = self.util.path_strip(args.msaccucmp_path) if args.msaccucmp_path else Const.MS_ACCU_CMP_PATH + self.util.check_path_valid(msaccucmp_path) + self.util.check_executable_file(msaccucmp_path) + if args.format: + self.util.check_str_param(args.format) + self.compare.convert_dump_to_npy(args.path, args.format, args.output_path, msaccucmp_path) + + @catch_exception + def do_print_data(self, argv=None): + """print tensor data""" + parser = argparse.ArgumentParser() + parser.add_argument('-n', '--name', dest='path', default=None, required=True, help='File name') + args = parser.parse_args(argv) + self.visual.print_npy_data(args.path) + + @catch_exception + def do_parse_pkl(self, argv=None): + parser = argparse.ArgumentParser() + parser.add_argument( + '-f', '--file', dest='file_name', default=None, required=True, help='PKL file path') + parser.add_argument( + '-n', '--name', dest='api_name', default=None, required=True, help='API name') + args = parser.parse_args(argv) + self.visual.parse_pkl(args.file_name, args.api_name) + + @catch_exception + def do_compare_data(self, argv): + """compare two tensor""" + parser = argparse.ArgumentParser() + parser.add_argument( + "-m", "--my_dump_path", dest="my_dump_path", default=None, + help=" my dump path, the data compared with golden data", + required=True + ) + parser.add_argument( + "-g", "--golden_dump_path", dest="golden_dump_path", default=None, + help=" the golden dump data path", + required=True + ) + parser.add_argument('-p', '--print', dest='count', default=20, type=int, help='print err data num') + parser.add_argument('-s', '--save', dest='save', action='store_true', help='save data in txt format') + parser.add_argument('-al', '--atol', dest='atol', default=0.001, type=float, help='set rtol') + parser.add_argument('-rl', '--rtol', dest='rtol', default=0.001, type=float, help='set atol') + args = parser.parse_args(argv) + self.util.check_path_valid(args.my_dump_path) + self.util.check_path_valid(args.golden_dump_path) + self.util.check_path_format(args.my_dump_path, Const.NPY_SUFFIX) + self.util.check_path_format(args.golden_dump_path, Const.NPY_SUFFIX) + compare_data_args = namedtuple('compare_data_args', ['my_dump_path', 'golden_dump_path', 'save', 'rtol', 'atol', 'count']) + compare_data_args.__new__.__defaults__ = (False, 0.001, 0.001, 20) + res = compare_data_args(args.my_dump_path, args.golden_dump_path, args.save, args.rtol, args.atol, args.count) + self.compare.compare_data(res) + + @catch_exception + def do_compare_converted_dir(self, args): + """compare two dir""" + my_dump_dir = self.util.path_strip(args.my_dump_path) + golden_dump_dir = self.util.path_strip(args.golden_dump_path) + if my_dump_dir == golden_dump_dir: + self.util.log.error("My directory path and golden directory path is same. Please check parameter" + " '-m' and '-g'.") + raise ParseException("My directory path and golden directory path is same.") + output_path = self.util.path_strip(args.output_path) if args.output_path else Const.BATCH_COMPARE_DIR + if not os.path.isdir(output_path): + os.makedirs(output_path, mode=0o750) + self.compare.compare_converted_dir(my_dump_dir, golden_dump_dir, output_path) + + @catch_exception + def do_convert_api_dir(self, argv=None): + parser = argparse.ArgumentParser() + parser.add_argument( + "-m", "--my_dump_path", dest="my_dump_path", default=None, + help=" my dump path, the data need to convert to npy files.", + required=True + ) + parser.add_argument( + '-out', '--output_path', dest='output_path', required=False, default=None, help='output path') + parser.add_argument( + "-asc", "--msaccucmp_path", dest="msaccucmp_path", default=None, + help=" the msaccucmp.py file path", required=False) + args = parser.parse_args(argv) + self.util.check_path_valid(args.my_dump_path) + self.util.check_files_in_path(args.my_dump_path) + output_path = self.util.path_strip(args.output_path) if args.output_path else \ + os.path.join(Const.BATCH_DUMP_CONVERT_DIR, self.util.localtime_str()) + msaccucmp_path = self.util.path_strip( + args.msaccucmp_path) if args.msaccucmp_path else Const.MS_ACCU_CMP_PATH + self.util.check_path_valid(msaccucmp_path) + self.util.check_executable_file(msaccucmp_path) + self.compare.convert_api_dir_to_npy(args.my_dump_path, None, output_path, msaccucmp_path) diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/utils.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..266e93fb3e727191a9e8a5307d436e3bff82fe6e --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/utils.py @@ -0,0 +1,367 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import logging +import os +import io +import re +import sys +import subprocess +import hashlib +import csv +import time +import numpy as np +from collections import namedtuple +from msprobe.pytorch.parse_tool.lib.config import Const +from msprobe.pytorch.parse_tool.lib.file_desc import DumpDecodeFileDesc, FileDesc +from msprobe.pytorch.parse_tool.lib.parse_exception import ParseException +from msprobe.core.common.file_check import change_mode, check_other_user_writable,\ + check_path_executable, check_path_owner_consistent +from msprobe.core.common.const import FileCheckConst +from msprobe.core.common.file_check import FileOpen +from msprobe.core.common.utils import check_file_or_directory_path +from msprobe.pytorch.common.log import logger + + +try: + from rich.traceback import install + from rich.panel import Panel + from rich.table import Table + from rich import print as rich_print + from rich.columns import Columns + + install() +except ImportError as err: + install = None + Panel = None + Table = None + Columns = None + rich_print = None + logger.warning( + "Failed to import rich, Some features may not be available. Please run 'pip install rich' to fix it.") + + +class Util: + def __init__(self): + self.ms_accu_cmp = None + logging.basicConfig( + level=Const.LOG_LEVEL, + format="%(asctime)s (%(process)d) -[%(levelname)s]%(message)s", + datefmt="%Y-%m-%d %H:%M:%S" + ) + self.log = logging.getLogger() + self.python = sys.executable + + @staticmethod + def print(content): + rich_print(content) + + @staticmethod + def path_strip(path): + return path.strip("'").strip('"') + + @staticmethod + def check_executable_file(path): + check_path_owner_consistent(path) + check_other_user_writable(path) + check_path_executable(path) + + @staticmethod + def get_subdir_count(self, directory): + subdir_count = 0 + for _, dirs, _ in os.walk(directory): + subdir_count += len(dirs) + break + return subdir_count + + @staticmethod + def get_subfiles_count(self, directory): + file_count = 0 + for _, _, files in os.walk(directory): + file_count += len(files) + return file_count + + @staticmethod + def get_sorted_subdirectories_names(self, directory): + subdirectories = [] + for item in os.listdir(directory): + item_path = os.path.join(directory, item) + if os.path.isdir(item_path): + subdirectories.append(item) + return sorted(subdirectories) + + @staticmethod + def get_sorted_files_names(self, directory): + files = [] + for item in os.listdir(directory): + item_path = os.path.join(directory, item) + if os.path.isfile(item_path): + files.append(item) + return sorted(files) + + @staticmethod + def check_npy_files_valid_in_dir(self, dir_path): + for file_name in os.listdir(dir_path): + file_path = os.path.join(dir_path, file_name) + check_file_or_directory_path(file_path) + _, file_extension = os.path.splitext(file_path) + if not file_extension == '.npy': + return False + return True + + @staticmethod + def get_md5_for_numpy(self, obj): + np_bytes = obj.tobytes() + md5_hash = hashlib.md5(np_bytes) + return md5_hash.hexdigest() + + @staticmethod + def write_csv(self, data, filepath): + need_change_mode = False + if not os.path.exists(filepath): + need_change_mode = True + with FileOpen(filepath, 'a') as f: + writer = csv.writer(f) + writer.writerows(data) + if need_change_mode: + change_mode(filepath, FileCheckConst.DATA_FILE_AUTHORITY) + + @staticmethod + def deal_with_dir_or_file_inconsistency(self, output_path): + if os.path.exists(output_path): + os.remove(output_path) + raise ParseException("Inconsistent directory structure or file.") + + @staticmethod + def deal_with_value_if_has_zero(self, data): + if data.dtype in Const.FLOAT_TYPE: + zero_mask = (data == 0) + # 给0的地方加上eps防止除0 + data[zero_mask] += np.finfo(data.dtype).eps + else: + # int type + float eps 会报错,所以这里要强转 + data = data.astype(float) + zero_mask = (data == 0) + data[zero_mask] += np.finfo(float).eps + return data + + @staticmethod + def dir_contains_only(self, path, endfix): + for _, _, files in os.walk(path): + for file in files: + if not file.endswith(endfix): + return False + return True + + @staticmethod + def localtime_str(self): + return time.strftime("%Y%m%d%H%M%S", time.localtime()) + + @staticmethod + def change_filemode_safe(self, path): + change_mode(path, FileCheckConst.DATA_FILE_AUTHORITY) + + @staticmethod + def _gen_npu_dump_convert_file_info(name, match, dir_path): + return DumpDecodeFileDesc(name, dir_path, int(match.groups()[-4]), op_name=match.group(2), + op_type=match.group(1), task_id=int(match.group(3)), anchor_type=match.groups()[-3], + anchor_idx=int(match.groups()[-2])) + + @staticmethod + def _gen_numpy_file_info(name, math, dir_path): + return FileDesc(name, dir_path) + + def execute_command(self, cmd): + if not cmd: + self.log.error("Commond is None") + return -1 + self.log.debug("[RUN CMD]: %s", cmd) + cmd = cmd.split(" ") + complete_process = subprocess.run(cmd, shell=False) + return complete_process.returncode + + def print_panel(self, content, title='', fit=True): + if not Panel: + self.print(content) + return + if fit: + self.print(Panel.fit(content, title=title)) + else: + self.print(Panel(content, title=title)) + + def check_msaccucmp(self, target_file): + if os.path.split(target_file)[-1] != Const.MS_ACCU_CMP_FILE_NAME: + self.log.error( + "Check msaccucmp failed in dir %s. This is not a correct msaccucmp file" % target_file) + raise ParseException(ParseException.PARSE_MSACCUCMP_ERROR) + result = subprocess.run( + [self.python, target_file, "--help"], stdout=subprocess.PIPE) + if result.returncode == 0: + self.log.info("Check [%s] success.", target_file) + else: + self.log.error("Check msaccucmp failed in dir %s" % target_file) + self.log.error("Please specify a valid msaccucmp.py path or install the cann package") + raise ParseException(ParseException.PARSE_MSACCUCMP_ERROR) + return target_file + + def create_dir(self, path): + path = self.path_strip(path) + if os.path.exists(path): + return + self.check_path_name(path) + try: + os.makedirs(path, mode=FileCheckConst.DATA_DIR_AUTHORITY) + except OSError as e: + self.log.error("Failed to create %s.", path) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) from e + + def gen_npy_info_txt(self, source_data): + (shape, dtype, max_data, min_data, mean) = \ + self.npy_info(source_data) + return \ + '[Shape: %s] [Dtype: %s] [Max: %s] [Min: %s] [Mean: %s]' % (shape, dtype, max_data, min_data, mean) + + def save_npy_to_txt(self, data, dst_file='', align=0): + if os.path.exists(dst_file): + self.log.info("Dst file %s exists, will not save new one.", dst_file) + return + shape = data.shape + data = data.flatten() + if align == 0: + align = 1 if len(shape) == 0 else shape[-1] + elif data.size % align != 0: + pad_array = np.zeros((align - data.size % align,)) + data = np.append(data, pad_array) + np.savetxt(dst_file, data.reshape((-1, align)), delimiter=' ', fmt='%g') + change_mode(dst_file, FileCheckConst.DATA_FILE_AUTHORITY) + + def list_convert_files(self, path, external_pattern=""): + return self.list_file_with_pattern( + path, Const.OFFLINE_DUMP_CONVERT_PATTERN, external_pattern, self._gen_npu_dump_convert_file_info + ) + + def list_numpy_files(self, path, extern_pattern=''): + return self.list_file_with_pattern(path, Const.NUMPY_PATTERN, extern_pattern, + self._gen_numpy_file_info) + + def create_columns(self, content): + if not Columns: + self.log.error("No module named rich, please install it") + raise ParseException(ParseException.PARSE_NO_MODULE_ERROR) + return Columns(content) + + def create_table(self, title, columns): + if not Table: + self.log.error("No module named rich, please install it and restart parse tool") + raise ParseException(ParseException.PARSE_NO_MODULE_ERROR) + table = Table(title=title) + for column_name in columns: + table.add_column(column_name, overflow='fold') + return table + + def check_path_valid(self, path): + path = self.path_strip(path) + if not path or not os.path.exists(path): + self.log.error("The path %s does not exist." % path) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if os.path.islink(path): + self.log.error('The file path {} is a soft link.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if len(os.path.realpath(path)) > Const.DIRECTORY_LENGTH or len(os.path.basename(path)) > \ + Const.FILE_NAME_LENGTH: + self.log.error('The file path length exceeds limit.') + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if not re.match(Const.FILE_PATTERN, os.path.realpath(path)): + self.log.error('The file path {} contains special characters.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if os.path.isfile(path): + file_size = os.path.getsize(path) + if path.endswith(Const.PKL_SUFFIX) and file_size > Const.ONE_GB: + self.log.error('The file {} size is greater than 1GB.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if path.endswith(Const.NPY_SUFFIX) and file_size > Const.TEN_GB: + self.log.error('The file {} size is greater than 10GB.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + return True + + def check_files_in_path(self, path): + if os.path.isdir(path) and len(os.listdir(path)) == 0: + self.log.error("No files in %s." % path) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + + def npy_info(self, source_data): + if isinstance(source_data, np.ndarray): + data = source_data + else: + self.log.error("Invalid data, data is not ndarray") + raise ParseException(ParseException.PARSE_INVALID_DATA_ERROR) + if data.dtype == 'object': + self.log.error("Invalid data, data is object.") + raise ParseException(ParseException.PARSE_INVALID_DATA_ERROR) + if np.size(data) == 0: + self.log.error("Invalid data, data is empty") + raise ParseException(ParseException.PARSE_INVALID_DATA_ERROR) + npu_info_result = namedtuple('npu_info_result', ['shape', 'dtype', 'max', 'min', 'mean']) + res = npu_info_result(data.shape, data.dtype, data.max(), data.min(), data.mean()) + return res + + def list_file_with_pattern(self, path, pattern, extern_pattern, gen_info_func): + self.check_path_valid(path) + file_list = {} + re_pattern = re.compile(pattern) + for dir_path, _, file_names in os.walk(path, followlinks=True): + for name in file_names: + match = re_pattern.match(name) + if not match: + continue + if extern_pattern != '' and not re.match(extern_pattern, name): + continue + file_list[name] = gen_info_func(name, match, dir_path) + return file_list + + def check_path_format(self, path, suffix): + if os.path.isfile(path): + if not path.endswith(suffix): + self.log.error("%s is not a %s file." % (path, suffix)) + raise ParseException(ParseException.PARSE_INVALID_FILE_FORMAT_ERROR) + elif os.path.isdir(path): + self.log.error("Please specify a single file path") + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + else: + self.log.error("The file path %s is invalid" % path) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + + def check_path_name(self, path): + if len(os.path.realpath(path)) > Const.DIRECTORY_LENGTH or len(os.path.basename(path)) > \ + Const.FILE_NAME_LENGTH: + self.log.error('The file path length exceeds limit.') + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + if not re.match(Const.FILE_PATTERN, os.path.realpath(path)): + self.log.error('The file path {} contains special characters.'.format(path)) + raise ParseException(ParseException.PARSE_INVALID_PATH_ERROR) + + def check_str_param(self, param): + if len(param) > Const.FILE_NAME_LENGTH: + self.log.error('The parameter length exceeds limit') + raise ParseException(ParseException.PARSE_INVALID_PARAM_ERROR) + if not re.match(Const.FILE_PATTERN, param): + self.log.error('The parameter {} contains special characters.'.format(param)) + raise ParseException(ParseException.PARSE_INVALID_PARAM_ERROR) + + def is_subdir_count_equal(self, dir1, dir2): + dir1_count = self.get_subdir_count(dir1) + dir2_count = self.get_subdir_count(dir2) + return dir1_count == dir2_count diff --git a/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/visualization.py b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/visualization.py new file mode 100644 index 0000000000000000000000000000000000000000..5e37b58d0b9fae8ad4e69ec58a4498dae9bd33b3 --- /dev/null +++ b/debug/accuracy_tools/msprobe/pytorch/parse_tool/lib/visualization.py @@ -0,0 +1,90 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import json +import numpy as np + +from msprobe.pytorch.parse_tool.lib.config import Const +from msprobe.pytorch.parse_tool.lib.utils import Util +from msprobe.pytorch.parse_tool.lib.parse_exception import ParseException +from msprobe.core.common.file_check import FileOpen + + +class Visualization: + def __init__(self): + self.util = Util() + + def print_npy_summary(self, target_file): + try: + np_data = np.load(target_file, allow_pickle=True) + except UnicodeError as e: + self.util.log.error("%s %s" % ("UnicodeError", str(e))) + self.util.log.warning("Please check the npy file") + raise ParseException(ParseException.PARSE_UNICODE_ERROR) from e + table = self.util.create_table('', ['Index', 'Data']) + flatten_data = np_data.flatten() + tablesize = 8 + for i in range(min(16, int(np.ceil(flatten_data.size / tablesize)))): + last_idx = min(flatten_data.size, i * tablesize + tablesize) + table.add_row(str(i * tablesize), ' '.join(flatten_data[i * tablesize: last_idx].astype('str').tolist())) + summary = ['[yellow]%s[/yellow]' % self.util.gen_npy_info_txt(np_data), 'Path: %s' % target_file, + "TextFile: %s.txt" % target_file] + self.util.print_panel(self.util.create_columns([table, "\n".join(summary)]), target_file) + self.util.save_npy_to_txt(np_data, target_file + ".txt") + + def print_npy_data(self, file_name): + file_name = self.util.path_strip(file_name) + self.util.check_path_valid(file_name) + self.util.check_path_format(file_name, Const.NPY_SUFFIX) + return self.print_npy_summary(file_name) + + def parse_pkl(self, path, api_name): + path = self.util.path_strip(path) + self.util.check_path_valid(path) + self.util.check_path_format(path, Const.PKL_SUFFIX) + self.util.check_str_param(api_name) + with FileOpen(path, "r") as pkl_handle: + title_printed = False + while True: + pkl_line = pkl_handle.readline() + if pkl_line == '\n': + continue + if len(pkl_line) == 0: + break + try: + msg = json.loads(pkl_line) + except json.JSONDecodeError as e: + self.util.log.error("%s %s in line %s" % ("JSONDecodeError", str(e), pkl_line)) + self.util.log.warning("Please check the pkl file") + raise ParseException(ParseException.PARSE_JSONDECODE_ERROR) from e + info_prefix = msg[0] + if not info_prefix.startswith(api_name): + continue + if info_prefix.find("stack_info") != -1 and len(msg) == 2: + self.util.log.info("\nTrace back({}):".format(msg[0])) + if msg[1] and len(msg[1]) > 4: + for item in reversed(msg[1]): + self.util.log.info(" File \"{}\", line {}, in {}".format(item[0], item[1], item[2])) + self.util.log.info(" {}".format(item[3])) + continue + if len(msg) > 5 and len(msg[5]) >= 3: + summery_info = " [{}][dtype: {}][shape: {}][max: {}][min: {}][mean: {}]" \ + .format(msg[0], msg[3], msg[4], msg[5][0], msg[5][1], msg[5][2]) + if not title_printed: + self.util.log.info("\nStatistic Info:") + title_printed = True + self.util.log.info(summery_info) + pkl_handle.close() diff --git a/debug/accuracy_tools/atat/pytorch/pt_config.py b/debug/accuracy_tools/msprobe/pytorch/pt_config.py similarity index 56% rename from debug/accuracy_tools/atat/pytorch/pt_config.py rename to debug/accuracy_tools/msprobe/pytorch/pt_config.py index a0691915cffc93b4a4505b2453560043b44cdc40..a3d765f3a4dd8787c4cd5ebd78ef4f80b5559c5d 100644 --- a/debug/accuracy_tools/atat/pytorch/pt_config.py +++ b/debug/accuracy_tools/msprobe/pytorch/pt_config.py @@ -1,11 +1,12 @@ -import os import json -from ..core.common_config import CommonConfig, BaseConfig -from ..core.utils import Const -from ..core.file_check_util import FileOpen +import os + +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.core.common.file_check import FileOpen +from msprobe.core.common.const import Const +from msprobe.pytorch.hook_module.utils import WrapFunctionalOps, WrapTensorOps, WrapTorchOps -#特定任务配置类 class TensorConfig(BaseConfig): def __init__(self, json_config): super().__init__(json_config) @@ -26,7 +27,7 @@ class StatisticsConfig(BaseConfig): def _check_summary_mode(self): if self.summary_mode and self.summary_mode not in ["statistics", "md5"]: raise Exception("summary_mode is invalid") - + class OverflowCheckConfig(BaseConfig): def __init__(self, json_config): @@ -34,13 +35,14 @@ class OverflowCheckConfig(BaseConfig): self.overflow_num = json_config.get("overflow_nums") self.check_mode = json_config.get("check_mode") self.check_overflow_config() - + def check_overflow_config(self): if self.overflow_num is not None and not isinstance(self.overflow_num, int): raise Exception("overflow_num is invalid") if self.check_mode is not None and self.check_mode not in ["all", "aicore", "atomic"]: raise Exception("check_mode is invalid") - + + class FreeBenchmarkCheckConfig(BaseConfig): def __init__(self, json_config): super().__init__(json_config) @@ -55,23 +57,59 @@ class FreeBenchmarkCheckConfig(BaseConfig): self.check_freebenchmark_config() def check_freebenchmark_config(self): - if self.if_preheat and self.handler_type == "fix": + if self.if_preheat and self.handler_type == "fix": raise Exception("Preheating is not supported in fix handler type") + if self.preheat_step and self.preheat_step == 0: + raise Exception("preheat_step cannot be 0") + + +class RunUTConfig(BaseConfig): + WrapApi = set(WrapFunctionalOps) | set(WrapTensorOps) | set(WrapTorchOps) + def __init__(self, json_config): + super().__init__(json_config) + self.white_list = json_config.get("white_list", Const.DEFAULT_LIST) + self.black_list = json_config.get("black_list", Const.DEFAULT_LIST) + self.error_data_path = json_config.get("error_data_path", Const.DEFAULT_PATH) + self.check_run_ut_config() + + @classmethod + def check_filter_list_config(cls, key, filter_list): + if not isinstance(filter_list, list): + raise Exception("%s must be a list type" % key) + if not all(isinstance(item, str) for item in filter_list): + raise Exception("All elements in %s must be string type" % key) + invalid_api = [item for item in filter_list if item not in cls.WrapApi] + if invalid_api: + raise Exception("Invalid api in %s: %s" % (key, invalid_api)) + + @classmethod + def check_error_data_path_config(cls, error_data_path): + if not os.path.exists(error_data_path): + raise Exception("error_data_path: %s does not exist" % error_data_path) + + def check_run_ut_config(self): + RunUTConfig.check_filter_list_config(Const.WHITE_LIST, self.white_list) + RunUTConfig.check_filter_list_config(Const.BLACK_LIST, self.black_list) + RunUTConfig.check_error_data_path_config(self.error_data_path) + def parse_task_config(task, json_config): default_dic = {} if task == Const.TENSOR: - config_dic = json_config.get(Const.TENSOR) if json_config.get(Const.TENSOR) else default_dic + config_dic = json_config.get(Const.TENSOR, default_dic) return TensorConfig(config_dic) elif task == Const.STATISTICS: - config_dic = json_config.get(Const.STATISTICS) if json_config.get(Const.STATISTICS) else default_dic + config_dic = json_config.get(Const.STATISTICS, default_dic) return StatisticsConfig(config_dic) elif task == Const.OVERFLOW_CHECK: - config_dic = json_config.get(Const.OVERFLOW_CHECK) if json_config.get(Const.OVERFLOW_CHECK) else default_dic + config_dic = json_config.get(Const.OVERFLOW_CHECK, default_dic) return OverflowCheckConfig(config_dic) elif task == Const.FREE_BENCHMARK: - config_dic = json_config.get(Const.FREE_BENCHMARK) if json_config.get(Const.FREE_BENCHMARK) else default_dic + config_dic = json_config.get(Const.FREE_BENCHMARK, default_dic) return FreeBenchmarkCheckConfig(config_dic) + elif task == Const.RUN_UT: + config_dic = json_config.get(Const.RUN_UT, default_dic) + return RunUTConfig(config_dic) else: return StatisticsConfig(default_dic) @@ -87,4 +125,4 @@ def parse_json_config(json_file_path, task): task_config = parse_task_config(task, json_config) else: task_config = parse_task_config(common_config.task, json_config) - return common_config, task_config \ No newline at end of file + return common_config, task_config diff --git a/debug/accuracy_tools/atat/pytorch/service.py b/debug/accuracy_tools/msprobe/pytorch/service.py similarity index 72% rename from debug/accuracy_tools/atat/pytorch/service.py rename to debug/accuracy_tools/msprobe/pytorch/service.py index 9c079aedebeec26120f27318bab1f1cdc8cf99cb..3a7b63623045e95143dc2a47c8f04b723edb684d 100644 --- a/debug/accuracy_tools/atat/pytorch/service.py +++ b/debug/accuracy_tools/msprobe/pytorch/service.py @@ -1,71 +1,60 @@ +import functools import os from pathlib import Path -import functools -import torch -from .functional import build_repair, build_data_collector, build_step_post_process -from .functional.scope import BaseScope -from .common.utils import get_rank_if_initialized, is_gpu, Const -from .common.file_check import FileChecker, FileCheckConst, check_path_before_create -from .common import print_info_log_rank_0 -from .hook_module.api_registry import api_register -from .hook_module import remove_dropout -from .functional.data_processor import ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs -from .module_processer import ModuleProcesser +from msprobe.pytorch.common.log import logger +from msprobe.core.common.file_check import FileChecker, check_path_before_create +from msprobe.core.common.const import Const, FileCheckConst +from msprobe.core.common.exceptions import DistributedNotInitializedError, MsprobeException +from msprobe.core.data_dump.data_collector import build_data_collector +from msprobe.core.data_dump.scope import BaseScope +from msprobe.core.data_dump.data_processor.base import ModuleForwardInputsOutputs, ModuleBackwardInputsOutputs +from msprobe.pytorch.common.utils import get_rank_if_initialized +from msprobe.pytorch.module_processer import ModuleProcesser +from msprobe.pytorch.hook_module import remove_dropout +from msprobe.pytorch.hook_module.api_registry import api_register -class Service: - make_dir_flag = True - REGISTER_HOOK_KWARGS = ["overflow_nums", "dump_mode", "dump_config"] +class Service: def __init__(self, config): self.model = None self.config = config self.data_collector = build_data_collector(config) self.module_processor = ModuleProcesser(self.data_collector.scope) - self.repair = build_repair(config) - self.step_post_process = build_step_post_process(config) self.switch = False self.current_iter = 0 self.first_start = True self.current_rank = None - self.first_touch_dir = True + self.dump_iter_dir = None def build_hook(self, module_type, name): - def pre_hook(repair, api_or_module_name, module, args, kwargs): - nonlocal module_type, pid + def pre_hook(api_or_module_name, module, args, kwargs): if module_type == BaseScope.Module_Type_Module: api_or_module_name = module.mindstudio_reserved_name self.data_collector.visit_and_clear_overflow_status(api_or_module_name) if not self.switch: return args, kwargs - if repair: - args, kwargs = repair.convert(api_or_module_name, module_type, args, kwargs) if self.data_collector: module_input_output = ModuleForwardInputsOutputs(args=args, kwargs=kwargs, output=None) self.data_collector.pre_forward_data_collect(api_or_module_name, module, pid, module_input_output) return args, kwargs - def forward_hook(repair, api_or_module_name, module, args, kwargs, output): - nonlocal module_type, pid + def forward_hook(api_or_module_name, module, args, kwargs, output): if module_type == BaseScope.Module_Type_Module: api_or_module_name = module.mindstudio_reserved_name self.data_collector.visit_and_clear_overflow_status(api_or_module_name) if not self.switch: - return + return None if self.data_collector: module_input_output = ModuleForwardInputsOutputs(args=args, kwargs=kwargs, output=output) self.data_collector.forward_data_collect(api_or_module_name, module, pid, module_input_output) if self.data_collector.if_return_forward_new_output(): return self.data_collector.get_forward_new_output() - if repair: - output = repair.invert(api_or_module_name, module_type, output) - return output - def backward_hook(repair, api_or_module_name, module, grad_input, grad_output): - nonlocal module_type, pid + def backward_hook(api_or_module_name, module, grad_input, grad_output): if module_type == BaseScope.Module_Type_Module: api_or_module_name = module.mindstudio_reserved_name self.data_collector.visit_and_clear_overflow_status(api_or_module_name) @@ -79,35 +68,37 @@ class Service: pid = os.getpid() forward_name_template = name + Const.FORWARD backward_name_template = name + Const.BACKWARD - pre_forward_hook = functools.partial(pre_hook, self.repair, forward_name_template) - forward_hook = functools.partial(forward_hook, self.repair, forward_name_template) - backward_hook = functools.partial(backward_hook, None, backward_name_template) + pre_forward_hook = functools.partial(pre_hook, forward_name_template) + forward_hook = functools.partial(forward_hook, forward_name_template) + backward_hook = functools.partial(backward_hook, backward_name_template) return pre_forward_hook, forward_hook, backward_hook def step(self): self.current_iter += 1 - if self.step_post_process: - self.step_post_process() self.data_collector.update_iter(self.current_iter) def start(self, model): self.model = model if self.config.step and self.current_iter > max(self.config.step): self.stop() - raise Exception("atat: exit after iteration {}".format(max(self.config.step))) + raise Exception("msprobe: exit after iteration {}".format(max(self.config.step))) if self.config.step and self.current_iter not in self.config.step: return if self.first_start: - self.current_rank = get_rank_if_initialized() + try: + self.current_rank = get_rank_if_initialized() + except DistributedNotInitializedError: + self.current_rank = None + if self.config.rank and self.current_rank not in self.config.rank: return self.register_hook_new() self.first_start = False self.switch = True - print_info_log_rank_0(f"Dump switch is turned on at step {self.current_iter}. ") + logger.info_on_rank_0(f"Dump switch is turned on at step {self.current_iter}. ") if self.config.level != "L2": self.create_dirs() - print_info_log_rank_0(f"Dump data will be saved in {self.dump_iter_dir}.") + logger.info_on_rank_0(f"Dump data will be saved in {self.dump_iter_dir}.") def stop(self): if self.config.level == "L2": @@ -144,19 +135,15 @@ class Service: dump_file_path, stack_file_path, construct_file_path, dump_data_dir, free_benchmark_file_path) def register_hook_new(self): - hook_name = self.config.task - - if "overflow_check" in hook_name and not is_gpu: - pass - - print_info_log_rank_0("The {} hook function is successfully mounted to the model.".format(hook_name)) + logger.info_on_rank_0("The {} hook function is successfully mounted to the model.".format(self.config.task)) if self.config.level in ["L0", "mix"]: - assert self.model is not None - print_info_log_rank_0("The init dump mode is enabled, and the module dump function will not be available") + if self.model is None: + logger.error_log_with_exp("The model is None.", MsprobeException.INVALID_PARAM_ERROR) + logger.info_on_rank_0("The init dump mode is enabled, and the module dump function will not be available") for name, module in self.model.named_modules(): if module == self.model: continue - prefix = BaseScope.Module_Type_Module + Const.SEP + name + Const.SEP +\ + prefix = BaseScope.Module_Type_Module + Const.SEP + name + Const.SEP + \ module.__class__.__name__ + Const.SEP pre_forward_hook, forward_hook, backward_hook = self.build_hook(BaseScope.Module_Type_Module, prefix) @@ -176,5 +163,5 @@ class Service: api_register.initialize_hook(functools.partial(self.build_hook, BaseScope.Module_Type_API)) api_register.api_modularity() - if Const.STATISTICS in hook_name or Const.TENSOR in hook_name: + if Const.STATISTICS == self.config.task or Const.TENSOR == self.config.task: remove_dropout() diff --git a/debug/accuracy_tools/msprobe/test/core_ut/common/test_utils.py b/debug/accuracy_tools/msprobe/test/core_ut/common/test_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..edd3eb53dccf453f1d3efde7189dfadcd6dee000 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/common/test_utils.py @@ -0,0 +1,345 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2022-2023. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import os +import uuid + +from unittest import TestCase +from unittest.mock import patch, MagicMock, mock_open + +from msprobe.core.common.log import logger +from msprobe.core.common.const import Const +from msprobe.core.common.utils import (CompareException, + check_seed_all, + check_inplace_op, + make_dump_path_if_not_exists, + check_mode_valid, + check_switch_valid, + check_dump_mode_valid, + check_summary_mode_valid, + check_summary_only_valid, + check_file_or_directory_path, + check_compare_param, + check_configuration_param, + is_starts_with, + _check_json, + check_json_file, + check_file_size, + check_regex_prefix_format_valid, + get_dump_data_path, + task_dumppath_get) +from msprobe.core.common.file_check import FileCheckConst + + +class TestUtils(TestCase): + @patch.object(logger, "error") + def test_check_seed_all(self, mock_error): + self.assertIsNone(check_seed_all(1234, True)) + self.assertIsNone(check_seed_all(0, True)) + self.assertIsNone(check_seed_all(Const.MAX_SEED_VALUE, True)) + + with self.assertRaises(CompareException) as context: + check_seed_all(-1, True) + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_error.assert_called_with(f"Seed must be between 0 and {Const.MAX_SEED_VALUE}.") + + with self.assertRaises(CompareException) as context: + check_seed_all(Const.MAX_SEED_VALUE + 1, True) + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_error.assert_called_with(f"Seed must be between 0 and {Const.MAX_SEED_VALUE}.") + + with self.assertRaises(CompareException) as context: + check_seed_all("1234", True) + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_error.assert_called_with("Seed must be integer.") + + with self.assertRaises(CompareException) as context: + check_seed_all(1234, 1) + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_error.assert_called_with("seed_all mode must be bool.") + + def test_check_inplace_op(self): + test_prefix_1 = "Distributed.broadcast.0.forward.input.0" + self.assertTrue(check_inplace_op(test_prefix_1)) + test_prefix_2 = "Distributed_broadcast_0_forward_input_0" + self.assertFalse(check_inplace_op(test_prefix_2)) + test_prefix_3 = "Torch.sum.0.backward.output.0" + self.assertFalse(check_inplace_op(test_prefix_3)) + + @patch.object(logger, "error") + def test_make_dump_path_if_not_exists(self, mock_error): + file_path = os.path.realpath(__file__) + dirname = os.path.dirname(file_path) + str(uuid.uuid4()) + + def test_mkdir(self, **kwargs): + raise OSError + + if not os.path.exists(dirname): + with patch("msprobe.core.common.utils.Path.mkdir", new=test_mkdir): + with self.assertRaises(CompareException) as context: + make_dump_path_if_not_exists(dirname) + self.assertEqual(context.exception.code, CompareException.INVALID_PATH_ERROR) + + make_dump_path_if_not_exists(file_path) + mock_error.assert_called_with(f"{file_path} already exists and is not a directory.") + + def test_check_mode_valid(self): + with self.assertRaises(ValueError) as context: + check_mode_valid("all", scope="scope") + self.assertEqual(str(context.exception), "scope param set invalid, it's must be a list.") + + with self.assertRaises(ValueError) as context: + check_mode_valid("all", api_list="api_list") + self.assertEqual(str(context.exception), "api_list param set invalid, it's must be a list.") + + mode = "all_list" + with self.assertRaises(CompareException) as context: + check_mode_valid(mode) + self.assertEqual(context.exception.code, CompareException.INVALID_DUMP_MODE) + self.assertEqual(str(context.exception), + f"Current mode '{mode}' is not supported. Please use the field in {Const.DUMP_MODE}") + + mode = "list" + with self.assertRaises(ValueError) as context: + check_mode_valid(mode) + self.assertEqual(str(context.exception), + "set_dump_switch, scope param set invalid, it's should not be an empty list.") + + @patch.object(logger, "error") + def test_check_switch_valid(self, mock_error): + with self.assertRaises(CompareException) as context: + check_switch_valid("Close") + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_error.assert_called_with("Please set switch with 'ON' or 'OFF'.") + + @patch.object(logger, "warning") + def test_check_dump_mode_valid(self, mock_warning): + dump_mode = check_dump_mode_valid("all") + mock_warning.assert_called_with("Please set dump_mode as a list.") + self.assertEqual(dump_mode, ["forward", "backward", "input", "output"]) + + with self.assertRaises(ValueError) as context: + check_dump_mode_valid("all_forward") + self.assertEqual(str(context.exception), + "Please set dump_mode as a list containing one or more of the following: " + + "'all', 'forward', 'backward', 'input', 'output'.") + + def test_check_summary_mode_valid(self): + with self.assertRaises(CompareException) as context: + check_summary_mode_valid("MD5") + self.assertEqual(context.exception.code, CompareException.INVALID_SUMMARY_MODE) + self.assertEqual(str(context.exception), "The summary_mode is not valid") + + @patch.object(logger, "error") + def test_check_summary_only_valid(self, mock_error): + summary_only = check_summary_only_valid(True) + self.assertTrue(summary_only) + + with self.assertRaises(CompareException) as context: + check_summary_only_valid("True") + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_error.assert_called_with("Params summary_only only support True or False.") + + def test_check_file_or_directory_path(self): + class TestFileChecker: + file_path = "" + path_type = "" + ability = "" + checked = False + + def __init__(self, file_path, path_type, ability=None): + TestFileChecker.file_path = file_path + TestFileChecker.path_type = path_type + TestFileChecker.ability = ability + + def common_check(self): + TestFileChecker.checked = True + + file_path = os.path.realpath(__file__) + dirname = os.path.dirname(file_path) + + with patch("msprobe.core.common.utils.FileChecker", new=TestFileChecker): + check_file_or_directory_path(file_path, isdir=False) + self.assertTrue(TestFileChecker.checked) + self.assertEqual(TestFileChecker.file_path, file_path) + self.assertEqual(TestFileChecker.path_type, FileCheckConst.FILE) + self.assertEqual(TestFileChecker.ability, FileCheckConst.READ_ABLE) + + TestFileChecker.checked = False + with patch("msprobe.core.common.utils.FileChecker", new=TestFileChecker): + check_file_or_directory_path(dirname, isdir=True) + self.assertTrue(TestFileChecker.checked) + self.assertEqual(TestFileChecker.file_path, dirname) + self.assertEqual(TestFileChecker.path_type, FileCheckConst.DIR) + self.assertEqual(TestFileChecker.ability, FileCheckConst.WRITE_ABLE) + + @patch.object(logger, "error") + def test_check_compare_param(self, mock_error): + params = { + "npu_json_path": "npu_json_path", + "bench_json_path": "bench_json_path", + "stack_json_path": "stack_json_path", + "npu_dump_data_dir": "npu_dump_data_dir", + "bench_dump_data_dir": "bench_dump_data_dir" + } + + call_args = [ + ("npu_json_path", False), + ("bench_json_path", False), + ("stack_json_path", False), + ("npu_dump_data_dir", True), + ("bench_dump_data_dir", True), + ("output_path", True), + ("npu_json_path", False), + ("bench_json_path", False), + ("stack_json_path", False), + ("output_path", True) + ] + + with self.assertRaises(CompareException) as context: + check_compare_param("npu_json_path", "output_path") + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_error.assert_called_with("Invalid input parameters") + + mock_check_file_or_directory_path = MagicMock() + mock_check_json_file = MagicMock() + with patch("msprobe.core.common.utils.FileOpen", mock_open(read_data="")), \ + patch("msprobe.core.common.utils.check_json_file", new=mock_check_json_file), \ + patch("msprobe.core.common.utils.check_file_or_directory_path", new=mock_check_file_or_directory_path): + check_compare_param(params, "output_path") + check_compare_param(params, "output_path", summary_compare=False, md5_compare=True) + for i in range(len(call_args)): + self.assertEqual(mock_check_file_or_directory_path.call_args_list[i][0], call_args[i]) + self.assertEqual(len(mock_check_json_file.call_args[0]), 4) + self.assertEqual(mock_check_json_file.call_args[0][0], params) + + @patch.object(logger, "error") + def test_check_configuration_param(self, mock_error): + with self.assertRaises(CompareException) as context: + check_configuration_param(stack_mode="False", auto_analyze=True, fuzzy_match=False) + self.assertEqual(context.exception.code, CompareException.INVALID_PARAM_ERROR) + mock_error.assert_called_with("Invalid input parameters which should be only bool type.") + + def test_is_starts_with(self): + string = "input_slot0" + self.assertFalse(is_starts_with(string, [])) + self.assertFalse(is_starts_with("", ["input"])) + self.assertFalse(is_starts_with(string, ["output"])) + self.assertTrue(is_starts_with(string, ["input", "output"])) + + @patch.object(logger, "error") + def test__check_json(self, mock_error): + class TestOpen: + def __init__(self, string): + self.string = string + + def readline(self): + return self.string + + def seek(self, begin, end): + self.string = str(begin) + "_" + str(end) + + with self.assertRaises(CompareException) as context: + _check_json(TestOpen(""), "test.json") + self.assertEqual(context.exception.code, CompareException.INVALID_DUMP_FILE) + mock_error.assert_called_with("dump file test.json have empty line!") + + handler = TestOpen("jons file\n") + _check_json(handler, "test.json") + self.assertEqual(handler.string, "0_0") + + @patch("msprobe.core.common.utils._check_json") + def test_check_json_file(self, _mock_check_json): + input_param = { + "npu_json_path": "npu_json_path", + "bench_json_path": "bench_json_path", + "stack_json_path": "stack_json_path" + } + check_json_file(input_param, "npu_json", "bench_json", "stack_json") + self.assertEqual(_mock_check_json.call_args_list[0][0], ("npu_json", "npu_json_path")) + self.assertEqual(_mock_check_json.call_args_list[1][0], ("bench_json", "bench_json_path")) + self.assertEqual(_mock_check_json.call_args_list[2][0], ("stack_json", "stack_json_path")) + + @patch.object(logger, "error") + def test_check_file_size(self, mock_error): + with patch("msprobe.core.common.utils.os.path.getsize", return_value=120): + with self.assertRaises(CompareException) as context: + check_file_size("input_file", 100) + self.assertEqual(context.exception.code, CompareException.INVALID_FILE_ERROR) + mock_error.assert_called_with("The size (120) of input_file exceeds (100) bytes, tools not support.") + + def test_check_regex_prefix_format_valid(self): + prefix = "A" * 21 + with self.assertRaises(ValueError) as context: + check_regex_prefix_format_valid(prefix) + self.assertEqual(str(context.exception), f"Maximum length of prefix is {Const.REGEX_PREFIX_MAX_LENGTH}, " + f"while current length is {len(prefix)}") + + prefix = "(prefix)" + with self.assertRaises(ValueError) as context: + check_regex_prefix_format_valid(prefix) + self.assertEqual(str(context.exception), f"prefix contains invalid characters, " + f"prefix pattern {Const.REGEX_PREFIX_PATTERN}") + + @patch("msprobe.core.common.utils.check_file_or_directory_path") + def test_get_dump_data_path(self, mock_check_file_or_directory_path): + file_path = os.path.realpath(__file__) + dirname = os.path.dirname(file_path) + + dump_data_path, file_is_exist = get_dump_data_path(dirname) + self.assertEqual(mock_check_file_or_directory_path.call_args[0], (dirname, True)) + self.assertEqual(dump_data_path, dirname) + self.assertTrue(file_is_exist) + + @patch.object(logger, "error") + def test_task_dumppath_get(self, mock_error): + input_param = { + "npu_json_path": None, + "bench_json_path": "bench_json_path" + } + npu_json = { + "task": Const.TENSOR, + "dump_data_dir": "dump_data_dir", + "data": "data" + } + + with self.assertRaises(CompareException) as context: + task_dumppath_get(input_param) + self.assertEqual(context.exception.code, CompareException.INVALID_PATH_ERROR) + mock_error.assert_called_with("Please check the json path is valid.") + + input_param["npu_json_path"] = "npu_json_path" + with patch("msprobe.core.common.utils.FileOpen", mock_open(read_data="")), \ + patch("msprobe.core.common.utils.json.load", return_value=npu_json): + summary_compare, md5_compare = task_dumppath_get(input_param) + self.assertFalse(summary_compare) + self.assertFalse(md5_compare) + + npu_json["task"] = Const.STATISTICS + with patch("msprobe.core.common.utils.FileOpen", mock_open(read_data="")), \ + patch("msprobe.core.common.utils.json.load", return_value=npu_json), \ + patch("msprobe.core.common.utils.md5_find", return_value=True): + summary_compare, md5_compare = task_dumppath_get(input_param) + self.assertFalse(summary_compare) + self.assertTrue(md5_compare) + + npu_json["task"] = Const.OVERFLOW_CHECK + with patch("msprobe.core.common.utils.FileOpen", mock_open(read_data="")), \ + patch("msprobe.core.common.utils.json.load", return_value=npu_json): + with self.assertRaises(CompareException) as context: + task_dumppath_get(input_param) + self.assertEqual(context.exception.code, CompareException.INVALID_TASK_ERROR) + mock_error.assert_called_with("Compare is not required for overflow_check or free_benchmark.") diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_data_collector.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_data_collector.py new file mode 100644 index 0000000000000000000000000000000000000000..eedbe5be7e0360d7874439357419510cbde73b71 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_data_collector.py @@ -0,0 +1,47 @@ +import unittest +from unittest.mock import patch, mock_open, MagicMock + +from msprobe.core.common.utils import Const +from msprobe.core.data_dump.data_collector import DataCollector +from msprobe.pytorch.debugger.debugger_config import DebuggerConfig +from msprobe.pytorch.pt_config import parse_json_config + + +class TestDataCollector(unittest.TestCase): + def setUp(self): + mock_json_data = { + "dump_path": "./ut_dump", + } + with patch("msprobe.pytorch.pt_config.FileOpen", mock_open(read_data='')), \ + patch("msprobe.pytorch.pt_config.json.load", return_value=mock_json_data): + common_config, task_config = parse_json_config("./config.json", Const.STATISTICS) + config = DebuggerConfig(common_config, task_config, Const.STATISTICS, "./ut_dump", "L1") + self.data_collector = DataCollector(config) + + def test_update_data(self): + self.data_collector.config.task = Const.OVERFLOW_CHECK + self.data_collector.data_processor.has_overflow = True + with patch("msprobe.core.data_dump.json_writer.DataWriter.update_data", return_value=None): + result1 = self.data_collector.update_data("test message", "test1:") + self.assertEqual(result1, "test1:Overflow detected.") + + self.data_collector.data_processor.has_overflow = False + result2 = self.data_collector.update_data("test message", "test2:") + self.assertEqual(result2, "test2:No Overflow, OK.") + + self.data_collector.config.task = Const.STATISTICS + self.data_collector.data_processor.has_overflow = True + with patch("msprobe.core.data_dump.json_writer.DataWriter.update_data", return_value=None): + result3 = self.data_collector.update_data("test message", "test3") + self.assertEqual(result3, "test3") + + def test_pre_forward_data_collect(self): + self.data_collector.check_scope_and_pid = MagicMock(return_value=False) + self.data_collector.is_inplace = MagicMock(return_value=False) + self.data_collector.data_processor.analyze_pre_forward = MagicMock() + name = "TestModule.forward" + pid = 123 + + self.data_collector.pre_forward_data_collect(name, None, pid, None) + self.data_collector.check_scope_and_pid.assert_called_once_with( + self.data_collector.scope, "TestModule.backward", 123) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_json_writer.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_json_writer.py new file mode 100644 index 0000000000000000000000000000000000000000..cfb1b3d551aa225d165b6620b2bb3de906ce70ea --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_json_writer.py @@ -0,0 +1,183 @@ +import unittest +from msprobe.core.data_dump.json_writer import DataWriter + +import os +import csv +from msprobe.core.common.file_check import FileOpen +from msprobe.core.common import utils +from pathlib import Path +import json + +class TestDataWriter(unittest.TestCase): + def test_write_data_to_csv(self): + cur_path = os.path.dirname(os.path.realpath(__file__)) + file_path = os.path.join(cur_path, "test.csv") + + if os.path.exists(file_path): + utils.remove_path(file_path) + + data = {"A":"1", "B":"2", "C":"3"} + result = data.values() + header = data.keys() + DataWriter.write_data_to_csv(result, header, file_path) + with FileOpen(file_path, "r") as f: + reader = csv.DictReader(f) + column_first = [row for row in reader][0] + self.assertEqual(data, column_first) + + + + + data = {"A":"4", "B":"5", "C":"6"} + result = data.values() + header = data.keys() + DataWriter.write_data_to_csv(result, header, file_path) + with FileOpen(file_path, "r") as f: + reader = csv.DictReader(f) + column_last = [row for row in reader][-1] + self.assertEqual(data, column_last) + + utils.remove_path(file_path) + + def test_initialize_json_file(self): + cur_path = os.path.dirname(os.path.realpath(__file__)) + dump_tensor_data_dir = os.path.join(cur_path, "dump_tensor_data.json") + dump_file_path = os.path.join(cur_path, "dump_file.json") + stack_file_path = os.path.join(cur_path, "stack_file.json") + construct_file_path = os.path.join(cur_path, "construct_file.json") + if not os.path.exists(stack_file_path): + Path(stack_file_path).touch() + if not os.path.exists(construct_file_path): + Path(construct_file_path).touch() + + test = DataWriter() + test.stack_file_path = stack_file_path + test.dump_file_path = dump_file_path + test.dump_tensor_data_dir = dump_tensor_data_dir + test.construct_file_path = construct_file_path + + test.initialize_json_file() + + with open(dump_file_path) as f: + load_data = json.load(f) + result = {"dump_data_dir": dump_tensor_data_dir, "data": {}} + self.assertEqual(result, load_data) + is_exist_1 = os.path.exists(test.stack_file_path) + self.assertTrue(is_exist_1) + os.access(test.stack_file_path, os.R_OK) + os.access(test.stack_file_path, os.W_OK) + is_exist_2 = os.path.exists(test.construct_file_path) + self.assertTrue(is_exist_2) + os.access(test.construct_file_path, os.R_OK) + os.access(test.construct_file_path, os.W_OK) + + os.remove(construct_file_path) + os.remove(stack_file_path) + os.remove(dump_file_path) + + def test_update_dump_paths(self): + test = DataWriter() + self.assertTrue(test.dump_file_path == None) + + cur_path = os.path.dirname(os.path.realpath(__file__)) + test_path = os.path.join(cur_path, "test1.json") + + test.update_dump_paths(test_path, test_path, test_path, test_path, test_path) + self.assertTrue(test.dump_file_path == test_path) + self.assertTrue(test.stack_file_path == test_path) + self.assertTrue(test.construct_file_path == test_path) + self.assertTrue(test.dump_tensor_data_dir == test_path) + self.assertTrue(test.free_benchmark_file_path == test_path) + + def test_update_data(self): + data = {"A":"1", "B":"2", "C":{"D":"2"}} + test = DataWriter() + test.cache_data["data"]["test_1"] = True + test.cache_data["data"]["test_2"] = False + + test.update_data(data) + self.assertEqual(test.cache_data["data"]["A"], "1") + + new_data = {"C":{"F":3}} + test.update_data(new_data) + self.assertEqual(test.cache_data["data"]["C"]["F"], 3) + + + def test_flush_data_when_buffer_is_full_and_test_write_data_json(self): + data = {"A":"1", "B":"2", "data":{}} + test = DataWriter() + test.buffer_size = 1 + test.cache_data["data"] = {"A":"1", "B":"2", "C":"3"} + + self.assertTrue(len(test.cache_data["data"]) >= test.buffer_size) + cur_path = os.path.dirname(os.path.realpath(__file__)) + dump_tensor_data_dir = os.path.join(cur_path, "dump_tensor_data.json") + dump_file_path = os.path.join(cur_path, "dump_file.json") + stack_file_path = os.path.join(cur_path, "stack_file.json") + construct_file_path = os.path.join(cur_path, "construct_file.json") + + test.dump_file_path = dump_file_path + test.dump_tensor_data_dir = dump_tensor_data_dir + + with open(dump_file_path, "w") as f: + dump_data = json.dumps(data) + f.write(dump_data) + + test.flush_data_when_buffer_is_full() + + with open(dump_file_path, "r") as f: + new_data = json.load(f) + + data.update({"data": {"A":"1", "B":"2", "C":"3"}}) + self.assertEqual(new_data, data) + + self.assertTrue(test.cache_data["data"] == {}) + os.remove(dump_file_path) + + + def test_update_stack(self): + data = {"A":"1", "B":"2", "data":{}} + test = DataWriter() + test.update_stack(data) + self.assertEqual(test.cache_stack, data) + + def test_update_construct(self): + data = {"A":"1", "B":"2", "data":{}} + test = DataWriter() + test.update_construct(data) + self.assertEqual(test.cache_construct, data) + + def test_write_stack_info_json(self): + test = DataWriter() + data = {"A":"1", "B":"2", "data":{}} + test.cache_stack = data + + cur_path = os.path.dirname(os.path.realpath(__file__)) + file_path = os.path.join(cur_path, "dump.json") + + test.write_stack_info_json(file_path) + + with open(file_path, "r") as f: + load_result = json.load(f) + try: + self.assertEqual(load_result, data) + finally: + os.remove(file_path) + + + def test_write_construct_info_json(self): + test = DataWriter() + data = {"A":"1", "B":"2", "data":{}} + test.cache_construct = data + + cur_path = os.path.dirname(os.path.realpath(__file__)) + file_path = os.path.join(cur_path, "dump.json") + + test.write_construct_info_json(file_path) + + with open(file_path, "r") as f: + load_result = json.load(f) + try: + self.assertEqual(load_result, data) + finally: + os.remove(file_path) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_scope.py b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_scope.py new file mode 100644 index 0000000000000000000000000000000000000000..1989fd0a95a5894b012aa916cdf44c625de27b1a --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/data_dump/test_scope.py @@ -0,0 +1,151 @@ +import unittest +from unittest.mock import MagicMock + +from msprobe.core.common.exceptions import ScopeException +from msprobe.core.data_dump.scope import ( + build_scope, + build_range_scope_according_to_scope_name, + BaseScope, + ListScope, + RangeScope, + APIRangeScope, + ModuleRangeScope +) + + +class TestBuildScope(unittest.TestCase): + def test_build_scope(self): + scope_class = MagicMock() + result1 = build_scope(scope_class, None, None) + self.assertEqual(result1, None) + + api_list = ['api1', 'api2'] + result2 = build_scope(scope_class, None, api_list) + self.assertEqual(result2, scope_class.return_value) + + def test_build_range_scope_according_to_scope_name(self): + result = build_range_scope_according_to_scope_name([], []) + self.assertIsInstance(result, APIRangeScope) + + +class TestBaseScope(unittest.TestCase): + def test_rectify_args(self): + scope = [] + api_list = "invalid_api_list" + with self.assertRaises(ScopeException) as context: + BaseScope.rectify_args(scope, api_list) + self.assertEqual(context.exception.code, ScopeException.InvalidApiStr) + + api_list = [1, 2, 3] + with self.assertRaises(ScopeException) as context: + BaseScope.rectify_args(scope, api_list) + self.assertEqual(context.exception.code, ScopeException.InvalidApiStr) + + scope = "module1" + api_list = [] + + expected_scope = ["module1"] + expected_api_list = [] + result_scope, result_api_list = BaseScope.rectify_args(scope, api_list) + self.assertEqual(result_scope, expected_scope) + self.assertEqual(result_api_list, expected_api_list) + + scope = 123 + api_list = [] + with self.assertRaises(ScopeException) as context: + BaseScope.rectify_args(scope, api_list) + self.assertEqual(context.exception.code, ScopeException.InvalidScope) + + scope = ["module1", 2, "module3"] + api_list = [] + with self.assertRaises(ScopeException) as context: + BaseScope.rectify_args(scope, api_list) + self.assertEqual(context.exception.code, ScopeException.InvalidScope) + + +class TestListScope(unittest.TestCase): + def test_rectify_args(self): + scope = ["module1"] + api_list = ["api1"] + with self.assertRaises(ScopeException) as context: + ListScope.rectify_args(scope, api_list) + self.assertEqual(context.exception.code, ScopeException.ArgConflict) + + def test_check(self): + list_scope = ListScope([], []) + module_name = "module1" + result = list_scope.check(module_name) + self.assertTrue(result) + + list_scope = ListScope(["module1"], []) + module_name = "module1" + result = list_scope.check(module_name) + self.assertTrue(result) + + list_scope = ListScope(["module1"], []) + module_name = "module2" + result = list_scope.check(module_name) + self.assertFalse(result) + + +class TestRangeScope(unittest.TestCase): + def test_rectify_args(self): + scope = ["module1", "module2", "module3"] + with self.assertRaises(ScopeException) as context: + RangeScope.rectify_args(scope, []) + self.assertEqual(context.exception.code, ScopeException.InvalidScope) + + scope = ["module1"] + expected_scope = ["module1", "module1"] + result_scope, result_api_list = RangeScope.rectify_args(scope, []) + self.assertEqual(result_scope, expected_scope) + + +class TestAPIRangeScope(unittest.TestCase): + def test_check_scope_is_valid(self): + api_range_scope = APIRangeScope([], []) + result = api_range_scope.check_scope_is_valid() + self.assertTrue(result) + + def test_check(self): + api_range_scope = APIRangeScope([], []) + api_name = "api1" + result = api_range_scope.check(api_name) + self.assertTrue(result) + + +class TestModuleRangeScope(unittest.TestCase): + def test_check_scope_is_valid(self): + module_range_scope = ModuleRangeScope([], []) + result = module_range_scope.check_scope_is_valid() + self.assertTrue(result) + + def test_begin_module(self): + module_range_scope = ModuleRangeScope(["module1", "module2"], []) + module_name = "module1" + module_range_scope.begin_module(module_name) + self.assertTrue(module_range_scope.in_scope) + + module_range_scope = ModuleRangeScope(["module1", "module2"], []) + module_name = "module3" + module_range_scope.begin_module(module_name) + self.assertFalse(module_range_scope.in_scope) + + def test_end_module(self): + module_range_scope = ModuleRangeScope(["module1", "module2"], []) + module_name = "module2" + module_range_scope.in_scope = True + module_range_scope.end_module(module_name) + self.assertFalse(module_range_scope.in_scope) + + module_range_scope = ModuleRangeScope(["module1", "module2"], []) + module_name = "module3" + module_range_scope.in_scope = True + module_range_scope.end_module(module_name) + self.assertTrue(module_range_scope.in_scope) + + def test_check(self): + module_range_scope = ModuleRangeScope([], []) + module_name = "module1" + result = module_range_scope.check(module_name) + self.assertTrue(result) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/test_common_config.py b/debug/accuracy_tools/msprobe/test/core_ut/test_common_config.py new file mode 100644 index 0000000000000000000000000000000000000000..06c7378ed36f764a21c42431d5155d02a2a003b8 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/test_common_config.py @@ -0,0 +1,152 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +from unittest import TestCase +from unittest.mock import patch + +from msprobe.core.common.log import logger +from msprobe.core.common.const import Const +from msprobe.core.common.exceptions import MsprobeException +from msprobe.core.common_config import CommonConfig, BaseConfig + + +class TestCommonConfig(TestCase): + @patch.object(logger, "error_log_with_exp") + def test_common_config(self, mock_error_log_with_exp): + json_config = dict() + + common_config = CommonConfig(json_config) + self.assertIsNone(common_config.task) + self.assertIsNone(common_config.dump_path) + self.assertIsNone(common_config.rank) + self.assertIsNone(common_config.step) + self.assertIsNone(common_config.level) + self.assertIsNone(common_config.seed) + self.assertIsNone(common_config.acl_config) + self.assertFalse(common_config.is_deterministic) + self.assertFalse(common_config.enable_dataloader) + + json_config.update({"task": "md5"}) + CommonConfig(json_config) + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "task is invalid, it should be one of {}".format(Const.TASK_LIST)) + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + json_config.update({"task": Const.TENSOR}) + json_config.update({"rank": 0}) + CommonConfig(json_config) + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "rank is invalid, it should be a list") + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + json_config.update({"task": Const.TENSOR}) + json_config.update({"rank": [0]}) + json_config.update({"step": 0}) + CommonConfig(json_config) + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "step is invalid, it should be a list") + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + json_config.update({"task": Const.TENSOR}) + json_config.update({"rank": [0]}) + json_config.update({"step": [0]}) + json_config.update({"level": "L3"}) + CommonConfig(json_config) + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "level is invalid, it should be one of {}".format(Const.LEVEL_LIST)) + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + json_config.update({"task": Const.TENSOR}) + json_config.update({"rank": [0]}) + json_config.update({"step": [0]}) + json_config.update({"level": "L0"}) + json_config.update({"seed": "1234"}) + CommonConfig(json_config) + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "seed is invalid, it should be an integer") + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + json_config.update({"task": Const.TENSOR}) + json_config.update({"rank": [0]}) + json_config.update({"step": [0]}) + json_config.update({"level": "L0"}) + json_config.update({"seed": 1234}) + json_config.update({"is_deterministic": "ENABLE"}) + CommonConfig(json_config) + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "is_deterministic is invalid, it should be a boolean") + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + json_config.update({"task": Const.TENSOR}) + json_config.update({"rank": [0]}) + json_config.update({"step": [0]}) + json_config.update({"level": "L0"}) + json_config.update({"seed": 1234}) + json_config.update({"is_deterministic": True}) + json_config.update({"enable_dataloader": "ENABLE"}) + CommonConfig(json_config) + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "enable_dataloader is invalid, it should be a boolean") + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + @patch.object(logger, "error_log_with_exp") + def test_base_config(self, mock_error_log_with_exp): + json_config = dict() + + base_config = BaseConfig(json_config) + base_config.check_config() + self.assertIsNone(base_config.scope) + self.assertIsNone(base_config.list) + self.assertIsNone(base_config.data_mode) + self.assertIsNone(base_config.backward_input) + self.assertIsNone(base_config.file_format) + self.assertIsNone(base_config.summary_mode) + self.assertIsNone(base_config.overflow_num) + self.assertIsNone(base_config.check_mode) + + json_config.update({"scope": "Tensor_Add"}) + base_config = BaseConfig(json_config) + base_config.check_config() + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "scope is invalid, it should be a list") + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + json_config.update({"scope": ["Tensor_Add"]}) + json_config.update({"list": "Tensor_Add"}) + base_config = BaseConfig(json_config) + base_config.check_config() + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "list is invalid, it should be a list") + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) + + json_config.update({"scope": ["Tensor_Add"]}) + json_config.update({"list": ["Tensor_Add"]}) + json_config.update({"data_mode": "all"}) + base_config = BaseConfig(json_config) + base_config.check_config() + self.assertEqual(mock_error_log_with_exp.call_args[0][0], + "data_mode is invalid, it should be a list") + self.assertEqual(str(mock_error_log_with_exp.call_args[0][1]), + MsprobeException.err_strs.get(MsprobeException.INVALID_PARAM_ERROR)) diff --git a/debug/accuracy_tools/msprobe/test/core_ut/test_file_check.py b/debug/accuracy_tools/msprobe/test/core_ut/test_file_check.py new file mode 100644 index 0000000000000000000000000000000000000000..ecdf3da9fedfcc283e3a7887c74b4e46c3d0aae5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/test_file_check.py @@ -0,0 +1,218 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import os + +from unittest import TestCase +from unittest.mock import patch, MagicMock + +from msprobe.core.common.log import logger +from msprobe.core.common.const import FileCheckConst +from msprobe.core.common.exceptions import FileCheckException +from msprobe.core.common.file_check import (check_link, + check_path_length, + check_path_exists, + check_path_readability, + check_path_writability, + check_path_executable, + check_other_user_writable, + check_path_owner_consistent, + check_path_pattern_vaild, + check_file_size, + check_common_file_size, + check_file_suffix, + check_path_type) + + +class TestFileCheckUtil(TestCase): + @patch.object(logger, "error") + def test_check_link(self, mock_logger_error): + with patch("msprobe.core.common.file_check.os.path.islink", return_value=True): + with self.assertRaises(FileCheckException) as context: + check_link("link_path") + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.SOFT_LINK_ERROR)) + mock_logger_error.assert_called_with("The file path link_path is a soft link.") + + @patch.object(logger, "error") + def test_check_path_length(self, mock_logger_error): + path = "P" * (FileCheckConst.DIRECTORY_LENGTH + 1) + with self.assertRaises(FileCheckException) as context: + check_path_length(path) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.ILLEGAL_PATH_ERROR)) + mock_logger_error.assert_called_with("The file path length exceeds limit.") + + path = "P" * (FileCheckConst.FILE_NAME_LENGTH + 1) + with self.assertRaises(FileCheckException) as context: + check_path_length(path) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.ILLEGAL_PATH_ERROR)) + mock_logger_error.assert_called_with("The file path length exceeds limit.") + + path = "P" * (FileCheckConst.FILE_NAME_LENGTH - 5) + with self.assertRaises(FileCheckException) as context: + check_path_length(path, name_length=FileCheckConst.FILE_NAME_LENGTH - 6) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.ILLEGAL_PATH_ERROR)) + mock_logger_error.assert_called_with("The file path length exceeds limit.") + + @patch.object(logger, "error") + def test_check_path_exists(self, mock_logger_error): + with patch("msprobe.core.common.file_check.os.path.exists", return_value=False): + with self.assertRaises(FileCheckException) as context: + check_path_exists("file_path") + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.ILLEGAL_PATH_ERROR)) + mock_logger_error.assert_called_with("The file path file_path does not exist.") + + @patch.object(logger, "error") + def test_check_path_readability(self, mock_logger_error): + path = "file_path" + with patch("msprobe.core.common.file_check.os.access", return_value=False): + with self.assertRaises(FileCheckException) as context: + check_path_readability(path) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.FILE_PERMISSION_ERROR)) + mock_logger_error.assert_called_with(f"The file path {path} is not readable.") + + mock_access = MagicMock() + mock_access.return_value = True + with patch("msprobe.core.common.file_check.os.access", new=mock_access): + check_path_readability(path) + self.assertEqual(mock_access.call_args[0], (path, os.R_OK)) + + @patch.object(logger, "error") + def test_check_path_writability(self, mock_logger_error): + path = "file_path" + with patch("msprobe.core.common.file_check.os.access", return_value=False): + with self.assertRaises(FileCheckException) as context: + check_path_writability(path) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.FILE_PERMISSION_ERROR)) + mock_logger_error.assert_called_with(f"The file path {path} is not writable.") + + mock_access = MagicMock() + mock_access.return_value = True + with patch("msprobe.core.common.file_check.os.access", new=mock_access): + check_path_writability(path) + self.assertEqual(mock_access.call_args[0], (path, os.W_OK)) + + @patch.object(logger, "error") + def test_check_path_executable(self, mock_logger_error): + path = "file_path" + with patch("msprobe.core.common.file_check.os.access", return_value=False): + with self.assertRaises(FileCheckException) as context: + check_path_executable(path) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.FILE_PERMISSION_ERROR)) + mock_logger_error.assert_called_with(f"The file path {path} is not executable.") + + mock_access = MagicMock() + mock_access.return_value = True + with patch("msprobe.core.common.file_check.os.access", new=mock_access): + check_path_executable(path) + self.assertEqual(mock_access.call_args[0], (path, os.X_OK)) + + @patch.object(logger, "error") + def test_check_other_user_writable(self, mock_logger_error): + class TestStat: + def __init__(self, mode): + self.st_mode = mode + + path = "file_path" + mock_stat = TestStat(0o002) + with patch("msprobe.core.common.file_check.os.stat", return_value=mock_stat): + with self.assertRaises(FileCheckException) as context: + check_other_user_writable(path) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.FILE_PERMISSION_ERROR)) + mock_logger_error.assert_called_with(f"The file path {path} may be insecure " + "because other users have write permissions. ") + + @patch.object(logger, "error") + def test_check_path_owner_consistent(self, mock_logger_error): + file_path = os.path.realpath(__file__) + file_owner = os.stat(file_path).st_uid + with patch("msprobe.core.common.file_check.os.getuid", return_value=file_owner+1): + with self.assertRaises(FileCheckException) as context: + check_path_owner_consistent(file_path) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.FILE_PERMISSION_ERROR)) + mock_logger_error.assert_called_with(f"The file path {file_path} may be insecure " + "because is does not belong to you.") + + @patch.object(logger, "error") + def test_check_path_pattern_vaild(self, mock_logger_error): + path = "path" + mock_re_match = MagicMock() + mock_re_match.return_value = False + with patch("msprobe.core.common.file_check.re.match", new=mock_re_match): + with self.assertRaises(FileCheckException) as context: + check_path_pattern_vaild(path) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.ILLEGAL_PATH_ERROR)) + mock_logger_error.assert_called_with(f"The file path {path} contains special characters.") + mock_re_match.assert_called_with(FileCheckConst.FILE_VALID_PATTERN, path) + + @patch.object(logger, "error") + def test_check_file_size(self, mock_logger_error): + file_path = os.path.realpath(__file__) + file_size = os.path.getsize(file_path) + max_size = file_size + with self.assertRaises(FileCheckException) as context: + check_file_size(file_path, max_size) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.FILE_TOO_LARGE_ERROR)) + mock_logger_error.assert_called_with(f"The size of file path {file_path} exceeds {max_size} bytes.") + + def test_check_common_file_size(self): + mock_check_file_size = MagicMock() + with patch("msprobe.core.common.file_check.os.path.isfile", return_value=True), \ + patch("msprobe.core.common.file_check.check_file_size", new=mock_check_file_size): + for suffix, max_size in FileCheckConst.FILE_SIZE_DICT.items(): + check_common_file_size(suffix) + mock_check_file_size.assert_called_with(suffix, max_size) + + @patch.object(logger, "error") + def test_check_file_suffix(self, mock_logger_error): + file_path = "file_path" + suffix = "suffix" + with self.assertRaises(FileCheckException) as context: + check_file_suffix(file_path, suffix) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.INVALID_FILE_ERROR)) + mock_logger_error.assert_called_with(f"The {file_path} should be a {suffix} file!") + + @patch.object(logger, "error") + def test_check_path_type(self, mock_logger_error): + file_path = "file_path" + + with patch("msprobe.core.common.file_check.os.path.isfile", return_value=False), \ + patch("msprobe.core.common.file_check.os.path.isdir", return_value=True): + with self.assertRaises(FileCheckException) as context: + check_path_type(file_path, FileCheckConst.FILE) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.INVALID_FILE_ERROR)) + mock_logger_error.assert_called_with(f"The {file_path} should be a file!") + + with patch("msprobe.core.common.file_check.os.path.isfile", return_value=True), \ + patch("msprobe.core.common.file_check.os.path.isdir", return_value=False): + with self.assertRaises(FileCheckException) as context: + check_path_type(file_path, FileCheckConst.DIR) + self.assertEqual(str(context.exception), + FileCheckException.err_strs.get(FileCheckException.INVALID_FILE_ERROR)) + mock_logger_error.assert_called_with(f"The {file_path} should be a dictionary!") diff --git a/debug/accuracy_tools/msprobe/test/core_ut/test_log.py b/debug/accuracy_tools/msprobe/test/core_ut/test_log.py new file mode 100644 index 0000000000000000000000000000000000000000..1687c48d025a72a9237a561cc3bea0dd473f6fbd --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/core_ut/test_log.py @@ -0,0 +1,109 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +from unittest import TestCase +from unittest.mock import patch, MagicMock + +from msprobe.core.common.log import BaseLogger, logger + + +class TestLog(TestCase): + @patch("msprobe.core.common.log.print") + def test__print_log(self, mock_print): + logger._print_log("level", "msg") + self.assertIn("[level] msg", mock_print.call_args[0][0]) + self.assertEqual("\n", mock_print.call_args[1].get("end")) + + logger._print_log("level", "msg", end="end") + self.assertIn("[level] msg", mock_print.call_args[0][0]) + self.assertEqual("end", mock_print.call_args[1].get("end")) + + @patch.object(BaseLogger, "_print_log") + def test_print_info_log(self, mock__print_log): + logger.info("info_msg") + mock__print_log.assert_called_with("INFO", "info_msg") + + @patch.object(BaseLogger, "_print_log") + def test_print_warn_log(self, mock__print_log): + logger.warning("warn_msg") + mock__print_log.assert_called_with("WARNING", "warn_msg") + + @patch.object(BaseLogger, "_print_log") + def test_print_error_log(self, mock__print_log): + logger.error("error_msg") + mock__print_log.assert_called_with("ERROR", "error_msg") + + @patch.object(BaseLogger, "error") + def test_error_log_with_exp(self, mock_error): + with self.assertRaises(Exception) as context: + logger.error_log_with_exp("msg", Exception("Exception")) + self.assertEqual(str(context.exception), "Exception") + mock_error.assert_called_with("msg") + + @patch.object(BaseLogger, "get_rank") + def test_on_rank_0(self, mock_get_rank): + mock_func = MagicMock() + func_rank_0 = logger.on_rank_0(mock_func) + + mock_get_rank.return_value = 1 + func_rank_0() + mock_func.assert_not_called() + + mock_get_rank.return_value = 0 + func_rank_0() + mock_func.assert_called() + + mock_func = MagicMock() + func_rank_0 = logger.on_rank_0(mock_func) + mock_get_rank.return_value = None + func_rank_0() + mock_func.assert_called() + + @patch.object(BaseLogger, "get_rank") + def test_info_on_rank_0(self, mock_get_rank): + mock_print = MagicMock() + with patch("msprobe.core.common.log.print", new=mock_print): + mock_get_rank.return_value = 0 + logger.info_on_rank_0("msg") + self.assertIn("[INFO] msg", mock_print.call_args[0][0]) + + mock_get_rank.return_value = 1 + logger.info_on_rank_0("msg") + mock_print.assert_called_once() + + @patch.object(BaseLogger, "get_rank") + def test_error_on_rank_0(self, mock_get_rank): + mock_print = MagicMock() + with patch("msprobe.core.common.log.print", new=mock_print): + mock_get_rank.return_value = 0 + logger.error_on_rank_0("msg") + self.assertIn("[ERROR] msg", mock_print.call_args[0][0]) + + mock_get_rank.return_value = 1 + logger.error_on_rank_0("msg") + mock_print.assert_called_once() + + @patch.object(BaseLogger, "get_rank") + def test_warning_on_rank_0(self, mock_get_rank): + mock_print = MagicMock() + with patch("msprobe.core.common.log.print", new=mock_print): + mock_get_rank.return_value = 0 + logger.warning_on_rank_0("msg") + self.assertIn("[WARNING] msg", mock_print.call_args[0][0]) + + mock_get_rank.return_value = 1 + logger.warning_on_rank_0("msg") + mock_print.assert_called_once() diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_api_kbk_dump.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_api_kbk_dump.py new file mode 100644 index 0000000000000000000000000000000000000000..7411018ff08507f0ab867b6394aa1c08b5f26469 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_api_kbk_dump.py @@ -0,0 +1,51 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import os + +from unittest import TestCase +from unittest.mock import patch + +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.dump.api_kbk_dump import ApiKbkDump + + +class TestApiKbkDump(TestCase): + + def test_handle(self): + json_config = { + "task": "statistics", + "dump_path": "/absolute_path", + "rank": [], + "step": [0, 2], + "level": "L1" + } + + common_config = CommonConfig(json_config) + task_config = BaseConfig(json_config) + config = DebuggerConfig(common_config, task_config) + dumper = ApiKbkDump(config) + self.assertEqual(dumper.dump_json["common_dump_settings"]["iteration"], "0|2") + + os.environ["MS_ACL_DUMP_CFG_PATH"] = "path" + with patch("msprobe.mindspore.dump.api_kbk_dump.make_dump_path_if_not_exists"), \ + patch("msprobe.mindspore.dump.api_kbk_dump.FileOpen"), \ + patch("msprobe.mindspore.dump.api_kbk_dump.json.dump"), \ + patch("msprobe.mindspore.dump.api_kbk_dump.logger.info"): + dumper.handle() + self.assertEqual(os.environ.get("GRAPH_OP_RUN"), "1") + self.assertEqual(os.environ.get("MS_ACL_DUMP_CFG_PATH"), None) diff --git a/debug/accuracy_tools/atat/core/log.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_debugger_config.py similarity index 35% rename from debug/accuracy_tools/atat/core/log.py rename to debug/accuracy_tools/msprobe/test/mindspore_ut/test_debugger_config.py index b9ac8f5edfb18286aff317b5440bb99a92dd2486..54bc1393aa6d9c5c120590de01fd9644706be2d2 100644 --- a/debug/accuracy_tools/atat/core/log.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_debugger_config.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -# Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -14,43 +14,29 @@ # See the License for the specific language governing permissions and # limitations under the License. """ -import os -import time -import sys - - -def _print_log(level, msg, end='\n'): - current_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(time.time()))) - pid = os.getgid() - print(current_time + "(" + str(pid) + ")-[" + level + "]" + msg, end=end) - sys.stdout.flush() - - -def print_info_log(info_msg, end='\n'): - """ - Function Description: - print info log. - Parameter: - info_msg: the info message. - """ - _print_log("INFO", info_msg, end=end) - - -def print_error_log(error_msg): - """ - Function Description: - print error log. - Parameter: - error_msg: the error message. - """ - _print_log("ERROR", error_msg) - - -def print_warn_log(warn_msg): - """ - Function Description: - print warn log. - Parameter: - warn_msg: the warning message. - """ - _print_log("WARNING", warn_msg) \ No newline at end of file +from unittest import TestCase + +from msprobe.core.common.const import Const +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig + + +class TestDebuggerConfig(TestCase): + def test_init(self): + json_config = { + "dump_path": "/absolute_path", + "rank": [], + "step": [], + "level": "L1" + } + common_config = CommonConfig(json_config) + task_config = BaseConfig(json_config) + debugger_config = DebuggerConfig(common_config, task_config) + self.assertEqual(debugger_config.task, Const.STATISTICS) + self.assertEqual(debugger_config.file_format, "npy") + self.assertEqual(debugger_config.check_mode, "all") + + common_config.dump_path = "./path" + with self.assertRaises(Exception) as context: + DebuggerConfig(common_config, task_config) + self.assertEqual(str(context.exception), "Dump path must be absolute path.") diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_tensor.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_dump_tool_factory.py similarity index 31% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_tensor.py rename to debug/accuracy_tools/msprobe/test/mindspore_ut/test_dump_tool_factory.py index f7791cdc9ac8e2084fc63d76e3819e137f4ea9d7..fb88d7bbbf328b0b8a61b11d41808b756881510e 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/hook_module/wrap_tensor.py +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_dump_tool_factory.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ -# Copyright (C) 2023-2023. Huawei Technologies Co., Ltd. All rights reserved. +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at @@ -14,51 +14,38 @@ # See the License for the specific language governing permissions and # limitations under the License. """ +from unittest import TestCase -import torch +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.dump.dump_tool_factory import DumpToolFactory -from .hook_module import HOOKModule -from ..common.utils import torch_device_guard -from ..common.config import msCheckerConfig -from ...common.utils import parameter_adapter +class TestDumpToolFactory(TestCase): -def get_tensor_ops(): - global WrapTensorOps - _tensor_ops = dir(torch._C._TensorBase) - if msCheckerConfig.white_list: - return set(WrapTensorOps) & set(_tensor_ops) & set(msCheckerConfig.white_list) - else: - return set(WrapTensorOps) & set(_tensor_ops) + def test_create(self): + json_config = { + "task": "statistics", + "dump_path": "/absolute_path", + "rank": [], + "step": [0, 2], + "level": "L1" + } + common_config = CommonConfig(json_config) + task_config = BaseConfig(json_config) + config = DebuggerConfig(common_config, task_config) -class HOOKTensor(object): - pass + config.level = "module" + with self.assertRaises(Exception) as context: + DumpToolFactory.create(config) + self.assertEqual(str(context.exception), "valid level is needed.") + config.level = "cell" + with self.assertRaises(Exception) as context: + DumpToolFactory.create(config) + self.assertEqual(str(context.exception), "Cell dump in not supported now.") -class TensorOPTemplate(HOOKModule): - - def __init__(self, op_name, hook, need_hook=True): - self.op_name_ = op_name - self.prefix_op_name_ = "Tensor*" + str(op_name) + "*" - if need_hook: - super().__init__(hook) - - @torch_device_guard - @parameter_adapter - def forward(self, *args, **kwargs): - return getattr(torch._C._TensorBase, str(self.op_name_))(*args, **kwargs) - - -def wrap_tensor_op(op_name, hook): - - def tensor_op_template(*args, **kwargs): - return TensorOPTemplate(op_name, hook)(*args, **kwargs) - - return tensor_op_template - - -def wrap_tensor_ops_and_bind(hook): - _tensor_ops = get_tensor_ops() - for op_name in _tensor_ops: - setattr(HOOKTensor, "wrap_" + str(op_name), wrap_tensor_op(op_name, hook)) + config.level = "kernel" + dumper = DumpToolFactory.create(config) + self.assertEqual(dumper.dump_json["common_dump_settings"]["net_name"], "Net") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_kernel_graph_dump.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_kernel_graph_dump.py new file mode 100644 index 0000000000000000000000000000000000000000..e691a2c7edde2feb1f2c1d60fbba275724bb9092 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_kernel_graph_dump.py @@ -0,0 +1,66 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import os + +from unittest import TestCase +from unittest.mock import patch + +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.dump.kernel_graph_dump import KernelGraphDump + + +class TestKernelGraphDump(TestCase): + + def test_handle(self): + json_config = { + "task": "tensor", + "dump_path": "/absolute_path", + "rank": [], + "step": [0, 2], + "level": "L2" + } + + common_config = CommonConfig(json_config) + task_config = BaseConfig(json_config) + task_config.data_mode = ["output"] + task_config.file_format = "bin" + config = DebuggerConfig(common_config, task_config) + dumper = KernelGraphDump(config) + self.assertEqual(dumper.dump_json["common_dump_settings"]["iteration"], "0|2") + self.assertEqual(dumper.dump_json["common_dump_settings"]["file_format"], "bin") + self.assertEqual(dumper.dump_json["common_dump_settings"]["input_output"], 2) + + with patch("msprobe.mindspore.dump.kernel_graph_dump.make_dump_path_if_not_exists"), \ + patch("msprobe.mindspore.dump.kernel_graph_dump.FileOpen"), \ + patch("msprobe.mindspore.dump.kernel_graph_dump.json.dump"), \ + patch("msprobe.mindspore.dump.kernel_graph_dump.logger.info"): + + os.environ["GRAPH_OP_RUN"] = "1" + with self.assertRaises(Exception) as context: + dumper.handle() + self.assertEqual(str(context.exception), "Must run in graph mode, not kbk mode") + if "GRAPH_OP_RUN" in os.environ: + del os.environ["GRAPH_OP_RUN"] + + dumper.handle() + self.assertIn("kernel_graph_dump.json", os.environ.get("MS_ACL_DUMP_CFG_PATH")) + + if "MINDSPORE_DUMP_CONFIG" in os.environ: + del os.environ["MINDSPORE_DUMP_CONFIG"] + if "MS_ACL_DUMP_CFG_PATH" in os.environ: + del os.environ["MS_ACL_DUMP_CFG_PATH"] diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_kernel_graph_overflow_check.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_kernel_graph_overflow_check.py new file mode 100644 index 0000000000000000000000000000000000000000..a93fab021ab59beff9016895a1748eb942274e30 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_kernel_graph_overflow_check.py @@ -0,0 +1,63 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +import os + +from unittest import TestCase +from unittest.mock import patch + +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.overflow_check.kernel_graph_overflow_check import KernelGraphOverflowCheck + + +class TestKernelGraphOverflowCheck(TestCase): + + def test_handle(self): + json_config = { + "task": "overflow_check", + "dump_path": "/absolute_path", + "rank": [], + "step": [], + "level": "L2" + } + + common_config = CommonConfig(json_config) + task_config = BaseConfig(json_config) + task_config.check_mode = "atomic" + config = DebuggerConfig(common_config, task_config) + checker = KernelGraphOverflowCheck(config) + self.assertEqual(checker.dump_json["common_dump_settings"]["op_debug_mode"], 2) + + os.environ["MS_ACL_DUMP_CFG_PATH"] = "path" + with patch("msprobe.mindspore.overflow_check.kernel_graph_overflow_check.make_dump_path_if_not_exists"), \ + patch("msprobe.mindspore.overflow_check.kernel_graph_overflow_check.FileOpen"), \ + patch("msprobe.mindspore.overflow_check.kernel_graph_overflow_check.json.dump"), \ + patch("msprobe.mindspore.overflow_check.kernel_graph_overflow_check.logger.info"): + + os.environ["GRAPH_OP_RUN"] = "1" + with self.assertRaises(Exception) as context: + checker.handle() + self.assertEqual(str(context.exception), "Must run in graph mode, not kbk mode") + if "GRAPH_OP_RUN" in os.environ: + del os.environ["GRAPH_OP_RUN"] + + checker.handle() + self.assertIn("kernel_graph_overflow_check.json", os.environ.get("MINDSPORE_DUMP_CONFIG")) + self.assertEqual(os.environ.get("MS_ACL_DUMP_CFG_PATH"), None) + + if "MINDSPORE_DUMP_CONFIG" in os.environ: + del os.environ["MINDSPORE_DUMP_CONFIG"] diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_config.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_config.py new file mode 100644 index 0000000000000000000000000000000000000000..673386afb5d5862e4437d23081792dd006c930cf --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_ms_config.py @@ -0,0 +1,69 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +from unittest import TestCase +from unittest.mock import patch, mock_open + +from msprobe.core.common.const import Const +from msprobe.mindspore.ms_config import (parse_json_config, parse_task_config, + TensorConfig, StatisticsConfig, OverflowCheck) + + +class TestMsConfig(TestCase): + def test_parse_json_config(self): + mock_json_data = { + "dump_path": "./dump/", + "rank": [], + "step": [], + "level": "L1", + "seed": 1234, + "statistics": { + "scope": [], + "list": [], + "data_mode": ["all"], + "summary_mode": "statistics" + } + } + with patch("msprobe.mindspore.ms_config.FileOpen", mock_open(read_data='')), \ + patch("msprobe.mindspore.ms_config.json.load", return_value=mock_json_data): + common_config, task_config = parse_json_config("./config.json") + self.assertEqual(common_config.task, Const.STATISTICS) + self.assertEqual(task_config.data_mode, ["all"]) + + with self.assertRaises(Exception) as context: + parse_json_config(None) + self.assertEqual(str(context.exception), "json file path is None") + + def test_parse_task_config(self): + mock_json_config = { + "tensor": None, + "statistics": None, + "overflow_check": None, + "free_benchmark": None + } + + task_config = parse_task_config("tensor", mock_json_config) + self.assertTrue(isinstance(task_config, TensorConfig)) + + task_config = parse_task_config("statistics", mock_json_config) + self.assertTrue(isinstance(task_config, StatisticsConfig)) + + task_config = parse_task_config("overflow_check", mock_json_config) + self.assertTrue(isinstance(task_config, OverflowCheck)) + + with self.assertRaises(Exception) as context: + parse_task_config("free_benchmark", mock_json_config) + self.assertEqual(str(context.exception), "task is invalid.") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_overflow_check_tool_factory.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_overflow_check_tool_factory.py new file mode 100644 index 0000000000000000000000000000000000000000..47da051d4fdd1d9b65ef8c6092a7b05a3e2263b6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_overflow_check_tool_factory.py @@ -0,0 +1,51 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +from unittest import TestCase + +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.overflow_check.overflow_check_tool_factory import OverflowCheckToolFactory + + +class TestOverflowCheckToolFactory(TestCase): + + def test_create(self): + json_config = { + "task": "overflow_check", + "dump_path": "/absolute_path", + "rank": [], + "step": [], + "level": "L2" + } + + common_config = CommonConfig(json_config) + task_config = BaseConfig(json_config) + config = DebuggerConfig(common_config, task_config) + + config.level = "module" + with self.assertRaises(Exception) as context: + OverflowCheckToolFactory.create(config) + self.assertEqual(str(context.exception), "valid level is needed.") + + config.level = "cell" + with self.assertRaises(Exception) as context: + OverflowCheckToolFactory.create(config) + self.assertEqual(str(context.exception), "Overflow check in not supported in this mode.") + + config.level = "kernel" + dumper = OverflowCheckToolFactory.create(config) + self.assertEqual(dumper.dump_json["common_dump_settings"]["file_format"], "npy") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_precision_debugger.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_precision_debugger.py new file mode 100644 index 0000000000000000000000000000000000000000..b33167dc7b7c8c35647b8eed5d5760b9a05ae974 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_precision_debugger.py @@ -0,0 +1,56 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +from unittest import TestCase +from unittest.mock import patch + +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.debugger.precision_debugger import PrecisionDebugger + + +class TestPrecisionDebugger(TestCase): + def test_start(self): + class Handler: + called = False + + def handle(self): + Handler.called = True + + json_config = { + "task": "statistics", + "dump_path": "/absolute_path", + "rank": [], + "step": [], + "level": "L1" + } + + common_config = CommonConfig(json_config) + task_config = BaseConfig(json_config) + handler = Handler() + + with patch("msprobe.mindspore.debugger.precision_debugger.parse_json_config", + return_value=[common_config, task_config]), \ + patch("msprobe.mindspore.debugger.precision_debugger.TaskHandlerFactory.create", return_value=handler): + debugger = PrecisionDebugger() + debugger.start() + self.assertTrue(isinstance(debugger.config, DebuggerConfig)) + self.assertTrue(Handler.called) + + PrecisionDebugger._instance = None + with self.assertRaises(Exception) as context: + debugger.start() + self.assertEqual(str(context.exception), "No instance of PrecisionDebugger found.") diff --git a/debug/accuracy_tools/msprobe/test/mindspore_ut/test_task_handler_factory.py b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_task_handler_factory.py new file mode 100644 index 0000000000000000000000000000000000000000..41be7b1db6c7d723aaeec1607f564ac3d772b404 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/mindspore_ut/test_task_handler_factory.py @@ -0,0 +1,58 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +# Copyright (C) 2024-2024. Huawei Technologies Co., Ltd. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +from unittest import TestCase +from unittest.mock import patch + +from msprobe.core.common_config import CommonConfig, BaseConfig +from msprobe.mindspore.debugger.debugger_config import DebuggerConfig +from msprobe.mindspore.dump.kernel_graph_dump import KernelGraphDump +from msprobe.mindspore.task_handler_factory import TaskHandlerFactory + + +class TestTaskHandlerFactory(TestCase): + + def test_create(self): + class HandlerFactory: + def create(self): + return None + + tasks = {"statistics": HandlerFactory} + + json_config = { + "task": "statistics", + "dump_path": "/absolute_path", + "rank": [], + "step": [], + "level": "L2" + } + + common_config = CommonConfig(json_config) + task_config = BaseConfig(json_config) + config = DebuggerConfig(common_config, task_config) + + handler = TaskHandlerFactory.create(config) + self.assertTrue(isinstance(handler, KernelGraphDump)) + + with patch("msprobe.mindspore.task_handler_factory.TaskHandlerFactory.tasks", new=tasks): + with self.assertRaises(Exception) as context: + TaskHandlerFactory.create(config) + self.assertEqual(str(context.exception), "Can not find task handler") + + config.task = "free_benchmark" + with self.assertRaises(Exception) as context: + TaskHandlerFactory.create(config) + self.assertEqual(str(context.exception), "valid task is needed.") diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/advisor/test_advisor.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/advisor/test_advisor.py new file mode 100644 index 0000000000000000000000000000000000000000..176b80068f70e60a06a6eed77b23b35e8b48a50d --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/advisor/test_advisor.py @@ -0,0 +1,83 @@ +import difflib +import os +import shutil +import unittest +import logging +from unittest.mock import patch + +import pandas + +from msprobe.pytorch.advisor.advisor import Advisor +from msprobe.pytorch.advisor.advisor_const import AdvisorConst + + +class TestAdvisor(unittest.TestCase): + + def setUp(self): + self.base_test_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) + self.input_dir = os.path.join(self.base_test_dir, 'resources') + self.output_path = os.path.abspath(os.path.join(self.base_test_dir, 'test_output')) + + os.makedirs(self.output_path, mode=0o700, exist_ok=True) + self.has_error = False + + self.input_data = pandas.read_csv(os.path.join(self.input_dir, 'compare_result_20230703104808.csv')) + self.advisor = Advisor(self.input_data, self.output_path) + + def tearDown(self) -> None: + shutil.rmtree(self.output_path, ignore_errors=True) + + @patch("os.path.realpath") + def test_init(self, mock_realpath): + mock_realpath.return_value = 'real_output_path' + adv = Advisor(self.input_data, self.output_path) + self.assertEqual(adv.out_path, 'real_output_path') + + def test_deterministic_advisor_when_api_in_need_determ_api(self): + msg = self.advisor.deterministic_advisor('', 'Functional.layer_norm.0.forward_input.0') + self.assertEqual(msg, AdvisorConst.DETERMINISTIC_SUGGEST) + + def test_deterministic_advisor_when_api_not_in_need_determ_api(self): + mock_message = 'mock message' + msg = self.advisor.deterministic_advisor(mock_message, 'Functional.linear.0.forward_input.0') + self.assertEqual(msg, mock_message) + + def test_batch_norm_advisor(self): + mock_message = 'mocked batch norm advisor message' + msg1 = self.advisor.batch_norm_advisor(mock_message, AdvisorConst.FUNC_BATCH_NORM + '' + + AdvisorConst.FORWARD_INPUT_1) + msg2 = self.advisor.batch_norm_advisor(mock_message, 'Functional.linear.0.forward_output.1') + self.assertEqual(msg1, AdvisorConst.BATCH_NORM_SUGGEST) + self.assertEqual(msg2, mock_message) + + def test_gen_advisor_message(self): + self.assertIn(AdvisorConst.FORWARD_OUTPUT_SUGGEST, self.advisor.gen_advisor_message( + 'Functional.linear.0.forward_output.1')) + self.assertIn(AdvisorConst.BACKWARD_INPUT_SUGGEST, self.advisor.gen_advisor_message( + 'Functional.linear.0.backward_input.1')) + + def test_advisor_summary_file(self): + self.advisor.analysis() + filenames = os.listdir(self.output_path) + for filename in filenames: + filename = os.path.join(self.output_path, filename) + self.result_check(os.path.join(self.input_dir, 'advisor.txt'), filename) + self.assertFalse(self.has_error) + + def result_check(self, standard_file, output_file): + with open(standard_file, 'r', encoding='utf-8') as st_file: + standard_content = st_file.read().splitlines() + with open(output_file, 'r', encoding='utf-8') as out_file: + output_content = out_file.read().splitlines() + result = list(difflib.unified_diff(standard_content, output_content, n=0)) + if result: + logging.basicConfig(level=logging.INFO) + logging.info('\n\n-------------------------------------------------------------------------') + logging.error(f'[ERROR] {output_file.replace(self.output_path, "")} advisor summary are inconsistent.') + logging.error('\n'.join(result)) + logging.info('\n\n-------------------------------------------------------------------------') + self.has_error = True + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_common_utils.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_common_utils.py new file mode 100644 index 0000000000000000000000000000000000000000..56d100f0a1b570c0c0a1753db3c79aacea7b76ac --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_common_utils.py @@ -0,0 +1,108 @@ +import unittest +from unittest.mock import patch + +from msprobe.pytorch.api_accuracy_checker.common.utils import * + + +class TestUtils(unittest.TestCase): + + @patch('msprobe.pytorch.api_accuracy_checker.common.utils.get_file_content_bytes') + def test_get_json_contents_should_raise_exception(self, mock_get_file_content_bytes): + mock_get_file_content_bytes.return_value = 'not a dict' + with self.assertRaises(CompareException) as ce: + get_json_contents('') + self.assertEqual(ce.exception.code, CompareException.INVALID_FILE_ERROR) + + def test_get_json_contents_should_return_json_obj(self): + test_dict = {"key": "value"} + file_name = 'test.json' + + fd = os.open(file_name, os.O_CREAT | os.O_WRONLY | os.O_TRUNC, 0o644) + with os.fdopen(fd, 'w') as f: + json.dump(test_dict, f) + self.assertEqual(get_json_contents(file_name), test_dict) + os.remove(file_name) + + def test_write_csv(self): + test_file_name = 'test.csv' + test_data = [["name", "age"], ["Alice", "20"], ["Bob", "30"]] + write_csv(test_data, 'test.csv') + with open(test_file_name, 'r', encoding='utf-8-sig') as f: + reader = csv.reader(f) + for i, row in enumerate(reader): + self.assertEqual(row, test_data[i]) + os.remove(test_file_name) + + def test_check_need_convert(self): + self.assertEqual(check_need_convert('cross_entropy'), 'int32_to_int64') + self.assertIsNone(check_need_convert('linear')) + + def test_check_object_type(self): + try: + check_object_type(123, int) + except Exception as e: + self.fail(f"check_object_type raised exception {e}") + + def test_check_file_or_directory_path(self): + try: + check_file_or_directory_path(__file__) + except Exception as e: + self.fail(f"check_file_or_directory_path raised exception {e}") + + def test_create_directory(self): + test_dir_name = 'test_dir' + create_directory(test_dir_name) + self.assertTrue(os.path.exists(test_dir_name)) + os.rmdir(test_dir_name) + + def test_get_file_content_bytes(self): + fd = os.open('test.txt', os.O_CREAT | os.O_WRONLY | os.O_TRUNC, 0o644) + with os.fdopen(fd, 'w') as f: + f.write("Hello, World!") + self.assertEqual(get_file_content_bytes('test.txt'), b"Hello, World!") + os.remove('test.txt') + + @patch('os.path.exists') + def test_check_file_or_dir_path_should_raise_exe_when_dir_path_not_existed(self, mock_path_exists): + mock_path_exists.return_value = False + with self.assertRaises(CompareException) as ce: + check_file_or_directory_path('', isdir=True) + self.assertEqual(ce.exception.code, CompareException.INVALID_PATH_ERROR) + + @patch('os.path.exists') + @patch('os.path.isdir') + @patch('os.access') + def test_check_file_or_dir_path_should_pass_when_path_is_dir(self, mock_os_access, mock_path_is_dir, + mock_path_exists): + mock_os_access.return_value = True + mock_path_is_dir.return_value = True + mock_path_exists.return_value = True + check_file_or_directory_path('', isdir=True) + + @patch('os.path.isfile') + @patch('os.access') + def test_check_file_or_dir_path_should_raise_exe_when_file_not_access(self, mock_os_access, mock_path_is_file): + mock_os_access.return_value = False + mock_path_is_file.return_value = True + with self.assertRaises(CompareException) as ce: + check_file_or_directory_path('', isdir=False) + self.assertEqual(ce.exception.code, CompareException.INVALID_PATH_ERROR) + + def test_check_file_or_dir_path_should_pass_when_path_is_file(self): + with unittest.mock.patch('os.path.isfile', return_value=True), \ + unittest.mock.patch('os.access', return_value=True): + check_file_or_directory_path('', isdir=False) + + def test_api_info_preprocess_no_conversion_needed(self): + api_name = 'linear' + original_api_info = {'key': 'value'} + convert_type, processed_api_info = api_info_preprocess(api_name, original_api_info.copy()) + self.assertIsNone(convert_type) + self.assertEqual(original_api_info, processed_api_info) + + def test_api_info_preprocess_cross_entropy_positive(self): + api_name = 'cross_entropy' + api_info = {'args': [{'Name': 'logit'}, {'Name': 'labels', 'Min': 1}]} + convert_type, processed_api_info = api_info_preprocess(api_name, api_info.copy()) + self.assertEqual(convert_type, 'int32_to_int64') + self.assertEqual(processed_api_info, api_info) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_config.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_config.py new file mode 100644 index 0000000000000000000000000000000000000000..35fc6164763e685d09e737e7f85bec33623ec111 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/common/test_config.py @@ -0,0 +1,39 @@ +import unittest +import os +from unittest.mock import patch + +from msprobe.pytorch.api_accuracy_checker.common.config import Config + + +class TestConfig(unittest.TestCase): + def setUp(self): + self.base_test_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__))))) + self.input_dir = os.path.join(self.base_test_dir, 'resources') + self.yaml_file = os.path.join(self.input_dir, "config.yaml") + self.cfg = Config(self.yaml_file) + + def test_validate_valid_data(self): + for key, val in self.cfg.config.items(): + validated_type = self.cfg.validate(key, val) + self.assertEqual(validated_type, val) + + def test_validate_should_raise_when_invalid_type(self): + with self.assertRaises(ValueError): + self.cfg.validate('error_data_path', True) + + def test_validate_should_raise_when_invalid_key(self): + with self.assertRaises(ValueError): + self.cfg.validate('invalid_key', 'mock_value') + + def test_validate_precision(self): + self.assertEqual(self.cfg.validate('precision', 1), 1) + + with self.assertRaises(ValueError): + self.cfg.validate('precision', -1) + + def test_validate_white_list(self): + validate_white_list = ['conv1d', 'max_pool1d', 'dropout', '__add__'] + self.assertEqual(self.cfg.validate('white_list', validate_white_list), validate_white_list) + + with self.assertRaises(Exception): + self.cfg.validate('white_list', ['invalid_api1', 'invalid_api2']) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_algorithm.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_algorithm.py new file mode 100644 index 0000000000000000000000000000000000000000..35a8b9f1fa52f689905986ca477c8a7077a084da --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_algorithm.py @@ -0,0 +1,112 @@ +import unittest + +import numpy as np + +from msprobe.pytorch.api_accuracy_checker.compare import algorithm as alg + + +class TestAlgorithmMethods(unittest.TestCase): + + def setUp(self): + self.bench_data = np.array([1.0, 1.0, 9.0], dtype=np.float16) + self.device_data = np.array([5.0, 2.0, 1.0], dtype=np.float16) + self.abs_err = np.abs(self.device_data - self.bench_data) + self.rel_err_origin = np.abs(self.abs_err / self.bench_data) + eps = np.finfo(self.bench_data.dtype).eps + self.abs_bench = np.abs(self.bench_data) + self.abs_bench_with_eps = self.abs_bench + eps + self.rel_err = self.abs_err / self.abs_bench_with_eps + + def test_cosine_sim(self): + cpu_output = np.array([1.0, 2.0, 3.0]) + npu_output = np.array([1.0, 2.0, 3.0]) + self.assertEqual(alg.cosine_sim(cpu_output, npu_output), (1.0, True, '')) + + def test_get_rmse(self): + inf_nan_mask = [False, False, False] + self.assertAlmostEqual(alg.get_rmse(self.abs_err, inf_nan_mask), 5.196, 3) + + def test_get_error_balance(self): + self.assertEqual(alg.get_error_balance(self.bench_data, self.device_data), 1 / 3) + + def test_get_small_value_err_ratio(self): + small_value_mask = [True, True, True, False, True] + abs_err_greater_mask = [False, True, True, True, False] + self.assertEqual(alg.get_small_value_err_ratio(small_value_mask, abs_err_greater_mask), 0.5) + + def get_rel_err(self): + eps = np.finfo(self.bench_data.dtype).eps + abs_bench = np.abs(self.bench_data) + abs_bench_with_eps = abs_bench + eps + small_value_mask = [False, False, False] + inf_nan_mask = [False, False, False] + rel_err = self.abs_err / abs_bench_with_eps + self.assertListEqual(list(alg.get_rel_err(self.abs_err, abs_bench_with_eps, small_value_mask, inf_nan_mask)), + list(rel_err)) + + def test_get_abs_err(self): + self.assertListEqual(list(alg.get_abs_err(self.bench_data, self.device_data)), [4.0, 1.0, 8.0]) + + def test_get_rel_err_origin(self): + self.assertListEqual(list(alg.get_rel_err_origin(self.abs_err, self.bench_data)), list(self.rel_err_origin)) + + def test_get_max_abs_err(self): + self.assertEqual(alg.get_max_abs_err(self.abs_err), (8.0, False)) + + def test_get_max_rel_err(self): + self.assertAlmostEqual(alg.get_max_rel_err(self.rel_err), 3.996, 3) + + def test_get_mean_rel_err(self): + self.assertAlmostEqual(alg.get_mean_rel_err(self.rel_err), 1.961, 3) + + def test_get_rel_err_ratio_thousandth(self): + b_value = np.array([1.0, 2.0, 3.0]) + n_value = np.array([1.0, 2.0, 3.0]) + abs_err = np.abs(b_value - n_value) + rel_err = alg.get_rel_err_origin(abs_err, b_value) + self.assertEqual(alg.get_rel_err_ratio(rel_err, 0.001), (1.0, True)) + + def test_get_rel_err_ratio_ten_thousandth(self): + b_value = np.array([1.0, 2.0, 3.0]) + n_value = np.array([1.0, 2.0, 3.0]) + abs_err = np.abs(b_value - n_value) + rel_err = alg.get_rel_err_origin(abs_err, b_value) + self.assertEqual(alg.get_rel_err_ratio(rel_err, 0.0001), (1.0, True)) + + def test_get_finite_and_infinite_mask(self): + both_finite_mask, inf_nan_mask = alg.get_finite_and_infinite_mask(self.bench_data, self.device_data) + self.assertListEqual(list(both_finite_mask), [True, True, True]) + self.assertListEqual(list(inf_nan_mask), [False, False, False]) + + def test_get_small_value_mask(self): + b_value = np.array([1e-7, 1.0, 2e-6], dtype=np.float16) + abs_bench = np.abs(b_value) + both_finite_mask = [True, True, True] + small_value_mask = alg.get_small_value_mask(abs_bench, both_finite_mask, 1e-3) + self.assertListEqual(list(small_value_mask), [True, False, True]) + + def test_get_abs_bench_with_eps(self): + abs_bench, abs_bench_with_eps = alg.get_abs_bench_with_eps(self.bench_data, np.float16) + self.assertListEqual(list(abs_bench), list(self.abs_bench)) + self.assertListEqual(list(abs_bench_with_eps), list(self.abs_bench_with_eps)) + + def test_check_inf_nan_value(self): + both_finite_mask, inf_nan_mask = alg.get_finite_and_infinite_mask(self.bench_data, self.device_data) + self.assertEqual(alg.check_inf_nan_value(inf_nan_mask, self.bench_data, self.device_data, np.float16, 0.001), 0) + + def test_check_small_value(self): + a_value = np.array([1e-7, 1.0, 2e-6], dtype=np.float16) + b_value = np.array([1e-7, 1.0, 2e-6], dtype=np.float16) + abs_bench = np.abs(b_value) + both_finite_mask = [True, True, True] + abs_err = abs(a_value - b_value) + small_value_mask = alg.get_small_value_mask(abs_bench, both_finite_mask, 1e-3) + self.assertEqual(alg.check_small_value(abs_err, small_value_mask, 0.001), 0) + + def test_check_norm_value(self): + both_finite_mask, inf_nan_mask = alg.get_finite_and_infinite_mask(self.bench_data, self.device_data) + small_value_mask = alg.get_small_value_mask(self.abs_bench, both_finite_mask, 1e-3) + normal_value_mask = np.logical_and(both_finite_mask, np.logical_not(small_value_mask)) + print(normal_value_mask) + print(self.rel_err) + self.assertEqual(alg.check_norm_value(normal_value_mask, self.rel_err, 0.001), 1) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..540460d0896532bbe242c3eb30b3ee945bae9571 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_api_precision_compare.py @@ -0,0 +1,77 @@ +import unittest + +import pandas as pd + +from msprobe.pytorch.api_accuracy_checker.compare.api_precision_compare import ( + CompareConfig, + BenchmarkStandard, + check_csv_columns, + check_error_rate, + get_api_checker_result, +) +from msprobe.core.common.const import CompareConst + + +class TestApiPrecisionCompare(unittest.TestCase): + + def setUp(self): + # Setup paths and mock data + self.config = CompareConfig( + npu_csv_path='mock_npu.csv', + gpu_csv_path='mock_gpu.csv', + result_csv_path='result.csv', + details_csv_path='details.csv' + ) + + self.npu_data = pd.DataFrame({ + 'API_NAME': ['api1.forward', 'api1.backward'], + 'DEVICE_DTYPE': ['float32', 'float32'], + 'ERROR_RATE': ['0', '0.1'], + 'SMALL_VALUE_ERROR_RATE': ['0.01', '0.02'], + 'RMSE': ['0.1', '0.2'], + 'MAX_REL_ERR': ['0.1', '0.2'], + 'MEAN_REL_ERR': ['0.1', '0.2'], + 'EB': ['0.1', '0.2'] + }) + + self.gpu_data = pd.DataFrame({ + 'API_NAME': ['api1.forward', 'api1.backward'], + 'DEVICE_DTYPE': ['float32', 'float32'], + 'ERROR_RATE': ['0', '0'], + 'SMALL_VALUE_ERROR_RATE': ['0.01', '0.01'], + 'RMSE': ['0.1', '0.1'], + 'MAX_REL_ERR': ['0.1', '0.1'], + 'MEAN_REL_ERR': ['0.1', '0.1'], + 'EB': ['0.1', '0.1'] + }) + + def test_benchmark_standard_calc_ratio(self): + column_name = "TEST_COLUMN" + default_value = 0 + result = BenchmarkStandard._calc_ratio(column_name, '2', '1', default_value) + self.assertEqual(result[0], 2.0) + + result = BenchmarkStandard._calc_ratio(column_name, '0', '0', default_value) + self.assertEqual(result[0], 1.0) + + def test_check_csv_columns(self): + with self.assertRaises(Exception): + check_csv_columns([], 'test_csv') + + def test_check_error_rate(self): + result = check_error_rate('0') + self.assertEqual(result, CompareConst.PASS) + + result = check_error_rate('0.1') + self.assertEqual(result, CompareConst.ERROR) + + def test_get_api_checker_result(self): + result = get_api_checker_result([CompareConst.PASS, CompareConst.ERROR]) + self.assertEqual(result, CompareConst.ERROR) + + result = get_api_checker_result([CompareConst.PASS, CompareConst.PASS]) + self.assertEqual(result, CompareConst.PASS) + + +if __name__ == '__main__': + unittest.main() diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_compare.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py similarity index 61% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_compare.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py index 4ce73ce550dfc5d5cd21246dbc2756a6024f6fea..e1e6d51de292cf4d8b617ab73db67ff4920bfac3 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_compare.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare.py @@ -7,13 +7,14 @@ import unittest import numpy as np import torch.nn.functional -from api_accuracy_checker.compare.compare import Comparator -from api_accuracy_checker.compare.compare_column import CompareColumn +from msprobe.pytorch.api_accuracy_checker.compare.compare import Comparator +from msprobe.pytorch.api_accuracy_checker.compare.compare_column import CompareColumn +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut import UtDataInfo current_time = time.strftime("%Y%m%d%H%M%S") RESULT_FILE_NAME = "accuracy_checking_result_" + current_time + ".csv" DETAILS_FILE_NAME = "accuracy_checking_details_" + current_time + '.csv' -base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) +base_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) class TestCompare(unittest.TestCase): @@ -30,10 +31,10 @@ class TestCompare(unittest.TestCase): shutil.rmtree(self.output_path) def test_compare_dropout(self): - dummmy_input = torch.randn(100, 100) - bench_out = torch.nn.functional.dropout2d(dummmy_input, 0.3) - npu_out = torch.nn.functional.dropout2d(dummmy_input, 0.3) - self.assertTrue(self.compare._compare_dropout("api", bench_out, npu_out)) + dummy_input = torch.randn(100, 100) + bench_out = torch.nn.functional.dropout2d(dummy_input, 0.3) + npu_out = torch.nn.functional.dropout2d(dummy_input, 0.3) + self.assertTrue(self.compare._compare_dropout(bench_out, npu_out)) def test_compare_core_wrapper(self): dummy_input = torch.randn(100, 100) @@ -41,14 +42,14 @@ class TestCompare(unittest.TestCase): test_final_success, detailed_result_total = self.compare._compare_core_wrapper("api", bench_out, npu_out) actual_cosine_similarity = detailed_result_total[0][3] # 设置一个小的公差值 - tolerance = 1e-4 + tolerance = 1e-4 # 判断实际的余弦相似度值是否在预期值的公差范围内 self.assertTrue(np.isclose(actual_cosine_similarity, 1.0, atol=tolerance)) # 对其他值进行比较,确保它们符合预期 detailed_result_total[0][3] = 1.0 self.assertEqual(detailed_result_total, [['torch.float32', 'torch.float32', (100, 100), 1.0, 0.0, ' ', ' ', ' ', - ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', 'pass', - '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n']]) + ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', ' ', ' ', ' ', 'pass', + '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n']]) self.assertTrue(test_final_success) bench_out, npu_out = [dummy_input, dummy_input], [dummy_input, dummy_input] @@ -61,51 +62,64 @@ class TestCompare(unittest.TestCase): detailed_result_total[1][3] = 1.0 self.assertTrue(test_final_success) self.assertEqual(detailed_result_total, [['torch.float32', 'torch.float32', (100, 100), 1.0, 0.0, ' ', ' ', ' ', - ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', 'pass', - '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n'], - ['torch.float32', 'torch.float32', (100, 100), 1.0, 0.0, ' ', ' ', ' ', ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', - ' ', 'pass', '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n']]) + ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', ' ', ' ', ' ', 'pass', + '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n'], + ['torch.float32', 'torch.float32', (100, 100), 1.0, 0.0, ' ', ' ', ' ', + ' ', 0.0, 0.0, 0, 0.0, 0.0, ' ', ' ', ' ', ' ', ' ', ' ', 'pass', + '\nMax abs error is less than 0.001, consider as pass, skip other check and set to SPACE.\n']]) def test_compare_output(self): bench_out, npu_out = torch.randn(100, 100), torch.randn(100, 100) bench_grad, npu_grad = [torch.randn(100, 100)], [torch.randn(100, 100)] - api_name = 'Functional*conv2d*0' - is_fwd_success, is_bwd_success = self.compare.compare_output(api_name, bench_out, npu_out, bench_grad, npu_grad) + api_name = 'Functional.conv2d.0' + data_info = UtDataInfo(bench_grad, npu_grad, bench_out, npu_out, None, None, None) + is_fwd_success, is_bwd_success = self.compare.compare_output(api_name, data_info) self.assertFalse(is_fwd_success) - self.assertFalse(is_bwd_success) + # is_bwd_success should be checked dummy_input = torch.randn(100, 100) bench_out, npu_out = dummy_input, dummy_input - is_fwd_success, is_bwd_success = self.compare.compare_output(api_name, bench_out, npu_out) + data_info = UtDataInfo(None, None, bench_out, npu_out, None, None, None) + is_fwd_success, is_bwd_success = self.compare.compare_output(api_name, data_info) self.assertTrue(is_fwd_success) self.assertTrue(is_bwd_success) def test_record_results(self): - args = ('Functional*conv2d*0', False, 'N/A', [['torch.float64', 'torch.float32', (32, 64, 112, 112), 1.0, + args = ('Functional.conv2d.0', False, 'N/A', [['torch.float64', 'torch.float32', (32, 64, 112, 112), 1.0, 0.012798667686, 'N/A', 0.81631212311, 0.159979121213, 'N/A', - 'error', '\n']], None) - self.compare.record_results(*args) + 'error', '\n']], None, 0) + self.compare.record_results(args) with open(self.details_csv_path, 'r') as file: csv_reader = csv.reader(file) next(csv_reader) api_name_list = [row[0] for row in csv_reader] - self.assertEqual(api_name_list[0], 'Functional*conv2d*0.forward.output.0') - + self.assertEqual(api_name_list[0], 'Functional.conv2d.0.forward.output.0') + def test_compare_torch_tensor(self): cpu_output = torch.Tensor([1.0, 2.0, 3.0]) npu_output = torch.Tensor([1.0, 2.0, 3.0]) compare_column = CompareColumn() - status, compare_column, message = self.compare._compare_torch_tensor("api", cpu_output, npu_output, compare_column) + status, compare_column, message = self.compare._compare_torch_tensor("api", cpu_output, npu_output, + compare_column) self.assertEqual(status, "pass") def test_compare_bool_tensor(self): cpu_output = np.array([True, False, True]) npu_output = np.array([True, False, True]) self.assertEqual(self.compare._compare_bool_tensor(cpu_output, npu_output), (0.0, 'pass', '')) - + def test_compare_builtin_type(self): compare_column = CompareColumn() bench_out = 1 npu_out = 1 status, compare_result, message = self.compare._compare_builtin_type(bench_out, npu_out, compare_column) self.assertEqual((status, compare_result.error_rate, message), ('pass', 0, '')) + + def test_compare_float_tensor(self): + cpu_output = torch.Tensor([1.0, 2.0, 3.0]) + npu_output = torch.Tensor([1.0, 2.0, 3.0]) + compare_column = CompareColumn() + status, compare_column, message = self.compare._compare_float_tensor("api", cpu_output.numpy(), + npu_output.numpy(), + compare_column, npu_output.dtype) + self.assertEqual(status, "pass") diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare_column.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare_column.py new file mode 100644 index 0000000000000000000000000000000000000000..782321868a8cbcae9ffed3b215ca068968b1c0ae --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare_column.py @@ -0,0 +1,10 @@ +import unittest + +from msprobe.pytorch.api_accuracy_checker.compare.compare_column import ApiPrecisionOutputColumn + + +class TestCompareColumns(unittest.TestCase): + + def test_api_precision_output_column(self): + col = ApiPrecisionOutputColumn() + self.assertIsInstance(col.to_column_value(), list) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_compare_utils.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare_utils.py similarity index 53% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_compare_utils.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare_utils.py index 4e83c0643ef452c28d11c02bbbc2fee359a1ea2e..ac9c974ea3ecf6a835ce448d754582d435548ed8 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/compare/test_compare_utils.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/compare/test_compare_utils.py @@ -1,6 +1,10 @@ import unittest + import numpy as np -from api_accuracy_checker.compare.compare_utils import CompareConst, check_dtype_comparable + +from msprobe.pytorch.api_accuracy_checker.common.utils import CompareException +from msprobe.pytorch.api_accuracy_checker.compare.compare_utils import check_dtype_comparable, convert_str_to_float + class TestCompareUtils(unittest.TestCase): def test_check_dtype_comparable(self): @@ -23,3 +27,17 @@ class TestCompareUtils(unittest.TestCase): x = np.array([1, 2, 3], dtype=np.int32) y = np.array([True, False, True], dtype=np.bool_) self.assertFalse(check_dtype_comparable(x, y)) + + def test_convert_str_to_float_when_valid_float(self): + self.assertEqual(convert_str_to_float("123.45"), 123.45) + + def test_convert_str_to_float_when_valid_int(self): + self.assertEqual(convert_str_to_float("123.0"), 123.0) + + def test_convert_str_to_float_when_valid_int_with_spaces(self): + self.assertEqual(convert_str_to_float(" 123.0 "), 123.0) + + def test_convert_str_to_float_when_empty_string(self): + with self.assertRaises(CompareException) as cm: + convert_str_to_float('') + self.assertEqual(cm.exception.code, CompareException.INVALID_DATA_ERROR) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/dump.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/dump.json new file mode 100644 index 0000000000000000000000000000000000000000..974cf3317414dde67cf75d92398a209c43664284 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/dump.json @@ -0,0 +1,179 @@ +{ + "task": "statistics", + "level": "mix", + "dump_data_dir": null, + "data": { + "Tensor.__mul__.7.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 256 + ], + "Max": 1.3955078125, + "Min": -1.443359375, + "Mean": -0.00013697147369384766, + "Norm": 318.5, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 1, + 1, + 256 + ], + "Max": 1.0, + "Min": -1.0, + "Mean": 0.214599609375, + "Norm": 547.0, + "requires_grad": false + } + ], + "input_kwargs": {}, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 256 + ], + "Max": 1.3564453125, + "Min": -1.443359375, + "Mean": -0.0014209747314453125, + "Norm": 240.125, + "requires_grad": true + } + ] + }, + "Torch.chunk.4.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 256 + ], + "Max": 1.3955078125, + "Min": -1.443359375, + "Mean": -0.00013697147369384766, + "Norm": 318.5, + "requires_grad": true + }, + { + "type": "int", + "value": 2 + } + ], + "input_kwargs": { + "dim": { + "type": "int", + "value": -1 + } + }, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 128 + ], + "Max": 1.3720703125, + "Min": -1.3759765625, + "Mean": 0.0015316009521484375, + "Norm": 226.25, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 128 + ], + "Max": 1.3955078125, + "Min": -1.443359375, + "Mean": -0.0018053054809570312, + "Norm": 224.375, + "requires_grad": true + } + ] + }, + "Torch.cat.8.forward": { + "input_args": [ + [ + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 128 + ], + "Max": 1.443359375, + "Min": -1.3955078125, + "Mean": 0.0018053054809570312, + "Norm": 224.375, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 128 + ], + "Max": 1.3720703125, + "Min": -1.3759765625, + "Mean": 0.0015316009521484375, + "Norm": 226.25, + "requires_grad": true + } + ] + ], + "input_kwargs": { + "dim": { + "type": "int", + "value": -1 + } + }, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 256 + ], + "Max": 1.443359375, + "Min": -1.3955078125, + "Mean": 0.0016689300537109375, + "Norm": 318.5, + "requires_grad": true + } + ] + } + } +} \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/forward.json b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/forward.json new file mode 100644 index 0000000000000000000000000000000000000000..dff6e546bea38576b180af887d3fc196d25ad20b --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/forward.json @@ -0,0 +1,63 @@ +{ + "Torch.chunk.4.forward": { + "input_args": [ + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 256 + ], + "Max": 1.3955078125, + "Min": -1.443359375, + "Mean": -0.00013697147369384766, + "Norm": 318.5, + "requires_grad": true + }, + { + "type": "int", + "value": 2 + } + ], + "input_kwargs": { + "dim": { + "type": "int", + "value": -1 + } + }, + "output": [ + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 128 + ], + "Max": 1.3720703125, + "Min": -1.3759765625, + "Mean": 0.0015316009521484375, + "Norm": 226.25, + "requires_grad": true + }, + { + "type": "torch.Tensor", + "dtype": "torch.float16", + "shape": [ + 2048, + 2, + 1, + 128 + ], + "Max": 1.3955078125, + "Min": -1.443359375, + "Mean": -0.0018053054809570312, + "Norm": 224.375, + "requires_grad": true + } + ] + } +} \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_data_generate.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_data_generate.py similarity index 61% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_data_generate.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_data_generate.py index b98f84d516404665b5c3284f1e03f14eedddac55..f664dad197f6bbaaed3d574f657552377b176dec 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_data_generate.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_data_generate.py @@ -1,20 +1,21 @@ # coding=utf-8 -import unittest -import numpy as np import os +import unittest import copy -from api_accuracy_checker.run_ut.data_generate import * -from api_accuracy_checker.common.utils import get_json_contents -base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) -forward_file = os.path.join(base_dir, "../resources/forward.json") +from msprobe.pytorch.api_accuracy_checker.run_ut.data_generate import * +from msprobe.pytorch.api_accuracy_checker.common.utils import get_json_contents + +base_dir = os.path.dirname(os.path.realpath(__file__)) +forward_file = os.path.join(base_dir, "forward.json") forward_content = get_json_contents(forward_file) -for api_full_name, api_info_dict in forward_content.items(): - api_full_name = api_full_name - api_info_dict = api_info_dict +for key, value in forward_content.items(): + api_full_name = key + api_info_dict = value + +max_value = 1.3945078125 +min_value = -1.444359375 -max_value = 5.7421875 -min_value = -5.125 class TestDataGenerateMethods(unittest.TestCase): def test_gen_api_params(self): @@ -22,56 +23,57 @@ class TestDataGenerateMethods(unittest.TestCase): args_params, kwargs_params = gen_api_params(api_info, True, None, None) max_diff = abs(args_params[0].max() - max_value) min_diff = abs(args_params[0].min() - min_value) - self.assertEqual(len(args_params), 1) - self.assertEqual(args_params[0].dtype, torch.float32) + self.assertEqual(len(args_params), 2) + self.assertEqual(args_params[0].dtype, torch.float16) + self.assertEqual(args_params[1], 2) self.assertLessEqual(max_diff, 0.001) self.assertLessEqual(min_diff, 0.001) - self.assertEqual(args_params[0].shape, torch.Size([2, 2560, 24, 24])) - self.assertEqual(kwargs_params, {'inplace': False}) + self.assertEqual(args_params[0].shape, torch.Size([2048, 2, 1, 256])) + self.assertEqual(kwargs_params, {'dim': -1}) def test_gen_args(self): - args_result = gen_args(api_info_dict.get('args'), real_data_path=None) + args_result = gen_args(api_info_dict.get('input_args'), "conv2d") max_diff = abs(args_result[0].max() - max_value) min_diff = abs(args_result[0].min() - min_value) - self.assertEqual(len(args_result), 1) - self.assertEqual(args_result[0].dtype, torch.float32) + self.assertEqual(len(args_result), 2) + self.assertEqual(args_result[0].dtype, torch.float16) self.assertLessEqual(max_diff, 0.001) self.assertLessEqual(min_diff, 0.001) - self.assertEqual(args_result[0].shape, torch.Size([2, 2560, 24, 24])) + self.assertEqual(args_result[0].shape, torch.Size([2048, 2, 1, 256])) def test_gen_data(self): - data = gen_data(api_info_dict.get('args')[0], True, None, None) + data = gen_data(api_info_dict.get('input_args')[0], "conv2d", True, None, None) max_diff = abs(data.max() - max_value) min_diff = abs(data.min() - min_value) - self.assertEqual(data.dtype, torch.float32) + self.assertEqual(data.dtype, torch.float16) self.assertEqual(data.requires_grad, True) self.assertLessEqual(max_diff, 0.001) self.assertLessEqual(min_diff, 0.001) - self.assertEqual(data.shape, torch.Size([2, 2560, 24, 24])) + self.assertEqual(data.shape, torch.Size([2048, 2, 1, 256])) def test_gen_kwargs(self): api_info = copy.deepcopy(api_info_dict) kwargs_params = gen_kwargs(api_info, None) - self.assertEqual(kwargs_params, {'inplace': False}) - + self.assertEqual(kwargs_params, {'dim': -1}) + def test_gen_kwargs_2(self): k_dict = {"inplace": {"type": "bool", "value": "False"}} for key, value in k_dict.items(): gen_torch_kwargs(k_dict, key, value) self.assertEqual(k_dict, {'inplace': False}) - + def test_gen_random_tensor(self): - data = gen_random_tensor(api_info_dict.get('args')[0], None) + data = gen_random_tensor(api_info_dict.get('input_args')[0], None) max_diff = abs(data.max() - max_value) min_diff = abs(data.min() - min_value) - self.assertEqual(data.dtype, torch.float32) + self.assertEqual(data.dtype, torch.float16) self.assertEqual(data.requires_grad, False) self.assertLessEqual(max_diff, 0.001) self.assertLessEqual(min_diff, 0.001) - self.assertEqual(data.shape, torch.Size([2, 2560, 24, 24])) - + self.assertEqual(data.shape, torch.Size([2048, 2, 1, 256])) + def test_gen_common_tensor(self): - info = api_info_dict.get('args')[0] + info = api_info_dict.get('input_args')[0] low, high = info.get('Min'), info.get('Max') low_origin, high_origin = info.get('Min_origin'), info.get('Max_origin') low_info = [low, low_origin] @@ -81,15 +83,15 @@ class TestDataGenerateMethods(unittest.TestCase): data = gen_common_tensor(low_info, high_info, shape, data_dtype, None) max_diff = abs(data.max() - max_value) min_diff = abs(data.min() - min_value) - self.assertEqual(data.dtype, torch.float32) + self.assertEqual(data.dtype, torch.float16) self.assertEqual(data.requires_grad, False) self.assertLessEqual(max_diff, 0.001) self.assertLessEqual(min_diff, 0.001) - self.assertEqual(data.shape, torch.Size([2, 2560, 24, 24])) - + self.assertEqual(data.shape, torch.Size([2048, 2, 1, 256])) + def test_gen_bool_tensor(self): - info = {"type": "torch.Tensor", "dtype": "torch.bool", "shape": [1, 1, 160, 256], \ - "Max": 1, "Min": 0, "requires_grad": False} + info = {"type": "torch.Tensor", "dtype": "torch.bool", "shape": [1, 1, 160, 256], "Max": 1, "Min": 0, + "requires_grad": False} low, high = info.get("Min"), info.get("Max") shape = tuple(info.get("shape")) data = gen_bool_tensor(low, high, shape) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_multi_run_ut.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_multi_run_ut.py similarity index 62% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_multi_run_ut.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_multi_run_ut.py index 315b16127972103dffdfe89c941d330c6962305d..771e0423804253fa5e974a65f35119ddc7b32c15 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_multi_run_ut.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_multi_run_ut.py @@ -1,34 +1,33 @@ +import os +import glob import unittest +import logging from unittest.mock import patch, mock_open, MagicMock import json import signal -from api_accuracy_checker.run_ut.multi_run_ut import split_json_file, signal_handler, run_parallel_ut, prepare_config, main, ParallelUTConfig +from msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut import split_json_file, signal_handler, run_parallel_ut, \ + prepare_config, main, ParallelUTConfig class TestMultiRunUT(unittest.TestCase): def setUp(self): - self.test_json_file = 'test_file.json' - self.test_data = {'key1': 'TRUE', 'key2': 'TRUE', 'key3': 'TRUE'} + self.test_json_file = os.path.join(os.path.dirname(os.path.realpath(__file__)), "dump.json") + self.test_data = {'data': {'key1': 'TRUE', 'key2': 'TRUE', 'key3': 'TRUE'}} self.test_json_content = json.dumps(self.test_data) self.forward_split_files_content = [ {'key1': 'TRUE', 'key2': 'TRUE'}, {'key3': 'TRUE', 'key4': 'TRUE'} ] - @patch('api_accuracy_checker.run_ut.multi_run_ut.FileOpen') + @patch('msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut.FileOpen') def test_split_json_file(self, mock_FileOpen): mock_FileOpen.return_value.__enter__.return_value = mock_open(read_data=self.test_json_content).return_value num_splits = 2 split_files, total_items = split_json_file(self.test_json_file, num_splits, False) self.assertEqual(len(split_files), num_splits) - self.assertEqual(total_items, len(self.test_data)) + self.assertEqual(total_items, len(self.test_data.get('data'))) - @patch('api_accuracy_checker.run_ut.multi_run_ut.print_warn_log') - def test_signal_handler(self, mock_print_warn_log): - with self.assertRaises(KeyboardInterrupt): - signal_handler(signal.SIGINT, None) - mock_print_warn_log.assert_called() @patch('subprocess.Popen') @patch('os.path.exists', return_value=True) @@ -41,8 +40,7 @@ class TestMultiRunUT(unittest.TestCase): mock_popen.return_value = mock_process config = ParallelUTConfig( - forward_files=['forward_split1.json', 'forward_split2.json'], - backward_files=[None, None], + api_files=['test.json'], out_path='./', num_splits=2, save_error_data_flag=True, @@ -65,17 +63,18 @@ class TestMultiRunUT(unittest.TestCase): @patch('os.remove') @patch('os.path.realpath', side_effect=lambda x: x) - @patch('api_accuracy_checker.run_ut.multi_run_ut.check_link') - @patch('api_accuracy_checker.run_ut.multi_run_ut.check_file_suffix') - @patch('api_accuracy_checker.run_ut.multi_run_ut.FileChecker') - @patch('api_accuracy_checker.run_ut.multi_run_ut.split_json_file', return_value=(['forward_split1.json', 'forward_split2.json'], 2)) - def test_prepare_config(self, mock_split_json_file, mock_FileChecker, mock_check_file_suffix, mock_check_link, mock_realpath, mock_remove): + @patch('msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut.check_link') + @patch('msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut.check_file_suffix') + @patch('msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut.FileChecker') + @patch('msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut.split_json_file', + return_value=(['forward_split1.json', 'forward_split2.json'], 2)) + def test_prepare_config(self, mock_split_json_file, mock_FileChecker, mock_check_file_suffix, mock_check_link, + mock_realpath, mock_remove): mock_FileChecker_instance = MagicMock() mock_FileChecker_instance.common_check.return_value = './' mock_FileChecker.return_value = mock_FileChecker_instance args = MagicMock() - args.forward_input_file = 'forward.json' - args.backward_input_file = None + args.api_info = 'forward.json' args.out_path = './' args.num_splits = 2 args.save_error_data = True @@ -90,14 +89,27 @@ class TestMultiRunUT(unittest.TestCase): self.assertTrue(config.save_error_data_flag) self.assertFalse(config.jit_compile_flag) self.assertEqual(config.device_id, [0, 1]) - self.assertEqual(len(config.forward_files), 2) self.assertEqual(config.total_items, 2) + @patch('argparse.ArgumentParser.parse_args') - @patch('api_accuracy_checker.run_ut.multi_run_ut.prepare_config') - @patch('api_accuracy_checker.run_ut.multi_run_ut.run_parallel_ut') + @patch('msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut.prepare_config') + @patch('msprobe.pytorch.api_accuracy_checker.run_ut.multi_run_ut.run_parallel_ut') def test_main(self, mock_run_parallel_ut, mock_prepare_config, mock_parse_args): main() mock_parse_args.assert_called() mock_prepare_config.assert_called() - mock_run_parallel_ut.assert_called() \ No newline at end of file + mock_run_parallel_ut.assert_called() + + def tearDown(self): + current_directory = os.getcwd() + pattern = os.path.join(current_directory, 'accuracy_checking_*') + files = glob.glob(pattern) + + for file in files: + try: + os.remove(file) + logging.info(f"Deleted file: {file}") + except Exception as e: + logging.error(f"Failed to delete file {file}: {e}") + diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_run_ut.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut.py similarity index 59% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_run_ut.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut.py index fdcc1cfddeb38d4fca0d2a67a09147b571b35def..bc643794ab692ff5d21bcf412450f845d01f662d 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/run_ut/test_run_ut.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/api_accuracy_checker/run_ut/test_run_ut.py @@ -2,69 +2,71 @@ import os import copy import unittest -from unittest.mock import patch, DEFAULT import torch -from api_accuracy_checker.run_ut.run_ut import * -from api_accuracy_checker.common.utils import get_json_contents +from unittest.mock import patch, DEFAULT +from msprobe.pytorch.api_accuracy_checker.run_ut.run_ut import * +from msprobe.pytorch.api_accuracy_checker.common.utils import get_json_contents -base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) -forward_file = os.path.join(base_dir, "../resources/forward.json") +base_dir = os.path.dirname(os.path.realpath(__file__)) +forward_file = os.path.join(base_dir, "forward.json") forward_content = get_json_contents(forward_file) for api_full_name, api_info_dict in forward_content.items(): api_full_name = api_full_name api_info_dict = api_info_dict - + + class TestRunUtMethods(unittest.TestCase): def test_exec_api(self): api_info = copy.deepcopy(api_info_dict) - [api_type, api_name, _] = api_full_name.split("*") + + [api_type, api_name, _, _] = api_full_name.split(".") args, kwargs, need_grad = get_api_info(api_info, api_name, None) cpu_args, cpu_kwargs = generate_cpu_params(args, kwargs, True, '') out = exec_api(api_type, api_name, cpu_args, cpu_kwargs) - self.assertEqual(out.dtype, torch.float64) - self.assertTrue(out.requires_grad) - self.assertEqual(out.shape, torch.Size([2, 2560, 24, 24])) + self.assertEqual(out[0].dtype, torch.float32) + self.assertTrue(out[0].requires_grad) + self.assertEqual(out[0].shape, torch.Size([2048, 2, 1, 128])) def test_generate_device_params(self): mock_tensor = torch.rand([2, 2560, 24, 24], dtype=torch.float32, requires_grad=True) - - with patch.multiple('torch.Tensor', - to=DEFAULT, - clone=DEFAULT, - detach=DEFAULT, - requires_grad_=DEFAULT, - type_as=DEFAULT, - retain_grad=DEFAULT) as mocks: + + with patch.multiple('torch.Tensor', + to=DEFAULT, + clone=DEFAULT, + detach=DEFAULT, + requires_grad_=DEFAULT, + type_as=DEFAULT, + retain_grad=DEFAULT) as mocks: mocks['clone'].return_value = mock_tensor mocks['detach'].return_value = mock_tensor mocks['requires_grad_'].return_value = mock_tensor mocks['type_as'].return_value = mock_tensor mocks['retain_grad'].return_value = None mocks['to'].return_value = mock_tensor - + device_args, device_kwargs = generate_device_params([mock_tensor], {'inplace': False}, True, '') self.assertEqual(len(device_args), 1) self.assertEqual(device_args[0].dtype, torch.float32) self.assertTrue(device_args[0].requires_grad) self.assertEqual(device_args[0].shape, torch.Size([2, 2560, 24, 24])) self.assertEqual(device_kwargs, {'inplace': False}) - + def test_generate_cpu_params(self): api_info = copy.deepcopy(api_info_dict) - [api_type, api_name, _] = api_full_name.split("*") + [api_type, api_name, _, _] = api_full_name.split(".") args, kwargs, need_grad = get_api_info(api_info, api_name, None) cpu_args, cpu_kwargs = generate_cpu_params(args, kwargs, True, '') - self.assertEqual(len(cpu_args), 1) - self.assertEqual(cpu_args[0].dtype, torch.float64) + self.assertEqual(len(cpu_args), 2) + self.assertEqual(cpu_args[0].dtype, torch.float32) self.assertTrue(cpu_args[0].requires_grad) - self.assertEqual(cpu_args[0].shape, torch.Size([2, 2560, 24, 24])) - self.assertEqual(cpu_kwargs, {'inplace': False}) - + self.assertEqual(cpu_args[0].shape, torch.Size([2048, 2, 1, 256])) + self.assertEqual(cpu_kwargs, {'dim': -1}) + def test_UtDataInfo(self): - data_info = UtDataInfo(None, None, None, None, None, None) - self.assertIsNone(data_info.bench_grad_out) - self.assertIsNone(data_info.device_grad_out) - self.assertIsNone(data_info.device_out) - self.assertIsNone(data_info.bench_out) + data_info = UtDataInfo(None, None, None, None, None, None, None) + self.assertIsNone(data_info.bench_grad) + self.assertIsNone(data_info.device_grad) + self.assertIsNone(data_info.device_output) + self.assertIsNone(data_info.bench_output) self.assertIsNone(data_info.grad_in) self.assertIsNone(data_info.in_fwd_data_list) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_acc_compare.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_acc_compare.py new file mode 100644 index 0000000000000000000000000000000000000000..288e259c0aae104a62054af3813b7831ec7722f7 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_acc_compare.py @@ -0,0 +1,267 @@ +# coding=utf-8 +import unittest +import pandas as pd +from msprobe.pytorch.compare import acc_compare as compare + +npu_dict = {'op_name': ['Functional_conv2d_0_forward_input.0', 'Functional_conv2d_0_forward_input.1', + 'Functional_conv2d_0_forward_input.2', 'Functional_conv2d_0_forward_output'], + 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), + ('torch.float32', [16])], + 'output_struct': [('torch.float32', [1, 16, 28, 28])], + 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], + [0.19919930398464203, -0.19974489510059357, 0.006269412115216255], + [0.19734230637550354, -0.18177609145641327, 0.007903944700956345], + [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], 'stack_info': []} + +bench_dict = {'op_name': ['Functional_conv2d_0_forward_input.0', 'Functional_conv2d_0_forward_input.1', + 'Functional_conv2d_0_forward_input.2', 'Functional_conv2d_0_forward_output'], + 'input_struct': [('torch.float32', [1, 1, 28, 28]), ('torch.float32', [16, 1, 5, 5]), + ('torch.float32', [16])], + 'output_struct': [('torch.float32', [1, 16, 28, 28])], + 'summary': [[3.029174327850342, -2.926689624786377, -0.06619918346405029], + [0.19919930398464203, -0.19974489510059357, 0.006269412115216255], + [0.19734230637550354, -0.18177609145641327, 0.007903944700956345], + [2.1166646480560303, -2.190781354904175, -0.003579073818400502]], 'stack_info': []} + +tensor_list = [ + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], 'Max': 0.33033010363578796, + 'Min': -0.331031858921051,'Mean': -0.030964046716690063, 'Norm': 2.2533628940582275, 'requires_grad': True, + 'full_op_name': 'Tensor.add_.0.forward_input.0'}, + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, + 'Norm': 0.02844562754034996, 'requires_grad': False, 'full_op_name': 'Tensor.add_.0.forward_input.1'}, + {'full_op_name': 'Tensor.add_.0.forward_input.alpha.0', 'dtype': "", "shape": '[]', 'md5': None, + 'Max': -0.1, 'Min': -0.1, 'Mean': -0.1, 'Norm': -0.1, 'data_name': '-1'}, + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, + 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_.0.forward_output.0'} +] + +result_op_dict = {'op_name': ['Tensor.add_.0.forward_input.0', 'Tensor.add_.0.forward_input.1', + 'Tensor.add_.0.forward_input.alpha.0', 'Tensor.add_.0.forward_output.0'], + 'input_struct': [('torch.float32', [16, 1, 3, 3]), ('torch.float32', [16, 1, 3, 3]), + ("", '[]')], + 'output_struct': [('torch.float32', [16, 1, 3, 3])], + 'summary': [[0.33033010363578796, -0.331031858921051, -0.030964046716690063, 2.2533628940582275], + [0.003992878366261721, -0.008102823048830032, -0.0002002553956117481, 0.02844562754034996], + [-0.1, -0.1, -0.1, -0.1], + [0.33033010363578796, -0.331031858921051, -0.030964046716690063, 2.2533628940582275]], + 'stack_info': []} + +o_result = [ + ['Functional_conv2d_0_forward_input.0', 'Functional_conv2d_0_forward_input.0', 'torch.float32', 'torch.float32', + [1, 1, 28, 28], [1, 1, 28, 28], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 3.029174327850342, -2.926689624786377, + -0.06619918346405029, 3.029174327850342, -2.926689624786377, -0.06619918346405029, '', '', 'None'], + ['Functional_conv2d_0_forward_input.1', 'Functional_conv2d_0_forward_input.1', 'torch.float32', 'torch.float32', + [16, 1, 5, 5], [16, 1, 5, 5], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 0.19919930398464203, -0.19974489510059357, + 0.006269412115216255, 0.19919930398464203, -0.19974489510059357, 0.006269412115216255, '', '', 'None'], + ['Functional_conv2d_0_forward_input.2', 'Functional_conv2d_0_forward_input.2', 'torch.float32', 'torch.float32', + [16], [16], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, + 0.19734230637550354, -0.18177609145641327, 0.007903944700956345, '', '', 'None'], + ['Functional_conv2d_0_forward_output', 'Functional_conv2d_0_forward_output', 'torch.float32', 'torch.float32', + [1, 16, 28, 28], [1, 16, 28, 28], 0.0, 0.0, 0.0, ' ', '0.0%', '0.0%', '0.0%', ' ', 2.1166646480560303, -2.190781354904175, + -0.003579073818400502, 2.1166646480560303, -2.190781354904175, -0.003579073818400502, '', '', 'None']] + +npu_dict_aten = {'op_name': ['Aten__native_batch_norm_legit_functional.default_0_forward_input.0', + 'Aten__native_batch_norm_legit_functional.default_0_forward_input.1', + 'Aten__native_batch_norm_legit_functional.default_0_forward_input.2', + 'Aten__native_batch_norm_legit_functional.default_0_forward_input.3', + 'Aten__native_batch_norm_legit_functional.default_0_forward_input.4', + 'Aten__native_batch_norm_legit_functional.default_0_forward_output.0', + 'Aten__native_batch_norm_legit_functional.default_0_forward_output.1', + 'Aten__native_batch_norm_legit_functional.default_0_forward_output.2', + 'Aten__native_batch_norm_legit_functional.default_0_forward_output.3', + 'Aten__native_batch_norm_legit_functional.default_0_forward_output.4'], + 'input_struct': [('torch.float16', [256, 256, 14, 14]), ('torch.float32', [256]), + ('torch.float32', [256]), ('torch.float32', [256]), ('torch.float32', [256])], + 'output_struct': [('torch.float16', [256, 256, 14, 14]), ('torch.float32', [256]), + ('torch.float32', [256]), ('torch.float32', [256]), ('torch.float32', [256])], + 'summary': [[139.625, -127.5625, -0.0103607177734375], + [2.5276029109954834, -2.1788690090179443, -0.0008259844034910202], + [2.472219944000244, -2.845968723297119, -0.008756577968597412], + [2.763145923614502, -3.398397922515869, -0.052132632583379745], + [2.673110008239746, -3.149275064468384, 0.01613386906683445], + [13.5546875, -10.640625, -0.008758544921875], + [0.30550330877304077, -0.24485322833061218, -0.010361209511756897], + [623.9192504882812, 432.96826171875, 520.2276611328125], + [2.4797861576080322, -3.055997371673584, -0.04795549064874649], + [61.7945556640625, 42.59713363647461, 52.03831481933594]]} + +bench_dict_functional = { + 'op_name': ['Functional_batch_norm_0_forward_input.0', 'Functional_batch_norm_0_forward_input.1', + 'Functional_batch_norm_0_forward_input.2', 'Functional_batch_norm_0_forward_input.3', + 'Functional_batch_norm_0_forward_input.4', 'Functional_batch_norm_0_forward_output'], + 'input_struct': [('torch.float32', [256, 256, 14, 14]), ('torch.float32', [256]), ('torch.float32', [256]), + ('torch.float32', [256]), ('torch.float32', [256])], + 'output_struct': [('torch.float32', [256, 256, 14, 14])], + 'summary': [[3.061628818511963, -3.22507381439209, 3.634914173744619e-05], + [0.0005779837374575436, -0.0006301702815108001, 3.634906533989124e-06], + [0.9338104128837585, 0.9277191162109375, 0.930335283279419], + [1.0, 1.0, 1.0], [0.0, 0.0, 0.0], + [5.397906303405762, -5.796811580657959, 2.5283952709287405e-10]] +} + +aten_result = [ + ['Aten__native_batch_norm_legit_functional.default_0_forward_input.0', 'Functional_batch_norm_0_forward_input.0', + 'torch.float16', 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 136.56337118148804, -124.33742618560791, + -0.010397066915174946, ' ', '4460.480981749501%', '3855.335826136584%', '28603.33536971545%', ' ', 139.625, + -127.5625, -0.0103607177734375, 3.061628818511963, -3.22507381439209, 3.634914173744619e-05, 'Warning', + 'Need double check api accuracy.', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_input.1', 'Functional_batch_norm_0_forward_input.1', + 'torch.float32', 'torch.float32', [256], [256], 2.527024927258026, -2.1782388387364335, -0.0008296193100250093, + ' ', '437213.84590749856%', '345658.76916858414%', '22823.676544842117%', ' ', 2.5276029109954834, + -2.1788690090179443, -0.0008259844034910202, 0.0005779837374575436, -0.0006301702815108001, 3.634906533989124e-06, + 'Warning', 'Need double check api accuracy.', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_input.2', 'Functional_batch_norm_0_forward_input.2', + 'torch.float32', 'torch.float32', [256], [256], 1.5384095311164856, -3.7736878395080566, -0.9390918612480164, ' ', + '164.74538192025793%', '406.7705163736246%', '100.94122819224167%', ' ', 2.472219944000244, -2.845968723297119, + -0.008756577968597412, 0.9338104128837585, 0.9277191162109375, 0.930335283279419, 'Warning', + 'Need double check api accuracy.', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_input.3', 'Functional_batch_norm_0_forward_input.3', + 'torch.float32', 'torch.float32', [256], [256], 1.763145923614502, -4.398397922515869, -1.0521326325833797, ' ', + '176.3145923614502%', '439.8397922515869%', '105.21326325833797%', ' ', 2.763145923614502, -3.398397922515869, + -0.052132632583379745, 1.0, 1.0, 1.0, 'Warning', 'Need double check api accuracy.', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_input.4', 'Functional_batch_norm_0_forward_input.4', + 'torch.float32', 'torch.float32', [256], [256], 2.673110008239746, -3.149275064468384, 0.01613386906683445, ' ', + 'N/A', 'N/A', 'N/A', ' ', 2.673110008239746, -3.149275064468384, 0.01613386906683445, 0.0, 0.0, 0.0, 'Warning', + 'Need double check api accuracy.', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_output.0', 'Functional_batch_norm_0_forward_output', + 'torch.float16', 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 8.156781196594238, -4.843813419342041, + -0.008758545174714527, ' ', '151.11009228611078%', '83.55995967687207%', '3464072756.115108%', ' ', 13.5546875, + -10.640625, -0.008758544921875, 5.397906303405762, -5.796811580657959, 2.5283952709287405e-10, 'Warning', + 'Need double check api accuracy.', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_output.1', 'Nan', 'torch.float32', 'Nan', [256], 'Nan', + ' ', ' ', ' ', ' ', ' ', 0.30550330877304077, -0.24485322833061218, -0.010361209511756897, 'Nan', 'Nan', 'Nan', + 'Yes', '', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_output.2', 'Nan', 'torch.float32', 'Nan', [256], 'Nan', + ' ', ' ', ' ', ' ', ' ', 623.9192504882812, 432.96826171875, 520.2276611328125, 'Nan', 'Nan', 'Nan', + 'Yes', '', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_output.3', 'Nan', 'torch.float32', 'Nan', [256], 'Nan', + ' ', ' ', ' ', ' ', ' ', 2.4797861576080322, -3.055997371673584, -0.04795549064874649, 'Nan', 'Nan', 'Nan', + 'Yes', '', 'None'], + ['Aten__native_batch_norm_legit_functional.default_0_forward_output.4', 'Nan', 'torch.float32', 'Nan', [256], 'Nan', + ' ', ' ', ' ', ' ', ' ', 61.7945556640625, 42.59713363647461, 52.03831481933594, 'Nan', 'Nan', 'Nan', + 'Yes', '', 'None']] + +highlight_dict = {'red_rows': [], 'yellow_rows': []} + +num_0, num_1, num_2, num_3 = 0, 1, 2, 3 +summary_line_input = ['Functional_batch_norm_0_forward_input.0', 'Functional_batch_norm_0_forward_input.0', + 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.01, 0, 0, 0, 1, 1, 1, 1, 1.01, 1, 1, 1, + 'Yes', ''] +summary_line_1 = ['Functional_batch_norm_0_forward_output.0', 'Functional_batch_norm_0_forward_output.0', + 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 10, 0, 0, 0, 2, 0, 1, 1, 1, 1, 1, 1, + 'Warning', ''] +summary_line_2 = ['Functional_batch_norm_0_forward_output.1', 'Functional_batch_norm_0_forward_output.1', + 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.02, 0, 0, 0, 0.12, 0, 1, 1, 0.1, 1, 1, 1, + 'Warning', ''] +summary_line_3 = ['Functional_batch_norm_0_forward_output.2', 'Functional_batch_norm_0_forward_output.2', + 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0, 0, 0, 0, 2, 0, 1, 1, 1, 1, 1, 1, + 'Warning', ''] +line_input = ['Functional_batch_norm_0_forward_input.0', 'Functional_batch_norm_0_forward_input.0', 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 1, 1, 1, 0.95, 1, 1, 1, 1, 1, 1.01, 1, 1, 1, + 'Yes', ''] +line_1 = ['Functional_batch_norm_0_forward_output.0', 'Functional_batch_norm_0_forward_output.0', 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.8, 1, 1, 0.59, 1, 'nan', 0, 1, 1, 19, 1, 1, 1, + 'Warning', ''] +line_2 = ['Functional_batch_norm_0_forward_output.1', 'Functional_batch_norm_0_forward_output.1', 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.9, 1, 1, 0.8, 1, 0, 0.12, 0, 1, 1, 0.1, 1, 1, 1, + 'Warning', ''] +line_3 = ['Functional_batch_norm_0_forward_output.2', 'Functional_batch_norm_0_forward_output.2', 'torch.float16', + 'torch.float32', [256, 256, 14, 14], [256, 256, 14, 14], 0.8, 1.1e+10, 1, 0.85, 1, 9, 0.12, 0, 1, 1, 0.1, 1, + 1, 1, 'Warning', ''] + +op_data = { + 'input_args': [{'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + 'Max': 0.33033010363578796, 'Min': -0.331031858921051,'Mean': -0.030964046716690063, + 'Norm': 2.2533628940582275, 'requires_grad': True}, + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, + 'Norm': 0.02844562754034996, 'requires_grad': False}], + 'input_kwargs': {'alpha': {'type': 'float', 'value': -0.1}}, + 'output': [{'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + 'Max': 0.33033010363578796, 'Min': -0.331031858921051,'Mean': -0.030964046716690063, + 'Norm': 2.2533628940582275, 'requires_grad': True}]} + +op_name = "Tensor.add_0.0.forward" + +op_result = [ + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, + 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_0.0.forward_input.0'}, + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + 'Max': 0.003992878366261721, 'Min': -0.008102823048830032, 'Mean': -0.0002002553956117481, + 'Norm': 0.02844562754034996, 'requires_grad': False, 'full_op_name': 'Tensor.add_0.0.forward_input.1'}, + {'full_op_name': 'Tensor.add_0.0.forward_input.alpha.0', 'dtype': "", 'shape': '[]', 'md5': None, + 'Max': -0.1, 'Min': -0.1, 'Mean': -0.1, 'Norm': -0.1, 'data_name': '-1'}, + {'type': 'torch.Tensor', 'dtype': 'torch.float32', 'shape': [16, 1, 3, 3], + 'Max': 0.33033010363578796, 'Min': -0.331031858921051, 'Mean': -0.030964046716690063, + 'Norm': 2.2533628940582275, 'requires_grad': True, 'full_op_name': 'Tensor.add_0.0.forward_output.0'}] + + +class TestUtilsMethods(unittest.TestCase): + + def test_check_graph_mode(self): + op1 = "Aten" + op2 = "torch" + self.assertTrue(compare.check_graph_mode(op1, op2)) + self.assertTrue(compare.check_graph_mode(op2, op1)) + self.assertFalse(compare.check_graph_mode(op1, op1)) + self.assertFalse(compare.check_graph_mode(op2, op2)) + + def test_check_op(self): + fuzzy_match = False + result = compare.check_op(npu_dict, bench_dict, fuzzy_match) + self.assertEqual(result, True) + + def test_merge_tensor(self): + op_dict = compare.merge_tensor(tensor_list, True, False) + self.assertEqual(op_dict, result_op_dict) + + def test_read_op(self): + result = compare.read_op(op_data, op_name) + self.assertEqual(result, op_result) + + def test_match_op(self): + fuzzy_match = False + a, b = compare.match_op([npu_dict], [bench_dict], fuzzy_match) + self.assertEqual(a, 0) + self.assertEqual(b, 0) + + def test_get_accuracy(self): + result = [] + compare.get_accuracy(result, npu_dict, bench_dict, highlight_dict) + self.assertEqual(result, o_result) + + def test_get_accuracy_graph_mode(self): + result = [] + compare.get_accuracy(result, npu_dict_aten, bench_dict_functional, highlight_dict) + self.assertEqual(result, aten_result) + + def test_find_error_rows(self): + summary_result = [summary_line_input, summary_line_1, summary_line_2, summary_line_3] + highlight_dict = {'red_rows': [], 'yellow_rows': []} + compare.find_error_rows(summary_result, 0, 1, highlight_dict, summary_compare=True) + self.assertEqual(highlight_dict, {'red_rows': [], 'yellow_rows': []}) + + def test_find_compare_result_error_rows(self): + result = [line_input, line_1, line_2, line_3] + result_df = pd.DataFrame(result) + highlight_dict = {'red_rows': [], 'yellow_rows': []} + compare.find_compare_result_error_rows(result_df, highlight_dict, False, False) + self.assertEqual(highlight_dict, {'red_rows': [num_1, num_3], 'yellow_rows': [num_2]}) + + def test_rename_api(self): + test_name_1 = "Distributed.broadcast.0.forward.input.0" + expect_name_1 = "Distributed.broadcast.input.0" + actual_name_1 = compare.rename_api(test_name_1, "forward") + self.assertEqual(actual_name_1, expect_name_1) + + test_name_2 = "Torch.sum.0.backward.output.0" + expect_name_2 = "Torch.sum.output.0" + actual_name_2 = compare.rename_api(test_name_2, "backward") + self.assertEqual(actual_name_2, expect_name_2) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_match.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_match.py new file mode 100644 index 0000000000000000000000000000000000000000..ac28e994e9c8e77f8ae675fec3322eaf64a64321 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/compare/test_match.py @@ -0,0 +1,20 @@ +# coding=utf-8 +import unittest +from msprobe.pytorch.compare import match + + +class TestMatch(unittest.TestCase): + def test_graph_mapping(self): + op1 = "Aten_convolution_1_forward_0.input.0" + op2 = "Torch_conv2d_0_forward_0.input.0" + op3 = "Torch_batch_norm_0_forward_0.input.0" + op4 = "Aten_convolution.default_1_forward_0.input.0" + op5 = "Aten_foo_1_forward_0.input.0" + self.assertTrue(match.graph_mapping.match(op1, op2)) + self.assertTrue(match.graph_mapping.match(op2, op1)) + self.assertTrue(match.graph_mapping.match(op4, op2)) + self.assertTrue(match.graph_mapping.match(op2, op4)) + self.assertFalse(match.graph_mapping.match(op1, op3)) + self.assertFalse(match.graph_mapping.match(op3, op1)) + self.assertFalse(match.graph_mapping.match(op5, op2)) + self.assertFalse(match.graph_mapping.match(op2, op5)) diff --git a/debug/accuracy_tools/test/pytorch/free_benchmark/test_perturbed_layser.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/perturbed_layers/test_perturbed_layser.py similarity index 83% rename from debug/accuracy_tools/test/pytorch/free_benchmark/test_perturbed_layser.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/perturbed_layers/test_perturbed_layser.py index 7fea7fa8e095fece43c408e07bde970206fcccc0..ad9eb5cd0ed5b2eaaff9745c8af9ced8dc1ab883 100644 --- a/debug/accuracy_tools/test/pytorch/free_benchmark/test_perturbed_layser.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/perturbed_layers/test_perturbed_layser.py @@ -1,10 +1,10 @@ from unittest import TestCase import torch -from atat.pytorch.common.utils import Const -from atat.pytorch.free_benchmark.common.enums import DeviceType, PerturbationMode -from atat.pytorch.free_benchmark.common.params import data_pre_deal -from atat.pytorch.free_benchmark.perturbed_layers.layer_factory import LayerFactory +from msprobe.core.common.const import Const +from msprobe.pytorch.free_benchmark.common.enums import DeviceType, PerturbationMode +from msprobe.pytorch.free_benchmark.common.params import data_pre_deal +from msprobe.pytorch.free_benchmark.perturbed_layers.layer_factory import LayerFactory class TestPerturbedLayer(TestCase): @@ -90,3 +90,16 @@ class TestPerturbedLayer(TestCase): self.assertEqual(Perturbed_value[0], 4096.0000000000) self.assertEqual(Perturbed_value[1], 16777218) self.assertEqual(Perturbed_value[2], 1e-38) + + # 对于输入张量,add_noise扰动因子对大于极小值的部分增加一个小值 + def test_add_noise_layer(self): + api_name = "addnoise.0.forward" + inputs = torch.as_tensor( + [1e-1, 1e-2], dtype=torch.bfloat16 + ) + layer = LayerFactory.create( + api_name, DeviceType.NPU, PerturbationMode.ADD_NOISE + ) + Perturbed_value = layer.add_noise(inputs) + self.assertEqual(Perturbed_value[0], 1e-1+1e-4) + self.assertEqual(Perturbed_value[1], 1e-2) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/result_handlers/test_result_handler.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/result_handlers/test_result_handler.py new file mode 100644 index 0000000000000000000000000000000000000000..399efeb42d7cd7e7e34dd472cd8a9d82c26a5b5e --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/result_handlers/test_result_handler.py @@ -0,0 +1,121 @@ +from abc import ABC +from unittest import TestCase + +import torch +from msprobe.core.common.const import Const +from msprobe.pytorch.free_benchmark.common.constant import PreheatConfig, ThresholdConfig +from msprobe.pytorch.free_benchmark.common.counter import preheat_counter +from msprobe.pytorch.free_benchmark.common.enums import ( + DeviceType, + FuzzLevel, + HandlerType, + PerturbationMode, +) +from msprobe.pytorch.free_benchmark.common.params import DataParams, make_handler_params +from msprobe.pytorch.free_benchmark.result_handlers.handler_factory import ( + FuzzHandlerFactory, +) + + +class Config(ABC): + """ + 用以提供参数配置 + """ + def __init__(self, handler_type, preheat_config): + self.fuzz_stage = Const.FORWARD + self.handler_type = handler_type + self.fuzz_device = DeviceType.NPU + self.fuzz_level = FuzzLevel.BASE_LEVEL + self.pert_mode = PerturbationMode.IMPROVE_PRECISION + self.preheat_config = preheat_config + + +class TestFuzzHandler(TestCase): + + def setUp(self) -> None: + origin_inputs = [ + torch.as_tensor([3.01, 3.02], dtype=torch.float16), + torch.as_tensor([0.02, 0.02], dtype=torch.float16), + ] + # 将输入乘以一个大于误差阈值1.002的值,模拟二次执行出现误差 + perturbed_inputs = [ + (value * 1.0021).to(torch.float32).to("cpu") for value in origin_inputs + ] + origin_output = torch.add(*origin_inputs) + perturbed_output = torch.add(*perturbed_inputs) + # 实例有问题的data对象 + self.data_params = DataParams( + args=origin_inputs, + kwargs={}, + original_result=origin_output, + perturbed_result=perturbed_output, + origin_func=torch.add, + ) + self.api_name = "add.0.forward" + self.step = 0 + + def test_result_handler_check(self): + # 对于check处理类,扰动前后输出不一致的情况会有UnequalRow对象生成 + for _ in range(2): + config = Config( + HandlerType.CHECK, {PreheatConfig.IF_PREHEAT: False} + ) + handler_params = make_handler_params(self.api_name, config, self.step) + handler = FuzzHandlerFactory.create(handler_params) + handler.handle(self.data_params) + self.assertEqual( + len(handler.get_unequal_rows()), 1 + ) + + def test_result_handler_fix(self): + # 对于fix处理类,扰动后输出会替代原始输出, dtype和原始输出一致,但值为新输出值 + config = Config( + HandlerType.FIX, {PreheatConfig.IF_PREHEAT: False} + ) + handler_params = make_handler_params(self.api_name, config, self.step) + handler = FuzzHandlerFactory.create(handler_params) + result = handler.handle(self.data_params) + self.assertEqual(result.dtype, torch.float16) + self.assertEqual(result.device, self.data_params.original_result.device) + self.assertAlmostEqual( + result[0], self.data_params.perturbed_result.to(torch.float16)[0] + ) + self.assertAlmostEqual( + result[1], self.data_params.perturbed_result.to(torch.float16)[1] + ) + + def test_result_handler_preheat(self): + # 对于preheat处理类,在预热阶段后的阈值会根据CPU调整 + config = Config( + HandlerType.CHECK, + { + PreheatConfig.IF_PREHEAT: True, + PreheatConfig.PREHEAT_STEP: 4, + PreheatConfig.MAX_SAMPLE: 3 + } + ) + for _ in range(3): + handler_params = make_handler_params(self.api_name, config, 0) + handler = FuzzHandlerFactory.create(handler_params) + handler.handle(self.data_params) + # 通过preheat_counter的数据可以判断预热是否正常执行,这里第一个step会记录api执行次数 + self.assertEqual(preheat_counter.get_one_step_used_api("add"), 3) + for step in range(1, 4): + for _ in range(3): + handler_params = make_handler_params(self.api_name, config, step) + handler = FuzzHandlerFactory.create(handler_params) + handler.handle(self.data_params) + # call time记录当前step api的调用次数 + self.assertEqual(preheat_counter.get_api_called_time("add"), 3) + # 对于3个step最多采样三次的预热设置,sample time应该每次采样一例 + self.assertEqual(preheat_counter.get_api_sample_time("add"), 1) + # 预热阶段,api阈值应该在两个阈值超参之间 + api_threshld = preheat_counter.get_api_thd("add", "torch.float16") + self.assertLessEqual( + api_threshld, + ThresholdConfig.PREHEAT_INITIAL_THD + ) + self.assertGreaterEqual( + api_threshld, + ThresholdConfig.DTYPE_PER_THD[torch.float16] + ) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/test_main.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/test_main.py new file mode 100644 index 0000000000000000000000000000000000000000..4498a2af7054edd89aa6fae6a057a489216794b6 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/free_benchmark/test_main.py @@ -0,0 +1,101 @@ +import functools +from abc import ABC +from unittest import TestCase + +import torch +import torch.nn as nn +from msprobe.core.common.const import Const +from msprobe.pytorch.free_benchmark import FreeBenchmarkCheck +from msprobe.pytorch.free_benchmark.common.constant import CommonField, PreheatConfig +from msprobe.pytorch.free_benchmark.common.enums import ( + DeviceType, + FuzzLevel, + HandlerType, + PerturbationMode, +) + + +class Config(ABC): + """ + 用以提供参数配置 + """ + + def __init__(self, fuzz_stage, handler_type): + self.fuzz_stage = fuzz_stage + self.handler_type = handler_type + self.fuzz_device = DeviceType.NPU + self.fuzz_level = FuzzLevel.BASE_LEVEL + self.pert_mode = PerturbationMode.IMPROVE_PRECISION + self.preheat_config = {PreheatConfig.IF_PREHEAT: False} + + +class WrapMul(nn.Module): + """ + 用nn.module包装mul算子, 在forward中调用torch.mul + """ + + def __init__(self, op_name) -> None: + super().__init__() + self.op_name = op_name + + def forward(self, *args, **kwargs): + return torch.mul(*args, **kwargs) + + +class UnequalDataProcessor(ABC): + """ + 接口类, 处理检测不一致结果 + """ + + def __init__(self) -> None: + super().__init__() + self.unequal_rows = [] + + def update_unequal_rows(self, unequal_rows): + self.unequal_rows.append(unequal_rows) + + +class TestInterface(TestCase): + def setUp(self): + self.api_name = "Torch.mul.0" + + def testForwardFix(self): + # 对于前向接口,在forward钩子中开启FIX,返回结果给hook的输出 + config = Config(Const.FORWARD, HandlerType.FIX) + checker = FreeBenchmarkCheck(config) + # 执行算子前向 + x = torch.randn(2, 3).to(torch.float16) + y = torch.randn(2, 3).to(torch.float16) + mul_module = WrapMul(self.api_name) + out = mul_module(x, y) + # 模拟forward hook中调用无标杆前向检测接口 + result, _ = checker.forward( + self.api_name, + mul_module, + args=(x, y), + kwargs={}, + output=out, + ) + self.assertEqual(result.dtype, torch.float32) + + def testBackwardCheck(self): + # 对于反向接口,在pre forward时暂存input, 然后在backwrad后进行对比 + config = Config(Const.BACKWARD, HandlerType.CHECK) + checker = FreeBenchmarkCheck(config) + processor = UnequalDataProcessor() + # 初始化输入输出 + x = torch.tensor([2, 3], dtype=torch.float16, requires_grad=True) + y = torch.tensor([2, 3], dtype=torch.float16, requires_grad=True) + grad_output = torch.tensor([1,1], dtype=torch.float16) + backward_name = Const.SEP.join([self.api_name, Const.BACKWARD]) + # 执行前向生成grad saver实例 + mul_module = WrapMul(self.api_name) + checker.pre_forward(backward_name, mul_module, processor, (x, y), {}) + # 执行算子前向和反向, 并反向获取扰动后grad_input + out = mul_module(x, y) + checker.backward(backward_name, mul_module, grad_output) + out.backward(torch.ones_like(out)) + # module是否添加暂存器, 其中反向钩子执行扰动后grad_input是否正确 + self.assertTrue(hasattr(mul_module, CommonField.GRADSAVER)) + grad_saver = getattr(mul_module, CommonField.GRADSAVER) + self.assertEqual(grad_saver.perturbed_grad_input[0][0], 2) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/functional/test_dump_module.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/functional/test_dump_module.py new file mode 100644 index 0000000000000000000000000000000000000000..d67adf2f91292391ff01d450bb5647524f6fc9c4 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/functional/test_dump_module.py @@ -0,0 +1,15 @@ +import unittest + +import torch.nn as nn +from msprobe.pytorch import PrecisionDebugger +from msprobe.pytorch.functional.dump_module import module_dump, module_count + + +class TestDumpModule(unittest.TestCase): + def setUp(self): + self.module = nn.Linear(in_features=8, out_features=4) + + def test_module_dump(self): + PrecisionDebugger(dump_path="./dump") + module_dump(self.module, "TestModule") + self.assertTrue("TestModule" in module_count) diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_api_registry.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_api_registry.py new file mode 100644 index 0000000000000000000000000000000000000000..837ad23df76be2a012a7408dab4879847937f229 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_api_registry.py @@ -0,0 +1,130 @@ +import unittest +from msprobe.pytorch.hook_module.api_registry import ApiRegistry, torch_version_above_2, is_gpu + + +class TestApiRegistry(unittest.TestCase): + + def test_store_ori_attr(self): + class A(): + a1 = 1 + class B(): + a = A() + b1 = 1 + b2 = 2 + + api_list = ["a.a1", "b1", "b2"] + expect_output = {"a.a1":1, "b1":1, "b2":2} + actual_output = dict() + ApiRegistry.store_ori_attr(B, api_list, actual_output) + self.assertEqual(actual_output, expect_output) + + + def test_set_api_attr(self): + class A(): + a1 = 1 + class B(): + a = A().__class__ + b1 = 1 + + attr_dict = {"a.a2":2, "b2":2, "b3":3} + ApiRegistry.set_api_attr(B, attr_dict) + + for k, v in attr_dict.items(): + if '.' in k: + sub_module_name, sub_op = k.rsplit('.', 1) + sub_module = getattr(B, sub_module_name, None) + + self.assertEqual(getattr(sub_module, sub_op), v) + else: + self.assertEqual(getattr(B, k), v) + + def test_api_modularity(self): + + import torch + import torch.distributed as dist + #import torch_npu #门禁没有安装torch_npu + from msprobe.pytorch.hook_module.api_registry import torch_without_guard_version, npu_distributed_api, is_gpu, torch_version_above_2 + + + + reg = ApiRegistry() + attr_dict = {"b2":2, "b3":3} + reg.tensor_hook_attr = attr_dict + reg.torch_hook_attr = attr_dict + reg.functional_hook_attr = attr_dict + reg.distributed_hook_attr = attr_dict + reg.npu_distributed_hook_attr = attr_dict + reg.aten_hook_attr = attr_dict + reg.vf_hook_attr = attr_dict + reg.torch_npu_hook_attr = attr_dict + + reg.api_modularity() + self.assertEqual(torch.Tensor.b2, 2) + + self.assertEqual(torch.b2, 2) + self.assertEqual(torch.nn.functional.b2, 2) + self.assertEqual(dist.b2, 2) + self.assertEqual(dist.distributed_c10d.b2, 2) + #if not is_gpu and not torch_without_guard_version: + #self.assertEqual(torch_npu.distributed.b2, 2) + #self.assertEqual(torch_npu.distributed.distributed_c10d.b2, 2) + if torch_version_above_2: + self.assertEqual(torch.ops.aten.b2, 2) + self.assertEqual(torch._VF.b2, 2) + #if not is_gpu: + #self.assertEqual(torch_npu.b2, 2) + + + def test_api_originality(self): + import torch + import torch.distributed as dist + #import torch_npu #门禁没有安装torch_npu + from msprobe.pytorch.hook_module.api_registry import torch_without_guard_version, npu_distributed_api, is_gpu, torch_version_above_2 + + + + reg = ApiRegistry() + attr_dict = {"b2":2, "b3":3} + reg.tensor_hook_attr = attr_dict + reg.torch_hook_attr = attr_dict + reg.functional_hook_attr = attr_dict + reg.distributed_hook_attr = attr_dict + reg.npu_distributed_hook_attr = attr_dict + reg.aten_hook_attr = attr_dict + reg.vf_hook_attr = attr_dict + reg.torch_npu_hook_attr = attr_dict + + reg.api_originality() + self.assertEqual(torch.Tensor.b2, 2) + + self.assertEqual(torch.b2, 2) + self.assertEqual(torch.nn.functional.b2, 2) + self.assertEqual(dist.b2, 2) + self.assertEqual(dist.distributed_c10d.b2, 2) + #if not is_gpu and not torch_without_guard_version: + #self.assertEqual(torch_npu.distributed.b2, 2) + #self.assertEqual(torch_npu.distributed.distributed_c10d.b2, 2) + if torch_version_above_2: + self.assertEqual(torch.ops.aten.b2, 2) + self.assertEqual(torch._VF.b2, 2) + #if not is_gpu: + #self.assertEqual(torch_npu.b2, 2) + + def test_initialize_hook(self): + def hook_test(): + pass + + reg = ApiRegistry() + reg.initialize_hook(hook_test) + empty_list = [] + self.assertFalse(empty_list==reg.tensor_hook_attr) + self.assertFalse(empty_list==reg.torch_hook_attr) + self.assertFalse(empty_list==reg.functional_hook_attr) + self.assertFalse(empty_list==reg.distributed_hook_attr) + self.assertFalse(empty_list==reg.npu_distributed_hook_attr) + if torch_version_above_2: + #print(True) + self.assertFalse(empty_list==reg.aten_hook_attr) + if not is_gpu: + #print(True) + self.assertFalse(empty_list==reg.torch_npu_hook_attr) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_hook_module.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_hook_module.py new file mode 100644 index 0000000000000000000000000000000000000000..50783e5d736c024b03f20008ad6b72882eddcd87 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_hook_module.py @@ -0,0 +1,42 @@ +import unittest +from unittest.mock import patch, Mock + +from msprobe.pytorch.hook_module.hook_module import HOOKModule + +class TestHookModule(unittest.TestCase): + def test_call_1(self): + def forward_pre_hook(): + return "result_input", "result_kwargs" + def forward_hook(): + return 2 + def backward_hook(): + pass + + def hook(prefix): + return forward_pre_hook, forward_hook, backward_hook + HOOKModule.prefix_op_name_ = "123" + test = HOOKModule(hook) + test._call_func = Mock(return_value=1) + result = test() + self.assertEqual(result, 1) + + def test_call_2(self): + def forward_pre_hook(nope, input, kwargs): + return input, kwargs + def forward_hook(nope, input, kwargs, result): + return input + def backward_hook(): + pass + + def hook(prefix): + return forward_pre_hook, forward_hook, backward_hook + HOOKModule.prefix_op_name_ = "123" + input = 2 + test = HOOKModule(hook) + + def temp_forward(*input, **kwargs): + return input + + test.forward = Mock(return_value=1) + result = test(input) + self.assertEqual(result, (input, )) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_aten.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_aten.py new file mode 100644 index 0000000000000000000000000000000000000000..4940b07cb0d8e9d2283db3daebc910a7fdcd6ce9 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_aten.py @@ -0,0 +1,65 @@ +import unittest +import torch +from msprobe.pytorch.hook_module.wrap_aten import AtenOPTemplate, AtenOPPacketTemplate + + +def hook(name): + def forward_pre_hook(nope, input, kwargs): + return input, kwargs + def forward_hook(nope, input, kwargs, result): + return 2 + def backward_hook(): + pass + + return forward_pre_hook, forward_hook, backward_hook + + + +class TestWrapAten(unittest.TestCase): + def setUp(self): + self.aten_op = AtenOPPacketTemplate(torch.ops.aten.convolution, hook) + + def test_atenop_attribute(self): + if torch.__version__.split("+")[0] <= '2.0': + return + self.setUp() + self.assertEqual(self.aten_op.default.op, torch.ops.aten.convolution.default) + self.assertEqual(self.aten_op.out.op, torch.ops.aten.convolution.out) + + def test_atenop_forward(self): + if torch.__version__.split("+")[0] <= '2.0': + return + self.setUp() + image = torch.randn(4, 3, 24, 24) + kernel = torch.randn(10, 3, 3, 3) + functional_out = torch.nn.functional.conv2d(image, kernel, stride=[1, 1], + padding=[1, 1], dilation=[1, 1], groups=1, bias=None) + aten_out = self.aten_op(image, kernel, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1) + self.assertTrue(aten_out == 2) + + def test_atenop_overload_forward(self): + if torch.__version__.split("+")[0] <= '2.0': + return + self.setUp() + image = torch.randn(4, 3, 24, 24) + kernel = torch.randn(10, 3, 3, 3) + functional_out = torch.nn.functional.conv2d(image, kernel, stride=[1, 1], + padding=[1, 1], dilation=[1, 1], groups=1, bias=None) + aten_out = self.aten_op(image, kernel, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1) + self.assertTrue(aten_out == 2) + + def test_atenop_nonattr(self): + if torch.__version__.split("+")[0] <= '2.0': + return + self.setUp() + self.assertRaises(AttributeError, getattr, self.aten_op, "foo") + + def test_atenop_overloads(self): + if torch.__version__.split("+")[0] <= '2.0': + return + self.setUp() + self.assertEqual(self.aten_op.overloads(), self.aten_op.opPacket.overloads()) + + + + \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_distributed.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_distributed.py new file mode 100644 index 0000000000000000000000000000000000000000..9a375e45bfcdc93ac36fb9d44a79f50fea7932d5 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_distributed.py @@ -0,0 +1,35 @@ +import unittest +import torch.distributed as dist +from msprobe.pytorch.hook_module.wrap_distributed import * + +class TestWrapDistributed(unittest.TestCase): + def hook(name, prefix): + def forward_pre_hook(nope, input, kwargs): + return input, kwargs + def forward_hook(nope, input, kwargs, result): + return 2 + def backward_hook(): + pass + return forward_pre_hook, forward_hook, backward_hook + + def test_get_distributed_ops(self): + ops = get_distributed_ops() + self.assertIsInstance(ops, set) + + def test_DistributedOPTemplate(self): + self.setUp() + op_name = 'all_reduce' + if op_name in get_distributed_ops(): + op = DistributedOPTemplate(op_name, self.hook) + self.assertEqual(op.op_name_, op_name) + + def test_wrap_distributed_op(self): + op_name = 'all_reduce' + if op_name in get_distributed_ops(): + wrapped_op = wrap_distributed_op(op_name, self.hook) + self.assertTrue(callable(wrapped_op)) + + def test_wrap_distributed_ops_and_bind(self): + wrap_distributed_ops_and_bind(self.hook) + for op_name in get_distributed_ops(): + self.assertTrue(hasattr(HOOKDistributedOP, "wrap_" + str(op_name))) \ No newline at end of file diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_functional.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_functional.py similarity index 59% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_functional.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_functional.py index 37058e77fd87e697b7dd7fde5e94b78d01a2cb89..f43b8ea6cb98dd1947811b7b0641439b225b51ec 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_functional.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_functional.py @@ -1,10 +1,15 @@ -# coding=utf-8 import unittest import torch -from api_accuracy_checker.hook_module import wrap_functional as wf +from msprobe.pytorch.hook_module import wrap_functional as wf class TestWrapFunctional(unittest.TestCase): + def test_remove_dropout(self): + input_tensor = torch.randn(20, 16) + wf.remove_dropout() + output_tensor = torch.nn.functional.dropout(input_tensor) + self.assertTrue(torch.equal(input_tensor, output_tensor)) + def test_get_functional_ops(self): expected_ops = {'relu', 'sigmoid', 'softmax'} actual_ops = wf.get_functional_ops() diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_tensor.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_tensor.py similarity index 58% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_tensor.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_tensor.py index bfae3c72771510b141abf9204723bfe48bfa8de3..61f76b0ca0a59ee680ff40991fa9cba5e42d869d 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_tensor.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_tensor.py @@ -1,13 +1,19 @@ -# coding=utf-8 import unittest import torch import yaml -from api_accuracy_checker.hook_module.wrap_tensor import get_tensor_ops, HOOKTensor, TensorOPTemplate, wrap_tensor_op, wrap_tensor_ops_and_bind +from msprobe.pytorch.hook_module.wrap_tensor import get_tensor_ops, HOOKTensor, TensorOPTemplate, wrap_tensor_op, wrap_tensor_ops_and_bind class TestWrapTensor(unittest.TestCase): - def hook(self, a, b): - return + def hook(name, prefix): + def forward_pre_hook(nope, input, kwargs): + return input, kwargs + def forward_hook(nope, input, kwargs, result): + return 2 + def backward_hook(): + pass + return forward_pre_hook, forward_hook, backward_hook + def test_get_tensor_ops(self): result = get_tensor_ops() self.assertIsInstance(result, set) @@ -18,7 +24,7 @@ class TestWrapTensor(unittest.TestCase): def test_TensorOPTemplate(self): tensor_op_template = TensorOPTemplate('add', self.hook) - self.assertEqual(tensor_op_template.op_name_, 'add') + self.assertTrue(tensor_op_template.op_name_, 'add') def test_wrap_tensor_op(self): wrapped_op = wrap_tensor_op('add', self.hook) diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_torch.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_torch.py similarity index 51% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_torch.py rename to debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_torch.py index 40cef939adfd06158eb543c07b3d682e29d6cdab..e1a3e77983d80e7c0519e30afbb592311550e794 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/ut/hook_module/test_wrap_torch.py +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_torch.py @@ -1,37 +1,43 @@ -# coding=utf-8 import unittest import torch import yaml -from api_accuracy_checker.hook_module.wrap_torch import * +from msprobe.pytorch.hook_module.wrap_torch import * class TestWrapTorch(unittest.TestCase): + def hook(name, prefix): + def forward_pre_hook(nope, input, kwargs): + return input, kwargs + def forward_hook(nope, input, kwargs, result): + return 2 + def backward_hook(): + pass + return forward_pre_hook, forward_hook, backward_hook + def setUp(self): - self.op_name = 'add' - self.torch_op = wrap_torch_op(self.op_name, self.hook) - def hook(self, a, b): - return + self.op_name = 'add' + self.torch_op = wrap_torch_op(self.op_name, self.hook) def test_get_torch_ops(self): + self.setUp() ops = get_torch_ops() self.assertIsInstance(ops, set) self.assertIn(self.op_name, ops) def test_TorchOPTemplate(self): + self.setUp() template = TorchOPTemplate(self.op_name, self.hook) - self.assertEqual(template.op_name_, self.op_name) - self.assertEqual(template.prefix_op_name_, "Torch*" + str(self.op_name) + "*") - - def test_input_param_need_adapt(self): - template = TorchOPTemplate(self.op_name, self.hook) - self.assertFalse(template.input_param_need_adapt()) + self.assertEqual(template.op_name_, self.op_name) + self.assertEqual(template.prefix_op_name_, "Torch." + str(self.op_name) + ".") def test_forward(self): + self.setUp() template = TorchOPTemplate(self.op_name, self.hook) result = template.forward(torch.tensor([1, 2, 3]), torch.tensor([4, 5, 6])) - torch.testing.assert_allclose(result, torch.tensor([5, 7, 9])) + torch.testing.assert_close(result, torch.tensor([5, 7, 9])) def test_wrap_torch_ops_and_bind(self): + self.setUp() wrap_torch_ops_and_bind(self.hook) self.assertTrue(hasattr(HOOKTorchOP, "wrap_" + self.op_name)) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_vf.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_vf.py new file mode 100644 index 0000000000000000000000000000000000000000..98efb4bc5b8a30284fe820124e48af7f487d1c54 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/hook_module/test_wrap_vf.py @@ -0,0 +1,11 @@ +import unittest +import torch +from msprobe.pytorch.hook_module import wrap_vf + +class TestWrapVF(unittest.TestCase): + def setUp(self): + self.hook = lambda x: x + + def test_get_vf_ops(self): + ops = wrap_vf.get_vf_ops() + self.assertIsInstance(ops, list) \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_config.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_config.py new file mode 100644 index 0000000000000000000000000000000000000000..c344f0b66b010e0f4a432e5c14738372c7990349 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_pt_config.py @@ -0,0 +1,84 @@ +from unittest import TestCase +from unittest.mock import patch, mock_open + +from msprobe.core.common.const import Const +from msprobe.pytorch.pt_config import parse_json_config, parse_task_config + + +class TestPtConfig(TestCase): + def test_parse_json_config(self): + mock_json_data = { + "task": "statistics", + "dump_path": "./dump/", + "rank": [], + "step": [], + "level": "L1", + "seed": 1234, + "statistics": { + "scope": [], + "list": [], + "data_mode": ["all"], + }, + "tensor": { + "file_format": "npy" + } + } + with patch("msprobe.pytorch.pt_config.os.path.join", return_value="/path/config.json"), \ + patch("msprobe.pytorch.pt_config.FileOpen", mock_open(read_data='')), \ + patch("msprobe.pytorch.pt_config.json.load", return_value=mock_json_data): + common_config, task_config = parse_json_config(None, None) + self.assertEqual(common_config.task, Const.STATISTICS) + self.assertEqual(task_config.data_mode, ["all"]) + + with patch("msprobe.pytorch.pt_config.os.path.join", return_value="/path/config.json"), \ + patch("msprobe.pytorch.pt_config.FileOpen", mock_open(read_data='')), \ + patch("msprobe.pytorch.pt_config.json.load", return_value=mock_json_data): + common_config, task_config = parse_json_config(None, Const.TENSOR) + self.assertEqual(common_config.task, Const.STATISTICS) + self.assertEqual(task_config.file_format, "npy") + + def test_parse_task_config(self): + overflow_check_config = { + "overflow_check": { + "overflow_nums": 1, + "check_mode": "all" + } + } + result = parse_task_config(Const.OVERFLOW_CHECK, overflow_check_config) + self.assertEqual(result.overflow_num, 1) + self.assertEqual(result.check_mode, "all") + + free_benchmark_config = { + "free_benchmark": { + "scope": [], + "list": ["conv2d"], + "fuzz_device": "npu", + "pert_mode": "improve_precision", + "handler_type": "check", + "fuzz_level": "L1", + "fuzz_stage": "forward", + "if_preheat": False, + "preheat_step": 15, + "max_sample": 20 + } + } + result = parse_task_config(Const.FREE_BENCHMARK, free_benchmark_config) + self.assertEqual(result.pert_mode, "improve_precision") + self.assertEqual(result.handler_type, "check") + self.assertEqual(result.preheat_step, 15) + self.assertEqual(result.max_sample, 20) + + run_ut_config = { + "run_ut": { + "white_list": ["conv2d"], + "black_list": ["matmul"], + "error_data_path": '/home/dump_path' + + } + } + with patch('os.path.exists', return_value=True) as mocked_exists: + result = parse_task_config(Const.RUN_UT, run_ut_config) + self.assertEqual(result.white_list, ["conv2d"]) + self.assertEqual(result.black_list, ["matmul"]) + self.assertEqual(result.error_data_path, '/home/dump_path') + mocked_exists.assert_called_once_with('/home/dump_path') diff --git a/debug/accuracy_tools/msprobe/test/pytorch_ut/test_service.py b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_service.py new file mode 100644 index 0000000000000000000000000000000000000000..c09b6abcb693a048e360ed0d783a0c817c76b06f --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/pytorch_ut/test_service.py @@ -0,0 +1,59 @@ +import unittest +from unittest.mock import patch, mock_open + +import torch.nn as nn +from msprobe.core.common.utils import Const +from msprobe.pytorch.debugger.debugger_config import DebuggerConfig +from msprobe.pytorch.pt_config import parse_json_config +from msprobe.pytorch.service import Service + + +class TestService(unittest.TestCase): + def setUp(self): + mock_json_data = { + "dump_path": "./dump/", + } + with patch("msprobe.pytorch.pt_config.FileOpen", mock_open(read_data='')), \ + patch("msprobe.pytorch.pt_config.json.load", return_value=mock_json_data): + common_config, task_config = parse_json_config("./config.json", Const.STATISTICS) + self.config = DebuggerConfig(common_config, task_config, Const.STATISTICS, "./ut_dump", "L1") + self.service = Service(self.config) + + def test_start(self): + with patch("msprobe.pytorch.service.get_rank_if_initialized", return_value=0), \ + patch("msprobe.pytorch.service.Service.create_dirs", return_value=None): + self.service.start(None) + self.assertEqual(self.service.current_rank, 0) + + def test_stop_and_step(self): + with patch("msprobe.core.data_dump.data_collector.DataCollector.write_json", return_value=None): + self.service.stop() + self.assertFalse(self.service.switch) + + self.service.step() + self.assertEqual(self.service.current_iter, 1) + + def test_register_hook_new(self): + class TestModule(nn.Module): + def __init__(self) -> None: + super().__init__() + self.linear = nn.Linear(in_features=8, out_features=4) + + def forward(self, x): + x = self.linear(x) + return x + + self.service.model = TestModule() + self.config.level = "L0" + with patch("msprobe.pytorch.service.logger.info_on_rank_0") as mock_logger, \ + patch("msprobe.pytorch.service.remove_dropout", return_value=None): + self.service.register_hook_new() + self.assertEqual(mock_logger.call_count, 2) + + def test_create_dirs(self): + with patch("msprobe.pytorch.service.Path.mkdir", return_value=None), \ + patch("msprobe.core.common.file_check.FileChecker.common_check", return_value=None), \ + patch("msprobe.core.data_dump.data_collector.DataCollector.update_dump_paths", + return_value=None): + self.service.create_dirs() + self.assertEqual(self.service.dump_iter_dir, "./ut_dump/step0") diff --git a/debug/accuracy_tools/msprobe/test/resources/advisor.txt b/debug/accuracy_tools/msprobe/test/resources/advisor.txt new file mode 100644 index 0000000000000000000000000000000000000000..5c4825e28ebde12b43ad7e46bf05820929c88f8d --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/advisor.txt @@ -0,0 +1,3 @@ +Line: NA +Suspect Nodes: NA +Expert Advice: All data in comparison result meets the accuracy requirements. diff --git a/debug/accuracy_tools/msprobe/test/resources/compare_result_20230703104808.csv b/debug/accuracy_tools/msprobe/test/resources/compare_result_20230703104808.csv new file mode 100644 index 0000000000000000000000000000000000000000..a7742ff3fd0863fa157dbabebee252aea6b70888 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/compare_result_20230703104808.csv @@ -0,0 +1,9 @@ +NPU Name,Bench Name,NPU Tensor Dtype,Bench Tensor Dtype,NPU Tensor Shape,Bench Tensor Shape,Cosine,MaxAbsErr,NPU max,NPU min,NPU mean,Bench max,Bench min,Bench mean,Accuracy Reached or Not,Err_message +Functional_linear_0_forward_input.0,Functional_linear_0_forward_input.0,torch.float32,torch.float32,"[3, 2]","[3, 2]",1.0,0.000000,1.948258399963379,-1.0052297115325928,-0.2003595232963562,1.948258399963379,-1.0052297115325928,-0.2003595232963562,Yes, +Functional_linear_0_forward_input.1,Functional_linear_0_forward_input.1,torch.float32,torch.float32,"[3, 2]","[3, 2]",1.0,0.000000,0.28375449776649475,-0.6661239266395569,-0.2789986729621887,0.28375449776649475,-0.6661239266395569,-0.2789986729621887,Yes, +Functional_linear_0_forward_input.2,Functional_linear_0_forward_input.2,torch.float32,torch.float32,[3],[3],1.0,0.000000,0.2457989901304245,-0.6338542103767395,-0.14437106251716614,0.2457989901304245,-0.6338542103767395,-0.14437106251716614,Yes, +Functional_linear_0_forward_output,Functional_linear_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,-0.8729169964790344,0.16790540516376495,0.8278868794441223,-0.8729169964790344,0.16790540516376495,Yes, +Torch_relu_0_forward_input.0,Torch_relu_0_forward_input.0,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,-0.8729169964790344,0.16790540516376495,0.8278868794441223,-0.8729169964790344,0.16790540516376495,Yes, +Torch_relu_0_forward_output,Torch_relu_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,0.0,0.31367552280426025,0.8278868794441223,0.0,0.31367552280426025,Yes, +Functional_relu_0_forward_input.0,Functional_relu_0_forward_input.0,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,-0.8729169964790344,0.16790540516376495,0.8278868794441223,-0.8729169964790344,0.16790540516376495,Yes, +Functional_relu_0_forward_output,Functional_relu_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1.0,0.000000,0.8278868794441223,0.0,0.31367552280426025,0.8278868794441223,0.0,0.31367552280426025,Yes, diff --git a/debug/accuracy_tools/msprobe/test/resources/compare_result_without_accuracy.csv b/debug/accuracy_tools/msprobe/test/resources/compare_result_without_accuracy.csv new file mode 100644 index 0000000000000000000000000000000000000000..404af78ec03f497f91dc7fcfc7c6ab0e855e7e7b --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/compare_result_without_accuracy.csv @@ -0,0 +1,9 @@ +NPU Name,Bench Name,NPU Tensor Dtype,Bench Tensor Dtype,NPU Tensor Shape,Bench Tensor Shape,Cosine,MaxAbsErr,NPU max,NPU min,NPU mean,Bench max,Bench min,Bench mean,Accuracy Reached or Not,Err_message +,Functional_linear_0_forward_input.0,torch.float32,torch.float32,"[3, 2]","[3, 2]",1,0,1.9482584,-1.005229712,-0.200359523,1.9482584,-1.005229712,-0.200359523,, +,Functional_linear_0_forward_input.1,torch.float32,torch.float32,"[3, 2]","[3, 2]",1,0,0.283754498,-0.666123927,-0.278998673,0.283754498,-0.666123927,-0.278998673,, +,Functional_linear_0_forward_input.2,torch.float32,torch.float32,[3],[3],1,0,0.24579899,-0.63385421,-0.144371063,0.24579899,-0.63385421,-0.144371063,, +,Functional_linear_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,-0.872916996,0.167905405,0.827886879,-0.872916996,0.167905405,, +,Torch_relu_0_forward_input.0,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,-0.872916996,0.167905405,0.827886879,-0.872916996,0.167905405,, +,Torch_relu_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,0,0.313675523,0.827886879,0,0.313675523,, +,Functional_relu_0_forward_input.0,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,-0.872916996,0.167905405,0.827886879,-0.872916996,0.167905405,, +,Functional_relu_0_forward_output,torch.float32,torch.float32,"[3, 3]","[3, 3]",1,0,0.827886879,0,0.313675523,0.827886879,0,0.313675523,, diff --git a/debug/accuracy_tools/msprobe/test/resources/config.yaml b/debug/accuracy_tools/msprobe/test/resources/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..1744c9cf4a8aa0c034157edfbe80c8083a87ad9c --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/config.yaml @@ -0,0 +1,3 @@ +white_list: [] +error_data_path: './' +precision: 14 \ No newline at end of file diff --git a/debug/accuracy_tools/msprobe/test/resources/npu_test.pkl b/debug/accuracy_tools/msprobe/test/resources/npu_test.pkl new file mode 100644 index 0000000000000000000000000000000000000000..2e00b07b7c97e9cdb497bc63dd7eef8063388807 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/resources/npu_test.pkl @@ -0,0 +1,8 @@ +["Functional_linear_0_forward_input.0", 1, [], "torch.float32", [3, 2], [1.948258399963379, -1.0052297115325928, -0.2003595232963562]] +["Functional_linear_0_forward_input.1", 1, [], "torch.float32", [3, 2], [0.28375449776649475, -0.6661239266395569, -0.2789986729621887]] +["Functional_linear_0_forward_input.2", 1, [], "torch.float32", [3], [0.2457989901304245, -0.6338542103767395, -0.14437106251716614]] +["Functional_linear_0_forward_output", 1, [], "torch.float32", [3, 3], [0.8278868794441223, -0.8729169964790344, 0.16790540516376495]] +["Torch_relu_0_forward_input.0", 1, [], "torch.float32", [3, 3], [0.8278868794441223, -0.8729169964790344, 0.16790540516376495]] +["Torch_relu_0_forward_output", 1, [], "torch.float32", [3, 3], [0.8278868794441223, 0.0, 0.31367552280426025]] +["Functional_relu_0_forward_input.0", 1, [], "torch.float32", [3, 3], [0.8278868794441223, -0.8729169964790344, 0.16790540516376495]] +["Functional_relu_0_forward_output", 1, [], "torch.float32", [3, 3], [0.8278868794441223, 0.0, 0.31367552280426025]] diff --git a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/run_test.sh b/debug/accuracy_tools/msprobe/test/run_test.sh similarity index 47% rename from debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/run_test.sh rename to debug/accuracy_tools/msprobe/test/run_test.sh index fdd00c6021c9827a68e005616b1b4d916e63e995..1bf0ccb77131f039407ce398f4f7907b794fe26f 100644 --- a/debug/accuracy_tools/atat/pytorch/api_accuracy_checker/test/run_test.sh +++ b/debug/accuracy_tools/msprobe/test/run_test.sh @@ -4,27 +4,26 @@ TOP_DIR=${CUR_DIR}/.. TEST_DIR=${TOP_DIR}/"test" SRC_DIR=${TOP_DIR}/../ -clean() { - cd ${TEST_DIR} - - if [ -e ${TEST_DIR}/"report" ]; then - rm -r ${TEST_DIR}/"report" - echo "remove last ut_report successfully." +install_pytest() { + if ! pip show pytest &> /dev/null; then + echo "pytest not found, trying to install..." + pip install pytest fi + if ! pip show pytest-cov &> /dev/null; then + echo "pytest-cov not found, trying to install..." + pip install pytest-cov + fi } run_ut() { + install_pytest + export PYTHONPATH=${SRC_DIR}:${PYTHONPATH} python3 run_ut.py } main() { - clean - if [ "$1"x == "clean"x ]; then - return 0 - fi - cd ${TEST_DIR} && run_ut } diff --git a/debug/accuracy_tools/msprobe/test/run_ut.py b/debug/accuracy_tools/msprobe/test/run_ut.py new file mode 100644 index 0000000000000000000000000000000000000000..8ea81ccca719952bdb8a6603b998902df94a53fb --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/run_ut.py @@ -0,0 +1,58 @@ +import os +import shutil +import subprocess +import sys + +from msprobe.core.common.log import logger + + +def run_ut(): + cur_dir = os.path.realpath(os.path.dirname(__file__)) + ut_path = cur_dir + cov_dir = os.path.dirname(cur_dir) + report_dir = os.path.join(cur_dir, "report") + final_xml_path = os.path.join(report_dir, "final.xml") + cov_report_path = os.path.join(report_dir, "coverage.xml") + + if os.path.exists(report_dir): + shutil.rmtree(report_dir) + os.makedirs(report_dir) + + pytest_cmd = [ + "python3", "-m", "pytest", + ut_path, + f"--junitxml={final_xml_path}", + f"--cov={cov_dir}", + "--cov-branch", + f"--cov-report=xml:{cov_report_path}", + ] + + try: + with subprocess.Popen( + pytest_cmd, + shell=False, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + ) as proc: + for line in proc.stdout: + logger.info(line.strip()) + + proc.wait() + + if proc.returncode == 0: + logger.info("Unit tests executed successfully.") + return True + else: + logger.error("Unit tests execution failed.") + return False + except Exception as e: + logger.error(f"An error occurred during test execution: {e}") + return False + + +if __name__ == "__main__": + if run_ut(): + sys.exit(0) + else: + sys.exit(1) diff --git a/debug/accuracy_tools/msprobe/test/test_module_processer.py b/debug/accuracy_tools/msprobe/test/test_module_processer.py new file mode 100644 index 0000000000000000000000000000000000000000..448c35f0554551884dc690a71aef6bc8141e9a39 --- /dev/null +++ b/debug/accuracy_tools/msprobe/test/test_module_processer.py @@ -0,0 +1,64 @@ +import unittest +from msprobe.pytorch.module_processer import ModuleProcesser +from msprobe.pytorch.common.utils import Const + +import torch + +class TestModuleProcesser(unittest.TestCase): + def test_filter_tensor_and_tuple(self): + def func(nope, x): + return x * 2 + + result_1 = ModuleProcesser.filter_tensor_and_tuple(func)(None, torch.tensor([1])) + self.assertEqual(result_1, torch.tensor([2])) + + result_2 = ModuleProcesser.filter_tensor_and_tuple(func)(None, "test") + self.assertEqual(result_2, "test") + + def test_clone_return_value_and_test_clone_if_tensor(self): + def func(x): + return x + + input = torch.tensor([1]) + input_tuple = (torch.tensor([1]), torch.tensor([2])) + input_list = [torch.tensor([1]), torch.tensor([2])] + input_dict = {"A":torch.tensor([1]), "B":torch.tensor([2])} + + result = ModuleProcesser.clone_return_value(func)(input) + result[0] = 2 + self.assertNotEqual(result, input) + + result_tuple = ModuleProcesser.clone_return_value(func)(input_tuple) + result_tuple[0][0] = 2 + self.assertNotEqual(result_tuple, input_tuple) + + result_list = ModuleProcesser.clone_return_value(func)(input_list) + result_list[0][0] = 2 + self.assertNotEqual(result_list, input_list) + + result_dict = ModuleProcesser.clone_return_value(func)(input_dict) + result_dict["A"][0] = 2 + self.assertNotEqual(result_dict, input_dict) + + + def test_node_hook(self): + empty_list = [] + test = ModuleProcesser(None) + pre_hook = test.node_hook("test", Const.START) + self.assertIsNotNone(pre_hook) + end_hook = test.node_hook("test", "stop") + self.assertIsNotNone(end_hook) + + class A(): + pass + pre_hook(A, None, None) + self.assertIn("test", test.module_count) + self.assertFalse(test.module_stack==empty_list) + + def test_module_count_func(self): + test = ModuleProcesser(None) + self.assertEqual(test.module_count, {}) + + module_name = "nope" + test.module_count_func(module_name) + self.assertEqual(test.module_count["nope"], 0) \ No newline at end of file diff --git a/debug/accuracy_tools/ptdbg_ascend/CMakeLists.txt b/debug/accuracy_tools/ptdbg_ascend/CMakeLists.txt index 70b2b0ff918ca2f734b3fead87e372880d213d8a..9f29d8cfca18c42420ffee2425981a0d290889a8 100644 --- a/debug/accuracy_tools/ptdbg_ascend/CMakeLists.txt +++ b/debug/accuracy_tools/ptdbg_ascend/CMakeLists.txt @@ -16,4 +16,4 @@ add_custom_target(ptdbg_ascend ALL VERBATIM ) -install(CODE "execute_process(COMMAND ${PYTHON_BIN_PATH} -m pip install ${CMAKE_BINARY_DIR}/ptdbg_ascend/dist/ptdbg_ascend-6.0.T4-py3-none-any.whl --upgrade)") +install(CODE "execute_process(COMMAND ${PYTHON_BIN_PATH} -m pip install ${CMAKE_BINARY_DIR}/ptdbg_ascend/dist/ptdbg_ascend-6.0-py3-none-any.whl --upgrade)") diff --git a/debug/accuracy_tools/ptdbg_ascend/README.md b/debug/accuracy_tools/ptdbg_ascend/README.md index 2fcbf89f29c65978cd941651fa0ec290b95f90b7..d6238ef72bd0db351fd1bac405301925ee7f4a86 100644 --- a/debug/accuracy_tools/ptdbg_ascend/README.md +++ b/debug/accuracy_tools/ptdbg_ascend/README.md @@ -2,29 +2,30 @@ ## 版本过渡提示 -当前版本ptdbg维护到2024/09/30,准备于2024/09/30下线,相关目录att/debug/accuracy_tools/ptdbg_ascend将于2024/09/30删除。新版本ptdbg已经合到att/debug/accuracy_tools/atat目录下。 +当前版本ptdbg维护到2024/09/30,准备于2024/09/30下线,相关目录mstt/debug/accuracy_tools/ptdbg_ascend将于2024/09/30删除。新版本ptdbg已经合到mstt/debug/accuracy_tools/msprobe目录下。 ## 快速安装 进行PyTorch精度比对需要将ptdbg_ascend精度工具分别安装在CPU或GPU环境以及NPU环境下。 -1. whl包获取。 +1. 使用pip命令安装wheel、numpy、openpyxl、pandas(1.3.5及以上版本)、psutil、pytest、PyYAML、rich、setuptools、torch依赖。 + + 若环境中已安装部分依赖,不需要重复安装。 + +2. whl包获取。 请通过下表链接下载ptdbg_ascend精度工具whl包,推荐下载最新版本。 | ptdbg_ascend版本 | 发布日期 | 支持PyTorch版本 | 下载链接 | 参考指南 | 校验码 | | ---------------- | ---------- | -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | - | 6.0.T4 | 2024-06-11 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0.T4-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0.T4-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0.T4](doc/ptdbg_ascend精度工具功能说明_v6.0.T3.md) | 138d78497476c10b1b27239814bdfb5ce78ea8c01a8544a95fffbf10fb166221 | - | 6.0.T3 | 2024-05-25 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0.T3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0.T3-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0.T3](doc/ptdbg_ascend精度工具功能说明_v6.0.T3.md) | f417f18e3ff52d2e15f9cadeea9931017bf9521b4f34fb657e013cead6c6bd31 | - | 6.0.T2 | 2024-05-09 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0.T2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0.T2-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0.T2](doc/ptdbg_ascend精度工具功能说明_v6.0.T2.md) | ca173e73d3908aa69cb10c8a1bb4e2b38f6488d3ceb5cca2877cae1500c7729d | - | 6.0.T1 | 2024-04-25 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0.T1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0.T1-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0.T1](doc/ptdbg_ascend精度工具功能说明_v6.0.T1.md) | 40aeaad94c8d446b5e3229989527fad0715ea9d103cf46305832ee21d362ae50 | + | 6.0 | 2024-07-09 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0](doc/ptdbg_ascend精度工具功能说明_v6.0.md) | 48a2862dc82d13c8a3fb176545f9f18c228d0438e968d2fd50d7cb9a371a272f | | 5.0 | 2024-04-11 | 1.11.0/2.0/2.1 | [ptdbg_ascend-v5.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/5.0/ptdbg_ascend-5.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v5.0](doc/ptdbg_ascend精度工具功能说明_v5.0.md) | 15ce1fb598781a9a03c7e8a28b1a9c400b52562c806c35649e929115cbe8b4f4 | | 4.0 | 2023-11-23 | 1.11.0/2.0/2.1 | [ptdbg_ascend-4.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/4.0/ptdbg_ascend-4.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v4.0](doc/ptdbg_ascend精度工具功能说明_v4.0.md) | ba7ff7a1acffb1a2fab02fea76b6f957b2868bc6b66d72365622f6a8950406c6 | | 3.0 | 2023-10-16 | 1.8.1/1.11.0/2.0/2.1 | [ptdbg_ascend-3.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/3.0/ptdbg_ascend-3.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v3.0](doc/ptdbg_ascend精度工具功能说明_v3.0.md) | eb177ec795f8ae8b0c937a3cf543914f535bb64c76ba2e520fc6f0456ff6740b | | 2.0 | 2023-7-07 | 1.8.1/1.11.0/2.0 | [ptdbg_ascend-2.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/2.0/ptdbg_ascend-2.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v2.0](doc/ptdbg_ascend精度工具功能说明_v2.0.md) | 85e046f133f0f40ed660337ce8207249b1dac47ac668910625bea49809f31d66 | | 1.0 | 2023-3-30 | 1.8.1/1.11.0 | [ptdbg_ascend-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/1.0/ptdbg_ascend-1.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v1.0](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend%E7%B2%BE%E5%BA%A6%E5%B7%A5%E5%85%B7%E5%8A%9F%E8%83%BD%E8%AF%B4%E6%98%8E_v1.0.md) | 0559e12ba7accf80d182f227698163ee0de88bf86b1e9cd9f33b16fdead14759 | - -2. whl包校验。 + +3. whl包校验。 1. 根据以上下载链接下载whl包到Linux安装环境。 @@ -43,7 +44,7 @@ ba7ff7a1acffb1a2fab02fea76b6f957b2868bc6b66d72365622f6a8950406c6 *ptdbg_ascend-4.0-py3-none-any.whl ``` -3. whl包安装。 +4. whl包安装。 执行如下命令进行安装。 @@ -107,7 +108,7 @@ ptdbg_ascend为PyTorch精度工具,用来进行PyTorch整网API粒度的数据 ### 环境准备 -- 通过pip安装环境依赖wheel、numpy、pandas(1.3.5及以上版本)和pyyaml。 +- 通过pip安装环境依赖wheel、numpy、openpyxl、pandas(1.3.5及以上版本)、psutil、pytest、PyYAML、rich、setuptools、torch、torch_npu。 - ptdbg_ascend与PyTorch有严格的版本配套关系,使用工具前,您需要确保已经正确安装了PyTorch v1.11.0、PyTorch v2.0.0或PyTorch v2.1.0版本: - CPU或GPU环境:请至[PyTorch官网](https://www.pytorch.org)下载并安装。 - NPU环境:请参见《[CANN软件安装指南](https://www.hiascend.com/document/detail/zh/canncommercial/63RC1/envdeployment/instg/instg_000002.html)》“安装开发环境 > 在昇腾设备上安装 > 安装深度学习框架 > 安装PyTorch”章节进行安装。 @@ -120,23 +121,24 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 #### 下载whl包安装 -1. whl包获取。 +1. 使用pip命令安装wheel、numpy、openpyxl、pandas(1.3.5及以上版本)、psutil、pytest、PyYAML、rich、setuptools、torch依赖。 + + 若环境中已安装部分依赖,不需要重复安装。 + +2. whl包获取。 请通过下表链接下载ptdbg_ascend精度工具whl包,推荐下载最新版本。 | ptdbg_ascend版本 | 发布日期 | 支持PyTorch版本 | 下载链接 | 参考指南 | 校验码 | | ---------------- | ---------- | -------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | - | 6.0.T4 | 2024-06-11 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0.T4-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0.T4-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0.T4](doc/ptdbg_ascend精度工具功能说明_v6.0.T3.md) | 138d78497476c10b1b27239814bdfb5ce78ea8c01a8544a95fffbf10fb166221 | - | 6.0.T3 | 2024-05-25 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0.T3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0.T3-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0.T3](doc/ptdbg_ascend精度工具功能说明_v6.0.T3.md) | f417f18e3ff52d2e15f9cadeea9931017bf9521b4f34fb657e013cead6c6bd31 | - | 6.0.T2 | 2024-05-09 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0.T2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0.T2-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0.T2](doc/ptdbg_ascend精度工具功能说明_v6.0.T2.md) | ca173e73d3908aa69cb10c8a1bb4e2b38f6488d3ceb5cca2877cae1500c7729d | - | 6.0.T1 | 2024-04-25 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0.T1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0.T1-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0.T1](doc/ptdbg_ascend精度工具功能说明_v6.0.T1.md) | 40aeaad94c8d446b5e3229989527fad0715ea9d103cf46305832ee21d362ae50 | + | 6.0 | 2024-07-09 | 1.11.0/2.0/2.1/2.2 | [ptdbg_ascend-v6.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/6.0/ptdbg_ascend-6.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v6.0](doc/ptdbg_ascend精度工具功能说明_v6.0.md) | 48a2862dc82d13c8a3fb176545f9f18c228d0438e968d2fd50d7cb9a371a272f | | 5.0 | 2024-04-11 | 1.11.0/2.0/2.1 | [ptdbg_ascend-v5.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/5.0/ptdbg_ascend-5.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v5.0](doc/ptdbg_ascend精度工具功能说明_v5.0.md) | 15ce1fb598781a9a03c7e8a28b1a9c400b52562c806c35649e929115cbe8b4f4 | | 4.0 | 2023-11-23 | 1.11.0/2.0/2.1 | [ptdbg_ascend-4.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/4.0/ptdbg_ascend-4.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v4.0](doc/ptdbg_ascend精度工具功能说明_v4.0.md) | ba7ff7a1acffb1a2fab02fea76b6f957b2868bc6b66d72365622f6a8950406c6 | | 3.0 | 2023-10-16 | 1.8.1/1.11.0/2.0/2.1 | [ptdbg_ascend-3.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/3.0/ptdbg_ascend-3.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v3.0](doc/ptdbg_ascend精度工具功能说明_v3.0.md) | eb177ec795f8ae8b0c937a3cf543914f535bb64c76ba2e520fc6f0456ff6740b | | 2.0 | 2023-7-07 | 1.8.1/1.11.0/2.0 | [ptdbg_ascend-2.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/2.0/ptdbg_ascend-2.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v2.0](doc/ptdbg_ascend精度工具功能说明_v2.0.md) | 85e046f133f0f40ed660337ce8207249b1dac47ac668910625bea49809f31d66 | | 1.0 | 2023-3-30 | 1.8.1/1.11.0 | [ptdbg_ascend-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/package/ptdbg_ascend/1.0/ptdbg_ascend-1.0-py3-none-any.whl) | [ptdbg_ascend精度工具功能说明_v1.0](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend精度工具功能说明_v1.0.md) | 0559e12ba7accf80d182f227698163ee0de88bf86b1e9cd9f33b16fdead14759 | - -2. whl包校验。 + +3. whl包校验。 1. 根据以上下载链接下载whl包到Linux安装环境。 @@ -155,7 +157,7 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 ba7ff7a1acffb1a2fab02fea76b6f957b2868bc6b66d72365622f6a8950406c6 *ptdbg_ascend-4.0-py3-none-any.whl ``` -3. whl包安装。 +4. whl包安装。 执行如下命令进行安装。 @@ -177,24 +179,20 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 #### 源代码编译安装 -1. 安装依赖。 +1. 使用pip命令安装wheel、numpy、openpyxl、pandas(1.3.5及以上版本)、psutil、pytest、PyYAML、rich、setuptools、torch依赖。 - 编译前需要安装wheel。 - - ```bash - pip3 install wheel - ``` + 若环境中已安装部分依赖,不需要重复安装。 2. 下载源码。 ```bash - git clone https://gitee.com/ascend/att.git + git clone https://gitee.com/ascend/mstt.git ``` 3. 配置安装环境。 ```bash - cd att/debug/accuracy_tools/ptdbg_ascend + cd mstt/debug/accuracy_tools/ptdbg_ascend bash ./configure ``` @@ -243,12 +241,12 @@ ptdbg_ascend精度工具的安装方式包括:**下载whl包安装**和**源 6. 安装。 执行如下命令进行ptdbg_ascend安装。 - + ```bash pip3 install ./ptdbg_ascend/dist/ptdbg_ascend-{version}-py3-none-any.whl --upgrade --force-reinstall ``` -完成ptdbg_ascend安装后,可以进行PyTorch精度数据的dump和、比对和溢出检测等操作,详细介绍请参见《[PyTorch精度工具使用指南](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)》。 +完成ptdbg_ascend安装后,可以进行PyTorch精度数据的dump和、比对和溢出检测等操作,详细介绍请参见《[PyTorch精度工具使用指南](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/ptdbg_ascend/doc)》。 ## 贡献 diff --git a/debug/accuracy_tools/ptdbg_ascend/RELEASE.md b/debug/accuracy_tools/ptdbg_ascend/RELEASE.md index 088c4d9e1f900f4b49dbf7935529e05a0cd2eaf5..27ffbd425c3d10f02d6244c5721539aa6132947c 100644 --- a/debug/accuracy_tools/ptdbg_ascend/RELEASE.md +++ b/debug/accuracy_tools/ptdbg_ascend/RELEASE.md @@ -1,4 +1,4 @@ -# Release 6.0.T4 +# Release 6.0 This is the initial release of Pytorch precision compare tools which was designed by the researchers and engineers in Huawei Technologies Co.,Ltd. \ No newline at end of file diff --git a/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md b/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md index bf96c42cdf647fc4fbf973728c6312c472e92d3b..45612989692c2e008567c8d4865c738ea93065e2 100644 --- a/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md +++ b/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md @@ -2,13 +2,13 @@ ## 工具使用 ### 1. 环境变量方式导入ptdbg_ascend -当需要使用export att/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/common的目录下,手动添加一个version.py,并加上以下版本号信息,其中‘3.4’为当前ptdbg_ascend的版本 +当需要使用export mstt/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/common的目录下,手动添加一个version.py,并加上以下版本号信息,其中‘3.4’为当前ptdbg_ascend的版本 ``` __version__ = '3.4' ``` ### 2. dump指定融合算子 -dump指定操作当前支持dump指定融合算子的输入输出,需要在att/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/hook_module/support_wrap_ops.yaml中添加,比如以下代码段调用的softmax融合算子 +dump指定操作当前支持dump指定融合算子的输入输出,需要在mstt/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/hook_module/support_wrap_ops.yaml中添加,比如以下代码段调用的softmax融合算子 ``` def npu_forward_fused_softmax(self, input_, mask): resl = torch_npu.npu_scaled_masked_softmax(input_, mask, self.scale, False) diff --git "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T1.md" "b/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T1.md" deleted file mode 100644 index 56e5b64ebd041f09ded63072b377867651d6f80a..0000000000000000000000000000000000000000 --- "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T1.md" +++ /dev/null @@ -1,2203 +0,0 @@ -# **PyTorch精度工具使用指南** - -本文主要介绍PyTorch精度工具ptdbg_ascend的使用以及精度比对场景示例。 - -ptdbg_ascend工具的原理及安装请参见《[PyTorch精度工具](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 - -ptdbg_ascend工具主要支持PyTorch API精度数据dump、溢出检测、精度比对以及parse数据解析功能。其中dump和溢出检测功能支持使用debugger和register_hook方式进行精度数据的dump和溢出检测,推荐使用debugger方式。 - -## PyTorch精度比对总体流程 - -1. 准备CPU或GPU训练工程。 - -2. 在环境下安装ptdbg_ascend工具。 - -3. 在训练脚本内插入ptdbg_ascend工具dump接口。 - -4. 执行训练dump数据。 - -5. 将CPU或GPU训练工程迁移为NPU训练工程。 - - 请参见《[PyTorch模型迁移和训练指南](https://www.hiascend.com/document/detail/zh/canncommercial/63RC1/modeldevpt/ptmigr/ptmigr_0001.html)》。 - -6. 在NPU环境下安装ptdbg_ascend工具。 - -7. 在NPU训练脚本内插入ptdbg_ascend工具dump接口。 - -8. NPU环境下执行训练dump数据。 - -9. 创建并配置精度比对脚本,例如compare.py。 - -10. 执行CPU或GPU dump与NPU dump数据的精度比对。 - -11. 比对结果分析。 - -## 快速入门(debugger方式) - -本章节主要介绍通过ptdbg_ascend工具进行精度比对和分析,主要使用“**debugger方式dump和溢出检测**”和“**CPU或GPU与NPU精度数据比对**”章节中介绍的ptdbg_ascend工具接口。 - -### 单卡场景精度比对 - -**精度分析建议** - -PyTorch训练场景的精度问题分析建议参考以下思路进行精度比对和比对结果分析: - -1. 整网比对:dump整网数据并进行精度比对,初步定位异常范围。 - - 不推荐使用整网dump比对,若模型数据庞大(比如达到T级别),整网dump可能导致磁盘不足,需要预留足够的存储空间,或者分多次dump。 - -2. 缩小范围:根据Accuracy Reached or Not找出不符合精度标准的API。 - -3. 范围比对:对不符合精度标准的API重新dump详细信息。 - -4. 分析原因并优化:分析API精度不符合标准的原因并进行优化调整。 - -5. 整网比对:重新进行整网比对,判断优化后的API是否已符合精度标准以及是否出现新的精度问题。 - -6. 重复1~5步,直到不存在精度问题为止。 - -**精度分析示例** - -1. dump整网数据。 - - 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_stack") - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -2. 比对整网数据。 - - 第1步中的NPU dump数据目录为npu_dump,假设GPU dump数据目录为gpu_dump;dump将生成pkl数据文件api_stack_dump.pkl和npy数据目录api_stack_dump。 - - 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - dump_result_param={ - "npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, "./output", stack_mode=True) - ``` - - 执行比对: - - ```bash - python3 compare.py - ``` - - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` - -3. 找出存在问题的API。 - - 1. 根据`advisor_{timestamp}.txt`或打屏信息的提示,可找到存在精度问题的算子(Suspect Nodes)和专家建议(Expert Advice)。 - - ![auto_analyze_log](img/auto_analyze_log.png) - - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 - -4. (可选)提取指定API的堆栈信息和dump数据统计信息。 - - 通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。 - - 创建并配置提取脚本,以创建parse.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - - # 提取dump信息中第1次调用的API:Torch_batch_normal的堆栈信息及数据统计信息 - parse("./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", "Torch_batch_normal_1_forward") - ``` - - 执行提取: - - ```bash - python3 parse.py - ``` - - - -5. (可选)指定API对其底层ACL数据进行dump。 - - - dump指定前向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Tensor_permute_1_forward"], acl_config='./dump.json') - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - dump指定反向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - # dump指定反向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - debugger.configure_hook(mode="acl", scope=["Functional_conv2d_1_backward"], acl_config="./dump.json", backward_input=["./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional_conv2d_1_backward_input.0.npy"]) - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -6. (可选)重新比对。 - - 根据第4或5步的dump数据重新配置compare.py并执行比对,可以对单API模型进行问题复现。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 - -### 溢出检测场景 - -溢出检测是针对NPU的PyTorch API,检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。 - -溢出检测原理:针对溢出阶段,开启acl dump模式,重新对溢出阶段执行,落盘数据。 - -建议按照如下步骤操作: - -1. 在NPU环境下安装ptdbg_ascend工具。 - -2. 在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。 - - - 示例1:全量溢出检测 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=-1) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - 多卡使用时各卡单独计算溢出次数。 - - - 示例2:dump指定前向API的ACL级别溢出数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(mode="acl", acl_config="./dump.json") - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - 示例3:dump指定反向API的ACL级别的溢出数据 - - 1. 进行全量溢出检测 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=-1) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - - 2. dump指定反向API的ACL级别的溢出数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Functional_conv2d_1_backward"], acl_config="./dump.json", backward_input=["./overflow_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional_conv2d_1_backward_input.0.npy"]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - 针对前向溢出API,可以通过overflow_nums,配置允许的溢出次数,并将每次溢出API的全部ACL数据dump下来,到达指定溢出次数后停止,停止后会看到堆栈打印包含如下字段。 - - ```bash - ValueError: [overflow xxx times]: dump file is saved in 'xxxxx.pkl'. - ``` - - 其中xxx times为用户设置的次数,xxxxx.pkl为文件生成路径。 - -3. NPU环境下执行训练dump溢出数据。 - - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - - 精度预检工具执行命令如下: - - ```bash - # 下载att代码仓后执行如下命令 - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - 反向过程溢出的API暂不支持精度预检功能。 - - 当重复执行溢出检测dump操作时,需要删除上一次dump目录下的溢出检测dump数据,否则将因重名而报错。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 -* 混合精度动态loss scale场景下,正常训练会有"Gradient overflow. SKipping step"日志,添加溢出检测后日志消失,可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1,并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持,不支持版本执行报错EZ3003。 - -## 场景化示例 - -本章节主要介绍通过ptdbg_ascend工具进行精度比对和分析,主要使用“**CPU或GPU及NPU精度数据dump**”和“**CPU或GPU与NPU精度数据比对**”章节中介绍的ptdbg_ascend工具接口。 - -### 多卡场景精度比对 - -精度工具支持多卡场景的精度比对,多卡场景的dump步骤与单卡场景完全一致,请参见“**单卡场景精度比对**”章节,不同的是多卡数据精度比对时需要使用“compare_distributed”函数进行比对。 - -**大模型场景下dump推荐使用debugger方式的手动模式。** - -如下示例: - -说明:多机多卡场景需要每个设备单独执行比对操作。 - -假设NPU dump npy数据目录为npu_dump/ptdbg_dump_v4.0,GPU dump npy数据目录为gpu_dump/ptdbg_dump_v4.0。 - -1. 创建比对脚本,例如compare_distributed.py,拷贝如下代码。 - - ```python - from ptdbg_ascend import * - compare_distributed('./npu_dump/ptdbg_dump_v4.0/step0', './gpu_dump/ptdbg_dump_v4.0/step0', './output') - ``` - - dump数据目录须指定到step级。 - -2. 执行比对: - - ```bash - python3 compare_distributed.py - ``` - -两次运行须用相同数量的卡,传入`compare_distributed`的两个文件夹下须有相同个数的rank文件夹,且不包含其他无关文件,否则将无法比对。 - -**多卡set_dump_path注意事项** - -多卡一般为多进程,须保证每个进程都正确调用PrecisionDebugger或set_dump_path,或把PrecisionDebugger或set_dump_path插入到import语句后,如: - -```python -from ptdbg_ascend import * -debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) -``` - -或 - -```python -from ptdbg_ascend import * -seed_all() -set_dump_path('./dump_resnet') -``` - -如此可保证set_dump_path在每个进程都被调用。 - -**多卡register_hook注意事项** - -register_hook需要在set_dump_path之后调用,也需要在每个进程上被调用,建议在搬运模型数据到卡之后调用。识别方法如下: - -- 找到训练代码中遍历epoch的for循环或遍历数据集的for循环,把register_hook放到循环开始前即可。 -- 找到训练代码中调用DDP或者DistributedDataParallel的代码行,把register_hook放到该代码行所在的代码块之后。 -- 若代码中均无以上两种情况,需要保证register_hook在模型定义之后插入,并配置rank参数。rank参数获取rank_id请参见“**[rank_id获取方法](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/rank_id获取方法.md)**”。 - -### NPU vs NPU精度比对 - -对于NPU vs NPU场景,是针对同一模型,进行迭代(模型、API版本升级或设备硬件升级)时存在的精度下降问题,对比相同模型在迭代前后版本的API计算数值,进行问题定位。 - -一般情况下迭代涉及NPU自定义算子,因此,可以仅dump NPU自定义算子进行比对。比对精度问题分析请参见“**单卡场景精度比对**”章节。 - -工具当前支持dump NPU自定义算子如下: - -| 序号 | NPU自定义算子 | -| :--- | ----------------------------------------------- | -| 1 | torch_npu.one_ | -| 2 | torch_npu.npu_sort_v2 | -| 3 | torch_npu.npu_transpose | -| 4 | torch_npu.npu_broadcast | -| 5 | torch_npu.npu_dtype_cast | -| 6 | torch_npu.empty_with_format | -| 7 | torch_npu.npu_one_hot | -| 8 | torch_npu.npu_stride_add | -| 9 | torch_npu.npu_ps_roi_pooling | -| 10 | torch_npu.npu_roi_align | -| 11 | torch_npu.npu_nms_v4 | -| 12 | torch_npu.npu_iou | -| 13 | torch_npu.npu_nms_with_mask | -| 14 | torch_npu.npu_pad | -| 15 | torch_npu.npu_bounding_box_encode | -| 16 | torch_npu.npu_bounding_box_decode | -| 17 | torch_npu.npu_batch_nms | -| 18 | torch_npu.npu_slice | -| 19 | torch_npu._npu_dropout | -| 20 | torch_npu.npu_indexing | -| 21 | torch_npu.npu_ifmr | -| 22 | torch_npu.npu_max | -| 23 | torch_npu.npu_scatter | -| 24 | torch_npu.npu_layer_norm_eval | -| 25 | torch_npu.npu_alloc_float_status | -| 26 | torch_npu.npu_confusion_transpose | -| 27 | torch_npu.npu_bmmV2 | -| 28 | torch_npu.fast_gelu | -| 29 | torch_npu.npu_sub_sample | -| 30 | torch_npu.npu_deformable_conv2d | -| 31 | torch_npu.npu_mish | -| 32 | torch_npu.npu_anchor_response_flags | -| 33 | torch_npu.npu_yolo_boxes_encode | -| 34 | torch_npu.npu_grid_assign_positive | -| 35 | torch_npu.npu_normalize_batch | -| 36 | torch_npu.npu_masked_fill_range | -| 37 | torch_npu.npu_linear | -| 38 | torch_npu.npu_bert_apply_adam | -| 39 | torch_npu.npu_giou | -| 40 | torch_npu.npu_ciou | -| 41 | torch_npu.npu_diou | -| 42 | torch_npu.npu_sign_bits_pack | -| 43 | torch_npu.npu_sign_bits_unpack | -| 44 | torch_npu.npu_flash_attention | -| 45 | torch_npu.npu_scaled_masked_softmax | -| 46 | torch_npu.npu_rotary_mul | -| 47 | torch_npu.npu_roi_align | -| 48 | torch_npu.npu_roi_alignbk | -| 49 | torch_npu.npu_ptiou | -| 50 | torch_npu.npu_fusion_attention | -| 51 | torch_npu.npu_dropout_with_add_softmax | -| 52 | torch_npu.npu_random_choice_with_mask | -| 53 | torch_npu.npu_rotated_iou | -| 54 | torch_npu.npu_conv2d | -| 55 | torch_npu.npu_conv3d | -| 56 | torch_npu.npu_softmax_cross_entropy_with_logits | -| 57 | torch_npu.npu_all_gather_base_mm | -| 58 | torch_npu.npu_swiglu | -| 59 | torch_npu.npu_rms_norm | -| 60 | torch_npu.npu_mm_reduce_scatter_base | -| 61 | torch_npu.npu_mm_all_reduce_base | -| 62 | torch_npu.npu_conv_transpose2d | -| 63 | torch_npu.npu_convolution | -| 64 | torch_npu.npu_convolution_transpose | -| 65 | torch_npu.npu_min | -| 66 | torch_npu.npu_nms_rotated | -| 67 | torch_npu.npu_reshape | -| 68 | torch_npu.npu_rotated_box_decode | -| 69 | torch_npu.npu_rotated_box_encode | -| 70 | torch_npu.npu_rotated_overlaps | -| 71 | torch_npu.npu_silu | -| 72 | torch_npu.npu_fused_attention_score | -| 73 | torch_npu.npu_multi_head_attention | -| 74 | torch_npu.npu_gru | -| 75 | torch_npu.npu_incre_flash_attention | -| 76 | torch_npu.npu_prompt_flash_attention | -| 77 | torch_npu.npu_lstm | -| 78 | torch_npu.npu_apply_adam | - -### 通信API的数据dump - -通信类API数据可以使用全量dump方式获取,若只dump通信类API数据,可以使用如下示例: - -```python -debugger.configure_hook(mode="api_list", api_list=["distributed"]) -``` - -或 - -```python -set_dump_switch("ON", mode="api_list", api_list=["distributed"]) -``` - -通信类API支持列表: - -| 序号 | Distributed | -| :--- | -------------------- | -| 1 | send | -| 2 | recv | -| 3 | broadcast | -| 4 | all_reduce | -| 5 | reduce | -| 6 | all_gather | -| 7 | gather | -| 8 | isend | -| 9 | irecv | -| 10 | scatter | -| 11 | reduce_scatter | -| 12 | _reduce_scatter_base | -| 13 | _all_gather_base | - -### 单卡场景精度比对(register_hook方式) - -**精度分析建议** - -PyTorch训练场景的精度问题分析建议参考以下思路进行精度比对和比对结果分析: - -1. 整网比对:dump整网数据并进行精度比对,初步定位异常范围。 -2. 缩小范围:根据Accuracy Reached or Not找出不符合精度标准的API。 -3. 范围比对:对不符合精度标准的API重新dump。 -4. 分析原因并优化:分析API精度不符合标准的原因并进行优化调整。 -5. 整网比对:重新进行整网比对,判断优化后的API是否已符合精度标准以及是否出现新的精度问题。 -6. 重复1~5步,直到不存在精度问题为止。 - -**精度分析示例** - -1. dump整网数据。 - - 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): - - ```python - from ptdbg_ascend import * - - # 在main函数开始前固定随机数 - seed_all() - - # 配置dump数据目录路径和名称 - set_dump_path("./npu_dump", dump_tag='all') - - # 注册dump回调函数 - register_hook(model, acc_cmp_dump) - - ... - - # 在第一个迭代开始的位置开启dump和堆栈模式,同时为保证数据完整性开启dump bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="api_stack", filter_switch="OFF") - - ... - - # 在第一个迭代结束的位置关闭dump - set_dump_switch("OFF") - ``` - -2. 比对整网数据。 - - 第1步中的NPU dump数据文件为npu_dump.pkl,假设NPU dump npy数据目录为npu_dump,GPU dump数据文件为gpu_dump.pkl,GPU dump npy数据目录为gpu_dump。 - - 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - dump_result_param={ - "npu_pkl_path": "./npu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/all_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/all_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, "./output", stack_mode=True) - ``` - - 执行比对: - - ```bash - python3 compare.py - ``` - - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` - -3. 找出存在问题的API。 - - 1. 根据`advisor_{timestamp}.txt`或打屏信息的提示,可找到存在精度问题的算子(Suspect Nodes)和专家建议(Expert Advice) - - ![auto_analyze_log](img/auto_analyze_log.png) - - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 - -4. (可选)提取指定API的堆栈信息和dump数据统计信息。 - - 通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。 - - 创建并配置提取脚本,以创建parse.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - - # 提取dump信息中第1次调用的API:Torch_batch_normal的堆栈信息及数据统计信息 - parse("./npu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", "Torch_batch_normal_1_forward") - ``` - - 执行提取: - - ```bash - python3 parse.py - ``` - - - -5. (可选)指定API对其底层ACL数据进行dump。 - - - dump指定前向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - - # 固定随机数,开启确定性计算 - seed_all(mode=True) - set_dump_path("./dump_path", dump_tag='forward') - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - - # dump指定前向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="acl", scope=["Tensor_permute_1_forward"], filter_switch="OFF") - - ... - - set_dump_switch("OFF") - ``` - - - dump指定反向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - - # 固定随机数,开启确定性计算 - seed_all(mode=True) - set_dump_path("./dump_path", dump_tag='backward') - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - - # dump指定反向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"], filter_switch="OFF") - set_backward_input(["./npu_dump/all_v4.0/step0/rank0/api_stack_dump/Functional_conv2d_1_backward_input.0.npy"]) - - ... - - set_dump_switch("OFF") - ``` - -6. (可选)重新比对。 - - 根据第4或5步的dump数据重新配置compare.py并执行比对,可以对单API模型进行问题复现。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 - -### 溢出检测场景(register_hook方式) - -溢出检测是针对NPU的PyTorch API,检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。 - -溢出检测原理:针对溢出阶段,开启acl dump模式,重新对溢出阶段执行,落盘数据。 - -建议按照如下步骤操作: - -1. 在NPU环境下安装ptdbg_ascend工具。 - -2. 在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。 - - - 示例1:全量溢出检测 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # 设置检测到3次溢出后退出训练 - register_hook(model, overflow_check, overflow_nums=3) - - ... - ``` - - 多卡使用时各卡单独计算溢出次数。 - - - 示例2:dump指定API的ACL级别溢出数据 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # dump指定API的ACL级别溢出数据 - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - - # 在期望溢出检测的step位置开始前打开溢出检测开关 - set_overflow_check_switch("ON") - - ... - - # 在step结束的位置关闭溢出检测开关 - set_overflow_check_switch("OFF") - - ... - ``` - - - 示例3:dump指定反向API的ACL级别的溢出数据 - - 1. 进行全量溢出检测 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # 设置检测到3次溢出后退出训练 - register_hook(model, overflow_check) - - ... - ``` - - 2. dump指定反向API的ACL级别的溢出数据 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # dump指定反向API的ACL级别溢出数据 - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"]) - set_backward_input(["./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump/Functional_conv2d_1_backward_input.0.npy"]) - ``` - - 针对前向溢出API,可以通过overflow_nums,配置允许的溢出次数,并将每次溢出API的全部ACL数据dump下来,到达指定溢出次数后停止,停止后会看到堆栈打印包含如下字段。 - - ```bash - ValueError: [overflow xxx times]: dump file is saved in 'xxxxx.pkl'. - ``` - - 其中xxx times为用户设置的次数,xxxxx.pkl为文件生成路径。 - -3. NPU环境下执行训练dump溢出数据。 - - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过 [Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - - 精度预检工具执行命令如下: - - ```bash - # 下载att代码仓后执行如下命令 - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - 反向过程溢出的API暂不支持精度预检功能。 - - 当重复执行溢出检测dump操作时,需要删除上一次dump目录下的溢出检测dump数据,否则将因重名而报错。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 -* 混合精度动态loss scale场景下,正常训练会有"Gradient overflow. SKipping step"日志,添加溢出检测后日志消失,可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1,并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持,不支持版本执行报错EZ3003。 - -## debugger方式dump和溢出检测(推荐) - -### PrecisionDebugger模块 - -**功能说明** - -PrecisionDebugger模块包含dump和溢出检测功能的总体配置项。可以指定dump目录,设置dump或溢出检测功能,指定dump的卡和迭代。 - -可以在from ptdbg_ascend import *和模型初始化之间的任意位置添加该模块。 - -**原型** - -```python -PrecisionDebugger(dump_path=None, hook_name=None, rank=None, step=[], enable_dataloader=False, model=None): -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| dump_path | 设置dump数据目录路径,参数示例:"./dump_path"。数据类型:str。
默认在dump_path目录下生成`ptdbg_dump_{version}`目录,并在该目录下生成`dump.pkl`文件以及`dump`数据文件保存目录。
当**configure_hook**函数配置了mode参数时,`dump.pkl`文件以及`dump`数据文件保存目录名称添加mode参数值为前缀,详情请参见“**dump数据存盘说明**”。
未配置dump_path时,也可以通过环境变量ASCEND_WORK_PATH配置dump路径,此时dump数据将落盘在${ASCEND_WORK_PATH}/dump_data下,自定义配置dump_path优先级高于环境变量,dump_path和环境变量需要二选一。 | 否 | -| hook_name | dump模式,可取值"dump"和"overflow_check",表示dump和溢出检测功能,二选一。参数示例:hook_name="dump"。数据类型:str。 | 是 | -| rank | 指定对某张卡上的数据进行dump或溢出检测,默认未配置(表示dump所有卡的数据),须根据实际卡的Rank ID配置。应配置为大于0的正整数,且须根据实际卡的Rank ID配置,若所配置的值大于实际训练所运行的卡的Rank ID,则dump数据为空,比如当前环境Rank ID为0到7,实际训练运行0到3卡,此时若配置Rank ID为4或不存在的10等其他值,此时dump数据为空。数据类型:int。 | 否 | -| step | 指定dump某个step的数据,默认未配置,表示dump所有step数据。dump特定step时,须指定为训练脚本中存在的step。step为list格式,可配置逐个step,例如:step=[0,1,2];也可以配置step范围,例如:step=list(range(0,9)),表示dump第0到第8个step。数据类型:List[int]。 | 否 | -| enable_dataloader | 自动控制开关,可取值True(开启)或False(关闭),默认为False。配置为True后自动识别dump step参数指定的迭代,并在该迭代执行完成后退出训练,此时start和stop函数可不配置,开启该开关要求训练脚本是通过torch.utils.data.dataloader方式加载数据;配置为False则需要配置start和stop函数,并在最后一个stop函数后或一个step结束的位置添加debugger.step()。数据类型:bool。 | 否 | -| model | 开启init dump模式,传入网络模型实例化的对象,配置该参数后,dump操作仅dump网络中init方法里调用的方法(nn.Module类),不会对所有API进行dump。参数示例: model=net,net为网络模型实例化的对象名称。默认未配置。
配置该参数时,PrecisionDebugger模块请在模型实例化之后调用。数据类型:torch.nn.Module。
该模式不支持“溢出检测”、”ACL级别数据dump“和“模块级精度数据dump”。此模式下dump文件名前缀为网络中定义的模块名或层名。 | 否 | - -#### init dump模式示例代码和数据落盘说明 - -**示例代码** - -```python -import os -import torch -import torch.nn as nn -import torch_npu -from ptdbg_ascend import * - -torch.npu.set_device("npu:0") - - -class Net(nn.Module): - - def __init__(self): - super(Net, self).__init__() - self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2) - self.relu1 = nn.ReLU() - self.bn1 = nn.BatchNorm2d(16) - - def forward(self, x): - x = self.conv1(x) - x = self.bn1(x) - output = self.relu1(x) - return output - -if __name__ == "__main__": - net = Net().npu() - # model参数传入net, 开启init dump 功能 - debugger = PrecisionDebugger(dump_path="./dump", hook_name="dump", model=net) - debugger.configure_hook(mode="api_stack") - debugger.start() - x = torch.randn(1, 1, 28, 28).npu() - out = net(x) - loss = out.sum() - loss.backward() - debugger.stop() -``` - -**落盘数据说明** - -该模式下dump数据命名格式为:`{Layer_name}_{Module_name}_{call_num}_{forward/backward}_{input/output}.npy` - -``` -# 按照上述用例代码进行dump,落盘数据命名示例如下: -conv1_Conv2d_0_forward_input.0.npy -conv1_Conv2d_0_forward_output.npy -relu1_ReLU_0_forward_input.0.npy -....... -bn1_BatchNorm2d_0_backward_output.2.npy -``` - -### configure_hook函数(可选) - -**功能说明** - -设置dump范围。 - -建议在**PrecisionDebugger**模块与模型初始化之间的任意位置添加,不添加此函数时默认使用mode="api_stack" dump整网数据。 - -**原型** - -dump: - -```python -debugger.configure_hook(mode="api_stack", scope=[], api_list=[], filter_switch="OFF", acl_config=None, backward_input=[], input_output_mode=["all"], summary_only=False, summary_mode="all") -``` - -溢出检测: - -```python -debugger.configure_hook(mode=None, acl_config=None, overflow_nums=1, need_replicate=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| mode | dump模式。可取值"all"、"list"、"range"、"stack"、"acl"、"api_list"、"api_stack",各参数含义请参见本节的“**函数示例**”。参数示例:mode="list"。默认为"api_stack"。该参数配置值将作为dump数据文件名的前缀,详情请参见“**dump数据存盘说明**”。数据类型:str。 | 否 | -| scope或api_list | dump范围。根据model配置的模式选择dump的API范围,mode="api_list"时,需要配置api_list=[],其他模式有需要时配置scope=[]。参数示例:scope=["Tensor_permute_1_forward", "Tensor_transpose_2_forward"]、api_list=["relu"]。默认为空。数据类型:List[str]。 | 否 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"(表示开启过滤,即不dump)或"OFF"(表示关闭过滤)。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | -| acl_config | acl dump的配置文件。mode="acl"时,该参数必选;mode为其他值时,该参数不选。参数示例:acl_config='./dump.json'。dump.json配置文件详细介绍请参见“**dump.json配置文件说明**”。数据类型:str。 | 否 | -| backward_input | 该输入文件为首次运行训练dump得到反向API输入的.npy文件。例如若需要dump Functional_conv2d_1 API的反向过程的输入输出,则需要在dump目录下查找命名包含Functional_conv2d_1、backward和input字段的.npy文件。数据类型:str。 | 否 | -| input_output_mode | dump数据过滤。可取值"all"、"forward"、"backward"、"input"和"output",表示仅保存dump的数据中文件名包含"forward"、"backward"、"input"和"output"的前向、反向、输入或输出的.npy文件。参数示例input_output_mode=["backward"]或input_output_mode=["forward", "backward"]。默认为["all"],即保存所有dump的数据。除了all参数只能单独配置外,其他参数可以自由组合。数据类型:list。 | 否 | -| summary_only | dump npy文件过滤,可取值True或False,配置为True后仅dump保存API统计信息的pkl文件,参数示例:summary_only=False,默认为False。数据类型:bool。 | 否 | -| summary_mode | 控制dump文件输出的模式,可取值md5(dump仅输出包含md5值的pkl文件,用于验证数据的完整性)、summary(dump仅输出包含API统计信息的pkl文件)、all(dump输出包含API统计信息的pkl文件以及具体的npy文件),参数示例:summary_mode="md5",默认为"all"。summary_only=True时,不允许配置该参数。数据类型:str。 | 否 | -| overflow_nums | 控制溢出次数,表示第N次溢出时,停止训练,过程中检测到溢出API对应ACL数据均dump。参数示例:overflow_nums=3。配置overflow_check时可配置,默认不配置,即检测到1次溢出,训练停止,配置为-1时,表示持续检测溢出直到训练结束。数据类型:int。 | 否 | -| need_replicate | 过程dump数据生成开关,执行溢出检测时,dump目录下会生成forward_real_data和backward_real_data的过程dump数据目录,可取值True(生成)或False(不生成),默认不生成。数据类型:bool。 | 否 | - -**函数示例** - -configure_hook可配置多种dump模式,示例如下: - -说明: - -以下均以dump部分API数据为例,API名可以从首次dump整网数据的结果csv文件中的NPU Name或Bench Name列获取。 - -以下仅为该函数配置示例,完整代码请参见“**示例代码**”章节。 - -- 示例1:dump指定API列表 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="list", scope=["Tensor_permute_1_forward", "Tensor_transpose_2_forward", "Torch_relu_3_backward"]) - ``` - -- 示例2:dump指定范围 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="range", scope=["Tensor_abs_1_forward", "Tensor_transpose_3_forward"]) - ``` - -- 示例3:STACK模式,只dump堆栈信息 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="stack", scope=["Tensor_abs_1_forward", "Tensor_transpose_3_forward"]) - ``` - -- 示例4:dump指定前向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Tensor_permute_1_forward"], acl_config="./dump.json") - ``` - -- 示例5:dump指定反向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Functional_conv2d_1_backward"], acl_config="./dump.json", backward_input=["./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional_conv2d_1_backward_input.0.npy"]) - ``` - -- 示例6:dump指定某一类API的API级别输入输出数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_list", api_list=["relu"]) - ``` - - mode="api_list"时不配置scope。 - -- 示例7:dump全部API级别输入输出数据以及相应堆栈信息 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_stack") - ``` - - mode="api_stack"时不配置scope。 - -- 示例8: dump全部API级别输入输出数据并包含bool和整型的tensor以及浮点、bool和整型的标量,配置为OFF,会dump bool和整型数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(filter_switch="OFF") - ``` - - 配置filter_switch="OFF"同时也可以配置mode、scope和api_list,除dump ACL级别数据。 - -- 示例9:仅保存dump的数据文件名包含“backward”的反向.npy文件 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(input_output_mode=["backward"]) - ``` - -- 示例10:仅dump pkl文件 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(summary_only=True) - ``` - -- 示例11:溢出检测dump - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=1) - ``` - - dump执行时会在**PrecisionDebugger**模块的dump_path参数指定的目录下生成ptdbg_dump_{version}目录,保存溢出数据。 - - 多卡场景时,需要检测到至少有一张卡溢出次数达到overflow_nums时,训练结束。 - - 仅支持NPU环境。 - -- 示例11:dump溢出API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - debugger.configure_hook(mode="acl", acl_config="./dump.json") - ``` - - 该场景会在原有数据基础上,额外在dump.json文件配置的dump_path目录下生成一份ACL算子数据,该数据可通过“**ptdbg_ascend.parse**”工具进行解析。 - - 仅支持NPU环境。 - -### start函数(可选) - -**功能说明** - -dump或溢出检测启动函数。 - -在模型初始化之后的任意位置添加。 - -**原型** - -```python -debugger.start() -``` - -该函数为类函数,可以使用debugger.start()也可以使用PrecisionDebugger.start()。 - -### stop函数(可选) - -**功能说明** - -dump或溢出检测停止函数。 - -在**start**函数之后的任意位置添加。 - -**原型** - -```python -debugger.stop() -``` - -该函数为类函数,可以使用debugger.stop()也可以使用PrecisionDebugger.stop()。 - -### 示例代码(自动模式) - -**需要保证用户训练代码是通过torch.utils.data.dataloader方式加载数据。** - -- 示例1:开启dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0,2], enable_dataloader=True) - # 请勿将以上初始化流程插入到循环代码中 - ``` - -- 示例2:开启溢出检测dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0,2], enable_dataloader=True) - # 请勿将以上初始化流程插入到循环代码中 - ``` - -### 示例代码(手动模式) - -一般情况下使用自动模式可以快速方便进行dump操作,但个别大模型可能在部分卡的训练操作中没有调用dataloader,这会导致自动模式无法dump指定迭代的数据,此时需要关闭自动模式手动在迭代前后插入start()和stop()函数,并在最后一个stop函数后或一个step结束的位置添加debugger.step()以标识dump结束。 - -- 示例1:开启dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -- 示例2:开启溢出检测dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -## register_hook方式dump和溢出检测 - -### 总体说明 - -- 本节主要介绍CPU或GPU及NPU精度数据dump和溢出检测所需要的函数以及示例。 - -- ptdbg_ascend工具默认情况下仅dump PyTorch模型的API输入输出数据进行精度比对,若在比对结果中发现某个API下可能存在ACL的精度问题,那么可以选择dump该API的ACL级别数据进行精度分析。 - -- 某些torch api的输出不是Tensor类型的数据。对于此类API的反向过程进行ACL dump,工具会在运行日志中给出对应的Warning(is not of tensor type and cannot be automatically derived)提示。如若想要进行该类API反向ACL dump,可以通过手动构建单API用例的方式进行ACL dump,具体用例可参见“**[反向ACL dump用例说明](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/%E5%8F%8D%E5%90%91ACL%20dump%E7%94%A8%E4%BE%8B%E8%AF%B4%E6%98%8E.md)**”。 - -- 工具性能:dump数据量较小时(小于5G),参考dump速度0.1GB/s;dump数据量较大时,参考dump速度0.2GB/s。 - 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 - - 用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。Dump速度的计算方式:Dump数据量/(单个step添加Dump耗时-原始单个step耗时)。 - -### 约束 -- 进行CPU或GPU数据dump时,请安装torch包而非torch_npu包,避免工具无法识别使用场景,导致失败。 - -- TASK_QUEUE_ENABLE环境变量会导致API下发和执行异步进行,因此在ACL dump前需要将TASK_QUEUE_ENABLE关闭,即export TASK_QUEUE_ENABLE=0。 - -- 不建议在PyTorch训练脚本中同时添加dump接口和性能数据采集(如Ascend PyThon Profiler)接口,二者可能相互影响导致数据不准确。 - -### seed_all - -**功能说明** - -固定随机数。通过固定随机数保证模型的输入或输出一致。在训练主函数开始前调用,避免随机数固定不全。 - -使用form ptdbg import *后自动导入该函数,代码无需再次添加,若需要修改随机数种子和确定性计算模式,则需要通过添加该函数修改。 - -**函数原型** - -```python -seed_all(seed=1234, mode=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------ | ------------------------------------------------------------ | -------- | -| seed | 随机数种子。参数示例:seed=1000。默认值为:1234。数据类型:int。 | 否 | -| mode | 确定性计算模式。可配置True或False。参数示例:mode=True。默认为False。数据类型:bool。
即使在相同的硬件和输入下,API多次执行的结果也可能不同,开启确定性计算是为了保证在相同的硬件和输入下,API多次执行的结果相同。
确定性计算会导致API执行性能降低,建议在发现模型多次执行结果不同的情况下开启。
rnn类算子、ReduceSum、ReduceMean等算子可能与确定性计算存在冲突,若开启确定性计算后多次执行的结果不相同,则考虑存在这些算子。 | 否 | - -**函数示例** - -seed_all函数的随机数种子,取默认值即可,无须配置;第二个参数默认关闭,不开启确定性计算时也无须配置。 - -- 示例1:仅固定随机数,不开启确定性计算 - - ```python - seed_all() - ``` - -- 示例2:固定随机数,开启确定性计算 - - ```python - seed_all(mode=True) - ``` - -**固定随机数范围** - -seed_all函数可固定随机数的范围如下表。 - -| API | 固定随机数 | -| ---------------------------------------- | --------------------------- | -| os.environ['PYTHONHASHSEED'] = str(seed) | 禁止Python中的hash随机化 | -| random.seed(seed) | 设置random随机生成器的种子 | -| np.random.seed(seed) | 设置numpy中随机生成器的种子 | -| torch.manual_seed(seed) | 设置当前CPU的随机种子 | -| torch.cuda.manual_seed(seed) | 设置当前GPU的随机种子 | -| torch.cuda.manual_seed_all(seed) | 设置所有GPU的随机种子 | -| torch_npu.npu.manual_seed(seed) | 设置当前NPU的随机种子 | -| torch_npu.npu.manual_seed_all(seed) | 设置所有NPU的随机种子 | -| torch.backends.cudnn.enable=False | 关闭cuDNN | -| torch.backends.cudnn.benchmark=False | cuDNN确定性地选择算法 | -| torch.backends.cudnn.deterministic=True | cuDNN仅使用确定性的卷积算法 | - -需要保证CPU或GPU以及NPU的模型输入完全一致,dump数据的比对才有意义,seed_all并不能保证模型输入完全一致,如下表所示场景需要保证输入的一致性。 - -| 场景 | 固定方法 | -| --------------- | ------------- | -| 数据集的shuffle | 关闭shuffle。 | -| dropout | 关闭dropout。 | - -关闭shuffle示例: - -```python -train_loader = torch.utils.data.DataLoader( - train_dataset, - batch_size = batch_size, - shuffle = False, - num_workers = num_workers -) -``` - -关闭dropout: - -在使用from ptdbg import *后,工具会自动将torch.nn.functional.dropout、torch.nn.functional.dropout2d、torch.nn.functional.dropout3d、torch.nn.Dropout、torch.nn.Dropout2d、torch.nn.Dropout3d的接口参数p置为0。 - -### set_dump_path - -**功能说明** - -设置数据保存目录。建议在seed_all函数之后调用且需要保证训练进程能够调用该函数;多卡时须保证每个进程都能调用该函数。 - -**函数原型** - -```python -set_dump_path(fpath=None, dump_tag='ptdbg_dump') -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------- | ------------------------------------------------------------ | -------- | -| fpath | 设置数据目录路径。参数示例:'./dump_path'。数据类型:str。
默认在dump_path目录下生成`ptdbg_dump_{version}`目录,并在该目录下生成`dump.pkl`文件以及`dump`数据文件保存目录。
当set_dump_switch函数配置了mode参数时,`dump.pkl`文件以及`dump`数据文件保存目录名称添加mode参数值为前缀,详情请参见“**dump数据存盘说明**”。
未配置fpath时,也可以通过环境变量ASCEND_WORK_PATH配置dump路径,此时数据将落盘在${ASCEND_WORK_PATH}/dump_data下,自定义配置dump_path优先级高于环境变量,fpath和环境变量需要二选一。 | 否 | -| dump_tag | 设置数据目录名称。参数示例:dump_tag='dump_conv2d'。默认数据目录命名为ptdbg_dump_{version}。数据类型:str。
{version}为当前安装ptdbg_ascend工具版本。目录结构参见“**dump数据存盘说明**”。
配置该参数会将生成的`ptdbg_dump_{version}`目录名称变更为dump_tag配置的值,如`dump_conv2d_{version}`。 | 否 | - -**函数示例** - -- 示例1:设置数据目录路径 - - ```python - set_dump_path('./dump_path') - ``` - -- 示例2:设置数据目录名称 - - ```python - set_dump_path('./dump_path', dump_tag='dump_conv2d') - ``` - - -若以相同的数据目录多次dump,则会因同名导致覆盖;多次dump建议配置不同的dump_tag。 - -### register_hook - -**功能说明** - -注册工具钩子函数。在set_dump_path之后调用。 - -dump操作必选。 - -**函数原型** - -```python -register_hook(model, hook, overflow_nums=overflow_nums, dump_mode=dump_mode, dump_config=dump_config_file) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------- | ------------------------------------------------------------ | -------- | -| model | 传入网络模型实例化的对象。参数示例: model=net,net为网络模型实例化的对象名称。数据类型:torch.nn.Module。 | 是 | -| hook | 注册工具的dump和溢出检测钩子。可取值overflow_check(表示溢出检测)和acc_cmp_dump(表示dump数据),二选一。数据类型:Callable。 | 是 | -| overflow_nums | 控制溢出次数,表示第N次溢出时,停止训练,过程中检测到溢出API对应ACL数据均dump。参数示例:overflow_nums=3。配置overflow_check时可配置,默认不配置,即检测到1次溢出,训练停止,配置为-1时,表示持续检测溢出直到训练结束。数据类型:int。 | 否 | -| dump_mode | 控制针对溢出API的dump模式,可取值"acl"或"api"。配置acl时,表示dump ACL级别的溢出数据,此时set_dump_path参数不生效,dump数据目录由dump_config的.json文件配置。参数示例:dump_mode="acl"。默认不配置,即dump API级别的溢出数据。数据类型:str。 | 否 | -| dump_config | acl dump的配置文件。dump_mode="acl"时,该参数必选;dump_mode="api"时,该参数不选。参数示例:dump_config='./dump.json'。数据类型:str。 | 否 | - -**函数示例** - -- 示例1:注册工具钩子函数 - - ```python - register_hook(model, acc_cmp_dump) - ``` - -- 示例2:dump指定API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - ``` - - 需要配置set_dump_switch的mode="acl"以及scope指定为前向或反向API,请参见“**set_dump_switch”**的示例。 - - 该场景set_dump_path不生效,由dump_config中的dump.json文件配置dump数据目录。 - -- 示例3:溢出检测dump - - ```python - register_hook(model, overflow_check, overflow_nums=3) - ``` - - dump执行时会在set_dump_path的fpath参数指定的目录下生成ptdbg_dump_{version}目录,保存溢出数据。 - - 多卡场景时,需要检测到至少有一张卡溢出次数达到overflow_nums时,训练结束。 - - 仅支持NPU环境。 - -- 示例4:dump指定API的ACL级别溢出数据 - - ```python - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - ``` - - 该场景会在原有数据基础上,额外在dump.json文件配置的dump_path目录下生成一份ACL算子数据,该数据可通过“**ptdbg_ascend.parse**”工具进行解析。 - - 仅支持NPU环境。 - -### set_dump_switch - -**功能说明** - -设置dump范围。建议在register_hook函数之后的脚本内任意位置插入,但进行精度问题排查建议参照“场景化示例 > 单卡场景精度比对”章节的顺序,先从第一个迭代开始的位置调用并dump整网数据。 - -dump操作必选。 - -**函数原型** - -```python -def set_dump_switch(switch, mode="all", scope=[], api_list=[], filter_switch="OFF", dump_mode=["all"], summary_only=False): -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| --------------- | ------------------------------------------------------------ | -------- | -| switch | dump开关。可取值"ON"或"OFF"。须在选定dump开始的位置配置set_dump_switch("ON");dump结束的位置设置set_dump_switch("OFF")。数据类型:str。 | 是 | -| mode | dump模式。可取值"all"、"list"、"range"、"stack"、"acl"、"api_list"、"api_stack",各参数含义请参见本节的“**函数示例**”。参数示例:mode="list"。默认为"all"。该参数配置值将作为dump数据文件名的前缀,详情请参见“**dump数据存盘说明**”。数据类型:str。 | 否 | -| scope或api_list | dump范围。根据model配置的模式选择dump的API范围。参数示例:scope=["Tensor_permute_1_forward", "Tensor_transpose_2_forward"]、api_list=["relu"]。默认为空。数据类型:List[str]。 | 否 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"或"OFF"。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | -| dump_mode | dump数据过滤。可取值"all"、"forward"、"backward"、"input"和"output",表示仅保存dump的数据中文件名包含"forward"、"backward"、"input"和"output"的前向、反向、输入或输出的.npy文件。参数示例dump_mode=["backward"]或dump_mode=["forward", "backward"]。默认为all,即保存所有dump的数据。除了all参数只能单独配置外,其他参数可以自由组合。数据类型:List[str]。 | 否 | -| summary_only | dump npy文件过滤,可取值True或False,配置为True后仅dump保存API统计信息的pkl文件,参数示例:summary_only=False,默认为False。数据类型:bool。 | 否 | - -**推荐配置** - -```python -set_dump_switch("ON", mode="api_stack", filter_switch="OFF") -``` - -开启dump数据和堆栈模式,同时为保证数据完整性开启dump bool和整型的tensor以及浮点、bool和整型的标量。 - -**函数示例** - -set_dump_switch可配置多种dump模式,示例如下: - -说明:以下均以dump部分API数据为例,API名可以从首次dump整网数据的结果csv文件中的NPU Name或Bench Name列获取。 - -- 示例1:dump指定API列表 - - ```python - set_dump_switch("ON", mode="list", scope=["Tensor_permute_1_forward", "Tensor_transpose_2_forward", "Torch_relu_3_backward"]) - ``` - -- 示例2:dump指定范围 - - ```python - set_dump_switch("ON", mode="range", scope=["Tensor_abs_1_forward", "Tensor_transpose_3_forward"]) - ``` - -- 示例3:STACK模式,只dump堆栈信息 - - ```python - set_dump_switch("ON", mode="stack", scope=["Tensor_abs_1_forward", "Tensor_transpose_3_forward"]) - ``` - -- 示例4:dump指定前向API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Tensor_permute_1_forward"]) - ``` - - 需要配置register_hook的dump_mode='acl'和dump_config配置文件。 - -- 示例4:dump指定反向API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"]) - set_backward_input(["./npu_dump/dump_conv2d_v4.0/step0/rank0/dump/Functional_conv2d_1_backward_input.0.npy"]) - ``` - - 需要配置register_hook的dump_mode='acl'和dump_config配置文件,并通过set_backward_input设置反向API输入的.npy文件。 - -- 示例5:dump指定某一类API的API级别输入输出数据 - - ```python - set_dump_switch("ON", mode="api_list", api_list=["relu"]) - ``` - - mode="api_list"时不配置scope。 - -- 示例6:dump全部API级别输入输出数据以及相应堆栈信息 - - ```python - set_dump_switch("ON", mode="api_stack") - ``` - - mode="api_stack"时不配置scope。 - -- 示例7: dump全部API级别输入输出数据并包含bool和整型的tensor以及浮点、bool和整型的标量,配置为OFF,会dump bool和整型数据 - - ```python - set_dump_switch("ON", filter_switch="OFF") - ``` - - 配置filter_switch="OFF"同时也可以配置mode、scope和api_list,除dump ACL级别数据。 - -- 示例8:仅保存dump的数据文件名包含“backward”的反向.npy文件 - - ```python - set_dump_switch("ON", dump_mode=["backward"]) - ``` - -- 示例9:仅dump pkl文件 - - ```python - set_dump_switch("ON", summary_only=True) - ``` - -以上示例均需要在结束dump的位置插入set_dump_switch("OFF")。 - -set_dump_switch配置mode为all或api_stack时,结束dump后,在dump目录下会自动生成compare_data.py比对脚本模板,示例如下: - -```python -from ptdbg_ascend import compare - -pkl_path = "%s" -dump_data_dir = "%s" - -dump_path_param = { - "npu_pkl_path": , - "bench_pkl_path": , - "npu_dump_data_dir": , - "bench_dump_data_dir": , - "is_print_compare_log": True -} - -compare(dump_path_param, output_path="", stack_mode="%s") -``` - -pkl_path和dump_data_dir字段会自动识别pkl和dump目录的路径,用户需要判断当前dump的环境是NPU、CPU或GPU,并将pkl_path和dump_data_dir字段填入下方dump_path_param函数对应的字段中,例如当前设备为NPU,那么填写方式如下: - -```python -from ptdbg_ascend import compare - -pkl_path = "%s" -dump_data_dir = "%s" - -dump_path_param = { - "npu_pkl_path": pkl_path, - "bench_pkl_path": , - "npu_dump_data_dir": dump_data_dir, - "bench_dump_data_dir": , - "is_print_compare_log": True -} - -compare(dump_path_param, output_path="", stack_mode="%s") -``` - -此时,另一侧数据的路径,需要用户另外识别并填入。 - -### set_overflow_check_switch - -**功能说明** - -置溢出检测范围。默认不配置该函数,全量进行溢出检测。 - -仅支持NPU环境。 - -**函数原型** - -```python -set_overflow_check_switch(switch, filter_switch='OFF') -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------- | ------------------------------------------------------------ | -------- | -| switch, | 检测开关。可取值"ON"或"OFF"。如果只在特定的step溢出检测,则在期望溢出检测的step位置开始前插入set_overflow_check_switch("ON"),在step结束的位置插入set_overflow_check_switch("OFF")。数据类型:str。 | 是 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"或"OFF"。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | - -**函数示例** - -- 示例1:指定范围溢出检测 - - ```python - register_hook(model, overflow_check) - set_overflow_check_switch("ON") - - ... - - set_overflow_check_switch("OFF") - ``` - - 该场景set_dump_path不生效,dump执行时会在当前目录自动生成ptdbg_dump_{version}目录,保存溢出数据。 - -- 示例2:前向API的ACL级别范围溢出检测 - - ```python - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - set_overflow_check_switch("ON") - - ... - - set_overflow_check_switch("OFF") - ``` - - 该场景set_dump_path不生效,由dump_config中的dump.json文件配置溢出数据目录。 - -### set_backward_input - -**功能说明** - -设置反向ACL级别dump时需要的反向输入的.npy文件。 - -**函数原型** - -```python -set_backward_input(backward_input) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| backward_input | 该输入文件为首次运行训练dump得到反向API输入的.npy文件。例如若需要dump Functional_conv2d_1 API的反向过程的输入输出,则需要在dump目录下查找命名包含Functional_conv2d_1、backward和input字段的.npy文件。数据类型:str。 | 是 | - -**函数示例** - -```python -register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') -set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"]) -set_backward_input(["./npu_dump/dump_conv2d_v4.0/step0/rank0/dump/Functional_conv2d_1_backward_input.0.npy"]) -``` - -## dump.json配置文件说明 - -**dump.json配置示例** - -```python -{ - "dump": - { - "dump_list":[], - "dump_path":"./dump/output", - "dump_mode":"all", - "dump_op_switch":"on" - } -} -``` - -**dump.json参数说明** - -| 字段名 | 说明 | -| -------------- | ------------------------------------------------------------ | -| dump_list | 待dump数据的API模型。为空,无需配置。 | -| dump_path | dump数据文件存储到运行环境的目录,主要用于指定ACL dump数据路径。支持配置绝对路径或相对路径。dump_path须为已存在目录。 | -| dump_mode | dump数据模式,配置如下:
- output:dump API的输出数据。默认值。
- input:dump API的输入数据。
- all:dump API的输入、输出数据。 | -| dump_op_switch | 单API模型dump数据开关,配置如下: * off:关闭单API模型dump,默认值。 * on:开启单API模型dump。 | - -**dump目录说明** - -配置register_hook的dump_config后,采集的dump数据会在{dump_path}/{time}/{deviceid}/{model_id}目录下生成,例如“/home/HwHiAiUser/output/20200808163566/0/0” - -```bash -├── 20230131172437 -│   └── 1 -│   ├── 0 -│   │   ├── Add.Add.45.0.1675157077183551 -│   │   ├── Cast.trans_Cast_0.31.0.1675157077159449 -│   │   ├── Cast.trans_Cast_5.43.0.1675157077180129 -│   │   ├── MatMul.MatMul.39.0.1675157077172961 -│   │   ├── Mul.Mul.29.0.1675157077155731 -│   │   ├── NPUAllocFloatStatus.NPUAllocFloatStatus.24.0.1675157077145262 -│   │   ├── TransData.trans_TransData_1.33.0.1675157077162791 -│   │   └── TransData.trans_TransData_4.41.0.1675157077176648 -│   ├── 1701737061 -│   │   └── Cast.trans_Cast_2.35.0.1675157077166214 -│   ├── 25 -│   │   └── NPUClearFloatStatus.NPUClearFloatStatus.26.0.1675157077150342 -│   └── 68 -│   └── TransData.trans_TransData_3.37.0.1675157077169473 -``` - -## 模块级精度数据dump - -### 总体说明 - -大模型场景下,通常不是简单的利用自动迁移能力实现GPU到NPU的训练脚本迁移,而是会对NPU网络进行一系列针对性的适配,因此,常常会造成迁移后的NPU模型存在部分子结构不能与GPU原始模型完全对应。模型结构不一致导致API调用类型及数量不一致,若直接按照API粒度进行精度数据dump和比对,则无法完全比对所有的API。 - -本节介绍的功能是对模型中的大粒度模块进行数据dump,使其比对时,对于无法以API粒度比对的模块可以直接以模块粒度进行比对。 - -模块指的是继承自nn.Module类模块,通常情况下这类模块就是一个小模型,可以被视为一个整体,dump数据时以模块为粒度进行dump。 - -### module_dump - -**功能说明** - -开启模块级精度数据dump。 - -模块级精度数据dump时必选。 - -**函数原型** - -```python -module_dump(module, module_name) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------- | ------------------------------------------------------------ | -------- | -| module | 网络中实例化好的nn.Module类对象。数据类型:torch.nn.Module。 | 是 | -| module_name | 用户自定义的该model名称。主要用于dump数据文件的命名,便于在比对时识别模块级数据。数据类型:str。 | 是 | - -### module_dump_end - -**功能说明** - -结束模块级精度数据dump。 - -模块级精度数据dump时必选。 - -**函数原型** - -```python -module_dump_end() -``` - -### 示例代码 - -```python -# 根据需要import包 -import os -import torch -import torch.nn as nn -import torch_npu -import torch.nn.functional as F -from ptdbg_ascend import * - -torch.npu.set_device("npu:0") -# 定义一个简单的网络 -class ModuleOP(nn.Module): - def __init__(self) -> None: - super().__init__() - self.linear_1 = nn.Linear(in_features=8, out_features=4) - self.linear_2 = nn.Linear(in_features=4, out_features=2) - def forward(self, x): - x1 = self.linear_1(x) - x2 = self.linear_2(x1) - r1 = F.relu(x2) - return r1 - -if __name__ == "__main__": - module = ModuleOP() - - # 注册工具 - pdbg = PrecisionDebugger("./dump_data/npu", hook_name="dump") - pdbg.start() - - x = torch.randn(10, 8) - module_dump(module, "MyModuleOP") # 开启模块级精度数据dump - out = module(x) - module_dump_end() # 结束模块级精度数据dump - loss = out.sum() - loss.backward() - pdbg.stop() -``` - -## dump数据存盘说明 - -dump结果目录结构示例如下: - -```bash -├── dump_path -│ └── ptdbg_dump_{version} -│ ├── step0 -│ | ├── rank0 -│ | │ ├── dump -| | | | ├── Tensor_permute_1_forward.npy -| | | | ├── MyModule_0_forward_input.npy # 开启模块级精度数据dump时存在模块级的dump数据文件 -| | | | ... -| | | | └── Fcuntion_linear_5_backward_output.npy -│ | │ └── dump.pkl -│ | ├── rank1 -| | | ├── dump -| | | | └── ... -| | | └── dump.pkl -│ | ├── ... -│ | | -| | └── rank7 -│ ├── step1 -│ | ├── ... -│ ├── step2 -``` - -dump过程中,npy文件在对应算子或者模块被执行后就会落盘,而pkl文件则需要在正常执行PrecisionDebugger.stop()或set_dump_switch("OFF")后才会被落盘保存,异常的程序终止会保存终止前被执行算子的相关npy文件,但是不会生成pkl文件。 - -其中`ptdbg_dump_{version}`为默认命名,debugger方式dump不支持修改该文件夹名称,使用set_dump_path函数则支持通过dump_tag参数修改文件夹名称;rank为设备上各卡的ID,每张卡上dump的数据会生成对应dump目录。 - -**精度比对dump场景** - -精度比对dump场景的结果如下: - -* dump.pkl文件:包含dump数据的API名称(命名格式为:`{api_type}_{api_name}_{API调用次数}_{前向反向}_{input/output}.{参数序号}`)、dtype、 shape、各数据的max、min、mean、L2norm统计信息以及当配置summary_mode="md5"时的md5数据。 - - 其中,“参数序号”表示该API下的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个子参数;L2norm表示2范数(平方根)。 - -* dump目录:目录下为npy格式的dump数据。 - - npy文件保存的前缀和PyTorch对应关系如下 - - | 前缀 | Torch模块 | - | ----------- | ------------------- | - | Tensor | torch.Tensor | - | Torch | torch | - | Functional | torch.nn.functional | - | NPU | NPU亲和算子 | - | VF | torch._VF | - | Aten | torch.ops.aten | - | Distributed | torch.distributed | - -当configure_hook或set_dump_switch配置mode参数(例如:mode="api_stack" )时,dump结果的文件名会添加api_stack前缀,dump结果如下: - -* api_stack_dump.pkl -* api_stack_dump目录 - -**溢出检测dump场景** - -PrecisionDebugger模块的hook_name参数或register_hook函数设置了overflow_check时,检测API溢出,dump结果的文件名格式为:`{api_type}_{api_name}_{API调用次数}_{前向反向}_{当前溢出次数}`,dump结果示例如下: - -* `Tensor_add_1_forward_1.pkl` -* `Tensor_add_1_forward_1`目录 - -## 工具支持的API列表 - -ptdbug_ascend工具维护固定的API支持列表,若需要删除或增加dump的API,可以在[support_wrap_ops.yaml](../src/python/ptdbg_ascend/hook_module/support_wrap_ops.yaml)文件内手动修改,如下示例: - -```bash -functional: # functional为算子类别,找到对应的类别,在该类别下按照下列格式删除或添加API - - conv1d - - conv2d - - conv3d -``` - -## CPU或GPU与NPU精度数据比对 - -### 总体说明 - -- 本节主要介绍CPU或GPU与NPU精度数据比对的函数以及示例。 - -- 比对函数均通过单独创建精度比对脚本执行,可支持单卡和多卡场景的精度数据比对。 - -- 工具性能:比对数据量较小时(参考值单份文件小于10GB),参考比对速度0.1GB/s;比对数据量较大时,参考比对速度0.3GB/s。 - 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 - - 用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。比对速度的计算方式:两份比对文件大小/比对耗时。 - -### 约束 - -- NPU自研API,在CPU或GPU若没有对应的API,该API的dump数据不比对。 - -- NPU与CPU或GPU的计算结果误差可能会随着模型的执行不断累积,最终会出现同一个API因为输入的数据差异较大而无法比对的情况。 - -- CPU或GPU与NPU中两个相同的API会因为调用次数不同导致无法比对或比对到错误的API,不影响整体运行,该API忽略。 - -### compare_distributed - -**功能说明** - -将CPU或GPU与NPU的dump文件进行比对,支持单卡和多卡,可同时比对多卡的dump数据。多机场景需要每个设备单独执行比对操作。可自动检索和匹配对应卡和进程所dump的数据文件,再调用compare进行比对。单机单卡时与compare函数二选一。 - -**函数原型** - -```python -compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| npu_dump_dir | 配置NPU环境下的dump目录。dump数据目录须指定到step级。参数示例:'./npu_dump/ptdbg_dump_v4.0/step0'。register_hook方式可通过set_dump_path函数的dump_tag参数修改该目录名称。数据类型:str。 | 是 | -| bench_dump_dir | 配置CPU、GPU或NPU环境下的dump目录。参数示例:'./gpu_dump/ptdbg_dump_v4.0/step0'。register_hook方式可通过set_dump_path函数的dump_tag参数修改该目录名称。数据类型:str。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。需要预先创建output_path目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.csv`。数据类型:str。 | 是 | -| **kwargs | 支持compare的所有可选参数。 | 否 | - -**函数示例** - -创建比对脚本,例如compare_distributed.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -compare_distributed('./npu_dump/ptdbg_dump_v4.0/step0', './gpu_dump/ptdbg_dump_v4.0/step0', './output') -``` - -dump数据目录须指定到step级。 - -### compare - -**功能说明** - -将CPU或GPU与NPU的dump文件进行比对,仅支持单机单卡。 - -**函数原型** - -```python -compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_match=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------ | ------------------------------------------------------------ | -------- | -| input_param | 配置dump数据文件及目录。数据类型:dict。配置参数包括:
- "npu_pkl_path":指定NPU dump目录下的.pkl文件。参数示例:"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl"。必选。
- "bench_pkl_path":指定CPU、GPU或NPU dump目录下的.pkl文件。参数示例:"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl"。必选。
- "npu_dump_data_dir":"指定NPU dump目录下的dump数据目录。参数示例:"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump"。可选,仅比对pkl文件时不选。
- "bench_dump_data_dir":"指定CPU、GPU或NPU dump目录下的dump数据目录。参数示例:"npu_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump"。可选,仅比对pkl文件时不选。
- "is_print_compare_log":配置是否开启日志打屏。可取值True或False。可选。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.csv`。数据类型:str。 | 是 | -| stack_mode | 配置stack_mode的开关。仅当dump数据时配置debugger.configure_hook或set_dump_switch的mode="api_stack"时需要开启。可取值True或False,参数示例:stack_mode=True,默认为False。数据类型:bool。 | 否 | -| auto_analyze | 自动精度分析,开启后工具自动针对比对结果进行分析,识别到第一个精度不达标节点(在比对结果文件中的“Accuracy Reached or Not”列显示为No),并给出问题可能产生的原因(打屏展示并生成advisor_{timestamp}.txt文件)。可取值True或False,参数示例:auto_analyze=False,默认为True。数据类型:bool。 | 否 | -| fuzzy_match | 模糊匹配。开启后,对于网络中同一层级且命名仅调用次数不同的API,可匹配并进行比对。可取值True或False,参数示例:fuzzy_match=True,默认为False。数据类型:bool。 | 否 | - -**函数示例** - -单机单卡场景下创建比对脚本,例如compare.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", -"bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", -"is_print_compare_log": True -} -compare(dump_result_param, output_path="./output", stack_mode=True) -``` - -### pkl文件比对 - -若使用**compare**或**compare_distributed**函数创建的比对脚本中,input_param参数只配置了npu_pkl_path和bench_pkl_path或使用summary_only、summary_mode(取值为md5或summary)方式dump时,可以进行pkl文件的比对,此时比对dump.pkl文件中的统计信息,开启后的比对结果文件生成Max diff、Min diff、Mean diff和L2norm diff,表示NPU dump数据中API的输入或输出与标杆数据输入或输出的最大值、最小值、平均值以及L2范数的差。可以通过该值判断API是否存在精度问题:当某个API的输入和输出的Max diff、Min diff、Mean diff和L2norm diff均为0或无限趋于0,那么可以判断该API无精度问题,反之则可能存在精度问题。 - -**比对脚本示例** - -以compare.py为例。 - -```python -from ptdbg_ascend import compare -dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"is_print_compare_log": True -} -compare(dump_result_param, output_path="./output", stack_mode=True) -``` - -**比对结果** - -pkl文件比对同样生成`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt`文件。其中`advisor_{timestamp}.txt`主要对`compare_result_{timestamp}.csv`中可能存在精度问题(Result为Waring)的API提出定位建议;`compare_result_{timestamp}.csv`主要有如下两种情况: - -- configure_hook配置summary_only=True、summary_mode=summary或不配置前面两个参数直接比对pkl文件: - - ![compare_result_pkl](./img/compare_result_pkl.png) - - 上图是对pkl文件中NPU及标杆API的统计信息进行比对,判断可能存在精度问题的API,文件中记录NPU及标杆API的基本信息和统计信息,其中需要关注Result列,包含结果:Waring(NPU与标杆统计信息的比对中存在相对误差大于0.5,则需要重点检查该API);为空(相对误差小于等于0.5,可以不需要重点关注,但不代表不存在精度问题);Nan(表示统计信息数据没有匹配上)。 - -- configure_hook配置summary_mode=md5: - - ![compare_result_pkl_md5.png](./img/compare_result_pkl_md5.png.png) - - 上图是对pkl文件中NPU及标杆API的MD5信息进行比对,判断API数据的完整性,文件中记录NPU及标杆API的基本信息和MD5信息,其中需要关注Result列,包含结果:Pass(表示NPU与标杆的MD5值一致,即API数据完整);Different(表示NPU与标杆的MD5值不一致,即API数据不完全一致,可以通过NPU_Stack_Info列API调用栈查询该API的详细信息);Nan(表示MD5信息数据没有匹配上)。 - -### parse - -**功能说明** - -解析并提取dump信息中的堆栈信息及数据统计信息。 - -**函数原型** - -```python -parse(pkl_file, module_name_prefix) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------------ | ------------------------------------------------------------ | -------- | -| pkl_file | 指定dump数据文件中的pkl文件名。参数示例:"./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump.pkl"。数据类型:str。 | 是 | -| module_name_prefix | 指定待提取的API接口前缀。参数示例:"Torch_norm_1_forward"。数据类型:str。 | 是 | -**函数示例** - -创建堆栈信息及数据统计信息提取脚本,例如parse.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -parse("./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump.pkl", "Torch_batch_normal_1_forward") -``` - -### 计算精度评价指标 - -PyTorch精度比对是以CPU或GPU的计算结果为标杆,通过计算精度评价指标判断API在运行时是否存在精度问题。 - -计算精度评价指标: - -1. Cosine:通过计算两个向量的余弦值来判断其相似度,数值越接近于1说明计算出的两个张量越相似,实际可接受阈值为大于0.99。在计算中可能会存在nan,主要由于可能会出现其中一个向量为0。 - -2. MaxAbsErr:当最大绝对误差越接近0表示其计算的误差越小,实际可接受阈值为小于0.001。 - -3. MaxRelativeErr:当最大相对误差越接近0表示其计算的误差越小。 - - 当dump数据中存在0或Nan时,比对结果中最大相对误差则出现inf或Nan的情况,属于正常现象。 - -4. One Thousandth Err Ratio(双千分之一)、Five Thousandths Err Ratio(双千分之五)精度指标:是指NPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差大于千分之一、千分之五的比例占总元素个数的比例小于千分之一、千分之五。该数据仅作为精度下降趋势的参考,并不参与计算精度是否通过的判定。 - -精度比对结果csv文件中只需要通过Accuracy Reached or Not来判断计算精度是否达标,判断标准如下: - -1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标,标记为“No”。 -2. Cosine < 0.9,精度不达标,标记为“No”。 -3. MaxAbsError > 1,精度不达标,标记为“No”。 -5. 其余情况下记为精度达标,标记为“Yes”。 - -## ptdbg_ascend.parse数据解析功能 - -ptdbg_ascend.parse为命令行交互式界面解析工具,提供更多的数据解析功能并且展示结果。 - -使用场景:本工具主要用于比对前后两次NPU ACL层级dump数据的一致性。 - -### 进入parse交互式界面 - -安装ptdbg_ascend工具后,可以通过使用命令 **python -m ptdbg_ascend.parse** 进入交互式界面,如下所示: - -```bash -python -m ptdbg_ascend.parse -Parse >>> -``` - -可在parse的界面中执行Shell命令,以及如下场景的相关解析命令: - -- 支持指定ACL层级算子数据比对。 -- 支持指定ACL层级算子数据转换及展示。 -- 支持交互式指定pkl文件中API对应dump数据查看。 -- 支持API进行可选层级比对和打印(统计级和像素级)。 - -Ctrl+C可以退出parse交互式界面。不退出parse交互式界面若需要执行非该界面下的内置Shell命令,且命令与parse交互式界面命令冲突时,非该界面命令需要使用run命令,在相关命令前加上run前缀,如下示例: - -```bash -python -m ptdbg_ascend.parse -Parse >>> run vim cli.py -Parse >>> vim cli.py -``` - -以上各场景详细介绍请参见下文章节。 - -### ACL层级算子数据批量转换 - -本功能会将原有待比对dump数据目录下的dump数据按照算子名和时间戳进行梳理并分类,之后再将dump数据转为为npy文件。 - -依赖:CANN包中的msaccucmp工具,需要安装Ascend-CANN-toolkit,详见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F700%2Fenvdeployment%2Finstg%2Finstg_0001.html)》。 - -输入以下比对命令进行数据转换。 - -```bash -cad -m my_dump_path [-out output_path] [-asc msaccucmp_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| -------- | ------------------------------------------------------------ | -------- | -| -m | 待转换ACL dump数据目录。需要指定到ACL dump数据的deviceid级目录。 | 是 | -| -out | 结果输出目录,须指定已存在的目录,默认为./parse_data/acl_batch_convert。未指定时保存在默认路径下,比对结束后会打印log提示输出结果存放路径。 | 否 | -| -asc | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py。 | 否 | - -**示例** - -```bash -# 传入待比对数据目录 -Parse >>> cad -m /home/xxx/my_dump_path/20000124003856/0 -# 转换结果打印 -...... -╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -# 转换前的dump文件 -│ SrcFile: /home/xxx/my_dump_path/20000124003856/0/272/TransData.trans_TransData_22.112.21.948645536672764 │ -# 转换后的npy文件 -│ - TransData.trans_TransData_22.112.21.948645536672764.output.0.npy │ -│ - TransData.trans_TransData_22.112.21.948645536672764.input.0.npy │ -╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯ -...... -[INFO] The comparison result have been written to "./parse_data/acl_batch_convert". -``` - -输出结果: - -原dump数据目录: - -```bash -├── /home/xxx/my_dump_path/20000124003856/0/ -│ ├── 272 -│ │ ├── {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp} -│ │ ... -│ ├── 512 -│ ... -``` - -转换后: - -```bash -├── ./parse_data/acl_batch_convert/{timestamp} -│ ├── {op_name1} -│ │ ├── {timestamp1} -│ │ | ├── {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input/output}.{参数序号}.npy -│ │ | │ ... -│ │ ├── {timestamp2} -│ │ | ... -│ ├── {op_name2} -│ ├── ... -``` - -### ACL层级算子数据比对 - -本功能主要用于比对前后两次NPU ACL层级dump数据的一致性。 - -本功能支持批量比对,若需要进行批量比对,需要先将两份待比对的NPU ACL层级dump数据进行“**ACL层级算子数据批量转换**”,可以使两份数据更好的匹配;若直接进行dump数据的比对,建议只比对单个dump数据文件。 - -输入以下比对命令进行数据比对。 - -```bash -vc -m my_dump_path -g golden_dump_path [-out output_path] [-cmp_path msaccucmp_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| --------- | ------------------------------------------------------------ | -------- | -| -m | 待比对ACL dump数据目录。如果比对单个算子,需要指定到ACL dump数据的model_id级目录;如果批量比对,则指定到cad转换后的timestamp级目录。 | 是 | -| -g | 标杆ACL dump数据目录。如果比对单个算子,需要指定到ACL dump数据的model_id级目录;如果批量比对,则指定到cad转换后的timestamp级目录。 | 是 | -| -out | 结果输出目录,须指定已存在的目录,默认为./parse_data/acl_batch_comapre。未指定时保存在默认路径下,比对结束后会打印log提示输出结果存放路径。 | 否 | -| -cmp_path | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py | 否 | - -输出结果:batch_compare_{timestamp}.csv文件。 - -**示例** - -```bash -# 传入待比对数据目录以及标杆数据目录 -Parse >>> vc -m ./my_dump_path -g ./golden_data_path -[INFO]Compare result is saved in : parse_data/acl_batch_comapre/batch_compare_1707271118.csv -``` - -### ACL算子数据的npy转换 - -依赖:CANN包中的msaccucmp工具,需要安装Ascend-CANN-toolkit,详见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F700%2Fenvdeployment%2Finstg%2Finstg_0001.html)》。 - -输入以下转换命令进行数据转换, 将ACL级别dump数据转为npy文件。 - -```bash -dc -n file_name/file_path [-f format] [-out output_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| --------- | ------------------------------------------------------------ | -------- | -| -n | 需转换的dump数据文件或dump数据文件目录。 | 是 | -| -f | 开启format转换,指定该参数时需要配置format格式。当前内置的Format转换支持如下类型:
FRACTAL_NZ转换NCHW
FRACTAL_NZ转换成NHWC
FRACTAL_NZ转换ND
HWCN转换FRACTAL_Z
HWCN转换成NCHW
HWCN转换成NHWC
NC1HWC0转换成HWCN
NC1HWC0转换成NCHW
NC1HWC0转换成NHWC
NCHW转换成FRACTAL_Z
NCHW转换成NHWC
NHWC转换成FRACTAL_Z
NHWC转换成HWCN
NHWC转换成NCHW
NDC1HWC0转换成NCDHW | 否 | -| -out | 结果输出目录。 | 否 | -| -cmp_path | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py | 否 | - -[^]: 若传入单个dump文件,则转换单个文件,若传入dump文件目录则转换目录下所有dump文件。 - -- 输出结果:npy文件。 -- 若指定-out参数需要用户传入输出路径,并且路径需要已存在。 -- 若未指定输出目录, 则比对结束后将结果保存在默认目录 “./parse_data/convert_result”中,比对结束后会打印log提示输出结果存放路径及转换结果。 - -- 输入以下命令,展示npy数据统计信息。 - - ```bash - pt -n file_path - ``` - - | 参数名称 | 说明 | 是否必选 | - | -------- | ------------- | -------- | - | -n | npy文件路径。 | 是 | - - 打印统计信息:shape, dtype, max, min和mean。默认在npy文件路径下将该数据保存为txt文件。 - -**示例1** - -```bash -# 传入需转换的dump文件目录 -Parse >>> dc -n ./dump_data/ -...... -# 转换结果 -╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮ -│ SrcFile: ./dump_data/ -│ - Add.fp32_vars_add_2fp32_vars_Relu_9.31.5.1636595794731103.input.0.npy │ -│ - Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.output.0.npy │ -│ - Add.fp32_vars_add_2fp32_vars_Relu_9.31.5.1636595794731103.input.1.npy │ -│ - Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.input.1.npy │ -│ - Add.fp32_vars_add_3fp32_vars_Relu_12.40.5.1636595794846124.input.1.npy │ -│ - Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.input.0.npy │ -│ - Add.fp32_vars_add_3fp32_vars_Relu_12.40.5.1636595794846124.input.0.npy │ -│ - Add.fp32_vars_add_2fp32_vars_Relu_9.31.5.1636595794731103.output.0.npy │ -│ - Add.fp32_vars_add_3fp32_vars_Relu_12.40.5.1636595794846124.output.0.npy │ -╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯ -``` - -**示例2** - -```bash -# 查看某个dump数据块的数据信息 -# 默认会将数据中的tensor保存成 txt -Parse >>> pt -n ./parse_data/dump_convert/Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.output.0.npy -...... -# 打印统计信息 -[Shape: (1, 16, 56, 56, 16)] [Dtype: float16] [Max: 452.0] [Min: -408.5] [Mean: -3.809] -Path: ./parse_data/dump_convert/Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.input.0.npy -TextFile:./parse_data/dump_convert/Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.input.0.npy.txt -``` - -### pkl文件中指定API的dump数据信息查看 - -输入以下命令,解析并输出pkl文件中指定api的统计信息。 - -```bash -pk -f pkl_path -n api_name -``` - -| 参数名称 | 说明 | 是否必选 | -| -------- | ----------------- | -------- | -| -f | 指定pkl文件路径。 | 是 | -| -n | 指定API名称。 | 是 | - -- 输出结果:打印统计信息(shape, dtype, max和min mean)。 -- 若pkl文件中存在相应的堆栈信息,则会打印堆栈信息。 - -**示例** - -```bash -# 传入pkl文件及api名称 -Parse >>> pk -f ./torch_dump/ptdbg_v3.2/rank0/api_stack_dump.pkl -n Functional_conv2d_0_forward -...... -# 打印统计信息及堆栈(pkl文件不包含堆栈则不会打印堆栈) - -Statistic Info: - [Functional_conv2d_0_forward_input.0][dtype: torch.float32][shape: [2, 1, 2, 2]][max: 1.576936960220337][min: -0.9757485389709473][mean: 0.4961632490158081] - [Functional_conv2d_0_forward_input.1][dtype: torch.float32][shape: [2, 1, 2, 2]][max: 0.20064473152160645][min: -0.47102075815200806][mean: -0.20796933770179749] - [Functional_conv2d_0_forward_input.2][dtype: torch.float32][shape: [2]][max: 0.17380613088607788][min: -0.16853803396224976][mean: 0.0026340484619140625] - [Functional_conv2d_0_forward_output][dtype: torch.float32][shape: [2, 2, 1, 1]][max: 0.02364911139011383][min: -1.762906551361084][mean: -0.6710853576660156] -``` - -### API可选层级比对 - -输入以下命令, 进行统计级和像素级比对。 - -```bash -cn -m my_data*.npy -g gloden*.npy [-p num] [-al atol] [-rl rtol] -``` - -- 统计级比对:对tensor整体进行余弦值及相对误差的计算。 -- 像素级比对:对输入的两个npy文件进行逐元素比对。若两个tensor对应元素的相对误差或绝对误差大于**误差阈值**(-al和-rl配置)则被标记为错误数据。 - -| 参数名称 | 说明 | 是否必选 | -| -------- | ----------------------------------------------- | -------- | -| -m | 待比对数据。 | 是 | -| -g | 标杆数据。 | 是 | -| -p | 设置比对结束后打印错误元素的个数,默认值20。 | 否 | -| -al | 判定数据存在精度问题的绝对误差阈值,默认0.001。 | 否 | -| -rl | 判定数据存在精度问题的相对误差阈值,默认0.001。 | 否 | -| -s | 将npy文件保存成txt文件,用于查看,默认开启。 | 否 | - -输出结果: - -- 统计级比对结果。 -- 两个文件的统计信息(shape, dtype, max, min和mean)。 -- 错误数据打印表格。 - -**示例** - -```bash -# 对比两个tensor的数据 -Parse >>> cn -m Add.InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.323.1619494134703053.output.0.npy -g InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.0.1619492699305998.npy -p 10 -s -al 0.002 -rl 0.005 - Error Item Table Top Item Table -┏━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┏━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ -┃ Index ┃ Left ┃ Right ┃ Diff ┃ ┃ Index ┃ Left ┃ Right ┃ Diff ┃ -┡━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ ┡━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ -│ 155 │ 0.024600908 │ 0.022271132 │ 0.002329776 │ │ 0 │ -0.9206961 │ -0.9222216 │ 0.0015255213 │ -│ 247 │ 0.015752593 │ 0.017937578 │ 0.0021849852 │ │ 1 │ -0.6416973 │ -0.64051837 │ 0.0011789203 │ -│ 282 │ -0.0101207765 │ -0.007852031 │ 0.0022687456 │ │ 2 │ -0.35383835 │ -0.35433492 │ 0.0004965663 │ -│ 292 │ 0.019581757 │ 0.02240482 │ 0.0028230622 │ │ 3 │ -0.18851271 │ -0.18883198 │ 0.00031927228 │ -│ 640 │ -0.06593232 │ -0.06874806 │ 0.0028157383 │ │ 4 │ -0.43508735 │ -0.43534422 │ 0.00025686622 │ -│ 1420 │ 0.09293677 │ 0.09586689 │ 0.0029301196 │ │ 5 │ 1.4447614 │ 1.4466647 │ 0.0019032955 │ -│ 1462 │ -0.085207745 │ -0.088047795 │ 0.0028400496 │ │ 6 │ -0.3455438 │ -0.3444429 │ 0.0011008978 │ -│ 1891 │ -0.03433288 │ -0.036525503 │ 0.002192624 │ │ 7 │ -0.6560242 │ -0.6564579 │ 0.0004336834 │ -│ 2033 │ 0.06828873 │ 0.07139922 │ 0.0031104907 │ │ 8 │ -2.6964858 │ -2.6975214 │ 0.0010356903 │ -│ 2246 │ -0.06376442 │ -0.06121233 │ 0.002552092 │ │ 9 │ -0.73746175 │ -0.73650354 │ 0.00095820427 │ -└───────┴───────────────┴──────────────┴──────────────┘ └───────┴─────────────┴─────────────┴───────────────┘ -╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -│ Left: | -│ |- NpyFile: ./dump/temp/decode/Add.InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.323.1619494134703053.output.0.npy | -│ |- TxtFile: ./dump/temp/decode/Add.InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.323.1619494134703053.output.0.npy.txt | -│ |- NpySpec: [Shape: (32, 8, 8, 320)] [Dtype: float32] [Max: 5.846897] [Min: -8.368301] [Mean: -0.72565556] | -│ DstFile: │ -│ |- NpyFile: ./dump/cpu/InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.0.1619492699305998.npy | -│ |- TxtFile: ./dump/cpu/InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.0.1619492699305998.npy.txt | -│ |- NpySpec: [Shape: (32, 8, 8, 320)] [Dtype: float32] [Max: 5.8425903] [Min: -8.374472] [Mean: -0.7256237] │ -│ NumCnt: 655360 │ -│ AllClose: False │ -│ CosSim: 0.99999493 │ -│ ErrorPer: 0.023504638671875 (rl= 0.005, al= 0.002) │ -╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ -``` - -## FAQ - -[FAQ](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md) diff --git "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T2.md" "b/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T2.md" deleted file mode 100644 index 56e5b64ebd041f09ded63072b377867651d6f80a..0000000000000000000000000000000000000000 --- "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T2.md" +++ /dev/null @@ -1,2203 +0,0 @@ -# **PyTorch精度工具使用指南** - -本文主要介绍PyTorch精度工具ptdbg_ascend的使用以及精度比对场景示例。 - -ptdbg_ascend工具的原理及安装请参见《[PyTorch精度工具](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 - -ptdbg_ascend工具主要支持PyTorch API精度数据dump、溢出检测、精度比对以及parse数据解析功能。其中dump和溢出检测功能支持使用debugger和register_hook方式进行精度数据的dump和溢出检测,推荐使用debugger方式。 - -## PyTorch精度比对总体流程 - -1. 准备CPU或GPU训练工程。 - -2. 在环境下安装ptdbg_ascend工具。 - -3. 在训练脚本内插入ptdbg_ascend工具dump接口。 - -4. 执行训练dump数据。 - -5. 将CPU或GPU训练工程迁移为NPU训练工程。 - - 请参见《[PyTorch模型迁移和训练指南](https://www.hiascend.com/document/detail/zh/canncommercial/63RC1/modeldevpt/ptmigr/ptmigr_0001.html)》。 - -6. 在NPU环境下安装ptdbg_ascend工具。 - -7. 在NPU训练脚本内插入ptdbg_ascend工具dump接口。 - -8. NPU环境下执行训练dump数据。 - -9. 创建并配置精度比对脚本,例如compare.py。 - -10. 执行CPU或GPU dump与NPU dump数据的精度比对。 - -11. 比对结果分析。 - -## 快速入门(debugger方式) - -本章节主要介绍通过ptdbg_ascend工具进行精度比对和分析,主要使用“**debugger方式dump和溢出检测**”和“**CPU或GPU与NPU精度数据比对**”章节中介绍的ptdbg_ascend工具接口。 - -### 单卡场景精度比对 - -**精度分析建议** - -PyTorch训练场景的精度问题分析建议参考以下思路进行精度比对和比对结果分析: - -1. 整网比对:dump整网数据并进行精度比对,初步定位异常范围。 - - 不推荐使用整网dump比对,若模型数据庞大(比如达到T级别),整网dump可能导致磁盘不足,需要预留足够的存储空间,或者分多次dump。 - -2. 缩小范围:根据Accuracy Reached or Not找出不符合精度标准的API。 - -3. 范围比对:对不符合精度标准的API重新dump详细信息。 - -4. 分析原因并优化:分析API精度不符合标准的原因并进行优化调整。 - -5. 整网比对:重新进行整网比对,判断优化后的API是否已符合精度标准以及是否出现新的精度问题。 - -6. 重复1~5步,直到不存在精度问题为止。 - -**精度分析示例** - -1. dump整网数据。 - - 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_stack") - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -2. 比对整网数据。 - - 第1步中的NPU dump数据目录为npu_dump,假设GPU dump数据目录为gpu_dump;dump将生成pkl数据文件api_stack_dump.pkl和npy数据目录api_stack_dump。 - - 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - dump_result_param={ - "npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, "./output", stack_mode=True) - ``` - - 执行比对: - - ```bash - python3 compare.py - ``` - - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` - -3. 找出存在问题的API。 - - 1. 根据`advisor_{timestamp}.txt`或打屏信息的提示,可找到存在精度问题的算子(Suspect Nodes)和专家建议(Expert Advice)。 - - ![auto_analyze_log](img/auto_analyze_log.png) - - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 - -4. (可选)提取指定API的堆栈信息和dump数据统计信息。 - - 通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。 - - 创建并配置提取脚本,以创建parse.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - - # 提取dump信息中第1次调用的API:Torch_batch_normal的堆栈信息及数据统计信息 - parse("./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", "Torch_batch_normal_1_forward") - ``` - - 执行提取: - - ```bash - python3 parse.py - ``` - - - -5. (可选)指定API对其底层ACL数据进行dump。 - - - dump指定前向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Tensor_permute_1_forward"], acl_config='./dump.json') - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - dump指定反向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - # dump指定反向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - debugger.configure_hook(mode="acl", scope=["Functional_conv2d_1_backward"], acl_config="./dump.json", backward_input=["./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional_conv2d_1_backward_input.0.npy"]) - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -6. (可选)重新比对。 - - 根据第4或5步的dump数据重新配置compare.py并执行比对,可以对单API模型进行问题复现。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 - -### 溢出检测场景 - -溢出检测是针对NPU的PyTorch API,检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。 - -溢出检测原理:针对溢出阶段,开启acl dump模式,重新对溢出阶段执行,落盘数据。 - -建议按照如下步骤操作: - -1. 在NPU环境下安装ptdbg_ascend工具。 - -2. 在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。 - - - 示例1:全量溢出检测 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=-1) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - 多卡使用时各卡单独计算溢出次数。 - - - 示例2:dump指定前向API的ACL级别溢出数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(mode="acl", acl_config="./dump.json") - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - 示例3:dump指定反向API的ACL级别的溢出数据 - - 1. 进行全量溢出检测 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=-1) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - - 2. dump指定反向API的ACL级别的溢出数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Functional_conv2d_1_backward"], acl_config="./dump.json", backward_input=["./overflow_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional_conv2d_1_backward_input.0.npy"]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - 针对前向溢出API,可以通过overflow_nums,配置允许的溢出次数,并将每次溢出API的全部ACL数据dump下来,到达指定溢出次数后停止,停止后会看到堆栈打印包含如下字段。 - - ```bash - ValueError: [overflow xxx times]: dump file is saved in 'xxxxx.pkl'. - ``` - - 其中xxx times为用户设置的次数,xxxxx.pkl为文件生成路径。 - -3. NPU环境下执行训练dump溢出数据。 - - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - - 精度预检工具执行命令如下: - - ```bash - # 下载att代码仓后执行如下命令 - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - 反向过程溢出的API暂不支持精度预检功能。 - - 当重复执行溢出检测dump操作时,需要删除上一次dump目录下的溢出检测dump数据,否则将因重名而报错。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 -* 混合精度动态loss scale场景下,正常训练会有"Gradient overflow. SKipping step"日志,添加溢出检测后日志消失,可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1,并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持,不支持版本执行报错EZ3003。 - -## 场景化示例 - -本章节主要介绍通过ptdbg_ascend工具进行精度比对和分析,主要使用“**CPU或GPU及NPU精度数据dump**”和“**CPU或GPU与NPU精度数据比对**”章节中介绍的ptdbg_ascend工具接口。 - -### 多卡场景精度比对 - -精度工具支持多卡场景的精度比对,多卡场景的dump步骤与单卡场景完全一致,请参见“**单卡场景精度比对**”章节,不同的是多卡数据精度比对时需要使用“compare_distributed”函数进行比对。 - -**大模型场景下dump推荐使用debugger方式的手动模式。** - -如下示例: - -说明:多机多卡场景需要每个设备单独执行比对操作。 - -假设NPU dump npy数据目录为npu_dump/ptdbg_dump_v4.0,GPU dump npy数据目录为gpu_dump/ptdbg_dump_v4.0。 - -1. 创建比对脚本,例如compare_distributed.py,拷贝如下代码。 - - ```python - from ptdbg_ascend import * - compare_distributed('./npu_dump/ptdbg_dump_v4.0/step0', './gpu_dump/ptdbg_dump_v4.0/step0', './output') - ``` - - dump数据目录须指定到step级。 - -2. 执行比对: - - ```bash - python3 compare_distributed.py - ``` - -两次运行须用相同数量的卡,传入`compare_distributed`的两个文件夹下须有相同个数的rank文件夹,且不包含其他无关文件,否则将无法比对。 - -**多卡set_dump_path注意事项** - -多卡一般为多进程,须保证每个进程都正确调用PrecisionDebugger或set_dump_path,或把PrecisionDebugger或set_dump_path插入到import语句后,如: - -```python -from ptdbg_ascend import * -debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) -``` - -或 - -```python -from ptdbg_ascend import * -seed_all() -set_dump_path('./dump_resnet') -``` - -如此可保证set_dump_path在每个进程都被调用。 - -**多卡register_hook注意事项** - -register_hook需要在set_dump_path之后调用,也需要在每个进程上被调用,建议在搬运模型数据到卡之后调用。识别方法如下: - -- 找到训练代码中遍历epoch的for循环或遍历数据集的for循环,把register_hook放到循环开始前即可。 -- 找到训练代码中调用DDP或者DistributedDataParallel的代码行,把register_hook放到该代码行所在的代码块之后。 -- 若代码中均无以上两种情况,需要保证register_hook在模型定义之后插入,并配置rank参数。rank参数获取rank_id请参见“**[rank_id获取方法](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/rank_id获取方法.md)**”。 - -### NPU vs NPU精度比对 - -对于NPU vs NPU场景,是针对同一模型,进行迭代(模型、API版本升级或设备硬件升级)时存在的精度下降问题,对比相同模型在迭代前后版本的API计算数值,进行问题定位。 - -一般情况下迭代涉及NPU自定义算子,因此,可以仅dump NPU自定义算子进行比对。比对精度问题分析请参见“**单卡场景精度比对**”章节。 - -工具当前支持dump NPU自定义算子如下: - -| 序号 | NPU自定义算子 | -| :--- | ----------------------------------------------- | -| 1 | torch_npu.one_ | -| 2 | torch_npu.npu_sort_v2 | -| 3 | torch_npu.npu_transpose | -| 4 | torch_npu.npu_broadcast | -| 5 | torch_npu.npu_dtype_cast | -| 6 | torch_npu.empty_with_format | -| 7 | torch_npu.npu_one_hot | -| 8 | torch_npu.npu_stride_add | -| 9 | torch_npu.npu_ps_roi_pooling | -| 10 | torch_npu.npu_roi_align | -| 11 | torch_npu.npu_nms_v4 | -| 12 | torch_npu.npu_iou | -| 13 | torch_npu.npu_nms_with_mask | -| 14 | torch_npu.npu_pad | -| 15 | torch_npu.npu_bounding_box_encode | -| 16 | torch_npu.npu_bounding_box_decode | -| 17 | torch_npu.npu_batch_nms | -| 18 | torch_npu.npu_slice | -| 19 | torch_npu._npu_dropout | -| 20 | torch_npu.npu_indexing | -| 21 | torch_npu.npu_ifmr | -| 22 | torch_npu.npu_max | -| 23 | torch_npu.npu_scatter | -| 24 | torch_npu.npu_layer_norm_eval | -| 25 | torch_npu.npu_alloc_float_status | -| 26 | torch_npu.npu_confusion_transpose | -| 27 | torch_npu.npu_bmmV2 | -| 28 | torch_npu.fast_gelu | -| 29 | torch_npu.npu_sub_sample | -| 30 | torch_npu.npu_deformable_conv2d | -| 31 | torch_npu.npu_mish | -| 32 | torch_npu.npu_anchor_response_flags | -| 33 | torch_npu.npu_yolo_boxes_encode | -| 34 | torch_npu.npu_grid_assign_positive | -| 35 | torch_npu.npu_normalize_batch | -| 36 | torch_npu.npu_masked_fill_range | -| 37 | torch_npu.npu_linear | -| 38 | torch_npu.npu_bert_apply_adam | -| 39 | torch_npu.npu_giou | -| 40 | torch_npu.npu_ciou | -| 41 | torch_npu.npu_diou | -| 42 | torch_npu.npu_sign_bits_pack | -| 43 | torch_npu.npu_sign_bits_unpack | -| 44 | torch_npu.npu_flash_attention | -| 45 | torch_npu.npu_scaled_masked_softmax | -| 46 | torch_npu.npu_rotary_mul | -| 47 | torch_npu.npu_roi_align | -| 48 | torch_npu.npu_roi_alignbk | -| 49 | torch_npu.npu_ptiou | -| 50 | torch_npu.npu_fusion_attention | -| 51 | torch_npu.npu_dropout_with_add_softmax | -| 52 | torch_npu.npu_random_choice_with_mask | -| 53 | torch_npu.npu_rotated_iou | -| 54 | torch_npu.npu_conv2d | -| 55 | torch_npu.npu_conv3d | -| 56 | torch_npu.npu_softmax_cross_entropy_with_logits | -| 57 | torch_npu.npu_all_gather_base_mm | -| 58 | torch_npu.npu_swiglu | -| 59 | torch_npu.npu_rms_norm | -| 60 | torch_npu.npu_mm_reduce_scatter_base | -| 61 | torch_npu.npu_mm_all_reduce_base | -| 62 | torch_npu.npu_conv_transpose2d | -| 63 | torch_npu.npu_convolution | -| 64 | torch_npu.npu_convolution_transpose | -| 65 | torch_npu.npu_min | -| 66 | torch_npu.npu_nms_rotated | -| 67 | torch_npu.npu_reshape | -| 68 | torch_npu.npu_rotated_box_decode | -| 69 | torch_npu.npu_rotated_box_encode | -| 70 | torch_npu.npu_rotated_overlaps | -| 71 | torch_npu.npu_silu | -| 72 | torch_npu.npu_fused_attention_score | -| 73 | torch_npu.npu_multi_head_attention | -| 74 | torch_npu.npu_gru | -| 75 | torch_npu.npu_incre_flash_attention | -| 76 | torch_npu.npu_prompt_flash_attention | -| 77 | torch_npu.npu_lstm | -| 78 | torch_npu.npu_apply_adam | - -### 通信API的数据dump - -通信类API数据可以使用全量dump方式获取,若只dump通信类API数据,可以使用如下示例: - -```python -debugger.configure_hook(mode="api_list", api_list=["distributed"]) -``` - -或 - -```python -set_dump_switch("ON", mode="api_list", api_list=["distributed"]) -``` - -通信类API支持列表: - -| 序号 | Distributed | -| :--- | -------------------- | -| 1 | send | -| 2 | recv | -| 3 | broadcast | -| 4 | all_reduce | -| 5 | reduce | -| 6 | all_gather | -| 7 | gather | -| 8 | isend | -| 9 | irecv | -| 10 | scatter | -| 11 | reduce_scatter | -| 12 | _reduce_scatter_base | -| 13 | _all_gather_base | - -### 单卡场景精度比对(register_hook方式) - -**精度分析建议** - -PyTorch训练场景的精度问题分析建议参考以下思路进行精度比对和比对结果分析: - -1. 整网比对:dump整网数据并进行精度比对,初步定位异常范围。 -2. 缩小范围:根据Accuracy Reached or Not找出不符合精度标准的API。 -3. 范围比对:对不符合精度标准的API重新dump。 -4. 分析原因并优化:分析API精度不符合标准的原因并进行优化调整。 -5. 整网比对:重新进行整网比对,判断优化后的API是否已符合精度标准以及是否出现新的精度问题。 -6. 重复1~5步,直到不存在精度问题为止。 - -**精度分析示例** - -1. dump整网数据。 - - 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): - - ```python - from ptdbg_ascend import * - - # 在main函数开始前固定随机数 - seed_all() - - # 配置dump数据目录路径和名称 - set_dump_path("./npu_dump", dump_tag='all') - - # 注册dump回调函数 - register_hook(model, acc_cmp_dump) - - ... - - # 在第一个迭代开始的位置开启dump和堆栈模式,同时为保证数据完整性开启dump bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="api_stack", filter_switch="OFF") - - ... - - # 在第一个迭代结束的位置关闭dump - set_dump_switch("OFF") - ``` - -2. 比对整网数据。 - - 第1步中的NPU dump数据文件为npu_dump.pkl,假设NPU dump npy数据目录为npu_dump,GPU dump数据文件为gpu_dump.pkl,GPU dump npy数据目录为gpu_dump。 - - 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - dump_result_param={ - "npu_pkl_path": "./npu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/all_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/all_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, "./output", stack_mode=True) - ``` - - 执行比对: - - ```bash - python3 compare.py - ``` - - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` - -3. 找出存在问题的API。 - - 1. 根据`advisor_{timestamp}.txt`或打屏信息的提示,可找到存在精度问题的算子(Suspect Nodes)和专家建议(Expert Advice) - - ![auto_analyze_log](img/auto_analyze_log.png) - - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 - -4. (可选)提取指定API的堆栈信息和dump数据统计信息。 - - 通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。 - - 创建并配置提取脚本,以创建parse.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - - # 提取dump信息中第1次调用的API:Torch_batch_normal的堆栈信息及数据统计信息 - parse("./npu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", "Torch_batch_normal_1_forward") - ``` - - 执行提取: - - ```bash - python3 parse.py - ``` - - - -5. (可选)指定API对其底层ACL数据进行dump。 - - - dump指定前向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - - # 固定随机数,开启确定性计算 - seed_all(mode=True) - set_dump_path("./dump_path", dump_tag='forward') - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - - # dump指定前向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="acl", scope=["Tensor_permute_1_forward"], filter_switch="OFF") - - ... - - set_dump_switch("OFF") - ``` - - - dump指定反向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - - # 固定随机数,开启确定性计算 - seed_all(mode=True) - set_dump_path("./dump_path", dump_tag='backward') - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - - # dump指定反向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"], filter_switch="OFF") - set_backward_input(["./npu_dump/all_v4.0/step0/rank0/api_stack_dump/Functional_conv2d_1_backward_input.0.npy"]) - - ... - - set_dump_switch("OFF") - ``` - -6. (可选)重新比对。 - - 根据第4或5步的dump数据重新配置compare.py并执行比对,可以对单API模型进行问题复现。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 - -### 溢出检测场景(register_hook方式) - -溢出检测是针对NPU的PyTorch API,检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。 - -溢出检测原理:针对溢出阶段,开启acl dump模式,重新对溢出阶段执行,落盘数据。 - -建议按照如下步骤操作: - -1. 在NPU环境下安装ptdbg_ascend工具。 - -2. 在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。 - - - 示例1:全量溢出检测 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # 设置检测到3次溢出后退出训练 - register_hook(model, overflow_check, overflow_nums=3) - - ... - ``` - - 多卡使用时各卡单独计算溢出次数。 - - - 示例2:dump指定API的ACL级别溢出数据 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # dump指定API的ACL级别溢出数据 - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - - # 在期望溢出检测的step位置开始前打开溢出检测开关 - set_overflow_check_switch("ON") - - ... - - # 在step结束的位置关闭溢出检测开关 - set_overflow_check_switch("OFF") - - ... - ``` - - - 示例3:dump指定反向API的ACL级别的溢出数据 - - 1. 进行全量溢出检测 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # 设置检测到3次溢出后退出训练 - register_hook(model, overflow_check) - - ... - ``` - - 2. dump指定反向API的ACL级别的溢出数据 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # dump指定反向API的ACL级别溢出数据 - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"]) - set_backward_input(["./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump/Functional_conv2d_1_backward_input.0.npy"]) - ``` - - 针对前向溢出API,可以通过overflow_nums,配置允许的溢出次数,并将每次溢出API的全部ACL数据dump下来,到达指定溢出次数后停止,停止后会看到堆栈打印包含如下字段。 - - ```bash - ValueError: [overflow xxx times]: dump file is saved in 'xxxxx.pkl'. - ``` - - 其中xxx times为用户设置的次数,xxxxx.pkl为文件生成路径。 - -3. NPU环境下执行训练dump溢出数据。 - - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过 [Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - - 精度预检工具执行命令如下: - - ```bash - # 下载att代码仓后执行如下命令 - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - 反向过程溢出的API暂不支持精度预检功能。 - - 当重复执行溢出检测dump操作时,需要删除上一次dump目录下的溢出检测dump数据,否则将因重名而报错。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 -* 混合精度动态loss scale场景下,正常训练会有"Gradient overflow. SKipping step"日志,添加溢出检测后日志消失,可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1,并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持,不支持版本执行报错EZ3003。 - -## debugger方式dump和溢出检测(推荐) - -### PrecisionDebugger模块 - -**功能说明** - -PrecisionDebugger模块包含dump和溢出检测功能的总体配置项。可以指定dump目录,设置dump或溢出检测功能,指定dump的卡和迭代。 - -可以在from ptdbg_ascend import *和模型初始化之间的任意位置添加该模块。 - -**原型** - -```python -PrecisionDebugger(dump_path=None, hook_name=None, rank=None, step=[], enable_dataloader=False, model=None): -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| dump_path | 设置dump数据目录路径,参数示例:"./dump_path"。数据类型:str。
默认在dump_path目录下生成`ptdbg_dump_{version}`目录,并在该目录下生成`dump.pkl`文件以及`dump`数据文件保存目录。
当**configure_hook**函数配置了mode参数时,`dump.pkl`文件以及`dump`数据文件保存目录名称添加mode参数值为前缀,详情请参见“**dump数据存盘说明**”。
未配置dump_path时,也可以通过环境变量ASCEND_WORK_PATH配置dump路径,此时dump数据将落盘在${ASCEND_WORK_PATH}/dump_data下,自定义配置dump_path优先级高于环境变量,dump_path和环境变量需要二选一。 | 否 | -| hook_name | dump模式,可取值"dump"和"overflow_check",表示dump和溢出检测功能,二选一。参数示例:hook_name="dump"。数据类型:str。 | 是 | -| rank | 指定对某张卡上的数据进行dump或溢出检测,默认未配置(表示dump所有卡的数据),须根据实际卡的Rank ID配置。应配置为大于0的正整数,且须根据实际卡的Rank ID配置,若所配置的值大于实际训练所运行的卡的Rank ID,则dump数据为空,比如当前环境Rank ID为0到7,实际训练运行0到3卡,此时若配置Rank ID为4或不存在的10等其他值,此时dump数据为空。数据类型:int。 | 否 | -| step | 指定dump某个step的数据,默认未配置,表示dump所有step数据。dump特定step时,须指定为训练脚本中存在的step。step为list格式,可配置逐个step,例如:step=[0,1,2];也可以配置step范围,例如:step=list(range(0,9)),表示dump第0到第8个step。数据类型:List[int]。 | 否 | -| enable_dataloader | 自动控制开关,可取值True(开启)或False(关闭),默认为False。配置为True后自动识别dump step参数指定的迭代,并在该迭代执行完成后退出训练,此时start和stop函数可不配置,开启该开关要求训练脚本是通过torch.utils.data.dataloader方式加载数据;配置为False则需要配置start和stop函数,并在最后一个stop函数后或一个step结束的位置添加debugger.step()。数据类型:bool。 | 否 | -| model | 开启init dump模式,传入网络模型实例化的对象,配置该参数后,dump操作仅dump网络中init方法里调用的方法(nn.Module类),不会对所有API进行dump。参数示例: model=net,net为网络模型实例化的对象名称。默认未配置。
配置该参数时,PrecisionDebugger模块请在模型实例化之后调用。数据类型:torch.nn.Module。
该模式不支持“溢出检测”、”ACL级别数据dump“和“模块级精度数据dump”。此模式下dump文件名前缀为网络中定义的模块名或层名。 | 否 | - -#### init dump模式示例代码和数据落盘说明 - -**示例代码** - -```python -import os -import torch -import torch.nn as nn -import torch_npu -from ptdbg_ascend import * - -torch.npu.set_device("npu:0") - - -class Net(nn.Module): - - def __init__(self): - super(Net, self).__init__() - self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2) - self.relu1 = nn.ReLU() - self.bn1 = nn.BatchNorm2d(16) - - def forward(self, x): - x = self.conv1(x) - x = self.bn1(x) - output = self.relu1(x) - return output - -if __name__ == "__main__": - net = Net().npu() - # model参数传入net, 开启init dump 功能 - debugger = PrecisionDebugger(dump_path="./dump", hook_name="dump", model=net) - debugger.configure_hook(mode="api_stack") - debugger.start() - x = torch.randn(1, 1, 28, 28).npu() - out = net(x) - loss = out.sum() - loss.backward() - debugger.stop() -``` - -**落盘数据说明** - -该模式下dump数据命名格式为:`{Layer_name}_{Module_name}_{call_num}_{forward/backward}_{input/output}.npy` - -``` -# 按照上述用例代码进行dump,落盘数据命名示例如下: -conv1_Conv2d_0_forward_input.0.npy -conv1_Conv2d_0_forward_output.npy -relu1_ReLU_0_forward_input.0.npy -....... -bn1_BatchNorm2d_0_backward_output.2.npy -``` - -### configure_hook函数(可选) - -**功能说明** - -设置dump范围。 - -建议在**PrecisionDebugger**模块与模型初始化之间的任意位置添加,不添加此函数时默认使用mode="api_stack" dump整网数据。 - -**原型** - -dump: - -```python -debugger.configure_hook(mode="api_stack", scope=[], api_list=[], filter_switch="OFF", acl_config=None, backward_input=[], input_output_mode=["all"], summary_only=False, summary_mode="all") -``` - -溢出检测: - -```python -debugger.configure_hook(mode=None, acl_config=None, overflow_nums=1, need_replicate=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| mode | dump模式。可取值"all"、"list"、"range"、"stack"、"acl"、"api_list"、"api_stack",各参数含义请参见本节的“**函数示例**”。参数示例:mode="list"。默认为"api_stack"。该参数配置值将作为dump数据文件名的前缀,详情请参见“**dump数据存盘说明**”。数据类型:str。 | 否 | -| scope或api_list | dump范围。根据model配置的模式选择dump的API范围,mode="api_list"时,需要配置api_list=[],其他模式有需要时配置scope=[]。参数示例:scope=["Tensor_permute_1_forward", "Tensor_transpose_2_forward"]、api_list=["relu"]。默认为空。数据类型:List[str]。 | 否 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"(表示开启过滤,即不dump)或"OFF"(表示关闭过滤)。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | -| acl_config | acl dump的配置文件。mode="acl"时,该参数必选;mode为其他值时,该参数不选。参数示例:acl_config='./dump.json'。dump.json配置文件详细介绍请参见“**dump.json配置文件说明**”。数据类型:str。 | 否 | -| backward_input | 该输入文件为首次运行训练dump得到反向API输入的.npy文件。例如若需要dump Functional_conv2d_1 API的反向过程的输入输出,则需要在dump目录下查找命名包含Functional_conv2d_1、backward和input字段的.npy文件。数据类型:str。 | 否 | -| input_output_mode | dump数据过滤。可取值"all"、"forward"、"backward"、"input"和"output",表示仅保存dump的数据中文件名包含"forward"、"backward"、"input"和"output"的前向、反向、输入或输出的.npy文件。参数示例input_output_mode=["backward"]或input_output_mode=["forward", "backward"]。默认为["all"],即保存所有dump的数据。除了all参数只能单独配置外,其他参数可以自由组合。数据类型:list。 | 否 | -| summary_only | dump npy文件过滤,可取值True或False,配置为True后仅dump保存API统计信息的pkl文件,参数示例:summary_only=False,默认为False。数据类型:bool。 | 否 | -| summary_mode | 控制dump文件输出的模式,可取值md5(dump仅输出包含md5值的pkl文件,用于验证数据的完整性)、summary(dump仅输出包含API统计信息的pkl文件)、all(dump输出包含API统计信息的pkl文件以及具体的npy文件),参数示例:summary_mode="md5",默认为"all"。summary_only=True时,不允许配置该参数。数据类型:str。 | 否 | -| overflow_nums | 控制溢出次数,表示第N次溢出时,停止训练,过程中检测到溢出API对应ACL数据均dump。参数示例:overflow_nums=3。配置overflow_check时可配置,默认不配置,即检测到1次溢出,训练停止,配置为-1时,表示持续检测溢出直到训练结束。数据类型:int。 | 否 | -| need_replicate | 过程dump数据生成开关,执行溢出检测时,dump目录下会生成forward_real_data和backward_real_data的过程dump数据目录,可取值True(生成)或False(不生成),默认不生成。数据类型:bool。 | 否 | - -**函数示例** - -configure_hook可配置多种dump模式,示例如下: - -说明: - -以下均以dump部分API数据为例,API名可以从首次dump整网数据的结果csv文件中的NPU Name或Bench Name列获取。 - -以下仅为该函数配置示例,完整代码请参见“**示例代码**”章节。 - -- 示例1:dump指定API列表 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="list", scope=["Tensor_permute_1_forward", "Tensor_transpose_2_forward", "Torch_relu_3_backward"]) - ``` - -- 示例2:dump指定范围 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="range", scope=["Tensor_abs_1_forward", "Tensor_transpose_3_forward"]) - ``` - -- 示例3:STACK模式,只dump堆栈信息 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="stack", scope=["Tensor_abs_1_forward", "Tensor_transpose_3_forward"]) - ``` - -- 示例4:dump指定前向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Tensor_permute_1_forward"], acl_config="./dump.json") - ``` - -- 示例5:dump指定反向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Functional_conv2d_1_backward"], acl_config="./dump.json", backward_input=["./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional_conv2d_1_backward_input.0.npy"]) - ``` - -- 示例6:dump指定某一类API的API级别输入输出数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_list", api_list=["relu"]) - ``` - - mode="api_list"时不配置scope。 - -- 示例7:dump全部API级别输入输出数据以及相应堆栈信息 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_stack") - ``` - - mode="api_stack"时不配置scope。 - -- 示例8: dump全部API级别输入输出数据并包含bool和整型的tensor以及浮点、bool和整型的标量,配置为OFF,会dump bool和整型数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(filter_switch="OFF") - ``` - - 配置filter_switch="OFF"同时也可以配置mode、scope和api_list,除dump ACL级别数据。 - -- 示例9:仅保存dump的数据文件名包含“backward”的反向.npy文件 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(input_output_mode=["backward"]) - ``` - -- 示例10:仅dump pkl文件 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(summary_only=True) - ``` - -- 示例11:溢出检测dump - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=1) - ``` - - dump执行时会在**PrecisionDebugger**模块的dump_path参数指定的目录下生成ptdbg_dump_{version}目录,保存溢出数据。 - - 多卡场景时,需要检测到至少有一张卡溢出次数达到overflow_nums时,训练结束。 - - 仅支持NPU环境。 - -- 示例11:dump溢出API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - debugger.configure_hook(mode="acl", acl_config="./dump.json") - ``` - - 该场景会在原有数据基础上,额外在dump.json文件配置的dump_path目录下生成一份ACL算子数据,该数据可通过“**ptdbg_ascend.parse**”工具进行解析。 - - 仅支持NPU环境。 - -### start函数(可选) - -**功能说明** - -dump或溢出检测启动函数。 - -在模型初始化之后的任意位置添加。 - -**原型** - -```python -debugger.start() -``` - -该函数为类函数,可以使用debugger.start()也可以使用PrecisionDebugger.start()。 - -### stop函数(可选) - -**功能说明** - -dump或溢出检测停止函数。 - -在**start**函数之后的任意位置添加。 - -**原型** - -```python -debugger.stop() -``` - -该函数为类函数,可以使用debugger.stop()也可以使用PrecisionDebugger.stop()。 - -### 示例代码(自动模式) - -**需要保证用户训练代码是通过torch.utils.data.dataloader方式加载数据。** - -- 示例1:开启dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0,2], enable_dataloader=True) - # 请勿将以上初始化流程插入到循环代码中 - ``` - -- 示例2:开启溢出检测dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0,2], enable_dataloader=True) - # 请勿将以上初始化流程插入到循环代码中 - ``` - -### 示例代码(手动模式) - -一般情况下使用自动模式可以快速方便进行dump操作,但个别大模型可能在部分卡的训练操作中没有调用dataloader,这会导致自动模式无法dump指定迭代的数据,此时需要关闭自动模式手动在迭代前后插入start()和stop()函数,并在最后一个stop函数后或一个step结束的位置添加debugger.step()以标识dump结束。 - -- 示例1:开启dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -- 示例2:开启溢出检测dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -## register_hook方式dump和溢出检测 - -### 总体说明 - -- 本节主要介绍CPU或GPU及NPU精度数据dump和溢出检测所需要的函数以及示例。 - -- ptdbg_ascend工具默认情况下仅dump PyTorch模型的API输入输出数据进行精度比对,若在比对结果中发现某个API下可能存在ACL的精度问题,那么可以选择dump该API的ACL级别数据进行精度分析。 - -- 某些torch api的输出不是Tensor类型的数据。对于此类API的反向过程进行ACL dump,工具会在运行日志中给出对应的Warning(is not of tensor type and cannot be automatically derived)提示。如若想要进行该类API反向ACL dump,可以通过手动构建单API用例的方式进行ACL dump,具体用例可参见“**[反向ACL dump用例说明](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/%E5%8F%8D%E5%90%91ACL%20dump%E7%94%A8%E4%BE%8B%E8%AF%B4%E6%98%8E.md)**”。 - -- 工具性能:dump数据量较小时(小于5G),参考dump速度0.1GB/s;dump数据量较大时,参考dump速度0.2GB/s。 - 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 - - 用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。Dump速度的计算方式:Dump数据量/(单个step添加Dump耗时-原始单个step耗时)。 - -### 约束 -- 进行CPU或GPU数据dump时,请安装torch包而非torch_npu包,避免工具无法识别使用场景,导致失败。 - -- TASK_QUEUE_ENABLE环境变量会导致API下发和执行异步进行,因此在ACL dump前需要将TASK_QUEUE_ENABLE关闭,即export TASK_QUEUE_ENABLE=0。 - -- 不建议在PyTorch训练脚本中同时添加dump接口和性能数据采集(如Ascend PyThon Profiler)接口,二者可能相互影响导致数据不准确。 - -### seed_all - -**功能说明** - -固定随机数。通过固定随机数保证模型的输入或输出一致。在训练主函数开始前调用,避免随机数固定不全。 - -使用form ptdbg import *后自动导入该函数,代码无需再次添加,若需要修改随机数种子和确定性计算模式,则需要通过添加该函数修改。 - -**函数原型** - -```python -seed_all(seed=1234, mode=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------ | ------------------------------------------------------------ | -------- | -| seed | 随机数种子。参数示例:seed=1000。默认值为:1234。数据类型:int。 | 否 | -| mode | 确定性计算模式。可配置True或False。参数示例:mode=True。默认为False。数据类型:bool。
即使在相同的硬件和输入下,API多次执行的结果也可能不同,开启确定性计算是为了保证在相同的硬件和输入下,API多次执行的结果相同。
确定性计算会导致API执行性能降低,建议在发现模型多次执行结果不同的情况下开启。
rnn类算子、ReduceSum、ReduceMean等算子可能与确定性计算存在冲突,若开启确定性计算后多次执行的结果不相同,则考虑存在这些算子。 | 否 | - -**函数示例** - -seed_all函数的随机数种子,取默认值即可,无须配置;第二个参数默认关闭,不开启确定性计算时也无须配置。 - -- 示例1:仅固定随机数,不开启确定性计算 - - ```python - seed_all() - ``` - -- 示例2:固定随机数,开启确定性计算 - - ```python - seed_all(mode=True) - ``` - -**固定随机数范围** - -seed_all函数可固定随机数的范围如下表。 - -| API | 固定随机数 | -| ---------------------------------------- | --------------------------- | -| os.environ['PYTHONHASHSEED'] = str(seed) | 禁止Python中的hash随机化 | -| random.seed(seed) | 设置random随机生成器的种子 | -| np.random.seed(seed) | 设置numpy中随机生成器的种子 | -| torch.manual_seed(seed) | 设置当前CPU的随机种子 | -| torch.cuda.manual_seed(seed) | 设置当前GPU的随机种子 | -| torch.cuda.manual_seed_all(seed) | 设置所有GPU的随机种子 | -| torch_npu.npu.manual_seed(seed) | 设置当前NPU的随机种子 | -| torch_npu.npu.manual_seed_all(seed) | 设置所有NPU的随机种子 | -| torch.backends.cudnn.enable=False | 关闭cuDNN | -| torch.backends.cudnn.benchmark=False | cuDNN确定性地选择算法 | -| torch.backends.cudnn.deterministic=True | cuDNN仅使用确定性的卷积算法 | - -需要保证CPU或GPU以及NPU的模型输入完全一致,dump数据的比对才有意义,seed_all并不能保证模型输入完全一致,如下表所示场景需要保证输入的一致性。 - -| 场景 | 固定方法 | -| --------------- | ------------- | -| 数据集的shuffle | 关闭shuffle。 | -| dropout | 关闭dropout。 | - -关闭shuffle示例: - -```python -train_loader = torch.utils.data.DataLoader( - train_dataset, - batch_size = batch_size, - shuffle = False, - num_workers = num_workers -) -``` - -关闭dropout: - -在使用from ptdbg import *后,工具会自动将torch.nn.functional.dropout、torch.nn.functional.dropout2d、torch.nn.functional.dropout3d、torch.nn.Dropout、torch.nn.Dropout2d、torch.nn.Dropout3d的接口参数p置为0。 - -### set_dump_path - -**功能说明** - -设置数据保存目录。建议在seed_all函数之后调用且需要保证训练进程能够调用该函数;多卡时须保证每个进程都能调用该函数。 - -**函数原型** - -```python -set_dump_path(fpath=None, dump_tag='ptdbg_dump') -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------- | ------------------------------------------------------------ | -------- | -| fpath | 设置数据目录路径。参数示例:'./dump_path'。数据类型:str。
默认在dump_path目录下生成`ptdbg_dump_{version}`目录,并在该目录下生成`dump.pkl`文件以及`dump`数据文件保存目录。
当set_dump_switch函数配置了mode参数时,`dump.pkl`文件以及`dump`数据文件保存目录名称添加mode参数值为前缀,详情请参见“**dump数据存盘说明**”。
未配置fpath时,也可以通过环境变量ASCEND_WORK_PATH配置dump路径,此时数据将落盘在${ASCEND_WORK_PATH}/dump_data下,自定义配置dump_path优先级高于环境变量,fpath和环境变量需要二选一。 | 否 | -| dump_tag | 设置数据目录名称。参数示例:dump_tag='dump_conv2d'。默认数据目录命名为ptdbg_dump_{version}。数据类型:str。
{version}为当前安装ptdbg_ascend工具版本。目录结构参见“**dump数据存盘说明**”。
配置该参数会将生成的`ptdbg_dump_{version}`目录名称变更为dump_tag配置的值,如`dump_conv2d_{version}`。 | 否 | - -**函数示例** - -- 示例1:设置数据目录路径 - - ```python - set_dump_path('./dump_path') - ``` - -- 示例2:设置数据目录名称 - - ```python - set_dump_path('./dump_path', dump_tag='dump_conv2d') - ``` - - -若以相同的数据目录多次dump,则会因同名导致覆盖;多次dump建议配置不同的dump_tag。 - -### register_hook - -**功能说明** - -注册工具钩子函数。在set_dump_path之后调用。 - -dump操作必选。 - -**函数原型** - -```python -register_hook(model, hook, overflow_nums=overflow_nums, dump_mode=dump_mode, dump_config=dump_config_file) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------- | ------------------------------------------------------------ | -------- | -| model | 传入网络模型实例化的对象。参数示例: model=net,net为网络模型实例化的对象名称。数据类型:torch.nn.Module。 | 是 | -| hook | 注册工具的dump和溢出检测钩子。可取值overflow_check(表示溢出检测)和acc_cmp_dump(表示dump数据),二选一。数据类型:Callable。 | 是 | -| overflow_nums | 控制溢出次数,表示第N次溢出时,停止训练,过程中检测到溢出API对应ACL数据均dump。参数示例:overflow_nums=3。配置overflow_check时可配置,默认不配置,即检测到1次溢出,训练停止,配置为-1时,表示持续检测溢出直到训练结束。数据类型:int。 | 否 | -| dump_mode | 控制针对溢出API的dump模式,可取值"acl"或"api"。配置acl时,表示dump ACL级别的溢出数据,此时set_dump_path参数不生效,dump数据目录由dump_config的.json文件配置。参数示例:dump_mode="acl"。默认不配置,即dump API级别的溢出数据。数据类型:str。 | 否 | -| dump_config | acl dump的配置文件。dump_mode="acl"时,该参数必选;dump_mode="api"时,该参数不选。参数示例:dump_config='./dump.json'。数据类型:str。 | 否 | - -**函数示例** - -- 示例1:注册工具钩子函数 - - ```python - register_hook(model, acc_cmp_dump) - ``` - -- 示例2:dump指定API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - ``` - - 需要配置set_dump_switch的mode="acl"以及scope指定为前向或反向API,请参见“**set_dump_switch”**的示例。 - - 该场景set_dump_path不生效,由dump_config中的dump.json文件配置dump数据目录。 - -- 示例3:溢出检测dump - - ```python - register_hook(model, overflow_check, overflow_nums=3) - ``` - - dump执行时会在set_dump_path的fpath参数指定的目录下生成ptdbg_dump_{version}目录,保存溢出数据。 - - 多卡场景时,需要检测到至少有一张卡溢出次数达到overflow_nums时,训练结束。 - - 仅支持NPU环境。 - -- 示例4:dump指定API的ACL级别溢出数据 - - ```python - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - ``` - - 该场景会在原有数据基础上,额外在dump.json文件配置的dump_path目录下生成一份ACL算子数据,该数据可通过“**ptdbg_ascend.parse**”工具进行解析。 - - 仅支持NPU环境。 - -### set_dump_switch - -**功能说明** - -设置dump范围。建议在register_hook函数之后的脚本内任意位置插入,但进行精度问题排查建议参照“场景化示例 > 单卡场景精度比对”章节的顺序,先从第一个迭代开始的位置调用并dump整网数据。 - -dump操作必选。 - -**函数原型** - -```python -def set_dump_switch(switch, mode="all", scope=[], api_list=[], filter_switch="OFF", dump_mode=["all"], summary_only=False): -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| --------------- | ------------------------------------------------------------ | -------- | -| switch | dump开关。可取值"ON"或"OFF"。须在选定dump开始的位置配置set_dump_switch("ON");dump结束的位置设置set_dump_switch("OFF")。数据类型:str。 | 是 | -| mode | dump模式。可取值"all"、"list"、"range"、"stack"、"acl"、"api_list"、"api_stack",各参数含义请参见本节的“**函数示例**”。参数示例:mode="list"。默认为"all"。该参数配置值将作为dump数据文件名的前缀,详情请参见“**dump数据存盘说明**”。数据类型:str。 | 否 | -| scope或api_list | dump范围。根据model配置的模式选择dump的API范围。参数示例:scope=["Tensor_permute_1_forward", "Tensor_transpose_2_forward"]、api_list=["relu"]。默认为空。数据类型:List[str]。 | 否 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"或"OFF"。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | -| dump_mode | dump数据过滤。可取值"all"、"forward"、"backward"、"input"和"output",表示仅保存dump的数据中文件名包含"forward"、"backward"、"input"和"output"的前向、反向、输入或输出的.npy文件。参数示例dump_mode=["backward"]或dump_mode=["forward", "backward"]。默认为all,即保存所有dump的数据。除了all参数只能单独配置外,其他参数可以自由组合。数据类型:List[str]。 | 否 | -| summary_only | dump npy文件过滤,可取值True或False,配置为True后仅dump保存API统计信息的pkl文件,参数示例:summary_only=False,默认为False。数据类型:bool。 | 否 | - -**推荐配置** - -```python -set_dump_switch("ON", mode="api_stack", filter_switch="OFF") -``` - -开启dump数据和堆栈模式,同时为保证数据完整性开启dump bool和整型的tensor以及浮点、bool和整型的标量。 - -**函数示例** - -set_dump_switch可配置多种dump模式,示例如下: - -说明:以下均以dump部分API数据为例,API名可以从首次dump整网数据的结果csv文件中的NPU Name或Bench Name列获取。 - -- 示例1:dump指定API列表 - - ```python - set_dump_switch("ON", mode="list", scope=["Tensor_permute_1_forward", "Tensor_transpose_2_forward", "Torch_relu_3_backward"]) - ``` - -- 示例2:dump指定范围 - - ```python - set_dump_switch("ON", mode="range", scope=["Tensor_abs_1_forward", "Tensor_transpose_3_forward"]) - ``` - -- 示例3:STACK模式,只dump堆栈信息 - - ```python - set_dump_switch("ON", mode="stack", scope=["Tensor_abs_1_forward", "Tensor_transpose_3_forward"]) - ``` - -- 示例4:dump指定前向API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Tensor_permute_1_forward"]) - ``` - - 需要配置register_hook的dump_mode='acl'和dump_config配置文件。 - -- 示例4:dump指定反向API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"]) - set_backward_input(["./npu_dump/dump_conv2d_v4.0/step0/rank0/dump/Functional_conv2d_1_backward_input.0.npy"]) - ``` - - 需要配置register_hook的dump_mode='acl'和dump_config配置文件,并通过set_backward_input设置反向API输入的.npy文件。 - -- 示例5:dump指定某一类API的API级别输入输出数据 - - ```python - set_dump_switch("ON", mode="api_list", api_list=["relu"]) - ``` - - mode="api_list"时不配置scope。 - -- 示例6:dump全部API级别输入输出数据以及相应堆栈信息 - - ```python - set_dump_switch("ON", mode="api_stack") - ``` - - mode="api_stack"时不配置scope。 - -- 示例7: dump全部API级别输入输出数据并包含bool和整型的tensor以及浮点、bool和整型的标量,配置为OFF,会dump bool和整型数据 - - ```python - set_dump_switch("ON", filter_switch="OFF") - ``` - - 配置filter_switch="OFF"同时也可以配置mode、scope和api_list,除dump ACL级别数据。 - -- 示例8:仅保存dump的数据文件名包含“backward”的反向.npy文件 - - ```python - set_dump_switch("ON", dump_mode=["backward"]) - ``` - -- 示例9:仅dump pkl文件 - - ```python - set_dump_switch("ON", summary_only=True) - ``` - -以上示例均需要在结束dump的位置插入set_dump_switch("OFF")。 - -set_dump_switch配置mode为all或api_stack时,结束dump后,在dump目录下会自动生成compare_data.py比对脚本模板,示例如下: - -```python -from ptdbg_ascend import compare - -pkl_path = "%s" -dump_data_dir = "%s" - -dump_path_param = { - "npu_pkl_path": , - "bench_pkl_path": , - "npu_dump_data_dir": , - "bench_dump_data_dir": , - "is_print_compare_log": True -} - -compare(dump_path_param, output_path="", stack_mode="%s") -``` - -pkl_path和dump_data_dir字段会自动识别pkl和dump目录的路径,用户需要判断当前dump的环境是NPU、CPU或GPU,并将pkl_path和dump_data_dir字段填入下方dump_path_param函数对应的字段中,例如当前设备为NPU,那么填写方式如下: - -```python -from ptdbg_ascend import compare - -pkl_path = "%s" -dump_data_dir = "%s" - -dump_path_param = { - "npu_pkl_path": pkl_path, - "bench_pkl_path": , - "npu_dump_data_dir": dump_data_dir, - "bench_dump_data_dir": , - "is_print_compare_log": True -} - -compare(dump_path_param, output_path="", stack_mode="%s") -``` - -此时,另一侧数据的路径,需要用户另外识别并填入。 - -### set_overflow_check_switch - -**功能说明** - -置溢出检测范围。默认不配置该函数,全量进行溢出检测。 - -仅支持NPU环境。 - -**函数原型** - -```python -set_overflow_check_switch(switch, filter_switch='OFF') -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------- | ------------------------------------------------------------ | -------- | -| switch, | 检测开关。可取值"ON"或"OFF"。如果只在特定的step溢出检测,则在期望溢出检测的step位置开始前插入set_overflow_check_switch("ON"),在step结束的位置插入set_overflow_check_switch("OFF")。数据类型:str。 | 是 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"或"OFF"。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | - -**函数示例** - -- 示例1:指定范围溢出检测 - - ```python - register_hook(model, overflow_check) - set_overflow_check_switch("ON") - - ... - - set_overflow_check_switch("OFF") - ``` - - 该场景set_dump_path不生效,dump执行时会在当前目录自动生成ptdbg_dump_{version}目录,保存溢出数据。 - -- 示例2:前向API的ACL级别范围溢出检测 - - ```python - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - set_overflow_check_switch("ON") - - ... - - set_overflow_check_switch("OFF") - ``` - - 该场景set_dump_path不生效,由dump_config中的dump.json文件配置溢出数据目录。 - -### set_backward_input - -**功能说明** - -设置反向ACL级别dump时需要的反向输入的.npy文件。 - -**函数原型** - -```python -set_backward_input(backward_input) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| backward_input | 该输入文件为首次运行训练dump得到反向API输入的.npy文件。例如若需要dump Functional_conv2d_1 API的反向过程的输入输出,则需要在dump目录下查找命名包含Functional_conv2d_1、backward和input字段的.npy文件。数据类型:str。 | 是 | - -**函数示例** - -```python -register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') -set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"]) -set_backward_input(["./npu_dump/dump_conv2d_v4.0/step0/rank0/dump/Functional_conv2d_1_backward_input.0.npy"]) -``` - -## dump.json配置文件说明 - -**dump.json配置示例** - -```python -{ - "dump": - { - "dump_list":[], - "dump_path":"./dump/output", - "dump_mode":"all", - "dump_op_switch":"on" - } -} -``` - -**dump.json参数说明** - -| 字段名 | 说明 | -| -------------- | ------------------------------------------------------------ | -| dump_list | 待dump数据的API模型。为空,无需配置。 | -| dump_path | dump数据文件存储到运行环境的目录,主要用于指定ACL dump数据路径。支持配置绝对路径或相对路径。dump_path须为已存在目录。 | -| dump_mode | dump数据模式,配置如下:
- output:dump API的输出数据。默认值。
- input:dump API的输入数据。
- all:dump API的输入、输出数据。 | -| dump_op_switch | 单API模型dump数据开关,配置如下: * off:关闭单API模型dump,默认值。 * on:开启单API模型dump。 | - -**dump目录说明** - -配置register_hook的dump_config后,采集的dump数据会在{dump_path}/{time}/{deviceid}/{model_id}目录下生成,例如“/home/HwHiAiUser/output/20200808163566/0/0” - -```bash -├── 20230131172437 -│   └── 1 -│   ├── 0 -│   │   ├── Add.Add.45.0.1675157077183551 -│   │   ├── Cast.trans_Cast_0.31.0.1675157077159449 -│   │   ├── Cast.trans_Cast_5.43.0.1675157077180129 -│   │   ├── MatMul.MatMul.39.0.1675157077172961 -│   │   ├── Mul.Mul.29.0.1675157077155731 -│   │   ├── NPUAllocFloatStatus.NPUAllocFloatStatus.24.0.1675157077145262 -│   │   ├── TransData.trans_TransData_1.33.0.1675157077162791 -│   │   └── TransData.trans_TransData_4.41.0.1675157077176648 -│   ├── 1701737061 -│   │   └── Cast.trans_Cast_2.35.0.1675157077166214 -│   ├── 25 -│   │   └── NPUClearFloatStatus.NPUClearFloatStatus.26.0.1675157077150342 -│   └── 68 -│   └── TransData.trans_TransData_3.37.0.1675157077169473 -``` - -## 模块级精度数据dump - -### 总体说明 - -大模型场景下,通常不是简单的利用自动迁移能力实现GPU到NPU的训练脚本迁移,而是会对NPU网络进行一系列针对性的适配,因此,常常会造成迁移后的NPU模型存在部分子结构不能与GPU原始模型完全对应。模型结构不一致导致API调用类型及数量不一致,若直接按照API粒度进行精度数据dump和比对,则无法完全比对所有的API。 - -本节介绍的功能是对模型中的大粒度模块进行数据dump,使其比对时,对于无法以API粒度比对的模块可以直接以模块粒度进行比对。 - -模块指的是继承自nn.Module类模块,通常情况下这类模块就是一个小模型,可以被视为一个整体,dump数据时以模块为粒度进行dump。 - -### module_dump - -**功能说明** - -开启模块级精度数据dump。 - -模块级精度数据dump时必选。 - -**函数原型** - -```python -module_dump(module, module_name) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------- | ------------------------------------------------------------ | -------- | -| module | 网络中实例化好的nn.Module类对象。数据类型:torch.nn.Module。 | 是 | -| module_name | 用户自定义的该model名称。主要用于dump数据文件的命名,便于在比对时识别模块级数据。数据类型:str。 | 是 | - -### module_dump_end - -**功能说明** - -结束模块级精度数据dump。 - -模块级精度数据dump时必选。 - -**函数原型** - -```python -module_dump_end() -``` - -### 示例代码 - -```python -# 根据需要import包 -import os -import torch -import torch.nn as nn -import torch_npu -import torch.nn.functional as F -from ptdbg_ascend import * - -torch.npu.set_device("npu:0") -# 定义一个简单的网络 -class ModuleOP(nn.Module): - def __init__(self) -> None: - super().__init__() - self.linear_1 = nn.Linear(in_features=8, out_features=4) - self.linear_2 = nn.Linear(in_features=4, out_features=2) - def forward(self, x): - x1 = self.linear_1(x) - x2 = self.linear_2(x1) - r1 = F.relu(x2) - return r1 - -if __name__ == "__main__": - module = ModuleOP() - - # 注册工具 - pdbg = PrecisionDebugger("./dump_data/npu", hook_name="dump") - pdbg.start() - - x = torch.randn(10, 8) - module_dump(module, "MyModuleOP") # 开启模块级精度数据dump - out = module(x) - module_dump_end() # 结束模块级精度数据dump - loss = out.sum() - loss.backward() - pdbg.stop() -``` - -## dump数据存盘说明 - -dump结果目录结构示例如下: - -```bash -├── dump_path -│ └── ptdbg_dump_{version} -│ ├── step0 -│ | ├── rank0 -│ | │ ├── dump -| | | | ├── Tensor_permute_1_forward.npy -| | | | ├── MyModule_0_forward_input.npy # 开启模块级精度数据dump时存在模块级的dump数据文件 -| | | | ... -| | | | └── Fcuntion_linear_5_backward_output.npy -│ | │ └── dump.pkl -│ | ├── rank1 -| | | ├── dump -| | | | └── ... -| | | └── dump.pkl -│ | ├── ... -│ | | -| | └── rank7 -│ ├── step1 -│ | ├── ... -│ ├── step2 -``` - -dump过程中,npy文件在对应算子或者模块被执行后就会落盘,而pkl文件则需要在正常执行PrecisionDebugger.stop()或set_dump_switch("OFF")后才会被落盘保存,异常的程序终止会保存终止前被执行算子的相关npy文件,但是不会生成pkl文件。 - -其中`ptdbg_dump_{version}`为默认命名,debugger方式dump不支持修改该文件夹名称,使用set_dump_path函数则支持通过dump_tag参数修改文件夹名称;rank为设备上各卡的ID,每张卡上dump的数据会生成对应dump目录。 - -**精度比对dump场景** - -精度比对dump场景的结果如下: - -* dump.pkl文件:包含dump数据的API名称(命名格式为:`{api_type}_{api_name}_{API调用次数}_{前向反向}_{input/output}.{参数序号}`)、dtype、 shape、各数据的max、min、mean、L2norm统计信息以及当配置summary_mode="md5"时的md5数据。 - - 其中,“参数序号”表示该API下的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个子参数;L2norm表示2范数(平方根)。 - -* dump目录:目录下为npy格式的dump数据。 - - npy文件保存的前缀和PyTorch对应关系如下 - - | 前缀 | Torch模块 | - | ----------- | ------------------- | - | Tensor | torch.Tensor | - | Torch | torch | - | Functional | torch.nn.functional | - | NPU | NPU亲和算子 | - | VF | torch._VF | - | Aten | torch.ops.aten | - | Distributed | torch.distributed | - -当configure_hook或set_dump_switch配置mode参数(例如:mode="api_stack" )时,dump结果的文件名会添加api_stack前缀,dump结果如下: - -* api_stack_dump.pkl -* api_stack_dump目录 - -**溢出检测dump场景** - -PrecisionDebugger模块的hook_name参数或register_hook函数设置了overflow_check时,检测API溢出,dump结果的文件名格式为:`{api_type}_{api_name}_{API调用次数}_{前向反向}_{当前溢出次数}`,dump结果示例如下: - -* `Tensor_add_1_forward_1.pkl` -* `Tensor_add_1_forward_1`目录 - -## 工具支持的API列表 - -ptdbug_ascend工具维护固定的API支持列表,若需要删除或增加dump的API,可以在[support_wrap_ops.yaml](../src/python/ptdbg_ascend/hook_module/support_wrap_ops.yaml)文件内手动修改,如下示例: - -```bash -functional: # functional为算子类别,找到对应的类别,在该类别下按照下列格式删除或添加API - - conv1d - - conv2d - - conv3d -``` - -## CPU或GPU与NPU精度数据比对 - -### 总体说明 - -- 本节主要介绍CPU或GPU与NPU精度数据比对的函数以及示例。 - -- 比对函数均通过单独创建精度比对脚本执行,可支持单卡和多卡场景的精度数据比对。 - -- 工具性能:比对数据量较小时(参考值单份文件小于10GB),参考比对速度0.1GB/s;比对数据量较大时,参考比对速度0.3GB/s。 - 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 - - 用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。比对速度的计算方式:两份比对文件大小/比对耗时。 - -### 约束 - -- NPU自研API,在CPU或GPU若没有对应的API,该API的dump数据不比对。 - -- NPU与CPU或GPU的计算结果误差可能会随着模型的执行不断累积,最终会出现同一个API因为输入的数据差异较大而无法比对的情况。 - -- CPU或GPU与NPU中两个相同的API会因为调用次数不同导致无法比对或比对到错误的API,不影响整体运行,该API忽略。 - -### compare_distributed - -**功能说明** - -将CPU或GPU与NPU的dump文件进行比对,支持单卡和多卡,可同时比对多卡的dump数据。多机场景需要每个设备单独执行比对操作。可自动检索和匹配对应卡和进程所dump的数据文件,再调用compare进行比对。单机单卡时与compare函数二选一。 - -**函数原型** - -```python -compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| npu_dump_dir | 配置NPU环境下的dump目录。dump数据目录须指定到step级。参数示例:'./npu_dump/ptdbg_dump_v4.0/step0'。register_hook方式可通过set_dump_path函数的dump_tag参数修改该目录名称。数据类型:str。 | 是 | -| bench_dump_dir | 配置CPU、GPU或NPU环境下的dump目录。参数示例:'./gpu_dump/ptdbg_dump_v4.0/step0'。register_hook方式可通过set_dump_path函数的dump_tag参数修改该目录名称。数据类型:str。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。需要预先创建output_path目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.csv`。数据类型:str。 | 是 | -| **kwargs | 支持compare的所有可选参数。 | 否 | - -**函数示例** - -创建比对脚本,例如compare_distributed.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -compare_distributed('./npu_dump/ptdbg_dump_v4.0/step0', './gpu_dump/ptdbg_dump_v4.0/step0', './output') -``` - -dump数据目录须指定到step级。 - -### compare - -**功能说明** - -将CPU或GPU与NPU的dump文件进行比对,仅支持单机单卡。 - -**函数原型** - -```python -compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_match=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------ | ------------------------------------------------------------ | -------- | -| input_param | 配置dump数据文件及目录。数据类型:dict。配置参数包括:
- "npu_pkl_path":指定NPU dump目录下的.pkl文件。参数示例:"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl"。必选。
- "bench_pkl_path":指定CPU、GPU或NPU dump目录下的.pkl文件。参数示例:"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl"。必选。
- "npu_dump_data_dir":"指定NPU dump目录下的dump数据目录。参数示例:"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump"。可选,仅比对pkl文件时不选。
- "bench_dump_data_dir":"指定CPU、GPU或NPU dump目录下的dump数据目录。参数示例:"npu_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump"。可选,仅比对pkl文件时不选。
- "is_print_compare_log":配置是否开启日志打屏。可取值True或False。可选。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.csv`。数据类型:str。 | 是 | -| stack_mode | 配置stack_mode的开关。仅当dump数据时配置debugger.configure_hook或set_dump_switch的mode="api_stack"时需要开启。可取值True或False,参数示例:stack_mode=True,默认为False。数据类型:bool。 | 否 | -| auto_analyze | 自动精度分析,开启后工具自动针对比对结果进行分析,识别到第一个精度不达标节点(在比对结果文件中的“Accuracy Reached or Not”列显示为No),并给出问题可能产生的原因(打屏展示并生成advisor_{timestamp}.txt文件)。可取值True或False,参数示例:auto_analyze=False,默认为True。数据类型:bool。 | 否 | -| fuzzy_match | 模糊匹配。开启后,对于网络中同一层级且命名仅调用次数不同的API,可匹配并进行比对。可取值True或False,参数示例:fuzzy_match=True,默认为False。数据类型:bool。 | 否 | - -**函数示例** - -单机单卡场景下创建比对脚本,例如compare.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", -"bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", -"is_print_compare_log": True -} -compare(dump_result_param, output_path="./output", stack_mode=True) -``` - -### pkl文件比对 - -若使用**compare**或**compare_distributed**函数创建的比对脚本中,input_param参数只配置了npu_pkl_path和bench_pkl_path或使用summary_only、summary_mode(取值为md5或summary)方式dump时,可以进行pkl文件的比对,此时比对dump.pkl文件中的统计信息,开启后的比对结果文件生成Max diff、Min diff、Mean diff和L2norm diff,表示NPU dump数据中API的输入或输出与标杆数据输入或输出的最大值、最小值、平均值以及L2范数的差。可以通过该值判断API是否存在精度问题:当某个API的输入和输出的Max diff、Min diff、Mean diff和L2norm diff均为0或无限趋于0,那么可以判断该API无精度问题,反之则可能存在精度问题。 - -**比对脚本示例** - -以compare.py为例。 - -```python -from ptdbg_ascend import compare -dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"is_print_compare_log": True -} -compare(dump_result_param, output_path="./output", stack_mode=True) -``` - -**比对结果** - -pkl文件比对同样生成`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt`文件。其中`advisor_{timestamp}.txt`主要对`compare_result_{timestamp}.csv`中可能存在精度问题(Result为Waring)的API提出定位建议;`compare_result_{timestamp}.csv`主要有如下两种情况: - -- configure_hook配置summary_only=True、summary_mode=summary或不配置前面两个参数直接比对pkl文件: - - ![compare_result_pkl](./img/compare_result_pkl.png) - - 上图是对pkl文件中NPU及标杆API的统计信息进行比对,判断可能存在精度问题的API,文件中记录NPU及标杆API的基本信息和统计信息,其中需要关注Result列,包含结果:Waring(NPU与标杆统计信息的比对中存在相对误差大于0.5,则需要重点检查该API);为空(相对误差小于等于0.5,可以不需要重点关注,但不代表不存在精度问题);Nan(表示统计信息数据没有匹配上)。 - -- configure_hook配置summary_mode=md5: - - ![compare_result_pkl_md5.png](./img/compare_result_pkl_md5.png.png) - - 上图是对pkl文件中NPU及标杆API的MD5信息进行比对,判断API数据的完整性,文件中记录NPU及标杆API的基本信息和MD5信息,其中需要关注Result列,包含结果:Pass(表示NPU与标杆的MD5值一致,即API数据完整);Different(表示NPU与标杆的MD5值不一致,即API数据不完全一致,可以通过NPU_Stack_Info列API调用栈查询该API的详细信息);Nan(表示MD5信息数据没有匹配上)。 - -### parse - -**功能说明** - -解析并提取dump信息中的堆栈信息及数据统计信息。 - -**函数原型** - -```python -parse(pkl_file, module_name_prefix) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------------ | ------------------------------------------------------------ | -------- | -| pkl_file | 指定dump数据文件中的pkl文件名。参数示例:"./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump.pkl"。数据类型:str。 | 是 | -| module_name_prefix | 指定待提取的API接口前缀。参数示例:"Torch_norm_1_forward"。数据类型:str。 | 是 | -**函数示例** - -创建堆栈信息及数据统计信息提取脚本,例如parse.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -parse("./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump.pkl", "Torch_batch_normal_1_forward") -``` - -### 计算精度评价指标 - -PyTorch精度比对是以CPU或GPU的计算结果为标杆,通过计算精度评价指标判断API在运行时是否存在精度问题。 - -计算精度评价指标: - -1. Cosine:通过计算两个向量的余弦值来判断其相似度,数值越接近于1说明计算出的两个张量越相似,实际可接受阈值为大于0.99。在计算中可能会存在nan,主要由于可能会出现其中一个向量为0。 - -2. MaxAbsErr:当最大绝对误差越接近0表示其计算的误差越小,实际可接受阈值为小于0.001。 - -3. MaxRelativeErr:当最大相对误差越接近0表示其计算的误差越小。 - - 当dump数据中存在0或Nan时,比对结果中最大相对误差则出现inf或Nan的情况,属于正常现象。 - -4. One Thousandth Err Ratio(双千分之一)、Five Thousandths Err Ratio(双千分之五)精度指标:是指NPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差大于千分之一、千分之五的比例占总元素个数的比例小于千分之一、千分之五。该数据仅作为精度下降趋势的参考,并不参与计算精度是否通过的判定。 - -精度比对结果csv文件中只需要通过Accuracy Reached or Not来判断计算精度是否达标,判断标准如下: - -1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标,标记为“No”。 -2. Cosine < 0.9,精度不达标,标记为“No”。 -3. MaxAbsError > 1,精度不达标,标记为“No”。 -5. 其余情况下记为精度达标,标记为“Yes”。 - -## ptdbg_ascend.parse数据解析功能 - -ptdbg_ascend.parse为命令行交互式界面解析工具,提供更多的数据解析功能并且展示结果。 - -使用场景:本工具主要用于比对前后两次NPU ACL层级dump数据的一致性。 - -### 进入parse交互式界面 - -安装ptdbg_ascend工具后,可以通过使用命令 **python -m ptdbg_ascend.parse** 进入交互式界面,如下所示: - -```bash -python -m ptdbg_ascend.parse -Parse >>> -``` - -可在parse的界面中执行Shell命令,以及如下场景的相关解析命令: - -- 支持指定ACL层级算子数据比对。 -- 支持指定ACL层级算子数据转换及展示。 -- 支持交互式指定pkl文件中API对应dump数据查看。 -- 支持API进行可选层级比对和打印(统计级和像素级)。 - -Ctrl+C可以退出parse交互式界面。不退出parse交互式界面若需要执行非该界面下的内置Shell命令,且命令与parse交互式界面命令冲突时,非该界面命令需要使用run命令,在相关命令前加上run前缀,如下示例: - -```bash -python -m ptdbg_ascend.parse -Parse >>> run vim cli.py -Parse >>> vim cli.py -``` - -以上各场景详细介绍请参见下文章节。 - -### ACL层级算子数据批量转换 - -本功能会将原有待比对dump数据目录下的dump数据按照算子名和时间戳进行梳理并分类,之后再将dump数据转为为npy文件。 - -依赖:CANN包中的msaccucmp工具,需要安装Ascend-CANN-toolkit,详见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F700%2Fenvdeployment%2Finstg%2Finstg_0001.html)》。 - -输入以下比对命令进行数据转换。 - -```bash -cad -m my_dump_path [-out output_path] [-asc msaccucmp_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| -------- | ------------------------------------------------------------ | -------- | -| -m | 待转换ACL dump数据目录。需要指定到ACL dump数据的deviceid级目录。 | 是 | -| -out | 结果输出目录,须指定已存在的目录,默认为./parse_data/acl_batch_convert。未指定时保存在默认路径下,比对结束后会打印log提示输出结果存放路径。 | 否 | -| -asc | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py。 | 否 | - -**示例** - -```bash -# 传入待比对数据目录 -Parse >>> cad -m /home/xxx/my_dump_path/20000124003856/0 -# 转换结果打印 -...... -╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -# 转换前的dump文件 -│ SrcFile: /home/xxx/my_dump_path/20000124003856/0/272/TransData.trans_TransData_22.112.21.948645536672764 │ -# 转换后的npy文件 -│ - TransData.trans_TransData_22.112.21.948645536672764.output.0.npy │ -│ - TransData.trans_TransData_22.112.21.948645536672764.input.0.npy │ -╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯ -...... -[INFO] The comparison result have been written to "./parse_data/acl_batch_convert". -``` - -输出结果: - -原dump数据目录: - -```bash -├── /home/xxx/my_dump_path/20000124003856/0/ -│ ├── 272 -│ │ ├── {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp} -│ │ ... -│ ├── 512 -│ ... -``` - -转换后: - -```bash -├── ./parse_data/acl_batch_convert/{timestamp} -│ ├── {op_name1} -│ │ ├── {timestamp1} -│ │ | ├── {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input/output}.{参数序号}.npy -│ │ | │ ... -│ │ ├── {timestamp2} -│ │ | ... -│ ├── {op_name2} -│ ├── ... -``` - -### ACL层级算子数据比对 - -本功能主要用于比对前后两次NPU ACL层级dump数据的一致性。 - -本功能支持批量比对,若需要进行批量比对,需要先将两份待比对的NPU ACL层级dump数据进行“**ACL层级算子数据批量转换**”,可以使两份数据更好的匹配;若直接进行dump数据的比对,建议只比对单个dump数据文件。 - -输入以下比对命令进行数据比对。 - -```bash -vc -m my_dump_path -g golden_dump_path [-out output_path] [-cmp_path msaccucmp_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| --------- | ------------------------------------------------------------ | -------- | -| -m | 待比对ACL dump数据目录。如果比对单个算子,需要指定到ACL dump数据的model_id级目录;如果批量比对,则指定到cad转换后的timestamp级目录。 | 是 | -| -g | 标杆ACL dump数据目录。如果比对单个算子,需要指定到ACL dump数据的model_id级目录;如果批量比对,则指定到cad转换后的timestamp级目录。 | 是 | -| -out | 结果输出目录,须指定已存在的目录,默认为./parse_data/acl_batch_comapre。未指定时保存在默认路径下,比对结束后会打印log提示输出结果存放路径。 | 否 | -| -cmp_path | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py | 否 | - -输出结果:batch_compare_{timestamp}.csv文件。 - -**示例** - -```bash -# 传入待比对数据目录以及标杆数据目录 -Parse >>> vc -m ./my_dump_path -g ./golden_data_path -[INFO]Compare result is saved in : parse_data/acl_batch_comapre/batch_compare_1707271118.csv -``` - -### ACL算子数据的npy转换 - -依赖:CANN包中的msaccucmp工具,需要安装Ascend-CANN-toolkit,详见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F700%2Fenvdeployment%2Finstg%2Finstg_0001.html)》。 - -输入以下转换命令进行数据转换, 将ACL级别dump数据转为npy文件。 - -```bash -dc -n file_name/file_path [-f format] [-out output_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| --------- | ------------------------------------------------------------ | -------- | -| -n | 需转换的dump数据文件或dump数据文件目录。 | 是 | -| -f | 开启format转换,指定该参数时需要配置format格式。当前内置的Format转换支持如下类型:
FRACTAL_NZ转换NCHW
FRACTAL_NZ转换成NHWC
FRACTAL_NZ转换ND
HWCN转换FRACTAL_Z
HWCN转换成NCHW
HWCN转换成NHWC
NC1HWC0转换成HWCN
NC1HWC0转换成NCHW
NC1HWC0转换成NHWC
NCHW转换成FRACTAL_Z
NCHW转换成NHWC
NHWC转换成FRACTAL_Z
NHWC转换成HWCN
NHWC转换成NCHW
NDC1HWC0转换成NCDHW | 否 | -| -out | 结果输出目录。 | 否 | -| -cmp_path | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py | 否 | - -[^]: 若传入单个dump文件,则转换单个文件,若传入dump文件目录则转换目录下所有dump文件。 - -- 输出结果:npy文件。 -- 若指定-out参数需要用户传入输出路径,并且路径需要已存在。 -- 若未指定输出目录, 则比对结束后将结果保存在默认目录 “./parse_data/convert_result”中,比对结束后会打印log提示输出结果存放路径及转换结果。 - -- 输入以下命令,展示npy数据统计信息。 - - ```bash - pt -n file_path - ``` - - | 参数名称 | 说明 | 是否必选 | - | -------- | ------------- | -------- | - | -n | npy文件路径。 | 是 | - - 打印统计信息:shape, dtype, max, min和mean。默认在npy文件路径下将该数据保存为txt文件。 - -**示例1** - -```bash -# 传入需转换的dump文件目录 -Parse >>> dc -n ./dump_data/ -...... -# 转换结果 -╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮ -│ SrcFile: ./dump_data/ -│ - Add.fp32_vars_add_2fp32_vars_Relu_9.31.5.1636595794731103.input.0.npy │ -│ - Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.output.0.npy │ -│ - Add.fp32_vars_add_2fp32_vars_Relu_9.31.5.1636595794731103.input.1.npy │ -│ - Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.input.1.npy │ -│ - Add.fp32_vars_add_3fp32_vars_Relu_12.40.5.1636595794846124.input.1.npy │ -│ - Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.input.0.npy │ -│ - Add.fp32_vars_add_3fp32_vars_Relu_12.40.5.1636595794846124.input.0.npy │ -│ - Add.fp32_vars_add_2fp32_vars_Relu_9.31.5.1636595794731103.output.0.npy │ -│ - Add.fp32_vars_add_3fp32_vars_Relu_12.40.5.1636595794846124.output.0.npy │ -╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯ -``` - -**示例2** - -```bash -# 查看某个dump数据块的数据信息 -# 默认会将数据中的tensor保存成 txt -Parse >>> pt -n ./parse_data/dump_convert/Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.output.0.npy -...... -# 打印统计信息 -[Shape: (1, 16, 56, 56, 16)] [Dtype: float16] [Max: 452.0] [Min: -408.5] [Mean: -3.809] -Path: ./parse_data/dump_convert/Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.input.0.npy -TextFile:./parse_data/dump_convert/Add.fp32_vars_add_1fp32_vars_Relu_6.24.5.1636595794631347.input.0.npy.txt -``` - -### pkl文件中指定API的dump数据信息查看 - -输入以下命令,解析并输出pkl文件中指定api的统计信息。 - -```bash -pk -f pkl_path -n api_name -``` - -| 参数名称 | 说明 | 是否必选 | -| -------- | ----------------- | -------- | -| -f | 指定pkl文件路径。 | 是 | -| -n | 指定API名称。 | 是 | - -- 输出结果:打印统计信息(shape, dtype, max和min mean)。 -- 若pkl文件中存在相应的堆栈信息,则会打印堆栈信息。 - -**示例** - -```bash -# 传入pkl文件及api名称 -Parse >>> pk -f ./torch_dump/ptdbg_v3.2/rank0/api_stack_dump.pkl -n Functional_conv2d_0_forward -...... -# 打印统计信息及堆栈(pkl文件不包含堆栈则不会打印堆栈) - -Statistic Info: - [Functional_conv2d_0_forward_input.0][dtype: torch.float32][shape: [2, 1, 2, 2]][max: 1.576936960220337][min: -0.9757485389709473][mean: 0.4961632490158081] - [Functional_conv2d_0_forward_input.1][dtype: torch.float32][shape: [2, 1, 2, 2]][max: 0.20064473152160645][min: -0.47102075815200806][mean: -0.20796933770179749] - [Functional_conv2d_0_forward_input.2][dtype: torch.float32][shape: [2]][max: 0.17380613088607788][min: -0.16853803396224976][mean: 0.0026340484619140625] - [Functional_conv2d_0_forward_output][dtype: torch.float32][shape: [2, 2, 1, 1]][max: 0.02364911139011383][min: -1.762906551361084][mean: -0.6710853576660156] -``` - -### API可选层级比对 - -输入以下命令, 进行统计级和像素级比对。 - -```bash -cn -m my_data*.npy -g gloden*.npy [-p num] [-al atol] [-rl rtol] -``` - -- 统计级比对:对tensor整体进行余弦值及相对误差的计算。 -- 像素级比对:对输入的两个npy文件进行逐元素比对。若两个tensor对应元素的相对误差或绝对误差大于**误差阈值**(-al和-rl配置)则被标记为错误数据。 - -| 参数名称 | 说明 | 是否必选 | -| -------- | ----------------------------------------------- | -------- | -| -m | 待比对数据。 | 是 | -| -g | 标杆数据。 | 是 | -| -p | 设置比对结束后打印错误元素的个数,默认值20。 | 否 | -| -al | 判定数据存在精度问题的绝对误差阈值,默认0.001。 | 否 | -| -rl | 判定数据存在精度问题的相对误差阈值,默认0.001。 | 否 | -| -s | 将npy文件保存成txt文件,用于查看,默认开启。 | 否 | - -输出结果: - -- 统计级比对结果。 -- 两个文件的统计信息(shape, dtype, max, min和mean)。 -- 错误数据打印表格。 - -**示例** - -```bash -# 对比两个tensor的数据 -Parse >>> cn -m Add.InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.323.1619494134703053.output.0.npy -g InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.0.1619492699305998.npy -p 10 -s -al 0.002 -rl 0.005 - Error Item Table Top Item Table -┏━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┏━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ -┃ Index ┃ Left ┃ Right ┃ Diff ┃ ┃ Index ┃ Left ┃ Right ┃ Diff ┃ -┡━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ ┡━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ -│ 155 │ 0.024600908 │ 0.022271132 │ 0.002329776 │ │ 0 │ -0.9206961 │ -0.9222216 │ 0.0015255213 │ -│ 247 │ 0.015752593 │ 0.017937578 │ 0.0021849852 │ │ 1 │ -0.6416973 │ -0.64051837 │ 0.0011789203 │ -│ 282 │ -0.0101207765 │ -0.007852031 │ 0.0022687456 │ │ 2 │ -0.35383835 │ -0.35433492 │ 0.0004965663 │ -│ 292 │ 0.019581757 │ 0.02240482 │ 0.0028230622 │ │ 3 │ -0.18851271 │ -0.18883198 │ 0.00031927228 │ -│ 640 │ -0.06593232 │ -0.06874806 │ 0.0028157383 │ │ 4 │ -0.43508735 │ -0.43534422 │ 0.00025686622 │ -│ 1420 │ 0.09293677 │ 0.09586689 │ 0.0029301196 │ │ 5 │ 1.4447614 │ 1.4466647 │ 0.0019032955 │ -│ 1462 │ -0.085207745 │ -0.088047795 │ 0.0028400496 │ │ 6 │ -0.3455438 │ -0.3444429 │ 0.0011008978 │ -│ 1891 │ -0.03433288 │ -0.036525503 │ 0.002192624 │ │ 7 │ -0.6560242 │ -0.6564579 │ 0.0004336834 │ -│ 2033 │ 0.06828873 │ 0.07139922 │ 0.0031104907 │ │ 8 │ -2.6964858 │ -2.6975214 │ 0.0010356903 │ -│ 2246 │ -0.06376442 │ -0.06121233 │ 0.002552092 │ │ 9 │ -0.73746175 │ -0.73650354 │ 0.00095820427 │ -└───────┴───────────────┴──────────────┴──────────────┘ └───────┴─────────────┴─────────────┴───────────────┘ -╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -│ Left: | -│ |- NpyFile: ./dump/temp/decode/Add.InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.323.1619494134703053.output.0.npy | -│ |- TxtFile: ./dump/temp/decode/Add.InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.323.1619494134703053.output.0.npy.txt | -│ |- NpySpec: [Shape: (32, 8, 8, 320)] [Dtype: float32] [Max: 5.846897] [Min: -8.368301] [Mean: -0.72565556] | -│ DstFile: │ -│ |- NpyFile: ./dump/cpu/InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.0.1619492699305998.npy | -│ |- TxtFile: ./dump/cpu/InceptionV3_InceptionV3_Mixed_7a_Branch_0_add_3.0.1619492699305998.npy.txt | -│ |- NpySpec: [Shape: (32, 8, 8, 320)] [Dtype: float32] [Max: 5.8425903] [Min: -8.374472] [Mean: -0.7256237] │ -│ NumCnt: 655360 │ -│ AllClose: False │ -│ CosSim: 0.99999493 │ -│ ErrorPer: 0.023504638671875 (rl= 0.005, al= 0.002) │ -╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ -``` - -## FAQ - -[FAQ](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md) diff --git "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T3.md" "b/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T3.md" deleted file mode 100644 index af73a56849588c1a080962f00249700aee9a3630..0000000000000000000000000000000000000000 --- "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T3.md" +++ /dev/null @@ -1,2301 +0,0 @@ -# **PyTorch精度工具使用指南** - -本文主要介绍PyTorch精度工具ptdbg_ascend的使用以及精度比对场景示例。 - -ptdbg_ascend工具的原理及安装请参见《[PyTorch精度工具](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 - -ptdbg_ascend工具主要支持PyTorch API精度数据dump、溢出检测、精度比对以及parse数据解析功能。其中dump和溢出检测功能支持使用debugger和register_hook方式进行精度数据的dump和溢出检测,推荐使用debugger方式。 - -## PyTorch精度比对总体流程 - -1. 准备CPU或GPU训练工程。 - -2. 在环境下安装ptdbg_ascend工具。 - -3. 在训练脚本内插入ptdbg_ascend工具dump接口。 - -4. 执行训练dump数据。 - -5. 将CPU或GPU训练工程迁移为NPU训练工程。 - - 请参见《[PyTorch模型迁移和训练指南](https://www.hiascend.com/document/detail/zh/canncommercial/63RC1/modeldevpt/ptmigr/ptmigr_0001.html)》。 - -6. 在NPU环境下安装ptdbg_ascend工具。 - -7. 在NPU训练脚本内插入ptdbg_ascend工具dump接口。 - -8. NPU环境下执行训练dump数据。 - -9. 创建并配置精度比对脚本,例如compare.py。 - -10. 执行CPU或GPU dump与NPU dump数据的精度比对。 - -11. 比对结果分析。 - -## 快速入门(debugger方式) - -本章节主要介绍通过ptdbg_ascend工具进行精度比对和分析,主要使用“**debugger方式dump和溢出检测**”和“**CPU或GPU与NPU精度数据比对**”章节中介绍的ptdbg_ascend工具接口。 - -### 单卡场景精度比对 - -**精度分析建议** - -PyTorch训练场景的精度问题分析建议参考以下思路进行精度比对和比对结果分析: - -1. 整网比对:dump整网数据并进行精度比对,初步定位异常范围。 - - 对于模型数据庞大(比如达到T级别)的场景,不推荐直接dump整网比对,整网dump可能导致磁盘不足,需要预留足够的存储空间或者分多次dump。 - -2. 缩小范围:根据Accuracy Reached or Not找出不符合精度标准的API。 - -3. 范围比对:对不符合精度标准的API重新dump详细信息。 - -4. 分析原因并优化:分析API精度不符合标准的原因并进行优化调整。 - -5. 整网比对:重新进行整网比对,判断优化后的API是否已符合精度标准以及是否出现新的精度问题。 - -6. 重复1~5步,直到不存在精度问题为止。 - -**精度分析示例** - -1. dump整网数据。 - - 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_stack") - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -2. 比对整网数据。 - - 第1步中的NPU dump数据目录为npu_dump,假设GPU dump数据目录为gpu_dump;dump将生成pkl数据文件api_stack_dump.pkl和npy数据目录api_stack_dump。 - - 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - dump_result_param={ - "npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, "./output", stack_mode=True) - ``` - - 执行比对: - - ```bash - python3 compare.py - ``` - - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` - -3. 找出存在问题的API。 - - 1. 根据`advisor_{timestamp}.txt`或打屏信息的提示,可找到存在精度问题的算子(Suspect Nodes)和专家建议(Expert Advice)。 - - ![auto_analyze_log](img/auto_analyze_log.png) - - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 - -4. (可选)提取指定API的堆栈信息和dump数据统计信息。 - - 通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。 - - 创建并配置提取脚本,以创建parse.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - - # 提取dump信息中第1次调用的API:Torch.batch.normal的堆栈信息及数据统计信息 - parse("./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", "Torch.batch.normal.1.forward") - ``` - - 执行提取: - - ```bash - python3 parse.py - ``` - - - -5. (可选)指定API对其底层ACL数据进行dump。 - - - dump指定前向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Tensor.permute.1.forward"], acl_config='./dump.json') - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - dump指定反向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - # dump指定反向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - debugger.configure_hook(mode="acl", scope=["Functional.conv2d.1.backward"], acl_config="./dump.json", backward_input=["./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional.conv2d.1.backward_input.0.npy"]) - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -6. (可选)重新比对。 - - 根据第4或5步的dump数据重新配置compare.py并执行比对,可以对单API模型进行问题复现。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 - -### 溢出检测场景 - -溢出检测是针对NPU的PyTorch API,检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。 - -溢出检测原理:针对溢出阶段,开启acl dump模式,重新对溢出阶段执行,落盘数据。 - -建议按照如下步骤操作: - -1. 在NPU环境下安装ptdbg_ascend工具。 - -2. 在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。 - - - 示例1:全量溢出检测 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=-1) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - 多卡使用时各卡单独计算溢出次数。 - - - 示例2:dump指定前向API的ACL级别溢出数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(mode="acl", acl_config="./dump.json") - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - 示例3:dump指定反向API的ACL级别的溢出数据 - - 1. 进行全量溢出检测 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=-1) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - - 2. dump指定反向API的ACL级别的溢出数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Functional.conv2d.1.backward"], acl_config="./dump.json", backward_input=["./overflow_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional.conv2d.1.backward_input.0.npy"]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - 针对前向溢出API,可以通过overflow_nums,配置允许的溢出次数,并将每次溢出API的全部ACL数据dump下来,到达指定溢出次数后停止,停止后会看到堆栈打印包含如下字段。 - - ```bash - ValueError: [overflow xxx times]: dump file is saved in '*.pkl'. - ``` - - 其中xxx times为用户设置的次数,*.pkl为文件生成路径。 - -3. NPU环境下执行训练dump溢出数据。 - - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - - 精度预检工具执行命令如下: - - ```bash - # 下载att代码仓后执行如下命令 - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - 反向过程溢出的API暂不支持精度预检功能。 - - 当重复执行溢出检测dump操作时,需要删除上一次dump目录下的溢出检测dump数据,否则将因重名而报错。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 -* 混合精度动态loss scale场景下,正常训练会有"Gradient overflow. SKipping step"日志,添加溢出检测后日志消失,可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1,并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持,不支持版本执行报错EZ3003。 - -## 场景化示例 - -本章节主要介绍通过ptdbg_ascend工具进行精度比对和分析,主要使用“**CPU或GPU及NPU精度数据dump**”和“**CPU或GPU与NPU精度数据比对**”章节中介绍的ptdbg_ascend工具接口。 - -### 多卡场景精度比对 - -精度工具支持多卡场景的精度比对,多卡场景的dump步骤与单卡场景完全一致,请参见“**单卡场景精度比对**”章节,不同的是多卡数据精度比对时需要使用“compare_distributed”函数进行比对。 - -**大模型场景下dump推荐使用debugger方式的手动模式。** - -如下示例: - -说明:多机多卡场景需要每个设备单独执行比对操作。 - -假设NPU dump npy数据目录为npu_dump/ptdbg_dump_v4.0,GPU dump npy数据目录为gpu_dump/ptdbg_dump_v4.0。 - -1. 创建比对脚本,例如compare_distributed.py,拷贝如下代码。 - - ```python - from ptdbg_ascend import * - compare_distributed('./npu_dump/ptdbg_dump_v4.0/step0', './gpu_dump/ptdbg_dump_v4.0/step0', './output') - ``` - - dump数据目录须指定到step级。 - -2. 执行比对: - - ```bash - python3 compare_distributed.py - ``` - -两次运行须用相同数量的卡,传入`compare_distributed`的两个文件夹下须有相同个数的rank文件夹,且不包含其他无关文件,否则将无法比对。 - -**多卡set_dump_path注意事项** - -多卡一般为多进程,须保证每个进程都正确调用PrecisionDebugger或set_dump_path,或把PrecisionDebugger或set_dump_path插入到import语句后,如: - -```python -from ptdbg_ascend import * -debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) -``` - -或 - -```python -from ptdbg_ascend import * -seed_all() -set_dump_path('./dump_resnet') -``` - -如此可保证set_dump_path在每个进程都被调用。 - -**多卡register_hook注意事项** - -register_hook需要在set_dump_path之后调用,也需要在每个进程上被调用,建议在搬运模型数据到卡之后调用。识别方法如下: - -- 找到训练代码中遍历epoch的for循环或遍历数据集的for循环,把register_hook放到循环开始前即可。 -- 找到训练代码中调用DDP或者DistributedDataParallel的代码行,把register_hook放到该代码行所在的代码块之后。 -- 若代码中均无以上两种情况,需要保证register_hook在模型定义之后插入,并配置rank参数。rank参数获取rank_id请参见“**[rank_id获取方法](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/rank_id获取方法.md)**”。 - -### NPU vs NPU精度比对 - -对于NPU vs NPU场景,是针对同一模型,进行迭代(模型、API版本升级或设备硬件升级)时存在的精度下降问题,对比相同模型在迭代前后版本的API计算数值,进行问题定位。 - -一般情况下迭代涉及NPU自定义算子,因此,可以仅dump NPU自定义算子进行比对。比对精度问题分析请参见“**单卡场景精度比对**”章节。 - -工具当前支持dump NPU自定义算子如下: - -| 序号 | NPU自定义算子 | -| :--- | ----------------------------------------------- | -| 1 | torch_npu.one_ | -| 2 | torch_npu.npu_sort_v2 | -| 3 | torch_npu.npu_transpose | -| 4 | torch_npu.npu_broadcast | -| 5 | torch_npu.npu_dtype_cast | -| 6 | torch_npu.empty_with_format | -| 7 | torch_npu.npu_one_hot | -| 8 | torch_npu.npu_stride_add | -| 9 | torch_npu.npu_ps_roi_pooling | -| 10 | torch_npu.npu_roi_align | -| 11 | torch_npu.npu_nms_v4 | -| 12 | torch_npu.npu_iou | -| 13 | torch_npu.npu_nms_with_mask | -| 14 | torch_npu.npu_pad | -| 15 | torch_npu.npu_bounding_box_encode | -| 16 | torch_npu.npu_bounding_box_decode | -| 17 | torch_npu.npu_batch_nms | -| 18 | torch_npu.npu_slice | -| 19 | torch_npu._npu_dropout | -| 20 | torch_npu.npu_indexing | -| 21 | torch_npu.npu_ifmr | -| 22 | torch_npu.npu_max | -| 23 | torch_npu.npu_scatter | -| 24 | torch_npu.npu_layer_norm_eval | -| 25 | torch_npu.npu_alloc_float_status | -| 26 | torch_npu.npu_confusion_transpose | -| 27 | torch_npu.npu_bmmV2 | -| 28 | torch_npu.fast_gelu | -| 29 | torch_npu.npu_sub_sample | -| 30 | torch_npu.npu_deformable_conv2d | -| 31 | torch_npu.npu_mish | -| 32 | torch_npu.npu_anchor_response_flags | -| 33 | torch_npu.npu_yolo_boxes_encode | -| 34 | torch_npu.npu_grid_assign_positive | -| 35 | torch_npu.npu_normalize_batch | -| 36 | torch_npu.npu_masked_fill_range | -| 37 | torch_npu.npu_linear | -| 38 | torch_npu.npu_bert_apply_adam | -| 39 | torch_npu.npu_giou | -| 40 | torch_npu.npu_ciou | -| 41 | torch_npu.npu_diou | -| 42 | torch_npu.npu_sign_bits_pack | -| 43 | torch_npu.npu_sign_bits_unpack | -| 44 | torch_npu.npu_flash_attention | -| 45 | torch_npu.npu_scaled_masked_softmax | -| 46 | torch_npu.npu_rotary_mul | -| 47 | torch_npu.npu_roi_align | -| 48 | torch_npu.npu_roi_alignbk | -| 49 | torch_npu.npu_ptiou | -| 50 | torch_npu.npu_fusion_attention | -| 51 | torch_npu.npu_dropout_with_add_softmax | -| 52 | torch_npu.npu_random_choice_with_mask | -| 53 | torch_npu.npu_rotated_iou | -| 54 | torch_npu.npu_conv2d | -| 55 | torch_npu.npu_conv3d | -| 56 | torch_npu.npu_softmax_cross_entropy_with_logits | -| 57 | torch_npu.npu_all_gather_base_mm | -| 58 | torch_npu.npu_swiglu | -| 59 | torch_npu.npu_rms_norm | -| 60 | torch_npu.npu_mm_reduce_scatter_base | -| 61 | torch_npu.npu_mm_all_reduce_base | -| 62 | torch_npu.npu_conv_transpose2d | -| 63 | torch_npu.npu_convolution | -| 64 | torch_npu.npu_convolution_transpose | -| 65 | torch_npu.npu_min | -| 66 | torch_npu.npu_nms_rotated | -| 67 | torch_npu.npu_reshape | -| 68 | torch_npu.npu_rotated_box_decode | -| 69 | torch_npu.npu_rotated_box_encode | -| 70 | torch_npu.npu_rotated_overlaps | -| 71 | torch_npu.npu_silu | -| 72 | torch_npu.npu_fused_attention_score | -| 73 | torch_npu.npu_multi_head_attention | -| 74 | torch_npu.npu_gru | -| 75 | torch_npu.npu_incre_flash_attention | -| 76 | torch_npu.npu_prompt_flash_attention | -| 77 | torch_npu.npu_lstm | -| 78 | torch_npu.npu_apply_adam | - -### 通信API的数据dump - -通信类API数据可以使用全量dump方式获取,若只dump通信类API数据,可以使用如下示例: - -```python -debugger.configure_hook(mode="api_list", api_list=["distributed"]) -``` - -或 - -```python -set_dump_switch("ON", mode="api_list", api_list=["distributed"]) -``` - -通信类API支持列表: - -| 序号 | Distributed | -| :--- | -------------------- | -| 1 | send | -| 2 | recv | -| 3 | broadcast | -| 4 | all_reduce | -| 5 | reduce | -| 6 | all_gather | -| 7 | gather | -| 8 | isend | -| 9 | irecv | -| 10 | scatter | -| 11 | reduce_scatter | -| 12 | _reduce_scatter_base | -| 13 | _all_gather_base | - -### 单卡场景精度比对(register_hook方式) - -**精度分析建议** - -PyTorch训练场景的精度问题分析建议参考以下思路进行精度比对和比对结果分析: - -1. 整网比对:dump整网数据并进行精度比对,初步定位异常范围。 -2. 缩小范围:根据Accuracy Reached or Not找出不符合精度标准的API。 -3. 范围比对:对不符合精度标准的API重新dump。 -4. 分析原因并优化:分析API精度不符合标准的原因并进行优化调整。 -5. 整网比对:重新进行整网比对,判断优化后的API是否已符合精度标准以及是否出现新的精度问题。 -6. 重复1~5步,直到不存在精度问题为止。 - -**精度分析示例** - -1. dump整网数据。 - - 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): - - ```python - from ptdbg_ascend import * - - # 在main函数开始前固定随机数 - seed_all() - - # 配置dump数据目录路径和名称 - set_dump_path("./npu_dump", dump_tag='all') - - # 注册dump回调函数 - register_hook(model, acc_cmp_dump) - - ... - - # 在第一个迭代开始的位置开启dump和堆栈模式,同时为保证数据完整性开启dump bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="api_stack", filter_switch="OFF") - - ... - - # 在第一个迭代结束的位置关闭dump - set_dump_switch("OFF") - ``` - -2. 比对整网数据。 - - 第1步中的NPU dump数据文件为npu_dump.pkl,假设NPU dump npy数据目录为npu_dump,GPU dump数据文件为gpu_dump.pkl,GPU dump npy数据目录为gpu_dump。 - - 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - dump_result_param={ - "npu_pkl_path": "./npu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/all_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/all_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, "./output", stack_mode=True) - ``` - - 执行比对: - - ```bash - python3 compare.py - ``` - - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` - -3. 找出存在问题的API。 - - 1. 根据`advisor_{timestamp}.txt`或打屏信息的提示,可找到存在精度问题的算子(Suspect Nodes)和专家建议(Expert Advice) - - ![auto_analyze_log](img/auto_analyze_log.png) - - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 - -4. (可选)提取指定API的堆栈信息和dump数据统计信息。 - - 通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。 - - 创建并配置提取脚本,以创建parse.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - - # 提取dump信息中第1次调用的API:Torch.batch.normal的堆栈信息及数据统计信息 - parse("./npu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", "Torch.batch.normal.1.forward") - ``` - - 执行提取: - - ```bash - python3 parse.py - ``` - -5. (可选)指定API对其底层ACL数据进行dump。 - - - dump指定前向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - - # 固定随机数,开启确定性计算 - seed_all(mode=True) - set_dump_path("./dump_path", dump_tag='forward') - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - - # dump指定前向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="acl", scope=["Tensor.permute.1.forward"], filter_switch="OFF") - - ... - - set_dump_switch("OFF") - ``` - - - dump指定反向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - - # 固定随机数,开启确定性计算 - seed_all(mode=True) - set_dump_path("./dump_path", dump_tag='backward') - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - - # dump指定反向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="acl", scope=["Functional.conv2d.1.backward"], filter_switch="OFF") - set_backward_input(["./npu_dump/all_v4.0/step0/rank0/api_stack_dump/Functional.conv2d.1.backward.input.0.npy"]) - - ... - - set_dump_switch("OFF") - ``` - -6. (可选)重新比对。 - - 根据第4或5步的dump数据重新配置compare.py并执行比对,可以对单API模型进行问题复现。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 - -### 溢出检测场景(register_hook方式) - -溢出检测是针对NPU的PyTorch API,检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。 - -溢出检测原理:针对溢出阶段,开启acl dump模式,重新对溢出阶段执行,落盘数据。 - -建议按照如下步骤操作: - -1. 在NPU环境下安装ptdbg_ascend工具。 - -2. 在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。 - - - 示例1:全量溢出检测 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # 设置检测到3次溢出后退出训练 - register_hook(model, overflow_check, overflow_nums=3) - - ... - ``` - - 多卡使用时各卡单独计算溢出次数。 - - - 示例2:dump指定API的ACL级别溢出数据 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # dump指定API的ACL级别溢出数据 - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - - # 在期望溢出检测的step位置开始前打开溢出检测开关 - set_overflow_check_switch("ON") - - ... - - # 在step结束的位置关闭溢出检测开关 - set_overflow_check_switch("OFF") - - ... - ``` - - - 示例3:dump指定反向API的ACL级别的溢出数据 - - 1. 进行全量溢出检测 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # 设置检测到3次溢出后退出训练 - register_hook(model, overflow_check) - - ... - ``` - - 2. dump指定反向API的ACL级别的溢出数据 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # dump指定反向API的ACL级别溢出数据 - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Functional.conv2d.1.backward"]) - set_backward_input(["./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump/Functional.conv2d.1.backward.input.0.npy"]) - ``` - - 针对前向溢出API,可以通过overflow_nums,配置允许的溢出次数,并将每次溢出API的全部ACL数据dump下来,到达指定溢出次数后停止,停止后会看到堆栈打印包含如下字段。 - - ```bash - ValueError: [overflow xxx times]: dump file is saved in '*.pkl'. - ``` - - 其中xxx times为用户设置的次数,*.pkl为文件生成路径。 - -3. NPU环境下执行训练dump溢出数据。 - - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过 [Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - - 精度预检工具执行命令如下: - - ```bash - # 下载att代码仓后执行如下命令 - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - 反向过程溢出的API暂不支持精度预检功能。 - - 当重复执行溢出检测dump操作时,需要删除上一次dump目录下的溢出检测dump数据,否则将因重名而报错。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 -* 混合精度动态loss scale场景下,正常训练会有"Gradient overflow. SKipping step"日志,添加溢出检测后日志消失,可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1,并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持,不支持版本执行报错EZ3003。 - -## debugger方式dump和溢出检测(推荐) - -### PrecisionDebugger模块 - -**功能说明** - -PrecisionDebugger模块包含dump和溢出检测功能的总体配置项。可以指定dump目录,设置dump或溢出检测功能,指定dump的卡和迭代。 - -可以在from ptdbg_ascend import *和模型初始化之间的任意位置添加该模块。 - -**原型** - -```python -PrecisionDebugger(dump_path=None, hook_name=None, rank=None, step=[], enable_dataloader=False, model=None): -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| dump_path | 设置dump数据目录路径,参数示例:"./dump_path"。数据类型:str。
默认在dump_path目录下生成`ptdbg_dump_{version}`目录,并在该目录下生成`dump.pkl`文件以及`dump`数据文件保存目录。
当**configure_hook**函数配置了mode参数时,`dump.pkl`文件以及`dump`数据文件保存目录名称添加mode参数值为前缀,详情请参见“**dump数据存盘说明**”。
未配置dump_path时,也可以通过环境变量ASCEND_WORK_PATH配置dump路径,此时dump数据将落盘在${ASCEND_WORK_PATH}/dump_data下,自定义配置dump_path优先级高于环境变量,dump_path和环境变量需要二选一。 | 否 | -| hook_name | dump模式,可取值"dump"和"overflow_check",表示dump和溢出检测功能,二选一。参数示例:hook_name="dump"。数据类型:str。 | 是 | -| rank | 指定对某张卡上的数据进行dump或溢出检测,默认未配置(表示dump所有卡的数据),须根据实际卡的Rank ID配置。应配置为大于0的正整数,且须根据实际卡的Rank ID配置,若所配置的值大于实际训练所运行的卡的Rank ID,则dump数据为空,比如当前环境Rank ID为0到7,实际训练运行0到3卡,此时若配置Rank ID为4或不存在的10等其他值,此时dump数据为空。数据类型:int。 | 否 | -| step | 指定dump某个step的数据,默认未配置,表示dump所有step数据。dump特定step时,须指定为训练脚本中存在的step。step为list格式,可配置逐个step,例如:step=[0,1,2];也可以配置step范围,例如:step=list(range(0,9)),表示dump第0到第8个step。数据类型:List[int]。 | 否 | -| enable_dataloader | 自动控制开关,可取值True(开启)或False(关闭),默认为False。配置为True后自动识别dump step参数指定的迭代,并在该迭代执行完成后退出训练,此时start和stop函数可不配置,开启该开关要求训练脚本是通过torch.utils.data.dataloader方式加载数据;配置为False则需要配置start和stop函数,并在最后一个stop函数后或一个step结束的位置添加debugger.step()。数据类型:bool。 | 否 | -| model | 开启init dump模式,传入网络模型实例化的对象,配置该参数后,dump操作仅dump网络中init方法里调用的方法(nn.Module类),不会对所有API进行dump。参数示例: model=net,net为网络模型实例化的对象名称。默认未配置。
配置该参数时,PrecisionDebugger模块请在模型实例化之后调用。数据类型:torch.nn.Module。
该模式不支持“溢出检测”、”ACL级别数据dump“和“模块级精度数据dump”。此模式下dump文件名前缀为网络中定义的模块名或层名。 | 否 | - -#### init dump模式示例代码和数据落盘说明 - -**示例代码** - -```python -import os -import torch -import torch.nn as nn -import torch_npu -from ptdbg_ascend import * - -torch.npu.set_device("npu:0") - - -class Net(nn.Module): - - def __init__(self): - super(Net, self).__init__() - self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2) - self.relu1 = nn.ReLU() - self.bn1 = nn.BatchNorm2d(16) - - def forward(self, x): - x = self.conv1(x) - x = self.bn1(x) - output = self.relu1(x) - return output - -if __name__ == "__main__": - net = Net().npu() - # model参数传入net, 开启init dump 功能 - debugger = PrecisionDebugger(dump_path="./dump", hook_name="dump", model=net) - debugger.configure_hook(mode="api_stack") - debugger.start() - x = torch.randn(1, 1, 28, 28).npu() - out = net(x) - loss = out.sum() - loss.backward() - debugger.stop() -``` - -**落盘数据说明** - -该模式下dump数据命名格式为:`{Layer_name}.{Module_name}.{call_num}.{forward/backward}.{input/output}.npy` - -``` -# 按照上述用例代码进行dump,落盘数据命名示例如下: -conv1.Conv2d.0.forward.input.0.npy -conv1.Conv2d.0.forward.output.npy -relu1.ReLU.0.forward.input.0.npy -....... -bn1.BatchNorm2d.0.backward.output.2.npy -``` - -### configure_hook函数(可选) - -**功能说明** - -设置dump范围。 - -建议在**PrecisionDebugger**模块与模型初始化之间的任意位置添加,不添加此函数时默认使用mode="api_stack" dump整网数据。 - -**原型** - -dump: - -```python -debugger.configure_hook(mode="api_stack", scope=[], api_list=[], filter_switch="OFF", acl_config=None, backward_input=[], input_output_mode=["all"], summary_only=False, summary_mode="all") -``` - -溢出检测: - -```python -debugger.configure_hook(mode=None, acl_config=None, overflow_nums=1, need_replicate=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| mode | dump模式。可取值"all"、"list"、"range"、"stack"、"acl"、"api_list"、"api_stack",各参数含义请参见本节的“**函数示例**”。参数示例:mode="list"。默认为"api_stack"。该参数配置值将作为dump数据文件名的前缀,详情请参见“**dump数据存盘说明**”。数据类型:str。 | 否 | -| scope或api_list | dump范围。根据model配置的模式选择dump的API范围,mode="api_list"时,需要配置api_list=[],其他模式有需要时配置scope=[]。参数示例:scope=["Tensor.permute.1.forward", "Tensor.transpose.2.forward"]、api_list=["relu"]。默认为空。数据类型:List[str]。 | 否 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"(表示开启过滤,即不dump)或"OFF"(表示关闭过滤)。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | -| acl_config | acl dump的配置文件。mode="acl"时,该参数必选;mode为其他值时,该参数不选。参数示例:acl_config='./dump.json'。dump.json配置文件详细介绍请参见“**dump.json配置文件说明**”。数据类型:str。 | 否 | -| backward_input | 该输入文件为首次运行训练dump得到反向API输入的.npy文件。例如若需要dump Functional.conv2d.1 API的反向过程的输入输出,则需要在dump目录下查找命名包含Functional.conv2d.1、backward和input字段的.npy文件。数据类型:str。 | 否 | -| input_output_mode | dump数据过滤。可取值"all"、"forward"、"backward"、"input"和"output",表示仅保存dump的数据中文件名包含"forward"、"backward"、"input"和"output"的前向、反向、输入或输出的.npy文件。参数示例input_output_mode=["backward"]或input_output_mode=["forward", "backward"]。默认为["all"],即保存所有dump的数据。除了all参数只能单独配置外,其他参数可以自由组合。数据类型:list。 | 否 | -| summary_only | dump npy文件过滤,可取值True或False,配置为True后仅dump保存API统计信息的pkl文件,参数示例:summary_only=False,默认为False。数据类型:bool。 | 否 | -| summary_mode | 控制dump文件输出的模式,可取值md5(dump仅输出包含md5值的pkl文件,用于验证数据的完整性)、summary(dump仅输出包含API统计信息的pkl文件)、all(dump输出包含API统计信息的pkl文件以及具体的npy文件),参数示例:summary_mode="md5",默认为"all"。summary_only=True时,不允许配置该参数。数据类型:str。 | 否 | -| overflow_nums | 控制溢出次数,表示第N次溢出时,停止训练,过程中检测到溢出API对应ACL数据均dump。参数示例:overflow_nums=3。配置overflow_check时可配置,默认不配置,即检测到1次溢出,训练停止,配置为-1时,表示持续检测溢出直到训练结束。数据类型:int。 | 否 | -| need_replicate | 过程dump数据生成开关,执行溢出检测时,dump目录下会生成forward_real_data和backward_real_data的过程dump数据目录,可取值True(生成)或False(不生成),默认不生成。数据类型:bool。 | 否 | - -**函数示例** - -configure_hook可配置多种dump模式,示例如下: - -说明: - -以下均以dump部分API数据为例,API名可以从首次dump整网数据的结果csv文件中的NPU Name或Bench Name列获取。 - -以下仅为该函数配置示例,完整代码请参见“**示例代码**”章节。 - -- 示例1:dump指定API列表 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="list", scope=["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]) - ``` - -- 示例2:dump指定范围 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="range", scope=["Tensor.abs.1.forward", "Tensor.transpose.3.forward"]) - ``` - -- 示例3:STACK模式,只dump堆栈信息 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="stack", scope=["Tensor.abs.1.forward", "Tensor.transpose.3.forward"]) - ``` - -- 示例4:dump指定前向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Tensor.permute.1.forward"], acl_config="./dump.json") - ``` - -- 示例5:dump指定反向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Functional.conv2d.1.backward"], acl_config="./dump.json", backward_input=["./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional.conv2d.1.backward.input.0.npy"]) - ``` - -- 示例6:dump指定某一类API的API级别输入输出数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_list", api_list=["relu"]) - ``` - - mode="api_list"时不配置scope。 - -- 示例7:dump全部API级别输入输出数据以及相应堆栈信息 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_stack") - ``` - - mode="api_stack"时不配置scope。 - -- 示例8: dump全部API级别输入输出数据并包含bool和整型的tensor以及浮点、bool和整型的标量,配置为OFF,会dump bool和整型数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(filter_switch="OFF") - ``` - - 配置filter_switch="OFF"同时也可以配置mode、scope和api_list,除dump ACL级别数据。 - -- 示例9:仅保存dump的数据文件名包含“backward”的反向.npy文件 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(input_output_mode=["backward"]) - ``` - -- 示例10:仅dump pkl文件 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(summary_only=True) - ``` - -- 示例11:溢出检测dump - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=1) - ``` - - dump执行时会在**PrecisionDebugger**模块的dump_path参数指定的目录下生成ptdbg_dump_{version}目录,保存溢出数据。 - - 多卡场景时,需要检测到至少有一张卡溢出次数达到overflow_nums时,训练结束。 - - 仅支持NPU环境。 - -- 示例11:dump溢出API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - debugger.configure_hook(mode="acl", acl_config="./dump.json") - ``` - - 该场景会在原有数据基础上,额外在dump.json文件配置的dump_path目录下生成一份ACL算子数据,该数据可通过“**ptdbg_ascend.parse**”工具进行解析。 - - 仅支持NPU环境。 - -### start函数(可选) - -**功能说明** - -dump或溢出检测启动函数。 - -在模型初始化之后的任意位置添加。 - -**原型** - -```python -debugger.start() -``` - -该函数为类函数,可以使用debugger.start()也可以使用PrecisionDebugger.start()。 - -### stop函数(可选) - -**功能说明** - -dump或溢出检测停止函数。 - -在**start**函数之后的任意位置添加。 - -**原型** - -```python -debugger.stop() -``` - -该函数为类函数,可以使用debugger.stop()也可以使用PrecisionDebugger.stop()。 - -### 示例代码(自动模式) - -**需要保证用户训练代码是通过torch.utils.data.dataloader方式加载数据。** - -- 示例1:开启dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0,2], enable_dataloader=True) - # 请勿将以上初始化流程插入到循环代码中 - ``` - -- 示例2:开启溢出检测dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0,2], enable_dataloader=True) - # 请勿将以上初始化流程插入到循环代码中 - ``` - -### 示例代码(手动模式) - -一般情况下使用自动模式可以快速方便进行dump操作,但个别大模型可能在部分卡的训练操作中没有调用dataloader,这会导致自动模式无法dump指定迭代的数据,此时需要关闭自动模式手动在迭代前后插入start()和stop()函数,并在最后一个stop函数后或一个step结束的位置添加debugger.step()以标识dump结束。 - -- 示例1:开启dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -- 示例2:开启溢出检测dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -## register_hook方式dump和溢出检测 - -### 总体说明 - -- 本节主要介绍CPU或GPU及NPU精度数据dump和溢出检测所需要的函数以及示例。 - -- ptdbg_ascend工具默认情况下仅dump PyTorch模型的API输入输出数据进行精度比对,若在比对结果中发现某个API下可能存在ACL的精度问题,那么可以选择dump该API的ACL级别数据进行精度分析。 - -- 某些torch api的输出不是Tensor类型的数据。对于此类API的反向过程进行ACL dump,工具会在运行日志中给出对应的Warning(is not of tensor type and cannot be automatically derived)提示。如若想要进行该类API反向ACL dump,可以通过手动构建单API用例的方式进行ACL dump,具体用例可参见“**[反向ACL dump用例说明](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/%E5%8F%8D%E5%90%91ACL%20dump%E7%94%A8%E4%BE%8B%E8%AF%B4%E6%98%8E.md)**”。 - -- 工具性能:dump数据量较小时(小于5G),参考dump速度0.1GB/s;dump数据量较大时,参考dump速度0.2GB/s。 - 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 - - 用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。Dump速度的计算方式:Dump数据量/(单个step添加Dump耗时-原始单个step耗时)。 - -### 约束 -- 进行CPU或GPU数据dump时,请安装torch包而非torch_npu包,避免工具无法识别使用场景,导致失败。 - -- TASK_QUEUE_ENABLE环境变量会导致API下发和执行异步进行,因此在ACL dump前需要将TASK_QUEUE_ENABLE关闭,即export TASK_QUEUE_ENABLE=0。 - -- 不建议在PyTorch训练脚本中同时添加dump接口和性能数据采集(如Ascend PyThon Profiler)接口,二者可能相互影响导致数据不准确。 - -### seed_all - -**功能说明** - -固定随机数。通过固定随机数保证模型的输入或输出一致。在训练主函数开始前调用,避免随机数固定不全。 - -使用form ptdbg import *后自动导入该函数,代码无需再次添加,若需要修改随机数种子和确定性计算模式,则需要通过添加该函数修改。 - -**函数原型** - -```python -seed_all(seed=1234, mode=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------ | ------------------------------------------------------------ | -------- | -| seed | 随机数种子。参数示例:seed=1000。默认值为:1234。数据类型:int。 | 否 | -| mode | 确定性计算模式。可配置True或False。参数示例:mode=True。默认为False。数据类型:bool。
即使在相同的硬件和输入下,API多次执行的结果也可能不同,开启确定性计算是为了保证在相同的硬件和输入下,API多次执行的结果相同。
确定性计算会导致API执行性能降低,建议在发现模型多次执行结果不同的情况下开启。
rnn类算子、ReduceSum、ReduceMean等算子可能与确定性计算存在冲突,若开启确定性计算后多次执行的结果不相同,则考虑存在这些算子。 | 否 | - -**函数示例** - -seed_all函数的随机数种子,取默认值即可,无须配置;第二个参数默认关闭,不开启确定性计算时也无须配置。 - -- 示例1:仅固定随机数,不开启确定性计算 - - ```python - seed_all() - ``` - -- 示例2:固定随机数,开启确定性计算 - - ```python - seed_all(mode=True) - ``` - -**固定随机数范围** - -seed_all函数可固定随机数的范围如下表。 - -| API | 固定随机数 | -| ---------------------------------------- | --------------------------- | -| os.environ['PYTHONHASHSEED'] = str(seed) | 禁止Python中的hash随机化 | -| random.seed(seed) | 设置random随机生成器的种子 | -| np.random.seed(seed) | 设置numpy中随机生成器的种子 | -| torch.manual_seed(seed) | 设置当前CPU的随机种子 | -| torch.cuda.manual_seed(seed) | 设置当前GPU的随机种子 | -| torch.cuda.manual_seed_all(seed) | 设置所有GPU的随机种子 | -| torch_npu.npu.manual_seed(seed) | 设置当前NPU的随机种子 | -| torch_npu.npu.manual_seed_all(seed) | 设置所有NPU的随机种子 | -| torch.backends.cudnn.enable=False | 关闭cuDNN | -| torch.backends.cudnn.benchmark=False | cuDNN确定性地选择算法 | -| torch.backends.cudnn.deterministic=True | cuDNN仅使用确定性的卷积算法 | - -需要保证CPU或GPU以及NPU的模型输入完全一致,dump数据的比对才有意义,seed_all并不能保证模型输入完全一致,如下表所示场景需要保证输入的一致性。 - -| 场景 | 固定方法 | -| --------------- | ------------- | -| 数据集的shuffle | 关闭shuffle。 | -| dropout | 关闭dropout。 | - -关闭shuffle示例: - -```python -train_loader = torch.utils.data.DataLoader( - train_dataset, - batch_size = batch_size, - shuffle = False, - num_workers = num_workers -) -``` - -关闭dropout: - -在使用from ptdbg import *后,工具会自动将torch.nn.functional.dropout、torch.nn.functional.dropout2d、torch.nn.functional.dropout3d、torch.nn.Dropout、torch.nn.Dropout2d、torch.nn.Dropout3d的接口参数p置为0。 - -### set_dump_path - -**功能说明** - -设置数据保存目录。建议在seed_all函数之后调用且需要保证训练进程能够调用该函数;多卡时须保证每个进程都能调用该函数。 - -**函数原型** - -```python -set_dump_path(fpath=None, dump_tag='ptdbg_dump') -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------- | ------------------------------------------------------------ | -------- | -| fpath | 设置数据目录路径。参数示例:'./dump_path'。数据类型:str。
默认在dump_path目录下生成`ptdbg_dump_{version}`目录,并在该目录下生成`dump.pkl`文件以及`dump`数据文件保存目录。
当set_dump_switch函数配置了mode参数时,`dump.pkl`文件以及`dump`数据文件保存目录名称添加mode参数值为前缀,详情请参见“**dump数据存盘说明**”。
未配置fpath时,也可以通过环境变量ASCEND_WORK_PATH配置dump路径,此时数据将落盘在${ASCEND_WORK_PATH}/dump_data下,自定义配置dump_path优先级高于环境变量,fpath和环境变量需要二选一。 | 否 | -| dump_tag | 设置数据目录名称。参数示例:dump_tag='dump_conv2d'。默认数据目录命名为ptdbg_dump_{version}。数据类型:str。
{version}为当前安装ptdbg_ascend工具版本。目录结构参见“**dump数据存盘说明**”。
配置该参数会将生成的`ptdbg_dump_{version}`目录名称变更为dump_tag配置的值,如`dump_conv2d_{version}`。 | 否 | - -**函数示例** - -- 示例1:设置数据目录路径 - - ```python - set_dump_path('./dump_path') - ``` - -- 示例2:设置数据目录名称 - - ```python - set_dump_path('./dump_path', dump_tag='dump_conv2d') - ``` - - -若以相同的数据目录多次dump,则会因同名导致覆盖;多次dump建议配置不同的dump_tag。 - -### register_hook - -**功能说明** - -注册工具钩子函数。在set_dump_path之后调用。 - -dump操作必选。 - -**函数原型** - -```python -register_hook(model, hook, overflow_nums=overflow_nums, dump_mode=dump_mode, dump_config=dump_config_file) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------- | ------------------------------------------------------------ | -------- | -| model | 传入网络模型实例化的对象。参数示例: model=net,net为网络模型实例化的对象名称。数据类型:torch.nn.Module。 | 是 | -| hook | 注册工具的dump和溢出检测钩子。可取值overflow_check(表示溢出检测)和acc_cmp_dump(表示dump数据),二选一。数据类型:Callable。 | 是 | -| overflow_nums | 控制溢出次数,表示第N次溢出时,停止训练,过程中检测到溢出API对应ACL数据均dump。参数示例:overflow_nums=3。配置overflow_check时可配置,默认不配置,即检测到1次溢出,训练停止,配置为-1时,表示持续检测溢出直到训练结束。数据类型:int。 | 否 | -| dump_mode | 控制针对溢出API的dump模式,可取值"acl"或"api"。配置acl时,表示dump ACL级别的溢出数据,此时set_dump_path参数不生效,dump数据目录由dump_config的.json文件配置。参数示例:dump_mode="acl"。默认不配置,即dump API级别的溢出数据。数据类型:str。 | 否 | -| dump_config | acl dump的配置文件。dump_mode="acl"时,该参数必选;dump_mode="api"时,该参数不选。参数示例:dump_config='./dump.json'。数据类型:str。 | 否 | - -**函数示例** - -- 示例1:注册工具钩子函数 - - ```python - register_hook(model, acc_cmp_dump) - ``` - -- 示例2:dump指定API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - ``` - - 需要配置set_dump_switch的mode="acl"以及scope指定为前向或反向API,请参见“**set_dump_switch”**的示例。 - - 该场景set_dump_path不生效,由dump_config中的dump.json文件配置dump数据目录。 - -- 示例3:溢出检测dump - - ```python - register_hook(model, overflow_check, overflow_nums=3) - ``` - - dump执行时会在set_dump_path的fpath参数指定的目录下生成ptdbg_dump_{version}目录,保存溢出数据。 - - 多卡场景时,需要检测到至少有一张卡溢出次数达到overflow_nums时,训练结束。 - - 仅支持NPU环境。 - -- 示例4:dump指定API的ACL级别溢出数据 - - ```python - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - ``` - - 该场景会在原有数据基础上,额外在dump.json文件配置的dump_path目录下生成一份ACL算子数据,该数据可通过“**ptdbg_ascend.parse**”工具进行解析。 - - 仅支持NPU环境。 - -### set_dump_switch - -**功能说明** - -设置dump范围。建议在register_hook函数之后的脚本内任意位置插入,但进行精度问题排查建议参照“场景化示例 > 单卡场景精度比对”章节的顺序,先从第一个迭代开始的位置调用并dump整网数据。 - -dump操作必选。 - -**函数原型** - -```python -def set_dump_switch(switch, mode="all", scope=[], api_list=[], filter_switch="OFF", dump_mode=["all"], summary_only=False): -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| --------------- | ------------------------------------------------------------ | -------- | -| switch | dump开关。可取值"ON"或"OFF"。须在选定dump开始的位置配置set_dump_switch("ON");dump结束的位置设置set_dump_switch("OFF")。数据类型:str。 | 是 | -| mode | dump模式。可取值"all"、"list"、"range"、"stack"、"acl"、"api_list"、"api_stack",各参数含义请参见本节的“**函数示例**”。参数示例:mode="list"。默认为"all"。该参数配置值将作为dump数据文件名的前缀,详情请参见“**dump数据存盘说明**”。数据类型:str。 | 否 | -| scope或api_list | dump范围。根据model配置的模式选择dump的API范围。参数示例:scope=["Tensor.permute.1.forward", "Tensor.transpose.2.forward"]、api_list=["relu"]。默认为空。数据类型:List[str]。 | 否 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"或"OFF"。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | -| dump_mode | dump数据过滤。可取值"all"、"forward"、"backward"、"input"和"output",表示仅保存dump的数据中文件名包含"forward"、"backward"、"input"和"output"的前向、反向、输入或输出的.npy文件。参数示例dump_mode=["backward"]或dump_mode=["forward", "backward"]。默认为all,即保存所有dump的数据。除了all参数只能单独配置外,其他参数可以自由组合。数据类型:List[str]。 | 否 | -| summary_only | dump npy文件过滤,可取值True或False,配置为True后仅dump保存API统计信息的pkl文件,参数示例:summary_only=False,默认为False。数据类型:bool。 | 否 | - -**推荐配置** - -```python -set_dump_switch("ON", mode="api_stack", filter_switch="OFF") -``` - -开启dump数据和堆栈模式,同时为保证数据完整性开启dump bool和整型的tensor以及浮点、bool和整型的标量。 - -**函数示例** - -set_dump_switch可配置多种dump模式,示例如下: - -说明:以下均以dump部分API数据为例,API名可以从首次dump整网数据的结果csv文件中的NPU Name或Bench Name列获取。 - -- 示例1:dump指定API列表 - - ```python - set_dump_switch("ON", mode="list", scope=["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]) - ``` - -- 示例2:dump指定范围 - - ```python - set_dump_switch("ON", mode="range", scope=["Tensor.abs.1.forward", "Tensor.transpose.3.forward"]) - ``` - -- 示例3:STACK模式,只dump堆栈信息 - - ```python - set_dump_switch("ON", mode="stack", scope=["Tensor.abs.1.forward", "Tensor.transpose.3.forward"]) - ``` - -- 示例4:dump指定前向API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Tensor.permute.1.forward"]) - ``` - - 需要配置register_hook的dump_mode='acl'和dump_config配置文件。 - -- 示例4:dump指定反向API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Functional.conv2d.1.backward"]) - set_backward_input(["./npu_dump/dump_conv2d_v4.0/step0/rank0/dump/Functional.conv2d.1.backward.input.0.npy"]) - ``` - - 需要配置register_hook的dump_mode='acl'和dump_config配置文件,并通过set_backward_input设置反向API输入的.npy文件。 - -- 示例5:dump指定某一类API的API级别输入输出数据 - - ```python - set_dump_switch("ON", mode="api_list", api_list=["relu"]) - ``` - - mode="api_list"时不配置scope。 - -- 示例6:dump全部API级别输入输出数据以及相应堆栈信息 - - ```python - set_dump_switch("ON", mode="api_stack") - ``` - - mode="api_stack"时不配置scope。 - -- 示例7: dump全部API级别输入输出数据并包含bool和整型的tensor以及浮点、bool和整型的标量,配置为OFF,会dump bool和整型数据 - - ```python - set_dump_switch("ON", filter_switch="OFF") - ``` - - 配置filter_switch="OFF"同时也可以配置mode、scope和api_list,除dump ACL级别数据。 - -- 示例8:仅保存dump的数据文件名包含“backward”的反向.npy文件 - - ```python - set_dump_switch("ON", dump_mode=["backward"]) - ``` - -- 示例9:仅dump pkl文件 - - ```python - set_dump_switch("ON", summary_only=True) - ``` - -以上示例均需要在结束dump的位置插入set_dump_switch("OFF")。 - -set_dump_switch配置mode为all或api_stack时,结束dump后,在dump目录下会自动生成compare_data.py比对脚本模板,示例如下: - -```python -from ptdbg_ascend import compare -from ptdbg_ascend.common.file_check_util import FileChecker -import argparse -import os.path - -pkl_path = "%s" -dump_data_dir = "%s" - -parser = argparse.ArgumentParser(description="compare data") -parser.add_argument("--npu_pkl_path", type=str, default=pkl_path, help="npu保存数据的pkl路径") -parser.add_argument("--bench_pkl_path", type=str, default=pkl_path, help="对比数据的pkl路径") -parser.add_argument("--output_path", type=str, default="./", help="导出对比数据的路径") - -args = parser.parse_args() -npu_pkl_path = args.npu_pkl_path -bench_pkl_path = args.bench_pkl_path -output_path = args.output_path - -suffix = ".pkl" -npu_path_checker = FileChecker(npu_pkl_path, "file", "read", suffix) -npu_path_checker.common_check() -bench_path_checker = FileChecker(bench_pkl_path, "file", "read", suffix) -bench_path_checker.common_check() - -npu_dump_data_dir = npu_pkl_path[:-len(suffix)] -bench_dump_data_dir = bench_pkl_path[:-len(suffix)] -if not os.path.exists(npu_dump_data_dir) or not os.path.exists(bench_dump_data_dir): - npu_dump_data_dir = "" - bench_dump_data_dir = "" - -dump_path_param = { - "npu_pkl_path": npu_pkl_path, - "bench_pkl_path": bench_pkl_path, - "npu_dump_data_dir": npu_dump_data_dir, - "bench_dump_data_dir": bench_dump_data_dir, - "is_print_compare_log": True -} - -compare(dump_path_param, output_path=output_path, stack_mode=%s) -``` - -compare_data.py比对脚本模板可以直接使用命令行配置比对参数,不需要通过编辑compare_data.py文件来修改,示例如下: - -```bash -python3 compare_data.py --npu_pkl_path "./npu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" --bench_pkl_path "./gpu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" --output_path "./output_path" -``` - -该命令行支持--npu_pkl_path、--bench_pkl_path和--output三个比对参数,其中pkl_path两个参数配置后,脚本可以自动识别同级目录下的dump_data目录,若同级目录下不存在dump_data目录,则直接执行“**pkl文件比对**”。仅ptdbg_ascend 6.0或更高版本支持比对命令行配置比对参数。更多介绍请参见“**执行比对操作**”。 - -### set_overflow_check_switch - -**功能说明** - -置溢出检测范围。默认不配置该函数,全量进行溢出检测。 - -仅支持NPU环境。 - -**函数原型** - -```python -set_overflow_check_switch(switch, filter_switch='OFF') -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------- | ------------------------------------------------------------ | -------- | -| switch, | 检测开关。可取值"ON"或"OFF"。如果只在特定的step溢出检测,则在期望溢出检测的step位置开始前插入set_overflow_check_switch("ON"),在step结束的位置插入set_overflow_check_switch("OFF")。数据类型:str。 | 是 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"或"OFF"。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | - -**函数示例** - -- 示例1:指定范围溢出检测 - - ```python - register_hook(model, overflow_check) - set_overflow_check_switch("ON") - - ... - - set_overflow_check_switch("OFF") - ``` - - 该场景set_dump_path不生效,dump执行时会在当前目录自动生成ptdbg_dump_{version}目录,保存溢出数据。 - -- 示例2:前向API的ACL级别范围溢出检测 - - ```python - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - set_overflow_check_switch("ON") - - ... - - set_overflow_check_switch("OFF") - ``` - - 该场景set_dump_path不生效,由dump_config中的dump.json文件配置溢出数据目录。 - -### set_backward_input - -**功能说明** - -设置反向ACL级别dump时需要的反向输入的.npy文件。 - -**函数原型** - -```python -set_backward_input(backward_input) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| backward_input | 该输入文件为首次运行训练dump得到反向API输入的.npy文件。例如若需要dump Functional.conv2d.1 API的反向过程的输入输出,则需要在dump目录下查找命名包含Functional.conv2d.1、backward和input字段的.npy文件。数据类型:str。 | 是 | - -**函数示例** - -```python -register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') -set_dump_switch("ON", mode="acl", scope=["Functional.conv2d.1.backward"]) -set_backward_input(["./npu_dump/dump_conv2d_v4.0/step0/rank0/dump/Functional.conv2d.1.backward.input.0.npy"]) -``` - -## dump.json配置文件说明 - -**dump.json配置示例** - -```python -{ - "dump": - { - "dump_list":[], - "dump_path":"./dump/output", - "dump_mode":"all", - "dump_op_switch":"on" - } -} -``` - -**dump.json参数说明** - -| 字段名 | 说明 | -| -------------- | ------------------------------------------------------------ | -| dump_list | 待dump数据的API模型。为空,无需配置。 | -| dump_path | dump数据文件存储到运行环境的目录,主要用于指定ACL dump数据路径。支持配置绝对路径或相对路径。dump_path须为已存在目录。 | -| dump_mode | dump数据模式,配置如下:
output:dump API的输出数据。默认值。
input:dump API的输入数据。
all:dump API的输入、输出数据。 | -| dump_op_switch | 单API模型dump数据开关,配置如下: * off:关闭单API模型dump,默认值。 * on:开启单API模型dump。 | - -**dump目录说明** - -配置register_hook的dump_config后,采集的dump数据会在{dump_path}/{time}/{deviceid}/{model_id}目录下生成,例如“/home/HwHiAiUser/output/20200808163566/0/0” - -```bash -├── 20230131172437 -│   └── 1 -│   ├── 0 -│   │   ├── Add.Add.45.0.1675157077183551 -│   │   ├── Cast.trans_Cast_0.31.0.1675157077159449 -│   │   ├── Cast.trans_Cast_5.43.0.1675157077180129 -│   │   ├── MatMul.MatMul.39.0.1675157077172961 -│   │   ├── Mul.Mul.29.0.1675157077155731 -│   │   ├── NPUAllocFloatStatus.NPUAllocFloatStatus.24.0.1675157077145262 -│   │   ├── TransData.trans_TransData_1.33.0.1675157077162791 -│   │   └── TransData.trans_TransData_4.41.0.1675157077176648 -│   ├── 1701737061 -│   │   └── Cast.trans_Cast_2.35.0.1675157077166214 -│   ├── 25 -│   │   └── NPUClearFloatStatus.NPUClearFloatStatus.26.0.1675157077150342 -│   └── 68 -│   └── TransData.trans_TransData_3.37.0.1675157077169473 -``` - -## 模块级精度数据dump - -### 总体说明 - -大模型场景下,通常不是简单的利用自动迁移能力实现GPU到NPU的训练脚本迁移,而是会对NPU网络进行一系列针对性的适配,因此,常常会造成迁移后的NPU模型存在部分子结构不能与GPU原始模型完全对应。模型结构不一致导致API调用类型及数量不一致,若直接按照API粒度进行精度数据dump和比对,则无法完全比对所有的API。 - -本节介绍的功能是对模型中的大粒度模块进行数据dump,使其比对时,对于无法以API粒度比对的模块可以直接以模块粒度进行比对。 - -模块指的是继承自nn.Module类模块,通常情况下这类模块就是一个小模型,可以被视为一个整体,dump数据时以模块为粒度进行dump。 - -### module_dump - -**功能说明** - -开启模块级精度数据dump。 - -模块级精度数据dump时必选。 - -**函数原型** - -```python -module_dump(module, module_name) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------- | ------------------------------------------------------------ | -------- | -| module | 网络中实例化好的nn.Module类对象。数据类型:torch.nn.Module。 | 是 | -| module_name | 用户自定义的该model名称。主要用于dump数据文件的命名,便于在比对时识别模块级数据。数据类型:str。 | 是 | - -### module_dump_end - -**功能说明** - -结束模块级精度数据dump。 - -模块级精度数据dump时必选。 - -**函数原型** - -```python -module_dump_end() -``` - -### 示例代码 - -```python -# 根据需要import包 -import os -import torch -import torch.nn as nn -import torch_npu -import torch.nn.functional as F -from ptdbg_ascend import * - -torch.npu.set_device("npu:0") -# 定义一个简单的网络 -class ModuleOP(nn.Module): - def __init__(self) -> None: - super().__init__() - self.linear_1 = nn.Linear(in_features=8, out_features=4) - self.linear_2 = nn.Linear(in_features=4, out_features=2) - def forward(self, x): - x1 = self.linear_1(x) - x2 = self.linear_2(x1) - r1 = F.relu(x2) - return r1 - -if __name__ == "__main__": - module = ModuleOP() - - # 注册工具 - pdbg = PrecisionDebugger("./dump_data/npu", hook_name="dump") - pdbg.start() - - x = torch.randn(10, 8) - module_dump(module, "MyModuleOP") # 开启模块级精度数据dump - out = module(x) - module_dump_end() # 结束模块级精度数据dump - loss = out.sum() - loss.backward() - pdbg.stop() -``` - -## dump数据存盘说明 - -dump结果目录结构示例如下: - -```bash -├── dump_path -│ └── ptdbg_dump_{version} -│ ├── step0 -│ | ├── rank0 -│ | │ ├── dump -| | | | ├── Tensor.permute.1.forward.npy -| | | | ├── MyModule.0.forward.input.npy # 开启模块级精度数据dump时存在模块级的dump数据文件 -| | | | ... -| | | | └── Fcuntion.linear.5.backward.output.npy -│ | │ └── dump.pkl -│ | ├── rank1 -| | | ├── dump -| | | | └── ... -| | | └── dump.pkl -│ | ├── ... -│ | | -| | └── rank7 -│ ├── step1 -│ | ├── ... -│ ├── step2 -``` - -dump过程中,npy文件在对应算子或者模块被执行后就会落盘,而pkl文件则需要在正常执行PrecisionDebugger.stop()或set_dump_switch("OFF")后才会被落盘保存,异常的程序终止会保存终止前被执行算子的相关npy文件,但是不会生成pkl文件。 - -其中`ptdbg_dump_{version}`为默认命名,debugger方式dump不支持修改该文件夹名称,使用set_dump_path函数则支持通过dump_tag参数修改文件夹名称;rank为设备上各卡的ID,每张卡上dump的数据会生成对应dump目录。 - -**精度比对dump场景** - -精度比对dump场景的结果如下: - -* dump.pkl文件:包含dump数据的API名称(命名格式为:`{api_type}.{api_name}.{API调用次数}.{前向反向}.{input/output}.{参数序号}`)、dtype、 shape、各数据的max、min、mean、L2norm统计信息以及当配置summary_mode="md5"时的md5数据。 - - 其中,“参数序号”表示该API下的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个子参数;L2norm表示2范数(平方根)。 - -* dump目录:目录下为npy格式的dump数据。 - - npy文件保存的前缀和PyTorch对应关系如下 - - | 前缀 | Torch模块 | - | ----------- | ------------------- | - | Tensor | torch.Tensor | - | Torch | torch | - | Functional | torch.nn.functional | - | NPU | NPU亲和算子 | - | VF | torch._VF | - | Aten | torch.ops.aten | - | Distributed | torch.distributed | - -当configure_hook或set_dump_switch配置mode参数(例如:mode="api_stack" )时,dump结果的文件名会添加api_stack前缀,dump结果如下: - -* api_stack_dump.pkl -* api_stack_dump目录 - -**溢出检测dump场景** - -PrecisionDebugger模块的hook_name参数或register_hook函数设置了overflow_check时,检测API溢出,dump结果的文件名格式为:`{api_type}.{api_name}.{API调用次数}.{前向反向}.{当前溢出次数}`,dump结果示例如下: - -* `Tensor_add_1_forward_1.pkl` -* `Tensor_add_1_forward_1`目录 - -## 工具支持的API列表 - -ptdbug_ascend工具维护固定的API支持列表,若需要删除或增加dump的API,可以在[support_wrap_ops.yaml](../src/python/ptdbg_ascend/hook_module/support_wrap_ops.yaml)文件内手动修改,如下示例: - -```bash -functional: # functional为算子类别,找到对应的类别,在该类别下按照下列格式删除或添加API - - conv1d - - conv2d - - conv3d -``` - -## CPU或GPU与NPU精度数据比对 - -### 总体说明 - -- 本节主要介绍CPU或GPU与NPU精度数据比对的函数以及示例。 - -- 比对函数均通过单独创建精度比对脚本执行,可支持单卡和多卡场景的精度数据比对。 - -- 工具性能:比对数据量较小时(参考值单份文件小于10GB),参考比对速度0.1GB/s;比对数据量较大时,参考比对速度0.3GB/s。 - 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 - - 用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。比对速度的计算方式:两份比对文件大小/比对耗时。 - -### 约束 - -- NPU自研API,在CPU或GPU若没有对应的API,该API的dump数据不比对。 - -- NPU与CPU或GPU的计算结果误差可能会随着模型的执行不断累积,最终会出现同一个API因为输入的数据差异较大而无法比对的情况。 - -- CPU或GPU与NPU中两个相同的API会因为调用次数不同导致无法比对或比对到错误的API,不影响整体运行,该API忽略。 - -### compare_distributed - -**功能说明** - -将CPU或GPU与NPU的dump文件进行比对,支持单卡和多卡,可同时比对多卡的dump数据。多机场景需要每个设备单独执行比对操作。可自动检索和匹配对应卡和进程所dump的数据文件,再调用compare进行比对。单机单卡时与compare函数二选一。 - -**函数原型** - -```python -compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| npu_dump_dir | 配置NPU环境下的dump目录。dump数据目录须指定到step级。参数示例:'./npu_dump/ptdbg_dump_v4.0/step0'。register_hook方式可通过set_dump_path函数的dump_tag参数修改该目录名称。数据类型:str。 | 是 | -| bench_dump_dir | 配置CPU、GPU或NPU环境下的dump目录。参数示例:'./gpu_dump/ptdbg_dump_v4.0/step0'。register_hook方式可通过set_dump_path函数的dump_tag参数修改该目录名称。数据类型:str。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。需要预先创建output_path目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.csv`。数据类型:str。 | 是 | -| **kwargs | 支持compare的所有可选参数。 | 否 | - -**函数示例** - -创建比对脚本,例如compare_distributed.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -compare_distributed('./npu_dump/ptdbg_dump_v4.0/step0', './gpu_dump/ptdbg_dump_v4.0/step0', './output') -``` - -dump数据目录须指定到step级。 - -### compare - -**功能说明** - -将CPU或GPU与NPU的dump文件进行比对,仅支持单机单卡。 - -**函数原型** - -```python -compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_match=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------ | ------------------------------------------------------------ | -------- | -| input_param | 配置dump数据文件及目录。数据类型:dict。配置参数包括:
"npu_pkl_path":指定NPU dump目录下的.pkl文件。参数示例:"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl"。必选。
"bench_pkl_path":指定CPU、GPU或NPU dump目录下的.pkl文件。参数示例:"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl"。必选。
"npu_dump_data_dir":"指定NPU dump目录下的dump数据目录。参数示例:"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump"。可选,仅比对pkl文件时不选。
"bench_dump_data_dir":"指定CPU、GPU或NPU dump目录下的dump数据目录。参数示例:"npu_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump"。可选,仅比对pkl文件时不选。
"is_print_compare_log":配置是否开启日志打屏。可取值True或False。可选。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。参数示例:"./output_path",默认为"./"。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.csv`。数据类型:str。 | 否 | -| stack_mode | 配置stack_mode的开关。仅当dump数据时配置debugger.configure_hook或set_dump_switch的mode="api_stack"时需要开启。可取值True或False,参数示例:stack_mode=True,默认为False。数据类型:bool。 | 否 | -| auto_analyze | 自动精度分析,开启后工具自动针对比对结果进行分析,识别到第一个精度不达标节点(在比对结果文件中的“Accuracy Reached or Not”列显示为No),并给出问题可能产生的原因(打屏展示并生成advisor_{timestamp}.txt文件)。可取值True或False,参数示例:auto_analyze=False,默认为True。数据类型:bool。 | 否 | -| fuzzy_match | 模糊匹配。开启后,对于网络中同一层级且命名仅调用次数不同的API,可匹配并进行比对。可取值True或False,参数示例:fuzzy_match=True,默认为False。数据类型:bool。 | 否 | - -**函数示例** - -单机单卡场景下创建比对脚本,例如compare.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import compare -dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", -"bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", -"is_print_compare_log": True -} -compare(dump_result_param, output_path="./output_path", stack_mode=True) -``` - -### pkl文件比对 - -若使用**compare**或**compare_distributed**函数创建的比对脚本中,input_param参数只配置了npu_pkl_path和bench_pkl_path或使用summary_only、summary_mode(取值为md5或summary)方式dump时,可以进行pkl文件的比对,此时比对dump.pkl文件中的统计信息,开启后的比对结果文件生成Max diff、Min diff、Mean diff和L2norm diff,表示NPU dump数据中API的输入或输出与标杆数据输入或输出的最大值、最小值、平均值以及L2范数的差。可以通过该值判断API是否存在精度问题:当某个API的输入和输出的Max diff、Min diff、Mean diff和L2norm diff均为0或无限趋于0,那么可以判断该API无精度问题,反之则可能存在精度问题。 - -**比对脚本示例** - -以compare.py为例。 - -```python -from ptdbg_ascend import compare -dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"is_print_compare_log": True -} -compare(dump_result_param, output_path="./output_path", stack_mode=True) -``` - -**比对结果** - -pkl文件比对同样生成`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt`文件。其中`advisor_{timestamp}.txt`主要对`compare_result_{timestamp}.csv`中可能存在精度问题(Result为Waring)的API提出定位建议;`compare_result_{timestamp}.csv`主要有如下两种情况: - -- configure_hook配置summary_only=True、summary_mode=summary或不配置前面两个参数直接比对pkl文件: - - ![compare_result_pkl](./img/compare_result_pkl.png) - - 上图是对pkl文件中NPU及标杆API的统计信息进行比对,判断可能存在精度问题的API,文件中记录NPU及标杆API的基本信息和统计信息,其中需要关注Result列,包含结果:Waring(NPU与标杆统计信息的比对中存在相对误差大于0.5,则需要重点检查该API);为空(相对误差小于等于0.5,可以不需要重点关注,但不代表不存在精度问题);Nan(表示统计信息数据没有匹配上)。 - -- configure_hook配置summary_mode=md5: - - ![compare_result_pkl_md5.png](./img/compare_result_pkl_md5.png.png) - - 上图是对pkl文件中NPU及标杆API的MD5信息进行比对,判断API数据的完整性,文件中记录NPU及标杆API的基本信息和MD5信息,其中需要关注Result列,包含结果:Pass(表示NPU与标杆的MD5值一致,即API数据完整);Different(表示NPU与标杆的MD5值不一致,即API数据不完全一致,可以通过NPU_Stack_Info列API调用栈查询该API的详细信息);Nan(表示MD5信息数据没有匹配上)。 - -### parse - -**功能说明** - -解析并提取dump信息中的堆栈信息及数据统计信息。 - -**函数原型** - -```python -parse(pkl_file, module_name_prefix) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------------ | ------------------------------------------------------------ | -------- | -| pkl_file | 指定dump数据文件中的pkl文件名。参数示例:"./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump.pkl"。数据类型:str。 | 是 | -| module_name_prefix | 指定待提取的API接口前缀。参数示例:"Torch.norm.1.forward"。数据类型:str。 | 是 | - -**函数示例** - -创建堆栈信息及数据统计信息提取脚本,例如parse.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -parse("./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump.pkl", "Torch.batch.normal.1.forward") -``` - -### 执行比对操作 - -比对操作通过执行比对脚本启动,根据不同的比对脚本分为如下场景: - -- dump数据时自动生成比对脚本模板,脚本名为compare_data.py,该脚本模板也可以直接手动创建: - - ```python - from ptdbg_ascend import compare - from ptdbg_ascend.common.file_check_util import FileChecker - import argparse - import os.path - - pkl_path = "%s" - dump_data_dir = "%s" - - parser = argparse.ArgumentParser(description="compare data") - parser.add_argument("--npu_pkl_path", type=str, default=pkl_path, help="npu保存数据的pkl路径") - parser.add_argument("--bench_pkl_path", type=str, default=pkl_path, help="对比数据的pkl路径") - parser.add_argument("--output_path", type=str, default="./", help="导出对比数据的路径") - - args = parser.parse_args() - npu_pkl_path = args.npu_pkl_path - bench_pkl_path = args.bench_pkl_path - output_path = args.output_path - - suffix = ".pkl" - npu_path_checker = FileChecker(npu_pkl_path, "file", "read", suffix) - npu_path_checker.common_check() - bench_path_checker = FileChecker(bench_pkl_path, "file", "read", suffix) - bench_path_checker.common_check() - - npu_dump_data_dir = npu_pkl_path[:-len(suffix)] - bench_dump_data_dir = bench_pkl_path[:-len(suffix)] - if not os.path.exists(npu_dump_data_dir) or not os.path.exists(bench_dump_data_dir): - npu_dump_data_dir = "" - bench_dump_data_dir = "" - - dump_path_param = { - "npu_pkl_path": npu_pkl_path, - "bench_pkl_path": bench_pkl_path, - "npu_dump_data_dir": npu_dump_data_dir, - "bench_dump_data_dir": bench_dump_data_dir, - "is_print_compare_log": True - } - - compare(dump_path_param, output_path=output_path, stack_mode=%s) - ``` - - 执行如下命令启动比对操作: - - ```bash - python3 compare_data.py --npu_pkl_path "npu_pkl_path" --bench_pkl_path "bench_pkl_path" --output_path "output_path" - ``` - - 命令行示例:python3 compare_data.py --npu_pkl_path "./npu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" --bench_pkl_path "./gpu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" --output_path "./output" - - - 该命令行支持--npu_pkl_path、--bench_pkl_path和--output三个**命令行比对参数**,其中pkl_path两个参数配置后,脚本可以自动识别同级目录下的dump_data目录,若同级目录下不存在dump_data目录,则直接执行“**pkl文件比对**”。 - - **命令行比对参数**的优先级高于compare.py比对脚本内的参数,配置命令行比对参数后,不需要通过编辑compare_data.py文件来修改比对参数。 - - **命令行比对参数**均为可选,但若未配置pkl_path两个参数,则需要在比对脚本中配置。 - - 仅ptdbg_ascend 6.0或更高版本支持**命令行比对参数**。 - - | 参数 | 说明 | 是否必选 | - | ---------------- | ------------------------------------------------------------ | -------- | - | --npu_pkl_path | 指定NPU dump目录下的.pkl文件。参数示例:--npu_pkl_path "./npu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl"。 | 否 | - | --bench_pkl_path | 指定CPU、GPU或NPU dump目录下的.pkl文件。参数示例:--bench_pkl_path "./gpu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" | 否 | - | --output_path | 配置比对结果csv文件存盘目录。参数示例:--output_path "./output",默认为"./"。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.csv`。 | 否 | - -- 手动创建比对脚本,自定义脚本名为compare.py: - - ```python - from ptdbg_ascend import compare - dump_result_param={ - "npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, output_path="./output_path", stack_mode=True) - ``` - - 执行如下命令启动比对操作: - - ```bash - python3 compare.py - ``` - -### 计算精度评价指标 - -PyTorch精度比对是以CPU或GPU的计算结果为标杆,通过计算精度评价指标判断API在运行时是否存在精度问题。 - -计算精度评价指标: - -1. Cosine:通过计算两个向量的余弦值来判断其相似度,数值越接近于1说明计算出的两个张量越相似,实际可接受阈值为大于0.99。在计算中可能会存在nan,主要由于可能会出现其中一个向量为0。 - -2. MaxAbsErr:当最大绝对误差越接近0表示其计算的误差越小,实际可接受阈值为小于0.001。 - -3. MaxRelativeErr:当最大相对误差越接近0表示其计算的误差越小。 - - 当dump数据中存在0或Nan时,比对结果中最大相对误差则出现inf或Nan的情况,属于正常现象。 - -4. One Thousandth Err Ratio(双千分之一)、Five Thousandths Err Ratio(双千分之五)精度指标:是指NPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差大于千分之一、千分之五的比例占总元素个数的比例小于千分之一、千分之五。该数据仅作为精度下降趋势的参考,并不参与计算精度是否通过的判定。 - -精度比对结果csv文件中只需要通过Accuracy Reached or Not来判断计算精度是否达标,判断标准如下: - -1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标,标记为“No”。 -2. Cosine < 0.9,精度不达标,标记为“No”。 -3. MaxAbsError > 1,精度不达标,标记为“No”。 -5. 其余情况下记为精度达标,标记为“Yes”。 - -## ptdbg_ascend.parse数据解析功能 - -ptdbg_ascend.parse为命令行交互式界面解析工具,提供更多的数据解析功能并且展示结果。 - -使用场景:本工具主要用于比对前后两次NPU ACL层级dump数据的一致性。 - -### 进入parse交互式界面 - -安装ptdbg_ascend工具后,可以通过使用命令 **python -m ptdbg_ascend.parse** 进入交互式界面,如下所示: - -```bash -python -m ptdbg_ascend.parse -Parse >>> -``` - -可在parse的界面中执行Shell命令,以及如下场景的相关解析命令: - -- 支持指定ACL层级算子数据比对。 -- 支持指定ACL层级算子数据转换及展示。 -- 支持交互式指定pkl文件中API对应dump数据查看。 -- 支持API进行可选层级比对和打印(统计级和像素级)。 - -Ctrl+C可以退出parse交互式界面。不退出parse交互式界面若需要执行非该界面下的内置Shell命令,且命令与parse交互式界面命令冲突时,非该界面命令需要使用run命令,在相关命令前加上run前缀,如下示例: - -```bash -python -m ptdbg_ascend.parse -Parse >>> run vim cli.py -Parse >>> vim cli.py -``` - -以上各场景详细介绍请参见下文章节。 - -### ACL层级算子数据批量转换 - -本功能会将原有待比对dump数据目录下的dump数据按照算子名和时间戳进行梳理并分类,之后再将dump数据转为为npy文件。 - -依赖:CANN包中的msaccucmp工具,需要安装Ascend-CANN-toolkit,详见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F700%2Fenvdeployment%2Finstg%2Finstg_0001.html)》。 - -输入以下比对命令进行数据转换。 - -```bash -cad -m my_dump_path [-out output_path] [-asc msaccucmp_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| -------- | ------------------------------------------------------------ | -------- | -| -m | 待转换ACL dump数据目录。需要指定到ACL dump数据的deviceid级目录。 | 是 | -| -out | 结果输出目录,须指定已存在的目录,默认为./parse_data/acl_batch_convert。未指定时保存在默认路径下,比对结束后会打印log提示输出结果存放路径。 | 否 | -| -asc | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py。 | 否 | - -**示例** - -```bash -# 传入待比对数据目录 -Parse >>> cad -m /home/xxx/my_dump_path/20000124003856/0 -# 转换结果打印 -...... -╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -# 转换前的dump文件 -│ SrcFile: /home/xxx/my_dump_path/20000124003856/0/272/TransData.trans_TransData_22.112.21.948645536672764 │ -# 转换后的npy文件 -│ - TransData.trans_TransData_22.112.21.948645536672764.output.0.npy │ -│ - TransData.trans_TransData_22.112.21.948645536672764.input.0.npy │ -╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯ -...... -[INFO] The comparison result have been written to "./parse_data/acl_batch_convert". -``` - -输出结果: - -原dump数据目录: - -```bash -├── /home/xxx/my_dump_path/20000124003856/0/ -│ ├── 272 -│ │ ├── {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp} -│ │ ... -│ ├── 512 -│ ... -``` - -转换后: - -```bash -├── ./parse_data/acl_batch_convert/{timestamp} -│ ├── {op_name1} -│ │ ├── {timestamp1} -│ │ | ├── {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input/output}.{参数序号}.npy -│ │ | │ ... -│ │ ├── {timestamp2} -│ │ | ... -│ ├── {op_name2} -│ ├── ... -``` - -### ACL层级算子数据比对 - -本功能主要用于比对前后两次NPU ACL层级dump数据的一致性。 - -本功能支持批量比对,若需要进行批量比对,需要先将两份待比对的NPU ACL层级dump数据进行“**ACL层级算子数据批量转换**”,可以使两份数据更好的匹配;若直接进行dump数据的比对,建议只比对单个dump数据文件。 - -输入以下比对命令进行数据比对。 - -```bash -vc -m my_dump_path -g golden_dump_path [-out output_path] [-cmp_path msaccucmp_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| --------- | ------------------------------------------------------------ | -------- | -| -m | 待比对ACL dump数据目录。如果比对单个算子,需要指定到ACL dump数据的model_id级目录;如果批量比对,则指定到cad转换后的timestamp级目录。 | 是 | -| -g | 标杆ACL dump数据目录。如果比对单个算子,需要指定到ACL dump数据的model_id级目录;如果批量比对,则指定到cad转换后的timestamp级目录。 | 是 | -| -out | 结果输出目录,须指定已存在的目录,默认为./parse_data/acl_batch_comapre。未指定时保存在默认路径下,比对结束后会打印log提示输出结果存放路径。 | 否 | -| -cmp_path | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py | 否 | - -输出结果:batch_compare_{timestamp}.csv文件。 - -**示例** - -```bash -# 传入待比对数据目录以及标杆数据目录 -Parse >>> vc -m ./my_dump_path -g ./golden_data_path -[INFO]Compare result is saved in : parse_data/acl_batch_comapre/batch_compare_1707271118.csv -``` - -### ACL算子数据的npy转换 - -依赖:CANN包中的msaccucmp工具,需要安装Ascend-CANN-toolkit,详见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F700%2Fenvdeployment%2Finstg%2Finstg_0001.html)》。 - -输入以下转换命令进行数据转换, 将ACL级别dump数据转为npy文件。 - -```bash -dc -n file_name/file_path [-f format] [-out output_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| --------- | ------------------------------------------------------------ | -------- | -| -n | 需转换的dump数据文件或dump数据文件目录。 | 是 | -| -f | 开启format转换,指定该参数时需要配置format格式。当前内置的Format转换支持如下类型:
FRACTAL_NZ转换NCHW
FRACTAL_NZ转换成NHWC
FRACTAL_NZ转换ND
HWCN转换FRACTAL_Z
HWCN转换成NCHW
HWCN转换成NHWC
NC1HWC0转换成HWCN
NC1HWC0转换成NCHW
NC1HWC0转换成NHWC
NCHW转换成FRACTAL_Z
NCHW转换成NHWC
NHWC转换成FRACTAL_Z
NHWC转换成HWCN
NHWC转换成NCHW
NDC1HWC0转换成NCDHW | 否 | -| -out | 结果输出目录。 | 否 | -| -cmp_path | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py | 否 | - -[^]: 若传入单个dump文件,则转换单个文件,若传入dump文件目录则转换目录下所有dump文件。 - -- 输出结果:npy文件。 -- 若指定-out参数需要用户传入输出路径,并且路径需要已存在。 -- 若未指定输出目录, 则比对结束后将结果保存在默认目录 “./parse_data/convert_result”中,比对结束后会打印log提示输出结果存放路径及转换结果。 - -- 输入以下命令,展示npy数据统计信息。 - - ```bash - pt -n file_path - ``` - - | 参数名称 | 说明 | 是否必选 | - | -------- | ------------- | -------- | - | -n | npy文件路径。 | 是 | - - 打印统计信息:shape, dtype, max, min和mean。默认在npy文件路径下将该数据保存为txt文件。 - -**示例1** - -```bash -# 传入需转换的dump文件目录 -Parse >>> dc -n ./dump_data/ -...... -# 转换结果 -╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮ -│ SrcFile: ./dump_data/ -│ - Add.fp32.vars.add.2fp32.vars.Relu.9.31.5.1636595794731103.input.0.npy │ -│ - Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.output.0.npy │ -│ - Add.fp32.vars.add.2fp32.vars.Relu.9.31.5.1636595794731103.input.1.npy │ -│ - Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.input.1.npy │ -│ - Add.fp32.vars.add.3fp32.vars.Relu.12.40.5.1636595794846124.input.1.npy │ -│ - Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.input.0.npy │ -│ - Add.fp32.vars.add.3fp32.vars.Relu.12.40.5.1636595794846124.input.0.npy │ -│ - Add.fp32.vars.add.2fp32.vars.Relu.9.31.5.1636595794731103.output.0.npy │ -│ - Add.fp32.vars.add.3fp32.vars.Relu.12.40.5.1636595794846124.output.0.npy │ -╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯ -``` - -**示例2** - -```bash -# 查看某个dump数据块的数据信息 -# 默认会将数据中的tensor保存成 txt -Parse >>> pt -n ./parse_data/dump_convert/Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.output.0.npy -...... -# 打印统计信息 -[Shape: (1, 16, 56, 56, 16)] [Dtype: float16] [Max: 452.0] [Min: -408.5] [Mean: -3.809] -Path: ./parse_data/dump_convert/Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.input.0.npy -TextFile:./parse_data/dump_convert/Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.input.0.npy.txt -``` - -### pkl文件中指定API的dump数据信息查看 - -输入以下命令,解析并输出pkl文件中指定api的统计信息。 - -```bash -pk -f pkl_path -n api_name -``` - -| 参数名称 | 说明 | 是否必选 | -| -------- | ----------------- | -------- | -| -f | 指定pkl文件路径。 | 是 | -| -n | 指定API名称。 | 是 | - -- 输出结果:打印统计信息(shape, dtype, max和min mean)。 -- 若pkl文件中存在相应的堆栈信息,则会打印堆栈信息。 - -**示例** - -```bash -# 传入pkl文件及api名称 -Parse >>> pk -f ./torch_dump/ptdbg_v3.2/rank0/api_stack_dump.pkl -n Functional.conv2d.0.forward -...... -# 打印统计信息及堆栈(pkl文件不包含堆栈则不会打印堆栈) - -Statistic Info: - [Functional.conv2d.0.forward.input.0][dtype: torch.float32][shape: [2, 1, 2, 2]][max: 1.576936960220337][min: -0.9757485389709473][mean: 0.4961632490158081] - [Functional.conv2d.0.forward.input.1][dtype: torch.float32][shape: [2, 1, 2, 2]][max: 0.20064473152160645][min: -0.47102075815200806][mean: -0.20796933770179749] - [Functional.conv2d.0.forward.input.2][dtype: torch.float32][shape: [2]][max: 0.17380613088607788][min: -0.16853803396224976][mean: 0.0026340484619140625] - [Functional.conv2d.0.forward.output][dtype: torch.float32][shape: [2, 2, 1, 1]][max: 0.02364911139011383][min: -1.762906551361084][mean: -0.6710853576660156] -``` - -### API可选层级比对 - -输入以下命令, 进行统计级和像素级比对。 - -```bash -cn -m my_data*.npy -g gloden*.npy [-p num] [-al atol] [-rl rtol] -``` - -- 统计级比对:对tensor整体进行余弦值及相对误差的计算。 -- 像素级比对:对输入的两个npy文件进行逐元素比对。若两个tensor对应元素的相对误差或绝对误差大于**误差阈值**(-al和-rl配置)则被标记为错误数据。 - -| 参数名称 | 说明 | 是否必选 | -| -------- | ----------------------------------------------- | -------- | -| -m | 待比对数据。 | 是 | -| -g | 标杆数据。 | 是 | -| -p | 设置比对结束后打印错误元素的个数,默认值20。 | 否 | -| -al | 判定数据存在精度问题的绝对误差阈值,默认0.001。 | 否 | -| -rl | 判定数据存在精度问题的相对误差阈值,默认0.001。 | 否 | -| -s | 将npy文件保存成txt文件,用于查看,默认开启。 | 否 | - -输出结果: - -- 统计级比对结果。 -- 两个文件的统计信息(shape, dtype, max, min和mean)。 -- 错误数据打印表格。 - -**示例** - -```bash -# 对比两个tensor的数据 -Parse >>> cn -m Add.InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.323.1619494134703053.output.0.npy -g InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.0.1619492699305998.npy -p 10 -s -al 0.002 -rl 0.005 - Error Item Table Top Item Table -┏━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┏━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ -┃ Index ┃ Left ┃ Right ┃ Diff ┃ ┃ Index ┃ Left ┃ Right ┃ Diff ┃ -┡━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ ┡━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ -│ 155 │ 0.024600908 │ 0.022271132 │ 0.002329776 │ │ 0 │ -0.9206961 │ -0.9222216 │ 0.0015255213 │ -│ 247 │ 0.015752593 │ 0.017937578 │ 0.0021849852 │ │ 1 │ -0.6416973 │ -0.64051837 │ 0.0011789203 │ -│ 282 │ -0.0101207765 │ -0.007852031 │ 0.0022687456 │ │ 2 │ -0.35383835 │ -0.35433492 │ 0.0004965663 │ -│ 292 │ 0.019581757 │ 0.02240482 │ 0.0028230622 │ │ 3 │ -0.18851271 │ -0.18883198 │ 0.00031927228 │ -│ 640 │ -0.06593232 │ -0.06874806 │ 0.0028157383 │ │ 4 │ -0.43508735 │ -0.43534422 │ 0.00025686622 │ -│ 1420 │ 0.09293677 │ 0.09586689 │ 0.0029301196 │ │ 5 │ 1.4447614 │ 1.4466647 │ 0.0019032955 │ -│ 1462 │ -0.085207745 │ -0.088047795 │ 0.0028400496 │ │ 6 │ -0.3455438 │ -0.3444429 │ 0.0011008978 │ -│ 1891 │ -0.03433288 │ -0.036525503 │ 0.002192624 │ │ 7 │ -0.6560242 │ -0.6564579 │ 0.0004336834 │ -│ 2033 │ 0.06828873 │ 0.07139922 │ 0.0031104907 │ │ 8 │ -2.6964858 │ -2.6975214 │ 0.0010356903 │ -│ 2246 │ -0.06376442 │ -0.06121233 │ 0.002552092 │ │ 9 │ -0.73746175 │ -0.73650354 │ 0.00095820427 │ -└───────┴───────────────┴──────────────┴──────────────┘ └───────┴─────────────┴─────────────┴───────────────┘ -╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -│ Left: | -│ |- NpyFile: ./dump/temp/decode/Add.InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.323.1619494134703053.output.0.npy | -│ |- TxtFile: ./dump/temp/decode/Add.InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.323.1619494134703053.output.0.npy.txt | -│ |- NpySpec: [Shape: (32, 8, 8, 320)] [Dtype: float32] [Max: 5.846897] [Min: -8.368301] [Mean: -0.72565556] | -│ DstFile: │ -│ |- NpyFile: ./dump/cpu/InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.0.1619492699305998.npy | -│ |- TxtFile: ./dump/cpu/InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.0.1619492699305998.npy.txt | -│ |- NpySpec: [Shape: (32, 8, 8, 320)] [Dtype: float32] [Max: 5.8425903] [Min: -8.374472] [Mean: -0.7256237] │ -│ NumCnt: 655360 │ -│ AllClose: False │ -│ CosSim: 0.99999493 │ -│ ErrorPer: 0.023504638671875 (rl= 0.005, al= 0.002) │ -╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ -``` - -## FAQ - -[FAQ](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md) diff --git "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T4.md" "b/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T4.md" deleted file mode 100644 index af73a56849588c1a080962f00249700aee9a3630..0000000000000000000000000000000000000000 --- "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.T4.md" +++ /dev/null @@ -1,2301 +0,0 @@ -# **PyTorch精度工具使用指南** - -本文主要介绍PyTorch精度工具ptdbg_ascend的使用以及精度比对场景示例。 - -ptdbg_ascend工具的原理及安装请参见《[PyTorch精度工具](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 - -ptdbg_ascend工具主要支持PyTorch API精度数据dump、溢出检测、精度比对以及parse数据解析功能。其中dump和溢出检测功能支持使用debugger和register_hook方式进行精度数据的dump和溢出检测,推荐使用debugger方式。 - -## PyTorch精度比对总体流程 - -1. 准备CPU或GPU训练工程。 - -2. 在环境下安装ptdbg_ascend工具。 - -3. 在训练脚本内插入ptdbg_ascend工具dump接口。 - -4. 执行训练dump数据。 - -5. 将CPU或GPU训练工程迁移为NPU训练工程。 - - 请参见《[PyTorch模型迁移和训练指南](https://www.hiascend.com/document/detail/zh/canncommercial/63RC1/modeldevpt/ptmigr/ptmigr_0001.html)》。 - -6. 在NPU环境下安装ptdbg_ascend工具。 - -7. 在NPU训练脚本内插入ptdbg_ascend工具dump接口。 - -8. NPU环境下执行训练dump数据。 - -9. 创建并配置精度比对脚本,例如compare.py。 - -10. 执行CPU或GPU dump与NPU dump数据的精度比对。 - -11. 比对结果分析。 - -## 快速入门(debugger方式) - -本章节主要介绍通过ptdbg_ascend工具进行精度比对和分析,主要使用“**debugger方式dump和溢出检测**”和“**CPU或GPU与NPU精度数据比对**”章节中介绍的ptdbg_ascend工具接口。 - -### 单卡场景精度比对 - -**精度分析建议** - -PyTorch训练场景的精度问题分析建议参考以下思路进行精度比对和比对结果分析: - -1. 整网比对:dump整网数据并进行精度比对,初步定位异常范围。 - - 对于模型数据庞大(比如达到T级别)的场景,不推荐直接dump整网比对,整网dump可能导致磁盘不足,需要预留足够的存储空间或者分多次dump。 - -2. 缩小范围:根据Accuracy Reached or Not找出不符合精度标准的API。 - -3. 范围比对:对不符合精度标准的API重新dump详细信息。 - -4. 分析原因并优化:分析API精度不符合标准的原因并进行优化调整。 - -5. 整网比对:重新进行整网比对,判断优化后的API是否已符合精度标准以及是否出现新的精度问题。 - -6. 重复1~5步,直到不存在精度问题为止。 - -**精度分析示例** - -1. dump整网数据。 - - 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_stack") - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -2. 比对整网数据。 - - 第1步中的NPU dump数据目录为npu_dump,假设GPU dump数据目录为gpu_dump;dump将生成pkl数据文件api_stack_dump.pkl和npy数据目录api_stack_dump。 - - 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - dump_result_param={ - "npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, "./output", stack_mode=True) - ``` - - 执行比对: - - ```bash - python3 compare.py - ``` - - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` - -3. 找出存在问题的API。 - - 1. 根据`advisor_{timestamp}.txt`或打屏信息的提示,可找到存在精度问题的算子(Suspect Nodes)和专家建议(Expert Advice)。 - - ![auto_analyze_log](img/auto_analyze_log.png) - - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 - -4. (可选)提取指定API的堆栈信息和dump数据统计信息。 - - 通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。 - - 创建并配置提取脚本,以创建parse.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - - # 提取dump信息中第1次调用的API:Torch.batch.normal的堆栈信息及数据统计信息 - parse("./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", "Torch.batch.normal.1.forward") - ``` - - 执行提取: - - ```bash - python3 parse.py - ``` - - - -5. (可选)指定API对其底层ACL数据进行dump。 - - - dump指定前向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Tensor.permute.1.forward"], acl_config='./dump.json') - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - dump指定反向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) - # dump指定反向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - debugger.configure_hook(mode="acl", scope=["Functional.conv2d.1.backward"], acl_config="./dump.json", backward_input=["./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional.conv2d.1.backward_input.0.npy"]) - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -6. (可选)重新比对。 - - 根据第4或5步的dump数据重新配置compare.py并执行比对,可以对单API模型进行问题复现。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 - -### 溢出检测场景 - -溢出检测是针对NPU的PyTorch API,检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。 - -溢出检测原理:针对溢出阶段,开启acl dump模式,重新对溢出阶段执行,落盘数据。 - -建议按照如下步骤操作: - -1. 在NPU环境下安装ptdbg_ascend工具。 - -2. 在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。 - - - 示例1:全量溢出检测 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=-1) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - 多卡使用时各卡单独计算溢出次数。 - - - 示例2:dump指定前向API的ACL级别溢出数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(mode="acl", acl_config="./dump.json") - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - 示例3:dump指定反向API的ACL级别的溢出数据 - - 1. 进行全量溢出检测 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=-1) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - - - 2. dump指定反向API的ACL级别的溢出数据 - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./overflow_dump", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Functional.conv2d.1.backward"], acl_config="./dump.json", backward_input=["./overflow_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional.conv2d.1.backward_input.0.npy"]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - - 针对前向溢出API,可以通过overflow_nums,配置允许的溢出次数,并将每次溢出API的全部ACL数据dump下来,到达指定溢出次数后停止,停止后会看到堆栈打印包含如下字段。 - - ```bash - ValueError: [overflow xxx times]: dump file is saved in '*.pkl'. - ``` - - 其中xxx times为用户设置的次数,*.pkl为文件生成路径。 - -3. NPU环境下执行训练dump溢出数据。 - - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - - 精度预检工具执行命令如下: - - ```bash - # 下载att代码仓后执行如下命令 - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - 反向过程溢出的API暂不支持精度预检功能。 - - 当重复执行溢出检测dump操作时,需要删除上一次dump目录下的溢出检测dump数据,否则将因重名而报错。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 -* 混合精度动态loss scale场景下,正常训练会有"Gradient overflow. SKipping step"日志,添加溢出检测后日志消失,可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1,并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持,不支持版本执行报错EZ3003。 - -## 场景化示例 - -本章节主要介绍通过ptdbg_ascend工具进行精度比对和分析,主要使用“**CPU或GPU及NPU精度数据dump**”和“**CPU或GPU与NPU精度数据比对**”章节中介绍的ptdbg_ascend工具接口。 - -### 多卡场景精度比对 - -精度工具支持多卡场景的精度比对,多卡场景的dump步骤与单卡场景完全一致,请参见“**单卡场景精度比对**”章节,不同的是多卡数据精度比对时需要使用“compare_distributed”函数进行比对。 - -**大模型场景下dump推荐使用debugger方式的手动模式。** - -如下示例: - -说明:多机多卡场景需要每个设备单独执行比对操作。 - -假设NPU dump npy数据目录为npu_dump/ptdbg_dump_v4.0,GPU dump npy数据目录为gpu_dump/ptdbg_dump_v4.0。 - -1. 创建比对脚本,例如compare_distributed.py,拷贝如下代码。 - - ```python - from ptdbg_ascend import * - compare_distributed('./npu_dump/ptdbg_dump_v4.0/step0', './gpu_dump/ptdbg_dump_v4.0/step0', './output') - ``` - - dump数据目录须指定到step级。 - -2. 执行比对: - - ```bash - python3 compare_distributed.py - ``` - -两次运行须用相同数量的卡,传入`compare_distributed`的两个文件夹下须有相同个数的rank文件夹,且不包含其他无关文件,否则将无法比对。 - -**多卡set_dump_path注意事项** - -多卡一般为多进程,须保证每个进程都正确调用PrecisionDebugger或set_dump_path,或把PrecisionDebugger或set_dump_path插入到import语句后,如: - -```python -from ptdbg_ascend import * -debugger = PrecisionDebugger(dump_path="./npu_dump", hook_name="dump", step=[0]) -``` - -或 - -```python -from ptdbg_ascend import * -seed_all() -set_dump_path('./dump_resnet') -``` - -如此可保证set_dump_path在每个进程都被调用。 - -**多卡register_hook注意事项** - -register_hook需要在set_dump_path之后调用,也需要在每个进程上被调用,建议在搬运模型数据到卡之后调用。识别方法如下: - -- 找到训练代码中遍历epoch的for循环或遍历数据集的for循环,把register_hook放到循环开始前即可。 -- 找到训练代码中调用DDP或者DistributedDataParallel的代码行,把register_hook放到该代码行所在的代码块之后。 -- 若代码中均无以上两种情况,需要保证register_hook在模型定义之后插入,并配置rank参数。rank参数获取rank_id请参见“**[rank_id获取方法](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/rank_id获取方法.md)**”。 - -### NPU vs NPU精度比对 - -对于NPU vs NPU场景,是针对同一模型,进行迭代(模型、API版本升级或设备硬件升级)时存在的精度下降问题,对比相同模型在迭代前后版本的API计算数值,进行问题定位。 - -一般情况下迭代涉及NPU自定义算子,因此,可以仅dump NPU自定义算子进行比对。比对精度问题分析请参见“**单卡场景精度比对**”章节。 - -工具当前支持dump NPU自定义算子如下: - -| 序号 | NPU自定义算子 | -| :--- | ----------------------------------------------- | -| 1 | torch_npu.one_ | -| 2 | torch_npu.npu_sort_v2 | -| 3 | torch_npu.npu_transpose | -| 4 | torch_npu.npu_broadcast | -| 5 | torch_npu.npu_dtype_cast | -| 6 | torch_npu.empty_with_format | -| 7 | torch_npu.npu_one_hot | -| 8 | torch_npu.npu_stride_add | -| 9 | torch_npu.npu_ps_roi_pooling | -| 10 | torch_npu.npu_roi_align | -| 11 | torch_npu.npu_nms_v4 | -| 12 | torch_npu.npu_iou | -| 13 | torch_npu.npu_nms_with_mask | -| 14 | torch_npu.npu_pad | -| 15 | torch_npu.npu_bounding_box_encode | -| 16 | torch_npu.npu_bounding_box_decode | -| 17 | torch_npu.npu_batch_nms | -| 18 | torch_npu.npu_slice | -| 19 | torch_npu._npu_dropout | -| 20 | torch_npu.npu_indexing | -| 21 | torch_npu.npu_ifmr | -| 22 | torch_npu.npu_max | -| 23 | torch_npu.npu_scatter | -| 24 | torch_npu.npu_layer_norm_eval | -| 25 | torch_npu.npu_alloc_float_status | -| 26 | torch_npu.npu_confusion_transpose | -| 27 | torch_npu.npu_bmmV2 | -| 28 | torch_npu.fast_gelu | -| 29 | torch_npu.npu_sub_sample | -| 30 | torch_npu.npu_deformable_conv2d | -| 31 | torch_npu.npu_mish | -| 32 | torch_npu.npu_anchor_response_flags | -| 33 | torch_npu.npu_yolo_boxes_encode | -| 34 | torch_npu.npu_grid_assign_positive | -| 35 | torch_npu.npu_normalize_batch | -| 36 | torch_npu.npu_masked_fill_range | -| 37 | torch_npu.npu_linear | -| 38 | torch_npu.npu_bert_apply_adam | -| 39 | torch_npu.npu_giou | -| 40 | torch_npu.npu_ciou | -| 41 | torch_npu.npu_diou | -| 42 | torch_npu.npu_sign_bits_pack | -| 43 | torch_npu.npu_sign_bits_unpack | -| 44 | torch_npu.npu_flash_attention | -| 45 | torch_npu.npu_scaled_masked_softmax | -| 46 | torch_npu.npu_rotary_mul | -| 47 | torch_npu.npu_roi_align | -| 48 | torch_npu.npu_roi_alignbk | -| 49 | torch_npu.npu_ptiou | -| 50 | torch_npu.npu_fusion_attention | -| 51 | torch_npu.npu_dropout_with_add_softmax | -| 52 | torch_npu.npu_random_choice_with_mask | -| 53 | torch_npu.npu_rotated_iou | -| 54 | torch_npu.npu_conv2d | -| 55 | torch_npu.npu_conv3d | -| 56 | torch_npu.npu_softmax_cross_entropy_with_logits | -| 57 | torch_npu.npu_all_gather_base_mm | -| 58 | torch_npu.npu_swiglu | -| 59 | torch_npu.npu_rms_norm | -| 60 | torch_npu.npu_mm_reduce_scatter_base | -| 61 | torch_npu.npu_mm_all_reduce_base | -| 62 | torch_npu.npu_conv_transpose2d | -| 63 | torch_npu.npu_convolution | -| 64 | torch_npu.npu_convolution_transpose | -| 65 | torch_npu.npu_min | -| 66 | torch_npu.npu_nms_rotated | -| 67 | torch_npu.npu_reshape | -| 68 | torch_npu.npu_rotated_box_decode | -| 69 | torch_npu.npu_rotated_box_encode | -| 70 | torch_npu.npu_rotated_overlaps | -| 71 | torch_npu.npu_silu | -| 72 | torch_npu.npu_fused_attention_score | -| 73 | torch_npu.npu_multi_head_attention | -| 74 | torch_npu.npu_gru | -| 75 | torch_npu.npu_incre_flash_attention | -| 76 | torch_npu.npu_prompt_flash_attention | -| 77 | torch_npu.npu_lstm | -| 78 | torch_npu.npu_apply_adam | - -### 通信API的数据dump - -通信类API数据可以使用全量dump方式获取,若只dump通信类API数据,可以使用如下示例: - -```python -debugger.configure_hook(mode="api_list", api_list=["distributed"]) -``` - -或 - -```python -set_dump_switch("ON", mode="api_list", api_list=["distributed"]) -``` - -通信类API支持列表: - -| 序号 | Distributed | -| :--- | -------------------- | -| 1 | send | -| 2 | recv | -| 3 | broadcast | -| 4 | all_reduce | -| 5 | reduce | -| 6 | all_gather | -| 7 | gather | -| 8 | isend | -| 9 | irecv | -| 10 | scatter | -| 11 | reduce_scatter | -| 12 | _reduce_scatter_base | -| 13 | _all_gather_base | - -### 单卡场景精度比对(register_hook方式) - -**精度分析建议** - -PyTorch训练场景的精度问题分析建议参考以下思路进行精度比对和比对结果分析: - -1. 整网比对:dump整网数据并进行精度比对,初步定位异常范围。 -2. 缩小范围:根据Accuracy Reached or Not找出不符合精度标准的API。 -3. 范围比对:对不符合精度标准的API重新dump。 -4. 分析原因并优化:分析API精度不符合标准的原因并进行优化调整。 -5. 整网比对:重新进行整网比对,判断优化后的API是否已符合精度标准以及是否出现新的精度问题。 -6. 重复1~5步,直到不存在精度问题为止。 - -**精度分析示例** - -1. dump整网数据。 - - 分别dump CPU或GPU以及NPU数据,在PyTorch训练脚本插入dump接口,示例代码如下(下面以NPU为例,CPU或GPU dump基本相同): - - ```python - from ptdbg_ascend import * - - # 在main函数开始前固定随机数 - seed_all() - - # 配置dump数据目录路径和名称 - set_dump_path("./npu_dump", dump_tag='all') - - # 注册dump回调函数 - register_hook(model, acc_cmp_dump) - - ... - - # 在第一个迭代开始的位置开启dump和堆栈模式,同时为保证数据完整性开启dump bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="api_stack", filter_switch="OFF") - - ... - - # 在第一个迭代结束的位置关闭dump - set_dump_switch("OFF") - ``` - -2. 比对整网数据。 - - 第1步中的NPU dump数据文件为npu_dump.pkl,假设NPU dump npy数据目录为npu_dump,GPU dump数据文件为gpu_dump.pkl,GPU dump npy数据目录为gpu_dump。 - - 创建并配置精度比对脚本,以创建compare.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - dump_result_param={ - "npu_pkl_path": "./npu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/all_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/all_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, "./output", stack_mode=True) - ``` - - 执行比对: - - ```bash - python3 compare.py - ``` - - 在output目录下生成结果文件,包括:`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt` - -3. 找出存在问题的API。 - - 1. 根据`advisor_{timestamp}.txt`或打屏信息的提示,可找到存在精度问题的算子(Suspect Nodes)和专家建议(Expert Advice) - - ![auto_analyze_log](img/auto_analyze_log.png) - - 2. 根据第2步结果文件`compare_result_{timestamp}.csv`中的Accuracy Reached or No字段显示为NO的API,针对该API执行后续比对操作,分析该API存在的精度问题。 - -4. (可选)提取指定API的堆栈信息和dump数据统计信息。 - - 通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。 - - 创建并配置提取脚本,以创建parse.py为例,示例代码如下: - - ```python - from ptdbg_ascend import * - - # 提取dump信息中第1次调用的API:Torch.batch.normal的堆栈信息及数据统计信息 - parse("./npu_dump/all_v4.0/step0/rank0/api_stack_dump.pkl", "Torch.batch.normal.1.forward") - ``` - - 执行提取: - - ```bash - python3 parse.py - ``` - -5. (可选)指定API对其底层ACL数据进行dump。 - - - dump指定前向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - - # 固定随机数,开启确定性计算 - seed_all(mode=True) - set_dump_path("./dump_path", dump_tag='forward') - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - - # dump指定前向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="acl", scope=["Tensor.permute.1.forward"], filter_switch="OFF") - - ... - - set_dump_switch("OFF") - ``` - - - dump指定反向API的ACL级别数据 - - ```python - from ptdbg_ascend import * - - # 固定随机数,开启确定性计算 - seed_all(mode=True) - set_dump_path("./dump_path", dump_tag='backward') - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - - # dump指定反向API的ACL级别数据、bool和整型的tensor以及浮点、bool和整型的标量 - set_dump_switch("ON", mode="acl", scope=["Functional.conv2d.1.backward"], filter_switch="OFF") - set_backward_input(["./npu_dump/all_v4.0/step0/rank0/api_stack_dump/Functional.conv2d.1.backward.input.0.npy"]) - - ... - - set_dump_switch("OFF") - ``` - -6. (可选)重新比对。 - - 根据第4或5步的dump数据重新配置compare.py并执行比对,可以对单API模型进行问题复现。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 - -### 溢出检测场景(register_hook方式) - -溢出检测是针对NPU的PyTorch API,检测是否存在溢出的情况。当前仅支持识别aicore浮点溢出。 - -溢出检测原理:针对溢出阶段,开启acl dump模式,重新对溢出阶段执行,落盘数据。 - -建议按照如下步骤操作: - -1. 在NPU环境下安装ptdbg_ascend工具。 - -2. 在NPU训练脚本内插入ptdbg_ascend工具溢出检测接口。 - - - 示例1:全量溢出检测 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # 设置检测到3次溢出后退出训练 - register_hook(model, overflow_check, overflow_nums=3) - - ... - ``` - - 多卡使用时各卡单独计算溢出次数。 - - - 示例2:dump指定API的ACL级别溢出数据 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # dump指定API的ACL级别溢出数据 - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - - # 在期望溢出检测的step位置开始前打开溢出检测开关 - set_overflow_check_switch("ON") - - ... - - # 在step结束的位置关闭溢出检测开关 - set_overflow_check_switch("OFF") - - ... - ``` - - - 示例3:dump指定反向API的ACL级别的溢出数据 - - 1. 进行全量溢出检测 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # 设置检测到3次溢出后退出训练 - register_hook(model, overflow_check) - - ... - ``` - - 2. dump指定反向API的ACL级别的溢出数据 - - ```python - from ptdbg_ascend import * - seed_all() - # 配置溢出数据目录路径和名称 - set_dump_path("./overflow_dump") - ... - # dump指定反向API的ACL级别溢出数据 - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Functional.conv2d.1.backward"]) - set_backward_input(["./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump/Functional.conv2d.1.backward.input.0.npy"]) - ``` - - 针对前向溢出API,可以通过overflow_nums,配置允许的溢出次数,并将每次溢出API的全部ACL数据dump下来,到达指定溢出次数后停止,停止后会看到堆栈打印包含如下字段。 - - ```bash - ValueError: [overflow xxx times]: dump file is saved in '*.pkl'. - ``` - - 其中xxx times为用户设置的次数,*.pkl为文件生成路径。 - -3. NPU环境下执行训练dump溢出数据。 - - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过 [Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 - - 精度预检工具执行命令如下: - - ```bash - # 下载att代码仓后执行如下命令 - export PYTHONPATH=$PYTHONPATH:$ATT_HOME/debug/accuracy_tools/ - cd $ATT_HOME/debug/accuracy_tools/api_accuracy_checker/run_ut - python run_overflow_check.py -forward ./forward_info_0.json - ``` - - 反向过程溢出的API暂不支持精度预检功能。 - - 当重复执行溢出检测dump操作时,需要删除上一次dump目录下的溢出检测dump数据,否则将因重名而报错。 - -**注意事项** - -* dump_mode="acl"场景下,会增加npu的内存消耗,请谨慎开启。 -* 部分API存在调用嵌套关系,比如functional.batch_norm实际调用torch.batch_norm,该场景会影响acl init初始化多次,导致功能异常。 -* 混合精度动态loss scale场景下,正常训练会有"Gradient overflow. SKipping step"日志,添加溢出检测后日志消失,可以通过设置环境变量export OVERFLOW_DEBUG_MODE_ENABLE=1,并将register_hook位置调整amp.initialize之前解决。此功能需要cann包配套支持,不支持版本执行报错EZ3003。 - -## debugger方式dump和溢出检测(推荐) - -### PrecisionDebugger模块 - -**功能说明** - -PrecisionDebugger模块包含dump和溢出检测功能的总体配置项。可以指定dump目录,设置dump或溢出检测功能,指定dump的卡和迭代。 - -可以在from ptdbg_ascend import *和模型初始化之间的任意位置添加该模块。 - -**原型** - -```python -PrecisionDebugger(dump_path=None, hook_name=None, rank=None, step=[], enable_dataloader=False, model=None): -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| dump_path | 设置dump数据目录路径,参数示例:"./dump_path"。数据类型:str。
默认在dump_path目录下生成`ptdbg_dump_{version}`目录,并在该目录下生成`dump.pkl`文件以及`dump`数据文件保存目录。
当**configure_hook**函数配置了mode参数时,`dump.pkl`文件以及`dump`数据文件保存目录名称添加mode参数值为前缀,详情请参见“**dump数据存盘说明**”。
未配置dump_path时,也可以通过环境变量ASCEND_WORK_PATH配置dump路径,此时dump数据将落盘在${ASCEND_WORK_PATH}/dump_data下,自定义配置dump_path优先级高于环境变量,dump_path和环境变量需要二选一。 | 否 | -| hook_name | dump模式,可取值"dump"和"overflow_check",表示dump和溢出检测功能,二选一。参数示例:hook_name="dump"。数据类型:str。 | 是 | -| rank | 指定对某张卡上的数据进行dump或溢出检测,默认未配置(表示dump所有卡的数据),须根据实际卡的Rank ID配置。应配置为大于0的正整数,且须根据实际卡的Rank ID配置,若所配置的值大于实际训练所运行的卡的Rank ID,则dump数据为空,比如当前环境Rank ID为0到7,实际训练运行0到3卡,此时若配置Rank ID为4或不存在的10等其他值,此时dump数据为空。数据类型:int。 | 否 | -| step | 指定dump某个step的数据,默认未配置,表示dump所有step数据。dump特定step时,须指定为训练脚本中存在的step。step为list格式,可配置逐个step,例如:step=[0,1,2];也可以配置step范围,例如:step=list(range(0,9)),表示dump第0到第8个step。数据类型:List[int]。 | 否 | -| enable_dataloader | 自动控制开关,可取值True(开启)或False(关闭),默认为False。配置为True后自动识别dump step参数指定的迭代,并在该迭代执行完成后退出训练,此时start和stop函数可不配置,开启该开关要求训练脚本是通过torch.utils.data.dataloader方式加载数据;配置为False则需要配置start和stop函数,并在最后一个stop函数后或一个step结束的位置添加debugger.step()。数据类型:bool。 | 否 | -| model | 开启init dump模式,传入网络模型实例化的对象,配置该参数后,dump操作仅dump网络中init方法里调用的方法(nn.Module类),不会对所有API进行dump。参数示例: model=net,net为网络模型实例化的对象名称。默认未配置。
配置该参数时,PrecisionDebugger模块请在模型实例化之后调用。数据类型:torch.nn.Module。
该模式不支持“溢出检测”、”ACL级别数据dump“和“模块级精度数据dump”。此模式下dump文件名前缀为网络中定义的模块名或层名。 | 否 | - -#### init dump模式示例代码和数据落盘说明 - -**示例代码** - -```python -import os -import torch -import torch.nn as nn -import torch_npu -from ptdbg_ascend import * - -torch.npu.set_device("npu:0") - - -class Net(nn.Module): - - def __init__(self): - super(Net, self).__init__() - self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2) - self.relu1 = nn.ReLU() - self.bn1 = nn.BatchNorm2d(16) - - def forward(self, x): - x = self.conv1(x) - x = self.bn1(x) - output = self.relu1(x) - return output - -if __name__ == "__main__": - net = Net().npu() - # model参数传入net, 开启init dump 功能 - debugger = PrecisionDebugger(dump_path="./dump", hook_name="dump", model=net) - debugger.configure_hook(mode="api_stack") - debugger.start() - x = torch.randn(1, 1, 28, 28).npu() - out = net(x) - loss = out.sum() - loss.backward() - debugger.stop() -``` - -**落盘数据说明** - -该模式下dump数据命名格式为:`{Layer_name}.{Module_name}.{call_num}.{forward/backward}.{input/output}.npy` - -``` -# 按照上述用例代码进行dump,落盘数据命名示例如下: -conv1.Conv2d.0.forward.input.0.npy -conv1.Conv2d.0.forward.output.npy -relu1.ReLU.0.forward.input.0.npy -....... -bn1.BatchNorm2d.0.backward.output.2.npy -``` - -### configure_hook函数(可选) - -**功能说明** - -设置dump范围。 - -建议在**PrecisionDebugger**模块与模型初始化之间的任意位置添加,不添加此函数时默认使用mode="api_stack" dump整网数据。 - -**原型** - -dump: - -```python -debugger.configure_hook(mode="api_stack", scope=[], api_list=[], filter_switch="OFF", acl_config=None, backward_input=[], input_output_mode=["all"], summary_only=False, summary_mode="all") -``` - -溢出检测: - -```python -debugger.configure_hook(mode=None, acl_config=None, overflow_nums=1, need_replicate=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------------- | ------------------------------------------------------------ | -------- | -| mode | dump模式。可取值"all"、"list"、"range"、"stack"、"acl"、"api_list"、"api_stack",各参数含义请参见本节的“**函数示例**”。参数示例:mode="list"。默认为"api_stack"。该参数配置值将作为dump数据文件名的前缀,详情请参见“**dump数据存盘说明**”。数据类型:str。 | 否 | -| scope或api_list | dump范围。根据model配置的模式选择dump的API范围,mode="api_list"时,需要配置api_list=[],其他模式有需要时配置scope=[]。参数示例:scope=["Tensor.permute.1.forward", "Tensor.transpose.2.forward"]、api_list=["relu"]。默认为空。数据类型:List[str]。 | 否 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"(表示开启过滤,即不dump)或"OFF"(表示关闭过滤)。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | -| acl_config | acl dump的配置文件。mode="acl"时,该参数必选;mode为其他值时,该参数不选。参数示例:acl_config='./dump.json'。dump.json配置文件详细介绍请参见“**dump.json配置文件说明**”。数据类型:str。 | 否 | -| backward_input | 该输入文件为首次运行训练dump得到反向API输入的.npy文件。例如若需要dump Functional.conv2d.1 API的反向过程的输入输出,则需要在dump目录下查找命名包含Functional.conv2d.1、backward和input字段的.npy文件。数据类型:str。 | 否 | -| input_output_mode | dump数据过滤。可取值"all"、"forward"、"backward"、"input"和"output",表示仅保存dump的数据中文件名包含"forward"、"backward"、"input"和"output"的前向、反向、输入或输出的.npy文件。参数示例input_output_mode=["backward"]或input_output_mode=["forward", "backward"]。默认为["all"],即保存所有dump的数据。除了all参数只能单独配置外,其他参数可以自由组合。数据类型:list。 | 否 | -| summary_only | dump npy文件过滤,可取值True或False,配置为True后仅dump保存API统计信息的pkl文件,参数示例:summary_only=False,默认为False。数据类型:bool。 | 否 | -| summary_mode | 控制dump文件输出的模式,可取值md5(dump仅输出包含md5值的pkl文件,用于验证数据的完整性)、summary(dump仅输出包含API统计信息的pkl文件)、all(dump输出包含API统计信息的pkl文件以及具体的npy文件),参数示例:summary_mode="md5",默认为"all"。summary_only=True时,不允许配置该参数。数据类型:str。 | 否 | -| overflow_nums | 控制溢出次数,表示第N次溢出时,停止训练,过程中检测到溢出API对应ACL数据均dump。参数示例:overflow_nums=3。配置overflow_check时可配置,默认不配置,即检测到1次溢出,训练停止,配置为-1时,表示持续检测溢出直到训练结束。数据类型:int。 | 否 | -| need_replicate | 过程dump数据生成开关,执行溢出检测时,dump目录下会生成forward_real_data和backward_real_data的过程dump数据目录,可取值True(生成)或False(不生成),默认不生成。数据类型:bool。 | 否 | - -**函数示例** - -configure_hook可配置多种dump模式,示例如下: - -说明: - -以下均以dump部分API数据为例,API名可以从首次dump整网数据的结果csv文件中的NPU Name或Bench Name列获取。 - -以下仅为该函数配置示例,完整代码请参见“**示例代码**”章节。 - -- 示例1:dump指定API列表 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="list", scope=["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]) - ``` - -- 示例2:dump指定范围 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="range", scope=["Tensor.abs.1.forward", "Tensor.transpose.3.forward"]) - ``` - -- 示例3:STACK模式,只dump堆栈信息 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="stack", scope=["Tensor.abs.1.forward", "Tensor.transpose.3.forward"]) - ``` - -- 示例4:dump指定前向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Tensor.permute.1.forward"], acl_config="./dump.json") - ``` - -- 示例5:dump指定反向API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="acl", scope=["Functional.conv2d.1.backward"], acl_config="./dump.json", backward_input=["./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump/Functional.conv2d.1.backward.input.0.npy"]) - ``` - -- 示例6:dump指定某一类API的API级别输入输出数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_list", api_list=["relu"]) - ``` - - mode="api_list"时不配置scope。 - -- 示例7:dump全部API级别输入输出数据以及相应堆栈信息 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(mode="api_stack") - ``` - - mode="api_stack"时不配置scope。 - -- 示例8: dump全部API级别输入输出数据并包含bool和整型的tensor以及浮点、bool和整型的标量,配置为OFF,会dump bool和整型数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(filter_switch="OFF") - ``` - - 配置filter_switch="OFF"同时也可以配置mode、scope和api_list,除dump ACL级别数据。 - -- 示例9:仅保存dump的数据文件名包含“backward”的反向.npy文件 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(input_output_mode=["backward"]) - ``` - -- 示例10:仅dump pkl文件 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - debugger.configure_hook(summary_only=True) - ``` - -- 示例11:溢出检测dump - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - debugger.configure_hook(overflow_nums=1) - ``` - - dump执行时会在**PrecisionDebugger**模块的dump_path参数指定的目录下生成ptdbg_dump_{version}目录,保存溢出数据。 - - 多卡场景时,需要检测到至少有一张卡溢出次数达到overflow_nums时,训练结束。 - - 仅支持NPU环境。 - -- 示例11:dump溢出API的ACL级别数据 - - ```python - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - debugger.configure_hook(mode="acl", acl_config="./dump.json") - ``` - - 该场景会在原有数据基础上,额外在dump.json文件配置的dump_path目录下生成一份ACL算子数据,该数据可通过“**ptdbg_ascend.parse**”工具进行解析。 - - 仅支持NPU环境。 - -### start函数(可选) - -**功能说明** - -dump或溢出检测启动函数。 - -在模型初始化之后的任意位置添加。 - -**原型** - -```python -debugger.start() -``` - -该函数为类函数,可以使用debugger.start()也可以使用PrecisionDebugger.start()。 - -### stop函数(可选) - -**功能说明** - -dump或溢出检测停止函数。 - -在**start**函数之后的任意位置添加。 - -**原型** - -```python -debugger.stop() -``` - -该函数为类函数,可以使用debugger.stop()也可以使用PrecisionDebugger.stop()。 - -### 示例代码(自动模式) - -**需要保证用户训练代码是通过torch.utils.data.dataloader方式加载数据。** - -- 示例1:开启dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0,2], enable_dataloader=True) - # 请勿将以上初始化流程插入到循环代码中 - ``` - -- 示例2:开启溢出检测dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0,2], enable_dataloader=True) - # 请勿将以上初始化流程插入到循环代码中 - ``` - -### 示例代码(手动模式) - -一般情况下使用自动模式可以快速方便进行dump操作,但个别大模型可能在部分卡的训练操作中没有调用dataloader,这会导致自动模式无法dump指定迭代的数据,此时需要关闭自动模式手动在迭代前后插入start()和stop()函数,并在最后一个stop函数后或一个step结束的位置添加debugger.step()以标识dump结束。 - -- 示例1:开启dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="dump", step=[0]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -- 示例2:开启溢出检测dump - - ```python - from ptdbg_ascend import * - debugger = PrecisionDebugger(dump_path="./dump_path", hook_name="overflow_check", step=[0]) - # 请勿将以上初始化流程插入到循环代码中 - - # 模型初始化 - # 下面代码也可以用PrecisionDebugger.start()和PrecisionDebugger.stop() - debugger.start() - - # 需要dump的代码片段1 - - debugger.stop() - debugger.start() - - # 需要dump的代码片段2 - - debugger.stop() - debugger.step() - ``` - -## register_hook方式dump和溢出检测 - -### 总体说明 - -- 本节主要介绍CPU或GPU及NPU精度数据dump和溢出检测所需要的函数以及示例。 - -- ptdbg_ascend工具默认情况下仅dump PyTorch模型的API输入输出数据进行精度比对,若在比对结果中发现某个API下可能存在ACL的精度问题,那么可以选择dump该API的ACL级别数据进行精度分析。 - -- 某些torch api的输出不是Tensor类型的数据。对于此类API的反向过程进行ACL dump,工具会在运行日志中给出对应的Warning(is not of tensor type and cannot be automatically derived)提示。如若想要进行该类API反向ACL dump,可以通过手动构建单API用例的方式进行ACL dump,具体用例可参见“**[反向ACL dump用例说明](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/%E5%8F%8D%E5%90%91ACL%20dump%E7%94%A8%E4%BE%8B%E8%AF%B4%E6%98%8E.md)**”。 - -- 工具性能:dump数据量较小时(小于5G),参考dump速度0.1GB/s;dump数据量较大时,参考dump速度0.2GB/s。 - 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 - - 用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。Dump速度的计算方式:Dump数据量/(单个step添加Dump耗时-原始单个step耗时)。 - -### 约束 -- 进行CPU或GPU数据dump时,请安装torch包而非torch_npu包,避免工具无法识别使用场景,导致失败。 - -- TASK_QUEUE_ENABLE环境变量会导致API下发和执行异步进行,因此在ACL dump前需要将TASK_QUEUE_ENABLE关闭,即export TASK_QUEUE_ENABLE=0。 - -- 不建议在PyTorch训练脚本中同时添加dump接口和性能数据采集(如Ascend PyThon Profiler)接口,二者可能相互影响导致数据不准确。 - -### seed_all - -**功能说明** - -固定随机数。通过固定随机数保证模型的输入或输出一致。在训练主函数开始前调用,避免随机数固定不全。 - -使用form ptdbg import *后自动导入该函数,代码无需再次添加,若需要修改随机数种子和确定性计算模式,则需要通过添加该函数修改。 - -**函数原型** - -```python -seed_all(seed=1234, mode=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------ | ------------------------------------------------------------ | -------- | -| seed | 随机数种子。参数示例:seed=1000。默认值为:1234。数据类型:int。 | 否 | -| mode | 确定性计算模式。可配置True或False。参数示例:mode=True。默认为False。数据类型:bool。
即使在相同的硬件和输入下,API多次执行的结果也可能不同,开启确定性计算是为了保证在相同的硬件和输入下,API多次执行的结果相同。
确定性计算会导致API执行性能降低,建议在发现模型多次执行结果不同的情况下开启。
rnn类算子、ReduceSum、ReduceMean等算子可能与确定性计算存在冲突,若开启确定性计算后多次执行的结果不相同,则考虑存在这些算子。 | 否 | - -**函数示例** - -seed_all函数的随机数种子,取默认值即可,无须配置;第二个参数默认关闭,不开启确定性计算时也无须配置。 - -- 示例1:仅固定随机数,不开启确定性计算 - - ```python - seed_all() - ``` - -- 示例2:固定随机数,开启确定性计算 - - ```python - seed_all(mode=True) - ``` - -**固定随机数范围** - -seed_all函数可固定随机数的范围如下表。 - -| API | 固定随机数 | -| ---------------------------------------- | --------------------------- | -| os.environ['PYTHONHASHSEED'] = str(seed) | 禁止Python中的hash随机化 | -| random.seed(seed) | 设置random随机生成器的种子 | -| np.random.seed(seed) | 设置numpy中随机生成器的种子 | -| torch.manual_seed(seed) | 设置当前CPU的随机种子 | -| torch.cuda.manual_seed(seed) | 设置当前GPU的随机种子 | -| torch.cuda.manual_seed_all(seed) | 设置所有GPU的随机种子 | -| torch_npu.npu.manual_seed(seed) | 设置当前NPU的随机种子 | -| torch_npu.npu.manual_seed_all(seed) | 设置所有NPU的随机种子 | -| torch.backends.cudnn.enable=False | 关闭cuDNN | -| torch.backends.cudnn.benchmark=False | cuDNN确定性地选择算法 | -| torch.backends.cudnn.deterministic=True | cuDNN仅使用确定性的卷积算法 | - -需要保证CPU或GPU以及NPU的模型输入完全一致,dump数据的比对才有意义,seed_all并不能保证模型输入完全一致,如下表所示场景需要保证输入的一致性。 - -| 场景 | 固定方法 | -| --------------- | ------------- | -| 数据集的shuffle | 关闭shuffle。 | -| dropout | 关闭dropout。 | - -关闭shuffle示例: - -```python -train_loader = torch.utils.data.DataLoader( - train_dataset, - batch_size = batch_size, - shuffle = False, - num_workers = num_workers -) -``` - -关闭dropout: - -在使用from ptdbg import *后,工具会自动将torch.nn.functional.dropout、torch.nn.functional.dropout2d、torch.nn.functional.dropout3d、torch.nn.Dropout、torch.nn.Dropout2d、torch.nn.Dropout3d的接口参数p置为0。 - -### set_dump_path - -**功能说明** - -设置数据保存目录。建议在seed_all函数之后调用且需要保证训练进程能够调用该函数;多卡时须保证每个进程都能调用该函数。 - -**函数原型** - -```python -set_dump_path(fpath=None, dump_tag='ptdbg_dump') -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------- | ------------------------------------------------------------ | -------- | -| fpath | 设置数据目录路径。参数示例:'./dump_path'。数据类型:str。
默认在dump_path目录下生成`ptdbg_dump_{version}`目录,并在该目录下生成`dump.pkl`文件以及`dump`数据文件保存目录。
当set_dump_switch函数配置了mode参数时,`dump.pkl`文件以及`dump`数据文件保存目录名称添加mode参数值为前缀,详情请参见“**dump数据存盘说明**”。
未配置fpath时,也可以通过环境变量ASCEND_WORK_PATH配置dump路径,此时数据将落盘在${ASCEND_WORK_PATH}/dump_data下,自定义配置dump_path优先级高于环境变量,fpath和环境变量需要二选一。 | 否 | -| dump_tag | 设置数据目录名称。参数示例:dump_tag='dump_conv2d'。默认数据目录命名为ptdbg_dump_{version}。数据类型:str。
{version}为当前安装ptdbg_ascend工具版本。目录结构参见“**dump数据存盘说明**”。
配置该参数会将生成的`ptdbg_dump_{version}`目录名称变更为dump_tag配置的值,如`dump_conv2d_{version}`。 | 否 | - -**函数示例** - -- 示例1:设置数据目录路径 - - ```python - set_dump_path('./dump_path') - ``` - -- 示例2:设置数据目录名称 - - ```python - set_dump_path('./dump_path', dump_tag='dump_conv2d') - ``` - - -若以相同的数据目录多次dump,则会因同名导致覆盖;多次dump建议配置不同的dump_tag。 - -### register_hook - -**功能说明** - -注册工具钩子函数。在set_dump_path之后调用。 - -dump操作必选。 - -**函数原型** - -```python -register_hook(model, hook, overflow_nums=overflow_nums, dump_mode=dump_mode, dump_config=dump_config_file) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------- | ------------------------------------------------------------ | -------- | -| model | 传入网络模型实例化的对象。参数示例: model=net,net为网络模型实例化的对象名称。数据类型:torch.nn.Module。 | 是 | -| hook | 注册工具的dump和溢出检测钩子。可取值overflow_check(表示溢出检测)和acc_cmp_dump(表示dump数据),二选一。数据类型:Callable。 | 是 | -| overflow_nums | 控制溢出次数,表示第N次溢出时,停止训练,过程中检测到溢出API对应ACL数据均dump。参数示例:overflow_nums=3。配置overflow_check时可配置,默认不配置,即检测到1次溢出,训练停止,配置为-1时,表示持续检测溢出直到训练结束。数据类型:int。 | 否 | -| dump_mode | 控制针对溢出API的dump模式,可取值"acl"或"api"。配置acl时,表示dump ACL级别的溢出数据,此时set_dump_path参数不生效,dump数据目录由dump_config的.json文件配置。参数示例:dump_mode="acl"。默认不配置,即dump API级别的溢出数据。数据类型:str。 | 否 | -| dump_config | acl dump的配置文件。dump_mode="acl"时,该参数必选;dump_mode="api"时,该参数不选。参数示例:dump_config='./dump.json'。数据类型:str。 | 否 | - -**函数示例** - -- 示例1:注册工具钩子函数 - - ```python - register_hook(model, acc_cmp_dump) - ``` - -- 示例2:dump指定API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - ``` - - 需要配置set_dump_switch的mode="acl"以及scope指定为前向或反向API,请参见“**set_dump_switch”**的示例。 - - 该场景set_dump_path不生效,由dump_config中的dump.json文件配置dump数据目录。 - -- 示例3:溢出检测dump - - ```python - register_hook(model, overflow_check, overflow_nums=3) - ``` - - dump执行时会在set_dump_path的fpath参数指定的目录下生成ptdbg_dump_{version}目录,保存溢出数据。 - - 多卡场景时,需要检测到至少有一张卡溢出次数达到overflow_nums时,训练结束。 - - 仅支持NPU环境。 - -- 示例4:dump指定API的ACL级别溢出数据 - - ```python - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - ``` - - 该场景会在原有数据基础上,额外在dump.json文件配置的dump_path目录下生成一份ACL算子数据,该数据可通过“**ptdbg_ascend.parse**”工具进行解析。 - - 仅支持NPU环境。 - -### set_dump_switch - -**功能说明** - -设置dump范围。建议在register_hook函数之后的脚本内任意位置插入,但进行精度问题排查建议参照“场景化示例 > 单卡场景精度比对”章节的顺序,先从第一个迭代开始的位置调用并dump整网数据。 - -dump操作必选。 - -**函数原型** - -```python -def set_dump_switch(switch, mode="all", scope=[], api_list=[], filter_switch="OFF", dump_mode=["all"], summary_only=False): -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| --------------- | ------------------------------------------------------------ | -------- | -| switch | dump开关。可取值"ON"或"OFF"。须在选定dump开始的位置配置set_dump_switch("ON");dump结束的位置设置set_dump_switch("OFF")。数据类型:str。 | 是 | -| mode | dump模式。可取值"all"、"list"、"range"、"stack"、"acl"、"api_list"、"api_stack",各参数含义请参见本节的“**函数示例**”。参数示例:mode="list"。默认为"all"。该参数配置值将作为dump数据文件名的前缀,详情请参见“**dump数据存盘说明**”。数据类型:str。 | 否 | -| scope或api_list | dump范围。根据model配置的模式选择dump的API范围。参数示例:scope=["Tensor.permute.1.forward", "Tensor.transpose.2.forward"]、api_list=["relu"]。默认为空。数据类型:List[str]。 | 否 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"或"OFF"。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | -| dump_mode | dump数据过滤。可取值"all"、"forward"、"backward"、"input"和"output",表示仅保存dump的数据中文件名包含"forward"、"backward"、"input"和"output"的前向、反向、输入或输出的.npy文件。参数示例dump_mode=["backward"]或dump_mode=["forward", "backward"]。默认为all,即保存所有dump的数据。除了all参数只能单独配置外,其他参数可以自由组合。数据类型:List[str]。 | 否 | -| summary_only | dump npy文件过滤,可取值True或False,配置为True后仅dump保存API统计信息的pkl文件,参数示例:summary_only=False,默认为False。数据类型:bool。 | 否 | - -**推荐配置** - -```python -set_dump_switch("ON", mode="api_stack", filter_switch="OFF") -``` - -开启dump数据和堆栈模式,同时为保证数据完整性开启dump bool和整型的tensor以及浮点、bool和整型的标量。 - -**函数示例** - -set_dump_switch可配置多种dump模式,示例如下: - -说明:以下均以dump部分API数据为例,API名可以从首次dump整网数据的结果csv文件中的NPU Name或Bench Name列获取。 - -- 示例1:dump指定API列表 - - ```python - set_dump_switch("ON", mode="list", scope=["Tensor.permute.1.forward", "Tensor.transpose.2.forward", "Torch.relu.3.backward"]) - ``` - -- 示例2:dump指定范围 - - ```python - set_dump_switch("ON", mode="range", scope=["Tensor.abs.1.forward", "Tensor.transpose.3.forward"]) - ``` - -- 示例3:STACK模式,只dump堆栈信息 - - ```python - set_dump_switch("ON", mode="stack", scope=["Tensor.abs.1.forward", "Tensor.transpose.3.forward"]) - ``` - -- 示例4:dump指定前向API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Tensor.permute.1.forward"]) - ``` - - 需要配置register_hook的dump_mode='acl'和dump_config配置文件。 - -- 示例4:dump指定反向API的ACL级别数据 - - ```python - register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') - set_dump_switch("ON", mode="acl", scope=["Functional.conv2d.1.backward"]) - set_backward_input(["./npu_dump/dump_conv2d_v4.0/step0/rank0/dump/Functional.conv2d.1.backward.input.0.npy"]) - ``` - - 需要配置register_hook的dump_mode='acl'和dump_config配置文件,并通过set_backward_input设置反向API输入的.npy文件。 - -- 示例5:dump指定某一类API的API级别输入输出数据 - - ```python - set_dump_switch("ON", mode="api_list", api_list=["relu"]) - ``` - - mode="api_list"时不配置scope。 - -- 示例6:dump全部API级别输入输出数据以及相应堆栈信息 - - ```python - set_dump_switch("ON", mode="api_stack") - ``` - - mode="api_stack"时不配置scope。 - -- 示例7: dump全部API级别输入输出数据并包含bool和整型的tensor以及浮点、bool和整型的标量,配置为OFF,会dump bool和整型数据 - - ```python - set_dump_switch("ON", filter_switch="OFF") - ``` - - 配置filter_switch="OFF"同时也可以配置mode、scope和api_list,除dump ACL级别数据。 - -- 示例8:仅保存dump的数据文件名包含“backward”的反向.npy文件 - - ```python - set_dump_switch("ON", dump_mode=["backward"]) - ``` - -- 示例9:仅dump pkl文件 - - ```python - set_dump_switch("ON", summary_only=True) - ``` - -以上示例均需要在结束dump的位置插入set_dump_switch("OFF")。 - -set_dump_switch配置mode为all或api_stack时,结束dump后,在dump目录下会自动生成compare_data.py比对脚本模板,示例如下: - -```python -from ptdbg_ascend import compare -from ptdbg_ascend.common.file_check_util import FileChecker -import argparse -import os.path - -pkl_path = "%s" -dump_data_dir = "%s" - -parser = argparse.ArgumentParser(description="compare data") -parser.add_argument("--npu_pkl_path", type=str, default=pkl_path, help="npu保存数据的pkl路径") -parser.add_argument("--bench_pkl_path", type=str, default=pkl_path, help="对比数据的pkl路径") -parser.add_argument("--output_path", type=str, default="./", help="导出对比数据的路径") - -args = parser.parse_args() -npu_pkl_path = args.npu_pkl_path -bench_pkl_path = args.bench_pkl_path -output_path = args.output_path - -suffix = ".pkl" -npu_path_checker = FileChecker(npu_pkl_path, "file", "read", suffix) -npu_path_checker.common_check() -bench_path_checker = FileChecker(bench_pkl_path, "file", "read", suffix) -bench_path_checker.common_check() - -npu_dump_data_dir = npu_pkl_path[:-len(suffix)] -bench_dump_data_dir = bench_pkl_path[:-len(suffix)] -if not os.path.exists(npu_dump_data_dir) or not os.path.exists(bench_dump_data_dir): - npu_dump_data_dir = "" - bench_dump_data_dir = "" - -dump_path_param = { - "npu_pkl_path": npu_pkl_path, - "bench_pkl_path": bench_pkl_path, - "npu_dump_data_dir": npu_dump_data_dir, - "bench_dump_data_dir": bench_dump_data_dir, - "is_print_compare_log": True -} - -compare(dump_path_param, output_path=output_path, stack_mode=%s) -``` - -compare_data.py比对脚本模板可以直接使用命令行配置比对参数,不需要通过编辑compare_data.py文件来修改,示例如下: - -```bash -python3 compare_data.py --npu_pkl_path "./npu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" --bench_pkl_path "./gpu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" --output_path "./output_path" -``` - -该命令行支持--npu_pkl_path、--bench_pkl_path和--output三个比对参数,其中pkl_path两个参数配置后,脚本可以自动识别同级目录下的dump_data目录,若同级目录下不存在dump_data目录,则直接执行“**pkl文件比对**”。仅ptdbg_ascend 6.0或更高版本支持比对命令行配置比对参数。更多介绍请参见“**执行比对操作**”。 - -### set_overflow_check_switch - -**功能说明** - -置溢出检测范围。默认不配置该函数,全量进行溢出检测。 - -仅支持NPU环境。 - -**函数原型** - -```python -set_overflow_check_switch(switch, filter_switch='OFF') -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------- | ------------------------------------------------------------ | -------- | -| switch, | 检测开关。可取值"ON"或"OFF"。如果只在特定的step溢出检测,则在期望溢出检测的step位置开始前插入set_overflow_check_switch("ON"),在step结束的位置插入set_overflow_check_switch("OFF")。数据类型:str。 | 是 | -| filter_switch | dump bool和整型的tensor以及浮点、bool和整型的标量的过滤开关。可取值"ON"或"OFF"。参数示例:filter_switch="ON"。默认不配置,即filter_switch="OFF",表示dump上述数据。数据类型:str。 | 否 | - -**函数示例** - -- 示例1:指定范围溢出检测 - - ```python - register_hook(model, overflow_check) - set_overflow_check_switch("ON") - - ... - - set_overflow_check_switch("OFF") - ``` - - 该场景set_dump_path不生效,dump执行时会在当前目录自动生成ptdbg_dump_{version}目录,保存溢出数据。 - -- 示例2:前向API的ACL级别范围溢出检测 - - ```python - register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') - set_overflow_check_switch("ON") - - ... - - set_overflow_check_switch("OFF") - ``` - - 该场景set_dump_path不生效,由dump_config中的dump.json文件配置溢出数据目录。 - -### set_backward_input - -**功能说明** - -设置反向ACL级别dump时需要的反向输入的.npy文件。 - -**函数原型** - -```python -set_backward_input(backward_input) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| backward_input | 该输入文件为首次运行训练dump得到反向API输入的.npy文件。例如若需要dump Functional.conv2d.1 API的反向过程的输入输出,则需要在dump目录下查找命名包含Functional.conv2d.1、backward和input字段的.npy文件。数据类型:str。 | 是 | - -**函数示例** - -```python -register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') -set_dump_switch("ON", mode="acl", scope=["Functional.conv2d.1.backward"]) -set_backward_input(["./npu_dump/dump_conv2d_v4.0/step0/rank0/dump/Functional.conv2d.1.backward.input.0.npy"]) -``` - -## dump.json配置文件说明 - -**dump.json配置示例** - -```python -{ - "dump": - { - "dump_list":[], - "dump_path":"./dump/output", - "dump_mode":"all", - "dump_op_switch":"on" - } -} -``` - -**dump.json参数说明** - -| 字段名 | 说明 | -| -------------- | ------------------------------------------------------------ | -| dump_list | 待dump数据的API模型。为空,无需配置。 | -| dump_path | dump数据文件存储到运行环境的目录,主要用于指定ACL dump数据路径。支持配置绝对路径或相对路径。dump_path须为已存在目录。 | -| dump_mode | dump数据模式,配置如下:
output:dump API的输出数据。默认值。
input:dump API的输入数据。
all:dump API的输入、输出数据。 | -| dump_op_switch | 单API模型dump数据开关,配置如下: * off:关闭单API模型dump,默认值。 * on:开启单API模型dump。 | - -**dump目录说明** - -配置register_hook的dump_config后,采集的dump数据会在{dump_path}/{time}/{deviceid}/{model_id}目录下生成,例如“/home/HwHiAiUser/output/20200808163566/0/0” - -```bash -├── 20230131172437 -│   └── 1 -│   ├── 0 -│   │   ├── Add.Add.45.0.1675157077183551 -│   │   ├── Cast.trans_Cast_0.31.0.1675157077159449 -│   │   ├── Cast.trans_Cast_5.43.0.1675157077180129 -│   │   ├── MatMul.MatMul.39.0.1675157077172961 -│   │   ├── Mul.Mul.29.0.1675157077155731 -│   │   ├── NPUAllocFloatStatus.NPUAllocFloatStatus.24.0.1675157077145262 -│   │   ├── TransData.trans_TransData_1.33.0.1675157077162791 -│   │   └── TransData.trans_TransData_4.41.0.1675157077176648 -│   ├── 1701737061 -│   │   └── Cast.trans_Cast_2.35.0.1675157077166214 -│   ├── 25 -│   │   └── NPUClearFloatStatus.NPUClearFloatStatus.26.0.1675157077150342 -│   └── 68 -│   └── TransData.trans_TransData_3.37.0.1675157077169473 -``` - -## 模块级精度数据dump - -### 总体说明 - -大模型场景下,通常不是简单的利用自动迁移能力实现GPU到NPU的训练脚本迁移,而是会对NPU网络进行一系列针对性的适配,因此,常常会造成迁移后的NPU模型存在部分子结构不能与GPU原始模型完全对应。模型结构不一致导致API调用类型及数量不一致,若直接按照API粒度进行精度数据dump和比对,则无法完全比对所有的API。 - -本节介绍的功能是对模型中的大粒度模块进行数据dump,使其比对时,对于无法以API粒度比对的模块可以直接以模块粒度进行比对。 - -模块指的是继承自nn.Module类模块,通常情况下这类模块就是一个小模型,可以被视为一个整体,dump数据时以模块为粒度进行dump。 - -### module_dump - -**功能说明** - -开启模块级精度数据dump。 - -模块级精度数据dump时必选。 - -**函数原型** - -```python -module_dump(module, module_name) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ----------- | ------------------------------------------------------------ | -------- | -| module | 网络中实例化好的nn.Module类对象。数据类型:torch.nn.Module。 | 是 | -| module_name | 用户自定义的该model名称。主要用于dump数据文件的命名,便于在比对时识别模块级数据。数据类型:str。 | 是 | - -### module_dump_end - -**功能说明** - -结束模块级精度数据dump。 - -模块级精度数据dump时必选。 - -**函数原型** - -```python -module_dump_end() -``` - -### 示例代码 - -```python -# 根据需要import包 -import os -import torch -import torch.nn as nn -import torch_npu -import torch.nn.functional as F -from ptdbg_ascend import * - -torch.npu.set_device("npu:0") -# 定义一个简单的网络 -class ModuleOP(nn.Module): - def __init__(self) -> None: - super().__init__() - self.linear_1 = nn.Linear(in_features=8, out_features=4) - self.linear_2 = nn.Linear(in_features=4, out_features=2) - def forward(self, x): - x1 = self.linear_1(x) - x2 = self.linear_2(x1) - r1 = F.relu(x2) - return r1 - -if __name__ == "__main__": - module = ModuleOP() - - # 注册工具 - pdbg = PrecisionDebugger("./dump_data/npu", hook_name="dump") - pdbg.start() - - x = torch.randn(10, 8) - module_dump(module, "MyModuleOP") # 开启模块级精度数据dump - out = module(x) - module_dump_end() # 结束模块级精度数据dump - loss = out.sum() - loss.backward() - pdbg.stop() -``` - -## dump数据存盘说明 - -dump结果目录结构示例如下: - -```bash -├── dump_path -│ └── ptdbg_dump_{version} -│ ├── step0 -│ | ├── rank0 -│ | │ ├── dump -| | | | ├── Tensor.permute.1.forward.npy -| | | | ├── MyModule.0.forward.input.npy # 开启模块级精度数据dump时存在模块级的dump数据文件 -| | | | ... -| | | | └── Fcuntion.linear.5.backward.output.npy -│ | │ └── dump.pkl -│ | ├── rank1 -| | | ├── dump -| | | | └── ... -| | | └── dump.pkl -│ | ├── ... -│ | | -| | └── rank7 -│ ├── step1 -│ | ├── ... -│ ├── step2 -``` - -dump过程中,npy文件在对应算子或者模块被执行后就会落盘,而pkl文件则需要在正常执行PrecisionDebugger.stop()或set_dump_switch("OFF")后才会被落盘保存,异常的程序终止会保存终止前被执行算子的相关npy文件,但是不会生成pkl文件。 - -其中`ptdbg_dump_{version}`为默认命名,debugger方式dump不支持修改该文件夹名称,使用set_dump_path函数则支持通过dump_tag参数修改文件夹名称;rank为设备上各卡的ID,每张卡上dump的数据会生成对应dump目录。 - -**精度比对dump场景** - -精度比对dump场景的结果如下: - -* dump.pkl文件:包含dump数据的API名称(命名格式为:`{api_type}.{api_name}.{API调用次数}.{前向反向}.{input/output}.{参数序号}`)、dtype、 shape、各数据的max、min、mean、L2norm统计信息以及当配置summary_mode="md5"时的md5数据。 - - 其中,“参数序号”表示该API下的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个子参数;L2norm表示2范数(平方根)。 - -* dump目录:目录下为npy格式的dump数据。 - - npy文件保存的前缀和PyTorch对应关系如下 - - | 前缀 | Torch模块 | - | ----------- | ------------------- | - | Tensor | torch.Tensor | - | Torch | torch | - | Functional | torch.nn.functional | - | NPU | NPU亲和算子 | - | VF | torch._VF | - | Aten | torch.ops.aten | - | Distributed | torch.distributed | - -当configure_hook或set_dump_switch配置mode参数(例如:mode="api_stack" )时,dump结果的文件名会添加api_stack前缀,dump结果如下: - -* api_stack_dump.pkl -* api_stack_dump目录 - -**溢出检测dump场景** - -PrecisionDebugger模块的hook_name参数或register_hook函数设置了overflow_check时,检测API溢出,dump结果的文件名格式为:`{api_type}.{api_name}.{API调用次数}.{前向反向}.{当前溢出次数}`,dump结果示例如下: - -* `Tensor_add_1_forward_1.pkl` -* `Tensor_add_1_forward_1`目录 - -## 工具支持的API列表 - -ptdbug_ascend工具维护固定的API支持列表,若需要删除或增加dump的API,可以在[support_wrap_ops.yaml](../src/python/ptdbg_ascend/hook_module/support_wrap_ops.yaml)文件内手动修改,如下示例: - -```bash -functional: # functional为算子类别,找到对应的类别,在该类别下按照下列格式删除或添加API - - conv1d - - conv2d - - conv3d -``` - -## CPU或GPU与NPU精度数据比对 - -### 总体说明 - -- 本节主要介绍CPU或GPU与NPU精度数据比对的函数以及示例。 - -- 比对函数均通过单独创建精度比对脚本执行,可支持单卡和多卡场景的精度数据比对。 - -- 工具性能:比对数据量较小时(参考值单份文件小于10GB),参考比对速度0.1GB/s;比对数据量较大时,参考比对速度0.3GB/s。 - 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 - - 用户环境性能弱于标准约束或非独占使用的比对速度酌情向下浮动。比对速度的计算方式:两份比对文件大小/比对耗时。 - -### 约束 - -- NPU自研API,在CPU或GPU若没有对应的API,该API的dump数据不比对。 - -- NPU与CPU或GPU的计算结果误差可能会随着模型的执行不断累积,最终会出现同一个API因为输入的数据差异较大而无法比对的情况。 - -- CPU或GPU与NPU中两个相同的API会因为调用次数不同导致无法比对或比对到错误的API,不影响整体运行,该API忽略。 - -### compare_distributed - -**功能说明** - -将CPU或GPU与NPU的dump文件进行比对,支持单卡和多卡,可同时比对多卡的dump数据。多机场景需要每个设备单独执行比对操作。可自动检索和匹配对应卡和进程所dump的数据文件,再调用compare进行比对。单机单卡时与compare函数二选一。 - -**函数原型** - -```python -compare_distributed(npu_dump_dir, bench_dump_dir, output_path, **kwargs) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| -------------- | ------------------------------------------------------------ | -------- | -| npu_dump_dir | 配置NPU环境下的dump目录。dump数据目录须指定到step级。参数示例:'./npu_dump/ptdbg_dump_v4.0/step0'。register_hook方式可通过set_dump_path函数的dump_tag参数修改该目录名称。数据类型:str。 | 是 | -| bench_dump_dir | 配置CPU、GPU或NPU环境下的dump目录。参数示例:'./gpu_dump/ptdbg_dump_v4.0/step0'。register_hook方式可通过set_dump_path函数的dump_tag参数修改该目录名称。数据类型:str。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。需要预先创建output_path目录。参数示例:'./output'。文件名称基于时间戳自动生成,格式为:`compare_result_rank{npu_ID}-rank{cpu/gpu/npu_ID}_{timestamp}.csv`。数据类型:str。 | 是 | -| **kwargs | 支持compare的所有可选参数。 | 否 | - -**函数示例** - -创建比对脚本,例如compare_distributed.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -compare_distributed('./npu_dump/ptdbg_dump_v4.0/step0', './gpu_dump/ptdbg_dump_v4.0/step0', './output') -``` - -dump数据目录须指定到step级。 - -### compare - -**功能说明** - -将CPU或GPU与NPU的dump文件进行比对,仅支持单机单卡。 - -**函数原型** - -```python -compare(input_param, output_path, stack_mode=False, auto_analyze=True, fuzzy_match=False) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------ | ------------------------------------------------------------ | -------- | -| input_param | 配置dump数据文件及目录。数据类型:dict。配置参数包括:
"npu_pkl_path":指定NPU dump目录下的.pkl文件。参数示例:"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl"。必选。
"bench_pkl_path":指定CPU、GPU或NPU dump目录下的.pkl文件。参数示例:"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl"。必选。
"npu_dump_data_dir":"指定NPU dump目录下的dump数据目录。参数示例:"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump"。可选,仅比对pkl文件时不选。
"bench_dump_data_dir":"指定CPU、GPU或NPU dump目录下的dump数据目录。参数示例:"npu_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump"。可选,仅比对pkl文件时不选。
"is_print_compare_log":配置是否开启日志打屏。可取值True或False。可选。 | 是 | -| output_path | 配置比对结果csv文件存盘目录。参数示例:"./output_path",默认为"./"。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.csv`。数据类型:str。 | 否 | -| stack_mode | 配置stack_mode的开关。仅当dump数据时配置debugger.configure_hook或set_dump_switch的mode="api_stack"时需要开启。可取值True或False,参数示例:stack_mode=True,默认为False。数据类型:bool。 | 否 | -| auto_analyze | 自动精度分析,开启后工具自动针对比对结果进行分析,识别到第一个精度不达标节点(在比对结果文件中的“Accuracy Reached or Not”列显示为No),并给出问题可能产生的原因(打屏展示并生成advisor_{timestamp}.txt文件)。可取值True或False,参数示例:auto_analyze=False,默认为True。数据类型:bool。 | 否 | -| fuzzy_match | 模糊匹配。开启后,对于网络中同一层级且命名仅调用次数不同的API,可匹配并进行比对。可取值True或False,参数示例:fuzzy_match=True,默认为False。数据类型:bool。 | 否 | - -**函数示例** - -单机单卡场景下创建比对脚本,例如compare.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import compare -dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", -"bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", -"is_print_compare_log": True -} -compare(dump_result_param, output_path="./output_path", stack_mode=True) -``` - -### pkl文件比对 - -若使用**compare**或**compare_distributed**函数创建的比对脚本中,input_param参数只配置了npu_pkl_path和bench_pkl_path或使用summary_only、summary_mode(取值为md5或summary)方式dump时,可以进行pkl文件的比对,此时比对dump.pkl文件中的统计信息,开启后的比对结果文件生成Max diff、Min diff、Mean diff和L2norm diff,表示NPU dump数据中API的输入或输出与标杆数据输入或输出的最大值、最小值、平均值以及L2范数的差。可以通过该值判断API是否存在精度问题:当某个API的输入和输出的Max diff、Min diff、Mean diff和L2norm diff均为0或无限趋于0,那么可以判断该API无精度问题,反之则可能存在精度问题。 - -**比对脚本示例** - -以compare.py为例。 - -```python -from ptdbg_ascend import compare -dump_result_param={ -"npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", -"is_print_compare_log": True -} -compare(dump_result_param, output_path="./output_path", stack_mode=True) -``` - -**比对结果** - -pkl文件比对同样生成`compare_result_{timestamp}.csv`和`advisor_{timestamp}.txt`文件。其中`advisor_{timestamp}.txt`主要对`compare_result_{timestamp}.csv`中可能存在精度问题(Result为Waring)的API提出定位建议;`compare_result_{timestamp}.csv`主要有如下两种情况: - -- configure_hook配置summary_only=True、summary_mode=summary或不配置前面两个参数直接比对pkl文件: - - ![compare_result_pkl](./img/compare_result_pkl.png) - - 上图是对pkl文件中NPU及标杆API的统计信息进行比对,判断可能存在精度问题的API,文件中记录NPU及标杆API的基本信息和统计信息,其中需要关注Result列,包含结果:Waring(NPU与标杆统计信息的比对中存在相对误差大于0.5,则需要重点检查该API);为空(相对误差小于等于0.5,可以不需要重点关注,但不代表不存在精度问题);Nan(表示统计信息数据没有匹配上)。 - -- configure_hook配置summary_mode=md5: - - ![compare_result_pkl_md5.png](./img/compare_result_pkl_md5.png.png) - - 上图是对pkl文件中NPU及标杆API的MD5信息进行比对,判断API数据的完整性,文件中记录NPU及标杆API的基本信息和MD5信息,其中需要关注Result列,包含结果:Pass(表示NPU与标杆的MD5值一致,即API数据完整);Different(表示NPU与标杆的MD5值不一致,即API数据不完全一致,可以通过NPU_Stack_Info列API调用栈查询该API的详细信息);Nan(表示MD5信息数据没有匹配上)。 - -### parse - -**功能说明** - -解析并提取dump信息中的堆栈信息及数据统计信息。 - -**函数原型** - -```python -parse(pkl_file, module_name_prefix) -``` - -**参数说明** - -| 参数名 | 说明 | 是否必选 | -| ------------------ | ------------------------------------------------------------ | -------- | -| pkl_file | 指定dump数据文件中的pkl文件名。参数示例:"./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump.pkl"。数据类型:str。 | 是 | -| module_name_prefix | 指定待提取的API接口前缀。参数示例:"Torch.norm.1.forward"。数据类型:str。 | 是 | - -**函数示例** - -创建堆栈信息及数据统计信息提取脚本,例如parse.py,拷贝如下代码,具体参数请根据实际环境修改。 - -```python -from ptdbg_ascend import * -parse("./npu_dump/ptdbg_dump_v4.0/step0/rank0/dump.pkl", "Torch.batch.normal.1.forward") -``` - -### 执行比对操作 - -比对操作通过执行比对脚本启动,根据不同的比对脚本分为如下场景: - -- dump数据时自动生成比对脚本模板,脚本名为compare_data.py,该脚本模板也可以直接手动创建: - - ```python - from ptdbg_ascend import compare - from ptdbg_ascend.common.file_check_util import FileChecker - import argparse - import os.path - - pkl_path = "%s" - dump_data_dir = "%s" - - parser = argparse.ArgumentParser(description="compare data") - parser.add_argument("--npu_pkl_path", type=str, default=pkl_path, help="npu保存数据的pkl路径") - parser.add_argument("--bench_pkl_path", type=str, default=pkl_path, help="对比数据的pkl路径") - parser.add_argument("--output_path", type=str, default="./", help="导出对比数据的路径") - - args = parser.parse_args() - npu_pkl_path = args.npu_pkl_path - bench_pkl_path = args.bench_pkl_path - output_path = args.output_path - - suffix = ".pkl" - npu_path_checker = FileChecker(npu_pkl_path, "file", "read", suffix) - npu_path_checker.common_check() - bench_path_checker = FileChecker(bench_pkl_path, "file", "read", suffix) - bench_path_checker.common_check() - - npu_dump_data_dir = npu_pkl_path[:-len(suffix)] - bench_dump_data_dir = bench_pkl_path[:-len(suffix)] - if not os.path.exists(npu_dump_data_dir) or not os.path.exists(bench_dump_data_dir): - npu_dump_data_dir = "" - bench_dump_data_dir = "" - - dump_path_param = { - "npu_pkl_path": npu_pkl_path, - "bench_pkl_path": bench_pkl_path, - "npu_dump_data_dir": npu_dump_data_dir, - "bench_dump_data_dir": bench_dump_data_dir, - "is_print_compare_log": True - } - - compare(dump_path_param, output_path=output_path, stack_mode=%s) - ``` - - 执行如下命令启动比对操作: - - ```bash - python3 compare_data.py --npu_pkl_path "npu_pkl_path" --bench_pkl_path "bench_pkl_path" --output_path "output_path" - ``` - - 命令行示例:python3 compare_data.py --npu_pkl_path "./npu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" --bench_pkl_path "./gpu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" --output_path "./output" - - - 该命令行支持--npu_pkl_path、--bench_pkl_path和--output三个**命令行比对参数**,其中pkl_path两个参数配置后,脚本可以自动识别同级目录下的dump_data目录,若同级目录下不存在dump_data目录,则直接执行“**pkl文件比对**”。 - - **命令行比对参数**的优先级高于compare.py比对脚本内的参数,配置命令行比对参数后,不需要通过编辑compare_data.py文件来修改比对参数。 - - **命令行比对参数**均为可选,但若未配置pkl_path两个参数,则需要在比对脚本中配置。 - - 仅ptdbg_ascend 6.0或更高版本支持**命令行比对参数**。 - - | 参数 | 说明 | 是否必选 | - | ---------------- | ------------------------------------------------------------ | -------- | - | --npu_pkl_path | 指定NPU dump目录下的.pkl文件。参数示例:--npu_pkl_path "./npu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl"。 | 否 | - | --bench_pkl_path | 指定CPU、GPU或NPU dump目录下的.pkl文件。参数示例:--bench_pkl_path "./gpu_dump/ptdbg_dump_v6.0/step0/rank0/api_stack_dump.pkl" | 否 | - | --output_path | 配置比对结果csv文件存盘目录。参数示例:--output_path "./output",默认为"./"。文件名称基于时间戳自动生成,格式为:`compare_result_{timestamp}.csv`。 | 否 | - -- 手动创建比对脚本,自定义脚本名为compare.py: - - ```python - from ptdbg_ascend import compare - dump_result_param={ - "npu_pkl_path": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "bench_pkl_path": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump.pkl", - "npu_dump_data_dir": "./npu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "bench_dump_data_dir": "./gpu_dump/ptdbg_dump_v4.0/step0/rank0/api_stack_dump", - "is_print_compare_log": True - } - compare(dump_result_param, output_path="./output_path", stack_mode=True) - ``` - - 执行如下命令启动比对操作: - - ```bash - python3 compare.py - ``` - -### 计算精度评价指标 - -PyTorch精度比对是以CPU或GPU的计算结果为标杆,通过计算精度评价指标判断API在运行时是否存在精度问题。 - -计算精度评价指标: - -1. Cosine:通过计算两个向量的余弦值来判断其相似度,数值越接近于1说明计算出的两个张量越相似,实际可接受阈值为大于0.99。在计算中可能会存在nan,主要由于可能会出现其中一个向量为0。 - -2. MaxAbsErr:当最大绝对误差越接近0表示其计算的误差越小,实际可接受阈值为小于0.001。 - -3. MaxRelativeErr:当最大相对误差越接近0表示其计算的误差越小。 - - 当dump数据中存在0或Nan时,比对结果中最大相对误差则出现inf或Nan的情况,属于正常现象。 - -4. One Thousandth Err Ratio(双千分之一)、Five Thousandths Err Ratio(双千分之五)精度指标:是指NPU的Tensor中的元素逐个与对应的标杆数据对比,相对误差大于千分之一、千分之五的比例占总元素个数的比例小于千分之一、千分之五。该数据仅作为精度下降趋势的参考,并不参与计算精度是否通过的判定。 - -精度比对结果csv文件中只需要通过Accuracy Reached or Not来判断计算精度是否达标,判断标准如下: - -1. Cosine < 0.99 且 MaxAbsError > 0.001时,精度不达标,标记为“No”。 -2. Cosine < 0.9,精度不达标,标记为“No”。 -3. MaxAbsError > 1,精度不达标,标记为“No”。 -5. 其余情况下记为精度达标,标记为“Yes”。 - -## ptdbg_ascend.parse数据解析功能 - -ptdbg_ascend.parse为命令行交互式界面解析工具,提供更多的数据解析功能并且展示结果。 - -使用场景:本工具主要用于比对前后两次NPU ACL层级dump数据的一致性。 - -### 进入parse交互式界面 - -安装ptdbg_ascend工具后,可以通过使用命令 **python -m ptdbg_ascend.parse** 进入交互式界面,如下所示: - -```bash -python -m ptdbg_ascend.parse -Parse >>> -``` - -可在parse的界面中执行Shell命令,以及如下场景的相关解析命令: - -- 支持指定ACL层级算子数据比对。 -- 支持指定ACL层级算子数据转换及展示。 -- 支持交互式指定pkl文件中API对应dump数据查看。 -- 支持API进行可选层级比对和打印(统计级和像素级)。 - -Ctrl+C可以退出parse交互式界面。不退出parse交互式界面若需要执行非该界面下的内置Shell命令,且命令与parse交互式界面命令冲突时,非该界面命令需要使用run命令,在相关命令前加上run前缀,如下示例: - -```bash -python -m ptdbg_ascend.parse -Parse >>> run vim cli.py -Parse >>> vim cli.py -``` - -以上各场景详细介绍请参见下文章节。 - -### ACL层级算子数据批量转换 - -本功能会将原有待比对dump数据目录下的dump数据按照算子名和时间戳进行梳理并分类,之后再将dump数据转为为npy文件。 - -依赖:CANN包中的msaccucmp工具,需要安装Ascend-CANN-toolkit,详见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F700%2Fenvdeployment%2Finstg%2Finstg_0001.html)》。 - -输入以下比对命令进行数据转换。 - -```bash -cad -m my_dump_path [-out output_path] [-asc msaccucmp_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| -------- | ------------------------------------------------------------ | -------- | -| -m | 待转换ACL dump数据目录。需要指定到ACL dump数据的deviceid级目录。 | 是 | -| -out | 结果输出目录,须指定已存在的目录,默认为./parse_data/acl_batch_convert。未指定时保存在默认路径下,比对结束后会打印log提示输出结果存放路径。 | 否 | -| -asc | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py。 | 否 | - -**示例** - -```bash -# 传入待比对数据目录 -Parse >>> cad -m /home/xxx/my_dump_path/20000124003856/0 -# 转换结果打印 -...... -╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -# 转换前的dump文件 -│ SrcFile: /home/xxx/my_dump_path/20000124003856/0/272/TransData.trans_TransData_22.112.21.948645536672764 │ -# 转换后的npy文件 -│ - TransData.trans_TransData_22.112.21.948645536672764.output.0.npy │ -│ - TransData.trans_TransData_22.112.21.948645536672764.input.0.npy │ -╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯ -...... -[INFO] The comparison result have been written to "./parse_data/acl_batch_convert". -``` - -输出结果: - -原dump数据目录: - -```bash -├── /home/xxx/my_dump_path/20000124003856/0/ -│ ├── 272 -│ │ ├── {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp} -│ │ ... -│ ├── 512 -│ ... -``` - -转换后: - -```bash -├── ./parse_data/acl_batch_convert/{timestamp} -│ ├── {op_name1} -│ │ ├── {timestamp1} -│ │ | ├── {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.{input/output}.{参数序号}.npy -│ │ | │ ... -│ │ ├── {timestamp2} -│ │ | ... -│ ├── {op_name2} -│ ├── ... -``` - -### ACL层级算子数据比对 - -本功能主要用于比对前后两次NPU ACL层级dump数据的一致性。 - -本功能支持批量比对,若需要进行批量比对,需要先将两份待比对的NPU ACL层级dump数据进行“**ACL层级算子数据批量转换**”,可以使两份数据更好的匹配;若直接进行dump数据的比对,建议只比对单个dump数据文件。 - -输入以下比对命令进行数据比对。 - -```bash -vc -m my_dump_path -g golden_dump_path [-out output_path] [-cmp_path msaccucmp_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| --------- | ------------------------------------------------------------ | -------- | -| -m | 待比对ACL dump数据目录。如果比对单个算子,需要指定到ACL dump数据的model_id级目录;如果批量比对,则指定到cad转换后的timestamp级目录。 | 是 | -| -g | 标杆ACL dump数据目录。如果比对单个算子,需要指定到ACL dump数据的model_id级目录;如果批量比对,则指定到cad转换后的timestamp级目录。 | 是 | -| -out | 结果输出目录,须指定已存在的目录,默认为./parse_data/acl_batch_comapre。未指定时保存在默认路径下,比对结束后会打印log提示输出结果存放路径。 | 否 | -| -cmp_path | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py | 否 | - -输出结果:batch_compare_{timestamp}.csv文件。 - -**示例** - -```bash -# 传入待比对数据目录以及标杆数据目录 -Parse >>> vc -m ./my_dump_path -g ./golden_data_path -[INFO]Compare result is saved in : parse_data/acl_batch_comapre/batch_compare_1707271118.csv -``` - -### ACL算子数据的npy转换 - -依赖:CANN包中的msaccucmp工具,需要安装Ascend-CANN-toolkit,详见《[CANN 软件安装指南](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2Fcanncommercial%2F700%2Fenvdeployment%2Finstg%2Finstg_0001.html)》。 - -输入以下转换命令进行数据转换, 将ACL级别dump数据转为npy文件。 - -```bash -dc -n file_name/file_path [-f format] [-out output_path] -``` - -| 参数名称 | 说明 | 是否必选 | -| --------- | ------------------------------------------------------------ | -------- | -| -n | 需转换的dump数据文件或dump数据文件目录。 | 是 | -| -f | 开启format转换,指定该参数时需要配置format格式。当前内置的Format转换支持如下类型:
FRACTAL_NZ转换NCHW
FRACTAL_NZ转换成NHWC
FRACTAL_NZ转换ND
HWCN转换FRACTAL_Z
HWCN转换成NCHW
HWCN转换成NHWC
NC1HWC0转换成HWCN
NC1HWC0转换成NCHW
NC1HWC0转换成NHWC
NCHW转换成FRACTAL_Z
NCHW转换成NHWC
NHWC转换成FRACTAL_Z
NHWC转换成HWCN
NHWC转换成NCHW
NDC1HWC0转换成NCDHW | 否 | -| -out | 结果输出目录。 | 否 | -| -cmp_path | 指定msaccucmp路径,默认路径为:/usr/local/Ascend/ascend-toolkit/latest/tools/operator_cmp/compare/msaccucmp.py | 否 | - -[^]: 若传入单个dump文件,则转换单个文件,若传入dump文件目录则转换目录下所有dump文件。 - -- 输出结果:npy文件。 -- 若指定-out参数需要用户传入输出路径,并且路径需要已存在。 -- 若未指定输出目录, 则比对结束后将结果保存在默认目录 “./parse_data/convert_result”中,比对结束后会打印log提示输出结果存放路径及转换结果。 - -- 输入以下命令,展示npy数据统计信息。 - - ```bash - pt -n file_path - ``` - - | 参数名称 | 说明 | 是否必选 | - | -------- | ------------- | -------- | - | -n | npy文件路径。 | 是 | - - 打印统计信息:shape, dtype, max, min和mean。默认在npy文件路径下将该数据保存为txt文件。 - -**示例1** - -```bash -# 传入需转换的dump文件目录 -Parse >>> dc -n ./dump_data/ -...... -# 转换结果 -╭──────────────────────────────────────────────────────────────────────────────────────────────────────╮ -│ SrcFile: ./dump_data/ -│ - Add.fp32.vars.add.2fp32.vars.Relu.9.31.5.1636595794731103.input.0.npy │ -│ - Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.output.0.npy │ -│ - Add.fp32.vars.add.2fp32.vars.Relu.9.31.5.1636595794731103.input.1.npy │ -│ - Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.input.1.npy │ -│ - Add.fp32.vars.add.3fp32.vars.Relu.12.40.5.1636595794846124.input.1.npy │ -│ - Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.input.0.npy │ -│ - Add.fp32.vars.add.3fp32.vars.Relu.12.40.5.1636595794846124.input.0.npy │ -│ - Add.fp32.vars.add.2fp32.vars.Relu.9.31.5.1636595794731103.output.0.npy │ -│ - Add.fp32.vars.add.3fp32.vars.Relu.12.40.5.1636595794846124.output.0.npy │ -╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯ -``` - -**示例2** - -```bash -# 查看某个dump数据块的数据信息 -# 默认会将数据中的tensor保存成 txt -Parse >>> pt -n ./parse_data/dump_convert/Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.output.0.npy -...... -# 打印统计信息 -[Shape: (1, 16, 56, 56, 16)] [Dtype: float16] [Max: 452.0] [Min: -408.5] [Mean: -3.809] -Path: ./parse_data/dump_convert/Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.input.0.npy -TextFile:./parse_data/dump_convert/Add.fp32.vars.add.1fp32.vars.Relu.6.24.5.1636595794631347.input.0.npy.txt -``` - -### pkl文件中指定API的dump数据信息查看 - -输入以下命令,解析并输出pkl文件中指定api的统计信息。 - -```bash -pk -f pkl_path -n api_name -``` - -| 参数名称 | 说明 | 是否必选 | -| -------- | ----------------- | -------- | -| -f | 指定pkl文件路径。 | 是 | -| -n | 指定API名称。 | 是 | - -- 输出结果:打印统计信息(shape, dtype, max和min mean)。 -- 若pkl文件中存在相应的堆栈信息,则会打印堆栈信息。 - -**示例** - -```bash -# 传入pkl文件及api名称 -Parse >>> pk -f ./torch_dump/ptdbg_v3.2/rank0/api_stack_dump.pkl -n Functional.conv2d.0.forward -...... -# 打印统计信息及堆栈(pkl文件不包含堆栈则不会打印堆栈) - -Statistic Info: - [Functional.conv2d.0.forward.input.0][dtype: torch.float32][shape: [2, 1, 2, 2]][max: 1.576936960220337][min: -0.9757485389709473][mean: 0.4961632490158081] - [Functional.conv2d.0.forward.input.1][dtype: torch.float32][shape: [2, 1, 2, 2]][max: 0.20064473152160645][min: -0.47102075815200806][mean: -0.20796933770179749] - [Functional.conv2d.0.forward.input.2][dtype: torch.float32][shape: [2]][max: 0.17380613088607788][min: -0.16853803396224976][mean: 0.0026340484619140625] - [Functional.conv2d.0.forward.output][dtype: torch.float32][shape: [2, 2, 1, 1]][max: 0.02364911139011383][min: -1.762906551361084][mean: -0.6710853576660156] -``` - -### API可选层级比对 - -输入以下命令, 进行统计级和像素级比对。 - -```bash -cn -m my_data*.npy -g gloden*.npy [-p num] [-al atol] [-rl rtol] -``` - -- 统计级比对:对tensor整体进行余弦值及相对误差的计算。 -- 像素级比对:对输入的两个npy文件进行逐元素比对。若两个tensor对应元素的相对误差或绝对误差大于**误差阈值**(-al和-rl配置)则被标记为错误数据。 - -| 参数名称 | 说明 | 是否必选 | -| -------- | ----------------------------------------------- | -------- | -| -m | 待比对数据。 | 是 | -| -g | 标杆数据。 | 是 | -| -p | 设置比对结束后打印错误元素的个数,默认值20。 | 否 | -| -al | 判定数据存在精度问题的绝对误差阈值,默认0.001。 | 否 | -| -rl | 判定数据存在精度问题的相对误差阈值,默认0.001。 | 否 | -| -s | 将npy文件保存成txt文件,用于查看,默认开启。 | 否 | - -输出结果: - -- 统计级比对结果。 -- 两个文件的统计信息(shape, dtype, max, min和mean)。 -- 错误数据打印表格。 - -**示例** - -```bash -# 对比两个tensor的数据 -Parse >>> cn -m Add.InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.323.1619494134703053.output.0.npy -g InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.0.1619492699305998.npy -p 10 -s -al 0.002 -rl 0.005 - Error Item Table Top Item Table -┏━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┏━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ -┃ Index ┃ Left ┃ Right ┃ Diff ┃ ┃ Index ┃ Left ┃ Right ┃ Diff ┃ -┡━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ ┡━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ -│ 155 │ 0.024600908 │ 0.022271132 │ 0.002329776 │ │ 0 │ -0.9206961 │ -0.9222216 │ 0.0015255213 │ -│ 247 │ 0.015752593 │ 0.017937578 │ 0.0021849852 │ │ 1 │ -0.6416973 │ -0.64051837 │ 0.0011789203 │ -│ 282 │ -0.0101207765 │ -0.007852031 │ 0.0022687456 │ │ 2 │ -0.35383835 │ -0.35433492 │ 0.0004965663 │ -│ 292 │ 0.019581757 │ 0.02240482 │ 0.0028230622 │ │ 3 │ -0.18851271 │ -0.18883198 │ 0.00031927228 │ -│ 640 │ -0.06593232 │ -0.06874806 │ 0.0028157383 │ │ 4 │ -0.43508735 │ -0.43534422 │ 0.00025686622 │ -│ 1420 │ 0.09293677 │ 0.09586689 │ 0.0029301196 │ │ 5 │ 1.4447614 │ 1.4466647 │ 0.0019032955 │ -│ 1462 │ -0.085207745 │ -0.088047795 │ 0.0028400496 │ │ 6 │ -0.3455438 │ -0.3444429 │ 0.0011008978 │ -│ 1891 │ -0.03433288 │ -0.036525503 │ 0.002192624 │ │ 7 │ -0.6560242 │ -0.6564579 │ 0.0004336834 │ -│ 2033 │ 0.06828873 │ 0.07139922 │ 0.0031104907 │ │ 8 │ -2.6964858 │ -2.6975214 │ 0.0010356903 │ -│ 2246 │ -0.06376442 │ -0.06121233 │ 0.002552092 │ │ 9 │ -0.73746175 │ -0.73650354 │ 0.00095820427 │ -└───────┴───────────────┴──────────────┴──────────────┘ └───────┴─────────────┴─────────────┴───────────────┘ -╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ -│ Left: | -│ |- NpyFile: ./dump/temp/decode/Add.InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.323.1619494134703053.output.0.npy | -│ |- TxtFile: ./dump/temp/decode/Add.InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.323.1619494134703053.output.0.npy.txt | -│ |- NpySpec: [Shape: (32, 8, 8, 320)] [Dtype: float32] [Max: 5.846897] [Min: -8.368301] [Mean: -0.72565556] | -│ DstFile: │ -│ |- NpyFile: ./dump/cpu/InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.0.1619492699305998.npy | -│ |- TxtFile: ./dump/cpu/InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.0.1619492699305998.npy.txt | -│ |- NpySpec: [Shape: (32, 8, 8, 320)] [Dtype: float32] [Max: 5.8425903] [Min: -8.374472] [Mean: -0.7256237] │ -│ NumCnt: 655360 │ -│ AllClose: False │ -│ CosSim: 0.99999493 │ -│ ErrorPer: 0.023504638671875 (rl= 0.005, al= 0.002) │ -╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ -``` - -## FAQ - -[FAQ](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md) diff --git "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.md" "b/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.md" index 94cbc7671bc8841705398b029fc27662801a4e04..09d608b676d7a02a59abbbdede4bda413b1152bd 100644 --- "a/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.md" +++ "b/debug/accuracy_tools/ptdbg_ascend/doc/ptdbg_ascend\347\262\276\345\272\246\345\267\245\345\205\267\345\212\237\350\203\275\350\257\264\346\230\216_v6.0.md" @@ -2,7 +2,7 @@ 本文主要介绍PyTorch精度工具ptdbg_ascend的使用以及精度比对场景示例。 -ptdbg_ascend工具的原理及安装请参见《[PyTorch精度工具](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 +ptdbg_ascend工具的原理及安装请参见《[PyTorch精度工具](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 ptdbg_ascend工具主要支持PyTorch API精度数据dump、溢出检测、精度比对以及parse数据解析功能。其中dump和溢出检测功能支持使用debugger和register_hook方式进行精度数据的dump和溢出检测,推荐使用debugger方式。 @@ -313,7 +313,7 @@ PyTorch训练场景的精度问题分析建议参考以下思路进行精度比 3. NPU环境下执行训练dump溢出数据。 - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过[Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 + 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过[Ascend模型精度预检工具](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 精度预检工具执行命令如下: @@ -392,7 +392,7 @@ register_hook需要在set_dump_path之后调用,也需要在每个进程上被 - 找到训练代码中遍历epoch的for循环或遍历数据集的for循环,把register_hook放到循环开始前即可。 - 找到训练代码中调用DDP或者DistributedDataParallel的代码行,把register_hook放到该代码行所在的代码块之后。 -- 若代码中均无以上两种情况,需要保证register_hook在模型定义之后插入,并配置rank参数。rank参数获取rank_id请参见“**[rank_id获取方法](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/rank_id获取方法.md)**”。 +- 若代码中均无以上两种情况,需要保证register_hook在模型定义之后插入,并配置rank参数。rank参数获取rank_id请参见“**[rank_id获取方法](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/rank_id获取方法.md)**”。 ### NPU vs NPU精度比对 @@ -748,7 +748,7 @@ PyTorch训练场景的精度问题分析建议参考以下思路进行精度比 3. NPU环境下执行训练dump溢出数据。 - 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过 [Ascend模型精度预检工具](https://gitee.com/ascend/att/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 + 针对输入正常但输出存在溢出的API,会训练执行目录下将溢出的API信息dump并保存为`forward_info_{pid}.json`和`backward_info_{pid}.json`,通过 [Ascend模型精度预检工具](https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/api_accuracy_checker)对json文件进行解析,输出溢出API为正常溢出还是非正常溢出,从而帮助用户快速判断。 精度预检工具执行命令如下: @@ -1105,7 +1105,7 @@ debugger.stop() - ptdbg_ascend工具默认情况下仅dump PyTorch模型的API输入输出数据进行精度比对,若在比对结果中发现某个API下可能存在ACL的精度问题,那么可以选择dump该API的ACL级别数据进行精度分析。 -- 某些torch api的输出不是Tensor类型的数据。对于此类API的反向过程进行ACL dump,工具会在运行日志中给出对应的Warning(is not of tensor type and cannot be automatically derived)提示。如若想要进行该类API反向ACL dump,可以通过手动构建单API用例的方式进行ACL dump,具体用例可参见“**[反向ACL dump用例说明](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/%E5%8F%8D%E5%90%91ACL%20dump%E7%94%A8%E4%BE%8B%E8%AF%B4%E6%98%8E.md)**”。 +- 某些torch api的输出不是Tensor类型的数据。对于此类API的反向过程进行ACL dump,工具会在运行日志中给出对应的Warning(is not of tensor type and cannot be automatically derived)提示。如若想要进行该类API反向ACL dump,可以通过手动构建单API用例的方式进行ACL dump,具体用例可参见“**[反向ACL dump用例说明](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/%E5%8F%8D%E5%90%91ACL%20dump%E7%94%A8%E4%BE%8B%E8%AF%B4%E6%98%8E.md)**”。 - 工具性能:dump数据量较小时(小于5G),参考dump速度0.1GB/s;dump数据量较大时,参考dump速度0.2GB/s。 推荐环境配置:独占环境,CPU核心数192,固态硬盘(IO速度参考:固态硬盘 > 500MB/s,机械硬盘60 ~ 170MB/s)。 @@ -2032,14 +2032,14 @@ PyTorch精度比对是以CPU或GPU的计算结果为标杆,通过计算精度 - 红色可能出现的情况有: - NPU max或NPU min信息中存在nan/inf - Max diff存在大于1e+10的值 - - 统计数据中Max diff除以Bench max > 0.5 + - 统计数据中output的Max diff除以max(0.01, Bench max) > 0.5 - 真实数据中One Thousandth Err Ratio的input > 0.9同时output < 0.6 - 黄色可能出现的情况有: - - Max diff的输出与输入存在数量级差 - - 统计数据Max diff除以Bench max的input > 0.1同时input < 0.01 - - 真实数据One Thousandth Err Ratio的output - input > 0.1 - - 真实数据Cosine的output - input > 0.1 + - Max diff的input与output都大于1,同时output比input大一个数量级以上 + - 统计数据Max diff除以max(0.01, Bench max)的output > 0.1同时input < 0.01 + - 真实数据One Thousandth Err Ratio的input - output > 0.1 + - 真实数据Cosine的input - output > 0.1 ## ptdbg_ascend.parse数据解析功能 @@ -2329,4 +2329,4 @@ Parse >>> cn -m Add.InceptionV3.InceptionV3.Mixed.7a.Branch.0.add.3.323.16194941 ## FAQ -[FAQ](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md) +[FAQ](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/ptdbg_ascend/doc/FAQ.md) diff --git "a/debug/accuracy_tools/ptdbg_ascend/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" "b/debug/accuracy_tools/ptdbg_ascend/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" index 317c99573c9c44a97c27a57282a7201e98aedd7b..dc479707430b8bb6cd8951d88353d74c941e655f 100644 --- "a/debug/accuracy_tools/ptdbg_ascend/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" +++ "b/debug/accuracy_tools/ptdbg_ascend/doc/\345\234\250\347\272\277\347\262\276\345\272\246\346\257\224\345\257\271.md" @@ -8,7 +8,7 @@ PyTorch NPU在线精度比对是ptdbg_ascend工具实现在PyTorch训练过程 1. 准备NPU训练工程。 -2. 在NPU环境下安装ptdbg_ascend工具,参见《[PyTorch精度工具](https://gitee.com/ascend/att/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 +2. 在NPU环境下安装ptdbg_ascend工具,参见《[PyTorch精度工具](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/ptdbg_ascend/README.md)》。 3. 在训练脚本内插入ptdbg_ascend工具在线精度比对接口。 diff --git a/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/common/utils.py b/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/common/utils.py index c800f31dbca6a7f857854e0c0733fec5ca27888c..aaa57968e9d78fcd1ac3308f16c6847f9ec65924 100644 --- a/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/common/utils.py +++ b/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/common/utils.py @@ -126,7 +126,7 @@ class Const: # version message tips VERSION_MESSAGE = """The current version of ptdbg will be deprecated on September 30, 2024. The att/debug/accuracy_tools/ptdbg_ascend directory will be deleted on September 30, 2024. - Please use the ptdbg in the att/debug/accuracy_tools/atat directory.""" + Please use the ptdbg in the att/debug/accuracy_tools/msprobe directory.""" class CompareConst: """ @@ -857,7 +857,7 @@ def check_path_before_create(path): def check_inplace_op(prefix): if len(prefix) > Const.DISTRIBUTED_PREFIX_LENGTH: return False - match_op = re.findall(r"Distributed_(.+?)_\d", prefix) + match_op = re.findall(r"Distributed\.(.+?)\.\d", prefix) op_name = match_op[0] if match_op else None return op_name in Const.INPLACE_LIST diff --git a/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/compare/acc_compare.py b/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/compare/acc_compare.py index e4ef7ab70de59a9663fbff1935973f42cb0a48dc..a6f7959512d2a15360ac33dfacfbeb4103fbbaf2 100644 --- a/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/compare/acc_compare.py +++ b/debug/accuracy_tools/ptdbg_ascend/src/python/ptdbg_ascend/compare/acc_compare.py @@ -226,7 +226,7 @@ def fuzzy_check_name(npu_name, bench_name): def rename_api(npu_name, process): npu_split = npu_name.split(process) torch_func_index, in_out = npu_split[0], npu_split[1] - torch_func_split = torch_func_index.rsplit("_", 2) + torch_func_split = torch_func_index.rsplit(".", 2) torch_func = str(torch_func_split[0]) + str(in_out) return torch_func diff --git a/debug/accuracy_tools/ptdbg_ascend/src/python/setup.py b/debug/accuracy_tools/ptdbg_ascend/src/python/setup.py index 0dcf513c3acae1f1cd8d6b09fe4a761e6bd5f730..ab21e7f68c0cab3c77a358d0e5fd386466960200 100644 --- a/debug/accuracy_tools/ptdbg_ascend/src/python/setup.py +++ b/debug/accuracy_tools/ptdbg_ascend/src/python/setup.py @@ -20,7 +20,7 @@ import stat from pathlib import Path import setuptools -VERSION = '6.0.T4' +VERSION = '6.0' def generate_ptdbg_ascend_version(): diff --git a/debug/accuracy_tools/ptdbg_ascend/test/ut/compare/test_acc_compare.py b/debug/accuracy_tools/ptdbg_ascend/test/ut/compare/test_acc_compare.py index be6b480657b88d4f7852e236c4215d62c5995ef5..c5ee7ff9d50dfaefbe94e2ccaf8b281433570c92 100644 --- a/debug/accuracy_tools/ptdbg_ascend/test/ut/compare/test_acc_compare.py +++ b/debug/accuracy_tools/ptdbg_ascend/test/ut/compare/test_acc_compare.py @@ -355,4 +355,17 @@ class TestUtilsMethods(unittest.TestCase): result_df = pd.DataFrame(result) highlight_dict = {'red_rows': [], 'yellow_rows': []} compare.find_compare_result_error_rows(result_df, highlight_dict) - self.assertEqual(highlight_dict, {'red_rows': [num_1, num_3], 'yellow_rows': [num_2]}) \ No newline at end of file + self.assertEqual(highlight_dict, {'red_rows': [num_1, num_3], 'yellow_rows': [num_2]}) + + def test_rename_api(self): + test_name_1 = "Distributed.broadcast.0.forward.input.0" + expect_name_1 = "Distributed.broadcast.input.0" + actual_name_1 = compare.rename_api(test_name_1, "forward") + self.assertEqual(actual_name_1, expect_name_1) + + test_name_2 = "Torch.sum.0.backward.output.0" + expect_name_2 = "Torch.sum.output.0" + actual_name_2 = compare.rename_api(test_name_2, "backward") + self.assertEqual(actual_name_2, expect_name_2) + + \ No newline at end of file diff --git a/debug/accuracy_tools/ptdbg_ascend/test/ut/test_common_util.py b/debug/accuracy_tools/ptdbg_ascend/test/ut/test_common_util.py index 4c91a6928c02c4a1e9eec5b21a4dc43d65ee631b..7a74eeb369c47a6d6ad7752bd75dd481039af2a2 100644 --- a/debug/accuracy_tools/ptdbg_ascend/test/ut/test_common_util.py +++ b/debug/accuracy_tools/ptdbg_ascend/test/ut/test_common_util.py @@ -65,6 +65,14 @@ class TestCommonUtilsMethods(unittest.TestCase): dump_path = "/usr/dump" mode = "api_stack" self.assertEqual(common.modify_dump_path(dump_path, mode), "/usr/api_stack_dump") + + def test_check_inplace_op(self): + test_prefix_1 = "Distributed.broadcast.0.forward.input.0" + self.assertTrue(common.check_inplace_op(test_prefix_1)) + test_prefix_2 = "Distributed_broadcast_0_forward_input_0" + self.assertFalse(common.check_inplace_op(test_prefix_2)) + test_prefix_3 = "Torch.sum.0.backward.output.0" + self.assertFalse(common.check_inplace_op(test_prefix_3)) def test_create_directory(self): pass diff --git a/debug/accuracy_tools/setup.py b/debug/accuracy_tools/setup.py index f1579a7e416e946e7f76ae2f78cc05d112cfc22d..4e0eaa1f3754f150006a8d656dc26d81b6ccea1a 100644 --- a/debug/accuracy_tools/setup.py +++ b/debug/accuracy_tools/setup.py @@ -1,6 +1,3 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -""" # Copyright (C) 2024. Huawei Technologies Co., Ltd. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,27 +10,59 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -""" -from setuptools import setup, find_packages +import setuptools -setup( - name='ascend_training_accuracy_tools', - version='0.0.3', - description='This is a pytorch precision comparison tools', - long_description='This is a pytorch precision comparison tools, include ptdbg and api accuracy checker', - packages=find_packages(), - install_requires=[ - "wheel", - "numpy", - "pandas >= 1.3.5", - "pyyaml", - "rich", - "tqdm" - ], + +__version__ = '1.0.0' + +INSTALL_REQUIRED = [ + "wheel", + "numpy", + "pandas >= 1.3.5", + "pyyaml", + "rich", + "tqdm", + "openpyxl" +] + +EXCLUDE_PKGS = [ + "api_accuracy_checker*", + "grad_tool*", + "kj600*", + "ptdbg_ascend*", + "msprobe.test*", +] + +setuptools.setup( + name="mindstudio-probe", + version=__version__, + description="Pytorch Ascend Probe Utils", + long_description="MindStudio-Probe is a set of tools for diagnosing and improving model accuracy on Ascend NPU, " + "including API acc checker, ptdbg, grad tool etc.", + url="https://gitee.com/ascend/mstt/tree/master/debug/accuracy_tools/msprobe", + author="Ascend Team", + author_email="pmail_mindstudio@huawei.com", + packages=setuptools.find_namespace_packages(exclude=EXCLUDE_PKGS, include=["msprobe", "msprobe*"]), include_package_data=True, + python_requires=">=3.6.2", + install_requires=INSTALL_REQUIRED, + classifiers=[ + 'Intended Audience :: Developers', + 'Intended Audience :: Education', + 'Intended Audience :: Science/Research', + 'Programming Language :: Python :: 3', + 'Topic :: Scientific/Engineering', + 'Topic :: Scientific/Engineering :: Mathematics', + 'Topic :: Scientific/Engineering :: Artificial Intelligence', + 'Topic :: Software Development', + 'Topic :: Software Development :: Libraries', + 'Topic :: Software Development :: Libraries :: Python Modules', + ], + license='Apache License 2.0', + keywords='pytorch msprobe ascend', ext_modules=[], zip_safe=False, entry_points={ - 'console_scripts' : ['atat=atat.atat:main'], - },) \ No newline at end of file + 'console_scripts': ['msprobe=msprobe.msprobe:main'], + },) diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py index f9c3e88c388718994d9741f01a003f4ecc4e2a2f..fd7b265cfa7d67023075ec8d9bc59ed85f4e0f15 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/__init__.py @@ -4,4 +4,4 @@ # Entry point for Pytorch TensorBoard plugin package. -__version__ = '0.4.0.5' +__version__ = '0.4.0.8' diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py index 9961ae7cf9eb3144da8f1ac78e551a56ca4f27b8..d6f9bb245eb2d170cb4a63e7f912a9c69932e28b 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/data.py @@ -64,13 +64,17 @@ class RunProfileData(object): fwd_bwd_events = [] if trace_body is not None: for data in trace_body: + if data.get('ts') is not None: + try: + self.profiler_start_ts = min(self.profiler_start_ts, float(data.get('ts'))) + except ValueError: + logger.warning(f'The operator {data.get("name")} has wrong "ts" format, expected a number.') if data.get('cat') == 'forward_backward': fwd_bwd_events.append(data) else: event = trace.create_event(data, self.is_pytorch_lightning) if event is not None: event.ts = float(event.ts) - self.profiler_start_ts = min(self.profiler_start_ts, event.ts) self.events.append(event) self.events.sort(key=lambda e: e.ts) diff --git a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py index 061db7a4e0d46f7a56c70e0953128c30a243499e..3cd7ce9ff662a152cc9e1e4150bfe4d762e7a691 100644 --- a/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py +++ b/plugins/tensorboard-plugins/tb_plugin/torch_tb_profiler/profiler/event_parser.py @@ -88,10 +88,6 @@ class NodeParserMixin: runtime_nodes = externalid_to_runtime.pop(op.external_id, []) if runtime_nodes: op.runtimes.extend(runtime_nodes) - for ext_id in externalid_to_runtime: - if ext_id != 0: - logger.warning("{} Runtime with external id {} don't correlate to any operator!".format( - len(externalid_to_runtime[ext_id]), ext_id)) if len(corrid_to_device) > 0: node_count_dict = defaultdict(int) @@ -138,13 +134,6 @@ class NodeParserMixin: if rt_node.device_nodes is None: rt_node.device_nodes = [] rt_node.device_nodes.append(device_node) - - # Check the external_id - if rt_node.external_id != device_node.external_id: - logger.warning( - 'Runtime and Device-op have same correlation id %s but with different external id!' - ' (runtime external_id, device external_id): (%s, %s)' % - (corrid, rt_node.external_id, device_node.external_id)) else: corrid_to_device[corrid].append(device_node) self.device_node_list.append(device_node) diff --git a/profiler/README.md b/profiler/README.md index 74b378e69c0913fee1212b079de419dd30bfb1c4..1669e3524e54bb78e6f4f09f597d2399196ff950 100644 --- a/profiler/README.md +++ b/profiler/README.md @@ -91,6 +91,7 @@ ascend pytorch profiler数据目录结构如下: | profiler版本 | 发布日期 | 下载链接 | 校验码 | | ------------ | ---------- | ------------------------------------------------------------ | ------------------------------------------------------------ | + | 1.1.2 | 2024-07-12 | [msprof_analyze-1.1.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.2/msprof_analyze-1.1.2-py3-none-any.whl) | af62125b1f9348bf491364e03af712fc6d0282ccee3fb07458bc9bbef82dacc6 | | 1.1.1 | 2024-06-20 | [msprof_analyze-1.1.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.1/msprof_analyze-1.1.1-py3-none-any.whl) | 76aad967a3823151421153d368d4d2f8e5cfbcb356033575e0b8ec5acea8e5e4 | | 1.1.0 | 2024-05-28 | [msprof_analyze-1.1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.1.0/msprof_analyze-1.1.0-py3-none-any.whl) | b339f70e7d1e45e81f289332ca64990a744d0e7ce6fdd84a8d82e814fa400698 | | 1.0 | 2024-05-10 | [msprof_analyze-1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/profiler/package/1.0/msprof_analyze-1.0-py3-none-any.whl) | 95b2f41c8c8e8afe4887b738c8cababcb4f412e1874483b6adae4a025fcbb7d4 | @@ -124,12 +125,6 @@ ascend pytorch profiler数据目录结构如下: pip3 install ./msprof_analyze-{version}-py3-none-any.whl ``` - 若为覆盖安装,请在命令行末尾增加“--force-reinstall”参数强制安装,例如: - - ```bash - pip3 install ./msprof_analyze-{version}-py3-none-any.whl --force-reinstall - ``` - 提示如下信息则表示安装成功。 ```bash @@ -149,17 +144,17 @@ ascend pytorch profiler数据目录结构如下: 2. 下载源码。 ```bash - git clone https://gitee.com/ascend/att.git + git clone https://gitee.com/ascend/mstt.git ``` 3. 编译whl包。 ```bash - cd att/profiler + cd mstt/profiler python3 setup.py bdist_wheel ``` - 以上命令执行完成后在att/profiler/dist目录下生成性能工具whl安装包`msprof_analyze-{version}-py3-none-any.whl`。 + 以上命令执行完成后在mstt/profiler/dist目录下生成性能工具whl安装包`msprof_analyze-{version}-py3-none-any.whl`。 4. 安装。 @@ -167,25 +162,45 @@ ascend pytorch profiler数据目录结构如下: ```bash cd dist - pip3 install ./msprof_analyze-{version}-py3-none-any.whl --force-reinstall + pip3 install ./msprof_analyze-{version}-py3-none-any.whl + ``` + +## 卸载和更新 + +若需要更新工具,请先卸载旧版本后再重新安装新版本,如下操作: + +1. 卸载 + + ```bash + pip3 uninstall msprof-analyze + ``` + +2. 更新 + + ```bash + pip3 install ./msprof_analyze-{version}-py3-none-any.whl ``` ## 工具使用 ```bash -msprof-analyze advisor [-h] [-v] +msprof-analyze advisor [-h] ``` ```bash -msprof-analyze compare [-h] [-v] +msprof-analyze compare [-h] ``` ```bash -msprof-analyze cluster [-h] [-v] +msprof-analyze cluster [-h] ``` ```bash -msprof-analyze auto-completion [-h] [-v] +msprof-analyze auto-completion [-h] +``` + +``` +msprof-analyze [-h] [-v] ``` | 参数 | 说明 | diff --git a/profiler/advisor/README.md b/profiler/advisor/README.md index 303f3148b5f2325130252d2a424f2a82d48fdf9f..c650f40b3ea8ef48b3c7644e279b00a1cb99f29a 100644 --- a/profiler/advisor/README.md +++ b/profiler/advisor/README.md @@ -92,31 +92,31 @@ msprof-analyze的advisor功能是将Ascend PyTorch Profiler或者msprof采集的 - 总体性能瓶颈 ```bash - msprof-analyze advisor all -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [-d] [-h] + msprof-analyze advisor all -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [--debug] [-h] ``` - 计算瓶颈 ```bash - msprof-analyze advisor computation -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-pt profiling_type] [-d] [-h] + msprof-analyze advisor computation -d {profiling_path} [-cv cann_version] [-tv torch_version] [-pt profiling_type] [--debug] [-h] ``` - 调度瓶颈 ```bash - msprof-analyze advisor schedule -d {profiling_path} [-bp benchmark_profiling_path] [-cv cann_version] [-tv torch_version] [-d] [-h] + msprof-analyze advisor schedule -d {profiling_path} [-cv cann_version] [-tv torch_version] [--debug] [-h] ``` #### 参数介绍 | 参数 | 说明 | 是否必选 | | ---------------------------------- | ------------------------------------------------------------ | -------- | -| -d
--profiling_path | 性能数据所在目录。性能数据通过Profiling工具采集获取。请确保性能数据采集时配置“aic-metrics”参数为“PipeUtilization”,“aicpu”参数为“on”。advisor依赖Profiling工具解析后的timeline数据、summary数据以及info.json*文件,请确保指定的“profiling_dir”目录下存在以上文件。 | 是 | +| -d
--profiling_path | 性能数据文件或目录所在路径,Ascend PyTorch Profiler采集场景指定为`*_ascend_pt`性能数据结果目录,其他场景指定为`PROF_XXX`性能数据结果目录。建议通过Ascend PyTorch Profiler获取性能数据。
advisor依赖Profiling工具解析后的timeline数据(.json)、summary(.csv)数据以及info.json*文件,请确保指定的“profiling_path”目录下存在以上文件。 | 是 | | -bp
--benchmark_profiling_path | 基准性能数据所在目录,用于性能比对。性能数据通过Profiling工具采集获取。
**computation和schedule不支持该参数。** | 否 | | -cv
--cann_version | 使用Profiling工具采集时对应的CANN软件版本,可通过在环境中执行如下命令获取其version字段,目前配套的兼容版本为“6.3.RC2”,“7.0.RC1”、“7.0.0”、“8.0.RC1”,此字段不填默认按“8.0.RC1”版本数据进行处理,其余版本采集的Profiling数据在分析时可能会导致不可知问题:`cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info` | 否 | | -tv
--torch_version | 运行环境的torch版本,默认为1.11.0,支持torch1.11.0和torch2.1.0,当运行环境torch版本为其他版本如torch1.11.3时,可以忽略小版本号差异选择相近的torch版本如1.11.0。 | 否 | -| -pt
--profiling_type | 配置性能数据采集使用的Profiling工具类型。可取值:
ascend_pytorch_profiler:使用Ascend PyThon Profiler接口方式采集的性能数据时配置,默认值。
msprof:使用msprof命令行方式采集的性能数据时配置。
mslite:使用[Benchmark](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench)工具采集的性能数据时配置。
**schedule不支持该参数。** | 否 | -| -D
--debug | 工具执行报错时可打开此开关,将会展示详细保存堆栈信息。 | 否 | +| -pt
--profiling_type | 配置性能数据采集使用的Profiling工具类型。可取值:
ascend_pytorch_profiler:使用Ascend PyThon Profiler接口方式采集的性能数据时配置,默认值。
msprof:使用msprof命令行方式采集的性能数据时配置。功能完善中,暂不建议使用。
mslite:使用[Benchmark](https://gitee.com/ascend/tools/tree/master/ais-bench_workload/tool/ais_bench)工具采集的性能数据时配置。不建议使用。
**schedule不支持该参数。** | 否 | +| --debug | 工具执行报错时可打开此开关,将会展示详细保存堆栈信息。 | 否 | | -h,-H
--help | 在需要查询当前命令附属子命令或相关参数时,给出帮助建议。 | 否 | ### 报告解析 @@ -173,18 +173,18 @@ Jupyter Notebook使用方式如下: 2. 在环境下安装mstt工具。 ``` - git clone https://gitee.com/ascend/att.git + git clone https://gitee.com/ascend/mstt.git ``` 安装环境下保存Ascend PyTorch Profiler采集的性能数据。 -3. 进入att\profiler\advisor目录执行如下命令启动Jupyter Notebook工具。 +3. 进入mstt\profiler\advisor目录执行如下命令启动Jupyter Notebook工具。 ```bash jupyter notebook ``` - 执行成功则自动启动浏览器读取att\profiler\advisor目录,如下示例: + 执行成功则自动启动浏览器读取mstt\profiler\advisor目录,如下示例: ![jupyter_report](./img/jupyter_report.PNG) diff --git a/profiler/advisor/analyzer/base_analyzer.py b/profiler/advisor/analyzer/base_analyzer.py index 5f4bd3202cd2071088f25564a7d4b14144a34826..e0e17320b3309ed24cfc7f45d6b09f73501be7da 100644 --- a/profiler/advisor/analyzer/base_analyzer.py +++ b/profiler/advisor/analyzer/base_analyzer.py @@ -1,3 +1,17 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import logging from functools import wraps from typing import Dict, List, Union diff --git a/profiler/advisor/analyzer/cluster/slow_link_analyser.py b/profiler/advisor/analyzer/cluster/slow_link_analyser.py index 846b79a50f31abb8445a0e5c2e82aaaf3c8ee23d..0b585cbc7c5f136b15cd9eb035ea2dac5caa9e4e 100644 --- a/profiler/advisor/analyzer/cluster/slow_link_analyser.py +++ b/profiler/advisor/analyzer/cluster/slow_link_analyser.py @@ -19,7 +19,7 @@ from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer from profiler.advisor.common import constant from profiler.advisor.result.result import OptimizeResult from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataSet +from profiler.advisor.dataset.cluster.cluster_dataset import ClusterCommunicationDataset class SlowLinkAnalyzer(BaseAnalyzer): @@ -35,11 +35,11 @@ class SlowLinkAnalyzer(BaseAnalyzer): SDMA = "SDMA" RDMA = "RDMA" SLOW_LINK_ANALYSIS = "slow_link_analysis" - dataset_cls_list = [ClusterCommunicationDataSet] + dataset_cls_list = [ClusterCommunicationDataset] def __init__(self, collection_path, n_processes: int = 1, **kwargs): super().__init__(collection_path, n_processes, **kwargs) - key = ClusterCommunicationDataSet.get_key() + key = ClusterCommunicationDataset.get_key() self.communication_data_class = self.get_first_data_by_key(self.dataset_list, key) self.rank_bw_dict = self.communication_data_class.get_data() self.result = OptimizeResult() @@ -49,8 +49,9 @@ class SlowLinkAnalyzer(BaseAnalyzer): def optimize(self, **kwargs): if self.rank_bw_dict is None: - print("slow_link 分析失败,原因是数据加载失败,请检查你的cluster_analysis_outpu文件夹, \ - 如不关心这类数据请忽略") + print("Slow link analysis failed due to data loading failure. \ + Please check your cluster_analysis_output folder. \ + If you are not concerned about this type of data, please ignore this message.") return self.result self.process() self.format_datas = self.format_details() @@ -65,8 +66,11 @@ class SlowLinkAnalyzer(BaseAnalyzer): def produce_bottleneck(self, link_type: str): data_list = [rank_dict.get(link_type, 0) for rank_id, rank_dict in self.rank_bw_dict.items()] - avg_bw = round(sum(data_list) / len(data_list), 3) - if avg_bw == 0: + if len(data_list) > 0: + avg_bw = round(sum(data_list) / len(data_list), 3) + else: + print("The slow link (identified bottleneck) cannot provide a bottleneck \ + because the analysis data is missing bandwidth information.") return self.bottelneck += f'{link_type}: \n' \ f' The average is {avg_bw}, \n' \ diff --git a/profiler/advisor/analyzer/cluster/slow_rank_analyser.py b/profiler/advisor/analyzer/cluster/slow_rank_analyser.py index 4215b514a215a2a350571746ff9cb90c3c9956eb..f439b31f7736ee4777d5ef10bf968738a76ae1b3 100644 --- a/profiler/advisor/analyzer/cluster/slow_rank_analyser.py +++ b/profiler/advisor/analyzer/cluster/slow_rank_analyser.py @@ -19,7 +19,7 @@ from profiler.advisor.analyzer.base_analyzer import BaseAnalyzer from profiler.advisor.common import constant from profiler.advisor.result.result import OptimizeResult from profiler.advisor.result.item import OptimizeItem, OptimizeRecord -from profiler.advisor.dataset.cluster.cluster_dataset import ClusterStepTraceTimeDataSet +from profiler.advisor.dataset.cluster.cluster_dataset import ClusterStepTraceTimeDataset class SlowRankAnalyzer(BaseAnalyzer): @@ -27,11 +27,11 @@ class SlowRankAnalyzer(BaseAnalyzer): RANK = "rank" RATIO_THRESHOLD = 0.05 BOTTLENECK_LIST = ['Computing', 'Communication', "Free"] - dataset_cls_list = [ClusterStepTraceTimeDataSet] + dataset_cls_list = [ClusterStepTraceTimeDataset] def __init__(self, collection_path, n_processes: int = 1, **kwargs): super().__init__(collection_path, n_processes, **kwargs) - key = ClusterStepTraceTimeDataSet.get_key() + key = ClusterStepTraceTimeDataset.get_key() self.step_trace_class = self.get_first_data_by_key(self.dataset_list, key) self.step_trace_dict = self.step_trace_class.get_data() self.result = OptimizeResult() @@ -81,7 +81,7 @@ class SlowRankAnalyzer(BaseAnalyzer): def format_details(self): details_dict = {} - headers = ["rank_id", "compute", "communication", "free"] + headers = ["rank_id", "compute(us)", "communication(us)", "free(us)"] data_list = [] for key,value in self.step_trace_dict.items(): data_list.append([key] + value) diff --git a/profiler/advisor/analyzer/computation/bound/block_dim_checker.py b/profiler/advisor/analyzer/computation/bound/block_dim_checker.py index a7d7ddd93c70e59dc0d10318fdac06fdc581f70c..7a873c65635fcc8f2ebb35c8d317de09d78da491 100644 --- a/profiler/advisor/analyzer/computation/bound/block_dim_checker.py +++ b/profiler/advisor/analyzer/computation/bound/block_dim_checker.py @@ -1,5 +1,4 @@ import logging - from typing import List from profiler.advisor.analyzer.computation.operator_checker import OperatorChecker diff --git a/profiler/advisor/analyzer/computation/operator_checker.py b/profiler/advisor/analyzer/computation/operator_checker.py index 0f47650943a7355b494bd766214d10526c46c0fa..64618b56a8df7f380277e99ae7ca47cd69d24648 100644 --- a/profiler/advisor/analyzer/computation/operator_checker.py +++ b/profiler/advisor/analyzer/computation/operator_checker.py @@ -118,7 +118,7 @@ class OperatorChecker(VersionControl): def is_dynamic_shape(self, profiling_database: ProfilingDataset) -> bool: less_than_cann800_list = [constant.CANN_VERSION_C30, constant.CANN_VERSION_C13, constant.CANN_VERSION_C15] - # CANN 8.0.0 之前从 ge_info 中获取 op_state 属性,进行动态 shape 逻辑判断 + # CANN 8.0.RC1 之前从 ge_info 中获取 op_state 属性,进行动态 shape 逻辑判断 if self.cann_version in less_than_cann800_list: if hasattr(profiling_database, "ge_info"): ge_info = profiling_database.ge_info @@ -131,7 +131,7 @@ class OperatorChecker(VersionControl): "To enable dynamic shape check, please try to set data_simplification=False in experimental_config.\n" "More details please refer to link : %s", constant.ASCEND_PROFILER_URL) else: - # CANN 8.0.0 之后 op_state 属性从 op_summary 文件中获取 + # CANN 8.0.RC1 之后 op_state 属性从 op_summary 文件中获取 if hasattr(profiling_database, "op_summary"): static_shape_operators = profiling_database.op_summary.get_static_shape_operators() if len(static_shape_operators) == 0: diff --git a/profiler/advisor/common/constant.py b/profiler/advisor/common/constant.py index 40aaac94b1c1e7f88a56c8a5b0d15e8814b9f61d..697430ee6cabad8c055176a3368a8b4a25e977ab 100644 --- a/profiler/advisor/common/constant.py +++ b/profiler/advisor/common/constant.py @@ -47,7 +47,8 @@ NO_STACK_REASON_MAP = { TIMELINE_BACKWARD_NO_STACK_CODE: "Backward broadcast, without call stacks in profiling.", TIMELINE_ACL_TO_NPU_NO_STACK_CODE: "Incoming flow is 'acl_to_npu', without call stacks in profiling." } -TIMELINE_API_DOC_URL = "https://support.huaweicloud.com/bestpractice-modelarts/modelarts_10_2516.html" +TIMELINE_API_DOC_URL = "https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc \ + /Samples%20of%20Fused%20Operator%20API%20Replacement.md" AFFINITY_TRAINING_API = "Affinity training api" TIMELINE_WITH_STACK_DOC_URL = "https://www.hiascend.com/document/detail/zh/canncommercial/" \ "70RC1/modeldevpt/ptmigr/AImpug_0067.html" @@ -69,7 +70,7 @@ SLOW_RANK_TIME_RATIO_THRESHOLD = 0.05 CANN_VERSION_C30 = '6.3.RC2' CANN_VERSION_C13 = '7.0.RC1' CANN_VERSION_C15 = '7.0.0' -CANN_VERSION_C17 = '8.0.0' +CANN_VERSION_C17 = '8.0.RC1' SUPPORTED_CANN_VERSION = [CANN_VERSION_C30, CANN_VERSION_C13, CANN_VERSION_C15, CANN_VERSION_C17] DEFAULT_CANN_VERSION = CANN_VERSION_C17 ASCEND_PYTORCH_PROFILER = "ascend_pytorch_profiler" diff --git a/profiler/advisor/common/timeline/event.py b/profiler/advisor/common/timeline/event.py index 6001ac88722e5a77daba1c960e8ccfd6894889e6..e24d983a02ff19fef5d6ae476f7c2f55bd9c8f85 100644 --- a/profiler/advisor/common/timeline/event.py +++ b/profiler/advisor/common/timeline/event.py @@ -1,3 +1,4 @@ +from decimal import Decimal class AdvisorDict(dict): def __getstate__(self): return self.__dict__ @@ -18,6 +19,6 @@ class AdvisorDict(dict): class TimelineEvent(AdvisorDict): def ts_include(self, event): - - return float(self.ts) <= float(event.ts) and float(self.ts) + float(self.dur) >= float(event.ts) + float( + return Decimal(self.ts) <= Decimal(event.ts) and Decimal(self.ts) + Decimal(self.dur) >= Decimal( + event.ts) + Decimal( event.dur) \ No newline at end of file diff --git a/profiler/advisor/computation_analysis.ipynb b/profiler/advisor/computation_analysis.ipynb index 15d8618bd9f32dccbee214c0a79f2be6863314cb..0d4aaadfadff05d1e11d4a9873ef7ce4ae2cfaa8 100644 --- a/profiler/advisor/computation_analysis.ipynb +++ b/profiler/advisor/computation_analysis.ipynb @@ -44,7 +44,7 @@ "outputs": [], "source": [ "# 查询computation相关是否存在block dim问题\n", - "# 如果profiling数据采集自非8.0.0的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", + "# 如果profiling数据采集自非8.0.RC1的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", "block_dim_result = interface.get_result(\"computation\", \"block_dim_analysis\", cann_version=\"7.0.RC1\")" ] }, @@ -252,7 +252,7 @@ "outputs": [], "source": [ "# 查询computation相关是否存在operator no bound问题\n", - "# 如果profiling数据采集自非8.0.0的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", + "# 如果profiling数据采集自非8.0.RC1的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", "operator_no_bound_result = interface.get_result(\"computation\", \"operator_no_bound_analysis\", cann_version=\"7.0.RC1\")" ] }, @@ -462,7 +462,7 @@ "source": [ "### AICPU问题识别\n", "AICPU问题主要为识别相关算子执行时跑到AICPU上计算,并没有利用到AI CORE的计算能力的场景,主要调优手段为修改相关代码来避免AICPU算子,可参见相关资料,来避免AICPU算子的问题:\n", - "https://support.huaweicloud.com/bestpractice-modelarts/modelarts_10_2517.html\n", + "https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc/Samples%20of%20AI%20CPU%20Operator%20Replacement.md\n", "\n", "下列代码为样例,主要展示如何检测Dynamic Shape类型问题,并获取相关问题检测结果:" ] @@ -499,7 +499,7 @@ ], "source": [ "# 查询computation相关是否存在aicpu问题\n", - "# 如果profiling数据采集自非8.0.0的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", + "# 如果profiling数据采集自非8.0.RC1的CANN版本,需要在训练/推理环境中执行: 'cat CANN安装目录/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info'命令查看version\n", "aicpu_result = interface.get_result(\"computation\", \"aicpu_analysis\")" ] }, diff --git a/profiler/advisor/config/profiling_data_version_config.yaml b/profiler/advisor/config/profiling_data_version_config.yaml index f73aecd3baf18e06981ef4d4b0db7d6faadd419a..4ef76105a07c28c5072c4bbfe20fd39a938038b7 100644 --- a/profiler/advisor/config/profiling_data_version_config.yaml +++ b/profiler/advisor/config/profiling_data_version_config.yaml @@ -1,5 +1,5 @@ versions: - - version: 8.0.0 + - version: 8.0.RC1 dirs_pattern: ^PROF_\d{6}_\d{17}_\w+$: mindstudio_profiler_output: diff --git a/profiler/advisor/dataset/cluster/cluster_dataset.py b/profiler/advisor/dataset/cluster/cluster_dataset.py index 09fda2d4dcf2df2f05abb0007befb5c5c36ef824..e1163f1cdd84265eb5cc5e356753cad5fa663339 100644 --- a/profiler/advisor/dataset/cluster/cluster_dataset.py +++ b/profiler/advisor/dataset/cluster/cluster_dataset.py @@ -25,9 +25,9 @@ class ClusterDataset(Dataset): """ for file in os.listdir(self.collection_path): if file == 'cluster_analysis_output': - print("[INFO]Cluster has been analyzed " - "because of the existence of cluster analysis output directory.") - print("[INFO]Skip Cluster analyze backend.") + logger.info("[INFO]Cluster has been analyzed " + "because of the existence of cluster analysis output directory.") + logger.info("[INFO]Skip Cluster analyze backend.") return True return False @@ -62,7 +62,7 @@ class ClusterDataset(Dataset): @singleton -class ClusterStepTraceTimeDataSet(ClusterDataset): +class ClusterStepTraceTimeDataset(ClusterDataset): RANK = "rank" def __init__(self, collection_path: str, data: dict, **kwargs): @@ -77,10 +77,10 @@ class ClusterStepTraceTimeDataSet(ClusterDataset): print("捕获到异常:", e) self._step_dict = None return False - self._step_dict = self.formate_data(step_data) + self._step_dict = self.format_data(step_data) return True - def formate_data(self, step_data: list): + def format_data(self, step_data: list): step_dict = defaultdict(lambda: [0, 0, 0]) for step_bean in step_data: if step_bean.type == self.RANK: @@ -94,7 +94,7 @@ class ClusterStepTraceTimeDataSet(ClusterDataset): @singleton -class ClusterCommunicationDataSet(ClusterDataset): +class ClusterCommunicationDataset(ClusterDataset): RDMA_TIME_MS = "RDMA time(ms)" RDMA_SIZE_MB = "RDMA size(mb)" SDMA_TIME_MS = "SDMA time(ms)" diff --git a/profiler/advisor/dataset/timeline_event_dataset.py b/profiler/advisor/dataset/timeline_event_dataset.py index 94b6fdfef78c044e37e24772699ed7ea67b0da30..d3889e4458fad8b34b5d811d152e255638999294 100644 --- a/profiler/advisor/dataset/timeline_event_dataset.py +++ b/profiler/advisor/dataset/timeline_event_dataset.py @@ -9,6 +9,7 @@ from profiler.advisor.common import constant as const from profiler.advisor.common.timeline.event import TimelineEvent from profiler.advisor.utils.utils import get_file_path_from_directory from profiler.advisor.utils.utils import singleton +from profiler.cluster_analyse.common_func.file_manager import FileManager logger = logging.getLogger() @@ -121,13 +122,13 @@ class TimelineEventDataset(Dataset): def parse_data_with_generator(self, func): result = [] try: - with open(self.timeline_data_list[0], "r") as f: - for i, event in tqdm(enumerate(ijson.items(f, "item")), - leave=False, ncols=100, desc="Building dataset for timeline analysis", - total=self.dataset_len): - func_res = func(index=i, event=event) - if func_res is not None: - result.append(func_res) + json_content = FileManager.read_json_file(self.timeline_data_list[0]) + for i, event in tqdm(enumerate(json_content), leave=False, ncols=100, + desc="Building dataset for timeline analysis", + total=self.dataset_len): + func_res = func(index=i, event=event) + if func_res: + result.append(func_res) except Exception as e: logger.warning("Error %s while parsing file %s, continue to timeline analysis", e, self.timeline_data_list[0]) diff --git a/profiler/advisor/display/html/render.py b/profiler/advisor/display/html/render.py index 8ea7c9e0fc22c7da71a673e399fcfc231fbf1453..3984fa8f34f0858a7281c9b51caaa43a170baf86 100644 --- a/profiler/advisor/display/html/render.py +++ b/profiler/advisor/display/html/render.py @@ -1,6 +1,7 @@ import os import logging from typing import List, Dict +from collections import defaultdict from jinja2 import Environment, FileSystemLoader from profiler.advisor.common import constant @@ -15,7 +16,7 @@ logger = logging.getLogger() class HTMLRender: def __init__(self): self.html = "" - self.render_list: Dict[str, List] = {} + self.render_list = defaultdict(list) def render_html(self, template_dir: str = "templates", template_name: str = "main.html", template_header=constant.DEFAULT_TEMPLATE_HEADER): @@ -30,8 +31,6 @@ class HTMLRender: autoescape=True) template = env.get_template(template_name) rendered_html = template.render(**kwargs) - if key not in self.render_list: - self.render_list[key] = [] self.render_list[key].append(rendered_html) return rendered_html diff --git "a/profiler/advisor/doc/AI CPU \347\256\227\345\255\220\346\233\277\346\215\242\346\240\267\344\276\213.md" b/profiler/advisor/doc/Samples of AI CPU Operator Replacement.md similarity index 100% rename from "profiler/advisor/doc/AI CPU \347\256\227\345\255\220\346\233\277\346\215\242\346\240\267\344\276\213.md" rename to profiler/advisor/doc/Samples of AI CPU Operator Replacement.md diff --git "a/profiler/advisor/doc/\346\230\207\350\205\276\350\277\201\347\247\273\350\236\215\345\220\210\347\256\227\345\255\220API\346\233\277\346\215\242\346\240\267\344\276\213.md" b/profiler/advisor/doc/Samples of Fused Operator API Replacement.md similarity index 100% rename from "profiler/advisor/doc/\346\230\207\350\205\276\350\277\201\347\247\273\350\236\215\345\220\210\347\256\227\345\255\220API\346\233\277\346\215\242\346\240\267\344\276\213.md" rename to profiler/advisor/doc/Samples of Fused Operator API Replacement.md diff --git a/profiler/advisor/fusion_operators_api_analysis.ipynb b/profiler/advisor/fusion_operators_api_analysis.ipynb index dcc71ba3c139f630c07545340e61c66b1f29d929..ac758f562f13c9dd7466279aac73002c0e68da55 100644 --- a/profiler/advisor/fusion_operators_api_analysis.ipynb +++ b/profiler/advisor/fusion_operators_api_analysis.ipynb @@ -81,7 +81,7 @@ " \n", " \n", " timeline_fusion_ops\n", - " Found 2 apis to be replaced based on the runtime env cann-8.0.0 and torch-2.1.0\n", + " Found 2 apis to be replaced based on the runtime env cann-8.0.RC1 and torch-2.1.0\n", " 1. Please replace training api according to sub table 'Affinity training api'\n", " \n", " \n", @@ -91,7 +91,7 @@ "+---------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------+\n", "| problem | description | suggestion |\n", "+---------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------+\n", - "| timeline_fusion_ops | Found 2 apis to be replaced based on the runtime env cann-8.0.0 and torch-2.1.0 | 1. Please replace training api according to sub table 'Affinity training api' |\n", + "| timeline_fusion_ops | Found 2 apis to be replaced based on the runtime env cann-8.0.RC1 and torch-2.1.0 | 1. Please replace training api according to sub table 'Affinity training api' |\n", "+---------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------+" ] }, diff --git a/profiler/advisor/interface/interface.py b/profiler/advisor/interface/interface.py index 231f595d70b7e9dd6ee436153dc24259cfef640b..59bfee77f60c24194cc3f392fc9c557d0f1ed70a 100644 --- a/profiler/advisor/interface/interface.py +++ b/profiler/advisor/interface/interface.py @@ -17,15 +17,15 @@ from profiler.advisor.analyzer.schedule.dispatch.timeline_op_dispatch_analyzer i class Interface: supported_analyzer = { "schedule": OrderedDict({ - SupportedScopes.TIMELINE_FUSION_OPS: TimelineFusionOpsAnalyzer + SupportedScopes.TIMELINE_FUSION_OPS: TimelineFusionOpsAnalyzer, + SupportedScopes.TIMELINE_OP_DISPATCH: OpDispatchAnalyzer }), "computation": OrderedDict({ SupportedScopes.DYNAMIC_SHAPE_ANALYSIS: DynamicShapeAnalyzer, SupportedScopes.AICPU_ANALYSIS: AicpuAnalyzer, SupportedScopes.OPERATOR_NO_BOUND_ANALYSIS: OperatorBoundAnalyzer, SupportedScopes.BLOCK_DIM_ANALYSIS: BlockDimAnalyzer, - SupportedScopes.GRAPH: FusionOPAnalyzer, - SupportedScopes.TIMELINE_OP_DISPATCH: OpDispatchAnalyzer + SupportedScopes.GRAPH: FusionOPAnalyzer }), "communication": OrderedDict(), "overall": OrderedDict({SupportedScopes.OVER_ALL: OverallSummaryAnalyzer}), diff --git a/profiler/advisor/rules/aicpu_rules.yaml b/profiler/advisor/rules/aicpu_rules.yaml index 9313700c800d337eaea18f5a634521710f09e465..58e6eef163204ea1b5efbb5148770948bd4afdad 100644 --- a/profiler/advisor/rules/aicpu_rules.yaml +++ b/profiler/advisor/rules/aicpu_rules.yaml @@ -1,5 +1,5 @@ DataTypeSuggeation: &DataTypeSuggeation "Data type {} in {} operator may cause AICPU issues, Try to convert to {} if possible." -AICPU_DOC_URL: &AICPU_DOC_URL "https://support.huaweicloud.com/bestpractice-modelarts/modelarts_10_2517.html" +AICPU_DOC_URL: &AICPU_DOC_URL "https://gitee.com/ascend/mstt/blob/master/profiler/advisor/doc/Samples%20of%20AI%20CPU%20Operator%20Replacement.md" CommonChecker: - DataTypeChecker: @@ -46,7 +46,7 @@ CommonChecker: suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ __ALL__ ] ignore_type: [ cast, tensorequal, equal, nonzero, mul ] input: [ float, float32, float16, dt_bf16, float64, bool, int32, int64, int8, uint8, int16, complex64, complex128 ] @@ -54,28 +54,28 @@ CommonChecker: suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ cast ] input: [ float, float32, float16, bool, int32, uint32, int64, uint64, uint8, dt_bf16 ] output: [ float, float32, float16, bool, int32, uint32, int64, uint64, uint8, dt_bf16 ] suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ tensorequal ] input: [ float, float32, float16, dt_bf16, float64, bool, int32, int8, uint8 ] output: [ bool ] suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ equal ] input: [ float, float32, float16, dt_bf16, float64, bool, int32, int64, int8, uint8 ] output: [ bool ] suggestion: *DataTypeSuggeation - DataTypeChecker: - cann_version: [8.0.0, 7.0.0] + cann_version: [8.0.RC1, 7.0.0] op_type: [ mul ] input: [ float, float32, float16, dt_bf16, float64, bool, int32, int64, int8, uint8, complex64 ] output: [ float, float32, float16, dt_bf16, float64, bool, int32, int64, int8, uint8, complex64 ] diff --git a/profiler/advisor/rules/timeline_fusion_ops.yaml b/profiler/advisor/rules/timeline_fusion_ops.yaml index 10c12ff18dd8792e24a89c6d5fbb7ed87f643a9d..8207465dc4a5c5ddbb1cc934ef95951493c4bacb 100644 --- a/profiler/advisor/rules/timeline_fusion_ops.yaml +++ b/profiler/advisor/rules/timeline_fusion_ops.yaml @@ -49,7 +49,7 @@ "(slice|chunk)-sigmoid-mul-mul", "(slice|chunk)-mul-sigmoid-mul", "(slice|chunk)-mul-mul-sigmoid" ] -- cann_version: 8.0.0 +- cann_version: 8.0.RC1 torch_version: [1.11.0, 2.1.0] unique_id: 3 inherit_unique_id: 2 diff --git a/profiler/advisor/version.py b/profiler/advisor/version.py index 1a95cc3c0f93f49a2aaacf483770462d09961ff9..67d04140866a3df8ecb8484451c476006da2671d 100644 --- a/profiler/advisor/version.py +++ b/profiler/advisor/version.py @@ -30,9 +30,9 @@ def print_version_callback(ctx, param, value): # NOQA if not value or ctx.resilient_parsing: return - click.echo('Version {}'.format(get_package_version("att_advisor"))) + click.echo('Version {}'.format(get_package_version("msprof-analyze"))) ctx.exit() def cli_version(): - return get_package_version("att_advisor") + return get_package_version("msprof-analyze") diff --git a/profiler/cli/analyze_cli.py b/profiler/cli/analyze_cli.py index 7bd7f1722517edc2e8177d3b88af06a6217cf5f2..2e173dc87086a1335f49f9685b928708089aa1ea 100644 --- a/profiler/cli/analyze_cli.py +++ b/profiler/cli/analyze_cli.py @@ -58,7 +58,8 @@ def analyze_cli(**kwargs): @analyze_cli.command(context_settings=CONTEXT_SETTINGS, name="all", - short_help='Analyze timeline, operators and graph.') + short_help='Analyze timeline fusion operators, operators and graph,\ + operators dispatching and cluster.') @click.option('--profiling_path', '-d', 'profiling_path', type=click.Path(), required=True, help='Directory of profiling data') @click.option('--benchmark_profiling_path', '-bp', 'benchmark_profiling_path', type=click.Path(), @@ -93,7 +94,7 @@ def analyze_all(**kwargs) -> None: @analyze_cli.command(context_settings=CONTEXT_SETTINGS, name="schedule", - short_help='Analyze timeline, operators and graph.') + short_help='Analyze operators dispatching and timeline fusion operators.') @click.option('--profiling_path', '-d', 'profiling_path', type=click.Path(), required=True, help='Directory of profiling data') @click.option('--cann_version', '-cv', 'cann_version', @@ -112,7 +113,7 @@ def analyze_schedule(**kwargs) -> None: @analyze_cli.command(context_settings=CONTEXT_SETTINGS, name="computation", - short_help='Analyze timeline, operators and graph.') + short_help='Analyze operators and graph.') @click.option('--profiling_path', '-d', 'profiling_path', type=click.Path(), required=True, help='Directory of profiling data') @click.option('--cann_version', '-cv', 'cann_version', diff --git a/profiler/cluster_analyse/README.md b/profiler/cluster_analyse/README.md index 989122375db65d32047555d29dd25ed9289764ea..fdd43ca965fe17edf9b565d05bf12cb68bff8d71 100644 --- a/profiler/cluster_analyse/README.md +++ b/profiler/cluster_analyse/README.md @@ -2,7 +2,7 @@ cluster_analyse(集群分析工具)是在集群场景下,通过此工具来进行集群数据的分析,当前主要对基于通信域的迭代内耗时分析、通信时间分析以及通信矩阵分析为主, 从而定位慢卡、慢节点以及慢链路问题。 ## 性能数据采集 -当前集群调优工具主要支持Ascend PyTorch Profiler采集方式下的集群数据。采集方式参考:[Profiling数据采集](https://gitee.com/ascend/att/tree/master/profiler),此工具只需要通过Ascend PyTorch Porfiler工具采集NPU的性能数据即可。 +当前集群调优工具主要支持Ascend PyTorch Profiler采集方式下的集群数据。采集方式参考:[Profiling数据采集](https://gitee.com/ascend/mstt/tree/master/profiler),此工具只需要通过Ascend PyTorch Porfiler工具采集NPU的性能数据即可。 我们要求至少是L1级别的数据。 ```python @@ -86,7 +86,7 @@ experimental_config = torch_npu.profiler._ExperimentalConfig( ### 交付件 -集群分析工具的交付件通过Ascend Insight工具展示,详见《[MindStudio Ascend Insight用户指南](https://www.hiascend.com/document/detail/zh/mindstudio/70RC1/GUI-baseddevelopmenttool/msascendinsightug/AscendInsight_0002.html)》。 +集群分析工具的交付件通过MindStudio Insight工具展示,详见《[MindStudio Insight用户指南](https://www.hiascend.com/document/detail/zh/mindstudio/70RC2/GUI-baseddevelopmenttool/msascendinsightug/AscendInsight_0002.html)》。 #### cluster_step_trace_time.csv @@ -156,25 +156,25 @@ L列:Preparing,指迭代开始到首个计算或通信算子运行的时间 #### cluster_analysis.db -解析analysis.db或ascend_pytorch_profiler_{rank_id}.db生成的交付件,根据数据解析模式不同而解析不同的数据,可以使用Ascend Insight工具展示。 +解析analysis.db或ascend_pytorch_profiler_{rank_id}.db生成的交付件,根据数据解析模式不同而解析不同的数据,可以使用MindStudio Insight工具展示。 #### stats.ipynb - 数据解析模式为cann_api_sum时生成,保存在cluster_analysis_output/CannApiSum目录下。 - 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群API耗时信息。 + 可使用jupyter notebook工具或MindStudio Insight工具打开,主要展示集群API耗时信息。 - 数据解析模式为compute_op_sum时生成,保存在cluster_analysis_output/ComputeOpSum目录下。 - 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群计算算子耗时分析(将集群所有计算算子进行汇总并以图表展示),集群Rank计算算子耗时分析(将每个Rank的计算算子进行各自汇总)。 + 可使用jupyter notebook工具或MindStudio Insight工具打开,主要展示集群计算算子耗时分析(将集群所有计算算子进行汇总并以图表展示),集群Rank计算算子耗时分析(将每个Rank的计算算子进行各自汇总)。 - 数据解析模式为hccl_sum时生成,保存在cluster_analysis_output/HcclSum目录下。 - 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群通信算子耗时分析(将集群所有通信算子进行汇总并以图表展示),集群Rank通信算子耗时分析(将每个Rank的通信算子进行各自汇总)、Top通信算子信息展示。 + 可使用jupyter notebook工具或MindStudio Insight工具打开,主要展示集群通信算子耗时分析(将集群所有通信算子进行汇总并以图表展示),集群Rank通信算子耗时分析(将每个Rank的通信算子进行各自汇总)、Top通信算子信息展示。 - 数据解析模式为mstx_sum时生成,保存在cluster_analysis_output/MstxSum目录下。 - 可使用jupyter notebook工具或Ascend Insight工具打开,主要展示集群场景mstx打点信息,分为框架侧、CANN侧和Device侧三部分的打点信息。 + 可使用jupyter notebook工具或MindStudio Insight工具打开,主要展示集群场景mstx打点信息,分为框架侧、CANN侧和Device侧三部分的打点信息。 diff --git a/profiler/compare_tools/README.md b/profiler/compare_tools/README.md index f7e24214341478df266cd5c59ba91152fa968a74..78ea5d8971722ec7d4f2b3ba624e66aa0bc33076 100644 --- a/profiler/compare_tools/README.md +++ b/profiler/compare_tools/README.md @@ -69,7 +69,7 @@ PyTorch Profiler采集结果数据目录结构如下: #### NPU性能数据采集 -通过Ascend PyTorch Profiler工具采集NPU的性能数据,采集参数配置与GPU基本一致,只需将GPU的性能数据采集代码中torch.profiler替换成torch_npu.profiler。,参考链接:[Profiling数据采集](https://gitee.com/ascend/att/tree/master/profiler)。 +通过Ascend PyTorch Profiler工具采集NPU的性能数据,采集参数配置与GPU基本一致,只需将GPU的性能数据采集代码中torch.profiler替换成torch_npu.profiler。,参考链接:[Profiling数据采集](https://gitee.com/ascend/mstt/tree/master/profiler)。 Ascend PyTorch Profiler采集结果数据目录结构如下: @@ -213,7 +213,7 @@ activities配置仅采集NPU数据,不配置experimental_config参数以及其 - 当Computing Time耗时增大,分析**算子性能**。 - 当Uncovered Communication Time耗时增大,分析**通信性能**,若通信性能分析没有劣化的通信算子,代表通信与计算的并行度较差,继续进行NPU的集群性能分析。 -- 当Mem Usage增大,分析**算子内存**,若没有明显占用较大的算子,则代表算子内存申请量没有差异,问题在于内存的释放(持有时间过久),可以使用tensorboard或ascend insight继续进行NPU内存的分析。 +- 当Mem Usage增大,分析**算子内存**,若没有明显占用较大的算子,则代表算子内存申请量没有差异,问题在于内存的释放(持有时间过久),可以使用TensorBoard或MindStudio insight继续进行NPU内存的分析。 ### 算子性能 diff --git a/profiler/compare_tools/compare_backend/comparator/base_comparator.py b/profiler/compare_tools/compare_backend/comparator/base_comparator.py index 330fb871ee19b9bac1c0dfff4cae5648ebeedf1c..8012dfae94440b7e17613f432770ec8b63ece431 100644 --- a/profiler/compare_tools/compare_backend/comparator/base_comparator.py +++ b/profiler/compare_tools/compare_backend/comparator/base_comparator.py @@ -21,4 +21,4 @@ class BaseComparator(ABC): @abstractmethod def _compare(self): - raise NotImplementedError("Function _compare need to be implemented.") + raise NotImplementedError("Function _compare need to be implemented.") \ No newline at end of file diff --git a/profiler/compare_tools/compare_backend/comparator/overall_metrics_comparator.py b/profiler/compare_tools/compare_backend/comparator/overall_metrics_comparator.py new file mode 100644 index 0000000000000000000000000000000000000000..d438dc41d563b163d14a1b391b2ef4a301144dc0 --- /dev/null +++ b/profiler/compare_tools/compare_backend/comparator/overall_metrics_comparator.py @@ -0,0 +1,50 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from math import isclose + +from compare_backend.comparator.base_comparator import BaseComparator +from compare_backend.utils.constant import Constant +from compare_backend.utils.excel_config import ExcelConfig + + +class OverallMetricsComparator(BaseComparator): + + def __init__(self, origin_data: dict, bean: any): + super().__init__(origin_data, bean) + self._row_style = [] + + @property + def base_info(self): + return self._origin_data.get(Constant.BASE_DATA) + + @property + def comp_info(self): + return self._origin_data.get(Constant.COMPARISON_DATA) + + def generate_data(self) -> dict: + self._compare() + return {self._sheet_name: { + "headers": self._headers, + "rows": self._rows, + "overhead": self._overhead, + "row_style": self._row_style + }} + + def _compare(self): + if isclose(self.base_info.e2e_time_ms, 0) or isclose(self.comp_info.e2e_time_ms, 0): + return + self._rows.extend(self._bean(self.base_info, self.comp_info).rows) + for row in self._rows: + self._row_style.append(ExcelConfig.ROW_STYLE_MAP.get(row[0], {})) # index 0 for metric index name diff --git a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py b/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py index 122009b9045074c908c33dc50fffd36f03eb4ff9..9c4825c0e8e0503127e6c450042cf784e73d9974 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py +++ b/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/kernel_details_bean.py @@ -1,8 +1,9 @@ import math +from decimal import Decimal import pandas as pd -from compare_backend.utils.common_func import convert_to_float +from compare_backend.utils.common_func import convert_to_float, convert_to_decimal from compare_backend.utils.constant import Constant @@ -12,8 +13,10 @@ class KernelDetailsBean: self._op_type = "" self._name = "" self._aiv_vec_time = 0.0 + self._aicore_time = 0.0 self._mac_time = 0.0 self._duration = 0.0 + self._start_time = Decimal("0") self.init() @property @@ -30,6 +33,12 @@ class KernelDetailsBean: return float("nan") return convert_to_float(self._aiv_vec_time) + @property + def aicore_time(self) -> float: + if self._aicore_time == "" or self._aicore_time == "N/A": + return float("nan") + return convert_to_float(self._aicore_time) + @property def mac_time(self) -> float: if self._mac_time == "" or self._mac_time == "N/A": @@ -40,6 +49,18 @@ class KernelDetailsBean: def duration(self) -> float: return convert_to_float(self._duration) + @property + def dur(self) -> float: + return convert_to_float(self._duration) + + @property + def start_time(self) -> Decimal: + return convert_to_decimal(self._start_time) + + @property + def end_time(self) -> Decimal: + return self.start_time + convert_to_decimal(self._duration) + def is_hide_op_pmu(self): if "mac_time(us)" in self._data.keys() or "aiv_vec_time(us)" in self._data.keys(): return False @@ -66,7 +87,7 @@ class KernelDetailsBean: def is_flash_attention(self): return "flashattention" in self.op_type.lower() - def is_cube(self): + def is_matmul(self): return "matmul" in self.op_type.lower() def is_conv(self): @@ -79,9 +100,17 @@ class KernelDetailsBean: def is_page_attention(self): return "pagedattention" in self.op_type.lower() + def is_trans(self): + return any(trans_mask in self.name.lower() for trans_mask in Constant.KERNEL_TRANS_MASK) + + def is_cube_kernel_cat(self): + return self.mac_time > 0 or self.aicore_time > 0 + def init(self): self._op_type = self._data.get('Type', "") self._name = self._data.get('Name', "") self._aiv_vec_time = self._data.get('aiv_vec_time(us)', "") + self._aicore_time = self._data.get("aicore_time(us)", "") self._mac_time = self._data.get('mac_time(us)', "") self._duration = self._data.get('Duration(us)', 0) + self._start_time = Decimal(self._data.get("Start Time(us)", "0")) diff --git a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py b/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py index cef6bb071243264c792e74f562e058ca1d8df7a1..245b51d105e1a4a872475da76682683859450401 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py +++ b/profiler/compare_tools/compare_backend/compare_bean/origin_data_bean/trace_event_bean.py @@ -114,6 +114,21 @@ class TraceEventBean: def is_torch_op(self, value: bool): self._is_torch_op = value + @classmethod + def is_sdma(cls): + return False + + @classmethod + def is_page_attention(cls): + return False + + @classmethod + def is_trans(cls) -> bool: + """ + 暂时没找到GPU判断trans的方法,暂时都是notrans + """ + return False + def is_m_mode(self) -> bool: return self._ph == "M" @@ -199,11 +214,44 @@ class TraceEventBean: self._name = name def is_conv(self): - return self.name.lower().startswith("aten::conv") + return self.lower_name.startswith("aten::conv") def is_lccl(self): return self.lower_name == "kernel_aivec" + def is_fa_for_cpu_op(self) -> bool: + """ + 这个类在cpu op和gpu中均有用到,这里是在cpu op阶段判断 + """ + return any(cube_mask in self.lower_name for cube_mask in Constant.CPU_OP_FA_MASK) + + def is_conv_for_cpu_op(self) -> bool: + """ + 这个类在cpu op和gpu中均有用到,这里是在cpu op阶段判断 + """ + return self.lower_name.startswith(Constant.CPU_OP_CONV) + + def is_matmul_for_cpu_op(self) -> bool: + """ + 这个类在cpu op和gpu中均有用到,这里是在cpu op阶段判断 + """ + return any(bwd_mask in self.lower_name for bwd_mask in Constant.CPU_OP_MATMUL_MASK) + + def is_bwd_for_cpu_op(self) -> bool: + """ + 这个类在cpu op和gpu中均有用到,这里是在cpu op阶段判断 + """ + return any(bwd_mask in self.lower_name for bwd_mask in Constant.BWD_LIST) + + def is_cpu_cube_op(self) -> bool: + return self.is_matmul_for_cpu_op() or self.is_fa_for_cpu_op() or self.is_conv_for_cpu_op() + + def is_vector(self): + return not any(cube_mask in self.lower_name for cube_mask in Constant.KERNEL_CUBE_MASK) + + def is_cube_kernel_cat(self): + return any(cube_mask in self.lower_name for cube_mask in Constant.KERNEL_CUBE_MASK) + def init(self): if isinstance(self._event, dict): self._pid = self._event.get("pid", 0) diff --git a/profiler/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py b/profiler/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py new file mode 100644 index 0000000000000000000000000000000000000000..544f8f5234d71eef52b9188e045c83baa3c70f20 --- /dev/null +++ b/profiler/compare_tools/compare_backend/compare_bean/overall_metrics_bean.py @@ -0,0 +1,255 @@ +# Copyright (c) 2024, Huawei Technologies Co., Ltd. +# All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from math import isclose + +from compare_backend.compare_bean.profiling_info import ProfilingInfo +from compare_backend.utils.common_func import calculate_diff_ratio +from compare_backend.utils.constant import Constant +from compare_backend.utils.excel_config import ExcelConfig + + +class OverallMetricsBean: + TABLE_NAME = Constant.OVERALL_METRICS_TABLE + HEADERS = ExcelConfig.HEADERS.get(TABLE_NAME) + OVERHEAD = ExcelConfig.OVERHEAD.get(TABLE_NAME) + + def __init__(self, base_info: ProfilingInfo, comparison_info: ProfilingInfo): + self._base_data = OverallMetricsInfo(base_info).overall_metrics + self._comparison_data = OverallMetricsInfo(comparison_info).overall_metrics + + @property + def rows(self): + rows_data = [] + for index, base_data in self._base_data.items(): + comparison_data = self._comparison_data.get(index) + row = self.get_row_data(index, base_data, comparison_data) + if row: + rows_data.append(row) + return rows_data + + @staticmethod + def get_row_data(index, base_data, comparison_data): + if isclose(base_data[0], 0) and isclose(comparison_data[0], 0): + return [] + row_data = [index] + row_data.extend(base_data) + row_data.extend(comparison_data) + row_data.extend(calculate_diff_ratio(base_data[0], comparison_data[0])) + return row_data + + +class OverallMetricsInfo: + def __init__(self, profiling_info: ProfilingInfo): + self._profiling_info = profiling_info + self._overall_metrics_data_map = { + ExcelConfig.COMPUTING: self.computing_data, + ExcelConfig.FA: self.fa_data, + ExcelConfig.FA_FWD_CUBE: self.fa_fwd_cube_data, + ExcelConfig.FA_FWD_VECTOR: self.fa_fwd_vector_data, + ExcelConfig.FA_BWD_CUBE: self.fa_bwd_cube_data, + ExcelConfig.FA_BWD_VECTOR: self.fa_bwd_vector_data, + ExcelConfig.CONV: self.conv_data, + ExcelConfig.CONV_FWD_CUBE: self.conv_fwd_cube_data, + ExcelConfig.CONV_FWD_VECTOR: self.conv_fwd_vector_data, + ExcelConfig.CONV_BWD_CUBE: self.conv_bwd_cube_data, + ExcelConfig.CONV_BWD_VECTOR: self.conv_bwd_vector_data, + ExcelConfig.MM: self.mm_data, + ExcelConfig.MM_CUBE: self.mm_cube_data, + ExcelConfig.MM_VECTOR: self.mm_vector_data, + ExcelConfig.PA: self.pa_data, + ExcelConfig.VECTOR: self.vector_data, + ExcelConfig.VECTOR_TRANS: self.vector_trans_data, + ExcelConfig.VECTOR_NO_TRANS: self.vector_no_trans_data, + ExcelConfig.CUBE: self.cube_data, + ExcelConfig.SDMA_TM: self.sdma_tm_data, + ExcelConfig.OTHER: self.other_data, + ExcelConfig.COMMUNICATION_TIME: self.communication_data, + ExcelConfig.WAIT: self.wait_data, + ExcelConfig.TRANSMIT: self.transmit_data, + ExcelConfig.FREE_TIME: self.free_time_data, + ExcelConfig.SDMA: self.sdma_data, + ExcelConfig.FREE: self.free_data, + ExcelConfig.E2E_TIME: self.e2e_time_data + } + + @property + def overall_metrics(self): + return self._overall_metrics_data_map + + @property + def computing_data(self): + return [self._profiling_info.compute_time_ms, + self._profiling_info.compute_time_ms / self._profiling_info.e2e_time_ms, + sum((self._profiling_info.fa_total_num, self._profiling_info.conv_total_num, + self._profiling_info.mm_total_num, self._profiling_info.vector_total_num, + self._profiling_info.sdma_num_tensor_move, self._profiling_info.other_cube_num, + self._profiling_info.page_attention_num))] + + @property + def fa_data(self): + return [self._profiling_info.fa_total_time, + self._profiling_info.fa_total_time / self._profiling_info.e2e_time_ms, + self._profiling_info.fa_total_num] + + @property + def fa_fwd_cube_data(self): + return [self._profiling_info.fa_time_fwd_cube, + self._profiling_info.fa_time_fwd_cube / self._profiling_info.e2e_time_ms, + self._profiling_info.fa_num_fwd_cube] + + @property + def fa_fwd_vector_data(self): + return [self._profiling_info.fa_time_fwd_vector, + self._profiling_info.fa_time_fwd_vector / self._profiling_info.e2e_time_ms, + self._profiling_info.fa_num_fwd_vector] + + @property + def fa_bwd_cube_data(self): + return [self._profiling_info.fa_time_bwd_cube, + self._profiling_info.fa_time_bwd_cube / self._profiling_info.e2e_time_ms, + self._profiling_info.fa_num_bwd_cube] + + @property + def fa_bwd_vector_data(self): + return [self._profiling_info.fa_time_bwd_vector, + self._profiling_info.fa_time_bwd_vector / self._profiling_info.e2e_time_ms, + self._profiling_info.fa_num_bwd_vector] + + @property + def conv_data(self): + return [self._profiling_info.conv_total_time, + self._profiling_info.conv_total_time / self._profiling_info.e2e_time_ms, + self._profiling_info.conv_total_num] + + @property + def conv_fwd_cube_data(self): + return [self._profiling_info.conv_time_fwd_cube, + self._profiling_info.conv_time_fwd_cube / self._profiling_info.e2e_time_ms, + self._profiling_info.conv_num_fwd_cube] + + @property + def conv_fwd_vector_data(self): + return [self._profiling_info.conv_time_fwd_vector, + self._profiling_info.conv_time_fwd_vector / self._profiling_info.e2e_time_ms, + self._profiling_info.conv_num_fwd_vector] + + @property + def conv_bwd_cube_data(self): + return [self._profiling_info.conv_time_bwd_cube, + self._profiling_info.conv_time_bwd_cube / self._profiling_info.e2e_time_ms, + self._profiling_info.conv_num_bwd_cube] + + @property + def conv_bwd_vector_data(self): + return [self._profiling_info.conv_time_bwd_vector, + self._profiling_info.conv_time_bwd_vector / self._profiling_info.e2e_time_ms, + self._profiling_info.conv_num_bwd_vector] + + @property + def mm_data(self): + return [self._profiling_info.mm_total_time, + self._profiling_info.mm_total_time / self._profiling_info.e2e_time_ms, + self._profiling_info.mm_total_num] + + @property + def mm_cube_data(self): + return [self._profiling_info.matmul_time_cube, + self._profiling_info.matmul_time_cube / self._profiling_info.e2e_time_ms, + self._profiling_info.matmul_num_cube] + + @property + def mm_vector_data(self): + return [self._profiling_info.matmul_time_vector, + self._profiling_info.matmul_time_vector / self._profiling_info.e2e_time_ms, + self._profiling_info.matmul_num_vector] + + @property + def pa_data(self): + return [self._profiling_info.page_attention_time, + self._profiling_info.page_attention_time / self._profiling_info.e2e_time_ms, + self._profiling_info.page_attention_num] + + @property + def vector_data(self): + return [self._profiling_info.vector_total_time, + self._profiling_info.vector_total_time / self._profiling_info.e2e_time_ms, + self._profiling_info.vector_total_num] + + @property + def vector_trans_data(self): + return [self._profiling_info.vector_time_trans, + self._profiling_info.vector_time_trans / self._profiling_info.e2e_time_ms, + self._profiling_info.vector_num_trans] + + @property + def vector_no_trans_data(self): + return [self._profiling_info.vector_time_notrans, + self._profiling_info.vector_time_notrans / self._profiling_info.e2e_time_ms, + self._profiling_info.vector_num_notrans] + + @property + def cube_data(self): + return [self._profiling_info.other_cube_time, + self._profiling_info.other_cube_time / self._profiling_info.e2e_time_ms, + self._profiling_info.other_cube_num] + + @property + def sdma_tm_data(self): + return [self._profiling_info.sdma_time_tensor_move, + self._profiling_info.sdma_time_tensor_move / self._profiling_info.e2e_time_ms, + self._profiling_info.sdma_num_tensor_move] + + @property + def other_data(self): + other_time = max((0, + self._profiling_info.compute_time_ms - self._profiling_info.fa_total_time - + self._profiling_info.conv_total_time - self._profiling_info.mm_total_time - + self._profiling_info.vector_total_time - self._profiling_info.sdma_time_tensor_move - + self._profiling_info.other_cube_time - self._profiling_info.page_attention_time)) + return [other_time, other_time / self._profiling_info.e2e_time_ms, "/"] + + @property + def communication_data(self): + return [self._profiling_info.communication_not_overlapped_ms, + self._profiling_info.communication_not_overlapped_ms / self._profiling_info.e2e_time_ms, "/"] + + @property + def wait_data(self): + return [self._profiling_info.wait_time_ms, + self._profiling_info.wait_time_ms / self._profiling_info.e2e_time_ms, "/"] + + @property + def transmit_data(self): + return [self._profiling_info.transmit_time_ms, + self._profiling_info.transmit_time_ms / self._profiling_info.e2e_time_ms, "/"] + + @property + def free_time_data(self): + return [self._profiling_info.free_time_ms, + self._profiling_info.free_time_ms / self._profiling_info.e2e_time_ms, "/"] + + @property + def sdma_data(self): + return [self._profiling_info.sdma_time_stream, + self._profiling_info.sdma_time_stream / self._profiling_info.e2e_time_ms, "/"] + + @property + def free_data(self): + free = self._profiling_info.free_time_ms - self._profiling_info.sdma_time_stream + return [free, free / self._profiling_info.e2e_time_ms, "/"] + + @property + def e2e_time_data(self): + return [self._profiling_info.e2e_time_ms, 1, "/"] diff --git a/profiler/compare_tools/compare_backend/compare_bean/profiling_info.py b/profiler/compare_tools/compare_backend/compare_bean/profiling_info.py index e5d9bf26e985330d830ba6e01f62525fe88e43ea..e0a80a4d30d0feda38d4290667df6620855d8562 100644 --- a/profiler/compare_tools/compare_backend/compare_bean/profiling_info.py +++ b/profiler/compare_tools/compare_backend/compare_bean/profiling_info.py @@ -37,6 +37,105 @@ class ProfilingInfo: self.hide_op_details = False self.is_level0 = False + # 性能拆解新指标 + self.fa_time_fwd_cube = 0.0 + self.fa_num_fwd_cube = 0 + self.fa_time_bwd_cube = 0.0 + self.fa_num_bwd_cube = 0 + self.fa_time_fwd_vector = 0.0 + self.fa_num_fwd_vector = 0 + self.fa_time_bwd_vector = 0.0 + self.fa_num_bwd_vector = 0 + + self.conv_time_fwd_cube = 0.0 + self.conv_num_fwd_cube = 0 + self.conv_time_bwd_cube = 0.0 + self.conv_num_bwd_cube = 0 + self.conv_time_fwd_vector = 0.0 + self.conv_num_fwd_vector = 0 + self.conv_time_bwd_vector = 0.0 + self.conv_num_bwd_vector = 0 + + self.matmul_time_cube = 0.0 + self.matmul_num_cube = 0 + self.matmul_time_vector = 0.0 + self.matmul_num_vector = 0 + + self.page_attention_time = 0.0 + self.page_attention_num = 0 + + self.vector_time_trans = 0.0 + self.vector_num_trans = 0 + self.vector_time_notrans = 0.0 + self.vector_num_notrans = 0 + + self.sdma_time_tensor_move = 0.0 + self.sdma_num_tensor_move = 0 + self.sdma_time_stream = 0.0 + self.sdma_num_stream = 0 + + self.other_cube_time = 0.0 + self.other_cube_num = 0 + + @property + def e2e_time_ms(self): + return self.e2e_time * 10 ** 3 + + @property + def compute_time_ms(self): + return self.compute_time * 10 ** 3 + + @property + def free_time_ms(self): + return self.scheduling_time * 10 ** 3 + + @property + def communication_not_overlapped_ms(self): + return self.communication_not_overlapped * 10 ** 3 + + @property + def wait_time_ms(self): + return self.wait_time * 10 ** 3 + + @property + def transmit_time_ms(self): + return (self.communication_not_overlapped - self.wait_time) * 10 ** 3 + + @property + def fa_total_time(self): + return sum((self.fa_time_fwd_cube, self.fa_time_fwd_vector, self.fa_time_bwd_cube, self.fa_time_bwd_vector)) + + @property + def fa_total_num(self): + return sum((self.fa_num_fwd_cube, self.fa_num_fwd_vector, self.fa_num_bwd_cube, self.fa_num_bwd_vector)) + + @property + def conv_total_time(self): + return sum( + (self.conv_time_fwd_cube, self.conv_time_fwd_vector, self.conv_time_bwd_cube, + self.conv_time_bwd_vector)) + + @property + def conv_total_num(self): + return sum((self.conv_num_fwd_cube, self.conv_num_fwd_vector, self.conv_num_bwd_cube, + self.conv_num_bwd_vector)) + + @property + def mm_total_time(self): + return sum((self.matmul_time_cube, self.matmul_time_vector)) + + @property + def mm_total_num(self): + return sum((self.matmul_num_cube, self.matmul_num_vector)) + + @property + def vector_total_time(self): + return sum((self.vector_time_trans, self.vector_time_notrans)) + + @property + def vector_total_num(self): + return sum((self.vector_num_trans, self.vector_num_notrans)) + def trans_time_to_s(self): self.cube_time = self.cube_time / 10 ** 6 self.other_time = self.other_time / 10 ** 6 @@ -54,6 +153,24 @@ class ProfilingInfo: self.conv_time_fwd = self.conv_time_fwd / 10 ** 6 self.conv_time_bwd = self.conv_time_bwd / 10 ** 6 + # 新指标单位为ms + self.fa_time_fwd_cube /= 10 ** 3 + self.fa_time_bwd_cube /= 10 ** 3 + self.fa_time_fwd_vector /= 10 ** 3 + self.fa_time_bwd_vector /= 10 ** 3 + self.conv_time_fwd_cube /= 10 ** 3 + self.conv_time_bwd_cube /= 10 ** 3 + self.conv_time_fwd_vector /= 10 ** 3 + self.conv_time_bwd_vector /= 10 ** 3 + self.matmul_time_cube /= 10 ** 3 + self.matmul_time_vector /= 10 ** 3 + self.vector_time_trans /= 10 ** 3 + self.vector_time_notrans /= 10 ** 3 + self.sdma_time_tensor_move /= 10 ** 3 + self.sdma_time_stream /= 10 ** 3 + self.page_attention_time /= 10 ** 3 + self.other_cube_time /= 10 ** 3 + def calculate_other_time(self): self.other_time = max( [0, self.compute_time - self.cube_time - self.fa_time_fwd - self.fa_time_bwd - @@ -64,8 +181,7 @@ class ProfilingInfo: - self.conv_time_fwd - self.conv_time_bwd def calculate_schedule_time(self): - self.scheduling_time = (self.e2e_time - self.compute_time - self.lccl_time \ - - self.communication_not_overlapped) + self.scheduling_time = (self.e2e_time - self.compute_time - self.lccl_time - self.communication_not_overlapped) def update_fa_fwd_info(self, time: float): self.fa_time_fwd += time @@ -75,6 +191,30 @@ class ProfilingInfo: self.fa_time_bwd += time self.fa_num_bwd += 1 + def update_fa_fwd_cube_info(self, time: float): + self.fa_time_fwd_cube += time + self.fa_num_fwd_cube += 1 + + def update_fa_bwd_cube_info(self, time: float): + self.fa_time_bwd_cube += time + self.fa_num_bwd_cube += 1 + + def update_fa_fwd_vector_info(self, time: float): + self.fa_time_fwd_vector += time + self.fa_num_fwd_vector += 1 + + def update_fa_bwd_vector_info(self, time: float): + self.fa_time_bwd_vector += time + self.fa_num_bwd_vector += 1 + + def update_sdma_tensor_move_info(self, time: float): + self.sdma_time_tensor_move += time + self.sdma_num_tensor_move += 1 + + def update_sdma_stream_info(self, time: float, num: int = 1): + self.sdma_time_stream += time + self.sdma_num_stream += num + def update_pa_info(self, time: float): self.pa_time += time self.pa_num += 1 @@ -91,6 +231,42 @@ class ProfilingInfo: self.conv_time_bwd += time self.conv_num_bwd += 1 + def update_conv_bwd_cube_info(self, time: float): + self.conv_time_bwd_cube += time + self.conv_num_bwd_cube += 1 + + def update_conv_fwd_cube_info(self, time: float): + self.conv_time_fwd_cube += time + self.conv_num_fwd_cube += 1 + + def update_conv_bwd_vector_info(self, time: float): + self.conv_time_bwd_vector += time + self.conv_num_bwd_vector += 1 + + def update_conv_fwd_vector_info(self, time: float): + self.conv_time_fwd_vector += time + self.conv_num_fwd_vector += 1 + + def update_matmul_cube_info(self, time: float): + self.matmul_time_cube += time + self.matmul_num_cube += 1 + + def update_matmul_vector_info(self, time: float): + self.matmul_time_vector += time + self.matmul_num_vector += 1 + + def update_page_attention_info(self, time: float): + self.page_attention_time += time + self.page_attention_num += 1 + + def update_vector_trans_info(self, time: float): + self.vector_time_trans += time + self.vector_num_trans += 1 + + def update_vector_notrans_info(self, time: float): + self.vector_time_notrans += time + self.vector_num_notrans += 1 + def update_sdma_info(self, time: float, num: int = 1): self.sdma_time += time self.sdma_num += num @@ -103,6 +279,10 @@ class ProfilingInfo: self.vec_time += time self.vec_num += 1 + def update_other_cube_info(self, time: float): + self.other_cube_time += time + self.other_cube_num += 1 + def set_compute_time(self, time: float): self.compute_time = time diff --git a/profiler/compare_tools/compare_backend/generator/detail_performance_generator.py b/profiler/compare_tools/compare_backend/generator/detail_performance_generator.py index 5b93d888a4b093a6509438ec6a3c916a50b48e9a..292e312815452c78fbc71fcca9860f887b38d8d4 100644 --- a/profiler/compare_tools/compare_backend/generator/detail_performance_generator.py +++ b/profiler/compare_tools/compare_backend/generator/detail_performance_generator.py @@ -8,6 +8,7 @@ from compare_backend.comparator.module_comparetor import ModuleComparator from compare_backend.comparator.module_statistic_comparator import ModuleStatisticComparator from compare_backend.comparator.operator_comparator import OperatorComparator from compare_backend.comparator.operator_statistic_comparator import OperatorStatisticComparator +from compare_backend.comparator.overall_metrics_comparator import OverallMetricsComparator from compare_backend.compare_bean.communication_bean import CommunicationBean from compare_backend.compare_bean.memory_compare_bean import MemoryCompareBean from compare_backend.compare_bean.memory_statistic_bean import MemoryStatisticBean @@ -15,6 +16,7 @@ from compare_backend.compare_bean.module_compare_bean import ModuleCompareBean from compare_backend.compare_bean.module_statistic_bean import ModuleStatisticBean from compare_backend.compare_bean.operator_compare_bean import OperatorCompareBean from compare_backend.compare_bean.operator_statistic_bean import OperatorStatisticBean +from compare_backend.compare_bean.overall_metrics_bean import OverallMetricsBean from compare_backend.data_prepare.module_data_prepare import ModuleDataPrepare from compare_backend.data_prepare.operator_data_prepare import OperatorDataPrepare from compare_backend.generator.base_generator import BaseGenerator @@ -41,8 +43,16 @@ class DetailPerformanceGenerator(BaseGenerator): self._args.enable_communication_compare: print("[INFO] Start to compare performance detail data, please wait.") comparator_list = self._create_comparator() - for comparator in comparator_list: - self._result_data.update(comparator.generate_data()) + else: + comparator_list = [] + if self._args.enable_profiling_compare: + overall_data = {Constant.BASE_DATA: self._profiling_data_dict.get(Constant.BASE_DATA).overall_metrics, + Constant.COMPARISON_DATA: self._profiling_data_dict.get( + Constant.COMPARISON_DATA).overall_metrics} + # overall 数据在最前面 + comparator_list.insert(0, OverallMetricsComparator(overall_data, OverallMetricsBean)) + for comparator in comparator_list: + self._result_data.update(comparator.generate_data()) def generate_view(self): if not self._result_data: @@ -57,6 +67,7 @@ class DetailPerformanceGenerator(BaseGenerator): comparator_list = [] op_compare_result = [] + if self._args.enable_operator_compare: module_compare_result = self.match_nn_module() if self._profiling_data_dict.get( Constant.BASE_DATA).python_function_data and self._profiling_data_dict.get( diff --git a/profiler/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py b/profiler/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py index 2127ff5e75e23e98f0debb0dfdafbeb01930c082..6ee07a65696e4c482b80238a66b0564f2c29e8f0 100644 --- a/profiler/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py +++ b/profiler/compare_tools/compare_backend/profiling_parser/base_profiling_parser.py @@ -2,6 +2,7 @@ from abc import abstractmethod, ABC from decimal import Decimal from compare_backend.compare_bean.origin_data_bean.compare_event import KernelEvent, MemoryEvent +from compare_backend.compare_bean.origin_data_bean.kernel_details_bean import KernelDetailsBean from compare_backend.compare_bean.origin_data_bean.trace_event_bean import TraceEventBean from compare_backend.compare_bean.profiling_info import ProfilingInfo from compare_backend.utils.constant import Constant @@ -66,6 +67,18 @@ class BaseProfilingParser(ABC): self._comm_list = [] self._read_trace_event() self._cur_func_index = 0 + self._categorize_performance_index = 0 + self._cpu_cube_op = None + self._bwd_tid = None + + @property + def cpu_cube_op(self): + if self._cpu_cube_op is not None: + return self._cpu_cube_op + cpu_cube_op = [op for op in self._result_data.torch_op_data if op.is_cpu_cube_op()] + cpu_cube_op.sort(key=lambda x: x.start_time) + self._cpu_cube_op = cpu_cube_op + return self._cpu_cube_op @abstractmethod def _update_memory_list(self): @@ -102,6 +115,90 @@ class BaseProfilingParser(ABC): self._check_result_data() return self._result_data + def categorize_computing_performance_data(self, tk: (TraceEventBean, KernelDetailsBean), flow_dict_new: dict): + if tk.is_page_attention(): + self._result_data.overall_metrics.update_page_attention_info(tk.dur) + return + if tk.is_sdma(): + self._result_data.overall_metrics.update_sdma_tensor_move_info(tk.dur) + return + flow_start_time = flow_dict_new.get(tk.start_time) + if flow_start_time: + while self._categorize_performance_index < len(self.cpu_cube_op): + cur_op = self.cpu_cube_op[self._categorize_performance_index] + if cur_op.end_time < flow_start_time: + self._categorize_performance_index += 1 + continue + if cur_op.start_time <= flow_start_time: + self._categorize_cube_performance_data(cur_op, tk) + return + break + if self._profiling_type == Constant.NPU: + # 缺失torch至npu连线的算子,判断fa/conv/matmul使用kernel_details.csv的op_type字段 + if tk.is_flash_attention(): + if tk.is_fa_bwd(): + self._result_data.overall_metrics.update_fa_bwd_cube_info(tk.dur) + else: + self._result_data.overall_metrics.update_fa_fwd_cube_info(tk.dur) + return + elif tk.is_conv(): + if tk.is_conv_bwd(): + self._result_data.overall_metrics.update_conv_bwd_cube_info(tk.dur) + else: + self._result_data.overall_metrics.update_conv_fwd_cube_info(tk.dur) + return + elif tk.is_matmul(): + self._result_data.overall_metrics.update_matmul_cube_info(tk.dur) + return + if tk.is_cube_kernel_cat(): + self._result_data.overall_metrics.update_other_cube_info(tk.dur) + elif tk.is_trans(): + self._result_data.overall_metrics.update_vector_trans_info(tk.dur) + else: + self._result_data.overall_metrics.update_vector_notrans_info(tk.dur) + + def _categorize_cube_performance_data(self, cpu_op: TraceEventBean, tk: (TraceEventBean, KernelDetailsBean)): + """ + 判断fa/conv/matmul/vector使用cpu_op + """ + if cpu_op.is_fa_for_cpu_op(): + if self._is_backward(cpu_op): + if tk.is_cube_kernel_cat(): + self._result_data.overall_metrics.update_fa_bwd_cube_info(tk.dur) + else: + self._result_data.overall_metrics.update_fa_bwd_vector_info(tk.dur) + else: + if tk.is_cube_kernel_cat(): + self._result_data.overall_metrics.update_fa_fwd_cube_info(tk.dur) + else: + self._result_data.overall_metrics.update_fa_fwd_vector_info(tk.dur) + elif cpu_op.is_conv_for_cpu_op(): + if self._is_backward(cpu_op): + if tk.is_cube_kernel_cat(): + self._result_data.overall_metrics.update_conv_bwd_cube_info(tk.dur) + else: + self._result_data.overall_metrics.update_conv_bwd_vector_info(tk.dur) + else: + if tk.is_cube_kernel_cat(): + self._result_data.overall_metrics.update_conv_fwd_cube_info(tk.dur) + else: + self._result_data.overall_metrics.update_conv_fwd_vector_info(tk.dur) + elif cpu_op.is_matmul_for_cpu_op(): # matmul + if tk.is_cube_kernel_cat(): + self._result_data.overall_metrics.update_matmul_cube_info(tk.dur) + else: + self._result_data.overall_metrics.update_matmul_vector_info(tk.dur) + + def _is_backward(self, event: TraceEventBean): + return event.tid == self._bwd_tid or event.is_bwd_for_cpu_op() + + def _get_flow_time_dict(self): + return { + flow_event["end"].start_time: flow_event["start"].start_time + for flow_event in self._flow_dict.values() + if flow_event.get("end") and flow_event.get("start") + } + def _dispatch_events(self): if not self._dispatch_func: return diff --git a/profiler/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py b/profiler/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py index c4089aec9bdcb35b80ae9ff9121fcd75bde3a63e..7b1ae1a5a12ac1547123f5822e63069d719a18a6 100644 --- a/profiler/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py +++ b/profiler/compare_tools/compare_backend/profiling_parser/gpu_profiling_parser.py @@ -20,6 +20,7 @@ class GPUProfilingParser(BaseProfilingParser): self._compute_stream_id = self._infer_compute_stream_id() self._marks = defaultdict(int) self._aten_index = 0 + self._find_bwd_tid() @classmethod def __is_flash_attention(cls, name: str): @@ -30,10 +31,7 @@ class GPUProfilingParser(BaseProfilingParser): @classmethod def __is_sdma_time(cls, name: str): - for mark in cls.SDMA_MARK_LIST: - if mark in name.lower(): - return True - return False + return any(mask in name.lower() for mask in cls.SDMA_MARK_LIST) def _update_memory_list(self): if not self._enable_memory_compare: @@ -68,19 +66,15 @@ class GPUProfilingParser(BaseProfilingParser): min_ts = sys.float_info.max max_ts = sys.float_info.min self._trace_events.sort(key=lambda x: x.start_time) - aten_events = list(filter(lambda x: x.name.startswith("aten::"), self._trace_events)) - flow_dict_new = {} - for flow_event in self._flow_dict.values(): - start_event = flow_event.get("start") - end_event = flow_event.get("end") - if start_event and end_event: - flow_dict_new[end_event.start_time] = start_event.start_time + aten_events = [event for event in self._trace_events if event.name.startswith("aten::")] + flow_dict_new = self._get_flow_time_dict() for event in self._trace_events: if event.stream: min_ts = min(event.start_time, min_ts) max_ts = max(event.end_time, max_ts) if event.stream == self._compute_stream_id and self.__is_sdma_time(event.name): self._result_data.overall_metrics.update_sdma_info(event.dur) + self._result_data.overall_metrics.update_sdma_stream_info(event.dur) continue if not event.is_kernel_cat(): continue @@ -88,6 +82,7 @@ class GPUProfilingParser(BaseProfilingParser): if event.is_nccl_name(): continue self.__add_compute_time(event, aten_events, flow_dict_new) + self.categorize_computing_performance_data(event, flow_dict_new) self._aten_events = None self._result_data.overall_metrics.set_e2e_time(float(max_ts - min_ts)) self.__add_compute_and_overlap_time() @@ -162,7 +157,7 @@ class GPUProfilingParser(BaseProfilingParser): def _get_dispatch_func(self): func_set = set() - if self._enable_memory_compare or self._enable_operator_compare: + if self._enable_memory_compare or self._enable_operator_compare or self._enable_profiling_compare: func_set.add(self._picking_torch_op_event) if self._enable_communication_compare: func_set.add(self._picking_kernel_event) @@ -174,6 +169,8 @@ class GPUProfilingParser(BaseProfilingParser): func_set.add(self._picking_flow_event) if self._enable_memory_compare or self._enable_profiling_compare: func_set.add(self._picking_memory_event) + if self._enable_profiling_compare: + func_set.add(self._picking_flow_event) return list(func_set) def _infer_compute_stream_id(self): @@ -187,3 +184,9 @@ class GPUProfilingParser(BaseProfilingParser): raise RuntimeError('[ERROR] The profiling data does not contain kernel running data.') counter = Counter(kernel_stream_ids) return counter.most_common(1)[0][0] + + def _find_bwd_tid(self): + for event in self._trace_events: + if event.is_fwdbwd() and event.is_flow_end(): + self._bwd_tid = event.tid + break diff --git a/profiler/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py b/profiler/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py index 70ce44b44eb419196dc479dc30ae0b1e4a1136cb..457a3b6be5e6a93a9a5e2a78d028096895b6ba56 100644 --- a/profiler/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py +++ b/profiler/compare_tools/compare_backend/profiling_parser/npu_profiling_parser.py @@ -36,7 +36,7 @@ class NPUProfilingParser(BaseProfilingParser): def _get_dispatch_func(self): func_list = set() - if self._enable_memory_compare or self._enable_operator_compare: + if self._enable_memory_compare or self._enable_operator_compare or self._enable_profiling_compare: func_list.add(self._picking_torch_op_event) if self._enable_operator_compare or self._args.max_kernel_num: func_list.add(self._picking_kernel_event) @@ -52,6 +52,7 @@ class NPUProfilingParser(BaseProfilingParser): func_list.add(self._picking_overlap_analysis_data) func_list.add(self._picking_kernel_event) func_list.add(self._picking_hccl_event) + func_list.add(self._picking_flow_event) return list(func_list) def _update_memory_list(self): @@ -205,6 +206,8 @@ class NPUProfilingParser(BaseProfilingParser): def _filter_meta_id(self): for event in self._trace_events: + if event.is_fwdbwd() and event.is_flow_end(): + self._bwd_tid = event.tid if not event.is_process_meta(): continue if event.is_hccl_process_name(): @@ -244,17 +247,7 @@ class NPUProfilingParser(BaseProfilingParser): self._result_data.overall_metrics.update_lccl_info(event.dur) def __parse_kernel_csv(self): - try: - kernel_details = FileReader.read_csv_file(self._kernel_detail_path, KernelDetailsBean) - except Exception: - print('[WARNING] Npu kernel details csv file is not available.') - return - if not kernel_details or kernel_details[0].is_hide_op_pmu(): - self._result_data.overall_metrics.hide_op_details = True - return - for kernel in kernel_details: - if kernel.is_invalid(): - continue + def __screen_data(kernel: KernelDetailsBean): if kernel.is_flash_attention(): if kernel.is_fa_bwd(): self._result_data.overall_metrics.update_fa_bwd_info(kernel.duration) @@ -265,7 +258,7 @@ class NPUProfilingParser(BaseProfilingParser): self._result_data.overall_metrics.update_conv_bwd_info(kernel.duration) else: self._result_data.overall_metrics.update_conv_fwd_info(kernel.duration) - elif kernel.is_cube(): + elif kernel.is_matmul(): self._result_data.overall_metrics.update_cube_info(kernel.duration) elif kernel.is_sdma(): self._result_data.overall_metrics.update_sdma_info(kernel.duration) @@ -276,6 +269,22 @@ class NPUProfilingParser(BaseProfilingParser): else: self._result_data.overall_metrics.update_cube_info(kernel.duration) + try: + kernel_details = FileReader.read_csv_file(self._kernel_detail_path, KernelDetailsBean) + except Exception: + print('[WARNING] Npu kernel details csv file is not available.') + return + if not kernel_details or kernel_details[0].is_hide_op_pmu(): + self._result_data.overall_metrics.hide_op_details = True + return + flow_dict_new = self._get_flow_time_dict() + kernel_details.sort(key=lambda x: x.start_time) + for kernel in kernel_details: + if kernel.is_invalid(): + continue + __screen_data(kernel) + self.categorize_computing_performance_data(kernel, flow_dict_new) + def __parse_mem_csv(self): try: memory_record = FileReader.read_csv_file(self._memory_record_path, MemoryRecordBean) @@ -321,3 +330,4 @@ class NPUProfilingParser(BaseProfilingParser): for stream in compute_stream: dur_list = sdma_dict.get(stream, []) self._result_data.overall_metrics.update_sdma_info(sum(dur_list), len(dur_list)) + self._result_data.overall_metrics.update_sdma_stream_info(sum(dur_list), len(dur_list)) diff --git a/profiler/compare_tools/compare_backend/utils/constant.py b/profiler/compare_tools/compare_backend/utils/constant.py index 1b77b214c85f6733e36298e119e43a778fd7969f..e2002588024fa2a701874ffa381590b0830d2fab 100644 --- a/profiler/compare_tools/compare_backend/utils/constant.py +++ b/profiler/compare_tools/compare_backend/utils/constant.py @@ -11,6 +11,7 @@ class Constant(object): GREEN_COLOR = "00FF00" RED_COLOR = "FF0000" BLUE_COLOR = "00BFFF" + LIGHT_BLUE_COLOR = "87CEFA" US_TO_MS = 1000 KB_TO_MB = 1024 INVALID_VALUE = -1 @@ -55,6 +56,7 @@ class Constant(object): PERFORMANCE_TABLE = "Model Profiling Time Distribution" MODULE_TABLE = "ModuleCompare" MODULE_TOP_TABLE = "ModuleCompareStatistic" + OVERALL_METRICS_TABLE = "OverallMetrics" # memory SIZE = "Size(KB)" @@ -74,7 +76,13 @@ class Constant(object): MEMORY_LIST = "memory_list" COMMUNICATION_DICT = "comm_dict" - #compare type + # compare type OVERALL_COMPARE = "overall" BWD_LIST = ["bwd", "backward", "back"] + + CPU_OP_FA_MASK = ("flash_attention", "fusion_attention", "flashattn", "xformers_flash", "efficient_attention") + CPU_OP_CONV = "aten::conv" + CPU_OP_MATMUL_MASK = ("aten::addmm", "aten::bmm", "aten::mm", "aten::matmul") + KERNEL_CUBE_MASK = ("gemm", "conv", "cutlass", "wgrad") + KERNEL_TRANS_MASK = ("cast", "transdata", "transpose") diff --git a/profiler/compare_tools/compare_backend/utils/excel_config.py b/profiler/compare_tools/compare_backend/utils/excel_config.py index 306abcdfec6e62f24977b989258ad190a90c9bd7..ae808863e77118358800c0fce5de2a3b763ec5e4 100644 --- a/profiler/compare_tools/compare_backend/utils/excel_config.py +++ b/profiler/compare_tools/compare_backend/utils/excel_config.py @@ -18,6 +18,8 @@ class CellFormatType: 'valign': 'vcenter', 'bold': True, 'border': True} # 绿色背景,加粗 YELLOW_BOLD = {"font_name": "Arial", 'font_size': 11, 'fg_color': Constant.YELLOW_COLOR, 'align': 'left', 'valign': 'vcenter', 'bold': True, 'border': True} # 黄色背景,加粗 + BLUE_NORMAL = {'fg_color': Constant.BLUE_COLOR} # 蓝色背景,主要用于行样式 + LIGHT_BLUE_NORMAL = {'fg_color': Constant.LIGHT_BLUE_COLOR} # 淡蓝色背景,主要用于行样式 class ExcelConfig(object): @@ -65,6 +67,10 @@ class ExcelConfig(object): MODULE_LEVEL = "Module Level" BASE_CALL_STACK = "Base Call Stack" COMPARISON_CALL_STACK = "Comparison Call Stack" + INDEX = "Index" + DURATION = "Duration(ms)" + DURATION_RATIO = "Duration Ratio" + DIFF_DUR_MS = "Diff Duration(ms)" HEADERS = { Constant.OPERATOR_TABLE: [ @@ -176,10 +182,81 @@ class ExcelConfig(object): {"name": DIFF_TOTAL_RATIO, "type": CellFormatType.DEFAULT_RATIO, "width": 15}, {"name": BASE_CALL_STACK, "type": CellFormatType.DEFAULT, "width": 30}, {"name": COMPARISON_CALL_STACK, "type": CellFormatType.DEFAULT, "width": 30} + ], + Constant.OVERALL_METRICS_TABLE: [ + {"name": INDEX, "type": CellFormatType.DEFAULT, "width": 40}, + {"name": DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": DURATION_RATIO, "type": CellFormatType.DEFAULT_RATIO, "width": 20}, + {"name": NUMBER, "type": CellFormatType.DEFAULT, "width": 10}, + {"name": DURATION, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": DURATION_RATIO, "type": CellFormatType.DEFAULT_RATIO, "width": 20}, + {"name": NUMBER, "type": CellFormatType.DEFAULT, "width": 10}, + {"name": DIFF_DUR_MS, "type": CellFormatType.DEFAULT_FLOAT, "width": 20}, + {"name": DIFF_RATIO, "type": CellFormatType.DEFAULT_RATIO, "width": 10}, + ] } OVERHEAD = {Constant.OPERATOR_TABLE: ["B1:F1", "G1:K1"], Constant.MEMORY_TABLE: ["B1:F1", "G1:K1"], Constant.COMMUNICATION_TABLE: ["B1:H1", "I1:O1"], Constant.OPERATOR_TOP_TABLE: ["C1:D1", "E1:F1"], Constant.MEMORY_TOP_TABLE: ["C1:E1", "F1:H1"], Constant.MODULE_TOP_TABLE: ["F1:I1", "J1:M1"], - Constant.MODULE_TABLE: ["E1:H1", "I1:L1"]} + Constant.MODULE_TABLE: ["E1:H1", "I1:L1"], + Constant.OVERALL_METRICS_TABLE: ["B1:D1", "E1:G1"]} + + # overall metrics index + # computing time + COMPUTING = "Computing Time" + + FA = "\tFlash Attention" + FA_FWD_CUBE = "\t\tFlash Attention (Forward) (Cube)" + FA_FWD_VECTOR = "\t\tFlash Attention (Forward) (Vector)" + FA_BWD_CUBE = "\t\tFlash Attention (Backward) (Cube)" + FA_BWD_VECTOR = "\t\tFlash Attention (Backward) (Vector)" + + CONV = "\tConv" + CONV_FWD_CUBE = "\t\tConv (Forward) (Cube)" + CONV_FWD_VECTOR = "\t\tConv (Forward) (Vector)" + CONV_BWD_CUBE = "\t\tConv (Backward) (Cube)" + CONV_BWD_VECTOR = "\t\tConv (Backward) (Vector)" + + MM = "\tMatmul" + MM_CUBE = "\t\tMatmul (Cube)" + MM_VECTOR = "\t\tMatmul (Vector)" + + PA = "\tPage Attention" + + VECTOR = "\tVector" + VECTOR_TRANS = "\t\tVector (Trans)" + VECTOR_NO_TRANS = "\t\tVector (No Trans)" + + CUBE = "\tCube" + SDMA_TM = "\tSDMA (Tensor Move)" + OTHER = "\tOther" + + # communication time + COMMUNICATION_TIME = "Uncovered Communication Time" + WAIT = "\tWait" + TRANSMIT = "\tTransmit" + + # free time + FREE_TIME = "Free Time" + SDMA = "\tSDMA" + FREE = "\tFree" + + # e2e time + E2E_TIME = "E2E Time" + + ROW_STYLE_MAP = { + COMPUTING: CellFormatType.BLUE_NORMAL, + COMMUNICATION_TIME: CellFormatType.BLUE_NORMAL, + FREE_TIME: CellFormatType.BLUE_NORMAL, + E2E_TIME: CellFormatType.BLUE_NORMAL, + FA: CellFormatType.LIGHT_BLUE_NORMAL, + CONV: CellFormatType.LIGHT_BLUE_NORMAL, + MM: CellFormatType.LIGHT_BLUE_NORMAL, + PA: CellFormatType.LIGHT_BLUE_NORMAL, + VECTOR: CellFormatType.LIGHT_BLUE_NORMAL, + CUBE: CellFormatType.LIGHT_BLUE_NORMAL, + SDMA_TM: CellFormatType.LIGHT_BLUE_NORMAL, + OTHER: CellFormatType.LIGHT_BLUE_NORMAL + } diff --git a/profiler/compare_tools/compare_backend/view/work_sheet_creator.py b/profiler/compare_tools/compare_backend/view/work_sheet_creator.py index 7a33168da377ae77ab64fff0886e09eef065b4e2..dffb7549fcd92af6b14ee81b019eba86996d9369 100644 --- a/profiler/compare_tools/compare_backend/view/work_sheet_creator.py +++ b/profiler/compare_tools/compare_backend/view/work_sheet_creator.py @@ -20,7 +20,10 @@ class WorkSheetCreator: return self._work_sheet = self._work_book.add_worksheet(self._sheet_name) self._write_headers() - self._write_data() + if "row_style" in self._data: + self._write_data_with_row_style() + else: + self._write_data() def _write_headers(self): base_header_format = self._work_book.add_format(CellFormatType.GREEN_BOLD) @@ -43,7 +46,7 @@ class WorkSheetCreator: col_id = self._col_ids[index] self._work_sheet.set_column(f"{col_id}:{col_id}", header.get("width")) self._work_sheet.write(f"{col_id}{self._row_id}", header.get("name"), header_format) - self._field_format[index] = self._work_book.add_format(header.get("type")) + self._field_format[index] = header.get("type") if header.get("name") in (ExcelConfig.DIFF_RATIO, ExcelConfig.DIFF_TOTAL_RATIO): self._diff_ratio_index = index self._row_id += 1 @@ -52,7 +55,27 @@ class WorkSheetCreator: red_ratio_format = self._work_book.add_format(CellFormatType.RED_RATIO) for data in self._data.get("rows"): for index, cell_data in enumerate(data): - cell_format = self._field_format.get(index) + cell_format = self._work_book.add_format(self._field_format.get(index)) + if index == self._diff_ratio_index and cell_data and cell_data > 1: + cell_format = red_ratio_format + cell_data = "INF" if cell_data == float('inf') else cell_data + self._work_sheet.write(f"{self._col_ids[index]}{self._row_id}", cell_data, cell_format) + self._row_id += 1 + + def _write_data_with_row_style(self): + """ + 带行样式及缩进的sheet + """ + red_ratio_format = self._work_book.add_format(CellFormatType.RED_RATIO) + rows = self._data.get("rows") + row_style = self._data.get("row_style") # 行样式 + + for data, row_style in zip(rows, row_style): + for index, cell_data in enumerate(data): + cell_style = {**self._field_format.get(index), **row_style} + if index == 0: # 0 for Index field + cell_style["indent"] = cell_data.count("\t") + cell_format = self._work_book.add_format(cell_style) if index == self._diff_ratio_index and cell_data and cell_data > 1: cell_format = red_ratio_format cell_data = "INF" if cell_data == float('inf') else cell_data diff --git a/profiler/merge_profiling_timeline/README.md b/profiler/merge_profiling_timeline/README.md index 5075f6bc2fcc8bf04b435562f28a50229b92362e..907a39a6e79c0e93753997b1985bdcc565071066 100644 --- a/profiler/merge_profiling_timeline/README.md +++ b/profiler/merge_profiling_timeline/README.md @@ -7,7 +7,7 @@ merge_profiling_timeline(合并大json工具)支持合并Profiling的timelin ### 性能数据采集 -使用Ascend PyTorch Profiler或者E2E性能采集工具采集性能数据,E2E profiling将被废弃,不建议使用。Ascend PyTorch Profiler采集方式参考:[Profiling数据采集](https://gitee.com/ascend/att/tree/master/profiler)。将采集到的所有节点的性能数据拷贝到当前环境同一目录下,以下假设数据在/home/test/cann_profiling下。 +使用Ascend PyTorch Profiler或者E2E性能采集工具采集性能数据,E2E profiling将被废弃,不建议使用。Ascend PyTorch Profiler采集方式参考:[Profiling数据采集](https://gitee.com/ascend/mstt/tree/master/profiler)。将采集到的所有节点的性能数据拷贝到当前环境同一目录下,以下假设数据在/home/test/cann_profiling下。 E2E Profiling数据目录结构示例如下: diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/common/__init__.py b/profiler/module_visualization/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/common/__init__.py rename to profiler/module_visualization/__init__.py diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/__init__.py b/profiler/module_visualization/graph/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/__init__.py rename to profiler/module_visualization/graph/__init__.py diff --git a/profiler/module_visualization/graph/prof_node.py b/profiler/module_visualization/graph/prof_node.py new file mode 100644 index 0000000000000000000000000000000000000000..cfcdabbb991d2abb86f31e5a5866e788cf9a3c6e --- /dev/null +++ b/profiler/module_visualization/graph/prof_node.py @@ -0,0 +1,90 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from profiler.prof_common.constant import Constant +from profiler.prof_common.base_node import BaseNode +from profiler.prof_common.trace_event_bean import TraceEventBean + + +class ProfNode(BaseNode): + MODULE_TYPE = 1 + + def __init__(self, event: TraceEventBean, parent_node=None): + super().__init__(event, parent_node) + self._kernel_total_list = [] + + @property + def node_id(self): + return self._event.unique_id + + @property + def total_kernels(self): + return self._kernel_total_list + + @property + def host_total_dur(self): + if self.is_root_node: + return sum((node.host_total_dur for node in self.child_nodes)) + return self._event.dur + + @property + def host_self_dur(self): + return self.host_total_dur - sum((node.host_total_dur for node in self.child_nodes)) + + @property + def device_total_dur(self): + if self.is_root_node: + return sum((node.device_total_dur for node in self.child_nodes)) + return sum((kernel.dur for kernel in self._kernel_total_list)) + + @property + def device_self_dur(self): + return self.device_total_dur - sum((node.device_total_dur for node in self.child_nodes)) + + @property + def input_data(self) -> dict: + data = {} + input_dim = self._event.args.get("Input Dims") + if input_dim: + data["Input Dims"] = input_dim + input_type = self._event.args.get("Input type") + if input_type: + data["Input type"] = input_type + return data + + @property + def data(self): + return {"Input Data": self.input_data, + "Host Self Duration(us)": round(self.host_self_dur, 2), + "Host Total Duration(us)": round(self.host_total_dur, 2), + "Device Self Duration(us)": round(self.device_self_dur, 2), + "Device Total Duration(us)": round(self.device_total_dur, 2)} + + @property + def info(self): + return {"id": self.node_id, + "node_type": self.MODULE_TYPE, + "data": self.data, + "upnode": self.parent_node.node_id if self.parent_node else "None", + "subnodes": [node.node_id for node in iter(self.child_nodes)]} + + @property + def is_root_node(self): + return self.node_id == Constant.NPU_ROOT_ID + + def update_child_nodes(self, node): + self._child_nodes.append(node) + + def update_kernel_total_list(self, kernel_list: list): + self._kernel_total_list.extend(kernel_list) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/__init__.py b/profiler/module_visualization/graph_build/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/perturbed_layers/npu/__init__.py rename to profiler/module_visualization/graph_build/__init__.py diff --git a/profiler/module_visualization/graph_build/fwd_module_node.py b/profiler/module_visualization/graph_build/fwd_module_node.py new file mode 100644 index 0000000000000000000000000000000000000000..34d7ab829649f482c97fb489ac0399d3a876c100 --- /dev/null +++ b/profiler/module_visualization/graph_build/fwd_module_node.py @@ -0,0 +1,29 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from profiler.prof_common.base_node import BaseNode +from profiler.prof_common.trace_event_bean import TraceEventBean + + +class FwdModuleNode(BaseNode): + def __init__(self, event: TraceEventBean, parent_node=None): + super().__init__(event, parent_node) + self._bwd_op_list = [] + + @property + def bwd_op_list(self): + return self._bwd_op_list + + def update_bwd_op(self, bwd_op_list: list): + self._bwd_op_list.extend(bwd_op_list) diff --git a/profiler/module_visualization/graph_build/prof_graph_builder.py b/profiler/module_visualization/graph_build/prof_graph_builder.py new file mode 100644 index 0000000000000000000000000000000000000000..83331b6250211e32399b05cabf19a293759a3741 --- /dev/null +++ b/profiler/module_visualization/graph_build/prof_graph_builder.py @@ -0,0 +1,115 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from profiler.module_visualization.graph.prof_node import ProfNode +from profiler.module_visualization.graph_build.fwd_module_node import FwdModuleNode +from profiler.prof_common.tree_builder import TreeBuilder +from profiler.prof_common.trace_event_bean import TraceEventBean +from profiler.prof_common.constant import Constant +from profiler.module_visualization.prof_parse.prof_data_pre_process import ProfDataPreProcess + + +class ProfGraphBuilder: + def __init__(self, prof_data_path: str): + self._prof_data_path = prof_data_path + self._prof_data = {} + + @classmethod + def _create_event_bean_from_ops(cls, op_list: list, name: str) -> TraceEventBean: + min_start = min((op.start_time for op in iter(op_list))) + max_end = max((op.end_time for op in iter(op_list))) + # 以反向算子的区间作为反向module的区间范围,为了module包含算子,做了+1 +2处理 + return TraceEventBean({"ts": min_start - 1, "dur": float(max_end - min_start) + 2, "name": name}) + + @classmethod + def _trans_flow_to_dict(cls, flow_events: dict, end_events: list) -> dict: + end_event_dict = {} + for event in end_events: + end_event_dict[event.start_time] = event + result_data = {} + for flow in flow_events.values(): + start_point = flow.get("start") + end_point = flow.get("end") + if not start_point or not end_point: + continue + end_event = end_event_dict.get(end_point.start_time) + if end_event: + result_data.setdefault(start_point.start_time, []).append(end_event) + return result_data + + def build_graph(self): + self._prof_data = ProfDataPreProcess(self._prof_data_path).run() + all_data = [*self._prof_data.get(Constant.MODULE_EVENT, []), + *self.find_bwd_module(), + *self._prof_data.get(Constant.CPU_OP_EVENT, [])] + all_data.sort(key=lambda x: x.start_time) + name_dict = {} + for event in all_data: + order_id = name_dict.get(event.name, 0) + event.set_id(f"{event.name}_{order_id}") + name_dict[event.name] = order_id + 1 + root_node = TreeBuilder.build_tree(all_data, ProfNode, TraceEventBean({}, Constant.NPU_ROOT_ID)) + kernel_flow_dict = self._trans_flow_to_dict(self._prof_data.get(Constant.TORCH_TO_NPU_FLOW, {}), + self._prof_data.get(Constant.KERNEL_EVENT, [])) + for start_time, kernels in kernel_flow_dict.items(): + matched_node = root_node.binary_search(start_time) + while matched_node != Constant.INVALID_RETURN: + matched_node.update_kernel_total_list(kernels) + matched_node = matched_node.binary_search(start_time) + all_data = root_node.find_all_child_nodes() + all_data.append(root_node) + return all_data + + def find_bwd_module(self) -> list: + bwd_module_list = [] + fwdbwd_flow = self._prof_data.get(Constant.FWD_BWD_FLOW, {}) + module_list = self._prof_data.get(Constant.MODULE_EVENT, []) + cpu_op_list = self._prof_data.get(Constant.CPU_OP_EVENT, []) + if not fwdbwd_flow or not module_list or not cpu_op_list: + return bwd_module_list + fwd_tid = module_list[0].tid + bwd_tid = fwd_tid + for end_point in (flow.get("end") for flow in fwdbwd_flow.values()): + if end_point: + bwd_tid = end_point.tid + break + if fwd_tid == bwd_tid: + return bwd_module_list + # 将每一个反向包成一个module,名字叫“nn.Module: BACKWARD_0” + cpu_op_list.sort(key=lambda x: x.start_time) + pre_status = Constant.FWD_OR_OPT + bwd_op_list = [] + for op in cpu_op_list: + if op.tid == bwd_tid: + bwd_op_list.append(op) + pre_status = Constant.BACKWARD + elif pre_status == Constant.BACKWARD: + bwd_module_list.append(self._create_event_bean_from_ops(bwd_op_list, "nn.Module: BACKWARD")) + bwd_op_list.clear() + pre_status = Constant.FWD_OR_OPT + + # 通过连线匹配正向module,构建出反向的整体module关系 + root_node = TreeBuilder.build_tree(module_list, FwdModuleNode, TraceEventBean({})) + fwdbwd_flow_dict = self._trans_flow_to_dict(fwdbwd_flow, cpu_op_list) + for start_time, end_events in fwdbwd_flow_dict.items(): + matched_node = root_node.binary_search(start_time) + while matched_node != Constant.INVALID_RETURN: + matched_node.update_bwd_op(end_events) + matched_node = matched_node.binary_search(start_time) + all_nodes = root_node.find_all_child_nodes() + for module_node in all_nodes: + if module_node.bwd_op_list: + bwd_module_list.append( + self._create_event_bean_from_ops(module_node.bwd_op_list, f"{module_node.name} [BACKWARD]")) + return bwd_module_list diff --git a/profiler/module_visualization/prof_graph_export.py b/profiler/module_visualization/prof_graph_export.py new file mode 100644 index 0000000000000000000000000000000000000000..d336e97f7419b53d011fa4c043948c60afa5174d --- /dev/null +++ b/profiler/module_visualization/prof_graph_export.py @@ -0,0 +1,39 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +from datetime import datetime + +from profiler.prof_common.constant import Constant +from profiler.prof_common.file_reader import FileReader +from profiler.prof_common.path_manager import PathManager +from profiler.module_visualization.graph_build.prof_graph_builder import ProfGraphBuilder + + +class ProfGraphExport: + @staticmethod + def export_to_json(prof_data_path: str, output_path: str): + logging.basicConfig(format="%(asctime)s - %(levelname)s - %(message)s") + try: + PathManager.input_path_common_check(prof_data_path) + PathManager.check_input_directory_path(output_path) + PathManager.make_dir_safety(output_path) + all_nodes = ProfGraphBuilder(prof_data_path).build_graph() + result_data = {"root": Constant.NPU_ROOT_ID, "node": {}} + for node in all_nodes: + result_data["node"][node.node_id] = node.info + file_name = "prof_graph_json_{}.vis".format(datetime.utcnow().strftime("%Y%m%d%H%M%S%f")[:-3]) + FileReader.write_json_file(output_path, result_data, file_name) + except RuntimeError as err: + logging.error(err) diff --git a/debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/__init__.py b/profiler/module_visualization/prof_parse/__init__.py similarity index 100% rename from debug/accuracy_tools/atat/pytorch/free_benchmark/result_handlers/__init__.py rename to profiler/module_visualization/prof_parse/__init__.py diff --git a/profiler/module_visualization/prof_parse/prof_data_pre_process.py b/profiler/module_visualization/prof_parse/prof_data_pre_process.py new file mode 100644 index 0000000000000000000000000000000000000000..9dc820e4ca560f816b7738243197b90f1adb8c25 --- /dev/null +++ b/profiler/module_visualization/prof_parse/prof_data_pre_process.py @@ -0,0 +1,102 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os + +from profiler.prof_common.file_reader import FileReader +from profiler.prof_common.constant import Constant +from profiler.prof_common.trace_event_bean import TraceEventBean + + +class ProfDataPreProcess: + def __init__(self, prof_data_path: str): + self._prof_data_path = prof_data_path + self._trace_path = "" + self._kernel_pid = None + self._result_data = {Constant.CPU_OP_EVENT: [], Constant.MODULE_EVENT: [], Constant.KERNEL_EVENT: [], + Constant.TORCH_TO_NPU_FLOW: {}, Constant.FWD_BWD_FLOW: {}} + + def run(self) -> dict: + self._check_trace_path() + self._parse_trace_events() + self._check_result_data() + return self._result_data + + def _check_trace_path(self): + if os.path.isfile(self._prof_data_path): + (split_file_path, split_file_name) = os.path.split(self._prof_data_path) + (shot_name, extension) = os.path.splitext(split_file_name) + if extension != ".json": + msg = f"Invalid profiling path suffix: {self._prof_data_path}. " \ + f"You should input in a json file path, such as trace_view.json." + raise RuntimeError(msg) + self._trace_path = self._prof_data_path + return + ascend_output = os.path.join(self._prof_data_path, "ASCEND_PROFILER_OUTPUT") + profiler_output = ascend_output if os.path.isdir(ascend_output) else self._prof_data_path + json_path = os.path.join(profiler_output, "trace_view.json") + if not os.path.isfile(json_path): + msg = f"Invalid profiling path: {self._prof_data_path}. The data path should be the " \ + f"folder that ends with the ascend_pt collected by the Ascend PyTorch Profiler." + raise RuntimeError(msg) + self._trace_path = json_path + + def _parse_trace_events(self): + trace_data = FileReader.read_json_file(self._trace_path) + self._check_trace_data(trace_data) + iter_trace_data = iter(trace_data) + for event in iter_trace_data: + bean = TraceEventBean(event) + if bean.is_optimizer(): + self._result_data[Constant.MODULE_EVENT].append(bean) + elif bean.is_cpu_op(): + if not bean.is_step(): + self._result_data[Constant.CPU_OP_EVENT].append(bean) + elif bean.is_nn_module(): + self._result_data[Constant.MODULE_EVENT].append(bean) + elif bean.is_torch_to_npu(): + if bean.is_flow_start(): + self._result_data[Constant.TORCH_TO_NPU_FLOW].setdefault(bean.id, {})["start"] = bean + else: + self._result_data[Constant.TORCH_TO_NPU_FLOW].setdefault(bean.id, {})["end"] = bean + elif bean.is_fwd_bwd_flow(): + if bean.is_flow_start(): + self._result_data[Constant.FWD_BWD_FLOW].setdefault(bean.id, {})["start"] = bean + else: + self._result_data[Constant.FWD_BWD_FLOW].setdefault(bean.id, {})["end"] = bean + elif bean.is_kernel_event(self._kernel_pid): + self._result_data[Constant.KERNEL_EVENT].append(bean) + + def _check_trace_data(self, trace_data): + if not isinstance(trace_data, list): + msg = f"Invalid profiling data path, this feature only supports performance data " \ + f"collected by Ascend PyTorch Profiler." + raise RuntimeError(msg) + iter_trace_data = iter(trace_data) + for event in iter_trace_data: + bean = TraceEventBean(event) + if bean.is_npu_process(): + self._kernel_pid = bean.pid + break + if self._kernel_pid is None: + msg = f"There is no operator on the NPU side for this data, please check whether the NPU switch is enabled." + raise RuntimeError(msg) + + def _check_result_data(self): + if not self._result_data.get(Constant.CPU_OP_EVENT): + msg = f"This data does not have any aten operator, please make sure to enable the CPU switch." + raise RuntimeError(msg) + if not self._result_data.get(Constant.MODULE_EVENT): + msg = f"This data does not collect any modules, please make sure to turn on the with_stack switch." + raise RuntimeError(msg) diff --git a/profiler/prof_common/base_node.py b/profiler/prof_common/base_node.py new file mode 100644 index 0000000000000000000000000000000000000000..b7cd6780003f9e0e5c58495ac43a893214e68beb --- /dev/null +++ b/profiler/prof_common/base_node.py @@ -0,0 +1,78 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from math import ceil +from queue import Queue + +from decimal import Decimal + +from profiler.prof_common.constant import Constant +from profiler.prof_common.trace_event_bean import TraceEventBean + + +class BaseNode: + def __init__(self, event: TraceEventBean, parent_node=None): + self._event = event + self._parent_node = parent_node + self._child_nodes = [] + + @property + def parent_node(self): + return self._parent_node + + @property + def child_nodes(self): + return self._child_nodes + + @property + def name(self): + return self._event.name + + @property + def start_time(self) -> Decimal: + return self._event.start_time + + @property + def end_time(self) -> Decimal: + return self._event.end_time + + def update_child_nodes(self, node): + self._child_nodes.append(node) + + def binary_search(self, ts_time): + if not self.child_nodes: + return Constant.INVALID_RETURN + right = len(self.child_nodes) - 1 + left = 0 + while right > left: + mid = left + ceil((right - left) / 2) + if ts_time >= self.child_nodes[mid].start_time: + left = mid + else: + right = mid - 1 + if self.child_nodes[left].start_time < ts_time < self.child_nodes[left].end_time: + return self.child_nodes[left] + return Constant.INVALID_RETURN + + def find_all_child_nodes(self) -> list: + result_data = [] + node_queue = Queue() + for child_node in self.child_nodes: + node_queue.put(child_node) + while not node_queue.empty(): + tree_node = node_queue.get() + result_data.append(tree_node) + for child_node in tree_node.child_nodes: + node_queue.put(child_node) + return result_data diff --git a/profiler/prof_common/constant.py b/profiler/prof_common/constant.py index 5789b89cb1a248977b64839339395acc5288b2ab..87bc51b56bc71c2a70e35a6b08aa4de7bd521f1d 100644 --- a/profiler/prof_common/constant.py +++ b/profiler/prof_common/constant.py @@ -15,4 +15,17 @@ class Constant(object): COLLECTION_PATH = "collection_path" ANALYSIS_MODE = "analysis_mode" - CONTEXT_SETTINGS = dict(help_option_names=['-H', '-h', '--help']) \ No newline at end of file + CONTEXT_SETTINGS = dict(help_option_names=['-H', '-h', '--help']) + + MAX_FILE_SIZE_5_GB = 1024 * 1024 * 1024 * 5 + + MODULE_EVENT = "module_event" + CPU_OP_EVENT = "op_event" + TORCH_TO_NPU_FLOW = "torch_to_device" + KERNEL_EVENT = "kernel_event" + FWD_BWD_FLOW = "fwd_to_bwd" + NPU_ROOT_ID = "NPU" + + FWD_OR_OPT = 0 + BACKWARD = 1 + INVALID_RETURN = -1 diff --git a/profiler/prof_common/file_reader.py b/profiler/prof_common/file_reader.py new file mode 100644 index 0000000000000000000000000000000000000000..d8a9c8fb4d6599edf46973f8e93aa708903ff007 --- /dev/null +++ b/profiler/prof_common/file_reader.py @@ -0,0 +1,59 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import json +import logging +import os + +from profiler.prof_common.path_manager import PathManager +from profiler.prof_common.constant import Constant + + +class FileReader: + DATA_FILE_AUTHORITY = 0o640 + DATA_DIR_AUTHORITY = 0o750 + + @classmethod + def read_json_file(cls, file_path: str) -> any: + PathManager.check_path_readable(file_path) + if not os.path.isfile(file_path): + raise FileNotFoundError("File not exists.") + file_size = os.path.getsize(file_path) + if file_size <= 0: + return [] + if file_size > Constant.MAX_FILE_SIZE_5_GB: + msg = f"The file({file_path}) size exceeds the preset max value, failed to read the file." + raise RuntimeError(msg) + try: + with open(file_path, "rt") as file: + json_data = json.loads(file.read()) + except Exception as e: + msg = f"Can't read file: {file_path}" + raise RuntimeError(msg) from e + return json_data + + @classmethod + def write_json_file(cls, output_path: str, data: dict, file_name: str, format_json: bool = False) -> None: + if not data: + return + output_file = os.path.join(output_path, file_name) + PathManager.check_path_writeable(output_path) + try: + with os.fdopen( + os.open(output_file, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY), 'w' + ) as file: + indent = 4 if format_json else None + file.write(json.dumps(data, indent=indent)) + except Exception as e: + raise RuntimeError(f"Can't create the file: {output_path}") from e diff --git a/profiler/prof_common/path_manager.py b/profiler/prof_common/path_manager.py new file mode 100644 index 0000000000000000000000000000000000000000..3e41b8b50aca42ba33071b2661966d221102e106 --- /dev/null +++ b/profiler/prof_common/path_manager.py @@ -0,0 +1,191 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import re +import shutil +import platform + + +class PathManager: + MAX_PATH_LENGTH = 4096 + MAX_FILE_NAME_LENGTH = 255 + DATA_FILE_AUTHORITY = 0o640 + DATA_DIR_AUTHORITY = 0o750 + WINDOWS = "windows" + + @classmethod + def check_input_directory_path(cls, path: str): + """ + Function Description: + check whether the path is valid, some businesses can accept a path that does not exist, + so the function do not verify whether the path exists + Parameter: + path: the path to check, whether the incoming path is absolute or relative depends on the business + Exception Description: + when invalid data throw exception + """ + cls.input_path_common_check(path) + base_name = os.path.basename(path) + if os.path.isfile(path): + msg = f"Invalid input path which is a file path: {base_name}" + raise RuntimeError(msg) + + @classmethod + def check_input_file_path(cls, path: str): + """ + Function Description: + check whether the file path is valid, some businesses can accept a path that does not exist, + so the function do not verify whether the path exists + Parameter: + path: the file path to check, whether the incoming path is absolute or relative depends on the business + Exception Description: + when invalid data throw exception + """ + cls.input_path_common_check(path) + base_name = os.path.basename(path) + if os.path.isdir(path): + msg = f"Invalid input path which is a directory path: {base_name}" + raise RuntimeError(msg) + + @classmethod + def check_path_length(cls, path: str): + if len(path) > cls.MAX_PATH_LENGTH: + raise RuntimeError("Length of input path exceeds the limit.") + path_split_list = path.split("/") + for path in path_split_list: + path_list = path.split("\\") + for name in path_list: + if len(name) > cls.MAX_FILE_NAME_LENGTH: + raise RuntimeError("Length of input path exceeds the limit.") + + @classmethod + def input_path_common_check(cls, path: str): + cls.check_path_length(path) + + if os.path.islink(path): + msg = f"Invalid input path which is a soft link." + raise RuntimeError(msg) + + if platform.system().lower() == cls.WINDOWS: + pattern = r'(\.|:|\\|/|_|-|\s|[~0-9a-zA-Z\u4e00-\u9fa5])+' + else: + pattern = r'(\.|/|_|-|\s|[~0-9a-zA-Z])+' + if not re.fullmatch(pattern, path): + msg = f"Invalid input path." + raise RuntimeError(msg) + + @classmethod + def check_path_owner_consistent(cls, path: str): + """ + Function Description: + check whether the path belong to process owner + Parameter: + path: the path to check + Exception Description: + when invalid path, prompt the user + """ + base_name = os.path.basename(path) + if not os.path.exists(path): + msg = f"Invalid path: {base_name}" + raise RuntimeError(msg) + if platform.system().lower() == cls.WINDOWS: + return + if os.stat(path).st_uid != os.getuid(): + check_msg = input("The path does not belong to you, do you want to continue? [y/n]") + if check_msg.lower() != "y": + raise RuntimeError("The user choose not to continue.") + + @classmethod + def check_path_writeable(cls, path): + """ + Function Description: + check whether the path is writable + Parameter: + path: the path to check + Exception Description: + when invalid data throw exception + """ + cls.check_path_owner_consistent(path) + if os.path.islink(path): + msg = f"Invalid path which is a soft link." + raise RuntimeError(msg) + base_name = os.path.basename(path) + if not os.access(path, os.W_OK): + msg = f"The path permission check failed: {base_name}" + raise RuntimeError(msg) + + @classmethod + def check_path_readable(cls, path): + """ + Function Description: + check whether the path is writable + Parameter: + path: the path to check + Exception Description: + when invalid data throw exception + """ + cls.check_path_owner_consistent(path) + if os.path.islink(path): + msg = f"Invalid path which is a soft link." + raise RuntimeError(msg) + base_name = os.path.basename(path) + if not os.access(path, os.R_OK): + msg = f"The path permission check failed: {base_name}" + raise RuntimeError(msg) + + @classmethod + def remove_path_safety(cls, path: str): + base_name = os.path.basename(path) + msg = f"Failed to remove path: {base_name}" + if os.path.islink(path): + raise RuntimeError(msg) + if os.path.exists(path): + try: + shutil.rmtree(path) + except Exception as err: + raise RuntimeError(msg) from err + + @classmethod + def make_dir_safety(cls, path: str): + base_name = os.path.basename(path) + msg = f"Failed to make directory: {base_name}" + if os.path.islink(path): + raise RuntimeError(msg) + if os.path.exists(path): + return + try: + os.makedirs(path, mode=cls.DATA_DIR_AUTHORITY) + except Exception as err: + raise RuntimeError(msg) from err + + @classmethod + def create_file_safety(cls, path: str): + base_name = os.path.basename(path) + msg = f"Failed to create file: {base_name}" + if os.path.islink(path): + raise RuntimeError(msg) + if os.path.exists(path): + return + try: + os.close(os.open(path, os.O_WRONLY | os.O_CREAT, cls.DATA_FILE_AUTHORITY)) + except Exception as err: + raise RuntimeError(msg) from err + + @classmethod + def get_realpath(cls, path: str) -> str: + if os.path.islink(path): + msg = f"Invalid input path which is a soft link." + raise RuntimeError(msg) + return os.path.realpath(path) diff --git a/profiler/prof_common/trace_event_bean.py b/profiler/prof_common/trace_event_bean.py new file mode 100644 index 0000000000000000000000000000000000000000..2d4b96e4f6aa84ce225531da89085ba4a07335a5 --- /dev/null +++ b/profiler/prof_common/trace_event_bean.py @@ -0,0 +1,69 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from decimal import Decimal + +from profiler.prof_common.utils import convert_to_decimal +from profiler.prof_common.analyze_dict import AnalyzeDict + + +class TraceEventBean(AnalyzeDict): + def __init__(self, data: dict, unique_id: int = None): + super().__init__(data) + self._id = unique_id + + @property + def unique_id(self): + return self._id + + @property + def start_time(self) -> Decimal: + return convert_to_decimal(self.ts) + + @property + def end_time(self) -> Decimal: + return self.start_time + convert_to_decimal(self.dur) + + def set_id(self, name_id): + self._id = name_id + + def is_cpu_op(self): + return self.cat == "cpu_op" + + def is_optimizer(self): + return self.cat == "cpu_op" and self.name.lower().startswith("optimizer") + + def is_nn_module(self): + return self.cat == "python_function" and self.name.lower().startswith("nn.module") + + def is_step(self): + return self.name.lower().startswith("profilerstep#") + + def is_torch_to_npu(self): + return self.cat == "async_npu" + + def is_fwd_bwd_flow(self): + return self.cat == "fwdbwd" + + def is_flow_start(self): + return self.ph == "s" + + def is_flow_end(self): + return self.ph == "f" + + def is_kernel_event(self, kernel_pid): + return self.ph == "X" and self.pid == kernel_pid + + def is_npu_process(self): + return self.ph == "M" and self.name == "process_name" and self.args.get("name", "") == "Ascend Hardware" diff --git a/profiler/prof_common/tree_builder.py b/profiler/prof_common/tree_builder.py new file mode 100644 index 0000000000000000000000000000000000000000..b7d3e1baf6aa48c480124056ced422178f8fe7a2 --- /dev/null +++ b/profiler/prof_common/tree_builder.py @@ -0,0 +1,33 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from profiler.prof_common.trace_event_bean import TraceEventBean + + +class TreeBuilder: + @staticmethod + def build_tree(event_list: list, node_class: any, root_bean: any): + root_node = node_class(root_bean) + event_list.sort(key=lambda x: x.start_time) + last_node = root_node + for event in event_list: + while last_node: + if last_node != root_node and event.start_time > last_node.end_time: + last_node = last_node.parent_node + continue + tree_node = node_class(event, last_node) + last_node.update_child_nodes(tree_node) + last_node = tree_node + break + return root_node diff --git a/profiler/prof_common/utils.py b/profiler/prof_common/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..a9db41ad0b8d9dd91132959fd5b583f5711d88db --- /dev/null +++ b/profiler/prof_common/utils.py @@ -0,0 +1,25 @@ +# Copyright (c) 2024 Huawei Technologies Co., Ltd +# All rights reserved. +# +# Licensed under the BSD 3-Clause License (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +from decimal import Decimal + + +def convert_to_decimal(data: any) -> Decimal: + try: + decimal_value = Decimal(data) + except Exception: + logging.error('Invalid profiling data which failed to convert data to decimal.') + return 0.0 + return decimal_value diff --git a/profiler/requirements/build.txt b/profiler/requirements/build.txt index c750ff83dedf2f6b6823f45a747c95d395e1ccb5..cacefda04c77fcda126ffb60966e8a632957a44b 100644 --- a/profiler/requirements/build.txt +++ b/profiler/requirements/build.txt @@ -9,4 +9,6 @@ ijson requests xlsxwriter sqlalchemy -urllib3<2.0 \ No newline at end of file +urllib3<2.0 +numpy +pandas \ No newline at end of file diff --git a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py b/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py index 7abf8da647d0fe218a0946fae8208a1fb38af0c0..869ee85570febe1d5db7c1a5aa6e89ac8392078d 100644 --- a/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py +++ b/profiler/test/ut/compare_tools/compare_bean/origin_data_bean/test_kernel_details_bean.py @@ -47,5 +47,5 @@ class TestKernelDetailsBean(unittest.TestCase): self.assertFalse(self.kernel_bean2.is_flash_attention()) def test_is_cube(self): - self.assertTrue(self.kernel_bean2.is_cube()) - self.assertFalse(self.kernel_bean3.is_cube()) + self.assertTrue(self.kernel_bean2.is_matmul()) + self.assertFalse(self.kernel_bean3.is_matmul()) diff --git a/profiler/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py b/profiler/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py index 04468721504b1e1133b659a4d497c4ef86ed0414..d7cb3d0588a3e13097d2429a92f283b6c3eaf4b8 100644 --- a/profiler/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py +++ b/profiler/test/ut/compare_tools/profiling_parser/test_gpu_profiling_parser.py @@ -68,6 +68,7 @@ class TestGpuProfilingParser(unittest.TestCase): patch("compare_backend.profiling_parser.gpu_profiling_parser.GPUProfilingParser.__init__", return_value=None): res = GPUProfilingParser({}, {}) + res._profiling_type = "GPU" res._trace_events = [TraceEventBean(event) for event in self.trace_events] res._result_data = ProfilingResult("GPU") res._compute_stream_id = 3 diff --git a/profiler/version.txt b/profiler/version.txt index 8cfbc905b39f65131ba18e561d236557fbdc52cc..8428158dc5bd08a490b652db38f90e08cb471d25 100644 --- a/profiler/version.txt +++ b/profiler/version.txt @@ -1 +1 @@ -1.1.1 \ No newline at end of file +1.1.2 \ No newline at end of file