diff --git "a/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/PyTorch\347\246\273\347\272\277\346\216\250\347\220\206-\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274.md" "b/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/PyTorch\347\246\273\347\272\277\346\216\250\347\220\206-\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274.md" index c0692d6cd88f520248f24714463b2ecfad2aec72..7fa534a20eb2674e1a2ab5e80ea0530150713e47 100644 --- "a/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/PyTorch\347\246\273\347\272\277\346\216\250\347\220\206-\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274.md" +++ "b/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/PyTorch\347\246\273\347\272\277\346\216\250\347\220\206-\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274.md" @@ -629,9 +629,10 @@ bs16: 310/t4=2078.536/1234.940=1.68倍 ### 2.2 模型转换指导 -[导出onnx文件](#211-导出onnx文件) -[模型转换专题](https://gitee.com/wangjiangben_hw/ascend-pytorch-crowdintelligence-doc/tree/master/Ascend-PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/%E4%B8%93%E9%A2%98%E6%A1%88%E4%BE%8B/%E5%8A%9F%E8%83%BD%E6%89%93%E9%80%9A) -[MindStudio可视化om模型](https://gitee.com/wangjiangben_hw/ascend-pytorch-crowdintelligence-doc/blob/master/Ascend-PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/%E4%B8%93%E9%A2%98%E6%A1%88%E4%BE%8B/%E7%9B%B8%E5%85%B3%E5%B7%A5%E5%85%B7/MindStudio%E5%B7%A5%E5%85%B7%E5%8F%AF%E8%A7%86%E5%8C%96om%E6%A8%A1%E5%9E%8B%E6%95%99%E7%A8%8B.docx) +- 导出onnx + + [导出onnx文件](#211-导出onnx文件) + [模型转换专题](https://gitee.com/wangjiangben_hw/ascend-pytorch-crowdintelligence-doc/tree/master/Ascend-PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/%E4%B8%93%E9%A2%98%E6%A1%88%E4%BE%8B/%E5%8A%9F%E8%83%BD%E6%89%93%E9%80%9A) ### 2.3 精度调试指导 @@ -662,48 +663,50 @@ bs16: 310/t4=2078.536/1234.940=1.68倍 ### 2.4 性能优化指导 - 性能分析工具profiling -``` -新建/home/HwHiAiUser/test/run文件,内容如下: -#! /bin/bash -export install_path=/usr/local/Ascend/ascend-toolkit/latest -export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH -export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH -export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH -export ASCEND_OPP_PATH=${install_path}/opp -./benchmark -round=50 -om_path=/home/HwHiAiUser/test/efficientnet-b0_bs1.om -device_id=0 -batch_size=1 -然后执行如下命令: -chmod 777 /home/HwHiAiUser/test/run -cd /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/toolkit/tools/profiler/bin -./msprof --output=/home/HwHiAiUser/test --application=/home/HwHiAiUser/test/run --sys-hardware-mem=on --sys-cpu-profiling=on --sys-profiling=on --sys-pid-profiling=on --sys-io-profiling=on --dvpp-profiling=on -cd /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/toolkit/tools/profiler/profiler_tool/analysis/msprof/ -python3.7 msprof.py import -dir /home/HwHiAiUser/test/生成的profiling目录 -python3.7 msprof.py export summary -dir /home/HwHiAiUser/test/生成的profiling目录 -python3.7 msprof.py export timeline -dir /home/HwHiAiUser/test/生成的profiling目录 --iteration-id 1 -在chrome的地址栏输入chrome://tracing/加载打点数据查看打点图 -``` -其中op_statistic_0_1.csv文件统计了模型中每类算子总体耗时与百分比,op_summary_0_1.csv中包含了模型每个算子的aicore耗时 -profiling工具使用详情请参考[CANN 5.0.1 开发辅助工具指南 (推理) 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100191944?idPath=23710424%7C251366513%7C22892968%7C251168373) - -- 实例 -Inception-V3性能不达标,使用profiling工具分析,可以从输出的csv文件看到算子统计结果 -``` -Model Name OP Type Core Type Count Total Time(us) Min Time(us) Avg Time(us) Max Time(us) Ratio(%) -inception_v3_bs16 TransData AI Core 22 399586.005 20.883 18163 105754.996 46.091391 -inception_v3_bs16 PadV3D AI Core 9 377928.343 14787.287 41992.038 102073.381 43.593226 -inception_v3_bs16 Conv2D AI Core 94 54804.676 201.195 583.028 3338.536 6.321602 -inception_v3_bs16 Pooling AI Core 13 27901.298 411.091 2146.253 4397.026 3.218355 -inception_v3_bs16 Mul AI Core 3 1964.518 572.027 654.839 714.319 0.226603 -inception_v3_bs16 ConcatD AI Core 3 1628.841 253.224 542.947 1111.872 0.187883 -inception_v3_bs16 GatherV2D AI Core 3 1284.729 335.778 428.243 594.11 0.148191 -inception_v3_bs16 Cast AI Core 2 1237.338 20.258 618.669 1217.08 0.142724 -inception_v3_bs16 AvgPool AI Core 1 460.62 460.62 460.62 460.62 0.053132 -inception_v3_bs16 MatMulV2 AI Core 1 126.037 126.037 126.037 126.037 0.014538 -inception_v3_bs16 Flatten AI Core 1 20.415 20.415 20.415 20.415 0.002355 -``` -profiling也会统计每个算子aicore耗时,结合使用netron查看onnx模型结构图,可以看出pad和pad前后的transdata耗时很长,经过分析pad的功能可以由其后的averagepool中的pad属性完成,可以节约大量时间,于是进行PadV3D和Pooling算子的graph融合。从op_summary_0_1.csv中看出单个TransData算子aicore的耗时已经很短了,本模型TransData算子没有优化空间。 - -[性能初步优化案例](https://gitee.com/wangjiangben_hw/ascend-pytorch-crowdintelligence-doc/blob/master/Ascend-PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86-Xxx%E6%A8%A1%E5%9E%8B%E6%B5%8B%E8%AF%95%E6%8A%A5%E5%91%8A.docx) -[性能优化专题](https://gitee.com/wangjiangben_hw/ascend-pytorch-crowdintelligence-doc/tree/master/Ascend-PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/%E4%B8%93%E9%A2%98%E6%A1%88%E4%BE%8B/%E6%80%A7%E8%83%BD%E8%B0%83%E4%BC%98) + ``` + 新建/home/HwHiAiUser/test/run文件,内容如下: + #! /bin/bash + export install_path=/usr/local/Ascend/ascend-toolkit/latest + export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH + export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH + export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH + export ASCEND_OPP_PATH=${install_path}/opp + ./benchmark -round=50 -om_path=/home/HwHiAiUser/test/efficientnet-b0_bs1.om -device_id=0 -batch_size=1 + 然后执行如下命令: + chmod 777 /home/HwHiAiUser/test/run + cd /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/toolkit/tools/profiler/bin + ./msprof --output=/home/HwHiAiUser/test --application=/home/HwHiAiUser/test/run --sys-hardware-mem=on --sys-cpu-profiling=on --sys-profiling=on --sys-pid-profiling=on --sys-io-profiling=on --dvpp-profiling=on + cd /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/toolkit/tools/profiler/profiler_tool/analysis/msprof/ + python3.7 msprof.py import -dir /home/HwHiAiUser/test/生成的profiling目录 + python3.7 msprof.py export summary -dir /home/HwHiAiUser/test/生成的profiling目录 + python3.7 msprof.py export timeline -dir /home/HwHiAiUser/test/生成的profiling目录 --iteration-id 1 + 在chrome的地址栏输入chrome://tracing/加载打点数据查看打点图 + ``` + 其中op_statistic_0_1.csv文件统计了模型中每类算子总体耗时与百分比,op_summary_0_1.csv中包含了模型每个算子的aicore耗时 + profiling工具使用详情请参考[CANN 5.0.1 开发辅助工具指南 (推理) 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100191944?idPath=23710424%7C251366513%7C22892968%7C251168373) + +- 性能优化实例 + Inception-V3性能不达标,使用profiling工具分析,可以从输出的csv文件看到算子统计结果 + ``` + Model Name OP Type Core Type Count Total Time(us) Min Time(us) Avg Time(us) Max Time(us) Ratio(%) + inception_v3_bs16 TransData AI Core 22 399586.005 20.883 18163 105754.996 46.091391 + inception_v3_bs16 PadV3D AI Core 9 377928.343 14787.287 41992.038 102073.381 43.593226 + inception_v3_bs16 Conv2D AI Core 94 54804.676 201.195 583.028 3338.536 6.321602 + inception_v3_bs16 Pooling AI Core 13 27901.298 411.091 2146.253 4397.026 3.218355 + inception_v3_bs16 Mul AI Core 3 1964.518 572.027 654.839 714.319 0.226603 + inception_v3_bs16 ConcatD AI Core 3 1628.841 253.224 542.947 1111.872 0.187883 + inception_v3_bs16 GatherV2D AI Core 3 1284.729 335.778 428.243 594.11 0.148191 + inception_v3_bs16 Cast AI Core 2 1237.338 20.258 618.669 1217.08 0.142724 + inception_v3_bs16 AvgPool AI Core 1 460.62 460.62 460.62 460.62 0.053132 + inception_v3_bs16 MatMulV2 AI Core 1 126.037 126.037 126.037 126.037 0.014538 + inception_v3_bs16 Flatten AI Core 1 20.415 20.415 20.415 20.415 0.002355 + ``` + profiling也会统计每个算子aicore耗时,结合使用netron查看onnx模型结构图,可以看出pad和pad前后的transdata耗时很长,经过分析pad的功能可以由其后的averagepool中的pad属性完成,可以节约大量时间,于是进行PadV3D和Pooling算子的graph融合。从op_summary_0_1.csv中看出单个TransData算子aicore的耗时已经很短了,本模型TransData算子没有优化空间。 + +- 性能优化案例 + + [性能初步优化案例](https://gitee.com/wangjiangben_hw/ascend-pytorch-crowdintelligence-doc/blob/master/Ascend-PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86-Xxx%E6%A8%A1%E5%9E%8B%E6%B5%8B%E8%AF%95%E6%8A%A5%E5%91%8A.docx) + [性能优化专题](https://gitee.com/wangjiangben_hw/ascend-pytorch-crowdintelligence-doc/tree/master/Ascend-PyTorch%E7%A6%BB%E7%BA%BF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/%E4%B8%93%E9%A2%98%E6%A1%88%E4%BE%8B/%E6%80%A7%E8%83%BD%E8%B0%83%E4%BC%98)