diff --git "a/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\347\246\273\347\272\277\346\216\250\347\220\206.md" "b/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\347\246\273\347\272\277\346\216\250\347\220\206.md" index f725a660f17687db2ff87b6c9728f678f1b9eb6e..07743b014fb122ea05b9063e42e105ef57496e6f 100644 --- "a/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\347\246\273\347\272\277\346\216\250\347\220\206.md" +++ "b/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\347\246\273\347\272\277\346\216\250\347\220\206.md" @@ -91,6 +91,8 @@ npu单颗芯片吞吐率乘以4颗大于gpu T4吞吐率则认为性能达标 >安装CANN包:./Ascend-cann-toolkit-\{version\}-linux-x86_64.run --install --quiet > >解压Ascend-cann-benchmark_\{version\}-Linux-x86_64.zip,获取benchmark工具与脚本 + > + >若报无HwHiAiUser用户则执行useradd HwHiAiUser,安装固件若报Not a physical-machine, firmware upgrade does not support.则不必安装固件,若报错ls: cannot access '/usr/local/Ascend/ascend-toolkit/5.0.1/x86_64-linux/toolkit/python/site-packages/bin': No such file or directory则export PATH=/usr/local/python3.7.5/bin:¥PATH;export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:¥LD_LIBRARY_PATH。安装后需要重启。 ### 2.2 深度学习框架与第三方库 @@ -125,13 +127,15 @@ opencv-python == 4.2.0.34 ### 3.1 华为云昇腾modelzone里Pytorch模型端到端推理网址 -当前已完成端到端推理模型放在[ModelZoo](https://www.huaweicloud.com/ascend/resources/modelzoo),包含模型端到端推理说明,代码与操作完整流程,下面的实例仅给出用于说明问题的代码片段,该页面过滤条件框中搜索atc可以看到这些模型 +当前已完成端到端推理模型放在[ModelZoo](https://ascend.huawei.com/zh/#/software/modelzoo),包含模型端到端推理说明,代码与操作完整流程,下面的实例仅给出用于说明问题的代码片段,该页面过滤条件框中搜索atc可以看到这些模型 一些典型模型的链接如下 -1. [ResNeXt-50](https://www.huaweicloud.com/ascend/resources/modelzoo/Models/2ca8ac26aeac461c85e7b04f17aa201a) -2. [Inception-V3](https://www.huaweicloud.com/ascend/resources/modelzoo/Models/132f32e409b44aac8951f58ca073b780) -3. [EfficientNet-b0](https://www.huaweicloud.com/ascend/resources/modelzoo/Models/75026a6edf604ec0bc5d16d220328646) -4. [YoloV3](https://www.huaweicloud.com/ascend/resources/modelzoo/Models/36ea401e0d844f549da2693c6289ad89) +1. [ResNeXt-50](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/2ca8ac26aeac461c85e7b04f17aa201a) +2. [Inception-V3](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/132f32e409b44aac8951f58ca073b780) +3. [Inception-V4](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/75eb32c2a2d94c4db743983504f83a06) +4. [EfficientNet-b0](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/75026a6edf604ec0bc5d16d220328646) +5. [YoloV3](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/36ea401e0d844f549da2693c6289ad89) +... ### 3.2 端到端推理实例 @@ -140,7 +144,7 @@ opencv-python == 4.2.0.34 1.pth模型转换为om模型 PyTorch训练得到的pth模型文件不能直接转换为om模型文件,因此先将pth文件转化为onnx模型文件,再由onnx转化为离线om模型文件 -1)基于PyTorch框架的模型代码与pth文件可以从开源[github链接](https://github.com/lukemelas/EfficientNet-PyTorch)获取,有些模型使用resize使用双线性模式训练的性能不达标,需要修改为resize使用最近邻模式重新训练,通过以下步骤得到onnx模型文件: +1)基于PyTorch框架的模型代码与pth文件可以从开源[github网址](https://github.com/lukemelas/EfficientNet-PyTorch)获取,有些模型使用resize使用双线性模式训练的性能不达标,需要修改为resize使用最近邻模式重新训练,通过以下步骤得到onnx模型文件: - [下载pth文件](https://github.com/lukemelas/EfficientNet-PyTorch/releases/download/1.0/efficientnet-b0-355c32eb.pth) - 参考github网址说明安装efficientnet_pytorch ``` @@ -529,7 +533,7 @@ gpu T4是4个device并行执行的结果,mean是时延(tensorrt的时延是b ``` 以root用户运行ada:kill -9 $(pidof ada) && /usr/local/Ascend/driver/tools/ada ... -编辑/home/HwHiAiUser/test/run文件 +新建/home/HwHiAiUser/test/run文件: #! /bin/bash export install_path=/usr/local/Ascend/ascend-toolkit/latest export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH @@ -545,9 +549,7 @@ python3.7.5 hiprof.pyc --ip_address=本机ip --result_dir=/root/out --profiling_ - CANN C20 版本profiling使用方法 ``` -以root用户运行ada:kill -9 $(pidof ada) && /usr/local/Ascend/driver/tools/ada -... -编辑/home/HwHiAiUser/test/run文件 +新建/home/HwHiAiUser/test/run文件: #! /bin/bash export install_path=/usr/local/Ascend/ascend-toolkit/latest export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH @@ -557,9 +559,9 @@ export ASCEND_OPP_PATH=${install_path}/opp ./benchmark -round=50 -om_path=/home/HwHiAiUser/test/efficientnet-b0_bs1.om -device_id=0 -batch_size=1 ... chmod 777 /home/HwHiAiUser/test/run -cd /usr/local/Ascend/ascend-toolkit/20.2.rc1/x86_64-linux/toolkit/tools/profiler/bin +cd /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/toolkit/tools/profiler/bin ./msprof --output=/home/HwHiAiUser/test --application=/home/HwHiAiUser/test/run --sys-hardware-mem=on --sys-cpu-profiling=on --sys-profiling=on --sys-pid-profiling=on --sys-io-profiling=on --dvpp-profiling=on -cd /usr/local/Ascend/ascend-toolkit/20.2.rc1/x86_64-linux/toolkit/tools/profiler/profiler_tool/analysis/msprof/ +cd /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/toolkit/tools/profiler/profiler_tool/analysis/msprof/ python3.7 msprof.pyc import -dir /home/HwHiAiUser/test/生成的profiling目录 python3.7 msprof.pyc export summary -dir /home/HwHiAiUser/test/生成的profiling目录 ``` @@ -1087,13 +1089,14 @@ inception_v3_bs16 MatMulV2 AI Core 1 126.037 126.037 12 inception_v3_bs16 Flatten AI Core 1 20.415 20.415 20.415 20.415 0.002355 ``` -2.算子融合 -从profiling结果看出,pad和pad前后的transdata耗时很长,经过分析pad的功能可以由其后的averagepool中的pad属性完成,可以节约大量时间,于是进行padV3D和pooling算子的graph融合 +2.算子融合 +profiling也会统计每个算子耗时,结合使用netron查看onnx模型结构图,可以看出pad和pad前后的transdata耗时很长,经过分析pad的功能可以由其后的averagepool中的pad属性完成,可以节约大量时间,于是进行padV3D和pooling算子的graph融合 参考前面提到的《CANN V100R020C10 图融合和UB融合规则参考 (推理) 01》 ### 4.5 maskrcnn端到端推理指导 -https://gitee.com/ascend/modelzoo/wikis +[基于开源mmdetection预训练的maskrcnn_Onnx模型端到端推理指导.md](https://gitee.com/pengyeqing/ascend-pytorch-crowdintelligence-doc/blob/master/onnx%E7%AB%AF%E5%88%B0%E7%AB%AF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/benchmark/cv/segmentation/%E5%9F%BA%E4%BA%8E%E5%BC%80%E6%BA%90mmdetection%E9%A2%84%E8%AE%AD%E7%BB%83%E7%9A%84maskrcnn_Onnx%E6%A8%A1%E5%9E%8B%E7%AB%AF%E5%88%B0%E7%AB%AF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC.md) +[基于detectron2训练的npu权重的maskrcnn_Onnx模型端到端推理指导.md](https://gitee.com/pengyeqing/ascend-pytorch-crowdintelligence-doc/blob/master/onnx%E7%AB%AF%E5%88%B0%E7%AB%AF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/benchmark/cv/segmentation/%E5%9F%BA%E4%BA%8Edetectron2%E8%AE%AD%E7%BB%83%E7%9A%84npu%E6%9D%83%E9%87%8D%E7%9A%84maskrcnn_Onnx%E6%A8%A1%E5%9E%8B%E7%AB%AF%E5%88%B0%E7%AB%AF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC.md) ## 5 深度学习指导 ### 5.1 书籍推荐 @@ -1131,12 +1134,14 @@ https://gitee.com/ascend/modelzoo/wikis >![](public_sys-resources/icon-note.gif) **说明:** > **机器周均使用率过低且项目无故无进展时,华为方将有权回收算力资源,由此造成交付延期由使用者自己承担。** +> **请勿随意更改密码,更改密码带来的风险由更改者承担。** +> **请勿随意更新驱动等系统相关软件,有需要请及时联系华为方支持人员。** - 机器申请 - GPU - - 由于GPU资源紧张,请提前做好资源申请,每个模型按3个工作日作为调测时间,每个老师需一次性租借其名下所有模型,若无法按期归还,请提前和华为方支撑者做好沟通 + - 由于GPU资源紧张,请提前做好资源申请,每个模型按3个工作日作为调测时间,原则上每次调测不少于2个模型,每个模型不可重复申请调试。若无法按期归还,请提前和华为方支撑者做好沟通 - NPU - - 每个模型调测人员至少分配一张NPU用于模型调测,请向华为方申请调配的NPU资源 + - 每个模型调测人员至少分配一张NPU用于模型调测,请向华为方申请动态调配的NPU资源 - 磁盘使用 - / 下是系统目录 - /home 是可使用的数据盘目录 @@ -1156,26 +1161,131 @@ https://gitee.com/ascend/modelzoo/wikis ascend benchmark工具纯推理测的npu单颗device吞吐率乘以4颗大于TensorRT工具测的gpu T4吞吐率则认为性能达标 - 脚本: 代码符合pep8规范; + 脚本命名格式需统一,文件名含模型名时模型名用小写,模型名含多个字符串时用-连接; xxx_pth2onnx.py中不能使用从网络下载权重pth文件的代码,xxx_pth2onnx.py应有输入输出参数,输入是本地权重pth文件,输出是生成的onnx模型文件名; xxx_pth_preprocess.py与xxx_pth_postprocess.py尽可能只引用numpy,Pillow,torch,pycocotools等基础库,如不要因mmdetection框架的数据处理与精度评测部分封装了这些基础库的操作,为调用这些简单接口,前后处理脚本就依赖mmdetection; 不同模型的脚本与代码部分处理流程有相似性,尽量整合成通用的脚本与代码。 - - 推理步骤: - 需要提供端到端推理的操作过程 + - 推理过程: + 需要提供端到端推理过程中执行的命令等 - 关键问题总结: - 需要提供端到端推理遇到的关键问题的简要调试过程 + 需要提供端到端推理遇到的关键问题的简要调试过程,至少包含模型转换要点,精度调试,性能优化 说明: ``` - 对于性能不达标的模型,优化是学生能做的尽量做,比如用ascend atc的相关优化选项尝试一下,尝试使用最近邻替换双线性的resize重新训练,降低图片分辨率等,然后profiling分析定位引起性能下降的原因,具体到引起性能下降的算子,并在交付文档中写明问题原因与简要的定位过程,涉及到atc算子代码的修改由华为方做。 - 工作量为简单模型2-3个工作日,复杂模型5-7个工作日,个别难度大的模型12个工作日。 + 1.需要测试batch1,4,8,16,32的精度与性能 + 2.对于性能不达标的模型,需要进行如下工作: + 1)用ascend atc的相关优化选项尝试一下,尝试使用最近邻替换双线性的resize重新训练,降低图片分辨率等使性能达标。 + 2)对于算子导致的性能问题,需要使用profiling分析定位引起性能下降的原因,具体到引起性能下降的算子。优先修改模型代码以使其选择性能好的npu算子替换性能差的npu算子使性能达标,然后在modelzoo上提issue,等修复版本发布后再重测性能,继续优化。 + 3)需要交付profiling性能数据,对经过上述方法性能可以达标的模型,在交付文档中写明问题原因与达标需要执行的操作;对经过上述方法性能仍不达标的模型,在交付文档中写明问题原因与简要的定位过程。 + 3.工作量为简单模型2-3个工作日,复杂模型5-10个工作日,个别难度大的模型15-20个工作日。 ``` - 交付件 - - 交付件参考:[ResNeXt Onnx端到端推理指导.docx](https://gitee.com/ascend/modelzoo/wikis) + - 交付件参考:[ResNeXt50_Onnx模型端到端推理指导.md](https://gitee.com/ascend/modelzoo/tree/master/built-in/ACL_PyTorch/Benchmark/cv/classification/ResNext50) - 最终交付件: - 包含以上交付标准的模型名称 Onnx端到端推理指导.docx + 包含以上交付标准的模型名称_Onnx端到端推理指导.md - 最终交付形式: - gitee网址:https://gitee.com/ascend/modelzoo/tree/master/contrib/onnx_infer/ + gitee网址:https://gitee.com/ascend/modelzoo/tree/master/contrib/ACL_PyTorch commit信息格式:【高校贡献-学校学院名称】【Onnx-模型名称】模型名称 Onnx端到端推理 + 模型命名风格为大驼峰,模型名含多个字符串时使用横杠或下划线连接,当上下文用横杠时模型名用下划线连接,否则用横杠连接 + 对于batch1与batch16,npu性能均高于T4性能1.2倍的模型,放在benchmark目录下,1-1.2倍对应official目录,低于1倍放在research目录 + +- gitee仓PR贡献流程 + - fork [modelzoo](https://gitee.com/ascend/modelzoo) 到个人仓 + - 提交代码到个人仓 + - 签署cla [link](https://clasign.osinfra.cn/sign/Z2l0ZWUlMkZhc2NlbmQ=) + - 选择 Sign Individual CLA + - 若已提交PR,但忘记签署,可在签署CLA后再评论内评论 ```/check-cla``` 重新校验 + - 依据文件夹名称及目录规范整理代码,完成自验,使用PR内容模板进行PR,审查人员请指定 王姜奔(wangjiangben_hw) + - PR后,华为方会进行代码检视,并对PR进行验证,请关注PR的评论并及时修改 + - 最终验收完成后合入主干 +- gitee仓验收使用脚本(请自验)、PR内容模板 + - 验收使用脚本(请自验) + >![](public_sys-resources/icon-note.gif) + **说明:** + > **提交前请确保自验通过!确保直接执行以下脚本就可运行!** + + ```shell script + + # pth是否能正确转换为om + bash scripts/pth2om.sh + + # 精度数据是否达标(需要显示官网精度与om模型的精度) + bash scripts/eval_acc.sh + + # npu性能数据(如果模型支持多batch,测试bs1与bs16,否则只测试bs1,性能数据以单卡吞吐率为标准) + bash scripts/perform_310.sh + + # 在t4环境测试性能数据(如果模型支持多batch,测试bs1与bs16,否则只测试bs1,如果导出的onnx模型因含自定义算子等不能离线推理,则在t4上测试pytorch模型的在线推理性能,性能数据以单卡吞吐率为标准) + bash scripts/perform_t4.sh + + ``` + - PR内容模板 + - PR示例链接 https://gitee.com/ascend/modelzoo/pulls/887 + - PR名称 + - 【高校贡献-${学校学院名称}】【Pytorch离线推理-${模型名称}】${PR内容摘要} + - 举例说明:【高校贡献-华为大学昇腾学院】【Pytorch离线推理-ResNeXt50】初次提交。 + ``` + + + + **What type of PR is this?** + > /kind task + + **What does this PR do / why do we need it**: + # 简述你这次的PR的详情 + + | 模型 | 官网精度 | 310精度 | t4性能 | 310性能 | + | :------: | :------: | :------: | :------: | :------: | + | ResNeXt50 bs1 | top1:77.62% top5:93.70% | top1:77.62% top5:93.69% | 763.044fps | 1497.252fps | + | ResNeXt50 bs16 | top1:77.62% top5:93.70% | top1:77.62% top5:93.69% | 1234.940fps | 2096.376fps | + + # 自验报告 + ```shell + # 第X次验收测试 + # 验收结果 OK / Failed + # 验收环境: A + K / CANN R20C20TR5 + # 关联issue: + + # pth是否能正确转换为om + bash scripts/pth2om.sh + # 验收结果: OK / Failed + # 备注: 成功生成om,无运行报错,报错日志xx 等 + + # 精度数据是否达标(需要显示官网精度与om模型的精度) + bash scripts/eval_acc.sh + # 验收结果: OK / Failed + # 备注: 目标精度top1:77.62% top5:93.70%;bs1,bs16验收精度top1:77.62% top5:93.69%;精度下降不超过1%;无运行报错,报错日志xx 等 + + # npu性能数据(如果模型支持多batch,测试bs1与bs16,否则只测试bs1,性能数据以单卡吞吐率为标准) + bash scripts/perform_310.sh + # 验收结果: OK / Failed + # 备注: 验收测试性能bs1:1497.252FPS bs16:2096.376FPS;无运行报错,报错日志xx 等 + # 在t4环境测试性能数据(如果模型支持多batch,测试bs1与bs16,否则只测试bs1,如果导出的onnx模型因含自定义算子等不能离线推理,则在t4上测试pytorch模型的在线推理性能,性能数据以单卡吞吐率为标准) + bash scripts/perform_t4.sh + # 验收结果: OK / Failed + # 备注: 验收测试性能bs1:763.044FPS bs16:1234.940FPS;无运行报错,报错日志xx 等 + # 310性能需要超过t4 + + ``` + - 示例链接 https://gitee.com/ascend/modelzoo/pulls/836#note_4750681 + + **Which issue(s) this PR fixes**: + # 用于后期issue关联的pr + + Fixes # + + **Special notes for your reviewers**: + # 在reviewer检视时你想要和他说的 + + ``` diff --git "a/AscendPyTorch\346\250\241\345\236\213\346\216\250\347\220\206\344\274\227\346\231\272\351\252\214\346\224\266\346\214\207\345\215\227.md" "b/AscendPyTorch\346\250\241\345\236\213\346\216\250\347\220\206\344\274\227\346\231\272\351\252\214\346\224\266\346\214\207\345\215\227.md" new file mode 100644 index 0000000000000000000000000000000000000000..bdae5f21510330a1a7769f18729737fdf5b44cfe --- /dev/null +++ "b/AscendPyTorch\346\250\241\345\236\213\346\216\250\347\220\206\344\274\227\346\231\272\351\252\214\346\224\266\346\214\207\345\215\227.md" @@ -0,0 +1,72 @@ +# Ascend PyTorch 模型推理众智验收指南 + +1. 先上gitee管理平台,将验收目标调整至验收状态 +2. 检查PR内容,文件夹路径和文件结构 + - PR末班和文件路径结构都在下面附件里有详细说明,请仔细check +3. 按照验收脚本在交付文件夹下进行验收 + + ```shell script + + # pth是否能正确转换为om + bash scripts/pth2om.sh + + # 精度数据是否达标(需要显示官网精度与om模型的精度) + bash scripts/eval_acc.sh + + # npu性能数据(如果模型支持多batch,测试bs1与bs16,否则只测试bs1,性能数据以单卡吞吐率为标准) + bash scripts/perform_310.sh + + # 在t4环境测试性能数据(如果模型支持多batch,测试bs1与bs16,否则只测试bs1,如果导出的onnx模型因含自定义算子等不能离线推理,则在t4上测试pytorch模型的在线推理性能,性能数据以单卡吞吐率为标准) + bash scripts/perform_t4.sh + + ``` + + - 验收过程中遇到问题,如是一些路径或者打字错误的问题,先修复继续执行 + - 每次验收都需要对验收脚本中的所有未验收脚本进行验收,不要因某一项验收失败而阻塞后续验收工作 +4. 验收反馈 + - 验收后,使用验收报告模板,在评论区反馈验收结果 + ```shell + # 第X次验收测试 + # 验收结果 OK / Failed + # 验收环境: A + K / CANN R20C20TR5 + # 关联issue: + + # pth是否能正确转换为om + bash scripts/pth2om.sh + # 验收结果: OK / Failed + # 备注: 成功生成om,无运行报错,报错日志xx 等 + + # 精度数据是否达标(需要显示官网精度与om模型的精度) + bash scripts/eval_acc.sh + # 验收结果: OK / Failed + # 备注: 目标精度top1:77.62% top5:93.70%;bs1,bs16验收精度top1:77.62% top5:93.69%;精度下降不超过1%;无运行报错,报错日志xx 等 + + # npu性能数据(如果模型支持多batch,测试bs1与bs16,否则只测试bs1,性能数据以单卡吞吐率为标准) + bash scripts/perform_310.sh + # 验收结果: OK / Failed + # 备注: 验收测试性能bs1:1497.252FPS bs16:2096.376FPS;无运行报错,报错日志xx 等 + # 在t4环境测试性能数据(如果模型支持多batch,测试bs1与bs16,否则只测试bs1,如果导出的onnx模型因含自定义算子等不能离线推理,则在t4上测试pytorch模型的在线推理性能,性能数据以单卡吞吐率为标准) + bash scripts/perform_t4.sh + # 验收结果: OK / Failed + # 备注: 验收测试性能bs1:763.044FPS bs16:1234.940FPS;无运行报错,报错日志xx 等 + # 310性能需要超过t4 + + ``` + - 示例链接 https://gitee.com/ascend/modelzoo/pulls/836#note_4814643 +5. 验收完成后,上gitee管理平台,将验收目标调整至完成状态 + + + + +- 关联issue模板 (负责人请关联相应的学生,若无法关联,请关联验收者) + ``` + 【Pytorch模型推理众智测试验收】【第x次回归测试】 xxx模型 验收不通过 + + 贴上验收报告 + + ``` + - 示例链接 https://gitee.com/ascend/modelzoo/issues/I3FI5L?from=project-issue + + + + \ No newline at end of file diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/.keep" new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/.keep" new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/.keep" new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/.keep" new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/ResNeXt50_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/ResNeXt50_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" new file mode 100644 index 0000000000000000000000000000000000000000..11de7d289beeae1e836017f5d59a41d3961726eb --- /dev/null +++ "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/ResNeXt50_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" @@ -0,0 +1,605 @@ +# ResNeXt50 Onnx模型端到端推理指导 +- [1 模型概述](#1-模型概述) + - [1.1 论文地址](#11-论文地址) + - [1.2 代码地址](#12-代码地址) +- [2 环境说明](#2-环境说明) + - [2.1 深度学习框架](#21-深度学习框架) + - [2.2 python第三方库](#22-python第三方库) +- [3 模型转换](#3-模型转换) + - [3.1 pth转onnx模型](#31-pth转onnx模型) + - [3.2 onnx转om模型](#32-onnx转om模型) +- [4 数据集预处理](#4-数据集预处理) + - [4.1 数据集获取](#41-数据集获取) + - [4.2 数据集预处理](#42-数据集预处理) + - [4.3 生成数据集信息文件](#43-生成数据集信息文件) +- [5 离线推理](#5-离线推理) + - [5.1 benchmark工具概述](#51-benchmark工具概述) + - [5.2 离线推理](#52-离线推理) +- [6 精度对比](#6-精度对比) + - [6.1 离线推理TopN精度统计](#61-离线推理TopN精度统计) + - [6.2 开源TopN精度](#62-开源TopN精度) + - [6.3 精度对比](#63-精度对比) +- [7 性能对比](#7-性能对比) + - [7.1 npu性能数据](#71-npu性能数据) + - [7.2 T4性能数据](#72-T4性能数据) + - [7.3 性能对比](#73-性能对比) + + + +## 1 模型概述 + +- **[论文地址](#11-论文地址)** + +- **[代码地址](#12-代码地址)** + +### 1.1 论文地址 +[ResNeXt50论文](https://arxiv.org/abs/1611.05431) +本文提出了一个简单的,高度模型化的针对图像分类问题的网络结构。本文的网络是通过重复堆叠building block组成的,这些building block整合了一系列具有相同拓扑结构的变体(transformations)。本文提出的简单的设计思路可以生成一种同质的,多分支的结构。这种方法产生了一个新的维度,作者将其称为基(变体的数量,the size of the set of transformations)。在ImageNet-1K数据集上,作者可以在保证模型复杂度的限制条件下,通过提升基的大小来提高模型的准确率。更重要的是,相比于更深和更宽的网络,提升基的大小更加有效。作者将本文的模型命名为ResNeXt,本模型在ILSVRC2016上取得了第二名。本文还在ImageNet-5K和COCO数据集上进行了实验,结果均表明ResNeXt的性能比ResNet好。 + +### 1.2 代码地址 +[ResNeXt50代码](https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py) + +## 2 环境说明 + +- **[深度学习框架](#21-深度学习框架)** + +- **[python第三方库](#22-python第三方库)** + +### 2.1 深度学习框架 +``` +pytorch == 1.6.0 +torchvision == 0.7.0 +onnx == 1.7.0 +``` + +### 2.2 python第三方库 + +``` +numpy == 1.18.5 +Pillow == 7.2.0 +``` + +**说明:** +> X86架构:pytorch,torchvision和onnx可以通过官方下载whl包安装,其它可以通过pip3.7 install 包名 安装 +> +> Arm架构:pytorch,torchvision和onnx可以通过源码编译安装,其它可以通过pip3.7 install 包名 安装 + +## 3 模型转换 + +- **[pth转onnx模型](#31-pth转onnx模型)** + +- **[onnx转om模型](#32-onnx转om模型)** + +### 3.1 pth转onnx模型 + +1.下载pth权重文件 +[ResNeXt50预训练pth权重文件](https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth) +文件md5sum: 1d6611049e6ef03f1d6afa11f6f9023e +2.编写pth2onnx脚本resnext50_pth2onnx.py +```python +import sys +import torch +import torch.onnx +import torchvision.models as models + +def pth2onnx(input_file, output_file): + model = models.resnext50_32x4d(pretrained=False) + checkpoint = torch.load(input_file, map_location=None) + model.load_state_dict(checkpoint) + + model.eval() + input_names = ["image"] + output_names = ["class"] + dynamic_axes = {'image': {0: '-1'}, 'class': {0: '-1'}} + dummy_input = torch.randn(1, 3, 224, 224) + torch.onnx.export(model, dummy_input, output_file, input_names = input_names, dynamic_axes = dynamic_axes, output_names = output_names, verbose=True, opset_version=11) + +if __name__ == "__main__": + input_file = sys.argv[1] + output_file = sys.argv[2] + pth2onnx(input_file, output_file) +``` + + **说明:** +>注意目前ATC支持的onnx算子版本为11 + +3.执行pth2onnx脚本,生成onnx模型文件 +``` +python3 resnext50_pth2onnx.py resnext50_32x4d-7cdf4587.pth resnext50.onnx +``` + +### 3.2 onnx转om模型 + +1.设置环境变量 +``` +export install_path=/usr/local/Ascend/ascend-toolkit/latest +export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH +export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH +export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH +export ASCEND_OPP_PATH=${install_path}/opp +export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/ +``` +2.使用atc将onnx模型转换为om模型文件,工具使用方法可以参考[CANN V100R020C10 开发辅助工具指南 (推理) 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164868?idPath=23710424%7C251366513%7C22892968%7C251168373) +``` +atc --framework=5 --model=./resnext50.onnx --input_format=NCHW --input_shape="image:16,3,224,224" --output=resnext50_bs16 --log=debug --soc_version=Ascend310 +``` + +## 4 数据集预处理 + +- **[数据集获取](#41-数据集获取)** + +- **[数据集预处理](#42-数据集预处理)** + +- **[生成数据集信息文件](#43-生成数据集信息文件)** + +### 4.1 数据集获取 +该模型使用[ImageNet官网](http://www.image-net.org)的5万张验证集进行测试,图片与标签分别存放在datasets/ImageNet/val_union与datasets/ImageNet/val_label.txt。 + +### 4.2 数据集预处理 +1.预处理脚本imagenet_torch_preprocess.py +```python +# Copyright 2020 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import sys +from PIL import Image +import numpy as np +import multiprocessing + + +model_config = { + 'resnet': { + 'resize': 256, + 'centercrop': 224, + 'mean': [0.485, 0.456, 0.406], + 'std': [0.229, 0.224, 0.225], + }, + 'inceptionv3': { + 'resize': 342, + 'centercrop': 299, + 'mean': [0.485, 0.456, 0.406], + 'std': [0.229, 0.224, 0.225], + }, + 'inceptionv4': { + 'resize': 342, + 'centercrop': 299, + 'mean': [0.5, 0.5, 0.5], + 'std': [0.5, 0.5, 0.5], + }, +} + + +def center_crop(img, output_size): + if isinstance(output_size, int): + output_size = (int(output_size), int(output_size)) + image_width, image_height = img.size + crop_height, crop_width = output_size + crop_top = int(round((image_height - crop_height) / 2.)) + crop_left = int(round((image_width - crop_width) / 2.)) + return img.crop((crop_left, crop_top, crop_left + crop_width, crop_top + crop_height)) + + +def resize(img, size, interpolation=Image.BILINEAR): + if isinstance(size, int): + w, h = img.size + if (w <= h and w == size) or (h <= w and h == size): + return img + if w < h: + ow = size + oh = int(size * h / w) + return img.resize((ow, oh), interpolation) + else: + oh = size + ow = int(size * w / h) + return img.resize((ow, oh), interpolation) + else: + return img.resize(size[::-1], interpolation) + + +def gen_input_bin(mode_type, file_batches, batch): + i = 0 + for file in file_batches[batch]: + i = i + 1 + print("batch", batch, file, "===", i) + + # RGBA to RGB + image = Image.open(os.path.join(src_path, file)).convert('RGB') + image = resize(image, model_config[mode_type]['resize']) # Resize + image = center_crop(image, model_config[mode_type]['centercrop']) # CenterCrop + img = np.array(image, dtype=np.float32) + img = img.transpose(2, 0, 1) # ToTensor: HWC -> CHW + img = img / 255. # ToTensor: div 255 + img -= np.array(model_config[mode_type]['mean'], dtype=np.float32)[:, None, None] # Normalize: mean + img /= np.array(model_config[mode_type]['std'], dtype=np.float32)[:, None, None] # Normalize: std + img.tofile(os.path.join(save_path, file.split('.')[0] + ".bin")) + + +def preprocess(mode_type, src_path, save_path): + files = os.listdir(src_path) + file_batches = [files[i:i + 500] for i in range(0, 50000, 500) if files[i:i + 500] != []] + thread_pool = multiprocessing.Pool(len(file_batches)) + for batch in range(len(file_batches)): + thread_pool.apply_async(gen_input_bin, args=(mode_type, file_batches, batch)) + thread_pool.close() + thread_pool.join() + print("in thread, except will not report! please ensure bin files generated.") + + +if __name__ == '__main__': + if len(sys.argv) < 4: + raise Exception("usage: python3 xxx.py [model_type] [src_path] [save_path]") + mode_type = sys.argv[1] + src_path = sys.argv[2] + save_path = sys.argv[3] + src_path = os.path.realpath(src_path) + save_path = os.path.realpath(save_path) + if mode_type not in model_config: + model_type_help = "model type: " + for key in model_config.keys(): + model_type_help += key + model_type_help += ' ' + raise Exception(model_type_help) + if not os.path.isdir(save_path): + os.makedirs(os.path.realpath(save_path)) + preprocess(mode_type, src_path, save_path) +``` +2.执行预处理脚本,生成数据集预处理后的bin文件 +``` +python3 imagenet_torch_preprocess.py datasets/ImageNet/val_union ./prep_dataset +``` +### 4.3 生成数据集信息文件 +1.生成数据集信息文件脚本get_info.py +```python +import os +import sys +import cv2 +from glob import glob + + +def get_bin_info(file_path, info_name, width, height): + bin_images = glob(os.path.join(file_path, '*.bin')) + with open(info_name, 'w') as file: + for index, img in enumerate(bin_images): + content = ' '.join([str(index), img, width, height]) + file.write(content) + file.write('\n') + + +def get_jpg_info(file_path, info_name): + extensions = ['jpg', 'jpeg', 'JPG', 'JPEG'] + image_names = [] + for extension in extensions: + image_names.append(glob(os.path.join(file_path, '*.' + extension))) + with open(info_name, 'w') as file: + for image_name in image_names: + if len(image_name) == 0: + continue + else: + for index, img in enumerate(image_name): + img_cv = cv2.imread(img) + shape = img_cv.shape + width, height = shape[1], shape[0] + content = ' '.join([str(index), img, str(width), str(height)]) + file.write(content) + file.write('\n') + + +if __name__ == '__main__': + file_type = sys.argv[1] + file_path = sys.argv[2] + info_name = sys.argv[3] + if file_type == 'bin': + width = sys.argv[4] + height = sys.argv[5] + assert len(sys.argv) == 6, 'The number of input parameters must be equal to 5' + get_bin_info(file_path, info_name, width, height) + elif file_type == 'jpg': + assert len(sys.argv) == 4, 'The number of input parameters must be equal to 3' + get_jpg_info(file_path, info_name) +``` +2.执行生成数据集信息脚本,生成数据集信息文件 +``` +python3 get_info.py bin ./prep_dataset ./resnext50_prep_bin.info 224 224 +``` +第一个参数为模型输入的类型,第二个参数为生成的bin文件路径,第三个为输出的info文件,后面为宽高信息 +## 5 离线推理 + +- **[benchmark工具概述](#51-benchmark工具概述)** + +- **[离线推理](#52-离线推理)** + +### 5.1 benchmark工具概述 + +benchmark工具为华为自研的模型推理工具,支持多种模型的离线推理,能够迅速统计出模型在Ascend310上的性能,支持真实数据和纯推理两种模式,配合后处理脚本,可以实现诸多模型的端到端过程,获取工具及使用方法可以参考[CANN V100R020C10 推理benchmark工具用户指南 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164874?idPath=23710424%7C251366513%7C22892968%7C251168373) +### 5.2 离线推理 +1.设置环境变量 +``` +export install_path=/usr/local/Ascend/ascend-toolkit/latest +export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH +export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH +export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH +export ASCEND_OPP_PATH=${install_path}/opp +export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/ +``` +2.执行离线推理 +``` +./benchmark -model_type=vision -device_id=0 -batch_size=16 -om_path=resnext50_bs16.om -input_text_path=./resnext50_prep_bin.info -input_width=224 -input_height=224 -output_binary=False -useDvpp=False +``` +输出结果默认保存在当前目录result/dumpOutput_devicex,模型只有一个名为class的输出,shape为bs * 1000,数据类型为FP32,对应1000个分类的预测结果,每个输入对应的输出对应一个_x.bin文件。 + +## 6 精度对比 + +- **[离线推理TopN精度](#61-离线推理TopN精度)** +- **[开源TopN精度](#62-开源TopN精度)** +- **[精度对比](#63-精度对比)** + +### 6.1 离线推理TopN精度统计 + +后处理统计TopN精度 +```python +import os +import sys +import json +import numpy as np +import time + +np.set_printoptions(threshold=sys.maxsize) + +LABEL_FILE = "HiAI_label.json" + + +def gen_file_name(img_name): + full_name = img_name.split('/')[-1] + index = full_name.rfind('.') + return full_name[:index] + + +def cre_groundtruth_dict(gtfile_path): + """ + :param filename: file contains the imagename and label number + :return: dictionary key imagename, value is label number + """ + img_gt_dict = {} + for gtfile in os.listdir(gtfile_path): + if (gtfile != LABEL_FILE): + with open(os.path.join(gtfile_path, gtfile), 'r') as f: + gt = json.load(f) + ret = gt["image"]["annotations"][0]["category_id"] + img_gt_dict[gen_file_name(gtfile)] = ret + return img_gt_dict + + +def cre_groundtruth_dict_fromtxt(gtfile_path): + """ + :param filename: file contains the imagename and label number + :return: dictionary key imagename, value is label number + """ + img_gt_dict = {} + with open(gtfile_path, 'r')as f: + for line in f.readlines(): + temp = line.strip().split(" ") + imgName = temp[0].split(".")[0] + imgLab = temp[1] + img_gt_dict[imgName] = imgLab + return img_gt_dict + + +def load_statistical_predict_result(filepath): + """ + function: + the prediction esult file data extraction + input: + result file:filepath + output: + n_label:numble of label + data_vec: the probabilitie of prediction in the 1000 + :return: probabilities, numble of label, in_type, color + """ + with open(filepath, 'r')as f: + data = f.readline() + temp = data.strip().split(" ") + n_label = len(temp) + if data == '': + n_label = 0 + data_vec = np.zeros((n_label), dtype=np.float32) + in_type = '' + color = '' + if n_label == 0: + in_type = f.readline() + color = f.readline() + else: + for ind, prob in enumerate(temp): + data_vec[ind] = np.float32(prob) + return data_vec, n_label, in_type, color + + +def create_visualization_statistical_result(prediction_file_path, + result_store_path, json_file_name, + img_gt_dict, topn=5): + """ + :param prediction_file_path: + :param result_store_path: + :param json_file_name: + :param img_gt_dict: + :param topn: + :return: + """ + writer = open(os.path.join(result_store_path, json_file_name), 'w') + table_dict = {} + table_dict["title"] = "Overall statistical evaluation" + table_dict["value"] = [] + + count = 0 + resCnt = 0 + n_labels = 0 + count_hit = np.zeros(topn) + for tfile_name in os.listdir(prediction_file_path): + count += 1 + temp = tfile_name.split('.')[0] + index = temp.rfind('_') + img_name = temp[:index] + filepath = os.path.join(prediction_file_path, tfile_name) + ret = load_statistical_predict_result(filepath) + prediction = ret[0] + n_labels = ret[1] + sort_index = np.argsort(-prediction) + gt = img_gt_dict[img_name] + if (n_labels == 1000): + realLabel = int(gt) + elif (n_labels == 1001): + realLabel = int(gt) + 1 + else: + realLabel = int(gt) + + resCnt = min(len(sort_index), topn) + for i in range(resCnt): + if (str(realLabel) == str(sort_index[i])): + count_hit[i] += 1 + break + + if 'value' not in table_dict.keys(): + print("the item value does not exist!") + else: + table_dict["value"].extend( + [{"key": "Number of images", "value": str(count)}, + {"key": "Number of classes", "value": str(n_labels)}]) + if count == 0: + accuracy = 0 + else: + accuracy = np.cumsum(count_hit) / count + for i in range(resCnt): + table_dict["value"].append({"key": "Top" + str(i + 1) + " accuracy", + "value": str( + round(accuracy[i] * 100, 2)) + '%'}) + json.dump(table_dict, writer) + writer.close() + + +if __name__ == '__main__': + start = time.time() + try: + # txt file path + folder_davinci_target = sys.argv[1] + # annotation files path, "val_label.txt" + annotation_file_path = sys.argv[2] + # the path to store the results json path + result_json_path = sys.argv[3] + # result json file name + json_file_name = sys.argv[4] + except IndexError: + print("Stopped!") + exit(1) + + if not (os.path.exists(folder_davinci_target)): + print("target file folder does not exist.") + + if not (os.path.exists(annotation_file_path)): + print("Ground truth file does not exist.") + + if not (os.path.exists(result_json_path)): + print("Result folder doesn't exist.") + + img_label_dict = cre_groundtruth_dict_fromtxt(annotation_file_path) + create_visualization_statistical_result(folder_davinci_target, + result_json_path, json_file_name, + img_label_dict, topn=5) + + elapsed = (time.time() - start) + print("Time used:", elapsed) +``` +调用vision_metric_ImageNet.py脚本推理结果与label比对,可以获得Accuracy Top5数据,结果保存在result.json中。 +``` +python3 vision_metric_ImageNet.py result/dumpOutput_device0/ dataset/ImageNet/val_label.txt ./ result.json +``` +第一个为benchmark输出目录,第二个为数据集配套标签,第三个是生成文件的保存目录,第四个是生成的文件名。 +查看输出结果: +``` +{"title": "Overall statistical evaluation", "value": [{"key": "Number of images", "value": "50000"}, {"key": "Number of classes", "value": "1000"}, {"key": "Top1 accuracy", "value": "77.62%"}, {"key": "Top2 accuracy", "value": "87.42%"}, {"key": "Top3 accuracy", "value": "90.79%"}, {"key": "Top4 accuracy", "value": "92.56%"}, {"key": "Top5 accuracy", "value": "93.69%"}] +``` +### 6.2 开源TopN精度 +[torchvision官网精度](https://pytorch.org/vision/stable/models.html) +``` +Model Acc@1 Acc@5 +ResNeXt-50-32x4d 77.618 93.698 +``` +### 6.3 精度对比 +将得到的om离线模型推理TopN精度与该模型github代码仓上公布的精度对比,精度下降在1%范围之内,故精度达标。 + +## 7 性能对比 + +- **[npu性能数据](#71-npu性能数据)** +- **[T4性能数据](#72-T4性能数据)** +- **[性能对比](#73-性能对比)** + +### 7.1 npu性能数据 +batch1的性能: + 测试npu性能要确保device空闲,使用npu-smi info命令可查看device是否在运行其它推理任务 +``` +./benchmark -round=50 -om_path=resnext50_bs1.om -device_id=0 -batch_size=1 +``` +执行50次纯推理取均值,统计吞吐率与其倒数时延(benchmark的时延是单个数据的推理时间),npu性能是一个device执行的结果 +``` +[INFO] Dataset number: 49 finished cost 2.635ms +[INFO] PureInfer result saved in ./result/PureInfer_perf_of_resnext50_bs1_in_device_0.txt +-----------------PureInfer Performance Summary------------------ +[INFO] ave_throughputRate: 374.313samples/s, ave_latency: 2.67914ms +``` +batch16的性能: +``` +./benchmark -round=50 -om_path=resnext50_bs16.om -device_id=0 -batch_size=16 +``` +``` +[INFO] Dataset number: 49 finished cost 30.514ms +[INFO] PureInfer result saved in ./result/PureInfer_perf_of_resnext50_bs16_in_device_0.txt +-----------------PureInfer Performance Summary------------------ +[INFO] ave_throughputRate: 524.094samples/s, ave_latency: 1.9101ms +``` +### 7.2 T4性能数据 +batch1性能: +在T4机器上安装开源TensorRT +``` +cd /usr/local/TensorRT-7.2.2.3/targets/x86_64-linux-gnu/bin/ +./trtexec --onnx=resnext50.onnx --fp16 --shapes=image:1x3x224x224 --threads +``` +gpu T4是4个device并行执行的结果,mean是时延(tensorrt的时延是batch个数据的推理时间),即吞吐率的倒数乘以batch +``` +[03/24/2021-03:54:47] [I] GPU Compute +[03/24/2021-03:54:47] [I] min: 1.26575 ms +[03/24/2021-03:54:47] [I] max: 4.41528 ms +[03/24/2021-03:54:47] [I] mean: 1.31054 ms +[03/24/2021-03:54:47] [I] median: 1.30151 ms +[03/24/2021-03:54:47] [I] percentile: 1.40723 ms at 99% +[03/24/2021-03:54:47] [I] total compute time: 2.9972 s +``` +batch16性能: +``` +./trtexec --onnx=resnext50.onnx --fp16 --shapes=image:16x3x224x224 --threads +``` +``` +[03/24/2021-03:57:22] [I] GPU Compute +[03/24/2021-03:57:22] [I] min: 12.5645 ms +[03/24/2021-03:57:22] [I] max: 14.8437 ms +[03/24/2021-03:57:22] [I] mean: 12.9561 ms +[03/24/2021-03:57:22] [I] median: 12.8541 ms +[03/24/2021-03:57:22] [I] percentile: 14.8377 ms at 99% +[03/24/2021-03:57:22] [I] total compute time: 3.03173 s +``` +### 7.3 性能对比 +batch1:2.67914/4 < 1.31054/1 +batch16:1.9101/4 < 12.9561/16 +npu的吞吐率乘4比T4的吞吐率大,即npu的时延除4比T4的时延除以batch小,故npu性能高于T4性能,性能达标。 +对于batch1与batch16,npu性能均高于T4性能1.2倍,该模型放在benchmark/cv/classification目录下。 + + diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/.keep" new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216detectron2\350\256\255\347\273\203\347\232\204npu\346\235\203\351\207\215\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216detectron2\350\256\255\347\273\203\347\232\204npu\346\235\203\351\207\215\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" new file mode 100644 index 0000000000000000000000000000000000000000..e365eaa88addc2895b241fce3ebb2eb2817d3d75 --- /dev/null +++ "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216detectron2\350\256\255\347\273\203\347\232\204npu\346\235\203\351\207\215\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" @@ -0,0 +1,1227 @@ +# 基于detectron2训练的npu权重的maskrcnn Onnx模型端到端推理指导 +- [1 模型概述](#1-模型概述) + - [1.1 论文地址](#11-论文地址) + - [1.2 代码地址](#12-代码地址) +- [2 环境说明](#2-环境说明) + - [2.1 深度学习框架](#21-深度学习框架) + - [2.2 python第三方库](#22-python第三方库) +- [3 模型转换](#3-模型转换) + - [3.1 pth转onnx模型](#31-pth转onnx模型) + - [3.2 onnx转om模型](#32-onnx转om模型) +- [4 数据集预处理](#4-数据集预处理) + - [4.1 数据集获取](#41-数据集获取) + - [4.2 数据集预处理](#42-数据集预处理) + - [4.3 生成数据集信息文件](#43-生成数据集信息文件) +- [5 离线推理](#5-离线推理) + - [5.1 benchmark工具概述](#51-benchmark工具概述) + - [5.2 离线推理](#52-离线推理) +- [6 精度对比](#6-精度对比) + - [6.1 离线推理精度统计](#61-离线推理精度统计) + - [6.2 开源精度](#62-开源精度) + - [6.3 精度对比](#63-精度对比) +- [7 性能对比](#7-性能对比) + - [7.1 npu性能数据](#71-npu性能数据) + - [7.2 T4性能数据](#72-T4性能数据) + - [7.3 性能对比](#73-性能对比) + + + +## 1 模型概述 + +- **[论文地址](#11-论文地址)** + +- **[代码地址](#12-代码地址)** + +### 1.1 论文地址 +[maskrcnn论文](https://arxiv.org/abs/1703.06870) +论文提出了一个简单、灵活、通用的目标实例分割框架Mask R-CNN。这个框架可同时做目标检测、实例分割。实例分割的实现就是在faster r-cnn的基础上加了一个可以预测目标掩膜(mask)的分支。只比Faster r-cnn慢一点,5fps。很容易拓展到其他任务如:关键点检测。18年在coco的目标检测、实例分割、人体关键点检测都取得了最优成绩。 + +### 1.2 代码地址 +[cpu,gpu版detectron2框架maskrcnn代码](https://github.com/facebookresearch/detectron2/blob/master/MODEL_ZOO.md) + +[npu版detectron2框架maskrcnn代码](https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch/Official/cv/image_object_detection/Faster_Mask_RCNN_for_PyTorch) + +## 2 环境说明 + +- **[深度学习框架](#21-深度学习框架)** + +- **[python第三方库](#22-python第三方库)** + +### 2.1 深度学习框架 +``` +pytorch == 1.8.0 +torchvision == 0.9.0 +onnx == 1.8.0 +``` + +**注意:** +> 转onnx的环境上pytorch需要安装1.8.0版本 +> + +### 2.2 python第三方库 + +``` +numpy == 1.18.5 +opencv-python == 4.2.0.34 +``` + +**说明:** +> X86架构:opencv,pytorch,torchvision和onnx可以通过官方下载whl包安装,其它可以通过pip3.7 install 包名 安装 +> +> Arm架构:opencv,pytorch,torchvision和onnx可以通过源码编译安装,其它可以通过pip3.7 install 包名 安装 + +## 3 模型转换 + +- **[pth转onnx模型](#31-pth转onnx模型)** + +- **[onnx转om模型](#32-onnx转om模型)** + +detectron2暂支持pytorch1.8导出pytorch框架的onnx,npu权重可以使用开源的detectron2加载,因此基于pytorch1.8与开源detectron2导出含npu权重的onnx。atc暂不支持动态shape小算子,可以使用大颗粒算子替换这些小算子规避,这些小算子可以在转onnx时的verbose打印中找到其对应的python代码,从而根据功能用大颗粒算子替换,onnx能推导出变量正确的shape与算子属性正确即可,变量实际的数值无关紧要,因此这些大算子函数的功能实现无关紧要,因包含自定义算子需要去掉对onnx模型的校验。 + +### 3.1 pth转onnx模型 + +1.获取pth权重文件 +[maskrcnn基于detectron2预训练的npu权重文件](https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch/Official/cv/image_object_detection/Faster_Mask_RCNN_for_PyTorch) +文件md5sum: b95f35f051012a02875220482a568c3b +2.下载detectron2源码并安装 +```shell +git clone https://github.com/facebookresearch/detectron2 +python3.7 -m pip install -e detectron2 +``` + + **说明:** +> 安装所需的依赖说明请参考detectron2/INSTALL.md +> +> 重装pytorch后需要rm -rf detectron2/build/ **/*.so再重装detectron2 + +3.detectron2代码迁移,参见maskrcnn_detectron2.diff: +```diff +diff --git a/detectron2/layers/__init__.py b/detectron2/layers/__init__.py +index c8bd1fb..f5fa9ea 100644 +--- a/detectron2/layers/__init__.py ++++ b/detectron2/layers/__init__.py +@@ -2,7 +2,7 @@ + from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm + from .deform_conv import DeformConv, ModulatedDeformConv + from .mask_ops import paste_masks_in_image +-from .nms import batched_nms, batched_nms_rotated, nms, nms_rotated ++from .nms import batched_nms, batch_nms_op, batched_nms_rotated, nms, nms_rotated + from .roi_align import ROIAlign, roi_align + from .roi_align_rotated import ROIAlignRotated, roi_align_rotated + from .shape_spec import ShapeSpec +diff --git a/detectron2/layers/nms.py b/detectron2/layers/nms.py +index ac14d45..22efb24 100644 +--- a/detectron2/layers/nms.py ++++ b/detectron2/layers/nms.py +@@ -15,6 +15,56 @@ if TORCH_VERSION < (1, 7): + else: + nms_rotated_func = torch.ops.detectron2.nms_rotated + ++class BatchNMSOp(torch.autograd.Function): ++ @staticmethod ++ def forward(ctx, bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size): ++ """ ++ boxes (torch.Tensor): boxes in shape (batch, N, C, 4). ++ scores (torch.Tensor): scores in shape (batch, N, C). ++ return: ++ nmsed_boxes: (1, N, 4) ++ nmsed_scores: (1, N) ++ nmsed_classes: (1, N) ++ nmsed_num: (1,) ++ """ ++ ++ # Phony implementation for onnx export ++ nmsed_boxes = bboxes[:, :max_total_size, 0, :] ++ nmsed_scores = scores[:, :max_total_size, 0] ++ nmsed_classes = torch.arange(max_total_size, dtype=torch.long) ++ nmsed_num = torch.Tensor([max_total_size]) ++ ++ return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num ++ ++ @staticmethod ++ def symbolic(g, bboxes, scores, score_thr, iou_thr, max_size_p_class, max_t_size): ++ nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = g.op('BatchMultiClassNMS', ++ bboxes, scores, score_threshold_f=score_thr, iou_threshold_f=iou_thr, ++ max_size_per_class_i=max_size_p_class, max_total_size_i=max_t_size, outputs=4) ++ return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num ++ ++def batch_nms_op(bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size): ++ """ ++ boxes (torch.Tensor): boxes in shape (N, 4). ++ scores (torch.Tensor): scores in shape (N, ). ++ """ ++ ++ num_classes = bboxes.shape[1].numpy() // 4 ++ if bboxes.dtype == torch.float32: ++ bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4).half() ++ scores = scores.reshape(1, scores.shape[0].numpy(), -1).half() ++ else: ++ bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4) ++ scores = scores.reshape(1, scores.shape[0].numpy(), -1) ++ ++ nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = BatchNMSOp.apply(bboxes, scores, ++ score_threshold, iou_threshold, max_size_per_class, max_total_size) ++ nmsed_boxes = nmsed_boxes.float() ++ nmsed_scores = nmsed_scores.float() ++ nmsed_classes = nmsed_classes.long() ++ dets = torch.cat((nmsed_boxes.reshape((max_total_size, 4)), nmsed_scores.reshape((max_total_size, 1))), -1) ++ labels = nmsed_classes.reshape((max_total_size, )) ++ return dets, labels + + def batched_nms( + boxes: torch.Tensor, scores: torch.Tensor, idxs: torch.Tensor, iou_threshold: float +diff --git a/detectron2/modeling/box_regression.py b/detectron2/modeling/box_regression.py +index 12be000..074f3e3 100644 +--- a/detectron2/modeling/box_regression.py ++++ b/detectron2/modeling/box_regression.py +@@ -87,20 +87,33 @@ class Box2BoxTransform(object): + deltas = deltas.float() # ensure fp32 for decoding precision + boxes = boxes.to(deltas.dtype) + +- widths = boxes[:, 2] - boxes[:, 0] +- heights = boxes[:, 3] - boxes[:, 1] +- ctr_x = boxes[:, 0] + 0.5 * widths +- ctr_y = boxes[:, 1] + 0.5 * heights ++ boxes_prof = boxes.permute(1, 0) ++ widths = boxes_prof[2, :] - boxes_prof[0, :] ++ heights = boxes_prof[3, :] - boxes_prof[1, :] ++ ctr_x = boxes_prof[0, :] + 0.5 * widths ++ ctr_y = boxes_prof[1, :] + 0.5 * heights + + wx, wy, ww, wh = self.weights +- dx = deltas[:, 0::4] / wx ++ '''dx = deltas[:, 0::4] / wx + dy = deltas[:, 1::4] / wy + dw = deltas[:, 2::4] / ww +- dh = deltas[:, 3::4] / wh ++ dh = deltas[:, 3::4] / wh''' ++ denorm_deltas = deltas ++ if denorm_deltas.shape[1] > 4: ++ denorm_deltas = denorm_deltas.view(-1, 80, 4) ++ dx = denorm_deltas[:, :, 0:1:].view(-1, 80) / wx ++ dy = denorm_deltas[:, :, 1:2:].view(-1, 80) / wy ++ dw = denorm_deltas[:, :, 2:3:].view(-1, 80) / ww ++ dh = denorm_deltas[:, :, 3:4:].view(-1, 80) / wh ++ else: ++ dx = denorm_deltas[:, 0:1:] / wx ++ dy = denorm_deltas[:, 1:2:] / wy ++ dw = denorm_deltas[:, 2:3:] / ww ++ dh = denorm_deltas[:, 3:4:] / wh + + # Prevent sending too large values into torch.exp() +- dw = torch.clamp(dw, max=self.scale_clamp) +- dh = torch.clamp(dh, max=self.scale_clamp) ++ dw = torch.clamp(dw, min=-float('inf'), max=self.scale_clamp) ++ dh = torch.clamp(dh, min=-float('inf'), max=self.scale_clamp) + + pred_ctr_x = dx * widths[:, None] + ctr_x[:, None] + pred_ctr_y = dy * heights[:, None] + ctr_y[:, None] +diff --git a/detectron2/modeling/meta_arch/rcnn.py b/detectron2/modeling/meta_arch/rcnn.py +index e5f66d1..1bbba71 100644 +--- a/detectron2/modeling/meta_arch/rcnn.py ++++ b/detectron2/modeling/meta_arch/rcnn.py +@@ -199,8 +199,9 @@ class GeneralizedRCNN(nn.Module): + """ + assert not self.training + +- images = self.preprocess_image(batched_inputs) +- features = self.backbone(images.tensor) ++ # images = self.preprocess_image(batched_inputs) ++ images = batched_inputs ++ features = self.backbone(images) + + if detected_instances is None: + if self.proposal_generator is not None: +diff --git a/detectron2/modeling/poolers.py b/detectron2/modeling/poolers.py +index e5d72ab..7c0dd2f 100644 +--- a/detectron2/modeling/poolers.py ++++ b/detectron2/modeling/poolers.py +@@ -94,6 +94,31 @@ def convert_boxes_to_pooler_format(box_lists: List[Boxes]): + + return pooler_fmt_boxes + ++import torch.onnx.symbolic_helper as sym_help ++ ++class RoiExtractor(torch.autograd.Function): ++ @staticmethod ++ def forward(self, f0, f1, f2, f3, rois, aligned=0, finest_scale=56, pooled_height=7, pooled_width=7, ++ pool_mode='avg', roi_scale_factor=0, sample_num=0, spatial_scale=[0.25, 0.125, 0.0625, 0.03125]): ++ """ ++ feats (torch.Tensor): feats in shape (batch, 256, H, W). ++ rois (torch.Tensor): rois in shape (k, 5). ++ return: ++ roi_feats (torch.Tensor): (k, 256, pooled_width, pooled_width) ++ """ ++ ++ # phony implementation for shape inference ++ k = rois.size()[0] ++ roi_feats = torch.ones(k, 256, pooled_height, pooled_width) ++ return roi_feats ++ ++ @staticmethod ++ def symbolic(g, f0, f1, f2, f3, rois, aligned=0, finest_scale=56, pooled_height=7, pooled_width=7): ++ # TODO: support tensor list type for feats ++ #f_tensors = sym_help._unpack_list(feats) ++ roi_feats = g.op('RoiExtractor', f0, f1, f2, f3, rois, aligned_i=0, finest_scale_i=56, pooled_height_i=pooled_height, pooled_width_i=pooled_width, ++ pool_mode_s='avg', roi_scale_factor_i=0, sample_num_i=0, spatial_scale_f=[0.25, 0.125, 0.0625, 0.03125], outputs=1) ++ return roi_feats + + class ROIPooler(nn.Module): + """ +@@ -202,6 +227,12 @@ class ROIPooler(nn.Module): + A tensor of shape (M, C, output_size, output_size) where M is the total number of + boxes aggregated over all N batch images and C is the number of channels in `x`. + """ ++ if torch.onnx.is_in_onnx_export(): ++ output_size = self.output_size[0] ++ pooler_fmt_boxes = convert_boxes_to_pooler_format(box_lists) ++ roi_feats = RoiExtractor.apply(x[0], x[1], x[2], x[3], pooler_fmt_boxes, 0, 56, output_size, output_size) ++ return roi_feats ++ + num_level_assignments = len(self.level_poolers) + + assert isinstance(x, list) and isinstance( +diff --git a/detectron2/modeling/proposal_generator/proposal_utils.py b/detectron2/modeling/proposal_generator/proposal_utils.py +index 9c10436..b3437a7 100644 +--- a/detectron2/modeling/proposal_generator/proposal_utils.py ++++ b/detectron2/modeling/proposal_generator/proposal_utils.py +@@ -4,7 +4,7 @@ import math + from typing import List, Tuple + import torch + +-from detectron2.layers import batched_nms, cat ++from detectron2.layers import batch_nms_op, cat + from detectron2.structures import Boxes, Instances + from detectron2.utils.env import TORCH_VERSION + +@@ -68,15 +68,19 @@ def find_top_rpn_proposals( + for level_id, (proposals_i, logits_i) in enumerate(zip(proposals, pred_objectness_logits)): + Hi_Wi_A = logits_i.shape[1] + if isinstance(Hi_Wi_A, torch.Tensor): # it's a tensor in tracing +- num_proposals_i = torch.clamp(Hi_Wi_A, max=pre_nms_topk) ++ num_proposals_i = torch.clamp(Hi_Wi_A, min=0, max=pre_nms_topk) + else: + num_proposals_i = min(Hi_Wi_A, pre_nms_topk) + + # sort is faster than topk: https://github.com/pytorch/pytorch/issues/22812 +- # topk_scores_i, topk_idx = logits_i.topk(num_proposals_i, dim=1) +- logits_i, idx = logits_i.sort(descending=True, dim=1) ++ num_proposals_i = num_proposals_i.item() ++ logits_i = logits_i.reshape(logits_i.size(1)) ++ topk_scores_i, topk_idx = torch.topk(logits_i, num_proposals_i) ++ topk_scores_i = topk_scores_i.reshape(1, topk_scores_i.size(0)) ++ topk_idx = topk_idx.reshape(1, topk_idx.size(0)) ++ '''logits_i, idx = logits_i.sort(descending=True, dim=1) + topk_scores_i = logits_i.narrow(1, 0, num_proposals_i) +- topk_idx = idx.narrow(1, 0, num_proposals_i) ++ topk_idx = idx.narrow(1, 0, num_proposals_i)''' + + # each is N x topk + topk_proposals_i = proposals_i[batch_idx[:, None], topk_idx] # N x topk x 4 +@@ -108,7 +112,7 @@ def find_top_rpn_proposals( + lvl = lvl[valid_mask] + boxes.clip(image_size) + +- # filter empty boxes ++ '''# filter empty boxes + keep = boxes.nonempty(threshold=min_box_size) + if _is_tracing() or keep.sum().item() != len(boxes): + boxes, scores_per_img, lvl = boxes[keep], scores_per_img[keep], lvl[keep] +@@ -126,7 +130,14 @@ def find_top_rpn_proposals( + res = Instances(image_size) + res.proposal_boxes = boxes[keep] + res.objectness_logits = scores_per_img[keep] ++ results.append(res)''' ++ ++ dets, labels = batch_nms_op(boxes.tensor, scores_per_img, 0, nms_thresh, post_nms_topk, post_nms_topk) ++ res = Instances(image_size) ++ res.proposal_boxes = Boxes(dets[:, :4]) ++ res.objectness_logits = dets[:, 4] + results.append(res) ++ + return results + + +diff --git a/detectron2/modeling/proposal_generator/rpn.py b/detectron2/modeling/proposal_generator/rpn.py +index 1675377..77d9f26 100644 +--- a/detectron2/modeling/proposal_generator/rpn.py ++++ b/detectron2/modeling/proposal_generator/rpn.py +@@ -434,7 +434,7 @@ class RPN(nn.Module): + else: + losses = {} + proposals = self.predict_proposals( +- anchors, pred_objectness_logits, pred_anchor_deltas, images.image_sizes ++ anchors, pred_objectness_logits, pred_anchor_deltas, [(1344, 1344)] + ) + return proposals, losses + +@@ -485,7 +485,8 @@ class RPN(nn.Module): + B = anchors_i.tensor.size(1) + pred_anchor_deltas_i = pred_anchor_deltas_i.reshape(-1, B) + # Expand anchors to shape (N*Hi*Wi*A, B) +- anchors_i = anchors_i.tensor.unsqueeze(0).expand(N, -1, -1).reshape(-1, B) ++ s = torch.zeros(N, anchors_i.tensor.unsqueeze(0).size(1), anchors_i.tensor.unsqueeze(0).size(2)) ++ anchors_i = anchors_i.tensor.unsqueeze(0).expand_as(s).reshape(-1, B) + proposals_i = self.box2box_transform.apply_deltas(pred_anchor_deltas_i, anchors_i) + # Append feature map proposals with shape (N, Hi*Wi*A, B) + proposals.append(proposals_i.view(N, -1, B)) +diff --git a/detectron2/modeling/roi_heads/fast_rcnn.py b/detectron2/modeling/roi_heads/fast_rcnn.py +index 348f6a0..87c7cd3 100644 +--- a/detectron2/modeling/roi_heads/fast_rcnn.py ++++ b/detectron2/modeling/roi_heads/fast_rcnn.py +@@ -7,7 +7,7 @@ from torch import nn + from torch.nn import functional as F + + from detectron2.config import configurable +-from detectron2.layers import ShapeSpec, batched_nms, cat, cross_entropy, nonzero_tuple ++from detectron2.layers import ShapeSpec, batch_nms_op, cat, cross_entropy, nonzero_tuple + from detectron2.modeling.box_regression import Box2BoxTransform + from detectron2.structures import Boxes, Instances + from detectron2.utils.events import get_event_storage +@@ -144,7 +144,7 @@ def fast_rcnn_inference_single_image( + # Convert to Boxes to use the `clip` function ... + boxes = Boxes(boxes.reshape(-1, 4)) + boxes.clip(image_shape) +- boxes = boxes.tensor.view(-1, num_bbox_reg_classes, 4) # R x C x 4 ++ boxes = boxes.tensor.view(-1, num_bbox_reg_classes.item(), 4) # R x C x 4 + + # 1. Filter results based on detection scores. It can make NMS more efficient + # by filtering out low-confidence detections. +@@ -152,7 +152,7 @@ def fast_rcnn_inference_single_image( + # R' x 2. First column contains indices of the R predictions; + # Second column contains indices of classes. + filter_inds = filter_mask.nonzero() +- if num_bbox_reg_classes == 1: ++ '''if num_bbox_reg_classes == 1: + boxes = boxes[filter_inds[:, 0], 0] + else: + boxes = boxes[filter_mask] +@@ -167,7 +167,14 @@ def fast_rcnn_inference_single_image( + result = Instances(image_shape) + result.pred_boxes = Boxes(boxes) + result.scores = scores +- result.pred_classes = filter_inds[:, 1] ++ result.pred_classes = filter_inds[:, 1]''' ++ ++ dets, labels = batch_nms_op(boxes, scores, score_thresh, nms_thresh, topk_per_image, topk_per_image) ++ result = Instances(image_shape) ++ result.pred_boxes = Boxes(dets[:, :4]) ++ result.scores = dets.permute(1, 0)[4, :] ++ result.pred_classes = labels ++ + return result, filter_inds[:, 0] + + +diff --git a/detectron2/modeling/roi_heads/mask_head.py b/detectron2/modeling/roi_heads/mask_head.py +index 5ac5c4b..f81b96b 100644 +--- a/detectron2/modeling/roi_heads/mask_head.py ++++ b/detectron2/modeling/roi_heads/mask_head.py +@@ -142,7 +142,9 @@ def mask_rcnn_inference(pred_mask_logits: torch.Tensor, pred_instances: List[Ins + num_masks = pred_mask_logits.shape[0] + class_pred = cat([i.pred_classes for i in pred_instances]) + indices = torch.arange(num_masks, device=class_pred.device) +- mask_probs_pred = pred_mask_logits[indices, class_pred][:, None].sigmoid() ++ print(indices,class_pred) ++ # mask_probs_pred = pred_mask_logits[indices, class_pred][:, None].sigmoid() ++ mask_probs_pred = pred_mask_logits.sigmoid() + # mask_probs_pred.shape: (B, 1, Hmask, Wmask) + + num_boxes_per_image = [len(i) for i in pred_instances] +diff --git a/detectron2/structures/boxes.py b/detectron2/structures/boxes.py +index 57f862a..bad473b 100644 +--- a/detectron2/structures/boxes.py ++++ b/detectron2/structures/boxes.py +@@ -202,10 +202,11 @@ class Boxes: + """ + assert torch.isfinite(self.tensor).all(), "Box tensor contains infinite or NaN!" + h, w = box_size +- x1 = self.tensor[:, 0].clamp(min=0, max=w) +- y1 = self.tensor[:, 1].clamp(min=0, max=h) +- x2 = self.tensor[:, 2].clamp(min=0, max=w) +- y2 = self.tensor[:, 3].clamp(min=0, max=h) ++ boxes_prof = self.tensor.permute(1, 0) ++ x1 = boxes_prof[0, :].clamp(min=0, max=w) ++ y1 = boxes_prof[1, :].clamp(min=0, max=h) ++ x2 = boxes_prof[2, :].clamp(min=0, max=w) ++ y2 = boxes_prof[3, :].clamp(min=0, max=h) + self.tensor = torch.stack((x1, y1, x2, y2), dim=-1) + + def nonempty(self, threshold: float = 0.0) -> torch.Tensor: +diff --git a/tools/deploy/export_model.py b/tools/deploy/export_model.py +index fe2fe30..22145b7 100755 +--- a/tools/deploy/export_model.py ++++ b/tools/deploy/export_model.py +@@ -77,6 +77,28 @@ def export_scripting(torch_model): + # TODO inference in Python now missing postprocessing glue code + return None + ++from typing import Dict, Tuple ++import numpy ++from detectron2.structures import ImageList ++def preprocess_image(batched_inputs: Tuple[Dict[str, torch.Tensor]]): ++ """ ++ Normalize, pad and batch the input images. ++ """ ++ images = [x["image"].to('cpu') for x in batched_inputs] ++ images = [(x - numpy.array([[[103.530]], [[116.280]], [[123.675]]])) / numpy.array([[[1.]], [[1.]], [[1.]]]) for x in images] ++ import torch.nn.functional as F ++ image = torch.zeros(0, 1344, 1344) ++ for i in range(images[0].size(0)): ++ img = images[0][i] ++ img = img.expand((1, 1, img.size(0), img.size(1))) ++ img = img.to(dtype=torch.float32) ++ img = F.interpolate(img, size=(int(1344), int(1344)), mode='bilinear', align_corners=False) ++ img = img[0][0] ++ img = img.unsqueeze(0) ++ image = torch.cat((image, img)) ++ images = [image] ++ images = ImageList.from_tensors(images, 32) ++ return images + + # experimental. API not yet final + def export_tracing(torch_model, inputs): +@@ -84,6 +106,8 @@ def export_tracing(torch_model, inputs): + image = inputs[0]["image"] + inputs = [{"image": image}] # remove other unused keys + ++ inputs = preprocess_image(inputs).tensor.to(torch.float32) ++ image = inputs + if isinstance(torch_model, GeneralizedRCNN): + + def inference(model, inputs): +@@ -104,7 +128,7 @@ def export_tracing(torch_model, inputs): + elif args.format == "onnx": + # NOTE onnx export currently failing in pytorch + with PathManager.open(os.path.join(args.output, "model.onnx"), "wb") as f: +- torch.onnx.export(traceable_model, (image,), f) ++ torch.onnx.export(traceable_model, (image,), f, opset_version=11, verbose=True) + logger.info("Inputs schema: " + str(traceable_model.inputs_schema)) + logger.info("Outputs schema: " + str(traceable_model.outputs_schema)) + + +``` + **修改依据:** +> 1.slice,topk算子问题导致pre_nms_topk未生效,atc转换报错,修改参见maskrcnn_detectron2.diff +> 2.expand会引入where动态算子因此用expand_as替换 +> 3.slice跑在aicpu有错误,所以改为dx = denorm_deltas[:, :, 0:1:].view(-1, 80) / wx,使其运行在aicore上 +> 4.atc转换时根据日志中报错的算子在转onnx时的verbose打印中找到其对应的python代码,然后找到规避方法解决,具体修改参见maskrcnn_detectron2.diff +> 5.其它地方的修改原因参见精度调试与性能优化 + + +通过打补丁的方式修改detectron2: +```shell +cd detectron2 +patch -p1 < ../maskrcnn_detectron2.diff +cd .. +``` +4.修改pytorch代码去除导出onnx时进行检查 +将/usr/local/python3.7.5/lib/python3.7/site-packages/torch/onnx/utils.py文件的_check_onnx_proto(proto)改为pass + +5.准备coco2017验证集,数据集获取参见本文第四章第一节 +在当前目录按结构构造数据集:datasets/coco目录下有annotations与val2017,annotations目录存放coco数据集的instances_val2017.json,val2017目录存放coco数据集的5000张验证图片。 +或者修改detectron2/detectron2/data/datasets/builtin.py为_root = os.getenv("DETECTRON2_DATASETS", "/opt/npu/dataset/")指定coco数据集所在的目录/opt/npu/dataset/。 + +6.运行如下命令,在output目录生成model.onnx +```shell +python3.7 detectron2/tools/deploy/export_model.py --config-file detectron2/configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml --output ./output --export-method tracing --format onnx MODEL.WEIGHTS model_final.pth MODEL.DEVICE cpu + +mv output/model.onnx model_py1.8.onnx +``` + +### 3.2 onnx转om模型 + +1.设置环境变量 +```shell +export install_path=/usr/local/Ascend/ascend-toolkit/latest +export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH +export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH +export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH +export ASCEND_OPP_PATH=${install_path}/opp +export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/ +``` +2.使用atc将onnx模型转换为om模型文件,工具使用方法可以参考[CANN V100R020C10 开发辅助工具指南 (推理) 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164868?idPath=23710424%7C251366513%7C22892968%7C251168373),需要指定输出节点以去除无用输出,使用netron开源可视化工具查看具体的输出节点名: +```shell +atc --model=model_py1.8.onnx --framework=5 --output=maskrcnn_detectron2_npu --input_format=NCHW --input_shape="0:1,3,1344,1344" --out_nodes="Cast_1673:0;Gather_1676:0;Reshape_1667:0;Slice_1706:0" --log=debug --soc_version=Ascend310 +``` + +## 4 数据集预处理 + +- **[数据集获取](#41-数据集获取)** + +- **[数据集预处理](#42-数据集预处理)** + +- **[生成数据集信息文件](#43-生成数据集信息文件)** + +### 4.1 数据集获取 +该模型使用[COCO官网](https://cocodataset.org/#download)的coco2017的5千张验证集进行测试,图片与标签分别存放在/opt/npu/dataset/coco/val2017/与/opt/npu/dataset/coco/annotations/instances_val2017.json。 + +### 4.2 数据集预处理 +1.预处理脚本maskrcnn_pth_preprocess_detectron2.py +```python +# Copyright 2020 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse +import numpy as np +import cv2 +import torch +import multiprocessing + +def resize(img, size): + old_h = img.shape[0] + old_w = img.shape[1] + scale_ratio = 800 / min(old_w, old_h) + new_w = int(np.floor(old_w * scale_ratio)) + new_h = int(np.floor(old_h * scale_ratio)) + if max(new_h, new_w) > 1333: + scale = 1333 / max(new_h, new_w) + new_h = new_h * scale + new_w = new_w * scale + new_w = int(new_w + 0.5) + new_h = int(new_h + 0.5) + resized_img = cv2.resize(img, (new_w, new_h), interpolation=cv2.INTER_LINEAR) + return resized_img + +def gen_input_bin(file_batches, batch): + i = 0 + for file in file_batches[batch]: + i = i + 1 + print("batch", batch, file, "===", i) + + image = cv2.imread(os.path.join(flags.image_src_path, file), cv2.IMREAD_COLOR) + image = resize(image, (800, 1333)) + mean = np.array([103.53, 116.28, 123.675], dtype=np.float32) + std = np.array([1., 1., 1.], dtype=np.float32) + img = image.copy().astype(np.float32) + mean = np.float64(mean.reshape(1, -1)) + std = 1 / np.float64(std.reshape(1, -1)) + cv2.subtract(img, mean, img) + cv2.multiply(img, std, img) + img = cv2.copyMakeBorder(img, 0, flags.model_input_height - img.shape[0], 0, flags.model_input_width - img.shape[1], cv2.BORDER_CONSTANT, value=0) + #os.makedirs('./paded_jpg/', exist_ok=True) + #cv2.imwrite('./paded_jpg/' + file.split('.')[0] + '.jpg', img) + img = img.transpose(2, 0, 1) + img.tofile(os.path.join(flags.bin_file_path, file.split('.')[0] + ".bin")) + +def preprocess(src_path, save_path): + files = os.listdir(src_path) + file_batches = [files[i:i + 100] for i in range(0, 5000, 100) if files[i:i + 100] != []] + thread_pool = multiprocessing.Pool(len(file_batches)) + for batch in range(len(file_batches)): + thread_pool.apply_async(gen_input_bin, args=(file_batches, batch)) + thread_pool.close() + thread_pool.join() + print("in thread, except will not report! please ensure bin files generated.") + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='preprocess of MaskRCNN PyTorch model') + parser.add_argument("--image_src_path", default="./coco2017/", help='image of dataset') + parser.add_argument("--bin_file_path", default="./coco2017_bin/", help='Preprocessed image buffer') + parser.add_argument("--model_input_height", default=1344, type=int, help='input tensor height') + parser.add_argument("--model_input_width", default=1344, type=int, help='input tensor width') + flags = parser.parse_args() + if not os.path.exists(flags.bin_file_path): + os.makedirs(flags.bin_file_path) + preprocess(flags.image_src_path, flags.bin_file_path) +``` +2.执行预处理脚本,生成数据集预处理后的bin文件 +```shell +python3.7 maskrcnn_pth_preprocess_detectron2.py --image_src_path=/opt/npu/dataset/coco/val2017 --bin_file_path=val2017_bin --model_input_height=1344 --model_input_width=1344 +``` +### 4.3 生成数据集信息文件 +1.生成数据集信息文件脚本get_info.py +```python +import os +import sys +import cv2 +from glob import glob + + +def get_bin_info(file_path, info_name, width, height): + bin_images = glob(os.path.join(file_path, '*.bin')) + with open(info_name, 'w') as file: + for index, img in enumerate(bin_images): + content = ' '.join([str(index), img, width, height]) + file.write(content) + file.write('\n') + + +def get_jpg_info(file_path, info_name): + extensions = ['jpg', 'jpeg', 'JPG', 'JPEG'] + image_names = [] + for extension in extensions: + image_names.append(glob(os.path.join(file_path, '*.' + extension))) + with open(info_name, 'w') as file: + for image_name in image_names: + if len(image_name) == 0: + continue + else: + for index, img in enumerate(image_name): + img_cv = cv2.imread(img) + shape = img_cv.shape + width, height = shape[1], shape[0] + content = ' '.join([str(index), img, str(width), str(height)]) + file.write(content) + file.write('\n') + + +if __name__ == '__main__': + file_type = sys.argv[1] + file_path = sys.argv[2] + info_name = sys.argv[3] + if file_type == 'bin': + width = sys.argv[4] + height = sys.argv[5] + assert len(sys.argv) == 6, 'The number of input parameters must be equal to 5' + get_bin_info(file_path, info_name, width, height) + elif file_type == 'jpg': + assert len(sys.argv) == 4, 'The number of input parameters must be equal to 3' + get_jpg_info(file_path, info_name) +``` +2.执行生成数据集信息脚本,生成数据集信息文件 +```shell +python3.7 get_info.py bin val2017_bin maskrcnn.info 1344 1344 +``` +第一个参数为模型输入的类型,第二个参数为生成的bin文件路径,第三个为输出的info文件,后面为宽高信息 +## 5 离线推理 + +- **[benchmark工具概述](#51-benchmark工具概述)** + +- **[离线推理](#52-离线推理)** + +### 5.1 benchmark工具概述 + +benchmark工具为华为自研的模型推理工具,支持多种模型的离线推理,能够迅速统计出模型在Ascend310上的性能,支持真实数据和纯推理两种模式,配合后处理脚本,可以实现诸多模型的端到端过程,获取工具及使用方法可以参考[CANN V100R020C10 推理benchmark工具用户指南 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164874?idPath=23710424%7C251366513%7C22892968%7C251168373) +### 5.2 离线推理 +1.设置环境变量 +```shell +export install_path=/usr/local/Ascend/ascend-toolkit/latest +export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH +export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH +export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH +export ASCEND_OPP_PATH=${install_path}/opp +export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/ +``` +2.执行离线推理 +```shell +./benchmark.x86_64 -model_type=vision -om_path=maskrcnn_detectron2_npu.om -device_id=0 -batch_size=1 -input_text_path=maskrcnn.info -input_width=1344 -input_height=1344 -useDvpp=false -output_binary=true +``` +输出结果默认保存在当前目录result/dumpOutput_device0,模型有四个输出,每个输入对应的输出对应四个_x.bin文件 +``` +输出 shape 数据类型 数据含义 +output1 100 * 4 FP32 boxes +output2 100 * 1 FP32 scores +output3 100 * 1 INT64 labels +output4 100 * 80 * 28 * 28 FP32 masks +``` + +## 6 精度对比 + +- **[离线推理精度](#61-离线推理精度)** +- **[开源精度](#62-开源精度)** +- **[精度对比](#63-精度对比)** + +### 6.1 离线推理精度统计 + +后处理统计map精度 +```python +# Copyright 2020 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse +import cv2 +import numpy as np + +def postprocess_bboxes(bboxes, image_size, net_input_width, net_input_height): + org_w = image_size[0] + org_h = image_size[1] + + scale = 800 / min(org_w, org_h) + new_w = int(np.floor(org_w * scale)) + new_h = int(np.floor(org_h * scale)) + if max(new_h, new_w) > 1333: + scale = 1333 / max(new_h, new_w) * scale + + bboxes[:, 0] = (bboxes[:, 0]) / scale + bboxes[:, 1] = (bboxes[:, 1]) / scale + bboxes[:, 2] = (bboxes[:, 2]) / scale + bboxes[:, 3] = (bboxes[:, 3]) / scale + + return bboxes + +def postprocess_masks(masks, image_size, net_input_width, net_input_height): + org_w = image_size[0] + org_h = image_size[1] + + scale = 800 / min(org_w, org_h) + new_w = int(np.floor(org_w * scale)) + new_h = int(np.floor(org_h * scale)) + if max(new_h, new_w) > 1333: + scale = 1333 / max(new_h, new_w) * scale + + pad_w = net_input_width - org_w * scale + pad_h = net_input_height - org_h * scale + top = 0 + left = 0 + hs = int(net_input_height - pad_h) + ws = int(net_input_width - pad_w) + + masks = masks.to(dtype=torch.float32) + res_append = torch.zeros(0, org_h, org_w) + if torch.cuda.is_available(): + res_append = res_append.to(device='cuda') + for i in range(masks.size(0)): + mask = masks[i][0][top:hs, left:ws] + mask = mask.expand((1, 1, mask.size(0), mask.size(1))) + mask = F.interpolate(mask, size=(int(org_h), int(org_w)), mode='bilinear', align_corners=False) + mask = mask[0][0] + mask = mask.unsqueeze(0) + res_append = torch.cat((res_append, mask)) + + return res_append[:, None] + +import pickle +def save_variable(v, filename): + f = open(filename, 'wb') + pickle.dump(v, f) + f.close() +def load_variavle(filename): + f = open(filename, 'rb') + r = pickle.load(f) + f.close() + return r + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument("--test_annotation", default="./origin_pictures.info") + parser.add_argument("--bin_data_path", default="./result/dumpOutput_device0/") + parser.add_argument("--det_results_path", default="./detection-results/") + parser.add_argument("--net_out_num", type=int, default=4) + parser.add_argument("--net_input_width", type=int, default=1344) + parser.add_argument("--net_input_height", type=int, default=1344) + parser.add_argument("--ifShowDetObj", action="store_true", help="if input the para means True, neither False.") + flags = parser.parse_args() + + img_size_dict = dict() + with open(flags.test_annotation)as f: + for line in f.readlines(): + temp = line.split(" ") + img_file_path = temp[1] + img_name = temp[1].split("/")[-1].split(".")[0] + img_width = int(temp[2]) + img_height = int(temp[3]) + img_size_dict[img_name] = (img_width, img_height, img_file_path) + + bin_path = flags.bin_data_path + det_results_path = flags.det_results_path + os.makedirs(det_results_path, exist_ok=True) + total_img = set([name[:name.rfind('_')] for name in os.listdir(bin_path) if "bin" in name]) + + import torch + from torchvision.models.detection.roi_heads import paste_masks_in_image + import torch.nn.functional as F + from detectron2.evaluation import COCOEvaluator + from detectron2.structures import Boxes, Instances + from detectron2.data import DatasetCatalog, MetadataCatalog + import logging + logging.basicConfig(level=logging.INFO) + evaluator = COCOEvaluator('coco_2017_val') + evaluator.reset() + coco_class_map = {id:name for id, name in enumerate(MetadataCatalog.get('coco_2017_val').thing_classes)} + results = [] + + cnt = 0 + for bin_file in sorted(total_img): + cnt = cnt + 1 + print(cnt - 1, bin_file) + path_base = os.path.join(bin_path, bin_file) + res_buff = [] + for num in range(1, flags.net_out_num + 1): + if os.path.exists(path_base + "_" + str(num) + ".bin"): + if num == 1: + buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32") + buf = np.reshape(buf, [100, 4]) + elif num == 2: + buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32") + buf = np.reshape(buf, [100, 1]) + elif num == 3: + buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="int64") + buf = np.reshape(buf, [100, 1]) + elif num == 4: + bboxes = np.fromfile(path_base + "_" + str(num - 3) + ".bin", dtype="float32") + bboxes = np.reshape(bboxes, [100, 4]) + bboxes = torch.from_numpy(bboxes) + scores = np.fromfile(path_base + "_" + str(num - 2) + ".bin", dtype="float32") + scores = np.reshape(scores, [100, 1]) + scores = torch.from_numpy(scores) + labels = np.fromfile(path_base + "_" + str(num - 1) + ".bin", dtype="int64") + labels = np.reshape(labels, [100, 1]) + labels = torch.from_numpy(labels) + mask_pred = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32") + mask_pred = np.reshape(mask_pred, [100, 80, 28, 28]) + mask_pred = torch.from_numpy(mask_pred) + + org_img_size = img_size_dict[bin_file][:2] + result = Instances((org_img_size[1], org_img_size[0])) + + if torch.cuda.is_available(): + mask_pred = mask_pred.to(device='cuda') + img_shape = (flags.net_input_height, flags.net_input_width) + mask_pred = mask_pred[range(len(mask_pred)), labels[:, 0]][:, None] + masks = paste_masks_in_image(mask_pred, bboxes[:, :4], img_shape) + masks = masks >= 0.5 + masks = postprocess_masks(masks, img_size_dict[bin_file], flags.net_input_width, flags.net_input_height) + if torch.cuda.is_available(): + masks = masks.cpu() + masks = masks.squeeze(1) + result.pred_masks = masks + + '''masks = masks.numpy() + img = masks[0] + from PIL import Image + for j in range(len(masks)): + mask = masks[j] + mask = mask.astype(bool) + img[mask] = img[mask] + 1 + imag = Image.fromarray((img * 255).astype(np.uint8)) + imag.save(os.path.join('.', bin_file + '.png'))''' + + predbox = postprocess_bboxes(bboxes, org_img_size, flags.net_input_height, flags.net_input_width) + result.pred_boxes = Boxes(predbox) + result.scores = scores.reshape([100]) + result.pred_classes = labels.reshape([100]) + + results.append({"instances": result}) + + res_buff.append(buf) + else: + print("[ERROR] file not exist", path_base + "_" + str(num) + ".bin") + + current_img_size = img_size_dict[bin_file] + res_bboxes = np.concatenate(res_buff, axis=1) + predbox = postprocess_bboxes(res_bboxes, current_img_size, flags.net_input_width, flags.net_input_height) + + if flags.ifShowDetObj == True: + imgCur = cv2.imread(current_img_size[2]) + + det_results_str = '' + for idx, class_idx in enumerate(predbox[:, 5]): + if float(predbox[idx][4]) < float(0.05): + #if float(predbox[idx][4]) < float(0): + continue + if class_idx < 0 or class_idx > 80: + continue + + class_name = coco_class_map[int(class_idx)] + det_results_str += "{} {} {} {} {} {}\n".format(class_name, str(predbox[idx][4]), predbox[idx][0], + predbox[idx][1], predbox[idx][2], predbox[idx][3]) + + if flags.ifShowDetObj == True: + imgCur = cv2.rectangle(imgCur, (int(predbox[idx][0]), int(predbox[idx][1])), (int(predbox[idx][2]), int(predbox[idx][3])), (0,255,0), 2) + imgCur = cv2.putText(imgCur, class_name, (int(predbox[idx][0]), int(predbox[idx][1])), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1) + #imgCur = cv2.putText(imgCur, str(predbox[idx][4]), (int(predbox[idx][0]), int(predbox[idx][1])),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1) + + if flags.ifShowDetObj == True: + cv2.imwrite(os.path.join(det_results_path, bin_file +'.jpg'), imgCur, [int(cv2.IMWRITE_JPEG_QUALITY), 70]) + + det_results_file = os.path.join(det_results_path, bin_file + ".txt") + with open(det_results_file, "w") as detf: + detf.write(det_results_str) + + #save_variable(results, './results.txt') + #results = load_variavle('./results.txt') + inputs = DatasetCatalog.get('coco_2017_val')[:5000] + evaluator.process(inputs, results) + evaluator.evaluate() +``` +调用maskrcnn_pth_postprocess_detectron2.py评测map精度: +```shell +python3.7 get_info.py jpg /opt/npu/dataset/coco/val2017 maskrcnn_jpeg.info + +python3.7 maskrcnn_pth_postprocess_detectron2.py --bin_data_path=./result/dumpOutput_device0/ --test_annotation=maskrcnn_jpeg.info --det_results_path=./ret_npuinfer/ --net_out_num=4 --net_input_height=1344 --net_input_width=1344 --ifShowDetObj +``` +第一个参数为benchmark推理结果,第二个为原始图片信息文件,第三个为后处理输出结果,第四个为网络输出个数,第五六个为网络高宽,第七个为是否将box画在图上显示 +执行完后会打印出精度: +``` +INFO:detectron2.data.datasets.coco:Loaded 5000 images in COCO format from /opt/npu/dataset/coco/annotations/instances_val2017.json +INFO:detectron2.evaluation.coco_evaluation:Preparing results for COCO format ... +INFO:detectron2.evaluation.coco_evaluation:Evaluating predictions with unofficial COCO API... +Loading and preparing results... +DONE (t=2.16s) +creating index... +index created! +INFO:detectron2.evaluation.fast_eval_api:Evaluate annotation type *bbox* +INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.evaluate() finished in 21.80 seconds. +INFO:detectron2.evaluation.fast_eval_api:Accumulating evaluation results... +INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.accumulate() finished in 2.61 seconds. +Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.326 +Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.536 +Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.349 +Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.179 +Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.366 +Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.432 +Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.282 +Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.444 +Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.465 +Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.269 +Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.508 +Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.609 +INFO:detectron2.evaluation.coco_evaluation:Evaluation results for bbox: +| AP | AP50 | AP75 | APs | APm | APl | +|:------:|:------:|:------:|:------:|:------:|:------:| +| 32.586 | 53.634 | 34.852 | 17.862 | 36.613 | 43.174 | +INFO:detectron2.evaluation.coco_evaluation:Per-category bbox AP: +| category | AP | category | AP | category | AP | +|:--------------|:-------|:-------------|:-------|:---------------|:-------| +| person | 48.933 | bicycle | 24.620 | car | 37.483 | +| motorcycle | 33.410 | airplane | 50.975 | bus | 54.898 | +| train | 51.864 | truck | 26.716 | boat | 20.755 | +| traffic light | 20.305 | fire hydrant | 58.144 | stop sign | 58.833 | +| parking meter | 41.813 | bench | 17.210 | bird | 29.444 | +| cat | 57.738 | dog | 52.853 | horse | 51.333 | +| sheep | 40.341 | cow | 41.568 | elephant | 56.160 | +| bear | 63.240 | zebra | 59.121 | giraffe | 57.166 | +| backpack | 11.226 | umbrella | 29.385 | handbag | 8.685 | +| tie | 24.923 | suitcase | 27.242 | frisbee | 53.933 | +| skis | 16.987 | snowboard | 24.268 | sports ball | 40.009 | +| kite | 34.285 | baseball bat | 17.073 | baseball glove | 25.865 | +| skateboard | 39.694 | surfboard | 28.035 | tennis racket | 37.552 | +| bottle | 30.593 | wine glass | 26.470 | cup | 33.779 | +| fork | 19.335 | knife | 11.024 | spoon | 8.761 | +| bowl | 33.928 | banana | 18.034 | apple | 15.394 | +| sandwich | 27.732 | orange | 26.546 | broccoli | 19.022 | +| carrot | 15.449 | hot dog | 25.118 | pizza | 44.402 | +| donut | 35.096 | cake | 23.876 | chair | 18.866 | +| couch | 32.443 | potted plant | 18.701 | bed | 33.585 | +| dining table | 20.164 | toilet | 46.354 | tv | 48.705 | +| laptop | 50.107 | mouse | 47.597 | remote | 20.899 | +| keyboard | 40.454 | cell phone | 28.115 | microwave | 43.190 | +| oven | 25.974 | toaster | 13.432 | sink | 27.114 | +| refrigerator | 42.467 | book | 10.420 | clock | 44.894 | +| vase | 30.559 | scissors | 25.719 | teddy bear | 36.704 | +| hair drier | 0.000 | toothbrush | 11.796 | | | +``` + + **精度调试:** +> 1.根据代码语义RoiExtractor参数finest_scale不是224而是56 +> 2.因gather算子处理-1会导致每张图的第一个score为0,故maskrcnn_detectron2.diff中已将dets[:, -1]改为dets[:, 4] +> 3.单张图调试 +> ``` +> demo.py分数改为0.05,defaults.py MIN_SIZE_TEST与MAX_SIZE_TEST改为1344: +> python3.7 demo.py --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml --input 000000252219_1344x1344.jpg --opts MODEL.WEIGHTS ../../model_final.pth MODEL.DEVICE cpu +> 说明: +> 比较pth的rpn与om的rpn输出前提是detectron2/config/defaults.py的_C.INPUT.MIN_SIZE_TEST与_C.INPUT.MAX_SIZE_TEST要改为1344,并且注意因为000000252219_1344x1344.jpg 是等比例缩放四边加pad的处理结果,因此pth推理时等价于先进行了pad然后再进行标准化的,因此图片tensor边缘是负均值。开始误认为预处理与mmdetection相同因此SIZE_TEST的值与000000252219_1344x1344.jpg缩放是按上述方式处理的,经此与后面的调试步骤发现预处理与mmdetection不同。om算子输出与开源pth推理时变量的打印值对比,找到输出不对的算子,发现前处理均值方差不同于mmdetection框架,且是BGR序。 +> ``` +> 4.精度调试 +> ``` +> 对开源代码预处理与参数修改,使得cpu,gpu版的pth推理达到npu版代码的pth推理精度,参见本文第七章第二节T4精度数据的diff文件与执行精度测评的命令。 +> 说明: +> 1.查看npu固定1344,1344的前处理方式(缩放加pad) +> from torchvision import utils as vutils +> vutils.save_image(images.tensor, 'test.jpg') +> FIX_SHAPE->./detectron2/data/dataset_mapper.py->ResizeShortestEdge,最短边800最大1333。 +> 2.cpu与gpu开源代码推理pth精度与npu代码推理pth差2到3个点,npu代码(基于detectron2 v0.2.1)更改roi_align.py为开源的代码后推理发现pth精度下降2到3个点,最终发现是aligned参数问题,注意插件缺陷导致om中设置该参数未能生效。 +> ``` + + +### 6.2 开源精度 +[官网精度](https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch/Official/cv/image_object_detection/Faster_Mask_RCNN_for_PyTorch) + +参考[npu版detectron2框架的maskrcnn](https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch/Official/cv/image_object_detection/Faster_Mask_RCNN_for_PyTorch),安装依赖PyTorch(NPU版本)与设置环境变量,在npu上执行推理,测得npu精度如下: +```shell +python3.7 -m pip install -e Faster_Mask_RCNN_for_PyTorch +cd Faster_Mask_RCNN_for_PyTorch +修改eval.sh的配置文件与权重文件分别为mask_rcnn_R_101_FPN_3x.yaml与model_final.pth,删除mask_rcnn_R_101_FPN_3x.yaml的SOLVER和DATALOADER配置,datasets/coco下面放置coco2017验证集图片与标签(参考本文第三章第一节步骤五) +./eval.sh +``` +``` +Task: bbox +AP,AP50,AP75,APs,APm,APl +33.0103,53.5686,35.5192,17.8069,36.9325,44.0201 +Task: segm +AP,AP50,AP75,APs,APm,APl +30.3271,50.4665,31.8223,12.9573,33.0375,44.8537 +``` +### 6.3 精度对比 +om推理box map精度为0.326,npu推理box map精度为0.330,npu输出400个框精度更高点但性能较低,精度下降在1个点之内,因此可视为精度达标 + +## 7 性能对比 + +- **[npu性能数据](#71-npu性能数据)** +- **[T4性能数据](#72-T4性能数据)** +- **[性能对比](#73-性能对比)** + +### 7.1 npu性能数据 +batch1的性能: + 测试npu性能要确保device空闲,使用npu-smi info命令可查看device是否在运行其它推理任务 +``` +./benchmark.x86_64 -round=20 -om_path=maskrcnn_detectron2_npu.om -device_id=0 -batch_size=1 +``` +执行20次纯推理取均值,统计吞吐率与其倒数时延(benchmark的时延是单个数据的推理时间),npu性能是一个device执行的结果 +``` +[INFO] Dataset number: 19 finished cost 439.142ms +[INFO] PureInfer result saved in ./result/PureInfer_perf_of_maskrcnn_detectron2_npu_in_device_0.txt +-----------------PureInfer Performance Summary------------------ +[INFO] ave_throughputRate: 2.27773samples/s, ave_latency: 440.813ms +---------------------------------------------------------------- +``` +maskrcnn detectron2不支持多batch + + **性能优化:** +> 查看profiling导出的op_statistic_0_1.csv算子总体耗时统计发现gather算子耗时最多,然后查看profiling导出的task_time_0_1.csv找到具体哪些gather算子耗时最多,通过导出onnx的verbose打印找到具体算子对应的代码,因gather算子计算最后一个轴会很耗时,因此通过转置后计算0轴规避,比如maskrcnn_detectron2.diff文件中的如下修改: +> ``` +> boxes_prof = boxes.permute(1, 0) +> widths = boxes_prof[2, :] - boxes_prof[0, :] +> ``` +> + + +### 7.2 T4性能数据 +batch1性能: +onnx包含自定义算子,因此不能使用开源TensorRT测试性能数据,故在T4机器上使用pth在线推理测试性能数据 + +依据npu版代码修改cpu,gpu版detectron2,参见maskrcnn_pth_npu.diff: +```diff +diff --git a/detectron2/data/dataset_mapper.py b/detectron2/data/dataset_mapper.py +index 0e77851..0d03c08 100644 +--- a/detectron2/data/dataset_mapper.py ++++ b/detectron2/data/dataset_mapper.py +@@ -4,6 +4,7 @@ import logging + import numpy as np + from typing import List, Optional, Union + import torch ++from torch.nn import functional as F + + from detectron2.config import configurable + +@@ -133,6 +134,7 @@ class DatasetMapper: + + aug_input = T.AugInput(image, sem_seg=sem_seg_gt) + transforms = self.augmentations(aug_input) ++ print(self.augmentations,transforms) + image, sem_seg_gt = aug_input.image, aug_input.sem_seg + + image_shape = image.shape[:2] # h, w +@@ -140,6 +142,20 @@ class DatasetMapper: + # but not efficient on large generic data structures due to the use of pickle & mp.Queue. + # Therefore it's important to use torch.Tensor. + dataset_dict["image"] = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1))) ++ ++ size_divisibility = 32 ++ pad_value = 0 ++ pixel_mean = torch.Tensor([103.53, 116.28, 123.675]).view(-1, 1, 1) ++ pixel_std = torch.Tensor([1.0, 1.0, 1.0]).view(-1, 1, 1) ++ images = (dataset_dict["image"] - pixel_mean) / pixel_std ++ dataset_dict["image_size"] = tuple(images.shape[-2:]) ++ batch_shape = (3, 1344, 1344) ++ padding_size = [0, batch_shape[-1] - images.shape[-1], ++ 0, batch_shape[-2] - images.shape[-2]] ++ padded = F.pad(images, padding_size, value=pad_value) ++ batched_imgs = padded.unsqueeze_(0) ++ dataset_dict["image_preprocess"] = batched_imgs.contiguous() ++ + if sem_seg_gt is not None: + dataset_dict["sem_seg"] = torch.as_tensor(sem_seg_gt.astype("long")) + +diff --git a/detectron2/layers/roi_align.py b/detectron2/layers/roi_align.py +index bcbf5f4..23b138d 100644 +--- a/detectron2/layers/roi_align.py ++++ b/detectron2/layers/roi_align.py +@@ -38,7 +38,7 @@ class ROIAlign(nn.Module): + self.output_size = output_size + self.spatial_scale = spatial_scale + self.sampling_ratio = sampling_ratio +- self.aligned = aligned ++ self.aligned = False + + from torchvision import __version__ + +diff --git a/detectron2/modeling/meta_arch/rcnn.py b/detectron2/modeling/meta_arch/rcnn.py +index e5f66d1..b9ffa66 100644 +--- a/detectron2/modeling/meta_arch/rcnn.py ++++ b/detectron2/modeling/meta_arch/rcnn.py +@@ -202,6 +202,9 @@ class GeneralizedRCNN(nn.Module): + images = self.preprocess_image(batched_inputs) + features = self.backbone(images.tensor) + ++ #from torchvision import utils as vutils ++ #vutils.save_image(images.tensor, 'test.jpg') ++ print(features['p2'].shape) + if detected_instances is None: + if self.proposal_generator is not None: + proposals, _ = self.proposal_generator(images, features, None) +@@ -224,10 +227,14 @@ class GeneralizedRCNN(nn.Module): + """ + Normalize, pad and batch the input images. + """ +- images = [x["image"].to(self.device) for x in batched_inputs] ++ '''images = [x["image"].to(self.device) for x in batched_inputs] + images = [(x - self.pixel_mean) / self.pixel_std for x in images] + images = ImageList.from_tensors(images, self.backbone.size_divisibility) +- return images ++ return images''' ++ images = [x["image_preprocess"].to(device=self.device) for x in batched_inputs] ++ images = torch.cat(images, dim=0) ++ image_sizes = [x["image_size"] for x in batched_inputs] ++ return ImageList(images, image_sizes) + + @staticmethod + def _postprocess(instances, batched_inputs: Tuple[Dict[str, torch.Tensor]], image_sizes): +diff --git a/detectron2/modeling/postprocessing.py b/detectron2/modeling/postprocessing.py +index f42e77c..909923a 100644 +--- a/detectron2/modeling/postprocessing.py ++++ b/detectron2/modeling/postprocessing.py +@@ -55,6 +55,7 @@ def detector_postprocess( + output_boxes = None + assert output_boxes is not None, "Predictions must contain boxes!" + ++ print(scale_x, scale_y) + output_boxes.scale(scale_x, scale_y) + output_boxes.clip(results.image_size) + + +``` +测评T4精度与性能: +```shell +git clone https://github.com/facebookresearch/detectron2 +python3.7 -m pip install -e detectron2 +cd detectron2 +patch -p1 < ../maskrcnn_pth_npu.diff +cd tools +mkdir datasets +cp -rf ../../datasets/coco datasets/(数据集构造参考本文第三章第一节步骤五) +python3.7 train_net.py --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml --eval-only MODEL.WEIGHTS ../../model_final.pth MODEL.DEVICE cuda:0 +``` +``` +Inference done 4993/5000. 0.2937 s / img. +``` + +### 7.3 性能对比 +310单卡4个device,benchmark测试的是一个device。T4一个设备相当于4个device,测试的是整个设备。benchmark时延是吞吐率的倒数,T4时延是吞吐率的倒数乘以batch。对于batch1,440.73ms / 4 * 1 < 0.2937s,即npu性能超过T4 +对于batch1,npu性能均高于T4性能1.2倍,该模型放在benchmark/cv/segmentation目录下 + + diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216\345\274\200\346\272\220mmdetection\351\242\204\350\256\255\347\273\203\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216\345\274\200\346\272\220mmdetection\351\242\204\350\256\255\347\273\203\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" new file mode 100644 index 0000000000000000000000000000000000000000..dfb1a8f4e361c421ba6e74ce55cfe816e0e313b6 --- /dev/null +++ "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216\345\274\200\346\272\220mmdetection\351\242\204\350\256\255\347\273\203\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" @@ -0,0 +1,1041 @@ +# 基于开源mmdetection预训练的maskrcnn Onnx模型端到端推理指导 +- [1 模型概述](#1-模型概述) + - [1.1 论文地址](#11-论文地址) + - [1.2 代码地址](#12-代码地址) +- [2 环境说明](#2-环境说明) + - [2.1 深度学习框架](#21-深度学习框架) + - [2.2 python第三方库](#22-python第三方库) +- [3 模型转换](#3-模型转换) + - [3.1 pth转onnx模型](#31-pth转onnx模型) + - [3.2 onnx转om模型](#32-onnx转om模型) +- [4 数据集预处理](#4-数据集预处理) + - [4.1 数据集获取](#41-数据集获取) + - [4.2 数据集预处理](#42-数据集预处理) + - [4.3 生成数据集信息文件](#43-生成数据集信息文件) +- [5 离线推理](#5-离线推理) + - [5.1 benchmark工具概述](#51-benchmark工具概述) + - [5.2 离线推理](#52-离线推理) +- [6 精度对比](#6-精度对比) + - [6.1 离线推理精度统计](#61-离线推理精度统计) + - [6.2 开源精度](#62-开源精度) + - [6.3 精度对比](#63-精度对比) +- [7 性能对比](#7-性能对比) + - [7.1 npu性能数据](#71-npu性能数据) + - [7.2 T4性能数据](#72-T4性能数据) + - [7.3 性能对比](#73-性能对比) + + + +## 1 模型概述 + +- **[论文地址](#11-论文地址)** + +- **[代码地址](#12-代码地址)** + +### 1.1 论文地址 +[maskrcnn论文](https://arxiv.org/abs/1703.06870) +论文提出了一个简单、灵活、通用的目标实例分割框架Mask R-CNN。这个框架可同时做目标检测、实例分割。实例分割的实现就是在faster r-cnn的基础上加了一个可以预测目标掩膜(mask)的分支。只比Faster r-cnn慢一点,5fps。很容易拓展到其他任务如:关键点检测。18年在coco的目标检测、实例分割、人体关键点检测都取得了最优成绩。 + +### 1.2 代码地址 +[mmdetection框架maskrcnn代码](https://github.com/open-mmlab/mmdetection/tree/master/configs/mask_rcnn) + +## 2 环境说明 + +- **[深度学习框架](#21-深度学习框架)** + +- **[python第三方库](#22-python第三方库)** + +### 2.1 深度学习框架 +``` +pytorch == 1.8.0 +torchvision == 0.9.0 +onnx == 1.8.0 +``` + +### 2.2 python第三方库 + +``` +numpy == 1.18.5 +opencv-python == 4.2.0.34 +``` + +**说明:** +> X86架构:opencv,pytorch,torchvision和onnx可以通过官方下载whl包安装,其它可以通过pip3.7 install 包名 安装 +> +> Arm架构:opencv,pytorch,torchvision和onnx可以通过源码编译安装,其它可以通过pip3.7 install 包名 安装 + +## 3 模型转换 + +- **[pth转onnx模型](#31-pth转onnx模型)** + +- **[onnx转om模型](#32-onnx转om模型)** + +atc暂不支持动态shape小算子,可以使用大颗粒算子替换这些小算子规避,这些小算子可以在转onnx时的verbose打印中找到其对应的python代码,从而根据功能用大颗粒算子替换,onnx能推导出变量正确的shape与算子属性正确即可,变量实际的数值无关紧要,因此这些大算子函数的功能实现无关紧要,因包含自定义算子需要去掉对onnx模型的校验。 + +### 3.1 pth转onnx模型 + +1.获取pth权重文件 +[maskrcnn基于detectron2预训练的npu权重文件](http://download.openmmlab.com/mmdetection/v2.0/mask_rcnn/mask_rcnn_r50_fpn_1x_coco/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth) +文件md5sum: f4ee3c5911537f454045395d2f708954 +2.mmdetection源码安装 +```shell +git clone https://github.com/open-mmlab/mmcv +cd mmcv +MMCV_WITH_OPS=1 pip3.7 install -e . +cd .. +git clone https://github.com/open-mmlab/mmdetection +cd mmdetection +pip3.7 install -r requirements/build.txt +python3.7 setup.py develop +``` + + **说明:** +> 安装所需的依赖说明请参考mmdetection/docs/get_started.md +> + +3.转原始onnx +```shell +python3.7 tools/deployment/pytorch2onnx.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --output-file mask_rcnn_r50_fpn_1x_coco.onnx --input-img demo/demo.jpg --test-img tests/data/color.jpg --shape 800 1216 --show --verify --simplify +若报错参考:https://github.com/open-mmlab/mmdetection/issues/4548 +``` +4.修改mmdetection代码,参见maskrcnn_mmdetection.diff: +```diff +diff --git a/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py b/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py +index e9eb357..f72cef7 100644 +--- a/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py ++++ b/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py +@@ -168,13 +168,31 @@ def delta2bbox(rois, + [0.0000, 0.3161, 4.1945, 0.6839], + [5.0000, 5.0000, 5.0000, 5.0000]]) + """ +- means = deltas.new_tensor(means).view(1, -1).repeat(1, deltas.size(1) // 4) +- stds = deltas.new_tensor(stds).view(1, -1).repeat(1, deltas.size(1) // 4) ++ ++ # fix shape for means and stds when exporting onnx ++ if torch.onnx.is_in_onnx_export(): ++ means = deltas.new_tensor(means).view(1, -1).repeat(1, deltas.size(1).numpy() // 4) ++ stds = deltas.new_tensor(stds).view(1, -1).repeat(1, deltas.size(1).numpy() // 4) ++ else: ++ means = deltas.new_tensor(means).view(1, -1).repeat(1, deltas.size(1) // 4) ++ stds = deltas.new_tensor(stds).view(1, -1).repeat(1, deltas.size(1) // 4) + denorm_deltas = deltas * stds + means +- dx = denorm_deltas[:, 0::4] +- dy = denorm_deltas[:, 1::4] +- dw = denorm_deltas[:, 2::4] +- dh = denorm_deltas[:, 3::4] ++ # dx = denorm_deltas[:, 0::4] ++ # dy = denorm_deltas[:, 1::4] ++ # dw = denorm_deltas[:, 2::4] ++ # dh = denorm_deltas[:, 3::4] ++ if denorm_deltas.shape[1] > 4: ++ denorm_deltas = denorm_deltas.view(-1, 80, 4) ++ dx = denorm_deltas[:, :, 0:1:].view(-1, 80) ++ dy = denorm_deltas[:, :, 1:2:].view(-1, 80) ++ dw = denorm_deltas[:, :, 2:3:].view(-1, 80) ++ dh = denorm_deltas[:, :, 3:4:].view(-1, 80) ++ else: ++ dx = denorm_deltas[:, 0:1:] ++ dy = denorm_deltas[:, 1:2:] ++ dw = denorm_deltas[:, 2:3:] ++ dh = denorm_deltas[:, 3:4:] ++ + max_ratio = np.abs(np.log(wh_ratio_clip)) + dw = dw.clamp(min=-max_ratio, max=max_ratio) + dh = dh.clamp(min=-max_ratio, max=max_ratio) +diff --git a/mmdet/core/post_processing/bbox_nms.py b/mmdet/core/post_processing/bbox_nms.py +index c43aea9..e99f5d8 100644 +--- a/mmdet/core/post_processing/bbox_nms.py ++++ b/mmdet/core/post_processing/bbox_nms.py +@@ -4,6 +4,59 @@ from mmcv.ops.nms import batched_nms + from mmdet.core.bbox.iou_calculators import bbox_overlaps + + ++class BatchNMSOp(torch.autograd.Function): ++ @staticmethod ++ def forward(ctx, bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size): ++ """ ++ boxes (torch.Tensor): boxes in shape (batch, N, C, 4). ++ scores (torch.Tensor): scores in shape (batch, N, C). ++ return: ++ nmsed_boxes: (1, N, 4) ++ nmsed_scores: (1, N) ++ nmsed_classes: (1, N) ++ nmsed_num: (1,) ++ """ ++ ++ # Phony implementation for onnx export ++ nmsed_boxes = bboxes[:, :max_total_size, 0, :] ++ nmsed_scores = scores[:, :max_total_size, 0] ++ nmsed_classes = torch.arange(max_total_size, dtype=torch.long) ++ nmsed_num = torch.Tensor([max_total_size]) ++ ++ return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num ++ ++ @staticmethod ++ def symbolic(g, bboxes, scores, score_thr, iou_thr, max_size_p_class, max_t_size): ++ nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = g.op('BatchMultiClassNMS', ++ bboxes, scores, score_threshold_f=score_thr, iou_threshold_f=iou_thr, ++ max_size_per_class_i=max_size_p_class, max_total_size_i=max_t_size, outputs=4) ++ return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num ++ ++def batch_nms_op(bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size): ++ """ ++ boxes (torch.Tensor): boxes in shape (N, 4). ++ scores (torch.Tensor): scores in shape (N, ). ++ """ ++ ++ num_classes = bboxes.shape[1].numpy() // 4 ++ if bboxes.dtype == torch.float32: ++ bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4).half() ++ scores = scores.reshape(1, scores.shape[0].numpy(), -1).half() ++ else: ++ bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4) ++ scores = scores.reshape(1, scores.shape[0].numpy(), -1) ++ ++ nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = BatchNMSOp.apply(bboxes, scores, ++ score_threshold, iou_threshold, max_size_per_class, max_total_size) ++ nmsed_boxes = nmsed_boxes.float() ++ nmsed_scores = nmsed_scores.float() ++ nmsed_classes = nmsed_classes.long() ++ dets = torch.cat((nmsed_boxes.reshape((max_total_size, 4)), nmsed_scores.reshape((max_total_size, 1))), -1) ++ dets = dets.reshape((max_total_size, 5)) ++ labels = nmsed_classes.reshape((max_total_size, )) ++ return dets, labels ++ ++ + def multiclass_nms(multi_bboxes, + multi_scores, + score_thr, +@@ -40,7 +93,17 @@ def multiclass_nms(multi_bboxes, + multi_scores.size(0), num_classes, 4) + + scores = multi_scores[:, :-1] ++ # multiply score_factor after threshold to preserve more bboxes, improve ++ # mAP by 1% for YOLOv3 ++ if score_factors is not None: ++ # expand the shape to match original shape of score ++ score_factors = score_factors.view(-1, 1).expand( ++ multi_scores.size(0), num_classes) ++ score_factors = score_factors.reshape(-1) ++ scores = scores * score_factors + ++ # cpu and gpu ++ ''' + labels = torch.arange(num_classes, dtype=torch.long) + labels = labels.view(1, -1).expand_as(scores) + +@@ -80,7 +143,11 @@ def multiclass_nms(multi_bboxes, + return dets, labels[keep], keep + else: + return dets, labels[keep] ++ ''' + ++ # npu ++ dets, labels = batch_nms_op(bboxes, scores, score_thr, nms_cfg.get("iou_threshold"), max_num, max_num) ++ return dets, labels + + def fast_nms(multi_bboxes, + multi_scores, +diff --git a/mmdet/models/dense_heads/rpn_head.py b/mmdet/models/dense_heads/rpn_head.py +index f565d1a..3c29386 100644 +--- a/mmdet/models/dense_heads/rpn_head.py ++++ b/mmdet/models/dense_heads/rpn_head.py +@@ -9,6 +9,57 @@ from .anchor_head import AnchorHead + from .rpn_test_mixin import RPNTestMixin + + ++class BatchNMSOp(torch.autograd.Function): ++ @staticmethod ++ def forward(ctx, bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size): ++ """ ++ boxes (torch.Tensor): boxes in shape (batch, N, C, 4). ++ scores (torch.Tensor): scores in shape (batch, N, C). ++ return: ++ nmsed_boxes: (1, N, 4) ++ nmsed_scores: (1, N) ++ nmsed_classes: (1, N) ++ nmsed_num: (1,) ++ """ ++ ++ # Phony implementation for onnx export ++ nmsed_boxes = bboxes[:, :max_total_size, 0, :] ++ nmsed_scores = scores[:, :max_total_size, 0] ++ nmsed_classes = torch.arange(max_total_size, dtype=torch.long) ++ nmsed_num = torch.Tensor([max_total_size]) ++ ++ return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num ++ ++ @staticmethod ++ def symbolic(g, bboxes, scores, score_thr, iou_thr, max_size_p_class, max_t_size): ++ nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = g.op('BatchMultiClassNMS', ++ bboxes, scores, score_threshold_f=score_thr, iou_threshold_f=iou_thr, ++ max_size_per_class_i=max_size_p_class, max_total_size_i=max_t_size, outputs=4) ++ return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num ++ ++def batch_nms_op(bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size): ++ """ ++ boxes (torch.Tensor): boxes in shape (N, 4). ++ scores (torch.Tensor): scores in shape (N, ). ++ """ ++ ++ num_classes = bboxes.shape[1].numpy() // 4 ++ if bboxes.dtype == torch.float32: ++ bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4).half() ++ scores = scores.reshape(1, scores.shape[0].numpy(), -1).half() ++ else: ++ bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4) ++ scores = scores.reshape(1, scores.shape[0].numpy(), -1) ++ ++ nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = BatchNMSOp.apply(bboxes, scores, ++ score_threshold, iou_threshold, max_size_per_class, max_total_size) ++ nmsed_boxes = nmsed_boxes.float() ++ nmsed_scores = nmsed_scores.float() ++ nmsed_classes = nmsed_classes.long() ++ dets = torch.cat((nmsed_boxes.reshape((max_total_size, 4)), nmsed_scores.reshape((max_total_size, 1))), -1) ++ labels = nmsed_classes.reshape((max_total_size, )) ++ return dets, labels ++ + @HEADS.register_module() + class RPNHead(RPNTestMixin, AnchorHead): + """RPN head. +@@ -132,9 +183,14 @@ class RPNHead(RPNTestMixin, AnchorHead): + if cfg.nms_pre > 0 and scores.shape[0] > cfg.nms_pre: + # sort is faster than topk + # _, topk_inds = scores.topk(cfg.nms_pre) +- ranked_scores, rank_inds = scores.sort(descending=True) +- topk_inds = rank_inds[:cfg.nms_pre] +- scores = ranked_scores[:cfg.nms_pre] ++ # onnx uses topk to sort, this is simpler for onnx export ++ if torch.onnx.is_in_onnx_export(): ++ scores, topk_inds = torch.topk(scores, cfg.nms_pre) ++ else: ++ ranked_scores, rank_inds = scores.sort(descending=True) ++ topk_inds = rank_inds[:cfg.nms_pre] ++ scores = ranked_scores[:cfg.nms_pre] ++ + rpn_bbox_pred = rpn_bbox_pred[topk_inds, :] + anchors = anchors[topk_inds, :] + mlvl_scores.append(scores) +@@ -164,5 +220,12 @@ class RPNHead(RPNTestMixin, AnchorHead): + + # TODO: remove the hard coded nms type + nms_cfg = dict(type='nms', iou_threshold=cfg.nms_thr) ++ # cpu and gpu return ++ ''' + dets, keep = batched_nms(proposals, scores, ids, nms_cfg) + return dets[:cfg.nms_post] ++ ''' ++ ++ # npu return ++ dets, labels = batch_nms_op(proposals, scores, 0.0, nms_cfg.get("iou_threshold"), cfg.nms_post, cfg.nms_post) ++ return dets +diff --git a/mmdet/models/roi_heads/mask_heads/fcn_mask_head.py b/mmdet/models/roi_heads/mask_heads/fcn_mask_head.py +index 0cba3cd..a965e53 100644 +--- a/mmdet/models/roi_heads/mask_heads/fcn_mask_head.py ++++ b/mmdet/models/roi_heads/mask_heads/fcn_mask_head.py +@@ -199,11 +199,11 @@ class FCNMaskHead(nn.Module): + # TODO: Remove after F.grid_sample is supported. + from torchvision.models.detection.roi_heads \ + import paste_masks_in_image +- masks = paste_masks_in_image(mask_pred, bboxes, ori_shape[:2]) ++ '''masks = paste_masks_in_image(mask_pred, bboxes, ori_shape[:2]) + thr = rcnn_test_cfg.get('mask_thr_binary', 0) + if thr > 0: +- masks = masks >= thr +- return masks ++ masks = masks >= thr''' ++ return mask_pred + + N = len(mask_pred) + # The actual implementation split the input into chunks, +diff --git a/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py b/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py +index c0eebc4..63605c5 100644 +--- a/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py ++++ b/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py +@@ -4,6 +4,31 @@ from mmcv.runner import force_fp32 + from mmdet.models.builder import ROI_EXTRACTORS + from .base_roi_extractor import BaseRoIExtractor + ++import torch.onnx.symbolic_helper as sym_help ++ ++class RoiExtractor(torch.autograd.Function): ++ @staticmethod ++ def forward(self, f0, f1, f2, f3, rois, aligned=1, finest_scale=56, pooled_height=7, pooled_width=7, ++ pool_mode='avg', roi_scale_factor=0, sample_num=0, spatial_scale=[0.25, 0.125, 0.0625, 0.03125]): ++ """ ++ feats (torch.Tensor): feats in shape (batch, 256, H, W). ++ rois (torch.Tensor): rois in shape (k, 5). ++ return: ++ roi_feats (torch.Tensor): (k, 256, pooled_width, pooled_width) ++ """ ++ ++ # phony implementation for shape inference ++ k = rois.size()[0] ++ roi_feats = torch.ones(k, 256, pooled_height, pooled_width) ++ return roi_feats ++ ++ @staticmethod ++ def symbolic(g, f0, f1, f2, f3, rois, aligned=1, finest_scale=56, pooled_height=7, pooled_width=7): ++ # TODO: support tensor list type for feats ++ #f_tensors = sym_help._unpack_list(feats) ++ roi_feats = g.op('RoiExtractor', f0, f1, f2, f3, rois, aligned_i=1, finest_scale_i=56, pooled_height_i=pooled_height, pooled_width_i=pooled_width, ++ pool_mode_s='avg', roi_scale_factor_i=0, sample_num_i=0, spatial_scale_f=[0.25, 0.125, 0.0625, 0.03125], outputs=1) ++ return roi_feats + + @ROI_EXTRACTORS.register_module() + class SingleRoIExtractor(BaseRoIExtractor): +@@ -52,6 +77,14 @@ class SingleRoIExtractor(BaseRoIExtractor): + + @force_fp32(apply_to=('feats', ), out_fp16=True) + def forward(self, feats, rois, roi_scale_factor=None): ++ # Work around to export onnx for npu ++ if torch.onnx.is_in_onnx_export(): ++ out_size = self.roi_layers[0].output_size ++ roi_feats = RoiExtractor.apply(feats[0], feats[1], feats[2], feats[3], rois, 1, 56, out_size[0], out_size[1]) ++ # roi_feats = RoiExtractor.apply(list(feats), rois) ++ return roi_feats ++ ++ + """Forward function.""" + out_size = self.roi_layers[0].output_size + num_levels = len(feats) +diff --git a/tools/deployment/pytorch2onnx.py b/tools/deployment/pytorch2onnx.py +index 1305a79..c79e9fb 100644 +--- a/tools/deployment/pytorch2onnx.py ++++ b/tools/deployment/pytorch2onnx.py +@@ -48,7 +48,7 @@ def pytorch2onnx(config_path, + input_names=['input'], + output_names=output_names, + export_params=True, +- keep_initializers_as_inputs=True, ++ #keep_initializers_as_inputs=True, + do_constant_folding=True, + verbose=show, + opset_version=opset_version) + +``` + **修改依据:** +> 1.atc暂不支持if与nonzero动态小算子,这两小算子是bbox_nms.py与single_level_roi_extractor.py的大功能nms与roi引入的(rpn_head.py中的nms虽然没有引入不支持算子但也需要替换,否则后面会出现E19014: Op[ReduceMax_505]'s attribute axes is invalid which is empty),因此使用npu的nms与roi大算子代替这部分大功能。loop算子暂无合适替换方法,由于它在网络最后一部分,因此可将其与后面的部分放到后处理 +> 2. atc转换报错E11019: Op[Conv_0]'s input[1] is not linked,因此注释掉tools/deployment/pytorch2onnx.py中export函数的keep_initializers_as_inputs=True, +> 3.动态shape算子导致atc转换出现未知错误,atc日志debug显示Unknown shape op Tile output shape range is unknown, set its size -1,在转onnx时的verbose打印中找到该算子对应的python代码行,利用numpy()将means和std的shape固定下来,参见maskrcnn_mmdetection.diff +> 4.slice跑在aicpu有错误,所以改为dx = denorm_deltas[:, :, 0:1:].view(-1, 80),使其运行在aicore上 +> 5.atc转换Concat一对多算子会改变其名字,故添加dets = dets.reshape((max_total_size, 5)),使得Concat后填加了一冗余的Reshape算子作为输出节点 +> 6.atc转换时计算mask的RoiExtractor算子报错,打开--log=debug输出日志,查看starce -f cmd的打印/root/ascend/log/plog/…找到日志存放路径,发现(14,14)导致cube内存不够用 +> 7.atc转换时根据日志中报错的算子在转onnx时的verbose打印中找到其对应的python代码,然后找到规避方法解决,具体修改参见maskrcnn_mmdetection.diff +> 8.其它地方的修改原因参见精度调试 + + +通过打补丁的方式修改detectron2: +```shell +patch -p1 < ./maskrcnn_mmdetection.diff +``` +5.修改pytorch代码去除导出onnx时进行检查 +将/usr/local/python3.7.5/lib/python3.7/site-packages/torch/onnx/utils.py文件的_check_onnx_proto(proto)改为pass + +6.运行如下命令,生成含有npu自定义算子的onnx: +```shell +python3.7 tools/deployment/pytorch2onnx.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --output-file mask_rcnn_r50_fpn_1x_coco.onnx --input-img demo/demo.jpg --shape 800 1216 +``` +7.经过修改后导出的onnx由于添加了自定义算子无法使用onnx的infer shape,所以需要手动固定resize算子的shape,这里可以用未经修改的源代码导出onnx并使用simplifier后使用netron查看resize的具体参数。对原始onnx使用simplifier后(添加--simplify参数参见转原始onnx命令),使用netron可视化工具可以查看该onnx中resize的大小 +```python +import sys +import onnx +from onnx import helper + +input_model=sys.argv[1] +output_model=sys.argv[2] +model = onnx.load(input_model) +# onnx.checker.check_model(model) + +model_nodes = model.graph.node +def getNodeByName(nodes, name: str): + for n in nodes: + if n.name == name: + return n + return -1 + +# fix shape for resize, 对原始onnx使用simplifier后,使用netron可视化工具可以查看该onnx中resize的大小 +sizes1 = onnx.helper.make_tensor('size1', onnx.TensorProto.INT32, [4], [1, 256, 50, 76]) +sizes2 = onnx.helper.make_tensor('size2', onnx.TensorProto.INT32, [4], [1, 256, 100, 152]) +sizes3 = onnx.helper.make_tensor('size3', onnx.TensorProto.INT32, [4], [1, 256, 200, 304]) +model.graph.initializer.append(sizes1) +model.graph.initializer.append(sizes2) +model.graph.initializer.append(sizes3) +getNodeByName(model_nodes, 'Resize_141').input[3] = "size1" +getNodeByName(model_nodes, 'Resize_161').input[3] = "size2" +getNodeByName(model_nodes, 'Resize_181').input[3] = "size3" + +print("Mask R-CNN onnx adapted to ATC") +onnx.save(model, output_model) +``` +```shell +python3.7 fix_onnx_shape.py mask_rcnn_r50_fpn_1x_coco.onnx mask_rcnn_r50_fpn_1x_coco_fix.onnx +``` + +### 3.2 onnx转om模型 + +1.设置环境变量 +```shell +export install_path=/usr/local/Ascend/ascend-toolkit/latest +export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH +export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH +export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH +export ASCEND_OPP_PATH=${install_path}/opp +export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/ +``` +2.使用atc将onnx模型转换为om模型文件,工具使用方法可以参考[CANN V100R020C10 开发辅助工具指南 (推理) 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164868?idPath=23710424%7C251366513%7C22892968%7C251168373),需要指定输出节点以去除无用输出,节点序号可能会因网络结构不同而不同,使用netron开源可视化工具查看具体的输出节点名: +```shell +atc --framework=5 --model=./mask_rcnn_r50_fpn_1x_coco_fix.onnx --output=mask_rcnn_r50_fpn_1x_coco_bs1 --out_nodes="Reshape_574:0;Reshape_576:0;Sigmoid_604:0" --input_format=NCHW --input_shape="input:1,3,800,1216" --log=debug --soc_version=Ascend310 +``` + +## 4 数据集预处理 + +- **[数据集获取](#41-数据集获取)** + +- **[数据集预处理](#42-数据集预处理)** + +- **[生成数据集信息文件](#43-生成数据集信息文件)** + +### 4.1 数据集获取 +该模型使用[COCO官网](https://cocodataset.org/#download)的coco2017的5千张验证集进行测试,图片与标签分别存放在/opt/npu/dataset/coco/val2017/与/opt/npu/dataset/coco/annotations/instances_val2017.json。 + +### 4.2 数据集预处理 +1.预处理脚本maskrcnn_pth_preprocess.py +```python +# Copyright 2020 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse +import numpy as np +import cv2 +import mmcv +import torch +import multiprocessing + +def resize(img, size): + old_h = img.shape[0] + old_w = img.shape[1] + scale_ratio = min(size[0] / old_w, size[1] / old_h) + new_w = int(np.floor(old_w * scale_ratio)) + new_h = int(np.floor(old_h * scale_ratio)) + resized_img = mmcv.imresize(img, (new_w, new_h), backend='cv2') + return resized_img + +def gen_input_bin(file_batches, batch): + i = 0 + for file in file_batches[batch]: + i = i + 1 + print("batch", batch, file, "===", i) + + image = mmcv.imread(os.path.join(flags.image_src_path, file), backend='cv2') + #image = mmcv.imrescale(image, (flags.model_input_width, flags.model_input_height)) + image = resize(image, (flags.model_input_width, flags.model_input_height)) + mean = np.array([123.675, 116.28, 103.53], dtype=np.float32) + std = np.array([58.395, 57.12, 57.375], dtype=np.float32) + image = mmcv.imnormalize(image, mean, std) + h = image.shape[0] + w = image.shape[1] + pad_left = (flags.model_input_width - w) // 2 + pad_top = (flags.model_input_height - h) // 2 + pad_right = flags.model_input_width - pad_left - w + pad_bottom = flags.model_input_height - pad_top - h + image = mmcv.impad(image, padding=(pad_left, pad_top, pad_right, pad_bottom), pad_val=0) + #mmcv.imwrite(image, './paded_jpg/' + file.split('.')[0] + '.jpg') + image = image.transpose(2, 0, 1) + image.tofile(os.path.join(flags.bin_file_path, file.split('.')[0] + ".bin")) + +def preprocess(src_path, save_path): + files = os.listdir(src_path) + file_batches = [files[i:i + 100] for i in range(0, 5000, 100) if files[i:i + 100] != []] + thread_pool = multiprocessing.Pool(len(file_batches)) + for batch in range(len(file_batches)): + thread_pool.apply_async(gen_input_bin, args=(file_batches, batch)) + thread_pool.close() + thread_pool.join() + print("in thread, except will not report! please ensure bin files generated.") + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='preprocess of MaskRCNN PyTorch model') + parser.add_argument("--image_src_path", default="./coco2017/", help='image of dataset') + parser.add_argument("--bin_file_path", default="./coco2017_bin/", help='Preprocessed image buffer') + parser.add_argument("--model_input_height", default=800, type=int, help='input tensor height') + parser.add_argument("--model_input_width", default=1216, type=int, help='input tensor width') + flags = parser.parse_args() + if not os.path.exists(flags.bin_file_path): + os.makedirs(flags.bin_file_path) + preprocess(flags.image_src_path, flags.bin_file_path) +``` +2.执行预处理脚本,生成数据集预处理后的bin文件 +```shell +python3.7 maskrcnn_pth_preprocess.py --image_src_path=/opt/npu/dataset/coco/val2017 --bin_file_path=val2017_bin --model_input_height=800 --model_input_width=1216 +``` +### 4.3 生成数据集信息文件 +1.生成数据集信息文件脚本get_info.py +```python +import os +import sys +import cv2 +from glob import glob + + +def get_bin_info(file_path, info_name, width, height): + bin_images = glob(os.path.join(file_path, '*.bin')) + with open(info_name, 'w') as file: + for index, img in enumerate(bin_images): + content = ' '.join([str(index), img, width, height]) + file.write(content) + file.write('\n') + + +def get_jpg_info(file_path, info_name): + extensions = ['jpg', 'jpeg', 'JPG', 'JPEG'] + image_names = [] + for extension in extensions: + image_names.append(glob(os.path.join(file_path, '*.' + extension))) + with open(info_name, 'w') as file: + for image_name in image_names: + if len(image_name) == 0: + continue + else: + for index, img in enumerate(image_name): + img_cv = cv2.imread(img) + shape = img_cv.shape + width, height = shape[1], shape[0] + content = ' '.join([str(index), img, str(width), str(height)]) + file.write(content) + file.write('\n') + + +if __name__ == '__main__': + file_type = sys.argv[1] + file_path = sys.argv[2] + info_name = sys.argv[3] + if file_type == 'bin': + width = sys.argv[4] + height = sys.argv[5] + assert len(sys.argv) == 6, 'The number of input parameters must be equal to 5' + get_bin_info(file_path, info_name, width, height) + elif file_type == 'jpg': + assert len(sys.argv) == 4, 'The number of input parameters must be equal to 3' + get_jpg_info(file_path, info_name) +``` +2.执行生成数据集信息脚本,生成数据集信息文件 +```shell +python3.7 get_info.py bin val2017_bin maskrcnn.info 1216 800 +``` +第一个参数为模型输入的类型,第二个参数为生成的bin文件路径,第三个为输出的info文件,后面为宽高信息 +## 5 离线推理 + +- **[benchmark工具概述](#51-benchmark工具概述)** + +- **[离线推理](#52-离线推理)** + +### 5.1 benchmark工具概述 + +benchmark工具为华为自研的模型推理工具,支持多种模型的离线推理,能够迅速统计出模型在Ascend310上的性能,支持真实数据和纯推理两种模式,配合后处理脚本,可以实现诸多模型的端到端过程,获取工具及使用方法可以参考[CANN V100R020C10 推理benchmark工具用户指南 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164874?idPath=23710424%7C251366513%7C22892968%7C251168373) +### 5.2 离线推理 +1.设置环境变量 +```shell +export install_path=/usr/local/Ascend/ascend-toolkit/latest +export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH +export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH +export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH +export ASCEND_OPP_PATH=${install_path}/opp +export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/ +``` +2.执行离线推理 +```shell +./benchmark.x86_64 -model_type=vision -om_path=mask_rcnn_r50_fpn_1x_coco_bs1.om -device_id=0 -batch_size=1 -input_text_path=maskrcnn.info -input_width=1216 -input_height=800 -useDvpp=false -output_binary=true +``` + **注意:** +> label是int64,benchmark输出非二进制时会将float转为0 +> + +输出结果默认保存在当前目录result/dumpOutput_device0,模型有三个输出,每个输入对应的输出对应三个_x.bin文件 +``` +输出 shape 数据类型 数据含义 +output1 100 * 5 FP32 boxes and scores +output3 100 * 1 INT64 labels +output4 100 * 80 * 28 * 28 FP32 masks +``` + +## 6 精度对比 + +- **[离线推理精度](#61-离线推理精度)** +- **[开源精度](#62-开源精度)** +- **[精度对比](#63-精度对比)** + +### 6.1 离线推理精度统计 + +后处理统计map精度 +```python +# Copyright 2020 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse +import cv2 +import numpy as np + +def postprocess_bboxes(bboxes, image_size, net_input_width, net_input_height): + w = image_size[0] + h = image_size[1] + scale = min(net_input_width / w, net_input_height / h) + + pad_w = net_input_width - w * scale + pad_h = net_input_height - h * scale + pad_left = pad_w // 2 + pad_top = pad_h // 2 + + bboxes[:, 0] = (bboxes[:, 0] - pad_left) / scale + bboxes[:, 1] = (bboxes[:, 1] - pad_top) / scale + bboxes[:, 2] = (bboxes[:, 2] - pad_left) / scale + bboxes[:, 3] = (bboxes[:, 3] - pad_top) / scale + + return bboxes + +def postprocess_masks(masks, image_size, net_input_width, net_input_height): + w = image_size[0] + h = image_size[1] + scale = min(net_input_width / w, net_input_height / h) + + pad_w = net_input_width - w * scale + pad_h = net_input_height - h * scale + pad_left = pad_w // 2 + pad_top = pad_h // 2 + + if pad_top < 0: + pad_top = 0 + if pad_left < 0: + pad_left = 0 + top = int(pad_top) + left = int(pad_left) + hs = int(pad_top + net_input_height - pad_h) + ws = int(pad_left + net_input_width - pad_w) + masks = masks.to(dtype=torch.float32) + res_append = torch.zeros(0, h, w) + if torch.cuda.is_available(): + res_append = res_append.to(device='cuda') + for i in range(masks.size(0)): + mask = masks[i][0][top:hs, left:ws] + mask = mask.expand((1, 1, mask.size(0), mask.size(1))) + mask = F.interpolate(mask, size=(int(h), int(w)), mode='bilinear', align_corners=False) + mask = mask[0][0] + mask = mask.unsqueeze(0) + res_append = torch.cat((res_append, mask)) + + return res_append[:, None] + +import pickle +def save_variable(v, filename): + f = open(filename, 'wb') + pickle.dump(v, f) + f.close() +def load_variavle(filename): + f = open(filename, 'rb') + r = pickle.load(f) + f.close() + return r + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument("--test_annotation", default="./origin_pictures.info") + parser.add_argument("--bin_data_path", default="./result/dumpOutput_device0/") + parser.add_argument("--det_results_path", default="./detection-results/") + parser.add_argument("--net_out_num", type=int, default=3) + parser.add_argument("--net_input_width", type=int, default=1216) + parser.add_argument("--net_input_height", type=int, default=800) + parser.add_argument("--ifShowDetObj", action="store_true", help="if input the para means True, neither False.") + flags = parser.parse_args() + + img_size_dict = dict() + with open(flags.test_annotation)as f: + for line in f.readlines(): + temp = line.split(" ") + img_file_path = temp[1] + img_name = temp[1].split("/")[-1].split(".")[0] + img_width = int(temp[2]) + img_height = int(temp[3]) + img_size_dict[img_name] = (img_width, img_height, img_file_path) + + bin_path = flags.bin_data_path + det_results_path = flags.det_results_path + os.makedirs(det_results_path, exist_ok=True) + #total_img = set([name[:name.rfind('_')] for name in os.listdir(bin_path) if "bin" in name]) + + import glob + import torch + from torchvision.models.detection.roi_heads import paste_masks_in_image + import torch.nn.functional as F + from mmdet.core import bbox2result + from mmdet.core import encode_mask_results + from mmdet.datasets import CocoDataset + coco_dataset = CocoDataset(ann_file='/opt/npu/dataset/coco/annotations/instances_val2017.json', pipeline=[]) + coco_class_map = {id:name for id, name in enumerate(coco_dataset.CLASSES)} + #print(dir(coco_dataset)) + results = [] + + cnt = 0 + #for bin_file in sorted(total_img): + for ids in coco_dataset.img_ids: + cnt = cnt + 1 + bin_file = glob.glob(bin_path + '/*0' + str(ids) + '_1.bin')[0] + bin_file = bin_file[bin_file.rfind('/') + 1:] + bin_file = bin_file[:bin_file.rfind('_')] + print(cnt - 1, bin_file) + path_base = os.path.join(bin_path, bin_file) + res_buff = [] + bbox_results = [] + cls_segms = [] + for num in range(1, flags.net_out_num + 1): + if os.path.exists(path_base + "_" + str(num) + ".bin"): + if num == 1: + buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32") + buf = np.reshape(buf, [100, 5]) + elif num == 2: + buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="int64") + buf = np.reshape(buf, [100, 1]) + elif num == 3: + bboxes = np.fromfile(path_base + "_" + str(num - 2) + ".bin", dtype="float32") + bboxes = np.reshape(bboxes, [100, 5]) + bboxes = torch.from_numpy(bboxes) + labels = np.fromfile(path_base + "_" + str(num - 1) + ".bin", dtype="int64") + labels = np.reshape(labels, [100, 1]) + labels = torch.from_numpy(labels) + mask_pred = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32") + mask_pred = np.reshape(mask_pred, [100, 80, 28, 28]) + mask_pred = torch.from_numpy(mask_pred) + + if torch.cuda.is_available(): + mask_pred = mask_pred.to(device='cuda') + + img_shape = (flags.net_input_height, flags.net_input_width) + mask_pred = mask_pred[range(len(mask_pred)), labels[:, 0]][:, None] + masks = paste_masks_in_image(mask_pred, bboxes[:, :4], img_shape) + masks = masks >= 0.5 + + masks = postprocess_masks(masks, img_size_dict[bin_file], flags.net_input_width, flags.net_input_height) + if torch.cuda.is_available(): + masks = masks.cpu() + '''masks = masks.numpy() + img = masks[0].squeeze() + from PIL import Image + for j in range(len(masks)): + mask = masks[j].squeeze() + mask = mask.astype(bool) + img[mask] = img[mask] + 1 + imag = Image.fromarray((img * 255).astype(np.uint8)) + imag.save(os.path.join('.', bin_file + '.png'))''' + + cls_segms = [[] for _ in range(80)] + for i in range(len(masks)): + cls_segms[labels[i][0]].append(masks[i][0].numpy()) + + bboxes = postprocess_bboxes(bboxes, img_size_dict[bin_file], flags.net_input_width, flags.net_input_height) + bbox_results = [bbox2result(bboxes, labels[:, 0], 80)] + res_buff.append(buf) + else: + print("[ERROR] file not exist", path_base + "_" + str(num) + ".bin") + + result = list(zip(bbox_results, [cls_segms])) + result = [(bbox_results, encode_mask_results(mask_results)) for bbox_results, mask_results in result] + results.extend(result) + + current_img_size = img_size_dict[bin_file] + res_bboxes = np.concatenate(res_buff, axis=1) + predbox = postprocess_bboxes(res_bboxes, current_img_size, flags.net_input_width, flags.net_input_height) + + if flags.ifShowDetObj == True: + imgCur = cv2.imread(current_img_size[2]) + + det_results_str = '' + for idx, class_idx in enumerate(predbox[:, 5]): + if float(predbox[idx][4]) < float(0.05): + continue + if class_idx < 0 or class_idx > 80: + continue + + class_name = coco_class_map[int(class_idx)] + det_results_str += "{} {} {} {} {} {}\n".format(class_name, str(predbox[idx][4]), predbox[idx][0], + predbox[idx][1], predbox[idx][2], predbox[idx][3]) + if flags.ifShowDetObj == True: + imgCur = cv2.rectangle(imgCur, (int(predbox[idx][0]), int(predbox[idx][1])), (int(predbox[idx][2]), int(predbox[idx][3])), (0,255,0), 2) + imgCur = cv2.putText(imgCur, class_name, (int(predbox[idx][0]), int(predbox[idx][1])), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1) + + if flags.ifShowDetObj == True: + cv2.imwrite(os.path.join(det_results_path, bin_file +'.jpg'), imgCur, [int(cv2.IMWRITE_JPEG_QUALITY), 70]) + + det_results_file = os.path.join(det_results_path, bin_file + ".txt") + with open(det_results_file, "w") as detf: + detf.write(det_results_str) + + save_variable(results, './results.txt') + #results = load_variavle('./results.txt') + eval_results = coco_dataset.evaluate(results, metric=['bbox', 'segm'], classwise=True) +``` +调用maskrcnn_pth_postprocess.py评测map精度: +```shell +python3.7 get_info.py jpg /opt/npu/dataset/coco/val2017 maskrcnn_jpeg.info + +python3.7 maskrcnn_pth_postprocess.py --bin_data_path=./result/dumpOutput_device0/ --test_annotation=maskrcnn_jpeg.info --det_results_path=./ret_npuinfer/ --net_out_num=3 --net_input_height=800 --net_input_width=1216 --ifShowDetObj +``` +第一个参数为benchmark推理结果,第二个为原始图片信息文件,第三个为后处理输出结果,第四个为网络输出个数,第五六个为网络高宽,第七个为是否将box画在图上显示 +执行完后会打印出精度: +``` +Evaluating bbox... +Loading and preparing results... +DONE (t=8.57s) +creating index... +index created! +Running per image evaluation... +Evaluate annotation type *bbox* +DONE (t=103.05s). +Accumulating evaluation results... +DONE (t=26.62s). +Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.377 +Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.584 +Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.411 +Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.211 +Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.411 +Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.500 +Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.515 +Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.515 +Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.515 +Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.319 +Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.556 +Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.656 + ++---------------+-------+--------------+-------+----------------+-------+ +| category | AP | category | AP | category | AP | ++---------------+-------+--------------+-------+----------------+-------+ +| person | 0.517 | bicycle | 0.296 | car | 0.411 | +| motorcycle | 0.392 | airplane | 0.588 | bus | 0.603 | +| train | 0.576 | truck | 0.332 | boat | 0.254 | +| traffic light | 0.253 | fire hydrant | 0.627 | stop sign | 0.624 | +| parking meter | 0.431 | bench | 0.224 | bird | 0.335 | +| cat | 0.588 | dog | 0.544 | horse | 0.527 | +| sheep | 0.473 | cow | 0.515 | elephant | 0.597 | +| bear | 0.616 | zebra | 0.627 | giraffe | 0.623 | +| backpack | 0.132 | umbrella | 0.347 | handbag | 0.119 | +| tie | 0.306 | suitcase | 0.368 | frisbee | 0.634 | +| skis | 0.214 | snowboard | 0.286 | sports ball | 0.398 | +| kite | 0.375 | baseball bat | 0.215 | baseball glove | 0.333 | +| skateboard | 0.455 | surfboard | 0.340 | tennis racket | 0.417 | +| bottle | 0.365 | wine glass | 0.325 | cup | 0.400 | +| fork | 0.259 | knife | 0.139 | spoon | 0.108 | +| bowl | 0.395 | banana | 0.217 | apple | 0.200 | +| sandwich | 0.322 | orange | 0.289 | broccoli | 0.214 | +| carrot | 0.199 | hot dog | 0.277 | pizza | 0.478 | +| donut | 0.397 | cake | 0.353 | chair | 0.245 | +| couch | 0.371 | potted plant | 0.243 | bed | 0.398 | +| dining table | 0.228 | toilet | 0.557 | tv | 0.542 | +| laptop | 0.547 | mouse | 0.572 | remote | 0.260 | +| keyboard | 0.491 | cell phone | 0.325 | microwave | 0.531 | +| oven | 0.300 | toaster | 0.467 | sink | 0.330 | +| refrigerator | 0.511 | book | 0.146 | clock | 0.481 | +| vase | 0.336 | scissors | 0.249 | teddy bear | 0.431 | +| hair drier | 0.013 | toothbrush | 0.145 | None | None | ++---------------+-------+--------------+-------+----------------+-------+ +``` + + **精度调试:** +> 1.因为在线推理前处理图片是一定格式的动态分辨率,所以onnx将分辨率固定为1216x800会导致精度下降些,改为1216x1216可以提升精度,使得mask的精度与开源相比下降在1%之内 +> 2.单图调试 +> ``` +> python3.7 tools/test.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --eval bbox segm --show +>python3.7 tools/deployment/pytorch2onnx.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --output-file mask_rcnn_r50_fpn_1x_coco.onnx --input-img 000000397133_1216x800.jpg --shape 800 1216 --show --verify --simplify +> 说明: +> 1.参考开源精度测评工具,以精度达标的pth为基准,添加打印弄明白关键点代码含义。可以得到导出原始onnx时,paste_masks_in_image 前需要添加mask_pred = mask_pred[range(len(mask_pred)), labels][:, None],onnx显示mask才与pth一致。 +> 2.将图片经过缩放添加pad后导出的原始onnx作为精度基准,发现原始onnx的mask_pred作为输出时形状是(100,80,28,28),而更换自定义算子后导出的onnx输出形状是(100,80,14,14),因此通过添加打印与对比发现计算mask的RoiExtractor的(pooled_height, pooled_width)配置是(14,14)而不应该是默认的(7,7)。将om推理RoiExtractor的输入变量使用pickle模块保存起来,然后在源代码中加载数据到这些变量,查看原始onnx的图片显示结果可以验证是RoiExtractor的问题 +> 3.800x1216不是pth模型固定的高宽,在build_from_cfg添加print(obj_cls)发现./mmdet/models/detectors/base.py的BaseDetector,推断模型的输入大小是变化的 +> 4.至于查看函数调用关系,可以在代码中故意构造错误,python运行出错时会打印调用栈 +> ``` + + +### 6.2 开源精度 +[官网精度](http://download.openmmlab.com/mmdetection/v2.0/mask_rcnn/mask_rcnn_r50_fpn_1x_coco/mask_rcnn_r50_fpn_1x_coco_20200205_050542.log.json) + +``` +{"mode": "val", "epoch": 12, "iter": 7330, "lr": 0.0002, "bbox_mAP": 0.382, "bbox_mAP_50": 0.588, "bbox_mAP_75": 0.414, "bbox_mAP_s": 0.219, "bbox_mAP_m": 0.409, "bbox_mAP_l": 0.495, "bbox_mAP_copypaste": "0.382 0.588 0.414 0.219 0.409 0.495", "segm_mAP": 0.347, "segm_mAP_50": 0.557, "segm_mAP_75": 0.372, "segm_mAP_s": 0.183, "segm_mAP_m": 0.374, "segm_mAP_l": 0.472, "segm_mAP_copypaste": "0.347 0.557 0.372 0.183 0.374 0.472"} +``` +### 6.3 精度对比 +om推理box map50精度为0.584,开源box map50精度为0.588,精度下降在1%之内,因此可视为精度达标 +om推理segm map50精度为0.553,开源segm map50精度为0.557,精度下降在1%之内,因此可视为精度达标 + +## 7 性能对比 + +- **[npu性能数据](#71-npu性能数据)** +- **[T4性能数据](#72-T4性能数据)** +- **[性能对比](#73-性能对比)** + +### 7.1 npu性能数据 +batch1的性能: + 测试npu性能要确保device空闲,使用npu-smi info命令可查看device是否在运行其它推理任务 +``` +./benchmark.x86_64 -round=20 -om_path=mask_rcnn_r50_fpn_1x_coco_bs1.om -device_id=0 -batch_size=1 +``` +执行20次纯推理取均值,统计吞吐率与其倒数时延(benchmark的时延是单个数据的推理时间),npu性能是一个device执行的结果 +``` +[INFO] Dataset number: 19 finished cost 512.331ms +[INFO] PureInfer result saved in ./result/PureInfer_perf_of_mask_rcnn_r50_fpn_1x_coco_bs1_in_device_0.txt +-----------------PureInfer Performance Summary------------------ +[INFO] ave_throughputRate: 1.95202samples/s, ave_latency: 512.318ms +---------------------------------------------------------------- +``` +maskrcnn mmdetection不支持多batch + + **性能优化:** +> 生成多batch模型需要修改源码,否则atc转化的多batch模型推理出的数据不对,多batch性能没有提升 +> + + +### 7.2 T4性能数据 +batch1性能: +onnx包含自定义算子,因此不能使用开源TensorRT测试性能数据,故在T4机器上使用pth在线推理测试性能数据 + +测评T4精度与性能: +```shell +git clone https://github.com/open-mmlab/mmcv +cd mmcv +MMCV_WITH_OPS=1 pip3.7 install -e . +cd .. +git clone https://github.com/open-mmlab/mmdetection +cd mmdetection +pip3.7 install -r requirements/build.txt +python3.7 setup.py develop +wget http://download.openmmlab.com/mmdetection/v2.0/mask_rcnn/mask_rcnn_r50_fpn_1x_coco/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth +在当前目录按结构构造数据集:data/coco目录下有annotations与val2017,annotations目录存放coco数据集的instances_val2017.json,val2017目录存放coco数据集的5000张验证图片。 +python3.7 tools/test.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --eval bbox segm +``` +``` +6.0 task/s +``` + +### 7.3 性能对比 +310单卡4个device,benchmark测试的是一个device。T4一个设备相当于4个device,测试的是整个设备。benchmark时延是吞吐率的倒数,T4时延是吞吐率的倒数乘以batch。对于batch1,1.95202 * 4 > 6.0,即npu性能超过T4 +对于batch1,npu性能均高于T4性能1.2倍,该模型放在benchmark/cv/segmentation目录下 + + diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/nlp/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/nlp/.keep" new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/official/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/official/.keep" new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/research/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/research/.keep" new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391