diff --git "a/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\347\246\273\347\272\277\346\216\250\347\220\206.md" "b/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\347\246\273\347\272\277\346\216\250\347\220\206.md"
index f725a660f17687db2ff87b6c9728f678f1b9eb6e..07743b014fb122ea05b9063e42e105ef57496e6f 100644
--- "a/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\347\246\273\347\272\277\346\216\250\347\220\206.md"
+++ "b/AscendPyTorch\346\250\241\345\236\213\344\274\227\346\231\272\346\226\207\346\241\243-\347\246\273\347\272\277\346\216\250\347\220\206.md"
@@ -91,6 +91,8 @@ npu单颗芯片吞吐率乘以4颗大于gpu T4吞吐率则认为性能达标
     >安装CANN包：./Ascend-cann-toolkit-\{version\}-linux-x86_64.run --install --quiet
    >
    >解压Ascend-cann-benchmark_\{version\}-Linux-x86_64.zip，获取benchmark工具与脚本
+   >
+   >若报无HwHiAiUser用户则执行useradd HwHiAiUser，安装固件若报Not a physical-machine, firmware upgrade does not support.则不必安装固件，若报错ls: cannot access '/usr/local/Ascend/ascend-toolkit/5.0.1/x86_64-linux/toolkit/python/site-packages/bin': No such file or directory则export PATH=/usr/local/python3.7.5/bin:￥PATH;export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:￥LD_LIBRARY_PATH。安装后需要重启。
 
 ### 2.2 深度学习框架与第三方库
 
@@ -125,13 +127,15 @@ opencv-python == 4.2.0.34
 
 ### 3.1 华为云昇腾modelzone里Pytorch模型端到端推理网址
 
-当前已完成端到端推理模型放在[ModelZoo](https://www.huaweicloud.com/ascend/resources/modelzoo)，包含模型端到端推理说明，代码与操作完整流程，下面的实例仅给出用于说明问题的代码片段，该页面过滤条件框中搜索atc可以看到这些模型
+当前已完成端到端推理模型放在[ModelZoo](https://ascend.huawei.com/zh/#/software/modelzoo)，包含模型端到端推理说明，代码与操作完整流程，下面的实例仅给出用于说明问题的代码片段，该页面过滤条件框中搜索atc可以看到这些模型
 
 一些典型模型的链接如下
-1. [ResNeXt-50](https://www.huaweicloud.com/ascend/resources/modelzoo/Models/2ca8ac26aeac461c85e7b04f17aa201a)
-2. [Inception-V3](https://www.huaweicloud.com/ascend/resources/modelzoo/Models/132f32e409b44aac8951f58ca073b780)
-3. [EfficientNet-b0](https://www.huaweicloud.com/ascend/resources/modelzoo/Models/75026a6edf604ec0bc5d16d220328646)
-4. [YoloV3](https://www.huaweicloud.com/ascend/resources/modelzoo/Models/36ea401e0d844f549da2693c6289ad89)
+1. [ResNeXt-50](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/2ca8ac26aeac461c85e7b04f17aa201a)
+2. [Inception-V3](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/132f32e409b44aac8951f58ca073b780)
+3. [Inception-V4](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/75eb32c2a2d94c4db743983504f83a06)
+4. [EfficientNet-b0](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/75026a6edf604ec0bc5d16d220328646)
+5. [YoloV3](https://ascend.huawei.com/zh/#/software/modelzoo/detail/1/36ea401e0d844f549da2693c6289ad89)  
+...  
 
 ### 3.2 端到端推理实例
 
@@ -140,7 +144,7 @@ opencv-python == 4.2.0.34
 1.pth模型转换为om模型
 PyTorch训练得到的pth模型文件不能直接转换为om模型文件，因此先将pth文件转化为onnx模型文件，再由onnx转化为离线om模型文件
 
-1）基于PyTorch框架的模型代码与pth文件可以从开源[github链接](https://github.com/lukemelas/EfficientNet-PyTorch)获取，有些模型使用resize使用双线性模式训练的性能不达标，需要修改为resize使用最近邻模式重新训练，通过以下步骤得到onnx模型文件：
+1）基于PyTorch框架的模型代码与pth文件可以从开源[github网址](https://github.com/lukemelas/EfficientNet-PyTorch)获取，有些模型使用resize使用双线性模式训练的性能不达标，需要修改为resize使用最近邻模式重新训练，通过以下步骤得到onnx模型文件：
  - [下载pth文件](https://github.com/lukemelas/EfficientNet-PyTorch/releases/download/1.0/efficientnet-b0-355c32eb.pth)
  - 参考github网址说明安装efficientnet_pytorch
 ```
@@ -529,7 +533,7 @@ gpu T4是4个device并行执行的结果，mean是时延（tensorrt的时延是b
 ```
 以root用户运行ada：kill -9 $(pidof ada)  && /usr/local/Ascend/driver/tools/ada
 ...
-编辑/home/HwHiAiUser/test/run文件
+新建/home/HwHiAiUser/test/run文件：
 #! /bin/bash
 export install_path=/usr/local/Ascend/ascend-toolkit/latest
 export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
@@ -545,9 +549,7 @@ python3.7.5 hiprof.pyc --ip_address=本机ip --result_dir=/root/out --profiling_
 
  - CANN C20 版本profiling使用方法
 ```
-以root用户运行ada：kill -9 $(pidof ada)  && /usr/local/Ascend/driver/tools/ada
-...
-编辑/home/HwHiAiUser/test/run文件
+新建/home/HwHiAiUser/test/run文件：
 #! /bin/bash
 export install_path=/usr/local/Ascend/ascend-toolkit/latest
 export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
@@ -557,9 +559,9 @@ export ASCEND_OPP_PATH=${install_path}/opp
 ./benchmark -round=50 -om_path=/home/HwHiAiUser/test/efficientnet-b0_bs1.om -device_id=0 -batch_size=1
 ...
 chmod 777 /home/HwHiAiUser/test/run
-cd /usr/local/Ascend/ascend-toolkit/20.2.rc1/x86_64-linux/toolkit/tools/profiler/bin
+cd /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/toolkit/tools/profiler/bin
 ./msprof --output=/home/HwHiAiUser/test --application=/home/HwHiAiUser/test/run --sys-hardware-mem=on --sys-cpu-profiling=on --sys-profiling=on --sys-pid-profiling=on --sys-io-profiling=on --dvpp-profiling=on
-cd /usr/local/Ascend/ascend-toolkit/20.2.rc1/x86_64-linux/toolkit/tools/profiler/profiler_tool/analysis/msprof/
+cd /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/toolkit/tools/profiler/profiler_tool/analysis/msprof/
 python3.7 msprof.pyc import -dir /home/HwHiAiUser/test/生成的profiling目录
 python3.7 msprof.pyc export summary -dir /home/HwHiAiUser/test/生成的profiling目录
 ```
@@ -1087,13 +1089,14 @@ inception_v3_bs16  MatMulV2   AI Core    1      126.037         126.037       12
 inception_v3_bs16  Flatten    AI Core    1      20.415          20.415        20.415        20.415        0.002355
 ```
 
-2.算子融合
-从profiling结果看出，pad和pad前后的transdata耗时很长，经过分析pad的功能可以由其后的averagepool中的pad属性完成，可以节约大量时间，于是进行padV3D和pooling算子的graph融合
+2.算子融合  
+profiling也会统计每个算子耗时，结合使用netron查看onnx模型结构图，可以看出pad和pad前后的transdata耗时很长，经过分析pad的功能可以由其后的averagepool中的pad属性完成，可以节约大量时间，于是进行padV3D和pooling算子的graph融合
 
 参考前面提到的《CANN V100R020C10 图融合和UB融合规则参考 (推理) 01》
 
 ### 4.5 maskrcnn端到端推理指导
-https://gitee.com/ascend/modelzoo/wikis
+[基于开源mmdetection预训练的maskrcnn_Onnx模型端到端推理指导.md](https://gitee.com/pengyeqing/ascend-pytorch-crowdintelligence-doc/blob/master/onnx%E7%AB%AF%E5%88%B0%E7%AB%AF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/benchmark/cv/segmentation/%E5%9F%BA%E4%BA%8E%E5%BC%80%E6%BA%90mmdetection%E9%A2%84%E8%AE%AD%E7%BB%83%E7%9A%84maskrcnn_Onnx%E6%A8%A1%E5%9E%8B%E7%AB%AF%E5%88%B0%E7%AB%AF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC.md)  
+[基于detectron2训练的npu权重的maskrcnn_Onnx模型端到端推理指导.md](https://gitee.com/pengyeqing/ascend-pytorch-crowdintelligence-doc/blob/master/onnx%E7%AB%AF%E5%88%B0%E7%AB%AF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC/benchmark/cv/segmentation/%E5%9F%BA%E4%BA%8Edetectron2%E8%AE%AD%E7%BB%83%E7%9A%84npu%E6%9D%83%E9%87%8D%E7%9A%84maskrcnn_Onnx%E6%A8%A1%E5%9E%8B%E7%AB%AF%E5%88%B0%E7%AB%AF%E6%8E%A8%E7%90%86%E6%8C%87%E5%AF%BC.md)  
 
 ## 5 深度学习指导
 ### 5.1 书籍推荐
@@ -1131,12 +1134,14 @@ https://gitee.com/ascend/modelzoo/wikis
 >![](public_sys-resources/icon-note.gif) 
 **说明：** 
 > **机器周均使用率过低且项目无故无进展时，华为方将有权回收算力资源，由此造成交付延期由使用者自己承担。**
+> **请勿随意更改密码，更改密码带来的风险由更改者承担。**
+> **请勿随意更新驱动等系统相关软件，有需要请及时联系华为方支持人员。**
 
 - 机器申请
     - GPU
-        - 由于GPU资源紧张，请提前做好资源申请，每个模型按3个工作日作为调测时间，每个老师需一次性租借其名下所有模型，若无法按期归还，请提前和华为方支撑者做好沟通
+        - 由于GPU资源紧张，请提前做好资源申请，每个模型按3个工作日作为调测时间，原则上每次调测不少于2个模型，每个模型不可重复申请调试。若无法按期归还，请提前和华为方支撑者做好沟通
     - NPU
-        - 每个模型调测人员至少分配一张NPU用于模型调测，请向华为方申请调配的NPU资源
+        - 每个模型调测人员至少分配一张NPU用于模型调测，请向华为方申请动态调配的NPU资源
 - 磁盘使用
     - / 下是系统目录
     - /home 是可使用的数据盘目录
@@ -1156,26 +1161,131 @@ https://gitee.com/ascend/modelzoo/wikis
  ascend benchmark工具纯推理测的npu单颗device吞吐率乘以4颗大于TensorRT工具测的gpu T4吞吐率则认为性能达标
     - 脚本：  
  代码符合pep8规范；  
+ 脚本命名格式需统一，文件名含模型名时模型名用小写，模型名含多个字符串时用-连接；  
  xxx_pth2onnx.py中不能使用从网络下载权重pth文件的代码，xxx_pth2onnx.py应有输入输出参数，输入是本地权重pth文件，输出是生成的onnx模型文件名；  
  xxx_pth_preprocess.py与xxx_pth_postprocess.py尽可能只引用numpy,Pillow,torch,pycocotools等基础库，如不要因mmdetection框架的数据处理与精度评测部分封装了这些基础库的操作，为调用这些简单接口，前后处理脚本就依赖mmdetection；  
  不同模型的脚本与代码部分处理流程有相似性，尽量整合成通用的脚本与代码。
-    - 推理步骤：  
- 需要提供端到端推理的操作过程
+    - 推理过程：  
+ 需要提供端到端推理过程中执行的命令等
     - 关键问题总结：  
- 需要提供端到端推理遇到的关键问题的简要调试过程
+ 需要提供端到端推理遇到的关键问题的简要调试过程，至少包含模型转换要点，精度调试，性能优化
 
      说明：
      ```
-     对于性能不达标的模型，优化是学生能做的尽量做，比如用ascend atc的相关优化选项尝试一下，尝试使用最近邻替换双线性的resize重新训练，降低图片分辨率等，然后profiling分析定位引起性能下降的原因，具体到引起性能下降的算子，并在交付文档中写明问题原因与简要的定位过程，涉及到atc算子代码的修改由华为方做。  
-     工作量为简单模型2-3个工作日，复杂模型5-7个工作日，个别难度大的模型12个工作日。
+     1.需要测试batch1,4,8,16,32的精度与性能  
+     2.对于性能不达标的模型，需要进行如下工作：
+       1）用ascend atc的相关优化选项尝试一下，尝试使用最近邻替换双线性的resize重新训练，降低图片分辨率等使性能达标。  
+       2）对于算子导致的性能问题，需要使用profiling分析定位引起性能下降的原因，具体到引起性能下降的算子。优先修改模型代码以使其选择性能好的npu算子替换性能差的npu算子使性能达标，然后在modelzoo上提issue，等修复版本发布后再重测性能，继续优化。  
+       3）需要交付profiling性能数据，对经过上述方法性能可以达标的模型，在交付文档中写明问题原因与达标需要执行的操作；对经过上述方法性能仍不达标的模型，在交付文档中写明问题原因与简要的定位过程。  
+     3.工作量为简单模型2-3个工作日，复杂模型5-10个工作日，个别难度大的模型15-20个工作日。
      ```
 
 - 交付件
-    - 交付件参考：[ResNeXt Onnx端到端推理指导.docx](https://gitee.com/ascend/modelzoo/wikis)
+    - 交付件参考：[ResNeXt50_Onnx模型端到端推理指导.md](https://gitee.com/ascend/modelzoo/tree/master/built-in/ACL_PyTorch/Benchmark/cv/classification/ResNext50)
     - 最终交付件：  
-      包含以上交付标准的模型名称 Onnx端到端推理指导.docx
+      包含以上交付标准的模型名称_Onnx端到端推理指导.md
     - 最终交付形式：  
-      gitee网址：https://gitee.com/ascend/modelzoo/tree/master/contrib/onnx_infer/  
+      gitee网址：https://gitee.com/ascend/modelzoo/tree/master/contrib/ACL_PyTorch  
       commit信息格式：【高校贡献-学校学院名称】【Onnx-模型名称】模型名称 Onnx端到端推理  
+                      模型命名风格为大驼峰，模型名含多个字符串时使用横杠或下划线连接，当上下文用横杠时模型名用下划线连接，否则用横杠连接  
+      对于batch1与batch16，npu性能均高于T4性能1.2倍的模型，放在benchmark目录下，1-1.2倍对应official目录，低于1倍放在research目录
+
+- gitee仓PR贡献流程
+    - fork [modelzoo](https://gitee.com/ascend/modelzoo) 到个人仓
+    - 提交代码到个人仓
+    - 签署cla [link](https://clasign.osinfra.cn/sign/Z2l0ZWUlMkZhc2NlbmQ=)
+        - 选择 Sign Individual CLA
+        - 若已提交PR，但忘记签署，可在签署CLA后再评论内评论 ```/check-cla``` 重新校验
+    - 依据文件夹名称及目录规范整理代码，完成自验，使用PR内容模板进行PR，审查人员请指定 王姜奔(wangjiangben_hw)
+    - PR后，华为方会进行代码检视，并对PR进行验证，请关注PR的评论并及时修改
+    - 最终验收完成后合入主干
+- gitee仓验收使用脚本(请自验)、PR内容模板
+    - 验收使用脚本(请自验)  
+        >![](public_sys-resources/icon-note.gif) 
+        **说明：** 
+            > **提交前请确保自验通过！确保直接执行以下脚本就可运行！**
+    
+        ```shell script
+        
+        # pth是否能正确转换为om
+        bash scripts/pth2om.sh
+        
+        # 精度数据是否达标（需要显示官网精度与om模型的精度）
+        bash scripts/eval_acc.sh
+        
+        # npu性能数据(如果模型支持多batch，测试bs1与bs16，否则只测试bs1，性能数据以单卡吞吐率为标准)
+        bash scripts/perform_310.sh
+        
+        # 在t4环境测试性能数据(如果模型支持多batch，测试bs1与bs16，否则只测试bs1，如果导出的onnx模型因含自定义算子等不能离线推理，则在t4上测试pytorch模型的在线推理性能，性能数据以单卡吞吐率为标准)
+        bash scripts/perform_t4.sh
+        
+        ```
+    - PR内容模板  
+        - PR示例链接 https://gitee.com/ascend/modelzoo/pulls/887
+        - PR名称
+            - 【高校贡献-${学校学院名称}】【Pytorch离线推理-${模型名称}】${PR内容摘要}
+                - 举例说明：【高校贡献-华为大学昇腾学院】【Pytorch离线推理-ResNeXt50】初次提交。
+        ```
+      
+        <!--  Thanks for sending a pull request!  Here are some tips for you:
+        # 首次必看，看完请删除这部分tips
+        1) If this is your first time, please read our contributor guidelines: https://gitee.com/ascend/modelzoo/blob/master/contrib/CONTRIBUTING.md
+        
+        2) If you want to contribute your code but don't know who will review and merge, please add label `ascend-assistant` to the pull request, we will find and do it as soon as possible.
+        -->
+        
+        **What type of PR is this?**
+        > /kind task
+        
+        **What does this PR do / why do we need it**:
+        # 简述你这次的PR的详情
+
+        | 模型      | 官网精度  | 310精度  | t4性能    | 310性能    |
+        | :------: | :------: | :------: | :------:  | :------:  | 
+        | ResNeXt50 bs1  | top1:77.62% top5:93.70% | top1:77.62% top5:93.69% |  763.044fps | 1497.252fps | 
+        | ResNeXt50 bs16 | top1:77.62% top5:93.70% | top1:77.62% top5:93.69% | 1234.940fps | 2096.376fps | 
+
+        # 自验报告
+        ```shell    
+        # 第X次验收测试   
+        # 验收结果 OK / Failed
+        # 验收环境: A + K / CANN R20C20TR5
+        # 关联issue: 
+        
+        # pth是否能正确转换为om
+        bash scripts/pth2om.sh
+        # 验收结果： OK / Failed
+        # 备注： 成功生成om，无运行报错，报错日志xx 等
+        
+        # 精度数据是否达标（需要显示官网精度与om模型的精度）
+        bash scripts/eval_acc.sh
+        # 验收结果： OK / Failed
+        # 备注： 目标精度top1:77.62% top5:93.70%；bs1,bs16验收精度top1:77.62% top5:93.69%；精度下降不超过1%；无运行报错，报错日志xx 等
+        
+        # npu性能数据(如果模型支持多batch，测试bs1与bs16，否则只测试bs1，性能数据以单卡吞吐率为标准)
+        bash scripts/perform_310.sh
+        # 验收结果： OK / Failed
+        # 备注： 验收测试性能bs1:1497.252FPS bs16:2096.376FPS；无运行报错，报错日志xx 等
+        # 在t4环境测试性能数据(如果模型支持多batch，测试bs1与bs16，否则只测试bs1，如果导出的onnx模型因含自定义算子等不能离线推理，则在t4上测试pytorch模型的在线推理性能，性能数据以单卡吞吐率为标准)
+        bash scripts/perform_t4.sh
+        # 验收结果： OK / Failed
+        # 备注： 验收测试性能bs1:763.044FPS bs16:1234.940FPS；无运行报错，报错日志xx 等
+        # 310性能需要超过t4
+    
+        ```  
+        - 示例链接 https://gitee.com/ascend/modelzoo/pulls/836#note_4750681
+        
+        **Which issue(s) this PR fixes**:
+        # 用于后期issue关联的pr
+        <!--
+        *Automatically closes linked issue when PR is merged.
+        Usage: `Fixes #<issue number>`, or `Fixes (paste link of issue)`.
+        -->
+        Fixes #
+        
+        **Special notes for your reviewers**:
+        # 在reviewer检视时你想要和他说的
+ 
+        ```
 
 
diff --git "a/AscendPyTorch\346\250\241\345\236\213\346\216\250\347\220\206\344\274\227\346\231\272\351\252\214\346\224\266\346\214\207\345\215\227.md" "b/AscendPyTorch\346\250\241\345\236\213\346\216\250\347\220\206\344\274\227\346\231\272\351\252\214\346\224\266\346\214\207\345\215\227.md"
new file mode 100644
index 0000000000000000000000000000000000000000..bdae5f21510330a1a7769f18729737fdf5b44cfe
--- /dev/null
+++ "b/AscendPyTorch\346\250\241\345\236\213\346\216\250\347\220\206\344\274\227\346\231\272\351\252\214\346\224\266\346\214\207\345\215\227.md"
@@ -0,0 +1,72 @@
+# Ascend PyTorch 模型推理众智验收指南
+
+1. 先上gitee管理平台，将验收目标调整至验收状态
+2. 检查PR内容，文件夹路径和文件结构
+    - PR末班和文件路径结构都在下面附件里有详细说明，请仔细check
+3. 按照验收脚本在交付文件夹下进行验收
+
+    ```shell script
+    
+    # pth是否能正确转换为om
+    bash scripts/pth2om.sh
+    
+    # 精度数据是否达标（需要显示官网精度与om模型的精度）
+    bash scripts/eval_acc.sh
+    
+    # npu性能数据(如果模型支持多batch，测试bs1与bs16，否则只测试bs1，性能数据以单卡吞吐率为标准)
+    bash scripts/perform_310.sh
+    
+    # 在t4环境测试性能数据(如果模型支持多batch，测试bs1与bs16，否则只测试bs1，如果导出的onnx模型因含自定义算子等不能离线推理，则在t4上测试pytorch模型的在线推理性能，性能数据以单卡吞吐率为标准)
+    bash scripts/perform_t4.sh
+    
+    ```
+
+    - 验收过程中遇到问题，如是一些路径或者打字错误的问题，先修复继续执行
+    - 每次验收都需要对验收脚本中的所有未验收脚本进行验收，不要因某一项验收失败而阻塞后续验收工作
+4. 验收反馈
+    - 验收后，使用验收报告模板，在评论区反馈验收结果
+    ```shell    
+    # 第X次验收测试   
+    # 验收结果 OK / Failed
+    # 验收环境: A + K / CANN R20C20TR5
+    # 关联issue: 
+    
+    # pth是否能正确转换为om
+    bash scripts/pth2om.sh
+    # 验收结果： OK / Failed
+    # 备注： 成功生成om，无运行报错，报错日志xx 等
+    
+    # 精度数据是否达标（需要显示官网精度与om模型的精度）
+    bash scripts/eval_acc.sh
+    # 验收结果： OK / Failed
+    # 备注： 目标精度top1:77.62% top5:93.70%；bs1,bs16验收精度top1:77.62% top5:93.69%；精度下降不超过1%；无运行报错，报错日志xx 等
+    
+    # npu性能数据(如果模型支持多batch，测试bs1与bs16，否则只测试bs1，性能数据以单卡吞吐率为标准)
+    bash scripts/perform_310.sh
+    # 验收结果： OK / Failed
+    # 备注： 验收测试性能bs1:1497.252FPS bs16:2096.376FPS；无运行报错，报错日志xx 等
+    # 在t4环境测试性能数据(如果模型支持多batch，测试bs1与bs16，否则只测试bs1，如果导出的onnx模型因含自定义算子等不能离线推理，则在t4上测试pytorch模型的在线推理性能，性能数据以单卡吞吐率为标准)
+    bash scripts/perform_t4.sh
+    # 验收结果： OK / Failed
+    # 备注： 验收测试性能bs1:763.044FPS bs16:1234.940FPS；无运行报错，报错日志xx 等
+    # 310性能需要超过t4
+
+    ```    
+    - 示例链接 https://gitee.com/ascend/modelzoo/pulls/836#note_4814643
+5. 验收完成后，上gitee管理平台，将验收目标调整至完成状态
+    
+    
+    
+    
+- 关联issue模板 (负责人请关联相应的学生，若无法关联，请关联验收者)
+    ```
+    【Pytorch模型推理众智测试验收】【第x次回归测试】 xxx模型 验收不通过
+    
+    贴上验收报告
+       
+    ```
+    - 示例链接 https://gitee.com/ascend/modelzoo/issues/I3FI5L?from=project-issue
+    
+
+
+    
\ No newline at end of file
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/.keep"
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/.keep"
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/.keep"
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/.keep"
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/ResNeXt50_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/ResNeXt50_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md"
new file mode 100644
index 0000000000000000000000000000000000000000..11de7d289beeae1e836017f5d59a41d3961726eb
--- /dev/null
+++ "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/classification/ResNeXt50_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md"
@@ -0,0 +1,605 @@
+# ResNeXt50 Onnx模型端到端推理指导
+-   [1 模型概述](#1-模型概述)
+	-   [1.1 论文地址](#11-论文地址)
+	-   [1.2 代码地址](#12-代码地址)
+-   [2 环境说明](#2-环境说明)
+	-   [2.1 深度学习框架](#21-深度学习框架)
+	-   [2.2 python第三方库](#22-python第三方库)
+-   [3 模型转换](#3-模型转换)
+	-   [3.1 pth转onnx模型](#31-pth转onnx模型)
+	-   [3.2 onnx转om模型](#32-onnx转om模型)
+-   [4 数据集预处理](#4-数据集预处理)
+	-   [4.1 数据集获取](#41-数据集获取)
+	-   [4.2 数据集预处理](#42-数据集预处理)
+	-   [4.3 生成数据集信息文件](#43-生成数据集信息文件)
+-   [5 离线推理](#5-离线推理)
+	-   [5.1 benchmark工具概述](#51-benchmark工具概述)
+	-   [5.2 离线推理](#52-离线推理)
+-   [6 精度对比](#6-精度对比)
+	-   [6.1 离线推理TopN精度统计](#61-离线推理TopN精度统计)
+	-   [6.2 开源TopN精度](#62-开源TopN精度)
+	-   [6.3 精度对比](#63-精度对比)
+-   [7 性能对比](#7-性能对比)
+	-   [7.1 npu性能数据](#71-npu性能数据)
+	-   [7.2 T4性能数据](#72-T4性能数据)
+	-   [7.3 性能对比](#73-性能对比)
+
+
+
+## 1 模型概述
+
+-   **[论文地址](#11-论文地址)**  
+
+-   **[代码地址](#12-代码地址)**  
+
+### 1.1 论文地址
+[ResNeXt50论文](https://arxiv.org/abs/1611.05431)  
+本文提出了一个简单的，高度模型化的针对图像分类问题的网络结构。本文的网络是通过重复堆叠building block组成的，这些building block整合了一系列具有相同拓扑结构的变体(transformations)。本文提出的简单的设计思路可以生成一种同质的，多分支的结构。这种方法产生了一个新的维度，作者将其称为基(变体的数量，the size of the set of transformations)。在ImageNet-1K数据集上，作者可以在保证模型复杂度的限制条件下，通过提升基的大小来提高模型的准确率。更重要的是，相比于更深和更宽的网络，提升基的大小更加有效。作者将本文的模型命名为ResNeXt，本模型在ILSVRC2016上取得了第二名。本文还在ImageNet-5K和COCO数据集上进行了实验，结果均表明ResNeXt的性能比ResNet好。
+
+### 1.2 代码地址
+[ResNeXt50代码](https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py)
+
+## 2 环境说明
+
+-   **[深度学习框架](#21-深度学习框架)**  
+
+-   **[python第三方库](#22-python第三方库)**  
+
+### 2.1 深度学习框架
+```
+pytorch == 1.6.0
+torchvision == 0.7.0
+onnx == 1.7.0
+```
+
+### 2.2 python第三方库
+
+```
+numpy == 1.18.5
+Pillow == 7.2.0
+```
+
+**说明：** 
+>   X86架构：pytorch，torchvision和onnx可以通过官方下载whl包安装，其它可以通过pip3.7 install 包名 安装
+>
+>   Arm架构：pytorch，torchvision和onnx可以通过源码编译安装，其它可以通过pip3.7 install 包名 安装
+
+## 3 模型转换
+
+-   **[pth转onnx模型](#31-pth转onnx模型)**  
+
+-   **[onnx转om模型](#32-onnx转om模型)**  
+
+### 3.1 pth转onnx模型
+
+1.下载pth权重文件  
+[ResNeXt50预训练pth权重文件](https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth)  
+文件md5sum: 1d6611049e6ef03f1d6afa11f6f9023e  
+2.编写pth2onnx脚本resnext50_pth2onnx.py
+```python
+import sys
+import torch
+import torch.onnx
+import torchvision.models as models
+
+def pth2onnx(input_file, output_file):
+    model = models.resnext50_32x4d(pretrained=False)
+    checkpoint = torch.load(input_file, map_location=None)
+    model.load_state_dict(checkpoint)
+
+    model.eval()
+    input_names = ["image"]
+    output_names = ["class"]
+    dynamic_axes = {'image': {0: '-1'}, 'class': {0: '-1'}}
+    dummy_input = torch.randn(1, 3, 224, 224)
+    torch.onnx.export(model, dummy_input, output_file, input_names = input_names, dynamic_axes = dynamic_axes, output_names = output_names, verbose=True, opset_version=11)
+
+if __name__ == "__main__":
+    input_file = sys.argv[1]
+    output_file = sys.argv[2]
+    pth2onnx(input_file, output_file)
+```
+
+ **说明：**  
+>注意目前ATC支持的onnx算子版本为11
+
+3.执行pth2onnx脚本，生成onnx模型文件
+```
+python3 resnext50_pth2onnx.py resnext50_32x4d-7cdf4587.pth resnext50.onnx
+```
+
+### 3.2 onnx转om模型
+
+1.设置环境变量
+```
+export install_path=/usr/local/Ascend/ascend-toolkit/latest
+export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
+export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH
+export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH
+export ASCEND_OPP_PATH=${install_path}/opp
+export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/
+```
+2.使用atc将onnx模型转换为om模型文件，工具使用方法可以参考[CANN V100R020C10 开发辅助工具指南 (推理) 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164868?idPath=23710424%7C251366513%7C22892968%7C251168373)
+```
+atc --framework=5 --model=./resnext50.onnx --input_format=NCHW --input_shape="image:16,3,224,224" --output=resnext50_bs16 --log=debug --soc_version=Ascend310
+```
+
+## 4 数据集预处理
+
+-   **[数据集获取](#41-数据集获取)**  
+
+-   **[数据集预处理](#42-数据集预处理)**  
+
+-   **[生成数据集信息文件](#43-生成数据集信息文件)**  
+
+### 4.1 数据集获取
+该模型使用[ImageNet官网](http://www.image-net.org)的5万张验证集进行测试，图片与标签分别存放在datasets/ImageNet/val_union与datasets/ImageNet/val_label.txt。
+
+### 4.2 数据集预处理
+1.预处理脚本imagenet_torch_preprocess.py
+```python
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from PIL import Image
+import numpy as np
+import multiprocessing
+
+
+model_config = {
+    'resnet': {
+        'resize': 256,
+        'centercrop': 224,
+        'mean': [0.485, 0.456, 0.406],
+        'std': [0.229, 0.224, 0.225],
+    },
+    'inceptionv3': {
+        'resize': 342,
+        'centercrop': 299,
+        'mean': [0.485, 0.456, 0.406],
+        'std': [0.229, 0.224, 0.225],
+    },
+    'inceptionv4': {
+        'resize': 342,
+        'centercrop': 299,
+        'mean': [0.5, 0.5, 0.5],
+        'std': [0.5, 0.5, 0.5],
+    },
+}
+
+
+def center_crop(img, output_size):
+    if isinstance(output_size, int):
+        output_size = (int(output_size), int(output_size))
+    image_width, image_height = img.size
+    crop_height, crop_width = output_size
+    crop_top = int(round((image_height - crop_height) / 2.))
+    crop_left = int(round((image_width - crop_width) / 2.))
+    return img.crop((crop_left, crop_top, crop_left + crop_width, crop_top + crop_height))
+
+
+def resize(img, size, interpolation=Image.BILINEAR):
+    if isinstance(size, int):
+        w, h = img.size
+        if (w <= h and w == size) or (h <= w and h == size):
+            return img
+        if w < h:
+            ow = size
+            oh = int(size * h / w)
+            return img.resize((ow, oh), interpolation)
+        else:
+            oh = size
+            ow = int(size * w / h)
+            return img.resize((ow, oh), interpolation)
+    else:
+        return img.resize(size[::-1], interpolation)
+
+
+def gen_input_bin(mode_type, file_batches, batch):
+    i = 0
+    for file in file_batches[batch]:
+        i = i + 1
+        print("batch", batch, file, "===", i)
+
+        # RGBA to RGB
+        image = Image.open(os.path.join(src_path, file)).convert('RGB')
+        image = resize(image, model_config[mode_type]['resize']) # Resize
+        image = center_crop(image, model_config[mode_type]['centercrop']) # CenterCrop
+        img = np.array(image, dtype=np.float32)
+        img = img.transpose(2, 0, 1) # ToTensor: HWC -> CHW
+        img = img / 255. # ToTensor: div 255
+        img -= np.array(model_config[mode_type]['mean'], dtype=np.float32)[:, None, None] # Normalize: mean
+        img /= np.array(model_config[mode_type]['std'], dtype=np.float32)[:, None, None] # Normalize: std
+        img.tofile(os.path.join(save_path, file.split('.')[0] + ".bin"))
+
+
+def preprocess(mode_type, src_path, save_path):
+    files = os.listdir(src_path)
+    file_batches = [files[i:i + 500] for i in range(0, 50000, 500) if files[i:i + 500] != []]
+    thread_pool = multiprocessing.Pool(len(file_batches))
+    for batch in range(len(file_batches)):
+        thread_pool.apply_async(gen_input_bin, args=(mode_type, file_batches, batch))
+    thread_pool.close()
+    thread_pool.join()
+    print("in thread, except will not report! please ensure bin files generated.")
+
+
+if __name__ == '__main__':
+    if len(sys.argv) < 4:
+        raise Exception("usage: python3 xxx.py [model_type] [src_path] [save_path]")
+    mode_type = sys.argv[1]
+    src_path = sys.argv[2]
+    save_path = sys.argv[3]
+    src_path = os.path.realpath(src_path)
+    save_path = os.path.realpath(save_path)
+    if mode_type not in model_config:
+        model_type_help = "model type: "
+        for key in model_config.keys():
+            model_type_help += key
+            model_type_help += ' '
+        raise Exception(model_type_help)
+    if not os.path.isdir(save_path):
+        os.makedirs(os.path.realpath(save_path))
+    preprocess(mode_type, src_path, save_path)
+```
+2.执行预处理脚本，生成数据集预处理后的bin文件
+```
+python3 imagenet_torch_preprocess.py datasets/ImageNet/val_union ./prep_dataset
+```
+### 4.3 生成数据集信息文件
+1.生成数据集信息文件脚本get_info.py
+```python
+import os
+import sys
+import cv2
+from glob import glob
+
+
+def get_bin_info(file_path, info_name, width, height):
+    bin_images = glob(os.path.join(file_path, '*.bin'))
+    with open(info_name, 'w') as file:
+        for index, img in enumerate(bin_images):
+            content = ' '.join([str(index), img, width, height])
+            file.write(content)
+            file.write('\n')
+
+
+def get_jpg_info(file_path, info_name):
+    extensions = ['jpg', 'jpeg', 'JPG', 'JPEG']
+    image_names = []
+    for extension in extensions:
+        image_names.append(glob(os.path.join(file_path, '*.' + extension)))  
+    with open(info_name, 'w') as file:
+        for image_name in image_names:
+            if len(image_name) == 0:
+                continue
+            else:
+                for index, img in enumerate(image_name):
+                    img_cv = cv2.imread(img)
+                    shape = img_cv.shape
+                    width, height = shape[1], shape[0]
+                    content = ' '.join([str(index), img, str(width), str(height)])
+                    file.write(content)
+                    file.write('\n')
+
+
+if __name__ == '__main__':
+    file_type = sys.argv[1]
+    file_path = sys.argv[2]
+    info_name = sys.argv[3]
+    if file_type == 'bin':
+        width = sys.argv[4]
+        height = sys.argv[5]
+        assert len(sys.argv) == 6, 'The number of input parameters must be equal to 5'
+        get_bin_info(file_path, info_name, width, height)
+    elif file_type == 'jpg':
+        assert len(sys.argv) == 4, 'The number of input parameters must be equal to 3'
+        get_jpg_info(file_path, info_name)
+```
+2.执行生成数据集信息脚本，生成数据集信息文件
+```
+python3 get_info.py bin ./prep_dataset ./resnext50_prep_bin.info 224 224
+```
+第一个参数为模型输入的类型，第二个参数为生成的bin文件路径，第三个为输出的info文件，后面为宽高信息
+## 5 离线推理
+
+-   **[benchmark工具概述](#51-benchmark工具概述)**  
+
+-   **[离线推理](#52-离线推理)**  
+
+### 5.1 benchmark工具概述
+
+benchmark工具为华为自研的模型推理工具，支持多种模型的离线推理，能够迅速统计出模型在Ascend310上的性能，支持真实数据和纯推理两种模式，配合后处理脚本，可以实现诸多模型的端到端过程，获取工具及使用方法可以参考[CANN V100R020C10 推理benchmark工具用户指南 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164874?idPath=23710424%7C251366513%7C22892968%7C251168373)
+### 5.2 离线推理
+1.设置环境变量
+```
+export install_path=/usr/local/Ascend/ascend-toolkit/latest
+export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
+export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH
+export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH
+export ASCEND_OPP_PATH=${install_path}/opp
+export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/
+```
+2.执行离线推理
+```
+./benchmark -model_type=vision -device_id=0 -batch_size=16 -om_path=resnext50_bs16.om -input_text_path=./resnext50_prep_bin.info -input_width=224 -input_height=224 -output_binary=False -useDvpp=False
+```
+输出结果默认保存在当前目录result/dumpOutput_devicex，模型只有一个名为class的输出，shape为bs * 1000，数据类型为FP32，对应1000个分类的预测结果，每个输入对应的输出对应一个_x.bin文件。
+
+## 6 精度对比
+
+-   **[离线推理TopN精度](#61-离线推理TopN精度)**  
+-   **[开源TopN精度](#62-开源TopN精度)**  
+-   **[精度对比](#63-精度对比)**  
+
+### 6.1 离线推理TopN精度统计
+
+后处理统计TopN精度
+```python
+import os
+import sys
+import json
+import numpy as np
+import time
+
+np.set_printoptions(threshold=sys.maxsize)
+
+LABEL_FILE = "HiAI_label.json"
+
+
+def gen_file_name(img_name):
+    full_name = img_name.split('/')[-1]
+    index = full_name.rfind('.')
+    return full_name[:index]
+
+
+def cre_groundtruth_dict(gtfile_path):
+    """
+    :param filename: file contains the imagename and label number
+    :return: dictionary key imagename, value is label number
+    """
+    img_gt_dict = {}
+    for gtfile in os.listdir(gtfile_path):
+        if (gtfile != LABEL_FILE):
+            with open(os.path.join(gtfile_path, gtfile), 'r') as f:
+                gt = json.load(f)
+                ret = gt["image"]["annotations"][0]["category_id"]
+                img_gt_dict[gen_file_name(gtfile)] = ret
+    return img_gt_dict
+
+
+def cre_groundtruth_dict_fromtxt(gtfile_path):
+    """
+    :param filename: file contains the imagename and label number
+    :return: dictionary key imagename, value is label number
+    """
+    img_gt_dict = {}
+    with open(gtfile_path, 'r')as f:
+        for line in f.readlines():
+            temp = line.strip().split(" ")
+            imgName = temp[0].split(".")[0]
+            imgLab = temp[1]
+            img_gt_dict[imgName] = imgLab
+    return img_gt_dict
+
+
+def load_statistical_predict_result(filepath):
+    """
+    function:
+    the prediction esult file data extraction
+    input:
+    result file:filepath
+    output:
+    n_label:numble of label
+    data_vec: the probabilitie of prediction in the 1000
+    :return: probabilities, numble of label, in_type, color
+    """
+    with open(filepath, 'r')as f:
+        data = f.readline()
+        temp = data.strip().split(" ")
+        n_label = len(temp)
+        if data == '':
+            n_label = 0
+        data_vec = np.zeros((n_label), dtype=np.float32)
+        in_type = ''
+        color = ''
+        if n_label == 0:
+            in_type = f.readline()
+            color = f.readline()
+        else:
+            for ind, prob in enumerate(temp):
+                data_vec[ind] = np.float32(prob)
+    return data_vec, n_label, in_type, color
+
+
+def create_visualization_statistical_result(prediction_file_path,
+                                            result_store_path, json_file_name,
+                                            img_gt_dict, topn=5):
+    """
+    :param prediction_file_path:
+    :param result_store_path:
+    :param json_file_name:
+    :param img_gt_dict:
+    :param topn:
+    :return:
+    """
+    writer = open(os.path.join(result_store_path, json_file_name), 'w')
+    table_dict = {}
+    table_dict["title"] = "Overall statistical evaluation"
+    table_dict["value"] = []
+
+    count = 0
+    resCnt = 0
+    n_labels = 0
+    count_hit = np.zeros(topn)
+    for tfile_name in os.listdir(prediction_file_path):
+        count += 1
+        temp = tfile_name.split('.')[0]
+        index = temp.rfind('_')
+        img_name = temp[:index]
+        filepath = os.path.join(prediction_file_path, tfile_name)
+        ret = load_statistical_predict_result(filepath)
+        prediction = ret[0]
+        n_labels = ret[1]
+        sort_index = np.argsort(-prediction)
+        gt = img_gt_dict[img_name]
+        if (n_labels == 1000):
+            realLabel = int(gt)
+        elif (n_labels == 1001):
+            realLabel = int(gt) + 1
+        else:
+            realLabel = int(gt)
+
+        resCnt = min(len(sort_index), topn)
+        for i in range(resCnt):
+            if (str(realLabel) == str(sort_index[i])):
+                count_hit[i] += 1
+                break
+
+    if 'value' not in table_dict.keys():
+        print("the item value does not exist!")
+    else:
+        table_dict["value"].extend(
+            [{"key": "Number of images", "value": str(count)},
+             {"key": "Number of classes", "value": str(n_labels)}])
+        if count == 0:
+            accuracy = 0
+        else:
+            accuracy = np.cumsum(count_hit) / count
+        for i in range(resCnt):
+            table_dict["value"].append({"key": "Top" + str(i + 1) + " accuracy",
+                                        "value": str(
+                                            round(accuracy[i] * 100, 2)) + '%'})
+        json.dump(table_dict, writer)
+    writer.close()
+
+
+if __name__ == '__main__':
+    start = time.time()
+    try:
+        # txt file path
+        folder_davinci_target = sys.argv[1]
+        # annotation files path, "val_label.txt"
+        annotation_file_path = sys.argv[2]
+        # the path to store the results json path
+        result_json_path = sys.argv[3]
+        # result json file name
+        json_file_name = sys.argv[4]
+    except IndexError:
+        print("Stopped!")
+        exit(1)
+
+    if not (os.path.exists(folder_davinci_target)):
+        print("target file folder does not exist.")
+
+    if not (os.path.exists(annotation_file_path)):
+        print("Ground truth file does not exist.")
+
+    if not (os.path.exists(result_json_path)):
+        print("Result folder doesn't exist.")
+
+    img_label_dict = cre_groundtruth_dict_fromtxt(annotation_file_path)
+    create_visualization_statistical_result(folder_davinci_target,
+                                            result_json_path, json_file_name,
+                                            img_label_dict, topn=5)
+
+    elapsed = (time.time() - start)
+    print("Time used:", elapsed)
+```
+调用vision_metric_ImageNet.py脚本推理结果与label比对，可以获得Accuracy Top5数据，结果保存在result.json中。
+```
+python3 vision_metric_ImageNet.py result/dumpOutput_device0/ dataset/ImageNet/val_label.txt ./ result.json
+```
+第一个为benchmark输出目录，第二个为数据集配套标签，第三个是生成文件的保存目录，第四个是生成的文件名。  
+查看输出结果：
+```
+{"title": "Overall statistical evaluation", "value": [{"key": "Number of images", "value": "50000"}, {"key": "Number of classes", "value": "1000"}, {"key": "Top1 accuracy", "value": "77.62%"}, {"key": "Top2 accuracy", "value": "87.42%"}, {"key": "Top3 accuracy", "value": "90.79%"}, {"key": "Top4 accuracy", "value": "92.56%"}, {"key": "Top5 accuracy", "value": "93.69%"}]
+```
+### 6.2 开源TopN精度
+[torchvision官网精度](https://pytorch.org/vision/stable/models.html)
+```
+Model               Acc@1     Acc@5
+ResNeXt-50-32x4d    77.618    93.698
+```
+### 6.3 精度对比
+将得到的om离线模型推理TopN精度与该模型github代码仓上公布的精度对比，精度下降在1%范围之内，故精度达标。
+
+## 7 性能对比
+
+-   **[npu性能数据](#71-npu性能数据)**  
+-   **[T4性能数据](#72-T4性能数据)**  
+-   **[性能对比](#73-性能对比)**  
+
+### 7.1 npu性能数据
+batch1的性能：
+ 测试npu性能要确保device空闲，使用npu-smi info命令可查看device是否在运行其它推理任务
+```
+./benchmark -round=50 -om_path=resnext50_bs1.om -device_id=0 -batch_size=1
+```
+执行50次纯推理取均值，统计吞吐率与其倒数时延（benchmark的时延是单个数据的推理时间），npu性能是一个device执行的结果
+```
+[INFO] Dataset number: 49 finished cost 2.635ms
+[INFO] PureInfer result saved in ./result/PureInfer_perf_of_resnext50_bs1_in_device_0.txt
+-----------------PureInfer Performance Summary------------------
+[INFO] ave_throughputRate: 374.313samples/s, ave_latency: 2.67914ms
+```
+batch16的性能：
+```
+./benchmark -round=50 -om_path=resnext50_bs16.om -device_id=0 -batch_size=16
+```
+```
+[INFO] Dataset number: 49 finished cost 30.514ms
+[INFO] PureInfer result saved in ./result/PureInfer_perf_of_resnext50_bs16_in_device_0.txt
+-----------------PureInfer Performance Summary------------------
+[INFO] ave_throughputRate: 524.094samples/s, ave_latency: 1.9101ms
+```
+### 7.2 T4性能数据
+batch1性能：
+在T4机器上安装开源TensorRT
+```
+cd /usr/local/TensorRT-7.2.2.3/targets/x86_64-linux-gnu/bin/
+./trtexec --onnx=resnext50.onnx --fp16 --shapes=image:1x3x224x224 --threads
+```
+gpu T4是4个device并行执行的结果，mean是时延（tensorrt的时延是batch个数据的推理时间），即吞吐率的倒数乘以batch
+```
+[03/24/2021-03:54:47] [I] GPU Compute
+[03/24/2021-03:54:47] [I] min: 1.26575 ms
+[03/24/2021-03:54:47] [I] max: 4.41528 ms
+[03/24/2021-03:54:47] [I] mean: 1.31054 ms
+[03/24/2021-03:54:47] [I] median: 1.30151 ms
+[03/24/2021-03:54:47] [I] percentile: 1.40723 ms at 99%
+[03/24/2021-03:54:47] [I] total compute time: 2.9972 s
+```
+batch16性能：
+```
+./trtexec --onnx=resnext50.onnx --fp16 --shapes=image:16x3x224x224 --threads
+```
+```
+[03/24/2021-03:57:22] [I] GPU Compute
+[03/24/2021-03:57:22] [I] min: 12.5645 ms
+[03/24/2021-03:57:22] [I] max: 14.8437 ms
+[03/24/2021-03:57:22] [I] mean: 12.9561 ms
+[03/24/2021-03:57:22] [I] median: 12.8541 ms
+[03/24/2021-03:57:22] [I] percentile: 14.8377 ms at 99%
+[03/24/2021-03:57:22] [I] total compute time: 3.03173 s
+```
+### 7.3 性能对比
+batch1：2.67914/4 < 1.31054/1  
+batch16：1.9101/4 < 12.9561/16  
+npu的吞吐率乘4比T4的吞吐率大，即npu的时延除4比T4的时延除以batch小，故npu性能高于T4性能，性能达标。  
+对于batch1与batch16，npu性能均高于T4性能1.2倍，该模型放在benchmark/cv/classification目录下。  
+
+
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/.keep"
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216detectron2\350\256\255\347\273\203\347\232\204npu\346\235\203\351\207\215\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216detectron2\350\256\255\347\273\203\347\232\204npu\346\235\203\351\207\215\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md"
new file mode 100644
index 0000000000000000000000000000000000000000..e365eaa88addc2895b241fce3ebb2eb2817d3d75
--- /dev/null
+++ "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216detectron2\350\256\255\347\273\203\347\232\204npu\346\235\203\351\207\215\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md"
@@ -0,0 +1,1227 @@
+# 基于detectron2训练的npu权重的maskrcnn Onnx模型端到端推理指导
+-   [1 模型概述](#1-模型概述)
+	-   [1.1 论文地址](#11-论文地址)
+	-   [1.2 代码地址](#12-代码地址)
+-   [2 环境说明](#2-环境说明)
+	-   [2.1 深度学习框架](#21-深度学习框架)
+	-   [2.2 python第三方库](#22-python第三方库)
+-   [3 模型转换](#3-模型转换)
+	-   [3.1 pth转onnx模型](#31-pth转onnx模型)
+	-   [3.2 onnx转om模型](#32-onnx转om模型)
+-   [4 数据集预处理](#4-数据集预处理)
+	-   [4.1 数据集获取](#41-数据集获取)
+	-   [4.2 数据集预处理](#42-数据集预处理)
+	-   [4.3 生成数据集信息文件](#43-生成数据集信息文件)
+-   [5 离线推理](#5-离线推理)
+	-   [5.1 benchmark工具概述](#51-benchmark工具概述)
+	-   [5.2 离线推理](#52-离线推理)
+-   [6 精度对比](#6-精度对比)
+	-   [6.1 离线推理精度统计](#61-离线推理精度统计)
+	-   [6.2 开源精度](#62-开源精度)
+	-   [6.3 精度对比](#63-精度对比)
+-   [7 性能对比](#7-性能对比)
+	-   [7.1 npu性能数据](#71-npu性能数据)
+	-   [7.2 T4性能数据](#72-T4性能数据)
+	-   [7.3 性能对比](#73-性能对比)
+
+
+
+## 1 模型概述
+
+-   **[论文地址](#11-论文地址)**  
+
+-   **[代码地址](#12-代码地址)**  
+
+### 1.1 论文地址
+[maskrcnn论文](https://arxiv.org/abs/1703.06870)  
+论文提出了一个简单、灵活、通用的目标实例分割框架Mask R-CNN。这个框架可同时做目标检测、实例分割。实例分割的实现就是在faster r-cnn的基础上加了一个可以预测目标掩膜（mask）的分支。只比Faster r-cnn慢一点，5fps。很容易拓展到其他任务如：关键点检测。18年在coco的目标检测、实例分割、人体关键点检测都取得了最优成绩。
+
+### 1.2 代码地址
+[cpu,gpu版detectron2框架maskrcnn代码](https://github.com/facebookresearch/detectron2/blob/master/MODEL_ZOO.md)   
+
+[npu版detectron2框架maskrcnn代码](https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch/Official/cv/image_object_detection/Faster_Mask_RCNN_for_PyTorch)
+
+## 2 环境说明
+
+-   **[深度学习框架](#21-深度学习框架)**  
+
+-   **[python第三方库](#22-python第三方库)**  
+
+### 2.1 深度学习框架
+```
+pytorch == 1.8.0
+torchvision == 0.9.0
+onnx == 1.8.0
+```
+
+**注意：** 
+>   转onnx的环境上pytorch需要安装1.8.0版本
+>
+
+### 2.2 python第三方库
+
+```
+numpy == 1.18.5
+opencv-python == 4.2.0.34
+```
+
+**说明：** 
+>   X86架构：opencv,pytorch,torchvision和onnx可以通过官方下载whl包安装，其它可以通过pip3.7 install 包名 安装
+>
+>   Arm架构：opencv,pytorch,torchvision和onnx可以通过源码编译安装，其它可以通过pip3.7 install 包名 安装
+
+## 3 模型转换
+
+-   **[pth转onnx模型](#31-pth转onnx模型)**  
+
+-   **[onnx转om模型](#32-onnx转om模型)**  
+
+detectron2暂支持pytorch1.8导出pytorch框架的onnx，npu权重可以使用开源的detectron2加载，因此基于pytorch1.8与开源detectron2导出含npu权重的onnx。atc暂不支持动态shape小算子，可以使用大颗粒算子替换这些小算子规避，这些小算子可以在转onnx时的verbose打印中找到其对应的python代码，从而根据功能用大颗粒算子替换，onnx能推导出变量正确的shape与算子属性正确即可，变量实际的数值无关紧要，因此这些大算子函数的功能实现无关紧要，因包含自定义算子需要去掉对onnx模型的校验。
+
+### 3.1 pth转onnx模型
+
+1.获取pth权重文件  
+[maskrcnn基于detectron2预训练的npu权重文件](https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch/Official/cv/image_object_detection/Faster_Mask_RCNN_for_PyTorch)  
+文件md5sum: b95f35f051012a02875220482a568c3b  
+2.下载detectron2源码并安装
+```shell
+git clone https://github.com/facebookresearch/detectron2
+python3.7 -m pip install -e detectron2
+```
+
+ **说明：**  
+> 安装所需的依赖说明请参考detectron2/INSTALL.md
+>
+> 重装pytorch后需要rm -rf detectron2/build/ **/*.so再重装detectron2
+
+3.detectron2代码迁移，参见maskrcnn_detectron2.diff：
+```diff
+diff --git a/detectron2/layers/__init__.py b/detectron2/layers/__init__.py
+index c8bd1fb..f5fa9ea 100644
+--- a/detectron2/layers/__init__.py
++++ b/detectron2/layers/__init__.py
+@@ -2,7 +2,7 @@
+ from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm
+ from .deform_conv import DeformConv, ModulatedDeformConv
+ from .mask_ops import paste_masks_in_image
+-from .nms import batched_nms, batched_nms_rotated, nms, nms_rotated
++from .nms import batched_nms, batch_nms_op, batched_nms_rotated, nms, nms_rotated
+ from .roi_align import ROIAlign, roi_align
+ from .roi_align_rotated import ROIAlignRotated, roi_align_rotated
+ from .shape_spec import ShapeSpec
+diff --git a/detectron2/layers/nms.py b/detectron2/layers/nms.py
+index ac14d45..22efb24 100644
+--- a/detectron2/layers/nms.py
++++ b/detectron2/layers/nms.py
+@@ -15,6 +15,56 @@ if TORCH_VERSION < (1, 7):
+ else:
+     nms_rotated_func = torch.ops.detectron2.nms_rotated
+ 
++class BatchNMSOp(torch.autograd.Function):
++    @staticmethod
++    def forward(ctx, bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size):
++        """
++        boxes (torch.Tensor): boxes in shape (batch, N, C, 4).
++        scores (torch.Tensor): scores in shape (batch, N, C).
++        return:
++            nmsed_boxes: (1, N, 4)
++            nmsed_scores: (1, N)
++            nmsed_classes: (1, N)
++            nmsed_num: (1,)
++        """
++
++        # Phony implementation for onnx export
++        nmsed_boxes = bboxes[:, :max_total_size, 0, :]
++        nmsed_scores = scores[:, :max_total_size, 0]
++        nmsed_classes = torch.arange(max_total_size, dtype=torch.long)
++        nmsed_num = torch.Tensor([max_total_size])
++
++        return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num
++
++    @staticmethod
++    def symbolic(g, bboxes, scores, score_thr, iou_thr, max_size_p_class, max_t_size):
++        nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = g.op('BatchMultiClassNMS',
++            bboxes, scores, score_threshold_f=score_thr, iou_threshold_f=iou_thr,
++            max_size_per_class_i=max_size_p_class, max_total_size_i=max_t_size, outputs=4)
++        return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num
++
++def batch_nms_op(bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size):
++    """
++    boxes (torch.Tensor): boxes in shape (N, 4).
++    scores (torch.Tensor): scores in shape (N, ).
++    """
++
++    num_classes = bboxes.shape[1].numpy() // 4
++    if bboxes.dtype == torch.float32:
++        bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4).half()
++        scores = scores.reshape(1, scores.shape[0].numpy(), -1).half()
++    else:
++        bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4)
++        scores = scores.reshape(1, scores.shape[0].numpy(), -1)
++
++    nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = BatchNMSOp.apply(bboxes, scores,
++        score_threshold, iou_threshold, max_size_per_class, max_total_size)
++    nmsed_boxes = nmsed_boxes.float()
++    nmsed_scores = nmsed_scores.float()
++    nmsed_classes = nmsed_classes.long()
++    dets = torch.cat((nmsed_boxes.reshape((max_total_size, 4)), nmsed_scores.reshape((max_total_size, 1))), -1)
++    labels = nmsed_classes.reshape((max_total_size, ))
++    return dets, labels
+ 
+ def batched_nms(
+     boxes: torch.Tensor, scores: torch.Tensor, idxs: torch.Tensor, iou_threshold: float
+diff --git a/detectron2/modeling/box_regression.py b/detectron2/modeling/box_regression.py
+index 12be000..074f3e3 100644
+--- a/detectron2/modeling/box_regression.py
++++ b/detectron2/modeling/box_regression.py
+@@ -87,20 +87,33 @@ class Box2BoxTransform(object):
+         deltas = deltas.float()  # ensure fp32 for decoding precision
+         boxes = boxes.to(deltas.dtype)
+ 
+-        widths = boxes[:, 2] - boxes[:, 0]
+-        heights = boxes[:, 3] - boxes[:, 1]
+-        ctr_x = boxes[:, 0] + 0.5 * widths
+-        ctr_y = boxes[:, 1] + 0.5 * heights
++        boxes_prof = boxes.permute(1, 0)
++        widths = boxes_prof[2, :] - boxes_prof[0, :]
++        heights = boxes_prof[3, :] - boxes_prof[1, :]
++        ctr_x = boxes_prof[0, :] + 0.5 * widths
++        ctr_y = boxes_prof[1, :] + 0.5 * heights
+ 
+         wx, wy, ww, wh = self.weights
+-        dx = deltas[:, 0::4] / wx
++        '''dx = deltas[:, 0::4] / wx
+         dy = deltas[:, 1::4] / wy
+         dw = deltas[:, 2::4] / ww
+-        dh = deltas[:, 3::4] / wh
++        dh = deltas[:, 3::4] / wh'''
++        denorm_deltas = deltas
++        if denorm_deltas.shape[1] > 4:
++            denorm_deltas = denorm_deltas.view(-1, 80, 4)
++            dx = denorm_deltas[:, :, 0:1:].view(-1, 80) / wx
++            dy = denorm_deltas[:, :, 1:2:].view(-1, 80) / wy
++            dw = denorm_deltas[:, :, 2:3:].view(-1, 80) / ww
++            dh = denorm_deltas[:, :, 3:4:].view(-1, 80) / wh
++        else:
++            dx = denorm_deltas[:, 0:1:] / wx
++            dy = denorm_deltas[:, 1:2:] / wy
++            dw = denorm_deltas[:, 2:3:] / ww
++            dh = denorm_deltas[:, 3:4:] / wh
+ 
+         # Prevent sending too large values into torch.exp()
+-        dw = torch.clamp(dw, max=self.scale_clamp)
+-        dh = torch.clamp(dh, max=self.scale_clamp)
++        dw = torch.clamp(dw, min=-float('inf'), max=self.scale_clamp)
++        dh = torch.clamp(dh, min=-float('inf'), max=self.scale_clamp)
+ 
+         pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
+         pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
+diff --git a/detectron2/modeling/meta_arch/rcnn.py b/detectron2/modeling/meta_arch/rcnn.py
+index e5f66d1..1bbba71 100644
+--- a/detectron2/modeling/meta_arch/rcnn.py
++++ b/detectron2/modeling/meta_arch/rcnn.py
+@@ -199,8 +199,9 @@ class GeneralizedRCNN(nn.Module):
+         """
+         assert not self.training
+ 
+-        images = self.preprocess_image(batched_inputs)
+-        features = self.backbone(images.tensor)
++        # images = self.preprocess_image(batched_inputs)
++        images = batched_inputs
++        features = self.backbone(images)
+ 
+         if detected_instances is None:
+             if self.proposal_generator is not None:
+diff --git a/detectron2/modeling/poolers.py b/detectron2/modeling/poolers.py
+index e5d72ab..7c0dd2f 100644
+--- a/detectron2/modeling/poolers.py
++++ b/detectron2/modeling/poolers.py
+@@ -94,6 +94,31 @@ def convert_boxes_to_pooler_format(box_lists: List[Boxes]):
+ 
+     return pooler_fmt_boxes
+ 
++import torch.onnx.symbolic_helper as sym_help
++
++class RoiExtractor(torch.autograd.Function):
++    @staticmethod
++    def forward(self, f0, f1, f2, f3, rois, aligned=0, finest_scale=56, pooled_height=7, pooled_width=7,
++                         pool_mode='avg', roi_scale_factor=0, sample_num=0, spatial_scale=[0.25, 0.125, 0.0625, 0.03125]):
++        """
++        feats (torch.Tensor): feats in shape (batch, 256, H, W).
++        rois (torch.Tensor): rois in shape (k, 5).
++        return:
++            roi_feats (torch.Tensor): (k, 256, pooled_width, pooled_width)
++        """
++
++        # phony implementation for shape inference
++        k = rois.size()[0]
++        roi_feats = torch.ones(k, 256, pooled_height, pooled_width)
++        return roi_feats
++
++    @staticmethod
++    def symbolic(g, f0, f1, f2, f3, rois, aligned=0, finest_scale=56, pooled_height=7, pooled_width=7):
++        # TODO: support tensor list type for feats
++        #f_tensors = sym_help._unpack_list(feats)
++        roi_feats = g.op('RoiExtractor', f0, f1, f2, f3, rois, aligned_i=0, finest_scale_i=56, pooled_height_i=pooled_height, pooled_width_i=pooled_width,
++                         pool_mode_s='avg', roi_scale_factor_i=0, sample_num_i=0, spatial_scale_f=[0.25, 0.125, 0.0625, 0.03125], outputs=1)
++        return roi_feats
+ 
+ class ROIPooler(nn.Module):
+     """
+@@ -202,6 +227,12 @@ class ROIPooler(nn.Module):
+                 A tensor of shape (M, C, output_size, output_size) where M is the total number of
+                 boxes aggregated over all N batch images and C is the number of channels in `x`.
+         """
++        if torch.onnx.is_in_onnx_export():
++            output_size = self.output_size[0]
++            pooler_fmt_boxes = convert_boxes_to_pooler_format(box_lists)
++            roi_feats = RoiExtractor.apply(x[0], x[1], x[2], x[3], pooler_fmt_boxes, 0, 56, output_size, output_size)
++            return roi_feats
++
+         num_level_assignments = len(self.level_poolers)
+ 
+         assert isinstance(x, list) and isinstance(
+diff --git a/detectron2/modeling/proposal_generator/proposal_utils.py b/detectron2/modeling/proposal_generator/proposal_utils.py
+index 9c10436..b3437a7 100644
+--- a/detectron2/modeling/proposal_generator/proposal_utils.py
++++ b/detectron2/modeling/proposal_generator/proposal_utils.py
+@@ -4,7 +4,7 @@ import math
+ from typing import List, Tuple
+ import torch
+ 
+-from detectron2.layers import batched_nms, cat
++from detectron2.layers import batch_nms_op, cat
+ from detectron2.structures import Boxes, Instances
+ from detectron2.utils.env import TORCH_VERSION
+ 
+@@ -68,15 +68,19 @@ def find_top_rpn_proposals(
+     for level_id, (proposals_i, logits_i) in enumerate(zip(proposals, pred_objectness_logits)):
+         Hi_Wi_A = logits_i.shape[1]
+         if isinstance(Hi_Wi_A, torch.Tensor):  # it's a tensor in tracing
+-            num_proposals_i = torch.clamp(Hi_Wi_A, max=pre_nms_topk)
++            num_proposals_i = torch.clamp(Hi_Wi_A, min=0, max=pre_nms_topk)
+         else:
+             num_proposals_i = min(Hi_Wi_A, pre_nms_topk)
+ 
+         # sort is faster than topk: https://github.com/pytorch/pytorch/issues/22812
+-        # topk_scores_i, topk_idx = logits_i.topk(num_proposals_i, dim=1)
+-        logits_i, idx = logits_i.sort(descending=True, dim=1)
++        num_proposals_i = num_proposals_i.item()
++        logits_i = logits_i.reshape(logits_i.size(1))
++        topk_scores_i, topk_idx = torch.topk(logits_i, num_proposals_i)
++        topk_scores_i = topk_scores_i.reshape(1, topk_scores_i.size(0))
++        topk_idx = topk_idx.reshape(1, topk_idx.size(0))
++        '''logits_i, idx = logits_i.sort(descending=True, dim=1)
+         topk_scores_i = logits_i.narrow(1, 0, num_proposals_i)
+-        topk_idx = idx.narrow(1, 0, num_proposals_i)
++        topk_idx = idx.narrow(1, 0, num_proposals_i)'''
+ 
+         # each is N x topk
+         topk_proposals_i = proposals_i[batch_idx[:, None], topk_idx]  # N x topk x 4
+@@ -108,7 +112,7 @@ def find_top_rpn_proposals(
+             lvl = lvl[valid_mask]
+         boxes.clip(image_size)
+ 
+-        # filter empty boxes
++        '''# filter empty boxes
+         keep = boxes.nonempty(threshold=min_box_size)
+         if _is_tracing() or keep.sum().item() != len(boxes):
+             boxes, scores_per_img, lvl = boxes[keep], scores_per_img[keep], lvl[keep]
+@@ -126,7 +130,14 @@ def find_top_rpn_proposals(
+         res = Instances(image_size)
+         res.proposal_boxes = boxes[keep]
+         res.objectness_logits = scores_per_img[keep]
++        results.append(res)'''
++
++        dets, labels = batch_nms_op(boxes.tensor, scores_per_img, 0, nms_thresh, post_nms_topk, post_nms_topk)
++        res = Instances(image_size)
++        res.proposal_boxes = Boxes(dets[:, :4])
++        res.objectness_logits = dets[:, 4]
+         results.append(res)
++
+     return results
+ 
+ 
+diff --git a/detectron2/modeling/proposal_generator/rpn.py b/detectron2/modeling/proposal_generator/rpn.py
+index 1675377..77d9f26 100644
+--- a/detectron2/modeling/proposal_generator/rpn.py
++++ b/detectron2/modeling/proposal_generator/rpn.py
+@@ -434,7 +434,7 @@ class RPN(nn.Module):
+         else:
+             losses = {}
+         proposals = self.predict_proposals(
+-            anchors, pred_objectness_logits, pred_anchor_deltas, images.image_sizes
++            anchors, pred_objectness_logits, pred_anchor_deltas, [(1344, 1344)]
+         )
+         return proposals, losses
+ 
+@@ -485,7 +485,8 @@ class RPN(nn.Module):
+             B = anchors_i.tensor.size(1)
+             pred_anchor_deltas_i = pred_anchor_deltas_i.reshape(-1, B)
+             # Expand anchors to shape (N*Hi*Wi*A, B)
+-            anchors_i = anchors_i.tensor.unsqueeze(0).expand(N, -1, -1).reshape(-1, B)
++            s = torch.zeros(N, anchors_i.tensor.unsqueeze(0).size(1), anchors_i.tensor.unsqueeze(0).size(2))
++            anchors_i = anchors_i.tensor.unsqueeze(0).expand_as(s).reshape(-1, B)
+             proposals_i = self.box2box_transform.apply_deltas(pred_anchor_deltas_i, anchors_i)
+             # Append feature map proposals with shape (N, Hi*Wi*A, B)
+             proposals.append(proposals_i.view(N, -1, B))
+diff --git a/detectron2/modeling/roi_heads/fast_rcnn.py b/detectron2/modeling/roi_heads/fast_rcnn.py
+index 348f6a0..87c7cd3 100644
+--- a/detectron2/modeling/roi_heads/fast_rcnn.py
++++ b/detectron2/modeling/roi_heads/fast_rcnn.py
+@@ -7,7 +7,7 @@ from torch import nn
+ from torch.nn import functional as F
+ 
+ from detectron2.config import configurable
+-from detectron2.layers import ShapeSpec, batched_nms, cat, cross_entropy, nonzero_tuple
++from detectron2.layers import ShapeSpec, batch_nms_op, cat, cross_entropy, nonzero_tuple
+ from detectron2.modeling.box_regression import Box2BoxTransform
+ from detectron2.structures import Boxes, Instances
+ from detectron2.utils.events import get_event_storage
+@@ -144,7 +144,7 @@ def fast_rcnn_inference_single_image(
+     # Convert to Boxes to use the `clip` function ...
+     boxes = Boxes(boxes.reshape(-1, 4))
+     boxes.clip(image_shape)
+-    boxes = boxes.tensor.view(-1, num_bbox_reg_classes, 4)  # R x C x 4
++    boxes = boxes.tensor.view(-1, num_bbox_reg_classes.item(), 4)  # R x C x 4
+ 
+     # 1. Filter results based on detection scores. It can make NMS more efficient
+     #    by filtering out low-confidence detections.
+@@ -152,7 +152,7 @@ def fast_rcnn_inference_single_image(
+     # R' x 2. First column contains indices of the R predictions;
+     # Second column contains indices of classes.
+     filter_inds = filter_mask.nonzero()
+-    if num_bbox_reg_classes == 1:
++    '''if num_bbox_reg_classes == 1:
+         boxes = boxes[filter_inds[:, 0], 0]
+     else:
+         boxes = boxes[filter_mask]
+@@ -167,7 +167,14 @@ def fast_rcnn_inference_single_image(
+     result = Instances(image_shape)
+     result.pred_boxes = Boxes(boxes)
+     result.scores = scores
+-    result.pred_classes = filter_inds[:, 1]
++    result.pred_classes = filter_inds[:, 1]'''
++
++    dets, labels = batch_nms_op(boxes, scores, score_thresh, nms_thresh, topk_per_image, topk_per_image)
++    result = Instances(image_shape)
++    result.pred_boxes = Boxes(dets[:, :4])
++    result.scores = dets.permute(1, 0)[4, :]
++    result.pred_classes = labels
++
+     return result, filter_inds[:, 0]
+ 
+ 
+diff --git a/detectron2/modeling/roi_heads/mask_head.py b/detectron2/modeling/roi_heads/mask_head.py
+index 5ac5c4b..f81b96b 100644
+--- a/detectron2/modeling/roi_heads/mask_head.py
++++ b/detectron2/modeling/roi_heads/mask_head.py
+@@ -142,7 +142,9 @@ def mask_rcnn_inference(pred_mask_logits: torch.Tensor, pred_instances: List[Ins
+         num_masks = pred_mask_logits.shape[0]
+         class_pred = cat([i.pred_classes for i in pred_instances])
+         indices = torch.arange(num_masks, device=class_pred.device)
+-        mask_probs_pred = pred_mask_logits[indices, class_pred][:, None].sigmoid()
++        print(indices,class_pred)
++        # mask_probs_pred = pred_mask_logits[indices, class_pred][:, None].sigmoid()
++        mask_probs_pred = pred_mask_logits.sigmoid()
+     # mask_probs_pred.shape: (B, 1, Hmask, Wmask)
+ 
+     num_boxes_per_image = [len(i) for i in pred_instances]
+diff --git a/detectron2/structures/boxes.py b/detectron2/structures/boxes.py
+index 57f862a..bad473b 100644
+--- a/detectron2/structures/boxes.py
++++ b/detectron2/structures/boxes.py
+@@ -202,10 +202,11 @@ class Boxes:
+         """
+         assert torch.isfinite(self.tensor).all(), "Box tensor contains infinite or NaN!"
+         h, w = box_size
+-        x1 = self.tensor[:, 0].clamp(min=0, max=w)
+-        y1 = self.tensor[:, 1].clamp(min=0, max=h)
+-        x2 = self.tensor[:, 2].clamp(min=0, max=w)
+-        y2 = self.tensor[:, 3].clamp(min=0, max=h)
++        boxes_prof = self.tensor.permute(1, 0)
++        x1 = boxes_prof[0, :].clamp(min=0, max=w)
++        y1 = boxes_prof[1, :].clamp(min=0, max=h)
++        x2 = boxes_prof[2, :].clamp(min=0, max=w)
++        y2 = boxes_prof[3, :].clamp(min=0, max=h)
+         self.tensor = torch.stack((x1, y1, x2, y2), dim=-1)
+ 
+     def nonempty(self, threshold: float = 0.0) -> torch.Tensor:
+diff --git a/tools/deploy/export_model.py b/tools/deploy/export_model.py
+index fe2fe30..22145b7 100755
+--- a/tools/deploy/export_model.py
++++ b/tools/deploy/export_model.py
+@@ -77,6 +77,28 @@ def export_scripting(torch_model):
+     # TODO inference in Python now missing postprocessing glue code
+     return None
+ 
++from typing import Dict, Tuple
++import numpy
++from detectron2.structures import ImageList
++def preprocess_image(batched_inputs: Tuple[Dict[str, torch.Tensor]]):
++        """
++        Normalize, pad and batch the input images.
++        """
++        images = [x["image"].to('cpu') for x in batched_inputs]
++        images = [(x - numpy.array([[[103.530]], [[116.280]], [[123.675]]])) / numpy.array([[[1.]], [[1.]], [[1.]]]) for x in images]
++        import torch.nn.functional as F
++        image = torch.zeros(0, 1344, 1344)
++        for i in range(images[0].size(0)):
++            img = images[0][i]
++            img = img.expand((1, 1, img.size(0), img.size(1)))
++            img = img.to(dtype=torch.float32)
++            img = F.interpolate(img, size=(int(1344), int(1344)), mode='bilinear', align_corners=False)
++            img = img[0][0]
++            img = img.unsqueeze(0)
++            image = torch.cat((image, img))
++        images = [image]
++        images = ImageList.from_tensors(images, 32)
++        return images
+ 
+ # experimental. API not yet final
+ def export_tracing(torch_model, inputs):
+@@ -84,6 +106,8 @@ def export_tracing(torch_model, inputs):
+     image = inputs[0]["image"]
+     inputs = [{"image": image}]  # remove other unused keys
+ 
++    inputs = preprocess_image(inputs).tensor.to(torch.float32)
++    image = inputs
+     if isinstance(torch_model, GeneralizedRCNN):
+ 
+         def inference(model, inputs):
+@@ -104,7 +128,7 @@ def export_tracing(torch_model, inputs):
+     elif args.format == "onnx":
+         # NOTE onnx export currently failing in pytorch
+         with PathManager.open(os.path.join(args.output, "model.onnx"), "wb") as f:
+-            torch.onnx.export(traceable_model, (image,), f)
++            torch.onnx.export(traceable_model, (image,), f, opset_version=11, verbose=True)
+     logger.info("Inputs schema: " + str(traceable_model.inputs_schema))
+     logger.info("Outputs schema: " + str(traceable_model.outputs_schema))
+ 
+
+```
+ **修改依据：**  
+> 1.slice，topk算子问题导致pre_nms_topk未生效，atc转换报错，修改参见maskrcnn_detectron2.diff  
+> 2.expand会引入where动态算子因此用expand_as替换  
+> 3.slice跑在aicpu有错误，所以改为dx = denorm_deltas[:, :, 0:1:].view(-1, 80) / wx，使其运行在aicore上  
+> 4.atc转换时根据日志中报错的算子在转onnx时的verbose打印中找到其对应的python代码，然后找到规避方法解决，具体修改参见maskrcnn_detectron2.diff  
+> 5.其它地方的修改原因参见精度调试与性能优化  
+
+
+通过打补丁的方式修改detectron2：
+```shell
+cd detectron2
+patch -p1 < ../maskrcnn_detectron2.diff
+cd ..
+```
+4.修改pytorch代码去除导出onnx时进行检查  
+将/usr/local/python3.7.5/lib/python3.7/site-packages/torch/onnx/utils.py文件的_check_onnx_proto(proto)改为pass
+
+5.准备coco2017验证集，数据集获取参见本文第四章第一节  
+在当前目录按结构构造数据集：datasets/coco目录下有annotations与val2017，annotations目录存放coco数据集的instances_val2017.json，val2017目录存放coco数据集的5000张验证图片。  
+或者修改detectron2/detectron2/data/datasets/builtin.py为_root = os.getenv("DETECTRON2_DATASETS", "/opt/npu/dataset/")指定coco数据集所在的目录/opt/npu/dataset/。
+
+6.运行如下命令，在output目录生成model.onnx
+```shell
+python3.7 detectron2/tools/deploy/export_model.py --config-file detectron2/configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml --output ./output --export-method tracing --format onnx MODEL.WEIGHTS model_final.pth MODEL.DEVICE cpu
+
+mv output/model.onnx model_py1.8.onnx
+```
+
+### 3.2 onnx转om模型
+
+1.设置环境变量
+```shell
+export install_path=/usr/local/Ascend/ascend-toolkit/latest
+export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
+export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH
+export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH
+export ASCEND_OPP_PATH=${install_path}/opp
+export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/
+```
+2.使用atc将onnx模型转换为om模型文件，工具使用方法可以参考[CANN V100R020C10 开发辅助工具指南 (推理) 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164868?idPath=23710424%7C251366513%7C22892968%7C251168373)，需要指定输出节点以去除无用输出，使用netron开源可视化工具查看具体的输出节点名：
+```shell
+atc --model=model_py1.8.onnx --framework=5 --output=maskrcnn_detectron2_npu --input_format=NCHW --input_shape="0:1,3,1344,1344" --out_nodes="Cast_1673:0;Gather_1676:0;Reshape_1667:0;Slice_1706:0" --log=debug --soc_version=Ascend310
+```
+
+## 4 数据集预处理
+
+-   **[数据集获取](#41-数据集获取)**  
+
+-   **[数据集预处理](#42-数据集预处理)**  
+
+-   **[生成数据集信息文件](#43-生成数据集信息文件)**  
+
+### 4.1 数据集获取
+该模型使用[COCO官网](https://cocodataset.org/#download)的coco2017的5千张验证集进行测试，图片与标签分别存放在/opt/npu/dataset/coco/val2017/与/opt/npu/dataset/coco/annotations/instances_val2017.json。
+
+### 4.2 数据集预处理
+1.预处理脚本maskrcnn_pth_preprocess_detectron2.py
+```python
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+import numpy as np
+import cv2
+import torch
+import multiprocessing
+
+def resize(img, size):
+    old_h = img.shape[0]
+    old_w = img.shape[1]
+    scale_ratio = 800 / min(old_w, old_h)
+    new_w = int(np.floor(old_w * scale_ratio))
+    new_h = int(np.floor(old_h * scale_ratio))
+    if max(new_h, new_w) > 1333:
+        scale = 1333 / max(new_h, new_w)
+        new_h = new_h * scale
+        new_w = new_w * scale
+    new_w = int(new_w + 0.5)
+    new_h = int(new_h + 0.5)
+    resized_img = cv2.resize(img, (new_w, new_h), interpolation=cv2.INTER_LINEAR)
+    return resized_img
+
+def gen_input_bin(file_batches, batch):
+    i = 0
+    for file in file_batches[batch]:
+        i = i + 1
+        print("batch", batch, file, "===", i)
+
+        image = cv2.imread(os.path.join(flags.image_src_path, file), cv2.IMREAD_COLOR)
+        image = resize(image, (800, 1333))
+        mean = np.array([103.53, 116.28, 123.675], dtype=np.float32)
+        std = np.array([1., 1., 1.], dtype=np.float32)
+        img = image.copy().astype(np.float32)
+        mean = np.float64(mean.reshape(1, -1))
+        std = 1 / np.float64(std.reshape(1, -1))
+        cv2.subtract(img, mean, img)
+        cv2.multiply(img, std, img)
+        img = cv2.copyMakeBorder(img, 0, flags.model_input_height - img.shape[0], 0, flags.model_input_width - img.shape[1], cv2.BORDER_CONSTANT, value=0)
+        #os.makedirs('./paded_jpg/', exist_ok=True)
+        #cv2.imwrite('./paded_jpg/' + file.split('.')[0] + '.jpg', img)
+        img = img.transpose(2, 0, 1)
+        img.tofile(os.path.join(flags.bin_file_path, file.split('.')[0] + ".bin"))
+
+def preprocess(src_path, save_path):
+    files = os.listdir(src_path)
+    file_batches = [files[i:i + 100] for i in range(0, 5000, 100) if files[i:i + 100] != []]
+    thread_pool = multiprocessing.Pool(len(file_batches))
+    for batch in range(len(file_batches)):
+        thread_pool.apply_async(gen_input_bin, args=(file_batches, batch))
+    thread_pool.close()
+    thread_pool.join()
+    print("in thread, except will not report! please ensure bin files generated.")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='preprocess of MaskRCNN PyTorch model')
+    parser.add_argument("--image_src_path", default="./coco2017/", help='image of dataset')
+    parser.add_argument("--bin_file_path", default="./coco2017_bin/", help='Preprocessed image buffer')
+    parser.add_argument("--model_input_height", default=1344, type=int, help='input tensor height')
+    parser.add_argument("--model_input_width", default=1344, type=int, help='input tensor width')
+    flags = parser.parse_args()    
+    if not os.path.exists(flags.bin_file_path):
+        os.makedirs(flags.bin_file_path)
+    preprocess(flags.image_src_path, flags.bin_file_path)
+```
+2.执行预处理脚本，生成数据集预处理后的bin文件
+```shell
+python3.7 maskrcnn_pth_preprocess_detectron2.py --image_src_path=/opt/npu/dataset/coco/val2017 --bin_file_path=val2017_bin --model_input_height=1344 --model_input_width=1344
+```
+### 4.3 生成数据集信息文件
+1.生成数据集信息文件脚本get_info.py
+```python
+import os
+import sys
+import cv2
+from glob import glob
+
+
+def get_bin_info(file_path, info_name, width, height):
+    bin_images = glob(os.path.join(file_path, '*.bin'))
+    with open(info_name, 'w') as file:
+        for index, img in enumerate(bin_images):
+            content = ' '.join([str(index), img, width, height])
+            file.write(content)
+            file.write('\n')
+
+
+def get_jpg_info(file_path, info_name):
+    extensions = ['jpg', 'jpeg', 'JPG', 'JPEG']
+    image_names = []
+    for extension in extensions:
+        image_names.append(glob(os.path.join(file_path, '*.' + extension)))  
+    with open(info_name, 'w') as file:
+        for image_name in image_names:
+            if len(image_name) == 0:
+                continue
+            else:
+                for index, img in enumerate(image_name):
+                    img_cv = cv2.imread(img)
+                    shape = img_cv.shape
+                    width, height = shape[1], shape[0]
+                    content = ' '.join([str(index), img, str(width), str(height)])
+                    file.write(content)
+                    file.write('\n')
+
+
+if __name__ == '__main__':
+    file_type = sys.argv[1]
+    file_path = sys.argv[2]
+    info_name = sys.argv[3]
+    if file_type == 'bin':
+        width = sys.argv[4]
+        height = sys.argv[5]
+        assert len(sys.argv) == 6, 'The number of input parameters must be equal to 5'
+        get_bin_info(file_path, info_name, width, height)
+    elif file_type == 'jpg':
+        assert len(sys.argv) == 4, 'The number of input parameters must be equal to 3'
+        get_jpg_info(file_path, info_name)
+```
+2.执行生成数据集信息脚本，生成数据集信息文件
+```shell
+python3.7 get_info.py bin val2017_bin maskrcnn.info 1344 1344
+```
+第一个参数为模型输入的类型，第二个参数为生成的bin文件路径，第三个为输出的info文件，后面为宽高信息
+## 5 离线推理
+
+-   **[benchmark工具概述](#51-benchmark工具概述)**  
+
+-   **[离线推理](#52-离线推理)**  
+
+### 5.1 benchmark工具概述
+
+benchmark工具为华为自研的模型推理工具，支持多种模型的离线推理，能够迅速统计出模型在Ascend310上的性能，支持真实数据和纯推理两种模式，配合后处理脚本，可以实现诸多模型的端到端过程，获取工具及使用方法可以参考[CANN V100R020C10 推理benchmark工具用户指南 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164874?idPath=23710424%7C251366513%7C22892968%7C251168373)
+### 5.2 离线推理
+1.设置环境变量
+```shell
+export install_path=/usr/local/Ascend/ascend-toolkit/latest
+export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
+export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH
+export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH
+export ASCEND_OPP_PATH=${install_path}/opp
+export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/
+```
+2.执行离线推理
+```shell
+./benchmark.x86_64 -model_type=vision -om_path=maskrcnn_detectron2_npu.om -device_id=0 -batch_size=1 -input_text_path=maskrcnn.info -input_width=1344 -input_height=1344 -useDvpp=false -output_binary=true
+```
+输出结果默认保存在当前目录result/dumpOutput_device0，模型有四个输出，每个输入对应的输出对应四个_x.bin文件
+```
+输出       shape                 数据类型    数据含义
+output1    100 * 4               FP32       boxes
+output2    100 * 1               FP32       scores
+output3    100 * 1               INT64      labels
+output4    100 * 80 * 28 * 28    FP32       masks
+```
+
+## 6 精度对比
+
+-   **[离线推理精度](#61-离线推理精度)**  
+-   **[开源精度](#62-开源精度)**  
+-   **[精度对比](#63-精度对比)**  
+
+### 6.1 离线推理精度统计
+
+后处理统计map精度
+```python
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+import cv2
+import numpy as np
+
+def postprocess_bboxes(bboxes, image_size, net_input_width, net_input_height):
+    org_w = image_size[0]
+    org_h = image_size[1]
+
+    scale = 800 / min(org_w, org_h)
+    new_w = int(np.floor(org_w * scale))
+    new_h = int(np.floor(org_h * scale))
+    if max(new_h, new_w) > 1333:
+        scale = 1333 / max(new_h, new_w) * scale
+
+    bboxes[:, 0] = (bboxes[:, 0]) / scale
+    bboxes[:, 1] = (bboxes[:, 1]) / scale
+    bboxes[:, 2] = (bboxes[:, 2]) / scale
+    bboxes[:, 3] = (bboxes[:, 3]) / scale
+
+    return bboxes
+
+def postprocess_masks(masks, image_size, net_input_width, net_input_height):
+    org_w = image_size[0]
+    org_h = image_size[1]
+
+    scale = 800 / min(org_w, org_h)
+    new_w = int(np.floor(org_w * scale))
+    new_h = int(np.floor(org_h * scale))
+    if max(new_h, new_w) > 1333:
+        scale = 1333 / max(new_h, new_w) * scale
+
+    pad_w = net_input_width - org_w * scale
+    pad_h = net_input_height - org_h * scale
+    top = 0
+    left = 0
+    hs = int(net_input_height - pad_h)
+    ws = int(net_input_width - pad_w)
+
+    masks = masks.to(dtype=torch.float32)
+    res_append = torch.zeros(0, org_h, org_w)
+    if torch.cuda.is_available():
+        res_append = res_append.to(device='cuda')
+    for i in range(masks.size(0)):
+        mask = masks[i][0][top:hs, left:ws]
+        mask = mask.expand((1, 1, mask.size(0), mask.size(1)))
+        mask = F.interpolate(mask, size=(int(org_h), int(org_w)), mode='bilinear', align_corners=False)
+        mask = mask[0][0]
+        mask = mask.unsqueeze(0)
+        res_append = torch.cat((res_append, mask))
+
+    return res_append[:, None]
+
+import pickle
+def save_variable(v, filename):
+    f = open(filename, 'wb')
+    pickle.dump(v, f)
+    f.close()
+def load_variavle(filename):
+    f = open(filename, 'rb')
+    r = pickle.load(f)
+    f.close()
+    return r
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--test_annotation", default="./origin_pictures.info")
+    parser.add_argument("--bin_data_path", default="./result/dumpOutput_device0/")
+    parser.add_argument("--det_results_path", default="./detection-results/")
+    parser.add_argument("--net_out_num", type=int, default=4)
+    parser.add_argument("--net_input_width", type=int, default=1344)
+    parser.add_argument("--net_input_height", type=int, default=1344)
+    parser.add_argument("--ifShowDetObj", action="store_true", help="if input the para means True, neither False.")
+    flags = parser.parse_args()
+
+    img_size_dict = dict()
+    with open(flags.test_annotation)as f:
+        for line in f.readlines():
+            temp = line.split(" ")
+            img_file_path = temp[1]
+            img_name = temp[1].split("/")[-1].split(".")[0]
+            img_width = int(temp[2])
+            img_height = int(temp[3])
+            img_size_dict[img_name] = (img_width, img_height, img_file_path)
+
+    bin_path = flags.bin_data_path
+    det_results_path = flags.det_results_path
+    os.makedirs(det_results_path, exist_ok=True)
+    total_img = set([name[:name.rfind('_')] for name in os.listdir(bin_path) if "bin" in name])
+
+    import torch
+    from torchvision.models.detection.roi_heads import paste_masks_in_image
+    import torch.nn.functional as F
+    from detectron2.evaluation import COCOEvaluator
+    from detectron2.structures import Boxes, Instances
+    from detectron2.data import DatasetCatalog, MetadataCatalog
+    import logging
+    logging.basicConfig(level=logging.INFO)
+    evaluator = COCOEvaluator('coco_2017_val')
+    evaluator.reset()
+    coco_class_map = {id:name for id, name in enumerate(MetadataCatalog.get('coco_2017_val').thing_classes)}
+    results = []
+
+    cnt = 0
+    for bin_file in sorted(total_img):
+        cnt = cnt + 1
+        print(cnt - 1, bin_file)
+        path_base = os.path.join(bin_path, bin_file)
+        res_buff = []
+        for num in range(1, flags.net_out_num + 1):
+            if os.path.exists(path_base + "_" + str(num) + ".bin"):
+                if num == 1:
+                    buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32")
+                    buf = np.reshape(buf, [100, 4])
+                elif num == 2:
+                    buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32")
+                    buf = np.reshape(buf, [100, 1])
+                elif num == 3:
+                    buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="int64")
+                    buf = np.reshape(buf, [100, 1])
+                elif num == 4:
+                    bboxes = np.fromfile(path_base + "_" + str(num - 3) + ".bin", dtype="float32")
+                    bboxes = np.reshape(bboxes, [100, 4])
+                    bboxes = torch.from_numpy(bboxes)
+                    scores = np.fromfile(path_base + "_" + str(num - 2) + ".bin", dtype="float32")
+                    scores = np.reshape(scores, [100, 1])
+                    scores = torch.from_numpy(scores)
+                    labels = np.fromfile(path_base + "_" + str(num - 1) + ".bin", dtype="int64")
+                    labels = np.reshape(labels, [100, 1])
+                    labels = torch.from_numpy(labels)
+                    mask_pred = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32")
+                    mask_pred = np.reshape(mask_pred, [100, 80, 28, 28])
+                    mask_pred = torch.from_numpy(mask_pred)
+
+                    org_img_size = img_size_dict[bin_file][:2]
+                    result = Instances((org_img_size[1], org_img_size[0]))
+
+                    if torch.cuda.is_available():
+                        mask_pred = mask_pred.to(device='cuda')
+                    img_shape = (flags.net_input_height, flags.net_input_width)
+                    mask_pred = mask_pred[range(len(mask_pred)), labels[:, 0]][:, None]
+                    masks = paste_masks_in_image(mask_pred, bboxes[:, :4], img_shape)
+                    masks = masks >= 0.5
+                    masks = postprocess_masks(masks, img_size_dict[bin_file], flags.net_input_width, flags.net_input_height)
+                    if torch.cuda.is_available():
+                        masks = masks.cpu()
+                    masks = masks.squeeze(1)
+                    result.pred_masks = masks
+
+                    '''masks = masks.numpy()
+                    img = masks[0]
+                    from PIL import Image
+                    for j in range(len(masks)):
+                        mask = masks[j]
+                        mask = mask.astype(bool)
+                        img[mask] = img[mask] + 1
+                    imag = Image.fromarray((img * 255).astype(np.uint8))
+                    imag.save(os.path.join('.', bin_file + '.png'))'''
+
+                    predbox = postprocess_bboxes(bboxes, org_img_size, flags.net_input_height, flags.net_input_width)
+                    result.pred_boxes = Boxes(predbox)
+                    result.scores = scores.reshape([100])
+                    result.pred_classes = labels.reshape([100])
+
+                    results.append({"instances": result})
+
+                res_buff.append(buf)
+            else:
+                print("[ERROR] file not exist", path_base + "_" + str(num) + ".bin")
+
+        current_img_size = img_size_dict[bin_file]
+        res_bboxes = np.concatenate(res_buff, axis=1)
+        predbox = postprocess_bboxes(res_bboxes, current_img_size, flags.net_input_width, flags.net_input_height)
+
+        if flags.ifShowDetObj == True:
+            imgCur = cv2.imread(current_img_size[2])
+
+        det_results_str = ''
+        for idx, class_idx in enumerate(predbox[:, 5]):
+            if float(predbox[idx][4]) < float(0.05):
+            #if float(predbox[idx][4]) < float(0):
+                continue
+            if class_idx < 0 or class_idx > 80:
+                continue
+
+            class_name = coco_class_map[int(class_idx)]
+            det_results_str += "{} {} {} {} {} {}\n".format(class_name, str(predbox[idx][4]), predbox[idx][0],
+                                                            predbox[idx][1], predbox[idx][2], predbox[idx][3])
+
+            if flags.ifShowDetObj == True:
+                imgCur = cv2.rectangle(imgCur, (int(predbox[idx][0]), int(predbox[idx][1])), (int(predbox[idx][2]), int(predbox[idx][3])), (0,255,0), 2)
+                imgCur = cv2.putText(imgCur, class_name, (int(predbox[idx][0]), int(predbox[idx][1])), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)
+                #imgCur = cv2.putText(imgCur, str(predbox[idx][4]), (int(predbox[idx][0]), int(predbox[idx][1])),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)
+
+        if flags.ifShowDetObj == True:
+            cv2.imwrite(os.path.join(det_results_path, bin_file +'.jpg'), imgCur, [int(cv2.IMWRITE_JPEG_QUALITY), 70])
+
+        det_results_file = os.path.join(det_results_path, bin_file + ".txt")
+        with open(det_results_file, "w") as detf:
+            detf.write(det_results_str)
+
+    #save_variable(results, './results.txt')
+    #results = load_variavle('./results.txt')
+    inputs = DatasetCatalog.get('coco_2017_val')[:5000]
+    evaluator.process(inputs, results)
+    evaluator.evaluate()
+```
+调用maskrcnn_pth_postprocess_detectron2.py评测map精度：
+```shell
+python3.7 get_info.py jpg /opt/npu/dataset/coco/val2017 maskrcnn_jpeg.info
+
+python3.7 maskrcnn_pth_postprocess_detectron2.py --bin_data_path=./result/dumpOutput_device0/ --test_annotation=maskrcnn_jpeg.info --det_results_path=./ret_npuinfer/ --net_out_num=4 --net_input_height=1344 --net_input_width=1344 --ifShowDetObj
+```
+第一个参数为benchmark推理结果，第二个为原始图片信息文件，第三个为后处理输出结果，第四个为网络输出个数，第五六个为网络高宽，第七个为是否将box画在图上显示  
+执行完后会打印出精度：
+```
+INFO:detectron2.data.datasets.coco:Loaded 5000 images in COCO format from /opt/npu/dataset/coco/annotations/instances_val2017.json
+INFO:detectron2.evaluation.coco_evaluation:Preparing results for COCO format ...
+INFO:detectron2.evaluation.coco_evaluation:Evaluating predictions with unofficial COCO API...
+Loading and preparing results...
+DONE (t=2.16s)
+creating index...
+index created!
+INFO:detectron2.evaluation.fast_eval_api:Evaluate annotation type *bbox*
+INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.evaluate() finished in 21.80 seconds.
+INFO:detectron2.evaluation.fast_eval_api:Accumulating evaluation results...
+INFO:detectron2.evaluation.fast_eval_api:COCOeval_opt.accumulate() finished in 2.61 seconds.
+Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.326
+Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.536
+Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.349
+Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.179
+Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.366
+Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.432
+Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.282
+Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.444
+Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.465
+Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.269
+Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.508
+Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.609
+INFO:detectron2.evaluation.coco_evaluation:Evaluation results for bbox:
+|   AP   |  AP50  |  AP75  |  APs   |  APm   |  APl   |
+|:------:|:------:|:------:|:------:|:------:|:------:|
+| 32.586 | 53.634 | 34.852 | 17.862 | 36.613 | 43.174 |
+INFO:detectron2.evaluation.coco_evaluation:Per-category bbox AP:
+| category      | AP     | category     | AP     | category       | AP     |
+|:--------------|:-------|:-------------|:-------|:---------------|:-------|
+| person        | 48.933 | bicycle      | 24.620 | car            | 37.483 |
+| motorcycle    | 33.410 | airplane     | 50.975 | bus            | 54.898 |
+| train         | 51.864 | truck        | 26.716 | boat           | 20.755 |
+| traffic light | 20.305 | fire hydrant | 58.144 | stop sign      | 58.833 |
+| parking meter | 41.813 | bench        | 17.210 | bird           | 29.444 |
+| cat           | 57.738 | dog          | 52.853 | horse          | 51.333 |
+| sheep         | 40.341 | cow          | 41.568 | elephant       | 56.160 |
+| bear          | 63.240 | zebra        | 59.121 | giraffe        | 57.166 |
+| backpack      | 11.226 | umbrella     | 29.385 | handbag        | 8.685  |
+| tie           | 24.923 | suitcase     | 27.242 | frisbee        | 53.933 |
+| skis          | 16.987 | snowboard    | 24.268 | sports ball    | 40.009 |
+| kite          | 34.285 | baseball bat | 17.073 | baseball glove | 25.865 |
+| skateboard    | 39.694 | surfboard    | 28.035 | tennis racket  | 37.552 |
+| bottle        | 30.593 | wine glass   | 26.470 | cup            | 33.779 |
+| fork          | 19.335 | knife        | 11.024 | spoon          | 8.761  |
+| bowl          | 33.928 | banana       | 18.034 | apple          | 15.394 |
+| sandwich      | 27.732 | orange       | 26.546 | broccoli       | 19.022 |
+| carrot        | 15.449 | hot dog      | 25.118 | pizza          | 44.402 |
+| donut         | 35.096 | cake         | 23.876 | chair          | 18.866 |
+| couch         | 32.443 | potted plant | 18.701 | bed            | 33.585 |
+| dining table  | 20.164 | toilet       | 46.354 | tv             | 48.705 |
+| laptop        | 50.107 | mouse        | 47.597 | remote         | 20.899 |
+| keyboard      | 40.454 | cell phone   | 28.115 | microwave      | 43.190 |
+| oven          | 25.974 | toaster      | 13.432 | sink           | 27.114 |
+| refrigerator  | 42.467 | book         | 10.420 | clock          | 44.894 |
+| vase          | 30.559 | scissors     | 25.719 | teddy bear     | 36.704 |
+| hair drier    | 0.000  | toothbrush   | 11.796 |                |        |
+```
+
+ **精度调试：**  
+> 1.根据代码语义RoiExtractor参数finest_scale不是224而是56  
+> 2.因gather算子处理-1会导致每张图的第一个score为0，故maskrcnn_detectron2.diff中已将dets[:, -1]改为dets[:, 4]  
+> 3.单张图调试  
+> ```
+> demo.py分数改为0.05，defaults.py MIN_SIZE_TEST与MAX_SIZE_TEST改为1344：
+> python3.7 demo.py --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml --input 000000252219_1344x1344.jpg --opts MODEL.WEIGHTS ../../model_final.pth MODEL.DEVICE cpu
+> 说明：
+> 比较pth的rpn与om的rpn输出前提是detectron2/config/defaults.py的_C.INPUT.MIN_SIZE_TEST与_C.INPUT.MAX_SIZE_TEST要改为1344，并且注意因为000000252219_1344x1344.jpg 是等比例缩放四边加pad的处理结果，因此pth推理时等价于先进行了pad然后再进行标准化的，因此图片tensor边缘是负均值。开始误认为预处理与mmdetection相同因此SIZE_TEST的值与000000252219_1344x1344.jpg缩放是按上述方式处理的，经此与后面的调试步骤发现预处理与mmdetection不同。om算子输出与开源pth推理时变量的打印值对比，找到输出不对的算子，发现前处理均值方差不同于mmdetection框架，且是BGR序。
+> ```
+> 4.精度调试  
+> ```
+> 对开源代码预处理与参数修改，使得cpu,gpu版的pth推理达到npu版代码的pth推理精度，参见本文第七章第二节T4精度数据的diff文件与执行精度测评的命令。
+> 说明：
+> 1.查看npu固定1344,1344的前处理方式（缩放加pad）
+> from torchvision import utils as vutils
+> vutils.save_image(images.tensor, 'test.jpg')
+> FIX_SHAPE->./detectron2/data/dataset_mapper.py->ResizeShortestEdge，最短边800最大1333。
+> 2.cpu与gpu开源代码推理pth精度与npu代码推理pth差2到3个点，npu代码（基于detectron2 v0.2.1）更改roi_align.py为开源的代码后推理发现pth精度下降2到3个点，最终发现是aligned参数问题，注意插件缺陷导致om中设置该参数未能生效。
+> ```
+
+
+### 6.2 开源精度
+[官网精度](https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch/Official/cv/image_object_detection/Faster_Mask_RCNN_for_PyTorch)
+
+参考[npu版detectron2框架的maskrcnn](https://gitee.com/ascend/modelzoo/tree/master/built-in/PyTorch/Official/cv/image_object_detection/Faster_Mask_RCNN_for_PyTorch)，安装依赖PyTorch(NPU版本)与设置环境变量，在npu上执行推理，测得npu精度如下：
+```shell
+python3.7 -m pip install -e Faster_Mask_RCNN_for_PyTorch
+cd Faster_Mask_RCNN_for_PyTorch
+修改eval.sh的配置文件与权重文件分别为mask_rcnn_R_101_FPN_3x.yaml与model_final.pth，删除mask_rcnn_R_101_FPN_3x.yaml的SOLVER和DATALOADER配置，datasets/coco下面放置coco2017验证集图片与标签（参考本文第三章第一节步骤五）
+./eval.sh
+```
+```
+Task: bbox
+AP,AP50,AP75,APs,APm,APl
+33.0103,53.5686,35.5192,17.8069,36.9325,44.0201
+Task: segm
+AP,AP50,AP75,APs,APm,APl
+30.3271,50.4665,31.8223,12.9573,33.0375,44.8537
+```
+### 6.3 精度对比
+om推理box map精度为0.326，npu推理box map精度为0.330，npu输出400个框精度更高点但性能较低，精度下降在1个点之内，因此可视为精度达标
+
+## 7 性能对比
+
+-   **[npu性能数据](#71-npu性能数据)**  
+-   **[T4性能数据](#72-T4性能数据)**  
+-   **[性能对比](#73-性能对比)**  
+
+### 7.1 npu性能数据
+batch1的性能：
+ 测试npu性能要确保device空闲，使用npu-smi info命令可查看device是否在运行其它推理任务
+```
+./benchmark.x86_64 -round=20 -om_path=maskrcnn_detectron2_npu.om -device_id=0 -batch_size=1
+```
+执行20次纯推理取均值，统计吞吐率与其倒数时延（benchmark的时延是单个数据的推理时间），npu性能是一个device执行的结果
+```
+[INFO] Dataset number: 19 finished cost 439.142ms
+[INFO] PureInfer result saved in ./result/PureInfer_perf_of_maskrcnn_detectron2_npu_in_device_0.txt
+-----------------PureInfer Performance Summary------------------
+[INFO] ave_throughputRate: 2.27773samples/s, ave_latency: 440.813ms
+----------------------------------------------------------------
+```
+maskrcnn detectron2不支持多batch
+
+ **性能优化：**  
+> 查看profiling导出的op_statistic_0_1.csv算子总体耗时统计发现gather算子耗时最多，然后查看profiling导出的task_time_0_1.csv找到具体哪些gather算子耗时最多，通过导出onnx的verbose打印找到具体算子对应的代码，因gather算子计算最后一个轴会很耗时，因此通过转置后计算0轴规避，比如maskrcnn_detectron2.diff文件中的如下修改：
+> ```
+>    boxes_prof = boxes.permute(1, 0)
+>    widths = boxes_prof[2, :] - boxes_prof[0, :]
+> ```
+>
+
+
+### 7.2 T4性能数据
+batch1性能：
+onnx包含自定义算子，因此不能使用开源TensorRT测试性能数据，故在T4机器上使用pth在线推理测试性能数据
+
+依据npu版代码修改cpu,gpu版detectron2，参见maskrcnn_pth_npu.diff：
+```diff
+diff --git a/detectron2/data/dataset_mapper.py b/detectron2/data/dataset_mapper.py
+index 0e77851..0d03c08 100644
+--- a/detectron2/data/dataset_mapper.py
++++ b/detectron2/data/dataset_mapper.py
+@@ -4,6 +4,7 @@ import logging
+ import numpy as np
+ from typing import List, Optional, Union
+ import torch
++from torch.nn import functional as F
+ 
+ from detectron2.config import configurable
+ 
+@@ -133,6 +134,7 @@ class DatasetMapper:
+ 
+         aug_input = T.AugInput(image, sem_seg=sem_seg_gt)
+         transforms = self.augmentations(aug_input)
++        print(self.augmentations,transforms)
+         image, sem_seg_gt = aug_input.image, aug_input.sem_seg
+ 
+         image_shape = image.shape[:2]  # h, w
+@@ -140,6 +142,20 @@ class DatasetMapper:
+         # but not efficient on large generic data structures due to the use of pickle & mp.Queue.
+         # Therefore it's important to use torch.Tensor.
+         dataset_dict["image"] = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))
++
++        size_divisibility = 32
++        pad_value = 0
++        pixel_mean = torch.Tensor([103.53, 116.28, 123.675]).view(-1, 1, 1)
++        pixel_std = torch.Tensor([1.0, 1.0, 1.0]).view(-1, 1, 1)
++        images = (dataset_dict["image"] - pixel_mean) / pixel_std
++        dataset_dict["image_size"] = tuple(images.shape[-2:])
++        batch_shape = (3, 1344, 1344)
++        padding_size = [0, batch_shape[-1] - images.shape[-1],
++                        0, batch_shape[-2] - images.shape[-2]]
++        padded = F.pad(images, padding_size, value=pad_value)
++        batched_imgs = padded.unsqueeze_(0)
++        dataset_dict["image_preprocess"] = batched_imgs.contiguous()
++
+         if sem_seg_gt is not None:
+             dataset_dict["sem_seg"] = torch.as_tensor(sem_seg_gt.astype("long"))
+ 
+diff --git a/detectron2/layers/roi_align.py b/detectron2/layers/roi_align.py
+index bcbf5f4..23b138d 100644
+--- a/detectron2/layers/roi_align.py
++++ b/detectron2/layers/roi_align.py
+@@ -38,7 +38,7 @@ class ROIAlign(nn.Module):
+         self.output_size = output_size
+         self.spatial_scale = spatial_scale
+         self.sampling_ratio = sampling_ratio
+-        self.aligned = aligned
++        self.aligned = False
+ 
+         from torchvision import __version__
+ 
+diff --git a/detectron2/modeling/meta_arch/rcnn.py b/detectron2/modeling/meta_arch/rcnn.py
+index e5f66d1..b9ffa66 100644
+--- a/detectron2/modeling/meta_arch/rcnn.py
++++ b/detectron2/modeling/meta_arch/rcnn.py
+@@ -202,6 +202,9 @@ class GeneralizedRCNN(nn.Module):
+         images = self.preprocess_image(batched_inputs)
+         features = self.backbone(images.tensor)
+ 
++        #from torchvision import utils as vutils
++        #vutils.save_image(images.tensor, 'test.jpg')
++        print(features['p2'].shape)
+         if detected_instances is None:
+             if self.proposal_generator is not None:
+                 proposals, _ = self.proposal_generator(images, features, None)
+@@ -224,10 +227,14 @@ class GeneralizedRCNN(nn.Module):
+         """
+         Normalize, pad and batch the input images.
+         """
+-        images = [x["image"].to(self.device) for x in batched_inputs]
++        '''images = [x["image"].to(self.device) for x in batched_inputs]
+         images = [(x - self.pixel_mean) / self.pixel_std for x in images]
+         images = ImageList.from_tensors(images, self.backbone.size_divisibility)
+-        return images
++        return images'''
++        images = [x["image_preprocess"].to(device=self.device) for x in batched_inputs]
++        images = torch.cat(images, dim=0)
++        image_sizes = [x["image_size"] for x in batched_inputs]
++        return ImageList(images, image_sizes)
+ 
+     @staticmethod
+     def _postprocess(instances, batched_inputs: Tuple[Dict[str, torch.Tensor]], image_sizes):
+diff --git a/detectron2/modeling/postprocessing.py b/detectron2/modeling/postprocessing.py
+index f42e77c..909923a 100644
+--- a/detectron2/modeling/postprocessing.py
++++ b/detectron2/modeling/postprocessing.py
+@@ -55,6 +55,7 @@ def detector_postprocess(
+         output_boxes = None
+     assert output_boxes is not None, "Predictions must contain boxes!"
+ 
++    print(scale_x, scale_y)
+     output_boxes.scale(scale_x, scale_y)
+     output_boxes.clip(results.image_size)
+ 
+
+```
+测评T4精度与性能：
+```shell
+git clone https://github.com/facebookresearch/detectron2
+python3.7 -m pip install -e detectron2
+cd detectron2
+patch -p1 < ../maskrcnn_pth_npu.diff
+cd tools
+mkdir datasets
+cp -rf ../../datasets/coco datasets/（数据集构造参考本文第三章第一节步骤五）
+python3.7 train_net.py --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml --eval-only MODEL.WEIGHTS ../../model_final.pth MODEL.DEVICE cuda:0
+```
+```
+Inference done 4993/5000. 0.2937 s / img.
+```
+
+### 7.3 性能对比
+310单卡4个device，benchmark测试的是一个device。T4一个设备相当于4个device，测试的是整个设备。benchmark时延是吞吐率的倒数，T4时延是吞吐率的倒数乘以batch。对于batch1，440.73ms / 4 * 1 < 0.2937s，即npu性能超过T4  
+对于batch1，npu性能均高于T4性能1.2倍，该模型放在benchmark/cv/segmentation目录下  
+
+
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216\345\274\200\346\272\220mmdetection\351\242\204\350\256\255\347\273\203\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216\345\274\200\346\272\220mmdetection\351\242\204\350\256\255\347\273\203\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md"
new file mode 100644
index 0000000000000000000000000000000000000000..dfb1a8f4e361c421ba6e74ce55cfe816e0e313b6
--- /dev/null
+++ "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/cv/segmentation/\345\237\272\344\272\216\345\274\200\346\272\220mmdetection\351\242\204\350\256\255\347\273\203\347\232\204maskrcnn_Onnx\346\250\241\345\236\213\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274.md"
@@ -0,0 +1,1041 @@
+# 基于开源mmdetection预训练的maskrcnn Onnx模型端到端推理指导
+-   [1 模型概述](#1-模型概述)
+	-   [1.1 论文地址](#11-论文地址)
+	-   [1.2 代码地址](#12-代码地址)
+-   [2 环境说明](#2-环境说明)
+	-   [2.1 深度学习框架](#21-深度学习框架)
+	-   [2.2 python第三方库](#22-python第三方库)
+-   [3 模型转换](#3-模型转换)
+	-   [3.1 pth转onnx模型](#31-pth转onnx模型)
+	-   [3.2 onnx转om模型](#32-onnx转om模型)
+-   [4 数据集预处理](#4-数据集预处理)
+	-   [4.1 数据集获取](#41-数据集获取)
+	-   [4.2 数据集预处理](#42-数据集预处理)
+	-   [4.3 生成数据集信息文件](#43-生成数据集信息文件)
+-   [5 离线推理](#5-离线推理)
+	-   [5.1 benchmark工具概述](#51-benchmark工具概述)
+	-   [5.2 离线推理](#52-离线推理)
+-   [6 精度对比](#6-精度对比)
+	-   [6.1 离线推理精度统计](#61-离线推理精度统计)
+	-   [6.2 开源精度](#62-开源精度)
+	-   [6.3 精度对比](#63-精度对比)
+-   [7 性能对比](#7-性能对比)
+	-   [7.1 npu性能数据](#71-npu性能数据)
+	-   [7.2 T4性能数据](#72-T4性能数据)
+	-   [7.3 性能对比](#73-性能对比)
+
+
+
+## 1 模型概述
+
+-   **[论文地址](#11-论文地址)**  
+
+-   **[代码地址](#12-代码地址)**  
+
+### 1.1 论文地址
+[maskrcnn论文](https://arxiv.org/abs/1703.06870)  
+论文提出了一个简单、灵活、通用的目标实例分割框架Mask R-CNN。这个框架可同时做目标检测、实例分割。实例分割的实现就是在faster r-cnn的基础上加了一个可以预测目标掩膜（mask）的分支。只比Faster r-cnn慢一点，5fps。很容易拓展到其他任务如：关键点检测。18年在coco的目标检测、实例分割、人体关键点检测都取得了最优成绩。
+
+### 1.2 代码地址
+[mmdetection框架maskrcnn代码](https://github.com/open-mmlab/mmdetection/tree/master/configs/mask_rcnn)   
+
+## 2 环境说明
+
+-   **[深度学习框架](#21-深度学习框架)**  
+
+-   **[python第三方库](#22-python第三方库)**  
+
+### 2.1 深度学习框架
+```
+pytorch == 1.8.0
+torchvision == 0.9.0
+onnx == 1.8.0
+```
+
+### 2.2 python第三方库
+
+```
+numpy == 1.18.5
+opencv-python == 4.2.0.34
+```
+
+**说明：** 
+>   X86架构：opencv,pytorch,torchvision和onnx可以通过官方下载whl包安装，其它可以通过pip3.7 install 包名 安装
+>
+>   Arm架构：opencv,pytorch,torchvision和onnx可以通过源码编译安装，其它可以通过pip3.7 install 包名 安装
+
+## 3 模型转换
+
+-   **[pth转onnx模型](#31-pth转onnx模型)**  
+
+-   **[onnx转om模型](#32-onnx转om模型)**  
+
+atc暂不支持动态shape小算子，可以使用大颗粒算子替换这些小算子规避，这些小算子可以在转onnx时的verbose打印中找到其对应的python代码，从而根据功能用大颗粒算子替换，onnx能推导出变量正确的shape与算子属性正确即可，变量实际的数值无关紧要，因此这些大算子函数的功能实现无关紧要，因包含自定义算子需要去掉对onnx模型的校验。
+
+### 3.1 pth转onnx模型
+
+1.获取pth权重文件  
+[maskrcnn基于detectron2预训练的npu权重文件](http://download.openmmlab.com/mmdetection/v2.0/mask_rcnn/mask_rcnn_r50_fpn_1x_coco/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth)  
+文件md5sum: f4ee3c5911537f454045395d2f708954  
+2.mmdetection源码安装
+```shell
+git clone https://github.com/open-mmlab/mmcv
+cd mmcv
+MMCV_WITH_OPS=1 pip3.7 install -e .
+cd ..
+git clone https://github.com/open-mmlab/mmdetection
+cd mmdetection
+pip3.7 install -r requirements/build.txt
+python3.7 setup.py develop
+```
+
+ **说明：**  
+> 安装所需的依赖说明请参考mmdetection/docs/get_started.md
+>
+
+3.转原始onnx
+```shell
+python3.7 tools/deployment/pytorch2onnx.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --output-file mask_rcnn_r50_fpn_1x_coco.onnx --input-img demo/demo.jpg --test-img tests/data/color.jpg --shape 800 1216 --show --verify --simplify
+若报错参考：https://github.com/open-mmlab/mmdetection/issues/4548
+```
+4.修改mmdetection代码，参见maskrcnn_mmdetection.diff：
+```diff
+diff --git a/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py b/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py
+index e9eb357..f72cef7 100644
+--- a/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py
++++ b/mmdet/core/bbox/coder/delta_xywh_bbox_coder.py
+@@ -168,13 +168,31 @@ def delta2bbox(rois,
+                 [0.0000, 0.3161, 4.1945, 0.6839],
+                 [5.0000, 5.0000, 5.0000, 5.0000]])
+     """
+-    means = deltas.new_tensor(means).view(1, -1).repeat(1, deltas.size(1) // 4)
+-    stds = deltas.new_tensor(stds).view(1, -1).repeat(1, deltas.size(1) // 4)
++
++    # fix shape for means and stds when exporting onnx
++    if torch.onnx.is_in_onnx_export():
++        means = deltas.new_tensor(means).view(1, -1).repeat(1, deltas.size(1).numpy() // 4)
++        stds = deltas.new_tensor(stds).view(1, -1).repeat(1, deltas.size(1).numpy() // 4)
++    else:
++        means = deltas.new_tensor(means).view(1, -1).repeat(1, deltas.size(1) // 4)
++        stds = deltas.new_tensor(stds).view(1, -1).repeat(1, deltas.size(1) // 4)
+     denorm_deltas = deltas * stds + means
+-    dx = denorm_deltas[:, 0::4]
+-    dy = denorm_deltas[:, 1::4]
+-    dw = denorm_deltas[:, 2::4]
+-    dh = denorm_deltas[:, 3::4]
++    # dx = denorm_deltas[:, 0::4]
++    # dy = denorm_deltas[:, 1::4]
++    # dw = denorm_deltas[:, 2::4]
++    # dh = denorm_deltas[:, 3::4]
++    if denorm_deltas.shape[1] > 4:
++        denorm_deltas = denorm_deltas.view(-1, 80, 4)
++        dx = denorm_deltas[:, :, 0:1:].view(-1, 80)
++        dy = denorm_deltas[:, :, 1:2:].view(-1, 80)
++        dw = denorm_deltas[:, :, 2:3:].view(-1, 80)
++        dh = denorm_deltas[:, :, 3:4:].view(-1, 80)
++    else:
++        dx = denorm_deltas[:, 0:1:]
++        dy = denorm_deltas[:, 1:2:]
++        dw = denorm_deltas[:, 2:3:]
++        dh = denorm_deltas[:, 3:4:]
++
+     max_ratio = np.abs(np.log(wh_ratio_clip))
+     dw = dw.clamp(min=-max_ratio, max=max_ratio)
+     dh = dh.clamp(min=-max_ratio, max=max_ratio)
+diff --git a/mmdet/core/post_processing/bbox_nms.py b/mmdet/core/post_processing/bbox_nms.py
+index c43aea9..e99f5d8 100644
+--- a/mmdet/core/post_processing/bbox_nms.py
++++ b/mmdet/core/post_processing/bbox_nms.py
+@@ -4,6 +4,59 @@ from mmcv.ops.nms import batched_nms
+ from mmdet.core.bbox.iou_calculators import bbox_overlaps
+ 
+ 
++class BatchNMSOp(torch.autograd.Function):
++    @staticmethod
++    def forward(ctx, bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size):
++        """
++        boxes (torch.Tensor): boxes in shape (batch, N, C, 4).
++        scores (torch.Tensor): scores in shape (batch, N, C).
++        return:
++            nmsed_boxes: (1, N, 4)
++            nmsed_scores: (1, N)
++            nmsed_classes: (1, N)
++            nmsed_num: (1,)
++        """
++
++        # Phony implementation for onnx export
++        nmsed_boxes = bboxes[:, :max_total_size, 0, :]
++        nmsed_scores = scores[:, :max_total_size, 0]
++        nmsed_classes = torch.arange(max_total_size, dtype=torch.long)
++        nmsed_num = torch.Tensor([max_total_size])
++
++        return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num
++
++    @staticmethod
++    def symbolic(g, bboxes, scores, score_thr, iou_thr, max_size_p_class, max_t_size):
++        nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = g.op('BatchMultiClassNMS',
++            bboxes, scores, score_threshold_f=score_thr, iou_threshold_f=iou_thr,
++            max_size_per_class_i=max_size_p_class, max_total_size_i=max_t_size, outputs=4)
++        return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num
++
++def batch_nms_op(bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size):
++    """
++    boxes (torch.Tensor): boxes in shape (N, 4).
++    scores (torch.Tensor): scores in shape (N, ).
++    """
++
++    num_classes = bboxes.shape[1].numpy() // 4
++    if bboxes.dtype == torch.float32:
++        bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4).half()
++        scores = scores.reshape(1, scores.shape[0].numpy(), -1).half()
++    else:
++        bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4)
++        scores = scores.reshape(1, scores.shape[0].numpy(), -1)
++
++    nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = BatchNMSOp.apply(bboxes, scores,
++        score_threshold, iou_threshold, max_size_per_class, max_total_size)
++    nmsed_boxes = nmsed_boxes.float()
++    nmsed_scores = nmsed_scores.float()
++    nmsed_classes = nmsed_classes.long()
++    dets = torch.cat((nmsed_boxes.reshape((max_total_size, 4)), nmsed_scores.reshape((max_total_size, 1))), -1)
++    dets = dets.reshape((max_total_size, 5))
++    labels = nmsed_classes.reshape((max_total_size, ))
++    return dets, labels
++
++
+ def multiclass_nms(multi_bboxes,
+                    multi_scores,
+                    score_thr,
+@@ -40,7 +93,17 @@ def multiclass_nms(multi_bboxes,
+             multi_scores.size(0), num_classes, 4)
+ 
+     scores = multi_scores[:, :-1]
++    # multiply score_factor after threshold to preserve more bboxes, improve
++    # mAP by 1% for YOLOv3
++    if score_factors is not None:
++        # expand the shape to match original shape of score
++        score_factors = score_factors.view(-1, 1).expand(
++            multi_scores.size(0), num_classes)
++        score_factors = score_factors.reshape(-1)
++        scores = scores * score_factors
+ 
++    # cpu and gpu
++    '''
+     labels = torch.arange(num_classes, dtype=torch.long)
+     labels = labels.view(1, -1).expand_as(scores)
+ 
+@@ -80,7 +143,11 @@ def multiclass_nms(multi_bboxes,
+         return dets, labels[keep], keep
+     else:
+         return dets, labels[keep]
++    '''
+ 
++    # npu
++    dets, labels = batch_nms_op(bboxes, scores, score_thr, nms_cfg.get("iou_threshold"), max_num, max_num)
++    return dets, labels
+ 
+ def fast_nms(multi_bboxes,
+              multi_scores,
+diff --git a/mmdet/models/dense_heads/rpn_head.py b/mmdet/models/dense_heads/rpn_head.py
+index f565d1a..3c29386 100644
+--- a/mmdet/models/dense_heads/rpn_head.py
++++ b/mmdet/models/dense_heads/rpn_head.py
+@@ -9,6 +9,57 @@ from .anchor_head import AnchorHead
+ from .rpn_test_mixin import RPNTestMixin
+ 
+ 
++class BatchNMSOp(torch.autograd.Function):
++    @staticmethod
++    def forward(ctx, bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size):
++        """
++        boxes (torch.Tensor): boxes in shape (batch, N, C, 4).
++        scores (torch.Tensor): scores in shape (batch, N, C).
++        return:
++            nmsed_boxes: (1, N, 4)
++            nmsed_scores: (1, N)
++            nmsed_classes: (1, N)
++            nmsed_num: (1,)
++        """
++
++        # Phony implementation for onnx export
++        nmsed_boxes = bboxes[:, :max_total_size, 0, :]
++        nmsed_scores = scores[:, :max_total_size, 0]
++        nmsed_classes = torch.arange(max_total_size, dtype=torch.long)
++        nmsed_num = torch.Tensor([max_total_size])
++
++        return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num
++
++    @staticmethod
++    def symbolic(g, bboxes, scores, score_thr, iou_thr, max_size_p_class, max_t_size):
++        nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = g.op('BatchMultiClassNMS',
++            bboxes, scores, score_threshold_f=score_thr, iou_threshold_f=iou_thr,
++            max_size_per_class_i=max_size_p_class, max_total_size_i=max_t_size, outputs=4)
++        return nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num
++
++def batch_nms_op(bboxes, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size):
++    """
++    boxes (torch.Tensor): boxes in shape (N, 4).
++    scores (torch.Tensor): scores in shape (N, ).
++    """
++
++    num_classes = bboxes.shape[1].numpy() // 4
++    if bboxes.dtype == torch.float32:
++        bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4).half()
++        scores = scores.reshape(1, scores.shape[0].numpy(), -1).half()
++    else:
++        bboxes = bboxes.reshape(1, bboxes.shape[0].numpy(), -1, 4)
++        scores = scores.reshape(1, scores.shape[0].numpy(), -1)
++
++    nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = BatchNMSOp.apply(bboxes, scores,
++        score_threshold, iou_threshold, max_size_per_class, max_total_size)
++    nmsed_boxes = nmsed_boxes.float()
++    nmsed_scores = nmsed_scores.float()
++    nmsed_classes = nmsed_classes.long()
++    dets = torch.cat((nmsed_boxes.reshape((max_total_size, 4)), nmsed_scores.reshape((max_total_size, 1))), -1)
++    labels = nmsed_classes.reshape((max_total_size, ))
++    return dets, labels
++
+ @HEADS.register_module()
+ class RPNHead(RPNTestMixin, AnchorHead):
+     """RPN head.
+@@ -132,9 +183,14 @@ class RPNHead(RPNTestMixin, AnchorHead):
+             if cfg.nms_pre > 0 and scores.shape[0] > cfg.nms_pre:
+                 # sort is faster than topk
+                 # _, topk_inds = scores.topk(cfg.nms_pre)
+-                ranked_scores, rank_inds = scores.sort(descending=True)
+-                topk_inds = rank_inds[:cfg.nms_pre]
+-                scores = ranked_scores[:cfg.nms_pre]
++                # onnx uses topk to sort, this is simpler for onnx export
++                if torch.onnx.is_in_onnx_export():
++                    scores, topk_inds = torch.topk(scores, cfg.nms_pre)
++                else:
++                    ranked_scores, rank_inds = scores.sort(descending=True)
++                    topk_inds = rank_inds[:cfg.nms_pre]
++                    scores = ranked_scores[:cfg.nms_pre]
++
+                 rpn_bbox_pred = rpn_bbox_pred[topk_inds, :]
+                 anchors = anchors[topk_inds, :]
+             mlvl_scores.append(scores)
+@@ -164,5 +220,12 @@ class RPNHead(RPNTestMixin, AnchorHead):
+ 
+         # TODO: remove the hard coded nms type
+         nms_cfg = dict(type='nms', iou_threshold=cfg.nms_thr)
++        # cpu and gpu return
++        '''
+         dets, keep = batched_nms(proposals, scores, ids, nms_cfg)
+         return dets[:cfg.nms_post]
++        '''
++
++        # npu return
++        dets, labels = batch_nms_op(proposals, scores, 0.0, nms_cfg.get("iou_threshold"), cfg.nms_post, cfg.nms_post)
++        return dets
+diff --git a/mmdet/models/roi_heads/mask_heads/fcn_mask_head.py b/mmdet/models/roi_heads/mask_heads/fcn_mask_head.py
+index 0cba3cd..a965e53 100644
+--- a/mmdet/models/roi_heads/mask_heads/fcn_mask_head.py
++++ b/mmdet/models/roi_heads/mask_heads/fcn_mask_head.py
+@@ -199,11 +199,11 @@ class FCNMaskHead(nn.Module):
+             # TODO: Remove after F.grid_sample is supported.
+             from torchvision.models.detection.roi_heads \
+                 import paste_masks_in_image
+-            masks = paste_masks_in_image(mask_pred, bboxes, ori_shape[:2])
++            '''masks = paste_masks_in_image(mask_pred, bboxes, ori_shape[:2])
+             thr = rcnn_test_cfg.get('mask_thr_binary', 0)
+             if thr > 0:
+-                masks = masks >= thr
+-            return masks
++                masks = masks >= thr'''
++            return mask_pred
+ 
+         N = len(mask_pred)
+         # The actual implementation split the input into chunks,
+diff --git a/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py b/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py
+index c0eebc4..63605c5 100644
+--- a/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py
++++ b/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py
+@@ -4,6 +4,31 @@ from mmcv.runner import force_fp32
+ from mmdet.models.builder import ROI_EXTRACTORS
+ from .base_roi_extractor import BaseRoIExtractor
+ 
++import torch.onnx.symbolic_helper as sym_help
++
++class RoiExtractor(torch.autograd.Function):
++    @staticmethod
++    def forward(self, f0, f1, f2, f3, rois, aligned=1, finest_scale=56, pooled_height=7, pooled_width=7,
++                         pool_mode='avg', roi_scale_factor=0, sample_num=0, spatial_scale=[0.25, 0.125, 0.0625, 0.03125]):
++        """
++        feats (torch.Tensor): feats in shape (batch, 256, H, W).
++        rois (torch.Tensor): rois in shape (k, 5).
++        return:
++            roi_feats (torch.Tensor): (k, 256, pooled_width, pooled_width)
++        """
++
++        # phony implementation for shape inference
++        k = rois.size()[0]
++        roi_feats = torch.ones(k, 256, pooled_height, pooled_width)
++        return roi_feats
++
++    @staticmethod
++    def symbolic(g, f0, f1, f2, f3, rois, aligned=1, finest_scale=56, pooled_height=7, pooled_width=7):
++        # TODO: support tensor list type for feats
++        #f_tensors = sym_help._unpack_list(feats)
++        roi_feats = g.op('RoiExtractor', f0, f1, f2, f3, rois, aligned_i=1, finest_scale_i=56, pooled_height_i=pooled_height, pooled_width_i=pooled_width,
++                         pool_mode_s='avg', roi_scale_factor_i=0, sample_num_i=0, spatial_scale_f=[0.25, 0.125, 0.0625, 0.03125], outputs=1)
++        return roi_feats
+ 
+ @ROI_EXTRACTORS.register_module()
+ class SingleRoIExtractor(BaseRoIExtractor):
+@@ -52,6 +77,14 @@ class SingleRoIExtractor(BaseRoIExtractor):
+ 
+     @force_fp32(apply_to=('feats', ), out_fp16=True)
+     def forward(self, feats, rois, roi_scale_factor=None):
++        # Work around to export onnx for npu
++        if torch.onnx.is_in_onnx_export():
++            out_size = self.roi_layers[0].output_size
++            roi_feats = RoiExtractor.apply(feats[0], feats[1], feats[2], feats[3], rois, 1, 56, out_size[0], out_size[1])
++            # roi_feats = RoiExtractor.apply(list(feats), rois)
++            return roi_feats
++
++
+         """Forward function."""
+         out_size = self.roi_layers[0].output_size
+         num_levels = len(feats)
+diff --git a/tools/deployment/pytorch2onnx.py b/tools/deployment/pytorch2onnx.py
+index 1305a79..c79e9fb 100644
+--- a/tools/deployment/pytorch2onnx.py
++++ b/tools/deployment/pytorch2onnx.py
+@@ -48,7 +48,7 @@ def pytorch2onnx(config_path,
+         input_names=['input'],
+         output_names=output_names,
+         export_params=True,
+-        keep_initializers_as_inputs=True,
++        #keep_initializers_as_inputs=True,
+         do_constant_folding=True,
+         verbose=show,
+         opset_version=opset_version)
+
+```
+ **修改依据：**  
+> 1.atc暂不支持if与nonzero动态小算子，这两小算子是bbox_nms.py与single_level_roi_extractor.py的大功能nms与roi引入的(rpn_head.py中的nms虽然没有引入不支持算子但也需要替换，否则后面会出现E19014: Op[ReduceMax_505]'s attribute axes is invalid which is empty)，因此使用npu的nms与roi大算子代替这部分大功能。loop算子暂无合适替换方法，由于它在网络最后一部分，因此可将其与后面的部分放到后处理  
+> 2. atc转换报错E11019: Op[Conv_0]'s input[1] is not linked，因此注释掉tools/deployment/pytorch2onnx.py中export函数的keep_initializers_as_inputs=True,  
+> 3.动态shape算子导致atc转换出现未知错误，atc日志debug显示Unknown shape op Tile output shape range is unknown, set its size -1，在转onnx时的verbose打印中找到该算子对应的python代码行，利用numpy()将means和std的shape固定下来，参见maskrcnn_mmdetection.diff  
+> 4.slice跑在aicpu有错误，所以改为dx = denorm_deltas[:, :, 0:1:].view(-1, 80)，使其运行在aicore上  
+> 5.atc转换Concat一对多算子会改变其名字，故添加dets = dets.reshape((max_total_size, 5))，使得Concat后填加了一冗余的Reshape算子作为输出节点  
+> 6.atc转换时计算mask的RoiExtractor算子报错，打开--log=debug输出日志，查看starce -f cmd的打印/root/ascend/log/plog/…找到日志存放路径，发现(14,14)导致cube内存不够用  
+> 7.atc转换时根据日志中报错的算子在转onnx时的verbose打印中找到其对应的python代码，然后找到规避方法解决，具体修改参见maskrcnn_mmdetection.diff  
+> 8.其它地方的修改原因参见精度调试  
+
+
+通过打补丁的方式修改detectron2：
+```shell
+patch -p1 < ./maskrcnn_mmdetection.diff
+```
+5.修改pytorch代码去除导出onnx时进行检查  
+将/usr/local/python3.7.5/lib/python3.7/site-packages/torch/onnx/utils.py文件的_check_onnx_proto(proto)改为pass
+
+6.运行如下命令，生成含有npu自定义算子的onnx：
+```shell
+python3.7 tools/deployment/pytorch2onnx.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --output-file mask_rcnn_r50_fpn_1x_coco.onnx --input-img demo/demo.jpg --shape 800 1216
+```
+7.经过修改后导出的onnx由于添加了自定义算子无法使用onnx的infer shape，所以需要手动固定resize算子的shape，这里可以用未经修改的源代码导出onnx并使用simplifier后使用netron查看resize的具体参数。对原始onnx使用simplifier后（添加--simplify参数参见转原始onnx命令），使用netron可视化工具可以查看该onnx中resize的大小
+```python
+import sys
+import onnx
+from onnx import helper
+
+input_model=sys.argv[1]
+output_model=sys.argv[2]
+model = onnx.load(input_model)
+# onnx.checker.check_model(model)
+
+model_nodes = model.graph.node
+def getNodeByName(nodes, name: str):
+    for n in nodes:
+        if n.name == name:
+            return n
+    return -1
+
+# fix shape for resize, 对原始onnx使用simplifier后，使用netron可视化工具可以查看该onnx中resize的大小
+sizes1 = onnx.helper.make_tensor('size1', onnx.TensorProto.INT32, [4], [1, 256, 50, 76])
+sizes2 = onnx.helper.make_tensor('size2', onnx.TensorProto.INT32, [4], [1, 256, 100, 152])
+sizes3 = onnx.helper.make_tensor('size3', onnx.TensorProto.INT32, [4], [1, 256, 200, 304])
+model.graph.initializer.append(sizes1)
+model.graph.initializer.append(sizes2)
+model.graph.initializer.append(sizes3)
+getNodeByName(model_nodes, 'Resize_141').input[3] = "size1"
+getNodeByName(model_nodes, 'Resize_161').input[3] = "size2"
+getNodeByName(model_nodes, 'Resize_181').input[3] = "size3"
+
+print("Mask R-CNN onnx adapted to ATC")
+onnx.save(model, output_model)
+```
+```shell
+python3.7 fix_onnx_shape.py mask_rcnn_r50_fpn_1x_coco.onnx mask_rcnn_r50_fpn_1x_coco_fix.onnx
+```
+
+### 3.2 onnx转om模型
+
+1.设置环境变量
+```shell
+export install_path=/usr/local/Ascend/ascend-toolkit/latest
+export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
+export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH
+export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH
+export ASCEND_OPP_PATH=${install_path}/opp
+export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/
+```
+2.使用atc将onnx模型转换为om模型文件，工具使用方法可以参考[CANN V100R020C10 开发辅助工具指南 (推理) 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164868?idPath=23710424%7C251366513%7C22892968%7C251168373)，需要指定输出节点以去除无用输出，节点序号可能会因网络结构不同而不同，使用netron开源可视化工具查看具体的输出节点名：
+```shell
+atc --framework=5 --model=./mask_rcnn_r50_fpn_1x_coco_fix.onnx --output=mask_rcnn_r50_fpn_1x_coco_bs1 --out_nodes="Reshape_574:0;Reshape_576:0;Sigmoid_604:0" --input_format=NCHW --input_shape="input:1,3,800,1216" --log=debug --soc_version=Ascend310
+```
+
+## 4 数据集预处理
+
+-   **[数据集获取](#41-数据集获取)**  
+
+-   **[数据集预处理](#42-数据集预处理)**  
+
+-   **[生成数据集信息文件](#43-生成数据集信息文件)**  
+
+### 4.1 数据集获取
+该模型使用[COCO官网](https://cocodataset.org/#download)的coco2017的5千张验证集进行测试，图片与标签分别存放在/opt/npu/dataset/coco/val2017/与/opt/npu/dataset/coco/annotations/instances_val2017.json。
+
+### 4.2 数据集预处理
+1.预处理脚本maskrcnn_pth_preprocess.py
+```python
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+import numpy as np
+import cv2
+import mmcv
+import torch
+import multiprocessing
+
+def resize(img, size):
+    old_h = img.shape[0]
+    old_w = img.shape[1]
+    scale_ratio = min(size[0] / old_w, size[1] / old_h)
+    new_w = int(np.floor(old_w * scale_ratio))
+    new_h = int(np.floor(old_h * scale_ratio))
+    resized_img = mmcv.imresize(img, (new_w, new_h), backend='cv2')
+    return resized_img
+
+def gen_input_bin(file_batches, batch):
+    i = 0
+    for file in file_batches[batch]:
+        i = i + 1
+        print("batch", batch, file, "===", i)
+
+        image = mmcv.imread(os.path.join(flags.image_src_path, file), backend='cv2')
+        #image = mmcv.imrescale(image, (flags.model_input_width, flags.model_input_height))
+        image = resize(image, (flags.model_input_width, flags.model_input_height))
+        mean = np.array([123.675, 116.28, 103.53], dtype=np.float32)
+        std = np.array([58.395, 57.12, 57.375], dtype=np.float32)
+        image = mmcv.imnormalize(image, mean, std)
+        h = image.shape[0]
+        w = image.shape[1]
+        pad_left = (flags.model_input_width - w) // 2
+        pad_top = (flags.model_input_height - h) // 2
+        pad_right = flags.model_input_width - pad_left - w
+        pad_bottom = flags.model_input_height - pad_top - h
+        image = mmcv.impad(image, padding=(pad_left, pad_top, pad_right, pad_bottom), pad_val=0)
+        #mmcv.imwrite(image, './paded_jpg/' + file.split('.')[0] + '.jpg')
+        image = image.transpose(2, 0, 1)
+        image.tofile(os.path.join(flags.bin_file_path, file.split('.')[0] + ".bin"))
+
+def preprocess(src_path, save_path):
+    files = os.listdir(src_path)
+    file_batches = [files[i:i + 100] for i in range(0, 5000, 100) if files[i:i + 100] != []]
+    thread_pool = multiprocessing.Pool(len(file_batches))
+    for batch in range(len(file_batches)):
+        thread_pool.apply_async(gen_input_bin, args=(file_batches, batch))
+    thread_pool.close()
+    thread_pool.join()
+    print("in thread, except will not report! please ensure bin files generated.")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='preprocess of MaskRCNN PyTorch model')
+    parser.add_argument("--image_src_path", default="./coco2017/", help='image of dataset')
+    parser.add_argument("--bin_file_path", default="./coco2017_bin/", help='Preprocessed image buffer')
+    parser.add_argument("--model_input_height", default=800, type=int, help='input tensor height')
+    parser.add_argument("--model_input_width", default=1216, type=int, help='input tensor width')
+    flags = parser.parse_args()    
+    if not os.path.exists(flags.bin_file_path):
+        os.makedirs(flags.bin_file_path)
+    preprocess(flags.image_src_path, flags.bin_file_path)
+```
+2.执行预处理脚本，生成数据集预处理后的bin文件
+```shell
+python3.7 maskrcnn_pth_preprocess.py --image_src_path=/opt/npu/dataset/coco/val2017 --bin_file_path=val2017_bin --model_input_height=800 --model_input_width=1216
+```
+### 4.3 生成数据集信息文件
+1.生成数据集信息文件脚本get_info.py
+```python
+import os
+import sys
+import cv2
+from glob import glob
+
+
+def get_bin_info(file_path, info_name, width, height):
+    bin_images = glob(os.path.join(file_path, '*.bin'))
+    with open(info_name, 'w') as file:
+        for index, img in enumerate(bin_images):
+            content = ' '.join([str(index), img, width, height])
+            file.write(content)
+            file.write('\n')
+
+
+def get_jpg_info(file_path, info_name):
+    extensions = ['jpg', 'jpeg', 'JPG', 'JPEG']
+    image_names = []
+    for extension in extensions:
+        image_names.append(glob(os.path.join(file_path, '*.' + extension)))  
+    with open(info_name, 'w') as file:
+        for image_name in image_names:
+            if len(image_name) == 0:
+                continue
+            else:
+                for index, img in enumerate(image_name):
+                    img_cv = cv2.imread(img)
+                    shape = img_cv.shape
+                    width, height = shape[1], shape[0]
+                    content = ' '.join([str(index), img, str(width), str(height)])
+                    file.write(content)
+                    file.write('\n')
+
+
+if __name__ == '__main__':
+    file_type = sys.argv[1]
+    file_path = sys.argv[2]
+    info_name = sys.argv[3]
+    if file_type == 'bin':
+        width = sys.argv[4]
+        height = sys.argv[5]
+        assert len(sys.argv) == 6, 'The number of input parameters must be equal to 5'
+        get_bin_info(file_path, info_name, width, height)
+    elif file_type == 'jpg':
+        assert len(sys.argv) == 4, 'The number of input parameters must be equal to 3'
+        get_jpg_info(file_path, info_name)
+```
+2.执行生成数据集信息脚本，生成数据集信息文件
+```shell
+python3.7 get_info.py bin val2017_bin maskrcnn.info 1216 800
+```
+第一个参数为模型输入的类型，第二个参数为生成的bin文件路径，第三个为输出的info文件，后面为宽高信息
+## 5 离线推理
+
+-   **[benchmark工具概述](#51-benchmark工具概述)**  
+
+-   **[离线推理](#52-离线推理)**  
+
+### 5.1 benchmark工具概述
+
+benchmark工具为华为自研的模型推理工具，支持多种模型的离线推理，能够迅速统计出模型在Ascend310上的性能，支持真实数据和纯推理两种模式，配合后处理脚本，可以实现诸多模型的端到端过程，获取工具及使用方法可以参考[CANN V100R020C10 推理benchmark工具用户指南 01](https://support.huawei.com/enterprise/zh/doc/EDOC1100164874?idPath=23710424%7C251366513%7C22892968%7C251168373)
+### 5.2 离线推理
+1.设置环境变量
+```shell
+export install_path=/usr/local/Ascend/ascend-toolkit/latest
+export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH
+export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH
+export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH
+export ASCEND_OPP_PATH=${install_path}/opp
+export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/
+```
+2.执行离线推理
+```shell
+./benchmark.x86_64 -model_type=vision -om_path=mask_rcnn_r50_fpn_1x_coco_bs1.om -device_id=0 -batch_size=1 -input_text_path=maskrcnn.info -input_width=1216 -input_height=800 -useDvpp=false -output_binary=true
+```
+ **注意：**  
+> label是int64，benchmark输出非二进制时会将float转为0
+>
+
+输出结果默认保存在当前目录result/dumpOutput_device0，模型有三个输出，每个输入对应的输出对应三个_x.bin文件
+```
+输出       shape                 数据类型    数据含义
+output1    100 * 5               FP32       boxes and scores
+output3    100 * 1               INT64      labels
+output4    100 * 80 * 28 * 28    FP32       masks
+```
+
+## 6 精度对比
+
+-   **[离线推理精度](#61-离线推理精度)**  
+-   **[开源精度](#62-开源精度)**  
+-   **[精度对比](#63-精度对比)**  
+
+### 6.1 离线推理精度统计
+
+后处理统计map精度
+```python
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+import cv2
+import numpy as np
+
+def postprocess_bboxes(bboxes, image_size, net_input_width, net_input_height):
+    w = image_size[0]
+    h = image_size[1]
+    scale = min(net_input_width / w, net_input_height / h)
+
+    pad_w = net_input_width - w * scale
+    pad_h = net_input_height - h * scale
+    pad_left = pad_w // 2
+    pad_top = pad_h // 2
+
+    bboxes[:, 0] = (bboxes[:, 0] - pad_left) / scale
+    bboxes[:, 1] = (bboxes[:, 1] - pad_top)  / scale
+    bboxes[:, 2] = (bboxes[:, 2] - pad_left) / scale
+    bboxes[:, 3] = (bboxes[:, 3] - pad_top)  / scale
+
+    return bboxes
+
+def postprocess_masks(masks, image_size, net_input_width, net_input_height):
+    w = image_size[0]
+    h = image_size[1]
+    scale = min(net_input_width / w, net_input_height / h)
+
+    pad_w = net_input_width - w * scale
+    pad_h = net_input_height - h * scale
+    pad_left = pad_w // 2
+    pad_top = pad_h // 2
+
+    if pad_top < 0:
+        pad_top = 0
+    if pad_left < 0:
+        pad_left = 0
+    top = int(pad_top)
+    left = int(pad_left)
+    hs = int(pad_top + net_input_height - pad_h)
+    ws = int(pad_left + net_input_width - pad_w)
+    masks = masks.to(dtype=torch.float32)
+    res_append = torch.zeros(0, h, w)
+    if torch.cuda.is_available():
+        res_append = res_append.to(device='cuda')
+    for i in range(masks.size(0)):
+        mask = masks[i][0][top:hs, left:ws]
+        mask = mask.expand((1, 1, mask.size(0), mask.size(1)))
+        mask = F.interpolate(mask, size=(int(h), int(w)), mode='bilinear', align_corners=False)
+        mask = mask[0][0]
+        mask = mask.unsqueeze(0)
+        res_append = torch.cat((res_append, mask))
+
+    return res_append[:, None]
+
+import pickle
+def save_variable(v, filename):
+    f = open(filename, 'wb')
+    pickle.dump(v, f)
+    f.close()
+def load_variavle(filename):
+    f = open(filename, 'rb')
+    r = pickle.load(f)
+    f.close()
+    return r
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--test_annotation", default="./origin_pictures.info")
+    parser.add_argument("--bin_data_path", default="./result/dumpOutput_device0/")
+    parser.add_argument("--det_results_path", default="./detection-results/")
+    parser.add_argument("--net_out_num", type=int, default=3)
+    parser.add_argument("--net_input_width", type=int, default=1216)
+    parser.add_argument("--net_input_height", type=int, default=800)
+    parser.add_argument("--ifShowDetObj", action="store_true", help="if input the para means True, neither False.")
+    flags = parser.parse_args()
+
+    img_size_dict = dict()
+    with open(flags.test_annotation)as f:
+        for line in f.readlines():
+            temp = line.split(" ")
+            img_file_path = temp[1]
+            img_name = temp[1].split("/")[-1].split(".")[0]
+            img_width = int(temp[2])
+            img_height = int(temp[3])
+            img_size_dict[img_name] = (img_width, img_height, img_file_path)
+
+    bin_path = flags.bin_data_path
+    det_results_path = flags.det_results_path
+    os.makedirs(det_results_path, exist_ok=True)
+    #total_img = set([name[:name.rfind('_')] for name in os.listdir(bin_path) if "bin" in name])
+
+    import glob
+    import torch
+    from torchvision.models.detection.roi_heads import paste_masks_in_image
+    import torch.nn.functional as F
+    from mmdet.core import bbox2result
+    from mmdet.core import encode_mask_results
+    from mmdet.datasets import CocoDataset
+    coco_dataset = CocoDataset(ann_file='/opt/npu/dataset/coco/annotations/instances_val2017.json', pipeline=[])
+    coco_class_map = {id:name for id, name in enumerate(coco_dataset.CLASSES)}
+    #print(dir(coco_dataset))
+    results = []
+
+    cnt = 0
+    #for bin_file in sorted(total_img):
+    for ids in coco_dataset.img_ids:
+        cnt = cnt + 1
+        bin_file = glob.glob(bin_path + '/*0' + str(ids) + '_1.bin')[0]
+        bin_file = bin_file[bin_file.rfind('/') + 1:]
+        bin_file = bin_file[:bin_file.rfind('_')]
+        print(cnt - 1, bin_file)
+        path_base = os.path.join(bin_path, bin_file)
+        res_buff = []
+        bbox_results = []
+        cls_segms = []
+        for num in range(1, flags.net_out_num + 1):
+            if os.path.exists(path_base + "_" + str(num) + ".bin"):
+                if num == 1:
+                    buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32")
+                    buf = np.reshape(buf, [100, 5])
+                elif num == 2:
+                    buf = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="int64")
+                    buf = np.reshape(buf, [100, 1])
+                elif num == 3:
+                    bboxes = np.fromfile(path_base + "_" + str(num - 2) + ".bin", dtype="float32")
+                    bboxes = np.reshape(bboxes, [100, 5])
+                    bboxes = torch.from_numpy(bboxes)
+                    labels = np.fromfile(path_base + "_" + str(num - 1) + ".bin", dtype="int64")
+                    labels = np.reshape(labels, [100, 1])
+                    labels = torch.from_numpy(labels)
+                    mask_pred = np.fromfile(path_base + "_" + str(num) + ".bin", dtype="float32")
+                    mask_pred = np.reshape(mask_pred, [100, 80, 28, 28])
+                    mask_pred = torch.from_numpy(mask_pred)
+
+                    if torch.cuda.is_available():
+                        mask_pred = mask_pred.to(device='cuda')
+
+                    img_shape = (flags.net_input_height, flags.net_input_width)
+                    mask_pred = mask_pred[range(len(mask_pred)), labels[:, 0]][:, None]
+                    masks = paste_masks_in_image(mask_pred, bboxes[:, :4], img_shape)
+                    masks = masks >= 0.5
+
+                    masks = postprocess_masks(masks, img_size_dict[bin_file], flags.net_input_width, flags.net_input_height)
+                    if torch.cuda.is_available():
+                        masks = masks.cpu()
+                    '''masks = masks.numpy()
+                    img = masks[0].squeeze()
+                    from PIL import Image
+                    for j in range(len(masks)):
+                        mask = masks[j].squeeze()
+                        mask = mask.astype(bool)
+                        img[mask] = img[mask] + 1
+                    imag = Image.fromarray((img * 255).astype(np.uint8))
+                    imag.save(os.path.join('.', bin_file + '.png'))'''
+
+                    cls_segms = [[] for _ in range(80)]
+                    for i in range(len(masks)):
+                        cls_segms[labels[i][0]].append(masks[i][0].numpy())
+
+                    bboxes = postprocess_bboxes(bboxes, img_size_dict[bin_file], flags.net_input_width, flags.net_input_height)
+                    bbox_results = [bbox2result(bboxes, labels[:, 0], 80)]
+                res_buff.append(buf)
+            else:
+                print("[ERROR] file not exist", path_base + "_" + str(num) + ".bin")
+
+        result = list(zip(bbox_results, [cls_segms]))
+        result = [(bbox_results, encode_mask_results(mask_results)) for bbox_results, mask_results in result]
+        results.extend(result)
+
+        current_img_size = img_size_dict[bin_file]
+        res_bboxes = np.concatenate(res_buff, axis=1)
+        predbox = postprocess_bboxes(res_bboxes, current_img_size, flags.net_input_width, flags.net_input_height)
+
+        if flags.ifShowDetObj == True:
+            imgCur = cv2.imread(current_img_size[2])
+
+        det_results_str = ''
+        for idx, class_idx in enumerate(predbox[:, 5]):
+            if float(predbox[idx][4]) < float(0.05):
+                continue
+            if class_idx < 0 or class_idx > 80:
+                continue
+
+            class_name = coco_class_map[int(class_idx)]
+            det_results_str += "{} {} {} {} {} {}\n".format(class_name, str(predbox[idx][4]), predbox[idx][0],
+                                                            predbox[idx][1], predbox[idx][2], predbox[idx][3])
+            if flags.ifShowDetObj == True:
+                imgCur = cv2.rectangle(imgCur, (int(predbox[idx][0]), int(predbox[idx][1])), (int(predbox[idx][2]), int(predbox[idx][3])), (0,255,0), 2)
+                imgCur = cv2.putText(imgCur, class_name, (int(predbox[idx][0]), int(predbox[idx][1])), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)
+
+        if flags.ifShowDetObj == True:
+            cv2.imwrite(os.path.join(det_results_path, bin_file +'.jpg'), imgCur, [int(cv2.IMWRITE_JPEG_QUALITY), 70])
+
+        det_results_file = os.path.join(det_results_path, bin_file + ".txt")
+        with open(det_results_file, "w") as detf:
+            detf.write(det_results_str)
+
+    save_variable(results, './results.txt')
+    #results = load_variavle('./results.txt')
+    eval_results = coco_dataset.evaluate(results, metric=['bbox', 'segm'], classwise=True)
+```
+调用maskrcnn_pth_postprocess.py评测map精度：
+```shell
+python3.7 get_info.py jpg /opt/npu/dataset/coco/val2017 maskrcnn_jpeg.info
+
+python3.7 maskrcnn_pth_postprocess.py --bin_data_path=./result/dumpOutput_device0/ --test_annotation=maskrcnn_jpeg.info --det_results_path=./ret_npuinfer/ --net_out_num=3 --net_input_height=800 --net_input_width=1216 --ifShowDetObj
+```
+第一个参数为benchmark推理结果，第二个为原始图片信息文件，第三个为后处理输出结果，第四个为网络输出个数，第五六个为网络高宽，第七个为是否将box画在图上显示  
+执行完后会打印出精度：
+```
+Evaluating bbox...
+Loading and preparing results...
+DONE (t=8.57s)
+creating index...
+index created!
+Running per image evaluation...
+Evaluate annotation type *bbox*
+DONE (t=103.05s).
+Accumulating evaluation results...
+DONE (t=26.62s).
+Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.377
+Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.584
+Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.411
+Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.211
+Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.411
+Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.500
+Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.515
+Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.515
+Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.515
+Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.319
+Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.556
+Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.656
+
++---------------+-------+--------------+-------+----------------+-------+
+| category      | AP    | category     | AP    | category       | AP    |
++---------------+-------+--------------+-------+----------------+-------+
+| person        | 0.517 | bicycle      | 0.296 | car            | 0.411 |
+| motorcycle    | 0.392 | airplane     | 0.588 | bus            | 0.603 |
+| train         | 0.576 | truck        | 0.332 | boat           | 0.254 |
+| traffic light | 0.253 | fire hydrant | 0.627 | stop sign      | 0.624 |
+| parking meter | 0.431 | bench        | 0.224 | bird           | 0.335 |
+| cat           | 0.588 | dog          | 0.544 | horse          | 0.527 |
+| sheep         | 0.473 | cow          | 0.515 | elephant       | 0.597 |
+| bear          | 0.616 | zebra        | 0.627 | giraffe        | 0.623 |
+| backpack      | 0.132 | umbrella     | 0.347 | handbag        | 0.119 |
+| tie           | 0.306 | suitcase     | 0.368 | frisbee        | 0.634 |
+| skis          | 0.214 | snowboard    | 0.286 | sports ball    | 0.398 |
+| kite          | 0.375 | baseball bat | 0.215 | baseball glove | 0.333 |
+| skateboard    | 0.455 | surfboard    | 0.340 | tennis racket  | 0.417 |
+| bottle        | 0.365 | wine glass   | 0.325 | cup            | 0.400 |
+| fork          | 0.259 | knife        | 0.139 | spoon          | 0.108 |
+| bowl          | 0.395 | banana       | 0.217 | apple          | 0.200 |
+| sandwich      | 0.322 | orange       | 0.289 | broccoli       | 0.214 |
+| carrot        | 0.199 | hot dog      | 0.277 | pizza          | 0.478 |
+| donut         | 0.397 | cake         | 0.353 | chair          | 0.245 |
+| couch         | 0.371 | potted plant | 0.243 | bed            | 0.398 |
+| dining table  | 0.228 | toilet       | 0.557 | tv             | 0.542 |
+| laptop        | 0.547 | mouse        | 0.572 | remote         | 0.260 |
+| keyboard      | 0.491 | cell phone   | 0.325 | microwave      | 0.531 |
+| oven          | 0.300 | toaster      | 0.467 | sink           | 0.330 |
+| refrigerator  | 0.511 | book         | 0.146 | clock          | 0.481 |
+| vase          | 0.336 | scissors     | 0.249 | teddy bear     | 0.431 |
+| hair drier    | 0.013 | toothbrush   | 0.145 | None           | None  |
++---------------+-------+--------------+-------+----------------+-------+
+```
+
+ **精度调试：**  
+> 1.因为在线推理前处理图片是一定格式的动态分辨率，所以onnx将分辨率固定为1216x800会导致精度下降些，改为1216x1216可以提升精度，使得mask的精度与开源相比下降在1%之内  
+> 2.单图调试  
+> ```
+> python3.7 tools/test.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --eval bbox segm --show
+>python3.7 tools/deployment/pytorch2onnx.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --output-file mask_rcnn_r50_fpn_1x_coco.onnx --input-img 000000397133_1216x800.jpg --shape 800 1216 --show --verify --simplify
+> 说明：
+> 1.参考开源精度测评工具，以精度达标的pth为基准，添加打印弄明白关键点代码含义。可以得到导出原始onnx时，paste_masks_in_image 前需要添加mask_pred = mask_pred[range(len(mask_pred)), labels][:, None]，onnx显示mask才与pth一致。
+> 2.将图片经过缩放添加pad后导出的原始onnx作为精度基准，发现原始onnx的mask_pred作为输出时形状是(100,80,28,28)，而更换自定义算子后导出的onnx输出形状是(100,80,14,14)，因此通过添加打印与对比发现计算mask的RoiExtractor的(pooled_height, pooled_width)配置是（14,14）而不应该是默认的（7,7）。将om推理RoiExtractor的输入变量使用pickle模块保存起来，然后在源代码中加载数据到这些变量，查看原始onnx的图片显示结果可以验证是RoiExtractor的问题
+> 3.800x1216不是pth模型固定的高宽，在build_from_cfg添加print(obj_cls)发现./mmdet/models/detectors/base.py的BaseDetector，推断模型的输入大小是变化的
+> 4.至于查看函数调用关系，可以在代码中故意构造错误，python运行出错时会打印调用栈
+> ```
+
+
+### 6.2 开源精度
+[官网精度](http://download.openmmlab.com/mmdetection/v2.0/mask_rcnn/mask_rcnn_r50_fpn_1x_coco/mask_rcnn_r50_fpn_1x_coco_20200205_050542.log.json)
+
+```
+{"mode": "val", "epoch": 12, "iter": 7330, "lr": 0.0002, "bbox_mAP": 0.382, "bbox_mAP_50": 0.588, "bbox_mAP_75": 0.414, "bbox_mAP_s": 0.219, "bbox_mAP_m": 0.409, "bbox_mAP_l": 0.495, "bbox_mAP_copypaste": "0.382 0.588 0.414 0.219 0.409 0.495", "segm_mAP": 0.347, "segm_mAP_50": 0.557, "segm_mAP_75": 0.372, "segm_mAP_s": 0.183, "segm_mAP_m": 0.374, "segm_mAP_l": 0.472, "segm_mAP_copypaste": "0.347 0.557 0.372 0.183 0.374 0.472"}
+```
+### 6.3 精度对比
+om推理box map50精度为0.584，开源box map50精度为0.588，精度下降在1%之内，因此可视为精度达标  
+om推理segm map50精度为0.553，开源segm map50精度为0.557，精度下降在1%之内，因此可视为精度达标  
+
+## 7 性能对比
+
+-   **[npu性能数据](#71-npu性能数据)**  
+-   **[T4性能数据](#72-T4性能数据)**  
+-   **[性能对比](#73-性能对比)**  
+
+### 7.1 npu性能数据
+batch1的性能：
+ 测试npu性能要确保device空闲，使用npu-smi info命令可查看device是否在运行其它推理任务
+```
+./benchmark.x86_64 -round=20 -om_path=mask_rcnn_r50_fpn_1x_coco_bs1.om -device_id=0 -batch_size=1
+```
+执行20次纯推理取均值，统计吞吐率与其倒数时延（benchmark的时延是单个数据的推理时间），npu性能是一个device执行的结果
+```
+[INFO] Dataset number: 19 finished cost 512.331ms
+[INFO] PureInfer result saved in ./result/PureInfer_perf_of_mask_rcnn_r50_fpn_1x_coco_bs1_in_device_0.txt
+-----------------PureInfer Performance Summary------------------
+[INFO] ave_throughputRate: 1.95202samples/s, ave_latency: 512.318ms
+----------------------------------------------------------------
+```
+maskrcnn mmdetection不支持多batch
+
+ **性能优化：**  
+> 生成多batch模型需要修改源码，否则atc转化的多batch模型推理出的数据不对，多batch性能没有提升  
+>
+
+
+### 7.2 T4性能数据
+batch1性能：
+onnx包含自定义算子，因此不能使用开源TensorRT测试性能数据，故在T4机器上使用pth在线推理测试性能数据
+
+测评T4精度与性能：
+```shell
+git clone https://github.com/open-mmlab/mmcv
+cd mmcv
+MMCV_WITH_OPS=1 pip3.7 install -e .
+cd ..
+git clone https://github.com/open-mmlab/mmdetection
+cd mmdetection
+pip3.7 install -r requirements/build.txt
+python3.7 setup.py develop
+wget http://download.openmmlab.com/mmdetection/v2.0/mask_rcnn/mask_rcnn_r50_fpn_1x_coco/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth
+在当前目录按结构构造数据集：data/coco目录下有annotations与val2017，annotations目录存放coco数据集的instances_val2017.json，val2017目录存放coco数据集的5000张验证图片。
+python3.7 tools/test.py configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py ./mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth --eval bbox segm
+```
+```
+6.0 task/s
+```
+
+### 7.3 性能对比
+310单卡4个device，benchmark测试的是一个device。T4一个设备相当于4个device，测试的是整个设备。benchmark时延是吞吐率的倒数，T4时延是吞吐率的倒数乘以batch。对于batch1，1.95202 * 4 > 6.0，即npu性能超过T4  
+对于batch1，npu性能均高于T4性能1.2倍，该模型放在benchmark/cv/segmentation目录下  
+
+
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/nlp/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/benchmark/nlp/.keep"
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/official/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/official/.keep"
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git "a/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/research/.keep" "b/onnx\347\253\257\345\210\260\347\253\257\346\216\250\347\220\206\346\214\207\345\257\274/research/.keep"
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391