diff --git a/ACL_PyTorch/contrib/cv/segmentation/VNet/README.md b/ACL_PyTorch/contrib/cv/segmentation/VNet/README.md index 5544cbbee431e24895e97fc4d6fc0f1436432fac..7f72d072353fcb2de081ea84609aeed2c1667dd4 100644 --- a/ACL_PyTorch/contrib/cv/segmentation/VNet/README.md +++ b/ACL_PyTorch/contrib/cv/segmentation/VNet/README.md @@ -21,9 +21,10 @@ - [6.2 开源精度](#62-开源精度) - [6.3 精度对比](#63-精度对比) - [7 性能对比](#7-性能对比) - - [7.1 npu性能数据](#71-npu性能数据) - - [7.2 T4性能数据](#72-t4性能数据) - - [7.3 性能对比](#73-性能对比) + - [7.1 310性能数据](#71-310性能数据) + - [7.2 710性能数据](#72-710性能数据) + - [7.3 T4性能数据](#73-T4性能数据) + - [7.4 性能对比](#74-性能对比) @@ -50,10 +51,10 @@ commit_id:a00c8ea16bcaea2bddf73b2bf506796f70077687 ### 2.1 深度学习框架 ``` -CANN 5.0.3.alpha002 -pytorch >= 1.5.0 -torchvision >= 0.6.0 -onnx >= 1.7.0 +CANN 5.1.RC1 +pytorch = 1.5.0 +torchvision = 0.6.0 +onnx = 1.7.0 ``` ### 2.2 python第三方库 @@ -143,16 +144,23 @@ python3.7 gen_dataset_info.py bin ./prep_bin ./vnet_prep_bin.info 80 80 ## 5 离线推理 - **[benchmark工具概述](#51-benchmark工具概述)** - - **[离线推理](#52-离线推理)** ### 5.1 benchmark工具概述 benchmark工具为华为自研的模型推理工具,支持多种模型的离线推理,能够迅速统计出模型在Ascend310上的性能,支持真实数据和纯推理两种模式,配合后处理脚本,可以实现诸多模型的端到端过程,获取工具及使用方法可以参考[CANN 5.0.1 推理benchmark工具用户指南 01] +获取推理benchmark工具软件包:解压后获取benchmark工具运行脚本benchmark.{arch}和scripts目录,该目录下包含各种模型处理脚本,包括模型预处理脚本、模型后处理脚本、精度统计脚本等。 + +获取地址:https://support.huawei.com/enterprise/zh/ascend-computing/cann-pid-251168373/software + +Ascend-cann-benchmark_{version}_Linux-{arch}.zip + +{version}为软件包的版本号;{arch}为CPU架构,请用户根据实际需要获取对应的软件包。 + ### 5.2 离线推理 1.设置环境变量 ``` -source env.sh +source /usr/local/Ascend/ascend-toolkit/latest/set_env.sh ``` 2.执行离线推理 ``` @@ -175,11 +183,15 @@ source env.sh python3.7 vnet_postprocess.py result/dumpOutput_device0 ./vnet.pytorch/luna16/normalized_lung_mask ./vnet.pytorch/test_uids.txt ``` 第一个为benchmark输出目录,第二个为真值所在目录,第三个为测试集样本的序列号。 -查看输出结果: +310精度测试结果: +``` +Test set: Error: 2497889/439091200 (0.5689%) ``` -Error rate: 2479051/439091200 (0.5646%) +710精度测试结果: ``` -经过对bs1与bs16的om测试,本模型batch1的精度与batch16的精度没有差别,精度数据均如上。 +Test set: Error: 2485695/439091200 (0.5661%) +``` +经过对batchsize为1/4/8/16/32/64的om测试,精度数据均如上。 ### 6.2 开源精度 [原代码仓公布精度](https://github.com/mattmacy/vnet.pytorch/blob/master/README.md) @@ -194,132 +206,175 @@ VNet 0.355% ## 7 性能对比 -- **[npu性能数据](#71-npu性能数据)** -- **[T4性能数据](#72-T4性能数据)** -- **[性能对比](#73-性能对比)** +- **[310性能数据](#71-310性能数据)** +- **[710性能数据](#72-710性能数据)** +- **[T4性能数据](#73-T4性能数据)** +- **[性能对比](#74-性能对比)** -### 7.1 npu性能数据 +### 7.1 310性能数据 1.benchmark工具在整个数据集上推理获得性能数据 batch1的性能,benchmark工具在整个数据集上推理后生成result/perf_vision_batchsize_1_device_0.txt: ``` -[e2e] throughputRate: 5.70609, latency: 187869 -[data read] throughputRate: 225.606, moduleLatency: 4.43251 -[preprocess] throughputRate: 53.7844, moduleLatency: 18.5928 -[inference] throughputRate: 5.75202, Interface throughputRate: 6.10496, moduleLatency: 173.468 -[postprocess] throughputRate: 5.75712, moduleLatency: 173.698 +[e2e] throughputRate: 7.44924, latency: 143907 +[data read] throughputRate: 159.324, moduleLatency: 6.27652 +[preprocess] throughputRate: 68.514, moduleLatency: 14.5956 +[inference] throughputRate: 7.52821, Interface throughputRate: 7.91715, moduleLatency: 132.521 +[postprocess] throughputRate: 7.53499, moduleLatency: 132.714 ``` -Interface throughputRate: 6.10496,6.10496x4=24.41984既是batch1 310单卡吞吐率 -batch16的性能,benchmark工具在整个数据集上推理后生成result/perf_vision_batchsize_16_device_1.txt: +batch1:Interface throughputRate: 7.91715 +batch4:Interface throughputRate: 8.5008 +batch8:Interface throughputRate: 8.00694 +batch16:Interface throughputRate: 8.11015 +batch32:Interface throughputRate: 7.91441 + +2.执行parse脚本,计算单卡吞吐率 ``` -[e2e] throughputRate: 6.24092, latency: 171769 -[data read] throughputRate: 377.232, moduleLatency: 2.65089 -[preprocess] throughputRate: 61.2764, moduleLatency: 16.3195 -[inference] throughputRate: 6.2793, Interface throughputRate: 6.49396, moduleLatency: 159.033 -[postprocess] throughputRate: 0.398022, moduleLatency: 2512.42 +python parse.py result/perf_vision_batchsize_1_device_0.txt ``` -Interface throughputRate: 6.49396,6.49396x4=25.97584既是batch16 310单卡吞吐率 -batch4性能: +batch1_310吞吐率为31.6686fps +batch4_310吞吐率为34.0032fps +batch8_310吞吐率为32.02776fps +batch16_310吞吐率为32.4406fps +batch32_310吞吐率为31.65764fps + +### 7.2 710性能数据 + +batch1的性能,benchmark工具在整个数据集上推理后生成result/perf_vision_batchsize_1_device_0.txt: ``` -[e2e] throughputRate: 6.38643, latency: 167856 -[data read] throughputRate: 220.829, moduleLatency: 4.52839 -[preprocess] throughputRate: 59.272, moduleLatency: 16.8714 -[inference] throughputRate: 6.42624, Interface throughputRate: 6.67466, moduleLatency: 155.341 -[postprocess] throughputRate: 1.61227, moduleLatency: 620.245 +[e2e] throughputRate: 49.5601, latency: 21630.3 +[data read] throughputRate: 643.612, moduleLatency: 1.55373 +[preprocess] throughputRate: 348.108, moduleLatency: 2.87267 +[inference] throughputRate: 51.1218, Interface throughputRate: 65.5303, moduleLatency: 19.3435 +[postprocess] throughputRate: 51.1467, moduleLatency: 19.5516 ``` -batch4 310单卡吞吐率:6.67466x4=26.69864fps -batch8性能: +batch1:Interface throughputRate: 65.5303 ,710吞吐率为65.5303fps +batch4:Interface throughputRate: 64.5802 ,710吞吐率为64.5802fps +batch8:Interface throughputRate: 64.3861 ,710吞吐率为64.3861fps +batch16:Interface throughputRate: 63.617 ,710吞吐率为63.617fps +batch32:Interface throughputRate: 59.7592 ,710吞吐率为59.7592fps +batch64:Interface throughputRate: 61.1219 ,710吞吐率为61.1219fps + +### 7.3 T4性能数据 +在装有T4卡的服务器上测试gpu性能,测试过程请确保卡没有运行其他任务,TensorRT版本:7.2.3.4,cuda版本:11.0,cudnn版本:8.2 +batch1性能: ``` -[e2e] throughputRate: 6.17056, latency: 173728 -[data read] throughputRate: 216.73, moduleLatency: 4.61403 -[preprocess] throughputRate: 57.3928, moduleLatency: 17.4238 -[inference] throughputRate: 6.20835, Interface throughputRate: 6.41992, moduleLatency: 160.848 -[postprocess] throughputRate: 0.781576, moduleLatency: 1279.47 +trtexec --onnx=vnet.onnx --fp16 --shapes=actual_input_1:1x1x64x80x80 --threads ``` -batch8 310单卡吞吐率:6.41992x4=25.67968fps -batch32性能: + ``` -[e2e] throughputRate: 6.09413, latency: 175907 -[data read] throughputRate: 183.187, moduleLatency: 5.45889 -[preprocess] throughputRate: 49.9254, moduleLatency: 20.0299 -[inference] throughputRate: 6.15986, Interface throughputRate: 6.35151, moduleLatency: 162.051 -[postprocess] throughputRate: 0.200903, moduleLatency: 4977.52 +[04/29/2022-14:15:41] [I] GPU Compute +[04/29/2022-14:15:41] [I] min: 90.6819 ms +[04/29/2022-14:15:41] [I] max: 92.8173 ms +[04/29/2022-14:15:41] [I] mean: 91.687 ms +[04/29/2022-14:15:41] [I] median: 91.8387 ms +[04/29/2022-14:15:41] [I] percentile: 92.8173 ms at 99% +[04/29/2022-14:15:41] [I] total compute time: 3.11736 s + ``` -batch32 310单卡吞吐率:6.35151x4=25.40604fps +batch1 t4单卡吞吐率:1000/(91.687/1)=10.90667fps -### 7.2 T4性能数据 -在装有T4卡的服务器上测试gpu性能,测试过程请确保卡没有运行其他任务,TensorRT版本:7.2.3.4,cuda版本:11.0,cudnn版本:8.2 -batch1性能: +batch4性能: ``` -trtexec --onnx=vnet.onnx --fp16 --shapes=actual_input_1:1x1x64x80x80 --threads +trtexec --onnx=vnet.onnx --fp16 --shapes=actual_input_1:4x1x64x80x80 --threads ``` -gpu T4是4个device并行执行的结果,mean是时延(tensorrt的时延是batch个数据的推理时间),即吞吐率的倒数乘以batch + ``` -[09/17/2021-15:39:40] [I] GPU Compute -[09/17/2021-15:39:40] [I] min: 92.4146 ms -[09/17/2021-15:39:40] [I] max: 103.909 ms -[09/17/2021-15:39:40] [I] mean: 97.0678 ms -[09/17/2021-15:39:40] [I] median: 96.9087 ms -[09/17/2021-15:39:40] [I] percentile: 103.909 ms at 99% -[09/17/2021-15:39:40] [I] total compute time: 3.20324 s +[04/29/2022-14:27:39] [I] GPU Compute +[04/29/2022-14:27:39] [I] min: 358.297 ms +[04/29/2022-14:27:39] [I] max: 366.323 ms +[04/29/2022-14:27:39] [I] mean: 360.984 ms +[04/29/2022-14:27:39] [I] median: 360.4 ms +[04/29/2022-14:27:39] [I] percentile: 366.323 ms at 99% +[04/29/2022-14:27:39] [I] total compute time: 3.60984 s + ``` -batch1 t4单卡吞吐率:1000/(96.9087/1)=10.31899fps +batch4 t4单卡吞吐率:1000/(360.984/4)=11.08082fps -batch16性能: +batch8性能: ``` -trtexec --onnx=nested_unet.onnx --fp16 --shapes=actual_input_1:16x3x96x96 --threads +trtexec --onnx=vnet.onnx --fp16 --shapes=actual_input_1:8x1x64x80x80 --threads ``` + ``` -[09/17/2021-16:11:37] [I] GPU Compute -[09/17/2021-16:11:37] [I] min: 1574.28 ms -[09/17/2021-16:11:37] [I] max: 1576.2 ms -[09/17/2021-16:11:37] [I] mean: 1575.22 ms -[09/17/2021-16:11:37] [I] median: 1574.94 ms -[09/17/2021-16:11:37] [I] percentile: 1576.2 ms at 99% -[09/17/2021-16:11:37] [I] total compute time: 15.7522 s +[[04/29/2022-14:36:16] [I] GPU Compute +[04/29/2022-14:36:16] [I] min: 810.815 ms +[04/29/2022-14:36:16] [I] max: 817.788 ms +[04/29/2022-14:36:16] [I] mean: 813.193 ms +[04/29/2022-14:36:16] [I] median: 813.153 ms +[04/29/2022-14:36:16] [I] percentile: 817.788 ms at 99% +[04/29/2022-14:36:16] [I] total compute time: 8.13194 s + ``` -batch16 t4单卡吞吐率:1000/(1575.22/16)=10.15731fps +batch8 t4单卡吞吐率:1000/(813.193/8)=9.83776fps -batch4性能: +batch16性能: ``` -[09/17/2021-15:44:51] [I] GPU Compute -[09/17/2021-15:44:51] [I] min: 361.722 ms -[09/17/2021-15:44:51] [I] max: 375.435 ms -[09/17/2021-15:44:51] [I] mean: 365.263 ms -[09/17/2021-15:44:51] [I] median: 363.615 ms -[09/17/2021-15:44:51] [I] percentile: 375.435 ms at 99% -[09/17/2021-15:44:51] [I] total compute time: 3.65263 s +trtexec --onnx=vnet.onnx --fp16 --shapes=actual_input_1:16x1x64x80x80 --threads ``` -batch4 t4单卡吞吐率:1000/(365.263/4)=10.95101fps -batch8性能: ``` -[09/17/2021-15:52:50] [I] GPU Compute -[09/17/2021-15:52:50] [I] min: 796.131 ms -[09/17/2021-15:52:50] [I] max: 802.935 ms -[09/17/2021-15:52:50] [I] mean: 798.473 ms -[09/17/2021-15:52:50] [I] median: 798.262 ms -[09/17/2021-15:52:50] [I] percentile: 802.935 ms at 99% -[09/17/2021-15:52:50] [I] total compute time: 7.98473 s +[04/29/2022-14:45:58] [I] GPU Compute +[04/29/2022-14:45:58] [I] min: 1561.08 ms +[04/29/2022-14:45:58] [I] max: 1566.75 ms +[04/29/2022-14:45:58] [I] mean: 1563.66 ms +[04/29/2022-14:45:58] [I] median: 1563.35 ms +[04/29/2022-14:45:58] [I] percentile: 1566.75 ms at 99% +[04/29/2022-14:45:58] [I] total compute time: 15.6366 s + + ``` -batch8 t4单卡吞吐率:1000/(798.473/8)=10.01912fps +batch16 t4单卡吞吐率:1000/(1563.66/16)=10.41219fps batch32性能: ``` -[09/17/2021-16:29:35] [I] GPU Compute -[09/17/2021-16:29:35] [I] min: 3382.94 ms -[09/17/2021-16:29:35] [I] max: 3395.54 ms -[09/17/2021-16:29:35] [I] mean: 3389.83 ms -[09/17/2021-16:29:35] [I] median: 3390.36 ms -[09/17/2021-16:29:35] [I] percentile: 3395.54 ms at 99% -[09/17/2021-16:29:35] [I] total compute time: 33.8983 s -``` -batch32 t4单卡吞吐率:1000/(3389.83/32)=9.44fps - -### 7.3 性能对比 -batch1:6.10496x4 > 1000x1/(96.9087/1) -batch16:6.49396x4 > 1000x1/(1575.22/16) -310单个device的吞吐率乘4即单卡吞吐率比T4单卡的吞吐率大,故310性能高于T4性能,性能达标。 -对于batch1与batch16,310性能均高于T4性能1.2倍,该模型放在ACL_PyTorch/Benchmark/cv/segmentation目录下。 +trtexec --onnx=vnet.onnx --fp16 --shapes=actual_input_1:32x1x64x80x80 --threads +``` + +``` +[04/29/2022-15:08:59] [I] GPU Compute +[04/29/2022-15:08:59] [I] min: 3571.51 ms +[04/29/2022-15:08:59] [I] max: 6799 ms +[04/29/2022-15:08:59] [I] mean: 5932.02 ms +[04/29/2022-15:08:59] [I] median: 6416.47 ms +[04/29/2022-15:08:59] [I] percentile: 6799 ms at 99% +[04/29/2022-15:08:59] [I] total compute time: 59.3202 s + + +``` +batch32 t4单卡吞吐率:1000/(5932.02/32)=5.39445fps + +batch64性能: +``` +trtexec --onnx=vnet.onnx --fp16 --shapes=actual_input_1:64x1x64x80x80 --threads +``` + +``` +[04/29/2022-16:19:59] [I] GPU Compute +[04/29/2022-16:19:59] [I] min: 12874.2 ms +[04/29/2022-16:19:59] [I] max: 13251.7 ms +[04/29/2022-16:19:59] [I] mean: 13051.9 ms +[04/29/2022-16:19:59] [I] median: 13051.4 ms +[04/29/2022-16:19:59] [I] percentile: 13251.7 ms at 99% +[04/29/2022-16:19:59] [I] total compute time: 130.519 s + +``` +batch64 t4单卡吞吐率:1000/(13051.4/64)=4.90369fps + +### 7.4 性能对比 + +310 710 T4性能对比如下(benchmark推理工具) +| batch | 310 | 710 | T4 | 710/310 | 710/T4 | +|-------|----------|---------|----------|---------|----------| +| 1 | 31.6686 | 65.5303 | 10.90667 | 2.06925 | 6.00828 | +| 4 | 34.0032 | 64.5802 | 11.08082 | 1.89924 | 5.82811 | +| 8 | 32.02776 | 64.3861 | 9.83776 | 2.01032 | 6.54479 | +| 16 | 32.4406 | 63.617 | 10.41219 | 1.96103 | 6.10986 | +| 32 | 31.65764 | 59.7592 | 5.39445 | 1.88767 | 11.07790 | +| 64 | - | 61.1219 | 4.90369 | - | 12.46447 | +| | | | | | | +| 最优 | 34.0032 | 65.5303 | 11.08082 | | | + +对于所有batchsize,710性能均高于310性能1.2倍,同时710性能均高于T4性能1.6倍,性能达标。 **性能优化:** >没有遇到性能不达标的问题,故不需要进行性能优化 diff --git a/ACL_PyTorch/contrib/cv/segmentation/VNet/env.sh b/ACL_PyTorch/contrib/cv/segmentation/VNet/env.sh deleted file mode 100644 index 005de02039e70c32c74850be4d84753ad80ad532..0000000000000000000000000000000000000000 --- a/ACL_PyTorch/contrib/cv/segmentation/VNet/env.sh +++ /dev/null @@ -1,7 +0,0 @@ -# 配置环境变量 -export install_path=/usr/local/Ascend/ascend-toolkit/latest -export PATH=/usr/local/python3.7.5/bin:${install_path}/atc/ccec_compiler/bin:${install_path}/atc/bin:$PATH -export PYTHONPATH=${install_path}/atc/python/site-packages:$PYTHONPATH -export LD_LIBRARY_PATH=${install_path}/atc/lib64:${install_path}/acllib/lib64:$LD_LIBRARY_PATH -export ASCEND_OPP_PATH=${install_path}/opp -export REPEAT_TUNE=true \ No newline at end of file diff --git a/ACL_PyTorch/contrib/cv/segmentation/VNet/parse.py b/ACL_PyTorch/contrib/cv/segmentation/VNet/parse.py new file mode 100644 index 0000000000000000000000000000000000000000..197b7e10f2a762848c16322d11058b7f210a84a8 --- /dev/null +++ b/ACL_PyTorch/contrib/cv/segmentation/VNet/parse.py @@ -0,0 +1,31 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import json +import re + +if __name__ == '__main__': + if sys.argv[1].endswith('.json'): + result_json = sys.argv[1] + with open(result_json, 'r') as f: + content = f.read() + print(content) + elif sys.argv[1].endswith('.txt'): + result_txt = sys.argv[1] + with open(result_txt, 'r') as f: + content = f.read() + txt_data_list = [i.strip() for i in re.findall(r':(.*?),', content.replace('\n', ',') + ',')] + fps = float(txt_data_list[7].replace('samples/s', '')) + print('310 bs{} fps:{}'.format(result_txt.split('_')[3], fps))