diff --git "a/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/PyTorch\347\246\273\347\272\277\346\216\250\347\220\206-FAQ.md" "b/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/PyTorch\347\246\273\347\272\277\346\216\250\347\220\206-FAQ.md" index 614d750dea40ce89f72e936ce7aaf53db9396644..e74e7ae560fee7f3a0cbe35e285cb0a56a717648 100644 --- "a/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/PyTorch\347\246\273\347\272\277\346\216\250\347\220\206-FAQ.md" +++ "b/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/PyTorch\347\246\273\347\272\277\346\216\250\347\220\206-FAQ.md" @@ -4,9 +4,11 @@ - [2 om模型转换问题汇总](#2-om模型转换问题汇总) - [2.1 如何查看 `ONNX/om/pbtxt` 模型](#21-如何查看-onnxompbtxt-模型) - [2.2 `Exporting the operator {opname} to ONNX opset version {version} is not supported.`](#22-exporting-the-operator-opname-to-onnx-opset-version-version-is-not-supported) + - [2.3 resize不支持5d输入的解决方案](#23-resize不支持5d输入的解决方案) - [3 OM离线推理失败问题汇总](#3-om离线推理失败问题汇总) - [3.1 找不到atc命令或找不到ascend动态库](#31-找不到atc命令或找不到ascend动态库) - - [模型推理工具常见的错误&&解决方案](#模型推理工具常见的错误解决方案) + - [3.2 模型推理工具常见的错误&&解决方案](#32-模型推理工具常见的错误解决方案) + - [3.3 msame和benchmark在多batch下推理区别](#33-msame和benchmark在多batch下推理区别) - [4 精度调试常见问题](#4-精度调试常见问题) - [5 性能优化常见问题](#5-性能优化常见问题) - [5.1 如何使用AIPP进行性能提升](#51-如何使用aipp进行性能提升) @@ -45,6 +47,68 @@ #x = F.adaptive_avg_pool2d(input, output_size=bin_size) x = adaptive_avg_pool_op(input, (bin_size, bin_size)) # 替换上面代码 ``` +## 2.3 resize不支持5d输入的解决方案 +- 错误现象 + + atc转om报错,经定位发现是resize输入是5D的导致后面算子shape不对 + ```shell + # 报错信息 + E89999: Inner Error! + op[Concat_114], the input shape dims should be equal except merge axis.[FUNC:ConcatInferShapeCommon][FILE:split_combination_ops.cc][LINE:705] + Call InferShapeAndType for node:Concat_114(ConcatD) failed[FUNC:Infer][FILE:infershape_pass.cc][LINE:117] + process pass InferShapePass on node:Concat_114 failed, ret:4294967295[FUNC:RunPassesOnNode][FILE:base_pass.cc][LINE:433] + build graph failed, graph id:0, ret:1343242270[FUNC:BuildModel][FILE:ge_generator.cc][LINE:1327] + ``` +- 原因分析 + + CANN层目前还不支持5d输入的resize,所以需要改图规避支持。修改前后对比如下 + ![FAQ001](./images/FAQ001.png) +- 解决方案 + + 使用 [MagicONNX工具](https://gitee.com/Ronnie_zheng/MagicONNX/tree/master) 进行改图,这里给出该问题的改图代码,更多功能详见[使用教程](https://gitee.com/Ronnie_zheng/MagicONNX/blob/master/docs/tutorials.md)和[API说明](https://gitee.com/Ronnie_zheng/MagicONNX/blob/master/docs/operations.md) + ```python + import numpy as np + from magiconnx import OnnxGraph + + + def modify(path): + graph = OnnxGraph(path) + resizes = graph.get_nodes("Resize") + shapes = [[[1, 512*4, 4, 8], [1, 512*4, 8, 16], [1, 512, 4, 8*16], [1, 512, 8, 8*16], [1, 512, 8, 8, 16]], + [[1, 256*8, 8, 16], [1, 256*8, 16, 32], [1, 256, 8, 16*32], [1, 256, 16, 16*32], [1, 256, 16, 16, 32]], + [[1, 128*16, 16, 32], [1, 128*16, 32, 64], [1, 128, 16, 32*64], [1, 128, 32, 32*64], [1, 128, 32, 32, 64]]] + for idx, node in enumerate(resizes): + reshape1 = graph.add_node(f'Reshape_{node.name}', 'Reshape') + graph.add_initializer(f'shape_{node.name}', np.array(shapes[idx][0])) + reshape1.inputs = [node.inputs[0], f'shape_{node.name}'] + reshape1.outputs = [f'Reshape_{node.name}'] + + graph[node.inputs[-1]].value = np.array(shapes[idx][1]) + out_name = node.outputs[0] + node.set_input(0, f'Reshape_{node.name}') + node.set_output(0, f'{node.name}_reshape') + + reshape2 = graph.add_node(f'Reshape2_{node.name}', 'Reshape') + graph.add_initializer(f'shape2_{node.name}', np.array(shapes[idx][2])) + reshape2.inputs = [f'{node.name}_reshape', f'shape2_{node.name}'] + reshape2.outputs = [f'Reshape2_{node.name}_out'] + + resize2 = graph.add_node(f'Resize2_{node.name}', 'Resize') + graph.add_initializer(f'size_{node.name}', np.array(shapes[idx][3])) + resize2.inputs = [f'Reshape2_{node.name}_out', node.inputs[1], node.inputs[1], f'size_{node.name}'] + resize2.outputs = [f'Resize2_{node.name}'] + + reshape3 = graph.add_node(f'Reshape3_{node.name}', 'Reshape') + graph.add_initializer(f'shape3_{node.name}', np.array(shapes[idx][4])) + reshape3.inputs = [f'Resize2_{node.name}', f'shape3_{node.name}'] + reshape3.outputs = [out_name] + + graph.save('modify.onnx') + + if __name__ == "__main__": + modify('src.onnx') + ``` + # 3 OM离线推理失败问题汇总 ## 3.1 找不到atc命令或找不到ascend动态库 - 现象描述 @@ -68,7 +132,7 @@ export ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest ``` -## 模型推理工具常见的错误&&解决方案 +## 3.2 模型推理工具常见的错误&&解决方案 当前默认的模型推理是benchmark工具,以下为常见错误和一些解决方案: - 错误现象: @@ -86,6 +150,30 @@ - 通常是输入的input_image_path/input_text_path格式问题,如常见的nlp模型通常会有多输入的场合会有输入顺序/输入名的问题 - 当前benchmark工具支持的模型类型有:图像/自然语音/YOLO检测/搜索/语义理解/翻译,但存在不支持的输入类型如:3D的输入(如视频理解/点云等),可以采用[msame工具](https://gitee.com/ascend/tools/tree/master/msame)进行推理。 +## 3.3 msame和benchmark在多batch下推理区别 +- 错误现象 + + 使用msame推理报如下错误 + ```shell + [ERROR] GE(7672,msame):2021-11-16-12:57:02.335.376 [davinci_model.cc:3466]7672 CheckUserAndModelSize: ErrorNo: 145000(Parameter invalid.) [EXEC][DEFAULT][Check][Param] input size:150528 from user add align:64 < op_size:2408448 in model, model_id:1 + [ERROR] GE(7672,msame):2021-11-16-12:57:02.335.386 [davinci_model.cc:3546]7672 UpdateIoTaskArgs: ErrorNo: 145000(Parameter invalid.) [EXEC][DEFAULT][Call][CheckInputAndModelSize] failed, op[image] + [ERROR] GE(7672,msame):2021-11-16-12:57:02.335.395 [davinci_model.cc:3483]7672 CopyModelData: ErrorNo: 145000(Parameter invalid.) [EXEC][DEFAULT][Call][UpdateIoTaskArgs] [ZCPY] Update input data to model:1 failed. + [ERROR] GE(7672,msame):2021-11-16-12:57:02.335.406 [davinci_model.cc:3921]7672 NnExecute: ErrorNo: 4294967295(failed) [EXEC][DEFAULT][Copy][ModelData] failed. model id: 1 + [ERROR] GE(7672,msame):2021-11-16-12:57:02.335.414 [graph_loader.cc:283]7672 ExecuteModel: ErrorNo: 145000(Parameter invalid.) [EXEC][DEFAULT][Execute][Model] failed, model_id:1. + [ERROR] ASCENDCL(7672,msame):2021-11-16-12:57:02.335.428 [model.cpp:824]7672 ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[145000], modelId[1] + [ERROR] ASCENDCL(7672,msame):2021-11-16-12:57:02.335.446 [model.cpp:869]7672 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[1] execute failed, result[145000] + [ERROR] execute model failed, modelId is 1 + [ERROR] model execute failed + [ERROR] Sample process failed + ``` + - 原因分析 + + benchmark工具多batch模式下设置 `batch_size=16` 数值后,benchmark工具会一次读取16个bin进行推理,所以一个bin对应的是一张图片; + msame工具只会一次读取1个bin进行推理(没有`batch_size`命令行参数),所以多batch模式下需要一个bin对应的是16张图片; + - 解决方案 + + 预处理时把16张图片保存成一个bin文件 + # 4 精度调试常见问题 # 5 性能优化常见问题 diff --git "a/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/images/FAQ001.png" "b/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/images/FAQ001.png" new file mode 100644 index 0000000000000000000000000000000000000000..0c396ece29c3a7daaffe1c0aeba1adfd4a475d5a Binary files /dev/null and "b/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/images/FAQ001.png" differ