diff --git a/figures/model_faq48_fig1_0819.png b/figures/model_faq48_fig1_0819.png new file mode 100644 index 0000000000000000000000000000000000000000..0afa6e00125ce9197a0c60c48978c7172a8bb654 Binary files /dev/null and b/figures/model_faq48_fig1_0819.png differ diff --git a/figures/model_faq48_fig2_0819.png b/figures/model_faq48_fig2_0819.png new file mode 100644 index 0000000000000000000000000000000000000000..17dd52ae7dc8ce26db4671365245da047307f88c Binary files /dev/null and b/figures/model_faq48_fig2_0819.png differ diff --git a/figures/model_faq48_fig3_0819.png b/figures/model_faq48_fig3_0819.png new file mode 100644 index 0000000000000000000000000000000000000000..e86c242d7f99f22ceb65d7cd415899b31bde8582 Binary files /dev/null and b/figures/model_faq48_fig3_0819.png differ diff --git "a/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" "b/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" index 7c4ed6dafd2827f6d53df673a8b1a7e34ea49e0f..719b0058842ff4e631def440f090a7fcd5428870 100644 --- "a/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" +++ "b/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" @@ -51,6 +51,7 @@ - [FAQ45、模型训练过程DCNv2算子+混合精度报错:excepted scalar type float but found half](#faq45模型训练过程dcnv2算子混合精度报错excepted-scalar-type-float-but-found-half) - [FAQ46、模型训练过程中算子PadV3D报错:constant_values value mismatches](#faq46模型训练过程中算子padv3d报错constant_values-value-mismatches) - [FAQ47、模型训练过程中python层报错:Expected isFloatingType(grad[i].scalar_type()) to be true, but got false](#faq47模型训练过程中报Expected-isFloatingType(grad[i].scalar_type())-to-be-true-but-got-false) + - [FAQ48、模型训练中aicpu算子报错](#faq48模型训练中aicpu算子报错) - [2.2 NPU模型分布式运行常见问题FAQ](#22-npu模型分布式运行常见问题faq) - [FAQ1、在模型分布式训练时,遇到报错 host not found.](#faq1在模型分布式训练时遇到报错-host-not-found) - [FAQ2、在模型运行时,遇到eval模式下loss值特别大,过万.](#faq2在模型运行时遇到eval模式下loss值特别大过万) @@ -974,6 +975,20 @@ DCNv2安装版本以该链接为例:https://github.com/jinfagang/DCNv2_latest 然后利用模型中的数据构造gather单算子,问题复现,结论是gather算子的输入输非法的,然后npu的gather正常计算,进而在反向报错 +### FAQ48、模型训练中aicpu算子报错:Aicpu kernel execute failed +- 现象描述 +屏幕报错信息如下所示: +![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq48_fig1_0819.png) +从中能看到时aicpu算子报错,然后查看host侧日志,显示如下: +![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq48_fig2_0819.png) + +- 原因分析 +根据host侧日志信息可以看到是aicpu算子index_put报错,猜测是索引值数据类型不对,aicpu算子实现同CPU算子,索引值类型是int64 +- 处理方法 +打点找到具体报错的函数,将此算子to cpu计算,结果也是报错,打印输入索引类型是int32,转换为int64后,不在报错 + + ![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq48_fig3_0819.png) + ## [2.2 NPU模型分布式运行常见问题FAQ](#22-NPU模型分布式运行常见问题FAQ) ### FAQ1、在模型分布式训练时,遇到报错 host not found.