diff --git a/figures/model_faq47_fig1_0812.jpg b/figures/model_faq47_fig1_0812.jpg new file mode 100644 index 0000000000000000000000000000000000000000..1c4520b0be9bbda6ab333c596ebc284e5ae790d6 Binary files /dev/null and b/figures/model_faq47_fig1_0812.jpg differ diff --git a/figures/model_faq47_fig2_0812.png b/figures/model_faq47_fig2_0812.png new file mode 100644 index 0000000000000000000000000000000000000000..299bc47ad065400f2b10c0614a098822e625d5c7 Binary files /dev/null and b/figures/model_faq47_fig2_0812.png differ diff --git a/figures/model_faq47_fig3_0812.png b/figures/model_faq47_fig3_0812.png new file mode 100644 index 0000000000000000000000000000000000000000..b43d3f53e977045d7f018fdbf63a1742ef40251f Binary files /dev/null and b/figures/model_faq47_fig3_0812.png differ diff --git "a/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" "b/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" index 3922a841f0630b7cb76c5746025570aa133bc8db..947532b05fc9ab893a4bb60d8395704c613209f9 100644 --- "a/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" +++ "b/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" @@ -52,6 +52,7 @@ - [FAQ44、模型刚开始训练没有报错或日志信息,等待时间久](#faq44模型刚开始训练没有报错或日志信息等待时间久) - [FAQ45、模型训练过程DCNv2算子+混合精度报错:excepted scalar type float but found half](#faq45模型训练过程dcnv2算子混合精度报错excepted-scalar-type-float-but-found-half) - [FAQ46、模型训练过程中算子PadV3D报错:constant_values value mismatches](#faq46模型训练过程中算子padv3d报错constant_values-value-mismatches) + - [FAQ47、模型训练过程中python层报错:Expected isFloatingType(grad[i].scalar_type()) to be true, but got false](#faq47模型训练过程中报Expected-isFloatingType(grad[i].scalar_type())-to-be-true-but-got-false) - [2.2 NPU模型分布式运行常见问题FAQ](#22-npu模型分布式运行常见问题faq) - [FAQ1、在模型分布式训练时,遇到报错 host not found.](#faq1在模型分布式训练时遇到报错-host-not-found) - [FAQ2、在模型运行时,遇到eval模式下loss值特别大,过万.](#faq2在模型运行时遇到eval模式下loss值特别大过万) @@ -959,6 +960,21 @@ DCNv2安装版本以该链接为例:https://github.com/jinfagang/DCNv2_latest 手算pad参数,去掉F.pad()这一运算,将这个运算加到卷积里面,两者是等价的 ![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq46_fig2_0803.png) + +### FAQ47、模型训练过程中报错:Expected isFloatingType(grad[i].scalar_type())to be true but got false +- 现象描述 +屏幕报错信息如下所示: +![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq47_fig1_0812.jpg) +从中可以看到是反向报错,并且是aicpu算子报错,然后查看host侧日志,显示如下: +![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq47_fig2_0812.png) + +- 原因分析 +根据host侧日志信息可以看到是aicpu算子GatherElements报错,然后源码查找反向调用gather算子的位置,查看pytorch源码derivatives.yaml发现只有scatter_和scatter_add_算子反向会调用gather算子,排查模型源码forward没有调用scatter_和scatter_add_这两个算子,但是forward有调用gather,所以问题可能出现在模型正向gather算子报错。 +- 处理方法 +将模型中调用gather的算子全部to cpu计算,host侧日志问题复现如下: +![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq47_fig3_0812.png) + +然后利用模型中的数据构造gather单算子,问题复现,结论是gather算子的输入输非法的,然后npu的gather正常计算,进而在反向报错 ## [2.2 NPU模型分布式运行常见问题FAQ](#22-NPU模型分布式运行常见问题FAQ) ### FAQ1、在模型分布式训练时,遇到报错 host not found.