diff --git "a/AscendPytorch\346\250\241\345\236\213\344\274\227\346\231\272FAQ.md" "b/AscendPytorch\346\250\241\345\236\213\344\274\227\346\231\272FAQ.md" index b607ca83900a055218b6991483ad953a0778470e..d3e549beb52bfd2c9cf5e20686d01c2d4b0877e1 100644 --- "a/AscendPytorch\346\250\241\345\236\213\344\274\227\346\231\272FAQ.md" +++ "b/AscendPytorch\346\250\241\345\236\213\344\274\227\346\231\272FAQ.md" @@ -34,6 +34,7 @@ - [FAQ28、模型推理时加载pth出现问题。](#faq28模型推理时加载pth出现问题) - [FAQ29、多个环境都遇到了安装升级5.0.1的toolkit包,安装时报错的问题。](#faq29多个环境都遇到了安装升级501的toolkit包安装时报错的问题) - [FAQ30、Alexnet dropout 精度不达标规避方法。](#faq30alexnet-dropout-精度不达标规避方法) + - [FAQ42、模型跑400多个step算子EmbeddingDenseGrad报错](#faq31模型跑400多个step算子EmbeddingDenseGrad报错) - [2.2 NPU模型分布式运行常见问题FAQ](#22-npu模型分布式运行常见问题faq) - [FAQ1、在模型分布式训练时,遇到报错 host not found.](#faq1在模型分布式训练时遇到报错-host-not-found) - [FAQ2、在模型运行时,遇到eval模式下loss值特别大,过万.](#faq2在模型运行时遇到eval模式下loss值特别大过万) @@ -870,7 +871,23 @@ Python版本不对,执行位置也不对 适配代码参考如下: ![](https://gitee.com/zwx5317131/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq41_0625_fig2.PNG) +### FAQ42、模型跑400多个step算子EmbeddingDenseGrad报错 +- 现象描述 +![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq42_0712_fig1.png) +在日志的稍后点位置会显示报错的具体算子名: +![](https://gitee.com/xiaxia3/ascend-pytorch-crowdintelligence-doc/raw/master/figures/model_faq42_0712_fig2.png) +- 原因分析 + +如果算子的input中存在NAN或者INF数据,就会出现如上的内存溢出问题。 + +- 处理方法 + +EmbeddingDenseGrad算子在整网精度不达标的情况下要to到cpu规避,待整网精度达标后在to到npu测试。 + +- 问题延展: + +日志中出现The DDR address of the MTE instruction is out of range 一般是算子计算溢出报错。 ## [2.2 NPU模型分布式运行常见问题FAQ](#22-NPU模型分布式运行常见问题FAQ) diff --git a/figures/model_faq42_0712_fig1.png b/figures/model_faq42_0712_fig1.png new file mode 100644 index 0000000000000000000000000000000000000000..a3a48301e36d90119c3b87bc932768b46eba7c34 Binary files /dev/null and b/figures/model_faq42_0712_fig1.png differ diff --git a/figures/model_faq42_0712_fig2.png b/figures/model_faq42_0712_fig2.png new file mode 100644 index 0000000000000000000000000000000000000000..defc4c9b8db676569999138230e637c72cce6f9a Binary files /dev/null and b/figures/model_faq42_0712_fig2.png differ