diff --git "a/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" "b/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" index 7c4ed6dafd2827f6d53df673a8b1a7e34ea49e0f..7cd2438d6b3ca93d957c7b5a569a197032397c86 100644 --- "a/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" +++ "b/pytorch-train-guide/Pytorch\350\256\255\347\273\203-FAQ.md" @@ -51,6 +51,7 @@ - [FAQ45、模型训练过程DCNv2算子+混合精度报错:excepted scalar type float but found half](#faq45模型训练过程dcnv2算子混合精度报错excepted-scalar-type-float-but-found-half) - [FAQ46、模型训练过程中算子PadV3D报错:constant_values value mismatches](#faq46模型训练过程中算子padv3d报错constant_values-value-mismatches) - [FAQ47、模型训练过程中python层报错:Expected isFloatingType(grad[i].scalar_type()) to be true, but got false](#faq47模型训练过程中报Expected-isFloatingType(grad[i].scalar_type())-to-be-true-but-got-false) + - [FAQ48、h5py安装方法](#faq48h5py安装方法) - [2.2 NPU模型分布式运行常见问题FAQ](#22-npu模型分布式运行常见问题faq) - [FAQ1、在模型分布式训练时,遇到报错 host not found.](#faq1在模型分布式训练时遇到报错-host-not-found) - [FAQ2、在模型运行时,遇到eval模式下loss值特别大,过万.](#faq2在模型运行时遇到eval模式下loss值特别大过万) @@ -974,6 +975,34 @@ DCNv2安装版本以该链接为例:https://github.com/jinfagang/DCNv2_latest 然后利用模型中的数据构造gather单算子,问题复现,结论是gather算子的输入输非法的,然后npu的gather正常计算,进而在反向报错 +### FAQ48、h5py安装方法 +``` +####################################### +安装h5py +####################################### +下载hdf5源码包 +wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.5/src/hdf5-1.10.5.tar.gz --no-check-certificate +解压 +tar -zxvf hdf5-1.10.5.tar.gz +进入目录,编译 +cd hdf5-1.10.5/ +./configure --prefix=/usr/include/hdf5 +make install + +配置环境变量 +export CPATH="/usr/include/hdf5/include/:/usr/include/hdf5/lib/" + +建立动态链接库软链接 +ln -s /usr/include/hdf5/lib/libhdf5.so /usr/lib/libhdf5.so +ln -s /usr/include/hdf5/lib/libhdf5_hl.so /usr/lib/libhdf5_hl.so + +安装h5py依赖包 +pip3.7 install pkgconfig Cython +安装h5py +pip3.7 install h5py + +``` + ## [2.2 NPU模型分布式运行常见问题FAQ](#22-NPU模型分布式运行常见问题FAQ) ### FAQ1、在模型分布式训练时,遇到报错 host not found.