diff --git "a/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/\344\270\223\351\242\230\346\241\210\344\276\213/\347\233\270\345\205\263\345\267\245\345\205\267/AMCT/README.md" "b/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/\344\270\223\351\242\230\346\241\210\344\276\213/\347\233\270\345\205\263\345\267\245\345\205\267/AMCT/README.md" new file mode 100644 index 0000000000000000000000000000000000000000..22246408995cea1589de5a7fdffb57333ccf5a5e --- /dev/null +++ "b/Ascend-PyTorch\347\246\273\347\272\277\346\216\250\347\220\206\346\214\207\345\257\274/\344\270\223\351\242\230\346\241\210\344\276\213/\347\233\270\345\205\263\345\267\245\345\205\267/AMCT/README.md" @@ -0,0 +1,147 @@ +# 使用amct工具解决HRNet-OCR模型内存不足问题案例 + +[TOC] + +## 模型背景 + +HRnet是semantic-segmentation类网络,目前是该网络输入shape和权重太大,导致om离线推理报内存不足的错误。 + +## 问题现象 +HRNet-OCR模型已经完成了转om,但是使用benchmark或者msame工具进行om推理时报错,直接运行报错如下 +```shell +[INFO][VisionPreProcess] Create stream SUCCESS +[INFO][VisionPreProcess] Init SUCCESS +[INFO][Preprocess] Init SUCCESS +[INFO][DataManager] Init SUCCESS +[ERROR][Inference] aclmdlLoadFromFile failed! +[ERROR][Inference] Load model failed! +[ERROR] moduleInstance.inferenceInstance init failed. +[INFO][VisionPreProcess] DeInit SUCCESS +[INFO][Preprocess] DeInit SUCCESS +[INFO][DataManager] DeInit SUCCESS +[INFO][Inference] DeInit SUCCESS +``` +开启debug日志级别并重定向到文件,查看debug日志文件可以看到如下错误信息 +```shell +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.201.588 [ascend][curpid: 27294, 27294][drv][devmm][devmm_ioctl_advise 166] Ioctl device error! ptr=0x108200000000, count=5799738880, advise=0x8c, device=1. +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.202.544 [ascend][curpid: 27294, 27294][drv][devmm][devmm_ioctl_alloc_and_advise 205] advise mem error! ret=6 +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.202.555 [ascend][curpid: 27294, 27294][drv][devmm][devmm_virt_heap_alloc_device 421] devmm_ioctl_alloc error. ptr=0x108200000000. +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.202.563 [ascend][curpid: 27294, 27294][drv][devmm][devmm_virt_set_alloced_mem_struct 101] alloc ptr err, ptr=0x1. +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.203.008 [ascend][curpid: 27294, 27294][drv][devmm][devmm_alloc_from_base_heap 131] alloc phy mem from base heap err=0x1, va:0x108200000000, size:5799738880,5799738880. +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.203.019 [ascend][curpid: 27294, 27294][drv][devmm][devmm_virt_free_check_and_get_pg 329] va(0x108200000000) is not alloced, pg is already in buddy,pfn(520),order(3),flags(1) +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.203.027 [ascend][curpid: 27294, 27294][drv][devmm][devmm_virt_heap_free 521] addr not alloced, addr=0x108200000000,start=0x100000000000,end=0x17ffffffffff +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.203.035 [ascend][curpid: 27294, 27294][drv][devmm][devmm_alloc_managed 134] heap_alloc_managed out of memory, pp=0x1, bytesize=5799738880. +[DEBUG] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.203.043 [ascend][curpid: 27294, 27294][drv][devmm][halMemAlloc 1347]alloc ptr=0, size=5799738880, virt_type=1, devid=1, phy_page=1, flag=0x1424401. +[ERROR] RUNTIME(27294,benchmark.x86_64):2021-11-25-03:24:56.203.065 [npu_driver.cc:691]27294 DevMemAllocHugePageManaged:[LOAD][DEFAULT][driver interface] halMemAlloc failed: device_id=1, size=5799738880, type=2, env_type=3, drvRetCode=6! +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.643.031 [ascend][curpid: 27294, 27294][drv][devmm][devmm_ioctl_advise 166] Ioctl device error! ptr=0x108200000000, count=5799738880, advise=0x88, device=1. +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:57.836.213 [ascend][curpid: 27294, 27294][drv][devmm][devmm_ioctl_alloc_and_advise 205] advise mem error! ret=6 +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:57.836.268 [ascend][curpid: 27294, 27294][drv][devmm][devmm_virt_heap_alloc_device 421] devmm_ioctl_alloc error. ptr=0x108200000000. +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:57.836.283 [ascend][curpid: 27294, 27294][drv][devmm][devmm_virt_set_alloced_mem_struct 101] alloc ptr err, ptr=0x1. +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:58.051.730 [ascend][curpid: 27294, 27294][drv][devmm][devmm_alloc_from_base_heap 131] alloc phy mem from base heap err=0x1, va:0x108200000000, size:5799738880,5799738880. +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:58.051.792 [ascend][curpid: 27294, 27294][drv][devmm][devmm_virt_free_check_and_get_pg 329] va(0x108200000000) is not alloced, pg is already in buddy,pfn(520),order(3),flags(1) +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:58.051.811 [ascend][curpid: 27294, 27294][drv][devmm][devmm_virt_heap_free 521] addr not alloced, addr=0x108200000000,start=0x100000000000,end=0x17ffffffffff +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:58.051.827 [ascend][curpid: 27294, 27294][drv][devmm][devmm_alloc_managed 134] heap_alloc_managed out of memory, pp=0x1, bytesize=5799738880. +[DEBUG] DRV(27294,benchmark.x86_64):2021-11-25-03:24:58.051.844 [ascend][curpid: 27294, 27294][drv][devmm][halMemAlloc 1347]alloc ptr=0, size=5799738880, virt_type=1, devid=1, phy_page=0, flag=0x1404401. +[ERROR] RUNTIME(27294,benchmark.x86_64):2021-11-25-03:24:58.051.920 [npu_driver.cc:726]27294 DevMemAllocManaged:[LOAD][DEFAULT][driver interface] halMemAlloc failed: size=5799738880, deviceId=1, type=2, env_type=3, drvRetCode=6! +[ERROR] RUNTIME(27294,benchmark.x86_64):2021-11-25-03:24:58.052.011 [npu_driver.cc:802]27294 DevMemAllocOnline:[LOAD][DEFAULT]DevMemAlloc huge page failed: deviceId=1, type=2, size=5799738880, retCode=117571606! +[ERROR] RUNTIME(27294,benchmark.x86_64):2021-11-25-03:24:58.052.046 [logger.cc:349]27294 DevMalloc:[LOAD][DEFAULT]Device malloc failed, size=5799738880, type=2. +[ERROR] RUNTIME(27294,benchmark.x86_64):2021-11-25-03:24:58.052.122 [api_c.cc:801]27294 rtMalloc:[LOAD][DEFAULT]ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016 +[ERROR] RUNTIME(27294,benchmark.x86_64):2021-11-25-03:24:58.052.158 [error_message_manage.cc:41]27294 ReportFuncErrorReason:[LOAD][DEFAULT]rtMalloc execute failed, reason=[driver error:out of memory] +[ERROR] GE(27294,benchmark.x86_64):2021-11-25-03:24:58.052.242 [graph_mem_allocator.cc:52]27294 MallocMemory: ErrorNo: 1343225860(Internal errors) [LOAD][DEFAULT][Malloc][Memory] failed, device_id = 1, size= 5799738880 +[ERROR] GE(27294,benchmark.x86_64):2021-11-25-03:24:58.052.279 [davinci_model.cc:358]27294 InitFeatureMapAndP2PMem: ErrorNo: 245000(Memory allocation error.) [LOAD][DEFAULT][Alloc][Memory] for feature map failed. size:5799738880, model_id:1 +[ERROR] GE(27294,benchmark.x86_64):2021-11-25-03:24:58.052.325 [model_manager.cc:1249]27294 LoadModelOffline: ErrorNo: 4294967295(failed) [LOAD][DEFAULT][Init][DavinciModel] failed, ret:245000. +``` + +## 调试思路 + +### 定位阶段 + +从日志来看是申请5G的featuremap空间导致device内存不够用所以报错了 +```shell +[ERROR] DRV(27294,benchmark.x86_64):2021-11-25-03:24:56.203.035 [ascend][curpid: 27294, 27294][drv][devmm][devmm_alloc_managed 134] heap_alloc_managed out of memory, pp=0x1, bytesize=5799738880. +``` + +使用npu-smi工具确认device侧总内存多少,看看是否会超出内存。查询命令详细说明参见[查询所有芯片内存信息](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/2c6df3af#ZH-CN_TOPIC_0000001117437354) +```shell +root@8c11674f3fef:# npu-smi info -t memory -i 2848 + NPU ID : 2848 + Chip Count : 1 + + Chip ID : 0 + Capacity(MB) : 8192 + Clock Speed(MHz) : 1600 +``` + +综上,device侧内存总共有8G,然后除去device OS本身占用的内存空间,所以om可以使用的内存少于8G。 + + +确认这个om实际使用了多少内存 +```shell +# 使用如下环境变量生成日志,然后进行后续的查找验证 +export ASCEND_GLOBAL_LOG_LEVEL=0 #设置debug日志级别 +export ASCEND_SLOG_PRINT_TO_STDOUT=1 # 设置日志打屏 +/usr/local/Ascend/driver/tools/msnpureport -e enable #生成event日志 +benchmark -model_type=vision .... > zl_debug.log +``` +```shell +# 这里是查询benchmark/msame工具为输入输出申请了多少内存 +root@8c11674f3fef:# grep "start to execute aclrtMalloc, size =" zl_debug.log -n +147501:[INFO] ASCENDCL(57775,benchmark):2021-12-01-02:07:53.900.932 [memory.cpp:32]57796 aclrtMalloc: start to execute aclrtMalloc, size = 25165824 +147511:[INFO] ASCENDCL(57775,benchmark):2021-12-01-02:07:53.920.797 [memory.cpp:32]57799 aclrtMalloc: start to execute aclrtMalloc, size = 25165824 +147521:[INFO] ASCENDCL(57775,benchmark):2021-12-01-02:07:53.951.382 [memory.cpp:32]57799 aclrtMalloc: start to execute aclrtMalloc, size = 637534208 +147528:[INFO] ASCENDCL(57775,benchmark):2021-12-01-02:07:53.962.108 [memory.cpp:32]57799 aclrtMalloc: start to execute aclrtMalloc, size = 159383552 +147535:[INFO] ASCENDCL(57775,benchmark):2021-12-01-02:07:53.964.940 [memory.cpp:32]57799 aclrtMalloc: start to execute aclrtMalloc, size = 8388608 +147542:[INFO] ASCENDCL(57775,benchmark):2021-12-01-02:07:53.965.327 [memory.cpp:32]57799 aclrtMalloc: start to execute aclrtMalloc, size = 39845888 +147549:[INFO] ASCENDCL(57775,benchmark):2021-12-01-02:07:53.966.245 [memory.cpp:32]57799 aclrtMalloc: start to execute aclrtMalloc, size = 2097152 +# 这里是查询om加载占用多少内存 +root@8c11674f3fef:# grep "\[IMAS\]InitFeatureMapAndP2PMem graph_" zl_debug.log -n +3239:[EVENT] GE(57775,benchmark):2021-12-01-02:07:51.772.113 [davinci_model.cc:344]57775 InitFeatureMapAndP2PMem:[IMAS]InitFeatureMapAndP2PMem graph_0 MallocMemory type[F] memaddr[108200000000] mem_size[5799738880] +``` +大致估计发现acl工具已经成功申请了大概1G+,om占用了5.8G,总共占用了6.8G,然后device os和CANN层其他的内存占用,所以acl再次申请内存的时候导致内存不够报错了。 +```shell +# acl调用aclrtMalloc申请2097184空间报错了 +[ERROR] RUNTIME(57775,benchmark):2021-12-01-02:07:54.023.862 [logger.cc:349]57799 DevMalloc:Device malloc failed, size=2097184, type=1024. +[ERROR] RUNTIME(57775,benchmark):2021-12-01-02:07:54.023.914 [api_c.cc:799]57799 rtMalloc:ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016 +[ERROR] RUNTIME(57775,benchmark):2021-12-01-02:07:54.023.924 [error_message_manage.cc:41]57799 ReportFuncErrorReason:rtMalloc execute failed, reason=[driver error:out of memory] +[ERROR] ASCENDCL(57775,benchmark):2021-12-01-02:07:54.023.946 [memory.cpp:62]57799 aclrtMalloc: alloc device memory failed, runtime result = 207001 +``` + +### 解决思路 +硬件只有这么多空间,然后该模型输入shape也比较大 `1*3*1024*2048`,所以权重也比较大,目前可以考虑的方式就是amct工具进行模型量化。 + +amct对原始onnx模型做量化,将fp32的权重量化成int8的权重,这样可以节约空间,同时可以保证atc得到的om体积和占用内存小。对应前面的分析就是解决 **om占用了5.8G** 的问题。 + + +## 解决方案 + +- AMCT工具下载与安装 + > 请确保onnxruntime版本是1.5.2或者1.6.0,否则下面安装会报错 + 从[CANN 商用版](https://www.hiascend.com/software/cann/commercial)下载AMCT工具包 + ```shell + tar -xzf Ascend-cann-amct_{version}_linux-{arch}.tar.gz + cd amct_onnx + pip install amct_onnx-0.2.2-py3-none-linux_x86_64.whl + tar -xzf amct_onnx_op.tar.gz + cd amct_onnx_op + python setup.py build + ``` + `python setup.py build` 会从github仓下载一些头文件,如果网络不稳定导致失败可以手动下载并移动到**inc**目录下,文件列表如下 + ```shell + onnxruntime_cxx_api.h + onnxruntime_cxx_inline.h + onnxruntime_c_api.h + onnxruntime_session_options_config_keys.h + ``` +- AMCT工具简明使用 + AMCT工具的详细使用可以参见[昇腾模型压缩工具使用指南(ONNX)](https://support.huawei.com/enterprise/zh/doc/EDOC1100219269/8dd67cf6),这里仅给出简明使用方法。 + ```shell + amct_onnx calibration --model ./HRnet_static.onnx --save_path ./HRnet --input_shape "image:1,3,1024,2048" --data_types "float32" --data_dir "bin_dir" + ``` + +- 正常进行后续操作 + 将amct_onnx得到的量化onnx进行om转换,精度和性能调测等。 + + +## 关于内存不足问题的总结 + +内存不足的情况很少遇见,所以碰到了需要生成详细的debug日志和调测文件,然后仔细分析。确认硬件限制后再根据具体的业务场景给出解决方案,量化是常用的手段