From e36af153033464da8d9dfa5bca1d04151b259ed8 Mon Sep 17 00:00:00 2001 From: changzherui Date: Tue, 4 Aug 2020 00:18:11 +0800 Subject: [PATCH] modify checkpoint note --- tutorials/source_en/use/saving_and_loading_model_parameters.md | 2 ++ .../source_zh_cn/use/saving_and_loading_model_parameters.md | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/tutorials/source_en/use/saving_and_loading_model_parameters.md b/tutorials/source_en/use/saving_and_loading_model_parameters.md index 9b2cc2625b..c75b3362ee 100644 --- a/tutorials/source_en/use/saving_and_loading_model_parameters.md +++ b/tutorials/source_en/use/saving_and_loading_model_parameters.md @@ -76,6 +76,8 @@ MindSpore adds underscores (_) and digits at the end of the user-defined prefix For example, `resnet50_3-2_32.ckpt` indicates the CheckPoint file generated during the 32th step of the second epoch after the script is executed for the third time. +> - When the saved single model parameter is large (more than 64M), it will fail to save due to the limitation of Protobuf's own data size. At this time, the restriction can be lifted by setting the environment variable `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`. +> - When performing distributed parallel training tasks, each process needs to set different `directory` parameters to save the CheckPoint file to a different directory to prevent file read and write confusion. ### CheckPoint Configuration Policies diff --git a/tutorials/source_zh_cn/use/saving_and_loading_model_parameters.md b/tutorials/source_zh_cn/use/saving_and_loading_model_parameters.md index 7ebf822df2..5e1941c899 100644 --- a/tutorials/source_zh_cn/use/saving_and_loading_model_parameters.md +++ b/tutorials/source_zh_cn/use/saving_and_loading_model_parameters.md @@ -76,7 +76,8 @@ MindSpore为方便用户区分每次生成的文件,会在用户定义的前 例:`resnet50_3-2_32.ckpt` 表示运行第3次脚本生成的第2个epoch的第32个step的CheckPoint文件。 -> 当保存的单个模型参数较大时(超过64M),会因为Protobuf自身对数据大小的限制,导致保存失败。这时可通过设置环境变量`PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`解除限制。 +> - 当保存的单个模型参数较大时(超过64M),会因为Protobuf自身对数据大小的限制,导致保存失败。这时可通过设置环境变量`PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`解除限制。 +> - 当执行分布式并行训练任务时,每个进程需要设置不同`directory`参数,用以保存CheckPoint文件到不同的目录,以防文件发生读写错乱。 ### CheckPoint配置策略 -- Gitee