diff --git a/tutorials/source_en/use/saving_and_loading_model_parameters.md b/tutorials/source_en/use/saving_and_loading_model_parameters.md index 9b2cc2625b9de393f2d27cf3c84b5f59397936b8..c75b3362ee4f50658be069d4d2a9f883615c2012 100644 --- a/tutorials/source_en/use/saving_and_loading_model_parameters.md +++ b/tutorials/source_en/use/saving_and_loading_model_parameters.md @@ -76,6 +76,8 @@ MindSpore adds underscores (_) and digits at the end of the user-defined prefix For example, `resnet50_3-2_32.ckpt` indicates the CheckPoint file generated during the 32th step of the second epoch after the script is executed for the third time. +> - When the saved single model parameter is large (more than 64M), it will fail to save due to the limitation of Protobuf's own data size. At this time, the restriction can be lifted by setting the environment variable `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`. +> - When performing distributed parallel training tasks, each process needs to set different `directory` parameters to save the CheckPoint file to a different directory to prevent file read and write confusion. ### CheckPoint Configuration Policies diff --git a/tutorials/source_zh_cn/use/saving_and_loading_model_parameters.md b/tutorials/source_zh_cn/use/saving_and_loading_model_parameters.md index 7ebf822df2e4333984497e0a1e3dba9f32fea5e7..5e1941c899ad8c5e6d765a8a7ac2d477de409904 100644 --- a/tutorials/source_zh_cn/use/saving_and_loading_model_parameters.md +++ b/tutorials/source_zh_cn/use/saving_and_loading_model_parameters.md @@ -76,7 +76,8 @@ MindSpore为方便用户区分每次生成的文件,会在用户定义的前 例:`resnet50_3-2_32.ckpt` 表示运行第3次脚本生成的第2个epoch的第32个step的CheckPoint文件。 -> 当保存的单个模型参数较大时(超过64M),会因为Protobuf自身对数据大小的限制,导致保存失败。这时可通过设置环境变量`PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`解除限制。 +> - 当保存的单个模型参数较大时(超过64M),会因为Protobuf自身对数据大小的限制,导致保存失败。这时可通过设置环境变量`PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python`解除限制。 +> - 当执行分布式并行训练任务时,每个进程需要设置不同`directory`参数,用以保存CheckPoint文件到不同的目录,以防文件发生读写错乱。 ### CheckPoint配置策略