From 571de4bbca8f9f694ca1434c830b647fe61b89d6 Mon Sep 17 00:00:00 2001 From: simson Date: Fri, 11 Dec 2020 17:40:30 +0800 Subject: [PATCH] add docs of operation cache --- docs/programming_guide/source_en/run.md | 3 +++ docs/programming_guide/source_zh_cn/run.md | 3 +++ 2 files changed, 6 insertions(+) diff --git a/docs/programming_guide/source_en/run.md b/docs/programming_guide/source_en/run.md index 85749389cc..2c868a5e0c 100644 --- a/docs/programming_guide/source_en/run.md +++ b/docs/programming_guide/source_en/run.md @@ -103,6 +103,9 @@ The [Model API](https://www.mindspore.cn/doc/api_python/en/master/mindspore/mind You can transfer the initialized Model APIs such as the network, loss function, and optimizer as required. You can also configure amp_level to implement mixed precision and configure metrics to implement model evaluation. +> Excecuting a network model will produce a `kernel_meta` directory in the current directory, and all the cache of operations compiled during the executing process will be stored in it. If user executes a same model again, or a model with some difference, MindSpore will automaticlly call the reusable cache in `kernel_meta` to reduce the compilation time of the whole model. It has a significant improvement in performance. +Please note that, if user only deletes the cache on part of the devices when executing models on several devices, may lead to a timeout of the waiting time between devices, because only some of them need to recompile the operations. To avoid this situation, user could set the environment variable `HCCL_CONNECT_TIMEOUT` to a reasonable waiting time. However, this way is has the same time consuming as deleting all the cache and recompiling. + ### Executing a Training Model Call the train API of Model to implement training. diff --git a/docs/programming_guide/source_zh_cn/run.md b/docs/programming_guide/source_zh_cn/run.md index 776b9e19c7..77e0bd01e6 100644 --- a/docs/programming_guide/source_zh_cn/run.md +++ b/docs/programming_guide/source_zh_cn/run.md @@ -109,6 +109,9 @@ MindSpore的[Model接口](https://www.mindspore.cn/doc/api_python/zh-CN/master/m 用户可以根据实际需要传入网络、损失函数和优化器等初始化Model接口,还可以通过配置amp_level实现混合精度,配置metrics实现模型评估。 +> 执行网络模型会在执行目录下生成`kernel_meta`目录,其中会保存在执行过程中网络编译生成的算子缓存文件,包括`.o`文件和`.json`文件。若用户再次执行相同的网络模型,或者仅有部分变化,MindSpore会自动调用`kernel_meta`目录下可复用的算子缓存文件,显著减少网络编译时间,提升执行性能。 +请注意,在多卡运行的情况下,如果仅删除部分卡的`kernel_meta`下的算子缓存文件后重复执行相同的网络模型,可能会导致不需重新编译算子的部分卡等候超时,导致执行失败。在这种情况下,可以通过设置环境变量HCCL_CONNECT_TIMEOUT,即多卡间等待时间来避免失败,但该方式耗时等同于全部删除缓存重新编译。 + ### 执行训练模型 通过调用Model的train接口可以实现训练。 -- Gitee