diff --git a/docs/mindformers/docs/source_en/feature/dataset.md b/docs/mindformers/docs/source_en/feature/dataset.md
index 4b252da969796d8c7998eb37dd7c9e0e17e9e79f..55b0997ad8ebfad0c88ba9f08e0179b050f9c524 100644
--- a/docs/mindformers/docs/source_en/feature/dataset.md
+++ b/docs/mindformers/docs/source_en/feature/dataset.md
@@ -99,7 +99,7 @@ The following example demonstrates how to convert the `wikitext-103` dataset int
    Take [LlamaTokenizerFast](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/tokenizer_config.json) and [vocab file](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/tokenizer.json) in [DeepSeek-V3 repository](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base) as an example. If there is no corresponding repository, configuration file (tokenizer_config.json) and vocab file (tokenizer.json) needed to be download to local path. Let it be /path/to/huggingface/tokenizer. Execute the following command to preprocess the dataset:
 
    ```shell
-   python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \
+   python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \
      --input /path/data.json \
      --output-prefix /path/megatron_data \
      --tokenizer-type HuggingFaceTokenizer \
@@ -109,7 +109,7 @@ The following example demonstrates how to convert the `wikitext-103` dataset int
    Take outer tokenizer class [Llama3Tokenizer](https://gitee.com/mindspore/mindformers/blob/master/research/llama3_1/llama3_1_tokenizer.py) as an example, make sure **local** MindSpore Transformers repository has 'research/llama3_1/llama3_1_tokenizer.py', and execute the following command to preprocess the dataset:
 
    ```shell
-   python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \
+   python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \
      --input /path/data.json \
      --output-prefix /path/megatron_data \
      --tokenizer-type AutoRegister \
diff --git a/docs/mindformers/docs/source_zh_cn/feature/dataset.md b/docs/mindformers/docs/source_zh_cn/feature/dataset.md
index 47dbe009d24f2836b6c7032ae756a2ca16545acb..2b3133127e5c7f3cae4d7c7ceeada075288c084b 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/dataset.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/dataset.md
@@ -96,7 +96,7 @@ MindSpore Transformers提供了数据预处理脚本[preprocess_indexed_dataset.
    以[Deepseek-V3仓库](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base)中的[LlamaTokenizerFast](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/tokenizer_config.json)和[词表](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/tokenizer.json)为例。如果本地不存在对应仓库，需要将配置文件(tokenizer_config.json)和词表文件(tokenizer.json)手动下载到本地目录，假设为/path/to/huggingface/tokenizer。执行如下命令处理数据集：
 
    ```shell
-   python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \
+   python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \
      --input /path/data.json \
      --output-prefix /path/megatron_data \
      --tokenizer-type HuggingFaceTokenizer \
@@ -106,7 +106,7 @@ MindSpore Transformers提供了数据预处理脚本[preprocess_indexed_dataset.
    以外部tokenizer类[Llama3Tokenizer](https://gitee.com/mindspore/mindformers/blob/master/research/llama3_1/llama3_1_tokenizer.py)为例，确保**本地**mindformers仓库下存在'research/llama3_1/llama3_1_tokenizer.py'，执行如下命令处理数据集：
 
    ```shell
-   python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \
+   python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \
      --input /path/data.json \
      --output-prefix /path/megatron_data \
      --tokenizer-type AutoRegister \