diff --git a/docs/mindformers/docs/source_en/feature/dataset.md b/docs/mindformers/docs/source_en/feature/dataset.md index 4b252da969796d8c7998eb37dd7c9e0e17e9e79f..55b0997ad8ebfad0c88ba9f08e0179b050f9c524 100644 --- a/docs/mindformers/docs/source_en/feature/dataset.md +++ b/docs/mindformers/docs/source_en/feature/dataset.md @@ -99,7 +99,7 @@ The following example demonstrates how to convert the `wikitext-103` dataset int Take [LlamaTokenizerFast](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/tokenizer_config.json) and [vocab file](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/tokenizer.json) in [DeepSeek-V3 repository](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base) as an example. If there is no corresponding repository, configuration file (tokenizer_config.json) and vocab file (tokenizer.json) needed to be download to local path. Let it be /path/to/huggingface/tokenizer. Execute the following command to preprocess the dataset: ```shell - python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \ + python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \ --input /path/data.json \ --output-prefix /path/megatron_data \ --tokenizer-type HuggingFaceTokenizer \ @@ -109,7 +109,7 @@ The following example demonstrates how to convert the `wikitext-103` dataset int Take outer tokenizer class [Llama3Tokenizer](https://gitee.com/mindspore/mindformers/blob/master/research/llama3_1/llama3_1_tokenizer.py) as an example, make sure **local** MindSpore Transformers repository has 'research/llama3_1/llama3_1_tokenizer.py', and execute the following command to preprocess the dataset: ```shell - python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \ + python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \ --input /path/data.json \ --output-prefix /path/megatron_data \ --tokenizer-type AutoRegister \ diff --git a/docs/mindformers/docs/source_zh_cn/feature/dataset.md b/docs/mindformers/docs/source_zh_cn/feature/dataset.md index 47dbe009d24f2836b6c7032ae756a2ca16545acb..2b3133127e5c7f3cae4d7c7ceeada075288c084b 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/dataset.md +++ b/docs/mindformers/docs/source_zh_cn/feature/dataset.md @@ -96,7 +96,7 @@ MindSpore Transformers提供了数据预处理脚本[preprocess_indexed_dataset. 以[Deepseek-V3仓库](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base)中的[LlamaTokenizerFast](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/tokenizer_config.json)和[词表](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/tokenizer.json)为例。如果本地不存在对应仓库,需要将配置文件(tokenizer_config.json)和词表文件(tokenizer.json)手动下载到本地目录,假设为/path/to/huggingface/tokenizer。执行如下命令处理数据集: ```shell - python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \ + python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \ --input /path/data.json \ --output-prefix /path/megatron_data \ --tokenizer-type HuggingFaceTokenizer \ @@ -106,7 +106,7 @@ MindSpore Transformers提供了数据预处理脚本[preprocess_indexed_dataset. 以外部tokenizer类[Llama3Tokenizer](https://gitee.com/mindspore/mindformers/blob/master/research/llama3_1/llama3_1_tokenizer.py)为例,确保**本地**mindformers仓库下存在'research/llama3_1/llama3_1_tokenizer.py',执行如下命令处理数据集: ```shell - python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \ + python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \ --input /path/data.json \ --output-prefix /path/megatron_data \ --tokenizer-type AutoRegister \