From 185a96135d4960d275f31add83e119ddb2518561 Mon Sep 17 00:00:00 2001
From: niujunhao <niujunhao@huawei.com>
Date: Thu, 3 Jul 2025 14:23:12 +0800
Subject: [PATCH] fix dataset megatron docs.

---
 .../docs/source_en/feature/dataset.md         | 23 ++++++++++---------
 .../docs/source_zh_cn/feature/dataset.md      |  1 +
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/docs/mindformers/docs/source_en/feature/dataset.md b/docs/mindformers/docs/source_en/feature/dataset.md
index 86a2c3ca07..d3a6d18ed9 100644
--- a/docs/mindformers/docs/source_en/feature/dataset.md
+++ b/docs/mindformers/docs/source_en/feature/dataset.md
@@ -174,17 +174,18 @@ The following explains how to configure and use Megatron datasets in the configu
 
    Below are the descriptions for each configuration option of the `GPTDataset` in the dataset:
 
-   | Parameter Name             | Description                                                                                                                                                      |
-   |----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-   | seed                       | Random seed for dataset sampling. Megatron datasets use this value to randomly sample and concatenate samples. Default: `1234`                                   |
-   | seq_length                 | Sequence length of data returned by the dataset. Should be consistent with the sequence length of the training model.                                            |
-   | eod_mask_loss              | Whether to compute loss at the end-of-document (EOD) token. Default: `False`                                                                                     |
-   | create_attention_mask      | Whether to return an attention_mask. Default: `True`                                                                                                             |
-   | reset_attention_mask       | Whether to reset the attention_mask at EOD tokens, returning a staircase-shaped attention_mask. Effective only if `create_attention_mask=True`. Default: `False` |
-   | create_compressed_eod_mask | Whether to return a compressed attention_mask. Has higher priority than `create_attention_mask`. Default: `False`                                                |
-   | eod_pad_length             | Length of the compressed attention_mask. Effective only if `create_compressed_eod_mask=True`. Default: `128`                                                     |
-   | eod                        | Token ID of the EOD token in the dataset                                                                                                                         |
-   | pad                        | Token ID of the pad token in the dataset                                                                                                                         |
+   | Parameter Name             | Description                                                                                                                                                           |
+   |----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+   | seed                       | Random seed for dataset sampling. Megatron datasets use this value to randomly sample and concatenate samples. Default: `1234`                                        |
+   | seq_length                 | Sequence length of data returned by the dataset. Should be consistent with the sequence length of the training model.                                                 |
+   | eod_mask_loss              | Whether to compute loss at the end-of-document (EOD) token. Default: `False`                                                                                          |
+   | create_attention_mask      | Whether to return an attention_mask. Default: `True`                                                                                                                  |
+   | reset_attention_mask       | Whether to reset the attention_mask at EOD tokens, returning a staircase-shaped attention_mask. Effective only if `create_attention_mask=True`. Default: `False`      |
+   | create_compressed_eod_mask | Whether to return a compressed attention_mask. Has higher priority than `create_attention_mask`. Default: `False`                                                     |
+   | eod_pad_length             | Length of the compressed attention_mask. Effective only if `create_compressed_eod_mask=True`. Default: `128`                                                          |
+   | eod                        | Token ID of the EOD token in the dataset                                                                                                                              |
+   | pad                        | Token ID of the pad token in the dataset                                                                                                                              |
+   | data_path                  | The numbers in `data_path` represent the sampling ratio, while the strings represent the data path. The .bin file extension should be removed from the bin file path. |
 
    In addition, the Megatron dataset also depends on configurations such as `input_columns`, `construct_args_key`, and `full_batch`. For more details, refer to the [configuration file documentation](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html).
 
diff --git a/docs/mindformers/docs/source_zh_cn/feature/dataset.md b/docs/mindformers/docs/source_zh_cn/feature/dataset.md
index bbce4057cc..b891ae0a58 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/dataset.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/dataset.md
@@ -178,6 +178,7 @@ MindSpore Transformers推荐用户使用Megatron数据集进行模型预训练
    | eod_pad_length             | 设置压缩后attention_mask的长度，仅在`create_compressed_eod_mask=True`时生效，默认值为`128`                   |
    | eod                        | 数据集中eod的token id                                                                          |
    | pad                        | 数据集中pad的token id                                                                          |
+   | data_path                  | `data_path`中的数字表示采样比例，字符串表示数据路径，应去掉bin文件路径的.bin后缀。                                        |
 
    此外，Megatron数据集还依赖`input_columns`、`construct_args_key`、`full_batch`等配置，具体可参考[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html)，这里仅说明在不同场景如何配置：
 
-- 
Gitee