From 185a96135d4960d275f31add83e119ddb2518561 Mon Sep 17 00:00:00 2001 From: niujunhao Date: Thu, 3 Jul 2025 14:23:12 +0800 Subject: [PATCH] fix dataset megatron docs. --- .../docs/source_en/feature/dataset.md | 23 ++++++++++--------- .../docs/source_zh_cn/feature/dataset.md | 1 + 2 files changed, 13 insertions(+), 11 deletions(-) diff --git a/docs/mindformers/docs/source_en/feature/dataset.md b/docs/mindformers/docs/source_en/feature/dataset.md index 86a2c3ca07..d3a6d18ed9 100644 --- a/docs/mindformers/docs/source_en/feature/dataset.md +++ b/docs/mindformers/docs/source_en/feature/dataset.md @@ -174,17 +174,18 @@ The following explains how to configure and use Megatron datasets in the configu Below are the descriptions for each configuration option of the `GPTDataset` in the dataset: - | Parameter Name | Description | - |----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| - | seed | Random seed for dataset sampling. Megatron datasets use this value to randomly sample and concatenate samples. Default: `1234` | - | seq_length | Sequence length of data returned by the dataset. Should be consistent with the sequence length of the training model. | - | eod_mask_loss | Whether to compute loss at the end-of-document (EOD) token. Default: `False` | - | create_attention_mask | Whether to return an attention_mask. Default: `True` | - | reset_attention_mask | Whether to reset the attention_mask at EOD tokens, returning a staircase-shaped attention_mask. Effective only if `create_attention_mask=True`. Default: `False` | - | create_compressed_eod_mask | Whether to return a compressed attention_mask. Has higher priority than `create_attention_mask`. Default: `False` | - | eod_pad_length | Length of the compressed attention_mask. Effective only if `create_compressed_eod_mask=True`. Default: `128` | - | eod | Token ID of the EOD token in the dataset | - | pad | Token ID of the pad token in the dataset | + | Parameter Name | Description | + |----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------| + | seed | Random seed for dataset sampling. Megatron datasets use this value to randomly sample and concatenate samples. Default: `1234` | + | seq_length | Sequence length of data returned by the dataset. Should be consistent with the sequence length of the training model. | + | eod_mask_loss | Whether to compute loss at the end-of-document (EOD) token. Default: `False` | + | create_attention_mask | Whether to return an attention_mask. Default: `True` | + | reset_attention_mask | Whether to reset the attention_mask at EOD tokens, returning a staircase-shaped attention_mask. Effective only if `create_attention_mask=True`. Default: `False` | + | create_compressed_eod_mask | Whether to return a compressed attention_mask. Has higher priority than `create_attention_mask`. Default: `False` | + | eod_pad_length | Length of the compressed attention_mask. Effective only if `create_compressed_eod_mask=True`. Default: `128` | + | eod | Token ID of the EOD token in the dataset | + | pad | Token ID of the pad token in the dataset | + | data_path | The numbers in `data_path` represent the sampling ratio, while the strings represent the data path. The .bin file extension should be removed from the bin file path. | In addition, the Megatron dataset also depends on configurations such as `input_columns`, `construct_args_key`, and `full_batch`. For more details, refer to the [configuration file documentation](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html). diff --git a/docs/mindformers/docs/source_zh_cn/feature/dataset.md b/docs/mindformers/docs/source_zh_cn/feature/dataset.md index bbce4057cc..b891ae0a58 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/dataset.md +++ b/docs/mindformers/docs/source_zh_cn/feature/dataset.md @@ -178,6 +178,7 @@ MindSpore Transformers推荐用户使用Megatron数据集进行模型预训练 | eod_pad_length | 设置压缩后attention_mask的长度,仅在`create_compressed_eod_mask=True`时生效,默认值为`128` | | eod | 数据集中eod的token id | | pad | 数据集中pad的token id | + | data_path | `data_path`中的数字表示采样比例,字符串表示数据路径,应去掉bin文件路径的.bin后缀。 | 此外,Megatron数据集还依赖`input_columns`、`construct_args_key`、`full_batch`等配置,具体可参考[配置文件说明](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/configuration.html),这里仅说明在不同场景如何配置: -- Gitee