From 2f15daa7a021c2118b2330009b6ede616fec34c1 Mon Sep 17 00:00:00 2001 From: zhangyihuiben Date: Tue, 8 Jul 2025 15:30:06 +0800 Subject: [PATCH] =?UTF-8?q?=E4=BF=AE=E5=A4=8D=E5=A4=B1=E6=95=88=E9=93=BE?= =?UTF-8?q?=E6=8E=A5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../advanced_development/dev_migration.md | 2 +- .../docs/source_en/feature/logging.md | 2 +- .../source_en/feature/memory_optimization.md | 2 +- .../docs/source_en/feature/safetensors.md | 2 +- .../feature/training_hyperparameters.md | 4 +- .../docs/source_en/guide/pre_training.md | 2 +- .../advanced_development/dev_migration.md | 2 +- .../deepseek3/pretrain_deepseek3_671b.yaml | 225 ++++++++++++++++++ .../docs/source_zh_cn/feature/logging.md | 2 +- .../feature/memory_optimization.md | 2 +- .../docs/source_zh_cn/feature/safetensors.md | 2 +- .../feature/training_hyperparameters.md | 4 +- .../docs/source_zh_cn/guide/pre_training.md | 2 +- 13 files changed, 239 insertions(+), 14 deletions(-) create mode 100644 docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml diff --git a/docs/mindformers/docs/source_en/advanced_development/dev_migration.md b/docs/mindformers/docs/source_en/advanced_development/dev_migration.md index 76331cf273..917b45b1da 100644 --- a/docs/mindformers/docs/source_en/advanced_development/dev_migration.md +++ b/docs/mindformers/docs/source_en/advanced_development/dev_migration.md @@ -48,7 +48,7 @@ All tokenizer classes must be inherited from the PretrainedTokenizer or Pretrain If a PyTorch-based model weight already exists, you can convert the weight to that in the MindSpore format by referring to [Weight Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html#weight-format-conversion). -For details about how to prepare a dataset, see [Dataset](https://www.mindspore.cn/mindformers/docs/en/dev/feature/dataset.html) or the model document, for example, [Llama2 Description Document > Dataset Preparation](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87). +For details about how to prepare a dataset, see [Dataset](https://www.mindspore.cn/mindformers/docs/en/dev/feature/dataset.html). ### Preparing a `YAML` Configuration File diff --git a/docs/mindformers/docs/source_en/feature/logging.md b/docs/mindformers/docs/source_en/feature/logging.md index 63da65ff41..e5b28cd74b 100644 --- a/docs/mindformers/docs/source_en/feature/logging.md +++ b/docs/mindformers/docs/source_en/feature/logging.md @@ -46,7 +46,7 @@ By default, MindSpore Transformers specifies the file output path as `./output` If you need to re-specify the output log folder, you can modify the configuration in yaml. -Taking [`DeepSeek-V3` pre-training yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L2) as an example, the following configuration can be made: +Taking [`DeepSeek-V3` pre-training yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, the following configuration can be made: ```yaml output_dir: './output' # path to save logs/checkpoint/strategy diff --git a/docs/mindformers/docs/source_en/feature/memory_optimization.md b/docs/mindformers/docs/source_en/feature/memory_optimization.md index b38779b9f1..bdf60739a8 100644 --- a/docs/mindformers/docs/source_en/feature/memory_optimization.md +++ b/docs/mindformers/docs/source_en/feature/memory_optimization.md @@ -14,7 +14,7 @@ Recomputation can significantly reduce activation memory usage during training b Users can enable recomputation by adding a `recompute_config` module to the YAML configuration file used for model training. -Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L113) as an example, it could be configured as follows: +Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, it could be configured as follows: ```yaml # recompute config diff --git a/docs/mindformers/docs/source_en/feature/safetensors.md b/docs/mindformers/docs/source_en/feature/safetensors.md index 70b0eea901..94e531b20e 100644 --- a/docs/mindformers/docs/source_en/feature/safetensors.md +++ b/docs/mindformers/docs/source_en/feature/safetensors.md @@ -119,7 +119,7 @@ Users can control the weight saving behavior by modifying the configuration file Users can modify the fields under `CheckpointMonitor` in the `yaml` configuration file to control the weight saving behavior. -Taking [`DeepSeek-V3` pre-training yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L206) as an example, the following configuration can be made: +Taking [`DeepSeek-V3` pre-training yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, the following configuration can be made: ```yaml # callbacks diff --git a/docs/mindformers/docs/source_en/feature/training_hyperparameters.md b/docs/mindformers/docs/source_en/feature/training_hyperparameters.md index 8740fdc422..ffdcc69773 100644 --- a/docs/mindformers/docs/source_en/feature/training_hyperparameters.md +++ b/docs/mindformers/docs/source_en/feature/training_hyperparameters.md @@ -24,7 +24,7 @@ Setting the learning rate too high can prevent the model from converging, while Users can utilize the learning rate by adding an `lr_schedule` module to the YAML configuration file used for model training. -Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L31) as an example, it could be configured as follows: +Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, it could be configured as follows: ```yaml # lr schedule @@ -91,7 +91,7 @@ Currently, MindSpore Transformers only supports the [AdamW optimizer](https://ww Users can use the optimizer by adding an `optimizer` module to the YAML configuration file for model training. -Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L24) as an example, it could be configured like this: +Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, it could be configured like this: ```yaml # optimizer diff --git a/docs/mindformers/docs/source_en/guide/pre_training.md b/docs/mindformers/docs/source_en/guide/pre_training.md index a901de3834..3f32d73987 100644 --- a/docs/mindformers/docs/source_en/guide/pre_training.md +++ b/docs/mindformers/docs/source_en/guide/pre_training.md @@ -78,7 +78,7 @@ For dataset processing, refer to [Megatron Dataset - Data Preprocessing](https:/ ### Single-Node Training -Specify the configuration file [pretrain_deepseek3_671b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml) and start the [run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/dev/run_mindformer.py) script in msrun mode to perform 8-device distributed training. +Specify the configuration file [pretrain_deepseek3_671b.yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) and start the [run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/dev/run_mindformer.py) script in msrun mode to perform 8-device distributed training. The default configuration includes large values for parameters such as the number of layers and hidden dimensions, which are intended for large-scale multi-node distributed training. It cannot be directly used for pretraining on a single machine. You will need to modify the configuration as described in [DeepSeek-V3 - Configuration Modification](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/README.md#%E4%BF%AE%E6%94%B9%E9%85%8D%E7%BD%AE). diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md b/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md index 48442dd5b4..5c5435cdb7 100644 --- a/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md +++ b/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md @@ -48,7 +48,7 @@ MindSpore Transformers提供了[PretrainedTokenizer](https://www.mindspore.cn/mi 如已有基于PyTorch的模型权重,可以参考[权重转换文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html#%E6%9D%83%E9%87%8D%E6%A0%BC%E5%BC%8F%E8%BD%AC%E6%8D%A2)将权重转换为MindSpore格式的权重。 -数据集的准备可以参考[数据集文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html),或参考模型文档,如[Llama2说明文档——数据集准备](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)。 +数据集的准备可以参考[数据集文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html)。 ### 准备`YAML`配置文件 diff --git a/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml b/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml new file mode 100644 index 0000000000..7ea7426096 --- /dev/null +++ b/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml @@ -0,0 +1,225 @@ +seed: 0 +output_dir: './output' # path to save checkpoint/strategy +load_checkpoint: '' +load_ckpt_format: "safetensors" +src_strategy_path_or_dir: '' +auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model +only_save_strategy: False +resume_training: False +use_parallel: True +run_mode: 'train' + +# trainer config +trainer: + type: CausalLanguageModelingTrainer + model_name: 'deepseekV3' + +# runner config +runner_config: + epochs: 2 + batch_size: 1 + sink_mode: True + sink_size: 1 + +# optimizer +optimizer: + type: AdamW + betas: [0.9, 0.95] + eps: 1.e-8 + +# lr schedule +lr_schedule: + type: ConstantWarmUpLR + learning_rate: 2.2e-4 + warmup_ratio: 0.02 + total_steps: -1 # -1 means it will load the total steps of the dataset + +# dataset +train_dataset: &train_dataset + data_loader: + type: BlendedMegatronDatasetDataLoader + datasets_type: "GPTDataset" + sizes: + - 5000 # train dataset size + - 0 + - 0 + config: + random_seed: 1234 + seq_length: 4096 + split: "1, 0, 0" + reset_position_ids: False + reset_attention_mask: False + eod_mask_loss: False + num_dataset_builder_threads: 1 + create_attention_mask: False + data_path: + - '1' + - "./dataset" + shuffle: False + input_columns: ["input_ids", "labels", "loss_mask", "position_ids"] + construct_args_key: ["input_ids", "labels"] + num_parallel_workers: 8 + python_multiprocessing: False + drop_remainder: True + repeat: 1 + numa_enable: False + prefetch_size: 1 +train_dataset_task: + type: CausalLanguageModelDataset + dataset_config: *train_dataset + +# mindspore context init config +context: + mode: 0 #0--Graph Mode; 1--Pynative Mode + device_target: "Ascend" + max_call_depth: 10000 + max_device_memory: "55GB" + save_graphs: False + save_graphs_path: "./graph" + jit_config: + jit_level: "O1" + ascend_config: + parallel_speed_up_json_path: "./parallel_speed_up.json" + +# parallel config for device num = 1024 +parallel_config: + data_parallel: &dp 16 + model_parallel: 4 + pipeline_stage: 16 + expert_parallel: 8 + micro_batch_num: µ_batch_num 32 + vocab_emb_dp: True + use_seq_parallel: True + gradient_aggregation_group: 4 +# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process. +micro_batch_interleave_num: 1 + +# parallel context config +parallel: + parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel + gradients_mean: False + enable_alltoall: True + full_batch: False + dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1]] + search_mode: "sharding_propagation" + enable_parallel_optimizer: True + strategy_ckpt_config: + save_file: "./ckpt_strategy.ckpt" + only_trainable_params: False + parallel_optimizer_config: + gradient_accumulation_shard: False + parallel_optimizer_threshold: 64 + +# recompute config +recompute_config: + recompute: [3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 2, 0] + select_recompute: False + parallel_optimizer_comm_recompute: True + mp_comm_recompute: True + recompute_slice_activation: True + +# model config +model: + model_config: + type: DeepseekV3Config + auto_register: deepseek3_config.DeepseekV3Config + batch_size: 1 # add for increase predict + seq_length: 4096 + hidden_size: 7168 + num_layers: &num_layers 61 + num_heads: 128 + max_position_embeddings: 4096 + intermediate_size: 18432 + kv_lora_rank: 512 + n_kv_heads: 128 + q_lora_rank: 1536 + qk_rope_head_dim: 64 + v_head_dim: 128 + qk_nope_head_dim: 128 + vocab_size: 129280 + multiple_of: 256 + rms_norm_eps: 1.0e-6 + bos_token_id: 100000 + eos_token_id: 100001 + pad_token_id: 100001 + ignore_token_id: -100 + compute_dtype: "bfloat16" + layernorm_compute_type: "float32" + softmax_compute_type: "float32" + rotary_dtype: "float32" + router_dense_type: "float32" + param_init_type: "float32" + use_past: False + extend_method: "None" + use_flash_attention: True + use_fused_swiglu: True + use_fused_rope: True + input_sliced_sig: True + offset: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1] + checkpoint_name_or_path: "" + theta: 10000.0 + return_extra_loss: True + mtp_depth: &mtp_depth 1 + mtp_loss_factor: 0.3 + arch: + type: DeepseekV3ForCausalLM + auto_register: deepseek3.DeepseekV3ForCausalLM + +#moe +moe_config: + expert_num: &expert_num 256 + expert_group_size: 8 + capacity_factor: 1.5 + aux_loss_factor: 0.05 + num_experts_chosen: 8 + routing_policy: "TopkRouterV2" + balance_via_topk_bias: &balance_via_topk_bias True + topk_bias_update_rate: &topk_bias_update_rate 0.001 + use_fused_ops_topkrouter: True + shared_expert_num: 1 + routed_scaling_factor: 2.5 + norm_topk_prob: True + first_k_dense_replace: 3 + moe_intermediate_size: 2048 + aux_loss_factors: [0.0001] + aux_loss_types: ["expert"] + expert_model_parallel: 1 + use_gating_sigmoid: True + callback_moe_droprate: False + use_gmm: True + use_fused_ops_permute: True + enable_gmm_safe_tokens: True + + +# callbacks +callbacks: + - type: MFLossMonitor + per_print_times: 1 + # balance topk bias with callback + - type: TopkBiasBalanceCallback + balance_via_topk_bias: *balance_via_topk_bias + topk_bias_update_rate: *topk_bias_update_rate + num_layers: *num_layers + mtp_depth: *mtp_depth + expert_num: *expert_num + micro_batch_num: *micro_batch_num + - type: CheckpointMonitor + prefix: "deepseekv3" + save_checkpoint_steps: 1000 + keep_checkpoint_max: 5 + integrated_save: False + async_save: False + checkpoint_format: "safetensors" + +# wrapper cell config +runner_wrapper: + type: MFTrainOneStepCell + scale_sense: 1.0 + use_clip_grad: True + +profile: False +profile_start_step: 1 +profile_stop_step: 10 +init_start_profile: False +profile_communication: False +profile_memory: True diff --git a/docs/mindformers/docs/source_zh_cn/feature/logging.md b/docs/mindformers/docs/source_zh_cn/feature/logging.md index bee9ee07b5..77f6226f7a 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/logging.md +++ b/docs/mindformers/docs/source_zh_cn/feature/logging.md @@ -46,7 +46,7 @@ MindSpore Transformers 默认会在训练的 yaml 文件中指定文件输出路 如果需要重新指定输出的日志文件夹,可以在 yaml 中修改配置。 -以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L2) 为例,可做如下配置: +以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例,可做如下配置: ```yaml output_dir: './output' # path to save logs/checkpoint/strategy diff --git a/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md b/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md index 6455de1cb0..b3f18891ca 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md +++ b/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md @@ -14,7 +14,7 @@ 用户可通过在模型训练的 yaml 配置文件中新增 `recompute_config` 模块来使用重计算。 -以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L113) 为例,可做如下配置: +以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例,可做如下配置: ```yaml # recompute config diff --git a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md index 0430277b21..64fa2ed18b 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md +++ b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md @@ -120,7 +120,7 @@ output 用户可修改 `yaml` 配置文件中 `CheckpointMonitor` 下的字段来控制权重保存行为。 -以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L206) 为例,可做如下配置: +以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例,可做如下配置: ```yaml # callbacks diff --git a/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md b/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md index 34304e255e..9ab760405a 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md +++ b/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md @@ -19,7 +19,7 @@ MindSpore Transformers 提供了如下几类超参数的配置方式。 #### YAML 参数配置 用户可通过在模型训练的 yaml 配置文件中新增 `lr_schedule` 模块来使用学习率。 -以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L31) 为例,可做如下配置: +以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例,可做如下配置: ```yaml # lr schedule @@ -86,7 +86,7 @@ lr_schedule: 用户可通过在模型训练的 yaml 配置文件中新增 `optimizer` 模块来使用学习率。 -以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L24) 为例,可做如下配置: +以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例,可做如下配置: ```yaml # optimizer diff --git a/docs/mindformers/docs/source_zh_cn/guide/pre_training.md b/docs/mindformers/docs/source_zh_cn/guide/pre_training.md index 628e6e3b90..8968d2b8bb 100644 --- a/docs/mindformers/docs/source_zh_cn/guide/pre_training.md +++ b/docs/mindformers/docs/source_zh_cn/guide/pre_training.md @@ -78,7 +78,7 @@ MindSpore Transformers 目前已经支持加载 Megatron 数据集,该数据 ### 单机训练 -通过指定模型路径和配置文件[pretrain_deepseek3_671b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml)以msrun的方式启动[run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/dev/run_mindformer.py)脚本,进行8卡分布式训练。 +通过指定模型路径和配置文件[pretrain_deepseek3_671b.yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml)以msrun的方式启动[run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/dev/run_mindformer.py)脚本,进行8卡分布式训练。 默认配置中的模型层数、隐藏维度等参数较大,适用于多机大规模分布式训练,无法直接在单机环境启动预训练,需要参考[DeepSeek-V3-修改配置](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/README.md#%E4%BF%AE%E6%94%B9%E9%85%8D%E7%BD%AE)修改配置文件。 -- Gitee