From 2f15daa7a021c2118b2330009b6ede616fec34c1 Mon Sep 17 00:00:00 2001
From: zhangyihuiben <zhangyihuiben@sina.com>
Date: Tue, 8 Jul 2025 15:30:06 +0800
Subject: [PATCH] =?UTF-8?q?=E4=BF=AE=E5=A4=8D=E5=A4=B1=E6=95=88=E9=93=BE?=
 =?UTF-8?q?=E6=8E=A5?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../advanced_development/dev_migration.md     |   2 +-
 .../docs/source_en/feature/logging.md         |   2 +-
 .../source_en/feature/memory_optimization.md  |   2 +-
 .../docs/source_en/feature/safetensors.md     |   2 +-
 .../feature/training_hyperparameters.md       |   4 +-
 .../docs/source_en/guide/pre_training.md      |   2 +-
 .../advanced_development/dev_migration.md     |   2 +-
 .../deepseek3/pretrain_deepseek3_671b.yaml    | 225 ++++++++++++++++++
 .../docs/source_zh_cn/feature/logging.md      |   2 +-
 .../feature/memory_optimization.md            |   2 +-
 .../docs/source_zh_cn/feature/safetensors.md  |   2 +-
 .../feature/training_hyperparameters.md       |   4 +-
 .../docs/source_zh_cn/guide/pre_training.md   |   2 +-
 13 files changed, 239 insertions(+), 14 deletions(-)
 create mode 100644 docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml

diff --git a/docs/mindformers/docs/source_en/advanced_development/dev_migration.md b/docs/mindformers/docs/source_en/advanced_development/dev_migration.md
index 76331cf273..917b45b1da 100644
--- a/docs/mindformers/docs/source_en/advanced_development/dev_migration.md
+++ b/docs/mindformers/docs/source_en/advanced_development/dev_migration.md
@@ -48,7 +48,7 @@ All tokenizer classes must be inherited from the PretrainedTokenizer or Pretrain
 
 If a PyTorch-based model weight already exists, you can convert the weight to that in the MindSpore format by referring to [Weight Conversion](https://www.mindspore.cn/mindformers/docs/en/dev/feature/ckpt.html#weight-format-conversion).
 
-For details about how to prepare a dataset, see [Dataset](https://www.mindspore.cn/mindformers/docs/en/dev/feature/dataset.html) or the model document, for example, [Llama2 Description Document > Dataset Preparation](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87).
+For details about how to prepare a dataset, see [Dataset](https://www.mindspore.cn/mindformers/docs/en/dev/feature/dataset.html).
 
 ### Preparing a `YAML` Configuration File
 
diff --git a/docs/mindformers/docs/source_en/feature/logging.md b/docs/mindformers/docs/source_en/feature/logging.md
index 63da65ff41..e5b28cd74b 100644
--- a/docs/mindformers/docs/source_en/feature/logging.md
+++ b/docs/mindformers/docs/source_en/feature/logging.md
@@ -46,7 +46,7 @@ By default, MindSpore Transformers specifies the file output path as `./output`
 
 If you need to re-specify the output log folder, you can modify the configuration in yaml.
 
-Taking [`DeepSeek-V3` pre-training yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L2) as an example, the following configuration can be made:
+Taking [`DeepSeek-V3` pre-training yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, the following configuration can be made:
 
 ```yaml
 output_dir: './output' # path to save logs/checkpoint/strategy
diff --git a/docs/mindformers/docs/source_en/feature/memory_optimization.md b/docs/mindformers/docs/source_en/feature/memory_optimization.md
index b38779b9f1..bdf60739a8 100644
--- a/docs/mindformers/docs/source_en/feature/memory_optimization.md
+++ b/docs/mindformers/docs/source_en/feature/memory_optimization.md
@@ -14,7 +14,7 @@ Recomputation can significantly reduce activation memory usage during training b
 
 Users can enable recomputation by adding a `recompute_config` module to the YAML configuration file used for model training.
 
-Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L113) as an example, it could be configured as follows:
+Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, it could be configured as follows:
 
 ```yaml
 # recompute config
diff --git a/docs/mindformers/docs/source_en/feature/safetensors.md b/docs/mindformers/docs/source_en/feature/safetensors.md
index 70b0eea901..94e531b20e 100644
--- a/docs/mindformers/docs/source_en/feature/safetensors.md
+++ b/docs/mindformers/docs/source_en/feature/safetensors.md
@@ -119,7 +119,7 @@ Users can control the weight saving behavior by modifying the configuration file
 
 Users can modify the fields under `CheckpointMonitor` in the `yaml` configuration file to control the weight saving behavior.
 
-Taking [`DeepSeek-V3` pre-training yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L206) as an example, the following configuration can be made:
+Taking [`DeepSeek-V3` pre-training yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, the following configuration can be made:
 
 ```yaml
 # callbacks
diff --git a/docs/mindformers/docs/source_en/feature/training_hyperparameters.md b/docs/mindformers/docs/source_en/feature/training_hyperparameters.md
index 8740fdc422..ffdcc69773 100644
--- a/docs/mindformers/docs/source_en/feature/training_hyperparameters.md
+++ b/docs/mindformers/docs/source_en/feature/training_hyperparameters.md
@@ -24,7 +24,7 @@ Setting the learning rate too high can prevent the model from converging, while
 
 Users can utilize the learning rate by adding an `lr_schedule` module to the YAML configuration file used for model training.
 
-Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L31) as an example, it could be configured as follows:
+Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, it could be configured as follows:
 
 ```yaml
 # lr schedule
@@ -91,7 +91,7 @@ Currently, MindSpore Transformers only supports the [AdamW optimizer](https://ww
 
 Users can use the optimizer by adding an `optimizer` module to the YAML configuration file for model training.
 
-Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L24) as an example, it could be configured like this:
+Taking the [DeepSeek-V3 pre-training's YAML file](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) as an example, it could be configured like this:
 
 ```yaml
 # optimizer
diff --git a/docs/mindformers/docs/source_en/guide/pre_training.md b/docs/mindformers/docs/source_en/guide/pre_training.md
index a901de3834..3f32d73987 100644
--- a/docs/mindformers/docs/source_en/guide/pre_training.md
+++ b/docs/mindformers/docs/source_en/guide/pre_training.md
@@ -78,7 +78,7 @@ For dataset processing, refer to [Megatron Dataset - Data Preprocessing](https:/
 
 ### Single-Node Training
 
-Specify the configuration file [pretrain_deepseek3_671b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml) and start the [run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/dev/run_mindformer.py) script in msrun mode to perform 8-device distributed training.
+Specify the configuration file [pretrain_deepseek3_671b.yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) and start the [run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/dev/run_mindformer.py) script in msrun mode to perform 8-device distributed training.
 
 The default configuration includes large values for parameters such as the number of layers and hidden dimensions, which are intended for large-scale multi-node distributed training. It cannot be directly used for pretraining on a single machine. You will need to modify the configuration as described in [DeepSeek-V3 - Configuration Modification](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/README.md#%E4%BF%AE%E6%94%B9%E9%85%8D%E7%BD%AE).
 
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md b/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
index 48442dd5b4..5c5435cdb7 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/dev_migration.md
@@ -48,7 +48,7 @@ MindSpore Transformers提供了[PretrainedTokenizer](https://www.mindspore.cn/mi
 
 如已有基于PyTorch的模型权重，可以参考[权重转换文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/ckpt.html#%E6%9D%83%E9%87%8D%E6%A0%BC%E5%BC%8F%E8%BD%AC%E6%8D%A2)将权重转换为MindSpore格式的权重。
 
-数据集的准备可以参考[数据集文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html)，或参考模型文档，如[Llama2说明文档——数据集准备](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/llama2.md#%E6%95%B0%E6%8D%AE%E5%8F%8A%E6%9D%83%E9%87%8D%E5%87%86%E5%A4%87)。
+数据集的准备可以参考[数据集文档](https://www.mindspore.cn/mindformers/docs/zh-CN/dev/feature/dataset.html)。
 
 ### 准备`YAML`配置文件
 
diff --git a/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml b/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml
new file mode 100644
index 0000000000..7ea7426096
--- /dev/null
+++ b/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml
@@ -0,0 +1,225 @@
+seed: 0
+output_dir: './output' # path to save checkpoint/strategy
+load_checkpoint: ''
+load_ckpt_format: "safetensors"
+src_strategy_path_or_dir: ''
+auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
+only_save_strategy: False
+resume_training: False
+use_parallel: True
+run_mode: 'train'
+
+# trainer config
+trainer:
+  type: CausalLanguageModelingTrainer
+  model_name: 'deepseekV3'
+
+# runner config
+runner_config:
+  epochs: 2
+  batch_size: 1
+  sink_mode: True
+  sink_size: 1
+
+# optimizer
+optimizer:
+  type: AdamW
+  betas: [0.9, 0.95]
+  eps: 1.e-8
+
+# lr schedule
+lr_schedule:
+  type: ConstantWarmUpLR
+  learning_rate: 2.2e-4
+  warmup_ratio: 0.02
+  total_steps: -1 # -1 means it will load the total steps of the dataset
+
+# dataset
+train_dataset: &train_dataset
+  data_loader:
+    type: BlendedMegatronDatasetDataLoader
+    datasets_type: "GPTDataset"
+    sizes:
+      - 5000 # train dataset size
+      - 0
+      - 0
+    config:
+      random_seed: 1234
+      seq_length: 4096
+      split: "1, 0, 0"
+      reset_position_ids: False
+      reset_attention_mask: False
+      eod_mask_loss: False
+      num_dataset_builder_threads: 1
+      create_attention_mask: False
+      data_path:
+        - '1'
+        - "./dataset"
+    shuffle: False
+  input_columns: ["input_ids", "labels", "loss_mask", "position_ids"]
+  construct_args_key: ["input_ids", "labels"]
+  num_parallel_workers: 8
+  python_multiprocessing: False
+  drop_remainder: True
+  repeat: 1
+  numa_enable: False
+  prefetch_size: 1
+train_dataset_task:
+  type: CausalLanguageModelDataset
+  dataset_config: *train_dataset
+
+# mindspore context init config
+context:
+  mode: 0 #0--Graph Mode; 1--Pynative Mode
+  device_target: "Ascend"
+  max_call_depth: 10000
+  max_device_memory: "55GB"
+  save_graphs: False
+  save_graphs_path: "./graph"
+  jit_config:
+    jit_level: "O1"
+  ascend_config:
+    parallel_speed_up_json_path: "./parallel_speed_up.json"
+
+# parallel config for device num = 1024
+parallel_config:
+  data_parallel: &dp 16
+  model_parallel: 4
+  pipeline_stage: 16
+  expert_parallel: 8
+  micro_batch_num: &micro_batch_num 32
+  vocab_emb_dp: True
+  use_seq_parallel: True
+  gradient_aggregation_group: 4
+# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
+micro_batch_interleave_num: 1
+
+# parallel context config
+parallel:
+  parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
+  gradients_mean: False
+  enable_alltoall: True
+  full_batch: False
+  dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1]]
+  search_mode: "sharding_propagation"
+  enable_parallel_optimizer: True
+  strategy_ckpt_config:
+    save_file: "./ckpt_strategy.ckpt"
+    only_trainable_params: False
+  parallel_optimizer_config:
+    gradient_accumulation_shard: False
+    parallel_optimizer_threshold: 64
+
+# recompute config
+recompute_config:
+  recompute: [3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 2, 0]
+  select_recompute: False
+  parallel_optimizer_comm_recompute: True
+  mp_comm_recompute: True
+  recompute_slice_activation: True
+
+# model config
+model:
+  model_config:
+    type: DeepseekV3Config
+    auto_register: deepseek3_config.DeepseekV3Config
+    batch_size: 1 # add for increase predict
+    seq_length: 4096
+    hidden_size:  7168
+    num_layers: &num_layers 61
+    num_heads: 128
+    max_position_embeddings: 4096
+    intermediate_size: 18432
+    kv_lora_rank: 512
+    n_kv_heads: 128
+    q_lora_rank: 1536
+    qk_rope_head_dim: 64
+    v_head_dim: 128
+    qk_nope_head_dim: 128
+    vocab_size: 129280
+    multiple_of: 256
+    rms_norm_eps: 1.0e-6
+    bos_token_id: 100000
+    eos_token_id: 100001
+    pad_token_id: 100001
+    ignore_token_id: -100
+    compute_dtype: "bfloat16"
+    layernorm_compute_type: "float32"
+    softmax_compute_type: "float32"
+    rotary_dtype: "float32"
+    router_dense_type: "float32"
+    param_init_type: "float32"
+    use_past: False
+    extend_method: "None"
+    use_flash_attention: True
+    use_fused_swiglu: True
+    use_fused_rope: True
+    input_sliced_sig: True
+    offset: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1]
+    checkpoint_name_or_path: ""
+    theta: 10000.0
+    return_extra_loss: True
+    mtp_depth: &mtp_depth 1
+    mtp_loss_factor: 0.3
+  arch:
+    type: DeepseekV3ForCausalLM
+    auto_register: deepseek3.DeepseekV3ForCausalLM
+
+#moe
+moe_config:
+  expert_num: &expert_num 256
+  expert_group_size: 8
+  capacity_factor: 1.5
+  aux_loss_factor: 0.05
+  num_experts_chosen: 8
+  routing_policy: "TopkRouterV2"
+  balance_via_topk_bias: &balance_via_topk_bias True
+  topk_bias_update_rate: &topk_bias_update_rate 0.001
+  use_fused_ops_topkrouter: True
+  shared_expert_num: 1
+  routed_scaling_factor: 2.5
+  norm_topk_prob: True
+  first_k_dense_replace: 3
+  moe_intermediate_size: 2048
+  aux_loss_factors: [0.0001]
+  aux_loss_types: ["expert"]
+  expert_model_parallel: 1
+  use_gating_sigmoid: True
+  callback_moe_droprate: False
+  use_gmm: True
+  use_fused_ops_permute: True
+  enable_gmm_safe_tokens: True
+
+
+# callbacks
+callbacks:
+  - type: MFLossMonitor
+    per_print_times: 1
+  # balance topk bias with callback
+  - type: TopkBiasBalanceCallback
+    balance_via_topk_bias: *balance_via_topk_bias
+    topk_bias_update_rate: *topk_bias_update_rate
+    num_layers: *num_layers
+    mtp_depth: *mtp_depth
+    expert_num: *expert_num
+    micro_batch_num: *micro_batch_num
+  - type: CheckpointMonitor
+    prefix: "deepseekv3"
+    save_checkpoint_steps: 1000
+    keep_checkpoint_max: 5
+    integrated_save: False
+    async_save: False
+    checkpoint_format: "safetensors"
+
+# wrapper cell config
+runner_wrapper:
+  type: MFTrainOneStepCell
+  scale_sense: 1.0
+  use_clip_grad: True
+
+profile: False
+profile_start_step: 1
+profile_stop_step: 10
+init_start_profile: False
+profile_communication: False
+profile_memory: True
diff --git a/docs/mindformers/docs/source_zh_cn/feature/logging.md b/docs/mindformers/docs/source_zh_cn/feature/logging.md
index bee9ee07b5..77f6226f7a 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/logging.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/logging.md
@@ -46,7 +46,7 @@ MindSpore Transformers 默认会在训练的 yaml 文件中指定文件输出路
 
 如果需要重新指定输出的日志文件夹，可以在 yaml 中修改配置。
 
-以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L2) 为例，可做如下配置：
+以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例，可做如下配置：
 
 ```yaml
 output_dir: './output' # path to save logs/checkpoint/strategy
diff --git a/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md b/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md
index 6455de1cb0..b3f18891ca 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/memory_optimization.md
@@ -14,7 +14,7 @@
 
 用户可通过在模型训练的 yaml 配置文件中新增 `recompute_config` 模块来使用重计算。
 
-以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L113) 为例，可做如下配置：
+以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例，可做如下配置：
 
 ```yaml
 # recompute config
diff --git a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md
index 0430277b21..64fa2ed18b 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/safetensors.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/safetensors.md
@@ -120,7 +120,7 @@ output
 
 用户可修改 `yaml` 配置文件中 `CheckpointMonitor` 下的字段来控制权重保存行为。
 
-以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L206) 为例，可做如下配置：
+以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例，可做如下配置：
 
 ```yaml
 # callbacks
diff --git a/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md b/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md
index 34304e255e..9ab760405a 100644
--- a/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md
+++ b/docs/mindformers/docs/source_zh_cn/feature/training_hyperparameters.md
@@ -19,7 +19,7 @@ MindSpore Transformers 提供了如下几类超参数的配置方式。
 #### YAML 参数配置
 
 用户可通过在模型训练的 yaml 配置文件中新增 `lr_schedule` 模块来使用学习率。
-以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L31) 为例，可做如下配置：
+以 [`DeepSeek-V3` 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例，可做如下配置：
 
 ```yaml
 # lr schedule
@@ -86,7 +86,7 @@ lr_schedule:
 
 用户可通过在模型训练的 yaml 配置文件中新增 `optimizer` 模块来使用学习率。
 
-以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml#L24) 为例，可做如下配置：
+以 [DeepSeek-V3 预训练 yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml) 为例，可做如下配置：
 
 ```yaml
 # optimizer
diff --git a/docs/mindformers/docs/source_zh_cn/guide/pre_training.md b/docs/mindformers/docs/source_zh_cn/guide/pre_training.md
index 628e6e3b90..8968d2b8bb 100644
--- a/docs/mindformers/docs/source_zh_cn/guide/pre_training.md
+++ b/docs/mindformers/docs/source_zh_cn/guide/pre_training.md
@@ -78,7 +78,7 @@ MindSpore Transformers 目前已经支持加载 Megatron 数据集，该数据
 
 ### 单机训练
 
-通过指定模型路径和配置文件[pretrain_deepseek3_671b.yaml](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/deepseek3_671b/pretrain_deepseek3_671b.yaml)以msrun的方式启动[run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/dev/run_mindformer.py)脚本，进行8卡分布式训练。
+通过指定模型路径和配置文件[pretrain_deepseek3_671b.yaml](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/example/deepseek3/pretrain_deepseek3_671b.yaml)以msrun的方式启动[run_mindformer.py](https://gitee.com/mindspore/mindformers/blob/dev/run_mindformer.py)脚本，进行8卡分布式训练。
 
 默认配置中的模型层数、隐藏维度等参数较大，适用于多机大规模分布式训练，无法直接在单机环境启动预训练，需要参考[DeepSeek-V3-修改配置](https://gitee.com/mindspore/mindformers/blob/dev/research/deepseek3/README.md#%E4%BF%AE%E6%94%B9%E9%85%8D%E7%BD%AE)修改配置文件。
 
-- 
Gitee