diff --git a/OWNERS b/OWNERS
index 0674e8de132ec6b4eba294f269ee1891b0a85af8..9256e24cdf85d1e8cce91e05187e1471974b8650 100644
--- a/OWNERS
+++ b/OWNERS
@@ -2,7 +2,6 @@ approvers:
 - fengliangjun66
 - wujianping996
 - dilililiwhy
-- liuyanghan
 reviewers:
 - fengliangjun66
 - guhangsong
diff --git a/README.md b/README.md
index 32418ae7ca86a89d1b721cf6d8979f1b774faa02..3285bdb3bcbe647348d5cde1a50586d49dc6e972 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-  <p align="center"> <img src="sources/images/logo.png" height="103px" width="700px"> </p>
+  <p align="center"> <img src="sources/images/logo.png" height="110px" width="500px"> </p>
 
 <p align="center">
     <a href="https://gitee.com/ascend/MindSpeed/blob/master/LICENSE">
@@ -13,70 +13,110 @@
     </a>
 </p>
 
-ModelLink旨在为华为 [昇腾芯片](https://www.hiascend.com/) 上提供端到端的大语言模型方案, 包含模型，算法，以及下游任务。
+MindSpeed-LLM，原仓名ModelLink，旨在为华为 [昇腾芯片](https://www.hiascend.com/) 上提供端到端的大语言模型方案, 包含模型，算法，以及下游任务。
 
 ---
+## 用户使用指南
+* MindSpeed-LLM[使用指南](./examples/README.md)
 
-## ModelLink大模型方案概览
+* MindSpeed-LLM[预置大模型列表](#预置大模型)
+
+---
+## MindSpeed-LLM大模型方案
 
 
-当前ModelLink支撑大模型使用功能:
-* 权重转换：[Huggingface与Megatron-LM权重转换](#jump1) 【昇腾】【OCK】【GTS】
+当前MindSpeed-LLM支撑大模型使用功能:
+* 权重转换：[Huggingface与Megatron-LM权重转换](#huggingface与megatron-lm权重转换) 【昇腾】【OCK】【GTS】
 * 数据集处理：[预训练数据集/指令微调数据集](./examples/README.md) 【NAIE】【昇腾】
-* 分布式预训练：[加速算法/融合算子/并行策略](#jump2)【昇腾】【计算算法部】【计算研究部】
-* 分布式指令微调：[Prompt模板/动态padding/多轮对话](#jump3)【NAIE】【昇腾】
+* 分布式预训练：[加速算法/融合算子/并行策略](#预训练加速算法与融合算子)【昇腾】【计算算法部】【计算研究部】
+* 分布式指令微调：[Prompt模板/动态padding/多轮对话](#分布式指令微调)【NAIE】【昇腾】
 * 分布式推理：[流式推理/人机对话](./examples/README.md) 【NAIE】【昇腾】【GTS】
-* 分布式评估：[MMLU/CEVAL/HumanEval/BBH/BoolQ/AGIEval](#jump4)【NAIE】
-* 昇腾工具链：[Profiling采集](#jump5)/[确定性计算](#jump6)/[高可用](#jump7)【昇腾】【计算研究部】
+* 分布式评估：[MMLU/CEVAL/HumanEval/BBH/BoolQ/AGIEval](#大模型benchmark基线评估)【NAIE】
+* 昇腾工具链：[Profiling采集](#基于昇腾芯片采集Profiling数据)/[确定性计算](#基于昇腾芯片的确定性计算功能)/[高可用](#基于昇腾芯片的高可用特性)【昇腾】【计算研究部】
 
 LoRA、DPO、奖励模型、PPO等特性即将上线
 
 ---
 
 
-## ModelLink版本维护策略
+## MindSpeed-LLM版本维护策略
 
-ModelLink版本有以下五个维护阶段：
+MindSpeed-LLM版本有以下五个维护阶段：
 
 | **状态**            | **时间** | **说明**                                                               |
 | ------------------- | -------- |----------------------------------------------------------------------|
 | 计划                | 1—3 个月 | 计划特性                                                                 |
 | 开发                | 3 个月   | 开发特性                                                                 |
-| 维护                | 6-12 个月| 合入所有已解决的问题并发布版本，针对不同的ModelLink版本采取不同的维护策略，常规版本和长期支持版本维护周期分别为6个月和12个月 |
+| 维护                | 6-12 个月| 合入所有已解决的问题并发布版本，针对不同的MindSpeed-LLM版本采取不同的维护策略，常规版本和长期支持版本维护周期分别为6个月和12个月 |
 | 无维护              | 0—3 个月 | 合入所有已解决的问题，无专职维护人员，无版本发布                                             |
 | 生命周期终止（EOL） | N/A      | 分支不再接受任何修改                                                           |
 
 
-ModelLink已发布版本维护策略：
+MindSpeed-LLM已发布版本维护策略：
 
-| **ModelLink版本** | **维护策略** | **当前状态** | **发布时间**   | **后续状态**         | **EOL日期** |
+| **MindSpeed-LLM版本** | **维护策略** | **当前状态** | **发布时间**   | **后续状态**         | **EOL日期** |
 |-----------------|-----------|--------|------------|-----------------------|-----------|
 | bk_origin_23    |  Demo     | EOL    | 2023       | 生命周期终止           | 2024/6/30 |
-| 1.0             |  常规版本  | 维护   | 2024/03/30 | 预计2024/9/30起无维护  |           |
-| 1.1             |  常规版本  | 维护   | 2024/06/30 | 预计2024/12/30起无维护 |           |
-
+| 1.0.RC1         |  常规版本  | 维护   | 2024/03/30 | 预计2024/9/30起无维护  |           |
+| 1.0.RC2         |  常规版本  | 维护   | 2024/06/30 | 预计2024/12/30起无维护 |           |
+| 1.0.RC3         |  常规版本  | 维护   | 2024/09/30 | 预计2025/03/30起无维护 |           |
 ---
 
 ## 配套版本与支持模型
 
-【版本配套环境】
+### 【版本配套环境】
+
+<table border="0">
+  <tr>
+    <th>依赖软件</th>
+    <th>版本</th>
+    <th>软件安装指南</th>
+    <th>推荐硬件形态</th>
+  </tr>
+
+  <tr>
+    <td>昇腾NPU驱动</td>
+    <td rowspan="2">Ascend HDK 24.1.RC3</td>
+    <td rowspan="2">《<a href="https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/softwareinst/instg/instg_0003.html?Mode=PmIns&OS=Ubuntu&Software=cannToolKit">驱动固件安装指南</a>》</td>
+    <td rowspan="8">Atlas 900 A2 PODc</td>
+  </tr>
+  <tr>
+    <td>昇腾NPU固件</td>
+  </tr>
+  <tr>
+    <td>Toolkit（开发套件）</td>
+    <td rowspan="2">CANN 8.0.RC3</td>
+    <td rowspan="2">《<a href="https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/softwareinst/instg/instg_0000.html">CANN 软件安装指南</a>》</td>
+  </tr>
+  <tr>
+    <td>Kernel（算子包）</td>
+  </tr>
+  <tr>
+  </tr>
+  <tr>
+    <td>PyTorch</td>
+    <td rowspan="3">release v6.0.RC3</td>
+    <td rowspan="3">《<a href="https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/configandinstg/instg/insg_0001.html">Ascend Extension for PyTorch 配置与安装</a>》</td>
+  </tr>
+  <tr>
+    <td>torch_npu插件</td>
+  </tr>
+  <tr>
+    <td>apex</td>
+  </tr>
+</table>
 
-|           软件            | [版本](https://www.hiascend.com/zh/) |
-| :-----------------------: |:----------------------------------:|
-|          Python           |                3.8                 |
-|          Driver           |         在研版本          |
-|         Firmware          |         在研版本          |
-|           CANN            |             在研版本             |
-|           Torch           |            2.1.0、2.2.0             |
-|         Torch_npu         |           在研版本           |
+已安装好以上配套环境的镜像的获取和使用方法见[镜像使用指南](./docs/DOCKER_GUIDE.md)。
 
-【预训练集群性能与线性度】
+### 【预训练集群性能与线性度】
 
-ModelLink 通过模型并行与数据并行来训练大语言模型，为了演示如何使用多个昇腾芯片和模型大小进行扩展性训练，我们使用 `GPT3-175B` 稠密大模型，从128颗 NPU 扩展到 7968颗 NPU 进行实验，下图是实验数据：
+MindSpeed-LLM 通过模型并行与数据并行来训练大语言模型，为了演示如何使用多个昇腾芯片和模型大小进行扩展性训练，我们使用 `GPT3-175B` 稠密大模型，从128颗 NPU 扩展到 7968颗 NPU 进行实验，下图是实验数据：
 <p align="center"> <img src="sources/images/linearity&mfu.png" height="485px" width="710px"> </p>
 报告的吞吐量是针对端到端训练进行测量的，涵盖所有操作，包括数据加载、优化器步骤、通信，甚至日志记录。请注意，示例大模型没有训练至收敛。
 
-图中呈现了对应集群规模下的 `MFU` 值与集群整体的 `线性度`情况. 计算公式已经放到社区，点击链接可进行参考：[MFU计算公式](https://gitee.com/ascend/ModelLink/wikis/%E6%9C%AF%E8%AF%AD%E5%AE%9A%E4%B9%89/%E5%A4%A7%E6%A8%A1%E5%9E%8B%20MFU%20%E8%AE%A1%E7%AE%97%E5%85%AC%E5%BC%8F)，[线性度计算公式](https://gitee.com/ascend/ModelLink/wikis/%E6%9C%AF%E8%AF%AD%E5%AE%9A%E4%B9%89/%E7%BA%BF%E6%80%A7%E5%BA%A6%E5%85%AC%E5%BC%8F)
+图中呈现了对应集群规模下的 `MFU` 值与集群整体的 `线性度`情况. 计算公式已经放到社区，点击链接可进行参考：[MFU计算公式](https://gitee.com/ascend/MindSpeed-LLM/wikis/%E6%9C%AF%E8%AF%AD%E5%AE%9A%E4%B9%89/%E5%A4%A7%E6%A8%A1%E5%9E%8B%20MFU%20%E8%AE%A1%E7%AE%97%E5%85%AC%E5%BC%8F)，[线性度计算公式](https://gitee.com/ascend/MindSpeed-LLM/wikis/%E6%9C%AF%E8%AF%AD%E5%AE%9A%E4%B9%89/%E7%BA%BF%E6%80%A7%E5%BA%A6%E5%85%AC%E5%BC%8F)
+
+### 【预置大模型】
 
 下述列表中支持的模型，我们在[examples/README.md](./examples/README.md)中提供了相应的使用说明，里面有详细的模型训练、推理、评估流程
 
@@ -88,7 +128,7 @@ ModelLink 通过模型并行与数据并行来训练大语言模型，为了演
 
 `认证`【Pass】表示经过昇腾官方版本测试的模型，【Test】表示待测试模型
 
-表中为开启 mc2 特性【内部在研特性】后预训练实测性能，该特性只在24RC2以上版本支持，本仓库代码层面默认关闭，若要使用，请参考[加速算法与融合算子](#jump2)章节
+表中为开启 mc2 特性【内部在研特性】后预训练实测性能，该特性只在24RC2以上版本支持，本仓库代码层面默认关闭，若要使用，请参考[加速算法与融合算子](#预训练加速算法与融合算子)章节
 
 <table>
   <thead>
@@ -865,9 +905,9 @@ ModelLink 通过模型并行与数据并行来训练大语言模型，为了演
 
 ---
 
-## <span id="jump1"> Huggingface与Megatron-LM权重转换
+## Huggingface与Megatron-LM权重转换
 
-ModelLink支持Huggingface、Megatron-Legacy以及Megatron-Core之间的权重格式互转，具体功能列表如下：
+MindSpeed-LLM支持Huggingface、Megatron-Legacy以及Megatron-Core之间的权重格式互转，具体功能列表如下：
 
 
 <table>
@@ -917,7 +957,7 @@ ModelLink支持Huggingface、Megatron-Legacy以及Megatron-Core之间的权重
     </tr>
     <tr>
       <td>专家并行</td>
-      <td>--expert-model-parallel-size</td>
+      <td>--target-expert-model-parallel-size</td>
     </tr>
   </tbody>
   <tbody>
@@ -1029,7 +1069,7 @@ ModelLink支持Huggingface、Megatron-Legacy以及Megatron-Core之间的权重
     </tr>
     <tr>
       <td>专家并行</td>
-      <td>--expert-model-parallel-size</td>
+      <td>--target-expert-model-parallel-size</td>
     </tr>
     <tr>
       <td>流水并行动态划分</td>
@@ -1046,189 +1086,161 @@ ModelLink支持Huggingface、Megatron-Legacy以及Megatron-Core之间的权重
 
 ---
 
-## <span id="jump2"> 预训练加速算法与融合算子
+## 预训练加速算法与融合算子
 
-ModelLink预训练支持张量并行、流水线并行等多种加速算法和融合算子，下表为各种加速特性对应的使能开关：
+MindSpeed-LLM预训练支持张量并行、流水线并行等多种加速算法和融合算子：
 
 <table><thead>
   <tr>
-    <th>使用场景</th>
+    <th>场景</th>
     <th>特性名称</th>
-    <th>具体参数</th>
     <th>Mcore</th>
     <th>Legacy</th>
+    <th>贡献方</th>
   </tr></thead>
 <tbody>
   <tr>
-    <td rowspan="6">PTD并行</td>
-    <td>张量并行</td>
-    <td>--tensor-model-parallel-size</td>
-    <td>Yes</td>
-    <td>Yes</td>
-  </tr>
-  <tr>
-    <td>流水线并行</td>
-    <td>--pipeline-model-parallel-size</td>
-    <td>Yes</td>
-    <td>Yes</td>
-  </tr>
-  <tr>
-    <td>流水线并行动态划分</td>
-    <td>--num-layer-list</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td rowspan="4">SPTD并行</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/tensor-parallel.md">张量并行</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>虚拟流水并行</td>
-    <td>--num-layers-per-virtual-pipeline-stage</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/pipeline-parallel.md">流水线并行</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>序列并行</td>
-    <td>--sequence-parallel</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://portrait.gitee.com/ascend/MindSpeed-LLM/blob/master/docs/features/virtual_pipeline_parallel.md">虚拟流水并行</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>分布式优化器</td>
-    <td>--use-distributed-optimizer</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/sequence-parallel.md">序列并行</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
     <td rowspan="3">长序列并行</td>
-    <td>长序列并行</td>
-    <td>--context-parallel-size</td>
-    <td>Yes</td>
-    <td>No</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/ring-attention-context-parallel.md">Ascend Ring Attention 长序列并行</a></td>
+    <td>✅</td>
+    <td>❌</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>多并行方案</td>
-    <td>--context-parallel-algo</td>
-    <td>Yes</td>
-    <td>No</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/ulysses-context-parallel.md">Ulysses 长序列并行</a></td>
+    <td>✅</td>
+    <td>❌</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>Send/recv掩盖加速</td>
-    <td>--cp-send-recv-overlap</td>
-    <td>Yes</td>
-    <td>No</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/hybrid-context-parallel.md">混合长序列并行</a></td>
+    <td>✅</td>
+    <td>❌</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td rowspan="3">MOE</td>
-    <td>MOE专家并行</td>
-    <td>--expert-model-parallel-size</td>
-    <td>Yes</td>
-    <td>No</td>
+    <td rowspan="2">MOE</td>
+    <td><a href="https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/moe/README.md">MOE 专家并行</a></td>
+    <td>✅</td>
+    <td>❌</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>MOE重排通信优化</td>
-    <td>--moe-permutation-async-comm</td>
-    <td>Yes</td>
-    <td>No</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/megatron_moe/megatron-moe-allgather-dispatcher.md">MOE 重排通信优化</a></td>
+    <td>✅</td>
+    <td>❌</td>
+    <td>【计算研究部】</td>
   </tr>
   <tr>
-    <td>GEMM</td>
-    <td>--moe-grouped-gemm</td>
-    <td>Yes</td>
-    <td>No</td>
+    <td rowspan="4">显存优化</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/reuse-fp32-param.md">参数副本复用</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【计算算法部】</td>
   </tr>
-  <tr>
-    <td rowspan="7">显存优化</td>
-    <td>参数副本复用</td>
-    <td>--reuse-fp32-param</td>
-    <td>Yes</td>
-    <td>Yes</td>
-  </tr>
-  <tr>
-    <td>激活函数重计算</td>
-    <td>--recompute-activation-function</td>
-    <td>Yes</td>
-    <td>Yes</td>
-  </tr>
- <tr>
-    <td>Swap Attention</td>
-    <td>--swap-attention</td>
-    <td>Yes</td>
-    <td>Yes</td>
-  </tr>
-  <tr>
-    <td>重计算程度</td>
-    <td>--recompute-granularity</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <tr>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/distributed-optimizer.md">分布式优化器</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>重计算层数</td>
-    <td>--recompute-num-layers</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/swap_attention.md">Swap Attention</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【计算研究部】</td>
   </tr>
   <tr>
-    <td>重计算方法</td>
-    <td>--recompute-method</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://portrait.gitee.com/ascend/MindSpeed-LLM/blob/master/docs/features/recompute_relative.md">重计算</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【计算研究部】</td>
   </tr>
   <tr>
-    <td>PP-Stage重计算</td>
-    <td>--enable-recompute-layers-per-pp-rank</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td rowspan="5">融合算子</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/flash-attention.md">Flash attention</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td rowspan="5">融合算子</td>
-    <td>Flash attention</td>
-    <td>--use-flash-attn</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/rms_norm.md">Fused rmsnorm</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>Fused rmsnorm</td>
-    <td>--use-fused-rmsnorm</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/swiglu.md">Fused swiglu</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>Fused swiglu</td>
-    <td>--use-fused-swiglu</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/rotary-embedding.md">Fused rotary position embedding</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>Fused rotary position embedding</td>
-    <td>--use-fused-rotary-pos-emb</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/megatron_moe/megatron-moe-gmm.md">GMM</a></td>
+    <td>✅</td>
+    <td>❌</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>Sliding window attention</td>
-    <td>--sliding-window</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td rowspan="4">通信掩盖</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/async-ddp-param-gather.md">梯度reduce通算掩盖</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td rowspan="3">通信</td>
-    <td>梯度reduce通算掩盖</td>
-    <td>--overlap-grad-reduce</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/recompute_independent_pipelining.md">Recompute in advance</a></td>
+    <td>✅</td>
+    <td>❌</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>权重all-gather通算掩盖</td>
-    <td>--overlap-param-gather</td>
-    <td>Yes</td>
-    <td>No</td>
+    <td><a href="https://gitee.com/ascend/MindSpeed/blob/master/docs/features/async-ddp-param-gather.md">权重all-gather通算掩盖</a></td>
+    <td>✅</td>
+    <td>❌</td>
+    <td>【昇腾】</td>
   </tr>
   <tr>
-    <td>MC2</td>
-    <td>--use-mc2</td>
-    <td>Yes</td>
-    <td>Yes</td>
+    <td><a href="https://portrait.gitee.com/ascend/MindSpeed-LLM/blob/master/docs/features/mc2.md">MC2</a></td>
+    <td>✅</td>
+    <td>✅</td>
+    <td>【昇腾】</td>
   </tr>
 </tbody></table>
 
+---
+
 
 **注意事项**
 1. 具体的预训练方法见[examples/README.md](./examples/README.md)
@@ -1240,101 +1252,8 @@ ModelLink预训练支持张量并行、流水线并行等多种加速算法和
 
 ---
 
-## <span id="jump3"> 分布式指令微调
-ModelLink支持指令微调，方案与<a href="https://github.com/hiyouga/LLaMA-Factory/tree/main">DeepSpeed</a>统一，在微调效果保持一致的前提下，ModelLink可以表现出优异性能
-
-【与<a href="https://github.com/hiyouga/LLaMA-Factory/tree/main">DeepSpeed</a>微调Loss对比】
-<table border:none>
-<tbody>
-    <tr>
-        <th rowspan="1">Llama2-7b模型与<a href="https://github.com/hiyouga/LLaMA-Factory/tree/main">DeepSpeed</a>微调5个epoch后的loss对比图
-
-  <p align="center"> <img src="sources/images/tune_llama2_7b_ModelLink_DeepSpeed_compare.png" height="270px" width="500px"> </p>
-        </th>
-        <th rowspan="1">Qwen-7b模型与<a href="https://github.com/hiyouga/LLaMA-Factory/tree/main">DeepSpeed</a>微调4个epoch后的loss对比图
-
-<p align="center"> <img src="sources/images/tune_qwen_7b_ModelLink_DeepSpeed_compare.png" height="270px" width="500px"> </p>
-        </th>
-    </tr>
-</tbody>
-</table>
-
-
-【Llama2-7b微调后对话效果对比】
-<table>
-    <tr>
-        <th rowspan="1">Question</th>
-        <th rowspan="1">ModelLink</th>
-        <th colspan="1"><a href="https://github.com/hiyouga/LLaMA-Factory/tree/main">DeepSpeed</a></th>
-    </tr>
-    <tr>
-        <th>What is the product of 6 and 2?</th>
-        <th>The product of 6 and 2 is 12.</th>
-        <th>The product of 6 and 2 is 12.</th>
-    </tr>
-    <tr>
-        <th>Look up the boiling point of water.</th>
-        <th>The boiling point of water is 100°C (212°F)..</th>
-        <th>The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit).</th>
-    </tr>
-    <tr>
-        <th>Generate a poem that expresses joy.</th>
-        <th>I open my arms and accept life's challenge, 
-
-Embracing all of life's uncertainty. 
-
-A future of promise awaits,
-
-One that I can find in my fate. 
-
-A journey full of hope and discovery,
-
-A journey that will never end in disappointment. 
-
-For I have found the strength to keep going,
-
-Researching and learning, ever knowing. 
-
-The future is mine, and mine alone,
-
-A future that I have to own,
-
-And I will never be alone,
-
-For I will have the courage to face all of my fears,
-
-And the joy of life will be mine forever.</th>
-        <th>I am so happy, so thankful and blessed,
-
-To have a heart that's full of cheer,
-
-My soul is on fire, my spirit runs free,
-
-My love is warm, my courage is high,
-
-My world is bright and my outlook is clear,
-
-My heart is light, my spirit runs free.
-
-I take each day with a positive outlook,
-
-My energy is so high, I can't be beat,
-
-My joy is real, my emotions run deep,
-
-My soul is full and my heart does soep.
-
-I am thankful for everything I have,
-
-My life is precious and my values ​​true,
-
-My hope is high and my spirit runs free,
-
-My soul is full and my heart does soep.</th>
-    </tr>
-</table>
-
-【现版本实测性能、显存（硬件信息：Atlas 900 A2 PODc）】
+## 分布式指令微调
+MindSpeed-LLM支持指令微调，在微调效果保持一致的前提下，MindSpeed-LLM可以表现出优异性能
 
 下述列表中的模型，我们在[examples/README.md](./examples/README.md)中提供了相应的使用说明，里面有详细的模型微调、推理、评估流程.
 其中性能的单位是samples/s
@@ -1343,7 +1262,7 @@ My soul is full and my heart does soep.</th>
     <tr>
         <th rowspan="2">模型</th>
         <th rowspan="2">--prompt-type</th>
-        <th colspan="2">ModelLink + NPU</th>
+        <th colspan="2">MindSpeed-LLM + NPU</th>
         <th colspan="2"><a href="https://github.com/hiyouga/LLaMA-Factory/tree/main">DeepSpeed</a> + NPU</th>
         <th colspan="2"><a href="https://github.com/hiyouga/LLaMA-Factory/tree/main">DeepSpeed</a> + 参考</th>
     </tr>
@@ -1358,22 +1277,22 @@ My soul is full and my heart does soep.</th>
     <tr>
         <td rowspan="1">llama2-7b</td>
         <td rowspan="1">llama2</td>
-        <th>-</th>
-        <th>-</th>
-        <th>-</th>
-        <th>-</th>
-        <th>-</th>
-        <th>-</th>
+        <th>dynamic</th>
+        <th>45.7</th>
+        <th>dynamic</th>
+        <th>40.4</th>
+        <th>dynamic</th>
+        <th>46.5</th>
     </tr>
     <tr>
-        <td rowspan="1">qwen-7b</td>
-        <td rowspan="1">qwen</td>
-        <th>-</th>
-        <th>-</th>
-        <th>-</th>
-        <th>-</th>
-        <th>-</th>
-        <th>-</th>
+        <td rowspan="1">llama2-13b</td>
+        <td rowspan="1">llama2</td>
+        <th>dynamic</th>
+        <th>28.4</th>
+        <th>dynamic</th>
+        <th>17.8</th>
+        <th>dynamic</th>
+        <th>24.9</th>
     </tr>
 
 </table>
@@ -1456,11 +1375,11 @@ My soul is full and my heart does soep.</th>
 ---
 
 
-## <span id="jump4"> 大模型Benchmark基线评估
+## 大模型Benchmark基线评估
 
-ModelLink支持大模型在公开基准数据集上进行准确率评估，当前支持的Benchmark如下：
+MindSpeed-LLM支持大模型在公开基准数据集上进行准确率评估，当前支持的Benchmark如下：
 
-| Benchmark | 下载链接                                                                                     | 验证集  | ModelLink                                                            | OpenCompass                                                      |
+| Benchmark | 下载链接                                                                                     | 验证集  | MindSpeed-LLM                                                            | OpenCompass                                                      |
 |-----------|------------------------------------------------------------------------------------------|------|----------------------------------------------------------------------|------------------------------------------------------------------|
 | MMLU      | [GitHub](https://people.eecs.berkeley.edu/~hendrycks/data.tar)                           | test | [45.73%](./examples/mcore/llama2/evaluate_llama2_7b_mmlu_ptd.sh)     | [45.3%](https://hub.opencompass.org.cn/dataset-detail/MMLU)      |
 | CEval     | [HuggingFace](https://huggingface.co/datasets/ceval/ceval-exam/blob/main/ceval-exam.zip) | val  | [33.87%](./examples/mcore/llama2/evaluate_llama2_7b_ceval_ptd.sh)    | [32.5%](https://hub.opencompass.org.cn/dataset-detail/C-Eval)    |
@@ -1469,9 +1388,9 @@ ModelLink支持大模型在公开基准数据集上进行准确率评估，当
 | AGIEval   | [GitHub](https://github.com/ruixiangcui/AGIEval/tree/main)                               | test | [20.6%](./examples/mcore/llama2/evaluate_llama2_7b_agieval_ptd.sh)   | [20.6%](https://hub.opencompass.org.cn/dataset-detail/AGIEval)   |
 | HumanEval | [GitHub](https://github.com/openai/human-eval/tree/master/data)                          | test | [12.8%](./examples/mcore/llama2/evaluate_llama2_7b_humaneval_ptd.sh) | [12.2%](https://hub.opencompass.org.cn/dataset-detail/HumanEval) |
 
-ModelLink已支持模型的评估数据统计如下：
+MindSpeed-LLM已支持模型的评估数据统计如下：
 
-| 模型            | 任务     | ModelLink | 社区                                                                   | 模型               | 任务     | ModelLink | 社区                                                                                |
+| 模型            | 任务     | MindSpeed-LLM | 社区                                                                   | 模型               | 任务     | MindSpeed-LLM | 社区                                                                                |
 |---------------|--------|-----------|----------------------------------------------------------------------|------------------|--------|-----------|-----------------------------------------------------------------------------------|
 | Aquila-7B     | BoolQ  | 77.3%     | --                                                                   | Aquila2-7B       | BoolQ  | 77.8%     | --                                                                                |
 | Aquila2-34B   | BoolQ  | 88.0%     | --                                                                   | Baichuan-7B      | BoolQ  | 69.0%     | [67.0%](https://hub.opencompass.org.cn/dataset-detail/BoolQ)                      |
@@ -1505,8 +1424,8 @@ MiniCPM-2B    | MMLU   | 51.6%     | [53.4%](https://github.com/OpenBMB/MiniCPM?
 ---
 
 
-## <span id="jump5"> 基于昇腾芯片采集Profiling数据
-Modellink支持基于昇腾芯片采集profiling数据，以提供对模型运行情况的分析，主要API如下：
+## 基于昇腾芯片采集Profiling数据
+MindSpeed-LLM支持基于昇腾芯片采集profiling数据，以提供对模型运行情况的分析，主要API如下：
 
 
 ```bash
@@ -1524,8 +1443,8 @@ Modellink支持基于昇腾芯片采集profiling数据，以提供对模型运
 
 ---
 
-## <span id="jump6"> 基于昇腾芯片的确定性计算功能
-昇腾芯片默认采用了不确定计算加速模型训练，有时为了重复实验与对比实验需要确定性的计算结果，ModelLink使能确定性计算的开关如下：
+## 基于昇腾芯片的确定性计算功能
+昇腾芯片默认采用了不确定计算加速模型训练，有时为了重复实验与对比实验需要确定性的计算结果，MindSpeed-LLM使能确定性计算的开关如下：
 
 - 启动命令中加入开关
 ```shell
@@ -1539,32 +1458,30 @@ export HCCL_DETERMINISTIC=True
 ---
 
 
-## <span id="jump7"> 基于昇腾芯片的高可用特性
-分布式优化器的思想是通过将优化器状态均匀地分布在数据并行组中来节省内存。基于该思想，设计了将数据并行组切分成两个副本数据并行组的方案，副本优化器将优化器状态均匀分布在副本数据并行组，实现优化器状态均有备份。结合华为自研的高可用框架，可实现以下功能：
-1. 训练过程中，支持故障场景保存临终checkpoint，训练结果0损失。
-2. 训练过程中，支持HBM的UCE故障检测，并完成在线修复，达到Step级重计算。
+## 基于昇腾芯片的高可用特性
+分布式优化器的思想是通过将优化器状态均匀地分布在数据并行组中来节省内存。基于该思想，设计了将数据并行组切分成两个副本数据并行组的方案，副本优化器将优化器状态均匀分布在副本数据并行组，实现优化器状态均有备份。结合华为自研的高可用框架，可实现训练过程中，支持故障场景保存临终checkpoint，训练结果0损失。
 
-开启高可用特性时，副本优化器使用的静态内存有所增加，每个参数的理论字节数为（其中“d”是数据并行大小）：
+
+开启高可用特性时，副本优化器使用的静态内存有所增加，每个参数的理论字节数为（其中“d”是数据并行大小，增长关系仅供参考）：
 
 |                                  | Non-distributed optim | Distributed optim | Replica optim |
 |----------------------------------| ------ | ------ |---------------|
-| fp16/bf16 param, fp16/bf16 grads | 20 | 4 + 16/d | 4 + 32/d       |
-| fp16/bf16 param, fp32 grads      | 18 | 6 + 12/d | Supporting      |
-| fp32 param, fp32 grads           | 16 | 8 + 8/d  | Supporting      |
+| fp16/bf16 param, fp16/bf16 grads | 20 | 4 + 16/d | 4 + 32/d      |
+| fp16/bf16 param, fp32 grads      | 18 | 6 + 12/d | 6 + 24/d      |
+| fp32 param, fp32 grads           | 16 | 8 + 8/d  | 8 + 16/d      |
 
 
-- 启动命令中加入开关，并安装华为自研高可用框架mindio_ttp.whl
-- mindio_ttp相关说明：https://www.hiascend.com/document/detail/zh/mindx-dl/60rc1/mindio/mindiottp
+- 启动命令中加入开关，并安装华为自研高可用框架 [mindio_ttp.whl](https://www.hiascend.com/document/detail/zh/mindx-dl/60rc3/clusterscheduling/ref/mindiottp/mindiotft009.html)
+- mindio_ttp相关说明：[MindIO TTP 官网介绍](https://www.hiascend.com/document/detail/zh/mindx-dl/60rc3/clusterscheduling/ref/mindiottp/mindiotft001.html)
 ```shell
 --enable-high-availability           #使能高可用特性的总开关
---enable-optimizer-state-local-copy  #使能保存上一步优化器状态，内存会进一步增加，默认可关闭
 ```
 
 ---
 
 ## 致谢
 
-ModelLink由华为公司的下列部门联合贡献 ：
+MindSpeed-LLM由华为公司的下列部门联合贡献 ：
 - 昇腾计算产品部
 - 计算算法部
 - 计算研究部
@@ -1572,10 +1489,23 @@ ModelLink由华为公司的下列部门联合贡献 ：
 - 公共开发部：NAIE
 - 全球技术服务部：GTS
 
-感谢来自社区的每一个PR，欢迎贡献 ModelLink
+感谢来自社区的每一个PR，欢迎贡献 MindSpeed-LLM
 
 ---
 
 ## 安全声明
 
-[ModelLink安全声明](https://gitee.com/ascend/ModelLink/wikis/%E5%AE%89%E5%85%A8%E7%9B%B8%E5%85%B3/%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)
+[MindSpeed-LLM安全声明](https://gitee.com/ascend/MindSpeed-LLM/wikis/%E5%AE%89%E5%85%A8%E7%9B%B8%E5%85%B3/%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)
+
+# 免责声明
+
+## 致MindSpeed-LLM使用者
+1. MindSpeed-LLM提供的模型仅供您用于非商业目的。
+2. 对于各模型，MindSpeed-LLM平台仅提示性地向您建议可用于训练的数据集，华为不提供任何数据集，如您使用这些数据集进行训练，请您特别注意应遵守对应数据集的License，如您因使用数据集而产生侵权纠纷，华为不承担任何责任。
+3. 如您在使用MindSpeed-LLM模型过程中，发现任何问题（包括但不限于功能问题、合规问题），请在Gitee提交issue，我们将及时审视并解决。
+
+## 致数据集所有者
+如果您不希望您的数据集在MindSpeed-LLM中的模型被提及，或希望更新MindSpeed-LLM中的模型关于您的数据集的描述，请在Gitee提交issue，我们将根据您的issue要求删除或更新您的数据集描述。衷心感谢您对MindSpeed-LLM的理解和贡献。
+
+## License声明
+Ascend MindSpeed-LLM提供的模型，如模型目录下存在License的，以该License为准。如模型目录下不存在License的，以Apache 2.0许可证许可，对应许可证文本可查阅Ascend MindSpeed-LLM根目录。
\ No newline at end of file
diff --git a/SECURITYNOTE.md b/SECURITYNOTE.md
index a8476ca67272aa17f0d0ed86f6c61658446d6bd8..84f48a974e7ca525179322efd54b412f7eb29577 100644
--- a/SECURITYNOTE.md
+++ b/SECURITYNOTE.md
@@ -6,12 +6,12 @@
     ```
 
 ## 运行用户建议
-出于安全性及权限最小化角度考虑，不建议使用root等管理员类型账户使用Modellink。
+出于安全性及权限最小化角度考虑，不建议使用root等管理员类型账户使用MindSpeed-LLM。
 
 ## 文件权限控制
 1. 建议用户在主机（包括宿主机）及容器中设置运行系统umask值为0027及以上，保障新增文件夹默认最高权限为750，新增文件默认最高权限为640。
-2. 建议用户对个人数据、商业资产、源文件、训练过程中保存的各类文件等敏感内容做好权限管控。涉及场景如Modellink安装目录权限管控、多用户使用共享数据集权限管控，管控权限可参考表1进行设置。
-3. Modellink在数据预处理中会生成训练数据，在训练过程会生成权重文件，文件权限默认640，用户可根据实际需求对生成文件权限进行进阶管控。
+2. 建议用户对个人数据、商业资产、源文件、训练过程中保存的各类文件等敏感内容做好权限管控。涉及场景如MindSpeed-LLM安装目录权限管控、多用户使用共享数据集权限管控，管控权限可参考表1进行设置。
+3. MindSpeed-LLM在数据预处理中会生成训练数据，在训练过程会生成权重文件，文件权限默认640，用户可根据实际需求对生成文件权限进行进阶管控。
 
 **表1 文件（夹）各场景权限管控推荐最大值**
 | 类型          | linux权限参考最大值 |
@@ -38,26 +38,28 @@
 
 ## 数据安全声明
 
-1. ModelLink会在megatron中的checkpointing模块中保存模型文件，其中部分模型文件使用了风险模块pickle，可能存在数据风险。
+1. MindSpeed-LLM会在megatron中的checkpointing模块中保存模型文件，其中部分模型文件使用了风险模块pickle，可能存在数据风险。
 
 
 ## 运行安全声明
 
 1. 建议用户结合运行资源状况编写对应训练脚本。若训练脚本与资源状况不匹配，如数据集加载内存大小超出内存容量限制、训练脚本在本地生成数据超过磁盘空间大小等情况，可能引发错误并导致进程意外退出。
-2. ModelLink内部用到了pytorch,可能会因为版本不匹配导致运行错误，具体可参考pytorch[安全声明](https://gitee.com/ascend/pytorch#%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)。
+2. MindSpeed-LLM内部用到了pytorch,可能会因为版本不匹配导致运行错误，具体可参考pytorch[安全声明](https://gitee.com/ascend/pytorch#%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)。
 
 
 ## 公网地址声明
 
 | 类型     | 开源代码地址                                                                                                         | 文件名                                                                | 公网IP地址/公网URL地址/域名/邮箱地址                                                                                                                     | 用途说明      |
 |--------|----------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|-----------|
-| 开源代码引入 | 不涉及                                                                                                            | modellink/model/language_model.py:85                                                        | https://github.com/kingoflolz/mesh-transformer-jax/                                                                          | 详情地址      |
-| 开源代码引入 | 涉及                                                                                                            | tests/pipeline/common.py:6                                                        | https://github.com/microsoft/DeepSpeed/blob/master/tests/unit/common.py                                                                        | 源代码地址      |
-| 开源代码引入 | 涉及                                                                                                            | tests/pipeline/conftest.py:6                                                        | https://github.com/microsoft/DeepSpeed/blob/master/tests/conftest.py                                                                          | 源代码地址      |
-
+| 自研 | 不涉及                                                                                                            | modellink/model/language_model.py:85                                                        | https://github.com/kingoflolz/mesh-transformer-jax/                                                                          | 详情地址      |
+| 自研 | 涉及                                                                                                            | tests/pipeline/common.py:6                                                          | https://github.com/microsoft/DeepSpeed/blob/master/tests/unit/common.py                                                                        | 源代码地址      |
+| 自研 | 涉及                                                                                                            | tests/pipeline/conftest.py:6                                                        | https://github.com/microsoft/DeepSpeed/blob/master/tests/conftest.py                                                                          | 源代码地址      |
+| 自研 | 不涉及                                                                                                            | examples/mcore/gemma/data_convert_gemma_pretrain.sh:5                               | https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered/resolve/main/wikipedia-cn-20230720-filtered.json?download=true  | 数据下载地址      |
+| 自研 | 涉及                                                                                                            | modellink/tasks/rl/dpo.py:178                                                       | https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py                                                                          | 源代码地址      |
+| 自研 | 不涉及                                                                                                            | modellink/core/transformer/moe/moe_utils.py:135                                     | https://arxiv.org/abs/2101.03961               | 论文地址      |
 
 ## 公开接口声明
-ModelLink 暂时未发布wheel包，无正式对外公开接口，所有功能均通过shell脚本调用。5个入口脚本分别为[pretrain_gpt.py](https://gitee.com/ascend/ModelLink/blob/master/pretrain_gpt.py), [inference.py](https://gitee.com/ascend/ModelLink/blob/master/inference.py), [evaluation.py](https://gitee.com/ascend/ModelLink/blob/master/evaluation.py), [preprocess_data.py](https://gitee.com/ascend/ModelLink/blob/master/preprocess_data.py) 和 [convert_ckpt.py](https://gitee.com/ascend/ModelLink/blob/master/convert_ckpt.py)。
+MindSpeed-LLM 暂时未发布wheel包，无正式对外公开接口，所有功能均通过shell脚本调用。5个入口脚本分别为[pretrain_gpt.py](https://gitee.com/ascend/MindSpeed-LLM/blob/master/pretrain_gpt.py), [inference.py](https://gitee.com/ascend/MindSpeed-LLM/blob/master/inference.py), [evaluation.py](https://gitee.com/ascend/MindSpeed-LLM/blob/master/evaluation.py), [preprocess_data.py](https://gitee.com/ascend/MindSpeed-LLM/blob/master/preprocess_data.py) 和 [convert_ckpt.py](https://gitee.com/ascend/MindSpeed-LLM/blob/master/convert_ckpt.py)。
 
 
 ## 通信安全加固
diff --git a/Third_Party_Open_Source_Software_Notice b/Third_Party_Open_Source_Software_Notice
new file mode 100644
index 0000000000000000000000000000000000000000..c5d32948cfa10bd71bc6829e896bac7be97b221d
--- /dev/null
+++ b/Third_Party_Open_Source_Software_Notice
@@ -0,0 +1,141 @@
+OPEN SOURCE SOFTWARE NOTICE
+
+Please note we provide an open source software notice along with this product and/or this product firmware (in the following just “this product”). The open source software licenses are granted by the respective right holders. And the open source licenses prevail all other license information with regard to the respective open source software contained in the product, including but not limited to End User Software Licensing Agreement. This notice is provided on behalf of Huawei Technologies Co. Ltd. and any of its local subsidiaries which may have provided this product to you in your local country.
+
+Warranty Disclaimer
+THE OPEN SOURCE SOFTWARE IN THIS PRODUCT IS DISTRIBUTED IN THE HOPE THAT IT WILL BE USEFUL, BUT WITHOUT ANY WARRANTY, WITHOUT EVEN THE IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. SEE THE APPLICABLE LICENSES FOR MORE DETAILS.
+
+Copyright Notice and License Texts
+Software: pytorch v2.1.0
+Copyright notice:
+Copyright (c) 2016- Facebook, Inc
+Copyright (c) 2014- Facebook, Inc
+Copyright (c) 2011-2014 Idiap Research Institute
+Copyright (c) 2012-2014 Deepmind Technologies
+Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
+Copyright (c) 2011-2013 NYU
+Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
+Copyright (c) 2006 Idiap Research Institute
+Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
+Copyright (c) 2016-present, Facebook Inc.
+Copyright (c) 2016 Facebook Inc.
+Copyright (c) 2015 Google Inc.
+Copyright (c) 2015 Yangqing Jia
+Copyright 2019-2020 Kakao Brain
+Copyright (c) 2022 Cruise LLC.
+Copyright (c) 2013, 2014, 2015, the respective contributors
+Copyright (c) 2015, 2016 the respective contributors
+Copyright (c) 2014, The Regents of the University of California (Regents)
+Copyright (c) 2014, the respective contributors
+Copyright (c) 2018, Steven Moshier
+Copyright (c) 2001-2002 Enthought, Inc. 2003-2019, SciPy Developers
+Copyright (c) 1997-2011 by Secret Labs AB
+Copyright (c) 1995-2011 by Fredrik Lundh
+Copyright (c) 2010-2022 by Alex Clark and contributors
+Copyright (c) 2006 The Android Open Source Project
+Copyright (c) Facebook, Inc. and its affiliates
+Copyright (c) Meta Platforms, Inc. and affiliates
+Copyright 2004-present Facebook
+Copyright (c) 2017 by Contributors
+Copyright (c) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura
+Copyright (c) 2022 Apple Inc.
+Copyright (c) 2023 Apple Inc.
+Copyright 2005 Robert Kern (robert.kern@gmail.com)
+copyright 2019 The TensorFlow Authors
+Copyright (c) 2018 MathInf GmbH, Thomas Viehmann
+Copyright (c) 2014 Indiana University (c)
+Copyright John Maddock 2006
+Copyright (c) 2012 Massachusetts Institute of Technology
+Copyright (c) 2012 Giovanni Garberoglio Interdisciplinary Laboratory for Computational Science (LISC) Fondazione Bruno Kessler and University of Trento
+Copyright (c) 2018 Marat Dukhan
+Copyright (c) 2017-2018 Facebook Inc.
+Copyright (c) 2017 Georgia Institute of Technology
+Copyright 2015 Google Inc.
+Copyright (c) 2011-2021, NVIDIA CORPORATION.
+Copyright (c) 2022, Tri Dao
+Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES.
+Copyright (c) 2017 - 2022 NVIDIA CORPORATION & AFFILIATES.
+Copyright (c) 2017 The Android Open Source Project
+Copyright (c) 2016-present, Facebook, Inc.
+Copyright (c) 2005-2020 Rich Felker
+Copyright Malte Skarupke 2017
+Copyright 2008 Google Inc.
+Copyright (c) 2011 - 2012 Andrzej Krzemienski
+Copyright (c) 2001-2019 Free Software Foundation, Inc.
+Copyright (c) 1994 Hewlett-Packard Company
+Copyright (c) 1996-1998 Silicon Graphics Computer Systems, Inc.
+Copyright (c) Bjorn Fahller
+Copyright Michael Park, 2015-2017
+Copyright (c) 2017-present, Facebook, Inc.
+Copyright (c) 2018-present, Facebook, Inc.
+Copyright (c) 2008-2015 The Khronos Group Inc.
+Copyright 2016 Facebook
+Copyright (c) 2016, NVIDIA CORPORATION
+Copyright (c) 2008 - 2012 The Khronos Group Inc.
+Copyright (c) 2008-2013 The Khronos Group Inc.
+Copyright (c) 2008-2012 The Khronos Group Inc.
+Copyright (c) 2016-2017, ARM Limited and Contributors
+Copyright (c) 2014-2015 The Khronos Group Inc.
+Copyright (c) 2015-2017 The Khronos Group Inc.
+Copyright (c) Facebook Inc. and Microsoft Corporation
+Copyright (c) 2014-2017 The Regents of the University of California (Regents)
+Copyright (c) 2014-2017, the respective contributors
+Copyright (c) 2017 Microsoft
+Copyright 2015 The Gemmlowp Authors
+Copyright (c) 2011-2019 Stephan Brumme
+Copyright 2006, Google Inc.
+Copyright (c) Meta Platforms, Inc. and its affiliates
+Copyright (c) 2008 - 2009 NVIDIA Corporation
+Copyright (c) 2007-2009 Scientific Computing and Imaging Institute, University of Utah
+Copyright (c) 2006, Laurent Montel, montel@kde.org
+Copyright 2013 Conrad Steenberg conrad.steenberg@gmail.com
+copyright 2022, PyTorch
+copyright 2023, PyTorch
+Copyright (c) 2005-2022 NVIDIA Corporation Built
+copyright PyTorch Contributors
+Copyright (c) 2018 Alex Rogozhnikov
+Copyright (c) 2016 Microsoft
+Copyright (c) 2014, 2015, The Regents of the University of California (Regents)
+Copyright (c) 2014, 2015, the respective contributors
+Copyright (c) 2005-2017, NumPy Developers (c) Parameter containing Float
+Copyright 2005, Google Inc.
+Copyright 2019 Kakao Brain
+Copyright 2013-2014 RAD Game
+Copyright 2010-2014 Rich Geldreich and Tenacious Software LLC
+Copyright 2016 Martin Raiber
+Copyright (c) 2003-2017 Josef Weidendorfer
+Copyright (c) 2000-2017 Julian Seward
+Copyright (c) Edward Z. Yang ezyang@mit.edu
+Copyright (c) 2005-2010 ActiveState Software Inc.
+Copyright (c) 2013 Eddy Petrisor
+Copyright (c) 2010 ActiveState Software Inc.
+Copyright (c) 2001-2014 Python Software Foundation
+Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020 Python Software Foundation
+Copyright Python Software Foundation
+Copyright 2022 Cruise LLC
+Copyright (c) 2014 Matthew Rocklin
+Copyright (c) 2015 Melissa E. O'Neill
+Copyright (c) 2019 NumPy Developers
+Copyright (c) 2015-2016 Advanced Micro Devices, Inc.
+Copyright 2013 Mark Dickinson
+
+License: BSD 3-Clause License
+Copyright (c) , <YEAR>,<OWNER>
+All rights reserved.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+Written Offer
+This product contains software whose rights holders license it on the terms of the GNU General Public License, version 2 (GPLv2) and/or other open source software licenses. We will provide you and any third party with the source code of the software licensed under an open source software license if you send us a written request by mail or email to the following addresses:
+foss@huawei.com
+detailing the name of the product and the firmware version for which you need the source code and indicating how we can contact you.
+
+Please note you need to make a payment before you obtain the complete Corresponding Source Code from us. For how much you will pay and how we will deliver the complete Corresponding Source Code to you, we will further discuss it by mail or email.
+This offer is valid to anyone in receipt of this information.
+
+THIS OFFER IS VALID FOR THREE YEARS FROM THE MOMENT WE DISTRIBUTED THE PRODUCT OR FIRMWARE.
\ No newline at end of file
diff --git a/ci/access_control_test.py b/ci/access_control_test.py
index 785dabdb9df08800129750eecec41b4becf4be71..dc512a3f5013ccf76a47081d13fc3b8026d549bb 100644
--- a/ci/access_control_test.py
+++ b/ci/access_control_test.py
@@ -89,7 +89,7 @@ class UTTest:
                 exsit_ut_files = [file for file in full_path if os.path.exists(file) and file.endswith(".py")]
                 self.ut_files = " ".join(exsit_ut_files)
 
-        command = f"pytest -x {self.ut_files}"
+        command = f"pytest -x --log-cli-level=INFO {self.ut_files}"
         code = acquire_exitcode(command)
         if code == 0:
             print("UT test success")
diff --git a/docs/DOCKER_GUIDE.md b/docs/DOCKER_GUIDE.md
new file mode 100644
index 0000000000000000000000000000000000000000..794dde7d0ac855e4a08653024cfd748bf4b785b5
--- /dev/null
+++ b/docs/DOCKER_GUIDE.md
@@ -0,0 +1,59 @@
+
+## 1.镜像下载（待补充昇腾社区的镜像下载地址）
+通过uname -a确认自身系统是ubuntu_x86 或者 openeuler
+根据需要下载对应的镜像,如下为下载链接：
+https://www.hiascend.com/developer/ascendhub/detail/e26da9266559438b93354792f25b2f4a
+
+## 2.镜像加载
+```bash
+# 挂载镜像,确认挂载是否成功                          
+docker image list
+```
+
+## 3.创建镜像容器
+注意当前默认配置驱动和固件安装在/usr/local/Ascend，如有差异请修改指令路径。
+当前容器默认初始化npu驱动和CANN环境信息，如需要安装新的，请自行替换或手动source，详见容器的bashrc
+```bash
+# 挂载镜像
+docker run -dit --ipc=host --network host --name 'llm_test' --privileged -v /usr/local/Ascend/driver:/usr/local/Ascend/driver  -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware  -v /usr/local/sbin/:/usr/local/sbin/ -v /home/:/home/ mindspeed-llm:tag
+```
+
+## 4.登录镜像并确认环境状态
+```bash
+# 登录容器
+docker exec -it llm_test /bin/bash                           
+# 确认npu是否可以正常使用，否则返回3.检查配置
+npu-smi info
+```
+
+## 5.拉取配套版本
+当前镜像推荐配套版本,用户可根据自己所需的版本配套，进行MindSpeed-LLM和MindSpeed的更新使用。
+rc+序号为对应配套版本，镜像与分支名是配套的。例如：
+1. 2024.rc2-arm/2024.rc2-x86 镜像版本匹配 [MindSpeed-LLM的1.0.RC2分支](https://gitee.com/ascend/MindSpeed-LLM/tree/1.0.RC2/)
+2. 2024.rc3-arm/2024.rc3-x86 镜像版本匹配 [MindSpeed-LLM的1.0.RC3分支](https://gitee.com/ascend/MindSpeed-LLM/tree/1.0.RC3/)
+3. ...
+
+**注意：master为研发分支，无支持镜像。**
+
+下面以MindSpeed-LLM的1.0.RC3分支进行配套说明。
+镜像根据系统区分选择2024.rc3-arm/2024.rc3-x86。
+```bash
+# 从Gitee克隆MindSpeed-LLM仓库 (git checkout 1.0.RC3)
+git clone https://gitee.com/ascend/MindSpeed-LLM.git
+cd MindSpeed-LLM
+git checkout 1.0.RC3
+# 从Gitee克隆MindSpeed仓库(git checkout 4ea42a23)
+git clone https://gitee.com/ascend/MindSpeed.git
+cd MindSpeed
+git checkout 4ea42a23
+pip install -e .
+cd ..
+# 拉取megatron并切换对应版本放到MindSpeed-LLM下
+git clone https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+git checkout core_r0.6.0
+cp -r megatron ../
+```
+
+## 6.单机以及多机模型的预训练任务运行
+基于拉取的镜像和仓库代码，执行单机和多机的预训练任务，具体可参考[MindSpeed-LLM 使用指南](../examples/README.md)
\ No newline at end of file
diff --git a/examples/README.md b/examples/README.md
index 768d8caab9bef64479e30d7110dd7b63d977384f..96fa8167fb0d752b9413925907a881efef884dd2 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,64 +1,64 @@
-# ModelLink 使用指南
+# MindSpeed-LLM 使用指南
 
 <p align="left">
 </p>
 
-## 目录
-
-
-- [环境安装](#jump1)
-  * [仓库拉取](#jump1.1)
-  * [环境搭建](#jump1.2)
-- [权重下载及转换](#jump2)
-  * [权重下载](#jump2.1)
-  * [权重转换](#jump2.2)
-- [数据集准备及处理](#jump3)
-  * [数据集下载](#jump3.1)
-  * [数据集处理](#jump3.2)
-  * [数据集合并](#jump3.3)
-- [大模型分布式预训练](#jump4)
-  * [准备工作](#jump4.1)
-  * [配置参数](#jump4.2)
-  * [启动预训练](#jump4.3)
-- [大模型分布式指令微调](#jump5)
-  * [准备工作](#jump5.1)
-  * [配置微调参数](#jump5.2)
-  * [启动全参微调](#jump5.3)
-- [大模型分布式推理](#jump6)
-  * [Generate：流式推理](#jump6.1)
-  * [Chat：指令微调后chat对话](#jump6.2)
-- [大模型分布式评估](#jump7)
-  * [基准评估](#jump7.1)
-  * [指令微调评估](#jump7.2)
-  * [LoRA权重评估](#jump7.3)
-- [社区BUG列表](#jump8)
-
----
-
 ## <span id="jump1"> 环境安装
 
 【模型开发时推荐使用配套的环境版本】
 
-|    软件     | [版本](https://www.hiascend.com/zh/) |
-|:---------:|:----------------------------------:|
-|  Python   |                3.8                 |
-|  Driver   |         在研版本          |
-| Firmware  |         在研版本          |
-|   CANN    |             在研版本             |
-|   Torch   |            2.1.0、2.2.0             |
-| Torch_npu |           在研版本           |
+<table border="0">
+  <tr>
+    <th>依赖软件</th>
+    <th>版本</th>
+    <th>软件安装指南</th>
+    <th>推荐硬件形态</th>
+  </tr>
+
+  <tr>
+    <td>昇腾NPU驱动</td>
+    <td rowspan="2">Ascend HDK 24.1.RC3</td>
+    <td rowspan="2">《<a href="https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/softwareinst/instg/instg_0003.html?Mode=PmIns&OS=Ubuntu&Software=cannToolKit">驱动固件安装指南</a>》</td>
+    <td rowspan="8">Atlas 900 A2 PODc</td>
+  </tr>
+  <tr>
+    <td>昇腾NPU固件</td>
+  </tr>
+  <tr>
+    <td>Toolkit（开发套件）</td>
+    <td rowspan="2">CANN 8.0.RC3</td>
+    <td rowspan="2">《<a href="https://www.hiascend.com/document/detail/zh/canncommercial/80RC2/softwareinst/instg/instg_0000.html">CANN 软件安装指南</a>》</td>
+  </tr>
+  <tr>
+    <td>Kernel（算子包）</td>
+  </tr>
+  <tr>
+  </tr>
+  <tr>
+    <td>PyTorch</td>
+    <td rowspan="3">release v6.0.RC3</td>
+    <td rowspan="3">《<a href="https://www.hiascend.com/document/detail/zh/Pytorch/60RC2/configandinstg/instg/insg_0001.html">Ascend Extension for PyTorch 配置与安装</a>》</td>
+  </tr>
+  <tr>
+    <td>torch_npu插件</td>
+  </tr>
+  <tr>
+    <td>apex</td>
+  </tr>
+</table>
 
 
 #### <span id="jump1.1"> 1. 仓库拉取
 
 ```shell
-    git clone https://gitee.com/ascend/ModelLink.git 
+    git clone https://gitee.com/ascend/MindSpeed-LLM.git 
     git clone https://github.com/NVIDIA/Megatron-LM.git
     cd Megatron-LM
     git checkout core_r0.6.0
-    cp -r megatron ../ModelLink/
+    cp -r megatron ../MindSpeed-LLM/
     cd ..
-    cd ModelLink
+    cd MindSpeed-LLM
+    git checkout 1.0.RC3
     mkdir logs
     mkdir model_from_hf
     mkdir dataset
@@ -71,9 +71,13 @@
     conda create -n test python=3.8
     conda activate test
 
-    # 安装 torch 和 torch_npu，注意要选择对应python版本、x86或arm的torch、torch_npu及apex包
+    # 安装所需版本的 torch 和 torch_npu，注意要选择对应python版本、x86或arm的torch、torch_npu及apex包
+    # 以安装 torch-2.1.0 和 torch_npu-2.1.0为例
     pip install torch-2.1.0-cp38-cp38m-manylinux2014_aarch64.whl 
     pip install torch_npu-2.1.0*-cp38-cp38m-linux_aarch64.whl
+
+    # 安装对应版本的torchvision
+    pip install torchvision==0.16.0
     
     # apex for Ascend 参考 https://gitee.com/ascend/apex
     pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
@@ -84,8 +88,8 @@
     # 安装加速库
     git clone https://gitee.com/ascend/MindSpeed.git
     cd MindSpeed
-    # checkout commit from MindSpeed core_r0.6.0
-    git checkout e6ea2117 
+    # checkout commit from MindSpeed core_r0.6.0 in 0923
+    git checkout 4ea42a23 
     pip install -r requirements.txt 
     pip3 install -e .
     cd ..
@@ -106,7 +110,7 @@
 
 更多社区资源可以在`模型`列链接中获取，如`Chat/Instruct`权重等
 
-权重可以基于网页直接下载，也可以基于命令行下载，保存到ModelLink/model_from_hf目录，比如：
+权重可以基于网页直接下载，也可以基于命令行下载，保存到MindSpeed-LLM/model_from_hf目录，比如：
 
 
 ```shell
@@ -126,20 +130,18 @@ cd ../../
 ```
 
 #### <span id="jump2.2"> 2. 权重转换
-
+在`example`目录下每个模型都已经预置好权重转换脚本，可以根据需要来进行修改
 ##### 2.1 Huggingface权重转换到Megatron-LM格式
 
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 python convert_ckpt.py \
     --model-type GPT \
     --load-model-type hf \
     --save-model-type mg \
-    --target-tensor-parallel-size 2 \
-    --target-pipeline-parallel-size 4 \
-    --num-layer-list 8,8,8,8 \
+    --target-tensor-parallel-size 1 \
+    --target-pipeline-parallel-size 2 \
+    --num-layer-list 16,16 \
     --model-type-hf llama2 \
     --load-dir ./model_from_hf/llama-2-7b-hf/ \
     --save-dir ./model_weights/llama-2-7b-legacy/ \
@@ -170,7 +172,7 @@ python convert_ckpt.py \
 
 【--model-type-hf】
 
-huggingface模型类别，默认为llama2，目前支持的模型见 [model_cfg.json](https://gitee.com/ascend/ModelLink/blob/master/modellink/tasks/checkpoint/model_cfg.json)
+huggingface模型类别，默认为llama2，目前支持的模型见 [model_cfg.json](https://gitee.com/ascend/MindSpeed-LLM/blob/master/modellink/tasks/checkpoint/model_cfg.json)
 
 【--tokenizer-model】
 
@@ -182,7 +184,7 @@ huggingface模型类别，默认为llama2，目前支持的模型见 [model_cfg.
 
 【启动脚本】
 
-ModelLink Huggingface到Megatron-Legacy权重转换脚本命名风格及启动方法为：
+MindSpeed-LLM Huggingface到Megatron-Legacy权重转换脚本命名风格及启动方法为：
 ```shell
 # 命名及启动：bash examples/model_name/ckpt_convert_xxx_hf2legacy.sh
 # 需要配置并行参数以及权重词表加载保存等路径
@@ -190,19 +192,18 @@ ModelLink Huggingface到Megatron-Legacy权重转换脚本命名风格及启动
 bash examples/llama2/ckpt_convert_llama2_hf2legacy.sh
 ```
 
-ModelLink Huggingface到Megatron-Mcore权重转换脚本命名风格及启动方法为：
+MindSpeed-LLM Huggingface到Megatron-Mcore权重转换脚本命名风格及启动方法为：
 ```shell
-# 命名及启动：bash examples/model_name/ckpt_convert_xxx_hf2mcore.sh
+# 命名及启动：bash examples/mcore/model_name/ckpt_convert_xxx_hf2mcore.sh
 # 需要配置并行参数以及权重词表加载保存等路径
 
-bash examples/llama2/ckpt_convert_llama2_hf2mcore.sh
+bash examples/mcore/llama2/ckpt_convert_llama2_hf2mcore.sh
 ```
 
 ##### 2.2 Megatron-LM权重转换到Huggingface格式
 
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
+# 转换到Huggingface格式时，`target-tensor-parallel-size`与`target-pipeline-parallel-size`均需设为1
 
 python convert_ckpt.py \
     --model-type GPT \
@@ -218,7 +219,7 @@ python convert_ckpt.py \
 
 【启动脚本】
 
-ModelLink Megatron-Legacy到Huggingface的权重转换脚本命名风格及启动方法为：
+MindSpeed-LLM Megatron-Legacy到Huggingface的权重转换脚本命名风格及启动方法为：
 ```shell
 # 命名及启动：bash examples/model_name/ckpt_convert_xxx_legacy2hf.sh
 # 需要配置并行参数以及权重词表加载保存等路径
@@ -226,19 +227,17 @@ ModelLink Megatron-Legacy到Huggingface的权重转换脚本命名风格及启
 bash examples/llama2/ckpt_convert_llama2_legacy2hf.sh
 ```
 
-ModelLink Megatron-Mcore到Huggingface的权重转换脚本命名风格及启动方法为：
+MindSpeed-LLM Megatron-Mcore到Huggingface的权重转换脚本命名风格及启动方法为：
 ```shell
-# 命名及启动：bash examples/model_name/ckpt_convert_xxx_mcore2hf.sh
+# 命名及启动：bash examples/mcore/model_name/ckpt_convert_xxx_mcore2hf.sh
 # 需要配置并行参数以及权重词表加载保存等路径
 
-bash examples/llama2/ckpt_convert_llama2_mcore2hf.sh
+bash examples/mcore/llama2/ckpt_convert_llama2_mcore2hf.sh
 ```
 
 ##### 2.3 Megatron-LM格式权重互转
 
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # legacy转legacy
 python convert_ckpt.py \
@@ -295,7 +294,7 @@ mcore转legacy时设置此参数以指定保存权重格式为legacy
 
 其余参数意义参考2.1
 
-注：上述权重legacy和mcore互转为高阶功能，modellink基于llama2提供基础能力，并进行版本迭代看护，其余模型的支持需要用户自行修改支持
+注：上述权重legacy和mcore互转为高阶功能，MindSpeed-LLM基于llama2提供基础能力，并进行版本迭代看护，其余模型的支持需要用户自行修改支持
 
 ##### 2.4 lora权重与base权重合并
 
@@ -308,11 +307,22 @@ mcore转legacy时设置此参数以指定保存权重格式为legacy
 --lora-target-modules query_key_value dense dense_h_to_4h dense_4h_to_h \
 ```
 
+【lora-r】
+
+`--lora_r`参数指的是LoRA中的秩（rank），它决定了低秩矩阵的大小。
+
+【--lora-alpha】
+
+`--lora_alpha`参数定义了LoRA适应的学习率缩放因子。这个参数影响了低秩矩阵的更新速度。
+
+【--lora-target-modules】
+
+`--lora-target-modules`定义了Lora目标模块，字符串列表，由空格隔开，无默认值。每一个字符串是需要进行LoRA微调的层的名称。
+
+
 【合并后转换为Megatron-Legacy权重】
 
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 python convert_ckpt.py \
     --model-type GPT \
@@ -336,8 +346,6 @@ bash examples/llama2/ckpt_convert_llama2_legacy2legacy_lora.sh
 【合并后转换为Huggingface权重】
 
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 python convert_ckpt.py \
     --model-type GPT \
@@ -367,7 +375,7 @@ bash examples/llama2/ckpt_convert_llama2_legacy2hf_lora.sh
 
 #### <span id="jump3.1"> 1. 数据集下载
 
-从Huggingface等网站下载开源数据集，保存到ModelLink/dataset/ 目录
+从Huggingface等网站下载开源数据集，保存到MindSpeed-LLM/dataset/ 目录
 
 常用的预训练数据集有：
 - [Enwiki数据集](https://huggingface.co/datasets/lsb/enwiki20230101)
@@ -390,12 +398,10 @@ cd ..
 ```
 
 #### <span id="jump3.2"> 2. 数据集处理
-
+在`example`目录下每个模型都已经预置好数据集预处理脚本，可以根据需要来进行修改
 ##### 2.1 预训练数据集处理方法
 
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./dataset
 
 python ./preprocess_data.py \
@@ -438,7 +444,7 @@ python ./preprocess_data.py \
 数据预处理并行加速参数。当需要预处理的数据集比较大时，可以通过并行处理进行加速，方法为设置参数`--n-subs`，通过该参数设置并行处理数量。在数据预处理过程会将原始数据集切分为`n_sub`个子集，对子集进行并行处理，然后合并，从而实现加速。建议预处理数据集超过GB级别时加上该参数。
 
 
-ModelLink预训练数据集处理脚本命名风格及启动方法为：
+MindSpeed-LLM预训练数据集处理脚本命名风格及启动方法为：
 ```shell
 # Legacy
 # 命名及启动：examples/model_name/data_convert_xxx_pretrain.sh
@@ -469,8 +475,6 @@ cd ..
 在指令监督微调时，instruction 列对应的内容会与 input 列对应的内容拼接后作为人类指令，即人类指令为 instruction\ninput。而 output 列对应的内容为模型回答。如果指定了history，则会将历史对话内容也加入进来。如果指定system 列，则对应的内容将被作为系统提示词。
 
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
 
 python ./preprocess_data.py \
@@ -580,8 +584,6 @@ cd ..
 ```
 Sharegpt格式数据预处理脚本：
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
 
 python ./preprocess_data.py \
@@ -632,8 +634,6 @@ OpenAI格式示例：
 OpenAI格式数据预处理脚本：
 
 ```shell
-# 请按照您的真实环境修改 set_env.sh 路径
-source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
 
 python ./preprocess_data.py \
@@ -665,7 +665,7 @@ python ./preprocess_data.py \
 则会提取数据集里的`"messages"`列，其中角色格式可以为：`"role": "user"、"role": "assistant"`，内容格式为`"content": "具体内容"`
 
 
-ModelLink微调数据集处理脚本命名风格及启动方法为：
+MindSpeed-LLM微调数据集处理脚本命名风格及启动方法为：
 ```shell
 # Legacy
 # 命名及启动：examples/model_name/data_convert_xxx_instruction.sh
@@ -941,7 +941,7 @@ DATA_PATH="./finetune_dataset/alpaca"  #数据集路径
 
 #### <span id="jump6.1"> 1. Generate：流式推理
 
-ModelLink 流式推理脚本命名风格及启动方法为：
+MindSpeed-LLM 流式推理脚本命名风格及启动方法为：
 ```shell
 # Legacy
 # 命名及启动：examples/model_name/generate_xxx.sh
@@ -962,7 +962,7 @@ bash examples/llama2/generate_llama2_7b_ptd.sh
 ```
 #### <span id="jump6.2"> 2. Chat：指令微调后chat对话
 
-ModelLink 指令微调后chat对话脚本命名风格及启动方法为：
+MindSpeed-LLM 指令微调后chat对话脚本命名风格及启动方法为：
 ```shell
 # Legacy
 # 命名及启动：examples/model_name/chat_xxx.sh
@@ -999,7 +999,7 @@ bash examples/llama2/chat_llama2_7b_ptd.sh
 ## <span id="jump7">大模型分布式评估
 
 #### <span id="jump7.1"> 1. 基准评估
-ModelLink 基准评估脚本命名风格及启动方法为：
+MindSpeed-LLM 基准评估脚本命名风格及启动方法为：
 ```shell
 # Legacy
 # 命名及启动：examples/model_name/evaluate_xxx.sh
diff --git a/examples/bloom/ckpt_convert_bloom_hf2legacy.sh b/examples/bloom/ckpt_convert_bloom_hf2legacy.sh
index 3b50abe7c9228a9b036f1269bd0e02495091c040..4e84654bb9a34373316dd9109db631e905d76e2d 100644
--- a/examples/bloom/ckpt_convert_bloom_hf2legacy.sh
+++ b/examples/bloom/ckpt_convert_bloom_hf2legacy.sh
@@ -8,8 +8,8 @@ python convert_ckpt.py \
     --save-model-type mg \
     --target-tensor-parallel-size 8 \
     --target-pipeline-parallel-size 1 \
-    --load-dir --load-dir ./model_from_hf/Bloom-hf/ \
-    --save-dir --save-dir ./model_weights/Bloom-legacy/ \
+    --load-dir ./model_from_hf/Bloom-hf/ \
+    --save-dir ./model_weights/Bloom-legacy/ \
     --tokenizer-model None \
     --model-type-hf bloom \
     --add-qkv-bias \
diff --git a/examples/chatglm3/data_convert_chatglm3_instruction.sh b/examples/chatglm3/data_convert_chatglm3_instruction.sh
index 1dda86447313bbf3aa63148230ab04c142dab2d4..37e9805a30c2e91f18ef7f70ce5c6d497b9d569d 100644
--- a/examples/chatglm3/data_convert_chatglm3_instruction.sh
+++ b/examples/chatglm3/data_convert_chatglm3_instruction.sh
@@ -1,10 +1,10 @@
-# Alpaca数据集下载链接: https://huggingface.co/datasets/tatsu-lab/alpaca
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 Alpaca 数据集
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
 
 python ./preprocess_data.py \
-    --input ./dataset/train-00000-of-00042-d964455e17e96d5a.parquet \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
     --tokenizer-name-or-path ./model_from_hf/Chatglm3-hf/ \
     --output-prefix ./finetune_dataset/alpaca \
     --workers 4 \
diff --git a/examples/gpt3/data_convert_gpt_pretrain.sh b/examples/gpt3/data_convert_gpt_pretrain.sh
index 63ac1635abcb190a2199750665c11b061451892c..06e59d9b04ae4889719f4ec73a8b304551da0192 100644
--- a/examples/gpt3/data_convert_gpt_pretrain.sh
+++ b/examples/gpt3/data_convert_gpt_pretrain.sh
@@ -1,13 +1,8 @@
+# 请根据 examples/README.md 下 “社区BUG列表” 章节下载 gpt2-vocab.json，gpt2-merges.txt 文件
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./dataset
 
-# 下载 vocab file 和 merge table
-# cd vocab_file
-# wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
-# wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
-# cd ..
-
 # 处理成训练数据
 python ./preprocess_data.py \
     --input ./dataset/ \
diff --git a/examples/llama/pretrain_llama_7b_ptd.sh b/examples/llama/pretrain_llama_7b_ptd.sh
index 490963637cdbd8a7e2a7a24dd86589f23643aa62..f549b629ccd49ff9f04f577386377c96eda5839e 100644
--- a/examples/llama/pretrain_llama_7b_ptd.sh
+++ b/examples/llama/pretrain_llama_7b_ptd.sh
@@ -2,6 +2,7 @@
 
 export CUDA_DEVICE_MAX_CONNECTIONS=1
 export NPU_ASD_ENABLE=0
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
 
 GPUS_PER_NODE=8
 MASTER_ADDR=localhost
diff --git a/examples/llama2/ckpt_convert_llama2_legacy2hf.sh b/examples/llama2/ckpt_convert_llama2_legacy2hf.sh
index 553e473a317578febbac0161f86f0ef2b99cacd7..edd6e469e67f3586d5ae60520a7563402e8bd3da 100644
--- a/examples/llama2/ckpt_convert_llama2_legacy2hf.sh
+++ b/examples/llama2/ckpt_convert_llama2_legacy2hf.sh
@@ -7,7 +7,7 @@ python convert_ckpt.py \
     --loader megatron \
     --saver megatron \
     --save-model-type save_huggingface_llama \
-    --load-dir ./model_weights/llama2-legacy/ \
+    --load-dir ./model_weights/llama-2-legacy/ \
     --target-tensor-parallel-size 1 \
     --target-pipeline-parallel-size 1 \
     --save-dir ./model_from_hf/llama-2-7b-hf/     # <-- 需要填入原始HF模型路径，新权重会存于./model_from_hf/llama-2-7b-hf/mg2hg/
diff --git a/examples/llama2/data_convert_llama2_instruction.sh b/examples/llama2/data_convert_llama2_instruction.sh
index 3902a16712a774c3b73ee620a5f12bb1fc639513..e752bf0e46e668af876a2c4bad9207e2790753d5 100644
--- a/examples/llama2/data_convert_llama2_instruction.sh
+++ b/examples/llama2/data_convert_llama2_instruction.sh
@@ -1,4 +1,4 @@
-# Sharegpt数据集下载链接: https://huggingface.co/datasets/shibing624/roleplay-zh-sharegpt-gpt4-data
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 ShareGPT 数据集
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
diff --git a/examples/llama2/evaluate_llama2_7B_ptd.sh b/examples/llama2/evaluate_llama2_7B_ptd.sh
index 606e7b0687d9a87c08896b851418cacee6b0110c..68785b08f4cf11ff4f0f00db24a3f065b4e0fe50 100644
--- a/examples/llama2/evaluate_llama2_7B_ptd.sh
+++ b/examples/llama2/evaluate_llama2_7B_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../mmlu/test/"
 TASK="mmlu"
diff --git a/examples/llama2/evaluate_llama2_7b_full_ptd.sh b/examples/llama2/evaluate_llama2_7b_full_ptd.sh
index 4e6704a40d3cca90bfebaf22b8c2b2b156066f4a..0f649304741ce258730ecf1d21a9ce93e1720022 100644
--- a/examples/llama2/evaluate_llama2_7b_full_ptd.sh
+++ b/examples/llama2/evaluate_llama2_7b_full_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../mmlu/test/"
 TASK="mmlu"
diff --git a/examples/llama3/data_convert_llama3_instruction.sh b/examples/llama3/data_convert_llama3_instruction.sh
index e7cdfeb53bc0349119ba9ef78441ff68a6b2910b..faa10de76d12fc08aabbe334d58aa190c7ab7bf9 100644
--- a/examples/llama3/data_convert_llama3_instruction.sh
+++ b/examples/llama3/data_convert_llama3_instruction.sh
@@ -1,10 +1,10 @@
-# Alpaca数据集下载链接: https://huggingface.co/datasets/tatsu-lab/alpaca
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 Alpaca 数据集
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
 
 python ./preprocess_data.py \
-    --input ./dataset/train-00000-of-00042-d964455e17e96d5a.parquet \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
     --tokenizer-name-or-path ./model_from_hf/Llama3-hf/ \
     --output-prefix ./finetune_dataset/alpaca \
     --workers 4 \
diff --git a/examples/mcore/baichuan2/data_convert_baichuan2_pretrain.sh b/examples/mcore/baichuan2/data_convert_baichuan2_pretrain.sh
index 456d68b01a9c6e9499692bdd8b27712e09668c9d..4205d6b422babb52e3b7852e8a848fa3be188c93 100644
--- a/examples/mcore/baichuan2/data_convert_baichuan2_pretrain.sh
+++ b/examples/mcore/baichuan2/data_convert_baichuan2_pretrain.sh
@@ -1,8 +1,8 @@
+请根据 examples/README.md 下 “数据集准备及处理” 章节下载 Alpaca 数据集
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./dataset
 
-# 数据集下载地址 https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
 python ./preprocess_data.py \
     --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
     --tokenizer-name-or-path ./model_from_hf/Baichuan-hf/ \
diff --git a/examples/mcore/deepseek2/data_convert_deepseek2_pretrain.sh b/examples/mcore/deepseek2/data_convert_deepseek2_pretrain.sh
index 5a8b10aa03603f5cab95ac1d08db8ec826b79729..bdfdf674ee6caa14947c930d5c0d9a28bbf080aa 100644
--- a/examples/mcore/deepseek2/data_convert_deepseek2_pretrain.sh
+++ b/examples/mcore/deepseek2/data_convert_deepseek2_pretrain.sh
@@ -1,8 +1,8 @@
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 Enwiki 数据集（一般取第一条即可）
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./dataset
 
-# 数据集下载地址 https://huggingface.co/datasets/lsb/enwiki20230101/blob/main/data/train-00000-of-00042-d964455e17e96d5a.parquet
 python ./preprocess_data.py \
     --input ./dataset/train-00000-of-00042-d964455e17e96d5a.parquet \
     --tokenizer-name-or-path ./model_from_hf/deepseek2-hf/ \
diff --git a/examples/mcore/deepseek2_coder/data_convert_deepseek2_pretrain.sh b/examples/mcore/deepseek2_coder/data_convert_deepseek2_pretrain.sh
index 051054eee5a9504a26e980eec1c7956e967d6c44..fbc5863c40bf8315559c0fb047ee32a6ef954a2d 100644
--- a/examples/mcore/deepseek2_coder/data_convert_deepseek2_pretrain.sh
+++ b/examples/mcore/deepseek2_coder/data_convert_deepseek2_pretrain.sh
@@ -1,8 +1,8 @@
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 Enwiki 数据集（一般取第一条即可）
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./dataset
 
-# 数据集下载地址 https://huggingface.co/datasets/lsb/enwiki20230101/blob/main/data/train-00000-of-00042-d964455e17e96d5a.parquet
 python ./preprocess_data.py \
     --input ./dataset/train-00000-of-00042-d964455e17e96d5a.parquet \
     --tokenizer-name-or-path ./model_from_hf/deepseek2-coder-hf/ \
diff --git a/examples/mcore/llama2/data_convert_llama2_instruction.sh b/examples/mcore/llama2/data_convert_llama2_instruction.sh
index 410e2097e5cb890f667b86ac586aad4e3f0f12ad..97f2a3a0e1ceb753f7b24e592f95721cddf1ab78 100644
--- a/examples/mcore/llama2/data_convert_llama2_instruction.sh
+++ b/examples/mcore/llama2/data_convert_llama2_instruction.sh
@@ -1,10 +1,10 @@
-# Alpaca数据集下载链接: https://huggingface.co/datasets/tatsu-lab/alpaca
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 Alpaca 数据集
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
 
 python ./preprocess_data.py \
-    --input ./dataset/train-00000-of-00042-d964455e17e96d5a.parquet \
+    --input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
     --tokenizer-name-or-path ./model_from_hf/Llama2-hf/ \
     --output-prefix ./finetune_dataset/alpaca \
     --workers 4 \
diff --git a/examples/mcore/llama2/data_convert_llama2_instruction_pack.sh b/examples/mcore/llama2/data_convert_llama2_instruction_pack.sh
index 5a9b2904ecd6370815a8d478e9d337fe38f27291..3edef8468e59be8c828823b9c3499cc6d6e4b0cb 100644
--- a/examples/mcore/llama2/data_convert_llama2_instruction_pack.sh
+++ b/examples/mcore/llama2/data_convert_llama2_instruction_pack.sh
@@ -1,4 +1,4 @@
-# Alpaca数据集下载链接: https://huggingface.co/datasets/tatsu-lab/alpaca
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 Alpaca 数据集
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
diff --git a/examples/mcore/llama2/dpo_llama2_7b_ptd.sh b/examples/mcore/llama2/dpo_llama2_7b_ptd.sh
index 2073b8c2ad0252b011776795cb48310f4e423617..28d0ac3540b91b92745ad12c1f42c1199789c02d 100644
--- a/examples/mcore/llama2/dpo_llama2_7b_ptd.sh
+++ b/examples/mcore/llama2/dpo_llama2_7b_ptd.sh
@@ -12,7 +12,7 @@ WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
 
 CKPT_SAVE_DIR="your model save ckpt path"
 DATA_PATH="your data path"
-TOKENIZER_MODEL="your tokenizer path"
+TOKENIZER_PATH="your tokenizer path"
 CKPT_LOAD_DIR="your model ckpt path"
 TP=2
 PP=4
@@ -34,8 +34,8 @@ GPT_ARGS="
     --hidden-size 4096 \
     --ffn-hidden-size 11008 \
     --num-attention-heads 32 \
-    --tokenizer-type Llama2Tokenizer \
-    --tokenizer-model ${TOKENIZER_MODEL} \
+    --tokenizer-type PretrainedFromHF \
+    --tokenizer-name-or-path ${TOKENIZER_PATH} \
     --seq-length 4096 \
     --max-position-embeddings 4096 \
     --micro-batch-size 1 \
diff --git a/examples/mcore/llama2/evaluate_llama2_7b_agieval_ptd.sh b/examples/mcore/llama2/evaluate_llama2_7b_agieval_ptd.sh
index 457fded553422fc1a8197bbeb1ae49c420d9868b..5e6d649bcc004b7526f3d2886807ce8f8e7ddfaa 100644
--- a/examples/mcore/llama2/evaluate_llama2_7b_agieval_ptd.sh
+++ b/examples/mcore/llama2/evaluate_llama2_7b_agieval_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../agieval"
 TASK="agieval"
diff --git a/examples/mcore/llama2/evaluate_llama2_7b_bbh_ptd.sh b/examples/mcore/llama2/evaluate_llama2_7b_bbh_ptd.sh
index fbb40fcb253e6921751cbd503aa0f72a358bdd92..a9a2e741bbefcfd9275e03cb6d3237db6b98bdda 100644
--- a/examples/mcore/llama2/evaluate_llama2_7b_bbh_ptd.sh
+++ b/examples/mcore/llama2/evaluate_llama2_7b_bbh_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../bbh/test"
 TASK="bbh"
diff --git a/examples/mcore/llama2/evaluate_llama2_7b_boolq_ptd.sh b/examples/mcore/llama2/evaluate_llama2_7b_boolq_ptd.sh
index 708abcbb56abb27ce9c1d796978c740294847f84..a4ea35b4cfa6455e32c7902a5a0b2411d87f503b 100644
--- a/examples/mcore/llama2/evaluate_llama2_7b_boolq_ptd.sh
+++ b/examples/mcore/llama2/evaluate_llama2_7b_boolq_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../boolq/dev"
 TASK="boolq"
diff --git a/examples/mcore/llama2/evaluate_llama2_7b_ceval_ptd.sh b/examples/mcore/llama2/evaluate_llama2_7b_ceval_ptd.sh
index ed54bd4550ebd45a84a4b100219732fbb4d43001..64ed045620159e0dfde406dce6b1df57b1b112d1 100644
--- a/examples/mcore/llama2/evaluate_llama2_7b_ceval_ptd.sh
+++ b/examples/mcore/llama2/evaluate_llama2_7b_ceval_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../ceval/val"
 TASK="ceval"
diff --git a/examples/mcore/llama2/evaluate_llama2_7b_full_mmlu_ptd.sh b/examples/mcore/llama2/evaluate_llama2_7b_full_mmlu_ptd.sh
index e1277043a2504db01309b4a7a5dd4d4aae5a8af0..cdb885b00d317a88bed2b2e883d0c5f492f95f61 100644
--- a/examples/mcore/llama2/evaluate_llama2_7b_full_mmlu_ptd.sh
+++ b/examples/mcore/llama2/evaluate_llama2_7b_full_mmlu_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../mmlu/test/"
 TASK="mmlu"
diff --git a/examples/mcore/llama2/evaluate_llama2_7b_humaneval_ptd.sh b/examples/mcore/llama2/evaluate_llama2_7b_humaneval_ptd.sh
index 7c736a7ad4bc5f0318ae3cf62f71919a3c5f88fd..4e2786a7239db0334c6ff1164a4f3c7738a5ed3f 100644
--- a/examples/mcore/llama2/evaluate_llama2_7b_humaneval_ptd.sh
+++ b/examples/mcore/llama2/evaluate_llama2_7b_humaneval_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../human_eval"
 TASK="human_eval"
diff --git a/examples/mcore/llama2/evaluate_llama2_7b_mmlu_ptd.sh b/examples/mcore/llama2/evaluate_llama2_7b_mmlu_ptd.sh
index 01c6991d130f10c346bcb46548de53e4d730fbdb..676bf0ae4d567e69dd8186ca94423e0720752d12 100644
--- a/examples/mcore/llama2/evaluate_llama2_7b_mmlu_ptd.sh
+++ b/examples/mcore/llama2/evaluate_llama2_7b_mmlu_ptd.sh
@@ -9,7 +9,7 @@ source /usr/local/Ascend/ascend-toolkit/set_env.sh
 
 # modify script model path and tokenizer path
 TOKENIZER_PATH=/../llama2-7b-hf/  #tokenizer path
-CHECKPOINT=/../llama2-7b-modellink-tp1  #model path
+CHECKPOINT=/../llama2-7b-mindspeed-llm-tp1  #model path
 # configure task and data path
 DATA_PATH="/../mmlu/test/"
 TASK="mmlu"
diff --git a/examples/mcore/mistral/data_convert_mistral_instruction.sh b/examples/mcore/mistral/data_convert_mistral_instruction.sh
index 6787f4c4194f63f5ec00580b52d3f531183f7b1d..6e419ca25cc15f19814238e733f441a74c31401c 100644
--- a/examples/mcore/mistral/data_convert_mistral_instruction.sh
+++ b/examples/mcore/mistral/data_convert_mistral_instruction.sh
@@ -1,4 +1,4 @@
-# Alpaca数据集下载链接: https://huggingface.co/datasets/tatsu-lab/alpaca
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 Alpaca 数据集
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
diff --git a/examples/qwen/convert_ckpt_qwen_hf2legacy.sh b/examples/qwen/convert_ckpt_qwen_hf2legacy.sh
index cfcc457b20e5519bf9ac9d8c7b827360f025bb82..de5faebcf273fe125b6222d0ac70b29e8a3d3e9e 100644
--- a/examples/qwen/convert_ckpt_qwen_hf2legacy.sh
+++ b/examples/qwen/convert_ckpt_qwen_hf2legacy.sh
@@ -1,4 +1,4 @@
-# 修改modellink_qwen.py文件第39行，将：
+# 修改modelling_qwen.py文件第39行，将：
 # SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
 # 修改为：
 # SUPPORT_FP16 = True
diff --git a/examples/qwen/convert_ckpt_qwen_legacy2hf.sh b/examples/qwen/convert_ckpt_qwen_legacy2hf.sh
index 0f87a2aa388fbfd31df5832dd3c8f3a2dd1425c8..9543345d994be82108a4a676894ed5e320c1a9dd 100644
--- a/examples/qwen/convert_ckpt_qwen_legacy2hf.sh
+++ b/examples/qwen/convert_ckpt_qwen_legacy2hf.sh
@@ -1,4 +1,4 @@
-# 修改modellink_qwen.py文件第39行，将：
+# 修改modelling_qwen.py文件第39行，将：
 # SUPPORT_FP16 = SUPPORT_CUDA and torch.cuda.get_device_capability(0)[0] >= 7
 # 修改为：
 # SUPPORT_FP16 = True
diff --git a/examples/qwen/data_convert_qwen_instruction.sh b/examples/qwen/data_convert_qwen_instruction.sh
index ded77c63bcace5c960bfd4e8c8ba1e2cc23685ad..b156b2bba4de44db12cde602cfb002392a004a6d 100644
--- a/examples/qwen/data_convert_qwen_instruction.sh
+++ b/examples/qwen/data_convert_qwen_instruction.sh
@@ -1,4 +1,4 @@
-# Sharegpt数据集下载链接: https://huggingface.co/datasets/shibing624/roleplay-zh-sharegpt-gpt4-data
+# 请根据 examples/README.md 下 “数据集准备及处理” 章节下载 ShareGPT 数据集
 # 请按照您的真实环境修改 set_env.sh 路径
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 mkdir ./finetune_dataset
diff --git a/modellink/arguments.py b/modellink/arguments.py
index 1194ba3c0041a740c8ac5313baff8360ac8c6445..6625f2f4ccbeafd74342f902025fbbd2c0063999 100644
--- a/modellink/arguments.py
+++ b/modellink/arguments.py
@@ -487,6 +487,8 @@ def _add_training_args(parser):
     group.add_argument('--swap-attention', action='store_true', default=False,
                        help='switch to open swap-attention feature.'
                             'The default is False.')
+    group.add_argument('--swap-modules', type=str, default=None,
+                       help='Swap modules for model. Should be used together with "--swap-attention."')
     return parser
 
 
@@ -577,7 +579,10 @@ def _validate_recompute_args(args):
     validate re-computation arguments.
     """
     enable_pp_vpp = args.num_layers_per_virtual_pipeline_stage
-    enable_recomputation = args.recompute_granularity is not None and args.recompute_method == 'block'
+    enable_vanilla_recomputation = args.recompute_granularity is not None and args.recompute_method == 'block'
+    enable_swap = args.swap_attention
+    enable_recompute_activation = args.recompute_activation_function
+    enable_recomputation = enable_vanilla_recomputation or enable_swap or enable_recompute_activation
     if args.enable_recompute_layers_per_pp_rank and not (enable_pp_vpp and enable_recomputation):
         raise AssertionError("enable-recompute-layers-per-pp-rank should be works with pipeline and virtual pipeline, when enabling re-computation.")
 
@@ -587,6 +592,12 @@ def _validate_recompute_args(args):
         if args.recompute_granularity == "selective":
             raise AssertionError('--recompute-activation-function is not compatible with selective recomputation.')
 
+    if args.swap_attention and args.swap_modules is None:
+        if args.use_mcore_models:
+            args.swap_modules = "input_layernorm,self_attention,pre_cross_attn_layernorm"
+        else:
+            args.swap_modules = "input_norm,self_attention,post_attention_norm"
+
 
 def _validate_high_availability(args):
     if args.enable_optimizer_state_local_copy and not args.enable_high_availability:
@@ -792,6 +803,10 @@ def _add_dummy_args(args):
     args.recompute_in_bubble = False
     args.recompute_in_advance = False
 
+    args.moe_alltoall_overlap_comm = False
+    args.moe_allgather_overlap_comm = False
+    args.noop_layers = None
+
 
 def validate_args_decorator(megatron_validate_args):
     @wraps(megatron_validate_args)
@@ -825,7 +840,7 @@ def validate_args_decorator(megatron_validate_args):
         _add_dummy_args(args)
 
         from modellink.utils import print_args
-        print_args('ModelLink Arguments', args)
+        print_args('MindSpeed-LLM Arguments', args)
         return args
 
     return wrapper
diff --git a/modellink/core/datasets/gpt_dataset.py b/modellink/core/datasets/gpt_dataset.py
index 4a37d53b2a3421a9151c9e11bdf324f761d77343..dfd6118f626d07ca9ae8f0c64cc894fb85d52435 100644
--- a/modellink/core/datasets/gpt_dataset.py
+++ b/modellink/core/datasets/gpt_dataset.py
@@ -156,7 +156,7 @@ def _build_document_sample_shuffle_indices(
         )
         
         if any(sample_index[:, 0] < 0):
-            _url = "https://gitee.com/ascend/ModelLink/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
+            _url = "https://gitee.com/ascend/MindSpeed-LLM/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
             raise GPTDatasetSampleIndexError(f"Bad sample index. Visit {_url} for more information")
         
         numpy.save(path_to_sample_index, sample_index, allow_pickle=True)
@@ -212,7 +212,7 @@ def _build_document_sample_shuffle_indices(
     sample_index = numpy.load(path_to_sample_index, allow_pickle=True, mmap_mode='r')
     
     if any(sample_index[:, 0] < 0):
-        _url = "https://gitee.com/ascend/ModelLink/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
+        _url = "https://gitee.com/ascend/MindSpeed-LLM/wikis/megatron%20data%20helpers%E5%8F%AF%E8%83%BD%E5%BC%95%E5%85%A5%E7%9A%84%E9%97%AE%E9%A2%98"
         raise GPTDatasetSampleIndexError(f"Bad sample index. Visit {_url} for more information")
     
     t_end = time.time()
diff --git a/modellink/core/models/common/embeddings/rotary_pos_embedding.py b/modellink/core/models/common/embeddings/rotary_pos_embedding.py
index b94b718c68c5a49c7aa0f97d8eecde43353c7142..5836616cfe578c20303cff0cb93526a46b9832d7 100644
--- a/modellink/core/models/common/embeddings/rotary_pos_embedding.py
+++ b/modellink/core/models/common/embeddings/rotary_pos_embedding.py
@@ -62,8 +62,12 @@ def apply_yarn_scaling(freqs: torch.Tensor):
 def rotary_embedding_init_wrapper(fn):
     @wraps(fn)
     def wrapper(self, *args, **kwargs):
-        fn(self, *args, **kwargs)
         _args = get_args()
+        if _args.rotary_base:
+            kwargs["rotary_base"] = _args.rotary_base
+
+        fn(self, *args, **kwargs)
+
         if hasattr(_args, "rope_scaling_type") and _args.rope_scaling_type == "llama3":
             self.inv_freq = apply_llama3_scaling(self.inv_freq)
         elif hasattr(_args, "rope_scaling_type") and _args.rope_scaling_type == "yarn":
diff --git a/modellink/core/transformer/attention.py b/modellink/core/transformer/attention.py
index 62009e3f8e35bf7ae0de49a06a8b66ae8c11d23c..96a98f90de8a8c956e580ab2fe462b5574bc82da 100644
--- a/modellink/core/transformer/attention.py
+++ b/modellink/core/transformer/attention.py
@@ -5,7 +5,6 @@
 from functools import wraps
 from megatron.training import get_args
 from megatron.core import mpu
-from megatron.core.models.common.embeddings.rotary_pos_embedding import apply_rotary_pos_emb
 from mindspeed.core.context_parallel.ulysses_context_parallel import UlyssesContextAttention
 from mindspeed.core.parallel_state import get_context_parallel_group_for_hybrid_ulysses
 
diff --git a/modellink/core/transformer/dot_product_attention.py b/modellink/core/transformer/dot_product_attention.py
index 6c760c7440dac9cc70249a37ce5c5a7729fa576a..e61ab50ad5c36d8b20cdb6006b7a3722f4b92b3f 100644
--- a/modellink/core/transformer/dot_product_attention.py
+++ b/modellink/core/transformer/dot_product_attention.py
@@ -116,6 +116,24 @@ def get_alibi(self, seq_length):
     self.alibi.alibi = alibi
 
 
+def ulysses_context_parallel_forward_wrapper(fn):
+    """
+    Do repeat KV to support GQA+Ulysses. This wrapper would be remove if mindspeed-core support ulysses+GQA.
+    """
+    @wraps(fn)
+    def wrapper(self, query: Tensor, key: Tensor, value: Tensor, *args, **kwargs):
+        heads_per_gqa_group = self.local_attn.num_attention_heads_per_partition // self.local_attn.num_query_groups_per_partition
+        global_args = get_args()
+        should_kv_repeat_before_uly = global_args.use_flash_attn and global_args.kv_head_repeat_before_uly_alltoall
+
+        if heads_per_gqa_group > 1 and should_kv_repeat_before_uly:
+            key = key.repeat_interleave(heads_per_gqa_group, dim=2)
+            value = value.repeat_interleave(heads_per_gqa_group, dim=2)
+
+        return fn(self, query, key, value, *args, **kwargs)
+    return wrapper
+
+
 def dot_product_attention_forward_wrapper(fn):
     @wraps(fn)
     def wrapper(self, query, key, value, attention_mask, attn_mask_type, packed_seq_params):
@@ -135,19 +153,14 @@ def dot_product_attention_forward_wrapper(fn):
 
         args = get_args()
         heads_per_gqa_group = self.num_attention_heads_per_partition // self.num_query_groups_per_partition
-
         if not args.use_flash_attn:
             if heads_per_gqa_group > 1:
                 key = key.repeat_interleave(heads_per_gqa_group, dim=2)
                 value = value.repeat_interleave(heads_per_gqa_group, dim=2)
         else:
-            # Do repeat KV to support GQA+Ulysses and PFA
-            should_kv_repeat_before_uly = args.context_parallel_size > 1 and \
-                            args.context_parallel_algo in ['ulysses_cp_algo', 'hybrid_cp_algo'] and \
-                            args.kv_head_repeat_before_uly_alltoall
+            # Do repeat KV to support PFA
             should_kv_repeat_before_pfa = hasattr(args, 'use_kv_cache') and args.use_kv_cache
-
-            if heads_per_gqa_group > 1 and (should_kv_repeat_before_uly or should_kv_repeat_before_pfa):
+            if heads_per_gqa_group > 1 and should_kv_repeat_before_pfa:
                 key = key.repeat_interleave(heads_per_gqa_group, dim=2)
                 value = value.repeat_interleave(heads_per_gqa_group, dim=2)
 
diff --git a/modellink/core/transformer/moe/moe_utils.py b/modellink/core/transformer/moe/moe_utils.py
index d38492eb6b85b2b89791f69d69e3f47dcbdcfda7..c9cfb07e87fc6875cb1b7bb1190d96f115375749 100644
--- a/modellink/core/transformer/moe/moe_utils.py
+++ b/modellink/core/transformer/moe/moe_utils.py
@@ -60,7 +60,7 @@ def topk_softmax_with_capacity(
     drop_policy: str = "probs",
 ):
     """
-        Migrated from megatron r0.7.0,. This would be removed after ModelLink switches to megatron r0.7.0.
+        Migrated from megatron r0.7.0,. This would be removed after MindSpeed-LLM switches to megatron r0.7.0.
 
         Apply capacity and padding to the top-k selection.
         Args:
diff --git a/modellink/core/transformer/moe/token_dispatcher.py b/modellink/core/transformer/moe/token_dispatcher.py
index 1371a1154ac8da1277f75285b8a74cbbe4620805..15015646116a51cff9d1f191635878b4d4af8ce7 100644
--- a/modellink/core/transformer/moe/token_dispatcher.py
+++ b/modellink/core/transformer/moe/token_dispatcher.py
@@ -14,7 +14,7 @@ class MoEAlltoAllTokenDispatcher(MoETokenDispatcher):
     """
     Mainly migrated from megatron r0.7.0. for drop and pad feature, and add few optimizations controlled
         by args.moe_permutation_async_comm.
-    This would be removed after ModelLink switches to megatron r0.7.0.
+    This would be removed after MindSpeed-LLM switches to megatron r0.7.0.
 
     AlltoAll Based Token dispatcher.
     """
diff --git a/modellink/data/decoder_packed_mtf_dataset.py b/modellink/data/decoder_packed_mtf_dataset.py
index 8b227891670c0e1564d72a87e932cfcf2b13e37e..91937d096c03446332cd5512d120ea7918e0c20c 100644
--- a/modellink/data/decoder_packed_mtf_dataset.py
+++ b/modellink/data/decoder_packed_mtf_dataset.py
@@ -28,6 +28,8 @@ from modellink.utils import is_rank_0
 from modellink.tokenizer import build_tokenizer
 from modellink.data.mtf_dataset import MTFDataset, get_packed_indexed_dataset
 from modellink.error_utils import check_equal
+from modellink.tasks.preprocess.templates import get_model_template
+
 
 logger = logging.getLogger(__name__)
 
@@ -164,14 +166,7 @@ class DecoderPackedMTFDataset(torch.utils.data.Dataset):
         item = self.mtf_dataset[doc_idx]
 
         if self.args.is_pairwise_dataset:
-            res = {
-                "chosen_input_ids": self._cut_token(item["chosen_input_ids"], np.int64),
-                "chosen_attention_mask": self._cut_token(item["chosen_attention_mask"], np.int64),
-                "chosen_labels": self._cut_token(item["chosen_labels"], np.int64),
-                "rejected_input_ids": self._cut_token(item["rejected_input_ids"], np.int64),
-                "rejected_attention_mask": self._cut_token(item["rejected_attention_mask"], np.int64),
-                "rejected_labels": self._cut_token(item["rejected_labels"], np.int64)
-            }
+            return self._cut_pairwise_token(item, np.int64)
         elif self.args.reset_position_ids:
             position_ids = self._get_reset_position_ids(torch.from_numpy(item['input_ids']))
             return {
@@ -181,13 +176,74 @@ class DecoderPackedMTFDataset(torch.utils.data.Dataset):
                 "position_ids": self._cut_token(position_ids.numpy(), np.int64)
             }
         else:
-            res = {
-                "input_ids": self._cut_token(item["input_ids"], np.int64),
-                "attention_mask": self._cut_token(item["attention_mask"], np.int64),
-                "labels": self._cut_token(item["labels"], np.int64),
+            return self._cut_instruction_token(item, np.int64)
+
+    def _cut_instruction_token(self, item, dtype):
+        IGNORE_INDEX = -100
+        token_length = len(item["input_ids"])
+        if token_length <= self.seq_length:
+            return {
+                "input_ids": item["input_ids"].astype(dtype),
+                "attention_mask": np.ones_like(item["input_ids"]).astype(dtype),
+                "labels": item["labels"].astype(dtype)
             }
 
+        template = None
+        # get model chat template
+        if hasattr(self.args, "prompt_type") and self.args.prompt_type is not None:
+            template = get_model_template(self.args.prompt_type)
+
+        prompt_begin_list, prompt_end_list = get_prompt_index(item["labels"], IGNORE_INDEX)
+
+        multi_turns = len(prompt_begin_list)
+        total_length = 0
+
+        if template is not None and template.efficient_eos:
+            total_length = 1
+            prompt_end_list = [x - 1 for x in prompt_end_list]
+            eos_token_id = item["input_ids"][token_length - 1]
+            item["input_ids"] = item["input_ids"][:token_length]
+            item["labels"] = item["labels"][:token_length]
+
+        cutoff_len = self.seq_length
+        input_ids = np.array([], dtype=dtype)
+        labels = np.array([], dtype=dtype)
+
+        for turn_idx in range(multi_turns):
+            if total_length >= cutoff_len:
+                break
+            source_ids = item["input_ids"][prompt_begin_list[turn_idx]:prompt_end_list[turn_idx]]
+            mask_ids = item["labels"][prompt_begin_list[turn_idx]:prompt_end_list[turn_idx]]
+
+            label_begin_idx = prompt_end_list[turn_idx]
+
+            if turn_idx != multi_turns - 1:
+                target_ids = item["labels"][label_begin_idx:prompt_begin_list[turn_idx + 1]]
+            else:
+                target_ids = item["labels"][label_begin_idx:]
+
+            source_len, target_len = _infer_seqlen(len(source_ids), len(target_ids), cutoff_len - total_length)
+
+            source_ids = source_ids[:source_len]
+            target_ids = target_ids[:target_len]
+            mask_ids = mask_ids[:source_len]
+
+            total_length += source_len + target_len
+            input_ids = np.concatenate((input_ids, source_ids, target_ids), axis=0)
+            labels = np.concatenate((labels, mask_ids, target_ids), axis=0)
+
+        if template is not None and template.efficient_eos:
+            input_ids = np.concatenate((input_ids, np.array([eos_token_id], dtype=dtype)), axis=0)
+            labels = np.concatenate((labels, np.array([eos_token_id], dtype=dtype)), axis=0)
+
+        res = {
+            "input_ids": input_ids.astype(dtype),
+            "attention_mask": np.ones_like(input_ids).astype(dtype),
+            "labels": labels.astype(dtype)
+        }
+
         return res
+
     
     def _cut_token(self, token, dtype):
         token_length = len(token)
@@ -196,6 +252,70 @@ class DecoderPackedMTFDataset(torch.utils.data.Dataset):
         return token.astype(dtype)
 
 
+    def _cut_pairwise_token(self, item, dtype):
+        """Cut prompt and response proportionally for pairwise datasets."""
+        IGNORE_INDEX = -100
+        prompt_length = (item["chosen_labels"] != IGNORE_INDEX).nonzero()[0][0]
+        prompt_ids = item["chosen_input_ids"][:prompt_length]
+        chosen_ids = item["chosen_input_ids"][prompt_length:]
+        rejected_ids = item["rejected_input_ids"][prompt_length:]
+        source_len, target_len = _infer_seqlen(
+            len(prompt_ids), max(len(chosen_ids), len(rejected_ids)), self.seq_length
+        )
+        prompt_ids = prompt_ids[:source_len]
+        chosen_ids = chosen_ids[:target_len]
+        rejected_ids = rejected_ids[:target_len]
+
+        chosen_input_ids = np.append(prompt_ids, chosen_ids)
+        chosen_labels = np.append(IGNORE_INDEX * np.ones(source_len), chosen_ids)
+        rejected_input_ids = np.append(prompt_ids, rejected_ids)
+        rejected_labels = np.append(IGNORE_INDEX * np.ones(source_len), rejected_ids)
+
+        res = {
+            "chosen_input_ids": chosen_input_ids.astype(dtype),
+            "chosen_attention_mask": np.ones_like(chosen_input_ids).astype(dtype),
+            "chosen_labels": chosen_labels.astype(dtype),
+            "rejected_input_ids": rejected_input_ids.astype(dtype),
+            "rejected_attention_mask": np.ones_like(rejected_input_ids).astype(dtype),
+            "rejected_labels": rejected_labels.astype(dtype)
+        }
+
+        return res
+
+
+def get_prompt_index(labels, ignored_label):
+    prompt_begin_list = []
+    prompt_end_list = []
+    in_group = False
+    for idx, label in enumerate(labels):
+        if label == ignored_label:
+            if not in_group:
+                prompt_begin_list.append(idx)
+                in_group = True
+        elif in_group:
+            prompt_end_list.append(idx)
+            in_group = False
+
+    return prompt_begin_list, prompt_end_list
+
+
+def _infer_seqlen(source_len: int, target_len: int, cutoff_len: int):
+    r"""
+    Computes the real sequence length after truncation by the cutoff_len.
+    """
+    if target_len * 2 < cutoff_len:  # truncate source
+        max_target_len = cutoff_len
+    elif source_len * 2 < cutoff_len:  # truncate target
+        max_target_len = cutoff_len - source_len
+    else:  # truncate both
+        max_target_len = int(cutoff_len * (target_len / (source_len + target_len)))
+
+    new_target_len = min(max_target_len, target_len)
+    max_source_len = max(cutoff_len - new_target_len, 0)
+    new_source_len = min(max_source_len, source_len)
+    return new_source_len, new_target_len
+
+
 def _build_index_mappings(
     name,
     data_prefix,
diff --git a/modellink/patchs/megatron_patch.py b/modellink/patchs/megatron_patch.py
index 35a676252bb6208b6ddf79d6aa8e4e948b1688e6..0ee13b7e53dfcd2f33cf7440705f93a92efb3db9 100644
--- a/modellink/patchs/megatron_patch.py
+++ b/modellink/patchs/megatron_patch.py
@@ -61,7 +61,7 @@ def get_modellink_args():
     """
     global _ARGS
     if _ARGS is None:
-        parser = argparse.ArgumentParser(description='ModelLink Arguments', allow_abbrev=False)
+        parser = argparse.ArgumentParser(description='MindSpeed-LLM Arguments', allow_abbrev=False)
         _ARGS, _ = process_args(parser).parse_known_args()
     return _ARGS
 
@@ -128,14 +128,14 @@ def patch_fusions():
 def patch_core_models(args):
     from megatron.core.models.gpt.gpt_layer_specs import get_gpt_layer_local_spec
     from mindspeed.core.models.common.embeddings.rotary_pos_embedding import get_pos_emb_on_this_cp_rank
-    from mindspeed.core.fusions.rotary_pos_embedding import rotary_embedding_init_wrapper
     from ..utils import get_batch_on_this_cp_rank, get_batch_on_this_tp_rank, get_device_wrapper
     from ..core import rotary_embedding_forward, apply_rotary_pos_emb_bshd
     from ..core.models.gpt.gpt_layer_specs import get_gpt_layer_local_spec_wrapper
     from ..core.transformer.dot_product_attention import dot_product_attention_init_wrapper, \
-        dot_product_attention_forward_wrapper
+        dot_product_attention_forward_wrapper, ulysses_context_parallel_forward_wrapper
     from ..core.transformer.attention import attention_init_wrapper
     from ..core.models.gpt.gpt_model import gpt_model_init_wrapper
+    from ..core import rotary_embedding_init_wrapper
 
     # Embedding
     PatchManager.register_patch('megatron.core.models.common.embeddings.rotary_pos_embedding.get_pos_emb_on_this_cp_rank', get_pos_emb_on_this_cp_rank)
@@ -150,6 +150,8 @@ def patch_core_models(args):
     PatchManager.register_patch('megatron.core.transformer.dot_product_attention.DotProductAttention.forward', dot_product_attention_forward_wrapper)
     PatchManager.register_patch('megatron.core.transformer.custom_layers.transformer_engine.TEDotProductAttention.__init__', dot_product_attention_init_wrapper)
     PatchManager.register_patch('megatron.core.transformer.custom_layers.transformer_engine.TEDotProductAttention.forward', dot_product_attention_forward_wrapper)
+    # For GQA in ulysses and hybrid
+    PatchManager.register_patch('mindspeed.core.context_parallel.ulysses_context_parallel.UlyssesContextAttention.forward', ulysses_context_parallel_forward_wrapper)
 
     # Layer Definition
     # For NPU, we use local-mcore-structrue in te layer.
@@ -169,7 +171,7 @@ def patch_core_models(args):
 
 def patch_core_transformers(args):
     from mindspeed.core.transformer.moe.router import aux_loss_load_balancing
-    from ..core import (PTNorm, topk_router_forward, topk_router_routing, z_loss_func, rotary_embedding_init_wrapper)
+    from ..core import (PTNorm, topk_router_forward, topk_router_routing, z_loss_func)
     from mindspeed.core.transformer.moe.token_dispatcher import allgather_token_permutation, allgather_token_unpermutation
     from mindspeed.core.transformer.moe.grouped_gemm_util import Ops, grouped_gemm_is_available, get_device_capability
 
@@ -179,8 +181,6 @@ def patch_core_transformers(args):
     from ..core.transformer.transformer_layer import transformer_layer_init_wrapper
 
     PatchManager.register_patch('torch.cuda.get_device_capability', get_device_capability)
-    PatchManager.register_patch('megatron.core.models.common.embeddings.rotary_pos_embedding.RotaryEmbedding.__init__',
-                                rotary_embedding_init_wrapper)
     PatchManager.register_patch('megatron.core.transformer.transformer_block.TENorm', PTNorm)
     PatchManager.register_patch('megatron.core.transformer.moe.router.TopKRouter.routing', topk_router_routing)
     PatchManager.register_patch('megatron.core.transformer.moe.router.TopKRouter.forward', topk_router_forward)
diff --git a/modellink/tasks/checkpoint/loader_hf.py b/modellink/tasks/checkpoint/loader_hf.py
index 1a6bde86fa03f91c01a04d6abd4441d6b68d999d..9e47d84bb09f5cf2435b1d71adf5989371deeab5 100644
--- a/modellink/tasks/checkpoint/loader_hf.py
+++ b/modellink/tasks/checkpoint/loader_hf.py
@@ -203,8 +203,6 @@ def _get_message_layer_mlp(message, model, layer_idx, md=None, tp_size=1, is_moe
         else:
             mlp_l0_bias.append(model.get_layers_mlp_linear_fc1_bias(layer_idx=layer_idx, **kwargs))
 
-    if md.linear_bias:
-        mlp_l0_bias.append(model.get_layers_mlp_linear_fc1_bias(layer_idx=layer_idx, **kwargs))
     # Handle gated linear units.
     if md.swiglu:
         # Concat all the first halves ('W's) and all the second halves ('V's).
diff --git a/modellink/tasks/inference/text_generation/infer_base.py b/modellink/tasks/inference/text_generation/infer_base.py
index a9ed2c1e24ec7f0c3224896e43efe952739fda39..14cf365776b118649b7e1cbb91200ecd0523076b 100644
--- a/modellink/tasks/inference/text_generation/infer_base.py
+++ b/modellink/tasks/inference/text_generation/infer_base.py
@@ -101,7 +101,7 @@ def task_greedy_search(args, model):
 
     if dist.get_rank() == 0:
         logging.info("\n=============== Greedy Search ================")
-        logging.info("\nYou:\n%s\n\nModelLink:\n%s", instruction, output)
+        logging.info("\nYou:\n%s\n\nMindSpeed-LLM:\n%s", instruction, output)
         logging.info("==============================================")
         logging.info("\nElapsed: %ss", round(time.time() - t, 2))
 
@@ -124,7 +124,7 @@ def task_do_sample(args, model):
 
     if dist.get_rank() == 0:
         logging.info("\n================ Do Sample =================")
-        logging.info("\nYou:\n%s\n\nModelLink:\n%s", instruction, output)
+        logging.info("\nYou:\n%s\n\nMindSpeed-LLM:\n%s", instruction, output)
         logging.info("============================================")
         logging.info("\nElapsed: %ss", round(time.time() - t, 2))
 
@@ -147,7 +147,7 @@ def task_beam_search(args, model):
 
     if dist.get_rank() == 0:
         logging.info("\n=============== Beam Search =================")
-        logging.info("\nYou:\n%s\n\nModelLink:\n%s", instruction, output)
+        logging.info("\nYou:\n%s\n\nMindSpeed-LLM:\n%s", instruction, output)
         logging.info("=============================================")
         logging.info("\nElapsed: %ss", round(time.time() - t, 2))
 
@@ -171,7 +171,7 @@ def task_beam_search_with_sampling(args, model):
 
     if dist.get_rank() == 0:
         logging.info("\n======== Beam Search with sampling ==========")
-        logging.info("\nYou:\n%s\n\nModelLink:\n%s", instruction, output)
+        logging.info("\nYou:\n%s\n\nMindSpeed-LLM:\n%s", instruction, output)
         logging.info("=============================================")
         logging.info("\nElapsed: %ss", round(time.time() - t, 2))
 
@@ -251,7 +251,7 @@ def chat_get_instruction(args, histories_no_template, histories_template, prompt
 
 
 def chat_print_and_update_histories(args, responses, histories_no_template, histories_template, prompt):
-    response_template = "\nModelLink:\n"
+    response_template = "\nMindSpeed-LLM:\n"
     output = ""
 
     if dist.get_rank() == 0:
diff --git a/modellink/tasks/preprocess/data_handler.py b/modellink/tasks/preprocess/data_handler.py
index f6f19c7d498bd54c809f39a19856ff3eee56c787..af40c07bfa685fd872be9017820705c8a07f06a1 100644
--- a/modellink/tasks/preprocess/data_handler.py
+++ b/modellink/tasks/preprocess/data_handler.py
@@ -188,6 +188,7 @@ class BaseDatasetHandler(object):
                 for j, sentences in enumerate(batch):
                     for k, sentence in enumerate(sentences):
                         if (j, k) in skip_indices:
+                            skip_num = skip_num + 1
                             continue
 
                         total_bytes_processed += len(sentence) * np.int32().itemsize
@@ -201,7 +202,7 @@ class BaseDatasetHandler(object):
                 mbs = total_bytes_processed / elapsed / 1024 / 1024
                 logger.info("Processed %s documents (%s docs/s, %s MB/s).", batch_id, batch_id / elapsed, mbs)
 
-        logger.info("Skip %s sample exceeded seq-length(%s)", skip_num, self.args.seq_length)
+        logger.info("Skip %s sample exceeded seq-length(%s)", skip_num / len(self.args.json_keys), self.args.seq_length)
         for key in self.args.json_keys:
             builders[key].finalize(output_idx_files[key])
 
diff --git a/modellink/tasks/preprocess/utils.py b/modellink/tasks/preprocess/utils.py
index 87e315251fb6497d50f913b5c9e17b76ece1a102..173929f7b1ef50075661b9cfc7786ba88e446789 100644
--- a/modellink/tasks/preprocess/utils.py
+++ b/modellink/tasks/preprocess/utils.py
@@ -238,7 +238,7 @@ def convert_alpaca_to_intermediate(sample: Dict[str, List[Any]], dataset_attr: "
     outputs = {"prompt": [], "response": [], "system": [], "tools": []}
     prompt = []
     
-    if dataset_attr.history and hasattr(sample, "history") and isinstance(sample[dataset_attr.history], dict):
+    if dataset_attr.history and (isinstance(sample[dataset_attr.history], list) or isinstance(sample[dataset_attr.history], dict)):
         for old_prompt, old_response in sample[dataset_attr.history]:
             prompt.append({"role": Role.USER.value, "content": old_prompt})
             prompt.append({"role": Role.ASSISTANT.value, "content": old_response})
diff --git a/requirements.txt b/requirements.txt
index 4e4afe71d4303f552f0ff6aad5aea46c04a30dbb..1888f81e7a37e7fe79b0e6d6dde45f705f7e30f1 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -10,8 +10,6 @@ datasets
 pybind11
 accelerate
 six
-torch==2.1.0
-torchvision==0.16.0
 protobuf
 peft==0.7.1
 tiktoken
\ No newline at end of file
diff --git a/setup.py b/setup.py
index 69bdfc5fe01c851fde54b2e6ae59c020af7eb580..de345fdfa02181bc2c95f68e1e6dd0a223749070 100644
--- a/setup.py
+++ b/setup.py
@@ -19,12 +19,12 @@ import sys
 import setuptools
 
 if sys.version_info < (3,):
-    raise Exception("Python 2 is not supported by ModelLink.")
+    raise Exception("Python 2 is not supported by MindSpeed-LLM.")
 
-__description__ = 'ModelLink for LLMs of Ascend'
+__description__ = 'MindSpeed-LLM for LLMs of Ascend'
 __version__ = '0.0.1'
 __author__ = 'Ascend'
-__long_description__ = 'ModelLink for LLMs of Ascend'
+__long_description__ = 'MindSpeed-LLM for LLMs of Ascend'
 __keywords__ = 'Ascend, langauge, deep learning, NLP'
 __package_name__ = 'modellink'
 __contact_names__ = 'Ascend'
diff --git a/sources/images/logo.png b/sources/images/logo.png
index e68882beafff23e533329c462fa5b8eb975fa269..2d32cf37c4869ffe5f8b58e6f0c2c744f2832f03 100644
Binary files a/sources/images/logo.png and b/sources/images/logo.png differ
diff --git a/sources/images/tune_llama2_7b_ModelLink_DeepSpeed_compare.png b/sources/images/tune_llama2_7b_MindSpeed-LLM_DeepSpeed_compare.png
similarity index 100%
rename from sources/images/tune_llama2_7b_ModelLink_DeepSpeed_compare.png
rename to sources/images/tune_llama2_7b_MindSpeed-LLM_DeepSpeed_compare.png
diff --git a/sources/images/tune_qwen_7b_ModelLink_DeepSpeed_compare.png b/sources/images/tune_qwen_7b_MindSpeed-LLM_DeepSpeed_compare.png
similarity index 100%
rename from sources/images/tune_qwen_7b_ModelLink_DeepSpeed_compare.png
rename to sources/images/tune_qwen_7b_MindSpeed-LLM_DeepSpeed_compare.png
diff --git a/tests/README.md b/tests/README.md
index 2fb717ababd8e23e1271e48b2a0074ac3bc52fcf..17258ef91c0456c205f54b79b49983c084b4e7ae 100644
--- a/tests/README.md
+++ b/tests/README.md
@@ -1,4 +1,4 @@
-## ModelLink 测试用例贡献说明
+## MindSpeed-LLM 测试用例贡献说明
 
 ### 门禁看护列表
 <table>
@@ -13,8 +13,8 @@
         <th>Mem.</th>
     </tr>
     <tr>
-        <td rowspan="14">ST</td>
-        <td rowspan="10">Pretrain</td>
+        <td rowspan="16">ST</td>
+        <td rowspan="12">Pretrain</td>
         <td>Mcore</td>
         <td>TP，PP，VPP，重计算，enable_recompute_layers_per_pp_rank</td>
         <td><a href="st/shell_scripts/llama2_tp2_pp4_vpp2_ptd.sh">llama2_tp2_pp4_vpp2.sh</a></td>
@@ -24,12 +24,28 @@
     </tr>
     <tr>
         <td>Mcore</td>
-        <td>CP，分布式优化器，reuse_fp32_param，recompute_activation_function, fused_rmsnorm，fused_swiglu，fused_rope，overlap_grad_reduce, overlap_param_gather</td>
+        <td>cp_ring，分布式优化器，reuse_fp32_param，recompute_activation_function，fused_rmsnorm，fused_swiglu，fused_rope，overlap_grad_reduce, overlap_param_gather</td>
         <td><a href="st/shell_scripts/llama2_tp2_cp4_mem_recompute.sh">llama2_tp2_cp4_mem_recompute.sh</a></td>
         <td>Y</td>
         <td>Y</td>
         <td>Y</td>
     </tr>
+    <tr>
+        <td>Mcore</td>
+        <td>cp_hybrid，gqa</td>
+        <td><a href="st/shell_scripts/chatglm3_gqa_cp8.sh">chatglm3_gqa_cp8.sh</a></td>
+        <td>Y</td>
+        <td>Y</td>
+        <td>Y</td>
+    </tr>
+    <tr>
+        <td>Mcore</td>
+        <td>swap_attention，recompute_activation_function，enable_recompute_layers_per_pp_rank，reuse_fp32_param</td>
+        <td><a href="tests/st/shell_scripts/llama2_tp2_pp4_vpp2_swap.sh">llama2_tp2_pp4_vpp2_swap.sh</a></td>
+        <td>Y</td>
+        <td>Y</td>
+        <td>Y</td>
+    </tr>
     <tr>
         <td>Mcore</td>
         <td>glm_rope, rotary_percent</td>
@@ -156,12 +172,14 @@
         <td></td>
         <td></td>
     </tr>
+    <tr>
         <td>ring_attn</td>
         <td><a href="ut/dist_algo/context_parallel/test_ringattn_context_parallel.py">test_ringattn_context_parallel.py</a></td>
         <td>Y</td>
         <td></td>
         <td></td>
     </tr>
+    <tr>
         <td>ulysses</td>
         <td><a href="ut/dist_algo/context_parallel/test_ulysses_context_parallel.py"> test_ulysses_context_parallel.py </a></td>
         <td>Y</td>
diff --git a/tests/__init__.py b/tests/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a1307aeca2e4639a1e6807acda4a07f1b9856eb
--- /dev/null
+++ b/tests/__init__.py
@@ -0,0 +1,14 @@
+# coding=utf-8
+# Copyright (c) 2024, HUAWEI CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
\ No newline at end of file
diff --git a/tests/pipeline/baichuan2-13B/test_generation.py b/tests/pipeline/baichuan2-13B/test_generation.py
index 7c5ff76907aba46edaed1dc57215d55363c9388f..6b8f63982c3fb3eca337cea5069ee16e1283745e 100644
--- a/tests/pipeline/baichuan2-13B/test_generation.py
+++ b/tests/pipeline/baichuan2-13B/test_generation.py
@@ -12,7 +12,7 @@ from tests.test_tools.utils import build_args, create_testconfig, setup_logger
 from ut.inference.test_inference import acquire_context
 
 
-PATTERN = r"ModelLink:\n(.*)"
+PATTERN = r"MindSpeed-LLM:\n(.*)"
 
 
 class TestInference(DistributedTest):
diff --git a/tests/pipeline/bloom-7B/test_generation.py b/tests/pipeline/bloom-7B/test_generation.py
index db4816eb45d665e1de781fe27a8783472ace40a5..1c0c6208497fb685251dd3741c84bc1f18611902 100644
--- a/tests/pipeline/bloom-7B/test_generation.py
+++ b/tests/pipeline/bloom-7B/test_generation.py
@@ -11,7 +11,7 @@ from tests.test_tools.utils import build_args, create_testconfig, setup_logger
 from ut.inference.test_inference import acquire_context
 
 
-PATTERN = r"ModelLink:\n(.*)"
+PATTERN = r"MindSpeed-LLM:\n(.*)"
 
 
 class TestInference(DistributedTest):
diff --git a/tests/pipeline/chatglm3-6B/test_generation.py b/tests/pipeline/chatglm3-6B/test_generation.py
index c54d8359b5b95c4507e1875833d043ca5d882bf1..391dea2d9166f7f9c7bf61ad2ddd66d83489ddf7 100644
--- a/tests/pipeline/chatglm3-6B/test_generation.py
+++ b/tests/pipeline/chatglm3-6B/test_generation.py
@@ -11,7 +11,7 @@ from tests.test_tools.utils import build_args, create_testconfig, setup_logger
 from ut.inference.test_inference import acquire_context
 
 
-PATTERN = r"ModelLink:\n(.*)"
+PATTERN = r"MindSpeed-LLM:\n(.*)"
 
 
 class TestInferenceWorldSize2(DistributedTest):
diff --git a/tests/pipeline/gemma-7B/test_generation.py b/tests/pipeline/gemma-7B/test_generation.py
index c7de50f750ddf8022beb8a0c425976fd7b7f293e..4c3711e1d6d7eae39645643244f54ac0db6f9681 100644
--- a/tests/pipeline/gemma-7B/test_generation.py
+++ b/tests/pipeline/gemma-7B/test_generation.py
@@ -11,7 +11,7 @@ from tests.test_tools.utils import build_args, create_testconfig, setup_logger
 from ut.inference.test_inference import acquire_context
 
 
-PATTERN = r"ModelLink:\n(.*)"
+PATTERN = r"MindSpeed-LLM:\n(.*)"
 
 
 class TestInference(DistributedTest):
diff --git a/tests/pipeline/qwen15-7B/test_generation.py b/tests/pipeline/qwen15-7B/test_generation.py
index e1614e60336c87a10c1742b42c9e65c89ac06280..eb39eb528138739654dee1fb052bba469478a802 100644
--- a/tests/pipeline/qwen15-7B/test_generation.py
+++ b/tests/pipeline/qwen15-7B/test_generation.py
@@ -11,7 +11,7 @@ from tests.test_tools.utils import build_args, create_testconfig, setup_logger
 from ut.inference.test_inference import acquire_context
 
 
-PATTERN = r"ModelLink:\n(.*)"
+PATTERN = r"MindSpeed-LLM:\n(.*)"
 
 
 class TestInference(DistributedTest):
diff --git a/tests/st/baseline_results/chatglm3_gqa_cp8.json b/tests/st/baseline_results/chatglm3_gqa_cp8.json
new file mode 100644
index 0000000000000000000000000000000000000000..200958907f590eb3034188789b764eed9b451f4a
--- /dev/null
+++ b/tests/st/baseline_results/chatglm3_gqa_cp8.json
@@ -0,0 +1,78 @@
+{
+    "lm loss": [
+        9.594278,
+        9.577759,
+        9.465339,
+        9.230533,
+        9.159581,
+        9.091681,
+        8.877193,
+        8.776001,
+        8.755611,
+        8.634124,
+        8.602496,
+        8.540811,
+        8.529873,
+        8.529889,
+        8.538946
+    ],
+    "throughput": [
+        81.0,
+        188.6,
+        191.3,
+        191.1,
+        190.8,
+        189.7,
+        190.0,
+        189.8,
+        189.4,
+        189.5,
+        189.1,
+        188.6,
+        188.4,
+        188.0,
+        188.0
+    ],
+    "memo info": [
+        {
+            "rank": 0,
+            "allocated memory": 5386.2021484375,
+            "max allocated memory": 29104.03466796875
+        },
+        {
+            "rank": 1,
+            "allocated memory": 5386.1884765625,
+            "max allocated memory": 29104.03466796875
+        },
+        {
+            "rank": 2,
+            "allocated memory": 5386.1865234375,
+            "max allocated memory": 29104.03466796875
+        },
+        {
+            "rank": 3,
+            "allocated memory": 5386.1904296875,
+            "max allocated memory": 29104.03466796875
+        },
+        {
+            "rank": 4,
+            "allocated memory": 5386.1865234375,
+            "max allocated memory": 29104.03466796875
+        },
+        {
+            "rank": 5,
+            "allocated memory": 5386.1865234375,
+            "max allocated memory": 29104.03466796875
+        },
+        {
+            "rank": 6,
+            "allocated memory": 5386.1865234375,
+            "max allocated memory": 29104.03466796875
+        },
+        {
+            "rank": 7,
+            "allocated memory": 5386.1865234375,
+            "max allocated memory": 29104.03466796875
+        }
+    ]
+}
diff --git a/tests/st/baseline_results/gemma2_tp8_pp1_ptd.json b/tests/st/baseline_results/gemma2_tp8_pp1_ptd.json
index af751dedac79dc177ea5eb37118c983cb0aac8ec..5a6223dfb7b21a277ae6477c40329baf9794db67 100644
--- a/tests/st/baseline_results/gemma2_tp8_pp1_ptd.json
+++ b/tests/st/baseline_results/gemma2_tp8_pp1_ptd.json
@@ -1,37 +1,37 @@
 {
     "lm loss": [
-        1.387083,
-        1.455181,
-        1.352648,
-        1.309051,
-        1.241118,
-        1.118676,
-        1.113002,
-        1.067905,
-        1.112425,
-        1.110263,
-        1.048378,
-        1.055717,
-        1.055748,
-        1.023589,
-        1.044405
+        1.385735,
+        1.454418,
+        1.355794,
+        1.314839,
+        1.248124,
+        1.126178,
+        1.113734,
+        1.070928,
+        1.117932,
+        1.112961,
+        1.058354,
+        1.065326,
+        1.060758,
+        1.028861,
+        1.048179
     ],
     "throughput": [
-        43.8,
-        91.9,
-        92.8,
-        92.9,
-        92.9,
-        93.1,
-        93.0,
-        92.6,
-        92.9,
-        92.9,
-        92.9,
-        92.9,
-        93.1,
-        92.8,
-        93.1
+        4.5,
+        19.2,
+        19.3,
+        19.7,
+        19.5,
+        19.7,
+        19.9,
+        20.1,
+        20.4,
+        20.6,
+        20.1,
+        19.7,
+        19.7,
+        20.5,
+        19.9
     ],
     "memo info": [
         {
diff --git a/tests/st/baseline_results/llama2_tp2_pp4_vpp2_swap.json b/tests/st/baseline_results/llama2_tp2_pp4_vpp2_swap.json
new file mode 100644
index 0000000000000000000000000000000000000000..12f7f1c5a57314191431550c7eccbda50e062592
--- /dev/null
+++ b/tests/st/baseline_results/llama2_tp2_pp4_vpp2_swap.json
@@ -0,0 +1,78 @@
+{
+    "lm loss": [
+        1.47164,
+        1.46151,
+        1.47107,
+        1.442299,
+        1.427286,
+        1.406914,
+        1.388342,
+        1.362971,
+        1.361994,
+        1.286056,
+        1.288965,
+        1.297596,
+        1.289588,
+        1.286705,
+        1.269324
+    ],
+    "throughput": [
+        51.4,
+        93.4,
+        93.6,
+        93.2,
+        93.3,
+        93.7,
+        93.1,
+        93.9,
+        93.5,
+        93.8,
+        93.3,
+        93.1,
+        93.0,
+        93.4,
+        93.8
+    ],
+    "memo info": [
+        {
+            "rank": 0,
+            "allocated memory": 13485.05908203125,
+            "max allocated memory": 13735.0615234375
+        },
+        {
+            "rank": 1,
+            "allocated memory": 13485.05908203125,
+            "max allocated memory": 13735.0615234375
+        },
+        {
+            "rank": 2,
+            "allocated memory": 12517.05908203125,
+            "max allocated memory": 12689.0615234375
+        },
+        {
+            "rank": 3,
+            "allocated memory": 12517.05908203125,
+            "max allocated memory": 12689.0615234375
+        },
+        {
+            "rank": 4,
+            "allocated memory": 12517.05908203125,
+            "max allocated memory": 12689.0615234375
+        },
+        {
+            "rank": 5,
+            "allocated memory": 12517.05908203125,
+            "max allocated memory": 12689.0615234375
+        },
+        {
+            "rank": 6,
+            "allocated memory": 13517.12451171875,
+            "max allocated memory": 13767.1416015625
+        },
+        {
+            "rank": 7,
+            "allocated memory": 13517.12451171875,
+            "max allocated memory": 13767.1416015625
+        }
+    ]
+}
\ No newline at end of file
diff --git a/tests/st/baseline_results/mixtral_tp1_pp4_ep2_drop_dpp.json b/tests/st/baseline_results/mixtral_tp1_pp4_ep2_drop_dpp.json
index 8550835bd501affa7f39ee28c90c907a3376890a..89d393dcdcf4dfce572e729ea2efc50349b2b362 100644
--- a/tests/st/baseline_results/mixtral_tp1_pp4_ep2_drop_dpp.json
+++ b/tests/st/baseline_results/mixtral_tp1_pp4_ep2_drop_dpp.json
@@ -1,20 +1,20 @@
 {
     "lm loss": [
-        14.14179,
-        14.11266,
-        13.25863,
-        11.26134,
-        8.678722,
-        8.656157,
-        7.504652,
-        7.164732,
-        6.634584,
-        6.487458,
-        6.295924,
-        5.948122,
-        5.95405,
-        5.73897,
-        5.644468
+        14.14042,
+        14.11082,
+        13.25441,
+        11.26227,
+        8.679944,
+        8.709602,
+        7.502835,
+        7.138363,
+        6.601542,
+        6.470576,
+        6.224497,
+        5.946918,
+        5.930299,
+        5.696367,
+        5.602035
     ],
     "throughput": [
         10.4,
diff --git a/tests/st/shell_scripts/chatglm3_gqa_cp8.sh b/tests/st/shell_scripts/chatglm3_gqa_cp8.sh
new file mode 100644
index 0000000000000000000000000000000000000000..23e2bc6c217c6a68d9b0ecc325c7c40dfc799ba8
--- /dev/null
+++ b/tests/st/shell_scripts/chatglm3_gqa_cp8.sh
@@ -0,0 +1,118 @@
+#!/bin/bash
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+NPUS_PER_NODE=8
+MASTER_ADDR=localhost
+MASTER_PORT=6001
+NNODES=1
+NODE_RANK=0
+WORLD_SIZE=$((NPUS_PER_NODE*$NNODES))
+
+basepath=$(cd `dirname $0`; cd ../../../; pwd)
+
+CKPT_SAVE_DIR=/data/ckpt
+DATA_PATH=/data/chatglm3-dataset-alpaca/alpaca_text_document
+TOKENIZER_PATH=/data/chatglm3-6b-base-hf/
+CKPT_LOAD_DIR=/data/chatglm3-6b-tp1-pp1-cp8/
+
+TP=1
+PP=1
+CP=8
+MBS=1
+GBS=8
+SEQ_LEN=65536
+CP_ALGO=hybrid_cp_algo
+
+DISTRIBUTED_ARGS="
+    --nproc_per_node $NPUS_PER_NODE \
+    --nnodes $NNODES \
+    --node_rank $NODE_RANK \
+    --master_addr $MASTER_ADDR \
+    --master_port $MASTER_PORT
+"
+
+GPT_ARGS="
+    --use-mcore-models \
+    --transformer-impl local \
+    --tensor-model-parallel-size ${TP} \
+    --pipeline-model-parallel-size ${PP} \
+    --sequence-parallel \
+    --num-layers 2 \
+    --hidden-size 4096 \
+    --ffn-hidden-size 13696 \
+    --num-attention-heads 32 \
+    --ulysses-degree-in-cp 4 \
+    --seq-length ${SEQ_LEN} \
+    --micro-batch-size ${MBS} \
+    --global-batch-size ${GBS} \
+    --context-parallel-algo ${CP_ALGO} \
+    --context-parallel-size ${CP} \
+    --max-position-embeddings ${SEQ_LEN} \
+    --padded-vocab-size 65024 \
+    --make-vocab-size-divisible-by 1 \
+    --group-query-attention \
+    --num-query-groups 2 \
+    --disable-bias-linear \
+    --add-qkv-bias \
+    --position-embedding-type rope \
+    --no-rope-fusion \
+    --use-distributed-optimizer \
+    --use-glm-rope \
+    --rotary-percent 0.5 \
+    --use-flash-attn \
+    --use-fused-rmsnorm \
+    --use-fused-swiglu \
+    --normalization RMSNorm \
+    --swiglu \
+    --no-create-attention-mask-in-dataloader \
+    --tokenizer-type PretrainedFromHF \
+    --tokenizer-name-or-path ${TOKENIZER_PATH} \
+    --lr 1e-6 \
+    --train-iters 15 \
+    --lr-decay-style cosine \
+    --untie-embeddings-and-output-weights \
+    --attention-dropout 0.0 \
+    --init-method-std 0.01 \
+    --hidden-dropout 0.0 \
+    --no-masked-softmax-fusion \
+    --attention-softmax-in-fp32 \
+    --min-lr 1e-8 \
+    --weight-decay 1e-1 \
+    --lr-warmup-fraction 0.01 \
+    --clip-grad 1.0 \
+    --adam-beta1 0.9 \
+    --initial-loss-scale 512 \
+    --adam-beta2 0.95 \
+    --no-gradient-accumulation-fusion \
+    --fp16 \
+    --num-workers 1 \
+    --kv-head-repeat-before-uly-alltoall \
+    --no-shared-storage \
+    --finetune \
+    --log-throughput \
+    --use-cp-send-recv-overlap \
+    --overlap-grad-reduce \
+    --overlap-param-gather \
+"
+
+DATA_ARGS="
+    --data-path $DATA_PATH \
+    --split 949,50,1
+"
+
+OUTPUT_ARGS="
+    --log-interval 1 \
+    --save-interval 15 \
+    --eval-interval 15 \
+    --eval-iters 10 \
+    --no-load-optim \
+    --no-load-rng \
+    --save $CKPT_SAVE_DIR \
+    --load $CKPT_LOAD_DIR \
+"
+
+torchrun $DISTRIBUTED_ARGS $basepath/pretrain_gpt.py \
+    $GPT_ARGS \
+    $DATA_ARGS \
+    $OUTPUT_ARGS \
+    --distributed-backend nccl
diff --git a/tests/st/shell_scripts/gemma2_tp8_pp1_ptd.sh b/tests/st/shell_scripts/gemma2_tp8_pp1_ptd.sh
index c7966c75fe16496afa74ead724bbdcde05105866..42099b519484d24c0e314f9da496f62815e047d0 100644
--- a/tests/st/shell_scripts/gemma2_tp8_pp1_ptd.sh
+++ b/tests/st/shell_scripts/gemma2_tp8_pp1_ptd.sh
@@ -1,5 +1,6 @@
 #!/bin/bash
 export CUDA_DEVICE_MAX_CONNECTIONS=1
+export HCCL_DETERMINISTIC=true
 
 NPUS_PER_NODE=8
 MASTER_ADDR=localhost
@@ -81,6 +82,7 @@ GPT_ARGS="
     --no-load-rng \
     --vocab-size 256000 \
     --log-throughput \
+    --use-deter-comp \
     --finetune \
     --bf16
 "
diff --git a/tests/st/shell_scripts/llama2_tp2_pp4_vpp2_swap.sh b/tests/st/shell_scripts/llama2_tp2_pp4_vpp2_swap.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a2d725817298701c2ba267512aaae96a389a4e8a
--- /dev/null
+++ b/tests/st/shell_scripts/llama2_tp2_pp4_vpp2_swap.sh
@@ -0,0 +1,127 @@
+#!/bin/bash
+# A test case for swap attention, re-compute activation function and reuse fp32 param.
+
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+
+NPUS_PER_NODE=8
+MASTER_ADDR=localhost
+MASTER_PORT=6079
+NNODES=1
+NODE_RANK=0
+WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
+
+basepath=$(cd `dirname $0`; cd ../../../; pwd)
+
+CKPT_SAVE_DIR=/data/ckpt
+CKPT_LOAD_DIR=/data/ci/llama-2-7b-mg-tp2-pp4-mcore-vpp2-test
+DATA_PATH=/data/pretrain_dataset/alpaca_text_document
+TOKENIZER_MODEL=/data/llama-2-7b-hf/tokenizer.model
+TP=2
+PP=4
+VPP=2
+
+DISTRIBUTED_ARGS=(
+    --nproc_per_node $NPUS_PER_NODE
+    --nnodes $NNODES
+    --node_rank $NODE_RANK
+    --master_addr $MASTER_ADDR
+    --master_port $MASTER_PORT
+)
+
+
+ACCELERATE_ARGS=(
+    --recompute-activation-function
+    --recompute-num-layers 1
+    --swap-attention
+    --reuse-fp32-param
+    --enable-recompute-layers-per-pp-rank
+)
+
+
+DIST_ALGO=(
+    --tensor-model-parallel-size ${TP}
+    --pipeline-model-parallel-size ${PP}
+    --num-layers-per-virtual-pipeline-stage ${VPP}
+    --sequence-parallel
+)
+
+
+MODEL_ARGS=(
+    --use-mcore-models
+    --transformer-impl local
+    --num-layers 32
+    --hidden-size 4096
+    --ffn-hidden-size 11008
+    --num-attention-heads 32
+    --seq-length 4096
+    --max-position-embeddings 4096
+)
+
+TRAINING_ARGS=(
+    --tokenizer-type Llama2Tokenizer
+    --tokenizer-model ${TOKENIZER_MODEL}
+    --micro-batch-size 1
+    --global-batch-size 32
+    --make-vocab-size-divisible-by 1
+    --lr 1.25e-6
+    --train-iters 15
+    --lr-decay-style cosine
+    --untie-embeddings-and-output-weights
+    --disable-bias-linear
+    --attention-dropout 0.0
+    --init-method-std 0.01
+    --hidden-dropout 0.0
+    --position-embedding-type rope
+    --normalization RMSNorm
+    --use-fused-rmsnorm
+    --swiglu
+    --use-flash-attn
+    --no-masked-softmax-fusion
+    --attention-softmax-in-fp32
+    --min-lr 1.25e-7
+    --weight-decay 1e-1
+    --lr-warmup-fraction 0.01
+    --clip-grad 1.0
+    --adam-beta1 0.9
+    --initial-loss-scale 65536
+    --adam-beta2 0.95
+    --no-gradient-accumulation-fusion
+    --no-load-optim
+    --no-load-rng
+    --use-fused-swiglu
+    --use-fused-rotary-pos-emb
+    --overlap-grad-reduce
+    --bf16
+    --use-distributed-optimizer
+)
+
+DATA_ARGS=(
+    --data-path $DATA_PATH
+    --split 949,50,1
+)
+
+OUTPUT_ARGS=(
+    --log-interval 1
+    --save-interval 10000
+    --eval-interval 1000
+    --eval-iters 1
+    --no-load-optim
+    --no-load-rng
+    --no-save-optim
+    --no-save-rng
+    --load ${CKPT_LOAD_DIR}
+    --save ${CKPT_SAVE_DIR}
+)
+
+
+torchrun ${DISTRIBUTED_ARGS[@]} $basepath/pretrain_gpt.py \
+    ${DIST_ALGO[@]} \
+    ${MODEL_ARGS[@]} \
+    ${TRAINING_ARGS[@]} \
+    ${ACCELERATE_ARGS[@]} \
+    ${DATA_ARGS[@]} \
+    ${OUTPUT_ARGS[@]} \
+    --finetune \
+    --log-throughput \
+    --distributed-backend nccl
diff --git a/tests/test_tools/__init__.py b/tests/test_tools/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..9a1307aeca2e4639a1e6807acda4a07f1b9856eb
--- /dev/null
+++ b/tests/test_tools/__init__.py
@@ -0,0 +1,14 @@
+# coding=utf-8
+# Copyright (c) 2024, HUAWEI CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
\ No newline at end of file
diff --git a/tests/ut/dist_algo/context_parallel/test_ulysses_context_parallel.py b/tests/ut/dist_algo/context_parallel/test_ulysses_context_parallel.py
index eb022742786929dc99cb30c4d58f316d3a44f19f..ddc9980c562aa06410a63bbfc753d516c51ebcc7 100644
--- a/tests/ut/dist_algo/context_parallel/test_ulysses_context_parallel.py
+++ b/tests/ut/dist_algo/context_parallel/test_ulysses_context_parallel.py
@@ -34,6 +34,8 @@ class FlashSelfAttention(torch.nn.Module):
         self.causal = causal
         self.softmax_scale = softmax_scale
         self.dropout_p = attention_dropout
+        self.num_attention_heads_per_partition = 1
+        self.num_query_groups_per_partition = 1
 
     def forward(self, q, k, v, attention_mask, head_num):
         """Implements the multihead softmax attention.
diff --git a/tests/ut/evaluation/test_evaluate.py b/tests/ut/evaluation/test_evaluate.py
index 8e5399cd88497eb3bce5a1f80855bd27e9e3bbeb..b3b8f662b72be1d80e97525ce4c5245d4b3218e6 100644
--- a/tests/ut/evaluation/test_evaluate.py
+++ b/tests/ut/evaluation/test_evaluate.py
@@ -86,7 +86,7 @@ class TestEvaluate(DistributedTest):
             print(log_capture)
 
             expected_score = acquire_score(log_capture)
-            assert math.isclose(expected_score, 0.5333, abs_tol=1e-2), f"score {expected_score}, forward pass has been changed, check it!"
+            assert math.isclose(expected_score, 0.5666, abs_tol=1e-2), f"score {expected_score}, forward pass has been changed, check it!"
 
     @pytest.mark.parametrize("params", test_config["test_qwen_prompt_ceval_evaluate"])
     def test_qwen_prompt_ceval_evaluate(self, build_args, params):
diff --git a/tests/ut/inference/test_inference.py b/tests/ut/inference/test_inference.py
index ab633f005c741f9bc4d79234f147609a1fb8cabb..ab0136263da5f036d4a306961ec54dc2655f61fd 100644
--- a/tests/ut/inference/test_inference.py
+++ b/tests/ut/inference/test_inference.py
@@ -26,13 +26,13 @@ from tests.test_tools.dist_test import DistributedTest
 from tests.test_tools.utils import build_args, create_testconfig, setup_logger
 
 
-PATTERN = r"ModelLink:\n(.*)"
+PATTERN = r"MindSpeed-LLM:\n(.*)"
 
 
 def acquire_context(log_capture):
     # Acquire the final score for evaluation tasks, still universal.
     context_str = log_capture[0]
-    context_pattern = r"ModelLink:\s*(.*?)(?=\n|$)"
+    context_pattern = r"MindSpeed-LLM:\s*(.*?)(?=\n|$)"
     match = re.search(context_pattern, context_str)
     if match:
         context = match.group(1)
@@ -77,7 +77,7 @@ class TestInference(DistributedTest):
             print(log_capture)
             context = acquire_context(log_capture)
             assert [context] == [
-                "I'm doing well. I'm in the middle of a 3-day weekend, so I'm enjoying that."
+                "I'm doing well. I'm in the middle of a 3-day weekend, so I'm enjoying the extra"
             ], "forward pass has been changed, check it!"
 
     @pytest.mark.parametrize("params", test_config["test_lora_greedy_search"])
@@ -92,5 +92,5 @@ class TestInference(DistributedTest):
             print(log_capture)
             context = acquire_context(log_capture)
             assert [context] == [
-                "I'm doing well. I'm in the middle of a 3-day weekend, so I'm enjoying the extra time off."
-            ], "forward pass has been changed, check it!"
\ No newline at end of file
+                "I'm doing well. I'm in the middle of a 3-day weekend, so I'm enjoying that."
+            ], "forward pass has been changed, check it!"
diff --git a/tests/ut/model_module/embeddings/test_rotary_pos_embedding.py b/tests/ut/model_module/embeddings/test_rotary_pos_embedding.py
index d801d4a30c99a2d2c6570fcb0a5183c2206ab3f0..4226c5fed6988d85a208da3855d67efbd3132aa2 100644
--- a/tests/ut/model_module/embeddings/test_rotary_pos_embedding.py
+++ b/tests/ut/model_module/embeddings/test_rotary_pos_embedding.py
@@ -17,7 +17,6 @@ from types import SimpleNamespace
 from pathlib import Path
 import pytest
 import torch
-import mindspeed
 import modellink
 from tests.test_tools.dist_test import create_testconfig
 from megatron.core.models.common.embeddings.rotary_pos_embedding import RotaryEmbedding
@@ -27,13 +26,16 @@ class TestRotaryPosEmbedding:
     test_config = create_testconfig(Path(__file__).with_suffix(".json"))
 
     @pytest.fixture
-    def mock_dependency(self, request, monkeypatch):
-        monkeypatch.setattr(modellink.core.models.common.embeddings.rotary_pos_embedding, "get_args",
-                            lambda : SimpleNamespace(use_glm_rope=request.getfixturevalue("chatglm"),
-                            rope_scaling_type = None,
-                            ))
-        monkeypatch.setattr(mindspeed.core.fusions.rotary_pos_embedding, "get_args",
-                            lambda : SimpleNamespace(rotary_base = request.getfixturevalue("rotary_base")))
+    def mock_dependency(self, request):
+        # init test name space
+        def get_test_namespace():
+            test_name_space = SimpleNamespace()
+            test_name_space.use_glm_rope = request.getfixturevalue("chatglm")
+            test_name_space.rope_scaling_type = None
+            test_name_space.rotary_base = request.getfixturevalue("rotary_base")
+            return test_name_space
+        # set up name space function
+        setattr(modellink.core.models.common.embeddings.rotary_pos_embedding, "get_args", get_test_namespace)
 
     @pytest.mark.parametrize("rotary_param, chatglm, rotary_base, seq, expected", test_config["test_rotary_pos_embedding"])
     def test_rotary_pos_embedding(self, mock_dependency, rotary_param, chatglm, rotary_base, seq, expected):
diff --git a/tests/ut/process_data/test_process_instruction_data_lf.json b/tests/ut/process_data/test_process_instruction_data_lf.json
index 75a8498967e7a5baa858119ae4ce317baea5310a..136b372887581fa5c850d945253e15a82278ee3b 100644
--- a/tests/ut/process_data/test_process_instruction_data_lf.json
+++ b/tests/ut/process_data/test_process_instruction_data_lf.json
@@ -8,6 +8,7 @@
                 "output-prefix": "/data/tune_dataset/alpaca/alpaca", 
                 "tokenizer-name-or-path": "/data/qwen-7b/",
                 "workers": 4,
+                "overwrite-cache": null,
                 "log-interval": 1000,
                 "prompt-type": "qwen"
             }
@@ -22,10 +23,26 @@
                 "output-prefix": "/data/tune_dataset/alpaca_his/alpaca_his", 
                 "tokenizer-name-or-path": "/data/qwen-7b/",
                 "workers": 4,
+                "overwrite-cache": null,
                 "log-interval": 1000,
                 "prompt-type": "qwen",
                 "map-keys": "{\"history\":\"history\"}"
             }
+        }, 
+        {
+            "params": {
+                "input": "/data/tune_dataset/oaast_sft.json",
+                "tokenizer-type": "PretrainedFromHF",
+                "handler-name": "AlpacaStyleInstructionHandler",
+                "output-prefix": "/data/tune_dataset/alpaca_his/alpaca_his_seq1024", 
+                "tokenizer-name-or-path": "/data/qwen-7b/",
+                "workers": 4,
+                "overwrite-cache": null,
+                "log-interval": 1000,
+                "seq-length" : 1024,
+                "prompt-type": "qwen",
+                "map-keys": "{\"history\":\"history\"}"
+            }
         }
     ],
     "test_sharegpt_dataset": [
@@ -37,6 +54,7 @@
                 "output-prefix": "/data/tune_dataset/sharegpt/sharegpt", 
                 "tokenizer-name-or-path": "/data/qwen-7b/",
                 "workers": 4,
+                "overwrite-cache": null,
                 "log-interval": 1000,
                 "prompt-type": "qwen",
                 "map-keys": "{\"system\":\"system_prompt\"}"
@@ -52,6 +70,7 @@
                 "output-prefix": "/data/tune_dataset/sharegpt/sharegpt", 
                 "tokenizer-name-or-path": "/data/qwen-7b/",
                 "workers": 4,
+                "overwrite-cache": null,
                 "log-interval": 1000,
                 "prompt-type": "qwen",
                 "map-keys": "{\"messages\":\"messages\", \"tags\": {\"role_tag\": \"role\", \"content_tag\": \"content\", \"user_tag\": \"user\", \"assistant_tag\": \"assistant\", \"system_tag\": \"system\"}}"
diff --git a/tests/ut/process_data/test_process_instruction_data_lf.py b/tests/ut/process_data/test_process_instruction_data_lf.py
index 11f6bca625c5951bb2f12dd04b4a87304f314a92..acec6fc634ef9faf59af4db0b722bc3dfb45a81d 100644
--- a/tests/ut/process_data/test_process_instruction_data_lf.py
+++ b/tests/ut/process_data/test_process_instruction_data_lf.py
@@ -1,6 +1,9 @@
 import os
+import contextlib
+import io
 from pathlib import Path
 import pytest
+import logging
 import modellink
 from tests.test_tools.utils import build_args, create_testconfig, compare_file_md5_same
 from preprocess_data import main
@@ -15,7 +18,7 @@ class TestProcessInstructionDataLf:
     @pytest.mark.parametrize("params, base_path", 
         [
             (test_config["test_alpaca_dataset"][0], "/data/tune_dataset/Llamafactoryhandler/alpaca/alpaca"),
-            (test_config["test_alpaca_history_dataset"][0], "/data/tune_dataset/Llamafactoryhandler/alpaca_history/alpaca_history"),
+            (test_config["test_alpaca_history_dataset"][0], "/data/tune_dataset/Llamafactoryhandler/alpaca_history/alpaca_history_new"),
             (test_config["test_sharegpt_dataset"][0], "/data/tune_dataset/Llamafactoryhandler/sharegpt/sharegpt_lf"),
             (test_config["test_openai_dataset"][0], "/data/tune_dataset/Llamafactoryhandler/openai/sss")
         ])
@@ -54,3 +57,62 @@ class TestProcessInstructionDataLf:
                 base_file = base_path + end_str
                 test_file = params["output-prefix"] + end_str
                 assert compare_file_md5_same(base_file, test_file)
+
+
+    @pytest.mark.parametrize("params, base_path", 
+        [
+            (test_config["test_alpaca_history_dataset"][1], "/data/tune_dataset/Llamafactoryhandler/alpaca_history/alpaca_history_seq1024"),
+        ])
+    def test_skip_num(self, build_args, params, base_path):
+        """
+        Tests skip_num in preprocessing and validates output files by comparing MD5 checksums.
+
+        Parameters:
+        - params: dict
+            A dictionary containing dataset-specific configurations, such as input files,
+            output prefix, and tokenizer information. Extracted from `test_config`.
+        - base_path: str
+            The base path of the reference dataset files (e.g., Alpaca, Alpaca History, ShareGPT, OpenAI).
+            Used to locate the ground truth files for comparison with the generated output.
+        """
+        # create output dir if it doesn't exist
+        out_dir = os.path.dirname(params["output-prefix"])
+        if not os.path.isdir(out_dir):
+            os.makedirs(out_dir)
+
+        # run the main preprocessing function
+        log_capture_string  = io.StringIO()
+        # run the main preprocessing function
+        log_handler = logging.StreamHandler(log_capture_string)
+        log_handler.setLevel(logging.INFO)
+
+        formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
+        log_handler.setFormatter(formatter)
+        logger = logging.getLogger()
+        logger.addHandler(log_handler)
+        main()
+        output = log_capture_string.getvalue()
+        assert("Skip " in output and " sample exceeded seq-length" in output)
+
+        index1 = output.find("Skip ")
+        index2 = output.find(" sample exceeded seq-length")
+        skip_num = output[index1 + 5: index2]
+        assert(skip_num == "796.0")
+        logger.removeHandler(log_handler)
+        log_capture_string.close()
+
+        # print dataset name for clarity
+        dataset_name = base_path.split('/')[-1]
+        print(f"=============== test_{dataset_name}_dataset =============")
+
+        prefix_str = params["output-prefix"].split('/')[-1]
+        mid_strs = ["_packed_attention_mask_document", "_packed_input_ids_document", "_packed_labels_document"]
+        end_suffixs = [".bin", ".idx"]
+
+        # loop through mid_strs and end_suffixs, checking file MD5 hashes
+        for mid_str in mid_strs:
+            for end_suffix in end_suffixs:
+                end_str = mid_str + end_suffix
+                base_file = base_path + end_str
+                test_file = params["output-prefix"] + end_str
+                assert compare_file_md5_same(base_file, test_file)
\ No newline at end of file

依赖软件	版本	软件安装指南	推荐硬件形态
昇腾NPU驱动	Ascend HDK 24.1.RC3	《驱动固件安装指南》	Atlas 900 A2 PODc
昇腾NPU固件	Ascend HDK 24.1.RC3	《驱动固件安装指南》
Toolkit（开发套件）	CANN 8.0.RC3	《CANN 软件安装指南》
Kernel（算子包）	CANN 8.0.RC3	《CANN 软件安装指南》

PyTorch	release v6.0.RC3	《Ascend Extension for PyTorch 配置与安装》
torch_npu插件
apex
专家并行	--expert-model-parallel-size	--target-expert-model-parallel-size
专家并行	--expert-model-parallel-size	--target-expert-model-parallel-size
流水并行动态划分
使用场景	场景	特性名称	具体参数	Mcore	Legacy	贡献方
PTD并行	张量并行	--tensor-model-parallel-size	Yes	Yes
	流水线并行	--pipeline-model-parallel-size	Yes	Yes
	流水线并行动态划分	--num-layer-list	Yes	Yes	SPTD并行	张量并行	✅	✅	【昇腾】
	虚拟流水并行	--num-layers-per-virtual-pipeline-stage	Yes	Yes		流水线并行	✅	✅	【昇腾】
	序列并行	--sequence-parallel	Yes	Yes		虚拟流水并行	✅	✅	【昇腾】
	分布式优化器	--use-distributed-optimizer	Yes	Yes		序列并行	✅	✅	【昇腾】
长序列并行	长序列并行	--context-parallel-size	Yes	No	Ascend Ring Attention 长序列并行	✅	❌	【昇腾】
	多并行方案	--context-parallel-algo	Yes	No	Ulysses 长序列并行	✅	❌	【昇腾】
	Send/recv掩盖加速	--cp-send-recv-overlap	Yes	No	混合长序列并行	✅	❌	【昇腾】
MOE	MOE专家并行	--expert-model-parallel-size	Yes	No	MOE	MOE 专家并行	✅	❌	【昇腾】
	MOE重排通信优化	--moe-permutation-async-comm	Yes	No	MOE	MOE 重排通信优化	✅	❌	【计算研究部】
	GEMM	--moe-grouped-gemm	Yes	No	显存优化	参数副本复用	✅	✅	【计算算法部】
显存优化	参数副本复用	--reuse-fp32-param	Yes	Yes
	激活函数重计算	--recompute-activation-function	Yes	Yes
	Swap Attention	--swap-attention	Yes	Yes
	重计算程度	--recompute-granularity	Yes	Yes
	分布式优化器	✅	✅	【昇腾】
	重计算层数	--recompute-num-layers	Yes	Yes	Swap Attention	✅	✅	【计算研究部】
	重计算方法	--recompute-method	Yes	Yes	重计算	✅	✅	【计算研究部】
PP-Stage重计算	--enable-recompute-layers-per-pp-rank	Yes	Yes	融合算子	Flash attention	✅	✅	【昇腾】
融合算子	Flash attention	--use-flash-attn	Yes		Yes	Fused rmsnorm	✅	✅	【昇腾】
	Fused rmsnorm	--use-fused-rmsnorm	Yes		Yes	Fused swiglu	✅	✅	【昇腾】
	Fused swiglu	--use-fused-swiglu	Yes		Yes	Fused rotary position embedding	✅	✅	【昇腾】
	Fused rotary position embedding	--use-fused-rotary-pos-emb	Yes		Yes	GMM	✅	❌	【昇腾】
	Sliding window attention	--sliding-window	Yes	Yes	通信掩盖	梯度reduce通算掩盖	✅	✅	【昇腾】
通信	梯度reduce通算掩盖	--overlap-grad-reduce	Yes	Yes		Recompute in advance	✅	❌	【昇腾】
	权重all-gather通算掩盖	--overlap-param-gather	Yes	No		权重all-gather通算掩盖	✅	❌	【昇腾】
	MC2	--use-mc2	Yes	Yes		MC2	✅	✅	【昇腾】
Question	ModelLink	DeepSpeed
What is the product of 6 and 2?	The product of 6 and 2 is 12.	The product of 6 and 2 is 12.
Look up the boiling point of water.	The boiling point of water is 100°C (212°F)..	The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit).
Generate a poem that expresses joy.	I open my arms and accept life's challenge, - -Embracing all of life's uncertainty. - -A future of promise awaits, - -One that I can find in my fate. - -A journey full of hope and discovery, - -A journey that will never end in disappointment. - -For I have found the strength to keep going, - -Researching and learning, ever knowing. - -The future is mine, and mine alone, - -A future that I have to own, - -And I will never be alone, - -For I will have the courage to face all of my fears, - -And the joy of life will be mine forever.	I am so happy, so thankful and blessed, - -To have a heart that's full of cheer, - -My soul is on fire, my spirit runs free, - -My love is warm, my courage is high, - -My world is bright and my outlook is clear, - -My heart is light, my spirit runs free. - -I take each day with a positive outlook, - -My energy is so high, I can't be beat, - -My joy is real, my emotions run deep, - -My soul is full and my heart does soep. - -I am thankful for everything I have, - -My life is precious and my values true, - -My hope is high and my spirit runs free, - -My soul is full and my heart does soep.
Mem.
ST	Pretrain	ST	Pretrain	Mcore	TP，PP，VPP，重计算，enable_recompute_layers_per_pp_rank	llama2_tp2_pp4_vpp2.sh
				Mcore	CP，分布式优化器，reuse_fp32_param，recompute_activation_function, fused_rmsnorm，fused_swiglu，fused_rope，overlap_grad_reduce, overlap_param_gather	cp_ring，分布式优化器，reuse_fp32_param，recompute_activation_function，fused_rmsnorm，fused_swiglu，fused_rope，overlap_grad_reduce, overlap_param_gather	llama2_tp2_cp4_mem_recompute.sh	Y	Y	Y
				Mcore	cp_hybrid，gqa	chatglm3_gqa_cp8.sh	Y	Y	Y
				Mcore	swap_attention，recompute_activation_function，enable_recompute_layers_per_pp_rank，reuse_fp32_param	llama2_tp2_pp4_vpp2_swap.sh	Y	Y	Y
				Mcore	glm_rope, rotary_percent
				ring_attn	test_ringattn_context_parallel.py	Y
				ulysses	test_ulysses_context_parallel.py	Y