diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md
index 2cb8191853368778180eb11533fdaffb5729e8b6..eea984d60307c9cdd4b4dd793bcfc9c932b41609 100644
--- a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md
@@ -1,477 +1,134 @@
-<div align="center">
-    <img alt="OpenRLHF logo" src="./docs/logo.png" style="height: 140px;" />
-</div>
-<div align="center">
-<p align="center">
-      <a href="https://github.com/OpenRLHF/OpenRLHF/graphs/contributors">
-        <img alt="GitHub Contributors" src="https://img.shields.io/github/contributors/OpenRLHF/OpenRLHF" />
-      </a>
-      <a href="https://github.com/OpenRLHF/OpenRLHF/issues">
-        <img alt="Issues" src="https://img.shields.io/github/issues/OpenRLHF/OpenRLHF?color=0088ff" />
-      </a>
-      <a href="https://github.com/OpenRLHF/OpenRLHF/discussions">
-        <img alt="Issues" src="https://img.shields.io/github/discussions/OpenRLHF/OpenRLHF?color=0088ff" />
-      </a>
-      <a href="https://github.com/OpenRLHF/OpenRLHF/pulls">
-        <img alt="GitHub pull requests" src="https://img.shields.io/github/issues-pr/OpenRLHF/OpenRLHF?color=0088ff" />
-      <a href="https://github.com/OpenRLHF/OpenRLHF/stargazers">
-        <img alt="GitHub stars" src="https://img.shields.io/github/stars/OpenRLHF/OpenRLHF?color=ccf" />
-      </a>
-      <br>
-      <em>Open-source / Comprehensive / Lightweight / Easy-to-use</em>
-    </p>
-</p>
-</div>
-
-<hr>
-
-<span>[ English | <a href="README_zh.md">中文</a> | <a href="README_ja.md">日本語</a> ]</span>
-
-OpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers:
-
-- **Simple and easy to use**: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and seamlessly compatible with Huggingface models and datasets.
-- **High performance**: RLHF training spends 80% of the time on the sample generation stage. Thanks to the ability to use a large inference batch size with Ray and Packing Samples and vLLM generation acceleration, the performance of OpenRLHF 3~4x+ that of Optimized DeepSpeedChat with Hybrid Engine.
-- **Distributed RLHF**:  OpenRLHF distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 70B+ models with multiple A100 80G GPUs and vLLM and 7B models across multiple 24GB RTX 4090 GPUs.
-- **PPO Implementation Optimization**: We integrated the implementation tricks for PPO to improve the training stability, referencing [Zhihu](https://zhuanlan.zhihu.com/p/622134699) and the [Notion blog](https://hijkzzz.notion.site/rlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361).
-
-More details are in [Slides](https://docs.google.com/presentation/d/1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk/edit?usp=sharing) | [Technical Report](https://arxiv.org/abs/2405.11143) | [Documents](https://openrlhf.readthedocs.io/)
-
-## News
-- [2025/1] HKUST reproduced the [DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF](https://github.com/hkust-nlp/simpleRL-reason)
-- [2024/12] We "proposed" 😊 the [REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS).
-- [2024/12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the [Notion Blogpost](https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05).
-
-
-## Features
-
-- Distributed [PPO](./examples/scripts/train_ppo_llama_ray.sh) and [REINFORCE++/RLOO](./examples/scripts/train_reinforce_llama_ray.sh) implementations based on Ray.  
-- Full RLHF fine-tuning support for models with [over 70 billion parameters](./examples/scripts/train_ppo_llama_ray_70b.sh).  
-- Integration with vLLM for accelerated generation in RLHF tasks (`--vllm_num_engines`).  
-- Support for multiple reward models (`--reward_pretrain model1,model2...`) and remote reward models (`--remote_rm_url`).  
-- Implementation of [DPO (Direct Preference Optimization)/IPO/cDPO](./examples/scripts/train_dpo_llama.sh) and [Kahneman-Tversky Optimization (KTO)](./examples/scripts/train_kto_llama.sh).  
-- Support for [Iterative DPO](./examples/scripts/train_iterative_dpo_llama.sh) ([GitHub: Online-RLHF](https://github.com/RLHFlow/Online-RLHF)).  
-- Support for [Rejection Sampling](./examples/scripts/train_rejection_sampling_llama.sh).  
-- Implementation of [Conditional SFT](./examples/scripts/train_conditional_llama.sh) ([arXiv:2308.12050](https://arxiv.org/abs/2308.12050)).  
-- Support for [Knowledge Distillation](./examples/scripts/train_knowledge_distillation.sh) ([Microsoft: minillm](https://github.com/microsoft/LMOps/tree/main/minillm)).  
-- Integration of [Process Reward Model (PRM)](./examples/scripts/train_prm_mistral.sh).  
-- Packing of training samples for SFT, DPO, RM, PRM, and PPO (`--packing_samples`).  
-- Implementation of [RingAttention](./examples/scripts/train_dpo_ring_llama.sh) (`--ring_attn_size`, `--ring_head_stride`).  
-- Support for [Mixture of Experts (MoE)](./examples/test_scripts/train_sft_mixtral_lora.sh) (`--aux_loss_coef`).  
-- Integration of FlashAttention2 (`--flash_attn`).  
-- Support for QLoRA (`--load_in_4bit`) and [LoRA](./examples/scripts/train_sft_mixtral_lora.sh) (`--lora_rank`, `--target_modules`).  
-- Compatibility with HuggingFace's `tokenizer.apply_chat_template` for datasets (`--apply_chat_template` and `--input_key`).  
-- Logging support with Wandb (`--use_wandb`) and TensorBoard (`--use_tensorboard`).  
-- Checkpoint recovery functionality (`--load_checkpoint` and `--save_steps`).  
-- Provided multi-node training scripts, such as [DPO](./examples/scripts/train_llama_slurm.sh) and [Ray PPO](./examples/scripts/train_ppo_llama_ray_slurm.sh).
-
-
-### PPO Support Matrix
-
-| Feature | OpenRLHF | DSChat | CAIChat | TRL |
-| ------------- |:-------------:| :-------------:| :-------------:| :-------------:|
-| 70B+ Full Tuning with 16 A100-80GB      | ✅ | ❌ | ❌ | ❌ |
-| 7B Full Tuning with 4 RTX4090 | ✅      |    ❌ | ❌ | ❌ |
-| 34B DPO Full Tuning with 8 A100-80GB | ✅      |    ❌ | ❌ | ❌ |  
-| Inference Engine in PPO | ✅      |    ✅ | ❌ | ❌ |  
-| PPO Implementation Tricks | ✅      |    ❌ | ❌ | ✅ |
-| Support QLoRA | ✅      |    ❌ | ❌ | ✅ | 
-| Support Mixtral 8*7b | ✅      |    ❌ | ❌ | ❌ |  
-| Support Unmerged Actor-Critic | ✅     |   ✅ | ✅ | ❌ | 
-| Support Multiple Reward Models | ✅      |    ❌ | ❌ | ❌ |   
-| Support Huggingface Models | ✅      |    ✅ | ✅ | ✅ | 
-| Easy-to-use | ✅      |   ❌ (HybridEngine bugs) | ✅ | ✅ | 
-
-
-## Quick Start
-
-### Installation
-
-To use OpenRLHF, first launch the docker container (**Recommended**) and `pip install` openrlhf inside the docker container:
-
-```bash
-# Launch the docker container
-docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.07-py3 bash
-sudo pip uninstall xgboost transformer_engine flash_attn -y
-
-# pip install
-pip install openrlhf
-
-# If you want to use vLLM acceleration (Install vLLM 0.6.5)
-pip install openrlhf[vllm]
-# latest vLLM is also supported
-pip install openrlhf[vllm_latest]
-
-# pip install the latest version
-pip install git+https://github.com/OpenRLHF/OpenRLHF.git
-
-# Or git clone
-git clone https://github.com/OpenRLHF/OpenRLHF.git
-cd OpenRLHF
-pip install -e .
-```
-
-> [!NOTE]
->We recommend using vLLM 0.6.4 or higher. Other versions (vLLM >= 0.4.2) may require weight synchronization via Gloo (`--vllm_sync_backend gloo`).
->We also provided the [Dockerfiles for vLLM](./dockerfile/) and [One-Click Installation Script of Nvidia-Docker](./examples/scripts/nvidia_docker_install.sh).
-
-### Prepare Datasets
-OpenRLHF provides multiple data processing methods in our dataset classes.
-Such as in the [Prompt Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/prompts_dataset.py#L6):
-
-```python
-def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str:
-    if apply_chat_template:
-        chat = data[input_key]
-        if isinstance(chat, str):
-            chat = [{"role": "user", "content": chat}]
-        prompt = apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
-    else:
-        prompt = data[input_key]
-        if input_template:
-            prompt = input_template.format(prompt)
-    return prompt
-```
-
-- We can use `--input_key` to specify the `JSON key name` of the input datasets `--prompt_data {name or path}` (PPO) or `--dataset {name or path}`, and use `--apply_chat_template` to utilize the `chat_template` from the [Huggingface Tokenizer](https://huggingface.co/docs/transformers/main/en/chat_templating).
-- If you don't want to use `--apply_chat_template`, you can use `--input_template` instead, or preprocess the datasets offline in advance.
-- OpenRLHF also support mixing multiple datasets using `--prompt_data_probs 0.1,0.4,0.5` (PPO) or `--dataset_probs 0.1,0.4,0.5`.
-
-How Chat Templating Works:
-
-```python
-dataset = [{"input_key": [
-  {"role": "user", "content": "Hello, how are you?"},
-  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
-  {"role": "user", "content": "I'd like to show off how chat templating works!"},
-]}]
-
-tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False)
-
-"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"
-```
-
-How to specify training and test datasets ?
-
-You can specify it using the `data_type@data_dir` format. For example, the dataset can be set as `--dataset json@./data`.
-
-```
-data
-├── test.jsonl
-└── train.jsonl
-```
-
-> [!NOTE]
-> By default, we use `train` and `test` as splits to distinguish training and testing datasets from Huggingface.
-> The ``JSON key`` options depends on the specific datasets. See [Reward Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/reward_dataset.py#L10) and [SFT Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/sft_dataset.py#L9)
-
-### Supervised Fine-tuning
-
-OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--pretrain  {name or path}`, `--reward_pretrain  {name or path}` and `--critic_pretrain  {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https://huggingface.co/OpenRLHF).
-
-Then you can use the startup scripts we provide in the [examples/scripts](./examples/scripts/) directory, or start the training using the following commands.
-
-```bash 
-deepspeed --module openrlhf.cli.train_sft \
-   --max_len 4096 \
-   --dataset Open-Orca/OpenOrca \
-   --input_key question \
-   --output_key response \
-   --input_template $'User: {}\nAssistant: ' \
-   --train_batch_size 256 \
-   --micro_train_batch_size 2 \
-   --max_samples 500000 \
-   --pretrain meta-llama/Meta-Llama-3-8B \
-   --save_path ./checkpoint/llama3-8b-sft \
-   --save_steps -1 \
-   --logging_steps 1 \
-   --eval_steps -1 \
-   --zero_stage 2 \
-   --max_epochs 1 \
-   --packing_samples \
-   --bf16 \
-   --flash_attn \
-   --learning_rate 5e-6 \
-   --gradient_checkpointing \
-   --use_wandb {wandb_token}
-
-# Support HF tokenizer.apply_chat_template
-# --apply_chat_template 
-# --tokenizer_chat_template {HF Chat Template}
-
-# Support RingAttention
-# pip install ring_flash_attn
-#   --ring_attn_size 2 \
-#   --ring_head_stride 2 \
-
-# Multi-turn fine-tuning loss
-# --multiturn
-
-# Can also be used for continued pre-training
-# --pretrain_mode
-```
-
-> [!NOTE]
-> OpenRLHF SFT/DPO/RewardModel/PPO trainers support `--packing_samples` [based on `--flash_attn`](https://github.com/MeetKai/functionary/tree/main/functionary/train/packing)
-
-
-### Reward Model Training
-```bash
-deepspeed --module openrlhf.cli.train_rm \
-   --save_path ./checkpoint/llama3-8b-rm \
-   --save_steps -1 \
-   --logging_steps 1 \
-   --eval_steps -1 \
-   --train_batch_size 256 \
-   --micro_train_batch_size 1 \
-   --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
-   --bf16 \
-   --max_epochs 1 \
-   --max_len 8192 \
-   --zero_stage 3 \
-   --learning_rate 9e-6 \
-   --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
-   --apply_chat_template \
-   --chosen_key chosen \
-   --rejected_key rejected \
-   --flash_attn \
-   --packing_samples \
-   --gradient_checkpointing \
-   --use_wandb {wandb_token}
-
-```
-
-It is recommended to set the `--value_prefix_head` option of the Reward Model to `score`, so that we can load the model using `AutoModelForSequenceClassification`:
-
-```python
-reward_model = AutoModelForSequenceClassification.from_pretrained(
-              reward_model_path,
-              num_labels=1,
-              torch_dtype=torch.bfloat16,
-              attn_implementation="flash_attention_2",
-              use_cache=False,
-          )
-inputs = xxxx (Left Padding Input Tokens)
-reward = reward_model.model(*inputs).last_hidden_state
-reward = reward_model.score(reward)[:, -1]
-```
-
-### PPO without Ray
-
-```bash
-deepspeed --module openrlhf.cli.train_ppo \
-  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
-  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
-  --save_path ./checkpoint/llama-3-8b-rlhf \
-  --save_steps -1 \
-  --logging_steps 1 \
-  --eval_steps -1 \
-  --micro_train_batch_size 2 \
-  --train_batch_size 128 \
-  --micro_rollout_batch_size 4 \
-  --rollout_batch_size 1024 \
-  --max_epochs 1 \
-  --prompt_max_len 1024 \
-  --generate_max_len 1024 \
-  --zero_stage 2 \
-  --bf16 \
-  --actor_learning_rate 5e-7 \
-  --critic_learning_rate 9e-6 \
-  --init_kl_coef 0.01 \
-  --prompt_data OpenRLHF/prompt-collection-v0.1 \
-  --input_key context_messages \
-  --apply_chat_template \
-  --max_samples 100000 \
-  --normalize_reward \
-  --adam_offload \
-  --flash_attn \
-  --gradient_checkpointing \
-  --use_wandb {wandb_token}
-
-# Support remote reward model (HTTP)
-# --remote_rm_url http://localhost:5000/get_reward
-```
-
-### PPO/REINFORCE++ with Ray and vLLM
-
-To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration
-
-```bash
-# launch the master node of ray in container
-ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
-
-# if you want to launch ray on more nodes, use
-ray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8
-
-ray job submit --address="http://127.0.0.1:8265" \
-  --runtime-env-json='{"working_dir": "/openrlhf"}' \
-  -- python3 -m openrlhf.cli.train_ppo_ray \
-  --ref_num_nodes 1 \
-  --ref_num_gpus_per_node 2 \
-  --reward_num_nodes 1 \
-  --reward_num_gpus_per_node 2 \
-  --critic_num_nodes 1 \
-  --critic_num_gpus_per_node 2 \
-  --actor_num_nodes 1 \
-  --actor_num_gpus_per_node 2 \
-  --vllm_num_engines 2 \
-  --vllm_tensor_parallel_size 2 \
-  --colocate_critic_reward \
-  --colocate_actor_ref \
-  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
-  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
-  --save_path /openrlhf/examples/checkpoint/llama3-8b-rlhf \
-  --micro_train_batch_size 8 \
-  --train_batch_size 128 \
-  --micro_rollout_batch_size 16 \
-  --rollout_batch_size 1024 \
-  --max_samples 100000 \
-  --max_epochs 1 \
-  --prompt_max_len 1024 \
-  --generate_max_len 1024 \
-  --zero_stage 3 \
-  --bf16 \
-  --actor_learning_rate 5e-7 \
-  --critic_learning_rate 9e-6 \
-  --init_kl_coef 0.01 \
-  --prompt_data OpenRLHF/prompt-collection-v0.1 \
-  --input_key context_messages \
-  --apply_chat_template \
-  --normalize_reward \
-  --packing_samples \
-  --adam_offload \
-  --flash_attn \
-  --gradient_checkpointing \
-  --use_wandb {wandb_token}
-
-# Support REINFORCE++  | RLOO
-# --advantage_estimator reinforce | rloo
-
-# Support remote reward model (HTTP)
-# --remote_rm_url http://localhost:5000/get_reward
-
-# Support N samples
-# --n_samples_per_prompt 4
-```
-> [!NOTE]
-> Do not set `--vllm_num_engines` means not using the vLLM engine.
-> You can also use ``setup_commands`` to let Ray automatically deploy the environment, such as `--runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'`.
-
-> [!NOTE]
-> RLOO in OPENRLHF is a modification based on REINFORCE++, differing from the original version.
-
-> [!NOTE]
-> If you you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable [`RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES`](openrlhf/trainer/ray/utils.py) as a workaround.
->   ```bash
->   # For NVIDIA GPUs:
->   export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
->   ```
-
-The launch scripts and documents for supported algorithms are in [example/scripts](./examples/scripts/) and [Documents - Usage](https://openrlhf.readthedocs.io/en/latest/usage.html)
-
-### LoRA
-If you use `LoRA (Low-Rank Adaptation)`, `OpenRLHF` will not save the full weights by default instead of `LoRA Adapter`. To continue in your task normally, you should combine the `Adapter` with weights of your base model
-
-```bash
-python -m openrlhf.cli.lora_combiner \
-    --model_path meta-llama/Meta-Llama-3-8B \
-    --lora_path ./checkpoint/llama3-8b-rm \
-    --output_path ./checkpoint/llama-3-8b-rm-combined \
-    --is_rm \
-    --bf16
-```
-
-## Performance
-
-We optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF:
-
-| **Size** | **NVIDIA A800-80GB GPUs** | **Optimized DSChat (with  Hybrid Engine)** | **OpenRLHF** | **Speedup** |
-| :---: | :---: | :---: | :---: | :---: |
-| 7B | 16 | 855.09 | 471.11 | 1.82x |
-| 13B | 32 | 1528.93 | 608.93 | 2.5x |
-| 34B | 32 | 3634.98 | 1526.4 | 2.4x |
-| 70B | 32 | 10407.0 | 4488.53 | 2.3x |
-
-> [!NOTE]
-> The data is outdated; please refer to the performance tuning section for re-testing.
-
-### Performance Tuning Guide
-
-To achieve optimal performance, we recommend allocating more nodes to the vLLM Engine. For example, for a 70B model with 32 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 8 GPUs to the Actor model, and the remaining 8 GPUs to the Critic model. Additionally, enable the `--colocate_critic_reward`, `--colocate_actor_ref` options to merge nodes. Finally, you should increase the `rollout_micro_batch_size` (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger `--micro_train_batch_size` is better and enable `--packing_samples`. When there are enough GPUs, please disable `--adam_offload` and enable `--overlap_comm`. For multi-nodes RLHF, please use `--vllm_sync_backend nccl` with vLLM 0.6.4+.
-
-## Companies and Organizations using OpenRLHF
-
-- Google
-- ByteDance
-- Tencent
-- Alibaba
-- Baidu
-- China Telecom
-- Vivo
-- Allen AI
-- NexusFlow
-- Jülich Supercomputing Centre (JSC)
-- Berkeley Starling Team
-- M-A-P
-- ...
-
-## Join Us
-
-**How to Join?**
-
-1. Email us at janhu9527@gmail.com or join [GitHub Organization](https://github.com/OpenRLHF). Please include the following details:
-   - Your name
-   - Your GitHub username
-   - Your areas of interest
-   - Your skills and experience related to NLP and/or AI
-1. You can also join us through the official GitHub [OpenRLHF ↗](https://github.com/OpenRLHF/OpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you.
-
-**What can you do?**
-
-1. Join the team and participate in the development of the OpenRLHF project.
-1. Contribute to the project by submitting pull requests.
-1. Help improve documentation, fix bugs, or create new features.
-1. Share the project and help us grow the community.
+# OpenRLHF v0.5.7
+
+-   [概述](#概述)
+-   [准备训练环境](#准备训练环境)
+-   [开始训练](#开始训练)
+-   [训练结果展示](#训练结果展示)
+-   [版本说明](#版本说明)
+
+# 概述
+
+OpenRLHF是一个基于Ray、DeepSpeed和HF Transformers构建的高性能RLHF框架。该代码实现OpenRLHF支持 `SFT` 和 `DPO` 训练，目前已验证支持模型为：`Qwen2-VL-2B-Instruct`。
+
+
+# 准备训练环境
+
+## 准备环境
+
+- 推荐参考[配套资源文档](https://www.hiascend.com/developer/download/commercial)使用最新的配套版本。
+
+  **表 1**  版本配套表
+    
+  <table border="0">
+    <tr>
+      <th>软件</th>
+      <th>版本</th>
+    </tr>
+    <tr>
+      <td> Driver </td>
+      <td> AscendHDK 25.0.RC1.B115 </td>
+    </tr>
+    <tr>
+      <td> Firmware </td>
+      <td> AscendHDK 25.0.RC1.B115 </td>
+    </tr>
+    <tr>
+      <td> CANN </td>
+      <td> CANN 8.2.RC1.B010 </td>
+    </tr>
+    <tr>
+      <td> PyTorch </td>
+      <td> 2.6.0 </td>
+    </tr>
+    <tr>
+      <td> torch_npu </td>
+      <td> 2.6.0 </td>
+    </tr>
+  </table>
+  
+- 环境准备指导。
 
-## Sponsor Us
+  请参考《[Pytorch框架训练环境准备](https://www.hiascend.com/document/detail/zh/ModelZoo/pytorchframework/ptes)》。
+  
+- 安装依赖。
 
-Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https://opencollective.com/OpenRLHF).
+  1. 在模型源码包根目录下执行命令，安装模型对应的PyTorch版本需要的依赖。
+     ```shell
+      TARGET_DEVICE=NPU pip install -e .
+     ```
 
-## Starchart
 
-[![Star History Chart](https://api.star-history.com/svg?repos=OpenRLHF/OpenRLHF&type=Date)](https://star-history.com/#OpenRLHF/OpenRLHF&Date)
+  2. 在模型源码包根目录下执行命令,源码编译安装 transformers v4.51.0。
+     ```shell
+      git clone -b v4.51.0 https://github.com/huggingface/transformers.git
+      cp transformers_need/modeling_qwen2_vl.py transformers/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py
+      cd transformers
+      pip install .
+     ```
+  
+## 获取预训练模型
 
-## Contributors
+1. 用户自行下载 `Qwen2-VL-2B-Instruct`模型，通过参数 `--pretrain_path` 指定模型地址。
 
-A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.
+## 准备数据集
 
-<a href="https://github.com/OpenRLHF/OpenRLHF/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=OpenRLHF/OpenRLHF" />
-</a>
+1. 模型源码包根目录下创建data文件夹，用户需自行下载 `llava-en-zh-300k` 和 `RLHF-V` 数据集，结构如下：
 
-## References & Acknowledgements
+   ```
+   data/
+   ├── llava-en-zh-300k
+   ├── RLHF-V
+   ```
 
-We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:
+# 开始训练
 
-- [Hugging Face Transformers ↗](https://github.com/huggingface/transformers)
-- [OpenAI GPT ↗](https://github.com/openai/gpt-3)
-- [LLaMA ↗](https://llama.meta.com/)
-- [DeepSpeed ↗](https://github.com/microsoft/DeepSpeed)
-- [Ray ↗](https://github.com/ray-project/ray)
+1. 进入解压后的源码包根目录。
 
-Our project would also like to thank [ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) and [DeepSpeedChat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat). In the early stages of the project, we referred to their code design. 
+    ```shell
+    cd /${模型文件夹名称}
+    ```
 
-(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.
+2. 运行训练脚本。
 
-## Citation
-```
-@article{hu2024openrlhf,
-  title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
-  author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},
-  journal={arXiv preprint arXiv:2405.11143},
-  year={2024}
-}
-```
+    - 8卡SFT训练
+      
+        启动8卡训练
+      
+        ```shell
+        bash test/train_qwen2_vl_sft_full_8p.sh --pretrain_path=xxxx --dataset_path=data/llava-en-zh-300k/zh  # 8p精度
 
-______________________________________________________________________
+        bash test/train_qwen2_vl_sft_performance_8p.sh --pretrain_path=xxxx --dataset_path=data/llava-en-zh-300k/zh  # 8p性能
+        ```
 
-*OpenRLHF © 2025 OpenRLHF. All Rights Reserved.*
+    - 8卡DPO训练
+      
+        启动8卡训练
+      
+        ```shell
+        bash test/train_qwen2_vl_dpo_full_8p.sh --pretrain_path=xxxx --dataset_path=data/RLHF-V  # 8p精度
+
+        bash test/train_qwen2_vl_dpo_performance_8p.sh --pretrain_path=xxxx --dataset_path=data/RLHF-V  # 8p性能
+        ```
+
+# 训练结果展示
+
+**表 2**  训练结果展示表（性能）
+
+| MODEL | NAME                    | METHOD | Second Per Step(s) |
+|:------------------------|:------------------------|:----------:|:----------------------:|
+| Qwen2-VL-2B-Instruct                  | 8P-竞品A                  |   SFT   |           0.17108           |
+| Qwen2-VL-2B-Instruct                  | 8P Atlas 200T A2 Box16  |   SFT   |           0.24064           |
+| Qwen2-VL-2B-Instruct                  | 8P-竞品A                  |  DPO   |           0.40857           |
+| Qwen2-VL-2B-Instruct                  | 8P Atlas 200T A2 Box16 |  DPO   |           0.53905           |
+
+
+# 公网地址说明
+代码涉及公网地址参考 public_address_statement.md。
+
+# 版本说明
+
+## 变更
+
+2025.5.12：首次发布。
+
+## FAQ
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md
new file mode 100644
index 0000000000000000000000000000000000000000..2cb8191853368778180eb11533fdaffb5729e8b6
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md
@@ -0,0 +1,477 @@
+<div align="center">
+    <img alt="OpenRLHF logo" src="./docs/logo.png" style="height: 140px;" />
+</div>
+<div align="center">
+<p align="center">
+      <a href="https://github.com/OpenRLHF/OpenRLHF/graphs/contributors">
+        <img alt="GitHub Contributors" src="https://img.shields.io/github/contributors/OpenRLHF/OpenRLHF" />
+      </a>
+      <a href="https://github.com/OpenRLHF/OpenRLHF/issues">
+        <img alt="Issues" src="https://img.shields.io/github/issues/OpenRLHF/OpenRLHF?color=0088ff" />
+      </a>
+      <a href="https://github.com/OpenRLHF/OpenRLHF/discussions">
+        <img alt="Issues" src="https://img.shields.io/github/discussions/OpenRLHF/OpenRLHF?color=0088ff" />
+      </a>
+      <a href="https://github.com/OpenRLHF/OpenRLHF/pulls">
+        <img alt="GitHub pull requests" src="https://img.shields.io/github/issues-pr/OpenRLHF/OpenRLHF?color=0088ff" />
+      <a href="https://github.com/OpenRLHF/OpenRLHF/stargazers">
+        <img alt="GitHub stars" src="https://img.shields.io/github/stars/OpenRLHF/OpenRLHF?color=ccf" />
+      </a>
+      <br>
+      <em>Open-source / Comprehensive / Lightweight / Easy-to-use</em>
+    </p>
+</p>
+</div>
+
+<hr>
+
+<span>[ English | <a href="README_zh.md">中文</a> | <a href="README_ja.md">日本語</a> ]</span>
+
+OpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers:
+
+- **Simple and easy to use**: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and seamlessly compatible with Huggingface models and datasets.
+- **High performance**: RLHF training spends 80% of the time on the sample generation stage. Thanks to the ability to use a large inference batch size with Ray and Packing Samples and vLLM generation acceleration, the performance of OpenRLHF 3~4x+ that of Optimized DeepSpeedChat with Hybrid Engine.
+- **Distributed RLHF**:  OpenRLHF distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 70B+ models with multiple A100 80G GPUs and vLLM and 7B models across multiple 24GB RTX 4090 GPUs.
+- **PPO Implementation Optimization**: We integrated the implementation tricks for PPO to improve the training stability, referencing [Zhihu](https://zhuanlan.zhihu.com/p/622134699) and the [Notion blog](https://hijkzzz.notion.site/rlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361).
+
+More details are in [Slides](https://docs.google.com/presentation/d/1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk/edit?usp=sharing) | [Technical Report](https://arxiv.org/abs/2405.11143) | [Documents](https://openrlhf.readthedocs.io/)
+
+## News
+- [2025/1] HKUST reproduced the [DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF](https://github.com/hkust-nlp/simpleRL-reason)
+- [2024/12] We "proposed" 😊 the [REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS).
+- [2024/12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the [Notion Blogpost](https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05).
+
+
+## Features
+
+- Distributed [PPO](./examples/scripts/train_ppo_llama_ray.sh) and [REINFORCE++/RLOO](./examples/scripts/train_reinforce_llama_ray.sh) implementations based on Ray.  
+- Full RLHF fine-tuning support for models with [over 70 billion parameters](./examples/scripts/train_ppo_llama_ray_70b.sh).  
+- Integration with vLLM for accelerated generation in RLHF tasks (`--vllm_num_engines`).  
+- Support for multiple reward models (`--reward_pretrain model1,model2...`) and remote reward models (`--remote_rm_url`).  
+- Implementation of [DPO (Direct Preference Optimization)/IPO/cDPO](./examples/scripts/train_dpo_llama.sh) and [Kahneman-Tversky Optimization (KTO)](./examples/scripts/train_kto_llama.sh).  
+- Support for [Iterative DPO](./examples/scripts/train_iterative_dpo_llama.sh) ([GitHub: Online-RLHF](https://github.com/RLHFlow/Online-RLHF)).  
+- Support for [Rejection Sampling](./examples/scripts/train_rejection_sampling_llama.sh).  
+- Implementation of [Conditional SFT](./examples/scripts/train_conditional_llama.sh) ([arXiv:2308.12050](https://arxiv.org/abs/2308.12050)).  
+- Support for [Knowledge Distillation](./examples/scripts/train_knowledge_distillation.sh) ([Microsoft: minillm](https://github.com/microsoft/LMOps/tree/main/minillm)).  
+- Integration of [Process Reward Model (PRM)](./examples/scripts/train_prm_mistral.sh).  
+- Packing of training samples for SFT, DPO, RM, PRM, and PPO (`--packing_samples`).  
+- Implementation of [RingAttention](./examples/scripts/train_dpo_ring_llama.sh) (`--ring_attn_size`, `--ring_head_stride`).  
+- Support for [Mixture of Experts (MoE)](./examples/test_scripts/train_sft_mixtral_lora.sh) (`--aux_loss_coef`).  
+- Integration of FlashAttention2 (`--flash_attn`).  
+- Support for QLoRA (`--load_in_4bit`) and [LoRA](./examples/scripts/train_sft_mixtral_lora.sh) (`--lora_rank`, `--target_modules`).  
+- Compatibility with HuggingFace's `tokenizer.apply_chat_template` for datasets (`--apply_chat_template` and `--input_key`).  
+- Logging support with Wandb (`--use_wandb`) and TensorBoard (`--use_tensorboard`).  
+- Checkpoint recovery functionality (`--load_checkpoint` and `--save_steps`).  
+- Provided multi-node training scripts, such as [DPO](./examples/scripts/train_llama_slurm.sh) and [Ray PPO](./examples/scripts/train_ppo_llama_ray_slurm.sh).
+
+
+### PPO Support Matrix
+
+| Feature | OpenRLHF | DSChat | CAIChat | TRL |
+| ------------- |:-------------:| :-------------:| :-------------:| :-------------:|
+| 70B+ Full Tuning with 16 A100-80GB      | ✅ | ❌ | ❌ | ❌ |
+| 7B Full Tuning with 4 RTX4090 | ✅      |    ❌ | ❌ | ❌ |
+| 34B DPO Full Tuning with 8 A100-80GB | ✅      |    ❌ | ❌ | ❌ |  
+| Inference Engine in PPO | ✅      |    ✅ | ❌ | ❌ |  
+| PPO Implementation Tricks | ✅      |    ❌ | ❌ | ✅ |
+| Support QLoRA | ✅      |    ❌ | ❌ | ✅ | 
+| Support Mixtral 8*7b | ✅      |    ❌ | ❌ | ❌ |  
+| Support Unmerged Actor-Critic | ✅     |   ✅ | ✅ | ❌ | 
+| Support Multiple Reward Models | ✅      |    ❌ | ❌ | ❌ |   
+| Support Huggingface Models | ✅      |    ✅ | ✅ | ✅ | 
+| Easy-to-use | ✅      |   ❌ (HybridEngine bugs) | ✅ | ✅ | 
+
+
+## Quick Start
+
+### Installation
+
+To use OpenRLHF, first launch the docker container (**Recommended**) and `pip install` openrlhf inside the docker container:
+
+```bash
+# Launch the docker container
+docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.07-py3 bash
+sudo pip uninstall xgboost transformer_engine flash_attn -y
+
+# pip install
+pip install openrlhf
+
+# If you want to use vLLM acceleration (Install vLLM 0.6.5)
+pip install openrlhf[vllm]
+# latest vLLM is also supported
+pip install openrlhf[vllm_latest]
+
+# pip install the latest version
+pip install git+https://github.com/OpenRLHF/OpenRLHF.git
+
+# Or git clone
+git clone https://github.com/OpenRLHF/OpenRLHF.git
+cd OpenRLHF
+pip install -e .
+```
+
+> [!NOTE]
+>We recommend using vLLM 0.6.4 or higher. Other versions (vLLM >= 0.4.2) may require weight synchronization via Gloo (`--vllm_sync_backend gloo`).
+>We also provided the [Dockerfiles for vLLM](./dockerfile/) and [One-Click Installation Script of Nvidia-Docker](./examples/scripts/nvidia_docker_install.sh).
+
+### Prepare Datasets
+OpenRLHF provides multiple data processing methods in our dataset classes.
+Such as in the [Prompt Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/prompts_dataset.py#L6):
+
+```python
+def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str:
+    if apply_chat_template:
+        chat = data[input_key]
+        if isinstance(chat, str):
+            chat = [{"role": "user", "content": chat}]
+        prompt = apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+    else:
+        prompt = data[input_key]
+        if input_template:
+            prompt = input_template.format(prompt)
+    return prompt
+```
+
+- We can use `--input_key` to specify the `JSON key name` of the input datasets `--prompt_data {name or path}` (PPO) or `--dataset {name or path}`, and use `--apply_chat_template` to utilize the `chat_template` from the [Huggingface Tokenizer](https://huggingface.co/docs/transformers/main/en/chat_templating).
+- If you don't want to use `--apply_chat_template`, you can use `--input_template` instead, or preprocess the datasets offline in advance.
+- OpenRLHF also support mixing multiple datasets using `--prompt_data_probs 0.1,0.4,0.5` (PPO) or `--dataset_probs 0.1,0.4,0.5`.
+
+How Chat Templating Works:
+
+```python
+dataset = [{"input_key": [
+  {"role": "user", "content": "Hello, how are you?"},
+  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
+  {"role": "user", "content": "I'd like to show off how chat templating works!"},
+]}]
+
+tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False)
+
+"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"
+```
+
+How to specify training and test datasets ?
+
+You can specify it using the `data_type@data_dir` format. For example, the dataset can be set as `--dataset json@./data`.
+
+```
+data
+├── test.jsonl
+└── train.jsonl
+```
+
+> [!NOTE]
+> By default, we use `train` and `test` as splits to distinguish training and testing datasets from Huggingface.
+> The ``JSON key`` options depends on the specific datasets. See [Reward Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/reward_dataset.py#L10) and [SFT Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/sft_dataset.py#L9)
+
+### Supervised Fine-tuning
+
+OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--pretrain  {name or path}`, `--reward_pretrain  {name or path}` and `--critic_pretrain  {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https://huggingface.co/OpenRLHF).
+
+Then you can use the startup scripts we provide in the [examples/scripts](./examples/scripts/) directory, or start the training using the following commands.
+
+```bash 
+deepspeed --module openrlhf.cli.train_sft \
+   --max_len 4096 \
+   --dataset Open-Orca/OpenOrca \
+   --input_key question \
+   --output_key response \
+   --input_template $'User: {}\nAssistant: ' \
+   --train_batch_size 256 \
+   --micro_train_batch_size 2 \
+   --max_samples 500000 \
+   --pretrain meta-llama/Meta-Llama-3-8B \
+   --save_path ./checkpoint/llama3-8b-sft \
+   --save_steps -1 \
+   --logging_steps 1 \
+   --eval_steps -1 \
+   --zero_stage 2 \
+   --max_epochs 1 \
+   --packing_samples \
+   --bf16 \
+   --flash_attn \
+   --learning_rate 5e-6 \
+   --gradient_checkpointing \
+   --use_wandb {wandb_token}
+
+# Support HF tokenizer.apply_chat_template
+# --apply_chat_template 
+# --tokenizer_chat_template {HF Chat Template}
+
+# Support RingAttention
+# pip install ring_flash_attn
+#   --ring_attn_size 2 \
+#   --ring_head_stride 2 \
+
+# Multi-turn fine-tuning loss
+# --multiturn
+
+# Can also be used for continued pre-training
+# --pretrain_mode
+```
+
+> [!NOTE]
+> OpenRLHF SFT/DPO/RewardModel/PPO trainers support `--packing_samples` [based on `--flash_attn`](https://github.com/MeetKai/functionary/tree/main/functionary/train/packing)
+
+
+### Reward Model Training
+```bash
+deepspeed --module openrlhf.cli.train_rm \
+   --save_path ./checkpoint/llama3-8b-rm \
+   --save_steps -1 \
+   --logging_steps 1 \
+   --eval_steps -1 \
+   --train_batch_size 256 \
+   --micro_train_batch_size 1 \
+   --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
+   --bf16 \
+   --max_epochs 1 \
+   --max_len 8192 \
+   --zero_stage 3 \
+   --learning_rate 9e-6 \
+   --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
+   --apply_chat_template \
+   --chosen_key chosen \
+   --rejected_key rejected \
+   --flash_attn \
+   --packing_samples \
+   --gradient_checkpointing \
+   --use_wandb {wandb_token}
+
+```
+
+It is recommended to set the `--value_prefix_head` option of the Reward Model to `score`, so that we can load the model using `AutoModelForSequenceClassification`:
+
+```python
+reward_model = AutoModelForSequenceClassification.from_pretrained(
+              reward_model_path,
+              num_labels=1,
+              torch_dtype=torch.bfloat16,
+              attn_implementation="flash_attention_2",
+              use_cache=False,
+          )
+inputs = xxxx (Left Padding Input Tokens)
+reward = reward_model.model(*inputs).last_hidden_state
+reward = reward_model.score(reward)[:, -1]
+```
+
+### PPO without Ray
+
+```bash
+deepspeed --module openrlhf.cli.train_ppo \
+  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
+  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
+  --save_path ./checkpoint/llama-3-8b-rlhf \
+  --save_steps -1 \
+  --logging_steps 1 \
+  --eval_steps -1 \
+  --micro_train_batch_size 2 \
+  --train_batch_size 128 \
+  --micro_rollout_batch_size 4 \
+  --rollout_batch_size 1024 \
+  --max_epochs 1 \
+  --prompt_max_len 1024 \
+  --generate_max_len 1024 \
+  --zero_stage 2 \
+  --bf16 \
+  --actor_learning_rate 5e-7 \
+  --critic_learning_rate 9e-6 \
+  --init_kl_coef 0.01 \
+  --prompt_data OpenRLHF/prompt-collection-v0.1 \
+  --input_key context_messages \
+  --apply_chat_template \
+  --max_samples 100000 \
+  --normalize_reward \
+  --adam_offload \
+  --flash_attn \
+  --gradient_checkpointing \
+  --use_wandb {wandb_token}
+
+# Support remote reward model (HTTP)
+# --remote_rm_url http://localhost:5000/get_reward
+```
+
+### PPO/REINFORCE++ with Ray and vLLM
+
+To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration
+
+```bash
+# launch the master node of ray in container
+ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
+
+# if you want to launch ray on more nodes, use
+ray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8
+
+ray job submit --address="http://127.0.0.1:8265" \
+  --runtime-env-json='{"working_dir": "/openrlhf"}' \
+  -- python3 -m openrlhf.cli.train_ppo_ray \
+  --ref_num_nodes 1 \
+  --ref_num_gpus_per_node 2 \
+  --reward_num_nodes 1 \
+  --reward_num_gpus_per_node 2 \
+  --critic_num_nodes 1 \
+  --critic_num_gpus_per_node 2 \
+  --actor_num_nodes 1 \
+  --actor_num_gpus_per_node 2 \
+  --vllm_num_engines 2 \
+  --vllm_tensor_parallel_size 2 \
+  --colocate_critic_reward \
+  --colocate_actor_ref \
+  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
+  --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
+  --save_path /openrlhf/examples/checkpoint/llama3-8b-rlhf \
+  --micro_train_batch_size 8 \
+  --train_batch_size 128 \
+  --micro_rollout_batch_size 16 \
+  --rollout_batch_size 1024 \
+  --max_samples 100000 \
+  --max_epochs 1 \
+  --prompt_max_len 1024 \
+  --generate_max_len 1024 \
+  --zero_stage 3 \
+  --bf16 \
+  --actor_learning_rate 5e-7 \
+  --critic_learning_rate 9e-6 \
+  --init_kl_coef 0.01 \
+  --prompt_data OpenRLHF/prompt-collection-v0.1 \
+  --input_key context_messages \
+  --apply_chat_template \
+  --normalize_reward \
+  --packing_samples \
+  --adam_offload \
+  --flash_attn \
+  --gradient_checkpointing \
+  --use_wandb {wandb_token}
+
+# Support REINFORCE++  | RLOO
+# --advantage_estimator reinforce | rloo
+
+# Support remote reward model (HTTP)
+# --remote_rm_url http://localhost:5000/get_reward
+
+# Support N samples
+# --n_samples_per_prompt 4
+```
+> [!NOTE]
+> Do not set `--vllm_num_engines` means not using the vLLM engine.
+> You can also use ``setup_commands`` to let Ray automatically deploy the environment, such as `--runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'`.
+
+> [!NOTE]
+> RLOO in OPENRLHF is a modification based on REINFORCE++, differing from the original version.
+
+> [!NOTE]
+> If you you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable [`RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES`](openrlhf/trainer/ray/utils.py) as a workaround.
+>   ```bash
+>   # For NVIDIA GPUs:
+>   export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
+>   ```
+
+The launch scripts and documents for supported algorithms are in [example/scripts](./examples/scripts/) and [Documents - Usage](https://openrlhf.readthedocs.io/en/latest/usage.html)
+
+### LoRA
+If you use `LoRA (Low-Rank Adaptation)`, `OpenRLHF` will not save the full weights by default instead of `LoRA Adapter`. To continue in your task normally, you should combine the `Adapter` with weights of your base model
+
+```bash
+python -m openrlhf.cli.lora_combiner \
+    --model_path meta-llama/Meta-Llama-3-8B \
+    --lora_path ./checkpoint/llama3-8b-rm \
+    --output_path ./checkpoint/llama-3-8b-rm-combined \
+    --is_rm \
+    --bf16
+```
+
+## Performance
+
+We optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF:
+
+| **Size** | **NVIDIA A800-80GB GPUs** | **Optimized DSChat (with  Hybrid Engine)** | **OpenRLHF** | **Speedup** |
+| :---: | :---: | :---: | :---: | :---: |
+| 7B | 16 | 855.09 | 471.11 | 1.82x |
+| 13B | 32 | 1528.93 | 608.93 | 2.5x |
+| 34B | 32 | 3634.98 | 1526.4 | 2.4x |
+| 70B | 32 | 10407.0 | 4488.53 | 2.3x |
+
+> [!NOTE]
+> The data is outdated; please refer to the performance tuning section for re-testing.
+
+### Performance Tuning Guide
+
+To achieve optimal performance, we recommend allocating more nodes to the vLLM Engine. For example, for a 70B model with 32 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 8 GPUs to the Actor model, and the remaining 8 GPUs to the Critic model. Additionally, enable the `--colocate_critic_reward`, `--colocate_actor_ref` options to merge nodes. Finally, you should increase the `rollout_micro_batch_size` (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger `--micro_train_batch_size` is better and enable `--packing_samples`. When there are enough GPUs, please disable `--adam_offload` and enable `--overlap_comm`. For multi-nodes RLHF, please use `--vllm_sync_backend nccl` with vLLM 0.6.4+.
+
+## Companies and Organizations using OpenRLHF
+
+- Google
+- ByteDance
+- Tencent
+- Alibaba
+- Baidu
+- China Telecom
+- Vivo
+- Allen AI
+- NexusFlow
+- Jülich Supercomputing Centre (JSC)
+- Berkeley Starling Team
+- M-A-P
+- ...
+
+## Join Us
+
+**How to Join?**
+
+1. Email us at janhu9527@gmail.com or join [GitHub Organization](https://github.com/OpenRLHF). Please include the following details:
+   - Your name
+   - Your GitHub username
+   - Your areas of interest
+   - Your skills and experience related to NLP and/or AI
+1. You can also join us through the official GitHub [OpenRLHF ↗](https://github.com/OpenRLHF/OpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you.
+
+**What can you do?**
+
+1. Join the team and participate in the development of the OpenRLHF project.
+1. Contribute to the project by submitting pull requests.
+1. Help improve documentation, fix bugs, or create new features.
+1. Share the project and help us grow the community.
+
+## Sponsor Us
+
+Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https://opencollective.com/OpenRLHF).
+
+## Starchart
+
+[![Star History Chart](https://api.star-history.com/svg?repos=OpenRLHF/OpenRLHF&type=Date)](https://star-history.com/#OpenRLHF/OpenRLHF&Date)
+
+## Contributors
+
+A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.
+
+<a href="https://github.com/OpenRLHF/OpenRLHF/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=OpenRLHF/OpenRLHF" />
+</a>
+
+## References & Acknowledgements
+
+We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:
+
+- [Hugging Face Transformers ↗](https://github.com/huggingface/transformers)
+- [OpenAI GPT ↗](https://github.com/openai/gpt-3)
+- [LLaMA ↗](https://llama.meta.com/)
+- [DeepSpeed ↗](https://github.com/microsoft/DeepSpeed)
+- [Ray ↗](https://github.com/ray-project/ray)
+
+Our project would also like to thank [ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) and [DeepSpeedChat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat). In the early stages of the project, we referred to their code design. 
+
+(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.
+
+## Citation
+```
+@article{hu2024openrlhf,
+  title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
+  author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},
+  journal={arXiv preprint arXiv:2405.11143},
+  year={2024}
+}
+```
+
+______________________________________________________________________
+
+*OpenRLHF © 2025 OpenRLHF. All Rights Reserved.*
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md
new file mode 100644
index 0000000000000000000000000000000000000000..8c541c924ca0a625e05c23adc1f730515ef62fb0
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md
@@ -0,0 +1,55 @@
+
+|   类型   |          开源代码地址           |                                              文件名                                              | 公网IP地址/公网URL地址/域名/邮箱 |       用途说明        | 
+|:------:|:-------------------------:|:---------------------------------------------------------------------------------------------:|:--------------------:|:-----------------:| 
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README_ja.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ja.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 | 
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README_zh.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_zh.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_dpo.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_dpo.py | https://arxiv.org/pdf/2310.12036v2.pdf | IPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_dpo.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_dpo.py | https://arxiv.org/pdf/2305.18290.pdf | DPO的论文链接 | 
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_ppo_ray.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_ppo_ray.py | http://joschu.net/blog/kl-approx.html |  近似KL散度的链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/utils.py | http://joschu.net/blog/kl-approx.html | 近似KL散度的链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/huggingface/transformers/blob/405b56269812056d9593869e22b7b264d806cb1e/src/transformers/models/llama/modeling_llama.py | transformers三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://huggingface.co/docs/transformers/deepspeed#non-trainer-deepspeed-integration | transformers三方仓源码DeepSpeed解释链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py |  https://github.com/huggingface/peft/issues/137 | peft三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://huggingface.co/docs/transformers/main_classes/deepspeed#nontrainer-deepspeed-integration | transformers三方仓源码DeepSpeed解释链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py |  https://github.com/huggingface/transformers/issues/26877 | transformers三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py |  https://arxiv.org/abs/2204.05862 | Pairwise Loss for Reward Model论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/huggingface/transformers/issues/26877 | transformers三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py |  https://arxiv.org/pdf/2310.12036v2.pdf | IPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py |  https://ericmitchell.ai/cdpo.pdf | cDPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py |  https://arxiv.org/pdf/2305.18290.pdf | DPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py |  https://github.com/ContextualAI/HALOs/blob/ca9b7e3eeea220c0944ad8095d641da33f907a7e/trainers.py | HALOs三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/OpenRLHF/OpenRLHF/issues/217 | OpenRLHF三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/OpenRLHF/OpenRLHF/issues/217 | OpenRLHF三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/OpenRLHF/OpenRLHF/pull/634 | OpenRLHF三方仓pr链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py |  https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | LMOps三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ppo_utils/kl_controller.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ppo_utils/kl_controller.py |  https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | Adaptive KL controller的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ppo_utils/experience_maker.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ppo_utils/experience_maker.py |  https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | PPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py |  https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/nvidia_gpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py |  https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/amd_gpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py |  https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/npu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py |  https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/hpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py |  https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/neuron.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py |  https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/tpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py |  https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/intel_gpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py |  https://github.com/vllm-project/vllm/commit/479d69fad0538f04cb22bf13e76ff91cfeb8a4e5 | vllm三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py |  https://github.com/vllm-project/vllm/commit/676a99982fe9aabe72fd52a91e08988a653a7359 | vllm三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py |  https://github.com/vllm-project/vllm/pull/10555 | vllm三方仓pr链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py |  https://github.com/vllm-project/vllm/commit/7206ce4ce112ed117796a59045c968a6d353f691 | vllm三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py |  https://github.com/vllm-project/vllm/commit/eb6d3c264d0cd8e44dec16bca7947fbe96415ce9#diff-e1ad69e38e033accddfa5480ec808c4740eb39244d1ef51cc3407e20dde8cfd4 | vllm三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/launcher.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/launcher.py |  https://github.com/vllm-project/vllm/commit/eb6d3c264d0cd8e44dec16bca7947fbe96415ce9#diff-e1ad69e38e033accddfa5480ec808c4740eb39244d1ef51cc3407e20dde8cfd4 | custom resources解释链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/ppo_actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/ppo_actor.py |  https://github.com/vllm-project/vllm/blob/c6b0a7d3ba03ca414be1174e9bd86a97191b7090/vllm/worker/worker_base.py | vllm三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/logging_utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/logging_utils.py |  https://github.com/skypilot-org/skypilot/blob/86dc0f6283a335e4aa37b3c10716f90999f48ab6/sky/sky_logging.py | skypilot三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/logging_utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/utils.py |  https://github.com/facebookresearch/llama-recipes/pull/196 | llama-cookbook三方仓pr链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py |  https://arxiv.org/abs/2308.12050 | CA论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_sampler.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_sampler.py |  https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/torch/utils/data/distributed.py | pytorch三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_sampler.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_sampler.py |  https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py | pytorch三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py |  https://arxiv.org/abs/2307.09288 | pytorch三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_util.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_util.py |  https://github.com/pytorch/pytorch/commit/a0c7029a75628cd5fa8df83c0de0ea98ee7fd844 | pytorch三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py |  https://github.com/RLHFlow/Online-RLHF/blob/main/run_loop.sh | OpenRLHF三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/deepspeed/deepspeed.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/deepspeed/deepspeed.py |  https://github.com/RLHFlow/Online-RLHF/blob/main/run_loop.sh | DeepSpeed三方仓issue链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py |  http://www.apache.org/licenses/LICENSE-2.0 | Apache-2.0 License链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py |  https://qwenlm.github.io/blog/qwen2-vl/ | qwen2-vl关于旋转位置编码解释链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py |   https://pytorch.org/docs/stable/nn.html#torch.nn.Module | torch.nn解释链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py |   https://github.com/huggingface/transformers/pull/34852 | transformers三方仓pr链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py |   https://github.com/pytorch/pytorch/issues/110213 | pytorch三方仓issue链接 |
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt
index be0bca46f79d16522aa21b3a915fe37d32492b15..e6a68fc37556e324df7a3c490edd1059986d51c0 100644
--- a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt
@@ -14,7 +14,7 @@ peft
 pillow
 ray[default]==2.42.0
 tensorboard
-torch
+torch==2.6.0
 torchmetrics
 tqdm
 transformers_stream_generator
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f57f4bbaf6f7efb057bd325b145633056e289f2d
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+CANN_INSTALL_PATH_CONF='/etc/Ascend/ascend_cann_install.info'
+
+if [ -f $CANN_INSTALL_PATH_CONF ]; then
+    CANN_INSTALL_PATH=$(cat $CANN_INSTALL_PATH_CONF | grep Install_Path | cut -d "=" -f 2)
+else
+    CANN_INSTALL_PATH="/usr/local/Ascend"
+fi
+
+if [ -d ${CANN_INSTALL_PATH}/ascend-toolkit/latest ]; then
+    source ${CANN_INSTALL_PATH}/ascend-toolkit/set_env.sh
+else
+    source ${CANN_INSTALL_PATH}/nnae/set_env.sh
+fi
+msnpureport -g error -d 0
+msnpureport -g error -d 1
+msnpureport -g error -d 2
+msnpureport -g error -d 3
+msnpureport -g error -d 4
+msnpureport -g error -d 5
+msnpureport -g error -d 6
+msnpureport -g error -d 7
+
+# 控制对输入数据为Inf/NaN的处理能力
+# 0：饱和模式，计算出现溢出时，计算结果会饱和为浮点数极值（+-MAX）。
+# 1：INf_NAN模式，根据定义输出Inf/NaN的计算结果。
+# Atlas训练系列仅支持饱和模式，Atlas A2/A3默认值为1，支持配置为0的饱和模式。
+export INF_NAN_MODE_ENABLE=1
+#将Host日志输出到串口,0-关闭/1-开启。指定0关闭日志打屏，即日志采用默认输出方式，将日志保存在log文件中。
+export ASCEND_SLOG_PRINT_TO_STDOUT=0
+#设置默认日志级别,0-debug/1-info/2-warning/3-error。此处指定3输出error级别日志，可根据具体需要调整。
+export ASCEND_GLOBAL_LOG_LEVEL=3
+#设置应用类日志是否开启Event日志。0-关闭/1-开启，默认值为1，此处设置为0表示关闭Event日志。
+export ASCEND_GLOBAL_EVENT_ENABLE=0
+#设置是否开启combined标志,0-关闭/1-开启。设置为1表示开启，用于优化非连续两个算子组合类场景。
+export COMBINED_ENABLE=1
+#HCCL白名单开关,1-关闭/0-开启。设置为1则无需校验HCCL通信白名单。
+export HCCL_WHITELIST_DISABLE=1
+export HCCL_IF_IP=$(hostname -I |awk '{print $1}')
+#缓存算子信息条目数
+export ACLNN_CACHE_LIMIT=100000
+#配置 Hugging Face 的 datasets 库在离线模式下运行
+export HF_DATASETS_OFFLINE=1
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f11fd5038521307f68314aff721d9a5bcf1fd617
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh
@@ -0,0 +1,107 @@
+#!/bin/bash
+echo "-------------------Start DPO Train-------------------"
+
+#配置八卡
+export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+#网络名称
+Network="dpo"
+
+# 默认值
+max_epochs=3
+pretrain_path=""
+dataset_path=""
+
+# 遍历所有传入的参数
+for para in $*; do
+  if [[ $para == --max_epochs* ]]; then
+    max_epochs="${para#*=}"
+  elif [[ $para == --pretrain_path=* ]]; then
+    pretrain_path="${para#*=}"
+  elif [[ $para == --dataset_path=* ]]; then
+    dataset_path="${para#*=}"
+  else
+    echo "Unknown parameter: $para" >&2
+  fi
+done
+
+# 检查参数pretrain_path、dataset_path是否已提供
+if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then
+  echo "Error: Both pretrain_path and dataset_path are required."
+  exit 1
+fi
+
+# cd到与test文件夹同层级目录下执行脚本，提高兼容性；test_path_dir为包含test文件夹的路径
+cur_path=$(pwd)
+cur_path_last_dirname=${cur_path##*/}
+if [ x"${cur_path_last_dirname}" == x"test" ]; then
+  test_path_dir=${cur_path}
+  cd ..
+  cur_path=$(pwd)
+else
+  test_path_dir=${cur_path}/test
+fi
+
+#创建DPO训练输出目录，不需要修改
+if [ -d ${cur_path}/test/output/${Network} ]; then
+  rm -rf ${cur_path}/test/output/${Network}
+  mkdir -p ${cur_path}/test/output/${Network}
+else
+  mkdir -p ${cur_path}/test/output/${Network}
+fi 
+
+source ${test_path_dir}/env_npu.sh
+
+training_commands=$(cat <<EOF
+openrlhf.cli.train_vl_dpo \
+   --task_type dpo \
+   --save_path $cur_path/checkpoint/qwen2vl_dpo \
+   --save_steps -1 \
+   --logging_steps 1 \
+   --eval_steps -1 \
+   --train_batch_size 8 \
+   --micro_train_batch_size 1 \
+   --model_arch qwen2_vl \
+   --pretrain $pretrain_path \
+   --bf16 \
+   --max_epochs $max_epochs \
+   --max_len 4096 \
+   --zero_stage 2 \
+   --learning_rate 1e-7 \
+   --lr_scheduler constant \
+   --beta 0.1 \
+   --dataset $dataset_path \
+   --dataset_config_path $cur_path/examples/vision_scripts/rlhf_v.json \
+   --apply_chat_template \
+   --chosen_key chosen \
+   --rejected_key rejected \
+   --flash_attn sdpa \
+   --gradient_checkpointing
+EOF
+)
+
+if [[ ${1} != "slurm" ]]; then
+  deepspeed --master_port=12432 --module $training_commands > ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 &
+fi
+
+wait
+
+# 训练用例信息，不需要修改
+DeviceType=$(uname -m)
+CaseName=${Network}_info
+
+# 获取训练日志
+source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log"
+
+#计算全程平均单步耗时
+tps=$(grep -a "'train/step_time':" "$source_log_file" |
+  sed 's/\x1b\[[0-9;]*m//g' |
+  awk -F "'train/step_time': '" '{print $2}' |
+  awk -F "'," '{print $1}' |
+  awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}')
+echo "Second Per Step: $tps"
+
+# 关键信息打印到${CaseName}.log中，不需要修改
+echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "-------------------End DPO Train-------------------"
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d9b3a99323be35ceae21895b5e205b04f24e9928
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh
@@ -0,0 +1,107 @@
+#!/bin/bash
+echo "-------------------Start DPO Train-------------------"
+
+#配置八卡
+export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+#网络名称
+Network="dpo"
+
+# 默认值
+max_epochs=1
+pretrain_path=""
+dataset_path=""
+
+# 遍历所有传入的参数
+for para in $*; do
+  if [[ $para == --max_epochs* ]]; then
+    max_epochs="${para#*=}"
+  elif [[ $para == --pretrain_path=* ]]; then
+    pretrain_path="${para#*=}"
+  elif [[ $para == --dataset_path=* ]]; then
+    dataset_path="${para#*=}"
+  else
+    echo "Unknown parameter: $para" >&2
+  fi
+done
+
+# 检查参数pretrain_path、dataset_path是否已提供
+if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then
+  echo "Error: Both pretrain_path and dataset_path are required."
+  exit 1
+fi
+
+# cd到与test文件夹同层级目录下执行脚本，提高兼容性；test_path_dir为包含test文件夹的路径
+cur_path=$(pwd)
+cur_path_last_dirname=${cur_path##*/}
+if [ x"${cur_path_last_dirname}" == x"test" ]; then
+  test_path_dir=${cur_path}
+  cd ..
+  cur_path=$(pwd)
+else
+  test_path_dir=${cur_path}/test
+fi
+
+#创建DPO训练输出目录，不需要修改
+if [ -d ${cur_path}/test/output/${Network} ]; then
+  rm -rf ${cur_path}/test/output/${Network}
+  mkdir -p ${cur_path}/test/output/${Network}
+else
+  mkdir -p ${cur_path}/test/output/${Network}
+fi 
+
+source ${test_path_dir}/env_npu.sh
+
+training_commands=$(cat <<EOF
+openrlhf.cli.train_vl_dpo \
+   --task_type dpo \
+   --save_path $cur_path/checkpoint/qwen2vl_dpo \
+   --save_steps -1 \
+   --logging_steps 1 \
+   --eval_steps -1 \
+   --train_batch_size 8 \
+   --micro_train_batch_size 1 \
+   --model_arch qwen2_vl \
+   --pretrain $pretrain_path \
+   --bf16 \
+   --max_epochs $max_epochs \
+   --max_len 4096 \
+   --zero_stage 2 \
+   --learning_rate 1e-7 \
+   --lr_scheduler constant \
+   --beta 0.1 \
+   --dataset $dataset_path \
+   --dataset_config_path $cur_path/examples/vision_scripts/rlhf_v.json \
+   --apply_chat_template \
+   --chosen_key chosen \
+   --rejected_key rejected \
+   --flash_attn sdpa \
+   --gradient_checkpointing
+EOF
+)
+
+if [[ ${1} != "slurm" ]]; then
+  deepspeed --master_port=12432 --module $training_commands > ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 &
+fi
+
+wait
+
+# 训练用例信息，不需要修改
+DeviceType=$(uname -m)
+CaseName=${Network}_info
+
+# 获取训练日志
+source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log"
+
+#计算全程平均单步耗时
+tps=$(grep -a "'train/step_time':" "$source_log_file" |
+  sed 's/\x1b\[[0-9;]*m//g' |
+  awk -F "'train/step_time': '" '{print $2}' |
+  awk -F "'," '{print $1}' |
+  awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}')
+echo "Second Per Step: $tps"
+
+# 关键信息打印到${CaseName}.log中，不需要修改
+echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "-------------------End DPO Train-------------------"
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..78f31c0417e784423c84cf56138d1380d43c01dd
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh
@@ -0,0 +1,104 @@
+#!/bin/bash
+echo "-------------------Start SFT Train-------------------"
+
+#配置八卡
+export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+#网络名称
+Network="sft"
+
+# 默认值
+max_epochs=3
+pretrain_path=""
+dataset_path=""
+
+# 遍历所有传入的参数
+for para in $*; do
+  if [[ $para == --max_epochs* ]]; then
+    max_epochs="${para#*=}"
+  elif [[ $para == --pretrain_path=* ]]; then
+    pretrain_path="${para#*=}"
+  elif [[ $para == --dataset_path=* ]]; then
+    dataset_path="${para#*=}"
+  else
+    echo "Unknown parameter: $para" >&2
+  fi
+done
+
+# 检查参数pretrain_path、dataset_path是否已提供
+if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then
+  echo "Error: Both pretrain_path and dataset_path are required."
+  exit 1
+fi
+
+# cd到与test文件夹同层级目录下执行脚本，提高兼容性；test_path_dir为包含test文件夹的路径
+cur_path=$(pwd)
+cur_path_last_dirname=${cur_path##*/}
+if [ x"${cur_path_last_dirname}" == x"test" ]; then
+  test_path_dir=${cur_path}
+  cd ..
+  cur_path=$(pwd)
+else
+  test_path_dir=${cur_path}/test
+fi
+
+#创建SFT训练输出目录，不需要修改
+if [ -d ${cur_path}/test/output/${Network} ]; then
+  rm -rf ${cur_path}/test/output/${Network}
+  mkdir -p ${cur_path}/test/output/${Network}
+else
+  mkdir -p ${cur_path}/test/output/${Network}
+fi 
+
+source ${test_path_dir}/env_npu.sh
+
+training_commands=$(cat <<EOF
+openrlhf.cli.train_vl_sft \
+   --max_len 2048 \
+   --dataset $dataset_path \
+   --dataset_config_path $cur_path/examples/vision_scripts/llava_zh_300k.json \
+   --train_batch_size 8 \
+   --micro_train_batch_size 1 \
+   --max_samples 6000 \
+   --processing_num_workers 8 \
+   --overwrite_cache True \
+   --model_arch qwen2_vl \
+   --pretrain $pretrain_path \
+   --save_path $cur_path/checkpoint/qwen2vl_sft \
+   --save_steps -1 \
+   --logging_steps 1 \
+   --eval_steps -1 \
+   --zero_stage 2 \
+   --max_epochs $max_epochs \
+   --bf16 \
+   --flash_attn sdpa \
+   --learning_rate 1e-5 \
+   --lr_scheduler constant
+EOF
+)
+
+if [[ ${1} != "slurm" ]]; then
+  deepspeed --master_port=12432 --module $training_commands > ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 &
+fi
+
+wait
+
+# 训练用例信息，不需要修改
+DeviceType=$(uname -m)
+CaseName=${Network}_info
+
+# 获取训练日志
+source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log"
+
+#计算全程平均单步耗时
+tps=$(grep -a "'train/step_time':" "$source_log_file" |
+  sed 's/\x1b\[[0-9;]*m//g' |
+  awk -F "'train/step_time': '" '{print $2}' |
+  awk -F "'," '{print $1}' |
+  awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}')
+echo "Second Per Step: $tps"
+
+# 关键信息打印到${CaseName}.log中，不需要修改
+echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "-------------------End SFT Train-------------------"
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..72e5cd3c6369d47459caa68db12dcca0b50056af
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh
@@ -0,0 +1,104 @@
+#!/bin/bash
+echo "-------------------Start SFT Train-------------------"
+
+#配置八卡
+export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+#网络名称
+Network="sft"
+
+# 默认值
+max_epochs=1
+pretrain_path=""
+dataset_path=""
+
+# 遍历所有传入的参数
+for para in $*; do
+  if [[ $para == --max_epochs* ]]; then
+    max_epochs="${para#*=}"
+  elif [[ $para == --pretrain_path=* ]]; then
+    pretrain_path="${para#*=}"
+  elif [[ $para == --dataset_path=* ]]; then
+    dataset_path="${para#*=}"
+  else
+    echo "Unknown parameter: $para" >&2
+  fi
+done
+
+# 检查参数pretrain_path、dataset_path是否已提供
+if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then
+  echo "Error: Both pretrain_path and dataset_path are required."
+  exit 1
+fi
+
+# cd到与test文件夹同层级目录下执行脚本，提高兼容性；test_path_dir为包含test文件夹的路径
+cur_path=$(pwd)
+cur_path_last_dirname=${cur_path##*/}
+if [ x"${cur_path_last_dirname}" == x"test" ]; then
+  test_path_dir=${cur_path}
+  cd ..
+  cur_path=$(pwd)
+else
+  test_path_dir=${cur_path}/test
+fi
+
+#创建SFT训练输出目录，不需要修改
+if [ -d ${cur_path}/test/output/${Network} ]; then
+  rm -rf ${cur_path}/test/output/${Network}
+  mkdir -p ${cur_path}/test/output/${Network}
+else
+  mkdir -p ${cur_path}/test/output/${Network}
+fi 
+
+source ${test_path_dir}/env_npu.sh
+
+training_commands=$(cat <<EOF
+openrlhf.cli.train_vl_sft \
+   --max_len 2048 \
+   --dataset $dataset_path \
+   --dataset_config_path $cur_path/examples/vision_scripts/llava_zh_300k.json \
+   --train_batch_size 8 \
+   --micro_train_batch_size 1 \
+   --max_samples 6000 \
+   --processing_num_workers 8 \
+   --overwrite_cache True \
+   --model_arch qwen2_vl \
+   --pretrain $pretrain_path \
+   --save_path $cur_path/checkpoint/qwen2vl_sft \
+   --save_steps -1 \
+   --logging_steps 1 \
+   --eval_steps -1 \
+   --zero_stage 2 \
+   --max_epochs $max_epochs \
+   --bf16 \
+   --flash_attn sdpa \
+   --learning_rate 1e-5 \
+   --lr_scheduler constant
+EOF
+)
+
+if [[ ${1} != "slurm" ]]; then
+  deepspeed --master_port=12432 --module $training_commands > ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 &
+fi
+
+wait
+
+# 训练用例信息，不需要修改
+DeviceType=$(uname -m)
+CaseName=${Network}_info
+
+# 获取训练日志
+source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log"
+
+#计算全程平均单步耗时
+tps=$(grep -a "'train/step_time':" "$source_log_file" |
+  sed 's/\x1b\[[0-9;]*m//g' |
+  awk -F "'train/step_time': '" '{print $2}' |
+  awk -F "'," '{print $1}' |
+  awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}')
+echo "Second Per Step: $tps"
+
+# 关键信息打印到${CaseName}.log中，不需要修改
+echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "-------------------End SFT Train-------------------"
\ No newline at end of file

软件	版本
Driver	AscendHDK 25.0.RC1.B115
Firmware	AscendHDK 25.0.RC1.B115
CANN	CANN 8.2.RC1.B010
PyTorch	2.6.0
torch_npu	2.6.0