diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md
index 2cb8191853368778180eb11533fdaffb5729e8b6..eea984d60307c9cdd4b4dd793bcfc9c932b41609 100644
--- a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md
@@ -1,477 +1,134 @@
-
-

-
-
-
-
-
-[ English | 中文 | 日本語 ]
-
-OpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers:
-
-- **Simple and easy to use**: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and seamlessly compatible with Huggingface models and datasets.
-- **High performance**: RLHF training spends 80% of the time on the sample generation stage. Thanks to the ability to use a large inference batch size with Ray and Packing Samples and vLLM generation acceleration, the performance of OpenRLHF 3~4x+ that of Optimized DeepSpeedChat with Hybrid Engine.
-- **Distributed RLHF**: OpenRLHF distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 70B+ models with multiple A100 80G GPUs and vLLM and 7B models across multiple 24GB RTX 4090 GPUs.
-- **PPO Implementation Optimization**: We integrated the implementation tricks for PPO to improve the training stability, referencing [Zhihu](https://zhuanlan.zhihu.com/p/622134699) and the [Notion blog](https://hijkzzz.notion.site/rlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361).
-
-More details are in [Slides](https://docs.google.com/presentation/d/1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk/edit?usp=sharing) | [Technical Report](https://arxiv.org/abs/2405.11143) | [Documents](https://openrlhf.readthedocs.io/)
-
-## News
-- [2025/1] HKUST reproduced the [DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF](https://github.com/hkust-nlp/simpleRL-reason)
-- [2024/12] We "proposed" 😊 the [REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS).
-- [2024/12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the [Notion Blogpost](https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05).
-
-
-## Features
-
-- Distributed [PPO](./examples/scripts/train_ppo_llama_ray.sh) and [REINFORCE++/RLOO](./examples/scripts/train_reinforce_llama_ray.sh) implementations based on Ray.
-- Full RLHF fine-tuning support for models with [over 70 billion parameters](./examples/scripts/train_ppo_llama_ray_70b.sh).
-- Integration with vLLM for accelerated generation in RLHF tasks (`--vllm_num_engines`).
-- Support for multiple reward models (`--reward_pretrain model1,model2...`) and remote reward models (`--remote_rm_url`).
-- Implementation of [DPO (Direct Preference Optimization)/IPO/cDPO](./examples/scripts/train_dpo_llama.sh) and [Kahneman-Tversky Optimization (KTO)](./examples/scripts/train_kto_llama.sh).
-- Support for [Iterative DPO](./examples/scripts/train_iterative_dpo_llama.sh) ([GitHub: Online-RLHF](https://github.com/RLHFlow/Online-RLHF)).
-- Support for [Rejection Sampling](./examples/scripts/train_rejection_sampling_llama.sh).
-- Implementation of [Conditional SFT](./examples/scripts/train_conditional_llama.sh) ([arXiv:2308.12050](https://arxiv.org/abs/2308.12050)).
-- Support for [Knowledge Distillation](./examples/scripts/train_knowledge_distillation.sh) ([Microsoft: minillm](https://github.com/microsoft/LMOps/tree/main/minillm)).
-- Integration of [Process Reward Model (PRM)](./examples/scripts/train_prm_mistral.sh).
-- Packing of training samples for SFT, DPO, RM, PRM, and PPO (`--packing_samples`).
-- Implementation of [RingAttention](./examples/scripts/train_dpo_ring_llama.sh) (`--ring_attn_size`, `--ring_head_stride`).
-- Support for [Mixture of Experts (MoE)](./examples/test_scripts/train_sft_mixtral_lora.sh) (`--aux_loss_coef`).
-- Integration of FlashAttention2 (`--flash_attn`).
-- Support for QLoRA (`--load_in_4bit`) and [LoRA](./examples/scripts/train_sft_mixtral_lora.sh) (`--lora_rank`, `--target_modules`).
-- Compatibility with HuggingFace's `tokenizer.apply_chat_template` for datasets (`--apply_chat_template` and `--input_key`).
-- Logging support with Wandb (`--use_wandb`) and TensorBoard (`--use_tensorboard`).
-- Checkpoint recovery functionality (`--load_checkpoint` and `--save_steps`).
-- Provided multi-node training scripts, such as [DPO](./examples/scripts/train_llama_slurm.sh) and [Ray PPO](./examples/scripts/train_ppo_llama_ray_slurm.sh).
-
-
-### PPO Support Matrix
-
-| Feature | OpenRLHF | DSChat | CAIChat | TRL |
-| ------------- |:-------------:| :-------------:| :-------------:| :-------------:|
-| 70B+ Full Tuning with 16 A100-80GB | ✅ | ❌ | ❌ | ❌ |
-| 7B Full Tuning with 4 RTX4090 | ✅ | ❌ | ❌ | ❌ |
-| 34B DPO Full Tuning with 8 A100-80GB | ✅ | ❌ | ❌ | ❌ |
-| Inference Engine in PPO | ✅ | ✅ | ❌ | ❌ |
-| PPO Implementation Tricks | ✅ | ❌ | ❌ | ✅ |
-| Support QLoRA | ✅ | ❌ | ❌ | ✅ |
-| Support Mixtral 8*7b | ✅ | ❌ | ❌ | ❌ |
-| Support Unmerged Actor-Critic | ✅ | ✅ | ✅ | ❌ |
-| Support Multiple Reward Models | ✅ | ❌ | ❌ | ❌ |
-| Support Huggingface Models | ✅ | ✅ | ✅ | ✅ |
-| Easy-to-use | ✅ | ❌ (HybridEngine bugs) | ✅ | ✅ |
-
-
-## Quick Start
-
-### Installation
-
-To use OpenRLHF, first launch the docker container (**Recommended**) and `pip install` openrlhf inside the docker container:
-
-```bash
-# Launch the docker container
-docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.07-py3 bash
-sudo pip uninstall xgboost transformer_engine flash_attn -y
-
-# pip install
-pip install openrlhf
-
-# If you want to use vLLM acceleration (Install vLLM 0.6.5)
-pip install openrlhf[vllm]
-# latest vLLM is also supported
-pip install openrlhf[vllm_latest]
-
-# pip install the latest version
-pip install git+https://github.com/OpenRLHF/OpenRLHF.git
-
-# Or git clone
-git clone https://github.com/OpenRLHF/OpenRLHF.git
-cd OpenRLHF
-pip install -e .
-```
-
-> [!NOTE]
->We recommend using vLLM 0.6.4 or higher. Other versions (vLLM >= 0.4.2) may require weight synchronization via Gloo (`--vllm_sync_backend gloo`).
->We also provided the [Dockerfiles for vLLM](./dockerfile/) and [One-Click Installation Script of Nvidia-Docker](./examples/scripts/nvidia_docker_install.sh).
-
-### Prepare Datasets
-OpenRLHF provides multiple data processing methods in our dataset classes.
-Such as in the [Prompt Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/prompts_dataset.py#L6):
-
-```python
-def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str:
- if apply_chat_template:
- chat = data[input_key]
- if isinstance(chat, str):
- chat = [{"role": "user", "content": chat}]
- prompt = apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
- else:
- prompt = data[input_key]
- if input_template:
- prompt = input_template.format(prompt)
- return prompt
-```
-
-- We can use `--input_key` to specify the `JSON key name` of the input datasets `--prompt_data {name or path}` (PPO) or `--dataset {name or path}`, and use `--apply_chat_template` to utilize the `chat_template` from the [Huggingface Tokenizer](https://huggingface.co/docs/transformers/main/en/chat_templating).
-- If you don't want to use `--apply_chat_template`, you can use `--input_template` instead, or preprocess the datasets offline in advance.
-- OpenRLHF also support mixing multiple datasets using `--prompt_data_probs 0.1,0.4,0.5` (PPO) or `--dataset_probs 0.1,0.4,0.5`.
-
-How Chat Templating Works:
-
-```python
-dataset = [{"input_key": [
- {"role": "user", "content": "Hello, how are you?"},
- {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
- {"role": "user", "content": "I'd like to show off how chat templating works!"},
-]}]
-
-tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False)
-
-"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]"
-```
-
-How to specify training and test datasets ?
-
-You can specify it using the `data_type@data_dir` format. For example, the dataset can be set as `--dataset json@./data`.
-
-```
-data
-├── test.jsonl
-└── train.jsonl
-```
-
-> [!NOTE]
-> By default, we use `train` and `test` as splits to distinguish training and testing datasets from Huggingface.
-> The ``JSON key`` options depends on the specific datasets. See [Reward Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/reward_dataset.py#L10) and [SFT Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/sft_dataset.py#L9)
-
-### Supervised Fine-tuning
-
-OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--pretrain {name or path}`, `--reward_pretrain {name or path}` and `--critic_pretrain {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https://huggingface.co/OpenRLHF).
-
-Then you can use the startup scripts we provide in the [examples/scripts](./examples/scripts/) directory, or start the training using the following commands.
-
-```bash
-deepspeed --module openrlhf.cli.train_sft \
- --max_len 4096 \
- --dataset Open-Orca/OpenOrca \
- --input_key question \
- --output_key response \
- --input_template $'User: {}\nAssistant: ' \
- --train_batch_size 256 \
- --micro_train_batch_size 2 \
- --max_samples 500000 \
- --pretrain meta-llama/Meta-Llama-3-8B \
- --save_path ./checkpoint/llama3-8b-sft \
- --save_steps -1 \
- --logging_steps 1 \
- --eval_steps -1 \
- --zero_stage 2 \
- --max_epochs 1 \
- --packing_samples \
- --bf16 \
- --flash_attn \
- --learning_rate 5e-6 \
- --gradient_checkpointing \
- --use_wandb {wandb_token}
-
-# Support HF tokenizer.apply_chat_template
-# --apply_chat_template
-# --tokenizer_chat_template {HF Chat Template}
-
-# Support RingAttention
-# pip install ring_flash_attn
-# --ring_attn_size 2 \
-# --ring_head_stride 2 \
-
-# Multi-turn fine-tuning loss
-# --multiturn
-
-# Can also be used for continued pre-training
-# --pretrain_mode
-```
-
-> [!NOTE]
-> OpenRLHF SFT/DPO/RewardModel/PPO trainers support `--packing_samples` [based on `--flash_attn`](https://github.com/MeetKai/functionary/tree/main/functionary/train/packing)
-
-
-### Reward Model Training
-```bash
-deepspeed --module openrlhf.cli.train_rm \
- --save_path ./checkpoint/llama3-8b-rm \
- --save_steps -1 \
- --logging_steps 1 \
- --eval_steps -1 \
- --train_batch_size 256 \
- --micro_train_batch_size 1 \
- --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
- --bf16 \
- --max_epochs 1 \
- --max_len 8192 \
- --zero_stage 3 \
- --learning_rate 9e-6 \
- --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
- --apply_chat_template \
- --chosen_key chosen \
- --rejected_key rejected \
- --flash_attn \
- --packing_samples \
- --gradient_checkpointing \
- --use_wandb {wandb_token}
-
-```
-
-It is recommended to set the `--value_prefix_head` option of the Reward Model to `score`, so that we can load the model using `AutoModelForSequenceClassification`:
-
-```python
-reward_model = AutoModelForSequenceClassification.from_pretrained(
- reward_model_path,
- num_labels=1,
- torch_dtype=torch.bfloat16,
- attn_implementation="flash_attention_2",
- use_cache=False,
- )
-inputs = xxxx (Left Padding Input Tokens)
-reward = reward_model.model(*inputs).last_hidden_state
-reward = reward_model.score(reward)[:, -1]
-```
-
-### PPO without Ray
-
-```bash
-deepspeed --module openrlhf.cli.train_ppo \
- --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
- --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
- --save_path ./checkpoint/llama-3-8b-rlhf \
- --save_steps -1 \
- --logging_steps 1 \
- --eval_steps -1 \
- --micro_train_batch_size 2 \
- --train_batch_size 128 \
- --micro_rollout_batch_size 4 \
- --rollout_batch_size 1024 \
- --max_epochs 1 \
- --prompt_max_len 1024 \
- --generate_max_len 1024 \
- --zero_stage 2 \
- --bf16 \
- --actor_learning_rate 5e-7 \
- --critic_learning_rate 9e-6 \
- --init_kl_coef 0.01 \
- --prompt_data OpenRLHF/prompt-collection-v0.1 \
- --input_key context_messages \
- --apply_chat_template \
- --max_samples 100000 \
- --normalize_reward \
- --adam_offload \
- --flash_attn \
- --gradient_checkpointing \
- --use_wandb {wandb_token}
-
-# Support remote reward model (HTTP)
-# --remote_rm_url http://localhost:5000/get_reward
-```
-
-### PPO/REINFORCE++ with Ray and vLLM
-
-To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration
-
-```bash
-# launch the master node of ray in container
-ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
-
-# if you want to launch ray on more nodes, use
-ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8
-
-ray job submit --address="http://127.0.0.1:8265" \
- --runtime-env-json='{"working_dir": "/openrlhf"}' \
- -- python3 -m openrlhf.cli.train_ppo_ray \
- --ref_num_nodes 1 \
- --ref_num_gpus_per_node 2 \
- --reward_num_nodes 1 \
- --reward_num_gpus_per_node 2 \
- --critic_num_nodes 1 \
- --critic_num_gpus_per_node 2 \
- --actor_num_nodes 1 \
- --actor_num_gpus_per_node 2 \
- --vllm_num_engines 2 \
- --vllm_tensor_parallel_size 2 \
- --colocate_critic_reward \
- --colocate_actor_ref \
- --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
- --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
- --save_path /openrlhf/examples/checkpoint/llama3-8b-rlhf \
- --micro_train_batch_size 8 \
- --train_batch_size 128 \
- --micro_rollout_batch_size 16 \
- --rollout_batch_size 1024 \
- --max_samples 100000 \
- --max_epochs 1 \
- --prompt_max_len 1024 \
- --generate_max_len 1024 \
- --zero_stage 3 \
- --bf16 \
- --actor_learning_rate 5e-7 \
- --critic_learning_rate 9e-6 \
- --init_kl_coef 0.01 \
- --prompt_data OpenRLHF/prompt-collection-v0.1 \
- --input_key context_messages \
- --apply_chat_template \
- --normalize_reward \
- --packing_samples \
- --adam_offload \
- --flash_attn \
- --gradient_checkpointing \
- --use_wandb {wandb_token}
-
-# Support REINFORCE++ | RLOO
-# --advantage_estimator reinforce | rloo
-
-# Support remote reward model (HTTP)
-# --remote_rm_url http://localhost:5000/get_reward
-
-# Support N samples
-# --n_samples_per_prompt 4
-```
-> [!NOTE]
-> Do not set `--vllm_num_engines` means not using the vLLM engine.
-> You can also use ``setup_commands`` to let Ray automatically deploy the environment, such as `--runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'`.
-
-> [!NOTE]
-> RLOO in OPENRLHF is a modification based on REINFORCE++, differing from the original version.
-
-> [!NOTE]
-> If you you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable [`RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES`](openrlhf/trainer/ray/utils.py) as a workaround.
-> ```bash
-> # For NVIDIA GPUs:
-> export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
-> ```
-
-The launch scripts and documents for supported algorithms are in [example/scripts](./examples/scripts/) and [Documents - Usage](https://openrlhf.readthedocs.io/en/latest/usage.html)
-
-### LoRA
-If you use `LoRA (Low-Rank Adaptation)`, `OpenRLHF` will not save the full weights by default instead of `LoRA Adapter`. To continue in your task normally, you should combine the `Adapter` with weights of your base model
-
-```bash
-python -m openrlhf.cli.lora_combiner \
- --model_path meta-llama/Meta-Llama-3-8B \
- --lora_path ./checkpoint/llama3-8b-rm \
- --output_path ./checkpoint/llama-3-8b-rm-combined \
- --is_rm \
- --bf16
-```
-
-## Performance
-
-We optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF:
-
-| **Size** | **NVIDIA A800-80GB GPUs** | **Optimized DSChat (with Hybrid Engine)** | **OpenRLHF** | **Speedup** |
-| :---: | :---: | :---: | :---: | :---: |
-| 7B | 16 | 855.09 | 471.11 | 1.82x |
-| 13B | 32 | 1528.93 | 608.93 | 2.5x |
-| 34B | 32 | 3634.98 | 1526.4 | 2.4x |
-| 70B | 32 | 10407.0 | 4488.53 | 2.3x |
-
-> [!NOTE]
-> The data is outdated; please refer to the performance tuning section for re-testing.
-
-### Performance Tuning Guide
-
-To achieve optimal performance, we recommend allocating more nodes to the vLLM Engine. For example, for a 70B model with 32 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 8 GPUs to the Actor model, and the remaining 8 GPUs to the Critic model. Additionally, enable the `--colocate_critic_reward`, `--colocate_actor_ref` options to merge nodes. Finally, you should increase the `rollout_micro_batch_size` (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger `--micro_train_batch_size` is better and enable `--packing_samples`. When there are enough GPUs, please disable `--adam_offload` and enable `--overlap_comm`. For multi-nodes RLHF, please use `--vllm_sync_backend nccl` with vLLM 0.6.4+.
-
-## Companies and Organizations using OpenRLHF
-
-- Google
-- ByteDance
-- Tencent
-- Alibaba
-- Baidu
-- China Telecom
-- Vivo
-- Allen AI
-- NexusFlow
-- Jülich Supercomputing Centre (JSC)
-- Berkeley Starling Team
-- M-A-P
-- ...
-
-## Join Us
-
-**How to Join?**
-
-1. Email us at janhu9527@gmail.com or join [GitHub Organization](https://github.com/OpenRLHF). Please include the following details:
- - Your name
- - Your GitHub username
- - Your areas of interest
- - Your skills and experience related to NLP and/or AI
-1. You can also join us through the official GitHub [OpenRLHF ↗](https://github.com/OpenRLHF/OpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you.
-
-**What can you do?**
-
-1. Join the team and participate in the development of the OpenRLHF project.
-1. Contribute to the project by submitting pull requests.
-1. Help improve documentation, fix bugs, or create new features.
-1. Share the project and help us grow the community.
+# OpenRLHF v0.5.7
+
+- [概述](#概述)
+- [准备训练环境](#准备训练环境)
+- [开始训练](#开始训练)
+- [训练结果展示](#训练结果展示)
+- [版本说明](#版本说明)
+
+# 概述
+
+OpenRLHF是一个基于Ray、DeepSpeed和HF Transformers构建的高性能RLHF框架。该代码实现OpenRLHF支持 `SFT` 和 `DPO` 训练,目前已验证支持模型为:`Qwen2-VL-2B-Instruct`。
+
+
+# 准备训练环境
+
+## 准备环境
+
+- 推荐参考[配套资源文档](https://www.hiascend.com/developer/download/commercial)使用最新的配套版本。
+
+ **表 1** 版本配套表
+
+
+
+ 软件 |
+ 版本 |
+
+
+ Driver |
+ AscendHDK 25.0.RC1.B115 |
+
+
+ Firmware |
+ AscendHDK 25.0.RC1.B115 |
+
+
+ CANN |
+ CANN 8.2.RC1.B010 |
+
+
+ PyTorch |
+ 2.6.0 |
+
+
+ torch_npu |
+ 2.6.0 |
+
+
+
+- 环境准备指导。
-## Sponsor Us
+ 请参考《[Pytorch框架训练环境准备](https://www.hiascend.com/document/detail/zh/ModelZoo/pytorchframework/ptes)》。
+
+- 安装依赖。
-Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https://opencollective.com/OpenRLHF).
+ 1. 在模型源码包根目录下执行命令,安装模型对应的PyTorch版本需要的依赖。
+ ```shell
+ TARGET_DEVICE=NPU pip install -e .
+ ```
-## Starchart
-[](https://star-history.com/#OpenRLHF/OpenRLHF&Date)
+ 2. 在模型源码包根目录下执行命令,源码编译安装 transformers v4.51.0。
+ ```shell
+ git clone -b v4.51.0 https://github.com/huggingface/transformers.git
+ cp transformers_need/modeling_qwen2_vl.py transformers/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py
+ cd transformers
+ pip install .
+ ```
+
+## 获取预训练模型
-## Contributors
+1. 用户自行下载 `Qwen2-VL-2B-Instruct`模型,通过参数 `--pretrain_path` 指定模型地址。
-A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.
+## 准备数据集
-
-
-
+1. 模型源码包根目录下创建data文件夹,用户需自行下载 `llava-en-zh-300k` 和 `RLHF-V` 数据集,结构如下:
-## References & Acknowledgements
+ ```
+ data/
+ ├── llava-en-zh-300k
+ ├── RLHF-V
+ ```
-We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:
+# 开始训练
-- [Hugging Face Transformers ↗](https://github.com/huggingface/transformers)
-- [OpenAI GPT ↗](https://github.com/openai/gpt-3)
-- [LLaMA ↗](https://llama.meta.com/)
-- [DeepSpeed ↗](https://github.com/microsoft/DeepSpeed)
-- [Ray ↗](https://github.com/ray-project/ray)
+1. 进入解压后的源码包根目录。
-Our project would also like to thank [ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) and [DeepSpeedChat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat). In the early stages of the project, we referred to their code design.
+ ```shell
+ cd /${模型文件夹名称}
+ ```
-(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.
+2. 运行训练脚本。
-## Citation
-```
-@article{hu2024openrlhf,
- title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
- author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},
- journal={arXiv preprint arXiv:2405.11143},
- year={2024}
-}
-```
+ - 8卡SFT训练
+
+ 启动8卡训练
+
+ ```shell
+ bash test/train_qwen2_vl_sft_full_8p.sh --pretrain_path=xxxx --dataset_path=data/llava-en-zh-300k/zh # 8p精度
-______________________________________________________________________
+ bash test/train_qwen2_vl_sft_performance_8p.sh --pretrain_path=xxxx --dataset_path=data/llava-en-zh-300k/zh # 8p性能
+ ```
-*OpenRLHF © 2025 OpenRLHF. All Rights Reserved.*
+ - 8卡DPO训练
+
+ 启动8卡训练
+
+ ```shell
+ bash test/train_qwen2_vl_dpo_full_8p.sh --pretrain_path=xxxx --dataset_path=data/RLHF-V # 8p精度
+
+ bash test/train_qwen2_vl_dpo_performance_8p.sh --pretrain_path=xxxx --dataset_path=data/RLHF-V # 8p性能
+ ```
+
+# 训练结果展示
+
+**表 2** 训练结果展示表(性能)
+
+| MODEL | NAME | METHOD | Second Per Step(s) |
+|:------------------------|:------------------------|:----------:|:----------------------:|
+| Qwen2-VL-2B-Instruct | 8P-竞品A | SFT | 0.17108 |
+| Qwen2-VL-2B-Instruct | 8P Atlas 200T A2 Box16 | SFT | 0.24064 |
+| Qwen2-VL-2B-Instruct | 8P-竞品A | DPO | 0.40857 |
+| Qwen2-VL-2B-Instruct | 8P Atlas 200T A2 Box16 | DPO | 0.53905 |
+
+
+# 公网地址说明
+代码涉及公网地址参考 public_address_statement.md。
+
+# 版本说明
+
+## 变更
+
+2025.5.12:首次发布。
+
+## FAQ
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md
new file mode 100644
index 0000000000000000000000000000000000000000..2cb8191853368778180eb11533fdaffb5729e8b6
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md
@@ -0,0 +1,477 @@
+
+

+
+
+
+
+
+[ English | 中文 | 日本語 ]
+
+OpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers:
+
+- **Simple and easy to use**: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and seamlessly compatible with Huggingface models and datasets.
+- **High performance**: RLHF training spends 80% of the time on the sample generation stage. Thanks to the ability to use a large inference batch size with Ray and Packing Samples and vLLM generation acceleration, the performance of OpenRLHF 3~4x+ that of Optimized DeepSpeedChat with Hybrid Engine.
+- **Distributed RLHF**: OpenRLHF distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 70B+ models with multiple A100 80G GPUs and vLLM and 7B models across multiple 24GB RTX 4090 GPUs.
+- **PPO Implementation Optimization**: We integrated the implementation tricks for PPO to improve the training stability, referencing [Zhihu](https://zhuanlan.zhihu.com/p/622134699) and the [Notion blog](https://hijkzzz.notion.site/rlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361).
+
+More details are in [Slides](https://docs.google.com/presentation/d/1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk/edit?usp=sharing) | [Technical Report](https://arxiv.org/abs/2405.11143) | [Documents](https://openrlhf.readthedocs.io/)
+
+## News
+- [2025/1] HKUST reproduced the [DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF](https://github.com/hkust-nlp/simpleRL-reason)
+- [2024/12] We "proposed" 😊 the [REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS).
+- [2024/12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the [Notion Blogpost](https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05).
+
+
+## Features
+
+- Distributed [PPO](./examples/scripts/train_ppo_llama_ray.sh) and [REINFORCE++/RLOO](./examples/scripts/train_reinforce_llama_ray.sh) implementations based on Ray.
+- Full RLHF fine-tuning support for models with [over 70 billion parameters](./examples/scripts/train_ppo_llama_ray_70b.sh).
+- Integration with vLLM for accelerated generation in RLHF tasks (`--vllm_num_engines`).
+- Support for multiple reward models (`--reward_pretrain model1,model2...`) and remote reward models (`--remote_rm_url`).
+- Implementation of [DPO (Direct Preference Optimization)/IPO/cDPO](./examples/scripts/train_dpo_llama.sh) and [Kahneman-Tversky Optimization (KTO)](./examples/scripts/train_kto_llama.sh).
+- Support for [Iterative DPO](./examples/scripts/train_iterative_dpo_llama.sh) ([GitHub: Online-RLHF](https://github.com/RLHFlow/Online-RLHF)).
+- Support for [Rejection Sampling](./examples/scripts/train_rejection_sampling_llama.sh).
+- Implementation of [Conditional SFT](./examples/scripts/train_conditional_llama.sh) ([arXiv:2308.12050](https://arxiv.org/abs/2308.12050)).
+- Support for [Knowledge Distillation](./examples/scripts/train_knowledge_distillation.sh) ([Microsoft: minillm](https://github.com/microsoft/LMOps/tree/main/minillm)).
+- Integration of [Process Reward Model (PRM)](./examples/scripts/train_prm_mistral.sh).
+- Packing of training samples for SFT, DPO, RM, PRM, and PPO (`--packing_samples`).
+- Implementation of [RingAttention](./examples/scripts/train_dpo_ring_llama.sh) (`--ring_attn_size`, `--ring_head_stride`).
+- Support for [Mixture of Experts (MoE)](./examples/test_scripts/train_sft_mixtral_lora.sh) (`--aux_loss_coef`).
+- Integration of FlashAttention2 (`--flash_attn`).
+- Support for QLoRA (`--load_in_4bit`) and [LoRA](./examples/scripts/train_sft_mixtral_lora.sh) (`--lora_rank`, `--target_modules`).
+- Compatibility with HuggingFace's `tokenizer.apply_chat_template` for datasets (`--apply_chat_template` and `--input_key`).
+- Logging support with Wandb (`--use_wandb`) and TensorBoard (`--use_tensorboard`).
+- Checkpoint recovery functionality (`--load_checkpoint` and `--save_steps`).
+- Provided multi-node training scripts, such as [DPO](./examples/scripts/train_llama_slurm.sh) and [Ray PPO](./examples/scripts/train_ppo_llama_ray_slurm.sh).
+
+
+### PPO Support Matrix
+
+| Feature | OpenRLHF | DSChat | CAIChat | TRL |
+| ------------- |:-------------:| :-------------:| :-------------:| :-------------:|
+| 70B+ Full Tuning with 16 A100-80GB | ✅ | ❌ | ❌ | ❌ |
+| 7B Full Tuning with 4 RTX4090 | ✅ | ❌ | ❌ | ❌ |
+| 34B DPO Full Tuning with 8 A100-80GB | ✅ | ❌ | ❌ | ❌ |
+| Inference Engine in PPO | ✅ | ✅ | ❌ | ❌ |
+| PPO Implementation Tricks | ✅ | ❌ | ❌ | ✅ |
+| Support QLoRA | ✅ | ❌ | ❌ | ✅ |
+| Support Mixtral 8*7b | ✅ | ❌ | ❌ | ❌ |
+| Support Unmerged Actor-Critic | ✅ | ✅ | ✅ | ❌ |
+| Support Multiple Reward Models | ✅ | ❌ | ❌ | ❌ |
+| Support Huggingface Models | ✅ | ✅ | ✅ | ✅ |
+| Easy-to-use | ✅ | ❌ (HybridEngine bugs) | ✅ | ✅ |
+
+
+## Quick Start
+
+### Installation
+
+To use OpenRLHF, first launch the docker container (**Recommended**) and `pip install` openrlhf inside the docker container:
+
+```bash
+# Launch the docker container
+docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.07-py3 bash
+sudo pip uninstall xgboost transformer_engine flash_attn -y
+
+# pip install
+pip install openrlhf
+
+# If you want to use vLLM acceleration (Install vLLM 0.6.5)
+pip install openrlhf[vllm]
+# latest vLLM is also supported
+pip install openrlhf[vllm_latest]
+
+# pip install the latest version
+pip install git+https://github.com/OpenRLHF/OpenRLHF.git
+
+# Or git clone
+git clone https://github.com/OpenRLHF/OpenRLHF.git
+cd OpenRLHF
+pip install -e .
+```
+
+> [!NOTE]
+>We recommend using vLLM 0.6.4 or higher. Other versions (vLLM >= 0.4.2) may require weight synchronization via Gloo (`--vllm_sync_backend gloo`).
+>We also provided the [Dockerfiles for vLLM](./dockerfile/) and [One-Click Installation Script of Nvidia-Docker](./examples/scripts/nvidia_docker_install.sh).
+
+### Prepare Datasets
+OpenRLHF provides multiple data processing methods in our dataset classes.
+Such as in the [Prompt Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/prompts_dataset.py#L6):
+
+```python
+def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str:
+ if apply_chat_template:
+ chat = data[input_key]
+ if isinstance(chat, str):
+ chat = [{"role": "user", "content": chat}]
+ prompt = apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+ else:
+ prompt = data[input_key]
+ if input_template:
+ prompt = input_template.format(prompt)
+ return prompt
+```
+
+- We can use `--input_key` to specify the `JSON key name` of the input datasets `--prompt_data {name or path}` (PPO) or `--dataset {name or path}`, and use `--apply_chat_template` to utilize the `chat_template` from the [Huggingface Tokenizer](https://huggingface.co/docs/transformers/main/en/chat_templating).
+- If you don't want to use `--apply_chat_template`, you can use `--input_template` instead, or preprocess the datasets offline in advance.
+- OpenRLHF also support mixing multiple datasets using `--prompt_data_probs 0.1,0.4,0.5` (PPO) or `--dataset_probs 0.1,0.4,0.5`.
+
+How Chat Templating Works:
+
+```python
+dataset = [{"input_key": [
+ {"role": "user", "content": "Hello, how are you?"},
+ {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
+ {"role": "user", "content": "I'd like to show off how chat templating works!"},
+]}]
+
+tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False)
+
+"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]"
+```
+
+How to specify training and test datasets ?
+
+You can specify it using the `data_type@data_dir` format. For example, the dataset can be set as `--dataset json@./data`.
+
+```
+data
+├── test.jsonl
+└── train.jsonl
+```
+
+> [!NOTE]
+> By default, we use `train` and `test` as splits to distinguish training and testing datasets from Huggingface.
+> The ``JSON key`` options depends on the specific datasets. See [Reward Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/reward_dataset.py#L10) and [SFT Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/sft_dataset.py#L9)
+
+### Supervised Fine-tuning
+
+OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--pretrain {name or path}`, `--reward_pretrain {name or path}` and `--critic_pretrain {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https://huggingface.co/OpenRLHF).
+
+Then you can use the startup scripts we provide in the [examples/scripts](./examples/scripts/) directory, or start the training using the following commands.
+
+```bash
+deepspeed --module openrlhf.cli.train_sft \
+ --max_len 4096 \
+ --dataset Open-Orca/OpenOrca \
+ --input_key question \
+ --output_key response \
+ --input_template $'User: {}\nAssistant: ' \
+ --train_batch_size 256 \
+ --micro_train_batch_size 2 \
+ --max_samples 500000 \
+ --pretrain meta-llama/Meta-Llama-3-8B \
+ --save_path ./checkpoint/llama3-8b-sft \
+ --save_steps -1 \
+ --logging_steps 1 \
+ --eval_steps -1 \
+ --zero_stage 2 \
+ --max_epochs 1 \
+ --packing_samples \
+ --bf16 \
+ --flash_attn \
+ --learning_rate 5e-6 \
+ --gradient_checkpointing \
+ --use_wandb {wandb_token}
+
+# Support HF tokenizer.apply_chat_template
+# --apply_chat_template
+# --tokenizer_chat_template {HF Chat Template}
+
+# Support RingAttention
+# pip install ring_flash_attn
+# --ring_attn_size 2 \
+# --ring_head_stride 2 \
+
+# Multi-turn fine-tuning loss
+# --multiturn
+
+# Can also be used for continued pre-training
+# --pretrain_mode
+```
+
+> [!NOTE]
+> OpenRLHF SFT/DPO/RewardModel/PPO trainers support `--packing_samples` [based on `--flash_attn`](https://github.com/MeetKai/functionary/tree/main/functionary/train/packing)
+
+
+### Reward Model Training
+```bash
+deepspeed --module openrlhf.cli.train_rm \
+ --save_path ./checkpoint/llama3-8b-rm \
+ --save_steps -1 \
+ --logging_steps 1 \
+ --eval_steps -1 \
+ --train_batch_size 256 \
+ --micro_train_batch_size 1 \
+ --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
+ --bf16 \
+ --max_epochs 1 \
+ --max_len 8192 \
+ --zero_stage 3 \
+ --learning_rate 9e-6 \
+ --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
+ --apply_chat_template \
+ --chosen_key chosen \
+ --rejected_key rejected \
+ --flash_attn \
+ --packing_samples \
+ --gradient_checkpointing \
+ --use_wandb {wandb_token}
+
+```
+
+It is recommended to set the `--value_prefix_head` option of the Reward Model to `score`, so that we can load the model using `AutoModelForSequenceClassification`:
+
+```python
+reward_model = AutoModelForSequenceClassification.from_pretrained(
+ reward_model_path,
+ num_labels=1,
+ torch_dtype=torch.bfloat16,
+ attn_implementation="flash_attention_2",
+ use_cache=False,
+ )
+inputs = xxxx (Left Padding Input Tokens)
+reward = reward_model.model(*inputs).last_hidden_state
+reward = reward_model.score(reward)[:, -1]
+```
+
+### PPO without Ray
+
+```bash
+deepspeed --module openrlhf.cli.train_ppo \
+ --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
+ --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
+ --save_path ./checkpoint/llama-3-8b-rlhf \
+ --save_steps -1 \
+ --logging_steps 1 \
+ --eval_steps -1 \
+ --micro_train_batch_size 2 \
+ --train_batch_size 128 \
+ --micro_rollout_batch_size 4 \
+ --rollout_batch_size 1024 \
+ --max_epochs 1 \
+ --prompt_max_len 1024 \
+ --generate_max_len 1024 \
+ --zero_stage 2 \
+ --bf16 \
+ --actor_learning_rate 5e-7 \
+ --critic_learning_rate 9e-6 \
+ --init_kl_coef 0.01 \
+ --prompt_data OpenRLHF/prompt-collection-v0.1 \
+ --input_key context_messages \
+ --apply_chat_template \
+ --max_samples 100000 \
+ --normalize_reward \
+ --adam_offload \
+ --flash_attn \
+ --gradient_checkpointing \
+ --use_wandb {wandb_token}
+
+# Support remote reward model (HTTP)
+# --remote_rm_url http://localhost:5000/get_reward
+```
+
+### PPO/REINFORCE++ with Ray and vLLM
+
+To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration
+
+```bash
+# launch the master node of ray in container
+ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
+
+# if you want to launch ray on more nodes, use
+ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8
+
+ray job submit --address="http://127.0.0.1:8265" \
+ --runtime-env-json='{"working_dir": "/openrlhf"}' \
+ -- python3 -m openrlhf.cli.train_ppo_ray \
+ --ref_num_nodes 1 \
+ --ref_num_gpus_per_node 2 \
+ --reward_num_nodes 1 \
+ --reward_num_gpus_per_node 2 \
+ --critic_num_nodes 1 \
+ --critic_num_gpus_per_node 2 \
+ --actor_num_nodes 1 \
+ --actor_num_gpus_per_node 2 \
+ --vllm_num_engines 2 \
+ --vllm_tensor_parallel_size 2 \
+ --colocate_critic_reward \
+ --colocate_actor_ref \
+ --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
+ --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
+ --save_path /openrlhf/examples/checkpoint/llama3-8b-rlhf \
+ --micro_train_batch_size 8 \
+ --train_batch_size 128 \
+ --micro_rollout_batch_size 16 \
+ --rollout_batch_size 1024 \
+ --max_samples 100000 \
+ --max_epochs 1 \
+ --prompt_max_len 1024 \
+ --generate_max_len 1024 \
+ --zero_stage 3 \
+ --bf16 \
+ --actor_learning_rate 5e-7 \
+ --critic_learning_rate 9e-6 \
+ --init_kl_coef 0.01 \
+ --prompt_data OpenRLHF/prompt-collection-v0.1 \
+ --input_key context_messages \
+ --apply_chat_template \
+ --normalize_reward \
+ --packing_samples \
+ --adam_offload \
+ --flash_attn \
+ --gradient_checkpointing \
+ --use_wandb {wandb_token}
+
+# Support REINFORCE++ | RLOO
+# --advantage_estimator reinforce | rloo
+
+# Support remote reward model (HTTP)
+# --remote_rm_url http://localhost:5000/get_reward
+
+# Support N samples
+# --n_samples_per_prompt 4
+```
+> [!NOTE]
+> Do not set `--vllm_num_engines` means not using the vLLM engine.
+> You can also use ``setup_commands`` to let Ray automatically deploy the environment, such as `--runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'`.
+
+> [!NOTE]
+> RLOO in OPENRLHF is a modification based on REINFORCE++, differing from the original version.
+
+> [!NOTE]
+> If you you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable [`RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES`](openrlhf/trainer/ray/utils.py) as a workaround.
+> ```bash
+> # For NVIDIA GPUs:
+> export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
+> ```
+
+The launch scripts and documents for supported algorithms are in [example/scripts](./examples/scripts/) and [Documents - Usage](https://openrlhf.readthedocs.io/en/latest/usage.html)
+
+### LoRA
+If you use `LoRA (Low-Rank Adaptation)`, `OpenRLHF` will not save the full weights by default instead of `LoRA Adapter`. To continue in your task normally, you should combine the `Adapter` with weights of your base model
+
+```bash
+python -m openrlhf.cli.lora_combiner \
+ --model_path meta-llama/Meta-Llama-3-8B \
+ --lora_path ./checkpoint/llama3-8b-rm \
+ --output_path ./checkpoint/llama-3-8b-rm-combined \
+ --is_rm \
+ --bf16
+```
+
+## Performance
+
+We optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF:
+
+| **Size** | **NVIDIA A800-80GB GPUs** | **Optimized DSChat (with Hybrid Engine)** | **OpenRLHF** | **Speedup** |
+| :---: | :---: | :---: | :---: | :---: |
+| 7B | 16 | 855.09 | 471.11 | 1.82x |
+| 13B | 32 | 1528.93 | 608.93 | 2.5x |
+| 34B | 32 | 3634.98 | 1526.4 | 2.4x |
+| 70B | 32 | 10407.0 | 4488.53 | 2.3x |
+
+> [!NOTE]
+> The data is outdated; please refer to the performance tuning section for re-testing.
+
+### Performance Tuning Guide
+
+To achieve optimal performance, we recommend allocating more nodes to the vLLM Engine. For example, for a 70B model with 32 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 8 GPUs to the Actor model, and the remaining 8 GPUs to the Critic model. Additionally, enable the `--colocate_critic_reward`, `--colocate_actor_ref` options to merge nodes. Finally, you should increase the `rollout_micro_batch_size` (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger `--micro_train_batch_size` is better and enable `--packing_samples`. When there are enough GPUs, please disable `--adam_offload` and enable `--overlap_comm`. For multi-nodes RLHF, please use `--vllm_sync_backend nccl` with vLLM 0.6.4+.
+
+## Companies and Organizations using OpenRLHF
+
+- Google
+- ByteDance
+- Tencent
+- Alibaba
+- Baidu
+- China Telecom
+- Vivo
+- Allen AI
+- NexusFlow
+- Jülich Supercomputing Centre (JSC)
+- Berkeley Starling Team
+- M-A-P
+- ...
+
+## Join Us
+
+**How to Join?**
+
+1. Email us at janhu9527@gmail.com or join [GitHub Organization](https://github.com/OpenRLHF). Please include the following details:
+ - Your name
+ - Your GitHub username
+ - Your areas of interest
+ - Your skills and experience related to NLP and/or AI
+1. You can also join us through the official GitHub [OpenRLHF ↗](https://github.com/OpenRLHF/OpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you.
+
+**What can you do?**
+
+1. Join the team and participate in the development of the OpenRLHF project.
+1. Contribute to the project by submitting pull requests.
+1. Help improve documentation, fix bugs, or create new features.
+1. Share the project and help us grow the community.
+
+## Sponsor Us
+
+Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https://opencollective.com/OpenRLHF).
+
+## Starchart
+
+[](https://star-history.com/#OpenRLHF/OpenRLHF&Date)
+
+## Contributors
+
+A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.
+
+
+
+
+
+## References & Acknowledgements
+
+We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:
+
+- [Hugging Face Transformers ↗](https://github.com/huggingface/transformers)
+- [OpenAI GPT ↗](https://github.com/openai/gpt-3)
+- [LLaMA ↗](https://llama.meta.com/)
+- [DeepSpeed ↗](https://github.com/microsoft/DeepSpeed)
+- [Ray ↗](https://github.com/ray-project/ray)
+
+Our project would also like to thank [ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) and [DeepSpeedChat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat). In the early stages of the project, we referred to their code design.
+
+(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.
+
+## Citation
+```
+@article{hu2024openrlhf,
+ title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
+ author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},
+ journal={arXiv preprint arXiv:2405.11143},
+ year={2024}
+}
+```
+
+______________________________________________________________________
+
+*OpenRLHF © 2025 OpenRLHF. All Rights Reserved.*
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md
new file mode 100644
index 0000000000000000000000000000000000000000..8c541c924ca0a625e05c23adc1f730515ef62fb0
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md
@@ -0,0 +1,55 @@
+
+| 类型 | 开源代码地址 | 文件名 | 公网IP地址/公网URL地址/域名/邮箱 | 用途说明 |
+|:------:|:-------------------------:|:---------------------------------------------------------------------------------------------:|:--------------------:|:-----------------:|
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README_ja.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ja.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README_zh.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_zh.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_dpo.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_dpo.py | https://arxiv.org/pdf/2310.12036v2.pdf | IPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_dpo.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_dpo.py | https://arxiv.org/pdf/2305.18290.pdf | DPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_ppo_ray.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_ppo_ray.py | http://joschu.net/blog/kl-approx.html | 近似KL散度的链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/utils.py | http://joschu.net/blog/kl-approx.html | 近似KL散度的链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/huggingface/transformers/blob/405b56269812056d9593869e22b7b264d806cb1e/src/transformers/models/llama/modeling_llama.py | transformers三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://huggingface.co/docs/transformers/deepspeed#non-trainer-deepspeed-integration | transformers三方仓源码DeepSpeed解释链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/huggingface/peft/issues/137 | peft三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://huggingface.co/docs/transformers/main_classes/deepspeed#nontrainer-deepspeed-integration | transformers三方仓源码DeepSpeed解释链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/huggingface/transformers/issues/26877 | transformers三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://arxiv.org/abs/2204.05862 | Pairwise Loss for Reward Model论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/huggingface/transformers/issues/26877 | transformers三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://arxiv.org/pdf/2310.12036v2.pdf | IPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://ericmitchell.ai/cdpo.pdf | cDPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://arxiv.org/pdf/2305.18290.pdf | DPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://github.com/ContextualAI/HALOs/blob/ca9b7e3eeea220c0944ad8095d641da33f907a7e/trainers.py | HALOs三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/OpenRLHF/OpenRLHF/issues/217 | OpenRLHF三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/OpenRLHF/OpenRLHF/issues/217 | OpenRLHF三方仓issue链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/OpenRLHF/OpenRLHF/pull/634 | OpenRLHF三方仓pr链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | LMOps三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ppo_utils/kl_controller.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ppo_utils/kl_controller.py | https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | Adaptive KL controller的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ppo_utils/experience_maker.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ppo_utils/experience_maker.py | https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | PPO的论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/nvidia_gpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/amd_gpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/npu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/hpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/neuron.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/tpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/intel_gpu.py | ray三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/commit/479d69fad0538f04cb22bf13e76ff91cfeb8a4e5 | vllm三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/commit/676a99982fe9aabe72fd52a91e08988a653a7359 | vllm三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/pull/10555 | vllm三方仓pr链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/commit/7206ce4ce112ed117796a59045c968a6d353f691 | vllm三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/commit/eb6d3c264d0cd8e44dec16bca7947fbe96415ce9#diff-e1ad69e38e033accddfa5480ec808c4740eb39244d1ef51cc3407e20dde8cfd4 | vllm三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/launcher.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/launcher.py | https://github.com/vllm-project/vllm/commit/eb6d3c264d0cd8e44dec16bca7947fbe96415ce9#diff-e1ad69e38e033accddfa5480ec808c4740eb39244d1ef51cc3407e20dde8cfd4 | custom resources解释链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/ppo_actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/ppo_actor.py | https://github.com/vllm-project/vllm/blob/c6b0a7d3ba03ca414be1174e9bd86a97191b7090/vllm/worker/worker_base.py | vllm三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/logging_utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/logging_utils.py | https://github.com/skypilot-org/skypilot/blob/86dc0f6283a335e4aa37b3c10716f90999f48ab6/sky/sky_logging.py | skypilot三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/logging_utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/utils.py | https://github.com/facebookresearch/llama-recipes/pull/196 | llama-cookbook三方仓pr链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py | https://arxiv.org/abs/2308.12050 | CA论文链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_sampler.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_sampler.py | https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/torch/utils/data/distributed.py | pytorch三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_sampler.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_sampler.py | https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py | pytorch三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py | https://arxiv.org/abs/2307.09288 | pytorch三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_util.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_util.py | https://github.com/pytorch/pytorch/commit/a0c7029a75628cd5fa8df83c0de0ea98ee7fd844 | pytorch三方仓commit id链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py | https://github.com/RLHFlow/Online-RLHF/blob/main/run_loop.sh | OpenRLHF三方仓源码链接 |
+| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/deepspeed/deepspeed.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/deepspeed/deepspeed.py | https://github.com/RLHFlow/Online-RLHF/blob/main/run_loop.sh | DeepSpeed三方仓issue链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | http://www.apache.org/licenses/LICENSE-2.0 | Apache-2.0 License链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | https://qwenlm.github.io/blog/qwen2-vl/ | qwen2-vl关于旋转位置编码解释链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | https://pytorch.org/docs/stable/nn.html#torch.nn.Module | torch.nn解释链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | https://github.com/huggingface/transformers/pull/34852 | transformers三方仓pr链接 |
+| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | https://github.com/pytorch/pytorch/issues/110213 | pytorch三方仓issue链接 |
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt
index be0bca46f79d16522aa21b3a915fe37d32492b15..e6a68fc37556e324df7a3c490edd1059986d51c0 100644
--- a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt
@@ -14,7 +14,7 @@ peft
pillow
ray[default]==2.42.0
tensorboard
-torch
+torch==2.6.0
torchmetrics
tqdm
transformers_stream_generator
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f57f4bbaf6f7efb057bd325b145633056e289f2d
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+CANN_INSTALL_PATH_CONF='/etc/Ascend/ascend_cann_install.info'
+
+if [ -f $CANN_INSTALL_PATH_CONF ]; then
+ CANN_INSTALL_PATH=$(cat $CANN_INSTALL_PATH_CONF | grep Install_Path | cut -d "=" -f 2)
+else
+ CANN_INSTALL_PATH="/usr/local/Ascend"
+fi
+
+if [ -d ${CANN_INSTALL_PATH}/ascend-toolkit/latest ]; then
+ source ${CANN_INSTALL_PATH}/ascend-toolkit/set_env.sh
+else
+ source ${CANN_INSTALL_PATH}/nnae/set_env.sh
+fi
+msnpureport -g error -d 0
+msnpureport -g error -d 1
+msnpureport -g error -d 2
+msnpureport -g error -d 3
+msnpureport -g error -d 4
+msnpureport -g error -d 5
+msnpureport -g error -d 6
+msnpureport -g error -d 7
+
+# 控制对输入数据为Inf/NaN的处理能力
+# 0:饱和模式,计算出现溢出时,计算结果会饱和为浮点数极值(+-MAX)。
+# 1:INf_NAN模式,根据定义输出Inf/NaN的计算结果。
+# Atlas训练系列仅支持饱和模式,Atlas A2/A3默认值为1,支持配置为0的饱和模式。
+export INF_NAN_MODE_ENABLE=1
+#将Host日志输出到串口,0-关闭/1-开启。指定0关闭日志打屏,即日志采用默认输出方式,将日志保存在log文件中。
+export ASCEND_SLOG_PRINT_TO_STDOUT=0
+#设置默认日志级别,0-debug/1-info/2-warning/3-error。此处指定3输出error级别日志,可根据具体需要调整。
+export ASCEND_GLOBAL_LOG_LEVEL=3
+#设置应用类日志是否开启Event日志。0-关闭/1-开启,默认值为1,此处设置为0表示关闭Event日志。
+export ASCEND_GLOBAL_EVENT_ENABLE=0
+#设置是否开启combined标志,0-关闭/1-开启。设置为1表示开启,用于优化非连续两个算子组合类场景。
+export COMBINED_ENABLE=1
+#HCCL白名单开关,1-关闭/0-开启。设置为1则无需校验HCCL通信白名单。
+export HCCL_WHITELIST_DISABLE=1
+export HCCL_IF_IP=$(hostname -I |awk '{print $1}')
+#缓存算子信息条目数
+export ACLNN_CACHE_LIMIT=100000
+#配置 Hugging Face 的 datasets 库在离线模式下运行
+export HF_DATASETS_OFFLINE=1
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f11fd5038521307f68314aff721d9a5bcf1fd617
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh
@@ -0,0 +1,107 @@
+#!/bin/bash
+echo "-------------------Start DPO Train-------------------"
+
+#配置八卡
+export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+#网络名称
+Network="dpo"
+
+# 默认值
+max_epochs=3
+pretrain_path=""
+dataset_path=""
+
+# 遍历所有传入的参数
+for para in $*; do
+ if [[ $para == --max_epochs* ]]; then
+ max_epochs="${para#*=}"
+ elif [[ $para == --pretrain_path=* ]]; then
+ pretrain_path="${para#*=}"
+ elif [[ $para == --dataset_path=* ]]; then
+ dataset_path="${para#*=}"
+ else
+ echo "Unknown parameter: $para" >&2
+ fi
+done
+
+# 检查参数pretrain_path、dataset_path是否已提供
+if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then
+ echo "Error: Both pretrain_path and dataset_path are required."
+ exit 1
+fi
+
+# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径
+cur_path=$(pwd)
+cur_path_last_dirname=${cur_path##*/}
+if [ x"${cur_path_last_dirname}" == x"test" ]; then
+ test_path_dir=${cur_path}
+ cd ..
+ cur_path=$(pwd)
+else
+ test_path_dir=${cur_path}/test
+fi
+
+#创建DPO训练输出目录,不需要修改
+if [ -d ${cur_path}/test/output/${Network} ]; then
+ rm -rf ${cur_path}/test/output/${Network}
+ mkdir -p ${cur_path}/test/output/${Network}
+else
+ mkdir -p ${cur_path}/test/output/${Network}
+fi
+
+source ${test_path_dir}/env_npu.sh
+
+training_commands=$(cat < ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 &
+fi
+
+wait
+
+# 训练用例信息,不需要修改
+DeviceType=$(uname -m)
+CaseName=${Network}_info
+
+# 获取训练日志
+source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log"
+
+#计算全程平均单步耗时
+tps=$(grep -a "'train/step_time':" "$source_log_file" |
+ sed 's/\x1b\[[0-9;]*m//g' |
+ awk -F "'train/step_time': '" '{print $2}' |
+ awk -F "'," '{print $1}' |
+ awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}')
+echo "Second Per Step: $tps"
+
+# 关键信息打印到${CaseName}.log中,不需要修改
+echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "-------------------End DPO Train-------------------"
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d9b3a99323be35ceae21895b5e205b04f24e9928
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh
@@ -0,0 +1,107 @@
+#!/bin/bash
+echo "-------------------Start DPO Train-------------------"
+
+#配置八卡
+export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+#网络名称
+Network="dpo"
+
+# 默认值
+max_epochs=1
+pretrain_path=""
+dataset_path=""
+
+# 遍历所有传入的参数
+for para in $*; do
+ if [[ $para == --max_epochs* ]]; then
+ max_epochs="${para#*=}"
+ elif [[ $para == --pretrain_path=* ]]; then
+ pretrain_path="${para#*=}"
+ elif [[ $para == --dataset_path=* ]]; then
+ dataset_path="${para#*=}"
+ else
+ echo "Unknown parameter: $para" >&2
+ fi
+done
+
+# 检查参数pretrain_path、dataset_path是否已提供
+if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then
+ echo "Error: Both pretrain_path and dataset_path are required."
+ exit 1
+fi
+
+# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径
+cur_path=$(pwd)
+cur_path_last_dirname=${cur_path##*/}
+if [ x"${cur_path_last_dirname}" == x"test" ]; then
+ test_path_dir=${cur_path}
+ cd ..
+ cur_path=$(pwd)
+else
+ test_path_dir=${cur_path}/test
+fi
+
+#创建DPO训练输出目录,不需要修改
+if [ -d ${cur_path}/test/output/${Network} ]; then
+ rm -rf ${cur_path}/test/output/${Network}
+ mkdir -p ${cur_path}/test/output/${Network}
+else
+ mkdir -p ${cur_path}/test/output/${Network}
+fi
+
+source ${test_path_dir}/env_npu.sh
+
+training_commands=$(cat < ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 &
+fi
+
+wait
+
+# 训练用例信息,不需要修改
+DeviceType=$(uname -m)
+CaseName=${Network}_info
+
+# 获取训练日志
+source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log"
+
+#计算全程平均单步耗时
+tps=$(grep -a "'train/step_time':" "$source_log_file" |
+ sed 's/\x1b\[[0-9;]*m//g' |
+ awk -F "'train/step_time': '" '{print $2}' |
+ awk -F "'," '{print $1}' |
+ awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}')
+echo "Second Per Step: $tps"
+
+# 关键信息打印到${CaseName}.log中,不需要修改
+echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "-------------------End DPO Train-------------------"
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..78f31c0417e784423c84cf56138d1380d43c01dd
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh
@@ -0,0 +1,104 @@
+#!/bin/bash
+echo "-------------------Start SFT Train-------------------"
+
+#配置八卡
+export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+#网络名称
+Network="sft"
+
+# 默认值
+max_epochs=3
+pretrain_path=""
+dataset_path=""
+
+# 遍历所有传入的参数
+for para in $*; do
+ if [[ $para == --max_epochs* ]]; then
+ max_epochs="${para#*=}"
+ elif [[ $para == --pretrain_path=* ]]; then
+ pretrain_path="${para#*=}"
+ elif [[ $para == --dataset_path=* ]]; then
+ dataset_path="${para#*=}"
+ else
+ echo "Unknown parameter: $para" >&2
+ fi
+done
+
+# 检查参数pretrain_path、dataset_path是否已提供
+if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then
+ echo "Error: Both pretrain_path and dataset_path are required."
+ exit 1
+fi
+
+# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径
+cur_path=$(pwd)
+cur_path_last_dirname=${cur_path##*/}
+if [ x"${cur_path_last_dirname}" == x"test" ]; then
+ test_path_dir=${cur_path}
+ cd ..
+ cur_path=$(pwd)
+else
+ test_path_dir=${cur_path}/test
+fi
+
+#创建SFT训练输出目录,不需要修改
+if [ -d ${cur_path}/test/output/${Network} ]; then
+ rm -rf ${cur_path}/test/output/${Network}
+ mkdir -p ${cur_path}/test/output/${Network}
+else
+ mkdir -p ${cur_path}/test/output/${Network}
+fi
+
+source ${test_path_dir}/env_npu.sh
+
+training_commands=$(cat < ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 &
+fi
+
+wait
+
+# 训练用例信息,不需要修改
+DeviceType=$(uname -m)
+CaseName=${Network}_info
+
+# 获取训练日志
+source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log"
+
+#计算全程平均单步耗时
+tps=$(grep -a "'train/step_time':" "$source_log_file" |
+ sed 's/\x1b\[[0-9;]*m//g' |
+ awk -F "'train/step_time': '" '{print $2}' |
+ awk -F "'," '{print $1}' |
+ awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}')
+echo "Second Per Step: $tps"
+
+# 关键信息打印到${CaseName}.log中,不需要修改
+echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "-------------------End SFT Train-------------------"
\ No newline at end of file
diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..72e5cd3c6369d47459caa68db12dcca0b50056af
--- /dev/null
+++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh
@@ -0,0 +1,104 @@
+#!/bin/bash
+echo "-------------------Start SFT Train-------------------"
+
+#配置八卡
+export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+#网络名称
+Network="sft"
+
+# 默认值
+max_epochs=1
+pretrain_path=""
+dataset_path=""
+
+# 遍历所有传入的参数
+for para in $*; do
+ if [[ $para == --max_epochs* ]]; then
+ max_epochs="${para#*=}"
+ elif [[ $para == --pretrain_path=* ]]; then
+ pretrain_path="${para#*=}"
+ elif [[ $para == --dataset_path=* ]]; then
+ dataset_path="${para#*=}"
+ else
+ echo "Unknown parameter: $para" >&2
+ fi
+done
+
+# 检查参数pretrain_path、dataset_path是否已提供
+if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then
+ echo "Error: Both pretrain_path and dataset_path are required."
+ exit 1
+fi
+
+# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径
+cur_path=$(pwd)
+cur_path_last_dirname=${cur_path##*/}
+if [ x"${cur_path_last_dirname}" == x"test" ]; then
+ test_path_dir=${cur_path}
+ cd ..
+ cur_path=$(pwd)
+else
+ test_path_dir=${cur_path}/test
+fi
+
+#创建SFT训练输出目录,不需要修改
+if [ -d ${cur_path}/test/output/${Network} ]; then
+ rm -rf ${cur_path}/test/output/${Network}
+ mkdir -p ${cur_path}/test/output/${Network}
+else
+ mkdir -p ${cur_path}/test/output/${Network}
+fi
+
+source ${test_path_dir}/env_npu.sh
+
+training_commands=$(cat < ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 &
+fi
+
+wait
+
+# 训练用例信息,不需要修改
+DeviceType=$(uname -m)
+CaseName=${Network}_info
+
+# 获取训练日志
+source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log"
+
+#计算全程平均单步耗时
+tps=$(grep -a "'train/step_time':" "$source_log_file" |
+ sed 's/\x1b\[[0-9;]*m//g' |
+ awk -F "'train/step_time': '" '{print $2}' |
+ awk -F "'," '{print $1}' |
+ awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}')
+echo "Second Per Step: $tps"
+
+# 关键信息打印到${CaseName}.log中,不需要修改
+echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log
+echo "-------------------End SFT Train-------------------"
\ No newline at end of file