diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md index 2cb8191853368778180eb11533fdaffb5729e8b6..eea984d60307c9cdd4b4dd793bcfc9c932b41609 100644 --- a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md @@ -1,477 +1,134 @@ -
- OpenRLHF logo -
-
-

- - GitHub Contributors - - - Issues - - - Issues - - - GitHub pull requests - - GitHub stars - -
- Open-source / Comprehensive / Lightweight / Easy-to-use -

-

-
- -
- -[ English | 中文 | 日本語 ] - -OpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers: - -- **Simple and easy to use**: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and seamlessly compatible with Huggingface models and datasets. -- **High performance**: RLHF training spends 80% of the time on the sample generation stage. Thanks to the ability to use a large inference batch size with Ray and Packing Samples and vLLM generation acceleration, the performance of OpenRLHF 3~4x+ that of Optimized DeepSpeedChat with Hybrid Engine. -- **Distributed RLHF**: OpenRLHF distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 70B+ models with multiple A100 80G GPUs and vLLM and 7B models across multiple 24GB RTX 4090 GPUs. -- **PPO Implementation Optimization**: We integrated the implementation tricks for PPO to improve the training stability, referencing [Zhihu](https://zhuanlan.zhihu.com/p/622134699) and the [Notion blog](https://hijkzzz.notion.site/rlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361). - -More details are in [Slides](https://docs.google.com/presentation/d/1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk/edit?usp=sharing) | [Technical Report](https://arxiv.org/abs/2405.11143) | [Documents](https://openrlhf.readthedocs.io/) - -## News -- [2025/1] HKUST reproduced the [DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF](https://github.com/hkust-nlp/simpleRL-reason) -- [2024/12] We "proposed" 😊 the [REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS). -- [2024/12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the [Notion Blogpost](https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05). - - -## Features - -- Distributed [PPO](./examples/scripts/train_ppo_llama_ray.sh) and [REINFORCE++/RLOO](./examples/scripts/train_reinforce_llama_ray.sh) implementations based on Ray. -- Full RLHF fine-tuning support for models with [over 70 billion parameters](./examples/scripts/train_ppo_llama_ray_70b.sh). -- Integration with vLLM for accelerated generation in RLHF tasks (`--vllm_num_engines`). -- Support for multiple reward models (`--reward_pretrain model1,model2...`) and remote reward models (`--remote_rm_url`). -- Implementation of [DPO (Direct Preference Optimization)/IPO/cDPO](./examples/scripts/train_dpo_llama.sh) and [Kahneman-Tversky Optimization (KTO)](./examples/scripts/train_kto_llama.sh). -- Support for [Iterative DPO](./examples/scripts/train_iterative_dpo_llama.sh) ([GitHub: Online-RLHF](https://github.com/RLHFlow/Online-RLHF)). -- Support for [Rejection Sampling](./examples/scripts/train_rejection_sampling_llama.sh). -- Implementation of [Conditional SFT](./examples/scripts/train_conditional_llama.sh) ([arXiv:2308.12050](https://arxiv.org/abs/2308.12050)). -- Support for [Knowledge Distillation](./examples/scripts/train_knowledge_distillation.sh) ([Microsoft: minillm](https://github.com/microsoft/LMOps/tree/main/minillm)). -- Integration of [Process Reward Model (PRM)](./examples/scripts/train_prm_mistral.sh). -- Packing of training samples for SFT, DPO, RM, PRM, and PPO (`--packing_samples`). -- Implementation of [RingAttention](./examples/scripts/train_dpo_ring_llama.sh) (`--ring_attn_size`, `--ring_head_stride`). -- Support for [Mixture of Experts (MoE)](./examples/test_scripts/train_sft_mixtral_lora.sh) (`--aux_loss_coef`). -- Integration of FlashAttention2 (`--flash_attn`). -- Support for QLoRA (`--load_in_4bit`) and [LoRA](./examples/scripts/train_sft_mixtral_lora.sh) (`--lora_rank`, `--target_modules`). -- Compatibility with HuggingFace's `tokenizer.apply_chat_template` for datasets (`--apply_chat_template` and `--input_key`). -- Logging support with Wandb (`--use_wandb`) and TensorBoard (`--use_tensorboard`). -- Checkpoint recovery functionality (`--load_checkpoint` and `--save_steps`). -- Provided multi-node training scripts, such as [DPO](./examples/scripts/train_llama_slurm.sh) and [Ray PPO](./examples/scripts/train_ppo_llama_ray_slurm.sh). - - -### PPO Support Matrix - -| Feature | OpenRLHF | DSChat | CAIChat | TRL | -| ------------- |:-------------:| :-------------:| :-------------:| :-------------:| -| 70B+ Full Tuning with 16 A100-80GB | ✅ | ❌ | ❌ | ❌ | -| 7B Full Tuning with 4 RTX4090 | ✅ | ❌ | ❌ | ❌ | -| 34B DPO Full Tuning with 8 A100-80GB | ✅ | ❌ | ❌ | ❌ | -| Inference Engine in PPO | ✅ | ✅ | ❌ | ❌ | -| PPO Implementation Tricks | ✅ | ❌ | ❌ | ✅ | -| Support QLoRA | ✅ | ❌ | ❌ | ✅ | -| Support Mixtral 8*7b | ✅ | ❌ | ❌ | ❌ | -| Support Unmerged Actor-Critic | ✅ | ✅ | ✅ | ❌ | -| Support Multiple Reward Models | ✅ | ❌ | ❌ | ❌ | -| Support Huggingface Models | ✅ | ✅ | ✅ | ✅ | -| Easy-to-use | ✅ | ❌ (HybridEngine bugs) | ✅ | ✅ | - - -## Quick Start - -### Installation - -To use OpenRLHF, first launch the docker container (**Recommended**) and `pip install` openrlhf inside the docker container: - -```bash -# Launch the docker container -docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.07-py3 bash -sudo pip uninstall xgboost transformer_engine flash_attn -y - -# pip install -pip install openrlhf - -# If you want to use vLLM acceleration (Install vLLM 0.6.5) -pip install openrlhf[vllm] -# latest vLLM is also supported -pip install openrlhf[vllm_latest] - -# pip install the latest version -pip install git+https://github.com/OpenRLHF/OpenRLHF.git - -# Or git clone -git clone https://github.com/OpenRLHF/OpenRLHF.git -cd OpenRLHF -pip install -e . -``` - -> [!NOTE] ->We recommend using vLLM 0.6.4 or higher. Other versions (vLLM >= 0.4.2) may require weight synchronization via Gloo (`--vllm_sync_backend gloo`). ->We also provided the [Dockerfiles for vLLM](./dockerfile/) and [One-Click Installation Script of Nvidia-Docker](./examples/scripts/nvidia_docker_install.sh). - -### Prepare Datasets -OpenRLHF provides multiple data processing methods in our dataset classes. -Such as in the [Prompt Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/prompts_dataset.py#L6): - -```python -def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str: - if apply_chat_template: - chat = data[input_key] - if isinstance(chat, str): - chat = [{"role": "user", "content": chat}] - prompt = apply_chat_template(chat, tokenize=False, add_generation_prompt=True) - else: - prompt = data[input_key] - if input_template: - prompt = input_template.format(prompt) - return prompt -``` - -- We can use `--input_key` to specify the `JSON key name` of the input datasets `--prompt_data {name or path}` (PPO) or `--dataset {name or path}`, and use `--apply_chat_template` to utilize the `chat_template` from the [Huggingface Tokenizer](https://huggingface.co/docs/transformers/main/en/chat_templating). -- If you don't want to use `--apply_chat_template`, you can use `--input_template` instead, or preprocess the datasets offline in advance. -- OpenRLHF also support mixing multiple datasets using `--prompt_data_probs 0.1,0.4,0.5` (PPO) or `--dataset_probs 0.1,0.4,0.5`. - -How Chat Templating Works: - -```python -dataset = [{"input_key": [ - {"role": "user", "content": "Hello, how are you?"}, - {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, - {"role": "user", "content": "I'd like to show off how chat templating works!"}, -]}] - -tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False) - -"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]" -``` - -How to specify training and test datasets ? - -You can specify it using the `data_type@data_dir` format. For example, the dataset can be set as `--dataset json@./data`. - -``` -data -├── test.jsonl -└── train.jsonl -``` - -> [!NOTE] -> By default, we use `train` and `test` as splits to distinguish training and testing datasets from Huggingface. -> The ``JSON key`` options depends on the specific datasets. See [Reward Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/reward_dataset.py#L10) and [SFT Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/sft_dataset.py#L9) - -### Supervised Fine-tuning - -OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--pretrain {name or path}`, `--reward_pretrain {name or path}` and `--critic_pretrain {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https://huggingface.co/OpenRLHF). - -Then you can use the startup scripts we provide in the [examples/scripts](./examples/scripts/) directory, or start the training using the following commands. - -```bash -deepspeed --module openrlhf.cli.train_sft \ - --max_len 4096 \ - --dataset Open-Orca/OpenOrca \ - --input_key question \ - --output_key response \ - --input_template $'User: {}\nAssistant: ' \ - --train_batch_size 256 \ - --micro_train_batch_size 2 \ - --max_samples 500000 \ - --pretrain meta-llama/Meta-Llama-3-8B \ - --save_path ./checkpoint/llama3-8b-sft \ - --save_steps -1 \ - --logging_steps 1 \ - --eval_steps -1 \ - --zero_stage 2 \ - --max_epochs 1 \ - --packing_samples \ - --bf16 \ - --flash_attn \ - --learning_rate 5e-6 \ - --gradient_checkpointing \ - --use_wandb {wandb_token} - -# Support HF tokenizer.apply_chat_template -# --apply_chat_template -# --tokenizer_chat_template {HF Chat Template} - -# Support RingAttention -# pip install ring_flash_attn -# --ring_attn_size 2 \ -# --ring_head_stride 2 \ - -# Multi-turn fine-tuning loss -# --multiturn - -# Can also be used for continued pre-training -# --pretrain_mode -``` - -> [!NOTE] -> OpenRLHF SFT/DPO/RewardModel/PPO trainers support `--packing_samples` [based on `--flash_attn`](https://github.com/MeetKai/functionary/tree/main/functionary/train/packing) - - -### Reward Model Training -```bash -deepspeed --module openrlhf.cli.train_rm \ - --save_path ./checkpoint/llama3-8b-rm \ - --save_steps -1 \ - --logging_steps 1 \ - --eval_steps -1 \ - --train_batch_size 256 \ - --micro_train_batch_size 1 \ - --pretrain OpenRLHF/Llama-3-8b-sft-mixture \ - --bf16 \ - --max_epochs 1 \ - --max_len 8192 \ - --zero_stage 3 \ - --learning_rate 9e-6 \ - --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \ - --apply_chat_template \ - --chosen_key chosen \ - --rejected_key rejected \ - --flash_attn \ - --packing_samples \ - --gradient_checkpointing \ - --use_wandb {wandb_token} - -``` - -It is recommended to set the `--value_prefix_head` option of the Reward Model to `score`, so that we can load the model using `AutoModelForSequenceClassification`: - -```python -reward_model = AutoModelForSequenceClassification.from_pretrained( - reward_model_path, - num_labels=1, - torch_dtype=torch.bfloat16, - attn_implementation="flash_attention_2", - use_cache=False, - ) -inputs = xxxx (Left Padding Input Tokens) -reward = reward_model.model(*inputs).last_hidden_state -reward = reward_model.score(reward)[:, -1] -``` - -### PPO without Ray - -```bash -deepspeed --module openrlhf.cli.train_ppo \ - --pretrain OpenRLHF/Llama-3-8b-sft-mixture \ - --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \ - --save_path ./checkpoint/llama-3-8b-rlhf \ - --save_steps -1 \ - --logging_steps 1 \ - --eval_steps -1 \ - --micro_train_batch_size 2 \ - --train_batch_size 128 \ - --micro_rollout_batch_size 4 \ - --rollout_batch_size 1024 \ - --max_epochs 1 \ - --prompt_max_len 1024 \ - --generate_max_len 1024 \ - --zero_stage 2 \ - --bf16 \ - --actor_learning_rate 5e-7 \ - --critic_learning_rate 9e-6 \ - --init_kl_coef 0.01 \ - --prompt_data OpenRLHF/prompt-collection-v0.1 \ - --input_key context_messages \ - --apply_chat_template \ - --max_samples 100000 \ - --normalize_reward \ - --adam_offload \ - --flash_attn \ - --gradient_checkpointing \ - --use_wandb {wandb_token} - -# Support remote reward model (HTTP) -# --remote_rm_url http://localhost:5000/get_reward -``` - -### PPO/REINFORCE++ with Ray and vLLM - -To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration - -```bash -# launch the master node of ray in container -ray start --head --node-ip-address 0.0.0.0 --num-gpus 8 - -# if you want to launch ray on more nodes, use -ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8 - -ray job submit --address="http://127.0.0.1:8265" \ - --runtime-env-json='{"working_dir": "/openrlhf"}' \ - -- python3 -m openrlhf.cli.train_ppo_ray \ - --ref_num_nodes 1 \ - --ref_num_gpus_per_node 2 \ - --reward_num_nodes 1 \ - --reward_num_gpus_per_node 2 \ - --critic_num_nodes 1 \ - --critic_num_gpus_per_node 2 \ - --actor_num_nodes 1 \ - --actor_num_gpus_per_node 2 \ - --vllm_num_engines 2 \ - --vllm_tensor_parallel_size 2 \ - --colocate_critic_reward \ - --colocate_actor_ref \ - --pretrain OpenRLHF/Llama-3-8b-sft-mixture \ - --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \ - --save_path /openrlhf/examples/checkpoint/llama3-8b-rlhf \ - --micro_train_batch_size 8 \ - --train_batch_size 128 \ - --micro_rollout_batch_size 16 \ - --rollout_batch_size 1024 \ - --max_samples 100000 \ - --max_epochs 1 \ - --prompt_max_len 1024 \ - --generate_max_len 1024 \ - --zero_stage 3 \ - --bf16 \ - --actor_learning_rate 5e-7 \ - --critic_learning_rate 9e-6 \ - --init_kl_coef 0.01 \ - --prompt_data OpenRLHF/prompt-collection-v0.1 \ - --input_key context_messages \ - --apply_chat_template \ - --normalize_reward \ - --packing_samples \ - --adam_offload \ - --flash_attn \ - --gradient_checkpointing \ - --use_wandb {wandb_token} - -# Support REINFORCE++ | RLOO -# --advantage_estimator reinforce | rloo - -# Support remote reward model (HTTP) -# --remote_rm_url http://localhost:5000/get_reward - -# Support N samples -# --n_samples_per_prompt 4 -``` -> [!NOTE] -> Do not set `--vllm_num_engines` means not using the vLLM engine. -> You can also use ``setup_commands`` to let Ray automatically deploy the environment, such as `--runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'`. - -> [!NOTE] -> RLOO in OPENRLHF is a modification based on REINFORCE++, differing from the original version. - -> [!NOTE] -> If you you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable [`RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES`](openrlhf/trainer/ray/utils.py) as a workaround. -> ```bash -> # For NVIDIA GPUs: -> export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 -> ``` - -The launch scripts and documents for supported algorithms are in [example/scripts](./examples/scripts/) and [Documents - Usage](https://openrlhf.readthedocs.io/en/latest/usage.html) - -### LoRA -If you use `LoRA (Low-Rank Adaptation)`, `OpenRLHF` will not save the full weights by default instead of `LoRA Adapter`. To continue in your task normally, you should combine the `Adapter` with weights of your base model - -```bash -python -m openrlhf.cli.lora_combiner \ - --model_path meta-llama/Meta-Llama-3-8B \ - --lora_path ./checkpoint/llama3-8b-rm \ - --output_path ./checkpoint/llama-3-8b-rm-combined \ - --is_rm \ - --bf16 -``` - -## Performance - -We optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF: - -| **Size** | **NVIDIA A800-80GB GPUs** | **Optimized DSChat (with Hybrid Engine)** | **OpenRLHF** | **Speedup** | -| :---: | :---: | :---: | :---: | :---: | -| 7B | 16 | 855.09 | 471.11 | 1.82x | -| 13B | 32 | 1528.93 | 608.93 | 2.5x | -| 34B | 32 | 3634.98 | 1526.4 | 2.4x | -| 70B | 32 | 10407.0 | 4488.53 | 2.3x | - -> [!NOTE] -> The data is outdated; please refer to the performance tuning section for re-testing. - -### Performance Tuning Guide - -To achieve optimal performance, we recommend allocating more nodes to the vLLM Engine. For example, for a 70B model with 32 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 8 GPUs to the Actor model, and the remaining 8 GPUs to the Critic model. Additionally, enable the `--colocate_critic_reward`, `--colocate_actor_ref` options to merge nodes. Finally, you should increase the `rollout_micro_batch_size` (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger `--micro_train_batch_size` is better and enable `--packing_samples`. When there are enough GPUs, please disable `--adam_offload` and enable `--overlap_comm`. For multi-nodes RLHF, please use `--vllm_sync_backend nccl` with vLLM 0.6.4+. - -## Companies and Organizations using OpenRLHF - -- Google -- ByteDance -- Tencent -- Alibaba -- Baidu -- China Telecom -- Vivo -- Allen AI -- NexusFlow -- Jülich Supercomputing Centre (JSC) -- Berkeley Starling Team -- M-A-P -- ... - -## Join Us - -**How to Join?** - -1. Email us at janhu9527@gmail.com or join [GitHub Organization](https://github.com/OpenRLHF). Please include the following details: - - Your name - - Your GitHub username - - Your areas of interest - - Your skills and experience related to NLP and/or AI -1. You can also join us through the official GitHub [OpenRLHF ↗](https://github.com/OpenRLHF/OpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you. - -**What can you do?** - -1. Join the team and participate in the development of the OpenRLHF project. -1. Contribute to the project by submitting pull requests. -1. Help improve documentation, fix bugs, or create new features. -1. Share the project and help us grow the community. +# OpenRLHF v0.5.7 + +- [概述](#概述) +- [准备训练环境](#准备训练环境) +- [开始训练](#开始训练) +- [训练结果展示](#训练结果展示) +- [版本说明](#版本说明) + +# 概述 + +OpenRLHF是一个基于Ray、DeepSpeed和HF Transformers构建的高性能RLHF框架。该代码实现OpenRLHF支持 `SFT` 和 `DPO` 训练,目前已验证支持模型为:`Qwen2-VL-2B-Instruct`。 + + +# 准备训练环境 + +## 准备环境 + +- 推荐参考[配套资源文档](https://www.hiascend.com/developer/download/commercial)使用最新的配套版本。 + + **表 1** 版本配套表 + + + + + + + + + + + + + + + + + + + + + + + + + + +
软件版本
Driver AscendHDK 25.0.RC1.B115
Firmware AscendHDK 25.0.RC1.B115
CANN CANN 8.2.RC1.B010
PyTorch 2.6.0
torch_npu 2.6.0
+ +- 环境准备指导。 -## Sponsor Us + 请参考《[Pytorch框架训练环境准备](https://www.hiascend.com/document/detail/zh/ModelZoo/pytorchframework/ptes)》。 + +- 安装依赖。 -Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https://opencollective.com/OpenRLHF). + 1. 在模型源码包根目录下执行命令,安装模型对应的PyTorch版本需要的依赖。 + ```shell + TARGET_DEVICE=NPU pip install -e . + ``` -## Starchart -[![Star History Chart](https://api.star-history.com/svg?repos=OpenRLHF/OpenRLHF&type=Date)](https://star-history.com/#OpenRLHF/OpenRLHF&Date) + 2. 在模型源码包根目录下执行命令,源码编译安装 transformers v4.51.0。 + ```shell + git clone -b v4.51.0 https://github.com/huggingface/transformers.git + cp transformers_need/modeling_qwen2_vl.py transformers/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py + cd transformers + pip install . + ``` + +## 获取预训练模型 -## Contributors +1. 用户自行下载 `Qwen2-VL-2B-Instruct`模型,通过参数 `--pretrain_path` 指定模型地址。 -A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue. +## 准备数据集 - - - +1. 模型源码包根目录下创建data文件夹,用户需自行下载 `llava-en-zh-300k` 和 `RLHF-V` 数据集,结构如下: -## References & Acknowledgements + ``` + data/ + ├── llava-en-zh-300k + ├── RLHF-V + ``` -We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP: +# 开始训练 -- [Hugging Face Transformers ↗](https://github.com/huggingface/transformers) -- [OpenAI GPT ↗](https://github.com/openai/gpt-3) -- [LLaMA ↗](https://llama.meta.com/) -- [DeepSpeed ↗](https://github.com/microsoft/DeepSpeed) -- [Ray ↗](https://github.com/ray-project/ray) +1. 进入解压后的源码包根目录。 -Our project would also like to thank [ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) and [DeepSpeedChat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat). In the early stages of the project, we referred to their code design. + ```shell + cd /${模型文件夹名称} + ``` -(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF. +2. 运行训练脚本。 -## Citation -``` -@article{hu2024openrlhf, - title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework}, - author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao}, - journal={arXiv preprint arXiv:2405.11143}, - year={2024} -} -``` + - 8卡SFT训练 + + 启动8卡训练 + + ```shell + bash test/train_qwen2_vl_sft_full_8p.sh --pretrain_path=xxxx --dataset_path=data/llava-en-zh-300k/zh # 8p精度 -______________________________________________________________________ + bash test/train_qwen2_vl_sft_performance_8p.sh --pretrain_path=xxxx --dataset_path=data/llava-en-zh-300k/zh # 8p性能 + ``` -*OpenRLHF © 2025 OpenRLHF. All Rights Reserved.* + - 8卡DPO训练 + + 启动8卡训练 + + ```shell + bash test/train_qwen2_vl_dpo_full_8p.sh --pretrain_path=xxxx --dataset_path=data/RLHF-V # 8p精度 + + bash test/train_qwen2_vl_dpo_performance_8p.sh --pretrain_path=xxxx --dataset_path=data/RLHF-V # 8p性能 + ``` + +# 训练结果展示 + +**表 2** 训练结果展示表(性能) + +| MODEL | NAME | METHOD | Second Per Step(s) | +|:------------------------|:------------------------|:----------:|:----------------------:| +| Qwen2-VL-2B-Instruct | 8P-竞品A | SFT | 0.17108 | +| Qwen2-VL-2B-Instruct | 8P Atlas 200T A2 Box16 | SFT | 0.24064 | +| Qwen2-VL-2B-Instruct | 8P-竞品A | DPO | 0.40857 | +| Qwen2-VL-2B-Instruct | 8P Atlas 200T A2 Box16 | DPO | 0.53905 | + + +# 公网地址说明 +代码涉及公网地址参考 public_address_statement.md。 + +# 版本说明 + +## 变更 + +2025.5.12:首次发布。 + +## FAQ \ No newline at end of file diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md new file mode 100644 index 0000000000000000000000000000000000000000..2cb8191853368778180eb11533fdaffb5729e8b6 --- /dev/null +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ori.md @@ -0,0 +1,477 @@ +
+ OpenRLHF logo +
+
+

+ + GitHub Contributors + + + Issues + + + Issues + + + GitHub pull requests + + GitHub stars + +
+ Open-source / Comprehensive / Lightweight / Easy-to-use +

+

+
+ +
+ +[ English | 中文 | 日本語 ] + +OpenRLHF is a high-performance RLHF framework built on Ray, DeepSpeed and HF Transformers: + +- **Simple and easy to use**: OpenRLHF is one of the simplest high-performance RLHF libraries currently available, and seamlessly compatible with Huggingface models and datasets. +- **High performance**: RLHF training spends 80% of the time on the sample generation stage. Thanks to the ability to use a large inference batch size with Ray and Packing Samples and vLLM generation acceleration, the performance of OpenRLHF 3~4x+ that of Optimized DeepSpeedChat with Hybrid Engine. +- **Distributed RLHF**: OpenRLHF distribute the Actor, Reward, Reference, and Critic models onto separate GPUs using Ray, while placing the Adam optimizer on the CPU. This enables full-scale fine-tuning of 70B+ models with multiple A100 80G GPUs and vLLM and 7B models across multiple 24GB RTX 4090 GPUs. +- **PPO Implementation Optimization**: We integrated the implementation tricks for PPO to improve the training stability, referencing [Zhihu](https://zhuanlan.zhihu.com/p/622134699) and the [Notion blog](https://hijkzzz.notion.site/rlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361). + +More details are in [Slides](https://docs.google.com/presentation/d/1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk/edit?usp=sharing) | [Technical Report](https://arxiv.org/abs/2405.11143) | [Documents](https://openrlhf.readthedocs.io/) + +## News +- [2025/1] HKUST reproduced the [DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF](https://github.com/hkust-nlp/simpleRL-reason) +- [2024/12] We "proposed" 😊 the [REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS). +- [2024/12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the [Notion Blogpost](https://hijkzzz.notion.site/unraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05). + + +## Features + +- Distributed [PPO](./examples/scripts/train_ppo_llama_ray.sh) and [REINFORCE++/RLOO](./examples/scripts/train_reinforce_llama_ray.sh) implementations based on Ray. +- Full RLHF fine-tuning support for models with [over 70 billion parameters](./examples/scripts/train_ppo_llama_ray_70b.sh). +- Integration with vLLM for accelerated generation in RLHF tasks (`--vllm_num_engines`). +- Support for multiple reward models (`--reward_pretrain model1,model2...`) and remote reward models (`--remote_rm_url`). +- Implementation of [DPO (Direct Preference Optimization)/IPO/cDPO](./examples/scripts/train_dpo_llama.sh) and [Kahneman-Tversky Optimization (KTO)](./examples/scripts/train_kto_llama.sh). +- Support for [Iterative DPO](./examples/scripts/train_iterative_dpo_llama.sh) ([GitHub: Online-RLHF](https://github.com/RLHFlow/Online-RLHF)). +- Support for [Rejection Sampling](./examples/scripts/train_rejection_sampling_llama.sh). +- Implementation of [Conditional SFT](./examples/scripts/train_conditional_llama.sh) ([arXiv:2308.12050](https://arxiv.org/abs/2308.12050)). +- Support for [Knowledge Distillation](./examples/scripts/train_knowledge_distillation.sh) ([Microsoft: minillm](https://github.com/microsoft/LMOps/tree/main/minillm)). +- Integration of [Process Reward Model (PRM)](./examples/scripts/train_prm_mistral.sh). +- Packing of training samples for SFT, DPO, RM, PRM, and PPO (`--packing_samples`). +- Implementation of [RingAttention](./examples/scripts/train_dpo_ring_llama.sh) (`--ring_attn_size`, `--ring_head_stride`). +- Support for [Mixture of Experts (MoE)](./examples/test_scripts/train_sft_mixtral_lora.sh) (`--aux_loss_coef`). +- Integration of FlashAttention2 (`--flash_attn`). +- Support for QLoRA (`--load_in_4bit`) and [LoRA](./examples/scripts/train_sft_mixtral_lora.sh) (`--lora_rank`, `--target_modules`). +- Compatibility with HuggingFace's `tokenizer.apply_chat_template` for datasets (`--apply_chat_template` and `--input_key`). +- Logging support with Wandb (`--use_wandb`) and TensorBoard (`--use_tensorboard`). +- Checkpoint recovery functionality (`--load_checkpoint` and `--save_steps`). +- Provided multi-node training scripts, such as [DPO](./examples/scripts/train_llama_slurm.sh) and [Ray PPO](./examples/scripts/train_ppo_llama_ray_slurm.sh). + + +### PPO Support Matrix + +| Feature | OpenRLHF | DSChat | CAIChat | TRL | +| ------------- |:-------------:| :-------------:| :-------------:| :-------------:| +| 70B+ Full Tuning with 16 A100-80GB | ✅ | ❌ | ❌ | ❌ | +| 7B Full Tuning with 4 RTX4090 | ✅ | ❌ | ❌ | ❌ | +| 34B DPO Full Tuning with 8 A100-80GB | ✅ | ❌ | ❌ | ❌ | +| Inference Engine in PPO | ✅ | ✅ | ❌ | ❌ | +| PPO Implementation Tricks | ✅ | ❌ | ❌ | ✅ | +| Support QLoRA | ✅ | ❌ | ❌ | ✅ | +| Support Mixtral 8*7b | ✅ | ❌ | ❌ | ❌ | +| Support Unmerged Actor-Critic | ✅ | ✅ | ✅ | ❌ | +| Support Multiple Reward Models | ✅ | ❌ | ❌ | ❌ | +| Support Huggingface Models | ✅ | ✅ | ✅ | ✅ | +| Easy-to-use | ✅ | ❌ (HybridEngine bugs) | ✅ | ✅ | + + +## Quick Start + +### Installation + +To use OpenRLHF, first launch the docker container (**Recommended**) and `pip install` openrlhf inside the docker container: + +```bash +# Launch the docker container +docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.07-py3 bash +sudo pip uninstall xgboost transformer_engine flash_attn -y + +# pip install +pip install openrlhf + +# If you want to use vLLM acceleration (Install vLLM 0.6.5) +pip install openrlhf[vllm] +# latest vLLM is also supported +pip install openrlhf[vllm_latest] + +# pip install the latest version +pip install git+https://github.com/OpenRLHF/OpenRLHF.git + +# Or git clone +git clone https://github.com/OpenRLHF/OpenRLHF.git +cd OpenRLHF +pip install -e . +``` + +> [!NOTE] +>We recommend using vLLM 0.6.4 or higher. Other versions (vLLM >= 0.4.2) may require weight synchronization via Gloo (`--vllm_sync_backend gloo`). +>We also provided the [Dockerfiles for vLLM](./dockerfile/) and [One-Click Installation Script of Nvidia-Docker](./examples/scripts/nvidia_docker_install.sh). + +### Prepare Datasets +OpenRLHF provides multiple data processing methods in our dataset classes. +Such as in the [Prompt Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/prompts_dataset.py#L6): + +```python +def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str: + if apply_chat_template: + chat = data[input_key] + if isinstance(chat, str): + chat = [{"role": "user", "content": chat}] + prompt = apply_chat_template(chat, tokenize=False, add_generation_prompt=True) + else: + prompt = data[input_key] + if input_template: + prompt = input_template.format(prompt) + return prompt +``` + +- We can use `--input_key` to specify the `JSON key name` of the input datasets `--prompt_data {name or path}` (PPO) or `--dataset {name or path}`, and use `--apply_chat_template` to utilize the `chat_template` from the [Huggingface Tokenizer](https://huggingface.co/docs/transformers/main/en/chat_templating). +- If you don't want to use `--apply_chat_template`, you can use `--input_template` instead, or preprocess the datasets offline in advance. +- OpenRLHF also support mixing multiple datasets using `--prompt_data_probs 0.1,0.4,0.5` (PPO) or `--dataset_probs 0.1,0.4,0.5`. + +How Chat Templating Works: + +```python +dataset = [{"input_key": [ + {"role": "user", "content": "Hello, how are you?"}, + {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, + {"role": "user", "content": "I'd like to show off how chat templating works!"}, +]}] + +tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False) + +"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]" +``` + +How to specify training and test datasets ? + +You can specify it using the `data_type@data_dir` format. For example, the dataset can be set as `--dataset json@./data`. + +``` +data +├── test.jsonl +└── train.jsonl +``` + +> [!NOTE] +> By default, we use `train` and `test` as splits to distinguish training and testing datasets from Huggingface. +> The ``JSON key`` options depends on the specific datasets. See [Reward Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/reward_dataset.py#L10) and [SFT Dataset](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/datasets/sft_dataset.py#L9) + +### Supervised Fine-tuning + +OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--pretrain {name or path}`, `--reward_pretrain {name or path}` and `--critic_pretrain {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https://huggingface.co/OpenRLHF). + +Then you can use the startup scripts we provide in the [examples/scripts](./examples/scripts/) directory, or start the training using the following commands. + +```bash +deepspeed --module openrlhf.cli.train_sft \ + --max_len 4096 \ + --dataset Open-Orca/OpenOrca \ + --input_key question \ + --output_key response \ + --input_template $'User: {}\nAssistant: ' \ + --train_batch_size 256 \ + --micro_train_batch_size 2 \ + --max_samples 500000 \ + --pretrain meta-llama/Meta-Llama-3-8B \ + --save_path ./checkpoint/llama3-8b-sft \ + --save_steps -1 \ + --logging_steps 1 \ + --eval_steps -1 \ + --zero_stage 2 \ + --max_epochs 1 \ + --packing_samples \ + --bf16 \ + --flash_attn \ + --learning_rate 5e-6 \ + --gradient_checkpointing \ + --use_wandb {wandb_token} + +# Support HF tokenizer.apply_chat_template +# --apply_chat_template +# --tokenizer_chat_template {HF Chat Template} + +# Support RingAttention +# pip install ring_flash_attn +# --ring_attn_size 2 \ +# --ring_head_stride 2 \ + +# Multi-turn fine-tuning loss +# --multiturn + +# Can also be used for continued pre-training +# --pretrain_mode +``` + +> [!NOTE] +> OpenRLHF SFT/DPO/RewardModel/PPO trainers support `--packing_samples` [based on `--flash_attn`](https://github.com/MeetKai/functionary/tree/main/functionary/train/packing) + + +### Reward Model Training +```bash +deepspeed --module openrlhf.cli.train_rm \ + --save_path ./checkpoint/llama3-8b-rm \ + --save_steps -1 \ + --logging_steps 1 \ + --eval_steps -1 \ + --train_batch_size 256 \ + --micro_train_batch_size 1 \ + --pretrain OpenRLHF/Llama-3-8b-sft-mixture \ + --bf16 \ + --max_epochs 1 \ + --max_len 8192 \ + --zero_stage 3 \ + --learning_rate 9e-6 \ + --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \ + --apply_chat_template \ + --chosen_key chosen \ + --rejected_key rejected \ + --flash_attn \ + --packing_samples \ + --gradient_checkpointing \ + --use_wandb {wandb_token} + +``` + +It is recommended to set the `--value_prefix_head` option of the Reward Model to `score`, so that we can load the model using `AutoModelForSequenceClassification`: + +```python +reward_model = AutoModelForSequenceClassification.from_pretrained( + reward_model_path, + num_labels=1, + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2", + use_cache=False, + ) +inputs = xxxx (Left Padding Input Tokens) +reward = reward_model.model(*inputs).last_hidden_state +reward = reward_model.score(reward)[:, -1] +``` + +### PPO without Ray + +```bash +deepspeed --module openrlhf.cli.train_ppo \ + --pretrain OpenRLHF/Llama-3-8b-sft-mixture \ + --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \ + --save_path ./checkpoint/llama-3-8b-rlhf \ + --save_steps -1 \ + --logging_steps 1 \ + --eval_steps -1 \ + --micro_train_batch_size 2 \ + --train_batch_size 128 \ + --micro_rollout_batch_size 4 \ + --rollout_batch_size 1024 \ + --max_epochs 1 \ + --prompt_max_len 1024 \ + --generate_max_len 1024 \ + --zero_stage 2 \ + --bf16 \ + --actor_learning_rate 5e-7 \ + --critic_learning_rate 9e-6 \ + --init_kl_coef 0.01 \ + --prompt_data OpenRLHF/prompt-collection-v0.1 \ + --input_key context_messages \ + --apply_chat_template \ + --max_samples 100000 \ + --normalize_reward \ + --adam_offload \ + --flash_attn \ + --gradient_checkpointing \ + --use_wandb {wandb_token} + +# Support remote reward model (HTTP) +# --remote_rm_url http://localhost:5000/get_reward +``` + +### PPO/REINFORCE++ with Ray and vLLM + +To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration + +```bash +# launch the master node of ray in container +ray start --head --node-ip-address 0.0.0.0 --num-gpus 8 + +# if you want to launch ray on more nodes, use +ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8 + +ray job submit --address="http://127.0.0.1:8265" \ + --runtime-env-json='{"working_dir": "/openrlhf"}' \ + -- python3 -m openrlhf.cli.train_ppo_ray \ + --ref_num_nodes 1 \ + --ref_num_gpus_per_node 2 \ + --reward_num_nodes 1 \ + --reward_num_gpus_per_node 2 \ + --critic_num_nodes 1 \ + --critic_num_gpus_per_node 2 \ + --actor_num_nodes 1 \ + --actor_num_gpus_per_node 2 \ + --vllm_num_engines 2 \ + --vllm_tensor_parallel_size 2 \ + --colocate_critic_reward \ + --colocate_actor_ref \ + --pretrain OpenRLHF/Llama-3-8b-sft-mixture \ + --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \ + --save_path /openrlhf/examples/checkpoint/llama3-8b-rlhf \ + --micro_train_batch_size 8 \ + --train_batch_size 128 \ + --micro_rollout_batch_size 16 \ + --rollout_batch_size 1024 \ + --max_samples 100000 \ + --max_epochs 1 \ + --prompt_max_len 1024 \ + --generate_max_len 1024 \ + --zero_stage 3 \ + --bf16 \ + --actor_learning_rate 5e-7 \ + --critic_learning_rate 9e-6 \ + --init_kl_coef 0.01 \ + --prompt_data OpenRLHF/prompt-collection-v0.1 \ + --input_key context_messages \ + --apply_chat_template \ + --normalize_reward \ + --packing_samples \ + --adam_offload \ + --flash_attn \ + --gradient_checkpointing \ + --use_wandb {wandb_token} + +# Support REINFORCE++ | RLOO +# --advantage_estimator reinforce | rloo + +# Support remote reward model (HTTP) +# --remote_rm_url http://localhost:5000/get_reward + +# Support N samples +# --n_samples_per_prompt 4 +``` +> [!NOTE] +> Do not set `--vllm_num_engines` means not using the vLLM engine. +> You can also use ``setup_commands`` to let Ray automatically deploy the environment, such as `--runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'`. + +> [!NOTE] +> RLOO in OPENRLHF is a modification based on REINFORCE++, differing from the original version. + +> [!NOTE] +> If you you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable [`RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES`](openrlhf/trainer/ray/utils.py) as a workaround. +> ```bash +> # For NVIDIA GPUs: +> export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 +> ``` + +The launch scripts and documents for supported algorithms are in [example/scripts](./examples/scripts/) and [Documents - Usage](https://openrlhf.readthedocs.io/en/latest/usage.html) + +### LoRA +If you use `LoRA (Low-Rank Adaptation)`, `OpenRLHF` will not save the full weights by default instead of `LoRA Adapter`. To continue in your task normally, you should combine the `Adapter` with weights of your base model + +```bash +python -m openrlhf.cli.lora_combiner \ + --model_path meta-llama/Meta-Llama-3-8B \ + --lora_path ./checkpoint/llama3-8b-rm \ + --output_path ./checkpoint/llama-3-8b-rm-combined \ + --is_rm \ + --bf16 +``` + +## Performance + +We optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF: + +| **Size** | **NVIDIA A800-80GB GPUs** | **Optimized DSChat (with Hybrid Engine)** | **OpenRLHF** | **Speedup** | +| :---: | :---: | :---: | :---: | :---: | +| 7B | 16 | 855.09 | 471.11 | 1.82x | +| 13B | 32 | 1528.93 | 608.93 | 2.5x | +| 34B | 32 | 3634.98 | 1526.4 | 2.4x | +| 70B | 32 | 10407.0 | 4488.53 | 2.3x | + +> [!NOTE] +> The data is outdated; please refer to the performance tuning section for re-testing. + +### Performance Tuning Guide + +To achieve optimal performance, we recommend allocating more nodes to the vLLM Engine. For example, for a 70B model with 32 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 8 GPUs to the Actor model, and the remaining 8 GPUs to the Critic model. Additionally, enable the `--colocate_critic_reward`, `--colocate_actor_ref` options to merge nodes. Finally, you should increase the `rollout_micro_batch_size` (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger `--micro_train_batch_size` is better and enable `--packing_samples`. When there are enough GPUs, please disable `--adam_offload` and enable `--overlap_comm`. For multi-nodes RLHF, please use `--vllm_sync_backend nccl` with vLLM 0.6.4+. + +## Companies and Organizations using OpenRLHF + +- Google +- ByteDance +- Tencent +- Alibaba +- Baidu +- China Telecom +- Vivo +- Allen AI +- NexusFlow +- Jülich Supercomputing Centre (JSC) +- Berkeley Starling Team +- M-A-P +- ... + +## Join Us + +**How to Join?** + +1. Email us at janhu9527@gmail.com or join [GitHub Organization](https://github.com/OpenRLHF). Please include the following details: + - Your name + - Your GitHub username + - Your areas of interest + - Your skills and experience related to NLP and/or AI +1. You can also join us through the official GitHub [OpenRLHF ↗](https://github.com/OpenRLHF/OpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you. + +**What can you do?** + +1. Join the team and participate in the development of the OpenRLHF project. +1. Contribute to the project by submitting pull requests. +1. Help improve documentation, fix bugs, or create new features. +1. Share the project and help us grow the community. + +## Sponsor Us + +Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https://opencollective.com/OpenRLHF). + +## Starchart + +[![Star History Chart](https://api.star-history.com/svg?repos=OpenRLHF/OpenRLHF&type=Date)](https://star-history.com/#OpenRLHF/OpenRLHF&Date) + +## Contributors + +A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue. + + + + + +## References & Acknowledgements + +We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP: + +- [Hugging Face Transformers ↗](https://github.com/huggingface/transformers) +- [OpenAI GPT ↗](https://github.com/openai/gpt-3) +- [LLaMA ↗](https://llama.meta.com/) +- [DeepSpeed ↗](https://github.com/microsoft/DeepSpeed) +- [Ray ↗](https://github.com/ray-project/ray) + +Our project would also like to thank [ColossalChat](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat) and [DeepSpeedChat](https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat). In the early stages of the project, we referred to their code design. + +(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF. + +## Citation +``` +@article{hu2024openrlhf, + title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework}, + author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao}, + journal={arXiv preprint arXiv:2405.11143}, + year={2024} +} +``` + +______________________________________________________________________ + +*OpenRLHF © 2025 OpenRLHF. All Rights Reserved.* diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md new file mode 100644 index 0000000000000000000000000000000000000000..8c541c924ca0a625e05c23adc1f730515ef62fb0 --- /dev/null +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/public_address_statement.md @@ -0,0 +1,55 @@ + +| 类型 | 开源代码地址 | 文件名 | 公网IP地址/公网URL地址/域名/邮箱 | 用途说明 | +|:------:|:-------------------------:|:---------------------------------------------------------------------------------------------:|:--------------------:|:-----------------:| +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README_ja.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_ja.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/README_zh.md | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/README_zh.md | janhu9527@gmail.com | 参与OpenRLHF贡献联系邮箱 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_dpo.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_dpo.py | https://arxiv.org/pdf/2310.12036v2.pdf | IPO的论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_dpo.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_dpo.py | https://arxiv.org/pdf/2305.18290.pdf | DPO的论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/cli/train_ppo_ray.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/cli/train_ppo_ray.py | http://joschu.net/blog/kl-approx.html | 近似KL散度的链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/utils.py | http://joschu.net/blog/kl-approx.html | 近似KL散度的链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/huggingface/transformers/blob/405b56269812056d9593869e22b7b264d806cb1e/src/transformers/models/llama/modeling_llama.py | transformers三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://huggingface.co/docs/transformers/deepspeed#non-trainer-deepspeed-integration | transformers三方仓源码DeepSpeed解释链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/huggingface/peft/issues/137 | peft三方仓issue链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://huggingface.co/docs/transformers/main_classes/deepspeed#nontrainer-deepspeed-integration | transformers三方仓源码DeepSpeed解释链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/huggingface/transformers/issues/26877 | transformers三方仓issue链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://arxiv.org/abs/2204.05862 | Pairwise Loss for Reward Model论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/huggingface/transformers/issues/26877 | transformers三方仓issue链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://arxiv.org/pdf/2310.12036v2.pdf | IPO的论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://ericmitchell.ai/cdpo.pdf | cDPO的论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://arxiv.org/pdf/2305.18290.pdf | DPO的论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://github.com/ContextualAI/HALOs/blob/ca9b7e3eeea220c0944ad8095d641da33f907a7e/trainers.py | HALOs三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/model.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/model.py | https://github.com/OpenRLHF/OpenRLHF/issues/217 | OpenRLHF三方仓issue链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/OpenRLHF/OpenRLHF/issues/217 | OpenRLHF三方仓issue链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/actor.py | https://github.com/OpenRLHF/OpenRLHF/pull/634 | OpenRLHF三方仓pr链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/models/loss.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/models/loss.py | https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | LMOps三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ppo_utils/kl_controller.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ppo_utils/kl_controller.py | https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | Adaptive KL controller的论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ppo_utils/experience_maker.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ppo_utils/experience_maker.py | https://github.com/microsoft/LMOps/blob/main/minillm/finetune.py | PPO的论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/nvidia_gpu.py | ray三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/amd_gpu.py | ray三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/npu.py | ray三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/hpu.py | ray三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/neuron.py | ray三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/tpu.py | ray三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/utils.py | https://github.com/ray-project/ray/blob/161849364a784442cc659fb9780f1a6adee85fce/python/ray/_private/accelerators/intel_gpu.py | ray三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/commit/479d69fad0538f04cb22bf13e76ff91cfeb8a4e5 | vllm三方仓commit id链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/commit/676a99982fe9aabe72fd52a91e08988a653a7359 | vllm三方仓commit id链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/pull/10555 | vllm三方仓pr链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/commit/7206ce4ce112ed117796a59045c968a6d353f691 | vllm三方仓commit id链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/vllm_engine.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/vllm_engine.py | https://github.com/vllm-project/vllm/commit/eb6d3c264d0cd8e44dec16bca7947fbe96415ce9#diff-e1ad69e38e033accddfa5480ec808c4740eb39244d1ef51cc3407e20dde8cfd4 | vllm三方仓commit id链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/launcher.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/launcher.py | https://github.com/vllm-project/vllm/commit/eb6d3c264d0cd8e44dec16bca7947fbe96415ce9#diff-e1ad69e38e033accddfa5480ec808c4740eb39244d1ef51cc3407e20dde8cfd4 | custom resources解释链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/trainer/ray/ppo_actor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/trainer/ray/ppo_actor.py | https://github.com/vllm-project/vllm/blob/c6b0a7d3ba03ca414be1174e9bd86a97191b7090/vllm/worker/worker_base.py | vllm三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/logging_utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/logging_utils.py | https://github.com/skypilot-org/skypilot/blob/86dc0f6283a335e4aa37b3c10716f90999f48ab6/sky/sky_logging.py | skypilot三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/logging_utils.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/utils.py | https://github.com/facebookresearch/llama-recipes/pull/196 | llama-cookbook三方仓pr链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py | https://arxiv.org/abs/2308.12050 | CA论文链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_sampler.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_sampler.py | https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/torch/utils/data/distributed.py | pytorch三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_sampler.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_sampler.py | https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py | pytorch三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py | https://arxiv.org/abs/2307.09288 | pytorch三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/distributed_util.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/distributed_util.py | https://github.com/pytorch/pytorch/commit/a0c7029a75628cd5fa8df83c0de0ea98ee7fd844 | pytorch三方仓commit id链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/processor.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/processor.py | https://github.com/RLHFlow/Online-RLHF/blob/main/run_loop.sh | OpenRLHF三方仓源码链接 | +| 开源代码引入 | https://github.com/OpenRLHF/OpenRLHF/blob/v0.5.7/openrlhf/utils/deepspeed/deepspeed.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/openrlhf/utils/deepspeed/deepspeed.py | https://github.com/RLHFlow/Online-RLHF/blob/main/run_loop.sh | DeepSpeed三方仓issue链接 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | http://www.apache.org/licenses/LICENSE-2.0 | Apache-2.0 License链接 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | https://qwenlm.github.io/blog/qwen2-vl/ | qwen2-vl关于旋转位置编码解释链接 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | https://pytorch.org/docs/stable/nn.html#torch.nn.Module | torch.nn解释链接 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | https://github.com/huggingface/transformers/pull/34852 | transformers三方仓pr链接 | +| 开源代码引入 | https://github.com/huggingface/transformers/blob/v4.51.0/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py | PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/transformers_need/modeling_qwen2_vl.py | https://github.com/pytorch/pytorch/issues/110213 | pytorch三方仓issue链接 | \ No newline at end of file diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt index be0bca46f79d16522aa21b3a915fe37d32492b15..e6a68fc37556e324df7a3c490edd1059986d51c0 100644 --- a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/requirements.txt @@ -14,7 +14,7 @@ peft pillow ray[default]==2.42.0 tensorboard -torch +torch==2.6.0 torchmetrics tqdm transformers_stream_generator diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh new file mode 100644 index 0000000000000000000000000000000000000000..f57f4bbaf6f7efb057bd325b145633056e289f2d --- /dev/null +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/env_npu.sh @@ -0,0 +1,43 @@ +#!/bin/bash +CANN_INSTALL_PATH_CONF='/etc/Ascend/ascend_cann_install.info' + +if [ -f $CANN_INSTALL_PATH_CONF ]; then + CANN_INSTALL_PATH=$(cat $CANN_INSTALL_PATH_CONF | grep Install_Path | cut -d "=" -f 2) +else + CANN_INSTALL_PATH="/usr/local/Ascend" +fi + +if [ -d ${CANN_INSTALL_PATH}/ascend-toolkit/latest ]; then + source ${CANN_INSTALL_PATH}/ascend-toolkit/set_env.sh +else + source ${CANN_INSTALL_PATH}/nnae/set_env.sh +fi +msnpureport -g error -d 0 +msnpureport -g error -d 1 +msnpureport -g error -d 2 +msnpureport -g error -d 3 +msnpureport -g error -d 4 +msnpureport -g error -d 5 +msnpureport -g error -d 6 +msnpureport -g error -d 7 + +# 控制对输入数据为Inf/NaN的处理能力 +# 0:饱和模式,计算出现溢出时,计算结果会饱和为浮点数极值(+-MAX)。 +# 1:INf_NAN模式,根据定义输出Inf/NaN的计算结果。 +# Atlas训练系列仅支持饱和模式,Atlas A2/A3默认值为1,支持配置为0的饱和模式。 +export INF_NAN_MODE_ENABLE=1 +#将Host日志输出到串口,0-关闭/1-开启。指定0关闭日志打屏,即日志采用默认输出方式,将日志保存在log文件中。 +export ASCEND_SLOG_PRINT_TO_STDOUT=0 +#设置默认日志级别,0-debug/1-info/2-warning/3-error。此处指定3输出error级别日志,可根据具体需要调整。 +export ASCEND_GLOBAL_LOG_LEVEL=3 +#设置应用类日志是否开启Event日志。0-关闭/1-开启,默认值为1,此处设置为0表示关闭Event日志。 +export ASCEND_GLOBAL_EVENT_ENABLE=0 +#设置是否开启combined标志,0-关闭/1-开启。设置为1表示开启,用于优化非连续两个算子组合类场景。 +export COMBINED_ENABLE=1 +#HCCL白名单开关,1-关闭/0-开启。设置为1则无需校验HCCL通信白名单。 +export HCCL_WHITELIST_DISABLE=1 +export HCCL_IF_IP=$(hostname -I |awk '{print $1}') +#缓存算子信息条目数 +export ACLNN_CACHE_LIMIT=100000 +#配置 Hugging Face 的 datasets 库在离线模式下运行 +export HF_DATASETS_OFFLINE=1 \ No newline at end of file diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh new file mode 100644 index 0000000000000000000000000000000000000000..f11fd5038521307f68314aff721d9a5bcf1fd617 --- /dev/null +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_full_8p.sh @@ -0,0 +1,107 @@ +#!/bin/bash +echo "-------------------Start DPO Train-------------------" + +#配置八卡 +export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" + +#网络名称 +Network="dpo" + +# 默认值 +max_epochs=3 +pretrain_path="" +dataset_path="" + +# 遍历所有传入的参数 +for para in $*; do + if [[ $para == --max_epochs* ]]; then + max_epochs="${para#*=}" + elif [[ $para == --pretrain_path=* ]]; then + pretrain_path="${para#*=}" + elif [[ $para == --dataset_path=* ]]; then + dataset_path="${para#*=}" + else + echo "Unknown parameter: $para" >&2 + fi +done + +# 检查参数pretrain_path、dataset_path是否已提供 +if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then + echo "Error: Both pretrain_path and dataset_path are required." + exit 1 +fi + +# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径 +cur_path=$(pwd) +cur_path_last_dirname=${cur_path##*/} +if [ x"${cur_path_last_dirname}" == x"test" ]; then + test_path_dir=${cur_path} + cd .. + cur_path=$(pwd) +else + test_path_dir=${cur_path}/test +fi + +#创建DPO训练输出目录,不需要修改 +if [ -d ${cur_path}/test/output/${Network} ]; then + rm -rf ${cur_path}/test/output/${Network} + mkdir -p ${cur_path}/test/output/${Network} +else + mkdir -p ${cur_path}/test/output/${Network} +fi + +source ${test_path_dir}/env_npu.sh + +training_commands=$(cat < ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 & +fi + +wait + +# 训练用例信息,不需要修改 +DeviceType=$(uname -m) +CaseName=${Network}_info + +# 获取训练日志 +source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log" + +#计算全程平均单步耗时 +tps=$(grep -a "'train/step_time':" "$source_log_file" | + sed 's/\x1b\[[0-9;]*m//g' | + awk -F "'train/step_time': '" '{print $2}' | + awk -F "'," '{print $1}' | + awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}') +echo "Second Per Step: $tps" + +# 关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log +echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log +echo "-------------------End DPO Train-------------------" \ No newline at end of file diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh new file mode 100644 index 0000000000000000000000000000000000000000..d9b3a99323be35ceae21895b5e205b04f24e9928 --- /dev/null +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_dpo_performance_8p.sh @@ -0,0 +1,107 @@ +#!/bin/bash +echo "-------------------Start DPO Train-------------------" + +#配置八卡 +export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" + +#网络名称 +Network="dpo" + +# 默认值 +max_epochs=1 +pretrain_path="" +dataset_path="" + +# 遍历所有传入的参数 +for para in $*; do + if [[ $para == --max_epochs* ]]; then + max_epochs="${para#*=}" + elif [[ $para == --pretrain_path=* ]]; then + pretrain_path="${para#*=}" + elif [[ $para == --dataset_path=* ]]; then + dataset_path="${para#*=}" + else + echo "Unknown parameter: $para" >&2 + fi +done + +# 检查参数pretrain_path、dataset_path是否已提供 +if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then + echo "Error: Both pretrain_path and dataset_path are required." + exit 1 +fi + +# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径 +cur_path=$(pwd) +cur_path_last_dirname=${cur_path##*/} +if [ x"${cur_path_last_dirname}" == x"test" ]; then + test_path_dir=${cur_path} + cd .. + cur_path=$(pwd) +else + test_path_dir=${cur_path}/test +fi + +#创建DPO训练输出目录,不需要修改 +if [ -d ${cur_path}/test/output/${Network} ]; then + rm -rf ${cur_path}/test/output/${Network} + mkdir -p ${cur_path}/test/output/${Network} +else + mkdir -p ${cur_path}/test/output/${Network} +fi + +source ${test_path_dir}/env_npu.sh + +training_commands=$(cat < ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 & +fi + +wait + +# 训练用例信息,不需要修改 +DeviceType=$(uname -m) +CaseName=${Network}_info + +# 获取训练日志 +source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log" + +#计算全程平均单步耗时 +tps=$(grep -a "'train/step_time':" "$source_log_file" | + sed 's/\x1b\[[0-9;]*m//g' | + awk -F "'train/step_time': '" '{print $2}' | + awk -F "'," '{print $1}' | + awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}') +echo "Second Per Step: $tps" + +# 关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log +echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log +echo "-------------------End DPO Train-------------------" \ No newline at end of file diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh new file mode 100644 index 0000000000000000000000000000000000000000..78f31c0417e784423c84cf56138d1380d43c01dd --- /dev/null +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_full_8p.sh @@ -0,0 +1,104 @@ +#!/bin/bash +echo "-------------------Start SFT Train-------------------" + +#配置八卡 +export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" + +#网络名称 +Network="sft" + +# 默认值 +max_epochs=3 +pretrain_path="" +dataset_path="" + +# 遍历所有传入的参数 +for para in $*; do + if [[ $para == --max_epochs* ]]; then + max_epochs="${para#*=}" + elif [[ $para == --pretrain_path=* ]]; then + pretrain_path="${para#*=}" + elif [[ $para == --dataset_path=* ]]; then + dataset_path="${para#*=}" + else + echo "Unknown parameter: $para" >&2 + fi +done + +# 检查参数pretrain_path、dataset_path是否已提供 +if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then + echo "Error: Both pretrain_path and dataset_path are required." + exit 1 +fi + +# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径 +cur_path=$(pwd) +cur_path_last_dirname=${cur_path##*/} +if [ x"${cur_path_last_dirname}" == x"test" ]; then + test_path_dir=${cur_path} + cd .. + cur_path=$(pwd) +else + test_path_dir=${cur_path}/test +fi + +#创建SFT训练输出目录,不需要修改 +if [ -d ${cur_path}/test/output/${Network} ]; then + rm -rf ${cur_path}/test/output/${Network} + mkdir -p ${cur_path}/test/output/${Network} +else + mkdir -p ${cur_path}/test/output/${Network} +fi + +source ${test_path_dir}/env_npu.sh + +training_commands=$(cat < ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 & +fi + +wait + +# 训练用例信息,不需要修改 +DeviceType=$(uname -m) +CaseName=${Network}_info + +# 获取训练日志 +source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log" + +#计算全程平均单步耗时 +tps=$(grep -a "'train/step_time':" "$source_log_file" | + sed 's/\x1b\[[0-9;]*m//g' | + awk -F "'train/step_time': '" '{print $2}' | + awk -F "'," '{print $1}' | + awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}') +echo "Second Per Step: $tps" + +# 关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log +echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log +echo "-------------------End SFT Train-------------------" \ No newline at end of file diff --git a/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh new file mode 100644 index 0000000000000000000000000000000000000000..72e5cd3c6369d47459caa68db12dcca0b50056af --- /dev/null +++ b/PyTorch/built-in/rl/OpenRLHF_v0.5.7_for_PyTorch/test/train_qwen2_vl_sft_performance_8p.sh @@ -0,0 +1,104 @@ +#!/bin/bash +echo "-------------------Start SFT Train-------------------" + +#配置八卡 +export ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" + +#网络名称 +Network="sft" + +# 默认值 +max_epochs=1 +pretrain_path="" +dataset_path="" + +# 遍历所有传入的参数 +for para in $*; do + if [[ $para == --max_epochs* ]]; then + max_epochs="${para#*=}" + elif [[ $para == --pretrain_path=* ]]; then + pretrain_path="${para#*=}" + elif [[ $para == --dataset_path=* ]]; then + dataset_path="${para#*=}" + else + echo "Unknown parameter: $para" >&2 + fi +done + +# 检查参数pretrain_path、dataset_path是否已提供 +if [ -z "$pretrain_path" ] || [ -z "$dataset_path" ]; then + echo "Error: Both pretrain_path and dataset_path are required." + exit 1 +fi + +# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径 +cur_path=$(pwd) +cur_path_last_dirname=${cur_path##*/} +if [ x"${cur_path_last_dirname}" == x"test" ]; then + test_path_dir=${cur_path} + cd .. + cur_path=$(pwd) +else + test_path_dir=${cur_path}/test +fi + +#创建SFT训练输出目录,不需要修改 +if [ -d ${cur_path}/test/output/${Network} ]; then + rm -rf ${cur_path}/test/output/${Network} + mkdir -p ${cur_path}/test/output/${Network} +else + mkdir -p ${cur_path}/test/output/${Network} +fi + +source ${test_path_dir}/env_npu.sh + +training_commands=$(cat < ${cur_path}/test/output/${Network}/train_${Network}.log 2>&1 & +fi + +wait + +# 训练用例信息,不需要修改 +DeviceType=$(uname -m) +CaseName=${Network}_info + +# 获取训练日志 +source_log_file="${cur_path}/test/output/${Network}/train_${Network}.log" + +#计算全程平均单步耗时 +tps=$(grep -a "'train/step_time':" "$source_log_file" | + sed 's/\x1b\[[0-9;]*m//g' | + awk -F "'train/step_time': '" '{print $2}' | + awk -F "'," '{print $1}' | + awk '{gsub(/s/, ""); sum+=$1; count++} END {if (count>0) print sum/count; else print "No data"}') +echo "Second Per Step: $tps" + +# 关键信息打印到${CaseName}.log中,不需要修改 +echo "Network = ${Network}" > ${cur_path}/test/output/${Network}/${CaseName}.log +echo "Second Per Step = ${tps}" > ${cur_path}/test/output/${Network}/${CaseName}.log +echo "-------------------End SFT Train-------------------" \ No newline at end of file