Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework (2405.11143v1)

Published 20 May 2024 in cs.AI, cs.CL, and cs.LG
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Abstract: As LLMs continue to grow by scaling laws, reinforcement learning from human feedback (RLHF) has gained significant attention due to its outstanding performance. However, unlike pretraining or fine-tuning a single model, scaling reinforcement learning from human feedback (RLHF) for training LLMs poses coordination challenges across four models. We present OpenRLHF, an open-source framework enabling efficient RLHF scaling. Unlike existing RLHF frameworks that co-locate four models on the same GPUs, OpenRLHF re-designs scheduling for the models beyond 70B parameters using Ray, vLLM, and DeepSpeed, leveraging improved resource utilization and diverse training approaches. Integrating seamlessly with Hugging Face, OpenRLHF provides an out-of-the-box solution with optimized algorithms and launch scripts, which ensures user-friendliness. OpenRLHF implements RLHF, DPO, rejection sampling, and other alignment techniques. Empowering state-of-the-art LLM development, OpenRLHF's code is available at https://github.com/OpenLLMAI/OpenRLHF.

OpenRLHF: Making Reinforcement Learning from Human Feedback More Accessible

Introduction

Reinforcement learning from human feedback (RLHF) has been getting a lot of buzz in the AI community for its ability to align LLMs with human values and intentions. However, training these colossal models through RLHF isn't a walk in the park—especially when the model parameters exceed a whopping 70 billion. Training such large models typically means juggling multiple models and resource-intensive computations, introducing a slew of logistical challenges.

Enter OpenRLHF, an open-source framework engineered specifically for scaling RLHF. Instead of co-locating models on the same GPUs, which quickly becomes inefficient as models grow larger, OpenRLHF leverages advanced scheduling techniques using Ray, vLLM, and DeepSpeed. The framework seamlessly integrates with Hugging Face, making it a user-friendly, out-of-the-box RLHF solution.

Design of OpenRLHF

Scheduling Optimization

Traditional RLHF training frameworks like ColossalChat and DeepSpeed-Chat often fall short when dealing with models above 70 billion parameters. They rely on techniques such as Zero Redundancy Optimizer (ZeRO) to co-locate four models (actor, critic, reward, reference) on the same GPU. This setup becomes increasingly inefficient due to the limited GPU memory.

OpenRLHF addresses these challenges by leveraging Ray for scheduling, vLLM for efficient inference, and DeepSpeed for optimized training. Ray's scheduling capabilities enable OpenRLHF to distribute the models across multiple GPUs, avoiding the pitfalls of memory constraints. This approach not only improves resource utilization but also facilitates the integration of multiple reward models, a key component in several alignment strategies.

Here are some notable features:

  • Flexible Model Placement: Models can be freely merged or offloaded to save GPU resources.
  • Multi-Reward Models: Supports various alignment strategies like separating usefulness from harmfulness.
  • Optimal Orchestration: Improves training performance by optimally coordinating GPUs.

Performance Optimization

One of the main bottlenecks in RLHF is the sample generation stage, which can consume up to 80% of the training time. OpenRLHF tackles this by employing several performance-boosting techniques:

  1. Inference Optimization: Utilizes vLLM's tensor parallelism and advanced batching techniques to accelerate the sample generation process.
  2. Memory Management: Offloads Adam optimizer states to CPU to free up GPU memory, allowing larger batch sizes.
  3. Flash Attention 2: This method accelerates Transformer model training.
  4. Padding Removal: Uses PyTorch tensor slicing to eliminate redundant padding from training samples.

The impact of these optimizations is substantial. For example, in comparative performance metrics, training a 70B model with OpenRLHF was up to 2.3x faster than a tuned version of DSChat.

Training Stability

Reinforcement learning can be notoriously unstable, especially for large models. To mitigate these risks, OpenRLHF incorporates several stability tricks in its PPO implementation. These include fine-tuning algorithmic parameters and employing techniques like gradient clipping and reward normalization. The framework also ensures weight synchronization between the ZeRO and vLLM engines using NVIDIA's NCCL, making the integration fast and reliable.

Ease of Use

Ease of use is one of OpenRLHF’s standout features. It comes with one-click trainable scripts that are fully compatible with Hugging Face models. Users simply need to specify the model and dataset paths to start RLHF training. Here is a quick example:

1
2
3
4
5
6
7
8
9
10
ray job submit -- python3 examples/train_ppo_ray.py \
    --ref_num_nodes 1 \      
    --reward_num_nodes 1 \   
    --critic_num_nodes 1 \   
    --actor_num_nodes 1 \    
    --vLLM_num_engines 4 \   
    --vLLM_tensor_parallel_size 4 \ 
    --pretrain meta-llama/Llama-2-70b-chat-hf \
    --reward_pretrain meta-llama/Llama-2-70b-chat-hf \
    --prompt_data Open-Orca/OpenOrca

This script covers the basic configuration needed to get Llama-2's 70B model running with PPO on OpenRLHF. The framework also supports a variety of algorithms out-of-the-box, including Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO).

Conclusion

OpenRLHF brings significant advancements to the RLHF landscape by focusing on efficient model scheduling, robust performance optimization, and user-friendly implementation. Its design principles make it feasible to train and align LLMs beyond 70 billion parameters effectively. As AI models continue to grow in size and complexity, OpenRLHF's approach could become a crucial tool for making them more aligned with human values, scaling efficiently, and maintaining training stability.

Future iterations of OpenRLHF might push these boundaries even further, integrating more advanced algorithms and optimizations. For now, it stands as a robust framework, enabling researchers and practitioners to scale their RLHF endeavors seamlessly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  2. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  3. T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  4. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  5. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. arXiv preprint arXiv:2310.05344, 2023.
  6. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
  7. M. Eric. A note on dpo with noisy preferences and relationship to ipo. https://ericmitchell.ai/cdpo.pdf, 2023. Accessed: November 25, 2023.
  8. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  9. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  10. Aligning language models with offline reinforcement learning from human feedback. arXiv preprint arXiv:2308.12050, 2023.
  11. The n implementation details of rlhf with ppo. In ICLR Blogposts 2024, 2024. URL https://iclr-blogposts.github.io/2024/blog/the-n-implementation-details-of-rlhf-with-ppo/. https://iclr-blogposts.github.io/2024/blog/the-n-implementation-details-of-rlhf-with-ppo/.
  12. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  13. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  14. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766–775, 2023a.
  15. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  16. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024.
  17. Ray: A distributed framework for emerging {{\{{AI}}\}} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577, 2018.
  18. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  19. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  20. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  21. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  22. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  23. Discriminative adversarial search for abstractive summarization. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8555–8564. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/scialom20a.html.
  24. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  25. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  27. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  28. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320, 2023.
  29. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jian Hu (40 papers)
  2. Xibin Wu (5 papers)
  3. Weixun Wang (31 papers)
  4. Xianyu (1 paper)
  5. Dehao Zhang (11 papers)
  6. Yu Cao (129 papers)
Citations (22)