Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models (2402.10038v2)

Published 15 Feb 2024 in cs.CL, cs.AI, cs.CV, and cs.LG
RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Abstract: Reinforcement learning from human feedback (RLHF) has been extensively employed to align LLMs with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference optimization (DPO) is proposed to address those challenges. However, DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model, limiting the effectiveness of the RLHF. In this paper, we addresses both challenges by systematically combining rejection sampling (RS) and DPO. Our proposed method, RS-DPO, initiates with the development of a supervised fine-tuned policy model (SFT). A varied set of k responses per prompt are sampled directly from the SFT model. RS-DPO identifies pairs of contrastive samples based on their reward distribution. Finally, we apply DPO with the contrastive samples to align the model to human preference. Our experiments indicate that our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent. Furthermore, it outperforms existing methods, including RS, PPO, and DPO.

The paper "RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of LLMs" explores the challenges and advancements in the domain of aligning LLMs with user intent through reinforcement learning from human feedback (RLHF). Commonly, Proximal Policy Optimization (PPO) is utilized for RLHF but it presents issues of instability, need for significant hyperparameter tuning, and high computational cost during the alignment phase.

To address these obstacles, Direct Preference Optimization (DPO) has been proposed as an alternative technique. However, DPO itself has limitations as it relies on contrastive responses generated from human annotators and alternative LLMs rather than the policy model, which hampers its effectiveness.

The authors of the paper propose a novel hybrid approach termed RS-DPO, which integrates Rejection Sampling (RS) with DPO to leverage the strengths of both methods. The process begins with training a supervised fine-tuned policy model (SFT). From this SFT model, a diverse set of kk responses are sampled for each prompt. Rejection Sampling is then employed to identify pairs of contrastive samples based on their reward distribution. Ultimately, DPO is applied to these selected samples to align the LLM with human preferences.

Experimental results presented in the paper demonstrate that RS-DPO is effective in fine-tuning LLMs, even in resource-constrained environments. It showcases superior performance compared to existing methods including RS, PPO, and DPO. The hybrid method thus offers a more stable and computationally efficient approach for the alignment of LLMs, leading to improved adherence to user intent with reduced resource expenditure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  2. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  3. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  8. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  9. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  11. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998.
  12. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  13. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  14. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  15. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  16. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
  17. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  18. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  19. OpenAssistant. 2023. Openassistant/oasst-rm-2-pythia-6.9b-epoch-1. Accessed: 2023.
  20. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  21. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  22. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  23. Efficient rlhf: Reducing the memory usage of ppo. arXiv preprint arXiv:2309.00754.
  24. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  25. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716.
  26. Learning to summarize from human feedback. In NeurIPS.
  27. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  29. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  30. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
  31. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  32. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  33. Enable language models to implicitly learn self-improvement from data. arXiv preprint arXiv:2310.00898.
  34. Ronald J. Williams. 2004. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
  35. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena.
  37. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Saeed Khaki (15 papers)
  2. Lan Ma (31 papers)
  3. Liu Yang (194 papers)
  4. Prathap Ramachandra (2 papers)
  5. Jinjin Li (17 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com