Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment (2405.19332v3)

Published 29 May 2024 in cs.LG and cs.AI
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Abstract: Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning LLMs to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring LLMs (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.

An Expert Overview of the SELM Framework for Preference Optimization in LLMs

The paper under discussion introduces an advanced method for enhancing preference optimization in LLMs via a novel framework called Self-Exploring LLMs (SELM). This approach fundamentally focuses on integrating active exploration into the process of Reinforcement Learning from Human Feedback (RLHF), aiming to produce LLMs that are better aligned with human intentions and more effective in various instruction-following benchmarks.

Core Approach and Theoretical Foundations

The SELM framework is built on the premise that online feedback collection, rather than relying on a fixed dataset, tends to generate more capable reward models and improved alignment for LLMs. Traditional RLHF procedures are often bounded by local optima due to limited diversity in the response data. The SELM approach addresses this by integrating an optimism term into the reward fitting objective, thus encouraging the exploration of out-of-distribution (OOD) responses.

The paper introduces a bilevel optimization objective that incorporates an optimism term αmaxyr(x,y)\alpha \max_y r(x, y). This addition biases the reward model toward potentially high-reward responses that are previously unexplored, allowing for more effective and dynamic learning. The resultant algorithm, SELM, reparameterizes the reward function to eliminate the need for a separate reward model (RM), subsequently simplifying the objective.

Empirical Validation

Experimental analyses validate the efficacy of SELM across multiple benchmarks. The framework was implemented using Zephyr-7B-SFT and Llama-3-8B-Instruct models, and performance was significantly boosted in instruction-following tasks such as MT-Bench and AlpacaEval 2.0. Specifically, SELM outperforming the baseline iterative Direct Preference Optimization (DPO) by margins of +16.24% and +11.75% on AlpacaEval 2.0 and +2.31 and +0.32 on MT-Bench, respectively.

Additionally, SELM demonstrated robust performance across various academic benchmarks, achieving improvements even in zero-shot, few-shot, and Chain-of-Thought (CoT) settings. The enhancements were consistent across different iterations, emphasizing the robustness and reliability of the SELM methodology.

Implications and Future Directions

Theoretically, SELM presents a profound implication for the field of AI alignment. By actively exploring OOD regions, it mitigates the risk of models becoming overfitted to local optima and ensures a higher probability of discovering globally optimal responses. Practically, the integration of optimism in the RLHF process provides a more efficient pathway for fine-tuning LLMs, which is critical in tasks requiring high adaptability and precision.

The SELM framework also highlights the potential for integrating this optimism-based exploration with other contemporary online RLHF methodologies, suggesting that future research could explore the synergistic effects of combining SELM with other sophisticated alignment techniques.

Conclusion

In summary, the SELM framework introduces a novel and effective approach to preference optimization in LLMs. By leveraging active exploration through an optimism-biased objective, SELM significantly enhances the alignment and performance of LLMs across various benchmarks. This research paves the way for future developments in AI alignment, emphasizing the importance of dynamic, exploration-based strategies in preference optimization. The code and models associated with this paper are available at SELM GitHub repository, providing a valuable resource for further research and application in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  3. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  4. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  6. Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635, 2024.
  7. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  8. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  11. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
  12. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  13. Rlhf workflow: From reward modeling to online rlhf. arXiv e-prints, pages arXiv–2405, 2024.
  14. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
  15. Efficient exploration for llms. arXiv preprint arXiv:2402.00396, 2024.
  16. The rating of chessplayers: Past and present. Ishi Press International, 1978.
  17. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  18. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  19. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  20. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  21. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
  22. Braden Hancock Hoang Tran, Chris Glaze. Snorkel-mistral-pairrm-dpo. 2024.
  23. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2024.
  24. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  25. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
  26. Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory, pages 2137–2143. PMLR, 2020.
  27. Exploiting asymmetry for synthetic training data generation: Synthie and the case of information extraction. arXiv preprint arXiv:2303.04132, 2023.
  28. sdpo: Don’t use your data all at once. arXiv preprint arXiv:2403.19270, 2024.
  29. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
  30. Multi-modal preference alignment remedies regression of visual instruction tuning on language model. arXiv preprint arXiv:2402.10884, 2024.
  31. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023.
  32. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  33. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  34. Best practices and lessons learned on synthetic data for language models, 2024.
  35. Maximize to explore: One objective function fusing estimation, planning, and exploration. Advances in Neural Information Processing Systems, 36, 2024.
  36. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer, 2024.
  37. Ensemble sampling. Advances in neural information processing systems, 30, 2017.
  38. Meta. Introducing meta llama 3: The most capable openly available llm to date. 2024.
  39. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  40. Language model alignment with elastic reset. Advances in Neural Information Processing Systems, 36, 2024.
  41. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.
  42. Approximate thompson sampling via epistemic neural networks. In Uncertainty in Artificial Intelligence, pages 1586–1595. PMLR, 2023.
  43. Epistemic neural networks. Advances in Neural Information Processing Systems, 36, 2024.
  44. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  45. Samuel J Paech. Eq-bench: An emotional intelligence benchmark for large language models. arXiv preprint arXiv:2312.06281, 2023.
  46. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
  47. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  48. From r𝑟ritalic_r to q∗superscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024.
  49. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  50. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715, 2024.
  51. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
  52. Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943–950, 2000.
  53. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.
  54. Understanding the performance gap between online and offline alignment algorithms. arXiv preprint arXiv:2405.08448, 2024.
  55. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  56. The alignment handbook. https://github.com/huggingface/alignment-handbook, 2023.
  57. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  58. Enhancing visual-language modality alignment in large vision language models via self-improvement, 2024.
  59. How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36, 2024.
  60. Self-evolved diverse data sampling for efficient instruction tuning. arXiv preprint arXiv:2311.08182, 2023.
  61. Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675, 2024.
  62. Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456, 2023.
  63. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
  64. Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719, 2024.
  65. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  66. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  67. Iterative reasoning preference optimization. arXiv e-prints, pages arXiv–2404, 2024.
  68. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  69. Shenao Zhang. Conservative dual policy optimization for efficient model-based reinforcement learning. Advances in neural information processing systems, 35:25450–25463, 2022.
  70. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  71. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  72. Starling-7b: Improving llm helpfulness and harmlessness with rlaif, November 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Shenao Zhang (16 papers)
  2. Donghan Yu (18 papers)
  3. Hiteshi Sharma (12 papers)
  4. Ziyi Yang (77 papers)
  5. Shuohang Wang (69 papers)
  6. Hany Hassan (11 papers)
  7. Zhaoran Wang (164 papers)
  8. Han Zhong (38 papers)
  9. Zhihan Liu (22 papers)
Citations (19)