Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization (2405.16455v1)

Published 26 May 2024 in stat.ML, cs.LG, and stat.ME
On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Abstract: Accurately aligning LLMs with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.

An In-depth Analysis of Algorithmic Bias in RLHF for LLMs

Overview

The paper "On the Algorithmic Bias of Aligning LLMs with RLHF: Preference Collapse and Matching Regularization" explores the issue of algorithmic bias in aligning LLMs with human preferences through reinforcement learning from human feedback (RLHF). The central assertion is that the prevalent RLHF approach, which employs Kullback-Leibler (KL) divergence-based regularization, introduces inherent biases that can lead to what the authors term "preference collapse." This phenomenon results in the near-total disregard of minority preferences. To address this, the authors propose a novel method called preference matching (PM) RLHF, which aims to align LLMs accurately with the distribution of preferences expressed by a reward model.

Key Contributions

The authors identify the primary source of bias in RLHF as the KL divergence-based regularization, which uses a pretrained LLM as a reference model. This introduces unavoidable biases from the reference model into the final LLM alignment. The bias can become so severe that minority preferences are entirely collapsed in favor of the majority.

The key contributions of the paper include:

  1. Introduction of PM RLHF: The authors propose PM RLHF as a method to eliminate the algorithmic bias inherent in standard RLHF. This technique involves a PM regularizer based on the negative logarithm of the LLM's policy probability distribution over responses.
  2. Theoretical Foundation: The paper establishes a theoretical basis for PM RLHF by solving an ordinary differential equation necessary for the PM property. This framework ensures that the LLM's output distribution matches the human preference distribution given by the reward model.
  3. Conditional Variant: For practical implementation, the authors propose a conditional variant of PM RLHF tailored to natural language generation. This variant penalizes responses with low probabilities according to a reference model, effectively filtering out unnatural or nonsensical outputs.
  4. Empirical Validation: Empirical results show significant improvements in alignment with human preferences. The proposed PM RLHF approach led to a 29% to 41% reduction in preference matching divergence compared to standard RLHF in experiments with the OPT-1.3B and Llama-2-7B models.

Methodological Insight

The PM RLHF method diverges from standard RLHF by directly addressing the distribution of preferences. The regularization term R(π)R(\pi), derived from solving a differential equation, ensures that the optimization aligns with the preference distribution modeled by the reward function r(x,y)r(x, y). Specifically, R(π)=log(π)+C1,x+C2,x/πR(\pi) = -\log(\pi) + C_{1,x} + C_{2,x}/ \pi, where C1,xC_{1,x} and C2,xC_{2,x} are constants that may depend on the prompt xx.

This formulation ensures that the LLM not only maximizes the reward but also maintains diverse responses, preventing the exclusive preference of majority opinions.

Addressing Practical Challenges

One of the challenges noted in the application of PM RLHF is the naturalness of generated text. To resolve text generation issues observed with the direct application of PM RLHF, the authors introduced the concept of conditional PM RLHF. This variant ensures that responses deemed nonsensical or meaningless by a reference model are heavily penalized, preventing their inclusion. This conditional approach effectively balances reward maximization and response naturalness.

Empirical Results

The empirical results were robust, demonstrating that conditional PM RLHF substantially reduces preference matching divergence. In experiments, the divergence metrics for the aligned models showed that the PM RLHF approach significantly outperformed standard RLHF across multiple configurations and values of β\beta.

Interestingly, there was a trade-off observed between preference alignment and generative performance. While the PM RLHF models excelled in aligning with human preferences, they also exhibited changes in metrics like perplexity, reflecting the nuanced balance between these objectives.

Implications and Future Directions

The findings of this paper have profound implications for both practical and theoretical domains. Practically, improving the alignment of LLMs with diverse human preferences can lead to fairer and more effective decision-making systems in various applications. Theoretically, the introduction of PM RLHF opens new avenues for further research into RLHF methodologies and their inherent biases.

Future research could explore several directions:

  1. Scaling Up: Applying PM RLHF to larger industrial-level LLMs such as GPT-4 or Claude-3 Opus could help to better understand its impact on more complex models.
  2. Diverse Human Preferences: Extending PM RLHF to incorporate multiple reward models could address preference matching more finely when faced with heterogeneous human preferences.
  3. Generalized Models: Investigating generalized preference models beyond the PL model could yield insights into the adaptability and effectiveness of PM regularization in various contexts.
  4. Direct Preference Optimization (DPO): Developing a DPO counterpart of PM RLHF could benefit scenarios where computational efficiency is critical.
  5. Length Sensitivity: Exploring the impact of response length on preference alignment could further refine PM RLHF to handle biases arising from varied response lengths.

Conclusion

The paper makes a significant contribution to the field of aligning LLMs with human preferences by identifying and addressing the intrinsic algorithmic biases in standard RLHF. The proposed PM RLHF method offers a principled approach to achieve unbiased preference alignment, backed by strong theoretical foundations and empirical validations. This work not only advances the understanding of RLHF methodologies but also paves the way for developing more fair and effective AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. A. Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
  2. The reasonable effectiveness of diverse evaluation data, 2023a.
  3. Dices dataset: Diversity in conversational ai evaluation for safety, 2023b.
  4. K. J. Arrow. Social choice and individual values, volume 12. Yale university press, 2012.
  5. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022. URL https://arxiv.org/abs/2204.05862.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  10. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  11. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences. arXiv preprint arXiv:2402.08925, 2024.
  12. Dataset reset policy optimization for rlhf. arXiv preprint arXiv:2404.08495, 2024.
  13. Odin: Disentangled reward mitigates hacking in rlhf. arXiv preprint arXiv:2402.07319, 2024a.
  14. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024b.
  15. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  16. Whose ground truth? accounting for individual and collective identities underlying dataset annotation, 2021a.
  17. Whose ground truth? accounting for individual and collective identities underlying dataset annotation, 2021b.
  18. Rlhf workflow: From reward modeling to online rlhf. arXiv e-prints, pages arXiv–2405, 2024.
  19. Mechanism design for large language models. arXiv preprint arXiv:2310.10826, 2023.
  20. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130, 2023.
  21. B. Eysenbach and S. Levine. Maximum entropy rl (provably) solves some robust rl problems. In International Conference on Learning Representations, 2021.
  22. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656, 2024.
  23. G. Grefenstette. Tokenization. In Syntactic wordclass tagging, pages 117–133. Springer, 1999.
  24. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  25. Perplexity—a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63–S63, 1977.
  26. A survey of reinforcement learning from human feedback, 2023.
  27. J. D. M.-W. C. Kenton and L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  28. Policy optimization in rlhf: The impact of out-of-preference data. arXiv preprint arXiv:2312.10584, 2023a.
  29. Remax: A simple, effective, and efficient method for aligning large language models. arXiv preprint arXiv:2310.10505, 2023b.
  30. Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews. arXiv preprint arXiv:2403.07183, 2024.
  31. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
  32. R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
  33. Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
  34. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  35. Training language models to follow instructions with human feedback, 2022.
  36. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024.
  37. R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
  38. Direct preference optimization: Your language model is secretly a reward model, 2023.
  39. From r𝑟ritalic_r to q∗superscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024.
  40. A. Rényi. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics, volume 4, pages 547–562. University of California Press, 1961.
  41. Why don’t you do it right? analysing annotators’ disagreement in subjective tasks. In A. Vlachos and I. Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2428–2441, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.178. URL https://aclanthology.org/2023.eacl-main.178.
  42. Whose opinions do language models reflect? In International Conference on Machine Learning, pages 29971–30004. PMLR, 2023.
  43. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.431. URL https://aclanthology.org/2022.naacl-main.431.
  44. J. Schulman. Proxy objectives in reinforcement learning from human feedback, 2023. URL https://icml.cc/virtual/2023/invited-talk/21549.
  45. Proximal policy optimization algorithms. ArXiv preprint, abs/1707.06347, 2017. URL https://arxiv.org/abs/1707.06347.
  46. Reward collapse in aligning large language models. arXiv preprint arXiv:2305.17608, 2023.
  47. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024.
  48. Understanding the performance gap between online and offline alignment algorithms. arXiv preprint arXiv:2405.08448, 2024.
  49. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  50. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  51. Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008, 2017.
  52. E. A. Vogels. The state of online harassment. Pew Research Center, 13:625, 2021.
  53. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. In The Twelfth International Conference on Learning Representations, 2023.
  54. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. arXiv preprint arXiv:2402.18571, 2024.
  55. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
  56. Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719, 2024.
  57. Asymptotics of language model alignment. arXiv preprint arXiv:2404.01730, 2024.
  58. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320, 2023.
  59. A theoretical analysis of nash learning from human feedback under general kl-regularized preference. arXiv preprint arXiv:2402.07314, 2024.
  60. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  61. Provable multi-party reinforcement learning with diverse human feedback. arXiv preprint arXiv:2403.05006, 2024.
  62. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiancong Xiao (15 papers)
  2. Ziniu Li (24 papers)
  3. Xingyu Xie (13 papers)
  4. Emily Getzen (2 papers)
  5. Cong Fang (36 papers)
  6. Qi Long (46 papers)
  7. Weijie J. Su (69 papers)
Citations (6)