Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding the Effects of RLHF on LLM Generalisation and Diversity (2310.06452v3)

Published 10 Oct 2023 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT or Anthropic's Claude. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the tradeoff between generalisation and diversity.

Analysis of RLHF on LLM Generalisation and Diversity

The paper "Understanding the Effects of RLHF on LLM Generalisation and Diversity" provides a comprehensive analysis of the impact of Reinforcement Learning from Human Feedback (RLHF) on LLMs, particularly focusing on their generalisation capabilities and output diversity. The paper assesses these effects across various fine-tuning stages and methodologies, contrasting RLHF with supervised fine-tuning (SFT) and Best-of-N (BoN) sampling, encompassing tasks such as text summarisation and instruction following.

Generalisation and Diversity

The core pursuit of the paper involves dissecting the trade-offs between model generalisation—how well an LLM adapts to new, unseen data distributions—and output diversity, defining the range of different outputs the model can generate.

Generalisation:

  • RLHF is shown to enhance both in-distribution (ID) and out-of-distribution (OOD) performance compared to SFT. This is notably observed in instruction following tasks with more considerable distribution shifts.
  • When evaluating summarisation models, RLHF maintains superior performance in comparison to SFT across diverse test datasets. BoN notably outperforms RLHF in summarisation, although BoN incurs significantly higher inference costs.

Diversity:

  • A consistent observation is that RLHF substantially reduces per-input diversity, revealing a significant drawback when diversity is required.
  • Interestingly, across-input diversity, though slightly diminished, shows less impact, suggesting that RLHF reduces variations for a single input but retains some flexibility across different inputs. This may relate to the perceived “mode collapse” in RL applications.

Implications and Future Directions

The findings underscore a critical tension in current LLM fine-tuning techniques—the balance between robust generalisation and maintaining diverse output capabilities. This is particularly relevant in applications where creative or varied output is necessary, like in story generation or in scenarios requiring multiple solution paths.

Practically, the implications suggest:

  • RLHF can be preferred in scenarios anticipating substantial distributional shifts, such as interactive user applications requiring adaptability.
  • SFT might be more favourable when output diversity is crucial, albeit at the cost of some generalisation prowess.
  • BoN emerges as a potent method where reward models exhibit strong generalisation, though it demands careful consideration of computational overhead.

The trade-offs highlighted necessitate innovative methods that gracefully balance these aspects without heavily compromising one for the other. Future research could explore hybrid approaches or augmenting RLHF with diversity-focused adjustments. Examining the underlying sources of diminished diversity in RLHF and systematically disentangling these effects could lead to more refined fine-tuning methodologies.

Conclusion

Through its meticulous evaluation of RLHF alongside SFT and BoN sampling, this paper makes a substantive contribution to our understanding of LLM fine-tuning. By spotlighting the inherent trade-offs between generalisation and diversity, it opens avenues for future research aimed at optimizing the development and application of LLMs in various domains, ensuring models are well-calibrated to their intended use cases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Hindsight experience replay. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5048–5058, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/453fadbd8a1a3af50a9df4df899537b5-Abstract.html.
  2. Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  3. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862 [cs], 2022. URL http://arxiv.org/abs/2204.05862.
  4. Emergent autonomous scientific research capabilities of large language models, 2023. URL http://arxiv.org/abs/2304.05332.
  5. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  6. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, 2023. URL http://arxiv.org/abs/2307.15217.
  7. Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning, 2022. URL http://arxiv.org/abs/2210.07792.
  8. PaLM: Scaling Language Modeling with Pathways, 2022. URL http://arxiv.org/abs/2204.02311.
  9. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  10. Scaling Instruction-Finetuned Language Models, 2022. URL http://arxiv.org/abs/2210.11416.
  11. Training Verifiers to Solve Math Word Problems, 2021. URL http://arxiv.org/abs/2110.14168.
  12. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback, 2023. URL http://arxiv.org/abs/2305.14387.
  13. Diversity is all you need: Learning skills without a reward function. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=SJx63jRqFm.
  14. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned, 2022. URL http://arxiv.org/abs/2209.07858.
  15. Scaling Laws for Reward Model Overoptimization, 2022. URL http://arxiv.org/abs/2210.10760.
  16. Improving alignment of dialogue agents via targeted human judgements, 2022. URL http://arxiv.org/abs/2209.14375.
  17. Yoav Goldberg. Reinforcement Learning for Language Models, 2023. URL https://gist.github.com/yoavg/6bff0fecd65950898eba1bb321cfbd81.
  18. Arnav Gudibande. Koala Evaluation Set, 2023. URL https://github.com/arnav-gudibande/koala-test-set.
  19. Tuomas Haarnoja. Acquiring Diverse Robot Skills via Maximum Entropy Deep Reinforcement Learning. PhD thesis, UC Berkeley, 2018. URL https://escholarship.org/uc/item/25g6573w.
  20. Training Compute-Optimal Large Language Models, 2022. URL http://arxiv.org/abs/2203.15556.
  21. State-of-the-art generalisation research in NLP: A taxonomy and review, 2023. URL http://arxiv.org/abs/2210.03050.
  22. janus. Mysteries of mode collapse, 2022. URL https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse.
  23. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp.  1645–1654. PMLR, 2017. URL http://proceedings.mlr.press/v70/jaques17a.html.
  24. A distributional approach to controlled text generation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=jWkw45-9AbL.
  25. OpenAssistant Conversations – Democratizing Large Language Model Alignment, 2023. URL http://arxiv.org/abs/2304.07327.
  26. One solution is not all you need: Few-shot extrapolation via structured maxent RL. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/5d151d1059a6281335a10732fc49620e-Abstract.html.
  27. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  110–119, San Diego, California, 2016. Association for Computational Linguistics. doi:10.18653/v1/N16-1014. URL https://aclanthology.org/N16-1014.
  28. AlpacaEval: An automatic evaluator of instruction-following models. GitHub, 2023. URL https://github.com/tatsu-lab/alpaca_eval.
  29. Chain of Hindsight Aligns Language Models with Feedback, 2023a. URL http://arxiv.org/abs/2302.02676.
  30. Rethinking and refining the distinct metric. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  762–770, Dublin, Ireland, 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.acl-short.86. URL https://aclanthology.org/2022.acl-short.86.
  31. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment, 2023b. URL http://arxiv.org/abs/2303.16634.
  32. Steven Loria. TextBlob: Simplified Text Processing — TextBlob 0.16.0 documentation, 2013. URL https://textblob.readthedocs.io/en/dev/.
  33. GPTEval: A Survey on Assessments of ChatGPT and GPT-4, 2023. URL http://arxiv.org/abs/2308.12488.
  34. Teaching language models to support answers with verified quotes, 2022. URL http://arxiv.org/abs/2203.11147.
  35. Asynchronous methods for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp.  1928–1937. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/mniha16.html.
  36. WebGPT: Browser-assisted question-answering with human feedback, 2022. URL http://arxiv.org/abs/2112.09332.
  37. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pp.  280–290, Berlin, Germany, 2016. Association for Computational Linguistics. doi:10.18653/v1/K16-1028. URL https://aclanthology.org/K16-1028.
  38. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  39. OpenAI. GPT-4 Technical Report, 2023. URL http://arxiv.org/abs/2303.08774.
  40. Discovering Diverse Solutions in Deep Reinforcement Learning by Maximizing State-Action-Based Mutual Information, 2022. URL http://arxiv.org/abs/2103.07084.
  41. Training language models to follow instructions with human feedback, 2022. URL http://arxiv.org/abs/2203.02155.
  42. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics. doi:10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  43. Instruction Tuning with GPT-4, 2023. URL http://arxiv.org/abs/2304.03277.
  44. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3419–3448, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.225.
  45. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, 2022. URL http://arxiv.org/abs/2112.11446.
  46. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023. URL http://arxiv.org/abs/2305.18290.
  47. Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization, 2022. URL http://arxiv.org/abs/2210.01241.
  48. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3982–3992, Hong Kong, China, 2019. Association for Computational Linguistics. doi:10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
  49. Training Language Models with Language Feedback at Scale, 2023. URL http://arxiv.org/abs/2303.16755.
  50. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs], 2017. URL http://arxiv.org/abs/1707.06347.
  51. Offline RL for Natural Language Generation with Implicit Language Q Learning, 2022. URL http://arxiv.org/abs/2206.11871.
  52. Semantic diversity in dialogue with natural language inference. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  85–98, Seattle, United States, 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.6. URL https://aclanthology.org/2022.naacl-main.6.
  53. Learning to summarize from human feedback, 2022. URL http://arxiv.org/abs/2009.01325.
  54. Evaluating the evaluation of diversity in natural language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  326–346, Online, 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.eacl-main.25. URL https://aclanthology.org/2021.eacl-main.25.
  55. LLaMA: Open and Efficient Foundation Language Models, 2023a. URL http://arxiv.org/abs/2302.13971.
  56. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023b. URL http://arxiv.org/abs/2307.09288.
  57. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pp.  59–63, Copenhagen, Denmark, 2017. Association for Computational Linguistics. doi:10.18653/v1/W17-4508. URL https://aclanthology.org/W17-4508.
  58. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.340.
  59. Self-Instruct: Aligning Language Models with Self-Generated Instructions, 2023. URL http://arxiv.org/abs/2212.10560.
  60. Neural text generation with unlikelihood training. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SJeYe0NtvH.
  61. RRHF: Rank Responses to Align Language Models with Human Feedback without tears, 2023. URL http://arxiv.org/abs/2304.05302.
  62. OPT: Open Pre-trained Transformer Language Models, 2022. URL http://arxiv.org/abs/2205.01068.
  63. The Wisdom of Hindsight Makes Language Models Better Instruction Followers, 2023. URL http://arxiv.org/abs/2302.05206.
  64. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023. URL http://arxiv.org/abs/2306.05685.
  65. Texygen: A benchmarking platform for text generation models. In Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (eds.), The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pp.  1097–1100. ACM, 2018. doi:10.1145/3209978.3210080. URL https://doi.org/10.1145/3209978.3210080.
  66. Fine-Tuning Language Models from Human Preferences, 2020. URL http://arxiv.org/abs/1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Robert Kirk (21 papers)
  2. Ishita Mediratta (5 papers)
  3. Christoforos Nalmpantis (5 papers)
  4. Jelena Luketina (8 papers)
  5. Eric Hambro (11 papers)
  6. Edward Grefenstette (66 papers)
  7. Roberta Raileanu (40 papers)
Citations (81)