Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

329

Understanding the Effects of RLHF on LLM Generalisation and Diversity (2310.06452v3)

Published 10 Oct 2023 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT or Anthropic's Claude. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the tradeoff between generalisation and diversity.

PDF HTML Abstract

Analysis of RLHF on LLM Generalisation and Diversity

The paper "Understanding the Effects of RLHF on LLM Generalisation and Diversity" provides a comprehensive analysis of the impact of Reinforcement Learning from Human Feedback (RLHF) on LLMs, particularly focusing on their generalisation capabilities and output diversity. The paper assesses these effects across various fine-tuning stages and methodologies, contrasting RLHF with supervised fine-tuning (SFT) and Best-of-N (BoN) sampling, encompassing tasks such as text summarisation and instruction following.

Generalisation and Diversity

The core pursuit of the paper involves dissecting the trade-offs between model generalisation—how well an LLM adapts to new, unseen data distributions—and output diversity, defining the range of different outputs the model can generate.

Generalisation:

RLHF is shown to enhance both in-distribution (ID) and out-of-distribution (OOD) performance compared to SFT. This is notably observed in instruction following tasks with more considerable distribution shifts.
When evaluating summarisation models, RLHF maintains superior performance in comparison to SFT across diverse test datasets. BoN notably outperforms RLHF in summarisation, although BoN incurs significantly higher inference costs.

Diversity:

A consistent observation is that RLHF substantially reduces per-input diversity, revealing a significant drawback when diversity is required.
Interestingly, across-input diversity, though slightly diminished, shows less impact, suggesting that RLHF reduces variations for a single input but retains some flexibility across different inputs. This may relate to the perceived “mode collapse” in RL applications.

Implications and Future Directions

The findings underscore a critical tension in current LLM fine-tuning techniques—the balance between robust generalisation and maintaining diverse output capabilities. This is particularly relevant in applications where creative or varied output is necessary, like in story generation or in scenarios requiring multiple solution paths.

Practically, the implications suggest:

RLHF can be preferred in scenarios anticipating substantial distributional shifts, such as interactive user applications requiring adaptability.
SFT might be more favourable when output diversity is crucial, albeit at the cost of some generalisation prowess.
BoN emerges as a potent method where reward models exhibit strong generalisation, though it demands careful consideration of computational overhead.

The trade-offs highlighted necessitate innovative methods that gracefully balance these aspects without heavily compromising one for the other. Future research could explore hybrid approaches or augmenting RLHF with diversity-focused adjustments. Examining the underlying sources of diminished diversity in RLHF and systematically disentangling these effects could lead to more refined fine-tuning methodologies.

Conclusion

Through its meticulous evaluation of RLHF alongside SFT and BoN sampling, this paper makes a substantive contribution to our understanding of LLM fine-tuning. By spotlighting the inherent trade-offs between generalisation and diversity, it opens avenues for future research aimed at optimizing the development and application of LLMs in various domains, ensuring models are well-calibrated to their intended use cases.

PDF Markdown Bookmark Chat (Pro)

References (66)

Authors (7)

Robert Kirk (21 papers)
Ishita Mediratta (5 papers)
Christoforos Nalmpantis (5 papers)
Jelena Luketina (8 papers)
Eric Hambro (11 papers)
Edward Grefenstette (66 papers)
Roberta Raileanu (40 papers)

Citations (81)

View on Semantic Scholar

Tweets

https://twitter.com/SamuelAlbanie/status/1751669720124674481

https://twitter.com/rmkubinec/status/1782032086976676122

https://twitter.com/robertarail/status/1888599638531293321

https://twitter.com/robertarail/status/1747405030699704787

https://twitter.com/azamatomu/status/1841123213897535687