Analysis of RLHF on LLM Generalisation and Diversity
The paper "Understanding the Effects of RLHF on LLM Generalisation and Diversity" provides a comprehensive analysis of the impact of Reinforcement Learning from Human Feedback (RLHF) on LLMs, particularly focusing on their generalisation capabilities and output diversity. The paper assesses these effects across various fine-tuning stages and methodologies, contrasting RLHF with supervised fine-tuning (SFT) and Best-of-N (BoN) sampling, encompassing tasks such as text summarisation and instruction following.
Generalisation and Diversity
The core pursuit of the paper involves dissecting the trade-offs between model generalisation—how well an LLM adapts to new, unseen data distributions—and output diversity, defining the range of different outputs the model can generate.
Generalisation:
- RLHF is shown to enhance both in-distribution (ID) and out-of-distribution (OOD) performance compared to SFT. This is notably observed in instruction following tasks with more considerable distribution shifts.
- When evaluating summarisation models, RLHF maintains superior performance in comparison to SFT across diverse test datasets. BoN notably outperforms RLHF in summarisation, although BoN incurs significantly higher inference costs.
Diversity:
- A consistent observation is that RLHF substantially reduces per-input diversity, revealing a significant drawback when diversity is required.
- Interestingly, across-input diversity, though slightly diminished, shows less impact, suggesting that RLHF reduces variations for a single input but retains some flexibility across different inputs. This may relate to the perceived “mode collapse” in RL applications.
Implications and Future Directions
The findings underscore a critical tension in current LLM fine-tuning techniques—the balance between robust generalisation and maintaining diverse output capabilities. This is particularly relevant in applications where creative or varied output is necessary, like in story generation or in scenarios requiring multiple solution paths.
Practically, the implications suggest:
- RLHF can be preferred in scenarios anticipating substantial distributional shifts, such as interactive user applications requiring adaptability.
- SFT might be more favourable when output diversity is crucial, albeit at the cost of some generalisation prowess.
- BoN emerges as a potent method where reward models exhibit strong generalisation, though it demands careful consideration of computational overhead.
The trade-offs highlighted necessitate innovative methods that gracefully balance these aspects without heavily compromising one for the other. Future research could explore hybrid approaches or augmenting RLHF with diversity-focused adjustments. Examining the underlying sources of diminished diversity in RLHF and systematically disentangling these effects could lead to more refined fine-tuning methodologies.
Conclusion
Through its meticulous evaluation of RLHF alongside SFT and BoN sampling, this paper makes a substantive contribution to our understanding of LLM fine-tuning. By spotlighting the inherent trade-offs between generalisation and diversity, it opens avenues for future research aimed at optimizing the development and application of LLMs in various domains, ensuring models are well-calibrated to their intended use cases.