A Long Way to Go: Investigating Length Correlations in RLHF (2310.03716v2)

Published 5 Oct 2023 in cs.CL and cs.LG

Abstract: Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align LLMs, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.

Citations (107)

View on Semantic Scholar

Summary

The paper demonstrates a strong correlation between response length and RLHF rewards, indicating that longer outputs may inflate performance metrics.
The analysis across datasets like WebGPT, Stack, and RLCD reveals that improvements in reward models are often driven by output length rather than qualitative enhancements.
The study tests interventions in both reinforcement learning and reward model phases, advocating for refined systems to ensure genuine utility in AI outputs.

Investigating Length Correlations in Reinforcement Learning from Human Feedback

The paper "A Long Way To Go: Investigating Length Correlations in RLHF" addresses a significant concern in the evaluation of Reinforcement Learning from Human Feedback (RLHF) mechanisms employed in aligning LLMs towards optimal helpfulness. The authors redirect the focus of contemporary efficacy claims, shifting the narrative from the assumed utility of RLHF-optimized systems to a critical analysis of the inherent length bias influencing reward-oriented improvements.

Reinforcement Learning from Human Feedback, a widely recognized methodology for refining LLMs, particularly in enhancing their alignment with human expectations across various tasks such as web-based question answering, summarization, and dialogues, appears to drive models toward generating longer outputs. The empirical investigation conducted by the authors elucidates that the correlation between response length and received rewards is substantial, indicating that length may unjustly dominate reward derivation.

This paper methodically dissects the phenomenon of length bias in RLHF through a three-pronged approach: first, analyzing the interplay between output length and reward metrics across three distinct datasets oriented towards helpfulness—WebGPT, Stack, and RLCD. A firm length-reward correlation was observed, reiterating that much of the reward enhancement arises from output length variations rather than qualitative improvements. Second, the paper explores potential interventions during both reinforcement learning and reward model phases to curtail undue elongation, although the interventions yield uneven effectiveness.

Furthermore, when crafting reward models purely based on length, most downstream improvements over initially fine-tuned supervised models are still retained, asserting that longer output generation remains a critical determinant of perceived success in RLHF schemes. Such findings question the reward model design and challenge researchers to disentangle length from genuine feature enhancement.

The authors argue for a reevaluation of reward models given their susceptibility to simple correlation optimizations. This implies a comprehensive refinement of RLHF processes to ensure that length does not obfuscate the true measure of utility in AI outputs. Furthermore, the insights necessitate a reevaluation of human preference datasets, advocating for counterbalancing spurious correlations to enhance robustness and numeric accuracy.

In terms of implications, dissecting this relationship between length and reward surface a need to bolster reward model resilience and enhance RLHF's fidelity to human judgment beyond superficial features like length. The pursuit of these aims is poised to unlock more substantive LLM advancements, paving the way for innovations that further hone the real-world applicability of AI across diverse contexts.

Future research efforts could extend this investigation to explore other overt and latent biases within reward models that may similarly drive suboptimal RLHF outcomes. As this paper suggests, more nuanced reward modeling incorporating broader human-aligned criteria or integrated learning could hold promise for AI's evolved alignment with human-centric objectives, underscoring a pivotal trajectory in reinforcement learning methodologies.

The paper commendably sheds light on a critical avenue in AI research, challenging assumptions on what constitutes RLHF success, and offers a thoughtful critique that could engender a paradigm shift in how LLM alignment efficacy is perceived, measured, and improved.

PDF Markdown

Related Papers

GitHub

GitHub - PrasannS/rlhf-length-biases (27 stars)

Tweets

https://twitter.com/maksym_andr/status/1756267422967816620

YouTube

Show All Videos