Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs (2402.14740v2)

Published 22 Feb 2024 in cs.LG

Abstract: AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance LLMs. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

PDF HTML Abstract

Reevaluating Optimization Techniques for LLMs Learning from Human Feedback

Introduction to REINFORCE and RLHF

In recent explorations of AI alignment, particularly within the context of Reinforcement Learning from Human Feedback (RLHF), there's been an ongoing dialogue regarding the most effective optimization techniques to align LLMs with human preferences. The traditional reliance on Proximal Policy Optimization (PPO) has seen significant scrutiny due to its computational demands and intricate hyperparameter tuning complexities. This paper presents a compelling argument for simplifying the optimization landscape by revisiting the REINFORCE algorithm, a less resource-intensive approach which, the authors claim, not only rivals but often surpasses the performance of PPO and other recent "RL-free" methods in the RLHF domain.

Optimization Challenges and Alternatives

REINFORCE's potential as a more straightforward and equally, if not more, effective optimization technique compared to PPO is central to this paper's investigation. PPO's computational intensity arises from the need to concurrently manage multiple models and the delicate balancing act of its hyperparameters to optimize policy models effectively. Contrastingly, REINFORCE offers a stark simplification, positing that the high variance issues PPO aims to mitigate are less concerning within RLHF settings, thanks to the strong initial conditioning provided by pre-training LLMs.

Empirical Analysis and Methodological Innovations

The authors delve into a mixed-method evaluation, comparing REINFORCE and its variant, REINFORCE Leave-One-Out (RLOO), against PPO, Direct Preference Optimization (DPO), and others across various models and datasets. Their empirical analysis showcases that:

REINFORCE style algorithms, especially the RLOO variant, consistently outperform PPO by a significant margin across different metrics, reinforcing the argument that simpler policy gradient algorithms could be better suited for RLHF tasks.
The need for modeling partial completions, a notable feature in PPO, is practically redundant. The paper demonstrates that focusing optimization efforts on the entire generation, as opposed to incremental tokens, does not detract from model performance, hence simplifying the training process.
RLOO, in particular, demonstrates remarkable robustness and utility, outperforming not just PPO but also "RL-free" methods like RAFT and DPO, positing it as a superior choice for optimizing LLMs in an RLHF context.

Forward Looking Implications

The paper's findings significantly challenge the prevailing norms within the RLHF optimization field. The demonstrated efficacy of REINFORCE and, by extension, RLOO suggests a paradigm shift towards simpler, more computationally conservative methods without compromising, and sometimes enhancing, performance. Such a shift could not only democratize the fine-tuning of LLMs by making it accessible to entities with limited computational resources but also open the floor for further exploration into simplifying optimization methodologies in other facets of machine learning.

Future Pathways in AI Alignment

Looking beyond the immediate implications for RLHF, this work opens intriguing pathways for future research. It prompts a reconsideration of not just optimization methods but also the fundamental approach to AI alignment and policy model fine-tuning. It encourages researchers to question assumptions about complexity and computational demands in optimizing AI to align with human preferences. Finally, this paper lays the groundwork for future explorations, potentially leading to more efficient, accessible, and broadly applicable methods to align AI systems with human values.

The research presented stands as a testament to the value of revisiting foundational principles in a rapidly evolving field, advocating for simplicity without compromise on performance. It foregrounds an exciting era of optimization techniques that could significantly enhance AI's alignment with human preferences, with implications far beyond the confines of current methodologies.