Contrastive Preference Learning: Learning from Human Feedback without RL (2310.13639v3)

Published 20 Oct 2023 in cs.LG and cs.AI

Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for aligning models with human intent. Typically RLHF algorithms operate in two phases: first, use human preferences to learn a reward function and second, align the model by optimizing the learned reward via reinforcement learning (RL). This paradigm assumes that human preferences are distributed according to reward, but recent work suggests that they instead follow the regret under the user's optimal policy. Thus, learning a reward function from feedback is not only based on a flawed assumption of human preference, but also leads to unwieldy optimization challenges that stem from policy gradients or bootstrapping in the RL phase. Because of these optimization challenges, contemporary RLHF methods restrict themselves to contextual bandit settings (e.g., as in LLMs) or limit observation dimensionality (e.g., state-based robotics). We overcome these limitations by introducing a new family of algorithms for optimizing behavior from human feedback using the regret-based model of human preferences. Using the principle of maximum entropy, we derive Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions, circumventing the need for RL. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs. This enables CPL to elegantly scale to high-dimensional and sequential RLHF problems while being simpler than prior methods.

PDF Abstract

Contrastive Preference Learning: Learning from Human Feedback without RL

The paper "Contrastive Preference Learning: Learning from Human Feedback without RL" presents a novel approach to policy optimization using human feedback, without the conventional reliance on Reinforcement Learning (RL). This approach centers around Contrastive Preference Learning (CPL), a method that leverages the concept of regret-based human preferences to streamline the learning process in sequential decision-making tasks.

Summary of Contributions

The authors identify two significant challenges in the traditional Reinforcement Learning from Human Feedback (RLHF) paradigm: the flawed assumption that human preferences are distributed according to reward, and the optimization difficulties associated with policy gradients in RL. To address these, the paper introduces a new family of algorithms that circumvents these challenges by aligning with the more realistic assumption that human preferences are based on regret. Employing the principle of maximum entropy, the authors derive CPL, which optimizes behavior from human feedback without explicitly learning a reward function. This is achieved by directly learning optimal policies through a contrastive learning objective.

Key benefits of CPL over traditional methods include its simplicity, reduced computational complexity, and scalability to high-dimensional and sequential RLHF problems like those encountered in robotics and LLMs. Unlike prior methods, CPL avoids RL's complexity by directly applying a contrastive objective to learn a policy, which inherently aligns with the optimal advantage function.

Theoretical Foundations

The theoretical foundation of CPL is established on the bijection between advantage functions and policies within the maximum entropy framework. The elimination of the need for RL is a critical advancement, as CPL leverages the optimal advantage function to derive explicit policy learning objectives. The paper provides a rigorous proof that, under the assumption of having an unbounded set of preferences generated from a regret-based model, CPL convergence to the optimal policy is guaranteed.

Consistency in the learned advantage function is another notable advantage with CPL. It ensures that the policy derived is representative of the optimal performance possible for some reward function, even with limited preference data.

Experimental Results

Extensive experiments are conducted to validate CPL's performance. The authors demonstrate the efficiency of CPL in a series of MetaWorld Benchmark tasks, including both state and image-based high-dimensional continuous control tasks. CPL consistently outperforms standard baselines such as Supervised Fine-Tuning (SFT) and Implicit Q-Learning (P-IQL), proving particularly advantageous when operating with dense preference data. The paper asserts that CPL is 1.6 times faster in its operations and four times as parameter efficient compared to P-IQL, emphasizing its computational efficiency.

Furthermore, the paper highlights how CPL effectively utilizes preferences to fine-tune models beyond the observed data's optimal behaviors, crucially enabling policy improvement. This outcome is particularly evident in smaller or densely labeled datasets.

Implications and Future Directions

Practically, CPL presents a compelling alternative to existing RLHF techniques, cutting down on the complexities and inefficiencies associated with traditional RL methods. This reduction in complexity, combined with its off-policy readiness, makes CPL appropriate for scaling up to larger datasets and architectures. The approach holds significant implications for domains that require scalable learning from human preferences, such as robotics, LLM fine-tuning, and automated systems requiring user-aligned behavior.

The authors anticipate that CPL's application could offer substantial performance benefits when tuned appropriately across various contexts. However, its dependency on accurately modeling human preference alignment through regret is a potential limitation. Future research directions proposed include enhancing online versions of CPL, which would facilitate continuous policy improvement based on real-time human feedback, and extending its applicability to LLMs for sequential dialogue interactions.

Overall, this paper advances the field of RLHF by simplifying the learning pipeline and aligns closely with human judgment systems, fostering more efficient model alignment with human intent.