Critique-out-Loud Reward Models (2408.11791v1)

Published 21 Aug 2024 in cs.LG

Abstract: Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying LLM. This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.

PDF HTML Abstract

Critique-out-Loud Reward Models

The paper "Critique-out-Loud Reward Models" introduces a novel approach for enhancing reward models utilized in reinforcement learning from human feedback (RLHF). Traditionally, reward models in this domain are trained to predict preference scores directly, relying only on a single forward pass through the model without leveraging the generation capabilities of the underlying LLMs. This intrinsic limitation constrains their ability to explicitly reason about the quality of a response. The paper proposes the Critique-out-Loud (CLoud) reward models, which integrate a critique-generating mechanism to improve the performance of reward models.

Methodology

The core innovation of the CLoud reward models lies in their two-stage process:

Critique Generation: The reward model first generates a natural language critique of the assistant's response.
Reward Prediction: Following the critique, the model predicts a scalar reward that evaluates the quality of the response.

Training the CLoud reward models involves several steps:

Utilizing a dataset comprising user prompts, responses, and critiques.
Conducting supervised fine-tuning (SFT) for the base model and LLMing head on oracle critiques.
Reconstructing the original dataset by replacing oracle critiques with self-generated critiques.
Training the reward head using the modified dataset with a loss function that combines preference modeling and LLMing losses.

Results

The experimental results are bifurcated into improvements in pairwise preference classification accuracy and enhancements in generation policy quality via Best-of-N (BoN) decoding.

Pairwise Preference Classification:

The CLoud reward models exhibit significant improvements in pairwise preference classification accuracy on RewardBench, achieving an increase of 4.65 percentage points for the Llama-3-8B model and 5.84 percentage points for the Llama-3-70B model compared to classic reward models. The enhancement spans all categories, including Chat, Chat-Hard, Safety, and Reasoning.

Best-of-N Win Rate:

When evaluated on the ArenaHard benchmark, the CLoud reward models demonstrate a Pareto improvement over classic models. The BoN win rate, which serves as a proxy for policy quality, is notably higher when employing CLoud models. Specifically, selecting from sixteen responses results in a 1.84 percentage point improvement for the 8B model and a 0.89 percentage point improvement for the 70B model.

Analysis of Training Dynamics

A significant component of this work involves assessing whether on-policy training—training on self-generated critiques—is crucial for achieving these improvements. Results indicate that on-policy training is essential, with off-policy models trained on oracle critiques displaying significantly lower performance. This emphasizes the importance of aligning the distribution of critiques seen during training with those expected during inference.

Self-Consistency Decoding

The paper also explores leveraging additional inference compute through self-consistency decoding, whereby multiple critiques are generated and averaged to produce a better reward estimate. This method shows limited overall benefits but enhances preference classification accuracy for reasoning tasks, particularly those involving short reasoning horizons.

Implications and Future Directions

Practical Applications:

Enhanced Feedback Mechanisms: By generating detailed critiques, this approach can offer more nuanced feedback in educational tools, helpdesk automation, and content generation systems.
Policy Improvement: Higher BoN win rates suggest that integrating critique-based evaluation could lead to more robust and reliable decision-making systems.

Theoretical Insights:

Unified Reward Models and LLM-as-a-Judge Frameworks: CLoud models bridge the gap between traditional reward models and the LLM-as-a-Judge frameworks, enabling explicit reasoning about preference modeling.
Inference Compute Utilization: The exploration of self-consistency decoding hints at ways to harness additional computing resources for improved model performance, though its utility may vary across tasks.

Future Research:

Integration with Complex Rewards: Future investigations could explore integrating CLoud models with more sophisticated reward structures, such as multi-objective frameworks or hierarchical models.
Broader Task Categories: Testing CLoud models across a broader spectrum of tasks could provide further validation and uncover more scenarios where self-consistency decoding proves advantageous.
Human-in-the-Loop Systems: Incorporating real-time human feedback could refine critique generation capabilities and the associated reward predictions, potentially leading to even more accurate alignment with human preferences.

In conclusion, the introduction of Critique-out-Loud reward models represents a significant advancement in leveraging the generation capabilities of LLMs for improved preference modeling and reward prediction. This paradigm shift promises substantial enhancements in both the theoretical understanding and practical applications of RLHF systems.