Critique-out-Loud Reward Models
The paper "Critique-out-Loud Reward Models" introduces a novel approach for enhancing reward models utilized in reinforcement learning from human feedback (RLHF). Traditionally, reward models in this domain are trained to predict preference scores directly, relying only on a single forward pass through the model without leveraging the generation capabilities of the underlying LLMs. This intrinsic limitation constrains their ability to explicitly reason about the quality of a response. The paper proposes the Critique-out-Loud (CLoud) reward models, which integrate a critique-generating mechanism to improve the performance of reward models.
Methodology
The core innovation of the CLoud reward models lies in their two-stage process:
- Critique Generation: The reward model first generates a natural language critique of the assistant's response.
- Reward Prediction: Following the critique, the model predicts a scalar reward that evaluates the quality of the response.
Training the CLoud reward models involves several steps:
- Utilizing a dataset comprising user prompts, responses, and critiques.
- Conducting supervised fine-tuning (SFT) for the base model and LLMing head on oracle critiques.
- Reconstructing the original dataset by replacing oracle critiques with self-generated critiques.
- Training the reward head using the modified dataset with a loss function that combines preference modeling and LLMing losses.
Results
The experimental results are bifurcated into improvements in pairwise preference classification accuracy and enhancements in generation policy quality via Best-of-N (BoN) decoding.
Pairwise Preference Classification:
The CLoud reward models exhibit significant improvements in pairwise preference classification accuracy on RewardBench, achieving an increase of 4.65 percentage points for the Llama-3-8B model and 5.84 percentage points for the Llama-3-70B model compared to classic reward models. The enhancement spans all categories, including Chat, Chat-Hard, Safety, and Reasoning.
Best-of-N Win Rate:
When evaluated on the ArenaHard benchmark, the CLoud reward models demonstrate a Pareto improvement over classic models. The BoN win rate, which serves as a proxy for policy quality, is notably higher when employing CLoud models. Specifically, selecting from sixteen responses results in a 1.84 percentage point improvement for the 8B model and a 0.89 percentage point improvement for the 70B model.
Analysis of Training Dynamics
A significant component of this work involves assessing whether on-policy training—training on self-generated critiques—is crucial for achieving these improvements. Results indicate that on-policy training is essential, with off-policy models trained on oracle critiques displaying significantly lower performance. This emphasizes the importance of aligning the distribution of critiques seen during training with those expected during inference.
Self-Consistency Decoding
The paper also explores leveraging additional inference compute through self-consistency decoding, whereby multiple critiques are generated and averaged to produce a better reward estimate. This method shows limited overall benefits but enhances preference classification accuracy for reasoning tasks, particularly those involving short reasoning horizons.
Implications and Future Directions
Practical Applications:
- Enhanced Feedback Mechanisms: By generating detailed critiques, this approach can offer more nuanced feedback in educational tools, helpdesk automation, and content generation systems.
- Policy Improvement: Higher BoN win rates suggest that integrating critique-based evaluation could lead to more robust and reliable decision-making systems.
Theoretical Insights:
- Unified Reward Models and LLM-as-a-Judge Frameworks: CLoud models bridge the gap between traditional reward models and the LLM-as-a-Judge frameworks, enabling explicit reasoning about preference modeling.
- Inference Compute Utilization: The exploration of self-consistency decoding hints at ways to harness additional computing resources for improved model performance, though its utility may vary across tasks.
Future Research:
- Integration with Complex Rewards: Future investigations could explore integrating CLoud models with more sophisticated reward structures, such as multi-objective frameworks or hierarchical models.
- Broader Task Categories: Testing CLoud models across a broader spectrum of tasks could provide further validation and uncover more scenarios where self-consistency decoding proves advantageous.
- Human-in-the-Loop Systems: Incorporating real-time human feedback could refine critique generation capabilities and the associated reward predictions, potentially leading to even more accurate alignment with human preferences.
In conclusion, the introduction of Critique-out-Loud reward models represents a significant advancement in leveraging the generation capabilities of LLMs for improved preference modeling and reward prediction. This paradigm shift promises substantial enhancements in both the theoretical understanding and practical applications of RLHF systems.