Reward Reasoning Model (2505.14674v1)

Published 20 May 2025 in cs.CL

Abstract: Reward models play a critical role in guiding LLMs toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at https://huggingface.co/Reward-Reasoning.

Summary

The paper introduces Reward Reasoning Models (RRMs) that employ an explicit chain-of-thought phase before final reward assignment to improve LLM judgments.
It utilizes a transformer-decoder architecture combined with a novel RL framework (R4L) to train models without needing detailed reasoning trace annotations.
Experimental results on benchmarks show that RRMs yield competitive accuracy and demonstrate scalable benefits through adaptive parallel and sequential test-time compute.

The paper "Reward Reasoning Model" (2505.14674) introduces Reward Reasoning Models (RRMs), a novel approach to reward modeling for LLMs designed to enhance performance by incorporating an explicit reasoning phase before generating a final reward. This allows RRMs to adaptively allocate more test-time compute to complex queries where the appropriate reward is not immediately obvious.

Core Concept and Motivation

Traditional reward models often apply uniform computational resources across all inputs, limiting their effectiveness for nuanced or multi-step reasoning tasks. RRMs address this by framing reward modeling as a reasoning task. The model first generates a chain-of-thought style reasoning process analyzing the query and candidate responses, and then outputs the final reward or preference. This contrasts with scalar reward models that directly output a score, or generative reward models that might provide explanations but not necessarily a structured, adaptive reasoning phase prior to judgment.

Methodology

Input Representation and Architecture:
- RRMs utilize the Qwen2 model architecture with a Transformer-decoder backbone.
- The input consists of a query and two candidate responses. The task is to determine the preferred response (ties are not allowed).
- A system prompt adapted from the RewardBench repository guides the RRM to analyze responses based on instruction fidelity, helpfulness, accuracy, harmlessness, and detail, while avoiding common biases.
- The RRM autoregressively generates an output comprising a thinking process followed by a final judgment (e.g., \boxed{Assistant 1}). The input is limited to two responses to reserve output length for reasoning.
Model Training: Reward Reasoning via Reinforcement Learning (R4L):
- Since supervised data with explicit reward reasoning traces is scarce, the authors propose an RL framework called "Reward Reasoning via Reinforcement Learning."
- Base models are Deepseek-R1 distilled Qwen models.
- The RL agent (the RRM) is trained to self-evolve reasoning capabilities.
- The reward function for the RL training is simple and rule-based:
  
  $\mathcal{R} = \begin{cases} +1, & \text{if RRM selects the correct ground-truth preferred response} \ -1, & \text{otherwise} \end{cases}$
  
  This reward evaluates the RRM's final judgment, not the quality of its reasoning trace directly.
- Group Relative Policy Optimization (GRPO) is used for training, implemented with the verl library.
Multi-Response Rewarding Strategies:

For scenarios with more than two candidate responses ( $n > 2$ ):
- ELO Rating System: All candidates are compared pairwise in a round-robin fashion (or a sampled subset for efficiency). The win-loss records are converted to ELO scores. This produces full ratings suitable for RLHF. Complexity is $\mathcal{O}(n^2)$ for full comparison.
- Knockout Tournament: Candidates are paired randomly in successive rounds, with winners advancing. This requires $n-1$ comparisons with $\mathcal{O}(n)$ complexity and $\mathcal{O}(\log n)$ sequential rounds. Useful for best-of-N sampling.
- Majority Voting: For any pairwise comparison within these strategies, the RRM can be sampled multiple times, and majority voting determines the pairwise winner, enhancing robustness at the cost of more compute.

Implementation Details and Training

Training Data: Approximately 420K preference pairs were compiled from:
- Skywork-Reward (80K).
- Tülu 3 prompt dataset (80K queries, responses generated by Deepseek-R1-Distill-Qwen-1.5B, preferences labeled by GPT-4o).
- Synthesized data using Tülu 3 prompts (80K).
- Synthesized data from verifiable QA pairs (WebInstruct-verified, Skywork-OR1, Big-Math-RL, DAPO-Math) using Deepseek-R1 distilled models and rule-based verifiers (180K). Intermediate thinking steps were removed from responses in the training data.
RRM Training:
- RRM-7B and RRM-32B models were trained on AMD Instinct MI300X Accelerators.
- Weighted dataset mixtures were used. For RRM-32B, the ratio was 5:1:1:1 for Skywork-Reward, Tülu-80K, GPT-4o-labeled pairs, and other synthetic data.
Prompt Template: A detailed prompt (see Appendix A of the paper) guides the RRM's comparative analysis, emphasizing criteria like instruction following, helpfulness, accuracy, harmlessness, and detail, while instructing to avoid biases. The output format is strictly \boxed{Assistant 1} or \boxed{Assistant 2}.

Experiments and Results

Agreement with Human Preference:
- Benchmarks: RewardBench and PandaLM Test.
- Baselines: Skywork-Reward, GPT-4o, Claude 3.5 Sonnet, JudgeLM, DeepSeek-GRM, RM-R1, and "DirectJudge" (same training data/base models as RRM but trained to judge directly without explicit reasoning).
- Findings: RRMs achieved competitive performance. RRM-32B achieved 98.6% accuracy in RewardBench's reasoning category. RRMs significantly outperformed DirectJudge models, especially on reasoning tasks, indicating the benefit of the explicit reasoning phase.
Reward-Guided Best-of-N Inference (Preference Proxy Evaluations - PPE):
- Tasks: MMLU-Pro, MATH, GPQA, selecting the best from 32 candidate responses using the knockout tournament strategy.
- Findings: RRMs (especially RRM-32B with voting@5) surpassed baselines like Skywork-Reward-Gemma-2 and GPT-4o. For example, RRM-32B (voting@5) achieved 83.0% on MMLU-Pro, 91.8% on MATH, and 64.3% on GPQA.
- In binary preference classification on PPE (Frick et al. protocol), RRM-32B (voting@5) also achieved SOTA results (e.g., 81.3% MMLU-Pro, 95.4% MATH, 68.4% GPQA).
Post-Training LLMs with RRM Feedback:
- RL with Unlabeled Data: Deepseek-R1-Distill-Qwen-7B was trained on WebInstruct queries using GRPO. RRM-32B provided rewards by generating pairwise preferences for 8 responses per query, converted to ELO scores. Resulting models showed steady performance improvements on MMLU-Pro and GPQA (see Figures \ref{fig:rlhf_gpqa} and \ref{fig:rlhf_mmlupro} in the paper).
- Direct Preference Optimization (DPO): Qwen2.5-7B was fine-tuned using DPO with preference labels on Tulu data annotated by RRM-7B, RRM-32B, and GPT-4o. The model trained with RRM-32B labels achieved the highest Arena-Hard score (55.4).
Scaling Test-Time Compute:
- Parallel Scaling: On MATH (best-of-8):
  - Increasing pairwise comparisons in the ELO system steadily improved performance.
  - Majority voting (e.g., 8 RRM samples per pair) improved accuracy (e.g., RRM-32B ELO went from 90.3% to 90.5%).
  - ELO slightly outperformed knockout tournament in accuracy but knockout was more compute-efficient ( $\mathcal{O}(n)$ vs. $\mathcal{O}(n^2)$ ).
- Sequential Scaling: On RewardBench, increasing the maximum token limit for the RRM's "thinking phase" consistently improved accuracy for 7B, 14B, and 32B RRMs, showing effective use of longer reasoning horizons. A fixed post-thinking budget of 100 tokens was used.
Scaling RRM Training Compute:
- Increasing RRM model size (7B, 14B, 32B) led to consistent performance gains on RewardBench.
- RRM-7B showed steady improvement across training steps on RewardBench domains without signs of overfitting.
Reward Reasoning Pattern Analysis:
- Keywords (e.g., 'wait', 'alternatively', 'compared to', 'break down') were used to categorize reasoning patterns (transition, reflection, comparison, breakdown).
- Compared to its base model (Deepseek-R1-Distill-Qwen-32B), RRM-32B exhibited more transition, reflection, and comparison patterns, and fewer direct breakdown patterns.
- Case studies (Table \ref{tab:case}) showed RRM-32B engaging in more iterative, in-depth comparative reasoning and self-reflection, leading to better instruction following and more accurate judgments.

Conclusions and Contributions

RRMs effectively leverage test-time compute through explicit chain-of-thought reasoning before reward assignment, leading to improved judgment accuracy, especially on complex tasks.
The "Reward Reasoning via Reinforcement Learning" (R4L) framework successfully enables RRMs to self-evolve reasoning capabilities without needing explicit reasoning traces as training supervision.
RRMs demonstrate strong performance on reward modeling benchmarks and practical applications like reward-guided best-of-N inference and LLM post-training (RL and DPO).
The models exhibit desirable scaling properties with increased test-time compute (both parallel and sequential) and training compute.
Analysis shows RRMs develop distinct, more iterative reasoning patterns.
The authors plan to open-source their code and models.

Practical Applications and Implementation Considerations

Improved Reward Quality: RRMs can serve as more accurate reward sources for RLHF or DPO, potentially leading to better-aligned and more capable LLMs.
Adaptive Compute: The reasoning mechanism allows for adaptive compute allocation, spending more "thinking time" on harder problems.
Versatile Rewarding: The ELO and knockout tournament strategies provide flexible ways to apply RRMs in scenarios with multiple candidate responses, suitable for tasks like best-of-N sampling or generating preference scores for RL.
Training Efficiency: The R4L framework avoids the costly annotation of detailed reasoning steps, relying on a simple binary signal based on final preference correctness.
Deployment: When deploying RRMs, practitioners can choose between strategies like knockout tournaments for efficiency or ELO ratings for more comprehensive scoring, and can further trade off compute for accuracy using majority voting or by adjusting the thinking budget (max token length for reasoning).
Interpretability: The generated reasoning traces can offer insights into the reward model's decision-making process, which can be valuable for debugging and understanding model behavior, although the paper focuses more on the performance benefits.

Overall, RRMs present a promising direction for developing more sophisticated and adaptive reward models by integrating explicit reasoning into the evaluation process, with practical training methods and application strategies.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1925213029894910191

https://twitter.com/BoLi68567011/status/1925261184489726062

https://twitter.com/fly51fly/status/1925301224762753365

https://twitter.com/Synced_Global/status/1925385152882217245

https://twitter.com/geeknik/status/1929532451644232193

https://twitter.com/chai_research/status/1941993267014553676

Reward Reasoning Model (2505.14674v1)

Summary

Related Papers

Tweets