Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RM-R1: Reward Modeling as Reasoning (2505.02387v3)

Published 5 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Reward modeling is essential for aligning LLMs with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RMs interpretability and performance. To this end, we introduce a new class of generative reward models - Reasoning Reward Models (ReasRMs) - which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism - self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve state-of-the-art performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough empirical analyses to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six REASRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.

Summary

  • The paper introduces a reasoning-based approach to reward modeling that produces explicit reasoning traces alongside preference predictions.
  • The method employs a two-stage training pipeline with high-quality reasoning distillation followed by reinforcement learning using verifiable rewards.
  • Experimental results show state-of-the-art performance on reasoning-intensive tasks, improved training stability, and effective scaling with model size.

Reward modeling is a crucial component for aligning LLMs with human preferences, particularly through reinforcement learning from human feedback (RLHF). Existing reward models (RMs) typically fall into two categories: scalar-based RMs which output a single opaque score, and generative RMs (GenRMs) which produce textual judgments. While GenRMs offer more transparency, their reasoning can often be superficial. This paper, "RM-R1: Reward Modeling as Reasoning" (2505.02387), proposes a new approach by framing reward modeling as a reasoning task to enhance both interpretability and performance.

The core idea is that accurate preference judgments require deep thinking and interpretable reasoning, similar to how humans evaluate. The paper introduces Reasoning Reward Models (ReasRMs) and presents RM-R1, a family of generative RMs trained to produce explicit reasoning traces alongside their preference predictions.

The training pipeline for RM-R1 consists of two main stages:

  1. Distillation of High-Quality Reasoning Chains: The process starts with an instruction-tuned model. High-quality reasoning traces are synthesized for a subset of preference data using powerful "oracle" models (like Claude or GPT-4). These traces justify why a preferred response is better than a rejected one. The base model is then fine-tuned on this synthesized data to acquire foundational reasoning capabilities for reward modeling. This supervised distillation step is crucial for bootstrapping the model's ability to generate structured, coherent judgments, especially for chat-based tasks where direct answer correctness isn't the primary evaluation metric.
  2. Reinforcement Learning with Verifiable Rewards (RLVR): The model fine-tuned in the distillation stage is further trained using RL. The objective is to maximize the reward for generating textual judgments that correctly identify the human-preferred response. The paper utilizes a simplified rule-based reward function R(x,jya,yb)\mathcal{R}(x, j | y_a, y_b) which gives a reward of 1 if the model's predicted preference l^\hat{l} matches the ground truth ll, and -1 otherwise.

    R(x,jya,yb)={1,if l^=l 1,otherwise.\mathcal{R}(x,j|y_a,y_b) = \begin{cases} 1, & \text{if } \hat{l}=l \ -1, & \text{otherwise.} \end{cases}

    A key aspect of the RL stage is the Chain-of-Rubrics (CoR) rollout strategy. During inference (rollout), the model is prompted to first classify the input task as either 'Chat' or 'Reasoning'.

    • For 'Reasoning' tasks (math, code, logic), the model is instructed to first solve the problem itself and then evaluate the candidate responses based on correctness and reasoning quality, referencing its own solution.
    • For 'Chat' tasks (conversation, safety, general helpfulness), the model is instructed to generate a set of evaluation rubrics, justify them, and then compare the responses against these rubrics. This task classification and structured generation (solving vs. rubric generation) are guided by a detailed system prompt. The training uses the Group Relative Policy Optimization (GRPO) algorithm, a variant of PPO, which uses the average reward of sampled outputs as a baseline for advantage computation.

Practical Implementation and Application:

  • Model Architecture: RM-R1 utilizes standard LLMs (specifically, Qwen-2.5-Instruct and DeepSeek-Distilled-Qwen models of various sizes: 7B, 14B, 32B) as the base architecture.
  • Training Data: The models are trained on a mix of publicly available preference datasets, including cleaned subsets of Skywork Reward Preference 80K, Code-Preference-Pairs, and Math-DPO-10K. The paper notes the importance of cleaning datasets to remove spurious correlations (e.g., between preference labels and token presence).
  • Training Pipeline: The two-stage approach (Distillation followed by RL) is critical. The distillation phase acts as a warm-start, providing the model with examples of structured, high-quality reasoning, which pure RL from a cold start struggles to discover effectively. The CoR prompting strategy during RL rollouts facilitates the generation of these structured reasoning traces and rubrics.
  • Computational Requirements: Training involves significant computation, using multiple nodes with H100 GPUs. The scaling analysis shows that performance improves with model size and inference compute (allowing for longer reasoning chains), indicating that sufficient computational resources are necessary to fully leverage the capabilities of ReasRMs.
  • Deployment: The trained ReasRM can be deployed as an LLM-as-a-judge model. Given a prompt and candidate responses, the ReasRM generates a detailed textual judgment including the reasoning process (solving the problem or generating rubrics) and the final preference prediction. This output format provides interpretability that scalar RMs lack.

Experimental Results and Analysis:

The paper evaluates RM-R1 models on three benchmarks: RewardBench, RM-Bench, and RMB. Key findings include:

  • RM-R1 models achieve state-of-the-art or near state-of-the-art performance among generative RMs, outperforming much larger open and proprietary models on various tasks, especially reasoning-intensive ones (e.g., significantly improving math and code performance on RM-Bench).
  • The scaling analysis confirms that RM-R1 benefits more from increased model size and inference compute compared to some previous RM types. Larger models not only perform better but also show greater performance gains from the reasoning-oriented training.
  • Ablation studies demonstrate that both the query categorization (distinguishing Chat/Reasoning tasks) and the distillation stage are essential for achieving high performance. Cold-start RL alone is insufficient.
  • Reasoning-based training consistently outperforms purely SFT-based approaches (fine-tuning on final answers only), even when using limited data for distillation.
  • Analysis of training dynamics shows that RL training is more stable and effective when warm-started with distillation.
  • Case studies reveal that RM-R1 generates more relevant and accurate rubrics (e.g., prioritizing accuracy for medical questions) and grounds its judgments more faithfully in the content of the responses compared to cold-start models that may focus on superficial features.

Limitations and Future Work:

While RM-R1 demonstrates significant improvements, future work includes:

  • Developing methods for automatic rubric induction to potentially reduce rollout length and complexity.
  • Integrating active learning to efficiently collect human feedback on samples where the current model's reasoning or rubrics are insufficient.
  • Extending ReasRMs to handle multimodal or agentic scenarios.

In summary, RM-R1 successfully implements reward modeling as a reasoning task through a tailored two-stage training pipeline involving distillation of reasoning traces and RL with structured rollouts and correctness-based rewards. This approach leads to more interpretable and higher-performing reward models that scale effectively with model size and computational budget, offering a practical method for improving LLM alignment.

Youtube Logo Streamline Icon: https://streamlinehq.com