Reward Reasoning Model (RRM)

Updated 6 April 2026

Reward Reasoning Model (RRM) is a framework that structures reward evaluation into multiple reasoning passes to focus on instance-critical quality dimensions.
It employs a two-turn process, starting with selective branching followed by branch-conditioned rethinking, to enhance error detection and alignment in LLM outputs.
Empirical results show that approaches like BR-RM-14B achieve state-of-the-art accuracy on multi-faceted benchmarks by mitigating judgment diffusion and optimizing token allocation.

A Reward Reasoning Model (RRM) is a class of reward models for aligning LLMs that explicitly externalize an intermediate reasoning process before emitting a final reward or preference decision. Unlike standard reward models—which typically compress multiple evaluation dimensions such as factuality, safety, and logical soundness into a single forward pass generating a scalar—RRMs architect the evaluation as a two-turn (or more) deliberate reasoning sequence. This approach, exemplified by the Branch-and-Rethink Reasoning Reward Model (BR-RM), mitigates judgment diffusion and enables the model to detect subtle, high-impact errors that traditional “one-shot” reward models often overlook (Jiao et al., 27 Oct 2025).

1. Motivation and Problem Formulation

Traditional reward models in RLHF pipelines for LLMs assign preference scores by compressing multiple quality dimensions into a single scalar in one pass. This process induces judgment diffusion, whereby attention is spread too thinly across multiple criteria, resulting in diluted focus, shallow reasoning, and increased risk of overlooking consequential errors. In contrast, “think-twice” strategies in solver LLMs—where a model first hypothesizes critical issues and then performs a dedicated second pass of scrutiny—have been empirically shown to elicit stronger reasoning and improved performance.

RRMs such as BR-RM transfer this think-twice paradigm into reward modeling, operationalizing evaluation as a modular, structured reasoning trace. The key insight is that exposing a two-pass reasoning workflow enables deeper, more focused analysis of the most salient quality dimensions in any specific instance, significantly strengthening reward sensitivity to subtle and consequential flaws.

2. Formal Structure: Two-Turn Reasoning Process

Let $\mathcal{D} = \{(x_i, y_{i,1}, y_{i,2}, z_i)\}_{i=1}^N$ denote a dataset of human-preference comparisons over LLM outputs, with $x_i$ as the prompt, $y_{i,1}$ and $y_{i,2}$ as competing responses, and $z_i\in\{1,2\}$ as the preferred label.

The BR-RM formalizes reward assignment over two explicit reasoning turns:

Turn 1 (Adaptive Branching): Selects a small subset $C_{\text{sel}} \subset C$ of critical evaluation dimensions (from a rubric $C$ such as factuality, safety, logical coherence, etc.) and generates brief, dimension-specific hypothesis sketches $\alpha_1, \alpha_2$ for each response. The branching function is:

$b_\theta(x, y_1, y_2) \to (C_{\text{sel}}, \alpha_1, \alpha_2)$

This confines the model’s attention to the most instance-relevant aspects.

Turn 2 (Branch-Conditioned Rethinking): Executes a conditioned reread over the responses, rigorously testing only those dimensions in $C_{\text{sel}}$ , guided by the hypothesis sketches. The rethinking function is:

$x_i$ 0

where $x_i$ 1 is the final preference derived from a weighted aggregation over dimension-specific evidence scores.

The composite two-step trace $x_i$ 2 is interpreted as evidence in a reward function $x_i$ 3, which outputs a scalar reward according to strict format checks:

$x_i$ 4

3. Optimization, Training Objective, and Implementation

BR-RM leverages Generalized Reward Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO). For each batch of traces $x_i$ 5, the probability ratio is:

$x_i$ 6

and the loss is:

$x_i$ 7

where $x_i$ 8 is the whitened advantage across the K-sample group. The total reward for the trace combines format validity and binary preference correctness, with format errors incurring negative penalties.

Architecturally, the backbone LLM is Qwen3-8B or Qwen3-14B. Inputs and outputs follow strict, regex-checked formats, with each step (selected dimension list, hypothesis sketches, critiques, and boxed decisions) labeled and delimited. Training is executed on up to 64 GPUs with maximum context lengths of 16K tokens.

4. Empirical Results and Evaluation

BR-RM was benchmarked against prior state-of-the-art scalar and structured reward models on RewardBench, RM-Bench, and RMB. Results are summarized as follows:

Model	RewardBench	RM-Bench	RMB	Average
Scalar SOTA	95.1%	70.9%	70.5%	78.8%
RRM-32B	91.2%	84.0%	70.0%	81.7%
BR-RM-8B	91.0%	85.0%	71.8%	82.6%
BR-RM-14B	92.1%	85.9%	74.7%	84.2%

BR-RM-14B sets a new state of the art on average accuracy across these diverse and challenging domains.

Ablation studies further reveal that disabling either branching or branch-conditioned rethinking reduces average accuracy by 1.0–2.6 percentage points, confirming the criticality of both structured passes.

5. Analysis of Benefits and Mechanistic Insights

Empirical evidence establishes that the two-turn architecture of RRMs delivers the following crucial benefits:

Mitigation of Judgment Diffusion: By restricting the token/compute budget in Turn 1 to 2–3 top-dimensional checks, BR-RM avoids wasted analysis on irrelevant criteria. The subsequent second pass enables deeper error-checking on those prioritized axes.
Detection of Subtle Errors: Branch-conditioned scrutiny in Turn 2 surfaces flaws such as minor factual inaccuracies or logic bugs that would otherwise evade detection under a flat, one-shot scoring regime.
Allocation of Compute: ∼70% of tokens in BR-RM traces are dedicated to critical dimensions, versus ∼30% for single-pass ReasonRMs, yielding 1–3 point gains in accuracy on fine-grained benchmarks.
Robustness and Generalization: The modular, two-turn structure integrates seamlessly with standard RLHF pipelines and is robust across multiple domains, even under strict format and length constraints.

6. Extensions, Limitations, and Future Directions

While the two-turn RRM architecture significantly advances reward model reliability and alignment, several limitations and open challenges remain:

Format Structure Integration: Current reward scaling is binary and format-dependent; more sophisticated multi-dimensional or continuous-valued rewards may be required for alignment with non-binary tasks.
External Verifiers: The branch-and-rethink machinery provides natural interface points for attaching retrieval modules or external knowledge tool calls, which may further enhance robustness in open-domain evaluation.
Adaptation to Task Complexity: The approach can be extended to more complex or multi-turn tasks by dynamically adjusting the number of reasoning passes or branches.
Scalability and Training Efficiency: Although training is practical and scalable up to 16K context on 64 GPUs, further efficiency gains for even larger models or longer contexts remain a research focus.

7. Significance in Alignment and the RLHF Pipeline

The Reward Reasoning Model framework, as instantiated by BR-RM, demonstrates that deliberate, multi-step reasoning in reward assignment not only improves interpretability but also yields measurable accuracy and robustness gains over traditional scalar reward models. By sharply reducing shallow judgment and increasing focus on instance-critical dimensions, RRMs establish a new design paradigm for alignment in LLMs, with broad implications for future RLHF system architectures and evaluation strategies (Jiao et al., 27 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Think Twice: Branch-and-Rethink Reasoning Reward Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward Reasoning Model (RRM).

Reward Reasoning Model (RRM)

1. Motivation and Problem Formulation

2. Formal Structure: Two-Turn Reasoning Process

3. Optimization, Training Objective, and Implementation

4. Empirical Results and Evaluation

5. Analysis of Benefits and Mechanistic Insights

6. Extensions, Limitations, and Future Directions

7. Significance in Alignment and the RLHF Pipeline

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Reward Reasoning Model (RRM)

1. Motivation and Problem Formulation

2. Formal Structure: Two-Turn Reasoning Process

3. Optimization, Training Objective, and Implementation

4. Empirical Results and Evaluation

5. Analysis of Benefits and Mechanistic Insights

6. Extensions, Limitations, and Future Directions

7. Significance in Alignment and the RLHF Pipeline

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research