Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 136 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Generative Reward Models

Updated 2 October 2025
  • Generative Reward Models (GenRM) are a reward modeling paradigm that uses next-token prediction to evaluate outputs, integrating generation and verification within one framework.
  • GenRM employs chain-of-thought reasoning, allowing the model to generate intermediate verification steps and leverage majority voting for improved robustness.
  • Empirical studies show that GenRM outperforms traditional discriminative models on reasoning-intensive tasks, yielding significant accuracy gains on benchmarks like GSM8K.

A Generative Reward Model (GenRM) is a paradigm for reward modeling in which the evaluation or verification of model outputs is framed as a next-token prediction or text generation task, as opposed to traditional scalar-valued, discriminative classification. GenRMs natively leverage the text generation capabilities of LLMs, integrating verification and solution generation into a unified architecture. This allows seamless adoption of instruction tuning, enables the use of chain-of-thought (CoT) reasoning during verification, and naturally supports scaling with both model size and inference-time compute. Empirical studies demonstrate that GenRMs achieve significant performance gains in reasoning-intensive tasks, with advances in mathematical and algorithmic domains, as well as in few-shot and out-of-distribution generalization.

1. Generative Reward Models: Principle and Next-Token Prediction

Traditional reward models (“discriminative RMs”) are typically trained as binary classifiers on preference or correctness labels, producing a score through a sigmoid-applied logit from a designated token. The loss is generally of the form: Ldisc=E[logrθ(x,s) for correct solutions+log(1rθ(x,s)) for incorrect solutions]L_{\mathrm{disc}} = - \mathbb{E}\left[ \log r_\theta(x, s)\ \text{for correct solutions} + \log(1 - r_\theta(x, s))\ \text{for incorrect solutions} \right] In contrast, a GenRM is trained using the next-token prediction objective: LSFT(θ)=E(x,y)[tlogpθ(ytx,y<t)]L_{\mathrm{SFT}}(\theta) = - \mathbb{E}_{(x, y)} \left[ \sum_t \log p_\theta(y_t | x, y_{<t}) \right] where yy represents the desired generative output (e.g., “Yes” for a correct solution, “No” for an incorrect one). For CoT-enabled GenRMs, yy can comprise an explicit verification chain-of-thought followed by a final indicator token. In the “Direct” GenRM, solution correctness is expressed as the probability of generating “Yes”: rDirect(x,s)=pθ(Yesx,s,“Is the answer correct (Yes/No)?”)r_{\mathrm{Direct}}(x, s) = p_\theta(\text{Yes} | x, s, \text{“Is the answer correct (Yes/No)?”}) This generative framing allows the reward model to take advantage of pretrained LLMs’ representation and generative capacities, including the automatic exploitation of autoregressive factorization, long-range context modeling, and instruction following.

2. Unified Training and Chain-of-Thought Verification

GenRMs are naturally integrated into LLM pipelines via joint instruction tuning. The total loss combines solution generation and verification: Ltotal(θ)=LSFT(verify)(θ)+λLSFT(correct)(θ)L_{\mathrm{total}}(\theta) = L_{\mathrm{SFT}}^{(\mathrm{verify})}(\theta) + \lambda \cdot L_{\mathrm{SFT}}^{(\mathrm{correct})}(\theta) with λ\lambda modulating the focus between solution and verification subtasks. This joint optimization ensures that the model is competent at both solving and verifying within a single architecture.

Verification in GenRM is enhanced by chain-of-thought reasoning. The model can be prompted to output intermediate reasoning steps (e.g., “Let’s verify step by step.”), with the final verdict emerging at the end of the chain. At inference, majority voting over KK independent CoT samples is performed: rMajV@K(x,s)=1Ki=1Kpθ(Yesx,s,prompt,rCoT(i))r_{\mathrm{MajV}@K}(x, s) = \frac{1}{K} \sum_{i=1}^K p_\theta(\text{Yes} | x, s, \text{prompt}, r_{\mathrm{CoT}}^{(i)}) This voting mechanism allows for effective utilization of increased compute at test time and substantially improves robustness against stochasticity and spurious reasoning paths.

3. Empirical Gains and Generalization

Empirical studies establish that GenRMs outperform traditional discriminative verifiers and LLM-as-a-Judge baselines. Performance gains are most evident in reasoning-dense tasks and in out-of-distribution (OOD) settings:

  • On GSM8K math reasoning, Best-of-N accuracy increases from 73% to 93.4% when replacing a discriminative verifier with a GenRM.
  • Algorithmic tasks report jumps from 5% to over 45.3% using GenRM.
  • On MATH generalization and MMLU abstract algebra, respective improvements observed are from 28% to 44.6%, and from 37.9% to 53.5%.

The next-token generative verification and CoT allow GenRM to identify subtle errors—e.g., unit mismatches, minor arithmetic flaws—that discriminative models fail to capture. These advantages are amplified as model capacity increases and additional test-time computation is invested in verification majority voting.

4. Synthetic Verification Rationales

The generative structure of GenRM enables straightforward conditioning on (or generation of) verification rationales. The use of synthetic rationales produced by prompting the model to reason step-by-step (“You are a math teacher. Grade the Solution, verifying correctness step by step…”) was shown to be sufficient for effective verifier training. These rationales focus the model on fine-grained errors and enhance label quality even in the absence of expensive human-labeled explanations.

Synthetic rationales facilitate more reliable reward judgments in settings where obtaining high-quality human rationales is costly. They provide additional supervision that channels model attention toward the nuanced, context-specific signals that underpin robust alignment in complex reasoning tasks.

5. Scalability in Model Capacity and Compute

GenRM exhibits favorable scaling properties. As model size increases (e.g., from Gemma-2B to Gemma-2-9B), both reward-model accuracy and Best-of-N performance improve. Larger generative verifiers yield more coherent, diverse, and informative chains-of-thought during verification. Additionally, GenRM can efficiently convert extra inference-time compute into accuracy gains by generating more verification chains per candidate and averaging their outputs—something not straightforwardly possible with discriminative RMs.

This scalability spans both compute (test-time majority voting) and data/model size, supporting strong performance with larger LLM backbones and more demanding evaluation scenarios.

6. Broader Implications and Future Directions

By reframing reward modeling as next-token prediction, GenRM unifies generative solution modeling and verification within a single LLM, inherits the benefits of chain-of-thought reasoning, and supports scalable, high-quality verification via joint generation. The generative paradigm not only outperforms classical discriminative reward models but also displays superior generalization, adaptability to multiple tasks, and efficient utilization of both synthetic rationale data and test-time computation.

The adoption of GenRM has direct implications for self-consistency, Best-of-N solution selection, verifiable reward modeling, and robust RLHF pipelines. Its ability to leverage synthetic rationales and scale with model/inference resources positions it as a central component in state-of-the-art LLM alignment and evaluation systems. Future research may investigate further integration of automated rationale induction, dynamic computation allocation, and principled uncertainty quantification within GenRM-based alignment frameworks.


Summary Table: Distinguishing Features of GenRM versus Traditional Discriminative RMs

Aspect Discriminative RM Generative RM (GenRM)
Training Objective Classifies as correct/incorrect via sigmoid output Next-token prediction of indicator or rationale
CoT Reasoning Absent or ad hoc Native, enables multi-step reasoning
Majority Voting/Test-Time Scaling Not directly applicable Simple, robust via chain-of-thought sampling
Synthetic Rationales Difficult to integrate Natural part of training and verification
Performance in OOD Limited Strong gains, enhanced error detection
Scalability Modest with model size Improves with both model size and extra compute

GenRM represents a shift in reward modeling, leveraging native LLM generation and reasoning capabilities for interpretable, robust alignment across challenging tasks (Zhang et al., 27 Aug 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Generative Reward Models (GenRM).