Generative Reward Models
- Generative Reward Models (GenRM) are a reward modeling paradigm that uses next-token prediction to evaluate outputs, integrating generation and verification within one framework.
- GenRM employs chain-of-thought reasoning, allowing the model to generate intermediate verification steps and leverage majority voting for improved robustness.
- Empirical studies show that GenRM outperforms traditional discriminative models on reasoning-intensive tasks, yielding significant accuracy gains on benchmarks like GSM8K.
A Generative Reward Model (GenRM) is a paradigm for reward modeling in which the evaluation or verification of model outputs is framed as a next-token prediction or text generation task, as opposed to traditional scalar-valued, discriminative classification. GenRMs natively leverage the text generation capabilities of LLMs, integrating verification and solution generation into a unified architecture. This allows seamless adoption of instruction tuning, enables the use of chain-of-thought (CoT) reasoning during verification, and naturally supports scaling with both model size and inference-time compute. Empirical studies demonstrate that GenRMs achieve significant performance gains in reasoning-intensive tasks, with advances in mathematical and algorithmic domains, as well as in few-shot and out-of-distribution generalization.
1. Generative Reward Models: Principle and Next-Token Prediction
Traditional reward models (“discriminative RMs”) are typically trained as binary classifiers on preference or correctness labels, producing a score through a sigmoid-applied logit from a designated token. The loss is generally of the form: In contrast, a GenRM is trained using the next-token prediction objective: where represents the desired generative output (e.g., “Yes” for a correct solution, “No” for an incorrect one). For CoT-enabled GenRMs, can comprise an explicit verification chain-of-thought followed by a final indicator token. In the “Direct” GenRM, solution correctness is expressed as the probability of generating “Yes”: This generative framing allows the reward model to take advantage of pretrained LLMs’ representation and generative capacities, including the automatic exploitation of autoregressive factorization, long-range context modeling, and instruction following.
2. Unified Training and Chain-of-Thought Verification
GenRMs are naturally integrated into LLM pipelines via joint instruction tuning. The total loss combines solution generation and verification: with modulating the focus between solution and verification subtasks. This joint optimization ensures that the model is competent at both solving and verifying within a single architecture.
Verification in GenRM is enhanced by chain-of-thought reasoning. The model can be prompted to output intermediate reasoning steps (e.g., “Let’s verify step by step.”), with the final verdict emerging at the end of the chain. At inference, majority voting over independent CoT samples is performed: This voting mechanism allows for effective utilization of increased compute at test time and substantially improves robustness against stochasticity and spurious reasoning paths.
3. Empirical Gains and Generalization
Empirical studies establish that GenRMs outperform traditional discriminative verifiers and LLM-as-a-Judge baselines. Performance gains are most evident in reasoning-dense tasks and in out-of-distribution (OOD) settings:
- On GSM8K math reasoning, Best-of-N accuracy increases from 73% to 93.4% when replacing a discriminative verifier with a GenRM.
- Algorithmic tasks report jumps from 5% to over 45.3% using GenRM.
- On MATH generalization and MMLU abstract algebra, respective improvements observed are from 28% to 44.6%, and from 37.9% to 53.5%.
The next-token generative verification and CoT allow GenRM to identify subtle errors—e.g., unit mismatches, minor arithmetic flaws—that discriminative models fail to capture. These advantages are amplified as model capacity increases and additional test-time computation is invested in verification majority voting.
4. Synthetic Verification Rationales
The generative structure of GenRM enables straightforward conditioning on (or generation of) verification rationales. The use of synthetic rationales produced by prompting the model to reason step-by-step (“You are a math teacher. Grade the Solution, verifying correctness step by step…”) was shown to be sufficient for effective verifier training. These rationales focus the model on fine-grained errors and enhance label quality even in the absence of expensive human-labeled explanations.
Synthetic rationales facilitate more reliable reward judgments in settings where obtaining high-quality human rationales is costly. They provide additional supervision that channels model attention toward the nuanced, context-specific signals that underpin robust alignment in complex reasoning tasks.
5. Scalability in Model Capacity and Compute
GenRM exhibits favorable scaling properties. As model size increases (e.g., from Gemma-2B to Gemma-2-9B), both reward-model accuracy and Best-of-N performance improve. Larger generative verifiers yield more coherent, diverse, and informative chains-of-thought during verification. Additionally, GenRM can efficiently convert extra inference-time compute into accuracy gains by generating more verification chains per candidate and averaging their outputs—something not straightforwardly possible with discriminative RMs.
This scalability spans both compute (test-time majority voting) and data/model size, supporting strong performance with larger LLM backbones and more demanding evaluation scenarios.
6. Broader Implications and Future Directions
By reframing reward modeling as next-token prediction, GenRM unifies generative solution modeling and verification within a single LLM, inherits the benefits of chain-of-thought reasoning, and supports scalable, high-quality verification via joint generation. The generative paradigm not only outperforms classical discriminative reward models but also displays superior generalization, adaptability to multiple tasks, and efficient utilization of both synthetic rationale data and test-time computation.
The adoption of GenRM has direct implications for self-consistency, Best-of-N solution selection, verifiable reward modeling, and robust RLHF pipelines. Its ability to leverage synthetic rationales and scale with model/inference resources positions it as a central component in state-of-the-art LLM alignment and evaluation systems. Future research may investigate further integration of automated rationale induction, dynamic computation allocation, and principled uncertainty quantification within GenRM-based alignment frameworks.
Summary Table: Distinguishing Features of GenRM versus Traditional Discriminative RMs
Aspect | Discriminative RM | Generative RM (GenRM) |
---|---|---|
Training Objective | Classifies as correct/incorrect via sigmoid output | Next-token prediction of indicator or rationale |
CoT Reasoning | Absent or ad hoc | Native, enables multi-step reasoning |
Majority Voting/Test-Time Scaling | Not directly applicable | Simple, robust via chain-of-thought sampling |
Synthetic Rationales | Difficult to integrate | Natural part of training and verification |
Performance in OOD | Limited | Strong gains, enhanced error detection |
Scalability | Modest with model size | Improves with both model size and extra compute |
GenRM represents a shift in reward modeling, leveraging native LLM generation and reasoning capabilities for interpretable, robust alignment across challenging tasks (Zhang et al., 27 Aug 2024).