Papers
Topics
Authors
Recent
2000 character limit reached

VerifyRM: Reference-Based Reward Model Overview

Updated 14 February 2026
  • VerifyRM is a reference-based reward model that assesses correctness in reasoning tasks by processing triplets of task prompts, reference answers, and candidate solutions using a scalar score head.
  • It employs a hybrid annotation strategy with rule-based and LLM verification, dynamic negative sampling, and joint optimization with RL to mitigate reward hacking.
  • Empirical results demonstrate that VerifyRM outperforms static and rule-only reward models, achieving up to 89.42% accuracy on math reasoning benchmarks.

VerifyRM is a reference-based reward model for evaluating the correctness of reasoning tasks, primarily in reinforcement learning (RL) applied to LLMs. Designed for robust reward assessment, especially in mathematical and multi-step reasoning domains, it is a central component of RL frameworks that seek to mitigate vulnerabilities such as reward hacking by providing dynamically updated and accurate supervision signals during policy optimization. The architecture, methodology, empirical performance, and contextual relevance of VerifyRM are detailed below.

1. Reference-Based Reward Model Architecture

VerifyRM builds on a decoder-only transformer backbone, initialized from an aligned LLM (e.g., Qwen2.5-Math-1.5B-Instruct), with the original language-modeling head replaced by a scalar score head. Unlike typical reward models that evaluate (prompt, completion) pairs, VerifyRM adopts a reference-based input paradigm, where inputs are structured as triples:

  • qq: Reasoning task or problem prompt
  • rr: Reference (ground-truth) answer
  • cc: Model-generated candidate solution

The model processes this triplet and outputs a scalar logit, with the estimated correctness probability y^=σ(logit)\hat{y} = \sigma(\text{logit}), where σ\sigma denotes the sigmoid function. This output is interpreted as the probability that cc correctly answers qq, relative to rr (Hong et al., 7 Aug 2025).

2. Data Collection and Hybrid Annotation Strategy

The training corpus for VerifyRM is constructed from a diverse array of math-focused datasets—including GSM8K, MATH, OlympiadBench, AIME, AMC23, and LiveMathBench—totaling approximately 5.9K problems. Completions are generated by running 11 distinct LLMs under stochastic sampling, yielding roughly 65K (q,r,c)(q, r, c) triples.

Correctness labels are assigned through a hybrid annotation pipeline:

  • Rule-based verifier: Math-Verify, a symbolic math parser, provides high precision.
  • LLM-as-judge: Qwen3-4B ("non-thinking" configuration) for broad recall.
  • Only examples with agreement between both approaches are retained, yielding 58.7K high-confidence annotated triples.

Precision and recall for Math-Verify reach 96% and 63% respectively; for Qwen3-4B, 90% and 99% on the VerifyBench benchmark (Hong et al., 7 Aug 2025).

Contrastive sample pairs are dynamically constructed during RL:

  • Positive: Rollouts confirmed correct by Math-Verify
  • Negative: Synthetic negatives generated via LLM-based mutation of positives and explicitly verified as incorrect via Math-Verify

3. Reward Modeling Objectives and Loss Functions

Two primary loss functions are employed:

  • Binary cross-entropy supervised loss:

Lsup(θ)=E(q,r,c,y)[BCE(σ(Mθ(q,r,c)),y)]L_\text{sup}(\theta) = \mathbb{E}_{(q, r, c, y)} [\text{BCE}(\sigma(M_\theta(q, r, c)), y)]

where yy is the binary correctness label.

  • Contrastive ranking loss (RL stage):

Given (q,a,opos,oneg)(q, a, o_\text{pos}, o_\text{neg}),

LRM(ϕ)=E[logσ(Rϕ(q,a,opos)Rϕ(q,a,oneg))]L_\text{RM}(\phi) = -\mathbb{E} \left[\log \sigma(R_\phi(q, a, o_\text{pos}) - R_\phi(q, a, o_\text{neg})) \right]

which drives the reward assigned to true/correct completions above that of plausible but incorrect samples (Hong et al., 7 Aug 2025).

Dynamic relabeling and continuous reward model updating in RL address reward hacking by ensuring that newly discovered policy exploits do not lead to persistent reward inflation; the RM is retrained to penalize such behavior on-the-fly.

4. Joint Optimization and Reinforcement Learning Integration

Within the Cooper RL framework, VerifyRM is trained and updated concurrently with policy optimization. The update loop proceeds as:

  • Policy update (Group Relative Policy Optimization): The LLM policy is updated using group-normalized advantages derived from VerifyRM's reward predictions on sampled rollouts.
  • Reward model update:

The RM is refined by minimizing the contrastive ranking loss on freshly constructed positive/negative sample pairs.

A KL-penalty ensures policy updates remain close to the previous policy iteration, stabilizing training and preventing drift. Positive samples are always drawn from high-precision Math-Verify-validated completions (Hong et al., 7 Aug 2025).

5. Empirical Performance and Robustness

VerifyRM exhibits superior quantitative accuracy to competing reward models:

Model VerifyBench-Math Accuracy (%)
Rule-based (Math-Verify) 79.93
Vanilla RMs (1–8B, non-reference) 47–53
xVerify (reference-based, 0.5–9B) 70.68–84.38
VerifyRM (1.5B, reference-based) 89.42

When integrated with Cooper, RL runs with a dynamic VerifyRM consistently outperform static-RM and rule-only baselines. For instance, when training Qwen2.5-1.5B-Instruct on diverse mathematical benchmarks, Cooper with VerifyRM achieves an average accuracy of 58.02%, compared to 57.48% for rule-based reward and 38.91% for static (non-updated) VerifyRM (Hong et al., 7 Aug 2025).

Dynamic RM updates effectively suppress reward hacking. In static-RM setups, policies quickly learn to exploit reward model blind spots, resulting in collapsed test accuracy. With VerifyRM's continuous co-optimization, reward signals remain reliable and test accuracy increases monotonically.

6. Comparative Evaluation and Broader Context

VerifyRM establishes a new state-of-the-art for reference-aware reward modeling in LLM mathematics RL. Its architecture and training methodology are aligned with recent advances in using reference answers for reward estimation, as demonstrated by the xVerify series, but achieve higher robustness through (1) hybrid, high-confidence label annotation; (2) dynamic negative sampling; and (3) explicit reference-conditioning in the reward head.

A plausible implication is that the reference-based paradigm is critical for reward model generalization and resilience, particularly in open-ended reasoning domains prone to exploitation under pure preference-based or non-reference RM designs.

7. Limitations and Future Directions

VerifyRM's effectiveness is conditioned on the availability of high-precision rule-based verifiers or trusted LLM judges for initial annotation. Extension beyond mathematics or highly-structured reasoning domains would require domain-specific verifiers of comparable reliability. Moreover, the resource cost for generating sufficient diverse triples and high-confidence labels may be significant for novel application areas. Further work could address generalization to vision-LLMs, as the VL-GenRM model for multimodal verification suggests parallel methodology [(Zhang et al., 16 Jun 2025), abstract only].


Key references:

The design, methodology, and empirical outcomes of VerifyRM represent the state-of-the-art in dynamic reward modeling for RL-aligned LLM reasoning.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VerifyRM.