Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
Gemini 2.5 Pro Premium
44 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
18 tokens/sec
GPT-4o
105 tokens/sec
DeepSeek R1 via Azure Premium
83 tokens/sec
GPT OSS 120B via Groq Premium
475 tokens/sec
Kimi K2 via Groq Premium
259 tokens/sec
2000 character limit reached

VerifyRM: Reference-Based Reward Model

Updated 11 August 2025
  • VerifyRM is a reference-based reward model that uses the question, gold answer, and generated completion to verify mathematical reasoning outputs.
  • It replaces the standard language model scoring head with a binary classifier trained using a hybrid annotation and contrastive loss approach from high-quality triplets.
  • Empirical results on VerifyBench and integration with the Cooper RL framework show enhanced verification accuracy and training stability, effectively mitigating reward hacking.

VerifyRM is a reference-based reward model specifically devised for verifying the correctness of outputs in mathematical reasoning tasks within reinforcement learning (RL) for LLMs. Developed as part of the Cooper RL framework, VerifyRM addresses the limitations of both rule-based and model-based rewards—namely, the lack of robustness and vulnerability to reward hacking, respectively—by incorporating the reference (“gold”) answer as an explicit input. This alignment of reward modeling and policy learning achieves higher verification accuracy, robust RL training, and resistance to exploitative behaviors seen in static reward models.

1. Model Architecture and Design

VerifyRM departs from standard reward modeling paradigms by including the reference answer (r) alongside the question (q) and the model’s generated completion (c) as a triplet input. It is built upon a pretrained and instruction-aligned LLM with the LLMing head replaced by a new score head, transforming it into a binary classifier. The output is a scalar in [0,1]—produced via a sigmoid over logits—signifying the estimated correctness of c with respect to r for a given q.

The loss function for training is the binary cross-entropy (BCE) over the dataset D:

L(θ)=E(q,r,c,y)D[BCE(σ(Mθ(q,r,c)),y)]\mathcal{L}(\theta) = \mathbb{E}_{(q, r, c, y) \sim D} \left[ \text{BCE}\left( \sigma( M_{\theta}(q, r, c) ), y \right) \right]

where y{0,1}y \in \{0,1\} is the correctness label and MθM_{\theta} is the score head parameterized by θ\theta. This architecture leverages the context afforded by the reference answer, yielding more precise reward signals than zero-reference baselines.

2. Training Methodology and Data Construction

The training pipeline employs a hybrid annotation strategy, combining rule-based and LLM-based verification for labeling, and constructs high-quality contrastive learning pairs:

  • Data Collection: Tens of thousands of (q, r, c) triplets from seven math reasoning datasets and 11 LLMs.
  • Label Assignment: Each output is verified by two sources—Math-Verify (rule-based) and Qwen3-4B (LLM judge, non-thinking mode). Only samples where both verifiers unanimously agree on correctness are used for training labels.
  • Positive/Negative Pair Construction:
    • Positive (oposo_{pos}): Select a completion oio_i with Rule(a, oio_i) = 1 for prompt aa.
    • Negative (onego_{neg}): Take a correct answer and prompt an LLM to minimally transform it into an incorrect answer, verified as incorrect by the rule-based judge.
  • Contrastive Loss: The reward model is further optimized to maximize the margin between scores for positive and negative completions:

LRM=E(q,a,opos,oneg)D[logσ(Rϕ(q,a,opos)Rϕ(q,a,oneg))]\mathcal{L}_\text{RM} = - \mathbb{E}_{(q, a, o_{pos}, o_{neg}) \sim D} [\log \sigma (R_\phi(q, a, o_{pos}) - R_\phi(q, a, o_{neg}))]

where RϕR_\phi is the reward model.

This hybrid annotation and contrastive structure provides strong supervision for distinguishing fine-grained correctness signals in mathematical reasoning.

3. Performance Evaluation and Benchmarking

VerifyRM is benchmarked on VerifyBench, explicitly constructed for reference-based reward model evaluation in math reasoning. The primary metric is accuracy: the percentage of model-assigned labels that match human judgments.

Comparison of model sizes and approaches on VerifyBench: | Model | Size | Reference | Accuracy (%) | |----------------------|--------|---------------------|---------------| | VerifyRM | 1.5B | Reference-based | 89.42 | | Rule-based (Math-Verify) | — | Rule only | 79.93 | | Vanilla RM | 1.5B | Non-reference | 47–52 | | xVerify | 1.5B–7B| Reference-based | 82–84 |

VerifyRM thus provides a significant improvement over both conventional and reference-based baselines, primarily due to the explicit reference context and high-quality labeled pairs.

4. Integration with the Cooper RL Framework

Within the Cooper framework, VerifyRM is dynamically updated alongside the policy model (LLM) during RL training:

  • Stage 1: The policy LLM generates multiple completions per prompt. Each completion receives a reward score from the concurrently-updated VerifyRM.
  • Stage 2: The reward model's parameters are co-optimized using contrastive learning from positive–negative pairs (as above), ensuring the reward model remains adaptive to new policy distributions.

Rewards are normalized using group-relative advantage estimation for policy updates, and KL divergence regularization preserves stability. This synchronous policy–reward training prevents the policy from discovering and exploiting fixed vulnerabilities in the reward model (i.e., reward hacking).

5. Empirical Findings and Analysis

Key experimental results using Qwen2.5-1.5B-Instruct as the underlying policy:

  • Static model-based rewards led to catastrophic collapse (validation accuracy drops from 54.93% to 38.91%) due to reward hacking.
  • Cooper with dynamic VerifyRM maintains stable training curves and an average accuracy of 58.02% across multiple math reasoning benchmarks (GSM8K, SVAMP, MATH500, OlympiadBench-EN, Math Odyssey).

Longitudinal training plots demonstrate that dynamic reward co-optimization averts reward saturation and collapse seen in static models, confirming the effectiveness of such reference-based, continually-updated verifiers.

6. Implications and Future Directions

  • The co-optimization paradigm addresses the primary failure mode of RL with verifiable rewards—reward hacking—by ensuring the reward model is always "one step ahead" of the policy's exploitative strategies.
  • The hybrid reference-based approach and dynamic contrastive learning are broadly applicable: the principles could be extended to other domains (e.g., code, logical reasoning, dialog) where high-precision verification requires contextualized reference comparison.
  • Future research might investigate self-supervised negative sample construction to reduce reliance on external LLMs, paper theoretical stability properties of coupled policy–reward training, and scale the method to partial-observability or non-mathematical evaluation tasks.

7. Summary Table: Key Features and Comparative Metrics

Feature VerifyRM Vanilla RM Rule-Based
Input Context q, r, c (triple) q, c (pair) q, c
Label Source Hybrid (rule+LLM) Human/LLM/manual Predefined Rule
Contrastive Learning Yes No No
Dynamic Update in RL Yes (Cooper) No No
VerifyBench Accuracy (1.5B) 89.42% 47–52% 79.93%

In sum, VerifyRM advances RL reward modeling for LLMs in mathematical reasoning tasks by tightly coupling a reference-based, high-precision verifier with dynamic sample construction and robust co-optimization. This method increases reliability and mitigates adversarial exploitation, establishing a foundation for safer, more trustworthy RL optimization in complex language tasks (Hong et al., 7 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)