VerifyBench: Cross-Domain Answer Verification

Updated 11 August 2025

VerifyBench is a systematic and cross-domain benchmark suite that rigorously evaluates answer verification and reward modeling systems using curated reference answers.
It employs hybrid annotation and dynamic reward models to combine rule-based verifiers with large language model judges, achieving high accuracy in specialized domains.
Empirical findings reveal a trade-off between precision and recall, guiding improvements in reinforcement learning systems with verifiable rewards.

VerifyBench is a systematic and cross-domain benchmark suite designed to rigorously evaluate answer verification and reward modeling systems, especially in the context of LLMs trained and evaluated via reinforcement learning (RL) with verifiable rewards. The benchmark and its derivatives assess how well automated verifiers—both specialized and general-purpose—can judge the correctness of model-generated answers against carefully curated reference answers, across multiple scientific and reasoning domains. Through well-controlled annotation and the design of challenging evaluation scenarios, VerifyBench provides a principled ground truth for both verifier development and empirical measurement of reward system reliability, informing core advances in reinforcement learning with verifiable reward (RLVR).

1. Benchmark Foundations and Design

VerifyBench is constructed from thousands of diversified reasoning questions across mathematics, physics, chemistry, and biology, with each entry comprising a question, a reference answer, and multiple model-generated responses. The data originates from 41 different open domains and is categorically balanced among four canonical answer types: numerical (e.g., integers, floats), algebraic expressions, multiple-choice, and free-form string answers. Each question is paired with an authoritative reference answer, which is used as the ground truth for verification judgments (Yan et al., 21 May 2025, Li et al., 14 Jul 2025).

The construction process involves automatic answer type labeling (using Llama-3.3-70B-Instruct) and large-scale response collection from over 20 diverse LLMs. This is followed by multi-annotator human validation to ensure that each question’s “correct” and “incorrect” response tuples are accurate and reflect intended evaluation targets. VerifyBench-Hard, a more challenging variant, is curated by identifying high-disagreement cases through ensemble model voting and further human annotation, leading to a dataset with an imbalanced distribution (e.g., only 291 correct vs. 709 incorrect completions among 1,000 challenging tuples) (Yan et al., 21 May 2025).

2. Methodology: Evaluation Protocols and Reward Modeling

The principal evaluation task is reference-based answer verification. A reward model receives a tuple $(q, gt, r)$ , where $q$ is the query, $gt$ is the ground-truth reference answer, and $r$ is the candidate model completion. The model returns a scalar or binary “reward,” typically thresholded to label the completion as correct or incorrect: $Acc = \frac{1}{|D|} \sum_{(q, gt, r, y) \in D} I[E(R_{\phi}(q, gt, r)) = y]$ with $I$ the indicator function and $E(\cdot)$ a mapping from raw reward to binary output (Yan et al., 21 May 2025).

The experimental framework is four-dimensional: it varies (1) the input given to the verifier (just the final “boxed” answer vs. the full chain-of-thought explanation), and (2) the expected output format (short binary vs. extended rationales). This matrix enables comprehensive measurement of both specialized verifiers (e.g., xVerify, R1-Distill-Verifier) and broader LLM-based judges (e.g., Qwen2.5-Instruct, Qwen3) under controlled, real-world conditions (Li et al., 14 Jul 2025).

3. Empirical Findings and Comparative Results

Experiments reveal two distinct verification performance profiles:

Specialized verifiers attain leading accuracy and precision (e.g., xVerify-9B-C reaches 96.48% in chemistry), but display lower recall and limited flexibility—failing to accommodate valid yet diverse correct answer forms.
General LLM judges accept a broader array of correct expressions (higher recall) but are inconsistent, particularly when supplied with long, unstructured chain-of-thought inputs, leading to unstable precision.

Providing a reference answer yields a decisive performance benefit; omitting it in ablation studies degrades accuracy by 5–18%, and the gap is even more pronounced with smaller models (<3B parameters) (Yan et al., 21 May 2025).

VerifyBench-Hard exposes upper bounds: whereas on the standard balanced test set, state-of-the-art LLM judges and reward models achieve >90% accuracy, this drops to ~72% on genuinely ambiguous or contentious cases (Yan et al., 21 May 2025).

4. Technical Innovations: Hybrid Annotation and Dynamic Reward Models

One central challenge is generating high-fidelity training and testing labels efficiently. Hybrid annotation strategies combine rule-based verifiers (e.g., Math-Verify with 96% precision and 63% recall) with high-recall LLM judges (e.g., Qwen3-4B). Only samples where both methods agree on the label are retained, guaranteeing minimal noise for downstream reward model training (Hong et al., 7 Aug 2025).

Reward models such as VerifyRM are constructed as discriminative classifiers accepting $(q, r, c)$ triples, where $c$ is the candidate completion, trained under binary cross-entropy loss: $\mathcal{L}(\theta) = \mathbb{E}_{(q, r, c, y) \sim D} \text{BCE}\left(\sigma(M_\theta(q, r, c)), y\right)$ where $\sigma$ is the sigmoid activation (Hong et al., 7 Aug 2025).

To address reward hacking in RL, the Cooper framework employs dynamic co-optimization: the reward model is updated via contrastive learning on positive (rule-verified correct) and negative (LLM-misled incorrect) examples. This prevents policy collapse and achieves superior average RL accuracy compared to both static reward models and rule-based methods alone (Hong et al., 7 Aug 2025).

5. Applications in Reinforcement Learning and Verifiable Reward

VerifyBench is integral to the development and evaluation of reinforcement learning with verifiable reward (RLVR). It enables systematic selection and improvement of reward models for RL finetuning of reasoning LLMs: models with higher measured accuracy on VerifyBench yield better downstream results—for example, in mathematical problem solving on GSM8K and related benchmarks (Yan et al., 21 May 2025).

The reward model’s inclusion of the reference answer is shown to be especially critical for precise verification in RL training, with empirical results demonstrating that models trained with dynamically co-optimized reward models outperform those tied to static, potentially exploitable verifiers. Cooper’s deployment of VerifyRM as a dynamically updated reward model produced marked improvements (e.g., 89.42% verification accuracy on VerifyBench, ~3–4% absolute RL accuracy improvement) (Hong et al., 7 Aug 2025).

6. Cross-Domain Generalization and Remaining Bottlenecks

VerifyBench systematically uncovers fundamental limitations in verifier generalization:

Specialized verifiers are highly accurate within a domain but fail across domains due to inflexibility in handling notation and reasoning conventions.
General-purpose LLM judges fluctuate in reliability depending on response length and input format, revealing high sensitivity to both superficial (formatting, structure) and deep (semantic, reasoning) input variations.
There is a structural trade-off between precision and recall in verification paradigms, pointing to the need for future hybrid approaches (e.g., initial LLM-based filtering followed by specialized final judgment) (Li et al., 14 Jul 2025).

A further bottleneck is the dependency on intermediate answer extraction stages; poor extraction can compromise final verification, so a move toward end-to-end, extraction-free judgment pipelines is encouraged.

7. Implications and Resources

VerifyBench’s rigorous evaluation framework now underpins standard experimental protocols in LLM verification and reward modeling, influencing both policy optimization pipelines and the design of more generalizable, robust verifiers. The benchmark and its derivatives (including VerifyBench-Hard) are central to state-of-the-art RLVR systems.

Data, code, and further resources are available via project repositories cited in the primary literature (Yan et al., 21 May 2025, Li et al., 14 Jul 2025, Hong et al., 7 Aug 2025). The principled combination of fine-grained annotation, diverse answer types, challenging ambiguous cases, and domain coverage makes VerifyBench a pivotal resource for advancing the empirical science and engineering of answer verification in AI systems.

PDF Markdown Chat (Pro)

References (3)

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models (2025)

VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains (2025)

Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to VerifyBench.